Kicking off the week of papers is The Locality-Aware Adaptive Cache Coherence Protocol by George Kurian and Srinivas Devadas from MIT and Omer Khan from the University of Connecticut.
(Apparently estimating time is hard – I expected an hour or two reading the paper and an hour writing; I ended up spending 45 minutes reading and an hour and a half writing (including rereading bits of the paper that I wanted to be sure I got right). And the result is much longer than I was expecting, though still less than 10 minutes to proofread. Anyway, enjoy!)
Caching is said to be one of two hard problems in computer science. This is particularly true at the hardware level; modern CPUs have on-chip caches to avoid expensive trips to main memory. Moreover, multicore CPUs typically have at least an L1 cache per-core, with an L2 (or L3) cache shared among all the cores. This paper explores how to scale that model to many-core systems, with their primary evaluation being 64 cores, and references to costs for a 1024 core system.
The primary issue is that of communication – if each core has a 48 KB private L1 cache and a portion of the 16 MB distributed L2 cache, any data access will pull an entire 64-byte cache line into the core’s local L1 from a (probably remote) L2 cache. Further, any writes by other cores to this cache line must cause additional traffic to tell the core to invalidate the local L1 line. Now the next access will have to pull the line from scratch, causing yet more traffic. And as the number of cores increases, the chance of collisions also does.
This traffic results in nontrivial power usage, plus can cause cache thrashing; if a portion of memory is heavily read and written by multiple cores, each core will be continuously invalidating and refetching the line, possibly uses cache space better spent on more stable data.
The key insight of this paper is to avoid pulling data from L2 into L1 until it’s been accessed “sufficiently often” and, in the mean time, access individual words out of the shared L2 cache. This cuts down on network traffic for rarely used or commonly updated cache lines, since the entire line isn’t fetched. It can also result in much better L1 utilization, since it is reserved for “commonly accessed” pages which won’t be evicted by simple accesses.
The problem is in defining “sufficiently often” and “commonly accessed”. A simple proposed solution is to track, for each L2 cache line, the last time each core referenced that line, as well as a counter of utilization. If a core goes too long without referencing a given line (defined in terms of entries stored in the core’s L1 cache), the count is reset. Otherwise, the count is incremented on each access, and, if it goes beyond the Private Caching Threshold (PCT), is promoted to the core’s private cache.
The line may later be evicted due to capacity if a new line replaces it, but we guarantee that any line replacing it is also well-utilized. It may also be evicted by an invalidation if another core writes to the cache line; in this case, it may be immediately re-promoted if its local utilization (tracked in the L1 cache) is still above the PCT; if not, it is demoted to the L2 cache.
This solution, though, results in tremendous storage overhead; each 64-byte cache line now keeps an 8-byte timestamp and some counts in each L1 and for each of the 64 cores in the L2; this is at least an 8x overhead. This is the classic caching algorithm tradeoff – you spend some amount of bits/space/time/energy, and try to save more than you spend. Luckily, we can do better.
The main source of overhead is the timestamp of last access. We could try to make it shorter, but it’s better to just get rid of it entirely. Previously, the PCT was used to track both promotions and demotions for the private cache; we now add another threshold, the Remote Access Threshold (RAT). The RAT is now used for promotions, while the PCT is used to remain in the cache on invalidations. Unlike the PCT, the RAT is tunable, based on an approximation of cache pressure in the L1. Initially, the RAT and PCT are equal; on an eviction where a line is demoted, the RAT is increased (up to a point). This indicates cache pressure, so decreases promotions to avoid thrashing. On an eviction where the line is still in the L1, the RAT is reset back to PCT; this is a weak signal that cache pressure isn’t to high, so we reset to allow new lines to come in more easily.
This kills off the timestamps, but according to the paper, still has an overhead of 60% for 64 cores and 10x for 1024.
The next step is to only store this data in the L2 cache for a few cores instead of all of them. When a new core asks for a cache line, add it to the tracked set if there’s an empty slot, and otherwise just take a majority vote of the cores that are being tracked. That is, assume that the cores are accessing data more or less uniformly, so if most of the cores you are tracking would want this cache line to be private, this one probably does too.
With this final optimization, the overhead is now claimed at 5.7%. There’s some other overhead from tricks with the distributed L2 cache; I’m not sure precisely how that factors in. But in any case, a 5.7% overhead seems reasonable, especially given a claimed a 25% reduction in power usage and 15% reduction in run time.
And speaking of results, they spend a good amount of time in a simulator running various benchmarks with various parameters – they can tune the PCT (access required to stay in the L1 cache and minimum of the RAT), the RAT maximum (the hardest it can be to get into the L1 cache), and the RAT increments (how quickly it changes).
The results are somewhat mixed (and complicated) – power usage is almost always a win, and never a loss, with most benchmarks showing solid drops down to PCT 2-4 or so, and limited gains after that, while runtime is generally better or the same up to PCT 2-3, with several benchmarks showing worse performance past PCT 4-5. The explanation is that the cache is better utilized, with misses being converted into word-misses rather than whole cacheline misses, resulting in major power savings. However, those same word-misses can hurt performance since more round trips to L2 are required.
So the conclusion is that a PCT of 4 gives nearly all the power savings and runtime savings (where applicable), with minimal runtime penalty for the benchmarks that do poorly.
Next up is the RAT – just using one level doesn’t hurt runtime too much versus timestamping, but energy usage shoots up by 9%. However, even two or four levels with a maximum of 16 is competitive with timestamping in both runtime and energy; for simplicity, they choose two levels.
And finally, how many cores should be tracked on each cache line? Here, one is generally not enough; the results can be significantly worse if that core isn’t representative. However, three suffice to keep in line with tracking every core, never doing less than 3% worse, and often doing better. This actually makes sense – if each core is tracked individually, it is initially a complete unknown, and will take some time to converge. However, if the cores are fairly homogeneous, the fourth core will immediately start behaving nearly optimally, since the cache has already “learned” from the first three.
So the final result is that, with parameters tuned for the given benchmarks, a 5.7% increase in die are can result in a 15% run-time reduction and 25% energy reduction. Given the small values of the parameters, I’m inclined to think that they are generic, and not overtuned for the benchmarks, arbitrary though they may seem.
It’s also worth pointing out the evaluation is being performed at a 11nm process node, which is still a few years out. This is somewhat justifiable as real-world implementation of this technique is also a few years out. However, as the authors call out, wires aren’t scaling as well as transistors, so the energy cost of moving bits around is a higher fraction of total cost than on current processors.
Overall, though, this is a fun paper that takes a good, foundationally motivated basic idea to improve cache performance, finds a number of approximations that work almost as well (or better!) and make it feasible to implement in hardware. I don’t generally focus on hardware, so don’t know if they’re missing other alternatives that can do as well or better, but I’m impressed with the result.