Multidimensional Approximate Agreement in Byzantine Asynchronous Systems

Available from Brown University; authors are Hammurabi Mendes and Maurice Herlihy from Brown University.

This is the only pure theory paper of the week; as a result, it has more that I can’t see a good way of explaining intuitively, so there will be some unfortunate gaps in this post.  If you want more detail, read the paper; I don’t think I can add much to those parts.  But I can still summarize the rest of it.

This paper is an extension of epsilon-approximate agreement in Byzantine systems to multiple dimensions.  A Byzantine system is a distributed system where Byzantine failure is allowed; Byzantine failure is simply arbitrary failure, where a defective (or malicious) component can send arbitrary messages.  This differs from the simpler crash-failure model, where failed processes are assumed to simply stop sending messages.  Byzantine failures are generally much harder to handle, since distinguishing a misbehaving process can be difficult (or impossible).

The general agreement problem is for a set of processes, each with some input, to agree on an output that is one of the inputs.  In the strict Byzantine failure model, this apparently turns out to be impossible; the Byzantine processes can be arbitrarily confusing.  Even with some relaxations, it’s generally impossible to do anything at least 1/3 of the processes fail.  To see this, consider the classic Byzantine Generals problem: three generals need to decide whether to attack tomorrow.  Each can send messages to the other two, and one of the three is a traitor.  It’s impossible to distinguish which of the two is the traitor, and the traitor can send conflicting messages to the other two.  Intuitively, this is hard to solve; it turns out to be provably impossible.

So we relax the conditions somewhat – instead of having to agree on a specific answer that’s one of the inputs, we merely all get close to the same answer (specifically, within epsilon), and that answer must be in the set “contained” by all the inputs (more precisely, the convex hull).  This problem was already solved for selecting a single output from the real numbers; this paper extends that to selecting an answer from a higher-dimensional real space.

Precisely, it presents an algorithm where n processes (at most t of which are faulty) select a value from an m-dimensional space, assuming that n > t(m+2).  (So if m=1, we get the standard 1/3 bound; for higher dimensions, the constraint gets gradually stronger).  Further, a simple example shows that this is the upper bound on failures; if any more processes can fail, then the problem is unsolvable.

Some motivation

The authors give two examples where their notion of agreement makes sense: robot convergence and distributed voting.

– Suppose a collection of robots are on unforgiving terrain, and need to physically meet up.  They’re distributed around a space which is known to be safe, but need to settle on a meeting spot without heading outside the known space.  However, some of them may be malicious, and try to lure other robots outside this known space where they may crash or fall inot pits.  This algorithm tells them how to choose a spot within the area they currently surround to meet at.

– A voting system lets voters pick a spot in a multi-dimensional space indicating their preferences.  They all need to agree on a final consensus, but it should be one that is in some sense an “average” of their votes.  This system isn’t particularly optimal, but will at least ensure that the consensus isn’t completely outside the range of opinions expressed.

Giants whose shoulders are stood on

This paper builds on two key preexisting ideas: reliable broadcast and the witness technique.

Reliable broadcast ensures that all processes see the same set of messages; it prevents Byzantine processes from sending different messages to different processes and mucking up the system.  It tags messages with the sender, and sends out three rounds of messages; the initial broadcast from a process saying the value it want to send; a round of echos from each process indicating that it received the initial broadcast, and a round of readys from each process indicating that it’s received “enough” echos.  Finally, after receiving “enough” readys, the message as accepted.

More precisely, when a process wants to send a value, it sends the initial broadcast.  Each process that receives this initial broadcast immediately sends an echo of this message all other processes.  Any process that receives at least n-t echos (that is, as many echos are there are non-failing processes) sends out a ready for this message, indicating that it’s accepted it, and any process receiving at least t+1 readys (that is, at least one ready from a non-failing process) sends out its own ready.  Finally, a process accepts the message after receiving n-t readys.

Any process following the procedure can now send a message and eventually have all other working processes accept the message: all n-t working processes will eventually echo the message to all other working process, so all n-t working processes will received at least n-t echos, (which is at least t+1, since n > 3t), so will send out readys, resulting in the n-t readys required to accept.

But a Byzantine process can’t get multiple values accepted.  To do so, it would have to get to get n-t readys, which requires readys from at least n-2t working processes (since only t of them can come from other Byzantine processes).  To get readys from these working processes, each of these must received either t+1 readys, or n-t echos.  The latter must happen at least once – the first working process to send a ready can only received t readys from the failing processes, so must have received n-t echos.  But by the same reasoning, these n-t echos must include at least n-2t working processes.  So n-2t working processes received and echoed the accepted value.  But if there are n-t working processes, and n > 3t, this is more than half of the working processes.  Thus, any value accepted must have been received by a majority of the working processes; thus, there can only be one such value.

So reliable broadcast ensures that only one value can be accepted from a given process.  So a Byzantine process can’t confuse things by sending different messages to different processes.  (Under the assumption that n > 3t – otherwise, the majority is no longer a majority, and two values can be accepted.)

The witness technique works around the fact that two processes won’t necessarily share all of the same values – reliable broadcast only requires hearing from n-t processes, so we may continue before all the processes have sent out their values.  Since two different processes may be missing different subsets, any two processes may only agree on n-2t of their values if they received them from different processes.

So the witness technique basically has each process broadcast its n-t received values, and merges such received messages into its received values.  It’s not clear to me how this prevents tampering by Byzantine processes; this seems like a good future paper to read.

The Safe Area

The new contribution here is the safe area.  The precise statement of the problem is as follows: given a set of processes, each of which has an input point in a multidimensional continuous space, they need to agree on a point which is inside the convex hull of the inputs, to within epsilon tolerance (so they can differ by up to epsilon.)

The safe area is a region which the process can conclude is definitely within this convex hull.  The problem is that a malicious process may suggest points outside this convex hull, and if we believe them, we may pick a point outside the set.

So the solution is to look at the region that’s inside the convex hull of every n-t point subset of received points.  We know that the n-t working processes will send the correct points, and the region is inside that particular convex hull.  (It’s also inside a bunch of other convex hulls, but that’s the one we care about.)

Unfortunately, this section is extremely technical, and I can’t find any intuition for it.  Suffice to say, this region exists, so we can compute it and pick a point inside it based on the messages received.  Further, there’s a notion of progress so that each working process will pick a point inside this safe region, resulting in a smaller region enclosed in the original convex hull.  By repeating this procedures with the new, smaller region, and after a number of iterations, we end up with all the working processes having converged within epsilon of each other.


This solution requires n > t(m+2) – the fraction of failing processes can only be an m+2 fraction of the total processes.  With any more failing processes, the problem is impossible.  This proof is fairly intuitive.

Assume n = t(m+2). Consider the m+1 points (0, 0, 0, … 0), (1, 0, 0, …, 0), (0, 1, 0, …, 0), (0, 0, 1, …, 0), …, (0, 0, 0, 1).  (That is, the points with either all coordinates zero or all but one coordinates zero with one 1.  Or more precisely, bigger than epsilon, but I’m assuming epsilon is less than 1).  The total number of good processes is n-t = t(m+2) – t = t(m+1).  Distribute these good processes with t at each of the m+1 points.

Assume that the t failed processes simply crash, and never send any messages.  Now each process receives exactly t messages from each other point.  Since it doesn’t know which processes are Byzantine, it’s possible that all t failing processes are at the same point, so can’t be sure that any of these points are in the convex hull.  (The network is assumed to be able to delay messages arbitrarily, so it can never be sure that there aren’t still t good messages out there waiting to be delivered.)  Thus, it can never pick any point beside itself, so the processes at different points can never agree on a point.


This is from the math side of computer science, while all the previous papers were more from the real-world side, which makes for a rather different writeup – no evaluation, and I couldn’t really explain the interesting bits of the proof in a satisfactory manner.

Also, while they motivated the idea with “real-world” examples, they felt like examples often do in math – they kind of make sense, but the real reason for the work is that it’s a natural extension of another problem.

But the result is still interesting; my favorite part is actually the reliable broadcast, but the safe region contribution is also pretty good (and much more surprising; I wouldn’t have expected that).  And the proof is interesting in its way; it just doesn’t lend itself (AFAICT) to an intuitive explanation.

Posted in papers | Tagged , , , , | Leave a comment

Knowledge discovery by accuracy maximization

Available from researchgate, with a web page as well.  Authors are Stefano Cacciatore, Claudio Luchinat and Leonardo Tenori, from, collectively, the University of Florence, Harvard Medical School, Rovira i Virgili University, and the FiorGen Foundation.  (Some authors are apparently at multiples of these; see the paper for the precise list.)

I’m generally not a huge fan of machine learning, and this paper triggers some of that feeling.  ML often feels to me much more like a traditional science, with lots of ideas that might work, and lots of evaluation to find the actual good ones.  There’s not a lot of good theory backing a lot of it; some ideas work and some don’t, and there often doesn’t feel like any reason for it.  It’s just the way the world is.

That’s not to say that ML techniques aren’t incredibly useful; it’s just that ML papers often tend to feel less principled and more like trial and error.  The results may be useful, but that’s not what I want to read about.

Anyway, this paper feels like one cute idea that’s explained very briefly in one paragraph, and then a huge amount of supporting work showing that, with proper tuning, it can work well.  I don’t have the background to judge it, but I think I understand the cute idea, at the very least.

In short, the idea is to group elements in a clever way: basically assuming a grouping, then checking how well a standard algorithm does at finding this grouping.  If it comes close but disagrees in some cases, assume it was right and update the assumed grouping.  Iterate until either the algorithm finds the right grouping, or it’s been a while and it’s time to give up.  Do this a bunch of times, and then weight how similar elements are by how often they were grouped together.

That gives grouping weights; now we multiply the likelihood of two points being in the same group by the Euclidean distance between two points.  This produces a “distance”.  By then using an all-pairs shortest path, we now have effectively a description of an underlying topology – e.g., if the dataset was a sheet that was curled up in three dimensions, this will do a good job of finding that sheet.

Grouping elements

The first step is to figure out groupings.  This is done by finding many groupings, then translating them into adjacency matrices and averaging them.

First, we pick some fraction of the elements (0.75 by default).  As far as I can tell, this is to create gaps in the data which will result in well-defined clusters.  However, I don’t see it explained in the paper at all.

Next, we assign initial groupings.  The simplest way to do this is to create a new group for each element; an alternative is to apply a standard clustering algorithm.  Now we do 10-fold cross-validation of a standard supervised algorithm.

Unwrapping that last sentence, a supervised algorithm is one that takes a classified learning set, builds up some model, and then applies that model to new data.  10-fold cross-validation means splitting the elements into 10 groups, and picking one group as the test set.  We then train the algorithm on the 9 remaining groups, and then see how well it does on the test set.  We do this with each of the 10 groups as the test set, and average the 10 results to determine how well the algorithm did.

An example of a standard clustering algorithm is k-Nearest-Neighbors, or kNN.  To classify a new point, it finds the k points in the training set closest to the new point, and assumes that point is in the set that most of its neighbors are.  Simple as this seems, it’s one of the standard classifiers.  (Others, such as SVM, are a bit more complicated.)

So all that’s happening at this point is we have a grouping (which we basically made up).  We pick 10% of the points for training, and train k-Nearest-Neighbors on the remaining 90%.  We then run kNN on the 10% and see how well it does.  Then repeat this 10 times for each 10% slice.

All told, we now have a notion of how “good” the grouping is based on how kNN does on it.  Moreover, we know, for each element, how kNN classified it.  If kNN got everything exactly right, then we’ve converged, and found the “true” classification.  (It may or may not be actually right, but it’s as good as we’ll get here.)  Otherwise, we can suppose that our original grouping was wrong, and kNN is right in some cases, so update the groups of some random subset of elements that kNN disagreed on.

Now we can rerun the cross-validation, and see if kNN does better.  If it does, we take this new set as our basis, and iterate.  If not, we try again with a different random set to update.

So basically, this is randomly converging on a grouping which kNN is able to learn and predict reasonably well by taking multiple iterations.  This effectively converts the supervised kNN (which requires input which is classified) into an unsupervised method (which just takes unlabeled input, and derives its own labels).

Once we have a set of groupings, we generate the proximity matrix: the element at row i, column j is 1 if elements i and j are in the same group, and 0 otherwise.  We then repeat this whole process a bunch of times with different random subsets of the data (remember, we initially picked a random 75% of the data to work on).  This generates a bunch of matrices, and we average all of them to get the overall proximity matrix (where each value is between 0 and 1, indicating what fraction of the time its row and column elements were in the same group.)

The dissimilarity matrix

One useful thing, though, is not only to find out which points are similar, but to find the underlying topology of the set.  For example, the dataset might be the “Swiss Roll”; imagine a sheet of paper curled up into a spiral.  (Or look at an actual swiss roll.)  In this case, the dataset is three dimensional, but it can be expressed in two dimensions (how far along the spiral, and where along the depth of it.)  Finding this structure can be extremely useful.

So we construct a dissimilarity matrix, which is effectively the distance between each pair of points along this structure.  (So in the Swiss Roll example, distance is measured along the rolled structure, not directly in three dimensions.)

To do this, first take the proximity matrix computed in the first part, and truncate values below some value (0.05 by default) to 0.  This theoretically cuts off points that occasionally ended up in the same group by luck, but aren’t really related.  Now for each pair of points, take the standard Euclidean distance between them (i.e., the square root of the sum of their differences) and divide it by the proximity.  This gives an adjacency matrix where two points will be close together if they are both nearby according to the standard meaning and are likely to be in the same group.  So it indicates the rough structure of the data.

Finally, apply an all-pairs shortest path to find distances between all pairs of points.  If we just used normal distance, this would be redundant because of the triangle inequality.  But with the proximity matrix factored in, two points may not have been clustered together often, but both often clustered with a third point; this may result in a shorter path through the third point than by going directly from one to the other.  For an extreme example, two rarely clustered points will have infinite distance between them (since their proximity is truncated to 0).  But there is likely a path connecting them through other points, so they will end up a finite distance away after the shortest path.

And now we have effectively a graph indicating the underlying structure of the data which we can use for pretty visualizations or to find interesting facts about it.


This idea is pretty in theory, but does it work in practice?

First, the authors generate data following various shapes (including the swiss roll), add some randomness, and try to recover the shape.  KODAMA does really well here, better than most other approaches.

Next up is clustering on various datasets; again, KODAMA does well.

Then come some real-world datasets; one on gene expression and one on metabolites; these are apparently standard benchmarks, and there are pretty pictures showing that KODAMA does really well.

Then they try some new exmaples.  One is early-type galaxies – various parameters are collected and KODAMA succesfully finds the difference between Fast Rotators and Slow Rotators, suggesting that these are actually meaningful distinctions.

The other was picked up by the media, and that’s State of the Unions addresses.  They gave each State of the Union address from 1900 to 2014 to KODAMA as “term frequency-inverse document frequency”; basically, weighting words based on how often the term appears in the document compared to general English text.  The result was that there’s a sharp change during Ronald Reagan’s years.  Examining the indicators, “total”, “understanding”, “expenditures”, “resources”, “peace”, “production”, and “employment” were more common pre-Reagan, while “bless”, “parents”, “businesses”, “thank”, “tonight”, “child”, and “children” were more common after.


I normally don’t like machine learning that much, but this paper had a cool idea and showed that it worked.  I would have liked to see more motivation and explanation, but it apparently was sufficient for me to figure it out, so I can’t complain.

The use of a supervised algorithm combined with cross-validation to generate an unsupervised algorithm is clever and works very well; then building on it to generate the structure seems a bit ad-hoc to me (why use Euclidean distance?  Why divide by the dissimilarity, and not its square or square root or log?).  But it works, and that’s what counts in machine learning.

Posted in papers | Tagged , , , , , | Leave a comment

Don’t Sweat the Small Stuff: Formal Verification of C Code Without the Pain

Available from NICTA.  Authors are David Greenaway, Japheth Lim, June Andronick, and Gerwin Klein from NICTA and UNSW, Sydney, Australia.

This paper feels much more technical than the earlier ones, and is building on earlier work in a way that’s hard to jump into in the middle.  It assumes basic knowledge of Isabelle/HOL, an automatic proof system, and gives only a brief description of AutoCorres, an existing tool for converting C programs to a higher-level structure in a provably correct way.  That said, even with only a basic background, the new contributions are still interesting.

The ultimate goal is to be able to easily prove properties about C programs in a machine-checkable way; a trivial example is proving that if you have two variables a and b with values x and y, after calling swap(&a, &b), a will contain y and b will contain x.

Or for a more complex example, the Schorr-Waite algorithm iterates over all the nodes in a binary tree by storing only two bits in each node and manipulating pointers.  The correctness theorem here is basically that if you have a valid tree with no nodes marked as visited, after running the algorithm all nodes will be marked as visited.

There’s been a lot of work on proving these sorts of statements in toy languages, but doing it in C with a full memory model, including actual C arithmetic and the C memory model is hard.  The AutoCorres framework can prove statements about C, but require a lot of work around these two areas of arithmetic and memory.  This paper helps with that – it provides a model for arithmetic on the finite integers used in real-world programming (both signed and unsigned), and a way of handling memory that allows addresses to be treated as different types at different points.

The basic idea for arithmetic is convert statements about word-size arithmetic (e.g., that low < mid and mid < high in int mid = (low + high) / 2) into a guard stating that overflow doesn’t occur, along with the statement to be shown.  If overflow is intended, the user can still make this more precise, but in the common case where overflow is unintentional, no additional effort is required.

For the heap, they effectively use a byte array for the underlying instance, but then provide abstractions to treat memory as specific types by tagging locations.  This abstraction can then check for things like alignment and non-overlapping for higher-level functions, while still allowing lower-level functions to treat memory as a simple untyped byte array.

And they do both of these in Isabelle/HOL including proof generation, so there is a machine-checkable proof that the program after applying these abstractions computes the same result as the original program.  So any statement proven using these abstractions applies to the original C program, but without all the pain of proving things about C.

How to prove things

The basic idea behind a theorem-prover like Isabelle/HOL or coq is to use types to represent logical statements; the statement “if A, then B” is converted to a function that takes a value of type A as input, and produces a result of type B.  Other logical connectives work as well; AND is a pair (that is, “A and B” is represented by a tuple containing an object of type A and an object of type B), and OR is a union type; either an A or a B.

Now a proof is just an expression, and checking the proof just means checking the type of the expression.  Constructing a proof, on the other hand, is still hard – coming up with a term of a given type is, in general, no easier than constructing a proof.

So Isabelle and coq have a bunch of rules that tell them how to prove things; e.g., if you have an expression “A AND B”, first prove A, then prove B.  But this isn’t always the right thing; maybe you know that C is true and C implies A AND B; in this case, splitting it up is the wrong choice.  So they have a bunch of heuristics and “tactics” they can apply to try to prove things, but still generally need a programmer’s help to fill in the hard parts.

This is why, even though we can prove theorems about C programs, we want better approaches – by automating more and more of the system, we can prove specific statements with less work.  Or we can spend the same amount of work and prove statements about larger systems.

What AutoCorres does

Before this paper, AutoCorres could already prove things about C programs, but it took a good amount of work to do so.  As I understand it, the C program is first translated to a language called Simpl which is embedded in Isabelle.  This step is unproven; it’s merely kept as simple as possible to avoid bugs.  Simpl is a very low-level language that is a pain to use.  AutoCorres then (provably correctly) munges this into a much nicer form that often resembles the original code.

However, dealing with arithmetic is a pain.  Isabelle has lots of theorems and tactics to handle proving things about the naturals, but C integers aren’t natural numbers; unsigned numbers wrap around, and signed numbers are undefined if they wrap (meaning the compiler can do whatever it wants).  And several theorems that apply to the naturals aren’t even true on C values: some examples:

  • s = s + 1 – 1 (fails for signed MAXINT, since overflow is undefined)
  • s = -(-s) (fails for signed MININT, since -MININT is undefined)
  • u + 1 > u (fails for unsigned MAXINT, since it wraps to 0)
  • u * 2 = 4 => u = 2 (u could also be MAXINT/2 + 2)
  • -u = u => u = 0 (u could also be MAXINT/2)

So traditionally, any theorem about C integers must be proven by hand, ignoring the years of work spent on standard Isabelle theorems.  The solution is to simply treat C integers as normal unbounded integers, and add in constraints that they’re not allowed to overflow.  Most of the time this works; if it doesn’t, the programmer can still explicitly prove the theorem the normal way.

And, incidentally, this can bring up real issues: in the binary search example, “int mid = (low + high) / 2”, one generated constraint is that (low + high) < MAXINT.  And indeed, this breaks a most binary searches, including some proven correct in insufficiently precise models, as a Google blog post pointed out in 2006.

There’s lots of details about exactly how this translation is proven correct and inference rules to use with it, but without a deeper understanding of AutoCorres and Isabelle, I can’t really say much about them.

Memory is handled with a bit more complexity.  The major issue is that memory is untyped, so a generic solution to handle things like alignment and overlapping is tricky.  For example, the naive statement of correctness for a swap function would be that, if the value at address a is x, and the value at address b is y, then after running swap a b, address a should contain y and b should contain x.  But this is false; a and b must be properly aligned, they can’t contain NULL (since dereferencing NULL is undefined), and they can’t partially overlap.

So to abstract this all away, they use “heap-lifting”.  They tag each address on the heap as either the start of a given type, or as a “footprint” continuing the previous type.  Then they define a check and getter for each type: is_valid_w32(ptr) takes a pointer and returns true if address ptr is 32-bit aligned, the start of a 32-bit word and the next three bytes are all footprints.  Similarly, heap_w32(ptr) gives the value stored at the pointer as a 32-bit word, assuming it’s of the right type.

Now these typed heaps can be used by the prover to automatically derive simpler true statements.  For example, swap a b now has a simple precondition of (roughly) is_valid_w32 a AND is_valid_w32 b AND s[a] = x AND s[b] = y; the postcondition is the same, but with x and y swapped.  All of the complexity of the model is rolled into the heap checks.  Apparently this is also enough for Isabelle to be able to automatically infer much more, as well.


One point brought up is the challenge of proving that this fragment returns 4:

w->next = x; x->next = y; y->next = z; x->next = z;
w->data = 1; x->data = 2; y->data = 3; z->data = 4;
return w->next->next->data;

It seems straightforward to a person, but all of memory dereferences make it hard for automated tactics to apply.  With the memory model used here, the standard “auto” tactic suffices to prove it (after giving it all the rules above).

This system was used on a number of systems, including the 10k line seL4 kernel and some other systems.  For the seL4 kernel, the total runtime was just over an hour, though apparently that was parallelizable and results can be cached for future runs.

They also look more closely at the “hello world” of pointer aliasing programs, linked list reversal, and the Schorr-Waite algorithm, which traverses all nodes in a binary tree using only two bits of memory per node, with lots of pointer munging.

In both cases, they took a previous high-level proof by Mehta and Nipkow, and ported it over to their system with relatively few changes.  This is impressive, given that the previous model didn’t deal with general memory or finite integers.

The total lines of code in the proof increased by just over 50% from 577 lines to 807, and they compare to a previous full C proof in coq, which was 3317 lines (though apparently coq is less automated, but not sufficiently so to explain the difference.)


This seems like another area ripe for a deep dive; I can vaguely understand what’s cool about this paper, but don’t really understand what’s going on and can’t really evaluate it myself.  Maybe at some future point I’ll try to read a series leading up to here and re-evaluate this.

But as far as I can tell now, this is a good step towards being able to prove C programs correct.  That would be particularly useful for, say, OpenSSL…

Posted in papers | Tagged , , , , , , | Leave a comment

Dataflow Execution of Sequential Imperative Programs on Multicore Architectures

Available from the University of Wisconsin-Madison; authors are Gagan Gupta and Gurindar S. Sohi from that institution.

I read The Road to Parallelism Leads Through Sequential Programming this morning as well, but that delegates back to this paper for the interesting details, so I’m writing up this instead.  (And if I’m feeling spectacularly lazy, I might write that up, too, but it feels like cheating to spend a day on it.)

This is another attempt at making parallel programming easier.  While Deterministic Parallel Ruby used simple parallelization primitives with runtime safety enforcement, this paper presents a model that uses runtime dataflow analysis of programmer-provided read and write dependencies to automatically parallelize an otherwise sequential program.

More specifically, it effectively annotates functions with shared objects that are read or written by the function, then runs the program, looking ahead to find functions that can currently be executed because they depend only on data that is currently available.  This is done by effectively giving each object to the next function that needs it, and making further functions needed the object wait on that function (with a bit of complexity due to the reader/writer model.)

The authors compare this to a modern superscaler CPU – such a CPU will typically be looking ahead for instructions that don’t depend on currently executing ones, and run those out of order in parallel with other instructions.  This requires a tremendous amount of bookkeeping, but is a vital part of how modern CPUs are fast and power efficient.

They then compare standard “static parallel” (that is, parallelism specified in the code, including MPI, pthreads, or DPR) to VLIW instruction sets – these were attempts to move the scheduling logic from the CPU to the compiler, which could theoretically make smarter decisions.  However, the dominance of x86 (and ARM) indicate that the standard “figure it out at runtime” may be a better approach. While it’s not outright stated, the authors seem to imply that the same applies to parallel software, and parallelizing code at runtime is the right approach.

The metaphor continues further: the authors refer to “Function Level Parallelism” (FLP) by comparison with “Instruction Level Parallelism” (ILP), the common term for the parallelism that CPUs perform.

What is it, really?

Traditional dataflow languages (AFAIK) are general functional and force the programmer to write code in that style; this makes analysis easy and complete, since the runtime (or compiler) can effectively sort commands or functions.  This is great in theory, but hasn’t been adopted in practice; so this paper attempts to apply the ideas to C++.  (Or any other language, AFAICT, but C++ was used for demonstration.)

The core is three new functions (library functions – no modification is needed to the compiler):

  • df_execute(write_set, read_set, function, [args]): Add the function to the run queue with the given read and write sets, and continue running the main program.
  • df_seq(object, function): When object is available, run function, and block the remainder of execution until function finishes.
  • df_end(): wait for all queued functions to complete before continuing; ends the parallel portion of the program until the next df_execute.

The read_ and write_ sets are simply STL sets containing objects; these are the objects which the function is allowed to modify, and which are used for the dataflow analysis.  The typical model would be to use df_execute for “interesting” function calls, letting the runtime schedule these in parallel as possible, and using df_end if work needs to be done that doesn’t have a constrained working set; it’s effectively a barrier that enforces that all previous computation is complete, and then proceeds with the program.

df_seq seems a bit mysterious; AFAICT, it can’t be implemented in terms of df_execute and df_end, since it can run concurrently with previous df_executes that don’t touch its object, but blocks further execution until the df_seq function (but not any previous functions) finish.  The example of its usage given is to print an object; this makes sense, but I’m not sure what other uses it has.  (My instinct is that df_execute is usually a better choice.)

The result is that a typical serial program can have its core loop parallelized simply by replacing function calls with appropriate calls to df_execute, with df_end called at the end of the parallel segment to ensure everything finishes.  The runtime then magically takes the dependencies (passed in as sets at runtime, so precisely the objects used, rather than broader classes) and makes it all run as parallel as possible.

Under the hood

The runtime isn’t actually magic; it’s actually fairly simple.  Each object used in a read_set or write_set needs to inherit from a token base class, which have associated read and write “tokens” (apparently the standard terminology for dataflow systems).  These tokens basically track who is currently using the object and who is waiting for it.

Then they use a thread pool to execute the program across multiple cores.  Initially, only one is active running the program as normal.  When a df_execute call occurs, the function attempts to acquire all the tokens it needs to run (i.e., checks if all of its required shared objects are available).  This uses the standard read/write structure, where writes block and are blocked by reads and writes, but reads don’t block each other.  If all the tokens are acquired, the task can be “delegated”, meaning that the thread starts running it, and pushes the rest of the program onto its run queue.  Then another thread, if idle, can “steal” the rest of the program and continue running it while the previous main thread continues on the df_executed function.

If some tokens are unavailable, the task is added to (ordered) wait lists for any tokens it’s missing.  These will later be made available by the tasks currently using them.

And when a task completes, it releases its objects – each object will be passed on to the first task in its wait list (if appropriate), which may now run if it isn’t waiting on any other objects.  If no tasks are waiting on an object, it’s marked as free, ready for the next task that tries to grab it.  Note the subtlety that a reader may not actually release an object if another task is still reading it – the number of readers is tracked, so a writer can only proceed when no readers are using an object.

This is pretty much the implementation; there are some details in the task scheduling, which are apparently borrowed from Cilk.  Note that a thread encountering an immediately-executable df_execute switches to it and puts off the main thread; this ensures that as many df_executes as possible are running.  (The alternative, queueing the df_execute to run, would always leave the main thread running, potentially spawning more df_executes).

An example

It really is that simple; here’s the bzip2 kernel before and after parallelization (with changes bolded):

1 ..
3 while (blockBegin < fileSize ? 1) {
4 blockBegin += updateBlock (blockBegin)
5 block = new block_t (hInfile, Length);
7 compress (block)
8 wr_file (hOpfile, block)
9 }
10 ..

1 ..
2 op_set->insert (hOpfile); // File wr set
3 while (blockBegin < fileSize ? 1) {
4 blockBegin += updateBlock (blockBegin)
5 block = new block_t (hInfile, Length);
6 block_set->insert (block);
7 df_execute (block_set, &compress)
8 df_execute (op_set,block_set,&wr_file)
9 }
10 ..

That’s it.  And you have a 2x speedup on a Xeon or a 9-11x speedup on a 4 or 8 socket Opteron.  Magic.


To evaluate their system, they compare runtime to pthreads across a few benchmarks on a quad core hyperthreaded Intel Xeon and 4 and 8 socket quad core Opterons systems.  Results are generally approximately competitive with pthreads, sometimes better, sometimes worse.  The cases where they do significantly worse are generally due to scheduling a large number of small jobs; by increasing the block size evaluated by each function call, they can do at least as well as pthreads on the Opteron, and not much worse on the Xeon.  (They claim the discrepancy is due to bus speeds; this seems sensible.)

They also note that their code is much simpler; the bzip2 above required almost no changes, while the pthreads version (implemented independently) uses multiple task-specific threads with complex communication patterns.  So being even remotely competitive is not bad for so little work.

There’s also a long discussion around microbenchmarks and how many clock cycles each operation takes.  This is probably useful, but seems incredibly boring to me.


Some other points come up in the paper that didn’t fit into my narrative above:

Deadlocks can be a problem for dataflow systems; this is easily avoided here by using ordered wait lists.  If A and B wait for the any of the same resources, whichever one was scheduled first will get first crack at those resources.  So it’s impossible for two tasks to each be waiting on resources the other holds.

It can also be an issue if too many tasks get scheduled (e.g., if a loop turns out to be fully sequential, and every iteration is scheduled while the first one is still running).  This is easily avoided by limiting how many tasks can be scheduled before the main task is blocked.

The system, while partially specified in the code, really is dynamic – determination of dependencies happens at runtime, much like in an out-of-order CPU.  This, as is usually the case, allows much more precision to be used.  Two function calls may conflict in some cases, but we can establish with precision whether they do when called with the particular arguments they are.


I really like this – it feels like a much cleaner solution than yesterday’s DPR, since it avoids the new parallelization primitives and still “looks like” a sequential program.  I’m not sure I completely agree with the authors that this is The Way to go, but it’s definitely promising.

I am, however, a little concerned that it looks easier, but depends on the programmer to specify what shared objects each function uses.  I can easily see this resulting in the sort of weird nondeterministic bugs that this type of system is supposed to avoid – if I mistype the read or write set, all of the nice properties go out the window and I have a giant race condition again.

It seems like it should be possible to apply something like TARDIS here – in development or testing, track which objects are actually used and complain if they’re not in the given read/write set.  (Or a simpler system might work, since it seems like we could simply add an additional field marking which tasks are allowed to read/write each object and check that on each access instead of logging.)

It also seems like it should be possible to effectively limit scope so that functions can’t access anything outside their modification set, but it’s not clear to me how easy that would be in C.  (I also may well be wrong and it’s impossible or hard.)

In any case, this looks like another great idea at making parallel programming easy with minimal programmer input.

Posted in papers | Tagged , , , , , | Leave a comment

Dynamic Enforcement of Determinism in a Parallel Scripting Language

Available from the University of Rochester; authors are Li Lu and Michael L. Scott from Rochester and Weixing Ji from the Beijing Institute of Technology.

Parallel programming is notoriously difficult; there are a number of specialized systems to deal with it.  From the process/channels model, to transactional memory, to pure functional event-based processing, there have been a number of attempts to make it easier for programmers.

This paper takes a different approach of providing some parallel primitives, and then doing “data race detection” at runtime.  In short, a data race is when one task writes a value concurrently with another task reading or writing that same value.  (Concurrent reads don’t cause any problems, but concurrent writes may result in a non-deterministic final value.  A write concurrent with a read could read either the old or new value.)

A program with no data races is deterministic – multiple runs of the program will always produce the same results, since no operation can turn out differently across runs.

The race detection system is surprisingly simple – log every access a task makes, then when tasks join, check that there are no bad collisions.  This is simplified by the limited primitives that enforce a “split/merge” structure, where a single task spawns multiple concurrent operations which must all complete (without interfering with each other) before the first task can continue.

This is made more practical by the addition of “reductions” and “atomic commutative” operations.  Reductions may be familiar from MapReduce – they are commutative, associative operations that each task can perform independently before merging.  Atomic commutative operations are high-level operations which enforce their own atomicity (e.g., by locking) but which are programmer-specified to not interfere (e.g., adding two elements to an unordered set has the same behavior regardless of ordering).

By combining low-level read/write access checking with these higher-level specifications of acceptable violations, they create a system which is reasonably performant (2x-4x slowdown) and apparently effectively perfect at catching data races.

What DPR actually is

The paper presents “Deterministic Parallel Ruby” (DPR) and TARDIS.  DPR is Ruby extended with a few extra instructions (initially via libraries, then for better performance as extensions to JRuby).  TARDIS is a runtime for DPR that detects when the concurrency contract of DPR is broken.

DPR adds a number of instructions.  First, “co” simply runs a number of functions concurrently.  “.all” is basically a “co” iterator – it runs a function over all elements of an array or range concurrently.  It is an error (detected by TARDIS) if the functions called by co or .all are not independent (in the sense of data races). 

Another concurrency operation is “futures”; these are functions that should be computed concurrently with the main task, and whose value will presumably be requested at some point in the future of that task.  For simplicity, these must not only not interfere with any other task, but must be completely “pure” – they can’t interact with any data outside their scope (any data must be passed in as arguments, which are deep copied to avoid interference).

Interestingly, this definition of pure seems to me to vary from the standard functional one in an important way – I/O is allowed.  A future could make a network request, as long as no resources from the surrounding scope are used.

Additionally, “pipelines” are provided as a convenience for parallelizing loops.  Similar to hardware pipelines, these are a series of functions to be performed on a stream.  At each logical step, each pipeline stage performs its operation on the output of the previous stage from the previous step, and passes its output to the next stage for the next step.  The first stage reads from the input stream, and the last phase writes to the output stream.  Fundamentally, it’s just calling “co” on each stage with the inputs and outputs hooked up properly.

The remaining operations are used to avoid spurious races.

First up are Reduction classes; these implement push and get methods.  Push adds an element to the set to reduce, and get gets the result of the reduction, which must be commutative and associative.  Examples given are addition, multiplication, minimum, and maximum.  Since the reduce operation is commutative and associative, each task can compute its own local result, and then the merge at the parent task can combine the results as the children complete (this is how a traditional parallel MapReduce works as well.)

More interestingly, atomic commutative operations (AC ops) allow the programmer to specify that specific methods may be run concurrently with each other.  These may touch the same data, but are guaranteed (by the programmer) to be commutative (and atomic, if necessary, using standard locking primitives).  One example is adding elements to a set – whatever order they’re added in, the final set is the same.  

Neither reductions nor AC ops are verified; that’s mentioned to be undecidable, with various heuristics as possible future work.  But for the moment, if a programmer messes up a reduction or AC op, all guarantees are off.

How TARDIS verifies all this

That about wraps it up for DPR.  Now how does TARDIS check that a DPR program is actually deterministic?

With logs!  Hooks are added to the Ruby VM to track, for each task, what object fields have been read or written.  (In Ruby, everything is an object, so this tracks all memory operations.)  More precisely, each task tracks:

  • local_set: all accesses by the finished subtasks in the current running group (co/.all/etc.)
  • current_set: all accesses by the task or finished groups.
  • parent: the parent task that spawned the task.

These sets are updated as you might expect, if you’re slightly clever.  When a task accesses a field, that access is added to its local set as object id, field id, and type (read/write).  When a task finishes, it merges its current_set into its parent’s local_set, checking for conflicts between the two.  And when the last task in a group finishes, the parent merges its current_set into its local set (without checking for conflicts).

This is slightly subtle – remember that only concurrent tasks (those spawned by the same call) should be checked.  Since the parent doesn’t run concurrently with its children, it can’t conflict with them.  But a task running concurrently with the parent could conflict with one of the children, so the child set does need to be merged into the parent’s set at the end so that the grandparent has the information it needs.

This is complicated by reductions and AC operations (pipelines, as noted above, are effectively just special co operations, so need no special work, and futures are completely isolated from their enclosing scope.)  However, it’s not very complicated.  Reductions are, as noted, run independently in each thread.  And avoiding data races is as simple as ensuring that creation and get don’t interact with each other or push.  This is isomorphic to writes not interacting with each other or reads.  So reduction operations can just be converted to pseudo-writes and -reads, and handled like ordinary operations.

AC ops are slightly harder; instead of tracking local and current sets, we split each into a normal_set and atomic_set, and we also track the commutative method list of each task.  So now, any access within an AC op (tracked by a global per-task boolean, under the assumption that AC ops don’t nest) is added to the current_atomic_set, while all other accesses go to the current_normal_set.

Merging of child and parent histories happens for each set independently, but conflict resolution now checks the normal_set of the child against both normal_ and atomic _sets of the parent, as well as the atomic_set of the child against the normal_set of the parent.  That is, normal operations can’t interfere with each other or with AC operations (since non-AC ops modifying AC datastructures don’t have any safety guarantees), but AC ops can stomp on each other all they want, since the programmer has guaranteed that they’re safe.

But AC ops are only safe within specific sets; adding elements to a set may be reorderable, but adding elements to two different sets may not be.  This is where the commutative method list (cml) comes in.  Each task tracks what AC ops it calls; if a merge then finds that two non-commutative AC ops were called concurrently, that’s a commutativity conflict.

So this lets us run specified AC ops concurrently, while still guarding against errors.

There are more details around the implementation, but they’re largely specific to JRuby, so less interesting to me.

One exception is on level of detail – by default, all that’s reported is that an error occurred on a given field of a given object id.  But if more information is desired, TARDIS can be run in detail mode, which gives the name of the object and source file and line of the conflicting accesses.  Tracking this extra data is naturally more expensive.  However, since the program is deterministic until it fails, running it again with the same inputs should catch the same failure.  So it’s reasonable to run the program normally until you find an error, then rerun it in detail mode to find out what went wrong.


No paper is complete without evaluation!  In this case, they focus on performance, since correctness is assumed.  They implement a variety of benchmarks, and compare them against other “state of the art” race detectors.  The overhead of TARDIS is, as advertised, about 2-4x, and the benchmarks scaling up to 16 cores about as well as they do without TARDIS.  The other race detectors are generally somewhat slower, except for the Cilk Nondeterminator, which is faster but strictly sequential, so doesn’t scale with core counts.

They also run five benchmarks that require AC ops, which the other race detectors don’t support.  So here they show TARDIS, TARDIS in detail mode, and TARDIS without AC checking versus the plain code.  Results are similar, with detail mode generally taking up to an additional 2x over the default mode, and disabling AC checking gets back some performance, but not a huge amount.

And finally, they take one of the benchmarks and replace the kernel (core inner loop) with an optimized Java version.  This, as expected, shows a dramatic speedup.

Also, interestingly, they point out that TARDIS found five concurrency conflicts in their evaluation code, “despite [their] familiarity with both parallel programming in general and DPR in particular”.

An Alternative

One comparison they discussed fairly heavily is shadow memory – tracking, along with each piece of data, how it’s being used by the various tasks.  This seems more obvious, but has some issues.  For one, it requires relatively expensive checks during execution, while using a log defers most of the work until task completion.  This both reduces redundant work (if a field is read or written multiple times, only a simple set addition is done each time rather than a full conflict check), and allows checks to be done offline in theory, though that seems to be less useful in real life.


Overall, this seems like another excellent application of theory to real life – I can certainly imagine using this system to parallelize a real-life system.  It does suffer somewhat from the fact (as the authors acknowledge) that scripting languages are generally slow, and performance is not a major concern.  But still, a project that wants to use a scripting language for ease of development could benefit from easy (and safe) parallelization, and Lua and Javascript have shown that they don’t have to be painfully slow.  So I could imagine that this type of work combined with a fast VM (say, something Lancet-ish) could result in very well performing code that’s also very easy to write.

Posted in papers | Tagged , , , , , , | Leave a comment

Project Lancet: Surgical Precision JIT Compilers

Project Lancet: Surgical Precision JIT Compilers

Available from EPFL; by Tiark Rompf, Arvind K. Sujeeth, Kevin J. Brown, HyoukJoong Lee, Hassan Chafi, Kunle Olukotun, and Martin Odersky, from Oracle Labs, EPFL, and Stanford (for a full mapping, see the paper.)

On first reading, I was incredibly impressed by this paper.  By writing a JIT in Scala, they can pretty much expose the JIT to the program, allowing it to do things like specializing a function at runtime at the direction of the compiler.  More impressively, they even implement a full LISP-style quote/unquote macro system.

On a second reading, knowing what to expect, I’m more impressed that they actually built the whole system; the basic concepts seem more reasonable, but the amount of work to make the whole system work is still extremely impressive.

That said, I don’t know nearly as much as I’d like about JIT’s, and the paper feels like part of a series, building on previous papers on LMS and Delite, which I only know about from the brief description in this paper.  (Sounds like a good series to follow up on after this week…)

What it’s about

Anyway, this paper sets out to make JITs (Just In Time compilers) more predictable and friendly.  They point out that while JITs tend to produce better performance for many dynamic languages, they are unpredictable black boxes; tuning for them can be somewhat of a black art.  So realtime or high-performance systems still prefer compiled languages.  I’m not sure that Lancet is enough to fix that, but it’s an interesting start.

The basic idea is to have a somewhat stupider than usual JIT, but expose more functionality to the programmer.  (They explicitly call out that “coming up with clever heuristics or automatisms was decidedly a non-goal.”)  And, interestingly, they do this primarily via types rather than syntax; instead of writing a macro, you just write an expression with type “Rep[T]”, and that’s now a code block to compute a T, rather than a T itself.

Though I have to say here that I’m not entirely clear where the levels are actually split – Lancet is built on Delite and LMS (Lightweight Module Staging), which are built on Graal, which either runs on or is a set of hooks for the HotSpot JVM.  My best interpretation is that Graal is a JIT that runs on the JVM, and JITs JVM code into JVM code, and LMS is a wrapper around Graal, and Lancet is a wrapper around LMS.  And Delite is a set of high-performance collection libraries that can be hooked in.

There’s also lots of talk about “staging” – this seems to be the core idea of having multiple levels of compilation all intermixed; so you can “stage” code, and then optimize or evaluate it later; there’s also the mysterious claim that “it has long been known that specializing an interpreter to a program yields a compiled program (the first Futamura projection [14]), and that staging an interpreter, which enables specialization for arbitrary programs, effectively turns an interpreter into a compiler [1]).  This definitely seems worth following up on, and I suspect is vital to fully understanding the various layers and what’s actually happening.

More detail

In any case, here’s my understanding of what’s going on.

Lancet started by taking the Graal project (a JVM implemented in Java) and porting it to Scala.  They then “[add] staging annotations (Rep[T] types) in key places” to produce a simple compiler.  Optimizations are added – e.g., they define infix_+ to either produce a constant expression if both arguments are constant, or the normal addition if they aren’t.  Another example given is that final static fields can be converted to constants.  Both of these seem like fairly standard constant propagation.

Next they dive into a set of “macro” (JIT-time) operations available to the programmer.  These are all just ordinary functions which are interpreted by the compiler to generate code.  It’s not entirely clear what the primitives are (they implement several of these in terms of each other), but the operations they give include:

  • Lancet code wrappers:
    • fun(f): convert a function to a Lancet function.  This can then be executed, or have other fun stuff done to it.
    • exec(code): Roughly equivalent to fun(code)(); it just takes a snippet of Lancet code and calls it.
  • Compile time constants:
    • frozen(v): assert that v is a compile-time constant, and fail compilation otherwise.
    • freeze(code): evaluate code (at compile time, including side effects) down to a constant.
  • Code generation:
    • lms(code): “the classical LMS approach”; basically use an arbitrary function at compile-time to generate the output code.
    • quote/unquote: The standard LISP macro facilities.  Creepy seeing that on the JVM.
  • Optimization guidance:
    • likely(cond): Assume that cond is probably true.  It’s not clear exactly what this actually does.
    • speculate(cond): Compile the code assuming the condition succeeds.  However, there is a guard that will check the condition and drop back to the interpreter if the condition fails.  But this would be great for unlikely critical-path error handling.
    • stable(cond): Compile the code assuming that cond rarely changes.  This means recompiling each time it changes.
  • Direct instruction on running method:
    • slowpath: interpret the current continuation.
    • fastpath: compile the current continuation.
  • Call/cc
    • reset: set the boundary for shift.
    • shift: Basically call/cc, but only pass in the continuation up to the last shift.

There are some interesting interactions here – e.g., if a variable is stable, it is frozen; a change to a stable variable triggers recompilation, so it is actually a compile-time constant.  Also, note that the optimization guidance can be implemented in terms of slowpath and fastpath; they’re just a conditional with slowpath on one side and fastpath on the other.

So how is this all implemented?  Basically, by adding “staging” to the interpreter (hopefully to be explained in a future post), and then overriding “invokeVirtual” to handle macros.  They also provide a few helpers implemented in the compiler:

  • evalA: return abstract type information about an object
  • evalM: “evaluate materialized”: basically evaluate a Rep[T] to a T.

… I’d love to give more detail, but first I probably have to spend a few hours reading up on modern compiler theory.  And unfortunately, reading a paper per day doesn’t really allow that time.  But maybe in the future!  (Though it does seem to me that there’s really not that much going on here – I suspect all the theory is, as usual, masking some very straightforward ideas.  That’s kind of a key idea underlying this blog.)


So how does this work in practice?  Evaluation was primarily reimplementing bits of OptiML (a “deeply embedded DSL for machine learning”) as a library using Lancet, then writing ordinary, Rep-free code that used the library.  And they also used Delite (a library for high-performance computing) as the backend.

The library is apparently easier to use (claimed without evidence, though it’s not surprising), and performance is similar to using Delite directly, and both are competitive with hand-optimized C++ code.

A second evaluation sample was a toy sample computing the sum of the ASCII values of each word in an array of strings.  By using macros that convert zipWithIndex, map, and reduce to Delite calls, and then running the code through Lancet, they got a ~2x speedup.


It’s seems a little strange that they spend most of the paper (and some of the discussion) talking about how JIT optimizations should be exposed to the programmer, and then proceed to do benchmarks emphasizing how no code needs to be changed in the application code to take advantage of the performance benefits.  But it does make sense, in a way – most programmers don’t want (or shouldn’t) be working at this low level, but library authors can do so to provide performance benefits for their users.

After writing this up, I still want to understand at a deeper level how this actually works, but my initial amazement at the concept has largely evaporated.  This is a very cool result, and may be useful in the real world, but ultimately it’s not as spectacularly new as I initially felt; it’s just combining a bunch of existing concepts I’m unfamiliar with into a practical system.  (Which we need more of, but it’s not necessarily what I want to read.)

On the other hand, this reminds me of some other systems I’ve gotten very excited by: Synthesis OS, HP’s Dynamo (not to be confused with Amazon’s.  Also, whoa, there’s a paper now!), and Transmeta’s code morphing – various takes on runtime code generation for performance.

Posted in papers | Tagged , , , , , | Leave a comment

The Locality-Aware Adaptive Cache Coherence Protocol

Kicking off the week of papers is The Locality-Aware Adaptive Cache Coherence Protocol by George Kurian and Srinivas Devadas from MIT and Omer Khan from the University of Connecticut.

(Apparently estimating time is hard – I expected an hour or two reading the paper and an hour writing; I ended up spending 45 minutes reading and an hour and a half writing (including rereading bits of the paper that I wanted to be sure I got right).  And the result is much longer than I was expecting, though still less than 10 minutes to proofread.  Anyway, enjoy!)

Caching is said to be one of two hard problems in computer science.  This is particularly true at the hardware level; modern CPUs have on-chip caches to avoid expensive trips to main memory.  Moreover, multicore CPUs typically have at least an L1 cache per-core, with an L2 (or L3) cache shared among all the cores.  This paper explores how to scale that model to many-core systems, with their primary evaluation being 64 cores, and references to costs for a 1024 core system.

The primary issue is that of communication – if each core has a 48 KB private L1 cache and a portion of the 16 MB distributed L2 cache, any data access will pull an entire 64-byte cache line into the core’s local L1 from a (probably remote) L2 cache.  Further, any writes by other cores to this cache line must cause additional traffic to tell the core to invalidate the local L1 line.  Now the next access will have to pull the line from scratch, causing yet more traffic.  And as the number of cores increases, the chance of collisions also does.

This traffic results in nontrivial power usage, plus can cause cache thrashing; if a portion of memory is heavily read and written by multiple cores, each core will be continuously invalidating and refetching the line, possibly uses cache space better spent on more stable data.

The key insight of this paper is to avoid pulling data from L2 into L1 until it’s been accessed “sufficiently often” and, in the mean time, access individual words out of the shared L2 cache.  This cuts down on network traffic for rarely used or commonly updated cache lines, since the entire line isn’t fetched.  It can also result in much better L1 utilization, since it is reserved for “commonly accessed” pages which won’t be evicted by simple accesses.

The problem is in defining “sufficiently often” and “commonly accessed”.  A simple proposed solution is to track, for each L2 cache line, the last time each core referenced that line, as well as a counter of utilization.  If a core goes too long without referencing a given line (defined in terms of entries stored in the core’s L1 cache), the count is reset.  Otherwise, the count is incremented on each access, and, if it goes beyond the Private Caching Threshold (PCT), is promoted to the core’s private cache.

The line may later be evicted due to capacity if a new line replaces it, but we guarantee that any line replacing it is also well-utilized.  It may also be evicted by an invalidation if another core writes to the cache line; in this case, it may be immediately re-promoted if its local utilization (tracked in the L1 cache) is still above the PCT; if not, it is demoted to the L2 cache.

This solution, though, results in tremendous storage overhead; each 64-byte cache line now keeps an 8-byte timestamp and some counts in each L1 and for each of the 64 cores in the L2; this is at least an 8x overhead.  This is the classic caching algorithm tradeoff – you spend some amount of bits/space/time/energy, and try to save more than you spend.  Luckily, we can do better.

The main source of overhead is the timestamp of last access.  We could try to make it shorter, but it’s better to just get rid of it entirely.  Previously, the PCT was used to track both promotions and demotions for the private cache; we now add another threshold, the Remote Access Threshold (RAT).  The RAT is now used for promotions, while the PCT is used to remain in the cache on invalidations.  Unlike the PCT, the RAT is tunable, based on an approximation of cache pressure in the L1.  Initially, the RAT and PCT are equal; on an eviction where a line is demoted, the RAT is increased (up to a point).  This indicates cache pressure, so decreases promotions to avoid thrashing.  On an eviction where the line is still in the L1, the RAT is reset back to PCT; this is a weak signal that cache pressure isn’t to high, so we reset to allow new lines to come in more easily.

This kills off the timestamps, but according to the paper, still has an overhead of 60% for 64 cores and 10x for 1024.

The next step is to only store this data in the L2 cache for a few cores instead of all of them.  When a new core asks for a cache line, add it to the tracked set if there’s an empty slot, and otherwise just take a majority vote of the cores that are being tracked.  That is, assume that the cores are accessing data more or less uniformly, so if most of the cores you are tracking would want this cache line to be private, this one probably does too.

With this final optimization, the overhead is now claimed at 5.7%.  There’s some other overhead from tricks with the distributed L2 cache; I’m not sure precisely how that factors in.  But in any case, a 5.7% overhead seems reasonable, especially given a claimed a 25% reduction in power usage and 15% reduction in run time.

And speaking of results, they spend a good amount of time in a simulator running various benchmarks with various parameters – they can tune the PCT (access required to stay in the L1 cache and minimum of the RAT), the RAT maximum (the hardest it can be to get into the L1 cache), and the RAT increments (how quickly it changes).

The results are somewhat mixed (and complicated) – power usage is almost always a win, and never a loss, with most benchmarks showing solid drops down to PCT 2-4 or so, and limited gains after that, while runtime is generally better or the same up to PCT 2-3, with several benchmarks showing worse performance past PCT 4-5.  The explanation is that the cache is better utilized, with misses being converted into word-misses rather than whole cacheline misses, resulting in major power savings.  However, those same word-misses can hurt performance since more round trips to L2 are required.

So the conclusion is that a PCT of 4 gives nearly all the power savings and runtime savings (where applicable), with minimal runtime penalty for the benchmarks that do poorly.

Next up is the RAT – just using one level doesn’t hurt runtime too much versus timestamping, but energy usage shoots up by 9%.  However, even two or four levels with a maximum of 16 is competitive with timestamping in both runtime and energy; for simplicity, they choose two levels.

And finally, how many cores should be tracked on each cache line?  Here, one is generally not enough; the results can be significantly worse if that core isn’t representative.  However, three suffice to keep in line with tracking every core, never doing less than 3% worse, and often doing better.  This actually makes sense – if each core is tracked individually, it is initially a complete unknown, and will take some time to converge.  However, if the cores are fairly homogeneous, the fourth core will immediately start behaving nearly optimally, since the cache has already “learned” from the first three.

So the final result is that, with parameters tuned for the given benchmarks, a 5.7% increase in die are can result in a 15% run-time reduction and 25% energy reduction.  Given the small values of the parameters, I’m inclined to think that they are generic, and not overtuned for the benchmarks, arbitrary though they may seem.

It’s also worth pointing out the evaluation is being performed at a 11nm process node, which is still a few years out.  This is somewhat justifiable as real-world implementation of this technique is also a few years out.  However, as the authors call out, wires aren’t scaling as well as transistors, so the energy cost of moving bits around is a higher fraction of total cost than on current processors.

Overall, though, this is a fun paper that takes a good, foundationally motivated basic idea to improve cache performance, finds a number of approximations that work almost as well (or better!) and make it feasible to implement in hardware.  I don’t generally focus on hardware, so don’t know if they’re missing other alternatives that can do as well or better, but I’m impressed with the result.

Posted in papers | Tagged , , , | Leave a comment