Available from Brown University; authors are Hammurabi Mendes and Maurice Herlihy from Brown University.

This is the only pure theory paper of the week; as a result, it has more that I can’t see a good way of explaining intuitively, so there will be some unfortunate gaps in this post. If you want more detail, read the paper; I don’t think I can add much to those parts. But I can still summarize the rest of it.

This paper is an extension of epsilon-approximate agreement in Byzantine systems to multiple dimensions. A Byzantine system is a distributed system where Byzantine failure is allowed; Byzantine failure is simply arbitrary failure, where a defective (or malicious) component can send arbitrary messages. This differs from the simpler crash-failure model, where failed processes are assumed to simply stop sending messages. Byzantine failures are generally much harder to handle, since distinguishing a misbehaving process can be difficult (or impossible).

The general agreement problem is for a set of processes, each with some input, to agree on an output that is one of the inputs. In the strict Byzantine failure model, this apparently turns out to be impossible; the Byzantine processes can be arbitrarily confusing. Even with some relaxations, it’s generally impossible to do anything at least 1/3 of the processes fail. To see this, consider the classic Byzantine Generals problem: three generals need to decide whether to attack tomorrow. Each can send messages to the other two, and one of the three is a traitor. It’s impossible to distinguish which of the two is the traitor, and the traitor can send conflicting messages to the other two. Intuitively, this is hard to solve; it turns out to be provably impossible.

So we relax the conditions somewhat – instead of having to agree on a specific answer that’s one of the inputs, we merely all get close to the same answer (specifically, within epsilon), and that answer must be in the set “contained” by all the inputs (more precisely, the convex hull). This problem was already solved for selecting a single output from the real numbers; this paper extends that to selecting an answer from a higher-dimensional real space.

Precisely, it presents an algorithm where n processes (at most t of which are faulty) select a value from an m-dimensional space, assuming that n > t(m+2). (So if m=1, we get the standard 1/3 bound; for higher dimensions, the constraint gets gradually stronger). Further, a simple example shows that this is the upper bound on failures; if any more processes can fail, then the problem is unsolvable.

## Some motivation

The authors give two examples where their notion of agreement makes sense: robot convergence and distributed voting.

– Suppose a collection of robots are on unforgiving terrain, and need to physically meet up. They’re distributed around a space which is known to be safe, but need to settle on a meeting spot without heading outside the known space. However, some of them may be malicious, and try to lure other robots outside this known space where they may crash or fall inot pits. This algorithm tells them how to choose a spot within the area they currently surround to meet at.

– A voting system lets voters pick a spot in a multi-dimensional space indicating their preferences. They all need to agree on a final consensus, but it should be one that is in some sense an “average” of their votes. This system isn’t particularly optimal, but will at least ensure that the consensus isn’t completely outside the range of opinions expressed.

## Giants whose shoulders are stood on

This paper builds on two key preexisting ideas: reliable broadcast and the witness technique.

Reliable broadcast ensures that all processes see the same set of messages; it prevents Byzantine processes from sending different messages to different processes and mucking up the system. It tags messages with the sender, and sends out three rounds of messages; the initial broadcast from a process saying the value it want to send; a round of echos from each process indicating that it received the initial broadcast, and a round of readys from each process indicating that it’s received “enough” echos. Finally, after receiving “enough” readys, the message as accepted.

More precisely, when a process wants to send a value, it sends the initial broadcast. Each process that receives this initial broadcast immediately sends an echo of this message all other processes. Any process that receives at least n-t echos (that is, as many echos are there are non-failing processes) sends out a ready for this message, indicating that it’s accepted it, and any process receiving at least t+1 readys (that is, at least one ready from a non-failing process) sends out its own ready. Finally, a process accepts the message after receiving n-t readys.

Any process following the procedure can now send a message and eventually have all other working processes accept the message: all n-t working processes will eventually echo the message to all other working process, so all n-t working processes will received at least n-t echos, (which is at least t+1, since n > 3t), so will send out readys, resulting in the n-t readys required to accept.

But a Byzantine process can’t get multiple values accepted. To do so, it would have to get to get n-t readys, which requires readys from at least n-2t working processes (since only t of them can come from other Byzantine processes). To get readys from these working processes, each of these must received either t+1 readys, or n-t echos. The latter must happen at least once – the first working process to send a ready can only received t readys from the failing processes, so must have received n-t echos. But by the same reasoning, these n-t echos must include at least n-2t working processes. So n-2t working processes received and echoed the accepted value. But if there are n-t working processes, and n > 3t, this is more than half of the working processes. Thus, any value accepted must have been received by a majority of the working processes; thus, there can only be one such value.

So reliable broadcast ensures that only one value can be accepted from a given process. So a Byzantine process can’t confuse things by sending different messages to different processes. (Under the assumption that n > 3t – otherwise, the majority is no longer a majority, and two values can be accepted.)

The witness technique works around the fact that two processes won’t necessarily share all of the same values – reliable broadcast only requires hearing from n-t processes, so we may continue before all the processes have sent out their values. Since two different processes may be missing different subsets, any two processes may only agree on n-2t of their values if they received them from different processes.

So the witness technique basically has each process broadcast its n-t received values, and merges such received messages into its received values. It’s not clear to me how this prevents tampering by Byzantine processes; this seems like a good future paper to read.

## The Safe Area

The new contribution here is the safe area. The precise statement of the problem is as follows: given a set of processes, each of which has an input point in a multidimensional continuous space, they need to agree on a point which is inside the convex hull of the inputs, to within epsilon tolerance (so they can differ by up to epsilon.)

The safe area is a region which the process can conclude is definitely within this convex hull. The problem is that a malicious process may suggest points outside this convex hull, and if we believe them, we may pick a point outside the set.

So the solution is to look at the region that’s inside the convex hull of every n-t point subset of received points. We know that the n-t working processes will send the correct points, and the region is inside that particular convex hull. (It’s also inside a bunch of other convex hulls, but that’s the one we care about.)

Unfortunately, this section is extremely technical, and I can’t find any intuition for it. Suffice to say, this region exists, so we can compute it and pick a point inside it based on the messages received. Further, there’s a notion of progress so that each working process will pick a point inside this safe region, resulting in a smaller region enclosed in the original convex hull. By repeating this procedures with the new, smaller region, and after a number of iterations, we end up with all the working processes having converged within epsilon of each other.

## Optimality

This solution requires n > t(m+2) – the fraction of failing processes can only be an m+2 fraction of the total processes. With any more failing processes, the problem is impossible. This proof is fairly intuitive.

Assume n = t(m+2). Consider the m+1 points (0, 0, 0, … 0), (1, 0, 0, …, 0), (0, 1, 0, …, 0), (0, 0, 1, …, 0), …, (0, 0, 0, 1). (That is, the points with either all coordinates zero or all but one coordinates zero with one 1. Or more precisely, bigger than epsilon, but I’m assuming epsilon is less than 1). The total number of good processes is n-t = t(m+2) – t = t(m+1). Distribute these good processes with t at each of the m+1 points.

Assume that the t failed processes simply crash, and never send any messages. Now each process receives exactly t messages from each other point. Since it doesn’t know which processes are Byzantine, it’s possible that all t failing processes are at the same point, so can’t be sure that any of these points are in the convex hull. (The network is assumed to be able to delay messages arbitrarily, so it can never be sure that there aren’t still t good messages out there waiting to be delivered.) Thus, it can never pick any point beside itself, so the processes at different points can never agree on a point.

## Thoughts

This is from the math side of computer science, while all the previous papers were more from the real-world side, which makes for a rather different writeup – no evaluation, and I couldn’t really explain the interesting bits of the proof in a satisfactory manner.

Also, while they motivated the idea with “real-world” examples, they felt like examples often do in math – they kind of make sense, but the real reason for the work is that it’s a natural extension of another problem.

But the result is still interesting; my favorite part is actually the reliable broadcast, but the safe region contribution is also pretty good (and much more surprising; I wouldn’t have expected that). And the proof is interesting in its way; it just doesn’t lend itself (AFAICT) to an intuitive explanation.