Available from NICTA. Authors are David Greenaway, Japheth Lim, June Andronick, and Gerwin Klein from NICTA and UNSW, Sydney, Australia.

This paper feels much more technical than the earlier ones, and is building on earlier work in a way that’s hard to jump into in the middle. It assumes basic knowledge of Isabelle/HOL, an automatic proof system, and gives only a brief description of AutoCorres, an existing tool for converting C programs to a higher-level structure in a provably correct way. That said, even with only a basic background, the new contributions are still interesting.

The ultimate goal is to be able to easily prove properties about C programs in a machine-checkable way; a trivial example is proving that if you have two variables a and b with values x and y, after calling swap(&a, &b), a will contain y and b will contain x.

Or for a more complex example, the Schorr-Waite algorithm iterates over all the nodes in a binary tree by storing only two bits in each node and manipulating pointers. The correctness theorem here is basically that if you have a valid tree with no nodes marked as visited, after running the algorithm all nodes will be marked as visited.

There’s been a lot of work on proving these sorts of statements in toy languages, but doing it in C with a full memory model, including actual C arithmetic and the C memory model is hard. The AutoCorres framework can prove statements about C, but require a lot of work around these two areas of arithmetic and memory. This paper helps with that – it provides a model for arithmetic on the finite integers used in real-world programming (both signed and unsigned), and a way of handling memory that allows addresses to be treated as different types at different points.

The basic idea for arithmetic is convert statements about word-size arithmetic (e.g., that low < mid and mid < high in int mid = (low + high) / 2) into a guard stating that overflow doesn’t occur, along with the statement to be shown. If overflow is intended, the user can still make this more precise, but in the common case where overflow is unintentional, no additional effort is required.

For the heap, they effectively use a byte array for the underlying instance, but then provide abstractions to treat memory as specific types by tagging locations. This abstraction can then check for things like alignment and non-overlapping for higher-level functions, while still allowing lower-level functions to treat memory as a simple untyped byte array.

And they do both of these in Isabelle/HOL including proof generation, so there is a machine-checkable proof that the program after applying these abstractions computes the same result as the original program. So any statement proven using these abstractions applies to the original C program, but without all the pain of proving things about C.

## How to prove things

The basic idea behind a theorem-prover like Isabelle/HOL or coq is to use types to represent logical statements; the statement “if A, then B” is converted to a function that takes a value of type A as input, and produces a result of type B. Other logical connectives work as well; AND is a pair (that is, “A and B” is represented by a tuple containing an object of type A and an object of type B), and OR is a union type; either an A or a B.

Now a proof is just an expression, and checking the proof just means checking the type of the expression. Constructing a proof, on the other hand, is still hard – coming up with a term of a given type is, in general, no easier than constructing a proof.

So Isabelle and coq have a bunch of rules that tell them how to prove things; e.g., if you have an expression “A AND B”, first prove A, then prove B. But this isn’t always the right thing; maybe you know that C is true and C implies A AND B; in this case, splitting it up is the wrong choice. So they have a bunch of heuristics and “tactics” they can apply to try to prove things, but still generally need a programmer’s help to fill in the hard parts.

This is why, even though we can prove theorems about C programs, we want better approaches – by automating more and more of the system, we can prove specific statements with less work. Or we can spend the same amount of work and prove statements about larger systems.

## What AutoCorres does

Before this paper, AutoCorres could already prove things about C programs, but it took a good amount of work to do so. As I understand it, the C program is first translated to a language called Simpl which is embedded in Isabelle. This step is unproven; it’s merely kept as simple as possible to avoid bugs. Simpl is a very low-level language that is a pain to use. AutoCorres then (provably correctly) munges this into a much nicer form that often resembles the original code.

However, dealing with arithmetic is a pain. Isabelle has lots of theorems and tactics to handle proving things about the naturals, but C integers aren’t natural numbers; unsigned numbers wrap around, and signed numbers are undefined if they wrap (meaning the compiler can do whatever it wants). And several theorems that apply to the naturals aren’t even true on C values: some examples:

- s = s + 1 – 1 (fails for signed MAXINT, since overflow is undefined)
- s = -(-s) (fails for signed MININT, since -MININT is undefined)
- u + 1 > u (fails for unsigned MAXINT, since it wraps to 0)
- u * 2 = 4 => u = 2 (u could also be MAXINT/2 + 2)
- -u = u => u = 0 (u could also be MAXINT/2)

So traditionally, any theorem about C integers must be proven by hand, ignoring the years of work spent on standard Isabelle theorems. The solution is to simply treat C integers as normal unbounded integers, and add in constraints that they’re not allowed to overflow. Most of the time this works; if it doesn’t, the programmer can still explicitly prove the theorem the normal way.

And, incidentally, this can bring up real issues: in the binary search example, “int mid = (low + high) / 2”, one generated constraint is that (low + high) < MAXINT. And indeed, this breaks a most binary searches, including some proven correct in insufficiently precise models, as a Google blog post pointed out in 2006.

There’s lots of details about exactly how this translation is proven correct and inference rules to use with it, but without a deeper understanding of AutoCorres and Isabelle, I can’t really say much about them.

Memory is handled with a bit more complexity. The major issue is that memory is untyped, so a generic solution to handle things like alignment and overlapping is tricky. For example, the naive statement of correctness for a swap function would be that, if the value at address a is x, and the value at address b is y, then after running swap a b, address a should contain y and b should contain x. But this is false; a and b must be properly aligned, they can’t contain NULL (since dereferencing NULL is undefined), and they can’t partially overlap.

So to abstract this all away, they use “heap-lifting”. They tag each address on the heap as either the start of a given type, or as a “footprint” continuing the previous type. Then they define a check and getter for each type: is_valid_w32(ptr) takes a pointer and returns true if address ptr is 32-bit aligned, the start of a 32-bit word and the next three bytes are all footprints. Similarly, heap_w32(ptr) gives the value stored at the pointer as a 32-bit word, assuming it’s of the right type.

Now these typed heaps can be used by the prover to automatically derive simpler true statements. For example, swap a b now has a simple precondition of (roughly) is_valid_w32 a AND is_valid_w32 b AND s[a] = x AND s[b] = y; the postcondition is the same, but with x and y swapped. All of the complexity of the model is rolled into the heap checks. Apparently this is also enough for Isabelle to be able to automatically infer much more, as well.

## Evaluation

One point brought up is the challenge of proving that this fragment returns 4:

w->next = x; x->next = y; y->next = z; x->next = z;

w->data = 1; x->data = 2; y->data = 3; z->data = 4;

return w->next->next->data;

It seems straightforward to a person, but all of memory dereferences make it hard for automated tactics to apply. With the memory model used here, the standard “auto” tactic suffices to prove it (after giving it all the rules above).

This system was used on a number of systems, including the 10k line seL4 kernel and some other systems. For the seL4 kernel, the total runtime was just over an hour, though apparently that was parallelizable and results can be cached for future runs.

They also look more closely at the “hello world” of pointer aliasing programs, linked list reversal, and the Schorr-Waite algorithm, which traverses all nodes in a binary tree using only two bits of memory per node, with lots of pointer munging.

In both cases, they took a previous high-level proof by Mehta and Nipkow, and ported it over to their system with relatively few changes. This is impressive, given that the previous model didn’t deal with general memory or finite integers.

The total lines of code in the proof increased by just over 50% from 577 lines to 807, and they compare to a previous full C proof in coq, which was 3317 lines (though apparently coq is less automated, but not sufficiently so to explain the difference.)

## Thoughts

This seems like another area ripe for a deep dive; I can vaguely understand what’s cool about this paper, but don’t really understand what’s going on and can’t really evaluate it myself. Maybe at some future point I’ll try to read a series leading up to here and re-evaluate this.

But as far as I can tell now, this is a good step towards being able to prove C programs correct. That would be particularly useful for, say, OpenSSL…