Project Lancet: Surgical Precision JIT Compilers
Available from EPFL; by Tiark Rompf, Arvind K. Sujeeth, Kevin J. Brown, HyoukJoong Lee, Hassan Chafi, Kunle Olukotun, and Martin Odersky, from Oracle Labs, EPFL, and Stanford (for a full mapping, see the paper.)
On first reading, I was incredibly impressed by this paper. By writing a JIT in Scala, they can pretty much expose the JIT to the program, allowing it to do things like specializing a function at runtime at the direction of the compiler. More impressively, they even implement a full LISP-style quote/unquote macro system.
On a second reading, knowing what to expect, I’m more impressed that they actually built the whole system; the basic concepts seem more reasonable, but the amount of work to make the whole system work is still extremely impressive.
That said, I don’t know nearly as much as I’d like about JIT’s, and the paper feels like part of a series, building on previous papers on LMS and Delite, which I only know about from the brief description in this paper. (Sounds like a good series to follow up on after this week…)
What it’s about
Anyway, this paper sets out to make JITs (Just In Time compilers) more predictable and friendly. They point out that while JITs tend to produce better performance for many dynamic languages, they are unpredictable black boxes; tuning for them can be somewhat of a black art. So realtime or high-performance systems still prefer compiled languages. I’m not sure that Lancet is enough to fix that, but it’s an interesting start.
The basic idea is to have a somewhat stupider than usual JIT, but expose more functionality to the programmer. (They explicitly call out that “coming up with clever heuristics or automatisms was decidedly a non-goal.”) And, interestingly, they do this primarily via types rather than syntax; instead of writing a macro, you just write an expression with type “Rep[T]”, and that’s now a code block to compute a T, rather than a T itself.
Though I have to say here that I’m not entirely clear where the levels are actually split – Lancet is built on Delite and LMS (Lightweight Module Staging), which are built on Graal, which either runs on or is a set of hooks for the HotSpot JVM. My best interpretation is that Graal is a JIT that runs on the JVM, and JITs JVM code into JVM code, and LMS is a wrapper around Graal, and Lancet is a wrapper around LMS. And Delite is a set of high-performance collection libraries that can be hooked in.
There’s also lots of talk about “staging” – this seems to be the core idea of having multiple levels of compilation all intermixed; so you can “stage” code, and then optimize or evaluate it later; there’s also the mysterious claim that “it has long been known that specializing an interpreter to a program yields a compiled program (the first Futamura projection ), and that staging an interpreter, which enables specialization for arbitrary programs, effectively turns an interpreter into a compiler ). This definitely seems worth following up on, and I suspect is vital to fully understanding the various layers and what’s actually happening.
In any case, here’s my understanding of what’s going on.
Lancet started by taking the Graal project (a JVM implemented in Java) and porting it to Scala. They then “[add] staging annotations (Rep[T] types) in key places” to produce a simple compiler. Optimizations are added – e.g., they define infix_+ to either produce a constant expression if both arguments are constant, or the normal addition if they aren’t. Another example given is that final static fields can be converted to constants. Both of these seem like fairly standard constant propagation.
Next they dive into a set of “macro” (JIT-time) operations available to the programmer. These are all just ordinary functions which are interpreted by the compiler to generate code. It’s not entirely clear what the primitives are (they implement several of these in terms of each other), but the operations they give include:
- Lancet code wrappers:
- fun(f): convert a function to a Lancet function. This can then be executed, or have other fun stuff done to it.
- exec(code): Roughly equivalent to fun(code)(); it just takes a snippet of Lancet code and calls it.
- Compile time constants:
- frozen(v): assert that v is a compile-time constant, and fail compilation otherwise.
- freeze(code): evaluate code (at compile time, including side effects) down to a constant.
- Code generation:
- lms(code): “the classical LMS approach”; basically use an arbitrary function at compile-time to generate the output code.
- quote/unquote: The standard LISP macro facilities. Creepy seeing that on the JVM.
- Optimization guidance:
- likely(cond): Assume that cond is probably true. It’s not clear exactly what this actually does.
- speculate(cond): Compile the code assuming the condition succeeds. However, there is a guard that will check the condition and drop back to the interpreter if the condition fails. But this would be great for unlikely critical-path error handling.
- stable(cond): Compile the code assuming that cond rarely changes. This means recompiling each time it changes.
- Direct instruction on running method:
- slowpath: interpret the current continuation.
- fastpath: compile the current continuation.
- reset: set the boundary for shift.
- shift: Basically call/cc, but only pass in the continuation up to the last shift.
There are some interesting interactions here – e.g., if a variable is stable, it is frozen; a change to a stable variable triggers recompilation, so it is actually a compile-time constant. Also, note that the optimization guidance can be implemented in terms of slowpath and fastpath; they’re just a conditional with slowpath on one side and fastpath on the other.
So how is this all implemented? Basically, by adding “staging” to the interpreter (hopefully to be explained in a future post), and then overriding “invokeVirtual” to handle macros. They also provide a few helpers implemented in the compiler:
- evalA: return abstract type information about an object
- evalM: “evaluate materialized”: basically evaluate a Rep[T] to a T.
… I’d love to give more detail, but first I probably have to spend a few hours reading up on modern compiler theory. And unfortunately, reading a paper per day doesn’t really allow that time. But maybe in the future! (Though it does seem to me that there’s really not that much going on here – I suspect all the theory is, as usual, masking some very straightforward ideas. That’s kind of a key idea underlying this blog.)
So how does this work in practice? Evaluation was primarily reimplementing bits of OptiML (a “deeply embedded DSL for machine learning”) as a library using Lancet, then writing ordinary, Rep-free code that used the library. And they also used Delite (a library for high-performance computing) as the backend.
The library is apparently easier to use (claimed without evidence, though it’s not surprising), and performance is similar to using Delite directly, and both are competitive with hand-optimized C++ code.
A second evaluation sample was a toy sample computing the sum of the ASCII values of each word in an array of strings. By using macros that convert zipWithIndex, map, and reduce to Delite calls, and then running the code through Lancet, they got a ~2x speedup.
It’s seems a little strange that they spend most of the paper (and some of the discussion) talking about how JIT optimizations should be exposed to the programmer, and then proceed to do benchmarks emphasizing how no code needs to be changed in the application code to take advantage of the performance benefits. But it does make sense, in a way – most programmers don’t want (or shouldn’t) be working at this low level, but library authors can do so to provide performance benefits for their users.
After writing this up, I still want to understand at a deeper level how this actually works, but my initial amazement at the concept has largely evaporated. This is a very cool result, and may be useful in the real world, but ultimately it’s not as spectacularly new as I initially felt; it’s just combining a bunch of existing concepts I’m unfamiliar with into a practical system. (Which we need more of, but it’s not necessarily what I want to read.)
On the other hand, this reminds me of some other systems I’ve gotten very excited by: Synthesis OS, HP’s Dynamo (not to be confused with Amazon’s. Also, whoa, there’s a paper now!), and Transmeta’s code morphing – various takes on runtime code generation for performance.