Available from the University of Rochester; authors are Li Lu and Michael L. Scott from Rochester and Weixing Ji from the Beijing Institute of Technology.
Parallel programming is notoriously difficult; there are a number of specialized systems to deal with it. From the process/channels model, to transactional memory, to pure functional event-based processing, there have been a number of attempts to make it easier for programmers.
This paper takes a different approach of providing some parallel primitives, and then doing “data race detection” at runtime. In short, a data race is when one task writes a value concurrently with another task reading or writing that same value. (Concurrent reads don’t cause any problems, but concurrent writes may result in a non-deterministic final value. A write concurrent with a read could read either the old or new value.)
A program with no data races is deterministic – multiple runs of the program will always produce the same results, since no operation can turn out differently across runs.
The race detection system is surprisingly simple – log every access a task makes, then when tasks join, check that there are no bad collisions. This is simplified by the limited primitives that enforce a “split/merge” structure, where a single task spawns multiple concurrent operations which must all complete (without interfering with each other) before the first task can continue.
This is made more practical by the addition of “reductions” and “atomic commutative” operations. Reductions may be familiar from MapReduce – they are commutative, associative operations that each task can perform independently before merging. Atomic commutative operations are high-level operations which enforce their own atomicity (e.g., by locking) but which are programmer-specified to not interfere (e.g., adding two elements to an unordered set has the same behavior regardless of ordering).
By combining low-level read/write access checking with these higher-level specifications of acceptable violations, they create a system which is reasonably performant (2x-4x slowdown) and apparently effectively perfect at catching data races.
What DPR actually is
The paper presents “Deterministic Parallel Ruby” (DPR) and TARDIS. DPR is Ruby extended with a few extra instructions (initially via libraries, then for better performance as extensions to JRuby). TARDIS is a runtime for DPR that detects when the concurrency contract of DPR is broken.
DPR adds a number of instructions. First, “co” simply runs a number of functions concurrently. “.all” is basically a “co” iterator – it runs a function over all elements of an array or range concurrently. It is an error (detected by TARDIS) if the functions called by co or .all are not independent (in the sense of data races).
Another concurrency operation is “futures”; these are functions that should be computed concurrently with the main task, and whose value will presumably be requested at some point in the future of that task. For simplicity, these must not only not interfere with any other task, but must be completely “pure” – they can’t interact with any data outside their scope (any data must be passed in as arguments, which are deep copied to avoid interference).
Interestingly, this definition of pure seems to me to vary from the standard functional one in an important way – I/O is allowed. A future could make a network request, as long as no resources from the surrounding scope are used.
Additionally, “pipelines” are provided as a convenience for parallelizing loops. Similar to hardware pipelines, these are a series of functions to be performed on a stream. At each logical step, each pipeline stage performs its operation on the output of the previous stage from the previous step, and passes its output to the next stage for the next step. The first stage reads from the input stream, and the last phase writes to the output stream. Fundamentally, it’s just calling “co” on each stage with the inputs and outputs hooked up properly.
The remaining operations are used to avoid spurious races.
First up are Reduction classes; these implement push and get methods. Push adds an element to the set to reduce, and get gets the result of the reduction, which must be commutative and associative. Examples given are addition, multiplication, minimum, and maximum. Since the reduce operation is commutative and associative, each task can compute its own local result, and then the merge at the parent task can combine the results as the children complete (this is how a traditional parallel MapReduce works as well.)
More interestingly, atomic commutative operations (AC ops) allow the programmer to specify that specific methods may be run concurrently with each other. These may touch the same data, but are guaranteed (by the programmer) to be commutative (and atomic, if necessary, using standard locking primitives). One example is adding elements to a set – whatever order they’re added in, the final set is the same.
Neither reductions nor AC ops are verified; that’s mentioned to be undecidable, with various heuristics as possible future work. But for the moment, if a programmer messes up a reduction or AC op, all guarantees are off.
How TARDIS verifies all this
That about wraps it up for DPR. Now how does TARDIS check that a DPR program is actually deterministic?
With logs! Hooks are added to the Ruby VM to track, for each task, what object fields have been read or written. (In Ruby, everything is an object, so this tracks all memory operations.) More precisely, each task tracks:
- local_set: all accesses by the finished subtasks in the current running group (co/.all/etc.)
- current_set: all accesses by the task or finished groups.
- parent: the parent task that spawned the task.
These sets are updated as you might expect, if you’re slightly clever. When a task accesses a field, that access is added to its local set as object id, field id, and type (read/write). When a task finishes, it merges its current_set into its parent’s local_set, checking for conflicts between the two. And when the last task in a group finishes, the parent merges its current_set into its local set (without checking for conflicts).
This is slightly subtle – remember that only concurrent tasks (those spawned by the same call) should be checked. Since the parent doesn’t run concurrently with its children, it can’t conflict with them. But a task running concurrently with the parent could conflict with one of the children, so the child set does need to be merged into the parent’s set at the end so that the grandparent has the information it needs.
This is complicated by reductions and AC operations (pipelines, as noted above, are effectively just special co operations, so need no special work, and futures are completely isolated from their enclosing scope.) However, it’s not very complicated. Reductions are, as noted, run independently in each thread. And avoiding data races is as simple as ensuring that creation and get don’t interact with each other or push. This is isomorphic to writes not interacting with each other or reads. So reduction operations can just be converted to pseudo-writes and -reads, and handled like ordinary operations.
AC ops are slightly harder; instead of tracking local and current sets, we split each into a normal_set and atomic_set, and we also track the commutative method list of each task. So now, any access within an AC op (tracked by a global per-task boolean, under the assumption that AC ops don’t nest) is added to the current_atomic_set, while all other accesses go to the current_normal_set.
Merging of child and parent histories happens for each set independently, but conflict resolution now checks the normal_set of the child against both normal_ and atomic _sets of the parent, as well as the atomic_set of the child against the normal_set of the parent. That is, normal operations can’t interfere with each other or with AC operations (since non-AC ops modifying AC datastructures don’t have any safety guarantees), but AC ops can stomp on each other all they want, since the programmer has guaranteed that they’re safe.
But AC ops are only safe within specific sets; adding elements to a set may be reorderable, but adding elements to two different sets may not be. This is where the commutative method list (cml) comes in. Each task tracks what AC ops it calls; if a merge then finds that two non-commutative AC ops were called concurrently, that’s a commutativity conflict.
So this lets us run specified AC ops concurrently, while still guarding against errors.
There are more details around the implementation, but they’re largely specific to JRuby, so less interesting to me.
One exception is on level of detail – by default, all that’s reported is that an error occurred on a given field of a given object id. But if more information is desired, TARDIS can be run in detail mode, which gives the name of the object and source file and line of the conflicting accesses. Tracking this extra data is naturally more expensive. However, since the program is deterministic until it fails, running it again with the same inputs should catch the same failure. So it’s reasonable to run the program normally until you find an error, then rerun it in detail mode to find out what went wrong.
No paper is complete without evaluation! In this case, they focus on performance, since correctness is assumed. They implement a variety of benchmarks, and compare them against other “state of the art” race detectors. The overhead of TARDIS is, as advertised, about 2-4x, and the benchmarks scaling up to 16 cores about as well as they do without TARDIS. The other race detectors are generally somewhat slower, except for the Cilk Nondeterminator, which is faster but strictly sequential, so doesn’t scale with core counts.
They also run five benchmarks that require AC ops, which the other race detectors don’t support. So here they show TARDIS, TARDIS in detail mode, and TARDIS without AC checking versus the plain code. Results are similar, with detail mode generally taking up to an additional 2x over the default mode, and disabling AC checking gets back some performance, but not a huge amount.
And finally, they take one of the benchmarks and replace the kernel (core inner loop) with an optimized Java version. This, as expected, shows a dramatic speedup.
Also, interestingly, they point out that TARDIS found five concurrency conflicts in their evaluation code, “despite [their] familiarity with both parallel programming in general and DPR in particular”.
One comparison they discussed fairly heavily is shadow memory – tracking, along with each piece of data, how it’s being used by the various tasks. This seems more obvious, but has some issues. For one, it requires relatively expensive checks during execution, while using a log defers most of the work until task completion. This both reduces redundant work (if a field is read or written multiple times, only a simple set addition is done each time rather than a full conflict check), and allows checks to be done offline in theory, though that seems to be less useful in real life.