Dataflow Execution of Sequential Imperative Programs on Multicore Architectures

Available from the University of Wisconsin-Madison; authors are Gagan Gupta and Gurindar S. Sohi from that institution.

I read The Road to Parallelism Leads Through Sequential Programming this morning as well, but that delegates back to this paper for the interesting details, so I’m writing up this instead.  (And if I’m feeling spectacularly lazy, I might write that up, too, but it feels like cheating to spend a day on it.)

This is another attempt at making parallel programming easier.  While Deterministic Parallel Ruby used simple parallelization primitives with runtime safety enforcement, this paper presents a model that uses runtime dataflow analysis of programmer-provided read and write dependencies to automatically parallelize an otherwise sequential program.

More specifically, it effectively annotates functions with shared objects that are read or written by the function, then runs the program, looking ahead to find functions that can currently be executed because they depend only on data that is currently available.  This is done by effectively giving each object to the next function that needs it, and making further functions needed the object wait on that function (with a bit of complexity due to the reader/writer model.)

The authors compare this to a modern superscaler CPU – such a CPU will typically be looking ahead for instructions that don’t depend on currently executing ones, and run those out of order in parallel with other instructions.  This requires a tremendous amount of bookkeeping, but is a vital part of how modern CPUs are fast and power efficient.

They then compare standard “static parallel” (that is, parallelism specified in the code, including MPI, pthreads, or DPR) to VLIW instruction sets – these were attempts to move the scheduling logic from the CPU to the compiler, which could theoretically make smarter decisions.  However, the dominance of x86 (and ARM) indicate that the standard “figure it out at runtime” may be a better approach. While it’s not outright stated, the authors seem to imply that the same applies to parallel software, and parallelizing code at runtime is the right approach.

The metaphor continues further: the authors refer to “Function Level Parallelism” (FLP) by comparison with “Instruction Level Parallelism” (ILP), the common term for the parallelism that CPUs perform.

What is it, really?

Traditional dataflow languages (AFAIK) are general functional and force the programmer to write code in that style; this makes analysis easy and complete, since the runtime (or compiler) can effectively sort commands or functions.  This is great in theory, but hasn’t been adopted in practice; so this paper attempts to apply the ideas to C++.  (Or any other language, AFAICT, but C++ was used for demonstration.)

The core is three new functions (library functions – no modification is needed to the compiler):

  • df_execute(write_set, read_set, function, [args]): Add the function to the run queue with the given read and write sets, and continue running the main program.
  • df_seq(object, function): When object is available, run function, and block the remainder of execution until function finishes.
  • df_end(): wait for all queued functions to complete before continuing; ends the parallel portion of the program until the next df_execute.

The read_ and write_ sets are simply STL sets containing objects; these are the objects which the function is allowed to modify, and which are used for the dataflow analysis.  The typical model would be to use df_execute for “interesting” function calls, letting the runtime schedule these in parallel as possible, and using df_end if work needs to be done that doesn’t have a constrained working set; it’s effectively a barrier that enforces that all previous computation is complete, and then proceeds with the program.

df_seq seems a bit mysterious; AFAICT, it can’t be implemented in terms of df_execute and df_end, since it can run concurrently with previous df_executes that don’t touch its object, but blocks further execution until the df_seq function (but not any previous functions) finish.  The example of its usage given is to print an object; this makes sense, but I’m not sure what other uses it has.  (My instinct is that df_execute is usually a better choice.)

The result is that a typical serial program can have its core loop parallelized simply by replacing function calls with appropriate calls to df_execute, with df_end called at the end of the parallel segment to ensure everything finishes.  The runtime then magically takes the dependencies (passed in as sets at runtime, so precisely the objects used, rather than broader classes) and makes it all run as parallel as possible.

Under the hood

The runtime isn’t actually magic; it’s actually fairly simple.  Each object used in a read_set or write_set needs to inherit from a token base class, which have associated read and write “tokens” (apparently the standard terminology for dataflow systems).  These tokens basically track who is currently using the object and who is waiting for it.

Then they use a thread pool to execute the program across multiple cores.  Initially, only one is active running the program as normal.  When a df_execute call occurs, the function attempts to acquire all the tokens it needs to run (i.e., checks if all of its required shared objects are available).  This uses the standard read/write structure, where writes block and are blocked by reads and writes, but reads don’t block each other.  If all the tokens are acquired, the task can be “delegated”, meaning that the thread starts running it, and pushes the rest of the program onto its run queue.  Then another thread, if idle, can “steal” the rest of the program and continue running it while the previous main thread continues on the df_executed function.

If some tokens are unavailable, the task is added to (ordered) wait lists for any tokens it’s missing.  These will later be made available by the tasks currently using them.

And when a task completes, it releases its objects – each object will be passed on to the first task in its wait list (if appropriate), which may now run if it isn’t waiting on any other objects.  If no tasks are waiting on an object, it’s marked as free, ready for the next task that tries to grab it.  Note the subtlety that a reader may not actually release an object if another task is still reading it – the number of readers is tracked, so a writer can only proceed when no readers are using an object.

This is pretty much the implementation; there are some details in the task scheduling, which are apparently borrowed from Cilk.  Note that a thread encountering an immediately-executable df_execute switches to it and puts off the main thread; this ensures that as many df_executes as possible are running.  (The alternative, queueing the df_execute to run, would always leave the main thread running, potentially spawning more df_executes).

An example

It really is that simple; here’s the bzip2 kernel before and after parallelization (with changes bolded):

1 ..
2
3 while (blockBegin < fileSize ? 1) {
4 blockBegin += updateBlock (blockBegin)
5 block = new block_t (hInfile, Length);
6
7 compress (block)
8 wr_file (hOpfile, block)
9 }
10 ..


1 ..
2 op_set->insert (hOpfile); // File wr set
3 while (blockBegin < fileSize ? 1) {
4 blockBegin += updateBlock (blockBegin)
5 block = new block_t (hInfile, Length);
6 block_set->insert (block);
7 df_execute (block_set, &compress)
8 df_execute (op_set,block_set,&wr_file)
9 }
10 ..

That’s it.  And you have a 2x speedup on a Xeon or a 9-11x speedup on a 4 or 8 socket Opteron.  Magic.

Evaluation

To evaluate their system, they compare runtime to pthreads across a few benchmarks on a quad core hyperthreaded Intel Xeon and 4 and 8 socket quad core Opterons systems.  Results are generally approximately competitive with pthreads, sometimes better, sometimes worse.  The cases where they do significantly worse are generally due to scheduling a large number of small jobs; by increasing the block size evaluated by each function call, they can do at least as well as pthreads on the Opteron, and not much worse on the Xeon.  (They claim the discrepancy is due to bus speeds; this seems sensible.)

They also note that their code is much simpler; the bzip2 above required almost no changes, while the pthreads version (implemented independently) uses multiple task-specific threads with complex communication patterns.  So being even remotely competitive is not bad for so little work.

There’s also a long discussion around microbenchmarks and how many clock cycles each operation takes.  This is probably useful, but seems incredibly boring to me.

Discussion

Some other points come up in the paper that didn’t fit into my narrative above:

Deadlocks can be a problem for dataflow systems; this is easily avoided here by using ordered wait lists.  If A and B wait for the any of the same resources, whichever one was scheduled first will get first crack at those resources.  So it’s impossible for two tasks to each be waiting on resources the other holds.

It can also be an issue if too many tasks get scheduled (e.g., if a loop turns out to be fully sequential, and every iteration is scheduled while the first one is still running).  This is easily avoided by limiting how many tasks can be scheduled before the main task is blocked.

The system, while partially specified in the code, really is dynamic – determination of dependencies happens at runtime, much like in an out-of-order CPU.  This, as is usually the case, allows much more precision to be used.  Two function calls may conflict in some cases, but we can establish with precision whether they do when called with the particular arguments they are.

Thoughts

I really like this – it feels like a much cleaner solution than yesterday’s DPR, since it avoids the new parallelization primitives and still “looks like” a sequential program.  I’m not sure I completely agree with the authors that this is The Way to go, but it’s definitely promising.

I am, however, a little concerned that it looks easier, but depends on the programmer to specify what shared objects each function uses.  I can easily see this resulting in the sort of weird nondeterministic bugs that this type of system is supposed to avoid – if I mistype the read or write set, all of the nice properties go out the window and I have a giant race condition again.

It seems like it should be possible to apply something like TARDIS here – in development or testing, track which objects are actually used and complain if they’re not in the given read/write set.  (Or a simpler system might work, since it seems like we could simply add an additional field marking which tasks are allowed to read/write each object and check that on each access instead of logging.)

It also seems like it should be possible to effectively limit scope so that functions can’t access anything outside their modification set, but it’s not clear to me how easy that would be in C.  (I also may well be wrong and it’s impossible or hard.)

In any case, this looks like another great idea at making parallel programming easy with minimal programmer input.

Advertisements
This entry was posted in papers and tagged , , , , , . Bookmark the permalink.

Discuss

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s