Is Dataflow the New Black?

Russel Winder, Concertant LLP

Everyone is probably tired of reading that parallelism has now arrived after 30 years of being the coming technology. It is true though, parallelism is the “now” technology. Almost every new computer is multicore and so, in effect, has two, three, four or more processors. The problem is, however, that whilst computer hardware may now be capable of parallel computation, the software is (in general) not.

When people first started investigating concurrency on uniprocessor machines, the concept of a process took form. Computers could run multiple processes (albeit one at a time using time sharing techniques) each having its own memory, communication between processes being effected by sending messages from one to another. This worked for a while, but there were perceived performance issues: processes were perceived as too heavyweight, i.e. they consumed too much resource; something with less overhead was called for. Thus threads were born. Processes and message passing were eschewed, threads and shared memory became the tools of choice for concurrency. Of course, as anyone familiar with concurrency theory and practice knows, sharing resources is where the problems start: almost all problems in programming concurrent systems relate to (mis-)use of shared resources. Moreover, all these problems become even more complex and problematic when there is real parallelism and not just the appearance of parallelism available on single processors using time sharing.

So as long as you don’t actually use shared memory or other resources, threads are useful. The problem is that people use the tools they have and generally the tools on offer are threads and shared memory. Is it any wonder that people find programming concurrent and especially parallel systems difficult?

Looking back, occam was a paradigm for a direction in parallel computing whose basic ideas have become subverted over time – lots of lightweight single-threaded processes communicating by message passing. Of course echoes of this approach can be seen today. Erlang shares the same underlying model with occam. In parallel (!), the HPC community developed MPI to exploit cluster-based parallelism – harnessing clusters effectively requires a communicating processes model for architecture and software design.

Despite the obvious direction indicated by those already harnessing parallelism (i.e. use lightweight processes and message passing), the current mindset in the C, C++, Fortran and Java worlds is one of controlling shared resources in a shared memory, multi-threaded world. The next C++ standard (C++0x as it is known) will introduce a standard threads model with emphasis on careful management of resources using locks and futures. Of course Java already has all this technology and has had for years. The problem is that the average programmer finds concurrency hard and parallelism harder. The programming tools being offered are not the right tools to implement parallel algorithms safely and correctly for practical applications.

Many people have tried to create MPI bindings for Java in an attempt to replicate the direction that C, C++ and Fortran have followed. These have not taken off. This is not just because of the ubiquity of threads and shared memory as the tools of concurrency in Java, it is also that MPI was based upon the C, C++ and Fortran programming models and it doesn’t sit that well in Java. Java has serialization and RMI (Remote Method Invocation), so the very low-level approach of MPI is incompatible with Java.

What is needed is lightweight processes and a message passing framework that sits well with Java. Peter Welch and his team at University of Kent have JCSP, which is an attempt to create a version of CSP (Communicating Sequential Processes – the underlying theory behind occam) for Java. Some regard this as just an academic exercise. Thinking in terms of multiple, single-threaded processes that exchange messages appears to be an alien approach for the majority of programmers – those not already familiar with occam, Erlang or MPI.

In a sense this is quite bizarre since the object-oriented view of computing is about objects passing messages to each other. Perhaps a change is needed in the way people translate the ideas of object orientation into programs? Perhaps a return to the Actor Model is needed? In any event, the occam, Erlang, MPI model has proved its worth in the telecoms and HPC communities.

In fact there is increasing commercial interest in process-based models. One example is Pervasive Software’s DataRush. DataRush is a commercial “lightweight processes with message passing” framework based upon Java technologies that is showing that dataflow parallelism can be harnessed for business-oriented applications.

The dataflow approach to algorithms is different to the CSP one even though both are based on connected processes. In CSP, the computation is driven by moving data from one process to another. In dataflow, the data does not move, but connected processes are automatically informed whenever a value changes, so that they can then undertake some computation. The classic dataflow framework is the spreadsheet – the data stay in the same cells but changes to the values causes other cells to undertake computation.

This inversion of what it is that is flowing between the processes is at the heart of why dataflow systems can process huge volumes of data very quickly. The data itself never moves, only information about changes of value moves. Pervasive are reporting that, using DataRush, they are getting almost linear speedups as more processors are added, and phenomenal speed ups over traditional data analysis approaches. Theirs is an example of how, by taking a dataflow approach, all the issues of thread management and shared resource management can be hidden behind a more easily programmed and more easily accessible API.

Computing languages and computing styles often have an element of fashion about them, but a few really good ideas get lost simply because they have never been fashionable. Object-orientation was a fashion in the 1980s and 1990s, now it is just the standard way of doing things. Dataflow may not have been fashionable in the past, but it offers much more to future systems design than just a passing fad. Good ideas do need a period of being the “in thing”, even if it is just to get them established – just as happened with object orientation. Dataflow is definitely the new black. Of course, it is not a silver bullet; dataflow is a useful tool not a panacea.

Copyright © 2005–2020 Russel Winder - Creative Commons License BY-NC-ND 4.0