Russel Winder, Concertant
Over the last few years processors have been accreting more and more cache, structured in more and more elaborate ways. What is the problem that means cache has become so important? Processors are fast and memory access is slow. So to avoid having processors wait for memory to deliver data (twiddling their thumbs as it were) they each have caches – very fast access copies of a selection of data stored in memory, connected directly to the processor. The observation that drives the design and implementation of cache is that programs generally want to make use of data they just made use of. This is the principle of locality.
In his talk today at SC08, Tim Mattson of Intel described the work he and two colleagues undertook trying to write software for the experimental 80-core chip that Intel designed and fabricated about fifteen months ago. This chip was an experiment in alternative ways of managing power rather than being a chip designed to bring to market. However, given that the basic architecture is fundamentally the same as Tilera's Tile64 architecture, it may actually indicate something more. It might presage a second hardware phase in the “Multicore Revolution”
The current range of x86 architecture multicore chips are all designed for use with a single external memory. The memory is a single global resource that all cores have equal access, and indeed rights, to. Since each core has its own cache and each chip may have another cache, the problem of keeping all the caches coherent is a immense problem. Coherence here means that if two (or more) separate caches are caching a given value from memory then if there is a write that changes this value, the system must ensure that this change is communicated to all the caches concerned. This is a hugely complex problem that rapidly gets close to impossible as the core and cache count gets to many tens or hundreds, and pragmatically impossible with core counts of thousands. This then is the background to Mattson's claim that “Cache sucks.”
What is the alternative? Well that is (at least conceptually) very easy – don’t have cache. Instead of having a single global memory, each processor has its own local, directly connected memory with no need for cache. In effect all chips are memory chips that have processors embedded in them, the processor / memory separation we currently have is removed and turned on its head. At a stroke this removes all the problems of cache. Obviously, though, there is a down side. Current operating systems and applications are not really set up for this. Except of course that Tilera have shown that they can run Linux (albeit modified) on their Tile64 processors very successfully.
Clearly there is the issue of how the cores communicate. Each has a communication channel to its nearest neighbours. So this is “network on chip”: Not only are chips basically huge memory chips, with embedded processors, the processors have communications channels replacing the bus structures found in today's commodity processors.
Does this all seem very familiar? Well it should, The Intel 80-core processor and Tilera's Tile64 are echoing a restructured form of the sort of architecture that the Transputer was designed to implement. The difference here is that the Transputer was an independent single-processor chip with memory, that could be assembled into any connection structure, whereas the Intel 80-core and the Tilera 64 core processors are fixed two-dimensional grids. Nonetheless, it is the same core, high-level architectural model: multiple connected processors with separate memory that use message passing.
Anyone familiar with history here will remember occam and the mathematical theory behind it, Communicating Sequential Processors (CSP). Erlang is the current standard bearer for this view of architecture. In the Python world things are moving in this direction – the release of Python 2.6 saw the release of the multiprocessing package that allows Python to realise the multiple independent processes view of computing as exemplified in occam and Erlang.
Of course the HPC world is fundamentally a C, C++, and Fortran world using MPI and OpenMP to manage parallelism. OpenMP is about shared memory multiprocessing and so would be somewhat irrelevant in a separate memory world. MPI though is about message passing. It arose to deal with clusters and grids. Its assumption of separate memory may allow it to be a bridge for C, C++ and Fortran codes to architectures such as the Intel 80-core and Tilera's Tile64. There would still need to be a software revolution, but not as big a one as you might first think.
Where does this leave Java? Who knows? Java is currently not a player in the HPC world, so until the chips and architecture are released into the world where Java is king – Web applications – it is hard to even speculate. Python is already ready so perhaps there could be a toppling of the king?
All this shows that there are still hardware revolutions waiting in the wings as part of the “Multicore Revolution” we are at the beginning of. Currently, the focus is on preserving the processor / memory separation with the x86 architecture so as to make things appear no different to the current crop of operating systems and applications. Soon, instead of trying to hide multicore behind variations of current architecture, the shift towards embedding computing cores in memory will happen. Then there will be a huge software revolution. Kings may fall, dynasties may yet change.