Yesterday was the sprint day for PyCon UK 2016. I proposed to sprint on "Supercomputer in a Briefcase". I got four helpers, Barry Scott, Emma Gordon, Tom Viner and John Chandler. Richard Wenner offered us use of 10 RaspberryPis (RPis) and a very big hub, but in the end we didn’t use these, we just used the two RPis, hub and DHCP server I had brought. If we do something with this project next year I do hope Richard is around with his RPis.
First technical problem to tackle was which repository to work with. I had assumed Git and had made repositories on GitHub, GitLab, and Bitbucket in preparation. One person voted for Mercurial on BitBucket, but in the end the path of least resistance was GitHub – even though I wanted to go with GitLab. You can get to the repository by following this link.
The first real problem to address was finding out what computers were available on the network – it is assumed that the cluster will always be on a private network. Whilst there are many RPi cluster projects around the world (do a Google search and you can’t miss them) they all put RPis into a rack of some sort, i.e. they are in a fixed configuration. A goal of "Supercomputer in a Briefcase" is to have a dynamic cluster: the cluster has to be able to deal with nodes being taken away and being added. I had previously created a discovery program using scapy and reverse ARP calls. This however required super-user permission and was somewhat unsatisfactory. The team came up with the idea of using Avahi instead. After various experiments, we found that by placing an XML service specification in the right place and restarting the Avahi server we got service notification. This deals with adding nodes to the cluster and of nodes removing themselves, but leaves the problem of nodes being disconnected by having the Ethernet removed or by being powered off. A watchdog will have to be created to cover this situation.
For now the idea is to use OpenMPI as a code and data transfer layer: python3-mpi4py is available on Rasbian and it is based on OpenMPI. The single master node of the cluster starts a job and collects together the result. For us at the sprint the master node was a laptop that was connected to both the cluster network and the rest of the Internet and Web. The master node has to know the collection of computers in the cluster and so the service discovery code was extended to write the file needed by OpenMPI. This covers the case of a fixed network, it will have to be extended to deal with a dynamic network.
People had to leave a little early so the technical effort was stopped there and notes put in place to outline what had been achieved and what needed to be achieved. This has created an excellent platform to progress this idea. Huge thanks to Tom, Emma, John, and Barry for taking the time to contribute. I will continue the work, if others want to join in please head over to Supercomputer in a Briefcase.