Last week, I attended the Docker Containers for Reproducible Research Workshop hosted by the Software Sustainability Institute. Many talks were addressing how containers can be used in a high performance computing (HPC) environment. Since running the Docker daemon requires root privileges, most administrators are reluctant to allow users running Docker containers in a HPC environment. This issue as been addressed by Singularity, which is an alternative conterization technology that does not require root privileges. The nice thing is that Singularity allows importing existing Docker images, which allows you creating a Singularity container from anything that is on Docker Hub. Although I only used Docker so far, Singularity sounds like a nice technology I would like to explore in the future.
Thinking about preservation
Although most talks focused on using containers in an analysis workflow, there was a notable exception. James Mooney (University of Oxford) and David Gerrard (University of Cambridge) looked at containers from a completely different perspective in their talk Software “Best Before” Dates: Thinking about Preservation. They discussed preserving research rather than reproducing research. While the latter typically considers time frames of several years, preservation targets time frames of several decades.
They raised the question: What if Docker has been superseded by this new software called HARPOON, no one uses Docker anymore and no one quite remembers how Docker worked? It is clear that without the Docker daemon all the images are basically useless. They continued with an anecdote on how to install Docker, which, funny enough, is easiest by downloading the latest image from Docker Hub. Singularity containers aren't suitable either, because in twenty years the problem is essentially the same: most people won't know how to run that thing. Mooney's and Gerrard's bottom line is that in order to preserve research, much more information is required beside the code or a container. It needs to be accompanied by very detailed step-by-step instructions that lists every single command, no matter how trivial it might look from today's perspective. Mooney and Gerrard closed their presentation with: Think about preservation now and not later.
In the end, it comes down to preserving a researcher's legacy. Since most of academia is driven by the ‘publish or perish’ paradigm, preservation is usually disregarded. Considering that most of my research output is software or otherwise a file in a digital format, I started to wonder how future generations can experience research happening today. You can easily value the experience of watching a steam engine in action to understand how it transformed society, but less though when looking at a piece of software that outperformed humans in analyzing images, but unfortunately no one knows how to run it anymore.
A DevOps Approach To Research
In other news, I gave a short presentation at the workshop that highlights how some of the DevOps tools can be very helpful to researchers too. In particular, I talked about GitLab, which comes with built-in continuous integration pipelines and Docker registries. I'm a big fan of GitLab and use it to manage all my research projects.
The code my presentation is based on is available at https://gitlab.com/sebp/devops-approach-to-research.
It is based on Theano's tutorial on Restricted Boltzmann Machines to learn the distribution of handwritten digits (the infamous MNIST dataset). I use GitLab CI to construct a Docker container with the code and all of its dependencies, push the resulting image to the Docker registry, use the image to train the model, and sample from the learned distribution. It is a rather simply approach compared to some of the sophisticated approaches presented at the workshop, but provides enough value for me without being overpowering.