Accelerating team research with containers

Mon 21 Jun 2021 Gregory J. Stein

I discovered Docker containers near the tail-end of my PhD. As I started to write my thesis in earnest, I realized that the farther back into the past I went, the harder it was for me to get code working as I expected. Moving to a Docker-based workflow held the promise of longevity and reproducibility. The docker container acted like a mini-virtual-machine that contained all my research code and the installs needed to get it to run, all the way down to the version of the operating system itself. So long as I could build the container, I would be able to run the code I'd moved inside it and reproduce my old work without the need to pollute my local machine's installs or break some other part of my code. Once I'd updated the dependencies and the OS, it was easy to move all my code to a mono-repo that contained code from my earlier papers. By the time I had finished my PhD, code from all of the papers I had published was runnable from within a single Docker container. Moving to a containerized workflow in my PhD helped me to not only run old code, but also modernize it, as Docker made it easy to quickly upgrade dependencies and even the operating system, all without any significant changes to my local machine. Now that I'm faculty, I have the opportunity to pass along my experience: all code in my research group is run exclusively inside Docker containers.

Note that modernization does not come for free; solid tests and good test coverage was the other key enabler of bringing old code up-to-date and helping it to play nice with others.

Perhaps the biggest upshot of our container-based workflow is that it has facilitated easier and faster development for our team. It's incredibly easy for me to pull code from someone else and help them debug—synchronously or asynchronously—without the need to go back-and-forth about dependencies and installs and configurations. Just the other day, a student of mine was trying to figure out how to use some of my older code as part of a new idea he's exploring. I told him to push the code, I pulled it, built it on my local machine, and within a few minutes we were on a Zoom call and I was showing him around my API after I ran the code locally and figured out where he was getting tripped up. The ability to quickly grab someone else's code and play around with it has been invaluable so far, particularly during the quarantine where simply sitting down at their desk and sharing a keyboard has been impossible. Code reviews are simple for the very same reason, lowering the bar for students looking to get their code reviewed and open a pull request. After pulling a new branch from our GitHub, a single invocation of make test builds the code and runs all our tests, an easy first step for any review.

So why didn't I adopt Docker earlier? At the time it seemed somewhat intimidating to switch to a new workflow. Perhaps since much of my code was written in Python at the time, pip + venv or conda seemed like they would be sufficient. Yet my code inevitably grew in complexity and started to include simulation tools and apt installs, requiring that I think more and more about the machine I was running it on. Docker makes it easy to quickly switch machines or share code without needing to think about what operating system it's running or what packages it has installed. In retrospect, I only wish I'd switched sooner.