The Data Sciences department at Biogen has been using Docker and watching the (r)evolution for a couple of years. Last year, as our experience with Docker grew and the use cases expanded, we built our own earlyDocker Swarm cluster with homegrown orchestration capabilities.
Through cutting-edge science and medicine, Biogen discovers, develops and delivers worldwide innovative therapies for people living with serious neurological, autoimmune and rare diseases. Founded in 1978, Biogen is one of the world’s oldest independent biotechnology companies and patients worldwide benefit from its leading multiple sclerosis and innovative hemophilia therapies.
The Data Science team within Biogen is tasked with discovering new insights from rich and complicated data sets spread across all aspects of Biogen’s operations. Sometimes a data scientist unlocks a simple one-off insight, but on other occasions the insight sparks another project to visualize data, continually analyze streams or repeatedly apply algorithms. This is where the data engineers step in, taking those insights and transforming them into proof-of-concepts or minimally viable products.
A few years ago, in order to achieve this, we would provision one Virtual Machine (VM) per component and write a Chef script to install and configure the machine. In late 2014, we had application stacks using almost 10 VMs per environment, which multiplied by several environments led to an excessive operations burden for our proof-of-concept apps.
At that time, we started exploring replacing VMs with Docker containers. We began to build entire application stacks on single VMs running Docker using a custom deployment script. Data Engineers could now script entire stacks, deploy on their laptops, push to dev/test/prod and start to share common patterns between projects. Additionally, this reduced the burden to the operations group by allowing the team to move to a more devops model.
We now had a better toolset but there were still some lingering issues to be addressed. We still had to request multiple (albeit fewer) VMs per project; the given DNS names for the VMs were nonsense to the end user (and requesting friendlier ones was another procedural step); we couldn’t easily scale out beyond a very large VM, and some projects were beginning to need bare metal performance.
To tackle our evolving needs we started to build a Swarm cluster from a handful of high-end servers. Combining our deployment script base with Swarm, dnsdock, Hashicorp Vault, Docker UI, and other components, we assembled a running cluster. By eliminating the need to size and request multiple VMs for a proof-of-concept project, engineers script an application stack and deploy almost immediately thus reducing the turnaround time on projects. As the engineers have more control over the stack we have witnessed best practices emerge on how to solve for particular problem spaces.
While some projects use the Swarm infrastructure, other projects use single VMs running Docker. Since we have set up the Docker environment last year many new features have shipped that we are looking to add as standard components into our environment.
The Data Science group at Biogen continues to work at the leading edge of technology to support the mission of delivering innovative therapies and we are always looking for talented engineers to join the team.