Why Docker for Data Scientists

Published on 10 August 2018, in #data-engineering, #devops

If you are considering using Docker for data-science projects, in this post you'll find the conceptual advantages of migrating data-science/big-data/machine-learning projects onto Docker.

In the post, we start with the topic of team productivity.

Exec Summary: Less technical and communication friction #

Based on my experience, teams using Docker experience a decrease in technical and communication friction, namely due to the following advantages that Docker offers:

New colleagues get onboard quickly
System documentation is explicit and always up-to-date
Docker eliminates most work-on-my-machine problems
Docker enables transparent and peer-reviewed infrastructure updates
Docker streamlines technical collaboration across teams
Docker sharpens responsibility

These advantages come with a cost—the necessity to learn new tooling.

Let's dig into each of those individually.

Advantage #1: New colleagues get onboarded quickly #

With Docker, it is a matter of one command to get an application up and running. The ease of spinning up a complete application also holds for complex cases where an application requires custom binaries, tweaks in system configuration, a database, or a cluster-like setup.

With Docker, it is not required anymore to spend days getting an application running on a new laptop. I once worked on a micro-service-based application consisting of 20+ interconnected services. Docker allowed us to spin up the whole stack easily with just one command.

To learn more, check out docker-compose.

Advantage #2: System documentation is explicit and always up-to-date #

Docker uses files to define application system requirements and links between them. These files are human-readable, declarative, and define the desired end-state of an application. They are typically committed to a version control system. Thus, they serve as excellent documentation of all system and application dependencies.

With Docker, it is not necessary to manually document system requirements. The documentation is, by definition, always up-to-date and on-par with the corresponding application version because the definition of the system requirements is the documentation itself.

To learn more, check out Dockerfile and docker-compose.

Advantage #3: Docker eliminates most work-on-my-machine problems #

How many times have you encountered an issue that an application behaves correctly on one developer machine but misbehaves on another developer machine? Issues like different path separators, missing undocumented system configuration, missing libraries, and customer binaries (e.g., linear programming tools).

Since Docker packages the application together with all its system dependencies (operating system, binaries, configuration), all developers work with the same infrastructure.

Useful concepts: Dockerfile, docker-compose.yaml

Advantage #4: Docker enables transparent & peer-reviewed infrastructure updates #

Docker defines application infrastructure and dependencies in files. These files are typically committed to a version control system (for example, Git or Mercurial).

Thus, best-practices for team collaboration like peer reviews and pull-requests can be applied to Docker files as well. A whole team can thus transparently review any suggestions for any updates for system configuration, custom binaries, and libraries.

Advantage #5: Docker streamlines technical collaboration across teams #

Docker encapsulates everything needed to run an application into a single Docker image and stores those images in a centrally accessible location. This mechanism enables a team to reuse an image developed by another team quickly.

Let's take a use-case where different teams concurrently develop a new set of services that depend on each other: a new version of a frontend coupled with a new REST API backend, for example. Each team would welcome having an option to quickly spin up the whole application stack (the frontend and the backend) locally for testing purposes. Docker enables such an option, for example, through docker-compose.

Advantage #6: Docker sharpens responsibility #

If you work for an older company where IT is not a core business, but rather, an optimization around a core business, chances are you will find yourself working in organizational silos. In this model, there's a lack of end-to-end responsibility for a product: one team develops a model, while another team deploys the model. This split leads, among other things, to high communication overhead, common technical misalignments (for example, different library versions between development machines and production environment), cumbersome application monitoring, and unclear support boundaries.

Ideally, these silos would be merged, and there would be a single team end-to-end responsible for a product. If that's not possible, Docker might partially help.

Docker allows shifting responsibility for application system requirements to the development team. At the same time, the responsibility of the deployment team would sharpen as well: Instead of supporting a wide range of applications half-way, the deployment team can be end-to-end responsible for deploying any application that is Dockerized.

Disadvantage #7: Necessity to learn a new tool #

Docker is an additional tool that team members need to have in their toolbelt. One option to learn Docker is to experiment on your own and invest time in exploring and learning the tool. However, team members need to find out what they don't know and learn it.

Bring Docker advantages to your team #

An in-house Docker training specialized for data-science teams and geared towards your specific needs is an excellent option to learn Docker fundamentals effectively. ~~My Docker training~~ (Update from 2020: I no longer to these) offers high-level conceptual overview coupled with practical, focused, hands-on exercises. Also, your team would learn about hard-won lessons, pragmatic best-practices and conceptual high-level overviews which are all typically hard to find on the interwebs.

← Previous post: Which problem would that solve?
→ Next post: On software tooling

This blog is written by Marcel Krcah, an independent consultant for product-oriented software engineering. If you like what you read, sign up for my newsletter