Using Docker for Deep Learning projects
As a Machine Learning Engineer, I use docker containers daily which helps me save a huge amount of time and stay organized.
In this article, I will explain how I use Docker in my everyday projects. There are a lot of good documentation and videos about this subject, but I wanted to share one way of doing things that I have been using for several of the industry projects I work on.
What is Docker?
Docker uses OS-level virtualization to deliver software in packages called containers. Each Docker container is created from a Docker image. An image has all of the information for constructing the environment (libraries, folders, files, OS, etc). Containers are isolated from one another.

The official Docker documentation can be found at this link.
To download Docker, you can go to this link, and I will assume for the rest of the article that Docker is already correctly installed on the machine.
Why use Docker for Deep learning projects?
When you run a Deep Learning training on a machine (on the cloud or locally), you need to be able to run the training easily without struggling with setting up the environment each time. And also, if you want to be running this training on another machine for some reason, you don’t want to go through all of the setups again.
Ideally, we want to have one command that can be run confidently across machines. That is why I use Docker containers for my trainings almost all the time.
Here are the main advantages of Docker for me:
- All the required packages are already installed with the correct versions for the training: for example, if I want to use Tensorflow2 for one training, I have a docker image configured for this, and another one for Tensorflow1, and another one for Pytorch.
- I can partition the desired GPU if running on single GPU.
- Since all containers are separate, if the execution crashed, other processes are not impacted — thus other GPUs are not impacted for example when I select a specific GPU to use for the training.
- If using a git repository as source code, I usually add it to the main docker image and then git pull each time I create a docker container, and switch to whichever branch/commit I want to use.
- Local folders/files and NAS can be mounted easily when starting the container — so no copying required which saves time, especially when I am debugging something.
Introduction to Docker
Here are some fundamental commands you need to know:
- display the containers currently running:
docker ps
- display all the containers (even those not running anymore):
docker ps -a
- display the images locally saved:
docker images
- remove a docker container:
docker stop container_name # if container is running
docker rm container_name
- remove all docker containers (not running anymore):
docker container prune
- remove an image:
docker rmi image_name
- remove all docker images (be very careful with this one!):
docker image prune -a
Run a Tensorflow2 training in a Docker container
Docker Image
First, we pull the NVIDIA image that contains Tensorflow version 2.1.04 and Python3 (this will take some time):
docker pull nvcr.io/nvidia/tensorflow:21.04-tf2-py3

Once this is done, we check that the image is in our local images’ list using the command docker images
:

We do see our image nvcr.io/nvidia/tensorflow
with the tag 21.04-tf2-py3
.
Start a container with this image and explore some of the flags
We can create a container using this image:
docker run -it --rm --name tensorflow2-container --network=host nvcr.io/nvidia/tensorflow:21.04-tf2-py3 bash

We used some specific flags in this command:
-it
is used to open an interactive terminal--rm
is used so that when we exit the container, it will remove it--name
is used to name the container with a custom name--network=host
is used to have access to the Internet in the container (same network as the host machine)- Then, we have the name of the image to use followed by
bash
to create an interactive shell in the container
If we run docker ps
in another terminal, we will see our new container:

We do see our container tensorflow2-container
✅
Now, if we want to use our local workspace with the training script, we can mount my workspace folder into the container by using -v /Users/margauxmforstyhe/workspace/:/workspace
. This argument mounts the workspace folder on our computer to the base workspace folder in the container.
In our current container, if we run ls
here is what we have:

Let’s exit the docker container we have currently opened using the command exit
and start a new one with the workspace folder:
docker run -it --rm --name tensorflow2-container --network=host -v /Users/margauxmforstyhe/workspace/:/workspace nvcr.io/nvidia/tensorflow:21.04-tf2-py3 bash
and run ls
:

➡️ local workspace was mounted in the Docker container and we can now use it for our trainings/tests.
hen I am doing some training testing, I use a machine with a GPU and select one GPU with --gpu=device=0
for the GPU0. Then, when I am done testing, I usually run a command as following to start a training:
docker run -i -d --rm --gpus=device=0 --name tensorflow2-container --network=host -v /Users/margauxmforstyhe/workspace/:/workspace nvcr.io/nvidia/tensorflow:21.04-tf2-py3 bash -c "export PYTHONPATH=/workspace && python3 /workspace/training_repo/train.py .... {parameters for the training}"
So here we have a docker container in detached mode (-d
meaning we do not see the execution of the code in the terminal) running the local training script on GPU number 0.
NB: this works exactly the same with the inference script, you only need to change the python script being called.
Another way to do this is to have a git repository with all the training scripts and add it as a part of the image. Let’s do this with a Dockerfile!
Build a Docker image with a Dockerfile and use a git repository as training repository
A Dockerfile is used to create an image. For example, we want to create an image on top of the image nvcr.io/nvidia/tensorflow:21.04-tf2-py3
we used before, then we want to clone the training_repo from Github and install all the requirements for running the training (for example install rasterio
, or install a specific package’s version), which gives us this Dockerfile:
FROM nvcr.io/nvidia/tensorflow:21.04-tf2-py3RUN apt-get update
RUN git clone https://github.com/MargauxMasson/training_repo.git
RUN pip install -r /workspace/training_repo/requirements.txt
RUN lsWORKDIR /workspace/CMD "ls"
In order to build the image — that we will name tensorflow-21.04-tf2-py3-with-requirements-and-git-repo
— we use the command docker build
(needs to be run in the folder with the Dockerfile):
docker build . --network=host -t tensorflow-21.04-tf2-py3-with-requirements-and-git-repo

We see that the built of the image works properly, and when we check with docker images
, we do see ou new image tensorflow-21.04-tf2-py3-with-requirements-and-git-repo
.
NB: The “.”
in the docker build command indicates that the Dockerfile, named Dockerfile
, is located in the folder we are running the command from.
Now when we start the container using this image, we do not need to mount the local workspace because the git repo is already in the image:
docker run -it --rm --name tensorflow2-container --network=host tensorflow-21.04-tf2-py3-with-requirements-and-git-repo bash

Indeed, the training_repo is in the workspace of the container.
This image can be used without being modified even if the code in the git repository changes. When starting the container, we can git pull
or git checkout
to whichever branch/commit desired:
docker run -i -d --rm --gpus=device=0 --name tensorflow2-container --network=host tensorflow-21.04-tf2-py3-with-requirements-and-git-repo bash -c "cd /workspace/training_repo && git pull && git checkout my_training_dev_branch && export PYTHONPATH=/workspace && python3 /workspace/training_repo/train.py .... {parameters for the training}"
Also, as suggested when starting the container, we can add these flags:--shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864
So:
docker run -i -d --rm --gpus=device=0 --name tensorflow2-container --network=host --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 tensorflow-21.04-tf2-py3-with-requirements-and-git-repo bash -c "cd /workspace/training_repo && git pull && git checkout my_training_dev_branch && export PYTHONPATH=/workspace && python3 /workspace/training_repo/train.py .... {parameters for the training}"
There are a lot of ways one can use Docker, but this is how I like to use it for my trainings and inferences, and it helps me stay organized because I have some specific images (or Dockerfile at least) that I can use confidently and know that my training code will run without any struggle.
