TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Follow publication

Using Docker for Deep Learning projects

Margaux Masson-Forsythe
TDS Archive
Published in
7 min readSep 17, 2021

--

Photo by Paul Teysen on Unsplash

In this article, I will explain how I use Docker in my everyday projects. There are a lot of good documentation and videos about this subject, but I wanted to share one way of doing things that I have been using for several of the industry projects I work on.

What is Docker?

Docker uses OS-level virtualization to deliver software in packages called containers. Each Docker container is created from a Docker image. An image has all of the information for constructing the environment (libraries, folders, files, OS, etc). Containers are isolated from one another.

Docker flow — Image by author

The official Docker documentation can be found at this link.

To download Docker, you can go to this link, and I will assume for the rest of the article that Docker is already correctly installed on the machine.

Why use Docker for Deep learning projects?

When you run a Deep Learning training on a machine (on the cloud or locally), you need to be able to run the training easily without struggling with setting up the environment each time. And also, if you want to be running this training on another machine for some reason, you don’t want to go through all of the setups again.

Ideally, we want to have one command that can be run confidently across machines. That is why I use Docker containers for my trainings almost all the time.

Here are the main advantages of Docker for me:

  • All the required packages are already installed with the correct versions for the training: for example, if I want to use Tensorflow2 for one training, I have a docker image configured for this, and another one for Tensorflow1, and another one for Pytorch.
  • I can partition the desired GPU if running on single GPU.
  • Since all containers are separate, if the execution crashed, other processes are not impacted — thus other GPUs are not impacted for example when I select a specific GPU to use for the training.
  • If using a git repository as source code, I usually add it to the main docker image and then git pull each time I create a docker container, and switch to whichever branch/commit I want to use.
  • Local folders/files and NAS can be mounted easily when starting the container — so no copying required which saves time, especially when I am debugging something.

Introduction to Docker

Here are some fundamental commands you need to know:

  • display the containers currently running:
docker ps
  • display all the containers (even those not running anymore):
docker ps -a 
  • display the images locally saved:
docker images
  • remove a docker container:
docker stop container_name # if container is running
docker rm container_name
  • remove all docker containers (not running anymore):
docker container prune
  • remove an image:
docker rmi image_name
  • remove all docker images (be very careful with this one!):
docker image prune -a

Run a Tensorflow2 training in a Docker container

Docker Image

First, we pull the NVIDIA image that contains Tensorflow version 2.1.04 and Python3 (this will take some time):

docker pull nvcr.io/nvidia/tensorflow:21.04-tf2-py3
Docker pull done — Image by author

Once this is done, we check that the image is in our local images’ list using the command docker images:

List Docker local images — Image by author

We do see our image nvcr.io/nvidia/tensorflow with the tag 21.04-tf2-py3.

Start a container with this image and explore some of the flags

We can create a container using this image:

docker run -it --rm --name tensorflow2-container --network=host nvcr.io/nvidia/tensorflow:21.04-tf2-py3 bash
Docker container from Tensorflow2 image opened — Image by author

We used some specific flags in this command:

  • -it is used to open an interactive terminal
  • --rm is used so that when we exit the container, it will remove it
  • --name is used to name the container with a custom name
  • --network=host is used to have access to the Internet in the container (same network as the host machine)
  • Then, we have the name of the image to use followed by bash to create an interactive shell in the container

If we run docker ps in another terminal, we will see our new container:

docker ps command to see the new started container — Image by author

We do see our container tensorflow2-container

Now, if we want to use our local workspace with the training script, we can mount my workspace folder into the container by using -v /Users/margauxmforstyhe/workspace/:/workspace . This argument mounts the workspace folder on our computer to the base workspace folder in the container.

In our current container, if we run ls here is what we have:

Default workspace folder in the Docker container — Image by author

Let’s exit the docker container we have currently opened using the command exit and start a new one with the workspace folder:

docker run -it --rm --name tensorflow2-container --network=host -v /Users/margauxmforstyhe/workspace/:/workspace nvcr.io/nvidia/tensorflow:21.04-tf2-py3 bash

and run ls:

Local workspace folder in the Docker container — Image by author

➡️ local workspace was mounted in the Docker container and we can now use it for our trainings/tests.

hen I am doing some training testing, I use a machine with a GPU and select one GPU with --gpu=device=0 for the GPU0. Then, when I am done testing, I usually run a command as following to start a training:

docker run -i -d --rm --gpus=device=0 --name tensorflow2-container --network=host -v /Users/margauxmforstyhe/workspace/:/workspace nvcr.io/nvidia/tensorflow:21.04-tf2-py3 bash -c "export PYTHONPATH=/workspace && python3 /workspace/training_repo/train.py .... {parameters for the training}"

So here we have a docker container in detached mode (-dmeaning we do not see the execution of the code in the terminal) running the local training script on GPU number 0.

NB: this works exactly the same with the inference script, you only need to change the python script being called.

Another way to do this is to have a git repository with all the training scripts and add it as a part of the image. Let’s do this with a Dockerfile!

Build a Docker image with a Dockerfile and use a git repository as training repository

A Dockerfile is used to create an image. For example, we want to create an image on top of the image nvcr.io/nvidia/tensorflow:21.04-tf2-py3 we used before, then we want to clone the training_repo from Github and install all the requirements for running the training (for example install rasterio, or install a specific package’s version), which gives us this Dockerfile:

FROM nvcr.io/nvidia/tensorflow:21.04-tf2-py3RUN apt-get update 
RUN git clone https://github.com/MargauxMasson/training_repo.git
RUN pip install -r /workspace/training_repo/requirements.txt
RUN ls
WORKDIR /workspace/CMD "ls"

In order to build the image — that we will name tensorflow-21.04-tf2-py3-with-requirements-and-git-repo — we use the command docker build(needs to be run in the folder with the Dockerfile):

docker build . --network=host -t tensorflow-21.04-tf2-py3-with-requirements-and-git-repo
Build the image with Dockefile — Image by author

We see that the built of the image works properly, and when we check with docker images, we do see ou new image tensorflow-21.04-tf2-py3-with-requirements-and-git-repo.

NB: The “.” in the docker build command indicates that the Dockerfile, named Dockerfile, is located in the folder we are running the command from.

Now when we start the container using this image, we do not need to mount the local workspace because the git repo is already in the image:

docker run -it --rm --name tensorflow2-container --network=host tensorflow-21.04-tf2-py3-with-requirements-and-git-repo bash

Indeed, the training_repo is in the workspace of the container.

This image can be used without being modified even if the code in the git repository changes. When starting the container, we can git pull or git checkout to whichever branch/commit desired:

docker run -i -d --rm --gpus=device=0 --name tensorflow2-container --network=host tensorflow-21.04-tf2-py3-with-requirements-and-git-repo bash -c "cd /workspace/training_repo && git pull && git checkout my_training_dev_branch && export PYTHONPATH=/workspace && python3 /workspace/training_repo/train.py .... {parameters for the training}"

Also, as suggested when starting the container, we can add these flags:--shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864

So:

docker run -i -d --rm --gpus=device=0 --name tensorflow2-container --network=host --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 tensorflow-21.04-tf2-py3-with-requirements-and-git-repo bash -c "cd /workspace/training_repo && git pull && git checkout my_training_dev_branch && export PYTHONPATH=/workspace && python3 /workspace/training_repo/train.py .... {parameters for the training}"

There are a lot of ways one can use Docker, but this is how I like to use it for my trainings and inferences, and it helps me stay organized because I have some specific images (or Dockerfile at least) that I can use confidently and know that my training code will run without any struggle.

GIF from https://media.giphy.com/media/g0gtihsbzj5pSWgcml/giphy.gif

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Margaux Masson-Forsythe
Margaux Masson-Forsythe

Written by Margaux Masson-Forsythe

Believe in an ethical AI used for good

Responses (1)

Write a response