Introducing Containers

Overview

Teaching: 20 min
Exercises: 0 min

Questions

What are containers, and why might they be useful to me?

Objectives

Show how software depending on other software leads to configuration management problems.

Identify the problems that software installation can pose for research.

Explain the advantages of containerization.

Explain how using containers can solve software configuration problems

Learning about Docker Containers

The Australian Research Data Commons has produced a short introductory video about Docker containers that covers many of the points below. Watch it before or after you go through this section to reinforce your understanding!

How can software containers help your research?

Australian Research Data Commons, 2021. How can software containers help your research?. [video] Available at: https://www.youtube.com/watch?v=HelrQnm3v4g DOI: http://doi.org/10.5281/zenodo.5091260

Scientific Software Challenges

What’s Your Experience?

Take a minute to think about challenges that you have experienced in using scientific software (or software in general!) for your research. Then, share with your neighbors and try to come up with a list of common gripes or challenges.

You may have come up with some of the following:

you want to use software that doesn’t exist for the operating system (Mac, Windows, Linux) you’d prefer.
you struggle with installing a software tool because you have to install a number of other dependencies first. Those dependencies, in turn, require other things, and so on (i.e. combinatoric explosion).
the software you’re setting up involves many dependencies and only a subset of all possible versions of those dependencies actually works as desired.
you’re not actually sure what version of the software you’re using because the install process was so circuitous.
you and a colleague are using the same software but get different results because you have installed different versions and/or are using different operating systems.
you installed everything correctly on your computer but now need to install it on a colleague’s computer/campus computing cluster/etc.
you’ve written a package for other people to use but a lot of your users frequently have trouble with installation.
you need to reproduce a research project from a former colleague and the software used was on a system you no longer have access to.

A lot of these characteristics boil down to one fact: the main program you want to use likely depends on many, many, different other programs (including the operating system!), creating a very complex, and often fragile system. One change or missing piece may stop the whole thing from working or break something that was already running. It’s no surprise that this situation is sometimes informally termed “dependency hell”.

Software and Science

Again, take a minute to think about how the software challenges we’ve discussed could impact (or have impacted!) the quality of your work. Share your thoughts with your neighbors. What can go wrong if our software doesn’t work?

Unsurprisingly, software installation and configuration challenges can have negative consequences for research:

you can’t use a specific tool at all, because it’s not available or installable.
you can’t reproduce your results because you’re not sure what tools you’re actually using.
you can’t access extra/newer resources because you’re not able to replicate your software set up.
others cannot validate and/or build upon your work because they cannot recreate your system’s unique configuration.

Thankfully there are ways to get underneath (a lot of) this mess: containers to the rescue! Containers provide a way to package up software dependencies and access to resources such as files and communications networks in a uniform manner.

What is a Container? What is Docker?

Docker is a tool that allows you to build what are called “containers.” It’s not the only tool that can create containers, but is the one we’ve chosen for this workshop. But what is a container?

To understand containers, let’s first talk briefly about your computer.

Your computer has some standard pieces that allow it to work – often what’s called the hardware. One of these pieces is the CPU or processor; another is the amount of memory or RAM that your computer can use to store information temporarily while running programs; another is the hard drive, which can store information over the long-term. All these pieces work together to do the “computing” of a computer, but we don’t see them because they’re hidden from view (usually).

Instead, what we see is our desktop, program windows, different folders, and files. These all live in what’s called the filesystem. Everything on your computer – programs, pictures, documents, the operating system itself – lives somewhere in the filesystem.

NOW, imagine you want to install some new software but don’t want to take the chance of making a mess of your existing system by installing a bunch of additional stuff (libraries/dependencies/etc.). You don’t want to buy a whole new computer because it’s too expensive. What if, instead, you could have another independent filesystem and running operating system that you could access from your main computer, and that is actually stored within this existing computer?

Or, imagine you have two tools you want to use in your groundbreaking research on cat memes: PurrLOLing, a tool that does AMAZINGLY well at predicting the best text for a meme based on the cat species and WhiskerSpot, the only tool available for identifying cat species from images. You want to send cat pictures to WhiskerSpot, and then send the species output to PurrLOLing. But there’s a problem: PurrLOLing only works on Ubuntu and WhiskerSpot is only supported for OpenSUSE so you can’t have them on the same system! Again, we really want another filesystem (or two) on our computer that we could use to chain together WhiskerSpot and PurrLOLing in a “pipeline”…

Container systems, like Docker, are special programs on your computer that make it possible! The term “container” can be usefully considered with reference to shipping containers. Before shipping containers were developed, packing and unpacking cargo ships was time consuming and error prone, with high potential for different clients’ goods to become mixed up. Just like shipping containers keep things together that should stay together, software containers standardize the description and creation of a complete software system: you can drop a container into any computer with the container software installed (the ‘container host’), and it should “just work”.

Virtualization

Containers are an example of what’s called virtualization – having a second “virtual” computer running and accessible from a main or host computer. Another example of virtualization are virtual machines or VMs. A virtual machine typically contains a whole copy of an operating system in addition to its own filesystem and has to get booted up in the same way a computer would. A container is considered a lightweight version of a virtual machine; underneath, the container is (usually) using the Linux kernel and simply has some flavour of Linux + the filesystem inside.

One final term: while the container is an alternative filesystem layer that you can access and run from your computer, the container image is the ‘recipe’ or template for a container. The container image has all the required information to start up a running copy of the container. A running container tends to be transient and can be started and shut down. The container image is more long-lived, as a definition for the container. You could think of the container image like a cookie cutter – it can be used to create multiple copies of the same shape (or container) and is relatively unchanging, where cookies come and go. If you want a different type of container (cookie) you need a different container image (cookie cutter).

Putting the Pieces Together

Think back to some of the challenges we described at the beginning. The many layers of scientific software installations make it hard to install and re-install scientific software – which ultimately, hinders reliability and reproducibility.

But now, think about what a container is – a self-contained, complete, separate computer filesystem. What advantages are there if you put your scientific software tools into containers?

This solves several of our problems:

documentation – there is a clear record of what software and software dependencies were used, from bottom to top.
portability – the container can be used on any computer that has Docker installed – it doesn’t matter whether the computer is Mac, Windows or Linux-based.
reproducibility – you can use the exact same software and environment on your computer and on other resources (like a large-scale computing cluster).
configurability – containers can be sized to take advantage of more resources (memory, CPU, etc.) on large systems (clusters) or less, depending on the circumstances.

The rest of this workshop will show you how to download and run containers from pre-existing container images on your own computer, and how to create and share your own container images.

Use cases for containers

Now that we have discussed a little bit about containers – what they do and the issues they attempt to address – you may be able to think of a few potential use cases in your area of work. Some examples of common use cases for containers in a research context include:

Using containers solely on your own computer to use a specific software tool or to test out a tool (possibly to avoid a difficult and complex installation process, to save your time or to avoid dependency hell).
Creating a Dockerfile that generates a container image with software that you specify installed, then sharing a container image generated using this Dockerfile with your collaborators for use on their computers or a remote computing resource (e.g. cloud-based or HPC system).
Archiving the container images so you can repeat analysis/modelling using the same software and configuration in the future – capturing your workflow.

Key Points

Almost all software depends on other software components to function, but these components have independent evolutionary paths.

Small environments that contain only the software that is needed for a given task are easier to replicate and maintain.

Critical systems that cannot be upgraded, due to cost, difficulty, etc. need to be reproduced on newer systems in a maintainable and self-documented way.

Virtualization allows multiple environments to run on a single computer.

Containerization improves upon the virtualization of whole computers by allowing efficient management of the host computer’s memory and storage resources.

Containers are built from ‘recipes’ that define the required set of software components and the instructions necessary to build/install them within a container image.

Docker is just one software platform that can create containers and the resources they use.

Introducing the Docker Command Line

Overview

Teaching: 10 min
Exercises: 0 min

Questions

How do I know Docker is installed and running?

How do I interact with Docker?

Objectives

Explain how to check that Docker is installed and is ready to use.

Demonstrate some initial Docker command line interactions.

Use the built-in help for Docker commands.

Docker command line

Start the Docker application that you installed in working through the setup instructions for this session. Note that this might not be necessary if your laptop is running Linux or if the installation added the Docker application to your startup process.

You may need to login to Docker Hub

The Docker application will usually provide a way for you to log in to the Docker Hub using the application’s menu (macOS) or systray icon (Windows) and it is usually convenient to do this when the application starts. This will require you to use your Docker Hub username and your password. We will not actually require access to the Docker Hub until later in the course but if you can login now, you should do so.

Determining your Docker Hub username

If you no longer recall your Docker Hub username, e.g., because you have been logging into the Docker Hub using your email address, you can find out what it is through the steps:

Open https://hub.docker.com/ in a web browser window

Sign-in using your email and password (don’t tell us what it is)

In the top-right of the screen you will see your username

Once your Docker application is running, open a shell (terminal) window, and run the following command to check that Docker is installed and the command line tools are working correctly. Below is the output for a Mac version, but the specific version is unlikely to matter much: it does not have to precisely match the one listed below.

$ docker --version

Docker version 20.10.5, build 55c4c88

The above command has not actually relied on the part of Docker that runs containers, just that Docker is installed and you can access it correctly from the command line.

A command that checks that Docker is working correctly is the docker container ls command (we cover this command in more detail later in the course).

Without explaining the details, output on a newly installed system would likely be:

$ docker container ls

CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

(The command docker system info could also be used to verify that Docker is correctly installed and operational but it produces a larger amount of output.)

However, if you instead get a message similar to the following

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

then you need to check that you have started the Docker Desktop, Docker Engine, or however else you worked through the setup instructions.

Getting help

Often when working with a new command line tool, we need to get help. These tools often have some sort of subcommand or flag (usually help, -h, or --help) that displays a prompt describing how to use the tool. For Docker, it’s no different. If we run docker --help, we see the following output (running docker also produces the help message):

Usage:  docker [OPTIONS] COMMAND

A self-sufficient runtime for containers

Options:
      --config string      Location of client config files (default "/Users/vini/.docker")
  -c, --context string     Name of the context to use to connect to the daemon (overrides DOCKER_HOST env var and default context set with "docker context use")
  -D, --debug              Enable debug mode
  -H, --host list          Daemon socket(s) to connect to
  -l, --log-level string   Set the logging level ("debug"|"info"|"warn"|"error"|"fatal") (default "info")
      --tls                Use TLS; implied by --tlsverify
      --tlscacert string   Trust certs signed only by this CA (default "/Users/vini/.docker/ca.pem")
      --tlscert string     Path to TLS certificate file (default "/Users/vini/.docker/cert.pem")
      --tlskey string      Path to TLS key file (default "/Users/vini/.docker/key.pem")
      --tlsverify          Use TLS and verify the remote
  -v, --version            Print version information and quit

Management Commands:
  app*        Docker App (Docker Inc., v0.9.1-beta3)
  builder     Manage builds
  buildx*     Build with BuildKit (Docker Inc., v0.5.1-docker)
  config      Manage Docker configs
  container   Manage containers
  context     Manage contexts
  image       Manage images
  manifest    Manage Docker image manifests and manifest lists
  network     Manage networks
  node        Manage Swarm nodes
  plugin      Manage plugins
  scan*       Docker Scan (Docker Inc., v0.6.0)
  secret      Manage Docker secrets
  service     Manage services
  stack       Manage Docker stacks
  swarm       Manage Swarm
  system      Manage Docker
  trust       Manage trust on Docker images
  volume      Manage volumes

Commands:
  attach      Attach local standard input, output, and error streams to a running container
  build       Build an image from a Dockerfile
  commit      Create a new image from a container's changes
  cp          Copy files/folders between a container and the local filesystem
  create      Create a new container
  diff        Inspect changes to files or directories on a container's filesystem
  events      Get real time events from the server
  exec        Run a command in a running container
  export      Export a container's filesystem as a tar archive
  history     Show the history of an image
  images      List images
  import      Import the contents from a tarball to create a filesystem image
  info        Display system-wide information
  inspect     Return low-level information on Docker objects
  kill        Kill one or more running containers
  load        Load an image from a tar archive or STDIN
  login       Log in to a Docker registry
  logout      Log out from a Docker registry
  logs        Fetch the logs of a container
  pause       Pause all processes within one or more containers
  port        List port mappings or a specific mapping for the container
  ps          List containers
  pull        Pull an image or a repository from a registry
  push        Push an image or a repository to a registry
  rename      Rename a container
  restart     Restart one or more containers
  rm          Remove one or more containers
  rmi         Remove one or more images
  run         Run a command in a new container
  save        Save one or more images to a tar archive (streamed to STDOUT by default)
  search      Search the Docker Hub for images
  start       Start one or more stopped containers
  stats       Display a live stream of container(s) resource usage statistics
  stop        Stop one or more running containers
  tag         Create a tag TARGET_IMAGE that refers to SOURCE_IMAGE
  top         Display the running processes of a container
  unpause     Unpause all processes within one or more containers
  update      Update configuration of one or more containers
  version     Show the Docker version information
  wait        Block until one or more containers stop, then print their exit codes

Run 'docker COMMAND --help' for more information on a command.

There is a list of commands and the end of the help message says: Run 'docker COMMAND --help' for more information on a command. For example, take the docker container ls command that we ran previously. We can see from the Docker help prompt that container is a Docker command, so to get help for that command, we run:

docker container --help  # or instead 'docker container'

Usage:  docker container COMMAND

Manage containers

Commands:
  attach      Attach local standard input, output, and error streams to a running container
  commit      Create a new image from a container's changes
  cp          Copy files/folders between a container and the local filesystem
  create      Create a new container
  diff        Inspect changes to files or directories on a container's filesystem
  exec        Run a command in a running container
  export      Export a container's filesystem as a tar archive
  inspect     Display detailed information on one or more containers
  kill        Kill one or more running containers
  logs        Fetch the logs of a container
  ls          List containers
  pause       Pause all processes within one or more containers
  port        List port mappings or a specific mapping for the container
  prune       Remove all stopped containers
  rename      Rename a container
  restart     Restart one or more containers
  rm          Remove one or more containers
  run         Run a command in a new container
  start       Start one or more stopped containers
  stats       Display a live stream of container(s) resource usage statistics
  stop        Stop one or more running containers
  top         Display the running processes of a container
  unpause     Unpause all processes within one or more containers
  update      Update configuration of one or more containers
  wait        Block until one or more containers stop, then print their exit codes

Run 'docker container COMMAND --help' for more information on a command.

There’s also help for the container ls command:

docker container ls --help  # this one actually requires the '--help' flag

Usage:  docker container ls [OPTIONS]

List containers

Aliases:
  ls, ps, list

Options:
  -a, --all             Show all containers (default shows just running)
  -f, --filter filter   Filter output based on conditions provided
      --format string   Pretty-print containers using a Go template
  -n, --last int        Show n last created containers (includes all states) (default -1)
  -l, --latest          Show the latest created container (includes all states)
      --no-trunc        Don't truncate output
  -q, --quiet           Only display container IDs
  -s, --size            Display total file sizes

You may notice that there are many commands that stem from the docker command. Instead of trying to remember all possible commands and options, it’s better to learn how to effectively get help from the command line. Although we can always search the web, getting the built-in help from our tool is often much faster and may provide the answer right away. This applies not only to Docker, but also to most command line-based tools.

Docker Command Line Interface (CLI) syntax

In this lesson we use the newest Docker CLI syntax introduced with the Docker Engine version 1.13. This new syntax combines commands into groups you will most often want to interact with. In the help example above you can see image and container management commands, which can be used to interact with your images and containers respectively. With this new syntax you issue commands using the following pattern docker [command] [subcommand] [additional options]

Comparing the output of two help commands above, you can see that the same thing can be achieved in multiple ways. For example to start a Docker container using the old syntax you would use docker run. To achieve the same with the new syntax, you use docker container run instead. Even though the old approach is shorter and still officially supported, the new syntax is more descriptive, less error-prone and is therefore recommended.

Exploring a command

Run docker --help and pick a command from the list. Explore the help prompt for that command. Try to guess how a command would work by looking at the Usage: section of the prompt.

Solution

Suppose we pick the docker image build command:

docker image build --help

Usage:  docker image build [OPTIONS] PATH | URL | -

Build an image from a Dockerfile

Options:
     --add-host list           Add a custom host-to-IP mapping (host:ip)
     --build-arg list          Set build-time variables
     --cache-from strings      Images to consider as cache sources
     --cgroup-parent string    Optional parent cgroup for the container
     --compress                Compress the build context using gzip
     --cpu-period int          Limit the CPU CFS (Completely Fair Scheduler) period
     --cpu-quota int           Limit the CPU CFS (Completely Fair Scheduler) quota
 -c, --cpu-shares int          CPU shares (relative weight)
     --cpuset-cpus string      CPUs in which to allow execution (0-3, 0,1)
     --cpuset-mems string      MEMs in which to allow execution (0-3, 0,1)
     --disable-content-trust   Skip image verification (default true)
 -f, --file string             Name of the Dockerfile (Default is 'PATH/Dockerfile')
     --force-rm                Always remove intermediate containers
     --iidfile string          Write the image ID to the file
     --isolation string        Container isolation technology
     --label list              Set metadata for an image
 -m, --memory bytes            Memory limit
     --memory-swap bytes       Swap limit equal to memory plus swap: '-1' to enable unlimited swap
     --network string          Set the networking mode for the RUN instructions during build (default "default")
     --no-cache                Do not use cache when building the image
     --pull                    Always attempt to pull a newer version of the image
 -q, --quiet                   Suppress the build output and print image ID on success
     --rm                      Remove intermediate containers after a successful build (default true)
     --security-opt strings    Security options
     --shm-size bytes          Size of /dev/shm
 -t, --tag list                Name and optionally a tag in the 'name:tag' format
     --target string           Set the target build stage to build.
     --ulimit ulimit           Ulimit options (default [])

We could try to guess that the command could be run like this:

docker image build .

docker image build https://github.com/docker/rootfs.git

Where https://github.com/docker/rootfs.git could be any relevant URL that supports a Docker image.

Key Points

A toolbar icon indicates that Docker is ready to use (on Windows and macOS).

You will typically interact with Docker using the command line.

To learn how to run a certain Docker command, we can type the command followed by the --help flag.

Break

Overview

Teaching: min
Exercises: min

Questions

Objectives

Morning break

Key Points

Exploring and Running Containers

Overview

Teaching: 20 min
Exercises: 10 min

Questions

How do I interact with Docker containers and container images on my computer?

Objectives

Use the correct command to see which Docker container images are on your computer.

Be able to download new Docker container images.

Demonstrate how to start an instance of a container from a container image.

Describe at least two ways to execute commands inside a running Docker container.

Reminder of terminology: container images and containers

Recall that a container image is the template from which particular instances of containers will be created.

Let’s explore our first Docker container. The Docker team provides a simple container image online called hello-world. We’ll start with that one.

Downloading Docker images

The docker image command is used to interact with Docker container images. You can find out what container images you have on your computer by using the following command (“ls” is short for “list”):

$ docker image ls

If you’ve just installed Docker, you won’t see any container images listed.

To get a copy of the hello-world Docker container image from the internet, run this command:

$ docker image pull hello-world

You should see output like this:

Using default tag: latest
latest: Pulling from library/hello-world
1b930d010525: Pull complete
Digest: sha256:f9dfddf63636d84ef479d645ab5885156ae030f611a56f3a7ac7f2fdd86d7e4e
Status: Downloaded newer image for hello-world:latest
docker.io/library/hello-world:latest

Docker Hub

Where did the hello-world container image come from? It came from the Docker Hub website, which is a place to share Docker container images with other people. More on that in a later episode.

Exercise: Check on Your Images

What command would you use to see if the hello-world Docker container image had downloaded successfully and was on your computer? Give it a try before checking the solution.
Solution

To see if the hello-world container image is now on your computer, run:
$ docker image ls

Note that the downloaded hello-world container image is not in the folder where you are in the terminal! (Run ls by itself to check.) The container image is not a file like our normal programs and documents; Docker stores it in a specific location that isn’t commonly accessed, so it’s necessary to use the special docker image command to see what Docker container images you have on your computer.

Running the `hello-world` container

To create and run containers from named Docker container images you use the docker container run command. Try the following docker container run invocation. Note that it does not matter what your current working directory is.

$ docker container run hello-world

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

What just happened? When we use the docker container run command, Docker does three things:

1. Starts a Running Container	2. Performs Default Action	3. Shuts Down the Container
Starts a running container, based on the container image. Think of this as the “alive” or “inflated” version of the container – it’s actually doing something.	If the container has a default action set, it will perform that default action. This could be as simple as printing a message (as above) or running a whole analysis pipeline!	Once the default action is complete, the container stops running (or exits). The container image is still there, but nothing is actively running.

The hello-world container is set up to run an action by default – namely to print this message.

Using docker container run to get the image

We could have skipped the docker image pull step; if you use the docker container run command and you don’t already have a copy of the Docker container image, Docker will automatically pull the container image first and then run it.

Running a container with a chosen command

But what if we wanted to do something different with the container? The output just gave us a suggestion of what to do – let’s use a different Docker container image to explore what else we can do with the docker container run command. The suggestion above is to use ubuntu, but we’re going to run a different type of Linux, alpine instead because it’s quicker to download.

Run the Alpine Docker container

Try downloading the alpine container image and using it to run a container. You can do it in two steps, or one. What are they?

What happened when you ran the Alpine Docker container?

$ docker container run alpine

If you have never used the alpine Docker container image on your computer, Docker probably printed a message that it couldn’t find the container image and had to download it. If you used the alpine container image before, the command will probably show no output. That’s because this particular container is designed for you to provide commands yourself. Try running this instead:

$ docker container run alpine cat /etc/os-release

You should see the output of the cat /etc/os-release command, which prints out the version of Alpine Linux that this container is using and a few additional bits of information.

Hello World, Part 2

Can you run a copy of the alpine container and make it print a “hello world” message?

Give it a try before checking the solution.
Solution

Use the same command as above, but with the echo command to print a message.
$ docker container run alpine echo 'Hello World'

So here, we see another option – we can provide commands at the end of the docker container run command and they will execute inside the running container.

Running containers interactively

In all the examples above, Docker has started the container, run a command, and then immediately stopped the container. But what if we wanted to keep the container running so we could log into it and test drive more commands? The way to do this is by adding the interactive flags -i and -t (usually combined as -it) to the docker container run command and provide a shell (bash,sh, etc.) as our command. The alpine Docker container image doesn’t include bash so we need to use sh.

$ docker container run -it alpine sh

Technically…

Technically, the interactive flag is just -i – the extra -t (combined as -it above) is the “pseudo-TTY” option, a fancy term that means a text interface. This allows you to connect to a shell, like sh, using a command line. Since you usually want to have a command line when running interactively, it makes sense to use the two together.

Your prompt should change significantly to look like this:

/ #

That’s because you’re now inside the running container! Try these commands:

pwd
ls
whoami
echo $PATH
cat /etc/os-release

All of these are being run from inside the running container, so you’ll get information about the container itself, instead of your computer. To finish using the container, type exit.

/ # exit

Practice Makes Perfect

Can you find out the version of Ubuntu installed on the ubuntu container image? (Hint: You can use the same command as used to find the version of alpine.)

Can you also find the apt-get program? What does it do? (Hint: try passing --help to almost any command will give you more information.)
Solution 1 – Interactive

Run an interactive busybox container – you can use docker image pull first, or just run it with this command:
$ docker container run -it ubuntu sh
OR you can get the bash shell instead
$ docker container run -it ubuntu bash
Then try, running these commands
/# cat /etc/os-release
/# apt-get --help
Exit when you’re done.
/# exit
Solution 2 – Run commands

Run a ubuntu container, first with a command to read out the Linux version:
$ docker container run ubuntu cat /etc/os-release
Then run a container with a command to print out the apt-get help:
$ docker container run ubuntu apt-get --help

Even More Options

There are many more options, besides -it that can be used with the docker container run command! A few of them will be covered in later episodes and we’ll share two more common ones here:

--rm: this option guarantees that any running container is completely removed from your computer after the container is stopped. Without this option, Docker actually keeps the “stopped” container around, which you’ll see in a later episode. Note that this option doesn’t impact the container images that you’ve pulled, just running instances of containers.

--name=: By default, Docker assigns a random name and ID number to each container instance that you run on your computer. If you want to be able to more easily refer to a specific running container, you can assign it a name using this option.

Conclusion

So far, we’ve seen how to download Docker container images, use them to run commands inside running containers, and even how to explore a running container from the inside. Next, we’ll take a closer look at all the different kinds of Docker container images that are out there.

Key Points

The docker image pull command downloads Docker container images from the internet.

The docker image ls command lists Docker container images that are (now) on your computer.

The docker container run command creates running containers from container images and can run commands inside them.

When using the docker container run command, a container can run a default action (if it has one), a user specified action, or a shell to be used interactively.

Cleaning Up Containers

Overview

Teaching: 10 min
Exercises: 0 min

Questions

How do I interact with a Docker container on my computer?

How do I manage my containers and container images?

Objectives

Explain how to list running and completed containers.

Know how to list and remove container images.

Removing images

The container images and their corresponding containers can start to take up a lot of disk space if you don’t clean them up occasionally, so it’s a good idea to periodically remove containers and container images that you won’t be using anymore.

In order to remove a specific container image, you need to find out details about the container image, specifically, the “Image ID”. For example, say my laptop contained the following container image:

$ docker image ls

REPOSITORY       TAG         IMAGE ID       CREATED          SIZE
hello-world      latest      fce289e99eb9   15 months ago    1.84kB

You can remove the container image with a docker image rm command that includes the Image ID, such as:

$ docker image rm fce289e99eb9

or use the container image name, like so:

$ docker image rm hello-world

However, you may see this output:

Error response from daemon: conflict: unable to remove repository reference "hello-world" (must force) - container e7d3b76b00f4 is using its referenced image fce289e99eb9

This happens when Docker hasn’t cleaned up some of the previously running containers based on this container image. So, before removing the container image, we need to be able to see what containers are currently running, or have been run recently, and how to remove these.

What containers are running?

Working with containers, we are going to shift back to the command: docker container. Similar to docker image, we can list running containers by typing:

$ docker container ls

CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

Notice that this command didn’t return any containers because our containers all exited and thus stopped running after they completed their work.

docker ps

The command docker ps serves the same purpose as docker container ls, and comes from the Unix shell command ps which describes running processes.

What containers have run recently?

There is also a way to list running containers, and those that have completed recently, which is to add the --all/-a flag to the docker container ls command as shown below.

$ docker container ls --all

CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS                     PORTS               NAMES
9c698655416a        hello-world         "/hello"            2 minutes ago       Exited (0) 2 minutes ago                       zen_dubinsky
6dd822cf6ca9        hello-world         "/hello"            3 minutes ago       Exited (0) 3 minutes ago                       eager_engelbart

Keeping it clean

You might be surprised at the number of containers Docker is still keeping track of. One way to prevent this from happening is to add the --rm flag to docker container run. This will completely wipe out the record of the run container when it exits. If you need a reference to the running container for any reason, don’t use this flag.

How do I remove an exited container?

To delete an exited container you can run the following command, inserting the CONTAINER ID for the container you wish to remove. It will repeat the CONTAINER ID back to you, if successful.

$ docker container rm 9c698655416a

9c698655416a

An alternative option for deleting exited containers is the docker container prune command. Note that this command doesn’t accept a container ID as an option because it deletes ALL exited containers! Be careful with this command as deleting the container is forever. Once a container is deleted you can not get it back. If you have containers you may want to reconnect to, you should not use this command. It will ask you if to confirm you want to remove these containers, see output below. If successful it will print the full CONTAINER ID back to you for each container it has removed.

$ docker container prune

WARNING! This will remove all stopped containers.
Are you sure you want to continue? [y/N] y
Deleted Containers:
9c698655416a848278d16bb1352b97e72b7ea85884bff8f106877afe0210acfc
6dd822cf6ca92f3040eaecbd26ad2af63595f30bb7e7a20eacf4554f6ccc9b2b

Removing images, for real this time

Now that we’ve removed any potentially running or stopped containers, we can try again to delete the hello-world container image.

$ docker image rm hello-world

Untagged: hello-world:latest
Untagged: hello-world@sha256:5f179596a7335398b805f036f7e8561b6f0e32cd30a32f5e19d17a3cda6cc33d
Deleted: sha256:fce289e99eb9bca977dae136fbe2a82b6b7d4c372474c9235adc1741675f587e
Deleted: sha256:af0b15c8625bb1938f1d7b17081031f649fd14e6b233688eea3c5483994a66a3

The reason that there are a few lines of output, is that a given container image may have been formed by merging multiple underlying layers. Any layers that are used by multiple Docker container images will only be stored once. Now the result of docker image ls should no longer include the hello-world container image.

Key Points

docker container has subcommands used to interact and manage containers.

docker image has subcommands used to interact and manage container images.

docker container ls or docker ps can provide information on currently running containers.

Finding Containers on Docker Hub

Overview

Teaching: 10 min
Exercises: 10 min

Questions

What is the Docker Hub, and why is it useful?

Objectives

Explain how the Docker Hub augments Docker use.

Explore the Docker Hub webpage for a popular Docker container image.

Find the list of tags for a particular Docker container image.

Identify the three components of a container image’s identifier.

In the previous episode, we ran a few different containers derived from different container images: hello-world, alpine, and maybe busybox. Where did these container images come from? The Docker Hub!

Introducing the Docker Hub

The Docker Hub is an online repository of container images, a vast number of which are publicly available. A large number of the container images are curated by the developers of the software that they package. Also, many commonly used pieces of software that have been containerized into images are officially endorsed, which means that you can trust the container images to have been checked for functionality, stability, and that they don’t contain malware.

Docker can be used without connecting to the Docker Hub

Note that while the Docker Hub is well integrated into Docker functionality, the Docker Hub is certainly not required for all types of use of Docker containers. For example, some organizations may run container infrastructure that is entirely disconnected from the Internet.

Exploring an Example Docker Hub Page

As an example of a Docker Hub page, let’s explore the page for the official Python language container images. The most basic form of containerized Python is in the python container image (which is endorsed by the Docker team). Open your web browser to https://hub.docker.com/_/python to see what is on a typical Docker Hub software page.

The top-left provides information about the name, short description, popularity (i.e., more than a billion downloads in the case of this container image), and endorsements.

The top-right provides the command to pull this container image to your computer.

The main body of the page contains many used headings, such as:

Which tags (i.e., container image versions) are supported;
Summary information about where to get help, which computer architectures are supported, etc.;
A longer description of the container image;
Examples of how to use the container image; and
The license that applies.

The “How to use the image” section of most container images’ pages will provide examples that are likely to cover your intended use of the container image.

Exploring Container Image Versions

A single Docker Hub page can have many different versions of container images, based on the version of the software inside. These versions are indicated by “tags”. When referring to the specific version of a container image by its tag, you use a colon, :, like this:

CONTAINER_IMAGE_NAME:TAG

So if I wanted to download the python container image, with Python 3.8, I would use this name:

$ docker image pull python:3.8

But if I wanted to download a Python 3.6 container image, I would use this name:

$ docker image pull python:3.6

The default tag (which is used if you don’t specify one) is called latest.

So far, we’ve only seen container images that are maintained by the Docker team. However, it’s equally common to use container images that have been produced by individual owners or organizations. Container images that you create and upload to Docker Hub would fall into this category, as would the container images maintained by organizations like ContinuumIO (the folks who develop the Anaconda Python environment) or community groups like rocker, a group that builds community R container images.

The name for these group- or individually-managed container images have this format:

OWNER/CONTAINER_IMAGE_NAME:TAG

Repositories

The technical name for the contents of a Docker Hub page is a “repository.” The tag indicates the specific version of the container image that you’d like to use from a particular repository. So a slightly more accurate version of the above example is:
OWNER/REPOSITORY:TAG

What’s in a name?

How would I download the Docker container image produced by the rocker group that has version 3.6.1 of R and the tidyverse installed?

Note: the container image described in this exercise is large and won’t be used later in this lesson, so you don’t actually need to pull the container image – constructing the correct docker pull command is sufficient.
Solution

First, search for rocker in Docker Hub. Then look for their tidyverse container image. You can look at the list of tags, or just guess that the tag is 3.6.1. Altogether, that means that the name of the container image we want to download is:
$ docker image pull rocker/tidyverse:3.6.1

Finding Container Images on Docker Hub

There are many different container images on Docker Hub. This is where the real advantage of using containers shows up – each container image represents a complete software installation that you can use and access without any extra work!

The easiest way to find container images is to search on Docker Hub, but sometimes software pages have a link to their container images from their home page.

Note that anyone can create an account on Docker Hub and share container images there, so it’s important to exercise caution when choosing a container image on Docker Hub. These are some indicators that a container image on Docker Hub is consistently maintained, functional and secure:

The container image is updated regularly.
The container image associated with a well established company, community, or other group that is well-known.
There is a Dockerfile or other listing of what has been installed to the container image.
The container image page has documentation on how to use the container image.

If a container image is never updated, created by a random person, and does not have a lot of metadata, it is probably worth skipping over. Even if such a container image is secure, it is not reproducible and not a dependable way to run research computations.

What container image is right for you?

Find a Docker container image that’s relevant to you. Take into account the suggestions above of what to look for as you evaluate options. If you’re unsuccessful in your search, or don’t know what to look for, you can use the R or Python container image we’ve already seen.

Once you find a container image, use the skills from the previous episode to download the container image and explore it.

Key Points

The Docker Hub is an online repository of container images.

Many Docker Hub container images are public, and may be officially endorsed.

Each Docker Hub page about a container image provides structured information and subheadings

Most Docker Hub pages about container images contain sections that provide examples of how to use those container images.

Many Docker Hub container images have multiple versions, indicated by tags.

The naming convention for Docker container images is: OWNER/CONTAINER_IMAGE_NAME:TAG

Break

Overview

Teaching: min
Exercises: min

Questions

Objectives

Lunch break

Key Points

Creating Your Own Container Images

Overview

Teaching: 20 min
Exercises: 15 min

Questions

How can I make my own Docker container images?

How do I document the ‘recipe’ for a Docker container image?

Objectives

Explain the purpose of a Dockerfile and show some simple examples.

Demonstrate how to build a Docker container image from a Dockerfile.

Compare the steps of creating a container image interactively versus a Dockerfile.

Create an installation strategy for a container image.

Demonstrate how to upload (‘push’) your container images to the Docker Hub.

Describe the significance of the Docker Hub naming scheme.

There are lots of reasons why you might want to create your own Docker container image.

You can’t find a container image with all the tools you need on Docker Hub.
You want to have a container image to “archive” all the specific software versions you ran for a project.
You want to share your workflow with someone else.

Interactive installation

Before creating a reproducible installation, let’s experiment with installing software inside a container. Start a container from the alpine container image we used before, interactively:

$ docker container run -it alpine sh

Because this is a basic container, there’s a lot of things not installed – for example, python3.

/# python3

sh: python3: not found

Inside the container, we can run commands to install Python 3. The Alpine version of Linux has a installation tool called apk that we can use to install Python 3.

/# apk add --update python3 py3-pip python3-dev

We can test our installation by running a Python command:

/# python3 --version

Once Python is installed, we can add Python packages using the pip package installer:

/# apk add cython

Exercise: Searching for Help

Can you find instructions for installing R on Alpine Linux? Do they work?
Solution

A quick search should hopefully show that the way to install R on Alpine Linux is:
/# apk add R

Once we exit, these changes are not saved to a new container image by default. There is a command that will “snapshot” our changes, but building container images this way is not easily reproducible. Instead, we’re going to take what we’ve learned from this interactive installation and create our container image from a reproducible recipe, known as a Dockerfile.

If you haven’t already, exit out of the interactively running container.

/# exit

Put installation instructions in a `Dockerfile`

A Dockerfile is a plain text file with keywords and commands that can be used to create a new container image.

Download the docker-intro.zip file and expand it, e.g.

wget https://epcced.github.io/2024-04-16_containers_bham/files/docker-intro.zip
unzip docker-intro.zip

From your shell, go to the folder you just created by expanding the zip file and print out the Dockerfile inside:

$ cd docker-intro/basic
$ cat Dockerfile

FROM <EXISTING IMAGE>
RUN <INSTALL CMDS FROM SHELL>
RUN <INSTALL CMDS FROM SHELL>
CMD <CMD TO RUN BY DEFAULT>

Let’s break this file down:

The first line, FROM, indicates which container image we’re starting with. It is the “base” container image we are going to start from.
The next two lines RUN, will indicate installation commands we want to run. These are the same commands that we used interactively above.
The last line, CMD, indicates the default command we want a container based on this container image to run, if no other command is provided. It is recommended to provide CMD in exec-form (see the CMD section of the Dockerfile documentation for more details). It is written as a list which contains the executable to run as its first element, optionally followed by any arguments as subsequent elements. The list is enclosed in square brackets ([]) and its elements are double-quoted (") strings which are separated by commas. For example, CMD ["ls", "-lF", "--color", "/etc"] would translate to ls -lF --color /etc.

shell-form and exec-form for CMD

Another way to specify the parameter for the CMD instruction is the shell-form. Here you type the command as you would call it from the command line. Docker then silently runs this command in the image’s standard shell. CMD cat /etc/passwd is equivalent to CMD ["/bin/sh", "-c", "cat /etc/passwd"]. We recommend to prefer the more explicit exec-form because we will be able to create more flexible container image command options and make sure complex commands are unambiguous in this format.

Exercise: Take a Guess

Do you have any ideas about what we should use to fill in the sample Dockerfile to replicate the installation we did above?
Solution:

Based on our experience above, edit the Dockerfile (in your text editor of choice) to look like this:
FROM alpine
RUN apk add --update python3 py3-pip python3-dev
RUN apk add cython
CMD ["python3", "--version"]

The recipe provided by the Dockerfile shown in the solution to the preceding exercise will use Alpine Linux as the base container image, add Python 3 and the Cython library, and set a default command to request Python 3 to report its version information.

Create a new Docker image

So far, we only have a text file named Dockerfile – we do not yet have a container image. We want Docker to take this Dockerfile, run the installation commands contained within it, and then save the resulting container as a new container image. To do this we will use the docker image build command.

We have to provide docker image build with two pieces of information:

the location of the Dockerfile
the name of the new container image. Remember the naming scheme from before? You should name your new image with your Docker Hub username and a name for the container image, like this: USERNAME/CONTAINER_IMAGE_NAME.

All together, the build command that you should run on your computer, will have a similar structure to this:

$ docker image build -t USERNAME/CONTAINER_IMAGE_NAME .

The -t option names the container image; the final dot indicates that the Dockerfile is in our current directory.

For example, if my user name was alice and I wanted to call my container image alpine-python, I would use this command:

$ docker image build -t alice/alpine-python .

Build Context

Notice that the final input to docker image build isn’t the Dockerfile – it’s a directory! In the command above, we’ve used the current working directory (.) of the shell as the final input to the docker image build command. This option provides what is called the build context to Docker – if there are files being copied into the built container image more details in the next episode they’re assumed to be in this location. Docker expects to see a Dockerfile in the build context also (unless you tell it to look elsewhere).

Even if it won’t need all of the files in the build context directory, Docker does “load” them before starting to build, which means that it’s a good idea to have only what you need for the container image in a build context directory, as we’ve done in this example.

Exercise: Review!

Think back to earlier. What command can you run to check if your container image was created successfully? (Hint: what command shows the container images on your computer?)

We didn’t specify a tag for our container image name. What tag did Docker automatically use?

What command will run a container based on the container image you’ve created? What should happen by default if you run such a container? Can you make it do something different, like print “hello world”?
Solution

To see your new image, run docker image ls. You should see the name of your new container image under the “REPOSITORY” heading.

In the output of docker image ls, you can see that Docker has automatically used the latest tag for our new container image.

We want to use docker container run to run a container based on a container image.

The following command should run a container and print out our default message, the version of Python:
$ docker container run alice/alpine-python
To run a container based on our container image and print out “Hello world” instead:
$ docker container run alice/alpine-python echo "Hello World"

While it may not look like you have achieved much, you have already effected the combination of a lightweight Linux operating system with your specification to run a given command that can operate reliably on macOS, Microsoft Windows, Linux and on the cloud!

Boring but important notes about installation

There are a lot of choices when it comes to installing software – sometimes too many! Here are some things to consider when creating your own container image:

Start smart, or, don’t install everything from scratch! If you’re using Python as your main tool, start with a Python container image. Same with R. We’ve used Alpine Linux as an example in this lesson, but it’s generally not a good container image to start with for initial development and experimentation because it is a less common distribution of Linux; using Ubuntu, Debian and CentOS are all good options for scientific software installations. The program you’re using might recommend a particular distribution of Linux, and if so, it may be useful to start with a container image for that distribution.
How big? How much software do you really need to install? When you have a choice, lean towards using smaller starting container images and installing only what’s needed for your software, as a bigger container image means longer download times to use.
Know (or Google) your Linux. Different distributions of Linux often have distinct sets of tools for installing software. The apk command we used above is the software package installer for Alpine Linux. The installers for various common Linux distributions are listed below:
- Ubuntu: apt or apt-get
- Debian: deb
- CentOS: yum Most common software installations are available to be installed via these tools. A web search for “install X on Y Linux” is usually a good start for common software installation tasks; if something isn’t available via the Linux distribution’s installation tools, try the options below.
Use what you know. You’ve probably used commands like pip or install.packages() before on your own computer – these will also work to install things in container images (if the basic scripting language is installed).
README. Many scientific software tools have a README or installation instructions that lay out how to install software. You want to look for instructions for Linux. If the install instructions include options like those suggested above, try those first.

In general, a good strategy for installing software is:

Make a list of what you want to install.
Look for pre-existing container images.
Read through instructions for software you’ll need to install.
Try installing everything interactively in your base container – take notes!
From your interactive installation, create a Dockerfile and then try to build the container image from that.

Container images that you release publicly can be stored on the Docker Hub for free. If you name your container image as described above, with your Docker Hub username, all you need to do is run the opposite of docker image pull – docker image push.

$ docker image push alice/alpine-python

Make sure to substitute the full name of your container image!

In a web browser, open https://hub.docker.com, and on your user page you should now see your container image listed, for anyone to use or build on.

Logging In

Technically, you have to be logged into Docker on your computer for this to work. Usually it happens by default, but if docker image push doesn’t work for you, run docker login first, enter your Docker Hub username and password, and then try docker image push again.

What’s in a name? (again)

You don’t have to name your containers images using the USERNAME/CONTAINER_IMAGE_NAME:TAG naming scheme. On your own computer, you can call container images whatever you want, and refer to them by the names you choose. It’s only when you want to share a container image that it needs the correct naming format.

You can rename container images using the docker image tag command. For example, imagine someone named Alice has been working on a workflow container image and called it workflow-test on her own computer. She now wants to share it in her alice Docker Hub account with the name workflow-complete and a tag of v1. Her docker image tag command would look like this:

$ docker image tag workflow-test alice/workflow-complete:v1

She could then push the re-named container image to Docker Hub, using docker image push alice/workflow-complete:v1

Key Points

Dockerfiles specify what is within Docker container images.

The docker image build command is used to build a container image from a Dockerfile.

You can share your Docker container images through the Docker Hub so that others can create Docker containers from your container images.

Creating More Complex Container Images

Overview

Teaching: 30 min
Exercises: 30 min

Questions

How can I make more complex container images?

Objectives

Explain how you can include files within Docker container images when you build them.

Explain how you can access files on the Docker host from your Docker containers.

In order to create and use your own container images, you may need more information than our previous example. You may want to use files from outside the container, that are not included within the container image, either by copying the files into the container image, or by making them visible within a running container from their existing location on your host system. You may also want to learn a little bit about how to install software within a running container or a container image. This episode will look at these advanced aspects of running a container or building a container image. Note that the examples will get gradually more and more complex – most day-to-day use of containers and container images can be accomplished using the first 1–2 sections on this page.

Using scripts and files from outside the container

In your shell, change to the sum folder in the docker-intro folder and look at the files inside.

$ cd docker-intro/sum
$ ls

This folder has both a Dockerfile and a Python script called sum.py. Let’s say we wanted to try running the script using a container based on our recently created alpine-python container image.

Running containers

What command would we use to run Python from the alpine-python container?

If we try running the container and Python script, what happens?

$ docker container run alice/alpine-python python3 sum.py

python3: can't open file 'sum.py': [Errno 2] No such file or directory

No such file or directory

What does the error message mean? Why might the Python inside the container not be able to find or open our script?

The problem here is that the container and its filesystem is separate from our host computer’s filesystem. When the container runs, it can’t see anything outside itself, including any of the files on our computer. In order to use Python (inside the container) and our script (outside the container, on our host computer), we need to create a link between the directory on our computer and the container.

This link is called a “mount” and is what happens automatically when a USB drive or other external hard drive gets connected to a computer – you can see the contents appear as if they were on your computer.

We can create a mount between our computer and the running container by using an additional option to docker container run. We’ll also use the variable ${PWD} which will substitute in our current working directory. The option will look like this

--mount type=bind,source=${PWD},target=/temp

What this means is: make my current working directory (on the host computer) – the source – visible within the container that is about to be started, and inside this container, name the directory /temp – the target.

Types of mounts

You will notice that we set the mount type=bind, there are other types of mount that can be used in Docker (e.g. volume and tmpfs). We do not cover other types of mounts or the differences between these mount types in the course as it is more of an advanced topic. You can find more information on the different mount types in the Docker documentation.

Let’s try running the command now:

$ docker container run --mount type=bind,source=${PWD},target=/temp alice/alpine-python python3 sum.py

But we get the same error!

python3: can't open file 'sum.py': [Errno 2] No such file or directory

This final piece is a bit tricky – we really have to remember to put ourselves inside the container. Where is the sum.py file? It’s in the directory that’s been mapped to /temp – so we need to include that in the path to the script. This command should give us what we need:

$ docker container run --mount type=bind,source=${PWD},target=/temp alice/alpine-python python3 /temp/sum.py

Note that if we create any files in the /temp directory while the container is running, these files will appear on our host filesystem in the original directory and will stay there even when the container stops.

Other Commonly Used Docker Run Flags

Docker run has many other useful flags to alter its function. A couple that are commonly used include -w and -u.

The --workdir/-w flag sets the working directory a.k.a. runs the command being executed inside the directory specified. For example, the following code would run the pwd command in a container started from the latest ubuntu image in the /home/alice directory and print /home/alice. If the directory doesn’t exist in the image it will create it.
docker container run -w /home/alice/ ubuntu pwd
The --user/-u flag lets you specify the username you would like to run the container as. This is helpful if you’d like to write files to a mounted folder and not write them as root but rather your own user identity and group. A common example of the -u flag is --user $(id -u):$(id -g) which will fetch the current user’s ID and group and run the container as that user.

Exercise: Explore the script

What happens if you use the docker container run command above and put numbers after the script name?

Solution

This script comes from the Python Wiki and is set to add all numbers that are passed to it as arguments.

Exercise: Checking the options

Our Docker command has gotten much longer! Can you go through each piece of the Docker command above and explain what it does? How would you characterize the key components of a Docker command?

Solution

Here’s a breakdown of each piece of the command above

docker container run: use Docker to run a container

--mount type=bind,source=${PWD},target=/temp: connect my current working directory (${PWD}) as a folder inside the container called /temp

alice/alpine-python: name of the container image to use to run the container

python3 /temp/sum.py: what commands to run in the container

More generally, every Docker command will have the form: docker [action] [docker options] [docker container image] [command to run inside]

Exercise: Interactive jobs

Try using the directory mount option but run the container interactively. Can you find the folder that’s connected to your host computer? What’s inside?
Solution

The docker command to run the container interactively is:
$ docker container run --mount type=bind,source=${PWD},target=/temp -it alice/alpine-python sh
Once inside, you should be able to navigate to the /temp folder and see that’s contents are the same as the files on your host computer:
/# cd /temp
/# ls

Mounting a directory can be very useful when you want to run the software inside your container on many different input files. In other situations, you may want to save or archive an authoritative version of your data by adding it to the container image permanently. That’s what we will cover next.

Including your scripts and data within a container image

Our next project will be to add our own files to a container image – something you might want to do if you’re sharing a finished analysis or just want to have an archived copy of your entire analysis including the data. Let’s assume that we’ve finished with our sum.py script and want to add it to the container image itself.

In your shell, you should still be in the sum folder in the docker-intro folder.

$ pwd

$ /Users/yourname/Desktop/docker-intro/sum

Let’s add a new line to the Dockerfile we’ve been using so far to create a copy of sum.py. We can do so by using the COPY keyword.

COPY sum.py /home

This line will cause Docker to copy the file from your computer into the container’s filesystem. Let’s build the container image like before, but give it a different name:

$ docker image build -t alice/alpine-sum .

The Importance of Command Order in a Dockerfile

When you run docker build it executes the build in the order specified in the Dockerfile. This order is important for rebuilding and you typically will want to put your RUN commands before your COPY commands.

Docker builds the layers of commands in order. This becomes important when you need to rebuild container images. If you change layers later in the Dockerfile and rebuild the container image, Docker doesn’t need to rebuild the earlier layers but will instead used a stored (called “cached”) version of those layers.

For example, in an instance where you wanted to copy multiply.py into the container image instead of sum.py. If the COPY line came before the RUN line, it would need to rebuild the whole image. If the COPY line came second then it would use the cached RUN layer from the previous build and then only rebuild the COPY layer.

Exercise: Did it work?

Can you remember how to run a container interactively? Try that with this one. Once inside, try running the Python script.
Solution

You can start the container interactively like so:
$ docker container run -it alice/alpine-sum sh
You should be able to run the python command inside the container like this:
/# python3 /home/sum.py

This COPY keyword can be used to place your own scripts or own data into a container image that you want to publish or use as a record. Note that it’s not necessarily a good idea to put your scripts inside the container image if you’re constantly changing or editing them. Then, referencing the scripts from outside the container is a good idea, as we did in the previous section. You also want to think carefully about size – if you run docker image ls you’ll see the size of each container image all the way on the right of the screen. The bigger your container image becomes, the harder it will be to easily download.

Security Warning

Login credentials including passwords, tokens, secure access tokens or other secrets must never be stored in a container. If secrets are stored, they are at high risk to be found and exploited when made public.

Copying alternatives

Another trick for getting your own files into a container image is by using the RUN keyword and downloading the files from the internet. For example, if your code is in a GitHub repository, you could include this statement in your Dockerfile to download the latest version every time you build the container image:
RUN git clone https://github.com/alice/mycode
Similarly, the wget command can be used to download any file publicly available on the internet:
RUN wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.10.0/ncbi-blast-2.10.0+-x64-linux.tar.gz
Note that the above RUN examples depend on commands (git and wget respectively) that must be available within your container: Linux distributions such as Alpine may require you to install such commands before using them within RUN statements.

More fancy `Dockerfile` options (optional, for presentation or as exercises)

We can expand on the example above to make our container image even more “automatic”. Here are some ideas:

Make the `sum.py` script run automatically

FROM alpine
RUN apk add --update python3 py3-pip python3-dev
COPY sum.py /home

# Run the sum.py script as the default command
CMD ["python3", "/home/sum.py"]

Build and test it:

$ docker image build -t alpine-sum:v1 .
$ docker container run alpine-sum:v1

You’ll notice that you can run the container without arguments just fine, resulting in sum = 0, but this is boring. Supplying arguments however doesn’t work:

docker container run alpine-sum:v1 10 11 12

results in

docker: Error response from daemon: OCI runtime create failed:
container_linux.go:349: starting container process caused "exec:
\"10\": executable file not found in $PATH": unknown.

This is because the arguments 10 11 12 are interpreted as a command that replaces the default command given by CMD ["python3", "/home/sum.py"] in the image.

To achieve the goal of having a command that always runs when a container is run from the container image and can be passed the arguments given on the command line, use the keyword ENTRYPOINT in the Dockerfile.

FROM alpine

COPY sum.py /home
RUN apk add --update python3 py3-pip python3-dev

# Run the sum.py script as the default command and
# allow people to enter arguments for it
ENTRYPOINT ["python3", "/home/sum.py"]

# Give default arguments, in case none are supplied on
# the command-line
CMD ["10", "11"]

Build and test it:

$ docker image build -t alpine-sum:v2 .
# Most of the time you are interested in the sum of 10 and 11:
$ docker container run alpine-sum:v2
# Sometimes you have more challenging calculations to do:
$ docker container run alpine-sum:v2 12 13 14

Overriding the ENTRYPOINT

Sometimes you don’t want to run the image’s ENTRYPOINT. For example if you have a specialized container image that does only sums, but you need an interactive shell to examine the container:
$ docker container run -it alpine-sum:v2 /bin/sh
will yield
Please supply integer arguments
You need to override the ENTRYPOINT statement in the container image like so:
$ docker container run -it --entrypoint /bin/sh alpine-sum:v2

Add the `sum.py` script to the `PATH` so you can run it directly:

FROM alpine

RUN apk add --update python3 py3-pip python3-dev

COPY sum.py /home
# set script permissions
RUN chmod +x /home/sum.py
# add /home folder to the PATH
ENV PATH /home:$PATH

Build and test it:

$ docker image build -t alpine-sum:v3 .
$ docker container run alpine-sum:v3 sum.py 1 2 3 4

Best practices for writing Dockerfiles

Take a look at Nüst et al.’s “Ten simple rules for writing Dockerfiles for reproducible data science” [1] for some great examples of best practices to use when writing Dockerfiles. The GitHub repository associated with the paper also has a set of example Dockerfiles demonstrating how the rules highlighted by the paper can be applied.

[1] Nüst D, Sochat V, Marwick B, Eglen SJ, Head T, et al. (2020) Ten simple rules for writing Dockerfiles for reproducible data science. PLOS Computational Biology 16(11): e1008316. https://doi.org/10.1371/journal.pcbi.1008316

Key Points

Docker allows containers to read and write files from the Docker host.

You can include files from your Docker host into your Docker container images by using the COPY instruction in your Dockerfile.

Break

Overview

Teaching: min
Exercises: min

Questions

Objectives

Afternoon break

Key Points

Examples of Using Container Images in Practice

Overview

Teaching: 20 min
Exercises: 0 min

Questions

How can I use Docker for my own work?

Objectives

Use existing container images and Docker in a research project.

Now that we have learned the basics of working with Docker container images and containers, let’s apply what we learned to an example workflow.

You may choose one or more of the following examples to practice using containers.

Jekyll Website Example

In this Jekyll Website example, you can practice rendering this lesson website on your computer using the Jekyll static website generator in a Docker container. Rendering the website in a container avoids a complicated software installation; instead of installing Jekyll and all the other tools needed to create the final website, all the work can be done in the container. Additionally, when you no longer need to render the website, you can easily and cleanly remove the software from your computer.

GitHub Actions Example

In this GitHub Actions example, you can learn more about continuous integration in the cloud and how you can use container images with GitHub to automate repetitive tasks like testing code or deploying websites.

Using Containers on an HPC Cluster

It is possible to run containers on shared computing systems run by a university or national computing center. As a researcher, you can build container images and test containers on your own computer and then run your full-scale computing work on a shared computing system like a high performance cluster or high throughput grid.

The catch? Most university and national computing centers do not support running containers with Docker commands, and instead use a similar tool called Singularity or Shifter. However, both of these programs can be used to run containers based on Docker container images, so often people create their container image as a Docker container image, so they can run it using either of Docker or Singularity.

There isn’t yet a working example of how to use Docker container images on a shared computing system, partially because each system is slightly different, but the following resources show what it can look like:

Introduction to Singularity: See the episode titled “Running MPI parallel jobs using Singularity containers”
Container Workflows at Pawsey: See the episode titled “Run containers on HPC with Shifter (and Singularity)”

Seeking Examples

Do you have another example of using Docker in a workflow related to your field? Please open a lesson issue or submit a pull request to add it to this episode and the extras section of the lesson.

Key Points

There are many ways you might use Docker and existing container images in your research project.

Singularity: Getting started

Overview

Teaching: 15 min
Exercises: 10 min

Questions

What is Singularity and why might I want to use it?

Objectives

Understand what Singularity is and when you might want to use it.

Undertake your first run of a simple Singularity container.

The episodes in this lesson will introduce you to the Singularity container platform and demonstrate how to set up and use Singularity.

What is Singularity?

Singularity (or Apptainer, we’ll get to this in a minute…) is a container platform that supports packaging and deploying software and tools in a portable and reproducible manner.

You may be familiar with Docker, another container platform that is now used widely. If you are, you will see that in some ways, Singularity is similar to Docker. However, in other ways, particularly in terms of the system’s architecture, it is fundamentally different. These differences mean that Singularity is particularly well-suited to running on shared platforms such as distributed, High Performance Computing (HPC) infrastructure, as well as on a Linux laptop or desktop.

Singularity runs containers from container images which, as we discussed, are essentially a virtual computer disk that contains all of the necessary software, libraries and configuration to run one or more applications or undertake a particular task, e.g. to support a specific research project. This saves you the time and effort of installing and configuring software on your own system or setting up a new computer from scratch, as you can simply run a Singularity container from an image and have a virtual environment that is equivalent to the one used by the person who created the image. Singularity/Apptainer is increasingly widely used in the research community for supporting research projects due to its support for shared computing platforms.

System administrators will not, generally, install Docker on shared computing platforms such as lab desktops, research clusters or HPC platforms because the design of Docker presents potential security issues for shared platforms with multiple users. Singularity/Apptainer, on the other hand, can be run by end-users entirely within “user space”, that is, no special administrative privileges need to be assigned to a user in order for them to run and interact with containers on a platform where Singularity has been installed.

A little history…

Singularity is open source software and was initially developed within the research community. A couple of years ago, the project was “forked” something that is not uncommon within the open source software community, with the software effectively splitting into two projects going in different directions. The fork is being developed by a commercial entity, Sylabs.io who provide both the free, open source SingularityCE (Community Edition) and Pro/Enterprise editions of the software. The original open source Singularity project has recently been renamed to Apptainer and has moved into the Linux Foundation. The initial release of Apptainer was made about a year ago, at the time of writing. While earlier versions of this course focused on versions of Singularity released before the project fork, we now base the course material on recent Apptainer releases. Despite this, the basic features of Apptainer/Singularity remain the same and so this material is equally applicable whether you’re working with a recent Apptainer release or a slightly older Singularity version. Nonetheless, it is useful to be aware of this history and that you may see both Singularity and Apptainer being used within the research community over the coming months and years.

Another point to note is that some systems that have a recent Apptainer release installed may also provide a singularity command that is simply a link to the apptainer executable on the system. This helps to ensure that existing scripts being used on the system that were developed before the migration to Apptainer will still function correctly.

For now, the remainder of this material refers to Singularity but where you have a release of Apptainer installed on your local system, you can simply replace references to singularity with apptainer, if you wish.

Checking Singularity works

login4.archer2.ac.uk

You must use the login4.archer2.ac.uk login address rather than the general login address as login4 has fixes for Singularity that are not present on the usual login nodes (the fixes will be rolled out to all ARCHER2 login and compute nodes soon).

ssh -i /path/to/ssh-key user@login4.archer2.ac.uk

Now check that the singularity command is available in your terminal:

remote$ singularity --version

singularity version 3.7.3-1

Loading a module

HPC systems often use modules to provide access to software on the system so you may need to use the command:
remote$ module load singularity
before you can use the singularity command on remote systems. However, this depends on how the system is configured. You do not need to load a module on ARCHER2. If in doubt, consult the documentation for the system you are using or contact the support team.

Images and containers: reminder

A quick reminder on terminology: we refer to both container images and containers. What is the difference between these two terms?

Container images (sometimes just images) are bundles of files including an operating system, software and potentially data and other application-related files. They may sometimes be referred to as a disk image or image and they may be stored in different ways, perhaps as a single file, or as a group of files. Either way, we refer to this file, or collection of files, as an image.

A container is a virtual environment that is based on a container image. That is, the files, applications, tools, etc that are available within a running container are determined by the image that the container is started from. It may be possible to start multiple container instances from an image. You could, perhaps, consider an image to be a form of template from which running container instances can be started.

Getting a container image and running a Singularity container

Singularity uses the Singularity Image Format (SIF) and container images are provided as single SIF files (usually with a .sif or .img filename extension). Singularity container images can be pulled from the Sylabs Cloud Library, a registry for Singularity container images. Singularity is also capable of running containers based on container images pulled from Docker Hub and other Docker image repositories (e.g. Quay.io). We will look at accessing container images from Docker Hub later in the course.

Sylabs Remote Builder

Note that in addition to providing a repository that you can pull container images from, Sylabs Cloud Library can also build Singularity images for you from a recipe - a configuration file defining the steps to build an image. We will look at recipes and building images later in the workshop.

Pulling a container image from Sylabs Cloud Library

Let’s begin by creating a test directory, changing into it and pulling an existing container image from Sylabs Cloud Library:

remote$ mkdir test
remote$ cd test
remote$ singularity pull lolcow.sif library://lolcow

INFO:    Downloading library image
 90.4 MiB / 90.4 MiB [===============================================================================================================] 100.00% 90.4 MiB/s 1s

What just happened? We pulled a container image from a remote repository using the singularity pull command and directed it to store the container image in a file using the name lolcow.sif in the current directory. If you run the ls command, you should see that the lolcow.sif file is now present in the current directory.

remote$ ls -lh

total 60M
-rwxr-xr-x. 1 auser group 91M Jun 13  2023 lolcow.sif

Running a Singularity container

We can now run a container based on the lolcow.sif container image:

remote$ singularity run lolcow.sif

 ______________________________
< Tue Jun 20 08:44:51 UTC 2023 >
 ------------------------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

The above command ran a lolcow container based on the container image we downloaded from the online repository and the resulting output was shown.

What just happened? When we use the singularity run command, Singularity does three things:

1. Starts a Running Container	2. Performs Default Action	3. Shuts Down the Container
Starts a running container, based on the container image. Think of this as the “alive” or “inflated” version of the container – it’s actually doing something.	If the container has a default action set, it will perform that default action. This could be as simple as printing a message (as above) or running a whole analysis pipeline!	Once the default action is complete, the container stops running (or exits).

Default action

How did the container determine what to do when we ran it? What did running the container actually do to result in the displayed output?

When you run a container from a Singularity container image using the singularity run command, the container runs the default run script that is embedded within the container image. This is a shell script that can be used to run commands, tools or applications stored within the container image on container startup. We can inspect the container image’s run script using the singularity inspect command:

remote$ singularity inspect -r lolcow.sif

#!/bin/sh 

    date  |  cowsay  |  lolcat

This shows us the script within the lolcow.sif image configured to run by default when we use the singularity run command.

This seems very simple but already, we have downloaded a container image that is built with a different OS than is available on ARCHER2 that also contains software not available on ARCHER2 (cowsay and lolcat) and been able to run this on the ARCHER2 system without needing to install anything ourselves and without the container image having to know anything specific about how ARCHER2 is configured.

Key Points

Singularity is another container platform and it is often used in cluster/HPC/research environments.

Singularity has a different security model to other container platforms, one of the key reasons that it is well suited to HPC and cluster environments.

Singularity has its own container image format (SIF).

The singularity command can be used to pull images from Sylabs Cloud Library and run a container from an image file.

Using Singularity containers to run commands

Overview

Teaching: 10 min
Exercises: 5 min

Questions

How do I run different commands within a container?

How do I access an interactive shell within a container?

Objectives

Learn how to run different commands when starting a container.

Learn how to open an interactive shell within a container environment.

Running specific commands within a container

We saw earlier that we can use the singularity inspect command to see the run script that a container is configured to run by default. What if we want to run a different command within a container?

If we know the path of an executable that we want to run within a container, we can use the singularity exec command. For example, using the lolcow.sif container that we’ve already pulled from Singularity Hub, we can run the following within the test directory where the lolcow.sif file is located:

remote$ singularity exec lolcow.sif /bin/echo "Hello, world"

Hello, world

Here we see that a container has been started from the lolcow.sif image and the /bin/echo command has been run within the container, passing the input Hello, world. The command has echoed the provided input to the console and the container has terminated.

Note that the use of singularity exec has overriden any run script set within the image metadata and the command that we specified as an argument to singularity exec has been run instead.

Basic exercise: Running a different command within the “hello-world” container

Can you run a container based on the lolcow.sif image that prints the current date and time?
Solution
remote$ singularity exec lolcow.sif /bin/date
Fri Jun 26 15:17:44 BST 2020

Difference between `singularity run` and `singularity exec`

Above, we used the singularity exec command. In earlier episodes of this course we used singularity run. To clarify, the difference between these two commands is:

singularity run: This will run the default command set for containers based on the specified image. This default command is set within the image metadata when the image is built (we’ll see more about this in later episodes). You do not specify a command to run when using singularity run, you simply specify the image file name. As we saw earlier, you can use the singularity inspect command to see what command is run by default when starting a new container based on an image.
singularity exec: This will start a container based on the specified image and run the command provided on the command line following singularity exec <image file name>. This will override any default command specified within the image metadata that would otherwise be run if you used singularity run.

Opening an interactive shell within a container

If you want to open an interactive shell within a container, Singularity provides the singularity shell command. Again, using the lolcow.sif image, and within our test directory, we can run a shell within a container from the hello-world image:

remote$ singularity shell lolcow.sif

Singularity> whoami
[<your username>]
Singularity> ls
lolcow.sif
Singularity> 

As shown above, we have opened a shell in a new container started from the lolcow.sif image. Note that the shell prompt has changed to show we are now within the Singularity container.

Use the exit command to exit from the container shell.

Key Points

The singularity exec is an alternative to singularity run that allows you to start a container running a specific command.

The singularity shell command can be used to start a container and run an interactive shell within it.

Break

Overview

Teaching: min
Exercises: min

Questions

Objectives

Morning break

Key Points

Using Docker images with Singularity

Overview

Teaching: 5 min
Exercises: 10 min

Questions

How do I use Docker images with Singularity?

Objectives

Learn how to run Singularity containers based on Docker images.

Using Docker images with Singularity

Singularity can also start containers directly from Docker container images, opening up access to a huge number of existing container images available on Docker Hub and other registries.

While Singularity doesn’t actually run a container using the Docker container image (it first converts it to a format suitable for use by Singularity), the approach used provides a seamless experience for the end user. When you direct Singularity to run a container based on a Docker container image, Singularity pulls the slices or layers that make up the Docker container image and converts them into a single-file Singularity SIF container image.

For example, moving on from the simple Hello World examples that we’ve looked at so far, let’s pull one of the official Docker Python container images. We’ll use the image with the tag 3.9.6-slim-buster which has Python 3.9.6 installed on Debian’s Buster (v10) Linux distribution:

remote$ singularity pull python-3.9.6.sif docker://python:3.9.6-slim-buster

INFO:    Converting OCI blobs to SIF format
INFO:    Starting build...
Getting image source signatures
Copying blob 33847f680f63 done  
Copying blob b693dfa28d38 done  
Copying blob ef8f1a8cefd1 done  
Copying blob 248d7d56b4a7 done  
Copying blob 478d2dfa1a8d done  
Copying config c7d70af7c3 done  
Writing manifest to image destination
Storing signatures
2021/07/27 17:23:38  info unpack layer: sha256:33847f680f63fb1b343a9fc782e267b5abdbdb50d65d4b9bd2a136291d67cf75
2021/07/27 17:23:40  info unpack layer: sha256:b693dfa28d38fd92288f84a9e7ffeba93eba5caff2c1b7d9fe3385b6dd972b5d
2021/07/27 17:23:40  info unpack layer: sha256:ef8f1a8cefd144b4ee4871a7d0d9e34f67c8c266f516c221e6d20bca001ce2a5
2021/07/27 17:23:40  info unpack layer: sha256:248d7d56b4a792ca7bdfe866fde773a9cf2028f973216160323684ceabb36451
2021/07/27 17:23:40  info unpack layer: sha256:478d2dfa1a8d7fc4d9957aca29ae4f4187bc2e5365400a842aaefce8b01c2658
INFO:    Creating SIF file...

Note how we see Singularity saying that it’s “Converting OCI blobs to SIF format”. We then see the layers of the Docker container image being downloaded and unpacked and written into a single SIF file. Once the process is complete, we should see the python-3.9.6.sif container image file in the current directory.

We can now run a container from this container image as we would with any other Singularity container image.

Running the Python 3.9.6 image that we just pulled from Docker Hub

Try running the Python 3.9.6 container image. What happens?

Try running some simple Python statements…
Running the Python 3.9.6 image
remote$ singularity run python-3.9.6.sif
This should put you straight into a Python interactive shell within the running container:
Python 3.9.6 (default, Jul 22 2021, 15:24:21) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 
Now try running some simple Python statements:
>>> import math
>>> math.pi
3.141592653589793
>>> 

In addition to running a container and having it run the default run script, you could also start a container running a shell in case you want to undertake any configuration prior to running Python. This is covered in the following exercise:

Open a shell within a Python container

Try to run a shell within a singularity container based on the python-3.9.6.sif container image. That is, run a container that opens a shell rather than the default Python interactive console as we saw above. See if you can find more than one way to achieve this.

Within the shell, try starting the Python interactive console and running some Python commands.
Solution

Recall from the earlier material that we can use the singularity shell command to open a shell within a container. To open a regular shell within a container based on the python-3.9.6.sif container image, we can therefore simply run:
remote$ singularity shell python-3.9.6.sif
Singularity> echo $SHELL
/bin/bash
Singularity> cat /etc/issue
Debian GNU/Linux 10 \n \l

Singularity> python
Python 3.9.6 (default, Jul 22 2021, 15:24:21) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> print('Hello World!')
Hello World!
>>> exit()

Singularity> exit
$ 
It is also possible to use the singularity exec command to run an executable within a container. We could, therefore, use the exec command to run /bin/bash:
remote$ singularity exec python-3.9.6.sif /bin/bash
Singularity> echo $SHELL
/bin/bash
You can run the Python console from your container shell simply by running the python command.

References

[1] Gregory M. Kurzer, Containers for Science, Reproducibility and Mobility: Singularity P2. Intel HPC Developer Conference, 2017

Key Points

Singularity can start a container from a Docker image which can be pulled directly from Docker Hub.

The Singularity cache

Overview

Teaching: 10 min
Exercises: 0 min

Questions

Why does Singularity use a local cache?

Where does Singularity store images?

Objectives

Learn about Singularity’s image cache.

Learn how to manage Singularity images stored locally.

Singularity’s image cache

Singularity uses a local cache to save downloaded container image files in addition to storing them as the file you specify. As we saw in the previous episode, images are simply .sif files stored on your local disk.

If you delete a local .sif container image that you have pulled from a remote container image repository and then pull it again, if the container image is unchanged from the version you previously pulled, you will be given a copy of the container image file from your local cache rather than the container image being downloaded again from the remote source. This removes unnecessary network transfers and is particularly useful for large container images which may take some time to transfer over the network. To demonstrate this, remove the lolcow.sif file stored in your test directory and then issue the pull command again:

remote$ rm lolcow.sif
remote$ singularity pull lolcow.sif library://lolcow

INFO:    Using cached image

As we can see in the above output, the container image has been returned from the cache and we do not see the output that we saw previously showing the container image being downloaded from the Cloud Library.

How do we know what is stored in the local cache? We can find out using the singularity cache command:

remote$ singularity cache list

There are 2 container file(s) using 129.35 MiB and 7 oci blob file(s) using 41.49 MiB of space
Total space used: 170.84 MiB

This tells us how many container image files are stored in the cache and how much disk space the cache is using but it doesn’t tell us what is actually being stored. To find out more information we can add the -v verbose flag to the list command:

remote$ singularity cache list -v

There are 2 container file(s) using 129.35 MiB and 7 oci blob file(s) using 41.49 MiB of space
Total space used: 170.84 MiB
artta118@ln04:~/test> singularity cache list -v
NAME                     DATE CREATED           SIZE             TYPE
50b2668d8d3f74c49a7280   2023-09-12 11:41:31    0.96 KiB         blob
76f124aca9afaf3f75812d   2023-09-12 11:41:30    2.51 MiB         blob
7becefa709e2358336177a   2023-09-12 11:41:31    6.25 KiB         blob
87bc5aa6fc4253b93dee0a   2023-09-12 11:41:30    0.23 KiB         blob
dae1d9fd74c12f7e66b92c   2023-09-12 11:41:29    10.43 MiB        blob
e1acddbe380c63f0de4b77   2023-09-12 11:41:27    25.89 MiB        blob
ecc7ff4d26223f4545c4fd   2023-09-12 11:41:28    2.64 MiB         blob
sha256.cef378b9a9274c2   2023-09-12 11:39:18    90.43 MiB        library
28bed4c51c3b531159d8af   2023-09-12 11:41:36    38.92 MiB        oci-tmp

There are 2 container file(s) using 129.35 MiB and 7 oci blob file(s) using 41.49 MiB of space
Total space used: 170.84 MiB

This provides us with some more useful information about the actual container images stored in the cache. In the TYPE column we can see that our container image type is library because it’s a SIF container image that has been pulled from the Cloud Library.

Cleaning the Singularity image cache

We can remove container images from the cache using the singularity cache clean command. Running the command without any options will display a warning and ask you to confirm that you want to remove everything from your cache.

You can also remove specific container images or all container images of a particular type. Look at the output of singularity cache clean --help for more information.

Cache location

By default, Singularity uses $HOME/.singularity/cache as the location for the cache. You can change the location of the cache by setting the SINGULARITY_CACHEDIR environment variable to the cache location you want to use.

Key Points

Singularity caches downloaded images so that an unchanged image isn’t downloaded again when it is requested using the singularity pull command.

You can free up space in the cache by removing all locally cached images or by specifying individual images to remove.

Files in Singularity containers

Overview

Teaching: 10 min
Exercises: 10 min

Questions

How do I make data available in a Singularity container?

What data is made available by default in a Singularity container?

Objectives

Understand that some data from the host system is usually made available by default within a container

Learn more about how Singularity handles users and binds directories from the host filesystem.

The key concept to remember when running a Singularity container, you only have the same permissions to access files as the user on the host system that you start the container as. (If you are familiar with Docker, you may note that this is different behaviour than you would see with that tool.)

In this episode we will look at working with files in the context of Singularity containers and how this links with Singularity’s approach to users and permissions within containers.

Users within a Singularity container

The first thing to note is that when you ran whoami within the container shell you started at the end of the previous episode, you should have seen the same username that you have on the host system when you ran the container.

For example, if my username were jc1000, I would expect to see the following:

remote$ singularity shell lolcow.sif
Singularity> whoami

jc1000

But wait! I downloaded the standard, public version of the lolcow container image from the Cloud Library. I haven’t customised it in any way. How is it configured with my own user details?!

If you have any familiarity with Linux system administration, you may be aware that in Linux, users and their Unix groups are configured in the /etc/passwd and /etc/group files respectively. In order for the running container to know of my user, the relevant user information needs to be available within these files within the container.

Assuming this feature is enabled within the installation of Singularity on your system, when the container is started, Singularity appends the relevant user and group lines from the host system to the /etc/passwd and /etc/group files within the container[1].

This means that the host system can effectively ensure that you cannot access/modify/delete any data you should not be able to on the host system from within the container and you cannot run anything that you would not have permission to run on the host system since you are restricted to the same user permissions within the container as you are on the host system.

Files and directories within a Singularity container

Singularity also binds some directories from the host system where you are running the singularity command into the container that you are starting. Note that this bind process is not copying files into the running container, it is making an existing directory on the host system visible and accessible within the container environment. If you write files to this directory within the running container, when the container shuts down, those changes will persist in the relevant location on the host system.

There is a default configuration of which files and directories are bound into the container but ultimate control of how things are set up on the system where you are running Singularity is determined by the system administrator. As a result, this section provides an overview but you may find that things are a little different on the system that you’re running on.

One directory that is likely to be accessible within a container that you start is your home directory. You may also find that the directory from which you issued the singularity command (the current working directory) is also bound.

The binding of file content and directories from a host system into a Singularity container is illustrated in the example below showing a subset of the directories on the host Linux system and in a running Singularity container:

Host system:                                                      Singularity container:
-------------                                                     ----------------------
/                                                                 /
├── bin                                                           ├── bin
├── etc                                                           ├── etc
│   ├── ...                                                       │   ├── ...
│   ├── group  ─> user's group added to group file in container ─>│   ├── group
│   └── passwd ──> user info added to passwd file in container ──>│   └── passwd
├── home                                                          ├── usr
│   └── jc1000 ───> user home directory made available ──> ─┐     ├── sbin
├── usr                 in container via bind mount         │     ├── home
├── sbin                                                    └────────>└── jc1000
└── ...                                                           └── ...

Questions and exercises: Files in Singularity containers

Q1: What do you notice about the ownership of files in a container started from the lolcow.sif image? (e.g. take a look at the ownership of files in the root directory (/) and your home directory (~/)).

Exercise 1: In this container, try creating a file in the root directory / (e.g. using touch /myfile.dat). What do you notice? Try removing the /singularity file. What happens in these two cases?

Exercise 2: In your home directory within the container shell, try and create a simple text file (e.g. echo "Some text" > ~/test-file.txt). Is it possible to do this? If so, why? If not, why not?! If you can successfully create a file, what happens to it when you exit the shell and the container shuts down?

Answers

A1: Use the ls -l / command to see a detailed file listing including file ownership and permission details. You should see that most of the files in the / directory are owned by root, as you would probably expect on any Linux system. If you look at the files in your home directory, they should be owned by you.

A Ex1: We’ve already seen from the previous answer that the files in / are owned by root so we would nott expect to be able to create files there if we’re not the root user. However, if you tried to remove /singularity you would have seen an error similar to the following: cannot remove '/singularity': Read-only file system. This tells us something else about the filesystem. It’s not just that we do not have permission to delete the file, the filesystem itself is read-only so even the root user would not be able to edit/delete this file. We will look at this in more detail shortly.

A Ex2: Within your home directory, you should be able to successfully create a file. Since you’re seeing your home directory on the host system which has been bound into the container, when you exit and the container shuts down, the file that you created within the container should still be present when you look at your home directory on the host system.

Binding additional host system directories to the container

You will sometimes need to bind additional host system directories into a container you are using over and above those bound by default. For example:

There may be a shared dataset in a location that you need access to in the container
You may require executables and software libraries from the host system in the container

The -B option to the singularity command is used to specify additional binds. For example, to bind the /opt/cray directory (where the HPE Cray programming environment is stored) into a container you could use:

remote$ singularity shell -B /opt/cray lolcow.sif
Singularity> ls -la /opt/cray

Note that, by default, a bind is mounted at the same path in the container as on the host system. You can also specify where a host directory is mounted in the container by separating the host path from the container path by a colon (:) in the option:

remote$ singularity shell -B /opt/cray:/cpe lolcow.sif
Singularity> ls -la /cpe

You can specify multiple binds to -B by separating them by commas (,).

Another option is to specify the paths you want to bind in the SINGULARITY_BIND environment variable. This can be more convenient when you have a lot of paths you want to bind into the running container (we will see this later in the course when we look at using MPI with containers). For example, to bind the locations that contain both the HPE Cray programming environment and the CSE centrally installed software into a running container, we would use:

remote$ export SINGULARITY_BIND="/opt/cray,/work/y07/shared"
remote$ singularity shell lolcow.sif
Singularity> ls -la /work/y07/shared

Finally, you can also copy data into a container image at build time if there is some static data required in the image. We cover this later in the section on building container images.

References

[1] Gregory M. Kurzer, Containers for Science, Reproducibility and Mobility: Singularity P2. Intel HPC Developer Conference, 2017. Available at: [https://www.intel.com/content/dam/www/public/us/en/documents/presentation/hpc-containers-singularity-advanced.pdf]

Key Points

Your current directory and home directory are usually available by default in a container.

You have the same username and permissions in a container as on the host system.

You can specify additional host system directories to be available in the container.

Break

Overview

Teaching: min
Exercises: min

Questions

Objectives

Lunch break

Key Points

Using Singularity to run BLAST+

Overview

Teaching: 30 min
Exercises: 30 min

Questions

How can I use Singularity to run bioinformatics workflows with BLAST+?

Objectives

Show example of using Singularity with a common bioinformatics tool.

We have now learned enough to be able to use Sigularity to deploy software without us needed to install the software itself on the host system.

In this section we will demonstrate the use of a Singularity container image that provides the BLAST+ software.

Source material

This example is based on the example from the official NCBI BLAST+ Docker container documentation Note: the efetch parts of the step-by-step guide do not currently work using Singularity version of the image so we provide a dataset with the data already downloaded.

(This is because the NCBI BLAST+ Docker container image has the efetch tool installed in the /root directory and this special location gets overwritten during the conversion to a Singularity container image.)

Download the required data

Download the blast_example.tar.gz.

Unpack the archive which contains the downloaded data required for the BLAST+ example:

remote$ wget https://epcced.github.io/2024-04-16_containers_bham/files/blast_example.tar.gz
remote$ tar -xvf blast_example.tar.gz

x blast/
x blast/blastdb/
x blast/queries/
x blast/fasta/
x blast/results/
x blast/blastdb_custom/
x blast/fasta/nurse-shark-proteins.fsa
x blast/queries/P01349.fsa

Finally, move into the newly created directory:

remote$ cd blast
remote$ ls  

blastdb        blastdb_custom fasta          queries        results

Create the Singularity container image

NCBI provide official Docker containers with the BLAST+ software hosted on Docker Hub. We can create a Singularity container image from the Docker container image with:

remote$ singularity pull ncbi-blast.sif docker://ncbi/blast

INFO:    Converting OCI blobs to SIF format
INFO:    Starting build...
Getting image source signatures
Copying blob f3b81f6693c5 done  
Copying blob 9e3ea8720c6d done  
Copying blob f1910abb61ed done  
Copying blob 5ac33d4de47b done  
Copying blob 8402427c8382 done  
Copying blob 06add1a477bc done  
Copying blob d9781f222125 done  
Copying blob 4aae31cc8a8b done  
Copying blob 6a61413c1ffa done  
Copying blob c657bf8fc6ca done  
Copying blob 1776e565f5f8 done  
Copying blob d90474a0d8c8 done  
Copying blob 0bc89cb1b9d7 done  
Copying blob b8a272fccf13 done  
Copying blob 891eb09f891f done  
Copying blob 4c64befa8a35 done  
Copying blob 7ab0b7afbc21 done  
Copying blob b007c620c60b done  
Copying blob f877ffc04713 done  
Copying blob 6ee97c348001 done  
Copying blob 03f0ee97190b done  
Copying config 28914b3519 done  
Writing manifest to image destination
Storing signatures
2023/06/16 08:26:53  info unpack layer: sha256:9e3ea8720c6de96cc9ad544dddc695a3ab73f5581c5d954e0504cc4f80fb5e5c
2023/06/16 08:26:53  info unpack layer: sha256:06add1a477bcffec8bac0529923aa8ae25d51f0660f0c8ef658e66aa89ac82c2
2023/06/16 08:26:53  info unpack layer: sha256:f3b81f6693c592ab94c8ebff2109dc60464d7220578331c39972407ef7b9e5ec
2023/06/16 08:26:53  info unpack layer: sha256:5ac33d4de47beb37ae35e9cad976d27afa514ab8cbc66e0e60c828a98e7531f4
2023/06/16 08:27:03  info unpack layer: sha256:8402427c8382ab723ac504155561fb6d3e5ea1e7b4f3deac8449cec9e44ae65a
2023/06/16 08:27:03  info unpack layer: sha256:f1910abb61edef8947e9b5556ec756fd989fa13f329ac503417728bf3b0bae5e
2023/06/16 08:27:03  info unpack layer: sha256:d9781f222125b5ad192d0df0b59570f75b797b2ab1dc0d82064c1b6cead04840
2023/06/16 08:27:03  info unpack layer: sha256:4aae31cc8a8b726dce085e4e2dc4671a9be28162b8d4e1b1c00b8754f14e6fe6
2023/06/16 08:27:03  info unpack layer: sha256:6a61413c1ffa309d92931265a5b0ecc9448568f13ccf3920e16aaacc8fdfc671
2023/06/16 08:27:03  info unpack layer: sha256:c657bf8fc6cae341e3835cb101dc4c6839ba4aad69578ff8538b3c1eba7abb21
2023/06/16 08:27:04  info unpack layer: sha256:1776e565f5f85562b8601edfd29c35f3fba76eb53177c8e89105f709387e3627
2023/06/16 08:27:04  info unpack layer: sha256:d90474a0d8c8e6165d909cc0ebbf97dbe70fd759a93eff11a5a3f91fa09a470e
2023/06/16 08:27:04  warn rootless{root/edirect/aux/lib/perl5/Mozilla/CA/cacert.pem} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
2023/06/16 08:27:04  warn rootless{root/edirect/aux/lib/perl5/Mozilla/CA.pm} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
2023/06/16 08:27:04  warn rootless{root/edirect/aux/lib/perl5/Mozilla/mk-ca-bundle.pl} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
2023/06/16 08:27:04  info unpack layer: sha256:0bc89cb1b9d7ca198a7a1b95258006560feffaff858509be8eb7388b315b9cf5
2023/06/16 08:27:04  info unpack layer: sha256:b8a272fccf13b721fa68826f17f0c2bb395de377e0d22c98d38748eb5957a4c6
2023/06/16 08:27:04  info unpack layer: sha256:891eb09f891ff2c26f24a5466112e134f6fb30bd3d0e78c14c0d676b0e68d60a
2023/06/16 08:27:04  info unpack layer: sha256:4c64befa8a35c9f8518324524dfc27966753462a4c07b2234811865387058bf4
2023/06/16 08:27:04  info unpack layer: sha256:7ab0b7afbc21b75697a7b8ed907ee9b81e5b17a04895dc6ff7d25ea2ba1eeba4
2023/06/16 08:27:04  info unpack layer: sha256:b007c620c60b91ce6a9e76584ecc4bc062c822822c204d8c2b1c8668193d44d1
2023/06/16 08:27:04  info unpack layer: sha256:f877ffc04713a03dffd995f540ee13b65f426b350cdc8c5f1e20c290de129571
2023/06/16 08:27:04  info unpack layer: sha256:6ee97c348001fca7c98e56f02b787ce5e91d8cc7af7c7f96810a9ecf4a833504
2023/06/16 08:27:04  info unpack layer: sha256:03f0ee97190baebded2f82136bad72239254175c567b19def105b755247b0193
INFO:    Creating SIF file...

Now we have a container with the software in, we can use it.

Build and verify the BLAST database

Our example dataset has already downloaded the query and database sequences. We first use these downloaded data to create a custom BLAST database by using a container to run the command makeblastdb with the correct options.

remote$ singularity exec ncbi-blast.sif \
    makeblastdb -in fasta/nurse-shark-proteins.fsa -dbtype prot \
    -parse_seqids -out nurse-shark-proteins -title "Nurse shark proteins" \
    -taxid 7801 -blastdb_version 5

Building a new DB, current time: 06/16/2023 14:35:07
New DB name:   /home/auser/test/blast/blast/nurse-shark-proteins
New DB title:  Nurse shark proteins
Sequence type: Protein
Keep MBits: T
Maximum file size: 3000000000B
Adding sequences from FASTA; added 7 sequences in 0.0199499 seconds.

To verify the newly created BLAST database above, you can run the blastdbcmd -entry all -db nurse-shark-proteins -outfmt "%a %l %T" command to display the accessions, sequence length, and common name of the sequences in the database.

remote$ singularity exec ncbi-blast.sif \
    blastdbcmd -entry all -db nurse-shark-proteins -outfmt "%a %l %T"

Q90523.1 106 7801
P80049.1 132 7801
P83981.1 53 7801
P83977.1 95 7801
P83984.1 190 7801
P83985.1 195 7801
P27950.1 151 7801

Now we have our database we can run queries against it.

Run a query against the BLAST database

Lets execute a query on our database using the blastp command:

remote$ singularity exec ncbi-blast.sif \
    blastp -query queries/P01349.fsa -db nurse-shark-proteins \
    -out results/blastp.out

At this point, you should see the results of the query in the output file results/blastp.out. To view the content of this output file, use the command less results/blastp.out.

remote$ less results/blastp.out

...output trimmed...

Query= sp|P01349.2|RELX_CARTA RecName: Full=Relaxin; Contains: RecName:
Full=Relaxin B chain; Contains: RecName: Full=Relaxin A chain

Length=44
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

P80049.1 RecName: Full=Fatty acid-binding protein, liver; AltName...  14.2    0.96


>P80049.1 RecName: Full=Fatty acid-binding protein, liver; AltName: Full=Liver-type
fatty acid-binding protein; Short=L-FABP
Length=132

...output trimmed...

With your query, BLAST identified the protein sequence P80049.1 as a match with a score of 14.2 and an E-value of 0.96.

Accessing online BLAST databases

As well as building your own local database to query, you can also access databases that are available online. For example, to see which databases are available online in the Google Compute Platform (GCP):

remote$ singularity exec ncbi-blast.sif update_blastdb.pl --showall pretty --source gcp

Connected to GCP
BLASTDB                                                      DESCRIPTION                                                                                                              SIZE (GB)      LAST_UPDATED
nr                                                           All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects        369.4824      2023-06-10
swissprot                                                    Non-redundant UniProtKB/SwissProt sequences                                                                                 0.3576      2023-06-10
refseq_protein                                               NCBI Protein Reference Sequences                                                                                          146.5088      2023-06-12
landmark                                                     Landmark database for SmartBLAST                                                                                            0.3817      2023-04-25
pdbaa                                                        PDB protein database                                                                                                        0.1967      2023-06-10
nt                                                           Nucleotide collection (nt)                                                                                                319.5044      2023-06-11
pdbnt                                                        PDB nucleotide database                                                                                                     0.0145      2023-06-09
patnt                                                        Nucleotide sequences derived from the Patent division of GenBank                                                           15.7342      2023-06-09
refseq_rna                                                   NCBI Transcript Reference Sequences                                                                                        47.8721      2023-06-12

...output trimmed...

Similarly, for databases hosted at NCBI:

remote$ singularity exec ncbi-blast.sif update_blastdb.pl --showall pretty --source ncbi

Connected to NCBI
BLASTDB                                                      DESCRIPTION                                                                                                              SIZE (GB)      LAST_UPDATED
env_nr                                                       Proteins from WGS metagenomic projects (env_nr).                                                                            3.9459      2023-06-11
SSU_eukaryote_rRNA                                           Small subunit ribosomal nucleic acid for Eukaryotes                                                                         0.0063      2022-12-05
LSU_prokaryote_rRNA                                          Large subunit ribosomal nucleic acid for Prokaryotes                                                                        0.0041      2022-12-05
16S_ribosomal_RNA                                            16S ribosomal RNA (Bacteria and Archaea type strains)                                                                       0.0178      2023-06-16
env_nt                                                       environmental samples                                                                                                      48.8599      2023-06-08
LSU_eukaryote_rRNA                                           Large subunit ribosomal nucleic acid for Eukaryotes                                                                         0.0053      2022-12-05
ITS_RefSeq_Fungi                                             Internal transcribed spacer region (ITS) from Fungi type and reference material                                             0.0067      2022-10-28
Betacoronavirus                                              Betacoronavirus                                                                                                            55.3705      2023-06-16

...output trimmed...

Notes

You have now completed a simple example of using a complex piece of bioinformatics software through Singularity containers. You may have noticed that some things just worked without you needing to set them up even though you were running using containers:

We did not need to explicitly bind any files/directories in to the container. This worked because Singularity automatically binds the current directory into the running container, so any data in the current directory (or its subdirectories) will generally be available in running Singularity containers. (If you have used Docker containers, you will notice that this is different from the default behaviour there.)
Access to the internet is automatically available within the running container in the same way as it is on the host system without us needed to specify any additional options.
Files and data we create within the container have the right ownership and permissions for us to access outside the container.

In addition, we were able to use the tools in the container image provided by NCBI without having to do any work to install the software irrespective of the computing platform that we are using. (In fact, the example this is based on runs the pipeline using Docker on a cloud computing platform rather than on your local system.)

Key Points

We can use containers to run software without having to install it

The commands we use are very similar to those we would use natively

Singularity handles a lot of complexity around data and internet access for us

Containers in Research Workflows: Reproducibility and Granularity

Overview

Teaching: 20 min
Exercises: 5 min

Questions

How can I use container images to make my research more reproducible?

How do I incorporate containers into my research workflow?

Objectives

Understand how container images can help make research more reproducible.

Understand what practical steps I can take to improve the reproducibility of my research using containers.

Although this workshop is titled “Reproducible computational environments using containers”, so far we have mostly covered the mechanics of using Singularity with only passing reference to the reproducibility aspects. In this section, we discuss these aspects in more detail.

Work in progress…

Note that reproducibility aspects of software and containers are an active area of research, discussion and development so are subject to many changes. We will present some ideas and approaches here but best practices will likely evolve in the near future.

Reproducibility

By reproducibility here we mean the ability of someone else (or your future self) being able to reproduce what you did computationally at a particular time (be this in research, analysis or something else) as closely as possible even if they do not have access to exactly the same hardware resources that you had when you did the original work.

Some examples of why containers are an attractive technology to help with reproducibility include:

The same computational work can be run across multiple different technologies seamlessly (e.g. Windows, macOS, Linux).
You can save the exact process that you used for your computational work (rather than relying on potentially incomplete notes).
You can save the exact versions of software and their dependencies in the container image.
You can access legacy versions of software and underlying dependencies which may not be generally available any more.
Depending on their size, you can also potentially store a copy of key data within the container image.
You can archive and share the container image as well as associating a persistent identifier with a container image to allow other researchers to reproduce and build on your work.

We have made use of a few different online repositories during this course, such as Sylabs Cloud Library and Docker Hub which provide platforms for sharing container images publicly. Once you have uploaded a container image, you can point people to its public location and they can download and build upon it.

This is fine for working collaboratively with container images on a day-to-day basis but these repositories are not a good option for long time archive of container images in support of research and publications as:

free accounts have a limit on how long a container image will be hosted if it is not updated
it does not support adding persistent identifiers to container images
it is easy to overwrite container images with newer versions by mistake.

Archiving and persistently identifying container images using Zenodo

When you publish your work or make it publicly available in some way it is good practice to make container images that you used for computational work available in an immutable, persistent way and to have an identifier that allows people to cite and give you credit for the work you have done. Zenodo is one service that provides this functionality.

Zenodo supports the upload of zip archives and we can capture our Singularity container images as zip archives. For example, to convert the container image we created earlier, alpine-sum.sif in this lesson to a zip archive (on the command line):

zip alpine-sum.zip alpine-sum.sif

Note: These zip container images can become quite large and Zenodo supports uploads up to 50GB. If your container image is too large, you may need to look at other options to archive them or work to reduce the size of the container images.

Once you have your archive, you can deposit it on Zenodo and this will:

Create a long-term archive snapshot of your Singularity container image which people (including your future self) can download and reuse or reproduce your work.
Create a persistent DOI (Digital Object Identifier) that you can cite in any publications or outputs to enable reproducibility and recognition of your work.

In addition to the archive file itself, the deposit process will ask you to provide some basic metadata to classify the container image and the associated work.

Note that Zenodo is not the only option for archiving and generating persistent DOIs for container images. There are other services out there – for example, some organizations may provide their own, equivalent, service.

Reproducibility good practice

Make use of container images to capture the computational environment required for your work.
Decide on the appropriate granularity for the container images you will use for your computational work – this will be different for each project/area. Take note of accepted practice from contemporary work in the same area. What are the right building blocks for individual container images in your work?
Document what you have done and why – this can be put in comments in the Singularity recipe file and the use of the container image described in associated documentation and/or publications. Make sure that references are made in both directions so that the container image and the documentation are appropriately linked.
When you publish work (in whatever way) use an archiving and DOI service such as Zenodo to make sure your container image is captured as it was used for the work and that is obtains a persistent DOI to allow it to be cited and referenced properly.

Container Granularity

As mentioned above, one of the decisions you may need to make when containerising your research workflows is what level of granularity you wish to employ. The two extremes of this decision could be characterized as:

Create a single container image with all the tools you require for your research or analysis workflow
Create many container images each running a single command (or step) of the workflow and use them together

Of course, many real applications will sit somewhere between these two extremes.

Positives and negatives

What are the advantages and disadvantages of the two approaches to container granularity for research workflows described above? Think about this and write a few bullet points for advantages and disadvantages for each approach in the course Etherpad.

Solution

This is not an exhaustive list but some of the advantages and disadvantages could be:

Single large container image

Advantages:

Simpler to document

Full set of requirements packaged in one place

Potentially easier to maintain (though could be opposite if working with large, distributed group)

Disadvantages:

Could get very large in size, making it more difficult to distribute

May end up with same dependency issues within the container image from different software requirements

Potentially more complex to test

Less re-useable for different, but related, work

Multiple smaller container images

Advantages:

Individual components can be re-used for different, but related, work

Individual parts are smaller in size making them easier to distribute

Avoid dependency issues between different pieces of software

Easier to test

Disadvantage:

More difficult to document

Potentially more difficult to maintain (though could be easier if working with large, distributed group)

May end up with dependency issues between component container images if they get out of sync

Next steps with containers

Now that we’re at the end of the lesson material, take a moment to reflect on what you’ve learned, how it applies to you, and what to do next.

In your own notes, write down or diagram your understanding of Singularity containers and container images: concepts, commands, and how they work.

In your own notes, write down how you think you might use containers in your daily work. If there’s something you want to try doing with containers right away, what is a next step after this workshop to make that happen?

Key Points

Container images allow us to encapsulate the computation (and data) we have used in our research.

Using online containerimage repositories allows us to easily share computational work we have done.

Using container images along with a DOI service such as Zenodo allows us to capture our work and enables reproducibility.

Break

Overview

Teaching: min
Exercises: min

Questions

Objectives

Afternoon break

Key Points

(Optional) Running MPI parallel jobs using Singularity containers

Overview

Teaching: 30 min
Exercises: 40 min

Questions

How do I set up and run an MPI job from a Singularity container?

Objectives

Learn how MPI applications within Singularity containers can be run on HPC platforms

Understand the challenges and related performance implications when running MPI jobs via Singularity

What is MPI?

MPI - Message Passing Interface - is a widely used standard for parallel programming. It is used for exchanging messages/data between processes in a parallel application. If you’ve been involved in developing or working with computational science software.

Usually, when working on HPC systems, you compile your application against the MPI libraries provided on the system or you use applications that have been compiled by the HPC system support team. This approach to portability: source code portability is the traditional approach to making applications portable to different HPC platforms.

However, compiling complex HPC applications that have lots of dependencies (including MPI) is not always straightforward and can be a significant challenge as most HPC systems differ in various ways in terms of OS and base software available. There are a number of different approaches that can be taken to make it easier to deploy applications on HPC systems; for example, the Spack software automates the dependency resolution and compilation of applications. Containers provide another potential way to resolve these problems but care needs to be taken when interfacing with MPI on the host system which adds more complexity to running containers in parallel on HPC systems.

MPI codes with Singularity containers

Obviously, we will not have admin/root access on the HPC platform we are using so cannot (usually) build our container images on the HPC system itself. However, we do need to ensure our container is using the MPI library on the HPC system itself so we can get the performance benefit of the HPC interconnect. How do we overcome these contradictions?

The answer is that we install a version of the MPI library in our container image that is binary compatible with the MPI library on the host system and install our software in the container image using the local version of the MPI library. At runtime, we then ensure that the MPI library from the host is used within the running container rather than the locally-installed version of MPI.

There are two widely used open source MPI library distributions on HPC systems:

MPICH - in addition to the open source version, MPICH is binary compatible with many proprietary vendor libraries, including Intel MPI and HPE Cray MPT as well as the open source MVAPICH.
OpenMPI

This typically means that if you want to distribute HPC software that uses MPI within a container image you will need to maintain versions that are compatible with both MPICH and OpenMPI. There are efforts underway to provide tools that will provide a binary interface between different MPI implementations, e.g. HPE Cray’s MPIxlate software but these are not generally available yet.

Building a Singularity container image with MPI software

This example makes the assumption that you’ll be building a container image on a local platform and then deploying it to a HPC system with a different but compatible MPI implementation using a combination of the Hybrid and Bind models from the Singularity documentation. We will build our application using MPI in the container image but will bind the MPI library from the host into the container at runtime. See Singularity and MPI applications in the Singularity documentation for more technical details.

The example we will build will:

Use MPICH as the container image’s MPI library
Use the Ohio State University MPI Micro-benchmarks as the example application
Use ARCHER2 as the runtime platform - this uses Cray MPT as the host MPI library and the HPE Cray Slingshot interconnect

The Dockerfile to install MPICH and the OSU micro-benchmark we will use to build the container image is shown below. Create a new directory called osu-benchmarks to hold the build context for this new image. Create the Dockerfile in this directory.

FROM ubuntu:20.04

ENV DEBIAN_FRONTEND=noninteractive

# Install the necessary packages (from repo)
RUN apt-get update && apt-get install -y --no-install-recommends \
 apt-utils \
 build-essential \
 curl \
 libcurl4-openssl-dev \
 libzmq3-dev \
 pkg-config \
 software-properties-common
RUN apt-get clean
RUN apt-get install -y dkms
RUN apt-get install -y autoconf automake build-essential numactl libnuma-dev autoconf automake gcc g++ git libtool

# Download and build an ABI compatible MPICH
RUN curl -sSLO http://www.mpich.org/static/downloads/3.4.2/mpich-3.4.2.tar.gz \
   && tar -xzf mpich-3.4.2.tar.gz -C /root \
   && cd /root/mpich-3.4.2 \
   && ./configure --prefix=/usr --with-device=ch4:ofi --disable-fortran \
   && make -j8 install \
   && cd / \
   && rm -rf /root/mpich-3.4.2 \
   && rm /mpich-3.4.2.tar.gz

# OSU benchmarks
RUN curl -sSLO http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.4.1.tar.gz \
   && tar -xzf osu-micro-benchmarks-5.4.1.tar.gz -C /root \
   && cd /root/osu-micro-benchmarks-5.4.1 \
   && ./configure --prefix=/usr/local CC=/usr/bin/mpicc CXX=/usr/bin/mpicxx \
   && cd mpi \
   && make -j8 install \
   && cd / \
   && rm -rf /root/osu-micro-benchmarks-5.4.1 \
   && rm /osu-micro-benchmarks-5.4.1.tar.gz

# Add the OSU benchmark executables to the PATH
ENV PATH=/usr/local/libexec/osu-micro-benchmarks/mpi/startup:$PATH
ENV PATH=/usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt:$PATH
ENV PATH=/usr/local/libexec/osu-micro-benchmarks/mpi/collective:$PATH
ENV OSU_DIR=/usr/local/libexec/osu-micro-benchmarks/mpi

# path to mlx IB libraries in Ubuntu
ENV LD_LIBRARY_PATH=/usr/lib/libibverbs:$LD_LIBRARY_PATH

A quick overview of what the above definition file is doing:

The image is being built based on the ubuntu:20.04 Docker image.
In the RUN sections:
- Ubuntu’s apt-get package manager is used to update the package directory and then install the compilers and other libraries required for the MPICH and OSU benchmark build.
- The MPICH software is downloaded, extracted, configured, built and installed. Note the use of the --with-device option to configure MPICH to use the correct driver to support improved communication performance on a high performance cluster. After the install is complete we delete the files that are no longer needed.
- The OSU Micro-Benchmarks software is downloaded, extracted, configured, built and installed. After the install is complete we delete the files that are no longer needed.
In the ENV sections: Set environment variables that will be available within all containers run from the generated image.

Build and test the OSU Micro-Benchmarks image

Using the above Dockerfile, build a container image and push it to Docker Hub.

Pull the image on ARCHER2 using Singularity to convert it to a Singularity image and the test it by running the osu_hello benchmark that is found in the startup benchmark folder with either singularity exec or singularity shell.

Note: the build process can take a while. If you want to test running while the build is happening, you can log into ARCHER2 and use a pre-built version of the container image to test. You can find this container image at:
${EPCC_SINGULARITY_DIR}/osu_benchmarks.sif
Solution

You should be able to build an image and push it to Docker Hub as follows:
$ docker image build --platform linux/amd64 -t alice/osu-benchmarks .
$ docker push alice/osu-benchmarks
You can then log into ARCHER2 and pull the container image from Docker Hub with:
remote$ singularity pull osu-benchmarks.sif docker://alice/osu-benchmarks
Let’s begin with a single-process run of startup/osu_hello to ensure that we can run the container as expected. We’ll use the MPI installation within the container for this test. Note that when we run a parallel job on an HPC cluster platform, we use the MPI installation on the cluster to coordinate the run so things are a little different… we will see this shortly.

Start a shell in the Singularity container based on your image and then run a single process job via mpirun:
$ singularity shell --contain osu_benchmarks.sif
Singularity> mpirun -np 1 osu_hello
You should see output similar to the following:
# OSU MPI Hello World Test v5.7.1
This is a test with 1 processes

Running Singularity containers with MPI on HPC system

Assuming the above tests worked, we can now try undertaking a parallel run of one of the OSU benchmarking tools within our container image on the remote HPC platform.

This is where things get interesting and we will begin by looking at how Singularity containers are run within an MPI environment.

If you’re familiar with running MPI codes, you’ll know that you use mpirun (as we did in the previous example), mpiexec, srun or a similar MPI executable to start your application. This executable may be run directly on the local system or cluster platform that you’re using, or you may need to run it through a job script submitted to a job scheduler. Your MPI-based application code, which will be linked against the MPI libraries, will make MPI API calls into these MPI libraries which in turn talk to the MPI daemon process running on the host system. This daemon process handles the communication between MPI processes, including talking to the daemons on other nodes to exchange information between processes running on different machines, as necessary.

When running code within a Singularity container, we don’t use the MPI executables stored within the container, i.e. we DO NOT run:

singularity exec mpirun -np <numprocs> /path/to/my/executable

Instead we use the MPI installation on the host system to run Singularity and start an instance of our executable from within a container for each MPI process. Without Singularity support in an MPI implementation, this results in starting a separate Singularity container instance within each process. This can present some overhead if a large number of processes are being run on a host. Where Singularity support is built into an MPI implementation this can address this potential issue and reduce the overhead of running code from within a container as part of an MPI job.

Ultimately, this means that our running MPI code is linking to the MPI libraries from the MPI install within our container and these are, in turn, communicating with the MPI daemon on the host system which is part of the host system’s MPI installation. In the case of MPICH, these two installations of MPI may be different but as long as there is ABI compatibility between the version of MPI installed in your container image and the version on the host system, your job should run successfully.

We can now try running a 2-process MPI run of a point to point benchmark osu_latency on ARCHER2.

Undertake a parallel run of the `osu_latency` benchmark (general example)

Create a job submission script called submit.slurm on the /work file system on ARCHER2 to run containers based on the container image across two nodes on ARCHER2. The example below uses the osu-benchmarks container image that is already available on ARCHER2 but you can easily modify it to use your version of the container image if you wish - the results should be the same in both cases.

A template based on the example in the ARCHER2 documentation is:

#!/bin/bash

#SBATCH --job-name=singularity_parallel
#SBATCH --time=0:10:0
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1

#SBATCH --partition=standard
#SBATCH --qos=short
#SBATCH --account=[budget code]

# Load the module to make the Cray MPICH ABI available
module load cray-mpich-abi

export OMP_NUM_THREADS=1
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK

# Set the LD_LIBRARY_PATH environment variable within the Singularity container
# to ensure that it used the correct MPI libraries.
export SINGULARITYENV_LD_LIBRARY_PATH="/opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib-abi-mpich:/opt/cray/pe/mpich/8.1.23/gtl/lib:/opt/cray/libfabric/1.12.1.2.2.0.0/lib64:/opt/cray/pe/gcc-libs:/opt/cray/pe/gcc-libs:/opt/cray/pe/lib64:/opt/cray/pe/lib64:/opt/cray/xpmem/default/lib64:/usr/lib64/libibverbs:/usr/lib64:/usr/lib64"

# This makes sure HPE Cray Slingshot interconnect libraries are available
# from inside the container.
export SINGULARITY_BIND="/opt/cray,/var/spool,/opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib-abi-mpich:/opt/cray/pe/mpich/8.1.23/gtl/lib,/etc/host.conf,/etc/libibverbs.d/mlx5.driver,/etc/libnl/classid,/etc/resolv.conf,/opt/cray/libfabric/1.12.1.2.2.0.0/lib64/libfabric.so.1,/opt/cray/pe/gcc-libs/libatomic.so.1,/opt/cray/pe/gcc-libs/libgcc_s.so.1,/opt/cray/pe/gcc-libs/libgfortran.so.5,/opt/cray/pe/gcc-libs/libquadmath.so.0,/opt/cray/pe/lib64/libpals.so.0,/opt/cray/pe/lib64/libpmi2.so.0,/opt/cray/pe/lib64/libpmi.so.0,/opt/cray/xpmem/default/lib64/libxpmem.so.0,/run/munge/munge.socket.2,/usr/lib64/libibverbs/libmlx5-rdmav34.so,/usr/lib64/libibverbs.so.1,/usr/lib64/libkeyutils.so.1,/usr/lib64/liblnetconfig.so.4,/usr/lib64/liblustreapi.so,/usr/lib64/libmunge.so.2,/usr/lib64/libnl-3.so.200,/usr/lib64/libnl-genl-3.so.200,/usr/lib64/libnl-route-3.so.200,/usr/lib64/librdmacm.so.1,/usr/lib64/libyaml-0.so.2"

# Launch the parallel job.
srun --hint=nomultithread --distribution=block:block \
    singularity exec ${EPCC_SINGULARITY_DIR}/osu_benchmarks.sif \
        osu_latency

Finally, submit the job to the batch system with

remote$ sbatch submit.slurm

Solution

As you can see in the job script shown above, we have called srun on the host system and are passing to MPI the singularity executable for which the parameters are the image file and the name of the benchmark executable we want to run.

The following shows an example of the output you should expect to see. You should have latency values reported for message sizes up to 4MB.
# OSU MPI Latency Test v5.6.2
# Size          Latency (us)
0                       0.38
1                       0.34
...

This has demonstrated that we can successfully run a parallel MPI executable from within a Singularity container.

Investigate performance of native benchmark compared to containerised version

To get an idea of any difference in performance between the code within your Singularity image and the same code built natively on the target HPC platform, try running the osu_allreduce benchmarks natively on ARCHER2 on all cores on at least 16 nodes (if you want to use more than 32 nodes, you will need to use the standard QoS rather than the short QoS). Then try running the same benchmark that you ran via the Singularity container. Do you see any performance differences?

What do you see?

Do you see the same when you run on small node counts - particularly a single node?

Note: a native version of the OSU micro-benchmark suite is available on ARCHER2 via module load osu-benchmarks.

Discussion

Here are some selected results measured on ARCHER2:

1 node:

4 B

Native: 6.13 us

Container: 5.30 us (16% faster)

128 KiB

Native: 173.00 us

Container: 230.38 us (25% slower)

1 MiB

Native: 1291.18 us

Container: 2101.02 us (39% slower)

16 nodes:

4 B

Native: 17.66 us

Container: 18.15 us (3% slower)

128 KiB

Native: 237.29 us

Container: 303.92 us (22% slower)

1 MiB

Native: 1501.25 us

Container: 2359.11 us (36% slower)

32 nodes:

4 B

Native: 30.72 us

Container: 24.41 us (20% faster)

128 KiB

Native: 265.36 us

Container: 363.58 us (26% slower)

1 MiB

Native: 1520.58 us

Container: 2429.24 us (36% slower)

For the medium and large messages, using a container produces substantially worse MPI performance for this benchmark on ARCHER2. When the messages are very small, containers match the native performance and can actually be faster.

Is this true for other MPI benchmarks that use all the cores on a node or is it specific to Allreduce?

Summary

Singularity can be combined with MPI to create portable containers that run software in parallel across multiple compute nodes. However, there are some limitations, specifically:

You must use an MPI library in the container that is binary compatible with the MPI library on the host system - typically, your container will be based on either MPICH or OpenMPI.
The host setup to enable MPI typically requires binding a large number of low-level libraries into the running container. You will usually require help from the HPC system support team to get the correct bind options for the platform you are using.
Performance of containers+MPI can be substantially lower than the performance of native applications using MPI on the system. The effect is dependent on the MPI routines used in your application, message sizes and the number of MPI processes used.

Key Points

Singularity images containing MPI applications can be built on one platform and then run on another (e.g. an HPC cluster) if the two platforms have compatible MPI implementations.

When running an MPI application within a Singularity container, use the MPI executable on the host system to launch a Singularity container for each process.

Think about parallel application performance requirements and how where you build/run your image may affect that.

(Optional) Additional topics and next steps

Overview

Teaching: 0 min
Exercises: 0 min

Questions

How do I understand more on how containers work?

What different container technologies are there and what are differences/implications?

How can I orchestrate different containers?

Objectives

Understand container technologies better.

Provide useful links to continue your journey with containers.

Additional topics

How do containers work
- Containers vs Virtual Machines
- Layers
Container technologies
Container good practice
- How should containers be used and present themselves?
- Best practice for bioinformatic containers
Container orchestration - typically using Docker containers rather than Singularity

Useful links

Key Points

TBC

Reproducible Computational Environments Using Containers: Introduction to Docker and Singularity

Introducing Containers

Overview

Learning about Docker Containers

Scientific Software Challenges

What’s Your Experience?

Software and Science

What is a Container? What is Docker?

Virtualization

Putting the Pieces Together

Use cases for containers

Key Points

Introducing the Docker Command Line

Overview

Docker command line

You may need to login to Docker Hub

Determining your Docker Hub username

Getting help

Docker Command Line Interface (CLI) syntax

Exploring a command

Solution

Key Points

Break

Overview

Key Points

Exploring and Running Containers

Overview

Reminder of terminology: container images and containers

Downloading Docker images

Docker Hub

Exercise: Check on Your Images

Solution

Running the hello-world container

Using docker container run to get the image

Running a container with a chosen command

Run the Alpine Docker container

Hello World, Part 2

Solution

Running containers interactively

Technically…

Practice Makes Perfect

Solution 1 – Interactive

Solution 2 – Run commands

Even More Options

Conclusion

Key Points

Cleaning Up Containers

Overview

Removing images

What containers are running?

docker ps

What containers have run recently?

Keeping it clean

How do I remove an exited container?

Removing images, for real this time

Key Points

Finding Containers on Docker Hub

Overview

Introducing the Docker Hub

Docker can be used without connecting to the Docker Hub

Exploring an Example Docker Hub Page

Exploring Container Image Versions

Repositories

What’s in a name?

Solution

Finding Container Images on Docker Hub

What container image is right for you?

Key Points

Break

Overview

Key Points

Creating Your Own Container Images

Overview

Interactive installation

Exercise: Searching for Help

Solution

Put installation instructions in a Dockerfile

shell-form and exec-form for CMD

Exercise: Take a Guess

Solution:

Running the `hello-world` container

Using `docker container run` to get the image

`docker ps`

Put installation instructions in a `Dockerfile`

More fancy `Dockerfile` options (optional, for presentation or as exercises)

Make the `sum.py` script run automatically

Add the `sum.py` script to the `PATH` so you can run it directly:

Difference between `singularity run` and `singularity exec`