(Optional) Running MPI parallel jobs using Singularity containers

Overview

Teaching: 30 min
Exercises: 40 min

Questions

How do I set up and run an MPI job from a Singularity container?

Objectives

Learn how MPI applications within Singularity containers can be run on HPC platforms

Understand the challenges and related performance implications when running MPI jobs via Singularity

What is MPI?

MPI - Message Passing Interface - is a widely used standard for parallel programming. It is used for exchanging messages/data between processes in a parallel application. If you’ve been involved in developing or working with computational science software.

Usually, when working on HPC systems, you compile your application against the MPI libraries provided on the system or you use applications that have been compiled by the HPC system support team. This approach to portability: source code portability is the traditional approach to making applications portable to different HPC platforms.

However, compiling complex HPC applications that have lots of dependencies (including MPI) is not always straightforward and can be a significant challenge as most HPC systems differ in various ways in terms of OS and base software available. There are a number of different approaches that can be taken to make it easier to deploy applications on HPC systems; for example, the Spack software automates the dependency resolution and compilation of applications. Containers provide another potential way to resolve these problems but care needs to be taken when interfacing with MPI on the host system which adds more complexity to running containers in parallel on HPC systems.

MPI codes with Singularity containers

Obviously, we will not have admin/root access on the HPC platform we are using so cannot (usually) build our container images on the HPC system itself. However, we do need to ensure our container is using the MPI library on the HPC system itself so we can get the performance benefit of the HPC interconnect. How do we overcome these contradictions?

The answer is that we install a version of the MPI library in our container image that is binary compatible with the MPI library on the host system and install our software in the container image using the local version of the MPI library. At runtime, we then ensure that the MPI library from the host is used within the running container rather than the locally-installed version of MPI.

There are two widely used open source MPI library distributions on HPC systems:

MPICH - in addition to the open source version, MPICH is binary compatible with many proprietary vendor libraries, including Intel MPI and HPE Cray MPT as well as the open source MVAPICH.
OpenMPI

This typically means that if you want to distribute HPC software that uses MPI within a container image you will need to maintain versions that are compatible with both MPICH and OpenMPI. There are efforts underway to provide tools that will provide a binary interface between different MPI implementations, e.g. HPE Cray’s MPIxlate software but these are not generally available yet.

Building a Singularity container image with MPI software

This example makes the assumption that you’ll be building a container image on a local platform and then deploying it to a HPC system with a different but compatible MPI implementation using a combination of the Hybrid and Bind models from the Singularity documentation. We will build our application using MPI in the container image but will bind the MPI library from the host into the container at runtime. See Singularity and MPI applications in the Singularity documentation for more technical details.

The example we will build will:

Use MPICH as the container image’s MPI library
Use the Ohio State University MPI Micro-benchmarks as the example application
Use ARCHER2 as the runtime platform - this uses Cray MPT as the host MPI library and the HPE Cray Slingshot interconnect

The Dockerfile to install MPICH and the OSU micro-benchmark we will use to build the container image is shown below. Create a new directory called osu-benchmarks to hold the build context for this new image. Create the Dockerfile in this directory.

FROM ubuntu:20.04

ENV DEBIAN_FRONTEND=noninteractive

# Install the necessary packages (from repo)
RUN apt-get update && apt-get install -y --no-install-recommends \
 apt-utils \
 build-essential \
 curl \
 libcurl4-openssl-dev \
 libzmq3-dev \
 pkg-config \
 software-properties-common
RUN apt-get clean
RUN apt-get install -y dkms
RUN apt-get install -y autoconf automake build-essential numactl libnuma-dev autoconf automake gcc g++ git libtool

# Download and build an ABI compatible MPICH
RUN curl -sSLO http://www.mpich.org/static/downloads/3.4.2/mpich-3.4.2.tar.gz \
   && tar -xzf mpich-3.4.2.tar.gz -C /root \
   && cd /root/mpich-3.4.2 \
   && ./configure --prefix=/usr --with-device=ch4:ofi --disable-fortran \
   && make -j8 install \
   && cd / \
   && rm -rf /root/mpich-3.4.2 \
   && rm /mpich-3.4.2.tar.gz

# OSU benchmarks
RUN curl -sSLO http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.4.1.tar.gz \
   && tar -xzf osu-micro-benchmarks-5.4.1.tar.gz -C /root \
   && cd /root/osu-micro-benchmarks-5.4.1 \
   && ./configure --prefix=/usr/local CC=/usr/bin/mpicc CXX=/usr/bin/mpicxx \
   && cd mpi \
   && make -j8 install \
   && cd / \
   && rm -rf /root/osu-micro-benchmarks-5.4.1 \
   && rm /osu-micro-benchmarks-5.4.1.tar.gz

# Add the OSU benchmark executables to the PATH
ENV PATH=/usr/local/libexec/osu-micro-benchmarks/mpi/startup:$PATH
ENV PATH=/usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt:$PATH
ENV PATH=/usr/local/libexec/osu-micro-benchmarks/mpi/collective:$PATH
ENV OSU_DIR=/usr/local/libexec/osu-micro-benchmarks/mpi

# path to mlx IB libraries in Ubuntu
ENV LD_LIBRARY_PATH=/usr/lib/libibverbs:$LD_LIBRARY_PATH

A quick overview of what the above definition file is doing:

The image is being built based on the ubuntu:20.04 Docker image.
In the RUN sections:
- Ubuntu’s apt-get package manager is used to update the package directory and then install the compilers and other libraries required for the MPICH and OSU benchmark build.
- The MPICH software is downloaded, extracted, configured, built and installed. Note the use of the --with-device option to configure MPICH to use the correct driver to support improved communication performance on a high performance cluster. After the install is complete we delete the files that are no longer needed.
- The OSU Micro-Benchmarks software is downloaded, extracted, configured, built and installed. After the install is complete we delete the files that are no longer needed.
In the ENV sections: Set environment variables that will be available within all containers run from the generated image.

Build and test the OSU Micro-Benchmarks image

Using the above Dockerfile, build a container image and push it to Docker Hub.

Pull the image on ARCHER2 using Singularity to convert it to a Singularity image and the test it by running the osu_hello benchmark that is found in the startup benchmark folder with either singularity exec or singularity shell.

Note: the build process can take a while. If you want to test running while the build is happening, you can log into ARCHER2 and use a pre-built version of the container image to test. You can find this container image at:
${EPCC_SINGULARITY_DIR}/osu_benchmarks.sif
Solution

You should be able to build an image and push it to Docker Hub as follows:
$ docker image build --platform linux/amd64 -t alice/osu-benchmarks .
$ docker push alice/osu-benchmarks
You can then log into ARCHER2 and pull the container image from Docker Hub with:
remote$ singularity pull osu-benchmarks.sif docker://alice/osu-benchmarks
Let’s begin with a single-process run of startup/osu_hello to ensure that we can run the container as expected. We’ll use the MPI installation within the container for this test. Note that when we run a parallel job on an HPC cluster platform, we use the MPI installation on the cluster to coordinate the run so things are a little different… we will see this shortly.

Start a shell in the Singularity container based on your image and then run a single process job via mpirun:
$ singularity shell --contain osu_benchmarks.sif
Singularity> mpirun -np 1 osu_hello
You should see output similar to the following:
# OSU MPI Hello World Test v5.7.1
This is a test with 1 processes

Running Singularity containers with MPI on HPC system

Assuming the above tests worked, we can now try undertaking a parallel run of one of the OSU benchmarking tools within our container image on the remote HPC platform.

This is where things get interesting and we will begin by looking at how Singularity containers are run within an MPI environment.

If you’re familiar with running MPI codes, you’ll know that you use mpirun (as we did in the previous example), mpiexec, srun or a similar MPI executable to start your application. This executable may be run directly on the local system or cluster platform that you’re using, or you may need to run it through a job script submitted to a job scheduler. Your MPI-based application code, which will be linked against the MPI libraries, will make MPI API calls into these MPI libraries which in turn talk to the MPI daemon process running on the host system. This daemon process handles the communication between MPI processes, including talking to the daemons on other nodes to exchange information between processes running on different machines, as necessary.

When running code within a Singularity container, we don’t use the MPI executables stored within the container, i.e. we DO NOT run:

singularity exec mpirun -np <numprocs> /path/to/my/executable

Instead we use the MPI installation on the host system to run Singularity and start an instance of our executable from within a container for each MPI process. Without Singularity support in an MPI implementation, this results in starting a separate Singularity container instance within each process. This can present some overhead if a large number of processes are being run on a host. Where Singularity support is built into an MPI implementation this can address this potential issue and reduce the overhead of running code from within a container as part of an MPI job.

Ultimately, this means that our running MPI code is linking to the MPI libraries from the MPI install within our container and these are, in turn, communicating with the MPI daemon on the host system which is part of the host system’s MPI installation. In the case of MPICH, these two installations of MPI may be different but as long as there is ABI compatibility between the version of MPI installed in your container image and the version on the host system, your job should run successfully.

We can now try running a 2-process MPI run of a point to point benchmark osu_latency on ARCHER2.

Undertake a parallel run of the `osu_latency` benchmark (general example)

Create a job submission script called submit.slurm on the /work file system on ARCHER2 to run containers based on the container image across two nodes on ARCHER2. The example below uses the osu-benchmarks container image that is already available on ARCHER2 but you can easily modify it to use your version of the container image if you wish - the results should be the same in both cases.

A template based on the example in the ARCHER2 documentation is:

#!/bin/bash

#SBATCH --job-name=singularity_parallel
#SBATCH --time=0:10:0
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1

#SBATCH --partition=standard
#SBATCH --qos=short
#SBATCH --account=[budget code]

# Load the module to make the Cray MPICH ABI available
module load cray-mpich-abi

export OMP_NUM_THREADS=1
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK

# Set the LD_LIBRARY_PATH environment variable within the Singularity container
# to ensure that it used the correct MPI libraries.
export SINGULARITYENV_LD_LIBRARY_PATH="/opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib-abi-mpich:/opt/cray/pe/mpich/8.1.23/gtl/lib:/opt/cray/libfabric/1.12.1.2.2.0.0/lib64:/opt/cray/pe/gcc-libs:/opt/cray/pe/gcc-libs:/opt/cray/pe/lib64:/opt/cray/pe/lib64:/opt/cray/xpmem/default/lib64:/usr/lib64/libibverbs:/usr/lib64:/usr/lib64"

# This makes sure HPE Cray Slingshot interconnect libraries are available
# from inside the container.
export SINGULARITY_BIND="/opt/cray,/var/spool,/opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib-abi-mpich:/opt/cray/pe/mpich/8.1.23/gtl/lib,/etc/host.conf,/etc/libibverbs.d/mlx5.driver,/etc/libnl/classid,/etc/resolv.conf,/opt/cray/libfabric/1.12.1.2.2.0.0/lib64/libfabric.so.1,/opt/cray/pe/gcc-libs/libatomic.so.1,/opt/cray/pe/gcc-libs/libgcc_s.so.1,/opt/cray/pe/gcc-libs/libgfortran.so.5,/opt/cray/pe/gcc-libs/libquadmath.so.0,/opt/cray/pe/lib64/libpals.so.0,/opt/cray/pe/lib64/libpmi2.so.0,/opt/cray/pe/lib64/libpmi.so.0,/opt/cray/xpmem/default/lib64/libxpmem.so.0,/run/munge/munge.socket.2,/usr/lib64/libibverbs/libmlx5-rdmav34.so,/usr/lib64/libibverbs.so.1,/usr/lib64/libkeyutils.so.1,/usr/lib64/liblnetconfig.so.4,/usr/lib64/liblustreapi.so,/usr/lib64/libmunge.so.2,/usr/lib64/libnl-3.so.200,/usr/lib64/libnl-genl-3.so.200,/usr/lib64/libnl-route-3.so.200,/usr/lib64/librdmacm.so.1,/usr/lib64/libyaml-0.so.2"

# Launch the parallel job.
srun --hint=nomultithread --distribution=block:block \
    singularity exec ${EPCC_SINGULARITY_DIR}/osu_benchmarks.sif \
        osu_latency

Finally, submit the job to the batch system with

remote$ sbatch submit.slurm

Solution

As you can see in the job script shown above, we have called srun on the host system and are passing to MPI the singularity executable for which the parameters are the image file and the name of the benchmark executable we want to run.

The following shows an example of the output you should expect to see. You should have latency values reported for message sizes up to 4MB.
# OSU MPI Latency Test v5.6.2
# Size          Latency (us)
0                       0.38
1                       0.34
...

This has demonstrated that we can successfully run a parallel MPI executable from within a Singularity container.

Investigate performance of native benchmark compared to containerised version

To get an idea of any difference in performance between the code within your Singularity image and the same code built natively on the target HPC platform, try running the osu_allreduce benchmarks natively on ARCHER2 on all cores on at least 16 nodes (if you want to use more than 32 nodes, you will need to use the standard QoS rather than the short QoS). Then try running the same benchmark that you ran via the Singularity container. Do you see any performance differences?

What do you see?

Do you see the same when you run on small node counts - particularly a single node?

Note: a native version of the OSU micro-benchmark suite is available on ARCHER2 via module load osu-benchmarks.

Discussion

Here are some selected results measured on ARCHER2:

1 node:

4 B

Native: 6.13 us

Container: 5.30 us (16% faster)

128 KiB

Native: 173.00 us

Container: 230.38 us (25% slower)

1 MiB

Native: 1291.18 us

Container: 2101.02 us (39% slower)

16 nodes:

4 B

Native: 17.66 us

Container: 18.15 us (3% slower)

128 KiB

Native: 237.29 us

Container: 303.92 us (22% slower)

1 MiB

Native: 1501.25 us

Container: 2359.11 us (36% slower)

32 nodes:

4 B

Native: 30.72 us

Container: 24.41 us (20% faster)

128 KiB

Native: 265.36 us

Container: 363.58 us (26% slower)

1 MiB

Native: 1520.58 us

Container: 2429.24 us (36% slower)

For the medium and large messages, using a container produces substantially worse MPI performance for this benchmark on ARCHER2. When the messages are very small, containers match the native performance and can actually be faster.

Is this true for other MPI benchmarks that use all the cores on a node or is it specific to Allreduce?

Summary

Singularity can be combined with MPI to create portable containers that run software in parallel across multiple compute nodes. However, there are some limitations, specifically:

You must use an MPI library in the container that is binary compatible with the MPI library on the host system - typically, your container will be based on either MPICH or OpenMPI.
The host setup to enable MPI typically requires binding a large number of low-level libraries into the running container. You will usually require help from the HPC system support team to get the correct bind options for the platform you are using.
Performance of containers+MPI can be substantially lower than the performance of native applications using MPI on the system. The effect is dependent on the MPI routines used in your application, message sizes and the number of MPI processes used.

Key Points

Singularity images containing MPI applications can be built on one platform and then run on another (e.g. an HPC cluster) if the two platforms have compatible MPI implementations.

When running an MPI application within a Singularity container, use the MPI executable on the host system to launch a Singularity container for each process.

Think about parallel application performance requirements and how where you build/run your image may affect that.

previous episode

Reproducible Computational Environments Using Containers: Introduction to Docker and Singularity

next episode

(Optional) Running MPI parallel jobs using Singularity containers

Overview

What is MPI?

MPI codes with Singularity containers

Building a Singularity container image with MPI software

Build and test the OSU Micro-Benchmarks image

Solution

Running Singularity containers with MPI on HPC system

Undertake a parallel run of the `osu_latency` benchmark (general example)

Solution

Investigate performance of native benchmark compared to containerised version

Discussion

Summary

Key Points

previous episode

next episode

previous episode

Reproducible Computational Environments Using Containers: Introduction to Docker and Singularity

next episode

(Optional) Running MPI parallel jobs using Singularity containers

Overview

What is MPI?

MPI codes with Singularity containers

Building a Singularity container image with MPI software

Build and test the OSU Micro-Benchmarks image

Solution

Running Singularity containers with MPI on HPC system

Undertake a parallel run of the osu_latency benchmark (general example)

Solution

Investigate performance of native benchmark compared to containerised version

Discussion

Summary

Key Points

previous episode

next episode

Undertake a parallel run of the `osu_latency` benchmark (general example)