Running MPI parallel jobs using Singularity containers
Overview
Teaching: 30 min
Exercises: 40 minQuestions
How do I set up and run an MPI job from a Singularity container?
Objectives
Learn how MPI applications within Singularity containers can be run on HPC platforms
Understand the challenges and related performance implications when running MPI jobs via Singularity
What is MPI?
MPI - Message Passing Interface - is a widely used standard for parallel programming. It is used for exchanging messages/data between processes in a parallel application. If you’ve been involved in developing or working with computational science software.
Usually, when working on HPC systems, you compile your application against the MPI libraries provided on the system or you use applications that have been compiled by the HPC system support team. This approach to portability: source code portability is the traditional approach to making applications portable to different HPC platforms.
However, compiling complex HPC applications that have lots of dependencies (including MPI) is not always straightforward and can be a significant challenge as most HPC systems differ in various ways in terms of OS and base software available. There are a number of different approaches that can be taken to make it easier to deploy applications on HPC systems; for example, the Spack software automates the dependency resolution and compilation of applications. Containers provide another potential way to resolve these problems but care needs to be taken when interfacing with MPI on the host system which adds more complexity to running containers in parallel on HPC systems.
MPI codes with Singularity containers
Obviously, we will not have admin/root access on the HPC platform we are using so cannot (usually) build our container images on the HPC system itself. However, we do need to ensure our container is using the MPI library on the HPC system itself so we can get the performance benefit of the HPC interconnect. How do we overcome these contradictions?
The answer is that we install a version of the MPI library in our container image that is binary compatible with the MPI library on the host system and install our software in the container image using the local version of the MPI library. At runtime, we then ensure that the MPI library from the host is used within the running container rather than the locally-installed version of MPI.
There are two widely used open source MPI library distributions on HPC systems:
- MPICH - in addition to the open source version, MPICH is binary compatible with many proprietary vendor libraries, including Intel MPI and HPE Cray MPT as well as the open source MVAPICH.
- OpenMPI
This typically means that if you want to distribute HPC software that uses MPI within a container image you will need to maintain versions that are compatible with both MPICH and OpenMPI. There are efforts underway to provide tools that will provide a binary interface between different MPI implementations, e.g. HPE Cray’s MPIxlate software but these are not generally available yet.
Building a Singularity container image with MPI software
This example makes the assumption that you’ll be building a container image on a local platform and then deploying it to a HPC system with a different but compatible MPI implementation using a combination of the Hybrid and Bind models from the Singularity documentation. We will build our application using MPI in the container image but will bind the MPI library from the host into the container at runtime. See Singularity and MPI applications in the Singularity documentation for more technical details.
The example we will build will:
- Use MPICH as the container image’s MPI library
- Use the Ohio State University MPI Micro-benchmarks as the example application
- Use ARCHER2 as the runtime platform - this uses Cray MPT as the host MPI library and the HPE Cray Slingshot interconnect
The Dockerfile to install MPICH and the OSU micro-benchmark we will use
to build the container image is shown below. Create a new directory called
osu-benchmarks
to hold the build context for this new image. Create the
Dockerfile
in this directory.
FROM ubuntu:20.04
ENV DEBIAN_FRONTEND=noninteractive
# Install the necessary packages (from repo)
RUN apt-get update && apt-get install -y --no-install-recommends \
apt-utils \
build-essential \
curl \
libcurl4-openssl-dev \
libzmq3-dev \
pkg-config \
software-properties-common
RUN apt-get clean
RUN apt-get install -y dkms
RUN apt-get install -y autoconf automake build-essential numactl libnuma-dev autoconf automake gcc g++ git libtool
# Download and build an ABI compatible MPICH
RUN curl -sSLO http://www.mpich.org/static/downloads/3.4.2/mpich-3.4.2.tar.gz \
&& tar -xzf mpich-3.4.2.tar.gz -C /root \
&& cd /root/mpich-3.4.2 \
&& ./configure --prefix=/usr --with-device=ch4:ofi --disable-fortran \
&& make -j8 install \
&& cd / \
&& rm -rf /root/mpich-3.4.2 \
&& rm /mpich-3.4.2.tar.gz
# OSU benchmarks
RUN curl -sSLO http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.4.1.tar.gz \
&& tar -xzf osu-micro-benchmarks-5.4.1.tar.gz -C /root \
&& cd /root/osu-micro-benchmarks-5.4.1 \
&& ./configure --prefix=/usr/local CC=/usr/bin/mpicc CXX=/usr/bin/mpicxx \
&& cd mpi \
&& make -j8 install \
&& cd / \
&& rm -rf /root/osu-micro-benchmarks-5.4.1 \
&& rm /osu-micro-benchmarks-5.4.1.tar.gz
# Add the OSU benchmark executables to the PATH
ENV PATH=/usr/local/libexec/osu-micro-benchmarks/mpi/startup:$PATH
ENV PATH=/usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt:$PATH
ENV PATH=/usr/local/libexec/osu-micro-benchmarks/mpi/collective:$PATH
ENV OSU_DIR=/usr/local/libexec/osu-micro-benchmarks/mpi
# path to mlx IB libraries in Ubuntu
ENV LD_LIBRARY_PATH=/usr/lib/libibverbs:$LD_LIBRARY_PATH
A quick overview of what the above definition file is doing:
- The image is being built based on the
ubuntu:20.04
Docker image. - In the
RUN
sections:- Ubuntu’s
apt-get
package manager is used to update the package directory and then install the compilers and other libraries required for the MPICH and OSU benchmark build. - The MPICH software is downloaded, extracted, configured, built and installed. Note the use of the
--with-device
option to configure MPICH to use the correct driver to support improved communication performance on a high performance cluster. After the install is complete we delete the files that are no longer needed. - The OSU Micro-Benchmarks software is downloaded, extracted, configured, built and installed. After the install is complete we delete the files that are no longer needed.
- Ubuntu’s
- In the
ENV
sections: Set environment variables that will be available within all containers run from the generated image.
Build and test the OSU Micro-Benchmarks image
Using the above
Dockerfile
, build a container image and push it to Docker Hub.Pull the image on ARCHER2 using Singularity to convert it to a Singularity image and the test it by running the
osu_hello
benchmark that is found in thestartup
benchmark folder with eithersingularity exec
orsingularity shell
.Note: the build process can take a while. If you want to test running while the build is happening, you can log into ARCHER2 and use a pre-built version of the container image to test. You can find this container image at:
${EPCC_SINGULARITY_DIR}/osu_benchmarks.sif
Solution
You should be able to build an image and push it to Docker Hub as follows:
$ docker image build --platform linux/amd64 -t alice/osu-benchmarks $ docker push alice/osu-benchmarks
You can then log into ARCHER2 and pull the container image from Docker Hub with:
remote$ singularity pull osu-benchmarks.sif docker://alice/osu-benchmarks
Let’s begin with a single-process run of
startup/osu_hello
to ensure that we can run the container as expected. We’ll use the MPI installation within the container for this test. Note that when we run a parallel job on an HPC cluster platform, we use the MPI installation on the cluster to coordinate the run so things are a little different… we will see this shortly.Start a shell in the Singularity container based on your image and then run a single process job via
mpirun
:$ singularity shell --contain osu_benchmarks.sif Singularity> mpirun -np 1 osu_hello
You should see output similar to the following:
# OSU MPI Hello World Test v5.7.1 This is a test with 1 processes
Running Singularity containers with MPI on HPC system
Assuming the above tests worked, we can now try undertaking a parallel run of one of the OSU benchmarking tools within our container image on the remote HPC platform.
This is where things get interesting and we will begin by looking at how Singularity containers are run within an MPI environment.
If you’re familiar with running MPI codes, you’ll know that you use mpirun
(as we did
in the previous example), mpiexec
, srun
or a similar MPI executable to start your
application. This executable may be run directly on the local system or cluster platform
that you’re using, or you may need to run it through a job script submitted to a job
scheduler. Your MPI-based application code, which will be linked against the MPI libraries,
will make MPI API calls into these MPI libraries which in turn talk to the MPI daemon
process running on the host system. This daemon process handles the communication between
MPI processes, including talking to the daemons on other nodes to exchange information
between processes running on different machines, as necessary.
When running code within a Singularity container, we don’t use the MPI executables stored within the container, i.e. we DO NOT run:
singularity exec mpirun -np <numprocs> /path/to/my/executable
Instead we use the MPI installation on the host system to run Singularity and start an instance of our executable from within a container for each MPI process. Without Singularity support in an MPI implementation, this results in starting a separate Singularity container instance within each process. This can present some overhead if a large number of processes are being run on a host. Where Singularity support is built into an MPI implementation this can address this potential issue and reduce the overhead of running code from within a container as part of an MPI job.
Ultimately, this means that our running MPI code is linking to the MPI libraries from the MPI install within our container and these are, in turn, communicating with the MPI daemon on the host system which is part of the host system’s MPI installation. In the case of MPICH, these two installations of MPI may be different but as long as there is ABI compatibility between the version of MPI installed in your container image and the version on the host system, your job should run successfully.
We can now try running a 2-process MPI run of a point to point benchmark osu_latency
on
ARCHER2.
Undertake a parallel run of the
osu_latency
benchmark (general example)Create a job submission script called
submit.slurm
on the /work file system on ARCHER2 to run containers based on the container image across two nodes on ARCHER2. The example below uses the osu-benchmarks container image that is already available on ARCHER2 but you can easily modify it to use your version of the container image if you wish - the results should be the same in both cases.A template based on the example in the ARCHER2 documentation is:
#!/bin/bash #SBATCH --job-name=singularity_parallel #SBATCH --time=0:10:0 #SBATCH --nodes=2 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=1 #SBATCH --partition=standard #SBATCH --qos=short #SBATCH --account=[budget code] # Load the module to make the Cray MPICH ABI available module load cray-mpich-abi export OMP_NUM_THREADS=1 export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK # Set the LD_LIBRARY_PATH environment variable within the Singularity container # to ensure that it used the correct MPI libraries. export SINGULARITYENV_LD_LIBRARY_PATH="/opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib-abi-mpich:/opt/cray/pe/mpich/8.1.23/gtl/lib:/opt/cray/libfabric/1.12.1.2.2.0.0/lib64:/opt/cray/pe/gcc-libs:/opt/cray/pe/gcc-libs:/opt/cray/pe/lib64:/opt/cray/pe/lib64:/opt/cray/xpmem/default/lib64:/usr/lib64/libibverbs:/usr/lib64:/usr/lib64" # This makes sure HPE Cray Slingshot interconnect libraries are available # from inside the container. export SINGULARITY_BIND="/opt/cray,/var/spool,/opt/cray/pe/mpich/8.1.23/ofi/gnu/9.1/lib-abi-mpich:/opt/cray/pe/mpich/8.1.23/gtl/lib,/etc/host.conf,/etc/libibverbs.d/mlx5.driver,/etc/libnl/classid,/etc/resolv.conf,/opt/cray/libfabric/1.12.1.2.2.0.0/lib64/libfabric.so.1,/opt/cray/pe/gcc-libs/libatomic.so.1,/opt/cray/pe/gcc-libs/libgcc_s.so.1,/opt/cray/pe/gcc-libs/libgfortran.so.5,/opt/cray/pe/gcc-libs/libquadmath.so.0,/opt/cray/pe/lib64/libpals.so.0,/opt/cray/pe/lib64/libpmi2.so.0,/opt/cray/pe/lib64/libpmi.so.0,/opt/cray/xpmem/default/lib64/libxpmem.so.0,/run/munge/munge.socket.2,/usr/lib64/libibverbs/libmlx5-rdmav34.so,/usr/lib64/libibverbs.so.1,/usr/lib64/libkeyutils.so.1,/usr/lib64/liblnetconfig.so.4,/usr/lib64/liblustreapi.so,/usr/lib64/libmunge.so.2,/usr/lib64/libnl-3.so.200,/usr/lib64/libnl-genl-3.so.200,/usr/lib64/libnl-route-3.so.200,/usr/lib64/librdmacm.so.1,/usr/lib64/libyaml-0.so.2" # Launch the parallel job. srun --hint=nomultithread --distribution=block:block \ singularity exec ${EPCC_SINGULARITY_DIR}/osu_benchmarks.sif \ osu_latency
Finally, submit the job to the batch system with
remote$ sbatch submit.slurm
Solution
As you can see in the job script shown above, we have called
srun
on the host system and are passing to MPI thesingularity
executable for which the parameters are the image file and the name of the benchmark executable we want to run.The following shows an example of the output you should expect to see. You should have latency values reported for message sizes up to 4MB.
# OSU MPI Latency Test v5.6.2 # Size Latency (us) 0 0.38 1 0.34 ...
This has demonstrated that we can successfully run a parallel MPI executable from within a Singularity container.
Investigate performance of native benchmark compared to containerised version
To get an idea of any difference in performance between the code within your Singularity image and the same code built natively on the target HPC platform, try running the
osu_allreduce
benchmarks natively on ARCHER2 on all cores on at least 16 nodes (if you want to use more than 32 nodes, you will need to use thestandard
QoS rather than theshort QoS
). Then try running the same benchmark that you ran via the Singularity container. Do you see any performance differences?What do you see?
Do you see the same when you run on small node counts - particularly a single node?
Note: a native version of the OSU micro-benchmark suite is available on ARCHER2 via
module load osu-benchmarks
.Discussion
Here are some selected results measured on ARCHER2:
1 node:
- 4 B
- Native: 6.13 us
- Container: 5.30 us (16% faster)
- 128 KiB
- Native: 173.00 us
- Container: 230.38 us (25% slower)
- 1 MiB
- Native: 1291.18 us
- Container: 2101.02 us (39% slower)
16 nodes:
- 4 B
- Native: 17.66 us
- Container: 18.15 us (3% slower)
- 128 KiB
- Native: 237.29 us
- Container: 303.92 us (22% slower)
- 1 MiB
- Native: 1501.25 us
- Container: 2359.11 us (36% slower)
32 nodes:
- 4 B
- Native: 30.72 us
- Container: 24.41 us (20% faster)
- 128 KiB
- Native: 265.36 us
- Container: 363.58 us (26% slower)
- 1 MiB
- Native: 1520.58 us
- Container: 2429.24 us (36% slower)
For the medium and large messages, using a container produces substantially worse MPI performance for this benchmark on ARCHER2. When the messages are very small, containers match the native performance and can actually be faster.
Is this true for other MPI benchmarks that use all the cores on a node or is it specific to Allreduce?
Summary
Singularity can be combined with MPI to create portable containers that run software in parallel across multiple compute nodes. However, there are some limitations, specifically:
- You must use an MPI library in the container that is binary compatible with the MPI library on the host system - typically, your container will be based on either MPICH or OpenMPI.
- The host setup to enable MPI typically requires binding a large number of low-level libraries into the running container. You will usually require help from the HPC system support team to get the correct bind options for the platform you are using.
- Performance of containers+MPI can be substantially lower than the performance of native applications using MPI on the system. The effect is dependent on the MPI routines used in your application, message sizes and the number of MPI processes used.
Key Points
Singularity images containing MPI applications can be built on one platform and then run on another (e.g. an HPC cluster) if the two platforms have compatible MPI implementations.
When running an MPI application within a Singularity container, use the MPI executable on the host system to launch a Singularity container for each process.
Think about parallel application performance requirements and how where you build/run your image may affect that.