PRACTICAL: Benchmarking Molecular Dynamics Using GROMACS 2
Overview
Teaching: 10 min
Exercises: 20 minQuestions
How do we run hybrid MPI and OpenMP jobs on ARCHER2?
Does adding OpenMP to MPI GROMACS affect performance?
Objectives
Run a hybrid MPI with OpenMP simuation on ARCHER2.
See how GROMACS performance is changed by including OpenMP.
Aims
GROMACS can run in parallel using MPI and simultaneously also using
OpenMP threads (see here).
In this tutorial, you will learn how to run hybrid MPI and OpenMP jobs on
ARCHER2, and you will benchmark the performance of the bechMEM
system to see
whether performance improves when using OpenMP threads.
Hybrid MPI and OpenMP jobs on ARCHER2
When running hybrid MPI (with the individual tasks also known as ranks or
processes) and OpenMP (with multiple threads) jobs you need to leave free
cores between the parallel tasks launched using srun
for the multiple
OpenMP threads that will be associated with each MPI task.
As we saw above, you can use the options to sbatch
to control how many
parallel tasks are placed on each compute node and can use the
--cpus-per-task
option to set the stride between parallel tasks
to the right value to accommodate the OpenMP threads. The value
of --cpus-per-task
should usually be the same as that for
OMP_NUM_THREADS
.
As an example, consider the job script below that runs across 2 nodes with
8 MPI tasks per node and 16 OpenMP threads per MPI task (so all 256 cores
are used).
Here we use the standard OpenMP control setting OMP_PLACES=cores
to specify that placement should be on the basis of cores.
#!/bin/bash
#SBATCH --partition=standard
#SBATCH --qos=short
#SBATCH --reservation=shortqos
#SBATCH --account=ta017
#SBATCH --time=00:10:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=16
#SBATCH --hint=nomultithread
#SBATCH --distribution=block:cyclic
module restore /etc/cray-pe.d/PrgEnv-gnu
module load gromacs
export OMP_PLACES=cores
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun gmx_mpi mdrun -ntomp $SLURM_CPUS_PER_TASK -s benchMEM.tpr
Each ARCHER2 compute node is made up of 8 NUMA (Non Uniform Memory Access) regions (4 per socket) with 16 cores in each region. Programs where the threads span multiple NUMA regions are likely to be less efficient so we recommend using thread counts that fit well into the ARCHER2 compute node layout. Effectively, this means one of the following options for nodes where all cores are used:
- 8 MPI tasks per node and 16 OpenMP threads per task: equivalent to 1 MPI task per NUMA region
- 16 MPI tasks per node and 8 OpenMP threads per task: equivalent to 2 MPI tasks per NUMA region
- 32 MPI tasks per node and 4 OpenMP threads per task: equivalent to 4 MPI tasks per NUMA region
- 64 MPI tasks per node and 2 OpenMP threads per task: equivalent to 8 MPI tasks per NUMA region
Instructions
For this tutorial, you will start by comparing the performance of a simulation that uses all of the cores on a node. Using the code above as a template, try running simulations that use varying levels of MPI and OpenMP (making sure that the number of MPI ranks and OpenMP threads always multiply to 128 or less). How do the simulation times change as you increase the numbers change? How do these times change if you do not spread the threads over the NUMA regions as suggested?
You may find it helpful to fill out this table
MPI Ranks | OpenMP threads | Walltime (s) | performance (ns/day) |
---|---|---|---|
128 | 1 | ||
64 | 2 | ||
42 | 3 | ||
32 | 4 | ||
25 | 5 | ||
16 | 8 | ||
8 | 16 |
Key Points
Hybrid MPI with OpenMP does affect performance.
When running hybrid jobs, placement across NUMA regions is important.