OPTIONAL PRACTICAL: Benchmarking Molecular Dynamics Performance Using GROMACS

Overview

Teaching: 20 min
Exercises: 60 min
Questions
  • How does a small, 80k-atom system performance scale as more cores are used?

  • What about a larger, 12M-atom system?

Objectives
  • Gain a basic understanding of key aspects of running molecular dynamics simulations in parallel.

Aims

In this exercise, you will be running molecular dynamics simulations using GROMACS. You will begin by benchmarking the strong-scaling performance of an 80k-atom GROMACS simulation. You will also be looking at the effects on performance that increasing the number of OMP thread count has when running GROMACS on a single node. Finally, you can see how multithreading and dynamic load balancing can impact performance.

Measuring strong scaling

You will be running a benchmark that has been used to explore the performance/price behaviour of GROMACS on various generations of CPUs and GPUs. A number of publications have used this benchmark to report on, and compare, the performance of GROMACS on different systems (e.g see: https://doi.org/10.1002/jcc.24030 ).

The benchmark system in question is “benchMEM”, which is available from the list of Standard MD benchmarks at https://www.mpibpc.mpg.de/grubmueller/bench. This benchmark simulates a membrane channel protein embedded in a lipid bilayer surrounded by water and ions. With its size of ~80 000 atoms, it serves as a prototypical example for a large class of setups used to study all kinds of membrane-embedded proteins.For some more information see here.

To get a copy of “benchMEM”, run the following from your work directory:

wget https://www.mpibpc.mpg.de/15101317/benchMEM.zip
unzip benchMEM.zip

Once the file is unzipped, you will need to create a Slurm submission script. Below is a copy of a script that will run on a single node, using a single processor. You can either copy the script from here, or from ARCHER2 in directory /work/ta017/shared/GMX_sub.slurm.

#!/bin/bash

#SBATCH --job-name=GMX_test
#SBATCH --account=ta022
#SBATCH --partition=standard
#SBATCH --qos=short
#SBATCH --reservation=shortqos
#SBATCH --time=0:5:0

#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1

#SBATCH --distribution=block:block
#SBATCH --hint=nomultithread

module restore /etc/cray-pe.d/PrgEnv-gnu
module load gromacs

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

srun gmx_mpi mdrun -ntomp $SLURM_CPUS_PER_TASK -s benchMEM.tpr

Run this script to make sure that it works – how quickly does the job complete? You can see the walltime and performance of your code by running tail md.log. This simulation is meant to perform 10,000 steps, each of which should simulate 2 ps of “real-life” time. The “Walltime” tells you how quickly the job ran on ARCHER2, and the “Performance data will let you know how many nanoseconds you can run in a day, and how many hours it will take to run a nanosecond of simulation.

How do the “Walltime” and “Performance” data change as you increase the number of cores being used? You can vary this by changing #SBATCH --tasks-per-node=1 to a higher number. Try filling out the table below:

Number of cores Walltime Performance (ns/day) Performance (hours/ns)
1      
2      
4      
8      
16      
32      
64      
128      
256*      
512*      

NOTE

Jobs run on more than one node will need to be run with constant #SBATCH --tasks-per-node=128 but varying #SBATCH --nodes=1


Measuring hybrid OpenMP + MPI performance on a single node

GROMACS can run in parallel using MPI and simultaneously also using OpenMP threads (see here). In this part of the tutorial, you will learn how to run hybrid MPI and OpenMP jobs on ARCHER2, and you will benchmark the performance of the benchMEM system to see whether performance improves when using OpenMP threads.

For this tutorial, you will start by comparing the performance of a simulation that uses all of the cores on a node. Using following Slurm submission script template, try running simulations that use varying levels of MPI tasks and OpenMP threads. You can do this by changing the #SBATCH --tasks-per-node and #SBATCH --cpus-per-task lines (making sure that the number of MPI ranks and OpenMP threads always multiply to 128 or less).

#!/bin/bash

#SBATCH --job-name=GMX_test
#SBATCH --account=ta022
#SBATCH --partition=standard
#SBATCH --qos=short
#SBATCH --reservation=shortqos
#SBATCH --time=0:5:0

#SBATCH --nodes=1
#SBATCH --tasks-per-node=128
#SBATCH --cpus-per-task=1

#SBATCH --distribution=block:block
#SBATCH --hint=nomultithread

module restore /etc/cray-pe.d/PrgEnv-gnu
module load gromacs

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

srun gmx_mpi mdrun -ntomp $SLURM_CPUS_PER_TASK -s benchMEM.tpr

How do the simulation times change as you increase the numbers change? How do these times change if you do not spread the threads over the NUMA regions as suggested?

You may find it helpful to fill out this table

MPI Ranks OpenMP threads Walltime (s) performance (ns/day)
128 1    
64 2    
42 3    
32 4    
25 5    
16 8    
8 16    

Multithreading and performance

The --hint=nomultithread asks SLURM to ignore the possibility of running two threads per core. If we remove this option, this makes available 256 “cpus” per node (2 threads per core in hardware). To run 8 MPI tasks with 1 task per NUMA region running 32 OpenMP threads, the script would look like:

#!/usr/bin/env bash

#SBATCH --job-name=GMX_test
#SBATCH --account=ta022
#SBATCH --partition=standard
#SBATCH --qos=short
#SBATCH --reservation=shortqos
#SBATCH --time=0:5:0

#SBATCH --nodes=1
#SBATCH --tasks-per-node=128
#SBATCH --cpus-per-task=1

#SBATCH --hint=multithread
#SBATCH --distribution=block:block

module load epcc-job-env
module load xthi/1.0

export OMP_PLACES=cores
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

srun gmx_mpi mdrun -ntomp $SLURM_CPUS_PER_TASK -s benchMEM.tpr

Note: physical cores appear as affinity 0-127, while the extra “logical” cores are numbered 128-255. Logical cores 0 and 128 occupy the same physical core etc.

Multithreading and GROMACS?

Staring with the MPI-only case first, how does enabling multithreading affect GROMACS performance?

What about the performance of hybrid MPI+OpenMP jobs?

Load balancing

GROMACS performs dynamic load balancing when it deems necessary. Can you tell from your md.log files so far whether it has been doing so, and what it calculated the load imbalance was before deciding to do so?

To demonstrate the effect of the load imbalance counteracted by GROMACS’s dynamic load balancing scheme, investigate what happens when this is turned off by including the -dlb no option to gmx_mpi mdrun.

Key Points

  • Larger systems scale better to large core-/node-counts than smaller systems.