Part 1: Strong Scaling
In this section you will run a benchmark simulation and investigate the strong scaling parallel performance.
The benchmark system
The system we will used from the HECBIOSIM benchmark suite: https://www.hecbiosim.ac.uk/access-hpc/benchmarks
We will focus on the 465K atom system:
hEGFR Dimer of 1IVO and 1NQL.
Total number of atoms = 465,399.
Protein atoms = 21,749 Lipid atoms = 134,268 Water atoms = 309,087 Ions = 295.
The input file can be obtained from https://www.hecbiosim.ac.uk/access-hpc/benchmarks, or from our git repo: #TODO
Running the benchmark
An example script to run the benchmark on ARCHER2 is shown below.
#!/bin/bash
#SBATCH --job-name=gmx_bench
#SBATCH --nodes=1
#SBATCH --tasks-per-node=128
#SBATCH --cpus-per-task=1
#SBATCH --time=00:10:00
# Replace [budget code] below with your project code (e.g. t01)
#SBATCH --account=z19
#SBATCH --partition=standard
#SBATCH --qos=standard
# Setup the environment
module load gromacs
export OMP_NUM_THREADS=1
srun --distribution=block:block --hint=nomultithread gmx_mpi mdrun -s bench_465kHBS.tpr -v
The bottom of the md.log
file will contain the performance timings, e.g:
Core t (s) Wall t (s) (%)
Time: 12283.520 95.966 12799.9
(ns/day) (hour/ns)
Performance: 18.008 1.333
The most useful numbers to us are the wall-time 95.966s
, and the performance: 18.008 ns/day
which tells us how much simulation time (in nano-seconds) can be simulated for 1 day of wall-clock time.
Things to investigate
You should vary the number of nodes/CPUs and plot the performance. This investigates the strong scaling of the program.
What node count would you use for a long production simulation?
Previous benchmarks for this system can be found here: https://www.hecbiosim.ac.uk/access-hpc/our-benchmark-results/archer2-benchmarks
We have plotted our results for version 2021.3 of gromacs on ARCHER2 here:
Comparing results with Amdahl’s law
Amdahl’s law characterizes the speed-up of a parallel program. It states that the speed-up for \(N\) processors \(S(N)\) is dependent on the serial \(s\) and parallel \(p\) portions of the code.
We have plotted Amdahl’s law for \(p=\) 99%, 99.9%, and 99.95% to compare against the measured results.
We can see that these results are similar to the p=99.9% curve. Suggesting that 99.9% of the code is parallel. However, it in reality it is not this simple. Amdahl’s law assumes perfect load balance, this is not usually the case. The md.log
file reports the load balance for this benchmark. E.g. for our run using 4 nodes:
Dynamic load balancing report:
DLB was permanently on during the run per user request.
Average load imbalance: 16.0%.
The balanceable part of the MD step is 78%, load imbalance is computed from this.
Part of the total run time spent waiting due to load imbalance: 12.5%.
Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Y 0 % Z 0 %
Average PME mesh/force load: 0.723
Part of the total run time spent waiting due to PP/PME imbalance: 6.2 %
NOTE: 12.5 % of the available CPU time was lost due to load imbalance
in the domain decomposition.
You can consider manually changing the decomposition (option -dd);
e.g. by using fewer domains along the box dimension in which there is
considerable inhomogeneity in the simulated system.
NOTE: 6.2 % performance was lost because the PME ranks
had less work to do than the PP ranks.
You might want to decrease the number of PME ranks
or decrease the cut-off and the grid spacing.
We can see that significant load imbalance is reported.