This lesson has passed peer-review! See the publicationnone.

HPC performance and benchmarking

Overview

Teaching: 30 min
Exercises: 25 min
Questions
  • Why should I benchmark my use of HPC?

  • What are the key benchmarking concepts that I should understand?

  • What is the right performance metric for my HPC use?

  • What parameters can affect the performance of my applications?

Objectives
  • Understand how benchmarking can improve my use of HPC resources.

  • Understand key benchmarking concepts and why they are useful for me.

  • Be able to identify the correct performance metric for my HPC use.

Having looked at workflow components in general we will now move on to look at the specifics of understanding the HPC software component of your workflow to allow you to plan and use your HPC resources more efficiently.

The main tool we are going to use to understand the performance during this course, is benchmarking.

What is benchmarking?

Benchmarking is measuring how the performance of something varies as you change parameters. In our case, we are benchmarking parallel software on HPC systems and so the parameters we will measure performance variation against are usually:

We often want to explore multiple parameters in our benchmarking to get an idea of how performance varies. Needless to say, the search space can become very large!

Why use benchmarking?

Benchmarking your use of software on HPC resources is potentially useful for a range of reasons. These could include:

Benchmarking is also commonly used in purchasing new HPC systems to make sure that the new system gives the right level of performance for users. However, you are unlikely to be purchasing your own HPC system so we will not discuss this scenario further here!

Both program and input are important

Remember, it is not just the software package you are benchmarking - it is the combination of the software package and the input data that constitute the benchmark case. Throughout this course we will refer to this combination of the software and the input as the application.

Key benchmarking terminology and concepts

We will use a number of different terms and concepts throughout our discussion on benchmarking so we will define them first:

Your application use an benchmarking

Think about your use of HPC. For an HPC application you use, try to identify the timings and performance metric you might use when benchmarking. Why do you think the metrics you have chosen are the correct ones for this case?

Practical considerations

Benchmarking the image sharpening program

Now we will use a simple example HPC application to run some benchmarks and extract timings and performance data. In the next part of this course we will look at how to analyse and present this data to help us interpret the performance of the application.

To do this, we will run the image sharpening program on different numbers of MPI processes for the same input to look at how well its performance scales.

Initial setup

Log into ARCHER2, if you are not already logged in, and load the training/sharpen module to gain access to the software and input data:

module load training/sharpen

Once this is done, move to your /work directory, create a sub-directory to contain our benchmarking results and move into it (remember to replace t001 with the correct project code for your course and auser with your username on ARCHER2).

Only work file system is visible on the compute nodes

Remember, the work file system is the only one available on the ARCHER2 compute nodes. All just should be launched from a directory on the work file system to ensure they run correctly.

cd /work/ta023/ta023/auser
mkdir sharpen-bench
cd sharpen-bench

Copy the input data from the central location to your directory:

cp $SHARPEN_INPUT/fuzzy.pgm .
ls
fuzzy.pgm

Baseline performance

For this small example, we are going to use a run on a single core of a compute node as our baseline.

Baseline size

Remember that for real parallel applications, it will often not be possible to use a single core or even a single node as your baseline (due to memory requirements or fitting the run within a reasonable runtime). Nevertheless, you should try and use the smallest size that you feasibly can for your baseline.

Run the single core calculation on an ARCHER2 compute node with:

srun --partition=standard --qos=standard --reservation= --account=ta023 --hint=nomultithread --distribution=block:block --nodes=1 --ntasks-per-node=1 --time=0:10:0  sharpen-mpi.x > sharpen_1core_001.out
srun: job 62318 queued and waiting for resources
srun: job 62318 has been allocated resources

Using srun in this way launches the application on a compute nodes with the specified resources.

This line is quite long and is going to be tedious to type out each time we want to run a calculation so we will setup a command alias with the options that will not change each time we run to make things easier:

alias srunopt="srun --partition=standard --qos=standard --reservation= --account=ta023 --hint=nomultithread --distribution=block:block"

Making the alias permanent

If you want this alias to persist and be available each time you log into ARCHER2 then you can add the alias command above to the end of your ~/.bashrc file on ARCHER2.

Now we can run the baseline calculation again with:

srunopt --time=0:10:0 --nodes=1 --ntasks-per-node=1 sharpen-mpi.x > sharpen_1core_002.out
srun: job 62321 queued and waiting for resources
srun: job 62321 has been allocated resources

Run the baseline calculation one more time so that we have three separate results.

Lets take a look at the output from one of our baseline runs:

cat sharpen_1core_001.out
 Image sharpening code running on  1 process(es)
 Input file is: fuzzy.pgm                       
 Image size is: 564 x 770 pixels
 
 Using a filter of size 17 x  17 pixels
 Reading image file: fuzzy.pgm                       
 ... done
 
 Starting calculation ...
Rank 0 on core 0 of node <nid001961>
 .. finished
 
 Writing output file: sharpened.pgm                   
 
 ... done
 
 Calculation time was  2.882 seconds
 Overall run time was  3.182 seconds

You can see that the output reports various parameters. In terms of timing and performance metrics, the ones of interest to us are:

Based on these parameters, we can propose two timing and corresponding performance metrics:

So, for the output above:

Combining multiple runs

Of course, we have three sets of data for our baseline rather than just the single result. What is the best way to combine these to produce our final performance metric?

The answer depends on what you are measuring and why. Some examples:

In this case, we are interested in the change in performance as we change the number of MPI processes (or cores used) so we will use the minimum timing value from the multiple runs (as this corresponds to the maximum measured performance). You can look at this more conveniently using the grep command:

grep time *.out
sharpen_1core_001.out: Calculation time was  2.882 seconds
sharpen_1core_001.out: Overall run time was  3.182 seconds
sharpen_1core_002.out: Calculation time was  2.857 seconds
sharpen_1core_002.out: Overall run time was  3.118 seconds
sharpen_1core_003.out: Calculation time was  2.844 seconds
sharpen_1core_003.out: Overall run time was  3.096 seconds

In my case, the best performance (lowest timing) was from run number 3.

To make the process of extracting the timings and performance data from the sharpen output files easier for you we have written a small Python program: sharpen-data.py. This program takes the extension of the output files (“out” in our examples above), extracts the data required to compute performance (image size and timings) and saves them in a CSV (comma-separated values) file.

sharpen-data.py out
Cores        Size      Calc   Overall
    1    0.434280     2.844     3.096
    1    0.434280     2.857     3.118
    1    0.434280     2.882     3.182

Now we have our baseline data. Next, we need to collect data on how the timings and performance vary as the number of MPI processes we use for the calculation increases.

Collecting benchmarking data

Building on your experience so far, the next exercise is to collect the benchmark data we will analyse in the next section of the course.

Benchmarking the performance of Sharpen

Run a set of calculations to benchmark the performance of the sharpen-mpi.x program with the same input up to 2 full nodes (256 cores). Make sure you keep the program output in a suitable set of output files that you can use with the sharpen-data.py program. If you prefer, you can write a job submission script to run these benchmark calculations rather than using srun directly.

Solution

People often use doubling of the number of MPI processes as a useful first place for a set of benchmark runs. Going from 1 MPI process to 256 MPI processes this gives the following runs: 2, 4, 8, 16, 32, 64, 128 and 256 MPI processes.

Cores        Size      Calc   Overall
   16    0.434280     0.181     0.450
    2    0.434280     1.432     1.684
    4    0.434280     0.727     0.982
  256    0.434280     0.016     0.310
  128    0.434280     0.029     0.314
   64    0.434280     0.053     0.326
    1    0.434280     2.844     3.096
  256    0.434280     0.016     0.314
   32    0.434280     0.093     0.362
    8    0.434280     0.360     0.615
    8    0.434280     0.360     0.618
  128    0.434280     0.029     0.316
   64    0.434280     0.053     0.327
   32    0.434280     0.093     0.364
    2    0.434280     1.446     1.770
    2    0.434280     1.432     1.685
    4    0.434280     0.727     0.984
   16    0.434280     0.187     0.457
    1    0.434280     2.857     3.118
   32    0.434280     0.094     0.363
  256    0.434280     0.016     0.309
    8    0.434280     0.359     0.617
    4    0.434280     0.717     0.970
   16    0.434280     0.181     0.450
   64    0.434280     0.053     0.322
    1    0.434280     2.882     3.182
  128    0.434280     0.029     0.380

Aggregating data

Next we want to aggregate the data from our multiple runs at particular core counts - remember that, in this case, we want the minimum timing (maximum performance) from the runs to use to compute the performance. To complete these steps we are going to make use of the VisiData tool which allows us to manipulate and visualise tabular data in the terminal.

As well as printing the timing data to the screen, the sharpen-data.py program also produces a file called benchmark_runs.csv with the data in CSV (comma-separated value) format that we can use with VisiData. Let’s load the timing data into VisiData:

module load cray-python
module load visidata
vd benchmark_runs.csv
 Cores | Size     | Calc  | Overall ║
 16    | 0.434280 | 0.181 | 0.450   ║
 2     | 0.434280 | 1.432 | 1.684   ║
 4     | 0.434280 | 0.727 | 0.982   ║
 256   | 0.434280 | 0.016 | 0.310   ║
 128   | 0.434280 | 0.029 | 0.314   ║
 64    | 0.434280 | 0.053 | 0.326   ║
 1     | 0.434280 | 2.844 | 3.096   ║
 256   | 0.434280 | 0.016 | 0.314   ║
 32    | 0.434280 | 0.093 | 0.362   ║
 8     | 0.434280 | 0.360 | 0.615   ║
 8     | 0.434280 | 0.360 | 0.618   ║
 128   | 0.434280 | 0.029 | 0.316   ║
 64    | 0.434280 | 0.053 | 0.327   ║
 32    | 0.434280 | 0.093 | 0.364   ║
 2     | 0.434280 | 1.446 | 1.770   ║
 2     | 0.434280 | 1.432 | 1.685   ║
 4     | 0.434280 | 0.727 | 0.984   ║
 16    | 0.434280 | 0.187 | 0.457   ║
 1     | 0.434280 | 2.857 | 3.118   ║
 32    | 0.434280 | 0.094 | 0.363   ║
 256   | 0.434280 | 0.016 | 0.309   ║
1› benchmark_runs| user_macros | saul.pw/VisiData v2.1 | opening benchmark_runs.csv a           27 rows 

Your terminal will now show a spreadsheet interface with the timing data from your benchmark runs. You can navigate between different cells using the arrow keys on your keyboard or by using your mouse.

We are now going to use VisiData to aggregate the timing data from our runs. To do this, we first need to let the tool know what type of numerical data is in each of the columns.

Select the “Cores” column and hit # to set it as integer data, next select the “Size” column and hit % to set it as floating point data; select the “Calc” column and hit % to set it as floating point data too; finally, select the “Overall” column and hit % to set it as floating point data. Your terminal should now look something like:

 Cores#| Size    %| Calc %| Overall%║
    16 |     0.43 |  0.18 |    0.45 ║
     2 |     0.43 |  1.43 |    1.68 ║
     4 |     0.43 |  0.73 |    0.98 ║
   256 |     0.43 |  0.02 |    0.31 ║
   128 |     0.43 |  0.03 |    0.31 ║
    64 |     0.43 |  0.05 |    0.33 ║
     1 |     0.43 |  2.84 |    3.10 ║
   256 |     0.43 |  0.02 |    0.31 ║
    32 |     0.43 |  0.09 |    0.36 ║
     8 |     0.43 |  0.36 |    0.61 ║
     8 |     0.43 |  0.36 |    0.62 ║
   128 |     0.43 |  0.03 |    0.32 ║
    64 |     0.43 |  0.05 |    0.33 ║
    32 |     0.43 |  0.09 |    0.36 ║
     2 |     0.43 |  1.45 |    1.77 ║
     2 |     0.43 |  1.43 |    1.69 ║
     4 |     0.43 |  0.73 |    0.98 ║
    16 |     0.43 |  0.19 |    0.46 ║
     1 |     0.43 |  2.86 |    3.12 ║
    32 |     0.43 |  0.09 |    0.36 ║
   256 |     0.43 |  0.02 |    0.31 ║
1› benchmark_runs|                                                         %  type-float        27 rows 

Next, we want to tell VisiData how to aggregate the data in each of the columns. Remember, we want the minimum value of the timings from the “Calc” and “Overall” columns. Highlight the “Calc” column and hit +, select min from the list of aggregators and press Return. Do the same for the “Overall” column. We also want to keep the size value in our aggregated table so select an aggregator for that column too (min or max are fine here as every row has the same value). Once you have set the aggregators, select the “Cores” column and hit “Shift+f” to perform the aggregation. You should see a new table that looks something like:

 Cores#║ count♯| Size_min%| Calc_min%| Overall_min%║
    16 ║     3 |     0.43 |     0.18 |        0.45 ║
     2 ║     3 |     0.43 |     1.43 |        1.68 ║
     4 ║     3 |     0.43 |     0.72 |        0.97 ║
   256 ║     3 |     0.43 |     0.02 |        0.31 ║
   128 ║     3 |     0.43 |     0.03 |        0.31 ║
    64 ║     3 |     0.43 |     0.05 |        0.32 ║
     1 ║     3 |     0.43 |     2.84 |        3.10 ║
    32 ║     3 |     0.43 |     0.09 |        0.36 ║
     8 ║     3 |     0.43 |     0.36 |        0.61 ║

2› benchmark_runs_Cores_freq|                                                        F           9 bins 

The final detail to tidy the aggregated data up before we save it, is to sort by increasing core count. Select the “Cores” column and hit [ to sort ascending.

Finally, we will save this aggregated data as another CSV file. Hit “Ctrl+s” and change the file name to benchmark_agg.csv). Now, you can exit VisiData by typing gq. We will use this CSV file in the next section to compute the performance.

Computing performance

Load up VisiData again with the aggregate timing data:

vd benchmark_agg.csv
 Cores | count | Size_min | Calc_min | Overall_min ║
 1     | 3     | 0.43     | 2.84     | 3.10        ║
 2     | 3     | 0.43     | 1.43     | 1.68        ║
 4     | 3     | 0.43     | 0.72     | 0.97        ║
 8     | 3     | 0.43     | 0.36     | 0.61        ║
 16    | 3     | 0.43     | 0.18     | 0.45        ║
 32    | 3     | 0.43     | 0.09     | 0.36        ║
 64    | 3     | 0.43     | 0.05     | 0.32        ║
 128   | 3     | 0.43     | 0.03     | 0.31        ║
 256   | 3     | 0.43     | 0.02     | 0.31        ║

1› benchmark_agg| user_macros | saul.pw/VisiData v2.1 | opening benchmark_agg.csv as             9 rows 

Now we are going to create new columns with the performance (in Mpixels/s) which we will use in the next section when we analyse the data.

The maximum calculation performance is computed as the size divided by the minimum calculation timing. We can use VisiData to compute this for us but first we need to tell the tool what numerical data is in each column (as we did for aggregating the data, if we had not quit VisiData, we could skip this step). So, set the Cores column as integer data (using #) and the other columns as floating point data (using %).

Now, select the “Calc_min” column and hit =, VisiData now asks us for a formula to use to compute a new column. Enter Size_min / Calc_min. This should create a new column with the performance:

 Cores#| count | Size_min%| Calc_min%| Size_min/Calc_min  | Overall_min%║
     1 | 3     |     0.43 |     2.84 | 0.151408450704225…%|        3.10 ║
     2 | 3     |     0.43 |     1.43 | 0.300699300699300…%|        1.68 ║
     4 | 3     |     0.43 |     0.72 | 0.5972222222222222%|        0.97 ║
     8 | 3     |     0.43 |     0.36 | 1.1944444444444444%|        0.61 ║
    16 | 3     |     0.43 |     0.18 | 2.388888888888889 %|        0.45 ║
    32 | 3     |     0.43 |     0.09 | 4.777777777777778 %|        0.36 ║
    64 | 3     |     0.43 |     0.05 | 8.6               %|        0.32 ║
   128 | 3     |     0.43 |     0.03 | 14.333333333333334%|        0.31 ║
   256 | 3     |     0.43 |     0.02 | 21.5              %|        0.31 ║

1› benchmark_agg|                                        BUTTON1_RELEASED  release-mouse         9 rows 

Move to the new column and set it to floating point data. We can also rename the column to be more descriptive: hit ^ and give it the name Calc_perf_max.

 Cores#| count | Size_min%| Calc_min%| Calc_perf_max     %| Overall_min%║
     1 | 3     |     0.43 |     2.84 |               0.15 |        3.10 ║
     2 | 3     |     0.43 |     1.43 |               0.30 |        1.68 ║
     4 | 3     |     0.43 |     0.72 |               0.60 |        0.97 ║
     8 | 3     |     0.43 |     0.36 |               1.19 |        0.61 ║
    16 | 3     |     0.43 |     0.18 |               2.39 |        0.45 ║
    32 | 3     |     0.43 |     0.09 |               4.78 |        0.36 ║
    64 | 3     |     0.43 |     0.05 |               8.60 |        0.32 ║
   128 | 3     |     0.43 |     0.03 |              14.33 |        0.31 ║
   256 | 3     |     0.43 |     0.02 |              21.50 |        0.31 ║

1› benchmark_agg|                                        BUTTON1_RELEASED  release-mouse         9 rows 

Compute the Overall maximum performance at each core count

Add a column called Overall_perf_max that contains the Overall maximum performance at each core count and that is formatted as floating point values.

Solution

Select the “Overall_min” column and hit =, VisiData now asks us for a formula to use to compute a new column. Enter Size_min / Overall_min. Move to the new column and hit % to format it as floating point values and then hit ^ to rename it Overall_perf_max. The final table should look something like:

 Cores#| count | Size_min%| Calc_min%| Calc_perf_max     %| Overall_min%| Overall_perf_max    %║
     1 | 3     |     0.43 |     2.84 |               0.15 |        3.10 |                 0.14 ║
     2 | 3     |     0.43 |     1.43 |               0.30 |        1.68 |                 0.26 ║
     4 | 3     |     0.43 |     0.72 |               0.60 |        0.97 |                 0.44 ║
     8 | 3     |     0.43 |     0.36 |               1.19 |        0.61 |                 0.70 ║
    16 | 3     |     0.43 |     0.18 |               2.39 |        0.45 |                 0.96 ║
    32 | 3     |     0.43 |     0.09 |               4.78 |        0.36 |                 1.19 ║
    64 | 3     |     0.43 |     0.05 |               8.60 |        0.32 |                 1.34 ║
   128 | 3     |     0.43 |     0.03 |              14.33 |        0.31 |                 1.39 ║
   256 | 3     |     0.43 |     0.02 |              21.50 |        0.31 |                 1.39 ║

1› benchmark_agg| "Overall_perf_max"                                       ^  rename-col         9 rows 

Finally, save this data in a CSV file called benchmark_perf.csv and exit VisiData.

Why do you want to use benchmarking

Think about your use of HPC. What would you want to get out of benchmarking? Would measuring the minimum, maximum, mean or performance variation be appropriate for what you want to do and why?

What parameters would you be interested in varying for the application you want to benchmark?

Capturing run details

As for determining what the correct performance measure is (min, max, mean), the choice of what details of the benchmark run to capture depend on what you are measuring performance against and why. If the output does not include the values you require automatically then you should ensure it is captured in some way.

Here are some concrete examples of scenarios and the information you should ensure is captured:

As with all the aspects we have discussed so far, you should plan in advance the details that need to be captured as part of your benchmarking activity and how they will be captured so you do not need to re-run calculations because you do not have a good record of the differences between runs.

Summary

In this section we have discussed:

We also used a simple example application to allow us to collect some benchmark data on ARCHER2.

Now we have collected our benchmarking data we will turn to how to analyse the data, understand the performance and make decisions based on it.

Key Points

  • Different timing and performance metrics are used for different applications.

  • Use the lowest node/core count that is feasible for your baseline.

  • Plan your benchmarking before you start, make sure you understand which parameters you want to vary and why.