Introduction to High-Performance Computing

Understanding what resources to use

Overview

Teaching: 30 min
Exercises: 15 min
Questions
  • How can I work out what resources to request in my HPC jobs?

  • What should I measure to understand if I am using the HPC system effectively?

Objectives
  • Understand key performance metrics and how to measure them.

  • Be able to choose resource requests in jobs based on performance metrics.

Once you finished testing your batch job scripts to ensure that your job script is doing what you expect (data files are in the right place, etc. ). You want to submit you batch job scripts to run on your local HPC resource, using the “scheduler”, however you are not sure resources you should request. Specifically what resources do you need initially and for parallel applicatiosns.

Remember the basic resources that are mananged by the scheduler on a HPC system are the number of nodes (also referred to as “CHUNKS” in PBSpro) and the walltime which is how long do you want to use them.

This leads us to a number of questions:

So we will go through these questions in reverse order and hopefully that will make things clearer!

Benchmarking

In HPC we love to benchmark everything but not without good reason! Benchmarking provides you with insight into how well something performs in a controlled environment. Within the scope of HPC carpentry, you need to have a strong understanding how your application performs (or scalibitly) on different architecture. Being able to estimate the average runtime for a single job will allow you to:

Parallelism

The two methods for parallelising your applications are shared memory and distributed memory.

Distributed Memory

The distributed memory method or Multiple Instruction, Multiple Data (MIMD) is for applications that have large amount of data that does not fit into the memory space of a single node. So we can split the data up across a large number nodes this is known as “domain decomposition”, the goal is that if we can keep the data in memory and not on disc. This will require some data movement and rather have a single node do all the work we divide the work into smaller chunks. This does require some the nodes to communicate with their “neighbors” nodes. This uses the MPI library for communication.

Shared Memory

The shared memory method or Single Instruction, Multiple Data (SIMD) is done on a single node using multi-cores. This is the Serial-Fork model of parallelism, this is openMP. OpenMP are compiler directives that you add to your source code. Typically splitting up the work done in loops across all the cores on a processor(s) within a single node.

Benchmarking checklist

To start you will need:

Sample Timing methods

The shell time utility see (man time)

time ./dgemm_ex.exe 

returns

    hpc-intro >time
    real	0m0.001s
    user	0m0.000s
    sys	0m0.000s

or build your own using the date function into your job script such as this exampler for running the 3D animation software Maya.

    echo "start Render"
    start=`date +%s`
    Render -r $RENDER_METHOD  $SCENE -log output.log

    stop=`date +%s`
    echo "finish Render" 

    finish=$(($stop-$start))
    echo `date  +%c` >> ~/Work/maya/timings.log
    echo $SCENE was rendered in $finish seconds using $RENDER_METHOD >> ~/Work/maya/timings.log
    echo " " >> ~/Work/maya/timings.log

NOTE: This is nice as it automatically creates a seperate timing log file of the runs and I can easily reformat it to plot it.

What you trying to understand is how to optimize your resource usage. Running a set of jobs where you increase the number of nodes (1, 2, 4, 16, 32,…) or chunk size and measure how the runtime changes. Once you reach the point of diminishing returns you should have a feel for what size CHUNK you should be requesting and able to extrapolate what the runtime will be for your actual models.

It is a good practice as well to run the same benchmark 3 times as it helps to insure you have consistent performance data.

PBSpro Resource

There are lot of different PBSpro directives that you will need to understand but we will focus on how to do use very basic “#PBS -l” resource requests specifically for benchmarking.

    #PBS -l walltime=hh:mm:ss    
    #PBS -l place=[arrangement][:sharing][:grouping]
    #PBS -l select=[N:][chunk specification][+[N:]chunk specification]

-l walltime

Ideally you have run your application someplace and have some idea about how long it takes to run a job, ie your laptop. That would be a suitable starting time.

   #PBS -l walltime=01:00:00

-l place

   #PBS -l place=free:excl

The arrangement can matter especially with MPI applications, due to inter-node communications, so you might want to experiment with scatter to see how the behavior affects runtimes. With benchmarking you want to run exclusively on the resource, your jobs maybe in the queue a bit longer but you don’t want any resource contention when benchmarking.

-l select

This is the complicate one as it is dependent on your code, and the best advice is to read the documentation about the HPC systems.
For simple mpi codes: To get 16 cores/MPI tasks

   #PBS -l select=2:ncpus=8:mpiprocs=8

To get 24 cores/MPI tasks across two nodes

   #PBS -l select=2:ncpus=12:mpiprocs=12

In both examples I have selected 2 chunks the first one as 8 mpiprocs the second has 12 mpiprocs. The ncpus requested should be equal to the mpiprocs.

For simple openMP codes. To use 6 ompthreads

    #PBS -l select=1:ncpus6:ompthreads=6

For hybrid MPI/openMP codes.

   #PBS -l select=2:ncpus=12:mpiprocs=1:ompthreads=12

This example has 2 nodes each with 12 cores, 1 core on each node handles the MPI and all 12 cores on each node has an openMP thread.

Benchmarking and scalability

With your runtimes collected you should be able to determine if you application is strong scaling or weak scaling. Codes that are strong scaling (or cpu-bound) for a fixed problem size as the number of cores/nodes increases.

Weak scaling (or Memory-bound) codes should have the same runtime for problems where the model double in size and the number of nodes doubles. The amount of work per node stays the same.

See “Measuring Parallel Scaling Performance” from SHARCNET in Canada they have really nice detailed description on “strong” scaling and “weak” scaling.

REMEMBER

References

Key Points