Running jobs on Tursa
As with most HPC services, Tursa uses a scheduler to manage access to resources and ensure that the thousands of different users of system are able to share the system and all get access to the resources they require. Tursa uses the Slurm software to schedule jobs.
Writing a submission script is typically the most convenient way to submit your job to the scheduler. Example submission scripts (with explanations) for the most common job types are provided below.
Interactive jobs are also available and can be particularly useful for developing and debugging applications. More details are available below.
Hint
If you have any questions on how to run jobs on Tursa do not hesitate to contact the DiRAC Service Desk.
You typically interact with Slurm by issuing Slurm commands from the login nodes (to submit, check and cancel jobs), and by specifying Slurm directives that describe the resources required for your jobs in job submission scripts.
Resources
GPUh
Time used on Tursa nodes is measured in GPUh.
1 GPUh = 1 GPU for 1 hour. So a Tursa compute node with 4 GPUs would cost
4 GPUh per hour.
Note
The minimum resource request on Tursa is one full node which is charged at a rate of 4 GPUh per hour.
Checking available budget
You can check in SAFE by selecting Login accounts
from the menu, select the login account you want to query.
Under Login account details
you will see each of the budget codes you have access to listed e.g.
e123 resources
and then under Resource Pool to the right of this, a note of the remaining budgets.
When logged in to the machine you can also use the command
sacctmgr show assoc where user=$LOGNAME format=account,user,maxtresmins
This will list all the budget codes that you have access to e.g.
Account User MaxTRESMins
---------- ---------- -------------
e123 userx cpu=0
e123-test userx
This shows that userx
is a member of budgets e123
and e123-test
. However, the cpu=0
indicates that the e123
budget is empty or disabled. This user can submit jobs using the e123-test
budget.
To see the number of coreh or GPUh remaining you must check in SAFE.
Charging
Jobs run on Tursa are charged for the time they use i.e. from the time the job begins to run until the time the job ends (not the full wall time requested).
Jobs are charged for the full number of nodes which are requested, even if they are not all used.
Charging takes place at the time the job ends, and the job is charged in full to the budget which is live at the end time.
Basic Slurm commands
There are three key commands used to interact with the Slurm on the command line:
sinfo
- Get information on the partitions and resources availablesbatch jobscript.slurm
- Submit a job submission script (in this case called:jobscript.slurm
) to the schedulersqueue
- Get the current status of jobs submitted to the schedulerscancel 12345
- Cancel a job (in this case with the job ID12345
)
We cover each of these commands in more detail below.
sinfo
: information on resources
sinfo
is used to query information about available resources and
partitions. Without any options, sinfo
lists the status of all
resources and partitions, e.g.
[dc-user1@tursa-login1 ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
cpu up infinite 6 idle tu-c0r0n[66-71]
gpu* up infinite 2 drain* tu-c0r0n[27,30]
gpu* up infinite 1 down* tu-c0r4n33
gpu* up infinite 1 drain tu-c0r3n81
gpu* up infinite 58 alloc tu-c0r0n[00,03,06,09,12,15,18,21,24,33,36,39,42,45],tu-c0r1n[24,27,30,33,60,63,66,69,72,75,78,81,84,87,90,93],tu-c0r2n[24,27,30,33,60,63,66,69],tu-c0r3n[03,72,75,78,84,87,90,93],tu-c0r4n[00,03,06,12,15,24,27,30,60,63,66,69]
gpu* up infinite 50 idle tu-c0r1n[00,03,06,09,12,15,18,21],tu-c0r2n[00,03,06,09,12,15,18,21,72,75,78,81,84,87,90,93],tu-c0r3n[00,06,09,12,15,18,21,24,27,30,33,60,63,66,69],tu-c0r4n[09,18,21,72,75,78,81,84,87,90,93]
rack0 up infinite 2 drain* tu-c0r0n[27,30]
rack0 up infinite 14 alloc tu-c0r0n[00,03,06,09,12,15,18,21,24,33,36,39,42,45]
rack1 up infinite 16 alloc tu-c0r1n[24,27,30,33,60,63,66,69,72,75,78,81,84,87,90,93]
rack1 up infinite 8 idle tu-c0r1n[00,03,06,09,12,15,18,21]
rack2 up infinite 8 alloc tu-c0r2n[24,27,30,33,60,63,66,69]
rack2 up infinite 16 idle tu-c0r2n[00,03,06,09,12,15,18,21,72,75,78,81,84,87,90,93]
rack3 up infinite 1 drain tu-c0r3n81
rack3 up infinite 8 alloc tu-c0r3n[03,72,75,78,84,87,90,93]
rack3 up infinite 15 idle tu-c0r3n[00,06,09,12,15,18,21,24,27,30,33,60,63,66,69]
rack4 up infinite 1 down* tu-c0r4n33
rack4 up infinite 12 alloc tu-c0r4n[00,03,06,12,15,24,27,30,60,63,66,69]
rack4 up infinite 11 idle tu-c0r4n[09,18,21,72,75,78,81,84,87,90,93]
alloc
nodes are those that are running jobsidle
nodes are emptydrain
,down
,maint
nodes are unavailable to users
sbatch
: submitting jobs
sbatch
is used to submit a job script to the job submission system.
The script will typically contain one or more srun
commands to launch
parallel tasks.
When you submit the job, the scheduler provides the job ID, which is used to identify this job in other Slurm commands and when looking at resource usage in SAFE.
sbatch test-job.slurm
Submitted batch job 12345
squeue
: monitoring jobs
squeue
without any options or arguments shows the current status of
all jobs known to the scheduler. For example:
squeue
will list all jobs on Tursa.
The output of this is often large. You can restrict the
output to just your jobs by adding the -u $USER
option:
squeue -u $USER
scancel
: deleting jobs
scancel
is used to delete a jobs from the scheduler. If the job is
waiting to run it is simply cancelled, if it is a running job then it is
stopped immediately. You need to provide the job ID of the job you wish
to cancel/stop. For example:
scancel 12345
will cancel (if waiting) or stop (if running) the job with ID 12345
.
Resource Limits
The Tursa resource limits for any given job are covered by three separate attributes.
- The amount of primary resource you require, i.e., number of compute nodes.
- The partition that you want to use - this specifies the nodes that are eligible to run your job.
- The Quality of Service (QoS) that you want to use - this specifies the job limits that apply.
Primary resource
The primary resource you can request for your job is the compute node.
Information
The --exclusive
option is enforced on Tursa which means you will
always have access to all of the memory on the compute node regardless
of how many processes are actually running on the node.
Note
You will not generally have access to the full amount of memory resource on the the node as some is retained for running the operating system and other system processes.
Partitions
On Tursa, compute nodes are grouped into partitions. You will have to
specify a partition using the --partition
option in your Slurm
submission script. The following table has a list of active partitions
on Tursa.
Partition | Description | Max nodes available |
---|---|---|
cpu | CPU nodes with AMD EPYC 32-core processor × 2 | 6 |
gpu | GPU nodes with AMD EPYC 32-core processor and NVIDIA A100 GPU × 4 | 114 |
You can list the active partitions by running sinfo
.
Tip
You may not have access to all the available partitions.
Quality of Service (QoS)
On Tursa, job limits are defined by the requested Quality of Service
(QoS), as specified by the --qos
Slurm directive. The following table
lists the active QoS on Tursa.
QoS | Max Nodes Per Job | Max Walltime | Jobs Queued | Jobs Running | Partition(s) | Notes |
---|---|---|---|---|---|---|
standard | 64 | 48 hrs | 32 | 16 | gpu, cpu | Only jobs sizes that are powers of 2 nodes are allowed (i.e. 1, 2, 4, 8, 16, 32, 64 nodes), only available when your budget is positive. |
low | 64 | 24 hrs | 4 | 4 | gpu, cpu | Only jobs sizes that are powers of 2 nodes are allowed (i.e. 1, 2, 4, 8, 16, 32, 64 nodes), only available when your budget is zero or negative |
You can find out the QoS that you can use by running the following command:
sacctmgr show assoc user=$USER cluster=tursa format=cluster,account,user,qos%50
As long as you have a positive budget, you should use the standard
QoS. Once you have exhausted your
budget you can use the low
QoS to continue to run jobs at a lower priority than jobs in the
standard
QoS.
Hint
If you have needs which do not fit within the current QoS, please contact the Service Desk and we can discuss how to accommodate your requirements.
Important
Only jobs sizes that are powers of 2 nodes
are allowed. i.e. 1, 2, 4, 8, 16, 32, 64 nodes on the gpu
partition and
1, 2, 4 nodes on the cpu
partition.
Priority
Job priority on Tursa depends on a number of different factors:
- The QoS your job has specified
- The amount of time you have been queuing for
- Your current fairshare factor
Each of these factors is normalised to a value between 0 and 1, is multiplied with a weight and the resulting values combined to produce a priority for the job. The current job priority formula on Tursa is:
Priority = [10000 * P(QoS)] + [500 * P(Age)] + [300 * P(Fairshare)]
The priority factors are:
- P(QoS) - The QoS priority normalised to a value between 0 and 1. The maximum raw
value is 10000 and the minimum is 0.
standard
QoS has a value of 5000 andlow
QoS a value of 1. - P(Age) - The priority based on the job age normalised to a value between 0 and 1. The maximum raw value is 14 days (where P(Age) = 1).
- P(Fairshare) - The fairshare priority normalised to a value between 0 and 1. Your fairshare priority is determined by a combination of your budget code fairshare value and your user fairshare value within that budget code. The more use that the budget code you are using has made of the system recently relative to other budget codes on the system, the lower the budget code fairshare value will be; and the more use you have made of the system recently relative to other users within your budget code, the lower your user fairshare value will be. The decay half life for fairshare on Tursa is set to 14 days. More information on the Slurm fairshare algorithm.
You can view the priorities for current queued jobs on the system with the sprio
command:
[dc-user1@tursa-login1 ~]$ sprio
JOBID PARTITION PRIORITY SITE AGE FAIRSHARE QOS
43963 gpu 5055 0 51 5 5000
43975 gpu 5061 0 41 20 5000
43976 gpu 5061 0 41 20 5000
43982 gpu 5046 0 26 20 5000
43986 gpu 5011 0 6 5 5000
43996 gpu 5020 0 0 20 5000
43997 gpu 5020 0 0 20 5000
Troubleshooting
Slurm error messages
An incorrect submission will cause Slurm to return an error. Some common problems are listed below, with a suggestion about the likely cause:
-
sbatch: unrecognized option <text>
One of your options is invalid or has a typo.
man sbatch
to help. -
error: Batch job submission failed: No partition specified or system default partition
A
--partition=
option is missing. You must specify the partition (see the list above). This is most often--partition=standard
. -
error: invalid partition specified: <partition>
error: Batch job submission failed: Invalid partition name specified
Check the partition exists and check the spelling is correct.
-
error: Batch job submission failed: Invalid account or account/partition combination specified
This probably means an invalid account has been given. Check the
--account=
options against valid accounts in SAFE. -
error: Batch job submission failed: Invalid qos specification
A QoS option is either missing or invalid. Check the script has a
--qos=
option and that the option is a valid one from the table above. (Check the spelling of the QoS is correct.) -
error: Your job has no time specification (--time=)...
Add an option of the form
--time=hours:minutes:seconds
to the submission script. E.g.,--time=01:30:00
gives a time limit of 90 minutes. -
error: QOSMaxWallDurationPerJobLimit
error: Batch job submission failed: Job violates accounting/QOS policy
(job submit limit, user's size and/or time limits)
The script has probably specified a time limit which is too long for the corresponding QoS. E.g., the time limit for the short QoS is 20 minutes.
Slurm queued reasons
The squeue
command allows users to view information for jobs managed by Slurm. Jobs
typically go through the following states: PENDING, RUNNING, COMPLETING, and COMPLETED.
The first table provides a description of some job state codes. The second table provides a description
of the reasons that cause a job to be in a state.
Status | Code | Description |
---|---|---|
PENDING | PD | Job is awaiting resource allocation. |
RUNNING | R | Job currently has an allocation. |
SUSPENDED | S | Job currently has an allocation. |
COMPLETING | CG | Job is in the process of completing. Some processes on some nodes may still be active. |
COMPLETED | CD | Job has terminated all processes on all nodes with an exit code of zero. |
TIMEOUT | TO | Job terminated upon reaching its time limit. |
STOPPED | ST | Job has an allocation, but execution has been stopped with SIGSTOP signal. CPUS have been retained by this job. |
OUT_OF_MEMORY | OOM | Job experienced out of memory error. |
FAILED | F | Job terminated with non-zero exit code or other failure condition. |
NODE_FAIL | NF | Job terminated due to failure of one or more allocated nodes. |
CANCELLED | CA | Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. |
For a full list of see Job State Codes.
Reason | Description |
---|---|
Priority | One or more higher priority jobs exist for this partition or advanced reservation. |
Resources | The job is waiting for resources to become available. |
BadConstraints | The job's constraints can not be satisfied. |
BeginTime | The job's earliest start time has not yet been reached. |
Dependency | This job is waiting for a dependent job to complete. |
Licenses | The job is waiting for a license. |
WaitingForScheduling | No reason has been set for this job yet. Waiting for the scheduler to determine the appropriate reason. |
Prolog | Its PrologSlurmctld program is still running. |
JobHeldAdmin | The job is held by a system administrator. |
JobHeldUser | The job is held by the user. |
JobLaunchFailure | The job could not be launched. This may be due to a file system problem, invalid program name, etc. |
NonZeroExitCode | The job terminated with a non-zero exit code. |
InvalidAccount | The job's account is invalid. |
InvalidQOS | The job's QOS is invalid. |
QOSUsageThreshold | Required QOS threshold has been breached. |
QOSJobLimit | The job's QOS has reached its maximum job count. |
QOSResourceLimit | The job's QOS has reached some resource limit. |
QOSTimeLimit | The job's QOS has reached its time limit. |
NodeDown | A node required by the job is down. |
TimeLimit | The job exhausted its time limit. |
ReqNodeNotAvail | Some node specifically required by the job is not currently available. The node may currently be in use, reserved for another job, in an advanced reservation, DOWN, DRAINED, or not responding. Nodes which are DOWN, DRAINED, or not responding will be identified as part of the job's "reason" field as "UnavailableNodes". Such nodes will typically require the intervention of a system administrator to make available. |
For a full list of see Job Reasons.
Output from Slurm jobs
Slurm places standard output (STDOUT) and standard error (STDERR) for
each job in the file slurm_<JobID>.out
. This file appears in the job's
working directory once your job starts running.
Hint
Output may be buffered - to enable live output, e.g. for monitoring
job status, add --unbuffered
to the srun
command in your SLURM
script.
Specifying resources in job scripts
You specify the resources you require for your job using directives at
the top of your job submission script using lines that start with the
directive #SBATCH
.
Hint
Most options provided using #SBATCH
directives can also be specified as command line options to srun
.
If you do not specify any options, then the default for each option will be applied. As a minimum, all job submissions must specify the budget that they wish to charge the job too with the option:
--account=<budgetID>
your budget ID is usually something liket01
ort01-test
. You can see which budget codes you can charge to in SAFE.
Other common options that are used are:
--time=<hh:mm:ss>
the maximum walltime for your job. e.g. For a 6.5 hour walltime, you would use--time=6:30:0
.--job-name=<jobname>
set a name for the job to help identify it in
In addition, parallel jobs will also need to specify how many nodes, parallel processes and threads they require.
--nodes=<nodes>
the number of nodes to use for the job.--tasks-per-node=<processes per node>
the number of parallel processes (e.g. MPI ranks) per node. For Grid on GPU nodes this will typically be 4 to give 1 MPI process per GPU. The CPU nodes have 128 cores per node.--cpus-per-task=<stride between processes>
for Grid jobs on GPU nodes where you typically use 1 MPI process per GPU, 4 per node, this will usually be 8 (so that the 32 cores on a node are evenly divided between the 4 MPI processes)--gres=gpu:4
the number of GPU to use per node. This will almost always be 4 to use all GPUs on a node. (This option should not be specified for jobs on the CPU nodes.)
If you wish to use just the A100-80 GPU nodes which have higher memory, you add the following option:
--constraint=a100-80
request the job is placed on nodes with high-memory (80 GB) GPUs - there are only 3 high memory GPU nodes on the system.
To just use the A100-40 GPU nodes:
--constraint=a100-40
request the job is placed on nodes with standard memory (40 GB) GPUs.
If you do not set a constraint, the scheduler may use any available node types for the job.
Note
For parallel jobs, Tursa operates in a node exclusive way. This means that you are assigned resources in the units of full compute nodes for your jobs (i.e. 32 cores and 4 GPU on GPU nodes, 128 cores on CPU nodes) and that no other user can share those compute nodes with you. Hence, the minimum amount of resource you can request for a parallel job is 1 node (or 32 cores and 4 GPU on GPU nodes, 128 cores on CPU nodes).
To prevent the behaviour of batch scripts being dependent on the user environment at the point of submission, the option
--export=none
prevents the user environment from being exported to the batch system.
Using the --export=none
means that the behaviour of batch submissions
should be repeatable. We strongly recommend its use, although see
the following section
to enable access to the usual modules.
GPU frequency
Important
The default GPU frequency on Tursa compute nodes was changed from 1410 MHz to 1040 MHz on Thursday 15 Dec 2022 to improve the energy efficiency of the service.
Users can control the GPU frequency in their job submission scripts:
--gpu-freq=<desired GPU freq in MHz>
allows users to set the GPU frequency on a per job basis. The frequency can be set in the range 210 - 1410 MHz in steps of 15 MHz.
Bug
When setting the GPU frequency you will see an error in the output from the job
that says control disabled
. This is an incorrect message due to an issue with
how Slurm sets the GPU frequency and can be safely ignored.
mpirun
: Launching parallel jobs
If you are running parallel jobs, your job submission script should
contain one or more mpirun
commands to launch the parallel executable
across the compute nodes. You will usually add the following options to
mpirun
:
-np <number of MPI processes>
: specify the number of MPI processes to launch--map-by-numa
-x $LD_LIBRARY_PATH
: ensure that the library paths are available to MPI processes--bind-to none
Example job submission scripts
Example: job submission script for Grid parallel job using CUDA
A job submission script for a Grid job that uses 4 compute nodes, 16 MPI processes per node and 4 GPUs per node:
#!/bin/bash
# Slurm job options (job-name, compute nodes, job time)
#SBATCH --job-name=Example_Grid_job
#SBATCH --time=12:0:0
#SBATCH --nodes=4
#SBATCH --tasks-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:4
# Replace [budget code] below with your budget code (e.g. t01)
#SBATCH --account=[budget code]
# Load the correct modules
module load gcc/9.3.0
module load cuda/11.4.1
module load openmpi/4.1.1-cuda11.4
ACC_THREADS=8
export OMP_NUM_THREADS=8
# Settings for MPI performance
export OMPI_MCA_btl=^uct,openib
export UCX_TLS=rc,rc_x,sm,cuda_copy,cuda_ipc,gdr_copy
export UCX_RNDV_THRESH=16384
export UCX_RNDV_SCHEME=put_zcopy
export UCX_IB_GPU_DIRECT_RDMA=yes
export UCX_MEMTYPE_CACHE=n
export OMPI_MCA_io=romio321
export OMPI_MCA_btl_openib_allow_ib=true
export OMPI_MCA_btl_openib_device_type=infiniband
export OMPI_MCA_btl_openib_if_exclude=mlx5_1,mlx5_2,mlx5_3
# These will need to be changed to match the actual application you are running
application="my_mpi_openmp_app.x"
options="arg 1 arg2"
mpirun -np $SLURM_NTASKS --map-by numa -x LD_LIBRARY_PATH --bind-to none ./wrapper.sh ${application} ${options}"
This will run your executable "grid" in parallel usimg 16 MPI processes on 4 nodes, 8 OpenMP thread will be used per MPI process and 4 GPUs will be used per node (32 cores per node, 4 GPUs per node). Slurm will allocate 4 nodes to your job and srun will place 4 MPI processes on each node.
When running on Tursa it is important that we specify how
each of the GPU's interacts with the network interfaces to
reach optimal network communication performance. To achieve
this, we introduce a wrapper script (specified as wrapper.sh
in the example job script above) that sets a number of
environment parameters for each rank in a node (each GPU
in a node) explicitly tell each rank which network interface
it should use to communicate internode.
wrapper.sh
script example:
#!/bin/bash
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
numa1=$(( 2 * $lrank))
numa2=$(( 2 * $lrank + 1 ))
netdev=mlx5_${lrank}:1
export CUDA_VISIBLE_DEVICES=$OMPI_COMM_WORLD_LOCAL_RANK
export UCX_NET_DEVICES=mlx5_${lrank}:1
BINDING="--interleave=$numa1,$numa2"
echo "`hostname` - $lrank device=$CUDA_VISIBLE_DEVICES binding=$BINDING"
numactl ${BINDING} $*
See above for a more detailed discussion of the different sbatch
options