Running jobs on Tursa
As with most HPC services, Tursa uses a scheduler to manage access to resources and ensure that the thousands of different users of system are able to share the system and all get access to the resources they require. Tursa uses the Slurm software to schedule jobs.
Writing a submission script is typically the most convenient way to submit your job to the scheduler. Example submission scripts (with explanations) for the most common job types are provided below.
Hint
If you have any questions on how to run jobs on Tursa do not hesitate to contact the DiRAC Service Desk.
You typically interact with Slurm by issuing Slurm commands from the login nodes (to submit, check and cancel jobs), and by specifying Slurm directives that describe the resources required for your jobs in job submission scripts.
Resources
GPUh
Time used on Tursa nodes is measured in GPUh.
1 GPUh = 1 GPU for 1 hour. So a Tursa compute node with 4 GPUs would cost
4 GPUh per hour.
Note
The minimum resource request on Tursa is one full node which is charged at a rate of 4 GPUh per hour.
Checking available budget
You can check in SAFE by selecting Login accounts
from the menu, select the login account you want to query.
Under Login account details
you will see each of the budget codes you have access to listed e.g.
dp123 resources
and then under Resource Pool to the right of this, a note of the remaining budgets.
When logged in to the machine you can also use the command
sacctmgr show assoc where user=$LOGNAME format=account,user,maxtresmins%75
This will list all the budget codes that you have access to e.g.
Account User MaxTRESMins
---------- ---------- ---------------------------------------------------------------------------
t01 dc-user1 gres/cpu-low=0,gres/cpu-standard=0,gres/gpu-low=0
z01 dc-user1
This shows that dc-user1
is a member of budgets t01
and z01
. However, the gres/cpu-low=0,gres/cpu-standard=0,gres/gpu-low=0
indicates that the t01
budget can only run GPU jobs in standard (charged) partitions (all other options are disabled, indicated by =0
for CPU standard, CPU low and GPU low). This user can also submit jobs to any partition using the z01
budget.
To see the number of coreh or GPUh remaining you must check in SAFE.
Charging
Jobs run on Tursa are charged for the time they use i.e. from the time the job begins to run until the time the job ends (not the full wall time requested).
Jobs are charged for the full number of nodes which are requested, even if they are not all used.
Charging takes place at the time the job ends, and the job is charged in full to the budget which is live at the end time.
Basic Slurm commands
There are three key commands used to interact with the Slurm on the command line:
sinfo
- Get information on the partitions and resources availablesbatch jobscript.slurm
- Submit a job submission script (in this case called:jobscript.slurm
) to the schedulersqueue
- Get the current status of jobs submitted to the schedulerscancel 12345
- Cancel a job (in this case with the job ID12345
)
We cover each of these commands in more detail below.
sinfo
: information on resources
sinfo
is used to query information about available resources and
partitions. Without any options, sinfo
lists the status of all
resources and partitions, e.g.
[dc-user1@tursa-login1 ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
cpu up 2-00:00:00 4 alloc tu-c0r0n[66-69]
cpu up 2-00:00:00 2 idle tu-c0r0n[70-71]
gpu up 2-00:00:00 1 plnd tu-c0r2n93
gpu up 2-00:00:00 11 drain tu-c0r0n75,tu-c0r5n[48,51,54,57],tu-c0r6n[48,51,54,57],tu-c0r7n[00,48]
gpu up 2-00:00:00 112 mix tu-c0r0n[00,03,06,09,12,15,18,21,24,27,30,33,36,39,42,45,72,87,90],tu-c0r1n[00,03,06,09,12,15,18,21,24,27,30,33,60,63,66,69,72,75,78,81,84,87,90,93],tu-c0r2n[00,03,06,09,12,15,18,21,24,27,30,33,60,63,66,69,72,75,78,81,84,87,90],tu-c0r3n[00,03,06,09,12,15,18,21,24,27,30,33,60,63,66,69,72,75,78,81,84,90,93],tu-c0r4n[00,03,06,09,12,15,18,21,24,27,30,33,60,63,66,69,72,75,81,84,87,90,93]
gpu up 2-00:00:00 56 resv tu-c0r0n93,tu-c0r4n78,tu-c0r5n[00,03,06,09,12,15,18,21,24,27,30,33,36,39,42,45],tu-c0r6n[00,03,06,09,12,15,18,21,24,27,30,33,36,39,42,45,60,63,66,69],tu-c0r7n[03,06,09,12,15,18,21,24,27,30,33,36,39,42,45,51,54,57]
gpu up 2-00:00:00 1 idle tu-c0r3n87
alloc
nodes are those that are running jobsidle
nodes are emptydrain
,down
,maint
nodes are unavailable to usersplnd
nodes are reserved for future jobs
sbatch
: submitting jobs
sbatch
is used to submit a job script to the job submission system.
The script will typically contain one or more mpirun
commands to launch
parallel tasks.
When you submit the job, the scheduler provides the job ID, which is used to identify this job in other Slurm commands and when looking at resource usage in SAFE.
sbatch test-job.slurm
Submitted batch job 12345
squeue
: monitoring jobs
squeue
without any options or arguments shows the current status of
all jobs known to the scheduler. For example:
squeue
will list all jobs on Tursa.
The output of this is often large. You can restrict the
output to just your jobs by adding the --me
option:
squeue --me
scancel
: deleting jobs
scancel
is used to delete a jobs from the scheduler. If the job is
waiting to run it is simply cancelled, if it is a running job then it is
stopped immediately. You need to provide the job ID of the job you wish
to cancel/stop. For example:
scancel 12345
will cancel (if waiting) or stop (if running) the job with ID 12345
.
Resource Limits
The Tursa resource limits for any given job are covered by three separate attributes.
- The amount of primary resource you require, i.e., number of compute nodes.
- The partition that you want to use - this specifies the nodes that are eligible to run your job.
- The Quality of Service (QoS) that you want to use - this specifies the job limits that apply.
Primary resource
The primary resource you can request for your job is the compute node.
Information
The --exclusive
option is enforced on Tursa which means you will
always have access to all of the memory on the compute node regardless
of how many processes are actually running on the node.
Note
You will not generally have access to the full amount of memory resource on the the node as some is retained for running the operating system and other system processes.
Partitions
On Tursa, compute nodes are grouped into partitions. You will have to
specify a partition using the --partition
option in your Slurm
submission script. The following table has a list of active partitions
on Tursa.
Partition | Description | Max nodes available |
---|---|---|
cpu | CPU nodes with 2 AMD EPYC 64-core processor | 6 |
gpu | GPU nodes with 2 AMD EPYC processor (16-core or 24-core) and NVIDIA A100 GPU × 4 (this includes both A100-40 and A100-80 GPU) | 181 |
gpu-a100-40 | GPU nodes with 2 AMD EPYC 16-core processors and NVIDIA A100-40 GPU × 4 | 114 |
gpu-a100-80 | GPU nodes with 2 AMD EPYC 24-core processor (3 nodes have 2 AMD EPYC 16-core processors) and NVIDIA A100-80 GPU × 4 | 67 |
You can list the active partitions by running sinfo
.
Tip
You may not have access to all the available partitions.
Quality of Service (QoS)
On Tursa, job limits are defined by the requested Quality of Service
(QoS), as specified by the --qos
Slurm directive. The following table
lists the active QoS on Tursa.
QoS | Max Nodes Per Job | Max Walltime | Queued | Running | Partition(s) | Notes |
---|---|---|---|---|---|---|
standard | 128 | 48 hrs | Max. 128 jobs per user | Max. 128 nodes per user, max. 32 jobs per user | gpu, gpu-a100-40, gpu-a100-80, cpu | Only jobs sizes that are powers of 2 nodes are allowed (i.e. 1, 2, 4, 8, 16, 32 nodes), only available when your budget is positive. |
low | 32 | 24 hrs | 4 | 4 | gpu, gpu-a100-40, gpu-a100-40, cpu | Only jobs sizes that are powers of 2 nodes are allowed (i.e. 1, 2, 4, 8, 16, 32 nodes), only available when your budget is zero or negative |
high | 128 | 48 hrs | Max. 128 jobs per user | Max. 128 nodes per user, max. 32 jobs per user | gpu, gpu-a100-40, gpu-a100-80 | Only jobs sizes that are powers of 2 nodes are allowed (i.e. 1, 2, 4, 8, 16, 32 nodes), only available when you have access to "dpXYZ-high" budget and the budget is positive. Only available to RAC projects. High priority jobs are prioritised above other jobs on the system. |
dev | 2 | 4 hrs | 2 | 1 | gpu | For faster turnaround for development jobs and interactive sessions, only available when your budget is positive. The dev QoS includes 2x A100-40 GPU nodes and 3x A100-80 GPU nodes. |
You can find out the QoS that you can use by running the following command:
sacctmgr show assoc user=$USER cluster=tursa format=cluster,account,user,qos%50
As long as you have a positive budget, you should use the standard
QoS. Once you have exhausted your
budget you can use the low
QoS to continue to run jobs at a lower priority than jobs in the
standard
QoS.
Hint
If you have needs which do not fit within the current QoS, please contact the Service Desk and we can discuss how to accommodate your requirements.
Important
Only jobs sizes that are powers of 2 nodes
are allowed. i.e. 1, 2, 4, 8, 16, 32 nodes on the gpu
partition and
1, 2, 4 nodes on the cpu
partition. There is a discussion of why this is enforced in
the Hardware section of the User Guide.
Priority
Job priority on Tursa depends on a number of different factors:
- The QoS your job has specified
- The amount of time you have been queuing for
- Your current fairshare factor
Each of these factors is normalised to a value between 0 and 1, is multiplied with a weight and the resulting values combined to produce a priority for the job. The current job priority formula on Tursa is:
Priority = [10000 * P(QoS)] + [500 * P(Age)] + [300 * P(Fairshare)]
The priority factors are:
- P(QoS) - The QoS priority normalised to a value between 0 and 1. The maximum raw
value is 10000 and the minimum is 0.
standard
QoS has a value of 5000 andlow
QoS a value of 1. - P(Age) - The priority based on the job age normalised to a value between 0 and 1. The maximum raw value is 14 days (where P(Age) = 1).
- P(Fairshare) - The fairshare priority normalised to a value between 0 and 1. Your fairshare priority is determined by a combination of your budget code fairshare value and your user fairshare value within that budget code. The more use that the budget code you are using has made of the system recently relative to other budget codes on the system, the lower the budget code fairshare value will be; and the more use you have made of the system recently relative to other users within your budget code, the lower your user fairshare value will be. The decay half life for fairshare on Tursa is set to 14 days. More information on the Slurm fairshare algorithm.
You can view the priorities for current queued jobs on the system with the sprio
command:
[dc-user1@tursa-login1 ~]$ sprio
JOBID PARTITION PRIORITY SITE AGE FAIRSHARE QOS
43963 gpu 5055 0 51 5 5000
43975 gpu 5061 0 41 20 5000
43976 gpu 5061 0 41 20 5000
43982 gpu 5046 0 26 20 5000
43986 gpu 5011 0 6 5 5000
43996 gpu 5020 0 0 20 5000
43997 gpu 5020 0 0 20 5000
Troubleshooting
Slurm error messages
An incorrect submission will cause Slurm to return an error. Some common problems are listed below, with a suggestion about the likely cause:
-
sbatch: unrecognized option <text>
One of your options is invalid or has a typo.
man sbatch
to help. -
error: Batch job submission failed: No partition specified or system default partition
A
--partition=
option is missing. You must specify the partition (see the list above). This is most often--partition=standard
. -
error: invalid partition specified: <partition>
error: Batch job submission failed: Invalid partition name specified
Check the partition exists and check the spelling is correct.
-
error: Batch job submission failed: Invalid account or account/partition combination specified
This probably means an invalid account has been given. Check the
--account=
options against valid accounts in SAFE. -
error: Batch job submission failed: Invalid qos specification
A QoS option is either missing or invalid. Check the script has a
--qos=
option and that the option is a valid one from the table above. (Check the spelling of the QoS is correct.) -
error: Your job has no time specification (--time=)...
Add an option of the form
--time=hours:minutes:seconds
to the submission script. E.g.,--time=01:30:00
gives a time limit of 90 minutes. -
error: QOSMaxWallDurationPerJobLimit
error: Batch job submission failed: Job violates accounting/QOS policy
(job submit limit, user's size and/or time limits)
The script has probably specified a time limit which is too long for the corresponding QoS. E.g., the time limit for the short QoS is 20 minutes.
Slurm queued reasons
The squeue
command allows users to view information for jobs managed by Slurm. Jobs
typically go through the following states: PENDING, RUNNING, COMPLETING, and COMPLETED.
The first table provides a description of some job state codes. The second table provides a description
of the reasons that cause a job to be in a state.
Status | Code | Description |
---|---|---|
PENDING | PD | Job is awaiting resource allocation. |
RUNNING | R | Job currently has an allocation. |
SUSPENDED | S | Job currently has an allocation. |
COMPLETING | CG | Job is in the process of completing. Some processes on some nodes may still be active. |
COMPLETED | CD | Job has terminated all processes on all nodes with an exit code of zero. |
TIMEOUT | TO | Job terminated upon reaching its time limit. |
STOPPED | ST | Job has an allocation, but execution has been stopped with SIGSTOP signal. CPUS have been retained by this job. |
OUT_OF_MEMORY | OOM | Job experienced out of memory error. |
FAILED | F | Job terminated with non-zero exit code or other failure condition. |
NODE_FAIL | NF | Job terminated due to failure of one or more allocated nodes. |
CANCELLED | CA | Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. |
For a full list of see Job State Codes.
Reason | Description |
---|---|
Priority | One or more higher priority jobs exist for this partition or advanced reservation. |
Resources | The job is waiting for resources to become available. |
BadConstraints | The job's constraints can not be satisfied. |
BeginTime | The job's earliest start time has not yet been reached. |
Dependency | This job is waiting for a dependent job to complete. |
Licenses | The job is waiting for a license. |
WaitingForScheduling | No reason has been set for this job yet. Waiting for the scheduler to determine the appropriate reason. |
Prolog | Its PrologSlurmctld program is still running. |
JobHeldAdmin | The job is held by a system administrator. |
JobHeldUser | The job is held by the user. |
JobLaunchFailure | The job could not be launched. This may be due to a file system problem, invalid program name, etc. |
NonZeroExitCode | The job terminated with a non-zero exit code. |
InvalidAccount | The job's account is invalid. |
InvalidQOS | The job's QOS is invalid. |
QOSUsageThreshold | Required QOS threshold has been breached. |
QOSJobLimit | The job's QOS has reached its maximum job count. |
QOSResourceLimit | The job's QOS has reached some resource limit. |
QOSTimeLimit | The job's QOS has reached its time limit. |
NodeDown | A node required by the job is down. |
TimeLimit | The job exhausted its time limit. |
ReqNodeNotAvail | Some node specifically required by the job is not currently available. The node may currently be in use, reserved for another job, in an advanced reservation, DOWN, DRAINED, or not responding. Nodes which are DOWN, DRAINED, or not responding will be identified as part of the job's "reason" field as "UnavailableNodes". Such nodes will typically require the intervention of a system administrator to make available. |
For a full list of see Job Reasons.
Output from Slurm jobs
Slurm places standard output (STDOUT) and standard error (STDERR) for
each job in the file slurm_<JobID>.out
. This file appears in the job's
working directory once your job starts running.
Hint
Output may be buffered - to enable live output, e.g. for monitoring
job status, add --unbuffered
to the srun
command in your SLURM
script.
Specifying resources in job scripts
You specify the resources you require for your job using directives at
the top of your job submission script using lines that start with the
directive #SBATCH
.
Important
You should always ask for the full node resources in your sbatch options
with tasks-per-node
equal to the number of CPU cores on the node and
cpus-per-task
equal to 1 and then specify the process and thread
pinning using srun
options. You cannot specify process/thread pinning
options in sbatch
options - if you try to do so you will run into
problems when launching the parallel executable using srun
.
If you do not specify any options, then the default for each option will be applied. As a minimum, all job submissions must specify the budget that they wish to charge the job too with the option:
--account=<budgetID>
your budget ID is usually something liket01
ort01-test
. You can see which budget codes you can charge to in SAFE.
Other common options that are used are:
--time=<hh:mm:ss>
the maximum walltime for your job. e.g. For a 6.5 hour walltime, you would use--time=6:30:0
.--job-name=<jobname>
set a name for the job to help identify it in
To prevent the behaviour of batch scripts being dependent on the user environment at the point of submission, the option
--export=none
prevents the user environment from being exported to the batch system.
Using the --export=none
means that the behaviour of batch submissions
should be repeatable. We strongly recommend its use, although see
the following section
to enable access to the usual modules.
Resources for GPU jobs
In addition, parallel GPU jobs will also need to specify how many nodes, parallel processes and threads they require.
--nodes=<nodes>
the number of nodes to use for the job.--tasks-per-node=<processes per node>
this should be set to either32
(A100-40 or unspecified nodes) or48
(A100-80 nodes)--cpus-per-task=1
this should always be set to1
--gres=gpu:4
the number of GPU to use per node. This will almost always be 4 to use all GPUs on a node.
If you are happy to have any GPU type for your job (A100-40 or A100-80) then you
select the gpu
partition:
--partition=gpu
If you wish to use just the A100-80 GPU nodes which have higher memory, you add the following option:
--partition=gpu-a100-80
request the job is placed on nodes with high-memory (80 GB) GPUs with 48 cores per node - there are 64 high memory GPU nodes on the system.
To just use the A100-40 GPU nodes:
--partition=gpu-a100-40
request the job is placed on nodes with standard memory (40 GB) GPUs with 32 cores per node.
If you do not specfy a partition, the scheduler may use any available node types for
the job (equivalent of --partition=gpu
).
Note
For parallel jobs, Tursa operates in a node exclusive way. This means that you are assigned resources in the units of full compute nodes for your jobs (i.e. 32 cores and 4 GPU on GPU A100-40 nodes, 48 cores and 4 GPU on A100-80 nodes) and that no other user can share those compute nodes with you. Hence, the minimum amount of resource you can request for a parallel job is 1 node (or 32 cores and 4 GPU on GPU A100-40 nodes, 48 cores and 4 GPU on A100-80 nodes).
GPU frequency
Important
The default GPU frequency on Tursa compute nodes was changed from 1410 MHz to 1040 MHz on Thursday 15 Dec 2022 to improve the energy efficiency of the service.
Users can control the GPU frequency in their job submission scripts:
--gpu-freq=<desired GPU freq in MHz>
allows users to set the GPU frequency on a per job basis. The frequency can be set in the range 210 - 1410 MHz in steps of 15 MHz.
Bug
When setting the GPU frequency you will see an error in the output from the job
that says control disabled
. This is an incorrect message due to an issue with
how Slurm sets the GPU frequency and can be safely ignored.
Resources for CPU jobs
Parallel CPU node jobs are specified in a similar way to GPU jobs
--nodes=<nodes>
the number of nodes to use for the job.--tasks-per-node=128
this should always be set to128
--cpus-per-task=1
this should always be set to1
--partition=cpu
this will always be set tocpu
for CPU jobs
Note
For parallel jobs, Tursa operates in a node exclusive way. This means that you are assigned resources in the units of full compute nodes for your jobs (i.e. 128 cores on CPU nodes) and that no other user can share those compute nodes with you. Hence, the minimum amount of resource you can request for a parallel job is 1 node (128 cores on CPU nodes).
srun
: Launching parallel jobs
Important
Only OpenMPI 4.1.5 and later versions of OpenMPI support parallel job lauch
using srun
on Tursa. If you try to use older versions of OpenMPI via modules
on Tursa you will see errors when using srun
. The old modules are only kept
for backwards compatibility, you should always compile and run software using
OpenMPI 4.1.5 or newer if possible.
If you are running parallel jobs, your job submission script should contain one or
more srun commands to launch the parallel executable across the compute nodes. In
most cases you will want to add the following options to srun
:
--nodes=[number of nodes]
- Set the number of compute nodes for this job step--tasks-per-node=[MPI processes per node]
- This will usually be4
for GPU jobs as you usually have 1 MPI process per GPU--cpus-per-task=[stride between MPI processes]
- This will usually be either8
(for A100-40 nodes) or12
(for A100-80 nodes). If you are using thegpu
QoS where you can get any type of GPU node, you will usually se this to8
.--hint=nomultithread
- do not use hyperthreads/SMP--distribution=block:block
- the firstblock
means use a block distribution of processes across nodes (i.e. fill nodes before moving onto the next one) and the secondblock
means use a block distribution of processes across "sockets" within a node (i.e. fill a "socket" before moving on to the next one).
Important
The Slurm definition of a "socket" does not usually correspond to a physical CPU socket. On Tursa GPU nodes it corresponds to half the cores on a socket as the GPU nodes are configured with NPS2.
On the Tursa CPU nodes, the Slurm definition of a socket does correspond to a physical CPU socket (64 cores) as the CPU nodes are configured with NPS1.
Example job submission scripts
The typical strategy for submitting josb on Tursa is for the batch script to
request full nodes with no process/thread pinning and then the individual
srun
commands set the correct options for dividing up processes and threads
across nodes.
Example: job submission script for a parallel GPU job
A job submission script for a parallel job that uses 4 compute nodes, 4 MPI processes per node and 4 GPUs per node. It does not restrict what type of GPU the job can run on so both A100-40 and A100-80 can be used.
#!/bin/bash
# Slurm job options
#SBATCH --job-name=Example_GPU_MPI_job
#SBATCH --time=12:0:0
#SBATCH --partition=gpu
#SBATCH --qos=standard
# Replace [budget code] below with your budget code (e.g. t01)
#SBATCH --account=[budget code]
# Request right number of full nodes (32 cores by node fits any GPU compute nodes))
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=32
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:4
# Load the correct modules
module load gcc/9.3.0
module load cuda/12.3
module load openmpi/4.1.5-cuda12.3
export OMP_NUM_THREADS=8
export OMP_PLACES=cores
# These will need to be changed to match the actual application you are running
application="my_mpi_openmp_app.x"
options="arg 1 arg2"
# We have reserved the full nodes, now distribute the processes as
# required: 4 MPI processes per node, stride of 8 cores between
# MPI processes
#
# Note use of gpu_launch.sh wrapper script for GPU and NIC pinning
srun --nodes=4 --ntasks-per-node=4 --cpus-per-task=8 \
--hint=nomultithread --distribution=block:block \
gpu_launch.sh \
${application} ${options}
This will run your executable "my_mpi_opnemp_app.x" in parallel usimg 16 MPI processes on 4 nodes. 4 GPUs will be used per node.
Important
You must use the gpu_launch.sh
wrapper script to get the correct biniding
of GPU to MPI processes and of network interface to GPU and MPI process.
This script is described in more detail below.
gpu_launch.sh
wrapper script
The gpu_launch.sh
wrapper script is required to set the correct binding of
GPU to MPI processes and the correct binding of interconnect interfaces to
MPI process and GPU. We provide this centrally for convenience but its contents
are simple:
#!/bin/bash
# Compute the raw process ID for binding to GPU and NIC
lrank=$((SLURM_PROCID % SLURM_NTASKS_PER_NODE))
# Bind the process to the correct GPU and NIC
export CUDA_VISIBLE_DEVICES=${lrank}
export UCX_NET_DEVICES=mlx5_${lrank}:1
$@
Example: job submission script for a parallel CPU job
A job submission script for a parallel job that uses 1 compute node, 128 MPI processes per node.
#!/bin/bash
# Slurm job options
#SBATCH --job-name=Example_CPU_MPI_job
#SBATCH --time=1:0:0
#SBATCH --partition=cpu
#SBATCH --qos=standard
# Replace [budget code] below with your budget code (e.g. t01)
#SBATCH --account=[budget code]
# Request right number of full nodes (32 cores by node fits any GPU compute nodes))
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=128
#SBATCH --cpus-per-task=1
# Load the correct modules
module load gcc/9.3.0
module load openmpi/4.1.5
export OMP_NUM_THREADS=1
export OMP_PLACES=cores
# These will need to be changed to match the actual application you are running
application="my_mpi_app.x"
options="arg 1 arg2"
# We have reserved the full nodes, now distribute the processes as
# required: 128 MPI processes per node
srun --nodes=1 --ntasks-per-node=128 --cpus-per-task=1 \
--hint=nomultithread --distribution=block:block \
${application} ${options}
This will run your executable "my_mpi_app.x" in parallel usimg 128 MPI processes on 1 node.
Using the dev
QoS for short GPU jobs
The dev
QoS is designed for faster turnaround of short GPU jobs than is usually available through
the production QoS. It is subject to a number of restrictions:
- 4 hour maximum walltime
- Maximum job size:
- 2 nodes for
gpu-a100-80
partition - 1 node for
gpu-a100-40
partition
- 2 nodes for
- Maximum 1 job running per user
- Maximum 2 jobs queued per user
- Only available to projects with a positive budget
In addtion, you must specify either the gpu-a100-80
or gpu-a100-40
partitions when using the
dev
QoS.
Tip
The generic gpu
partition will not work consistently when using the dev
QoS.
Here is an example job submission script for a 2-node job in the dev
QoS using the gpu-a100-80
partition. Note the use of the gpu_launch.sh
wrapper script to get correct GPU and NIC
binding.
#!/bin/bash
# Slurm job options
#SBATCH --job-name=Example_MPI_job
#SBATCH --time=12:0:0
#SBATCH --partition=gpu-a100-80
#SBATCH --qos=dev
# Replace [budget code] below with your budget code (e.g. t01)
#SBATCH --account=[budget code]
# Request right number of full nodes (48 cores by node for A100-80 GPU nodes))
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=48
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:4
export OMP_NUM_THREADS=1
export OMP_PLACES=cores
# Load the correct modules
module load gcc/9.3.0
module load cuda/12.3
module load openmpi/4.1.5-cuda12.3
# These will need to be changed to match the actual application you are running
application="my_mpi_openmp_app.x"
options="arg 1 arg2"
# We have reserved the full nodes, now distribute the processes as
# required: 4 MPI processes per node, stride of 12 cores between
# MPI processes
#
# Note use of gpu_launch.sh wrapper script for GPU and NIC pinning
srun --nodes=4 --ntasks-per-node=4 --cpus-per-task=12 \
--hint=nomultithread --distribution=block:block \
gpu_launch.sh \
${application} ${options}