Distributed training on multiple IPUs
In this tutorial, we will cover how to run larger models, including examples provided by Graphcore on https://github.com/graphcore/examples. These may require distributed training on multiple IPUs.
The number of IPUs requested must be in powers of two, i.e. 1, 2, 4, 8, 16, 32, or 64.
First example
As an example, we will use 4 IPUs to perform the pre-training step of BERT, an NLP transformer model. The code is available from https://github.com/graphcore/examples/tree/master/nlp/bert/pytorch.
To get started, save and create an IPUJob with the following .yaml
file:
apiVersion: graphcore.ai/v1alpha1
kind: IPUJob
metadata:
generateName: bert-training-multi-ipu-
spec:
jobInstances: 1
ipusPerJobInstance: "4"
workers:
template:
spec:
containers:
- name: bert-training-multi-ipu
image: graphcore/pytorch:3.3.0
command: [/bin/bash, -c, --]
args:
- |
cd ;
mkdir build;
cd build ;
git clone https://github.com/graphcore/examples.git;
cd examples/nlp/bert/pytorch;
apt update ;
apt upgrade -y;
DEBIAN_FRONTEND=noninteractive TZ='Europe/London' apt install $(< required_apt_packages.txt) -y ;
pip3 install -r requirements.txt ;
python3 run_pretraining.py --dataset generated --config pretrain_base_128_pod4 --training-steps 1
resources:
limits:
cpu: 32
memory: 200Gi
securityContext:
capabilities:
add:
- IPC_LOCK
volumeMounts:
- mountPath: /dev/shm
name: devshm
restartPolicy: Never
hostIPC: true
volumes:
- emptyDir:
medium: Memory
sizeLimit: 10Gi
name: devshm
Running the above IPUJob and querying the log via kubectl logs pod/bert-training-multi-ipu-<random string>-worker-0
should give:
...
Data loaded in 8.559805537108332 secs
-----------------------------------------------------------
-------------------- Device Allocation --------------------
Embedding --> IPU 0
Encoder 0 --> IPU 1
Encoder 1 --> IPU 1
Encoder 2 --> IPU 1
Encoder 3 --> IPU 1
Encoder 4 --> IPU 2
Encoder 5 --> IPU 2
Encoder 6 --> IPU 2
Encoder 7 --> IPU 2
Encoder 8 --> IPU 3
Encoder 9 --> IPU 3
Encoder 10 --> IPU 3
Encoder 11 --> IPU 3
Pooler --> IPU 0
Classifier --> IPU 0
-----------------------------------------------------------
---------- Compilation/Loading from Cache Started ---------
...
Graph compilation: 100%|██████████| 100/100 [08:02<00:00]
Compiled/Loaded model in 500.756152929971 secs
-----------------------------------------------------------
--------------------- Training Started --------------------
Step: 0 / 0 - LR: 0.00e+00 - total loss: 10.817 - mlm_loss: 10.386 - nsp_loss: 0.432 - mlm_acc: 0.000 % - nsp_acc: 1.000 %: 0%| | 0/1 [00:16<?, ?it/s, throughput: 4035.0 samples/sec]
-----------------------------------------------------------
-------------------- Training Metrics ---------------------
global_batch_size: 65536
device_iterations: 1
training_steps: 1
Training time: 16.245 secs
-----------------------------------------------------------
Details
In this example, we have requested 4 IPUs:
ipusPerJobInstance: "4"
The python flag --config pretrain_base_128_pod4
uses one of the preset configurations for this model with 4 IPUs. Here we also use the --datset generated
flag to generate data rather than download the required dataset.
To provided sufficient shm for the IPU pod, it may be necessary to mount /dev/shm
as follows:
volumeMounts:
- mountPath: /dev/shm
name: devshm
volumes:
- emptyDir:
medium: Memory
sizeLimit: 10Gi
name: devshm
It is also required to set spec.hostIPC
to true
:
hostIPC: true
and add a securityContext
to the container definition than enables the IPC_LOCK
capability:
securityContext:
capabilities:
add:
- IPC_LOCK
Note: IPC_LOCK
allows for the RDMA software stack to use pinned memory — which is particularly useful for PyTorch dataloaders, which can be very memory hungry. This is since all data going to the IPUs go via the network interfaces (via 100Gbps ethernet).
Memory usage
In general, the graph compilation phase of running large models can require significant memory, and far less during the execution phase.
In the example above, it is possible to explicitly request the memory via:
resources:
limits:
memory: "128Gi"
requests:
memory: "128Gi"
which will succeed. (The graph compilation fails if only 32Gi
is requested.)
As a general guideline, 128GB memory should be enough for the majority of tasks, and rarely exceed 200GB even for jobs with high IPU count. In the example .yaml
script, we do not specifically request the memory.
Scaling up IPU count and using Poprun
In the example above, python is launched directly in the pod. When scaling up the number of IPUs (e.g. above 8 IPUs), it may be possible to run into a CPU bottleneck. This may be observed when the throughput scales sub-linearly with the number of data-parallel replicas (i.e. when doubling the IPU count, the performance does not double). This can also be verified by profiling the application and observing a significant proportion of runtime spent on host CPU workload.
In this case, Poprun can be used launch multiple instances. As an example, we will save the following .yaml configuratoin and run:
apiVersion: graphcore.ai/v1alpha1
kind: IPUJob
metadata:
generateName: bert-poprun-64ipus-
spec:
jobInstances: 1
modelReplicasPerWorker: "16"
ipusPerJobInstance: "64"
workers:
template:
spec:
containers:
- name: bert-poprun-64ipus
image: graphcore/pytorch:3.3.0
command: [/bin/bash, -c, --]
args:
- |
cd ;
mkdir build;
cd build ;
git clone https://github.com/graphcore/examples.git;
cd examples/nlp/bert/pytorch;
apt update ;
apt upgrade -y;
DEBIAN_FRONTEND=noninteractive TZ='Europe/London' apt install $(< required_apt_packages.txt) -y ;
pip3 install -r requirements.txt ;
OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 OMPI_ALLOW_RUN_AS_ROOT=1 \
poprun \
--allow-run-as-root 1 \
--vv \
--num-instances 1 \
--num-replicas 16 \
--mpi-global-args="--tag-output" \
--ipus-per-replica 4 \
python3 run_pretraining.py \
--config pretrain_large_128_POD64 \
--dataset generated --training-steps 1
resources:
limits:
cpu: 32
memory: 200Gi
securityContext:
capabilities:
add:
- IPC_LOCK
volumeMounts:
- mountPath: /dev/shm
name: devshm
restartPolicy: Never
hostIPC: true
volumes:
- emptyDir:
medium: Memory
sizeLimit: 10Gi
name: devshm
Inspecting the log via kubectl logs <pod-name>
should produce:
...
===========================================================================================
| poprun topology |
|===========================================================================================|
10:10:50.154 1 POPRUN [D] Done polling, final state of p-bert-poprun-64ipus-gc-dev-0: PS_ACTIVE
10:10:50.154 1 POPRUN [D] Target options from environment: {}
| hosts | localhost |
|-----------|-------------------------------------------------------------------------------|
| ILDs | 0 |
|-----------|-------------------------------------------------------------------------------|
| instances | 0 |
|-----------|-------------------------------------------------------------------------------|
| replicas | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
-------------------------------------------------------------------------------------------
10:10:50.154 1 POPRUN [D] Target options from V-IPU partition: {"ipuLinkDomainSize":"64","ipuLinkConfiguration":"slidingWindow","ipuLinkTopology":"torus","gatewayMode":"true","instanceSize":"64"}
10:10:50.154 1 POPRUN [D] Using target options: {"ipuLinkDomainSize":"64","ipuLinkConfiguration":"slidingWindow","ipuLinkTopology":"torus","gatewayMode":"true","instanceSize":"64"}
10:10:50.203 1 POPRUN [D] No hosts specified; ignoring host-subnet setting
10:10:50.203 1 POPRUN [D] Default network/RNIC for host communication: None
10:10:50.203 1 POPRUN [I] Running command: /opt/poplar/bin/mpirun '--tag-output' '--bind-to' 'none' '--tag-output'
'--allow-run-as-root' '-np' '1' '-x' 'POPDIST_NUM_TOTAL_REPLICAS=16' '-x' 'POPDIST_NUM_IPUS_PER_REPLICA=4' '-x'
'POPDIST_NUM_LOCAL_REPLICAS=16' '-x' 'POPDIST_UNIFORM_REPLICAS_PER_INSTANCE=1' '-x' 'POPDIST_REPLICA_INDEX_OFFSET=0' '-x'
'POPDIST_LOCAL_INSTANCE_INDEX=0' '-x' 'IPUOF_VIPU_API_HOST=10.21.21.129' '-x' 'IPUOF_VIPU_API_PORT=8090' '-x'
'IPUOF_VIPU_API_PARTITION_ID=p-bert-poprun-64ipus-gc-dev-0' '-x' 'IPUOF_VIPU_API_TIMEOUT=120' '-x' 'IPUOF_VIPU_API_GCD_ID=0'
'-x' 'IPUOF_LOG_LEVEL=WARN' '-x' 'PATH' '-x' 'LD_LIBRARY_PATH' '-x' 'PYTHONPATH' '-x' 'POPLAR_TARGET_OPTIONS=
{"ipuLinkDomainSize":"64","ipuLinkConfiguration":"slidingWindow","ipuLinkTopology":"torus","gatewayMode":"true",
"instanceSize":"64"}' 'python3' 'run_pretraining.py' '--config' 'pretrain_large_128_POD64' '--dataset' 'generated' '--training-steps' '1'
10:10:50.204 1 POPRUN [I] Waiting for mpirun (PID 4346)
[1,0]<stderr>: Registered metric hook: total_compiling_time with object: <function get_results_for_compile_time at 0x7fe0a6e8af70>
[1,0]<stderr>:Using config: pretrain_large_128_POD64
...
Graph compilation: 100%|██████████| 100/100 [10:11<00:00][1,0]<stderr>:
[1,0]<stderr>:Compiled/Loaded model in 683.6591004971415 secs
[1,0]<stderr>:-----------------------------------------------------------
[1,0]<stderr>:--------------------- Training Started --------------------
Step: 0 / 0 - LR: 0.00e+00 - total loss: 11.260 - mlm_loss: 10.397 - nsp_loss: 0.863 - mlm_acc: 0.000 % - nsp_acc: 0.052 %: 0%| | 0/1 [00:03<?, ?itStep: 0 / 0 - LR: 0.00e+00 - total loss: 11.260 - mlm_loss: 10.397 - nsp_loss: 0.863 - mlm_acc: 0.000 % - nsp_acc: 0.052 %: 0%| | 0/1 [00:03<?, ?itStep: 0 / 0 - LR: 0.00e+00 - total loss: 11.260 - mlm_loss: 10.397 - nsp_loss: 0.863 - mlm_acc: 0.000 % - nsp_acc: 0.052 %: 0%| | 0/1 [00:03<?, ?it/s, throughput: 17692.1 samples/sec][1,0]<stderr>:
[1,0]<stderr>:-----------------------------------------------------------
[1,0]<stderr>:-------------------- Training Metrics ---------------------
[1,0]<stderr>:global_batch_size: 65536
[1,0]<stderr>:device_iterations: 1
[1,0]<stderr>:training_steps: 1
[1,0]<stderr>:Training time: 3.718 secs
[1,0]<stderr>:-----------------------------------------------------------
Notes on using the examples respository
Graphcore provides examples of a variety of models on Github https://github.com/graphcore/examples. When following the instructions, note that since we are using a container within a Kubernetes pod, there is no need to enable the Poplar/PopART SDK, set up a virtual python environment, or install the PopTorch wheel.