Running a PyTorch task
In the following lesson, we'll build a NLP neural network and train it using the EIDFGPUS.
The model was taken from the PyTorch Tutorials.
The lesson will be split into three parts:
- Requesting a persistent volume and transferring code/data to it
- Creating a pod with a PyTorch container downloaded from DockerHub
- Submitting a job to the EIDFGPUS and retrieving the results
Load training data and ML code into a persistent volume
Create a persistent volume
Request memory from the Ceph server by submitting a PVC to K8s (example pvc spec yaml below).
kubectl create -f <pvc-spec-yaml>
Example PyTorch PersistentVolumeClaim
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: pytorch-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Gi
storageClassName: csi-rbd-sc
Transfer code/data to persistent volume
-
Check PVC has been created
kubectl get pvc <pv-name>
-
Create a lightweight pod with PV mounted (example pod below)
kubectl create -f lightweight-pod.yaml
-
Download the pytorch code
wget https://github.com/EPCCed/eidf-docs/raw/main/docs/services/gpuservice/training/resources/example_pytorch_code.py
-
Copy python script into the PV
kubectl cp example_pytorch_code.py lightweight-pod:/mnt/ceph_rbd/
-
Check files were transferred successfully
kubectl exec lightweight-pod -- ls /mnt/ceph_rbd
-
Delete lightweight pod
kubectl delete pod lightweight-pod
Example lightweight pod specification
apiVersion: v1
kind: Pod
metadata:
name: lightweight-pod
spec:
containers:
- name: data-loader
image: busybox
command: ["sleep", "infinity"]
resources:
requests:
cpu: 1
memory: "1Gi"
limits:
cpu: 1
memory: "1Gi"
volumeMounts:
- mountPath: /mnt/ceph_rbd
name: volume
volumes:
- name: volume
persistentVolumeClaim:
claimName: pytorch-pvc
Creating a pod with a PyTorch container
We will use the pre-made PyTorch Docker image available on Docker Hub to run the PyTorch ML model.
The PyTorch container will be held within a pod that has the persistent volume mounted and access a MIG GPU.
Submit the specification file to K8s to create the pod.
kubectl create -f <pytorch-pod-yaml>
Example PyTorch Pod Specification File
apiVersion: v1
kind: Pod
metadata:
name: pytorch-pod
spec:
restartPolicy: Never
containers:
- name: pytorch-con
image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel
command: ["python3"]
args: ["/mnt/ceph_rbd/example_pytorch_code.py"]
volumeMounts:
- mountPath: /mnt/ceph_rbd
name: volume
resources:
requests:
cpu: 2
memory: "1Gi"
limits:
cpu: 4
memory: "4Gi"
nvidia.com/gpu: 1
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb
volumes:
- name: volume
persistentVolumeClaim:
claimName: pytorch-pvc
Reviewing the results of the PyTorch model
This is not intended to be an introduction to PyTorch, please see the online tutorial for details about the model.
-
Check model ran to completion
kubectl logs <pytorch-pod-name>
-
Spin up lightweight pod to retrieve results
kubectl create -f lightweight-pod.yaml
-
Copy trained model back to the head node
kubectl cp lightweight-pod:mnt/ceph_rbd/model.pth model.pth
Using a Kubernetes job to train the pytorch model
A common ML training workflow may consist of training multiple iterations of a model: such as models with different hyperparameters or models trained on multiple different data sets.
A Kubernetes job can create and manage multiple pods with identical or different initial parameters.
NVIDIA provide a detailed tutorial on how to conduct a ML hyperparameter search with a Kubernetes job.
Below is an example job yaml for running the pytorch model which will continue to create pods until three have successfully completed the task of training the model.
apiVersion: batch/v1
kind: Job
metadata:
name: pytorch-job
spec:
completions: 3
parallelism: 1
template:
spec:
restartPolicy: Never
containers:
- name: pytorch-con
image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel
command: ["python3"]
args: ["/mnt/ceph_rbd/example_pytorch_code.py"]
volumeMounts:
- mountPath: /mnt/ceph_rbd
name: volume
resources:
requests:
cpu: 1
memory: "4Gi"
limits:
cpu: 1
memory: "8Gi"
nvidia.com/gpu: 1
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb
volumes:
- name: volume
persistentVolumeClaim:
claimName: pytorch-pvc
Clean up
kubectl delete pod pytorch-pod
kubectl delete pod pytorch-job
kubectl delete pv pytorch-pvc