Running a PyTorch task

Requirements

It is recommended that users complete Getting started with Kubernetes and Requesting persistent volumes With Kubernetes before proceeding with this tutorial.

Overview

In the following lesson, we'll build a CNN neural network and train it using the EIDF GPU Service.

The model was taken from the PyTorch Tutorials.

The lesson will be split into three parts:

Requesting a persistent volume and transferring code/data to it
Creating a pod with a PyTorch container downloaded from DockerHub
Submitting a job to the EIDF GPU Service and retrieving the results

Load training data and ML code into a persistent volume

Create a persistent volume

Request memory from the Ceph server by submitting a PVC to K8s (example pvc spec yaml below).

kubectl -n <project-namespace> create -f <pvc-spec-yaml>

Example PyTorch PersistentVolumeClaim

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
 name: pytorch-pvc
spec:
 accessModes:
  - ReadWriteOnce
 resources:
  requests:
   storage: 2Gi
 storageClassName: csi-rbd-sc

Transfer code/data to persistent volume

Check PVC has been created

kubectl -n <project-namespace> get pvc <pv-name>

Create a lightweight job with pod with PV mounted (example job below)

kubectl -n <project-namespace> create -f lightweight-pod-job.yaml

Download the PyTorch code

wget https://github.com/EPCCed/eidf-docs/raw/main/docs/services/gpuservice/training/resources/example_pytorch_code.py

Copy the Python script into the PV

kubectl -n <project-namespace> cp example_pytorch_code.py lightweight-job-<identifier>:/mnt/ceph_rbd/

Check whether the files were transferred successfully

kubectl -n <project-namespace> exec lightweight-job-<identifier> -- ls /mnt/ceph_rbd

Delete the lightweight job

kubectl -n <project-namespace> delete job lightweight-job-<identifier>

Example lightweight job specification

apiVersion: batch/v1
kind: Job
metadata:
    name: lightweight-job
    labels:
        kueue.x-k8s.io/queue-name:  <project namespace>-user-queue
spec:
    completions: 1
    template:
        metadata:
            name: lightweight-pod
        spec:
            containers:
            - name: data-loader
              image: busybox
              args: ["sleep", "infinity"]
              resources:
                    requests:
                        cpu: 1
                        memory: '1Gi'
                    limits:
                        cpu: 1
                        memory: '1Gi'
              volumeMounts:
                    - mountPath: /mnt/ceph_rbd
                      name: volume
            restartPolicy: Never
            volumes:
                - name: volume
                  persistentVolumeClaim:
                    claimName: pytorch-pvc

Creating a Job with a PyTorch container

We will use the pre-made PyTorch Docker image available on Docker Hub to run the PyTorch ML model.

The PyTorch container will be held within a pod that has the persistent volume mounted and access a MIG GPU.

Submit the specification file below to K8s to create the job, replacing the queue name with your project namespace queue name.

kubectl -n <project-namespace> create -f <pytorch-job-yaml>

Example PyTorch Job Specification File

apiVersion: batch/v1
kind: Job
metadata:
    name: pytorch-job
    labels:
        kueue.x-k8s.io/queue-name:  <project namespace>-user-queue
spec:
    completions: 1
    template:
        metadata:
            name: pytorch-pod
        spec:
            restartPolicy: Never
            containers:
            - name: pytorch-con
              image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel
              command: ["python3"]
              args: ["/mnt/ceph_rbd/example_pytorch_code.py"]
              volumeMounts:
                - mountPath: /mnt/ceph_rbd
                  name: volume
              resources:
                requests:
                  cpu: 2
                  memory: "1Gi"
                limits:
                  cpu: 4
                  memory: "4Gi"
                  nvidia.com/gpu: 1
            nodeSelector:
                nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb
            volumes:
                - name: volume
                  persistentVolumeClaim:
                    claimName: pytorch-pvc

Reviewing the results of the PyTorch model

This is not intended to be an introduction to PyTorch, please see the online tutorial for details about the model.

Check that the model ran to completion

kubectl -n <project-namespace> logs <pytorch-pod-name>

Spin up a lightweight pod to retrieve results

kubectl -n <project-namespace> create -f lightweight-pod-job.yaml

Copy the trained model back to your access VM

kubectl -n <project-namespace> cp lightweight-job-<identifier>:mnt/ceph_rbd/model.pth model.pth

Using a Kubernetes job to train the pytorch model multiple times

A common ML training workflow may consist of training multiple iterations of a model: such as models with different hyperparameters or models trained on multiple different data sets.

A Kubernetes job can create and manage multiple pods with identical or different initial parameters.

NVIDIA provide a detailed tutorial on how to conduct a ML hyperparameter search with a Kubernetes job.

Below is an example job yaml for running the pytorch model which will continue to create pods until three have successfully completed the task of training the model.

apiVersion: batch/v1
kind: Job
metadata:
    name: pytorch-job
    labels:
        kueue.x-k8s.io/queue-name:  <project namespace>-user-queue
spec:
    completions: 3
    template:
        metadata:
            name: pytorch-pod
        spec:
            restartPolicy: Never
            containers:
            - name: pytorch-con
              image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel
              command: ["python3"]
              args: ["/mnt/ceph_rbd/example_pytorch_code.py"]
              volumeMounts:
                - mountPath: /mnt/ceph_rbd
                  name: volume
              resources:
                requests:
                  cpu: 2
                  memory: "1Gi"
                limits:
                  cpu: 4
                  memory: "4Gi"
                  nvidia.com/gpu: 1
            nodeSelector:
                nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb
            volumes:
                - name: volume
                  persistentVolumeClaim:
                    claimName: pytorch-pvc

Clean up

kubectl -n <project-namespace> delete pod pytorch-job

kubectl -n <project-namespace> delete pvc pytorch-pvc