Skip to content

Template workflow

Requirements

It is recommended that users complete Getting started with Kubernetes and Requesting persistent volumes With Kubernetes before proceeding with this tutorial.

Overview

An example workflow for code development using K8s is outlined below.

In theory, users can create docker images with all the code, software and data included to complete their analysis.

In practice, docker images with the required software can be several gigabytes in size which can lead to unacceptable download times when ~100GB of data and code is then added.

Therefore, it is recommended to separate code, software, and data preparation into distinct steps:

  1. Data Loading: Loading large data sets asynchronously.

  2. Developing a Docker environment: Manually or automatically building Docker images.

  3. Code development with K8s: Iteratively changing and testing code in a job.

The workflow describes different strategies to tackle the three common stages in code development and analysis using the EIDF GPU Service.

The three stages are interchangeable and may not be relevant to every project.

Some strategies in the workflow require a GitHub account and Docker Hub account for automatic building (this can be adapted for other platforms such as GitLab).

Data loading

The EIDF GPU service contains GPUs with 40Gb/80Gb of on board memory and it is expected that data sets of > 100 Gb will be loaded onto the service to utilise this hardware.

Persistent volume claims need to be of sufficient size to hold the input data, any expected output data and a small amount of additional empty space to facilitate IO.

Read the requesting persistent volumes with Kubernetes lesson to learn how to request and mount persistent volumes to pods.

It often takes several hours or days to download data sets of 1/2 TB or more to a persistent volume.

Therefore, the data download step needs to be completed asynchronously as maintaining a contention to the server for long periods of time can be unreliable.

Asynchronous data downloading with a lightweight job

  1. Check a PVC has been created.

    kubectl -n <project-namespace> get pvc template-workflow-pvc
    
  2. Write a job yaml with PV mounted and a command to download the data. Change the curl URL to your data set of interest.

    apiVersion: batch/v1
    kind: Job
    metadata:
     name: lightweight-job
     labels:
      kueue.x-k8s.io/queue-name: <project-namespace>-user-queue
    spec:
     completions: 1
     parallelism: 1
     template:
      metadata:
       name: lightweight-job
      spec:
       restartPolicy: Never
       containers:
       - name: data-loader
         image: alpine/curl:latest
         command: ['sh', '-c', "cd /mnt/ceph_rbd; curl https://archive.ics.uci.edu/static/public/53/iris.zip -o iris.zip"]
         resources:
          requests:
           cpu: 1
           memory: "1Gi"
          limits:
           cpu: 1
           memory: "1Gi"
         volumeMounts:
         - mountPath: /mnt/ceph_rbd
           name: volume
       volumes:
       - name: volume
         persistentVolumeClaim:
          claimName: template-workflow-pvc
    
  3. Run the data download job.

    kubectl -n <project-namespace> create -f lightweight-pod.yaml
    
  4. Check if the download has completed.

    kubectl -n <project-namespace> get jobs
    
  5. Delete the lightweight job once completed.

    kubectl -n <project-namespace> delete job lightweight-job
    

Asynchronous data downloading within a screen session

Screen is a window manager available in Linux that allows you to create multiple interactive shells and swap between then.

Screen has the added benefit that if your remote session is interrupted the screen session persists and can be reattached when you manage to reconnect.

This allows you to start a task, such as downloading a data set, and check in on it asynchronously.

Once you have started a screen session, you can create a new window with ctrl-a c, swap between windows with ctrl-a 0-9 and exit screen (but keep any task running) with ctrl-a d.

Using screen rather than a single download job can be helpful if downloading multiple data sets or if you intend to do some simple QC or tidying up before/after downloading.

  1. Start a screen session.

    screen
    
  2. Create an interactive lightweight job session.

    apiVersion: batch/v1
    kind: Job
    metadata:
     name: lightweight-job
     labels:
      kueue.x-k8s.io/queue-name: <project-namespace>-user-queue
    spec:
     completions: 1
     parallelism: 1
     template:
      metadata:
       name: lightweight-pod
      spec:
       restartPolicy: Never
       containers:
       - name: data-loader
         image: alpine/curl:latest
         command: ['sleep','infinity']
         resources:
          requests:
           cpu: 1
           memory: "1Gi"
          limits:
           cpu: 1
           memory: "1Gi"
         volumeMounts:
         - mountPath: /mnt/ceph_rbd
           name: volume
       volumes:
       - name: volume
         persistentVolumeClaim:
          claimName: template-workflow-pvc
    
  3. Download data set. Change the curl URL to your data set of interest.

    kubectl -n <project-namespace> exec <lightweight-pod-name> -- curl https://archive.ics.uci.edu/static/public/53/iris.zip -o /mnt/ceph_rbd/iris.zip
    
  4. Exit the remote session by either ending the session or ctrl-a d.

  5. Reconnect at a later time and reattach the screen window.

    screen -list
    
    screen -r <session-name>
    
  6. Check the download was successful and delete the job.

    kubectl -n <project-namespace> exec <lightweight-pod-name> -- ls /mnt/ceph_rbd/
    
    kubectl -n <project-namespace> delete job lightweight-job
    
  7. Exit the screen session.

    exit
    

Preparing a custom Docker image

Kubernetes requires Docker images to be pre-built and available for download from a container repository such as Docker Hub.

It does not provide functionality to build images and create pods from docker files.

However, use cases may require some custom modifications of a base image, such as adding a python library.

These custom images need to be built locally (using docker) or online (using a GitHub/GitLab worker) and pushed to a repository such as Docker Hub.

This is not an introduction to building docker images, please see the Docker tutorial for a general overview.

Manually building a Docker image locally

  1. Select a suitable base image (The Nvidia container catalog is often a useful starting place for GPU accelerated tasks). We'll use the base RAPIDS image.

  2. Create a Dockerfile to add any additional packages required to the base image.

    FROM nvcr.io/nvidia/rapidsai/base:23.12-cuda12.0-py3.10
    RUN pip install pandas
    RUN pip install plotly
    
  3. Build the Docker container locally (You will need to install Docker)

    cd <dockerfile-folder>
    
    docker build . -t <docker-hub-username>/template-docker-image:latest
    

Building images for different CPU architectures

Be aware that docker images built for Apple ARM64 architectures will not function optimally on the EIDFGPU Service's AMD64 based architecture.

If building docker images locally on an Apple device you must tell the docker daemon to use AMD64 based images by passing the --platform linux/amd64 flag to the build function.

  1. Create a repository to hold the image on Docker Hub (You will need to create and setup an account).

  2. Push the Docker image to the repository.

    docker push <docker-hub-username>/template-docker-image:latest
    
  3. Finally, specify your Docker image in the image: tag of the job specification yaml file.

    apiVersion: batch/v1
    kind: Job
    metadata:
     name: template-workflow-job
     labels:
      kueue.x-k8s.io/queue-name: <project-namespace>-user-queue
    spec:
     completions: 1
     parallelism: 1
     template:
      spec:
       restartPolicy: Never
       containers:
       - name: template-docker-image
         image: <docker-hub-username>/template-docker-image:latest
         command: ["sleep", "infinity"]
         resources:
          requests:
           cpu: 1
           memory: "4Gi"
          limits:
           cpu: 1
           memory: "8Gi"
    

Automatically building docker images using GitHub Actions

In cases where the Docker image needs to be built and tested iteratively (i.e. to check for comparability issues), git version control and GitHub Actions can simplify the build process.

A GitHub action can build and push a Docker image to Docker Hub whenever it detects a git push that changes the docker file in a git repo.

This process requires you to already have a GitHub and Docker Hub account.

  1. Create an access token on your Docker Hub account to allow GitHub to push changes to the Docker Hub image repo.

  2. Create two GitHub secrets to securely provide your Docker Hub username and access token.

  3. Add the dockerfile to a code/docker folder within an active GitHub repo.

  4. Add the GitHub action yaml file below to the .github/workflow folder to automatically push a new image to Docker Hub if any changes to files in the code/docker folder is detected.

    name: ci
    on:
      push:
        paths:
          - 'code/docker/**'
    
    jobs:
      docker:
        runs-on: ubuntu-latest
        steps:
          -
            name: Set up QEMU
            uses: docker/setup-qemu-action@v3
          -
            name: Set up Docker Buildx
            uses: docker/setup-buildx-action@v3
          -
            name: Login to Docker Hub
            uses: docker/login-action@v3
            with:
              username: ${{ secrets.DOCKERHUB_USERNAME }}
              password: ${{ secrets.DOCKERHUB_TOKEN }}
          -
            name: Build and push
            uses: docker/build-push-action@v5
            with:
              context: "{{defaultContext}}:code/docker"
              push: true
              tags: <target-dockerhub-image-name>
    
  5. Push a change to the dockerfile and check the Docker Hub image is updated.

Code development with K8s

Production code can be included within a Docker image to aid reproducibility as the specific software versions required to run the code are packaged together.

However, binding the code to the docker image during development can delay the testing cycle as re-downloading all of the software for every change in a code block can take time.

If the docker image is consistent across tests, then it can be cached locally on the EIDFGPU Service instead of being re-downloaded (this occurs automatically although the cache is node specific and is not shared across nodes).

A pod yaml file can be defined to automatically pull the latest code version before running any tests.

Reducing the download time to fractions of a second allows rapid testing to be completed on the cluster with just the kubectl create command.

You must already have a GitHub account to follow this process.

This process allows code development to be conducted on any device/VM with access to the repo (GitHub/GitLab).

A template GitHub repo with sample code, k8s yaml files and a Docker build Github Action is available here.

Create a job that downloads and runs the latest code version at runtime

  1. Write a standard yaml file for a k8s job with the required resources and custom docker image (example below)

    apiVersion: batch/v1
    kind: Job
    metadata:
     name: template-workflow-job
     labels:
      kueue.x-k8s.io/queue-name: <project-namespace>-user-queue
    spec:
     completions: 1
     parallelism: 1
     template:
      spec:
       restartPolicy: Never
       containers:
       - name: template-docker-image
         image: <docker-hub-username>/template-docker-image:latest
         command: ["sleep", "infinity"]
         resources:
          requests:
           cpu: 1
           memory: "4Gi"
          limits:
           cpu: 1
           memory: "8Gi"
         volumeMounts:
         - mountPath: /mnt/ceph_rbd
           name: volume
       volumes:
       - name: volume
         persistentVolumeClaim:
          claimName: template-workflow-pvc
    
  2. Add an initial container that runs before the main container to download the latest version of the code.

    apiVersion: batch/v1
    kind: Job
    metadata:
     name: template-workflow-job
     labels:
      kueue.x-k8s.io/queue-name: <project-namespace>-user-queue
    spec:
     completions: 1
     parallelism: 1
     template:
      spec:
       restartPolicy: Never
       containers:
       - name: template-docker-image
         image: <docker-hub-username>/template-docker-image:latest
         command: ["sleep", "infinity"]
         resources:
          requests:
           cpu: 1
           memory: "4Gi"
          limits:
           cpu: 1
           memory: "8Gi"
         volumeMounts:
         - mountPath: /mnt/ceph_rbd
           name: volume
         - mountPath: /code
           name: github-code
       initContainers:
       - name: lightweight-git-container
         image: cicirello/alpine-plus-plus
         command: ['sh', '-c', "cd /code; git clone <target-repo>"]
         resources:
          requests:
           cpu: 1
           memory: "4Gi"
          limits:
           cpu: 1
           memory: "8Gi"
         volumeMounts:
         - mountPath: /code
           name: github-code
       volumes:
       - name: volume
         persistentVolumeClaim:
          claimName: template-workflow-pvc
       - name: github-code
         emptyDir:
          sizeLimit: 1Gi
    
  3. Change the command argument in the main container to run the code once started. Add the URL of the GitHub repo of interest to the initContainers: command: tag.

    apiVersion: batch/v1
    kind: Job
    metadata:
     name: template-workflow-job
     labels:
      kueue.x-k8s.io/queue-name: <project-namespace>-user-queue
    spec:
     completions: 1
     parallelism: 1
     template:
      spec:
       restartPolicy: Never
       containers:
       - name: template-docker-image
         image: <docker-hub-username>/template-docker-image:latest
         command: ['sh', '-c', "python3 /code/<python-script>"]
         resources:
          requests:
           cpu: 10
           memory: "40Gi"
          limits:
           cpu: 10
           memory: "80Gi"
           nvidia.com/gpu: 1
         volumeMounts:
         - mountPath: /mnt/ceph_rbd
           name: volume
         - mountPath: /code
           name: github-code
       initContainers:
       - name: lightweight-git-container
         image: cicirello/alpine-plus-plus
         command: ['sh', '-c', "cd /code; git clone <target-repo>"]
         resources:
          requests:
           cpu: 1
           memory: "4Gi"
          limits:
           cpu: 1
           memory: "8Gi"
         volumeMounts:
         - mountPath: /code
           name: github-code
       volumes:
       - name: volume
         persistentVolumeClaim:
          claimName: template-workflow-pvc
       - name: github-code
         emptyDir:
          sizeLimit: 1Gi
    
  4. Submit the yaml file to kubernetes

    kubectl -n <project-namespace> create -f <job-yaml-file>