Getting started with Kubernetes
Requirements
In order to follow this tutorial on the EIDF GPU Cluster you will need to have:
-
An account on the EIDF Portal.
-
An active EIDF Project on the Portal with access to the EIDF GPU Service.
-
The EIDF GPU Service kubernetes namespace associated with the project, e.g. eidf001ns.
-
The EIDF GPU Service queue name associated with the project, e.g. eidf001ns-user-queue.
-
Downloaded the kubeconfig file to a Project VM along with the kubectl command line tool to interact with the K8s API.
Downloading the kubeconfig file and kubectl
Project Leads should use the 'Download kubeconfig' button on the EIDF Portal to complete this step to ensure the correct kubeconfig file and kubectl version is installed.
Introduction
Kubernetes (K8s) is a container orchestration system, originally developed by Google, for the deployment, scaling, and management of containerised applications.
Nvidia GPUs are supported through K8s native Nvidia GPU Operators.
The use of K8s to manage the EIDF GPU Service provides two key advantages:
- support for containers enabling reproducible analysis whilst minimising demand on system admin.
- automated resource allocation management for GPUs and storage volumes that are shared across multiple users.
Interacting with a K8s cluster
An overview of the key components of a K8s container can be seen on the Kubernetes docs website.
The primary component of a K8s cluster is a pod.
A pod is a set of one or more docker containers (and their storage volumes) that share resources.
It is the EIDF GPU Cluster policy that all pods should be wrapped within a K8s job.
This allows GPU/CPU/Memory resource requests to be managed by the cluster queue management system, kueue.
Pods which attempt to bypass the queue mechanism will affect the experience of other project users.
Any pods not associated with a job (or other K8s object) are at risk of being deleted without notice.
K8s jobs also provide additional functionality such as parallelism (described later in this tutorial).
Users define the resource requirements of a pod (i.e. number/type of GPU) and the containers/code to be ran in the pod by defining a template within a job manifest file written in yaml.
The job yaml file is sent to the cluster using the K8s API and is assigned to an appropriate node to be ran.
A node is a part of the cluster such as a physical or virtual host which exposes CPU, Memory and GPUs.
Users interact with the K8s API using the kubectl
(short for kubernetes control) commands.
Some of the kubectl commands are restricted on the EIDF cluster in order to ensure project details are not shared across namespaces.
Ensure kubectl is interacting with your project namespace.
You will need to pass the name of your project namespace to kubectl
in order for it to have permission to interact with the cluster.
kubectl
will attempt to interact with the default
namespace which will return a permissions error if it is not told otherwise.
kubectl -n <project-namespace> <command>
will tell kubectl to pass the commands to the correct namespace.
Useful commands are:
kubectl -n <project-namespace> create -f <job definition yaml>
: Create a new job with requested resources. Returns an error if a job with the same name already exists.kubectl -n <project-namespace> apply -f <job definition yaml>
: Create a new job with requested resources. If a job with the same name already exists it updates that job with the new resource/container requirements outlined in the yaml.kubectl -n <project-namespace> delete pod <pod name>
: Delete a pod from the cluster.kubectl -n <project-namespace> get pods
: Summarise all pods the namespace has active (or pending).kubectl -n <project-namespace> describe pods
: Verbose description of all pods the namespace has active (or pending).kubectl -n <project-namespace> describe pod <pod name>
: Verbose summary of the specified pod.kubectl -n <project-namespace> logs <pod name>
: Retrieve the log files associated with a running pod.kubectl -n <project-namespace> get jobs
: List all jobs the namespace has active (or pending).kubectl -n <project-namespace> describe job <job name>
: Verbose summary of the specified job.kubectl -n <project-namespace> delete job <job name>
: Delete a job from the cluster.
Creating your first pod template within a job yaml file
To access the GPUs on the service, it is recommended to start with one of the prebuilt container images provided by Nvidia, these images are intended to perform different tasks using Nvidia GPUs.
The list of Nvidia images is available on their website.
The following example uses their CUDA sample code simulating nbody interactions.
apiVersion: batch/v1
kind: Job
metadata:
generateName: jobtest-
labels:
kueue.x-k8s.io/queue-name: <project-namespace>-user-queue
spec:
completions: 1
template:
metadata:
name: job-test
spec:
containers:
- name: cudasample
image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1
args: ["-benchmark", "-numbodies=512000", "-fp64", "-fullscreen"]
resources:
requests:
cpu: 2
memory: '1Gi'
limits:
cpu: 2
memory: '4Gi'
nvidia.com/gpu: 1
restartPolicy: Never
The pod resources are defined under the resources
tags using the requests
and limits
tags.
Resources defined under the requests
tags are the reserved resources required for the pod to be scheduled.
If a pod is assigned to a node with unused resources then it may burst up to use resources beyond those requested.
This may allow the task within the pod to run faster, but it will also throttle back down when further pods are scheduled to the node.
The limits
tag specifies the maximum resources that can be assigned to a pod.
The EIDF GPU Service requires all pods have requests
and limits
tags for CPU and memory defined in order to be accepted.
GPU resources requests are optional and only an entry under the limits
tag is needed to specify the use of a GPU, nvidia.com/gpu: 1
. Without this no GPU will be available to the pod.
The label kueue.x-k8s.io/queue-name
specifies the queue you are submitting your job to. This is part of the Kueue system in operation on the service to allow for improved resource management for users.
Submitting your first job
- Open an editor of your choice and create the file test_NBody.yml
- Copy the above job yaml in to the file, filling in
<project-namespace>-user-queue
, e.g. eidf001ns-user-queue: - Save the file and exit the editor
- Run
kubectl -n <project-namespace> create -f test_NBody.yml
-
This will output something like:
job.batch/jobtest-b92qg created
The five character code appended to the job name, i.e.
b92qg
, is randomly generated and will differ from your run. -
Run
kubectl -n <project-namespace> get jobs
-
This will output something like:
NAME COMPLETIONS DURATION AGE jobtest-b92qg 1/1 48s 29m
There may be more than one entry as this displays all the jobs in the current namespace, starting with their name, number of completions against required completions, duration and age.
-
Inspect your job further using the command
kubectl -n <project-namespace> describe job jobtest-b92qg
, updating the job name with your five character code. -
This will output something like:
Name: jobtest-b92qg Namespace: t4 Selector: controller-uid=d3233fee-794e-466f-9655-1fe32d1f06d3 Labels: kueue.x-k8s.io/queue-name=t4-user-queue Annotations: batch.kubernetes.io/job-tracking: Parallelism: 1 Completions: 3 Completion Mode: NonIndexed Start Time: Wed, 14 Feb 2024 14:07:44 +0000 Completed At: Wed, 14 Feb 2024 14:08:32 +0000 Duration: 48s Pods Statuses: 0 Active (0 Ready) / 3 Succeeded / 0 Failed Pod Template: Labels: controller-uid=d3233fee-794e-466f-9655-1fe32d1f06d3 job-name=jobtest-b92qg Containers: cudasample: Image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1 Port: <none> Host Port: <none> Args: -benchmark -numbodies=512000 -fp64 -fullscreen Limits: cpu: 2 memory: 4Gi nvidia.com/gpu: 1 Requests: cpu: 2 memory: 1Gi Environment: <none> Mounts: <none> Volumes: <none> Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Suspended 8m1s job-controller Job suspended Normal CreatedWorkload 8m1s batch/job-kueue-controller Created Workload: t4/job-jobtest-b92qg-3b890 Normal Started 8m1s batch/job-kueue-controller Admitted by clusterQueue project-cq Normal SuccessfulCreate 8m job-controller Created pod: jobtest-b92qg-lh64s Normal Resumed 8m job-controller Job resumed Normal SuccessfulCreate 7m44s job-controller Created pod: jobtest-b92qg-xhvdm Normal SuccessfulCreate 7m28s job-controller Created pod: jobtest-b92qg-lvmrf Normal Completed 7m12s job-controller Job completed
-
Run
kubectl -n <project-namespace> get pods
-
This will output something like:
NAME READY STATUS RESTARTS AGE jobtest-b92qg-lh64s 0/1 Completed 0 11m
Again, there may be more than one entry as this displays all the jobs in the current namespace. Also, each pod within a job is given another unique 5 character code appended to the job name.
-
View the logs of a pod from the job you ran
kubectl -n <project-namespace> logs jobtest-b92qg-lh64s
- again update with you run's pod and job five letter code. -
This will output something like:
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance. -fullscreen (run n-body simulation in fullscreen mode) -fp64 (use double precision floating point values for simulation) -hostmem (stores simulation data in host memory) -benchmark (run benchmark to measure performance) -numbodies=<N> (number of bodies (>= 1) to run in simulation) -device=<d> (where d=0,1,2.... for the CUDA device to use) -numdevices=<i> (where i=(number of CUDA devices > 0) to use for simulation) -compare (compares simulation results running once on the default GPU and once on the CPU) -cpu (run n-body simulation on the CPU) -tipsy=<file.bin> (load a tipsy model file for simulation) NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. > Fullscreen mode > Simulation data stored in video memory > Double precision floating point simulation > 1 Devices used for simulation GPU Device 0: "Ampere" with compute capability 8.0 > Compute 8.0 CUDA device: [NVIDIA A100-SXM4-40GB] number of bodies = 512000 512000 bodies, total time for 10 iterations: 10570.778 ms = 247.989 billion interactions per second = 7439.679 double-precision GFLOP/s at 30 flops per interaction
-
Delete your job with
kubectl -n <project-namespace> delete job jobtest-b92qg
- this will delete the associated pods as well.
Specifying GPU requirements
If you create multiple jobs with the same definition file and compare their log files you may notice the CUDA device may differ from Compute 8.0 CUDA device: [NVIDIA A100-SXM4-40GB]
.
The GPU Operator on K8s is allocating the pod to the first node with a GPU free that matches the other resource specifications irrespective of the type of GPU present on the node.
The GPU resource requests can be made more specific by adding the type of GPU product the pod template is requesting to the node selector:
nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-80GB'
nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-40GB'
nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-40GB-MIG-3g.20gb'
nvidia.com/gpu.product: 'NVIDIA-A100-SXM4-40GB-MIG-1g.5gb'
nvidia.com/gpu.product: 'NVIDIA-H100-80GB-HBM3'
Example yaml file with GPU type specified
The nodeSelector:
key at the bottom of the pod template states the pod should be ran on a node with a 1g.5gb MIG GPU.
Exact GPU product names only
K8s will fail to assign the pod if you misspell the GPU type.
Be especially careful when requesting a full 80Gb or 40Gb A100 GPU as attempting to load GPUs with more data than its memory can handle can have unexpected consequences.
apiVersion: batch/v1
kind: Job
metadata:
generateName: jobtest-
labels:
kueue.x-k8s.io/queue-name: <project-namespace>-user-queue
spec:
completions: 1
template:
metadata:
name: job-test
spec:
containers:
- name: cudasample
image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1
args: ["-benchmark", "-numbodies=512000", "-fp64", "-fullscreen"]
resources:
requests:
cpu: 2
memory: '1Gi'
limits:
cpu: 2
memory: '4Gi'
nvidia.com/gpu: 1
restartPolicy: Never
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB-MIG-1g.5gb
Running multiple pods with K8s jobs
Wrapping a pod within a job provides additional functionality on top of accessing the queuing system.
Firstly, the restartPolicy within a job enables the self-healing mechanism within K8s so that if a node dies with the job's pod on it then the job will find a new node to automatically restart the pod.
Jobs also allow users to define multiple pods that can run in parallel or series and will continue to spawn pods until a specific number of pods successfully terminate.
See below for an example K8s job that requires three pods to successfully complete the example CUDA code before the job itself ends.
apiVersion: batch/v1
kind: Job
metadata:
generateName: jobtest-
labels:
kueue.x-k8s.io/queue-name: <project-namespace>-user-queue
spec:
completions: 3
parallelism: 1
template:
metadata:
name: job-test
spec:
containers:
- name: cudasample
image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.7.1
args: ["-benchmark", "-numbodies=512000", "-fp64", "-fullscreen"]
resources:
requests:
cpu: 2
memory: '1Gi'
limits:
cpu: 2
memory: '4Gi'
nvidia.com/gpu: 1
restartPolicy: Never
Change the default kubectl namespace in the project kubeconfig file
Passing the -n <project-namespace>
flag every time you want to interact with the cluster can be cumbersome.
You can alter the kubeconfig on your VM to send commands to your project namespace by default.
Only users with sudo privileges can change the root kubectl config file.
-
Open the command line on your EIDF VM with access to the EIDF GPU Service.
-
Open the root kubeconfig file with sudo privileges.
sudo nano /kubernetes/config
-
Add the namespace line with your project's kubernetes namespace to the "eidf-general-prod" context entry in your copy of the config file.
*** MORE CONFIG *** contexts: - name: "eidf-general-prod" context: user: "eidf-general-prod" namespace: "<project-namespace>" # INSERT LINE cluster: "eidf-general-prod" *** MORE CONFIG ***
-
Check kubectl connects to the cluster. If this does not work you delete and re-download the kubeconfig file using the button on the project page of the EIDF portal.
kubectl get pods