Overview
The EIDF GPU Service (EIDFGPUS) uses Nvidia A100 GPUs as accelerators.
Full Nvidia A100 GPUs are connected to 40GB of dynamic memory.
Multi-instance usage (MIG) GPUs allow multiple tasks or users to share the same GPU (similar to CPU threading).
There are two types of MIG GPUs inside the EIDFGPUS the Nvidia A100 3G.20GB GPUs and the Nvidia A100 1G.5GB GPUs which equate to ~1/2 and ~1/7 of a full Nvidia A100 40 GB GPU.
The current specification of the EIDFGPUS is:
- 1856 CPU Cores
- 8.7 TiB Memory
- Local Disk Space (Node Image Cache and Local Workspace) - 21 TiB
- Ceph Persistent Volumes (Long Term Data) - up to 100TiB
- 70 Nvidia A100 40 GB GPUs
- 14 MIG Nvidia A100 40 GB GPUs equating to 28 Nvidia A100 3G.20GB GPUs
- 20 MIG Nvidia A100 40 GB GPU equating to 140 A100 1G.5GB GPUs
The EIDFGPUS is managed using Kubernetes, with up to 8 GPUs being on a single node.
Service Access
Users should have an EIDF account - EIDF Accounts.
Project Leads will be able to have access to the EIDFGPUS added to their project during the project application process or through a request to the EIDF helpdesk.
Each project will be given a namespace to operate in and a kubeconfig file in a Virtual Machine on the EIDF DSC - information on access to VMs is available here.
Project Quotas
A standard project namespace has the following initial quota (subject to ongoing review):
- CPU: 100 Cores
- Memory: 1TiB
- GPU: 12
Note these quotas are maximum use by a single project, and that during periods of high usage Kubernetes Jobs maybe queued waiting for resource to become available on the cluster.
Additional Service Policy Information
Additional information on service policies can be found here.
EIDF GPU Service Tutorial
This tutorial teaches users how to submit tasks to the EIDFGPUS, but it is not a comprehensive overview of Kubernetes.
Lesson | Objective |
---|---|
Getting started with Kubernetes | a. What is Kubernetes? b. How to send a task to a GPU node. c. How to define the GPU resources needed. |
Requesting persistent volumes with Kubernetes | a. What is a persistent volume? b. How to request a PV resource. |
Running a PyTorch task | a. Accessing a Pytorch container. b. Submitting a PyTorch task to the cluster. c. Inspecting the results. |
Further Reading and Help
-
The Nvidia developers blog provides several examples of how to run ML tasks on a Kubernetes GPU cluster.
-
Kubernetes documentation has a useful kubectl cheat sheet
-
More detailed use cases for the
kubectl
can be found in the Kubernetes documentation