CUDA Memory: Constant, Shared, Unified

Material by: Kevin Stratford, Rupert Nash

CUDA Memory so far...

Global memory

Allocated on host
Available to both host and device read/write

Local variables in kernels

Private on a per thread basis
Usually expected to be held in registers

Other types of memory

Constant cache memory
Shared memory
Unified memory

Constant memory

Read only in kernel
No cache coherency mechanism required to support writes
Fast and effectively very high bandwidth

Schematic: C


/* Constant data declared at file scope with
 *  __constant__ memory space qualifier  */

static __constant__ double coeffs[3];

int someHostFunction(...) {

  /* ... assign some values at run time ... */

  double values[3];

  /* ... and before the relevant kernel ... */

  cudaMemcpyToSymbol(coeffs, values, 3*sizeof(double));

  ...
}

Schematic: C kernel



/* Still in the appropriate scope ... */

static __constant__ double coeffs[3];

__global__ void someKernel(...) {

  ...

  /* Reference constant data as usual ... */

  result = coeffs[0]*x + coeffs[1]*y + coeffs[2]*z;
}

Schematic: Fortran


! Constant variable declared at e.g., module scope
! with constant attribute

real, constant :: coeffs(3)

contains

subroutine someHostRoutine(...)

  ! ...assign some values at runtime ...

  coeffs(:) = values(1:3)

  ! ...and call relevant kernel ...

end subroutine someHostRoutine

Schematic: Fortran kernel


! Still in the appropriate scope ...

real, constant :: coeffs(3)

contains

attributes(global) subroutine someKernel(...)

  ! Reference constant data as usual ...

  result = coeffs(1)*x + coeffs(2)*y + coeffs(3)*z

end subroutine someKernel

Constant memory summary

A relatively scarce resource

Typically 64 kB in total (can inquire at runtime)
No huge look-up tables!
Also used for kernel actual arguments (by value)

Any "overflow" will spill to normal global memory
- ... and accesses will be relatively slow

Shared Memory

Shared between threads in a block

Useful for temporary values, particularly if significant reuse
Marshalling data within a block
May be used to perform reductions (sum, min, max)

May require care in synchronisation with a block

Basic synchonisation is __syncthreads()
Many others

Lifetime of the kernel's blocks

Only addressable when a block starts executing
Released when a block finishes

Declaring Shared Memory

C: via __shared__ memory space qualifier
Fortran: via shared attribute
If the size of array is known an compile time, just include the size of array


attributes(global) subroutine reverseElements(d)
    real, shared :: s(TPB) 
    ! ... implementation ...
end subroutine reverseElements

! host code...
call reverseElements<<< BPG, TPB >>>(data)

Declaring Shared Memory

C: via __shared__ memory space qualifier
Fortran: via shared attribute
If the size isn't known until runtime, you must supply this at kernel launch with the optional third execution configuration parameter


attributes(global) subroutine reverseElements(d)
    real, shared :: s(THREADS_PER_BLOCK)
    real, shared :: s(*)
    ! ... implementation ...
end subroutine reverseElements

! host code...
size_bytes = TPB*4
call reverseElements<<< BPG, TPB, size_bytes >>>(data)

Example: Reverse elements in array


/* Reverse elements so that the order 0,1,2,3,...
 * becomes ...,3,2,1,0
 * Assume we have one block. */

__global__ void reverseElements(int * myArray) {

  __shared__ int tmp[THREADS_PER_BLOCK];

  int idx = threadIdx.x;
  tmp[idx] = myArray[idx];
  __syncthreads();
  myArray[THREADS_PER_BLOCK - (idx+1)] = tmp[idx];
}

Shared Memory Summary

Again, a relatively scarce resource

E.g., 50 kB per block
Some care may be required (check at runtime)

Various performance considerations
- E.g., "bank conflicts"
- Warp divergence related to synchronisation

Atomic access to memory

Many algorithms require updates of shared state, e.g. counting zeros in an array (incorrectly!):


__global__ void count_zeros(int N, int const* data, int*total) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  if (data[i] == 0) {
    *total += 1;
  }
}

This code needs to read, modify, and write back the data stored at *total.
Dealing with this (using the techniques mentioned so far) can be complex.

Atomic access to memory

The hardware provides facilities to read, update, and write back a memory location atomically - i.e. without any other threads interferring


__global__ void count_zeros(int N, int const* data, int*total) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  if (data[i] == 0) {
    atomicAdd(total, 1);
  }
}

Works for built in types with some simple operations - see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions for the details.
But this is slow if other threads are also accessing the same address - consider using block shared memory to compute an intermediate value that is applied to the global result via a single atomic.

Unified Memory

(NB: this is NOT unified virtual addressing.)

The GPU has a separate memory space from the host CPU
Since CUDA 6 and Kepler (compute capability 3.0) this aspect can be largely hidden from the programmer with automatic data movement
However performance is usually worse than doing it yourself
Things get better with CUDA 8 and Pascal (compute capability 6.0) and are now a reasonable starting point

Adapt an existing CUDA code to unified memory

Allocate host arrays with cudaMallocManaged
Remove device pointers and calls to cudaMalloc
Remove calls to cudaMemcpy
In kernels launches, use host arrays instead

But this will be slower

Make porting CPU code easier

Allocate key arrays with cudaMallocManaged
Implement kernels and launches
Profile to see if managed memory is the bottleneck
Add manual memory management where needed

Usually takes less programmer effort to get it working

Unified Memory limitations – Kepler and before

cudaMallocManaged reserves space on GPU
Kernel not running:

CPU can access any data, that page only moved to main DRAM (page fault – c.f. swap space)
CPU can read/write as necessary

Kernel launch moves any pages changed by CPU to GPU (maybe expensive)
Kernel running - CPU cannot access any UM that the kernel touches

Unified Memory limitations – Pascal and after

cudaMallocManaged does not allocate on GPU (GPU memory is used more like a cache)
Pages allocated on first touch
Pages migrated on demand
Pages migrated on demand between GPUs
No bulk transfer on kernel launch
Can give hints to the run time to optimise: cudaMemAdvise / cudaMemPrefetchAsync

Unified Memory Summary

Can simplify first implementation
Performance cost high on old systems
This is much reduced with Pascal and newer