CUDA Memory: Constant, Shared, Unified


Material by: Kevin Stratford, Rupert Nash

EPCC Logo

CUDA Memory so far...

  • Global memory
    • Allocated on host
    • Available to both host and device read/write
  • Local variables in kernels
    • Private on a per thread basis
    • Usually expected to be held in registers

Other types of memory

  • Constant cache memory
  • Shared memory
  • Unified memory

Constant memory

  • Read only in kernel
  • No cache coherency mechanism required to support writes
  • Fast and effectively very high bandwidth

Schematic: C


/* Constant data declared at file scope with
 *  __constant__ memory space qualifier  */

static __constant__ double coeffs[3];

int someHostFunction(...) {

  /* ... assign some values at run time ... */

  double values[3];

  /* ... and before the relevant kernel ... */

  cudaMemcpyToSymbol(coeffs, values, 3*sizeof(double));

  ...
}
    

Schematic: C kernel



/* Still in the appropriate scope ... */

static __constant__ double coeffs[3];

__global__ void someKernel(...) {

  ...

  /* Reference constant data as usual ... */

  result = coeffs[0]*x + coeffs[1]*y + coeffs[2]*z;
}
    

Schematic: Fortran


! Constant variable declared at e.g., module scope
! with constant attribute

real, constant :: coeffs(3)

contains

subroutine someHostRoutine(...)

  ! ...assign some values at runtime ...

  coeffs(:) = values(1:3)

  ! ...and call relevant kernel ...

end subroutine someHostRoutine
    

Schematic: Fortran kernel


! Still in the appropriate scope ...

real, constant :: coeffs(3)

contains

attributes(global) subroutine someKernel(...)

  ! Reference constant data as usual ...

  result = coeffs(1)*x + coeffs(2)*y + coeffs(3)*z

end subroutine someKernel

    

Constant memory summary

  • A relatively scarce resource
    • Typically 64 kB in total (can inquire at runtime)
    • No huge look-up tables!
    • Also used for kernel actual arguments (by value)
  • Any "overflow" will spill to normal global memory
    • ... and accesses will be relatively slow

Shared Memory

  • Shared between threads in a block
    • Useful for temporary values, particularly if significant reuse
    • Marshalling data within a block
    • May be used to perform reductions (sum, min, max)
  • May require care in synchronisation with a block
    • Basic synchonisation is __syncthreads()
    • Many others
  • Lifetime of the kernel's blocks
    • Only addressable when a block starts executing
    • Released when a block finishes

Declaring Shared Memory

  • C: via __shared__ memory space qualifier
  • Fortran: via shared attribute
  • If the size of array is known an compile time, just include the size of array
  • 
    attributes(global) subroutine reverseElements(d)
        real, shared :: s(TPB) 
        ! ... implementation ...
    end subroutine reverseElements
    
    ! host code...
    call reverseElements<<< BPG, TPB >>>(data)
          

Declaring Shared Memory

  • C: via __shared__ memory space qualifier
  • Fortran: via shared attribute
  • If the size isn't known until runtime, you must supply this at kernel launch with the optional third execution configuration parameter
  • 
    attributes(global) subroutine reverseElements(d)
        real, shared :: s(THREADS_PER_BLOCK)
        real, shared :: s(*)
        ! ... implementation ...
    end subroutine reverseElements
    
    ! host code...
    size_bytes = TPB*4
    call reverseElements<<< BPG, TPB, size_bytes >>>(data)
          

Example: Reverse elements in array


/* Reverse elements so that the order 0,1,2,3,...
 * becomes ...,3,2,1,0
 * Assume we have one block. */

__global__ void reverseElements(int * myArray) {

  __shared__ int tmp[THREADS_PER_BLOCK];

  int idx = threadIdx.x;
  tmp[idx] = myArray[idx];

__syncthreads();

myArray[THREADS_PER_BLOCK - (idx+1)] = tmp[idx]; }

Shared Memory Summary

  • Again, a relatively scarce resource
    • E.g., 50 kB per block
    • Some care may be required (check at runtime)
  • Various performance considerations
    • E.g., "bank conflicts"
    • Warp divergence related to synchronisation

Atomic access to memory

  • Many algorithms require updates of shared state, e.g. counting zeros in an array (incorrectly!):
  • 
    __global__ void count_zeros(int N, int const* data, int*total) {
      int i = blockIdx.x * blockDim.x + threadIdx.x;
      if (data[i] == 0) {
        *total += 1;
      }
    }
          
  • This code needs to read, modify, and write back the data stored at *total.
  • Dealing with this (using the techniques mentioned so far) can be complex.

Atomic access to memory

  • The hardware provides facilities to read, update, and write back a memory location atomically - i.e. without any other threads interferring
    
    __global__ void count_zeros(int N, int const* data, int*total) {
      int i = blockIdx.x * blockDim.x + threadIdx.x;
      if (data[i] == 0) {
        atomicAdd(total, 1);
      }
    }
          
  • Works for built in types with some simple operations - see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions for the details.
  • But this is slow if other threads are also accessing the same address - consider using block shared memory to compute an intermediate value that is applied to the global result via a single atomic.

Unified Memory

(NB: this is NOT unified virtual addressing.)
  • The GPU has a separate memory space from the host CPU
  • Since CUDA 6 and Kepler (compute capability 3.0) this aspect can be largely hidden from the programmer with automatic data movement
  • However performance is usually worse than doing it yourself
  • Things get better with CUDA 8 and Pascal (compute capability 6.0) and are now a reasonable starting point

Adapt an existing CUDA code to unified memory

  1. Allocate host arrays with cudaMallocManaged
  2. Remove device pointers and calls to cudaMalloc
  3. Remove calls to cudaMemcpy
  4. In kernels launches, use host arrays instead

But this will be slower

Make porting CPU code easier

  1. Allocate key arrays with cudaMallocManaged
  2. Implement kernels and launches
  3. Profile to see if managed memory is the bottleneck
  4. Add manual memory management where needed

Usually takes less programmer effort to get it working

Unified Memory limitations – Kepler and before

  • cudaMallocManaged reserves space on GPU
  • Kernel not running:
    • CPU can access any data, that page only moved to main DRAM (page fault – c.f. swap space)
    • CPU can read/write as necessary
  • Kernel launch moves any pages changed by CPU to GPU (maybe expensive)
  • Kernel running - CPU cannot access any UM that the kernel touches

Unified Memory limitations – Pascal and after

  • cudaMallocManaged does not allocate on GPU (GPU memory is used more like a cache)
  • Pages allocated on first touch
  • Pages migrated on demand
  • Pages migrated on demand between GPUs
  • No bulk transfer on kernel launch
  • Can give hints to the run time to optimise: cudaMemAdvise / cudaMemPrefetchAsync

Unified Memory Summary

  • Can simplify first implementation
  • Performance cost high on old systems
  • This is much reduced with Pascal and newer