-

Constant Memory Shared Memory


Material by: Kevin Stratford

EPCC Logo

CUDA Memory so far...

  • Global memory
    • Allocated on host
    • Available to both host and device read/write
  • Local variables in kernels
    • Private on a per thread basis
    • Usually expected to be held in registers

Other types of memory

  • Constant cache memory
    • Read only in kernel
    • No cache coherency mechanism required to support writes
    • Fast and effectively very high bandwidth
  • Shared memory
    • Shared between threads in the same block
    • Often declared statically in the kernel (can be dynamic)
    • Lifetime of the kernel

Schematic: C


/* Constant data declared at file scope with
 *  __constant__ memory space qualifier  */

static __constant__ double coeffs[3];

int someHostFunction(...) {

  /* ... assign some values at run time ... */

  double values[3];

  /* ... and before the relevant kernel ... */

  cudaMemcpyToSymbol(coeffs, values, 3*sizeof(double));

  ...
}
    

Schematic: C kernel



/* Still in the appropriate scope ... */

static __constant__ double coeffs[3];

__global__ void someKernel(...) {

  ...

  /* Reference constant data as usual ... */

  result = coeffs[0]*x + coeffs[1]*y + coeffs[2]*z;
}
    

Schematic: Fortran


! Constant variable declared at e.g., module scope
! with constant attribute

real, constant :: coeffs(3)

contains

subroutine someHostRoutine(...)

  ! ...assign some values at runtime ...

  coeffs(:) = values(1:3)

  ! ...and call relevant kernel ...

end subroutine someHostRoutine
    

Schematic: Fortran kernel


! Still in the appropriate scope ...

real, constant :: coeffs(3)

contains

attributes(global) subroutine someKernel(...)

  ! Reference constant data as usual ...

  result = coeffs(1)*x + coeffs(2)*y + coeffs(3)*z

end subroutine someKernel

    

Constant memory

  • A relatively scarce resource
    • Typically 64 kB in total (can inquire at runtime)
    • No huge look-up tables!
    • Also used for kernel actual arguments (by value)
  • Any "overflow" will spill to normal global memory
    • ... and accesses will be relatively slow

Shared Memory

  • Shared between threads in a block
    • Useful for temporary values, particularly if significant reuse
    • Marshalling data within a block
    • May be used to perform reductions (sum, min, max)
  • May require care in synchronisation with a block
    • Basic synchonisation is __syncthreads()
    • Many others
  • Declaration
    • C: via __shared__ memory space qualifier
    • Fortran: via shared attribute

Example: Reverse elements in array


/* Reverse elements so that the order 0,1,2,3,...
 * becomes ...,3,2,1,0
 * Assume we have one block. */

__global__ void reverseElements(int * myArray) {

  __shared__ int tmp[THREADS_PER_BLOCK];

  int idx = threadIdx.x;
  tmp[idx] = myArray[idx];

__syncthreads();

myArray[THREADS_PER_BLOCK - (idx+1)] = tmp[idx]; }

Shared Memory Summary

  • Again, a relatively scarce resource
    • E.g., 50 kB per block
    • Some care may be required (check at runtime)
  • Various performance considerations
    • E.g., "bank conflicts"
    • Warp divergence related to synchronisation