CUDA Memory: Constant, Shared, Unified
Material by: Kevin Stratford, Rupert Nash
CUDA Memory so far...
- Global memory
- Allocated on host
- Available to both host and device read/write
- Local variables in kernels
- Private on a per thread basis
- Usually expected to be held in registers
Other types of memory
- Constant cache memory
- Shared memory
- Unified memory
Constant memory
- Read only in kernel
- No cache coherency mechanism required to support writes
- Fast and effectively very high bandwidth
Schematic: C
/* Constant data declared at file scope with
* __constant__ memory space qualifier */
static __constant__ double coeffs[3];
int someHostFunction(...) {
/* ... assign some values at run time ... */
double values[3];
/* ... and before the relevant kernel ... */
cudaMemcpyToSymbol(coeffs, values, 3*sizeof(double));
...
}
Schematic: C kernel
/* Still in the appropriate scope ... */
static __constant__ double coeffs[3];
__global__ void someKernel(...) {
...
/* Reference constant data as usual ... */
result = coeffs[0]*x + coeffs[1]*y + coeffs[2]*z;
}
Schematic: Fortran
! Constant variable declared at e.g., module scope
! with constant attribute
real, constant :: coeffs(3)
contains
subroutine someHostRoutine(...)
! ...assign some values at runtime ...
coeffs(:) = values(1:3)
! ...and call relevant kernel ...
end subroutine someHostRoutine
Schematic: Fortran kernel
! Still in the appropriate scope ...
real, constant :: coeffs(3)
contains
attributes(global) subroutine someKernel(...)
! Reference constant data as usual ...
result = coeffs(1)*x + coeffs(2)*y + coeffs(3)*z
end subroutine someKernel
Constant memory summary
- A relatively scarce resource
- Typically 64 kB in total (can inquire at runtime)
- No huge look-up tables!
- Also used for kernel actual arguments (by value)
- Any "overflow" will spill to normal global memory
- ... and accesses will be relatively slow
Shared Memory
- Shared between threads in a block
- Useful for temporary values, particularly if significant reuse
- Marshalling data within a block
- May be used to perform reductions (sum, min, max)
- May require care in synchronisation with a block
- Basic synchonisation is
__syncthreads()
- Many others
- Lifetime of the kernel's blocks
- Only addressable when a block starts executing
- Released when a block finishes
Declaring Shared Memory
- C: via
__shared__
memory space qualifier
- Fortran: via
shared
attribute
- If the size of array is known an compile time, just include the size of
array
attributes(global) subroutine reverseElements(d)
real, shared :: s(TPB)
! ... implementation ...
end subroutine reverseElements
! host code...
call reverseElements<<< BPG, TPB >>>(data)
Declaring Shared Memory
- C: via
__shared__
memory space qualifier
- Fortran: via
shared
attribute
- If the size isn't known until runtime, you must supply this
at kernel launch with the optional third execution configuration
parameter
attributes(global) subroutine reverseElements(d)
real, shared :: s(THREADS_PER_BLOCK)
real, shared :: s(*)
! ... implementation ...
end subroutine reverseElements
! host code...
size_bytes = TPB*4
call reverseElements<<< BPG, TPB, size_bytes >>>(data)
Example: Reverse elements in array
/* Reverse elements so that the order 0,1,2,3,...
* becomes ...,3,2,1,0
* Assume we have one block. */
__global__ void reverseElements(int * myArray) {
__shared__ int tmp[THREADS_PER_BLOCK];
int idx = threadIdx.x;
tmp[idx] = myArray[idx];
__syncthreads();
myArray[THREADS_PER_BLOCK - (idx+1)] = tmp[idx];
}
Shared Memory Summary
- Again, a relatively scarce resource
- E.g., 50 kB per block
- Some care may be required (check at runtime)
- Various performance considerations
- E.g., "bank conflicts"
- Warp divergence related to synchronisation
Atomic access to memory
- Many algorithms require updates of shared state,
e.g. counting zeros in an array (incorrectly!):
__global__ void count_zeros(int N, int const* data, int*total) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (data[i] == 0) {
*total += 1;
}
}
This code needs to read, modify, and write back the data
stored at *total
.
Dealing with this (using the techniques mentioned so far)
can be complex.
Atomic access to memory
- The hardware provides facilities to read, update, and write
back a memory location atomically - i.e. without any other
threads interferring
__global__ void count_zeros(int N, int const* data, int*total) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (data[i] == 0) {
atomicAdd(total, 1);
}
}
- Works for built in types with some simple operations - see
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions
for the details.
- But this is slow if other threads are also accessing the
same address - consider using block shared memory to compute an
intermediate value that is applied to the global result via a
single atomic.
Unified Memory
(NB: this is NOT unified virtual addressing.)
- The GPU has a separate memory space from the host CPU
- Since CUDA 6 and Kepler (compute capability 3.0) this aspect can be largely hidden from
the programmer with automatic data movement
- However performance is usually worse than doing it yourself
- Things get better with CUDA 8 and Pascal (compute
capability 6.0) and are now a reasonable starting point
Adapt an existing CUDA code to unified memory
- Allocate host arrays with
cudaMallocManaged
- Remove device pointers and calls to
cudaMalloc
- Remove calls to
cudaMemcpy
- In kernels launches, use host arrays instead
But this will be slower
Make porting CPU code easier
- Allocate key arrays with
cudaMallocManaged
- Implement kernels and launches
- Profile to see if managed memory is the bottleneck
- Add manual memory management where needed
Usually takes less programmer effort to get it working
Unified Memory limitations – Kepler and before
cudaMallocManaged
reserves space on GPU
- Kernel not running:
- CPU can access any data, that page only moved to main DRAM (page fault – c.f. swap space)
- CPU can read/write as necessary
- Kernel launch moves any pages changed by CPU to GPU (maybe expensive)
- Kernel running - CPU cannot access any UM that the kernel
touches
Unified Memory limitations – Pascal and after
cudaMallocManaged
does not allocate on GPU
(GPU memory is used more like a cache)
- Pages allocated on first touch
- Pages migrated on demand
- Pages migrated on demand between GPUs
- No bulk transfer on kernel launch
- Can give hints to the run time to optimise:
cudaMemAdvise
/ cudaMemPrefetchAsync
Unified Memory Summary
- Can simplify first implementation
- Performance cost high on old systems
- This is much reduced with Pascal and newer