CUDA Programming

Kevin Stratford

Material by: Alan Gray, Kevin Stratford



  • Compute Unified Device Architecture
  • Background
  • CUDA blocks and threads
  • Data management

Host and Device

CPU and (one) GPU are combined:
  • Usually referred to as "host" and "device"
  • Each has a separate memory and separate address space
A schematic diagram showing CPU host and GPU device
                  with separate address spaces connected via (PCIe) bus


  • Early graphics interfaces difficult to program
    • Partcularly for scientific applications
  • CUDA developed around 2007 to ease development
  • C/C++ interface
    • Host side interface to control GPU memory etc
    • Kernels executed on the device by threads
  • Extended to CUDA Fortran with appropriate compiler

Streaming Multiprocessors

A schematic showing a GPU device made up of a number
                 of streaming multiprocessors (SMs). Each SM consists
                 of a number of cores and shared memory.
A two level hierarchy:
  • Many streaming multiprocessors each with many cores
  • Exact numbers depend on particular hardware

Grids, Blocks, and Threads

Reflected in programming model
  • Problem abstracted to blocks (map to SMs)
  • Each block contains a number of threads (map to cores)
Don't care about details of mapping to hardware
  • Just describe a grid of blocks and threads
  • Hardware will schedule work as it sees fit

dim3 structure

CUDA introduces a container for x,y,z dimensions

     struct {
       unsigned int x;
       unsigned int y;
       unsigned int z;

     type :: dim3
       integer :: x
       integer :: y
       integer :: z
     end type dim3


/* Consider the one-dimensional loop: */

for (int i = 0; i < LOOP_LENGTH; i++) {
   result[i] = 2*i;

CUDA C Kernel Function

__global__ void myKernel(int * result) {

  int i;

  i = threadIdx.x;
  result[i] = 2*i;

Executing a kernel

/* Kernel is launched by the host by specifying
 * Number of blocks (sometimes "blocksPerGrid")
 * Number of threads per block */

dim3 blocks;
dim3 threadsPerBlock;

blocks.x = 1;
threadsPerBlock.x = LOOP_LENGTH;

myKernel <<< blocks, threadsPerBlock >>> (result);
Referred to as the execution configuration

CUDA Fortran

! In Fortran an analogous kernel is...
attributes(global) subroutine myKernel(result)
  integer, dimension(:) :: result
  integer               :: i

  i = threadIdx%x
  result(i) = 2*i
end subroutine myKernel

! ... with execution ...

blocks%x = 1
threadsPerBlock%x = LOOP_LENGTH
call myKernel <<< blocks, threadsPerBlock >>> (result)

More than one block

/* One block only uses one SM; use of resources is very poor.
 * Usually want large arrays using many blocks. */

__global__ void myKernel(int * result) {

  int i = blockIdx.x*blockDim.x + threadIdx.x;
  result[i] = 2*i;

/* ... with execution ... */

block.x = NBLOCKS;
threadsPerBlock.x = LOOP_LENGTH/NBLOCKS;
myKernel <<< blocks, threadsPerBlock >>> (result);

More than one block: Fortran

attributes(global) subroutine myKernel(result)
  integer, dimension(:) :: result
  integer               :: i

  i = (blockIdx%x - 1)*blockDim%x + threadIdx%x
  result(i) = 2*i
end subroutine myKernel

! ... with execution ...

blocks%x = NBLOCKS
threadsPerBlock%x = LOOP_LENGTH/NBLOCKS
call myKernel <<< blocks, threadsPerBlock >>> (result)

Internal variables: C

All provided by the implementation:
  • Fixed at kernel invocation:

dim3 gridDim;    /* Number of blocks */
dim3 blockDim;   /* Number of threads per block */
  • Unique to each block:

dim3 blockIdx;   /* 0 <= blockIdx.x < gridDim.x etc */
  • Unique to each thread:

dim3 threadIdx;  /* 0 <= threadIdx.x < blockDim.x etc */

Internal variables: Fortran

Again provided by the implementation:
  • Fixed at kernel invocation:

type (dim3) :: gridDim   ! Number of blocks
type (dim3) :: blockDim  ! Number of threads per block
  • Unique to each block:

type (dim3) :: blockIdx  ! 1 <= blockIdx%x <= gridDim%x etc
  • Unique to each thread:

type (dim3) :: threadIdx ! 1 <= threadIdx%x <= blockDim%x etc

Two-dimensional example

__global__ void matrix2d(float a[N][N], float b[N][N],
                         float c[N][N]) {

  int j = blockIdx.x*blockDim.x + threadIdx.x;
  int i = blockIdx.y*blockDim.y + threadIdx.y;

  c[i][j] = a[i][j] + b[i][j];

/* ... with execution, e.g.,  ... */

dim3 blocksPerGrid(N/16, N/16, 1);
dim3 threadsPerBlock(16, 16, 1);

matrix2d <<< blocksPerGrid, threadsPerBlock >>> (a, b, c);

Synchronisation between host and device

Kernel launches are asynchronous
  • Return immediately on the host
  • Synchronisation required to ensure completion
  • Errors can appear asynchronously!

myKernel <<<blocksPerGrid, threadsPerBlock>>> (...)

/* ... could perform independent work here ... */

err = cudaDeviceSynchronize();

/* ... now safe to obtain results of kernel ... */
Many other CUDA operations have asynchronous analogues
  • cudaMemcpyAsync(), ...

Synchronisation on the device

Synchronisation between threads in the same block is possible
  • Allows co-ordination of action in shared memory
  • Allows reductions
Historically, not possible to synchronise between blocks
  • Can only exit the kernel
  • Synchronise on host and start another kernel

Memory Management

Recall host and device have separate address spaces
  • Data accessed by kernel must be in the device memory
  • This is managed largely explicitly

Memory Allocation: C

Allocation managed via standard C pointers

/* For example, provide an allocation of "nSize" floats
 * in the device memory: */

float * data;

err = cudaMalloc(&data, nSize*sizeof(float));


err = cudaFree(data);
Such pointers cannot be dereferenced on the host

Memory Movement: cudaMemcpy()

Initiated on the host:

/* Copy host data values to device memory ... */
err = cudaMemcpy(dataDevice, dataHost, nSize*sizeof(float),

/* And back again ... */
err = cudaMemcpy(dataHost, dataDevice, nSize*sizeof(float),

cudaError_t cudaMemcpy(void * dest, const void * src,
                       size_t count,
                       cudaMemcpyKind kind);

Memory allocation: CUDA Fortran

Declare variable to be in the device memory space
  • Via the device attribute
  • Compiler then knows that the variable should be treated appropriately

! Make an allocation in device memory:
real, device, allocatable :: dataDevice(:)

allocate(dataDevice(nSize), stat = ...)


Or, can use the C-like API
  • cudaMalloc(), cudaFree()

Memory movement: CUDA Fortran

May be performed via simple assignment
  • Again, compiler knows what action to take via declarations

! Copy from host to device

dataDevice(:) = dataHost(:)

! ... and back again ...

dataHost(:) = dataDevice(:)
Can be more explicit using C-like API

err = cudaMemcpy(dataDevice, dataHost, nSize,


CUDA C source
  • File extension .cu by convention
  • Compiled via NVIDIA nvcc

$ nvcc -o example
CUDA Fortran source
  • File extension .cuf by convention
  • Compiled via Portland Group compiler pgf90
  • Use -Mcuda (if file extension not .cuf)

$ pgf90 -Mcuda -o example example.cuf


CUDA C and CUDA Fortran
  • Provide API and extensions for programming NVIDIA GPUs
  • Memory management
  • Kernel execution
CUDA emcompasses wide range of functionality
  • Can make significant progress with a small subset
Still evolving (along with hardware)
  • Currently CUDA v10
  • About one release per year
  • Features can be deprecated / become outmoded