% Optimizing a CUDA application % Rupert Nash, Kevin Stratford, Alan Gray
Exercise content created by EPCC, The University of Edinburgh. Documentation and source code copyright The University of Edinburgh 2019. Lab style and template created by NVIDIA, see https://nvidia.qwiklab.com/.
In this self-paced, hands-on lab, we will take an existing CUDA application and go through several optimization steps, measuring the performance benefits of each. We will see the importance of minimizing data transfer, enabling coalesced memory access, and tuning the parallel decomposition.
As before the code can be cloned from GitHub:
git clone https://github.com/EPCCed/APT-CUDA.git
cd APT-CUDA/exercises/cuda-reconstruct
Or you can use your existing repo (update with git pull
)
This exercise involves performing a series of optimizations to an existing CUDA application.
You start with an image that looks like this:
{ width=50% }
Which has been generated from the original:
{ width=50% }
First is an image of Edinburgh Castle, processed such that the edges between light and dark areas replace the original picture.
This is an artificial thing to do, but it mimics many scientific applications (e.g. that solve systems of PDEs) since the algorithm is iterative, requiring many successive stencil operations.
Each pixel of the new image $M$ is generated based on its neighboring pixel values and the original edge data $E$ by repeatedly performing the following update:
\[M_{i,j} = \frac{M_{i-1,j} + M_{i+1,j} + M_{i,j-1} + M_{i,j+1} - E_{i,j}}{4}\]The more iterations, the better the reconstruction (although for simplicity we work in greyscale rather than colour).
You are provided with a working but slow CUDA implementation of the reconstruction algorithm.
First of all, let’s compile and run the code. The code is set up to run the algorithm on both the GPU and the CPU. It compares the outputs from the two runs to verify correctness, and then displays timings for each run.
Choose to work with either C or Fortran
C:
cd src_c
module load nvidia/nvhpc/22.11
make
Fortran:
cd src_fortran
module load nvidia/nvhpc/22.11
make
To run, you need to know your budget code. To do this, you need to know your budget code - you can check by logging into SAFEm navigating to the relevant Cirrus login account and checking which budgets it can access.
Submit the job with:
sbatch --account <YOUR BUDGET CODE> submit.sh
Query SLURM for your jobs:
squeue -u $USER
During on campus tutorials we have reserved one node (4 GPUs) for the use of the
class. You can access this by editing the SLURM script or adding extra
options to the sbatch
command:
sbatch --account d171 --qos=reservation
--reservation=<reservation ID>
On Cirrus you can either:
display
if you have set up
X11 forwarding when you connectedmodule load ImageMagick
display output.pgm
convert
to turn the output file
into e.g. a jpg or png file and then copy it to your clientmodule load ImageMagick
convert output.pgm output.jpg
Hopefully you can see that the picture is starting to become clearer. As the algorithm is iterative, there is a loop in the main routine that invokes the kernel $N=100$ times.
Increasing N will increase the quality of the reconstruction, but please don’t do this during the lab!
If you were to run for 10 million iterations, the resulting image would look like this:
{ width=50% }
Keep a note of the run times of your jobs and what you have changed each time!
## Now it’s time to optimise the code and improve on the GPU timing printed when you ran the code, by editing the source code.
Note that a one pixel wide halo region of zeroes is added to each edge of the various image-data arrays; this simplifies the computation as it allows the edge pixels to be treated in the same manner as other pixels.
Note: The edge array, which holds the original edge data, does not have require a halo.
You can (should?) profile your code as you optimise it. There are
basic submission scripts for both Nsight Systems (profile-nsys.sh
)
and Compute (profile-ncomp.sh
) included. Once you have run these,
you can transfer the files to your workstation and analyse them with
the GUIs (available from NVIDIA
https://developer.nvidia.com/tools-overview).
A challenge with GPUs and other accelerators is that transferring data between host memory and device memory is often relatively slow. An important optimization technique involves minimise the amount of data that is transferred between host and device.
Notice that in the main loop in reconstruct.cu
(C) or
reconstruct.cuf
(Fortran), the data is copied from GPU
memory to host memory and then back to GPU memory at each
iteration. This is not in fact necessary; with the exception of the
final iteration when the data must be copied back to the host, it is
going to be processed on the GPU again in the next
iteration. Therefore, we can optimise manipulating the GPU memory
directly without expensive transfers.
We can simply copy the output array directly to the input array after each iteration.
In order to do this you will need to:
Remove the cudaMemcpy
calls from inside the main loop
Replace them with a cudaMemcpy
call to copy, directly on the
device, from d_output
to d_input
Add a new cudaMemcpy
call after the end of the loop (in between
the two calls to get_current_time()
) to copy the final result back
from the GPU to the output
buffer in host memory.
Remove the assignments to output
(from d_output
) and to
d_input
(from output
), inside the main loop
Replace them with an assignment directly from d_output
to
d_input
Add a new assignment after the end of the loop (in between the two
calls to cpu_time()
) to copy the final result back from the GPU to
the output
buffer in host memory
Once you have made these changes, compile and run the code again as above and take note of the time taken by the GPU version.
How does it compare to the previous timing?
The GPU performs best when consecutive CUDA threads access consecutive memory locations, allowing memory coalescing.
For the kernel in reconstruct_kernels.cu
, it can be seen that
consecutive threads correspond to consecutive rows of the image, but
consecutive memory locations instead correspond to consecutive
columns. The threads are not reading from consecutive locations.
For the kernel in reconstruct_kernels.cuf
, it can be seen that
consecutive threads correspond to consecutive columns of the image,
but consecutive memory locations instead correspond to consecutive
rows. The threads are not reading from consecutive locations.
Update the kernel such that the role of the image row and column is reversed, in relation to how pixels are assigned to CUDA threads.
Since the image is perfectly square, you will not need to change the way the kernel is launched.
How does the performance compare to the previous version?
You should hopefully have seen a noticeable improvement in performance as a result of the changes you made to reduce data transfers between the host and the device and to enable coalescing. However, the current solution is still sub-optimal as it will not create sufficient threads to utilise all the SMs on the GPU - it has low occupancy.
GPU codes typically run best when there are many threads running in parallel, each doing a small part of the work. We can achieve this with our image processing code by using a thread for each pixel of the image, rather than for each row or column as before. CUDA supports 1-, 2- or 3-dimensional decompositions. A 2D decomposition maps most naturally onto the pixels of an image.
Update your both your kernel, and the code responsible for specifying the decomposition such that that a 2D decomposition is over both rows and columns.
The original code uses 256 threads per block in a 1D CUDA decomposition. Replace this with 16 threads in each of the X and Y directions of the 2D CUDA decomposition, to give a total of 256 threads per block. Ensure that the number of blocks is specified appropriately in each direction.
Ensure that you retain memory coalescing!
Measure performance and compare to the previous versions.
Once you have the 2D kernel working correctly, you can try altering certain parameters and see what effect this has on its performance. In particular, you can investigate the effects of different grid and block sizes.
How does changing the grid and block sizes affect the total runtime?