Profiling
This section provides brief documentation on how to use the NVIDIA NSight tools to profile an application on Tursa. The process is provided as short example using a simple application.
For full details, see the NVIDIA Nsight documentation:
Important
The Nsight GUI is not available on Tursa and you cannot connect a local GUI to Tursa over SSH due to limitations in the SSH module in the Nsight GUI. If you want to visualise profiles, you must download them from Tursa to your local system where you have installed the GUI.
Credit
Thanks to Paul Graham of NVIDIA for agreeing to share this example.
Example code
Here is the example CUDA code that will be used for this example. In the rest of the
exercise, we assume you have saved this to a file called vector-add.cu
on Tursa.
#include <stdio.h>
/*
* Host function to initialize vector elements. This function
* simply initializes each element to equal its index in the
* vector.
*/
void initWith(float num, float *a, int N)
{
for(int i = 0; i < N; ++i)
{
a[i] = num;
}
}
/*
* Device kernel stores into `result` the sum of each
* same-indexed value of `a` and `b`.
*/
__global__
void addVectorsInto(float *result, float *a, float *b, int N)
{
int index = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
for(int i = index; i < N; i += stride)
{
result[i] = a[i] + b[i];
}
}
/*
* Host function to confirm values in `vector`. This function
* assumes all values are the same `target` value.
*/
void checkElementsAre(float target, float *vector, int N)
{
for(int i = 0; i < N; i++)
{
if(vector[i] != target)
{
printf("FAIL: vector[%d] - %0.0f does not equal %0.0f\n", i, vector[i], target);
exit(1);
}
}
printf("Success! All values calculated correctly.\n");
}
int main()
{
int deviceId;
int numberOfSMs;
cudaGetDevice(&deviceId);
cudaDeviceGetAttribute(&numberOfSMs, cudaDevAttrMultiProcessorCount, deviceId);
printf("Device ID: %d\tNumber of SMs: %d\n", deviceId, numberOfSMs);
const int N = 2<<24;
size_t size = N * sizeof(float);
float *a;
float *b;
float *c;
cudaMallocManaged(&a, size);
cudaMallocManaged(&b, size);
cudaMallocManaged(&c, size);
initWith(3, a, N);
initWith(4, b, N);
initWith(0, c, N);
size_t threadsPerBlock;
size_t numberOfBlocks;
/*
* nsys should register performance changes when execution configuration
* is updated.
*/
threadsPerBlock = 256;
numberOfBlocks = 32 * numberOfSMs;
cudaError_t addVectorsErr;
cudaError_t asyncErr;
addVectorsInto<<<numberOfBlocks, threadsPerBlock>>>(c, a, b, N);
addVectorsErr = cudaGetLastError();
if(addVectorsErr != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(addVectorsErr));
asyncErr = cudaDeviceSynchronize();
if(asyncErr != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(asyncErr));
checkElementsAre(7, c, N);
cudaFree(a);
cudaFree(b);
cudaFree(c);
}
Compile the code
Compile the example code:
module load nvhpc/23.5-nompi
module load gcc/12.2.0
nvcc -o vector-add.exe vector-add.cu
Test the example
Create a job submission script to run the example code:
#!/bin/bash
#SBATCH --job-name=vector-add
#SBATCH --time=0:5:0
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=32
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:4
#SBATCH --partition=gpu-a100-40
#SBATCH --qos=dev
#SBATCH --account=[add your budget code]
# Load the correct modules
module load nvhpc/23.5-nompi
module load gcc/12.2.0
./vector-add.exe
When you submit this, you should see the code produce output like:
Device ID: 0 Number of SMs: 108
Success! All values calculated correctly.
Use Nsight System to generate a profile
Create a job submission script to get a profile of the example application:
#!/bin/bash
#SBATCH --job-name=vector-add
#SBATCH --time=0:5:0
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=32
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:4
#SBATCH --partition=gpu-a100-40
#SBATCH --qos=dev
#SBATCH --account=[add your budget code]
# Load the correct modules
module load nvhpc/23.5-nompi
module load gcc/12.2.0
nsys profile --stats=true vector-add.exe
This should produce output something like:
[1/8] [========================100%] report1.nsys-rep
[2/8] [========================100%] report1.sqlite
SKIPPED: /mnt/lustre/tursafs1/home/t01/t01/dc-user1/report1.sqlite does not contain NV Tools Extension (NVTX) data.
[3/8] Executing 'nvtx_sum' stats report
[4/8] Executing 'osrt_sum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ---------- ---------- -------- ---------- ----------- ---------------------
84.5 3740154532 120 31167954.4 10103030.0 7054 1413120085 130155802.9 poll
8.5 376952738 1065 353946.2 20882.0 1397 29245539 1359316.7 ioctl
5.2 230556453 106 2175060.9 2094310.0 1815 20983399 2645399.3 sem_timedwait
0.9 38619004 7 5517000.6 8521.0 1467 20053805 9423347.1 fread
0.3 15445783 58 266306.6 5168.0 2794 11102745 1479367.5 fopen
0.3 13440698 26 516949.9 6425.5 2305 5145173 1446432.8 mmap
0.2 6658153 10 665815.3 1431.5 1396 6644466 2100683.7 dup
0.0 1609153 42 38313.2 8520.0 6635 937695 143076.8 mmap64
0.0 621661 4 155415.3 152814.0 121175 194858 30933.7 pthread_create
0.0 536320 102 5258.0 1816.0 978 213296 25541.3 fcntl
0.0 533379 52 10257.3 2375.0 1885 260160 40144.9 fclose
0.0 485683 83 5851.6 5169.0 2794 22349 2604.6 open64
0.0 106926 64 1670.7 1467.0 978 2864 389.8 pthread_mutex_trylock
0.0 94711 29 3265.9 1397.0 908 58668 10657.4 fgets
0.0 73122 14 5223.0 4260.0 1816 16552 3478.6 write
0.0 54408 11 4946.2 4679.0 2794 7124 1502.6 munmap
0.0 49799 7 7114.1 7054.0 2445 13829 3811.4 open
0.0 47074 17 2769.1 2794.0 1815 5168 868.5 read
0.0 23257 3 7752.3 8521.0 4260 10476 3178.5 pipe2
0.0 18368 2 9184.0 9184.0 7054 11314 3012.3 socket
0.0 11873 1 11873.0 11873.0 11873 11873 0.0 connect
0.0 2864 1 2864.0 2864.0 2864 2864 0.0 bind
0.0 1886 1 1886.0 1886.0 1886 1886 0.0 listen
[5/8] Executing 'cuda_api_sum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ----------- ---------- -------- --------- ----------- ----------------------
66.9 326556539 3 108852179.7 69982.0 61601 326424956 188423551.5 cudaMallocManaged
17.4 84859556 1 84859556.0 84859556.0 84859556 84859556 0.0 cudaDeviceSynchronize
12.9 62931338 1 62931338.0 62931338.0 62931338 62931338 0.0 cudaLaunchKernel
2.8 13646243 3 4548747.7 4418958.0 4016391 5210894 607736.3 cudaFree
0.0 7054 1 7054.0 7054.0 7054 7054 0.0 cuModuleGetLoadingMode
[6/8] Executing 'cuda_gpu_kern_sum' stats report
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ---------- ---------- -------- -------- ----------- ----------------------------------------------
100.0 84862110 1 84862110.0 84862110.0 84862110 84862110 0.0 addVectorsInto(float *, float *, float *, int)
[7/8] Executing 'cuda_gpu_mem_time_sum' stats report
Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation
-------- --------------- ----- -------- -------- -------- -------- ----------- ---------------------------------
81.9 48978013 10109 4845.0 3455.0 2656 51328 5656.4 [CUDA Unified Memory memcpy HtoD]
18.1 10801550 768 14064.5 4095.0 2463 79840 21695.6 [CUDA Unified Memory memcpy DtoH]
[8/8] Executing 'cuda_gpu_mem_size_sum' stats report
Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation
---------- ----- -------- -------- -------- -------- ----------- ---------------------------------
402.653 10109 0.040 0.008 0.004 1.044 0.135 [CUDA Unified Memory memcpy HtoD]
134.218 768 0.175 0.033 0.004 1.044 0.301 [CUDA Unified Memory memcpy DtoH]
Generated:
/mnt/lustre/tursafs1/home/t01/t01/dc-user1/report1.nsys-rep
/mnt/lustre/tursafs1/home/t01/t01/dc-user1/report1.sqlite
You can download the report1.nsys-rep
file to your local system to load into the Nsight GUI for
visualisation if you wish.
Use Nsight Compute to investiage hardware counters
Create a job submission script to get a profile of the hardware counters for the example application:
#!/bin/bash
#SBATCH --job-name=vector-add
#SBATCH --time=0:5:0
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=32
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:4
#SBATCH --partition=gpu-a100-40
#SBATCH --qos=dev
#SBATCH --account=[add your budget code]
# Load the correct modules
module load nvhpc/23.5-nompi
module load gcc/12.2.0
ncu ./vector-add.exe
This should produce output something like:
[165308] vector-add.exe@127.0.0.1
addVectorsInto(float *, float *, float *, int) (3456, 1, 1)x(256, 1, 1), Context 1, Stream 7, Device 0, CC 8.0
Section: GPU Speed Of Light Throughput
----------------------- ------------- ------------
Metric Name Metric Unit Metric Value
----------------------- ------------- ------------
DRAM Frequency cycle/nsecond 1.20
SM Frequency cycle/nsecond 1.08
Elapsed Cycles cycle 322427
Memory Throughput % 85.69
DRAM Throughput % 85.69
Duration usecond 299.23
L1/TEX Cache Throughput % 17.69
L2 Cache Throughput % 68.64
SM Active Cycles cycle 314986.82
Compute (SM) Throughput % 9.12
----------------------- ------------- ------------
INF The kernel is utilizing greater than 80.0% of the available compute or memory performance of the device. To
further improve performance, work will likely need to be shifted from the most utilized to another unit.
Start by analyzing DRAM in the Memory Workload Analysis section.
Section: Launch Statistics
-------------------------------- --------------- ---------------
Metric Name Metric Unit Metric Value
-------------------------------- --------------- ---------------
Block Size 256
Function Cache Configuration CachePreferNone
Grid Size 3456
Registers Per Thread register/thread 26
Shared Memory Configuration Size Kbyte 32.77
Driver Shared Memory Per Block Kbyte/block 1.02
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 0
Threads thread 884736
Waves Per SM 4
-------------------------------- --------------- ---------------
Section: Occupancy
------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
------------------------------- ----------- ------------
Block Limit SM block 32
Block Limit Registers block 8
Block Limit Shared Mem block 32
Block Limit Warps block 8
Theoretical Active Warps per SM warp 64
Theoretical Occupancy % 100
Achieved Occupancy % 96.24
Achieved Active Warps Per SM warp 61.59
------------------------------- ----------- ------------
INF This kernel's theoretical occupancy is not impacted by any block limit.