Performance Optimisation

Material by: Alan Gray, Kevin Stratford

Sources of poor performance

Potential sources of poor performance include:

Lack of parallelism
Copying data to/from device
Device under-utilisation / memory latency
Memory bandwidth
Code branches

Possible solutions

Should be relevant on many GPU devices
"Mileage may vary"

Some comments on profiling
- Do you have a performance problem?

Exposing parallelism

Amdahl's Law:

Parallel performance limited by fraction of code that is serial
Applies particularly to GPUs

Performance relies on use of many parallel threads

Degree of parallelism higher than typical CPU
Typically want at least $O(10^5) - O(10^6)$ threads in a kernel

Effort must be made to expose parallelism
- As much as absolutely possible
- Rewriting / refactoring / different algorithm

Copying between host and device

Separate memory spaces mean some copying inevitable
- Via PCI/Express bus
- Relatively slow/expensive
Simply must avoid unnecessary copies

Keep data resident on device
May involve moving all relevant code to device
Recalculation / extra computation instead of data communication

Removing data transfer

Consider a common pattern:
Must be refactored to:

Occupancy and latency hiding

Work decomposed and distributed between threads

Suggests may want as many threads as there are cores
...or some cores will be left idle

Actually want $$ N_{threads} >> N_{cores}$$
Latency for access to main memory
- Perhaps 100 clock cycles
- If other threads are available, can be swapped in quickly

Example

Suppose we have a two-dimensional loop


for (i = 0; i < 512; i++) {
  for (j = 0; j < 512; j++) {
    /* ... work ... */

Parallelise inner loop only

Parallelise both loops

CPU Caching


/* C: recall right-most index runs fastest */
for (i = 0; i < NI; i++) {
  for (j = 0; j < NJ; j++) {
    output[i][j] = input[i][j];
  }
}


! Fortran: recall left-most index runs fastest
do j = 1, NJ
  do i = 1, NI
    output(i,j) = input(i,j)
  end do
end do

Individual thread has consecutive memory accesses

Memory coalescing

GPUs have a high peak memory bandwidth

But only achieved when accesses are coalesced
That is, when:

consecutive threads access consecutive memory locations

If not, access may be serialised
- Performance disaster
- Need to refactor to allow coalescing

So, what is the correct order?

In one dimension, the picture is relatively simple


/* In C: */
idx = blockIdx.x*blockDim.x + threadIdx.x;
output[idx] = input[idx];


! In Fortran:
idx = (blockIdx%x - 1)*blockDim%x + threadIdx%x
output(idx) = input(idx)

Both good; accesses are coalesced.

Two-dimensional array: C

Recall: right-most index runs fastest


/* Bad: consecutive threads have strided access */
i = blockIdx.x*blockDim.x + threadIdx.x;
for (j = 0; j < NJ; j++) {
  output[i][j] = input[i][j];
}


/* Good: consecutive threads have contiguous access */
j = blockIdx.x*blockDim.x + threadIdx.x;
for (i = 0; i < NI; i++) {
  output[i][j] = input[i][j];
}

Two-dimensional array: Fortran

Recall: left-most index runs fastest


! Bad: consecutive threads have strided access
j = blockIdx.x*blockDim.x + threadIdx.x;
do i = 1,  NI
  output(i, j) = input(i, j);
end do


! Good: consecutive threads have contiguous access
i = blockIdx.x*blockDim.x + threadIdx.x;
do j = 1, NJ
  output(i, j) = input(i, j);
end do

Two-dimensional decomposition

More complex


/* C: note apparent transposition of i, j here... */
int j = blockIdx.x*blockDim.x + threadIdx.x;
int i = blockIdx.y*blockDim.y + threadIdx.y;

output[i][j] = input[i][j];


! Fortran: looks more natural
i = (blockIdx%x - 1)*blockDim%x + threadIdx%x
j = (blockIdx%y - 1)*blockDim%y + threadIdx%y

output(i, j) = input(i, j)

Code branching

Threads are scheduled in groups of 32
Branches in the code can cause serialisation

Avoiding warp divergence

Imagine want to split threads into two groups:


/* Bad: threads in same warp diverge... */

tid = blockIdx.x*blockDim.x + threadIdx.x;

if (tid % 2 == 0) {
  /* Threads 0, 2, 4, ... do one thing ... */
}
else {
  /* Threads 1, 3, 5, ... do something else */
}

Avoiding warp divergence



/* Good: threads in same warp follow same path ...
 * Note use of the internal variable "warpSize" */

tid = blockIdx.x*blockDim.x + threadIdx.x;

if ( (tid / warpSize) % 2 == 0) {
  /* Threads 0, 1, 2, 3, 4, ... do one thing ... */
}
else {
  /* Threads 32, 33, 34, 35,  ... do something else */
}

Performance Problem?

Compiler can help to prevent problems


$ nvcc -Xptxas -v ...

Verbose option for PTX stage

Profiling CUDA code

Command-line profiler nvprof


$ nvprof ./cudaExecutable

NVIDIAVisual profiler

Summary

With care, a good fraction of peak performance is possible
Not all problems well suited