Performance Optimisation


Material by: Alan Gray, Kevin Stratford

EPCC Logo

Sources of poor performance

  • Potential sources of poor performance include:
    • Lack of parallelism
    • Copying data to/from device
    • Device under-utilisation / memory latency
    • Memory bandwidth
    • Code branches
  • Possible solutions
    • Should be relevant on many GPU devices
    • "Mileage may vary"
  • Some comments on profiling
    • Do you have a performance problem?

Exposing parallelism

  • Amdahl's Law:
    • Parallel performance limited by fraction of code that is serial
    • Applies particularly to GPUs
  • Performance relies on use of many parallel threads
    • Degree of parallelism higher than typical CPU
    • Typically want at least $O(10^5) - O(10^6)$ threads in a kernel
  • Effort must be made to expose parallelism
    • As much as absolutely possible
    • Rewriting / refactoring / different algorithm

Copying between host and device

  • Separate memory spaces mean some copying inevitable
    • Via PCI/Express bus
    • Relatively slow/expensive
  • Simply must avoid unnecessary copies
    • Keep data resident on device
    • May involve moving all relevant code to device
    • Recalculation / extra computation instead of data communication

Removing data transfer

Consider a common pattern:

for (it = 0; it < nTimeSteps; it++) {
  myCheapHostOperation(hostData);
  cudaMemcpy(..., cudaMemcpyHostToDevice);
  myExpensiveKernel <<<...>>> (deviceData, ...);
  cudaMemcpy(..., cudaMemcpyDeviceToHost);
}
    
Must be refactored to:

cudaMemcpy(..., cudaMemcpyHostToDevice);

for (it = 0; it < nTimeSteps; it++) {
  myCheapKernel <<< ... >>> (deviceData, ...);
  myExpensiveKernel <<< ... >>> (deviceData, ...)
}

cudeMemcpy(..., cudaMemcpyDeviceToHost);
   

Occupancy and latency hiding

  • Work decomposed and distributed between threads
    • Suggests may want as many threads as there are cores
    • ...or some cores will be left idle
  • Actually want $$ N_{threads} >> N_{cores}$$
  • Latency for access to main memory
    • Perhaps 100 clock cycles
    • If other threads are available, can be swapped in quickly

Example

Suppose we have a two-dimensional loop

for (i = 0; i < 512; i++) {
  for (j = 0; j < 512; j++) {
    /* ... work ... */
     
Parallelise inner loop only
  • Can use 512 threads
  • Poor occupancy!
Parallelise both loops
  • Can use $512 \times 512 = 262,\!144$ threads
  • Much better!

CPU Caching


/* C: recall right-most index runs fastest */
for (i = 0; i < NI; i++) {
  for (j = 0; j < NJ; j++) {
    output[i][j] = input[i][j];
  }
}
    

! Fortran: recall left-most index runs fastest
do j = 1, NJ
  do i = 1, NI
    output(i,j) = input(i,j)
  end do
end do
    
Individual thread has consecutive memory accesses

Memory coalescing

  • GPUs have a high peak memory bandwidth
    • But only achieved when accesses are coalesced
    • That is, when:

    consecutive threads access consecutive memory locations

  • If not, access may be serialised
    • Performance disaster
    • Need to refactor to allow coalescing

So, what is the correct order?

In one dimension, the picture is relatively simple
  • threadsPerBlock = (nThreads, 1, 1)
  • Consective threads are those with consective index

/* In C: */
idx = blockIdx.x*blockDim.x + threadIdx.x;
output[idx] = input[idx];
      

! In Fortran:
idx = (blockIdx%x - 1)*blockDim%x + threadIdx%x
output(idx) = input(idx)
      
Both good; accesses are coalesced.

Two-dimensional array: C

Recall: right-most index runs fastest

/* Bad: consecutive threads have strided access */
i = blockIdx.x*blockDim.x + threadIdx.x;
for (j = 0; j < NJ; j++) {
  output[i][j] = input[i][j];
}
    

/* Good: consecutive threads have contiguous access */
j = blockIdx.x*blockDim.x + threadIdx.x;
for (i = 0; i < NI; i++) {
  output[i][j] = input[i][j];
}
    

Two-dimensional array: Fortran

Recall: left-most index runs fastest

! Bad: consecutive threads have strided access
j = blockIdx.x*blockDim.x + threadIdx.x;
do i = 1,  NI
  output(i, j) = input(i, j);
end do
    

! Good: consecutive threads have contiguous access
i = blockIdx.x*blockDim.x + threadIdx.x;
do j = 1, NJ
  output(i, j) = input(i, j);
end do
    

Two-dimensional decomposition

More complex
  • blocksPerGrid = (nBlocksX, nBlocksY, 1)
  • threadsPerBlock = (nThreadsX, nThreadsY, 1)
  • x counts fastest, then y (and then z, in three dimensions)

/* C: note apparent transposition of i, j here... */
int j = blockIdx.x*blockDim.x + threadIdx.x;
int i = blockIdx.y*blockDim.y + threadIdx.y;

output[i][j] = input[i][j];
    

! Fortran: looks more natural
i = (blockIdx%x - 1)*blockDim%x + threadIdx%x
j = (blockIdx%y - 1)*blockDim%y + threadIdx%y

output(i, j) = input(i, j)
    

Code branching

Threads are scheduled in groups of 32
  • Share same instruction scheduling hardware units
  • A group is referred to as a warp
  • Warp executes instructions in lock-step (SIMT)
Branches in the code can cause serialisation
  • All threads execute all branches
  • Throw away irrelevant results

Avoiding warp divergence

Imagine want to split threads into two groups:

/* Bad: threads in same warp diverge... */

tid = blockIdx.x*blockDim.x + threadIdx.x;

if (tid % 2 == 0) {
  /* Threads 0, 2, 4, ... do one thing ... */
}
else {
  /* Threads 1, 3, 5, ... do something else */
}
    

Avoiding warp divergence



/* Good: threads in same warp follow same path ...
 * Note use of the internal variable "warpSize" */

tid = blockIdx.x*blockDim.x + threadIdx.x;

if ( (tid / warpSize) % 2 == 0) {
  /* Threads 0, 1, 2, 3, 4, ... do one thing ... */
}
else {
  /* Threads 32, 33, 34, 35,  ... do something else */
}
    

Performance Problem?

Compiler can help to prevent problems

$ nvcc -Xptxas -v ...
    
Verbose option for PTX stage
  • Parallel Thread Execution assembler (an intermediate form)
  • Reports register usage, constant/shared memory usage in kernels
  • Reports spills to global memory (very harmful to performance)
  • Use iteratively during development

Profiling CUDA code

Command-line profiler nvprof
  • Provides traditional text profile report

$ nvprof ./cudaExecutable
    
NVIDIAVisual profiler

Summary

With care, a good fraction of peak performance is possible
  • Can be quite difficult on CPU
Not all problems well suited
  • Often just not enough parallelism