CUDA Programming

Kevin Stratford

kevin@epcc.ed.ac.uk

Material by: Alan Gray, Kevin Stratford

Overview

Compute Unified Device Architecture
Background
CUDA blocks and threads
Data management

Host and Device

CPU and (one) GPU are combined:

A schematic diagram showing CPU host and GPU device
with separate address spaces connected via (PCIe) bus

Development

Early graphics interfaces difficult to program
- Partcularly for scientific applications
CUDA developed around 2007 to ease development
C/C++ interface
- Host side interface to control GPU memory etc
- Kernels executed on the device by threads
Extended to CUDA Fortran with appropriate compiler

Streaming Multiprocessors

A two level hierarchy:

Grids, Blocks, and Threads

Reflected in programming model
Don't care about details of mapping to hardware

`dim3` structure

C:
Fortran:

Example


/* Consider the one-dimensional loop: */

for (int i = 0; i < LOOP_LENGTH; i++) {
   result[i] = 2*i;
}

CUDA C Kernel Function


__global__ void myKernel(int * result) {

  int i;

  i = threadIdx.x;
  result[i] = 2*i;
}

Executing a kernel


/* Kernel is launched by the host by specifying
 * Number of blocks (sometimes "blocksPerGrid")
 * Number of threads per block */

dim3 blocks;
dim3 threadsPerBlock;

blocks.x = 1;
threadsPerBlock.x = LOOP_LENGTH;

myKernel <<< blocks, threadsPerBlock >>> (result);

Referred to as the execution configuration

CUDA Fortran


! In Fortran an analogous kernel is...
 
attributes(global) subroutine myKernel(result)
  integer, dimension(:) :: result
  integer               :: i

  i = threadIdx%x
  result(i) = 2*i
end subroutine myKernel

! ... with execution ...

blocks%x = 1
threadsPerBlock%x = LOOP_LENGTH
call myKernel <<< blocks, threadsPerBlock >>> (result)

More than one block


/* One block only uses one SM; use of resources is very poor.
 * Usually want large arrays using many blocks. */

__global__ void myKernel(int * result) {

  int i = blockIdx.x*blockDim.x + threadIdx.x;
  result[i] = 2*i;
}

/* ... with execution ... */

block.x = NBLOCKS;
threadsPerBlock.x = LOOP_LENGTH/NBLOCKS;
myKernel <<< blocks, threadsPerBlock >>> (result);

More than one block: Fortran


attributes(global) subroutine myKernel(result)
  integer, dimension(:) :: result
  integer               :: i

  i = (blockIdx%x - 1)*blockDim%x + threadIdx%x
  result(i) = 2*i
end subroutine myKernel

! ... with execution ...

blocks%x = NBLOCKS
threadsPerBlock%x = LOOP_LENGTH/NBLOCKS
call myKernel <<< blocks, threadsPerBlock >>> (result)

Internal variables: C

All provided by the implementation: Fixed at kernel invocation:


dim3 gridDim;    /* Number of blocks */
dim3 blockDim;   /* Number of threads per block */

Unique to each block:


dim3 blockIdx;   /* 0 <= blockIdx.x < gridDim.x etc */

Unique to each thread:


dim3 threadIdx;  /* 0 <= threadIdx.x < blockDim.x etc */

Internal variables: Fortran

Again provided by the implementation: Fixed at kernel invocation:


type (dim3) :: gridDim   ! Number of blocks
type (dim3) :: blockDim  ! Number of threads per block

Unique to each block:


type (dim3) :: blockIdx  ! 1 <= blockIdx%x <= gridDim%x etc

Unique to each thread:


type (dim3) :: threadIdx ! 1 <= threadIdx%x <= blockDim%x etc

Two-dimensional example


__global__ void matrix2d(float a[N][N], float b[N][N],
                         float c[N][N]) {

  int j = blockIdx.x*blockDim.x + threadIdx.x;
  int i = blockIdx.y*blockDim.y + threadIdx.y;

  c[i][j] = a[i][j] + b[i][j];
}


/* ... with execution, e.g.,  ... */

dim3 blocksPerGrid(N/16, N/16, 1);
dim3 threadsPerBlock(16, 16, 1);

matrix2d <<< blocksPerGrid, threadsPerBlock >>> (a, b, c);

Synchronisation between host and device

Kernel launches are asynchronous


myKernel <<<blocksPerGrid, threadsPerBlock>>> (...)

/* ... could perform independent work here ... */

err = cudaDeviceSynchronize();

/* ... now safe to obtain results of kernel ... */

Many other CUDA operations have asynchronous analogues

Synchronisation on the device

Synchronisation between threads in the same block is possible Allows co-ordination of action in shared memory Allows reductions
Historically, not possible to synchronise between blocks Can only exit the kernel Synchronise on host and start another kernel

Memory Management

Recall host and device have separate address spaces Data accessed by kernel must be in the device memory This is managed largely explicitly

Memory Allocation: C

Allocation managed via standard C pointers


/* For example, provide an allocation of "nSize" floats
 * in the device memory: */

float * data;

err = cudaMalloc(&data, nSize*sizeof(float));

...

err = cudaFree(data);

Such pointers cannot be dereferenced on the host

Memory Movement: `cudaMemcpy()`

Initiated on the host:



/* Copy host data values to device memory ... */
err = cudaMemcpy(dataDevice, dataHost, nSize*sizeof(float),
                 cudaMemcpyHostToDevice);

/* And back again ... */
err = cudaMemcpy(dataHost, dataDevice, nSize*sizeof(float),
                 cudaMemcpyDeviceToHost);

API:


cudaError_t cudaMemcpy(void * dest, const void * src,
                       size_t count,
                       cudaMemcpyKind kind);

Memory allocation: CUDA Fortran

Declare variable to be in the device memory space Via the device attribute Compiler then knows that the variable should be treated appropriately


! Make an allocation in device memory:
real, device, allocatable :: dataDevice(:)

allocate(dataDevice(nSize), stat = ...)

...

deallocate(dataDevice)

Or, can use the C-like API

Memory movement: CUDA Fortran

May be performed via simple assignment Again, compiler knows what action to take via declarations


! Copy from host to device

dataDevice(:) = dataHost(:)

! ... and back again ...

dataHost(:) = dataDevice(:)

Can be more explicit using C-like API


err = cudaMemcpy(dataDevice, dataHost, nSize,
                 cudaMemcpyHostToDevice)

Compilation

CUDA C source


$ nvcc -o example example.cu

CUDA Fortran source


$ pgf90 -Mcuda -o example example.cuf

Summary

CUDA C and CUDA Fortran
CUDA emcompasses wide range of functionality
Still evolving (along with hardware)

CUDA Programming

Overview

Host and Device

Development

Streaming Multiprocessors

Grids, Blocks, and Threads

dim3 structure

Example

CUDA C Kernel Function

Executing a kernel

CUDA Fortran

More than one block

More than one block: Fortran

Internal variables: C

Internal variables: Fortran

Two-dimensional example

Synchronisation between host and device

Synchronisation on the device

Memory Management

Memory Allocation: C

Memory Movement: cudaMemcpy()

Memory allocation: CUDA Fortran

Memory movement: CUDA Fortran

Compilation

Summary

`dim3` structure

Memory Movement: `cudaMemcpy()`