Managed, Constant, and Shared Memory

Material by: Kevin Stratford

CUDA Memory so far...

Global memory

Allocated on host
Available to both host and device read/write

Local variables in kernels

Private on a per thread basis
Usually expected to be held in registers

Other types of memory

Managed memory

Unified address space: available on host and device
No explicit copies required
Very useful for development / more complex code

Constant cache memory

Read only in kernel
No cache coherency mechanism required to support writes
Fast and effectively very high bandwidth

Shared memory
- Shared between threads in the same block
- Often declared statically in the kernel (can be dynamic)
- Lifetime of the kernel

Host and device memory



    /* Schematically */

    /* Allocate and initialise host memory ... */

    h_ptr = malloc(nbytes);
    ...

    /* Allocate device memory and copy */

    cudaMalloc(&d_ptr, nbytes)
    cudaMemcpy(d_ptr, h_ptr, nbytes, cudaMemCpyHostToDevice);
    ...

    /* Use device memory in kernel */
    kernel<<<...>>> (d_ptr, ...)

Unified memory



  /* Can we do both? */

  cudaMallocManaged(&ptr, nbytes);

  /* Initialise data on host */
  for (i = 0; i < NDATA; i++) {
    ptr[i] = value;
  }

  /* Use data in kernel */
  kernel<<<...>>> (ptr, ...)

  /* ... and get results back */

Managed memory

__host__ cudaErr_t cudaMallocManaged(void ** dptr, int sz);

Allocated on the host

...but single address space for host and device
Management of copies performed by CUDA runtime
Release with cudaFree()

Page migration

Access to memory not present generates page fault
Pages are then copied; page size may be e.g., 64KB
Costs time, so can be slow relative to cudaMemcpy()
Can specify prefetches, or provide hints

Managed memory (cont..)



    /* Prefetch to destination device */

    cudaGetDevice(&device)
    cudaMallocManaged(&ptr, nbytes);
    ...
    cudaMemPrefetchAsync(ptr, nbytes, device, NULL);
    ...

    /* Hints */

    cudaMemAdvise(ptr, nbytes, advice, device);

    /* advice: cudaMemAdviseSetReadMostly */
    /*         cudaMemAdviseSetPreferredLocation */
    /*         cudaMemAdviseSetAccessedBy */

Constant Memory: C


/* Constant data declared at file scope with
 *  __constant__ memory space qualifier  */

static __constant__ double coeffs[3];

int someHostFunction(...) {

  /* ... assign some values at run time ... */

  double values[3];

  /* ... and before the relevant kernel ... */

  cudaMemcpyToSymbol(coeffs, values, 3*sizeof(double));

  ...
}

Schematic: C kernel



/* Still in the appropriate scope ... */

static __constant__ double coeffs[3];

__global__ void someKernel(...) {

  ...

  /* Reference constant data as usual ... */

  result = coeffs[0]*x + coeffs[1]*y + coeffs[2]*z;
}

Schematic: Fortran


! Constant variable declared at e.g., module scope
! with constant attribute

real, constant :: coeffs(3)

contains

subroutine someHostRoutine(...)

  ! ...assign some values at runtime ...

  coeffs(:) = values(1:3)

  ! ...and call relevant kernel ...

end subroutine someHostRoutine

Schematic: Fortran kernel


! Still in the appropriate scope ...

real, constant :: coeffs(3)

contains

attributes(global) subroutine someKernel(...)

  ! Reference constant data as usual ...

  result = coeffs(1)*x + coeffs(2)*y + coeffs(3)*z

end subroutine someKernel

Constant memory

A relatively scarce resource

Typically 64 kB in total (can inquire at runtime)
No huge look-up tables!
Also used for kernel actual arguments (by value)

Any "overflow" will spill to normal global memory
- ... and accesses will be relatively slow

Shared Memory

Shared between threads in a block

Useful for temporary values, particularly if significant reuse
Marshalling data within a block
May be used to perform reductions (sum, min, max)

May require care in synchronisation with a block
- Basic synchonisation is __syncthreads()
- Many others
Declaration
- C: via __shared__ memory space qualifier
- Fortran: via shared attribute

Example: Reverse elements in array


/* Reverse elements so that the order 0,1,2,3,...
 * becomes ...,3,2,1,0
 * Assume we have one block. */

__global__ void reverseElements(int * myArray) {

  __shared__ int tmp[THREADS_PER_BLOCK];

  int idx = threadIdx.x;
  tmp[idx] = myArray[idx];
  __syncthreads();
  myArray[THREADS_PER_BLOCK - (idx+1)] = tmp[idx];
}

Shared Memory Summary

Again, a relatively scarce resource

E.g., 50 kB per block
Some care may be required (check at runtime)

Various performance considerations
- E.g., "bank conflicts"
- Warp divergence related to synchronisation