Managed, Constant, and Shared Memory


Material by: Kevin Stratford

EPCC Logo

CUDA Memory so far...

  • Global memory
    • Allocated on host
    • Available to both host and device read/write
  • Local variables in kernels
    • Private on a per thread basis
    • Usually expected to be held in registers

Other types of memory

  • Managed memory
    • Unified address space: available on host and device
    • No explicit copies required
    • Very useful for development / more complex code
  • Constant cache memory
    • Read only in kernel
    • No cache coherency mechanism required to support writes
    • Fast and effectively very high bandwidth
  • Shared memory
    • Shared between threads in the same block
    • Often declared statically in the kernel (can be dynamic)
    • Lifetime of the kernel

Host and device memory



    /* Schematically */

    /* Allocate and initialise host memory ... */

    h_ptr = malloc(nbytes);
    ...

    /* Allocate device memory and copy */

    cudaMalloc(&d_ptr, nbytes)
    cudaMemcpy(d_ptr, h_ptr, nbytes, cudaMemCpyHostToDevice);
    ...

    /* Use device memory in kernel */
    kernel<<<...>>> (d_ptr, ...)

    

Unified memory



  /* Can we do both? */

cudaMallocManaged(&ptr, nbytes);

/* Initialise data on host */ for (i = 0; i < NDATA; i++) { ptr[i] = value; } /* Use data in kernel */ kernel<<<...>>> (ptr, ...) /* ... and get results back */

Managed memory

__host__ cudaErr_t cudaMallocManaged(void ** dptr, int sz);

  • Allocated on the host
    • ...but single address space for host and device
    • Management of copies performed by CUDA runtime
    • Release with cudaFree()
  • Page migration
    • Access to memory not present generates page fault
    • Pages are then copied; page size may be e.g., 64KB
    • Costs time, so can be slow relative to cudaMemcpy()
    • Can specify prefetches, or provide hints

Managed memory (cont..)



    /* Prefetch to destination device */

    cudaGetDevice(&device)
    cudaMallocManaged(&ptr, nbytes);
    ...
    cudaMemPrefetchAsync(ptr, nbytes, device, NULL);
    ...

    /* Hints */

    cudaMemAdvise(ptr, nbytes, advice, device);

    /* advice: cudaMemAdviseSetReadMostly */
    /*         cudaMemAdviseSetPreferredLocation */
    /*         cudaMemAdviseSetAccessedBy */

    

Constant Memory: C


/* Constant data declared at file scope with
 *  __constant__ memory space qualifier  */

static __constant__ double coeffs[3];

int someHostFunction(...) {

  /* ... assign some values at run time ... */

  double values[3];

  /* ... and before the relevant kernel ... */

  cudaMemcpyToSymbol(coeffs, values, 3*sizeof(double));

  ...
}
    

Schematic: C kernel



/* Still in the appropriate scope ... */

static __constant__ double coeffs[3];

__global__ void someKernel(...) {

  ...

  /* Reference constant data as usual ... */

  result = coeffs[0]*x + coeffs[1]*y + coeffs[2]*z;
}
    

Schematic: Fortran


! Constant variable declared at e.g., module scope
! with constant attribute

real, constant :: coeffs(3)

contains

subroutine someHostRoutine(...)

  ! ...assign some values at runtime ...

  coeffs(:) = values(1:3)

  ! ...and call relevant kernel ...

end subroutine someHostRoutine
    

Schematic: Fortran kernel


! Still in the appropriate scope ...

real, constant :: coeffs(3)

contains

attributes(global) subroutine someKernel(...)

  ! Reference constant data as usual ...

  result = coeffs(1)*x + coeffs(2)*y + coeffs(3)*z

end subroutine someKernel

    

Constant memory

  • A relatively scarce resource
    • Typically 64 kB in total (can inquire at runtime)
    • No huge look-up tables!
    • Also used for kernel actual arguments (by value)
  • Any "overflow" will spill to normal global memory
    • ... and accesses will be relatively slow

Shared Memory

  • Shared between threads in a block
    • Useful for temporary values, particularly if significant reuse
    • Marshalling data within a block
    • May be used to perform reductions (sum, min, max)
  • May require care in synchronisation with a block
    • Basic synchonisation is __syncthreads()
    • Many others
  • Declaration
    • C: via __shared__ memory space qualifier
    • Fortran: via shared attribute

Example: Reverse elements in array


/* Reverse elements so that the order 0,1,2,3,...
 * becomes ...,3,2,1,0
 * Assume we have one block. */

__global__ void reverseElements(int * myArray) {

  __shared__ int tmp[THREADS_PER_BLOCK];

  int idx = threadIdx.x;
  tmp[idx] = myArray[idx];

__syncthreads();

myArray[THREADS_PER_BLOCK - (idx+1)] = tmp[idx]; }

Shared Memory Summary

  • Again, a relatively scarce resource
    • E.g., 50 kB per block
    • Some care may be required (check at runtime)
  • Various performance considerations
    • E.g., "bank conflicts"
    • Warp divergence related to synchronisation