Multi-GPU Methods


Material by: Nick Johnson

EPCC Logo
ExaLAT Logo

Why would you want to use >1 GPUs?

  • Massively (Massively) parallel performance
    • Uncommon to find GPU-enabled HPC machines with only a single GPU per node
    • Memory space can be limiting for larger problems, though less of a problem for newer GPUs
    • Time to science!
  • More GPUs, more difficulties
    • You now need to think about how to do things across GPUs
    • What goes where: which GPU, which host?
    • What are the costs of moving data versus replicating effort
    • Synchronisation across devices?
    • Do we require some extra, host-side, co-ordination?

A simple example

  • Our traditional dense Matrix-Vector kernel
    • Send all of the matrix and all of the vector the single GPU, bring back all of the result vector
    • nBlocks = rows
    • nThreads/block = cols
    • It's easy enough to visualise how this works across SMs (blocks)
  • Now you need to split it up further
    • Send half of matrix A to GPU 0 (first 16 rows)
    • Send the other half (second 16 rows) to GPU 1
    • Send vector X to BOTH!
    • Compute M.V on the half-sized matrixes (threads/block = 32, nBlocks=16)
    • Synchronise on the host
    • Copy back to a host vector with an offset.

OpenMP on the host

  • Host-side parallelisation framework
    • All parallelisation frameworks (MPI, OpenMP, pThreads etc) require some extra effort to use
    • Good OpenMP technique is a multi-day course in itself
    • We will stick with a very simple case here
    • 1 OpenMP thread --> 1 CPU (core) --> 1 GPU
    • OpenMP generally associated with parallelising FOR loops
  • The general flow is:
    • A parallel region which surrounds our work and spawns N threads (I have forced this equal to N GPUs)
    • 1 OpenMP thread --> 1 CPU (core) --> 1 GPU
    • N threads executing code inside this region, in parallel with some short sections of serial execution

OpenMP

  • Lines denoted by #pragam omp
    • We need to emply a few tricks to get this to work correctly
    • We switch between GPUs with cuda_set_device()
    • We also need to think (very briefly) about streams
    • Each OpenMP thread runs in parallel so we call cuda_set_device() with the thread number of our OpenMP thread to select a GPU
    • We have our main loop inside the parallel region.

    An example code

    
    	#pragma omp parallel shared() private() default()
    	{
    		while(mainloop < ARRAY_SIZE){
    		...  some parallel execution of GPU kernel, copy etc
    		#pragma omp single
    		{
    			... code for a single CPU THREAD (compute Rs)
    		}
    		... more code to be executed on BOTH GPUs etc.
    		}// End of mainloop
    	} // End of parallel region
    	

    private(variables)

    • #pragma parallel default(shared) private(variables)
      • Strictly, this denotes that those variables are private to each thread in the parallel region
      • Everything which must remain private to a single GPU goes in this clause
      • Everything else is shared
      • What happens at the end?
      • "shared" variables keep their value
      • "private" variables dont! (we can make this happen but we don't need it here)

    Multi-GPU exercise

    • You will find this in the usual place under exercises/exalat/multi
      • 1. Start by trying to implement the kernels as you did in the single-GPU exercise.
      • 2. Then work on the first matrix vector operation (Ax).
      • 3. Then work on implementing the main code as far as calculating the initial residual.
      • 4. Then you can uncomment the rest of the loops and work through those.
      • I found it useful to print out the values of R, alpha, beta and mainloop on each iteration.
      • Compare with your values from the single-GPU case - they should be equal!