Multi-GPU Methods
Material by: Nick Johnson
Why would you want to use >1 GPUs?
- Massively (Massively) parallel performance
- Uncommon to find GPU-enabled HPC machines with only a single GPU per node
- Memory space can be limiting for larger problems, though less of a problem for newer GPUs
- Time to science!
- More GPUs, more difficulties
- You now need to think about how to do things across GPUs
- What goes where: which GPU, which host?
- What are the costs of moving data versus replicating effort
- Synchronisation across devices?
- Do we require some extra, host-side, co-ordination?
A simple example
- Our traditional dense Matrix-Vector kernel
- Send all of the matrix and all of the vector the single GPU, bring back all of the result vector
- nBlocks = rows
- nThreads/block = cols
- It's easy enough to visualise how this works across SMs (blocks)
- Now you need to split it up further
- Send half of matrix A to GPU 0 (first 16 rows)
- Send the other half (second 16 rows) to GPU 1
- Send vector X to BOTH!
- Compute M.V on the half-sized matrixes (threads/block = 32, nBlocks=16)
- Synchronise on the host
- Copy back to a host vector with an offset.
OpenMP on the host
- Host-side parallelisation framework
- All parallelisation frameworks (MPI, OpenMP, pThreads etc) require some extra effort to use
- Good OpenMP technique is a multi-day course in itself
- We will stick with a very simple case here
- 1 OpenMP thread --> 1 CPU (core) --> 1 GPU
- OpenMP generally associated with parallelising FOR loops
- The general flow is:
- A parallel region which surrounds our work and spawns N threads (I have forced this equal to N GPUs)
- 1 OpenMP thread --> 1 CPU (core) --> 1 GPU
- N threads executing code inside this region, in parallel with some short sections of serial execution
OpenMP
Lines denoted by #pragam omp
- We need to emply a few tricks to get this to work correctly
- We switch between GPUs with cuda_set_device()
- We also need to think (very briefly) about streams
- Each OpenMP thread runs in parallel so we call cuda_set_device() with the thread number of our OpenMP thread to select a GPU
- We have our main loop inside the parallel region.
An example code
#pragma omp parallel shared() private() default()
{
while(mainloop < ARRAY_SIZE){
... some parallel execution of GPU kernel, copy etc
#pragma omp single
{
... code for a single CPU THREAD (compute Rs)
}
... more code to be executed on BOTH GPUs etc.
}// End of mainloop
} // End of parallel region
private(variables)
- #pragma parallel default(shared) private(variables)
- Strictly, this denotes that those variables are private to each thread in the parallel region
- Everything which must remain private to a single GPU goes in this clause
- Everything else is shared
- What happens at the end?
- "shared" variables keep their value
- "private" variables dont! (we can make this happen but we don't need it here)
Multi-GPU exercise
- You will find this in the usual place under exercises/exalat/multi
- 1. Start by trying to implement the kernels as you did in the single-GPU exercise.
- 2. Then work on the first matrix vector operation (Ax).
- 3. Then work on implementing the main code as far as calculating the initial residual.
- 4. Then you can uncomment the rest of the loops and work through those.
- I found it useful to print out the values of R, alpha, beta and mainloop on each iteration.
- Compare with your values from the single-GPU case - they should be equal!