Multi-GPU Methods

Material by: Nick Johnson

Why would you want to use >1 GPUs?

Massively (Massively) parallel performance

Uncommon to find GPU-enabled HPC machines with only a single GPU per node
Memory space can be limiting for larger problems, though less of a problem for newer GPUs
Time to science!

More GPUs, more difficulties

You now need to think about how to do things across GPUs
What goes where: which GPU, which host?
What are the costs of moving data versus replicating effort
Synchronisation across devices?
Do we require some extra, host-side, co-ordination?

A simple example

Our traditional dense Matrix-Vector kernel

Send all of the matrix and all of the vector the single GPU, bring back all of the result vector
nBlocks = rows
nThreads/block = cols
It's easy enough to visualise how this works across SMs (blocks)

Now you need to split it up further

Send half of matrix A to GPU 0 (first 16 rows)
Send the other half (second 16 rows) to GPU 1
Send vector X to BOTH!
Compute M.V on the half-sized matrixes (threads/block = 32, nBlocks=16)
Synchronise on the host
Copy back to a host vector with an offset.

OpenMP on the host

Host-side parallelisation framework

All parallelisation frameworks (MPI, OpenMP, pThreads etc) require some extra effort to use
Good OpenMP technique is a multi-day course in itself
We will stick with a very simple case here
1 OpenMP thread --> 1 CPU (core) --> 1 GPU
OpenMP generally associated with parallelising FOR loops

The general flow is:

A parallel region which surrounds our work and spawns N threads (I have forced this equal to N GPUs)
1 OpenMP thread --> 1 CPU (core) --> 1 GPU
N threads executing code inside this region, in parallel with some short sections of serial execution

OpenMP

Lines denoted by #pragam omp

We need to emply a few tricks to get this to work correctly
We switch between GPUs with cuda_set_device()
We also need to think (very briefly) about streams
Each OpenMP thread runs in parallel so we call cuda_set_device() with the thread number of our OpenMP thread to select a GPU
We have our main loop inside the parallel region.

An example code


	#pragma omp parallel shared() private() default()
	{
		while(mainloop < ARRAY_SIZE){
		...  some parallel execution of GPU kernel, copy etc
		#pragma omp single
		{
			... code for a single CPU THREAD (compute Rs)
		}
		... more code to be executed on BOTH GPUs etc.
		}// End of mainloop
	} // End of parallel region

private(variables)

#pragma parallel default(shared) private(variables)

Strictly, this denotes that those variables are private to each thread in the parallel region
Everything which must remain private to a single GPU goes in this clause
Everything else is shared
What happens at the end?
"shared" variables keep their value
"private" variables dont! (we can make this happen but we don't need it here)

Multi-GPU exercise

You will find this in the usual place under exercises/exalat/multi

1. Start by trying to implement the kernels as you did in the single-GPU exercise.
2. Then work on the first matrix vector operation (Ax).
3. Then work on implementing the main code as far as calculating the initial residual.
4. Then you can uncomment the rest of the loops and work through those.
I found it useful to print out the values of R, alpha, beta and mainloop on each iteration.
Compare with your values from the single-GPU case - they should be equal!