ExaLAT

Material by: Nick Johnson

Why should we be interested in GPUs?

Massively parallel performance

Very good at BLAS-like operations
Established hardware available in many places
Established programming language with lots of examples, help etc.
Good suite of tools, compilers, profilers etc.
C, C++ and Fortran (plus others) supported by vendor

Why wouldn't you want to use CUDA

Low but not zero bar to entry
Not "open" and not necessarily available on Intel/AMD hardware
For a small code, or a quick test of something, that low bar can be high for a new user
Can be a bit "boiler plate" in places compared to, for example, directives

Example Application: Conjugate Gradient

How to exploit the available resources:

Make sure you can decompose the problem across GPUs
Ideally, make it decomposable across blocks and threads
Minimise any halo swapping operations

The ideal code can adapt to use N GPUs at runtime

You don't want to have to tweak the code for each set of resources
Multiple ways to do this
... of escalating complexity
Having the host code in a good shape, then making sure you have a multi-block GPU code in good shape on a single GPU helps a lot!

Example Application: Conjugate Gradient

Today's exercises are based around building a conjugate gradient implementation.

What is CG?
Effectively an interative algorithm to find x given a system Ax = b
Some may be familiar with this algorithm and variations thereof

We will look at the most basic version
This does not involve any preconditioners etc
A is a simple sqaure matrix with values between 0 and 1 on the leading diagonal
x is all zeros and b is all ones

Overarching theme of today's exercises.

Basic conjugate gradient algorithm image from Wikipedia

Img from: https://en.wikipedia.org/wiki/Conjugate_gradient_method

Basic CG
On multiple-GPUs
Optimally
Remotely

Overarching theme of today's exercises.

Basic conjugate gradient algorithm image from Wikipedia

Img from: https://en.wikipedia.org/wiki/Conjugate_gradient_method

Let's break it down into some smaller blocks.
We've got some basic building blocks here
They are standard mathematical functions which we should be able to do on paper
There are 3 different classes of function, what are they?

How to begin...

Let's start with the most basic building blocks
If we implement these, we can begin to build up our arsenal of kernels that we can use to construct our vanilla-CG algorithm.

Vector something Vector
- Scale by a value
- Add / Subtract
- Vector Product
- Transpose

Matrix-Vector
Single value computation
- Reduction
- Max/Min

Some planning will help a lot

Things to think about:
- How are you going to move data around?
- How are you going to determine if the answer is correct?
- How are you going to profile it?
Reductions

What happens in the first step of a matrix-vector?
Does it matter what type of memory you use to hold the results?
Can CUDA help us with this?

Reductions

What does the below do on the GPU?
How does each thread (recall they operate roughly in lockstep) know which value of temp it is reading
Worse how does any thread know when to write, or what to write?
This is called a reduction in most parallel languages, vector tempa[] is reduced into scalar temp with a + operator


    __shared__ int temp;
    __shared__ int tempa[THREADS_PER_BLOCK];
	v_idx = threadIdx.x;
	for (int i = 0; i < v_idx; i++){
		temp += tempa[i];
	}

AtomicAdd

Enter, your new best friend, AtomicAdd...
Forces the threads to behave sensibly by adding their components in order
Not the fastest method to do this, but it is a CUDA function, so likely to be as fast as possible
Note that the implementation of this does NOT zero the value before starting!


	function(..., float *result){
	...
	t_idx = threadIdx.x;
	__shared__ int temp_array[THREADS_PER_BLOCK];
	atomicAdd(result, temp_array[t_idx]);

Exercise time

You can find this exercise in the usual place, under single
My full (ie with MV and V.V) implemented is in the single-solution directory
Try the following pieces of exercise:

1.1 Create a vector dot product kernel (vector_vector)
1.2 Create a matrix vector kernel (matrix_vector)
1.3 Create the vector mins/add (factor) vector kernels
2.1 Profile these kernels to see the effect of atomicAdd being called by all threads in a block adds
2.2 Can you cut this down by doing it on a single thread?

3.1 Add a finishing condition (sqrt(R) < 1e-5 ?)
4.1 Increase the size of your vectors and matrices until you fill the GPU
4.2 Compare the performance in terms of run-time for the kernels vs copying time for the data
5.1 Optimise by removing un-necessary copy-backs and re-profile

Hints

My demo code in the slides for atomicAdd could be reframed to


	v_idx = threadIdx.x;
	if (v_idx == 0){
		float sum = 0;
		for (i = 0; i < THREADS_PER_BLOCK; i++){
			sum += temp_array[v_idx];
		}
		atomicAdd(result, sum);
	}

Hints

I also do a horrible thing in that I use a point to scalar as the return method from a function


	__global__ void function(float *A, float *B, float *scalar)

If you combine this with an atomicAdd, you could end up in trouble. Why?


	__global__ void function(float *A, float *B, float *scalar)
	...
	atomicAdd(scalar, values[index]);
	...

Hints

I do too many copies.

Does the initialisation need to be done on the GPU?
Do we need vector B on the GPU?
Do we need to copy everything back to the host, at each step, for a single GPU?
If we are worried about space, can we abuse existing space to hold intermediate products?
Could we combine some steps?