ExaLAT
Material by: Nick Johnson
Why should we be interested in GPUs?
- Massively parallel performance
- Very good at BLAS-like operations
- Established hardware available in many places
- Established programming language with lots of examples, help etc.
- Good suite of tools, compilers, profilers etc.
- C, C++ and Fortran (plus others) supported by vendor
- Why wouldn't you want to use CUDA
- Low but not zero bar to entry
- Not "open" and not necessarily available on Intel/AMD hardware
- For a small code, or a quick test of something, that low bar can be high for a new user
- Can be a bit "boiler plate" in places compared to, for example, directives
Example Application: Conjugate Gradient
- How to exploit the available resources:
- Make sure you can decompose the problem across GPUs
- Ideally, make it decomposable across blocks and threads
- Minimise any halo swapping operations
- The ideal code can adapt to use N GPUs at runtime
- You don't want to have to tweak the code for each set of resources
- Multiple ways to do this
- ... of escalating complexity
- Having the host code in a good shape, then making sure you have a multi-block GPU code in good shape on a single GPU helps a lot!
Example Application: Conjugate Gradient
- Today's exercises are based around building a conjugate gradient implementation.
- What is CG?
- Effectively an interative algorithm to find x given a system Ax = b
- Some may be familiar with this algorithm and variations thereof
- We will look at the most basic version
- This does not involve any preconditioners etc
- A is a simple sqaure matrix with values between 0 and 1 on the leading diagonal
- x is all zeros and b is all ones
Overarching theme of today's exercises.
Img from: https://en.wikipedia.org/wiki/Conjugate_gradient_method
- Basic CG
- On multiple-GPUs
- Optimally
- Remotely
Overarching theme of today's exercises.
Img from: https://en.wikipedia.org/wiki/Conjugate_gradient_method
- Let's break it down into some smaller blocks.
- We've got some basic building blocks here
- They are standard mathematical functions which we should be able to do on paper
- There are 3 different classes of function, what are they?
How to begin...
- Let's start with the most basic building blocks
- If we implement these, we can begin to build up our arsenal of kernels that we can use to construct our vanilla-CG algorithm.
- Vector something Vector
- Scale by a value
- Add / Subtract
- Vector Product
- Transpose
- Matrix-Vector
- Single value computation
Some planning will help a lot
- Things to think about:
- How are you going to move data around?
- How are you going to determine if the answer is correct?
- How are you going to profile it?
- Reductions
- What happens in the first step of a matrix-vector?
- Does it matter what type of memory you use to hold the results?
- Can CUDA help us with this?
Reductions
- What does the below do on the GPU?
- How does each thread (recall they operate roughly in lockstep) know which value of temp it is reading
- Worse how does any thread know when to write, or what to write?
- This is called a reduction in most parallel languages, vector tempa[] is reduced into scalar temp with a + operator
__shared__ int temp;
__shared__ int tempa[THREADS_PER_BLOCK];
v_idx = threadIdx.x;
for (int i = 0; i < v_idx; i++){
temp += tempa[i];
}
AtomicAdd
- Enter, your new best friend, AtomicAdd...
- Forces the threads to behave sensibly by adding their components in order
- Not the fastest method to do this, but it is a CUDA function, so likely to be as fast as possible
- Note that the implementation of this does NOT zero the value before starting!
function(..., float *result){
...
t_idx = threadIdx.x;
__shared__ int temp_array[THREADS_PER_BLOCK];
atomicAdd(result, temp_array[t_idx]);
Exercise time
- You can find this exercise in the usual place, under single
- My full (ie with MV and V.V) implemented is in the single-solution directory
- Try the following pieces of exercise:
- 1.1 Create a vector dot product kernel (vector_vector)
- 1.2 Create a matrix vector kernel (matrix_vector)
- 1.3 Create the vector mins/add (factor) vector kernels
- 2.1 Profile these kernels to see the effect of atomicAdd being called by all threads in a block adds
- 2.2 Can you cut this down by doing it on a single thread?
- 3.1 Add a finishing condition (sqrt(R) < 1e-5 ?)
- 4.1 Increase the size of your vectors and matrices until you fill the GPU
- 4.2 Compare the performance in terms of run-time for the kernels vs copying time for the data
- 5.1 Optimise by removing un-necessary copy-backs and re-profile
Hints
My demo code in the slides for atomicAdd could be reframed to
v_idx = threadIdx.x;
if (v_idx == 0){
float sum = 0;
for (i = 0; i < THREADS_PER_BLOCK; i++){
sum += temp_array[v_idx];
}
atomicAdd(result, sum);
}
Hints
I also do a horrible thing in that I use a point to scalar as the return method from a function
__global__ void function(float *A, float *B, float *scalar)
If you combine this with an atomicAdd, you could end up in trouble. Why?
__global__ void function(float *A, float *B, float *scalar)
...
atomicAdd(scalar, values[index]);
...
Hints
I do too many copies.
- Does the initialisation need to be done on the GPU?
- Do we need vector B on the GPU?
- Do we need to copy everything back to the host, at each step, for a single GPU?
- If we are worried about space, can we abuse existing space to hold intermediate products?
- Could we combine some steps?