ExaLAT


Material by: Nick Johnson

EPCC Logo

Why should we be interested in GPUs?

  • Massively parallel performance
    • Very good at BLAS-like operations
    • Established hardware available in many places
    • Established programming language with lots of examples, help etc.
    • Good suite of tools, compilers, profilers etc.
    • C, C++ and Fortran (plus others) supported by vendor
  • Why wouldn't you want to use CUDA
    • Low but not zero bar to entry
    • Not "open" and not necessarily available on Intel/AMD hardware
    • For a small code, or a quick test of something, that low bar can be high for a new user
    • Can be a bit "boiler plate" in places compared to, for example, directives

Example Application: Conjugate Gradient

  • How to exploit the available resources:
    • Make sure you can decompose the problem across GPUs
    • Ideally, make it decomposable across blocks and threads
    • Minimise any halo swapping operations
  • The ideal code can adapt to use N GPUs at runtime
    • You don't want to have to tweak the code for each set of resources
    • Multiple ways to do this
    • ... of escalating complexity
    • Having the host code in a good shape, then making sure you have a multi-block GPU code in good shape on a single GPU helps a lot!

Example Application: Conjugate Gradient

  • Today's exercises are based around building a conjugate gradient implementation.
    • What is CG?
    • Effectively an interative algorithm to find x given a system Ax = b
    • Some may be familiar with this algorithm and variations thereof
  • We will look at the most basic version
  • This does not involve any preconditioners etc
  • A is a simple sqaure matrix with values between 0 and 1 on the leading diagonal
  • x is all zeros and b is all ones

Overarching theme of today's exercises.

Basic conjugate gradient algorithm image from Wikipedia
Img from: https://en.wikipedia.org/wiki/Conjugate_gradient_method
  1. Basic CG
  2. On multiple-GPUs
  3. Optimally
  4. Remotely

Overarching theme of today's exercises.

Basic conjugate gradient algorithm image from Wikipedia
Img from: https://en.wikipedia.org/wiki/Conjugate_gradient_method
  1. Let's break it down into some smaller blocks.
  2. We've got some basic building blocks here
  3. They are standard mathematical functions which we should be able to do on paper
  4. There are 3 different classes of function, what are they?

How to begin...

  • Let's start with the most basic building blocks
  • If we implement these, we can begin to build up our arsenal of kernels that we can use to construct our vanilla-CG algorithm.
  • Vector something Vector
    • Scale by a value
    • Add / Subtract
    • Vector Product
    • Transpose
  • Matrix-Vector
  • Single value computation
    • Reduction
    • Max/Min

Some planning will help a lot

  • Things to think about:
    • How are you going to move data around?
    • How are you going to determine if the answer is correct?
    • How are you going to profile it?
  • Reductions
    • What happens in the first step of a matrix-vector?
    • Does it matter what type of memory you use to hold the results?
    • Can CUDA help us with this?

Reductions

  • What does the below do on the GPU?
  • How does each thread (recall they operate roughly in lockstep) know which value of temp it is reading
  • Worse how does any thread know when to write, or what to write?
  • This is called a reduction in most parallel languages, vector tempa[] is reduced into scalar temp with a + operator


    __shared__ int temp;
    __shared__ int tempa[THREADS_PER_BLOCK];
	v_idx = threadIdx.x;
	for (int i = 0; i < v_idx; i++){
		temp += tempa[i];
	}
	
    

AtomicAdd

  • Enter, your new best friend, AtomicAdd...
  • Forces the threads to behave sensibly by adding their components in order
  • Not the fastest method to do this, but it is a CUDA function, so likely to be as fast as possible
  • Note that the implementation of this does NOT zero the value before starting!


	function(..., float *result){
	...
	t_idx = threadIdx.x;
	__shared__ int temp_array[THREADS_PER_BLOCK];
	atomicAdd(result, temp_array[t_idx]);
    

Exercise time

  • You can find this exercise in the usual place, under single
  • My full (ie with MV and V.V) implemented is in the single-solution directory
  • Try the following pieces of exercise:
  • 1.1 Create a vector dot product kernel (vector_vector)
  • 1.2 Create a matrix vector kernel (matrix_vector)
  • 1.3 Create the vector mins/add (factor) vector kernels
  • 2.1 Profile these kernels to see the effect of atomicAdd being called by all threads in a block adds
  • 2.2 Can you cut this down by doing it on a single thread?
  • 3.1 Add a finishing condition (sqrt(R) < 1e-5 ?)
  • 4.1 Increase the size of your vectors and matrices until you fill the GPU
  • 4.2 Compare the performance in terms of run-time for the kernels vs copying time for the data
  • 5.1 Optimise by removing un-necessary copy-backs and re-profile

Hints

My demo code in the slides for atomicAdd could be reframed to


	v_idx = threadIdx.x;
	if (v_idx == 0){
		float sum = 0;
		for (i = 0; i < THREADS_PER_BLOCK; i++){
			sum += temp_array[v_idx];
		}
		atomicAdd(result, sum);
	}
    

Hints

I also do a horrible thing in that I use a point to scalar as the return method from a function


	__global__ void function(float *A, float *B, float *scalar)
    

If you combine this with an atomicAdd, you could end up in trouble. Why?


	__global__ void function(float *A, float *B, float *scalar)
	...
	atomicAdd(scalar, values[index]);
	...
    

Hints

I do too many copies.
  • Does the initialisation need to be done on the GPU?
  • Do we need vector B on the GPU?
  • Do we need to copy everything back to the host, at each step, for a single GPU?
  • If we are worried about space, can we abuse existing space to hold intermediate products?
  • Could we combine some steps?