Using Directives

Material by: Mark Bull, Alan Gray, Alistair Hart (Cray), Kevin Stratford

Aim

An instruction, or directive, to compiler to generate parallel code for GPU (more generally, "accelerator")

OpenACC: Directives for Accelerators

Standard first announced in November 2011

Directive format

C:


#pragma acc parallel [clause ...]
{
  /* ...structured block .. */
}

Fortran:


!$acc parallel [clause ...]
! ... structured block ...
!$acc end parallel

Both ignored if non-OpenACC compilation

Parallelism based on loops

C:


#pragma acc parallel loop
for (n = 0; n < NSIZE; n++) {
  /* ...structured block .. */
}

Fortran:


!$acc parallel loop
do n = 1, NSIZE
  ! ...loop body...
end do
!$acc end parallel loop

More than one loop

C:


#pragma acc parallel
{
  #pragma acc loop
  for (i = 0; i < NI; i++) {
    /* ... loop one ... */
  }

  ...

  #pragma acc loop
  for (i = 0; i < NI; j++) {
    /* ... loop two ... */
  }
}

Fortran:


!$acc parallel
  ...
  !$acc loop
  do i = 1, NI
    ! ... loop one ..
  end do
  !$acc end loop

  ...

  !$acc loop
  do i = 1, NI
    ! ... loop two ...
  end do
  !$acc end loop

!$acc end parallel

Note parallel may do little in itself

How and where is work distributed?

OpenACC abstracts hardware as gangs, vectors
- Gang $\sim$ CUDA block of threads
- Vector $\sim$ CUDA thread
Use both blocks and threads


#pragma acc loop gang vector

May be used with loop or parallel loop

Controlling execution

Set the number of blocks
- num_gangs(nblocks) clause
Set the number of threads per block
- vector_length(nthreads) clause
- nthreads one of 1, 32, 64, 128 (the default), 256, 512, 1024
May be used with loop or parallel loop

Clauses

Many other clauses
- seq: loop should be executed sequentially
- if (condition):conditional execution on accelerator
- reduction: sum, minimum, maximum operations
- collapse(): merge loop nests
- ...

Reductions


/* Consider... */

total = 0.0;
for (i = 1; i < NI; i++) {
  total += data[i];
}

/* In OpenACC: */
total = 0.0;

#pragma acc loop reduction(+: total)
for (i = 0; i < NI; i++) {
  total += data[i];
}

Data movement

Consider:


!$acc parallel loop
do i = 1, NI
  output(i) = input(i)
end do
!$acc end parallel loop

Compiler may identify relevant data transfers
For C (especially) help may be required

From parallel loop


/* Declare shape of arrays [] */

#pragma acc parallel loop copyin(in[0:NI]) copyout(out[0:NI])
for (i = 1; i < NI; i++) {
  out[i] = in[i];
}

/* copyin()  - copy to GPU at start */
/* copyout() - copy to host at end  */

Explicit control of data movement

Use the data directive
C:


#pragma acc data [clause ...]
{
  /* ... structured block ... */
}

Fortran:


!$acc data [clause ...]
! ... structured block ...
!$acc end data

Preventing unnecessary transfers


!$acc data copyin(input(1:NI)) copyout(output(1:NI))

!$acc parallel loop
do i = 1, NI
  output(i) = input(i)
end do
!$acc end parallel loop

write (*,*) "Copied input to output"

!$acc parallel loop
do i = 1, NI
  output(i) = function_of(output(i))
end do
!$add end parallel loop

!$acc end data

Data clauses

copyin: Copy data to accelerator at start of region
copyout: Copy data to host at end of region
copy: Do both (copyin and copyout)
create: No copy at all. Allocate temporary storage on device

Reference counting


program example
  integer :: a(10000)

  !$acc data copy(a)

  !$acc parallel loop
    ! ... initialise a()
  !$add end parallel loop

  call my_double(a)

  !$acc end data
end program example


subroutine my_double(b)
  integer, intent(inout) :: b(:)
  integer :: i

  !$acc parallel loop copy(b)
  do i = 1, size(b)
    b(i) = 2*b(i)
  end do
  !$acc end parallel loop

end subroutine my_double

Runtime counts references to avoid unnecessary copies

Data for unstructured regions



    /* Consider */

    void my_init(my_t * a) {

      a->data = (double *) malloc(a->n*sizeof(double));
      #pragma acc enter data copyin(a)
      #pragma acc enter data copyin(a->data[a->n])
    }

    ...

    void my_finalise(my_t * a) {

      #pragma acc exit data delete(a->data)
      #pragma acc exit data delete(a)
      free(a->data);
    }

Summary

Directives ask the compiler to do the work
- Faster in development
- More portable (?)
- Less direct control (and perhaps poorer performance)
- May be intractable for very complex code