Using Directives


Material by: Mark Bull, Alan Gray, Alistair Hart (Cray), Kevin Stratford

EPCC Logo

Aim

An instruction, or directive, to compiler to generate parallel code for GPU (more generally, "accelerator")
  • OpenACC
  • OpenMP

OpenACC: Directives for Accelerators

Standard first announced in November 2011
  • http://www.openacc.org/
  • Many relevant industry bodies involved
  • C/C++ or Fortran
  • Support from e.g., Cray, NVIDIA, (Gnu since gcc 7) compilers

Directive format

C:

#pragma acc parallel [clause ...]
{
  /* ...structured block .. */
}
      
Fortran:

!$acc parallel [clause ...]
! ... structured block ...
!$acc end parallel 
      
Both ignored if non-OpenACC compilation

Parallelism based on loops

C:

#pragma acc parallel loop
for (n = 0; n < NSIZE; n++) {
  /* ...structured block .. */
}
      
Fortran:

!$acc parallel loop
do n = 1, NSIZE
  ! ...loop body...
end do
!$acc end parallel loop
      

More than one loop

C:

#pragma acc parallel
{
  #pragma acc loop
  for (i = 0; i < NI; i++) {
    /* ... loop one ... */
  }

  ...

  #pragma acc loop
  for (i = 0; i < NI; j++) {
    /* ... loop two ... */
  }
}
      
Fortran:

!$acc parallel
  ...
  !$acc loop
  do i = 1, NI
    ! ... loop one ..
  end do
  !$acc end loop

  ...

  !$acc loop
  do i = 1, NI
    ! ... loop two ...
  end do
  !$acc end loop

!$acc end parallel
      
Note parallel may do little in itself
  • Combined with loop to distribute work

How and where is work distributed?

  • OpenACC abstracts hardware as gangs, vectors
    • Gang $\sim$ CUDA block of threads
    • Vector $\sim$ CUDA thread
  • Use both blocks and threads

#pragma acc loop gang vector
     
  • May be used with loop or parallel loop

Controlling execution

  • Set the number of blocks
    • num_gangs(nblocks) clause
  • Set the number of threads per block
    • vector_length(nthreads) clause
    • nthreads one of 1, 32, 64, 128 (the default), 256, 512, 1024
  • May be used with loop or parallel loop

Clauses

  • Many other clauses
    • seq: loop should be executed sequentially
    • if (condition):conditional execution on accelerator
    • reduction: sum, minimum, maximum operations
    • collapse(): merge loop nests
    • ...

Reductions


/* Consider... */

total = 0.0;
for (i = 1; i < NI; i++) {
  total += data[i];
}

/* In OpenACC: */ total = 0.0; #pragma acc loop reduction(+: total) for (i = 0; i < NI; i++) { total += data[i]; }

Data movement

Consider:

!$acc parallel loop
do i = 1, NI
  output(i) = input(i)
end do
!$acc end parallel loop
    
Compiler may identify relevant data transfers
  • On entry: input() copied to device
  • On exit: output() copied back to host
For C (especially) help may be required

From parallel loop


/* Declare shape of arrays [] */

#pragma acc parallel loop copyin(in[0:NI]) copyout(out[0:NI])
for (i = 1; i < NI; i++) {
  out[i] = in[i];
}

/* copyin()  - copy to GPU at start */
/* copyout() - copy to host at end  */
    

Explicit control of data movement

Use the data directive
C:

#pragma acc data [clause ...]
{
  /* ... structured block ... */
}
    
Fortran:

!$acc data [clause ...]
! ... structured block ...
!$acc end data
    

Preventing unnecessary transfers


!$acc data copyin(input(1:NI)) copyout(output(1:NI))

!$acc parallel loop
do i = 1, NI
  output(i) = input(i)
end do
!$acc end parallel loop

write (*,*) "Copied input to output"

!$acc parallel loop
do i = 1, NI
  output(i) = function_of(output(i))
end do
!$add end parallel loop

!$acc end data
    

Data clauses

copyin
Copy data to accelerator at start of region
copyout
Copy data to host at end of region
copy
Do both (copyin and copyout)
create
No copy at all. Allocate temporary storage on device

Reference counting


program example
  integer :: a(10000)

  !$acc data copy(a)

  !$acc parallel loop
    ! ... initialise a()
  !$add end parallel loop

  call my_double(a)

  !$acc end data
end program example
      

subroutine my_double(b)
  integer, intent(inout) :: b(:)
  integer :: i

  !$acc parallel loop copy(b)
  do i = 1, size(b)
    b(i) = 2*b(i)
  end do
  !$acc end parallel loop

end subroutine my_double
    
Runtime counts references to avoid unnecessary copies

Data for unstructured regions



    /* Consider */

    void my_init(my_t * a) {

      a->data = (double *) malloc(a->n*sizeof(double));
      #pragma acc enter data copyin(a)
      #pragma acc enter data copyin(a->data[a->n])
    }

    ...

    void my_finalise(my_t * a) {

      #pragma acc exit data delete(a->data)
      #pragma acc exit data delete(a)
      free(a->data);
    }
    

Summary

  • Directives ask the compiler to do the work
    • Faster in development
    • More portable (?)
    • Less direct control (and perhaps poorer performance)
    • May be intractable for very complex code