OpenACC

From HPC users
Jump to navigationJump to search

Introduction

OpenACC (for open accelerators) is a programming standard for parallel computing developed by Cray, CAPS, Nvidia and PGI. The standard is designed to simplify parallel programming of heterogeneous CPU/GPU systems. [1]

Like in OpenMP, the programmer can annotate C, C++ and Fortran source code to identify the areas that should be accelerated using compiler directives and additional functions (see here for more information]. Like OpenMP 4.0 and newer, code can be started on both the CPU and GPU.

Support of OpenACC is available in commercial compilers from PGI. Tutorials and documantation can be found on http://www.openacc.org/.

Example: Jacobi Iteration

The code

The following example is taken from this link and the code can be downloaded here or using

wget https://raw.githubusercontent.com/parallel-forall/cudacasts/master/ep3-first-openacc-program/laplace2d.c
wget https://raw.githubusercontent.com/parallel-forall/cudacasts/master/ep3-first-openacc-program/timer.h

The codes performs a Jacobi Iteration on an 4096x4096 grid.

Modules

Since we need a compiler with OpenACC support and we want to use the GPUS we load the following modules:

module load PGI CUDA-Toolkit

Serial execution of the program

The serial version of the program can be compiled with

pgcc -fast -o laplace2d_ser laplace2d.c 

To run the executable on a compute node we can use the command

srun -p carl.p ./laplace2d_ser

which after some time should print

Jacobi relaxation Calculation: 4096 x 4096 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269
 total: 92.824342 s
srun: error: mpcl001: task 0: Exited with exit code 20

The total runtime may differ depending on the compute node used and other job that may run on that node (use --exclusive to rule that out). Note: the cause of the exit code 20 (or any other number) is due to a missing return 0 and may be ignored (or corrected in the code).

Parallel Execution Using OpenMP

To compile an OpenMP-parallel version of the code use the command

pgcc -fast -mp -o laplace2d_omp laplace2d.c

where the option -mp enables OpenMP compilation. The OpenMP-parallel program can be executed with

export OMP_NUM_THREADS=24
srun -p carl.p -n 1 -c 24 ./laplace2d_omp

The first line sets the number of threads that should be used (note that the user environment is persistent when executing a program with srun). The second line uses srun to request a single task with 24 cores per task (this ensures that all CPUs (cores) are on the same node and the program is executed only once). The output should look like this:

Jacobi relaxation Calculation: 4096 x 4096 mesh
    0, 0.250000
  100, 0.002397
  200, 0.001204
  300, 0.000804
  400, 0.000603
  500, 0.000483
  600, 0.000403
  700, 0.000345
  800, 0.000302
  900, 0.000269
 total: 8.505286 s

Note, that the speedup is about a factor of 10 and it is probably more efficient to use less cores for this example (for 12 cores it takes about 13 seconds).

Using a GPU with OpenACC

In order to use the GPUs you need to compile the code with

pgcc -fast -acc -ta=tesla:cc60 -Minfo=accel -o laplace2d_acc laplace2d.c

where -acc tells the compiler to interpret OpenACC directive (in the code lines with #pragma acc). The option