OpenACC
Introduction
OpenACC (for open accelerators) is a programming standard for parallel computing developed by Cray, CAPS, Nvidia and PGI. The standard is designed to simplify parallel programming of heterogeneous CPU/GPU systems. [1]
Like in OpenMP, the programmer can annotate C, C++ and Fortran source code to identify the areas that should be accelerated using compiler directives and additional functions (see here for more information]. Like OpenMP 4.0 and newer, code can be started on both the CPU and GPU.
Support of OpenACC is available in commercial compilers from PGI. Tutorials and documantation can be found on http://www.openacc.org/.
Example: Jacobi Iteration
The code
The following example is taken from this link and the code can be downloaded here or using
wget https://raw.githubusercontent.com/parallel-forall/cudacasts/master/ep3-first-openacc-program/laplace2d.c wget https://raw.githubusercontent.com/parallel-forall/cudacasts/master/ep3-first-openacc-program/timer.h
The codes performs a Jacobi Iteration on an 4096x4096 grid.
Modules
Since we need a compiler with OpenACC support and we want to use the GPUS we load the following modules:
module load PGI CUDA-Toolkit
Serial execution of the program
The serial version of the program can be compiled with
pgcc -fast -o laplace2d_ser laplace2d.c
To run the executable on a compute node we can use the command
srun -p carl.p ./laplace2d_ser
which after some time should print
Jacobi relaxation Calculation: 4096 x 4096 mesh 0, 0.250000 100, 0.002397 200, 0.001204 300, 0.000804 400, 0.000603 500, 0.000483 600, 0.000403 700, 0.000345 800, 0.000302 900, 0.000269 total: 92.824342 s srun: error: mpcl001: task 0: Exited with exit code 20
The total runtime may differ depending on the compute node used and other job that may run on that node (use --exclusive to rule that out). Note: the cause of the exit code 20 (or any other number) is due to a missing return 0 and may be ignored (or corrected in the code).
Parallel Execution Using OpenMP
To compile an OpenMP-parallel version of the code use the command
pgcc -fast -mp -o laplace2d_omp laplace2d.c
where the option -mp enables OpenMP compilation. The OpenMP-parallel program can be executed with
export OMP_NUM_THREADS=24 srun -p carl.p -n 1 -c 24 ./laplace2d_omp
The first line sets the number of threads that should be used (note that the user environment is persistent when executing a program with srun). The second line uses srun to request a single task with 24 cores per task (this ensures that all CPUs (cores) are on the same node and the program is executed only once). The output should look like this:
Jacobi relaxation Calculation: 4096 x 4096 mesh 0, 0.250000 100, 0.002397 200, 0.001204 300, 0.000804 400, 0.000603 500, 0.000483 600, 0.000403 700, 0.000345 800, 0.000302 900, 0.000269 total: 8.505286 s
Note, that the speedup is about a factor of 10 and it is probably more efficient to use less cores for this example (for 12 cores it takes about 13 seconds).
Using a GPU with OpenACC
In order to use the GPUs you need to compile the code with
pgcc -fast -acc -ta=tesla:cc60 -Minfo=accel -o laplace2d_acc laplace2d.c
where -acc tells the compiler to interpret OpenACC directive (in the code lines with #pragma acc). The option