GPU COMPUTING WITH OPENACC
3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries
OpenACC Directives
Programming Languages
“Drop-in” Acceleration
Easily Accelerate Applications
Maximum Flexibility 2
OPENACC DIRECTIVES CPU
GPU
Simple Compiler hints Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience
Your original Fortran or C code
Compiler Parallelizes code OpenACC Compiler Hint
Works on many-core GPUs & multicore CPUs
3
FAMILIAR TO OPENMP PROGRAMMERS OpenMP CPU
main() { double pi = 0.0; long i;
#pragma omp parallel for reduction(+:pi) for (i=0; i sbatch runit.acc > qstat –u
# prints qsub status
Output is placed in slurm.* when finished.
18
EXERCISE 1 Jacobi kernels Task: use acc kernels to parallelize the Jacobi loop nests Edit laplace2D.c or laplace2D.f90 (your choice) In the 001-laplace2D-kernels directory Add directives where it helps Figure out the proper compilation flags to use Optionally: Run OpenACC version with laplace_acc
Q: can you get a speedup with just kernels directives? Versus 1 CPU core? Versus 6 CPU cores?
19
EXERCISE 1 SOLUTION: OPENACC C while ( error > tol && iter < iter_max ) { error=0.0; #pragma acc kernels for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) {
Execute GPU kernel for loop nest
Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = max(error, abs(Anew[j][i] - A[j][i]); } } #pragma acc kernels for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } }
Execute GPU kernel for loop nest
iter++; }
20
EXERCISE 1 SOLUTION: OPENACC FORTRAN do while ( error > tol .and. iter < iter_max ) err=0._fp_kind $!acc kernels do j=1,m do i=1,n
Execute GPU kernel for loop nest
Anew(i,j) = 0.25 * (A(i+1,j) + A(i-1,j) + & A(i,j-1) + A(i,j+1)) err = max(err, abs(Anew(i,j) – A(i,j); enddo enddo !$acc end kernels !$acc kernels do j=1, m-2 do i=1,n-2 A(i,j) = Anew(i,j) enddo enddo !$acc end kernels iter = iter+1 enddo
Execute GPU kernel for loop nest
21
EXERCISE 1: COMPILER OUTPUT (C) pgcc -tp sandybridge-64 -acc -ta=nvidia -Minfo=accel -o laplace2d_acc laplace2d.c main: 56, Generating present_or_copyout(Anew[1:4094][1:4094]) Generating present_or_copyin(A[:][:]) Generating Tesla code 57, Loop is parallelizable 59, Loop is parallelizable Accelerator kernel generated 57, #pragma acc loop gang /* blockIdx.y */ 59, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ 63, Max reduction generated for error 68, Generating present_or_copyin(Anew[1:4094][1:4094]) Generating present_or_copyout(A[1:4094][1:4094]) Generating Tesla code 69, Loop is parallelizable 71, Loop is parallelizable Accelerator kernel generated 69, #pragma acc loop gang /* blockIdx.y */ 71, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
22
EXERCISE 1: PERFORMANCE CPU: Intel E5-2680 v2 10 Cores @ 2.80 GHz
GPU: NVIDIA Tesla K20m
Execution (4096x4096)
Time (s)
Speedup
CPU 1 OpenMP thread
108.7
--
CPU 2 OpenMP threads
58.6
1.85x
CPU 4 OpenMP threads
31.9
3.41x
CPU 8 OpenMP threads
20.3
5.35x
Speedup vs. 1 CPU core
OpenACC GPU
176.1
0.12x FAIL
Speedup vs. 8 CPU cores
23
WHAT WENT WRONG? PGI_ACC_TIME=1
time(us): 101,104,174 56: data region reached 1000 times 56: data copyin reached 8000 times device time(us): total=22,030,081 max=2,858 min=2,746 avg=2,753 68: data copyout reached 8000 times device time(us): total=23,018,701 max=6,552 min=2,855 avg=2,877 56: compute region reached 1000 times 59: kernel launched 1000 times grid: [32x4094] block: [128] device time(us): total=6,456,517 max=6,516 min=6,447 avg=6,456 elapsed time(us): total=6,471,110 max=7,066 min=6,460 avg=6,471 59: reduction kernel launched 1000 times grid: [1] block: [256] device time(us): total=270,280 max=276 min=268 avg=270 elapsed time(us): total=283,763 max=353 min=282 avg=283 68: data region reached 1000 times 68: data copyin reached 8000 times device time(us): total=23,271,701 max=2,946 min=2,891 avg=2,908 77: data copyout reached 8000 times Huge Data Transfer Bottleneck! device time(us): total=23,016,095 max=2,993 min=2,854 avg=2,877 68: compute region reached 1000 times Computation: 10 seconds 71: kernel launched 1000 times Data movement: 90 seconds grid: [32x4094] block: [128] device time(us): total=3,040,799 max=3,050 min=3,037 avg=3,040 elapsed time(us): total=3,056,315 max=3,131 min=3,052 avg=3,056 24
BASIC CONCEPTS CPU Memory
Transfer data
GPU Memory
PCI Bus
CPU
Offload computation
GPU
For efficiency, decouple data movement and compute off-load 25
EXCESSIVE DATA TRANSFERS while ( error > tol && iter < iter_max ) { error=0.0; A, Anew resident on host
Copy
#pragma acc kernels A, Anew resident on accelerator
These copies happen every iteration of the outer while loop!*
for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = max(error, abs(Anew[j][i] - A[j][i]); } } A, Anew resident on accelerator
A, Anew resident on host
Copy
... } *Note: there are two #pragma acc kernels, so there are 4 copies per while loop iteration!
26
Data Management
27
DATA CONSTRUCT Fortran
!$acc data [clause …] structured block !$acc end data
C #pragma acc data [clause …] { structured block }
General Clauses if( condition ) async( expression )
Manage data movement. Data regions may be nested. 28
DATA CLAUSES copy ( list )
Allocates memory on GPU and copies data from host to GPU when entering region and copies data to the host when exiting region.
copyin ( list )
Allocates memory on GPU and copies data from host to GPU when entering region.
copyout ( list )
Allocates memory on GPU and copies data to the host when exiting region.
create ( list )
Allocates memory on GPU but does not copy.
present ( list )
Data is already present on GPU from another containing data region.
and
present_or_copy[in|out], present_or_create, deviceptr.
29
ARRAY SHAPING Compiler sometimes cannot determine size of arrays Must specify explicitly using data clauses and array “shape”
C #pragma acc data copyin(a[0:size-1]), copyout(b[s/4:3*s/4])
Fortran !$pragma acc data copyin(a(1:size)), copyout(b(s/4:3*s/4))
Note: data clauses can be used on data, kernels or parallel 30
EXERCISE 2: JACOBI DATA DIRECTIVES Task: use acc data to minimize transfers in the Jacobi example Start from given laplace2D.c or laplace2D.f90 (your choice) In the 002-laplace2d-data directory Add directives where it helps (hint: [do] while loop)
Q: What speedup can you get with data + kernels directives? Versus 6 CPU cores? OMP_NUM_THREADS=6 ./laplace2d_omp 31
Exercise 2 Solution: OpenACC C #pragma acc data copy(A), create(Anew) while ( error > tol && iter < iter_max ) { error=0.0;
#pragma acc kernels for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = max(error, abs(Anew[j][i] - A[j][i]); } } #pragma acc kernels for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } iter++; }
Copy A in at beginning of loop, out at end. Allocate Anew on accelerator
Exercise 2 Solution: OpenACC Fortran !$acc data copy(A), create(Anew) do while ( err > tol .and. iter < iter_max ) err=0._fp_kind !$acc kernels do j=1,m do i=1,n Anew(i,j) = .25_fp_kind * (A(i+1, j ) + A(i-1, j ) + & A(i , j-1) + A(i , j+1)) err = max(err, Anew(i,j) - A(i,j)) end do end do !$acc end kernels ... iter = iter +1 end do !$acc end data
Copy A in at beginning of loop, out at end. Allocate Anew on accelerator
EXERCISE 2: PERFORMANCE Execution (4096x4096)
Time (s)
Speedup
CPU 8 OpenMP thread
20.3
--
OpenACC K20m
176.1
0.12x
OpenACC K20m-opt
8.34
2.43x
Speedup vs. 8 CPU cores
34
WHAT WENT RIGHT? time(us): 9,901,981 50: data region reached 1 time 50: data copyin reached 8 times device time(us): total=22,039 max=2,763 min=2,751 avg=2,754 82: data copyout reached 9 times device time(us): total=21,851 max=2,734 min=14 avg=2,427 56: compute region reached 1000 times 59: kernel launched 1000 times grid: [32x4094] block: [128] device time(us): total=6,437,470 max=6,491 min=6,429 avg=6,437 elapsed time(us): total=6,452,030 max=7,162 min=6,442 avg=6,452 59: reduction kernel launched 1000 times grid: [1] block: [256] device time(us): total=269,570 max=324 min=268 avg=269 elapsed time(us): total=283,752 max=1,057 min=281 avg=283 68: compute region reached 1000 times 71: kernel launched 1000 times grid: [32x4094] block: [128] device time(us): total=3,151,051 max=3,206 min=3,147 avg=3,151 elapsed time(us): total=3,166,372 max=3,924 min=3,160 avg=3,166
Transfer Bottleneck Eliminated! Computation: 10 seconds Data movement: negligible 35
FURTHER SPEEDUPS OpenACC gives us more detailed control over parallelization Via gang, worker, and vector clauses
By understanding more about OpenACC execution model and GPU hardware organization, we can get higher speedups on this code By understanding bottlenecks in the code via profiling, we can reorganize the code for higher performance Will tackle these in later exercises
36
FINDING PARALLELISM IN YOUR CODE (Nested) for loops are best for parallelization Large loop counts needed to offset GPU/memcpy overhead Iterations of loops must be independent of each other To help compiler: restrict keyword (C), independent clause
Compiler must be able to figure out sizes of data regions Can use directives to explicitly control sizes
Pointer arithmetic should be avoided if possible Use subscripted arrays, rather than pointer-indexed arrays.
Function calls within accelerated region must be inlineable. 37
TIPS AND TRICKS (PGI) Use time option to learn where time is being spent -ta=nvidia,time
Eliminate pointer arithmetic
Inline function calls in directives regions (PGI): -inline or –inline,levels()
Use contiguous memory for multi-dimensional arrays Use data regions to avoid excessive memory transfers Conditional compilation with _OPENACC macro 38