GPU COMPUTING WITH OPENACC

GPU COMPUTING WITH OPENACC 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries OpenACC Directives Programming Languages “Drop-in” Accelerati...

Author: Bruno Cross

2 downloads 0 Views 4MB Size

Report

Download PDF

Recommend Documents

Introduction to GPU Computing with OpenACC

NVIDIA GPU COMPUTING THE NEXT ERA OF COMPUTING. December 2016

HPC APPLICATION SUPPORT FOR GPU COMPUTING

Adding GPU Computing to Computer Organization Courses

GPU-computing in econophysics and statistical physics

Tesla GPU Computing An introduction. October 2008

Stream Computing for GPU-Accelerated HPC Applications

HARNESSING GPU COMPUTING IN SYSTEM-LEVEL SOFTWARE

Enabling CP2K Application for Exascale Computing with Accelerators using OpenACC and OpenCL

OpenACC CUDA Interoperability OpenACC Course Member of the Helmholtz Association

Introduction to OpenACC

The OpenACC programming model

GPU ray tracing with CUDA

Introduction to GPU Computing. Mike Clark, NVIDIA Developer Technology Group

TRENDS OF CPU, GPU AND FPGA FOR HIGH-PERFORMANCE COMPUTING

GPU implementations of scheduling heuristics for heterogeneous computing environments

Adaptive Cache Management for Energy-efficient GPU Computing

A Reduction of the Elastic Net to Support Vector Machines with an Application to GPU Computing

Computing with group orbits

Parallel Computing with MATLAB

Parallel Computing with OpenMP

Exploring Massive Parallel Computation with GPU

Extending Scala with General Purpose GPU Programming

PyCUDA: Even Simpler GPU Programming with Python

GPU COMPUTING WITH OPENACC

3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries

OpenACC Directives

Programming Languages

“Drop-in” Acceleration

Easily Accelerate Applications

Maximum Flexibility 2

OPENACC DIRECTIVES CPU

GPU

Simple Compiler hints Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience

Your original Fortran or C code

Compiler Parallelizes code OpenACC Compiler Hint

Works on many-core GPUs & multicore CPUs

3

FAMILIAR TO OPENMP PROGRAMMERS OpenMP CPU

main() { double pi = 0.0; long i;

#pragma omp parallel for reduction(+:pi) for (i=0; i sbatch runit.acc > qstat –u

# prints qsub status

Output is placed in slurm.* when finished.

18

EXERCISE 1 Jacobi kernels Task: use acc kernels to parallelize the Jacobi loop nests Edit laplace2D.c or laplace2D.f90 (your choice) In the 001-laplace2D-kernels directory Add directives where it helps Figure out the proper compilation flags to use Optionally: Run OpenACC version with laplace_acc

Q: can you get a speedup with just kernels directives? Versus 1 CPU core? Versus 6 CPU cores?

19

EXERCISE 1 SOLUTION: OPENACC C while ( error > tol && iter < iter_max ) { error=0.0; #pragma acc kernels for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) {

Execute GPU kernel for loop nest

Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = max(error, abs(Anew[j][i] - A[j][i]); } } #pragma acc kernels for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } }

Execute GPU kernel for loop nest

iter++; }

20

EXERCISE 1 SOLUTION: OPENACC FORTRAN do while ( error > tol .and. iter < iter_max ) err=0._fp_kind $!acc kernels do j=1,m do i=1,n

Execute GPU kernel for loop nest

Anew(i,j) = 0.25 * (A(i+1,j) + A(i-1,j) + & A(i,j-1) + A(i,j+1)) err = max(err, abs(Anew(i,j) – A(i,j); enddo enddo !$acc end kernels !$acc kernels do j=1, m-2 do i=1,n-2 A(i,j) = Anew(i,j) enddo enddo !$acc end kernels iter = iter+1 enddo

Execute GPU kernel for loop nest

21

EXERCISE 1: COMPILER OUTPUT (C) pgcc -tp sandybridge-64 -acc -ta=nvidia -Minfo=accel -o laplace2d_acc laplace2d.c main: 56, Generating present_or_copyout(Anew[1:4094][1:4094]) Generating present_or_copyin(A[:][:]) Generating Tesla code 57, Loop is parallelizable 59, Loop is parallelizable Accelerator kernel generated 57, #pragma acc loop gang /* blockIdx.y */ 59, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ 63, Max reduction generated for error 68, Generating present_or_copyin(Anew[1:4094][1:4094]) Generating present_or_copyout(A[1:4094][1:4094]) Generating Tesla code 69, Loop is parallelizable 71, Loop is parallelizable Accelerator kernel generated 69, #pragma acc loop gang /* blockIdx.y */ 71, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

22

EXERCISE 1: PERFORMANCE CPU: Intel E5-2680 v2 10 Cores @ 2.80 GHz

GPU: NVIDIA Tesla K20m

Execution (4096x4096)

Time (s)

Speedup

CPU 1 OpenMP thread

108.7

--

CPU 2 OpenMP threads

58.6

1.85x

CPU 4 OpenMP threads

31.9

3.41x

CPU 8 OpenMP threads

20.3

5.35x

Speedup vs. 1 CPU core

OpenACC GPU

176.1

0.12x FAIL

Speedup vs. 8 CPU cores

23

WHAT WENT WRONG? PGI_ACC_TIME=1

time(us): 101,104,174 56: data region reached 1000 times 56: data copyin reached 8000 times device time(us): total=22,030,081 max=2,858 min=2,746 avg=2,753 68: data copyout reached 8000 times device time(us): total=23,018,701 max=6,552 min=2,855 avg=2,877 56: compute region reached 1000 times 59: kernel launched 1000 times grid: [32x4094] block: [128] device time(us): total=6,456,517 max=6,516 min=6,447 avg=6,456 elapsed time(us): total=6,471,110 max=7,066 min=6,460 avg=6,471 59: reduction kernel launched 1000 times grid: [1] block: [256] device time(us): total=270,280 max=276 min=268 avg=270 elapsed time(us): total=283,763 max=353 min=282 avg=283 68: data region reached 1000 times 68: data copyin reached 8000 times device time(us): total=23,271,701 max=2,946 min=2,891 avg=2,908 77: data copyout reached 8000 times Huge Data Transfer Bottleneck! device time(us): total=23,016,095 max=2,993 min=2,854 avg=2,877 68: compute region reached 1000 times Computation: 10 seconds 71: kernel launched 1000 times Data movement: 90 seconds grid: [32x4094] block: [128] device time(us): total=3,040,799 max=3,050 min=3,037 avg=3,040 elapsed time(us): total=3,056,315 max=3,131 min=3,052 avg=3,056 24

BASIC CONCEPTS CPU Memory

Transfer data

GPU Memory

PCI Bus

CPU

Offload computation

GPU

For efficiency, decouple data movement and compute off-load 25

EXCESSIVE DATA TRANSFERS while ( error > tol && iter < iter_max ) { error=0.0; A, Anew resident on host

Copy

#pragma acc kernels A, Anew resident on accelerator

These copies happen every iteration of the outer while loop!*

for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = max(error, abs(Anew[j][i] - A[j][i]); } } A, Anew resident on accelerator

A, Anew resident on host

Copy

... } *Note: there are two #pragma acc kernels, so there are 4 copies per while loop iteration!

26

Data Management

27

DATA CONSTRUCT Fortran

!$acc data [clause …] structured block !$acc end data

C #pragma acc data [clause …] { structured block }

General Clauses if( condition ) async( expression )

Manage data movement. Data regions may be nested. 28

DATA CLAUSES copy ( list )

Allocates memory on GPU and copies data from host to GPU when entering region and copies data to the host when exiting region.

copyin ( list )

Allocates memory on GPU and copies data from host to GPU when entering region.

copyout ( list )

Allocates memory on GPU and copies data to the host when exiting region.

create ( list )

Allocates memory on GPU but does not copy.

present ( list )

Data is already present on GPU from another containing data region.

and

present_or_copy[in|out], present_or_create, deviceptr.

29

ARRAY SHAPING Compiler sometimes cannot determine size of arrays Must specify explicitly using data clauses and array “shape”

C #pragma acc data copyin(a[0:size-1]), copyout(b[s/4:3*s/4])

Fortran !$pragma acc data copyin(a(1:size)), copyout(b(s/4:3*s/4))

Note: data clauses can be used on data, kernels or parallel 30

EXERCISE 2: JACOBI DATA DIRECTIVES Task: use acc data to minimize transfers in the Jacobi example Start from given laplace2D.c or laplace2D.f90 (your choice) In the 002-laplace2d-data directory Add directives where it helps (hint: [do] while loop)

Q: What speedup can you get with data + kernels directives? Versus 6 CPU cores? OMP_NUM_THREADS=6 ./laplace2d_omp 31

Exercise 2 Solution: OpenACC C #pragma acc data copy(A), create(Anew) while ( error > tol && iter < iter_max ) { error=0.0;

#pragma acc kernels for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = max(error, abs(Anew[j][i] - A[j][i]); } } #pragma acc kernels for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } iter++; }

Copy A in at beginning of loop, out at end. Allocate Anew on accelerator

Exercise 2 Solution: OpenACC Fortran !$acc data copy(A), create(Anew) do while ( err > tol .and. iter < iter_max ) err=0._fp_kind !$acc kernels do j=1,m do i=1,n Anew(i,j) = .25_fp_kind * (A(i+1, j ) + A(i-1, j ) + & A(i , j-1) + A(i , j+1)) err = max(err, Anew(i,j) - A(i,j)) end do end do !$acc end kernels ... iter = iter +1 end do !$acc end data

Copy A in at beginning of loop, out at end. Allocate Anew on accelerator

EXERCISE 2: PERFORMANCE Execution (4096x4096)

Time (s)

Speedup

CPU 8 OpenMP thread

20.3

--

OpenACC K20m

176.1

0.12x

OpenACC K20m-opt

8.34

2.43x

Speedup vs. 8 CPU cores

34

WHAT WENT RIGHT? time(us): 9,901,981 50: data region reached 1 time 50: data copyin reached 8 times device time(us): total=22,039 max=2,763 min=2,751 avg=2,754 82: data copyout reached 9 times device time(us): total=21,851 max=2,734 min=14 avg=2,427 56: compute region reached 1000 times 59: kernel launched 1000 times grid: [32x4094] block: [128] device time(us): total=6,437,470 max=6,491 min=6,429 avg=6,437 elapsed time(us): total=6,452,030 max=7,162 min=6,442 avg=6,452 59: reduction kernel launched 1000 times grid: [1] block: [256] device time(us): total=269,570 max=324 min=268 avg=269 elapsed time(us): total=283,752 max=1,057 min=281 avg=283 68: compute region reached 1000 times 71: kernel launched 1000 times grid: [32x4094] block: [128] device time(us): total=3,151,051 max=3,206 min=3,147 avg=3,151 elapsed time(us): total=3,166,372 max=3,924 min=3,160 avg=3,166

Transfer Bottleneck Eliminated! Computation: 10 seconds Data movement: negligible 35

FURTHER SPEEDUPS OpenACC gives us more detailed control over parallelization Via gang, worker, and vector clauses

By understanding more about OpenACC execution model and GPU hardware organization, we can get higher speedups on this code By understanding bottlenecks in the code via profiling, we can reorganize the code for higher performance Will tackle these in later exercises

36

FINDING PARALLELISM IN YOUR CODE (Nested) for loops are best for parallelization Large loop counts needed to offset GPU/memcpy overhead Iterations of loops must be independent of each other To help compiler: restrict keyword (C), independent clause

Compiler must be able to figure out sizes of data regions Can use directives to explicitly control sizes

Pointer arithmetic should be avoided if possible Use subscripted arrays, rather than pointer-indexed arrays.

Function calls within accelerated region must be inlineable. 37

TIPS AND TRICKS (PGI) Use time option to learn where time is being spent -ta=nvidia,time

Eliminate pointer arithmetic

Inline function calls in directives regions (PGI): -inline or –inline,levels()

Use contiguous memory for multi-dimensional arrays Use data regions to avoid excessive memory transfers Conditional compilation with _OPENACC macro 38