Introduction to GPU Computing with OpenACC

Introduction to GPU Computing with OpenACC Michael Wolfe [email protected] http://www.pgroup.com November 2012 1-1 Updates to these Tutori...

Author: Ronald Allen

2 downloads 0 Views 897KB Size

Report

Download PDF

Recommend Documents

GPU COMPUTING WITH OPENACC

Introduction to OpenACC

Tesla GPU Computing An introduction. October 2008

Introduction to Computing with MATLAB

Introduction to GPU Computing. Mike Clark, NVIDIA Developer Technology Group

Introduction to GPU Programming

Adding GPU Computing to Computer Organization Courses

Introduction to Scientific Computing

Introduction to parallel computing

Introduction to Distributed Computing

Introduction to Parallel Computing

Introduction to parallel computing

INTRODUCTION TO SCIENTIFIC COMPUTING DRAFT

ITI Introduction to Computing II

A Reduction of the Elastic Net to Support Vector Machines with an Application to GPU Computing

NVIDIA GPU COMPUTING THE NEXT ERA OF COMPUTING. December 2016

GPU: A Formal Introduction

An introduction to many-core parallel computing with OpenCL

Introduction to Parallel Computing with MPI. Pekka Manninen Sami Ilvonen

HPC APPLICATION SUPPORT FOR GPU COMPUTING

Introduction to PARALLEL COMPUTING with OpenMP and MPI

GPU-computing in econophysics and statistical physics

Hands-on Computing with RunGDyn: An Introduction to GDyn

Stream Computing for GPU-Accelerated HPC Applications

Introduction to GPU Computing with OpenACC Michael Wolfe

[email protected] http://www.pgroup.com

November 2012

1-1

Updates to these Tutorial Notes The latest updates to these tutorial notes are available at: http://www.pgroup.com/lit/presentations/ sc12_tutorial_openacc_intro.pdf

Also available:

Training at Pittsburgh Supercomputer Center January 15-16 www.psc.edu/index.php/training/openacc-gpu-programming

1-2

Outline  Introduction  GPU architecture vs. Host architecture  GPU parallel programming vs. Host parallel programming

 Introduction to OpenACC Directives     

Elements of Directives-based Model Controlling Data Movement Explicit Parallelism Reductions, the Loop directive Interacting with CUDA

 Problems, Pitfalls, Wrapup  Appropriate algorithms and data structures  Obvious Performance Bottlenecks

1-3

AMD “Magny-Cours”

Host Architecture Features  Register files (integer, float)  Functional units (integer, float, address), Icache, Dcache  Execution pipeline (fetch,decode,issue,execute,cache,commit)  branch prediction, hazards (control, data, structural)  pipelined functional units, superpipelining, register bypass  stall, scoreboarding, reservation stations, register renaming

 Multiscalar execution (superscalar, control unit lookahead)  LIW (long instruction word)

 Multithreading, Simultaneous multithreading  Vector instruction set  Multiprocessor, Multicore, coherent caches (MESI protocols) 1-5

Making It Faster  Processor:  Faster clocks  More work per clock:  superscalar  VLIW  more cores  vector / SIMD instructions

 Memory  Latency reduction  Latency tolerance

1-6

Abstracted x64+Tesla Architecture

1-7

Abstracted x64+Fermi Architecture

1-8

1-9

Abstracted x64+Accelerator Architecture

1-10

GPU Architecture Features  Optimized for high degree of regular parallelism  Classically optimized for low precision  High bandwidth memory  Highly multithreaded (slack parallelism)  Hardware thread scheduling  Non-coherent software-managed data caches  some hardware data caches as well

 No multiprocessor memory model guarantees  some guarantees with fence operations

1-11

Tesla-10 Features Summary  Massively parallel thread processors  Organized into multiprocessors  up to 30, see deviceQuery or pgaccelinfo  Physically: 8 thread processors per multiprocessor  Logically: 32 threads per warp

 Memory hierarchy  host memory, device memory, constant memory, shared memory, register

 Queue of operations (kernels) on device

1-12

Fermi (Tesla-20) Features Summary  Massively parallel thread processors  Organized into multiprocessors  up to 16, see deviceQuery or pgaccelinfo  Physically: two groups of 16 thread processors per multiprocessor  Logically: still 32 threads per warp

 Memory hierarchy  host memory, device memory (two level hardware cache), constant memory, (configurable) shared memory, register

 Queue of operations (kernels) on device  ECC memory protection (supported, not default)

 Much improved double precision performance  Hardware 32-bit integer multiply

1-13

Kepler (K20) Features Summary  Massively parallel thread processors  Organized into multiprocessors  8 or more, see deviceQuery or pgaccelinfo  Physically: 12 groups of 16 SP thread processors + 4 groups of DP thread processors per SM  Logically: still 32 threads per warp

 Memory hierarchy  host memory, device memory (hardware cache), constant memory, (configurable) shared memory, register

 Queue of operations (kernels) on device  HyperQ – more asynchronous compute operations

 ECC memory protection  less memory per "CUDA core"

1-14

Parallel Programming on CPUs  Instruction level parallelism (ILP)  Loop unrolling, instruction scheduling

 Vector parallelism  Vectorized loops (or vector intrinsics)

 Thread level / Multiprocessor / multicore parallelism  Parallel loops, parallel tasks  Posix threads, OpenMP, Cilk, TBB, .....

 Large scale cluster / multicomputer parallelism  MPI (& HPF, co-array Fortran, UPC, Titanium, X10, Fortress, Chapel)

1-15

pthreads main routine call call call call call call call

pthread_create( t1, NULL, jacobi, 1 ) pthread_create( t2, NULL, jacobi, 2 ) pthread_create( t3, NULL, jacobi, 3 ) jacobi( 4 ) pthread_join( t1, NULL ) pthread_join( t2, NULL ) pthread_join( t3, NULL )

1-16

pthreads subroutine subroutine jacobi( threadnum ) lchange = 0 do j = threadnum+1, n-1, numthreads do i = 2, m-1 newa(i,j) = w0*a(i,j) + & w1 * (a(i-1,j) + a(i,j-1) + & a(i+1,j) + a(i,j+1)) + & w2 * (a(i-1,j-1) + a(i-1,j+1) + & a(i+1,j-1) + a(i+1,j+1)) lchange = max(lchange,abs(newa(i,j)-a(i,j))) enddo enddo call pthread_mutex_lock( lck ) change = max(change,lchange) call pthread_mutex_unlock( lck ) end subroutine

1-17

Jacobi Relaxation with OpenMP directives change = 0 !$omp parallel private(i,j) !$omp do reduction(max:change) do j = 2, n-1 do i = 2, m-1 newa(i,j) = w0*a(i,j) + & w1 * (a(i-1,j) + a(i,j-1) + & a(i+1,j) + a(i,j+1)) + & w2 * (a(i-1,j-1) + a(i-1,j+1) + & a(i+1,j-1) + a(i+1,j+1)) change = max(change,abs(newa(i,j)-a(i,j))) enddo enddo !$omp end parallel 1-18

Behind the Scenes  Compiler generates code for N threads:  split up the iterations across N threads  accumulate N partial sums (no synchronization)  accumulate final sum as threads complete

 Assumptions  uniform memory access costs  coherent cache mechanism

1-19

More Behind the Scenes  Virtualization penalties      

load balancing cache locality vectorization within the threads thread management loop scheduling (which thread does what iteration) NUMA memory access penalty

1-20

Parallel Programming on GPUs  High degree of regular parallelism  lots of scalar threads  threads organized into thread groups / blocks  SIMD, pseudo-SIMD  thread groups organized into grid  MIMD

 Languages  CUDA, OpenCL, graphics: OpenGL, DirectX  may include vector datatypes (float4, int2)  CAPS HMPP

 Platforms  formerly ArBB (fka Rapidmind, now owned by Intel)  Thrust library

1-21

GPU Programming  Allocate data on the GPU  Move data from host, or initialize data on GPU  Launch kernel(s)  GPU driver can generate ISA code at runtime  preserves forward compatibility without requiring ISA compatibility

 Gather results from GPU  Deallocate data

1-22

Appropriate GPU programs  Characterized by nested parallel loops  High compute intensity  Regular data access  Isolated host/GPU data movement

1-23

Behind the Scenes  What you write is what you get  Implicitly parallel  threads into warps  warps into thread groups  thread groups into a grid

 Hardware thread scheduler  Highly multithreaded

1-24

CUDA C and CUDA Fortran  Simple introductory program  Programming model  Low-level Programming with CUDA  Building CUDA programs

2-25

VADD on Host subroutine host_vadd(A,B,C,N) real(4) :: A(N), B(N), C(N) integer :: N, i do i = 1,N C(i) = A(i) + B(i) enddo end subroutine void host_vadd( float* A, float* B, float* C, int n ){ int i; for( i = 0; i < n; ++i ){ C[i] = A[i] + B[i]; } } 2-26

CUDA C VADD Device Code __global__ void vaddkernel( float* A, float* B, float* C, int n ){ int i; i = blockIdx.x*blockDim.x + threadIdx.x; if( i tolerance) change = 0 do i = 2, m-1 do j = 2, n-1 newa(i,j) = w0*a(i,j) + & w1 * (a(i-1,j)+a(i,j-1)+a(i+1,j)+a(i,j+1)) + & w2 * (a(i-1,j-1)+a(i-1,j+1)+a(i+1,j-1)+a(i+1,j+1)) change = max(change,abs(newa(i,j)-a(i,j))) enddo enddo do i = 2, m-1 do j = 2, n-1 a(i,j) = newa(i,j) enddo enddo enddo 3-34

Jacobi Relaxation do{ change = 0; for( j = 1; j < m-1; ++j ){ for( i = 1; i < n-1; ++i ){ newa[j][i] = w0*a[j][i] + w1 * (a[j][i-1] + a[j-1][i] + a[j][i+1] + a[j+1][i]) + w2 * (a[j-1][i-1] + a[j+1][i-1] + a[j-1][i+1] + a[j+1][i+1]); change = fmax(change,fabs(newa[j][i]-a[j][i])); } } tmp = a; a = newa; newa = tmp; }while( change > tolerance );

3-35

Jacobi Relaxation change = tolerance + 1.0 !$omp parallel shared(change) do while(change > tolerance) change = 0 !$omp do reduction(max:change) private(i,j) do i = 2, m-1 do j = 2, n-1 newa(i,j) = w0*a(i,j) + & w1 * (a(i-1,j)+a(i,j-1)+a(i+1,j)+a(i,j+1)) + & w2 * (a(i-1,j-1)+a(i-1,j+1)+a(i+1,j-1)+a(i+1,j+1)) change = max(change,abs(newa(i,j)-a(i,j))) enddo enddo !$omp do private(i,j) do i = 2, m-1 do j = 2, n-1 a(i,j) = newa(i,j) enddo enddo enddo

3-36

Jacobi Relaxation #pragma omp parallel shared(change) private(tmp) do{ change = 0; #pragma omp for private(i,j) reduction(max:change) for( j = 1; j < m-1; ++j ){ for( i = 1; i < n-1; ++i ){ newa[j][i] = w0*a[j][i] + w1 * (a[j][i-1] + a[j-1][i] + a[j][i+1] + a[j+1][i]) + w2 * (a[j-1][i-1] + a[j+1][i-1] + a[j-1][i+1] + a[j+1][i+1]); change = fmax(change,fabs(newa[j][i]-a[j][i])); } } tmp = a; a = newa; newa = tmp; }while( change > tolerance ); 3-37

change = tolerance + 1.0 !$acc data create(newa(1:m,1:n)) copy(a(1:m,1:n)) do while(change > tolerance) change = 0 !$acc kernels reduction(max:change) do i = 2, m-1 do j = 2, n-1 newa(i,j) = w0*a(i,j) + & w1 * (a(i-1,j)+a(i,j-1)+a(i+1,j)+a(i,j+1)) + & w2 * (a(i-1,j-1)+a(i-1,j+1)+a(i+1,j-1)+a(i+1,j+1)) change = max(change,abs(newa(i,j)-a(i,j))) enddo enddo do i = 2, m-1 do j = 2, n-1 a(i,j) = newa(i,j) enddo enddo !$acc end kernels enddo !$acc end data

3-38

#pragma acc data create(newa[0:n-1][0:m-1])\ copy(a[0:n-1][0:m-1]) do{ change = 0; #pragma acc kernels private(i,j) reduction(max:change) for( j = 1; j < m-1; ++j ){ for( i = 1; i < n-1; ++i ){ newa[j][i] = w0*a[j][i] + w1 * (a[j][i-1] + a[j-1][i] + a[j][i+1] + a[j+1][i]) + w2 * (a[j-1][i-1] + a[j+1][i-1] + a[j-1][i+1] + a[j+1][i+1]); change = fmax(change,fabs(newa[j][i]-a[j][i])); } } tmp = a; a = newa; newa = tmp; }while( change > tolerance );

3-39

Why use Accelerator Directives?  Productivity  Higher level programming model  a la OpenMP

 Portability  ignore directives, portable to the host  portable to other accelerators  performance portability

 Performance feedback  Downsides  it’s not as easy as inserting a few directives  a good host algorithm is not necessarily a good GPU algorithm 3-40

Basic Syntactic Concepts  Fortran accelerator directive syntax  !$acc directive [clause]...  & continuation  Fortran-77 syntax rules  !$acc or C$acc or *$acc in columns 1-5  continuation with nonblank in column 6

 C accelerator directive syntax  #pragma acc directive [clause]... eol  continue to next line with backslash

3-41

Construct  construct is single-entry/single-exit construct  in Fortran, delimited by begin/end directives  in C, a single statement, or {...} region  no jumps into/out of region, no return

 compute region contains loops to send to GPU  loop iterations translated to GPU threads  loop indices become threadidx/blockidx indices

 data region encloses compute regions  data moved at region boundaries

 Construct vs. Region  construct is the lexical instance in the program  region is the dynamic instance when running the program 3-42

Appropriate Algorithm  Nested parallel loops  iterations map to threads  parallelism means threads are independent  nested loops means lots of parallelism

 Regular array indexing  allows for stride-1 array fetches

3-43

#pragma acc data create(newa[0:n][0:m])\ copy(a[0:n][0:m]) do{ change = 0; #pragma acc kernels private(i,j) reduction(max:change) for( j = 1; j < m-1; ++j ){ for( i = 1; i < n-1; ++i ){ newa[j][i] = w0*a[j][i] + w1 * (a[j][i-1] + a[j-1][i] + a[j][i+1] + a[j+1][i]) + w2 * (a[j-1][i-1] + a[j+1][i-1] + a[j-1][i+1] + a[j+1][i+1]); change = fmax(change,fabs(newa[j][i]-a[j][i])); } } tmp = a; a = newa; newa = tmp; }while( change > tolerance );

3-44

#pragma acc data create(newa[0:n][0:m])\ copy(a[0:n][0:m]) do{ change = 0; #pragma acc parallel private(i,j) reduction(max:change) { #pragma acc loop for( j = 1; j < m-1; ++j ){ for( i = 1; i < n-1; ++i ){ newa[j][i] = w0*a[j][i] + w1 * (a[j][i-1] + a[j-1][i] + a[j][i+1] + a[j+1][i]) + w2 * (a[j-1][i-1] + a[j+1][i-1] + a[j-1][i+1] + a[j+1][i+1]); change = fmax(change,fabs(newa[j][i]-a[j][i])); } } } tmp = a; a = newa; newa = tmp; }while( change > tolerance ); 3-45

Behind the Scenes  compiler determines parallelism (kernels) or applies parallelism (parallel)  compiler generates thread code    

split up the iterations into threads, thread groups inserts code to use software data cache accumulate partial sum second kernel to combine final sum

 compiler also inserts data movement  compiler or user determines what data to move  data moved at boundaries of data/compute region

3-46

Behind the Scenes  virtualization penalties  fine grain control  thread scheduling  shared memory usage  loop unrolling

3-47

Building Accelerator Programs  pgfortran –acc a.f90  pgcc –acc a.c  Other options:  -ta=nvidia[,cc1x|cc2x|cc3x]  default in siterc file:  set COMPUTECAP=30;  -ta=nvidia[,cuda5.0]  default in siterc file:  set DEFCUDAVERSION=5.0;  -ta=nvidia,fastmath,nofma

 Enable compiler feedback with –Minfo or –Minfo=accel 3-48

Program Execution Model  Host        

executes most of the program allocates accelerator memory initiates data copy from host memory to accelerator sends kernel code to accelerator queues kernels for execution on accelerator waits for kernel completion initiates data copy from accelerator to host memory deallocates accelerator memory

 Accelerator  executes kernels, one after another  concurrently, may transfer data between host and accelerator 3-49

Controlling Data Movement  data clauses  data construct and data region  update directive  asynchronous data movement, the wait directive

3-50

Data Clauses  C #pragma acc data copyin(a[0:n]) copyout(r[0:n]) { .... }  Fortran !$acc data copyin(a(1:n)) copyout(r(1:n)) ... !$acc end data

3-51

Data Clauses  C

#pragma acc data copyin(a[0:n][0:m]) \ copy(r[0:n][0:m]) { ... }

 Fortran !$acc data copyin(a(1:m,1:n)) copy(r(:,:)) ... !$acc end data

3-52

Data Clauses  C

#pragma acc data copyin(a[0:n][0:m]) \ create(r[0:n][0:m]) { ... }

 Fortran !$acc data copyin(a(1:m,1:n)) create(r) ... !$acc end data

3-53

Data Clauses  C

#pragma acc data copyin(a[0:n][0:m]) \ create(r[1:n-1][1:m-1]) { ... }

 Fortran !$acc data copyin(a(1:m,1:n)) & create(r(2:m-1,2:n-1)) ... !$acc end data

3-54

Data Clauses  C

#pragma acc data copyin(a[0:n][0:m]) /* data copied to Accelerator here */ { ... } /* data copied to Host here */

 Fortran !$acc data copyin(a(1:m,1:n)) ! data copied to Accelerator here ... ! data copied to Host here !$acc end data

3-55

Data Clauses  C

#pragma acc data copyin(a[0:n][0:m]) \ present(r[1:n-1][1:m-1]) { ... }

 Fortran !$acc data copyin(a(1:m,1:n)) & present(r(2:m-1,2:n-1)) ... !$acc end data

3-56

Data Clauses  C

#pragma acc data present_or_copyin(a[0:n]) \ present(r[1:n-1]) { ... }

 Fortran !$acc data present_or_copyin(a(1:m)) & present(r(2:m-1)) ... !$acc end data

3-57

Data Clauses  C

#pragma acc data pcopyin(a[0:n]) \ present(r[1:n-1]) { ... }

 Fortran !$acc data pcopyin(a(1:m)) & present(r(2:m-1)) ... !$acc end data

3-58

Data Clauses          

copy( list ) copyin( list ) copyout( list ) create( list ) present( list ) present_or_copy( list ) present_or_copyin( list ) present_or_copyout( list ) present_or_create( list ) deviceptr( list )

pcopy( list ) pcopyin( list ) pcopyout( list ) pcreate( list )

3-59

Data Region  C #pragma acc data data-clauses if( condition ) { .... }  Fortran !$acc data data-clauses if( condition ) .... !$acc end data

 May be nested and may contain compute regions  May not be nested within a compute region  May contain procedure calls

3-60

Update Directive  C #pragma acc update host( list ) #pragma acc update device( list )  Fortran !$acc update host( list ) !$acc update device( list )  data must be in a data clause for an enclosing data region  implies present( list )  both may be on a single line

 update host( list ) device( list ) 3-61

Asynchronous Update Directive  C #pragma acc update host( list ) async(1) #pragma acc update device( list ) async(2)  Fortran !$acc update host( list ) async(3) !$acc update device( list ) async(n)    

async value should be >= 0 mapped down to some number of actual async queues updates with same value will execute in program order async with no value is same as async(acc_async_noval)

 async(acc_async_sync) is same as no async clause  host program may continue 3-62

Wait Directive  C #pragma acc wait( 2 ) #pragma acc wait  Fortran !$acc wait( n, n-1 ) !$acc wait  wait with no values waits for ALL asynchronous queues  wait with multiple values waits for each queue  NO implied wait at end of data construct  async updates in a data region require wait 3-63

Data Regions Across Procedures subroutine sub( a, b ) real :: a(:), b(:) !$acc kernels copyin(b) do i = 1,n a(i) = a(i) * b(i) enddo !$acc end kernels ... end subroutine subroutine bus(x, y) real :: x(:), y(:) !$acc data copy(x) call sub( x, y ) !$acc end data

3-64

Data Regions Across Procedures void sub( float* a, float* b, int n ){ int i; #pragma acc kernels copyin(b[0:n]) for( i = 0; i < n; ++i ) a[i] *= b[i]; ... } void bus( float* x, float* y, int n ){ #pragma acc data copy(x[0:n]) { sub( x, y, n ); } ... 3-65

Loop Directive  C #pragma acc loop clause... for( i = 0; i < n; ++i ){ .... }  Fortran !$acc loop clause... do i = 1, n

3-66

Kernels Loop Scheduling Clauses  !$acc loop gang  runs in ‘gang’ mode only (blockIdx)  does not declare that the loop is in fact parallel (use independent)

 !$acc loop gang(32)  runs in ‘parallel’ mode only with gridDim == 32 (32 blocks)

 !$acc loop vector(128)  runs in ‘vector’ mode (threadIdx) with blockDim == 128 (128 threads)  vector size, if present, must be compile-time constant

 !$acc loop gang vector(128)  strip mines loop  inner loop runs in vector mode, 128 threads (threadIdx)  outer loop runs across thread blocks (blockIdx)

3-67

Loop Scheduling Clauses  Want stride-1 loop to be in ‘vector’ mode (threadIdx)  look at –Minfo messages!

 Want lots of parallelism

3-68

Loop Directive Clauses  Scheduling Clauses

 vector or vector(n)  gang or gang(n)  worker or worker(n)  independent  use with care, overrides compiler analysis for dependence, private variables

 private( list )  private data for each iteration of the loop  different from local (how?)

 reduction( red:var )  reduction across the loop 3-69

Compiler Feedback Messages  Data related  Generating copyin(b(1:n,1:m))  Generating copyout(b(2:n-1,2:m-1))  Generating copy(a(1:n,1:n))

 Generating create(c(1:n,1:n))

 Loop or kernel related  Loop is parallelizable  Accelerator kernel generated

 Barriers to GPU code generation  No parallel kernels found, accelerator region ignored  Loop carried dependence due to exposed use of ... prevents parallelization

 Parallelization would require privatization of array ...

Compiler Messages Continued Memory optimization related  Cached references to size [(x+2)x(y+2)] block of ‘b’  Non-stride-1 memory accesses for ‘a’

Combined Directives  C #pragma acc kernels loop copyin(a[0:n])\ gang vector(32) for( i = 0; i < n; ++i ) r[i] = a[i]*2.0f;  Fortran !$acc kernels loop copyin(a(1:n)) & gang vector(32) do i = 1,n r(i) = a(i) * 2.0 enddo

3-72

Parallel Region  C #pragma acc parallel loop copyin(a[0:n]) for( i = 0; i < n; ++i ) r[i] = a[i]*2.0f;  Fortran !$acc parallel loop copyin(a(1:n)) do i = 1,n r(i) = a(i) * 2.0 enddo

3-73

Selecting Device  C #include n = acc_get_num_devices(acc_device_nvidia); ... acc_set_device_num( 1, acc_device_nvidia);  Fortran

use accel_lib n = acc_get_num_devices( acc_device_nvidia ) ... call acc_set_device_num( 0, acc_device_nvidia)  Environment Variable setenv ACC_DEVICE_NUM 1 export ACC_DEVICE_NUM=1

3-74

Performance Profiling  TAU (Tuning and Analysis Utilities, University of Oregon)  collects performance information

 cudaprof (NVIDIA)  gives a trace of kernel execution

 PGI_ACC_TIME environment variable  dump of region-level and kernel-level performance  upload/download data movement time  kernel execution time

 pgcollect a.out  PGI_ACC_PROFILE environment variable  enables profile data collection for accelerator regions

 ACC_NOTIFY environment variable  prints one line for each kernel invocation

3-75

Performance Profiling Accelerator Kernel Timing data f3.f90 smooth 24: region entered 1 time time(us): total=1116701 init=1115986 region=715 kernels=22 data=693 w/o init: total=715 max=715 min=715 avg=715 27: kernel launched 5 times grid: [7x7] block: [16x16] time(us): total=17 max=10 min=1 avg=3 34: kernel launched 5 times grid: [7x7] block: [16x16] time(us): total=5 max=1 min=1 avg=1

3-76

Directives Summary

 acc kernels

 if(condition)  copy(list) copyin(list)

copyout(list)

create(list)

 acc parallel  if(condition)  num_gangs(n) num_workers(n) vector_length(n)

 acc data  copy(list) copyin(list)  if(condition)

copyout(list)

create(list)

 acc update device(list) host(list)  acc loop    

gang(width) worker(width) vector(width) independent private(list) reduction(red:var)

3-77

Runtime Summary  acc_get_num_devices( acc_device_nvidia )  acc_set_device_num( n, acc_device_nvidia )  acc_set_device( acc_device_nvidia | acc_device_host )  acc_get_device()  acc_init( acc_device_nvidia | acc_device_host )

3-78

Environment Variable Summary  ACC_DEVICE_NUM  ACC_DEVICE  ACC_NOTIFY  PGI_ACC_TIME

3-79

Command Line Summary  -ta=nvidia  -ta=nvidia,nofma  -ta=nvidia,fastmath  -ta=nvidia,[cc1x|cc2x|cc3x]

(multiple allowed)

 -ta=nvidia,maxregcount:n  -ta=nvidia,cuda5.0

3-80

Accelerator Programming Timeline  late 1990s – GPGPU Programming

 early 2000s – Brook project (and others)  2007 – NVIDIA CUDA release, SC07 tutorial  HiCUDA, HMPP (and others)  EXOCHI: PLDI 2007 (Intel)

 2008 – PGI Accelerator Programming Model  Larrabee paper at SIGGRAPH (Intel)

 2009 – OpenMP BoF at SC09  Programming Model for Heterogeneous x86: PLDI 2009 (Intel)  Larrabee featured in Justin Rattner (Intel) opening address

 2010 – OpenMP Accelerator committee  2011 – OpenACC API 1.0  Language Extensions for Offload (LEO, Intel)

3-81

Where to get help • PGI Customer Support - [email protected] • PGI User's Forum - http://www.pgroup.com/userforum/index.php • PGI Articles - http://www.pgroup.com/resources/articles.htm http://www.pgroup.com/resources/accel.htm • PGI User's Guide - http://www.pgroup.com/doc/pgiug.pdf

• CUDA Fortran Reference Guide http://www.pgroup.com/doc/pgicudafortug.pdf

OpenACC Members CAPS Entreprise

Allinea

Cray, Inc.

CSCS (Switzerland)

NVIDIA

Oak Ridge National Lab

The Portland Group, Inc.

Tech. Univ. Dresden Univ. Houston

OpenACC Supporters Sandia Georgia Tech Rogue Wave 3-83

Copyright Notice

© Contents copyright 2009-2012, The Portland Group, Inc. This material may not be reproduced in any manner without the expressed written permission of The Portland Group.

3-84