Introduction to GPU Computing with OpenACC Michael Wolfe
[email protected] http://www.pgroup.com
November 2012
1-1
Updates to these Tutorial Notes The latest updates to these tutorial notes are available at: http://www.pgroup.com/lit/presentations/ sc12_tutorial_openacc_intro.pdf
Also available:
Training at Pittsburgh Supercomputer Center January 15-16 www.psc.edu/index.php/training/openacc-gpu-programming
1-2
Outline Introduction GPU architecture vs. Host architecture GPU parallel programming vs. Host parallel programming
Introduction to OpenACC Directives
Elements of Directives-based Model Controlling Data Movement Explicit Parallelism Reductions, the Loop directive Interacting with CUDA
Problems, Pitfalls, Wrapup Appropriate algorithms and data structures Obvious Performance Bottlenecks
1-3
AMD “Magny-Cours”
Host Architecture Features Register files (integer, float) Functional units (integer, float, address), Icache, Dcache Execution pipeline (fetch,decode,issue,execute,cache,commit) branch prediction, hazards (control, data, structural) pipelined functional units, superpipelining, register bypass stall, scoreboarding, reservation stations, register renaming
Multiscalar execution (superscalar, control unit lookahead) LIW (long instruction word)
Multithreading, Simultaneous multithreading Vector instruction set Multiprocessor, Multicore, coherent caches (MESI protocols) 1-5
Making It Faster Processor: Faster clocks More work per clock: superscalar VLIW more cores vector / SIMD instructions
Memory Latency reduction Latency tolerance
1-6
Abstracted x64+Tesla Architecture
1-7
Abstracted x64+Fermi Architecture
1-8
1-9
Abstracted x64+Accelerator Architecture
1-10
GPU Architecture Features Optimized for high degree of regular parallelism Classically optimized for low precision High bandwidth memory Highly multithreaded (slack parallelism) Hardware thread scheduling Non-coherent software-managed data caches some hardware data caches as well
No multiprocessor memory model guarantees some guarantees with fence operations
1-11
Tesla-10 Features Summary Massively parallel thread processors Organized into multiprocessors up to 30, see deviceQuery or pgaccelinfo Physically: 8 thread processors per multiprocessor Logically: 32 threads per warp
Memory hierarchy host memory, device memory, constant memory, shared memory, register
Queue of operations (kernels) on device
1-12
Fermi (Tesla-20) Features Summary Massively parallel thread processors Organized into multiprocessors up to 16, see deviceQuery or pgaccelinfo Physically: two groups of 16 thread processors per multiprocessor Logically: still 32 threads per warp
Memory hierarchy host memory, device memory (two level hardware cache), constant memory, (configurable) shared memory, register
Queue of operations (kernels) on device ECC memory protection (supported, not default)
Much improved double precision performance Hardware 32-bit integer multiply
1-13
Kepler (K20) Features Summary Massively parallel thread processors Organized into multiprocessors 8 or more, see deviceQuery or pgaccelinfo Physically: 12 groups of 16 SP thread processors + 4 groups of DP thread processors per SM Logically: still 32 threads per warp
Memory hierarchy host memory, device memory (hardware cache), constant memory, (configurable) shared memory, register
Queue of operations (kernels) on device HyperQ – more asynchronous compute operations
ECC memory protection less memory per "CUDA core"
1-14
Parallel Programming on CPUs Instruction level parallelism (ILP) Loop unrolling, instruction scheduling
Vector parallelism Vectorized loops (or vector intrinsics)
Thread level / Multiprocessor / multicore parallelism Parallel loops, parallel tasks Posix threads, OpenMP, Cilk, TBB, .....
Large scale cluster / multicomputer parallelism MPI (& HPF, co-array Fortran, UPC, Titanium, X10, Fortress, Chapel)
1-15
pthreads main routine call call call call call call call
pthread_create( t1, NULL, jacobi, 1 ) pthread_create( t2, NULL, jacobi, 2 ) pthread_create( t3, NULL, jacobi, 3 ) jacobi( 4 ) pthread_join( t1, NULL ) pthread_join( t2, NULL ) pthread_join( t3, NULL )
1-16
pthreads subroutine subroutine jacobi( threadnum ) lchange = 0 do j = threadnum+1, n-1, numthreads do i = 2, m-1 newa(i,j) = w0*a(i,j) + & w1 * (a(i-1,j) + a(i,j-1) + & a(i+1,j) + a(i,j+1)) + & w2 * (a(i-1,j-1) + a(i-1,j+1) + & a(i+1,j-1) + a(i+1,j+1)) lchange = max(lchange,abs(newa(i,j)-a(i,j))) enddo enddo call pthread_mutex_lock( lck ) change = max(change,lchange) call pthread_mutex_unlock( lck ) end subroutine
1-17
Jacobi Relaxation with OpenMP directives change = 0 !$omp parallel private(i,j) !$omp do reduction(max:change) do j = 2, n-1 do i = 2, m-1 newa(i,j) = w0*a(i,j) + & w1 * (a(i-1,j) + a(i,j-1) + & a(i+1,j) + a(i,j+1)) + & w2 * (a(i-1,j-1) + a(i-1,j+1) + & a(i+1,j-1) + a(i+1,j+1)) change = max(change,abs(newa(i,j)-a(i,j))) enddo enddo !$omp end parallel 1-18
Behind the Scenes Compiler generates code for N threads: split up the iterations across N threads accumulate N partial sums (no synchronization) accumulate final sum as threads complete
Assumptions uniform memory access costs coherent cache mechanism
1-19
More Behind the Scenes Virtualization penalties
load balancing cache locality vectorization within the threads thread management loop scheduling (which thread does what iteration) NUMA memory access penalty
1-20
Parallel Programming on GPUs High degree of regular parallelism lots of scalar threads threads organized into thread groups / blocks SIMD, pseudo-SIMD thread groups organized into grid MIMD
Languages CUDA, OpenCL, graphics: OpenGL, DirectX may include vector datatypes (float4, int2) CAPS HMPP
Platforms formerly ArBB (fka Rapidmind, now owned by Intel) Thrust library
1-21
GPU Programming Allocate data on the GPU Move data from host, or initialize data on GPU Launch kernel(s) GPU driver can generate ISA code at runtime preserves forward compatibility without requiring ISA compatibility
Gather results from GPU Deallocate data
1-22
Appropriate GPU programs Characterized by nested parallel loops High compute intensity Regular data access Isolated host/GPU data movement
1-23
Behind the Scenes What you write is what you get Implicitly parallel threads into warps warps into thread groups thread groups into a grid
Hardware thread scheduler Highly multithreaded
1-24
CUDA C and CUDA Fortran Simple introductory program Programming model Low-level Programming with CUDA Building CUDA programs
2-25
VADD on Host subroutine host_vadd(A,B,C,N) real(4) :: A(N), B(N), C(N) integer :: N, i do i = 1,N C(i) = A(i) + B(i) enddo end subroutine void host_vadd( float* A, float* B, float* C, int n ){ int i; for( i = 0; i < n; ++i ){ C[i] = A[i] + B[i]; } } 2-26
CUDA C VADD Device Code __global__ void vaddkernel( float* A, float* B, float* C, int n ){ int i; i = blockIdx.x*blockDim.x + threadIdx.x; if( i tolerance) change = 0 do i = 2, m-1 do j = 2, n-1 newa(i,j) = w0*a(i,j) + & w1 * (a(i-1,j)+a(i,j-1)+a(i+1,j)+a(i,j+1)) + & w2 * (a(i-1,j-1)+a(i-1,j+1)+a(i+1,j-1)+a(i+1,j+1)) change = max(change,abs(newa(i,j)-a(i,j))) enddo enddo do i = 2, m-1 do j = 2, n-1 a(i,j) = newa(i,j) enddo enddo enddo 3-34
Jacobi Relaxation do{ change = 0; for( j = 1; j < m-1; ++j ){ for( i = 1; i < n-1; ++i ){ newa[j][i] = w0*a[j][i] + w1 * (a[j][i-1] + a[j-1][i] + a[j][i+1] + a[j+1][i]) + w2 * (a[j-1][i-1] + a[j+1][i-1] + a[j-1][i+1] + a[j+1][i+1]); change = fmax(change,fabs(newa[j][i]-a[j][i])); } } tmp = a; a = newa; newa = tmp; }while( change > tolerance );
3-35
Jacobi Relaxation change = tolerance + 1.0 !$omp parallel shared(change) do while(change > tolerance) change = 0 !$omp do reduction(max:change) private(i,j) do i = 2, m-1 do j = 2, n-1 newa(i,j) = w0*a(i,j) + & w1 * (a(i-1,j)+a(i,j-1)+a(i+1,j)+a(i,j+1)) + & w2 * (a(i-1,j-1)+a(i-1,j+1)+a(i+1,j-1)+a(i+1,j+1)) change = max(change,abs(newa(i,j)-a(i,j))) enddo enddo !$omp do private(i,j) do i = 2, m-1 do j = 2, n-1 a(i,j) = newa(i,j) enddo enddo enddo
3-36
Jacobi Relaxation #pragma omp parallel shared(change) private(tmp) do{ change = 0; #pragma omp for private(i,j) reduction(max:change) for( j = 1; j < m-1; ++j ){ for( i = 1; i < n-1; ++i ){ newa[j][i] = w0*a[j][i] + w1 * (a[j][i-1] + a[j-1][i] + a[j][i+1] + a[j+1][i]) + w2 * (a[j-1][i-1] + a[j+1][i-1] + a[j-1][i+1] + a[j+1][i+1]); change = fmax(change,fabs(newa[j][i]-a[j][i])); } } tmp = a; a = newa; newa = tmp; }while( change > tolerance ); 3-37
change = tolerance + 1.0 !$acc data create(newa(1:m,1:n)) copy(a(1:m,1:n)) do while(change > tolerance) change = 0 !$acc kernels reduction(max:change) do i = 2, m-1 do j = 2, n-1 newa(i,j) = w0*a(i,j) + & w1 * (a(i-1,j)+a(i,j-1)+a(i+1,j)+a(i,j+1)) + & w2 * (a(i-1,j-1)+a(i-1,j+1)+a(i+1,j-1)+a(i+1,j+1)) change = max(change,abs(newa(i,j)-a(i,j))) enddo enddo do i = 2, m-1 do j = 2, n-1 a(i,j) = newa(i,j) enddo enddo !$acc end kernels enddo !$acc end data
3-38
#pragma acc data create(newa[0:n-1][0:m-1])\ copy(a[0:n-1][0:m-1]) do{ change = 0; #pragma acc kernels private(i,j) reduction(max:change) for( j = 1; j < m-1; ++j ){ for( i = 1; i < n-1; ++i ){ newa[j][i] = w0*a[j][i] + w1 * (a[j][i-1] + a[j-1][i] + a[j][i+1] + a[j+1][i]) + w2 * (a[j-1][i-1] + a[j+1][i-1] + a[j-1][i+1] + a[j+1][i+1]); change = fmax(change,fabs(newa[j][i]-a[j][i])); } } tmp = a; a = newa; newa = tmp; }while( change > tolerance );
3-39
Why use Accelerator Directives? Productivity Higher level programming model a la OpenMP
Portability ignore directives, portable to the host portable to other accelerators performance portability
Performance feedback Downsides it’s not as easy as inserting a few directives a good host algorithm is not necessarily a good GPU algorithm 3-40
Basic Syntactic Concepts Fortran accelerator directive syntax !$acc directive [clause]... & continuation Fortran-77 syntax rules !$acc or C$acc or *$acc in columns 1-5 continuation with nonblank in column 6
C accelerator directive syntax #pragma acc directive [clause]... eol continue to next line with backslash
3-41
Construct construct is single-entry/single-exit construct in Fortran, delimited by begin/end directives in C, a single statement, or {...} region no jumps into/out of region, no return
compute region contains loops to send to GPU loop iterations translated to GPU threads loop indices become threadidx/blockidx indices
data region encloses compute regions data moved at region boundaries
Construct vs. Region construct is the lexical instance in the program region is the dynamic instance when running the program 3-42
Appropriate Algorithm Nested parallel loops iterations map to threads parallelism means threads are independent nested loops means lots of parallelism
Regular array indexing allows for stride-1 array fetches
3-43
#pragma acc data create(newa[0:n][0:m])\ copy(a[0:n][0:m]) do{ change = 0; #pragma acc kernels private(i,j) reduction(max:change) for( j = 1; j < m-1; ++j ){ for( i = 1; i < n-1; ++i ){ newa[j][i] = w0*a[j][i] + w1 * (a[j][i-1] + a[j-1][i] + a[j][i+1] + a[j+1][i]) + w2 * (a[j-1][i-1] + a[j+1][i-1] + a[j-1][i+1] + a[j+1][i+1]); change = fmax(change,fabs(newa[j][i]-a[j][i])); } } tmp = a; a = newa; newa = tmp; }while( change > tolerance );
3-44
#pragma acc data create(newa[0:n][0:m])\ copy(a[0:n][0:m]) do{ change = 0; #pragma acc parallel private(i,j) reduction(max:change) { #pragma acc loop for( j = 1; j < m-1; ++j ){ for( i = 1; i < n-1; ++i ){ newa[j][i] = w0*a[j][i] + w1 * (a[j][i-1] + a[j-1][i] + a[j][i+1] + a[j+1][i]) + w2 * (a[j-1][i-1] + a[j+1][i-1] + a[j-1][i+1] + a[j+1][i+1]); change = fmax(change,fabs(newa[j][i]-a[j][i])); } } } tmp = a; a = newa; newa = tmp; }while( change > tolerance ); 3-45
Behind the Scenes compiler determines parallelism (kernels) or applies parallelism (parallel) compiler generates thread code
split up the iterations into threads, thread groups inserts code to use software data cache accumulate partial sum second kernel to combine final sum
compiler also inserts data movement compiler or user determines what data to move data moved at boundaries of data/compute region
3-46
Behind the Scenes virtualization penalties fine grain control thread scheduling shared memory usage loop unrolling
3-47
Building Accelerator Programs pgfortran –acc a.f90 pgcc –acc a.c Other options: -ta=nvidia[,cc1x|cc2x|cc3x] default in siterc file: set COMPUTECAP=30; -ta=nvidia[,cuda5.0] default in siterc file: set DEFCUDAVERSION=5.0; -ta=nvidia,fastmath,nofma
Enable compiler feedback with –Minfo or –Minfo=accel 3-48
Program Execution Model Host
executes most of the program allocates accelerator memory initiates data copy from host memory to accelerator sends kernel code to accelerator queues kernels for execution on accelerator waits for kernel completion initiates data copy from accelerator to host memory deallocates accelerator memory
Accelerator executes kernels, one after another concurrently, may transfer data between host and accelerator 3-49
Controlling Data Movement data clauses data construct and data region update directive asynchronous data movement, the wait directive
3-50
Data Clauses C #pragma acc data copyin(a[0:n]) copyout(r[0:n]) { .... } Fortran !$acc data copyin(a(1:n)) copyout(r(1:n)) ... !$acc end data
3-51
Data Clauses C
#pragma acc data copyin(a[0:n][0:m]) \ copy(r[0:n][0:m]) { ... }
Fortran !$acc data copyin(a(1:m,1:n)) copy(r(:,:)) ... !$acc end data
3-52
Data Clauses C
#pragma acc data copyin(a[0:n][0:m]) \ create(r[0:n][0:m]) { ... }
Fortran !$acc data copyin(a(1:m,1:n)) create(r) ... !$acc end data
3-53
Data Clauses C
#pragma acc data copyin(a[0:n][0:m]) \ create(r[1:n-1][1:m-1]) { ... }
Fortran !$acc data copyin(a(1:m,1:n)) & create(r(2:m-1,2:n-1)) ... !$acc end data
3-54
Data Clauses C
#pragma acc data copyin(a[0:n][0:m]) /* data copied to Accelerator here */ { ... } /* data copied to Host here */
Fortran !$acc data copyin(a(1:m,1:n)) ! data copied to Accelerator here ... ! data copied to Host here !$acc end data
3-55
Data Clauses C
#pragma acc data copyin(a[0:n][0:m]) \ present(r[1:n-1][1:m-1]) { ... }
Fortran !$acc data copyin(a(1:m,1:n)) & present(r(2:m-1,2:n-1)) ... !$acc end data
3-56
Data Clauses C
#pragma acc data present_or_copyin(a[0:n]) \ present(r[1:n-1]) { ... }
Fortran !$acc data present_or_copyin(a(1:m)) & present(r(2:m-1)) ... !$acc end data
3-57
Data Clauses C
#pragma acc data pcopyin(a[0:n]) \ present(r[1:n-1]) { ... }
Fortran !$acc data pcopyin(a(1:m)) & present(r(2:m-1)) ... !$acc end data
3-58
Data Clauses
copy( list ) copyin( list ) copyout( list ) create( list ) present( list ) present_or_copy( list ) present_or_copyin( list ) present_or_copyout( list ) present_or_create( list ) deviceptr( list )
pcopy( list ) pcopyin( list ) pcopyout( list ) pcreate( list )
3-59
Data Region C #pragma acc data data-clauses if( condition ) { .... } Fortran !$acc data data-clauses if( condition ) .... !$acc end data
May be nested and may contain compute regions May not be nested within a compute region May contain procedure calls
3-60
Update Directive C #pragma acc update host( list ) #pragma acc update device( list ) Fortran !$acc update host( list ) !$acc update device( list ) data must be in a data clause for an enclosing data region implies present( list ) both may be on a single line
update host( list ) device( list ) 3-61
Asynchronous Update Directive C #pragma acc update host( list ) async(1) #pragma acc update device( list ) async(2) Fortran !$acc update host( list ) async(3) !$acc update device( list ) async(n)
async value should be >= 0 mapped down to some number of actual async queues updates with same value will execute in program order async with no value is same as async(acc_async_noval)
async(acc_async_sync) is same as no async clause host program may continue 3-62
Wait Directive C #pragma acc wait( 2 ) #pragma acc wait Fortran !$acc wait( n, n-1 ) !$acc wait wait with no values waits for ALL asynchronous queues wait with multiple values waits for each queue NO implied wait at end of data construct async updates in a data region require wait 3-63
Data Regions Across Procedures subroutine sub( a, b ) real :: a(:), b(:) !$acc kernels copyin(b) do i = 1,n a(i) = a(i) * b(i) enddo !$acc end kernels ... end subroutine subroutine bus(x, y) real :: x(:), y(:) !$acc data copy(x) call sub( x, y ) !$acc end data
3-64
Data Regions Across Procedures void sub( float* a, float* b, int n ){ int i; #pragma acc kernels copyin(b[0:n]) for( i = 0; i < n; ++i ) a[i] *= b[i]; ... } void bus( float* x, float* y, int n ){ #pragma acc data copy(x[0:n]) { sub( x, y, n ); } ... 3-65
Loop Directive C #pragma acc loop clause... for( i = 0; i < n; ++i ){ .... } Fortran !$acc loop clause... do i = 1, n
3-66
Kernels Loop Scheduling Clauses !$acc loop gang runs in ‘gang’ mode only (blockIdx) does not declare that the loop is in fact parallel (use independent)
!$acc loop gang(32) runs in ‘parallel’ mode only with gridDim == 32 (32 blocks)
!$acc loop vector(128) runs in ‘vector’ mode (threadIdx) with blockDim == 128 (128 threads) vector size, if present, must be compile-time constant
!$acc loop gang vector(128) strip mines loop inner loop runs in vector mode, 128 threads (threadIdx) outer loop runs across thread blocks (blockIdx)
3-67
Loop Scheduling Clauses Want stride-1 loop to be in ‘vector’ mode (threadIdx) look at –Minfo messages!
Want lots of parallelism
3-68
Loop Directive Clauses Scheduling Clauses
vector or vector(n) gang or gang(n) worker or worker(n) independent use with care, overrides compiler analysis for dependence, private variables
private( list ) private data for each iteration of the loop different from local (how?)
reduction( red:var ) reduction across the loop 3-69
Compiler Feedback Messages Data related Generating copyin(b(1:n,1:m)) Generating copyout(b(2:n-1,2:m-1)) Generating copy(a(1:n,1:n))
Generating create(c(1:n,1:n))
Loop or kernel related Loop is parallelizable Accelerator kernel generated
Barriers to GPU code generation No parallel kernels found, accelerator region ignored Loop carried dependence due to exposed use of ... prevents parallelization
Parallelization would require privatization of array ...
Compiler Messages Continued Memory optimization related Cached references to size [(x+2)x(y+2)] block of ‘b’ Non-stride-1 memory accesses for ‘a’
Combined Directives C #pragma acc kernels loop copyin(a[0:n])\ gang vector(32) for( i = 0; i < n; ++i ) r[i] = a[i]*2.0f; Fortran !$acc kernels loop copyin(a(1:n)) & gang vector(32) do i = 1,n r(i) = a(i) * 2.0 enddo
3-72
Parallel Region C #pragma acc parallel loop copyin(a[0:n]) for( i = 0; i < n; ++i ) r[i] = a[i]*2.0f; Fortran !$acc parallel loop copyin(a(1:n)) do i = 1,n r(i) = a(i) * 2.0 enddo
3-73
Selecting Device C #include n = acc_get_num_devices(acc_device_nvidia); ... acc_set_device_num( 1, acc_device_nvidia); Fortran
use accel_lib n = acc_get_num_devices( acc_device_nvidia ) ... call acc_set_device_num( 0, acc_device_nvidia) Environment Variable setenv ACC_DEVICE_NUM 1 export ACC_DEVICE_NUM=1
3-74
Performance Profiling TAU (Tuning and Analysis Utilities, University of Oregon) collects performance information
cudaprof (NVIDIA) gives a trace of kernel execution
PGI_ACC_TIME environment variable dump of region-level and kernel-level performance upload/download data movement time kernel execution time
pgcollect a.out PGI_ACC_PROFILE environment variable enables profile data collection for accelerator regions
ACC_NOTIFY environment variable prints one line for each kernel invocation
3-75
Performance Profiling Accelerator Kernel Timing data f3.f90 smooth 24: region entered 1 time time(us): total=1116701 init=1115986 region=715 kernels=22 data=693 w/o init: total=715 max=715 min=715 avg=715 27: kernel launched 5 times grid: [7x7] block: [16x16] time(us): total=17 max=10 min=1 avg=3 34: kernel launched 5 times grid: [7x7] block: [16x16] time(us): total=5 max=1 min=1 avg=1
3-76
Directives Summary
acc kernels
if(condition) copy(list) copyin(list)
copyout(list)
create(list)
acc parallel if(condition) num_gangs(n) num_workers(n) vector_length(n)
acc data copy(list) copyin(list) if(condition)
copyout(list)
create(list)
acc update device(list) host(list) acc loop
gang(width) worker(width) vector(width) independent private(list) reduction(red:var)
3-77
Runtime Summary acc_get_num_devices( acc_device_nvidia ) acc_set_device_num( n, acc_device_nvidia ) acc_set_device( acc_device_nvidia | acc_device_host ) acc_get_device() acc_init( acc_device_nvidia | acc_device_host )
3-78
Environment Variable Summary ACC_DEVICE_NUM ACC_DEVICE ACC_NOTIFY PGI_ACC_TIME
3-79
Command Line Summary -ta=nvidia -ta=nvidia,nofma -ta=nvidia,fastmath -ta=nvidia,[cc1x|cc2x|cc3x]
(multiple allowed)
-ta=nvidia,maxregcount:n -ta=nvidia,cuda5.0
3-80
Accelerator Programming Timeline late 1990s – GPGPU Programming
early 2000s – Brook project (and others) 2007 – NVIDIA CUDA release, SC07 tutorial HiCUDA, HMPP (and others) EXOCHI: PLDI 2007 (Intel)
2008 – PGI Accelerator Programming Model Larrabee paper at SIGGRAPH (Intel)
2009 – OpenMP BoF at SC09 Programming Model for Heterogeneous x86: PLDI 2009 (Intel) Larrabee featured in Justin Rattner (Intel) opening address
2010 – OpenMP Accelerator committee 2011 – OpenACC API 1.0 Language Extensions for Offload (LEO, Intel)
3-81
Where to get help • PGI Customer Support -
[email protected] • PGI User's Forum - http://www.pgroup.com/userforum/index.php • PGI Articles - http://www.pgroup.com/resources/articles.htm http://www.pgroup.com/resources/accel.htm • PGI User's Guide - http://www.pgroup.com/doc/pgiug.pdf
• CUDA Fortran Reference Guide http://www.pgroup.com/doc/pgicudafortug.pdf
OpenACC Members CAPS Entreprise
Allinea
Cray, Inc.
CSCS (Switzerland)
NVIDIA
Oak Ridge National Lab
The Portland Group, Inc.
Tech. Univ. Dresden Univ. Houston
OpenACC Supporters Sandia Georgia Tech Rogue Wave 3-83
Copyright Notice
© Contents copyright 2009-2012, The Portland Group, Inc. This material may not be reproduced in any manner without the expressed written permission of The Portland Group.
3-84