3/2/2005
Coarse-Grain Parallelism
Optimizing Compilers for Modern Architectures
1
Introduction • Transformations in the previous chapter found fine-grained parallelism in the inner-most loop • The transformations targeted vector and superscalar architectures • In this chapter, we worry about transformations for symmetric multiprocessor machines • The difference between these transformations tends to be one of granularity
Optimizing Compilers for Modern Architectures
2
1
3/2/2005
Review • • • •
SMP machines have multiple processors all accessing a central memory This is also called a shared memory multiprocessor
p1
p2
p3
p4
Bus
Memory
The processors are unrelated and can run separate processes Starting processes and synchronization between processes is expensive
Optimizing Compilers for Modern Architectures
3
Coarse-grained Parallelism • • •
Create a thread on each processor These threads execute in parallel for a period of time —There is occasional synchronization
The threads are synchronized at the end by a barrier
Optimizing Compilers for Modern Architectures
4
2
3/2/2005
Synchronization • • •
A basic synchronization element is the barrier A barrier in a program forces all processes to reach a certain point before execution continues for any process Bus contention can cause slowdowns
Optimizing Compilers for Modern Architectures
5
Synchronization • • •
Our focus will be to find parallel loops with a large granularity so we can overcome the overhead of parallelism initiation and synchronization Focus on finding loops with significant amounts of computation within their bodies This usually mean parallelization of outer loops rather than inner loops
Optimizing Compilers for Modern Architectures
6
3
3/2/2005
Trade Offs • • •
Minimize communication
•
These trade offs can work against each other
Minimize synchronization overhead Load balancing —Make sure the processors are as busy as possible
Optimizing Compilers for Modern Architectures
7
Loop Types •
PARALLEL DO
•
DOACROSS
—Iterations can be correctly run in any order —Also called a DOALL loop —Pipelines parallel loop iterations with cross-iteration synchronization
Optimizing Compilers for Modern Architectures
8
4
3/2/2005
Single-Loop Methods • • •
If the loop is sequential (i.e., carries a dependence), find a way to make it parallel —Any transformation that eliminates loop carried dependence (e.g., loop distribution) can achieve this goal
Increase the granularity of the exposed parallelism We loop through the transformations to achieve the above goals
Optimizing Compilers for Modern Architectures
9
Privatization • •
The analog of scalar expansion is privatization Temporaries can be given separate namespaces for each iteration
S1 S2 S3
DO I = 1, N T = A(I) A(I) = B(I) B(I) = T ENDDO
Optimizing Compilers for Modern Architectures
S1 S2 S3
PARALLEL DO I = 1, N PRIVATE t t = A(I) A(I) = B(I) B(I) = t ENDDO
10
5
3/2/2005
Privatization • Definition: A scalar variable x in a loop L is said to be privatizable if every path from the loop entry to a use of x inside the loop passes through a definition of x • Privatizability can be stated as a data-flow problem: up ( x ) = use ( x ) ∪ ( ¬ def ( x ) ∩
U up ( y ))
y ∈ succ ( x )
private ( L ) = ¬ up ( entry ) ∩ ( U def ( y )) y∈ L
• up(x) – set of upward-exposed variables at the beginning of block x • def(x) – set of variables defined in block x • use(x) – set of variables that have upwards-exposed uses in block x Optimizing Compilers for Modern Architectures
11
Privatization • A variable x defined in a loop may be made private if and only if the SSA graph for the variable does not have a φ-function at the entry to the loop
Optimizing Compilers for Modern Architectures
12
6
3/2/2005
Scalar Expansion PARALLEL DO I = 1, N T$(I) = A(I) + B(I) ENDDO PARALLEL DO I = 1, N A(I-1) = T$(I) ENDDO
DO I = 1, N T = A(I) + B(I) A(I-1) = T ENDDO
Optimizing Compilers for Modern Architectures
13
Array Privatization • We sometimes need to privatize array variables • For iteration J, upwards exposed variables are those exposed due to loop body without variables defined earlier
S0 L1 S1 S2
DO I = 1, 100 T(1) = X DO J = 2, N T(J) = T(J-1) + B(I,J) A(I,J) = T(J) ENDDO ENDDO
up ( L ) = U {T ( J − 1)} − T (2 : J ) N
1
J =2
• So for this fragment, T(1) is the only exposed variable Optimizing Compilers for Modern Architectures
14
7
3/2/2005
Array Privatization •
Using this analysis, we get the following code:
S0 L1 S1 S2
PARALLEL DO I = 1,100 PRIVATE t(N) t(1) = X DO J = 2, N t(J) = t(J-1) + B(I,J) A(I,J) = t(J) ENDDO ENDDO
Optimizing Compilers for Modern Architectures
15
Loop Distribution • •
Loop distribution eliminates carried dependences
•
We must add extra barriers to keep dependent loops from executing out of order, so the overhead may override the parallel savings
•
Consequently, it often creates opportunity for outerloop parallelism
Attempt other transformations before attempting loop distribution
Optimizing Compilers for Modern Architectures
16
8
3/2/2005
Loop Distribution DO I = 1, 100 DO J = 1, 100 S1 A(I,J) = B(I,J) + C(I,J) S2 D(I,J) = A(I,J-1) ∗ 2.0 ENDDO ENDDO
DO I = 1, 100 DO J = 1, 100 S1 A(I,J) = B(I,J) + C(I,J) ENDDO DO J = 1, 100 S2 D(I,J) = A(I,J-1) ∗ 2.0 ENDDO ENDDO
• Loop distribution converts a sequential loop into multiple parallel loops • Synchronization barriers must be placed between the loops Optimizing Compilers for Modern Architectures
17
Alignment • •
Many carried dependences are due to array alignment issues If we can align all references, then dependences would go away, and parallelism is possible DO I = 2, N A(I) = B(I) + C(I) D(I) = A(I-1) ∗ 2.0 ENDDO
Optimizing Compilers for Modern Architectures
DO i = 1, N+1 IF (i .GT. 1) A(i) = B(i) + C(i) IF (i .LE. N) D(i+1) = A(i) ∗ 2.0 ENDDO
18
9
3/2/2005
Alignment
Optimizing Compilers for Modern Architectures
19
Alignment • • •
Loop alignment is carried out by increasing the number of loop iterations The loop carried dependence become a loop independent dependence There is overhead: — Executing the loop more times — Executing the tests every time through the loop
Optimizing Compilers for Modern Architectures
20
10
3/2/2005
Alignment •
This overhead can be reduced by executing the last iteration of the first statement with the first iteration of the second statement DO i = 2, N j = i – 1; IF (i.EQ.2) j = N A(j) = B(j) + C(j) D(i) = A(i-1) ∗ 2.0 ENDDO
Optimizing Compilers for Modern Architectures
21
Alignment •
There are other ways to align the loop: —We can peel off the first and last iterations of the loop
DO I = 2, N A(I) = B(I) + C(I) D(I) = A(I-1) ∗ 2.0 ENDDO
Optimizing Compilers for Modern Architectures
D(2) = A(1) ∗ 2.0 DO I = 2, N-1 A(I) = B(I) + C(I) D(I+1) = A(I) ∗ 2.0 ENDDO A(N) = B(N) + C(N)
22
11
3/2/2005
Alignment •
Is it always possible to align loops to eliminate loop carried dependences?
DO I = 1, N A(I) = B(I) + C B(I+1) = A(I) + D ENDDO
DO I = 1, N+1 IF (I.NE.1) B(I) = A(I-1) + D IF (I.NE.N+1) A(I) = B(I) + C ENDDO
• Although B is now aligned, the references to A are misaligned, creating a new carried dependence
Optimizing Compilers for Modern Architectures
23
Code Replication • • •
If an array is involved in a recurrence, then alignment isn’t possible If two dependences between the same statements have different dependence distances, then alignment doesn’t work We can fix the second case by replicating code:
DO I = 1, N A(I+1) = B(I) + C X(I) = A(I+1) + A(I) ENDDO
Optimizing Compilers for Modern Architectures
DO I = 1, N A(I+1) = B(I) + C ! Replicated Statement IF (I .EQ 1) THEN t = A(I) ELSE t = B(I-1)+C END IF X(I) = A(I+1) + t ENDDO 24
12
3/2/2005
Alignment • Theorem: Alignment, replication, and statement reordering are sufficient to eliminate all carried dependences in a single loop containing no recurrence, and in which the distance of each dependence is a constant independent of the loop index • We can establish this constructively • Let G = (V,E) be a weighted graph. V represents statements and E represent dependences labeled with the distance between v1 and v2. Let o: V →Z give the offset of vertices • G is said to be carry free if o(v1) + d(e) = o(v2) Optimizing Compilers for Modern Architectures
25
Alignment DO I = 1, N S1 A(I+2) = B(I) + C S2 X(I+1) = A(I) + D S3 Y(I) = A(I+1) + X(I) ENDDO
Optimizing Compilers for Modern Architectures
26
13
3/2/2005
Alignment Procedure procedure Align(V,E,µ,0) While V is not empty remove element v from V for each (w,v) ∈ E if w ∈ V W ← W ∪ {w} o(w) ← o(v) - µ(w,v) else if o(w) != o(v) - µ(w,v) create vertex w’ replace (w,v) with (w’,v) replicate all edges into w onto w’ W ← W ∪ {w’} o(w)’ ← o(v) - µ(w,v)
for each (v,w) ∈ E if w ∈ V W ← W ∪{w} o(w) ← o(v) + µ(v,w) else if o(w) != o(v) + µ(v,w) create vertex v’ replace (v,w) with (v’,w) replicate edges into v onto v’ W ← W ∪ {v’} o(v’) ← o(w) - µ(v,w) end align
Optimizing Compilers for Modern Architectures
27
Alignment
• If we use the code generation given in the book, we get the code shown
Optimizing Compilers for Modern Architectures
DO I = 1, N+3 S1 IF (I.GE>4) A(I-1) = B(I-3) + C S1’ IF (I.GE.2.AND.I.LE.N+1) THEN t = B(I-1) + C ELSE t = A(I+1) ENDIF S2 IF (I.GE.2.AND.I.LE.N+1) THEN X(I) = A(I-1) + D ENDIF S3 IF (I.LE.N) Y(I) = t + X(I) ENDDO
28
14
3/2/2005
Loop Fusion • • • •
Loop distribution is a method for separating parallel parts of a loop Our solution attempted to find the maximal loop distribution The maximal distribution often finds parallelizable components too small for efficient parallelizing Two obvious solutions: —Strip mine large loops to create larger granularity —Perform maximal distribution, and fuse together parallelizable loops
Optimizing Compilers for Modern Architectures
29
Loop Fusion DO I = 1, N S1 A(I) = B(I) + 1 S2 C(I) = A(I) + C(I-1) S3 D(I) = A(I) + X ENDDO L1 PARALLEL DO I = 1, N A(I) = B(I) + 1 L3
D(I) = A(I) + X ENDDO
L1 DO I = 1, N A(I) = B(I) + 1 ENDDO L2 DO I = 1, N C(I) = A(I) + C(I-1) ENDDO L3 DO I = 1, N D(I) = A(I) + x ENDDO
L2 DO I = 1, N C(I) = A(I) + C(I-1) ENDDO Optimizing Compilers for Modern Architectures
30
15
3/2/2005
Fusion Safety • Definition: A loop-independent dependence between statements S1 and S2 in loops L1 and L2 respectively is fusion-preventing if fusing L1 and L2 causes the dependence to be carried by the combined loop in the opposite direction. S1 S2
DO I = 1, N A(I) = B(I) + C ENDDO DO I = 1, N D(I) = A(I+1) + E ENDDO
DO I = 1, N S1
A(I) = B(I) + C
S2
D(I) = A(I+1) + E ENDDO
Notice the loop-independent dependence has become a backward loop-carried antidependence Optimizing Compilers for Modern Architectures
31
Fusion Safety • •
We shouldn’t fuse loops if the fusing will violate ordering of the dependence graph Ordering Constraint: Two loops can’t be validly fused if there exists a path of loop-independent dependencies between them containing a loop or statement not being fused with them DO I = 1, N S1 A(I) = B(I) + 1 S2 C(I) = A(I) + C(I-1) S3 D(I) = A(I) + C(I) ENDDO
Fusing L1 with L3 violates the ordering constraint. {L1,L3} must occur both before and after the node L2. Optimizing Compilers for Modern Architectures
32
16
3/2/2005
Fusion Profitability • Parallel loops should generally not be merged with sequential loops • Definition: An edge between two statements in loops L1 and L2 respectively is said to be parallelisminhibiting if after merging L1 and L2, the dependence is carried by the combined loop
S1
S2
S1 S2
DO I = 1, N A(I+1) = B(I) + C ENDDO DO I = 1,N D(I) = A(I) + E ENDDO
DO I = 1,N A(I+1) = B(I) + C D(I) = A(I) + E ENDDO
Optimizing Compilers for Modern Architectures
33
Typed Fusion • • •
We start off by classifying loops into two types: parallel and sequential We next gather together all edges that inhibit efficient fusion, and call them bad edges Given a loop dependency graph (V,E), we want to obtain a graph (V’,E’) by merging vertices of V subject to the following constraints: —Bad Edge Constraint: vertices joined by a bad edge aren’t fused —Ordering Constraint: vertices joined by path containing non-parallel vertex aren’t fused
Optimizing Compilers for Modern Architectures
34
17
3/2/2005
Typed Fusion Procedure procedure TypedFusion(G,T,type,B,t0) Initialize all variables to zero Set count[n] to be the in-degree of node n Initialize W with all nodes with in-degree zero while W isn’t empty remove element n with type t from W if t = t0 if maxBadPrev[n] = 0 then p ← fused else p ← next[maxBadPrev[n]] if p != 0 then x ← node[p] num[n] ← num[x] update_successors(n,t) fuse x and n and call the result n
else create_new_fused_node(n) update_successors(n,t) else create_new_node(n) update_successors(n,t) end TypedFusion
Optimizing Compilers for Modern Architectures
35
Typed Fusion Example Graph annotated (maxBadPrev,p) → num
Original loop graph 1
(0,0) → 1
2
3
(0,1) → 1
4
5
7
(4) → 6
8
After fusing parallel loops 1
1,3
2
4
3
(1,0) → 4
6
1
2
(1) → 3
4
5
7
(0) → 2
6
8
(1) → 5
(1,4) →4
After fusing sequential loops
2
1.3
2,4,6
3
5,8
4
5,8
6
5 7
6
7
Optimizing Compilers for Modern Architectures
36
18
3/2/2005
Cohort Fusion • • • •
We wish to not only parallelize loops but also run two sequential loops in parallel or a sequential loop in parallel with a parallel loop Find all the loops that be run in parallel with one another either by fusion or because they have no dependence between them This model combines some of the features of data and task parallelism The principal idea behind the algorithm is to identify at each stage a cohort, which is a collection of loops that can be run in parallel
Optimizing Compilers for Modern Architectures
37
Cohort Fusion • Given an outer loop containing some number of inner loops, we want to be able to run some inner loops in parallel • We can do this as follows: —Run TypedFusion with B = {fusion-preventing edges, parallelism-inhibiting edges, and edges between a parallel loop and a sequential loop} —Put a barrier at the end of each identified cohort —Run TypedFusion again to fuse the parallel loops in each cohort
Optimizing Compilers for Modern Architectures
38
19
3/2/2005
We’ve Talked About … • • • • •
Single loop methods Privatization Loop distribution Alignment Loop Fusion
Optimizing Compilers for Modern Architectures
39
Loop Interchange • •
Moves dependence-free loops to outermost level
•
Vectorization moves loops to innermost level
Theorem —In a perfect nest of loops, a particular loop can be parallelized at the outermost level if and only if the column of the direction matrix for that nest contain only '=' entries
Optimizing Compilers for Modern Architectures
40
20
3/2/2005
Loop Interchange DO I = 1, N DO J = 1, N A(I+1, J) = A(I, J) + B(I, J) ENDDO ENDDO
• • •
Outer loop (I) carries a dependence OK for vectorization Problematic for parallelization
Optimizing Compilers for Modern Architectures
41
Loop Interchange PARALLEL DO J = 1, N DO I = 1, N A(I+1, J) = A(I, J) + B(I, J) ENDDO END PARALLEL DO
Optimizing Compilers for Modern Architectures
42
21
3/2/2005
Loop Interchange •
Suppose we have the following loop nest:
• •
The direction vector is (