Coarse-Grain Parallelism

Optimizing Compilers for Modern Architectures

1

Introduction • Transformations in the previous chapter found fine-grained parallelism in the inner-most loop • The transformations targeted vector and superscalar architectures • In this chapter, we worry about transformations for symmetric multiprocessor machines • The difference between these transformations tends to be one of granularity

Optimizing Compilers for Modern Architectures

2

1

3/2/2005

Review • • • •

SMP machines have multiple processors all accessing a central memory This is also called a shared memory multiprocessor

p1

p2

p3

p4

Bus

Memory

The processors are unrelated and can run separate processes Starting processes and synchronization between processes is expensive

Optimizing Compilers for Modern Architectures

3

Coarse-grained Parallelism • • •

Create a thread on each processor These threads execute in parallel for a period of time —There is occasional synchronization

The threads are synchronized at the end by a barrier

Optimizing Compilers for Modern Architectures

4

2

3/2/2005

Synchronization • • •

A basic synchronization element is the barrier A barrier in a program forces all processes to reach a certain point before execution continues for any process Bus contention can cause slowdowns

Optimizing Compilers for Modern Architectures

5

Synchronization • • •

Our focus will be to find parallel loops with a large granularity so we can overcome the overhead of parallelism initiation and synchronization Focus on finding loops with significant amounts of computation within their bodies This usually mean parallelization of outer loops rather than inner loops

Optimizing Compilers for Modern Architectures

6

3

3/2/2005

Trade Offs • • •

Minimize communication

•

These trade offs can work against each other

Minimize synchronization overhead Load balancing —Make sure the processors are as busy as possible

Optimizing Compilers for Modern Architectures

7

Loop Types •

PARALLEL DO

•

DOACROSS

—Iterations can be correctly run in any order —Also called a DOALL loop —Pipelines parallel loop iterations with cross-iteration synchronization

Optimizing Compilers for Modern Architectures

8

4

3/2/2005

Single-Loop Methods • • •

If the loop is sequential (i.e., carries a dependence), find a way to make it parallel —Any transformation that eliminates loop carried dependence (e.g., loop distribution) can achieve this goal

Increase the granularity of the exposed parallelism We loop through the transformations to achieve the above goals

Optimizing Compilers for Modern Architectures

9

Privatization • •

The analog of scalar expansion is privatization Temporaries can be given separate namespaces for each iteration

S1 S2 S3

DO I = 1, N T = A(I) A(I) = B(I) B(I) = T ENDDO

Optimizing Compilers for Modern Architectures

S1 S2 S3

PARALLEL DO I = 1, N PRIVATE t t = A(I) A(I) = B(I) B(I) = t ENDDO

10

5

3/2/2005

Privatization • Definition: A scalar variable x in a loop L is said to be privatizable if every path from the loop entry to a use of x inside the loop passes through a definition of x • Privatizability can be stated as a data-flow problem: up ( x ) = use ( x ) ∪ ( ¬ def ( x ) ∩

U up ( y ))

y ∈ succ ( x )

private ( L ) = ¬ up ( entry ) ∩ ( U def ( y )) y∈ L

• up(x) – set of upward-exposed variables at the beginning of block x • def(x) – set of variables defined in block x • use(x) – set of variables that have upwards-exposed uses in block x Optimizing Compilers for Modern Architectures

11

Privatization • A variable x defined in a loop may be made private if and only if the SSA graph for the variable does not have a φ-function at the entry to the loop

Optimizing Compilers for Modern Architectures

12

6

3/2/2005

Scalar Expansion PARALLEL DO I = 1, N T$(I) = A(I) + B(I) ENDDO PARALLEL DO I = 1, N A(I-1) = T$(I) ENDDO

DO I = 1, N T = A(I) + B(I) A(I-1) = T ENDDO

Optimizing Compilers for Modern Architectures

13

Array Privatization • We sometimes need to privatize array variables • For iteration J, upwards exposed variables are those exposed due to loop body without variables defined earlier

S0 L1 S1 S2

DO I = 1, 100 T(1) = X DO J = 2, N T(J) = T(J-1) + B(I,J) A(I,J) = T(J) ENDDO ENDDO

up ( L ) = U {T ( J − 1)} − T (2 : J ) N

1

J =2

• So for this fragment, T(1) is the only exposed variable Optimizing Compilers for Modern Architectures

14

7

3/2/2005

Array Privatization •

Using this analysis, we get the following code:

S0 L1 S1 S2

PARALLEL DO I = 1,100 PRIVATE t(N) t(1) = X DO J = 2, N t(J) = t(J-1) + B(I,J) A(I,J) = t(J) ENDDO ENDDO

Optimizing Compilers for Modern Architectures

15

Loop Distribution • •

Loop distribution eliminates carried dependences

•

We must add extra barriers to keep dependent loops from executing out of order, so the overhead may override the parallel savings

•

Consequently, it often creates opportunity for outerloop parallelism

Attempt other transformations before attempting loop distribution

Optimizing Compilers for Modern Architectures

16

8

3/2/2005

Loop Distribution DO I = 1, 100 DO J = 1, 100 S1 A(I,J) = B(I,J) + C(I,J) S2 D(I,J) = A(I,J-1) ∗ 2.0 ENDDO ENDDO

DO I = 1, 100 DO J = 1, 100 S1 A(I,J) = B(I,J) + C(I,J) ENDDO DO J = 1, 100 S2 D(I,J) = A(I,J-1) ∗ 2.0 ENDDO ENDDO

• Loop distribution converts a sequential loop into multiple parallel loops • Synchronization barriers must be placed between the loops Optimizing Compilers for Modern Architectures

17

Alignment • •

Many carried dependences are due to array alignment issues If we can align all references, then dependences would go away, and parallelism is possible DO I = 2, N A(I) = B(I) + C(I) D(I) = A(I-1) ∗ 2.0 ENDDO

Optimizing Compilers for Modern Architectures

DO i = 1, N+1 IF (i .GT. 1) A(i) = B(i) + C(i) IF (i .LE. N) D(i+1) = A(i) ∗ 2.0 ENDDO

18

9

3/2/2005

Alignment

Optimizing Compilers for Modern Architectures

19

Alignment • • •

Loop alignment is carried out by increasing the number of loop iterations The loop carried dependence become a loop independent dependence There is overhead: — Executing the loop more times — Executing the tests every time through the loop

Optimizing Compilers for Modern Architectures

20

10

3/2/2005

Alignment •

This overhead can be reduced by executing the last iteration of the first statement with the first iteration of the second statement DO i = 2, N j = i – 1; IF (i.EQ.2) j = N A(j) = B(j) + C(j) D(i) = A(i-1) ∗ 2.0 ENDDO

Optimizing Compilers for Modern Architectures

21

Alignment •

There are other ways to align the loop: —We can peel off the first and last iterations of the loop

DO I = 2, N A(I) = B(I) + C(I) D(I) = A(I-1) ∗ 2.0 ENDDO

Optimizing Compilers for Modern Architectures

D(2) = A(1) ∗ 2.0 DO I = 2, N-1 A(I) = B(I) + C(I) D(I+1) = A(I) ∗ 2.0 ENDDO A(N) = B(N) + C(N)

22

11

3/2/2005

Alignment •

Is it always possible to align loops to eliminate loop carried dependences?

DO I = 1, N A(I) = B(I) + C B(I+1) = A(I) + D ENDDO

DO I = 1, N+1 IF (I.NE.1) B(I) = A(I-1) + D IF (I.NE.N+1) A(I) = B(I) + C ENDDO

• Although B is now aligned, the references to A are misaligned, creating a new carried dependence

Optimizing Compilers for Modern Architectures

23

Code Replication • • •

If an array is involved in a recurrence, then alignment isn’t possible If two dependences between the same statements have different dependence distances, then alignment doesn’t work We can fix the second case by replicating code:

DO I = 1, N A(I+1) = B(I) + C X(I) = A(I+1) + A(I) ENDDO

Optimizing Compilers for Modern Architectures

DO I = 1, N A(I+1) = B(I) + C ! Replicated Statement IF (I .EQ 1) THEN t = A(I) ELSE t = B(I-1)+C END IF X(I) = A(I+1) + t ENDDO 24

12

3/2/2005

Alignment • Theorem: Alignment, replication, and statement reordering are sufficient to eliminate all carried dependences in a single loop containing no recurrence, and in which the distance of each dependence is a constant independent of the loop index • We can establish this constructively • Let G = (V,E) be a weighted graph. V represents statements and E represent dependences labeled with the distance between v1 and v2. Let o: V →Z give the offset of vertices • G is said to be carry free if o(v1) + d(e) = o(v2) Optimizing Compilers for Modern Architectures

25

Alignment DO I = 1, N S1 A(I+2) = B(I) + C S2 X(I+1) = A(I) + D S3 Y(I) = A(I+1) + X(I) ENDDO

Optimizing Compilers for Modern Architectures

26

13

3/2/2005

Alignment Procedure procedure Align(V,E,µ,0) While V is not empty remove element v from V for each (w,v) ∈ E if w ∈ V W ← W ∪ {w} o(w) ← o(v) - µ(w,v) else if o(w) != o(v) - µ(w,v) create vertex w’ replace (w,v) with (w’,v) replicate all edges into w onto w’ W ← W ∪ {w’} o(w)’ ← o(v) - µ(w,v)

for each (v,w) ∈ E if w ∈ V W ← W ∪{w} o(w) ← o(v) + µ(v,w) else if o(w) != o(v) + µ(v,w) create vertex v’ replace (v,w) with (v’,w) replicate edges into v onto v’ W ← W ∪ {v’} o(v’) ← o(w) - µ(v,w) end align

Optimizing Compilers for Modern Architectures

27

Alignment

• If we use the code generation given in the book, we get the code shown

Optimizing Compilers for Modern Architectures

DO I = 1, N+3 S1 IF (I.GE>4) A(I-1) = B(I-3) + C S1’ IF (I.GE.2.AND.I.LE.N+1) THEN t = B(I-1) + C ELSE t = A(I+1) ENDIF S2 IF (I.GE.2.AND.I.LE.N+1) THEN X(I) = A(I-1) + D ENDIF S3 IF (I.LE.N) Y(I) = t + X(I) ENDDO

28

14

3/2/2005

Loop Fusion • • • •

Loop distribution is a method for separating parallel parts of a loop Our solution attempted to find the maximal loop distribution The maximal distribution often finds parallelizable components too small for efficient parallelizing Two obvious solutions: —Strip mine large loops to create larger granularity —Perform maximal distribution, and fuse together parallelizable loops

Optimizing Compilers for Modern Architectures

29

Loop Fusion DO I = 1, N S1 A(I) = B(I) + 1 S2 C(I) = A(I) + C(I-1) S3 D(I) = A(I) + X ENDDO L1 PARALLEL DO I = 1, N A(I) = B(I) + 1 L3

D(I) = A(I) + X ENDDO

L1 DO I = 1, N A(I) = B(I) + 1 ENDDO L2 DO I = 1, N C(I) = A(I) + C(I-1) ENDDO L3 DO I = 1, N D(I) = A(I) + x ENDDO

L2 DO I = 1, N C(I) = A(I) + C(I-1) ENDDO Optimizing Compilers for Modern Architectures

30

15

3/2/2005

Fusion Safety • Definition: A loop-independent dependence between statements S1 and S2 in loops L1 and L2 respectively is fusion-preventing if fusing L1 and L2 causes the dependence to be carried by the combined loop in the opposite direction. S1 S2

DO I = 1, N A(I) = B(I) + C ENDDO DO I = 1, N D(I) = A(I+1) + E ENDDO

DO I = 1, N S1

A(I) = B(I) + C

S2

D(I) = A(I+1) + E ENDDO

Notice the loop-independent dependence has become a backward loop-carried antidependence Optimizing Compilers for Modern Architectures

31

Fusion Safety • •

We shouldn’t fuse loops if the fusing will violate ordering of the dependence graph Ordering Constraint: Two loops can’t be validly fused if there exists a path of loop-independent dependencies between them containing a loop or statement not being fused with them DO I = 1, N S1 A(I) = B(I) + 1 S2 C(I) = A(I) + C(I-1) S3 D(I) = A(I) + C(I) ENDDO

Fusing L1 with L3 violates the ordering constraint. {L1,L3} must occur both before and after the node L2. Optimizing Compilers for Modern Architectures

32

16

3/2/2005

Fusion Profitability • Parallel loops should generally not be merged with sequential loops • Definition: An edge between two statements in loops L1 and L2 respectively is said to be parallelisminhibiting if after merging L1 and L2, the dependence is carried by the combined loop

S1

S2

S1 S2

DO I = 1, N A(I+1) = B(I) + C ENDDO DO I = 1,N D(I) = A(I) + E ENDDO

DO I = 1,N A(I+1) = B(I) + C D(I) = A(I) + E ENDDO

Optimizing Compilers for Modern Architectures

33

Typed Fusion • • •

We start off by classifying loops into two types: parallel and sequential We next gather together all edges that inhibit efficient fusion, and call them bad edges Given a loop dependency graph (V,E), we want to obtain a graph (V’,E’) by merging vertices of V subject to the following constraints: —Bad Edge Constraint: vertices joined by a bad edge aren’t fused —Ordering Constraint: vertices joined by path containing non-parallel vertex aren’t fused

Optimizing Compilers for Modern Architectures

34

17

3/2/2005

Typed Fusion Procedure procedure TypedFusion(G,T,type,B,t0) Initialize all variables to zero Set count[n] to be the in-degree of node n Initialize W with all nodes with in-degree zero while W isn’t empty remove element n with type t from W if t = t0 if maxBadPrev[n] = 0 then p ← fused else p ← next[maxBadPrev[n]] if p != 0 then x ← node[p] num[n] ← num[x] update_successors(n,t) fuse x and n and call the result n

else create_new_fused_node(n) update_successors(n,t) else create_new_node(n) update_successors(n,t) end TypedFusion

Optimizing Compilers for Modern Architectures

35

Typed Fusion Example Graph annotated (maxBadPrev,p) → num

Original loop graph 1

(0,0) → 1

2

3

(0,1) → 1

4

5

7

(4) → 6

8

After fusing parallel loops 1

1,3

2

4

3

(1,0) → 4

6

1

2

(1) → 3

4

5

7

(0) → 2

6

8

(1) → 5

(1,4) →4

After fusing sequential loops

2

1.3

2,4,6

3

5,8

4

5,8

6

5 7

6

7

Optimizing Compilers for Modern Architectures

36

18

3/2/2005

Cohort Fusion • • • •

We wish to not only parallelize loops but also run two sequential loops in parallel or a sequential loop in parallel with a parallel loop Find all the loops that be run in parallel with one another either by fusion or because they have no dependence between them This model combines some of the features of data and task parallelism The principal idea behind the algorithm is to identify at each stage a cohort, which is a collection of loops that can be run in parallel

Optimizing Compilers for Modern Architectures

37

Cohort Fusion • Given an outer loop containing some number of inner loops, we want to be able to run some inner loops in parallel • We can do this as follows: —Run TypedFusion with B = {fusion-preventing edges, parallelism-inhibiting edges, and edges between a parallel loop and a sequential loop} —Put a barrier at the end of each identified cohort —Run TypedFusion again to fuse the parallel loops in each cohort

Optimizing Compilers for Modern Architectures

38

19

3/2/2005

We’ve Talked About … • • • • •

Single loop methods Privatization Loop distribution Alignment Loop Fusion

Optimizing Compilers for Modern Architectures

39

Loop Interchange • •

Moves dependence-free loops to outermost level

•

Vectorization moves loops to innermost level

Theorem —In a perfect nest of loops, a particular loop can be parallelized at the outermost level if and only if the column of the direction matrix for that nest contain only '=' entries

Optimizing Compilers for Modern Architectures

40

20

3/2/2005

Loop Interchange DO I = 1, N DO J = 1, N A(I+1, J) = A(I, J) + B(I, J) ENDDO ENDDO

• • •

Outer loop (I) carries a dependence OK for vectorization Problematic for parallelization

Optimizing Compilers for Modern Architectures

41

Loop Interchange PARALLEL DO J = 1, N DO I = 1, N A(I+1, J) = A(I, J) + B(I, J) ENDDO END PARALLEL DO

Optimizing Compilers for Modern Architectures

42

21

3/2/2005

Loop Interchange •

Suppose we have the following loop nest:

• •

The direction vector is (