2. Parallel Random Access Machine (PRAM)

2. Parallel Random Access Machine (PRAM) Massivparallele Modelle und Architekturen R. Hoffmann, FB Informatik, TU-Darmstadt WS 10/11 Inhalt • • • • ...
Author: Lilli Pfaff
3 downloads 0 Views 646KB Size
2. Parallel Random Access Machine (PRAM) Massivparallele Modelle und Architekturen R. Hoffmann, FB Informatik, TU-Darmstadt

WS 10/11

Inhalt • • • •

Metriken (Wdh.) Optimale Algorithmen Classification and Types Algorithmen • Initialize • Broadcasting • All-to-All Broadcasting • Reduction • Sum • Pointer Jumping



• •

(Algorithmen) • Naive Sorting • Prefix Sum Accelarated Cascading Simulations between PRAM Models

2-1

Verwendete Quellen  An Introduction to Parallel Algorithms,1992

 Robert van Engelen  (Folien) The PRAM Model and Algorithms, Advanced Topics Spring 2008

 Behrooz Parhami  Introduction to Parallel Processing, Algorithms and Architectures, Plenum Series, Kluwer 2002  (Folien)

 Arvind Krishnamurthy

MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

 J. JaJa

 (Folien) PRAM Algorithms, Fall 2004

 Keller, Kessler, Träff  Practical PRAM Programming. Wiley Interscience, New York, 2000 2-2

p q(t) P(n)

bereitgestellte Prozessoren Parallelitätsgrad, genutzte Prozessoren zum Zeitp. t Darstellung für alle t: Parallelitätsprofil Benötigte Prozessoren für einen best. Algorithmus in Abhängigkeit von n (Problemgöße) (siehe PRAM)

T(p) S(p) = T(p=1)/T(p) E = S/p Work(p) = ∑q(t) Cost(p) = p* T(p) I(p) = Work(p)/T(p) Utilization = Work/Cost

Zeitschritte insgesamt Speedup Effizienz = T(1) / Cost(p) Work Cost bei p bereitgestellten Proz. Parallelitätsindex, Mittlerer Parallelitätsgrad Mittlere Nutzung der Prozessoren

R(p) = Work(p) / Work(1)

Mehraufwand durch Parallelverarbeitung mit p Prozessoren

MMA - WS 10/11, R. Hoffmann, Rechnerarchitektur, TU Darmstadt

Metriken (Wdh.)

2-3

 Im Zusammenhang mit den PRAM-Algorithmen auch verwendete Notation:  Cost(n, P(n)) = P(n) * T(n)  Work(n, P(n)) = geleistete Arbeit (Prozessoren/Operationen x Zeit)  in Abh. von der Problemgröße n und der Anzahl P(n) der für den PRAMAlgorithmus benötigten Prozessoren.

MMA - WS 10/11, R. Hoffmann, Rechnerarchitektur, TU Darmstadt

PRAM: Cost, Work

2-4

Optimale Algorithmen  Zeit-optimal: Ein sequentieller Algorithmus, dessen (sequentielle) Zeitkomplexität O(Tseq(n)) nicht verbessert werden kann, heißt zeit-optimal.

 Paralleler Algorithmus  (Work)-optimal: Wenn die Work-Komplexität in der Zeitkomplexität des besten sequentiellen Algorithmus liegt. (Die Anzahl der Operationen des parallelen Algorithmus ist asymptotisch gleich der Anzahl der Operationen des sequentiellen Algorithmus) Work(n) = O(Tseq(n)) [noch strengere Definition: Total number of operations in the parallel algorithm is equal to the number of operations in a sequential algorithm]

 Work-time-optimal, WT-optimal, streng optimal: Die Anzahl der parallelen Schritte Tpar(n) kann nicht durch irgend einen anderen Work-optimalen Algorithmus weiter verringert werden. (Es gibt keinen schnelleren optimalen.)  Cost-optimal, kostenoptimal: Cost(n) = O(Tseq(n))

MMA - WS 10/11, R. Hoffmann, Rechnerarchitektur, TU Darmstadt

 Sequentieller Algorithmus

2-5

 PRAM removes algorithmic details concerning synchronization and communication, allowing the algorithm designer to focus on problem properties

 A PRAM algorithm describes explicitly the operations performed at each time unit  PRAM design paradigms have turned out to be robust and have been mapped efficiently onto many other parallel models and even network models

MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

Parallel Random Access Machine (PRAM) Model

2-6

 Natural extension of RAM: each processor is a RAM 1)  Processors operate synchronously  Earliest and best-known model of parallel computation

Shared memory with m locations

Shared Memory

P processors, each with private memory P1

P2

P3



Pp

All processors operate synchronously, by executing load, store, and operations on data

1) Random Access Machine, Zugriff per Befehl auf eine beliebige Speicherzelle (nicht gemeint ist hier RAM = Random Access Memory)

MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

The PRAM Model of Parallel Computation

2-7

Synchronous PRAM  All processors execute the same program, deshalb heißt das Modell auch SPMD (Single Program Multiple Data)  All processors execute the same PRAM step instruction stream in “lock-step” (im Gleichschritt)  Effect of operation depends on  individual or common data which processor(i) accesses  processor index i  local data

 Instructions can be selectively disabled (if-then-else flow)

 Asynchronous PRAM

MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

 Synchronous PRAM is an SIMD-style model

 Several competing models  No lock-step 2-8

PRAM Befehlssatz Ein üblicher Befehlssatz einer PRAM sieht wie folgt aus

(1) Konstanten: L[x]:=Konstante, L[x]:=Eingabegröße, L[x]:= Prozessornummer (2) Schreiben: G[L[x]] := L[y] (3) Lesen: L[x] := G[L[y]] (4) Lokale Zuweisung: L[x] := L[y] (5) Bedingte Zuweisung: if L[x] > 0 then Zuweisung (6) Operationen: L[x] := L[y] op L[z] (7) Sprünge: Goto Marke y, if L[x] > 0 goto Marke y (8) Sonstiges: if L[x] > 0 then Halt, if L[y] > 0 then NoOperation.

2-9

Classification of PRAM Models 1. Read: each processor may read a value from shared memory 2. Compute: each processor may perform operations on local data 3. Write: each processor may write a value to shared memory

 Model is refined for concurrent read/write capability    

Exclusive Read Exclusive Write (EREW) Concurrent Read Exclusive Write (CREW) Concurrent Read Concurrent Write (CRCW) Concurrent Read Owner Write (CROW)

MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

 A PRAM step (“clock cycle”) consists of three phases

2-10

CRCW -

result

Common

all processors must write the same value

Arbitrary, Random one of the processors succeeds in writing Priority

processor with the lowest index succeeds

Undefined

the value written is undefined

Ignore

no new value is written in case of conflict

Detecting

special „collision detection“ code is written

Max/Min

largest / smallest of the values is written

Reduction

SUM, AND, OR, XOR or another function is written

MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

Types of CRCW PRAM

2-11

Comparison of PRAM Models  The time complexity is asymptotically less in model B for solving a problem compared to A  Or the time complexity is the same and the work complexity is asymptotically less in model B compared to A

 From weakest to strongest:     

EREW CREW Common CRCW Arbitrary CRCW Priority CRCW

MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

 A model B is more powerful compared to model A if either

2-12

P processors n elements

Initializing an n-vector M (with base address = B) to all 0s with P < n processors for t=0 to n/P -1 do

// segments

parallel [i = 0 .. P-1] // processors if (t*P + i < n) then M[B + t*P + i]  0 endparallel endfor

FRAGE 1):

Work = …..

0

P

t=0

t=1

n-1

MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

Initialize

Cost = …... 2-13

Data Broadcasting B

for t = 0 to log2 P – 1 do parallel [i = 0 .. 2t -1] // active B[i+ 2t]  B[i] // proc(i) writes endparallel endfor for t = 0 to log2 P – 1 do parallel [i = 2t .. 2t+1 -1] // active B[i]  B[i – 2t] // proc(i) reads endparallel endfor

0 1 2 3 4 5 6 7 8 9 10 11

EREW PRAM data broadcasting without redundant copying.

MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

Making P copies of B[0]

T=?

falls Array-Grenze keine Potenz von 2 ist, dann ergänze die Algorithmen um Bedingung 2-14

B 0 1 2 3 4 5 6 7 8 9 10 11

parallel [i = 0 .. 2t -1] B[i+ 2t]  B[i] endparallel Datenparallele Schreibweise

I  [0 .. 2t -1] B[I+ 2t]  B[I]

I = Indexvektor

EREW PRAM data broadcasting without redundant copying.

MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

Data Broadcasting, Data Parallel Formulation

entspricht

B[2t .. 2t+1 -1]  B[0 .. 2t -1] 2-15

parallel [i = 0 .. P-1] write own data value into B[i] for t = 1 to P – 1 parallel [i = 0 .. P-1] Read the data value from B[(i + t) mod P] endparallel endfor

This O(P)-step algorithm is time-optimal

Jeder Prozessor i liest im Schritt t den Wert von seinem Nachbarn i+t

MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

All-to-All Broadcasting on EREW PRAM

Frage 2) Warum?

2-16

Reduction on the EREW PRAM Speicherzellen

MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

 Reduce P values on the P-processor EREW PRAM in O(log P) time  Reduction algorithm uses exclusive reads and writes  Algorithm is the basis of other EREW algorithms

Prozessoren

Datenfluss

2-17

Sum on the EREW PRAM (1)

for t = 1 to log n do parallel [i = 1 .. n/2] if i < n/2t then B[i]  B[2i-1] + B[2i]; if i = 1 then s  B[i] end Cost = (n/2) log n Work = n-1 Utilization = U = Work/Cost = (n-1)/((n/2)log n) = O(1/log n) E = T(1)/Cost(p)= (n-1)/((n/2)log n) = O(1/log n)

1

2

3

4

5

6

7

8 MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

Sum of n values using n/2 processors (i) Input: A[1,…n], n = 2k Output: s parallel [i=1 .. P] B[i]  A[i] endparallel

B

B

B

B S

Frage 3) für das Beispiel: Cost, Work, Utilization?

2-18

Sum on the EREW PRAM (2) Hierbei wird die Anzahl der Prozessoren dynamisch verwaltet (ist nicht PRAM-Standard !)

parallel [i=1 .. P] B[i]  A[i] endparallel for t = 1 to log n do dynparallel [i =1 .. n/2t ] // dynamic activity B[i] := B[2i-1] + B[2i] endparallel endfor s := B[i] Cost = n -1 Work = n -1 Utilization = 1

2

3

4

5

6

7

8 MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

Sum of n values using n/2 .. 1 processors (i) Input: A[1,…n], n = 2k Output: s

1

B

B

B

B s 2-19

Pointer Jumping MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

 Finding the roots of a forest using pointer-jumping

2-20

Input: A forest of trees, each with a self-loop at its root, consisting of arcs (i, P(i)) and nodes i, where 1 < i < n Output: For each node i, the root S[i] of the tree containing i parallel [i = 1 .. n] S[i]  P[i] endparallel for h = 1 to log n do // while there exists a node such that S[i]  S[S[i]] do parallel [i = 1 .. n] if S[i]  S[S[i]] then S[i]  S[S[i]] endparallel endfor

MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

Pointer Jumping on the CREW PRAM

T(n) = O(log h) with h the maximum height of trees Cost(n) = O(n log h) 2-21

Beispiel 1

2

3

4

5

6

7

8

6

3

7

4

Pointer werden transportiert

5

1

2

8

3

4

6

5

1

7

2

MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

8

8

3

4

1

6

2

5

7

2-22

parallel [i= 0.. P-1]

R[i]  0 endparallel

P=n Werte in S Zielposition in R

for t = 1 to P – 1 do parallel [i= 0.. P-1] neighbor := (i + t) mod P if S[neighbor ] < S[ i ] or S[neighbor ] = S[ i ] and neighbor < i then R[ i ] := R[ i ] + 1 // erhöhe für jedes kleinere Element endparallel endfor // R[i] gibt schließlich die Anzahl der kleineren Elemente an. // Das entspricht der Zielposition an This O(P)-step sorting parallel [i= 0.. P-1]

S[R[ i ]]  S[ i ] endparallel

Eigene Übung: An einem Beispiel durchspielen

MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

Naive EREW PRAM sorting algorithm (all read from all)

algorithm is far from optimal; sorting is possible in O(log P) time 2-23

Prefix Sum  deleting marked elements from an array (stream compaction), radixsort, solving recurrence equations, solving tri-diagonal linear systems, quicksort

i=1

2

3

4

5

6

7

8

x

7

6

5

4

3

2

1

0

s

7

13

18

22

25

27

28

28

si  Σk=1..n xi s1 = x1 s2 = s1 + x2 ... si = si-1 + xi

MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

 Applications

n

2-24

Horn„s Algorithm CREW 2

3

4

5

6

7

8 MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

1

for t=1 to log2 n do parallel [i = 2 .. n] if i > 2t-1 then x[i]  x[i-2t-1] + x[i] endparallel endfor Problem: Mehrfache Lesezugriffe auf dieselbe Variable kann zu Zugriffsverzögerungen (Leseressourcenkonflikt (Congestion) an einem Speichermodul) führen Work = 17

2-25

1

Prefix Sum EREW

2

3

4

5

6

7

8

M

 

Die Prozessoren und die Speicherzellen M[i] enthalten die x-Werte Pi hat eine lokales Register yi Komplexität wie EREWReduktion

Prozessoren

parallel [i=0 .. n-1] y[i]  M[i] endparallel for t=0 to log n -1 do dynparallel [i=2t .. n-1] y[i]:=y[i] + M[i-2t] M[i]  y[i] endparallel Unterschied zum vorherigen Algorithmus -keine mehrfachen Lesezugriffe auf globale Variablen -dynamische Parallität

MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt



2-26

Prefix Sum Algorithmus mit n/2 Prozessoren CREW Lese-und Scheibzugriffe

1

2

3

4

5

6

7

8 MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

 

1. 8er1 + 4ew1 2. 2cr2 + 4er1 + 4ew1 3. 1cr4 + 4er1 + 4ew1

    

log n Schritte n/2 Prozessoren = const. Work(p=n/2) = n/2 * log n Cost = Work Mehraufwand der parallelen Ausführung 

R(p=n/2) = Work(p)/T(1) = (n/2 * log n)/n = ½ log n = O(log n) Frage 4): Mehraufwand für dieses Beispiel

2-27

 Ziel: Cost = O(Tseq(n))  Kombination eines langsamen kostenoptimalen mit einem schnellen nicht kostenoptimalen Algorithmus  Bsp.: Maximum. Use p = n/log n processors instead of n  Have a local phase followed by the global phase  Local phase: compute maximum over log n values  use simple sequential algorithm  Time for local phase = O(log n)  in parallel for all p

 Global phase: take (n/log n) local maxima and compute global maximum using the reduction tree algorithm

MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

Accelerated Cascading

 Time for global phase = O(log (n/log n)) = O(log n – log log n) = O(log n) 2-28

Accelerated Cascading: Bildung des Maximums n Elemente

P1

P2



Pn/logn

log n



Baumreduktion

log (n/log n)

T(p) = log n -1 + log(n/log n) = 2 log n – log log n - 1 Cost = pT(p) = (n/log n) T(p) = Work = (n/log n) (lg n -1) + (n/log n) -1 = n – 1 Utilization = Work/Cost Speedup = (n-1) / T(p)

2-29

Cost = pT(p) = (n/log n) T(p) E=

Cost(n)/n=

n

log n

n/log n

log(n/log n)

D/log n

1-1/log n +E

4

2

2,00

1,00

0,50

1,00

8

3

2,67

1,42

0,47

1,14

16

4

4,00

2,00

0,50

1,25

32

5

6,40

2,68

0,54

1,34

64

6

10,67

3,42

0,57

1,40

128

7

18,29

4,19

0,60

1,46

256

8

32,00

5,00

0,63

1,50

512

9

56,89

5,83

0,65

1,54

1024

10

102,40

6,68

0,67

1,57

2048

11

186,18

7,54

0,69

1,59

1048576

20

52428,80

15,68

0,78

1,73

Cost = O(n) = 2n? für sehr große n

2,0 ? 2-30

   

Example: n = 256 Number of processors, p= n/log n = 32 Divide 256 elements into 32 groups of 8 each Local phase: each processor computes the maximum of its 8 local elements  Global phase: performed amongst the maxima computed by the 32 processors  Frage 5): Berechne T, Cost, Work, Utilization, Speedup

MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

Zahlenbeispiel

2-31

 An algorithm designed for a weaker model can be executed within the same time complexity and work complexity on a stronger model  An algorithm designed for a stronger model can be simulated on a weaker model, either with  Asymptotically more processors (more work)  Or asymptotically more time

MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

Simulations between PRAM Models

2-32

Zusammenfassung  PRAM    

P(n) Prozessoren globaler Speicher Lock-Step EREW, CREW, CRCW, CROW

 Broadcasting  write  read  all-to-all

   

 Prefix Sum  Tseq = n  Horn: n-1 ... n/2 aktive Prozessoren  n/2 Prozessoren

 Accelerated Cascading  Cost = O(Tseq)

Reduction, Baum Sum, Baum Pointer Jumping Sorting

2-33

1) Work = n, Cost = n/P *P 2) Die sequentielle Ausführung benötigt genauso viele Kopieroperationen 3) Cost=4*3=12; Work=7; Utilization=7/12 4) Sequentiell 8 Operationen, Parallel 12 Operationen, R=12/8

5) T=7+5=13, Cost=32 x 13=416, Work=255, Utilization=255/416=61%, Speedup=255/13=19,6

MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

Antworten

2-34