2. Parallel Random Access Machine (PRAM) Massivparallele Modelle und Architekturen R. Hoffmann, FB Informatik, TU-Darmstadt
WS 10/11
Inhalt • • • •
Metriken (Wdh.) Optimale Algorithmen Classification and Types Algorithmen • Initialize • Broadcasting • All-to-All Broadcasting • Reduction • Sum • Pointer Jumping
•
• •
(Algorithmen) • Naive Sorting • Prefix Sum Accelarated Cascading Simulations between PRAM Models
2-1
Verwendete Quellen An Introduction to Parallel Algorithms,1992
Robert van Engelen (Folien) The PRAM Model and Algorithms, Advanced Topics Spring 2008
Behrooz Parhami Introduction to Parallel Processing, Algorithms and Architectures, Plenum Series, Kluwer 2002 (Folien)
Arvind Krishnamurthy
MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
J. JaJa
(Folien) PRAM Algorithms, Fall 2004
Keller, Kessler, Träff Practical PRAM Programming. Wiley Interscience, New York, 2000 2-2
p q(t) P(n)
bereitgestellte Prozessoren Parallelitätsgrad, genutzte Prozessoren zum Zeitp. t Darstellung für alle t: Parallelitätsprofil Benötigte Prozessoren für einen best. Algorithmus in Abhängigkeit von n (Problemgöße) (siehe PRAM)
T(p) S(p) = T(p=1)/T(p) E = S/p Work(p) = ∑q(t) Cost(p) = p* T(p) I(p) = Work(p)/T(p) Utilization = Work/Cost
Zeitschritte insgesamt Speedup Effizienz = T(1) / Cost(p) Work Cost bei p bereitgestellten Proz. Parallelitätsindex, Mittlerer Parallelitätsgrad Mittlere Nutzung der Prozessoren
R(p) = Work(p) / Work(1)
Mehraufwand durch Parallelverarbeitung mit p Prozessoren
MMA - WS 10/11, R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Metriken (Wdh.)
2-3
Im Zusammenhang mit den PRAM-Algorithmen auch verwendete Notation: Cost(n, P(n)) = P(n) * T(n) Work(n, P(n)) = geleistete Arbeit (Prozessoren/Operationen x Zeit) in Abh. von der Problemgröße n und der Anzahl P(n) der für den PRAMAlgorithmus benötigten Prozessoren.
MMA - WS 10/11, R. Hoffmann, Rechnerarchitektur, TU Darmstadt
PRAM: Cost, Work
2-4
Optimale Algorithmen Zeit-optimal: Ein sequentieller Algorithmus, dessen (sequentielle) Zeitkomplexität O(Tseq(n)) nicht verbessert werden kann, heißt zeit-optimal.
Paralleler Algorithmus (Work)-optimal: Wenn die Work-Komplexität in der Zeitkomplexität des besten sequentiellen Algorithmus liegt. (Die Anzahl der Operationen des parallelen Algorithmus ist asymptotisch gleich der Anzahl der Operationen des sequentiellen Algorithmus) Work(n) = O(Tseq(n)) [noch strengere Definition: Total number of operations in the parallel algorithm is equal to the number of operations in a sequential algorithm]
Work-time-optimal, WT-optimal, streng optimal: Die Anzahl der parallelen Schritte Tpar(n) kann nicht durch irgend einen anderen Work-optimalen Algorithmus weiter verringert werden. (Es gibt keinen schnelleren optimalen.) Cost-optimal, kostenoptimal: Cost(n) = O(Tseq(n))
MMA - WS 10/11, R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Sequentieller Algorithmus
2-5
PRAM removes algorithmic details concerning synchronization and communication, allowing the algorithm designer to focus on problem properties
A PRAM algorithm describes explicitly the operations performed at each time unit PRAM design paradigms have turned out to be robust and have been mapped efficiently onto many other parallel models and even network models
MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Parallel Random Access Machine (PRAM) Model
2-6
Natural extension of RAM: each processor is a RAM 1) Processors operate synchronously Earliest and best-known model of parallel computation
Shared memory with m locations
Shared Memory
P processors, each with private memory P1
P2
P3
…
Pp
All processors operate synchronously, by executing load, store, and operations on data
1) Random Access Machine, Zugriff per Befehl auf eine beliebige Speicherzelle (nicht gemeint ist hier RAM = Random Access Memory)
MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
The PRAM Model of Parallel Computation
2-7
Synchronous PRAM All processors execute the same program, deshalb heißt das Modell auch SPMD (Single Program Multiple Data) All processors execute the same PRAM step instruction stream in “lock-step” (im Gleichschritt) Effect of operation depends on individual or common data which processor(i) accesses processor index i local data
Instructions can be selectively disabled (if-then-else flow)
Asynchronous PRAM
MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Synchronous PRAM is an SIMD-style model
Several competing models No lock-step 2-8
PRAM Befehlssatz Ein üblicher Befehlssatz einer PRAM sieht wie folgt aus
(1) Konstanten: L[x]:=Konstante, L[x]:=Eingabegröße, L[x]:= Prozessornummer (2) Schreiben: G[L[x]] := L[y] (3) Lesen: L[x] := G[L[y]] (4) Lokale Zuweisung: L[x] := L[y] (5) Bedingte Zuweisung: if L[x] > 0 then Zuweisung (6) Operationen: L[x] := L[y] op L[z] (7) Sprünge: Goto Marke y, if L[x] > 0 goto Marke y (8) Sonstiges: if L[x] > 0 then Halt, if L[y] > 0 then NoOperation.
2-9
Classification of PRAM Models 1. Read: each processor may read a value from shared memory 2. Compute: each processor may perform operations on local data 3. Write: each processor may write a value to shared memory
Model is refined for concurrent read/write capability
Exclusive Read Exclusive Write (EREW) Concurrent Read Exclusive Write (CREW) Concurrent Read Concurrent Write (CRCW) Concurrent Read Owner Write (CROW)
MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
A PRAM step (“clock cycle”) consists of three phases
2-10
CRCW -
result
Common
all processors must write the same value
Arbitrary, Random one of the processors succeeds in writing Priority
processor with the lowest index succeeds
Undefined
the value written is undefined
Ignore
no new value is written in case of conflict
Detecting
special „collision detection“ code is written
Max/Min
largest / smallest of the values is written
Reduction
SUM, AND, OR, XOR or another function is written
MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Types of CRCW PRAM
2-11
Comparison of PRAM Models The time complexity is asymptotically less in model B for solving a problem compared to A Or the time complexity is the same and the work complexity is asymptotically less in model B compared to A
From weakest to strongest:
EREW CREW Common CRCW Arbitrary CRCW Priority CRCW
MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
A model B is more powerful compared to model A if either
2-12
P processors n elements
Initializing an n-vector M (with base address = B) to all 0s with P < n processors for t=0 to n/P -1 do
// segments
parallel [i = 0 .. P-1] // processors if (t*P + i < n) then M[B + t*P + i] 0 endparallel endfor
FRAGE 1):
Work = …..
0
P
t=0
t=1
n-1
MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Initialize
Cost = …... 2-13
Data Broadcasting B
for t = 0 to log2 P – 1 do parallel [i = 0 .. 2t -1] // active B[i+ 2t] B[i] // proc(i) writes endparallel endfor for t = 0 to log2 P – 1 do parallel [i = 2t .. 2t+1 -1] // active B[i] B[i – 2t] // proc(i) reads endparallel endfor
0 1 2 3 4 5 6 7 8 9 10 11
EREW PRAM data broadcasting without redundant copying.
MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Making P copies of B[0]
T=?
falls Array-Grenze keine Potenz von 2 ist, dann ergänze die Algorithmen um Bedingung 2-14
B 0 1 2 3 4 5 6 7 8 9 10 11
parallel [i = 0 .. 2t -1] B[i+ 2t] B[i] endparallel Datenparallele Schreibweise
I [0 .. 2t -1] B[I+ 2t] B[I]
I = Indexvektor
EREW PRAM data broadcasting without redundant copying.
MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Data Broadcasting, Data Parallel Formulation
entspricht
B[2t .. 2t+1 -1] B[0 .. 2t -1] 2-15
parallel [i = 0 .. P-1] write own data value into B[i] for t = 1 to P – 1 parallel [i = 0 .. P-1] Read the data value from B[(i + t) mod P] endparallel endfor
This O(P)-step algorithm is time-optimal
Jeder Prozessor i liest im Schritt t den Wert von seinem Nachbarn i+t
MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
All-to-All Broadcasting on EREW PRAM
Frage 2) Warum?
2-16
Reduction on the EREW PRAM Speicherzellen
MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Reduce P values on the P-processor EREW PRAM in O(log P) time Reduction algorithm uses exclusive reads and writes Algorithm is the basis of other EREW algorithms
Prozessoren
Datenfluss
2-17
Sum on the EREW PRAM (1)
for t = 1 to log n do parallel [i = 1 .. n/2] if i < n/2t then B[i] B[2i-1] + B[2i]; if i = 1 then s B[i] end Cost = (n/2) log n Work = n-1 Utilization = U = Work/Cost = (n-1)/((n/2)log n) = O(1/log n) E = T(1)/Cost(p)= (n-1)/((n/2)log n) = O(1/log n)
1
2
3
4
5
6
7
8 MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Sum of n values using n/2 processors (i) Input: A[1,…n], n = 2k Output: s parallel [i=1 .. P] B[i] A[i] endparallel
B
B
B
B S
Frage 3) für das Beispiel: Cost, Work, Utilization?
2-18
Sum on the EREW PRAM (2) Hierbei wird die Anzahl der Prozessoren dynamisch verwaltet (ist nicht PRAM-Standard !)
parallel [i=1 .. P] B[i] A[i] endparallel for t = 1 to log n do dynparallel [i =1 .. n/2t ] // dynamic activity B[i] := B[2i-1] + B[2i] endparallel endfor s := B[i] Cost = n -1 Work = n -1 Utilization = 1
2
3
4
5
6
7
8 MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Sum of n values using n/2 .. 1 processors (i) Input: A[1,…n], n = 2k Output: s
1
B
B
B
B s 2-19
Pointer Jumping MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Finding the roots of a forest using pointer-jumping
2-20
Input: A forest of trees, each with a self-loop at its root, consisting of arcs (i, P(i)) and nodes i, where 1 < i < n Output: For each node i, the root S[i] of the tree containing i parallel [i = 1 .. n] S[i] P[i] endparallel for h = 1 to log n do // while there exists a node such that S[i] S[S[i]] do parallel [i = 1 .. n] if S[i] S[S[i]] then S[i] S[S[i]] endparallel endfor
MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Pointer Jumping on the CREW PRAM
T(n) = O(log h) with h the maximum height of trees Cost(n) = O(n log h) 2-21
Beispiel 1
2
3
4
5
6
7
8
6
3
7
4
Pointer werden transportiert
5
1
2
8
3
4
6
5
1
7
2
MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
8
8
3
4
1
6
2
5
7
2-22
parallel [i= 0.. P-1]
R[i] 0 endparallel
P=n Werte in S Zielposition in R
for t = 1 to P – 1 do parallel [i= 0.. P-1] neighbor := (i + t) mod P if S[neighbor ] < S[ i ] or S[neighbor ] = S[ i ] and neighbor < i then R[ i ] := R[ i ] + 1 // erhöhe für jedes kleinere Element endparallel endfor // R[i] gibt schließlich die Anzahl der kleineren Elemente an. // Das entspricht der Zielposition an This O(P)-step sorting parallel [i= 0.. P-1]
S[R[ i ]] S[ i ] endparallel
Eigene Übung: An einem Beispiel durchspielen
MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Naive EREW PRAM sorting algorithm (all read from all)
algorithm is far from optimal; sorting is possible in O(log P) time 2-23
Prefix Sum deleting marked elements from an array (stream compaction), radixsort, solving recurrence equations, solving tri-diagonal linear systems, quicksort
i=1
2
3
4
5
6
7
8
x
7
6
5
4
3
2
1
0
s
7
13
18
22
25
27
28
28
si Σk=1..n xi s1 = x1 s2 = s1 + x2 ... si = si-1 + xi
MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Applications
n
2-24
Horn„s Algorithm CREW 2
3
4
5
6
7
8 MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
1
for t=1 to log2 n do parallel [i = 2 .. n] if i > 2t-1 then x[i] x[i-2t-1] + x[i] endparallel endfor Problem: Mehrfache Lesezugriffe auf dieselbe Variable kann zu Zugriffsverzögerungen (Leseressourcenkonflikt (Congestion) an einem Speichermodul) führen Work = 17
2-25
1
Prefix Sum EREW
2
3
4
5
6
7
8
M
Die Prozessoren und die Speicherzellen M[i] enthalten die x-Werte Pi hat eine lokales Register yi Komplexität wie EREWReduktion
Prozessoren
parallel [i=0 .. n-1] y[i] M[i] endparallel for t=0 to log n -1 do dynparallel [i=2t .. n-1] y[i]:=y[i] + M[i-2t] M[i] y[i] endparallel Unterschied zum vorherigen Algorithmus -keine mehrfachen Lesezugriffe auf globale Variablen -dynamische Parallität
MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
2-26
Prefix Sum Algorithmus mit n/2 Prozessoren CREW Lese-und Scheibzugriffe
1
2
3
4
5
6
7
8 MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
1. 8er1 + 4ew1 2. 2cr2 + 4er1 + 4ew1 3. 1cr4 + 4er1 + 4ew1
log n Schritte n/2 Prozessoren = const. Work(p=n/2) = n/2 * log n Cost = Work Mehraufwand der parallelen Ausführung
R(p=n/2) = Work(p)/T(1) = (n/2 * log n)/n = ½ log n = O(log n) Frage 4): Mehraufwand für dieses Beispiel
2-27
Ziel: Cost = O(Tseq(n)) Kombination eines langsamen kostenoptimalen mit einem schnellen nicht kostenoptimalen Algorithmus Bsp.: Maximum. Use p = n/log n processors instead of n Have a local phase followed by the global phase Local phase: compute maximum over log n values use simple sequential algorithm Time for local phase = O(log n) in parallel for all p
Global phase: take (n/log n) local maxima and compute global maximum using the reduction tree algorithm
MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Accelerated Cascading
Time for global phase = O(log (n/log n)) = O(log n – log log n) = O(log n) 2-28
Accelerated Cascading: Bildung des Maximums n Elemente
P1
P2
…
Pn/logn
log n
…
Baumreduktion
log (n/log n)
T(p) = log n -1 + log(n/log n) = 2 log n – log log n - 1 Cost = pT(p) = (n/log n) T(p) = Work = (n/log n) (lg n -1) + (n/log n) -1 = n – 1 Utilization = Work/Cost Speedup = (n-1) / T(p)
2-29
Cost = pT(p) = (n/log n) T(p) E=
Cost(n)/n=
n
log n
n/log n
log(n/log n)
D/log n
1-1/log n +E
4
2
2,00
1,00
0,50
1,00
8
3
2,67
1,42
0,47
1,14
16
4
4,00
2,00
0,50
1,25
32
5
6,40
2,68
0,54
1,34
64
6
10,67
3,42
0,57
1,40
128
7
18,29
4,19
0,60
1,46
256
8
32,00
5,00
0,63
1,50
512
9
56,89
5,83
0,65
1,54
1024
10
102,40
6,68
0,67
1,57
2048
11
186,18
7,54
0,69
1,59
1048576
20
52428,80
15,68
0,78
1,73
Cost = O(n) = 2n? für sehr große n
2,0 ? 2-30
Example: n = 256 Number of processors, p= n/log n = 32 Divide 256 elements into 32 groups of 8 each Local phase: each processor computes the maximum of its 8 local elements Global phase: performed amongst the maxima computed by the 32 processors Frage 5): Berechne T, Cost, Work, Utilization, Speedup
MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Zahlenbeispiel
2-31
An algorithm designed for a weaker model can be executed within the same time complexity and work complexity on a stronger model An algorithm designed for a stronger model can be simulated on a weaker model, either with Asymptotically more processors (more work) Or asymptotically more time
MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Simulations between PRAM Models
2-32
Zusammenfassung PRAM
P(n) Prozessoren globaler Speicher Lock-Step EREW, CREW, CRCW, CROW
Broadcasting write read all-to-all
Prefix Sum Tseq = n Horn: n-1 ... n/2 aktive Prozessoren n/2 Prozessoren
Accelerated Cascading Cost = O(Tseq)
Reduction, Baum Sum, Baum Pointer Jumping Sorting
2-33
1) Work = n, Cost = n/P *P 2) Die sequentielle Ausführung benötigt genauso viele Kopieroperationen 3) Cost=4*3=12; Work=7; Utilization=7/12 4) Sequentiell 8 Operationen, Parallel 12 Operationen, R=12/8
5) T=7+5=13, Cost=32 x 13=416, Work=255, Utilization=255/416=61%, Speedup=255/13=19,6
MMP – WS 10/11 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Antworten
2-34