R Parallel Random Access Machine (PRAM) Vorlesung im SS 2010
Inhalt • • • •
Metriken Optimale Algorithmen Classification and Types Algorithmen • Initialize • Broadcasting • All-to-All Broadcasting • Reduction • Sum • Pointer Jumping
•
• •
(Algorithmen) • Naive Sorting • Prefix Sum WT-optimaler Algorithmus Simulations between PRAM Models • Simulating a Priority CRCW on an EREW PRAM
RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
R. Hoffmann, FB Informatik, TU-Darmstadt
R-1
Verwendete Quellen
R-2
An Introduction to Parallel Algorithms,1992
Robert van Engelen (Folien) The PRAM Model and Algorithms, Advanced Topics Spring 2008
Behrooz Parhami Introduction to Parallel Processing, Algorithms and Architectures, Plenum Series, Kluwer 2002 (Folien)
Arvind Krishnamurthy (Folien) PRAM Algorithms, Fall 2004
RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
J. JaJa
Mandelbrot-Menge
Global Climate Modeling Problem
R-3
RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Data parallel example
pmax = P = verwendete OTPP T = Zeitschritte Work(pmax) = ∑p(t) Cost(pmax) = pmax* T Mehraufwand R(pmax) = Work(pmax) / Work(1) Utilization = Work/Cost Speedup S = T(p=1)/T(p) Effizienz E = S/p Parallelitätsgrad p(t), Darstellung für alle t: Parallelitätsprofil Parallelitätsindex: Mittlerer Parallelitätsgrad I = Work/T Im Zusammenhang mit den PRAM Algorithmen auch verwendete Notation: Cost(n, P(n)) = P(n) * T(n) Work(n, P(n)) = Summe der ausgeführten Operationen in Abh. von der Problemgröße und der Anzahl P(n) der verwendeten OTPP
R-4
RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
(Wdh.) Metriken
Optimale Algorithmen
R-5
Zeit-optimal: Ein sequentieller Algorithmus, dessen (sequentielle) Zeitkomplexität O(Tseq(n) nicht verbessert werden kann, heißt zeit-optimal.
Paralleler Algorithmus (Work)-optimal: Wenn die Work-Komplexität in der Zeitkomplexität des zugehörigen sequentiellen Algorithmus liegt. (Die Anzahl der Operationen des parallelen Algorithmus ist asymptotisch gleich der Anzahl der Operationen des sequentiellen Algorithmus) Work(n) = O(Tseq(n)) [noch strengere Definition: Total number of operations in the parallel algorithm is equal to the number of operations in a sequential algorithm]
Work-Time-optimal, WT-optimal: Die Anzahl der parallelen Schritte Tpar(n) kann nicht durch irgend einen anderen Work-optimalen Algorithmus weiter verringert werden
RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Sequentieller Algorithmus
PRAM removes algorithmic details concerning synchronization and communication, allowing the algorithm designer to focus on problem properties A PRAM algorithm describes explicitly the operations performed at each time unit PRAM design paradigms have turned out to be robust and have been mapped efficiently onto many other parallel models and even network models
R-6
RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Parallel Random Access Machine (PRAM) Model
Natural extension of RAM: each processor is a RAM 1) Processors operate synchronously Earliest and best-known model of parallel computation
Shared memory with m locations
Shared Memory
P processors, each with private memory P1
P2
P3
…
Pp
All processors operate synchronously, by executing load, store, and operations on data
1) Random Access Machine, Zugriff per Befehl auf eine beliebige Speicherzelle (nicht gemeint ist hier RAM = Random Access Memory)
R-7
RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
The PRAM Model of Parallel Computation
Synchronous PRAM
R-8
All processors execute the same program, deshalb heißt das Modell auch SPMD (Single Program Multiple Data) All processors execute the same PRAM step instruction stream in “lock-step” (im Gleichschritt) Effect of operation depends on individual or common data which processor(i) accesses processor index i local data
Instructions can be selectively disabled (if-then-else flow)
Asynchronous PRAM Several competing models No lock-step
RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Synchronous PRAM is an SIMD-style model
Classification of PRAM Models
R-9
1. Read: each processor may read a value from shared memory 2. Compute: each processor may perform operations on local data 3. Write: each processor may write a value to shared memory
Model is refined for concurrent read/write capability Exclusive Read Exclusive Write (EREW) Concurrent Read Exclusive Write (CREW) Concurrent Read Concurrent Write (CRCW)
RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
A PRAM step (“clock cycle”) consists of three phases
CRCW -
result
Common
all processors must write the same value
Arbitrary, Random one of the processors succeeds in writing Priority
processor with the lowest index succeeds
Undefined
the value written is undefined
Ignore
no new value is written
Detecting
special „collision detection“ code is written
Max/Min
largest / smallest of the values is written
Reduction
SUM, AND, OR, XOR or another function is written
R-10
RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Types of CRCW PRAM
Comparison of PRAM Models
R-11
The time complexity is asymptotically less in model B for solving a problem compared to A Or the time complexity is the same and the work complexity is asymptotically less in model B compared to A
From weakest to strongest:
EREW CREW Common CRCW Arbitrary CRCW Priority CRCW
RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
A model B is more powerful compared to model A if either
Initialize P processors n elements
Initializing an n-vector M (with base address = B) to all 0s with P < n processors for t=0 to n/P -1 do
// segments
parallel [i = 0 .. P-1] // processors if (tP + i < n) then M[B + tP + i] 0 endparallel endfor
P
t=0
RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
R-12
Data Broadcasting
R-13
B
for t = 0 to log2 P – 1 do parallel [i = 0 .. 2t -1] // active B[i+ 2t] B[i] // proc(i) writes endparallel endfor for t = 0 to log2 P – 1 do parallel [i = 2t .. 2t+1 -1] // active B[i] B[i – 2t] // proc(i) reads endparallel endfor
0 1 2 3 4 5 6 7 8 9 10 11
EREW PRAM data broadcasting without redundant copying.
falls Array-Grenze keine Potenz von 2 ist, dann ergänze die Algorithmen um Bedingung
RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Making P copies of B[0]
B 0 1 2 3 4 5 6 7 8 9 10 11
parallel [i = 0 .. 2t -1] B[i+ 2t] B[i] endparallel Datenparallele Schreibweise
I [0 .. 2t -1] B[I+ 2t] B[I]
I = Indexvektor
entspricht
B[2t .. 2t+1 -1] B[0 .. 2t -1]
EREW PRAM data broadcasting without redundant copying.
R-14
RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Data Broadcasting, Data Parallel Formulation
All-to-All Broadcasting on EREW PRAM
R-15
parallel [i = 0 .. P-1] write own data value into B[i] for t = 1 to P – 1 parallel [i = 0 .. P-1] Read the data value from B[(i + t) mod P] endparallel endfor
This O(P)-step algorithm is time-optimal
RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Jeder Prozessor i liest im Schritt t den Wert von seinem Nachbarn i+t
Reduction on the EREW PRAM Speicherzellen
RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Reduce P values on the P-processor EREW PRAM in O(log P) time Reduction algorithm uses exclusive reads and writes Algorithm is the basis of other EREW algorithms
R-16
Prozessoren
Datenfluss
Sum on the EREW PRAM (1)
for t = 1 to log n do parallel [i = 1 .. n/2] if i < n/2t then B[i] B[2i-1] + B[2i]; if i = 1 then s B[i] end Cost = (n/2) log n Work = n-1 Utilization = U = Work/Cost = [2(n-1)/n] log n= O(1/log n) E = (n-1)/((n/2)log n) = O(1/log n)
1
2
3
4
5
6
7
8 RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Sum of n values using n/2 processors (i) Input: A[1,…n], n = 2k Output: s parallel [i=1 .. P] B[i] A[i] endparallel
R-17
B
B
B
B S
Sum on the EREW PRAM (2) Hierbei wird die Anzahl der Prozessoren dynamisch verwaltet (ist nicht PRAM-Standard !)
parallel [i=1 .. P] B[i] A[i] endparallel for t = 1 to log n do parallel [i =1 .. n/2t ] // dynamic activity B[i] := B[2i-1] + B[2i] endparallel endfor s := B[i] Cost = n -1 Work = n -1 Utilization = 1
2
3
4
5
6
7
8 RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Sum of n values using n/2 .. 1 processors (i) Input: A[1,…n], n = 2k Output: s
1
R-18
B
B
B
B s
Pointer Jumping
R-19
RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Finding the roots of a forest using pointer-jumping
Input: A forest of trees, each with a self-loop at its root, consisting of arcs (i, P(i)) and nodes i, where 1 < i < n Output: For each node i, the root S[i] of the tree containing i parallel [i = 1 .. n] S[i] P[i] endparallel for h = 1 to log n do // while there exists a node such that S[i] S[S[i]] do parallel [i = 1 .. n] if S[i] S[S[i]] then S[i] S[S[i]] endparallel endfor
T(n) = O(log h) with h the maximum height of trees Cost(n) = O(n log h)
R-20
RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Pointer Jumping on the CREW PRAM
Beispiel 1
R-21
2
3
4
5
6
7
8
6
3
7
4
Pointer werden transportiert
5
1
2
8
3
4
6
5
1
7
2
8
3
4
1
6
2
5
7
RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
8
parallel [i= 0.. P-1] R[i] 0 endparallel
for t = 1 to P – 1 do parallel [i= 0.. P-1] neighbor := (i + t) mod P if S[neighbor ] < S[ i ] or S[neighbor ] = S[ i ] and neighbor < i then R[ i ] := R[ i ] + 1 endparallel endfor parallel [i= 0.. P-1] S[R[ i ]] S[ i ] endparallel
Werte in S Such-Pointer in R
This O(P)-step sorting algorithm is far from optimal; sorting is possible in O(log P) time
R-22
RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Naive EREW PRAM sorting algorithm (all read from all)
Prefix Sum
R-23
deleting marked elements from an array (stream compaction), radixsort, solving recurrence equations, solving tri-diagonal linear systems, quicksort
i=1
2
3
4
5
6
7
8
x
7
6
5
4
3
2
1
0
s
7
13
18
22
25
27
28
28
si 2t-1 then x[i] x[i-2t-1] + x[i] endparallel endfor Problem: Mehrfache Lesezugriffe auf dieselbe Variable kann zu Zugriffsverzögerungen (Leseressourcenkonflikt (Congestion) an einem Speichermodul) führen
2
3
4
5
6
7
8 RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
1
1
Prefix Sum EREW
2
3
4
5
6
7
8
R-25
M
Die Prozessoren und die Speicherzellen M[i] enthalten die x-Werte Pi hat eine lokales Register yi Komplexität wie EREWReduktion
parallel [i=0 .. n-1] y[i] M[i] endparallel
for t=0 to log n -1 do parallel [i=2t .. n-1] y[i]:=y[i] + M[i-2t] M[i] y[i] endparallel
Prozessoren RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Prefix Sum Algorithmus, Work Optimal CREW Lese-und Scheibzugriffe 1. 8er1 + 4ew1 2. 2cr2 + 4er1 + 4ew1 3. 1cr4 + 4er1 + 4ew1
log n Schritte n/2 Prozessoren = const. Work(pmax=n/2) = n/2 * log n Cost = Work optimal Mehraufwand der parallelen Ausführung
R(pmax=n/2) = (n/2 * log n)/(n-1) = O(log n)
1
2
3
4
5
6
7
8 RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
R-26
Use P = n/log n processors instead of n Have a local phase followed by the global phase Local phase: compute maximum over log n values use simple sequential algorithm Time for local phase = O(log n)
Global phase: take (n/log n) local maxima and compute global maximum using the reduction tree algorithm Time for global phase = O(log (n/log n)) = O(log n – log log n) = O(log n)
Work(n) = O(n) Example: n = 16 Number of processors, p = n/log n = 4 Divide 16 elements into four groups of four each Local phase: each processor computes the maximum of its four local elements Global phase: performed amongst the maxima computed by the four processors
R-27
RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Entwicklung eines WT-optimalen Algorithmus zur Bildung des Maximums von n Zahlen
An algorithm designed for a weaker model can be executed within the same time complexity and work complexity on a stronger model An algorithm designed for a stronger model can be simulated on a weaker model, either with Asymptotically more processors (more work) Or asymptotically more time
R-28
RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt
Simulations between PRAM Models