R Parallel Random Access Machine (PRAM)

R Parallel Random Access Machine (PRAM) Vorlesung im SS 2010 Inhalt • • • • Metriken Optimale Algorithmen Classification and Types Algorithmen • Ini...
Author: Anton Stieber
4 downloads 0 Views 595KB Size
R Parallel Random Access Machine (PRAM) Vorlesung im SS 2010

Inhalt • • • •

Metriken Optimale Algorithmen Classification and Types Algorithmen • Initialize • Broadcasting • All-to-All Broadcasting • Reduction • Sum • Pointer Jumping



• •

(Algorithmen) • Naive Sorting • Prefix Sum WT-optimaler Algorithmus Simulations between PRAM Models • Simulating a Priority CRCW on an EREW PRAM

RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

R. Hoffmann, FB Informatik, TU-Darmstadt

R-1

Verwendete Quellen

R-2

 An Introduction to Parallel Algorithms,1992

 Robert van Engelen  (Folien) The PRAM Model and Algorithms, Advanced Topics Spring 2008

 Behrooz Parhami  Introduction to Parallel Processing, Algorithms and Architectures, Plenum Series, Kluwer 2002  (Folien)

 Arvind Krishnamurthy  (Folien) PRAM Algorithms, Fall 2004

RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

 J. JaJa

Mandelbrot-Menge

Global Climate Modeling Problem

R-3

RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

Data parallel example

          

pmax = P = verwendete OTPP T = Zeitschritte Work(pmax) = ∑p(t) Cost(pmax) = pmax* T Mehraufwand R(pmax) = Work(pmax) / Work(1) Utilization = Work/Cost Speedup S = T(p=1)/T(p) Effizienz E = S/p Parallelitätsgrad p(t), Darstellung für alle t: Parallelitätsprofil Parallelitätsindex: Mittlerer Parallelitätsgrad I = Work/T Im Zusammenhang mit den PRAM Algorithmen auch verwendete Notation:  Cost(n, P(n)) = P(n) * T(n)  Work(n, P(n)) = Summe der ausgeführten Operationen in Abh. von der Problemgröße und der Anzahl P(n) der verwendeten OTPP

R-4

RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

(Wdh.) Metriken

Optimale Algorithmen

R-5

 Zeit-optimal: Ein sequentieller Algorithmus, dessen (sequentielle) Zeitkomplexität O(Tseq(n) nicht verbessert werden kann, heißt zeit-optimal.

 Paralleler Algorithmus  (Work)-optimal: Wenn die Work-Komplexität in der Zeitkomplexität des zugehörigen sequentiellen Algorithmus liegt. (Die Anzahl der Operationen des parallelen Algorithmus ist asymptotisch gleich der Anzahl der Operationen des sequentiellen Algorithmus) Work(n) = O(Tseq(n)) [noch strengere Definition: Total number of operations in the parallel algorithm is equal to the number of operations in a sequential algorithm]

 Work-Time-optimal, WT-optimal: Die Anzahl der parallelen Schritte Tpar(n) kann nicht durch irgend einen anderen Work-optimalen Algorithmus weiter verringert werden

RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

 Sequentieller Algorithmus

 PRAM removes algorithmic details concerning synchronization and communication, allowing the algorithm designer to focus on problem properties  A PRAM algorithm describes explicitly the operations performed at each time unit  PRAM design paradigms have turned out to be robust and have been mapped efficiently onto many other parallel models and even network models

R-6

RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

Parallel Random Access Machine (PRAM) Model

 Natural extension of RAM: each processor is a RAM 1)  Processors operate synchronously  Earliest and best-known model of parallel computation

Shared memory with m locations

Shared Memory

P processors, each with private memory P1

P2

P3



Pp

All processors operate synchronously, by executing load, store, and operations on data

1) Random Access Machine, Zugriff per Befehl auf eine beliebige Speicherzelle (nicht gemeint ist hier RAM = Random Access Memory)

R-7

RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

The PRAM Model of Parallel Computation

Synchronous PRAM

R-8

 All processors execute the same program, deshalb heißt das Modell auch SPMD (Single Program Multiple Data)  All processors execute the same PRAM step instruction stream in “lock-step” (im Gleichschritt)  Effect of operation depends on  individual or common data which processor(i) accesses  processor index i  local data

 Instructions can be selectively disabled (if-then-else flow)

 Asynchronous PRAM  Several competing models  No lock-step

RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

 Synchronous PRAM is an SIMD-style model

Classification of PRAM Models

R-9

1. Read: each processor may read a value from shared memory 2. Compute: each processor may perform operations on local data 3. Write: each processor may write a value to shared memory

 Model is refined for concurrent read/write capability  Exclusive Read Exclusive Write (EREW)  Concurrent Read Exclusive Write (CREW)  Concurrent Read Concurrent Write (CRCW)

RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

 A PRAM step (“clock cycle”) consists of three phases

CRCW -

result

Common

all processors must write the same value

Arbitrary, Random one of the processors succeeds in writing Priority

processor with the lowest index succeeds

Undefined

the value written is undefined

Ignore

no new value is written

Detecting

special „collision detection“ code is written

Max/Min

largest / smallest of the values is written

Reduction

SUM, AND, OR, XOR or another function is written

R-10

RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

Types of CRCW PRAM

Comparison of PRAM Models

R-11

 The time complexity is asymptotically less in model B for solving a problem compared to A  Or the time complexity is the same and the work complexity is asymptotically less in model B compared to A

 From weakest to strongest:     

EREW CREW Common CRCW Arbitrary CRCW Priority CRCW

RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

 A model B is more powerful compared to model A if either

Initialize P processors n elements

Initializing an n-vector M (with base address = B) to all 0s with P < n processors for t=0 to n/P -1 do

// segments

parallel [i = 0 .. P-1] // processors if (tP + i < n) then M[B + tP + i]  0 endparallel endfor

P

t=0

RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

R-12

Data Broadcasting

R-13

B

for t = 0 to log2 P – 1 do parallel [i = 0 .. 2t -1] // active B[i+ 2t]  B[i] // proc(i) writes endparallel endfor for t = 0 to log2 P – 1 do parallel [i = 2t .. 2t+1 -1] // active B[i]  B[i – 2t] // proc(i) reads endparallel endfor

0 1 2 3 4 5 6 7 8 9 10 11

EREW PRAM data broadcasting without redundant copying.

falls Array-Grenze keine Potenz von 2 ist, dann ergänze die Algorithmen um Bedingung

RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

Making P copies of B[0]

B 0 1 2 3 4 5 6 7 8 9 10 11

parallel [i = 0 .. 2t -1] B[i+ 2t]  B[i] endparallel Datenparallele Schreibweise

I  [0 .. 2t -1] B[I+ 2t]  B[I]

I = Indexvektor

entspricht

B[2t .. 2t+1 -1]  B[0 .. 2t -1]

EREW PRAM data broadcasting without redundant copying.

R-14

RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

Data Broadcasting, Data Parallel Formulation

All-to-All Broadcasting on EREW PRAM

R-15

parallel [i = 0 .. P-1] write own data value into B[i] for t = 1 to P – 1 parallel [i = 0 .. P-1] Read the data value from B[(i + t) mod P] endparallel endfor

This O(P)-step algorithm is time-optimal

RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

Jeder Prozessor i liest im Schritt t den Wert von seinem Nachbarn i+t

Reduction on the EREW PRAM Speicherzellen

RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

 Reduce P values on the P-processor EREW PRAM in O(log P) time  Reduction algorithm uses exclusive reads and writes  Algorithm is the basis of other EREW algorithms

R-16

Prozessoren

Datenfluss

Sum on the EREW PRAM (1)

for t = 1 to log n do parallel [i = 1 .. n/2] if i < n/2t then B[i]  B[2i-1] + B[2i]; if i = 1 then s  B[i] end Cost = (n/2) log n Work = n-1 Utilization = U = Work/Cost = [2(n-1)/n] log n= O(1/log n) E = (n-1)/((n/2)log n) = O(1/log n)

1

2

3

4

5

6

7

8 RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

Sum of n values using n/2 processors (i) Input: A[1,…n], n = 2k Output: s parallel [i=1 .. P] B[i]  A[i] endparallel

R-17

B

B

B

B S

Sum on the EREW PRAM (2) Hierbei wird die Anzahl der Prozessoren dynamisch verwaltet (ist nicht PRAM-Standard !)

parallel [i=1 .. P] B[i]  A[i] endparallel for t = 1 to log n do parallel [i =1 .. n/2t ] // dynamic activity B[i] := B[2i-1] + B[2i] endparallel endfor s := B[i] Cost = n -1 Work = n -1 Utilization = 1

2

3

4

5

6

7

8 RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

Sum of n values using n/2 .. 1 processors (i) Input: A[1,…n], n = 2k Output: s

1

R-18

B

B

B

B s

Pointer Jumping

R-19

RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

 Finding the roots of a forest using pointer-jumping

Input: A forest of trees, each with a self-loop at its root, consisting of arcs (i, P(i)) and nodes i, where 1 < i < n Output: For each node i, the root S[i] of the tree containing i parallel [i = 1 .. n] S[i]  P[i] endparallel for h = 1 to log n do // while there exists a node such that S[i]  S[S[i]] do parallel [i = 1 .. n] if S[i]  S[S[i]] then S[i]  S[S[i]] endparallel endfor

T(n) = O(log h) with h the maximum height of trees Cost(n) = O(n log h)

R-20

RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

Pointer Jumping on the CREW PRAM

Beispiel 1

R-21

2

3

4

5

6

7

8

6

3

7

4

Pointer werden transportiert

5

1

2

8

3

4

6

5

1

7

2

8

3

4

1

6

2

5

7

RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

8

parallel [i= 0.. P-1] R[i]  0 endparallel

for t = 1 to P – 1 do parallel [i= 0.. P-1] neighbor := (i + t) mod P if S[neighbor ] < S[ i ] or S[neighbor ] = S[ i ] and neighbor < i then R[ i ] := R[ i ] + 1 endparallel endfor parallel [i= 0.. P-1] S[R[ i ]]  S[ i ] endparallel

Werte in S Such-Pointer in R

This O(P)-step sorting algorithm is far from optimal; sorting is possible in O(log P) time

R-22

RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

Naive EREW PRAM sorting algorithm (all read from all)

Prefix Sum

R-23

 deleting marked elements from an array (stream compaction), radixsort, solving recurrence equations, solving tri-diagonal linear systems, quicksort

i=1

2

3

4

5

6

7

8

x

7

6

5

4

3

2

1

0

s

7

13

18

22

25

27

28

28

si 2t-1 then x[i]  x[i-2t-1] + x[i] endparallel endfor Problem: Mehrfache Lesezugriffe auf dieselbe Variable kann zu Zugriffsverzögerungen (Leseressourcenkonflikt (Congestion) an einem Speichermodul) führen

2

3

4

5

6

7

8 RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

1

1

Prefix Sum EREW

2

3

4

5

6

7

8

R-25

M

 

Die Prozessoren und die Speicherzellen M[i] enthalten die x-Werte Pi hat eine lokales Register yi Komplexität wie EREWReduktion

parallel [i=0 .. n-1] y[i]  M[i] endparallel

for t=0 to log n -1 do parallel [i=2t .. n-1] y[i]:=y[i] + M[i-2t] M[i]  y[i] endparallel

Prozessoren RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt



Prefix Sum Algorithmus, Work Optimal CREW Lese-und Scheibzugriffe 1. 8er1 + 4ew1 2. 2cr2 + 4er1 + 4ew1 3. 1cr4 + 4er1 + 4ew1

    

log n Schritte n/2 Prozessoren = const. Work(pmax=n/2) = n/2 * log n Cost = Work  optimal Mehraufwand der parallelen Ausführung 

R(pmax=n/2) = (n/2 * log n)/(n-1) = O(log n)

1

2

3

4

5

6

7

8 RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

 

R-26

 Use P = n/log n processors instead of n  Have a local phase followed by the global phase  Local phase: compute maximum over log n values  use simple sequential algorithm  Time for local phase = O(log n)

 Global phase: take (n/log n) local maxima and compute global maximum using the reduction tree algorithm  Time for global phase = O(log (n/log n)) = O(log n – log log n) = O(log n)

 Work(n) = O(n)  Example: n = 16  Number of processors, p = n/log n = 4  Divide 16 elements into four groups of four each  Local phase: each processor computes the maximum of its four local elements  Global phase: performed amongst the maxima computed by the four processors

R-27

RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

Entwicklung eines WT-optimalen Algorithmus zur Bildung des Maximums von n Zahlen

 An algorithm designed for a weaker model can be executed within the same time complexity and work complexity on a stronger model  An algorithm designed for a stronger model can be simulated on a weaker model, either with  Asymptotically more processors (more work)  Or asymptotically more time

R-28

RA – SS 2010 , R. Hoffmann, Rechnerarchitektur, TU Darmstadt

Simulations between PRAM Models

Suggest Documents