Automatic Extraction of Parallelism for Embedded Software

Informatik Centrum Dortmund Germany Automatic Extraction of Parallelism for Embedded Software Daniel Cordes June 29th, 2011 Informatik Centrum Dort...
Author: Dale Atkins
3 downloads 2 Views 506KB Size
Informatik Centrum Dortmund Germany

Automatic Extraction of Parallelism for Embedded Software Daniel Cordes June 29th, 2011

Informatik Centrum Dortmund Germany

Outline 1. Motivation / State-of-the-art / Ideas 2. ILP-based Task-Level Parallelization 3. ILP-based Loop-Level Parallelization 4. Conclusion & Future work

© Daniel Cordes | June 29th, 2011

Slide 2 / 23

Informatik Centrum Dortmund Germany

Outline 1. Motivation / State-of-the-art / Ideas 2. ILP-based Task-Level Parallelization 3. ILP-based Loop-Level Parallelization 4. Conclusion & Future work

© Daniel Cordes | June 29th, 2011

Slide 3 / 23

Informatik Centrum Dortmund Germany

Motivation  Using multiple cores in one systems can...  reduce CPU frequencies / exec. time  save energy  enable further optimizations

Problem  Most embedded application are written in sequential C  Splitting program into tasks manually is...  error prone  time consuming

 Automatic parallelization beneficial © Daniel Cordes | June 29th, 2011

Slide 4 / 23

Informatik Centrum Dortmund Germany

State-of-the-art  Many semi- or fully automatic parallelization frameworks exist  Common characteristics:  Detailed costs model are rarely used  Hard to exploit whether parallelization is beneficial  Due to complexity:  (Greedy or other) heuristics often used  Most frameworks not optimized for embedded devices

© Daniel Cordes | June 29th, 2011

Slide 5 / 23

Informatik Centrum Dortmund Germany

Idea  Focus on characteristics of embedded system applications and architectures  Often streaming orientated applications  In general: limited OS support compared to host-architectures  NON-unified memory architectures  more expensive communication  Use integer linear programming (ILP) for parallelization decisions  Clear, mathematical description of problem  Optimal due to its model © Daniel Cordes | June 29th, 2011

Slide 6 / 23

Informatik Centrum Dortmund Germany

Outline 1. Motivation / State-of-the-art / Ideas 2. ILP-based Task-Level Parallelization * 3. ILP-based Loop-Level Parallelization 4. Future work

*D. Cordes, P. Marwedel, A. Mallik, Automatic parallelization of embedded software using hierarchical task graphs and integer linear programming, CODES/ISSS '10 © Daniel Cordes | June 29th, 2011

Slide 7 / 23

Informatik Centrum Dortmund Germany

ILP-based Task-Level Parallelization  Focus on characteristics of embedded system applications and architectures  Reduce search space by introducing hierarchy to the task graph model  Use integer linear programming (ILP) for parallelization decisions  Use adequate cost model to balance tasks

Characteristics  Coarse grained parallelization technique  Possible to limit number of concurrently executed tasks © Daniel Cordes | June 29th, 2011

Slide 8 / 23

Informatik Centrum Dortmund Germany

Hierarchical Task Graph  Too large search space for flat task graphs using ILP  Introduction of hierarchy  Only small number of nodes per hierarchical level  Communication redirected through communication in- and out-nodes  Decisions can be done locally  Automatically extracted from source code  Annotated with com / exec costs

© Daniel Cordes | June 29th, 2011

Edge Info: Edge type: Communication cost: Communited data: Iteration count:

Node Info: Iteration count: 16 Execution cost: 200 Reference to Statement

RAW 64 [i] 8

In In

Hierarchical Node

In

In

Simple Node

...

...

...

...

...

...

Out

Out Out

Communication Node

Out

Slide 9 / 23

Informatik Centrum Dortmund Germany

Parallelization Methodology 3. 5. 7. 9.

int main() { for (i = 0; i < NUMAV; ++i) { int index = i * DELTA; for (int j = 0; j < SLICE; ++j) { sample_real[j] = input_signal[index + j] * hamming[j]; sample_imag[j] = zero; } } }

1. Extract hierarchical graph

So AN u r SI ce Cco de

1.

In

2. Parallelize nodes bottom-up

Node 2 Node 3 Node 4

In

Node 5

In

Node 6

In In

... Out

Out

In

... ...

... ...

3. Transform to ILP

Node 1

...

...

In

...

...

...

... ...

Out

Out Out

Node 1 Out

T1 In

T2 In

Node 2

Node 3

Node 4

Node 5

Out

Out In

Out

Node 6

6. Attach results and continue with other nodes

© Daniel Cordes | June 29th, 2011

Out

4. Solve ILP for different task limitations

5. Transform solutions to HTG

Slide 10 / 23

Informatik Centrum Dortmund Germany

Results 4,0

2 Cores

3 Cores

4 Cores

3,5 3,0 2,5 2,0 1,5

da ry bo

un

av er ag e

va lu e

ax w

en co d cm

ad p

im

er

al sp ec tr

.2 63 H

te ct de e ed g

co m pr

es s

1,0



Cycle accurate simulator: MPARM



Speedup up to 1.9x / 2.9x / 3.7x



OS: RTEMS + Runtime Library



Average speedup: 1.8x / 2.2x / 2.7x



Measured: Exec. time of application without OS init



Work was part of MNEMEE European FP 7 project

© Daniel Cordes | June 29th, 2011

Slide 11 / 23

Informatik Centrum Dortmund Germany

Outline 1. Motivation / State-of-the-art / Ideas 2. ILP-based Task-Level Parallelization 3. ILP-based Loop-Level Parallelization 4. Conclusion & Future work

© Daniel Cordes | June 29th, 2011

Slide 12 / 23

Informatik Centrum Dortmund Germany

ILP-based Loop-Level Parallelization  HTG is both a blessing and

Node Info: Iteration count: 16 Execution cost: 200 Reference to Statement

Entry

a curse

i=0

 Problem: Hierarchy hides dependencies between different loop iterations  -> Used a Program Dependence Graph (PDG)  Build PDG for each loop and try to parallelize  Finer grained approach than Task-Level parallelism © Daniel Cordes | June 29th, 2011

Edge Info: Edge type: Communication cost: Communited data: Iteration count: Interleaving level:

RAW 64 [i] 16 1

i < NUMAV

index = i * DELTA;

j=0

j < SLICE

sample_real[j] = input_signal[index + j] * hamming[j];

sample_imag[j] = zero;

++j

fft (sample_real, sample_imag);

j =0

j < SLICE

mag[j] = mag[j] + ((( sample_real[j] * sample_real[j]) + ( sample_imag[j] * sample_imag[j])) / SLICE_2);

++j

++i

Exit

Slide 13 / 23

Informatik Centrum Dortmund Germany

Parallelization example – Sequential execution  Main loop of spectral benchmark  All iterations executed sequentially

1. 3.

for (i = 0; i < NUMAV; ++i) { float sample_real[SLICE]; float sample_imag[SLICE]; int index = i * DELTA; for (int j = 0; j < SLICE; ++j) { sample_real[j] = input_signal[index + j] * hamming[j]; sample_imag[j] = zero; }

5. 7. 9. 11.

15. 17.

© Daniel Cordes | June 29th, 2011

t2

T1

1

t4

t6

2

fft(sample_real, sample_imag);

13.

19.

t0

}

for (int j = 0; j < SLICE; ++j) { mag[j] = mag[j] + ((( sample_real[j] * sample_real[j]) + ( sample_imag[j] * sample_imag[j])) / SLICE_2); }

t8

3 t 10

... time

Slide 14 / 23

Informatik Centrum Dortmund Germany

Parallelization example – Horizontal splits  Split statements into disjunctive parts  Creates a pipeline of calculations  Execute 1 iteration

1. 3.

for (i = 0; i < NUMAV; ++i) { float sample_real[SLICE]; float sample_imag[SLICE]; int index = i * DELTA; for (int j = 0; j < SLICE; ++j) { sample_real[j] = T input_signal[index1 + j] * hamming[j]; sample_imag[j] = zero; }

5. 7. 9.

 Communicate data

11.

 Continue with next iteration

13.

17.

© Daniel Cordes | June 29th, 2011

}

for (int j = 0; j < SLICE; ++j) { mag[j] = mag[j] + ((( sample_real[j] * sample_real[j])T+3( sample_imag[j] * sample_imag[j])) / SLICE_2); }

T1 1 2

t2

3 4

t4

5 6

t6

fft(sample_real, sample_imag); T2

15.

19.

t0

7 8

t8

9 10

t 10

11 time

T2

T3

1 2 3 4 5

1 2 3 4

...

Slide 15 / 23

Informatik Centrum Dortmund Germany

Parallelization example – Horizontal splits T

T

T

1 2 3 Outer loop copied t 1 to each task 2 Execute first 1 t iteration 3 0

2

t4

4 Start with 1 next 2 5 iteration 6

t6

2 3 Communicate 7

data to T2

8 t8

3

4

first 9OnlyExecute those iteration

t10

statements, 10 4 5 mapped to the 11 task

time

...

Task T 3: Task T 1: T 1 { T2 T3 for (i = 0; i