Automatic Extraction of Parallelism for Embedded Software

Informatik Centrum Dortmund Germany Automatic Extraction of Parallelism for Embedded Software Daniel Cordes June 29th, 2011 Informatik Centrum Dort...

Author: Dale Atkins

3 downloads 2 Views 506KB Size

Report

Download PDF

Recommend Documents

Automatic API Usage Rule Extraction for Software Model Checking

Automatic Extraction of Subcategorization Frames for Czech

Software Testing for Embedded Systems

EXTRACTION FULLY AUTOMATIC

Automatic Keyword Extraction for Learning Object Repositories

A System for Automatic Testing of Embedded Software in Undergraduate Study Exercises

Testing of Embedded Software Products

Optimizing Testing of Embedded Software

Testing of Embedded Software Products

Automatic Extraction of Features From Line Drawings

Automatic Foreground Extraction of Head Shoulder Images

Efficient Method Extraction for Automatic Elimination of Type-3 Clones

Automatic Extraction of Hierarchical Relations from Text

Assessing algorithms for automatic extraction of anglicisms in Norwegian texts

Automatic extraction of semantics in law documents

Automatic Keyword Extraction on Twitter

XCode IDE for Embedded AVR Software Development

Pluggable Abstract Domains for Analyzing Embedded Software

EFFECTIVE EMBEDDED SYSTEMS SOFTWARE

Embedded Software Version 9.1.4h0d51

Embedded Systems and Software

Virtual Machines for embedded software development

ROBUSTNESS IN EMBEDDED SOFTWARE FOR AUTONOMOUS ROBOTS

Embedded System Software Quality

Informatik Centrum Dortmund Germany

Automatic Extraction of Parallelism for Embedded Software Daniel Cordes June 29th, 2011

Informatik Centrum Dortmund Germany

Outline 1. Motivation / State-of-the-art / Ideas 2. ILP-based Task-Level Parallelization 3. ILP-based Loop-Level Parallelization 4. Conclusion & Future work

© Daniel Cordes | June 29th, 2011

Slide 2 / 23

Informatik Centrum Dortmund Germany

Outline 1. Motivation / State-of-the-art / Ideas 2. ILP-based Task-Level Parallelization 3. ILP-based Loop-Level Parallelization 4. Conclusion & Future work

© Daniel Cordes | June 29th, 2011

Slide 3 / 23

Informatik Centrum Dortmund Germany

Motivation  Using multiple cores in one systems can...  reduce CPU frequencies / exec. time  save energy  enable further optimizations

Problem  Most embedded application are written in sequential C  Splitting program into tasks manually is...  error prone  time consuming

 Automatic parallelization beneficial © Daniel Cordes | June 29th, 2011

Slide 4 / 23

Informatik Centrum Dortmund Germany

State-of-the-art  Many semi- or fully automatic parallelization frameworks exist  Common characteristics:  Detailed costs model are rarely used  Hard to exploit whether parallelization is beneficial  Due to complexity:  (Greedy or other) heuristics often used  Most frameworks not optimized for embedded devices

© Daniel Cordes | June 29th, 2011

Slide 5 / 23

Informatik Centrum Dortmund Germany

Idea  Focus on characteristics of embedded system applications and architectures  Often streaming orientated applications  In general: limited OS support compared to host-architectures  NON-unified memory architectures  more expensive communication  Use integer linear programming (ILP) for parallelization decisions  Clear, mathematical description of problem  Optimal due to its model © Daniel Cordes | June 29th, 2011

Slide 6 / 23

Informatik Centrum Dortmund Germany

Outline 1. Motivation / State-of-the-art / Ideas 2. ILP-based Task-Level Parallelization * 3. ILP-based Loop-Level Parallelization 4. Future work

*D. Cordes, P. Marwedel, A. Mallik, Automatic parallelization of embedded software using hierarchical task graphs and integer linear programming, CODES/ISSS '10 © Daniel Cordes | June 29th, 2011

Slide 7 / 23

Informatik Centrum Dortmund Germany

ILP-based Task-Level Parallelization  Focus on characteristics of embedded system applications and architectures  Reduce search space by introducing hierarchy to the task graph model  Use integer linear programming (ILP) for parallelization decisions  Use adequate cost model to balance tasks

Characteristics  Coarse grained parallelization technique  Possible to limit number of concurrently executed tasks © Daniel Cordes | June 29th, 2011

Slide 8 / 23

Informatik Centrum Dortmund Germany

Hierarchical Task Graph  Too large search space for flat task graphs using ILP  Introduction of hierarchy  Only small number of nodes per hierarchical level  Communication redirected through communication in- and out-nodes  Decisions can be done locally  Automatically extracted from source code  Annotated with com / exec costs

© Daniel Cordes | June 29th, 2011

Edge Info: Edge type: Communication cost: Communited data: Iteration count:

Node Info: Iteration count: 16 Execution cost: 200 Reference to Statement

RAW 64 [i] 8

In In

Hierarchical Node

In

In

Simple Node

...

...

...

...

...

...

Out

Out Out

Communication Node

Out

Slide 9 / 23

Informatik Centrum Dortmund Germany

Parallelization Methodology 3. 5. 7. 9.

int main() { for (i = 0; i < NUMAV; ++i) { int index = i * DELTA; for (int j = 0; j < SLICE; ++j) { sample_real[j] = input_signal[index + j] * hamming[j]; sample_imag[j] = zero; } } }

1. Extract hierarchical graph

So AN u r SI ce Cco de

1.

In

2. Parallelize nodes bottom-up

Node 2 Node 3 Node 4

In

Node 5

In

Node 6

In In

... Out

Out

In

... ...

... ...

3. Transform to ILP

Node 1

...

...

In

...

...

...

... ...

Out

Out Out

Node 1 Out

T1 In

T2 In

Node 2

Node 3

Node 4

Node 5

Out

Out In

Out

Node 6

6. Attach results and continue with other nodes

© Daniel Cordes | June 29th, 2011

Out

4. Solve ILP for different task limitations

5. Transform solutions to HTG

Slide 10 / 23

Informatik Centrum Dortmund Germany

Results 4,0

2 Cores

3 Cores

4 Cores

3,5 3,0 2,5 2,0 1,5

da ry bo

un

av er ag e

va lu e

ax w

en co d cm

ad p

im

er

al sp ec tr

.2 63 H

te ct de e ed g

co m pr

es s

1,0



Cycle accurate simulator: MPARM



Speedup up to 1.9x / 2.9x / 3.7x



OS: RTEMS + Runtime Library



Average speedup: 1.8x / 2.2x / 2.7x



Measured: Exec. time of application without OS init



Work was part of MNEMEE European FP 7 project

© Daniel Cordes | June 29th, 2011

Slide 11 / 23

Informatik Centrum Dortmund Germany

Outline 1. Motivation / State-of-the-art / Ideas 2. ILP-based Task-Level Parallelization 3. ILP-based Loop-Level Parallelization 4. Conclusion & Future work

© Daniel Cordes | June 29th, 2011

Slide 12 / 23

Informatik Centrum Dortmund Germany

ILP-based Loop-Level Parallelization  HTG is both a blessing and

Node Info: Iteration count: 16 Execution cost: 200 Reference to Statement

Entry

a curse

i=0

 Problem: Hierarchy hides dependencies between different loop iterations  -> Used a Program Dependence Graph (PDG)  Build PDG for each loop and try to parallelize  Finer grained approach than Task-Level parallelism © Daniel Cordes | June 29th, 2011

Edge Info: Edge type: Communication cost: Communited data: Iteration count: Interleaving level:

RAW 64 [i] 16 1

i < NUMAV

index = i * DELTA;

j=0

j < SLICE

sample_real[j] = input_signal[index + j] * hamming[j];

sample_imag[j] = zero;

++j

fft (sample_real, sample_imag);

j =0

j < SLICE

mag[j] = mag[j] + ((( sample_real[j] * sample_real[j]) + ( sample_imag[j] * sample_imag[j])) / SLICE_2);

++j

++i

Exit

Slide 13 / 23

Informatik Centrum Dortmund Germany

Parallelization example – Sequential execution  Main loop of spectral benchmark  All iterations executed sequentially

1. 3.

for (i = 0; i < NUMAV; ++i) { float sample_real[SLICE]; float sample_imag[SLICE]; int index = i * DELTA; for (int j = 0; j < SLICE; ++j) { sample_real[j] = input_signal[index + j] * hamming[j]; sample_imag[j] = zero; }

5. 7. 9. 11.

15. 17.

© Daniel Cordes | June 29th, 2011

t2

T1

1

t4

t6

2

fft(sample_real, sample_imag);

13.

19.

t0

}

for (int j = 0; j < SLICE; ++j) { mag[j] = mag[j] + ((( sample_real[j] * sample_real[j]) + ( sample_imag[j] * sample_imag[j])) / SLICE_2); }

t8

3 t 10

... time

Slide 14 / 23

Informatik Centrum Dortmund Germany

Parallelization example – Horizontal splits  Split statements into disjunctive parts  Creates a pipeline of calculations  Execute 1 iteration

1. 3.

for (i = 0; i < NUMAV; ++i) { float sample_real[SLICE]; float sample_imag[SLICE]; int index = i * DELTA; for (int j = 0; j < SLICE; ++j) { sample_real[j] = T input_signal[index1 + j] * hamming[j]; sample_imag[j] = zero; }

5. 7. 9.

 Communicate data

11.

 Continue with next iteration

13.

17.

© Daniel Cordes | June 29th, 2011

}

for (int j = 0; j < SLICE; ++j) { mag[j] = mag[j] + ((( sample_real[j] * sample_real[j])T+3( sample_imag[j] * sample_imag[j])) / SLICE_2); }

T1 1 2

t2

3 4

t4

5 6

t6

fft(sample_real, sample_imag); T2

15.

19.

t0

7 8

t8

9 10

t 10

11 time

T2

T3

1 2 3 4 5

1 2 3 4

...

Slide 15 / 23

Informatik Centrum Dortmund Germany

Parallelization example – Horizontal splits T

T

T

1 2 3 Outer loop copied t 1 to each task 2 Execute first 1 t iteration 3 0

2

t4

4 Start with 1 next 2 5 iteration 6

t6

2 3 Communicate 7

data to T2

8 t8

3

4

first 9OnlyExecute those iteration

t10

statements, 10 4 5 mapped to the 11 task

time

...

Task T 3: Task T 1: T 1 { T2 T3 for (i = 0; i