Informatik Centrum Dortmund Germany
Automatic Extraction of Parallelism for Embedded Software Daniel Cordes June 29th, 2011
Informatik Centrum Dortmund Germany
Outline 1. Motivation / State-of-the-art / Ideas 2. ILP-based Task-Level Parallelization 3. ILP-based Loop-Level Parallelization 4. Conclusion & Future work
© Daniel Cordes | June 29th, 2011
Slide 2 / 23
Informatik Centrum Dortmund Germany
Outline 1. Motivation / State-of-the-art / Ideas 2. ILP-based Task-Level Parallelization 3. ILP-based Loop-Level Parallelization 4. Conclusion & Future work
© Daniel Cordes | June 29th, 2011
Slide 3 / 23
Informatik Centrum Dortmund Germany
Motivation Using multiple cores in one systems can... reduce CPU frequencies / exec. time save energy enable further optimizations
Problem Most embedded application are written in sequential C Splitting program into tasks manually is... error prone time consuming
Automatic parallelization beneficial © Daniel Cordes | June 29th, 2011
Slide 4 / 23
Informatik Centrum Dortmund Germany
State-of-the-art Many semi- or fully automatic parallelization frameworks exist Common characteristics: Detailed costs model are rarely used Hard to exploit whether parallelization is beneficial Due to complexity: (Greedy or other) heuristics often used Most frameworks not optimized for embedded devices
© Daniel Cordes | June 29th, 2011
Slide 5 / 23
Informatik Centrum Dortmund Germany
Idea Focus on characteristics of embedded system applications and architectures Often streaming orientated applications In general: limited OS support compared to host-architectures NON-unified memory architectures more expensive communication Use integer linear programming (ILP) for parallelization decisions Clear, mathematical description of problem Optimal due to its model © Daniel Cordes | June 29th, 2011
Slide 6 / 23
Informatik Centrum Dortmund Germany
Outline 1. Motivation / State-of-the-art / Ideas 2. ILP-based Task-Level Parallelization * 3. ILP-based Loop-Level Parallelization 4. Future work
*D. Cordes, P. Marwedel, A. Mallik, Automatic parallelization of embedded software using hierarchical task graphs and integer linear programming, CODES/ISSS '10 © Daniel Cordes | June 29th, 2011
Slide 7 / 23
Informatik Centrum Dortmund Germany
ILP-based Task-Level Parallelization Focus on characteristics of embedded system applications and architectures Reduce search space by introducing hierarchy to the task graph model Use integer linear programming (ILP) for parallelization decisions Use adequate cost model to balance tasks
Characteristics Coarse grained parallelization technique Possible to limit number of concurrently executed tasks © Daniel Cordes | June 29th, 2011
Slide 8 / 23
Informatik Centrum Dortmund Germany
Hierarchical Task Graph Too large search space for flat task graphs using ILP Introduction of hierarchy Only small number of nodes per hierarchical level Communication redirected through communication in- and out-nodes Decisions can be done locally Automatically extracted from source code Annotated with com / exec costs
© Daniel Cordes | June 29th, 2011
Edge Info: Edge type: Communication cost: Communited data: Iteration count:
Node Info: Iteration count: 16 Execution cost: 200 Reference to Statement
RAW 64 [i] 8
In In
Hierarchical Node
In
In
Simple Node
...
...
...
...
...
...
Out
Out Out
Communication Node
Out
Slide 9 / 23
Informatik Centrum Dortmund Germany
Parallelization Methodology 3. 5. 7. 9.
int main() { for (i = 0; i < NUMAV; ++i) { int index = i * DELTA; for (int j = 0; j < SLICE; ++j) { sample_real[j] = input_signal[index + j] * hamming[j]; sample_imag[j] = zero; } } }
1. Extract hierarchical graph
So AN u r SI ce Cco de
1.
In
2. Parallelize nodes bottom-up
Node 2 Node 3 Node 4
In
Node 5
In
Node 6
In In
... Out
Out
In
... ...
... ...
3. Transform to ILP
Node 1
...
...
In
...
...
...
... ...
Out
Out Out
Node 1 Out
T1 In
T2 In
Node 2
Node 3
Node 4
Node 5
Out
Out In
Out
Node 6
6. Attach results and continue with other nodes
© Daniel Cordes | June 29th, 2011
Out
4. Solve ILP for different task limitations
5. Transform solutions to HTG
Slide 10 / 23
Informatik Centrum Dortmund Germany
Results 4,0
2 Cores
3 Cores
4 Cores
3,5 3,0 2,5 2,0 1,5
da ry bo
un
av er ag e
va lu e
ax w
en co d cm
ad p
im
er
al sp ec tr
.2 63 H
te ct de e ed g
co m pr
es s
1,0
Cycle accurate simulator: MPARM
Speedup up to 1.9x / 2.9x / 3.7x
OS: RTEMS + Runtime Library
Average speedup: 1.8x / 2.2x / 2.7x
Measured: Exec. time of application without OS init
Work was part of MNEMEE European FP 7 project
© Daniel Cordes | June 29th, 2011
Slide 11 / 23
Informatik Centrum Dortmund Germany
Outline 1. Motivation / State-of-the-art / Ideas 2. ILP-based Task-Level Parallelization 3. ILP-based Loop-Level Parallelization 4. Conclusion & Future work
© Daniel Cordes | June 29th, 2011
Slide 12 / 23
Informatik Centrum Dortmund Germany
ILP-based Loop-Level Parallelization HTG is both a blessing and
Node Info: Iteration count: 16 Execution cost: 200 Reference to Statement
Entry
a curse
i=0
Problem: Hierarchy hides dependencies between different loop iterations -> Used a Program Dependence Graph (PDG) Build PDG for each loop and try to parallelize Finer grained approach than Task-Level parallelism © Daniel Cordes | June 29th, 2011
Edge Info: Edge type: Communication cost: Communited data: Iteration count: Interleaving level:
RAW 64 [i] 16 1
i < NUMAV
index = i * DELTA;
j=0
j < SLICE
sample_real[j] = input_signal[index + j] * hamming[j];
sample_imag[j] = zero;
++j
fft (sample_real, sample_imag);
j =0
j < SLICE
mag[j] = mag[j] + ((( sample_real[j] * sample_real[j]) + ( sample_imag[j] * sample_imag[j])) / SLICE_2);
++j
++i
Exit
Slide 13 / 23
Informatik Centrum Dortmund Germany
Parallelization example – Sequential execution Main loop of spectral benchmark All iterations executed sequentially
1. 3.
for (i = 0; i < NUMAV; ++i) { float sample_real[SLICE]; float sample_imag[SLICE]; int index = i * DELTA; for (int j = 0; j < SLICE; ++j) { sample_real[j] = input_signal[index + j] * hamming[j]; sample_imag[j] = zero; }
5. 7. 9. 11.
15. 17.
© Daniel Cordes | June 29th, 2011
t2
T1
1
t4
t6
2
fft(sample_real, sample_imag);
13.
19.
t0
}
for (int j = 0; j < SLICE; ++j) { mag[j] = mag[j] + ((( sample_real[j] * sample_real[j]) + ( sample_imag[j] * sample_imag[j])) / SLICE_2); }
t8
3 t 10
... time
Slide 14 / 23
Informatik Centrum Dortmund Germany
Parallelization example – Horizontal splits Split statements into disjunctive parts Creates a pipeline of calculations Execute 1 iteration
1. 3.
for (i = 0; i < NUMAV; ++i) { float sample_real[SLICE]; float sample_imag[SLICE]; int index = i * DELTA; for (int j = 0; j < SLICE; ++j) { sample_real[j] = T input_signal[index1 + j] * hamming[j]; sample_imag[j] = zero; }
5. 7. 9.
Communicate data
11.
Continue with next iteration
13.
17.
© Daniel Cordes | June 29th, 2011
}
for (int j = 0; j < SLICE; ++j) { mag[j] = mag[j] + ((( sample_real[j] * sample_real[j])T+3( sample_imag[j] * sample_imag[j])) / SLICE_2); }
T1 1 2
t2
3 4
t4
5 6
t6
fft(sample_real, sample_imag); T2
15.
19.
t0
7 8
t8
9 10
t 10
11 time
T2
T3
1 2 3 4 5
1 2 3 4
...
Slide 15 / 23
Informatik Centrum Dortmund Germany
Parallelization example – Horizontal splits T
T
T
1 2 3 Outer loop copied t 1 to each task 2 Execute first 1 t iteration 3 0
2
t4
4 Start with 1 next 2 5 iteration 6
t6
2 3 Communicate 7
data to T2
8 t8
3
4
first 9OnlyExecute those iteration
t10
statements, 10 4 5 mapped to the 11 task
time
...
Task T 3: Task T 1: T 1 { T2 T3 for (i = 0; i