Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra
School of Informatics University of Edinburgh
Introduction Multi-cores and many-cores here to stay
Source: Intel Intl. Symp. on Microarchitecture - December 2011
58
Introduction Multi-cores and many-cores are here to stay Parallel programming is essential to realize potential Focus on coarse-grain parallelism Weak or no scaling of some parallel applications Can we exploit under-utilized cores to complement coarse-grain parallelism? – Nested parallelism in multi-threaded applications – Exploit it using implicit speculative parallelism
Intl. Symp. on Microarchitecture - December 2011
59
Contributions Evaluation of implicit speculative parallelism on top of explicit parallelism to improve scalability: – Improve scalability by 40% on avg. – Same energy consumption
Detailed analysis of multithreaded scalability: – Performance bottlenecks – Behavior on different input datasets
Auto-tuning to dynamically select the number of explicit and implicit threads
Intl. Symp. on Microarchitecture - December 2011
60
Outline Introduction
Motivation Proposal Evaluation Methodology Results Conclusions
Intl. Symp. on Microarchitecture - December 2011
61
Bottlenecks: Large Critical Sections
Time
T0 T1 T2 T3
Integer Sort (IS) NASPB Intl. Symp. on Microarchitecture - December 2011
62
Bottlenecks: Large Critical Sections
Time
T0 T1 T2 T3
Integer Sort (IS) NASPB Intl. Symp. on Microarchitecture - December 2011
63
Bottlenecks: Large Critical Sections
Time
T0 T1 T2 T3
Integer Sort (IS) NASPB Intl. Symp. on Microarchitecture - December 2011
64
Bottlenecks: Large Critical Sections
Time
T0 T1 T2 T3
Integer Sort (IS) NASPB Intl. Symp. on Microarchitecture - December 2011
65
Bottlenecks: Large Critical Sections
Time
T0 T1 T2 T3
Integer Sort (IS) NASPB Intl. Symp. on Microarchitecture - December 2011
66
Bottlenecks: Large Critical Sections
Time
T0 T1 T2 T3
Integer Sort (IS) NASPB Intl. Symp. on Microarchitecture - December 2011
67
Bottlenecks: Large Critical Sections
Time
T0 T1 T2 T3
Integer Sort (IS) NASPB Intl. Symp. on Microarchitecture - December 2011
68
Bottlenecks: Large Critical Sections T0 T1 T2 T3
Time
Speedup
3 2 1 0
Integer Sort (IS) NASPB
Norm. Execution Time
1.2 1.0 0.8
0
20
40
60
Cores Busy Lock Barrier
0.6 0.4 0.2 0
2
4
8
16
32
64
Cores Intl. Symp. on Microarchitecture - December 2011
69
Bottlenecks: Load Imbalance
Time
T0 T1 T2 T3
RADIOSITY SPLASH 2 Intl. Symp. on Microarchitecture - December 2011
70
Bottlenecks: Load Imbalance
Time
T0 T1 T2 T3
RADIOSITY SPLASH 2 Intl. Symp. on Microarchitecture - December 2011
71
Bottlenecks: Load Imbalance
Time
T0 T1 T2 T3
RADIOSITY SPLASH 2 Intl. Symp. on Microarchitecture - December 2011
72
Bottlenecks: Load Imbalance
Time
T0 T1 T2 T3
RADIOSITY SPLASH 2 Intl. Symp. on Microarchitecture - December 2011
73
Bottlenecks: Load Imbalance
Time
T0 T1 T2 T3
RADIOSITY SPLASH 2 Intl. Symp. on Microarchitecture - December 2011
74
Bottlenecks: Load Imbalance
Time
T0 T1 T2 T3
RADIOSITY SPLASH 2 Intl. Symp. on Microarchitecture - December 2011
75
Bottlenecks: Load Imbalance
Time
T0 T1 T2 T3
RADIOSITY SPLASH 2 Intl. Symp. on Microarchitecture - December 2011
76
Bottlenecks: Load Imbalance
Time
T0 T1 T2 T3
RADIOSITY SPLASH 2 Intl. Symp. on Microarchitecture - December 2011
77
Bottlenecks: Load Imbalance
Time
T0 T1 T2 T3
RADIOSITY SPLASH 2 Intl. Symp. on Microarchitecture - December 2011
78
Bottlenecks: Load Imbalance
Time
T0 T1 T2 T3
RADIOSITY SPLASH 2 Intl. Symp. on Microarchitecture - December 2011
79
Bottlenecks: Load Imbalance
Time
T0 T1 T2 T3
RADIOSITY SPLASH 2 Intl. Symp. on Microarchitecture - December 2011
80
Bottlenecks: Load Imbalance
Time
T0 T1 T2 T3
RADIOSITY SPLASH 2 Intl. Symp. on Microarchitecture - December 2011
81
Bottlenecks: Load Imbalance 20
T0 T1 T2 T3
Speedup
15
Time
10 5 0
0
20
40
RADIOSITY SPLASH 2
Norm. Execution Time
0.6
60 Cores
80
100
120
Busy Lock Barrier
0.5 0.4 0.3 0.2 0.1 0
2
4
8
16 Cores
Intl. Symp. on Microarchitecture - December 2011
32
64
128 82
Bottlenecks: Load Imbalance 20
T0 T1 T2 T3
Speedup
15
Can we use these cores to accelerate this app.?
Time
10 5 0
0
20
40
RADIOSITY SPLASH 2
Norm. Execution Time
0.6
60 Cores
80
100
120
Busy Lock Barrier
0.5 0.4 0.3 0.2 0.1 0
2
4
8
16 Cores
Intl. Symp. on Microarchitecture - December 2011
32
64
128 83
Outline Introduction
Motivation Proposal Evaluation Methodology Results Low power nested parallelism Conclusions
Intl. Symp. on Microarchitecture - December 2011
84
Proposal Programming: – Users explicitly parallelize code – Tradeoff development time for performance gains
Architecture and Compiler: – Exploit fine-grain parallelism on top of user threads – Thread-Level Speculation (TLS) within each user thread
Hardware: – Support both explicit and implicit threads simultaneously in a nested fashion
Intl. Symp. on Microarchitecture - December 2011
85
Proposal #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … }
Intl. Symp. on Microarchitecture - December 2011
86
Proposal #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … }
T0
TK …
TL …
Intl. Symp. on Microarchitecture - December 2011
TM …
87
Proposal #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … }
T0
TK …
TL …
Intl. Symp. on Microarchitecture - December 2011
TM …
88
Proposal #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … }
T0
TK …
TL
TM
…
…
Speculative
TK,i
Speculative
TL,i
Intl. Symp. on Microarchitecture - December 2011
89
Proposal #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … }
T0
TK …
TL
TM
…
…
Speculative
TK,i
TK,i+1
Intl. Symp. on Microarchitecture - December 2011
Speculative
TL,i
TL,i+1
90
Proposal #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … }
T0
TK …
TL
TM
…
…
Speculative
TK,i
TK,i+1 T K,i+2
Intl. Symp. on Microarchitecture - December 2011
Speculative
TL,i
TL,i+1 T L,i+2
91
Proposal #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … }
T0
TK …
TL
TM
…
…
Speculative
TK,i
TK,i+1 T K,i+2 TK,i+3
Intl. Symp. on Microarchitecture - December 2011
Speculative
TL,i
TL,i+1 T L,i+2 TL,i+3
92
Proposal: Many-core Architecture Many-core partitioned in clusters (tiles) Coherence (MESI) – Snooping coherence within cluster – Directory coherence across clusters
Support for TLS only within cluster – Snooping TLS protocol – Speculative buffering in L1 data caches
Intl. Symp. on Microarchitecture - December 2011
93
T2
T3
T4
T5
T6
T7
T8
T9
T10
T11
T12
T13
T14
T15
T16
T17
T18
T19
T20
T21
T22
T23
T24
T25
T26
T27
T28
T29
T30
T31
Intl. Symp. on Microarchitecture - December 2011
Mem. Contr.
T1
Mem. Contr.
Mem. Contr.
T0
Mem. Contr.
Proposal: Many-core Architecture
94
T8
T1
T2
T3
T4
T9
C0 T10
C1 T11
C2 T12
IC DC
IC DC
IC DC
T27
T28
T16
T17
T24
T25
T18 T26
T19
L2 $
T20
T5
T6
T7
T13
T14
T15
T21
T22
T23
T29
T30
T31
C3
IC DC
Dir/ Router
Intl. Symp. on Microarchitecture - December 2011
Mem. Contr.
T0
Mem. Contr.
Mem. Contr.
Mem. Contr.
Proposal: Many-core Architecture
95
Complementing Coarse-Grain Parallelism
Time
T0 T1 T2 T3
Intl. Symp. on Microarchitecture - December 2011
96
Complementing Coarse-Grain Parallelism
Time
T0 T1 T2 T3
2x Explicit Threads
Intl. Symp. on Microarchitecture - December 2011
97
Complementing Coarse-Grain Parallelism T0 T1 T2 T3 T4 T5 T6 T7
Time
T0 T1 T2 T3
2x Explicit Threads
Intl. Symp. on Microarchitecture - December 2011
98
Complementing Coarse-Grain Parallelism
Time
T0 T1 T2 T3
Intl. Symp. on Microarchitecture - December 2011
99
Complementing Coarse-Grain Parallelism
Time
T0 T1 T2 T3
4ETs + 4ISTs
Intl. Symp. on Microarchitecture - December 2011
100
Complementing Coarse-Grain Parallelism T0 T1 T2 T3 T4 T5 T6 T7
Time
T0 T1 T2 T3
4ETs + 4ISTs
Intl. Symp. on Microarchitecture - December 2011
101
Complementing Coarse-Grain Parallelism
Time
T0 T1 T2 T3
Intl. Symp. on Microarchitecture - December 2011
102
Complementing Coarse-Grain Parallelism
Time
T0 T1 T2 T3
2x Explicit Threads
Intl. Symp. on Microarchitecture - December 2011
103
Complementing Coarse-Grain Parallelism T0 T1 T2 T3 T4 T5 T6 T7
Time
T0 T1 T2 T3
2x Explicit Threads
Intl. Symp. on Microarchitecture - December 2011
104
Complementing Coarse-Grain Parallelism
Time
T0 T1 T2 T3
Intl. Symp. on Microarchitecture - December 2011
105
Complementing Coarse-Grain Parallelism
Time
T0 T1 T2 T3
4ETs + 4ISTs
Intl. Symp. on Microarchitecture - December 2011
106
Complementing Coarse-Grain Parallelism T0 T1 T2 T3 T4 T5 T6 T7
Time
T0 T1 T2 T3
4ETs + 4ISTs
Intl. Symp. on Microarchitecture - December 2011
107
Speedup
Expected Speedup Behavior
4-way TLS speedup region C
4-way TLS
2-way TLS speedup region B
Baseline speedup region
2-way TLS
A
12 4
8
Baseline
16
32
Intl. Symp. on Microarchitecture - December 2011
64
Cores
108
Proposal: Auto-Tuning the Thread Count
Find the scalability tipping point dynamically Choose whether to employ implicit threads Simple hill climbing approach Applicable to OpenMP applications that are amenable to Dynamic Concurrency Throttling (DCT [Curtis-Maury PACT’08] ) Developed a prototype in the Omni OpenMP System
Intl. Symp. on Microarchitecture - December 2011
109
Auto-tuning example
… #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …
Intl. Symp. on Microarchitecture - December 2011
110
Auto-tuning example i
Learning
… #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …
Intl. Symp. on Microarchitecture - December 2011
111
Auto-tuning example i
Learning omp parallel region i detected: … #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …
First time: Can we compute iteration count statically and is less than max core count?
Intl. Symp. on Microarchitecture - December 2011
112
Auto-tuning example i
Learning omp parallel region i detected: … #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …
First time: Can we compute iteration count statically and is less than max M=32 core count?
Intl. Symp. on Microarchitecture - December 2011
113
Auto-tuning example i
Learning omp parallel region i detected: … #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …
First time: Can we compute iteration count statically and is less than max M=32 core count? Yes -> set Initial Tcount to 32
Intl. Symp. on Microarchitecture - December 2011
114
Auto-tuning example i
Learning omp parallel region i detected: … #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …
First time: Can we compute iteration count statically and is less than max M=32 core count? Yes -> set Initial Tcount to 32 Measure execution time ti1
Intl. Symp. on Microarchitecture - December 2011
115
Auto-tuning example i
Learning
… #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …
Intl. Symp. on Microarchitecture - December 2011
116
Auto-tuning example i
i
Learning
… #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …
Intl. Symp. on Microarchitecture - December 2011
117
Auto-tuning example i
i
Learning
… #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …
omp parallel region i detected: Set Tcount to next value (16) Measure execution time ti2 ti2 < ti1 → continue exploration
Intl. Symp. on Microarchitecture - December 2011
118
Auto-tuning example i
i
Learning
… #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …
Intl. Symp. on Microarchitecture - December 2011
119
Auto-tuning example i
i
i
Learning
… #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …
Intl. Symp. on Microarchitecture - December 2011
120
Auto-tuning example i
i
i
Learning
… #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …
omp parallel region i detected: Set Tcount to next value (8) Measure execution time ti3 ti3 > ti2 → stop exploration
Intl. Symp. on Microarchitecture - December 2011
121
Auto-tuning example i
i
i
Learning
… #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …
Intl. Symp. on Microarchitecture - December 2011
122
Auto-tuning example i
i
i
i
Learning
… #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …
Intl. Symp. on Microarchitecture - December 2011
123
Auto-tuning example i
i
i
i
Learning
… #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …
omp parallel region i detected: Use Tcount = 16, no further exploration Set TLS to 4-way
Intl. Symp. on Microarchitecture - December 2011
124
Outline Introduction
Motivation Proposal Evaluation Methodology Results Conclusions
Intl. Symp. on Microarchitecture - December 2011
125
Evaluation Methodology SESC simulator - extended to model our scheme Architecture: – Core: 4-issue OoO superscalar, 96-entry ROB, 3GHz 32KB, 4-way, DL1 $ - 32KB, 2-way, IL1 $ 16Kbit Hybrid Branch Predictor
– Tile/System:
128 cores partitioned in 2-way or 4-way tiles (evaluate both) Shared L2 cache, 8MB, 8-way, 64MSHRs Directory: Full-bit vector sharer list Interconnect: Grid, 64B links - 48GB/s to main memory Intl. Symp. on Microarchitecture - December 2011
126
Evaluation Methodology Benchmarks: – 12 workloads from PARSEC 2.1, SPLASH2, NASPB – Simulate parallel region to completion
Compilation: – MIPS binaries generated using GCC 3.4.4 – Speculation added automatically through source-tosource compiler – Selection of speculation regions through manual profiling
Power: – CACTI 4.2 and Wattch Intl. Symp. on Microarchitecture - December 2011
127
Evaluation Methodology Alternative schemes compared against: – Core Fusion [Ipek ISCA’07]: Dynamic combination of cores to deal with lowly-threaded apps Approximated through wide 8-issue cores with all the core resources doubled without latency increase => upper bound
– Frequency Boost: Inspired by Turbo Boost [Intel’08] For each idle core one other core gains a frequency boost of 800MHz with a 200mV increase in voltage (same power cap)
All these schemes shift resources to a subset of cores in order to improve performance Intl. Symp. on Microarchitecture - December 2011
128
Outline Introduction
Motivation Proposal Evaluation Methodology Results Conclusions
Intl. Symp. on Microarchitecture - December 2011
129
Bottom Line Speedup over best scalability point 2.0 TLS-2 TLS-4 CFusion FBoost
1.8
Speedup
1.6 1.4 1.2 1.0 0.8
bo
ca str sw ch oc ra wa dio nn ole ea ea dy ap ter tra ea n mc sit tio sk y l ck ns y lus ter Benchmark
ep
Intl. Symp. on Microarchitecture - December 2011
ft
is
sp
av er
ag
130
e
Bottom Line Speedup over best scalability point 2.0 TLS-2 TLS-4 CFusion FBoost
TLS-4: 41% avg TLS-2:27% avg
1.8
Speedup
1.6 1.4 1.2 1.0 0.8
bo
ca str sw ch oc ra wa dio nn ole ea ea dy ap ter tra ea n mc sit tio sk y l ck ns y lus ter Benchmark
ep
Intl. Symp. on Microarchitecture - December 2011
ft
is
sp
av er
ag
131
e
Energy Showing best performing point for each scheme 2.0 2TLS 4TLS CFusion FBoost
1.8
Normalized Energy
1.6 1.4 1.2 1.0 0.8 0.6 bod
c s s c o r w e ft ytra anneal treamc waptio holesk cean-nadiosity ater-n p sq u a ck y luste ns cp re d r Benchmark Intl. Symp. on Microarchitecture - December 2011
is
sp
a ve r
age
132
Energy Showing best performing point for each scheme 2.0
Energy consumption slightly lower on avg
1.8
2TLS 4TLS CFusion FBoost
Normalized Energy
1.6 1.4 1.2 1.0 0.8 0.6 bod
c s s c o r w e ft ytra anneal treamc waptio holesk cean-nadiosity ater-n p sq u a ck y luste ns cp re d r Benchmark Intl. Symp. on Microarchitecture - December 2011
is
sp
a ve r
age
133
Energy Showing best performing point for each scheme 2.0 2TLS 4TLS CFusion FBoost
1.8
Normalized Energy
1.6 1.4 1.2 1.0 0.8 0.6 bod
c s s c o r w e ft ytra anneal treamc waptio holesk cean-nadiosity ater-n p sq u a ck y luste ns cp re d r Benchmark Intl. Symp. on Microarchitecture - December 2011
is
sp
a ve r
age
134
Energy Showing best performing point for each scheme 2.0
Spending less time in busy synchronization
1.8
2TLS 4TLS CFusion FBoost
Normalized Energy
1.6 1.4 1.2 1.0 0.8 0.6 bod
c s s c o r w e ft ytra anneal treamc waptio holesk cean-nadiosity ater-n p sq u a ck y luste ns cp re d r Benchmark Intl. Symp. on Microarchitecture - December 2011
is
sp
a ve r
age
135
Energy Showing best performing point for each scheme 2.0 2TLS 4TLS CFusion FBoost
1.8
Normalized Energy
1.6 1.4 1.2 1.0 0.8 0.6 bod
c s s c o r w e ft ytra anneal treamc waptio holesk cean-nadiosity ater-n p sq u a ck y luste ns cp re d r Benchmark Intl. Symp. on Microarchitecture - December 2011
is
sp
a ve r
age
136
Energy Showing best performing point for each scheme 2.0
High mispeculation: Higher energy
1.8
2TLS 4TLS CFusion FBoost
Normalized Energy
1.6 1.4 1.2 1.0 0.8 0.6 bod
c s s c o r w e ft ytra anneal treamc waptio holesk cean-nadiosity ater-n p sq u a ck y luste ns cp re d r Benchmark Intl. Symp. on Microarchitecture - December 2011
is
sp
a ve r
age
137
Energy Showing best performing point for each scheme 2.0 2TLS 4TLS CFusion FBoost
1.8
Normalized Energy
1.6 1.4 1.2 1.0 0.8 0.6 bod
c s s c o r w e ft ytra anneal treamc waptio holesk cean-nadiosity ater-n p sq u a ck y luste ns cp re d r Benchmark Intl. Symp. on Microarchitecture - December 2011
is
sp
a ve r
age
138
Energy Showing best performing point for each scheme 2.0
Little synchronization: Higher energy
1.8
2TLS 4TLS CFusion FBoost
Normalized Energy
1.6 1.4 1.2 1.0 0.8 0.6 bod
c s s c o r w e ft ytra anneal treamc waptio holesk cean-nadiosity ater-n p sq u a ck y luste ns cp re d r Benchmark Intl. Symp. on Microarchitecture - December 2011
is
sp
a ve r
age
139
Serial/Critical Sections 6
NASPB
base TLS-2 TLS-4 FBoost CFusion
4 3 2 1 0 0
10
20
30
40
50
60 70 Cores
80
90
100 110 120
3.0
Norm. Execution Time
is
Speedup
5
Busy Lock Barrier
2.5 2.0 1.5 1.0 0.5 0
2
4
8
16 Cores
32
Intl. Symp. on Microarchitecture - December 2011
64
128 140
Load Imbalance 25 20 Speedup
SPLASH2
15 10 5 0 0
10
20
30
40
50
60 70 Cores
80
90 100 110 120
2.5
Norm. Execution Time
radiosity
base TLS-2 TLS-4 FBoost CFusion
Busy Lock Barrier
2.0 1.5 1.0 0.5 0
2
4
8
16 Cores
Intl. Symp. on Microarchitecture - December 2011
32
64
128 141
Synchronization Heavy 14 12 base TLS-2 TLS-4 FBoost CFusion
10 Speedup
ocean
SPLASH2
8 6 4 2
Norm. Execution Time
0 0
1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0
10
20
30
40
50
60 70 Cores
80
90
100 110 120
Busy Lock Barrier
2
4
8
16 Cores
32
Intl. Symp. on Microarchitecture - December 2011
64
128 142
Coarse-Grain Partitioning 30
PARSEC
base TLS-2 TLS-4 FBoost CFusion
20 15 10 5 0 0
2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0
10
20
30
40
50
60 70 Cores
80
90
100 110 120 Busy Lock Barrier
Norm. Execution Time
swaptions
Speedup
25
2
4
8
16 Cores
Intl. Symp. on Microarchitecture - December 2011
32
64
128 143
Poor Static Partitioning 12
NASPB
base TLS-2 TLS-4 FBoost CFusion
8 6 4 2 0 0
Norm. Execution Time
sp
Speedup
10
1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0
10
20
30
40
50
60 70 Cores
80
90
100 110 120 Busy Lock Barrier
2
4
8
16 Cores
32
Intl. Symp. on Microarchitecture - December 2011
64
128 144
Effect of Dataset size Unchanged behavior: cholesky Also: canneal, ocean, ft, is, sp 14 12
base baseL TLS-2 TLS-2L TLS-4 TLS-4L
Speedup
10 8 6 4 2 0 10
20
30
40
50
60
70
Cores
80
90 100 110 120
Intl. Symp. on Microarchitecture - December 2011
145
Effect of Dataset size Improved scalability, but TLS boost remains: swaptions Also: bodytrack, radiosity, ep 50 45 40
base baseL TLS-2 TLS-2L TLS-4 TLS-4L
Speedup
35 30 25 20 15 10 5 0
0
10 20 30 40 50 60 70 80 90 100 110 120
Cores Intl. Symp. on Microarchitecture - December 2011
146
Effect of Dataset size Improved scalability, lessened TLS boost: streamcluster 60 50
base baseL TLS-2 TLS-2L TLS-4 TLS-4L
Speedup
40 30 20 10 0 30
40
50
60
70
80
90
100
110
120
Cores Intl. Symp. on Microarchitecture - December 2011
147
Effect of Dataset size Worse scalability, even better TLS boost: water 60 50
base baseL TLS-2 TLS-2L TLS-4 TLS-4L
Speedup
40 30 20 10 0 10
20
30
40
50
60
70
80
90 100 110 120
Cores Intl. Symp. on Microarchitecture - December 2011
148
Outline Introduction
Motivation Proposal Evaluation Methodology Results Conclusions
Intl. Symp. on Microarchitecture - December 2011
149
Conclusions Multicores and many-cores are here to stay – Parallel programming essential to exploit new hardware – Some coarse-grain parallel programs do not scale – Enough nested parallelism to improve scalability
Proposed speculative parallelization through implicit speculative threads on top of explicit threads: – Significant scalability improvement of 40% on avg – No increase in total energy consumptions – Presented an auto-tuning mechanism to dynamically choose the number of threads that performs within 6% of the oracle Intl. Symp. on Microarchitecture - December 2011
150
Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra
School of Informatics University of Edinburgh
Related Work [von Praun PPoPP’07] Implicit ordered transactions [Kim Micro’10] Speculative Parallel-stage Decoupled Software Pipelining [Ooi ICS’01] Multiplex [Madriles ISCA’09] Anaphase [Rajwar MICRO’01],[Martinez ASPLOS’02] Speculative Lock Elision [Moravan ASPLOS’06], etc., Nested transactional memory
Intl. Symp. on Microarchitecture - December 2011
152
Bibliography [Intl’08] Intel Corp. Intel Turbo Boost Technology in Intel Core Microarchitecture (Nehalem) Based Processors, White Paper, 2008 [Ipek ISCA’07] Ipek et al. Core fusion: Accommodating software diversity in chip multiprocessors [von Praun PPoPP’07] C. von Praun et al. Implicit parallelism with ordered transactions, PPoPP 2007 [Kim Micro’10] Scalable speculative parallelization in commodity clusters, MICRO, 2010 [Ooi ICS’01] C.-L Ooi et al. Multiplex: Unifying conventional and speculative thread-level parallelism on a chip multiprocessor, ICS 2001 [Madriles ISCA’09] C. Madriles et al. Boosting single-thread performance in multi-core system through fine-grain multi-threading. ISCA 2009
Intl. Symp. on Microarchitecture - December 2011
153
Bibliography [Rajwar MICRO’01] R. Rajwar and J.R. Goodman. Speculative lock elision: Enabling highly concurrent multithreaded execution. MICRO 2001 [Martinez ASPLOS’02] J. Martinez and J. Torellas. Speculative synchronization: Applying thread-level speculation to explicitly parallel applications. ASPLOS 2002 [Moravan ASPLOS’06] Supporting nested transactional memory in logtm. ASPLOS 2006 [Curtis-Maury PACT’08] Prediction models for multi-dimensional powerperformance optimization on many-cores.
Intl. Symp. on Microarchitecture - December 2011
154
Benchmark details
Intl. Symp. on Microarchitecture - December 2011
155
Fetched Instructions 1.2
TLS-2 TLS-4 FBoost CFusion
Norm. Total Fetched Ins.
1.1 1.0 0.9 0.8 0.7 0.6 0.5
bo
ca
dy
tra
ck
nn
ea
str l
ea
sw
ap
ch
mc ti lus ons ter
ole
oc sk
ea
y
ra nn
dio
cp
wa
sit y
e ter p -n sq ua
ft re
is
sp
av
er
ag
e
d
Benchmark Intl. Symp. on Microarchitecture - December 2011
156
TLS-2 TLS-4
TLS-2 TLS-4 TLS-2 TLS-4 TLS-2 TLS-4 TLS-2 TLS-4
ft is sp
TLS-2 TLS-4
TLS-2 TLS-4
TLS-2 TLS-4
TLS-2 TLS-4
ep
-ns ter
qu
a re
cp ity
n- n
ter d
Benchmark wa
s sky ios
ea
k lus
l n tio ole ra d
oc
ch
ap
mc
ea
c tra nn ea sw
TLS-2 TLS-4
TLS-2 TLS-4
ca str
bo
dy
TLS-2 TLS-4
Norm. Execution Time
Failed Speculation
100% Restart Busy
80%
60%
40%
20%
0%
Intl. Symp. on Microarchitecture - December 2011 157