Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism

Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra School of Informatics Univers...
Author: Blaze Garrison
1 downloads 0 Views 1MB Size
Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra

School of Informatics University of Edinburgh

Introduction  Multi-cores and many-cores here to stay

Source: Intel Intl. Symp. on Microarchitecture - December 2011

58

Introduction  Multi-cores and many-cores are here to stay  Parallel programming is essential to realize potential  Focus on coarse-grain parallelism  Weak or no scaling of some parallel applications  Can we exploit under-utilized cores to complement coarse-grain parallelism? – Nested parallelism in multi-threaded applications – Exploit it using implicit speculative parallelism

Intl. Symp. on Microarchitecture - December 2011

59

Contributions  Evaluation of implicit speculative parallelism on top of explicit parallelism to improve scalability: – Improve scalability by 40% on avg. – Same energy consumption

 Detailed analysis of multithreaded scalability: – Performance bottlenecks – Behavior on different input datasets

 Auto-tuning to dynamically select the number of explicit and implicit threads

Intl. Symp. on Microarchitecture - December 2011

60

Outline  Introduction

 Motivation  Proposal  Evaluation Methodology  Results  Conclusions

Intl. Symp. on Microarchitecture - December 2011

61

Bottlenecks: Large Critical Sections

Time

T0 T1 T2 T3

Integer Sort (IS) NASPB Intl. Symp. on Microarchitecture - December 2011

62

Bottlenecks: Large Critical Sections

Time

T0 T1 T2 T3

Integer Sort (IS) NASPB Intl. Symp. on Microarchitecture - December 2011

63

Bottlenecks: Large Critical Sections

Time

T0 T1 T2 T3

Integer Sort (IS) NASPB Intl. Symp. on Microarchitecture - December 2011

64

Bottlenecks: Large Critical Sections

Time

T0 T1 T2 T3

Integer Sort (IS) NASPB Intl. Symp. on Microarchitecture - December 2011

65

Bottlenecks: Large Critical Sections

Time

T0 T1 T2 T3

Integer Sort (IS) NASPB Intl. Symp. on Microarchitecture - December 2011

66

Bottlenecks: Large Critical Sections

Time

T0 T1 T2 T3

Integer Sort (IS) NASPB Intl. Symp. on Microarchitecture - December 2011

67

Bottlenecks: Large Critical Sections

Time

T0 T1 T2 T3

Integer Sort (IS) NASPB Intl. Symp. on Microarchitecture - December 2011

68

Bottlenecks: Large Critical Sections T0 T1 T2 T3

Time

Speedup

3 2 1 0

Integer Sort (IS) NASPB

Norm. Execution Time

1.2 1.0 0.8

0

20

40

60

Cores Busy Lock Barrier

0.6 0.4 0.2 0

2

4

8

16

32

64

Cores Intl. Symp. on Microarchitecture - December 2011

69

Bottlenecks: Load Imbalance

Time

T0 T1 T2 T3

RADIOSITY SPLASH 2 Intl. Symp. on Microarchitecture - December 2011

70

Bottlenecks: Load Imbalance

Time

T0 T1 T2 T3

RADIOSITY SPLASH 2 Intl. Symp. on Microarchitecture - December 2011

71

Bottlenecks: Load Imbalance

Time

T0 T1 T2 T3

RADIOSITY SPLASH 2 Intl. Symp. on Microarchitecture - December 2011

72

Bottlenecks: Load Imbalance

Time

T0 T1 T2 T3

RADIOSITY SPLASH 2 Intl. Symp. on Microarchitecture - December 2011

73

Bottlenecks: Load Imbalance

Time

T0 T1 T2 T3

RADIOSITY SPLASH 2 Intl. Symp. on Microarchitecture - December 2011

74

Bottlenecks: Load Imbalance

Time

T0 T1 T2 T3

RADIOSITY SPLASH 2 Intl. Symp. on Microarchitecture - December 2011

75

Bottlenecks: Load Imbalance

Time

T0 T1 T2 T3

RADIOSITY SPLASH 2 Intl. Symp. on Microarchitecture - December 2011

76

Bottlenecks: Load Imbalance

Time

T0 T1 T2 T3

RADIOSITY SPLASH 2 Intl. Symp. on Microarchitecture - December 2011

77

Bottlenecks: Load Imbalance

Time

T0 T1 T2 T3

RADIOSITY SPLASH 2 Intl. Symp. on Microarchitecture - December 2011

78

Bottlenecks: Load Imbalance

Time

T0 T1 T2 T3

RADIOSITY SPLASH 2 Intl. Symp. on Microarchitecture - December 2011

79

Bottlenecks: Load Imbalance

Time

T0 T1 T2 T3

RADIOSITY SPLASH 2 Intl. Symp. on Microarchitecture - December 2011

80

Bottlenecks: Load Imbalance

Time

T0 T1 T2 T3

RADIOSITY SPLASH 2 Intl. Symp. on Microarchitecture - December 2011

81

Bottlenecks: Load Imbalance 20

T0 T1 T2 T3

Speedup

15

Time

10 5 0

0

20

40

RADIOSITY SPLASH 2

Norm. Execution Time

0.6

60 Cores

80

100

120

Busy Lock Barrier

0.5 0.4 0.3 0.2 0.1 0

2

4

8

16 Cores

Intl. Symp. on Microarchitecture - December 2011

32

64

128 82

Bottlenecks: Load Imbalance 20

T0 T1 T2 T3

Speedup

15

Can we use these cores to accelerate this app.?

Time

10 5 0

0

20

40

RADIOSITY SPLASH 2

Norm. Execution Time

0.6

60 Cores

80

100

120

Busy Lock Barrier

0.5 0.4 0.3 0.2 0.1 0

2

4

8

16 Cores

Intl. Symp. on Microarchitecture - December 2011

32

64

128 83

Outline  Introduction

 Motivation  Proposal  Evaluation Methodology  Results  Low power nested parallelism  Conclusions

Intl. Symp. on Microarchitecture - December 2011

84

Proposal  Programming: – Users explicitly parallelize code – Tradeoff development time for performance gains

 Architecture and Compiler: – Exploit fine-grain parallelism on top of user threads – Thread-Level Speculation (TLS) within each user thread

 Hardware: – Support both explicit and implicit threads simultaneously in a nested fashion

Intl. Symp. on Microarchitecture - December 2011

85

Proposal #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … }

Intl. Symp. on Microarchitecture - December 2011

86

Proposal #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … }

T0

TK …

TL …

Intl. Symp. on Microarchitecture - December 2011

TM …

87

Proposal #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … }

T0

TK …

TL …

Intl. Symp. on Microarchitecture - December 2011

TM …

88

Proposal #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … }

T0

TK …

TL

TM





Speculative

TK,i

Speculative

TL,i

Intl. Symp. on Microarchitecture - December 2011

89

Proposal #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … }

T0

TK …

TL

TM





Speculative

TK,i

TK,i+1

Intl. Symp. on Microarchitecture - December 2011

Speculative

TL,i

TL,i+1

90

Proposal #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … }

T0

TK …

TL

TM





Speculative

TK,i

TK,i+1 T K,i+2

Intl. Symp. on Microarchitecture - December 2011

Speculative

TL,i

TL,i+1 T L,i+2

91

Proposal #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … }

T0

TK …

TL

TM





Speculative

TK,i

TK,i+1 T K,i+2 TK,i+3

Intl. Symp. on Microarchitecture - December 2011

Speculative

TL,i

TL,i+1 T L,i+2 TL,i+3

92

Proposal: Many-core Architecture  Many-core partitioned in clusters (tiles)  Coherence (MESI) – Snooping coherence within cluster – Directory coherence across clusters

 Support for TLS only within cluster – Snooping TLS protocol – Speculative buffering in L1 data caches

Intl. Symp. on Microarchitecture - December 2011

93

T2

T3

T4

T5

T6

T7

T8

T9

T10

T11

T12

T13

T14

T15

T16

T17

T18

T19

T20

T21

T22

T23

T24

T25

T26

T27

T28

T29

T30

T31

Intl. Symp. on Microarchitecture - December 2011

Mem. Contr.

T1

Mem. Contr.

Mem. Contr.

T0

Mem. Contr.

Proposal: Many-core Architecture

94

T8

T1

T2

T3

T4

T9

C0 T10

C1 T11

C2 T12

IC DC

IC DC

IC DC

T27

T28

T16

T17

T24

T25

T18 T26

T19

L2 $

T20

T5

T6

T7

T13

T14

T15

T21

T22

T23

T29

T30

T31

C3

IC DC

Dir/ Router

Intl. Symp. on Microarchitecture - December 2011

Mem. Contr.

T0

Mem. Contr.

Mem. Contr.

Mem. Contr.

Proposal: Many-core Architecture

95

Complementing Coarse-Grain Parallelism

Time

T0 T1 T2 T3

Intl. Symp. on Microarchitecture - December 2011

96

Complementing Coarse-Grain Parallelism

Time

T0 T1 T2 T3

2x Explicit Threads

Intl. Symp. on Microarchitecture - December 2011

97

Complementing Coarse-Grain Parallelism T0 T1 T2 T3 T4 T5 T6 T7

Time

T0 T1 T2 T3

2x Explicit Threads

Intl. Symp. on Microarchitecture - December 2011

98

Complementing Coarse-Grain Parallelism

Time

T0 T1 T2 T3

Intl. Symp. on Microarchitecture - December 2011

99

Complementing Coarse-Grain Parallelism

Time

T0 T1 T2 T3

4ETs + 4ISTs

Intl. Symp. on Microarchitecture - December 2011

100

Complementing Coarse-Grain Parallelism T0 T1 T2 T3 T4 T5 T6 T7

Time

T0 T1 T2 T3

4ETs + 4ISTs

Intl. Symp. on Microarchitecture - December 2011

101

Complementing Coarse-Grain Parallelism

Time

T0 T1 T2 T3

Intl. Symp. on Microarchitecture - December 2011

102

Complementing Coarse-Grain Parallelism

Time

T0 T1 T2 T3

2x Explicit Threads

Intl. Symp. on Microarchitecture - December 2011

103

Complementing Coarse-Grain Parallelism T0 T1 T2 T3 T4 T5 T6 T7

Time

T0 T1 T2 T3

2x Explicit Threads

Intl. Symp. on Microarchitecture - December 2011

104

Complementing Coarse-Grain Parallelism

Time

T0 T1 T2 T3

Intl. Symp. on Microarchitecture - December 2011

105

Complementing Coarse-Grain Parallelism

Time

T0 T1 T2 T3

4ETs + 4ISTs

Intl. Symp. on Microarchitecture - December 2011

106

Complementing Coarse-Grain Parallelism T0 T1 T2 T3 T4 T5 T6 T7

Time

T0 T1 T2 T3

4ETs + 4ISTs

Intl. Symp. on Microarchitecture - December 2011

107

Speedup

Expected Speedup Behavior

4-way TLS speedup region C

4-way TLS

2-way TLS speedup region B

Baseline speedup region

2-way TLS

A

12 4

8

Baseline

16

32

Intl. Symp. on Microarchitecture - December 2011

64

Cores

108

Proposal: Auto-Tuning the Thread Count    

Find the scalability tipping point dynamically Choose whether to employ implicit threads Simple hill climbing approach Applicable to OpenMP applications that are amenable to Dynamic Concurrency Throttling (DCT [Curtis-Maury PACT’08] )  Developed a prototype in the Omni OpenMP System

Intl. Symp. on Microarchitecture - December 2011

109

Auto-tuning example

… #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …

Intl. Symp. on Microarchitecture - December 2011

110

Auto-tuning example i

Learning

… #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …

Intl. Symp. on Microarchitecture - December 2011

111

Auto-tuning example i

Learning omp parallel region i detected: … #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …

First time: Can we compute iteration count statically and is less than max core count?

Intl. Symp. on Microarchitecture - December 2011

112

Auto-tuning example i

Learning omp parallel region i detected: … #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …

First time: Can we compute iteration count statically and is less than max M=32 core count?

Intl. Symp. on Microarchitecture - December 2011

113

Auto-tuning example i

Learning omp parallel region i detected: … #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …

First time: Can we compute iteration count statically and is less than max M=32 core count? Yes -> set Initial Tcount to 32

Intl. Symp. on Microarchitecture - December 2011

114

Auto-tuning example i

Learning omp parallel region i detected: … #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …

First time: Can we compute iteration count statically and is less than max M=32 core count? Yes -> set Initial Tcount to 32 Measure execution time ti1

Intl. Symp. on Microarchitecture - December 2011

115

Auto-tuning example i

Learning

… #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …

Intl. Symp. on Microarchitecture - December 2011

116

Auto-tuning example i

i

Learning

… #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …

Intl. Symp. on Microarchitecture - December 2011

117

Auto-tuning example i

i

Learning

… #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …

omp parallel region i detected: Set Tcount to next value (16) Measure execution time ti2 ti2 < ti1 → continue exploration

Intl. Symp. on Microarchitecture - December 2011

118

Auto-tuning example i

i

Learning

… #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …

Intl. Symp. on Microarchitecture - December 2011

119

Auto-tuning example i

i

i

Learning

… #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …

Intl. Symp. on Microarchitecture - December 2011

120

Auto-tuning example i

i

i

Learning

… #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …

omp parallel region i detected: Set Tcount to next value (8) Measure execution time ti3 ti3 > ti2 → stop exploration

Intl. Symp. on Microarchitecture - December 2011

121

Auto-tuning example i

i

i

Learning

… #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …

Intl. Symp. on Microarchitecture - December 2011

122

Auto-tuning example i

i

i

i

Learning

… #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …

Intl. Symp. on Microarchitecture - December 2011

123

Auto-tuning example i

i

i

i

Learning

… #pragma omp parallel for for(j = 0; j < M; ++j) { … for(i = 0; i < N; ++i) { … = A[L[i]] + … … A[K[i]] = … } … } …

omp parallel region i detected: Use Tcount = 16, no further exploration Set TLS to 4-way

Intl. Symp. on Microarchitecture - December 2011

124

Outline  Introduction

 Motivation  Proposal  Evaluation Methodology  Results  Conclusions

Intl. Symp. on Microarchitecture - December 2011

125

Evaluation Methodology  SESC simulator - extended to model our scheme  Architecture: – Core:  4-issue OoO superscalar, 96-entry ROB, 3GHz  32KB, 4-way, DL1 $ - 32KB, 2-way, IL1 $  16Kbit Hybrid Branch Predictor

– Tile/System:    

128 cores partitioned in 2-way or 4-way tiles (evaluate both) Shared L2 cache, 8MB, 8-way, 64MSHRs Directory: Full-bit vector sharer list Interconnect: Grid, 64B links - 48GB/s to main memory Intl. Symp. on Microarchitecture - December 2011

126

Evaluation Methodology  Benchmarks: – 12 workloads from PARSEC 2.1, SPLASH2, NASPB – Simulate parallel region to completion

 Compilation: – MIPS binaries generated using GCC 3.4.4 – Speculation added automatically through source-tosource compiler – Selection of speculation regions through manual profiling

 Power: – CACTI 4.2 and Wattch Intl. Symp. on Microarchitecture - December 2011

127

Evaluation Methodology  Alternative schemes compared against: – Core Fusion [Ipek ISCA’07]:  Dynamic combination of cores to deal with lowly-threaded apps  Approximated through wide 8-issue cores with all the core resources doubled without latency increase => upper bound

– Frequency Boost:  Inspired by Turbo Boost [Intel’08]  For each idle core one other core gains a frequency boost of 800MHz with a 200mV increase in voltage (same power cap)

 All these schemes shift resources to a subset of cores in order to improve performance Intl. Symp. on Microarchitecture - December 2011

128

Outline  Introduction

 Motivation  Proposal  Evaluation Methodology  Results  Conclusions

Intl. Symp. on Microarchitecture - December 2011

129

Bottom Line  Speedup over best scalability point 2.0 TLS-2 TLS-4 CFusion FBoost

1.8

Speedup

1.6 1.4 1.2 1.0 0.8

bo

ca str sw ch oc ra wa dio nn ole ea ea dy ap ter tra ea n mc sit tio sk y l ck ns y lus ter Benchmark

ep

Intl. Symp. on Microarchitecture - December 2011

ft

is

sp

av er

ag

130

e

Bottom Line  Speedup over best scalability point 2.0 TLS-2 TLS-4 CFusion FBoost

TLS-4: 41% avg TLS-2:27% avg

1.8

Speedup

1.6 1.4 1.2 1.0 0.8

bo

ca str sw ch oc ra wa dio nn ole ea ea dy ap ter tra ea n mc sit tio sk y l ck ns y lus ter Benchmark

ep

Intl. Symp. on Microarchitecture - December 2011

ft

is

sp

av er

ag

131

e

Energy  Showing best performing point for each scheme 2.0 2TLS 4TLS CFusion FBoost

1.8

Normalized Energy

1.6 1.4 1.2 1.0 0.8 0.6 bod

c s s c o r w e ft ytra anneal treamc waptio holesk cean-nadiosity ater-n p sq u a ck y luste ns cp re d r Benchmark Intl. Symp. on Microarchitecture - December 2011

is

sp

a ve r

age

132

Energy  Showing best performing point for each scheme 2.0

Energy consumption slightly lower on avg

1.8

2TLS 4TLS CFusion FBoost

Normalized Energy

1.6 1.4 1.2 1.0 0.8 0.6 bod

c s s c o r w e ft ytra anneal treamc waptio holesk cean-nadiosity ater-n p sq u a ck y luste ns cp re d r Benchmark Intl. Symp. on Microarchitecture - December 2011

is

sp

a ve r

age

133

Energy  Showing best performing point for each scheme 2.0 2TLS 4TLS CFusion FBoost

1.8

Normalized Energy

1.6 1.4 1.2 1.0 0.8 0.6 bod

c s s c o r w e ft ytra anneal treamc waptio holesk cean-nadiosity ater-n p sq u a ck y luste ns cp re d r Benchmark Intl. Symp. on Microarchitecture - December 2011

is

sp

a ve r

age

134

Energy  Showing best performing point for each scheme 2.0

Spending less time in busy synchronization

1.8

2TLS 4TLS CFusion FBoost

Normalized Energy

1.6 1.4 1.2 1.0 0.8 0.6 bod

c s s c o r w e ft ytra anneal treamc waptio holesk cean-nadiosity ater-n p sq u a ck y luste ns cp re d r Benchmark Intl. Symp. on Microarchitecture - December 2011

is

sp

a ve r

age

135

Energy  Showing best performing point for each scheme 2.0 2TLS 4TLS CFusion FBoost

1.8

Normalized Energy

1.6 1.4 1.2 1.0 0.8 0.6 bod

c s s c o r w e ft ytra anneal treamc waptio holesk cean-nadiosity ater-n p sq u a ck y luste ns cp re d r Benchmark Intl. Symp. on Microarchitecture - December 2011

is

sp

a ve r

age

136

Energy  Showing best performing point for each scheme 2.0

High mispeculation: Higher energy

1.8

2TLS 4TLS CFusion FBoost

Normalized Energy

1.6 1.4 1.2 1.0 0.8 0.6 bod

c s s c o r w e ft ytra anneal treamc waptio holesk cean-nadiosity ater-n p sq u a ck y luste ns cp re d r Benchmark Intl. Symp. on Microarchitecture - December 2011

is

sp

a ve r

age

137

Energy  Showing best performing point for each scheme 2.0 2TLS 4TLS CFusion FBoost

1.8

Normalized Energy

1.6 1.4 1.2 1.0 0.8 0.6 bod

c s s c o r w e ft ytra anneal treamc waptio holesk cean-nadiosity ater-n p sq u a ck y luste ns cp re d r Benchmark Intl. Symp. on Microarchitecture - December 2011

is

sp

a ve r

age

138

Energy  Showing best performing point for each scheme 2.0

Little synchronization: Higher energy

1.8

2TLS 4TLS CFusion FBoost

Normalized Energy

1.6 1.4 1.2 1.0 0.8 0.6 bod

c s s c o r w e ft ytra anneal treamc waptio holesk cean-nadiosity ater-n p sq u a ck y luste ns cp re d r Benchmark Intl. Symp. on Microarchitecture - December 2011

is

sp

a ve r

age

139

Serial/Critical Sections 6

NASPB

base TLS-2 TLS-4 FBoost CFusion

4 3 2 1 0 0

10

20

30

40

50

60 70 Cores

80

90

100 110 120

3.0

Norm. Execution Time

 is

Speedup

5

Busy Lock Barrier

2.5 2.0 1.5 1.0 0.5 0

2

4

8

16 Cores

32

Intl. Symp. on Microarchitecture - December 2011

64

128 140

Load Imbalance 25 20 Speedup

SPLASH2

15 10 5 0 0

10

20

30

40

50

60 70 Cores

80

90 100 110 120

2.5

Norm. Execution Time

 radiosity

base TLS-2 TLS-4 FBoost CFusion

Busy Lock Barrier

2.0 1.5 1.0 0.5 0

2

4

8

16 Cores

Intl. Symp. on Microarchitecture - December 2011

32

64

128 141

Synchronization Heavy 14 12 base TLS-2 TLS-4 FBoost CFusion

10 Speedup

 ocean

SPLASH2

8 6 4 2

Norm. Execution Time

0 0

1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0

10

20

30

40

50

60 70 Cores

80

90

100 110 120

Busy Lock Barrier

2

4

8

16 Cores

32

Intl. Symp. on Microarchitecture - December 2011

64

128 142

Coarse-Grain Partitioning 30

PARSEC

base TLS-2 TLS-4 FBoost CFusion

20 15 10 5 0 0

2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0

10

20

30

40

50

60 70 Cores

80

90

100 110 120 Busy Lock Barrier

Norm. Execution Time

 swaptions

Speedup

25

2

4

8

16 Cores

Intl. Symp. on Microarchitecture - December 2011

32

64

128 143

Poor Static Partitioning 12

NASPB

base TLS-2 TLS-4 FBoost CFusion

8 6 4 2 0 0

Norm. Execution Time

 sp

Speedup

10

1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0

10

20

30

40

50

60 70 Cores

80

90

100 110 120 Busy Lock Barrier

2

4

8

16 Cores

32

Intl. Symp. on Microarchitecture - December 2011

64

128 144

Effect of Dataset size  Unchanged behavior: cholesky  Also: canneal, ocean, ft, is, sp 14 12

base baseL TLS-2 TLS-2L TLS-4 TLS-4L

Speedup

10 8 6 4 2 0 10

20

30

40

50

60

70

Cores

80

90 100 110 120

Intl. Symp. on Microarchitecture - December 2011

145

Effect of Dataset size  Improved scalability, but TLS boost remains: swaptions  Also: bodytrack, radiosity, ep 50 45 40

base baseL TLS-2 TLS-2L TLS-4 TLS-4L

Speedup

35 30 25 20 15 10 5 0

0

10 20 30 40 50 60 70 80 90 100 110 120

Cores Intl. Symp. on Microarchitecture - December 2011

146

Effect of Dataset size  Improved scalability, lessened TLS boost: streamcluster 60 50

base baseL TLS-2 TLS-2L TLS-4 TLS-4L

Speedup

40 30 20 10 0 30

40

50

60

70

80

90

100

110

120

Cores Intl. Symp. on Microarchitecture - December 2011

147

Effect of Dataset size  Worse scalability, even better TLS boost: water 60 50

base baseL TLS-2 TLS-2L TLS-4 TLS-4L

Speedup

40 30 20 10 0 10

20

30

40

50

60

70

80

90 100 110 120

Cores Intl. Symp. on Microarchitecture - December 2011

148

Outline  Introduction

 Motivation  Proposal  Evaluation Methodology  Results  Conclusions

Intl. Symp. on Microarchitecture - December 2011

149

Conclusions  Multicores and many-cores are here to stay – Parallel programming essential to exploit new hardware – Some coarse-grain parallel programs do not scale – Enough nested parallelism to improve scalability

 Proposed speculative parallelization through implicit speculative threads on top of explicit threads: – Significant scalability improvement of 40% on avg – No increase in total energy consumptions – Presented an auto-tuning mechanism to dynamically choose the number of threads that performs within 6% of the oracle Intl. Symp. on Microarchitecture - December 2011

150

Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism Nikolas Ioannou, Marcelo Cintra

School of Informatics University of Edinburgh

Related Work  [von Praun PPoPP’07] Implicit ordered transactions  [Kim Micro’10] Speculative Parallel-stage Decoupled Software Pipelining  [Ooi ICS’01] Multiplex  [Madriles ISCA’09] Anaphase  [Rajwar MICRO’01],[Martinez ASPLOS’02] Speculative Lock Elision  [Moravan ASPLOS’06], etc., Nested transactional memory

Intl. Symp. on Microarchitecture - December 2011

152

Bibliography  [Intl’08] Intel Corp. Intel Turbo Boost Technology in Intel Core Microarchitecture (Nehalem) Based Processors, White Paper, 2008  [Ipek ISCA’07] Ipek et al. Core fusion: Accommodating software diversity in chip multiprocessors  [von Praun PPoPP’07] C. von Praun et al. Implicit parallelism with ordered transactions, PPoPP 2007  [Kim Micro’10] Scalable speculative parallelization in commodity clusters, MICRO, 2010  [Ooi ICS’01] C.-L Ooi et al. Multiplex: Unifying conventional and speculative thread-level parallelism on a chip multiprocessor, ICS 2001  [Madriles ISCA’09] C. Madriles et al. Boosting single-thread performance in multi-core system through fine-grain multi-threading. ISCA 2009

Intl. Symp. on Microarchitecture - December 2011

153

Bibliography  [Rajwar MICRO’01] R. Rajwar and J.R. Goodman. Speculative lock elision: Enabling highly concurrent multithreaded execution. MICRO 2001  [Martinez ASPLOS’02] J. Martinez and J. Torellas. Speculative synchronization: Applying thread-level speculation to explicitly parallel applications. ASPLOS 2002  [Moravan ASPLOS’06] Supporting nested transactional memory in logtm. ASPLOS 2006  [Curtis-Maury PACT’08] Prediction models for multi-dimensional powerperformance optimization on many-cores.

Intl. Symp. on Microarchitecture - December 2011

154

Benchmark details

Intl. Symp. on Microarchitecture - December 2011

155

Fetched Instructions 1.2

TLS-2 TLS-4 FBoost CFusion

Norm. Total Fetched Ins.

1.1 1.0 0.9 0.8 0.7 0.6 0.5

bo

ca

dy

tra

ck

nn

ea

str l

ea

sw

ap

ch

mc ti lus ons ter

ole

oc sk

ea

y

ra nn

dio

cp

wa

sit y

e ter p -n sq ua

ft re

is

sp

av

er

ag

e

d

Benchmark Intl. Symp. on Microarchitecture - December 2011

156

TLS-2 TLS-4

TLS-2 TLS-4 TLS-2 TLS-4 TLS-2 TLS-4 TLS-2 TLS-4

ft is sp

TLS-2 TLS-4

TLS-2 TLS-4

TLS-2 TLS-4

TLS-2 TLS-4

ep

-ns ter

qu

a re

cp ity

n- n

ter d

Benchmark wa

s sky ios

ea

k lus

l n tio ole ra d

oc

ch

ap

mc

ea

c tra nn ea sw

TLS-2 TLS-4

TLS-2 TLS-4

ca str

bo

dy

TLS-2 TLS-4

Norm. Execution Time

Failed Speculation

100% Restart Busy

80%

60%

40%

20%

0%

Intl. Symp. on Microarchitecture - December 2011 157