Everything You Always Wanted to Know about Synchronization but Were Afraid to Ask. Tudor David, Rachid Guerraoui, Vasileios Trigonakis

Everything You Always Wanted to Know about Synchronization but Were Afraid to Ask Tudor David, Rachid Guerraoui, Vasileios Trigonakis The multi-core...
Author: Moris McBride
2 downloads 2 Views 3MB Size
Everything You Always Wanted to Know about Synchronization but Were Afraid to Ask Tudor David, Rachid Guerraoui, Vasileios Trigonakis

The multi-core revolution

• Big challenges in hardware & software • Software – scalability: ↑ performance by ↑ number of cores Synchronization is one of the biggest scalability bottlenecks 11/4/2013

Vasileios Trigonakis (EPFL)

2

Synchronization • Cannot always be avoided – not all applications embarrassingly parallel

• Synchronization is just an overhead – but guarantees correctness

• Scalability of synchronization – do not ↓ performance as the number of cores ↑

Scalability of synchronization is key to application scalability 11/4/2013

Vasileios Trigonakis (EPFL)

3

Synchronization is difficult • Tons of work – design of synchronization schemes [ISCA’89, TPDS’90, ASPLOS’91, TOCS’91, PPoPP’01, PODC’95, ICPP’06, SPAA’10, IPDPS’11, PPoPP’12, ATC’12, …]

– fix synchronization bottlenecks [SOSP’89, HPCA’07, OSDI’99, OSDI’08, SOSP’09, APLOS’09, OSR’09, OSDI’10, …]

• Scalability issues? – – – – –

hardware usage of specific atomic operations synchronization algorithm application context workload

Limited understanding of the behavior of synchronization 11/4/2013

Vasileios Trigonakis (EPFL)

4

Take a step back and perform a thorough analysis of synchronization on modern hardware

What is the main source of scalability problems in synchronization?

Answer Scalability of synchronization is mainly a property of the hardware

11/4/2013

Vasileios Trigonakis (EPFL)

5

Key observations 1. Crossing sockets is a killer c

c

c

c

c

c

c

c

c

c

c

c

socket socket 2. Sharing within a socket is necessary but not sufficient c

c

c

c

c

c

c

c

c

c

c

c

Directory

3. Intra-socket uniformity matters c

c

c

c

c

c

c

c

vs.

c c c c c c

c c c c c c

c c c c c c

c c c c c c

c c c c c c

c c c c c c

4. Loads & stores can be as expensive as atomic operations 5. Simple locks are powerful

… 11/4/2013

Vasileios Trigonakis (EPFL)

6

Disclaimer  We do not claim

“Bad synchronization” in software will scale well due to hardware

We claim

“Good synchronization” in software might not scale as expected due to hardware

11/4/2013

7

Analysis method Hardware processors ccc ccc

ccc ccc

ccc ccc

ccc ccc ccc ccc

Multi-sockets

Synchronization layers ccc ccc

ccc ccc

ccc ccc

• AMD Opteron (4x 6172 - 48 cores) • Intel Xeon (8x E7-8867L - 80 cores)

Single-sockets

c c c c c c

c c c c c c

c c c c c c

c c c c c c

• Sun Niagara 2 (8 cores) • Tilera TILE-Gx36 (36 cores)

11/4/2013

c c c c c c

c c c c c c

systems / applications (e.g., hash table) software primitives (e.g., locks)

software

atomic operations (e.g., CAS) cache coherence (load and store)

Vasileios Trigonakis (EPFL)

hardware

8

Outline 1. Crossing sockets 2. Sharing within a socket 3. Intra-socket uniformity 4. Atomic operations 5. Simple locks applications primitives atomic ops cache coherence

11/4/2013

Vasileios Trigonakis (EPFL)

9

applications primitives atomic ops cache coherence

Distance on multi-sockets Opteron

Xeon dir

dir

c c c c c c

c c c c c c

dir

dir

c c c c c c

c c c c c c

c c c c c c c c c c

dir

dir

c c c c c c

c c c c c c

dir

dir

c c c c c c

c c c c c c

• Within socket: 40 ns • Per hop: +40 ns • Up to 3x more

c c c c c c c c c c

c c c c c c c c c c

c c c c c c c c c c c c c c c c c c c c

c c c c c c c c c c

c c c c c c c c c c

c c c c c c c c c c

• Within socket: 20 – 40 ns • Per hop: +50 ns • Up to 8x more

Crossing sockets is a killer: up to 8x more expensive 11/4/2013

Vasileios Trigonakis (EPFL)

10

applications primitives atomic ops cache coherence

Locks on multi-sockets ** Each point is the best result out of 9 lock algorithms

80 70 60 • Each thread repeatedly 50 1. Chooses one lock out40of N at random 2. Acquires the lock 30 3. Reads and writes the20 protected data 4. Releases the lock 10 0

• Initialize N locks & T threads

25 20

Opteron

Xeon

Vasileios Trigonakis (EPFL)

Opteron

36

18

socket

1

36

18

socket

1

36

18

socket

• Repeat with 9 different lock algorithms • spinlocks, queue-based, hierarchical, mutex Threads • ReportThreads the best total throughput 1

1

0

36

5

18

10

11/4/2013

?

?

15

socket

Throughput (Mops/s)

High contention (4 locks) Low contention (512 locks) 30 90 Locks microbenchmark

Xeon

11

applications primitives atomic ops cache coherence

Locks on multi-sockets ** Each point is the best result out of 9 lock algorithms

Low contention (512 locks) 90 80 70 60 50 40 30 20 10 0

Opteron

Threads

Xeon

Opteron

Threads

36

36

18

socket

1

36

18

socket

1

0

18

5

socket

10

1

?

15

?

36

20

socket

25

1

Throughput (Mops/s)

30

18

High contention (4 locks)

Xeon

Crossing sockets is a killer: big decrease in performance 11/4/2013

Vasileios Trigonakis (EPFL)

12

applications primitives atomic ops cache coherence

Hash table on multi-sockets ** Each point is the best result taken by any out of 9 lock algorithms

High contention (12 buckets)

120

?

20

100 80

15

?

60 10

Opteron

Xeon

Opteron

Threads

36

18

socket 10

1

36

18

socket 6

36

18

socket 10

1

0

36

0

18

20 socket 6

5

1

40

1

Throughput (Mops/s)

25

Low contention (512 buckets)

Xeon

Threads

Crossing sockets is a killer 11/4/2013

Vasileios Trigonakis (EPFL)

13

Outline 1. Crossing sockets 2. Sharing within a socket 3. Intra-socket uniformity 4. Atomic operations 5. Simple locks applications primitives atomic ops cache coherence

11/4/2013

Vasileios Trigonakis (EPFL)

14

Coherence on multi-sockets Opteron

Xeon dir

dir

c c c c c c

c c c c c c

dir

dir

c c c c c c

c c c c c c dir

dir

c c c c c c

dir

dir

c c c c c c

Incomplete directory

c c c c c c c c c c

c c c c c c c c c c

c c c c c c

c c c c c c

11/4/2013

applications primitives atomic ops cache coherence

c c c c c c c c c c

c c c c c c c c c c c c c c c c c c c c

c c c c c c c c c c

c c c c c c c c c c

c c c c c c c c c c

Broadcast requests

15

applications primitives atomic ops cache coherence

Locality on multi-sockets Opteron

Xeon dir

dir

c c c c c c

c c c c c c

dir

dir

c c c c c c

c c c c c c

c c c c c c c c c c

dir

dir

c c c c c c

c c c c c c

dir

dir

c c c c c c

c c c c c c

• Within socket: 40 ns • Data within a socket – served locally (40 ns) – broadcast (120 ns)

c c c c c c c c c c

c c c c c c c c c c

c c c c c c c c c c c c c c c c c c c c

c c c c c c c c c c

c c c c c c c c c c

c c c c c c c c c c

• Within socket: 20 – 40 ns • Data within a socket – served by the LLC (20 – 40 ns)

Sharing within a socket is necessary but not sufficient 11/4/2013

Vasileios Trigonakis (EPFL)

16

applications primitives atomic ops cache coherence

Locks on multi-sockets High contention (4 locks)

Low contention (512 locks)

Threads

Opteron

Xeon

Threads

Opteron

36

36

18

socket

1

36

18

socket

1

0

18

5

socket

10

1

15

36

20

18

25

socket

90 80 70 60 50 40 30 20 10 0 1

Throughput (Mops/s)

30

Xeon

Sharing within a socket is not sufficient 11/4/2013

Vasileios Trigonakis (EPFL)

17

applications primitives atomic ops cache coherence

Hash table on multi-sockets

Low contention (512 buckets)

25

120

20

100 80

15

60 10

Opteron

Xeon Threads

Opteron

36

18

socket 10

1

36

18

socket 6

36

18

socket 10

1

0

36

0

18

20 socket 6

5

1

40

1

Throughput (Mops/s)

High contention (12 buckets)

Xeon

Threads

Sharing within a socket is necessary but not sufficient 11/4/2013

Vasileios Trigonakis (EPFL)

18

Outline 1. Crossing sockets 2. Sharing within a socket 3. Intra-socket uniformity 4. Atomic operations 5. Simple locks applications primitives atomic ops cache coherence

11/4/2013

Vasileios Trigonakis (EPFL)

19

applications primitives atomic ops cache coherence

Distance on single-sockets

Directory

Niagara

Tilera c c

c c

c

c

c c

c

dir

c

dir

c

dir

c

dir

c

dir

c

c

dir

c

dir

c

dir

c

dir

c

dir

c

c

dir

c

dir

c

dir

c

dir

c

dir

c

c

dir

c

dir

c

dir

c

dir

c

dir

c

c

dir

c

dir

c

dir

c

dir

c

dir

c

dir

c

dir

c

dir

c

dir

c

dir

c

c

• Uniform: 23 ns

dir

dir

dir

dir

dir

dir

• 1 hop: 40 ns • Per hop: +2 ns • Up to 0.5x more

Uniformity is expected to scale better 11/4/2013

Vasileios Trigonakis (EPFL)

20

applications primitives atomic ops cache coherence

Locks on single-sockets 16 14 12 10 8 6 4 2 0

3.8x

Low contention (512 locks) 120

2.3x

25.0x

21.2x

100 80 60 40 20

Niagara

Tilera Threads

Niagara

36

18

8

1

36

18

8

1

36

18

8

1

36

18

8

0 1

Throughput (Mops/s)

High contention (4 locks)

Tilera

Threads

Uniformity leads to up to 70% higher scalability 11/4/2013

Vasileios Trigonakis (EPFL)

21

applications primitives atomic ops cache coherence

Hash table on single-sockets

Tilera

Threads

Niagara

36

18

8

1

20.6x

36

18

8

25.4x

1

36

8

1

Niagara

Low contention (512 buckets) 45 40 35 30 25 20 15 10 5 0

6.7x

36

18

8

10.1x

18

20 18 16 14 12 10 8 6 4 2 0 1

Throughput (Mops/s)

High contention (12 buckets)

Tilera

Threads

Uniformity leads to up to 50% higher scalability 11/4/2013

Vasileios Trigonakis (EPFL)

22

Outline 1. Crossing sockets 2. Sharing within a socket 3. Intra-socket uniformity 4. Atomic operations 5. Simple locks applications primitives atomic ops cache coherence

11/4/2013

Vasileios Trigonakis (EPFL)

23

Atomic ops on local data 12

Load

Store

applications primitives atomic ops cache coherence

CAS

Latency (ns)

10 8 6 4 2 0 Opteron

Xeon

CAS: an order of magnitude more expensive on local data 11/4/2013

Vasileios Trigonakis (EPFL)

24

Atomic ops on multi-sockets 250

Load

Store

CAS

Latency (ns)

200

100

8% 12%

17%

150

applications primitives atomic ops cache coherence

25%

10%

35%

50

Opteron

two hops

one hop

same socket

two hops

one hop

same socket

0

Xeon Distance

Loads and stores can be as expensive as atomic operations 11/4/2013

Vasileios Trigonakis (EPFL)

25

Outline 1. Crossing sockets 2. Sharing within a socket 3. Intra-socket uniformity 4. Atomic operations 5. Simple locks applications primitives atomic ops cache coherence

11/4/2013

Vasileios Trigonakis (EPFL)

26

applications primitives atomic ops cache coherence

Hash table – best locks

20

TICKET TICKET TAS CLH TICKET TAS TAS HTICKET TICKET TICKET TICKET TICKET TAS TAS TAS TAS

100 80

15

25 / 32 60

10

40

5

20

Opteron

Xeon Niagara Threads

18

1

18

1

18

1

18

1

18

1

18

1

18

1

Opteron Xeon Niagara Tilera Threads

18

0

0

1

25

Low contention (512 buckets)

TICKET TAS CLH CLH TAS CLH CLH CLH TICKET TICKET TICKET TICKET TAS TAS TICKET TICKET

Throughput (Mops/s)

High contention (12 buckets)

Tilera

Simple locks are powerful 11/4/2013

27

Lessons learned 1. Crossing sockets is a killer → up to 8x more expensive communication

2. Sharing within a socket is necessary but not sufficient → up to 3x more expensive communication

3. Intra-socket uniformity matters → up to 70% higher scalability

4. Loads & stores can be as expensive as atomic operations → 8 - 35% more expensive on non-locally cached data

5. Simple locks are powerful → better in 25 out of 32 data-points on a hash table

Scalability of synchronization is mainly a property of the hardware 11/4/2013

Vasileios Trigonakis (EPFL)

28

synchronization schemes

Analysis’ space & limitations locks message passing lock-free combiner approaches …

hardware platforms 11/4/2013

29

SSYNC synchronization suite ssht, TM2C, Memcached

systems / applications software primitives (e.g., locks)

libslock, libssmp

atomic operations (e.g., compare-and-swap)

ccbench cache coherence (load and store)

http://go.epfl.ch/ssync 11/4/2013

Vasileios Trigonakis (EPFL)

30

Thanks! 1. 2. 3. 4. 5.

applications primitives atomic ops cache coherence

Crossing sockets is a killer Sharing within a socket is necessary but not sufficient Intra-socket uniformity matters Loads & stores can be as expensive as atomic operations Simple locks are powerful

Scalability of synchronization is mainly a property of the hardware http://go.epfl.ch/ssync 11/4/2013

Vasileios Trigonakis (EPFL)

31