Everything You Always Wanted to Know about Synchronization but Were Afraid to Ask Tudor David, Rachid Guerraoui, Vasileios Trigonakis
The multi-core revolution
• Big challenges in hardware & software • Software – scalability: ↑ performance by ↑ number of cores Synchronization is one of the biggest scalability bottlenecks 11/4/2013
Vasileios Trigonakis (EPFL)
2
Synchronization • Cannot always be avoided – not all applications embarrassingly parallel
• Synchronization is just an overhead – but guarantees correctness
• Scalability of synchronization – do not ↓ performance as the number of cores ↑
Scalability of synchronization is key to application scalability 11/4/2013
Vasileios Trigonakis (EPFL)
3
Synchronization is difficult • Tons of work – design of synchronization schemes [ISCA’89, TPDS’90, ASPLOS’91, TOCS’91, PPoPP’01, PODC’95, ICPP’06, SPAA’10, IPDPS’11, PPoPP’12, ATC’12, …]
– fix synchronization bottlenecks [SOSP’89, HPCA’07, OSDI’99, OSDI’08, SOSP’09, APLOS’09, OSR’09, OSDI’10, …]
• Scalability issues? – – – – –
hardware usage of specific atomic operations synchronization algorithm application context workload
Limited understanding of the behavior of synchronization 11/4/2013
Vasileios Trigonakis (EPFL)
4
Take a step back and perform a thorough analysis of synchronization on modern hardware
What is the main source of scalability problems in synchronization?
Answer Scalability of synchronization is mainly a property of the hardware
11/4/2013
Vasileios Trigonakis (EPFL)
5
Key observations 1. Crossing sockets is a killer c
c
c
c
c
c
c
c
c
c
c
c
socket socket 2. Sharing within a socket is necessary but not sufficient c
c
c
c
c
c
c
c
c
c
c
c
Directory
3. Intra-socket uniformity matters c
c
c
c
c
c
c
c
vs.
c c c c c c
c c c c c c
c c c c c c
c c c c c c
c c c c c c
c c c c c c
4. Loads & stores can be as expensive as atomic operations 5. Simple locks are powerful
… 11/4/2013
Vasileios Trigonakis (EPFL)
6
Disclaimer We do not claim
“Bad synchronization” in software will scale well due to hardware
We claim
“Good synchronization” in software might not scale as expected due to hardware
11/4/2013
7
Analysis method Hardware processors ccc ccc
ccc ccc
ccc ccc
ccc ccc ccc ccc
Multi-sockets
Synchronization layers ccc ccc
ccc ccc
ccc ccc
• AMD Opteron (4x 6172 - 48 cores) • Intel Xeon (8x E7-8867L - 80 cores)
Single-sockets
c c c c c c
c c c c c c
c c c c c c
c c c c c c
• Sun Niagara 2 (8 cores) • Tilera TILE-Gx36 (36 cores)
11/4/2013
c c c c c c
c c c c c c
systems / applications (e.g., hash table) software primitives (e.g., locks)
software
atomic operations (e.g., CAS) cache coherence (load and store)
Vasileios Trigonakis (EPFL)
hardware
8
Outline 1. Crossing sockets 2. Sharing within a socket 3. Intra-socket uniformity 4. Atomic operations 5. Simple locks applications primitives atomic ops cache coherence
11/4/2013
Vasileios Trigonakis (EPFL)
9
applications primitives atomic ops cache coherence
Distance on multi-sockets Opteron
Xeon dir
dir
c c c c c c
c c c c c c
dir
dir
c c c c c c
c c c c c c
c c c c c c c c c c
dir
dir
c c c c c c
c c c c c c
dir
dir
c c c c c c
c c c c c c
• Within socket: 40 ns • Per hop: +40 ns • Up to 3x more
c c c c c c c c c c
c c c c c c c c c c
c c c c c c c c c c c c c c c c c c c c
c c c c c c c c c c
c c c c c c c c c c
c c c c c c c c c c
• Within socket: 20 – 40 ns • Per hop: +50 ns • Up to 8x more
Crossing sockets is a killer: up to 8x more expensive 11/4/2013
Vasileios Trigonakis (EPFL)
10
applications primitives atomic ops cache coherence
Locks on multi-sockets ** Each point is the best result out of 9 lock algorithms
80 70 60 • Each thread repeatedly 50 1. Chooses one lock out40of N at random 2. Acquires the lock 30 3. Reads and writes the20 protected data 4. Releases the lock 10 0
• Initialize N locks & T threads
25 20
Opteron
Xeon
Vasileios Trigonakis (EPFL)
Opteron
36
18
socket
1
36
18
socket
1
36
18
socket
• Repeat with 9 different lock algorithms • spinlocks, queue-based, hierarchical, mutex Threads • ReportThreads the best total throughput 1
1
0
36
5
18
10
11/4/2013
?
?
15
socket
Throughput (Mops/s)
High contention (4 locks) Low contention (512 locks) 30 90 Locks microbenchmark
Xeon
11
applications primitives atomic ops cache coherence
Locks on multi-sockets ** Each point is the best result out of 9 lock algorithms
Low contention (512 locks) 90 80 70 60 50 40 30 20 10 0
Opteron
Threads
Xeon
Opteron
Threads
36
36
18
socket
1
36
18
socket
1
0
18
5
socket
10
1
?
15
?
36
20
socket
25
1
Throughput (Mops/s)
30
18
High contention (4 locks)
Xeon
Crossing sockets is a killer: big decrease in performance 11/4/2013
Vasileios Trigonakis (EPFL)
12
applications primitives atomic ops cache coherence
Hash table on multi-sockets ** Each point is the best result taken by any out of 9 lock algorithms
High contention (12 buckets)
120
?
20
100 80
15
?
60 10
Opteron
Xeon
Opteron
Threads
36
18
socket 10
1
36
18
socket 6
36
18
socket 10
1
0
36
0
18
20 socket 6
5
1
40
1
Throughput (Mops/s)
25
Low contention (512 buckets)
Xeon
Threads
Crossing sockets is a killer 11/4/2013
Vasileios Trigonakis (EPFL)
13
Outline 1. Crossing sockets 2. Sharing within a socket 3. Intra-socket uniformity 4. Atomic operations 5. Simple locks applications primitives atomic ops cache coherence
11/4/2013
Vasileios Trigonakis (EPFL)
14
Coherence on multi-sockets Opteron
Xeon dir
dir
c c c c c c
c c c c c c
dir
dir
c c c c c c
c c c c c c dir
dir
c c c c c c
dir
dir
c c c c c c
Incomplete directory
c c c c c c c c c c
c c c c c c c c c c
c c c c c c
c c c c c c
11/4/2013
applications primitives atomic ops cache coherence
c c c c c c c c c c
c c c c c c c c c c c c c c c c c c c c
c c c c c c c c c c
c c c c c c c c c c
c c c c c c c c c c
Broadcast requests
15
applications primitives atomic ops cache coherence
Locality on multi-sockets Opteron
Xeon dir
dir
c c c c c c
c c c c c c
dir
dir
c c c c c c
c c c c c c
c c c c c c c c c c
dir
dir
c c c c c c
c c c c c c
dir
dir
c c c c c c
c c c c c c
• Within socket: 40 ns • Data within a socket – served locally (40 ns) – broadcast (120 ns)
c c c c c c c c c c
c c c c c c c c c c
c c c c c c c c c c c c c c c c c c c c
c c c c c c c c c c
c c c c c c c c c c
c c c c c c c c c c
• Within socket: 20 – 40 ns • Data within a socket – served by the LLC (20 – 40 ns)
Sharing within a socket is necessary but not sufficient 11/4/2013
Vasileios Trigonakis (EPFL)
16
applications primitives atomic ops cache coherence
Locks on multi-sockets High contention (4 locks)
Low contention (512 locks)
Threads
Opteron
Xeon
Threads
Opteron
36
36
18
socket
1
36
18
socket
1
0
18
5
socket
10
1
15
36
20
18
25
socket
90 80 70 60 50 40 30 20 10 0 1
Throughput (Mops/s)
30
Xeon
Sharing within a socket is not sufficient 11/4/2013
Vasileios Trigonakis (EPFL)
17
applications primitives atomic ops cache coherence
Hash table on multi-sockets
Low contention (512 buckets)
25
120
20
100 80
15
60 10
Opteron
Xeon Threads
Opteron
36
18
socket 10
1
36
18
socket 6
36
18
socket 10
1
0
36
0
18
20 socket 6
5
1
40
1
Throughput (Mops/s)
High contention (12 buckets)
Xeon
Threads
Sharing within a socket is necessary but not sufficient 11/4/2013
Vasileios Trigonakis (EPFL)
18
Outline 1. Crossing sockets 2. Sharing within a socket 3. Intra-socket uniformity 4. Atomic operations 5. Simple locks applications primitives atomic ops cache coherence
11/4/2013
Vasileios Trigonakis (EPFL)
19
applications primitives atomic ops cache coherence
Distance on single-sockets
Directory
Niagara
Tilera c c
c c
c
c
c c
c
dir
c
dir
c
dir
c
dir
c
dir
c
c
dir
c
dir
c
dir
c
dir
c
dir
c
c
dir
c
dir
c
dir
c
dir
c
dir
c
c
dir
c
dir
c
dir
c
dir
c
dir
c
c
dir
c
dir
c
dir
c
dir
c
dir
c
dir
c
dir
c
dir
c
dir
c
dir
c
c
• Uniform: 23 ns
dir
dir
dir
dir
dir
dir
• 1 hop: 40 ns • Per hop: +2 ns • Up to 0.5x more
Uniformity is expected to scale better 11/4/2013
Vasileios Trigonakis (EPFL)
20
applications primitives atomic ops cache coherence
Locks on single-sockets 16 14 12 10 8 6 4 2 0
3.8x
Low contention (512 locks) 120
2.3x
25.0x
21.2x
100 80 60 40 20
Niagara
Tilera Threads
Niagara
36
18
8
1
36
18
8
1
36
18
8
1
36
18
8
0 1
Throughput (Mops/s)
High contention (4 locks)
Tilera
Threads
Uniformity leads to up to 70% higher scalability 11/4/2013
Vasileios Trigonakis (EPFL)
21
applications primitives atomic ops cache coherence
Hash table on single-sockets
Tilera
Threads
Niagara
36
18
8
1
20.6x
36
18
8
25.4x
1
36
8
1
Niagara
Low contention (512 buckets) 45 40 35 30 25 20 15 10 5 0
6.7x
36
18
8
10.1x
18
20 18 16 14 12 10 8 6 4 2 0 1
Throughput (Mops/s)
High contention (12 buckets)
Tilera
Threads
Uniformity leads to up to 50% higher scalability 11/4/2013
Vasileios Trigonakis (EPFL)
22
Outline 1. Crossing sockets 2. Sharing within a socket 3. Intra-socket uniformity 4. Atomic operations 5. Simple locks applications primitives atomic ops cache coherence
11/4/2013
Vasileios Trigonakis (EPFL)
23
Atomic ops on local data 12
Load
Store
applications primitives atomic ops cache coherence
CAS
Latency (ns)
10 8 6 4 2 0 Opteron
Xeon
CAS: an order of magnitude more expensive on local data 11/4/2013
Vasileios Trigonakis (EPFL)
24
Atomic ops on multi-sockets 250
Load
Store
CAS
Latency (ns)
200
100
8% 12%
17%
150
applications primitives atomic ops cache coherence
25%
10%
35%
50
Opteron
two hops
one hop
same socket
two hops
one hop
same socket
0
Xeon Distance
Loads and stores can be as expensive as atomic operations 11/4/2013
Vasileios Trigonakis (EPFL)
25
Outline 1. Crossing sockets 2. Sharing within a socket 3. Intra-socket uniformity 4. Atomic operations 5. Simple locks applications primitives atomic ops cache coherence
11/4/2013
Vasileios Trigonakis (EPFL)
26
applications primitives atomic ops cache coherence
Hash table – best locks
20
TICKET TICKET TAS CLH TICKET TAS TAS HTICKET TICKET TICKET TICKET TICKET TAS TAS TAS TAS
100 80
15
25 / 32 60
10
40
5
20
Opteron
Xeon Niagara Threads
18
1
18
1
18
1
18
1
18
1
18
1
18
1
Opteron Xeon Niagara Tilera Threads
18
0
0
1
25
Low contention (512 buckets)
TICKET TAS CLH CLH TAS CLH CLH CLH TICKET TICKET TICKET TICKET TAS TAS TICKET TICKET
Throughput (Mops/s)
High contention (12 buckets)
Tilera
Simple locks are powerful 11/4/2013
27
Lessons learned 1. Crossing sockets is a killer → up to 8x more expensive communication
2. Sharing within a socket is necessary but not sufficient → up to 3x more expensive communication
3. Intra-socket uniformity matters → up to 70% higher scalability
4. Loads & stores can be as expensive as atomic operations → 8 - 35% more expensive on non-locally cached data
5. Simple locks are powerful → better in 25 out of 32 data-points on a hash table
Scalability of synchronization is mainly a property of the hardware 11/4/2013
Vasileios Trigonakis (EPFL)
28
synchronization schemes
Analysis’ space & limitations locks message passing lock-free combiner approaches …
hardware platforms 11/4/2013
29
SSYNC synchronization suite ssht, TM2C, Memcached
systems / applications software primitives (e.g., locks)
libslock, libssmp
atomic operations (e.g., compare-and-swap)
ccbench cache coherence (load and store)
http://go.epfl.ch/ssync 11/4/2013
Vasileios Trigonakis (EPFL)
30
Thanks! 1. 2. 3. 4. 5.
applications primitives atomic ops cache coherence
Crossing sockets is a killer Sharing within a socket is necessary but not sufficient Intra-socket uniformity matters Loads & stores can be as expensive as atomic operations Simple locks are powerful
Scalability of synchronization is mainly a property of the hardware http://go.epfl.ch/ssync 11/4/2013
Vasileios Trigonakis (EPFL)
31