Hardware Transactional Memory on Beehive
Andrew Birrell, Tom Rodeheffer, Chuck Thacker Microsoft Research, Silicon Valley
Basics: Concurrent Execution Debit-credit with …a…race … Possible Intended behavior behavior Core #2: var r1 = Read(x); var r2 = r1 + 7; Write(x, r2); r2: 17
Core #3: var r3 = Read(x); var r4 = r3 – 5; Write(x, r4); r4: 5 Memory 10 x: 5 17 12
September 2010
Hardware Transactional Memory on Beehive
2
Classic Solution: Mutual Exclusion Core #2: lock(m) { var r1 = Read(x); var r2 = r1 + 7; Write(x, r2); }
Core #3: lock(m) { var r3 = Read(x); var r4 = r3 – 5; Write(x, r4); } Memory 10 x: 17 12 m: held free
September 2010
Hardware Transactional Memory on Beehive
3
Lock-based Mutual Exclusion is Hard • Locking levels • Composition • Locking granularity
September 2010
Hardware Transactional Memory on Beehive
4
Locking Levels • Deadlock if locking order is inconsistent: Core #2: lock(p) { lock(q) { … } }
Core #3: lock(q) { lock(p) { … } }
• Requires partial order on locks • Doesn’t scale September 2010
Hardware Transactional Memory on Beehive
5
Composition • Hard to extend existing libraries • E.g. hash table:
– add, lookup, remove
• How to enhance with atomic “rename”?
– rename = { add(); remove() } …leaves temp duplicate – rename = { remove(); add() } … leaves temp gap
• Requires access to internal locking mechanism September 2010
Hardware Transactional Memory on Beehive
6
Locking Granularity • Simple locking can inhibit concurrency: Core #2: lock(m) { var r1 = Read(x[i]); var r2 = r1 + 7; Write(x[i], r2); }
Core #3: lock(m) { var r3 = Read(x[j]); var r4 = r3 – 5; Write(x[j], r4); }
• Tricky trade-off of complexity/performance September 2010
Hardware Transactional Memory on Beehive
7
Alternatives • Rely on experts? – – – –
Parallel processing libraries (LINQ) Map-Reduce, Hadoop, Dryad, Parallelizing compilers, GPU graphics
• Use a better abstraction?
– Atomic Transactions: • semantics as if sequential • actually, concurrent
September 2010
Hardware Transactional Memory on Beehive
8
Debit-credit with Atomic Transactions Core #2: atomic { var r1 = Read(x); var r2 = r1 + 7; Write(x, r2); }
Core #3: atomic { var r3 = Read(x); var r4 = r3 – 5; Write(x, r4); } Memory x: 17 10 12
September 2010
Hardware Transactional Memory on Beehive
9
Atomic Transaction Semantics • Database transactions: – – – –
Atomicity: all of the transaction happens, or nothing Consistency: failed transactions have no effect Isolation: internal state invisible to others Durability: after commit, effects are permanent
• Transactional memory:
– Serialization: execution is indistinguishable from some serial execution of the transactions – Reality: non-transactional code can see non-atomic effects of transactions
September 2010
Hardware Transactional Memory on Beehive
10
Transactional Memory Research • ~500 papers in last 15 years • All about software TM, or about simulations • Software TM is extremely inefficient • Hardware TM needs hardware • So, Beehive … September 2010
Hardware Transactional Memory on Beehive
11
Implementing TM Transactions • Execute in parallel, hoping to succeed • Detect conflicts that prevent serialization • Rollback all but one and retry them
September 2010
Hardware Transactional Memory on Beehive
12
Implementing TM Debit-Credit Core #2: tm_startTX(); { var r1 = x; var r2 = r1 + 7; x := r2; } tm_endTX(); R{ x } W{ x: 17} September 2010
Memory x: 17 10 12
Core #3: tm_startTX(); { var r3 = x; var r4 = r3 – 5; x := r4; } tm_endTX(); R{ x } 5 }} W{ x: 12
Hardware Transactional Memory on Beehive
13
Conflict Detection Abstractly • For each uncommitted transaction “X” maintain: – Read set R(X), all locations read so far – Write set W(X), all locations written so far – Ability to undo writes for rollback (or do them on commit)
• If W(X) intersects with R(Y) or W(Y), and X commits before Y: – rollback Y – (or could rollback X) – (or could delay committing X)
September 2010
Hardware Transactional Memory on Beehive
14
Conflict Detection in Beehive Hardware • During transaction “X”: – Maintain R(X) and W(X) – Defer writes to DRAM
• To commit “X”:
– Send writes to DRAM – Send W(X) around the ring
• During transaction “Y”:
– Snoop on W(X), compare with R(Y) – Rollback and retry on conflict
September 2010
Hardware Transactional Memory on Beehive
15
Finally, Some Hardware • R(X) is recorded in a “Bloom filter” • D-cache evicts go to victim cache, not DRAM • Filter snoops on Tx writes; a conflict triggers abort • “Commit” sends writes to ring
W(X): Victim Cache
Core N
D cache
R(X): Bloom Filter
The Ring September 2010
Hardware Transactional Memory on Beehive
16
Sidebar: “What’s a Bloom filter?” • Probabilistic storage for set membership • Storage can be less than set size • Operations:
– Add x to the set – Is x in the set? • If x is in the set, answer “yes” • If x isn’t in the set, answer either “yes” or “no”
• Skill is controlling probability of false positives September 2010
Hardware Transactional Memory on Beehive
17
Details (1): Serializing Commits • tm_endTX() {
P(commitMutex); Flush victim cache and D-cache; Clear “inTx” state; V(commitMutex);
September 2010
Hardware Transactional Memory on Beehive
18
Details(2): Conflicts and Rollback • tm_startTX() {
}
setjmp(…); // inline at caller Flush D-cache; Set “inTx”;
• On conflict: clear “inTx”, trap to location 2 • abortHandler() {
Invalidate D-cache; longjmp(…)
September 2010
Hardware Transactional Memory on Beehive
19
Details (3): Mice and Elephants • Victim cache can overflow • Clear “inTX”, trap to location 3 • Elephant:
– tm_startTX(): { P(commitMutex); flush; set “inTX”; } – execute the transaction, completely; – tm_endTX(): { clear “inTX”; V(commitMutex); }
September 2010
Hardware Transactional Memory on Beehive
20
Measurements • With Sungpack Hong • Eigenbench and STAMP benchmarks (Stanford) • Comparing with “SwissTM” (EPFL) and “TL2” (Sun) software TM • Work-in-progress
September 2010
Hardware Transactional Memory on Beehive
21
Eigenbench (1): no conflicts
Short Transactions (9RD, 1WR)
Large Transactions (270RD, 30WR) 10.00%
12
9.00%
11.71 80.00% 70.00%
10.69
8.00%
10
9.23
7.00%
p 8 u d e e 6 p S 4
p 8 u d e e 6 p S
Speedup
6.00% 5.00% 4.00%
4
3.00% 2.00%
2
1.00%
8.92
50.00% 40.00% 30.00% 20.00%
2 3.11%
0 0
2
4
0
5
10
15
6
8
10
12
10.00% 0.00%
14
Number of Cores
0.00%
0
60.00%
Speedup
10
90.00%
12
Unprotected
TM
%Overflow
%Abort
Number of Cores Unprotected
September 2010
TM
%Overflow
%Abort
Hardware Transactional Memory on Beehive
22
Eigenbench (2): with Conflicts
Speedup (8 cores, 90R, 10W) 6
Speedup
5 4 p u d e3 e p S 2 1 0 0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
Esimated Degree of Conflicts (From Analytic Model) Beehive Speedup
September 2010
SwissTM Speedup
TL2 Speedup
Hardware Transactional Memory on Beehive
23
Red-Black Trees Execution time in committed transactions 10000 9000
?
1012
8000 #
876
7000
6000 c y 5000 c l 4000 e s 3000
733
717
767
6147 4679
5205
4354
4444
1492
1556
1651
1836
2099
1
2
4
8
12
2000 1000 0 Number of Cores tm_startTX
September 2010
Hardware Transactional Memory on Beehive
Body
tm_endTX
24
10000
Breakdown of core, 10 RB Tree (12
STAMP Benchmarks Overflow / False Positives
Nondeterministic Execution
September 2010
Overflow
Small TXs Floating Point Operations
Hardware Transactional Memory on Beehive
25
Atomicity: a Cautionary Tale Core #2: tm_startTX(); { if (ok) { x->f(); } } tm_endTX();
Core #3: tm_startTX(); { x = NULL; ok = false; } tm_endTX();
Memory R{ ok 0 }ok: 0 } ok,}x } W{ x: 0, • Make TX writes on the ring &y x:atomic 0 • Inhibit mice while ok: 1running an elephant September 2010
Hardware Transactional Memory on Beehive
26
Status • It works • It needs polishing • The fundamental question remains:
– “is TM significantly easier than locks/monitors?”
• Beehive TM will help us answer this
September 2010
Hardware Transactional Memory on Beehive
27