12th Workshop on Silicon Errors in Logic – System Effects (SELSE) Austin, TX, USA, March 29-30, 2016
Software-Defined Error-Correcting Codes Mark Gottscho Clayton Schoeny Lara Dolecek Puneet Gupta nanocad.ee.ucla.edu loris.ee.ucla.edu
Memory Errors are a Major Problem • System-level effects from embedded to HPC • System crashes • Silent data corruption
• DRAM reliability worsens with density
• Google: 70,000 FIT/Mb in commodity DRAM; 8% of modules affected per year; 4% of servers crash per year [Schroeder CACM’11] • Facebook: 2.5% of machines see DRAM errors per month [Meza DSN’15]
• SRAM stops working at low voltage
550 mV
• 6X fault rate measured from 600mV to 525mV [Gottscho TACO’15]
• Flash wears out with usage
• NASA’s Opportunity Mars rover had to reformat its flash in 2014
• STT-RAM is unpredictable
• Stochastic write & thermal instability [Zhao Microelec. Rel.’12]
525 mV
• Memory errors will continue to be a challenge! [Gottscho TACO’15]
Current Techniques: Costly & Data-Oblivious System-Level Fault Tolerance • Mirroring / Sparing ($$$) Checkpoint & Recovery • Checkpoint & Recovery ($$) Mirroring / Sparing • Resource Retirement ($) Resource Retirement
Separate
Abstractions
Error-Correcting Codes Error-Correcting Codes • •• ••
BCH Codes ($$$) SEC-DED Hamming Codes ChipKill ($$) BCH Codes SEC-DED Hamming Codes ($) IBM Chipkill
These techniques do not take advantage of available side information about data stored in memory.
Software context of data • Legality • Criticality • Statistics
Current Techniques: Costly & Data-Oblivious State-of-the-art techniques are costly.
These techniques do not take advantage of available side information about data • Mirroring / Sparing ($$$) Checkpoint & Recovery More importantly:stored in memory. they try to protect • Checkpoint & Recovery ($$) Mirroring / Sparing
System-Level Fault Tolerance
data in memory without knowing Software context of data Abstractions anything about it! • Legality
• Resource Retirement ($) Resource Retirement Separate
Error-Correcting Codes Error-Correcting Codes • •• ••
BCH Codes ($$$) SEC-DED Hamming Codes ChipKill ($$) BCH Codes SEC-DED Hamming Codes ($) IBM Chipkill
• Criticality • Statistics
Can we do better?
Our Solution: Software-Defined ECC (SWD-ECC) System-Level Fault Tolerance
Error-Correcting Codes
Software-Defined ECC
Side-information about data in memory
Our Solution: Software-Defined ECC (SWD-ECC)
Since Hamming’s seminal work 68 years ago, System-Level Error-Correcting coding theory has generally assumed that Fault Tolerance Codes all bits are created equal. This is not the case! Software-Defined ECC
Side-information Software-Defined Error-Correcting Codes about data in memory represent a new paradigm in memory resiliency.
Essence of Software-Defined ECC Codeword Hamming sphere
Each dotted edge is a single-bit flip between two n-bit strings
2-bit DUE with 4 equidistant candidate codewords
1-bit CE
2-bit DUE with 3 equidistant candidate codewords
Essence of Software-Defined ECC Conceptual example using SEC-DED • Heuristic Recovery • Determine candidate codewords • Filter out illegal codewords • Rank remaining codewords using all available side-information
• ECC Code Design • Minimize average number of neighboring spheres • Geometrically separate critical codewords
Codeword Hamming sphere
Each dotted edge is a single-bit flip between two n-bit strings
2-bit DUE with 4 equidistant candidate codewords
1-bit CE
2-bit DUE with 3 equidistant candidate codewords
Essence of Software-Defined ECC Conceptual example using SEC-DED • Heuristic Recovery • Determine candidate codewords • Filter out illegal codewords • Rank remaining codewords using all available side-information
Concept is not restricted to SEC-DED codes!
• ECC Code Design • Minimize average number of neighboring spheres • Geometrically separate critical codewords
Codeword Hamming sphere
Each dotted edge is a single-bit flip between two n-bit strings
2-bit DUE with 4 equidistant candidate codewords
1-bit CE
2-bit DUE with 3 equidistant candidate codewords
Heuristic Recovery for Data Memory
Main Memory
Word 7: 0x0...00000004
Data types
uint32_t, double,
•
0x0...0 x0...0 Word 5: 0x0...00000000
pointers, packed arrays, classes
Object states •
•
Assertions, invalid pointers
Data correlation
[Yang MICRO’00, Alameldeen ‘04, Pekhimenko PACT’12]
•
Previously used for compression
Time
•
DUE: candidate codeword changes 0x00 to 0x35
..00 Word 4: 0x0...00000004 Word 3: 0x0...00350001 0 0 Word 2: 0x0...00000003
Burst of 64-bit words over 8 clock cycles
Word 6: 0x0...00000003
64B Cache Line
•
Word 1: 0x0...0000000B Word 0: 0x0...00000000 64-bit data + 8-bit parity (not shown) Memory Controller with (72,64) SECDED ECC
Heuristic Recovery for Instruction Memory Example: MIPS Format
• Known instruction set
• MIPS formats: R-type, I-type, J-type
• Illegal instructions
• Reserved values for opcode, fmt, funct
• Instruction frequency
• lw, sw much more common than sqrt.s
• Anomaly detection
• Control flow checks
Overall Flow for Heuristic Recovery from DUEs ECC Hardware Decode Conventional
Crash
No
No Errors or Controllable Errors?
Yes
Success
No Error Detected: Attempt Recovery High-End Mainframes
Poison Data
Tagging
Clean Page
Page Fault
Costly Fault Tolerance Mechanisms? Rollback Checkpointed
Overall Flow for Heuristic Recovery from DUEs ECC Hardware Decode
Compute Candidate ECC Codewords
Conventional
No
Crash
Yes
No Errors or Controllable Errors?
Success
No Error Detected: Attempt Recovery High-End Mainframes
Instruction Memory?
Yes
Only 1 Legal Message?
No Compute similarity to nearby good messages
Yes
Success
Poison Data
Tagging
Page Fault Rollback
Checkpointed
No Decode to most likely based on program statistics (Optional)
Decode to closest fit candidate codeword
Clean Page Costly Fault Tolerance Mechanisms?
Fork execution, poison data, wait for crash/corruption
Probabilistic Success
Experimental Setup Analytically studied all possible 2-bit DUEs that could affect MIPS instruction memory • Used common Hsiao (39,32) SEC-DED ECC Code • Multiple benchmarks from the SPEC CPU2006 suite • Compiled for 32-bit MIPS
These DUEs would normally result in a system crash or silent data corruption.
Results: ECC & Software Analysis 5
10
Bit position of second error
30
35
35
30
Bit position of first error
10
5
1
1
39
ECC Analysis • Number of candidate codewords depends on error locations
Dark red: 16 candidate codewords (worst case) Light green: 8 candidate codewords (best case)
39
Results: ECC & Software Analysis bzip2
h264ref
mcf
perlbench
1E+0
Relative Frequency of Instruction in Program Binary (Moving Avg.)
ECC Analysis • Number of candidate codewords depends on error locations Program Analysis • The static frequency of instructions in programs follow a power law distribution
1E-1 1E-2 1E-3 1E-4 1E-5 1E-6 1E-7
Instruction Mneumonic
povray
Results: ECC & Software Analysis bzip2
Relative Frequency of Instruction in Program Binary
ECC Analysis • Number of candidate codewords depends on error locations Program Analysis • The static frequency of instructions in programs follow a power law distribution
h264ref
mcf
perlbench
0.25 0.2 0.15 0.1 0.05 0
Instruction Mneumonic
povray
Results: Rate of Successful Recovery Using our filter-and-rank approach, we can already recover from 33% of all possible DUEs! Rate of Successful Recovery
0.9 0.8
bzip2
h264ref
100
200
mcf
perlbench
povray
overall mean
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
300
400
500
600
700
Index of 2-bit Error Pattern
• Starting point: much room for improvement! • More sophisticated heuristic recovery using additional side information • Customized ECC code suited to protecting instructions
Ongoing Work: Constructing Better Codes • Codes traditionally considered equivalent can have very different error correcting capabilities with side information • We can control the geometry of codewords in 𝑛-dimensional space while keeping the following properties constant: • • • • • •
Linearity Decoding Complexity 𝑘: Message Size 𝑛: Codeword Size 𝑅: Rate of Code Minimum Hamming Distance
• We can geometrically separate “important” codewords, reducing chance of mis-corrections
Conclusion • SWD-ECC: new paradigm for memory resiliency • Applications to several domains of computing • Mobile devices: powerful error correction is too costly • Supercomputing: checkpoint rollbacks steal performance • Real-time embedded systems: missed deadlines worse than data corruption
• We hope to inspire a new thread of research in the community • Sophisticated recovery schemes and novel codes • Other systems beyond memory: channel coding for networks or storage
Thank you!