Software-Defined Error-Correcting Codes

12th Workshop on Silicon Errors in Logic – System Effects (SELSE) Austin, TX, USA, March 29-30, 2016 Software-Defined Error-Correcting Codes Mark Got...
Author: Nancy Hopkins
6 downloads 2 Views 7MB Size
12th Workshop on Silicon Errors in Logic – System Effects (SELSE) Austin, TX, USA, March 29-30, 2016

Software-Defined Error-Correcting Codes Mark Gottscho Clayton Schoeny Lara Dolecek Puneet Gupta nanocad.ee.ucla.edu loris.ee.ucla.edu

Memory Errors are a Major Problem • System-level effects from embedded to HPC • System crashes • Silent data corruption

• DRAM reliability worsens with density

• Google: 70,000 FIT/Mb in commodity DRAM; 8% of modules affected per year; 4% of servers crash per year [Schroeder CACM’11] • Facebook: 2.5% of machines see DRAM errors per month [Meza DSN’15]

• SRAM stops working at low voltage

550 mV

• 6X fault rate measured from 600mV to 525mV [Gottscho TACO’15]

• Flash wears out with usage

• NASA’s Opportunity Mars rover had to reformat its flash in 2014

• STT-RAM is unpredictable

• Stochastic write & thermal instability [Zhao Microelec. Rel.’12]

525 mV

• Memory errors will continue to be a challenge! [Gottscho TACO’15]

Current Techniques: Costly & Data-Oblivious System-Level Fault Tolerance • Mirroring / Sparing ($$$) Checkpoint & Recovery • Checkpoint & Recovery ($$) Mirroring / Sparing • Resource Retirement ($) Resource Retirement

Separate

Abstractions

Error-Correcting Codes Error-Correcting Codes • •• ••

BCH Codes ($$$) SEC-DED Hamming Codes ChipKill ($$) BCH Codes SEC-DED Hamming Codes ($) IBM Chipkill

These techniques do not take advantage of available side information about data stored in memory.

Software context of data • Legality • Criticality • Statistics

Current Techniques: Costly & Data-Oblivious State-of-the-art techniques are costly.

These techniques do not take advantage of available side information about data • Mirroring / Sparing ($$$) Checkpoint & Recovery More importantly:stored in memory. they try to protect • Checkpoint & Recovery ($$) Mirroring / Sparing

System-Level Fault Tolerance

data in memory without knowing Software context of data Abstractions anything about it! • Legality

• Resource Retirement ($) Resource Retirement Separate

Error-Correcting Codes Error-Correcting Codes • •• ••

BCH Codes ($$$) SEC-DED Hamming Codes ChipKill ($$) BCH Codes SEC-DED Hamming Codes ($) IBM Chipkill

• Criticality • Statistics

Can we do better?

Our Solution: Software-Defined ECC (SWD-ECC) System-Level Fault Tolerance

Error-Correcting Codes

Software-Defined ECC

Side-information about data in memory

Our Solution: Software-Defined ECC (SWD-ECC)

Since Hamming’s seminal work 68 years ago, System-Level Error-Correcting coding theory has generally assumed that Fault Tolerance Codes all bits are created equal. This is not the case! Software-Defined ECC

Side-information Software-Defined Error-Correcting Codes about data in memory represent a new paradigm in memory resiliency.

Essence of Software-Defined ECC Codeword Hamming sphere

Each dotted edge is a single-bit flip between two n-bit strings

2-bit DUE with 4 equidistant candidate codewords

1-bit CE

2-bit DUE with 3 equidistant candidate codewords

Essence of Software-Defined ECC Conceptual example using SEC-DED • Heuristic Recovery • Determine candidate codewords • Filter out illegal codewords • Rank remaining codewords using all available side-information

• ECC Code Design • Minimize average number of neighboring spheres • Geometrically separate critical codewords

Codeword Hamming sphere

Each dotted edge is a single-bit flip between two n-bit strings

2-bit DUE with 4 equidistant candidate codewords

1-bit CE

2-bit DUE with 3 equidistant candidate codewords

Essence of Software-Defined ECC Conceptual example using SEC-DED • Heuristic Recovery • Determine candidate codewords • Filter out illegal codewords • Rank remaining codewords using all available side-information

Concept is not restricted to SEC-DED codes!

• ECC Code Design • Minimize average number of neighboring spheres • Geometrically separate critical codewords

Codeword Hamming sphere

Each dotted edge is a single-bit flip between two n-bit strings

2-bit DUE with 4 equidistant candidate codewords

1-bit CE

2-bit DUE with 3 equidistant candidate codewords

Heuristic Recovery for Data Memory

Main Memory

Word 7: 0x0...00000004

Data types

uint32_t, double,



0x0...0 x0...0 Word 5: 0x0...00000000

pointers, packed arrays, classes

Object states •



Assertions, invalid pointers

Data correlation

[Yang MICRO’00, Alameldeen ‘04, Pekhimenko PACT’12]



Previously used for compression

Time



DUE: candidate codeword changes 0x00 to 0x35

..00 Word 4: 0x0...00000004 Word 3: 0x0...00350001 0 0 Word 2: 0x0...00000003

Burst of 64-bit words over 8 clock cycles

Word 6: 0x0...00000003

64B Cache Line



Word 1: 0x0...0000000B Word 0: 0x0...00000000 64-bit data + 8-bit parity (not shown) Memory Controller with (72,64) SECDED ECC

Heuristic Recovery for Instruction Memory Example: MIPS Format

• Known instruction set

• MIPS formats: R-type, I-type, J-type

• Illegal instructions

• Reserved values for opcode, fmt, funct

• Instruction frequency

• lw, sw much more common than sqrt.s

• Anomaly detection

• Control flow checks

Overall Flow for Heuristic Recovery from DUEs ECC Hardware Decode Conventional

Crash

No

No Errors or Controllable Errors?

Yes

Success

No Error Detected: Attempt Recovery High-End Mainframes

Poison Data

Tagging

Clean Page

Page Fault

Costly Fault Tolerance Mechanisms? Rollback Checkpointed

Overall Flow for Heuristic Recovery from DUEs ECC Hardware Decode

Compute Candidate ECC Codewords

Conventional

No

Crash

Yes

No Errors or Controllable Errors?

Success

No Error Detected: Attempt Recovery High-End Mainframes

Instruction Memory?

Yes

Only 1 Legal Message?

No Compute similarity to nearby good messages

Yes

Success

Poison Data

Tagging

Page Fault Rollback

Checkpointed

No Decode to most likely based on program statistics (Optional)

Decode to closest fit candidate codeword

Clean Page Costly Fault Tolerance Mechanisms?

Fork execution, poison data, wait for crash/corruption

Probabilistic Success

Experimental Setup Analytically studied all possible 2-bit DUEs that could affect MIPS instruction memory • Used common Hsiao (39,32) SEC-DED ECC Code • Multiple benchmarks from the SPEC CPU2006 suite • Compiled for 32-bit MIPS

These DUEs would normally result in a system crash or silent data corruption.

Results: ECC & Software Analysis 5

10

Bit position of second error

30

35

35

30

Bit position of first error

10

5

1

1

39

ECC Analysis • Number of candidate codewords depends on error locations

Dark red: 16 candidate codewords (worst case) Light green: 8 candidate codewords (best case)

39

Results: ECC & Software Analysis bzip2

h264ref

mcf

perlbench

1E+0

Relative Frequency of Instruction in Program Binary (Moving Avg.)

ECC Analysis • Number of candidate codewords depends on error locations Program Analysis • The static frequency of instructions in programs follow a power law distribution

1E-1 1E-2 1E-3 1E-4 1E-5 1E-6 1E-7

Instruction Mneumonic

povray

Results: ECC & Software Analysis bzip2

Relative Frequency of Instruction in Program Binary

ECC Analysis • Number of candidate codewords depends on error locations Program Analysis • The static frequency of instructions in programs follow a power law distribution

h264ref

mcf

perlbench

0.25 0.2 0.15 0.1 0.05 0

Instruction Mneumonic

povray

Results: Rate of Successful Recovery Using our filter-and-rank approach, we can already recover from 33% of all possible DUEs! Rate of Successful Recovery

0.9 0.8

bzip2

h264ref

100

200

mcf

perlbench

povray

overall mean

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

300

400

500

600

700

Index of 2-bit Error Pattern

• Starting point: much room for improvement! • More sophisticated heuristic recovery using additional side information • Customized ECC code suited to protecting instructions

Ongoing Work: Constructing Better Codes • Codes traditionally considered equivalent can have very different error correcting capabilities with side information • We can control the geometry of codewords in 𝑛-dimensional space while keeping the following properties constant: • • • • • •

Linearity Decoding Complexity 𝑘: Message Size 𝑛: Codeword Size 𝑅: Rate of Code Minimum Hamming Distance

• We can geometrically separate “important” codewords, reducing chance of mis-corrections

Conclusion • SWD-ECC: new paradigm for memory resiliency • Applications to several domains of computing • Mobile devices: powerful error correction is too costly • Supercomputing: checkpoint rollbacks steal performance • Real-time embedded systems: missed deadlines worse than data corruption

• We hope to inspire a new thread of research in the community • Sophisticated recovery schemes and novel codes • Other systems beyond memory: channel coding for networks or storage

Thank you!