Would Error Correction Provide a Benefit in Classical Computers? 27 Jan 2012 Photons, Electrons, Bands Thomas Szkopek Canada Research Chair in Nanoscale Electronics Department of Electrical and Computer Engineering
Acknowledgements
Vwani Roychowdhury, (collaborator) UCLA
Funding:
Eli Yablonovitch, (provocateur) UC Berkeley
Gate Source
system reliability Drain
Lawrence Livermore National Laboratory
ENIAC, 1946 17,468 vacuum tubes mean time between faults: ~2 days
IBM BlueGene/L, 2006 131,072 processors mean time between faults: ~6 days 3
Gate
system reliability
“[with] Source
current state‐of‐the‐art fault‐tolerance strategy, Drain checkpoint/restart, for a 1 PFlop/s system… a computational job that could complete in 100 hours in a failure‐free environment will actually take 251 hours” “While several [high-end computing] vendors are looking to address reliability at the hardware level, the costs are proving to be staggeringly high in both money and power.” let’s look at the hardware level!
DeBardeleben et al., High‐End Computing Resilience: Analysis of Issues Facing the HEC Community and Path‐Forward for Research and Development, Los Alamos National Laboratory 2010, http://institute.lanl.gov/resilience/docs/
4
error correction: memory and communications errors reliable encoding identity
transmitter (write)
channel (memory)
reliable decoding & error correction
receiver (read)
• reliable encoding, decoding and error correcting hardware • efficient, complex codes are used
5
error correction: computation errors reliable encoding
encoded logic
encoder
logic unit
reliable decoding & error correction
decoder
• reliable encoding, decoding and error correcting hardware • logic performed in code space (eg. Reed-Muller codes) D. Pradhan & S. Reddy, IEEE Trans. Comp. 21, 1331 (1972).
• however, it is likely that all hardware is equally (un)reliable 6
error correction: computation errors
error correctio n
logic
error correctio n
logic
• errors occur in all hardware
• never decode bits or they will be corrupted, in other words: all operations must be perfomed in protected code space!
7
protecting 1 bit : repetition repetition code
error correction by majority vote
“0” = 0 0 0 0 0
0 0 0 1 0
0 0 0 0 0
“1” = 1 1 1 1 1
1 1 0 1 1
1 1 1 1 1
0 1 0 1 1
1 1 1 1 1
0 1 0 0 1
0 0 0 0 0
single bit flip: p logical bit flip: P = 60p3 + … error rate
P = 60p3
p
p
J. von Neumann, Lectures on Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components, 1952. 8
protecting 1 bit MAJ
MAJ = majority vote
MAJ MAJ
If majority gates are error-free, then the majority voting process is error free if