Would Error Correction Provide a Benefit in Classical Computers?

Would Error Correction Provide a Benefit in Classical Computers? 27 Jan 2012 Photons, Electrons, Bands Thomas Szkopek Canada Research Chair in Nanosca...
3 downloads 0 Views 1MB Size
Would Error Correction Provide a Benefit in Classical Computers? 27 Jan 2012 Photons, Electrons, Bands Thomas Szkopek Canada Research Chair in Nanoscale Electronics Department of Electrical and Computer Engineering

Acknowledgements

Vwani Roychowdhury, (collaborator) UCLA

Funding:

Eli Yablonovitch, (provocateur) UC Berkeley

Gate Source

system reliability Drain

Lawrence Livermore National Laboratory

ENIAC, 1946 17,468 vacuum tubes mean time between faults: ~2 days

IBM BlueGene/L, 2006 131,072 processors mean time between faults: ~6 days 3

Gate

system reliability

“[with] Source

current state‐of‐the‐art fault‐tolerance strategy, Drain checkpoint/restart, for a 1 PFlop/s system… a computational job that could complete in 100 hours in a failure‐free environment will actually take 251 hours” “While several [high-end computing] vendors are looking to address reliability at the hardware level, the costs are proving to be staggeringly high in both money and power.” let’s look at the hardware level!

DeBardeleben et al., High‐End Computing Resilience: Analysis of Issues Facing the HEC Community and Path‐Forward for Research and Development, Los Alamos National Laboratory 2010, http://institute.lanl.gov/resilience/docs/

4

error correction: memory and communications errors reliable encoding identity

transmitter (write)

channel (memory)

reliable decoding & error correction

receiver (read)

• reliable encoding, decoding and error correcting hardware • efficient, complex codes are used

5

error correction: computation errors reliable encoding

encoded logic

encoder

logic unit

reliable decoding & error correction

decoder

• reliable encoding, decoding and error correcting hardware • logic performed in code space (eg. Reed-Muller codes) D. Pradhan & S. Reddy, IEEE Trans. Comp. 21, 1331 (1972).

• however, it is likely that all hardware is equally (un)reliable 6

error correction: computation errors

error correctio n

logic

error correctio n

logic

• errors occur in all hardware

• never decode bits or they will be corrupted, in other words: all operations must be perfomed in protected code space!

7

protecting 1 bit : repetition repetition code

error correction by majority vote

“0” = 0 0 0 0 0

0 0 0 1 0

0 0 0 0 0

“1” = 1 1 1 1 1

1 1 0 1 1

1 1 1 1 1

0 1 0 1 1

1 1 1 1 1

0 1 0 0 1

0 0 0 0 0

single bit flip: p logical bit flip: P = 60p3 + … error rate

P = 60p3

p

p

J. von Neumann, Lectures on Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components, 1952. 8

protecting 1 bit MAJ

MAJ = majority vote

MAJ MAJ

If majority gates are error-free, then the majority voting process is error free if

Suggest Documents