New Reliability Challenges for Electronics Deep Submicron Reliability issues
Philippe Perdu DCT/AQ/LE
EUROCALCE, May 3rd- 4th, TOULOUSE
1/ 1
Purpose ■ DSM lifetime has been identified as a key issue Coming from survey, not from direct experimentation
■ Presentation of compiled data collected through selected references (more than 100) Google CMOS + DSM + “wear out” OR lifetime 131000 (limited to 1 year)
■ Purpose of this study Better evaluation of initial level of risk R&T program for Lifetime evaluation (models, Key parameters…), Design For Reliability, Adapted Burn In & Life Test …
■ Today, it is just an overview Limited to Bulk CMOS Focused on die rather that packaged device EUROCALCE, May 3rd- 4th, TOULOUSE
2
Outline ■ Introduction ■ Technology trends Moore’s Law Do we need DSM devices?
■ Warning charts and reliability trends Lifetime Charts Trends
■ DSM Wear out Front End Of the Line (FEOL) Back End Of the Line (BEOL)
■ Can we estimate and manage lifetime? Mandatory studies Design For Reliability Trade-Off
■ Conclusion EUROCALCE, May 3rd- 4th, TOULOUSE
3
Introduction
■ According to Moore’s Law, technology are still speedily evolving ■DSM Lifetime is a controversial issue Frightening trends have been underlined Manufacturers maintain they can manage high performance and long lifetime for specific applications
■ DSM should be used for space application We are not yet using deep DSM devices but there are some requests driven by performance needs (telecom payloads)
EUROCALCE, May 3rd- 4th, TOULOUSE
4
EUROCALCE, May 3rd- 4th, TOULOUSE
5
Outline ■ Introduction ■ Technology trends Moore’s Law Do we need DSM devices?
■ Warning charts and reliability trends Lifetime Charts Trends
■ DSM Wear out Front End Of the Line (FEOL) Back End Of the Line (BEOL)
■ Can we estimate and manage lifetime? Mandatory studies Design For Reliability Trade-Off
■ Conclusion EUROCALCE, May 3rd- 4th, TOULOUSE
6
Moore’s Law (1)
Area Speed Power Cost
EUROCALCE, May 3rd- 4th, TOULOUSE
7
Moore’s Law (2)
EUROCALCE, May 3rd- 4th, TOULOUSE
8
OBC From 1970 to 2004 (1) ■ARGOS / EOLE (D2A) control unit, magnetic core memory
EUROCALCE, May 3rd- 4th, TOULOUSE
9
OBC From 1970 to 2004 (2) ■ Myriade (microsat)
EUROCALCE, May 3rd- 4th, TOULOUSE
10
SoC and SIP
EUROCALCE, May 3rd- 4th, TOULOUSE
11
Scaling consequences (1) ■ Constant field technology scaling Supply voltage: Vdd → Gate length: L → Gate width: W → Gate-oxide thickness: tox → Junction depth: Xj → Substrate doping: NA →
Vdd / α L/α W/α tox / α Xj / α NA × α
■ “No exponential is forever…but forever can be delayed” (Gordon Moore, 2003 ISSSC Conference)
EUROCALCE, May 3rd- 4th, TOULOUSE
12
Scaling consequences (2)
EUROCALCE, May 3rd- 4th, TOULOUSE
13
Scaling limit
EUROCALCE, May 3rd- 4th, TOULOUSE
14
Do we need DSM? Generation Technology Number of transistors Surface
■ Self-fulfilling prophecy? “it happened because everyone believed it was going to happen” ■ Device/die area W × L → (1/α)2 = 0.49 (more functionality for the same size) ■ Higher frequency, more power for the same surface (High Performance) … ■ Or less consumption, smaller size for same function (C×V2×f→ (1/α)2 = 0.49) (Low power) EUROCALCE, May 3rd- 4th, TOULOUSE
15
Do we need DSM? ■ IC Logic technology scaling into deep submicron regime to: Increase speed and function density Decrease power dissipation (and cost) per function • ATMEL 0.18 micron CMOS, 5,5 Mgates, 85nW/MHz/Gate • ST 90 nm CMOS, 15 Mgates, 18nW/MHz/Gate
■ Some space applications require VLSI with High density High speed Low consumption
■ Unfortunately, These satellites are expected to have a long lifetime (18 years and more) ■ It concerns only a few number of component types (even if they are key components) EUROCALCE, May 3rd- 4th, TOULOUSE
16
Outline ■ Introduction ■ Technology trends Moore’s Law Do we need DSM devices?
■ Warning charts and reliability trends Lifetime Charts Trends
■ DSM Wear out Front End Of the Line (FEOL) Back End Of the Line (BEOL)
■ Can we estimate and manage lifetime? Mandatory studies Design For Reliability Trade-Off
■ Conclusion EUROCALCE, May 3rd- 4th, TOULOUSE
17
Statements (1) ■ Systems are always built with available technologies Scaling and integration give more and more performances So long, VLSI has offered us intrinsic lifetimes incredibly greater than application lifetimes. ■ This situation seems to be no longer ensured Estimated lifetimes are continuously decreasing with technology evolution Technical inputs confirm these estimations Other reliability related issues are raising (burn-in, soft error)
EUROCALCE, May 3rd- 4th, TOULOUSE
18
Statements (2) ■ Some reference papers « Impact of semiconductor technology on aerospace electronic system design production and support», 8th Joint NASA/FAA/DoD conference on aging aircraft (February 2005) « Advanced Test Methodologies and Strategies for Semiconductors », IPFA 2006 (July 2006) « Rapport sur l’évolution du secteur des semi-conducteurs et ses liens avec les micro et nanotechnologies », office parlementaire d'évaluation des choix scientifiques et technologiques (January 2003) « The Impact of Technology Scaling on Lifetime Reliability », The International Conference on Dependable Systems and Networks (June 2004) «Is CMOS more reliable with scaling? », BAST, Pacific Northwest Test Workshop( March 2003) … EUROCALCE, May 3rd- 4th, TOULOUSE
19
Estimated lifetime (1)
EUROCALCE, May 3rd- 4th, TOULOUSE
20
Estimated lifetime (2)
EUROCALCE, May 3rd- 4th, TOULOUSE
21
Estimated lifetime (3)
Is CMOS more reliable with scaling?, TM Mak Intel Corporation, BAST 2003 EUROCALCE, May 3rd- 4th, TOULOUSE
22
Estimated lifetime (4)
Joseph B. Bernstein, University of Maryland, IRPS 2007 EUROCALCE, May 3rd- 4th, TOULOUSE
23
Estimated lifetime (5)
Subhasish Mitra (Stanford University), 2005 EUROCALCE, May 3rd- 4th, TOULOUSE
24
Trends: ITRS 2006 (1) Reliability Technology Requirements—Near-term
EUROCALCE, May 3rd- 4th, TOULOUSE
25
Trends: ITRS 2006 (2)
■ [1] Failures during the first 4000 hours of operation (~1 year's use at 50% duty cycle). Early failures are associated with defects. ■[2] Long term reliability rate applies for the specified lifetime of the IC. ■[3] While the overall IC failure rate does not change with time, as the number of transistors per chip increases [from ORTC], the relative failure rate per transistor must decrease ■[4] As the length of interconnect per chip increases [from Interconnect Technology Requirements Tables], the failure rate per m of interconnect must decrease. Even more important for reliability is the increase in the number of vias.
EUROCALCE, May 3rd- 4th, TOULOUSE
26
Trends: ITRS 2006 (3) MPU and ASIC Interconnect Technology Requirements—Near-term Years
EUROCALCE, May 3rd- 4th, TOULOUSE
27
Trends: ITRS 2006 (4)
■ [1] Calculated by assuming that only one of every three minimum pitch wiring tracks for Metal 1 and five intermediate wiring levels are populated. The wiring lengths for each level are then summed to calculate the total interconnect length per square centimeter of active area. ■[2] This metric is calculated by assuming that a 5 FIT (failure in ten thousand) reliability budget is apportioned to interconnect for the highest reliability grade MPUs. This number is then divided by the total interconnect length to arrive at the FITs per meter of wiring per one square centimeter of active area.
EUROCALCE, May 3rd- 4th, TOULOUSE
28
Trends: ITRS 2006 (5)
■ Red Brick = ITRS Technology Requirement with no known solution ■ Alternate definition: Red Brick = something that REQUIRES billions of dollars in R&D investment
EUROCALCE, May 3rd- 4th, TOULOUSE
29
Trends: Temperature (1) ■ Dissipated
power density exponentially grows with technology evolution Scaling factors are not ideal (ie mobility). Voltage slower decreases than the scaling factor in order to keep sufficient noise margin and performance level (ultimate 0.3 V?) Leakage currents increase under combined effect of gate leakage and sub threshold current (Off state) (working frequency, number of gates … are triggering higher dynamic currents) “ To obtain the projected performance gain of 30% per generation, device designers have been forced to relax the device subthreshold leakage continuously from one to several nA/lm for the 250-nm node to hundreds of nA/lm for the 65-nm node. Consequently, passive power density is now a significant portion of the power budget of a high-speed microprocessor.”
■ =>
IC temperature is increasing
This high temperature affects the reliability (acceleration factor) … and has side effect issue on burn in process
■ More critical for HP VLSI EUROCALCE, May 3rd- 4th, TOULOUSE
30
Trends: Temperature (2)
EUROCALCE, May 3rd- 4th, TOULOUSE
31
Trends: Temperature (3)
EUROCALCE, May 3rd- 4th, TOULOUSE
32
Trends: Temperature (4)
EUROCALCE, May 3rd- 4th, TOULOUSE
33
Evidence of proof: Power PC FIT
180 nm EUROCALCE, May 3rd- 4th, TOULOUSE
65 nm 34
Trends: Other parameters ■ Electrical field Constant (limited by oxide thickness) From 1MV / cm (1970) up to 6 MV / cm (max breakdown field around 12 MV for SiO2)
■ Interconnect current density: 0,1 MA/cm² (1970) to 1 MA/cm² ■ Integration: more transistors, more interconnections ■ Process variability Statistical effects (“atomic” scale) Process complexity (OPC for FEOL lithography, CMP, damascene copper for BEOL)
■ New material (metal gate, High K, Low K … and thermomechanical issues) ■ Package … EUROCALCE, May 3rd- 4th, TOULOUSE
35
Trends: T and V effects (1) ■ Burn-in used to get rid of early failures 2 acceleration parameters: temperature and voltage Acceleration factor decrease by a factor of 10 between 180 nm et 90 nm (γ = 4, Ea = 0,7 eV)
Burn-in issue
EUROCALCE, May 3rd- 4th, TOULOUSE
36
Trends: T and V effects (2) ■ Thermal runaway during burn-in Parts heat up -> heat increases leakage currents -> generating more heat > thermal runaway With each generation the power dissipation of parts grows
■ Reduced margins Increased on chip electric fields limit the ability to apply over voltages
■ Less acceleration due to overvoltage as IC voltage is scaled down Assuming overvoltage is also scaled down
■ Danger that burn-in could “turn on” defects ■ From Dr. Ted Dellin, Sandia Natl. Lab.
EUROCALCE, May 3rd- 4th, TOULOUSE
37
Trends: Weibull distribution (1)
λ (t ) =
β c
t β
β −1
■ Weibull distribution (Empirical generalization of the exponential distribution): 2 parameters (shape parameter, time parameter) provide a wide variety of shapes
EUROCALCE, May 3rd- 4th, TOULOUSE
38
Trends: Weibull distribution (2)
Normalized scale t/c
EUROCALCE, May 3rd- 4th, TOULOUSE
39
Trends: Weibull distribution (3)
λ (t ) =
β c
β
t
β −1
β smaller than 1
■ Caused by “defects” and correlates with defect-related yield loss ■ Reduced by improved quality and by screens (e.g., burn-in) EUROCALCE, May 3rd- 4th, TOULOUSE
40
Trends: Weibull distribution (4)
λ (t ) =
β c
β
t
β −1
β=1
■ Caused by random defects, random events ■ Often used to design tests to demonstrate a given level of reliability EUROCALCE, May 3rd- 4th, TOULOUSE
41
Trends: Weibull distribution (5)
λ (t ) =
β c
β
t
β −1
β greater than 1
■ Intrinsic wearout depends on design, materials, process, application, environment… ■ Want the onset of intrinsic wearout to be beyond lifetime requirement EUROCALCE, May 3rd- 4th, TOULOUSE
42
Trends: Weibull distribution (6)
E. Y. Wu et al. Microelectronics Reliability 43 (2003) 1175-1184 EUROCALCE, May 3rd- 4th, TOULOUSE
43
Trends: Weibull distribution (6) ■ Field Failures has proven a more and more Constant Rate occurrences.
■ Joseph B. Bernstein, University of Maryland, IRPS 2007 EUROCALCE, May 3rd- 4th, TOULOUSE
44
Trends: Soft Defects
Increase Increaseof oftransient transientnoise noise Increase Increaseof of“white” “white”noise noise
Power Supply (V)
5.0 I/O power supply 3.3 2.5 Core Power Supply
1.5 0.7 0.5µ
EUROCALCE, May 3rd- 4th, TOULOUSE
0.35µ
0.18µ
90nm
65nm Technology node 45
Trends: Soft defect
■ INTEL: “Soft errors are the second biggest [reliability] concern after leakage current in submicron design” ■ Tim Dell, IBM: “for every 256 Mbytes of memory, you will get one soft error a month due to cosmic-ray-generated neutrons” ■ Link with robustness Ageing can reduce robustness (cumulative effects) Stresses below “threshold” can • Reduce the lifetime (ESD => TDDB, EMI) • Reduce the robustness (cumulative ESD) EUROCALCE, May 3rd- 4th, TOULOUSE
46
Trends: Reliability improvements (1) ■ Manufacturer could improve it For instance, barrier layers could be optimized in order to limit coper diffusion in low-K dielectrics (TDDB) ... But it could jeopardize electromigration robustness, performances, cost and block impurities inside low K dielectrics
■ …. They won’t do it ■ Performance and cost are the main drivers for manufacturers (consumer) and longer lifetime and reliability are not the main objectives ■ Spatial market is too tiny and do not have the economic weight to modify and direct process or design trends EUROCALCE, May 3rd- 4th, TOULOUSE
47
Trends: Reliability improvements (2) ■ Richard Goering, EE Times 4, Sept. 2006: Chipping away at design for Reliability … at 65 nanometers and below, … current densities go through the roof, exacerbating electromigration. Problems such as hot-carrier degradation loom larger. Ultra-thin gate oxides are prone to breakage. Without DFR, many 65- and 45-nm chips will ultimately break. That may not matter for a volume consumer product with a short life. But it matters a lot for chips that go into airplanes, pacemakers or cars.
■ Overview of wear out mechanisms ■ A look at DFR EUROCALCE, May 3rd- 4th, TOULOUSE
48
Outline ■ Introduction ■ Technology trends Moore’s Law Do we need DSM devices?
■ Warning charts and reliability trends Lifetime Charts Trends
■ DSM Wear out Front End Of the Line (FEOL) Back End Of the Line (BEOL)
■ Can we estimate and manage lifetime? Mandatory studies Design For Reliability Trade-Off
■ Conclusion EUROCALCE, May 3rd- 4th, TOULOUSE
49
Scaling: Electrical performances ■ Increase frequency, decrease propagation time Decrease RC (BEOL) • RP => Metal choice: copper rather than aluminum • C P => low K material (porous silicon dioxide…) Decrease switching times, we want IDsat has high as possible
■ Decrease power consumption
Cox μ Z (VGS − VT ) 2 I Dsat = 2L
Cox =
ε ox tox
I leak as small as possible (scaling worsens it) • Ioff • Igate
nW/MHz/Gate targeted by scaling • ( V t IDsat)) EUROCALCE, May 3rd- 4th, TOULOUSE
50
ITRS 2006 identified challenges Difficult Challenges ≥ 32 nm
Summary of Issues
High-κ gate dielectrics with metal gate electrodes
•Dielectric breakdown characteristics (hard and soft breakdown) •Transistor stability (charge trapping, work function stability, metal ion drift or diffusion) •Impact of implantation •Metal gate thermomechanical issues (coefficient of thermal expansion mismatch)
Copper/Low-κ interconnects
•Stress migration of Cu vias and lines •Cu via and line electromigration performance •Impact of degradation of properties with lowering k (strength, adhesion, thermal conductivity, •coefficient of thermal expansion) •Time Dependent Dielectric Breakdown of the Cu/low-κ system •Impact of packaging
Negative bias temperature instability
•Degradation of p channel current •Dependence on scaling and nitrogen in gate insulator •Impact on burn-in
EUROCALCE, May 3rd- 4th, TOULOUSE
51
FEOL: known wear out mechanisms ■ Reliability At Transistor Level Hot Carriers Degradation (HCI) Gate Oxide Degradation • Gate Oxide Breakdown • Time Dependant Dielectric Breakdown (TDDB)
Negative Bias Temperature Instability (NBTI)
■ Scanning down Channel length (L) and gate oxide thickness (Tox) E –Fields In Oxide And Channel Increase Device Reliability Issues (HCI And TDDB) Become Severe
EUROCALCE, May 3rd- 4th, TOULOUSE
52
FEOL: Gate Oxide Breakdown (1) ■ Dielectric Breakdown Mechanism
EUROCALCE, May 3rd- 4th, TOULOUSE
53
FEOL: Gate Oxide Breakdown (2) ■ Hard breakdown Current flowing through short in oxide raises temperature and electrode melts and diffuses into oxide Low resistance ohmic path through gate insulator Definitely an IC failure
■ Soft breakdown Less power dissipation results in less thermal effects High resistance ohmic path through gate insulator Increase in noise IC may still function after soft breakdown
■Trends Occurs more frequently in thinner oxides lower voltages happen very soon in DSM device life Can induce side effect (power dissipation, soft defects due to increased noise level) EUROCALCE, May 3rd- 4th, TOULOUSE
54
FEOL: Hot Electron Injection (1) ■Electrons are injected in the channel (NMOS ON) ■ Impact ionization creates electron hole pairs. ■ Holes drift to substrate (Isub) ■ Hot electrons create damage to the oxide ■ Isub is a measure of H-C generation rate ■ Injected carriers produce damage that reduces transistor current CHE : Channel Hot Electron Eventually, device becomes too slow Lifetime issue EUROCALCE, May 3rd- 4th, TOULOUSE
DAHC : Drain Avalanche Hot Carrier
55
FEOL: Hot Electron Injection (2)
■Was a NMOS problem N channel: increase in substrate currents
■with scaling it becomes also a PMOS issue P Channel: Increase in Off State Leakage Current EUROCALCE, May 3rd- 4th, TOULOUSE
56
FEOL: Mission Profile Dependance (1) ■ DRAM 90 nm: 1GB, DDR2, 266MHz, Vdd=1.8V, Temperature=75C, Simulation result
EUROCALCE, May 3rd- 4th, TOULOUSE
57
FEOL: Mission Profile Dependence (2) ■ If one unique bit is accessed constantly, HCI failure will dominate.
■ In addition, 10 year DC lifetime hard to achieve in deep sub micron region EUROCALCE, May 3rd- 4th, TOULOUSE
58
FEOL: NBTI (1) ■ Stress-conditions negative electrical field over gate oxide p-MOS device in inversion elevated temperature
■ Damages stress induced interface states trapping fixed positive oxide charges
■Electrical effects increase of the absolute value of Vth decrease of the drain current decrease of carrier mobility
■ Still under investigation, NBTI importance could be related to Si-H bonds in Nitrured gate oxide EUROCALCE, May 3rd- 4th, TOULOUSE
59
FEOL: NBTI (2) ■ Comparison of PMOS NBTI lifetimes vs. NMOS and PMOS HCI, 0.13 µm technology ■ C.H. Jeon IEEE Integrated Reliability 2002 Workshop, final repot pp130-132
110°C 150°C
EUROCALCE, May 3rd- 4th, TOULOUSE
60
FEOL: NBTI (3) ■ From Joseph B. Bernstein (University of Maryland/Bar-llan University)
EUROCALCE, May 3rd- 4th, TOULOUSE
61
FEOL: New materials (1) ■ Nitrided gate oxide Boron penetration is a problem for ultra-thin oxides, it lowers TDDB lifetime Nitrogen doping limits boron penetration and improve oxide reliability
But it triggers more NBTI issue!
■ EUROCALCE, May 3rd- 4th, TOULOUSE
62
FEOL: New materials (2) ■ High K dielectric Equivalent Oxide Thickness = Tox = THighK * (3.9/K), HfO2 (Keff~15 - 30); HfSiOx (Keff~12 - 16), La based in future Materials, process, integration issues to solve (thermal stability, thermal & chemical compatibility, interface with Si substrate and gate electrode Potential side effect (radiation robustness)
SiO2
Tox
High-k Material TK
Electrode
Si substrate
EUROCALCE, May 3rd- 4th, TOULOUSE
Electrode
Si substrate 63
FEOL: New materials (3) ■ Polysilicon depletion in gate electrode Tox(electric) = Tox + Wpoly depletion Decrease C Reduced Idsat
■ Potential solution Wpoly depletion ~ (poly doping) - 0.5 increase poly doping to reduce Wpoly depletion with scaling but max. poly doping is limited Poly depletion become more critical with Tox scaling metal gate electrodes Induce new reliability issues
EUROCALCE, May 3rd- 4th, TOULOUSE
Depletion Layer Polysilicon Gate
Wd,Poly
Gate Oxide
Substrate
Inversion Layer
64
FEOL: New materials (4) ■ Stressed Si, Si Ge Increase the mobility
EUROCALCE, May 3rd- 4th, TOULOUSE
Cox μ Z (VGS − VT ) 2 I Dsat = 2L
65
FEOL: New materials (5) ■ Stress engineering can deliver incredible performance gain through mobility enhancement ■ it can also degrade device reliability (NBTI) even though compressively stressed silicon nitride films could significantly increase mobility in the pFET channel, excess hydrogen in the nitride could degrade NBTI Hwa Sung Rhee, Samsung Electronics EUROCALCE, May 3rd- 4th, TOULOUSE
66
BEOL: Electromigration ■Metal Atoms Can Migrate Due to Currents and/or Stresses ■ Electromigration Requires an electrical current Atoms move due to collision of electrons
■Stress Migration Atoms move to relieve compressive stresses Stress gradients from processing and/or electromigration EUROCALCE, May 3rd- 4th, TOULOUSE
67
BEOL: Cu and Low K (1)
EUROCALCE, May 3rd- 4th, TOULOUSE
68
BEOL: Cu and Low K (2) ■ Low k Dielectrics Present Many Processing and Reliability Challenges ■ Compared to SiO2 low k dielectrics are less robust Weaker: makes chemical mechanical polishing more difficult Porous: can trap process gases and chemicals Poorer Adhesion: can lead to reliability problems Cu diffuses more easily along surfaces than through the bulk Especially under top cap layer Surface effects are larger in thinner lines Thinner lines + lower k dielectric • Weaker adhesion • Decrease in EM
■ May have to be implemented in a stack using more robust, higher k, dielectrics to protect the low k dielectrics Increases the effect dielectric constant, reducing the speed
EUROCALCE, May 3rd- 4th, TOULOUSE
69
BEOL: Cu and Low K (3) Relative time to failure
Line width (μm) ■Lifetime with scaling worsens ■Sato & Ogawa, 2001 Interconnect Tech. Conf., EUROCALCE, May 3rd- 4th, TOULOUSE
70
BEOL: Cu and Low K (4)
Dielectric Constant ■ From Ted Dellin’s IRPS tutorial ■ Proposed Low k Interlevel Dielectrics Have Reduced thermal Conductivity & Strength ■ Other things that get worse with lower k: interfacial adhesion, electrical breakdown and coef. of thermal expansion mismatch EUROCALCE, May 3rd- 4th, TOULOUSE
71
BEOL: Cu and Low K (5) ■ Packaging Challenges: The Poor Mechanical Properties of Low k Dielectrics
EUROCALCE, May 3rd- 4th, TOULOUSE
72
BEOL: Cu and Low K (6) ■ Leakage Currents (and TDDB) Between Cu Lines Degrades as k is Lowered ■ Systematic reduction in dielectric breakdown strength with lower k Copper extruding into low k makes things worse Sensitive to process damage, porosity,
EUROCALCE, May 3rd- 4th, TOULOUSE
73
BEOL: Cu and Low K (7)
■ size effect : linewidths shrink below around 100 nm close to mean free path of electrons in copper (39 nm) increased resistivity of copper caused by electron scattering at the surface of the line and at grain boundaries.
■ estimations of the magnitude of the effect the size effect has on interconnect delay has been overestimated next few device generations size effect can be effectively managed through interconnect design
EUROCALCE, May 3rd- 4th, TOULOUSE
74
Outline ■ Introduction ■ Technology trends Moore’s Law Do we need DSM devices?
■ Warning charts and reliability trends Lifetime Charts Trends
■ DSM Wear out Front End Of the Line (FEOL) Back End Of the Line (BEOL)
■ Can we estimate and manage lifetime? Mandatory studies Design For Reliability Trade-Off
■ Conclusion EUROCALCE, May 3rd- 4th, TOULOUSE
75
Mandatory studies
■ DSM lifetime has to be early taken into account ■ DSM reliability parameters has to be fine tuned for lifetime simulation purpose Technology dependant Application dependant (mission profile)
■ Manufacturer involvement is a critical issue We need them High reliability / long lifetime is a small market
EUROCALCE, May 3rd- 4th, TOULOUSE
76
DFR: example (1) ■ Design for Reliability Example: Layout of Cu Lines and Vias
EUROCALCE, May 3rd- 4th, TOULOUSE
77
DFR: example (2)
■ CMOS device reliability – dynamic NBTI recovery lifetime can improve by a factor of 10 – 30 recovery is always same fraction of in every cycle.
■ S. Chakravarthi IEEE IRPS, 2004, pp173 EUROCALCE, May 3rd- 4th, TOULOUSE
78
DFR: technology tolerance (1) ■ Methodologies for Adaptation to Process Variations, Manufacturing Defects, and Transient Errors in Scaled CMOS (Georgia Institute of Technology, August 2007) ■ Variation-Tolerant Design increase of process parameter variations in CMOS technologies Variation-Aware Placement • huge leakage variation problem addressed by looking at the effects that the gate placement have in leakage distribution (clusters) • algorithms for the placement of gates in a dual-Vt circuit to mitigate the large leakage variation by reducing the variation caused by correlated within-die process variation. • sub-threshold leakage variation reduced by an average of 17% and maximum of 31%. • obtained with a small increase in wire length.
Post-Manufacture Tuning Architecture • tunable gates, supply voltage, or body bias • deal with the delay and leakage variation • self-test/self adaptationcan improve the delay yield by 40%. EUROCALCE, May 3rd- 4th, TOULOUSE
79
DFR: technology tolerance (2) ■ Defect-Tolerant CMOS Gate Design significant defectivity due to manufacturing defects, random process variations, and wear-out future circuits must be equipped with a significant defect-tolerance capability little delay overhead (less than 6%) but incurs leakage power dissipation overhead (less than 20%) in the presence of defects.
■ Probabilistic Checksum-Based Error Relaxing the requirement of 100% correctness for devices and interconnects may dramatically reduce costs of manufacturing, verification and test hard to achieve 100% correctness because of an increase in transient error rate SNR improvements (up to 13 dB) can be obtained in the presence of soft errors EUROCALCE, May 3rd- 4th, TOULOUSE
80
DFR: Dual Vt, Tox … (1) ■ Leakage Optimization using Dual Threshold Voltage ■ Off-State Leakage Current Subthreshold Leakage (ISUB) Gate Induced Drain Leakage (IGIDL) Edge Directed Tunneling Leakage (IEDT) Band to Band Tunneling Leakage (IBTBT)
■ On-State Leakage Current Gate Leakage (IGON)
EUROCALCE, May 3rd- 4th, TOULOUSE
81
DFR: Dual Vt, Tox … (2) ■ Gate Delay and Leakage Tradeoff
T pd ∝
CV dd
(V dd
− V th )
α
■ Propagation delay has the above dependence on Vt Higher Vt means slower gate (larger propagation delay) But higher Vt means smaller subthreshold leakage (exponential dependence!)
■ Tradeoff between delay versus leakage done at design level (fast gate or low consumption gate) ■ The other possibility is to increase gate thickness Trade-off between delay (Idsat driven) and leakage EUROCALCE, May 3rd- 4th, TOULOUSE
82
Dual Vt Results ■ Results for ISCAS benchmark circuits
EUROCALCE, May 3rd- 4th, TOULOUSE
[Wei, et al., DAC98] 83
DFR: Low Voltage (1) ■ Voltage derating Many delay-causing defects have much greater impact at reduced VDD. • Voltage variation has a much greater impact on delay at lower VDD than high VDD. • Latent defects likely to be more pronounced at low VDD – may be larger numbers than 1-2%
EUROCALCE, May 3rd- 4th, TOULOUSE
84
DFR: Low Voltage (2)
■ Reliability Implications Reduced VDD will reduce some wear-out mechanisms • Oxide breakdown • NBTI • Some thermal effects (due to reduced heat)
Others will get worse • Latent delay defects • Some thermal effects (due to increased thermal cycles)
■ Rob Aitken, IOLTS 2006
EUROCALCE, May 3rd- 4th, TOULOUSE
85
DFR: Lifetime Reliability-Aware µP (1) ■ Ensuring long processor lifetimes by limiting failures due to wear-out related hard errors is a critical requirement for all microprocessor manufacturers average increase of 316% in processor failure rates when scaling from 180nm to 65nm some performance and/or die area (and resultant cost) will have to be sacrificed for reliability.
■ microarchitecture-level model RAMP electromigration, stress migration, time dependent dielectric breakdown, and thermal cycling, + NBTI Dynamically tracks processor lifetime reliability, accounting for the behavior of the executing application.
■ dynamic reliability management (DRM) Processor scaling and increasing power densities Increasing transistor count • More transistors result in more failures which results in lower processor lifetimes. • Hence, not only is the reliability of individual transistors decreasing, the number of transistors that can fail is also increasing. EUROCALCE, May 3rd- 4th, TOULOUSE
86
DFR: Lifetime Reliability-Aware µP (2) ■ Architectural awareness of lifetime reliability ■ Workload ■ Over-designed processors Current reliability qualification is based on worst case temperature and utilization; however, most applications will run at lower temperature and utilization resulting in higher reliability and longer processor lifetimes than required. If the processor cooling solution can handle it, this excess reliability can be utilized by the processor to increase application performance.
■ Under-designed processors. Beneficial to commodity processors where increasing yield and reducing cooling costs would have significant impact on profits, even if they incur some performance loss. EUROCALCE, May 3rd- 4th, TOULOUSE
87
DFR: Lifetime Reliability-Aware µP (3) ■ J. Srinivasan, University of Illinois, P. Bose, IBM T.J. Watson Research Center ■ two methods for structural redundancy to enhance Lifetime Reliability ■Structural Duplication Certain redundant microarchitectural structures added to the processor Spare structures can be turned on when the original structure fails, increasing the processor’s lifetime
■ Graceful Performance Degradation (GPD) replicated structures that are used for increasing performance for some high parallelism applications (Modern processors) replicated structures are not required for functional correctness so the processor can shut down a failed structure and still maintain functionality, thereby increasing lifetime. processor with GPD would fail only when all redundant structures of a type fail. EUROCALCE, May 3rd- 4th, TOULOUSE
88
DFR: Lifetime Reliability-Aware µP (4)
■ Main driver are cost and performance ■ Done to target a minimal acceptable lifetime (7 year) and is $$$$ EUROCALCE, May 3rd- 4th, TOULOUSE
89
DFR: FLAW (1) ■ Altera Starfix III CMOS 65 nm Power Play (Development tool Quartus II version 6.1) automatically analyze the design • 0.9 V for low power • 1.1 V for high performance and critical path
■Xilinx FPGAs Spartan-3 / UMC-12A 90 nm qualification report Claims more than 10 years lifetime
■ FPGA Lifetime Awareness FPGA is “low volume” model It targets also military (and) space market FPGA manufacturers are involved in proving long lifetime EUROCALCE, May 3rd- 4th, TOULOUSE
90
DFR: FLAW (2) ■ Test structure (65 nm) Look Up Tables (LUTs) in FPGAs (made with 16 x 1 multiplexer) Studied mechanisms TDDB, EM HC
■ Region Constrained Placement for Reliability (RCPFR) periodic re-mapping of the design to less used regions for increasing the lifetimes of the device
EUROCALCE, May 3rd- 4th, TOULOUSE
91
Conclusions (1) ■ Lifetime is a real issue for long term use of high performance devices Lifetime decrease for same surface, scaling technology (more transistor, higher frequency) Thermal issue, high electrical field Derating after design is difficult to manage • Lower voltage has to be decided at design level (optimization) • Cooling is always possible (HCI?)
■ Questionnaire What is the expected gain when using DSM? (low power or high performance?) What are the acceptable trade-off?
■ Only a review approach EUROCALCE, May 3rd- 4th, TOULOUSE
92
Conclusions (2) ■ From Testing-in Reliability Use of end-of-line testing and screening to measure and ensure reliability Multiple problems • large number of samples that need to be tested • by the time problems are discovered, a lot of product has been affected
■ To Building-in Reliability Control reliability by process control and control of the design process Emphasis on preventing problems Testing is used to validate physical/statistical models and to find critical process variables Customer has to move from demanding explicit reliability demonstrations to confidence that reliability processes are under control EUROCALCE, May 3rd- 4th, TOULOUSE
93