Microprocessor Reliability
description
Transcript of Microprocessor Reliability
![Page 1: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/1.jpg)
1
Microprocessor Reliability
Robert PawlowskiECE 570 – 2/19/2013
![Page 2: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/2.jpg)
2
Reliability
• Involves different aspects about a processor that can affect performance and functionality.– Ultimately can reduce the lifetime of the
processor.
• Issues typically manifest themselves at the device level.– Solutions can be implemented at multiple design
levels.
![Page 3: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/3.jpg)
3
Why the concern?• Operating at highest frequencies and/or lowest power
possible increases sensitivity to process-related variabilities.– Gate length/doping concentration variations– Temperature– Supply voltage droops
• This decreases processor yield
• Decreasing device sizes Increased effect of external issues
![Page 4: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/4.jpg)
4
Outline
• Error Classification• Hard Errors• Soft Errors
• Sources of radiation• Device/Circuit approaches• Architectural approaches
• Error detection• Error correction
• System level impact
![Page 5: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/5.jpg)
5
Processor Error Classification
• Hard Errors will result in permanent processor failure.• Processor lifetime is inversely proportional to hard error rate.
• Soft Errors do not permanently damage the device.
![Page 6: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/6.jpg)
6
Hard Errors
• Extrinsic failures– Caused by process and manufacturing defects– Occur with decreasing rate over time– No impact from micro-architecture
• Intrinsic failures– Related to processor wear-out– Occur with increasing rate over time– Related to wafer packaging, process parameters, and
processor design.
![Page 7: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/7.jpg)
7
Hard Errors
![Page 8: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/8.jpg)
8
Soft Errors• Occur in both memory and logic
– External radiation main issue in memory• Alpha particles• High energy neutrons• Thermal neutrons
• Different causes of transient errors in logic– External radiation– Supply voltage droop
• Power supply fluctuations– Ground bounce, cross-talk
– Process variation, temperature– Affect delay of computational paths
![Page 9: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/9.jpg)
9
Outline
• Error Classification• Hard Errors• Soft Errors
• Sources of radiation• Device/Circuit approaches• Architectural approaches
• Error detection• Error correction
• System level impact
![Page 10: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/10.jpg)
10
Radiation-Induced Soft Errors• Ionized particle strike causing a state change• No permanent damage (Hard-error)• Combo logic – Single Event Transients (SET)• Memory cells – Single Bit Upset (SBU)
Multi Bit Upset (MBU)• Three causes of soft errors
– Alpha particles– Thermal neutrons– High-energy neutrons
![Page 11: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/11.jpg)
11
Alpha-Particles• Emitted from impurities in packaging materials.
• Create electron-hole pairs through direct ionization
• Range for a 10 MeV particle < 100um– Typical energy 4-9MeV
• Improved manufacturing trends Reduced effect– Purified materials– Shielding layers
![Page 12: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/12.jpg)
12
Neutrons• Result of cosmic ray reactions
with atmosphere
• High-Energy neutrons react with chip materials.
• Concrete only shielding material– 1.4x lower flux/foot of
thickness
![Page 13: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/13.jpg)
13
Neutrons• Thermal neutrons (<<< 1MeV) react with Boron-
Doped Phosphosilicate Glass (BPSG) dielectric layer.– Produce ionized particles that can cause soft-errors
• Solution Remove BPSG from advanced processes
• Mostly solved – SEU’s still found in 45nm, 90nm
![Page 14: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/14.jpg)
14
Outline
• Error Classification• Hard Errors• Soft Errors
• Sources of radiation• Device/Circuit approaches• Architectural approaches
• Error detection• Error correction
• System level impact
![Page 15: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/15.jpg)
15
Device-level solutions
• Larger device sizes Larger capacitance– Increase the amount of charge necessary to flip bit
(critical charge)
• Multiple VT design – Sensitivity to variation at low-VDD may limit effectiveness.
• Body biasing also common to both radiation hardening and variation tolerance
![Page 16: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/16.jpg)
16
Circuit-level solutions
• DICE cell– Used for SRAM, FF’s, latches
• Built-in current sensors on supply lines of memory cells.
![Page 17: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/17.jpg)
17
Outline
• Error Classification• Hard Errors• Soft Errors
• Sources of radiation• Device/Circuit approaches• Architectural approaches
• Error detection• Error correction
• System level impact
![Page 18: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/18.jpg)
18
Modular redundancy • Dual Modular Redundancy
• Triple Modular Redundancy
![Page 19: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/19.jpg)
19
Redundant Circuits
• Redundancy increases area/power
• DMR/TMR in sub/near-VT
– Timing variation between circuits increases
• Utilization of redundant lanes for parallel operation can increase throughput at low-VDD
![Page 20: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/20.jpg)
20
Self-Checking Circuits• Partition circuit into smaller blocks
– Error checker for each block
• Use error detection codes– Berger codes– Arithmetic codes
• Increases circuit delay for error computation
![Page 21: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/21.jpg)
21
Circuit-Level Speculation• Uses approximated circuit implementation
– Goal is to reduce critical path
![Page 22: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/22.jpg)
22
Tunable Replica Circuits• Mirrors delay of critical path• Monitors for errors over voltage/frequency
changes
![Page 23: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/23.jpg)
23
Timing Speculation
DFFD Q
Shadow Latch
D Q
01
clk
data in
delayed clk
error
data out
D2
D2D1
D1
clk
delayed clk
error
data out D0
D0data in
• Razor timing error detection– Designed for transient faults– Effective against SET’s and SBU’s on flip-flops
• Requires error recovery
![Page 24: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/24.jpg)
24
Outline
• Error Classification• Hard Errors• Soft Errors
• Sources of radiation• Device/Circuit approaches• Architectural approaches
• Error detection• Error correction
• System level impact
![Page 25: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/25.jpg)
25
Error Recovery Options in Scalar Processors • Clock Gating:
– Global error signal– Clock gating– 1-cycle penalty
![Page 26: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/26.jpg)
26
• Multiple Issue:– Error signals propagated to control unit– Instructions must be flushed– Error instruction then replayed– 2N-cycle penalty
Error Recovery Options in Scalar Processors
![Page 27: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/27.jpg)
27
• Counter-flow pipelining
• Micro-rollback
Error Recovery Options in Scalar Processors
![Page 28: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/28.jpg)
28
Error correcting codes for memories
• Most common is Hamming code• Check bits stored when data written• Identifies error and erroneous bit position
![Page 29: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/29.jpg)
29
Error correcting codes for memories
• Single-bit ECC adds area/power and delay– Low-VDD Increased delay
– Hybrid VDD operation will reduce delay
• Overhead increases for multi-bit ECC– Increased memory density higher probability of
MBU – Current research increase in ratio of MBU to total
SER in sub-VT
![Page 30: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/30.jpg)
30
Outline
• Error Classification• Hard Errors• Soft Errors
• Sources of radiation• Device/Circuit approaches• Architectural approaches
• Error detection• Error correction
• System level impact
![Page 31: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/31.jpg)
31
System-Level Impact
• Soft errors can have a large affect on processor functionality– Increasing issue with further device scaling
• All methods off error detection/correction are costly– Need to be added to system blocks wisely
• SEU distribution• Effects of process variation
![Page 32: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/32.jpg)
32
System-Level Impact• How to determine what blocks have the highest
system-level impact?– Mostly through simulation
• For radiation: all-encompassing– Includes fault injection @ circuit level
• Different models have been developed– ReStore – University of Illinois at Urbana-Champaign
• Focuses on system level effect of radiation-induced errors– RAMP – IBM
• Directed more towards hard-errors and processor failure.
![Page 33: Microprocessor Reliability](https://reader033.fdocuments.net/reader033/viewer/2022061519/56816411550346895dd5bcdc/html5/thumbnails/33.jpg)
33
Questions?