Self-Adaptable and Error- Resilient Design - VAST...

27
DUSD(S&T) Self Self - - Adaptable and Error Adaptable and Error - - Resilient Design Resilient Design - - coping with increasing variability coping with increasing variability and reliability concerns and reliability concerns Tim Cheng Tim Cheng Univ. of California, Santa Barbara Univ. of California, Santa Barbara

Transcript of Self-Adaptable and Error- Resilient Design - VAST...

Page 1: Self-Adaptable and Error- Resilient Design - VAST labcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_2006_Workshop... · Self-Adaptable and Error-Resilient Design- ... Burn-in to catch

DUSD(S&T)

SelfSelf--Adaptable and ErrorAdaptable and Error--Resilient DesignResilient Design --coping with increasing variability coping with increasing variability and reliability concernsand reliability concerns

Tim ChengTim ChengUniv. of California, Santa BarbaraUniv. of California, Santa Barbara

Page 2: Self-Adaptable and Error- Resilient Design - VAST labcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_2006_Workshop... · Self-Adaptable and Error-Resilient Design- ... Burn-in to catch

2

Sources of Component FailuresSources of Component Failures

40

50

60

70

80

90

100

110

Tem

pera

ture

(C)

On-Die Temperature variations

SEU - soft errors

Parametric variations

RandomDefects

random defects

parametric variationscatastrophic parametric

deterministic

transient

Design errors

soft errors

design errors

probabilistic

permanent

soft

hard

Page 3: Self-Adaptable and Error- Resilient Design - VAST labcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_2006_Workshop... · Self-Adaptable and Error-Resilient Design- ... Burn-in to catch

3

Test ChallengesTest ChallengesCost of manufacturing Cost of manufacturing test not scalingtest not scaling

ATE falling further ATE falling further behind device speedsbehind device speeds

BurnBurn--in running out of in running out of steamsteam

Increasing device Increasing device integration: Digital, integration: Digital, analog/mixed signal, analog/mixed signal, memory, software, highmemory, software, high--speed buses, etc.speed buses, etc.

High-Speed

IO

High-Speed

IOCPUCPU

MemoryMemory

FlashFlashDAC

SwitchFabricSwitchSwitchFabricFabric

ADC

FPGAFPGADSPDSP

Cost of Silicon Mfg and TestCost of Silicon Mfg and Test

1010--771010--661010--551010--441010--331010--221010--11

11

‘‘8282 ‘‘8585 ‘‘8888 ‘‘9191 ‘‘9494 ‘‘9797 ‘‘0000 ‘‘0303 ‘‘0606 ‘‘0909 ‘‘1212

cost: cost: cents/transistorcents/transistor

FabFab capital / transistor (Moorecapital / transistor (Moore’’s law)s law)

Test capital / transistor (MooreTest capital / transistor (Moore’’s law for test)s law for test)

Page 4: Self-Adaptable and Error- Resilient Design - VAST labcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_2006_Workshop... · Self-Adaptable and Error-Resilient Design- ... Burn-in to catch

4

ImplicationsImplicationsBecome harder and harder to design reliable components

Shorter term: Demand better silicon debug technologies to effectively find bugs escaped from verification and timing failures resulted from variations

Longer term:One-time-factory testing will be too costly and insufficientBurn-in to catch chip infant-mortality will not be practicalHW need to dynamically self-test, detect errors, reconfigure, and adaptOn-line testing technique will become necessary to trigger correction/reconfiguration

Page 5: Self-Adaptable and Error- Resilient Design - VAST labcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_2006_Workshop... · Self-Adaptable and Error-Resilient Design- ... Burn-in to catch

5

From Test to Recovery/Reconfiguration From Test to Recovery/Reconfiguration --ExamplesExamples

MemoryMemory““BIST BIST →→ BISD BISD →→ BISRBISR”” a common practicea common practiceErrorError--Tolerant Cache Architecture [Purdue U.]Tolerant Cache Architecture [Purdue U.]

Dynamic circuitsDynamic circuitsUsing programmable keeper and onUsing programmable keeper and on--chip leakage sensors for chip leakage sensors for tuning performance and robustnesstuning performance and robustness

MicroprocessorMicroprocessorDIVA and RAZOR [U. of Michigan] for onDIVA and RAZOR [U. of Michigan] for on--line checking and line checking and recoveryrecoveryAssumption: low failure rateAssumption: low failure rate

Analog/RF/HighAnalog/RF/High--speed IO componentsspeed IO componentsSelfSelf--calibration: calibration: finefine--tuning performance; more robust to tuning performance; more robust to process, temperature and voltage variationsprocess, temperature and voltage variations……

Page 6: Self-Adaptable and Error- Resilient Design - VAST labcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_2006_Workshop... · Self-Adaptable and Error-Resilient Design- ... Burn-in to catch

6

0

50

100

150

200

250

300

350

0 52 105

157

210

262

315

367

419

472

524

577

629

682

734

786

839

890

944

996

1049

Chi

p C

ount

(Nch

ip)

Fault statistics

NFaulty-Cells

Conv. Yield≈ 33.4%

Fault Statistics in 64K Cache @45nmFault Statistics in 64K Cache @45nm

σVt ≈ 30mv, using BPTM 45nm technology

NFaulty-Cells = PFault X NCells (total number of cells in a cache)

Conventional 64K cache results in only 33.4% yield

Page 7: Self-Adaptable and Error- Resilient Design - VAST labcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_2006_Workshop... · Self-Adaptable and Error-Resilient Design- ... Burn-in to catch

7

ErrorError--Tolerant Cache ArchitectureTolerant Cache Architecture

An error-tolerant, dynamically reconfigurable architecture:

Results in 94% yield vs 33% in conventional architectureDoes not affect cache access timeTransparent to the processorMinimum performance loss (<4%)

ConfigStorage

Controller Column MUX

CACHE, 4 Blocks in a Row

“00” “01” “10” “11”

“01” “01” “10” “11”

Faul

ty B

lock

Row

Dec

oder

Row Address

Col

umn

Add

ress

ColumnDecoder

“11” “10” “01” “00”

Main ideas:Main ideas:

Assume BIST implemented

Resize cache to avoid faulty blocks during regular operation

Force column MUX to select a non-faulty block in same row if the accessed block is faulty

Page 8: Self-Adaptable and Error- Resilient Design - VAST labcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_2006_Workshop... · Self-Adaptable and Error-Resilient Design- ... Burn-in to catch

8

Error Tolerant CapabilityError Tolerant Capability

0

50

100

150

200

250

300

350

0 105 210 315 419 524 629 734 839 944 1049

Chi

p C

ount

(Nch

ip)

Fault statistics

Chips saved by the proposed + redundancy (R=8, r=3)

Chips saved by ECC + redundancy ( R=16)

NFaulty-Cells

More number of saved chipsas compare to ECC

ECC fails to save any chips

Proposed architecture can handle more faulty cells than ECC, as high as 890 faulty cells with marginal performance loss

Page 9: Self-Adaptable and Error- Resilient Design - VAST labcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_2006_Workshop... · Self-Adaptable and Error-Resilient Design- ... Burn-in to catch

9

Self-Tuning Using On-Chip Current Monitoring – SRAM

Bypass Switch

Online Leakage Monitor

Calibrate Signal

VDD

SRAMArray

Comparator

Vbody Body-Bias Generator

Vout VREF1VREF2

STD of inter-die Vt variation [V]

Yiel

d [%

]

64KB SRAM array with ZBB

64KB Self Repairing SRAM array with ABB

ZBB=Zero Body BiasABB=Adaptive Body Bias

Self-repairing SRAM using on-chip current monitoring andadaptive body biasing (ABB)

Effective in achieving high yield in nanometer technologies

8%-40%

Mukhopadhaya, et. al. ITC’05

Source: K. Roy et al, Purdue

Page 10: Self-Adaptable and Error- Resilient Design - VAST labcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_2006_Workshop... · Self-Adaptable and Error-Resilient Design- ... Burn-in to catch

10

From Test to Recovery/Reconfiguration From Test to Recovery/Reconfiguration --ExamplesExamples

MemoryMemory““BIST BIST →→ BISD BISD →→ BISRBISR”” a common practicea common practiceErrorError--Tolerant Cache Architecture [Purdue U.]Tolerant Cache Architecture [Purdue U.]

Dynamic circuitsDynamic circuitsUsing programmable keeper and onUsing programmable keeper and on--chip leakage sensors for chip leakage sensors for tuning performance and robustnesstuning performance and robustness

MicroprocessorMicroprocessorDIVA and RAZOR [U. of Michigan] for onDIVA and RAZOR [U. of Michigan] for on--line checking and line checking and recoveryrecoveryAssumption: low failure rateAssumption: low failure rate

Analog/RF/HighAnalog/RF/High--speed IO componentsspeed IO componentsSelfSelf--calibration: calibration: finefine--tuning performance; more robust to tuning performance; more robust to process, temperature and voltage variationsprocess, temperature and voltage variations……

Page 11: Self-Adaptable and Error- Resilient Design - VAST labcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_2006_Workshop... · Self-Adaptable and Error-Resilient Design- ... Burn-in to catch

11

Dynamic Circuit Using Static KeeperDynamic Circuit Using Static Keeper

clk

. . .RS0 RS7

D0 D7

RS1

D1

LBL0

LBL1

N0

Keeper upsizing degrades average performance

Conventional Static Keeper

Page 12: Self-Adaptable and Error- Resilient Design - VAST labcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_2006_Workshop... · Self-Adaptable and Error-Resilient Design- ... Burn-in to catch

12

Pessimistic Design Hurts PerformancePessimistic Design Hurts Performance

worst-case corner

(130nm CMOS Measurements, 110°C)

0

50

100

150

200

Normalized IOFF

Num

ber

of d

ies

0 1 2 3 4 5 6

nominal corner

Substantial variation in leakage across dies4-5X variation between nominal and worst-case leakagePerformance determined at nominal leakageRobustness determined at worst-case leakage

Page 13: Self-Adaptable and Error- Resilient Design - VAST labcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_2006_Workshop... · Self-Adaptable and Error-Resilient Design- ... Burn-in to catch

13

Programmable Keeper for Dynamic CktsProgrammable Keeper for Dynamic Ckts

3-bit programmable keeper

clk

. . .RS0 RS7

D0 D7

RS1

D1

LBL0

LBL1

N0

b[2:0]

W 2W 4Ws s s

Opportunistic speedup via keeper downsizing

C. Kim et al. , VLSI Circuits Symp. ‘03

Page 14: Self-Adaptable and Error- Resilient Design - VAST labcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_2006_Workshop... · Self-Adaptable and Error-Resilient Design- ... Burn-in to catch

14

OnOn--Die Leakage SensorDie Leakage Sensor

C. Kim et al. , VLSI Circuits Symp. ‘04

83μm

73μ

mcurrent

reference

comparators

currentm

irrors

VBIASgen.

NMOS device

test interface

High leakage sensing gain Compact analog design sharing bias generators

7 levels7 levelsResolutionResolution

1.2V1.2VVVDDDD

83 83 X 73 X 73 μμmm22Dimensions Dimensions

0.66 0.66 mW @80CmW @80CººPower consumptionPower consumption

90nm dual 90nm dual VtVt CMOSCMOSTechnologyTechnology

Page 15: Self-Adaptable and Error- Resilient Design - VAST labcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_2006_Workshop... · Self-Adaptable and Error-Resilient Design- ... Burn-in to catch

15Output codes from leakage sensor

001 010 011 100 101 110 111

Leakage Binning ResultsLeakage Binning Results

Page 16: Self-Adaptable and Error- Resilient Design - VAST labcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_2006_Workshop... · Self-Adaptable and Error-Resilient Design- ... Burn-in to catch

16

Process detection

Test Process for Self-Calibrating DesignFab

Assembly

Wafer test

Burn inPackage testCustomer

Leakage measurement

On-die leakage sensor

Program using fuses

Page 17: Self-Adaptable and Error- Resilient Design - VAST labcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_2006_Workshop... · Self-Adaptable and Error-Resilient Design- ... Burn-in to catch

17

From Test to Recovery/Reconfiguration From Test to Recovery/Reconfiguration --ExamplesExamples

MemoryMemory““BIST BIST →→ BISD BISD →→ BISRBISR”” a common practicea common practiceErrorError--Tolerant Cache Architecture [Purdue U.]Tolerant Cache Architecture [Purdue U.]

Dynamic circuitsDynamic circuitsUsing programmable keeper and onUsing programmable keeper and on--chip leakage sensors for chip leakage sensors for tuning performance and robustnesstuning performance and robustness

MicroprocessorMicroprocessorDIVA and RAZOR [U. of Michigan] for onDIVA and RAZOR [U. of Michigan] for on--line checking and line checking and recoveryrecoveryAssumption: low failure rateAssumption: low failure rate

Analog/RF/HighAnalog/RF/High--speed IO componentsspeed IO componentsSelfSelf--calibration: calibration: finefine--tuning performance; more robust to tuning performance; more robust to process, temperature and voltage variationsprocess, temperature and voltage variations……

Page 18: Self-Adaptable and Error- Resilient Design - VAST labcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_2006_Workshop... · Self-Adaptable and Error-Resilient Design- ... Burn-in to catch

18

DIVA: On-Line Checking and Correction for Microprocessor

All core function is validated by checkerSimple checker detects & corrects faulty results, restarts coreValidates: control, computation, communication, and forward progress

Checker relaxes burden of correctness on core processorTolerates core design errors, electrical faults, defects, and failuresCore only targets high accuracy prediction, checker alone is 15x slower

Core does the heavy lifting, removes hazards that could slow the simple checker

Source: Todd Austin, Univ. of Michigan

Page 19: Self-Adaptable and Error- Resilient Design - VAST labcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_2006_Workshop... · Self-Adaptable and Error-Resilient Design- ... Burn-in to catch

19

DIVA DIVA -- Case StudyCase StudyPerformance impacts minimalPerformance impacts minimal

Without faults, less than Without faults, less than ½½% slowdown % slowdown for broad array of applicationsfor broad array of applicationsAt 1 fault/microsecond on a 1GHz At 1 fault/microsecond on a 1GHz processor, only 1% slowdownprocessor, only 1% slowdown

Area requirements modestArea requirements modestAlpha ISA checker less than 6% area Alpha ISA checker less than 6% area of Alpha 21264 processorof Alpha 21264 processor

Checker lends itself to formal Checker lends itself to formal verificationverification

Simple extensions provide Simple extensions provide excellent SER coverageexcellent SER coverage

4k datacache

1/2k instcache

pipe-line

BIST

205 mm2

(in 0.25um)

Alpha 21264

REMORAChecker

12 mm2

(in 0.25um)

Source: Todd Austin, Univ. of Michigan

Page 20: Self-Adaptable and Error- Resilient Design - VAST labcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_2006_Workshop... · Self-Adaptable and Error-Resilient Design- ... Burn-in to catch

20

Razor: Timing Error Detection & Correction

Double-sampling metastability tolerant latches detect timing errors

Second sample is correct-by-design

Micro-architectural support restores stateTiming errors treated like branch mis-predictions

Source: Austin & Blaauw, Michigan

Page 21: Self-Adaptable and Error- Resilient Design - VAST labcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_2006_Workshop... · Self-Adaptable and Error-Resilient Design- ... Burn-in to catch

21

From Test to Recovery/Reconfiguration From Test to Recovery/Reconfiguration --ExamplesExamples

MemoryMemory““BIST BIST →→ BISD BISD →→ BISRBISR”” a common practicea common practiceErrorError--Tolerant Cache Architecture [Purdue U.]Tolerant Cache Architecture [Purdue U.]

Dynamic circuitsDynamic circuitsUsing programmable keeper and onUsing programmable keeper and on--chip leakage sensors for chip leakage sensors for tuning performance and robustnesstuning performance and robustness

MicroprocessorMicroprocessorDIVA and RAZOR [U. of Michigan] for onDIVA and RAZOR [U. of Michigan] for on--line checking and line checking and recoveryrecoveryAssumption: low failure rateAssumption: low failure rate

Analog/RF/HighAnalog/RF/High--speed IO componentsspeed IO componentsSelfSelf--tuning/calibration: tuning/calibration: finefine--tuning performance; more robust to tuning performance; more robust to process, temperature and voltage variationsprocess, temperature and voltage variations……

Page 22: Self-Adaptable and Error- Resilient Design - VAST labcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_2006_Workshop... · Self-Adaptable and Error-Resilient Design- ... Burn-in to catch

22

SelfSelf--Test and SelfTest and Self--Tuning of Tuning of HighHigh--Speed IOSpeed IO

Jitter in DLL/PLL leads toJitter in DLL/PLL leads toJitter in Transmitted DataJitter in Transmitted DataUncertainty in Uncertainty in RXRX’’ss sampling sampling edgesedges

Mismatch & variations in DLL/PLL Mismatch & variations in DLL/PLL lead to high BER/yield losslead to high BER/yield loss

TX

Clock Recovery

FF

DLL

Recovered Data

RX

Clock Recovery

FF

Recovered Data

RX

TX

DLL

Ref. CLK

Parallel Data

Parallel Data

External measurement of DLL is infeasibleExternal measurement of DLL is infeasibleMultiple and matched access points are required for Multiple and matched access points are required for each delay stageeach delay stage

Page 23: Self-Adaptable and Error- Resilient Design - VAST labcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_2006_Workshop... · Self-Adaptable and Error-Resilient Design- ... Burn-in to catch

23

More ExamplesMore Examples……..

OnOn--chip thermal sensingchip thermal sensing →→ Cooling adjustmentCooling adjustment

OnOn--chip delay sensing chip delay sensing →→ Performance tuningPerformance tuning

OnOn--chip leakage sensing chip leakage sensing →→ Leakage controlLeakage control

…………..

Page 24: Self-Adaptable and Error- Resilient Design - VAST labcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_2006_Workshop... · Self-Adaptable and Error-Resilient Design- ... Burn-in to catch

24

SelfSelf--Tuning in the Factory Is HappeningTuning in the Factory Is Happening

Hardware support for syndrome collectionHardware support for syndrome collection

Hardware support for selfHardware support for self--tuning/selftuning/self--reconfigurationreconfiguration

In need of design methodology for exploring design In need of design methodology for exploring design tradetrade--offsoffs

withSelf-test & self-tuning/

-configuration capability

withSelf-test & self-tuning/

-configuration capabilitytuning knobs

test results/syndrome

component

Page 25: Self-Adaptable and Error- Resilient Design - VAST labcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_2006_Workshop... · Self-Adaptable and Error-Resilient Design- ... Burn-in to catch

25

Allow shipping of defective parts!Allow shipping of defective parts!

Require onRequire on--line testing/checking capabilityline testing/checking capability

Require Require ““selfself--diagnosisdiagnosis”” capabilitycapability

Diagnosis could be conducted on remote serverDiagnosis could be conducted on remote server

withOn-line-test &

self-tuning/-configuration

capability

withOn-line-test &

self-tuning/-configuration

capability

system

Diagnosis server

w/ database

Diagnosis server

w/ database

tuning knobs

test results/syndrome

network

component

Supporting SelfSupporting Self--Tuning/Tuning/--Configuration in Configuration in the Fieldthe Field

Page 26: Self-Adaptable and Error- Resilient Design - VAST labcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_2006_Workshop... · Self-Adaptable and Error-Resilient Design- ... Burn-in to catch

26

Design ChallengesDesign Challenges

“Self-diagnosis” to support reconfiguration

Low-cost on-line checking, self-repair and self-tuning schemes and design methodologies

Exploration of redundancy and reconfiguration tradeoffs (power, area, performance, reliability)

Page 27: Self-Adaptable and Error- Resilient Design - VAST labcadlab.cs.ucla.edu/icsoc/protected-dir/IC-DFN_2006_Workshop... · Self-Adaptable and Error-Resilient Design- ... Burn-in to catch

27

Summary

OnOn--line testing is promising for detecting soft line testing is promising for detecting soft errors, latency failures and marginality failureserrors, latency failures and marginality failures

Need automatic diagnosis solutions after Need automatic diagnosis solutions after errors are detected by onerrors are detected by on--line checkerline checker

Need lowNeed low--cost and lowcost and low--power recovery or repower recovery or re--configuration schemesconfiguration schemes

Post silicon tuning/calibration/reconfiguration Post silicon tuning/calibration/reconfiguration is becoming promising, and necessary, for is becoming promising, and necessary, for SiSinanonano systemssystems