Assessing Costs of Variability, Reliability and...

55
Assessing Costs of Variability, Reliability and Resilience Andrew B. Kahng UCSD CSE and ECE Departments [email protected] http://vlsicad.ucsd.edu

Transcript of Assessing Costs of Variability, Reliability and...

Page 1: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

Assessing Costs of Variability, Reliability and Resilience

Andrew B. KahngUCSD CSE and ECE Departments

[email protected]://vlsicad.ucsd.edu

Page 2: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 2NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Design Capability Gap, Value Scaling Gap• “Available density” ideally grows at 2x/node

• = a typical view of “Moore’s Law”

• Even so, “realized density” grows at 1.6x/node• Power, performance, area resources spent on guardband, reliability, etc.• Designers obtain only part of Moore’s Law scaling benefits

Page 3: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 3NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Challenge: Variability + Reliability

• Variability + Reliability = challenges to design closure for a competitive IC product• Design costs from margins; “0‐1 benefits”

• Resilience = system product’s ability to mitigate variability and reliability phenomena• Error detection and repair mechanisms• Alternative guardbanding mechanisms for 

different system abstractions: stochastic, approximate, …

• Costs and benefits often less well‐defined

Defocus/Dose VariationMisalignment

TemperatureVariation

Reliability

Non-Rectangular ShapesLine-End Shortening

CrosstalkIR-drop

Imperfect regulatorsNon-Uniform CD

Erosion/Dishing in CMP

Electromigration

Hot-Carrier Injection

NBTI

Alpha-Particle

Line Edge Roughness

Mask CD Error

Wafer flatness Lens Aberration

Flare

Page 4: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 4NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

“Cost of Variability and Reliability”D

esig

n qu

ality

(e.g

., fr

eque

ncy)

Technology Node

Signoff with larger guardbands

Guardbands

Standard vague picture: increased guardband lost benefits of technology = no ROI

Lost benefits

Page 5: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 5NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Quantified Cost of Guardband [ISQED08]

Can we quantify cost of guardband?Idea (2007‐2008): study design benefit of reduced guardband

N.B.: going to the next node gives 20% speed, 20% power benefit        10% is half a node!

E.g., 50% guardband reduction looks like:

Expected impacts of guardband reduction:

Parambest Paramworst

-100% 100%0%

Delay reduction

Easier optimization

Smaller gate size

Smaller area (A)

Fewer defects

Less cost

Shorter wires

Adr eY

Ar

ArN dies 2

22

(d: defect density)

(r: wafer radius)

Page 6: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 6NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Design Outcomes from Guardband Reduction• 40% guardband reduction

• Area: 13% reduction• Dynamic power: 13% reduction• Leakage power: 19% reduction• Wirelength: 12% reduction• Tool runtime (S,P&R): 28% reduction• #Timing viols.:100% reduction • #Good dies per wafer (w/o process 

enhancement): 4% increase• Raw die per wafer • Parametric yield • 40nm sweetspot: 20% guardband reduction 

• Quantified impact of guardband insight into cost of guardband !

• Can we then answer:  What is cost of {variability, reliability, resilience}?

Cell library guardband reduction

Synthesis

RC guardbandreduction

Placement

Clock tree synthesis

Routing

Analyze outcomes(Area, wirelength,

runtime, #violations,yield)

RTL Design(AES, JPEG, SOC1)

Technology(90nm, 65nm, 45nm)

Experimentswith industry chipimplementationflow

Page 7: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 7NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

My Group: Reduced Margin = Reduced Cost• Pessimism removal with more accurate margins• Explicit tradeoffs across various types of margin     e.g., 1 mV = 5 MHz• Co‐optimization across engineering scopes, chip implementation phases  includes “cross‐layer”, adaptivity / resilience, … 

Design Time

Margin

Product Quality Model and Analysis Accuracy

ps, nm, mV, …

power, area, fmax, Iddq,…rms, %, σ

Page 8: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 8NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Reducing Cost Measuring Cost•Measuring Cost of X is difficult!  which is why we’re here …• Reliability margins are intertwined with other margins• Tough to isolate specific costs of variability / reliability / resilience, especially in any design‐agnostic way

• Toward Assessing Cost of … (work at UCSD)• … Variability

• Reducing (phantom) margins:  BEOL corners, FF timing model• … Reliability

• “cost of EM guardband”• AVS‐BTI‐EM: cost of wrong signoff conditions• Non‐default routing rules: cost of naïve enforcement of reliability margins• Assessment of EM margin considering lifetime (throughput and performance)

• … Resilience• “MinRazor”: tradeoff of resilience mechanism cost vs. margin cost• “PVS”: process‐aware voltage scaling (design‐independent, tunable monitors)

Page 9: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 9NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Our Usual Playing Field: SOC Implementation

Cell library guardband reduction

Synthesis

RC guardbandreduction

Placement

Clock tree synthesis

Routing

Outcomes(area, wirelength,

runtime, #viols, yield)

RTL Designs

Technology(90nm, 65nm, 45nm, 28nm)

P&R stage optimization

Signoff

BEOL cornersFF model

AVS-BTI-EM signoff

naive EM compliance

Runtime optimization

EM-overdrive

MinRazor

Page 10: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 10NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Activity factor

(α)

Jrms

Temp

Wire width

Lifetime (MTTF)

Driver size

A B Inverse relation; if A increases then B decreases

A B Direct relation; if A increases then B increases

Supply voltage

Timing slack

|Vthp |

Wire spacing

TDDB

TDDB

EM

EM

Freq.|Vthn |

Slew rate

Load/fanout

Gate length

Junction resistance

EM, TDDB, NBTI, HCI

HCINBTI

HCIHCI

HCI

HCI

HCI

HCI

NBTI

Tunable at design or runtime

Tunable at design

general

general

general

generalgeneral

general

general

general

generalgeneral

general

general

general

general

general

general

general

HCI

HCI

NBTI

Another View of the (Reliability) Playing Field

Models; technology parameters (not tunable)

Page 11: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 11NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

I. Assessing Costs of Variability“Phantom margins”: (1) BEOL, (2) FF model pessimism

Page 12: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 12NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Pessimism in Conventional BEOL Corners

• Conventional BEOL corners (CBC)• Skew all layers in the same direction to guardband for variability• Too pessimistic! Impossible to have worst‐case on all layers

• Pessimism in CBC creates “false” timing‐critical paths• Fixing “false” paths degrades design quality• Slow down design turnaround time

M2

M3

M1

S2 W2T2

H2 Inter-layer dielectric

Inter-metal dielectric

H3

H1

T1

T3

∆W ∆T ∆H

Typical typical typical Typical

Cbest min min max

Cworst max max min

RCbest max max max

Rcworst min min min

[ICCD14]

Page 13: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 13NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

A New Timing Signoff Flow

Routed design

Timing analysis using conventional

BEOL corners (CBC)

ECOusing CBC

violation = 0?

done

Conventional Signoff

No

Routed design

Classify timing critical paths

GTBC GCBC

ECOusing CBC

Timing analysis

using TBC

violation = 0?

Timing analysis

using CBC

violation = 0?

ECOusing TBC

done

Our work

NoNo

Page 14: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 14NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Pessimism in Conventional BEOL Corners (CBC)• Assumption: a max (setup) path pj is “safe” when the delay evaluated at a given CBC is larger than nominal delay + 3σj

dj(YCBC) ≥ 3σj + dj(Ytyp)

• For a given path, we can compare the statistical delay variation and the delay obtained from a given CBC

αj = 3σj / Δdj(YCBC) Δdj(YCBC)= [dj(YCBC) ‐ dj(Ytyp)] YCBC  {Ycw, Ycb, Yrcw, Yrcb}

• A small αj implies there is a large pessimism

delay-3σ

dj(YCBC)-dj(Ytyp)3σj

Large pessimism

Page 15: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 15NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Wiring Structure in Timing-Critical Paths

• Wires on critical paths are routed on many layers

• Similar wiring structure is an outcome of design flow

Testcase:• 45nm foundry library (wire resistivity scaled by 8X)• Netlist: NETCARD 1mm2, 570K standard cell instances• 9 metal layers• Extract critical paths from different PVT and BEOL corners

Max. wirelength ratio across all layers (%)

Cum

ulat

ive

prob

abili

ty

0.92

60%

92% of paths have < 60% of wirelength on any single layer

Testcase:• 45nm foundry library (wire resistivity scaled by 8X)• Netlist: NETCARD 1mm2, 570K standard cell instances• 9 metal layers• Extract critical paths from different PVT and BEOL corners

Page 16: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 16NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Wiring Structure in Timing-Critical Paths

Page 17: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 17NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Opportunities for Tightened BEOL Corners

• CBC can be pessimistic! Most paths have α < 0.5 • Use tightened BEOL corners, e.g., scale BEOL variation in

.itf with α = 0.5

∆dj(Yrcw)/dj(Ytyp) x 100%

3σj/d(Ytyp) x 100%

Challenge: how to avoid underestimating delay variation to preserve parametric yield?

Page 18: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 18NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Scaling Factor α Delay Variation @Cw,RCw• Paths with small Δdrcw and Δdcw have large α• E.g., there are αj > 0.6 when ((Δdrcw < 3%) AND (Δdcw < 3%))• Identify paths for tightened BEOL corners based on Δdrcw and Δdcw

α

∆d(Ycw)/d(Ytyp)

∆d(Yrcw)/d(Ytyp)

Page 19: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 19NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

A Practical Filter for TBC-Amenable Paths

Acw

Arcw

Gtbc = paths which can be safely signed off using tightened corners:(Path with (∆dcw larger than Acw)) OR (Path with (∆drcw larger than Arcw))

∆d(Ycw)/d(Ytyp)

∆d(Yrcw)/d(Ytyp)

Page 20: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 20NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Benefits of Tightened BEOL Corners

• WNS and TNS are reduced by up to 100ps and 53ns• #Timing violations reduced by

24% to 100% [Moore’s Law: 1% / week !]

• TBC-0.6 : more benefits• Tradeoff between reduced margin

vs. #paths which use TBC

‐0.2

‐0.15

‐0.1

‐0.05

0LEON SUPERBLUE12 NETCARD

WNS (ns)

CBC TBC‐0.5 TBC‐0.6 TBC‐0.7

‐100

‐80

‐60

‐40

‐20

0LEON SUPERBLUE12 NETCARD

TNS (ns)

CBC TBC‐0.5 TBC‐0.6 TBC‐0.7

0

500

1000

1500

LEON SUPERBLUE12 NETCARD

#Tim

ing violations

CBC TBC‐0.5 TBC‐0.6 TBC‐0.7

Page 21: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 21NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Flexible FF Timing Margin Recovery

setup

c2q

hold

c2q

C2q-setup-hold surface

setup holdc2q

setup

hold

c2q1

c2qn

...

setup‐hold‐c2q   flexible model

• Setup time, hold time and clock-to-q (c2q) delay of FF⇒ NOT fixed values

• Flexible FF timing model considering operating (function/test) modes⇒ Reduce pessimism in timing analysis⇒ Reassessment of costs of variation

• Sequential LP• setup-c2q

optimization + hold-c2q optimization

• Objective: Find the best setup/hold time/c2q for each FF

setup‐hold‐c2q   fixed model

[ISQED14]

Page 22: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 22NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Improved Timing Signoff Flow

Extract path timing information

LP formulation with flexible flip‐flop timing model

Solve Sequential LP (STA_FTmax , STA_FTmin)

Annotate new timing model for each flip‐flop

Solution

Netlist (and SPEF, if routed)

Timing signoff with annotated timing

Takeaways• Fix timing violations “for free”• 48ps average improvement of

slack over 5 designs in a foundry 65nm technology

Next steps• Study in advanced nodes• Better exploitation of disjoint

cycles/modes • More accurate modeling of

setup-hold-c2q tradeoff• Circuit optimization exploiting

FF timing model flexibility

Page 23: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 23NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Takeaways on Variability• Phantom margins leave O(node) value on the table                               recovering this is essential “equivalent scaling”• Two examples:  BEOL corners, FF timing model• NOTE: To assess costs/benefits of new methods, need correct starting point!

• Conventional BEOL corners are VERY pessimistic!• Bottleneck for wire‐dominated, high‐performance circuits

• Revised signoff flow + tightened BEOL corners reduces WNS, TNS and #timing violations• Signoff methodology change under way at sponsor company 

• Relaxed timing closure  shortened design cycle, better PPA

Page 24: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 24NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

II. Assessing Costs of Reliability(1) cost of suboptimal AVS-BTI-EM signoff;

(2) cost of naïve EM rule enforcement;

(3) available lifetime throughput and performance benefit from

scheduling of multi-cores

Page 25: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 25NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Reliability Margin vs. Adaptive Voltage Scaling• Interaction between reliability margins and AVS mechanism• BTI aging  higher |ΔVth|  lower fmax AVS used to compensate performance degradation

• Higher voltage worsens EM on wires

Circuit frequency

Vdd

time

time

Without AVS

With AVS

target

Stress on Wires

VDD(AVS)

Design Implementation

Vlib , VBTI

Derated Libraries

Signoff loop of BTI + EM

EM loop

BTI loop

[DATE13,SLIP14]

Page 26: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 26NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

• VBTI = Voltage for BTI aging estimation• Vlib = Voltage for circuit performance estimation (library characterization)

• VBTI and Vlib are required in signoff • VBTI and Vlib selection should consider BTI + AVS interaction• Aging and Vfinal are unknowns before circuit implementation

BTI degradation

and AVSVfinal?

VBTI |Vt|

Step 1

Vlib

Deratedlibrary

Step 2

Circuit implementation

and signoff

circuit

Step 3

Derated Library Characterization and AVS (BTI Loop)

Page 27: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 27NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Derated Library Characterization and AVS (BTI Loop)

•VBTI = Voltage for BTI aging estimation•Vlib = Voltage for circuit performance estimation (library characterization)

•VBTI and Vlib are required in signoff •VBTI and Vlib depend on aging during AVS•Aging and Vfinal are unknowns before circuit implementation

Vlib

VBTI Derated library

|Vt|

Circuit implementation

and signoff

circuitBTI

degradation and AVS

Vfinal?

Step 1 Step 2 Step 3

No obvious guideline to define VBTI and Vlib

Inconsistency among Vfinal , Vlib , VBTI• What is the design overhead when timing

libraries are not properly characterized?• Can we define BTI- and AVS-aware signoff

corners that ensure product goals with small design, lifetime energy overheads?

• What is the impact of EM for different signoff corners?

Page 28: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 28NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Energy vs. Area Across Different Signoffs

“Knee” point for area vs. lifetime energy

Optimistic signoff corner • AVS increases supply voltage

aggressively to compensate aging

• Large lifetime energy overhead• May fail to meet timing if

desired supply voltage > Vmax

Pessimistic signoff corner • Ovestimate aging and/or

underestimate circuit performance

• Large area overhead

Page 29: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 29NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

AVS Impact on EM Lifetime

0.8

0.9

1

1.1

1.2

0

2

4

6

8

10

12

1 2 3 4 5 6 7 8

Vfi

nal

(V)

Life

tim

e (y

ear)

Implementation #

Lifetime (year) Vfinal (V)

11

• Assume no EM fix at signoff• BTI degradation is checked at each step and MTTF is updated as

30% MTTF penalty

200mV voltage compensation

Page 30: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 30NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Power Penalty of Fixing EM with AVS

0.310.310.320.320.330.330.340.340.350.35

12.00

13.00

14.00

15.00

16.00

17.00

1 2 3 4 5 6 7 8

P/G

Pow

er (

mW

)

Cor

e P

ower

(m

W)

Implementation #

Core Power (mW) P/G Power (mW)

• Core power increases with elevated voltage • P/G power increases due to both elevated voltage, PDN degradation• Tradeoff with guardband investment at design signoff

Highest invested guardband

Least invested guardband

14% power penalty

Page 31: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 31NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

0.90

0.95

1.00

0 5 10 15

VD

D

Year

S1 S2 S3 S4 S5

DMA, #3 7.97.98.08.08.18.1

S1 S2 S3 S4 S5

MTT

F (Y

ear)

EM Impact on AVS Scheduling

1.2 years MTTF penalty

• AVS affects EM lifetime penalty  • We empirically sweep AVS voltage step size to obtain the impact• 5 step sizes: S1 – S5 = {8, 10, 15, 18, 20} mV

Page 32: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 32NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Smarter NDRs in CTS (EM Cost Reduction)•NDRs apply wider wire widths (= costs of EM) and spacing to address EM and parasitic and delay variation for clock tree•However, a wire does NOT need to be wide if it has a small number of downstream sinks 

Accurate assessment of EM margin should include clock tree topologies (e.g., #downstream sinks)

Less #downstream sinks (== Less current) at leaf-side in a clock tree

sink

driver

Driving 4 buffers

Driving 2 buffers

Driving 1 buffer

[DAC13]

Page 33: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 33NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Vicious Cycle vs. Virtuous Cycle• Excessive margin ⇒ Not just “design overhead”• Vicious cycle vs. virtuous cycle

sink

driver

# downstream = 2

# downstream = 16

Larger Cap.

More/Larger Buffers

More EM Viol.

Fixed NDR

More power

Smaller Cap.

Fewer/Smaller Buffers

Less EM Viol.

Smart NDR

Less powe

r

Fixed NDR (Wider Wires)

Smart NDR (Tapering)

• Less-naïve compliance with EM rules ⇒ reduce design overhead, and avoid vicious cycle

Page 34: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 34NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Smart Routing NDRs: Clock Power Reduction• 9.2% wire capacitance, 4.9% clock switching power reduction • Still, satisfy skew, max transition limits and EM limit

Capacitance, Clock Power Reduction

0.0%5.0%

10.0%15.0%

Red

ucti

on

[%]

Wire Cap.Clock Switching Pwr

0.0%2.0%4.0%6.0%8.0%

Red

ucti

on

[%]

Wire Cap. Clock Switching Pwr

Default: 4W5S

Default: 2W4S

Proportions of NDRs

0% 20% 40% 60% 80% 100%

aeseth

jpeg_encmc

mpeg2tv80susbf

conmaxdma

1W8S2W7S3W6S4W5S

N*spacingmin

M*widthmin

NDR {M}W{N}S

Page 35: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 35NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Reliability-Constrained foverdrive Selection

• Reliability and system lifetime guarantees are key design considerations for multicore processors in advanced nodes

• Task scheduling determines use of cores across operating modes• Overdrive (turbo) mode can meet performance and throughput requirements, but incurs faster MTTF degradation

• Two potential failures: throughput and performance• Can violate “acceptable throughput” for tasks: cores fail before all assigned tasks 

are completed• Can violate minimum “acceptable performance” for tasks: ores operate only at 

lower frequencies than needed

• “EMOD”: solves a new Maximum‐Value Reliability‐Constrained Overdrive Frequencies (MVRCOF) optimization (offline) problem • When all cores not simultaneously active, adjust task scheduling on a subset of 

active cores for balanced wearout• Guarantee prescribed levels of performance and lifetime throughput• Overdrive frequencies = optimization variables; user experience = objective

[ISQED14]

Page 36: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 36NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Comparison vs. Previous Works

Work TypeReiss12 NRC, NLG, NPG

Karpuzcu09 RC, NLG, NPG

Mihic04 RC, LG (Dynamic power management), NPG

Rosing07 RC, LG (Dynamic power management), NPG

Rong06 RC, LG (Dynamic power management), NPG

Coskun09 RC, LG (Dynamic thermal management), NPG

Srinivasan04 RC, LG (Dynamic reliability management), NPG

Karl08 RC, LG (Dynamic reliability management), NPG

Our Work RC, LG (Dynamic reliability management, PG

(N)RC – (Non-) Reliability Constrained(N)LG – (No) Lifetime Guarantee(N)PG – (No) Performance Guarantee

Page 37: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 37NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Optimal (Discretized) Solution Flow• For each core• For each combination in which the core is active

• Choose discrete values of overdrive frequencies within a range• Perform power and temperature simulations  one‐time LUT creation

• Example: • If a system has 3 cores (Core A, B, C), the number of active cores 

can be 1, 2 or 3• Core A is active 

• One (out of three) combinations when  1; two (out of three) combinations when  2; one (out of one) combination when  3

•Use exhaustive search based on LUT to find optimal overdrive frequencies that maximize the value of the objective function

Page 38: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 38NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Heuristic Flow

•We maximize the overdrive frequency (fOD,m) in the order of the set of active cores for which the product of weights (wnom,m, wOD,m) and execution times (Enom,m, EOD,m) is maximum• Example: 

• If a system has 3 cores, the number of active cores  can be 1, 2 or 3• If  , ∙ , , ∙ , , ∙ , , we maximize , , , ,and  ,

• Empirically, finds large improvements in objective function value

, ∙ , ∙ , , ∙ , ∙ ,

Page 39: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 39NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Testcases

Name m Enom,m(Kh)

EOD,m(Kh)

wnom,m wOD,m

4-I 1, 23, 4

1, 23, 2

3, 58, 5

0.5, 0.30.2, 0.4

0.5, 0.70.8, 0.6

• Testcases are described by • #activecores

• , , , nominalandoverdriveexecutiontimes

• , , , nominalandoverdriveuser‐definedweights

• Eight testcases in total• Format is  ‐Testcase#• Seven have optimal solutions• One does not have feasible solution

• Example

Page 40: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 40NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Optimal, Heuristic vs. RC-LG (Baseline)

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

4-I 4-II 4-III 4-IV 4-V 6-I 8-I

Obj

ecti

ve F

unct

ion

Val

ue

Testcase

Optimal Heuristic Baseline

-3.3%

-17.4%

-12%-9%

Optimal solution improves objective function value by up to 17.4%

Page 41: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 41NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Takeaways on Reliability• Signoff methodology can have huge impact

• Example: Chicken‐and‐egg loops among BTI, EM, and signoff corner selection in AVS‐enabled systems

• AVS = new dimension in reliability vs. design cost (power/area) tradeoff space

• Naïve enforcement of reliability rules can be costly• Post‐IC implementation, reliability awareness at scheduler‐level improves lifetime “user experience” and guaranteed performance

• Basic challenges remain:• (i) reliability modeling and calibration• (ii) measuring reliability cost ({PPA}) with, without reliability margins• (iii) many “don’t turn over rocks” barriers to reliability cost reduction

Page 42: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 42NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

III. Assessing Costs of Resilience(1) “MinRazor”

Page 43: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 43NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

How to Minimize Cost of Resilience ?• Additional circuits  area and power penalties• Recovery from errors  throughput degradation• Large hold margin  short‐path padding cost• Want benefits (e.g., energy) to maximally outweigh costs • “MinRazor”: Minimum‐Cost Resilient Design Implementation

Razor Razor-Lite TIMBER

Razor Razor-Lite TIMBERPower penalty 30% [Das08] ~0% [Kim13] 100% [Choudhury09]

Area penalty 182% [Kim13] 33% [Kim13] 255% [Chen13]

#recovery cycles 5 [Wan09] 11 [Kim13] 0 [Choudhury09]

[GLSVLSI14]

Page 44: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 44NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Tradeoff: Resilience Cost vs. Datapath Cost

#Razor FFs (resilience cost)

Power/area of fanin circuits

Tradeoff

0

1

2

3

4

8

9

10

11

12

Ener

gy (m

J)

#Razor FFs

Total energyEnergy of non-resilient partResilience cost

300 100 50 0

We seek to minimize total energy via this tradeoff

Page 45: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 45NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Selective-Endpoint Optimization (SEOpt)• Optimize fanin cone of an endpoint w/ tighter constraints Allows replacement of Razor FF w/ normal FF

• Pick endpoints based on heuristic sensitivity functionsVary #endpoints compare area/power penalty

1 | |

2 | |

3 | |

4 | |

5 | |

Candidate Sensitivity Functions

p negative slack endpointc cells within fanin coneNumcri number of negative slack cells

Page 46: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 46NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Clock Skew Optimization (SkewOpt)• Increase slacks on timing‐critical and/or frequently‐exercised paths1. Generate sequential graph 2. Find cycle of paths with minimum total weight  adjust clock latencies  contract the cycle into one vertex 

3. Iterate Step 2 until all endpoints are optimized

FF1 FF2 FF3W12 W23

ClockData path Clock tree

W31

,1 β ,

Setup slack of path p-q

Weighting factor

Toggle rate of path p-q

W’

W’ W’

W’ = average weight on cycle

Page 47: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 47NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Overall Optimization Flow• Iteratively optimize with SEOpt and SkewOpt

Initial placement (all FFs = error-tolerant FFs)

Energy < min energy?

Save current solution

Margin insertion on K paths based on sensitivity function

Replace error-tolerant FFs w/ normal FFs

SEOpt

Activity aware clock skew optimization

SkewOpt

OR-tree insertion

Page 48: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 48NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Benefit of Low-Cost Resilience• Proposed method (CO) minimizes cost of resilience in terms of energy• Reference flows

• Pure‐margin (PM): conventional method w/ only margin insertion Cost of pure margin insertion = up to 21% energy overhead

• Brute‐force (BF): use error‐tolerant FFs for timing‐critical endpoints  Cost of resilience w/ poor design method = up to 10% energy overhead

• Cost increases with larger process variation

27

29

31

33

35

37

PM BF CO PM BF CO PM BF CO

Ener

gy (m

J)

22

26

30

34

38

PM BF CO PM BF CO PM BF CO

Ener

gy (m

J)

Energy penalty of throughput degradationEnergy penalty of additional circuitsEnergy w/o resilience

Large marginMedium marginSmall margin

MUL

EXU

Large marginMedium marginSmall margin

Small/medium/large margin 1σ/2σ/3σ for SS corner

Technology: foundry 28nm

Page 49: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 49NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Increased Benefit of Resilience with AVS• Adaptive voltage scaling allows a lower supply voltage for resilient designs, thus reduced power

• Proposed method trades off between timing‐error penalty vs. reduced power at a lower supply voltage

• Proposed method achieves an average of 17% energy reduction compared to pure‐margin designs Proposed optimization leads to further reduced resilience cost in the context of AVS strategy

25

30

35

40

45

50

0.86 0.9 0.94 0.98 1.02

Ener

gy (m

J)

Supply voltage (V)

pure-marginbrute-forceCombOpt

24

26

28

30

32

34

36

0.70 0.72 0.74 0.76 0.78 0.80

Ener

gy (m

J)

Supply voltage (V)

pure-marginbrute-forceCombOpt

MUL EXU

Minimum achievable energy

Technology: foundry 28nm

Page 50: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 50NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Optimization of TIMBER-Based Designs• TIMBER FFs use time borrowing to mask timing errors

• Additional constraints to select endpoints as TIMBER FFs(1) No loop of TIMBER FFs(2) No chained TIMBER FFs with more than two stages (assume two error‐detection intervals)

• Require additional timing slacks on fanout paths to mitigate timing errors• As compared to the solution of the proposed flow (CO)  Cost of pure margin insertion = 23% energy overhead Cost of resilience w/ poor design method = 7% energy overhead

0

1

2

3

4

5

6

7

PM BF CO

Ener

gy (m

J)

Energy penalty of additional circuitsEnergy w/o resilience

Design: ARM M0Technology: foundry 40nmED interval = 10% of clock period

Page 51: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 51NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Recent: Iterative Opt for Conventional Designs• Cost of resilience = area/power overheads, design difficulties …  Can we achieve similar benefits without resilient circuits, but following the same spirit of optimization for resilient designs?

• Optimization flowI. Relax timing constraints on all paths to be original clock period + relaxed marginII. Calculate sensitivity function of each endpoint with respect to original clock period 

(SF = sum of |slack * power| of negative‐slack cells in the fanin cone)III. Based on SF (sorted in increasing order), select top 10% endpoints to recover to 

original clock period (i.e., perform timing optimization with updated SDC file)IV. Iterate Steps II and III 10 times

• Design: ARM M0 at foundry 40nm (clock period = 6ns, relaxed margin = 300ps) • Optimization shows 16% power reduction

All power values are reported at clock period = 6ns

Iteration Power(mW)

Endpointsw/ violation

Area(um^2)

1 2.145 452 1313402 2.669 339 1310893 2.703 264 1305634 2.371 215 1304865 2.329 139 1308806 2.373 89 1314157 2.446 31 1307128 2.452 0 131011

PM 2.934 0 131319

0

1

2

3

0100200300400500

Tota

l pow

er (m

W)

#Endpoints with timing violation

PM

Opt

Page 52: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 52NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

A Different Slack Distribution• Design optimized with the new iterative optimization flow has more balanced slack distribution

• More timing paths with small slacks  exploit additional timing slacks for power reduction

• Reopens the question: How to best trade timing slacks for power reduction in IC implementation / performance closure

Optimized designPM design

Page 53: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 53NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

Takeaways on Resilience• “Cost of resilience” strongly depends on ability to mix resilient and non‐resilient circuits  • Up to 21% and 10% energy overheads respectively for cost of margin insertion 

and resilience (with poor design method) • Careful reduction of resilience cost can improve resilient design value 

proposition in the AVS context• Yet again:  hard to obtain correct starting point for benefit/cost assessment!

• Basic challenges remain: • (i) measuring cost of resilience at software level• (ii) unpredictable dependencies on design, implementation and operating 

scenarios• (iii) missing formulations of resilience as “optimizable objectives” for design 

tools

Page 54: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 54NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

In Closing …• ‘Ground‐up’ (crawl before walk, walk before run) approach is still stuck at ground level (in my group)• Basics of techno models, reliability models, margins, signoff criteria, implementation flows, design testcases, workload models, narrow windows of opportunity, …, optimization problem statements, … still way too fuzzy for our tastes (!)

• How should we assess the cost of {reliability, resilience}?• Is it even possible in a general, non‐artifactual way?   • Can we taxonomize and avoid pitfalls seen in previous works?

• Targets for next / new research?• Missing theorems?  Missing links?  Missing infrastructure?  Missing models and data?  Missing problem statements?

= to be identified during this colloquium !?!?

Page 55: Assessing Costs of Variability, Reliability and Resiliencevlsicad.ucsd.edu/Presentations/talk/Kahng-Variability... ·  · 2015-06-26Assessing Costs of Variability, Reliability and

UCSD VLSI CAD Laboratory 55NSF Variability Expedition / DFG SPP 1500 Colloquium, 141113

THANK YOU !