Performance Optimization and Analysis of Blade Designs under...

39
Performance Optimization and Analysis of Blade Designs under Delay Variability Dylan Hand , Hsin-Ho Huang , Benmao Chang , Yang Zhang * , Matheus Trevisan Moreira , Melvin Breuer , Ney Laert Vilar Calazans , and Peter A. Beerel May 5th, 2015 * University of Southern California, Los Angeles, CA † Pontifcia Universidade Catlica do Rio Grande do Sul, Porto Alegre, Brazil ‡ Dept. of Automation, Tsinghua University, Beijing, China

Transcript of Performance Optimization and Analysis of Blade Designs under...

Page 1: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

Performance Optimization and Analysis of Blade Designs under Delay Variability

Dylan Hand∗, Hsin-Ho Huang∗, Benmao Chang‡ , Yang Zhang*, Matheus Trevisan Moreira∗†, Melvin Breuer∗,

Ney Laert Vilar Calazans†, and Peter A. Beerel∗

May 5th, 2015

* University of Southern California, Los Angeles, CA† Pontificia Universidade Catolica do Rio Grande do Sul, Porto Alegre, Brazil ‡ Dept. of Automation, Tsinghua University, Beijing, China

Page 2: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

The Context: Delay Overheads

Traditional synchronous design suffers from increased margins• Worse at low and near-threshold regions

MOTIVATION | 2

PVT margin

Clock margin

Flip-flop alignment

Cycle time ofclocked logic

Logic gates

Logic Time

[Dreslinksi et al., IEEE Proc. 2010]

Page 3: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

Data delay in Plasma CPU

Potential of Average-Case Data

Delay

Num

ber

of O

pera

tions

Slower

Logic gates

Logic Time

Worst-CaseData

Cycle time ofclocked logic

MOTIVATION | 3

Delay variation due to data is rarely exploited in synchronous designs

Page 4: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

Data delay in Plasma CPU

Potential of Average-Case Data

Delay

Num

ber

of O

pera

tions

Slower

Logic gates

Logic Time

Worst-CaseData

Worst-Case DelayCycle time ofclocked logic

MOTIVATION | 3

Delay variation due to data is rarely exploited in synchronous designs

Page 5: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

Data delay in Plasma CPU

Potential of Average-Case Data

Delay

Num

ber

of O

pera

tions

Slower

Logic gates

Logic Time

Worst-CaseData

Worst-Case DelayAverage DelayCycle time ofclocked logic

MOTIVATION | 3

Delay variation due to data is rarely exploited in synchronous designs

Page 6: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

EDLCombinational

Logic

δ

Error Detection Logic

ControllerB

Δ

Error Detection Logic

ControllerA

Δ

EDL

One Solution: Resiliency

Timing errors delay handshaking by the resiliency window Δ

RESILIENCY | 4

CLK Err

Sam

ple

CLK Err

Sam

ple

Asynchronous Controller

Error Detecting Latch

Single Rail Datapath

Reconfigurable Delay Lines

Page 7: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

EDLCombinational

Logic

δ

Error Detection Logic

ControllerB

Δ

Error Detection Logic

ControllerA

Δ

EDL

One Solution: Resiliency

Timing errors delay handshaking by the resiliency window Δ

RESILIENCY | 4

CLK Err

Sam

ple

CLK Err

Sam

ple

Page 8: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

EDLCombinational

Logic

δ

Error Detection Logic

ControllerB

Δ

Error Detection Logic

ControllerA

Δ

EDL

One Solution: Resiliency

Timing errors delay handshaking by the resiliency window Δ

RESILIENCY | 4

CLK Err

Sam

ple

CLK Err

Sam

ple

Page 9: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

EDLCombinational

Logic

δ

Error Detection Logic

ControllerB

Δ

Error Detection Logic

ControllerA

Δ

EDL

One Solution: Resiliency

Timing errors delay handshaking by the resiliency window Δ

RESILIENCY | 4

CLK Err

Sam

ple

CLK Err

Sam

ple

Page 10: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

EDLCombinational

Logic

δ

Error Detection Logic

ControllerB

Δ

Error Detection Logic

ControllerA

Δ

EDL

One Solution: Resiliency

Timing errors delay handshaking by the resiliency window Δ

RESILIENCY | 4

CLK Err

Sam

ple

CLK Err

Sam

ple

Page 11: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

EDLCombinational

Logic

δ

Error Detection Logic

ControllerB

Δ

Error Detection Logic

ControllerA

Δ

EDL

One Solution: Resiliency

Timing errors delay handshaking by the resiliency window Δ

RESILIENCY | 4

CLK

Sam

ple

CLK Err

Sam

pleErr = 1

Page 12: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

EDLCombinational

Logic

δ

Error Detection Logic

ControllerB

Δ

Error Detection Logic

ControllerA

Δ

EDL

One Solution: Resiliency

Timing errors delay handshaking by the resiliency window Δ

RESILIENCY | 4

CLK

Sam

ple

CLK Err

Sam

pleErr = 1

Page 13: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

EDLCombinational

Logic

δ

Error Detection Logic

ControllerB

Δ

Error Detection Logic

ControllerA

Δ

EDL

One Solution: Resiliency

Timing errors delay handshaking by the resiliency window Δ

RESILIENCY | 4

CLK Err

Sam

ple

CLK Err

Sam

ple

Page 14: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

EDLCombinational

Logic

δ

Error Detection Logic

ControllerB

Δ

Error Detection Logic

ControllerA

Δ

EDL

One Solution: Resiliency

Timing errors delay handshaking by the resiliency window Δ

RESILIENCY | 4

CLK Err

Sam

ple

CLK Err

Sam

ple

Page 15: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

Resiliency Performance Benefit

RESILIENCY | 5

Data delay in Plasma CPU

Delay

Num

ber

of O

pera

tions

Worst-Case DelayNominal Delay

Key Question: How do we set δ to optimize performance

δ Δ

Page 16: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

Outline

Performance Optimization• Delay models

• Impact of delay line quantization

• Impact of metastability

• Comparison to Bubble Razor

Case Study• Analyze and optimize a 3-stage Blade CPU

Conclusions and Future Work

OUTLINE | 6

Page 17: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

Delay Models

Our approach

• Analyze the performance of Blade for a variety of delay models

NormalDistribution

Logic Delay

Pro

babi

lity

Logic Delay

Pro

babi

lity

Log-NormalDistribution

Logic Delay

Fre

quen

cy Real WorldDistribution

DELAY MODELS | 7

Page 18: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

Optimal Average-Case Performance

DELAY MODELS | 8

Pro

babi

lity

δ δ + Δ

σ

μ

Page 19: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

: Average delay of Blade stage

Definitions• C : Clock Period / Cycle Time• EC : Effective Clock Period• p : Probability of error• 𝑑 : Average delay of Blade

stage

Optimal performance achieved by minimizing 𝑑

Optimal Average-Case Performance

DELAY MODELS | 8

Pro

babi

lity

δ δ + Δ

PR( d ≤ δ )

Pro

babi

lity

δ δ + Δ

1-PR( d ≤ δ )

Probability oferror (p)

𝐶 = 𝛿 + Δ

𝑑 = 𝛿 + p ∗ Δ

Assumes backward latency is hidden via latch retiming

Page 20: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

popt observations

• Varies between 20% and 35% for log-normal distributions

• Significantly higher than in sync resiliency

• Constant for normal distributions!

Optimal Probability of Error - popt

DELAY MODELS | 9

Higher Variance

Page 21: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

Proof of constant popt

DELAY MODELS | 10

𝐾 = 𝜇 +𝑚 ∗ 𝜎

𝐾 = 𝛿 + ΔAssume worst case delay per stage is constant

Worst case delay is set by mean, variance, and SER

Systematic Error Rate (ξ) sets the worst-case delay per stage, K

𝑚 = 𝑓(𝜉)

𝜉 = 1 − 𝑃𝑅 𝑑 ≤ 𝐶 𝑁

Page 22: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

Proof of constant popt

DELAY MODELS | 10

𝐾 = 𝜇 +𝑚 ∗ 𝜎

𝐾 = 𝛿 + ΔAssume worst case delay per stage is constant

Worst case delay is set by mean, variance, and SER

Recall: 𝑑 = 𝛿 + 𝑝 ∗ Δ = 1 − 𝑝 ∗ 𝛿 + 𝑝 ∗ 𝐾

For Normal distribution:

Taking inverse error function of both sides:

1 − 𝑝 =1

2[1 + erf

𝛿−𝜇

2𝜎]

1 − 2𝑝 = erf𝛿−𝜇

2𝜎

erf−1 1 − 2𝑝 =𝛿 − 𝜇

2𝜎𝛿 = 2𝜎[erf−1 1 − 2𝑝 ] + 𝜇

Page 23: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

Proof of constant popt

DELAY MODELS | 10

Implication: Tuning of delay line may target fixed probability!

𝐾 = 𝜇 +𝑚 ∗ 𝜎

𝐾 = 𝛿 + ΔAssume worst case delay per stage is constant

Worst case delay is set by mean, variance, and SER

Recall: 𝑑 = 𝛿 + 𝑝 ∗ Δ = 1 − 𝑝 ∗ 𝛿 + 𝑝 ∗ 𝐾

Rewrite: 𝑑 = 1 − 𝑝 [ 2𝜎[erf−1 1 − 2𝑝 ] + 𝜇] + p ∗ K

𝜕 𝑑

𝜕𝑝= 1 + 𝑦 2

𝜕erf−1 𝑦

𝜕𝑦− 2 erf−1 𝑦 +𝑚 = 0

𝑦 = 1 − 2𝑝

Minimize 𝑑 by taking derivative and setting it equal to zero :

Note y and p are independent of σ and μ! m depends only on 𝜉

Page 24: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

Optimal Size of Resiliency Window - Δ

Blade supports maximum Δ of 50% of clock cycleOptimal Δ is larger for designs with high-variance!

DELAY LINE QUANTIZATION | 11

(%)

Page 25: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

Delay Line Quantification Effects

Quantification effects reduced due to inherenttradeoff between nominal delayδ and error penalty p * Δ

Del

ay

δ p * Δ

𝑑 = 𝛿 + p ∗ ΔRecall:

DELAY LINE QUANTIZATION | 12

Page 26: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

Delay Line Quantification in BD

Linear relationship between delay line quantizationand average stage delay in Bundled Data

Del

ay

δ 𝑑

𝑑 = 𝛿Bundled Data:

DELAY LINE QUANTIZATION | 13

Page 27: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

Metastability Effectsp : probability of timing violation

tMS : time for metastability to resolve

tMSE : MS in error-detecting latch

tMSL : MS in detection logic

E[delay]

δ + Δ + tMSL

δ + tMSL

δ + Δ

δ

δ + Δ

δ + Δ

δ

δ + Δ

δ No

MS

MS

in E

DL

(onl

y)

MS

in D

ete

ctio

n Lo

gic

METASTABILITY | 14

Page 28: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

Metastability Effectsp : probability of timing violation

tMS : time for metastability to resolve

tMSE : MS in error-detecting latch

tMSL : MS in detection logic

E[delay]

δ + Δ + tMSL

δ + tMSL

δ + Δ

δ

δ + Δ

δ + Δ

δ

δ + Δ

δ No

MS

MS

in E

DL

(onl

y)

MS

in D

ete

ctio

n Lo

gic

METASTABILITY | 14

Metastability resolution times most often hidden!

Page 29: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

Metastability Effects

METASTABILITY | 15

Page 30: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

Metastability Effects

METASTABILITY | 15

Page 31: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

Comparison to Sync Resiliency

Synchronous• EC set by systematic

error rate

Bubble Razor [Zhang,2014]

• 𝐸𝐶 = 𝐶 2 − 1 − 𝑝 2𝑁

Blade• 𝐸𝐶 = 𝛿 + 𝑝𝑜𝑝𝑡 ∗ 𝛥

COMPARISON | 16

FF

FF

M S M S

ED

L

ED

L

ED

L

ED

L

N-Stage Rings

Page 32: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

Comparison to Sync Resiliency

Synchronous• EC set by systematic

error rate

Bubble Razor [Zhang,2014]

• 𝐸𝐶 = 𝐶 2 − 1 − 𝑝 2𝑁

Blade• 𝐸𝐶 = 𝛿 + 𝑝𝑜𝑝𝑡 ∗ 𝛥

COMPARISON | 16

Normal Distribution

Synchronous BR

Blade

35% 23%

Page 33: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

Comparison to Sync Resiliency

Synchronous• EC set by systematic

error rate

Bubble Razor [Zhang,2014]

• 𝐸𝐶 = 𝐶 2 − 1 − 𝑝 2𝑁

Blade• 𝐸𝐶 = 𝛿 + 𝑝𝑜𝑝𝑡 ∗ 𝛥

COMPARISON | 16

Normal Distribution

Page 34: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

Plasma MIPS OpenCore

28nm FDSOI

Compare mathematical model of optimal resiliency window (Δ) with simulation results

Application to 3-Stage CPU

[1] http://opencores.org/project,plasma

CASE STUDY | 17

Page 35: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

Application to 3-Stage CPU

20 40 60 800

5

10

15

20

Distribution A

Delay (%)

Cy

cle

s (%

)

20 40 60 800

5

10

15

20

Distribution C

Delay (%)

Cycle

s (%

)

CASE STUDY | 18

Δ

Δ

Page 36: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

Application to 3-Stage CPU

Model estimated optimal Δ within 5.4%

• Optimal EC within 99%

CASE STUDY | 19

Distribution A Distribution B Distribution C

Model Sim Model Sim Model Sim

Δmax 27% 35% 43%

Δopt 26% 27% 34% 35% 39% 37%

ECopt 74.8% 75.1% 71.3% 71.9% 75.1% 74.8%

Page 37: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

Application to 3-Stage CPU

Model estimated optimal Δ within 5.4%

• Optimal EC within 99%

Model allows estimation of optimal Δ w/o limitations of simulated design

CASE STUDY | 19

Distribution A Distribution B Distribution C

Model Sim Model Sim Model Sim

Δmax 27% 35% 43%

Δopt 26% 27% 34% 35% 39% 37%

ECopt 74.8% 75.1% 71.3% 71.9% 75.1% 74.8%

Distribution A Distribution B Distribution C

Model Sim Model Sim Model Sim

Δmax 27% 35% 43%

Δopt 26% 27% 34% 35% 39% 37%

ECopt 74.8% 75.1% 71.3% 71.9% 75.1% 74.8%

Ideal Δopt 48% 53% 46%

Ideal ECopt 63.2% 67.8% 74.5%

Page 38: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

Summary and Conclusions

Performance model

• Use either analytical and real world delay distributions

• Predicts performance within 99% accuracy

Comparison to sync N-stage rings

• 23% better than Bubble Razor

• 35% better than traditional designs

Several interesting conclusions

• Optimal error rate is relatively high and may be constant

• Programmable delay line need not be fine-grained

• Metastability impact is negligible

• Supporting larger resiliency windows may be useful

CONCLUSIONS | 20

Page 39: Performance Optimization and Analysis of Blade Designs under …ee.usc.edu/async2015/web/wp-content/uploads/2015/03/S4_P... · 2015-05-14 · Performance Optimization and Analysis

Questions?