Performance Optimization and Analysis of Blade Designs under...
Transcript of Performance Optimization and Analysis of Blade Designs under...
Performance Optimization and Analysis of Blade Designs under Delay Variability
Dylan Hand∗, Hsin-Ho Huang∗, Benmao Chang‡ , Yang Zhang*, Matheus Trevisan Moreira∗†, Melvin Breuer∗,
Ney Laert Vilar Calazans†, and Peter A. Beerel∗
May 5th, 2015
* University of Southern California, Los Angeles, CA† Pontificia Universidade Catolica do Rio Grande do Sul, Porto Alegre, Brazil ‡ Dept. of Automation, Tsinghua University, Beijing, China
The Context: Delay Overheads
Traditional synchronous design suffers from increased margins• Worse at low and near-threshold regions
MOTIVATION | 2
PVT margin
Clock margin
Flip-flop alignment
Cycle time ofclocked logic
Logic gates
Logic Time
[Dreslinksi et al., IEEE Proc. 2010]
Data delay in Plasma CPU
Potential of Average-Case Data
Delay
Num
ber
of O
pera
tions
Slower
Logic gates
Logic Time
Worst-CaseData
Cycle time ofclocked logic
MOTIVATION | 3
Delay variation due to data is rarely exploited in synchronous designs
Data delay in Plasma CPU
Potential of Average-Case Data
Delay
Num
ber
of O
pera
tions
Slower
Logic gates
Logic Time
Worst-CaseData
Worst-Case DelayCycle time ofclocked logic
MOTIVATION | 3
Delay variation due to data is rarely exploited in synchronous designs
Data delay in Plasma CPU
Potential of Average-Case Data
Delay
Num
ber
of O
pera
tions
Slower
Logic gates
Logic Time
Worst-CaseData
Worst-Case DelayAverage DelayCycle time ofclocked logic
MOTIVATION | 3
Delay variation due to data is rarely exploited in synchronous designs
EDLCombinational
Logic
δ
Error Detection Logic
ControllerB
Δ
Error Detection Logic
ControllerA
Δ
EDL
One Solution: Resiliency
Timing errors delay handshaking by the resiliency window Δ
RESILIENCY | 4
CLK Err
Sam
ple
CLK Err
Sam
ple
Asynchronous Controller
Error Detecting Latch
Single Rail Datapath
Reconfigurable Delay Lines
EDLCombinational
Logic
δ
Error Detection Logic
ControllerB
Δ
Error Detection Logic
ControllerA
Δ
EDL
One Solution: Resiliency
Timing errors delay handshaking by the resiliency window Δ
RESILIENCY | 4
CLK Err
Sam
ple
CLK Err
Sam
ple
EDLCombinational
Logic
δ
Error Detection Logic
ControllerB
Δ
Error Detection Logic
ControllerA
Δ
EDL
One Solution: Resiliency
Timing errors delay handshaking by the resiliency window Δ
RESILIENCY | 4
CLK Err
Sam
ple
CLK Err
Sam
ple
EDLCombinational
Logic
δ
Error Detection Logic
ControllerB
Δ
Error Detection Logic
ControllerA
Δ
EDL
One Solution: Resiliency
Timing errors delay handshaking by the resiliency window Δ
RESILIENCY | 4
CLK Err
Sam
ple
CLK Err
Sam
ple
EDLCombinational
Logic
δ
Error Detection Logic
ControllerB
Δ
Error Detection Logic
ControllerA
Δ
EDL
One Solution: Resiliency
Timing errors delay handshaking by the resiliency window Δ
RESILIENCY | 4
CLK Err
Sam
ple
CLK Err
Sam
ple
EDLCombinational
Logic
δ
Error Detection Logic
ControllerB
Δ
Error Detection Logic
ControllerA
Δ
EDL
One Solution: Resiliency
Timing errors delay handshaking by the resiliency window Δ
RESILIENCY | 4
CLK
Sam
ple
CLK Err
Sam
pleErr = 1
EDLCombinational
Logic
δ
Error Detection Logic
ControllerB
Δ
Error Detection Logic
ControllerA
Δ
EDL
One Solution: Resiliency
Timing errors delay handshaking by the resiliency window Δ
RESILIENCY | 4
CLK
Sam
ple
CLK Err
Sam
pleErr = 1
EDLCombinational
Logic
δ
Error Detection Logic
ControllerB
Δ
Error Detection Logic
ControllerA
Δ
EDL
One Solution: Resiliency
Timing errors delay handshaking by the resiliency window Δ
RESILIENCY | 4
CLK Err
Sam
ple
CLK Err
Sam
ple
EDLCombinational
Logic
δ
Error Detection Logic
ControllerB
Δ
Error Detection Logic
ControllerA
Δ
EDL
One Solution: Resiliency
Timing errors delay handshaking by the resiliency window Δ
RESILIENCY | 4
CLK Err
Sam
ple
CLK Err
Sam
ple
Resiliency Performance Benefit
RESILIENCY | 5
Data delay in Plasma CPU
Delay
Num
ber
of O
pera
tions
Worst-Case DelayNominal Delay
Key Question: How do we set δ to optimize performance
δ Δ
Outline
Performance Optimization• Delay models
• Impact of delay line quantization
• Impact of metastability
• Comparison to Bubble Razor
Case Study• Analyze and optimize a 3-stage Blade CPU
Conclusions and Future Work
OUTLINE | 6
Delay Models
Our approach
• Analyze the performance of Blade for a variety of delay models
NormalDistribution
Logic Delay
Pro
babi
lity
Logic Delay
Pro
babi
lity
Log-NormalDistribution
Logic Delay
Fre
quen
cy Real WorldDistribution
DELAY MODELS | 7
Optimal Average-Case Performance
DELAY MODELS | 8
Pro
babi
lity
δ δ + Δ
σ
μ
: Average delay of Blade stage
Definitions• C : Clock Period / Cycle Time• EC : Effective Clock Period• p : Probability of error• 𝑑 : Average delay of Blade
stage
Optimal performance achieved by minimizing 𝑑
Optimal Average-Case Performance
DELAY MODELS | 8
Pro
babi
lity
δ δ + Δ
PR( d ≤ δ )
Pro
babi
lity
δ δ + Δ
1-PR( d ≤ δ )
Probability oferror (p)
𝐶 = 𝛿 + Δ
𝑑 = 𝛿 + p ∗ Δ
Assumes backward latency is hidden via latch retiming
popt observations
• Varies between 20% and 35% for log-normal distributions
• Significantly higher than in sync resiliency
• Constant for normal distributions!
Optimal Probability of Error - popt
DELAY MODELS | 9
Higher Variance
Proof of constant popt
DELAY MODELS | 10
𝐾 = 𝜇 +𝑚 ∗ 𝜎
𝐾 = 𝛿 + ΔAssume worst case delay per stage is constant
Worst case delay is set by mean, variance, and SER
Systematic Error Rate (ξ) sets the worst-case delay per stage, K
𝑚 = 𝑓(𝜉)
𝜉 = 1 − 𝑃𝑅 𝑑 ≤ 𝐶 𝑁
Proof of constant popt
DELAY MODELS | 10
𝐾 = 𝜇 +𝑚 ∗ 𝜎
𝐾 = 𝛿 + ΔAssume worst case delay per stage is constant
Worst case delay is set by mean, variance, and SER
Recall: 𝑑 = 𝛿 + 𝑝 ∗ Δ = 1 − 𝑝 ∗ 𝛿 + 𝑝 ∗ 𝐾
For Normal distribution:
Taking inverse error function of both sides:
1 − 𝑝 =1
2[1 + erf
𝛿−𝜇
2𝜎]
1 − 2𝑝 = erf𝛿−𝜇
2𝜎
erf−1 1 − 2𝑝 =𝛿 − 𝜇
2𝜎𝛿 = 2𝜎[erf−1 1 − 2𝑝 ] + 𝜇
Proof of constant popt
DELAY MODELS | 10
Implication: Tuning of delay line may target fixed probability!
𝐾 = 𝜇 +𝑚 ∗ 𝜎
𝐾 = 𝛿 + ΔAssume worst case delay per stage is constant
Worst case delay is set by mean, variance, and SER
Recall: 𝑑 = 𝛿 + 𝑝 ∗ Δ = 1 − 𝑝 ∗ 𝛿 + 𝑝 ∗ 𝐾
Rewrite: 𝑑 = 1 − 𝑝 [ 2𝜎[erf−1 1 − 2𝑝 ] + 𝜇] + p ∗ K
𝜕 𝑑
𝜕𝑝= 1 + 𝑦 2
𝜕erf−1 𝑦
𝜕𝑦− 2 erf−1 𝑦 +𝑚 = 0
𝑦 = 1 − 2𝑝
Minimize 𝑑 by taking derivative and setting it equal to zero :
Note y and p are independent of σ and μ! m depends only on 𝜉
Optimal Size of Resiliency Window - Δ
Blade supports maximum Δ of 50% of clock cycleOptimal Δ is larger for designs with high-variance!
DELAY LINE QUANTIZATION | 11
(%)
Delay Line Quantification Effects
Quantification effects reduced due to inherenttradeoff between nominal delayδ and error penalty p * Δ
Del
ay
δ p * Δ
𝑑 = 𝛿 + p ∗ ΔRecall:
DELAY LINE QUANTIZATION | 12
Delay Line Quantification in BD
Linear relationship between delay line quantizationand average stage delay in Bundled Data
Del
ay
δ 𝑑
𝑑 = 𝛿Bundled Data:
DELAY LINE QUANTIZATION | 13
Metastability Effectsp : probability of timing violation
tMS : time for metastability to resolve
tMSE : MS in error-detecting latch
tMSL : MS in detection logic
E[delay]
δ + Δ + tMSL
δ + tMSL
δ + Δ
δ
δ + Δ
δ + Δ
δ
δ + Δ
δ No
MS
MS
in E
DL
(onl
y)
MS
in D
ete
ctio
n Lo
gic
METASTABILITY | 14
Metastability Effectsp : probability of timing violation
tMS : time for metastability to resolve
tMSE : MS in error-detecting latch
tMSL : MS in detection logic
E[delay]
δ + Δ + tMSL
δ + tMSL
δ + Δ
δ
δ + Δ
δ + Δ
δ
δ + Δ
δ No
MS
MS
in E
DL
(onl
y)
MS
in D
ete
ctio
n Lo
gic
METASTABILITY | 14
Metastability resolution times most often hidden!
Metastability Effects
METASTABILITY | 15
Metastability Effects
METASTABILITY | 15
Comparison to Sync Resiliency
Synchronous• EC set by systematic
error rate
Bubble Razor [Zhang,2014]
• 𝐸𝐶 = 𝐶 2 − 1 − 𝑝 2𝑁
Blade• 𝐸𝐶 = 𝛿 + 𝑝𝑜𝑝𝑡 ∗ 𝛥
COMPARISON | 16
FF
FF
M S M S
ED
L
ED
L
ED
L
ED
L
N-Stage Rings
Comparison to Sync Resiliency
Synchronous• EC set by systematic
error rate
Bubble Razor [Zhang,2014]
• 𝐸𝐶 = 𝐶 2 − 1 − 𝑝 2𝑁
Blade• 𝐸𝐶 = 𝛿 + 𝑝𝑜𝑝𝑡 ∗ 𝛥
COMPARISON | 16
Normal Distribution
Synchronous BR
Blade
35% 23%
Comparison to Sync Resiliency
Synchronous• EC set by systematic
error rate
Bubble Razor [Zhang,2014]
• 𝐸𝐶 = 𝐶 2 − 1 − 𝑝 2𝑁
Blade• 𝐸𝐶 = 𝛿 + 𝑝𝑜𝑝𝑡 ∗ 𝛥
COMPARISON | 16
Normal Distribution
Plasma MIPS OpenCore
28nm FDSOI
Compare mathematical model of optimal resiliency window (Δ) with simulation results
Application to 3-Stage CPU
[1] http://opencores.org/project,plasma
CASE STUDY | 17
Application to 3-Stage CPU
20 40 60 800
5
10
15
20
Distribution A
Delay (%)
Cy
cle
s (%
)
20 40 60 800
5
10
15
20
Distribution C
Delay (%)
Cycle
s (%
)
CASE STUDY | 18
Δ
Δ
Application to 3-Stage CPU
Model estimated optimal Δ within 5.4%
• Optimal EC within 99%
CASE STUDY | 19
Distribution A Distribution B Distribution C
Model Sim Model Sim Model Sim
Δmax 27% 35% 43%
Δopt 26% 27% 34% 35% 39% 37%
ECopt 74.8% 75.1% 71.3% 71.9% 75.1% 74.8%
Application to 3-Stage CPU
Model estimated optimal Δ within 5.4%
• Optimal EC within 99%
Model allows estimation of optimal Δ w/o limitations of simulated design
CASE STUDY | 19
Distribution A Distribution B Distribution C
Model Sim Model Sim Model Sim
Δmax 27% 35% 43%
Δopt 26% 27% 34% 35% 39% 37%
ECopt 74.8% 75.1% 71.3% 71.9% 75.1% 74.8%
Distribution A Distribution B Distribution C
Model Sim Model Sim Model Sim
Δmax 27% 35% 43%
Δopt 26% 27% 34% 35% 39% 37%
ECopt 74.8% 75.1% 71.3% 71.9% 75.1% 74.8%
Ideal Δopt 48% 53% 46%
Ideal ECopt 63.2% 67.8% 74.5%
Summary and Conclusions
Performance model
• Use either analytical and real world delay distributions
• Predicts performance within 99% accuracy
Comparison to sync N-stage rings
• 23% better than Bubble Razor
• 35% better than traditional designs
Several interesting conclusions
• Optimal error rate is relatively high and may be constant
• Programmable delay line need not be fine-grained
• Metastability impact is negligible
• Supporting larger resiliency windows may be useful
CONCLUSIONS | 20
Questions?