SIMD Lane Decoupling Improved Timing-Error Resilience
description
Transcript of SIMD Lane Decoupling Improved Timing-Error Resilience
![Page 1: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/1.jpg)
SIMD Lane DecouplingImproved Timing-Error Resilience
Evgeni Krimer (UT Austin)Patrick Chiang (Oregon State)Mattan Erez (UT Austin)
![Page 2: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/2.jpg)
2
All systems power/energy bound• The good:
– Transistor still following Moore’s Law• The bad:
– Transistor power efficiency improving too slowly– Larger fraction of power to non-compute resources
• The conclusion:– Better algorithms– More efficient architectures– Proportionality: waste less of what you have
• This paper: SIMD + timing speculation– Efficient architecture + proportional guardbands
SIMD Lane Decoupling (C) M. Erez, E. Krimer
![Page 3: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/3.jpg)
3
SIMD Lane Decoupling (C) M. Erez, E. Krimer
Outline• Setup:
efficient architecture + proportional margining• Proportional margining w/ timing speculation• Timing speculation with SIMD
– Problem and DPSP solution• Methodology and modeling• Evaluation
![Page 4: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/4.jpg)
4
SIMD Lane Decoupling (C) M. Erez, E. Krimer
Voltage/timing margins “waste” energy• Illustrative only – not to scale
Max
imum
lo
gic
dela
y
Noi
se g
uard
-ba
ndW
earo
ut
guar
d-ba
nd
Proc
ess
vari
atio
n gu
ard-
band
Tem
pera
ture
…Typical
logic delay
Toda
y
time (1 cycle)
![Page 5: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/5.jpg)
5
SIMD Lane Decoupling (C) M. Erez, E. Krimer
Voltage/timing margins “waste” energy• Illustrative only – not to scale
Max
imum
lo
gic
dela
y
Noi
se g
uard
-ba
ndW
earo
ut
guar
d-ba
nd
Proc
ess
vari
atio
n gu
ard-
band
Tem
pera
ture
…
Max
imum
logi
c de
lay
Noi
se g
uard
-ba
nd
Wea
rout
gu
ard-
band
Proc
ess
vari
atio
n gu
ard-
band
Tem
pera
ture
…
Typical logic delay
Typical logic delay
Toda
y
time (1 cycle)
Futu
re
![Page 6: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/6.jpg)
6
Timing speculation to the rescue [Ernst04]• Razor latches• Speculate low delay• Detect violations
– Early/late mismatch• Recover by stalling
– Requires fast “global” signal
– Alternative – flush
• Requires extra ~10% logic • Path delay restrictions:
Δ < t < Δ+cycle
SIMD Lane Decoupling (C) M. Erez, E. Krimer
![Page 7: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/7.jpg)
7
SIMD Lane Decoupling (C) M. Erez, E. Krimer
Outline• Setup:
SIMD architecture + proportional margining• Proportional margining w/ timing speculation• Timing speculation with SIMD
– Problem and DPSP solution• Methodology and modeling• Evaluation
![Page 8: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/8.jpg)
8
SIMD Lane Decoupling (C) M. Erez, E. Krimer
SIMD leads to inefficient timing speculation
SIMD Lane Decoupling (C) M. Erez, E. Krimer
![Page 9: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/9.jpg)
9
SIMD Lane Decoupling (C) M. Erez, E. Krimer
SIMD leads to inefficient timing speculation
SIMD Lane Decoupling (C) M. Erez, E. Krimer
0.50.60.70.80.9
1
0 0.02 0.04 0.06 0.08 0.1
Frac
tion
of p
eak
thro
ughp
ut
Probability of an error in a single stage, single lane
SISD16-wide SIMD32-wide SIMD
![Page 10: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/10.jpg)
10
SIMD Lane Decoupling (C) M. Erez, E. Krimer
Decoupled Parallel SIMD Pipeline (DPSP)• Shallow FIFO for control (or between stages)
![Page 11: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/11.jpg)
11
SIMD Lane Decoupling (C) M. Erez, E. Krimer
0.50.60.70.80.9
1
0 0.02 0.04 0.06 0.08 0.1
Frac
tion
of p
eak
thro
ughp
ut
Probability of an error in a single stage, single lane
SISD32-wide SIMD32-wide DPSP
Decoupled Parallel SIMD Pipeline (DPSP)• Decoupling mitigates SIMD impact
![Page 12: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/12.jpg)
12
DPSP challenge 1: inter-lane communication• Decoupling may delay producer (store)• Micro barriers
– Enforce SIMD semantics• Not a problem in practice
with GPUs– Execution model requires
explicit sync across CTAs / work-groups
SIMD Lane Decoupling (C) M. Erez, E. Krimer
![Page 13: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/13.jpg)
13
SIMD Lane Decoupling (C) M. Erez, E. Krimer
DPSP challenge 2: memory access locality• Loads and stores no longer aligned
– Memory “divergence”• May increase pressure on on-chip memory
access• May impact off-chip access
– Old NVIDIA hardware had memory coalescing issues– No Problem with coalescing buffers and caches
• Micro-barriers if problematic– Can be done implicitly or explicitly in hardware– Sync before every load– Prediction
![Page 14: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/14.jpg)
14
SIMD Lane Decoupling (C) M. Erez, E. Krimer
Outline• Setup:
efficient architecture + proportional margining• Proportional margining w/ timing speculation• Timing speculation with SIMD
– Problem and DPSP solution• Methodology and modeling• Evaluation
![Page 15: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/15.jpg)
15
Evaluation flow
Error Measurements
Error Probability Model
Energy-Efficiency Model
Design Space Exploration
Arch Sim. Validation
SIMD Lane Decoupling (C) M. Erez, E. Krimer
![Page 16: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/16.jpg)
16
Measuring error rate• Inherently circuit and
implementation dependent• Used 3 exemplary circuits
– SPICE-simulated adder [Ernst04]
– FPGA-modeled multiplier [Ernst04]
– Multiplier fabricated in our IBM 45nm SOI test chip[Pawlowski12]
SIMD Lane Decoupling (C) M. Erez, E. Krimer
Pawlowski ISSCC’12
![Page 17: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/17.jpg)
17
SIMD Lane Decoupling (C) M. Erez, E. Krimer
Modeling the error rate function• 2-parameter model
errVmax
Slope
Adder [Ernst04]Mul. [Ernst04]
![Page 18: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/18.jpg)
18
SIMD Lane Decoupling (C) M. Erez, E. Krimer
ET2 energy-efficiency metric• Energy x (execution)Time2
– In circuit context: time=delay -> ED2
• Isolates architecture efficiency – Independent of DVFS– Shows improvements in addition to DVFS
2ddVE
ddVt 1
![Page 19: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/19.jpg)
19
SIMD Lane Decoupling (C) M. Erez, E. Krimer
Simple ET2 model• Throughput (1/T):
• Relative energy:Dynamic Static
![Page 20: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/20.jpg)
20
SIMD Lane Decoupling (C) M. Erez, E. Krimer
GP-GPU simulation adds some realism• Baseline uses ideal margins without
specuation– Only max delay vs. typical delay left on table– Timing speculation overhead is 0 – 15% ET2
• GPGPUSim (version 2.1)– Cycle-based extendable GP-GPU simulator from UBC
• Developer-recommended parameters• Extended to DPSP
– Recovery through stall– Micro-barrier options
• Explicit CTA/workgroup synchronization only (no mbarriers)• Implicit sync before every memory operation
• Power model based on Hong & Kim, ISCA’10
![Page 21: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/21.jpg)
21
SIMD Lane Decoupling (C) M. Erez, E. Krimer
Outline• Setup:
efficient architecture + proportional margining• Proportional margining w/ timing speculation• Timing speculation with SIMD
– Problem and DPSP solution• Methodology and modeling• Evaluation
– Design-space exploration– Architecture effects
![Page 22: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/22.jpg)
22
SIMD Lane Decoupling (C) M. Erez, E. Krimer
ET2 vs. SIMD (no spec.)
DPSP
errVmax
Slope
Adder [Ernst04]Mul. [Ernst04]
• DPSP
*- Relative ET2 - lower elevation is better
![Page 23: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/23.jpg)
23
SIMD Lane Decoupling (C) M. Erez, E. Krimer
DPSP vs. SIMD (w/ spec.)
*- ET2 Difference - higher elevation is better
errVmax
Slope
Adder [Ernst04]Mul. [Ernst04]
• SIMD – DPSP
![Page 24: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/24.jpg)
24
Bringing in architecture effectsSIMD Lane Decoupling (C) M. Erez, E. Krimer
Adder
Fabricated MUL
![Page 25: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/25.jpg)
25
SIMD Lane Decoupling (C) M. Erez, E. Krimer
Summary• Design margins inefficiency• Naive timing speculation with SIMD is inefficient• DPSP enables efficient speculation in SIMD
– Microbarriers maintain semantics when necessary– With GPU, frequent mbarriers help memory access
• Simple models can capture error response– Error rate exponential with Vdd– Dependent on circuit and implementation
• Design-space exploration shows potential– When and why timing speculation should (not) be used– DPSP consistently improves ET2 (10 – 45%)– DPSP achieves 10 – 20% better ET2 than SIMD w/ spec.
![Page 26: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/26.jpg)
26
BACKUP
SIMD Lane Decoupling (C) M. Erez, E. Krimer
![Page 27: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/27.jpg)
27
SIMD Lane Decoupling (C) M. Erez, E. Krimer
Detailed ET2 vs. Vdd behaviorNN AES
BFSMUM
![Page 28: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/28.jpg)
28
Frequent micro-barriers improve ET2
SIMD Lane Decoupling (C) M. Erez, E. Krimer
Adder
Multiplier
Fab.
![Page 29: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/29.jpg)
29
Modeling the error rate functionSIMD Lane Decoupling (C) M. Erez, E. Krimer
errVmax
Slope
Adder [Ernst04]Mul. [Ernst04]
![Page 30: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/30.jpg)
30
SIMD Lane Decoupling (C) M. Erez, E. Krimer
Proportional margining• Static margin control
– Binning– Vdd/frequency/biasing adjustment
• Dynamic margin control– Vdd/frequency/biasing for slowly varying effects
• Temperature and aging– Clocking tricks
• From GALS to dynamic and elastic clockingM
axim
um
logi
c de
lay
Noi
se g
uard
-ba
ndW
earo
ut
guar
d-ba
nd
Proc
ess
vari
atio
n gu
ard-
band
Cloc
k Sk
ew
and
jitte
r
Typical logic delay
time
Other
![Page 31: SIMD Lane Decoupling Improved Timing-Error Resilience](https://reader036.fdocuments.net/reader036/viewer/2022062501/5681632f550346895dd3ab51/html5/thumbnails/31.jpg)
31
Detailed results summary• BFS
– High divergence rate– Requires implicit synchronizations– Limits DPSP opportunities
• CP,DG,RAY– Sensitive to memory coalescing– Synchronization between memory operations solves it
• MUM– Low SIMD occupancy limits the benefit of decoupling
• WP– Not enough registers, lots of memory spills.– Extremely sensitive to memory latency and the exact
scheduling – disturbed by DPSP
SIMD Lane Decoupling (C) M. Erez, E. Krimer