Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh...
Transcript of Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh...
![Page 1: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/1.jpg)
Deep Learning Hardware
Acceleration
Jorge Albericio+ Alberto Delmas Lascorz
Patrick Judd Sayeh Sharify
Tayler Hetherington*
Natalie Enright Jerger Tor Aamodt*
Andreas Moshovos
*
+ now at NVIDIA
![Page 2: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/2.jpg)
The University of Toronto have filed patent
applications for the mentioned technologies.
Disclaimer
![Page 3: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/3.jpg)
Time: ~ 60% - 90% inner products
Deep Learning: Where Time Goes?
100s
-
1000s
X
X
+
Convolutional Neural Networks: e.g., Image Classification
![Page 4: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/4.jpg)
4
100s
-
1000s
X
X
+
X
X
+
X
X
+X
X
+X
X
+X
X
+
Time: ~ 60% - 90% inner products
Deep Learning: Where Time Goes?
![Page 5: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/5.jpg)
SIMD: Exploit Computation Stucture
5
DaDianNao
4K terms/cycle
0
15
0
15
0
15
x
x
x
x
x16
![Page 6: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/6.jpg)
Our Approach
6
0
15
0
15
0
15
Filter 0
Filter 15
Improve by Exploiting Value Properties
Maintaining:
Massive Parallelism
SIMD Lanes
Wide Memory Accesses
No Modifications to the Networks
![Page 7: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/7.jpg)
Longer Term Goal
7
0
15
0
15
0
15
Filter 0
Filter 15
One Architecture to Rule them All
![Page 8: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/8.jpg)
Value Properties to Exploit?
8
0.0…0a x A
![Page 9: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/9.jpg)
Value Properties to Exploit
9
X
X
X
![Page 10: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/10.jpg)
Value Properties to Exploit
x A
x A
x A
![Page 11: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/11.jpg)
Our Results: Performance
11
1.5x1.9x
3,1x1.60x
2.08x
0
1
1
2
2
3
3
4
CNVLUTIN STRIPES ENGINE P
100% 99%
PRAGMATIC
TARTAN +
vs. DaDianNao which was ~300x over GPUs
Accuracy
ISCA’16 MICRO’16 arxiv
![Page 12: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/12.jpg)
• Proteus:
44% less memory bandwidth + footprint
Our Results: Memory Footprint and Bandwidth
12
![Page 13: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/13.jpg)
Many ineffectual multiplications
Cnvlutin: ISCA’16
13
X0
X
+0
0
![Page 14: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/14.jpg)
• Zero Activations:
• Runtime calculated
• None that are always zero
Many Activations and Weights
are Intrinsically Ineffectual (zero)
14
Zero Weights:
Known in Advance
Not pursued in this work
45% of Runtime Values are zero
% Stable for any Input
None always zero
![Page 15: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/15.jpg)
Many ineffectual multiplications
Cnvlutin
15
X0
X
+0
0
![Page 16: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/16.jpg)
Many more ineffectual multiplications
Cnvlutin
16
X0
X
+0
0
a
b
![Page 17: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/17.jpg)
Cnvlutin
17
X
X
+
Beating Fast and
“Dumb” SIMD is Hard
On-the-fly ineffectual product elimination
Performance + energy
Optional: accuracy loss +performance
![Page 18: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/18.jpg)
Cnvlutin
No Accuracy Loss
+52% performance
-7% power
+5% area
Can relax the ineffectual criterion
better performance: 60%
even more w/ some accuracy loss
![Page 19: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/19.jpg)
Deep Learning: Convolutional Neural Networks
19
imagelayers10s-100
Swedish
Meatballs
maybe
![Page 20: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/20.jpg)
• 16 independent narrow activation streams
Naïve Solution: No Wide Memory Accesses
23
Lane 15
Lane 0
![Page 21: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/21.jpg)
Removing Zeroes: At the output of each layer
en
co
de
Layer
iLayer
i + 1N
eu
ron
Mem
Swedish
Meatballs
maybe
![Page 22: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/22.jpg)
Cnvlutin: No Accuracy Loss
25better
![Page 23: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/23.jpg)
• Treat
Neurons
close to zero
as zero
Loosening the Ineffectual Neuron Criterion
26
Open Questions:
Are these robust? How to find the best?
![Page 24: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/24.jpg)
Another Property of CNNs
27
X
X
+
Operand Precision Required Fixed?
16 bits?
16 bits
![Page 25: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/25.jpg)
CNNs: Precision Requirements Vary
28
X
X
+
Operand Precision Required Fixed Varies
5 bits to 13 bits
16 bits
p bits
![Page 26: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/26.jpg)
Stripes
29
X
X
+
Execution Time = 16 / PPeformance + Energy Efficiency + Accuracy Knob
p bits
![Page 27: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/27.jpg)
• Devil in the Details: Carefully chose what to serialize and
what to reuse same input wires as baseline
Stripes: Key Concept
30
2 2x2b
Terms/Step2 1x2b 4 1x2b
![Page 28: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/28.jpg)
SIMD: Exploit Computation Stucture
31
DaDianNao
4K terms/cycle
0
15
0
15
0
15
x
x
x
x
x16
![Page 29: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/29.jpg)
Stripes Bit-Serial Engine
32
0
15
0
15
x
x
1
1
16
16
248
255
x
x
1
1
![Page 30: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/30.jpg)
• Each Tile:
• 16 Windows Concurrently – 16 neurons each
• 16 Filters
• 16 partial output neurons
Compensating for Bit-Serial’s Compute Bandwidth Loss
33
16
16
neurons
neurons
synapses
16
![Page 31: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/31.jpg)
Stripes
No Accuracy Loss
+192% performance
-57% energy
+32% area
More performance w/ accuracy loss
*
* W/O Older: LeNet + Covnet
![Page 32: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/32.jpg)
Stripes: Performance Boost
35
be
tte
r
![Page 33: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/33.jpg)
• Each Tile:
• No Weight Reuse
• Cannot Have 16 Windows
Fully-Connected Layers?
36
16
16
neurons
neurons
synapses
16
![Page 34: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/34.jpg)
• No Weight Reuse
• Cannot Have 16 Windows
Fully-Connected Layers
37
Input neurons Output
neurons
synapses
![Page 35: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/35.jpg)
• Bit-Parallel Engine
• V: activation
• I: weight
• Both 2 bits
TARTAN: Accelerating Fully-Connected Layers
38
![Page 36: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/36.jpg)
• Cycle 1:
• Activation: a1 and Weight: W
Bit-Parallel Engine: Processing one Activation x Weight
39
![Page 37: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/37.jpg)
• Cycle 2:
• Activation: a2 and Weight: W
• a1 x W + a2 x W over two cycles
Bit-Parallel Engine: Processing Another Pair
40
![Page 38: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/38.jpg)
• 2 x 1b activation inputs
• 2b or 2 x 1b weight inputs
TARTAN engine
41
activations
we
igh
ts
![Page 39: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/39.jpg)
• Cycle 1: load 2b weight into BRs
TARTAN: Convolutional Layer Processing
42
![Page 40: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/40.jpg)
• Cycle 2: Multiply W with bit 1 of activations a1 and a2
TARTAN: Weight x 1st bit of Two Activations
43
![Page 41: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/41.jpg)
• Cycle 3: multiply W with 2nd bit of a1 and a2
• Load new W’ into BR
• 3-stage pipeline to do 2: 2b activation x 2b weight
TARTAN: Weight x 2nd bit of Two Activations
44
![Page 42: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/42.jpg)
• What is different? Weights cannot be reused
• Cycle 1: Load first bit of two weights into Ars
TARTAN: Fully-Connected Layers: Loading Weights
45
Bit 1 of Two Different Weights
![Page 43: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/43.jpg)
• Cycle 2: Load 2nd bit of w1 and w2 into ARs
• Bit 2 of Two Different Weights
• Loaded Different Weights to Each Unit
TARTAN: Fully-Connected Layers: Loading Weights
46
![Page 44: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/44.jpg)
• Cycle 3: Move AR into BR and proceed as before over
two cycles
• 5-stage pipeline to do:
• TWO of (2b activation x 2b weight)
TARTAN: Fully-Connected Layers: Processing Activations
47
![Page 45: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/45.jpg)
• Bit-Serial TARTAN
• 2.04x faster than DaDiannao
• 1.25x more energy efficient at the same frequency
• 1.5x area overhead
• 2-bit at-a-time TARTAN
• 1.6x faster over DaDiannao
• Roughly same energy efficiency
• 1.25x area overhead
TARTAN: Result Summary
48
![Page 46: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/46.jpg)
Bit-Pragmatic Engine
49
X
X
+
Operand Information Content Varies
![Page 47: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/47.jpg)
• Want to A x B
• Let’s look at A
• Which bits really matter?
Inner-Products
50
X B
![Page 48: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/48.jpg)
• Only 8% of bits are non-zero once precision is reduced
• 15%-10% otherwise
Zero Bit Content: 16-bit fixed-point
51
![Page 49: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/49.jpg)
• Only 27% of bits are non-zero
Zero Bit Content: 8-bit Quantized (Tensorflow-like)
52
![Page 50: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/50.jpg)
• Simply Modify Stripes?
• Too Large + Cross Lane Synchronization
Pragmatic Concept: Use Shift-and-Add
53
![Page 51: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/51.jpg)
Bit-Parallel Engine
54
0
15
0
15
x
x
16
16
16
16
![Page 52: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/52.jpg)
STRIPES
55
0
15
0
15
x
x
1
1
16
16
248
255
x
x
1
1
![Page 53: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/53.jpg)
Pragmatic: Naive STRIPES extension? Problem #1: Too Large
56
0
15
0
15
BIG
>>
>>
1
1
16
16
248
255
>>
>>
1
1
32
32
32
32
![Page 54: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/54.jpg)
• Process in groups of Max N Difference
• Example with N = 4
• Some opportunity loss, much lower area overhead
• Can skip groups of all zeroes
Solution to #1? 2-Stage Shifting
57
0
15
0
15
OK
>>
>>
1
1
16
16
20
20>>
0
1
10 00 00
00 00 10
![Page 55: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/55.jpg)
• Process in groups of Max N Difference
• Example with N = 4
• Some opportunity loss, much lower area overhead
Solution to #1? 2-Stage Shifting
58
0
15
0
15
OK
>>
>>
1
1
16
16
20
20>>
1
0
10 00 00
00 00 10
4
![Page 56: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/56.jpg)
• Different # of 1 bits
• Lanes go out of sync
• May have to fetch up to 256 different activations from
NM
• Keep Lanes Synchronized:
• No cost: All lanes
• Extra register per column: some cost better performance
Lane Synchronization
60
![Page 57: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/57.jpg)
Bit-Pragmatic
61
X
X
+
No Accuracy Loss
+310% performance
- 48% Energy
+ 45% Area
Better w/ 8-bit Quantized
Nets
![Page 58: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/58.jpg)
• 8-bit Quantized Representation
Processing Only The Essential Information
62
Stripes 8b Pragmatic
![Page 59: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/59.jpg)
• Better encoding is possible and improves performance
Bit-Pragmatic
63
![Page 60: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/60.jpg)
• Operand Precision Required Varies
Proteus
64
X
X
+
Proteus: Store in reduced precision in memory
Less Bandwidth, Less Energy
![Page 61: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/61.jpg)
0
15
0
15
x
x
Conventional Format: Base Precision
Data Physically aligns with Unit Inputs
![Page 62: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/62.jpg)
Conventional Format: Base Precision
Need Shuffling Network to Route Synapses
4K input bits Any 4K output bit position
0
15
0
15
x
x
4K
x 4
K
Cro
ssb
ar
![Page 63: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/63.jpg)
0
15
0
15
x
x
Proteus’ Key Idea: Pack Along Data Lane Columns
68
Local Shufflers: 16b input 16b output
Much simpler
![Page 64: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/64.jpg)
Proteus
44% less memory
bandwidth
![Page 65: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/65.jpg)
• Training
• Prototype
•Design Space: lower-end confs
• Unified Architecture
•Dispatcher + Compute
•Other Workloads: Comp. Photo
• General Purpose Compute Class
What’s Next
70
![Page 66: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/66.jpg)
Our Results: Performance
71
1.5x1.9x
3,1x1.60x
2.08x
0
1
1
2
2
3
3
4
CNVLUTIN STRIPES ENGINE P
100% 99%
PRAGMATIC
TARTAN +
vs. DaDianNao which is ~300x over GPUs
Accuracy
ISCA’16 MICRO’16 arxiv
![Page 67: Deep Learning Hardware Acceleration · Jorge Albericio+ Alberto Delmas Lascorz Patrick Judd Sayeh Sharify Tayler Hetherington* Natalie Enright Jerger Tor Aamodt* Andreas Moshovos](https://reader033.fdocuments.net/reader033/viewer/2022050214/5f6014cf027e185666531ec6/html5/thumbnails/67.jpg)
• More properties to discover and exploit
• E.g., Filters do overlap significantly
• CNNs one class
• Other networks
• Use the same layers
• Relative importance different
• Training
A Value-Based Approach to Acceleration
72