Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X]...
Transcript of Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X]...
![Page 1: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/1.jpg)
Supporting Compressed-Sparse Activations and Weights on SIMD-like Accelerator for Sparse Convolutional
Neural Networks
Chien-Yu Lin and Bo-Cheng Lai
Institute of Electronics EngineeringNational Chiao Tung University
![Page 2: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/2.jpg)
Convolutional Neural Network• CNN now dominants visual recognition applications• Face recognition, object detection, autonomous vehicles…
• Major components: deep convolutional layers
2
Network Structure of VGG-16k. Simonyan etal., VeryDeepConvolutionalNetworksforLarge-ScaleImageRecognition,ICLR2015http://file.scirp.org/Html/4-7800353_65406.htm
![Page 3: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/3.jpg)
Convolutional Layer
3
Computations of a conv layer
• A lot of parallel multiplications and additions
![Page 4: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/4.jpg)
CNN Acceleration with SIMD
4
• MAC unit can efficiently perform CNN and thus, adopted by many CNN accelerators [Google TPU, DianNao, Zhang 2015, Cambricon-X]
SIMD-like ArchitectureN.P.Jouppi etal.,In-DatacenterPerformanceAnalysisofaTensorProcessingUnit,ISCA2017T. Chen et al., DianNao:asmall-footprinthigh-throughput acceleratorforubiquitousmachine learning,ASPLOS2014C. Zhang et al., Optimizingfpga-based acceleratordesignfordeepconvolutionalneuralnetworks,FPGA2015S. Zhang et al., Cambricon-X: Anacceleratorforsparseneuralnetworks,MICRO2016
![Page 5: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/5.jpg)
CNN Acceleration with SIMD
5SIMD-like Architecture
![Page 6: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/6.jpg)
CNN Acceleration with SIMD
6SIMD-like Architecture
![Page 7: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/7.jpg)
CNN is Sparse• About 60% of weights and activations are ZEROs• Zeros in activations are dynamically generated after ReLU• Zeros in weights are obtained with Network Pruning
• Sparsity is promising for speedup (Zero-skipping) and energy reduction (smaller memory footprint)
7
A. Parashar et al., SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks, ISCA 2017S. Han et al., Learningbothweightsandconnections forefficient neuralnetwork,NIPS2015
![Page 8: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/8.jpg)
Sparse CNN on SIMD?
8SIMD-like Architecture
![Page 9: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/9.jpg)
Sparse CNN on SIMD?
9SIMD-like Architecture
![Page 10: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/10.jpg)
Sparse CNN on SIMD?
10SIMD-like Architecture
Wrong Alignment!
![Page 11: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/11.jpg)
Sparse CNN on SIMD!
11SIMD-like Architecture
![Page 12: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/12.jpg)
A Simple Sparse Layer
12
w3
w5
w6
Weight Output
o1
Act
a2
a5
a3
a7
![Page 13: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/13.jpg)
A Simple Sparse Layer
13
w3
w5
w6
Weight Output
o1
Act
a2
a5
a3
a7
![Page 14: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/14.jpg)
Compressed-Sparse Data: Only Keep Non-Zeros
14
w3
w5
w6
Weight Output
o1
Act
a2
a5
a3
a7
w3 w5 w6Stored
Weights
a2 a3 a5Stored
Activationsa7
![Page 15: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/15.jpg)
w3
w5
w6
Weight Output
o1
Act
a2
a5
a3
a7
w3 w5 w6Stored
Weights
a2 a3 a5Stored
Activationsa7
0
1
1
0
1
0
1
0
ActIndex
0
0
1
0
1
1
0
0
WeightIndex
Plus Index: Bit-Vector Recording Zero/Non-Zero
15
![Page 16: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/16.jpg)
w3
w5
w6
Weight Output
o1
Act
a2
a5
a3
a7
Effectual Pairs
w3 w5 w6Stored
Weights
a2 a3 a5Stored
Activationsa7
0
1
1
0
1
0
1
0
ActIndex
0
0
1
0
1
1
0
0
WeightIndex
Target of DIM: Find Out Effectual Pairs
16
![Page 17: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/17.jpg)
Dual Indexing Module: Step1
17
Activation Index0 1 1 0 1 0 1 00 1 1 0 1 0 1 0
AND
0 0 1 0 1 1 0 0
Co-activated Index0 0 1 0 1 0 0 0
1
0 0 1 0 1 1 0 0
Weight Index
w3
w5
w6
Weight Output
o1
Act
a2
a5
a3
a7
Effectual Pairs
w3 w5 w6Stored
Weights
a2 a3 a5Stored
Activationsa7
0
1
1
0
1
0
1
0
ActIndex
0
0
1
0
1
1
0
0
WeightIndex
![Page 18: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/18.jpg)
Dual Indexing Module: Step2
18
Activation Index0 1 1 0 1 0 1 00 1 1 0 1 0 1 0
+ + + + + + + +
0 1 2 2 3 3 4 4
AND
0 0 1 0 1 1 0 0
Co-activated Index0 0 1 0 1 0 0 0
12
0 0 1 0 1 1 0 0
+ + + + + + + +
0 0 1 1 2 3 3 3
Weight Index
2
![Page 19: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/19.jpg)
Dual Indexing Module: Step3
19
Activation Index0 1 1 0 1 0 1 00 1 1 0 1 0 1 0
+ + + + + + + +
0 1 2 2 3 3 4 4
AND
0 0 1 0 1 0 0 0
0 0 2 0 3 0 0 0
AND
0 0 1 0 1 1 0 0
Co-activated Index0 0 1 0 1 0 0 0
12
3
0 0 1 0 1 1 0 0
+ + + + + + + +
0 0 1 1 2 3 3 3
AND
0 0 1 0 1 0 0 0
0 0 1 0 2 0 0 0
Weight Index
2
3
![Page 20: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/20.jpg)
Dual Indexing Module: Step4
20
Activation Index0 1 1 0 1 0 1 00 1 1 0 1 0 1 0
+ + + + + + + +
0 1 2 2 3 3 4 4
AND
0 0 1 0 1 0 0 0
0 0 2 0 3 0 0 0
AND
0 0 1 0 1 1 0 0
w3 w5 w6Stored
WeightsStored
Activationsa2 a3 a5 a7
Effectual Activations
Co-activated Index0 0 1 0 1 0 0 0
12
3
0 0 1 0 1 1 0 0
+ + + + + + + +
0 0 1 1 2 3 3 3
AND
0 0 1 0 1 0 0 0
0 0 1 0 2 0 0 0
Weight Index
2
3
4 4
a5a3 w5w3
MUX MUX
Effectual Weights
![Page 21: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/21.jpg)
Alignment Issue Solved!
21
Activation Index0 1 1 0 1 0 1 00 1 1 0 1 0 1 0
+ + + + + + + +
0 1 2 2 3 3 4 4
AND
0 0 1 0 1 0 0 0
0 0 2 0 3 0 0 0
AND
0 0 1 0 1 1 0 0
w3 w5 w6Stored
WeightsStored
Activationsa2 a3 a5 a7
Effectual Activations
Co-activated Index0 0 1 0 1 0 0 0
12
3
0 0 1 0 1 1 0 0
+ + + + + + + +
0 0 1 1 2 3 3 3
AND
0 0 1 0 1 0 0 0
0 0 1 0 2 0 0 0
Weight Index
2
3
4 4
a5a3 w5w3
MUX MUX
Effectual Weights
![Page 22: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/22.jpg)
Accelerator Design• Extended from Cambricon-X [MICRO 2016]
22S. Zhang et al., Cambricon-X: An Accelerator for Sparse Neural Networks, MICRO 2016
Encoder
DMA
Off-
chip
Mem
ory
(DR
AM
)
Controller
AB-Out
AB-In
PE
PE
PE
![Page 23: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/23.jpg)
Accelerator Design• Plug DIM into each PE
23
Encoder
DMA
Off-
chip
Mem
ory
(DR
AM
)
Controller
AB-Out
AB-In
PE
PE
PE
PENon-zero A
ct &
Index
WB DIM
MAC
![Page 24: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/24.jpg)
Accelerator Design• Encode output activations on-the-fly
24
Encoder
DMA
Off-
chip
Mem
ory
(DR
AM
)
Controller
AB-Out
AB-In
PE
PE
PE
0 a2 0 a4 a5 0 a7 0
0 1 0 1 1 0 1 0
= 0?
Uncompressed Act
Index
Non-zero Act
AB
-Out
a2 a4 a5 a7
Encoder
![Page 25: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/25.jpg)
Evaluation Methodology• Logic: Synthesis with TSMC 40nm• SRAM and DRAM: CACTI• Benchmark: Open Sparse-AlexNet + ImageNet Data• Experiments: In-house performance simulator
25
M.Naveenetal.,CACTI6.0:ATooltoModelLargeCaches,HPLaboratories,2009Sparse AlexNet, https://github.com/songhan/D eep-Compression-Al exNetJ. Deng et al., Imagenet:Alarge-scalehierarchicalimagedatabase,CVPR,2009
![Page 26: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/26.jpg)
Accelerator Variants
26
Acc Act Weights Index Encoder Area(mm2) Power(mW)
DAW Dense Dense N/A N/A 2.05 395
SpA Sparse Dense IM ✔ 2.15 428
SpW Dense Sparse IM N/A 2.23 441
SpAW Sparse Sparse DIM ✔ 2.34 472
• Overheads: 14.4% in Area and 19.5% in Power
![Page 27: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/27.jpg)
DRAM Access Volume• 47.3% less in DRAM access volume compared to DAW
27
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
DAW SpA SpW SpAW
Normalize
dDR
AMAccessVo
lume
Accelerator
DRAM-Act DRAM-Wei
![Page 28: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/28.jpg)
Energy Consumption• 46% energy reduction compared to DAW
28
00.10.20.30.40.50.60.70.80.91
1.1
DAW SpA SpW SpAWNormalize
dEnergyCon
sumption
Accelerator
FU ABs WBs DRAM-Act DRAM-Wei
![Page 29: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/29.jpg)
Energy-Delay-Product• 55.4% EDP reduction compared to DAW
29
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
DAW SpA SpW SpAW
Normalize
dED
P
Accelerator
![Page 30: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/30.jpg)
Summary• SIMD-like accelerator has alignment issue whileperforming sparse CNN• We propose a novel Dual Indexing Module (DIM)to handle the alignment issue efficiently• By keeping data in a compressed-sparse format,a CNN accelerator with DIM can reduce DRAMaccess volume, energy consumption and EDP for47.3%, 46% and 55.4%
30
![Page 31: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/31.jpg)
Thank You!
![Page 32: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/32.jpg)
Additional Materials
![Page 33: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/33.jpg)
Design Parameters of SpAW
33
![Page 34: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/34.jpg)
Related Work - Cnvlutin• Decouple neuron lanes to do zero-skipping in neurons
34
CNVUnitZero-FreeNeuronArray
Format(ZFNAF)
J. Albericio et al., CNVLUTIN: Ineffectual-neuron-free DNN computing, ISCA 2016
![Page 35: Supporting Compressed -Sparse Activations and Weights on … · DianNao, Zhang 2015, Cambricon-X] SIMD-like Architecture N. P. Jouppiet al., In-Datacenter Performance Analysis of](https://reader033.fdocuments.net/reader033/viewer/2022051914/6005d51d7bd8aa650a406ff0/html5/thumbnails/35.jpg)
Related Work - Cambricon-X• Utilizing weight sparsity with step indexing (a compressed-sparse format)
35
StepIndexingModule
S. Zhang et al., Cambricon-X: An Accelerator for Sparse Neural Networks, MICRO 2016