Deep Compression and EIE: Efficient Inference...
Transcript of Deep Compression and EIE: Efficient Inference...
Deep Compression and EIE: Efficient Inference Engine on Compressed Deep Neural Network
Song Han*, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark Horowitz, Bill Dally
Stanford University
Network Original Size
Compressed Size
Compression Ratio
Original Accuracy
Compressed Accuracy
AlexNet 240MB 6.9MB 35x 80.27% 80.30%
VGGNet 550MB 11.3MB 49x 88.68% 89.09%
GoogleNet 28MB 2.8MB 10x 88.90% 88.92%
SqueezeNet 4.8MB 0.47MB 10x 80.32% 80.35%
Our Prior Work: Deep Compression• Memory reference is expensive.• Small DNN models are critical.
pruning weight sharing
Published as a conference paper at ICLR 2016
Figure 2: Representing the matrix sparsity with relative index. Padding filler zero to prevent overflow.
2.09 -0.98 1.48 0.09
0.05 -0.14 -1.08 2.12
-0.91 1.92 0 -1.03
1.87 0 1.53 1.49
cluster
(32 bit float) centroids
3 0 2 1
1 1 0 3
0 3 1 0
3 1 2 2
(2 bit uint)
2.00
1.50
0.00
-1.00
1:
0:
2:
3:
Figure 3: Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom).
We store the sparse structure that results from pruning using compressed sparse row (CSR) orcompressed sparse column (CSC) format, which requires 2a+n+1 numbers, where a is the numberof non-zero elements and n is the number of rows or columns.
To compress further, we store the index difference instead of the absolute position, and encode thisdifference in 8 bits for conv layer and 5 bits for fc layer. When we need an index difference largerthan the bound, we the zero padding solution shown in Figure 2: in case when the difference exceeds8, the largest 3-bit (as an example) unsigned number, we add a filler zero.
3 TRAINED QUANTIZATION AND WEIGHT SHARING
Network quantization and weight sharing further compresses the pruned network by reducing thenumber of bits required to represent each weight. We limit the number of effective weights we need tostore by having multiple connections share the same weight, and then fine-tune those shared weights.
Weight sharing is illustrated in Figure 3. Suppose we have a layer that has 4 input neurons and 4output neurons, the weight is a 4⇥ 4 matrix. On the top left is the 4⇥ 4 weight matrix, and on thebottom left is the 4⇥ 4 gradient matrix. The weights are quantized to 4 bins (denoted with 4 colors),all the weights in the same bin share the same value, thus for each weight, we then need to store onlya small index into a table of shared weights. During update, all the gradients are grouped by the colorand summed together, multiplied by the learning rate and subtracted from the shared centroids fromlast iteration. For pruned AlexNet, we are able to quantize to 8-bits (256 shared weights) for eachCONV layers, and 5-bits (32 shared weights) for each FC layer without any loss of accuracy.
To calculate the compression rate, given k clusters, we only need log2(k) bits to encode the index. Ingeneral, for a network with n connections and each connection is represented with b bits, constrainingthe connections to have only k shared weights will result in a compression rate of:
r =nb
nlog2(k) + kb(1)
For example, Figure 3 shows the weights of a single layer neural network with four input units andfour output units. There are 4⇥4 = 16 weights originally but there are only 4 shared weights: similarweights are grouped together to share the same value. Originally we need to store 16 weights each
3
Train Connectivity
Prune Connections
Train Weights
Figure 2: Three-Step Training Pipeline.
pruning neurons
pruning synapses
after pruningbefore pruning
Figure 3: Synapses and neurons before and afterpruning.
3 Learning Connections in Addition to Weights
Our pruning method employs a three-step process, as illustrated in Figure 2, which begins by learningthe connectivity via normal network training. Unlike conventional training, however, we are notlearning the final values of the weights, but rather we are learning which connections are important.
The second step is to prune the low-weight connections. All connections with weights below athreshold are removed from the network — converting a dense network into a sparse network, asshown in Figure 3. The final step retrains the network to learn the final weights for the remainingsparse connections. This step is critical. If the pruned network is used without retraining, accuracy issignificantly impacted.
3.1 Regularization
Choosing the correct regularization impacts the performance of pruning and retraining. L1 regulariza-tion penalizes non-zero parameters resulting in more parameters near zero. This gives better accuracyafter pruning, but before retraining. However, the remaining connections are not as good as with L2regularization, resulting in lower accuracy after retraining. Overall, L2 regularization gives the bestpruning results. This is further discussed in experiment section.
3.2 Dropout Ratio Adjustment
Dropout [23] is widely used to prevent over-fitting, and this also applies to retraining. Duringretraining, however, the dropout ratio must be adjusted to account for the change in model capacity.In dropout, each parameter is probabilistically dropped during training, but will come back duringinference. In pruning, parameters are dropped forever after pruning and have no chance to come backduring both training and inference. As the parameters get sparse, the classifier will select the mostinformative predictors and thus have much less prediction variance, which reduces over-fitting. Aspruning already reduced model capacity, the retraining dropout ratio should be smaller.
Quantitatively, let Ci
be the number of connections in layer i, Cio
for the original network, Cir
forthe network after retraining, N
i
be the number of neurons in layer i. Since dropout works on neurons,and C
i
varies quadratically with Ni
, according to Equation 1 thus the dropout ratio after pruning theparameters should follow Equation 2, where D
o
represent the original dropout rate, Dr
represent thedropout rate during retraining.
Ci
= Ni
Ni�1 (1) D
r
= Do
rC
ir
Cio
(2)
3.3 Local Pruning and Parameter Co-adaptation
During retraining, it is better to retain the weights from the initial training phase for the connectionsthat survived pruning than it is to re-initialize the pruned layers. CNNs contain fragile co-adaptedfeatures [24]: gradient descent is able to find a good solution when the network is initially trained,but not after re-initializing some layers and retraining them. So when we retrain the pruned layers,we should keep the surviving parameters instead of re-initializing them.
3
[1]. Han et al. NIPS 2015 [2]. Han et al. ICLR 2016, best paper award
EIE: First Accelerator for Sparse DNN
Fully fits in SRAM120x less energy than DRAM
Sparse Vector70% dynamic sparsity
in the activation3x less computation
Sparse Matrix90% static sparsity
in the weights,10x less computation,
5x less memory footprint
Weight Sharing4bits weights
8x less memoryfootprint
• Deep Compression solves the model size problem.• But it creates another problem: irregular computation pattern.• CPU/GPU are only good at dense linear algebra.• So we create EIE that supports: static-sparse M, dynamic-sparse V,
indirect indexing, weight sharing.
Savings are multiplicative: 5x3x8x120=14,400 theoretical energy improvement.Dally. NIPS tutorial 2015; Han et al. ISCA 2016
4
Technology 45 nm
# PEs 64
on-chip SRAM 8 MB
Max Model Size 84 Million
Static Sparsity 10x
Dynamic Sparsity 3x
Quantization 4-bit
ALU Width 16-bit
Area 40.8 mm^2
MxV Throughput 81,967 layers/s
Power 586 mW
1. Post layout result2. Throughput measured on AlexNet FC-7
EIE: Efficient Inference Engine on Compressed Deep Neural Network
Song Han⇤ Xingyu Liu⇤ Huizi Mao⇤ Jing Pu⇤ Ardavan Pedram⇤
Mark A. Horowitz⇤ William J. Dally⇤†
⇤Stanford University, †NVIDIA{songhan,xyl,huizi,jingpu,perdavan,horowitz,dally}@stanford.edu
Abstract—State-of-the-art deep neural networks (DNNs)have hundreds of millions of connections and are both compu-tationally and memory intensive, making them difficult to de-ploy on embedded systems with limited hardware resources andpower budgets. While custom hardware helps the computation,fetching weights from DRAM is two orders of magnitude moreexpensive than ALU operations, and dominates the requiredpower.
Previously proposed ‘Deep Compression’ makes it possibleto fit large DNNs (AlexNet and VGGNet) fully in on-chipSRAM. This compression is achieved by pruning the redundantconnections and having multiple connections share the sameweight. We propose an energy efficient inference engine (EIE)that performs inference on this compressed network model andaccelerates the resulting sparse matrix-vector multiplicationwith weight sharing. Going from DRAM to SRAM gives EIE120⇥ energy saving; Exploiting sparsity saves 10⇥; Weightsharing gives 8⇥; Skipping zero activations from ReLU savesanother 3⇥. Evaluated on nine DNN benchmarks, EIE is189⇥ and 13⇥ faster when compared to CPU and GPUimplementations of the same DNN without compression. EIEhas a processing power of 102 GOPS/s working directly ona compressed network, corresponding to 3 TOPS/s on anuncompressed network, and processes FC layers of AlexNet at1.88⇥104 frames/sec with a power dissipation of only 600mW.It is 24,000⇥ and 3,400⇥ more energy efficient than a CPUand GPU respectively. Compared with DaDianNao, EIE has2.9⇥, 19⇥ and 3⇥ better throughput, energy efficiency andarea efficiency.
Keywords-Deep Learning; Model Compression; HardwareAcceleration; Algorithm-Hardware co-Design; ASIC;
I. INTRODUCTION
Neural networks have become ubiquitous in applicationsincluding computer vision [1]–[3], speech recognition [4],and natural language processing [4]. In 1998, Lecun etal. classified handwritten digits with less than 1M parame-ters [5], while in 2012, Krizhevsky et al. won the ImageNetcompetition with 60M parameters [1]. Deepface classifiedhuman faces with 120M parameters [6]. Neural Talk [7]automatically converts image to natural language with 130MCNN parameters and 100M RNN parameters. Coates etal. scaled up a network to 10 billion parameters on HPCsystems [8].
Large DNN models are very powerful but consume largeamounts of energy because the model must be stored inexternal DRAM, and fetched every time for each image,
4-bit RelativeIndex
4-bit Virtualweight
16-bitRealweight
16-bit AbsoluteIndex
EncodedWeightRelativeIndexSparseFormat
ALU
Mem
CompressedDNNModel Weight
Look-up
IndexAccum
Prediction
InputImage
Result
Figure 1. Efficient inference engine that works on the compressed deepneural network model for machine learning applications.
word, or speech sample. For embedded mobile applications,these resource demands become prohibitive. Table I showsthe energy cost of basic arithmetic and memory operationsin a 45nm CMOS process [9]. It shows that the total energyis dominated by the required memory access if there isno data reuse. The energy cost per fetch ranges from 5pJfor 32b coefficients in on-chip SRAM to 640pJ for 32bcoefficients in off-chip LPDDR2 DRAM. Large networks donot fit in on-chip storage and hence require the more costlyDRAM accesses. Running a 1G connection neural network,for example, at 20Hz would require (20Hz)(1G)(640pJ) =12.8W just for DRAM accesses, which is well beyond thepower envelope of a typical mobile device.
Previous work has used specialized hardware to accelerateDNNs [10]–[12]. However, these efforts focus on acceler-ating dense, uncompressed models - limiting their utilityto small models or to cases where the high energy costof external DRAM access can be tolerated. Without modelcompression, it is only possible to fit very small neuralnetworks, such as Lenet-5, in on-chip SRAM [12].
Efficient implementation of convolutional layers in CNNhas been intensively studied, as its data reuse and manipu-lation is quite suitable for customized hardware [10]–[15].However, it has been found that fully-connected (FC) layers,widely used in RNN and LSTMs, are bandwidth limitedon large networks [14]. Unlike CONV layers, there is noparameter reuse in FC layers. Data batching has becomean efficient solution when training networks on CPUs orGPUs, however, it is unsuitable for real-time applicationswith latency requirements.
Network compression via pruning and weight sharing[16] makes it possible to fit modern networks such asAlexNet (60M parameters, 240MB), and VGG-16 (130Mparameters, 520MB) in on-chip SRAM. Processing these
arX
iv:1
602.
0152
8v2
[cs.C
V]
3 M
ay 2
016
EIE: First Accelerator for Sparse DNN
Dally. NIPS tutorial 2015; Han et al. ISCA 2016
FC Layers: Speedup / Energy Efficiency
1x 1x 1x 1x 1x 1x 1x 1x 1x 1x2x
5x1x
9x 10x
1x2x 3x 2x 3x
14x25x
14x 24x 22x10x 9x 15x 9x 15x
56x 94x
21x
210x 135x
16x34x 33x 25x
48x
0.6x1.1x
0.5x1.0x 1.0x
0.3x 0.5x 0.5x 0.5x 0.6x
3x5x
1x
8x 9x
1x3x 2x 1x
3x
248x507x
115x
1018x 618x
92x 63x 98x 60x189x
0.1x
1x
10x
100x
1000x
Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo Mean
Speedup
CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mGPU Dense mGPU Compressed EIE
Figure 6. Speedups of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases.
1x 1x 1x 1x 1x 1x 1x 1x 1x 1x5x
9x3x
17x 20x
2x6x 6x 4x 6x7x 12x 7x 10x 10x
5x 6x 6x 5x 7x26x 37x
10x
78x 61x
8x25x 14x 15x 23x
10x 15x7x 13x 14x
5x 8x 7x 7x 9x
37x 59x18x
101x 102x
14x39x 25x 20x 36x
34,522x 61,533x14,826x
119,797x 76,784x
11,828x 9,485x 10,904x 8,053x24,207x
1x
10x
100x
1000x
10000x
100000x
Alex-6 Alex-7 Alex-8 VGG-6 VGG-7 VGG-8 NT-We NT-Wd NT-LSTM Geo MeanEner
gy E
ffici
ency
CPU Dense (Baseline) CPU Compressed GPU Dense GPU Compressed mGPU Dense mGPU Compressed EIE
Figure 7. Energy efficiency of GPU, mobile GPU and EIE compared with CPU running uncompressed DNN model. There is no batching in all cases.
corner. We placed and routed the PE using the Synopsys ICcompiler (ICC). We used Cacti [25] to get SRAM area andenergy numbers. We annotated the toggle rate from the RTLsimulation to the gate-level netlist, which was dumped toswitching activity interchange format (SAIF), and estimatedthe power using Prime-Time PX.
Comparison Baseline. We compare EIE with three dif-ferent off-the-shelf computing units: CPU, GPU and mobileGPU.
1) CPU. We use Intel Core i-7 5930k CPU, a Haswell-Eclass processor, that has been used in NVIDIA Digits DeepLearning Dev Box as a CPU baseline. To run the benchmarkon CPU, we used MKL CBLAS GEMV to implement theoriginal dense model and MKL SPBLAS CSRMV for thecompressed sparse model. CPU socket and DRAM powerare as reported by the pcm-power utility provided by Intel.
2) GPU. We use NVIDIA GeForce GTX Titan X GPU,a state-of-the-art GPU for deep learning as our baselineusing nvidia-smi utility to report the power. To runthe benchmark, we used cuBLAS GEMV to implementthe original dense layer. For the compressed sparse layer,we stored the sparse matrix in in CSR format, and usedcuSPARSE CSRMV kernel, which is optimized for sparsematrix-vector multiplication on GPUs.
3) Mobile GPU. We use NVIDIA Tegra K1 that has192 CUDA cores as our mobile GPU baseline. We usedcuBLAS GEMV for the original dense model and cuS-PARSE CSRMV for the compressed sparse model. Tegra K1doesn’t have software interface to report power consumption,so we measured the total power consumption with a power-meter, then assumed 15% AC to DC conversion loss, 85%regulator efficiency and 15% power consumed by peripheralcomponents [26], [27] to report the AP+DRAM power forTegra K1.
Benchmarks.We compare the performance on two sets of models:
uncompressed DNN model and the compressed DNN model.
Table IIIBENCHMARK FROM STATE-OF-THE-ART DNN MODELS
Layer Size Weight% Act% FLOP% Description
Alex-6 9216, 9% 35.1% 3% Compressed4096AlexNet [1] forAlex-7 4096, 9% 35.3% 3% large scale image4096classificationAlex-8 4096, 25% 37.5% 10%1000
VGG-6 25088, 4% 18.3% 1% Compressed4096 VGG-16 [3] forVGG-7 4096, 4% 37.5% 2% large scale image4096 classification andVGG-8 4096, 23% 41.1% 9% object detection1000
NT-We 4096, 10% 100% 10% Compressed600 NeuralTalk [7]
NT-Wd 600, 11% 100% 11% with RNN and8791 LSTM for
NTLSTM 1201, 10% 100% 11% automatic2400 image captioning
The uncompressed DNN model is obtained from Caffemodel zoo [28] and NeuralTalk model zoo [7]; The com-pressed DNN model is produced as described in [16], [23].The benchmark networks have 9 layers in total obtainedfrom AlexNet, VGGNet, and NeuralTalk. We use the Image-Net dataset [29] and the Caffe [28] deep learning frameworkas golden model to verify the correctness of the hardwaredesign.
VI. EXPERIMENTAL RESULT
Figure 5 shows the layout (after place-and-route) ofan EIE processing element. The power/area breakdown isshown in Table II. We brought the critical path delay downto 1.15ns by introducing 4 pipeline stages to update oneactivation: codebook lookup and address accumulation (inparallel), output activation read and input activation multiply(in parallel), shift and add, and output activation write. Ac-tivation read and write access a local register and activationbypassing is employed to avoid a pipeline hazard. Using64 PEs running at 800MHz yields a performance of 102
Compared to CPU and GPU: 189x and 13x faster 24,000x and 3,400x more energy efficient
Beyond EIE: a Multi-Dimension Sparse Recipe for Deep Learning
6
[1]. Han et al. “Learning both Weights and Connections for Efficient Neural Networks”, NIPS 2015 [2]. Han et al. “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding”, Deep Learning Symposium 2015, ICLR 2016 (best paper award) [3]. Han et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA 2016 [4]. Han et al. “DSD: Regularizing Deep Neural Networks with Dense-Sparse-Dense Training Flow”, arXiv 2016 [5]. Iandola, Han,et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size”, arXiv 16 [6]. Yao, Han, et.al, “Hardware-friendly convolutional neural network with even-number filter size”, ICLR workshop 2016
Higher Accuracy: DSD regularization
Smaller Size: Deep Compression, SqueezeNet++
Faster Speed: EIE accelerator
sparsity