Deep Compression: Compressing Deep Neural Networks with...

1
Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding Song Han 1 , Huizi Mao 2 , William J. Dally 1,3 1 Stanford University 2 Tsinghua University 3 NVIDIA {songhan, dally}@stanford.edu [email protected] Deep Compression is a three stage compression pipeline: pruning, quantization and Huffman coding. Pruning reduces the number of weights by 10x, quantization further improves the compression rate between 27x and 31x. Huffman coding gives more compression: between 35x and 49x. The compression rate already included the meta- data for sparse representation. Deep Compression doesn’t incur loss of accuracy. Representing the matrix sparsity with relative index. Padding filler zero to prevent overflow. Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom). Left: Three different methods for centroids initialization. Right: Distribution of weights (blue) and distribution of codebook before (green cross) and after fine- tuning (red dot). Weight / Index distribution is biased. Represent frequent occurring weight with less number of bitsless frequent occurring weight with more bits. Compression statistics for AlexNet. P: pruning, Q: quantization, H:Huffman coding. Compression statistics for VGG-16. P: pruning, Q:quantization, H:Huffman coding. Accuracy v.s. compression rate under different compression methods. Pruning and quantization works best when combined. Storage ratio of weight, index and codebook. Motivation: Make DNN Smaller Weight Sharing and Quantization Results We pruned, quantized, and Huffman encoded four networks: Lenet-5, Lenet-300-100 on MNIST and AlexNet, VGG-16 on ImageNet. The compression pipeline saves network storage by 35× to 49× across different networks without loss of accuracy. The total size of AlexNet decreased from 240MB to 6.9MB, which is small enough to be put into on-chip SRAM, eliminating the need to store the model in energy-consuming DRAM memory. Pruning, Relative Indexing & Huffman Coding Demo: Pocket AlexNet

Transcript of Deep Compression: Compressing Deep Neural Networks with...

Page 1: Deep Compression: Compressing Deep Neural Networks with …forum.stanford.edu/events/posterslides/DeepCompression... · 2016-03-01 · Deep Compression: Compressing Deep Neural Networks

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Song Han1, Huizi Mao2, William J. Dally1,3 1 Stanford University 2 Tsinghua University 3 NVIDIA

{songhan, dally}@stanford.edu [email protected]

Deep Compression is a three stage compression pipeline: pruning, quantization and Huffman coding. Pruning reduces the number of weights by 10x, quantization further improves the compression rate between 27x and 31x. Huffman coding gives more compression: between 35x and 49x. The compression rate already included the meta-data for sparse representation. Deep Compression doesn’t incur loss of accuracy.

Representing the matrix sparsity with relative index. Padding filler zero to prevent overflow.

Weight sharing by scalar quantization (top) and centroids fine-tuning (bottom).

Left: Three different methods for centroids initialization. Right: Distribution of weights (blue) and distribution of codebook before (green cross) and after fine-tuning (red dot).

Weight / Index distribution is biased. Represent frequent occurring weight with less number of bits,less frequent occurring weight with more bits.

Compression statistics for AlexNet. P: pruning, Q: quantization, H:Huffman coding.

Compression statistics for VGG-16. P: pruning, Q:quantization, H:Huffman coding.

Accuracy v.s. compression rate under different compression methods. Pruning and quantization works best when combined.

Storage ratio of weight, index and codebook.

Motivation: Make DNN Smaller Weight Sharing and Quantization ResultsWe pruned, quantized, and Huffman encoded four networks: Lenet-5, Lenet-300-100 on MNIST and AlexNet, VGG-16 on ImageNet. The compression pipeline saves network storage by 35× to 49× across different networks without loss of accuracy. The total size of AlexNet decreased from 240MB to 6.9MB, which is small enough to be put into on-chip SRAM, eliminating the need to store the model in energy-consuming DRAM memory.

Pruning, Relative Indexing & Huffman Coding

Demo: Pocket AlexNet