5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 ·...
Transcript of 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 ·...
![Page 2: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/2.jpg)
神经网络复兴
来源:Russ Salakhutdinov2
![Page 3: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/3.jpg)
神经网络复兴
![Page 4: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/4.jpg)
密集计算的挑战
4
Year Model Layers Parameter Model Size FLOPs ImageNetTop-5 error
2012 AlexNet 5+3 60M 233MB 725M 16.4%
2013 Clarifai 5+3 60M 233MB 1.17B 11.7%
2014 VGG-19 16+3 143M 548MB 19.6B 7.32%
2014 GoogLeNet 22 6.8M 51MB 1.566B 6.67%
2015 ResNet 152 19.4M 230MB 11.3B 3.57%
2016 Inception-V4 112 42.6M 184MB 12.25B 3.08%
2019 GPT-2 12-72 117M-8.3B 500MB-6.5GB — —
![Page 5: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/5.jpg)
密集计算的挑战
移动设备:算不好
穿戴设备:算不了
数据中心:算不起
5
![Page 6: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/6.jpg)
神经网络计算的特点
Eyeriss 2016
Memory access:
6
![Page 7: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/7.jpg)
解决思路
Conv. Pooling BN ReLU SoftMax · · · · · ·
Representation Learning Storage Computing
Learning to Quantize (量化学习)算法 芯片
![Page 8: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/8.jpg)
8
一、基于量化学习的模型压缩表示
![Page 9: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/9.jpg)
基于量化学习的表示
8bit fixed-point
1bit binary
优点:通用性好,有效减少存储
缺点:高比特量化压缩倍数有限,低比特损失大
![Page 10: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/10.jpg)
之前的方法重点在最小化权重的量化误差:
这里我们最小化权重和输入的内积相似性的量化误差:
(1) BinaryConnect (2) BWN
s.t.
s.t. s.t.
Qinghao Hu, Peisong Wang, Jian Cheng. From Hashing to CNNs: Training Binary Weight Networks via Hashing. AAAI 2018
基于哈希的二值量化方法
![Page 11: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/11.jpg)
我们对二值权重B 乘以尺度因子A:
Let , , and then:
s.t.
s.t.
二值权重的学习可以转化成一个保持内积相似性的哈希函数:
Qinghao Hu, Peisong Wang, Jian Cheng. From Hashing to CNNs: Training Binary Weight Networks via Hashing. AAAI 2018
基于哈希的二值量化方法
![Page 12: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/12.jpg)
AlexNet网络在ImageNet上的分类精度
ResNet-18 网络在 ImageNet上的分类精度
基于哈希的二值量化方法
![Page 13: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/13.jpg)
基于三值定点量化方法
Peisong Wang and Jian Cheng, “Fixed-point Factorized Networks”. CVPR 2017
• 提出一种定点分解网络(FFN),把CNN网络权重量化成[-1,0,1],从
而把乘法计算转变成加法,有效提高网络inference速度
![Page 14: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/14.jpg)
存储:减少20倍加速:乘法运算几乎没有
加法运算减少一半精度:Top1/5的精度不变
Peisong Wang and Jian Cheng, “Fixed-point Factorized Networks”. CVPR 2017
基于三值定点量化方法
![Page 15: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/15.jpg)
{-1, 0, +1} {-1, 0, +1} {-1, 0, +1} {-1, 0, +1}
Peisong Wang and Jian Cheng, “Two-Step Quantization for Low-bit Neural Networks”. CVPR 2018
基于二步定点量化求解方法
![Page 16: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/16.jpg)
Stage-1
{-1, 0, +1} {-1, 0, +1} {-1, 0, +1} {-1, 0, +1}
Stage-2
{-1, 0, +1}
Peisong Wang and Jian Cheng, “Two-Step Quantization for Low-bit Neural Networks”. CVPR 2018
基于二步定点量化求解方法
![Page 17: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/17.jpg)
Peisong Wang and Jian Cheng, “Two-Step Quantization for Low-bit Neural Networks”. CVPR 2018
基于二步定点量化求解方法
![Page 18: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/18.jpg)
18
二、基于量化学习的小样本学习
![Page 19: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/19.jpg)
基于量化的小样本学习
• 揭示了在深度神经网络压缩中batch normalization层与权值参数的隐含关系,通过少量无标注样本更新BN层的统计量即可极大的提升压缩后网络的性能。
• 实验证实,对全精度网络进行4-bit权值量化或通过剪枝达到6x压缩,仍可以在不做训练的情况下取得较好的性能。
Xiangyu He, Jian Cheng. Learning Compression from Limited Unlabeled Data. ECCV 2018
![Page 20: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/20.jpg)
• 权值量化
• 网络剪枝
Xiangyu He, Jian Cheng. Learning Compression from Limited Unlabeled Data. ECCV 2018
![Page 21: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/21.jpg)
21
三、基于量化的推理引擎
![Page 22: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/22.jpg)
深度网络的推理引擎
现有深度学习框架做推理:
面向训练设计
存在冗余算子
面向Nvidia GPU设计
对移动端GPU支持不友好
面向X86 架构设计
在arm等平台上速度慢
好的推理引擎:
面向推理设计
计算图优化
面向多种计算平台设计
支持CPU、嵌入式GPU等计算平台
面向多种架构设计
针对arm等架构进行底层汇编优化
PyTorchCaffe
![Page 23: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/23.jpg)
深度网络的推理引擎现有推理引擎存在的问题:
计算图优化比较简单
网络中的所有算子参与运算、内存管理简单
移动端访问内存较多、计算速度慢
量化算子等支持度有限解决思路:
根据输出节点对计算图进行优化
算子融合、冗余算子删除、
内存池管理算法
申请连续内存、复用内存块
量化+winograd算子
![Page 24: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/24.jpg)
基于量化学习的推理引擎:QEngine
QEngineQEngine
NNLib量化运算稀疏运算
指令集加速
NNLib量化运算稀疏运算
指令集加速
NNCompiler计算图优化内存优化
NNCompiler计算图优化内存优化
NNSaveLoaderNNSaveLoader
CaffePyTorchPyTorch
Caffe
QEngine: Quantized Neural Network Engine for Efficient Inference
CPUCPU GPUGPU DSPDSP ASICASIC
![Page 25: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/25.jpg)
Qengine的关键特性
1. 高效直接卷积计算
2. 量化计算
3. 计算图优化
ad b
* cd
+
![Page 26: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/26.jpg)
o 支持低比特量化和Winograd加速的结合
o 8bit到1bit优化的量化计算
o 8bit精度与FP32相比无损失
o 兼容多种处理器和硬件
Qengine的关键特性
![Page 27: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/27.jpg)
Benchmark on Huawei Mate10
Runtime (ms) AlexNet MobilenetV1 MobilenetV2 Resnet18 Resnet50
T-Engine 104.96 42.04 48.6 123.36 252.06
Tf-lite 186 97 116 187 435
Qengine_FP32 80 38.8 40.6 77 225
QEngine_int8 39 29 38.6 68.5 143
Table 1. Runtime Benchmark on Kirin 970
![Page 28: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/28.jpg)
Benchmark on RK3399
Table 1. Runtime Benchmark on RK3399
Runtime (ms) AlexNet MobilenetV1 MobilenetV2 Resnet18 Resnet50
caffe 501.339 260.193 382.893 380.929 757.437
TF-lite 271.283 184.255 125.111 426.396 818.635
Tengine 184.5 70.1 87.4 242.8 483.9
Qengine_FP32 172.44 69.82 75.92 137.66 393.99
QEngine_int8 82.74 50.35 65.61 139.88 291.38
![Page 29: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/29.jpg)
29
四、量化神经处理器(QNPU)
![Page 30: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/30.jpg)
神经网络的计算
神经网络存在计算密集、存储密集特点,中间结果数据量大前向计算时,57.3% [1]能耗花费在数据搬运
0
5
10
15
1_1
1_2
2_1
2_2
3_1
3_2
3_3
4_1
4_2
4_3
5_1
5_2
5_3 6 7 8
Siz
e(M
B)
Intermediate data size of all layers in VGG-16 Inference Prediction
Quantization
Floatingpoint Integer
专用加速器加速深度学习应用将32bits浮点数转换为定点数,减少延迟,降低能耗[1] Amirali Boroumand, et al. ”Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks" Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2018.
![Page 31: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/31.jpg)
量化在加速器上的优势
Operations Energy Cost[1]
8 bit int Add 0.03pJ
32 bit int Add 0.1pJ
8 bit int Mult 0.2pJ
32 bit int Mult 3.1pJ
32 bit float Add 0.9pJ
32 bit float Mult 3.7pJ
64 bit SRAM Cache 10pJ
64 bit DRAM 1.3nJ
0.01 0.1 1 10 100 1000 10000
Energy Cost
[1] Horowitz, Mark. “1.1 Computing's energy problem (and what we can do about it).” 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC)(2014): 10-14.
• 较低比特的整型数(8 bits)会大大提升加速器各个操作的能效。• 访存与其他操作相比会需要更多的能耗,低比特量化技术能够极
大程度上减少访存量,从而大大降低访存所带来的能量消耗。• 在芯片设计中,减少整体访存也是高效设计的主要目标之一。
![Page 32: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/32.jpg)
量化在加速器上的优势
• 在相同的片上资源约束下,量化技术可以使更多的激活和参数存储在片上的缓存中,给片上数据的复用带来了更多的机会,从而摊销对片外DRAM访问所需的能耗。
• 采用量化技术,单个计算单元的复杂度更低,所需面积更小,在相同的芯片面积下,可以容纳更多计算单元从而提升整体性能。由于采用了较低比特的整型数,带宽需求并不会提升。
![Page 33: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/33.jpg)
极低比特量化提升芯片性能
• Binary和ternary等极低比特量化技术可以进一步提升神经网络加速器的能效
采用binary/ternary的参数,所有的乘法操作可以被省去,只需要完成加减法的操作,从而大大减小片上设计的复杂度。
• Bit-serial计算单元可以用于支持不同比特精度的计算,一方面可以充分利用极低比特带来的降低计算量的优势,另一方面对于精度要求较高的,也可以支持多比特高精度的计算来保证足够高的精度。
![Page 34: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/34.jpg)
分块方案进行访存优化
• 片外DRAM访存的能耗远高于片上操作的能耗,对于较大的神经网络,采用逐层的方式计算,会需要极大的外部存储访问。分块计算结合层融合的方案[1]使得网络计算时的中间结果存在片上缓存中,大大减小片外存储访问。
• 分块计算的方案,在不损失精度的前提下,避免了块与块之间的数据依赖,进一步减小了为了处理块间重叠部分所需的访存操作。
A3 A4
A1 A2
B3 B4
B1 B2
C3 C4
C1 C2
A3 A4
A1 A2
B3 B4
B1 B2
C3 C4
C1 C2
[1] Gang Li, et al. “Block convolution: Towards memory-efficient inference of large-scale CNNs on FPGA.” 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE) (2018): 1163-1166.
![Page 35: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/35.jpg)
分块方案进行访存优化
• 采用分块方案,在Xilinx ZC706板上部署VGG-16网络,能够达到12.19fps,其中所有中间结果都将存储与片上缓存中,不需要额外访问DRAM,大大节省能耗。
[1] Gang Li, et al. “Block convolution: Towards memory-efficient inference of large-scale CNNs on FPGA.” 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE) (2018): 1163-1166.
![Page 36: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/36.jpg)
层融合技术优化访存
• 计算单元中不同层可以融合计算,前面层的计算结果直接传递至完成后续层计算的计算单元进行计算,不再经过缓存的来存储中间结果,从而大大减少对片上存储的访问,节省能耗。
• 片上设置有缓存用于存储输入数据、参数和中间结果,当一个层的激活值可以完全存在片上时,存储的中间结果可以直接作为下一层的输入,无需访问外部存储。
![Page 37: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/37.jpg)
量化神经处理器: QNPU
37
0101FP32 INT8/4/2/1
最前沿量化算法,无损压缩低位宽
0202 乘法 << >>
移位操作代替普通数值乘法:计算架构
0303
算子融合
低成本
低功耗
低延时
Total Latency
Read
Write
Read
Write
Read
Write
Read
Write
Read
Write
Read
Write
Convolution Layer 1 Convolution Layer 2 Convolution Layer 3 Convolution Layer 4
Total Latency
Convolution Layer 1&2 Convolution Layer 3&4
Time
多操作融合机制
![Page 38: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/38.jpg)
应用领域
![Page 39: 5 5 5 H L Á5 Æ 1Í# f8å(}valser.org/webinar/slide/slides/20190911/2019.9.11... · 2019-09-16 · c 30 L 8 Þ L.J0¬ v MFKHQJ#QOSU LD DF FQ ZZZ QOSU LD DF FQ MFKHQJ / 5 5 5 H](https://reader033.fdocuments.net/reader033/viewer/2022050405/5f82e33667cf4e0b17701232/html5/thumbnails/39.jpg)
谢谢聆听!程 健 研究员
联系方式:[email protected]/jcheng
Detailed survey refer to:Recent Advances in Efficient Computation of Deep Convolutional Neural Networks. Frontiers of Information Technology & Electronic Engineering (FITEE), 2018