fpgaConvNet: A Framework for Mapping Convolutional Neural ...
Transcript of fpgaConvNet: A Framework for Mapping Convolutional Neural ...
fpgaConvNet:AFrameworkforMappingConvolutionalNeuralNetworksonFPGAs
Stylianos I. Venieris, Christos-Savvas [email protected]
FCCM 2016, Washington DC2 May 2016
Deep Learning and AI
2
Deep Learning Success Stories - ConvNets
3
Image Recognition (Microsoft, 2015)
Image Captioning (Microsoft, 2015)“Deep Face” (Facebook, 2014)
Deep Learning on FPGAs
4
• Hand-tuned implementations • Memory I/O Optimisation
• Design Space Exploration
What is missing?
5
• Caffe• TensorFlo• Theano• Torch
• FPGA
– ConvNet Functionality
– Optimised for
HighPerformance
Deep Learning Developers fpgaConvNet
Our approach - fpgaConvNet
6
Automated Design Space Exploration
ConvNet Hardware Mapping
fpgaConvNet
FPGA Target Platform Specifications
Supplied by Deep Learning Expert
ConvNet Description
Bitstream
Convolutional Neural Networks (ConvNets)
7
convolutional+ nonlinearity
pooling convolutional+ nonlinearity
pooling
• SynchronousDataFlow
– ConvNetasadata-drivengraph
– Representedasamatrix
– Eachlayermappedtoatunablesetofhardwarebuildingblocks
fpgaConvNet – ConvNet Modelling Framework
8
Streaming
Analytical power
fpgaConvNet – Modelling ConvNets with SDF
9
ConvNet Hardware SDF Graph
Convolutional Layerwith4filtersNonlinLayer PoolingLayer Convolutional Layerwith4filters
Memory
Interface
Input
Data
fpgaConvNet – Design Space Perspective
10
ConvNet Hardware SDF Graph
Bottlenecks:
– Limitedcomputeresources
– Limitedoff-chipmemorybandwidth
– Limitedon-chipmemory formodelparameters
0
2
4
6
0 5 10
Thro
ughp
ut
Area
Design Space
Current Design PointFPGA 2
FPGA 1
Defineasetofactionstomovearoundthedesignspace
Action 1: Coarse-grained Folding
1) Exceeding the available compute resources
4 Convolutions / cycle
11
2) Not enough off-chip memory bandwidth
Action 1: Coarse-grained Folding
2 Convolutions / cycle
12
Action 2
Fine-grained FoldingCompute Resources
Required Bandwidth
Input Data
ConvLayer 1
NonlinLayer 1
PoolLayer 1
ConvLayer 2
NonlinLayer 2
PoolLayer 2
Subgraph 1 Subgraph 2 Subgraph 3
Bitstream 1 Bitstream 2 Bitstream 3
13
Bitstream
Exceeding the available compute resources
Action 3: Partitioning through Reconfiguration
Not enough on-chip memory FPGA Reconfiguration
InputData
Intermediate Results
Intermediate Results
Final ResultsOff-chip Memory
fpgaConvNet – SDF Analytical Power
14
Window Size = KPool Size = P
Design 1
Design 2
• SynchronousDataFlow– Actionsasalgebraicoperations
– Anylocalactionpropagates throughthenetwork
– Staticscheduling
– AnalyticalPerformanceModel
– CastDSEtoformalresource-constrainedoptimisation
Hardware Stages Interconnections
Evaluation - Experimental Setup
15
• fpgaConvNet
– XilinxZynq-7000XC7Z020SoCwith220DSPsat100MHz
– Q8.8fixed-pointprecisiontomatchexistingwork
(alsosupportsfloating-point)
– CurrenttoolflowsupportstheVivadoHLStoolchain
Performance Model Accuracy
16
0 2 4 6 8 10 12 14
CFF
LeNet-5
MPCNN
CNP
Sign Recognition ConvNet
Scene Labelling ConvNet
Performance Model Accuracy
Measured Performance (GOps/s) Predicted Performance (GOps/s)
Error between 1.73% and 11.76%
fpgaConvNet vs. Existing FPGA Work
17
00.000050.0001
0.000150.0002
0.000250.0003
0.000350.0004
0.000450.0005
Hand-tuned [1] Memory-optimised [2]
Performance Density Comparison (GOps/s/Slice)
Existing Work (GOps/s/Slice) fpgaConvNet (GOps/s/Slice)
1.62×
[1] C. Farabet et al., “CNP: An FPGA-Based Processor for Convolutional Networks”, in FPL, IEEE, 2009.
[2] M. Peemen et al., “Memory-centric accelerator design for Convolutional Neural Networks”, in ICCD, IEEE, 2013.
18
0
1
2
3
4
5
6
7
8
Hand-tuned Embedded GPU [3]
Performance Efficiency Comparison (GOps/s/Watt)
Existing Work (GOps/s/Watt)fpgaConvNet (GOps/s/Watt)
[3] L. Cavigelli et al., “Accelerating real-time embedded scene labeling with convolutional networks”, in DAC, ACM/EDAC/IEEE, 2015.
Hand-tuned Embedded GPU
• Tegra K1at800MHz
• MemoryBandwidth: 12GB/s
fpgaConvNet
• Zynq-7000XC7Z020at100MHz
• MemoryBandwidth:4.26GB/s
fpgaConvNet vs. Existing Embedded GPU Work
Conclusions
19
• Caffe• TensorFlo• Theano• Torch
Deep Learning Developers fpgaConvNet