fpgaConvNet: A Framework for Mapping Convolutional Neural ...

19
fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs Stylianos I. Venieris, Christos-Savvas Bouganis [email protected] FCCM 2016, Washington DC 2 May 2016

Transcript of fpgaConvNet: A Framework for Mapping Convolutional Neural ...

Page 1: fpgaConvNet: A Framework for Mapping Convolutional Neural ...

fpgaConvNet:AFrameworkforMappingConvolutionalNeuralNetworksonFPGAs

Stylianos I. Venieris, Christos-Savvas [email protected]

FCCM 2016, Washington DC2 May 2016

Page 2: fpgaConvNet: A Framework for Mapping Convolutional Neural ...

Deep Learning and AI

2

Page 3: fpgaConvNet: A Framework for Mapping Convolutional Neural ...

Deep Learning Success Stories - ConvNets

3

Image Recognition (Microsoft, 2015)

Image Captioning (Microsoft, 2015)“Deep Face” (Facebook, 2014)

Page 4: fpgaConvNet: A Framework for Mapping Convolutional Neural ...

Deep Learning on FPGAs

4

• Hand-tuned implementations • Memory I/O Optimisation

• Design Space Exploration

Page 5: fpgaConvNet: A Framework for Mapping Convolutional Neural ...

What is missing?

5

• Caffe• TensorFlo• Theano• Torch

• FPGA

– ConvNet Functionality

– Optimised for

HighPerformance

Deep Learning Developers fpgaConvNet

Page 6: fpgaConvNet: A Framework for Mapping Convolutional Neural ...

Our approach - fpgaConvNet

6

Automated Design Space Exploration

ConvNet Hardware Mapping

fpgaConvNet

FPGA Target Platform Specifications

Supplied by Deep Learning Expert

ConvNet Description

Bitstream

Page 7: fpgaConvNet: A Framework for Mapping Convolutional Neural ...

Convolutional Neural Networks (ConvNets)

7

convolutional+ nonlinearity

pooling convolutional+ nonlinearity

pooling

Page 8: fpgaConvNet: A Framework for Mapping Convolutional Neural ...

• SynchronousDataFlow

– ConvNetasadata-drivengraph

– Representedasamatrix

– Eachlayermappedtoatunablesetofhardwarebuildingblocks

fpgaConvNet – ConvNet Modelling Framework

8

Streaming

Analytical power

Page 9: fpgaConvNet: A Framework for Mapping Convolutional Neural ...

fpgaConvNet – Modelling ConvNets with SDF

9

ConvNet Hardware SDF Graph

Convolutional Layerwith4filtersNonlinLayer PoolingLayer Convolutional Layerwith4filters

Memory

Interface

Input

Data

Page 10: fpgaConvNet: A Framework for Mapping Convolutional Neural ...

fpgaConvNet – Design Space Perspective

10

ConvNet Hardware SDF Graph

Bottlenecks:

– Limitedcomputeresources

– Limitedoff-chipmemorybandwidth

– Limitedon-chipmemory formodelparameters

0

2

4

6

0 5 10

Thro

ughp

ut

Area

Design Space

Current Design PointFPGA 2

FPGA 1

Defineasetofactionstomovearoundthedesignspace

Page 11: fpgaConvNet: A Framework for Mapping Convolutional Neural ...

Action 1: Coarse-grained Folding

1) Exceeding the available compute resources

4 Convolutions / cycle

11

2) Not enough off-chip memory bandwidth

Page 12: fpgaConvNet: A Framework for Mapping Convolutional Neural ...

Action 1: Coarse-grained Folding

2 Convolutions / cycle

12

Action 2

Fine-grained FoldingCompute Resources

Required Bandwidth

Page 13: fpgaConvNet: A Framework for Mapping Convolutional Neural ...

Input Data

ConvLayer 1

NonlinLayer 1

PoolLayer 1

ConvLayer 2

NonlinLayer 2

PoolLayer 2

Subgraph 1 Subgraph 2 Subgraph 3

Bitstream 1 Bitstream 2 Bitstream 3

13

Bitstream

Exceeding the available compute resources

Action 3: Partitioning through Reconfiguration

Not enough on-chip memory FPGA Reconfiguration

InputData

Intermediate Results

Intermediate Results

Final ResultsOff-chip Memory

Page 14: fpgaConvNet: A Framework for Mapping Convolutional Neural ...

fpgaConvNet – SDF Analytical Power

14

Window Size = KPool Size = P

Design 1

Design 2

• SynchronousDataFlow– Actionsasalgebraicoperations

– Anylocalactionpropagates throughthenetwork

– Staticscheduling

– AnalyticalPerformanceModel

– CastDSEtoformalresource-constrainedoptimisation

Hardware Stages Interconnections

Page 15: fpgaConvNet: A Framework for Mapping Convolutional Neural ...

Evaluation - Experimental Setup

15

• fpgaConvNet

– XilinxZynq-7000XC7Z020SoCwith220DSPsat100MHz

– Q8.8fixed-pointprecisiontomatchexistingwork

(alsosupportsfloating-point)

– CurrenttoolflowsupportstheVivadoHLStoolchain

Page 16: fpgaConvNet: A Framework for Mapping Convolutional Neural ...

Performance Model Accuracy

16

0 2 4 6 8 10 12 14

CFF

LeNet-5

MPCNN

CNP

Sign Recognition ConvNet

Scene Labelling ConvNet

Performance Model Accuracy

Measured Performance (GOps/s) Predicted Performance (GOps/s)

Error between 1.73% and 11.76%

Page 17: fpgaConvNet: A Framework for Mapping Convolutional Neural ...

fpgaConvNet vs. Existing FPGA Work

17

00.000050.0001

0.000150.0002

0.000250.0003

0.000350.0004

0.000450.0005

Hand-tuned [1] Memory-optimised [2]

Performance Density Comparison (GOps/s/Slice)

Existing Work (GOps/s/Slice) fpgaConvNet (GOps/s/Slice)

1.62×

[1] C. Farabet et al., “CNP: An FPGA-Based Processor for Convolutional Networks”, in FPL, IEEE, 2009.

[2] M. Peemen et al., “Memory-centric accelerator design for Convolutional Neural Networks”, in ICCD, IEEE, 2013.

Page 18: fpgaConvNet: A Framework for Mapping Convolutional Neural ...

18

0

1

2

3

4

5

6

7

8

Hand-tuned Embedded GPU [3]

Performance Efficiency Comparison (GOps/s/Watt)

Existing Work (GOps/s/Watt)fpgaConvNet (GOps/s/Watt)

[3] L. Cavigelli et al., “Accelerating real-time embedded scene labeling with convolutional networks”, in DAC, ACM/EDAC/IEEE, 2015.

Hand-tuned Embedded GPU

• Tegra K1at800MHz

• MemoryBandwidth: 12GB/s

fpgaConvNet

• Zynq-7000XC7Z020at100MHz

• MemoryBandwidth:4.26GB/s

fpgaConvNet vs. Existing Embedded GPU Work

Page 19: fpgaConvNet: A Framework for Mapping Convolutional Neural ...

Conclusions

19

• Caffe• TensorFlo• Theano• Torch

Deep Learning Developers fpgaConvNet