Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to...

37
1 © 2015 The MathWorks, Inc. Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD Senior Application Engineer, Computer Vision

Transcript of Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to...

Page 1: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

1© 2015 The MathWorks, Inc.

Deploying Deep Learning Networks

to Embedded GPUs and CPUs

Rishu Gupta, PhD

Senior Application Engineer, Computer Vision

Page 2: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

2

MATLAB Deep Learning Framework

Access Data Design + Train Deploy

▪ Manage large image sets

▪ Automate image labeling

▪ Easy access to models

▪ Acceleration with GPU’s

▪ Scale to clusters

Page 3: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

3

Multi-Platform Deep Learning Deployment

Embedded

MobileNvidia

TX1, TX2, TK1 Raspberry pi Beagle bone

Desktop Data-center

Page 4: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

4

Multi-Platform Deep Learning Deployment

▪ Need code that takes advantage

of:

– NVIDIA® CUDA libraries, including

cuDNN and TensorRT

– Intel® Math Kernel Library for Deep

Neural Networks (MKL-DNN) for

Intel processors

– ARM® Compute libraries for ARM

processors

Page 5: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

5

Intel Xeon Desktop PC Raspberry Pi Board

Android Phone

NVIDIA Jetson TX1 board

Multi-Platform Deep Learning Deployment

▪ Need code that takes advantage

of:

– NVIDIA® CUDA libraries, including

cuDNN and TensorRT

– Intel® Math Kernel Library for Deep

Neural Networks (MKL-DNN) for

Intel processors

– ARM® Compute libraries for ARM

processors

Page 6: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

6

Algorithm Design to Embedded Deployment Workflow

Conventional Approach

High-level language

Deep learning framework

Large, complex software stack1

Desktop GPU

C++

C/C++

Low-level APIs

Application-specific libraries2

C++

Embedded GPU

C/C++

Target-optimized libraries

Optimize for memory & speed3

Challenges

• Integrating multiple libraries and

packages

• Verifying and maintaining multiple

implementations

• Algorithm & vendor lock-in

Page 7: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

7

Solution- GPU Coder for Deep Learning Deployment

GPU Coder

Target Libraries

NVIDIA

TensorRT &

cuDNN

Libraries

ARM

Compute

Library

Intel

MKL-DNN

LibraryApplication

logic

Page 8: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

8

Deep Learning Deployment Workflows

Pre-

processing

Post-

processing

codegen

Portable target code

INTEGRATED APPLICATION DEPLOYMENT

cnncodegen

Portable target code

INFERENCE ENGINE DEPLOYMENT

Trained

DNN

Page 9: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

9

Workflow for Inference Engine Deployment

Steps for inference engine deployment

1. Generate the code for trained model>> cnncodegen(net, 'targetlib', 'cudnn’)

2. Copy the generated code onto target board

3. Build the code for the inference engine>> make –C ./codegen –f …mk

4. Use hand written main function to call inference engine

5. Generate the exe and test the executable>> make –C ./ ……

cnncodegen

Portable target code

INFERENCE ENGINE DEPLOYMENT

Trained

DNN

Page 10: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

10

How to get a Trained DNN into MATLAB?

Train in MATLAB

Model

importer

Trained

DNN

Transfer

learningReference model

Page 11: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

11

Deep Learning Inference Deployment

Train in MATLAB Trained

DNN

Target Libraries

NVIDIA

TensorRT &

cuDNN

Libraries

ARM

Compute

Library

Intel

MKL-DNN

LibraryModel

importer

Transfer

learningReference model

Page 12: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

12

Building DNN from Scratch

Load Training Data

Build Layer Architecture

Set Training

Options

Train Network

%% Create a datastore

imds = imageDatastore(‘Data’,...

'IncludeSubfolders',true,'LabelSource','foldernames');

num_classes = numel(unique(imds.Labels));

%% Build layer architecture

layers = [imageInputLayer([64 32 3])

convolution2dLayer(5,20)

reluLayer()

maxPooling2dLayer(2,'Stride',2)

fullyConnectedLayer(512)

fullyConnectedLayer(2)

softmaxLayer()

classificationLayer()];

%% Set Training Options

trainOpts = trainingOptions( 'sgdm',...

'MiniBatchSize', miniBatchSize,...

'Plots', 'training-progress’);

%% Train Network

net = trainNetwork(imds, layers, trainOpts);

Page 13: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

13

Pedestrian Detection DNN Deployment on ARM Processor

layers = [imageInputLayer([64 32 3])

convolution2dLayer(5,20)

reluLayer()

maxPooling2dLayer(2,'Stride',2)

CrossChannelNormalizationLayer(5,'K',1);

convolution2dLayer(5,20)

reluLayer()

maxPooling2dLayer(2,'Stride',2)

fullyConnectedLayer(512)

fullyConnectedLayer(2)

softmaxLayer()

classificationLayer()];

Page 14: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

14

Pedestrian Detection DNN Deployment on ARM Processor

▪ ARM Neon instruction set architecture

– Example: ARM Cortex A

▪ ARM Compute Library

– Low-level Software functions

– Computer vision, machine learning etc…

▪ Pedestrian detection on Raspberry pi

Page 15: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

15

Deep Learning Inference Deployment

Train in MATLAB

Model

importer

Trained

DNN

Target Libraries

NVIDIA

TensorRT &

cuDNN

Libraries

ARM

Compute

Library

Intel

MKL-DNN

Library

Transfer

learningReference model

Page 16: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

16

Importing DNN from Open Source Framework

Caffe Model Importer

(including Caffe Model Zoo)

▪ importCaffeLayers

▪ importCaffeNetwork

TensorFlow-Keras Model Importer

▪ importKerasLayers

▪ importKerasNetwork

network = importCaffeNetwork(protofile, 'yolo.caffemodel');

Page 17: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

17

Deep Learning Inference Deployment

Train in MATLAB

Model

importer

Trained

DNN

Target Libraries

NVIDIA

TensorRT &

cuDNN

Libraries

ARM

Compute

Library

Intel

MKL-DNN

Library

Transfer

learningReference model

Object Detection

Page 18: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

18

Train in MATLAB

Deep Learning Inference Deployment

Model

importer

Trained

DNN

Target Libraries

Transfer

learningReference model

NVIDIA

TensorRT &

cuDNN

Libraries

ARM

Compute

Library

Intel

MKL-DNN

Library

Page 19: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

19

Train in MATLAB

Deep Learning Inference Deployment

Model

importer

Trained

DNN

Target Libraries

Transfer

learningReference model

NVIDIA

TensorRT &

cuDNN

Libraries

ARM

Compute

Library

Intel

MKL-DNN

Library

Page 20: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

20

Layered Architecture for Segnet- Semantic Segmentation

DAG Network

Total number of layers: 91

Page 21: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

21

TESLA V100

DRIVE PX 2

TESLA P4

JETSON TX2

NVIDIA DLA

PROGRAMMABLE INFERENCE ACCELERATOR

NVIDIA TensorRT

TensorRT

Page 22: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

22

Performance Summary (VGG-16) on TitanXP

0 50 100 150 200 250 300 350 400

MATLAB (cuDNN fp32)

GPU Coder (cuDNN fp32)

GPU Coder (TensorRT fp32)

GPU Coder (TensorRT int8)

Page 23: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

23

How Good is Generated Code Performance?

▪ Performance of CNN inference (Alexnet) on Titan XP GPU

▪ Performance of CNN inference (Alexnet) on Jetson (Tegra) TX2

Page 24: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

24

Alexnet Inference on NVIDIA Titan Xp

GPU Coder +

TensorRT (3.0.1)

GPU Coder +

cuDNN

Fra

mes p

er

second

Batch Size

CPU Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz

GPU Pascal Titan Xp

cuDNN v7

Testing platform

MXNet (1.1.0)

GPU Coder +

TensorRT (3.0.1, int8)

TensorFlow (1.6.0)

Page 25: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

25

VGG-16 Inference on NVIDIA Titan Xp

GPU Coder +

TensorRT (3.0.1)

GPU Coder +

cuDNN

Fra

mes p

er

second

Batch Size

CPU Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz

GPU Pascal Titan Xp

cuDNN v7

Testing platform

MXNet (1.1.0)

GPU Coder +

TensorRT (3.0.1, int8)

TensorFlow (1.6.0)

Page 26: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

26

Alexnet Inference on Jetson TX2: Frame-Rate Performance

GPU Coder + cuDNN

Batch Size

C++ Caffe (1.0.0-rc5)

GPU Coder +

TensorRT

Fra

mes p

er

second

Page 27: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

27

Brief Summary

DNN libraries are great for inference, …

▪ GPU coder generates code that takes advantage

of:

NVIDIA® CUDA libraries, including

cuDNN, and TensorRT

Intel® Math Kernel Library for Deep

Neural Networks (MKL-DNN)

ARM® Compute libraries for mobile

platforms

Page 28: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

28

Brief Summary

DNN libraries are great for inference, …

▪ GPU coder generates code that takes advantage

of:

NVIDIA® CUDA libraries, including

cuDNN, and TensorRT

Intel® Math Kernel Library for Deep

Neural Networks (MKL-DNN)

ARM® Compute libraries for mobile

platforms

but, applications require more than just inference

Page 29: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

29

Deep learning Workflows- Integrated Application Deployment

Pre-

processing

Post-

processing

codegen

Portable target code

INTEGRATED APPLICATION DEPLOYMENT

Page 30: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

30

Traffic sign detection and recognition

Object

detection

DNN

Strongest

Bounding

Box

Classifier

DNN

YOLO Recognition net

Page 31: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

31

Traffic sign detection and recognition

Page 32: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

32

Traffic sign detection and recognition

Page 33: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

33

GPU Coder Helps You Deploy Applications to GPUs Faster

GPU Coder

CUDA Kernel creation

Memory allocation

Data transfer minimization

• Library function mapping

• Loop optimizations

• Dependence analysis

• Data locality analysis

• GPU memory allocation

• Data-dependence analysis

• Dynamic memcpy reduction

Page 34: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

34

CUDA Code Generation from GPU Coder app

Integrated editor

and simplified

workflow for code

generation

Page 35: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

35

Summary- GPU Coder

MATLAB algorithm

(functional reference)

Functional test1 Deployment

unit-test

2

Desktop

GPU

C++

Deployment

integration-test

3

Desktop

GPU

C++

Real-time test4

Embedded GPU

.mex .lib Cross-compiled

.lib

Build type

Call CUDA

from MATLAB

directly

Call CUDA from

(C++) hand-

coded main()

Call CUDA from (C++)

hand-coded main().

Page 36: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

36

MATLAB Deep Learning Framework

Access Data Design + Train

▪ Manage large image sets

▪ Automate image labeling

▪ Easy access to models

▪ Acceleration with GPU’s

▪ Scale to clusters

NVIDIA

TensorRT &

cuDNN

Libraries

Intel

MKL-DNN

Library

DEPLOYMENT

ARM

Compute

Library

Page 37: Deploying Deep Learning Networks to Embedded GPUs and CPUs · Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD ... • Loop optimizations • Dependence

37

• Share your experience with MATLAB & Simulink on Social Media

▪ Use #MATLABEXPO

▪ I use #MATLAB because……………………… Attending #MATLABEXPO▪ Examples

▪ I use #MATLAB because it helps me be a data scientist! Attending #MATLABEXPO

▪ Learning new capabilities in #MATLAB and #Simulink at #MATLABEXPO.

• Share your session feedback: Please fill in your feedback for this session in the feedback form

Speaker Details

Email: [email protected]

LinkedIn: https://www.linkedin.com/in/rishu-gupta-72148914/

Contact MathWorks India

Products/Training Enquiry Booth

Call: 080-6632-5749

Email: [email protected]