Post on 23-Oct-2020
1© 2017 The MathWorks, Inc.
Deep learning in MATLABFrom Concept to CUDA Code
Roy Fahn
Applications Engineer
Systematics
royf@systematics.co.il
03-7660111
Ram Kokku
Principal Engineer
MathWorks
ram.kokku@mathworks.com
2
Talk Outline
Design Deep
Learning & Vision
Algorithms
High Performance
Deployment
• Manage large image sets
• Automate image labeling
• Easy access to models
• Pre-built training
frameworks
Automate compilation
with GPU Coder
On TitanXP: 7x faster than TensorFlow
5x faster than pyCaffe2
On Jetson: On par with TensorRT
2x faster than C++-Caffe
Accelerate and Scale
Training
• Acceleration with GPU’s
• Scale to clusters
3
Example: Transfer Learning Workflow
Transfer Learning
Images
New
ClassifierLearn New
Weights
Modify
Network
Structure
Load
Reference
NetworkLabels
Training Data
Labels: Cars, Trucks,
BigTrucks, SUVs, Vans
4
Example: Transfer Learning in MATLAB
Set up
training
dataset
Split, shuffle, re-arrange images
Read image, Data augmentation
(clip, rotate, resize, etc)
Easily manage large sets of images
Single line of code to access images
Operates on disk, database, big-data file system
5
Example: Transfer Learning in MATLAB
Load
Reference
Network
Set up
training
dataset
Create DNNs in MATLAB
1. Easy access to research models
2. Caffe Model importer
3. Build from scratch
6
Example: Transfer Learning in MATLAB
Modify
Network
Structure
Load
Reference
Network
Set up
training
dataset
7
Example: Transfer Learning in MATLAB
Modify
Network
Structure
Load
Reference
Network
Set up
training
dataset
8
Example: Transfer Learning in MATLAB
Learn New
Weights
Modify
Network
Structure
Load
Reference
Network
Set up
training
dataset
Many more training options
9
Deep learning on CPU, GPU, multi-GPU and clusters
More GPUs
10
Deep learning on CPU, GPU, multi-GPU and clusters
More GPUs
Mo
re C
PU
s
11
Deep learning on CPU, GPU, multi-GPU and clusters
More GPUs
Mo
re C
PU
s
13
Visualizing and Debugging Intermediate Results
Filters…
Activations
Deep Dream
Training Accuracy Visualization Deep Dream
Layer Activations Feature Visualization
• Many options for visualizations and debugging• Examples to get started
14
GPU Coder for Deployment: New Product in R2017b
Neural Networks
Deep Learning, machine learning
Image Processing and
Computer Vision
Image filtering, feature detection/extraction
Signal Processing and
Communications FFT, filtering, cross correlation,
7x faster than state-of-art 700x faster than CPUs
for feature extraction
20x faster than
CPUs for FFTs
GPU Coder
Accelerated implementation of
parallel algorithms on GPUs
15
GPU Coder Compilation Flow
GPU Coder
CUDA Kernel creation
Memory allocation
Data transfer minimization
• Library function mapping
• Loop optimizations
• Dependence analysis
• Data locality analysis
• GPU memory allocation
• Data-dependence analysis
• Dynamic memcpy reduction
16
Scalarized MATLAB
GPU Coder Generates CUDA from MATLAB: saxpy
CUDA kernel for GPU parallelization
CUDA
Vectorized MATLAB
Loops and matrix operations are
directly compiled in to kernels
17
Generated CUDA Optimized for Memory Performance
Mandelbrot space
CUDA kernel for GPU parallelization
… …
… …
CUDA
Kernel data allocation is
automatically optimized
20
Algorithm Design to Embedded Deployment Workflow
MATLAB algorithm
(functional reference)
Functional test1 Deployment
unit-test
2
Desktop
GPU
C++
Deployment
integration-test
3
Desktop
GPU
C++
Real-time test4
Embedded GPU
.mex .lib Cross-compiled
.lib
Build type
Call CUDA
from MATLAB
directly
Call CUDA from
(C++) hand-
coded main()
Call CUDA from (C++)
hand-coded main().
21
Demo: Alexnet Deployment with ‘mex’ Code Generation
22
Algorithm Design to Embedded Deployment on Tegra GPU
MATLAB algorithm
(functional reference)
Functional test1
(Test in MATLAB on host)
Deployment
unit-test
2
(Test generated code in
MATLAB on host + GPU)
Tesla
GPU
C++
Deployment
integration-test
3
(Test generated code within
C/C++ app on host + GPU)
Tesla
GPU
C++
Real-time test4
(Test generated code within
C/C++ app on Tegra target)
Tegra GPU
.mex .lib Cross-compiled
.lib
Build type
Call CUDA
from MATLAB
directly
Call CUDA from
(C++) hand-
coded main()
Call CUDA from (C++)
hand-coded main().
Cross-compiled on host
with Linaro toolchain
23
Alexnet Deployment to Tegra: Cross-Compiled with ‘lib’
Two small changes
1. Change build-type to ‘lib’
2. Select cross-compile toolchain
24
End-to-End Application: Lane Detection
Transfer Learning
Alexnet
Lane detection
CNN
Post-processing
(find left/right lane
points)Image
Image with
marked lanes
Left lane co-efficients
Right lane co-efficients
Output of CNN is lane parabola co-efficients according to: y = ax^2 + bx + c
GPU coder generates code for whole application
https://tinyurl.com/ybaxnxjg
https://devblogs.nvidia.com/parallelforall/deep-learning-automated-driving-matlab/https://tinyurl.com/ybaxnxjg
25
How Good is Generated Code Performance
Performance of image processing and computer vision
Performance of CNN inference (Alexnet) on Titan XP GPU
Performance of CNN inference (Alexnet) on Jetson (Tegra) TX2
26
GPU Coder for Image Processing and Computer Vision
Distance
transform
Fog removal
SURF feature
extraction
Ray tracing
Stereo disparity
Orders magnitude speedup over CPU
27
Alexnet Inference on NVIDIA Titan XP
MATLAB GPU Coder
(R2017b)
TensorFlow (1.2.0)
Caffe2 (0.8.1)
Fra
mes p
er
second
Batch Size
CPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
GPU Pascal Titan Xp
cuDNN v5
Testing platform
mxNet (0.10)
MATLAB (R2017b)
2x 7x5x
28
0
50
100
150
200
250
300
350
400
1 16 32 64 128 256
Alexnet Inference on Jetson TX2: Frame-Rate Performance
MATLAB GPU Coder
(R2017b)
Fra
me
s p
er
se
co
nd
Batch Size
C++ Caffe
(1.0.0-rc5)
TensorRT (2.1)
2x
0.85x
30
Alexnet Inference on Jetson TX2: Memory PerformanceP
ea
k M
em
ory
(M
B)
Batch Size
MATLAB GPU Coder
(R2017b)
C++ Caffe
(1.0.0-rc5)
TensorRT 2.1
(using giexec wrapper)
31
Design Your DNNs in MATLAB, Deploy with GPU Coder
Design Deep
Learning & Vision
Algorithms
High Performance
Deployment
Manage large image sets
Automate image labeling
Easy access to models
Pre-built training
frameworks
Automate compilation
with GPU Coder
On TitanXP: 7x faster than TensorFlow
5x faster than pyCaffe2
On Jetson TX2: On par with TensorRT
2x faster than C++-Caffe
Accelerate and Scale
Training
Acceleration with GPU’s
Scale to clusters
32
Check Out Deep Learning in MATLAB and GPU Coder
GPU Coder
Deep learning in MATLAB
systematics.co.il\mwevents
https://www.mathworks.com/products/gpu-coder.htmlhttps://www.mathworks.com/solutions/deep-learning.htmlhttp://www.systematics.co.il/mwevents