April 4-7, 2016 | Silicon Valley NVIDIA GIE: HIGH-PERFORMANCE … · 2016. 4. 6. · April 4-7,...
Transcript of April 4-7, 2016 | Silicon Valley NVIDIA GIE: HIGH-PERFORMANCE … · 2016. 4. 6. · April 4-7,...
April 4-7, 2016 | Silicon Valley
Michael Andersch, 7th April 2016
NVIDIA GIE: HIGH-PERFORMANCE GPU INFERENCE ENGINE
2
WHAT IS INFERENCE, ANYWAYS?
Building a deep neural network based application
Step 1: Use data to train the neural network - training
Step 2: Use the neural network to process unseen data - inference
3
INFERENCE VS TRAININGHow is inference different from training?
1. No backpropagation / static weights
enables graph optimizations, simplifies memory management
2. Tendency towards smaller batch sizes
harder to amortize weight loading, achieve high GPU utilization
3. Reduced precision requirements
provides opportunity for BW savings and accelerated arithmetic
4
OPTIMIZING SOFTWARE FOR INFERENCEExtracting every bit of performance
What’s running on the GPU: cuDNN optimizations
Support for standard tensor layouts and major frameworks
Available automatically and “for free”
How you use it: Framework optimizations
Every last bit of performance matters
Challenging due to framework structure
Changes to one framework don’t propagate to others
5
OPTIMIZING SOFTWARE FOR INFERENCEChallenge: Efficient small batch convolutions
Optimal convolution algorithm depends on convolution layer dimensions
Meta-parameters (data layouts, texture memory) afford higher performance
Using texture memory for convolutions: 13% inference speedup
(GoogLeNet, batch size 1)
0.73
1.84 1.83 2.03 2.07 2.261.92 1.98
1.25
conv 1.1 conv 1.2 conv 2.1 conv 2.2 conv 3.1 conv 3.2 conv 4.1 conv 4.2 conv 5.0
Winograd speedup over GEMM-based convolution (VGG-E layers, N=1)
6
OPTIMIZING SOFTWARE FOR INFERENCEChallenge: Graph optimization
tensor
concat
1x1 conv.3x3 conv. 5x5 conv. 1x1 conv.
1x1 conv. 1x1 conv. max pool
input
7
OPTIMIZING SOFTWARE FOR INFERENCEChallenge: Graph optimization
concat
max pool
input
next input
3x3 conv.
relu
bias
1x1 conv.
relu
bias
3x3 conv.
relu
bias
3x3 conv.
relu
bias
concat
1x1 conv.
relu
bias3x3 conv.
relu
bias
8
OPTIMIZING SOFTWARE FOR INFERENCEGraph optimization: Vertical fusion
concat
max pool
input
next input
concat
1x1 CBR 3x3 CBR 5x5 CBR 1x1 CBR
1x1 CBR 1x1 CBR
9
OPTIMIZING SOFTWARE FOR INFERENCEGraph optimization: Horizontal fusion
concat
max pool
input
next input
concat
3x3 CBR 5x5 CBR 1x1 CBR
1x1 CBR
10
OPTIMIZING SOFTWARE FOR INFERENCEGraph optimization: Concat elision
max pool
input
next input
3x3 CBR 5x5 CBR 1x1 CBR
1x1 CBR
11
OPTIMIZING SOFTWARE FOR INFERENCEGraph optimization: Concurrency
max pool
input
next input
3x3 CBR 5x5 CBR 1x1 CBR
1x1 CBR
12
OPTIMIZING SOFTWARE FOR INFERENCEChallenge: Effective use of cuBLAS intrinsics
Run GEMV instead of GEMM
Small batch sizes degrade N dimension
B matrix becomes narrow
Pre-transpose weight matrices
Allows using NN/NT GEMM, where NT > NN > TN
13
ACCELERATED INFERENCE ON PASCALSupport for fast mixed precision arithmetic
Inference products will support a new dedicated vector math instruction
Multi-element dot product, 8-bit integer inputs, 32-bit accumulator
4x the rate of equivalent FP32 operations
Full-speed FP32 processing for any layers that require higher precision
14
BUT WHO WILL IMPLEMENT IT?Introducing NVIDIA GIE: GPU Inference Engine
STRATEGYOPTIMIZATION ENGINE EXECUTION ENGINE
15
GPU INFERENCE ENGINE WORKFLOW
DIGITS TRAINING TOOLS
OPTIMIZATION ENGINE
EXECUTION ENGINE
STRATEGY
16
SUMMARYInference on the GPU
GPUs are a great platform for inference
Efficiency: great performance/watt
Scalability: from 3W to 300W
GPU-based inference affords …
… same performance in much tighter power envelope
… freeing up the CPU to do other work
Questions: [email protected], or find me after the talk!
Tesla M4 Hyperscale Accelerator
April 4-7, 2016 | Silicon Valley
THANK YOU
JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join