April 4-7, 2016 | Silicon Valley NVIDIA GIE: HIGH-PERFORMANCE … · 2016. 4. 6. · April 4-7,...

17
April 4-7, 2016 | Silicon Valley Michael Andersch, 7 th April 2016 NVIDIA GIE: HIGH-PERFORMANCE GPU INFERENCE ENGINE

Transcript of April 4-7, 2016 | Silicon Valley NVIDIA GIE: HIGH-PERFORMANCE … · 2016. 4. 6. · April 4-7,...

Page 1: April 4-7, 2016 | Silicon Valley NVIDIA GIE: HIGH-PERFORMANCE … · 2016. 4. 6. · April 4-7, 2016 | Silicon Valley Michael Andersch, 7th April 2016 NVIDIA GIE: HIGH-PERFORMANCE

April 4-7, 2016 | Silicon Valley

Michael Andersch, 7th April 2016

NVIDIA GIE: HIGH-PERFORMANCE GPU INFERENCE ENGINE

Page 2: April 4-7, 2016 | Silicon Valley NVIDIA GIE: HIGH-PERFORMANCE … · 2016. 4. 6. · April 4-7, 2016 | Silicon Valley Michael Andersch, 7th April 2016 NVIDIA GIE: HIGH-PERFORMANCE

2

WHAT IS INFERENCE, ANYWAYS?

Building a deep neural network based application

Step 1: Use data to train the neural network - training

Step 2: Use the neural network to process unseen data - inference

Page 3: April 4-7, 2016 | Silicon Valley NVIDIA GIE: HIGH-PERFORMANCE … · 2016. 4. 6. · April 4-7, 2016 | Silicon Valley Michael Andersch, 7th April 2016 NVIDIA GIE: HIGH-PERFORMANCE

3

INFERENCE VS TRAININGHow is inference different from training?

1. No backpropagation / static weights

enables graph optimizations, simplifies memory management

2. Tendency towards smaller batch sizes

harder to amortize weight loading, achieve high GPU utilization

3. Reduced precision requirements

provides opportunity for BW savings and accelerated arithmetic

Page 4: April 4-7, 2016 | Silicon Valley NVIDIA GIE: HIGH-PERFORMANCE … · 2016. 4. 6. · April 4-7, 2016 | Silicon Valley Michael Andersch, 7th April 2016 NVIDIA GIE: HIGH-PERFORMANCE

4

OPTIMIZING SOFTWARE FOR INFERENCEExtracting every bit of performance

What’s running on the GPU: cuDNN optimizations

Support for standard tensor layouts and major frameworks

Available automatically and “for free”

How you use it: Framework optimizations

Every last bit of performance matters

Challenging due to framework structure

Changes to one framework don’t propagate to others

Page 5: April 4-7, 2016 | Silicon Valley NVIDIA GIE: HIGH-PERFORMANCE … · 2016. 4. 6. · April 4-7, 2016 | Silicon Valley Michael Andersch, 7th April 2016 NVIDIA GIE: HIGH-PERFORMANCE

5

OPTIMIZING SOFTWARE FOR INFERENCEChallenge: Efficient small batch convolutions

Optimal convolution algorithm depends on convolution layer dimensions

Meta-parameters (data layouts, texture memory) afford higher performance

Using texture memory for convolutions: 13% inference speedup

(GoogLeNet, batch size 1)

0.73

1.84 1.83 2.03 2.07 2.261.92 1.98

1.25

conv 1.1 conv 1.2 conv 2.1 conv 2.2 conv 3.1 conv 3.2 conv 4.1 conv 4.2 conv 5.0

Winograd speedup over GEMM-based convolution (VGG-E layers, N=1)

Page 6: April 4-7, 2016 | Silicon Valley NVIDIA GIE: HIGH-PERFORMANCE … · 2016. 4. 6. · April 4-7, 2016 | Silicon Valley Michael Andersch, 7th April 2016 NVIDIA GIE: HIGH-PERFORMANCE

6

OPTIMIZING SOFTWARE FOR INFERENCEChallenge: Graph optimization

tensor

concat

1x1 conv.3x3 conv. 5x5 conv. 1x1 conv.

1x1 conv. 1x1 conv. max pool

input

Page 7: April 4-7, 2016 | Silicon Valley NVIDIA GIE: HIGH-PERFORMANCE … · 2016. 4. 6. · April 4-7, 2016 | Silicon Valley Michael Andersch, 7th April 2016 NVIDIA GIE: HIGH-PERFORMANCE

7

OPTIMIZING SOFTWARE FOR INFERENCEChallenge: Graph optimization

concat

max pool

input

next input

3x3 conv.

relu

bias

1x1 conv.

relu

bias

3x3 conv.

relu

bias

3x3 conv.

relu

bias

concat

1x1 conv.

relu

bias3x3 conv.

relu

bias

Page 8: April 4-7, 2016 | Silicon Valley NVIDIA GIE: HIGH-PERFORMANCE … · 2016. 4. 6. · April 4-7, 2016 | Silicon Valley Michael Andersch, 7th April 2016 NVIDIA GIE: HIGH-PERFORMANCE

8

OPTIMIZING SOFTWARE FOR INFERENCEGraph optimization: Vertical fusion

concat

max pool

input

next input

concat

1x1 CBR 3x3 CBR 5x5 CBR 1x1 CBR

1x1 CBR 1x1 CBR

Page 9: April 4-7, 2016 | Silicon Valley NVIDIA GIE: HIGH-PERFORMANCE … · 2016. 4. 6. · April 4-7, 2016 | Silicon Valley Michael Andersch, 7th April 2016 NVIDIA GIE: HIGH-PERFORMANCE

9

OPTIMIZING SOFTWARE FOR INFERENCEGraph optimization: Horizontal fusion

concat

max pool

input

next input

concat

3x3 CBR 5x5 CBR 1x1 CBR

1x1 CBR

Page 10: April 4-7, 2016 | Silicon Valley NVIDIA GIE: HIGH-PERFORMANCE … · 2016. 4. 6. · April 4-7, 2016 | Silicon Valley Michael Andersch, 7th April 2016 NVIDIA GIE: HIGH-PERFORMANCE

10

OPTIMIZING SOFTWARE FOR INFERENCEGraph optimization: Concat elision

max pool

input

next input

3x3 CBR 5x5 CBR 1x1 CBR

1x1 CBR

Page 11: April 4-7, 2016 | Silicon Valley NVIDIA GIE: HIGH-PERFORMANCE … · 2016. 4. 6. · April 4-7, 2016 | Silicon Valley Michael Andersch, 7th April 2016 NVIDIA GIE: HIGH-PERFORMANCE

11

OPTIMIZING SOFTWARE FOR INFERENCEGraph optimization: Concurrency

max pool

input

next input

3x3 CBR 5x5 CBR 1x1 CBR

1x1 CBR

Page 12: April 4-7, 2016 | Silicon Valley NVIDIA GIE: HIGH-PERFORMANCE … · 2016. 4. 6. · April 4-7, 2016 | Silicon Valley Michael Andersch, 7th April 2016 NVIDIA GIE: HIGH-PERFORMANCE

12

OPTIMIZING SOFTWARE FOR INFERENCEChallenge: Effective use of cuBLAS intrinsics

Run GEMV instead of GEMM

Small batch sizes degrade N dimension

B matrix becomes narrow

Pre-transpose weight matrices

Allows using NN/NT GEMM, where NT > NN > TN

Page 13: April 4-7, 2016 | Silicon Valley NVIDIA GIE: HIGH-PERFORMANCE … · 2016. 4. 6. · April 4-7, 2016 | Silicon Valley Michael Andersch, 7th April 2016 NVIDIA GIE: HIGH-PERFORMANCE

13

ACCELERATED INFERENCE ON PASCALSupport for fast mixed precision arithmetic

Inference products will support a new dedicated vector math instruction

Multi-element dot product, 8-bit integer inputs, 32-bit accumulator

4x the rate of equivalent FP32 operations

Full-speed FP32 processing for any layers that require higher precision

Page 14: April 4-7, 2016 | Silicon Valley NVIDIA GIE: HIGH-PERFORMANCE … · 2016. 4. 6. · April 4-7, 2016 | Silicon Valley Michael Andersch, 7th April 2016 NVIDIA GIE: HIGH-PERFORMANCE

14

BUT WHO WILL IMPLEMENT IT?Introducing NVIDIA GIE: GPU Inference Engine

STRATEGYOPTIMIZATION ENGINE EXECUTION ENGINE

Page 15: April 4-7, 2016 | Silicon Valley NVIDIA GIE: HIGH-PERFORMANCE … · 2016. 4. 6. · April 4-7, 2016 | Silicon Valley Michael Andersch, 7th April 2016 NVIDIA GIE: HIGH-PERFORMANCE

15

GPU INFERENCE ENGINE WORKFLOW

DIGITS TRAINING TOOLS

OPTIMIZATION ENGINE

EXECUTION ENGINE

STRATEGY

Page 16: April 4-7, 2016 | Silicon Valley NVIDIA GIE: HIGH-PERFORMANCE … · 2016. 4. 6. · April 4-7, 2016 | Silicon Valley Michael Andersch, 7th April 2016 NVIDIA GIE: HIGH-PERFORMANCE

16

SUMMARYInference on the GPU

GPUs are a great platform for inference

Efficiency: great performance/watt

Scalability: from 3W to 300W

GPU-based inference affords …

… same performance in much tighter power envelope

… freeing up the CPU to do other work

Questions: [email protected], or find me after the talk!

Tesla M4 Hyperscale Accelerator

Page 17: April 4-7, 2016 | Silicon Valley NVIDIA GIE: HIGH-PERFORMANCE … · 2016. 4. 6. · April 4-7, 2016 | Silicon Valley Michael Andersch, 7th April 2016 NVIDIA GIE: HIGH-PERFORMANCE

April 4-7, 2016 | Silicon Valley

THANK YOU

JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join