Jeff Johnson, Research Engineer, Facebook at MLconf NYC

Post on 15-Jul-2015

803 views 2 download

Transcript of Jeff Johnson, Research Engineer, Facebook at MLconf NYC

Hacking GPUs for Deep Learning

MLConf New York

Jeff Johnson

Facebook AI Research

Deep (convolutional) Neural Networks

Revolution in machine learning

Convolution: since 1980s. Deep: flops since 2000s

Avoid feature engineering

▪ With enough data, let network discover feature


▪ Can work even for NLP. No word segmentation, use

raw character data.

2D Convolutional Nets (images)

LeCun, Bottou, Bengio and Haffner, 1998

Krizhevsky, Sutskever and Hinton, 2012

2D Convolutional Nets

Progress towards smaller kernels and deeper nets

Network architecture ImageNet 1000 class top-5


AlexNet ~15%

OverFeat ~13%

ZeilerNet ~11%

Oxford-VGG ~7%

GoogLeNet ~6%, ~4.5%

PReLU (MSR) ~4.9%

Human performance 3-5%

3D Convolutional Nets (videos)

C3D (Tran et al., 2014)

DeepVideo (Karpathy et al., 2014)

1D Convolutional Nets (text, sequences)

Collobert et al., 2011

Zhang and LeCun, 2015

RNNs and LSTMs (text, sequences)

Graves, Mohamed and Hinton, 2013

Mikolov, 2014

Deep Neural Networks

Supervised learning. Unsupervised ???

Train with back-propagation/SGD variants

Strong scaling is unsolved

▪ Distributed parameter space exploration (e.g.,

Hogwild!; Niu et al. 2011)

▪ Distributed hyperparameter space exploration

(e.g., Bayesian optimization; Snoek et al. 2012)


Deep nets are flop eaters

Convolutions are expensive

Pointwise calcuations (log/exp, ReLU, */+, ...)

Neighborhood reductions (pooling, convolution)

Scaling network parameters

increased learning capacity; overfitting

more training data (real or synthetic),

regularization required

Deep nets are bandwidth eaters

More parameters = more memory, data to exchange

Barrier to cross-machine parallelism

▪ periodic exchanges, compression, quantization

Increase reuse of memory while local?

▪ interspersed reductions are resistant to fusion of


▪ generalized programming language problem

Deep nets are latency sensitive

Serial dependency of training

fprop => bprop => fprop => ...

Serial dependency of multi-layer networks

layer 1 => layer 2 => layer 3 => ...

Multiple path dependent networks (RNNs, multi-layer


Deep nets are also small?

Deeper = smaller feature planes, more of them

input Rm => expand to Rn => non-lin => reduce to


Problems are tiny in HPC terms

4096×4096 FFT, FE/PDE on massive grids, ...

NLP tasks can be sparse

Setup/kernel launch latency on GPU can dominate


The tools

Vector processors

SIMD: Single Instruction,

Multiple Data

Serial processor with ability to

operate on more than one

piece of data concurrently

Cray-1 (1976)

Vector processors

Hard to use: instructions only operate on 4, 8, 16, ...

pieces of data at a time. Boundary/alignment

effects. Great if your vectors are large, but...

float* a = ...; // is this aligned (a % 16 == 0)?

float* b = ...; // is this aligned (b % 16 == 0)?

for (i = 0; i < 18; ++i) { // how to handle [16, 17]?

b[i] += a[i]; // SIMD this?!? masking/loop epilogue


“Vector cores”?

SIMD variant: NVIDIA calls


Lots of simple cores (CM)

Hide latency through many

threads + switching (Tera)

“Pixel/vertex shaders” in 2000s


CM-1 (1983)

Tera MTA (1995)

GPU versus CPU

GPUs represent a different form of vector

programming (“vector cores”)

▪ 32-wide vector of threads (“warp”)

Sufficiently optimized CPU code can be on par with

GPU perf (Tflop range with AVX2/512, exploit multi-

level caches, deep pipelines, prefetch, ...)

Vector programming: easier with GPUs than CPUs

Sweetspot is different from GPU codes

Parallelization + vectorization

Serial nature of commonly used CPU programming

languages sometimes hides opportunities

Auto-vectorizing/parallelizing compilers + DSLs can’t

yet compete with expert hand-rolled

▪ DSLs like Halide (Ragan-Kelley et al. 2013) show

promise but need a few more generations

Sprinkle in (OpenMP) doesn’t cut it

Who winsCPU GPU

flops ✔

(vectorize: AVX2/512 gives

Tflop range)

Tesla K40: 2880 fp32 ALU


main memory b/w ✖

(Xeon Phi improves)

latency ✔

(high clock, reordering;

caches are large and work if

you obey them)

(threads slow, non-smem

caches irrelevant, CPU ->

GPU control overhead)

boundary effects,

small/irregular sizes


(branches easy, vectorization


(warp divergence, load


parallel programming model ✖

(vectorization hard, perf

black box)


(CUDA is very different,

domain knowledge)

Tool + problem = solution?

Dive into 2D Convolutional Nets

Somewhat computationally expensive

O(b × f × f’ × n2 × k2)

1st layer AlexNet:

▪ 13.493 Gflop (1 flop here = fp32 multiply-add)

▪ 77.2 Mbyte in, 63.7 Mbyte out (fp32)

▪ Perfect caching + reuse, 175 flop/byte in

▪ No caching + reuse, 0.125 flop/byte in

The problem

Programmable caches (shared memory, registers, ...)

not large enough for perfect reuse

Space of all possible square 2D convolution

problems is 5/6-dimensional

Parameter Size

minibatch size (b) 128

input feature maps (f) 3

output feature maps (f’) 96

input feature size (n x n) 224

convolution kernel size (k x k) 11

convolution kernel stride (SxS) (optional) 4


Space of all possible matrix multiplications = 3

dimensional (ANxMBMxP = CNxP)

NVIDIA, Intel, others have put lots of effort into

optimizing many parts of this space

▪ Rephrase convolution as a matrix multiplication!



Sgemm originally optimized for large problems

13x13 * 3x3 is a small convolution. Unrolling it 192

times it might be enough to feed GPU

Large convolutions are intractable?

Small feature maps/

convolutions = boundary

effects bad for GPUs

Facebook AI Research work

2D convolution via FFT

Fast convolutional nets with fbfft: A GPU Performance

Evaluation (Vasilache, Johnson et al., 2015 ICLR

conference track oral)

Convolution => pointwise × in Fourier basis

Choice of basis is wide open! 2i is great perf

O(b f f’ n2 k2) => O(b f f’ n2 + (b f + f f’ + bf’) n2 log n)

▪ >= 5x5 kernels, faster than cuDNN


cuFFT optimized for large FFT sizes

fbfft: smaller data, fit in registers, focus on warp

Data layout

Different problem sizes => different data layout

▪ cudaconv: DHWB (optimal for large b)

▪ deeper layers: HWBD/BHWD (many feature maps)

▪ b=1 faster convergence?

▪ b=128 better compute utilization

Smaller problems, exploit different layout/batching

▪ fbcunn 1D convolution

Latency hiding: what holds you back?

▪ Compute bound? (math)

▪ Memory b/w bound? (streaming)

▪ Memory latency bound? (sparse)

Almost all “deep learning” algorithms are b/w bound

on GPU. Low math intensity!

cuBLAS: Sgemm b/w bound. Dgemm compute


Kernel fusion: CPU vs GPU

Reduces memory b/w pressure

Exploits cache locality and register reuse

CPU: fusion not necessary

Kernel tiling + interleaving works due to caches

GPU: fusion necessary

Tiling + interleaving doesn’t work: smem not

persistent, caches too small/irrelevant

Kernel fusion

CUDA kernel = hard optimization boundary on GPU

Loop interchange, lifting, better fusion on CPU

CUDA: parallelization layer not visible to optimizer.

Auto-tuning desired. HW specific non-linear tradeoffs

Scripting languages are further barrier to fusion on

both CPU and GPU (Torch)

Kernel fusion

Torch: transposition is common operation

▪ size (80, 40) stride (40, 1) => size (40, 80) stride

(1, 40)

▪ Old approach: transpose in memory, perform work,

copy back

▪ New approach: rewrite kernel to handle

transpositions. Optimize if non-transposed

Runtime fusion (CUDA 7.0, Theano)

Exploiting parallelism
