Post on 15-Jul-2015
Hacking GPUs for Deep Learning
MLConf New York
Jeff Johnson
Facebook AI Research
jhj@fb.com
Deep (convolutional) Neural Networks
Revolution in machine learning
Convolution: since 1980s. Deep: flops since 2000s
Avoid feature engineering
▪ With enough data, let network discover feature
representations
▪ Can work even for NLP. No word segmentation, use
raw character data.
2D Convolutional Nets (images)
LeCun, Bottou, Bengio and Haffner, 1998
Krizhevsky, Sutskever and Hinton, 2012
2D Convolutional Nets
Progress towards smaller kernels and deeper nets
Network architecture ImageNet 1000 class top-5
error
AlexNet ~15%
OverFeat ~13%
ZeilerNet ~11%
Oxford-VGG ~7%
GoogLeNet ~6%, ~4.5%
PReLU (MSR) ~4.9%
Human performance 3-5%
3D Convolutional Nets (videos)
C3D (Tran et al., 2014)
DeepVideo (Karpathy et al., 2014)
1D Convolutional Nets (text, sequences)
Collobert et al., 2011
Zhang and LeCun, 2015
RNNs and LSTMs (text, sequences)
Graves, Mohamed and Hinton, 2013
Mikolov, 2014
Deep Neural Networks
Supervised learning. Unsupervised ???
Train with back-propagation/SGD variants
Strong scaling is unsolved
▪ Distributed parameter space exploration (e.g.,
Hogwild!; Niu et al. 2011)
▪ Distributed hyperparameter space exploration
(e.g., Bayesian optimization; Snoek et al. 2012)
Characteristics
Deep nets are flop eaters
Convolutions are expensive
Pointwise calcuations (log/exp, ReLU, */+, ...)
Neighborhood reductions (pooling, convolution)
Scaling network parameters
increased learning capacity; overfitting
more training data (real or synthetic),
regularization required
Deep nets are bandwidth eaters
More parameters = more memory, data to exchange
Barrier to cross-machine parallelism
▪ periodic exchanges, compression, quantization
Increase reuse of memory while local?
▪ interspersed reductions are resistant to fusion of
computations
▪ generalized programming language problem
Deep nets are latency sensitive
Serial dependency of training
fprop => bprop => fprop => ...
Serial dependency of multi-layer networks
layer 1 => layer 2 => layer 3 => ...
Multiple path dependent networks (RNNs, multi-layer
LSTMs)
Deep nets are also small?
Deeper = smaller feature planes, more of them
input Rm => expand to Rn => non-lin => reduce to
Rk
Problems are tiny in HPC terms
4096×4096 FFT, FE/PDE on massive grids, ...
NLP tasks can be sparse
Setup/kernel launch latency on GPU can dominate
compute
The tools
Vector processors
SIMD: Single Instruction,
Multiple Data
Serial processor with ability to
operate on more than one
piece of data concurrently
Cray-1 (1976)
Vector processors
Hard to use: instructions only operate on 4, 8, 16, ...
pieces of data at a time. Boundary/alignment
effects. Great if your vectors are large, but...
float* a = ...; // is this aligned (a % 16 == 0)?
float* b = ...; // is this aligned (b % 16 == 0)?
for (i = 0; i < 18; ++i) { // how to handle [16, 17]?
b[i] += a[i]; // SIMD this?!? masking/loop epilogue
}
“Vector cores”?
SIMD variant: NVIDIA calls
“SIMT”
Lots of simple cores (CM)
Hide latency through many
threads + switching (Tera)
“Pixel/vertex shaders” in 2000s
GPUs => GPGPU
CM-1 (1983)
Tera MTA (1995)
GPU versus CPU
GPUs represent a different form of vector
programming (“vector cores”)
▪ 32-wide vector of threads (“warp”)
Sufficiently optimized CPU code can be on par with
GPU perf (Tflop range with AVX2/512, exploit multi-
level caches, deep pipelines, prefetch, ...)
Vector programming: easier with GPUs than CPUs
Sweetspot is different from GPU codes
Parallelization + vectorization
Serial nature of commonly used CPU programming
languages sometimes hides opportunities
Auto-vectorizing/parallelizing compilers + DSLs can’t
yet compete with expert hand-rolled
▪ DSLs like Halide (Ragan-Kelley et al. 2013) show
promise but need a few more generations
Sprinkle in (OpenMP) doesn’t cut it
Who winsCPU GPU
flops ✔
(vectorize: AVX2/512 gives
Tflop range)
✔
Tesla K40: 2880 fp32 ALU
pipelines
main memory b/w ✖
(Xeon Phi improves)
✔
latency ✔
(high clock, reordering;
caches are large and work if
you obey them)
✖
(threads slow, non-smem
caches irrelevant, CPU ->
GPU control overhead)
boundary effects,
small/irregular sizes
✔✖
(branches easy, vectorization
hard)
✖
(warp divergence, load
imbalance)
parallel programming model ✖
(vectorization hard, perf
black box)
✔✖
(CUDA is very different,
domain knowledge)
Tool + problem = solution?
Dive into 2D Convolutional Nets
Somewhat computationally expensive
O(b × f × f’ × n2 × k2)
1st layer AlexNet:
▪ 13.493 Gflop (1 flop here = fp32 multiply-add)
▪ 77.2 Mbyte in, 63.7 Mbyte out (fp32)
▪ Perfect caching + reuse, 175 flop/byte in
▪ No caching + reuse, 0.125 flop/byte in
The problem
Programmable caches (shared memory, registers, ...)
not large enough for perfect reuse
Space of all possible square 2D convolution
problems is 5/6-dimensional
Parameter Size
minibatch size (b) 128
input feature maps (f) 3
output feature maps (f’) 96
input feature size (n x n) 224
convolution kernel size (k x k) 11
convolution kernel stride (SxS) (optional) 4
Converting
Space of all possible matrix multiplications = 3
dimensional (ANxMBMxP = CNxP)
NVIDIA, Intel, others have put lots of effort into
optimizing many parts of this space
▪ Rephrase convolution as a matrix multiplication!
▪ NVIDIA’s cuDNN
But:
Sgemm originally optimized for large problems
13x13 * 3x3 is a small convolution. Unrolling it 192
times it might be enough to feed GPU
Large convolutions are intractable?
Small feature maps/
convolutions = boundary
effects bad for GPUs
Facebook AI Research work
2D convolution via FFT
Fast convolutional nets with fbfft: A GPU Performance
Evaluation (Vasilache, Johnson et al., 2015 ICLR
conference track oral)
Convolution => pointwise × in Fourier basis
Choice of basis is wide open! 2i is great perf
O(b f f’ n2 k2) => O(b f f’ n2 + (b f + f f’ + bf’) n2 log n)
▪ >= 5x5 kernels, faster than cuDNN
fbfft
cuFFT optimized for large FFT sizes
fbfft: smaller data, fit in registers, focus on warp
Data layout
Different problem sizes => different data layout
▪ cudaconv: DHWB (optimal for large b)
▪ deeper layers: HWBD/BHWD (many feature maps)
▪ b=1 faster convergence?
▪ b=128 better compute utilization
Smaller problems, exploit different layout/batching
▪ fbcunn 1D convolution
Latency hiding: what holds you back?
▪ Compute bound? (math)
▪ Memory b/w bound? (streaming)
▪ Memory latency bound? (sparse)
Almost all “deep learning” algorithms are b/w bound
on GPU. Low math intensity!
cuBLAS: Sgemm b/w bound. Dgemm compute
bound
Kernel fusion: CPU vs GPU
Reduces memory b/w pressure
Exploits cache locality and register reuse
CPU: fusion not necessary
Kernel tiling + interleaving works due to caches
GPU: fusion necessary
Tiling + interleaving doesn’t work: smem not
persistent, caches too small/irrelevant
Kernel fusion
CUDA kernel = hard optimization boundary on GPU
Loop interchange, lifting, better fusion on CPU
CUDA: parallelization layer not visible to optimizer.
Auto-tuning desired. HW specific non-linear tradeoffs
Scripting languages are further barrier to fusion on
both CPU and GPU (Torch)
Kernel fusion
Torch: transposition is common operation
▪ size (80, 40) stride (40, 1) => size (40, 80) stride
(1, 40)
▪ Old approach: transpose in memory, perform work,
copy back
▪ New approach: rewrite kernel to handle
transpositions. Optimize if non-transposed
Runtime fusion (CUDA 7.0, Theano)
Exploiting parallelism
end