Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep...

Large-Scale Deep Learning

Tal Ben-Nun

Advanced Seminar in Deep Learning, October 2015

The Hebrew University of Jerusalem

Outline

• Motivation

• Training Parallelism

• Distributed Deep Learning

• Specialized Hardware

• Further Optimizations

Why Scale Up?

• Enormous amounts of data • ImageNet (1k): 180 GB • ImageNet (22k): A few TB • Industry: Much larger

• Large neural network architectures • Around 20 layers deep today, ~100M-2B parameters

• Faster prototyping • GoogLeNet GPU throughput: 66-452 images per second • Epoch time: 10s of hours to days (and weeks)

Outline

• Motivation





Backpropagation Algorithm

• Update rule: 𝑊𝑡 = 𝑔(𝑊𝑡−1, 𝛻𝐿𝑡 𝑊𝑡−1 , [ℎ𝑦𝑝𝑒𝑟𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 … ])

• 𝛻𝐿𝑡 𝑊 is an average direction of the gradient over a mini-batch of size 𝑏:

• 𝛻𝐿𝑡 𝑊 =1

𝑏 𝛻ℓ 𝑊; (𝑥𝑖 , 𝑦𝑖)𝑏𝑖=1

• A CNN is a Directed Acyclic Graph (DAG)

• At each layer in backpropagation, derivatives are estimated w.r.t.: • Layer parameters (if necessary) • Data (chain rule)

• Additional memory storage required for training

Softmax

Concatenation

Fully Connected

Convolution Pooling

Convolution Convolution

Convolution

Input

• ~6.8 million parameters

• 22 layers deep

GoogLeNet [Szegedy et al., 2014]

Szegedy et al. "Going deeper with convolutions." (2014).

http://arxiv.org/abs/1409.4842






Multi-Convolution Images Pooling Fully

Connected

…

…

…

…

…

…

Loss

Simple CNN Architecture

Proc. 3

Multi-Convolution Images Pooling

Proc. 1

Proc. 2

Fully

Connected

…

…

…

…

…

…

Loss

Data Parallelism

Data Parallelism

Good for forward pass (independent)

Backpropagation requires all-to-all communication only when accumulating results

× Requires allocation of all parameters on each processor


3

1

Fully

Connected

…

…

…

…

…

…

3

1

Loss

Model Parallelism

Proc. 1

Proc. 2 Proc. 3

Model Parallelism

Parameters can be divided across processors

× Mini-batch has to be copied to all processors

× Backpropagation requires all-to-all communication every layer

Hybrid Data/Model Parallelism

• Conjecture[Krizhevsky, 2014]: Most of the computations are performed in the convolutional portion, most of the parameters are stored in the fully connected portion

• Proposed Solution: Use data parallelism on convolutional portion and model parallelism on the FC portion

Krizhevsky, Alex. "One weird trick for parallelizing convolutional neural networks." (2014).







Proc. 3


3

1

Proc. 1

Proc. 2

Fully

Connected

…

…

…

…

…

…

3

1

Loss

Hybrid Data/Model Parallelism [Krizhevsky, 2014]








Hybrid Data/Model Parallelism Results

• AlexNet, ILSVRC 2012:








Outline

• Motivation





Distributed Deep Learning

• Runs on a computer cluster

• Each node runs partially autonomously

• Inter-node communication from time to time

• Best result is gathered from the nodes

• Training data can be split to per-node “shards”

Node 1

Node 3

Node 2

Node N

Distributed Deep Learning – Opportunities

• Increased memory: • More data

• More parameters

• Fault tolerance • Protection against node crashes

• Improved stochasticity

Node 1

Node 3

Node 2

Node N

Distributed Deep Learning – Determining Factors

• Computational independence

• Communication efficiency

• Network congestion

• Load balancing

• Points of failure

Node 1

Node 3

Node 2

Node N

Distributed Deep Learning – DistBelief

• Distributed learning infrastructure used at Google [Dean et al., 2012] • Handles communications automatically • Users define computation and messages sent during fprop/bprop

• Each model replica has the same parameters, but optimizes different data

• Each replica is divided among several machines

• Two distributed optimization schemes for training: • Online – Downpour SGD • Batch – Sandblaster LBFGS

• Uses a centralized parameter server (several machines, sharded) • Handles slow and faulty replicas

Dean, Jeffrey, et al. "Large scale distributed deep networks." NIPS 2012.

http://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks






Asynchronous SGD – HOGWILD! • To achieve coherency in distributed SGD, nodes must synchronize

w.r.t. parameters:

• HOGWILD! [Niu et al., 2011] is a method for asynchronous SGD:

Niu, Feng, et al. "Hogwild: A lock-free approach to parallelizing stochastic gradient descent." NIPS 2011.

http://papers.nips.cc/paper/4390-hogwild-a-lock-free-approach-to-parallelizing-stochastic-gradient-descent












Asynchronous SGD – HOGWILD! • HOGWILD!:

• Proven to converge in sparse problems • Provides near-linear scaling • Assumes shared-memory architecture (e.g., multicore CPUs)

• Formulates ML problems as hypergraphs 𝐺 = (𝑉, 𝐸) where: • 𝑤∗ = argmin𝑤 𝑓 𝑤 = argmin𝑤 𝑓𝑒(𝑤𝑒)𝑒∈𝐸

• Each hyperedge 𝑒 ∈ E represents subsets of 𝑛

• 𝑤𝑒 is reduced to coordinates in 𝑒

Image: Wikipedia














Asynchronous SGD – HOGWILD!

• Example: Sparse SVM • 𝐸 represents training set, denoting participating parameters as 𝑢 ∈ 𝑒

argmin𝑤 max

𝑖

0,1 − 𝑦𝑖𝑤𝑇𝑥𝑖 + 𝜆||𝑤||

2

is formulated as:

argmin𝑤 max 0,1 − 𝑦𝑒𝑤𝑇𝑥𝑒 + 𝜆

𝑤𝑢2

𝑑𝑢𝑢∈𝑒𝑒∈𝐸

• 𝑑𝑢 is the number of training examples non-zero in component 𝑢














Asynchronous SGD – HOGWILD!

• Assuming 𝑓 is 𝐿-Lipschitz, strongly convex with constant 𝑐, and its derivatives are bounded by M

• Assuming constant learning rate 𝛾 and 𝛾𝑐 < 1














Distributed Deep Learning – Downpour SGD

• Relaxation of HOGWILD! for distributed systems

• Algorithm: • Divide training data into subsets and run a replica on each subset

• Every 𝑛𝑓𝑒𝑡𝑐ℎ iterations, fetch up-to-date parameters from server

• Every 𝑛𝑝𝑢𝑠ℎ iterations, push local gradients to server

• Note that parameter shards may be “out-of-sync”








Distributed Deep Learning – Downpour SGD








BFGS Nonlinear Optimization Algorithm

• Second order method (𝐵𝑘 estimates Hessian)

• Uses line search to determine learning rate per iteration

• Limited-Memory BFGS (LBFGS) uses 𝑚 last updates of 𝑥 instead of 𝐵𝑘

𝐵0 = 𝐼

for k=0 to K:

Solve 𝐵𝑘𝑝𝑘 = −𝛻𝑓 𝑥𝑘 to obtain direction 𝑝𝑘

Find 𝛼𝑘 = argmin𝛼 𝑓(𝑥𝑘 + 𝛼𝑝𝑘)

Set 𝑥𝑘+1 = 𝑥𝑘 + 𝛼𝑘𝑝𝑘

Compute 𝐵𝑘+1−1 using two rank-1 updates on 𝐵𝑘

−1

Distributed Deep Learning – Sandblaster LBFGS

• Coordinator process issues commands (dot product, scaling, multiplication, etc.) to slave nodes, each processing a different parameter shard

• Communication is sparser • Most of the information is stored locally • Coordinator messages are small • Slaves fetch parameters at the beginning of each batch,

send gradients once in a while for fault tolerance

• Employs computation replication and load balancing • Nodes that finish their job get more jobs • If one node is slow, additional nodes get the same job








Distributed Deep Learning – Sandblaster LBFGS








DistBelief Results – Scaling








DistBelief Results – Accuracy








DistBelief Results – Time








Distributed Deep Learning – Drawbacks

• Centralized • Single point of failure: Parameter Server(s) / Coordinator

• Network congestion can be a bottleneck

• Very relaxed theory for Downpour SGD

Outline

• Motivation





Specialized Hardware

• GPU • Thousands of cores, massively parallel (5-10 TFLOP/s per card)

• Multi-GPU nodes further increase training performance (using data/model parallelism)

• Drawback: Hard to program efficiently

• FPGA • Specialized for certain operations (e.g. convolutions)

• Drawbacks: Network-specific, harder to program

• Convolutional Processing Units?

Specialized Hardware

Distributed Deep Learning with GPUs

• Distributed GPU-based system[Coates et al., 2013] was shown to run DistBelief-scale problems (1000 machines) with 3 multi-GPU nodes

• 3 tiers of concurrency: GPU, model parallelism, nodes

Coates et al. "Deep learning with COTS HPC systems." ICML 2013.

Time for a single mini-batch gradient update of a sparse autoencoder

http://jmlr.org/proceedings/papers/v28/coates13.pdf







Outline

• Motivation





Further Optimizations

• Convolution operator replacement • As Toeplitz matrix multiplication (im2col)

• As an operation in frequency domain

• Lowering floating point precision

• Specific compilers/assemblers

Convolution as FFT

• In convolution, each pixel requires an NxN surrounding region

• Convolution can be computed using FFT [Vasilache et al., 2015]:

• The larger the convolution kernel, the better the performance:

3x3 kernel 7x7 kernel 13x13 kernel 5x5 kernel

Output size

Pro

ble

m s

ize

Vasilache et al. "Fast Convolutional Nets With fbfft: A GPU Performance Evaluation", ICLR 2015

Output size

Pro

ble

m s

ize

Output size

Pro

ble

m s

ize

Output size

Pro

ble

m s

ize








Sacrificing Accuracy for Performance

• Half-precision (16-bit floating point) [Gupta et al., 2015] • Memory is stored in 16-bit format

• Computations are performed in 32-bits

• Uses Stochastic Rounding:

Goal: Preserve

Gupta et al. "Deep Learning with Limited Numerical Precision." NIPS 2015.








Sacrificing Accuracy for Performance

• Results on MNIST with LeNet:

WL=Word Length (bits) FL=Fractional Length (bits)

Gupta et al. "Deep Learning with Limited Numerical Precision." NIPS 2015.








Benchmark Insights

According to Convnet Benchmarks (as of October 2015):

• GPU + 16-bit floating point + custom assembler is the fastest

• GPU + 32-bit floating point + FFT is a close runner-up

• FFT surpasses convolution, unless kernel size is small (OxfordNet)

• Performance is close to peak hardware FLOPs (~4 out of 4.7 TFLOPs)

• Naïve versions can be >16 times slower

https://github.com/soumith/convnet-benchmarks



Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep...

Documents

Transcript of Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep...