Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep...
Transcript of Large-Scale Deep Learning - Hebrew University of Jerusalem · 2015-10-25 · Large-Scale Deep...
Large-Scale Deep Learning
Tal Ben-Nun
Advanced Seminar in Deep Learning, October 2015
The Hebrew University of Jerusalem
Outline
• Motivation
• Training Parallelism
• Distributed Deep Learning
• Specialized Hardware
• Further Optimizations
Why Scale Up?
• Enormous amounts of data • ImageNet (1k): 180 GB • ImageNet (22k): A few TB • Industry: Much larger
• Large neural network architectures • Around 20 layers deep today, ~100M-2B parameters
• Faster prototyping • GoogLeNet GPU throughput: 66-452 images per second • Epoch time: 10s of hours to days (and weeks)
Outline
• Motivation
• Training Parallelism
• Distributed Deep Learning
• Specialized Hardware
• Further Optimizations
Backpropagation Algorithm
• Update rule: 𝑊𝑡 = 𝑔(𝑊𝑡−1, 𝛻𝐿𝑡 𝑊𝑡−1 , [ℎ𝑦𝑝𝑒𝑟𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 … ])
• 𝛻𝐿𝑡 𝑊 is an average direction of the gradient over a mini-batch of size 𝑏:
• 𝛻𝐿𝑡 𝑊 =1
𝑏 𝛻ℓ 𝑊; (𝑥𝑖 , 𝑦𝑖)𝑏𝑖=1
• A CNN is a Directed Acyclic Graph (DAG)
• At each layer in backpropagation, derivatives are estimated w.r.t.: • Layer parameters (if necessary) • Data (chain rule)
• Additional memory storage required for training
Softmax
Concatenation
Fully Connected
Convolution Pooling
Convolution Convolution
Convolution
Input
• ~6.8 million parameters
• 22 layers deep
GoogLeNet [Szegedy et al., 2014]
Szegedy et al. "Going deeper with convolutions." (2014).
Multi-Convolution Images Pooling Fully
Connected
…
…
…
…
…
…
Loss
Simple CNN Architecture
Proc. 3
Multi-Convolution Images Pooling
Proc. 1
Proc. 2
Fully
Connected
…
…
…
…
…
…
Loss
Data Parallelism
Data Parallelism
Good for forward pass (independent)
Backpropagation requires all-to-all communication only when accumulating results
× Requires allocation of all parameters on each processor
Multi-Convolution Images Pooling
3
1
Fully
Connected
…
…
…
…
…
…
3
1
Loss
Model Parallelism
Proc. 1
Proc. 2 Proc. 3
Model Parallelism
Parameters can be divided across processors
× Mini-batch has to be copied to all processors
× Backpropagation requires all-to-all communication every layer
Hybrid Data/Model Parallelism
• Conjecture[Krizhevsky, 2014]: Most of the computations are performed in the convolutional portion, most of the parameters are stored in the fully connected portion
• Proposed Solution: Use data parallelism on convolutional portion and model parallelism on the FC portion
Krizhevsky, Alex. "One weird trick for parallelizing convolutional neural networks." (2014).
Proc. 3
Multi-Convolution Images Pooling
3
1
Proc. 1
Proc. 2
Fully
Connected
…
…
…
…
…
…
3
1
Loss
Hybrid Data/Model Parallelism [Krizhevsky, 2014]
Krizhevsky, Alex. "One weird trick for parallelizing convolutional neural networks." (2014).
Hybrid Data/Model Parallelism Results
• AlexNet, ILSVRC 2012:
Krizhevsky, Alex. "One weird trick for parallelizing convolutional neural networks." (2014).
Outline
• Motivation
• Training Parallelism
• Distributed Deep Learning
• Specialized Hardware
• Further Optimizations
Distributed Deep Learning
• Runs on a computer cluster
• Each node runs partially autonomously
• Inter-node communication from time to time
• Best result is gathered from the nodes
• Training data can be split to per-node “shards”
Node 1
Node 3
Node 2
Node N
Distributed Deep Learning – Opportunities
• Increased memory: • More data
• More parameters
• Fault tolerance • Protection against node crashes
• Improved stochasticity
Node 1
Node 3
Node 2
Node N
Distributed Deep Learning – Determining Factors
• Computational independence
• Communication efficiency
• Network congestion
• Load balancing
• Points of failure
Node 1
Node 3
Node 2
Node N
Distributed Deep Learning – DistBelief
• Distributed learning infrastructure used at Google [Dean et al., 2012] • Handles communications automatically • Users define computation and messages sent during fprop/bprop
• Each model replica has the same parameters, but optimizes different data
• Each replica is divided among several machines
• Two distributed optimization schemes for training: • Online – Downpour SGD • Batch – Sandblaster LBFGS
• Uses a centralized parameter server (several machines, sharded) • Handles slow and faulty replicas
Dean, Jeffrey, et al. "Large scale distributed deep networks." NIPS 2012.
Asynchronous SGD – HOGWILD! • To achieve coherency in distributed SGD, nodes must synchronize
w.r.t. parameters:
• HOGWILD! [Niu et al., 2011] is a method for asynchronous SGD:
Niu, Feng, et al. "Hogwild: A lock-free approach to parallelizing stochastic gradient descent." NIPS 2011.
Asynchronous SGD – HOGWILD! • HOGWILD!:
• Proven to converge in sparse problems • Provides near-linear scaling • Assumes shared-memory architecture (e.g., multicore CPUs)
• Formulates ML problems as hypergraphs 𝐺 = (𝑉, 𝐸) where: • 𝑤∗ = argmin𝑤 𝑓 𝑤 = argmin𝑤 𝑓𝑒(𝑤𝑒)𝑒∈𝐸
• Each hyperedge 𝑒 ∈ E represents subsets of 𝑛
• 𝑤𝑒 is reduced to coordinates in 𝑒
Image: Wikipedia
Niu, Feng, et al. "Hogwild: A lock-free approach to parallelizing stochastic gradient descent." NIPS 2011.
Asynchronous SGD – HOGWILD!
• Example: Sparse SVM • 𝐸 represents training set, denoting participating parameters as 𝑢 ∈ 𝑒
argmin𝑤 max
𝑖
0,1 − 𝑦𝑖𝑤𝑇𝑥𝑖 + 𝜆||𝑤||
2
is formulated as:
argmin𝑤 max 0,1 − 𝑦𝑒𝑤𝑇𝑥𝑒 + 𝜆
𝑤𝑢2
𝑑𝑢𝑢∈𝑒𝑒∈𝐸
• 𝑑𝑢 is the number of training examples non-zero in component 𝑢
Niu, Feng, et al. "Hogwild: A lock-free approach to parallelizing stochastic gradient descent." NIPS 2011.
Asynchronous SGD – HOGWILD!
• Assuming 𝑓 is 𝐿-Lipschitz, strongly convex with constant 𝑐, and its derivatives are bounded by M
• Assuming constant learning rate 𝛾 and 𝛾𝑐 < 1
Niu, Feng, et al. "Hogwild: A lock-free approach to parallelizing stochastic gradient descent." NIPS 2011.
Distributed Deep Learning – Downpour SGD
• Relaxation of HOGWILD! for distributed systems
• Algorithm: • Divide training data into subsets and run a replica on each subset
• Every 𝑛𝑓𝑒𝑡𝑐ℎ iterations, fetch up-to-date parameters from server
• Every 𝑛𝑝𝑢𝑠ℎ iterations, push local gradients to server
• Note that parameter shards may be “out-of-sync”
Dean, Jeffrey, et al. "Large scale distributed deep networks." NIPS 2012.
Distributed Deep Learning – Downpour SGD
Dean, Jeffrey, et al. "Large scale distributed deep networks." NIPS 2012.
BFGS Nonlinear Optimization Algorithm
• Second order method (𝐵𝑘 estimates Hessian)
• Uses line search to determine learning rate per iteration
• Limited-Memory BFGS (LBFGS) uses 𝑚 last updates of 𝑥 instead of 𝐵𝑘
𝐵0 = 𝐼
for k=0 to K:
Solve 𝐵𝑘𝑝𝑘 = −𝛻𝑓 𝑥𝑘 to obtain direction 𝑝𝑘
Find 𝛼𝑘 = argmin𝛼 𝑓(𝑥𝑘 + 𝛼𝑝𝑘)
Set 𝑥𝑘+1 = 𝑥𝑘 + 𝛼𝑘𝑝𝑘
Compute 𝐵𝑘+1−1 using two rank-1 updates on 𝐵𝑘
−1
Distributed Deep Learning – Sandblaster LBFGS
• Coordinator process issues commands (dot product, scaling, multiplication, etc.) to slave nodes, each processing a different parameter shard
• Communication is sparser • Most of the information is stored locally • Coordinator messages are small • Slaves fetch parameters at the beginning of each batch,
send gradients once in a while for fault tolerance
• Employs computation replication and load balancing • Nodes that finish their job get more jobs • If one node is slow, additional nodes get the same job
Dean, Jeffrey, et al. "Large scale distributed deep networks." NIPS 2012.
Distributed Deep Learning – Sandblaster LBFGS
Dean, Jeffrey, et al. "Large scale distributed deep networks." NIPS 2012.
DistBelief Results – Scaling
Dean, Jeffrey, et al. "Large scale distributed deep networks." NIPS 2012.
DistBelief Results – Accuracy
Dean, Jeffrey, et al. "Large scale distributed deep networks." NIPS 2012.
DistBelief Results – Time
Dean, Jeffrey, et al. "Large scale distributed deep networks." NIPS 2012.
Distributed Deep Learning – Drawbacks
• Centralized • Single point of failure: Parameter Server(s) / Coordinator
• Network congestion can be a bottleneck
• Very relaxed theory for Downpour SGD
Outline
• Motivation
• Training Parallelism
• Distributed Deep Learning
• Specialized Hardware
• Further Optimizations
Specialized Hardware
• GPU • Thousands of cores, massively parallel (5-10 TFLOP/s per card)
• Multi-GPU nodes further increase training performance (using data/model parallelism)
• Drawback: Hard to program efficiently
• FPGA • Specialized for certain operations (e.g. convolutions)
• Drawbacks: Network-specific, harder to program
• Convolutional Processing Units?
Specialized Hardware
Distributed Deep Learning with GPUs
• Distributed GPU-based system[Coates et al., 2013] was shown to run DistBelief-scale problems (1000 machines) with 3 multi-GPU nodes
• 3 tiers of concurrency: GPU, model parallelism, nodes
Coates et al. "Deep learning with COTS HPC systems." ICML 2013.
Time for a single mini-batch gradient update of a sparse autoencoder
Outline
• Motivation
• Training Parallelism
• Distributed Deep Learning
• Specialized Hardware
• Further Optimizations
Further Optimizations
• Convolution operator replacement • As Toeplitz matrix multiplication (im2col)
• As an operation in frequency domain
• Lowering floating point precision
• Specific compilers/assemblers
Convolution as FFT
• In convolution, each pixel requires an NxN surrounding region
• Convolution can be computed using FFT [Vasilache et al., 2015]:
• The larger the convolution kernel, the better the performance:
3x3 kernel 7x7 kernel 13x13 kernel 5x5 kernel
Output size
Pro
ble
m s
ize
Vasilache et al. "Fast Convolutional Nets With fbfft: A GPU Performance Evaluation", ICLR 2015
Output size
Pro
ble
m s
ize
Output size
Pro
ble
m s
ize
Output size
Pro
ble
m s
ize
Sacrificing Accuracy for Performance
• Half-precision (16-bit floating point) [Gupta et al., 2015] • Memory is stored in 16-bit format
• Computations are performed in 32-bits
• Uses Stochastic Rounding:
Goal: Preserve
Gupta et al. "Deep Learning with Limited Numerical Precision." NIPS 2015.
Sacrificing Accuracy for Performance
• Results on MNIST with LeNet:
WL=Word Length (bits) FL=Fractional Length (bits)
Gupta et al. "Deep Learning with Limited Numerical Precision." NIPS 2015.
Benchmark Insights
According to Convnet Benchmarks (as of October 2015):
• GPU + 16-bit floating point + custom assembler is the fastest
• GPU + 32-bit floating point + FFT is a close runner-up
• FFT surpasses convolution, unless kernel size is small (OxfordNet)
• Performance is close to peak hardware FLOPs (~4 out of 4.7 TFLOPs)
• Naïve versions can be >16 times slower