YellowFin and the Art of Momentum Tuning - DAWN · 3/8/2018 · YellowFin and the Art of Momentum...
Transcript of YellowFin and the Art of Momentum Tuning - DAWN · 3/8/2018 · YellowFin and the Art of Momentum...
YellowFin and the Art of Momentum Tuning Jian Zhang1, Ioannis Mitliagkas2
1Stanford University, 2MILA, University of Montreal
Adaptive OptimizationHyperparameter tuning is a big cost of deep learning.
Momentum: a key hyperparameter to SGD and variants.
Adaptive methods, e.g. Adam1, don’t tune momentum.
YellowFin optimizer • Based on the robustness properties of momentum.
• Auto-tuning of momentum and learning rate in SGD.
• Closed-loop momentum control for async. training.
ExperimentsResNet and LSTM YellowFin runs with no tuning.
Adam, mom. SGD etc. are tuned on learning rate grids.
YellowFin can outperform tuned SoA on train/val. metrics.
Facebook Convolutional Seq-to-seq model IWSLT 2014 German-English translationYellowFin outperforms the hand-tuned default optimizer
Extension: Closed-loop YellowFinAsync. distributed training: fast, no sync. barrier.
However, Asynchrony induces additional to .
Can we auto-match total momentum to YF-tuned ?
Closed-loop momentum control
Closed-loop mechanism improves YellowFin in async..
Momentum operation• SGD step: .
• In a 1-D case, matrix form with :
Robust region , given .
• Linear rate robust to curvature variance (middle).• Linear rate robust to a range of learning rates (left).
Robustness of Momentum
YellowFinNoisy quadratic model • Model stochastic setting with gradient variance .• Local quadratic approximation3: 1-D case. In robust region, distance to optimum is approx. by
Greedy tuning strategy
• Solve learning rate and momentum in closed-form sol.
E(xt � x
⇤)2 ⇡ µ
t(x0 � x
⇤)2 + (1� µ
t)↵
2C
1� µ
C
E(x0 � x
?)2
µ
Async. induced
Effective total value
µ
µT
Algorithmic
µ̄
µ̄
µ µ+ � · (µ⇤ � µT )Feedback control
µ?
µT = µ+ µ̄
YF target valueµ?
Github for PyTorch: https://github.com/JianGoForIt/YellowFin_PytorchGithub for TensorFlow: https://github.com/JianGoForIt/YellowFin
Principle I: Stay in the robust region
Principle II: Minimize after one step ( )
✓xt+1 � x
⇤
xt � x
⇤
◆=
1�↵h(xt)+µ �µ
1 0
�✓xt � x
⇤
xt�1 � x
⇤
◆,At
✓xt � x
⇤
xt�1 � x
⇤
◆
h
min
h(xt)h
max
(1�pµ)2
hmax
↵ (1 +
pµ)2
hmin
Spectral radius ⇢(At)=pµ linear convergence2
xt+1 = xt � ↵rf(xt) + µ(xt � xt�1)
rf(xt)=h(xt)(xt�x
?)
is curvature in quadratics.h
1.[Kingma et. al 15]
3.[Schaul et. al 13]
0k 30k 60k 90k 120k
Iterations
100
101
Trai
ning
loss
CIFAR100 ResNet 164AdamYellowFinClosed-loopYellowFin
0k 5k 10k 15k 20k 25k 30k
Iterations
102
103
Valid
atio
npe
rple
xity 2-layer PTB LSTM
Momentum SGDAdamYellowFin
0k 5k 10k 15k 20k
Iterations
1
1.5
2
Trai
ning
loss
2-layer TinyShakespeare Char-RNNMomentum SGDAdamYellowFin
0k 10k 20k 30k 40k
Iterations
10�1
100
Trai
ning
loss
ResNet 110 CIFAR10Momentum SGDAdamYellowFin
0k 30k 60k 90k 120k
Iterations
88.0
88.5
89.0
89.5
90.0
90.5
91.0
91.5
Valid
atio
nF1
3-layer WSJ LSTM
Momentum SGDAdamYellowFinAdagradVanilla SGD
t=1
Val. loss Val. BLEU@4Default Nesterov Momentum 2.86 30.75
YellowFin 2.75 31.59
2. Not guarrenteed for non-quadratics
[Mitliagkas et. al 16]
“Send us your bug reports!”