YellowFin and the Art of Momentum Tuning - DAWN · 3/8/2018  · YellowFin and the Art of Momentum...

Post on 21-Feb-2021

4 views 0 download

Transcript of YellowFin and the Art of Momentum Tuning - DAWN · 3/8/2018  · YellowFin and the Art of Momentum...

YellowFin and the Art of Momentum Tuning Jian Zhang1, Ioannis Mitliagkas2

1Stanford University, 2MILA, University of Montreal

Adaptive OptimizationHyperparameter tuning is a big cost of deep learning.

Momentum: a key hyperparameter to SGD and variants.

Adaptive methods, e.g. Adam1, don’t tune momentum.

YellowFin optimizer • Based on the robustness properties of momentum.

• Auto-tuning of momentum and learning rate in SGD.

• Closed-loop momentum control for async. training.

ExperimentsResNet and LSTM YellowFin runs with no tuning.

Adam, mom. SGD etc. are tuned on learning rate grids.

YellowFin can outperform tuned SoA on train/val. metrics.

Facebook Convolutional Seq-to-seq model IWSLT 2014 German-English translationYellowFin outperforms the hand-tuned default optimizer

Extension: Closed-loop YellowFinAsync. distributed training: fast, no sync. barrier.

However, Asynchrony induces additional to .

Can we auto-match total momentum to YF-tuned ?

Closed-loop momentum control

Closed-loop mechanism improves YellowFin in async..

Momentum operation• SGD step: .

• In a 1-D case, matrix form with :

Robust region , given .

• Linear rate robust to curvature variance (middle).• Linear rate robust to a range of learning rates (left).

Robustness of Momentum

YellowFinNoisy quadratic model • Model stochastic setting with gradient variance .• Local quadratic approximation3: 1-D case. In robust region, distance to optimum is approx. by

Greedy tuning strategy

• Solve learning rate and momentum in closed-form sol.

E(xt � x

⇤)2 ⇡ µ

t(x0 � x

⇤)2 + (1� µ

t)↵

2C

1� µ

C

E(x0 � x

?)2

µ

Async. induced

Effective total value

µ

µT

Algorithmic

µ̄

µ̄

µ µ+ � · (µ⇤ � µT )Feedback control

µ?

µT = µ+ µ̄

YF target valueµ?

Github for PyTorch: https://github.com/JianGoForIt/YellowFin_PytorchGithub for TensorFlow: https://github.com/JianGoForIt/YellowFin

Principle I: Stay in the robust region

Principle II: Minimize after one step ( )

✓xt+1 � x

xt � x

◆=

1�↵h(xt)+µ �µ

1 0

�✓xt � x

xt�1 � x

◆,At

✓xt � x

xt�1 � x

h

min

h(xt)h

max

(1�pµ)2

hmax

↵ (1 +

pµ)2

hmin

Spectral radius ⇢(At)=pµ linear convergence2

xt+1 = xt � ↵rf(xt) + µ(xt � xt�1)

rf(xt)=h(xt)(xt�x

?)

is curvature in quadratics.h

1.[Kingma et. al 15]

3.[Schaul et. al 13]

0k 30k 60k 90k 120k

Iterations

100

101

Trai

ning

loss

CIFAR100 ResNet 164AdamYellowFinClosed-loopYellowFin

0k 5k 10k 15k 20k 25k 30k

Iterations

102

103

Valid

atio

npe

rple

xity 2-layer PTB LSTM

Momentum SGDAdamYellowFin

0k 5k 10k 15k 20k

Iterations

1

1.5

2

Trai

ning

loss

2-layer TinyShakespeare Char-RNNMomentum SGDAdamYellowFin

0k 10k 20k 30k 40k

Iterations

10�1

100

Trai

ning

loss

ResNet 110 CIFAR10Momentum SGDAdamYellowFin

0k 30k 60k 90k 120k

Iterations

88.0

88.5

89.0

89.5

90.0

90.5

91.0

91.5

Valid

atio

nF1

3-layer WSJ LSTM

Momentum SGDAdamYellowFinAdagradVanilla SGD

t=1

Val. loss Val. BLEU@4Default Nesterov Momentum 2.86 30.75

YellowFin 2.75 31.59

2. Not guarrenteed for non-quadratics

[Mitliagkas et. al 16]

“Send us your bug reports!”