Lecture 8: Training Neural Networks, Part...

111
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019 1 Lecture 8: Training Neural Networks, Part 2

Transcript of Lecture 8: Training Neural Networks, Part...

Page 1: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 20191

Lecture 8:Training Neural Networks,

Part 2

Page 2: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 20192

Administrative:- A1 grades released: Check Piazza for regrade policy- Project proposal due yesterday- A2 due Wednesday 5/1

Page 3: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 20193

Last time: Activation FunctionsSigmoid

tanh

ReLU

Leaky ReLU

Maxout

ELU

Page 4: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 20194

Last time: Activation FunctionsSigmoid

tanh

ReLU

Leaky ReLU

Maxout

ELU

Good default choice

Page 5: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 20195

Last time: Weight InitializationInitialization too small:Activations go to zero, gradients also zero,No learning =(

Initialization too big:Activations saturate (for tanh),Gradients zero, no learning =(

Initialization just right:Nice distribution of activations at all layers,Learning proceeds nicely

Page 6: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 20196

Last time: Data Preprocessing

Page 7: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 20197

Input: Per-channel mean, shape is D

Per-channel var, shape is D

Normalized x,Shape is N x D

Last Time: Batch Normalization [Ioffe and Szegedy, 2015]

Learnable scale and shift parameters:

Output,Shape is N x D

Learning = , = will recover the identity function!

Page 8: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 20198

Today- Improve your training error:

- Optimizers- Learning rate schedules

- Improve your test error:- Regularization- Choosing Hyperparameters

Page 9: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 20199

Optimization

W_1

W_2

Page 10: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201910

Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?

Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large

Page 11: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201911

Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?Very slow progress along shallow dimension, jitter along steep direction

Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large

Page 12: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201912

Optimization: Problems with SGD

What if the loss function has a local minima or saddle point?

Page 13: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201913

Optimization: Problems with SGD

What if the loss function has a local minima or saddle point?

Zero gradient, gradient descent gets stuck

Page 14: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201914

Optimization: Problems with SGD

What if the loss function has a local minima or saddle point?

Saddle points much more common in high dimension

Dauphin et al, “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization”, NIPS 2014

Page 15: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201915

Optimization: Problems with SGD

Our gradients come from minibatches so they can be noisy!

Page 16: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201916

SGD + MomentumSGD

- Build up “velocity” as a running mean of gradients- Rho gives “friction”; typically rho=0.9 or 0.99

Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013

SGD+Momentum

Page 17: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201917

SGD + MomentumSGD+Momentum SGD+Momentum

You may see SGD+Momentum formulated different ways, but they are equivalent - give same sequence of x

Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013

Page 18: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201918

SGD + Momentum

Local Minima Saddle points

Poor Conditioning

Gradient Noise

SGD SGD+Momentum

Page 19: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201919

Gradient

Velocity

actual step

Momentum update:

SGD+Momentum

Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983Nesterov, “Introductory lectures on convex optimization: a basic course”, 2004Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013

Combine gradient at current point with velocity to get step used to update weights

Page 20: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201920

Gradient

Velocity

actual step

Momentum update:

Nesterov Momentum

Nesterov, “A method of solving a convex programming problem with convergence rate O(1/k^2)”, 1983Nesterov, “Introductory lectures on convex optimization: a basic course”, 2004Sutskever et al, “On the importance of initialization and momentum in deep learning”, ICML 2013

GradientVelocity

actual step

Nesterov Momentum

Combine gradient at current point with velocity to get step used to update weights

“Look ahead” to the point where updating using velocity would take us; compute gradient there and mix it with velocity to get actual update direction

Page 21: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201921

Nesterov Momentum

GradientVelocity

actual step

“Look ahead” to the point where updating using velocity would take us; compute gradient there and mix it with velocity to get actual update direction

Page 22: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201922

Nesterov MomentumAnnoying, usually we want update in terms of

GradientVelocity

actual step

“Look ahead” to the point where updating using velocity would take us; compute gradient there and mix it with velocity to get actual update direction

Page 23: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201923

Nesterov MomentumAnnoying, usually we want update in terms of

GradientVelocity

actual step

“Look ahead” to the point where updating using velocity would take us; compute gradient there and mix it with velocity to get actual update direction

Change of variables and rearrange:

Page 24: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019

Change of variables and rearrange:

24

Nesterov MomentumAnnoying, usually we want update in terms of

Page 25: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201925

Nesterov MomentumSGD

SGD+Momentum

Nesterov

Page 26: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201926

AdaGrad

Added element-wise scaling of the gradient based on the historical sum of squares in each dimension

“Per-parameter learning rates” or “adaptive learning rates”

Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011

Page 27: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201927

AdaGrad

Q: What happens with AdaGrad?

Page 28: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201928

AdaGrad

Q: What happens with AdaGrad? Progress along “steep” directions is damped; progress along “flat” directions is accelerated

Page 29: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201929

AdaGrad

Q2: What happens to the step size over long time?

Page 30: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201930

AdaGrad

Q2: What happens to the step size over long time? Decays to zero

Page 31: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201931

RMSProp: “Leaky AdaGrad”

AdaGrad

RMSProp

Tieleman and Hinton, 2012

Page 32: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201932

RMSPropSGD

SGD+Momentum

RMSProp

Page 33: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201933

Adam (almost)

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

Page 34: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201934

Adam (almost)

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

Momentum

AdaGrad / RMSProp

Sort of like RMSProp with momentum

Q: What happens at first timestep?

Page 35: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201935

Adam (full form)

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

Momentum

AdaGrad / RMSProp

Bias correction

Bias correction for the fact that first and second moment estimates start at zero

Page 36: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201936

Adam (full form)

Kingma and Ba, “Adam: A method for stochastic optimization”, ICLR 2015

Momentum

AdaGrad / RMSProp

Bias correction

Bias correction for the fact that first and second moment estimates start at zero

Adam with beta1 = 0.9, beta2 = 0.999, and learning_rate = 1e-3 or 5e-4is a great starting point for many models!

Page 37: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201937

Adam

SGD

SGD+Momentum

RMSProp

Adam

Page 38: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201938

SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning rate as a hyperparameter.

Q: Which one of these learning rates is best to use?

Page 39: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201939

SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning rate as a hyperparameter.

Q: Which one of these learning rates is best to use?

A: All of them! Start with large learning rate and decay over time

Page 40: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201940

Learning Rate Decay

Reduce learning rateStep: Reduce learning rate at a few fixed points. E.g. for ResNets, multiply LR by 0.1 after epochs 30, 60, and 90.

Page 41: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201941

Learning Rate Decay

: Initial learning rate: Learning rate at epoch t: Total number of epochs

Loshchilov and Hutter, “SGDR: Stochastic Gradient Descent with Warm Restarts”, ICLR 2017Radford et al, “Improving Language Understanding by Generative Pre-Training”, 2018Feichtenhofer et al, “SlowFast Networks for Video Recognition”, arXiv 2018Child at al, “Generating Long Sequences with Sparse Transformers”, arXiv 2019

Step: Reduce learning rate at a few fixed points. E.g. for ResNets, multiply LR by 0.1 after epochs 30, 60, and 90.

Cosine:

Page 42: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201942

Learning Rate Decay

: Initial learning rate: Learning rate at epoch t: Total number of epochs

Loshchilov and Hutter, “SGDR: Stochastic Gradient Descent with Warm Restarts”, ICLR 2017Radford et al, “Improving Language Understanding by Generative Pre-Training”, 2018Feichtenhofer et al, “SlowFast Networks for Video Recognition”, arXiv 2018Child at al, “Generating Long Sequences with Sparse Transformers”, arXiv 2019

Step: Reduce learning rate at a few fixed points. E.g. for ResNets, multiply LR by 0.1 after epochs 30, 60, and 90.

Cosine:

Page 43: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201943

Learning Rate Decay

Devlin et al, “BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding”, 2018

Step: Reduce learning rate at a few fixed points. E.g. for ResNets, multiply LR by 0.1 after epochs 30, 60, and 90.

Cosine:

Linear:

: Initial learning rate: Learning rate at epoch t: Total number of epochs

Page 44: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201944

Step: Reduce learning rate at a few fixed points. E.g. for ResNets, multiply LR by 0.1 after epochs 30, 60, and 90.

Cosine:

Linear:

Inverse sqrt:

Learning Rate Decay

: Initial learning rate: Learning rate at epoch t: Total number of epochsVaswani et al, “Attention is all you need”, NIPS 2017

Page 45: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201945

High initial learning rates can make loss explode; linearly increasing learning rate from 0 over the first ~5000 iterations can prevent this

Empirical rule of thumb: If you increase the batch size by N, also scale the initial learning rate by N

Learning Rate Decay: Linear Warmup

Goyal et al, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”, arXiv 2017

Page 46: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201946

First-Order Optimization

Loss

w1

Page 47: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201947

First-Order Optimization

Loss

w1

(1) Use gradient form linear approximation(2) Step to minimize the approximation

Page 48: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201948

Second-Order Optimization

Loss

w1

(1) Use gradient and Hessian to form quadratic approximation(2) Step to the minima of the approximation

Page 49: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201949

second-order Taylor expansion:

Solving for the critical point we obtain the Newton parameter update:

Second-Order Optimization

Q: Why is this bad for deep learning?

Page 50: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201950

second-order Taylor expansion:

Solving for the critical point we obtain the Newton parameter update:

Second-Order Optimization

Q: Why is this bad for deep learning?

Hessian has O(N^2) elementsInverting takes O(N^3)N = (Tens or Hundreds of) Millions

Page 51: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201951

Second-Order Optimization

- Quasi-Newton methods (BGFS most popular):instead of inverting the Hessian (O(n^3)), approximate inverse Hessian with rank 1 updates over time (O(n^2) each).

- L-BFGS (Limited memory BFGS): Does not form/store the full inverse Hessian.

Page 52: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201952

L-BFGS

- Usually works very well in full batch, deterministic mode i.e. if you have a single, deterministic f(x) then L-BFGS will probably work very nicely

- Does not transfer very well to mini-batch setting. Gives bad results. Adapting second-order methods to large-scale, stochastic setting is an active area of research.

Le et al, “On optimization methods for deep learning, ICML 2011”Ba et al, “Distributed second-order optimization using Kronecker-factored approximations”, ICLR 2017

Page 53: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201953

- Adam is a good default choice in many cases; it often works ok even with constant learning rate

- SGD+Momentum can outperform Adam but may require more tuning of LR and schedule- Try cosine schedule, very few hyperparameters!

- If you can afford to do full batch updates then try out L-BFGS (and don’t forget to disable all sources of noise)

In practice:

Page 54: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201954

Beyond Training Error

Better optimization algorithms help reduce training loss

But we really care about error on new data - how to reduce the gap?

Page 55: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201955

Early Stopping: Always do this

Iteration

Loss

Iteration

AccuracyTrainVal

Stop training here

Stop training the model when accuracy on the validation set decreasesOr train for a long time, but always keep track of the model snapshot that worked best on val

Page 56: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201956

1. Train multiple independent models2. At test time average their results

(Take average of predicted probability distributions, then choose argmax)

Enjoy 2% extra performance

Model Ensembles

Page 57: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201957

Model Ensembles: Tips and TricksInstead of training independent models, use multiple snapshots of a single model during training!

Loshchilov and Hutter, “SGDR: Stochastic gradient descent with restarts”, arXiv 2016Huang et al, “Snapshot ensembles: train 1, get M for free”, ICLR 2017Figures copyright Yixuan Li and Geoff Pleiss, 2017. Reproduced with permission.

Page 58: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201958

Model Ensembles: Tips and TricksInstead of training independent models, use multiple snapshots of a single model during training!

Cyclic learning rate schedules can make this work even better!

Loshchilov and Hutter, “SGDR: Stochastic gradient descent with restarts”, arXiv 2016Huang et al, “Snapshot ensembles: train 1, get M for free”, ICLR 2017Figures copyright Yixuan Li and Geoff Pleiss, 2017. Reproduced with permission.

Page 59: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201959

Model Ensembles: Tips and TricksInstead of using actual parameter vector, keep a moving average of the parameter vector and use that at test time (Polyak averaging)

Polyak and Juditsky, “Acceleration of stochastic approximation by averaging”, SIAM Journal on Control and Optimization, 1992.

Page 60: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201960

How to improve single-model performance?

Regularization

Page 61: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019

Regularization: Add term to loss

61

In common use: L2 regularizationL1 regularizationElastic net (L1 + L2)

(Weight decay)

Page 62: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201962

Regularization: DropoutIn each forward pass, randomly set some neurons to zeroProbability of dropping is a hyperparameter; 0.5 is common

Srivastava et al, “Dropout: A simple way to prevent neural networks from overfitting”, JMLR 2014

Page 63: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201963

Regularization: Dropout Example forward pass with a 3-layer network using dropout

Page 64: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201964

Regularization: DropoutHow can this possibly be a good idea?

Forces the network to have a redundant representation;Prevents co-adaptation of features

has an ear

has a tail

is furry

has claws

mischievous look

cat score

X

X

X

Page 65: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201965

Regularization: DropoutHow can this possibly be a good idea?

Another interpretation:

Dropout is training a large ensemble of models (that share parameters).

Each binary mask is one model

An FC layer with 4096 units has24096 ~ 101233 possible masks!Only ~ 1082 atoms in the universe...

Page 66: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201966

Dropout: Test time

Dropout makes our output random!

Output(label)

Input(image)

Random mask

Want to “average out” the randomness at test-time

But this integral seems hard …

Page 67: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201967

Dropout: Test timeWant to approximate the integral

Consider a single neuron.a

x y

w1 w2

Page 68: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201968

Dropout: Test timeWant to approximate the integral

Consider a single neuron.

At test time we have:a

x y

w1 w2

Page 69: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201969

Dropout: Test timeWant to approximate the integral

Consider a single neuron.

At test time we have:During training we have:

a

x y

w1 w2

Page 70: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201970

Dropout: Test timeWant to approximate the integral

Consider a single neuron.

At test time we have:During training we have:

a

x y

w1 w2

At test time, multiply by dropout probability

Page 71: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201971

Dropout: Test time

At test time all neurons are active always=> We must scale the activations so that for each neuron:output at test time = expected output at training time

Page 72: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201972

Dropout Summary

drop in forward pass

scale at test time

Page 73: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201973

More common: “Inverted dropout”

test time is unchanged!

Page 74: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201974

Regularization: A common patternTraining: Add some kind of randomness

Testing: Average out randomness (sometimes approximate)

Page 75: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201975

Regularization: A common patternTraining: Add some kind of randomness

Testing: Average out randomness (sometimes approximate)

Example: Batch Normalization

Training: Normalize using stats from random minibatches

Testing: Use fixed stats to normalize

Page 76: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201976

Load image and label

“cat”

CNN

Computeloss

Regularization: Data Augmentation

This image by Nikita is licensed under CC-BY 2.0

Page 77: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201977

Regularization: Data Augmentation

Load image and label

“cat”

CNN

Computeloss

Transform image

Page 78: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201978

Data AugmentationHorizontal Flips

Page 79: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201979

Data AugmentationRandom crops and scales

Training: sample random crops / scalesResNet:1. Pick random L in range [256, 480]2. Resize training image, short side = L3. Sample random 224 x 224 patch

Page 80: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201980

Data AugmentationRandom crops and scales

Training: sample random crops / scalesResNet:1. Pick random L in range [256, 480]2. Resize training image, short side = L3. Sample random 224 x 224 patch

Testing: average a fixed set of cropsResNet:1. Resize image at 5 scales: {224, 256, 384, 480, 640}2. For each size, use 10 224 x 224 crops: 4 corners + center, + flips

Page 81: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201981

Data AugmentationColor Jitter

Simple: Randomize contrast and brightness

Page 82: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201982

Data AugmentationColor Jitter

Simple: Randomize contrast and brightness

More Complex:

1. Apply PCA to all [R, G, B] pixels in training set

2. Sample a “color offset” along principal component directions

3. Add offset to all pixels of a training image

(As seen in [Krizhevsky et al. 2012], ResNet, etc)

Page 83: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201983

Data AugmentationGet creative for your problem!

Random mix/combinations of :- translation- rotation- stretching- shearing, - lens distortions, … (go crazy)

Page 84: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201984

Regularization: A common patternTraining: Add random noiseTesting: Marginalize over the noise

Examples:DropoutBatch NormalizationData Augmentation

Page 85: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201985

Regularization: DropConnectTraining: Drop connections between neurons (set weights to 0)Testing: Use all the connections

Wan et al, “Regularization of Neural Networks using DropConnect”, ICML 2013

Examples:DropoutBatch NormalizationData AugmentationDropConnect

Page 86: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201986

Regularization: Fractional PoolingTraining: Use randomized pooling regionsTesting: Average predictions from several regions

Graham, “Fractional Max Pooling”, arXiv 2014

Examples:DropoutBatch NormalizationData AugmentationDropConnectFractional Max Pooling

Page 87: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201987

Regularization: Stochastic DepthTraining: Skip some layers in the networkTesting: Use all the layer

Examples:DropoutBatch NormalizationData AugmentationDropConnectFractional Max PoolingStochastic Depth

Huang et al, “Deep Networks with Stochastic Depth”, ECCV 2016

Page 88: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201988

Regularization: CutoutTraining: Set random image regions to zeroTesting: Use full image

Examples:DropoutBatch NormalizationData AugmentationDropConnectFractional Max PoolingStochastic DepthCutout

DeVries and Taylor, “Improved Regularization of Convolutional Neural Networks with Cutout”, arXiv 2017

Works very well for small datasets like CIFAR, less common for large datasets like ImageNet

Page 89: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201989

Regularization: MixupTraining: Train on random blends of imagesTesting: Use original images

Examples:DropoutBatch NormalizationData AugmentationDropConnectFractional Max PoolingStochastic DepthCutoutMixup

Zhang et al, “mixup: Beyond Empirical Risk Minimization”, ICLR 2018

Randomly blend the pixels of pairs of training images, e.g. 40% cat, 60% dog

CNNTarget label:cat: 0.4dog: 0.6

Page 90: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201990

RegularizationTraining: Add random noiseTesting: Marginalize over the noise

Examples:DropoutBatch NormalizationData AugmentationDropConnectFractional Max PoolingStochastic DepthCutoutMixup

- Consider dropout for large fully-connected layers

- Batch normalization and data augmentation almost always a good idea

- Try cutout and mixup especially for small classification datasets

Page 91: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201991

Choosing Hyperparameters(without tons of GPUs)

Page 92: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201992

Choosing Hyperparameters

Step 1: Check initial loss

Turn off weight decay, sanity check loss at initializatione.g. log(C) for softmax with C classes

Page 93: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201993

Choosing Hyperparameters

Step 1: Check initial lossStep 2: Overfit a small sample

Try to train to 100% training accuracy on a small sample of training data (~5-10 minibatches); fiddle with architecture, learning rate, weight initialization

Loss not going down? LR too low, bad initializationLoss explodes to Inf or NaN? LR too high, bad initialization

Page 94: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201994

Choosing Hyperparameters

Step 1: Check initial lossStep 2: Overfit a small sampleStep 3: Find LR that makes loss go down

Use the architecture from the previous step, use all training data, turn on small weight decay, find a learning rate that makes the loss drop significantly within ~100 iterations

Good learning rates to try: 1e-1, 1e-2, 1e-3, 1e-4

Page 95: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201995

Choosing Hyperparameters

Step 1: Check initial lossStep 2: Overfit a small sampleStep 3: Find LR that makes loss go downStep 4: Coarse grid, train for ~1-5 epochs

Choose a few values of learning rate and weight decay around what worked from Step 3, train a few models for ~1-5 epochs.

Good weight decay to try: 1e-4, 1e-5, 0

Page 96: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201996

Choosing Hyperparameters

Step 1: Check initial lossStep 2: Overfit a small sampleStep 3: Find LR that makes loss go downStep 4: Coarse grid, train for ~1-5 epochsStep 5: Refine grid, train longer

Pick best models from Step 4, train them for longer (~10-20 epochs) without learning rate decay

Page 97: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201997

Choosing Hyperparameters

Step 1: Check initial lossStep 2: Overfit a small sampleStep 3: Find LR that makes loss go downStep 4: Coarse grid, train for ~1-5 epochsStep 5: Refine grid, train longerStep 6: Look at loss curves

Page 98: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201998

Losses may be noisy, use a scatter plot and also plot moving average to see trends better

Look at learning curves!Training Loss Train / Val Accuracy

Page 99: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 201999

Loss

time

Bad initialization a prime suspect

Page 100: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019100

Loss

time

Loss plateaus: Try learning rate decay

Page 101: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019101

Loss

time

Learning rate step decay Loss was still going down when learning rate dropped, you decayed too early!

Page 102: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019102

Accuracy

time

Train

Accuracy still going up, you need to train longer

Val

Page 103: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019103

Accuracy

time

Train

Huge train / val gap means overfitting! Increase regularization, get more data

Val

Page 104: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019104

Accuracy

time

Train

No gap between train / val means underfitting: train longer, use a bigger model

Val

Page 105: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019105

Choosing Hyperparameters

Step 1: Check initial lossStep 2: Overfit a small sampleStep 3: Find LR that makes loss go downStep 4: Coarse grid, train for ~1-5 epochsStep 5: Refine grid, train longerStep 6: Look at loss curvesStep 7: GOTO step 5

Page 106: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019106

Hyperparameters to play with:- network architecture- learning rate, its decay schedule, update type- regularization (L2/Dropout strength)

neural networks practitionermusic = loss function

This image by Paolo Guereta is licensed under CC-BY 2.0

Page 107: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019107

Cross-validation “command center”

Page 108: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019108

Random Search vs. Grid Search

Important Parameter Important ParameterU

nim

porta

nt P

aram

eter

Uni

mpo

rtant

Par

amet

er

Grid Layout Random Layout

Illustration of Bergstra et al., 2012 by Shayne Longpre, copyright CS231n 2017

Random Search for Hyper-Parameter OptimizationBergstra and Bengio, 2012

Page 109: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019109

Track the ratio of weight updates / weight magnitudes:

ratio between the updates and values: ~ 0.0002 / 0.02 = 0.01 (about okay)want this to be somewhere around 0.001 or so

Page 110: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019110

Summary- Improve your training error:

- Optimizers- Learning rate schedules

- Improve your test error:- Regularization- Choosing Hyperparameters

Page 111: Lecture 8: Training Neural Networks, Part 2vision.stanford.edu/teaching/cs231n/slides/2019/cs231n...- Project proposal due yesterday - A2 due Wednesday 5/1 Fei-Fei Li & Justin Johnson

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 25, 2019111

Next time: CNN Architecture Design