Lecture 6: Training Neural Networks, Part I -...

89
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017 Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017 1 Lecture 6: Training Neural Networks, Part I

Transcript of Lecture 6: Training Neural Networks, Part I -...

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 20171

Lecture 6:Training Neural Networks,

Part I

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 20172

Administrative

Assignment 1 due Thursday (today), 11:59pm on Canvas

Assignment 2 out today

Project proposal due Tuesday April 25

Notes on backprop for a linear layer and vector/tensor derivatives linked to Lecture 4 on syllabus

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017

Where we are now...

3

x

W

hinge loss

R

+ Ls (scores)

Computational graphs

*

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 20174

Where we are now...

Linear score function:

2-layer Neural Network

x hW1 sW2

3072 100 10

Neural Networks

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017

Illustration of LeCun et al. 1998 from CS231n 2017 Lecture 1

5

Where we are now...

Convolutional Neural Networks

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 20176

Where we are now...Convolutional Layer

32

32

3

32x32x3 image5x5x3 filter

convolve (slide) over all spatial locations

activation map

1

28

28

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 20177

Where we are now...Convolutional Layer

32

32

3

Convolution Layer

activation maps

6

28

28

For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:

We stack these up to get a “new image” of size 28x28x6!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 20179

Where we are now...

Mini-batch SGDLoop:1. Sample a batch of data2. Forward prop it through the graph

(network), get loss3. Backprop to calculate the gradients4. Update the parameters using the gradient

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201710

Next: Training Neural Networks

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201711

Overview1. One time setup

activation functions, preprocessing, weight initialization, regularization, gradient checking

2. Training dynamicsbabysitting the learning process, parameter updates, hyperparameter optimization

3. Evaluationmodel ensembles

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201712

Part 1- Activation Functions- Data Preprocessing- Weight Initialization- Batch Normalization- Babysitting the Learning Process- Hyperparameter Optimization

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201713

Activation Functions

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201714

Activation Functions

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201715

Activation FunctionsSigmoid

tanh

ReLU

Leaky ReLU

Maxout

ELU

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201716

Activation Functions

Sigmoid

- Squashes numbers to range [0,1]- Historically popular since they

have nice interpretation as a saturating “firing rate” of a neuron

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201717

Activation Functions

Sigmoid

- Squashes numbers to range [0,1]- Historically popular since they

have nice interpretation as a saturating “firing rate” of a neuron

3 problems:

1. Saturated neurons “kill” the gradients

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201718

sigmoid gate

x

What happens when x = -10?What happens when x = 0?What happens when x = 10?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201719

Activation Functions

Sigmoid

- Squashes numbers to range [0,1]- Historically popular since they

have nice interpretation as a saturating “firing rate” of a neuron

3 problems:

1. Saturated neurons “kill” the gradients

2. Sigmoid outputs are not zero-centered

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201720

Consider what happens when the input to a neuron (x) is always positive:

What can we say about the gradients on w?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201721

Consider what happens when the input to a neuron is always positive...

What can we say about the gradients on w?Always all positive or all negative :((this is also why you want zero-mean data!)

hypothetical optimal w vector

zig zag path

allowed gradient update directions

allowed gradient update directions

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201722

Activation Functions

Sigmoid

- Squashes numbers to range [0,1]- Historically popular since they

have nice interpretation as a saturating “firing rate” of a neuron

3 problems:

1. Saturated neurons “kill” the gradients

2. Sigmoid outputs are not zero-centered

3. exp() is a bit compute expensive

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201723

Activation Functions

tanh(x)

- Squashes numbers to range [-1,1]- zero centered (nice)- still kills gradients when saturated :(

[LeCun et al., 1991]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201724

Activation Functions - Computes f(x) = max(0,x)

- Does not saturate (in +region)- Very computationally efficient- Converges much faster than

sigmoid/tanh in practice (e.g. 6x)- Actually more biologically plausible

than sigmoid

ReLU(Rectified Linear Unit)

[Krizhevsky et al., 2012]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201725

Activation Functions

ReLU(Rectified Linear Unit)

- Computes f(x) = max(0,x)

- Does not saturate (in +region)- Very computationally efficient- Converges much faster than

sigmoid/tanh in practice (e.g. 6x)- Actually more biologically plausible

than sigmoid

- Not zero-centered output- An annoyance:

hint: what is the gradient when x < 0?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201726

ReLU gate

x

What happens when x = -10?What happens when x = 0?What happens when x = 10?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201727

DATA CLOUDactive ReLU

dead ReLUwill never activate => never update

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201728

DATA CLOUDactive ReLU

dead ReLUwill never activate => never update

=> people like to initialize ReLU neurons with slightly positive biases (e.g. 0.01)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201729

Activation Functions

Leaky ReLU

- Does not saturate- Computationally efficient- Converges much faster than

sigmoid/tanh in practice! (e.g. 6x)- will not “die”.

[Mass et al., 2013][He et al., 2015]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201730

Activation Functions

Leaky ReLU

- Does not saturate- Computationally efficient- Converges much faster than

sigmoid/tanh in practice! (e.g. 6x)- will not “die”.

Parametric Rectifier (PReLU)

backprop into \alpha(parameter)

[Mass et al., 2013][He et al., 2015]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201731

Activation FunctionsExponential Linear Units (ELU)

- All benefits of ReLU- Closer to zero mean outputs- Negative saturation regime

compared with Leaky ReLU adds some robustness to noise

- Computation requires exp()

[Clevert et al., 2015]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201732

Maxout “Neuron”- Does not have the basic form of dot product ->

nonlinearity- Generalizes ReLU and Leaky ReLU - Linear Regime! Does not saturate! Does not die!

Problem: doubles the number of parameters/neuron :(

[Goodfellow et al., 2013]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201733

TLDR: In practice:

- Use ReLU. Be careful with your learning rates- Try out Leaky ReLU / Maxout / ELU- Try out tanh but don’t expect much- Don’t use sigmoid

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201734

Data Preprocessing

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201735

Step 1: Preprocess the data

(Assume X [NxD] is data matrix, each example in a row)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201736

Remember: Consider what happens when the input to a neuron is always positive...

What can we say about the gradients on w?Always all positive or all negative :((this is also why you want zero-mean data!)

hypothetical optimal w vector

zig zag path

allowed gradient update directions

allowed gradient update directions

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201737

Step 1: Preprocess the data

(Assume X [NxD] is data matrix, each example in a row)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201738

Step 1: Preprocess the dataIn practice, you may also see PCA and Whitening of the data

(data has diagonal covariance matrix)

(covariance matrix is the identity matrix)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201739

TLDR: In practice for Images: center only

- Subtract the mean image (e.g. AlexNet)(mean image = [32,32,3] array)

- Subtract per-channel mean (e.g. VGGNet)(mean along each channel = 3 numbers)

e.g. consider CIFAR-10 example with [32,32,3] images

Not common to normalize variance, to do PCA or whitening

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201740

Weight Initialization

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201741

- Q: what happens when W=0 init is used?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201742

- First idea: Small random numbers (gaussian with zero mean and 1e-2 standard deviation)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201743

- First idea: Small random numbers (gaussian with zero mean and 1e-2 standard deviation)

Works ~okay for small networks, but problems with deeper networks.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201744

Lets look at some activation statistics

E.g. 10-layer net with 500 neurons on each layer, using tanh non-linearities, and initializing as described in last slide.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201745

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201746

All activations become zero!

Q: think about the backward pass. What do the gradients look like?

Hint: think about backward pass for a W*X gate.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201747

Almost all neurons completely saturated, either -1 and 1. Gradients will be all zero.

*1.0 instead of *0.01

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201748

“Xavier initialization”[Glorot et al., 2010]

Reasonable initialization.(Mathematical derivation assumes linear activations)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201749

but when using the ReLU nonlinearity it breaks.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201750

He et al., 2015(note additional /2)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201751

He et al., 2015(note additional /2)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201752

Proper initialization is an active area of research…Understanding the difficulty of training deep feedforward neural networksby Glorot and Bengio, 2010

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013

Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014

Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015

Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015

All you need is a good init, Mishkin and Matas, 2015…

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201753

Batch Normalization

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201754

Batch Normalization“you want unit gaussian activations? just make them so.”

[Ioffe and Szegedy, 2015]

consider a batch of activations at some layer. To make each dimension unit gaussian, apply:

this is a vanilla differentiable function...

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201755

Batch Normalization“you want unit gaussian activations? just make them so.”

[Ioffe and Szegedy, 2015]

XN

D

1. compute the empirical mean and variance independently for each dimension.

2. Normalize

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201756

Batch Normalization [Ioffe and Szegedy, 2015]

FC

BN

tanh

FC

BN

tanh

...

Usually inserted after Fully Connected or Convolutional layers, and before nonlinearity.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201757

Batch Normalization [Ioffe and Szegedy, 2015]

FC

BN

tanh

FC

BN

tanh

...

Usually inserted after Fully Connected or Convolutional layers, and before nonlinearity.

Problem: do we necessarily want a unit gaussian input to a tanh layer?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201758

Batch Normalization [Ioffe and Szegedy, 2015]

And then allow the network to squash the range if it wants to:

Note, the network can learn:

to recover the identity mapping.

Normalize:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201759

Batch Normalization [Ioffe and Szegedy, 2015]

- Improves gradient flow through the network

- Allows higher learning rates- Reduces the strong dependence

on initialization- Acts as a form of regularization

in a funny way, and slightly reduces the need for dropout, maybe

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201760

Batch Normalization [Ioffe and Szegedy, 2015]

Note: at test time BatchNorm layer functions differently:

The mean/std are not computed based on the batch. Instead, a single fixed empirical mean of activations during training is used.

(e.g. can be estimated during training with running averages)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201761

Babysitting the Learning Process

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201762

Step 1: Preprocess the data

(Assume X [NxD] is data matrix, each example in a row)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201763

Step 2: Choose the architecture:say we start with one hidden layer of 50 neurons:

input layer hidden layer

output layerCIFAR-10 images, 3072 numbers

10 output neurons, one per class

50 hidden neurons

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201764

Double check that the loss is reasonable:

returns the loss and the gradient for all parameters

disable regularization

loss ~2.3.“correct “ for 10 classes

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201765

Double check that the loss is reasonable:

crank up regularization

loss went up, good. (sanity check)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201766

Lets try to train now…

Tip: Make sure that you can overfit very small portion of the training data The above code:

- take the first 20 examples from CIFAR-10

- turn off regularization (reg = 0.0)- use simple vanilla ‘sgd’

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201767

Lets try to train now…

Tip: Make sure that you can overfit very small portion of the training data

Very small loss, train accuracy 1.00, nice!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201768

Lets try to train now…

Start with small regularization and find learning rate that makes the loss go down.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201769

Lets try to train now…

Start with small regularization and find learning rate that makes the loss go down.

Loss barely changing

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201770

Lets try to train now…

Start with small regularization and find learning rate that makes the loss go down.

loss not going down:learning rate too low

Loss barely changing: Learning rate is probably too low

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201771

Lets try to train now…

Start with small regularization and find learning rate that makes the loss go down.

loss not going down:learning rate too low

Loss barely changing: Learning rate is probably too low

Notice train/val accuracy goes to 20% though, what’s up with that? (remember this is softmax)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201772

Lets try to train now…

Start with small regularization and find learning rate that makes the loss go down.

loss not going down:learning rate too low

Now let’s try learning rate 1e6.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201773

cost: NaN almost always means high learning rate...

Lets try to train now…

Start with small regularization and find learning rate that makes the loss go down.

loss not going down:learning rate too lowloss exploding:learning rate too high

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201774

Lets try to train now…

Start with small regularization and find learning rate that makes the loss go down.

loss not going down:learning rate too lowloss exploding:learning rate too high

3e-3 is still too high. Cost explodes….

=> Rough range for learning rate we should be cross-validating is somewhere [1e-3 … 1e-5]

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201775

Hyperparameter Optimization

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201776

Cross-validation strategycoarse -> fine cross-validation in stages

First stage: only a few epochs to get rough idea of what params workSecond stage: longer running time, finer search… (repeat as necessary)

Tip for detecting explosions in the solver: If the cost is ever > 3 * original cost, break out early

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201777

For example: run coarse search for 5 epochs

nice

note it’s best to optimize in log space!

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201778

Now run finer search...adjust range

53% - relatively good for a 2-layer neural net with 50 hidden neurons.

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201779

Now run finer search...adjust range

53% - relatively good for a 2-layer neural net with 50 hidden neurons.

But this best cross-validation result is worrying. Why?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201780

Random Search vs. Grid Search

Important Parameter Important ParameterU

nim

porta

nt P

aram

eter

Uni

mpo

rtant

Par

amet

er

Grid Layout Random Layout

Illustration of Bergstra et al., 2012 by Shayne Longpre, copyright CS231n 2017

Random Search for Hyper-Parameter OptimizationBergstra and Bengio, 2012

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201781

Hyperparameters to play with:- network architecture- learning rate, its decay schedule, update type- regularization (L2/Dropout strength)

neural networks practitionermusic = loss function

This image by Paolo Guereta is licensed under CC-BY 2.0

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201782

Cross-validation “command center”

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201783

Monitor and visualize the loss curve

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201784

Loss

time

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201785

Loss

time

Bad initializationa prime suspect

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201786

Monitor and visualize the accuracy:

big gap = overfitting=> increase regularization strength?

no gap=> increase model capacity?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201787

Track the ratio of weight updates / weight magnitudes:

ratio between the updates and values: ~ 0.0002 / 0.02 = 0.01 (about okay)want this to be somewhere around 0.001 or so

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201788

SummaryWe looked in detail at:

- Activation Functions (use ReLU)- Data Preprocessing (images: subtract mean)- Weight Initialization (use Xavier init)- Batch Normalization (use)- Babysitting the Learning process- Hyperparameter Optimization

(random sample hyperparams, in log space when appropriate)

TLDRs

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 2017Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 6 - April 20, 201789

Next time: Training Neural Networks, Part 2

- Parameter update schemes- Learning rate schedules- Gradient checking- Regularization (Dropout etc.)- Evaluation (Ensembles etc.)- Transfer learning / fine-tuning