Deep Learning & Neural Networks Lecture 1

76
Deep Learning & Neural Networks Lecture 1 Kevin Duh Graduate School of Information Science Nara Institute of Science and Technology Jan 14, 2014

Transcript of Deep Learning & Neural Networks Lecture 1

Page 1: Deep Learning & Neural Networks Lecture 1

Deep Learning & Neural NetworksLecture 1

Kevin Duh

Graduate School of Information ScienceNara Institute of Science and Technology

Jan 14, 2014

Page 2: Deep Learning & Neural Networks Lecture 1

2/40

Page 3: Deep Learning & Neural Networks Lecture 1

3/40

Page 4: Deep Learning & Neural Networks Lecture 1

4/40

Page 5: Deep Learning & Neural Networks Lecture 1

What is Deep Learning?

A family of methods that uses deep architectures to learn high-levelfeature representations

5/40

Page 6: Deep Learning & Neural Networks Lecture 1

What is Deep Learning?

A family of methods that uses deep architectures to learn high-levelfeature representations

5/40

Page 7: Deep Learning & Neural Networks Lecture 1

Example of Trainable Features

Hierarchical object-parts features in Computer Vision [Lee et al., 2009]

6/40

Page 8: Deep Learning & Neural Networks Lecture 1

Course Outline

Goal:To understand the foundations of neural networks and deep learning,at a level sufficient for reading recent research papers

Schedule:I Lecture 1 (Jan 14): Machine Learning background & Neural NetworksI Lecture 2 (Jan 16): Deep Architectures (DBN, SAE)I Lecture 3 (Jan 21): Applications in Vision, Speech, LanguageI Lecture 4 (Jan 23): Advanced topics in optimization

Prerequisites:I Basic calculus, probability, linear algebra

7/40

Page 9: Deep Learning & Neural Networks Lecture 1

Course Outline

Goal:To understand the foundations of neural networks and deep learning,at a level sufficient for reading recent research papers

Schedule:I Lecture 1 (Jan 14): Machine Learning background & Neural NetworksI Lecture 2 (Jan 16): Deep Architectures (DBN, SAE)I Lecture 3 (Jan 21): Applications in Vision, Speech, LanguageI Lecture 4 (Jan 23): Advanced topics in optimization

Prerequisites:I Basic calculus, probability, linear algebra

7/40

Page 10: Deep Learning & Neural Networks Lecture 1

Course Outline

Goal:To understand the foundations of neural networks and deep learning,at a level sufficient for reading recent research papers

Schedule:I Lecture 1 (Jan 14): Machine Learning background & Neural NetworksI Lecture 2 (Jan 16): Deep Architectures (DBN, SAE)I Lecture 3 (Jan 21): Applications in Vision, Speech, LanguageI Lecture 4 (Jan 23): Advanced topics in optimization

Prerequisites:I Basic calculus, probability, linear algebra

7/40

Page 11: Deep Learning & Neural Networks Lecture 1

Course Material

Course Website:http://cl.naist.jp/~kevinduh/a/deep2014/

Useful References:1 Yoshua Bengio’s [Bengio, 2009] short book: Learning Deep

Architectures for AI1

2 Yann LeCun & Marc’Aurelio Ranzato’s ICML2013 tutorial2

3 Richard Socher et. al.’s NAACL2013 tutorial3

4 Geoff Hinton’s Coursera course4

5 Theano code samples5

6 Chris Bishop’s book Pattern Recognition and Machine Learning(PRML)6

1http://www.iro.umontreal.ca/~bengioy/papers/ftml.pdf2http://techtalks.tv/talks/deep-learning/58122/3http://www.socher.org/index.php/DeepLearningTutorial/4https://www.coursera.org/course/neuralnets5http://deeplearning.net/tutorial/6http://research.microsoft.com/en-us/um/people/cmbishop/prml/

8/40

Page 12: Deep Learning & Neural Networks Lecture 1

Grading

The only criteria for grading:Are you actively participating and asking questions in class?

I If you ask (or answer) 3+ questions, grade = AI If you ask (or answer) 2 questions, grade = BI If you ask (or answer) 1 question, grade = CI If you don’t ask (or answer) any questions, you get no credit.

9/40

Page 13: Deep Learning & Neural Networks Lecture 1

Best Advice I got while in Grad School

Always Ask Questions!

If you don’t understand, you must ask questions in order tounderstand.

If you understand, you will naturally have questions.

Having no questions is a sign that you are not thinking.

10/40

Page 14: Deep Learning & Neural Networks Lecture 1

Best Advice I got while in Grad School

Always Ask Questions!

If you don’t understand, you must ask questions in order tounderstand.

If you understand, you will naturally have questions.

Having no questions is a sign that you are not thinking.

10/40

Page 15: Deep Learning & Neural Networks Lecture 1

Best Advice I got while in Grad School

Always Ask Questions!

If you don’t understand, you must ask questions in order tounderstand.

If you understand, you will naturally have questions.

Having no questions is a sign that you are not thinking.

10/40

Page 16: Deep Learning & Neural Networks Lecture 1

Best Advice I got while in Grad School

Always Ask Questions!

If you don’t understand, you must ask questions in order tounderstand.

If you understand, you will naturally have questions.

Having no questions is a sign that you are not thinking.

10/40

Page 17: Deep Learning & Neural Networks Lecture 1

Today’s Topics

1 Machine Learning backgroundWhy Machine Learning is needed?Main Concepts: Generalization, Model Expressiveness, OverfittingFormal Notation

2 Neural Networks1-Layer Nets (Logistic Regression)2-Layer Nets and Model ExpressivenessTraining by Backpropagation

11/40

Page 18: Deep Learning & Neural Networks Lecture 1

Today’s Topics

1 Machine Learning backgroundWhy Machine Learning is needed?Main Concepts: Generalization, Model Expressiveness, OverfittingFormal Notation

2 Neural Networks1-Layer Nets (Logistic Regression)2-Layer Nets and Model ExpressivenessTraining by Backpropagation

12/40

Page 19: Deep Learning & Neural Networks Lecture 1

Write a Program∗ to Recognize the Digit 2

This is hard to do manually!bool recognizeDigitAs2(int** imagePixels){...}

Machine Learning solution:

1 Assume you have a database (training data) of 2’s and non-2’s.

2 Automatically ”learn” this function from data

*example from Hinton’s Coursera course13/40

Page 20: Deep Learning & Neural Networks Lecture 1

Write a Program∗ to Recognize the Digit 2

This is hard to do manually!bool recognizeDigitAs2(int** imagePixels){...}

Machine Learning solution:

1 Assume you have a database (training data) of 2’s and non-2’s.

2 Automatically ”learn” this function from data

*example from Hinton’s Coursera course13/40

Page 21: Deep Learning & Neural Networks Lecture 1

A Machine Learning Solution

Training data are represented as pixel matrices:Classifier is parameterized by weight matrix of same dimension.

Training procedure:1 When observe ”2”, add 1 to corresponding matrix elements2 When observe ”non-2”, subtract 1 to corresponding matrix elements

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 1 1 1 1 1 1 0 0 0

0 0 0 0 0 0 1 0 0 0

0 0 0 0 0 0 1 0 0 0

0 0 0 0 0 1 0 0 0 0

0 0 0 0 1 0 0 0 0 0

0 0 0 1 0 0 0 0 0 0

0 0 1 0 0 0 0 0 0 0

0 1 1 1 1 1 1 1 1 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 1 1 1 1 1 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 -1 0 0 0

0 0 0 0 1 0 -1 0 0 0

0 0 0 1 0 0 -1 0 0 0

0 0 1 0 0 0 -1 0 0 0

0 1 1 1 1 1 -1 1 1 0

0 0 0 0 0 0 0 0 0 0

Test procedure: given new image, take sum of element-wise product.If positive, predict ”2”; else predict ”non-2”.

14/40

Page 22: Deep Learning & Neural Networks Lecture 1

A Machine Learning Solution

Training data are represented as pixel matrices:Classifier is parameterized by weight matrix of same dimension.

Training procedure:1 When observe ”2”, add 1 to corresponding matrix elements2 When observe ”non-2”, subtract 1 to corresponding matrix elements

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 1 1 1 1 1 1 0 0 0

0 0 0 0 0 0 1 0 0 0

0 0 0 0 0 0 1 0 0 0

0 0 0 0 0 1 0 0 0 0

0 0 0 0 1 0 0 0 0 0

0 0 0 1 0 0 0 0 0 0

0 0 1 0 0 0 0 0 0 0

0 1 1 1 1 1 1 1 1 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 1 1 1 1 1 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 -1 0 0 0

0 0 0 0 1 0 -1 0 0 0

0 0 0 1 0 0 -1 0 0 0

0 0 1 0 0 0 -1 0 0 0

0 1 1 1 1 1 -1 1 1 0

0 0 0 0 0 0 0 0 0 0

Test procedure: given new image, take sum of element-wise product.If positive, predict ”2”; else predict ”non-2”.

14/40

Page 23: Deep Learning & Neural Networks Lecture 1

A Machine Learning Solution

Training data are represented as pixel matrices:Classifier is parameterized by weight matrix of same dimension.

Training procedure:1 When observe ”2”, add 1 to corresponding matrix elements2 When observe ”non-2”, subtract 1 to corresponding matrix elements

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 1 1 1 1 1 1 0 0 0

0 0 0 0 0 0 1 0 0 0

0 0 0 0 0 0 1 0 0 0

0 0 0 0 0 1 0 0 0 0

0 0 0 0 1 0 0 0 0 0

0 0 0 1 0 0 0 0 0 0

0 0 1 0 0 0 0 0 0 0

0 1 1 1 1 1 1 1 1 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 1 1 1 1 1 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 -1 0 0 0

0 0 0 0 1 0 -1 0 0 0

0 0 0 1 0 0 -1 0 0 0

0 0 1 0 0 0 -1 0 0 0

0 1 1 1 1 1 -1 1 1 0

0 0 0 0 0 0 0 0 0 0

Test procedure: given new image, take sum of element-wise product.If positive, predict ”2”; else predict ”non-2”.

14/40

Page 24: Deep Learning & Neural Networks Lecture 1

Today’s Topics

1 Machine Learning backgroundWhy Machine Learning is needed?Main Concepts: Generalization, Model Expressiveness, OverfittingFormal Notation

2 Neural Networks1-Layer Nets (Logistic Regression)2-Layer Nets and Model ExpressivenessTraining by Backpropagation

15/40

Page 25: Deep Learning & Neural Networks Lecture 1

Generalization 6= Memorization

Key Issue in Machine Learning: Training data is limited

If the classifier just memorizes the training data, it may performpoorly on new data

”Generalization” is ability to extend accurate predictions to new data

E.g. consider shifted image: Will this classifier generalize?

0 0 0 0 0 0 0 0 0 0

0 1 1 1 1 1 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 -1 0 0 0

0 0 0 0 1 0 -1 0 0 0

0 0 0 1 0 0 -1 0 0 0

0 0 1 0 0 0 -1 0 0 0

0 1 1 1 1 1 -1 1 1 0

0 0 0 0 0 0 0 0 0 0

16/40

Page 26: Deep Learning & Neural Networks Lecture 1

Generalization 6= Memorization

Key Issue in Machine Learning: Training data is limited

If the classifier just memorizes the training data, it may performpoorly on new data

”Generalization” is ability to extend accurate predictions to new data

E.g. consider shifted image: Will this classifier generalize?

0 0 0 0 0 0 0 0 0 0

0 1 1 1 1 1 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 -1 0 0 0

0 0 0 0 1 0 -1 0 0 0

0 0 0 1 0 0 -1 0 0 0

0 0 1 0 0 0 -1 0 0 0

0 1 1 1 1 1 -1 1 1 0

0 0 0 0 0 0 0 0 0 0

16/40

Page 27: Deep Learning & Neural Networks Lecture 1

Generalization 6= Memorization

One potential way to increase generalization ability:

Discretize weight matrix with larger grids (fewer weights to train)

E.g. consider shifted image:

Now will this classifier generalize?0 0 0 0 0 0 0 0 0 0

0 1 1 1 1 1 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 -1 0 0 0

0 0 0 0 1 0 -1 0 0 0

0 0 0 1 0 0 -1 0 0 0

0 0 1 0 0 0 -1 0 0 0

0 1 1 1 1 1 -1 1 1 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 1 1 1 1 1 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1 -1 0 0 0

0 0 0 0 1 0 -1 0 0 0

0 0 0 1 0 0 -1 0 0 0

0 0 1 0 0 0 -1 0 0 0

0 1 1 1 1 1 -1 1 1 0

0 0 0 0 0 0 0 0 0 0

1 1 1 0 0

0 0 0 0 0

0 0 1 -1 0

0 1 0 -1 0

1 1 1 0 1

17/40

Page 28: Deep Learning & Neural Networks Lecture 1

Model Expressiveness and Overfitting

A model with more weight parameters may fit training data better

But since training data is limited, expressive model stand the risk ofoverfitting to peculiarities of the data.

Less Expressive Model ⇐⇒ More Expressive Model(fewer weights) (more weights)

Underfit training data ⇐⇒ Overfit training data

18/40

Page 29: Deep Learning & Neural Networks Lecture 1

Model Expressiveness and Overfitting

Fitting the training data (blue points: xn)with a polynomial model: f (x) = w0 + w1x + w2x

2 + . . .+ wMxM

under squared error objective 12

∑n(f (xn)− tn)2

x

t

M = 0

0 1

−1

0

1

x

t

M = 1

0 1

−1

0

1

x

t

M = 3

0 1

−1

0

1

x

t

M = 9

0 1

−1

0

1

from PRML Chapter 1 [Bishop, 2006] 19/40

Page 30: Deep Learning & Neural Networks Lecture 1

Today’s Topics

1 Machine Learning backgroundWhy Machine Learning is needed?Main Concepts: Generalization, Model Expressiveness, OverfittingFormal Notation

2 Neural Networks1-Layer Nets (Logistic Regression)2-Layer Nets and Model ExpressivenessTraining by Backpropagation

20/40

Page 31: Deep Learning & Neural Networks Lecture 1

Basic Problem Setup in Machine Learning

Training Data: a set of (x (m), y (m))m={1,2,..M} pairs, where input

x (m) ∈ Rd and output y (m) = {0, 1}I e.g. x=vectorized image pixels, y=2 or non-2

Goal: Learn function f : x → y to predicts correctly on new inputs x .

I Step 1: Choose a function model family:F e.g. logistic regression, support vector machines, neural networks

I Step 2: Optimize parameters w on the Training DataF e.g. minimize loss function minw

∑Mm=1(fw (x

(m))− y (m))2

21/40

Page 32: Deep Learning & Neural Networks Lecture 1

Basic Problem Setup in Machine Learning

Training Data: a set of (x (m), y (m))m={1,2,..M} pairs, where input

x (m) ∈ Rd and output y (m) = {0, 1}I e.g. x=vectorized image pixels, y=2 or non-2

Goal: Learn function f : x → y to predicts correctly on new inputs x .I Step 1: Choose a function model family:

F e.g. logistic regression, support vector machines, neural networks

I Step 2: Optimize parameters w on the Training DataF e.g. minimize loss function minw

∑Mm=1(fw (x

(m))− y (m))2

21/40

Page 33: Deep Learning & Neural Networks Lecture 1

Basic Problem Setup in Machine Learning

Training Data: a set of (x (m), y (m))m={1,2,..M} pairs, where input

x (m) ∈ Rd and output y (m) = {0, 1}I e.g. x=vectorized image pixels, y=2 or non-2

Goal: Learn function f : x → y to predicts correctly on new inputs x .I Step 1: Choose a function model family:

F e.g. logistic regression, support vector machines, neural networks

I Step 2: Optimize parameters w on the Training DataF e.g. minimize loss function minw

∑Mm=1(fw (x

(m))− y (m))2

21/40

Page 34: Deep Learning & Neural Networks Lecture 1

Today’s Topics

1 Machine Learning backgroundWhy Machine Learning is needed?Main Concepts: Generalization, Model Expressiveness, OverfittingFormal Notation

2 Neural Networks1-Layer Nets (Logistic Regression)2-Layer Nets and Model ExpressivenessTraining by Backpropagation

22/40

Page 35: Deep Learning & Neural Networks Lecture 1

1-Layer Nets (Logistic Regression)

Function model: f (x) = σ(wT · x + b)I Parameters: vector w ∈ Rd , b is scalar bias termI σ is a non-linearity, e.g. sigmoid: σ(z) = 1/(1 + exp(−z))I For simplicity, sometimes write f (x) = σ(wT x) where w = [w ; b] and

x = [x ; 1]

Non-linearity will be important in expressiveness multi-layer nets.Other non-linearities, e.g., tanh(z) = (ez − e−z)/(ez + e−z)

23/40

Page 36: Deep Learning & Neural Networks Lecture 1

1-Layer Nets (Logistic Regression)

Function model: f (x) = σ(wT · x + b)I Parameters: vector w ∈ Rd , b is scalar bias termI σ is a non-linearity, e.g. sigmoid: σ(z) = 1/(1 + exp(−z))I For simplicity, sometimes write f (x) = σ(wT x) where w = [w ; b] and

x = [x ; 1]

Non-linearity will be important in expressiveness multi-layer nets.Other non-linearities, e.g., tanh(z) = (ez − e−z)/(ez + e−z)

23/40

Page 37: Deep Learning & Neural Networks Lecture 1

Training 1-Layer Nets: Gradient

Assume Squared-Error∗ Loss(w) = 12

∑m(σ(wT x (m))− y (m))2

Gradient: ∇wLoss =∑

m

[σ(wT x (m))− y (m)

]σ′(wT x (m))x (m)

I General form of gradient:∑

m Error (m) ∗ σ′(in(m)) ∗ x (m)

Derivative of sigmoid σ(z) = 1/(1 + exp(−z)):

σ′(z) =d

dz

(1

1 + exp(−z)

)= −

(1

1 + exp(−z)

)2d

dz(1 + exp(−z))

= −(

1

1 + exp(−z)

)2

exp(−z)(−1)

=

(1

1 + exp(−z)

)(exp(−z)

1 + exp(−z)

)= σ(z)(1− σ(z))

*An alternative is Cross-Entropy loss:∑m y (m) log(σ(wT x (m))) + (1− y (m)) log(1− σ(wT x (m)))

24/40

Page 38: Deep Learning & Neural Networks Lecture 1

Training 1-Layer Nets: Gradient

Assume Squared-Error∗ Loss(w) = 12

∑m(σ(wT x (m))− y (m))2

Gradient: ∇wLoss =∑

m

[σ(wT x (m))− y (m)

]σ′(wT x (m))x (m)

I General form of gradient:∑

m Error (m) ∗ σ′(in(m)) ∗ x (m)

Derivative of sigmoid σ(z) = 1/(1 + exp(−z)):

σ′(z) =d

dz

(1

1 + exp(−z)

)= −

(1

1 + exp(−z)

)2d

dz(1 + exp(−z))

= −(

1

1 + exp(−z)

)2

exp(−z)(−1)

=

(1

1 + exp(−z)

)(exp(−z)

1 + exp(−z)

)= σ(z)(1− σ(z))

*An alternative is Cross-Entropy loss:∑m y (m) log(σ(wT x (m))) + (1− y (m)) log(1− σ(wT x (m)))

24/40

Page 39: Deep Learning & Neural Networks Lecture 1

Training 1-Layer Nets: Gradient

Assume Squared-Error∗ Loss(w) = 12

∑m(σ(wT x (m))− y (m))2

Gradient: ∇wLoss =∑

m

[σ(wT x (m))− y (m)

]σ′(wT x (m))x (m)

I General form of gradient:∑

m Error (m) ∗ σ′(in(m)) ∗ x (m)

Derivative of sigmoid σ(z) = 1/(1 + exp(−z)):

σ′(z) =d

dz

(1

1 + exp(−z)

)= −

(1

1 + exp(−z)

)2d

dz(1 + exp(−z))

= −(

1

1 + exp(−z)

)2

exp(−z)(−1)

=

(1

1 + exp(−z)

)(exp(−z)

1 + exp(−z)

)= σ(z)(1− σ(z))

*An alternative is Cross-Entropy loss:∑m y (m) log(σ(wT x (m))) + (1− y (m)) log(1− σ(wT x (m)))

24/40

Page 40: Deep Learning & Neural Networks Lecture 1

Training 1-Layer Nets: Gradient

Assume Squared-Error∗ Loss(w) = 12

∑m(σ(wT x (m))− y (m))2

Gradient: ∇wLoss =∑

m

[σ(wT x (m))− y (m)

]σ′(wT x (m))x (m)

I General form of gradient:∑

m Error (m) ∗ σ′(in(m)) ∗ x (m)

Derivative of sigmoid σ(z) = 1/(1 + exp(−z)):

σ′(z) =d

dz

(1

1 + exp(−z)

)= −

(1

1 + exp(−z)

)2d

dz(1 + exp(−z))

= −(

1

1 + exp(−z)

)2

exp(−z)(−1)

=

(1

1 + exp(−z)

)(exp(−z)

1 + exp(−z)

)= σ(z)(1− σ(z))

*An alternative is Cross-Entropy loss:∑m y (m) log(σ(wT x (m))) + (1− y (m)) log(1− σ(wT x (m)))

24/40

Page 41: Deep Learning & Neural Networks Lecture 1

Training 1-Layer Nets: Gradient Descent Algorithm

General form of gradient:∑

m Error (m) ∗ σ′(in(m)) ∗ x (m)

Gradient descent algorithm:1 Initialize w2 Compute ∇wLoss =

∑m Error (m) ∗ σ′(in(m)) ∗ x (m)

3 w ← w − γ(∇wLoss)4 Repeat steps 2-3 until some condition satisfied

Stochastic gradient descent (SGD) algorithm:1 Initialize w2 for each sample (x (m), y (m)) in training set:3 w ← w − γ(Error (m) ∗ σ′(in(m)) ∗ x (m))4 Repeat loop 2-3 until some condition satisfied

Learning rate γ > 0 & stopping condition are important in practice

25/40

Page 42: Deep Learning & Neural Networks Lecture 1

Training 1-Layer Nets: Gradient Descent Algorithm

General form of gradient:∑

m Error (m) ∗ σ′(in(m)) ∗ x (m)

Gradient descent algorithm:1 Initialize w2 Compute ∇wLoss =

∑m Error (m) ∗ σ′(in(m)) ∗ x (m)

3 w ← w − γ(∇wLoss)4 Repeat steps 2-3 until some condition satisfied

Stochastic gradient descent (SGD) algorithm:1 Initialize w2 for each sample (x (m), y (m)) in training set:3 w ← w − γ(Error (m) ∗ σ′(in(m)) ∗ x (m))4 Repeat loop 2-3 until some condition satisfied

Learning rate γ > 0 & stopping condition are important in practice

25/40

Page 43: Deep Learning & Neural Networks Lecture 1

Training 1-Layer Nets: Gradient Descent Algorithm

General form of gradient:∑

m Error (m) ∗ σ′(in(m)) ∗ x (m)

Gradient descent algorithm:1 Initialize w2 Compute ∇wLoss =

∑m Error (m) ∗ σ′(in(m)) ∗ x (m)

3 w ← w − γ(∇wLoss)4 Repeat steps 2-3 until some condition satisfied

Stochastic gradient descent (SGD) algorithm:1 Initialize w2 for each sample (x (m), y (m)) in training set:3 w ← w − γ(Error (m) ∗ σ′(in(m)) ∗ x (m))4 Repeat loop 2-3 until some condition satisfied

Learning rate γ > 0 & stopping condition are important in practice

25/40

Page 44: Deep Learning & Neural Networks Lecture 1

Intuition of SGD update

for some sample (x (m), y (m)):w ← w − γ((σ(wT x (m))− y (m)) ∗ σ′(wT x (m)) ∗ x (m))

σ(wT x (m)) y (m) Error new w new prediction

0 0 0 no change 0

1 1 0 no change 1

0 1 -1 w + γσ′(in(m))x (m) ≥ 0

1 0 +1 w − γσ′(in(m))x (m) ≤ 1

[w + γσ′(in(m))x (m)

]Tx (m) = wT x (m) + γσ′(in(m))||x (m)||2 ≥ wT x (m)

σ′(wT x (m)) is near 0 when confident, near 0.25 when uncertain.

large γ = more aggressive updates; small γ = more conservative

SGD improves classification for current sample, but no guaranteeabout others

26/40

Page 45: Deep Learning & Neural Networks Lecture 1

Intuition of SGD update

for some sample (x (m), y (m)):w ← w − γ((σ(wT x (m))− y (m)) ∗ σ′(wT x (m)) ∗ x (m))

σ(wT x (m)) y (m) Error new w new prediction

0 0 0 no change 0

1 1 0 no change 1

0 1 -1 w + γσ′(in(m))x (m) ≥ 0

1 0 +1 w − γσ′(in(m))x (m) ≤ 1

[w + γσ′(in(m))x (m)

]Tx (m) = wT x (m) + γσ′(in(m))||x (m)||2 ≥ wT x (m)

σ′(wT x (m)) is near 0 when confident, near 0.25 when uncertain.

large γ = more aggressive updates; small γ = more conservative

SGD improves classification for current sample, but no guaranteeabout others

26/40

Page 46: Deep Learning & Neural Networks Lecture 1

Intuition of SGD update

for some sample (x (m), y (m)):w ← w − γ((σ(wT x (m))− y (m)) ∗ σ′(wT x (m)) ∗ x (m))

σ(wT x (m)) y (m) Error new w new prediction

0 0 0 no change 0

1 1 0 no change 1

0 1 -1 w + γσ′(in(m))x (m) ≥ 0

1 0 +1 w − γσ′(in(m))x (m) ≤ 1

[w + γσ′(in(m))x (m)

]Tx (m) = wT x (m) + γσ′(in(m))||x (m)||2 ≥ wT x (m)

σ′(wT x (m)) is near 0 when confident, near 0.25 when uncertain.

large γ = more aggressive updates; small γ = more conservative

SGD improves classification for current sample, but no guaranteeabout others

26/40

Page 47: Deep Learning & Neural Networks Lecture 1

Intuition of SGD update

for some sample (x (m), y (m)):w ← w − γ((σ(wT x (m))− y (m)) ∗ σ′(wT x (m)) ∗ x (m))

σ(wT x (m)) y (m) Error new w new prediction

0 0 0 no change 0

1 1 0 no change 1

0 1 -1 w + γσ′(in(m))x (m) ≥ 0

1 0 +1 w − γσ′(in(m))x (m) ≤ 1

[w + γσ′(in(m))x (m)

]Tx (m) = wT x (m) + γσ′(in(m))||x (m)||2 ≥ wT x (m)

σ′(wT x (m)) is near 0 when confident, near 0.25 when uncertain.

large γ = more aggressive updates; small γ = more conservative

SGD improves classification for current sample, but no guaranteeabout others

26/40

Page 48: Deep Learning & Neural Networks Lecture 1

Intuition of SGD update

for some sample (x (m), y (m)):w ← w − γ((σ(wT x (m))− y (m)) ∗ σ′(wT x (m)) ∗ x (m))

σ(wT x (m)) y (m) Error new w new prediction

0 0 0 no change 0

1 1 0 no change 1

0 1 -1 w + γσ′(in(m))x (m) ≥ 0

1 0 +1 w − γσ′(in(m))x (m) ≤ 1

[w + γσ′(in(m))x (m)

]Tx (m) = wT x (m) + γσ′(in(m))||x (m)||2 ≥ wT x (m)

σ′(wT x (m)) is near 0 when confident, near 0.25 when uncertain.

large γ = more aggressive updates; small γ = more conservative

SGD improves classification for current sample, but no guaranteeabout others

26/40

Page 49: Deep Learning & Neural Networks Lecture 1

Geometric view of SGD update

Loss objective contour plot: 12

∑m(σ(wT x (m))− y (m))2 + ||w ||

I Gradient descent goes in steepest descent direction, but slower tocompute per iteration for large datasets

I SGD can be viewed as noisy descent, but faster per iterationI In practice, a good tradeoff is mini-batch SGD

27/40

Page 50: Deep Learning & Neural Networks Lecture 1

Effect of Learning Rate γ on Convergence Speed

SGD update: w ← w − γ(Error (m) ∗ σ′(in(m)) ∗ x (m))I Ideally, γ should be as large as possible without causing divergence.I Common heuristic: γ(t) = γ0

1+νt = O(1/t)

Analysis by [Schaul et al., 2013] (in plot, η ≡ γ):

28/40

Page 51: Deep Learning & Neural Networks Lecture 1

Generalization issues: Regularization and Early-stopping

Optimizing Loss(w) = 12

∑m(σ(wT x (m))− y (m))2 on training data

not necessarily leads to generalization.

1 Adding regularization: Loss(w) = 12

∑m(σ(wT x (m))− y (m))2 + ||w ||

reduces sensitivity to training input and decreases risk of overfitting2 Early Stopping:

F Prepare separate training and validation (development) dataF Optimize Loss(w) on training but stop when Loss(w) on validation

stops improving

0 10 20 30 40 500.15

0.2

0.25

0 10 20 30 40 500.35

0.4

0.45

Figures from Chapter 5, [Bishop, 2006]29/40

Page 52: Deep Learning & Neural Networks Lecture 1

Generalization issues: Regularization and Early-stopping

Optimizing Loss(w) = 12

∑m(σ(wT x (m))− y (m))2 on training data

not necessarily leads to generalization.1 Adding regularization: Loss(w) = 1

2

∑m(σ(wT x (m))− y (m))2 + ||w ||

reduces sensitivity to training input and decreases risk of overfitting

2 Early Stopping:F Prepare separate training and validation (development) dataF Optimize Loss(w) on training but stop when Loss(w) on validation

stops improving

0 10 20 30 40 500.15

0.2

0.25

0 10 20 30 40 500.35

0.4

0.45

Figures from Chapter 5, [Bishop, 2006]29/40

Page 53: Deep Learning & Neural Networks Lecture 1

Generalization issues: Regularization and Early-stopping

Optimizing Loss(w) = 12

∑m(σ(wT x (m))− y (m))2 on training data

not necessarily leads to generalization.1 Adding regularization: Loss(w) = 1

2

∑m(σ(wT x (m))− y (m))2 + ||w ||

reduces sensitivity to training input and decreases risk of overfitting2 Early Stopping:

F Prepare separate training and validation (development) dataF Optimize Loss(w) on training but stop when Loss(w) on validation

stops improving

0 10 20 30 40 500.15

0.2

0.25

0 10 20 30 40 500.35

0.4

0.45

Figures from Chapter 5, [Bishop, 2006]29/40

Page 54: Deep Learning & Neural Networks Lecture 1

Summary

1 Given Training Data: (x (m), y (m))m={1,2,..M}

2 Optimize a model f (x) = σ(wT · x + b) underLoss(w) = 1

2

∑m(σ(wT x (m))− y (m))2

3 General form of gradient:∑

m Error (m) ∗ σ′(in(m)) ∗ x (m)

4 SGD algorithm: for each sample (x (m), y (m)) in training set,w ← w − γ(Error (m) ∗ σ′(in(m)) ∗ x (m))

5 Important issues:I Optimization speed/convergence: batch vs. mini-batch, learning rate γI Generalization ability: regularization, early-stopping

30/40

Page 55: Deep Learning & Neural Networks Lecture 1

Summary

1 Given Training Data: (x (m), y (m))m={1,2,..M}

2 Optimize a model f (x) = σ(wT · x + b) underLoss(w) = 1

2

∑m(σ(wT x (m))− y (m))2

3 General form of gradient:∑

m Error (m) ∗ σ′(in(m)) ∗ x (m)

4 SGD algorithm: for each sample (x (m), y (m)) in training set,w ← w − γ(Error (m) ∗ σ′(in(m)) ∗ x (m))

5 Important issues:I Optimization speed/convergence: batch vs. mini-batch, learning rate γI Generalization ability: regularization, early-stopping

30/40

Page 56: Deep Learning & Neural Networks Lecture 1

Summary

1 Given Training Data: (x (m), y (m))m={1,2,..M}

2 Optimize a model f (x) = σ(wT · x + b) underLoss(w) = 1

2

∑m(σ(wT x (m))− y (m))2

3 General form of gradient:∑

m Error (m) ∗ σ′(in(m)) ∗ x (m)

4 SGD algorithm: for each sample (x (m), y (m)) in training set,w ← w − γ(Error (m) ∗ σ′(in(m)) ∗ x (m))

5 Important issues:I Optimization speed/convergence: batch vs. mini-batch, learning rate γI Generalization ability: regularization, early-stopping

30/40

Page 57: Deep Learning & Neural Networks Lecture 1

Summary

1 Given Training Data: (x (m), y (m))m={1,2,..M}

2 Optimize a model f (x) = σ(wT · x + b) underLoss(w) = 1

2

∑m(σ(wT x (m))− y (m))2

3 General form of gradient:∑

m Error (m) ∗ σ′(in(m)) ∗ x (m)

4 SGD algorithm: for each sample (x (m), y (m)) in training set,w ← w − γ(Error (m) ∗ σ′(in(m)) ∗ x (m))

5 Important issues:I Optimization speed/convergence: batch vs. mini-batch, learning rate γI Generalization ability: regularization, early-stopping

30/40

Page 58: Deep Learning & Neural Networks Lecture 1

Today’s Topics

1 Machine Learning backgroundWhy Machine Learning is needed?Main Concepts: Generalization, Model Expressiveness, OverfittingFormal Notation

2 Neural Networks1-Layer Nets (Logistic Regression)2-Layer Nets and Model ExpressivenessTraining by Backpropagation

31/40

Page 59: Deep Learning & Neural Networks Lecture 1

2-Layer Neural Networks

x1 x2 x3 x4

h1 h2 h3

y

xi

wij

hj

wj

w11 w12

w1 w2 w3

f (x) = σ(∑

j wj · hj) = σ(∑

j wj · σ(∑

i wijxi ))

Hidden units hj ’s can be viewed as new ”features” from combining xi ’s

Called Multilayer Perceptron (MLP), but more like multilayer logistic regression32/40

Page 60: Deep Learning & Neural Networks Lecture 1

2-Layer Neural Networks

x1 x2 x3 x4

h1 h2 h3

y

xi

wij

hj

wj

w11 w12

w1 w2 w3

f (x) = σ(∑

j wj · hj) = σ(∑

j wj · σ(∑

i wijxi ))

Hidden units hj ’s can be viewed as new ”features” from combining xi ’s

Called Multilayer Perceptron (MLP), but more like multilayer logistic regression32/40

Page 61: Deep Learning & Neural Networks Lecture 1

Modeling complex non-linearities

Given same number of units (with non-linear activation), a deeperarchitecture is more expressive than a shallow one [Bishop, 1995]

I 1-layer nets only model linear hyperplanesI 2-layer nets are universal function approximators: given infinite hidden

nodes, it can express any continuous functionI >3-layer nets can do so with fewer nodes/weights

33/40

Page 62: Deep Learning & Neural Networks Lecture 1

Today’s Topics

1 Machine Learning backgroundWhy Machine Learning is needed?Main Concepts: Generalization, Model Expressiveness, OverfittingFormal Notation

2 Neural Networks1-Layer Nets (Logistic Regression)2-Layer Nets and Model ExpressivenessTraining by Backpropagation

34/40

Page 63: Deep Learning & Neural Networks Lecture 1

Training a 2-Layer Net with Backpropagation

x1 x2 x3 x4

h1 h2 h3

y

xi

wij

hj

wj

Predict f (x (m))

Adjust weights

w11 w12

w1 w2 w3

1. For each sample, compute f (x (m)) = σ(∑

j wj · σ(∑

i wijx(m)i ))

2. If f (x (m)) 6= y (m), back-propagate error and adjust weights {wij ,wj}.

35/40

Page 64: Deep Learning & Neural Networks Lecture 1

Derivatives of the weights

Assume two outputs (y1, y2) per input x ,and loss per sample: Loss =

∑k

12 [σ(ink)− yk ]2

x1 x2 x3 x4

h1 h2 h3

y1 y2

xi

wij

hj

wjk

yk

∂Loss∂wjk

= ∂Loss∂ink

∂ink∂wjk

= δk∂(

∑j wjkhj )

∂wjk= δkhj

∂Loss∂wij

= ∂Loss∂inj

∂inj∂wij

= δj∂(

∑j wijxi )

∂wij= δjxi

δk = ∂∂ink

(∑k

12 [σ(ink)− yk ]2

)= [σ(ink)− yk ]σ′(ink)

δj =∑

k∂Loss∂ink

∂ink∂inj

=∑

k δk ·∂

∂inj

(∑j wjkσ(inj)

)= [∑

k δkwjk ]σ′(inj)

36/40

Page 65: Deep Learning & Neural Networks Lecture 1

Derivatives of the weights

Assume two outputs (y1, y2) per input x ,and loss per sample: Loss =

∑k

12 [σ(ink)− yk ]2

x1 x2 x3 x4

h1 h2 h3

y1 y2

xi

wij

hj

wjk

yk

∂Loss∂wjk

= ∂Loss∂ink

∂ink∂wjk

= δk∂(

∑j wjkhj )

∂wjk= δkhj

∂Loss∂wij

= ∂Loss∂inj

∂inj∂wij

= δj∂(

∑j wijxi )

∂wij= δjxi

δk = ∂∂ink

(∑k

12 [σ(ink)− yk ]2

)= [σ(ink)− yk ]σ′(ink)

δj =∑

k∂Loss∂ink

∂ink∂inj

=∑

k δk ·∂

∂inj

(∑j wjkσ(inj)

)= [∑

k δkwjk ]σ′(inj)

36/40

Page 66: Deep Learning & Neural Networks Lecture 1

Derivatives of the weights

Assume two outputs (y1, y2) per input x ,and loss per sample: Loss =

∑k

12 [σ(ink)− yk ]2

x1 x2 x3 x4

h1 h2 h3

y1 y2

xi

wij

hj

wjk

yk

∂Loss∂wjk

= ∂Loss∂ink

∂ink∂wjk

= δk∂(

∑j wjkhj )

∂wjk= δkhj

∂Loss∂wij

= ∂Loss∂inj

∂inj∂wij

= δj∂(

∑j wijxi )

∂wij= δjxi

δk = ∂∂ink

(∑k

12 [σ(ink)− yk ]2

)= [σ(ink)− yk ]σ′(ink)

δj =∑

k∂Loss∂ink

∂ink∂inj

=∑

k δk ·∂

∂inj

(∑j wjkσ(inj)

)= [∑

k δkwjk ]σ′(inj)

36/40

Page 67: Deep Learning & Neural Networks Lecture 1

Derivatives of the weights

Assume two outputs (y1, y2) per input x ,and loss per sample: Loss =

∑k

12 [σ(ink)− yk ]2

x1 x2 x3 x4

h1 h2 h3

y1 y2

xi

wij

hj

wjk

yk

∂Loss∂wjk

= ∂Loss∂ink

∂ink∂wjk

= δk∂(

∑j wjkhj )

∂wjk= δkhj

∂Loss∂wij

= ∂Loss∂inj

∂inj∂wij

= δj∂(

∑j wijxi )

∂wij= δjxi

δk = ∂∂ink

(∑k

12 [σ(ink)− yk ]2

)= [σ(ink)− yk ]σ′(ink)

δj =∑

k∂Loss∂ink

∂ink∂inj

=∑

k δk ·∂

∂inj

(∑j wjkσ(inj)

)= [∑

k δkwjk ]σ′(inj)

36/40

Page 68: Deep Learning & Neural Networks Lecture 1

Derivatives of the weights

Assume two outputs (y1, y2) per input x ,and loss per sample: Loss =

∑k

12 [σ(ink)− yk ]2

x1 x2 x3 x4

h1 h2 h3

y1 y2

xi

wij

hj

wjk

yk

∂Loss∂wjk

= ∂Loss∂ink

∂ink∂wjk

= δk∂(

∑j wjkhj )

∂wjk= δkhj

∂Loss∂wij

= ∂Loss∂inj

∂inj∂wij

= δj∂(

∑j wijxi )

∂wij= δjxi

δk = ∂∂ink

(∑k

12 [σ(ink)− yk ]2

)= [σ(ink)− yk ]σ′(ink)

δj =∑

k∂Loss∂ink

∂ink∂inj

=∑

k δk ·∂

∂inj

(∑j wjkσ(inj)

)= [∑

k δkwjk ]σ′(inj)

36/40

Page 69: Deep Learning & Neural Networks Lecture 1

Backpropagation Algorithm

All updates involve some scaled error from output ∗ input feature:

∂Loss∂wjk

= δkhj where δk = [σ(ink)− yk ]σ′(ink)

∂Loss∂wij

= δjxi where δj = [∑

k δkwjk ]σ′(inj)

After forward pass, compute δk from final layer, then δj for previous layer.For deeper nets, iterate backwards further.

x1 x2 x3 x4

h1 h2 h3

y1 y2

xi

wij

hj

wjk

yk

37/40

Page 70: Deep Learning & Neural Networks Lecture 1

Summary

1 By extending from 1-layer to 2-layer net, we get dramatic increase inmodel expressiveness:

2 Backpropagation is an efficient way to train 2-layer nets:I Similar to SGD for 1-layer net, just more chaining in gradientI General form: update wij by δjxi , and δj is scaled/weighted sum of

errors from outgoing layers3 Ideally, we want even deeper architectures

I But Backpropagation becomes ineffective due to vanishing gradientsI Deep Learning comes to the rescue! (next lecture)

38/40

Page 71: Deep Learning & Neural Networks Lecture 1

Summary

1 By extending from 1-layer to 2-layer net, we get dramatic increase inmodel expressiveness:

2 Backpropagation is an efficient way to train 2-layer nets:I Similar to SGD for 1-layer net, just more chaining in gradient

I General form: update wij by δjxi , and δj is scaled/weighted sum oferrors from outgoing layers

3 Ideally, we want even deeper architecturesI But Backpropagation becomes ineffective due to vanishing gradientsI Deep Learning comes to the rescue! (next lecture)

38/40

Page 72: Deep Learning & Neural Networks Lecture 1

Summary

1 By extending from 1-layer to 2-layer net, we get dramatic increase inmodel expressiveness:

2 Backpropagation is an efficient way to train 2-layer nets:I Similar to SGD for 1-layer net, just more chaining in gradientI General form: update wij by δjxi , and δj is scaled/weighted sum of

errors from outgoing layers

3 Ideally, we want even deeper architecturesI But Backpropagation becomes ineffective due to vanishing gradientsI Deep Learning comes to the rescue! (next lecture)

38/40

Page 73: Deep Learning & Neural Networks Lecture 1

Summary

1 By extending from 1-layer to 2-layer net, we get dramatic increase inmodel expressiveness:

2 Backpropagation is an efficient way to train 2-layer nets:I Similar to SGD for 1-layer net, just more chaining in gradientI General form: update wij by δjxi , and δj is scaled/weighted sum of

errors from outgoing layers3 Ideally, we want even deeper architectures

I But Backpropagation becomes ineffective due to vanishing gradientsI Deep Learning comes to the rescue! (next lecture)

38/40

Page 74: Deep Learning & Neural Networks Lecture 1

Summary

1 By extending from 1-layer to 2-layer net, we get dramatic increase inmodel expressiveness:

2 Backpropagation is an efficient way to train 2-layer nets:I Similar to SGD for 1-layer net, just more chaining in gradientI General form: update wij by δjxi , and δj is scaled/weighted sum of

errors from outgoing layers3 Ideally, we want even deeper architectures

I But Backpropagation becomes ineffective due to vanishing gradientsI Deep Learning comes to the rescue! (next lecture)

38/40

Page 75: Deep Learning & Neural Networks Lecture 1

References I

Bengio, Y. (2009).Learning Deep Architectures for AI, volume Foundations and Trends inMachine Learning.NOW Publishers.

Bishop, C. (1995).Neural Networks for Pattern Recognition.Oxford University Press.

Bishop, C. (2006).Pattern Recognition and Machine Learning.Springer.

Lee, H., Grosse, R., Ranganath, R., and Ng, A. (2009).Convolutional deep belief networks for scalable unsupervised learningof hierarchical representations.In ICML.

39/40

Page 76: Deep Learning & Neural Networks Lecture 1

References II

Schaul, T., Zhang, S., and LeCun, Y. (2013).No more pesky learning rates.In Proc. International Conference on Machine learning (ICML’13).

40/40