CS489/698 Lecture 9: Jan 31, 2018 - University of Waterloo · 2018-01-29 · Quick Recap:...

18
CS489/698 Lecture 9: Jan 31, 2018 Multi-layer Neural Networks, Error Backpropagation [D] Chapt. 10, [HTF] Chapt. 11, [B] Sec. 5.2, 5.3, [M] Sec. 16.5, [RN] Sec. 18.7 CS489/698 (c) 2018 P. Poupart 1

Transcript of CS489/698 Lecture 9: Jan 31, 2018 - University of Waterloo · 2018-01-29 · Quick Recap:...

Page 1: CS489/698 Lecture 9: Jan 31, 2018 - University of Waterloo · 2018-01-29 · Quick Recap: Non-linear Models Non-linear classification Non-linear regression CS489/698 (c) 2018 P. Poupart

CS489/698Lecture 9: Jan 31, 2018

Multi-layer Neural Networks, Error Backpropagation

[D] Chapt. 10, [HTF] Chapt. 11, [B] Sec. 5.2, 5.3, [M] Sec. 16.5, [RN] Sec. 18.7

CS489/698 (c) 2018 P. Poupart 1

Page 2: CS489/698 Lecture 9: Jan 31, 2018 - University of Waterloo · 2018-01-29 · Quick Recap: Non-linear Models Non-linear classification Non-linear regression CS489/698 (c) 2018 P. Poupart

Quick Recap: Linear Models

Linear Regression Linear Classification

CS489/698 (c) 2018 P. Poupart 2

Page 3: CS489/698 Lecture 9: Jan 31, 2018 - University of Waterloo · 2018-01-29 · Quick Recap: Non-linear Models Non-linear classification Non-linear regression CS489/698 (c) 2018 P. Poupart

Quick Recap: Non-linear Models

Non-linear classification Non-linear regression

CS489/698 (c) 2018 P. Poupart 3

Page 4: CS489/698 Lecture 9: Jan 31, 2018 - University of Waterloo · 2018-01-29 · Quick Recap: Non-linear Models Non-linear classification Non-linear regression CS489/698 (c) 2018 P. Poupart

Non-linear Models

• Convenient modeling assumption: linearity• Extension: non-linearity can be obtained by mapping ! to a non-linear feature space " !

• Limit: the basis functions "#(!) are chosen a priori and are fixed

• Question: can we work with unrestricted non-linear models?

CS489/698 (c) 2018 P. Poupart 4

Page 5: CS489/698 Lecture 9: Jan 31, 2018 - University of Waterloo · 2018-01-29 · Quick Recap: Non-linear Models Non-linear classification Non-linear regression CS489/698 (c) 2018 P. Poupart

Flexible Non-Linear Models

• Idea 1: Select basis functions that correspond to the training data and retain only a subset of them (e.g., Support Vector Machines)

• Idea 2: Learn non-linear basis functions (e.g., Multi-layer Neural Networks)

CS489/698 (c) 2018 P. Poupart 5

Page 6: CS489/698 Lecture 9: Jan 31, 2018 - University of Waterloo · 2018-01-29 · Quick Recap: Non-linear Models Non-linear classification Non-linear regression CS489/698 (c) 2018 P. Poupart

Two-Layer Architecture

• Feed-forward neural network

• Hidden units: !" = ℎ%('"(%))*)

• Output units: +, = ℎ-(',(-)./)

• Overall: +, = ℎ- ∑ 1,"- ℎ% ∑ 1"2% 322"

CS489/698 (c) 2018 P. Poupart 6

Page 7: CS489/698 Lecture 9: Jan 31, 2018 - University of Waterloo · 2018-01-29 · Quick Recap: Non-linear Models Non-linear classification Non-linear regression CS489/698 (c) 2018 P. Poupart

Common activation functions ℎ• Threshold: ℎ " = $ 1 " ≥ 0

−1 " < 0

• Sigmoid: ℎ " = * " = ++,-./

• Gaussian: ℎ " = 0123/.45

3

• Tanh: ℎ " = tanh " = -/1-./-/,-./

• Identity: ℎ " = "

CS489/698 (c) 2018 P. Poupart 7

Page 8: CS489/698 Lecture 9: Jan 31, 2018 - University of Waterloo · 2018-01-29 · Quick Recap: Non-linear Models Non-linear classification Non-linear regression CS489/698 (c) 2018 P. Poupart

Adaptive non-linear basis functions

• Non-linear regression– ℎ": non-linear function and ℎ#: identity

• Non-linear classification– ℎ#: non-linear function and ℎ#: sigmoid

CS489/698 (c) 2018 P. Poupart 8

Page 9: CS489/698 Lecture 9: Jan 31, 2018 - University of Waterloo · 2018-01-29 · Quick Recap: Non-linear Models Non-linear classification Non-linear regression CS489/698 (c) 2018 P. Poupart

Weight training

• Parameters: < " # ," % ,… >• Objectives:– Error minimization• Backpropagation (aka “backprop”)

– Maximum likelihood– Maximum a posteriori– Bayesian learning

CS489/698 (c) 2018 P. Poupart 9

Page 10: CS489/698 Lecture 9: Jan 31, 2018 - University of Waterloo · 2018-01-29 · Quick Recap: Non-linear Models Non-linear classification Non-linear regression CS489/698 (c) 2018 P. Poupart

Least squared error

• Error function

! " = 12&!' " (

'= 12& ) *+," − .' (

(

'

• When ) *," = ∑ 012( 3 ∑ 0245 6442

then we are optimizing a linear combination of non-linear basis functions

Linear combo Non-linear basis functions

CS489/698 (c) 2018 P. Poupart 10

Page 11: CS489/698 Lecture 9: Jan 31, 2018 - University of Waterloo · 2018-01-29 · Quick Recap: Non-linear Models Non-linear classification Non-linear regression CS489/698 (c) 2018 P. Poupart

Sequential Gradient Descent

• For each example ("#, %#) adjust the weights as

follows:

'() ← '() − ,-.#-'()

• How can we compute the gradient efficiently given

an arbitrary network structure?

• Answer: backpropagation algorithm

CS489/698 (c) 2018 P. Poupart 11

Page 12: CS489/698 Lecture 9: Jan 31, 2018 - University of Waterloo · 2018-01-29 · Quick Recap: Non-linear Models Non-linear classification Non-linear regression CS489/698 (c) 2018 P. Poupart

Backpropagation Algorithm

• Two phases:– Forward phase: compute output !" of each unit #

– Backward phase: compute delta $" at each unit #

CS489/698 (c) 2018 P. Poupart 12

Page 13: CS489/698 Lecture 9: Jan 31, 2018 - University of Waterloo · 2018-01-29 · Quick Recap: Non-linear Models Non-linear classification Non-linear regression CS489/698 (c) 2018 P. Poupart

Forward phase

• Propagate inputs forward to compute the output of each unit

• Output !" at unit #:!" = ℎ &" where a" = ∑ )"*!**

CS489/698 (c) 2018 P. Poupart 13

Page 14: CS489/698 Lecture 9: Jan 31, 2018 - University of Waterloo · 2018-01-29 · Quick Recap: Non-linear Models Non-linear classification Non-linear regression CS489/698 (c) 2018 P. Poupart

Backward phase

• Use chain rule to recursively compute gradient

– For each weight !"#: $%&$'()

= $%&$+(

$+($'()

= ,"-#

– Let ," ≡$%&$+(

then

," = /ℎ′(3") -" − 6"ℎ′(3")∑ !8",88

basecase: @isanoutputunitrecursion: @isahiddenunit

– Since 3" = ∑ !"#-## then $+($'()

= -#

CS489/698 (c) 2018 P. Poupart 14

Page 15: CS489/698 Lecture 9: Jan 31, 2018 - University of Waterloo · 2018-01-29 · Quick Recap: Non-linear Models Non-linear classification Non-linear regression CS489/698 (c) 2018 P. Poupart

Simple Example

• Consider a network with two layers:

– Hidden nodes: ℎ " = tanh " = ()*(+)(),(+)

• Tip: -".ℎ/ " = 1 − (-".ℎ " )4– Output node: ℎ " = "

• Objective: squared error

CS489/698 (c) 2018 P. Poupart 15

Page 16: CS489/698 Lecture 9: Jan 31, 2018 - University of Waterloo · 2018-01-29 · Quick Recap: Non-linear Models Non-linear classification Non-linear regression CS489/698 (c) 2018 P. Poupart

Simple Example

• Forward propagation: – Hidden units: !" = $" =– Output units: !% = $% =

• Backward propagation:– Output units: &% =– Hidden units: &" =

• Gradients:– Hidden layers: '()'*+,

=– Output layer: '()'*-+

=

CS489/698 (c) 2018 P. Poupart 16

Page 17: CS489/698 Lecture 9: Jan 31, 2018 - University of Waterloo · 2018-01-29 · Quick Recap: Non-linear Models Non-linear classification Non-linear regression CS489/698 (c) 2018 P. Poupart

Non-linear regression examples

• Two layer network:

– 3 tanh hidden units and 1 identity output unit

! = #$ ! = sin #

! = |#| ! = ) * + ,+-

./

CS489/698 (c) 2018 P. Poupart 17

Page 18: CS489/698 Lecture 9: Jan 31, 2018 - University of Waterloo · 2018-01-29 · Quick Recap: Non-linear Models Non-linear classification Non-linear regression CS489/698 (c) 2018 P. Poupart

Analysis

• Efficiency: – Fast gradient computation: linear in number of weights

• Convergence: – Slow convergence (linear rate)– May get trapped in local optima

• Prone to overfitting– Solutions: early stopping, regularization (add ! "

"

penalty term to objective), dropout

CS489/698 (c) 2018 P. Poupart 18