CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and...

45
Neural Networks and Backpropagation 1 10601 Introduction to Machine Learning Matt Gormley Lecture 19 March 29, 2017 Machine Learning Department School of Computer Science Carnegie Mellon University Neural Net Readings: Murphy Bishop 5 HTF 11 Mitchell 4

Transcript of CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and...

Page 1: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Neural  Networksand

Backpropagation

1

10-­‐601  Introduction  to  Machine  Learning

Matt  GormleyLecture  19

March  29,  2017

Machine  Learning  DepartmentSchool  of  Computer  ScienceCarnegie  Mellon  University

Neural  Net  Readings:Murphy  -­‐-­‐Bishop  5HTF  11Mitchell  4

Page 2: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Reminders

• Homework 6:  Unsupervised Learning– Release:  Wed,  Mar.  22– Due:  Mon,  Apr.  03  at  11:59pm

• Homework 5 (Part II):  Peer  Review– Release:  Wed,  Mar.  29– Due:  Wed,  Apr.  05  at  11:59pm

• Peer  Tutoring

2

Expectation:  You  should  spend  at  most  1  hour  on  your  reviews

Page 3: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Neural  Networks  Outline• Logistic  Regression  (Recap)

– Data,  Model,  Learning,  Prediction• Neural  Networks

– A  Recipe  for  Machine  Learning– Visual  Notation  for  Neural  Networks– Example:  Logistic  Regression  Output  Surface– 2-­‐Layer  Neural  Network– 3-­‐Layer  Neural  Network

• Neural  Net  Architectures– Objective  Functions– Activation  Functions

• Backpropagation– Basic  Chain  Rule  (of  calculus)– Chain  Rule  for  Arbitrary  Computation  Graph– Backpropagation  Algorithm– Module-­‐based  Automatic  Differentiation  (Autodiff)

3

Page 4: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

RECALL:  LOGISTIC  REGRESSION

4

Page 5: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Using  gradient  ascent  for  linear  classifiers

Key  idea  behind  today’s  lecture:1. Define  a  linear  classifier  (logistic  regression)2. Define  an  objective  function  (likelihood)3. Optimize  it  with  gradient  descent  to  learn  

parameters4. Predict  the  class  with  highest  probability  under  

the  model

5

Page 6: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Using  gradient  ascent  for  linear  classifiers

6

Use  a  differentiable  function  instead:

logistic(u) ≡ 11+ e−u

p�(y = 1| ) =1

1 + (��T )

This  decision  function  isn’t  differentiable:

sign(x)

h( ) = sign(�T )

Page 7: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Using  gradient  ascent  for  linear  classifiers

7

Use  a  differentiable  function  instead:

logistic(u) ≡ 11+ e−u

p�(y = 1| ) =1

1 + (��T )

This  decision  function  isn’t  differentiable:

sign(x)

h( ) = sign(�T )

Page 8: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Logistic  Regression

8

Learning:  finds  the  parameters  that  minimize  some  objective  function. �� = argmin

�J(�)

Data:  Inputs  are  continuous  vectors  of  length  K.  Outputs  are  discrete.

D = { (i), y(i)}Ni=1 where � RK and y � {0, 1}

Prediction:  Output  is  the  most  probable  class.y =

y�{0,1}p�(y| )

Model:  Logistic  function  applied  to  dot  product  of  parameters  with  input  vector.

p�(y = 1| ) =1

1 + (��T )

Page 9: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

NEURAL  NETWORKS

9

Page 10: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

A  Recipe  for  Machine  Learning

1.  Given  training  data:

10

Background

2.  Choose  each  of  these:– Decision  function

– Loss  function

Face Face Not  a  face

Examples:  Linear  regression,  Logistic  regression,  Neural  Network

Examples:  Mean-­‐squared  error,  Cross  Entropy

Page 11: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

A  Recipe  for  Machine  Learning

1.  Given  training  data: 3.  Define  goal:

11

Background

2.  Choose  each  of  these:– Decision  function

– Loss  function

4.  Train  with  SGD:(take  small  steps  opposite  the  gradient)

Page 12: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

A  Recipe  for  Machine  Learning

1.  Given  training  data: 3.  Define  goal:

12

Background

2.  Choose  each  of  these:– Decision  function

– Loss  function

4.  Train  with  SGD:(take  small  steps  opposite  the  gradient)

Gradients

Backpropagation can  compute  this  gradient!  And  it’s  a  special  case  of  a  more  general  algorithm  called  reverse-­‐mode  automatic  differentiation  that  can  compute  the  gradient  of  any  differentiable  function  efficiently!

Page 13: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

A  Recipe  for  Machine  Learning

1.  Given  training  data: 3.  Define  goal:

13

Background

2.  Choose  each  of  these:– Decision  function

– Loss  function

4.  Train  with  SGD:(take  small  steps  opposite  the  gradient)

Goals  for  Today’s  Lecture

1. Explore  a  new  class  of  decision  functions  (Neural  Networks)

2. Consider  variants  of  this  recipe  for  training

Page 14: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Linear  Regression

14

Decision  Functions

Output

Input

θ1 θ2 θ3 θM

y = h�(x) = �(�T x)

where �(a) = a

Page 15: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Logistic  Regression

15

Decision  Functions

Output

Input

θ1 θ2 θ3 θM

y = h�(x) = �(�T x)

where �(a) =1

1 + (�a)

Page 16: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

y = h�(x) = �(�T x)

where �(a) =1

1 + (�a)

Logistic  Regression

16

Decision  Functions

Output

Input

θ1 θ2 θ3 θM

Face Face Not  a  face

Page 17: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

y = h�(x) = �(�T x)

where �(a) =1

1 + (�a)

Logistic  Regression

17

Decision  Functions

Output

Input

θ1 θ2 θ3 θM

1 1 0

x1

x2

y

In-­‐Class  Example

Page 18: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Perceptron

18

Decision  Functions

Output

Input

θ1 θ2 θ3 θM

y = h�(x) = �(�T x)

where �(a) =1

1 + (�a)

Page 19: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

From  Biological to  Artificial

Biological  “Model”• Neuron:  an  excitable  cell• Synapse:  connection  between  

neurons• A  neuron sends  an  

electrochemical  pulse  along  its  synapseswhen  a  sufficient  voltage  change  occurs

• Biological  Neural  Network:  collection  of  neurons  along  some  pathway  through  the  brain

Artificial  Model• Neuron:  node  in  a  directed  acyclic  

graph  (DAG)• Weight:  multiplier  on  each  edge• Activation  Function:  nonlinear  

thresholding  function,  which  allows  a  neuron  to  “fire”  when  the  input  value  is  sufficiently  high  

• Artificial  Neural  Network:  collection  of  neurons  into  a  DAG,  which  define  some  differentiable  function

19

Biological  “Computation”• Neuron  switching  time  :   ~  0.001  sec• Number  of  neurons:   ~  1010

• Connections  per  neuron:   ~  104-­‐5

• Scene  recognition  time:   ~  0.1  sec

Artificial  Computation• Many  neuron-­‐like  threshold  switching  

units• Many  weighted  interconnections  

among  units• Highly  parallel,  distributed  processes  

Slide  adapted  from  Eric  Xing

The  motivation  for  Artificial  Neural  Networks  comes  from  biology…

Page 20: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Logistic  Regression

20

Decision  Functions

Output

Input

θ1 θ2 θ3 θM

y = h�(x) = �(�T x)

where �(a) =1

1 + (�a)

Page 21: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Neural  Networks

Whiteboard– Example:  Neural  Network  w/1  Hidden  Layer– Example:  Neural  Network  w/2  Hidden  Layers– Example:  Feed  Forward  Neural  Network

21

Page 22: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Inputs

Weights

Output

Independent variables

Dependent variable

Prediction

Age 34

2Gender

Stage 4

.6

.5

.8

.2

.1

.3.7

.2

WeightsHidden Layer

“Probability of beingAlive”

0.6S

S

.4

.2S

Neural  Network  Model

©  Eric  Xing  @  CMU,  2006-­‐2011 22

Page 23: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Inputs

Weights

Output

Independent variables

Dependent variable

Prediction

Age 34

2Gender

Stage 4

.6

.5

.8

.1

.7

WeightsHidden Layer

“Probability of beingAlive”

0.6S

“Combined  logistic  models”

©  Eric  Xing  @  CMU,  2006-­‐2011 23

Page 24: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Inputs

Weights

Output

Independent variables

Dependent variable

Prediction

Age 34

2Gender

Stage 4

.5

.8.2

.3

.2

WeightsHidden Layer

“Probability of beingAlive”

0.6S

©  Eric  Xing  @  CMU,  2006-­‐2011 24

Page 25: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

©  Eric  Xing  @  CMU,  2006-­‐2011 25

Inputs

Weights

Output

Independent variables

Dependent variable

Prediction

Age 34

1Gender

Stage 4

.6.5

.8.2

.1

.3.7

.2

WeightsHidden Layer

“Probability of beingAlive”

0.6S

Page 26: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

WeightsIndependent variables

Dependent variable

Prediction

Age 34

2Gender

Stage 4

.6

.5

.8

.2

.1

.3.7

.2

WeightsHidden Layer

“Probability of beingAlive”

0.6S

S

.4

.2S

Not  really,  no  target  for  hidden  units...

©  Eric  Xing  @  CMU,  2006-­‐2011 26

Page 27: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Neural  Network

28

Decision  Functions

Output

Input

Hidden  Layer

Page 28: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Neural  Network

29

Decision  Functions

Output

Input

Hidden  Layer

(F) LossJ = 1

2 (y � y(d))2

(E) Output (sigmoid)y = 1

1+ (�b)

(D) Output (linear)b =

�Dj=0 �jzj

(C) Hidden (sigmoid)zj = 1

1+ (�aj), �j

(B) Hidden (linear)aj =

�Mi=0 �jixi, �j

(A) InputGiven xi, �i

Page 29: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

NEURAL  NETWORKS:  REPRESENTATIONAL  POWER

30

Page 30: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Building  a  Neural  Net

31

Output

Features

Q:  How  many  hidden  units,  D,  should  we  use?

Page 31: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Building  a  Neural  Net

Output

Input

Hidden  LayerD  =  M

32

1 1 1

Q:  How  many  hidden  units,  D,  should  we  use?

Page 32: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Building  a  Neural  Net

Output

Input

Hidden  LayerD  =  M

33

Q:  How  many  hidden  units,  D,  should  we  use?

Page 33: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Building  a  Neural  Net

Output

Input

Hidden  LayerD  =  M

34

Q:  How  many  hidden  units,  D,  should  we  use?

Page 34: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Building  a  Neural  Net

Output

Input

Hidden  LayerD  <  M

35

What  method(s)  is  this  setting  similar  to?

Q:  How  many  hidden  units,  D,  should  we  use?

Page 35: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Building  a  Neural  Net

Output

Input

Hidden  LayerD  >  M

36

What  method(s)  is  this  setting  similar  to?

Q:  How  many  hidden  units,  D,  should  we  use?

Page 36: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Deeper  Networks

37

Decision  Functions

Output

Input

Hidden  Layer  1

Q:  How  many  layers  should  we  use?

Page 37: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Deeper  Networks

38

Decision  Functions

…Input

Hidden  Layer  1

Output

Hidden  Layer  2

Q:  How  many  layers  should  we  use?

Page 38: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Q:  How  many  layers  should  we  use?

Deeper  Networks

39

Decision  Functions

…Input

Hidden  Layer  1

…Hidden  Layer  2

Output

Hidden  Layer  3

Page 39: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Deeper  Networks

40

Decision  Functions

Output

Input

Hidden  Layer  1

Q:  How  many  layers  should  we  use?• Theoretical  answer:

– A  neural  network  with  1  hidden  layer  is  a  universal  function  approximator

– Cybenko (1989):  For  any  continuous  function  g(x),  there  exists  a  1-­‐hidden-­‐layer  neural  net  hθ(x)  s.t. |  hθ(x)  – g(x)  |  <  ϵ for  all  x,  assuming  sigmoid  activation  functions

• Empirical  answer:– Before  2006:  “Deep  networks  (e.g.  3  or  more  hidden  layers)  

are  too  hard  to  train”– After  2006:  “Deep  networks  are  easier  to  train  than  shallow  

networks  (e.g.  2  or  fewer  layers)  for  many  problems”

Big  caveat:  You  need  to  know  and  use  the  right  tricks.

Page 40: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Decision  Boundary

• 0  hidden  layers:  linear  classifier– Hyperplanes

y

1x 2x

Example from to Eric Postma via Jason Eisner 41

Page 41: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Decision  Boundary

• 1  hidden  layer– Boundary  of  convex  region  (open  or  closed)  

y

1x 2x

Example from to Eric Postma via Jason Eisner 42

Page 42: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Decision  Boundary

• 2  hidden  layers– Combinations  of  convex  regions  

Example from to Eric Postma via Jason Eisner

y

1x 2x

43

Page 43: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Different  Levels  of  Abstraction

• We  don’t  know  the  “right”  levels  of  abstraction

• So  let  the  model  figure  it  out!

44

Decision  Functions

Example  from  Honglak Lee  (NIPS  2010)

Page 44: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Different  Levels  of  Abstraction

Face  Recognition:– Deep  Network  can  build  up  increasingly  higher  levels  of  abstraction

– Lines,  parts,  regions

45

Decision  Functions

Example  from  Honglak Lee  (NIPS  2010)

Page 45: CarnegieMellonUniversity NeuralNetworks and Backpropagation · 2019. 1. 11. · NeuralNetworks and Backpropagation 1 106601’Introduction’to’Machine’Learning Matt%Gormley Lecture%19

Different  Levels  of  Abstraction

46

Decision  Functions

Example  from  Honglak Lee  (NIPS  2010)

…Input

Hidden  Layer  1

…Hidden  Layer  2

Output

Hidden  Layer  3