Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4...

155
Lecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago April 5, 2017 Lecture 4 Backpropagation CMSC 35246

Transcript of Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4...

Page 1: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Lecture 4Backpropagation

CMSC 35246: Deep Learning

Shubhendu Trivedi&

Risi Kondor

University of Chicago

April 5, 2017

Lecture 4 Backpropagation CMSC 35246

Page 2: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Things we will look at today

• More Backpropagation

• Still more backpropagation• Quiz at 4:05 PM

Lecture 4 Backpropagation CMSC 35246

Page 3: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Things we will look at today

• More Backpropagation• Still more backpropagation

• Quiz at 4:05 PM

Lecture 4 Backpropagation CMSC 35246

Page 4: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Things we will look at today

• More Backpropagation• Still more backpropagation• Quiz at 4:05 PM

Lecture 4 Backpropagation CMSC 35246

Page 5: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

To understand, let us just calculate!

Lecture 4 Backpropagation CMSC 35246

Page 6: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

One Neuron Again

x1 x2 . . . xd

y

w1 w2

wd

Consider example x; Output for x is y; Correct Answer is y

Loss L = (y − y)2

y = xTw = x1w1 + x2w2 + . . . xdwd

Lecture 4 Backpropagation CMSC 35246

Page 7: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

One Neuron Again

x1 x2 . . . xd

y

w1 w2

wd

Consider example x; Output for x is y; Correct Answer is y

Loss L = (y − y)2

y = xTw = x1w1 + x2w2 + . . . xdwd

Lecture 4 Backpropagation CMSC 35246

Page 8: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

One Neuron Again

x1 x2 . . . xd

y

w1 w2

wd

Want to update wi (forget closed form solution for a bit!)

Update rule: wi := wi − η ∂L∂wi

Now

∂L

∂wi=∂(y − y)2

∂wi= 2(y − y)∂(x1w1 + x2w2 + . . . xdwd)

∂wi

Lecture 4 Backpropagation CMSC 35246

Page 9: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

One Neuron Again

x1 x2 . . . xd

y

w1 w2

wd

Want to update wi (forget closed form solution for a bit!)

Update rule: wi := wi − η ∂L∂wi

Now

∂L

∂wi=∂(y − y)2

∂wi= 2(y − y)∂(x1w1 + x2w2 + . . . xdwd)

∂wi

Lecture 4 Backpropagation CMSC 35246

Page 10: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

One Neuron Again

x1 x2 . . . xd

y

w1 w2

wd

Want to update wi (forget closed form solution for a bit!)

Update rule: wi := wi − η ∂L∂wi

Now

∂L

∂wi=

∂(y − y)2

∂wi= 2(y − y)∂(x1w1 + x2w2 + . . . xdwd)

∂wi

Lecture 4 Backpropagation CMSC 35246

Page 11: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

One Neuron Again

x1 x2 . . . xd

y

w1 w2

wd

Want to update wi (forget closed form solution for a bit!)

Update rule: wi := wi − η ∂L∂wi

Now

∂L

∂wi=∂(y − y)2

∂wi=

2(y − y)∂(x1w1 + x2w2 + . . . xdwd)

∂wi

Lecture 4 Backpropagation CMSC 35246

Page 12: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

One Neuron Again

x1 x2 . . . xd

y

w1 w2

wd

Want to update wi (forget closed form solution for a bit!)

Update rule: wi := wi − η ∂L∂wi

Now

∂L

∂wi=∂(y − y)2

∂wi= 2(y − y)∂(x1w1 + x2w2 + . . . xdwd)

∂wi

Lecture 4 Backpropagation CMSC 35246

Page 13: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

One Neuron Again

x1 x2 . . . xd

y

w1 w2

wd

We have:∂L

∂wi=

2(y − y)xiUpdate Rule:

wi := wi − η(y − y)xi = wi − ηδxi where δ = (y − y)

In vector form: w := w − ηδxSimple enough! Now let’s graduate ...

Lecture 4 Backpropagation CMSC 35246

Page 14: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

One Neuron Again

x1 x2 . . . xd

y

w1 w2

wd

We have:∂L

∂wi= 2(y − y)xi

Update Rule:

wi := wi − η(y − y)xi = wi − ηδxi where δ = (y − y)

In vector form: w := w − ηδxSimple enough! Now let’s graduate ...

Lecture 4 Backpropagation CMSC 35246

Page 15: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

One Neuron Again

x1 x2 . . . xd

y

w1 w2

wd

We have:∂L

∂wi= 2(y − y)xi

Update Rule:

wi := wi − η(y − y)xi = wi − ηδxi where δ = (y − y)

In vector form: w := w − ηδxSimple enough! Now let’s graduate ...

Lecture 4 Backpropagation CMSC 35246

Page 16: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

One Neuron Again

x1 x2 . . . xd

y

w1 w2

wd

We have:∂L

∂wi= 2(y − y)xi

Update Rule:

wi := wi − η(y − y)xi = wi − ηδxi where δ = (y − y)

In vector form: w := w − ηδx

Simple enough! Now let’s graduate ...

Lecture 4 Backpropagation CMSC 35246

Page 17: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

One Neuron Again

x1 x2 . . . xd

y

w1 w2

wd

We have:∂L

∂wi= 2(y − y)xi

Update Rule:

wi := wi − η(y − y)xi = wi − ηδxi where δ = (y − y)

In vector form: w := w − ηδxSimple enough! Now let’s graduate ...

Lecture 4 Backpropagation CMSC 35246

Page 18: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Simple Feedforward Network

x1 x2 x3

z1 z2

y

w(1)11 w

(1)21 w

(1)31

w(1)12 w

(1)22 w

(1)32

w(2)1 w

(2)2

y = w(2)1 z1 + w

(2)2 z2

z1 = tanh(a1) where a1 = w(1)11 x1 + w

(1)21 x2 + w

(1)31 x3 likewise

for z2

Lecture 4 Backpropagation CMSC 35246

Page 19: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Simple Feedforward Network

x1 x2 x3

z1 z2

y

w(1)11 w

(1)21 w

(1)31

w(1)12 w

(1)22 w

(1)32

w(2)1 w

(2)2

y = w(2)1 z1 + w

(2)2 z2

z1 = tanh(a1) where a1 = w(1)11 x1 + w

(1)21 x2 + w

(1)31 x3 likewise

for z2

Lecture 4 Backpropagation CMSC 35246

Page 20: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Simple Feedforward Network

z1 = tanh(a1) where a1 =

w(1)11 x1 + w

(1)21 x2 + w

(1)31 x3

z2 = tanh(a2) where a2 =

w(1)12 x1 + w

(1)22 x2 + w

(1)32 x3

x1 x2 x3

z1 z2

y

w(1)11 w

(1)21 w

(1)31

w(1)12 w

(1)22 w

(1)32

w(2)1 w

(2)2

Output y = w(2)1 z1 + w

(2)2 z2; Loss L = (y − y)2

Want to assign credit for the loss L to each weight

Lecture 4 Backpropagation CMSC 35246

Page 21: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Simple Feedforward Network

z1 = tanh(a1) where a1 =

w(1)11 x1 + w

(1)21 x2 + w

(1)31 x3

z2 = tanh(a2) where a2 =

w(1)12 x1 + w

(1)22 x2 + w

(1)32 x3

x1 x2 x3

z1 z2

y

w(1)11 w

(1)21 w

(1)31

w(1)12 w

(1)22 w

(1)32

w(2)1 w

(2)2

Output y = w(2)1 z1 + w

(2)2 z2; Loss L = (y − y)2

Want to assign credit for the loss L to each weight

Lecture 4 Backpropagation CMSC 35246

Page 22: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Top Layer

Want to find: ∂L

∂w(2)1

and ∂L

∂w(2)2

Consider w(2)1 first

x1 x2 x3

z1 z2

y

w(1)11 w

(1)21 w

(1)31

w(1)12 w

(1)22 w

(1)32

w(2)1 w

(2)2

∂L

∂w(2)1

= ∂(y−y)2

∂w(2)1

= 2(y − y)∂(w(2)1 z1+w

(2)2 z2)

∂w(2)1

= 2(y − y)z1

Familiar from earlier! Update for w(2)1 would be

w(2)1 := w

(2)1 − η ∂L

∂w(2)1

= w(2)1 − ηδz1 with δ = (y − y)

Likewise, for w(2)2 update would be w

(2)2 := w

(2)2 − ηδz2

Lecture 4 Backpropagation CMSC 35246

Page 23: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Top Layer

Want to find: ∂L

∂w(2)1

and ∂L

∂w(2)2

Consider w(2)1 first

x1 x2 x3

z1 z2

y

w(1)11 w

(1)21 w

(1)31

w(1)12 w

(1)22 w

(1)32

w(2)1 w

(2)2

∂L

∂w(2)1

=

∂(y−y)2

∂w(2)1

= 2(y − y)∂(w(2)1 z1+w

(2)2 z2)

∂w(2)1

= 2(y − y)z1

Familiar from earlier! Update for w(2)1 would be

w(2)1 := w

(2)1 − η ∂L

∂w(2)1

= w(2)1 − ηδz1 with δ = (y − y)

Likewise, for w(2)2 update would be w

(2)2 := w

(2)2 − ηδz2

Lecture 4 Backpropagation CMSC 35246

Page 24: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Top Layer

Want to find: ∂L

∂w(2)1

and ∂L

∂w(2)2

Consider w(2)1 first

x1 x2 x3

z1 z2

y

w(1)11 w

(1)21 w

(1)31

w(1)12 w

(1)22 w

(1)32

w(2)1 w

(2)2

∂L

∂w(2)1

= ∂(y−y)2

∂w(2)1

=

2(y − y)∂(w(2)1 z1+w

(2)2 z2)

∂w(2)1

= 2(y − y)z1

Familiar from earlier! Update for w(2)1 would be

w(2)1 := w

(2)1 − η ∂L

∂w(2)1

= w(2)1 − ηδz1 with δ = (y − y)

Likewise, for w(2)2 update would be w

(2)2 := w

(2)2 − ηδz2

Lecture 4 Backpropagation CMSC 35246

Page 25: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Top Layer

Want to find: ∂L

∂w(2)1

and ∂L

∂w(2)2

Consider w(2)1 first

x1 x2 x3

z1 z2

y

w(1)11 w

(1)21 w

(1)31

w(1)12 w

(1)22 w

(1)32

w(2)1 w

(2)2

∂L

∂w(2)1

= ∂(y−y)2

∂w(2)1

= 2(y − y)∂(w(2)1 z1+w

(2)2 z2)

∂w(2)1

=

2(y − y)z1

Familiar from earlier! Update for w(2)1 would be

w(2)1 := w

(2)1 − η ∂L

∂w(2)1

= w(2)1 − ηδz1 with δ = (y − y)

Likewise, for w(2)2 update would be w

(2)2 := w

(2)2 − ηδz2

Lecture 4 Backpropagation CMSC 35246

Page 26: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Top Layer

Want to find: ∂L

∂w(2)1

and ∂L

∂w(2)2

Consider w(2)1 first

x1 x2 x3

z1 z2

y

w(1)11 w

(1)21 w

(1)31

w(1)12 w

(1)22 w

(1)32

w(2)1 w

(2)2

∂L

∂w(2)1

= ∂(y−y)2

∂w(2)1

= 2(y − y)∂(w(2)1 z1+w

(2)2 z2)

∂w(2)1

= 2(y − y)z1

Familiar from earlier! Update for w(2)1 would be

w(2)1 := w

(2)1 − η ∂L

∂w(2)1

= w(2)1 − ηδz1 with δ = (y − y)

Likewise, for w(2)2 update would be w

(2)2 := w

(2)2 − ηδz2

Lecture 4 Backpropagation CMSC 35246

Page 27: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Top Layer

Want to find: ∂L

∂w(2)1

and ∂L

∂w(2)2

Consider w(2)1 first

x1 x2 x3

z1 z2

y

w(1)11 w

(1)21 w

(1)31

w(1)12 w

(1)22 w

(1)32

w(2)1 w

(2)2

∂L

∂w(2)1

= ∂(y−y)2

∂w(2)1

= 2(y − y)∂(w(2)1 z1+w

(2)2 z2)

∂w(2)1

= 2(y − y)z1

Familiar from earlier! Update for w(2)1 would be

w(2)1 := w

(2)1 − η ∂L

∂w(2)1

=

w(2)1 − ηδz1 with δ = (y − y)

Likewise, for w(2)2 update would be w

(2)2 := w

(2)2 − ηδz2

Lecture 4 Backpropagation CMSC 35246

Page 28: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Top Layer

Want to find: ∂L

∂w(2)1

and ∂L

∂w(2)2

Consider w(2)1 first

x1 x2 x3

z1 z2

y

w(1)11 w

(1)21 w

(1)31

w(1)12 w

(1)22 w

(1)32

w(2)1 w

(2)2

∂L

∂w(2)1

= ∂(y−y)2

∂w(2)1

= 2(y − y)∂(w(2)1 z1+w

(2)2 z2)

∂w(2)1

= 2(y − y)z1

Familiar from earlier! Update for w(2)1 would be

w(2)1 := w

(2)1 − η ∂L

∂w(2)1

= w(2)1 − ηδz1 with δ = (y − y)

Likewise, for w(2)2 update would be w

(2)2 := w

(2)2 − ηδz2

Lecture 4 Backpropagation CMSC 35246

Page 29: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Top Layer

Want to find: ∂L

∂w(2)1

and ∂L

∂w(2)2

Consider w(2)1 first

x1 x2 x3

z1 z2

y

w(1)11 w

(1)21 w

(1)31

w(1)12 w

(1)22 w

(1)32

w(2)1 w

(2)2

∂L

∂w(2)1

= ∂(y−y)2

∂w(2)1

= 2(y − y)∂(w(2)1 z1+w

(2)2 z2)

∂w(2)1

= 2(y − y)z1

Familiar from earlier! Update for w(2)1 would be

w(2)1 := w

(2)1 − η ∂L

∂w(2)1

= w(2)1 − ηδz1 with δ = (y − y)

Likewise, for w(2)2 update would be w

(2)2 := w

(2)2 − ηδz2

Lecture 4 Backpropagation CMSC 35246

Page 30: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Next Layer

There are six weights to assigncredit for the loss incurred

Consider w(1)11 for an illustration

Rest are similar

x1 x2 x3

z1 z2

y

w(1)11 w

(1)21 w

(1)31

w(1)12 w

(1)22 w

(1)32

w(2)1 w

(2)2

∂L

∂w(1)11

= ∂(y−y)2

∂w(1)11

= 2(y − y)∂(w(2)1 z1+w

(2)2 z2)

∂w(21)11

Now:∂(w

(2)1 z1+w

(2)2 z2)

∂w(1)11

= w(2)1

∂(tanh(w(1)11 x1+w

(1)21 x2+w

(1)31 x3))

∂w(1)11

+ 0

Which is: w(2)1 (1− tanh2(a1))x1 recall a1 =?

So we have: ∂L

∂w(1)11

= 2(y − y)w(2)1 (1− tanh2(a1))x1

Lecture 4 Backpropagation CMSC 35246

Page 31: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Next Layer

There are six weights to assigncredit for the loss incurred

Consider w(1)11 for an illustration

Rest are similar

x1 x2 x3

z1 z2

y

w(1)11 w

(1)21 w

(1)31

w(1)12 w

(1)22 w

(1)32

w(2)1 w

(2)2

∂L

∂w(1)11

=

∂(y−y)2

∂w(1)11

= 2(y − y)∂(w(2)1 z1+w

(2)2 z2)

∂w(21)11

Now:∂(w

(2)1 z1+w

(2)2 z2)

∂w(1)11

= w(2)1

∂(tanh(w(1)11 x1+w

(1)21 x2+w

(1)31 x3))

∂w(1)11

+ 0

Which is: w(2)1 (1− tanh2(a1))x1 recall a1 =?

So we have: ∂L

∂w(1)11

= 2(y − y)w(2)1 (1− tanh2(a1))x1

Lecture 4 Backpropagation CMSC 35246

Page 32: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Next Layer

There are six weights to assigncredit for the loss incurred

Consider w(1)11 for an illustration

Rest are similar

x1 x2 x3

z1 z2

y

w(1)11 w

(1)21 w

(1)31

w(1)12 w

(1)22 w

(1)32

w(2)1 w

(2)2

∂L

∂w(1)11

= ∂(y−y)2

∂w(1)11

=

2(y − y)∂(w(2)1 z1+w

(2)2 z2)

∂w(21)11

Now:∂(w

(2)1 z1+w

(2)2 z2)

∂w(1)11

= w(2)1

∂(tanh(w(1)11 x1+w

(1)21 x2+w

(1)31 x3))

∂w(1)11

+ 0

Which is: w(2)1 (1− tanh2(a1))x1 recall a1 =?

So we have: ∂L

∂w(1)11

= 2(y − y)w(2)1 (1− tanh2(a1))x1

Lecture 4 Backpropagation CMSC 35246

Page 33: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Next Layer

There are six weights to assigncredit for the loss incurred

Consider w(1)11 for an illustration

Rest are similar

x1 x2 x3

z1 z2

y

w(1)11 w

(1)21 w

(1)31

w(1)12 w

(1)22 w

(1)32

w(2)1 w

(2)2

∂L

∂w(1)11

= ∂(y−y)2

∂w(1)11

= 2(y − y)∂(w(2)1 z1+w

(2)2 z2)

∂w(21)11

Now:∂(w

(2)1 z1+w

(2)2 z2)

∂w(1)11

= w(2)1

∂(tanh(w(1)11 x1+w

(1)21 x2+w

(1)31 x3))

∂w(1)11

+ 0

Which is: w(2)1 (1− tanh2(a1))x1 recall a1 =?

So we have: ∂L

∂w(1)11

= 2(y − y)w(2)1 (1− tanh2(a1))x1

Lecture 4 Backpropagation CMSC 35246

Page 34: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Next Layer

There are six weights to assigncredit for the loss incurred

Consider w(1)11 for an illustration

Rest are similar

x1 x2 x3

z1 z2

y

w(1)11 w

(1)21 w

(1)31

w(1)12 w

(1)22 w

(1)32

w(2)1 w

(2)2

∂L

∂w(1)11

= ∂(y−y)2

∂w(1)11

= 2(y − y)∂(w(2)1 z1+w

(2)2 z2)

∂w(21)11

Now:∂(w

(2)1 z1+w

(2)2 z2)

∂w(1)11

=

w(2)1

∂(tanh(w(1)11 x1+w

(1)21 x2+w

(1)31 x3))

∂w(1)11

+ 0

Which is: w(2)1 (1− tanh2(a1))x1 recall a1 =?

So we have: ∂L

∂w(1)11

= 2(y − y)w(2)1 (1− tanh2(a1))x1

Lecture 4 Backpropagation CMSC 35246

Page 35: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Next Layer

There are six weights to assigncredit for the loss incurred

Consider w(1)11 for an illustration

Rest are similar

x1 x2 x3

z1 z2

y

w(1)11 w

(1)21 w

(1)31

w(1)12 w

(1)22 w

(1)32

w(2)1 w

(2)2

∂L

∂w(1)11

= ∂(y−y)2

∂w(1)11

= 2(y − y)∂(w(2)1 z1+w

(2)2 z2)

∂w(21)11

Now:∂(w

(2)1 z1+w

(2)2 z2)

∂w(1)11

= w(2)1

∂(tanh(w(1)11 x1+w

(1)21 x2+w

(1)31 x3))

∂w(1)11

+ 0

Which is: w(2)1 (1− tanh2(a1))x1 recall a1 =?

So we have: ∂L

∂w(1)11

= 2(y − y)w(2)1 (1− tanh2(a1))x1

Lecture 4 Backpropagation CMSC 35246

Page 36: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Next Layer

There are six weights to assigncredit for the loss incurred

Consider w(1)11 for an illustration

Rest are similar

x1 x2 x3

z1 z2

y

w(1)11 w

(1)21 w

(1)31

w(1)12 w

(1)22 w

(1)32

w(2)1 w

(2)2

∂L

∂w(1)11

= ∂(y−y)2

∂w(1)11

= 2(y − y)∂(w(2)1 z1+w

(2)2 z2)

∂w(21)11

Now:∂(w

(2)1 z1+w

(2)2 z2)

∂w(1)11

= w(2)1

∂(tanh(w(1)11 x1+w

(1)21 x2+w

(1)31 x3))

∂w(1)11

+ 0

Which is: w(2)1 (1− tanh2(a1))x1

recall a1 =?

So we have: ∂L

∂w(1)11

= 2(y − y)w(2)1 (1− tanh2(a1))x1

Lecture 4 Backpropagation CMSC 35246

Page 37: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Next Layer

There are six weights to assigncredit for the loss incurred

Consider w(1)11 for an illustration

Rest are similar

x1 x2 x3

z1 z2

y

w(1)11 w

(1)21 w

(1)31

w(1)12 w

(1)22 w

(1)32

w(2)1 w

(2)2

∂L

∂w(1)11

= ∂(y−y)2

∂w(1)11

= 2(y − y)∂(w(2)1 z1+w

(2)2 z2)

∂w(21)11

Now:∂(w

(2)1 z1+w

(2)2 z2)

∂w(1)11

= w(2)1

∂(tanh(w(1)11 x1+w

(1)21 x2+w

(1)31 x3))

∂w(1)11

+ 0

Which is: w(2)1 (1− tanh2(a1))x1 recall a1 =?

So we have: ∂L

∂w(1)11

= 2(y − y)w(2)1 (1− tanh2(a1))x1

Lecture 4 Backpropagation CMSC 35246

Page 38: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Next Layer

There are six weights to assigncredit for the loss incurred

Consider w(1)11 for an illustration

Rest are similar

x1 x2 x3

z1 z2

y

w(1)11 w

(1)21 w

(1)31

w(1)12 w

(1)22 w

(1)32

w(2)1 w

(2)2

∂L

∂w(1)11

= ∂(y−y)2

∂w(1)11

= 2(y − y)∂(w(2)1 z1+w

(2)2 z2)

∂w(21)11

Now:∂(w

(2)1 z1+w

(2)2 z2)

∂w(1)11

= w(2)1

∂(tanh(w(1)11 x1+w

(1)21 x2+w

(1)31 x3))

∂w(1)11

+ 0

Which is: w(2)1 (1− tanh2(a1))x1 recall a1 =?

So we have: ∂L

∂w(1)11

= 2(y − y)w(2)1 (1− tanh2(a1))x1

Lecture 4 Backpropagation CMSC 35246

Page 39: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Next Layer

∂L

∂w(1)11

=

2(y − y)w(2)1 (1− tanh2(a1))x1

Weight update:

w(1)11 := w

(1)11 − η ∂L

∂w(1)11

x1 x2 x3

z1 z2

y

w(1)11 w

(1)21 w

(1)31

w(1)12 w

(1)22 w

(1)32

w(2)1 w

(2)2

Likewise, if we were considering w(1)22 , we’d have:

∂L

∂w(1)22

= 2(y − y)w(2)2 (1− tanh2(a2))x2

Weight update: w(1)22 := w

(1)22 − η ∂L

∂w(1)22

Lecture 4 Backpropagation CMSC 35246

Page 40: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Next Layer

∂L

∂w(1)11

=

2(y − y)w(2)1 (1− tanh2(a1))x1

Weight update:

w(1)11 := w

(1)11 − η ∂L

∂w(1)11

x1 x2 x3

z1 z2

y

w(1)11 w

(1)21 w

(1)31

w(1)12 w

(1)22 w

(1)32

w(2)1 w

(2)2

Likewise, if we were considering w(1)22 , we’d have:

∂L

∂w(1)22

= 2(y − y)w(2)2 (1− tanh2(a2))x2

Weight update: w(1)22 := w

(1)22 − η ∂L

∂w(1)22

Lecture 4 Backpropagation CMSC 35246

Page 41: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Next Layer

∂L

∂w(1)11

=

2(y − y)w(2)1 (1− tanh2(a1))x1

Weight update:

w(1)11 := w

(1)11 − η ∂L

∂w(1)11

x1 x2 x3

z1 z2

y

w(1)11 w

(1)21 w

(1)31

w(1)12 w

(1)22 w

(1)32

w(2)1 w

(2)2

Likewise, if we were considering w(1)22 , we’d have:

∂L

∂w(1)22

= 2(y − y)w(2)2 (1− tanh2(a2))x2

Weight update: w(1)22 := w

(1)22 − η ∂L

∂w(1)22

Lecture 4 Backpropagation CMSC 35246

Page 42: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Next Layer

∂L

∂w(1)11

=

2(y − y)w(2)1 (1− tanh2(a1))x1

Weight update:

w(1)11 := w

(1)11 − η ∂L

∂w(1)11

x1 x2 x3

z1 z2

y

w(1)11 w

(1)21 w

(1)31

w(1)12 w

(1)22 w

(1)32

w(2)1 w

(2)2

Likewise, if we were considering w(1)22 , we’d have:

∂L

∂w(1)22

= 2(y − y)w(2)2 (1− tanh2(a2))x2

Weight update: w(1)22 := w

(1)22 − η ∂L

∂w(1)22

Lecture 4 Backpropagation CMSC 35246

Page 43: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Let’s clean this up...

Recall, for top layer: ∂L

∂w(2)i

= (y − y)zi = δzi (ignoring 2)

One can think of this as: ∂L

∂w(2)i

= δ︸︷︷︸local error

zi︸︷︷︸local input

For next layer we had: ∂L

∂w(1)ij

= (y − y)w(2)j (1− tanh2(aj))xi

Let δj = (y − y)w(2)j (1− tanh2(aj)) = δw

(2)j (1− tanh2(aj))

(Notice that δj contains the δ term (which is the error!))

Then: ∂L

∂w(1)ij

= δj︸︷︷︸local error

xi︸︷︷︸local input

Neat!

Lecture 4 Backpropagation CMSC 35246

Page 44: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Let’s clean this up...

Recall, for top layer: ∂L

∂w(2)i

= (y − y)zi = δzi (ignoring 2)

One can think of this as: ∂L

∂w(2)i

= δ︸︷︷︸local error

zi︸︷︷︸local input

For next layer we had: ∂L

∂w(1)ij

= (y − y)w(2)j (1− tanh2(aj))xi

Let δj = (y − y)w(2)j (1− tanh2(aj)) = δw

(2)j (1− tanh2(aj))

(Notice that δj contains the δ term (which is the error!))

Then: ∂L

∂w(1)ij

= δj︸︷︷︸local error

xi︸︷︷︸local input

Neat!

Lecture 4 Backpropagation CMSC 35246

Page 45: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Let’s clean this up...

Recall, for top layer: ∂L

∂w(2)i

= (y − y)zi = δzi (ignoring 2)

One can think of this as: ∂L

∂w(2)i

= δ︸︷︷︸local error

zi︸︷︷︸local input

For next layer we had: ∂L

∂w(1)ij

= (y − y)w(2)j (1− tanh2(aj))xi

Let δj = (y − y)w(2)j (1− tanh2(aj)) = δw

(2)j (1− tanh2(aj))

(Notice that δj contains the δ term (which is the error!))

Then: ∂L

∂w(1)ij

= δj︸︷︷︸local error

xi︸︷︷︸local input

Neat!

Lecture 4 Backpropagation CMSC 35246

Page 46: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Let’s clean this up...

Recall, for top layer: ∂L

∂w(2)i

= (y − y)zi = δzi (ignoring 2)

One can think of this as: ∂L

∂w(2)i

= δ︸︷︷︸local error

zi︸︷︷︸local input

For next layer we had: ∂L

∂w(1)ij

= (y − y)w(2)j (1− tanh2(aj))xi

Let δj = (y − y)w(2)j (1− tanh2(aj)) = δw

(2)j (1− tanh2(aj))

(Notice that δj contains the δ term (which is the error!))

Then: ∂L

∂w(1)ij

= δj︸︷︷︸local error

xi︸︷︷︸local input

Neat!

Lecture 4 Backpropagation CMSC 35246

Page 47: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Let’s clean this up...

Recall, for top layer: ∂L

∂w(2)i

= (y − y)zi = δzi (ignoring 2)

One can think of this as: ∂L

∂w(2)i

= δ︸︷︷︸local error

zi︸︷︷︸local input

For next layer we had: ∂L

∂w(1)ij

= (y − y)w(2)j (1− tanh2(aj))xi

Let δj = (y − y)w(2)j (1− tanh2(aj)) = δw

(2)j (1− tanh2(aj))

(Notice that δj contains the δ term (which is the error!))

Then: ∂L

∂w(1)ij

= δj︸︷︷︸local error

xi︸︷︷︸local input

Neat!

Lecture 4 Backpropagation CMSC 35246

Page 48: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Let’s clean this up...

Recall, for top layer: ∂L

∂w(2)i

= (y − y)zi = δzi (ignoring 2)

One can think of this as: ∂L

∂w(2)i

= δ︸︷︷︸local error

zi︸︷︷︸local input

For next layer we had: ∂L

∂w(1)ij

= (y − y)w(2)j (1− tanh2(aj))xi

Let δj = (y − y)w(2)j (1− tanh2(aj)) = δw

(2)j (1− tanh2(aj))

(Notice that δj contains the δ term (which is the error!))

Then: ∂L

∂w(1)ij

= δj︸︷︷︸local error

xi︸︷︷︸local input

Neat!

Lecture 4 Backpropagation CMSC 35246

Page 49: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Let’s clean this up...

Recall, for top layer: ∂L

∂w(2)i

= (y − y)zi = δzi (ignoring 2)

One can think of this as: ∂L

∂w(2)i

= δ︸︷︷︸local error

zi︸︷︷︸local input

For next layer we had: ∂L

∂w(1)ij

= (y − y)w(2)j (1− tanh2(aj))xi

Let δj = (y − y)w(2)j (1− tanh2(aj)) = δw

(2)j (1− tanh2(aj))

(Notice that δj contains the δ term (which is the error!))

Then: ∂L

∂w(1)ij

= δj︸︷︷︸local error

xi︸︷︷︸local input

Neat!

Lecture 4 Backpropagation CMSC 35246

Page 50: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Let’s clean this up...

Let’s get a cleaner notation to summarize this

Let wi j be the weight for the connection FROM node i tonode j

Then∂L

∂wi j= δjzi

δj is the local error (going from j backwards) and zi is thelocal input coming from i

Lecture 4 Backpropagation CMSC 35246

Page 51: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Let’s clean this up...

Let’s get a cleaner notation to summarize this

Let wi j be the weight for the connection FROM node i tonode j

Then∂L

∂wi j= δjzi

δj is the local error (going from j backwards) and zi is thelocal input coming from i

Lecture 4 Backpropagation CMSC 35246

Page 52: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Let’s clean this up...

Let’s get a cleaner notation to summarize this

Let wi j be the weight for the connection FROM node i tonode j

Then∂L

∂wi j=

δjzi

δj is the local error (going from j backwards) and zi is thelocal input coming from i

Lecture 4 Backpropagation CMSC 35246

Page 53: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Let’s clean this up...

Let’s get a cleaner notation to summarize this

Let wi j be the weight for the connection FROM node i tonode j

Then∂L

∂wi j= δjzi

δj is the local error (going from j backwards) and zi is thelocal input coming from i

Lecture 4 Backpropagation CMSC 35246

Page 54: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Let’s clean this up...

Let’s get a cleaner notation to summarize this

Let wi j be the weight for the connection FROM node i tonode j

Then∂L

∂wi j= δjzi

δj is the local error (going from j backwards) and zi is thelocal input coming from i

Lecture 4 Backpropagation CMSC 35246

Page 55: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Credit Assignment: A Graphical Revision

5x1 4x2 3 x3

1z1 2z2

0

y

w5 1 w4 1 w3 1

w5 2w4 2w3 1

w1 0 w2 0

Let’s redraw our toy network with new notation and labelnodes

Lecture 4 Backpropagation CMSC 35246

Page 56: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Credit Assignment: Top Layer

5x1 4x2 3 x3

1z1 2z2

0

y

w5 1 w4 1 w3 1

w5 2w4 2w3 1

w1 0 w2 0

δ

Local error from 0: δ = (y − y), local input from 1: z1

∴∂L

∂w1 0= δz1; and update w1 0 := w1 0 − ηδz1

Lecture 4 Backpropagation CMSC 35246

Page 57: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Credit Assignment: Top Layer

5x1 4x2 3 x3

1z1 2z2

0

y

w5 1 w4 1 w3 1

w5 2w4 2w3 1

w1 0 w2 0

δ

Local error from 0: δ = (y − y), local input from 1: z1

∴∂L

∂w1 0= δz1; and update w1 0 := w1 0 − ηδz1

Lecture 4 Backpropagation CMSC 35246

Page 58: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Credit Assignment: Top Layer

5x1 4x2 3 x3

1z1 2z2

0

y

w5 1 w4 1 w3 1

w5 2w4 2w3 1

w1 0 w2 0

δ

Local error from 0: δ = (y − y), local input from 1: z1

∴∂L

∂w1 0= δz1; and update w1 0 := w1 0 − ηδz1

Lecture 4 Backpropagation CMSC 35246

Page 59: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Credit Assignment: Top Layer

5x1 4x2 3 x3

1z1 2z2

0

y

w5 1 w4 1 w3 1

w5 2w4 2w3 1

w1 0 w2 0

Lecture 4 Backpropagation CMSC 35246

Page 60: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Credit Assignment: Top Layer

5x1 4x2 3 x3

1z1 2z2

0

y

w5 1 w4 1 w3 1

w5 2w4 2w3 1

w1 0w2 0

δ

Local error from 0: δ = (y − y), local input from 2: z2

∴∂L

∂w2 0= δz2 and update w2 0 := w2 0 − ηδz2

Lecture 4 Backpropagation CMSC 35246

Page 61: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Credit Assignment: Top Layer

5x1 4x2 3 x3

1z1 2z2

0

y

w5 1 w4 1 w3 1

w5 2w4 2w3 1

w1 0w2 0

δ

Local error from 0: δ = (y − y), local input from 2: z2

∴∂L

∂w2 0= δz2 and update w2 0 := w2 0 − ηδz2

Lecture 4 Backpropagation CMSC 35246

Page 62: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Credit Assignment: Next Layer

5x1 4x2 3 x3

1z1 2z2

0

y

w5 1 w4 1 w3 1

w5 2w4 2w3 1

w1 0 w2 0

Lecture 4 Backpropagation CMSC 35246

Page 63: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Credit Assignment: Next Layer

5x1 4x2 3 x3

1z1 2z2

0

y

w5 1 w4 1 w3 1

w5 2w4 2w3 1

w1 0w2 0

δ

Lecture 4 Backpropagation CMSC 35246

Page 64: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Credit Assignment: Next Layer

5x1 4x2 3 x3

1z1 2z2

0

y

w5 1 w4 1 w3 1

w5 2w4 2w3 1

w1 0w2 0

δ

δ1

Local error from 1: δ1 = (δ)(w1 0)(1− tanh2(a1)), localinput from 3: x3

∴∂L

∂w3 1= δ1x3 and update w3 1 := w3 1 − ηδ1x3

Lecture 4 Backpropagation CMSC 35246

Page 65: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Credit Assignment: Next Layer

5x1 4x2 3 x3

1z1 2z2

0

y

w5 1 w4 1 w3 1

w5 2w4 2w3 1

w1 0w2 0

δ

δ1

Local error from 1: δ1 = (δ)(w1 0)(1− tanh2(a1)), localinput from 3: x3

∴∂L

∂w3 1= δ1x3 and update w3 1 := w3 1 − ηδ1x3

Lecture 4 Backpropagation CMSC 35246

Page 66: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Credit Assignment: Next Layer

5x1 4x2 3 x3

1z1 2z2

0

y

w5 1 w4 1 w3 1

w5 2w4 2w3 1

w1 0w2 0

δ

δ1

Local error from 1: δ1 = (δ)(w1 0)(1− tanh2(a1)), localinput from 3: x3

∴∂L

∂w3 1= δ1x3 and update w3 1 := w3 1 − ηδ1x3

Lecture 4 Backpropagation CMSC 35246

Page 67: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Credit Assignment: Next Layer

5x1 4x2 3 x3

1z1 2z2

0

y

w5 1 w4 1 w3 1

w5 2w4 2w3 1

w1 0 w2 0

Lecture 4 Backpropagation CMSC 35246

Page 68: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Credit Assignment: Next Layer

5x1 4x2 3 x3

1z1 2z2

0

y

w5 1 w4 1 w3 1

w5 2w4 2w3 1

w1 0w2 0

δ

Lecture 4 Backpropagation CMSC 35246

Page 69: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Credit Assignment: Next Layer

5x1 4x2 3 x3

1z1 2z2

0

y

w5 1 w4 1 w3 1

w5 2

w4 2

w3 1

w1 0w2 0

δ

δ1

Local error from 2: δ2 = (δ)(w2 0)(1− tanh2(a2)), localinput from 4: x2

∴∂L

∂w4 2= δ2x2 and update w4 2 := w4 2 − ηδ2x2

Lecture 4 Backpropagation CMSC 35246

Page 70: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Credit Assignment: Next Layer

5x1 4x2 3 x3

1z1 2z2

0

y

w5 1 w4 1 w3 1

w5 2

w4 2

w3 1

w1 0w2 0

δ

δ1

Local error from 2: δ2 = (δ)(w2 0)(1− tanh2(a2)), localinput from 4: x2

∴∂L

∂w4 2= δ2x2 and update w4 2 := w4 2 − ηδ2x2

Lecture 4 Backpropagation CMSC 35246

Page 71: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Credit Assignment: Next Layer

5x1 4x2 3 x3

1z1 2z2

0

y

w5 1 w4 1 w3 1

w5 2

w4 2

w3 1

w1 0w2 0

δ

δ1

Local error from 2: δ2 = (δ)(w2 0)(1− tanh2(a2)), localinput from 4: x2

∴∂L

∂w4 2= δ2x2 and update w4 2 := w4 2 − ηδ2x2

Lecture 4 Backpropagation CMSC 35246

Page 72: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Let’s Vectorize

Let W (2) =

[w1 0

w2 0

](ignore that W (2) is a vector and hence

more appropriate to use w(2))

Let

W (1) =

w5 1 w5 2

w4 1 w4 2

w3 1 w3 2

Let

Z(1) =

x1x2x3

and Z(2) =

[z1z2

]

Lecture 4 Backpropagation CMSC 35246

Page 73: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Let’s Vectorize

Let W (2) =

[w1 0

w2 0

](ignore that W (2) is a vector and hence

more appropriate to use w(2))

Let

W (1) =

w5 1 w5 2

w4 1 w4 2

w3 1 w3 2

Let

Z(1) =

x1x2x3

and Z(2) =

[z1z2

]

Lecture 4 Backpropagation CMSC 35246

Page 74: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Let’s Vectorize

Let W (2) =

[w1 0

w2 0

](ignore that W (2) is a vector and hence

more appropriate to use w(2))

Let

W (1) =

w5 1 w5 2

w4 1 w4 2

w3 1 w3 2

Let

Z(1) =

x1x2x3

and Z(2) =

[z1z2

]

Lecture 4 Backpropagation CMSC 35246

Page 75: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Feedforward Computation

1 Compute A(1) = Z(1)TW (1)

2 Applying element-wise non-linearity Z(2) = tanhA(1)

3 Compute Output y = Z(2)TW (2)

4 Compute Loss on example (y − y)2

Lecture 4 Backpropagation CMSC 35246

Page 76: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Feedforward Computation

1 Compute A(1) = Z(1)TW (1)

2 Applying element-wise non-linearity Z(2) = tanhA(1)

3 Compute Output y = Z(2)TW (2)

4 Compute Loss on example (y − y)2

Lecture 4 Backpropagation CMSC 35246

Page 77: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Feedforward Computation

1 Compute A(1) = Z(1)TW (1)

2 Applying element-wise non-linearity Z(2) = tanhA(1)

3 Compute Output y = Z(2)TW (2)

4 Compute Loss on example (y − y)2

Lecture 4 Backpropagation CMSC 35246

Page 78: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Feedforward Computation

1 Compute A(1) = Z(1)TW (1)

2 Applying element-wise non-linearity Z(2) = tanhA(1)

3 Compute Output y = Z(2)TW (2)

4 Compute Loss on example (y − y)2

Lecture 4 Backpropagation CMSC 35246

Page 79: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Flowing Backward

1 Top: Compute δ

2 Gradient w.r.t W (2) = δZ(2)

3 Compute δ1 = (W (2)T δ)� (1− tanh(A(1))2)

Notes: (a): � is Hadamard product. (b) have written W (2)T δas δ can be a vector when there are multiple outputs

4 Gradient w.r.t W (1) = δ1Z(1)

5 Update W (2) :=W (2) − ηδZ(2)

6 Update W (1) :=W (1) − ηδ1Z(1)

7 All the dimensionalities nicely check out!

Lecture 4 Backpropagation CMSC 35246

Page 80: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Flowing Backward

1 Top: Compute δ

2 Gradient w.r.t W (2) = δZ(2)

3 Compute δ1 = (W (2)T δ)� (1− tanh(A(1))2)

Notes: (a): � is Hadamard product. (b) have written W (2)T δas δ can be a vector when there are multiple outputs

4 Gradient w.r.t W (1) = δ1Z(1)

5 Update W (2) :=W (2) − ηδZ(2)

6 Update W (1) :=W (1) − ηδ1Z(1)

7 All the dimensionalities nicely check out!

Lecture 4 Backpropagation CMSC 35246

Page 81: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Flowing Backward

1 Top: Compute δ

2 Gradient w.r.t W (2) = δZ(2)

3 Compute δ1 = (W (2)T δ)� (1− tanh(A(1))2)

Notes: (a): � is Hadamard product. (b) have written W (2)T δas δ can be a vector when there are multiple outputs

4 Gradient w.r.t W (1) = δ1Z(1)

5 Update W (2) :=W (2) − ηδZ(2)

6 Update W (1) :=W (1) − ηδ1Z(1)

7 All the dimensionalities nicely check out!

Lecture 4 Backpropagation CMSC 35246

Page 82: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Flowing Backward

1 Top: Compute δ

2 Gradient w.r.t W (2) = δZ(2)

3 Compute δ1 = (W (2)T δ)� (1− tanh(A(1))2)

Notes: (a): � is Hadamard product. (b) have written W (2)T δas δ can be a vector when there are multiple outputs

4 Gradient w.r.t W (1) = δ1Z(1)

5 Update W (2) :=W (2) − ηδZ(2)

6 Update W (1) :=W (1) − ηδ1Z(1)

7 All the dimensionalities nicely check out!

Lecture 4 Backpropagation CMSC 35246

Page 83: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Flowing Backward

1 Top: Compute δ

2 Gradient w.r.t W (2) = δZ(2)

3 Compute δ1 = (W (2)T δ)� (1− tanh(A(1))2)

Notes: (a): � is Hadamard product. (b) have written W (2)T δas δ can be a vector when there are multiple outputs

4 Gradient w.r.t W (1) = δ1Z(1)

5 Update W (2) :=W (2) − ηδZ(2)

6 Update W (1) :=W (1) − ηδ1Z(1)

7 All the dimensionalities nicely check out!

Lecture 4 Backpropagation CMSC 35246

Page 84: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Flowing Backward

1 Top: Compute δ

2 Gradient w.r.t W (2) = δZ(2)

3 Compute δ1 = (W (2)T δ)� (1− tanh(A(1))2)

Notes: (a): � is Hadamard product. (b) have written W (2)T δas δ can be a vector when there are multiple outputs

4 Gradient w.r.t W (1) = δ1Z(1)

5 Update W (2) :=W (2) − ηδZ(2)

6 Update W (1) :=W (1) − ηδ1Z(1)

7 All the dimensionalities nicely check out!

Lecture 4 Backpropagation CMSC 35246

Page 85: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Flowing Backward

1 Top: Compute δ

2 Gradient w.r.t W (2) = δZ(2)

3 Compute δ1 = (W (2)T δ)� (1− tanh(A(1))2)

Notes: (a): � is Hadamard product. (b) have written W (2)T δas δ can be a vector when there are multiple outputs

4 Gradient w.r.t W (1) = δ1Z(1)

5 Update W (2) :=W (2) − ηδZ(2)

6 Update W (1) :=W (1) − ηδ1Z(1)

7 All the dimensionalities nicely check out!

Lecture 4 Backpropagation CMSC 35246

Page 86: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Flowing Backward

1 Top: Compute δ

2 Gradient w.r.t W (2) = δZ(2)

3 Compute δ1 = (W (2)T δ)� (1− tanh(A(1))2)

Notes: (a): � is Hadamard product. (b) have written W (2)T δas δ can be a vector when there are multiple outputs

4 Gradient w.r.t W (1) = δ1Z(1)

5 Update W (2) :=W (2) − ηδZ(2)

6 Update W (1) :=W (1) − ηδ1Z(1)

7 All the dimensionalities nicely check out!

Lecture 4 Backpropagation CMSC 35246

Page 87: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

So Far

Backpropagation in the context of neural networks is all aboutassigning credit (or blame!) for error incurred to the weights

• We follow the path from the output (where we have anerror signal) to the edge we want to consider

• We find the δs from the top to the edge concerned byusing the chain rule

• Once we have the partial derivative, we can write theupdate rule for that weight

Lecture 4 Backpropagation CMSC 35246

Page 88: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

So Far

Backpropagation in the context of neural networks is all aboutassigning credit (or blame!) for error incurred to the weights

• We follow the path from the output (where we have anerror signal) to the edge we want to consider

• We find the δs from the top to the edge concerned byusing the chain rule

• Once we have the partial derivative, we can write theupdate rule for that weight

Lecture 4 Backpropagation CMSC 35246

Page 89: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

So Far

Backpropagation in the context of neural networks is all aboutassigning credit (or blame!) for error incurred to the weights

• We follow the path from the output (where we have anerror signal) to the edge we want to consider

• We find the δs from the top to the edge concerned byusing the chain rule

• Once we have the partial derivative, we can write theupdate rule for that weight

Lecture 4 Backpropagation CMSC 35246

Page 90: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

So Far

Backpropagation in the context of neural networks is all aboutassigning credit (or blame!) for error incurred to the weights

• We follow the path from the output (where we have anerror signal) to the edge we want to consider

• We find the δs from the top to the edge concerned byusing the chain rule

• Once we have the partial derivative, we can write theupdate rule for that weight

Lecture 4 Backpropagation CMSC 35246

Page 91: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

What did we miss?

Exercise: What if there are multiple outputs? (look at slidefrom last class)

Another exercise: Add bias neurons. What changes?

As we go down the network, notice that we need previous δs

If we recompute them each time, it can blow up!

Need to book-keep derivatives as we go down the network andreuse them

Lecture 4 Backpropagation CMSC 35246

Page 92: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

What did we miss?

Exercise: What if there are multiple outputs? (look at slidefrom last class)

Another exercise: Add bias neurons. What changes?

As we go down the network, notice that we need previous δs

If we recompute them each time, it can blow up!

Need to book-keep derivatives as we go down the network andreuse them

Lecture 4 Backpropagation CMSC 35246

Page 93: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

What did we miss?

Exercise: What if there are multiple outputs? (look at slidefrom last class)

Another exercise: Add bias neurons. What changes?

As we go down the network, notice that we need previous δs

If we recompute them each time, it can blow up!

Need to book-keep derivatives as we go down the network andreuse them

Lecture 4 Backpropagation CMSC 35246

Page 94: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

What did we miss?

Exercise: What if there are multiple outputs? (look at slidefrom last class)

Another exercise: Add bias neurons. What changes?

As we go down the network, notice that we need previous δs

If we recompute them each time, it can blow up!

Need to book-keep derivatives as we go down the network andreuse them

Lecture 4 Backpropagation CMSC 35246

Page 95: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

What did we miss?

Exercise: What if there are multiple outputs? (look at slidefrom last class)

Another exercise: Add bias neurons. What changes?

As we go down the network, notice that we need previous δs

If we recompute them each time, it can blow up!

Need to book-keep derivatives as we go down the network andreuse them

Lecture 4 Backpropagation CMSC 35246

Page 96: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

A General View of BackpropagationSome redundancy in upcoming slides, but redundancy can be good!

Lecture 4 Backpropagation CMSC 35246

Page 97: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

An Aside

Backpropagation only refers to the method for computing thegradient

This is used with another algorithm such as SGD for learningusing the gradient

Next: Computing gradient ∇xf(x, y) for arbitrary f

x is the set of variables whose derivatives are desired

Often we require the gradient of the cost J(θ) with respect toparameters θ i.e ∇θJ(θ)Note: We restrict to case where f has a single output

First: Move to more precise computational graph language!

Lecture 4 Backpropagation CMSC 35246

Page 98: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

An Aside

Backpropagation only refers to the method for computing thegradient

This is used with another algorithm such as SGD for learningusing the gradient

Next: Computing gradient ∇xf(x, y) for arbitrary f

x is the set of variables whose derivatives are desired

Often we require the gradient of the cost J(θ) with respect toparameters θ i.e ∇θJ(θ)Note: We restrict to case where f has a single output

First: Move to more precise computational graph language!

Lecture 4 Backpropagation CMSC 35246

Page 99: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

An Aside

Backpropagation only refers to the method for computing thegradient

This is used with another algorithm such as SGD for learningusing the gradient

Next: Computing gradient ∇xf(x, y) for arbitrary f

x is the set of variables whose derivatives are desired

Often we require the gradient of the cost J(θ) with respect toparameters θ i.e ∇θJ(θ)Note: We restrict to case where f has a single output

First: Move to more precise computational graph language!

Lecture 4 Backpropagation CMSC 35246

Page 100: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

An Aside

Backpropagation only refers to the method for computing thegradient

This is used with another algorithm such as SGD for learningusing the gradient

Next: Computing gradient ∇xf(x, y) for arbitrary f

x is the set of variables whose derivatives are desired

Often we require the gradient of the cost J(θ) with respect toparameters θ i.e ∇θJ(θ)Note: We restrict to case where f has a single output

First: Move to more precise computational graph language!

Lecture 4 Backpropagation CMSC 35246

Page 101: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

An Aside

Backpropagation only refers to the method for computing thegradient

This is used with another algorithm such as SGD for learningusing the gradient

Next: Computing gradient ∇xf(x, y) for arbitrary f

x is the set of variables whose derivatives are desired

Often we require the gradient of the cost J(θ) with respect toparameters θ i.e ∇θJ(θ)

Note: We restrict to case where f has a single output

First: Move to more precise computational graph language!

Lecture 4 Backpropagation CMSC 35246

Page 102: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

An Aside

Backpropagation only refers to the method for computing thegradient

This is used with another algorithm such as SGD for learningusing the gradient

Next: Computing gradient ∇xf(x, y) for arbitrary f

x is the set of variables whose derivatives are desired

Often we require the gradient of the cost J(θ) with respect toparameters θ i.e ∇θJ(θ)Note: We restrict to case where f has a single output

First: Move to more precise computational graph language!

Lecture 4 Backpropagation CMSC 35246

Page 103: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

An Aside

Backpropagation only refers to the method for computing thegradient

This is used with another algorithm such as SGD for learningusing the gradient

Next: Computing gradient ∇xf(x, y) for arbitrary f

x is the set of variables whose derivatives are desired

Often we require the gradient of the cost J(θ) with respect toparameters θ i.e ∇θJ(θ)Note: We restrict to case where f has a single output

First: Move to more precise computational graph language!

Lecture 4 Backpropagation CMSC 35246

Page 104: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Computational Graphs

Formalize computation as graphs

Nodes indicate variables (scalar, vector, tensor or anothervariable)

Operations are simple functions of one or more variables

Our graph language comes with a set of allowable operations

Examples:

Lecture 4 Backpropagation CMSC 35246

Page 105: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Computational Graphs

Formalize computation as graphs

Nodes indicate variables (scalar, vector, tensor or anothervariable)

Operations are simple functions of one or more variables

Our graph language comes with a set of allowable operations

Examples:

Lecture 4 Backpropagation CMSC 35246

Page 106: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Computational Graphs

Formalize computation as graphs

Nodes indicate variables (scalar, vector, tensor or anothervariable)

Operations are simple functions of one or more variables

Our graph language comes with a set of allowable operations

Examples:

Lecture 4 Backpropagation CMSC 35246

Page 107: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Computational Graphs

Formalize computation as graphs

Nodes indicate variables (scalar, vector, tensor or anothervariable)

Operations are simple functions of one or more variables

Our graph language comes with a set of allowable operations

Examples:

Lecture 4 Backpropagation CMSC 35246

Page 108: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Computational Graphs

Formalize computation as graphs

Nodes indicate variables (scalar, vector, tensor or anothervariable)

Operations are simple functions of one or more variables

Our graph language comes with a set of allowable operations

Examples:

Lecture 4 Backpropagation CMSC 35246

Page 109: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

z = xy

x y

z

×

Graph uses × operation for the computation

Lecture 4 Backpropagation CMSC 35246

Page 110: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Logistic Regression

x w

u1 u2

b

y

dot

+

σ

Computes y = σ(xTw + b)

Lecture 4 Backpropagation CMSC 35246

Page 111: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

H = max{0, XW + b}

X W

U1 U2

b

H

MM+

Rc

MM is matrix multiplication and Rc is ReLU activation

Lecture 4 Backpropagation CMSC 35246

Page 112: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Back to backprop: Chain Rule

Backpropagation computes the chain rule, in a manner that ishighly efficient

Let f, g : R→ RSuppose y = g(x) and z = f(y) = f(g(x))

Chain rule:dz

dx=dz

dy

dy

dx

Lecture 4 Backpropagation CMSC 35246

Page 113: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Back to backprop: Chain Rule

Backpropagation computes the chain rule, in a manner that ishighly efficient

Let f, g : R→ R

Suppose y = g(x) and z = f(y) = f(g(x))

Chain rule:dz

dx=dz

dy

dy

dx

Lecture 4 Backpropagation CMSC 35246

Page 114: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Back to backprop: Chain Rule

Backpropagation computes the chain rule, in a manner that ishighly efficient

Let f, g : R→ RSuppose y = g(x) and z = f(y) = f(g(x))

Chain rule:dz

dx=dz

dy

dy

dx

Lecture 4 Backpropagation CMSC 35246

Page 115: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Back to backprop: Chain Rule

Backpropagation computes the chain rule, in a manner that ishighly efficient

Let f, g : R→ RSuppose y = g(x) and z = f(y) = f(g(x))

Chain rule:dz

dx=dz

dy

dy

dx

Lecture 4 Backpropagation CMSC 35246

Page 116: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

x

y

z

dzdy

dydx

Chain rule:dz

dx=dz

dy

dy

dx

Lecture 4 Backpropagation CMSC 35246

Page 117: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

x

y1 y2

z

dy1dx

dy2dx

dzdy1

dzdy2

Multiple Paths:dz

dx=

dz

dy1

dy1dx

+dz

dy2

dy2dx

Lecture 4 Backpropagation CMSC 35246

Page 118: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

x

y1 y2

. . .yn

z

dy1dx

dy2dx

dyndx

dzdy1

dzdy2

dzdyn

Multiple Paths:dz

dx=∑j

dz

dyj

dyjdx

Lecture 4 Backpropagation CMSC 35246

Page 119: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Chain Rule

Consider x ∈ Rm,y ∈ Rn

Let g : Rm → Rn and f : Rn → RSuppose y = g(x) and z = f(y), then

∂z

∂xi=∑j

∂z

∂yj

∂yj∂xi

In vector notation:∂z∂x1

...∂z∂xm

=

j∂z∂yj

∂yj∂x1

...∑j∂z∂yj

∂yj∂xm

= ∇xz =

(∂y

∂x

)T∇yz

Lecture 4 Backpropagation CMSC 35246

Page 120: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Chain Rule

Consider x ∈ Rm,y ∈ Rn

Let g : Rm → Rn and f : Rn → R

Suppose y = g(x) and z = f(y), then

∂z

∂xi=∑j

∂z

∂yj

∂yj∂xi

In vector notation:∂z∂x1

...∂z∂xm

=

j∂z∂yj

∂yj∂x1

...∑j∂z∂yj

∂yj∂xm

= ∇xz =

(∂y

∂x

)T∇yz

Lecture 4 Backpropagation CMSC 35246

Page 121: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Chain Rule

Consider x ∈ Rm,y ∈ Rn

Let g : Rm → Rn and f : Rn → RSuppose y = g(x) and z = f(y), then

∂z

∂xi=∑j

∂z

∂yj

∂yj∂xi

In vector notation:∂z∂x1

...∂z∂xm

=

j∂z∂yj

∂yj∂x1

...∑j∂z∂yj

∂yj∂xm

= ∇xz =

(∂y

∂x

)T∇yz

Lecture 4 Backpropagation CMSC 35246

Page 122: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Chain Rule

Consider x ∈ Rm,y ∈ Rn

Let g : Rm → Rn and f : Rn → RSuppose y = g(x) and z = f(y), then

∂z

∂xi=∑j

∂z

∂yj

∂yj∂xi

In vector notation:∂z∂x1

...∂z∂xm

=

j∂z∂yj

∂yj∂x1

...∑j∂z∂yj

∂yj∂xm

= ∇xz =

(∂y

∂x

)T∇yz

Lecture 4 Backpropagation CMSC 35246

Page 123: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Chain Rule

Consider x ∈ Rm,y ∈ Rn

Let g : Rm → Rn and f : Rn → RSuppose y = g(x) and z = f(y), then

∂z

∂xi=∑j

∂z

∂yj

∂yj∂xi

In vector notation:∂z∂x1

...∂z∂xm

=

j∂z∂yj

∂yj∂x1

...∑j∂z∂yj

∂yj∂xm

= ∇xz =

(∂y

∂x

)T∇yz

Lecture 4 Backpropagation CMSC 35246

Page 124: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Chain Rule

∇xz =

(∂y

∂x

)T∇yz

(∂y∂x

)is the n×m Jacobian matrix of g

Gradient of x is a multiplication of a Jacobian matrix(∂y∂x

)with a vector i.e. the gradient ∇yz

Backpropagation consists of applying such Jacobian-gradientproducts to each operation in the computational graph

In general this need not only apply to vectors, but can applyto tensors w.l.o.g

Lecture 4 Backpropagation CMSC 35246

Page 125: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Chain Rule

∇xz =

(∂y

∂x

)T∇yz

(∂y∂x

)is the n×m Jacobian matrix of g

Gradient of x is a multiplication of a Jacobian matrix(∂y∂x

)with a vector i.e. the gradient ∇yz

Backpropagation consists of applying such Jacobian-gradientproducts to each operation in the computational graph

In general this need not only apply to vectors, but can applyto tensors w.l.o.g

Lecture 4 Backpropagation CMSC 35246

Page 126: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Chain Rule

∇xz =

(∂y

∂x

)T∇yz

(∂y∂x

)is the n×m Jacobian matrix of g

Gradient of x is a multiplication of a Jacobian matrix(∂y∂x

)with a vector i.e. the gradient ∇yz

Backpropagation consists of applying such Jacobian-gradientproducts to each operation in the computational graph

In general this need not only apply to vectors, but can applyto tensors w.l.o.g

Lecture 4 Backpropagation CMSC 35246

Page 127: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Chain Rule

∇xz =

(∂y

∂x

)T∇yz

(∂y∂x

)is the n×m Jacobian matrix of g

Gradient of x is a multiplication of a Jacobian matrix(∂y∂x

)with a vector i.e. the gradient ∇yz

Backpropagation consists of applying such Jacobian-gradientproducts to each operation in the computational graph

In general this need not only apply to vectors, but can applyto tensors w.l.o.g

Lecture 4 Backpropagation CMSC 35246

Page 128: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Chain Rule

We can ofcourse also write this in terms of tensors

Let the gradient of z with respect to a tensor X be ∇Xz

If Y = g(X) and z = f(Y), then:

∇Xz =∑j

(∇XYj)∂z

∂Yj

Lecture 4 Backpropagation CMSC 35246

Page 129: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Chain Rule

We can ofcourse also write this in terms of tensors

Let the gradient of z with respect to a tensor X be ∇Xz

If Y = g(X) and z = f(Y), then:

∇Xz =∑j

(∇XYj)∂z

∂Yj

Lecture 4 Backpropagation CMSC 35246

Page 130: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Chain Rule

We can ofcourse also write this in terms of tensors

Let the gradient of z with respect to a tensor X be ∇Xz

If Y = g(X) and z = f(Y), then:

∇Xz =∑j

(∇XYj)∂z

∂Yj

Lecture 4 Backpropagation CMSC 35246

Page 131: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Recursive Application in a ComputationalGraph

Writing an algebraic expression for the gradient of a scalarwith respect to any node in the computational graph thatproduced that scalar is straightforward using the chain-rule

Let for some node x the successors be: {y1, y2, . . . yn}Node: Computation result

Edge: Computation dependency

dz

dx=

n∑i=1

dz

dyi

dyidx

Lecture 4 Backpropagation CMSC 35246

Page 132: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Recursive Application in a ComputationalGraph

Writing an algebraic expression for the gradient of a scalarwith respect to any node in the computational graph thatproduced that scalar is straightforward using the chain-rule

Let for some node x the successors be: {y1, y2, . . . yn}

Node: Computation result

Edge: Computation dependency

dz

dx=

n∑i=1

dz

dyi

dyidx

Lecture 4 Backpropagation CMSC 35246

Page 133: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Recursive Application in a ComputationalGraph

Writing an algebraic expression for the gradient of a scalarwith respect to any node in the computational graph thatproduced that scalar is straightforward using the chain-rule

Let for some node x the successors be: {y1, y2, . . . yn}Node: Computation result

Edge: Computation dependency

dz

dx=

n∑i=1

dz

dyi

dyidx

Lecture 4 Backpropagation CMSC 35246

Page 134: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Recursive Application in a ComputationalGraph

Writing an algebraic expression for the gradient of a scalarwith respect to any node in the computational graph thatproduced that scalar is straightforward using the chain-rule

Let for some node x the successors be: {y1, y2, . . . yn}Node: Computation result

Edge: Computation dependency

dz

dx=

n∑i=1

dz

dyi

dyidx

Lecture 4 Backpropagation CMSC 35246

Page 135: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Flow Graph (for previous slide)

. . .

x

y1 y2 . . . yn

. . .

z

Lecture 4 Backpropagation CMSC 35246

Page 136: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Recursive Application in a ComputationalGraph

Fpropagation: Visit nodes in the order after a topological sort

Compute the value of each node given its ancestors

Bpropagation: Output gradient = 1

Now visit nods in reverse order

Compute gradient with respect to each node using gradientwith respect to successors

Successors of x in previous slide {y1, y2, . . . yn}:

dz

dx=

n∑i=1

dz

dyi

dyidx

Lecture 4 Backpropagation CMSC 35246

Page 137: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Recursive Application in a ComputationalGraph

Fpropagation: Visit nodes in the order after a topological sort

Compute the value of each node given its ancestors

Bpropagation: Output gradient = 1

Now visit nods in reverse order

Compute gradient with respect to each node using gradientwith respect to successors

Successors of x in previous slide {y1, y2, . . . yn}:

dz

dx=

n∑i=1

dz

dyi

dyidx

Lecture 4 Backpropagation CMSC 35246

Page 138: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Recursive Application in a ComputationalGraph

Fpropagation: Visit nodes in the order after a topological sort

Compute the value of each node given its ancestors

Bpropagation: Output gradient = 1

Now visit nods in reverse order

Compute gradient with respect to each node using gradientwith respect to successors

Successors of x in previous slide {y1, y2, . . . yn}:

dz

dx=

n∑i=1

dz

dyi

dyidx

Lecture 4 Backpropagation CMSC 35246

Page 139: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Recursive Application in a ComputationalGraph

Fpropagation: Visit nodes in the order after a topological sort

Compute the value of each node given its ancestors

Bpropagation: Output gradient = 1

Now visit nods in reverse order

Compute gradient with respect to each node using gradientwith respect to successors

Successors of x in previous slide {y1, y2, . . . yn}:

dz

dx=

n∑i=1

dz

dyi

dyidx

Lecture 4 Backpropagation CMSC 35246

Page 140: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Recursive Application in a ComputationalGraph

Fpropagation: Visit nodes in the order after a topological sort

Compute the value of each node given its ancestors

Bpropagation: Output gradient = 1

Now visit nods in reverse order

Compute gradient with respect to each node using gradientwith respect to successors

Successors of x in previous slide {y1, y2, . . . yn}:

dz

dx=

n∑i=1

dz

dyi

dyidx

Lecture 4 Backpropagation CMSC 35246

Page 141: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Automatic Differentiation

Computation of the gradient can be automatically inferredfrom the symbolic expression of fprop

Every node type needs to know:

• How to compute its output• How to compute its gradients with respect to its inputsgiven the gradient w.r.t its outputs

Makes for rapid prototyping

Lecture 4 Backpropagation CMSC 35246

Page 142: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Automatic Differentiation

Computation of the gradient can be automatically inferredfrom the symbolic expression of fprop

Every node type needs to know:

• How to compute its output• How to compute its gradients with respect to its inputsgiven the gradient w.r.t its outputs

Makes for rapid prototyping

Lecture 4 Backpropagation CMSC 35246

Page 143: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Automatic Differentiation

Computation of the gradient can be automatically inferredfrom the symbolic expression of fprop

Every node type needs to know:

• How to compute its output

• How to compute its gradients with respect to its inputsgiven the gradient w.r.t its outputs

Makes for rapid prototyping

Lecture 4 Backpropagation CMSC 35246

Page 144: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Automatic Differentiation

Computation of the gradient can be automatically inferredfrom the symbolic expression of fprop

Every node type needs to know:

• How to compute its output• How to compute its gradients with respect to its inputsgiven the gradient w.r.t its outputs

Makes for rapid prototyping

Lecture 4 Backpropagation CMSC 35246

Page 145: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Automatic Differentiation

Computation of the gradient can be automatically inferredfrom the symbolic expression of fprop

Every node type needs to know:

• How to compute its output• How to compute its gradients with respect to its inputsgiven the gradient w.r.t its outputs

Makes for rapid prototyping

Lecture 4 Backpropagation CMSC 35246

Page 146: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Computational Graph for a MLP

Figure: Goodfellow et al.

To train we want to compute ∇W (1)J and ∇W (2)J

Two paths lead backwards from J to weights: Through crossentropy and through regularization cost

Lecture 4 Backpropagation CMSC 35246

Page 147: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Computational Graph for a MLP

Figure: Goodfellow et al.

To train we want to compute ∇W (1)J and ∇W (2)J

Two paths lead backwards from J to weights: Through crossentropy and through regularization cost

Lecture 4 Backpropagation CMSC 35246

Page 148: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Computational Graph for a MLP

Figure: Goodfellow et al.

To train we want to compute ∇W (1)J and ∇W (2)J

Two paths lead backwards from J to weights: Through crossentropy and through regularization cost

Lecture 4 Backpropagation CMSC 35246

Page 149: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Computational Graph for a MLP

Figure: Goodfellow et al.

Weight decay cost is relatively simple: Will always contribute2λW (i) to gradient on W (i)

Two paths lead backwards from J to weights: Through crossentropy and through regularization cost

Lecture 4 Backpropagation CMSC 35246

Page 150: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Computational Graph for a MLP

Figure: Goodfellow et al.

Weight decay cost is relatively simple: Will always contribute2λW (i) to gradient on W (i)

Two paths lead backwards from J to weights: Through crossentropy and through regularization cost

Lecture 4 Backpropagation CMSC 35246

Page 151: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Symbol to Symbol

Figure: Goodfellow et al.

In this approach backpropagation never accesses anynumerical values

Instead it just adds nodes to the graph that describe how tocompute derivatives

A graph evaluation engine will then do the actual computation

Approach taken by Theano and TensorFlow

Lecture 4 Backpropagation CMSC 35246

Page 152: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Symbol to Symbol

Figure: Goodfellow et al.

In this approach backpropagation never accesses anynumerical values

Instead it just adds nodes to the graph that describe how tocompute derivatives

A graph evaluation engine will then do the actual computation

Approach taken by Theano and TensorFlow

Lecture 4 Backpropagation CMSC 35246

Page 153: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Symbol to Symbol

Figure: Goodfellow et al.

In this approach backpropagation never accesses anynumerical values

Instead it just adds nodes to the graph that describe how tocompute derivatives

A graph evaluation engine will then do the actual computation

Approach taken by Theano and TensorFlow

Lecture 4 Backpropagation CMSC 35246

Page 154: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Symbol to Symbol

Figure: Goodfellow et al.

In this approach backpropagation never accesses anynumerical values

Instead it just adds nodes to the graph that describe how tocompute derivatives

A graph evaluation engine will then do the actual computation

Approach taken by Theano and TensorFlow

Lecture 4 Backpropagation CMSC 35246

Page 155: Lecture 4 Backpropagation - University of Chicagoshubhendu/Pages/Files/Lecture4_pauses.pdfLecture 4 Backpropagation CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University

Next time

Regularization Methods for Deep Neural Networks

Lecture 4 Backpropagation CMSC 35246