Backpropagation - NTU Speech Processing...

19
Backpropagation

Transcript of Backpropagation - NTU Speech Processing...

Page 1: Backpropagation - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/BP.pdf · Gradient Descent T0 Starting Parameters T1T2 CCLTT T omputeompute12

Backpropagation

Page 2: Backpropagation - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/BP.pdf · Gradient Descent T0 Starting Parameters T1T2 CCLTT T omputeompute12

Gradient Descent

0Starting

Parameters1 2 ……

0L C ompute 001 L

1L C ompute 112 L

L

,,,,, 2121 bbww

2

1

2

1

L

L

L

L

b

b

w

w

Millions of parameters ……

To compute the gradients efficiently, we use backpropagation.

Network parameters

Page 3: Backpropagation - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/BP.pdf · Gradient Descent T0 Starting Parameters T1T2 CCLTT T omputeompute12

Chain Rule

Case 1

Case 2

yhz xgy

dx

dy

dy

dz

dx

dzzyx

yxkz , shy sgx

ds

dy

y

z

ds

dx

x

z

ds

dz

s z

x

y

Page 4: Backpropagation - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/BP.pdf · Gradient Descent T0 Starting Parameters T1T2 CCLTT T omputeompute12

Backpropagation

𝐿 𝜃 =

𝑛=1

𝑁

𝐶𝑛 𝜃𝜕𝐿 𝜃

𝜕𝑤=

𝑛=1

𝑁𝜕𝐶𝑛 𝜃

𝜕𝑤

𝑥1

𝑥2

𝑦1

𝑦2

xn NN𝜃

yn ො𝑦𝑛

𝐶𝑛

Page 5: Backpropagation - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/BP.pdf · Gradient Descent T0 Starting Parameters T1T2 CCLTT T omputeompute12

Backpropagation

b

𝑤1 𝑧

𝜕𝐶

𝜕𝑤=?

……

𝜕𝑧

𝜕𝑤

𝜕𝐶

𝜕𝑧

(Chain rule)

𝑤2

Forward pass:

Compute Τ𝜕𝑧 𝜕𝑤 for all parameters

Backward pass:

Compute Τ𝜕𝐶 𝜕𝑧 for all activation function inputs z

𝑥1

𝑥2

𝑦1

𝑦2

𝑧 = 𝑥1𝑤1 + 𝑥2𝑤2 + 𝑏

Page 6: Backpropagation - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/BP.pdf · Gradient Descent T0 Starting Parameters T1T2 CCLTT T omputeompute12

Backpropagation – Forward pass

b

𝑤1 𝑧

……

𝑤2

𝑥1

𝑥2

𝑦1

𝑦2

𝑧 = 𝑥1𝑤1 + 𝑥2𝑤2 + 𝑏

Τ𝜕𝑧 𝜕𝑤1 =? 𝑥1

Compute Τ𝜕𝑧 𝜕𝑤 for all parameters

Τ𝜕𝑧 𝜕𝑤2 =? 𝑥2

The value of the input connected by the weight

Page 7: Backpropagation - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/BP.pdf · Gradient Descent T0 Starting Parameters T1T2 CCLTT T omputeompute12

Backpropagation – Forward pass

1

-2

1

-1

1

0

0.98

0.12

2

-1

-1

-2

3

-1

4

-1

0.86

0.11

0

0

-2

2

1

-1

Compute Τ𝜕𝑧 𝜕𝑤 for all parameters

𝜕𝑧

𝜕𝑤= −1

𝜕𝑧

𝜕𝑤= 0.12

𝜕𝑧

𝜕𝑤= 0.11

Page 8: Backpropagation - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/BP.pdf · Gradient Descent T0 Starting Parameters T1T2 CCLTT T omputeompute12

Backpropagation – Backward pass

Compute Τ𝜕𝐶 𝜕𝑧 for all activation function inputs z

b

𝑤1 𝑧

𝑤2

𝑥1

𝑥2

𝑤3 𝑧′𝑎

𝑧’’

𝑤4

𝜕𝐶

𝜕𝑧=𝜕𝑎

𝜕𝑧

𝜕𝐶

𝜕𝑎

𝑎 = 𝜎 𝑧

𝜎′ 𝑧

Page 9: Backpropagation - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/BP.pdf · Gradient Descent T0 Starting Parameters T1T2 CCLTT T omputeompute12

Backpropagation – Backward pass

Compute Τ𝜕𝐶 𝜕𝑧 for all activation function inputs z

b

𝑤1 𝑧

𝑤2

𝑥1

𝑥2

𝑤3 𝑧′𝑎

𝑧’’

𝑤4

𝜕𝐶

𝜕𝑧=𝜕𝑎

𝜕𝑧

𝜕𝐶

𝜕𝑎

𝑎 = 𝜎 𝑧

𝜎′ 𝑧

𝜎′ 𝑧

𝜎 𝑧

Page 10: Backpropagation - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/BP.pdf · Gradient Descent T0 Starting Parameters T1T2 CCLTT T omputeompute12

Backpropagation – Backward pass

Compute Τ𝜕𝐶 𝜕𝑧 for all activation function inputs z

b

𝑤1 𝑧

𝑤2

𝑥1

𝑥2

𝑤3 𝑧′𝑎

𝑧’’

𝑤4

𝜕𝐶

𝜕𝑧=𝜕𝑎

𝜕𝑧

𝜕𝐶

𝜕𝑎

𝑎 = 𝜎 𝑧

𝜕𝐶

𝜕𝑎=𝜕𝑧′

𝜕𝑎

𝜕𝐶

𝜕𝑧′+𝜕𝑧′′

𝜕𝑎

𝜕𝐶

𝜕𝑧′′(Chain rule)

𝑧′ = 𝑎𝑤3 +⋯

𝑤3 𝑤4? ? Assumed

it’s known

Page 11: Backpropagation - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/BP.pdf · Gradient Descent T0 Starting Parameters T1T2 CCLTT T omputeompute12

b

𝑤1 𝑧

𝑤2

𝑥1

𝑥2

𝑤3 𝑧′𝑎

𝑧’’

𝑤4

𝜕𝐶

𝜕𝑧

𝜕𝐶

𝜕𝑧′′

𝜕𝐶

𝜕𝑧′

𝜕𝐶

𝜕𝑧= 𝜎′ 𝑧 𝑤3

𝜕𝐶

𝜕𝑧′+ 𝑤4

𝜕𝐶

𝜕𝑧′′

Backpropagation – Backward pass

Compute Τ𝜕𝐶 𝜕𝑧 for all activation function inputs z

Page 12: Backpropagation - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/BP.pdf · Gradient Descent T0 Starting Parameters T1T2 CCLTT T omputeompute12

𝑤3

𝜕𝐶

𝜕𝑧

𝜕𝐶

𝜕𝑧′′

𝜕𝐶

𝜕𝑧′

𝜕𝐶

𝜕𝑧= 𝜎′ 𝑧 𝑤3

𝜕𝐶

𝜕𝑧′+ 𝑤4

𝜕𝐶

𝜕𝑧′′

Backpropagation – Backward pass

𝜎′ 𝑧

𝑤4

𝜎′ 𝑧 is a constant because z is already determined in the forward pass.

Page 13: Backpropagation - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/BP.pdf · Gradient Descent T0 Starting Parameters T1T2 CCLTT T omputeompute12

b

𝑤1 𝑧

𝑤2

𝑥1

𝑥2

𝑤3 𝑧′𝑎

𝑧’’

𝑤4

𝜕𝐶

𝜕𝑧

𝜕𝐶

𝜕𝑧′′

𝜕𝐶

𝜕𝑧′

Backpropagation – Backward pass

Compute Τ𝜕𝐶 𝜕𝑧 for all activation function inputs z

𝑦1

𝑦2

Case 1. Output Layer

𝜕𝐶

𝜕𝑧′=𝜕𝑦1𝜕𝑧′

𝜕𝐶

𝜕𝑦1

𝜕𝐶

𝜕𝑧′′=𝜕𝑦2𝜕𝑧′′

𝜕𝐶

𝜕𝑦2Done!

Page 14: Backpropagation - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/BP.pdf · Gradient Descent T0 Starting Parameters T1T2 CCLTT T omputeompute12

𝑧′

𝑧’’

𝜕𝐶

𝜕𝑧′

Backpropagation – Backward pass

Compute Τ𝜕𝐶 𝜕𝑧 for all activation function inputs z

Case 2. Not Output Layer

……

……𝜕𝐶

𝜕𝑧′′

Page 15: Backpropagation - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/BP.pdf · Gradient Descent T0 Starting Parameters T1T2 CCLTT T omputeompute12

𝑧′

𝑧’’

𝜕𝐶

𝜕𝑧′

Backpropagation – Backward pass

Compute Τ𝜕𝐶 𝜕𝑧 for all activation function inputs z

Case 2. Not Output Layer

𝜕𝐶

𝜕𝑧′′

𝑤5 𝑧𝑎𝑎′

𝑤6

𝜕𝐶

𝜕𝑧𝑎

𝑧𝑏

𝜕𝐶

𝜕𝑧𝑏

Page 16: Backpropagation - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/BP.pdf · Gradient Descent T0 Starting Parameters T1T2 CCLTT T omputeompute12

𝑧′

𝑧’’

𝜕𝐶

𝜕𝑧′

Backpropagation – Backward pass

Compute Τ𝜕𝐶 𝜕𝑧 for all activation function inputs z

Case 2. Not Output Layer

𝜕𝐶

𝜕𝑧′′

𝑤5 𝑧𝑎𝑎′

𝜕𝐶

𝜕𝑧𝑎

𝑧𝑏

𝜕𝐶

𝜕𝑧𝑏

𝜎′ 𝑧′

𝑤6

Until we reach the output layer ……

Compute Τ𝜕𝐶 𝜕𝑧recursively

Page 17: Backpropagation - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/BP.pdf · Gradient Descent T0 Starting Parameters T1T2 CCLTT T omputeompute12

Backpropagation – Backward Pass

Compute Τ𝜕𝐶 𝜕𝑧 from the output layer

Compute Τ𝜕𝐶 𝜕𝑧 for all activation function inputs z

𝑥1

𝑥2

𝑦1

𝑦2

𝑧1

𝑧2

𝑧3

𝑧4

𝑧5

𝑧6𝜕𝐶

𝜕𝑧6

𝜕𝐶

𝜕𝑧5

𝜕𝐶

𝜕𝑧4

𝜕𝐶

𝜕𝑧3

𝜕𝐶

𝜕𝑧2

𝜕𝐶

𝜕𝑧1

Page 18: Backpropagation - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/BP.pdf · Gradient Descent T0 Starting Parameters T1T2 CCLTT T omputeompute12

Backpropagation – Backward Pass

Compute Τ𝜕𝐶 𝜕𝑧 from the output layer

Compute Τ𝜕𝐶 𝜕𝑧 for all activation function inputs z

𝑥1

𝑥2

𝑧1

𝑧2

𝑧3

𝑧4

𝑧5

𝑧6𝜕𝐶

𝜕𝑧6

𝜕𝐶

𝜕𝑧5

𝜕𝐶

𝜕𝑧4

𝜕𝐶

𝜕𝑧3

𝜕𝐶

𝜕𝑧2

𝜕𝐶

𝜕𝑧1

𝜎′ 𝑧1

𝜎′ 𝑧2

𝜎′ 𝑧3

𝜎′ 𝑧4

𝑦1

𝑦2

Page 19: Backpropagation - NTU Speech Processing Laboratoryspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/BP.pdf · Gradient Descent T0 Starting Parameters T1T2 CCLTT T omputeompute12

Backpropagation – Summary

……

Forward Pass Backward Pass

𝑎

𝜕𝑧

𝜕𝑤

𝜕𝐶

𝜕𝑧

… …X =

𝜕𝐶

𝜕𝑤= 𝑎

for all w