Lecture 3 recapΒ Β· Neural Network β€’Linear score function 𝑓=π‘Š β€’Neural network is a nesting...

Post on 24-Sep-2020

10 views 0 download

Transcript of Lecture 3 recapΒ Β· Neural Network β€’Linear score function 𝑓=π‘Š β€’Neural network is a nesting...

Lecture 3 recap

1Prof. Leal-TaixΓ© and Prof. Niessner

Beyond linearβ€’ Linear score function 𝑓 = π‘Šπ‘₯

Credit: Li/Karpathy/Johnson

On CIFAR-10

On ImageNet2

Prof. Leal-TaixΓ© and Prof. Niessner

Neural Network

Credit: Li/Karpathy/Johnson

3Prof. Leal-TaixΓ© and Prof. Niessner

Neural Network

Depth

Wid

th

4Prof. Leal-TaixΓ© and Prof. Niessner

Neural Networkβ€’ Linear score function 𝑓 = π‘Šπ‘₯

β€’ Neural network is a nesting of β€˜functions’– 2-layers: 𝑓 = π‘Š2max(0,π‘Š1π‘₯)

– 3-layers: 𝑓 = π‘Š3max(0,π‘Š2max(0,π‘Š1π‘₯))

– 4-layers: 𝑓 = π‘Š4 tanh (W3, max(0,π‘Š2max(0,π‘Š1π‘₯)))

– 5-layers: 𝑓 = π‘Š5𝜎(π‘Š4 tanh(W3, max(0,π‘Š2max(0,π‘Š1π‘₯))))

– … up to hundreds of layers

5Prof. Leal-TaixΓ© and Prof. Niessner

Computational Graphsβ€’ Neural network is a computational graph

– It has compute nodes

– It has edges that connect nodes

– It is directional

– It is organized in β€˜layers’

6Prof. Leal-TaixΓ© and Prof. Niessner

Backprop con’t

7Prof. Leal-TaixΓ© and Prof. Niessner

The importance of gradientsβ€’ All optimization schemes are based on computing

gradients

β€’ One can compute gradients analytically but what if our function is too complex?

β€’ Break down gradient computation Backpropagation

Rumelhart 19868

Prof. Leal-TaixΓ© and Prof. Niessner

Backprop: Forward Passβ€’ 𝑓 π‘₯, 𝑦, 𝑧 = π‘₯ + 𝑦 β‹… 𝑧 Initialization π‘₯ = 1, 𝑦 = βˆ’3, 𝑧 = 4

mult

sum 𝑓 = βˆ’8

1

βˆ’3

4

𝑑 = βˆ’2

1

βˆ’3

4

Prof. Leal-TaixΓ© and Prof. Niessner 9

Backprop: Backward Pass

with π‘₯ = 1, 𝑦 = βˆ’3, 𝑧 = 4

mult

sum 𝑓 = βˆ’8

1

βˆ’3

4

𝑑 = βˆ’2

1

βˆ’3

4

𝑓 π‘₯, 𝑦, 𝑧 = π‘₯ + 𝑦 β‹… 𝑧

𝑑 = π‘₯ + π‘¦πœ•π‘‘

πœ•π‘₯= 1, πœ•π‘‘

πœ•π‘¦= 1

𝑓 = 𝑑 β‹… π‘§πœ•π‘“

πœ•π‘‘= 𝑧, πœ•π‘“

πœ•π‘§= 𝑑

What is πœ•π‘“πœ•π‘₯,πœ•π‘“

πœ•π‘¦,πœ•π‘“

πœ•π‘§?

Prof. Leal-TaixΓ© and Prof. Niessner 10

Backprop: Backward Pass

with π‘₯ = 1, 𝑦 = βˆ’3, 𝑧 = 4

mult

sum 𝑓 = βˆ’8

1

βˆ’3

4

𝑑 = βˆ’2

1

βˆ’3

4

𝑓 π‘₯, 𝑦, 𝑧 = π‘₯ + 𝑦 β‹… 𝑧

𝑑 = π‘₯ + π‘¦πœ•π‘‘

πœ•π‘₯= 1, πœ•π‘‘

πœ•π‘¦= 1

𝑓 = 𝑑 β‹… π‘§πœ•π‘“

πœ•π‘‘= 𝑧, πœ•π‘“

πœ•π‘§= 𝑑

What is πœ•π‘“πœ•π‘₯,πœ•π‘“

πœ•π‘¦,πœ•π‘“

πœ•π‘§?

πœ•π‘“

πœ•π‘“

1

Prof. Leal-TaixΓ© and Prof. Niessner 11

Backprop: Backward Pass

with π‘₯ = 1, 𝑦 = βˆ’3, 𝑧 = 4

mult

sum 𝑓 = βˆ’8

1

βˆ’3

4

𝑑 = βˆ’2

1

βˆ’3

4

𝑓 π‘₯, 𝑦, 𝑧 = π‘₯ + 𝑦 β‹… 𝑧

𝑑 = π‘₯ + π‘¦πœ•π‘‘

πœ•π‘₯= 1, πœ•π‘‘

πœ•π‘¦= 1

𝑓 = 𝑑 β‹… π‘§πœ•π‘“

πœ•π‘‘= 𝑧, πœ•π‘“

πœ•π‘§= 𝑑

What is πœ•π‘“πœ•π‘₯,πœ•π‘“

πœ•π‘¦,πœ•π‘“

πœ•π‘§?

1

πœ•π‘“

πœ•π‘§

βˆ’2βˆ’2

Prof. Leal-TaixΓ© and Prof. Niessner 12

Backprop: Backward Pass

with π‘₯ = 1, 𝑦 = βˆ’3, 𝑧 = 4

mult

sum 𝑓 = βˆ’8

1

βˆ’3

4

𝑑 = βˆ’2

1

βˆ’3

4

𝑓 π‘₯, 𝑦, 𝑧 = π‘₯ + 𝑦 β‹… 𝑧

𝑑 = π‘₯ + π‘¦πœ•π‘‘

πœ•π‘₯= 1, πœ•π‘‘

πœ•π‘¦= 1

𝑓 = 𝑑 β‹… π‘§πœ•π‘“

πœ•π‘‘= 𝑧, πœ•π‘“

πœ•π‘§= 𝑑

What is πœ•π‘“πœ•π‘₯,πœ•π‘“

πœ•π‘¦,πœ•π‘“

πœ•π‘§?

βˆ’2

πœ•π‘“

πœ•π‘‘

4

1

βˆ’2

Prof. Leal-TaixΓ© and Prof. Niessner 13

Backprop: Backward Pass

with π‘₯ = 1, 𝑦 = βˆ’3, 𝑧 = 4

mult

sum 𝑓 = βˆ’8

1

βˆ’3

4

𝑑 = βˆ’2

1

βˆ’3

4

𝑓 π‘₯, 𝑦, 𝑧 = π‘₯ + 𝑦 β‹… 𝑧

𝑑 = π‘₯ + π‘¦πœ•π‘‘

πœ•π‘₯= 1, πœ•π‘‘

πœ•π‘¦= 1

𝑓 = 𝑑 β‹… π‘§πœ•π‘“

πœ•π‘‘= 𝑧, πœ•π‘“

πœ•π‘§= 𝑑

What is πœ•π‘“πœ•π‘₯,πœ•π‘“

πœ•π‘¦,πœ•π‘“

πœ•π‘§?

1

4

πœ•π‘“

πœ•π‘¦

4

πœ•π‘“

πœ•π‘¦=πœ•π‘“

πœ•π‘‘β‹…πœ•π‘‘

πœ•π‘¦

Chain Rule:

β†’πœ•π‘“

πœ•π‘¦= 4 β‹… 1 = 4

4

βˆ’2

Prof. Leal-TaixΓ© and Prof. Niessner 14

Backprop: Backward Pass

with π‘₯ = 1, 𝑦 = βˆ’3, 𝑧 = 4

mult

sum 𝑓 = βˆ’8

1

βˆ’3

4

𝑑 = βˆ’2

1

βˆ’3

4

𝑓 π‘₯, 𝑦, 𝑧 = π‘₯ + 𝑦 β‹… 𝑧

𝑑 = π‘₯ + π‘¦πœ•π‘‘

πœ•π‘₯= 1, πœ•π‘‘

πœ•π‘¦= 1

𝑓 = 𝑑 β‹… π‘§πœ•π‘“

πœ•π‘‘= 𝑧, πœ•π‘“

πœ•π‘§= 𝑑

What is πœ•π‘“πœ•π‘₯,πœ•π‘“

πœ•π‘¦,πœ•π‘“

πœ•π‘§?

1

βˆ’2

44

βˆ’2

πœ•π‘“

πœ•π‘₯=πœ•π‘“

πœ•π‘‘β‹…πœ•π‘‘

πœ•π‘₯

Chain Rule:

β†’πœ•π‘“

πœ•π‘₯= 4 β‹… 1 = 4

πœ•π‘“

πœ•π‘₯

4

4

4

Prof. Leal-TaixΓ© and Prof. Niessner 15

Compute Graphs -> Neural Networksβ€’ π‘₯π‘˜ input variablesβ€’ 𝑀𝑙,π‘š,𝑛 network weights (note 3 indices)

– l which layer– m which neuron in layer– n weights in neuron

β€’ 𝑦𝑖 computed output (i output dim; nout)β€’ 𝑑𝑖 ground truth targetsβ€’ 𝐿 is loss function

Prof. Leal-TaixΓ© and Prof. Niessner 16

Compute Graphs -> Neural Networks

Prof. Leal-TaixΓ© and Prof. Niessner 17

π‘₯0

π‘₯1

𝑦0 𝑑0

Input layer Output layer

e.g., class label / regression target

π‘₯0

π‘₯1 βˆ— 𝑀1

βˆ— 𝑀0

-𝑑0+ x*x

Input Weights(unknowns!)

L2 Lossfunction

Loss/cost

We want to compute gradients w.r.t. all weights w

Compute Graphs -> Neural Networks

Prof. Leal-TaixΓ© and Prof. Niessner 18

π‘₯0

π‘₯1

𝑦0 𝑑0

Input layer Output layer

e.g., class label / regression target

π‘₯0

π‘₯1 βˆ— 𝑀1

βˆ— 𝑀0

-𝑑0+ x*x

Input Weights(unknowns!)

L2 Lossfunction

Loss/cost

We want to compute gradients w.r.t. all weights w

max(0, π‘₯)

ReLU Activation(btw. I’m not arguingthis is the right choice here)

Compute Graphs -> Neural Networks

Prof. Leal-TaixΓ© and Prof. Niessner 19

π‘₯0

π‘₯1

𝑦0 𝑑0

Input layer Output layer

𝑦1

𝑦2

𝑑1

𝑑2

π‘₯0

π‘₯1

βˆ— 𝑀0,0

-𝑑0+ x*x

Loss/cost

-𝑑1+ x*x

Loss/cost

βˆ— 𝑀0,1

βˆ— 𝑀1,0

βˆ— 𝑀1,1

-𝑑2+ x*x

Loss/costβˆ— 𝑀2,0

βˆ— 𝑀2,1

We want to compute gradients w.r.t. all weights w

Compute Graphs -> Neural Networks

Prof. Leal-TaixΓ© and Prof. Niessner 20

π‘₯0

π‘₯π‘˜

𝑦0 𝑑0

Input layer Output layer

𝑦1 𝑑1

…

𝑦𝑖 = 𝐴(𝑏𝑖 +

π‘˜

π‘₯π‘˜π‘€π‘–,π‘˜)

𝐿𝑖 = 𝑦𝑖 βˆ’ 𝑑𝑖2

𝐿 =

𝑖

𝐿𝑖

We want to compute gradients w.r.t. all weights w*AND* biases b

L2 loss -> simply sum up squaresEnergy to minimize is E=L

Activation function bias

πœ•πΏ

πœ•π‘€π‘–,π‘˜=

πœ•πΏ

πœ•π‘¦π‘–β‹…πœ•π‘¦π‘–πœ•π‘€π‘–,π‘˜

-> use chain rule to compute partials

Gradient Descent

Optimum

Initialization

21Prof. Leal-TaixΓ© and Prof. Niessner

Gradient Descentβ€’ From derivative to gradient

β€’ Gradient steps in direction of negative gradient

Direction of greatest

increase of the function

Learning rate22

Prof. Leal-TaixΓ© and Prof. Niessner

Gradient Descent for Neural Networks

Prof. Leal-TaixΓ© and Prof. Niessner 23

𝛻𝑀𝑓π‘₯,𝑑 (𝑀) =

πœ•π‘“

πœ•π‘€0,0,0β€¦β€¦πœ•π‘“

πœ•π‘€π‘™,π‘š,𝑛

For a given training pair {x,t},we want to update all weights;i.e., we need to compute derivativesw.r.t. to all weights

𝑙 Layers

π‘šN

euro

ns

Gradient step:𝑀′ = 𝑀 βˆ’ πœ–π›»π‘€π‘“π‘₯,𝑑 (𝑀)

Gradient Descent for Neural Networks

Prof. Leal-TaixΓ© and Prof. Niessner 24

π‘₯0

π‘₯1

π‘₯2

β„Ž0

β„Ž1

β„Ž2

β„Ž3

𝑦0

𝑦1

𝑑0

𝑑1

𝑦𝑖 = 𝐴(𝑏1,𝑖 +

𝑗

β„Žπ‘—π‘€1,𝑖,𝑗)β„Žπ‘— = 𝐴(𝑏0,𝑗 +

π‘˜

π‘₯π‘˜π‘€0,𝑗,π‘˜)

𝐿𝑖 = 𝑦𝑖 βˆ’ 𝑑𝑖2

𝛻𝑀𝑓π‘₯,𝑑 (𝑀) =

πœ•π‘“

πœ•π‘€0,0,0β€¦β€¦πœ•π‘“

πœ•π‘€π‘™,π‘š,π‘›β€¦β€¦πœ•π‘“

πœ•π‘π‘™,π‘š

Just simple: 𝐴 π‘₯ = max(0, π‘₯)

Gradient Descent for Neural Networks

Prof. Leal-TaixΓ© and Prof. Niessner 25

π‘₯0

π‘₯1

π‘₯2

β„Ž0

β„Ž1

β„Ž2

β„Ž3

𝑦0

𝑦1

𝑑0

𝑑1

𝑦𝑖 = 𝐴(𝑏1,𝑖 +

𝑗

β„Žπ‘—π‘€1,𝑖,𝑗)β„Žπ‘— = 𝐴(𝑏0,𝑗 +

π‘˜

π‘₯π‘˜π‘€0,𝑗,π‘˜)

𝐿𝑖 = 𝑦𝑖 βˆ’ 𝑑𝑖2

Backpropagation

πœ•πΏ

πœ•π‘€1,𝑖,𝑗=

πœ•πΏ

πœ•π‘¦π‘–β‹…πœ•π‘¦π‘–

πœ•π‘€1,𝑖,𝑗

πœ•πΏ

πœ•π‘€0,𝑗,π‘˜=

πœ•πΏ

πœ•π‘¦π‘–β‹…πœ•π‘¦π‘–πœ•β„Žπ‘—

β‹…πœ•β„Žπ‘—

πœ•π‘€0,𝑗,π‘˜

Just

go

th

rou

gh la

yer

by

laye

r

πœ•πΏπ‘–πœ•π‘¦π‘–

= 2(yi βˆ’ ti)

πœ•π‘¦π‘–

πœ•π‘€1,𝑖,𝑗= β„Žπ‘— if > 0, else 0

…

Gradient Descent for Neural Networks

Prof. Leal-TaixΓ© and Prof. Niessner 26

π‘₯0

π‘₯1

π‘₯2

β„Ž0

β„Ž1

β„Ž2

β„Ž3

𝑦0

𝑦1

𝑑0

𝑑1

𝑦𝑖 = 𝐴(𝑏1,𝑖 +

𝑗

β„Žπ‘—π‘€1,𝑖,𝑗)

𝐿𝑖 = 𝑦𝑖 βˆ’ 𝑑𝑖2

How many unknown weights?

Output layer: 2 β‹… 4 + 2

Hidden Layer: 4 β‹… 3 + 4

#neurons #biases#input channels

Note that some activations have also weightsβ„Žπ‘— = 𝐴(𝑏0,𝑗 +

π‘˜

π‘₯π‘˜π‘€0,𝑗,π‘˜)

Derivatives of Cross Entropy Loss

Prof. Leal-TaixΓ© and Prof. Niessner 27

[Sadowski]

Derivatives of Cross Entropy Loss

Prof. Leal-TaixΓ© and Prof. Niessner 28

[Sadowski]

Gradients of weights of last layer:

Derivatives of Cross Entropy Loss

Prof. Leal-TaixΓ© and Prof. Niessner 29

[Sadowski]

Gradients of weights of first layer:

β„Žπ‘—

Derivatives of Neural Networks

Prof. Leal-TaixΓ© and Prof. Niessner 30

Combining nodes:Linear activation node + hinge loss + regularization

Credit: Li/Karpathy/Johnson

Can become quite complex…

Prof. Leal-TaixΓ© and Prof. Niessner 31

𝑙 Layers

π‘šN

euro

ns

Can become quite complex…‒ These graphs can be huge!

32Prof. Leal-TaixΓ© and Prof. Niessner

The flow of the gradientsA

ctiv

atio

ns

Activation function 33Prof. Leal-TaixΓ© and Prof. Niessner

The flow of the gradientsA

ctiv

atio

ns

Activation function 34Prof. Leal-TaixΓ© and Prof. Niessner

The flow of the gradients

β€’ Many many many many of these nodes form a neural network

β€’ Each one has its own work to do

NEURONS

FORWARD AND BACKWARD PASS

35Prof. Leal-TaixΓ© and Prof. Niessner

Gradient descent

36Prof. Leal-TaixΓ© and Prof. Niessner

Gradient Descent

Optimum

Initialization

37Prof. Leal-TaixΓ© and Prof. Niessner

Gradient Descentβ€’ From derivative to gradient

β€’ Gradient steps in direction of negative gradient

Direction of greatest

increase of the function

Learning rate38

Prof. Leal-TaixΓ© and Prof. Niessner

Gradient Descentβ€’ How to pick good learning rate?

β€’ How to compute gradient for single training pair?

β€’ How to compute gradient for large training set?

β€’ How to speed things up

Prof. Leal-TaixΓ© and Prof. Niessner 39

Next lectureβ€’ This week:

– No tutorial this Thursday (due to MaiTUM)– Exercise 1 will be released on Thursday as planned

β€’ Next lecture on May 7th: – Optimization of Neural Networks– In particular, introduction to SGD (our main method!)

40Prof. Leal-TaixΓ© and Prof. Niessner

See you next week!

Prof. Leal-TaixΓ© and Prof. Niessner 41