Lecture 3 recap · Neural Network •Linear score function 𝑓=𝑊 •Neural network is a nesting...

Lecture 3 recap

1Prof. Leal-Taixé and Prof. Niessner

Beyond linear• Linear score function 𝑓 = 𝑊𝑥

Credit: Li/Karpathy/Johnson

On CIFAR-10

On ImageNet2

Prof. Leal-Taixé and Prof. Niessner

Neural Network

Neural Network• Linear score function 𝑓 = 𝑊𝑥

• Neural network is a nesting of ‘functions’– 2-layers: 𝑓 = 𝑊2max(0,𝑊1𝑥)

– 3-layers: 𝑓 = 𝑊3max(0,𝑊2max(0,𝑊1𝑥))

– 4-layers: 𝑓 = 𝑊4 tanh (W3, max(0,𝑊2max(0,𝑊1𝑥)))

– 5-layers: 𝑓 = 𝑊5𝜎(𝑊4 tanh(W3, max(0,𝑊2max(0,𝑊1𝑥))))

– … up to hundreds of layers

Computational Graphs• Neural network is a computational graph

– It has compute nodes

– It has edges that connect nodes

– It is directional

– It is organized in ‘layers’

Backprop con’t

The importance of gradients• All optimization schemes are based on computing

gradients

• One can compute gradients analytically but what if our function is too complex?

• Break down gradient computation Backpropagation

Rumelhart 19868

Backprop: Forward Pass• 𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 ⋅ 𝑧 Initialization 𝑥 = 1, 𝑦 = −3, 𝑧 = 4

sum 𝑓 = −8

𝑑 = −2

Prof. Leal-Taixé and Prof. Niessner 9

Backprop: Backward Pass

with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4

sum 𝑓 = −8

𝑑 = −2

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 ⋅ 𝑧

𝑑 = 𝑥 + 𝑦𝜕𝑑

𝜕𝑥= 1, 𝜕𝑑

𝜕𝑦= 1

𝑓 = 𝑑 ⋅ 𝑧𝜕𝑓

𝜕𝑑= 𝑧, 𝜕𝑓

𝜕𝑧= 𝑑

What is 𝜕𝑓𝜕𝑥,𝜕𝑓

𝜕𝑦,𝜕𝑓

𝜕𝑧?

with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4

sum 𝑓 = −8

𝑑 = −2

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 ⋅ 𝑧

𝑑 = 𝑥 + 𝑦𝜕𝑑

𝜕𝑥= 1, 𝜕𝑑

𝜕𝑦= 1

𝑓 = 𝑑 ⋅ 𝑧𝜕𝑓

𝜕𝑑= 𝑧, 𝜕𝑓

𝜕𝑧= 𝑑

𝜕𝑦,𝜕𝑓

𝜕𝑧?

𝜕𝑓

with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4

sum 𝑓 = −8

𝑑 = −2

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 ⋅ 𝑧

𝑑 = 𝑥 + 𝑦𝜕𝑑

𝜕𝑥= 1, 𝜕𝑑

𝜕𝑦= 1

𝑓 = 𝑑 ⋅ 𝑧𝜕𝑓

𝜕𝑑= 𝑧, 𝜕𝑓

𝜕𝑧= 𝑑

𝜕𝑦,𝜕𝑓

𝜕𝑧?

𝜕𝑓

𝜕𝑧

−2−2

with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4

sum 𝑓 = −8

𝑑 = −2

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 ⋅ 𝑧

𝑑 = 𝑥 + 𝑦𝜕𝑑

𝜕𝑥= 1, 𝜕𝑑

𝜕𝑦= 1

𝑓 = 𝑑 ⋅ 𝑧𝜕𝑓

𝜕𝑑= 𝑧, 𝜕𝑓

𝜕𝑧= 𝑑

𝜕𝑦,𝜕𝑓

𝜕𝑧?

𝜕𝑓

𝜕𝑑

with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4

sum 𝑓 = −8

𝑑 = −2

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 ⋅ 𝑧

𝑑 = 𝑥 + 𝑦𝜕𝑑

𝜕𝑥= 1, 𝜕𝑑

𝜕𝑦= 1

𝑓 = 𝑑 ⋅ 𝑧𝜕𝑓

𝜕𝑑= 𝑧, 𝜕𝑓

𝜕𝑧= 𝑑

𝜕𝑦,𝜕𝑓

𝜕𝑧?

𝜕𝑓

𝜕𝑦

𝜕𝑓

𝜕𝑦=𝜕𝑓

𝜕𝑑⋅𝜕𝑑

𝜕𝑦

Chain Rule:

→𝜕𝑓

𝜕𝑦= 4 ⋅ 1 = 4

with 𝑥 = 1, 𝑦 = −3, 𝑧 = 4

sum 𝑓 = −8

𝑑 = −2

𝑓 𝑥, 𝑦, 𝑧 = 𝑥 + 𝑦 ⋅ 𝑧

𝑑 = 𝑥 + 𝑦𝜕𝑑

𝜕𝑥= 1, 𝜕𝑑

𝜕𝑦= 1

𝑓 = 𝑑 ⋅ 𝑧𝜕𝑓

𝜕𝑑= 𝑧, 𝜕𝑓

𝜕𝑧= 𝑑

𝜕𝑦,𝜕𝑓

𝜕𝑧?

𝜕𝑓

𝜕𝑥=𝜕𝑓

𝜕𝑑⋅𝜕𝑑

𝜕𝑥

Chain Rule:

→𝜕𝑓

𝜕𝑥= 4 ⋅ 1 = 4

𝜕𝑓

𝜕𝑥

Compute Graphs -> Neural Networks• 𝑥𝑘 input variables• 𝑤𝑙,𝑚,𝑛 network weights (note 3 indices)

– l which layer– m which neuron in layer– n weights in neuron

• 𝑦𝑖 computed output (i output dim; nout)• 𝑡𝑖 ground truth targets• 𝐿 is loss function

Compute Graphs -> Neural Networks

𝑦0 𝑡0

Input layer Output layer

e.g., class label / regression target

𝑥1 ∗ 𝑤1

∗ 𝑤0

-𝑡0+ x*x

Input Weights(unknowns!)

L2 Lossfunction

Loss/cost

We want to compute gradients w.r.t. all weights w

𝑦0 𝑡0

e.g., class label / regression target

𝑥1 ∗ 𝑤1

∗ 𝑤0

-𝑡0+ x*x

Input Weights(unknowns!)

L2 Lossfunction

Loss/cost

max(0, 𝑥)

ReLU Activation(btw. I’m not arguingthis is the right choice here)

𝑦0 𝑡0

∗ 𝑤0,0

-𝑡0+ x*x

Loss/cost

-𝑡1+ x*x

Loss/cost

∗ 𝑤0,1

∗ 𝑤1,0

∗ 𝑤1,1

-𝑡2+ x*x

Loss/cost∗ 𝑤2,0

∗ 𝑤2,1

𝑥𝑘

𝑦0 𝑡0

𝑦1 𝑡1

𝑦𝑖 = 𝐴(𝑏𝑖 +

𝑥𝑘𝑤𝑖,𝑘)

𝐿𝑖 = 𝑦𝑖 − 𝑡𝑖2

𝐿 =

𝐿𝑖

We want to compute gradients w.r.t. all weights w*AND* biases b

L2 loss -> simply sum up squaresEnergy to minimize is E=L

Activation function bias

𝜕𝐿

𝜕𝑤𝑖,𝑘=

𝜕𝐿

𝜕𝑦𝑖⋅𝜕𝑦𝑖𝜕𝑤𝑖,𝑘

-> use chain rule to compute partials

Gradient Descent

Optimum

Initialization

Gradient Descent• From derivative to gradient

• Gradient steps in direction of negative gradient

Direction of greatest

increase of the function

Learning rate22

Gradient Descent for Neural Networks

𝛻𝑤𝑓𝑥,𝑡 (𝑤) =

𝜕𝑓

𝜕𝑤0,0,0……𝜕𝑓

𝜕𝑤𝑙,𝑚,𝑛

For a given training pair {x,t},we want to update all weights;i.e., we need to compute derivativesw.r.t. to all weights

𝑙 Layers

Gradient step:𝑤′ = 𝑤 − 𝜖𝛻𝑤𝑓𝑥,𝑡 (𝑤)

𝑦𝑖 = 𝐴(𝑏1,𝑖 +

ℎ𝑗𝑤1,𝑖,𝑗)ℎ𝑗 = 𝐴(𝑏0,𝑗 +

𝑥𝑘𝑤0,𝑗,𝑘)

𝛻𝑤𝑓𝑥,𝑡 (𝑤) =

𝜕𝑓

𝜕𝑤0,0,0……𝜕𝑓

𝜕𝑤𝑙,𝑚,𝑛……𝜕𝑓

𝜕𝑏𝑙,𝑚

Just simple: 𝐴 𝑥 = max(0, 𝑥)

𝑦𝑖 = 𝐴(𝑏1,𝑖 +

ℎ𝑗𝑤1,𝑖,𝑗)ℎ𝑗 = 𝐴(𝑏0,𝑗 +

Backpropagation

𝜕𝐿

𝜕𝑤1,𝑖,𝑗=

𝜕𝐿

𝜕𝑦𝑖⋅𝜕𝑦𝑖

𝜕𝑤1,𝑖,𝑗

𝜕𝐿

𝜕𝑤0,𝑗,𝑘=

𝜕𝐿

𝜕𝑦𝑖⋅𝜕𝑦𝑖𝜕ℎ𝑗

⋅𝜕ℎ𝑗

𝜕𝑤0,𝑗,𝑘

𝜕𝐿𝑖𝜕𝑦𝑖

= 2(yi − ti)

𝜕𝑦𝑖

𝜕𝑤1,𝑖,𝑗= ℎ𝑗 if > 0, else 0

𝑦𝑖 = 𝐴(𝑏1,𝑖 +

ℎ𝑗𝑤1,𝑖,𝑗)

How many unknown weights?

Output layer: 2 ⋅ 4 + 2

Hidden Layer: 4 ⋅ 3 + 4

#neurons #biases#input channels

Note that some activations have also weightsℎ𝑗 = 𝐴(𝑏0,𝑗 +

Derivatives of Cross Entropy Loss

[Sadowski]

Gradients of weights of last layer:

[Sadowski]

Gradients of weights of first layer:

ℎ𝑗

Derivatives of Neural Networks

Combining nodes:Linear activation node + hinge loss + regularization

Can become quite complex…

𝑙 Layers

Can become quite complex…• These graphs can be huge!

The flow of the gradientsA

Activation function 33Prof. Leal-Taixé and Prof. Niessner

The flow of the gradientsA

Activation function 34Prof. Leal-Taixé and Prof. Niessner

The flow of the gradients

• Many many many many of these nodes form a neural network

• Each one has its own work to do

NEURONS

FORWARD AND BACKWARD PASS

Gradient descent

Gradient Descent

Optimum

Initialization

Gradient Descent• From derivative to gradient

• Gradient steps in direction of negative gradient

Direction of greatest

increase of the function

Learning rate38

Gradient Descent• How to pick good learning rate?

• How to compute gradient for single training pair?

• How to compute gradient for large training set?

• How to speed things up

Next lecture• This week:

– No tutorial this Thursday (due to MaiTUM)– Exercise 1 will be released on Thursday as planned

• Next lecture on May 7th: – Optimization of Neural Networks– In particular, introduction to SGD (our main method!)

See you next week!

Lecture 3 recap · Neural Network •Linear score function 𝑓=𝑊 •Neural network is a nesting...

Documents

Transcript of Lecture 3 recap · Neural Network •Linear score function 𝑓=𝑊 •Neural network is a nesting...

MODE: Automated Neural Network Model Debugging via State ...ging technique for neural network models. Neural network models can be considered as layers of internal state variables

Low rank approximation for Convolution Neural Network · VGG-16 computational complexity 2.8% Convolution layers Fully connected layers 1.94 2.77 4.62 4.62 1.39 0.41 0.017 0.0004

A Peek Into the Hidden Layers of a Convolutional Neural ... · A Peek Into the Hidden Layers of a Convolutional Neural Network Through a Factorization Lens KDD’18 Workshop, , London,

XONN XNOR-based Oblivious Deep Neural Network Inference · 2.1 Deep Neural Networks The computational ﬂow of a deep neural network is com-posed of multiple computational layers.

Multilayer Neural Networks: One or Two Hidden Layers?papers.nips.cc/paper/1239-multilayer-neural... · Multilayer Neural Networks: One or Two Hidden Layers? 149 1.1 NOTATIONS AND

Neural Adaptation Layers for Cross-domain Named …Neural Adaptation Layers for Cross-domain Named Entity Recognition Bill Yuchen Liny and Wei Luz yUniversity of Southern California

Bayesian Layers: A Module for Neural Network …...Bayesian Layers: A Module for Neural Network Uncertainty Dustin Tran GoogleBrain Michael W. Dusenberry GoogleBrain Mark van der Wilk

𝑐 - thogharkogh.cba.am · 𝑃= 𝑐 𝑓 ∑ (1 (1+ 𝑦 100∗𝑓) 𝑖−1+𝑡) 𝑁 𝑖=1 + (100 (1+ 𝑦 100∗𝑓) 𝑁−1+𝑡) 𝑡= 𝑆𝑁

+∞ 𝑓(𝑥)𝑑𝑥=1 - Université de Bouira

Convolutional Neural Networks - University of Pennsylvaniacis581/Lectures/Fall... · 2017-12-05 · Input: Convolutional Neural Network (CNN) Early layers learn to detect low level

Deep Learning - SNU is just another Neural Network with sparse connections Learning Algorithm: Back Propagation on Convolution Layers and Fully-Connected Layers ...

Artificial neural network model & hidden layers in multilayer artificial neural networks

3.4 Hermite Interpolation 3.5 Cubic Spline Interpolationzxu2/acms40390F12/Lec-3.4-5.pdf · Hermite Polynomial by Divided Differences Suppose 𝑥0,…,𝑥𝑛 and 𝑓, 𝑓′ are

Machine Learning Neural Networks (2). Multi-layers Network Let the network of 3 layers – Input layer – Hidden layer – Output layer Each layer has different.

4.8 Multiple Integrals and Monte Carlo Integrationzxu2/acms40390F13/MC-Integration.pdf · 4 [𝑓 , +2𝑓 , + 2 +𝑓( , )] The approximation is of order (( − )( − )[ − 2+(

Implicit Surface Representations As Layers in Neural Networksopenaccess.thecvf.com/content_ICCV_2019/papers/Michalkiewicz_I… · Implicit Surface Representations as Layers in Neural

Learning and Generalization in Overparameterized Neural ...Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers Zeyuan Allen-Zhu zeyuan@csail.mit.edu

Neural Networks and Deep Learning - TUM€¦ · A common problem for deep neural networks (a lot of hidden layers > 5): sigmoid derivative of the sigmoid In a deep neural network

Chapter 2: Introduction to Neural Networkssite.iugaza.edu.ps/mtastal/files/ch1-NN-All_Parts.pdf · 2019-10-08 · Neural Network Architecture/Multiple layers: X & vice versa. Now,

Artificial neural networks in medical diagnosis · 49 Amato et al.: Artificial neural networks in medical diagnosis Fig. 2. General structure of a neural network with two hidden layers.