Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4...

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 12, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 12, 20181

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4 - April 12, 2018

Administrative

Assignment 1 due Wednesday April 18, 11:59pm

Administrative

All office hours this week will use queuestatus

scores function

SVM loss

data loss + regularization

Where we are...

Optimization

Landscape image is CC0 1.0 public domainWalking man image is CC0 1.0 public domain

Numerical gradient: slow :(, approximate :(, easy to write :)Analytic gradient: fast :), exact :), error-prone :(

In practice: Derive analytic gradient, check your implementation with numerical gradient

Gradient descent

hinge loss

+ Ls (scores)

Computational graphs

input image

weights

Convolutional network(AlexNet)

Neural Turing Machine

Figure reproduced with permission from a Twitter post by Andrej Karpathy.

input image

Neural Turing Machine

Figure reproduced with permission from a Twitter post by Andrej Karpathy.

Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:

Upstream gradient

Localgradient

Chain rule:

e.g. x = -2, y = 5, z = -4

Upstream gradient

Localgradient

e.g. x = -2, y = 5, z = -4

Chain rule:

Upstream gradient

Localgradient

Chain rule:

e.g. x = -2, y = 5, z = -4

Upstream gradient

Localgradient

“local gradient”

gradients

Another example:

Upstream gradient

Localgradient

Another example:

Upstream gradient

Localgradient

Another example:

Upstream gradient

Localgradient

Another example:

Upstream gradient

Localgradient

Another example:

[upstream gradient] x [local gradient][0.2] x [1] = 0.2[0.2] x [1] = 0.2 (both inputs!)

Another example:

[upstream gradient] x [local gradient]x0: [0.2] x [2] = 0.4w0: [0.2] x [-1] = -0.2

sigmoid function

sigmoid gate

Computational graph representation may not be unique. Choose one where local gradients at each node can be easily expressed!

sigmoid gate

[upstream gradient] x [local gradient][1.00] x [(1 - 0.73) (0.73)]= 0.2

sigmoid function

Computational graph representation may not be unique. Choose one where local gradients at each node can be easily expressed!

add gate: gradient distributor

Patterns in backward flow

Q: What is a max gate?

max gate: gradient router

Q: What is a mul gate?

mul gate: gradient switcher

Gradients add at branches

This is now the Jacobian matrix (derivative of each element of z w.r.t. each element of x)

(x,y,z are now vectors)

gradients

Gradients for vectorized code

f(x) = max(0,x)(elementwise)

4096-d input vector

4096-d output vector

Vectorized operations

Jacobian matrix

4096-d input vector

Q: what is the size of the Jacobian matrix?

Jacobian matrix

4096-d input vector

Q: what is the size of the Jacobian matrix?[4096 x 4096!]

i.e. Jacobian would technically be a[409,600 x 409,600] matrix :\

4096-d input vector

in practice we process an entire minibatch (e.g. 100) of examples at one time:

Q2: what does it look like?

4096-d input vector

Jacobian matrix

A vectorized example:

Always check: The gradient with respect to a variable should have the same shape as the variable

In discussion section: A matrix example...

Modularized implementation: forward / backward API

Graph (or Net) object (rough pseudo code)

(x,y,z are scalars)

Modularized implementation: forward / backward API

Upstream gradient variableLocal gradient

Example: Caffe layers

Caffe is licensed under BSD 2-Clause

* top_diff (chain rule)

Caffe is licensed under BSD 2-Clause

Caffe Sigmoid Layer

Stage your forward/backward computation!E.g. for the SVM:

margins

In Assignment 1: Writing SVM / Softmax

● neural nets will be very large: impractical to write down gradient formula by hand for all parameters

● backpropagation = recursive application of the chain rule along a computational graph to compute the gradients of all inputs/parameters/intermediates

● implementations maintain a graph structure, where the nodes implement the forward() / backward() API

● forward: compute result of an operation and save any intermediates needed for gradient computation in memory

● backward: apply the chain rule to compute the gradient of the loss function with respect to the inputs

Summary so far...

Next: Neural Networks

Neural networks: without the brain stuff

(Before) Linear score function:

(Now) 2-layer Neural Network

x hW1 sW2

3072 100 10

x hW1 sW2

3072 100 10

(Now) 2-layer Neural Network or 3-layer Neural Network

Full implementation of training a 2-layer Neural Network needs ~20 lines:

In HW: Writing a 2-layer net

This image by Fotis Bobolas is licensed under CC-BY 2.0

Impulses carried toward cell body

Impulses carried away from cell body

This image by Felipe Peruchois licensed under CC-BY 3.0

dendrite

cell body

presynaptic terminal

dendrite

cell body

sigmoid activation function

dendrite

cell body

dendrite

cell body

Biological Neurons:● Many different types● Dendrites can perform complex non-linear computations● Synapses are not a single weight but a complex non-linear dynamical

system● Rate code may not be adequate

[Dendritic Computation. London and Hausser]

Be very careful with your brain analogies!

Sigmoid

Leaky ReLU

Maxout

Activation functions

“Fully-connected” layers“2-layer Neural Net”, or“1-hidden-layer Neural Net”

“3-layer Neural Net”, or“2-hidden-layer Neural Net”

Neural networks: Architectures

We can efficiently evaluate an entire layer of neurons.

Example feed-forward computation of a neural network

Summary

- We arrange neurons into fully-connected layers- The abstraction of a layer has the nice property that it

allows us to use efficient vectorized code (e.g. matrix multiplies)

- Neural networks are not really neural- Next time: Convolutional Neural Networks

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4...

Documents

Transcript of Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 4...

Lecture 9: CNN Architectures - Stanford Universitycs231n.stanford.edu/slides/2017/cs231n_2017_lecture9.pdf · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 1 May 2, 2017

for Retrieving Complex Compositional Activities in Videos · for Retrieving Complex Compositional Activities in Videos Bingbin Liu, Serena Yeung, Edward Chou, De-An Huang, Li Fei-Fei,

Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2018/cs231n_2018_lecture09.pdf · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 9 - 15 May 1, 2018 Case Study:

Serena Yeung, PHD, Stanford, at MLconf Seattle 2017

Reinforcement LearningReinforcement Learning. Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 14 - May 23, 2017 Cart-Pole Problem 14 Objective: Balance a pole on top of a movable

Midterm Review - cs231n.stanford.educs231n.stanford.edu/slides/2018/cs231n_2018_midterm_review.pdf · Fei-Fei Li & Justin Johnson & Serena Yeung Midterm ReviewLecture 3 - May 4th,

Lecture 1: Course Introduction - web.stanford.eduweb.stanford.edu/class/biods220/lectures/lecture1.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 1 - Lecture 1: Course Introduction.

Lecture 3: Loss Functions and Optimizationcs231n.stanford.edu/slides/2017/cs231n_2017_lecture3.pdfLecture 3: Loss Functions and Optimization Fei-Fei Li & Justin Johnson & Serena Yeung

Fei-Fei Li & Justin Johnson & Serena Yeungvision.stanford.edu/teaching/cs231n/slides/2019/cs231n...Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - 30 April 22, 2019 Activation

Lecture 10: Recurrent Neural Networks · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 95May 3, 2018 Vanilla RNN Gradient Flow h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Largest

Lecture 7: Electronic Health Records (Part 2)web.stanford.edu/class/biods220/lectures/lecture7.pdf · Serena Yeung BIODS 220: AI in Healthcare Lecture 7 - Last time: overview of electronic

Lecture 6: Training Neural Networks, Part Ivision.stanford.edu/teaching/cs231n/slides/2018/cs... · W1 W2 s 3072 100 10 Neural Networks. Fei-Fei Li & Justin Johnson & Serena Yeung

Recurrent Neural Networks: Process Sequencesbuzzard/MA598-Spring2019/Lectures… · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10 - 12 May 4, 2017 Recurrent Neural Networks:

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 2 ...cs231n.stanford.edu/slides/2017/cs231n_2017_lecture2.pdf · All pixels change when ... Fei-Fei Li & Justin Johnson & Serena

Aska Yeung

Lecture 8: Deep Learning Software - Stanford Universitycs231n.stanford.edu/slides/2017/cs231n_2017_lecture8.pdfLecture 8: Deep Learning Software Fei-Fei Li & Justin Johnson & Serena

Lecture 1: Introduction · Introduction 1 4/3/2018. Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 1 - Welcome to CS231n 2 ... Speech, NLP Information retrieval Mathematics Computer

Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture09.pdf · 2019-04-30 · Subsampling (Pooling) layers were 2x2 applied at stride 2 i.e.

Lecture 13: Generative Models - Stanford University …cs231n.stanford.edu/slides/2017/cs231n_2017_lecture13.pdf · Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 13 - May 18,

Lecture 12: Generative Models - Virginia TechLecture 12 - 21 May 15, 2018 PixelRNN and PixelCNN Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 22 May 15, 2018 Fully visible