Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning...

36
Deep Learning Deterministic Neural Networks Zhirong Wu

Transcript of Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning...

Page 1: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Deep LearningDeterministic Neural Networks

Zhirong Wu

Page 2: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Deep Learning !With massive amounts of computational power, machines can now recognize objects and translate speech in real time. Artificial intelligence is finally getting smart.

Page 3: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Neural Net Events

founded by Warren McCulloch and Walter Pitts

1943 1986

back propagation by Rumelhart and Hinton

1969

criticism by Minsky in his book “Perceptron”

2013

Facebook established its deep learning lab

Google CAT

20122011

Microsoft deep networks in speech recognition

2012

ImageNet classification over millions of images

2006

deep belief networks by Hinton

Page 4: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Neural Networks

neuronz =

X

i

xiwi + b, y = f(z)

where f is a activation function:

f(z) = �(z) =1

1 + exp(�z)

basic building blocks

x1

x2

x3

+1

y

w1

w2

w3

b

input output

z

• sigmoid is bounded between 0 and 1 • monotonically increasing • differentiation: �0(z) = �(z) ⇤ (1� �(z))

Page 5: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Neural Networks

neuronz =

X

i

xiwi + b, y = f(z)

where f is a activation function:

f(z) = �(z) =1

1 + exp(�z)

basic building blocks

x1

x2

x3

+1

y

w1

w2

w3

b

input output

z

• sigmoid is bounded between 0 and 1 • monotonically increasing • differentiation: �0(z) = �(z) ⇤ (1� �(z))

Page 6: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Neural Networks

x1

x2

x3

+1+1

a(2)1

a(2)2

layer1 layer2 layer3

representation feed-forward

where W and b are model parameters

globally models a function:

y = hW,b(x)

h(x)compactly:

z

(2) = W

(1)x+ b

(1), a

(2) = f(z(2))

z

(3) = W

(2)x+ b

(2), a

(3) = f(z(3))

for each neuron in the next layer:

z

(2)i =

nX

j=1

w

(1)ij xj + b

(1)i , a

(2)i = f(z(2)i )

z

(2)1 = w

(1)11 x1 + w

(1)21 x2 + w

(1)31 x3 + b

(1)1 , a

(2)1 = f(z(2)1 )

W (1)

b(1)b(2)

W (2) z(3)

z(2)1

z(2)2

Page 7: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Neural Networksrecall: gradient descent

min : F (x)

gradient descent ensures that for sufficiently small !we have

b = a� �rF (↵)�

F (b) < F (a)

if the multivariable function   is defined and differentiable in the neighborhood of a point  , then   decreases fastest if one goes from    in the direction of the negative gradient

↵F (x)

�rF (↵)

F (x)↵

Page 8: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Neural Networks

supervised learning

• define a loss function over the true label and the model prediction, such as squared error / cross entropy.

!• use gradient descent to optimize the model parameter.

back propagation

learning rate

b

(l)i := b

(l)i � ↵

@

@b

(l)i

J(W, b;x, y)

w

(l)ij := w

(l)ij � ↵

@

@w

(l)ij

J(W, b;x, y)

Page 9: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Neural Networksback propagation

is just the chain rule to derive gradients.

J(W, b) =1

2||h(x)� y||2just one sample

for illustration:

x1

x2

x3

+1+1

a(2)1

a(2)2

layer1 layer2 layer3

h(x)

W (1), b(1)W (2), b(2)

z(3)

z(2)1

z(2)2

minimize J(W, b) =1

m

mX

i=1

1

2||hW,b(x

(i))� y

(i)||2

chain rule: y = f(g(x)), y

0 = f

0(g(x)) ⇤ g0(x)

intuition:

x

g(x) f(g(x))

f

0(g(x))g

0(x)

Page 10: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Neural Networks

z

(3) : h(x) = f(z(3))

@J

@z

(3)=

@J

@h(x)· f 0(z(3))

z(3) = W (2)a(2) + b(2)W (2), b(2) :

@J

@w(2)j

=@J

@z(3)· a(2)j

@J

@b(2)=

@J

@z(3)· 1

x1

x2

x3

+1+1

a(2)1

a(2)2

layer1 layer2 layer3

h(x)

W (1), b(1)W (2), b(2)

z(3)

z(2)1

z(2)2

h(x) :@J

@h(x)= h(x)� y

Page 11: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Neural Networks

x1

x2

x3

+1+1

a(2)1

a(2)2

layer1 layer2 layer3

h(x)

W (1), b(1)W (2), b(2)

z(3)

z(2)1

z(2)2

a(2)j : z(3) = W (2)a(2) + b(2)

@J

@a(2)j

=@J

@z(3)· w(2)

j

z(2)j : a(2)j = f(z(2)j )

@J

@z(2)j

=@J

@a(2)j

· f 0(z(2)j )

Page 12: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Neural Networks

x1

x2

x3

+1+1

a(2)1

a(2)2

layer1 layer2 layer3

h(x)

W (1), b(1)W (2), b(2)

z(3)

z(2)1

z(2)2

@J

@b(1)i

=@J

@z(2)i

· 1

@J

@w

(1)ij

=@J

@z

(2)i

· xj

W

(1), b

(1) : z

(2) = W

(1)x+ b

(1)

Page 13: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Neural Networks

learning algorithm for neural nets

W = 0.01 ⇤N(0, 1); b = 0

W := W � ↵ ·�W ; b := b� ↵ ·�b

Page 14: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Neural Networks

x1

x2

x3

+1+1

layer1 layer2 layer3

W (1), b(1)W (2), b(2)

z(3)

z(2)1

z(2)2

 toy example: (linear identity activation function)

1 !0 !1

�0.4

0.5

W (1) =

0.1 0.3 0�0.2 0.5 0.2

�b(1) =

�0.50.5

W (2) =⇥�1, 1

⇤b(2) =

⇥0⇤

0.9 1

Page 15: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Neural Networks

x1

x2

x3

+1+1

layer1 layer2 layer3

 toy example: (linear identity activation function)

1 !0 !1

�0.4

0.5

W (1) =

0.1 0.3 0�0.2 0.5 0.2

�b(1) =

�0.50.5

W (2) =⇥�1, 1

⇤b(2) =

⇥0⇤

0.9 1�0.1

0.04

�0.05

�0.1

�0.1

0.1

Page 16: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Neural Networksbatch gradient descent & stochastic gradient descent

we have to compute and average all the training sample gradients in order to perform one gradient descent step

too costly! each sample has already provide some

information where to go x

y

use a mini-batch instead of all samples to evaluate one gradient step

e.x. 100 training data, 5 mini-batch, each 20 dataminimize

recallJ(W, b) =

1

m

mX

i=1

1

2||hW,b(x

(i))� y

(i)||2

shuffle the data

Page 17: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Neural Networksstochastic gradient descent algorithm

W = 0.01 ⇤N(0, 1); b = 0

W := W � ↵ ·�W ; b := b� ↵ ·�b

Page 18: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Neural Networksstochastic gradient descent practical issues:1. how to initialize W,b? small values. e.x. W to be 0.01*N(0,1), b to be 0

2. how to set the learning rate? set the learning rate so that:

!

linearly decreasing when necessary

3. how to set the size of mini-batch? cannot be too big, 20 - 50 is fine

4. problem: gradient diffusion. choose larger learning rate for the front layers.

↵ ⇤ @

@w⇡ 10�3 · w

Page 19: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Neural Networksmore sophisticated way to update parameters

weight decay: J(✓) = J(✓) +�

2||✓||2 =

1

2||h(x)� y||2 + �

2||✓||2

� ⇡ 10�4 � 10�6

m ⇡ 0.9

brings robustness to stochastic gradient descent

✓ := ✓ � ↵ ·�✓

�✓ =@(J)

@✓+ �✓

momentum: �✓new

= m ·�✓old

+ (1�m) ·�✓

Page 20: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Neural Networkssoftmax classifier

y1

y2

y3

label is in 1-k coding scheme

yi =exp(wix)Pkj=1 exp(wjx)

= p(yi = 1)

cost function:

J(W ) = � 1

m

"mX

i=1

y

(i)log(y(i)) + (1� y

(i))log(1� y

(i)))

#

derivatives: (back prop)@

@wjJ(W ) = � 1

m

mX

i=1

hx

(i)(y(i)j � y

(i)j )

i

w1x1

x2

x3

x4

last layercross entropy:

maximum likelihood of bernouli distribution

Page 21: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Convolutional Neural Networks

why convolutional?

• locality and stationarity of features

• for images, too many model parameters

Page 22: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Convolutional Neural Networks

convolution

detection layers

Page 23: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Convolutional Neural Networks

common pooling techniques: max pooling, average pooling

pooling : • reduce the size of representations • allow small translation invariance.

feature map

figure from roger grosse tutorial

Page 24: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Convolutional Neural Networks

convolution network is just a combination of convolution layers, pooling layers and fully connected layers

(figure from LeNet tutorial, http://deeplearning.net/tutorial/lenet.html)

Page 25: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Convolutional Neural Networkspractical example: ImageNet Classification

(figure from Krizhevsky, Alex et al 2012)

architecture: !8 layers in total

first 5 are convolutional last 3 are fully connected

max-pooling in the 1,2,5th layer

softmax classifierstride

Page 26: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Convolutional Neural Networks

details:

activation function : ReLUf(z) = max(0, z)

local normal contrast:

normalize at the same spatial location over different feature maps

overlapping pooling: pooling size 3 with stride 2{size=3

{stride=2

Page 27: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Convolutional Neural Networksreduce over-fitting:

training 1.2 million images for 60 million parameters

data augmentation:

•extracting random 224*224 patches from the 256*256 images and horizontal refections. This results in an increase by a factor of 2048.

•altering the intensities of the RGB channels, object identity is invariant to the illumination. Add each RGB pixel in each training image by

where are the 3 - 1 eigenvectors and are eigen values of 3 - 3 covariance matrix of RGB pixels, are random numbers

pi �i↵i

Page 28: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Convolutional Neural Networksreduce over-fitting:

dropout:!an efficient way to combining different model predictions.

• for each forward pass and each back propagation, randomly defunctions every neuron with probability 0.5. (setting the activation to zero)

• testing: half all the weights, and use all of the neurons to predict

dropout

Page 29: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Convolutional Neural Networkspractical implementation tips: (do not contain activation functions)

feed forward:

convolution:

max & average pooling:

Page 30: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Convolutional Neural Networkspractical implementation tips: (do not contain activation functions)

back prop:

convolution:

max & average pooling:

detection neuron

input neuron

average

1

4 max

Page 31: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Convolutional Neural Networksfilter visualization:

first layer: what pattern is most likely to turn on this neuron?

x1

x2

x3

y

w1

w2

w3

z

max w · xs.t. ||x|| = 1

x =w

||w||

Page 32: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Convolutional Neural Networksfilter visualization:higher layers: difficult, a linear combination of previous layers.

x1

x2

x3

a(2)1

a(2)2

h(x)W (1)

W (2)

f1 ·W (2)1 + f2 ·W (2)

2

Page 33: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Convolutional Neural Networks

filters learned

(Matthew D. Zeiler et al 2013)

filter visualization:

Page 34: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Convolutional Neural Networks

(Honglak Lee et al 2009)

filter visualization:

Page 35: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Convolutional Neural Networks

online web demo:

Page 36: Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning Deterministic Neural Networks Zhirong Wu. Deep Learning ! With massive amounts of

Convolutional Neural Networks

matlab code demo