Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning...
Transcript of Deep Learning - Princeton University3dvision.princeton.edu/courses/COS598/2014sp/...Deep Learning...
Deep LearningDeterministic Neural Networks
Zhirong Wu
Deep Learning !With massive amounts of computational power, machines can now recognize objects and translate speech in real time. Artificial intelligence is finally getting smart.
Neural Net Events
founded by Warren McCulloch and Walter Pitts
1943 1986
back propagation by Rumelhart and Hinton
1969
criticism by Minsky in his book “Perceptron”
2013
Facebook established its deep learning lab
Google CAT
20122011
Microsoft deep networks in speech recognition
2012
ImageNet classification over millions of images
2006
deep belief networks by Hinton
Neural Networks
neuronz =
X
i
xiwi + b, y = f(z)
where f is a activation function:
f(z) = �(z) =1
1 + exp(�z)
basic building blocks
x1
x2
x3
+1
y
w1
w2
w3
b
input output
z
• sigmoid is bounded between 0 and 1 • monotonically increasing • differentiation: �0(z) = �(z) ⇤ (1� �(z))
Neural Networks
neuronz =
X
i
xiwi + b, y = f(z)
where f is a activation function:
f(z) = �(z) =1
1 + exp(�z)
basic building blocks
x1
x2
x3
+1
y
w1
w2
w3
b
input output
z
• sigmoid is bounded between 0 and 1 • monotonically increasing • differentiation: �0(z) = �(z) ⇤ (1� �(z))
Neural Networks
x1
x2
x3
+1+1
a(2)1
a(2)2
layer1 layer2 layer3
representation feed-forward
where W and b are model parameters
globally models a function:
y = hW,b(x)
h(x)compactly:
z
(2) = W
(1)x+ b
(1), a
(2) = f(z(2))
z
(3) = W
(2)x+ b
(2), a
(3) = f(z(3))
for each neuron in the next layer:
z
(2)i =
nX
j=1
w
(1)ij xj + b
(1)i , a
(2)i = f(z(2)i )
z
(2)1 = w
(1)11 x1 + w
(1)21 x2 + w
(1)31 x3 + b
(1)1 , a
(2)1 = f(z(2)1 )
W (1)
b(1)b(2)
W (2) z(3)
z(2)1
z(2)2
Neural Networksrecall: gradient descent
min : F (x)
gradient descent ensures that for sufficiently small !we have
b = a� �rF (↵)�
F (b) < F (a)
if the multivariable function is defined and differentiable in the neighborhood of a point , then decreases fastest if one goes from in the direction of the negative gradient
↵F (x)
�rF (↵)
F (x)↵
Neural Networks
supervised learning
• define a loss function over the true label and the model prediction, such as squared error / cross entropy.
!• use gradient descent to optimize the model parameter.
back propagation
learning rate
b
(l)i := b
(l)i � ↵
@
@b
(l)i
J(W, b;x, y)
w
(l)ij := w
(l)ij � ↵
@
@w
(l)ij
J(W, b;x, y)
Neural Networksback propagation
is just the chain rule to derive gradients.
J(W, b) =1
2||h(x)� y||2just one sample
for illustration:
x1
x2
x3
+1+1
a(2)1
a(2)2
layer1 layer2 layer3
h(x)
W (1), b(1)W (2), b(2)
z(3)
z(2)1
z(2)2
minimize J(W, b) =1
m
mX
i=1
1
2||hW,b(x
(i))� y
(i)||2
chain rule: y = f(g(x)), y
0 = f
0(g(x)) ⇤ g0(x)
intuition:
x
g(x) f(g(x))
f
0(g(x))g
0(x)
Neural Networks
z
(3) : h(x) = f(z(3))
@J
@z
(3)=
@J
@h(x)· f 0(z(3))
z(3) = W (2)a(2) + b(2)W (2), b(2) :
@J
@w(2)j
=@J
@z(3)· a(2)j
@J
@b(2)=
@J
@z(3)· 1
x1
x2
x3
+1+1
a(2)1
a(2)2
layer1 layer2 layer3
h(x)
W (1), b(1)W (2), b(2)
z(3)
z(2)1
z(2)2
h(x) :@J
@h(x)= h(x)� y
Neural Networks
x1
x2
x3
+1+1
a(2)1
a(2)2
layer1 layer2 layer3
h(x)
W (1), b(1)W (2), b(2)
z(3)
z(2)1
z(2)2
a(2)j : z(3) = W (2)a(2) + b(2)
@J
@a(2)j
=@J
@z(3)· w(2)
j
z(2)j : a(2)j = f(z(2)j )
@J
@z(2)j
=@J
@a(2)j
· f 0(z(2)j )
Neural Networks
x1
x2
x3
+1+1
a(2)1
a(2)2
layer1 layer2 layer3
h(x)
W (1), b(1)W (2), b(2)
z(3)
z(2)1
z(2)2
@J
@b(1)i
=@J
@z(2)i
· 1
@J
@w
(1)ij
=@J
@z
(2)i
· xj
W
(1), b
(1) : z
(2) = W
(1)x+ b
(1)
Neural Networks
learning algorithm for neural nets
W = 0.01 ⇤N(0, 1); b = 0
W := W � ↵ ·�W ; b := b� ↵ ·�b
Neural Networks
x1
x2
x3
+1+1
layer1 layer2 layer3
W (1), b(1)W (2), b(2)
z(3)
z(2)1
z(2)2
toy example: (linear identity activation function)
1 !0 !1
�0.4
0.5
W (1) =
0.1 0.3 0�0.2 0.5 0.2
�b(1) =
�0.50.5
�
W (2) =⇥�1, 1
⇤b(2) =
⇥0⇤
0.9 1
Neural Networks
x1
x2
x3
+1+1
layer1 layer2 layer3
toy example: (linear identity activation function)
1 !0 !1
�0.4
0.5
W (1) =
0.1 0.3 0�0.2 0.5 0.2
�b(1) =
�0.50.5
�
W (2) =⇥�1, 1
⇤b(2) =
⇥0⇤
0.9 1�0.1
0.04
�0.05
�0.1
�0.1
0.1
Neural Networksbatch gradient descent & stochastic gradient descent
we have to compute and average all the training sample gradients in order to perform one gradient descent step
too costly! each sample has already provide some
information where to go x
y
use a mini-batch instead of all samples to evaluate one gradient step
e.x. 100 training data, 5 mini-batch, each 20 dataminimize
recallJ(W, b) =
1
m
mX
i=1
1
2||hW,b(x
(i))� y
(i)||2
shuffle the data
Neural Networksstochastic gradient descent algorithm
W = 0.01 ⇤N(0, 1); b = 0
W := W � ↵ ·�W ; b := b� ↵ ·�b
Neural Networksstochastic gradient descent practical issues:1. how to initialize W,b? small values. e.x. W to be 0.01*N(0,1), b to be 0
2. how to set the learning rate? set the learning rate so that:
!
linearly decreasing when necessary
3. how to set the size of mini-batch? cannot be too big, 20 - 50 is fine
4. problem: gradient diffusion. choose larger learning rate for the front layers.
↵ ⇤ @
@w⇡ 10�3 · w
Neural Networksmore sophisticated way to update parameters
weight decay: J(✓) = J(✓) +�
2||✓||2 =
1
2||h(x)� y||2 + �
2||✓||2
� ⇡ 10�4 � 10�6
m ⇡ 0.9
brings robustness to stochastic gradient descent
✓ := ✓ � ↵ ·�✓
�✓ =@(J)
@✓+ �✓
momentum: �✓new
= m ·�✓old
+ (1�m) ·�✓
Neural Networkssoftmax classifier
y1
y2
y3
label is in 1-k coding scheme
yi =exp(wix)Pkj=1 exp(wjx)
= p(yi = 1)
cost function:
J(W ) = � 1
m
"mX
i=1
y
(i)log(y(i)) + (1� y
(i))log(1� y
(i)))
#
derivatives: (back prop)@
@wjJ(W ) = � 1
m
mX
i=1
hx
(i)(y(i)j � y
(i)j )
i
w1x1
x2
x3
x4
last layercross entropy:
maximum likelihood of bernouli distribution
Convolutional Neural Networks
why convolutional?
• locality and stationarity of features
• for images, too many model parameters
Convolutional Neural Networks
convolution
detection layers
Convolutional Neural Networks
common pooling techniques: max pooling, average pooling
pooling : • reduce the size of representations • allow small translation invariance.
feature map
figure from roger grosse tutorial
Convolutional Neural Networks
convolution network is just a combination of convolution layers, pooling layers and fully connected layers
(figure from LeNet tutorial, http://deeplearning.net/tutorial/lenet.html)
Convolutional Neural Networkspractical example: ImageNet Classification
(figure from Krizhevsky, Alex et al 2012)
architecture: !8 layers in total
first 5 are convolutional last 3 are fully connected
max-pooling in the 1,2,5th layer
softmax classifierstride
Convolutional Neural Networks
details:
activation function : ReLUf(z) = max(0, z)
local normal contrast:
normalize at the same spatial location over different feature maps
overlapping pooling: pooling size 3 with stride 2{size=3
{stride=2
Convolutional Neural Networksreduce over-fitting:
training 1.2 million images for 60 million parameters
data augmentation:
•extracting random 224*224 patches from the 256*256 images and horizontal refections. This results in an increase by a factor of 2048.
•altering the intensities of the RGB channels, object identity is invariant to the illumination. Add each RGB pixel in each training image by
where are the 3 - 1 eigenvectors and are eigen values of 3 - 3 covariance matrix of RGB pixels, are random numbers
pi �i↵i
Convolutional Neural Networksreduce over-fitting:
dropout:!an efficient way to combining different model predictions.
• for each forward pass and each back propagation, randomly defunctions every neuron with probability 0.5. (setting the activation to zero)
• testing: half all the weights, and use all of the neurons to predict
dropout
Convolutional Neural Networkspractical implementation tips: (do not contain activation functions)
feed forward:
convolution:
max & average pooling:
Convolutional Neural Networkspractical implementation tips: (do not contain activation functions)
back prop:
convolution:
max & average pooling:
detection neuron
input neuron
average
1
4 max
Convolutional Neural Networksfilter visualization:
first layer: what pattern is most likely to turn on this neuron?
x1
x2
x3
y
w1
w2
w3
z
max w · xs.t. ||x|| = 1
x =w
||w||
Convolutional Neural Networksfilter visualization:higher layers: difficult, a linear combination of previous layers.
x1
x2
x3
a(2)1
a(2)2
h(x)W (1)
W (2)
f1 ·W (2)1 + f2 ·W (2)
2
Convolutional Neural Networks
filters learned
(Matthew D. Zeiler et al 2013)
filter visualization:
Convolutional Neural Networks
(Honglak Lee et al 2009)
filter visualization:
Convolutional Neural Networks
online web demo:
Convolutional Neural Networks
matlab code demo