Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo...

73
Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

Transcript of Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo...

Page 1: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

Lecture 8: Deep Learning

Tuo Zhao

Schools of ISyE and CSE, Georgia Tech

Page 2: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Deep Learning = Artificial Intelligence?

Tuo Zhao — Lecture 8: Deep Learning 2/73

Page 3: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Deep Learning = Artificial Intelligence?

Tuo Zhao — Lecture 8: Deep Learning 3/73

Page 4: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

Neural Network

Page 5: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Single Neuron

Basic Building Block

Input: x1, x2, x3,+1

Output: hw,b(x) = σ(w>x) = σ(∑3

j=1wjxj + b)

Activation Function σ : R→ R

Tuo Zhao — Lecture 8: Deep Learning 5/73

Page 6: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Activation Function

Sigmoid function:

σ(z) =1

1 + exp(−z)

Tanh function:

σ(z) =exp(z)− exp(−z)exp(z) + exp(−z)

ReLU function:

σ(z) = max{0, z}

Tuo Zhao — Lecture 8: Deep Learning 6/73

Page 7: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Activation Function

Tuo Zhao — Lecture 8: Deep Learning 7/73

Page 8: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Neural Network

Tuo Zhao — Lecture 8: Deep Learning 8/73

Page 9: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Multiple Neuron

Supervised Learning: X → Y

Input: x1, x2, x3,+1

Hidden Units:

a(2)1 = σ(W

(1)11 x1 +W

(1)12 x2 +W

(1)13 x3 + b

(1)1 )

a(2)2 = σ(W

(1)21 x1 +W

(1)22 x2 +W

(1)23 x3 + b

(1)2 )

a(2)3 = σ(W

(1)31 x1 +W

(1)32 x2 +W

(1)33 x3 + b

(1)3 )

Output:

hW,b(x) = a(3)1 = σ(W

(2)11 a

(2)1 +W

(2)12 a

(2)2 +W

(2)13 a

(2)3 + b

(2)1 )

Tuo Zhao — Lecture 8: Deep Learning 9/73

Page 10: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Feedforward Network

hW,b(x) = W(3)σ(W(2)σ(W(1)x+ b(1)) + b(2)) + b(3)

Tuo Zhao — Lecture 8: Deep Learning 10/73

Page 11: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

Backpropagation Algorithm

Page 12: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Empirical Risk Minimization

Supervised Learning: (x(1),y(1)), ..., (x(n),y(n))

Loss function:

L(W, b) =1

n

n∑i=1

`(hW,b(x(i)),y(i))

Empirical Risk Minimization:

L(W, b) =1

n

n∑i=1

`(hW,b(x(i)),y(i)) + λR(W, b)

Tuo Zhao — Lecture 8: Deep Learning 12/73

Page 13: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Backpropagation Algorithm

Nonconvex Optimization: Convergence to stationary solutions

Gradient Descent: Not scalable

Stochastic Gradient Descent: Most popular

W(p)jk ←W

(p)jk − α

∂`(hW,b(x(t)),y(t))

∂W(p)jk

b(p)j ← b

(p)j − α

∂`(hW,b(x(t)),y(t))

∂b(p)j

Step size: α — also known as learning rate

Tuo Zhao — Lecture 8: Deep Learning 13/73

Page 14: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Backpropagation Algorithm

Composite function: h(x) = f(g(x))

Chain Rule: h′(x) = f ′(g(x))g′(x)

Error Backpropagation ⇔ Stochastic Gradient Descent

Momentum:

δW

(p)jk

← γδ(p)Wjk

+ α∂`(hW,b(x

(t)),y(t))

∂W(p)jk

δb(p)j

← γδb(p)i

− α∂`(hW,b(x(t)),y(t))

∂b(p)j

W(p)jk ←W

(p)jk − δW(p)

jk

, b(p)j ← b

(p)j − δb(p)j

Tuo Zhao — Lecture 8: Deep Learning 14/73

Page 15: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Momentum

Tuo Zhao — Lecture 8: Deep Learning 15/73

Page 16: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

GPU & Asynchronous SGD

Tuo Zhao — Lecture 8: Deep Learning 16/73

Page 17: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

GPU & Asynchronous SGD

Tuo Zhao — Lecture 8: Deep Learning 17/73

Page 18: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

GPU & Asynchronous SGD

Tuo Zhao — Lecture 8: Deep Learning 18/73

Page 19: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

A Function Approximation Perspective

Supervised Learning: (x(1), y(1)), ..., (x(n), y(n))

Decision function f(x) : Rd → R

Empirical Risk Minimization:

f̂ = argminf∈F

m∑i=1

`(f(x(i)), y(i)) +R(f),

Linear Model: f(x(i)) = θ>x(i)

Nonparametric Model: Polynomial Regressions

Neural Network: f(x(i)) = hW,b(x(i))

Tuo Zhao — Lecture 8: Deep Learning 19/73

Page 20: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Universal Approximation

Any function f can be approximated by a neural net with onehidden layer.

A wide and shallow network is sufficient for representation.

The hidden layer may contain a large number of neurons,which is generally computationally intractble.

How can we get such a good neural net?

Mission Impossible

Tuo Zhao — Lecture 8: Deep Learning 20/73

Page 21: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

How to “Hack” a Better Neural Network

Page 22: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Vanishing Gradient

Tuo Zhao — Lecture 8: Deep Learning 22/73

Page 23: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Vanishing Gradient

Overfitting: No Errors to Propagate

Avoid Zero Derivate

Tuo Zhao — Lecture 8: Deep Learning 23/73

Page 24: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Dropout Training

Randomly drop neurons:

High dropout probability: e.g. 0.5

Implicit regularization

Tuo Zhao — Lecture 8: Deep Learning 24/73

Page 25: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Batch Normalization

Normalize Each Layer: Standardization

Avoid covariate shift

Implicit regularization

Tuo Zhao — Lecture 8: Deep Learning 25/73

Page 26: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Step Size Annealing

Tuo Zhao — Lecture 8: Deep Learning 26/73

Page 27: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Noise Annealing

W(p)jk ←W

(p)jk − α

∂`(hW,b(x(t)),y(t))

∂W(p)jk

+ ε(p)jk

b(p)j ← b

(p)j − α

∂`(hW,b(x(t)),y(t))

∂b(p)j

+ ε(p)j

Tuo Zhao — Lecture 8: Deep Learning 27/73

Page 28: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Adaptive Optimization

We solve the following optimization problem,

minθf(θ) where g(θ) = ∇f(θ).

We can make the step sizes and momentums adaptive tocoordinates (Animation 1, Animation 2)

AdaGrad: θ(t+1)j = θ

(t)j − η

(t)j gj(θ

(t)).

AdaM : θ(t+1)j = θ

(t)j − η

(t)j gj(θ

(t)) + α(t)j (θ

(t)j − θ

(t−1)j ).

The AdaGrad algorithm takes

θ(t+1)j = θ

(t)j −

ηgj(θ(t))√

1 +∑t

i=1 gj(θ(i))

.

Tuo Zhao — Lecture 8: Deep Learning 28/73

Page 29: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Early Stopping

Tuo Zhao — Lecture 8: Deep Learning 29/73

Page 30: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Residual Network

Skip-Layer Connection

FW,V(x) = σ(Vσ(Wx) + x)

Ensemble Multiple Neural Networks

Tuo Zhao — Lecture 8: Deep Learning 30/73

Page 31: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Xavier Initialization

254

Understanding the difficulty of training deep feedforward neural networks

4.2.2 Gradient Propagation Study

To empirically validate the above theoretical ideas, we haveplotted some normalized histograms of activation values,weight gradients and of the back-propagated gradients atinitialization with the two different initialization methods.The results displayed (Figures 6, 7 and 8) are from exper-iments on Shapeset-3 ⇥ 2, but qualitatively similar resultswere obtained with the other datasets.

We monitor the singular values of the Jacobian matrix as-sociated with layer i:

J i =@zi+1

@zi(17)

When consecutive layers have the same dimension, the av-erage singular value corresponds to the average ratio of in-finitesimal volumes mapped from zi to zi+1, as well asto the ratio of average activation variance going from zi

to zi+1. With our normalized initialization, this ratio isaround 0.8 whereas with the standard initialization, it dropsdown to 0.5.

Figure 6: Activation values normalized histograms withhyperbolic tangent activation, with standard (top) vs nor-malized initialization (bottom). Top: 0-peak increases forhigher layers.

4.3 Back-propagated Gradients During Learning

The dynamic of learning in such networks is complex andwe would like to develop better tools to analyze and trackit. In particular, we cannot use simple variance calculationsin our theoretical analysis because the weights values arenot anymore independent of the activation values and thelinearity hypothesis is also violated.

As first noted by Bradley (2009), we observe (Figure 7) thatat the beginning of training, after the standard initializa-tion (eq. 1), the variance of the back-propagated gradientsgets smaller as it is propagated downwards. However wefind that this trend is reversed very quickly during learning.Using our normalized initialization we do not see such de-creasing back-propagated gradients (bottom of Figure 7).

Figure 7: Back-propagated gradients normalized his-tograms with hyperbolic tangent activation, with standard(top) vs normalized (bottom) initialization. Top: 0-peakdecreases for higher layers.

What was initially really surprising is that even when theback-propagated gradients become smaller (standard ini-tialization), the variance of the weights gradients is roughlyconstant across layers, as shown on Figure 8. However, thisis explained by our theoretical analysis above (eq. 14). In-terestingly, as shown in Figure 9, these observations on theweight gradient of standard and normalized initializationchange during training (here for a tanh network). Indeed,whereas the gradients have initially roughly the same mag-nitude, they diverge from each other (with larger gradientsin the lower layers) as training progresses, especially withthe standard initialization. Note that this might be one ofthe advantages of the normalized initialization, since hav-ing gradients of very different magnitudes at different lay-ers may yield to ill-conditioning and slower training.

Finally, we observe that the softsign networks share simi-larities with the tanh networks with normalized initializa-tion, as can be seen by comparing the evolution of activa-tions in both cases (resp. Figure 3-bottom and Figure 10).

5 Error Curves and Conclusions

The final consideration that we care for is the successof training with different strategies, and this is best il-lustrated with error curves showing the evolution of testerror as training progresses and asymptotes. Figure 11shows such curves with online training on Shapeset-3⇥ 2,while Table 1 gives final test error for all the datasetsstudied (Shapeset-3 ⇥ 2, MNIST, CIFAR-10, and Small-ImageNet). As a baseline, we optimized RBF SVM mod-els on one hundred thousand Shapeset examples and ob-tained 59.47% test error, while on the same set we obtained50.47% with a depth five hyperbolic tangent network withnormalized initialization.

These results illustrate the effect of the choice of activa-tion and initialization. As a reference we include in Fig-

Standard Initialization: W (l) ∼ U(−√3√nl,√3√nl

)Xavier Initialization: W (l) ∼ U

(−

√6√

nl+nl+1,

√6√

nl+nl+1

)

Tuo Zhao — Lecture 8: Deep Learning 31/73

Page 32: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Deep v.s. Shallow Networks

AlexNet and VGGNet

GoogleNet

Tuo Zhao — Lecture 8: Deep Learning 32/73

Page 33: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Deep v.s. Shallow Networks

Deep network is very powerful in representation

Deep network turns out to be easier to optimize

AlexNet: 8 ⇒ LeNet: 23 ⇒ ResNet: 152

Why?

Tuo Zhao — Lecture 8: Deep Learning 33/73

Page 34: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

Convolutional Neural Networks

Page 35: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

The Architecture of CNNs

5 Convolution Layers

3 Max Pooling

3 Dense Layers

Tuo Zhao — Lecture 8: Deep Learning 35/73

Page 36: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Convolutional Neural Networks

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016

Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions

32

32

3

28

slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson

28

6

CONV, ReLU e.g. 6 5x5x3 filters

Tuo Zhao — Lecture 8: Deep Learning 36/73

Page 37: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Convolutional Neural Networks

Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation functions

32

32

3

CONV, ReLU e.g. 6 5x5x3 filters 28

28

6

CONV, ReLU e.g. 10 5x5x6 filters

CONV, ReLU

….

10

24

24

slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson

Tuo Zhao — Lecture 8: Deep Learning 37/73

Page 38: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Convolution Operation

The convolution operation

Tuo Zhao — Lecture 8: Deep Learning 38/73

Page 39: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Convolution Operation

TheconvoluFonoperaFon

Tuo Zhao — Lecture 8: Deep Learning 39/73

Page 40: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Benefits of Convolution

Sparse ConnectivityReason 1 : Sparse Connectivity

Tuo Zhao — Lecture 8: Deep Learning 40/73

Page 41: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Benefits of Convolution

Parameter SharingReason 2 : Parameter sharing

Tuo Zhao — Lecture 8: Deep Learning 41/73

Page 42: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Benefits of Convolution

Translational Invariance

Tuo Zhao — Lecture 8: Deep Learning 42/73

Page 43: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Convolution Layer

32

3

Convolution Layer 32x32x3 image

width

height

32 depth

slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson

Tuo Zhao — Lecture 8: Deep Learning 43/73

Page 44: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Convolution Layer

32

32

3

5x5x3 filter

32x32x3 image

Convolve the filter with the image i.e. “slide over the image spatially, computing dot products”

Convolution Layer

slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson

Tuo Zhao — Lecture 8: Deep Learning 44/73

Page 45: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Convolution Layer

32

32

3

32x32x3 image 5x5x3 filter

1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image (i.e. 5*5*3 = 75-dimensional dot product + bias)

Convolution Layer

slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson

Tuo Zhao — Lecture 8: Deep Learning 45/73

Page 46: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Convolution Layer

32

32

3

activation map 32x32x3 image 5x5x3 filter

1

28

28

convolve (slide) over all spatial locations

Convolution Layer

slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson

Tuo Zhao — Lecture 8: Deep Learning 46/73

Page 47: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Convolution Layer

32

32

3

32x32x3 image 5x5x3 filter

activation maps

1

28

28

convolve (slide) over all spatial locations

consider a second, green filter

Convolution Layer

slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson

Tuo Zhao — Lecture 8: Deep Learning 47/73

Page 48: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Convolution Layer

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016

32

3 6

28

activation maps 32

28

Convolution Layer

For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:

We stack these up to get a “new image” of size 28x28x6!

slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson

Tuo Zhao — Lecture 8: Deep Learning 48/73

Page 49: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Stride Convolution

Stride

Tuo Zhao — Lecture 8: Deep Learning 49/73

Page 50: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Stride Convolution

A closer look at spatial dimensions:

32

32

3

activation map 32x32x3 image 5x5x3 filter

1

28

28

convolve (slide) over all spatial locations

slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson

Tuo Zhao — Lecture 8: Deep Learning 50/73

Page 51: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Stride Convolution

7

7x7 input (spatially) assume 3x3 filter

7

A closer look at spatial dimensions:

slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson

Tuo Zhao — Lecture 8: Deep Learning 51/73

Page 52: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Stride Convolution

7

7x7 input (spatially) assume 3x3 filter

7

A closer look at spatial dimensions:

slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson

Tuo Zhao — Lecture 8: Deep Learning 52/73

Page 53: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Stride Convolution

7

7x7 input (spatially) assume 3x3 filter

7

A closer look at spatial dimensions:

slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson

Tuo Zhao — Lecture 8: Deep Learning 53/73

Page 54: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Stride Convolution

7

7x7 input (spatially) assume 3x3 filter

7

A closer look at spatial dimensions:

slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson

Tuo Zhao — Lecture 8: Deep Learning 54/73

Page 55: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Stride Convolution

=> 5x5 output

7

7x7 input (spatially) assume 3x3 filter

7

A closer look at spatial dimensions:

slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson

Tuo Zhao — Lecture 8: Deep Learning 55/73

Page 56: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Stride Convolution

7x7 input (spatially) assume 3x3 filter applied with stride 2

7

7

A closer look at spatial dimensions:

slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson

Tuo Zhao — Lecture 8: Deep Learning 56/73

Page 57: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Stride Convolution

7x7 input (spatially) assume 3x3 filter applied with stride 2

7

7

A closer look at spatial dimensions:

slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson

Tuo Zhao — Lecture 8: Deep Learning 57/73

Page 58: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Stride Convolution

7x7 input (spatially) assume 3x3 filter applied with stride 2 => 3x3 output!

7

7

A closer look at spatial dimensions:

slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson

Tuo Zhao — Lecture 8: Deep Learning 58/73

Page 59: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Stride Convolution

7x7 input (spatially) assume 3x3 filter applied with stride 3?

7

7

A closer look at spatial dimensions:

slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson

Tuo Zhao — Lecture 8: Deep Learning 59/73

Page 60: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Stride Convolution

7x7 input (spatially) assume 3x3 filter applied with stride 3?

7

7

A closer look at spatial dimensions:

doesn’t fit! cannot apply 3x3 filter on 7x7 input with stride 3.

slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson

Tuo Zhao — Lecture 8: Deep Learning 60/73

Page 61: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Stride Convolution

N

F

F

N Output size: (N - F) / stride + 1

e.g. N = 7, F = 3: stride 1 => (7 - 3)/1 + 1 = 5 stride 2 => (7 - 3)/2 + 1 = 3 stride 3 => (7 - 3)/3 + 1 = 2.33 :\

slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson

Tuo Zhao — Lecture 8: Deep Learning 61/73

Page 62: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Zero-Padding

Zero-Padding

Tuo Zhao — Lecture 8: Deep Learning 62/73

Page 63: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Zero-Padding: common to the border

0 0 0 0 0 0

0

0

0

0

e.g. input 7x7 3x3 filter, applied with stride 1 pad with 1 pixel border => what is the output?

(recall:) (N - F) / stride + 1

slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson

Zero-Padding: common to the border

Tuo Zhao — Lecture 8: Deep Learning 63/73

Page 64: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Zero-Padding: common to the border

e.g. input 7x7 3x3 filter, applied with stride 1 pad with 1 pixel border => what is the output? 7x7 output!

0 0 0 0 0 0

0

0

0

0

slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson

Zero-Padding: common to the border

Tuo Zhao — Lecture 8: Deep Learning 64/73

Page 65: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Tiled Convolution

Local connectivity

Locallyconnectedlayer

ConvoluFonallayer

Fullyconnectedlayer

Tuo Zhao — Lecture 8: Deep Learning 65/73

Page 66: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Tiled Convolution

Tiled convolution

Locallyconnectedlayer

TiledconvoluFon

ConvoluFonallayer

Tuo Zhao — Lecture 8: Deep Learning 66/73

Page 67: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Pooling

Effect=invariancetosmalltranslaFonsoftheinput

Pooling

Tuo Zhao — Lecture 8: Deep Learning 67/73

Page 68: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Pooling

Pooling

Tuo Zhao — Lecture 8: Deep Learning 68/73

Page 69: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Pooling-  makes the representations smaller and more manageable -  operates over each activation map independently

Pooling

slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson Tuo Zhao — Lecture 8: Deep Learning 69/73

Page 70: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Pooling

1 1 2 4

5 6 7 8

3 2 1 0

1 2 3 4

Single depth slice

x

y

max pool with 2x2 filters and stride 2 6 8

3 4

Max Pooling

slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson

Tuo Zhao — Lecture 8: Deep Learning 70/73

Page 71: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Case Study: AlexNet

Case Study: AlexNet [Krizhevsky et al. 2012]

Full (simplified) AlexNet architecture: [227x227x3] INPUT [55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0 [27x27x96] MAX POOL1: 3x3 filters at stride 2 [27x27x96] NORM1: Normalization layer [27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 [13x13x256] MAX POOL2: 3x3 filters at stride 2 [13x13x256] NORM2: Normalization layer [13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1 [13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1 [13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1 [6x6x256] MAX POOL3: 3x3 filters at stride 2 [4096] FC6: 4096 neurons [4096] FC7: 4096 neurons [1000] FC8: 1000 neurons (class scores)

Details/Retrospectives: - first use of ReLU - used Norm layers (not common anymore) - heavy data augmentation - dropout 0.5 - batch size 128 - SGD Momentum 0.9 - Learning rate 1e-2, reduced by 10 manually when val accuracy plateaus - L2 weight decay 5e-4 - 7 CNN ensemble: 18.2% -> 15.4%

slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson Tuo Zhao — Lecture 8: Deep Learning 71/73

Page 72: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech

The End

Congratulations!

Page 73: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech