Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo...
Transcript of Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo...
![Page 1: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/1.jpg)
Lecture 8: Deep Learning
Tuo Zhao
Schools of ISyE and CSE, Georgia Tech
![Page 2: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/2.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Deep Learning = Artificial Intelligence?
Tuo Zhao — Lecture 8: Deep Learning 2/73
![Page 3: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/3.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Deep Learning = Artificial Intelligence?
Tuo Zhao — Lecture 8: Deep Learning 3/73
![Page 4: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/4.jpg)
Neural Network
![Page 5: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/5.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Single Neuron
Basic Building Block
Input: x1, x2, x3,+1
Output: hw,b(x) = σ(w>x) = σ(∑3
j=1wjxj + b)
Activation Function σ : R→ R
Tuo Zhao — Lecture 8: Deep Learning 5/73
![Page 6: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/6.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Activation Function
Sigmoid function:
σ(z) =1
1 + exp(−z)
Tanh function:
σ(z) =exp(z)− exp(−z)exp(z) + exp(−z)
ReLU function:
σ(z) = max{0, z}
Tuo Zhao — Lecture 8: Deep Learning 6/73
![Page 7: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/7.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Activation Function
Tuo Zhao — Lecture 8: Deep Learning 7/73
![Page 8: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/8.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Neural Network
Tuo Zhao — Lecture 8: Deep Learning 8/73
![Page 9: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/9.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Multiple Neuron
Supervised Learning: X → Y
Input: x1, x2, x3,+1
Hidden Units:
a(2)1 = σ(W
(1)11 x1 +W
(1)12 x2 +W
(1)13 x3 + b
(1)1 )
a(2)2 = σ(W
(1)21 x1 +W
(1)22 x2 +W
(1)23 x3 + b
(1)2 )
a(2)3 = σ(W
(1)31 x1 +W
(1)32 x2 +W
(1)33 x3 + b
(1)3 )
Output:
hW,b(x) = a(3)1 = σ(W
(2)11 a
(2)1 +W
(2)12 a
(2)2 +W
(2)13 a
(2)3 + b
(2)1 )
Tuo Zhao — Lecture 8: Deep Learning 9/73
![Page 10: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/10.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Feedforward Network
hW,b(x) = W(3)σ(W(2)σ(W(1)x+ b(1)) + b(2)) + b(3)
Tuo Zhao — Lecture 8: Deep Learning 10/73
![Page 11: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/11.jpg)
Backpropagation Algorithm
![Page 12: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/12.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Empirical Risk Minimization
Supervised Learning: (x(1),y(1)), ..., (x(n),y(n))
Loss function:
L(W, b) =1
n
n∑i=1
`(hW,b(x(i)),y(i))
Empirical Risk Minimization:
L(W, b) =1
n
n∑i=1
`(hW,b(x(i)),y(i)) + λR(W, b)
Tuo Zhao — Lecture 8: Deep Learning 12/73
![Page 13: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/13.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Backpropagation Algorithm
Nonconvex Optimization: Convergence to stationary solutions
Gradient Descent: Not scalable
Stochastic Gradient Descent: Most popular
W(p)jk ←W
(p)jk − α
∂`(hW,b(x(t)),y(t))
∂W(p)jk
b(p)j ← b
(p)j − α
∂`(hW,b(x(t)),y(t))
∂b(p)j
Step size: α — also known as learning rate
Tuo Zhao — Lecture 8: Deep Learning 13/73
![Page 14: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/14.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Backpropagation Algorithm
Composite function: h(x) = f(g(x))
Chain Rule: h′(x) = f ′(g(x))g′(x)
Error Backpropagation ⇔ Stochastic Gradient Descent
Momentum:
δW
(p)jk
← γδ(p)Wjk
+ α∂`(hW,b(x
(t)),y(t))
∂W(p)jk
δb(p)j
← γδb(p)i
− α∂`(hW,b(x(t)),y(t))
∂b(p)j
W(p)jk ←W
(p)jk − δW(p)
jk
, b(p)j ← b
(p)j − δb(p)j
Tuo Zhao — Lecture 8: Deep Learning 14/73
![Page 15: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/15.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Momentum
Tuo Zhao — Lecture 8: Deep Learning 15/73
![Page 16: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/16.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
GPU & Asynchronous SGD
Tuo Zhao — Lecture 8: Deep Learning 16/73
![Page 17: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/17.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
GPU & Asynchronous SGD
Tuo Zhao — Lecture 8: Deep Learning 17/73
![Page 18: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/18.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
GPU & Asynchronous SGD
Tuo Zhao — Lecture 8: Deep Learning 18/73
![Page 19: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/19.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
A Function Approximation Perspective
Supervised Learning: (x(1), y(1)), ..., (x(n), y(n))
Decision function f(x) : Rd → R
Empirical Risk Minimization:
f̂ = argminf∈F
m∑i=1
`(f(x(i)), y(i)) +R(f),
Linear Model: f(x(i)) = θ>x(i)
Nonparametric Model: Polynomial Regressions
Neural Network: f(x(i)) = hW,b(x(i))
Tuo Zhao — Lecture 8: Deep Learning 19/73
![Page 20: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/20.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Universal Approximation
Any function f can be approximated by a neural net with onehidden layer.
A wide and shallow network is sufficient for representation.
The hidden layer may contain a large number of neurons,which is generally computationally intractble.
How can we get such a good neural net?
Mission Impossible
Tuo Zhao — Lecture 8: Deep Learning 20/73
![Page 21: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/21.jpg)
How to “Hack” a Better Neural Network
![Page 22: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/22.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Vanishing Gradient
Tuo Zhao — Lecture 8: Deep Learning 22/73
![Page 23: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/23.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Vanishing Gradient
Overfitting: No Errors to Propagate
Avoid Zero Derivate
Tuo Zhao — Lecture 8: Deep Learning 23/73
![Page 24: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/24.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Dropout Training
Randomly drop neurons:
High dropout probability: e.g. 0.5
Implicit regularization
Tuo Zhao — Lecture 8: Deep Learning 24/73
![Page 25: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/25.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Batch Normalization
Normalize Each Layer: Standardization
Avoid covariate shift
Implicit regularization
Tuo Zhao — Lecture 8: Deep Learning 25/73
![Page 26: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/26.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Step Size Annealing
Tuo Zhao — Lecture 8: Deep Learning 26/73
![Page 27: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/27.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Noise Annealing
W(p)jk ←W
(p)jk − α
∂`(hW,b(x(t)),y(t))
∂W(p)jk
+ ε(p)jk
b(p)j ← b
(p)j − α
∂`(hW,b(x(t)),y(t))
∂b(p)j
+ ε(p)j
Tuo Zhao — Lecture 8: Deep Learning 27/73
![Page 28: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/28.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Adaptive Optimization
We solve the following optimization problem,
minθf(θ) where g(θ) = ∇f(θ).
We can make the step sizes and momentums adaptive tocoordinates (Animation 1, Animation 2)
AdaGrad: θ(t+1)j = θ
(t)j − η
(t)j gj(θ
(t)).
AdaM : θ(t+1)j = θ
(t)j − η
(t)j gj(θ
(t)) + α(t)j (θ
(t)j − θ
(t−1)j ).
The AdaGrad algorithm takes
θ(t+1)j = θ
(t)j −
ηgj(θ(t))√
1 +∑t
i=1 gj(θ(i))
.
Tuo Zhao — Lecture 8: Deep Learning 28/73
![Page 29: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/29.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Early Stopping
Tuo Zhao — Lecture 8: Deep Learning 29/73
![Page 30: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/30.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Residual Network
Skip-Layer Connection
FW,V(x) = σ(Vσ(Wx) + x)
Ensemble Multiple Neural Networks
Tuo Zhao — Lecture 8: Deep Learning 30/73
![Page 31: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/31.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Xavier Initialization
254
Understanding the difficulty of training deep feedforward neural networks
4.2.2 Gradient Propagation Study
To empirically validate the above theoretical ideas, we haveplotted some normalized histograms of activation values,weight gradients and of the back-propagated gradients atinitialization with the two different initialization methods.The results displayed (Figures 6, 7 and 8) are from exper-iments on Shapeset-3 ⇥ 2, but qualitatively similar resultswere obtained with the other datasets.
We monitor the singular values of the Jacobian matrix as-sociated with layer i:
J i =@zi+1
@zi(17)
When consecutive layers have the same dimension, the av-erage singular value corresponds to the average ratio of in-finitesimal volumes mapped from zi to zi+1, as well asto the ratio of average activation variance going from zi
to zi+1. With our normalized initialization, this ratio isaround 0.8 whereas with the standard initialization, it dropsdown to 0.5.
Figure 6: Activation values normalized histograms withhyperbolic tangent activation, with standard (top) vs nor-malized initialization (bottom). Top: 0-peak increases forhigher layers.
4.3 Back-propagated Gradients During Learning
The dynamic of learning in such networks is complex andwe would like to develop better tools to analyze and trackit. In particular, we cannot use simple variance calculationsin our theoretical analysis because the weights values arenot anymore independent of the activation values and thelinearity hypothesis is also violated.
As first noted by Bradley (2009), we observe (Figure 7) thatat the beginning of training, after the standard initializa-tion (eq. 1), the variance of the back-propagated gradientsgets smaller as it is propagated downwards. However wefind that this trend is reversed very quickly during learning.Using our normalized initialization we do not see such de-creasing back-propagated gradients (bottom of Figure 7).
Figure 7: Back-propagated gradients normalized his-tograms with hyperbolic tangent activation, with standard(top) vs normalized (bottom) initialization. Top: 0-peakdecreases for higher layers.
What was initially really surprising is that even when theback-propagated gradients become smaller (standard ini-tialization), the variance of the weights gradients is roughlyconstant across layers, as shown on Figure 8. However, thisis explained by our theoretical analysis above (eq. 14). In-terestingly, as shown in Figure 9, these observations on theweight gradient of standard and normalized initializationchange during training (here for a tanh network). Indeed,whereas the gradients have initially roughly the same mag-nitude, they diverge from each other (with larger gradientsin the lower layers) as training progresses, especially withthe standard initialization. Note that this might be one ofthe advantages of the normalized initialization, since hav-ing gradients of very different magnitudes at different lay-ers may yield to ill-conditioning and slower training.
Finally, we observe that the softsign networks share simi-larities with the tanh networks with normalized initializa-tion, as can be seen by comparing the evolution of activa-tions in both cases (resp. Figure 3-bottom and Figure 10).
5 Error Curves and Conclusions
The final consideration that we care for is the successof training with different strategies, and this is best il-lustrated with error curves showing the evolution of testerror as training progresses and asymptotes. Figure 11shows such curves with online training on Shapeset-3⇥ 2,while Table 1 gives final test error for all the datasetsstudied (Shapeset-3 ⇥ 2, MNIST, CIFAR-10, and Small-ImageNet). As a baseline, we optimized RBF SVM mod-els on one hundred thousand Shapeset examples and ob-tained 59.47% test error, while on the same set we obtained50.47% with a depth five hyperbolic tangent network withnormalized initialization.
These results illustrate the effect of the choice of activa-tion and initialization. As a reference we include in Fig-
Standard Initialization: W (l) ∼ U(−√3√nl,√3√nl
)Xavier Initialization: W (l) ∼ U
(−
√6√
nl+nl+1,
√6√
nl+nl+1
)
Tuo Zhao — Lecture 8: Deep Learning 31/73
![Page 32: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/32.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Deep v.s. Shallow Networks
AlexNet and VGGNet
GoogleNet
Tuo Zhao — Lecture 8: Deep Learning 32/73
![Page 33: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/33.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Deep v.s. Shallow Networks
Deep network is very powerful in representation
Deep network turns out to be easier to optimize
AlexNet: 8 ⇒ LeNet: 23 ⇒ ResNet: 152
Why?
Tuo Zhao — Lecture 8: Deep Learning 33/73
![Page 34: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/34.jpg)
Convolutional Neural Networks
![Page 35: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/35.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
The Architecture of CNNs
5 Convolution Layers
3 Max Pooling
3 Dense Layers
Tuo Zhao — Lecture 8: Deep Learning 35/73
![Page 36: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/36.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Convolutional Neural Networks
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016
Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions
32
32
3
28
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
28
6
CONV, ReLU e.g. 6 5x5x3 filters
Tuo Zhao — Lecture 8: Deep Learning 36/73
![Page 37: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/37.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Convolutional Neural Networks
Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation functions
32
32
3
CONV, ReLU e.g. 6 5x5x3 filters 28
28
6
CONV, ReLU e.g. 10 5x5x6 filters
CONV, ReLU
….
10
24
24
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 37/73
![Page 38: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/38.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Convolution Operation
The convolution operation
Tuo Zhao — Lecture 8: Deep Learning 38/73
![Page 39: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/39.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Convolution Operation
TheconvoluFonoperaFon
Tuo Zhao — Lecture 8: Deep Learning 39/73
![Page 40: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/40.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Benefits of Convolution
Sparse ConnectivityReason 1 : Sparse Connectivity
Tuo Zhao — Lecture 8: Deep Learning 40/73
![Page 41: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/41.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Benefits of Convolution
Parameter SharingReason 2 : Parameter sharing
Tuo Zhao — Lecture 8: Deep Learning 41/73
![Page 42: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/42.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Benefits of Convolution
Translational Invariance
Tuo Zhao — Lecture 8: Deep Learning 42/73
![Page 43: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/43.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Convolution Layer
32
3
Convolution Layer 32x32x3 image
width
height
32 depth
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 43/73
![Page 44: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/44.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Convolution Layer
32
32
3
5x5x3 filter
32x32x3 image
Convolve the filter with the image i.e. “slide over the image spatially, computing dot products”
Convolution Layer
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 44/73
![Page 45: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/45.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Convolution Layer
32
32
3
32x32x3 image 5x5x3 filter
1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image (i.e. 5*5*3 = 75-dimensional dot product + bias)
Convolution Layer
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 45/73
![Page 46: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/46.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Convolution Layer
32
32
3
activation map 32x32x3 image 5x5x3 filter
1
28
28
convolve (slide) over all spatial locations
Convolution Layer
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 46/73
![Page 47: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/47.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Convolution Layer
32
32
3
32x32x3 image 5x5x3 filter
activation maps
1
28
28
convolve (slide) over all spatial locations
consider a second, green filter
Convolution Layer
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 47/73
![Page 48: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/48.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Convolution Layer
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016
32
3 6
28
activation maps 32
28
Convolution Layer
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
We stack these up to get a “new image” of size 28x28x6!
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 48/73
![Page 49: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/49.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Stride Convolution
Stride
Tuo Zhao — Lecture 8: Deep Learning 49/73
![Page 50: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/50.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Stride Convolution
A closer look at spatial dimensions:
32
32
3
activation map 32x32x3 image 5x5x3 filter
1
28
28
convolve (slide) over all spatial locations
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 50/73
![Page 51: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/51.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Stride Convolution
7
7x7 input (spatially) assume 3x3 filter
7
A closer look at spatial dimensions:
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 51/73
![Page 52: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/52.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Stride Convolution
7
7x7 input (spatially) assume 3x3 filter
7
A closer look at spatial dimensions:
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 52/73
![Page 53: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/53.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Stride Convolution
7
7x7 input (spatially) assume 3x3 filter
7
A closer look at spatial dimensions:
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 53/73
![Page 54: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/54.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Stride Convolution
7
7x7 input (spatially) assume 3x3 filter
7
A closer look at spatial dimensions:
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 54/73
![Page 55: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/55.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Stride Convolution
=> 5x5 output
7
7x7 input (spatially) assume 3x3 filter
7
A closer look at spatial dimensions:
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 55/73
![Page 56: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/56.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Stride Convolution
7x7 input (spatially) assume 3x3 filter applied with stride 2
7
7
A closer look at spatial dimensions:
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 56/73
![Page 57: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/57.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Stride Convolution
7x7 input (spatially) assume 3x3 filter applied with stride 2
7
7
A closer look at spatial dimensions:
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 57/73
![Page 58: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/58.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Stride Convolution
7x7 input (spatially) assume 3x3 filter applied with stride 2 => 3x3 output!
7
7
A closer look at spatial dimensions:
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 58/73
![Page 59: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/59.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Stride Convolution
7x7 input (spatially) assume 3x3 filter applied with stride 3?
7
7
A closer look at spatial dimensions:
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 59/73
![Page 60: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/60.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Stride Convolution
7x7 input (spatially) assume 3x3 filter applied with stride 3?
7
7
A closer look at spatial dimensions:
doesn’t fit! cannot apply 3x3 filter on 7x7 input with stride 3.
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 60/73
![Page 61: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/61.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Stride Convolution
N
F
F
N Output size: (N - F) / stride + 1
e.g. N = 7, F = 3: stride 1 => (7 - 3)/1 + 1 = 5 stride 2 => (7 - 3)/2 + 1 = 3 stride 3 => (7 - 3)/3 + 1 = 2.33 :\
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 61/73
![Page 62: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/62.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Zero-Padding
Zero-Padding
Tuo Zhao — Lecture 8: Deep Learning 62/73
![Page 63: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/63.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Zero-Padding: common to the border
0 0 0 0 0 0
0
0
0
0
e.g. input 7x7 3x3 filter, applied with stride 1 pad with 1 pixel border => what is the output?
(recall:) (N - F) / stride + 1
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Zero-Padding: common to the border
Tuo Zhao — Lecture 8: Deep Learning 63/73
![Page 64: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/64.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Zero-Padding: common to the border
e.g. input 7x7 3x3 filter, applied with stride 1 pad with 1 pixel border => what is the output? 7x7 output!
0 0 0 0 0 0
0
0
0
0
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Zero-Padding: common to the border
Tuo Zhao — Lecture 8: Deep Learning 64/73
![Page 65: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/65.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Tiled Convolution
Local connectivity
Locallyconnectedlayer
ConvoluFonallayer
Fullyconnectedlayer
Tuo Zhao — Lecture 8: Deep Learning 65/73
![Page 66: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/66.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Tiled Convolution
Tiled convolution
Locallyconnectedlayer
TiledconvoluFon
ConvoluFonallayer
Tuo Zhao — Lecture 8: Deep Learning 66/73
![Page 67: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/67.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Pooling
Effect=invariancetosmalltranslaFonsoftheinput
Pooling
Tuo Zhao — Lecture 8: Deep Learning 67/73
![Page 68: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/68.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Pooling
Pooling
Tuo Zhao — Lecture 8: Deep Learning 68/73
![Page 69: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/69.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Pooling- makes the representations smaller and more manageable - operates over each activation map independently
Pooling
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson Tuo Zhao — Lecture 8: Deep Learning 69/73
![Page 70: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/70.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Pooling
1 1 2 4
5 6 7 8
3 2 1 0
1 2 3 4
Single depth slice
x
y
max pool with 2x2 filters and stride 2 6 8
3 4
Max Pooling
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson
Tuo Zhao — Lecture 8: Deep Learning 70/73
![Page 71: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/71.jpg)
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Case Study: AlexNet
Case Study: AlexNet [Krizhevsky et al. 2012]
Full (simplified) AlexNet architecture: [227x227x3] INPUT [55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0 [27x27x96] MAX POOL1: 3x3 filters at stride 2 [27x27x96] NORM1: Normalization layer [27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 [13x13x256] MAX POOL2: 3x3 filters at stride 2 [13x13x256] NORM2: Normalization layer [13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1 [13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1 [13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1 [6x6x256] MAX POOL3: 3x3 filters at stride 2 [4096] FC6: 4096 neurons [4096] FC7: 4096 neurons [1000] FC8: 1000 neurons (class scores)
Details/Retrospectives: - first use of ReLU - used Norm layers (not common anymore) - heavy data augmentation - dropout 0.5 - batch size 128 - SGD Momentum 0.9 - Learning rate 1e-2, reduced by 10 manually when val accuracy plateaus - L2 weight decay 5e-4 - 7 CNN ensemble: 18.2% -> 15.4%
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson Tuo Zhao — Lecture 8: Deep Learning 71/73
![Page 72: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/72.jpg)
The End
Congratulations!
![Page 73: Lecture 8: Deep Learning - ISyE Hometzhao80/Lectures/Lecture_8.pdf · Lecture 8: Deep Learning Tuo Zhao Schools of ISyE and CSE, Georgia Tech](https://reader034.fdocuments.net/reader034/viewer/2022050107/5f455c2f9217406fb54ea804/html5/thumbnails/73.jpg)