Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent...
Transcript of Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent...
![Page 1: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,](https://reader030.fdocuments.net/reader030/viewer/2022041015/5ec684e437e7e804ad3fd685/html5/thumbnails/1.jpg)
Training Convolutional Neural Networks
R. Q. FEITOSA
![Page 2: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,](https://reader030.fdocuments.net/reader030/viewer/2022041015/5ec684e437e7e804ad3fd685/html5/thumbnails/2.jpg)
Overview
1. Activation Functions
2. Data Preprocessing
3. Weight Initialization
4. Batch Normalization
2
![Page 3: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,](https://reader030.fdocuments.net/reader030/viewer/2022041015/5ec684e437e7e804ad3fd685/html5/thumbnails/3.jpg)
Activation Functions
3
![Page 4: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,](https://reader030.fdocuments.net/reader030/viewer/2022041015/5ec684e437e7e804ad3fd685/html5/thumbnails/4.jpg)
Activation Layer
4
𝑓
𝑤0𝑥0 = 𝑏
𝑤1𝑥1
𝑤𝑛𝑥𝑛
𝑤𝑖𝑥𝑖
𝑖
+ 𝑏
𝑓 𝑤𝑖𝑥𝑖
𝑖
+ 𝑏
activation
function
![Page 5: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,](https://reader030.fdocuments.net/reader030/viewer/2022041015/5ec684e437e7e804ad3fd685/html5/thumbnails/5.jpg)
Activation Functions
5
tanh
𝑡𝑎𝑛ℎ 𝑥
Sigmoid
𝜎 𝑥 = 1
1 + 𝑒−𝑥
Leaky ReLU
𝑚𝑎𝑥 0.01𝑥, 𝑥
ReLU
𝑚𝑎𝑥 0, 𝑥
![Page 6: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,](https://reader030.fdocuments.net/reader030/viewer/2022041015/5ec684e437e7e804ad3fd685/html5/thumbnails/6.jpg)
Sigmoid
• Squashes numbers to range [0,1]
• Historically popular since they have nice interpretation as a saturating “firing rate” of a neuron.
Three shortcomings:
1) Saturated neurons “kill” the gradients,
2) Sigmoid outputs are not zero-centered,
3) 𝑒𝑥𝑝 () is a bit compute expensive.
6
Sigmoid
𝜎 𝑥 = 1
1 + 𝑒−𝑥
𝜎 𝑥 = 1
1 + 𝑒−𝑥
![Page 7: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,](https://reader030.fdocuments.net/reader030/viewer/2022041015/5ec684e437e7e804ad3fd685/html5/thumbnails/7.jpg)
Sigmoid kills gradient
7
𝜕𝐿
𝜕𝜎
𝜕𝐿
𝜕𝑥=
𝜕𝜎
𝜕𝑥
𝜕𝐿
𝜕𝜎 upstream
gradient
sigmoid
gate
𝜕𝜎
𝜕𝑥
𝑥 𝜎 𝑥 =1
1 + 𝑒−𝑥
local
gradient
Sigmoid
𝜎 𝑥 = 1
1 + 𝑒−𝑥
“kills” gradient “kills” gradient
![Page 8: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,](https://reader030.fdocuments.net/reader030/viewer/2022041015/5ec684e437e7e804ad3fd685/html5/thumbnails/8.jpg)
What happens when the input to a neuron (𝑥) is always positive?
The gradients on 𝒘 are all positive or all negative.
Therefore, zero-mean data is highly desirable!
possible
gradient
update
directions
possible
gradient
update
directions
Sigmoid output not zero-centered
8
𝑓 𝑤𝑖𝑥𝑖
𝑖
+ 𝑏 𝑤1
𝑤2
hypothetical
optimal
𝑤 vector
zig zag
updates
![Page 9: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,](https://reader030.fdocuments.net/reader030/viewer/2022041015/5ec684e437e7e804ad3fd685/html5/thumbnails/9.jpg)
Tanh
9
tanh
𝑡𝑎𝑛ℎ 𝑥
• Squashes numbers to range [-1,1]
• Zero-centered.
Shortcoming:
Still “kills” the gradients when saturated
“kills” gradient “kills” gradient
𝑓 𝑥 = 𝑡𝑎𝑛ℎ 𝑥
![Page 10: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,](https://reader030.fdocuments.net/reader030/viewer/2022041015/5ec684e437e7e804ad3fd685/html5/thumbnails/10.jpg)
Rectified Linear Unit
10
• Does not saturate in the positive region.
• Very computationally efficient
• Converges much faster than sigmoid/tanh ( 6×)
Shortcoming:
1) not zero-centered output
ReLU
𝑚𝑎𝑥 0, 𝑥
𝑓 𝑥 = 𝑚𝑎𝑥 0, 𝑥
![Page 11: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,](https://reader030.fdocuments.net/reader030/viewer/2022041015/5ec684e437e7e804ad3fd685/html5/thumbnails/11.jpg)
ReLU
𝑚𝑎𝑥 0, 𝑥
ReLU kills gradient for negative input
11
𝜕𝐿
𝜕𝑧
𝜕𝐿
𝜕𝑥=
𝜕𝑧
𝜕𝑥
𝜕𝐿
𝜕𝑧 upstream
gradient
ReLU
gate 𝜕𝑧
𝜕𝑥
𝑥 𝑧 = 𝑅𝑒𝐿𝑈 𝑥 = 𝑚𝑎𝑥 (0, 𝑥)
local
gradient
“kills” gradient
![Page 12: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,](https://reader030.fdocuments.net/reader030/viewer/2022041015/5ec684e437e7e804ad3fd685/html5/thumbnails/12.jpg)
𝑥1
𝑥2
data cloud
Dead ReLU
12
𝜕𝐿
𝜕𝑧
𝜕𝐿
𝜕𝑦=
𝜕𝑧
𝜕𝑦
𝜕𝐿
𝜕𝑧 upstream
gradient
ReLU
𝑥
local
gradient 𝑊𝑥 + 𝑑 para 𝑊𝑥 + 𝑑 >0
0, otherwise
𝜕𝑧
𝜕𝑥=
𝑊𝑥 + 𝑑
𝑦 = 𝑊𝑥 + 𝑑 𝑧 = 𝑚𝑎𝑥 (0,𝑊𝑥 + 𝑑 )
only data on this side of
the hyperplane will have
a non-zero derivative
![Page 13: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,](https://reader030.fdocuments.net/reader030/viewer/2022041015/5ec684e437e7e804ad3fd685/html5/thumbnails/13.jpg)
Leaky ReLU
13
• Never saturates.
• Computationally efficient
• Converges much faster than sigmoid/tanh ( 6×)
• will not “die”.
Shortcoming:
not zero-centered output
𝑓 𝑥 = 𝑚𝑎𝑥 0.01𝑥, 𝑥
Leaky ReLU
𝑚𝑎𝑥 0.1𝑥, 𝑥
Parametric Rectifier (PReLU):
𝑓 𝑥 = 𝑚𝑎𝑥 𝛼𝑥, 𝑥
![Page 14: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,](https://reader030.fdocuments.net/reader030/viewer/2022041015/5ec684e437e7e804ad3fd685/html5/thumbnails/14.jpg)
Data Preprocessing
14
![Page 15: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,](https://reader030.fdocuments.net/reader030/viewer/2022041015/5ec684e437e7e804ad3fd685/html5/thumbnails/15.jpg)
Preprocessing Operations
15
In practice for images
• No normalization (pixel
values in the same range)
• Subtract the mean image
(e.g. AlexNet)
• Subtract per-channel mean
(e.g. VGGNet)
![Page 16: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,](https://reader030.fdocuments.net/reader030/viewer/2022041015/5ec684e437e7e804ad3fd685/html5/thumbnails/16.jpg)
Weight Initialization
16
![Page 17: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,](https://reader030.fdocuments.net/reader030/viewer/2022041015/5ec684e437e7e804ad3fd685/html5/thumbnails/17.jpg)
Small initial Weights As the data flows forward, the input (𝑥) at each layer tends to concentrate around zero.
Recall that the local gradient 𝑊𝑥 𝜕𝑊𝑥
𝜕𝑊 = 𝑥.
As the gradient back props, it tends to vanish no update.
17
layer 1 layer 1 layer 1 layer 1 layer 1
histograms of the input data at each convolutional layer
![Page 18: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,](https://reader030.fdocuments.net/reader030/viewer/2022041015/5ec684e437e7e804ad3fd685/html5/thumbnails/18.jpg)
Large initial Weights Assume a tanh or a sigmoid activation function.
The input to the activation functions are in the saturated region, where the gradient is small.dients 𝜕𝑊𝑥
𝜕𝑊 = 𝑥.
As the gradient propagates back, they tend to vanish no update.
18
![Page 19: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,](https://reader030.fdocuments.net/reader030/viewer/2022041015/5ec684e437e7e804ad3fd685/html5/thumbnails/19.jpg)
Optimal Weight Initialization Still an active research topic:
• Understanding the difficulty of training deep feedforward neural networks by Glorot and
Bengio, 2010
• Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013
• Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014
• Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification by He et al., 2015
• Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., 2015
• All you need is a good init, Mishkin and Matas, 2015
19
![Page 20: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,](https://reader030.fdocuments.net/reader030/viewer/2022041015/5ec684e437e7e804ad3fd685/html5/thumbnails/20.jpg)
Batch Normalization
20
![Page 21: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,](https://reader030.fdocuments.net/reader030/viewer/2022041015/5ec684e437e7e804ad3fd685/html5/thumbnails/21.jpg)
Making the Activations Gaussian Take a batch of activation at some layer. To make each dimension 𝑘 look like a unit Gaussian, apply
𝑥 𝑘 =𝑥 𝑘 − 𝔼 𝑥 𝑘
𝑉𝑎𝑟 𝑥 𝑘
Notice that this function has a simple derivative.
21
![Page 22: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,](https://reader030.fdocuments.net/reader030/viewer/2022041015/5ec684e437e7e804ad3fd685/html5/thumbnails/22.jpg)
Making the Activations Gaussian In practice, we take the empirical mean and variance. For 𝑥 over a mini-batch ℬ = 𝑥1…𝑚 :
22
…
…
…
…
sa
mp
les
features
𝑥 1
𝑥2
𝑥𝑚
…
𝜇ℬ =1
𝑚 𝑥𝑖
𝑚
𝑖=1
𝜎ℬ2 =
1
𝑚 𝑥𝑖 − 𝜇ℬ
2
𝑚
𝑖=1
𝑥 𝑖 =𝑥𝑖 − 𝜇ℬ
𝜎ℬ2 + 𝜀
constant added for
numerical stability
𝜇ℬ and 𝜎ℬ2
computed for
each
dimension
independently
![Page 23: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,](https://reader030.fdocuments.net/reader030/viewer/2022041015/5ec684e437e7e804ad3fd685/html5/thumbnails/23.jpg)
Recovering the Identity Mapping Normalize:
and then allow the network to undo it, if “wished”.
23
Notice that the network can learn:
to recover the identity mapping
𝑥 𝑘 =𝑥 𝑘 − 𝔼 𝑥 𝑘
𝑉𝑎𝑟 𝑥 𝑘
𝑦 𝑘 = 𝛾 𝑘 𝑥 𝑘 + 𝛽 𝑘 𝛾 𝑘 = 𝑉𝑎𝑟 𝑥 𝑘
𝛽 𝑘 = 𝔼 𝑥 𝑘
Obs.: recall that 𝑘 refers to a dimension.
learnable parameters
![Page 24: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,](https://reader030.fdocuments.net/reader030/viewer/2022041015/5ec684e437e7e804ad3fd685/html5/thumbnails/24.jpg)
ConvNet Layer with BN
Batch normalization is inserted prior to the nonlinearity
24
polling
activation function
batch normalization
convolution
𝑦 𝑘 = 𝛾 𝑘 𝑥 𝑘 + 𝛽 𝑘
![Page 25: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,](https://reader030.fdocuments.net/reader030/viewer/2022041015/5ec684e437e7e804ad3fd685/html5/thumbnails/25.jpg)
Batch Normalizing Transform
Input: Values of x over a mini-batch ℬ = 𝑥1…𝑚 ;
Parameters to be learned: 𝛾, 𝛽
Input:
𝜇ℬ =1
𝑚 𝑥𝑖
𝑚𝑖=1 // mini-batch mean
𝜎ℬ2 =
1
𝑚 𝑥𝑖 − 𝜇ℬ
2𝑚𝑖=1 // mini-batch variance
𝑥 𝑖 =𝑥𝑖−𝜇ℬ
𝜎ℬ2+𝜀
// normalize
𝑦𝑖 = 𝛾 𝑥 𝑖+𝛽 ≡ 𝐵𝑁𝛾,𝛽 𝑥𝑖 // scale and shift
25
Ioffe, S. and Szegedy, C., Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,
2015, arXiv [Available Online] at https://arxiv.org/abs/1502.03167
• Improves gradient flow
through the network
• Allows higher learning
rates
• Reduces the dependence
on the initialization
![Page 26: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,](https://reader030.fdocuments.net/reader030/viewer/2022041015/5ec684e437e7e804ad3fd685/html5/thumbnails/26.jpg)
Batch Normalizing Transform
Input: Values of x over a mini-batch ℬ = 𝑥1…𝑚 ;
Parameters to be learned: 𝛾, 𝛽
Input:
𝜇ℬ =1
𝑚 𝑥𝑖
𝑚𝑖=1 // mini-batch mean
𝜎ℬ2 =
1
𝑚 𝑥𝑖 − 𝜇ℬ
2𝑚𝑖=1 // mini-batch variance
𝑥 𝑖 =𝑥𝑖−𝜇ℬ
𝜎ℬ2+𝜀
// normalize
𝑦𝑖 = 𝛾 𝑥 𝑖+𝛽 ≡ BN𝛾,𝛽 𝑥𝑖 // scale and shift
26
Ioffe, S. and Szegedy, C., Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,
2015, arXiv [Available Online] at https://arxiv.org/abs/1502.03167
At test time BN functions
differently:
The mean/std are not
computed based on the
batch. Instead, fixed
empirical values computed
during training are used.
![Page 27: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,](https://reader030.fdocuments.net/reader030/viewer/2022041015/5ec684e437e7e804ad3fd685/html5/thumbnails/27.jpg)
Summary
• Activation Functions (use ReLU or Leaky ReLU)
• Data Preprocessing (subtract the mean)
• Weight Initialization (use Xavier init)
• Batch Normalization (use)
27
![Page 28: Training Convolutional Neural Networksraul/RS/Training Convolutional... · • Data-dependent Initializations of Convolutional Neural Networks by Krähenbühl et al., ... In practice,](https://reader030.fdocuments.net/reader030/viewer/2022041015/5ec684e437e7e804ad3fd685/html5/thumbnails/28.jpg)
Training Convolutional Neural Networks
Thank you!
R. Q. FEITOSA