Convolutional Neural Networks Summary - auth
Transcript of Convolutional Neural Networks Summary - auth
S. Papadopoulos, P. Kaplanoglou, Prof. Ioannis Pitas
Aristotle University of Thessaloniki
www.aiia.csd.auth.grVersion 2.6
Convolutional Neural
Networks Summary
Deep Neural Networks
β’ Deep Neural Networks (DNNs) have a count of layers (depth)
πΏ β₯ 3.
βͺ There are multiple hidden layers
with regard to the MLP reference
model.
Deep Neural Network with πΏ = 4
β’ Color π1 Γ π2 pixel images are represented by huge
dimensionality vectors π1 Γ π2 Γ 3 that cannot be processed
by fully connected Multilayer Perceptrons.
βͺ Problem: Increased computational complexity.
βͺ Solution: Convolutional Neural Networks
(CNNs) that employ sparse connectivity
and weight replication.
Deep Neural Networks
W
H
C=3
Convolutional Neural Networks
β’ Foundations: Rummelhart, Hinton, Williams (1985) described
a solution to the T-C problem
β’ First CNN: LeCun, Bottou, Bengio, Haffner (1989) proposed
convolutional layer for character recognition
β’ Success: Krizhevsky, Sutskever, Hinton (2012) significantly
outperformed the state-of-the-art for image classification in the
ImageNet competition.
Convolutional Neural Networks
ImageNet competition:
β’ The ImageNet Large Scale Image Recognition Challenge
(ILSVRC) is an annual competition started in 2010.
β’ Large scale image datasets are used that have over a million
images of natural scenes and objects.
β’ Since 2012, research in the CNN area produced significant
advances, some through the ILSVRC competition.
βͺ The CNN that won the classification task (CLS) in 2015, surpassed the
human classification accuracy on the same test set.
Convolutional Neural NetworksNotations:
β’ Standard CNN input is a π1 Γπ2 Γ πΆ multi-channel (color) image represented by 3D
tensor πΏ.
β’ A multichannel image is a tensor πΏ or a 3D signal π₯ π, π, π βΆ π = 1,β¦ ,π1, π = 1,β¦ , π2, π = 1,β¦ , πΆ.
β’ The number of channels πΆ is sometimes called depth.
β’ A multichannel image block of dimensions π1 Γπ2:π1 = 2π1 + 1,π2 = 2π2 + 1centered
at π, π π is noted as the 3D tensor πΏππ.
β’ The 2D matrix πΏππ π is a slice of πΏππthat contains the values of channel π (e.g. intensity
of red/green/blue) for the π1 Γπ2 image block centered at π, π π.
Matrix of feature (channel) π for the image block centered at π, π π:
πΏππ π = [π₯ π1, π2, π : π1 = π β π1, β¦ , π + π1, π2 = π β π2, β¦ , π + π2]
Convolutional Module 1
β’Convolutional Layer 1
β’Max Pooling 1
β’Local Response Normalization 1
Convolutional Module 2
β’Convolutional Layer 2
β’Max Pooling 2
β’Local Response Normalization 2
Convolutional Module 3
β’Convolutional Layer 3
β’Local Response Normalization 3
Convolutional Module 4
β’Convolutional Layer 4
β’Local Response Normalization 4
Convolutional Module 5
β’Convolutional Layer 5
β’Max Pooling 5
β’Local Response Normalization 5
Fully Connected Module 1
β’FC Layer 6
β’Dropout
Fully Connected Module 2
β’FC Layer 7
β’Dropout
SoftmaxOutput Module
β’FC Layer 8
β’Softmax Function
β’ There are several convolutional layers.
β’ There is a classifier that consists of 3 fully connected layers.
β’ Class prediction are given by softmax functions.
Reference CNN ClassifierThe AlexNet CNN
Synaptic summation:
β’ The classic neurons have a weighted sum on vectors (1D) of features
values:
β’ Convolutional neurons implement a convolution operation
between 2D single feature image π₯ and 2D convolution kernel
(coefficient matrix) π€ plus a scalar bias π:
Convolutional Layer
Perceptron Neuron Summation
π§ = π€0 + Οπ=1π π€ππ₯π = π€0 +π°ππ±.
β’ Using an odd convolution window with π1 = 2π1 + 1 and π2 = 2π2 + 1,
the filter output is mapped to the filter window center at location π, π π .
Convolutional Layer
Convolutional Neuron Summation
π§ π, π = π + Οπ1=1π1 Οπ2=1
π2 π€ π1, π2 π₯ π β π1, π β π2 .
Convolutional Neuron Summation - Centered on Input:
π’(π, π) = π + Οπ1=βπ1π1 Οπ2=βπ2
π2 π€ π1, π2 π₯ π β π1, π β π2 .
For a convolutional filter π1 Γπ2 where π1 = 2π1 + 1 and π2 = 2π2 + 1.
Single feature extracted from grayscale images:
β’ For a convolutional layer π with an activation function ππ(β) , a single input
feature (e.g., grayscale image intensities) and a single feature output:
β’ The activation map π¨ π π ,= π π , π = 1is a table of the output feature
values for all π Γπ regions.Convolutional Layer Activation Map (2D matrix) from single input feature
ππππ(1) = ππ π(π) +πΎ π (1,1) ββ πΏππ
π(1)
πΎ π 1,1 : 2D filter for the 1st output feature, πΏππ(1):grayscale intensities of the local block at π, π π
π¨ π 1 = ππππ: π = 1, . . , π π , π = 1, . . , π π
The activation map for the 1st output feature at layer π is the matrixπ¨ π (1)that contains πππ(π)
acrossall regions.
Convolutional Layer
Output feature extracted from image block centered at π, π π
π¦ π (π, π) = ππ π(π) + Οπ1=1
π1(π)
Οπ2=1
π2(π)
π€(π) π1, π2 π₯(π) π β π2, π β π2 .
Convolutional LayerBasic Activation Functions:
β’ Sigmoid and hyperbolic tangent function are not proper for
CNNs, because they lead to the vanishing gradients problem.
β’ Rectifiers are more suitable for activation functions.
βͺ ReLU - Rectified Linear Unit
π¦ = π ππΏπ π’ = max π’, 0 : β β 0,+β
βͺ ReLU6 - Rectified Linear Unit Bounded by 6
π¦ = πππ π ππΏπ π’ , 6 = min max π’, 0 , 6 : β β 0,6
βͺ Softplus
π¦ = π πππ‘πππ’π π’ = πππ 1 + ππ’ : β β 0,+β
Neural Features for Image Representation:
β’ Visualizing the πππ’π‘ = 96 features learned from RGB pixels in the 1st layer of the
Zeiler& Fergus CNN (ZFNet) shows characteristics of biological vision.
βͺ ZFNet is an improvement of AlexNet with 1st layer: [7 Γ 7/2| 3 β 96]
βͺ Feature visualization indicated poorly trained convolutional kernels in
AlexNet.
orientation selectivity
found in V1 simple cells
green/red color opponency
observed in retinal neurons
and human visual perception
blue/yellow color opponency
observed in retinal neurons
and human visual perception
Convolutional Layer
Pooling layers are added inside a CNN architecture primarily for
downsampling, aiming to reduce the computational cost. Secondarily helps
on translation invariance.
βͺ The pooling window is moved over an activation map π¨πππ(π) along π, π with
stride π .
βͺ Typical pool window sizes 2 Γ 2, 3 Γ 3.
βͺ Downsampling usually with π = 2.
βͺ Pools could overlap, e.g., [3 Γ 3 / 2]
β’ Ad-hoc decision to use pooling or not.
β’ No formal justification for the effect of overlapping on pooling regions.
Convolutional Layer
Simple Convolutional Module:
β’ The simplest convolutional module has one convolutional layer with
stride π = 1 and one max pooling layer with stride π ππππ = 2 for
downsampling.
βͺ Input has π Γπ spatial
regions and πππfeatures.
Module Activation
Tensorconv maxpool
πππ πππ’π‘
πππ’π‘
π
ππ
π
Convolutional Layer
Advanced multilayer CNN architecture.
Convolutional Neural Networks
Classification/Recognition/
Identificationβ’ Training: Given π pairs of training samples π = {(π±π , π²π), π = 1,β¦π},
where π±π β βπ and π²π β 0,1 π, estimate π by minimizing a loss
function: minπ
π½(π², ΰ·π²).
β’ Inference/testing: Given ππ‘ pairs of testing examples ππ‘ = {(π±π , π²π), π =1,β¦ ,ππ‘} , where π±π β βπ and π²π β 0,1 π, compute (predict) ΰ·π²π and
calculate a performance metric, e.g., classification accuracy.
Regression
Given a sample π± β βπ and a function π² = π(π±), the model predicts real-
valued quantities for that sample: ΰ·π² = π(π±; π), where ΰ·π² β βπand π are the
learnable parameters of the model.
β’ Training: Given π pairs of training examples π = {(π±π , π²π), π = 1,β¦π},where π±π β βπ and π²π β βπ, estimate π by minimizing a loss function:
minπ
π½(π², ΰ·π²) .
β’ Testing: Given ππ‘ pairs of testing examples ππ‘ = {(π±π , π²π), π = 1,β¦ ,ππ‘},where π±π β βπ and π²π β π²π β βπ, compute (predict) ΰ·π²π and calculate a
performance metric, e.g., MSE.
CNN Training β’ Mean Square Error (MSE):
π½ π = π½ π, π =1
πΟπ=1π ΰ·π²π β π²π
2.
β’ It is suitable for regression and classification.
β’ Categorical Cross Entropy Error:
π½πΆπΆπΈ = βΟπ=1π Οj=1
π π¦ππ log ΰ·π¦ππ .
β’ It is suitable for classifiers that use softmax output layers.
β’ Exponential Loss:
π½(π) = Οπ=1π πβπ½π²π
π ΰ·π²π.
37
Backpropagation
β’ The most widespread algorithm for supervised training of CNNs is the
Backpropagation algorithm.
βͺ Unlike typical MLPs, the summation that occurs inside each convolutional neuron
during forward pass uses convolution instead of normal multiplication.
βͺ Training is performed as a problem of minimization of a cost function.
βͺ MSE is the most common cost function.
β’ A reminder of a forward pass for a convolutional neuron of a single input
channel:
Convolutional Neuron Summation
π’ π, π = π + Οπ1=1π1 Οπ2=1
π2 π€ π1, π2 π₯ π β π1, π β π2 = π + πΎ ββ πΏ π, π .
CNN Training
β’ CNNs are trained with the same gradient descent methods as multilayer
perceptron.
βͺ Convolution is a differentiable operation.
βͺ Mini-batch Stochastic Gradient Descent methods are used for image batches.
β’ Optimization methods:
βͺ Learning rate decay are scheduled changes to the learning rate at the various
training epochs. For non-adaptive mini-batch SGD methods, e.g. Momentum.
βͺ ADAM is an optimization method with an adaptive learning rate.
β’ Large scale datasets are needed to adequately train a CNN.
βͺ CNNs are prone to over-fitting.
βͺ Training images count in the magnitude of 10 or 100 thousands.
Data Preprocessing
β’ Apply a whitening (projection) operator π on the centered data
matrix:
πΏπ = πΰ΄₯π.
β’ PCA Whitening: π = π²β1/2ππ .
β’ ZCA Whitening: π = ππ²β1/2ππ .
β’ ZCA is preferred for natural images.
β’ Preprocessing helps illumination invariance and convergence on
certain visual features distribution, e.g. natural images of ILSVRC2012.
CNN Training
Weight initialization:
β’ Completely random initialization of weights has experimentally proven
ineffective to train deep CNNs.
β’ Values are chosen from a uniform distribution and restricted
in an interval that relates to the layer input/output feature depths
[ β3 β π β π½
πππ + π β 1 β πππ’π‘,
3 β π β π½
πππ + π β 1 β πππ’π‘]
CNN Training
Data Augmentation:
β’ It is used to avoid overfitting.
β’ The training image set is augmented during training with label-
preserving transformation of the samples:
βͺ Image translations and random image crops.
βͺ Photometric distortions, i.e. altering the intensities of RGB channels.
βͺ Scaling and rotation, e.g. at β€ 90π
βͺ Vertical reflections, e.g. mirror.
βͺ Addition of Salt and Pepper noise.
β’ Data augmentation can be done with minimal computation cost inside the
training process.
CNN Training
Normalization Layers:
β’ Level the intermediate activations to a desired range.
β’ Prevent the exploding gradients problem, during gradient
descent training of CNNs, when using activation functions that
have no upper bound.
βͺ LCN - Local Contrast Normalization is both a subtractive and divisive
normalization (2009).
βͺ LRN - Local Response Normalization resembles LCN (2012).
CNN Training
Softmax Layer:
β’ It is the last layer in several neural network classifier.
β’ The response of neuron π in the softmax layer πΏ is calculated with
regard to the value of its activation function ππ = π π§π :
ΰ·π¦π = π ππ =πππ
Οπ=1ππΏ πππ
: β β 0,1 , π = 1, . . , ππΏ ,
Οπ=1ππΏ ΰ·π¦π = 1.
β’ The responses of softmax neurons sum up to one:
β’ Better representation for mutually exclusive class labels.
31
CNN Training
Regularization
β’ Regularizers aim to reduce over-fitting during training of
deep CNNs on image data:
βͺ Weight decay is a well-known approach that adds a norm Ξ© π of the
networks weights on the cost function, penalizing large values on the
parameters:
π½πππ = π½ + π
πΞ© π .
β’ π is a regularization parameter controlling the trade β off betweenminimizing the cost function and penalizing large parameters.
CNN Training
Dropout randomly excludes a set of neurons from a certain training epoch
with a constant keep out probability πππππ.
β’ Activations of dropped out neurons are set to zero.
βͺ They do not participate in the loss, thus excluded from back-propagation.
βͺ Dropout was initially used in AlexNet after each fully connected layer.
βͺ During testing a trained model, all neurons are used with their already learned
weights.
β’ Induces dynamic sparsity during training.
β’ Prevents complex co-adaptations of the
synaptic weights, that may lead to correlated
activations of neurons.
CNN Training
Residual Convolutional Module:
β’ The basic module of ResNets.
β’ A shortcut connection bypasses layers. BN is used before
the activation function. These implement identity mapping.
π¨ π+2 = ππ+2(π¨π + π΅ππ+2 π π+2 +πΎ π+2 ββ ππ+1 π΅ππ+1(π
π+1 +πΎ π+1 ββ π¨ π )
β’ Residual learning makes possible to train extremely
deep CNNs up to 192 layers.
CNN architectures
Inception Convolutional Module:
β’ Basic idea is to have multiple window sizes for convolutions that operate
over the same input region.βͺ Building block of GoogleLeNet (Inception v1) CNN.
βͺ Latest upgrade to the model is Inception v4.
β’ The activation volumes are concatenated in the
module output along the feature dimension.
βͺ πππ’π‘ = π1Γ1 + π3Γ3 + π5Γ5 + ππππ₯ππππ
β’ The 1 Γ 1 convolutional layers are used before the computationally
intensive 3 Γ 3 and 5 Γ 5 convolution operations.
CNN architectures
CNN Computational Issues
β’ Current research aims to reduce the number of CNN
model parameters, thus reducing the complexity.
βͺ Less parameters means less memory and faster recall of patterns.
βͺ Prediction accuracy of models should be competitive or the same.
β’ Lightweight CNNs need to fit in the memory of mobile accelerators, like
the NVIDIA Jetson TX2 that is used in MultiDrone.
βͺ Less computations means less energy
consumption.
βͺ Better models may lead to increased flight time.
βGoldilocksβ zone of CNN models
A. Canziani, A. Paszke, and E. Culurciello, βAn Analysis of Deep Neural Network
Models for Practical Applications,β arXiv:1605.07678 [cs], May 2016.
Candidates for real-world applications
CNN Computational Issues
[ZHA2018] X. Zhang, X. Zhou, M. Lin, J. Sun, βShuffleNet: An Extremely Efficient
Convolutional Neural Network for Mobile Devicesβ, IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2018, pp. 6848-6856
[WAN2018] S. Wang, S. Suo, W.C. Ma, A. Pokrovsky, R. Urtasun, βDeep Parametric
Continuous Convolutional Neural Networksβ, The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2018, pp. 2589-2597
J[GU2018] iuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy,
Bing Shuai, Ting Liu, Xingxing Wang, Gang Wang, Jianfei Cai, Tsuhan Chen, βRecent
advances in convolutional neural networksβ, Pattern Recognition, Volume 77, May
2018, Pages 354-377.
[HOW2017] Howard, Andrew G., et al. "Mobilenets: Efficient convolutional neural
networks for mobile vision applications." arXiv preprint arXiv:1704.04861 (2017).
Bibliography