Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearning... · input data...
Transcript of Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearning... · input data...
Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université Bordeaux Chapter 2. From Shallow to Deep
Chapter 2
Summary. 1. Kinds of machine Learning. 1.1. Unsupervised learning, 1.2. Supervised learning, main formulations, 2. Artificial Neural Networks 3. Multi-Layered Perceptron (MLP)
Deep Learning for Computer Vision 2
1. Kinds of Machine learning
➔ Once more ➔ Machine learning teaches computers to do what comes naturally to
humans and animals: learn from experience. ➔ Machine learning algorithms use computational methods to “learn” information directly from data without ➔ The algorithms adaptively improve their performance as the number of
samples available for learning increases.
Deep Learning for Computer Vision 3
Oge Marques, IPTA’2017
Types of Learning
Deep Learning for Computer Vision 4
Evaluation Metrics
Deep Learning for Computer Vision 5
Confusion Matrix
Quality Metrics
➔ ACC = (TP +TN)/(TP+FP+TN+FN)
➔ BACC = (TP/P + TN/N)/2*
➔ TPR = TP/(TP+FN) or recall (R) ➔ TNR = TN/(TN + FP)
➔ P = TP/(TP+FP) ➔ F-score = 2/(1/R + 1/P)* ➔ FPR = FP/(FP+TN) ➔ FNR = FN/(FN+TP)
➔ * - unbalanced classes
Deep Learning for Computer Vision 6
ROC curve
➔ If we consider a binary classification problem ➔ The classifier dependent on the threshold, then for different values of
the threshold
➔ In multi-class classification problem : one vs all ! And we plot for all N classes
Deep Learning for Computer Vision 7
1.1. Unsupervised learning(1)
➔ Unsupervised learning finds hidden patterns or intrinsic structures in data. ➔ It is used to draw inferences from datasets consisting of input data without labelled responses. ➔ Clustering is the most common unsupervised learning technique. ➔ It is used for exploratory data analysis to find hidden patterns or groupings in data. ➔ Applications in computer vision : visual data summarization (collections
of images, video)
Deep Learning for Computer Vision 8
K-means clustering
➔ J. MacQueen, “Some methods for classification and analysis of multivariate observations”, Proc. Of the Fifth Berkley Symposium on Math. Stat. And Prob., pp. 281 – 296, 1967
➔ Principle : Unisupervised clasification with a priori known numebr of clusters.
➔ Parameter : the number k of clusters ➔ Input data : a sample of M descriptor vectors x1,... xM. ➔ (1) Chose k initial centers c1,... ck
➔ (2) For each of M vectors, assign it to the i-th cluster the center ci of which is closest in the sense of chosen metrics
➔ (3) If none vector changes its class then stop. ➔ (4) Compute new centers: for each i, ci is the mean of vectors of th
eclass i ➔ (5) Go to 2
Deep Learning for Computer Vision 9
Application example(1)
➔ Lifelogging ( K. Gurin, A. Smeaton DCU)
Deep Learning for Computer Vision 10
http://www.slideshare.net/cgurrin/biohackers-summit-2015-lifelogging-a-new-era-of-personal-data
Application Example (2)
➔ Grouping of similar images ➔ Selection of the cluster center : “Hyper-scenes”
Deep Learning for Computer Vision 11
H1
H2
H3
H4 H. Nicolas, A. Manoury, J. Benois-Pineau, Wi. Dupuis, D. Barba: Grouping video shots into scenes based on 1D mosaic descriptors. IEEE ICIP 2004: 637-640
Hierarchical agglomerative clustering (HAG)
➔ Principle : ➔ (1) At the initialisation each descriptor-vector in the sample forms a class ➔ (2) While the number of clusters is larger than k ( limit k=1)
› Groupe classes in the sense of a distance d Distance between clusters Max-link Min-link Mean-link
Deep Learning for Computer Vision 12
( )yxdji CyCx
ji CCd ,max,
max ),(∈∈
=
( )yxdji CyCx
ji CCd ,min,
min ),(∈∈
=
dmean(Ci ,Cj ) =1
ni ×nj l=1
l=ni
∑ dp=1
p=nj
∑ xl , yp( )
Dendrogramm
Deep Learning for Computer Vision 13
S. Benini et al. Extraction of Significant Video Summaries by Dendrogram Analysis. IEEE ICIP 2006: 133-136
Supervised learning(1)
➔ Problem statement :
➔ Le us consider a set of pairs
➔ - feature vectors
➔ - labels (of classes)
➔ Let us consider a function ,
➔ Let us now consider a function - the loss of predicting
Deep Learning for Computer Vision 14
x1, y1( ),..., xn , yn( ),... xN , yN( ){ }
xn ∈ RK = X
yn ∈Y
g(x,α) : X →Y g(xn ,α) = yn
L yn , yn( ) yn
Supervised learning(2)
➔ Empirical risk minimization : to find a function which minimizes
➔ Structural risk minimization : consider a penalty
➔ If the variable y is discrete – classification, otherwise – regression
➔ For a given form of g the problem consists in finding ➔ optimal parameters
Deep Learning for Computer Vision 15
g
Remp g( ) = 1N L yn ,g xn ,α( )( )n=1
N
∑
J g( ) = Remp g( )+λC g( ),λ ≥ 0
C g( ) :G→ R+
α*
Quality of prediction
Deep Learning for Computer Vision 16
Type II Error
Type I Error
2. Artificial Neural Networks
➔ Biological inspiration ➔ The basic computational unit of the brain is a neuron. Approximately 86
billion neurons can be found in the human nervous system and they are connected with approximately 1014 – 1015 synapses.
Deep Learning for Computer Vision 17
g x,W,b( ) = f wixi + bi∑⎛
⎝⎜
⎞
⎠⎟
McCulloch and Pitts, 1943, activation function
f t( ) =1, t > 0
0 otherwise
⎧
⎨⎪
⎩⎪
Commonly used non-linear functions
➔ Sigmoïd : ➔ Tanh :
➔ Sigmoids saturate and kill gradients(!) ➔ Tanh non-linearity is always preferred to the sigmoid nonlinearity.
Deep Learning for Computer Vision 18
f (t) = 11+ e−t
f (t) = et − e−t
et + e−t
a) Sigmoid b) Tanh
ReLu non-linearity
➔ Rectified Linear Unit :
Deep Learning for Computer Vision 19
f (t) =max(t, 0)
- Lower computational cost wrt sigmoid and tanh- Faster convergence
A simple neuron
Deep Learning for Computer Vision 20
x =x1x2
⎛
⎝
⎜⎜
⎞
⎠
⎟⎟
w =w1w2
⎛
⎝
⎜⎜
⎞
⎠
⎟⎟
w1
w2
xTw = x1w1 + x2w2
y = f (x1w1 + x2w2 )
Our simplest function f : Heaviside Step function
f t( ) =1, t > 0
0 otherwise
⎧
⎨⎪
⎩⎪
Source: Wikipedia
y = f (x1w1 + x2w2 )><0 10
How to determine weights ? wi
Training a neuron
➔ The artificial neuron can be trained to perform as elementary linear classifier, i.e. we can determine the weights which will minimize empirical (or structural) risk of our classifier.
➔ The elementary training algorithm was proposed by Rosenblatt (1958). “Perceptron”
➔ Consider the “training set” : (In our case of a simple neuron is a binary label) ➔ Initialize weights randomly ➔ Then at each iteration t the weights are updated as
➔ “Back propagation”
Deep Learning for Computer Vision 21
wi
x1, y1( ),..., xn , yn( ),... xN , yN( ){ }yi ∈ 0;1{ }
w w0
wit+1= wi
t +η y tn − yn( ) xi ,n
Limitations : the set of functions – classifiers which could be simulated by Perceptron is narrow ( cf. Minsky and Papert (1969) XOR)
3. Multi-Layered Perceptron (MLP)
➔ Perceptron ( 1958) -> MLP (1961) – Rosenblatt ➔ Let us consider a binary classification problem
➔ Input layer – is just our data, ➔ Hidden layers produce more abstract features ➔ Hidden layers are fully connected ➔ The output of hidden layers is usually not binary (RELU, Tanh, Sigm)
Deep Learning for Computer Vision 22 http://neuralnetworksanddeeplearning.com/chap1.html
Example : Recognition of Handwritten digits
➔ Input is a binarised image
➔ Each matrix if of 28x28
➔ 10-class classification problem
Deep Learning for Computer Vision 23
The architecture of an MLP with 1 hidden layer
Deep Learning for Computer Vision 24
http://neuralnetworksanddeeplearning.com/chap1.html
Training of MLP
➔ As in the case of Perceptron, training of MLP consists in finding optimal configuration of weights at all layers
➔ Principle : is the same, i.e. to minimize the Loss L between prediction and ground truth labels
➔ To stress multilayer architecture let us denote the weight between i-th neuron of the layer l and j-th neuron of the layer l+1
➔ - is the learning rate ➔ I.e. each parameter of MLP is updated in the direction opposite to the
gradient of of the loss function L – gradient descent. To compute the derivatives of the Loss function at each layer we use the “chain rule”.
Deep Learning for Computer Vision 25
wij(t+1),l ,(l+1) = wij
(t )l ,(l+1) −η∂L
∂wijl ,(l+1)
wijl ,(l+1)
η
Back-propagation and chain rule(1)
➔ Let us consider a trivial Neural Network
➔ x is input ➔ h stands for hidden layer ➔ o stands for output layer ➔ f is a non-linear transformation ➔ w are synaptic weights ➔ y is the known output ➔ is the predicted output
Deep Learning for Computer Vision 26
whx
f x ⋅wh( )x ⋅whwo
w0 ⋅ f x ⋅wh( )f w0 ⋅ f x ⋅wh( )( )
y y
y
Back-propagation(2)
➔ Les us consider Error/Loss function as
➔ The goal is to find
➔ Method : Gradient descent
➔ is the “learning rate” for simplicity, the same for all layers
Deep Learning for Computer Vision 27
L = 12y − y( )
2
wo*,wh
*( )T= Argmin L wo ,wh( )
wo(t+1) = wo
(t ) −η ʹLwo wo(t ) ,wh
(t )( )
wh(t+1) = wh
(t ) −η ʹLwh wo(t ) ,wh
(t )( )
η
Back-propagation(3)
➔ How to compute the partial derivatives
➔ Chain rule :
Deep Learning for Computer Vision 28
whx
f x ⋅wh( )x ⋅whwo
w0 ⋅ f x ⋅wh( )f w0 ⋅ f x ⋅wh( )( )
y y
L = 12y − y( )
2ʹLwo = ?
ʹLwo = y − y( ) ⋅ ˆʹywo = y − y( ) ⋅ ʹf ⋅ f x ⋅wh( )
ʹLwh = ?
a b c x( )( )( )ʹ = ʹa b( ) ⋅ ʹb c( ) ⋅ ʹc x( )
ʹLwh = y − y( ) ⋅ ˆʹywh = y − y( ) ⋅ ʹf ⋅wo ⋅ ʹf ⋅ x
Back-propagation(4)
➔ How to compute the partial derivatives
Deep Learning for Computer Vision 29
whx
f x ⋅wh( )x ⋅whwo
w0 ⋅ f x ⋅wh( )f w0 ⋅ f x ⋅wh( )( )
y y
ʹLwh = y − y( ) ⋅ ˆʹywh = y − y( ) ⋅ ʹf ⋅wo ⋅ ʹf ⋅ x
wh
ʹLwo = y − y( ) ⋅ ˆʹywo = y − y( ) ⋅ ʹf ⋅ f x ⋅wh( )yhLayer Error
Layer input
Layer Error
yh
Layer input
MLP- conclusion
➔ MLP is a fully connected network : each output of a previous layer is connected with all inputs of the next layer
➔ MLP is a feed-forward neural network network : at a test step the input data pass in a direct manner from the input layer up to the output one
➔ MLP trained with back-propagation proved to be very effective supervised learning algorithm
➔ It was widely used for : character recognition, face recognition etc…
➔ Nevertheless, if we work with high resolution images, the number of parameters to train becomes very high. This would kill performances.
➔ Solution : Convolutional Neural Networks (CNN). Deep Learning for Computer Vision 30