Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearning... · input data...

Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université Bordeaux Chapter 2. From Shallow to Deep

Chapter 2

Summary. 1. Kinds of machine Learning. 1.1. Unsupervised learning, 1.2. Supervised learning, main formulations, 2. Artificial Neural Networks 3. Multi-Layered Perceptron (MLP)

Deep Learning for Computer Vision 2

1. Kinds of Machine learning

➔  Once more ➔  Machine learning teaches computers to do what comes naturally to

humans and animals: learn from experience. ➔  Machine learning algorithms use computational methods to “learn” information directly from data without ➔  The algorithms adaptively improve their performance as the number of

samples available for learning increases.


Oge Marques, IPTA’2017

Types of Learning


Evaluation Metrics


Confusion Matrix

Quality Metrics

➔  ACC = (TP +TN)/(TP+FP+TN+FN)

➔  BACC = (TP/P + TN/N)/2*

➔  TPR = TP/(TP+FN) or recall (R) ➔  TNR = TN/(TN + FP)

➔  P = TP/(TP+FP) ➔  F-score = 2/(1/R + 1/P)* ➔  FPR = FP/(FP+TN) ➔  FNR = FN/(FN+TP)

➔  * - unbalanced classes


ROC curve

➔  If we consider a binary classification problem ➔  The classifier dependent on the threshold, then for different values of

the threshold

➔  In multi-class classification problem : one vs all ! And we plot for all N classes


1.1. Unsupervised learning(1)

➔  Unsupervised learning finds hidden patterns or intrinsic structures in data. ➔  It is used to draw inferences from datasets consisting of input data without labelled responses. ➔  Clustering is the most common unsupervised learning technique. ➔  It is used for exploratory data analysis to find hidden patterns or groupings in data. ➔  Applications in computer vision : visual data summarization (collections

of images, video)


K-means clustering

➔  J. MacQueen, “Some methods for classification and analysis of multivariate observations”, Proc. Of the Fifth Berkley Symposium on Math. Stat. And Prob., pp. 281 – 296, 1967

➔  Principle : Unisupervised clasification with a priori known numebr of clusters.

➔  Parameter : the number k of clusters ➔  Input data : a sample of M descriptor vectors x1,... xM. ➔  (1) Chose k initial centers c1,... ck

➔  (2) For each of M vectors, assign it to the i-th cluster the center ci of which is closest in the sense of chosen metrics

➔  (3) If none vector changes its class then stop. ➔  (4) Compute new centers: for each i, ci is the mean of vectors of th

eclass i ➔  (5) Go to 2


Application example(1)

➔  Lifelogging ( K. Gurin, A. Smeaton DCU)


http://www.slideshare.net/cgurrin/biohackers-summit-2015-lifelogging-a-new-era-of-personal-data

Application Example (2)

➔  Grouping of similar images ➔  Selection of the cluster center : “Hyper-scenes”


H1

H2

H3

H4 H. Nicolas, A. Manoury, J. Benois-Pineau, Wi. Dupuis, D. Barba: Grouping video shots into scenes based on 1D mosaic descriptors. IEEE ICIP 2004: 637-640

Hierarchical agglomerative clustering (HAG)

➔  Principle : ➔  (1) At the initialisation each descriptor-vector in the sample forms a class ➔  (2) While the number of clusters is larger than k ( limit k=1)

›  Groupe classes in the sense of a distance d Distance between clusters Max-link Min-link Mean-link


( )yxdji CyCx

ji CCd ,max,

max ),(∈∈

=

( )yxdji CyCx

ji CCd ,min,

min ),(∈∈

=

dmean(Ci ,Cj ) =1

ni ×nj l=1

l=ni

∑ dp=1

p=nj

∑ xl , yp( )

Dendrogramm


S. Benini et al. Extraction of Significant Video Summaries by Dendrogram Analysis. IEEE ICIP 2006: 133-136

Supervised learning(1)

➔  Problem statement :

➔  Le us consider a set of pairs

➔  - feature vectors

➔  - labels (of classes)

➔  Let us consider a function ,

➔  Let us now consider a function - the loss of predicting


x1, y1( ),..., xn , yn( ),... xN , yN( ){ }

xn ∈ RK = X

yn ∈Y

g(x,α) : X →Y g(xn ,α) = yn

L yn , yn( ) yn

Supervised learning(2)

➔  Empirical risk minimization : to find a function which minimizes

➔  Structural risk minimization : consider a penalty

➔  If the variable y is discrete – classification, otherwise – regression

➔  For a given form of g the problem consists in finding ➔  optimal parameters


g

Remp g( ) = 1N L yn ,g xn ,α( )( )n=1

N

∑

J g( ) = Remp g( )+λC g( ),λ ≥ 0

C g( ) :G→ R+

α*

Quality of prediction


Type II Error

Type I Error

2. Artificial Neural Networks

➔  Biological inspiration ➔  The basic computational unit of the brain is a neuron. Approximately 86

billion neurons can be found in the human nervous system and they are connected with approximately 1014 – 1015 synapses.


g x,W,b( ) = f wixi + bi∑⎛

⎝⎜

⎞

⎠⎟

McCulloch and Pitts, 1943, activation function

f t( ) =1, t > 0

0 otherwise

⎧

⎨⎪

⎩⎪

Commonly used non-linear functions

➔  Sigmoïd : ➔  Tanh :

➔  Sigmoids saturate and kill gradients(!) ➔  Tanh non-linearity is always preferred to the sigmoid nonlinearity.


f (t) = 11+ e−t

f (t) = et − e−t

et + e−t

a) Sigmoid b) Tanh

ReLu non-linearity

➔  Rectified Linear Unit :


f (t) =max(t, 0)

- Lower computational cost wrt sigmoid and tanh- Faster convergence

A simple neuron


x =x1x2

⎛

⎝

⎜⎜

⎞

⎠

⎟⎟

w =w1w2

⎛

⎝

⎜⎜

⎞

⎠

⎟⎟

w1

w2

xTw = x1w1 + x2w2

y = f (x1w1 + x2w2 )

Our simplest function f : Heaviside Step function

f t( ) =1, t > 0

0 otherwise

⎧

⎨⎪

⎩⎪

Source: Wikipedia

y = f (x1w1 + x2w2 )><0 10

How to determine weights ? wi

Training a neuron

➔  The artificial neuron can be trained to perform as elementary linear classifier, i.e. we can determine the weights which will minimize empirical (or structural) risk of our classifier.

➔  The elementary training algorithm was proposed by Rosenblatt (1958). “Perceptron”

➔  Consider the “training set” : (In our case of a simple neuron is a binary label) ➔  Initialize weights randomly ➔  Then at each iteration t the weights are updated as

➔  “Back propagation”


wi

x1, y1( ),..., xn , yn( ),... xN , yN( ){ }yi ∈ 0;1{ }

w w0

wit+1= wi

t +η y tn − yn( ) xi ,n

Limitations : the set of functions – classifiers which could be simulated by Perceptron is narrow ( cf. Minsky and Papert (1969) XOR)

3. Multi-Layered Perceptron (MLP)

➔  Perceptron ( 1958) -> MLP (1961) – Rosenblatt ➔  Let us consider a binary classification problem

➔  Input layer – is just our data, ➔  Hidden layers produce more abstract features ➔  Hidden layers are fully connected ➔  The output of hidden layers is usually not binary (RELU, Tanh, Sigm)

Deep Learning for Computer Vision 22 http://neuralnetworksanddeeplearning.com/chap1.html

Example : Recognition of Handwritten digits

➔  Input is a binarised image

➔  Each matrix if of 28x28

➔  10-class classification problem


The architecture of an MLP with 1 hidden layer


http://neuralnetworksanddeeplearning.com/chap1.html

Training of MLP

➔  As in the case of Perceptron, training of MLP consists in finding optimal configuration of weights at all layers

➔  Principle : is the same, i.e. to minimize the Loss L between prediction and ground truth labels

➔  To stress multilayer architecture let us denote the weight between i-th neuron of the layer l and j-th neuron of the layer l+1

➔  - is the learning rate ➔  I.e. each parameter of MLP is updated in the direction opposite to the

gradient of of the loss function L – gradient descent. To compute the derivatives of the Loss function at each layer we use the “chain rule”.


wij(t+1),l ,(l+1) = wij

(t )l ,(l+1) −η∂L

∂wijl ,(l+1)

wijl ,(l+1)

η

Back-propagation and chain rule(1)

➔  Let us consider a trivial Neural Network

➔  x is input ➔  h stands for hidden layer ➔  o stands for output layer ➔  f is a non-linear transformation ➔  w are synaptic weights ➔  y is the known output ➔  is the predicted output


whx

f x ⋅wh( )x ⋅whwo

w0 ⋅ f x ⋅wh( )f w0 ⋅ f x ⋅wh( )( )

y y

y

Back-propagation(2)

➔  Les us consider Error/Loss function as

➔  The goal is to find

➔  Method : Gradient descent

➔  is the “learning rate” for simplicity, the same for all layers


L = 12y − y( )

2

wo*,wh

*( )T= Argmin L wo ,wh( )

wo(t+1) = wo

(t ) −η ʹLwo wo(t ) ,wh

(t )( )

wh(t+1) = wh

(t ) −η ʹLwh wo(t ) ,wh

(t )( )

η

Back-propagation(3)

➔  How to compute the partial derivatives

➔  Chain rule :


whx


w0 ⋅ f x ⋅wh( )f w0 ⋅ f x ⋅wh( )( )

y y

L = 12y − y( )

2ʹLwo = ?

ʹLwo = y − y( ) ⋅ ˆʹywo = y − y( ) ⋅ ʹf ⋅ f x ⋅wh( )

ʹLwh = ?

a b c x( )( )( )ʹ = ʹa b( ) ⋅ ʹb c( ) ⋅ ʹc x( )

ʹLwh = y − y( ) ⋅ ˆʹywh = y − y( ) ⋅ ʹf ⋅wo ⋅ ʹf ⋅ x

Back-propagation(4)

➔  How to compute the partial derivatives


whx


w0 ⋅ f x ⋅wh( )f w0 ⋅ f x ⋅wh( )( )

y y

ʹLwh = y − y( ) ⋅ ˆʹywh = y − y( ) ⋅ ʹf ⋅wo ⋅ ʹf ⋅ x

wh

ʹLwo = y − y( ) ⋅ ˆʹywo = y − y( ) ⋅ ʹf ⋅ f x ⋅wh( )yhLayer Error

Layer input

Layer Error

yh

Layer input

MLP- conclusion

➔  MLP is a fully connected network : each output of a previous layer is connected with all inputs of the next layer

➔  MLP is a feed-forward neural network network : at a test step the input data pass in a direct manner from the input layer up to the output one

➔  MLP trained with back-propagation proved to be very effective supervised learning algorithm

➔  It was widely used for : character recognition, face recognition etc…

➔  Nevertheless, if we work with high resolution images, the number of parameters to train becomes very high. This would kill performances.

➔  Solution : Convolutional Neural Networks (CNN). Deep Learning for Computer Vision 30

Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearning... · input data...

Documents

Transcript of Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearning... · input data...