Deep Neural Networksrherault/pelican/resources/... · Deep Neural Networks ... Introduction to...
-
Upload
truongkhue -
Category
Documents
-
view
224 -
download
2
Transcript of Deep Neural Networksrherault/pelican/resources/... · Deep Neural Networks ... Introduction to...
Deep Neural Networks
Romain HERAULT
Normandie Universite - INSA de Rouen - LITIS
April 29 2015
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 1 / 56
Introduction to supervised learning
Outline
1 Introduction to supervised learning
2 Introduction to Neural Networks
3 Multi-Layer Perceptron - Feed-forward network
4 Deep Neural Networks
5 Extension to structured output
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 2 / 56
Introduction to supervised learning
Supervised learning: Concept
Setup
A input (or features) space, X ∈ Rm,
A output (or target) space Y,
Objective
Find the link f : X → Y (or the dependencies p(y |x) ) between the input and the outputspaces.
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 2 / 56
Introduction to supervised learning
Supervised learning: general framework
Hypotheses space
f belongs to a hypotheses space H that depends on the chosen methods (MLP,SVM,Decision trees, . . . ).How to choose f within H ?
Expected Prediction Error
or generalization error, or generalization risk,
R(f ) = EX ,Y [L(f (X ),Y )] =
∫ ∫L(f (x), y)p(x , y)dxdy
where L is a loss function that measures the accuracy of a prediction f (x) to a targetvalue y .
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 3 / 56
Introduction to supervised learning
Supervised learning: different tasks, different losses
Regression
If Y ∈ Ro, it is a regression task.Standard loss are (y − f (x))2 or |y − f (x)|.
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5−1.5
−1
−0.5
0
0.5
1Support Vector Machine Regression
x
y
Classification / Discrimination
If Y in a discrete set, it is a classificationor discrimination task.Standard loss is Θ(−yf (x))2 where Θ is thestep function.
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−1
−1
−1
−1
−1
−1
0
0
0
0
0
1
1
1
1
1
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 4 / 56
Introduction to supervised learning
Supervised learning: Experimental setup
Available data
Data consists in a set of n examples (x, y) where x ∈ X and y ∈ Y It is split into:
A training set that will be used to choose f ,i.e. to learn the parameters w of the model
A test set to evaluate the chosen f
(A validation set to choose the hyper-parameters of f )
Because of the human cost of labelling data, one can found a separate unlabelled set,i.e. examples with only the feature x (see semi-supervised learning)
Evaluation: Empirical risk
RS(f ) =1
card(S)
n∑(x,y)∈S
L(f (x), y)
where S is the train set during learning, the test set during final evaluation.
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 5 / 56
Introduction to supervised learning
Supervised learning: Overfitting
Em
piri
cal r
isk Test set
Learning set
Low HighModel complexity
Adding noise to data or to model parameters (dark age)
Limiting model capacity⇒ Regularization
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 6 / 56
Introduction to supervised learning
Supervised learning as an optimization problem
Tikhonov regularization scheme
arg minw
∑(x,y)∈Strain
L(f (x; w), y) + λΩ(w)
where
L is a loss term that measures the accuracy of the model,
Ω is a regularization term that limits the capacity of the model,
λ ∈ [0,∞[ is the regularization hyper-parameter.
Example: Ridge regression
Linear regression with the sum squared error as loss and a L2-norm as regularization:
arg minw∈Rd
||Y− X.w||2 + λ∑
d
||wd ||2
Solutionw(λ) = (XᵀX + λI)−1XᵀY
Regularization path:
w(λ)|λ ∈ [0,∞[
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 7 / 56
Introduction to supervised learning
Ridge regression: illustration
−2 −1 0 1 2 3 4 5w0
−2
−1
0
1
2
3
4
5w
1
λ = 0
λ = +∞
Reg. termLoss termReg. Path
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 8 / 56
Introduction to supervised learning
Why do we care about sparsity ?
Sparsity is a very useful property of some MachineLearning algorithms.
Machine Learning is model selection
Cheap to store & transmit
Sparse coefficients are meaningful.
They make more sense.
More robust to errors
Need fewer data to begin with provides scalableoptimization
In the Big Data era, as datasets become larger, itbecomes desirable to process the structured informationcontained within data, rather than data itself.
For lectures on sparsity, see Stephane Canu website.
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 9 / 56
Introduction to supervised learning
Introducing sparsity
Lasso
Linear regression with the sum squared error as loss and a L1-norm as regularization:
arg minw∈Rd
||Y− X.w||2 + λ∑
d
|wd |
which is equivalent to
arg minw+∈Rd ,w−∈Rd
||Y− X.(w+ −w−)||2 + λ∑
d (w+ + w−)
s.t .
w+i ≥ 0 ∀i ∈ [1..d ]
w−i ≥ 0 ∀i ∈ [1..d ]
Why is it sparse ?
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 10 / 56
Introduction to supervised learning
Lasso: illustration
−2 −1 0 1 2 3 4 5w0
−2
−1
0
1
2
3
4
5w
1
λ = 0
λ = +∞
Reg. termLoss termReg. Path
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 11 / 56
Introduction to Neural Networks
Outline
1 Introduction to supervised learning
2 Introduction to Neural Networks
3 Multi-Layer Perceptron - Feed-forward network
4 Deep Neural Networks
5 Extension to structured output
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 12 / 56
Introduction to Neural Networks
History . . .
1940 : Turing machine
1943 : Formal neuron (Mc Culloch & Pitts)
1948 : Automate networks (Von Neuman)
1949 : First learning rules (Hebb)
1957 : Perceptron (Rosenblatt)
1960 : Adaline (Widrow & Hoff)1969 : Perceptrons (Minsky & Papert)
Limitation of the perceptronNeed for more complex architectures, but then how to learn?
1974 : Gradient back-propagation (Werbos)no success !?!?
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 12 / 56
Introduction to Neural Networks
History . . .
1986 : Gradient back-propagation bis (Rumelhart & McClelland, Lecun)New neural networks architecturesNew Applications :
Character recognitionSpeech recognition and synthesisVision (image processing)
1990-2010 : Information societyNew fields
Web crawlersInformation extractionMultimedia (indexation,. . . ).Data-mining
Needs to combine many models and build adequate features1992-1995 : Kernel methods
Support Vector Machine (Boser, Guyon and Vapnik)2005 : Deep networks
Deep Belief Machine, DBM (Hinton and Salakhutdinov, 2006)Deep Neural Network, DNN
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 13 / 56
Introduction to Neural Networks
Biological neuron
Figure: Scheme of a biological neuron [Wikimedia commons - M. R. Villarreal]
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 14 / 56
Introduction to Neural Networks
Formal neuron (1)
Origin
Warren McCulloch and Walter Pitts (1943), Frank Rosenblatt (1957),
Mathematical representation of a biological neuron
Schematic
x1
x2
xm
. . .
Σ cd
b
y1
w1
w2
wm
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 15 / 56
Introduction to Neural Networks
Formal neuron (2)
Formulation
y = f (〈w, x〉+ b) (1)
where
x, input vector,
y , output estimation,
w, weights linked to each input (model parameter),
b, bias (model parameter),
f , activation function.
EvaluationTypical losses are
Classification
L(y , y) = − (y .log(y) + (1− y).log(1− y))
Regression
L(y , y) = ||y − y ||2
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 16 / 56
Introduction to Neural Networks
Formal neuron (3)
Activation functions are typically step function, sigmoid function ([0 1]) or hyperbolictangent ([−1 1]).
x
f (x)f (x) = sigm(x)
1
1
Figure: Sigmoid
x
f (x)f (x) = tanh(x)
1
1
Figure: Hyperbolic tangent
If loss and activation function are differentiable, parameters w and b can be learned bygradient descent.
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 17 / 56
Introduction to Neural Networks
A perceptron
x3
x2
x1
x0 = 1
∑∑
S2
S1
f
f
y2
y1
w23w13
w22w12
w21w11
w20w10
Let’s be xi input number i and yj output number j
Sj =∑
i
Wjixi
yj = f (Sj )
with Wj0 = bj and x0 = 1.
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 18 / 56
Introduction to Neural Networks
A perceptron
x3
x2
x1
x0 = 1
∑∑
S2
S1
f
f
y2
y1
w23w13
w22w12
w21w11
w20w10
As the loss is differentiable, we can compute ∂L∂yj
.
∂L∂wji
=∂L∂yj
∂yj
∂Sj
∂Sj
∂wji
∂L∂wji
=∂L∂yj
f ′(Sj )xi
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 18 / 56
Introduction to Neural Networks
Gradient descent : general algorithm
Input: Integer Nb : Batch numberInput: Boolean Sto : Stochastic grad ?Input: (Xtrain,Ytrain) : Training set
W← random initialization(Xsplit ,Ysplit )← split ((Xtrain,Ytrain),Nb)while stopping criterion not reached do
if Sto then(Xsplit ,Ysplit )← randperm ((Xsplit ,Ysplit ))
end iffor (Xbloc ,Ybloc) ∈ (Xsplit ,Ysplit ) do
∆W← 0for (x, y) ∈ (Xbloc ,Ybloc) do
∆Wi ← ∆Wi + ∂L(x,W,y)∂Wi
∀iend for∆W← ∆W
card(Xbloc )
W← W− η∆Wend for
end while
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 19 / 56
Introduction to Neural Networks
Neural network
A perceptron can only solve linearly separable problems
Neural network
To solve more complex problems, we need to build a network of perceptrons
Principles
The network is an oriented graph, each node represent a formal neuron,
Information follows graph edges,
Calculus is distributed over nodes
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 20 / 56
Introduction to Neural Networks
Multi-Layer Perceptron - Feed-forward network
x1
x2
x3
x4
y1
y2
Figure: Feed-forward network, with two layers and one hidden representation
Neurons are layered.
Calculus always flows in one direction.
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 21 / 56
Introduction to Neural Networks
Recurrent network
At least one retroactive loop
Hysteresis effect
x1
x2
x3
x4
y1
y2
Figure: Recurrent network
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 22 / 56
Introduction to Neural Networks
Recurrent network
x1,t
x2,t
x3,t
y1,t−3
y1,t−2
y1,t−1
y1,t
Figure: NARX Recurrent network
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 23 / 56
Multi-Layer Perceptron - Feed-forward network
Outline
1 Introduction to supervised learning
2 Introduction to Neural Networks
3 Multi-Layer Perceptron - Feed-forward network
4 Deep Neural Networks
5 Extension to structured output
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 24 / 56
Multi-Layer Perceptron - Feed-forward network
Scheme of a Multi Layer Perceptron
x1
x2
x3
x4
y1
y2
Figure: Example of feed-forward network: a 2-layer perceptron
Formalism:
Layer, computational element,
Representation, data element
This MLP has
an input layer and an output layer (2 layers),
an input, a hidden and output representations (3representations).
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 24 / 56
Multi-Layer Perceptron - Feed-forward network
Estimation of y: Forward path
I(l)3
I(l)2
I(l)1
I(l)0 = 1
∑∑
S(l)2
S(l)1
f (l)
f (l)
O(l)2
O(l)1
w23w13
w22w12
w21w11
w20w10
If we look at layer (l), let’s be I(l)i input number i and O(l)
j output number j ,
S(l)j =
∑i
W (l)ji I(l)
i
O(l)j = f (l)(S(l)
j ) = I(l+1)
Starts with I(0) = x and finishes with O(last) = y
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 25 / 56
Multi-Layer Perceptron - Feed-forward network
How to learn parameters ? Gradient back-propagation
I(l)3
I(l)2
I(l)1
I(l)0 = 1
∑∑
S(l)2
S(l)1
f (l)
f (l)
O(l)2
O(l)1
w23w13
w22w12
w21w11
w20w10
We assume to know ∂L
∂O(l)j
∂L
∂w(l)ji
=∂L
∂O(l)j
∂O(l)j
∂S(l)j
∂S(l)j
∂w(l)ji
∂L
∂w(l)ji
=∂L
∂O(l)j
f ′(l)(S(l)j )I(l)
i
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 26 / 56
Multi-Layer Perceptron - Feed-forward network
How to learn parameters ? Gradient back-propagation
I(l)3
I(l)2
I(l)1
I(l)0 = 1
∑∑
S(l)2
S(l)1
f (l)
f (l)
O(l)2
O(l)1
w23w13
w22w12
w21w11
w20w10
Now we compute ∂L
∂I(l)i
∂L
∂I(l)i
=∑
j
∂L
∂O(l)j
∂O(l)j
∂I(l)i
∂L
∂I(l)i
=∑
j
∂L
∂O(l)j
∂O(l)j
∂S(l)j
∂S(l)j
∂I(l)i
∂L
∂I(l)i
=∑
j
∂L
∂O(l)j
f ′(l)(S(l)j )wji
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 26 / 56
Multi-Layer Perceptron - Feed-forward network
How to learn parameters ? Gradient back-propagation
I(l)3
I(l)2
I(l)1
I(l)0 = 1
∑∑
S(l)2
S(l)1
f (l)
f (l)
O(l)2
O(l)1
w23w13
w22w12
w21w11
w20w10
Start ∂L
∂O(last)j
= ∂L∂yj
Backward recurrence∂L
∂w(l)ji
=∂L
∂O(l)j
f ′(l)(S(l)j )I(l)
i
∂L
∂I(l)i
=∑
j
∂L
∂O(l)j
f ′(l)(S(l)j )wji
∂L
∂O(l−1)j
=∂L
∂I(l)i
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 26 / 56
Deep Neural Networks
Outline
1 Introduction to supervised learning
2 Introduction to Neural Networks
3 Multi-Layer Perceptron - Feed-forward network
4 Deep Neural Networks
5 Extension to structured output
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 27 / 56
Deep Neural Networks
Deep architecture
x1
x2
x3
x4
x5
y1
Why ?
Some problems needs exponential number of neurons on the hiddenrepresentation,
Build / extract features inside the NN in order not to rely on handmade extraction(human prior).
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 27 / 56
Deep Neural Networks
The vanishing gradient problem
x
f (x)f (x) = tanh(x)
1
1
Figure: Hyperbolic tangent
∂L
∂I(l)i
=∑
j
∂L
∂O(l)j
f ′(l)(S(l)j )wji
When neurons at higher layers are saturated, the gradient decreases toward zero.
Solution
Better topology, better initialization of the weights,
Regularization !
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 28 / 56
Deep Neural Networks
Convolutional network
A unit on representation (l) is connected to a sub-slice of o units from representation (l − 1). Allthe weights between units are tied leading to only o weights. Warning, bias are not tied.If representation (l − 1) is in Rm and (l) is in Rn, number of parameters:
(m + 1) ∗ n→ (o + 1) ∗ n
w1
w2
w3w1
w2
w3w1
w2
w3
Figure: 1D convolutional network
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 29 / 56
Deep Neural Networks
Convolutional network : 2D example
Figure: [LeCun 2010]
LeCun, Y. (1989). Generalization and network design strategies. Connections in Perspective. North-Holland, Amsterdam, 143-55.
LeCun, Y., Kavukcuoglu, K., & Farabet, C. (2010, May). Convolutional networks and applications in vision. In Circuits and Systems (ISCAS),
Proceedings of 2010 IEEE International Symposium on (pp. 253-256). IEEE.
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 30 / 56
Deep Neural Networks
Better initialization through unsupervised learning
The learning is split into two steps:
Pre-training
A unsupervised pre-training of the input layers with auto-encoders. Intuition: learningthe manifold where the input data resides.Can take into account an unlabelled dataset.
Finetuning
A finetuning of the whole network with supervised back-propagation.
Hinton, G. E., Osindero, S. and Teh, Y. (2006) A fast learning algorithm for deep belief nets. Neural Computation, 18, pp 1527-1554
Hinton, G. E. and Salakhutdinov, R. R. (2006) Reducing the dimensionality of data with neural networks. Science, Vol. 313. no. 5786, pp. 504 - 507,
28 July 2006.
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 31 / 56
Deep Neural Networks
Diabolo network, Autoencoders
Autoencoders are neural network where the input and output representations have thesame number of units. The learned target is the input itself.
h1
h2
x1
x2
x3
x4
x5
x1
x2
x3
x4
x5
x
Figure: Diabolo network
When 2 layers :The input layer is called the encoder,The output layer, the decoder.
Tied weights Wdec = W ᵀenc , convergence? PCA ?
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 32 / 56
Deep Neural Networks
Diabolo network, Autoencoders
Autoencoders are neural network where the input and output representations have thesame number of units. The learned target is the input itself.
h1
h2
x1
x2
x3
x4
x5
x1
x2
x3
x4
x5
x
Figure: Diabolo network
Undercomplete, size(h) < size(x)
Overcomplete, size(x) < size(h).
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 32 / 56
Deep Neural Networks
Building from auto-encoders
Autoencoders are stacked and learned layer by layer. When a layer is pre-trained, itsweights are fixed until the finetuning.
x1
x2
x3
x4
x5
x1
x2
x3
x4
x5
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 33 / 56
Deep Neural Networks
Building from auto-encoders
Autoencoders are stacked and learned layer by layer. When a layer is pre-trained, itsweights are fixed until the finetuning.
h1,1
h1,2
h1,3
h1,4
h1,5
h1,1
h1,2
h1,3
h1,4
h1,5
x1
x2
x3
x4
x5
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 33 / 56
Deep Neural Networks
Building from auto-encoders
Autoencoders are stacked and learned layer by layer. When a layer is pre-trained, itsweights are fixed until the finetuning.
h2,1
h2,2
h2,3
h2,1
h2,2
h2,3
x1
x2
x3
x4
x5
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 33 / 56
Deep Neural Networks
Building from auto-encoders
Autoencoders are stacked and learned layer by layer. When a layer is pre-trained, itsweights are fixed until the finetuning.
x1
x2
x3
x4
x5
y1
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 33 / 56
Deep Neural Networks
Simplified stacked AE Algorithm
Input: X , a training feature set of size Nbexamples × Nbfeatures
Input: Y , a corresponding training label set of size Nbexamples × Nblabels
Input: Ninput, the number of input layers to be pre-trainedInput: Noutput, the number of output layers to be pre-trainedInput: N, the number of layers in the IODA, Ninput + Noutput < NOutput: [w1,w2, . . . ,wN ], the parameters for all the layers
Randomly initialize [w1,w2, . . . ,wN ]Input pre-trainingR ← Xfor i ← 1..Ninput doTraining an AE on R and keeps its encoding parameters[wi ,wdummy]← MLPTRAIN([wi ,wᵀ
i ],R,R)Drop wdummy
R ← MLPFORWARD([wi ],R)end forFinal supervised learning[w1,w2, . . . ,wN ]← MLPTRAIN([w1,w2, . . . ,wN ],X ,Y )
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 34 / 56
Deep Neural Networks
Improve optimization by adding noise 1/3
Denoising (undercomplete) auto-encoders
The auto-encoder is learned from x, a disturbed x; the target is still x.
x1
x2
x3
x4
x5
h1
h2
x1
x2
x3
x4
x5
x1
x2
x3
x4
x5
Dis
turb
ance
x
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 35 / 56
Deep Neural Networks
Improve optimization by adding noise 2/3
Prevent co-adaptation in (overcomplete) autoencoders
During training, randomly disconnect hidden units.
h1
h2
h4
h5
x1
x2
x3
x1
x2
x3
h3
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 36 / 56
Deep Neural Networks
Improve optimization by adding noise 2/3
Prevent co-adaptation in (overcomplete) autoencoders
During training, randomly disconnect hidden units.
h1
h2
h4
h5
x1
x2
x3
x1
x2
x3
h3
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 36 / 56
Deep Neural Networks
Improve optimization by adding noise 2/3
Prevent co-adaptation in (overcomplete) autoencoders
During training, randomly disconnect hidden units.
Figure: MNIST [Hinton 2012]
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 36 / 56
Deep Neural Networks
Improve optimization by adding noise 3/3
Dropout
During training, randomly disconnect at each iteration weights by probability p.
At testing, multiply the weights by # actual disconnections# iterations ( 6= p).
x1
x2
x3
x4
x5
y1
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of
feature detectors. arXiv:1207.0580.
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 37 / 56
Deep Neural Networks
Improve optimization by adding noise 3/3
Dropout
During training, randomly disconnect at each iteration weights by probability p.
At testing, multiply the weights by # actual disconnections# iterations ( 6= p).
x1
x2
x3
x4
x5
y1
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of
feature detectors. arXiv:1207.0580.
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 37 / 56
Deep Neural Networks
Improve optimization by adding noise 3/3
Dropout
During training, randomly disconnect at each iteration weights by probability p.
At testing, multiply the weights by # actual disconnections# iterations ( 6= p).
Figure: Reuters dataset
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of
feature detectors. arXiv:1207.0580.
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 37 / 56
Deep Neural Networks
Tikhonov regularization scheme
Noise and early stopping are connected to regularization.So why not using Tikhonov regularization scheme ?
J =∑
i
L(yi , f (xi ; w)) + λ.Ω(w)
Notation
2-layer MLP
y = fMLP(x; win,wout ) = fout (bout + wout .fin(bin + win.x))
AE
x = fAE (x; wenc ,wdec) = fdec (bdec + wdec .fenc(benc + wenc .x))
Tied weights
win ↔ wenc , wdec ↔ wᵀenc
Bishop, C. M. (1995). Training with noise is equivalent to Tikhonov regularization. Neural computation, 7(1), 108-116
Collobert, R. and Bengio, S. (2004). Links between perceptrons, MLPs and SVMs. In ICML’2004
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 38 / 56
Deep Neural Networks
Regularization on weights
J =∑
i
L(yi , fMLP(xi ; w)) + λ.Ω(wout )
It is enough to regularize output-layer weights.L2 (Gaussian prior):
Ω(wout ) =∑
d
||wd ||2
L1 (Laplace prior):Ω(wout ) =
∑d
|wd |
t-Student:Ω(wout ) =
∑d
log(1 + w2d )
With infinite units,
L1 : boosting
L2 : SVM
Bengio, Y., Roux, N. L., Vincent, P., Delalleau, O., & Marcotte, P. (2005). Convex neural networks. In Advances in neural information processing
systems (pp. 123-130)
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 39 / 56
Deep Neural Networks
Contractive autoencoder 1/2
Figure: Input manifold
AE must be sensitive to [blue] direction to reconstruct wellIt can be insensitive to [orange] direction.
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 40 / 56
Deep Neural Networks
Contractive autoencoder 2/2
The autoencoder should:
reconstruct correctly x which lies on the input manifold∑i
L(xi , fAE (xi ; wenc))
be insensitive to small changes on x outside the manifold (i.e. project on themanifold)⇒ penalize by the Jacobian
||Jfenc (x; wenc)||2F =∑
ij
(∂fj (x; wenc)
∂xi
)
Objective function
J =∑
i
L(xi , fAE (xi ; wenc)) + λ.||Jfenc (x; wenc)||2F
Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. (2011). Contractive auto-encoders: Explicit invariance during feature extraction. In
Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 833–840
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 41 / 56
Deep Neural Networks
Regularization brought by multi-task learning / embedding
Combine multiple tasks in the same optimization problem.Tasks are sharing parameters.
J = λL∑i∈L
L(yi , fMLP(xi ; wout ,win))
+λU∑
i∈L∪U
L(xi , fAE (xi ; win))
+λΩ Ω(wout )
Mix supervised and unsupervised data.Weston, J., Ratle, F., and Collobert, R. . Deep learning via semi-supervised embedding. ICML, pages 1168–1175, 2008
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 42 / 56
Deep Neural Networks
Regularization brought by multi-task learning / embedding
Combine multiple tasks in the same optimization problem.Tasks are sharing parameters.
J = λL∑i∈L
L(yi , fMLP(xi ; wout ,win))
+λU∑
i∈L∪U
L(xi , fAE (xi ; win))
+λΩ Ω(wout )
+λJ ||Jfin (x; win)||2F+ . . .
Mix supervised and unsupervised data.Weston, J., Ratle, F., and Collobert, R. . Deep learning via semi-supervised embedding. ICML, pages 1168–1175, 2008
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 42 / 56
Extension to structured output
Outline
1 Introduction to supervised learning
2 Introduction to Neural Networks
3 Multi-Layer Perceptron - Feed-forward network
4 Deep Neural Networks
5 Extension to structured output
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 43 / 56
Extension to structured output
Structured output
Ad-hoc definition
Data that consists of several parts, and not only the parts themselves containinformation, but also the way in which the parts belong together. Christoph Lampert
Automatic transcription
Automatic translation
Point matching
Image labeling (semantic image segmentation)
Landmark detection
Input/Output Deep Architecture (IODA)
Learn output dependencies the same way a DNN learns input dependencies.
B. Labbe, R. Herault & C. Chatelain Learning Deep Neural Networks for High Dimensional Output Problems. In IEEE International Conference on
Machine Learning and Applications, 2009 (pp. 63-68).
J. Lerouge, R. Herault, C. Chatelain, F. Jardin, R. Modzelewski, IODA: An input/output deep architecture for image labeling, Pattern Recognition,
Available online 27 March 2015, ISSN 0031-3203
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 43 / 56
Extension to structured output
The image labeling problem
Dataset Input Target
Toy
Sarcopenia
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 44 / 56
Extension to structured output
Input/Output Deep Architecture (IODA) for Image Labeling
Figure: The IODA architecture. It directly links the pixel matrix to the label matrix. The input layers(left, light) are pre-trained to provide a high level representation of the image pixels, while theoutput layers (right, dark) are pre-trained to learn the a priori knowledge of the problem.
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 45 / 56
Extension to structured output
Simplified IODA Algorithm 1/2
Input: X , a training feature set of size Nbexamples × Nbfeatures
Input: Y , a corresponding training label set of size Nbexamples × Nblabels
Input: Ninput, the number of input layers to be pre-trainedInput: Noutput, the number of output layers to be pre-trainedInput: N, the number of layers in the IODA, Ninput + Noutput < NOutput: [w1,w2, . . . ,wN ], the parameters for all the layers
Randomly initialize [w1,w2, . . . ,wN ]
Input pre-training
R ← Xfor i ← 1..Ninput doTraining an AE on R and keeps its encoding parameters[wi ,wdummy]← MLPTRAIN([wi ,wᵀ
i ],R,R)Drop wdummy
R ← MLPFORWARD([wi ],R)end for
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 46 / 56
Extension to structured output
Simplified IODA Algorithm 2/2
Output pre-training
R ← Yfor i ← N..N − Noutput + 1 step − 1 doTraining an AE on R and keeps its decoding parameters[u,wi ]← MLPTRAIN([wᵀ
i ,wi ],R,R)R ← MLPFORWARD([u],R)Drop u
end for
Final supervised learning
[w1,w2, . . . ,wN ]← MLPTRAIN([w1,w2, . . . ,wN ],X ,Y )
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 47 / 56
Extension to structured output
Qualitative results 1/3
(NDA)
(IDA)
(IODA)
Iter. 10 Iter. 100 Iter. 200 Iter. 300
Figure: Evolution of the output image of the architecture according to the number of batchgradient descent iterations for the three learning strategies, using the validation example #10.
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 48 / 56
Extension to structured output
Qualitative results 2/3
(a) CT image (b) Ground truth
(c) Chung (d) IODA
Figure: Non-sarcopenic patient
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 49 / 56
Extension to structured output
Qualitative results 3/3
(a) CT image (b) Ground truth
(c) Chung (d) IODA
Figure: Sarcopenic patient
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 50 / 56
Extension to structured output
Quantitative results
Architecture Train error Test errorX r1 r2 Y
1282 2048 2048 1282 2.64e-02 3.48e-021282 1024 1024 1282 3.11e-02 3.91e-021282 2048 2048 1282 3.86e-02 4.59e-021282 1024 1024 1282 4.44e-02 5.13e-021282 2048 2048 1282 5.20e-02 5.75e-021282 1024 1024 1282 6.29e-02 6.77e-021282 2048 2048 1282 6.30e-02 6.79e-021282 1024 1024 1282 7.09e-02 7.55e-021282 2048 2048 1282 9.03e-02 9.40e-021282 1024 1024 1282 1.03e-01 1.06e-01
: input pre-training, : no pre-training, : output pre-training.
Table: Toy dataset: 3-layer MLP
Method Diff. (%) Jaccard (%)Chung -10.6 60.3NDA 0.12 85.88IDA 0.15 85.91
IODA 3.37 88.47
Table: Sarcopenia.
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 51 / 56
Extension to structured output
Why not using multi-tasking + Tikhonov schemes ?
Notation
3-layer MLP
y = fMLP(x; win,wlink ,wout ) = fout (bout + wout .flink (blink + wlink .fin(bin + win.x)))
Input AE
x = fAEi (x; win) = fdec(bdec + wᵀ
in.fenc(benc + win.x))
Output AE
y = fAEo(y; wout ) = fdec(b′dec + wout .fenc(b′enc + wᵀ
out .y))
Objective function
J = λL∑i∈L
L(yi , fMLP(xi ; win,wlink ,wout ))
+λU∑
i∈L∪UL(xi , fAEi (xi ; win))) + λL′
∑i∈L
L(yi , fAEo(yi ; wout ))
+λΩ Ω(wlink )
Submitted to ECML, Input/Output Deep Architecture for Structured Output Problems, Soufiane Belharbi, Clement Chatelain, Romain Herault and
Sebastien Adam, arXiv:1504.07550
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 52 / 56
Extension to structured output
Facial landmark detection problem
Competition i-bug: http://ibug.doc.ic.ac.uk/resources/300-W_IMAVISImages from:
Zhanpeng Zhang, Ping Luo, Chen Change Loy, Xiaoou Tang. Learning and Transferring Multi-task Deep Representation for Face Alignment.
Technical report, arXiv:1408.3967, 2014
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 53 / 56
Extension to structured output
Facial landmark detection, some results
Figure: Early results on facial landmark detection
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 54 / 56
Extension to structured output
Questions ?
?R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 55 / 56
Extension to structured output
References
Y. Bengio, A. Courville, P. Vincent, ”Representation Learning: A Review and New Perspectives,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 35, no. 8, pp. 1798-1828, Aug., 2013 (arXiv:1206.5538)
Bengio, Y. (2012). Practical recommendations for gradient-based training of deep architectures. In Neural Networks: Tricks of the Trade (pp.
437-478). Springer Berlin Heidelberg.
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of
feature detectors. (arXiv:1207.0580).
LeCun, Y., Kavukcuoglu, K., & Farabet, C. (2010, May). Convolutional networks and applications in vision. In Circuits and Systems (ISCAS),
Proceedings of 2010 IEEE International Symposium on (pp. 253-256). IEEE.
J. Lerouge, R. Herault, C. Chatelain, F. Jardin, R. Modzelewski, IODA: An input/output deep architecture for image labeling, Pattern Recognition,
Available online 27 March 2015, ISSN 0031-3203, http://dx.doi.org/10.1016/j.patcog.2015.03.017.
Hugo Larochelle lectures:http://info.usherbrooke.ca/hlarochelle/cours/ift725_A2013/contenu.html
R. HERAULT (INSA LITIS) Deep Neural Networks April 29 2015 56 / 56