Restricted Boltzmann Machine and Deep Belief Net
description
Transcript of Restricted Boltzmann Machine and Deep Belief Net
1
Restricted Boltzmann Machine and Deep Belief
Net Wanli Ouyang [email protected]
Animation is available for illustration
2
Outline•Short introduction on deep learning•Short introduction on statistical
models and Graphical model•Restricted Boltzmann Machine
(RBM) and Contrastive divergence•Deep belief net (DBN)
Deep belief net is trained using RBM and CD
RBM and DBN are statistical models
Deep belief net is an unsupervised training algorithm for deep neural network
3
Good learning resources• Webpages:
– Geoffrey E. Hinton’s readings (with source code available for DBN) http://www.cs.toronto.edu/~hinton/csc2515/deeprefs.html
– Notes on Deep Belief Networks http://www.quantumg.net/dbns.php
– MLSS Tutorial, October 2010, ANU Canberra, Marcus Frean http://videolectures.net/mlss2010au_frean_deepbeliefnets/
– Deep Learning Tutorials http://deeplearning.net/tutorial/ – Hinton’s Tutorial, http://videolectures.net/mlss09uk_hinton_dbn/ – Fergus’s Tutorial,
http://cs.nyu.edu/~fergus/presentations/nips2013_final.pdf– CUHK MMlab project :
http://mmlab.ie.cuhk.edu.hk/project_deep_learning.html • People:
– Geoffrey E. Hinton’s http://www.cs.toronto.edu/~hinton– Andrew Ng http://www.cs.stanford.edu/people/ang/index.html – Ruslan Salakhutdinov http://www.utstat.toronto.edu/~rsalakhu/ – Yee-Whye Teh http://www.gatsby.ucl.ac.uk/~ywteh/ – Yoshua Bengio www.iro.umontreal.ca/~bengioy – Yann LeCun http://yann.lecun.com/ – Marcus Frean http://ecs.victoria.ac.nz/Main/MarcusFrean – Rob Fergus http://cs.nyu.edu/~fergus/pmwiki/pmwiki.php
• Acknowledgement– Many materials in this ppt are from these papers, tutorials,
etc (especially Hinton and Frean’s). Sorry for not listing them in full detail.
Dumitru Erhan, Aaron Courville, Yoshua Bengio. Understanding Representations Learned in Deep Architectures. Technical Report.
Neural networkBack propagation
1986
• Solve general learning problems• Tied with biological system
But it is given up…
• Hard to train• Insufficient computational
resources• Small training sets• Does not work well
2006
• SVM• Boosting• Decision tree• KNN• …
• Loose tie with biological systems• Shallow model• Specific methods for specific tasks
– Hand crafted features (GMM-HMM, SIFT, LBP, HOG)
Kruger et al. TPAMI’13
Deep belief netScience
… …
… …
… …
… …• Unsupervised & Layer-wised pre-
training• Better designs for modeling and
training (normalization, nonlinearity, dropout)
• Feature learning• New development of computer
architectures– GPU– Multi-core computer systems
• Large scale databases
deep learning results
Speech
20112012
How Many Computers to Identify a Cat? 16000 CPU cores
Object recognition over 1,000,000 images and 1,000 categories(2 GPU)
5
Outline•Short introduction on deep learning•Short introduction on statistical
models and Graphical model•Restricted Boltzmann Machine
(RBM) and Contrastive divergence•Deep belief net (DBN)
6
Graphical model for Statistics• Conditional
independence between random variables
• Given C, A and B are independent:– P(A, B|C) = P(A|C)P(B|C)
• P(A,B,C) =P(A, B|C) P(C) – =P(A|C)P(B|C)P(C)
• Any two nodes are conditionally independent given the values of their parents.http://www.eecs.qmul.ac.uk/~norman/BBNs/Independence_and_conditional_independence.htm
Smoker?
Has Lung cancer Has bronchitis支气管炎肺癌
B
C
A
7
Directed and undirected graphical model• Directed graphical model
– P(A,B,C) = P(A|C)P(B|C)P(C)– Any two nodes
are conditionally independent given the values of
their parents.• Undirected graphical model
– P(A,B,C) = P(B,C)P(A,C)– Also called Marcov Random
Field (MRF)
B A
C
P(A,B,C,D) = P(D|A,B)P(B|C)P(A|C)P(C)
B A
C
D
B
C
A
B
C
A
8
Modeling undirected model
• Probability:
)(
);(
);(
);()(
Z
f
f
fP
x
x
xx;
x
B A
C
),(
)exp()exp(
)exp(
)exp(),,(
21
21
,,21
21
wwZ
ACwBCw
ACwBCw
ACwBCwCBAP
CBA
;w1 w2
Example: P(A,B,C) = P(B,C)P(A,C) Is smoker?
Is healthy Has Lung cancer
partition function
1)( x
x;P
9
More directed and undirected models
D E
A B
G H
h1 h2
y1
h3
y2 y3
Hidden Marcov model MRF in 2D
F
C
I
10
More directed and undirected models
A B
C
D
P(A,B,C,D)=P(A)P(B)P(C|B)P(D|A,B,C)
h1 h2
y1
h3
y2 y3
P(y1, y2, y3, h1, h2, h3)=P(h1)P(h2| h1)P(h3| h2) P(y1| h1)P(y2| h2)P(y3| h3)
11
More directed and undirected models
v
h ...
...x
h1
...
...h2
h3
...
...W W 0
W 1
W 2
(a) (b)
HMM
RBM DBN (c)
...
...
...
...
h1
h2
h3
W 1
W 2
x
Our deep model
12
Extended reading on graphical model
• Zoubin Ghahramani ‘s video lecture on graphical models:
• http://videolectures.net/mlss07_ghahramani_grafm/
13
Outline
• Short introduction on deep learning• Short introduction on statistical models and
Graphical model
• Restricted Boltzmann machine and Contrastive divergence– Product of experts– Contrastive divergence– Restricted Boltzmann Machine
• Deep belief net
A training algorithm for
14
Outline
• Short introduction on deep learning• Short introduction on statistical models and
Graphical model
• Restricted Boltzmann machine and Contrastive divergence– Product of experts– Contrastive divergence– Restricted Boltzmann Machine
• Deep belief net
A specific, useful case of
15
Outline
• Short introduction on deep learning• Short introduction on statistical models and
Graphical model
• Restricted Boltzmann machine and Contrastive divergence– Product of experts– Contrastive divergence– Restricted Boltzmann Machine
• Deep belief net
Product of Experts
16
,)(
);(
);(
);();(
);(
);(
Z
f
e
e
f
fP
E
E
mm
mm
mm
mm x
x
xx
x
x
x
x
m
xx );(log);( mmmfE Energy function
Partition function
MRF in 2D
...);( 34321 CFwBEwADwBCwABwE wx
D E
A B
G H
F
C
I
Product of Experts
15
1
)()( )1(i
ii ce ii uxux T
18
Products of experts versus Mixture model
• Products of experts :– "and" operation– Sharper than mixture– Each expert can constrain a different subset of
dimensions.• Mixture model, e.g. Gaussian Mixture model
– “or” operation– a weighted sum of many density functions
x
x
xx
);(
);();(
mm
mm
mm
mm
f
fP
19
Outline• Basic background on statistical learning
and Graphical model•Contrastive divergence and
Restricted Boltzmann machine– Product of experts– Contrastive divergence– Restricted Boltzmann Machine
• Deep belief net
20
Contrastive Divergence (CD)• Probability:• Maximum Likelihood and gradient descent
)(/);()x;( ZxfP
Xx
xx
xx
xx
xX
);(log);(log
);(log1);(log),(
);(log1
)(log);(1
),(
1
1
ff
f
Kd
fp
fK
ZL
K
p
K
k
(k)
K
k
(k)
data dist.model dist. expectation
0);();(
1
XX Lor
Ltt
K
k
kK
k
k PLP11
)logmax);(max)max
;(xX;(x )()(
x
)x;()( mfZ
21
Contrastive Divergence (CD)• Gradient of
Likelihood:
T
K
k
(k)f
Kd
fp
L
1
);x(log1x
);x(log),x(
);X(
Intractable
Tractable Gibbs Sampling Fast contrastive divergence T=1
Easy to compute
Sample p(z1,z2,…,zM)
);X(
1
Ltt
CD
Minimum
P(A,B,C) = P(A|C)P(B|C)P(C)
B A
C
Accurate but slow gradient
Approximate but fast gradient
22
Gibbs Sampling for graphical model
x1 x2
h2 h3 h4
x3
h5h1
More information on Gibbs sampling: Pattern recognition and machine learning(PRML)
23
Convergence of Contrastive divergence (CD)
• The fixed points of ML are not fixed points of CD and vice versa. – CD is a biased learning algorithm.– But the bias is typically very small.– CD can be used for getting close to ML solution and
then ML learning can be used for fine-tuning.• It is not clear if CD learning converges (to a
stable fixed point). At 2005, proof is not available.
• Further theoretical results? Please inform us
M. A. Carreira-Perpignan and G. E. Hinton. On Contrastive Divergence Learning. Artificial Intelligence and Statistics, 2005
24
Outline• Basic background on statistical learning
and Graphical model•Contrastive divergence and
Restricted Boltzmann machine– Product of experts– Contrastive divergence– Restricted Boltzmann Machine
• Deep belief net
25
Boltzmann Machine
• Undirected graphical model, with hidden nodes.
i
iiji
jiij xxxwE )(x;
,)(
);(
);(
);();(
);(
);(
Z
f
e
e
f
fP
E
E
mm
mm
mm
mm x
x
xx
x
x
x
x
Boltzmann machine: E(x,h)=b' x+c' h+h' Wx+x’Ux+h’Vh
},{: iijw
Restricted Boltzmann Machine (RBM)
• Undirected, loopy, layer
• E(x,h)=b' x+c' h+h' Wx
jj
ii
xPP
hPP
)|()|(
)|()|(
hhx
xxh
hx
hx
hx
hx
,
),(
),(
),(E
E
e
eP
x1 x2
h2 h3 h4
x3
h5h1
h
x
hx
hxh
hx
x
,
),(
),(
)(E
E
e
eP
P(xj = 1|h) = σ(bj +W’• j · h)P(hi = 1|x) = σ(ci +Wi · · x)
Boltzmann machine: E(x,h)=b' x+c' h+h' Wx+x’Ux+h’Vh
partition function
W
Read the manuscript for details
27
Restricted Boltzmann Machine (RBM)
• E(x,h)=b' x+c' h+h' Wx• x = [x1 x2 …]T, h = [h1 h2 …]T
• Parameter learning– Maximum Log-Likelihood
K
k
kK
k
k PLP11
)logmin);(min)max
;(xX;(x )()(
)(
);()(
,
(
(
Z
xf
e
eP
hx
Wx)h'h c'x b'h
Wx)h'h c'x b'
x;
Geoffrey E. Hinton, “Training Products of Experts by Minimizing Contrastive Divergence.” Neural Computation 14, 1771–1800 (2002)
28
CD for RBM
• CD for RBM, very fast!
);X(
1
Ltt
)(
);()x;(
h,x
Wx)h'h c' xb'(h
Wx)h'h c' xb'(
Z
xf
e
eP
0
0Xx
xx
xx
X
jiji
jijijipji
K
k
(k)
ij
hxhx
hxhxhxhx
f
Kd
fp
w
L
1
),(
1
);(log1);(log),(
);(
P(xj = 1|h) = σ(bj +W’• j · h) P(hi = 1|x) = σ(ci +Wi · x)
CD
29
CD for RBM
h1
x1 x2
h2
0
Xjiji
ij
hxhxw
L
1
);(
P(xj = 1|h) = σ(bj +W’• j · h) P(hi = 1|x) = σ(ci +Wi · x)
P(xj = 1|h) = σ(bj +W’• j · h)
P(hi = 1|x) = σ(ci +Wi · x)
P(xj = 1|h) = σ(bj +W’• j · h)
30
RBM for classification
• y: classification label
Hugo Larochelle and Yoshua Bengio, Classification using Discriminative Restricted Boltzmann Machines, ICML 2008.
31
RBM itself has many applications
• Multiclass classification• Collaborative filtering• Motion capture modeling• Information retrieval• Modeling natural images• Segmentation
Y Li, D Tarlow, R Zemel, Exploring compositional high order pattern potentials for structured output learning, CVPR 2013V. Mnih, H Larochelle, GE Hinton , Conditional Restricted Boltzmann Machines for Structured Output Prediction, Uncertainty in Artificial Intelligence, 2011.Larochelle, H., & Bengio, Y. (2008). Classification using discriminative restricted boltzmann machines. ICML, 2008.Salakhutdinov, R., Mnih, A., & Hinton, G. E. (2007). Restricted Boltzmann machines for collaborative filtering. ICML 2007.Salakhutdinov, R., & Hinton, G. E. (2009). Replicated softmax: an undirected topic model., NIPS 2009.Osindero, S., & Hinton, G. E. (2008). Modeling image patches with a directed hierarchy of markov random field., NIPS 2008
32
Outline• Basic background on statistical learning
and Graphical model• Contrastive divergence and Restricted
Boltzmann machine•Deep belief net (DBN)
– Why deep leaning?– Learning and inference– Applications
33
Belief Nets
• A belief net is a directed acyclic graph composed of random variables.
randomhidden cause
visible effect
34
Deep Belief Net
• Belief net that is deep• A generative model
– P(x,h1,…,hl) = p(x|h1) p(h1|h2)… p(hl-2|hl-1) p(hl-
1,hl)• Used for unsupervised training of multi-layer
deep model.
h1
x
h2
h3
… …
… …
… …
… …
P(x,h1,h2,h3) = p(x|h1) p(h1|h2) p(h2,h3)
Pixels=>edges=> local shapes=> object parts
35
Why Deep learning?
• The mammal brain is organized in a deep architecture with a given input percept represented at multiple levels of abstraction, each level corresponding to a different area of cortex(脑或其他器官的皮层 ).
• An architecture with insufficient depth can require many more computational elements, potentially exponentially more (with respect to input size), than architectures whose depth is matched to the task.
• Since the number of computational elements one can afford depends on the number of training examples available to tune or select them, the consequences are not just computational but also statistical: poor generalization may be expected when using an insufficiently deep architecture for representing some functions.T. Serre, etc., “A quantitative theory of immediate visual recognition,” Progress in Brain Research, Computational
Neuroscience: Theoretical Insights into Brain Function, vol. 165, pp. 33–56, 2007. Yoshua Bengio, “Learning Deep Architectures for AI,” Foundations and Trends in Machine Learning, 2009.
Pixels=>edges=> local shapes=> object parts
36
• Linear regression, logistic regression: depth 1• Kernel SVM: depth 2• Decision tree: depth 2• Boosting: depth 2• The basic conclusion that these results suggest is
that when a function can be compactly represented by a deep architecture, it might need a very large architecture to be represented by an insufficiently deep one. (Example: logic gates, multi-layer NN with linear threshold units and positive weight).
Yoshua Bengio, “Learning Deep Architectures for AI,” Foundations and Trends in Machine Learning, 2009.
Why Deep learning?
37
Example: sum product network (SPN)
X2 X2
X3 X3
X1 X1 X4
X4
X5 X5
2N-1
N2N-1 parameters
O(N) parameters
38
Depth of existing approaches
• Boosting (2 layers)– L 1: base learner– L 2: vote or linear combination of layer 1
• Decision tree, LLE, KNN, Kernel SVM (2 layers)– L 1: matching degree to a set of local templates.– L 2: Combine these degrees
• Brain: 5-10 layers
i ii Kb ),( xx
39
Why decision tree has depth 2?
• Rely on partition of input space.• Local estimator. Rely on partition of input
space and use separate params for each region. Each region is associated with a leaf.
• Need as many as training samples as there are variations of interest in the target function. Not good for highly varying functions.
• Num. training sample is exponential to Num. dim in order to achieve a fixed error rate.
40
Outline• Basic background on statistical learning
and Graphical model• Contrastive divergence and Restricted
Boltzmann machine•Deep belief net (DBN)
– Why DBN?–Learning and inference– Applications
41
Deep Belief Net
• Inference problem: Infer the states of the unobserved variables.
• Learning problem: Adjust the interactions between variables to make the network more likely to generate the observed data
h1
x
h2
h3
… …
… …
… …
… …
P(x,h1,h2,h3) = p(x|h1) p(h1|h2) p(h2,h3)
42
Deep Belief Net– Inference problem (the problem of explaining
away):
B A
C
h11 h12
x1
=
h1
x
… …
… …
P(A,B|C) = P(A|C)P(B|C)
P(h11, h12 | x1) ≠ P(h11| x1) P(h12 | x1)
An example from manuscript Sol: Complementary prior
43
Deep Belief Net
h1
x
h2
h4
… …
… …
… …
… …
h3 … …
20001000500
30
Sol: Complementary prior
Inference problem (the problem of explaining away) Sol: Complementary prior
44
Deep Belief Net• Explaining away problem of Inference (see
the manuscript)– Sol: Complementary prior, see the manuscript
• Learning problem– Greedy layer by layer RBM training (optimize
lower bound) and fine tuning– Contrastive divergence for RBM training
h1
x
h2
h3
… …
… …
… …
… …
P(hi = 1|x) = σ(ci +Wi · x)
… …
… …
… …
… …
… …
… …
h1
x
h2
h1
h3
h2
45
Code reading• It is much easier to read the
DeepLearningToolbox for understanding DBN.
46
47
Deep Belief Net
• Why greedy layerwise learning work?
• Optimizing a lower bound:
• When we fix parameters for layer 1 and optimize the parameters for layer 2, we are optimizing the P(h1) in (1)
1h11111
1
x|hx|hx|hhx|h
hx,x
)]}(log)()](log)()[log({
)(log)(log
QQPPQ
PPh
… …
… …
… …
… …
… …
… …
h1
x
h2
h1
h3
h2
(1)
48
Deep Belief Net and RBM
• RBM can be considered as DBN that has infinitive layers
TW… …
… …
… …
… …
… …
… …
… …
h0
x0
h0
x0
h1
x1
x2
…
W
W
TW
W
49
Pretrain, fine-tune and inference – (autoencoder)
(BP)
50
Pretrain, fine-tune and inference - 2
y: identity or rotation degree
Pretraining Fine-tuning
51
How many layers should we use?
• There might be no universally right depth– Bengio suggests that several layers is
better than one– Results are robust against changes in the
size of a layer, but top layer should be big– A parameter. Depends on your task.– With enough narrow layers, we can model
any distribution over binary vectors [1]
Copied from http://videolectures.net/mlss09uk_hinton_dbn/
[1] Sutskever, I. and Hinton, G. E., Deep Narrow Sigmoid Belief Networks are Universal Approximators. Neural Computation, 2007
52
Effect of Unsupervised Pre-training
Erhan et. al. AISTATS’2009
53
Effect of Depth
w/o pre-trainingwith pre-trainingwithout pre-training
54
Why unsupervised pre-training makes sense
stuff
image label
stuff
image label
If image-label pairs were generated this way, it would make sense to try to go straight from images to labels. For example, do the pixels have even parity?
If image-label pairs are generated this way, it makes sense to first learn to recover the stuff that caused the image by inverting the high bandwidth pathway.
high bandwidth
low bandwidth
55
Beyond layer-wise pretraining
• Layer-wise pretraining is efficient but not optimal.
• It is possible to train parameters for all layers using a wake-sleep algorithm.– Bottom-up in a layer-wise manner– Top-down and reffiting the earlier models
56
Fine-tuning with a contrastive version of the “wake-sleep” algorithm
After learning many layers of features, we can fine-tune the features to improve generation.
1. Do a stochastic bottom-up pass– Adjust the top-down weights to be good at
reconstructing the feature activities in the layer below.
2. Do a few iterations of sampling in the top level RBM-- Adjust the weights in the top-level RBM.
3. Do a stochastic top-down pass– Adjust the bottom-up weights to be good at
reconstructing the feature activities in the layer above.
57
Include lateral connections
• RBM has no connection among layers• This can be generalized.• Lateral connections for the first layer [1].
– Sampling from P(h|x) is still easy. But sampling from p(x|h) is more difficult.
• Lateral connections at multiple layers [2].– Generate more realistic images.– CD is still applicable, with small
modification.
[1]B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: a strategy employed by V1?,” Vision Research, vol. 37, pp. 3311–3325, December 1997.[2]S. Osindero and G. E. Hinton, “Modeling image patches with a directed hierarchy of Markov random field,” in NIPS, 2007.
58
Without lateral connection
59
With lateral connection
60
My data is real valued …
• Make it [0 1] linearly: x = ax + b• Use another distribution
61
My data has temporal dependency …
• Static:• Temporal
62
My data has temporal dependency …
• Static:• Temporal
63
I consider DBN as…
• A statistical model that is used for unsupervised training of fully connected deep model
• A directed graphical model that is approximated by fast learning and inference algorithms
• A directed graphical model that is fine tuned using mature neural network learning approach -- BP.
64
Outline• Basic background on statistical learning
and Graphical model• Contrastive divergence and Restricted
Boltzmann machine•Deep belief net (DBN)
– Why DBN?– Learning and inference–Applications
65
Applications of deep learning
• Hand written digits recognition• Dimensionality reduction• Information retrieval • Segmentation• Denoising• Phone recognition• Object recognition• Object detection• …
Hinton, G. E, Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation Hinton, G. E. and Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks, Science 2006.Welling, M. etc., Exponential Family Harmoniums with an Application to Information Retrieval, NIPS 2004A. R. Mohamed, etc., Deep Belief Networks for phone recognition, NIPS 09 workshop on deep learning for speech recognition.Nair, V. and Hinton, G. E. 3-D Object recognition with deep belief nets. NIPS09
………………………….
66
Object recognition• NORB
– logistic regression 19.6%, kNN (k=1) 18.4%, Gaussian kernel SVM 11.6%, convolutional neural net 6.0%, convolutional net + SVM hybrid 5.9%. DBN 6.5%.
– With the extra unlabeled data (and the same amount of labeled data as before), DBN achieves 5.2%.
67
Rank Name Error rate
Description
1 U. Toronto 0.15315 Deep learning
2 U. Tokyo 0.26172 Hand-crafted features and learning models.Bottleneck.
3 U. Oxford 0.26979
4 Xerox/INRIA 0.27058
Object recognitionImageNet
68
Learning to extract the orientation of a face patch (Salakhutdinov & Hinton, NIPS 2007)
69
The training and test sets
11,000 unlabeled cases100, 500, or 1000 labeled cases
face patches from new people
70
The root mean squared error in the orientation when combining GP’s with deep belief nets
22.2 17.9 15.2
17.2 12.7 7.2
16.3 11.2 6.4
GP on the pixels
GP on top-level features
GP on top-level features with fine-tuning
100 labels
500 labels
1000 labels
Conclusion: The deep features are much better than the pixels. Fine-tuning helps a lot.
71
Deep Autoencoders(Hinton & Salakhutdinov, 2006)• They always looked like a
really nice way to do non-linear dimensionality reduction:– But it is very difficult to
optimize deep autoencoders using backpropagation.
• We now have a much better way to optimize them:– First train a stack of 4
RBM’s– Then “unroll” them. – Then fine-tune with
backprop.
1000 neurons
500 neurons
500 neurons
250 neurons
250 neurons
30
1000 neurons
28x28
28x28
1
2
3
4
4
3
2
1
W
W
W
W
W
W
W
W
T
T
T
T
linear units
72
Deep Autoencoders (Hinton & Salakhutdinov, 2006)
real data
30-D deep auto
30-D PCA
73
A comparison of methods for compressing digit images to 30 real numbers.
real data
30-D deep auto
30-D logistic PCA
30-D PCA
74
Representation of DBN
Our works
75
http://mmlab.ie.cuhk.edu.hk/project_deep_learning.html
Pedestrian Detection
CVPR’12 CVPR’13
ICCV’13
ICCV’13
Facial keypoint detection, CVPR’13(2% average error on LFPW)
Face parsing, CVPR’12
Pedestrian parsing, ICCV’13
Face Recognition and Face Attribute Recognition(LFW: 96.45%)
Face verification, ICCV’13 Recovering Canonical-View Face Images, ICCV’13
Face attribute recognition, ICCV’13
79
Summary• Deep belief net (DBN)
– is a network with deep layers, which provides strong representation power;
– is a generative model;– can be learned by layerwise RBM using
Contrastive Divergence;– has many applications and more applications is
yet to be found.
Generative models explicitly or implicitly model the distribution of inputs and outputs. Discriminative models model the posterior probabilities directly.
80
DBN VS SVM• A very controversial topic• Model
– DBN is generative, SVM is discriminative. But fine-tuning of DBN is discriminative
• Application– SVM is widely applied. – Researchers are expanding the application area
of DBN.• Learning
– DBN is non-convex and slow– SVM is convex and fast (in linear case).
• Which one is better? – Time will say.– You can contribute
Hinton: The superior classification performance of discriminative learning methods holds only for domains in which it is not possible to learn a good generative model. This set of domains is being eroded by Moore’s law.
81