Post on 04-Aug-2020
Bayesian Nonparametrics
Yee Whye Teh
Gatsby Computational Neuroscience UnitUniversity College London
Acknowledgements:Charles Blundell, Lloyd Elliott, Jan Gasthaus, Vinayak Rao,
Dilan Görür, Andriy Mnih, Frank WoodCedric Archambeau, Hal Daume III, Zoubin Ghahramani, Katherine Heller, Lancelot James,
Michael I. Jordan, Daniel Roy, Jurgen Van Gael, Max Welling,
December, 2010 / KAIST, South Korea
1 / 127
Outline
Introduction
Regression and Gaussian Processes
Density Estimation, Clustering and Dirichlet Processes
Latent Variable Models and Indian Buffet and Beta Processes
Topic Modelling and Hierarchical Processes
Hierarchical Structure Discovery and Nested Processes
Time Series Models
Modelling Power-laws with Pitman-Yor Processes
Summary2 / 127
Outline
Introduction
Regression and Gaussian Processes
Density Estimation, Clustering and Dirichlet Processes
Latent Variable Models and Indian Buffet and Beta Processes
Topic Modelling and Hierarchical Processes
Hierarchical Structure Discovery and Nested Processes
Time Series Models
Modelling Power-laws with Pitman-Yor Processes
Summary3 / 127
Probabilistic Machine Learning
I Probabilistic model of data {xi}ni=1 given parameters θ:
P(x1, x2, . . . , xn, y1, y2, . . . , yn|θ)
where yi is a latent variable associated with xi .
I Often thought of as generative models of data.
I Inference, of latent variables given observations:
P(y1, y2, . . . , yn|θ, x1, x2, . . . , xn) =P(x1, x2, . . . , xn, y1, y2, . . . , yn|θ)
P(x1, x2, . . . , xn|θ)
I Learning, typically by maximum likelihood :
θML = argmaxθ
P(x1, x2, . . . , xn|θ)
4 / 127
Probabilistic Machine Learning
I Prediction:
P(xn+1|θML)
P(yn+1|xn+1, θML)
P(yn+1|x1, . . . , xn+1, θML)
I Classification:
argmaxC
P(xn+1|θMLC )
I Visualization, Interpretation, summarization
5 / 127
Bayesian Machine LearningI Probabilistic model of data {xi}n
i=1 given parameters θ:
P(x1, x2, . . . , xn, y1, y2, . . . , yn|θ)
I Prior distribution:
P(θ)
I Posterior distribution:
P(θ, y1, y2, . . . , yn|x1, x2, . . . , xn) =P(θ)P(x1, x2, . . . , xn, y1, y2, . . . , yn|θ)
P(x1, x2, . . . , xn)
I Prediction:
P(xn+1|x1, . . . , xn) =
∫P(xn+1|θ)P(θ|x1, . . . , xn)dθ
I (Easier said than done...)
6 / 127
Computing Posterior Distributions
I Posterior distribution:
P(θ, y1, y2, . . . , yn|x1, x2, . . . , xn) =P(θ)P(x1, x2, . . . , xn, y1, y2, . . . , yn|θ)
P(x1, x2, . . . , xn)
I High-dimensional, no closed-form, multi-modal...
I Variational approximations [Wainwright and Jordan 2008]: simpleparametrized form, “fit” to true posterior. Includes mean fieldapproximation, variational Bayes, belief propagation, expectationpropagation.
I Monte Carlo methods, including Markov chain Monte Carlo[Neal 1993, Robert and Casella 2004] and sequential Monte Carlo[Doucet et al. 2001]: construct generators for random samples from theposterior.
7 / 127
Bayesian Model Selection
I Model selection is often necessary to prevent overfitting and underfitting.
I Bayesian approach to model selection uses the marginal likelihood :
p(x|Mk ) =
∫p(x|θk ,Mk )p(θk |Mk )dθk
Model selection: M∗ = argmaxMk
p(x|Mk )
Model averaging: p(Mk , θk |x) =p(Mk )p(θk |Mk )p(x|θk ,Mk )∑
k ′ p(Mk ′)p(θk ′ |Mk ′)p(x|θk ′ ,Mk ′)
I Other approaches to model selection: cross validation, regularization,sparse models...
8 / 127
Side-Stepping Model Selection
I Strategies for model selection often entail significant complexities.
I But reasonable and proper Bayesian methods should not overfit anyway[Rasmussen and Ghahramani 2001].
I Idea: use a large model, and be Bayesian so will not overfit.
I Bayesian nonparametric idea: use a very large Bayesian model avoidsboth overfitting and underfitting.
9 / 127
Direct Modelling of Very Large Spaces
I Regression: learn about functions from an input to an output space.
I Density estimation: learn about densities over Rd .
I Clustering: learn about partitions of a large space.
I Objects of interest are often infinite dimensional. Model these directly:
I Using models that can learn any such object;I Using models that can approximate any such object to arbitrary
accuracy.
I Many theoretical and practical issues to resolve:
I Convergence and consistency.I Practical inference algorithms.
10 / 127
Outline
Introduction
Regression and Gaussian Processes
Density Estimation, Clustering and Dirichlet Processes
Latent Variable Models and Indian Buffet and Beta Processes
Topic Modelling and Hierarchical Processes
Hierarchical Structure Discovery and Nested Processes
Time Series Models
Modelling Power-laws with Pitman-Yor Processes
Summary11 / 127
Regression and ClassificationI Learn a function f ∗ : X→ Y from training data {xi , yi}n
i=1.
**
*
*
*
*
*
*
I Regression: if yi = f ∗(xi ) + εi .I Classification: e.g. P(yi = 1|f ∗(xi )) = Φ(f ∗(xi )).
12 / 127
Parametric Regression with Basis Functions
I Assume a set of basis functions φ1, . . . , φK and parametrize a function:
f (x ; w) =K∑
k=1
wkφk (x)
Parameters w = {w1, . . . ,wK}.I Find optimal parameters
argminw
n∑
i=1
∣∣∣∣yi − f (xi ; w)
∣∣∣∣2
= argminw
n∑
i=1
∣∣∣∣yi −∑K
k=1 wkφk (xi )
∣∣∣∣2
I What family of basis function to use?
I How many?
I What if true function cannot be parametrized as such?
13 / 127
Towards Nonparametric RegressionI What we are interested in is the output values of the function,
f (x1), f (x2), . . . , f (xn), f (xn+1)
Why not model these directly?
**
*
*
*
*
*
*
I In regression, each f (xi ) is continuous and real-valued, so a naturalchoice is to model f (xi ) using a Gaussian.
I Assume that function f is smooth. If two inputs xi and xj are close-by,then f (xi ) and f (xj ) should be close by as well. This translates intocorrelations among the outputs f (xi ).
14 / 127
Towards Nonparametric RegressionI We can use a multi-dimensional Gaussian to model correlated function
outputs:
f (x1)...
f (xn+1)
∼ N
0...0
,
C1,1 . . . C1,n+1...
. . ....
Cn+1,1 . . . Cn+1,n+1
where the mean is zero, and C = [Cij ] is the covariance matrix.I Each observed output yi can be modelled as,
yi |f (xi ) ∼ N (f (xi ), σ2)
I Learning: compute posterior distribution
p(f (x1), . . . , f (xn)|y1, . . . , yn)
Straightforward since whole model is Gaussian.I Prediction: compute
p(f (xn+1)|y1, . . . , yn)
15 / 127
Gaussian Processes
I A Gaussian process (GP) is a random function f : X→ R such that forany finite set of input points x1, . . . , xn,
f (x1)...
f (xn)
∼ N
m(x1)...
m(xn)
,
c(x1, x1) . . . c(x1, xn)...
. . ....
c(xn, x1) . . . c(xn, xn)
where the parameters are the mean function m(x) and covariancekernel c(x , y).
I Difference from before: the GP defines a distribution over f (x), for everyinput value x simultaneously. Prior is defined even before observinginputs x1, . . . , xn.
I Such a random function f is known as a stochastic process. It is acollection of random variables {f (x)}x∈X.
I Demo: GPgenerate.
[Rasmussen and Williams 2006]
16 / 127
Posterior and Predictive Distributions
I How do we compute the posterior and predictive distributions?
I Training set (x1, y1), (x2, y2), . . . , (xn, yn) and test input xn+1.
I Out of the (uncountably infinitely) many random variables {f (x)}x∈Xmaking up the GP only n + 1 has to do with the data:
f (x1), f (x2), . . . , f (xn+1)
I Training data gives observations f (x1) = y1, . . . , f (xn) = yn. Thepredictive distribution of f (xn+1) is simply
p(f (xn+1)|f (x1) = y1, . . . , f (xn) = yn)
which is easy to compute since f (x1), . . . , f (xn+1) is Gaussian.
17 / 127
Consistency and Existence
I The definition of Gaussian processes only give finite dimensionalmarginal distributions of the stochastic process.
I Fortunately these marginal distributions are consistent .
I For every finite set x ⊂ X we have a distinct distributionpx([f (x)]x∈x). These distributions are said to be consistent if
px([f (x)]x∈x) =
∫px∪y([f (x)]x∈x∪y)d [f (x)]x∈y
for disjoint and finite x,y ⊂ X.I The marginal distributions for the GP are consistent because
Gaussians are closed under marginalization.
I The Kolmogorov Consistency Theorem guarantees existence of GPs,i.e. the whole stochastic process {f (x)}x∈X.
18 / 127
Outline
Introduction
Regression and Gaussian Processes
Density Estimation, Clustering and Dirichlet Processes
Latent Variable Models and Indian Buffet and Beta Processes
Topic Modelling and Hierarchical Processes
Hierarchical Structure Discovery and Nested Processes
Time Series Models
Modelling Power-laws with Pitman-Yor Processes
Summary19 / 127
Density Estimation with Mixture Models
I Unsupervised learning of a density f ∗(x) from training samples {xi}.
* ** * ** * ** *** ****
I Can use a mixture model for flexible family of densities, e.g.
f (x) =K∑
k=1
πkN (x ;µk ,Σk )
I How many mixture components to use?
I What family of mixture components?
I Do we believe that the true density is a mixture of K components?
20 / 127
Bayesian Mixture Models
I Let’s be Bayesian about mixture models, and placepriors over our parameters (and to computeposteriors).
I First, introduce conjugate priors for parameters:
π ∼ Dirichlet(αK , . . . ,αK )
µk ,Σk = θ∗k ∼ H = N -IW(0, s,d ,Φ)
I Second, introduce variable zi indicator whichcomponent xi belongs to.
zi |π ∼ Multinomial(π)
xi |zi = k ,µ,Σ ∼ N (µk ,Σk )
zi
!
!
H
i = 1, . . . , n
xi
!!kk = 1, . . . ,K
21 / 127
Gibbs Sampling for Bayesian Mixture Models
I All conditional distributions are simple to compute:
p(zi = k |others) ∝ πkN (xi ;µk ,Σk )
π|z ∼ Dirichlet(αK +n1(z), . . . , αK +nK (z))
µk ,Σk |others ∼ N -IW(ν′, s′,d ′,Φ′)
I Not as efficient as collapsed Gibbs sampling whichintegrates out π,µ,Σ:
p(zi = k |others) ∝αK + nk (z−i )
α + n − 1×
p(xi |{xi′ : i ′ 6= i , zi′ = k})
I Demo: fm_demointeractive.
zi
!
!
H
i = 1, . . . , n
xi
!!kk = 1, . . . ,K
22 / 127
Infinite Bayesian Mixture Models
I We will take K →∞.
I Imagine a very large value of K .
I There are at most n < K occupied components, somost components are empty. We can lump theseempty components together:
Occupied clusters:
p(zi = k |others) ∝αK +nk (z−i )
n − 1 + αp(xi |x−i
k )
Empty clusters:
p(zi = kempty|z−i ) ∝ αK−K∗K
n − 1 + αp(xi |{})
I Demo: dpm_demointeractive.
[Rasmussen 2000]
zi
!
!
H
i = 1, . . . , n
xi
!!kk = 1, . . . ,K
23 / 127
Infinite Bayesian Mixture Models
I We will take K →∞.
I Imagine a very large value of K .
I There are at most n < K occupied components, somost components are empty. We can lump theseempty components together:
Occupied clusters:
p(zi = k |others) ∝αK +nk (z−i )
n − 1 + αp(xi |x−i
k )
Empty clusters:
p(zi = kempty|z−i ) ∝ αK−K∗K
n − 1 + αp(xi |{})
I Demo: dpm_demointeractive.
[Rasmussen 2000]
zi
!
!
H
i = 1, . . . , n
xi
!!kk = 1, . . . ,!
23 / 127
Density Estimation
!15 !10 !5 0 5 10 150
0.1
0.2
0.3
0.4
0.5
0.6
0.7
F (·|µ,Σ) is Gaussian with mean µ, covariance Σ.H(µ,Σ) is Gaussian-inverse-Wishart conjugate prior.Red: mean density. Blue: median density. Grey: 5-95 quantile.Others: posterior samples. Black: data points.
24 / 127
Density Estimation
!15 !10 !5 0 5 10 150
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
F (·|µ,Σ) is Gaussian with mean µ, covariance Σ.H(µ,Σ) is Gaussian-inverse-Wishart conjugate prior.Red: mean density. Blue: median density. Grey: 5-95 quantile.Others: posterior samples. Black: data points.
24 / 127
Infinite Bayesian Mixture Models
I The actual infinite limit of finite mixture models does not actually makemathematical sense.
I Other better ways of making this infinite limit precise:
I Look at the prior clustering structure induced by the Dirichlet priorover mixing proportions—Chinese restaurant process.
I Re-order components so that those with larger mixing proportionstend to occur first, before taking the infinite limit—stick-breakingconstruction.
I Both are different views of the Dirichlet process (DP).
I The K →∞ Gibbs sampler is for DP mixture models.
25 / 127
A Tiny Bit of Measure Theoretic Probability Theory
I A σ-algebra Σ is a family of subsets of a set Θ such that
I Σ is not empty;I If A ∈ Σ then Θ\A ∈ Σ;I If A1,A2, . . . ∈ Σ then ∪∞i=1Ai ∈ Σ.
I (Θ,Σ) is a measure space and A ∈ Σ are the measurable sets.
I A measure µ over (Θ,Σ) is a function µ : Σ→ [0,∞] such that
I µ(∅) = 0;I If A1,A2, . . . ∈ Σ are disjoint then µ(∪∞i=1Ai ) =
∑∞i=1 µ(Ai ).
I Everything we consider here will be measurable.I A probability measure is one where µ(Θ) = 1.
I Given two measure spaces (Θ,Σ) and (∆,Φ), a function f : Θ→ ∆ ismeasurable if f−1(A) ∈ Σ for every A ∈ Φ.
26 / 127
A Tiny Bit of Measure Theoretic Probability Theory
I If p is a probability measure on (Θ,Σ), a random variable X takingvalues in ∆ is simply a measurable function X : Θ→ ∆.
I Think of the probability space (Θ,Σ,p) as a black-box randomnumber generator, and X as a function taking random samples in Θand producing random samples in ∆.
I The probability of an event A ∈ Φ is p(X ∈ A) = p(X−1(A)).
I A stochastic process is simply a collection of random variables {Xi}i∈Iover the same measure space (Θ,Σ), where I is an index set.
I Can think of a stochastic process as a random function X (i).
I Stochastic processes form the core of many Bayesian nonparametricmodels.
I Gaussian processes, Poisson processes, Dirichlet processes, betaprocesses, completely random measures...
27 / 127
Dirichlet DistributionsI A Dirichlet distribution is a distribution over the K -dimensional probability
simplex:∆K =
{(π1, . . . , πK ) : πk ≥ 0,
∑k πk = 1
}
I We say (π1, . . . , πK ) is Dirichlet distributed,
(π1, . . . , πK ) ∼ Dirichlet(λ1, . . . , λK )
with parameters (λ1, . . . , λK ), if
p(π1, . . . , πK ) =Γ(∑
k λk )∏k Γ(λk )
n∏
k=1
πλk−1k
I Equivalent to normalizing a set of independent gamma variables:
(π1, . . . , πK ) = 1∑k γk
(γ1, . . . , γK )
γk ∼ Gamma(λk ) for k = 1, . . . ,K
28 / 127
Dirichlet Distributions
29 / 127
Dirichlet Processes
I A Dirichlet Process (DP) is a random probability measure G over (Θ,Σ)such that for any finite set of measurable partitions A1∪ . . . ∪AK = Θ,
(G(A1), . . . ,G(AK )) ∼ Dirichlet(λ(A1), . . . , λ(AK ))
where λ is a base measure.
6
A
A1
A A
A
A
2
3
4
5
I The above family of distributions is consistent (next), and KolmogorovConsistency Theorem can be applied to show existence (but there aretechnical conditions restricting the generality of the definition).
[Ferguson 1973, Blackwell and MacQueen 1973]
30 / 127
Consistency of Dirichlet Marginals
I If we have two partitions (A1, . . . ,AK ) and (B1, . . . ,BJ) of Θ, how do wesee if the two Dirichlets are consistent?
I Because Dirichlet variables are normalized gamma variables and sumsof gammas are gammas, if (I1, . . . , Ij ) is a partition of (1, . . . ,K ),
(∑i∈I1 πi , . . . ,
∑i∈Ij πi
)∼ Dirichlet
(∑i∈I1 λi , . . . ,
∑i∈Ij λi
)
31 / 127
Consistency of Dirichlet Marginals
I Form the common refinement (C1, . . . ,CL) where each C` is theintersection of some Ak with some Bj . Then:
By definition, (G(C1), . . . ,G(CL)) ∼ Dirichlet(λ(C1), . . . , λ(CL))
(G(A1), . . . ,G(AK )) =(∑
C`⊂A1G(C`), . . . ,
∑C`⊂AK
G(C`))
∼ Dirichlet(λ(A1), . . . , λ(AK ))
Similarly, (G(B1), . . . ,G(BJ)) ∼ Dirichlet(λ(B1), . . . , λ(BJ))
so the distributions of (G(A1), . . . ,G(AK )) and (G(B1), . . . ,G(BJ)) areconsistent.
I Demonstration: DPgenerate.
32 / 127
Parameters of Dirichlet ProcessesI Usually we split the λ base measure into two parameters λ = αH:
I Base distribution H, which is like the mean of the DP.I Strength parameter α, which is like an inverse-variance of the DP.
I We write:
G ∼ DP(α,H)
if for any partition (A1, . . . ,AK ) of Θ:
(G(A1), . . . ,G(AK )) ∼ Dirichlet(αH(A1), . . . , αH(AK ))
I The first and second moments of the DP:
Expectation: E[G(A)] = H(A)
Variance: V[G(A)] =H(A)(1− H(A))
α + 1
where A is any measurable subset of Θ.
33 / 127
Representations of Dirichlet Processes
I Draws from Dirichlet processes will always place all their mass on acountable set of points:
G =∞∑
k=1
πkδθ∗k
where∑
k πk = 1 and θ∗k ∈ Θ.
I What is the joint distribution over π1, π2, . . . and θ∗1 , θ∗2 , . . .?
I Since G is a (random) probability measure over Θ, we can treat it as adistribution and draw samples from it. Let
θ1, θ2, . . . ∼ G
be random variables with distribution G.
I Can we describe G by describing its effect θ1, θ2, . . .?I What is the marginal distribution of θ1, θ2, . . . with G integrated out?
34 / 127
Stick-breaking Construction
G =∞∑
k=1
πkδθ∗k
I There is a simple construction giving the joint distribution of π1, π2, . . .and θ∗1 , θ
∗2 , . . . called the stick-breaking construction.
θ∗k ∼ H
πk = vk
k−1∏
i=1
(1− vi )
vk ∼ Beta(1, α)
π
(4)π(5)π
(2)π(3)π
(6)π
(1)
I Also known as the GEM distribution, write π ∼ GEM(α).
[Sethuraman 1994]
35 / 127
Posterior of Dirichlet Processes
I Since G is a probability measure, we can draw samples from it,
G ∼ DP(α,H)
θ1, . . . , θn|G ∼ G
What is the posterior of G given observations of θ1, . . . , θn?
I The usual Dirichlet-multinomial conjugacy carries over to thenonparametric DP as well:
G|θ1, . . . , θn ∼ DP(α + n, αH+∑n
i=1 δθiα+n )
36 / 127
Pólya Urn Scheme
G ∼ DP(α,H)
θ1, . . . , θn|G ∼ G
I The marginal distribution of θ1, θ2, . . . has a simple generative processcalled the Pólya urn scheme (aka Blackwell-MacQueen urn scheme).
θn|θ1:n−1 ∼αH +
∑n−1i=1 δθi
α + n − 1
I Picking balls of different colors from an urn:
I Start with no balls in the urn.I with probability ∝ α, draw θn ∼ H, and add a ball of color θn into urn.I With probability ∝ n − 1, pick a ball at random from the urn, recordθn to be its color and return two balls of color θn into urn.
[Blackwell and MacQueen 1973]
37 / 127
Chinese Restaurant Process
I θ1, . . . , θn take on K < n distinct values, say θ∗1 , . . . , θ∗K .
I This defines a partition of (1, . . . ,n) into K clusters, such that if i is incluster k , then θi = θ∗k .
I The distribution over partitions is a Chinese restaurant process (CRP).
I Generating from the CRP:
I First customer sits at the first table.I Customer n sits at:
I Table k with probability nkα+n−1 where nk is the number of customers
at table k .I A new table K + 1 with probability α
α+n−1 .I Customers⇔ integers, tables⇔ clusters.
91
23
4 56 7
8
38 / 127
Chinese Restaurant Process
0 2000 4000 6000 8000 100000
50
100
150
200
customer
tabl
e
!=30, d=0
I The CRP exhibits the clustering property of the DP.I Rich-gets-richer effect implies small number of large clusters.I Expected number of clusters is K = O(α log n).
39 / 127
Clustering
I To partition a heterogeneous data set into distinct, homogeneousclusters.
I The CRP is a canonical nonparametric prior over partitions that can beused as part of a Bayesian model for clustering.
I There are other priors over partitions ([Lijoi and Pruenster 2010]).
40 / 127
Inferring Discrete Latent Structures
I DPs have also found uses in applications where the aim is to discoverlatent objects, and where the number of objects is not known orunbounded.
I Nonparametric probabilistic context free grammars.I Visual scene analysis.I Infinite hidden Markov models/trees.I Genetic ancestry inference.I ...
I In many such applications it is important to be able to model the sameset of objects in different contexts.
I This can be tackled using hierarchical Dirichlet processes.
[Teh et al. 2006, Teh and Jordan 2010]
41 / 127
Exchangeability
I Instead of deriving the Pólya urn scheme by marginalizing out a DP,consider starting directly from the conditional distributions:
θn|θ1:n−1 ∼αH +
∑n−1i=1 δθi
α + n − 1
I For any n, the joint distribution of θ1, . . . , θn is:
p(θ1, . . . , θn) =αK ∏K
k=1 h(θ∗k )(mnk − 1)!∏ni=1 i − 1 + α
where h(θ) is density of θ under H, θ∗1 , . . . , θ∗K are the unique values, and
θ∗k occurred mnk times among θ1, . . . , θn.
I The joint distribution is exchangeable wrt permutations of θ1, . . . , θn.
I De Finetti’s Theorem says that there must be a random probabilitymeasure G making θ1, θ2, . . . iid. This is the DP.
42 / 127
De Finetti’s Theorem
Let θ1, θ2, . . . be an infinite sequence of random variables with jointdistribution p. If for all n ≥ 1, and all permutations σ ∈ Σn on n objects,
p(θ1, . . . , θn) = p(θσ(1), . . . , θσ(n))
That is, the sequence is infinitely exchangeable. Then there exists a (unique)latent random parameter G such that:
p(θ1, . . . , θn) =
∫p(G)
n∏
i=1
p(θi |G)dG
where ρ is a joint distribution over G and θi ’s.
I θi ’s are independent given G.
I Sufficient to define G through the conditionals p(θn|θ1, . . . , θn−1).
I G can be infinite dimensional (indeed it is often a random measure).
43 / 127
Outline
Introduction
Regression and Gaussian Processes
Density Estimation, Clustering and Dirichlet Processes
Latent Variable Models and Indian Buffet and Beta Processes
Topic Modelling and Hierarchical Processes
Hierarchical Structure Discovery and Nested Processes
Time Series Models
Modelling Power-laws with Pitman-Yor Processes
Summary44 / 127
Latent Variable Modelling
I Say we have n vector observations x1, . . . , xn.
I Model each observation as a linear combination of K latent sources:
xi =K∑
k=1
Λk yik + εi
yik : activity of source k in datum i .Λk : basis vector describing effect of source k .
I Examples include principle components analysis, factor analysis,independent components analysis.
I How many sources are there?
I Do we believe that K sources is sufficient to explain all our data?
I What prior distribution should we use for sources?
45 / 127
Binary Latent Variable Models
I Consider a latent variable model with binary sources/features,
zik =
{1 with probability µk ;0 with probability 1− µk .
I Example: Data items could be movies like “Terminator 2”, “Shrek” and“Lord of the Rings”, and features could be “science fiction”, “fantasy”,“action” and “Arnold Schwarzenegger”.
I Place beta prior over the probabilities of features:
µk ∼ Beta(αK ,1)
I We will again take K →∞.
46 / 127
Indian Buffet ProcessesI The Indian Buffet Process (IBP) describes each customer with a binary
vector instead of cluster.
I Generating from an IBP:I Parameter α.I First customer picks Poisson(α) dishes to eat.I Subsequent customer i picks dish k with probability mk
i ; and picksPoisson(αi ) new dishes.
Tables
Cu
sto
me
rs
Dishes
Cu
sto
me
rs
[Griffiths and Ghahramani 2006]47 / 127
Indian Buffet Processes and Exchangeability
I The IBP is infinitely exchangeable. For this to make sense, we need to“forget” the ordering of the dishes.
I “Name” each dish k with a Λ∗k drawn iid from H.I Each customer now eats a set of dishes: Ψi = {Λ∗k : zik = 1}.I The joint probability of Ψ1, . . . ,Ψn can be calculated:
p(Ψ1, . . . ,Ψn) = exp
(−α
n∑
i=1
1i
)αK
K∏
k=1
(mk − 1)!(n −mk )!
n!h(Λ∗k )
K : total number of dishes tried by n customers.Λ∗k : Name of k th dish tried.mk : number of customers who tried dish Λ∗k .
I De Finetti’s Theorem again states that there is some random measureunderlying the IBP.
I This random measure is the beta process.
[Griffiths and Ghahramani 2006, Thibaux and Jordan 2007]
48 / 127
Applications of Indian Buffet ProcessesI The IBP can be used in concert with different likelihood models in a
variety of applications.
Z ∼ IBP(α) X ∼ F (Z ,Y )
Y ∼ H p(Z ,Y |X ) =p(Z ,Y )p(X |Z ,Y )
p(X )
I Latent factor models for distributed representation [Griffiths andGhahramani 2005].
I Matrix factorization for collaborative filtering [Meeds et al. 2007].
I Latent causal discovery for medical diagnostics [Wood et al. 2006]
I Protein complex discovery [Chu et al. 2006].
I Psychological choice behaviour [Görür et al. 2006].
I Independent components analysis [Knowles and Ghahramani 2007].
I Learning the structure of deep belief networks [Adams et al. 2010].
49 / 127
Infinite Independent Components Analysis
I Each image Xi is a linear combination of sparse features:
Xi =∑
k
Λ∗k yik
where yik is activity of feature k with sparse prior. One possibility is amixture of a Gaussian and a point mass at 0:
yik = zik aik aik ∼ N (0,1) Z ∼ IBP(α)
I An ICA model with infinite number of features.
[Knowles and Ghahramani 2007, Teh et al. 2007]
50 / 127
Beta Processes
I A one-parameter beta process B ∼ BP(α,H) is a random discretemeasure with form:
B =∞∑
k=1
µkδθ∗k
where the points P = {(θ∗1 , µ1), (θ∗2 , µ2), . . .} are spikes in a 2D Poissonprocess with rate measure:
αµ−1dµH(dθ)
I It is the de Finetti measure for the IBP.
I This is an example of a completely random measure.
I A beta process does not have Beta distributed marginals.
[Hjort 1990, Thibaux and Jordan 2007]
51 / 127
Beta Processes
52 / 127
Stick-breaking Construction for Beta ProcessesI The following generates a draw of B:
vk ∼ Beta(1, α) µk = (1− vk )k−1∏
i=1
(1− vi ) θ∗k ∼ H
B =∞∑
k=1
µkδθ∗k
I The above is the complement of the stick-breaking construction for DPs.
π
(4)π(3)µ
(6)µ
(1)µ(2)µ
(4)µ(5)µ
(5)π
(2)π(3)π
(6)π
(1)
[Teh et al. 2007]
53 / 127
Outline
Introduction
Regression and Gaussian Processes
Density Estimation, Clustering and Dirichlet Processes
Latent Variable Models and Indian Buffet and Beta Processes
Topic Modelling and Hierarchical Processes
Hierarchical Structure Discovery and Nested Processes
Time Series Models
Modelling Power-laws with Pitman-Yor Processes
Summary54 / 127
Topic Modelling with Latent Dirichlet Allocation
I Infer topics from a document corpus, topicsbeing sets of words that tend to co-occurtogether.
I Using (Bayesian) latent Dirichlet allocation:
πj ∼ Dirichlet(αK , . . . ,αK )
θk ∼ Dirichlet( βW , . . . , βW )
zji |πj ∼ Multinomial(πj )
xji |zji ,θzji ∼ Multinomial(θzji )
I How many topics can we find from thecorpus?
I Can we take number of topics K →∞?
topics k=1...K
document j=1...D
words i=1...nd
!j
zji
xji !k
55 / 127
Hierarchical Dirichlet Processes
I Use a DP mixture for each group.
I Unfortunately there is no sharing of clustersacross different groups because H is smooth.
I Solution: make the base distribution H discrete.
I Put a DP prior on the common base distribution.
[Teh et al. 2006]
H
1i
1ix
θ
G1 G
x
θ
2
2i
2i
56 / 127
Hierarchical Dirichlet Processes
I A hierarchical Dirichlet process:
G0 ∼ DP(α0,H)
G1,G2|G0 ∼ DP(α,G0) iid
I Extension to larger hierarchies is straightforward. 1i
1ix
θ
G1 G
x
θ
G
H
0
2
2i
2i
57 / 127
Hierarchical Dirichlet ProcessesI Making G0 discrete forces shared cluster between G1 and G2.
58 / 127
Hierarchical Dirichlet Processes
10 20 30 40 50 60 70 80 90 100 110 120750
800
850
900
950
1000
1050
Perp
lexi
ty
Number of LDA topics
Perplexity on test abstacts of LDA and HDP mixture
LDAHDP Mixture
61 62 63 64 65 66 67 68 69 70 71 72 730
5
10
15
Number of topics
Num
ber o
f sam
ples
Posterior over number of topics in HDP mixtures
59 / 127
Chinese Restaurant Franchise
13
18
14 12
1516
17
22
21 34
3332
35
36
31
2728
25
21
13
11
3122 23
32
12
11
24
11
13
1221
24
23
2623
24
22
3231
21
3
group j=1 group j=2 group j=3
global
60 / 127
Visual Scene Analysis with Transformed DPs
wji
ji
!
H
R
"
JNj
#
G
0G
vji
jio
j
$ji
%
&lF
2o 2,v
0G
1G 2G 3G
1o 1,v 3o 3,v
Figure 15: TDP model for 2D visual scenes (left), and cartoon illustration of the generative process(right). Global mixture G0 describes the expected frequency and image position of visual categories,whose internal structure is represented by part–based appearance models {F!}!
!=1. Each image distri-bution Gj instantiates a randomly chosen set of objects at transformed locations !. Image features withappearance wji and position vji are then sampled from transformed parameters "
!#ji; !ji
"corresponding
to di!erent parts of object oji. The cartoon example defines three color–coded object categories, whichare composed of one (blue), two (green), and four (red) Gaussian parts, respectively. Dashed ellipsesindicate transformation priors for each category.
responding to some category. The appearance of the jth image is then determined by a set of
randomly transformed objects Gj ! DP($, G0), so that
Gj(o, !) =!#
t=1
$%jt&(o, ojt)&(!, !jt)$!j ! GEM($)
(ojt, !jt) ! G0
(30)
In this expression, t indexes the set of object instances in image j, which are associated with
visual categories ojt. Each of the Nj features in image j is independently sampled from some
object instance tji ! $!j . This can be equivalently expressed as (oji, !ji) ! Gj , where oji is the
global category corresponding to an object instance situated at transformed location !ji. Finally,
parameters corresponding to one of this object’s parts generate the observed feature:
('ji, µji, "ji) = #ji ! Foji wji ! 'ji vji ! N!µji + !ji, "ji
"(31)
In later sections, we let kji ! "oji indicate the part underlying the ith feature. Focusing on scale–
normalized datasets, we again associate transformations with image–based translations.
The hierarchical, TDP scene model of Fig. 15 employs three di!erent stick–breaking processes,
allowing uncertainty in the number of visual categories (GEM(()), parts composing each category
(GEM())), and object instances depicted in each image (GEM($)). It thus generalizes the para-
metric model of Fig. 12, which assumed fixed, known sets of parts and objects. As ) " 0, each
category uses a single part, and we recover a variant of the simpler TDP model of Sec. 7.1. Inter-
30
[Sudderth et al. 2008]61 / 127
Visual Scene Analysis with Transformed DPs
!"#
$%"&
'()*&)+,-#..
!"#$$%&
'(&%%)
*+#,%-%./+0&1
Figure 16: Learned contextual, fixed–order models for street scenes (left) and o!ce scenes (right),each containing four objects. Top: Gaussian distributions over the positions of other objects given thelocation of the car (left) or computer screen (right). Bottom: Parts (solid) generating at least 5% ofeach category’s features, with intensity proportional to probability. Parts are tranlated by that object’smean position, while the dashed ellipses indicate each object’s marginal transformation covariance.
used thirty shared parts, and Dirichlet precision parameters set as ! = 4, " = 15 via cross–
validation. The position prior Hv weakly favored parts covering 10% of the image range, while the
appearance prior Dir(W/10) was biased towards sparse distributions.
9.1.1 Visualization of Learned Parts
Fig. 16 illustrates learned, part–based models for street and o!ce scenes. Although objects share
a common set of parts within each scene model, we can approximately count the number of parts
used by each object by thresholding the posterior part distributions !!. For street scenes, cars are
allocated roughly four parts, while buildings and roads use large numbers of parts to uniformly tile
regions corresponding to their typical size. Several parts are shared between the tree and building
categories, presumably due to the many training images in which buildings are only partially
occluded by foliage. The o!ce scene model describes computer screens with ten parts, which
primarily align with edge and corner features. Due to their smaller size, keyboards are described
by five parts, and mice by two. The background clutter category then uses several parts, which
move little from scene to scene, to distribute features across the image. Most parts are unshared,
although the screen and keyboard categories reuse a few parts to describe edge–like features.
Fig. 16 also illustrates the contextual relationships learned by both scene models. Intuitively,
street scenes have a vertically layered structure, while in o!ce scenes the keyboard is typically
located beneath the monitor, and the mouse to the keyboard’s right.
32
[Sudderth et al. 2008]62 / 127
Hierarchical Modelling
i=1...n2
!2
x2i
i=1...n3
!3
x3i
i=1...n1
!1
x1i
[Gelman et al. 1995]
63 / 127
Hierarchical Modelling
i=1...n2
!0
!2
x2i
i=1...n3
!3
x3i
i=1...n1
!1
x1i
[Gelman et al. 1995]
63 / 127
Outline
Introduction
Regression and Gaussian Processes
Density Estimation, Clustering and Dirichlet Processes
Latent Variable Models and Indian Buffet and Beta Processes
Topic Modelling and Hierarchical Processes
Hierarchical Structure Discovery and Nested Processes
Time Series Models
Modelling Power-laws with Pitman-Yor Processes
Summary64 / 127
Topic Hierarchies
!"#$%&%'%&()*+)$(,+-.%&&)$/
0)1%&#)$2-$3)+)4
%(+#.5$6%+%$.%
!"#$
%&'()(
*+,-./,0
,1.23
2/3-4-526734-038
1-,9.23
:#;<:$4=>.-45.>
,1.23
,+-.%%/#$7&-6("%
2)(#-$)89.)/%4
1-6:.#%$.%&1-,9:<<:?#@@:(A23B,54=C74-6
D4>-3>.-/5.38
.,.23!;#@@
.3-9>.24.
,55C--38/09,-3
.240!B3
8,5C930.>;6/378/0E
45,-+C>
,1#($FD,-8>(A23
734-0382/3-4-526;,1
D2/52
,0764+,-./,0
/>/77C>.-4.38;5,0.4/0>GH.,+/5>(I
,.3.24..23
!+4-49
3.3-/>!J384.4
>94773-
B47C3;.,+-,B/83
4-34>,04=76
>/K38.,+/5
2/3-4-526D/.2.23
>/E0/!540.7674-E3-5,-+C>(
LM*24>=330
>2,D0.,6/378
E,,8+-38/5./B3
+3-1,-94053
-374./B3.,5,9
+3./0EC0/E-49
740EC4E39,837>;408
/.24>47>,=330
4-EC38.24..23
.,+/5N=4>3840476N
>/>+-,B/838=6LM*
-3+-3>30.>4OC47/.4./B3/9+-,B39
30.,05,9
+3./0E740EC4E3
9,837>
PQ73/3.47(#@@$=R'-/1!.2>408S.36B->#@@HT(A2C>LM*
+-,B/83>404.C-47+,/0.,15,9
+4-/>,0(A23-34-3>3B3-47/>>C3>.24.9
C>.=3=,-03/09/08
/05,9
+4-/0E2LM*
.,LM*
(%/->.;/0
LM*.23
0C9=3-
,1.,+/5>
/>4!J38
+4-493.3-;408
49,837>3735./,0
U,C-047,1.23*VF
;W,7(G!;I,(#;*
-./573!;XC=7/54./,0
84.3"U40C4-6#@:@(
[Blei et al. 2010]65 / 127
Nested Chinese Restaurant Process
1 2 3 4 5 6 7 8 9
[Blei et al. 2010]66 / 127
Nested Chinese Restaurant Process
1 2 3 4 5 6 7 8 9
12
3
456
78 9
[Blei et al. 2010]66 / 127
Nested Chinese Restaurant Process
1 2 3 4 5 6 7 8 9
12
3
456
78 9
12 34 5 6
78 9
[Blei et al. 2010]66 / 127
Visual Taxonomies
1 4 82 3 5 6 7
CB
A
CALsuburb
MITcoast
MITforest
MIThighway
MITinsidecity
MITmountain
MITopencountry
MITstreet
MITtallbuilding
PARoffice
bedroom
kitchen
livingroom
1
tall building! "# $ street! "# $ topic! "# $
2
inside city! "# $ office! "# $ topic! "# $
A
topic 1! "# $ topic 2! "# $
Figure 5. Unsupervised taxonomy learned on the 13 scenes data set. Top: the entire taxonomy shown as a tree. Each category is color-codedaccording to the legend on the right. The proportion of a given color in a node corresponds to the proportion of images of the correspondingcategory. Note that this category information was not used when inferring the taxonomy. There are three large groups, marked A, B, andC. Roughly, group C contains natural scenes, such as forest and mountains. Group A contains cluttered man-made scenes, such as tallbuildings and city scenes. Group B contains man-made scenes that are less cluttered, such as highways. These groups split into finersub-groups at the third level. Each of the third-level groups splits into several tens of fourth-level subgroups, typically with 10 or lessimages in each. These are omitted from the figure for clarity. Below the tree, the top row shows the information represented in leaf 1 (theleftmost leaf). Two categories most frequent in that node were selected, and two most probable images from each category are shown. Themost probable topic in the node is also displayed. For that topic, six most probable visual words are shown. The display for each visualword has two parts. First, in the top left corner the pixel-wise average of all image patches assigned to that visual word is shown. It givesa rough idea of the overall structure of the visual word. For example, in the topic for leaf 1, the visual words seem to represent verticaledges. Second, six patches assigned to that visual word were selected at random. These are shown at the bottom of each visual word, ona 2 ! 3 grid. Next, information for node 2 is shown in a similar format. Finally, the bottom row shows the top two topics from node A,which is shared between leaves 1 and 2. Both leaves have clutter (topic 1) and horizontal bars (topic 2), and these are represented at theshared node.
In [1], a taxonomy of object parts was learned. In con-trast, we learn a taxonomy of object categories.
Finally, the original NCRP model was independently ap-plied to image data in a concurrent publication [16]. Thedifferences between TAX and NCRP are summarized insection 2. In addition, [16] uses different sets of visualwords at different levels of the taxonomy. This encouragesthe taxonomy to learn different representations at differentlevels. The disadvantage is that the sets had to be manuallydefined, and without this NCRP performed poorly. In con-
trast, in TAX the same set of visual words is used through-out the taxonomy, and different representations at differentlevels emerge completely automatically.
6. Discussion
We presented TAX, a nonparametric probabilistic modelfor learning visual taxonomies. In the context of computervision, it is the first fully unsupervised model that can orga-nize images into a hierarchy of categories.
Our experiments in section 4.1 show that an intuitive hi-
[Bart et al. 2008] 67 / 127
Hierarchical Clustering
68 / 127
Hierarchical Clustering
I Bayesian approach to hierarchical clustering: place prior over treestructures, and infer posterior.
I The nested DP can be used as a prior over layered tree structures.
I Another prior is a Dirichlet diffusion tree, which produces binaryultrametric trees, and which can be obtained as an infinitesimal limit of anested DP. It is an example of a fragmentation process.
I Yet another prior is Kingman’s coalescent , which also produces binaryultrametric trees, but is an example of a coalescent process.
[Neal 2003, Teh et al. 2008, Bertoin 2006]
69 / 127
Nested Dirichlet Process
I Underlying stochastic process for the nested CRP is a nested DP.
Hierarchical DP:
G0 ∼ DP(α0,H)
Gj |G0 ∼ DP(α,G0)
xji |Gj ∼ Gj
Nested DP:
G0 ∼ DP(α,DP(α0,H))
Gi ∼ G0
xi |Gi ∼ Gi
I The hierarchical DP starts with groups of data items, and analyses themtogether by introducing dependencies through G0.
I The nested DP starts with one set of data items, partitions them intodifferent groups, and analyses each group separately.
I Orthogonal effects, can be used together.
[Rodríguez et al. 2008]
70 / 127
Nested Beta/Indian Buffet Processes
1 2 3 4 5
I Exchangeable distribution over layered trees.
71 / 127
Nested Beta/Indian Buffet Processes
1 2 3 4 5
I Exchangeable distribution over layered trees.
71 / 127
Nested Beta/Indian Buffet Processes
1 2 3 4 5
I Exchangeable distribution over layered trees.
71 / 127
Nested Beta/Indian Buffet Processes
1 2 3 4 5
I Exchangeable distribution over layered trees.
71 / 127
Nested Beta/Indian Buffet Processes
1 2 3 4 5
I Exchangeable distribution over layered trees.
71 / 127
Nested Beta/Indian Buffet Processes
1 2 3 4 5
I Exchangeable distribution over layered trees.
71 / 127
Nested Beta/Indian Buffet Processes
1 2 3 4 5
I Exchangeable distribution over layered trees.
71 / 127
Hierarchical Beta/Indian Buffet Processes
1 2 3 4 5 6 7 8 9
I Different from the hierarchical beta process of[Thibaux and Jordan 2007].
72 / 127
Hierarchical Beta/Indian Buffet Processes
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6
I Different from the hierarchical beta process of[Thibaux and Jordan 2007].
72 / 127
Hierarchical Beta/Indian Buffet Processes
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6
1 2 3 4 5 6 7 8
I Different from the hierarchical beta process of[Thibaux and Jordan 2007].
72 / 127
Deep Structure LearningLearning the Structure of Deep Sparse Graphical Models
(a) (b)
(c) (d)Figure 4: Olivetti faces a) Test images on the left, withreconstructed bottom halves on the right. b) Sixty featureslearned in the bottom layer, where black shows absence ofan edge. Note the learning of sparse features correspond-ing to specific facial structures such as mouth shapes, nosesand eyebrows. c) Raw predictive fantasies. d) Feature ac-tivations from individual units in the second hidden layer.
by k!. We calculate !(m)"k,k! , the number of nonzero en-
tries in the k!th column of Z(m+1), excluding any entry
in the kth row. If !(m)"k,k! is zero, we call the unit k! a
singleton parent, to be dealt with in the second phase.
If !(m)"k,k! is nonzero, we introduce (or keep) the edge
from unit u(m+1)k! to u
(m)k with Bernoulli probability
p(Z(m+1)k,k! =1 |!\Z
(m+1)k,k! )=
1
Z
!!(m)"k,k!
K(m)+"(m)!1
"
N#
n=1
p(u(m)n,k | Z(m+1)
k,k! = 1,!\Z(m)k,k!)
p(Z(m+1)k,k! =0 |!\Z
(m+1)k,k! )=
1
Z
!1!
!(m)"k,k!
K(m)+"(m)!1
"
N#
n=1
p(u(m)n,k | Z(m+1)
k,k! = 0,!\Z(m+1)k,k! ),
where Z is the appropriate normalization constant.
In the second phase, we consider deleting connectionsto singleton parents of unit k, or adding new sin-gleton parents. We use Metropolis–Hastings with abirth/death process. If there are currently K# sin-gleton parents, then with probability 1/2 we proposeadding a new one by drawing it recursively from deeperlayers, as above. We accept the proposal to insert a
connection to this new parent unit with M–H ratio1
!(m)"(m)
(K"+1)2("(m)+K(m)"1)
$Nn=1
p(u(m)n,k | Z
(m+1)k,j =1,!\Z
(m+1)k,j )
p(u(m)n,k | Z
(m+1)k,j =0,!\Z
(m+1)k,j )
.
If we do not propose to insert a unit and K# " 0, thenwith probability 1/2 we select uniformly from amongthe singleton parents of unit k and propose removingthe connection to it. We accept the proposal to removethe jth one with M–H acceptance ratio given by
K2"("(m)+K(m)"1)
!(m)"(m)
$Nn=1
p(u(m)n,k | Z
(m+1)k,j =0,!\Z
(m+1)k,j )
p(u(m)n,k | Z
(m+1)k,j =1,!\Z
(m+1)k,j )
.
After these phases, chains of units that are not ances-tors of the visible units can be discarded. Notably,this birth/death operator samples from the IBP poste-rior with a nontruncated equilibrium distribution, evenwithout conjugacy. Unlike the stick-breaking approachof Teh et al. (2007), it allows use of the two-parameterIBP, which is important to this model.
5 Reconstructing Images
We applied our approach to three image data sets—theOlivetti faces, the MNIST digits and the Frey faces—and analyzed the structures that arose in the modelposteriors. To assess the model, we constructed amissing-data problem using held-out images from eachset. We removed the bottom halves of the test imagesand used the model to reconstruct the missing data,conditioned on the top half. Prediction itself was doneby integrating out the parameters and structure viaMCMC and averaging over predictive samples.
Olivetti Faces The Olivetti faces data (Samariaand Harter, 1994) are 400 64#64 grayscale imagesof the faces of 40 distinct subjects, which we dividedrandomly into 350 training and 50 test data. Fig 4ashows six bottom-half test set reconstructions on theright, compared to the ground truth on the left. Fig 4bshows a subset of sixty weight patterns from a poste-rior sample of the structure, with black indicating thatno edge is present from that hidden unit to the visibleunit (pixel). The algorithm is clearly assigning hid-den units to specific and interpretable features, suchas mouth shapes, facial hair, and skin tone. Fig 4cshows ten pure fantasies from the model, easily gener-ated in a directed acyclic belief network. Fig 4d showsthe result of activating individual units in the secondhidden layer, while keeping the rest unactivated, andpropagating the activations down to the visible pix-els. This provides an idea of the image space spannedby the principal components of these deeper units. Atypical posterior network structure had three hiddenlayers, with approximately seventy units in each layer.
1These equations had an error in the original version.
[Adams et al. 2010]
73 / 127
Deep Structure Learning
Learning the Structure of Deep Sparse Graphical Models
(a) (b)
(c) (d)Figure 4: Olivetti faces a) Test images on the left, withreconstructed bottom halves on the right. b) Sixty featureslearned in the bottom layer, where black shows absence ofan edge. Note the learning of sparse features correspond-ing to specific facial structures such as mouth shapes, nosesand eyebrows. c) Raw predictive fantasies. d) Feature ac-tivations from individual units in the second hidden layer.
by k!. We calculate !(m)"k,k! , the number of nonzero en-
tries in the k!th column of Z(m+1), excluding any entry
in the kth row. If !(m)"k,k! is zero, we call the unit k! a
singleton parent, to be dealt with in the second phase.
If !(m)"k,k! is nonzero, we introduce (or keep) the edge
from unit u(m+1)k! to u
(m)k with Bernoulli probability
p(Z(m+1)k,k! =1 |!\Z
(m+1)k,k! )=
1
Z
!!(m)"k,k!
K(m)+"(m)!1
"
N#
n=1
p(u(m)n,k | Z(m+1)
k,k! = 1,!\Z(m)k,k!)
p(Z(m+1)k,k! =0 |!\Z
(m+1)k,k! )=
1
Z
!1!
!(m)"k,k!
K(m)+"(m)!1
"
N#
n=1
p(u(m)n,k | Z(m+1)
k,k! = 0,!\Z(m+1)k,k! ),
where Z is the appropriate normalization constant.
In the second phase, we consider deleting connectionsto singleton parents of unit k, or adding new sin-gleton parents. We use Metropolis–Hastings with abirth/death process. If there are currently K# sin-gleton parents, then with probability 1/2 we proposeadding a new one by drawing it recursively from deeperlayers, as above. We accept the proposal to insert a
connection to this new parent unit with M–H ratio1
!(m)"(m)
(K"+1)2("(m)+K(m)"1)
$Nn=1
p(u(m)n,k | Z
(m+1)k,j =1,!\Z
(m+1)k,j )
p(u(m)n,k | Z
(m+1)k,j =0,!\Z
(m+1)k,j )
.
If we do not propose to insert a unit and K# " 0, thenwith probability 1/2 we select uniformly from amongthe singleton parents of unit k and propose removingthe connection to it. We accept the proposal to removethe jth one with M–H acceptance ratio given by
K2"("(m)+K(m)"1)
!(m)"(m)
$Nn=1
p(u(m)n,k | Z
(m+1)k,j =0,!\Z
(m+1)k,j )
p(u(m)n,k | Z
(m+1)k,j =1,!\Z
(m+1)k,j )
.
After these phases, chains of units that are not ances-tors of the visible units can be discarded. Notably,this birth/death operator samples from the IBP poste-rior with a nontruncated equilibrium distribution, evenwithout conjugacy. Unlike the stick-breaking approachof Teh et al. (2007), it allows use of the two-parameterIBP, which is important to this model.
5 Reconstructing Images
We applied our approach to three image data sets—theOlivetti faces, the MNIST digits and the Frey faces—and analyzed the structures that arose in the modelposteriors. To assess the model, we constructed amissing-data problem using held-out images from eachset. We removed the bottom halves of the test imagesand used the model to reconstruct the missing data,conditioned on the top half. Prediction itself was doneby integrating out the parameters and structure viaMCMC and averaging over predictive samples.
Olivetti Faces The Olivetti faces data (Samariaand Harter, 1994) are 400 64#64 grayscale imagesof the faces of 40 distinct subjects, which we dividedrandomly into 350 training and 50 test data. Fig 4ashows six bottom-half test set reconstructions on theright, compared to the ground truth on the left. Fig 4bshows a subset of sixty weight patterns from a poste-rior sample of the structure, with black indicating thatno edge is present from that hidden unit to the visibleunit (pixel). The algorithm is clearly assigning hid-den units to specific and interpretable features, suchas mouth shapes, facial hair, and skin tone. Fig 4cshows ten pure fantasies from the model, easily gener-ated in a directed acyclic belief network. Fig 4d showsthe result of activating individual units in the secondhidden layer, while keeping the rest unactivated, andpropagating the activations down to the visible pix-els. This provides an idea of the image space spannedby the principal components of these deeper units. Atypical posterior network structure had three hiddenlayers, with approximately seventy units in each layer.
1These equations had an error in the original version.
[Adams et al. 2010]
73 / 127
Transfer Learning
I Many recent machine learning paradigms can be understood as trying tomodel data from heterogeneous sources and types.
I Semi-supervised learning: we have labelled data, and unlabelleddata.
I Multi-task learning: we have multiple tasks with differentdistributions but structurally similar.
I Domain adaptation: we have a small amount of pertinent data, anda large amount of data from a related problem or domain.
I The transfer learning problem is how to transfer information betweendifferent sources and types.
I Flexible nonparametric models can allow for more information extractionand transfer.
I Hierarchies and nestings are different ways of putting together multiplestochastic processes to form complex models.
[Jordan 2010]
74 / 127
Outline
Introduction
Regression and Gaussian Processes
Density Estimation, Clustering and Dirichlet Processes
Latent Variable Models and Indian Buffet and Beta Processes
Topic Modelling and Hierarchical Processes
Hierarchical Structure Discovery and Nested Processes
Time Series Models
Modelling Power-laws with Pitman-Yor Processes
Summary75 / 127
Hidden Markov Models
z0 z1 z2 z!
x1 x2 x!
! !k
!!k!
πk ∼ Dirichlet(αK , . . . ,αK ) zi |zi−1,πzi−1 ∼ Multinomial(πzi−1 )
θ∗k ∼ H xi |zi , θ∗zi∼ F (θ∗zi
)
I Can we take K →∞?
I Can we do so while imposing structure in transition probability matrix?
76 / 127
Infinite Hidden Markov Models
z0 z1 z2 z!
x1 x2 x!
! !k
!!k!
β ∼ GEM(γ) πk |β ∼ DP(α,β) zi |zi−1,πzi−1 ∼ Multinomial(πzi−1 )
θ∗k ∼ H xi |zi , θ∗zi∼ F (θ∗zi
)
I Hidden Markov models with an infinite number of states: infinite HMM.
I Hierarchical DPs used to share information among transition probabilityvectors prevents “run-away” states: HDP-HMM.
[Beal et al. 2002, Teh et al. 2006]
77 / 127
Word Segmentation
I Given sequences of utterances or characters can a probabilistic modelsegment sequences into coherent chunks (“words”)?
canyoureadthissentencewithoutspaces?can you read this sentence without spaces?
I Use an infinite HMM: each chunk/word is a state, with Markov model ofstate transitions.
I Nonparametric model is natural, since number of words unknown beforesegmentation.
[Goldwater et al. 2006b]
78 / 127
Word Segmentation
Words Lexicon BoundariesNGS-u 68.9 82.6 52.0MBDP-1 68.2 82.3 52.4DP 53.8 74.3 57.2NGS-b 68.3 82.1 55.7HDP 76.6 87.7 63.1
I NGS-u: n-gram Segmentation (unigram) [Venkataraman 2001].
I NGS-b: n-gram Segmentation (bigram) [Venkataraman 2001].
I MBDP-1: Model-based Dynamic Programming [Brent 1999].
I DP, HDP: Nonparametric model, without and with Markov dependencies.
[Goldwater et al. 2006a]
79 / 127
Sticky HDP-HMM
I In typical HMMs or in infinite HMMs the model does not give specialtreatment to self-transitions (from a state to itself).
I In many HMM applications self-transitions are much more likely.
I Example application of HMMs: speaker diarization.
I Straightforward extension of HDP-HMM prior encourages higherself-transition probabilities:
πk |β ∼ DP(α + κ, αβ+κδkα+κ )
[Beal et al. 2002, Fox et al. 2008]
80 / 127
Sticky HDP-HMM THE STICKY HDP-HMM 33
0 5 10 15 200
10
20
30
40
50
60
Meeting
DER
StickyNon−StickyICSI
0 5 10 15 200
10
20
30
40
50
60
Meeting
DER
(a) (b)FIG 14. (a) Chart comparing the DERs of the sticky and original HDP-HMM with DP emissions tothose of ICSI for each of the 21 meetings. Here, we chose the state sequence at the 10, 000th Gibbsiteration that minimizes the expected Hamming distance. For meeting 16 using the sticky HDP-HMMwith DP emissions, we chose between state sequences at Gibbs iteration 50,000. (b) DERs associatedwith using ground truth speaker labels for the post-processed data. Here, we assign undetected non-speech a label different than the pre-processed non-speech.
9. Discussion. We have developed a Bayesian nonparametric approach to theproblem of speaker diarization, building on the HDP-HMM presented in Teh et al.(2006). Although the original HDP-HMM does not yield competitive speaker di-arization performance due to its inadequate modeling of the temporal persistenceof states, the sticky HDP-HMM that we have presented here resolves this problemand yields a state-of-the-art solution to the speaker diarization problem.We have also shown that this sticky HDP-HMM allows a fully Bayesian non-
parametric treatment of multimodal emissions, disambiguated by its bias towardsself-transitions. Accommodating multimodal emissions is essential for the speakerdiarization problem and is likely to be an important ingredient in other applicationsof the HDP-HMM to problems in speech technology.We also presented efficient sampling techniques with mixing rates that improve
on the state-of-the-art by harnessing the Markovian structure of the HDP-HMM.Specifically, we proposed employing a truncated approximation to the HDP andblock-sampling the state sequence using a variant of the forward-backward algo-rithm. Although the blocked samplers yield substantially improved mixing ratesover the sequential, direct assignment samplers, there are still some pitfalls to thesesampling methods. One issue is that for each new considered state, the parametersampled from the prior distribution must better explain the data than the parametersassociated with other states that have already been informed by the data. In high-dimensional applications, and in cases where state-specific emission distributionsare not clearly distinguishable, this method for adding new states poses a signif-
[Fox et al. 2008]
81 / 127
Infinite Factorial HMM
Figure 1: The Hidden Markov Model Figure 2: The Factorial Hidden Markov Model
in a factored form. This way, information from the past is propagated in a distributed manner througha set of parallel Markov chains. The parallel chains can be viewed as latent features which evolveover time according to Markov dynamics. Formally, the FHMM defines a probability distributionover observations y1, y2, · · · yT as follows: M latent chains s(1), s(2), · · · , s(M) evolve accordingto Markov dynamics and at each timestep t, the Markov chains generate an output yt using somelikelihood model F parameterized by a joint state-dependent parameter θ
s(1:m)t
. The graphical modelin figure 2 shows how the FHMM is a special case of a dynamic Bayesian network. The FHMM hasbeen successfully applied in vision [3], audio processing [4] and natural language processing [5].Unfortunately, the dimensionality M of our factorial representation or equivalently, the number ofparallel Markov chains, is a new free parameter for the FHMM which we would prefer learningfrom data rather than specifying it beforehand.
Recently, [6] introduced the basic building block for nonparametric Bayesian factor models calledthe Indian Buffet Process (IBP). The IBP defines a distribution over infinite binary matrices Z whereelement znk denotes whether datapoint n has feature k or not. The IBP can be combined withdistributions over real numbers or integers to make the features useful for practical problems.
In this work, we derive the basic building block for nonparametric Bayesian factor models for timeseries which we call the Markov Indian Buffet Process (mIBP). Using this distribution we build anonparametric extension of the FHMM which we call the Infinite Factorial Hidden Markov Model(iFHMM). This construction allows us to learn a factorial representation for time series.
In the next section, we develop the novel and generic nonparametric mIBP distribution. Section 3describes how to use the mIBP do build the iFHMM. Which in turn can be used to perform inde-pendent component analysis on time series data. Section 4 shows results of our application of theiFHMM to a blind source separation problem. Finally, we conclude with a discussion in section 5.
2 The Markov Indian Buffet Process
Similar to the IBP, we define a distribution over binary matrices to model whether a feature at timet is on or off. In this representation rows correspond to timesteps and the columns to features orMarkov chains. We want the distribution over matrices to satisfy the following two properties: (1)the potential number of columns (representing latent features) should be able to be arbitrary large;(2) the rows (representing timesteps) should evolve according to a Markov process.
Below, we will formally derive the mIBP distribution in two steps: first, we describe a distributionover binary matrices with a finite number of columns. We choose the hyperparameters carefully sowe can easily integrate out the parameters of the model. In a second phase, we take the limit as thenumber of features goes to infinity in a manner analogous to [7]’s derivation of infinite mixtures.
2.1 A finite model
Let S represent a binary matrix with T rows (datapoints) and M columns (features). stm representsthe hidden state at time t for Markov chain m. Each Markov chain evolves according to the transitionmatrix
W (m) =
�1 − am am
1 − bm bm
�, (3)
2
I Take M →∞ for the following model specification:
P(s(m)t = 1|s(m)
t−1 = 0) = am am ∼ Beta( αM ,1)
P(s(m)t = 1|s(m)
t−1 = 1) = bm bm ∼ Beta(γ, δ)
I Stochastic process is a Markov Indian buffet process. It is an example ofa dependent random measure.
[Van Gael et al. 2009]82 / 127
Nonparametric Grammars, Hierarchical HMMs etc
I In linguistics, grammars are much more plausible as generative modelsof sentences.
I Learning the structure of probabilistic grammars is even more difficult,and Bayesian nonparametrics provides a compelling alternative.
[Liang et al. 2007, Finkel et al. 2007, Johnson et al. 2007, Heller et al. 2009]
83 / 127
Motion Capture Analysis
I Goal: find coherent “behaviour” in the time series that transfers to othertime series.
Slides courtesy of [Fox et al. 2010]
84 / 127
Motion Capture Analysis
I Transfer knowledge among related time series in the form of a library of“behaviours”.
I Allow each time series model to make use of an arbitrary subset of thebehaviours.
I Method: represent behaviors as states in an autoregressive HMM, anduse the beta/Bernoulli process to pick out subsets of states.
Slides courtesy of [Fox et al. 2010]
85 / 127
BP-AR-HMM
Slides courtesy of [Fox et al. 2010]
86 / 127
Motion Capture Results
Slides courtesy of [Fox et al. 2010]
87 / 127
High Order Markov ModelsI Decompose the joint distribution of a sequence of variables into
conditional distributions:
P(x1, x2, . . . , xT ) =T∏
t=1
P(xt |x1, . . . , xt−1)
I An Nth order Markov model approximates the joint distribution as:
P(x1, x2, . . . , xT ) =T∏
t=1
P(xt |xt−N , . . . , xt−1)
I Such models are particularly prevalent in natural language processing,compression and biological sequence modelling.
toad, in, a, holet, o, a, d, _, i, n, _, a, _, h, o, l, e
A, C, G, T, C, C, A
I Would like to take N →∞.
88 / 127
High Order Markov Models
I Difficult to fit such models due to data sparsity.
P(xt |xt−N , . . . , xt−1) =C(xt−N , . . . , xt−1, xt )
C(xt−N , . . . , xt−1)
I Sharing information via hierarchical models.
P(xt |xt−N:t−1 = u) = Gu(xt )
I A context tree.
G∅
Ga
Gin a
Gtoad in a Gstuck in a
Gis a Gabout a
[MacKay and Peto 1994, Teh 2006a]
89 / 127
Outline
Introduction
Regression and Gaussian Processes
Density Estimation, Clustering and Dirichlet Processes
Latent Variable Models and Indian Buffet and Beta Processes
Topic Modelling and Hierarchical Processes
Hierarchical Structure Discovery and Nested Processes
Time Series Models
Modelling Power-laws with Pitman-Yor Processes
Summary90 / 127
Pitman-Yor ProcessesI Two-parameter generalization of the Chinese restaurant process:
p(customer n sat at table k |past) =
{nk−β
n−1+α if occupied tableα+βKn−1+α if new table
I Associating each cluster k with a unique draw θ∗k ∼ H, thecorresponding Pólya urn scheme is also exchangeable.
0 2000 4000 6000 8000 100000
50
100
150
200
customer
tabl
e
!=30, d=0
Dirichlet
0 2000 4000 6000 8000 100000
50
100
150
200
250
customer
tabl
e
!=1, d=.5
Pitman-Yor
91 / 127
Pitman-Yor Processes
I De Finetti’s Theorem states that there is a random measure underlyingthis two-parameter generalization.
I This is the Pitman-Yor process.
I The Pitman-Yor process also has a stick-breaking construction:
πk = vk
k−1∏
i=1
(1− vi ) βk ∼ Beta(1− β, α + βk) θ∗k ∼ H G =∞∑
k=1
πkδθ∗k
I The Pitman-Yor process cannot be obtained as the infinite limit of asimple parametric model.
[Perman et al. 1992, Pitman and Yor 1997, Ishwaran and James 2001]
92 / 127
Pitman-Yor Processes
I Two salient features of the Pitman-Yor process:
I With more occupied tables, the chance of even more tablesbecomes higher.
I Tables with smaller occupancy numbers tend to have lower chanceof getting new customers.
I The above means that Pitman-Yor processes produce Zipf’s Law typebehaviour, with K = O(αnβ).
100 101 102 103 104 105 106100
101
102
103
104
105
106
# customers
# ta
bles
!=10, d=[.9 .5 0]
100 101 102 103 104 105 106100
101
102
103
104
105
106
# customers
# ta
bles
with
1 c
usto
mer
!=10, d=[.9 .5 0]
100 101 102 103 104 105 1060
0.2
0.4
0.6
0.8
1
# customerspr
opor
tion
of ta
bles
with
1 c
usto
mer
!=10, d=[.9 .5 0]
93 / 127
Pitman-Yor Processes
Draw from a Pitman-Yor process
0 2000 4000 6000 8000 100000
50
100
150
200
250
customer
tabl
e
!=1, d=.5
0 200 400 600 800100
101
102
103
104
105
# tables#
cust
omer
s pe
r tab
le
!=1, d=.5
100 101 102 103100
101
102
103
104
105
# tables
# cu
stom
ers
per t
able
!=1, d=.5
Draw from a Dirichlet process
0 2000 4000 6000 8000 100000
50
100
150
200
customer
tabl
e
!=30, d=0
0 50 100 150 200 250100
101
102
103
104
105
# tables
# cu
stom
ers
per t
able
!=30, d=0
100 101 102 103100
101
102
103
104
105
# tables#
cust
omer
s pe
r tab
le
!=30, d=0
94 / 127
Pitman-Yor Processes
95 / 127
Hierarchical Pitman-Yor Markov Models
I Use a hierarchical Pitman-Yor prior for high order Markov models.
I Can now take N →∞, making use of coagulation and fragmentationproperties of Pitman-Yor processes for computational tractability.
I Non-Markov model called the sequence memoizer .
[Goldwater et al. 2006a, Teh 2006b, Wood et al. 2009, Gasthaus et al. 2010]
96 / 127
Language Modelling
I Compare hierarchical Pitman-Yor model against hierarchical Dirichletmodel, and two state-of-the-art language models (interpolatedKneser-Ney, modified Kneser-Ney).
I Results reported as perplexity scores.
T N IKN MKN HPYLM HDLM2e6 3 148.8 144.1 144.3 191.24e6 3 137.1 132.7 132.7 172.76e6 3 130.6 126.7 126.4 162.38e6 3 125.9 122.3 121.9 154.7
10e6 3 122.0 118.6 118.2 148.712e6 3 119.0 115.8 115.4 144.014e6 3 116.7 113.6 113.2 140.514e6 2 169.9 169.2 169.3 180.614e6 4 106.1 102.4 101.9 136.6
[Teh 2006b]
97 / 127
Compression
I Predictive models can be used to compress sequence data usingentropic coding techniques.
I Compression results on Calgary corpus:
Model Average bits / bytegzip 2.61bzip2 2.11CTW 1.99PPM 1.93
Sequence Memoizer 1.89
I See http://deplump.com and http://sequencememoizer.com.
[Gasthaus et al. 2010]
98 / 127
Comparing Finite and Infinite Order Markov Models
[Wood et al. 2009]
99 / 127
Image Segmentation with Pitman-Yor Processes
I Human segmentations of images also seem to follow power-law.
I An unsupervised image segmentation model based on a dependenthierarchical Pitman-Yor processes achieves state-of-the-art results.
[Sudderth and Jordan 2009]100 / 127
Stable Beta Process
I Extensions allow for different aspects of the generative process to bemodelled:
I α: controls the expected number of dishes picked by eachcustomer.
I c: controls the overall number of dishes picked by all customers.I σ: controls power-law scaling (ratio of popular dishes to unpopular
ones).
I A completely random measure, with Lévy measure:
αΓ(1 + c)
Γ(1− σ)Γ(c + σ)µ−σ−1(1− µ)c+σ−1dµH(dθ)
[Ghahramani et al. 2007, Teh and Görür 2009]
101 / 127
Stable Beta Process
!=1, c=1, "=0.5 !=10, c=1, "=0.5 !=100, c=1, "=0.5
102 / 127
Stable Beta Process
!=10, c=0.1, "=0.5 !=10, c=1, "=0.5 !=10, c=10, "=0.5
102 / 127
Stable Beta Process
!=10, c=1, "=0.2 !=10, c=1, "=0.5 !=10, c=1, "=0.8
102 / 127
Modelling Word Occurrences in Documents
50 100 150 200 250 300 350 400 450 500 550
2000
4000
6000
8000
10000
12000
14000
number of documents
cum
ulat
ive
num
ber
of w
ords
BPSBPDATA
103 / 127
Outline
Introduction
Regression and Gaussian Processes
Density Estimation, Clustering and Dirichlet Processes
Latent Variable Models and Indian Buffet and Beta Processes
Topic Modelling and Hierarchical Processes
Hierarchical Structure Discovery and Nested Processes
Time Series Models
Modelling Power-laws with Pitman-Yor Processes
Summary104 / 127
SummaryI Motivation for Bayesian nonparametrics:
I Allows practitioners to define and work with models with largesupport, sidesteps model selection.
I New models with useful properties.I Large variety of applications.
I Various standard Bayesian nonparametric models:
I Dirichlet processesI Hierarchical Dirichlet processesI Infinite hidden Markov modelsI Indian buffet and beta processesI Pitman-Yor processes
I Touched upon two important theoretical tools:
I Consistency and Kolmogorov’s Consistency TheoremI Exchangeability and de Finetti’s Theorem
I Described a number of applications of Bayesian nonparametrics.
105 / 127
Advertisement
I PhD studentships and postdoctoral fellowships are available
I Both machine learning and computational neuroscience.
I Both Arthur Gretton and I.
I Happy to speak with anyone interested.
106 / 127
Outline
Relating Different Representations of Dirichlet Processes
Representations of Hierarchical Dirichlet Processes
107 / 127
Representations of Dirichlet ProcessesI Posterior Dirichlet process:
G ∼ DP(α,H)
θ|G ∼ G⇐⇒
θ ∼ H
G|θ ∼ DP(α + 1, αH+δθ
α+1
)
I Pólya urn scheme:
θn|θ1:n−1 ∼αH +
∑n−1i=1 δθi
α + n − 1
I Chinese restaurant process:
p(customer n sat at table k |past) =
{nk
n−1+α if occupied tableα
n−1+α if new table
I Stick-breaking construction:
πk = βk
k−1∏
i=1
(1− βi ) βk ∼ Beta(1, α) θ∗k ∼ H G =∞∑
k=1
πkδθ∗k
108 / 127
Posterior Dirichlet Processes
I Suppose G is DP distributed, and θ is G distributed:
G ∼ DP(α,H)
θ|G ∼ G
I We are interested in:
I The marginal distribution of θ with G integrated out.I The posterior distribution of G conditioning on θ.
109 / 127
Posterior Dirichlet Processes
Conjugacy between Dirichlet Distribution and Multinomial.
I Consider:
(π1, . . . , πK ) ∼ Dirichlet(α1, . . . , αK )
z|(π1, . . . , πK ) ∼ Discrete(π1, . . . , πK )
z is a multinomial variate, taking on value i ∈ {1, . . . ,n} with probabilityπi .
I Then:
z ∼ Discrete(
α1∑i αi, . . . , αK∑
i αi
)
(π1, . . . , πK )|z ∼ Dirichlet(α1 + δ1(z), . . . , αK + δK (z))
where δi (z) = 1 if z takes on value i , 0 otherwise.
I Converse also true.
110 / 127
Posterior Dirichlet ProcessesI Fix a partition (A1, . . . ,AK ) of Θ. Then
(G(A1), . . . ,G(AK )) ∼ Dirichlet(αH(A1), . . . , αH(AK ))
P(θ ∈ Ai |G) = G(Ai )
I Using Dirichlet-multinomial conjugacy,
P(θ ∈ Ai ) = H(Ai )
(G(A1), . . . ,G(AK ))|θ ∼ Dirichlet(αH(A1)+δθ(A1), . . . , αH(AK )+δθ(AK ))
I The above is true for every finite partition of Θ. In particular, taking areally fine partition,
p(dθ) = H(dθ)
i.e. θ ∼ H with G integrated out.I Also, the posterior G|θ is also a Dirichlet process:
G|θ ∼ DP(α + 1,
αH + δθα + 1
)
111 / 127
Posterior Dirichlet Processes
G ∼ DP(α,H)
θ|G ∼ G⇐⇒
θ ∼ H
G|θ ∼ DP(α + 1, αH+δθ
α+1
)
112 / 127
Pólya Urn Scheme
I First sample:θ1|G ∼ G G ∼ DP(α,H)
⇐⇒ θ1 ∼ H G|θ1 ∼ DP(α + 1, αH+δθ1α+1 )
I Second sample:θ2|θ1,G ∼ G G|θ1 ∼ DP(α + 1, αH+δθ1
α+1 )
⇐⇒ θ2|θ1 ∼ αH+δθ1α+1 G|θ1, θ2 ∼ DP(α + 2, αH+δθ1+δθ2
α+2 )
I nth sample
θn|θ1:n−1,G ∼ G G|θ1:n−1 ∼ DP(α + n − 1, αH+∑n−1
i=1 δθiα+n−1 )
⇐⇒ θn|θ1:n−1 ∼ αH+∑n−1
i=1 δθiα+n−1 G|θ1:n ∼ DP(α + n, αH+
∑ni=1 δθi
α+n )
113 / 127
Stick-breaking ConstructionI Returning to the posterior process:
G ∼ DP(α,H)
θ|G ∼ G⇔
θ ∼ H
G|θ ∼ DP(α + 1, αH+δθα+1 )
I Consider a partition (θ,Θ\θ) of Θ. We have:
(G(θ),G(Θ\θ))|θ ∼ Dirichlet((α + 1)αH+δθα+1 (θ), (α + 1)αH+δθ
α+1 (Θ\θ))
= Dirichlet(1, α)
I G has a point mass located at θ:
G = βδθ + (1− β)G′ with β ∼ Beta(1, α)
and G′ is the (renormalized) probability measure with the point massremoved.
I What is G′?
114 / 127
Stick-breaking ConstructionI Currently, we have:
G ∼ DP(α,H)
θ ∼ G⇒
θ ∼ H
G|θ ∼ DP(α + 1, αH+δθα+1 )
G = βδθ + (1− β)G′
β ∼ Beta(1, α)
I Consider a further partition (θ,A1, . . . ,AK ) of Θ:
(G(θ),G(A1), . . . ,G(AK ))
=(β, (1− β)G′(A1), . . . , (1− β)G′(AK ))
∼Dirichlet(1, αH(A1), . . . , αH(AK ))
I The agglomerative/decimative property of Dirichlet implies:
(G′(A1), . . . ,G′(AK ))|θ ∼ Dirichlet(αH(A1), . . . , αH(AK ))
G′ ∼ DP(α,H)
115 / 127
Stick-breaking ConstructionI We have:
G ∼ DP(α,H)
G = β1δθ∗1 + (1− β1)G1
G = β1δθ∗1 + (1− β1)(β2δθ∗2 + (1− β2)G2)
...
G =∞∑
k=1
πkδθ∗k
where
πk = βk∏k−1
i=1 (1− βi ) βk ∼ Beta(1, α) θ∗k ∼ H
π
(4)π(5)π
(2)π(3)π
(6)π
(1)
116 / 127
Outline
Relating Different Representations of Dirichlet Processes
Representations of Hierarchical Dirichlet Processes
117 / 127
Stick-breaking Construction
I We shall assume the following HDP hierarchy:
G0 ∼ DP(γ,H)
Gj |G0 ∼ DP(α,G0) for j = 1, . . . , J
I The stick-breaking construction for the HDP is:
G0 =∑∞
k=1 π0kδθ∗k θ∗k ∼ H
π0k = β0k∏k−1
l=1 (1− β0l ) β0k ∼ Beta(1, γ)
Gj =∑∞
k=1 πjkδθ∗k
πjk = βjk∏k−1
l=1 (1− βjl ) βjk ∼ Beta(αβ0k , α(1−∑k
l=1 β0l ))
118 / 127
Hierarchical Pòlya Urn Scheme
I Let G ∼ DP(α,H).
I We can visualize the Pòlya urn scheme as follows:
2
1θ θ θ θ θ θ
θ∗θ∗θ∗θ∗θ∗1θ∗ . . . . .
. . . . .θ2 3 4 5 6 7
6543
where the arrows denote to which θ∗k each θi was assigned and
θ1, θ2, . . . ∼ G i.i.d.θ∗1 , θ
∗2 , . . . ∼ H i.i.d.
(but θ1, θ2, . . . are not independent of θ∗1 , θ∗2 , . . .).
119 / 127
Hierarchical Pòlya Urn Scheme
I Let G0 ∼ DP(γ,H) and G1,G2|G0 ∼ DP(α,G0).
I The hierarchical Pòlya urn scheme to generate draws from G1,G2:
21θ θ θ θ θ
θ∗θ∗θ∗θ∗θ∗θ∗ . . . . .
. . . . .θ
11 12 13 14 15 16
11 12 13 14 15 θ16 17
θ∗θ∗θ∗θ∗θ∗θ∗ . . . . .21 22 23 24 25 26
θ∗θ∗θ∗θ∗θ∗θ∗ . . . . .01 02 03 0504 06
θ θ θ θ . . . . .θ θ θ272625242322
120 / 127
Chinese Restaurant Franchise
I Let G0 ∼ DP(γ,H) and G1,G2|G0 ∼ DP(α,G0).
I The Chinese restaurant franchise describes the clustering of data itemsin the hierarchy:
E
. . .134
6 7
52
. . .
. . .7
64
12
53 E F GA B C D
ABC
GF
D
121 / 127
References II Adams, R. P., Wallach, H. M., and Ghahramani, Z. (2010). Learning the structure of deep sparse graphical models. In
Proceedings of the 13th International Conference on Artificial Intelligence and Statistics.
I Bart, E., Porteous, I., Perona, P., and Welling, M. (2008). Unsupervised learning of visual taxonomies. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition.
I Beal, M. J., Ghahramani, Z., and Rasmussen, C. E. (2002). The infinite hidden Markov model. In Advances in NeuralInformation Processing Systems, volume 14.
I Bertoin, J. (2006). Random Fragmentation and Coagulation Processes. Cambridge University Press.
I Blackwell, D. and MacQueen, J. B. (1973). Ferguson distributions via Pólya urn schemes. Annals of Statistics, 1:353–355.
I Blei, D. M., Griffiths, T. L., and Jordan, M. I. (2010). The nested Chinese restaurant process and Bayesian nonparametricinference of topic hierarchies. Journal of the Association for Computing Machines, 57(2):1–30.
I Chu, W., Ghahramani, Z., Krause, R., and Wild, D. L. (2006). Identifying protein complexes in high-throughput proteininteraction screens using an infinite latent feature model. In BIOCOMPUTING: Proceedings of the Pacific Symposium.
I Doucet, A., de Freitas, N., and Gordon, N. J. (2001). Sequential Monte Carlo Methods in Practice. Statistics for Engineeringand Information Science. New York: Springer-Verlag.
I Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Annals of Statistics, 1(2):209–230.
I Finkel, J. R., Grenager, T., and Manning, C. D. (2007). The infinite tree. In Proceedings of the Annual Meeting of theAssociation for Computational Linguistics.
I Fox, E., Sudderth, E., Jordan, M. I., and Willsky, A. (2008). An HDP-HMM for systems with state persistence. In Proceedingsof the International Conference on Machine Learning.
122 / 127
References III Fox, E. B., Sudderth, E. B., Jordan, M. I., and Willsky, A. S. (2010). Sharing features among dynamical systems with beta
processes. In Neural Information Processing Systems 22. MIT Press.
I Gasthaus, J., Wood, F., and Teh, Y. W. (2010). Lossless compression based on the sequence memoizer. In DataCompression Conference.
I Gelman, A., Carlin, J., Stern, H., and Rubin, D. (1995). Bayesian data analysis. Chapman & Hall, London.
I Ghahramani, Z., Griffiths, T. L., and Sollich, P. (2007). Bayesian nonparametric latent feature models (with discussion andrejoinder). In Bayesian Statistics, volume 8.
I Goldwater, S., Griffiths, T., and Johnson, M. (2006a). Interpolating between types and tokens by estimating power-lawgenerators. In Advances in Neural Information Processing Systems, volume 18.
I Goldwater, S., Griffiths, T. L., and Johnson, M. (2006b). Contextual dependencies in unsupervised word segmentation. InProceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Associationfor Computational Linguistics.
I Görür, D., Jäkel, F., and Rasmussen, C. E. (2006). A choice model with infinitely many latent features. In Proceedings of theInternational Conference on Machine Learning, volume 23.
I Griffiths, T. L. and Ghahramani, Z. (2006). Infinite latent feature models and the Indian buffet process. In Advances in NeuralInformation Processing Systems, volume 18.
I Heller, K. A., Teh, Y. W., and Görür, D. (2009). Infinite hierarchical hidden Markov models. In JMLR Workshop andConference Proceedings: AISTATS 2009, volume 5, pages 224–231.
I Hjort, N. L. (1990). Nonparametric Bayes estimators based on beta processes in models for life history data. Annals ofStatistics, 18(3):1259–1294.
123 / 127
References IIII Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical
Association, 96(453):161–173.
I Johnson, M., Griffiths, T. L., and Goldwater, S. (2007). Adaptor grammars: A framework for specifying compositionalnonparametric Bayesian models. In Advances in Neural Information Processing Systems, volume 19.
I Jordan, M. I. (2010). Hierarchical models, nested models and completely random measures. In Frontiers of StatisticalDecision Making and Bayesian Analysis: In Honor of James O. Berger. New York: Springer.
I Knowles, D. and Ghahramani, Z. (2007). Infinite sparse factor analysis and infinite independent components analysis. InInternational Conference on Independent Component Analysis and Signal Separation, volume 7 of Lecture Notes inComputer Science. Springer.
I Liang, P., Petrov, S., Jordan, M. I., and Klein, D. (2007). The infinite PCFG using hierarchical Dirichlet processes. InProceedings of the Conference on Empirical Methods in Natural Language Processing.
I Lijoi, A. and Pruenster, I. (2010). Models beyond the Dirichlet process. In Hjort, N., Holmes, C., Müller, P., and Walker, S.,editors, Bayesian Nonparametrics. Cambridge University Press.
I MacKay, D. and Peto, L. (1994). A hierarchical Dirichlet language model. Natural Language Engineering.
I Meeds, E., Ghahramani, Z., Neal, R. M., and Roweis, S. T. (2007). Modeling dyadic data with binary latent factors. InAdvances in Neural Information Processing Systems, volume 19.
I Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods. Technical Report CRG-TR-93-1,Department of Computer Science, University of Toronto.
I Neal, R. M. (2003). Density modeling and clustering using Dirichlet diffusion trees. In Bayesian Statistics, volume 7, pages619–629.
124 / 127
References IVI Perman, M., Pitman, J., and Yor, M. (1992). Size-biased sampling of Poisson point processes and excursions. Probability
Theory and Related Fields, 92(1):21–39.
I Pitman, J. and Yor, M. (1997). The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Annals ofProbability, 25:855–900.
I Rasmussen, C. E. (2000). The infinite Gaussian mixture model. In Advances in Neural Information Processing Systems,volume 12.
I Rasmussen, C. E. and Ghahramani, Z. (2001). Occam’s razor. In Advances in Neural Information Processing Systems,volume 13.
I Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press.
I Robert, C. P. and Casella, G. (2004). Monte Carlo statistical methods. Springer Verlag.
I Rodríguez, A., Dunson, D. B., and Gelfand, A. E. (2008). The nested Dirichlet process. Journal of the American StatisticalAssociation, 103(483):1131–1154.
I Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 4:639–650.
I Sudderth, E. and Jordan, M. I. (2009). Shared segmentation of natural scenes using dependent Pitman-Yor processes. InAdvances in Neural Information Processing Systems, volume 21.
I Sudderth, E., Torralba, A., Freeman, W., and Willsky, A. (2008). Describing visual scenes using transformed objects andparts. International Journal of Computer Vision, 77.
I Teh, Y. W. (2006a). A Bayesian interpretation of interpolated Kneser-Ney. Technical Report TRA2/06, School of Computing,National University of Singapore.
125 / 127
References VI Teh, Y. W. (2006b). A hierarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of the 21st
International Conference on Computational Linguistics and 44th Annual Meeting of the Association for ComputationalLinguistics, pages 985–992.
I Teh, Y. W., Daume III, H., and Roy, D. M. (2008). Bayesian agglomerative clustering with coalescents. In Advances in NeuralInformation Processing Systems, volume 20, pages 1473–1480.
I Teh, Y. W. and Görür, D. (2009). Indian buffet processes with power-law behavior. In Advances in Neural InformationProcessing Systems, volume 22, pages 1838–1846.
I Teh, Y. W., Görür, D., and Ghahramani, Z. (2007). Stick-breaking construction for the Indian buffet process. In Proceedings ofthe International Conference on Artificial Intelligence and Statistics, volume 11.
I Teh, Y. W. and Jordan, M. I. (2010). Hierarchical Bayesian nonparametric models with applications. In Hjort, N., Holmes, C.,Müller, P., and Walker, S., editors, Bayesian Nonparametrics. Cambridge University Press.
I Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the AmericanStatistical Association, 101(476):1566–1581.
I Thibaux, R. and Jordan, M. I. (2007). Hierarchical beta processes and the Indian buffet process. In Proceedings of theInternational Workshop on Artificial Intelligence and Statistics, volume 11, pages 564–571.
I Van Gael, J., Teh, Y. W., and Ghahramani, Z. (2009). The infinite factorial hidden Markov model. In Advances in NeuralInformation Processing Systems, volume 21, pages 1697–1704.
I Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundationsand Trends in Machine Learning, 1(1-2):1–305.
I Wood, F., Archambeau, C., Gasthaus, J., James, L. F., and Teh, Y. W. (2009). A stochastic memoizer for sequence data. InProceedings of the International Conference on Machine Learning, volume 26, pages 1129–1136.
126 / 127
References VI
I Wood, F., Griffiths, T. L., and Ghahramani, Z. (2006). A non-parametric Bayesian method for inferring hidden causes. InProceedings of the Conference on Uncertainty in Artificial Intelligence, volume 22.
127 / 127