Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian...

Bayesian Nonparametrics

Yee Whye Teh

Gatsby Computational Neuroscience UnitUniversity College London

Acknowledgements:Charles Blundell, Lloyd Elliott, Jan Gasthaus, Vinayak Rao,

Dilan Görür, Andriy Mnih, Frank WoodCedric Archambeau, Hal Daume III, Zoubin Ghahramani, Katherine Heller, Lancelot James,

Michael I. Jordan, Daniel Roy, Jurgen Van Gael, Max Welling,

December, 2010 / KAIST, South Korea

1 / 127

Outline

Introduction

Regression and Gaussian Processes

Density Estimation, Clustering and Dirichlet Processes

Latent Variable Models and Indian Buffet and Beta Processes

Topic Modelling and Hierarchical Processes

Hierarchical Structure Discovery and Nested Processes

Time Series Models

Modelling Power-laws with Pitman-Yor Processes

Summary2 / 127

Outline

Introduction






Time Series Models


Summary3 / 127

Probabilistic Machine Learning

I Probabilistic model of data {xi}ni=1 given parameters θ:

P(x1, x2, . . . , xn, y1, y2, . . . , yn|θ)

where yi is a latent variable associated with xi .

I Often thought of as generative models of data.

I Inference, of latent variables given observations:

P(y1, y2, . . . , yn|θ, x1, x2, . . . , xn) =P(x1, x2, . . . , xn, y1, y2, . . . , yn|θ)

P(x1, x2, . . . , xn|θ)

I Learning, typically by maximum likelihood :

θML = argmaxθ

P(x1, x2, . . . , xn|θ)

4 / 127

Probabilistic Machine Learning

I Prediction:

P(xn+1|θML)

P(yn+1|xn+1, θML)

P(yn+1|x1, . . . , xn+1, θML)

I Classification:

argmaxC

P(xn+1|θMLC )

I Visualization, Interpretation, summarization

5 / 127

Bayesian Machine LearningI Probabilistic model of data {xi}n

i=1 given parameters θ:

P(x1, x2, . . . , xn, y1, y2, . . . , yn|θ)

I Prior distribution:

P(θ)

I Posterior distribution:

P(θ, y1, y2, . . . , yn|x1, x2, . . . , xn) =P(θ)P(x1, x2, . . . , xn, y1, y2, . . . , yn|θ)

P(x1, x2, . . . , xn)

I Prediction:

P(xn+1|x1, . . . , xn) =

∫P(xn+1|θ)P(θ|x1, . . . , xn)dθ

I (Easier said than done...)

6 / 127

Computing Posterior Distributions

I Posterior distribution:

P(θ, y1, y2, . . . , yn|x1, x2, . . . , xn) =P(θ)P(x1, x2, . . . , xn, y1, y2, . . . , yn|θ)

P(x1, x2, . . . , xn)

I High-dimensional, no closed-form, multi-modal...

I Variational approximations [Wainwright and Jordan 2008]: simpleparametrized form, “fit” to true posterior. Includes mean fieldapproximation, variational Bayes, belief propagation, expectationpropagation.

I Monte Carlo methods, including Markov chain Monte Carlo[Neal 1993, Robert and Casella 2004] and sequential Monte Carlo[Doucet et al. 2001]: construct generators for random samples from theposterior.

7 / 127

Bayesian Model Selection

I Model selection is often necessary to prevent overfitting and underfitting.

I Bayesian approach to model selection uses the marginal likelihood :

p(x|Mk ) =

∫p(x|θk ,Mk )p(θk |Mk )dθk

Model selection: M∗ = argmaxMk

p(x|Mk )

Model averaging: p(Mk , θk |x) =p(Mk )p(θk |Mk )p(x|θk ,Mk )∑

k ′ p(Mk ′)p(θk ′ |Mk ′)p(x|θk ′ ,Mk ′)

I Other approaches to model selection: cross validation, regularization,sparse models...

8 / 127

Side-Stepping Model Selection

I Strategies for model selection often entail significant complexities.

I But reasonable and proper Bayesian methods should not overfit anyway[Rasmussen and Ghahramani 2001].

I Idea: use a large model, and be Bayesian so will not overfit.

I Bayesian nonparametric idea: use a very large Bayesian model avoidsboth overfitting and underfitting.

9 / 127

Direct Modelling of Very Large Spaces

I Regression: learn about functions from an input to an output space.

I Density estimation: learn about densities over Rd .

I Clustering: learn about partitions of a large space.

I Objects of interest are often infinite dimensional. Model these directly:

I Using models that can learn any such object;I Using models that can approximate any such object to arbitrary

accuracy.

I Many theoretical and practical issues to resolve:

I Convergence and consistency.I Practical inference algorithms.

10 / 127

Outline

Introduction






Time Series Models


Summary11 / 127

Regression and ClassificationI Learn a function f ∗ : X→ Y from training data {xi , yi}n

i=1.

**

*

*

*

*

*

*

I Regression: if yi = f ∗(xi ) + εi .I Classification: e.g. P(yi = 1|f ∗(xi )) = Φ(f ∗(xi )).

12 / 127

Parametric Regression with Basis Functions

I Assume a set of basis functions φ1, . . . , φK and parametrize a function:

f (x ; w) =K∑

k=1

wkφk (x)

Parameters w = {w1, . . . ,wK}.I Find optimal parameters

argminw

n∑

i=1

∣∣∣∣yi − f (xi ; w)

∣∣∣∣2

= argminw

n∑

i=1

∣∣∣∣yi −∑K

k=1 wkφk (xi )

∣∣∣∣2

I What family of basis function to use?

I How many?

I What if true function cannot be parametrized as such?

13 / 127

Towards Nonparametric RegressionI What we are interested in is the output values of the function,

f (x1), f (x2), . . . , f (xn), f (xn+1)

Why not model these directly?

**

*

*

*

*

*

*

I In regression, each f (xi ) is continuous and real-valued, so a naturalchoice is to model f (xi ) using a Gaussian.

I Assume that function f is smooth. If two inputs xi and xj are close-by,then f (xi ) and f (xj ) should be close by as well. This translates intocorrelations among the outputs f (xi ).

14 / 127

Towards Nonparametric RegressionI We can use a multi-dimensional Gaussian to model correlated function

outputs:

f (x1)...

f (xn+1)

∼ N

0...0

,

C1,1 . . . C1,n+1...

. . ....

Cn+1,1 . . . Cn+1,n+1

where the mean is zero, and C = [Cij ] is the covariance matrix.I Each observed output yi can be modelled as,

yi |f (xi ) ∼ N (f (xi ), σ2)

I Learning: compute posterior distribution

p(f (x1), . . . , f (xn)|y1, . . . , yn)

Straightforward since whole model is Gaussian.I Prediction: compute

p(f (xn+1)|y1, . . . , yn)

15 / 127

Gaussian Processes

I A Gaussian process (GP) is a random function f : X→ R such that forany finite set of input points x1, . . . , xn,

f (x1)...

f (xn)

∼ N

m(x1)...

m(xn)

,

c(x1, x1) . . . c(x1, xn)...

. . ....

c(xn, x1) . . . c(xn, xn)

where the parameters are the mean function m(x) and covariancekernel c(x , y).

I Difference from before: the GP defines a distribution over f (x), for everyinput value x simultaneously. Prior is defined even before observinginputs x1, . . . , xn.

I Such a random function f is known as a stochastic process. It is acollection of random variables {f (x)}x∈X.

I Demo: GPgenerate.

[Rasmussen and Williams 2006]

16 / 127

Posterior and Predictive Distributions

I How do we compute the posterior and predictive distributions?

I Training set (x1, y1), (x2, y2), . . . , (xn, yn) and test input xn+1.

I Out of the (uncountably infinitely) many random variables {f (x)}x∈Xmaking up the GP only n + 1 has to do with the data:

f (x1), f (x2), . . . , f (xn+1)

I Training data gives observations f (x1) = y1, . . . , f (xn) = yn. Thepredictive distribution of f (xn+1) is simply

p(f (xn+1)|f (x1) = y1, . . . , f (xn) = yn)

which is easy to compute since f (x1), . . . , f (xn+1) is Gaussian.

17 / 127

Consistency and Existence

I The definition of Gaussian processes only give finite dimensionalmarginal distributions of the stochastic process.

I Fortunately these marginal distributions are consistent .

I For every finite set x ⊂ X we have a distinct distributionpx([f (x)]x∈x). These distributions are said to be consistent if

px([f (x)]x∈x) =

∫px∪y([f (x)]x∈x∪y)d [f (x)]x∈y

for disjoint and finite x,y ⊂ X.I The marginal distributions for the GP are consistent because

Gaussians are closed under marginalization.

I The Kolmogorov Consistency Theorem guarantees existence of GPs,i.e. the whole stochastic process {f (x)}x∈X.

18 / 127

Outline

Introduction






Time Series Models


Summary19 / 127

Density Estimation with Mixture Models

I Unsupervised learning of a density f ∗(x) from training samples {xi}.

* ** * ** * ** *** ****

I Can use a mixture model for flexible family of densities, e.g.

f (x) =K∑

k=1

πkN (x ;µk ,Σk )

I How many mixture components to use?

I What family of mixture components?

I Do we believe that the true density is a mixture of K components?

20 / 127

Bayesian Mixture Models

I Let’s be Bayesian about mixture models, and placepriors over our parameters (and to computeposteriors).

I First, introduce conjugate priors for parameters:

π ∼ Dirichlet(αK , . . . ,αK )

µk ,Σk = θ∗k ∼ H = N -IW(0, s,d ,Φ)

I Second, introduce variable zi indicator whichcomponent xi belongs to.

zi |π ∼ Multinomial(π)

xi |zi = k ,µ,Σ ∼ N (µk ,Σk )

zi

!

!

H

i = 1, . . . , n

xi

!!kk = 1, . . . ,K

21 / 127

Gibbs Sampling for Bayesian Mixture Models

I All conditional distributions are simple to compute:

p(zi = k |others) ∝ πkN (xi ;µk ,Σk )

π|z ∼ Dirichlet(αK +n1(z), . . . , αK +nK (z))

µk ,Σk |others ∼ N -IW(ν′, s′,d ′,Φ′)

I Not as efficient as collapsed Gibbs sampling whichintegrates out π,µ,Σ:

p(zi = k |others) ∝αK + nk (z−i )

α + n − 1×

p(xi |{xi′ : i ′ 6= i , zi′ = k})

I Demo: fm_demointeractive.

zi

!

!

H

i = 1, . . . , n

xi

!!kk = 1, . . . ,K

22 / 127

Infinite Bayesian Mixture Models

I We will take K →∞.

I Imagine a very large value of K .

I There are at most n < K occupied components, somost components are empty. We can lump theseempty components together:

Occupied clusters:

p(zi = k |others) ∝αK +nk (z−i )

n − 1 + αp(xi |x−i

k )

Empty clusters:

p(zi = kempty|z−i ) ∝ αK−K∗K

n − 1 + αp(xi |{})

I Demo: dpm_demointeractive.

[Rasmussen 2000]

zi

!

!

H

i = 1, . . . , n

xi

!!kk = 1, . . . ,K

23 / 127


I We will take K →∞.

I Imagine a very large value of K .

I There are at most n < K occupied components, somost components are empty. We can lump theseempty components together:

Occupied clusters:

p(zi = k |others) ∝αK +nk (z−i )

n − 1 + αp(xi |x−i

k )

Empty clusters:

p(zi = kempty|z−i ) ∝ αK−K∗K

n − 1 + αp(xi |{})

I Demo: dpm_demointeractive.

[Rasmussen 2000]

zi

!

!

H

i = 1, . . . , n

xi

!!kk = 1, . . . ,!

23 / 127

Density Estimation

!15 !10 !5 0 5 10 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

F (·|µ,Σ) is Gaussian with mean µ, covariance Σ.H(µ,Σ) is Gaussian-inverse-Wishart conjugate prior.Red: mean density. Blue: median density. Grey: 5-95 quantile.Others: posterior samples. Black: data points.

24 / 127

Density Estimation

!15 !10 !5 0 5 10 150

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

F (·|µ,Σ) is Gaussian with mean µ, covariance Σ.H(µ,Σ) is Gaussian-inverse-Wishart conjugate prior.Red: mean density. Blue: median density. Grey: 5-95 quantile.Others: posterior samples. Black: data points.

24 / 127


I The actual infinite limit of finite mixture models does not actually makemathematical sense.

I Other better ways of making this infinite limit precise:

I Look at the prior clustering structure induced by the Dirichlet priorover mixing proportions—Chinese restaurant process.

I Re-order components so that those with larger mixing proportionstend to occur first, before taking the infinite limit—stick-breakingconstruction.

I Both are different views of the Dirichlet process (DP).

I The K →∞ Gibbs sampler is for DP mixture models.

25 / 127

A Tiny Bit of Measure Theoretic Probability Theory

I A σ-algebra Σ is a family of subsets of a set Θ such that

I Σ is not empty;I If A ∈ Σ then Θ\A ∈ Σ;I If A1,A2, . . . ∈ Σ then ∪∞i=1Ai ∈ Σ.

I (Θ,Σ) is a measure space and A ∈ Σ are the measurable sets.

I A measure µ over (Θ,Σ) is a function µ : Σ→ [0,∞] such that

I µ(∅) = 0;I If A1,A2, . . . ∈ Σ are disjoint then µ(∪∞i=1Ai ) =

∑∞i=1 µ(Ai ).

I Everything we consider here will be measurable.I A probability measure is one where µ(Θ) = 1.

I Given two measure spaces (Θ,Σ) and (∆,Φ), a function f : Θ→ ∆ ismeasurable if f−1(A) ∈ Σ for every A ∈ Φ.

26 / 127

A Tiny Bit of Measure Theoretic Probability Theory

I If p is a probability measure on (Θ,Σ), a random variable X takingvalues in ∆ is simply a measurable function X : Θ→ ∆.

I Think of the probability space (Θ,Σ,p) as a black-box randomnumber generator, and X as a function taking random samples in Θand producing random samples in ∆.

I The probability of an event A ∈ Φ is p(X ∈ A) = p(X−1(A)).

I A stochastic process is simply a collection of random variables {Xi}i∈Iover the same measure space (Θ,Σ), where I is an index set.

I Can think of a stochastic process as a random function X (i).

I Stochastic processes form the core of many Bayesian nonparametricmodels.

I Gaussian processes, Poisson processes, Dirichlet processes, betaprocesses, completely random measures...

27 / 127

Dirichlet DistributionsI A Dirichlet distribution is a distribution over the K -dimensional probability

simplex:∆K =

{(π1, . . . , πK ) : πk ≥ 0,

∑k πk = 1

}

I We say (π1, . . . , πK ) is Dirichlet distributed,

(π1, . . . , πK ) ∼ Dirichlet(λ1, . . . , λK )

with parameters (λ1, . . . , λK ), if

p(π1, . . . , πK ) =Γ(∑

k λk )∏k Γ(λk )

n∏

k=1

πλk−1k

I Equivalent to normalizing a set of independent gamma variables:

(π1, . . . , πK ) = 1∑k γk

(γ1, . . . , γK )

γk ∼ Gamma(λk ) for k = 1, . . . ,K

28 / 127

Dirichlet Distributions

29 / 127

Dirichlet Processes

I A Dirichlet Process (DP) is a random probability measure G over (Θ,Σ)such that for any finite set of measurable partitions A1∪ . . . ∪AK = Θ,

(G(A1), . . . ,G(AK )) ∼ Dirichlet(λ(A1), . . . , λ(AK ))

where λ is a base measure.

6

A

A1

A A

A

A

2

3

4

5

I The above family of distributions is consistent (next), and KolmogorovConsistency Theorem can be applied to show existence (but there aretechnical conditions restricting the generality of the definition).

[Ferguson 1973, Blackwell and MacQueen 1973]

30 / 127

Consistency of Dirichlet Marginals

I If we have two partitions (A1, . . . ,AK ) and (B1, . . . ,BJ) of Θ, how do wesee if the two Dirichlets are consistent?

I Because Dirichlet variables are normalized gamma variables and sumsof gammas are gammas, if (I1, . . . , Ij ) is a partition of (1, . . . ,K ),

(∑i∈I1 πi , . . . ,

∑i∈Ij πi

)∼ Dirichlet

(∑i∈I1 λi , . . . ,

∑i∈Ij λi

)

31 / 127

Consistency of Dirichlet Marginals

I Form the common refinement (C1, . . . ,CL) where each C` is theintersection of some Ak with some Bj . Then:

By definition, (G(C1), . . . ,G(CL)) ∼ Dirichlet(λ(C1), . . . , λ(CL))

(G(A1), . . . ,G(AK )) =(∑

C`⊂A1G(C`), . . . ,

∑C`⊂AK

G(C`))

∼ Dirichlet(λ(A1), . . . , λ(AK ))

Similarly, (G(B1), . . . ,G(BJ)) ∼ Dirichlet(λ(B1), . . . , λ(BJ))

so the distributions of (G(A1), . . . ,G(AK )) and (G(B1), . . . ,G(BJ)) areconsistent.

I Demonstration: DPgenerate.

32 / 127

Parameters of Dirichlet ProcessesI Usually we split the λ base measure into two parameters λ = αH:

I Base distribution H, which is like the mean of the DP.I Strength parameter α, which is like an inverse-variance of the DP.

I We write:

G ∼ DP(α,H)

if for any partition (A1, . . . ,AK ) of Θ:

(G(A1), . . . ,G(AK )) ∼ Dirichlet(αH(A1), . . . , αH(AK ))

I The first and second moments of the DP:

Expectation: E[G(A)] = H(A)

Variance: V[G(A)] =H(A)(1− H(A))

α + 1

where A is any measurable subset of Θ.

33 / 127

Representations of Dirichlet Processes

I Draws from Dirichlet processes will always place all their mass on acountable set of points:

G =∞∑

k=1

πkδθ∗k

where∑

k πk = 1 and θ∗k ∈ Θ.

I What is the joint distribution over π1, π2, . . . and θ∗1 , θ∗2 , . . .?

I Since G is a (random) probability measure over Θ, we can treat it as adistribution and draw samples from it. Let

θ1, θ2, . . . ∼ G

be random variables with distribution G.

I Can we describe G by describing its effect θ1, θ2, . . .?I What is the marginal distribution of θ1, θ2, . . . with G integrated out?

34 / 127

Stick-breaking Construction

G =∞∑

k=1

πkδθ∗k

I There is a simple construction giving the joint distribution of π1, π2, . . .and θ∗1 , θ

∗2 , . . . called the stick-breaking construction.

θ∗k ∼ H

πk = vk

k−1∏

i=1

(1− vi )

vk ∼ Beta(1, α)

π

(4)π(5)π

(2)π(3)π

(6)π

(1)

I Also known as the GEM distribution, write π ∼ GEM(α).

[Sethuraman 1994]

35 / 127

Posterior of Dirichlet Processes

I Since G is a probability measure, we can draw samples from it,

G ∼ DP(α,H)

θ1, . . . , θn|G ∼ G

What is the posterior of G given observations of θ1, . . . , θn?

I The usual Dirichlet-multinomial conjugacy carries over to thenonparametric DP as well:

G|θ1, . . . , θn ∼ DP(α + n, αH+∑n

i=1 δθiα+n )

36 / 127

Pólya Urn Scheme

G ∼ DP(α,H)

θ1, . . . , θn|G ∼ G

I The marginal distribution of θ1, θ2, . . . has a simple generative processcalled the Pólya urn scheme (aka Blackwell-MacQueen urn scheme).

θn|θ1:n−1 ∼αH +

∑n−1i=1 δθi

α + n − 1

I Picking balls of different colors from an urn:

I Start with no balls in the urn.I with probability ∝ α, draw θn ∼ H, and add a ball of color θn into urn.I With probability ∝ n − 1, pick a ball at random from the urn, recordθn to be its color and return two balls of color θn into urn.

[Blackwell and MacQueen 1973]

37 / 127

Chinese Restaurant Process

I θ1, . . . , θn take on K < n distinct values, say θ∗1 , . . . , θ∗K .

I This defines a partition of (1, . . . ,n) into K clusters, such that if i is incluster k , then θi = θ∗k .

I The distribution over partitions is a Chinese restaurant process (CRP).

I Generating from the CRP:

I First customer sits at the first table.I Customer n sits at:

I Table k with probability nkα+n−1 where nk is the number of customers

at table k .I A new table K + 1 with probability α

α+n−1 .I Customers⇔ integers, tables⇔ clusters.

91

23

4 56 7

8

38 / 127

Chinese Restaurant Process

0 2000 4000 6000 8000 100000

50

100

150

200

customer

tabl

e

!=30, d=0

I The CRP exhibits the clustering property of the DP.I Rich-gets-richer effect implies small number of large clusters.I Expected number of clusters is K = O(α log n).

39 / 127

Clustering

I To partition a heterogeneous data set into distinct, homogeneousclusters.

I The CRP is a canonical nonparametric prior over partitions that can beused as part of a Bayesian model for clustering.

I There are other priors over partitions ([Lijoi and Pruenster 2010]).

40 / 127

Inferring Discrete Latent Structures

I DPs have also found uses in applications where the aim is to discoverlatent objects, and where the number of objects is not known orunbounded.

I Nonparametric probabilistic context free grammars.I Visual scene analysis.I Infinite hidden Markov models/trees.I Genetic ancestry inference.I ...

I In many such applications it is important to be able to model the sameset of objects in different contexts.

I This can be tackled using hierarchical Dirichlet processes.

[Teh et al. 2006, Teh and Jordan 2010]

41 / 127

Exchangeability

I Instead of deriving the Pólya urn scheme by marginalizing out a DP,consider starting directly from the conditional distributions:


∑n−1i=1 δθi

α + n − 1

I For any n, the joint distribution of θ1, . . . , θn is:

p(θ1, . . . , θn) =αK ∏K

k=1 h(θ∗k )(mnk − 1)!∏ni=1 i − 1 + α

where h(θ) is density of θ under H, θ∗1 , . . . , θ∗K are the unique values, and

θ∗k occurred mnk times among θ1, . . . , θn.

I The joint distribution is exchangeable wrt permutations of θ1, . . . , θn.

I De Finetti’s Theorem says that there must be a random probabilitymeasure G making θ1, θ2, . . . iid. This is the DP.

42 / 127

De Finetti’s Theorem

Let θ1, θ2, . . . be an infinite sequence of random variables with jointdistribution p. If for all n ≥ 1, and all permutations σ ∈ Σn on n objects,

p(θ1, . . . , θn) = p(θσ(1), . . . , θσ(n))

That is, the sequence is infinitely exchangeable. Then there exists a (unique)latent random parameter G such that:

p(θ1, . . . , θn) =

∫p(G)

n∏

i=1

p(θi |G)dG

where ρ is a joint distribution over G and θi ’s.

I θi ’s are independent given G.

I Sufficient to define G through the conditionals p(θn|θ1, . . . , θn−1).

I G can be infinite dimensional (indeed it is often a random measure).

43 / 127

Outline

Introduction






Time Series Models


Summary44 / 127

Latent Variable Modelling

I Say we have n vector observations x1, . . . , xn.

I Model each observation as a linear combination of K latent sources:

xi =K∑

k=1

Λk yik + εi

yik : activity of source k in datum i .Λk : basis vector describing effect of source k .

I Examples include principle components analysis, factor analysis,independent components analysis.

I How many sources are there?

I Do we believe that K sources is sufficient to explain all our data?

I What prior distribution should we use for sources?

45 / 127

Binary Latent Variable Models

I Consider a latent variable model with binary sources/features,

zik =

{1 with probability µk ;0 with probability 1− µk .

I Example: Data items could be movies like “Terminator 2”, “Shrek” and“Lord of the Rings”, and features could be “science fiction”, “fantasy”,“action” and “Arnold Schwarzenegger”.

I Place beta prior over the probabilities of features:

µk ∼ Beta(αK ,1)

I We will again take K →∞.

46 / 127

Indian Buffet ProcessesI The Indian Buffet Process (IBP) describes each customer with a binary

vector instead of cluster.

I Generating from an IBP:I Parameter α.I First customer picks Poisson(α) dishes to eat.I Subsequent customer i picks dish k with probability mk

i ; and picksPoisson(αi ) new dishes.

Tables

Cu

sto

me

rs

Dishes

Cu

sto

me

rs

[Griffiths and Ghahramani 2006]47 / 127

Indian Buffet Processes and Exchangeability

I The IBP is infinitely exchangeable. For this to make sense, we need to“forget” the ordering of the dishes.

I “Name” each dish k with a Λ∗k drawn iid from H.I Each customer now eats a set of dishes: Ψi = {Λ∗k : zik = 1}.I The joint probability of Ψ1, . . . ,Ψn can be calculated:

p(Ψ1, . . . ,Ψn) = exp

(−α

n∑

i=1

1i

)αK

K∏

k=1

(mk − 1)!(n −mk )!

n!h(Λ∗k )

K : total number of dishes tried by n customers.Λ∗k : Name of k th dish tried.mk : number of customers who tried dish Λ∗k .

I De Finetti’s Theorem again states that there is some random measureunderlying the IBP.

I This random measure is the beta process.

[Griffiths and Ghahramani 2006, Thibaux and Jordan 2007]

48 / 127

Applications of Indian Buffet ProcessesI The IBP can be used in concert with different likelihood models in a

variety of applications.

Z ∼ IBP(α) X ∼ F (Z ,Y )

Y ∼ H p(Z ,Y |X ) =p(Z ,Y )p(X |Z ,Y )

p(X )

I Latent factor models for distributed representation [Griffiths andGhahramani 2005].

I Matrix factorization for collaborative filtering [Meeds et al. 2007].

I Latent causal discovery for medical diagnostics [Wood et al. 2006]

I Protein complex discovery [Chu et al. 2006].

I Psychological choice behaviour [Görür et al. 2006].

I Independent components analysis [Knowles and Ghahramani 2007].

I Learning the structure of deep belief networks [Adams et al. 2010].

49 / 127

Infinite Independent Components Analysis

I Each image Xi is a linear combination of sparse features:

Xi =∑

k

Λ∗k yik

where yik is activity of feature k with sparse prior. One possibility is amixture of a Gaussian and a point mass at 0:

yik = zik aik aik ∼ N (0,1) Z ∼ IBP(α)

I An ICA model with infinite number of features.

[Knowles and Ghahramani 2007, Teh et al. 2007]

50 / 127

Beta Processes

I A one-parameter beta process B ∼ BP(α,H) is a random discretemeasure with form:

B =∞∑

k=1

µkδθ∗k

where the points P = {(θ∗1 , µ1), (θ∗2 , µ2), . . .} are spikes in a 2D Poissonprocess with rate measure:

αµ−1dµH(dθ)

I It is the de Finetti measure for the IBP.

I This is an example of a completely random measure.

I A beta process does not have Beta distributed marginals.

[Hjort 1990, Thibaux and Jordan 2007]

51 / 127

Beta Processes

52 / 127

Stick-breaking Construction for Beta ProcessesI The following generates a draw of B:

vk ∼ Beta(1, α) µk = (1− vk )k−1∏

i=1

(1− vi ) θ∗k ∼ H

B =∞∑

k=1

µkδθ∗k

I The above is the complement of the stick-breaking construction for DPs.

π

(4)π(3)µ

(6)µ

(1)µ(2)µ

(4)µ(5)µ

(5)π

(2)π(3)π

(6)π

(1)

[Teh et al. 2007]

53 / 127

Outline

Introduction






Time Series Models


Summary54 / 127

Topic Modelling with Latent Dirichlet Allocation

I Infer topics from a document corpus, topicsbeing sets of words that tend to co-occurtogether.

I Using (Bayesian) latent Dirichlet allocation:

πj ∼ Dirichlet(αK , . . . ,αK )

θk ∼ Dirichlet( βW , . . . , βW )

zji |πj ∼ Multinomial(πj )

xji |zji ,θzji ∼ Multinomial(θzji )

I How many topics can we find from thecorpus?

I Can we take number of topics K →∞?

topics k=1...K

document j=1...D

words i=1...nd

!j

zji

xji !k

55 / 127

Hierarchical Dirichlet Processes

I Use a DP mixture for each group.

I Unfortunately there is no sharing of clustersacross different groups because H is smooth.

I Solution: make the base distribution H discrete.

I Put a DP prior on the common base distribution.

[Teh et al. 2006]

H

1i

1ix

θ

G1 G

x

θ

2

2i

2i

56 / 127


I A hierarchical Dirichlet process:

G0 ∼ DP(α0,H)

G1,G2|G0 ∼ DP(α,G0) iid

I Extension to larger hierarchies is straightforward. 1i

1ix

θ

G1 G

x

θ

G

H

0

2

2i

2i

57 / 127

Hierarchical Dirichlet ProcessesI Making G0 discrete forces shared cluster between G1 and G2.

58 / 127


10 20 30 40 50 60 70 80 90 100 110 120750

800

850

900

950

1000

1050

Perp

lexi

ty

Number of LDA topics

Perplexity on test abstacts of LDA and HDP mixture

LDAHDP Mixture

61 62 63 64 65 66 67 68 69 70 71 72 730

5

10

15

Number of topics

Num

ber o

f sam

ples

Posterior over number of topics in HDP mixtures

59 / 127

Chinese Restaurant Franchise

13

18

14 12

1516

17

22

21 34

3332

35

36

31

2728

25

21

13

11

3122 23

32

12

11

24

11

13

1221

24

23

2623

24

22

3231

21

3

group j=1 group j=2 group j=3

global

60 / 127

Visual Scene Analysis with Transformed DPs

wji

ji

!

H

R

"

JNj

#

G

0G

vji

jio

j

$ji

%

&lF

2o 2,v

0G

1G 2G 3G

1o 1,v 3o 3,v

Figure 15: TDP model for 2D visual scenes (left), and cartoon illustration of the generative process(right). Global mixture G0 describes the expected frequency and image position of visual categories,whose internal structure is represented by part–based appearance models {F!}!

!=1. Each image distri-bution Gj instantiates a randomly chosen set of objects at transformed locations !. Image features withappearance wji and position vji are then sampled from transformed parameters "

!#ji; !ji

"corresponding

to di!erent parts of object oji. The cartoon example defines three color–coded object categories, whichare composed of one (blue), two (green), and four (red) Gaussian parts, respectively. Dashed ellipsesindicate transformation priors for each category.

responding to some category. The appearance of the jth image is then determined by a set of

randomly transformed objects Gj ! DP($, G0), so that

Gj(o, !) =!#

t=1

$%jt&(o, ojt)&(!, !jt)$!j ! GEM($)

(ojt, !jt) ! G0

(30)

In this expression, t indexes the set of object instances in image j, which are associated with

visual categories ojt. Each of the Nj features in image j is independently sampled from some

object instance tji ! $!j . This can be equivalently expressed as (oji, !ji) ! Gj , where oji is the

global category corresponding to an object instance situated at transformed location !ji. Finally,

parameters corresponding to one of this object’s parts generate the observed feature:

('ji, µji, "ji) = #ji ! Foji wji ! 'ji vji ! N!µji + !ji, "ji

"(31)

In later sections, we let kji ! "oji indicate the part underlying the ith feature. Focusing on scale–

normalized datasets, we again associate transformations with image–based translations.

The hierarchical, TDP scene model of Fig. 15 employs three di!erent stick–breaking processes,

allowing uncertainty in the number of visual categories (GEM(()), parts composing each category

(GEM())), and object instances depicted in each image (GEM($)). It thus generalizes the para-

metric model of Fig. 12, which assumed fixed, known sets of parts and objects. As ) " 0, each

category uses a single part, and we recover a variant of the simpler TDP model of Sec. 7.1. Inter-

30

[Sudderth et al. 2008]61 / 127

Visual Scene Analysis with Transformed DPs

!"#

$%"&

'()*&)+,-#..

!"#$$%&

'(&%%)

*+#,%-%./+0&1

Figure 16: Learned contextual, fixed–order models for street scenes (left) and o!ce scenes (right),each containing four objects. Top: Gaussian distributions over the positions of other objects given thelocation of the car (left) or computer screen (right). Bottom: Parts (solid) generating at least 5% ofeach category’s features, with intensity proportional to probability. Parts are tranlated by that object’smean position, while the dashed ellipses indicate each object’s marginal transformation covariance.

used thirty shared parts, and Dirichlet precision parameters set as ! = 4, " = 15 via cross–

validation. The position prior Hv weakly favored parts covering 10% of the image range, while the

appearance prior Dir(W/10) was biased towards sparse distributions.

9.1.1 Visualization of Learned Parts

Fig. 16 illustrates learned, part–based models for street and o!ce scenes. Although objects share

a common set of parts within each scene model, we can approximately count the number of parts

used by each object by thresholding the posterior part distributions !!. For street scenes, cars are

allocated roughly four parts, while buildings and roads use large numbers of parts to uniformly tile

regions corresponding to their typical size. Several parts are shared between the tree and building

categories, presumably due to the many training images in which buildings are only partially

occluded by foliage. The o!ce scene model describes computer screens with ten parts, which

primarily align with edge and corner features. Due to their smaller size, keyboards are described

by five parts, and mice by two. The background clutter category then uses several parts, which

move little from scene to scene, to distribute features across the image. Most parts are unshared,

although the screen and keyboard categories reuse a few parts to describe edge–like features.

Fig. 16 also illustrates the contextual relationships learned by both scene models. Intuitively,

street scenes have a vertically layered structure, while in o!ce scenes the keyboard is typically

located beneath the monitor, and the mouse to the keyboard’s right.

32

[Sudderth et al. 2008]62 / 127

Hierarchical Modelling

i=1...n2

!2

x2i

i=1...n3

!3

x3i

i=1...n1

!1

x1i

[Gelman et al. 1995]

63 / 127

Hierarchical Modelling

i=1...n2

!0

!2

x2i

i=1...n3

!3

x3i

i=1...n1

!1

x1i

[Gelman et al. 1995]

63 / 127

Outline

Introduction






Time Series Models


Summary64 / 127

Topic Hierarchies

!"#$%&%'%&()*+)$(,+-.%&&)$/

0)1%&#)$2-$3)+)4

%(+#.5$6%+%$.%

!"#$

%&'()(

*+,-./,0

,1.23

2/3-4-526734-038

1-,9.23

:#;<:$4=>.-45.>

,1.23

,+-.%%/#$7&-6("%

2)(#-$)89.)/%4

1-6:.#%$.%&1-,9:<<:?#@@:(A23B,54=C74-6

D4>-3>.-/5.38

.,.23!;#@@

.3-9>.24.

,55C--38/09,-3

.240!B3

8,5C930.>;6/378/0E

45,-+C>

,1#($FD,-8>(A23

734-0382/3-4-526;,1

D2/52

,0764+,-./,0

/>/77C>.-4.38;5,0.4/0>GH.,+/5>(I

,.3.24..23

!+4-49

3.3-/>!J384.4

>94773-

B47C3;.,+-,B/83

4-34>,04=76

>/K38.,+/5

2/3-4-526D/.2.23

>/E0/!540.7674-E3-5,-+C>(

LM*24>=330

>2,D0.,6/378

E,,8+-38/5./B3

+3-1,-94053

-374./B3.,5,9

+3./0EC0/E-49

740EC4E39,837>;408

/.24>47>,=330

4-EC38.24..23

.,+/5N=4>3840476N

>/>+-,B/838=6LM*

-3+-3>30.>4OC47/.4./B3/9+-,B39

30.,05,9

+3./0E740EC4E3

9,837>

PQ73/3.47(#@@$=R'-/1!.2>408S.36B->#@@HT(A2C>LM*

+-,B/83>404.C-47+,/0.,15,9

+4-/>,0(A23-34-3>3B3-47/>>C3>.24.9

C>.=3=,-03/09/08

/05,9

+4-/0E2LM*

.,LM*

(%/->.;/0

LM*.23

0C9=3-

,1.,+/5>

/>4!J38

+4-493.3-;408

49,837>3735./,0

U,C-047,1.23*VF

;W,7(G!;I,(#;*

-./573!;XC=7/54./,0

84.3"U40C4-6#@:@(

[Blei et al. 2010]65 / 127

Nested Chinese Restaurant Process

1 2 3 4 5 6 7 8 9

[Blei et al. 2010]66 / 127


1 2 3 4 5 6 7 8 9

12

3

456

78 9

[Blei et al. 2010]66 / 127


1 2 3 4 5 6 7 8 9

12

3

456

78 9

12 34 5 6

78 9

[Blei et al. 2010]66 / 127

Visual Taxonomies

1 4 82 3 5 6 7

CB

A

CALsuburb

MITcoast

MITforest

MIThighway

MITinsidecity

MITmountain

MITopencountry

MITstreet

MITtallbuilding

PARoffice

bedroom

kitchen

livingroom

1

tall building! "# $ street! "# $ topic! "# $

2

inside city! "# $ office! "# $ topic! "# $

A

topic 1! "# $ topic 2! "# $

Figure 5. Unsupervised taxonomy learned on the 13 scenes data set. Top: the entire taxonomy shown as a tree. Each category is color-codedaccording to the legend on the right. The proportion of a given color in a node corresponds to the proportion of images of the correspondingcategory. Note that this category information was not used when inferring the taxonomy. There are three large groups, marked A, B, andC. Roughly, group C contains natural scenes, such as forest and mountains. Group A contains cluttered man-made scenes, such as tallbuildings and city scenes. Group B contains man-made scenes that are less cluttered, such as highways. These groups split into finersub-groups at the third level. Each of the third-level groups splits into several tens of fourth-level subgroups, typically with 10 or lessimages in each. These are omitted from the figure for clarity. Below the tree, the top row shows the information represented in leaf 1 (theleftmost leaf). Two categories most frequent in that node were selected, and two most probable images from each category are shown. Themost probable topic in the node is also displayed. For that topic, six most probable visual words are shown. The display for each visualword has two parts. First, in the top left corner the pixel-wise average of all image patches assigned to that visual word is shown. It givesa rough idea of the overall structure of the visual word. For example, in the topic for leaf 1, the visual words seem to represent verticaledges. Second, six patches assigned to that visual word were selected at random. These are shown at the bottom of each visual word, ona 2 ! 3 grid. Next, information for node 2 is shown in a similar format. Finally, the bottom row shows the top two topics from node A,which is shared between leaves 1 and 2. Both leaves have clutter (topic 1) and horizontal bars (topic 2), and these are represented at theshared node.

In [1], a taxonomy of object parts was learned. In con-trast, we learn a taxonomy of object categories.

Finally, the original NCRP model was independently ap-plied to image data in a concurrent publication [16]. Thedifferences between TAX and NCRP are summarized insection 2. In addition, [16] uses different sets of visualwords at different levels of the taxonomy. This encouragesthe taxonomy to learn different representations at differentlevels. The disadvantage is that the sets had to be manuallydefined, and without this NCRP performed poorly. In con-

trast, in TAX the same set of visual words is used through-out the taxonomy, and different representations at differentlevels emerge completely automatically.

6. Discussion

We presented TAX, a nonparametric probabilistic modelfor learning visual taxonomies. In the context of computervision, it is the first fully unsupervised model that can orga-nize images into a hierarchy of categories.

Our experiments in section 4.1 show that an intuitive hi-

[Bart et al. 2008] 67 / 127

Hierarchical Clustering

68 / 127

Hierarchical Clustering

I Bayesian approach to hierarchical clustering: place prior over treestructures, and infer posterior.

I The nested DP can be used as a prior over layered tree structures.

I Another prior is a Dirichlet diffusion tree, which produces binaryultrametric trees, and which can be obtained as an infinitesimal limit of anested DP. It is an example of a fragmentation process.

I Yet another prior is Kingman’s coalescent , which also produces binaryultrametric trees, but is an example of a coalescent process.

[Neal 2003, Teh et al. 2008, Bertoin 2006]

69 / 127

Nested Dirichlet Process

I Underlying stochastic process for the nested CRP is a nested DP.

Hierarchical DP:

G0 ∼ DP(α0,H)

Gj |G0 ∼ DP(α,G0)

xji |Gj ∼ Gj

Nested DP:

G0 ∼ DP(α,DP(α0,H))

Gi ∼ G0

xi |Gi ∼ Gi

I The hierarchical DP starts with groups of data items, and analyses themtogether by introducing dependencies through G0.

I The nested DP starts with one set of data items, partitions them intodifferent groups, and analyses each group separately.

I Orthogonal effects, can be used together.

[Rodríguez et al. 2008]

70 / 127

Nested Beta/Indian Buffet Processes

1 2 3 4 5

I Exchangeable distribution over layered trees.

71 / 127

Hierarchical Beta/Indian Buffet Processes

1 2 3 4 5 6 7 8 9

I Different from the hierarchical beta process of[Thibaux and Jordan 2007].

72 / 127


1 2 3 4 5 6 7 8 9

1 2 3 4 5 6


72 / 127


1 2 3 4 5 6 7 8 9

1 2 3 4 5 6

1 2 3 4 5 6 7 8


72 / 127

Deep Structure LearningLearning the Structure of Deep Sparse Graphical Models

(a) (b)

(c) (d)Figure 4: Olivetti faces a) Test images on the left, withreconstructed bottom halves on the right. b) Sixty featureslearned in the bottom layer, where black shows absence ofan edge. Note the learning of sparse features correspond-ing to specific facial structures such as mouth shapes, nosesand eyebrows. c) Raw predictive fantasies. d) Feature ac-tivations from individual units in the second hidden layer.

by k!. We calculate !(m)"k,k! , the number of nonzero en-

tries in the k!th column of Z(m+1), excluding any entry

in the kth row. If !(m)"k,k! is zero, we call the unit k! a

singleton parent, to be dealt with in the second phase.

If !(m)"k,k! is nonzero, we introduce (or keep) the edge

from unit u(m+1)k! to u

(m)k with Bernoulli probability

p(Z(m+1)k,k! =1 |!\Z

(m+1)k,k! )=

1

Z

!!(m)"k,k!

K(m)+"(m)!1

"

N#

n=1

p(u(m)n,k | Z(m+1)

k,k! = 1,!\Z(m)k,k!)

p(Z(m+1)k,k! =0 |!\Z

(m+1)k,k! )=

1

Z

!1!

!(m)"k,k!

K(m)+"(m)!1

"

N#

n=1

p(u(m)n,k | Z(m+1)

k,k! = 0,!\Z(m+1)k,k! ),

where Z is the appropriate normalization constant.

In the second phase, we consider deleting connectionsto singleton parents of unit k, or adding new sin-gleton parents. We use Metropolis–Hastings with abirth/death process. If there are currently K# sin-gleton parents, then with probability 1/2 we proposeadding a new one by drawing it recursively from deeperlayers, as above. We accept the proposal to insert a

connection to this new parent unit with M–H ratio1

!(m)"(m)

(K"+1)2("(m)+K(m)"1)

$Nn=1

p(u(m)n,k | Z

(m+1)k,j =1,!\Z

(m+1)k,j )

p(u(m)n,k | Z

(m+1)k,j =0,!\Z

(m+1)k,j )

.

If we do not propose to insert a unit and K# " 0, thenwith probability 1/2 we select uniformly from amongthe singleton parents of unit k and propose removingthe connection to it. We accept the proposal to removethe jth one with M–H acceptance ratio given by

K2"("(m)+K(m)"1)

!(m)"(m)

$Nn=1

p(u(m)n,k | Z

(m+1)k,j =0,!\Z

(m+1)k,j )

p(u(m)n,k | Z

(m+1)k,j =1,!\Z

(m+1)k,j )

.

After these phases, chains of units that are not ances-tors of the visible units can be discarded. Notably,this birth/death operator samples from the IBP poste-rior with a nontruncated equilibrium distribution, evenwithout conjugacy. Unlike the stick-breaking approachof Teh et al. (2007), it allows use of the two-parameterIBP, which is important to this model.

5 Reconstructing Images

We applied our approach to three image data sets—theOlivetti faces, the MNIST digits and the Frey faces—and analyzed the structures that arose in the modelposteriors. To assess the model, we constructed amissing-data problem using held-out images from eachset. We removed the bottom halves of the test imagesand used the model to reconstruct the missing data,conditioned on the top half. Prediction itself was doneby integrating out the parameters and structure viaMCMC and averaging over predictive samples.

Olivetti Faces The Olivetti faces data (Samariaand Harter, 1994) are 400 64#64 grayscale imagesof the faces of 40 distinct subjects, which we dividedrandomly into 350 training and 50 test data. Fig 4ashows six bottom-half test set reconstructions on theright, compared to the ground truth on the left. Fig 4bshows a subset of sixty weight patterns from a poste-rior sample of the structure, with black indicating thatno edge is present from that hidden unit to the visibleunit (pixel). The algorithm is clearly assigning hid-den units to specific and interpretable features, suchas mouth shapes, facial hair, and skin tone. Fig 4cshows ten pure fantasies from the model, easily gener-ated in a directed acyclic belief network. Fig 4d showsthe result of activating individual units in the secondhidden layer, while keeping the rest unactivated, andpropagating the activations down to the visible pix-els. This provides an idea of the image space spannedby the principal components of these deeper units. Atypical posterior network structure had three hiddenlayers, with approximately seventy units in each layer.

1These equations had an error in the original version.

[Adams et al. 2010]

73 / 127

Deep Structure Learning

Learning the Structure of Deep Sparse Graphical Models

(a) (b)

(c) (d)Figure 4: Olivetti faces a) Test images on the left, withreconstructed bottom halves on the right. b) Sixty featureslearned in the bottom layer, where black shows absence ofan edge. Note the learning of sparse features correspond-ing to specific facial structures such as mouth shapes, nosesand eyebrows. c) Raw predictive fantasies. d) Feature ac-tivations from individual units in the second hidden layer.

by k!. We calculate !(m)"k,k! , the number of nonzero en-

tries in the k!th column of Z(m+1), excluding any entry

in the kth row. If !(m)"k,k! is zero, we call the unit k! a

singleton parent, to be dealt with in the second phase.

If !(m)"k,k! is nonzero, we introduce (or keep) the edge

from unit u(m+1)k! to u

(m)k with Bernoulli probability

p(Z(m+1)k,k! =1 |!\Z

(m+1)k,k! )=

1

Z

!!(m)"k,k!

K(m)+"(m)!1

"

N#

n=1

p(u(m)n,k | Z(m+1)

k,k! = 1,!\Z(m)k,k!)

p(Z(m+1)k,k! =0 |!\Z

(m+1)k,k! )=

1

Z

!1!

!(m)"k,k!

K(m)+"(m)!1

"

N#

n=1

p(u(m)n,k | Z(m+1)

k,k! = 0,!\Z(m+1)k,k! ),

where Z is the appropriate normalization constant.

In the second phase, we consider deleting connectionsto singleton parents of unit k, or adding new sin-gleton parents. We use Metropolis–Hastings with abirth/death process. If there are currently K# sin-gleton parents, then with probability 1/2 we proposeadding a new one by drawing it recursively from deeperlayers, as above. We accept the proposal to insert a

connection to this new parent unit with M–H ratio1

!(m)"(m)

(K"+1)2("(m)+K(m)"1)

$Nn=1

p(u(m)n,k | Z

(m+1)k,j =1,!\Z

(m+1)k,j )

p(u(m)n,k | Z

(m+1)k,j =0,!\Z

(m+1)k,j )

.

If we do not propose to insert a unit and K# " 0, thenwith probability 1/2 we select uniformly from amongthe singleton parents of unit k and propose removingthe connection to it. We accept the proposal to removethe jth one with M–H acceptance ratio given by

K2"("(m)+K(m)"1)

!(m)"(m)

$Nn=1

p(u(m)n,k | Z

(m+1)k,j =0,!\Z

(m+1)k,j )

p(u(m)n,k | Z

(m+1)k,j =1,!\Z

(m+1)k,j )

.

After these phases, chains of units that are not ances-tors of the visible units can be discarded. Notably,this birth/death operator samples from the IBP poste-rior with a nontruncated equilibrium distribution, evenwithout conjugacy. Unlike the stick-breaking approachof Teh et al. (2007), it allows use of the two-parameterIBP, which is important to this model.

5 Reconstructing Images

We applied our approach to three image data sets—theOlivetti faces, the MNIST digits and the Frey faces—and analyzed the structures that arose in the modelposteriors. To assess the model, we constructed amissing-data problem using held-out images from eachset. We removed the bottom halves of the test imagesand used the model to reconstruct the missing data,conditioned on the top half. Prediction itself was doneby integrating out the parameters and structure viaMCMC and averaging over predictive samples.

Olivetti Faces The Olivetti faces data (Samariaand Harter, 1994) are 400 64#64 grayscale imagesof the faces of 40 distinct subjects, which we dividedrandomly into 350 training and 50 test data. Fig 4ashows six bottom-half test set reconstructions on theright, compared to the ground truth on the left. Fig 4bshows a subset of sixty weight patterns from a poste-rior sample of the structure, with black indicating thatno edge is present from that hidden unit to the visibleunit (pixel). The algorithm is clearly assigning hid-den units to specific and interpretable features, suchas mouth shapes, facial hair, and skin tone. Fig 4cshows ten pure fantasies from the model, easily gener-ated in a directed acyclic belief network. Fig 4d showsthe result of activating individual units in the secondhidden layer, while keeping the rest unactivated, andpropagating the activations down to the visible pix-els. This provides an idea of the image space spannedby the principal components of these deeper units. Atypical posterior network structure had three hiddenlayers, with approximately seventy units in each layer.

1These equations had an error in the original version.

[Adams et al. 2010]

73 / 127

Transfer Learning

I Many recent machine learning paradigms can be understood as trying tomodel data from heterogeneous sources and types.

I Semi-supervised learning: we have labelled data, and unlabelleddata.

I Multi-task learning: we have multiple tasks with differentdistributions but structurally similar.

I Domain adaptation: we have a small amount of pertinent data, anda large amount of data from a related problem or domain.

I The transfer learning problem is how to transfer information betweendifferent sources and types.

I Flexible nonparametric models can allow for more information extractionand transfer.

I Hierarchies and nestings are different ways of putting together multiplestochastic processes to form complex models.

[Jordan 2010]

74 / 127

Outline

Introduction






Time Series Models


Summary75 / 127

Hidden Markov Models

z0 z1 z2 z!

x1 x2 x!

! !k

!!k!

πk ∼ Dirichlet(αK , . . . ,αK ) zi |zi−1,πzi−1 ∼ Multinomial(πzi−1 )

θ∗k ∼ H xi |zi , θ∗zi∼ F (θ∗zi

)

I Can we take K →∞?

I Can we do so while imposing structure in transition probability matrix?

76 / 127

Infinite Hidden Markov Models

z0 z1 z2 z!

x1 x2 x!

! !k

!!k!

β ∼ GEM(γ) πk |β ∼ DP(α,β) zi |zi−1,πzi−1 ∼ Multinomial(πzi−1 )

θ∗k ∼ H xi |zi , θ∗zi∼ F (θ∗zi

)

I Hidden Markov models with an infinite number of states: infinite HMM.

I Hierarchical DPs used to share information among transition probabilityvectors prevents “run-away” states: HDP-HMM.

[Beal et al. 2002, Teh et al. 2006]

77 / 127

Word Segmentation

I Given sequences of utterances or characters can a probabilistic modelsegment sequences into coherent chunks (“words”)?

canyoureadthissentencewithoutspaces?can you read this sentence without spaces?

I Use an infinite HMM: each chunk/word is a state, with Markov model ofstate transitions.

I Nonparametric model is natural, since number of words unknown beforesegmentation.

[Goldwater et al. 2006b]

78 / 127

Word Segmentation

Words Lexicon BoundariesNGS-u 68.9 82.6 52.0MBDP-1 68.2 82.3 52.4DP 53.8 74.3 57.2NGS-b 68.3 82.1 55.7HDP 76.6 87.7 63.1

I NGS-u: n-gram Segmentation (unigram) [Venkataraman 2001].

I NGS-b: n-gram Segmentation (bigram) [Venkataraman 2001].

I MBDP-1: Model-based Dynamic Programming [Brent 1999].

I DP, HDP: Nonparametric model, without and with Markov dependencies.

[Goldwater et al. 2006a]

79 / 127

Sticky HDP-HMM

I In typical HMMs or in infinite HMMs the model does not give specialtreatment to self-transitions (from a state to itself).

I In many HMM applications self-transitions are much more likely.

I Example application of HMMs: speaker diarization.

I Straightforward extension of HDP-HMM prior encourages higherself-transition probabilities:

πk |β ∼ DP(α + κ, αβ+κδkα+κ )

[Beal et al. 2002, Fox et al. 2008]

80 / 127

Sticky HDP-HMM THE STICKY HDP-HMM 33

0 5 10 15 200

10

20

30

40

50

60

Meeting

DER

StickyNon−StickyICSI

0 5 10 15 200

10

20

30

40

50

60

Meeting

DER

(a) (b)FIG 14. (a) Chart comparing the DERs of the sticky and original HDP-HMM with DP emissions tothose of ICSI for each of the 21 meetings. Here, we chose the state sequence at the 10, 000th Gibbsiteration that minimizes the expected Hamming distance. For meeting 16 using the sticky HDP-HMMwith DP emissions, we chose between state sequences at Gibbs iteration 50,000. (b) DERs associatedwith using ground truth speaker labels for the post-processed data. Here, we assign undetected non-speech a label different than the pre-processed non-speech.

9. Discussion. We have developed a Bayesian nonparametric approach to theproblem of speaker diarization, building on the HDP-HMM presented in Teh et al.(2006). Although the original HDP-HMM does not yield competitive speaker di-arization performance due to its inadequate modeling of the temporal persistenceof states, the sticky HDP-HMM that we have presented here resolves this problemand yields a state-of-the-art solution to the speaker diarization problem.We have also shown that this sticky HDP-HMM allows a fully Bayesian non-

parametric treatment of multimodal emissions, disambiguated by its bias towardsself-transitions. Accommodating multimodal emissions is essential for the speakerdiarization problem and is likely to be an important ingredient in other applicationsof the HDP-HMM to problems in speech technology.We also presented efficient sampling techniques with mixing rates that improve

on the state-of-the-art by harnessing the Markovian structure of the HDP-HMM.Specifically, we proposed employing a truncated approximation to the HDP andblock-sampling the state sequence using a variant of the forward-backward algo-rithm. Although the blocked samplers yield substantially improved mixing ratesover the sequential, direct assignment samplers, there are still some pitfalls to thesesampling methods. One issue is that for each new considered state, the parametersampled from the prior distribution must better explain the data than the parametersassociated with other states that have already been informed by the data. In high-dimensional applications, and in cases where state-specific emission distributionsare not clearly distinguishable, this method for adding new states poses a signif-

[Fox et al. 2008]

81 / 127

Infinite Factorial HMM

Figure 1: The Hidden Markov Model Figure 2: The Factorial Hidden Markov Model

in a factored form. This way, information from the past is propagated in a distributed manner througha set of parallel Markov chains. The parallel chains can be viewed as latent features which evolveover time according to Markov dynamics. Formally, the FHMM defines a probability distributionover observations y1, y2, · · · yT as follows: M latent chains s(1), s(2), · · · , s(M) evolve accordingto Markov dynamics and at each timestep t, the Markov chains generate an output yt using somelikelihood model F parameterized by a joint state-dependent parameter θ

s(1:m)t

. The graphical modelin figure 2 shows how the FHMM is a special case of a dynamic Bayesian network. The FHMM hasbeen successfully applied in vision [3], audio processing [4] and natural language processing [5].Unfortunately, the dimensionality M of our factorial representation or equivalently, the number ofparallel Markov chains, is a new free parameter for the FHMM which we would prefer learningfrom data rather than specifying it beforehand.

Recently, [6] introduced the basic building block for nonparametric Bayesian factor models calledthe Indian Buffet Process (IBP). The IBP defines a distribution over infinite binary matrices Z whereelement znk denotes whether datapoint n has feature k or not. The IBP can be combined withdistributions over real numbers or integers to make the features useful for practical problems.

In this work, we derive the basic building block for nonparametric Bayesian factor models for timeseries which we call the Markov Indian Buffet Process (mIBP). Using this distribution we build anonparametric extension of the FHMM which we call the Infinite Factorial Hidden Markov Model(iFHMM). This construction allows us to learn a factorial representation for time series.

In the next section, we develop the novel and generic nonparametric mIBP distribution. Section 3describes how to use the mIBP do build the iFHMM. Which in turn can be used to perform inde-pendent component analysis on time series data. Section 4 shows results of our application of theiFHMM to a blind source separation problem. Finally, we conclude with a discussion in section 5.

2 The Markov Indian Buffet Process

Similar to the IBP, we define a distribution over binary matrices to model whether a feature at timet is on or off. In this representation rows correspond to timesteps and the columns to features orMarkov chains. We want the distribution over matrices to satisfy the following two properties: (1)the potential number of columns (representing latent features) should be able to be arbitrary large;(2) the rows (representing timesteps) should evolve according to a Markov process.

Below, we will formally derive the mIBP distribution in two steps: first, we describe a distributionover binary matrices with a finite number of columns. We choose the hyperparameters carefully sowe can easily integrate out the parameters of the model. In a second phase, we take the limit as thenumber of features goes to infinity in a manner analogous to [7]’s derivation of infinite mixtures.

2.1 A finite model

Let S represent a binary matrix with T rows (datapoints) and M columns (features). stm representsthe hidden state at time t for Markov chain m. Each Markov chain evolves according to the transitionmatrix

W (m) =

�1 − am am

1 − bm bm

�, (3)

2

I Take M →∞ for the following model specification:

P(s(m)t = 1|s(m)

t−1 = 0) = am am ∼ Beta( αM ,1)

P(s(m)t = 1|s(m)

t−1 = 1) = bm bm ∼ Beta(γ, δ)

I Stochastic process is a Markov Indian buffet process. It is an example ofa dependent random measure.

[Van Gael et al. 2009]82 / 127

Nonparametric Grammars, Hierarchical HMMs etc

I In linguistics, grammars are much more plausible as generative modelsof sentences.

I Learning the structure of probabilistic grammars is even more difficult,and Bayesian nonparametrics provides a compelling alternative.

[Liang et al. 2007, Finkel et al. 2007, Johnson et al. 2007, Heller et al. 2009]

83 / 127

Motion Capture Analysis

I Goal: find coherent “behaviour” in the time series that transfers to othertime series.

Slides courtesy of [Fox et al. 2010]

84 / 127

Motion Capture Analysis

I Transfer knowledge among related time series in the form of a library of“behaviours”.

I Allow each time series model to make use of an arbitrary subset of thebehaviours.

I Method: represent behaviors as states in an autoregressive HMM, anduse the beta/Bernoulli process to pick out subsets of states.


85 / 127

BP-AR-HMM


86 / 127

Motion Capture Results


87 / 127

High Order Markov ModelsI Decompose the joint distribution of a sequence of variables into

conditional distributions:

P(x1, x2, . . . , xT ) =T∏

t=1

P(xt |x1, . . . , xt−1)

I An Nth order Markov model approximates the joint distribution as:

P(x1, x2, . . . , xT ) =T∏

t=1

P(xt |xt−N , . . . , xt−1)

I Such models are particularly prevalent in natural language processing,compression and biological sequence modelling.

toad, in, a, holet, o, a, d, _, i, n, _, a, _, h, o, l, e

A, C, G, T, C, C, A

I Would like to take N →∞.

88 / 127

High Order Markov Models

I Difficult to fit such models due to data sparsity.

P(xt |xt−N , . . . , xt−1) =C(xt−N , . . . , xt−1, xt )

C(xt−N , . . . , xt−1)

I Sharing information via hierarchical models.

P(xt |xt−N:t−1 = u) = Gu(xt )

I A context tree.

G∅

Ga

Gin a

Gtoad in a Gstuck in a

Gis a Gabout a

[MacKay and Peto 1994, Teh 2006a]

89 / 127

Outline

Introduction






Time Series Models


Summary90 / 127

Pitman-Yor ProcessesI Two-parameter generalization of the Chinese restaurant process:

p(customer n sat at table k |past) =

{nk−β

n−1+α if occupied tableα+βKn−1+α if new table

I Associating each cluster k with a unique draw θ∗k ∼ H, thecorresponding Pólya urn scheme is also exchangeable.

0 2000 4000 6000 8000 100000

50

100

150

200

customer

tabl

e

!=30, d=0

Dirichlet

0 2000 4000 6000 8000 100000

50

100

150

200

250

customer

tabl

e

!=1, d=.5

Pitman-Yor

91 / 127

Pitman-Yor Processes

I De Finetti’s Theorem states that there is a random measure underlyingthis two-parameter generalization.

I This is the Pitman-Yor process.

I The Pitman-Yor process also has a stick-breaking construction:

πk = vk

k−1∏

i=1

(1− vi ) βk ∼ Beta(1− β, α + βk) θ∗k ∼ H G =∞∑

k=1

πkδθ∗k

I The Pitman-Yor process cannot be obtained as the infinite limit of asimple parametric model.

[Perman et al. 1992, Pitman and Yor 1997, Ishwaran and James 2001]

92 / 127


I Two salient features of the Pitman-Yor process:

I With more occupied tables, the chance of even more tablesbecomes higher.

I Tables with smaller occupancy numbers tend to have lower chanceof getting new customers.

I The above means that Pitman-Yor processes produce Zipf’s Law typebehaviour, with K = O(αnβ).

100 101 102 103 104 105 106100

101

102

103

104

105

106

# customers

# ta

bles

!=10, d=[.9 .5 0]

100 101 102 103 104 105 106100

101

102

103

104

105

106

# customers

# ta

bles

with

1 c

usto

mer

!=10, d=[.9 .5 0]

100 101 102 103 104 105 1060

0.2

0.4

0.6

0.8

1

# customerspr

opor

tion

of ta

bles

with

1 c

usto

mer

!=10, d=[.9 .5 0]

93 / 127


Draw from a Pitman-Yor process

0 2000 4000 6000 8000 100000

50

100

150

200

250

customer

tabl

e

!=1, d=.5

0 200 400 600 800100

101

102

103

104

105

# tables#

cust

omer

s pe

r tab

le

!=1, d=.5

100 101 102 103100

101

102

103

104

105

# tables

# cu

stom

ers

per t

able

!=1, d=.5

Draw from a Dirichlet process

0 2000 4000 6000 8000 100000

50

100

150

200

customer

tabl

e

!=30, d=0

0 50 100 150 200 250100

101

102

103

104

105

# tables

# cu

stom

ers

per t

able

!=30, d=0

100 101 102 103100

101

102

103

104

105

# tables#

cust

omer

s pe

r tab

le

!=30, d=0

94 / 127


95 / 127

Hierarchical Pitman-Yor Markov Models

I Use a hierarchical Pitman-Yor prior for high order Markov models.

I Can now take N →∞, making use of coagulation and fragmentationproperties of Pitman-Yor processes for computational tractability.

I Non-Markov model called the sequence memoizer .

[Goldwater et al. 2006a, Teh 2006b, Wood et al. 2009, Gasthaus et al. 2010]

96 / 127

Language Modelling

I Compare hierarchical Pitman-Yor model against hierarchical Dirichletmodel, and two state-of-the-art language models (interpolatedKneser-Ney, modified Kneser-Ney).

I Results reported as perplexity scores.

T N IKN MKN HPYLM HDLM2e6 3 148.8 144.1 144.3 191.24e6 3 137.1 132.7 132.7 172.76e6 3 130.6 126.7 126.4 162.38e6 3 125.9 122.3 121.9 154.7

10e6 3 122.0 118.6 118.2 148.712e6 3 119.0 115.8 115.4 144.014e6 3 116.7 113.6 113.2 140.514e6 2 169.9 169.2 169.3 180.614e6 4 106.1 102.4 101.9 136.6

[Teh 2006b]

97 / 127

Compression

I Predictive models can be used to compress sequence data usingentropic coding techniques.

I Compression results on Calgary corpus:

Model Average bits / bytegzip 2.61bzip2 2.11CTW 1.99PPM 1.93

Sequence Memoizer 1.89

I See http://deplump.com and http://sequencememoizer.com.

[Gasthaus et al. 2010]

98 / 127

Comparing Finite and Infinite Order Markov Models

[Wood et al. 2009]

99 / 127

Image Segmentation with Pitman-Yor Processes

I Human segmentations of images also seem to follow power-law.

I An unsupervised image segmentation model based on a dependenthierarchical Pitman-Yor processes achieves state-of-the-art results.

[Sudderth and Jordan 2009]100 / 127

Stable Beta Process

I Extensions allow for different aspects of the generative process to bemodelled:

I α: controls the expected number of dishes picked by eachcustomer.

I c: controls the overall number of dishes picked by all customers.I σ: controls power-law scaling (ratio of popular dishes to unpopular

ones).

I A completely random measure, with Lévy measure:

αΓ(1 + c)

Γ(1− σ)Γ(c + σ)µ−σ−1(1− µ)c+σ−1dµH(dθ)

[Ghahramani et al. 2007, Teh and Görür 2009]

101 / 127

Stable Beta Process

!=1, c=1, "=0.5 !=10, c=1, "=0.5 !=100, c=1, "=0.5

102 / 127

Stable Beta Process

!=10, c=0.1, "=0.5 !=10, c=1, "=0.5 !=10, c=10, "=0.5

102 / 127

Stable Beta Process

!=10, c=1, "=0.2 !=10, c=1, "=0.5 !=10, c=1, "=0.8

102 / 127

Modelling Word Occurrences in Documents

50 100 150 200 250 300 350 400 450 500 550

2000

4000

6000

8000

10000

12000

14000

number of documents

cum

ulat

ive

num

ber

of w

ords

BPSBPDATA

103 / 127

Outline

Introduction






Time Series Models


Summary104 / 127

SummaryI Motivation for Bayesian nonparametrics:

I Allows practitioners to define and work with models with largesupport, sidesteps model selection.

I New models with useful properties.I Large variety of applications.

I Various standard Bayesian nonparametric models:

I Dirichlet processesI Hierarchical Dirichlet processesI Infinite hidden Markov modelsI Indian buffet and beta processesI Pitman-Yor processes

I Touched upon two important theoretical tools:

I Consistency and Kolmogorov’s Consistency TheoremI Exchangeability and de Finetti’s Theorem

I Described a number of applications of Bayesian nonparametrics.

105 / 127

Advertisement

I PhD studentships and postdoctoral fellowships are available

I Both machine learning and computational neuroscience.

I Both Arthur Gretton and I.

I Happy to speak with anyone interested.

106 / 127

Outline

Relating Different Representations of Dirichlet Processes

Representations of Hierarchical Dirichlet Processes

107 / 127

Representations of Dirichlet ProcessesI Posterior Dirichlet process:

G ∼ DP(α,H)

θ|G ∼ G⇐⇒

θ ∼ H

G|θ ∼ DP(α + 1, αH+δθ

α+1

)

I Pólya urn scheme:


∑n−1i=1 δθi

α + n − 1

I Chinese restaurant process:

p(customer n sat at table k |past) =

{nk

n−1+α if occupied tableα

n−1+α if new table

I Stick-breaking construction:

πk = βk

k−1∏

i=1

(1− βi ) βk ∼ Beta(1, α) θ∗k ∼ H G =∞∑

k=1

πkδθ∗k

108 / 127

Posterior Dirichlet Processes

I Suppose G is DP distributed, and θ is G distributed:

G ∼ DP(α,H)

θ|G ∼ G

I We are interested in:

I The marginal distribution of θ with G integrated out.I The posterior distribution of G conditioning on θ.

109 / 127


Conjugacy between Dirichlet Distribution and Multinomial.

I Consider:

(π1, . . . , πK ) ∼ Dirichlet(α1, . . . , αK )

z|(π1, . . . , πK ) ∼ Discrete(π1, . . . , πK )

z is a multinomial variate, taking on value i ∈ {1, . . . ,n} with probabilityπi .

I Then:

z ∼ Discrete(

α1∑i αi, . . . , αK∑

i αi

)

(π1, . . . , πK )|z ∼ Dirichlet(α1 + δ1(z), . . . , αK + δK (z))

where δi (z) = 1 if z takes on value i , 0 otherwise.

I Converse also true.

110 / 127

Posterior Dirichlet ProcessesI Fix a partition (A1, . . . ,AK ) of Θ. Then

(G(A1), . . . ,G(AK )) ∼ Dirichlet(αH(A1), . . . , αH(AK ))

P(θ ∈ Ai |G) = G(Ai )

I Using Dirichlet-multinomial conjugacy,

P(θ ∈ Ai ) = H(Ai )

(G(A1), . . . ,G(AK ))|θ ∼ Dirichlet(αH(A1)+δθ(A1), . . . , αH(AK )+δθ(AK ))

I The above is true for every finite partition of Θ. In particular, taking areally fine partition,

p(dθ) = H(dθ)

i.e. θ ∼ H with G integrated out.I Also, the posterior G|θ is also a Dirichlet process:

G|θ ∼ DP(α + 1,

αH + δθα + 1

)

111 / 127


G ∼ DP(α,H)

θ|G ∼ G⇐⇒

θ ∼ H

G|θ ∼ DP(α + 1, αH+δθ

α+1

)

112 / 127

Stick-breaking ConstructionI Returning to the posterior process:

G ∼ DP(α,H)

θ|G ∼ G⇔

θ ∼ H

G|θ ∼ DP(α + 1, αH+δθα+1 )

I Consider a partition (θ,Θ\θ) of Θ. We have:

(G(θ),G(Θ\θ))|θ ∼ Dirichlet((α + 1)αH+δθα+1 (θ), (α + 1)αH+δθ

α+1 (Θ\θ))

= Dirichlet(1, α)

I G has a point mass located at θ:

G = βδθ + (1− β)G′ with β ∼ Beta(1, α)

and G′ is the (renormalized) probability measure with the point massremoved.

I What is G′?

114 / 127

Stick-breaking ConstructionI Currently, we have:

G ∼ DP(α,H)

θ ∼ G⇒

θ ∼ H

G|θ ∼ DP(α + 1, αH+δθα+1 )

G = βδθ + (1− β)G′

β ∼ Beta(1, α)

I Consider a further partition (θ,A1, . . . ,AK ) of Θ:

(G(θ),G(A1), . . . ,G(AK ))

=(β, (1− β)G′(A1), . . . , (1− β)G′(AK ))

∼Dirichlet(1, αH(A1), . . . , αH(AK ))

I The agglomerative/decimative property of Dirichlet implies:

(G′(A1), . . . ,G′(AK ))|θ ∼ Dirichlet(αH(A1), . . . , αH(AK ))

G′ ∼ DP(α,H)

115 / 127

Stick-breaking ConstructionI We have:

G ∼ DP(α,H)

G = β1δθ∗1 + (1− β1)G1

G = β1δθ∗1 + (1− β1)(β2δθ∗2 + (1− β2)G2)

...

G =∞∑

k=1

πkδθ∗k

where

πk = βk∏k−1

i=1 (1− βi ) βk ∼ Beta(1, α) θ∗k ∼ H

π

(4)π(5)π

(2)π(3)π

(6)π

(1)

116 / 127

Outline

Relating Different Representations of Dirichlet Processes

Representations of Hierarchical Dirichlet Processes

117 / 127

Stick-breaking Construction

I We shall assume the following HDP hierarchy:

G0 ∼ DP(γ,H)

Gj |G0 ∼ DP(α,G0) for j = 1, . . . , J

I The stick-breaking construction for the HDP is:

G0 =∑∞

k=1 π0kδθ∗k θ∗k ∼ H

π0k = β0k∏k−1

l=1 (1− β0l ) β0k ∼ Beta(1, γ)

Gj =∑∞

k=1 πjkδθ∗k

πjk = βjk∏k−1

l=1 (1− βjl ) βjk ∼ Beta(αβ0k , α(1−∑k

l=1 β0l ))

118 / 127

Hierarchical Pòlya Urn Scheme

I Let G ∼ DP(α,H).

I We can visualize the Pòlya urn scheme as follows:

2

1θ θ θ θ θ θ

θ∗θ∗θ∗θ∗θ∗1θ∗ . . . . .

. . . . .θ2 3 4 5 6 7

6543

where the arrows denote to which θ∗k each θi was assigned and

θ1, θ2, . . . ∼ G i.i.d.θ∗1 , θ

∗2 , . . . ∼ H i.i.d.

(but θ1, θ2, . . . are not independent of θ∗1 , θ∗2 , . . .).

119 / 127

Hierarchical Pòlya Urn Scheme

I Let G0 ∼ DP(γ,H) and G1,G2|G0 ∼ DP(α,G0).

I The hierarchical Pòlya urn scheme to generate draws from G1,G2:

21θ θ θ θ θ

θ∗θ∗θ∗θ∗θ∗θ∗ . . . . .

. . . . .θ

11 12 13 14 15 16

11 12 13 14 15 θ16 17

θ∗θ∗θ∗θ∗θ∗θ∗ . . . . .21 22 23 24 25 26

θ∗θ∗θ∗θ∗θ∗θ∗ . . . . .01 02 03 0504 06

θ θ θ θ . . . . .θ θ θ272625242322

120 / 127

Chinese Restaurant Franchise

I Let G0 ∼ DP(γ,H) and G1,G2|G0 ∼ DP(α,G0).

I The Chinese restaurant franchise describes the clustering of data itemsin the hierarchy:

E

. . .134

6 7

52

. . .

. . .7

64

12

53 E F GA B C D

ABC

GF

D

121 / 127

References II Adams, R. P., Wallach, H. M., and Ghahramani, Z. (2010). Learning the structure of deep sparse graphical models. In

Proceedings of the 13th International Conference on Artificial Intelligence and Statistics.

I Bart, E., Porteous, I., Perona, P., and Welling, M. (2008). Unsupervised learning of visual taxonomies. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition.

I Beal, M. J., Ghahramani, Z., and Rasmussen, C. E. (2002). The infinite hidden Markov model. In Advances in NeuralInformation Processing Systems, volume 14.

I Bertoin, J. (2006). Random Fragmentation and Coagulation Processes. Cambridge University Press.

I Blackwell, D. and MacQueen, J. B. (1973). Ferguson distributions via Pólya urn schemes. Annals of Statistics, 1:353–355.

I Blei, D. M., Griffiths, T. L., and Jordan, M. I. (2010). The nested Chinese restaurant process and Bayesian nonparametricinference of topic hierarchies. Journal of the Association for Computing Machines, 57(2):1–30.

I Chu, W., Ghahramani, Z., Krause, R., and Wild, D. L. (2006). Identifying protein complexes in high-throughput proteininteraction screens using an infinite latent feature model. In BIOCOMPUTING: Proceedings of the Pacific Symposium.

I Doucet, A., de Freitas, N., and Gordon, N. J. (2001). Sequential Monte Carlo Methods in Practice. Statistics for Engineeringand Information Science. New York: Springer-Verlag.

I Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Annals of Statistics, 1(2):209–230.

I Finkel, J. R., Grenager, T., and Manning, C. D. (2007). The infinite tree. In Proceedings of the Annual Meeting of theAssociation for Computational Linguistics.

I Fox, E., Sudderth, E., Jordan, M. I., and Willsky, A. (2008). An HDP-HMM for systems with state persistence. In Proceedingsof the International Conference on Machine Learning.

122 / 127

References III Fox, E. B., Sudderth, E. B., Jordan, M. I., and Willsky, A. S. (2010). Sharing features among dynamical systems with beta

processes. In Neural Information Processing Systems 22. MIT Press.

I Gasthaus, J., Wood, F., and Teh, Y. W. (2010). Lossless compression based on the sequence memoizer. In DataCompression Conference.

I Gelman, A., Carlin, J., Stern, H., and Rubin, D. (1995). Bayesian data analysis. Chapman & Hall, London.

I Ghahramani, Z., Griffiths, T. L., and Sollich, P. (2007). Bayesian nonparametric latent feature models (with discussion andrejoinder). In Bayesian Statistics, volume 8.

I Goldwater, S., Griffiths, T., and Johnson, M. (2006a). Interpolating between types and tokens by estimating power-lawgenerators. In Advances in Neural Information Processing Systems, volume 18.

I Goldwater, S., Griffiths, T. L., and Johnson, M. (2006b). Contextual dependencies in unsupervised word segmentation. InProceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Associationfor Computational Linguistics.

I Görür, D., Jäkel, F., and Rasmussen, C. E. (2006). A choice model with infinitely many latent features. In Proceedings of theInternational Conference on Machine Learning, volume 23.

I Griffiths, T. L. and Ghahramani, Z. (2006). Infinite latent feature models and the Indian buffet process. In Advances in NeuralInformation Processing Systems, volume 18.

I Heller, K. A., Teh, Y. W., and Görür, D. (2009). Infinite hierarchical hidden Markov models. In JMLR Workshop andConference Proceedings: AISTATS 2009, volume 5, pages 224–231.

I Hjort, N. L. (1990). Nonparametric Bayes estimators based on beta processes in models for life history data. Annals ofStatistics, 18(3):1259–1294.

123 / 127

References IIII Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical

Association, 96(453):161–173.

I Johnson, M., Griffiths, T. L., and Goldwater, S. (2007). Adaptor grammars: A framework for specifying compositionalnonparametric Bayesian models. In Advances in Neural Information Processing Systems, volume 19.

I Jordan, M. I. (2010). Hierarchical models, nested models and completely random measures. In Frontiers of StatisticalDecision Making and Bayesian Analysis: In Honor of James O. Berger. New York: Springer.

I Knowles, D. and Ghahramani, Z. (2007). Infinite sparse factor analysis and infinite independent components analysis. InInternational Conference on Independent Component Analysis and Signal Separation, volume 7 of Lecture Notes inComputer Science. Springer.

I Liang, P., Petrov, S., Jordan, M. I., and Klein, D. (2007). The infinite PCFG using hierarchical Dirichlet processes. InProceedings of the Conference on Empirical Methods in Natural Language Processing.

I Lijoi, A. and Pruenster, I. (2010). Models beyond the Dirichlet process. In Hjort, N., Holmes, C., Müller, P., and Walker, S.,editors, Bayesian Nonparametrics. Cambridge University Press.

I MacKay, D. and Peto, L. (1994). A hierarchical Dirichlet language model. Natural Language Engineering.

I Meeds, E., Ghahramani, Z., Neal, R. M., and Roweis, S. T. (2007). Modeling dyadic data with binary latent factors. InAdvances in Neural Information Processing Systems, volume 19.

I Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods. Technical Report CRG-TR-93-1,Department of Computer Science, University of Toronto.

I Neal, R. M. (2003). Density modeling and clustering using Dirichlet diffusion trees. In Bayesian Statistics, volume 7, pages619–629.

124 / 127

References IVI Perman, M., Pitman, J., and Yor, M. (1992). Size-biased sampling of Poisson point processes and excursions. Probability

Theory and Related Fields, 92(1):21–39.

I Pitman, J. and Yor, M. (1997). The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Annals ofProbability, 25:855–900.

I Rasmussen, C. E. (2000). The infinite Gaussian mixture model. In Advances in Neural Information Processing Systems,volume 12.

I Rasmussen, C. E. and Ghahramani, Z. (2001). Occam’s razor. In Advances in Neural Information Processing Systems,volume 13.

I Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press.

I Robert, C. P. and Casella, G. (2004). Monte Carlo statistical methods. Springer Verlag.

I Rodríguez, A., Dunson, D. B., and Gelfand, A. E. (2008). The nested Dirichlet process. Journal of the American StatisticalAssociation, 103(483):1131–1154.

I Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 4:639–650.

I Sudderth, E. and Jordan, M. I. (2009). Shared segmentation of natural scenes using dependent Pitman-Yor processes. InAdvances in Neural Information Processing Systems, volume 21.

I Sudderth, E., Torralba, A., Freeman, W., and Willsky, A. (2008). Describing visual scenes using transformed objects andparts. International Journal of Computer Vision, 77.

I Teh, Y. W. (2006a). A Bayesian interpretation of interpolated Kneser-Ney. Technical Report TRA2/06, School of Computing,National University of Singapore.

125 / 127

References VI Teh, Y. W. (2006b). A hierarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of the 21st

International Conference on Computational Linguistics and 44th Annual Meeting of the Association for ComputationalLinguistics, pages 985–992.

I Teh, Y. W., Daume III, H., and Roy, D. M. (2008). Bayesian agglomerative clustering with coalescents. In Advances in NeuralInformation Processing Systems, volume 20, pages 1473–1480.

I Teh, Y. W. and Görür, D. (2009). Indian buffet processes with power-law behavior. In Advances in Neural InformationProcessing Systems, volume 22, pages 1838–1846.

I Teh, Y. W., Görür, D., and Ghahramani, Z. (2007). Stick-breaking construction for the Indian buffet process. In Proceedings ofthe International Conference on Artificial Intelligence and Statistics, volume 11.

I Teh, Y. W. and Jordan, M. I. (2010). Hierarchical Bayesian nonparametric models with applications. In Hjort, N., Holmes, C.,Müller, P., and Walker, S., editors, Bayesian Nonparametrics. Cambridge University Press.

I Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the AmericanStatistical Association, 101(476):1566–1581.

I Thibaux, R. and Jordan, M. I. (2007). Hierarchical beta processes and the Indian buffet process. In Proceedings of theInternational Workshop on Artificial Intelligence and Statistics, volume 11, pages 564–571.

I Van Gael, J., Teh, Y. W., and Ghahramani, Z. (2009). The infinite factorial hidden Markov model. In Advances in NeuralInformation Processing Systems, volume 21, pages 1697–1704.

I Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundationsand Trends in Machine Learning, 1(1-2):1–305.

I Wood, F., Archambeau, C., Gasthaus, J., James, L. F., and Teh, Y. W. (2009). A stochastic memoizer for sequence data. InProceedings of the International Conference on Machine Learning, volume 26, pages 1129–1136.

126 / 127

References VI

I Wood, F., Griffiths, T. L., and Ghahramani, Z. (2006). A non-parametric Bayesian method for inferring hidden causes. InProceedings of the Conference on Uncertainty in Artificial Intelligence, volume 22.

127 / 127

Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian...

Documents

Transcript of Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian...