Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian...

143
Bayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College London Acknowledgements: Charles Blundell, Lloyd Elliott, Jan Gasthaus, Vinayak Rao, Dilan Görür, Andriy Mnih, Frank Wood Cedric Archambeau, Hal Daume III, Zoubin Ghahramani, Katherine Heller, Lancelot James, Michael I. Jordan, Daniel Roy, Jurgen Van Gael, Max Welling, December, 2010 / KAIST, South Korea 1 / 127

Transcript of Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian...

Page 1: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Bayesian Nonparametrics

Yee Whye Teh

Gatsby Computational Neuroscience UnitUniversity College London

Acknowledgements:Charles Blundell, Lloyd Elliott, Jan Gasthaus, Vinayak Rao,

Dilan Görür, Andriy Mnih, Frank WoodCedric Archambeau, Hal Daume III, Zoubin Ghahramani, Katherine Heller, Lancelot James,

Michael I. Jordan, Daniel Roy, Jurgen Van Gael, Max Welling,

December, 2010 / KAIST, South Korea

1 / 127

Page 2: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Outline

Introduction

Regression and Gaussian Processes

Density Estimation, Clustering and Dirichlet Processes

Latent Variable Models and Indian Buffet and Beta Processes

Topic Modelling and Hierarchical Processes

Hierarchical Structure Discovery and Nested Processes

Time Series Models

Modelling Power-laws with Pitman-Yor Processes

Summary2 / 127

Page 3: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Outline

Introduction

Regression and Gaussian Processes

Density Estimation, Clustering and Dirichlet Processes

Latent Variable Models and Indian Buffet and Beta Processes

Topic Modelling and Hierarchical Processes

Hierarchical Structure Discovery and Nested Processes

Time Series Models

Modelling Power-laws with Pitman-Yor Processes

Summary3 / 127

Page 4: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Probabilistic Machine Learning

I Probabilistic model of data {xi}ni=1 given parameters θ:

P(x1, x2, . . . , xn, y1, y2, . . . , yn|θ)

where yi is a latent variable associated with xi .

I Often thought of as generative models of data.

I Inference, of latent variables given observations:

P(y1, y2, . . . , yn|θ, x1, x2, . . . , xn) =P(x1, x2, . . . , xn, y1, y2, . . . , yn|θ)

P(x1, x2, . . . , xn|θ)

I Learning, typically by maximum likelihood :

θML = argmaxθ

P(x1, x2, . . . , xn|θ)

4 / 127

Page 5: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Probabilistic Machine Learning

I Prediction:

P(xn+1|θML)

P(yn+1|xn+1, θML)

P(yn+1|x1, . . . , xn+1, θML)

I Classification:

argmaxC

P(xn+1|θMLC )

I Visualization, Interpretation, summarization

5 / 127

Page 6: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Bayesian Machine LearningI Probabilistic model of data {xi}n

i=1 given parameters θ:

P(x1, x2, . . . , xn, y1, y2, . . . , yn|θ)

I Prior distribution:

P(θ)

I Posterior distribution:

P(θ, y1, y2, . . . , yn|x1, x2, . . . , xn) =P(θ)P(x1, x2, . . . , xn, y1, y2, . . . , yn|θ)

P(x1, x2, . . . , xn)

I Prediction:

P(xn+1|x1, . . . , xn) =

∫P(xn+1|θ)P(θ|x1, . . . , xn)dθ

I (Easier said than done...)

6 / 127

Page 7: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Computing Posterior Distributions

I Posterior distribution:

P(θ, y1, y2, . . . , yn|x1, x2, . . . , xn) =P(θ)P(x1, x2, . . . , xn, y1, y2, . . . , yn|θ)

P(x1, x2, . . . , xn)

I High-dimensional, no closed-form, multi-modal...

I Variational approximations [Wainwright and Jordan 2008]: simpleparametrized form, “fit” to true posterior. Includes mean fieldapproximation, variational Bayes, belief propagation, expectationpropagation.

I Monte Carlo methods, including Markov chain Monte Carlo[Neal 1993, Robert and Casella 2004] and sequential Monte Carlo[Doucet et al. 2001]: construct generators for random samples from theposterior.

7 / 127

Page 8: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Bayesian Model Selection

I Model selection is often necessary to prevent overfitting and underfitting.

I Bayesian approach to model selection uses the marginal likelihood :

p(x|Mk ) =

∫p(x|θk ,Mk )p(θk |Mk )dθk

Model selection: M∗ = argmaxMk

p(x|Mk )

Model averaging: p(Mk , θk |x) =p(Mk )p(θk |Mk )p(x|θk ,Mk )∑

k ′ p(Mk ′)p(θk ′ |Mk ′)p(x|θk ′ ,Mk ′)

I Other approaches to model selection: cross validation, regularization,sparse models...

8 / 127

Page 9: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Side-Stepping Model Selection

I Strategies for model selection often entail significant complexities.

I But reasonable and proper Bayesian methods should not overfit anyway[Rasmussen and Ghahramani 2001].

I Idea: use a large model, and be Bayesian so will not overfit.

I Bayesian nonparametric idea: use a very large Bayesian model avoidsboth overfitting and underfitting.

9 / 127

Page 10: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Direct Modelling of Very Large Spaces

I Regression: learn about functions from an input to an output space.

I Density estimation: learn about densities over Rd .

I Clustering: learn about partitions of a large space.

I Objects of interest are often infinite dimensional. Model these directly:

I Using models that can learn any such object;I Using models that can approximate any such object to arbitrary

accuracy.

I Many theoretical and practical issues to resolve:

I Convergence and consistency.I Practical inference algorithms.

10 / 127

Page 11: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Outline

Introduction

Regression and Gaussian Processes

Density Estimation, Clustering and Dirichlet Processes

Latent Variable Models and Indian Buffet and Beta Processes

Topic Modelling and Hierarchical Processes

Hierarchical Structure Discovery and Nested Processes

Time Series Models

Modelling Power-laws with Pitman-Yor Processes

Summary11 / 127

Page 12: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Regression and ClassificationI Learn a function f ∗ : X→ Y from training data {xi , yi}n

i=1.

**

*

*

*

*

*

*

I Regression: if yi = f ∗(xi ) + εi .I Classification: e.g. P(yi = 1|f ∗(xi )) = Φ(f ∗(xi )).

12 / 127

Page 13: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Parametric Regression with Basis Functions

I Assume a set of basis functions φ1, . . . , φK and parametrize a function:

f (x ; w) =K∑

k=1

wkφk (x)

Parameters w = {w1, . . . ,wK}.I Find optimal parameters

argminw

n∑

i=1

∣∣∣∣yi − f (xi ; w)

∣∣∣∣2

= argminw

n∑

i=1

∣∣∣∣yi −∑K

k=1 wkφk (xi )

∣∣∣∣2

I What family of basis function to use?

I How many?

I What if true function cannot be parametrized as such?

13 / 127

Page 14: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Towards Nonparametric RegressionI What we are interested in is the output values of the function,

f (x1), f (x2), . . . , f (xn), f (xn+1)

Why not model these directly?

**

*

*

*

*

*

*

I In regression, each f (xi ) is continuous and real-valued, so a naturalchoice is to model f (xi ) using a Gaussian.

I Assume that function f is smooth. If two inputs xi and xj are close-by,then f (xi ) and f (xj ) should be close by as well. This translates intocorrelations among the outputs f (xi ).

14 / 127

Page 15: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Towards Nonparametric RegressionI We can use a multi-dimensional Gaussian to model correlated function

outputs:

f (x1)...

f (xn+1)

∼ N

0...0

,

C1,1 . . . C1,n+1...

. . ....

Cn+1,1 . . . Cn+1,n+1

where the mean is zero, and C = [Cij ] is the covariance matrix.I Each observed output yi can be modelled as,

yi |f (xi ) ∼ N (f (xi ), σ2)

I Learning: compute posterior distribution

p(f (x1), . . . , f (xn)|y1, . . . , yn)

Straightforward since whole model is Gaussian.I Prediction: compute

p(f (xn+1)|y1, . . . , yn)

15 / 127

Page 16: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Gaussian Processes

I A Gaussian process (GP) is a random function f : X→ R such that forany finite set of input points x1, . . . , xn,

f (x1)...

f (xn)

∼ N

m(x1)...

m(xn)

,

c(x1, x1) . . . c(x1, xn)...

. . ....

c(xn, x1) . . . c(xn, xn)

where the parameters are the mean function m(x) and covariancekernel c(x , y).

I Difference from before: the GP defines a distribution over f (x), for everyinput value x simultaneously. Prior is defined even before observinginputs x1, . . . , xn.

I Such a random function f is known as a stochastic process. It is acollection of random variables {f (x)}x∈X.

I Demo: GPgenerate.

[Rasmussen and Williams 2006]

16 / 127

Page 17: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Posterior and Predictive Distributions

I How do we compute the posterior and predictive distributions?

I Training set (x1, y1), (x2, y2), . . . , (xn, yn) and test input xn+1.

I Out of the (uncountably infinitely) many random variables {f (x)}x∈Xmaking up the GP only n + 1 has to do with the data:

f (x1), f (x2), . . . , f (xn+1)

I Training data gives observations f (x1) = y1, . . . , f (xn) = yn. Thepredictive distribution of f (xn+1) is simply

p(f (xn+1)|f (x1) = y1, . . . , f (xn) = yn)

which is easy to compute since f (x1), . . . , f (xn+1) is Gaussian.

17 / 127

Page 18: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Consistency and Existence

I The definition of Gaussian processes only give finite dimensionalmarginal distributions of the stochastic process.

I Fortunately these marginal distributions are consistent .

I For every finite set x ⊂ X we have a distinct distributionpx([f (x)]x∈x). These distributions are said to be consistent if

px([f (x)]x∈x) =

∫px∪y([f (x)]x∈x∪y)d [f (x)]x∈y

for disjoint and finite x,y ⊂ X.I The marginal distributions for the GP are consistent because

Gaussians are closed under marginalization.

I The Kolmogorov Consistency Theorem guarantees existence of GPs,i.e. the whole stochastic process {f (x)}x∈X.

18 / 127

Page 19: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Outline

Introduction

Regression and Gaussian Processes

Density Estimation, Clustering and Dirichlet Processes

Latent Variable Models and Indian Buffet and Beta Processes

Topic Modelling and Hierarchical Processes

Hierarchical Structure Discovery and Nested Processes

Time Series Models

Modelling Power-laws with Pitman-Yor Processes

Summary19 / 127

Page 20: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Density Estimation with Mixture Models

I Unsupervised learning of a density f ∗(x) from training samples {xi}.

* ** * ** * ** *** ****

I Can use a mixture model for flexible family of densities, e.g.

f (x) =K∑

k=1

πkN (x ;µk ,Σk )

I How many mixture components to use?

I What family of mixture components?

I Do we believe that the true density is a mixture of K components?

20 / 127

Page 21: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Bayesian Mixture Models

I Let’s be Bayesian about mixture models, and placepriors over our parameters (and to computeposteriors).

I First, introduce conjugate priors for parameters:

π ∼ Dirichlet(αK , . . . ,αK )

µk ,Σk = θ∗k ∼ H = N -IW(0, s,d ,Φ)

I Second, introduce variable zi indicator whichcomponent xi belongs to.

zi |π ∼ Multinomial(π)

xi |zi = k ,µ,Σ ∼ N (µk ,Σk )

zi

!

!

H

i = 1, . . . , n

xi

!!kk = 1, . . . ,K

21 / 127

Page 22: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Gibbs Sampling for Bayesian Mixture Models

I All conditional distributions are simple to compute:

p(zi = k |others) ∝ πkN (xi ;µk ,Σk )

π|z ∼ Dirichlet(αK +n1(z), . . . , αK +nK (z))

µk ,Σk |others ∼ N -IW(ν′, s′,d ′,Φ′)

I Not as efficient as collapsed Gibbs sampling whichintegrates out π,µ,Σ:

p(zi = k |others) ∝αK + nk (z−i )

α + n − 1×

p(xi |{xi′ : i ′ 6= i , zi′ = k})

I Demo: fm_demointeractive.

zi

!

!

H

i = 1, . . . , n

xi

!!kk = 1, . . . ,K

22 / 127

Page 23: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Infinite Bayesian Mixture Models

I We will take K →∞.

I Imagine a very large value of K .

I There are at most n < K occupied components, somost components are empty. We can lump theseempty components together:

Occupied clusters:

p(zi = k |others) ∝αK +nk (z−i )

n − 1 + αp(xi |x−i

k )

Empty clusters:

p(zi = kempty|z−i ) ∝ αK−K∗K

n − 1 + αp(xi |{})

I Demo: dpm_demointeractive.

[Rasmussen 2000]

zi

!

!

H

i = 1, . . . , n

xi

!!kk = 1, . . . ,K

23 / 127

Page 24: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Infinite Bayesian Mixture Models

I We will take K →∞.

I Imagine a very large value of K .

I There are at most n < K occupied components, somost components are empty. We can lump theseempty components together:

Occupied clusters:

p(zi = k |others) ∝αK +nk (z−i )

n − 1 + αp(xi |x−i

k )

Empty clusters:

p(zi = kempty|z−i ) ∝ αK−K∗K

n − 1 + αp(xi |{})

I Demo: dpm_demointeractive.

[Rasmussen 2000]

zi

!

!

H

i = 1, . . . , n

xi

!!kk = 1, . . . ,!

23 / 127

Page 25: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Density Estimation

!15 !10 !5 0 5 10 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

F (·|µ,Σ) is Gaussian with mean µ, covariance Σ.H(µ,Σ) is Gaussian-inverse-Wishart conjugate prior.Red: mean density. Blue: median density. Grey: 5-95 quantile.Others: posterior samples. Black: data points.

24 / 127

Page 26: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Density Estimation

!15 !10 !5 0 5 10 150

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

F (·|µ,Σ) is Gaussian with mean µ, covariance Σ.H(µ,Σ) is Gaussian-inverse-Wishart conjugate prior.Red: mean density. Blue: median density. Grey: 5-95 quantile.Others: posterior samples. Black: data points.

24 / 127

Page 27: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Infinite Bayesian Mixture Models

I The actual infinite limit of finite mixture models does not actually makemathematical sense.

I Other better ways of making this infinite limit precise:

I Look at the prior clustering structure induced by the Dirichlet priorover mixing proportions—Chinese restaurant process.

I Re-order components so that those with larger mixing proportionstend to occur first, before taking the infinite limit—stick-breakingconstruction.

I Both are different views of the Dirichlet process (DP).

I The K →∞ Gibbs sampler is for DP mixture models.

25 / 127

Page 28: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

A Tiny Bit of Measure Theoretic Probability Theory

I A σ-algebra Σ is a family of subsets of a set Θ such that

I Σ is not empty;I If A ∈ Σ then Θ\A ∈ Σ;I If A1,A2, . . . ∈ Σ then ∪∞i=1Ai ∈ Σ.

I (Θ,Σ) is a measure space and A ∈ Σ are the measurable sets.

I A measure µ over (Θ,Σ) is a function µ : Σ→ [0,∞] such that

I µ(∅) = 0;I If A1,A2, . . . ∈ Σ are disjoint then µ(∪∞i=1Ai ) =

∑∞i=1 µ(Ai ).

I Everything we consider here will be measurable.I A probability measure is one where µ(Θ) = 1.

I Given two measure spaces (Θ,Σ) and (∆,Φ), a function f : Θ→ ∆ ismeasurable if f−1(A) ∈ Σ for every A ∈ Φ.

26 / 127

Page 29: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

A Tiny Bit of Measure Theoretic Probability Theory

I If p is a probability measure on (Θ,Σ), a random variable X takingvalues in ∆ is simply a measurable function X : Θ→ ∆.

I Think of the probability space (Θ,Σ,p) as a black-box randomnumber generator, and X as a function taking random samples in Θand producing random samples in ∆.

I The probability of an event A ∈ Φ is p(X ∈ A) = p(X−1(A)).

I A stochastic process is simply a collection of random variables {Xi}i∈Iover the same measure space (Θ,Σ), where I is an index set.

I Can think of a stochastic process as a random function X (i).

I Stochastic processes form the core of many Bayesian nonparametricmodels.

I Gaussian processes, Poisson processes, Dirichlet processes, betaprocesses, completely random measures...

27 / 127

Page 30: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Dirichlet DistributionsI A Dirichlet distribution is a distribution over the K -dimensional probability

simplex:∆K =

{(π1, . . . , πK ) : πk ≥ 0,

∑k πk = 1

}

I We say (π1, . . . , πK ) is Dirichlet distributed,

(π1, . . . , πK ) ∼ Dirichlet(λ1, . . . , λK )

with parameters (λ1, . . . , λK ), if

p(π1, . . . , πK ) =Γ(∑

k λk )∏k Γ(λk )

n∏

k=1

πλk−1k

I Equivalent to normalizing a set of independent gamma variables:

(π1, . . . , πK ) = 1∑k γk

(γ1, . . . , γK )

γk ∼ Gamma(λk ) for k = 1, . . . ,K

28 / 127

Page 31: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Dirichlet Distributions

29 / 127

Page 32: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Dirichlet Processes

I A Dirichlet Process (DP) is a random probability measure G over (Θ,Σ)such that for any finite set of measurable partitions A1∪ . . . ∪AK = Θ,

(G(A1), . . . ,G(AK )) ∼ Dirichlet(λ(A1), . . . , λ(AK ))

where λ is a base measure.

6

A

A1

A A

A

A

2

3

4

5

I The above family of distributions is consistent (next), and KolmogorovConsistency Theorem can be applied to show existence (but there aretechnical conditions restricting the generality of the definition).

[Ferguson 1973, Blackwell and MacQueen 1973]

30 / 127

Page 33: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Consistency of Dirichlet Marginals

I If we have two partitions (A1, . . . ,AK ) and (B1, . . . ,BJ) of Θ, how do wesee if the two Dirichlets are consistent?

I Because Dirichlet variables are normalized gamma variables and sumsof gammas are gammas, if (I1, . . . , Ij ) is a partition of (1, . . . ,K ),

(∑i∈I1 πi , . . . ,

∑i∈Ij πi

)∼ Dirichlet

(∑i∈I1 λi , . . . ,

∑i∈Ij λi

)

31 / 127

Page 34: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Consistency of Dirichlet Marginals

I Form the common refinement (C1, . . . ,CL) where each C` is theintersection of some Ak with some Bj . Then:

By definition, (G(C1), . . . ,G(CL)) ∼ Dirichlet(λ(C1), . . . , λ(CL))

(G(A1), . . . ,G(AK )) =(∑

C`⊂A1G(C`), . . . ,

∑C`⊂AK

G(C`))

∼ Dirichlet(λ(A1), . . . , λ(AK ))

Similarly, (G(B1), . . . ,G(BJ)) ∼ Dirichlet(λ(B1), . . . , λ(BJ))

so the distributions of (G(A1), . . . ,G(AK )) and (G(B1), . . . ,G(BJ)) areconsistent.

I Demonstration: DPgenerate.

32 / 127

Page 35: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Parameters of Dirichlet ProcessesI Usually we split the λ base measure into two parameters λ = αH:

I Base distribution H, which is like the mean of the DP.I Strength parameter α, which is like an inverse-variance of the DP.

I We write:

G ∼ DP(α,H)

if for any partition (A1, . . . ,AK ) of Θ:

(G(A1), . . . ,G(AK )) ∼ Dirichlet(αH(A1), . . . , αH(AK ))

I The first and second moments of the DP:

Expectation: E[G(A)] = H(A)

Variance: V[G(A)] =H(A)(1− H(A))

α + 1

where A is any measurable subset of Θ.

33 / 127

Page 36: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Representations of Dirichlet Processes

I Draws from Dirichlet processes will always place all their mass on acountable set of points:

G =∞∑

k=1

πkδθ∗k

where∑

k πk = 1 and θ∗k ∈ Θ.

I What is the joint distribution over π1, π2, . . . and θ∗1 , θ∗2 , . . .?

I Since G is a (random) probability measure over Θ, we can treat it as adistribution and draw samples from it. Let

θ1, θ2, . . . ∼ G

be random variables with distribution G.

I Can we describe G by describing its effect θ1, θ2, . . .?I What is the marginal distribution of θ1, θ2, . . . with G integrated out?

34 / 127

Page 37: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Stick-breaking Construction

G =∞∑

k=1

πkδθ∗k

I There is a simple construction giving the joint distribution of π1, π2, . . .and θ∗1 , θ

∗2 , . . . called the stick-breaking construction.

θ∗k ∼ H

πk = vk

k−1∏

i=1

(1− vi )

vk ∼ Beta(1, α)

π

(4)π(5)π

(2)π(3)π

(6)π

(1)

I Also known as the GEM distribution, write π ∼ GEM(α).

[Sethuraman 1994]

35 / 127

Page 38: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Posterior of Dirichlet Processes

I Since G is a probability measure, we can draw samples from it,

G ∼ DP(α,H)

θ1, . . . , θn|G ∼ G

What is the posterior of G given observations of θ1, . . . , θn?

I The usual Dirichlet-multinomial conjugacy carries over to thenonparametric DP as well:

G|θ1, . . . , θn ∼ DP(α + n, αH+∑n

i=1 δθiα+n )

36 / 127

Page 39: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Pólya Urn Scheme

G ∼ DP(α,H)

θ1, . . . , θn|G ∼ G

I The marginal distribution of θ1, θ2, . . . has a simple generative processcalled the Pólya urn scheme (aka Blackwell-MacQueen urn scheme).

θn|θ1:n−1 ∼αH +

∑n−1i=1 δθi

α + n − 1

I Picking balls of different colors from an urn:

I Start with no balls in the urn.I with probability ∝ α, draw θn ∼ H, and add a ball of color θn into urn.I With probability ∝ n − 1, pick a ball at random from the urn, recordθn to be its color and return two balls of color θn into urn.

[Blackwell and MacQueen 1973]

37 / 127

Page 40: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Chinese Restaurant Process

I θ1, . . . , θn take on K < n distinct values, say θ∗1 , . . . , θ∗K .

I This defines a partition of (1, . . . ,n) into K clusters, such that if i is incluster k , then θi = θ∗k .

I The distribution over partitions is a Chinese restaurant process (CRP).

I Generating from the CRP:

I First customer sits at the first table.I Customer n sits at:

I Table k with probability nkα+n−1 where nk is the number of customers

at table k .I A new table K + 1 with probability α

α+n−1 .I Customers⇔ integers, tables⇔ clusters.

91

23

4 56 7

8

38 / 127

Page 41: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Chinese Restaurant Process

0 2000 4000 6000 8000 100000

50

100

150

200

customer

tabl

e

!=30, d=0

I The CRP exhibits the clustering property of the DP.I Rich-gets-richer effect implies small number of large clusters.I Expected number of clusters is K = O(α log n).

39 / 127

Page 42: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Clustering

I To partition a heterogeneous data set into distinct, homogeneousclusters.

I The CRP is a canonical nonparametric prior over partitions that can beused as part of a Bayesian model for clustering.

I There are other priors over partitions ([Lijoi and Pruenster 2010]).

40 / 127

Page 43: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Inferring Discrete Latent Structures

I DPs have also found uses in applications where the aim is to discoverlatent objects, and where the number of objects is not known orunbounded.

I Nonparametric probabilistic context free grammars.I Visual scene analysis.I Infinite hidden Markov models/trees.I Genetic ancestry inference.I ...

I In many such applications it is important to be able to model the sameset of objects in different contexts.

I This can be tackled using hierarchical Dirichlet processes.

[Teh et al. 2006, Teh and Jordan 2010]

41 / 127

Page 44: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Exchangeability

I Instead of deriving the Pólya urn scheme by marginalizing out a DP,consider starting directly from the conditional distributions:

θn|θ1:n−1 ∼αH +

∑n−1i=1 δθi

α + n − 1

I For any n, the joint distribution of θ1, . . . , θn is:

p(θ1, . . . , θn) =αK ∏K

k=1 h(θ∗k )(mnk − 1)!∏ni=1 i − 1 + α

where h(θ) is density of θ under H, θ∗1 , . . . , θ∗K are the unique values, and

θ∗k occurred mnk times among θ1, . . . , θn.

I The joint distribution is exchangeable wrt permutations of θ1, . . . , θn.

I De Finetti’s Theorem says that there must be a random probabilitymeasure G making θ1, θ2, . . . iid. This is the DP.

42 / 127

Page 45: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

De Finetti’s Theorem

Let θ1, θ2, . . . be an infinite sequence of random variables with jointdistribution p. If for all n ≥ 1, and all permutations σ ∈ Σn on n objects,

p(θ1, . . . , θn) = p(θσ(1), . . . , θσ(n))

That is, the sequence is infinitely exchangeable. Then there exists a (unique)latent random parameter G such that:

p(θ1, . . . , θn) =

∫p(G)

n∏

i=1

p(θi |G)dG

where ρ is a joint distribution over G and θi ’s.

I θi ’s are independent given G.

I Sufficient to define G through the conditionals p(θn|θ1, . . . , θn−1).

I G can be infinite dimensional (indeed it is often a random measure).

43 / 127

Page 46: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Outline

Introduction

Regression and Gaussian Processes

Density Estimation, Clustering and Dirichlet Processes

Latent Variable Models and Indian Buffet and Beta Processes

Topic Modelling and Hierarchical Processes

Hierarchical Structure Discovery and Nested Processes

Time Series Models

Modelling Power-laws with Pitman-Yor Processes

Summary44 / 127

Page 47: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Latent Variable Modelling

I Say we have n vector observations x1, . . . , xn.

I Model each observation as a linear combination of K latent sources:

xi =K∑

k=1

Λk yik + εi

yik : activity of source k in datum i .Λk : basis vector describing effect of source k .

I Examples include principle components analysis, factor analysis,independent components analysis.

I How many sources are there?

I Do we believe that K sources is sufficient to explain all our data?

I What prior distribution should we use for sources?

45 / 127

Page 48: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Binary Latent Variable Models

I Consider a latent variable model with binary sources/features,

zik =

{1 with probability µk ;0 with probability 1− µk .

I Example: Data items could be movies like “Terminator 2”, “Shrek” and“Lord of the Rings”, and features could be “science fiction”, “fantasy”,“action” and “Arnold Schwarzenegger”.

I Place beta prior over the probabilities of features:

µk ∼ Beta(αK ,1)

I We will again take K →∞.

46 / 127

Page 49: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Indian Buffet ProcessesI The Indian Buffet Process (IBP) describes each customer with a binary

vector instead of cluster.

I Generating from an IBP:I Parameter α.I First customer picks Poisson(α) dishes to eat.I Subsequent customer i picks dish k with probability mk

i ; and picksPoisson(αi ) new dishes.

Tables

Cu

sto

me

rs

Dishes

Cu

sto

me

rs

[Griffiths and Ghahramani 2006]47 / 127

Page 50: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Indian Buffet Processes and Exchangeability

I The IBP is infinitely exchangeable. For this to make sense, we need to“forget” the ordering of the dishes.

I “Name” each dish k with a Λ∗k drawn iid from H.I Each customer now eats a set of dishes: Ψi = {Λ∗k : zik = 1}.I The joint probability of Ψ1, . . . ,Ψn can be calculated:

p(Ψ1, . . . ,Ψn) = exp

(−α

n∑

i=1

1i

)αK

K∏

k=1

(mk − 1)!(n −mk )!

n!h(Λ∗k )

K : total number of dishes tried by n customers.Λ∗k : Name of k th dish tried.mk : number of customers who tried dish Λ∗k .

I De Finetti’s Theorem again states that there is some random measureunderlying the IBP.

I This random measure is the beta process.

[Griffiths and Ghahramani 2006, Thibaux and Jordan 2007]

48 / 127

Page 51: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Applications of Indian Buffet ProcessesI The IBP can be used in concert with different likelihood models in a

variety of applications.

Z ∼ IBP(α) X ∼ F (Z ,Y )

Y ∼ H p(Z ,Y |X ) =p(Z ,Y )p(X |Z ,Y )

p(X )

I Latent factor models for distributed representation [Griffiths andGhahramani 2005].

I Matrix factorization for collaborative filtering [Meeds et al. 2007].

I Latent causal discovery for medical diagnostics [Wood et al. 2006]

I Protein complex discovery [Chu et al. 2006].

I Psychological choice behaviour [Görür et al. 2006].

I Independent components analysis [Knowles and Ghahramani 2007].

I Learning the structure of deep belief networks [Adams et al. 2010].

49 / 127

Page 52: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Infinite Independent Components Analysis

I Each image Xi is a linear combination of sparse features:

Xi =∑

k

Λ∗k yik

where yik is activity of feature k with sparse prior. One possibility is amixture of a Gaussian and a point mass at 0:

yik = zik aik aik ∼ N (0,1) Z ∼ IBP(α)

I An ICA model with infinite number of features.

[Knowles and Ghahramani 2007, Teh et al. 2007]

50 / 127

Page 53: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Beta Processes

I A one-parameter beta process B ∼ BP(α,H) is a random discretemeasure with form:

B =∞∑

k=1

µkδθ∗k

where the points P = {(θ∗1 , µ1), (θ∗2 , µ2), . . .} are spikes in a 2D Poissonprocess with rate measure:

αµ−1dµH(dθ)

I It is the de Finetti measure for the IBP.

I This is an example of a completely random measure.

I A beta process does not have Beta distributed marginals.

[Hjort 1990, Thibaux and Jordan 2007]

51 / 127

Page 54: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Beta Processes

52 / 127

Page 55: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Stick-breaking Construction for Beta ProcessesI The following generates a draw of B:

vk ∼ Beta(1, α) µk = (1− vk )k−1∏

i=1

(1− vi ) θ∗k ∼ H

B =∞∑

k=1

µkδθ∗k

I The above is the complement of the stick-breaking construction for DPs.

π

(4)π(3)µ

(6)µ

(1)µ(2)µ

(4)µ(5)µ

(5)π

(2)π(3)π

(6)π

(1)

[Teh et al. 2007]

53 / 127

Page 56: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Outline

Introduction

Regression and Gaussian Processes

Density Estimation, Clustering and Dirichlet Processes

Latent Variable Models and Indian Buffet and Beta Processes

Topic Modelling and Hierarchical Processes

Hierarchical Structure Discovery and Nested Processes

Time Series Models

Modelling Power-laws with Pitman-Yor Processes

Summary54 / 127

Page 57: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Topic Modelling with Latent Dirichlet Allocation

I Infer topics from a document corpus, topicsbeing sets of words that tend to co-occurtogether.

I Using (Bayesian) latent Dirichlet allocation:

πj ∼ Dirichlet(αK , . . . ,αK )

θk ∼ Dirichlet( βW , . . . , βW )

zji |πj ∼ Multinomial(πj )

xji |zji ,θzji ∼ Multinomial(θzji )

I How many topics can we find from thecorpus?

I Can we take number of topics K →∞?

topics k=1...K

document j=1...D

words i=1...nd

!j

zji

xji !k

55 / 127

Page 58: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Hierarchical Dirichlet Processes

I Use a DP mixture for each group.

I Unfortunately there is no sharing of clustersacross different groups because H is smooth.

I Solution: make the base distribution H discrete.

I Put a DP prior on the common base distribution.

[Teh et al. 2006]

H

1i

1ix

θ

G1 G

x

θ

2

2i

2i

56 / 127

Page 59: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Hierarchical Dirichlet Processes

I A hierarchical Dirichlet process:

G0 ∼ DP(α0,H)

G1,G2|G0 ∼ DP(α,G0) iid

I Extension to larger hierarchies is straightforward. 1i

1ix

θ

G1 G

x

θ

G

H

0

2

2i

2i

57 / 127

Page 60: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Hierarchical Dirichlet ProcessesI Making G0 discrete forces shared cluster between G1 and G2.

58 / 127

Page 61: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Hierarchical Dirichlet Processes

10 20 30 40 50 60 70 80 90 100 110 120750

800

850

900

950

1000

1050

Perp

lexi

ty

Number of LDA topics

Perplexity on test abstacts of LDA and HDP mixture

LDAHDP Mixture

61 62 63 64 65 66 67 68 69 70 71 72 730

5

10

15

Number of topics

Num

ber o

f sam

ples

Posterior over number of topics in HDP mixtures

59 / 127

Page 62: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Chinese Restaurant Franchise

13

18

14 12

1516

17

22

21 34

3332

35

36

31

2728

25

21

13

11

3122 23

32

12

11

24

11

13

1221

24

23

2623

24

22

3231

21

3

group j=1 group j=2 group j=3

global

60 / 127

Page 63: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Visual Scene Analysis with Transformed DPs

wji

ji

!

H

R

"

JNj

#

G

0G

vji

jio

j

$ji

%

&lF

2o 2,v

0G

1G 2G 3G

1o 1,v 3o 3,v

Figure 15: TDP model for 2D visual scenes (left), and cartoon illustration of the generative process(right). Global mixture G0 describes the expected frequency and image position of visual categories,whose internal structure is represented by part–based appearance models {F!}!

!=1. Each image distri-bution Gj instantiates a randomly chosen set of objects at transformed locations !. Image features withappearance wji and position vji are then sampled from transformed parameters "

!#ji; !ji

"corresponding

to di!erent parts of object oji. The cartoon example defines three color–coded object categories, whichare composed of one (blue), two (green), and four (red) Gaussian parts, respectively. Dashed ellipsesindicate transformation priors for each category.

responding to some category. The appearance of the jth image is then determined by a set of

randomly transformed objects Gj ! DP($, G0), so that

Gj(o, !) =!#

t=1

$%jt&(o, ojt)&(!, !jt)$!j ! GEM($)

(ojt, !jt) ! G0

(30)

In this expression, t indexes the set of object instances in image j, which are associated with

visual categories ojt. Each of the Nj features in image j is independently sampled from some

object instance tji ! $!j . This can be equivalently expressed as (oji, !ji) ! Gj , where oji is the

global category corresponding to an object instance situated at transformed location !ji. Finally,

parameters corresponding to one of this object’s parts generate the observed feature:

('ji, µji, "ji) = #ji ! Foji wji ! 'ji vji ! N!µji + !ji, "ji

"(31)

In later sections, we let kji ! "oji indicate the part underlying the ith feature. Focusing on scale–

normalized datasets, we again associate transformations with image–based translations.

The hierarchical, TDP scene model of Fig. 15 employs three di!erent stick–breaking processes,

allowing uncertainty in the number of visual categories (GEM(()), parts composing each category

(GEM())), and object instances depicted in each image (GEM($)). It thus generalizes the para-

metric model of Fig. 12, which assumed fixed, known sets of parts and objects. As ) " 0, each

category uses a single part, and we recover a variant of the simpler TDP model of Sec. 7.1. Inter-

30

[Sudderth et al. 2008]61 / 127

Page 64: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Visual Scene Analysis with Transformed DPs

!"#

$%"&

'()*&)+,-#..

!"#$$%&

'(&%%)

*+#,%-%./+0&1

Figure 16: Learned contextual, fixed–order models for street scenes (left) and o!ce scenes (right),each containing four objects. Top: Gaussian distributions over the positions of other objects given thelocation of the car (left) or computer screen (right). Bottom: Parts (solid) generating at least 5% ofeach category’s features, with intensity proportional to probability. Parts are tranlated by that object’smean position, while the dashed ellipses indicate each object’s marginal transformation covariance.

used thirty shared parts, and Dirichlet precision parameters set as ! = 4, " = 15 via cross–

validation. The position prior Hv weakly favored parts covering 10% of the image range, while the

appearance prior Dir(W/10) was biased towards sparse distributions.

9.1.1 Visualization of Learned Parts

Fig. 16 illustrates learned, part–based models for street and o!ce scenes. Although objects share

a common set of parts within each scene model, we can approximately count the number of parts

used by each object by thresholding the posterior part distributions !!. For street scenes, cars are

allocated roughly four parts, while buildings and roads use large numbers of parts to uniformly tile

regions corresponding to their typical size. Several parts are shared between the tree and building

categories, presumably due to the many training images in which buildings are only partially

occluded by foliage. The o!ce scene model describes computer screens with ten parts, which

primarily align with edge and corner features. Due to their smaller size, keyboards are described

by five parts, and mice by two. The background clutter category then uses several parts, which

move little from scene to scene, to distribute features across the image. Most parts are unshared,

although the screen and keyboard categories reuse a few parts to describe edge–like features.

Fig. 16 also illustrates the contextual relationships learned by both scene models. Intuitively,

street scenes have a vertically layered structure, while in o!ce scenes the keyboard is typically

located beneath the monitor, and the mouse to the keyboard’s right.

32

[Sudderth et al. 2008]62 / 127

Page 65: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Hierarchical Modelling

i=1...n2

!2

x2i

i=1...n3

!3

x3i

i=1...n1

!1

x1i

[Gelman et al. 1995]

63 / 127

Page 66: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Hierarchical Modelling

i=1...n2

!0

!2

x2i

i=1...n3

!3

x3i

i=1...n1

!1

x1i

[Gelman et al. 1995]

63 / 127

Page 67: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Outline

Introduction

Regression and Gaussian Processes

Density Estimation, Clustering and Dirichlet Processes

Latent Variable Models and Indian Buffet and Beta Processes

Topic Modelling and Hierarchical Processes

Hierarchical Structure Discovery and Nested Processes

Time Series Models

Modelling Power-laws with Pitman-Yor Processes

Summary64 / 127

Page 68: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Topic Hierarchies

!"#$%&%'%&()*+)$(,+-.%&&)$/

0)1%&#)$2-$3)+)4

%(+#.5$6%+%$.%

!"#$

%&'()(

*+,-./,0

,1.23

2/3-4-526734-038

1-,9.23

:#;<:$4=>.-45.>

,1.23

,+-.%%/#$7&-6("%

2)(#-$)89.)/%4

1-6:.#%$.%&1-,9:<<:?#@@:(A23B,54=C74-6

D4>-3>.-/5.38

.,.23!;#@@

.3-9>.24.

,55C--38/09,-3

.240!B3

8,5C930.>;6/378/0E

45,-+C>

,1#($FD,-8>(A23

734-0382/3-4-526;,1

D2/52

,0764+,-./,0

/>/77C>.-4.38;5,0.4/0>GH.,+/5>(I

,.3.24..23

!+4-49

3.3-/>!J384.4

>94773-

B47C3;.,+-,B/83

4-34>,04=76

>/K38.,+/5

2/3-4-526D/.2.23

>/E0/!540.7674-E3-5,-+C>(

LM*24>=330

>2,D0.,6/378

E,,8+-38/5./B3

+3-1,-94053

-374./B3.,5,9

+3./0EC0/E-49

740EC4E39,837>;408

/.24>47>,=330

4-EC38.24..23

.,+/5N=4>3840476N

>/>+-,B/838=6LM*

-3+-3>30.>4OC47/.4./B3/9+-,B39

30.,05,9

+3./0E740EC4E3

9,837>

PQ73/3.47(#@@$=R'-/1!.2>408S.36B->#@@HT(A2C>LM*

+-,B/83>404.C-47+,/0.,15,9

+4-/>,0(A23-34-3>3B3-47/>>C3>.24.9

C>.=3=,-03/09/08

/05,9

+4-/0E2LM*

.,LM*

(%/->.;/0

LM*.23

0C9=3-

,1.,+/5>

/>4!J38

+4-493.3-;408

49,837>3735./,0

U,C-047,1.23*VF

;W,7(G!;I,(#;*

-./573!;XC=7/54./,0

84.3"U40C4-6#@:@(

[Blei et al. 2010]65 / 127

Page 69: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Nested Chinese Restaurant Process

1 2 3 4 5 6 7 8 9

[Blei et al. 2010]66 / 127

Page 70: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Nested Chinese Restaurant Process

1 2 3 4 5 6 7 8 9

12

3

456

78 9

[Blei et al. 2010]66 / 127

Page 71: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Nested Chinese Restaurant Process

1 2 3 4 5 6 7 8 9

12

3

456

78 9

12 34 5 6

78 9

[Blei et al. 2010]66 / 127

Page 72: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Visual Taxonomies

1 4 82 3 5 6 7

CB

A

CALsuburb

MITcoast

MITforest

MIThighway

MITinsidecity

MITmountain

MITopencountry

MITstreet

MITtallbuilding

PARoffice

bedroom

kitchen

livingroom

1

tall building! "# $ street! "# $ topic! "# $

2

inside city! "# $ office! "# $ topic! "# $

A

topic 1! "# $ topic 2! "# $

Figure 5. Unsupervised taxonomy learned on the 13 scenes data set. Top: the entire taxonomy shown as a tree. Each category is color-codedaccording to the legend on the right. The proportion of a given color in a node corresponds to the proportion of images of the correspondingcategory. Note that this category information was not used when inferring the taxonomy. There are three large groups, marked A, B, andC. Roughly, group C contains natural scenes, such as forest and mountains. Group A contains cluttered man-made scenes, such as tallbuildings and city scenes. Group B contains man-made scenes that are less cluttered, such as highways. These groups split into finersub-groups at the third level. Each of the third-level groups splits into several tens of fourth-level subgroups, typically with 10 or lessimages in each. These are omitted from the figure for clarity. Below the tree, the top row shows the information represented in leaf 1 (theleftmost leaf). Two categories most frequent in that node were selected, and two most probable images from each category are shown. Themost probable topic in the node is also displayed. For that topic, six most probable visual words are shown. The display for each visualword has two parts. First, in the top left corner the pixel-wise average of all image patches assigned to that visual word is shown. It givesa rough idea of the overall structure of the visual word. For example, in the topic for leaf 1, the visual words seem to represent verticaledges. Second, six patches assigned to that visual word were selected at random. These are shown at the bottom of each visual word, ona 2 ! 3 grid. Next, information for node 2 is shown in a similar format. Finally, the bottom row shows the top two topics from node A,which is shared between leaves 1 and 2. Both leaves have clutter (topic 1) and horizontal bars (topic 2), and these are represented at theshared node.

In [1], a taxonomy of object parts was learned. In con-trast, we learn a taxonomy of object categories.

Finally, the original NCRP model was independently ap-plied to image data in a concurrent publication [16]. Thedifferences between TAX and NCRP are summarized insection 2. In addition, [16] uses different sets of visualwords at different levels of the taxonomy. This encouragesthe taxonomy to learn different representations at differentlevels. The disadvantage is that the sets had to be manuallydefined, and without this NCRP performed poorly. In con-

trast, in TAX the same set of visual words is used through-out the taxonomy, and different representations at differentlevels emerge completely automatically.

6. Discussion

We presented TAX, a nonparametric probabilistic modelfor learning visual taxonomies. In the context of computervision, it is the first fully unsupervised model that can orga-nize images into a hierarchy of categories.

Our experiments in section 4.1 show that an intuitive hi-

[Bart et al. 2008] 67 / 127

Page 73: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Hierarchical Clustering

68 / 127

Page 74: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Hierarchical Clustering

I Bayesian approach to hierarchical clustering: place prior over treestructures, and infer posterior.

I The nested DP can be used as a prior over layered tree structures.

I Another prior is a Dirichlet diffusion tree, which produces binaryultrametric trees, and which can be obtained as an infinitesimal limit of anested DP. It is an example of a fragmentation process.

I Yet another prior is Kingman’s coalescent , which also produces binaryultrametric trees, but is an example of a coalescent process.

[Neal 2003, Teh et al. 2008, Bertoin 2006]

69 / 127

Page 75: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Nested Dirichlet Process

I Underlying stochastic process for the nested CRP is a nested DP.

Hierarchical DP:

G0 ∼ DP(α0,H)

Gj |G0 ∼ DP(α,G0)

xji |Gj ∼ Gj

Nested DP:

G0 ∼ DP(α,DP(α0,H))

Gi ∼ G0

xi |Gi ∼ Gi

I The hierarchical DP starts with groups of data items, and analyses themtogether by introducing dependencies through G0.

I The nested DP starts with one set of data items, partitions them intodifferent groups, and analyses each group separately.

I Orthogonal effects, can be used together.

[Rodríguez et al. 2008]

70 / 127

Page 76: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Nested Beta/Indian Buffet Processes

1 2 3 4 5

I Exchangeable distribution over layered trees.

71 / 127

Page 77: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Nested Beta/Indian Buffet Processes

1 2 3 4 5

I Exchangeable distribution over layered trees.

71 / 127

Page 78: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Nested Beta/Indian Buffet Processes

1 2 3 4 5

I Exchangeable distribution over layered trees.

71 / 127

Page 79: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Nested Beta/Indian Buffet Processes

1 2 3 4 5

I Exchangeable distribution over layered trees.

71 / 127

Page 80: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Nested Beta/Indian Buffet Processes

1 2 3 4 5

I Exchangeable distribution over layered trees.

71 / 127

Page 81: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Nested Beta/Indian Buffet Processes

1 2 3 4 5

I Exchangeable distribution over layered trees.

71 / 127

Page 82: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Nested Beta/Indian Buffet Processes

1 2 3 4 5

I Exchangeable distribution over layered trees.

71 / 127

Page 83: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Hierarchical Beta/Indian Buffet Processes

1 2 3 4 5 6 7 8 9

I Different from the hierarchical beta process of[Thibaux and Jordan 2007].

72 / 127

Page 84: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Hierarchical Beta/Indian Buffet Processes

1 2 3 4 5 6 7 8 9

1 2 3 4 5 6

I Different from the hierarchical beta process of[Thibaux and Jordan 2007].

72 / 127

Page 85: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Hierarchical Beta/Indian Buffet Processes

1 2 3 4 5 6 7 8 9

1 2 3 4 5 6

1 2 3 4 5 6 7 8

I Different from the hierarchical beta process of[Thibaux and Jordan 2007].

72 / 127

Page 86: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Deep Structure LearningLearning the Structure of Deep Sparse Graphical Models

(a) (b)

(c) (d)Figure 4: Olivetti faces a) Test images on the left, withreconstructed bottom halves on the right. b) Sixty featureslearned in the bottom layer, where black shows absence ofan edge. Note the learning of sparse features correspond-ing to specific facial structures such as mouth shapes, nosesand eyebrows. c) Raw predictive fantasies. d) Feature ac-tivations from individual units in the second hidden layer.

by k!. We calculate !(m)"k,k! , the number of nonzero en-

tries in the k!th column of Z(m+1), excluding any entry

in the kth row. If !(m)"k,k! is zero, we call the unit k! a

singleton parent, to be dealt with in the second phase.

If !(m)"k,k! is nonzero, we introduce (or keep) the edge

from unit u(m+1)k! to u

(m)k with Bernoulli probability

p(Z(m+1)k,k! =1 |!\Z

(m+1)k,k! )=

1

Z

!!(m)"k,k!

K(m)+"(m)!1

"

N#

n=1

p(u(m)n,k | Z(m+1)

k,k! = 1,!\Z(m)k,k!)

p(Z(m+1)k,k! =0 |!\Z

(m+1)k,k! )=

1

Z

!1!

!(m)"k,k!

K(m)+"(m)!1

"

N#

n=1

p(u(m)n,k | Z(m+1)

k,k! = 0,!\Z(m+1)k,k! ),

where Z is the appropriate normalization constant.

In the second phase, we consider deleting connectionsto singleton parents of unit k, or adding new sin-gleton parents. We use Metropolis–Hastings with abirth/death process. If there are currently K# sin-gleton parents, then with probability 1/2 we proposeadding a new one by drawing it recursively from deeperlayers, as above. We accept the proposal to insert a

connection to this new parent unit with M–H ratio1

!(m)"(m)

(K"+1)2("(m)+K(m)"1)

$Nn=1

p(u(m)n,k | Z

(m+1)k,j =1,!\Z

(m+1)k,j )

p(u(m)n,k | Z

(m+1)k,j =0,!\Z

(m+1)k,j )

.

If we do not propose to insert a unit and K# " 0, thenwith probability 1/2 we select uniformly from amongthe singleton parents of unit k and propose removingthe connection to it. We accept the proposal to removethe jth one with M–H acceptance ratio given by

K2"("(m)+K(m)"1)

!(m)"(m)

$Nn=1

p(u(m)n,k | Z

(m+1)k,j =0,!\Z

(m+1)k,j )

p(u(m)n,k | Z

(m+1)k,j =1,!\Z

(m+1)k,j )

.

After these phases, chains of units that are not ances-tors of the visible units can be discarded. Notably,this birth/death operator samples from the IBP poste-rior with a nontruncated equilibrium distribution, evenwithout conjugacy. Unlike the stick-breaking approachof Teh et al. (2007), it allows use of the two-parameterIBP, which is important to this model.

5 Reconstructing Images

We applied our approach to three image data sets—theOlivetti faces, the MNIST digits and the Frey faces—and analyzed the structures that arose in the modelposteriors. To assess the model, we constructed amissing-data problem using held-out images from eachset. We removed the bottom halves of the test imagesand used the model to reconstruct the missing data,conditioned on the top half. Prediction itself was doneby integrating out the parameters and structure viaMCMC and averaging over predictive samples.

Olivetti Faces The Olivetti faces data (Samariaand Harter, 1994) are 400 64#64 grayscale imagesof the faces of 40 distinct subjects, which we dividedrandomly into 350 training and 50 test data. Fig 4ashows six bottom-half test set reconstructions on theright, compared to the ground truth on the left. Fig 4bshows a subset of sixty weight patterns from a poste-rior sample of the structure, with black indicating thatno edge is present from that hidden unit to the visibleunit (pixel). The algorithm is clearly assigning hid-den units to specific and interpretable features, suchas mouth shapes, facial hair, and skin tone. Fig 4cshows ten pure fantasies from the model, easily gener-ated in a directed acyclic belief network. Fig 4d showsthe result of activating individual units in the secondhidden layer, while keeping the rest unactivated, andpropagating the activations down to the visible pix-els. This provides an idea of the image space spannedby the principal components of these deeper units. Atypical posterior network structure had three hiddenlayers, with approximately seventy units in each layer.

1These equations had an error in the original version.

[Adams et al. 2010]

73 / 127

Page 87: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Deep Structure Learning

Learning the Structure of Deep Sparse Graphical Models

(a) (b)

(c) (d)Figure 4: Olivetti faces a) Test images on the left, withreconstructed bottom halves on the right. b) Sixty featureslearned in the bottom layer, where black shows absence ofan edge. Note the learning of sparse features correspond-ing to specific facial structures such as mouth shapes, nosesand eyebrows. c) Raw predictive fantasies. d) Feature ac-tivations from individual units in the second hidden layer.

by k!. We calculate !(m)"k,k! , the number of nonzero en-

tries in the k!th column of Z(m+1), excluding any entry

in the kth row. If !(m)"k,k! is zero, we call the unit k! a

singleton parent, to be dealt with in the second phase.

If !(m)"k,k! is nonzero, we introduce (or keep) the edge

from unit u(m+1)k! to u

(m)k with Bernoulli probability

p(Z(m+1)k,k! =1 |!\Z

(m+1)k,k! )=

1

Z

!!(m)"k,k!

K(m)+"(m)!1

"

N#

n=1

p(u(m)n,k | Z(m+1)

k,k! = 1,!\Z(m)k,k!)

p(Z(m+1)k,k! =0 |!\Z

(m+1)k,k! )=

1

Z

!1!

!(m)"k,k!

K(m)+"(m)!1

"

N#

n=1

p(u(m)n,k | Z(m+1)

k,k! = 0,!\Z(m+1)k,k! ),

where Z is the appropriate normalization constant.

In the second phase, we consider deleting connectionsto singleton parents of unit k, or adding new sin-gleton parents. We use Metropolis–Hastings with abirth/death process. If there are currently K# sin-gleton parents, then with probability 1/2 we proposeadding a new one by drawing it recursively from deeperlayers, as above. We accept the proposal to insert a

connection to this new parent unit with M–H ratio1

!(m)"(m)

(K"+1)2("(m)+K(m)"1)

$Nn=1

p(u(m)n,k | Z

(m+1)k,j =1,!\Z

(m+1)k,j )

p(u(m)n,k | Z

(m+1)k,j =0,!\Z

(m+1)k,j )

.

If we do not propose to insert a unit and K# " 0, thenwith probability 1/2 we select uniformly from amongthe singleton parents of unit k and propose removingthe connection to it. We accept the proposal to removethe jth one with M–H acceptance ratio given by

K2"("(m)+K(m)"1)

!(m)"(m)

$Nn=1

p(u(m)n,k | Z

(m+1)k,j =0,!\Z

(m+1)k,j )

p(u(m)n,k | Z

(m+1)k,j =1,!\Z

(m+1)k,j )

.

After these phases, chains of units that are not ances-tors of the visible units can be discarded. Notably,this birth/death operator samples from the IBP poste-rior with a nontruncated equilibrium distribution, evenwithout conjugacy. Unlike the stick-breaking approachof Teh et al. (2007), it allows use of the two-parameterIBP, which is important to this model.

5 Reconstructing Images

We applied our approach to three image data sets—theOlivetti faces, the MNIST digits and the Frey faces—and analyzed the structures that arose in the modelposteriors. To assess the model, we constructed amissing-data problem using held-out images from eachset. We removed the bottom halves of the test imagesand used the model to reconstruct the missing data,conditioned on the top half. Prediction itself was doneby integrating out the parameters and structure viaMCMC and averaging over predictive samples.

Olivetti Faces The Olivetti faces data (Samariaand Harter, 1994) are 400 64#64 grayscale imagesof the faces of 40 distinct subjects, which we dividedrandomly into 350 training and 50 test data. Fig 4ashows six bottom-half test set reconstructions on theright, compared to the ground truth on the left. Fig 4bshows a subset of sixty weight patterns from a poste-rior sample of the structure, with black indicating thatno edge is present from that hidden unit to the visibleunit (pixel). The algorithm is clearly assigning hid-den units to specific and interpretable features, suchas mouth shapes, facial hair, and skin tone. Fig 4cshows ten pure fantasies from the model, easily gener-ated in a directed acyclic belief network. Fig 4d showsthe result of activating individual units in the secondhidden layer, while keeping the rest unactivated, andpropagating the activations down to the visible pix-els. This provides an idea of the image space spannedby the principal components of these deeper units. Atypical posterior network structure had three hiddenlayers, with approximately seventy units in each layer.

1These equations had an error in the original version.

[Adams et al. 2010]

73 / 127

Page 88: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Transfer Learning

I Many recent machine learning paradigms can be understood as trying tomodel data from heterogeneous sources and types.

I Semi-supervised learning: we have labelled data, and unlabelleddata.

I Multi-task learning: we have multiple tasks with differentdistributions but structurally similar.

I Domain adaptation: we have a small amount of pertinent data, anda large amount of data from a related problem or domain.

I The transfer learning problem is how to transfer information betweendifferent sources and types.

I Flexible nonparametric models can allow for more information extractionand transfer.

I Hierarchies and nestings are different ways of putting together multiplestochastic processes to form complex models.

[Jordan 2010]

74 / 127

Page 89: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Outline

Introduction

Regression and Gaussian Processes

Density Estimation, Clustering and Dirichlet Processes

Latent Variable Models and Indian Buffet and Beta Processes

Topic Modelling and Hierarchical Processes

Hierarchical Structure Discovery and Nested Processes

Time Series Models

Modelling Power-laws with Pitman-Yor Processes

Summary75 / 127

Page 90: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Hidden Markov Models

z0 z1 z2 z!

x1 x2 x!

! !k

!!k!

πk ∼ Dirichlet(αK , . . . ,αK ) zi |zi−1,πzi−1 ∼ Multinomial(πzi−1 )

θ∗k ∼ H xi |zi , θ∗zi∼ F (θ∗zi

)

I Can we take K →∞?

I Can we do so while imposing structure in transition probability matrix?

76 / 127

Page 91: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Infinite Hidden Markov Models

z0 z1 z2 z!

x1 x2 x!

! !k

!!k!

β ∼ GEM(γ) πk |β ∼ DP(α,β) zi |zi−1,πzi−1 ∼ Multinomial(πzi−1 )

θ∗k ∼ H xi |zi , θ∗zi∼ F (θ∗zi

)

I Hidden Markov models with an infinite number of states: infinite HMM.

I Hierarchical DPs used to share information among transition probabilityvectors prevents “run-away” states: HDP-HMM.

[Beal et al. 2002, Teh et al. 2006]

77 / 127

Page 92: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Word Segmentation

I Given sequences of utterances or characters can a probabilistic modelsegment sequences into coherent chunks (“words”)?

canyoureadthissentencewithoutspaces?can you read this sentence without spaces?

I Use an infinite HMM: each chunk/word is a state, with Markov model ofstate transitions.

I Nonparametric model is natural, since number of words unknown beforesegmentation.

[Goldwater et al. 2006b]

78 / 127

Page 93: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Word Segmentation

Words Lexicon BoundariesNGS-u 68.9 82.6 52.0MBDP-1 68.2 82.3 52.4DP 53.8 74.3 57.2NGS-b 68.3 82.1 55.7HDP 76.6 87.7 63.1

I NGS-u: n-gram Segmentation (unigram) [Venkataraman 2001].

I NGS-b: n-gram Segmentation (bigram) [Venkataraman 2001].

I MBDP-1: Model-based Dynamic Programming [Brent 1999].

I DP, HDP: Nonparametric model, without and with Markov dependencies.

[Goldwater et al. 2006a]

79 / 127

Page 94: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Sticky HDP-HMM

I In typical HMMs or in infinite HMMs the model does not give specialtreatment to self-transitions (from a state to itself).

I In many HMM applications self-transitions are much more likely.

I Example application of HMMs: speaker diarization.

I Straightforward extension of HDP-HMM prior encourages higherself-transition probabilities:

πk |β ∼ DP(α + κ, αβ+κδkα+κ )

[Beal et al. 2002, Fox et al. 2008]

80 / 127

Page 95: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Sticky HDP-HMM THE STICKY HDP-HMM 33

0 5 10 15 200

10

20

30

40

50

60

Meeting

DER

StickyNon−StickyICSI

0 5 10 15 200

10

20

30

40

50

60

Meeting

DER

(a) (b)FIG 14. (a) Chart comparing the DERs of the sticky and original HDP-HMM with DP emissions tothose of ICSI for each of the 21 meetings. Here, we chose the state sequence at the 10, 000th Gibbsiteration that minimizes the expected Hamming distance. For meeting 16 using the sticky HDP-HMMwith DP emissions, we chose between state sequences at Gibbs iteration 50,000. (b) DERs associatedwith using ground truth speaker labels for the post-processed data. Here, we assign undetected non-speech a label different than the pre-processed non-speech.

9. Discussion. We have developed a Bayesian nonparametric approach to theproblem of speaker diarization, building on the HDP-HMM presented in Teh et al.(2006). Although the original HDP-HMM does not yield competitive speaker di-arization performance due to its inadequate modeling of the temporal persistenceof states, the sticky HDP-HMM that we have presented here resolves this problemand yields a state-of-the-art solution to the speaker diarization problem.We have also shown that this sticky HDP-HMM allows a fully Bayesian non-

parametric treatment of multimodal emissions, disambiguated by its bias towardsself-transitions. Accommodating multimodal emissions is essential for the speakerdiarization problem and is likely to be an important ingredient in other applicationsof the HDP-HMM to problems in speech technology.We also presented efficient sampling techniques with mixing rates that improve

on the state-of-the-art by harnessing the Markovian structure of the HDP-HMM.Specifically, we proposed employing a truncated approximation to the HDP andblock-sampling the state sequence using a variant of the forward-backward algo-rithm. Although the blocked samplers yield substantially improved mixing ratesover the sequential, direct assignment samplers, there are still some pitfalls to thesesampling methods. One issue is that for each new considered state, the parametersampled from the prior distribution must better explain the data than the parametersassociated with other states that have already been informed by the data. In high-dimensional applications, and in cases where state-specific emission distributionsare not clearly distinguishable, this method for adding new states poses a signif-

[Fox et al. 2008]

81 / 127

Page 96: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Infinite Factorial HMM

Figure 1: The Hidden Markov Model Figure 2: The Factorial Hidden Markov Model

in a factored form. This way, information from the past is propagated in a distributed manner througha set of parallel Markov chains. The parallel chains can be viewed as latent features which evolveover time according to Markov dynamics. Formally, the FHMM defines a probability distributionover observations y1, y2, · · · yT as follows: M latent chains s(1), s(2), · · · , s(M) evolve accordingto Markov dynamics and at each timestep t, the Markov chains generate an output yt using somelikelihood model F parameterized by a joint state-dependent parameter θ

s(1:m)t

. The graphical modelin figure 2 shows how the FHMM is a special case of a dynamic Bayesian network. The FHMM hasbeen successfully applied in vision [3], audio processing [4] and natural language processing [5].Unfortunately, the dimensionality M of our factorial representation or equivalently, the number ofparallel Markov chains, is a new free parameter for the FHMM which we would prefer learningfrom data rather than specifying it beforehand.

Recently, [6] introduced the basic building block for nonparametric Bayesian factor models calledthe Indian Buffet Process (IBP). The IBP defines a distribution over infinite binary matrices Z whereelement znk denotes whether datapoint n has feature k or not. The IBP can be combined withdistributions over real numbers or integers to make the features useful for practical problems.

In this work, we derive the basic building block for nonparametric Bayesian factor models for timeseries which we call the Markov Indian Buffet Process (mIBP). Using this distribution we build anonparametric extension of the FHMM which we call the Infinite Factorial Hidden Markov Model(iFHMM). This construction allows us to learn a factorial representation for time series.

In the next section, we develop the novel and generic nonparametric mIBP distribution. Section 3describes how to use the mIBP do build the iFHMM. Which in turn can be used to perform inde-pendent component analysis on time series data. Section 4 shows results of our application of theiFHMM to a blind source separation problem. Finally, we conclude with a discussion in section 5.

2 The Markov Indian Buffet Process

Similar to the IBP, we define a distribution over binary matrices to model whether a feature at timet is on or off. In this representation rows correspond to timesteps and the columns to features orMarkov chains. We want the distribution over matrices to satisfy the following two properties: (1)the potential number of columns (representing latent features) should be able to be arbitrary large;(2) the rows (representing timesteps) should evolve according to a Markov process.

Below, we will formally derive the mIBP distribution in two steps: first, we describe a distributionover binary matrices with a finite number of columns. We choose the hyperparameters carefully sowe can easily integrate out the parameters of the model. In a second phase, we take the limit as thenumber of features goes to infinity in a manner analogous to [7]’s derivation of infinite mixtures.

2.1 A finite model

Let S represent a binary matrix with T rows (datapoints) and M columns (features). stm representsthe hidden state at time t for Markov chain m. Each Markov chain evolves according to the transitionmatrix

W (m) =

�1 − am am

1 − bm bm

�, (3)

2

I Take M →∞ for the following model specification:

P(s(m)t = 1|s(m)

t−1 = 0) = am am ∼ Beta( αM ,1)

P(s(m)t = 1|s(m)

t−1 = 1) = bm bm ∼ Beta(γ, δ)

I Stochastic process is a Markov Indian buffet process. It is an example ofa dependent random measure.

[Van Gael et al. 2009]82 / 127

Page 97: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Nonparametric Grammars, Hierarchical HMMs etc

I In linguistics, grammars are much more plausible as generative modelsof sentences.

I Learning the structure of probabilistic grammars is even more difficult,and Bayesian nonparametrics provides a compelling alternative.

[Liang et al. 2007, Finkel et al. 2007, Johnson et al. 2007, Heller et al. 2009]

83 / 127

Page 98: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Motion Capture Analysis

I Goal: find coherent “behaviour” in the time series that transfers to othertime series.

Slides courtesy of [Fox et al. 2010]

84 / 127

Page 99: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Motion Capture Analysis

I Transfer knowledge among related time series in the form of a library of“behaviours”.

I Allow each time series model to make use of an arbitrary subset of thebehaviours.

I Method: represent behaviors as states in an autoregressive HMM, anduse the beta/Bernoulli process to pick out subsets of states.

Slides courtesy of [Fox et al. 2010]

85 / 127

Page 100: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

BP-AR-HMM

Slides courtesy of [Fox et al. 2010]

86 / 127

Page 101: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Motion Capture Results

Slides courtesy of [Fox et al. 2010]

87 / 127

Page 102: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

High Order Markov ModelsI Decompose the joint distribution of a sequence of variables into

conditional distributions:

P(x1, x2, . . . , xT ) =T∏

t=1

P(xt |x1, . . . , xt−1)

I An Nth order Markov model approximates the joint distribution as:

P(x1, x2, . . . , xT ) =T∏

t=1

P(xt |xt−N , . . . , xt−1)

I Such models are particularly prevalent in natural language processing,compression and biological sequence modelling.

toad, in, a, holet, o, a, d, _, i, n, _, a, _, h, o, l, e

A, C, G, T, C, C, A

I Would like to take N →∞.

88 / 127

Page 103: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

High Order Markov Models

I Difficult to fit such models due to data sparsity.

P(xt |xt−N , . . . , xt−1) =C(xt−N , . . . , xt−1, xt )

C(xt−N , . . . , xt−1)

I Sharing information via hierarchical models.

P(xt |xt−N:t−1 = u) = Gu(xt )

I A context tree.

G∅

Ga

Gin a

Gtoad in a Gstuck in a

Gis a Gabout a

[MacKay and Peto 1994, Teh 2006a]

89 / 127

Page 104: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Outline

Introduction

Regression and Gaussian Processes

Density Estimation, Clustering and Dirichlet Processes

Latent Variable Models and Indian Buffet and Beta Processes

Topic Modelling and Hierarchical Processes

Hierarchical Structure Discovery and Nested Processes

Time Series Models

Modelling Power-laws with Pitman-Yor Processes

Summary90 / 127

Page 105: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Pitman-Yor ProcessesI Two-parameter generalization of the Chinese restaurant process:

p(customer n sat at table k |past) =

{nk−β

n−1+α if occupied tableα+βKn−1+α if new table

I Associating each cluster k with a unique draw θ∗k ∼ H, thecorresponding Pólya urn scheme is also exchangeable.

0 2000 4000 6000 8000 100000

50

100

150

200

customer

tabl

e

!=30, d=0

Dirichlet

0 2000 4000 6000 8000 100000

50

100

150

200

250

customer

tabl

e

!=1, d=.5

Pitman-Yor

91 / 127

Page 106: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Pitman-Yor Processes

I De Finetti’s Theorem states that there is a random measure underlyingthis two-parameter generalization.

I This is the Pitman-Yor process.

I The Pitman-Yor process also has a stick-breaking construction:

πk = vk

k−1∏

i=1

(1− vi ) βk ∼ Beta(1− β, α + βk) θ∗k ∼ H G =∞∑

k=1

πkδθ∗k

I The Pitman-Yor process cannot be obtained as the infinite limit of asimple parametric model.

[Perman et al. 1992, Pitman and Yor 1997, Ishwaran and James 2001]

92 / 127

Page 107: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Pitman-Yor Processes

I Two salient features of the Pitman-Yor process:

I With more occupied tables, the chance of even more tablesbecomes higher.

I Tables with smaller occupancy numbers tend to have lower chanceof getting new customers.

I The above means that Pitman-Yor processes produce Zipf’s Law typebehaviour, with K = O(αnβ).

100 101 102 103 104 105 106100

101

102

103

104

105

106

# customers

# ta

bles

!=10, d=[.9 .5 0]

100 101 102 103 104 105 106100

101

102

103

104

105

106

# customers

# ta

bles

with

1 c

usto

mer

!=10, d=[.9 .5 0]

100 101 102 103 104 105 1060

0.2

0.4

0.6

0.8

1

# customerspr

opor

tion

of ta

bles

with

1 c

usto

mer

!=10, d=[.9 .5 0]

93 / 127

Page 108: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Pitman-Yor Processes

Draw from a Pitman-Yor process

0 2000 4000 6000 8000 100000

50

100

150

200

250

customer

tabl

e

!=1, d=.5

0 200 400 600 800100

101

102

103

104

105

# tables#

cust

omer

s pe

r tab

le

!=1, d=.5

100 101 102 103100

101

102

103

104

105

# tables

# cu

stom

ers

per t

able

!=1, d=.5

Draw from a Dirichlet process

0 2000 4000 6000 8000 100000

50

100

150

200

customer

tabl

e

!=30, d=0

0 50 100 150 200 250100

101

102

103

104

105

# tables

# cu

stom

ers

per t

able

!=30, d=0

100 101 102 103100

101

102

103

104

105

# tables#

cust

omer

s pe

r tab

le

!=30, d=0

94 / 127

Page 109: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Pitman-Yor Processes

95 / 127

Page 110: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Hierarchical Pitman-Yor Markov Models

I Use a hierarchical Pitman-Yor prior for high order Markov models.

I Can now take N →∞, making use of coagulation and fragmentationproperties of Pitman-Yor processes for computational tractability.

I Non-Markov model called the sequence memoizer .

[Goldwater et al. 2006a, Teh 2006b, Wood et al. 2009, Gasthaus et al. 2010]

96 / 127

Page 111: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Language Modelling

I Compare hierarchical Pitman-Yor model against hierarchical Dirichletmodel, and two state-of-the-art language models (interpolatedKneser-Ney, modified Kneser-Ney).

I Results reported as perplexity scores.

T N IKN MKN HPYLM HDLM2e6 3 148.8 144.1 144.3 191.24e6 3 137.1 132.7 132.7 172.76e6 3 130.6 126.7 126.4 162.38e6 3 125.9 122.3 121.9 154.7

10e6 3 122.0 118.6 118.2 148.712e6 3 119.0 115.8 115.4 144.014e6 3 116.7 113.6 113.2 140.514e6 2 169.9 169.2 169.3 180.614e6 4 106.1 102.4 101.9 136.6

[Teh 2006b]

97 / 127

Page 112: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Compression

I Predictive models can be used to compress sequence data usingentropic coding techniques.

I Compression results on Calgary corpus:

Model Average bits / bytegzip 2.61bzip2 2.11CTW 1.99PPM 1.93

Sequence Memoizer 1.89

I See http://deplump.com and http://sequencememoizer.com.

[Gasthaus et al. 2010]

98 / 127

Page 113: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Comparing Finite and Infinite Order Markov Models

[Wood et al. 2009]

99 / 127

Page 114: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Image Segmentation with Pitman-Yor Processes

I Human segmentations of images also seem to follow power-law.

I An unsupervised image segmentation model based on a dependenthierarchical Pitman-Yor processes achieves state-of-the-art results.

[Sudderth and Jordan 2009]100 / 127

Page 115: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Stable Beta Process

I Extensions allow for different aspects of the generative process to bemodelled:

I α: controls the expected number of dishes picked by eachcustomer.

I c: controls the overall number of dishes picked by all customers.I σ: controls power-law scaling (ratio of popular dishes to unpopular

ones).

I A completely random measure, with Lévy measure:

αΓ(1 + c)

Γ(1− σ)Γ(c + σ)µ−σ−1(1− µ)c+σ−1dµH(dθ)

[Ghahramani et al. 2007, Teh and Görür 2009]

101 / 127

Page 116: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Stable Beta Process

!=1, c=1, "=0.5 !=10, c=1, "=0.5 !=100, c=1, "=0.5

102 / 127

Page 117: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Stable Beta Process

!=10, c=0.1, "=0.5 !=10, c=1, "=0.5 !=10, c=10, "=0.5

102 / 127

Page 118: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Stable Beta Process

!=10, c=1, "=0.2 !=10, c=1, "=0.5 !=10, c=1, "=0.8

102 / 127

Page 119: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Modelling Word Occurrences in Documents

50 100 150 200 250 300 350 400 450 500 550

2000

4000

6000

8000

10000

12000

14000

number of documents

cum

ulat

ive

num

ber

of w

ords

BPSBPDATA

103 / 127

Page 120: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Outline

Introduction

Regression and Gaussian Processes

Density Estimation, Clustering and Dirichlet Processes

Latent Variable Models and Indian Buffet and Beta Processes

Topic Modelling and Hierarchical Processes

Hierarchical Structure Discovery and Nested Processes

Time Series Models

Modelling Power-laws with Pitman-Yor Processes

Summary104 / 127

Page 121: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

SummaryI Motivation for Bayesian nonparametrics:

I Allows practitioners to define and work with models with largesupport, sidesteps model selection.

I New models with useful properties.I Large variety of applications.

I Various standard Bayesian nonparametric models:

I Dirichlet processesI Hierarchical Dirichlet processesI Infinite hidden Markov modelsI Indian buffet and beta processesI Pitman-Yor processes

I Touched upon two important theoretical tools:

I Consistency and Kolmogorov’s Consistency TheoremI Exchangeability and de Finetti’s Theorem

I Described a number of applications of Bayesian nonparametrics.

105 / 127

Page 122: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Advertisement

I PhD studentships and postdoctoral fellowships are available

I Both machine learning and computational neuroscience.

I Both Arthur Gretton and I.

I Happy to speak with anyone interested.

106 / 127

Page 123: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Outline

Relating Different Representations of Dirichlet Processes

Representations of Hierarchical Dirichlet Processes

107 / 127

Page 124: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Representations of Dirichlet ProcessesI Posterior Dirichlet process:

G ∼ DP(α,H)

θ|G ∼ G⇐⇒

θ ∼ H

G|θ ∼ DP(α + 1, αH+δθ

α+1

)

I Pólya urn scheme:

θn|θ1:n−1 ∼αH +

∑n−1i=1 δθi

α + n − 1

I Chinese restaurant process:

p(customer n sat at table k |past) =

{nk

n−1+α if occupied tableα

n−1+α if new table

I Stick-breaking construction:

πk = βk

k−1∏

i=1

(1− βi ) βk ∼ Beta(1, α) θ∗k ∼ H G =∞∑

k=1

πkδθ∗k

108 / 127

Page 125: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Posterior Dirichlet Processes

I Suppose G is DP distributed, and θ is G distributed:

G ∼ DP(α,H)

θ|G ∼ G

I We are interested in:

I The marginal distribution of θ with G integrated out.I The posterior distribution of G conditioning on θ.

109 / 127

Page 126: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Posterior Dirichlet Processes

Conjugacy between Dirichlet Distribution and Multinomial.

I Consider:

(π1, . . . , πK ) ∼ Dirichlet(α1, . . . , αK )

z|(π1, . . . , πK ) ∼ Discrete(π1, . . . , πK )

z is a multinomial variate, taking on value i ∈ {1, . . . ,n} with probabilityπi .

I Then:

z ∼ Discrete(

α1∑i αi, . . . , αK∑

i αi

)

(π1, . . . , πK )|z ∼ Dirichlet(α1 + δ1(z), . . . , αK + δK (z))

where δi (z) = 1 if z takes on value i , 0 otherwise.

I Converse also true.

110 / 127

Page 127: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Posterior Dirichlet ProcessesI Fix a partition (A1, . . . ,AK ) of Θ. Then

(G(A1), . . . ,G(AK )) ∼ Dirichlet(αH(A1), . . . , αH(AK ))

P(θ ∈ Ai |G) = G(Ai )

I Using Dirichlet-multinomial conjugacy,

P(θ ∈ Ai ) = H(Ai )

(G(A1), . . . ,G(AK ))|θ ∼ Dirichlet(αH(A1)+δθ(A1), . . . , αH(AK )+δθ(AK ))

I The above is true for every finite partition of Θ. In particular, taking areally fine partition,

p(dθ) = H(dθ)

i.e. θ ∼ H with G integrated out.I Also, the posterior G|θ is also a Dirichlet process:

G|θ ∼ DP(α + 1,

αH + δθα + 1

)

111 / 127

Page 128: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Posterior Dirichlet Processes

G ∼ DP(α,H)

θ|G ∼ G⇐⇒

θ ∼ H

G|θ ∼ DP(α + 1, αH+δθ

α+1

)

112 / 127

Page 129: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Pólya Urn Scheme

I First sample:θ1|G ∼ G G ∼ DP(α,H)

⇐⇒ θ1 ∼ H G|θ1 ∼ DP(α + 1, αH+δθ1α+1 )

I Second sample:θ2|θ1,G ∼ G G|θ1 ∼ DP(α + 1, αH+δθ1

α+1 )

⇐⇒ θ2|θ1 ∼ αH+δθ1α+1 G|θ1, θ2 ∼ DP(α + 2, αH+δθ1+δθ2

α+2 )

I nth sample

θn|θ1:n−1,G ∼ G G|θ1:n−1 ∼ DP(α + n − 1, αH+∑n−1

i=1 δθiα+n−1 )

⇐⇒ θn|θ1:n−1 ∼ αH+∑n−1

i=1 δθiα+n−1 G|θ1:n ∼ DP(α + n, αH+

∑ni=1 δθi

α+n )

113 / 127

Page 130: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Stick-breaking ConstructionI Returning to the posterior process:

G ∼ DP(α,H)

θ|G ∼ G⇔

θ ∼ H

G|θ ∼ DP(α + 1, αH+δθα+1 )

I Consider a partition (θ,Θ\θ) of Θ. We have:

(G(θ),G(Θ\θ))|θ ∼ Dirichlet((α + 1)αH+δθα+1 (θ), (α + 1)αH+δθ

α+1 (Θ\θ))

= Dirichlet(1, α)

I G has a point mass located at θ:

G = βδθ + (1− β)G′ with β ∼ Beta(1, α)

and G′ is the (renormalized) probability measure with the point massremoved.

I What is G′?

114 / 127

Page 131: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Stick-breaking ConstructionI Currently, we have:

G ∼ DP(α,H)

θ ∼ G⇒

θ ∼ H

G|θ ∼ DP(α + 1, αH+δθα+1 )

G = βδθ + (1− β)G′

β ∼ Beta(1, α)

I Consider a further partition (θ,A1, . . . ,AK ) of Θ:

(G(θ),G(A1), . . . ,G(AK ))

=(β, (1− β)G′(A1), . . . , (1− β)G′(AK ))

∼Dirichlet(1, αH(A1), . . . , αH(AK ))

I The agglomerative/decimative property of Dirichlet implies:

(G′(A1), . . . ,G′(AK ))|θ ∼ Dirichlet(αH(A1), . . . , αH(AK ))

G′ ∼ DP(α,H)

115 / 127

Page 132: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Stick-breaking ConstructionI We have:

G ∼ DP(α,H)

G = β1δθ∗1 + (1− β1)G1

G = β1δθ∗1 + (1− β1)(β2δθ∗2 + (1− β2)G2)

...

G =∞∑

k=1

πkδθ∗k

where

πk = βk∏k−1

i=1 (1− βi ) βk ∼ Beta(1, α) θ∗k ∼ H

π

(4)π(5)π

(2)π(3)π

(6)π

(1)

116 / 127

Page 133: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Outline

Relating Different Representations of Dirichlet Processes

Representations of Hierarchical Dirichlet Processes

117 / 127

Page 134: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Stick-breaking Construction

I We shall assume the following HDP hierarchy:

G0 ∼ DP(γ,H)

Gj |G0 ∼ DP(α,G0) for j = 1, . . . , J

I The stick-breaking construction for the HDP is:

G0 =∑∞

k=1 π0kδθ∗k θ∗k ∼ H

π0k = β0k∏k−1

l=1 (1− β0l ) β0k ∼ Beta(1, γ)

Gj =∑∞

k=1 πjkδθ∗k

πjk = βjk∏k−1

l=1 (1− βjl ) βjk ∼ Beta(αβ0k , α(1−∑k

l=1 β0l ))

118 / 127

Page 135: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Hierarchical Pòlya Urn Scheme

I Let G ∼ DP(α,H).

I We can visualize the Pòlya urn scheme as follows:

2

1θ θ θ θ θ θ

θ∗θ∗θ∗θ∗θ∗1θ∗ . . . . .

. . . . .θ2 3 4 5 6 7

6543

where the arrows denote to which θ∗k each θi was assigned and

θ1, θ2, . . . ∼ G i.i.d.θ∗1 , θ

∗2 , . . . ∼ H i.i.d.

(but θ1, θ2, . . . are not independent of θ∗1 , θ∗2 , . . .).

119 / 127

Page 136: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Hierarchical Pòlya Urn Scheme

I Let G0 ∼ DP(γ,H) and G1,G2|G0 ∼ DP(α,G0).

I The hierarchical Pòlya urn scheme to generate draws from G1,G2:

21θ θ θ θ θ

θ∗θ∗θ∗θ∗θ∗θ∗ . . . . .

. . . . .θ

11 12 13 14 15 16

11 12 13 14 15 θ16 17

θ∗θ∗θ∗θ∗θ∗θ∗ . . . . .21 22 23 24 25 26

θ∗θ∗θ∗θ∗θ∗θ∗ . . . . .01 02 03 0504 06

θ θ θ θ . . . . .θ θ θ272625242322

120 / 127

Page 137: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

Chinese Restaurant Franchise

I Let G0 ∼ DP(γ,H) and G1,G2|G0 ∼ DP(α,G0).

I The Chinese restaurant franchise describes the clustering of data itemsin the hierarchy:

E

. . .134

6 7

52

. . .

. . .7

64

12

53 E F GA B C D

ABC

GF

D

121 / 127

Page 138: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

References II Adams, R. P., Wallach, H. M., and Ghahramani, Z. (2010). Learning the structure of deep sparse graphical models. In

Proceedings of the 13th International Conference on Artificial Intelligence and Statistics.

I Bart, E., Porteous, I., Perona, P., and Welling, M. (2008). Unsupervised learning of visual taxonomies. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition.

I Beal, M. J., Ghahramani, Z., and Rasmussen, C. E. (2002). The infinite hidden Markov model. In Advances in NeuralInformation Processing Systems, volume 14.

I Bertoin, J. (2006). Random Fragmentation and Coagulation Processes. Cambridge University Press.

I Blackwell, D. and MacQueen, J. B. (1973). Ferguson distributions via Pólya urn schemes. Annals of Statistics, 1:353–355.

I Blei, D. M., Griffiths, T. L., and Jordan, M. I. (2010). The nested Chinese restaurant process and Bayesian nonparametricinference of topic hierarchies. Journal of the Association for Computing Machines, 57(2):1–30.

I Chu, W., Ghahramani, Z., Krause, R., and Wild, D. L. (2006). Identifying protein complexes in high-throughput proteininteraction screens using an infinite latent feature model. In BIOCOMPUTING: Proceedings of the Pacific Symposium.

I Doucet, A., de Freitas, N., and Gordon, N. J. (2001). Sequential Monte Carlo Methods in Practice. Statistics for Engineeringand Information Science. New York: Springer-Verlag.

I Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Annals of Statistics, 1(2):209–230.

I Finkel, J. R., Grenager, T., and Manning, C. D. (2007). The infinite tree. In Proceedings of the Annual Meeting of theAssociation for Computational Linguistics.

I Fox, E., Sudderth, E., Jordan, M. I., and Willsky, A. (2008). An HDP-HMM for systems with state persistence. In Proceedingsof the International Conference on Machine Learning.

122 / 127

Page 139: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

References III Fox, E. B., Sudderth, E. B., Jordan, M. I., and Willsky, A. S. (2010). Sharing features among dynamical systems with beta

processes. In Neural Information Processing Systems 22. MIT Press.

I Gasthaus, J., Wood, F., and Teh, Y. W. (2010). Lossless compression based on the sequence memoizer. In DataCompression Conference.

I Gelman, A., Carlin, J., Stern, H., and Rubin, D. (1995). Bayesian data analysis. Chapman & Hall, London.

I Ghahramani, Z., Griffiths, T. L., and Sollich, P. (2007). Bayesian nonparametric latent feature models (with discussion andrejoinder). In Bayesian Statistics, volume 8.

I Goldwater, S., Griffiths, T., and Johnson, M. (2006a). Interpolating between types and tokens by estimating power-lawgenerators. In Advances in Neural Information Processing Systems, volume 18.

I Goldwater, S., Griffiths, T. L., and Johnson, M. (2006b). Contextual dependencies in unsupervised word segmentation. InProceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Associationfor Computational Linguistics.

I Görür, D., Jäkel, F., and Rasmussen, C. E. (2006). A choice model with infinitely many latent features. In Proceedings of theInternational Conference on Machine Learning, volume 23.

I Griffiths, T. L. and Ghahramani, Z. (2006). Infinite latent feature models and the Indian buffet process. In Advances in NeuralInformation Processing Systems, volume 18.

I Heller, K. A., Teh, Y. W., and Görür, D. (2009). Infinite hierarchical hidden Markov models. In JMLR Workshop andConference Proceedings: AISTATS 2009, volume 5, pages 224–231.

I Hjort, N. L. (1990). Nonparametric Bayes estimators based on beta processes in models for life history data. Annals ofStatistics, 18(3):1259–1294.

123 / 127

Page 140: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

References IIII Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical

Association, 96(453):161–173.

I Johnson, M., Griffiths, T. L., and Goldwater, S. (2007). Adaptor grammars: A framework for specifying compositionalnonparametric Bayesian models. In Advances in Neural Information Processing Systems, volume 19.

I Jordan, M. I. (2010). Hierarchical models, nested models and completely random measures. In Frontiers of StatisticalDecision Making and Bayesian Analysis: In Honor of James O. Berger. New York: Springer.

I Knowles, D. and Ghahramani, Z. (2007). Infinite sparse factor analysis and infinite independent components analysis. InInternational Conference on Independent Component Analysis and Signal Separation, volume 7 of Lecture Notes inComputer Science. Springer.

I Liang, P., Petrov, S., Jordan, M. I., and Klein, D. (2007). The infinite PCFG using hierarchical Dirichlet processes. InProceedings of the Conference on Empirical Methods in Natural Language Processing.

I Lijoi, A. and Pruenster, I. (2010). Models beyond the Dirichlet process. In Hjort, N., Holmes, C., Müller, P., and Walker, S.,editors, Bayesian Nonparametrics. Cambridge University Press.

I MacKay, D. and Peto, L. (1994). A hierarchical Dirichlet language model. Natural Language Engineering.

I Meeds, E., Ghahramani, Z., Neal, R. M., and Roweis, S. T. (2007). Modeling dyadic data with binary latent factors. InAdvances in Neural Information Processing Systems, volume 19.

I Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods. Technical Report CRG-TR-93-1,Department of Computer Science, University of Toronto.

I Neal, R. M. (2003). Density modeling and clustering using Dirichlet diffusion trees. In Bayesian Statistics, volume 7, pages619–629.

124 / 127

Page 141: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

References IVI Perman, M., Pitman, J., and Yor, M. (1992). Size-biased sampling of Poisson point processes and excursions. Probability

Theory and Related Fields, 92(1):21–39.

I Pitman, J. and Yor, M. (1997). The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Annals ofProbability, 25:855–900.

I Rasmussen, C. E. (2000). The infinite Gaussian mixture model. In Advances in Neural Information Processing Systems,volume 12.

I Rasmussen, C. E. and Ghahramani, Z. (2001). Occam’s razor. In Advances in Neural Information Processing Systems,volume 13.

I Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press.

I Robert, C. P. and Casella, G. (2004). Monte Carlo statistical methods. Springer Verlag.

I Rodríguez, A., Dunson, D. B., and Gelfand, A. E. (2008). The nested Dirichlet process. Journal of the American StatisticalAssociation, 103(483):1131–1154.

I Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 4:639–650.

I Sudderth, E. and Jordan, M. I. (2009). Shared segmentation of natural scenes using dependent Pitman-Yor processes. InAdvances in Neural Information Processing Systems, volume 21.

I Sudderth, E., Torralba, A., Freeman, W., and Willsky, A. (2008). Describing visual scenes using transformed objects andparts. International Journal of Computer Vision, 77.

I Teh, Y. W. (2006a). A Bayesian interpretation of interpolated Kneser-Ney. Technical Report TRA2/06, School of Computing,National University of Singapore.

125 / 127

Page 142: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

References VI Teh, Y. W. (2006b). A hierarchical Bayesian language model based on Pitman-Yor processes. In Proceedings of the 21st

International Conference on Computational Linguistics and 44th Annual Meeting of the Association for ComputationalLinguistics, pages 985–992.

I Teh, Y. W., Daume III, H., and Roy, D. M. (2008). Bayesian agglomerative clustering with coalescents. In Advances in NeuralInformation Processing Systems, volume 20, pages 1473–1480.

I Teh, Y. W. and Görür, D. (2009). Indian buffet processes with power-law behavior. In Advances in Neural InformationProcessing Systems, volume 22, pages 1838–1846.

I Teh, Y. W., Görür, D., and Ghahramani, Z. (2007). Stick-breaking construction for the Indian buffet process. In Proceedings ofthe International Conference on Artificial Intelligence and Statistics, volume 11.

I Teh, Y. W. and Jordan, M. I. (2010). Hierarchical Bayesian nonparametric models with applications. In Hjort, N., Holmes, C.,Müller, P., and Walker, S., editors, Bayesian Nonparametrics. Cambridge University Press.

I Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the AmericanStatistical Association, 101(476):1566–1581.

I Thibaux, R. and Jordan, M. I. (2007). Hierarchical beta processes and the Indian buffet process. In Proceedings of theInternational Workshop on Artificial Intelligence and Statistics, volume 11, pages 564–571.

I Van Gael, J., Teh, Y. W., and Ghahramani, Z. (2009). The infinite factorial hidden Markov model. In Advances in NeuralInformation Processing Systems, volume 21, pages 1697–1704.

I Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundationsand Trends in Machine Learning, 1(1-2):1–305.

I Wood, F., Archambeau, C., Gasthaus, J., James, L. F., and Teh, Y. W. (2009). A stochastic memoizer for sequence data. InProceedings of the International Conference on Machine Learning, volume 26, pages 1129–1136.

126 / 127

Page 143: Bayesian Nonparametrics - University College Londonywteh/teaching/npbayes/kaist2010.pdfBayesian Nonparametrics Yee Whye Teh Gatsby Computational Neuroscience Unit University College

References VI

I Wood, F., Griffiths, T. L., and Ghahramani, Z. (2006). A non-parametric Bayesian method for inferring hidden causes. InProceedings of the Conference on Uncertainty in Artificial Intelligence, volume 22.

127 / 127