Bayesian Nonparametric Mixture, Admixture, and ... -...
Transcript of Bayesian Nonparametric Mixture, Admixture, and ... -...
![Page 1: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/1.jpg)
Yee Whye TehUniversity of Oxford
Nov 2015
Bayesian Nonparametric Mixture, Admixture,
and Language Models
![Page 2: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/2.jpg)
Overview
• Bayesian nonparametrics and random probability measures• Mixture models and clustering
• Hierarchies of Dirichlet processes• Modelling document collections with topic models• Modelling genetic admixtures in human populations
• Hierarchies of Pitman-Yor processes• Language modelling with high-order Markov models and power law statistics• Non-Markov language models with the sequence memoizer
![Page 3: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/3.jpg)
Bayesian Nonparametrics
• Data x1,...,xn assume iid from an underlying distribution μ:
• Inference on μ nonparametrically, within a Bayesian framework:
• “There are two desirable properties of a prior distribution for nonparametric problems:
• (I) The support of the prior distribution should be large—with respect to some suitable topology on the space of probability distributions on the sample space.
• (II) Posterior distributions given a sample of observations from the true probability distribution should be manageable analytically.”
• — Ferguson (1973)
µ ⇠ P
xi|µiid⇠ µ
[Hjort et al (eds) 2010]
![Page 4: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/4.jpg)
Dirichlet Process
• Random probability measure
• For each partition (A1,...Am),
• Cannot use Kolmogorov Consistency Theorem to construct the DP:• Space of probability measures not in the product σ-field on [0,1]B.• Use a countable generator F for B and view μ ∈ [0,1]F.
• Easier constructions:• Define an infinitely exchangeable sequence with directing random measure μ.• Define a gamma process and normalizing it.• Explicit construction using the stick-breaking process.
(µ(A1), . . . , µ(Am)) ⇠ Dir(↵H(A1), . . . ,↵H(Am))
µ ⇠ DP(↵, H)
[Ferguson 1973, Blackwell-McQueen 1973, Sethuraman 1994, Pitman 2006]
![Page 5: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/5.jpg)
Dirichlet Process
• Analytically tractable posterior distribution.
• Well-studied process:• ranked-ordered masses have Poisson-Dirichlet distribution.• Size-bias permuted masses have simple iid Beta structure.• Corresponding exchangeable random partition described by the Chinese
restaurant process.
• Large support over space of probability measures in weak topology.• Variety of convergence (and non-convergence) results.
• Draws from DP are discrete w.p. 1.
![Page 6: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/6.jpg)
Dirichlet Process Mixture Models
• Draws from DPs are discrete probability measures:
where wk, θk are random.• Typically use within a hierarchical model,
leading to nonparametric mixture models.• Discrete nature of μ induces repeated values among ϕ1:n.
• Induces a partition Π of [n] = {1,…,n}.• Leads to a clustering model with an unbounded/infinite number of clusters.
• Properties of model for cluster analysis depends on the properties of the induced random partition Π (a Chinese restaurant process (CRP)).
• Generalisations of DPs allow for more flexible prior specifications.
[Antoniak 1974, Lo 1984]
µ =1X
k=1
wk�✓k
�i|µiid⇠ µ xi|�i ⇠ F (�i)
![Page 7: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/7.jpg)
• Defines an exchangeable stochastic process over sequences
• The de Finetti measure [Kingman 1978] is the Dirichlet process,
Chinese Restaurant Processes
�1
�5
�2�3
�4
�6
�7 �8
�9✓1 ✓2 ✓3 ✓4
p(sit at table k) =ck
↵+PK
j=1 cjp(table serves dish y) =H(y)
p(sit at new table) =↵
↵+PK
j=1 cji sits at table j: �i =✓j
�1,�2, . . .
µ ⇠ DP(↵, H)
�i|µ ⇠ µ i = 1, 2, . . .
[Blackwell & McQueen 1973, Pitman 2006]
![Page 8: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/8.jpg)
Density Estimation and Clustering
Predictive Density Co−clustering
0
0.2
0.4
0.6
0.8
1
[Favaro & Teh 2013]
![Page 9: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/9.jpg)
Spike Sorting
![Page 10: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/10.jpg)
Spike Sorting1: size = 319
2: size = 62
3: size = 40
4: size = 270
5: size = 44
6: size = 306
7: size = 826
7a: size = 267
7b: size = 525
8: size = 123
[Favaro & Teh 2013]
![Page 11: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/11.jpg)
Families of Random Probability Measures
Pitman-Yor
Dirichlet NormalizedStable
NormalizedGeneralized
Gamma
NormalizedRandomMeasure
GibbsType
PoissonKingman
NormalizedInverse
Gaussian
f(�n) = Vn,K
KY
k=1
Wnk
T � �
⇥|T � CRM(⇤|⇥(�) = T )
µ = ⇥/T
� ⇠ CRM(⇥)
µ = �/�(�)
Gibbs-typeindex !<0
Mixtures ofFinite Dirichlets
Gibbs-typeindex !=0
Mixtures ofDirichlets
Gibbs-typeindex !>0!-stable
Poisson-Kingman
![Page 12: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/12.jpg)
Gibbs Type Partitions • An exchangeable random partition Π is of Gibbs type if
where π has K clusters with sizes n1,…,nK.• Exchangeability and Gibbs form implies that wlog:
where -∞ ≤ σ ≤ 1.• The number of clusters K grows with n, with asymptotic distribution
for some random variable Sσ, where f(n) = 1, log n, nσ for σ < 0, = 0, > 0.• Choice of Sσ and σ arbitrary and part of prior specification.
• σ < 0: Bayesian finite mixture model• σ = 0: DP mixture model with hyper prior on α• σ > 0: σ-stable Poisson-Kingman process mixture model
[Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015]
p(⇧n = ⇡n) = Vn,K
KY
k=1
Wnk
Wm = (1� �)(2� �) · · · (m� 1� �)
Kn
f(n)! S�
![Page 13: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/13.jpg)
Other Uses of Random Probability Measures
• Species sampling [Lijoi, Pruenster, Favaro, Mena]• Nonparametric regression [MacEachern, Dunson, Griffin etc]• Flexibly modelling heterogeneity in data
• More general random measures:• Survival analysis [Hjort 1990]• Feature models [Griffiths & Ghahramani 2011, Broderick et al 2012]
• Building more complex models via different motifs:• hierarchical Bayes• measure-valued stochastic processes• spatial and temporal processes• relational models
[Hjort et al (eds) 2010]
![Page 14: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/14.jpg)
Overview
• Bayesian nonparametrics and random probability measures• Mixture models and clustering
• Hierarchies of Dirichlet processes• Modelling document collections with topic models• Modelling genetic admixtures in human populations
• Hierarchies of Pitman-Yor processes• Language modelling with high-order Markov models and power law statistics• Non-Markov language models with the sequence memoizer
![Page 15: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/15.jpg)
Hierarchical Bayesian Models
• Hierarchical modelling an important overarching theme in modern statistics [Gelman et al, 1995, James & Stein 1961].
• In machine learning, have been used for multitask learning, transfer learning, learning-to-learn and domain adaptation.
µ0
µ1 µ2 µ3
x1i x2i x3i
![Page 16: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/16.jpg)
Clustering of Related Groups of Data
• Multiple groups of data.• Wish to cluster each group, using DP mixture models.• Clusters are shared across multiple groups.
![Page 17: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/17.jpg)
Clustering of Related Groups of Data
• Multiple groups of data.• Wish to cluster each group, using DP mixture models.• Clusters are shared across multiple groups.
![Page 18: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/18.jpg)
• Model each document as a bag of words coming from an underlying set of topics [Hofmann 2001, Blei et al 2003].
CARSON, Calif., April 3 - Nissan Motor Corp said it is raising the suggested retail price for its cars
and trucks sold in the United States by 1.9 pct, or an average
212 dollars per vehicle, effective April 6....
Document Topic Modeling
Auto industryMarket economy
Plain old EnglishUS geography
DETROIT, April 3 - Sales of U.S.-built new cars surged during the
last 10 days of March to the second highest levels of 1987.
Sales of imports, meanwhile, fell for the first time in years,
succumbing to price hikes by foreign carmakers.....
• Summarize documents.• Document/query comparisons.
• Topics are shared across documents.• Don’t know #topics beforehand.
![Page 19: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/19.jpg)
Multi-Population Genetics
European Asian African
• Individuals can be clustered into a number of genotypes, with each population having a different proportion of genotypes [Xing et al 2006].
• Sharing genotypes among individuals in a population, and across different populations.
• Indeterminate number of genotypes.
![Page 20: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/20.jpg)
Genetic Admixtures
![Page 21: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/21.jpg)
�1i
G1�
H
x1i
cluster
datum
�2i
G2
x2i
• Introduce dependencies between groups by making parameters random?
• If H is smooth, then clusters will not be shared between groups.
• But if the base distribution were discrete….
Dirichlet Process Mixture for Grouped Data?
atoms do not match up
G2G1
![Page 22: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/22.jpg)
• Making base distribution discrete forces groups to share clusters.
• Hierarchical Dirichlet process:
• Extension to deeper hierarchies is straightforward.
�1i
G1�
G0
x1i
cluster
datum
�2i
G2
x2i
H
�
Hierarchical Dirichlet Process Mixture Models
[Teh et al 2006]
G0 ⇠ DP(�, H)
G1|G0 ⇠ DP(↵, G0)
G2|G0 ⇠ DP(↵, G0)
![Page 23: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/23.jpg)
Hierarchical Dirichlet Process Mixture Models
G0
G1 G2�1i
G1�
G0
x1i
cluster
datum
�2i
G2
x2i
H
�
![Page 24: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/24.jpg)
Document Topic Modeling
• Comparison of HDP and latent Dirichlet allocation (LDA).• LDA is a parametric model, for which model selection is needed.• HDP bypasses this step in the analysis.
![Page 25: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/25.jpg)
Shared Topics
Cognitive Science Neuroscience Algorithms & Architecture Signal Processing
task representation
pattern processing
trained representations
three process
unit patterns
examples concept similarity Bayesian
hypotheses generalization
numbers positive classes
hypothesis
cells cell
activity response neuron visual
patterns pattern single
fig
visual cells
cortical orientation receptive contrast spatial cortex
stimulus tuning
algorithms test
approach methods
based point
problems form large paper
distance tangent image images
transformation transformations
pattern vectors
convolution simard
visual images video
language image pixel
acoustic delta
lowpass flow
signals separation
signal sources source matrix blind
mixing gradient
eq
• Used a 3-level HDP to model shared topics in a collection of machine learning conference papers.
• Shown are the two largest topics shared between Visual Sciences section and four other sections.
• Topics are summarized by the 10 most frequent words in it.
![Page 26: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/26.jpg)
[de Iorio et al 2015]
Genetic Admixtures
G0 ⇠ DP(�, H)
Gi|G0 ⇠ DP(↵, G0)
si,l+1 ⇠ Bernoulli(e�rdl)
zi,l+1|si,l+1, zil ⇠(�zil if si,l+1 = 1,
Gi if si,l+1 = 0.
xil|zil = ✓k ⇠ Discrete(✓kl)
![Page 27: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/27.jpg)
Overview
• Bayesian nonparametrics and random probability measures• Mixture models and clustering
• Hierarchies of Dirichlet processes• Modelling document collections with topic models• Modelling genetic admixtures in human populations
• Hierarchies of Pitman-Yor processes• Language modelling with high-order Markov models and power law statistics• Non-Markov language models with the sequence memoizer
![Page 28: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/28.jpg)
Sequence Models for Language and Text
• Probabilistic models for sequences of words and characters, e.g.
• Uses:• Natural language processing: speech recognition, OCR, machine
translation.• Compression.• Cognitive models of language acquisition.• Sequence data arises in many other domains.
south, parks, road
s, o, u, t, h, _, p, a, r, k, s, _, r, o, a, d
![Page 29: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/29.jpg)
Markov Models for Language and Text
• Probabilistic models for sequences of words and characters.
• Usually makes a Markov assumption:
• Order of Markov model typically ranges from ~3 to > 10.
P(south parks road) = P(south)*
P(parks | south)* P(road | south parks)
P(south parks road) ~ P(south)*
P(parks | south)* P(road | parks)
Andrey Markov
George E. P. Box
![Page 30: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/30.jpg)
• Consider a high order Markov models:
• Large vocabulary size means naïvely estimating parameters of this model from data counts is problematic for N>2.
• Naïve regularization fail as well: most parameters have no associated data.• Smoothing.• Hierarchical Bayesian models.
High Dimensional Estimation
PML(wordi|wordi�N+1 . . .wordi�1) =C(wordi�N+1 . . .wordi)
C(wordi�N+1 . . .wordi�1)
P (sentence) =�
i
P (wordi|wordi�N+1 . . .wordi�1)
![Page 31: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/31.jpg)
Smoothing in Language Models
• Smoothing is a way of dealing with data sparsity by combining large and small models together.
• Combines expressive power of large models with better estimation of small models (cf bias-variance trade-off and hierarchical modelling).
P smooth(wordi|wordi�1i�N+1) =
N�
n=1
�(n)Qn(wordi|wordi�1i�n+1)
P smooth
(road|south parks)
= �(3)Q3
(road|south parks) +
�(2)Q2
(road|parks) +�(1)Q
1
(road|�)
![Page 32: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/32.jpg)
Smoothing in Language Models
[Chen and Goodman 1998]
![Page 33: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/33.jpg)
• Context of conditional probabilities naturally organized using a tree.
• Smoothing makes conditional probabilities of neighbouring contexts more similar.
• Later words in context more important in predicting next word.
�
Context Tree
along south parks
south parks
parks
to parks university parks
at south parks
P smooth
(road|south parks)
=�(3)Q3
(road|south parks)+
�(2)Q2
(road|parks)+�(1)Q
1
(road|;)
![Page 34: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/34.jpg)
• Parametrize the conditional probabilities of Markov model:
• Gu is a probability vector associated with context u.
• Obvious choice: hierarchical Dirichlet distributions.G�
Hierarchical Bayesian Models on Context Tree
P (wordi = w|wordi�1i�N+1 = u) = Gu(w)
Gu = [Gu(w)]w�vocabulary
Gparks
Gsouth parks
Gto parks
Guniversity parks
Galong south parks
Gat south parks
[MacKay and Peto 1994]
![Page 35: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/35.jpg)
Hierarchical Dirichlet Language Models
• What is ? [MacKay and Peto 1994] proposed using the standard Dirichlet distribution over probability vectors.
• We will use Pitman-Yor processes instead [Pitman and Yor 1997], [Ishwaran and James 2001].
P (Gu|Gpa(u))
T N-1 IKN MKN HDLM
2� 106 2 148.8 144.1 191.24� 106 2 137.1 132.7 172.76� 106 2 130.6 126.7 162.38� 106 2 125.9 122.3 154.7
10� 106 2 122.0 118.6 148.712� 106 2 119.0 115.8 144.014� 106 2 116.7 113.6 140.514� 106 1 169.9 169.2 180.614� 106 3 106.1 102.4 136.6
![Page 36: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/36.jpg)
• Easiest to understand them using Chinese restaurant processes.
• Defines an exchangeable stochastic process over sequences
• The de Finetti measure [Kingman 1978] is the Pitman-Yor process,
• [Pitman & Yor 1997]
x1, x2, . . .
Exchangeable Random Partition
y1
x1 x2x3x4
x5
x6
x7x8
x9y2 y3 y4
G � PY(�, d, H)xi � G i = 1, 2, . . .
p(sit at table k) =ck � d
� +PK
j=1 cjp(table serves dish y) =H(y)
p(sit at new table) =� + dK
� +PK
j=1 cji sits at table c: xi =yc
![Page 37: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/37.jpg)
Power Law Properties of Pitman-Yor Processes
• Chinese restaurant process:
• Pitman-Yor processes produce distributions over words given by a power-law distribution with index 1 + d.• Customers = word instances, tables = dictionary look-up;• Small number of common word types;• Large number of rare word types.
• This is more suitable for languages than Dirichlet distributions.
• [Goldwater, Griffiths and Johnson 2005] investigated the Pitman-Yor process from this perspective.
p(sit at table k) ⇥ ck � d
p(sit at new table) ⇥ � + dK
![Page 38: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/38.jpg)
Pitman-Yor Processes
10
0
10
1
10
2
10
3
10
4
10
5
10
6
Word
frequency
10
010
110
210
310
410
510
6
Rank (according to frequency)
Pitman-Yor
English text
Dirichlet
![Page 39: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/39.jpg)
Power Law Properties of Pitman-Yor Processes
0 2000 4000 6000 8000 100000
50
100
150
200
250
customer
table
!=1, d=.5
0 2000 4000 6000 8000 100000
50
100
150
200
customer
table
!=30, d=0
![Page 40: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/40.jpg)
Power Law Properties of Pitman-Yor Processes
![Page 41: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/41.jpg)
Hierarchical Pitman-Yor Language Models
• Parametrize the conditional probabilities of Markov model:
• Gu is a probability vector associated with context u.
• Place Pitman-Yor process prior on each Gu.
P (wordi = w|wordi�1i�N+1 = u) = Gu(w)
Gu = [Gu(w)]w�vocabulary
G�
Gparks
Gsouth parks
Gto parks
Guniversity parks
Galong south parks
Gat south parks
![Page 42: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/42.jpg)
Hierarchical Pitman-Yor Language Models
• Significantly improved on the hierarchical Dirichlet language model.• Results better Kneser-Ney smoothing, state-of-the-art language models.
• Similarity of perplexities not a surprise---Kneser-Ney can be derived as a particular approximate inference method.
T N-1 IKN MKN HDLM HPYLM
2� 106 2 148.8 144.1 191.2 144.34� 106 2 137.1 132.7 172.7 132.76� 106 2 130.6 126.7 162.3 126.48� 106 2 125.9 122.3 154.7 121.9
10� 106 2 122.0 118.6 148.7 118.212� 106 2 119.0 115.8 144.0 115.414� 106 2 116.7 113.6 140.5 113.214� 106 1 169.9 169.2 180.6 169.314� 106 3 106.1 102.4 136.6 101.9
[Teh 2006]
![Page 43: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/43.jpg)
Markov Models for Language and Text
• Usually makes a Markov assumption to simplify model:
• Language models: usually Markov models of order 2-4 (3-5-grams).• How do we determine the order of our Markov models?• Is the Markov assumption a reasonable assumption?
• Be nonparametric about Markov order...
P(south parks road) ~ P(south)*
P(parks | south)*P(road | south parks)
![Page 44: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/44.jpg)
Non-Markov Models for Language and Text
• Model the conditional probabilities of each possible word occurring after each possible context (of unbounded length).
• Use hierarchical Pitman-Yor process prior to share information across all contexts.
• Hierarchy is infinitely deep.
• Sequence memoizer.
....
....
....
....
G�
Gparks
Gsouth parks
Gto parks
Guniversity parks
Galong south parks
Gat south parks
Gmeet at south parks
[Wood et al 2009]
![Page 45: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/45.jpg)
• The sequence memoizer model is very large (actually, infinite).
• Given a training sequence (e.g.: o,a,c,a,c), most of the model can be ignored (integrated out), leaving a finite number of nodes in context tree.
• But there are still O(T2) number of nodes in the context tree...
Model Size: Infinite -> O(T2)
G[oacac]
G[acac]
G[cac]
G[ac]
G[c]
G[ ]
G[a] G[o]
G[ca]
G[aca]
G[oaca]
G[oa]
G[oac]
c
a
a
c
c
c
c
a
a
o
o
o
oa
o
a
o
H
![Page 46: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/46.jpg)
Model Size: Infinite -> O(T2) -> 2T
• Idea: integrate out non-branching, non-leaf nodes of the context tree.
• Resulting tree is related to a suffix tree data structure, and has at most 2T nodes.
• There are linear time construction algorithms [Ukkonen 1995].
oac
ac
oac
G[oacac]
G[ac]
G[ ]
G[a] G[o]
G[oaca]
G[oa]
G[oac]
c
c
a
a
o
o
oa
o
H
![Page 47: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/47.jpg)
Closure under Marginalization
• In marginalizing out non-branching interior nodes, need to ensure that resulting conditional distributions are still tractable.
• E.g.: If each conditional is Dirichlet, resulting conditional is not of known analytic form.
G[a]
G[ca]
G[aca]
PY(�2, d2, G[a])
PY(�3, d3, G[ca])
G[a]
G[aca]
?
![Page 48: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/48.jpg)
Closure under Marginalization
G[a]
G[ca]
G[aca]
PY(�2, d2, G[a])
G[a]
G[aca]
PY(�2d3, d3, G[ca])
PY(�2d3, d2d3, G[a])
• In marginalizing out non-branching interior nodes, need to ensure that resulting conditional distributions are still tractable.
• Hierarchical construction is equivalent to coagulation, so the marginal process is Pitman-Yor distributed as well.
![Page 49: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/49.jpg)
Comparison to Finite Order HPYLM
![Page 50: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/50.jpg)
Compression Results
Calgary corpus SM inference: particle filter PPM: Prediction by Partial Matching CTW: Context Tree Weigting Online inference, entropic coding.
Model Average bits/byte
gzip 2.61
bzip2 2.11
CTW 1.99
PPM 1.93
Sequence Memoizer 1.89
![Page 51: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/51.jpg)
Summary
• Random probability measures are building blocks of many Bayesian nonparametric models.
• Motivated by problems in text and language processing, we discussed methods of constructing hierarchies of random measures.
• We used Pitman-Yor processes to capture the power law behaviour of language data.
• We used the equivalence between hierarchies and coagulations, and a duality between fragmentations and coagulations, to construct an efficient non-Markov language model.
![Page 52: Bayesian Nonparametric Mixture, Admixture, and ... - WUstatmath.wu.ac.at/.../Yee_Whye_Teh_WU_2015.pdf · [Gnedin & Pitman 2006, De Blasi et al 2015, Lomeli et al 2015] p(⇧ ...](https://reader035.fdocuments.net/reader035/viewer/2022081612/5f19d9919aaddb0aa60da5bd/html5/thumbnails/52.jpg)
Thank You!Acknowledgements:
Matt Beal, Dave Blei, Mike Jordan,Frank Wood, Jan Gasthaus,
Cedric Archambeau, Lancelot James.
Lee Kuan Yew FoundationGatsby Charitable FoundationEuropean Research Council
EPSRC