Language Modeling, Pitman-Yor, Beta Process & Stable Beta ...
Transcript of Language Modeling, Pitman-Yor, Beta Process & Stable Beta ...
Language Modeling, Pitman-Yor,Beta Process & Stable Beta Process
Lawrence Carin
Department of Electrical & Computer EngineeringDuke University
Durham, NC
Review of the following papers:
Y.W. Teh, “A Bayesian Interpretation of Interpolated Kneser-Ney”R. Thibaux and M.I. Jordan, “Hierarchical Beta Process and the IBP”
Y.W. Teh and D. Gorur, “Indian Buffet Processes with Power-Law Behavior”
Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process
Agenda
I Language modeling, n-grams, and Kneser-NeyI Dirichlet process, Pitman-Yor process, power-law behaviorI Pitman-Yor for hierarchical language modeling &
connection to Kneser-NeyI Other power-law phenomenon: Beta process, stable beta
process, and IBP with power-law behaviorI Future research
Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process
Word Histories and n-Grams
I Assume we have a sentence of t words:word1,word2, . . . ,wordt
I In language modeling the probability of these words istypically represented as
p(word1,word2, . . . ,wordt) =t∏
i=1
p(wordi |word1, . . . ,wordi−1)
I In n-gram models, we only retain up to n − 1 previouswords:
p(wordi |word1, . . . ,wordi−1) = p(wordi |wordi−n+1, . . . ,wordi−1)
Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process
Data and Modeling
I Assume a word vocabulary W , composed of V wordsI Each word w ∈ W , and n − 1 prior-word context is
represented u ∈ W n−1
I Based on the available corpus, let cuw represent thenumber of times word w followed context u
I The simplest model estimate for context-word probabilitiesis
PMLu (w) =
cuw
cu·
where cu· =∑
w ′ cuw ′
I This is expected to be a very poor estimate given realistic arealistic corpus size
Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process
Smoothing and Interpolation
I Introduce “Interpolated Kneser-Ney” or IKNI We “discount” all non-zero counts by d|u|, and define
tu· = #w ′|cuw ′ > 0, so tu· number of words that follow uin corpus
I Finally, define π(u) represent context u with the first (mostdistant) word removed
I The IKN estimate for the probability of w ∈ W givencontext u is
PIKNu (w) = [
cu· − tu·d|u|cu·
][max(0, cuw − d|u|)
cu· − tu·d|u|]+
d|u|tu·cu·
PIKNπ(u) (w)
I With probability 1− d|u|tu·/cu· use the discounted empiricalprobabilities, and with probability d|u|tu·/cu· use the IKNprobability from one-word reduction in context
Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process
Recursive Estimation
PIKNu (w) = [
cu· − tu·d|u|cu·
][max(0, cuw − d|u|)
cu· − tu·d|u|] +
d|u|tu·cu·
PIKNπ(u) (w)
I The expression PIKNπ(u) (w) is represented recursively in
terms of context π(π(u)), successively until here is nomore context
I The discount dm−1 is a function of the successive contextlength m
Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process
Tree Structure
I For visualization purposes, consider a vocabulary of sizeV = 4
I There are four words: w1, w2, w3, w4
I Consider n-gram with n = 3
Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process
Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process
Agenda
I Language modeling, n-grams, and Kneser-NeyI Dirichlet process, Pitman-Yor process, power-law
behaviorI Pitman-Yor for hierarchical language modeling &
connection to Kneser-NeyI Other power-law phenomenon: Beta process, stable beta
process, and IBP with power-law behaviorI Future research
Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process
Generalized Stick-Breaking Priors
I Consider the probability measure
P(·) =∞∑
k=1
pkδZk (·)
with Zk iid draws from a measure H, and
pk = Vk
k−1∏h=1
(1 − Vh) , Vi ∼ Beta(ai ,bi)
with ai ,bi > 0I May also have a finite sum with N terms, with VN = 1I The Pitman-Yor process is defined by two parameters
0 ≤ a < 1 and b > −a, with ak = 1 − a and bk = b + kaI The special case a = 0 and b = α > 0 corresponds to the
Dirichlet processI The draw of Y1, . . . ,Yn from PY(a,b,H) represented as
Yi |P ∼ P, i = 1, . . . ,n, P ∼ PY(a,b,H)Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process
Exchangeability and Full Conditionals
I Assume we observe Y1, . . . ,Yi−1, where Y ∗1,i , . . . ,Y
∗mi ,i
represent the mi unique draws and n∗
j,i represents thenumber of times Y ∗
j,i is drawnI Pitman proved the following prediction rule
P(Yi ∈ · | Y1, . . . ,Yi−1) =
[i − 1 − miab + i − 1
]
mi∑j=1
n∗j,i − a
i − 1 − miaδY∗
j,i(·) +
b + ami
b + i − 1H(·)
I Given that we observed Y1, . . . ,Yi−1 with base measure H,this tells the probability of what we will see for Yi
I Note that when b = 0, this looks exactly like the IKNconstruction discussed above for language models
I The above conditional distribution is independent of theorder of the data, and is therefore exchangeable
Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process
Asymptotic Behavior of PY & Power Law
I In the language model, the δY∗j,i(·) correspond to words that have
been observed previously in the corpus, following a particularcontext
I Interested in the probability of occurrence of new-word “outliers”not seen in corpus, as outliers are important in language
I For language model we desire a longer “tail” on distribution ofnew words
I Note that this is also witnessed by decreasing-amplitude stickswith increasing k
P(·) =∞∑
k=1
pkδZk (·) , pk = Vk
k−1∏h=1
(1−Vh) , Vi ∼ Beta(1−a,b+ia)
I If T is the total number of draws from such processes (T wordoccurrences for a given context), then the number of uniquewords scales as O(bT a) for the PY, and O(α log T ) for DP (a = 0and b = α > 0)
Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process
Agenda
I Language modeling, n-grams, and Kneser-NeyI Dirichlet process, Pitman-Yor process, power-law behaviorI Pitman-Yor for hierarchical language modeling &
connection to Kneser-NeyI Other power-law phenomenon: Beta process, stable beta
process, and IBP with power-law behaviorI Future research
Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process
Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process
Hierarchical Bayesian Construction
I Consider contexts u corresponding to n-grams, yieldingthe hierarchical construction
G∅ ∼ PY(d0, α0,U)
G[x1] ∼ PY(d1, α1,G[])
···
G[xn−1...x1] ∼ PY(dn−1, αn−1,G[xn−2...x1])
wt |wt−n+1 · · ·wt−1 ∼ G[wn−1...w1]
I The draws from the PY measures are performed for allword combinations, e.g., G[xn−1...x1] for all W n−1
combinations of xn−1 . . . x1I Unknown parameters di , αii=0,n−1, implying 2n unknown
parameters (on which we place priors)
Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process
Learning
I We consider all mixture measures G∅,G[x1], . . .G[xn−1...x1]
as Chinese restaurants (considering all x ∈ W )I Assume we are given a corpus of data uiwii=1,N where
ui ∈ W n−1 is the i th example context, and wi is theassociated observed word
I We wish to infer the unknown parameters di , αii=0,n−1 aswell as well all N samples are “sitting” in the restaurants
I We find a posterior on these quantities, and then use theposterior to make an inference for wN+1 given a newuN+1 ∈ W n−1
I By construction the order of the training data isexchangeable
Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process
Sequence of Restaurant Customers - 1/2
I Consider first sample u1, where we representu1 = w (1)
n−1 . . .w(1)1
I Since this is the first “customer” in this restaurant, we mustdraw from the base measure G
[w (1)n−2...w (1)
1 ], which
corresponds to a separate restaurantI Since there are no customers in this restaurant, we must
then draw from the base G[w (1)
n−3...w (1)1 ]
I The first customer sequentially visits PY draws,corresponding to restaurants, until finally drawing from U
I Assume word drawn for u1 is represented w∗1 ∈ W
I There is now a “table” at each restaurant visited untilarriving at U, with corresponding “dish” w∗
1 , and onecustomer at each such table
Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process
Sequence of Restaurant Customers - 2/2
I Now consider “customer” K > 1, corresponding to contextuK = [w (K )
n−1 . . .w(K )1 ]
I There are K − 1 customers distributed among the“franchise” of restaurants, with less than or equal to K − 1tables at any restaurant
I The first restaurant visited by uK corresponds toG
[w (K )n−1...w (K )
1 ]which has m ≤ K − 1 tables, with cj customers
at table j
G[w (K )
n−1...w (K )1 ]
=m∑
j=1
cj − dα+ K − 1
δw∗m +
α+ mdα+ K − 1
G[w (K )
n−2...w (K )1 ]
I With probability cj−dα+K−1 will sit at table j and “eat” w∗
j , andwith probability α+md
α+K−1 will transfer to “restaurant”corresponding to G
[w (K )n−2...w (K )
1 ]
Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process
Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process
Inference
I Given training data ui ,wii=1,N and an n-gram model,there are 2n unknown PY parameters, plus the latentseating arrangements across all the restaurants
I The wii=1,N are here observed, and they appear in thelikelihood function (multinomial, with parameters defined bycorresponding table in restaurant franchise)
I May be implemented using a Gibbs sampler, with aChinese-restaurant formulation (done by Teh)
I May also wish to consider a stick-breaking construction
Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process
Agenda
I Language modeling, n-grams, and Kneser-NeyI Dirichlet process, Pitman-Yor process, power-law behaviorI Pitman-Yor for hierarchical language modeling &
connection to Kneser-NeyI Other power-law phenomenon: Beta process, stable
beta process, and IBP with power-law behaviorI Future research
Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process
Power-Law Formulations
I A draw from a Pitman-Yor model is a mixture of discreteatoms plus a base measure
I Dirichlet process is a special caseI The PY model is more general, in that the probability of
drawing from the base measure is higher with PY than withDP
I The PY model therefore has more mixtures (discreteatoms/tables), and with fewer data ‘sitting” around any onetable
I Provides a more-flexible model for some domains, ofsignificant utility for language modeling
I DP-based performance for language model significantlyworse than that of PY
I Other types of measures we may generalize to power-lawcharacteristics?
Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process
Levy Process and Measure
I Thibaux and Jordan represented a Beta Process (BP) interms of a Levy measure, and related it to the Indian BuffetProcess (IBP)
I A random measure B on space Ω is a Ley measure if themasses B(S1), . . . ,B(Sk ) are independent, for all disjointsubsets S1, . . . ,Sk of Ω
I Levy process is uniquely defined by its Levy measureI A beta process B ∼ BP(c,B0) is a positive Levy process
whose Levy measure depends on two parameters: c, apositive function over Ω, and B0, a fixed measure on Ω
I If B0 is continuous, the BP Levy measure is
ν(dω,dp) = c(ω)p−1(1 − p)c(ω)−1dpB0(dω)
on Ω× [0,1]
Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process
Beta and Bernoulli Processes
I To draw B ∼ BP(c,B0) draw a set of points (ωi ,pi) ∈ Ω× [0,1]from a Poisson process with base measure ν
B =∑
i
piδωi
I If X is drawn from a Bernoulli process with measure B,represented X ∼ BeP(B), then
X =∑
i
biδωi
where bi ∼ Bernoulli(pi)
I Assume n samples of infinite-dimensional binary vectors
Xi ∼ BeP(B) , i = 1, . . . ,nB ∼ BP(c,B0)
where c is a constantI The posterior of B|X1, . . . ,Xn is also a BP (conjugacy)
Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process
Indian Buffet Process
I The posterior of B|X1, . . . ,Xn is also a BP (conjugacy), with
B|X1, . . . ,Xn ∼ BP(c + n,c
c + nB0 +
1c + n
n∑i=1
Xi)
I We may integrate out the measure B to draw the Xidirectly, yielding an Indian Buffet process
I If X1, . . . ,Xn, and there have been mn,j times that “dish” ωjhas been used among the previous n “customers”
I For Xn+1 component j is equal to one with probability mn,jc+n ,
and Poi(cγ/(c + n)) with γ = B0(Ω)
I Customers try Poi(nγ) dishes
Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process
Motivation for Stable Beta Process
I The Pitman-Yor process generalizes DP, to produce moredistinct mixture components for a given size of data
I The beta process utilizes a certain set of dishes for a givennumber of samples
I Can we do for the BP what PY does for DP?I Can we construct a model that is more general in the
number of dishes created, with BP as a special limitingcase, like DP is a limiting case for PY?
I This will involve a “discount” on the empirical countsemployed in the Bernoulli distribution
Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process
Generalized Indian Buffet Process
I Consider parameters α > 0, σ ∈ [0,1] and c > −σI Generalized IBP:
1. Customer 1 tries Poisson(α) dishes, and draws dishes
2. Let mk define number of times dish k sampled for first ncustomers
3. Subsequently, customer n + 1 tries dish k withdiscounted probability mk−σ
n+c , and tries
Poisson(αΓ(1+c)Γ(n+c+σ)Γ(n+1+c)Γ(c+σ)) new dishes
I Total number of dishes: O(nσ), and the proportion ofdishes tried by m customers is O(m−1−σ)
Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process
Stable Beta Process
I Teh develops the theory in terms of a “completely randommeasure” (CRM)
I A CRM may be represented as a sum of three parts: (i) anon-random measure, (ii) an atomic measure with fixed atomsbut random masses, and (iii) an atomic measure with randomatoms and masses
I Without the non-random measure
µ =N∑
k=1
ukδφk +M∑
l=1
vlδψl
where uk ∼ Fk , and the random atoms vl , ψl are drawn from aPoisson process with Levy measure Λ
µ ∼ CRM(Λ, φk ,Fkk=1,NI For the stable Beta process (SBP) µ ∼ CRM(Λ0, ) with
Λ0(du × dθ) = αΓ(1 + c)
Γ(1 − c)Γ(c + σ)u−σ−1(1 − u)c+σ−1duH(dθ)
Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process
Agenda
I Language modeling, n-grams, and Kneser-NeyI Dirichlet process, Pitman-Yor process, power-law behaviorI Pitman-Yor for hierarchical language modeling &
connection to Kneser-NeyI Other power-law phenomenon: Beta process, stable beta
process, and IBP with power-law behaviorI Future research
Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process