Language Modeling, Pitman-Yor, Beta Process & Stable Beta ...

Language Modeling, Pitman-Yor,Beta Process & Stable Beta Process

Lawrence Carin

Department of Electrical & Computer EngineeringDuke University

Durham, NC

Review of the following papers:

Y.W. Teh, “A Bayesian Interpretation of Interpolated Kneser-Ney”R. Thibaux and M.I. Jordan, “Hierarchical Beta Process and the IBP”

Y.W. Teh and D. Gorur, “Indian Buffet Processes with Power-Law Behavior”

Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process

Agenda

I Language modeling, n-grams, and Kneser-NeyI Dirichlet process, Pitman-Yor process, power-law behaviorI Pitman-Yor for hierarchical language modeling &

connection to Kneser-NeyI Other power-law phenomenon: Beta process, stable beta

process, and IBP with power-law behaviorI Future research


Word Histories and n-Grams

I Assume we have a sentence of t words:word1,word2, . . . ,wordt

I In language modeling the probability of these words istypically represented as

p(word1,word2, . . . ,wordt) =t∏

i=1

p(wordi |word1, . . . ,wordi−1)

I In n-gram models, we only retain up to n − 1 previouswords:

p(wordi |word1, . . . ,wordi−1) = p(wordi |wordi−n+1, . . . ,wordi−1)


Data and Modeling

I Assume a word vocabulary W , composed of V wordsI Each word w ∈ W , and n − 1 prior-word context is

represented u ∈ W n−1

I Based on the available corpus, let cuw represent thenumber of times word w followed context u

I The simplest model estimate for context-word probabilitiesis

PMLu (w) =

cuw

cu·

where cu· =∑

w ′ cuw ′

I This is expected to be a very poor estimate given realistic arealistic corpus size


Smoothing and Interpolation

I Introduce “Interpolated Kneser-Ney” or IKNI We “discount” all non-zero counts by d|u|, and define

tu· = #w ′|cuw ′ > 0, so tu· number of words that follow uin corpus

I Finally, define π(u) represent context u with the first (mostdistant) word removed

I The IKN estimate for the probability of w ∈ W givencontext u is

PIKNu (w) = [

cu· − tu·d|u|cu·

][max(0, cuw − d|u|)

cu· − tu·d|u|]+

d|u|tu·cu·

PIKNπ(u) (w)

I With probability 1− d|u|tu·/cu· use the discounted empiricalprobabilities, and with probability d|u|tu·/cu· use the IKNprobability from one-word reduction in context


Recursive Estimation

PIKNu (w) = [

cu· − tu·d|u|cu·

][max(0, cuw − d|u|)

cu· − tu·d|u|] +

d|u|tu·cu·

PIKNπ(u) (w)

I The expression PIKNπ(u) (w) is represented recursively in

terms of context π(π(u)), successively until here is nomore context

I The discount dm−1 is a function of the successive contextlength m


Tree Structure

I For visualization purposes, consider a vocabulary of sizeV = 4

I There are four words: w1, w2, w3, w4

I Consider n-gram with n = 3


Agenda

I Language modeling, n-grams, and Kneser-NeyI Dirichlet process, Pitman-Yor process, power-law

behaviorI Pitman-Yor for hierarchical language modeling &




Generalized Stick-Breaking Priors

I Consider the probability measure

P(·) =∞∑

k=1

pkδZk (·)

with Zk iid draws from a measure H, and

pk = Vk

k−1∏h=1

(1 − Vh) , Vi ∼ Beta(ai ,bi)

with ai ,bi > 0I May also have a finite sum with N terms, with VN = 1I The Pitman-Yor process is defined by two parameters

0 ≤ a < 1 and b > −a, with ak = 1 − a and bk = b + kaI The special case a = 0 and b = α > 0 corresponds to the

Dirichlet processI The draw of Y1, . . . ,Yn from PY(a,b,H) represented as

Yi |P ∼ P, i = 1, . . . ,n, P ∼ PY(a,b,H)Lawrence Carin Language Modeling, Pitman-Yor, Beta Process & Stable Beta Process

Exchangeability and Full Conditionals

I Assume we observe Y1, . . . ,Yi−1, where Y ∗1,i , . . . ,Y

∗mi ,i

represent the mi unique draws and n∗

j,i represents thenumber of times Y ∗

j,i is drawnI Pitman proved the following prediction rule

P(Yi ∈ · | Y1, . . . ,Yi−1) =

[i − 1 − miab + i − 1

]

mi∑j=1

n∗j,i − a

i − 1 − miaδY∗

j,i(·) +

b + ami

b + i − 1H(·)

I Given that we observed Y1, . . . ,Yi−1 with base measure H,this tells the probability of what we will see for Yi

I Note that when b = 0, this looks exactly like the IKNconstruction discussed above for language models

I The above conditional distribution is independent of theorder of the data, and is therefore exchangeable


Asymptotic Behavior of PY & Power Law

I In the language model, the δY∗j,i(·) correspond to words that have

been observed previously in the corpus, following a particularcontext

I Interested in the probability of occurrence of new-word “outliers”not seen in corpus, as outliers are important in language

I For language model we desire a longer “tail” on distribution ofnew words

I Note that this is also witnessed by decreasing-amplitude stickswith increasing k

P(·) =∞∑

k=1

pkδZk (·) , pk = Vk

k−1∏h=1

(1−Vh) , Vi ∼ Beta(1−a,b+ia)

I If T is the total number of draws from such processes (T wordoccurrences for a given context), then the number of uniquewords scales as O(bT a) for the PY, and O(α log T ) for DP (a = 0and b = α > 0)


Agenda





Hierarchical Bayesian Construction

I Consider contexts u corresponding to n-grams, yieldingthe hierarchical construction

G∅ ∼ PY(d0, α0,U)

G[x1] ∼ PY(d1, α1,G[])

···

G[xn−1...x1] ∼ PY(dn−1, αn−1,G[xn−2...x1])

wt |wt−n+1 · · ·wt−1 ∼ G[wn−1...w1]

I The draws from the PY measures are performed for allword combinations, e.g., G[xn−1...x1] for all W n−1

combinations of xn−1 . . . x1I Unknown parameters di , αii=0,n−1, implying 2n unknown

parameters (on which we place priors)


Learning

I We consider all mixture measures G∅,G[x1], . . .G[xn−1...x1]

as Chinese restaurants (considering all x ∈ W )I Assume we are given a corpus of data uiwii=1,N where

ui ∈ W n−1 is the i th example context, and wi is theassociated observed word

I We wish to infer the unknown parameters di , αii=0,n−1 aswell as well all N samples are “sitting” in the restaurants

I We find a posterior on these quantities, and then use theposterior to make an inference for wN+1 given a newuN+1 ∈ W n−1

I By construction the order of the training data isexchangeable


Sequence of Restaurant Customers - 1/2

I Consider first sample u1, where we representu1 = w (1)

n−1 . . .w(1)1

I Since this is the first “customer” in this restaurant, we mustdraw from the base measure G

[w (1)n−2...w (1)

1 ], which

corresponds to a separate restaurantI Since there are no customers in this restaurant, we must

then draw from the base G[w (1)

n−3...w (1)1 ]

I The first customer sequentially visits PY draws,corresponding to restaurants, until finally drawing from U

I Assume word drawn for u1 is represented w∗1 ∈ W

I There is now a “table” at each restaurant visited untilarriving at U, with corresponding “dish” w∗

1 , and onecustomer at each such table


Sequence of Restaurant Customers - 2/2

I Now consider “customer” K > 1, corresponding to contextuK = [w (K )

n−1 . . .w(K )1 ]

I There are K − 1 customers distributed among the“franchise” of restaurants, with less than or equal to K − 1tables at any restaurant

I The first restaurant visited by uK corresponds toG

[w (K )n−1...w (K )

1 ]which has m ≤ K − 1 tables, with cj customers

at table j

G[w (K )

n−1...w (K )1 ]

=m∑

j=1

cj − dα+ K − 1

δw∗m +

α+ mdα+ K − 1

G[w (K )

n−2...w (K )1 ]

I With probability cj−dα+K−1 will sit at table j and “eat” w∗

j , andwith probability α+md

α+K−1 will transfer to “restaurant”corresponding to G

[w (K )n−2...w (K )

1 ]


Inference

I Given training data ui ,wii=1,N and an n-gram model,there are 2n unknown PY parameters, plus the latentseating arrangements across all the restaurants

I The wii=1,N are here observed, and they appear in thelikelihood function (multinomial, with parameters defined bycorresponding table in restaurant franchise)

I May be implemented using a Gibbs sampler, with aChinese-restaurant formulation (done by Teh)

I May also wish to consider a stick-breaking construction


Agenda


connection to Kneser-NeyI Other power-law phenomenon: Beta process, stable

beta process, and IBP with power-law behaviorI Future research


Power-Law Formulations

I A draw from a Pitman-Yor model is a mixture of discreteatoms plus a base measure

I Dirichlet process is a special caseI The PY model is more general, in that the probability of

drawing from the base measure is higher with PY than withDP

I The PY model therefore has more mixtures (discreteatoms/tables), and with fewer data ‘sitting” around any onetable

I Provides a more-flexible model for some domains, ofsignificant utility for language modeling

I DP-based performance for language model significantlyworse than that of PY

I Other types of measures we may generalize to power-lawcharacteristics?


Levy Process and Measure

I Thibaux and Jordan represented a Beta Process (BP) interms of a Levy measure, and related it to the Indian BuffetProcess (IBP)

I A random measure B on space Ω is a Ley measure if themasses B(S1), . . . ,B(Sk ) are independent, for all disjointsubsets S1, . . . ,Sk of Ω

I Levy process is uniquely defined by its Levy measureI A beta process B ∼ BP(c,B0) is a positive Levy process

whose Levy measure depends on two parameters: c, apositive function over Ω, and B0, a fixed measure on Ω

I If B0 is continuous, the BP Levy measure is

ν(dω,dp) = c(ω)p−1(1 − p)c(ω)−1dpB0(dω)

on Ω× [0,1]


Beta and Bernoulli Processes

I To draw B ∼ BP(c,B0) draw a set of points (ωi ,pi) ∈ Ω× [0,1]from a Poisson process with base measure ν

B =∑

i

piδωi

I If X is drawn from a Bernoulli process with measure B,represented X ∼ BeP(B), then

X =∑

i

biδωi

where bi ∼ Bernoulli(pi)

I Assume n samples of infinite-dimensional binary vectors

Xi ∼ BeP(B) , i = 1, . . . ,nB ∼ BP(c,B0)

where c is a constantI The posterior of B|X1, . . . ,Xn is also a BP (conjugacy)


Indian Buffet Process

I The posterior of B|X1, . . . ,Xn is also a BP (conjugacy), with

B|X1, . . . ,Xn ∼ BP(c + n,c

c + nB0 +

1c + n

n∑i=1

Xi)

I We may integrate out the measure B to draw the Xidirectly, yielding an Indian Buffet process

I If X1, . . . ,Xn, and there have been mn,j times that “dish” ωjhas been used among the previous n “customers”

I For Xn+1 component j is equal to one with probability mn,jc+n ,

and Poi(cγ/(c + n)) with γ = B0(Ω)

I Customers try Poi(nγ) dishes


Motivation for Stable Beta Process

I The Pitman-Yor process generalizes DP, to produce moredistinct mixture components for a given size of data

I The beta process utilizes a certain set of dishes for a givennumber of samples

I Can we do for the BP what PY does for DP?I Can we construct a model that is more general in the

number of dishes created, with BP as a special limitingcase, like DP is a limiting case for PY?

I This will involve a “discount” on the empirical countsemployed in the Bernoulli distribution


Generalized Indian Buffet Process

I Consider parameters α > 0, σ ∈ [0,1] and c > −σI Generalized IBP:

1. Customer 1 tries Poisson(α) dishes, and draws dishes

2. Let mk define number of times dish k sampled for first ncustomers

3. Subsequently, customer n + 1 tries dish k withdiscounted probability mk−σ

n+c , and tries

Poisson(αΓ(1+c)Γ(n+c+σ)Γ(n+1+c)Γ(c+σ)) new dishes

I Total number of dishes: O(nσ), and the proportion ofdishes tried by m customers is O(m−1−σ)


Stable Beta Process

I Teh develops the theory in terms of a “completely randommeasure” (CRM)

I A CRM may be represented as a sum of three parts: (i) anon-random measure, (ii) an atomic measure with fixed atomsbut random masses, and (iii) an atomic measure with randomatoms and masses

I Without the non-random measure

µ =N∑

k=1

ukδφk +M∑

l=1

vlδψl

where uk ∼ Fk , and the random atoms vl , ψl are drawn from aPoisson process with Levy measure Λ

µ ∼ CRM(Λ, φk ,Fkk=1,NI For the stable Beta process (SBP) µ ∼ CRM(Λ0, ) with

Λ0(du × dθ) = αΓ(1 + c)

Γ(1 − c)Γ(c + σ)u−σ−1(1 − u)c+σ−1duH(dθ)


Agenda





Language Modeling, Pitman-Yor, Beta Process & Stable Beta ...

Documents

Transcript of Language Modeling, Pitman-Yor, Beta Process & Stable Beta ...