A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in...

91
A survey of the Dirichlet process and its role in nonparametrics Jayaram Sethuraman Department of Statistics Florida State University Tallahassee, FL 32306 [email protected] May 9, 2013

Transcript of A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in...

Page 1: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

A survey of the Dirichlet process and its role innonparametrics

Jayaram SethuramanDepartment of StatisticsFlorida State UniversityTallahassee, FL 32306

[email protected]

May 9, 2013

Page 2: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Summary

• Elementary frequentist and Bayesian methods

• Bayesian nonparametrics

• Nonparametric priors in a natural way

• Nonparametric priors and exchangeable random variables;Polya urn sequences

• Nonparametric priors through other approaches

• Sethuraman construction of Dirichlet priors

• Some properties of Dirichlet priors

• Bayes hierarchical models

Page 3: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Frequentist methods

Data X1, . . . ,Xn are i.i.d. Fθ where {Fθ : θ ∈ Θ} is a family ofdistributions.

For instance, data can be X1, . . . ,Xn be i.i.d. N(θ, σ2), withθ ∈ Θ = (−∞,∞) and known σ2.

Then the frequentist estimate of θ is Xn = (1/n)∑n

1 Xi .

Page 4: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Frequentist methods

Data X1, . . . ,Xn are i.i.d. Fθ where {Fθ : θ ∈ Θ} is a family ofdistributions.

For instance, data can be X1, . . . ,Xn be i.i.d. N(θ, σ2), withθ ∈ Θ = (−∞,∞) and known σ2.

Then the frequentist estimate of θ is Xn = (1/n)∑n

1 Xi .

Page 5: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Frequentist methods

Data X1, . . . ,Xn are i.i.d. Fθ where {Fθ : θ ∈ Θ} is a family ofdistributions.

For instance, data can be X1, . . . ,Xn be i.i.d. N(θ, σ2), withθ ∈ Θ = (−∞,∞) and known σ2.

Then the frequentist estimate of θ is Xn = (1/n)∑n

1 Xi .

Page 6: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Bayesian methods

However, a Bayesian would consider θ to be random (by seekingexperts’ opinion, etc.) and would put a distribution on Θ,

say aN(µ, τ2) distribution and call it the prior distribution.

Then the posterior distribution of θ, the conditional distribution ofθ given the data X1, . . . ,Xn is

N(nXnσ2 + µ

τ2

nσ2 + 1

τ2

,1

nσ2 + 1

τ2

).

The Bayesian of θ is the expectation of θ under its posteriordistribution and is

nXnσ2 + µ

τ2

nσ2 + 1

τ2

.

Page 7: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Bayesian methods

However, a Bayesian would consider θ to be random (by seekingexperts’ opinion, etc.) and would put a distribution on Θ, say aN(µ, τ2) distribution and call it the prior distribution.

Then the posterior distribution of θ, the conditional distribution ofθ given the data X1, . . . ,Xn is

N(nXnσ2 + µ

τ2

nσ2 + 1

τ2

,1

nσ2 + 1

τ2

).

The Bayesian of θ is the expectation of θ under its posteriordistribution and is

nXnσ2 + µ

τ2

nσ2 + 1

τ2

.

Page 8: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Bayesian methods

However, a Bayesian would consider θ to be random (by seekingexperts’ opinion, etc.) and would put a distribution on Θ, say aN(µ, τ2) distribution and call it the prior distribution.

Then the posterior distribution of θ, the conditional distribution ofθ given the data X1, . . . ,Xn is

N(nXnσ2 + µ

τ2

nσ2 + 1

τ2

,1

nσ2 + 1

τ2

).

The Bayesian of θ is the expectation of θ under its posteriordistribution and is

nXnσ2 + µ

τ2

nσ2 + 1

τ2

.

Page 9: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Bayesian methods

However, a Bayesian would consider θ to be random (by seekingexperts’ opinion, etc.) and would put a distribution on Θ, say aN(µ, τ2) distribution and call it the prior distribution.

Then the posterior distribution of θ, the conditional distribution ofθ given the data X1, . . . ,Xn is

N(nXnσ2 + µ

τ2

nσ2 + 1

τ2

,1

nσ2 + 1

τ2

).

The Bayesian of θ is the expectation of θ under its posteriordistribution and is

nXnσ2 + µ

τ2

nσ2 + 1

τ2

.

Page 10: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Beginning nonparametrics

Suppose that X1, . . . ,Xn are i.i.d. and take on only a finite numberof values, say in X = {1, . . . , k}. The most general probabilitydistribution for X1 is p = (p1, . . . , pk) taking values in the unitsimplex in Rk , i.e. pj ≥ 0, i = 1, . . . , k and

∑pj = 1.

Let N = (N1, . . . ,Nk) where Nj = #{Xi = j , i = 1, . . . , n} is theobserved frequency of j .

The frequentist estimate of pj is Nj/n and the estimate of p is(N1, . . . ,Nk)/n.

Page 11: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Bayes analysis

The class of all priors for p is just the class of distributions of thek − 1 dimensional vector (p1, . . . , pk−1).

A special distribution is the finite dimensional Dirichlet distributionD(a) with parameter a = (a1, . . . , ak) with a density functionproportional to

k∏1

pai−11 , aj ≥ 0, j = 1, . . . , k ,A =

∑aj > 0.

Should really use ratios of independent Gammas to their sum totake care of cases when aj = 0.

Page 12: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Bayes analysis

The class of all priors for p is just the class of distributions of thek − 1 dimensional vector (p1, . . . , pk−1).

A special distribution is the finite dimensional Dirichlet distributionD(a) with parameter a = (a1, . . . , ak) with a density functionproportional to

k∏1

pai−11 , aj ≥ 0, j = 1, . . . , k ,A =

∑aj > 0.

Should really use ratios of independent Gammas to their sum totake care of cases when aj = 0.

Page 13: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Bayes analysis

The class of all priors for p is just the class of distributions of thek − 1 dimensional vector (p1, . . . , pk−1).

A special distribution is the finite dimensional Dirichlet distributionD(a) with parameter a = (a1, . . . , ak) with a density functionproportional to

k∏1

pai−11 , aj ≥ 0, j = 1, . . . , k ,A =

∑aj > 0.

Should really use ratios of independent Gammas to their sum totake care of cases when aj = 0.

Page 14: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Bayes analysis

Then the posterior distribution of p given the data X1, . . . ,Xn hasa density proportional to

k∏1

pai+Ni−11

which is the finite dimensional Dirichlet distribution withparameter a + N.

Thus the Bayes estimate of p is a+NA+n .

Page 15: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Bayes analysis

Then the posterior distribution of p given the data X1, . . . ,Xn hasa density proportional to

k∏1

pai+Ni−11

which is the finite dimensional Dirichlet distribution withparameter a + N.

Thus the Bayes estimate of p is a+NA+n .

Page 16: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

The standard nonparametric problem

Let X = (X1, . . . ,Xn) be i.i.d. random variables on R1 with acommon distribution function (df) F (or with a commonprobability measure (pm) P).

Let Fn(x) = (1/n)∑

I (Xi ≤ x), Pn(A) = (1/n)∑

I (Xi ∈ A)be the empirical distribution function (edf) (empirical probabilitymeasure (epm)) of X.

Then Fn (Pn) is the frequentist estimate of F (P).

In factFn(x)→ F (x),Pn(A)→ P(A)

for each x ∈ R1 and each A in the Borel sigma field B in R1.

Page 17: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

The standard nonparametric problem

Let X = (X1, . . . ,Xn) be i.i.d. random variables on R1 with acommon distribution function (df) F (or with a commonprobability measure (pm) P).

Let Fn(x) = (1/n)∑

I (Xi ≤ x), Pn(A) = (1/n)∑

I (Xi ∈ A)be the empirical distribution function (edf) (empirical probabilitymeasure (epm)) of X.

Then Fn (Pn) is the frequentist estimate of F (P).

In factFn(x)→ F (x),Pn(A)→ P(A)

for each x ∈ R1 and each A in the Borel sigma field B in R1.

Page 18: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

The standard nonparametric problem

Let X = (X1, . . . ,Xn) be i.i.d. random variables on R1 with acommon distribution function (df) F (or with a commonprobability measure (pm) P).

Let Fn(x) = (1/n)∑

I (Xi ≤ x), Pn(A) = (1/n)∑

I (Xi ∈ A)be the empirical distribution function (edf) (empirical probabilitymeasure (epm)) of X.

Then Fn (Pn) is the frequentist estimate of F (P).

In factFn(x)→ F (x),Pn(A)→ P(A)

for each x ∈ R1 and each A in the Borel sigma field B in R1.

Page 19: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Nonparametric priors in a natural way

The parameter F or P in this nonparametric problem is differentfrom the finite parameter case.

The first and most natural attempt to introduce a distribution forP was to mimic the case of random variables taking a finitenumber of values.

Consider a finite partition A = (A1, . . . ,Ak) of R1.Ferguson (1973) defined the distribution of (P(A1), . . . ,P(Ak)) tobe a finite dimensional Dirichlet distributionD(αβ(A1), . . . , αβ(Ak)) where α > 0 and β(·) is a pm on (R1,B).

Page 20: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Nonparametric priors in a natural way

The parameter F or P in this nonparametric problem is differentfrom the finite parameter case.

The first and most natural attempt to introduce a distribution forP was to mimic the case of random variables taking a finitenumber of values.

Consider a finite partition A = (A1, . . . ,Ak) of R1.Ferguson (1973) defined the distribution of (P(A1), . . . ,P(Ak)) tobe a finite dimensional Dirichlet distributionD(αβ(A1), . . . , αβ(Ak)) where α > 0 and β(·) is a pm on (R1,B).

Page 21: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Nonparametric priors in a natural way

The parameter F or P in this nonparametric problem is differentfrom the finite parameter case.

The first and most natural attempt to introduce a distribution forP was to mimic the case of random variables taking a finitenumber of values.

Consider a finite partition A = (A1, . . . ,Ak) of R1.Ferguson (1973) defined the distribution of (P(A1), . . . ,P(Ak)) tobe a finite dimensional Dirichlet distributionD(αβ(A1), . . . , αβ(Ak)) where α > 0 and β(·) is a pm on (R1,B).

Page 22: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Nonparametric priors in a natural way

By a distribution for P we mean a distribution on P, the space ofall probability measures on R1. More precisely, we will also have tospecify a σ-field S in P. We will take the smallest σ-fieldcontaining sets of the form {P : P(A) ≤ r} for all A ∈ B and0 ≤ r ≤ 1 as our S.

It can be shown that Ferguson choice is a probability distributionon (P,S) and is called the Dirichlet distribution with parameterαβ(·) and denoted by D(αβ(·)).

This is also called the Dirichlet process since it is the distributionof F (x), x ∈ R1.

Page 23: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Nonparametric priors in a natural way

By a distribution for P we mean a distribution on P, the space ofall probability measures on R1. More precisely, we will also have tospecify a σ-field S in P. We will take the smallest σ-fieldcontaining sets of the form {P : P(A) ≤ r} for all A ∈ B and0 ≤ r ≤ 1 as our S.

It can be shown that Ferguson choice is a probability distributionon (P,S) and is called the Dirichlet distribution with parameterαβ(·) and denoted by D(αβ(·)).

This is also called the Dirichlet process since it is the distributionof F (x), x ∈ R1.

Page 24: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Nonparametric priors in a natural way

Current usage calls α as the scale factor and β(·) as the basemeasure of the Dirichlet distribution D(αβ(·)).

Ferguson showed that the posterior distribution of P given thedata X is the Dirichlet distribution D(αβ(·) + nPn(·)).

and thus the Bayes estimate of P is αβ(·)+nPn(·)α+n .

This is analogous to the results in the finite sample space case.

Page 25: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Nonparametric priors in a natural way

Under D(αβ), (P(A1), . . . ,P(Ak)) has a finite dimensionalDirichlet distribution.

Two assertions made in the last slide.

• D(αβ) is a probability measure on (P,S).

• The posterior distribution given X is D(αβ + nPn).

Another property of the Dirichlet prior that was disconcerting atfirst is that

• D(αβ) gives probability 1 to the subset of of discreteprobability measures on (R1,B).

The proofs needed some effort and were sometimes mystifying.

Page 26: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Nonparametric priors in a natural way

Under D(αβ), (P(A1), . . . ,P(Ak)) has a finite dimensionalDirichlet distribution.Two assertions made in the last slide.

• D(αβ) is a probability measure on (P,S).

• The posterior distribution given X is D(αβ + nPn).

Another property of the Dirichlet prior that was disconcerting atfirst is that

• D(αβ) gives probability 1 to the subset of of discreteprobability measures on (R1,B).

The proofs needed some effort and were sometimes mystifying.

Page 27: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Nonparametric priors and exchangeable random variables

The class of all nonparametric priors are the same as the class ofall exchangeable sequences of random variables!

This followed from De Finetti’s theorem (1931). See also Hewittand Savage (1955), Kingman (1978).

Let X1,X2, . . . be an infinite exchangeable (def?) sequence ofrandom variables with a joint distribution Q. Then

1. The empirical distribution functions Fn(x)→ F (x) withprobability 1 for all x . In fact, supx |Fn(x)− F (x)| → 0 withprobability 1. (Note that F (x) is a random distributionfunction.)

Page 28: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Nonparametric priors and exchangeable random variables

The class of all nonparametric priors are the same as the class ofall exchangeable sequences of random variables!

This followed from De Finetti’s theorem (1931). See also Hewittand Savage (1955), Kingman (1978).

Let X1,X2, . . . be an infinite exchangeable (def?) sequence ofrandom variables with a joint distribution Q. Then

1. The empirical distribution functions Fn(x)→ F (x) withprobability 1 for all x . In fact, supx |Fn(x)− F (x)| → 0 withprobability 1. (Note that F (x) is a random distributionfunction.)

Page 29: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Nonparametric priors and exchangeable random variables

2. The empirical probability measures Pnw→ P. This P is a

random probability measure.

3. Given P, X1,X2, . . . are i.i.d. P.

4. Thus the distribution of P under Q denoted by νQ is anonparametric prior.

5. The class of all nonparametric priors arises in this fashion.

6. The distribution of X2,X3, . . . , given X1 is also exchangeable;will be denoted by QX1 .

7. The limit P of the empirical probability measures ofX1,X2, . . . is also the limit of the empirical probabilitymeasures of X2,X3, . . . . Thus the distribution of P given X1

(the posterior distribution) is the distribution of P under QX1

and, by mere notation, is νQX1 .

Page 30: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Dirichlet prior based on a Polya urn sequences

The Polya urn sequence is an example of an infinite exchangeablerandom variables.

Let β be a pm on R1 and let α > 0. Define the joint distributionPol(α, β) of X1,X2, . . . through

X1 ∼ β,Xn|(X1, . . . ,Xn−1) ∼αβ +

∑n−11 δXi

α + n − 1, n = 2, 3, . . .

This defines Pol(α, β) as an exchangeable probability measure.

Page 31: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Dirichlet prior based on a Polya urn sequences

The nonparametric prior νPol(α,β) is the same as the Dirichlet priorD(αβ)!

• That is, the distribution of (P(A1), . . . ,P(Ak)) for anypartition (A1, . . . ,Ak), under Pol(α, β), is the finitedimensional Dirichlet D(αβ(A1), . . . , αβ(Ak)).

P(A) ∼ B(αβ(A), αβ(Ac)).

• The conditional distribution of (X2.X3, . . . ) given X1 is

Pol(α + 1,αβ+δX1α+1 ). Thus posterior distribution given X1 is

D(αβ + δX1).

• Each Pn is a discrete rpm and the limit P is also a discreterpm. For this case of a Polya urn sequence, we can show thatP({X1, . . . ,Xn})→ 1 with probability 1 and thus P is adiscrete rpm.

Page 32: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Dirichlet prior based on a Polya urn sequences

The nonparametric prior νPol(α,β) is the same as the Dirichlet priorD(αβ)!

• That is, the distribution of (P(A1), . . . ,P(Ak)) for anypartition (A1, . . . ,Ak), under Pol(α, β), is the finitedimensional Dirichlet D(αβ(A1), . . . , αβ(Ak)).P(A) ∼ B(αβ(A), αβ(Ac)).

• The conditional distribution of (X2.X3, . . . ) given X1 is

Pol(α + 1,αβ+δX1α+1 ). Thus posterior distribution given X1 is

D(αβ + δX1).

• Each Pn is a discrete rpm and the limit P is also a discreterpm. For this case of a Polya urn sequence, we can show thatP({X1, . . . ,Xn})→ 1 with probability 1 and thus P is adiscrete rpm.

Page 33: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Dirichlet prior based on a Polya urn sequences

The nonparametric prior νPol(α,β) is the same as the Dirichlet priorD(αβ)!

• That is, the distribution of (P(A1), . . . ,P(Ak)) for anypartition (A1, . . . ,Ak), under Pol(α, β), is the finitedimensional Dirichlet D(αβ(A1), . . . , αβ(Ak)).P(A) ∼ B(αβ(A), αβ(Ac)).

• The conditional distribution of (X2.X3, . . . ) given X1 is

Pol(α + 1,αβ+δX1α+1 ). Thus posterior distribution given X1 is

D(αβ + δX1).

• Each Pn is a discrete rpm and the limit P is also a discreterpm.

For this case of a Polya urn sequence, we can show thatP({X1, . . . ,Xn})→ 1 with probability 1 and thus P is adiscrete rpm.

Page 34: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Dirichlet prior based on a Polya urn sequences

The nonparametric prior νPol(α,β) is the same as the Dirichlet priorD(αβ)!

• That is, the distribution of (P(A1), . . . ,P(Ak)) for anypartition (A1, . . . ,Ak), under Pol(α, β), is the finitedimensional Dirichlet D(αβ(A1), . . . , αβ(Ak)).P(A) ∼ B(αβ(A), αβ(Ac)).

• The conditional distribution of (X2.X3, . . . ) given X1 is

Pol(α + 1,αβ+δX1α+1 ). Thus posterior distribution given X1 is

D(αβ + δX1).

• Each Pn is a discrete rpm and the limit P is also a discreterpm. For this case of a Polya urn sequence, we can show thatP({X1, . . . ,Xn})→ 1 with probability 1 and thus P is adiscrete rpm.

Page 35: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Dirichlet prior based on a Polya urn sequences

The conditional distribution of P({X1}) given X1 is

B(1 + αβ({X1}), αβ(R1 \ {X1})).

This is tricky. Is P({X1}) measurable to begin with?

For the moment assume that β is nonatomic.

The above conditional distribution does not depend of X1 and thusX1 and P({X1}) are independent and

P({X1}) ∼ B(1, α).

Page 36: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Dirichlet prior based on a Polya urn sequences

The conditional distribution of P({X1}) given X1 is

B(1 + αβ({X1}), αβ(R1 \ {X1})).

This is tricky. Is P({X1}) measurable to begin with?

For the moment assume that β is nonatomic.

The above conditional distribution does not depend of X1 and thusX1 and P({X1}) are independent and

P({X1}) ∼ B(1, α).

Page 37: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Dirichlet prior based on a Polya urn sequences

The conditional distribution of P({X1}) given X1 is

B(1 + αβ({X1}), αβ(R1 \ {X1})).

This is tricky. Is P({X1}) measurable to begin with?

For the moment assume that β is nonatomic.

The above conditional distribution does not depend of X1 and thusX1 and P({X1}) are independent and

P({X1}) ∼ B(1, α).

Page 38: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Dirichlet prior based on a Polya urn sequences

Let Y1,Y2, . . . be the distinct values among X1,X2, . . . listed inthe order of their appearance.

Then Y1 = X1,

Y1,P({Y1}) are independent

and Y1 ∼ β,P({Y1}) ∼ B(1, α).

Page 39: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Dirichlet prior based on a Polya urn sequences

Let Y1,Y2, . . . be the distinct values among X1,X2, . . . listed inthe order of their appearance.

Then Y1 = X1,

Y1,P({Y1}) are independent and Y1 ∼ β,P({Y1}) ∼ B(1, α).

Page 40: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Dirichlet prior based on a Polya urn sequences

Consider the sequence X2,X3, . . . where all occurrences of X1 areremoved. This reduced sequence is the Polya urn sequencePol(α, β) and independent of Y1.

As before, Y2 and P({Y2})1−P({Y1}) are independent,

Y2 ∼ β, P({Y2})1−P({Y1}) ∼ B(1, α).

Thus P({Y1}), P({Y2})1−P({Y1}) ,

P({Y3})1−P({Y1})−P({Y2}) , . . . are i.i.d. B(1, α)

and all these are independent of Y1,Y2,Y3 . . . which are i.i.d. β.

Page 41: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Dirichlet prior based on a Polya urn sequences

Consider the sequence X2,X3, . . . where all occurrences of X1 areremoved. This reduced sequence is the Polya urn sequencePol(α, β) and independent of Y1.

As before, Y2 and P({Y2})1−P({Y1}) are independent,

Y2 ∼ β, P({Y2})1−P({Y1}) ∼ B(1, α).

Thus P({Y1}), P({Y2})1−P({Y1}) ,

P({Y3})1−P({Y1})−P({Y2}) , . . . are i.i.d. B(1, α)

and all these are independent of Y1,Y2,Y3 . . . which are i.i.d. β.

Page 42: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Dirichlet prior based on a Polya urn sequences

Consider the sequence X2,X3, . . . where all occurrences of X1 areremoved. This reduced sequence is the Polya urn sequencePol(α, β) and independent of Y1.

As before, Y2 and P({Y2})1−P({Y1}) are independent,

Y2 ∼ β, P({Y2})1−P({Y1}) ∼ B(1, α).

Thus P({Y1}), P({Y2})1−P({Y1}) ,

P({Y3})1−P({Y1})−P({Y2}) , . . . are i.i.d. B(1, α)

and all these are independent of Y1,Y2,Y3 . . . which are i.i.d. β.

Page 43: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Dirichlet prior based on a Polya urn sequences

Consider the sequence X2,X3, . . . where all occurrences of X1 areremoved. This reduced sequence is the Polya urn sequencePol(α, β) and independent of Y1.

As before, Y2 and P({Y2})1−P({Y1}) are independent,

Y2 ∼ β, P({Y2})1−P({Y1}) ∼ B(1, α).

Thus P({Y1}), P({Y2})1−P({Y1}) ,

P({Y3})1−P({Y1})−P({Y2}) , . . . are i.i.d. B(1, α)

and all these are independent of Y1,Y2,Y3 . . . which are i.i.d. β.

Page 44: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Dirichlet prior based on a Polya urn sequences

Since P is discrete and just sits on the set {X1,X2, . . . } which is{Y1,Y2, . . . },

it is also equal to∑∞

1 P({Yi})δY1 , in other words, we have theSethuraman construction of the Dirichlet prior (if β is nonatomic).

Blackwell and MacQueen (1973) do not obtain this result, butshow just that, the rpm P is discrete for all Polya urn sequences.

Page 45: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Dirichlet prior based on a Polya urn sequences

Since P is discrete and just sits on the set {X1,X2, . . . } which is{Y1,Y2, . . . },

it is also equal to∑∞

1 P({Yi})δY1 , in other words,

we have theSethuraman construction of the Dirichlet prior (if β is nonatomic).

Blackwell and MacQueen (1973) do not obtain this result, butshow just that, the rpm P is discrete for all Polya urn sequences.

Page 46: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Dirichlet prior based on a Polya urn sequences

Since P is discrete and just sits on the set {X1,X2, . . . } which is{Y1,Y2, . . . },

it is also equal to∑∞

1 P({Yi})δY1 , in other words, we have theSethuraman construction of the Dirichlet prior (if β is nonatomic).

Blackwell and MacQueen (1973) do not obtain this result, butshow just that, the rpm P is discrete for all Polya urn sequences.

Page 47: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Dirichlet prior based on a Polya urn sequences

The Polya urn sequence also gives the predictive distribution undera Dirichlet prior by its very definition.

Let the distribution of X1,X2, . . . given P be i.i.d. P where P hasthe Dirichlet prior D(αβ). Then X1,X2, . . . is the Polya urnsequence Pol(α, β).

Hence, the distribution of Xn given X1, . . . ,Xn−1 isαβ+

∑n−11 δXi

α+n−1 .

Page 48: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Dirichlet prior based on a Polya urn sequences

The awkwardness of the rpm P being discrete has turned into agold mine for finding clusters in data.

Let Y1, . . . ,YMn be the distinct values among X1, . . . ,Xn and lettheir multiplicities be k1, . . . , kMn . From the predictive distributiongiven above we can show that the probability of Mn = m and(k1, . . . , km) is

αmΓ(α)∏m

1 Γ(kj)

Γ(α + n).

From this we can obtain the marginal distribution of M thenumber of distinct values (or clusters) and also the conditionaldistributions of the multiplicities.

Page 49: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Dirichlet prior based on a Polya urn sequences

A curious property:

The number of distinct values (clusters) Mn goes to ∞ slower thann and in fact, Mn

log(n) → α.

Also, from the LLN,∑Mn

1 I (Yj≤x)log(n) → β(x) for all x with probability

1.

Note that∑n

1 I (Xj≤x)n → F (x) with probability 1 where F is a rdf

and has distribution D(αβ).Note all all this required that β be nonatomic.

Page 50: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Dirichlet prior based on a Polya urn sequences

A curious property:

The number of distinct values (clusters) Mn goes to ∞ slower thann and in fact, Mn

log(n) → α.

Also, from the LLN,∑Mn

1 I (Yj≤x)log(n) → β(x) for all x with probability

1.

Note that∑n

1 I (Xj≤x)n → F (x) with probability 1 where F is a rdf

and has distribution D(αβ).Note all all this required that β be nonatomic.

Page 51: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Nonparametric priors through completely random measures

Kingman (1967) introduced the concept of completely randommeasures.

X (A),A ∈ B is a completely random measure, if X (·) is a randommeasure and if X (A1), . . . ,X (Ak) are independent wheneverA1, . . . ,Ak are disjoint.

If X (R1) <∞ with probability 1, then P(A) = X (A)X (R1) will be a

random probability measure.

Kingman also characterized the class of all completely randommeasures (subject to a σ-finite condition) and also showed how togenerate them from Poisson processes and transition functions.

The Dirichlet prior is a special case of this.

Page 52: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Nonparametric priors and ind inc processes

A df is just an increasing function on the real line. Consider [0, 1]instead.

The class of processes X (t), t ∈ [0, 1] with nonnegativeindependent increments is well known from the theory of infinitelydivisible laws.

When some simple cases are excluded, this process has only acountable number of jumps which are independent of theirlocations.

Page 53: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Nonparametric priors and ind inc processes

A special case when X (t) ∼ Gamma(αβ(t)) for t ∈ [0, 1].

F (t) = X (t)X (1) is a random distribution function (since X (1) <∞

with probability 1). Ferguson (second definition) shows that itsdistribution is the nonparametric prior D(αβ(·)).

The properties of the Gamma process show that the rpm isdiscrete and also has the representation

P([0, t]) = F (t) =∞∑1

p∗i δYi([0, t])

where p∗1 > p∗2 > · · · are the jumps of the Gamma process indecreasing order. The jumps are independent of the locations andthus (p∗1 , p

∗2 , . . . ) which are independent of Y1,Y2, . . . which are

i.i.d. β.

Page 54: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Sethuraman construction of Dirichlet priorsSethuraman (1994)

Let α > 0 and let β(·) be a pm. We do not assume that β isnonatomic. Let V1,V2, . . . , be i.i.d. B(1, α) and let Y1,Y2, . . . beindependent of V1,V2, . . . and be i.i.d. β(·).

Let p1 = V1, p2 = (1− V1)V2, p3 = V3(1− V1)(1− V2), . . . . Notethat (p1, p2, . . . ) is a random discrete distribution with i.i.d.discrete failure rates V1,V2, . . . . “Stick breaking.” In othercontexts, like species sampling, this is called the GEM(α)distribution.

Then the convex mixture

P(·) =∞∑1

piδYi(·)[

= p1δY1(·) + (1− p1)∞∑2

pi1− p1

δYi(·)]

is a random discrete probability measure and its distribution is theDirichlet prior D(αβ).

Page 55: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Sethuraman construction of Dirichlet priorsSethuraman (1994)

Let α > 0 and let β(·) be a pm. We do not assume that β isnonatomic. Let V1,V2, . . . , be i.i.d. B(1, α) and let Y1,Y2, . . . beindependent of V1,V2, . . . and be i.i.d. β(·).

Let p1 = V1, p2 = (1− V1)V2, p3 = V3(1− V1)(1− V2), . . . . Notethat (p1, p2, . . . ) is a random discrete distribution with i.i.d.discrete failure rates V1,V2, . . . . “Stick breaking.” In othercontexts, like species sampling, this is called the GEM(α)distribution.

Then the convex mixture

P(·) =∞∑1

piδYi(·)[

= p1δY1(·) + (1− p1)∞∑2

pi1− p1

δYi(·)]

is a random discrete probability measure and its distribution is theDirichlet prior D(αβ).

Page 56: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Sethuraman construction of Dirichlet priorsSethuraman (1994)

Let α > 0 and let β(·) be a pm. We do not assume that β isnonatomic. Let V1,V2, . . . , be i.i.d. B(1, α) and let Y1,Y2, . . . beindependent of V1,V2, . . . and be i.i.d. β(·).

Let p1 = V1, p2 = (1− V1)V2, p3 = V3(1− V1)(1− V2), . . . . Notethat (p1, p2, . . . ) is a random discrete distribution with i.i.d.discrete failure rates V1,V2, . . . . “Stick breaking.” In othercontexts, like species sampling, this is called the GEM(α)distribution.

Then the convex mixture

P(·) =∞∑1

piδYi(·)

[= p1δY1(·) + (1− p1)

∞∑2

pi1− p1

δYi(·)]

is a random discrete probability measure and its distribution is theDirichlet prior D(αβ).

Page 57: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Sethuraman construction of Dirichlet priorsSethuraman (1994)

Let α > 0 and let β(·) be a pm. We do not assume that β isnonatomic. Let V1,V2, . . . , be i.i.d. B(1, α) and let Y1,Y2, . . . beindependent of V1,V2, . . . and be i.i.d. β(·).

Let p1 = V1, p2 = (1− V1)V2, p3 = V3(1− V1)(1− V2), . . . . Notethat (p1, p2, . . . ) is a random discrete distribution with i.i.d.discrete failure rates V1,V2, . . . . “Stick breaking.” In othercontexts, like species sampling, this is called the GEM(α)distribution.

Then the convex mixture

P(·) =∞∑1

piδYi(·)[

= p1δY1(·) + (1− p1)∞∑2

pi1− p1

δYi(·)]

is a random discrete probability measure and its distribution is theDirichlet prior D(αβ).

Page 58: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Sethuraman construction of Dirichlet priorsSethuraman (1994)

Let α > 0 and let β(·) be a pm. We do not assume that β isnonatomic. Let V1,V2, . . . , be i.i.d. B(1, α) and let Y1,Y2, . . . beindependent of V1,V2, . . . and be i.i.d. β(·).

Let p1 = V1, p2 = (1− V1)V2, p3 = V3(1− V1)(1− V2), . . . . Notethat (p1, p2, . . . ) is a random discrete distribution with i.i.d.discrete failure rates V1,V2, . . . . “Stick breaking.” In othercontexts, like species sampling, this is called the GEM(α)distribution.

Then the convex mixture

P(·) =∞∑1

piδYi(·)[

= p1δY1(·) + (1− p1)∞∑2

pi1− p1

δYi(·)]

is a random discrete probability measure and its distribution is theDirichlet prior D(αβ).

Page 59: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Sethuraman construction of Dirichlet priors

The random variables Y1,Y2 . . . can take values any measurespace.The proof is demonstrated by rewriting the constructive definitionas

P = p1δY1 + (1− p1)P∗

where all the random variables are independent,p1 ∼ B(1, α),Y1 ∼ β and the two rpm’s P,P∗ have the samedistribution.

For the distribution of P we have the distributional identity

Pd= p1δY1 + (1− p1)P.

We first show that D(αβ) is a solution to this distributionalequation and the solution is unique.

Page 60: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Sethuraman construction of Dirichlet priors

To summarize,

P(·) =∞∑1

piδYi(·)

is a concrete representation of a random probability measure andits distribution is D(αβ(·)).

It was not assumed that β is a nonatomic probability measure andthe Yi ’s can be general random variables.

Page 61: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Sethuraman construction of Dirichlet priors

To summarize,

P(·) =∞∑1

piδYi(·)

is a concrete representation of a random probability measure andits distribution is D(αβ(·)).

It was not assumed that β is a nonatomic probability measure andthe Yi ’s can be general random variables.

Page 62: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Sethuraman construction of Dirichlet priors

The Sethuraman construction is similar to

P(·) =∞∑1

p∗i δYi(·)

obtained when using a Gamma process to define the Dirichlet prior.

It can be shown that these (p∗1 , p∗2 , . . . ) are the same, in

distribution, as (p1, p2, . . . ) arranged in increasing order.

Definition: A one-time size biased sampling converts (p∗1 , p∗2 , . . . )

to (p∗∗1 , p∗∗2 , . . . ) as follows.

Let J be an integer valued randomvariable with Prob(J = n|p∗1 , p∗2 , . . . ) = p∗n.If J = 1 define (p∗∗1 , p∗∗2 , . . . ) to be the same as (p∗1 , p

∗2 , . . . ).

If J > 1 put p∗∗1 = p∗J and let p∗∗2 , p∗3∗, . . . be p∗1 , p∗2 . . . with p∗J

removed.

Page 63: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Sethuraman construction of Dirichlet priors

The Sethuraman construction is similar to

P(·) =∞∑1

p∗i δYi(·)

obtained when using a Gamma process to define the Dirichlet prior.

It can be shown that these (p∗1 , p∗2 , . . . ) are the same, in

distribution, as (p1, p2, . . . ) arranged in increasing order.

Definition: A one-time size biased sampling converts (p∗1 , p∗2 , . . . )

to (p∗∗1 , p∗∗2 , . . . ) as follows. Let J be an integer valued randomvariable with Prob(J = n|p∗1 , p∗2 , . . . ) = p∗n.If J = 1 define (p∗∗1 , p∗∗2 , . . . ) to be the same as (p∗1 , p

∗2 , . . . ).

If J > 1 put p∗∗1 = p∗J and let p∗∗2 , p∗3∗, . . . be p∗1 , p∗2 . . . with p∗J

removed.

Page 64: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Sethuraman construction of Dirichlet priors

As a converse result, the distribution of (p1, p2, . . . ) is the limitingdistribution after repeated size biased sampling of (p∗1 , p

∗2 , . . . ).

McCloskey (1965), Pitman (1996), etc.

The distribution of p1, p2, . . . does not change after a single sizebiased sampling.

This property can be used to establish that in the nonparametricproblem with one observation X1, the posterior distribution of Pgiven X1 is D(αβ + δX1). Simpler than the one given inSethuraman (1994).

Page 65: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Sethuraman construction of Dirichlet priors

Ferguson showed that the support of the D(αβ) is the collection ofprobability measures in P whose support is contained in thesupport of β.

If the support of β is R1 then the support of Dαβ is P.

This assures us that even though D(αβ) gives probability 1 to theclass of discrete pm’s, it gives positive probability to everyneighborhood of every pm.

Page 66: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Dirichlet priors are not discrete

The Dirichlet prior D(αβ) is not a discrete pm; it just sits on theset all discrete pm’s.

The rpm P with distribution D(αβ) is a random discreteprobability measure.

Page 67: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Dirichlet priors are not discrete

The Dirichlet prior D(αβ) is not a discrete pm; it just sits on theset all discrete pm’s.

The rpm P with distribution D(αβ) is a random discreteprobability measure.

Page 68: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Absolute continuity

Consider D(αiβi ), i = 1, 2. These two measures are absolutelycontinuous with respect to each other or they are orthogonal.

Let βi1, βi ,2 be the atomic and discrete part of βi , i = 1, 2.

D(αiβi ), i = 1, 2 are orthogonal to each other if α1 6= α2 or ifβ11 6= β21 or if support of β12 6= support of β22. There is anecessary and sufficient condition for orthogonality when all ofthese are equal.

Page 69: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Absolute continuity

Consider D(αiβi ), i = 1, 2. These two measures are absolutelycontinuous with respect to each other or they are orthogonal.

Let βi1, βi ,2 be the atomic and discrete part of βi , i = 1, 2.

D(αiβi ), i = 1, 2 are orthogonal to each other if α1 6= α2 or ifβ11 6= β21 or if support of β12 6= support of β22. There is anecessary and sufficient condition for orthogonality when all ofthese are equal.

Page 70: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Absolute continuity

The curious result that we can consistently estimate theparameters of the Dirichlet prior from the sample happens becauseof the orthogonality of Dirichlet distributions when theirparameters are different.

Another curious result: When β is nonatomic, the prior D(αβ), theposterior given X1, the posterior given X1,X2, the posterior givenX1,X2,X3 etc. are all orthogonal to one another!

Page 71: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Absolute continuity

The curious result that we can consistently estimate theparameters of the Dirichlet prior from the sample happens becauseof the orthogonality of Dirichlet distributions when theirparameters are different.

Another curious result: When β is nonatomic, the prior D(αβ), theposterior given X1, the posterior given X1,X2, the posterior givenX1,X2,X3 etc. are all orthogonal to one another!

Page 72: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Some properties of Dirichlet priors

A simple problem is the estimation of the “true” mean, i.e.∫xdP(x) from data X1,X2, . . . ,Xn which are i.i.d. P.

In the Bayesian nonparametric problem, P has a prior distributionD(αβ) and given P, the data X1, . . . ,Xn are i.i.d. P.

The Bayesian estimate of∫xdP(x) is its mean under the posterior

distribution.

However, before estimating∫xdP(x), one should first check that

our prior and posterior give probability 1 to the set{P :

∫|x |dP(x) <∞}.

Page 73: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Some properties of Dirichlet priors

A simple problem is the estimation of the “true” mean, i.e.∫xdP(x) from data X1,X2, . . . ,Xn which are i.i.d. P.

In the Bayesian nonparametric problem, P has a prior distributionD(αβ) and given P, the data X1, . . . ,Xn are i.i.d. P.

The Bayesian estimate of∫xdP(x) is its mean under the posterior

distribution.

However, before estimating∫xdP(x),

one should first check thatour prior and posterior give probability 1 to the set{P :

∫|x |dP(x) <∞}.

Page 74: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Some properties of Dirichlet priors

A simple problem is the estimation of the “true” mean, i.e.∫xdP(x) from data X1,X2, . . . ,Xn which are i.i.d. P.

In the Bayesian nonparametric problem, P has a prior distributionD(αβ) and given P, the data X1, . . . ,Xn are i.i.d. P.

The Bayesian estimate of∫xdP(x) is its mean under the posterior

distribution.

However, before estimating∫xdP(x), one should first check that

our prior and posterior give probability 1 to the set{P :

∫|x |dP(x) <∞}.

Page 75: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Some properties of Dirichlet priors

Feigin and Tweedie (1989), and others later, gave necessary andsufficient conditions for this, namely

∫log(max(1, |x |))dβ(x) <∞.

From our constructive definition,∫|x |dP(x) =

∞∑1

p1|Yi |.

The Kolomogorov three series theorem gives a simple direct proof.Sethuraman (2010).

Page 76: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Some properties of Dirichlet priors

The Bayes estimate of the mean is the mean under the posteriordistribution D(αβ + nPn) and it is

α∫xdβ(x) + nXn

α + n.

One should check first that∫

log(max(1, |x |))dβ(x) <∞.

Notethat the Bayesian can estimate

∫xdP(x) in this case even though

the base measure β may not have a mean. (What does frequentistestimate Xn estimate in this case?).

Page 77: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Some properties of Dirichlet priors

The Bayes estimate of the mean is the mean under the posteriordistribution D(αβ + nPn) and it is

α∫xdβ(x) + nXn

α + n.

One should check first that∫

log(max(1, |x |))dβ(x) <∞. Notethat the Bayesian can estimate

∫xdP(x) in this case even though

the base measure β may not have a mean. (What does frequentistestimate Xn estimate in this case?).

Page 78: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Some properties of Dirichlet priors

The actual distribution of∫xdP(x) under D(αβ) is a vexing

problem. Regazzini, Lijoi and Prunster (2003), Lijoi and Prunster(2009) have the best results.

When β is the Cauchy distribution, it is easy that∫xdP(x) =

∞∑1

piYi

where Y1,Y2, . . . are i.i.d. Cauchy, and hence∫xPd(x) is Cauchy.

One does not need the GEM property of (p1, p2, . . . ) for this; it isenough for it to be independent of (Y1,Y2, . . . ). Yamato (1984)was the first to prove this.

Page 79: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Some properties of Dirichlet priors

Convergence of Dirichlet priors when their parameters converge iseasy to establish using the constructive definition.

When α→∞ and β is fixed, D(αβ)w→ δβ.

When α→ 0 and β is fixed, D(αβ)w→ δY where Y ∼ β.

Sethuraman and Tiwari (1982).

Page 80: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Some properties of Dirichlet priors

Convergence of Dirichlet priors when their parameters converge iseasy to establish using the constructive definition.

When α→∞ and β is fixed, D(αβ)w→ δβ.

When α→ 0 and β is fixed, D(αβ)w→ δY where Y ∼ β.

Sethuraman and Tiwari (1982).

Page 81: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Some properties of Dirichlet priors

The constructive definition

P(·) =∞∑1

piδYi(·)

leads to the inequality

||P −M∑1

piδYi|| ≤

M∏1

(1− pi ).

So one can allow for several kinds of random stopping to staywithin chosen errors. One can also stop at nonrandom times andhave probability bounds for errors. Muliere and Tardella (1998) hasseveral results of this type.

Page 82: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Some properties of Dirichlet priors

Let P ∼ D(αβ) and let f be a function of X . The Pf −1 thedistribution of f under P has the D(αβf −1) where βf −1 is thedistribution of f under β

Page 83: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Some properties of Dirichlet priors

Regression problems are studied when the data is of the form(Y1,X1), (Y2,X2), . . . and the model is that Yi = f (Xi , εi ). If thedistribution of ε is P, then the conditional distribution of Y givenX = x is the distribution of f (x , ·) under P and is denoted by Mx .

Let P ∼ D(α, β) and given P let ε1, ε2, . . . be i.i.d. P. Theconstructive definition gives

P =∞∑1

piδZi

where Z1,Z2, . . . are i.i.d. β.

Page 84: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Some properties of Dirichlet priors

The df Mx is the distribution of f (x , ·) under P and is thus has aDirichlet distribution and can be represented by

Px =∞∑1

piδZi,x

where Z1,x ,Z2,x , . . . are i.i.d from the distribution of f (x , ) underβ. MacEachern (1999) calls this the dependent Dirichlet prior(DDP).

Page 85: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Some properties of Dirichlet priors

Ishwaran and James (2011) allow the Vi ’s appearing in thedefinition of the pi ’s to independent B(ai , bi ) random variables.Rodriguez (2011) allows Vi = Φ(Ni ) where N1,N2, . . . are i.i.d.N(µ, σ2) and calls it “probit stick breaking”.Bayes analysis is with such priors can be handled only bycomputational methods.

Page 86: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Bayes hierarchical models

A popular Bayes hierarchical model is usually stated as follows.

The data X1, . . . ,Xn are independent with distributionsK (·, θ1), . . . ,K (·, θn), where K (·, θ) is a nice continuous df foreach θ.

Next it is assumed that there is rpm P, and given P, θ1, . . . , θnare i.i.d. P.

Finally, it is assumed that P has the D(αβ) distribution.

This is also called the DP mixture model and it has lots ofapplications.

One wants the posterior distribution of P given the dataX1, . . . ,Xn.

Page 87: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Bayes hierarchical models

A popular Bayes hierarchical model is usually stated as follows.

The data X1, . . . ,Xn are independent with distributionsK (·, θ1), . . . ,K (·, θn), where K (·, θ) is a nice continuous df foreach θ.

Next it is assumed that there is rpm P, and given P, θ1, . . . , θnare i.i.d. P.

Finally, it is assumed that P has the D(αβ) distribution.

This is also called the DP mixture model and it has lots ofapplications.

One wants the posterior distribution of P given the dataX1, . . . ,Xn.

Page 88: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Bayes hierarchical models

One should state the Bayes hierarchical model more carefully.

Let the rpm P have the D(αβ) distribution.

Conditional on P, let θ1, . . . , θn be i.i.d. P.

Conditional of (P, θ1, . . . , θn), let X1, . . . ,Xn be independent withdistributions K (·, θ1), . . . ,K (·, θn), where K (·, θ) is a df for each θ.

Then (X1, . . . ,Xn, θ1, . . . , θn,P) will have a well defined jointdistribution and all the necessary conditional distributions are alsowell defined.Ghosh and Ramamoorthy (2003) are careful to state this waythroughout their book.

Page 89: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Bayes hierarchical models

Several computational methods have been proposed in theliterature.West, Muller, and Escobar (1994), Escobar and West (1998), andMacEachern (1998) have studied computational methods. Theyintegrate out the rpm P and deal with the joint distribution of(X1, . . . ,Xn, θ1, . . . , θn).The hierarchical model can also be expressed as follows,suppressing the latent variables θ1, . . . , θn: For any pm P, letK (·,P) =

∫K (·, θ)dP(θ).

P ∼ D(αβ) and given P, let X1, . . . ,Xn be i.i.d. P.

Page 90: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

Bayes hierarchical models

To perform an MCMC one can use the constructive definition andput

K (·,P) =∞∑1

piK (·,Yi )

where (p1, p2, . . . ) is the usual GEM(α) and Y1,Y2, . . . are i.i.d. .βTo do the MCMC here one has to truncate this infinite seriesappropriately.Doss (1994), Ishwaran and James (2011) and others.

Page 91: A survey of the Dirichlet process and its role in ...Dirichlet distribution. Two assertions made in the last slide. D( ) is a probability measure on (P;S). The posterior distribution

More

More Bayes hierarchical models: Neal (2000), Rodriguez, Dunsonand Gelfand (2008)Clustering applications: Blei and Jordan (2006), Wang, Shan andBanerjee (2009)Two parameter Dirichlet process: Perman, Pitman and Yor (1992),Pitman and Yor (1997).Partition based priors - useful in reliability and repair models:Hollander and Sethuraman (2009)