EM algorithm and its application in probabilistic latent semantic analysis

EM algorithm and its application in Probabilistic LatentSemantic Analysis (pLSA)

Duc-Hieu Trantdh.net [at] gmail.com

Nanyang Technological University

July 27, 2010

Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 1 / 27

The parameter estimation problem

Outline


EM algorithm

Probabilistic Latent Sematic Analysis

Reference



Frequentist vs. Bayesian schools

Frequentist

I parameters – quantities whose values are fixed but unknown.

I the best estimate of their values – the one that maximizes theprobability of obtaining the observed samples.

Bayesian

I paramters – random variables having some known prior distribution.

I observation of the samples converts this to a posterior density;revising our opinion about the true values of the parameters.



Examples

I training samples: S = {(x (1), y (1)), . . . (x (m), y (m))}I frequentist: maximum likelihood

maxθ

∏i

p(y (i)|x (i); θ)

I bayesian: P(θ) – prior, e.g., P(θ) ∼ N (0, I)

P(θ|S) ∝

(m∏i=1

P(y (i)|x (i), θ)

).P(θ)

θMAP = arg maxθ

P(θ|S)


EM algorithm

Outline


EM algorithm


Reference


EM algorithm

An estimation problem

I training set of m independent samples: {x (1), x (2), . . . , x (m)}I goal: fit the paramters of a model p(x , z) to the data

I the likelihood:

`(θ) =m∑i=1

log p(x (i); θ) =m∑i=1

log∑z

p(x (i), z ; θ)

I explicitly maximize `(θ) might be difficult.

I z - laten random variableif z(i) were observed, then maximum likelihood estimation would beeasy.

I strategy: repeatedly construct a lower-bound on ` (E-step) andoptimize that lower-bound (M-step).


EM algorithm

EM algorithm (1)

I digression: Jensen’s inequality.f – convex function; E [f (X )] ≥ f (E [X ])

I for each i , Qi – distribution of z :∑

z Qi (z) = 1,Qi (z) ≥ 0

`(θ) =∑i

log p(x (i); θ)

=∑i

log∑z(i)

p(x (i), z(i); θ)

=∑i

log∑z(i)

Qi (z(i))

p(x (i), z(i); θ)

Qi (z(i))(1)

applying Jensen’s inequality, concave function log

≥∑i

∑z(i)

Qi (z(i))log

p(x (i), z(i); θ)

Qi (z(i))(2)

More detail . . .


EM algorithm

EM algorithm (2)

I for any set of distribution Qi , formula (2) gives a lower-bound on `(θ)

I how to choose Qi?

I strategy: make the inequality hold with equality at our particularvalue of θ.

I require:p(x (i), z(i); θ)

Qi (z(i))= c

c – constant not depend on z(i)

I choose: Qi (z(i)) ∝ p(x (i), z(i); θ)

I we know∑

z Qi (z(i)) = 1, so

Qi (z(i)) =

p(x (i), z(i); θ)∑z p(x (i), z ; θ)

=p(x (i), z(i); θ)

p(x (i); θ)= p(z(i)|x (i); θ)


EM algorithm

EM algorithm (3)

I Qi – posterior distribution of z(i) given x (i) and the parameter θ

EM algorithm: repeat until convergence

I E-step: for each iQi (z

(i)) := p(z(i)|x (i); θ)

I M-step:

θ := arg maxθ

∑i

∑z(i)

Qi (z(i)) log

p(x (i), z(i); θ)

Qi (z(i))

The algorithm will converge, since `(θ(t)) ≤ `(θ(t+1))


EM algorithm

EM algorithm (4)Digression: coordinate ascent algorithm.

I maxα

W (α1, . . . αm)

I loop until converge:for i ∈ 1, . . . ,m:

αi = arg maxα̂i

W (α1, . . . , α̂i , . . . , αm)

EM-algorithm as coordinate ascent algorithm

I

J(Q, θ) =∑i

∑z(i)

Qi (z(i)) log

p(x (i), z(i); θ)

Qi (z(i))

I `(θ) ≥ J(Q, θ)

I EM algorithm can be viewed as coordinate ascent on JE-step: maximize w.r.t QM-step: maximize w.r.t θ



Outline


EM algorithm


Reference



Probabilistic Latent Semantic Analysis (1)

I set of documents D = {d1, . . . , dN}set of words W = {w1, . . . ,wM}set of unobserved classes Z = {z1, . . . , zK}

I conditional independence assumption:

P(di ,wj |zk) = P(di |zk)P(wj |zk) (3)

I so,

P(wj |di ) =K∑

k=1

P(zk |di )P(wj |zk) (4)

P(di ,wj) = P(di )K∑

k=1

P(wj |zk)P(zk |di )

More detail . . .




I n(di ,wj) – # word wj in doc. diI Likelihood

L =N∏i=1

P(di ) =N∏i=1

M∏j=1

[P(di ,wj)]n(di ,wj )

=N∏i=1

M∏j=1

[P(di )

K∑k=1

P(wj |zk)P(zk |di )

]n(di ,wj )

I log-likelihood ` = log(L)

=N∑i=1

M∑j=1

[n(di ,wj) logP(di ) + n(di ,wj) log

K∑k=1

P(wj |zk)P(zk |di )

]



Probabilistic Latent Semantic Analysis (5)EM-algorithm

I E-step: update

P(zk |di ,wj) =P(wj |zk)P(zk |di )∑Kl=1 P(wj |zl)P(zl |di )

I M-step: maximize w.r.t P(wj |zk), P(zk |di )N∑i=1

M∑j=1

n(di ,wj)K∑

k=1


subject toM∑j=1

P(wj |zk) = 1, k ∈ {1 . . .K}

K∑k=1

P(zk |di ) = 1, i ∈ {1 . . .N}


Reference

Outline


EM algorithm


Reference


Reference

I R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification,Wiley-Interscience, 2001.

I T. Hofmann, ”Unsupervised learning by probabilistic latent semanticanalysis,” Machine Learning, vol. 42, 2001, p. 177–196.

I Course: ”Machine Learning CS229”, Andrew Ng, Stanford University


Appendix

Applying the Jensen’s inequality

I f (x) = log(x), concave function

f

(Ez(i)∼Qi

[p(x (i), z(i); θ)

Qi (z(i))

])≥ Ez(i)∼Qi

[f

(p(x (i), z(i); θ)

Qi (z(i))

)]Return


EM algorithm and its application in probabilistic latent semantic analysis

Technology

Transcript of EM algorithm and its application in probabilistic latent semantic analysis