Latent Semantic Analysis Probabilistic Topic Models & Associative Memory.
EM algorithm and its application in probabilistic latent semantic analysis
-
Upload
zukun -
Category
Technology
-
view
108 -
download
0
description
Transcript of EM algorithm and its application in probabilistic latent semantic analysis
EM algorithm and its application in Probabilistic LatentSemantic Analysis (pLSA)
Duc-Hieu Trantdh.net [at] gmail.com
Nanyang Technological University
July 27, 2010
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 1 / 27
The parameter estimation problem
Outline
The parameter estimation problem
EM algorithm
Probabilistic Latent Sematic Analysis
Reference
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 2 / 27
The parameter estimation problem
Introduction
Known the prior probabilities P(ωi ), class-conditional densities p(x |ωi )=⇒ optimal classifier
I P(ωj |x) ∝ p(x |ωj)p(ωj)
I decide ωi if p(ωi |x) > P(ωj |x), ∀j 6= i
In practice, p(x |ωi ) is unknown – just estimated from training samples(e.g., assume p(x |ωi ) ∼ N (µi ,Σi )).
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 3 / 27
The parameter estimation problem
Frequentist vs. Bayesian schools
Frequentist
I parameters – quantities whose values are fixed but unknown.
I the best estimate of their values – the one that maximizes theprobability of obtaining the observed samples.
Bayesian
I paramters – random variables having some known prior distribution.
I observation of the samples converts this to a posterior density;revising our opinion about the true values of the parameters.
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 4 / 27
The parameter estimation problem
Examples
I training samples: S = {(x (1), y (1)), . . . (x (m), y (m))}I frequentist: maximum likelihood
maxθ
∏i
p(y (i)|x (i); θ)
I bayesian: P(θ) – prior, e.g., P(θ) ∼ N (0, I)
P(θ|S) ∝
(m∏i=1
P(y (i)|x (i), θ)
).P(θ)
θMAP = arg maxθ
P(θ|S)
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 5 / 27
EM algorithm
Outline
The parameter estimation problem
EM algorithm
Probabilistic Latent Sematic Analysis
Reference
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 6 / 27
EM algorithm
An estimation problem
I training set of m independent samples: {x (1), x (2), . . . , x (m)}I goal: fit the paramters of a model p(x , z) to the data
I the likelihood:
`(θ) =m∑i=1
log p(x (i); θ) =m∑i=1
log∑z
p(x (i), z ; θ)
I explicitly maximize `(θ) might be difficult.
I z - laten random variableif z(i) were observed, then maximum likelihood estimation would beeasy.
I strategy: repeatedly construct a lower-bound on ` (E-step) andoptimize that lower-bound (M-step).
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 7 / 27
EM algorithm
EM algorithm (1)
I digression: Jensen’s inequality.f – convex function; E [f (X )] ≥ f (E [X ])
I for each i , Qi – distribution of z :∑
z Qi (z) = 1,Qi (z) ≥ 0
`(θ) =∑i
log p(x (i); θ)
=∑i
log∑z(i)
p(x (i), z(i); θ)
=∑i
log∑z(i)
Qi (z(i))
p(x (i), z(i); θ)
Qi (z(i))(1)
applying Jensen’s inequality, concave function log
≥∑i
∑z(i)
Qi (z(i))log
p(x (i), z(i); θ)
Qi (z(i))(2)
More detail . . .
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 8 / 27
EM algorithm
EM algorithm (2)
I for any set of distribution Qi , formula (2) gives a lower-bound on `(θ)
I how to choose Qi?
I strategy: make the inequality hold with equality at our particularvalue of θ.
I require:p(x (i), z(i); θ)
Qi (z(i))= c
c – constant not depend on z(i)
I choose: Qi (z(i)) ∝ p(x (i), z(i); θ)
I we know∑
z Qi (z(i)) = 1, so
Qi (z(i)) =
p(x (i), z(i); θ)∑z p(x (i), z ; θ)
=p(x (i), z(i); θ)
p(x (i); θ)= p(z(i)|x (i); θ)
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 9 / 27
EM algorithm
EM algorithm (3)
I Qi – posterior distribution of z(i) given x (i) and the parameter θ
EM algorithm: repeat until convergence
I E-step: for each iQi (z
(i)) := p(z(i)|x (i); θ)
I M-step:
θ := arg maxθ
∑i
∑z(i)
Qi (z(i)) log
p(x (i), z(i); θ)
Qi (z(i))
The algorithm will converge, since `(θ(t)) ≤ `(θ(t+1))
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 10 / 27
EM algorithm
EM algorithm (4)Digression: coordinate ascent algorithm.
I maxα
W (α1, . . . αm)
I loop until converge:for i ∈ 1, . . . ,m:
αi = arg maxα̂i
W (α1, . . . , α̂i , . . . , αm)
EM-algorithm as coordinate ascent algorithm
I
J(Q, θ) =∑i
∑z(i)
Qi (z(i)) log
p(x (i), z(i); θ)
Qi (z(i))
I `(θ) ≥ J(Q, θ)
I EM algorithm can be viewed as coordinate ascent on JE-step: maximize w.r.t QM-step: maximize w.r.t θ
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 11 / 27
Probabilistic Latent Sematic Analysis
Outline
The parameter estimation problem
EM algorithm
Probabilistic Latent Sematic Analysis
Reference
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 12 / 27
Probabilistic Latent Sematic Analysis
Probabilistic Latent Semantic Analysis (1)
I set of documents D = {d1, . . . , dN}set of words W = {w1, . . . ,wM}set of unobserved classes Z = {z1, . . . , zK}
I conditional independence assumption:
P(di ,wj |zk) = P(di |zk)P(wj |zk) (3)
I so,
P(wj |di ) =K∑
k=1
P(zk |di )P(wj |zk) (4)
P(di ,wj) = P(di )K∑
k=1
P(wj |zk)P(zk |di )
More detail . . .
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 13 / 27
Probabilistic Latent Sematic Analysis
Probabilistic Latent Semantic Analysis (2)
I n(di ,wj) – # word wj in doc. diI Likelihood
L =N∏i=1
P(di ) =N∏i=1
M∏j=1
[P(di ,wj)]n(di ,wj )
=N∏i=1
M∏j=1
[P(di )
K∑k=1
P(wj |zk)P(zk |di )
]n(di ,wj )
I log-likelihood ` = log(L)
=N∑i=1
M∑j=1
[n(di ,wj) logP(di ) + n(di ,wj) log
K∑k=1
P(wj |zk)P(zk |di )
]
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 14 / 27
Probabilistic Latent Sematic Analysis
Probabilistic Latent Semantic Analysis (3)I maximize ` w.r.t P(wj |zk), P(zk |di )I ≈ maximize
N∑i=1
M∑j=1
n(di ,wj) logK∑
k=1
P(wj |zk)P(zk |di )
=N∑i=1
M∑j=1
n(di ,wj) logK∑
k=1
Qk(zk)P(wj |zk)P(zk |di )
Qk(zk)
≥N∑i=1
M∑j=1
n(di ,wj)K∑
k=1
Qk(zk) logP(wj |zk)P(zk |di )
Qk(zk)
I choose
Qk(zk) =P(wj |zk)P(zk |di )∑Kl=1 P(wj |zl)P(zl |di )
= P(zk |di ,wj)
More detail . . .
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 15 / 27
Probabilistic Latent Sematic Analysis
Probabilistic Latent Semantic Analysis (4)
I ≈ maximize (w.r.t P(wj |zk), P(zk |di ))
N∑i=1
M∑j=1
n(di ,wj)K∑
k=1
P(zk |di ,wj) logP(wj |zk)P(zk |di )
P(zk |di ,wj)
I ≈ maximize
N∑i=1
M∑j=1
n(di ,wj)K∑
k=1
P(zk |di ,wj) log[P(wj |zk)P(zk |di )]
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 16 / 27
Probabilistic Latent Sematic Analysis
Probabilistic Latent Semantic Analysis (5)EM-algorithm
I E-step: update
P(zk |di ,wj) =P(wj |zk)P(zk |di )∑Kl=1 P(wj |zl)P(zl |di )
I M-step: maximize w.r.t P(wj |zk), P(zk |di )N∑i=1
M∑j=1
n(di ,wj)K∑
k=1
P(zk |di ,wj) log[P(wj |zk)P(zk |di )]
subject toM∑j=1
P(wj |zk) = 1, k ∈ {1 . . .K}
K∑k=1
P(zk |di ) = 1, i ∈ {1 . . .N}
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 17 / 27
Probabilistic Latent Sematic Analysis
Probabilistic Latent Semantic Analysis (6)
Solution of maximization problem in M-step:
P(wj |zk) =
∑Ni=1 n(di ,wj)P(zk |di ,wj)∑M
m=1
∑Nn=1 n(dn,wm)P(zk |dn,wm)
P(zk |di ) =
∑Mj=1 n(di ,wj)P(zk |di ,wj)
n(di )
where, n(di ) =∑M
j=1 n(di ,wj)More detail . . .
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 18 / 27
Probabilistic Latent Sematic Analysis
Probabilistic Latent Semantic Analysis (7)
All together
I E-step:
P(zk |di ,wj) =P(wj |zk)P(zk |di )∑Kl=1 P(wj |zl)P(zl |di )
I M-step:
P(wj |zk) =
∑Ni=1 n(di ,wj)P(zk |di ,wj)∑M
m=1
∑Nn=1 n(dn,wm)P(zk |dn,wm)
P(zk |di ) =
∑Mj=1 n(di ,wj)P(zk |di ,wj)
n(di )
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 19 / 27
Reference
Outline
The parameter estimation problem
EM algorithm
Probabilistic Latent Sematic Analysis
Reference
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 20 / 27
Reference
I R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification,Wiley-Interscience, 2001.
I T. Hofmann, ”Unsupervised learning by probabilistic latent semanticanalysis,” Machine Learning, vol. 42, 2001, p. 177–196.
I Course: ”Machine Learning CS229”, Andrew Ng, Stanford University
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 21 / 27
Appendix
Generative model for word/document co-occurence
I select a document di with probability (w.p) P(di )
I pick a latent class zk w.p P(zk |di )I generate a word wj w.p P(wj |zk)
P(di ,wj) =K∑
k=1
P(di ,wj |zk)P(zk) =K∑
k=1
P(wj |zk)P(di |zk)P(zk)
=K∑
k=1
P(wj |zk)P(zk |di )P(di )
= P(di )K∑
k=1
P(wj |zk)P(zk |di )
P(di ,wj) = P(wj |di )P(di )
=⇒ P(wj |di ) =K∑
k=1
P(zk |di )P(wj |zk)
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 22 / 27
Appendix
I
P(wj |di ) =K∑
k=1
P(zk |di )P(wj |zk)
since∑K
k=1 P(zk |di ) = 1, P(wj , di ) is convex combination of P(wj |zk)
I ≈ each document is modelled as a mixture of topics
Return
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 23 / 27
Appendix
P(zk |di ,wj) =P(di ,wj |zk)P(zk)
P(di ,wj)(5)
=P(wj |zk)P(di |zk)P(zk)
P(di ,wj)(6)
=P(wj |zk)P(zk |di )
P(wj |di )(7)
=P(wj |zk)P(zk |di )∑Kl=1 P(wj |zl)P(zl |di )
(8)
From (5) to (6) by conditional independence assumption (3). From (7) to(8) by (4). Return
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 24 / 27
Appendix
Lagrange multipliers τk , ρi
H =N∑i=1
M∑j=1
n(di ,wj)K∑
k=1
P(zk |di ,wj) log[P(wj |zk)P(zk |di )]
+K∑
k=1
τk
1−M∑j=1
P(wj |di )
+N∑i=1
ρi
[1−
K∑k=1
P(zk |di )
]
∂H∂P(wj |zk)
=
∑Ni=1 P(zk |di ,wj)n(di ,wj)
P(wj |zk)− τk = 0
∂H∂P(zk |di )
=
∑Mj=1 n(di ,wj)P(zk |di ,wj)
P(zk |di )− ρi = 0
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 25 / 27
Appendix
from∑M
j=1 P(wj |zk) = 1
τk =M∑j=1
N∑i=1
P(zk |di ,wj)n(di ,wj)
from∑K
k=1 P(zk |di ,wj) = 1
ρi = n(di )
=⇒ P(wj |zk),P(zk |di ) Return
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 26 / 27
Appendix
Applying the Jensen’s inequality
I f (x) = log(x), concave function
f
(Ez(i)∼Qi
[p(x (i), z(i); θ)
Qi (z(i))
])≥ Ez(i)∼Qi
[f
(p(x (i), z(i); θ)
Qi (z(i))
)]Return
Duc-Hieu Trantdh.net [at] gmail.com (NTU) EM in pLSA July 27, 2010 27 / 27