Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to...
Transcript of Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to...
![Page 1: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/1.jpg)
Expectation-Maximization
Nuno Vasconcelos ECE Department, UCSDp ,
![Page 2: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/2.jpg)
Recalllast class, we will have “Cheetah Day”what:what:• 4 teams, average of 6 people• each team will write a report on the 4 p
cheetah problems• each team will give a presentation on one
of the problemsp
I am waiting to hear on the teams
2
![Page 3: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/3.jpg)
Plan for todaywe have been talking about mixture modelslast time we introduced the basics of EMlast time we introduced the basics of EMtoday we study the application of EM for ML estimation of mixture parameters p
t lnext class:• proof that EM maximizes likelihood of incomplete data
3
![Page 4: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/4.jpg)
mixture modeltwo types of random variables• Z – hidden state variable
PZ(z)
zZ – hidden state variable• X – observed variable
observations sampled with a
zi
ptwo-step procedure• a state (class) is sampled from the
distribution of the hidden variablePX|Z(x|0) PX|Z(x|1) PX|Z(x|K)…
distribution of the hidden variable
PZ(z) → zixi
• an observation is drawn from the class conditional density for the selected state
4
PX|Z(x|zi) → xi
4
![Page 5: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/5.jpg)
mixture modelthe sample consists of pairs (xi,zi)
D = {(x z ) (x z )}D = {(x1,z1), …, (xn,zn)}but we never get to see the zi
the pdf of the observed data is# of mixture components
component “weight”
cth “mixture component”
55
![Page 6: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/6.jpg)
The basics of EMas usual, we start from an iid sample D = {x1,…,xN}goal is to find parameters Ψ* that maximize likelihood withgoal is to find parameters Ψ that maximize likelihood with respect to D
the setthe setDc = {(x1,z1), …, (xN,zN)}
is called the complete datais called the complete datathe set
D = {x1, …, xN}{ 1, , N}is called the incomplete data
6
![Page 7: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/7.jpg)
Learning with incomplete data (EM)the basic idea is quite simple1 start with an initial parameter estimate Ψ(0)1. start with an initial parameter estimate Ψ( )
2. E-step: given current parameters Ψ(i) and observations in D, “guess” what the values of the zi are
3. M-step: with the new zi, we have a complete data problem, solve this problem for the parameters, i.e. compute Ψ(i+1)
4. go to 2.
this can be summarized as
E-step
estimateparameters
fill in classassignments
zi
p
iM-step
7
![Page 8: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/8.jpg)
Classification-maximizationC-step:• given estimates Ψ (i) = {Ψ (i) Ψ (i) }given estimates Ψ ( ) = {Ψ ( )
1, …, Ψ ( )C }
• determine zi by the BDR
• split the training set according to the labels zi1 2 CD1 = {xi|zi=1}, D2 = {xi|zi=2}, … , DC = {xi|zi=C}
M-step:as before determine the parameters of each class• as before, determine the parameters of each class independently
8
![Page 9: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/9.jpg)
For Gaussian mixturesC-step:•
• split the training set according to the labels zi
D1 = {xi|zi=1}, D2 = {xi|zi=2}, … , DC = {xi|zi=C}
M-step:•
9
![Page 10: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/10.jpg)
K-meanswhen covariances are identity and priors uniformC step:C-step:•• split the training set according to the labels zip g g i
D1 = {xi|zi=1}, D2 = {xi|zi=2}, … , DC = {xi|zi=C}
M-step:•
this is the K-means algorithm aka generalized Loydthis is the K-means algorithm, aka generalized Loyd algorithm, aka LBG algorithm in the vector quantization literature:
10
• “assign points to the closest mean; recompute the means”
![Page 11: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/11.jpg)
The Q functionis defined as
and is a bit tricky:• it is the expected value of likelihood with respect to complete data
(joint X and Z)• given that we observed incomplete data (X=D)g p ( )
• note that the likelihood is a function of Ψ (the parameters that we want to determine)b t t t th t d l d t th t• but to compute the expected value we need to use the parameter values from the previous iteration (because we need a distribution for Z|X)
the EM algorithm is, therefore, as follows
11
![Page 12: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/12.jpg)
Expectation-maximizationE-step:• given estimates Ψ (n) = {Ψ (n) Ψ (n) }given estimates Ψ ( ) = {Ψ ( )
1, …, Ψ ( )C }
• compute expected log-likelihood of complete data
M-step:• find parameter set that maximizes this expected log-likelihood
let’s make this more concrete by looking at the mixturecasecase
12
![Page 13: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/13.jpg)
Expectation-maximizationto derive an EM algorithm you need to do the following1 write down the likelihood of the COMPLETE data1. write down the likelihood of the COMPLETE data2. E-step: write down the Q function, i.e. its expectation given the
observed data3. M-step: solve the maximization, deriving a closed-form solution if
there is one
13
![Page 14: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/14.jpg)
EM for mixtures (step 1)the first thing we always do in a EM problem is • compute the likelihood of the COMPLETE datacompute the likelihood of the COMPLETE data
very neat trick to use when z is discrete (classes)• instead of using z in {1, 2, ..., C}g { , , , }• use a binary vector of size equal to the # of classes
• where z = j in the z in {1, 2, ..., C} notation, now becomeswhere z j in the z in {1, 2, ..., C} notation, now becomes
14
![Page 15: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/15.jpg)
EM for mixtures (step 1)we can now write the complete data likelihood as
for example, if z = k in the z in {1, 2, ..., C} notation,
the advantage is thatthe advantage is that
becomes LINEAR in the components zj !!!
15
![Page 16: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/16.jpg)
The assignment vector trickthis is similar to something that we used alreadyBernoulli random variableBernoulli random variable
⎩⎨⎧ =
=011
)(zp
zPZ
can be written as⎩⎨ =− 01 zpZ
zzP −1)1()(
or, using instead of , as
zzZ ppzP −= 1)1()(
⎬⎫
⎨⎧
⎥⎤
⎢⎡
⎥⎤
⎢⎡
∈1
,0
z { }1,0∈zor, using instead of , as
21 )1()( zzZ ppzP −=
⎭⎬
⎩⎨ ⎥
⎦⎢⎣
⎥⎦
⎢⎣
∈0
,1
z { }1,0∈z
16
)1()(Z ppzP
![Page 17: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/17.jpg)
EM for mixtures (step 1)for the complete iid dataset Dc = {(x1,z1), …, (xN,zN)}
and the complete data log-likelihood is
this does not depend on z and simply becomes a constant for the expectation that we have to compute in the E-step p
17
![Page 18: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/18.jpg)
Expectation-maximizationto derive an EM algorithm you need to do the following1 write down the likelihood of the COMPLETE data1. write down the likelihood of the COMPLETE data2. E-step: write down the Q function, i.e. its expectation given the
observed data3. M-step: solve the maximization, deriving a closed-form solution if
there is one
important E-step advice:p p• do not compute terms that you do not need• at the end of the day we only care about the parameters• terms of Q that do not depend on the parameters are useless,
e.g. inQ = f(z,Ψ) + log(sin z)
th t d l f l ( i ) t b diffi lt d ithe expected value of log(sin z) appears to be difficult and is completely unnecessary, since it is dropped in the M-step
18
![Page 19: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/19.jpg)
EM for mixtures (step 2)once we have the complete data likelihood
i.e. to compute the Q function we only need to compute
note that this expectation can only be computed because we use Ψ(n)because we use Ψnote that the Q function will be a function of both Ψ and Ψ(n)
19
![Page 20: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/20.jpg)
EM for mixtures (step 2)since zij is binary and only depends on xi
the E-step reduces to computing the posterior probability of each point under each class!probability of each point under each class!defining
the Q function is
20
![Page 21: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/21.jpg)
Expectation-maximizationto derive an EM algorithm you need to do the following1 write down the likelihood of the COMPLETE data1. write down the likelihood of the COMPLETE data
2. E-step: write down the Q function, i.e. its expectation given the observed data
3. M-step: solve the maximization, deriving a closed-form solution if there is one
21
![Page 22: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/22.jpg)
EM vs CMlet’s compare this with the CM algorithm• the C-stepthe C-step
assigns each point to the class of largest posterior• the E-step
assigns the point to all classes with weight given by the posterior
for this, EM is said to make “soft-assignments”for this, EM is said to make soft assignments• it does not commit to any of the classes (unless the posterior is
one for that class), i.e. it is less greedy• no longer partition space into rigid cells, but now the boundaries
are soft 22
![Page 23: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/23.jpg)
EM vs CMwhat about the M-steps?• for CMfor CM
• for EM
these are the same if we threshold the hij to make, for each i, maxj hij = 1 and all other hij = 0M t th t th diff f i tM-steps the same up to the difference of assignments
23
![Page 24: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/24.jpg)
EM for Gaussian mixturesin summary: • CM = EM + hard assignments• CM special case, cannot be better
let’s look at the special case of Gaussian mixturesE-step:
24
![Page 25: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/25.jpg)
M-step for Gaussian mixturesM-step:
important note: • in the M-step, the optimization must be subject to whatever p, p j
constraint may hold• in particular, we always have the constraint• as usual we introduce a Lagrangian• as usual we introduce a Lagrangian
25
![Page 26: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/26.jpg)
M-step for Gaussian mixturesLagrangian
setting derivatives to zero
26
![Page 27: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/27.jpg)
M-step for Gaussian mixturesleads to the update equations
comparing to those of CMp g
they are the same up to hard vs soft assignments.ey a e e sa e up o a d s so ass g e s
27
![Page 28: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/28.jpg)
Expectation-maximizationnote that the procedure is the same for all mixtures1 write down the likelihood of the COMPLETE data1. write down the likelihood of the COMPLETE data
2. E-step: write down the Q function, i.e. its expectation given the observed data
3. M-step: solve the maximization, deriving a closed-form solution if there is one
28
![Page 29: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/29.jpg)
Expectation-maximizationE.g. for a mixture of exponential distributions
Cλ∑
1 E-step: write down the Q function i e its expectation given the
x
iiiX
iexP λλπ −
=∑=
1
)(
1. E-step: write down the Q function, i.e. its expectation given the observed data
ij xjj e
xjPhλλπ −
== )|(ic x
C
ccc
jjiXZij
exjPh
λλπ −
=∑
==
1
| )|(
2. M-step: solve the maximization, deriving a closed-form solution if there is one
29
![Page 30: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/30.jpg)
M-step for exponential mixturesM-step:
[ ]xn h ijλ λl)1(Ψ ∑ −+ [ ][ ]( )
xjj
ijij
n
xh
eh ij
λπλ
λπ λ
logminarg
logmaxarg)1(
−=
=Ψ
∑
∑Ψ
+
[ ]( )jjijij
ij xh λπλ logminarg = ∑Ψ
the Lagragian is
( ) ⎞⎜⎛∑∑ 1llhL λλ( ) ⎟
⎠⎜⎜⎝
−+−−= ∑∑ 1loglogj
jjjijij
ij xhL πκπλλ
30
![Page 31: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/31.jpg)
M-step for exponential mixtures
( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛−+−−= ∑∑ 1loglog
jjjjij
ijij xhL πκπλλ
and has minimum at
⎠⎝ jj
∑ iik xh1
01=⎟⎟
⎠
⎞⎜⎜⎝
⎛−=
∂∂ ∑
ki
iik
k
xhLλλ ∑
=
iik
i
k hλ1
0=+−=∂∂ ∑ κ
ππ i k
ik
k
hL ∑=ij
ikhκ
∑h01 =−=
∂∂ ∑
jj
L πκ ∑
∑=
ik
iik
k h
hπ
∑ij
ik
31
![Page 32: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/32.jpg)
EM algorithmnote, however, that EM is much more general than this recipe for mixturespit can be applied for any problem where we have observed and hidden random variables here is a very simple example• X observer Gaussian variable, X~ N(µ,1), • Z hidden exponential variable• It is known that Z is independent of X• sample D = {x1 x } of iid observations from Xsample D {x1, …, xn} of iid observations from X
note that the assumption of independence does not really make sense (why?) how does this affect EM?
32
![Page 33: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/33.jpg)
Exampletoy model: X iid, Z iid, Xi ~ N(µ,1), Zi ~ λe-λz,X independent of Z
33
![Page 34: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/34.jpg)
Example
this makes sense:• since hidden variables Z are independent of observed X• ML estimate of µ is always the same: the sample mean, no
dependence on zidependence on zi
• ML estimate of λ is always the initial estimate λ(0): since the observations are independent of the zi we have no information on what λ should be other than initial guess
34
what λ should be, other than initial guess.
note that model does not make sense, not EM solution
![Page 35: Expectation-Maximization - SVCL · Expectation-maximization to derive an EM algorithm you need to do the following 1. write down thewrite down the likelihood of the COMPLETE datalikelihood](https://reader033.fdocuments.net/reader033/viewer/2022042313/5f3769f1e5594e69040ec824/html5/thumbnails/35.jpg)
35