Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik...
-
Upload
lucy-thompson -
Category
Documents
-
view
212 -
download
0
Transcript of Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik...
![Page 1: Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture 9, Friday June 15 th, 2007 (EM.](https://reader036.fdocuments.net/reader036/viewer/2022072011/56649de65503460f94ade53d/html5/thumbnails/1.jpg)
Machine LearningSaarland University, SS 2007
Holger Bast
Max-Planck-Institut für InformatikSaarbrücken, Germany
Lecture 9, Friday June 15th, 2007(EM algorithm + convergence)
![Page 2: Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture 9, Friday June 15 th, 2007 (EM.](https://reader036.fdocuments.net/reader036/viewer/2022072011/56649de65503460f94ade53d/html5/thumbnails/2.jpg)
Overview of this Lecture
Quick recap of last lecture
– maximum likelihood principle / our 3 examples
The EM algorithm
– writing down the formula (very easy)
– understanding the formula (very hard)
– Example: mixture of two normal distributions
Convergence
– to local maximum (under mild assumptions)
Exercise Sheet
– explain / discuss / make a start
![Page 3: Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture 9, Friday June 15 th, 2007 (EM.](https://reader036.fdocuments.net/reader036/viewer/2022072011/56649de65503460f94ade53d/html5/thumbnails/3.jpg)
Maximum Likelihood: Example 1
Sequence of coin flips HHTTTTTTHTTTTTHTTHHT
– say 5 times H and 15 times T
– which Prob(H) and Prob(T) are most likely?
Formalization
– Data X = (x1, … , xn), xi in {H,T}
– Parameters Θ = (pH, pT), pH + pT = 1
– Likelihood L(X,Θ) =pHh
· pTt, h = #{i : xi = H}, t = #{i : xi = T}
– Log Likelihood Q(X,Θ) = log L(X,Θ) = h · log pH + t · log pT
– find Θ* = argmaxΘ L(X,Θ) = argmaxΘ Q(X,Θ)
Solution
– here pH = h / (h + t) and pT = t / (h + t)
looks like Prob(H) = ¼ Prob(T) = ¾
simple calculus[blackboard]
![Page 4: Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture 9, Friday June 15 th, 2007 (EM.](https://reader036.fdocuments.net/reader036/viewer/2022072011/56649de65503460f94ade53d/html5/thumbnails/4.jpg)
Maximum Likelihood: Example 2
Sequence of reals drawn from N(μ, σ)
– which μ and σ are most likely?
Formalization
– Data X = (x1, … , xn), xi real number
– Parameters Θ = (μ, σ)– Likelihood L(X,Θ) = πi 1/(sqrt(2π)σ) · exp( - (xi - μ)2 / 2σ2 )
– Log Likelihood Q(X,Θ) = - n/2·log(2π) - n·log σ – Σi (xi - μ)2 /
2σ2
– find Θ* = argmaxΘ L(X,Θ) = argmaxΘ Q(X,Θ)
Solution
– here μ = 1/n * Σi xi and σ2 = 1/n * Σi (xi - μ)2
simple calculus[blackboard]
normal distributionwith mean μ and
standard deviation σ
![Page 5: Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture 9, Friday June 15 th, 2007 (EM.](https://reader036.fdocuments.net/reader036/viewer/2022072011/56649de65503460f94ade53d/html5/thumbnails/5.jpg)
Maximum Likelihood: Example 3
Sequence of real numbers
– each drawn from either N1(μ1, σ1) or N2(μ2, σ2)
– from N1 with prob p1, and from N2 with prob p2
– which μ1, σ1, μ2, σ2, p1, p2 are most likely?
Formalization
– Data X = (x1, … , xn), xi real number
– Hidden data Z = (z1, … , zn), zi = j iff xi drawn from Nj
– Parameters Θ = (μ1, σ1, μ2, σ2, p1, p2), p1 + p2 = 1
– Likelihood L(X,Θ) = [blackboard]
– Log Likelihood Q(X,Θ) = [blackboard]
– find Θ* = argmaxΘ L(X,Θ) = argmaxΘ Q(X,Θ)
standard calculus fails (derivative of sum of logs of sum)
![Page 6: Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture 9, Friday June 15 th, 2007 (EM.](https://reader036.fdocuments.net/reader036/viewer/2022072011/56649de65503460f94ade53d/html5/thumbnails/6.jpg)
The EM algorithm — Formula
Given
– Data X = (x1, … ,xn)
– Hidden data Z = (z1, … ,zn)
– Parameters Θ + an initial guess θ1
Expectation-Step:
– Pr(Z|X;θt) = Pr(X|Z;θt) ∙ Pr(Z|θt) / ΣZ’ Pr(X|Z’;θt) ∙ Pr(Z’|θt)
Maximization-Step:
– θt+1 = argmaxΘ EZ[ log Pr(X,Z|Θ) | X;θt ]
What the hell does this mean?crucial to understand each of these probabilities / expected values
What is fixed? What is random and how? What do the conditionals mean?
![Page 7: Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture 9, Friday June 15 th, 2007 (EM.](https://reader036.fdocuments.net/reader036/viewer/2022072011/56649de65503460f94ade53d/html5/thumbnails/7.jpg)
Three attempts to maximize the likelihood
1. The direct way …
– given x1, … ,xn
– find parameters μ1, σ1, μ2, σ2, p1, p2
– such that log L(x1, … ,xn) is maximized
2. If only we knew …
– given data x1, … ,xn and hidden data z1, … ,zn
– find parameters μ1, σ1, μ2, σ2, p1, p2
– such that log L(x1, … ,xn, z1, … ,zn) is maximized
The EM way …
– given x1, … ,xn and random variables Z1, … ,Zn
– find parameters μ1, σ1, μ2, σ2, p1, p2
– such that E log L(x1, … ,xn, Z1, … ,Zn) is maximized
optimization too hard(sum of logs of sums)
would be feasible[show on blackboard]
M-Step of theEM algorithm
E-Step provides
the Z1, … ,Zn
consider the mixture of twoGaussians as an example
but we don’t know
the z1, … ,zn
![Page 8: Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture 9, Friday June 15 th, 2007 (EM.](https://reader036.fdocuments.net/reader036/viewer/2022072011/56649de65503460f94ade53d/html5/thumbnails/8.jpg)
E-Step — Formula
We have (at the beginning of each iteration)
– the data x1, … ,xn
– the fully specified distributions N1(μ1,σ1) and N2(μ2,σ2)
– the probability of choosing between N1and N2
= random variable Z with p1 = Pr(Z=1) and p2 = Pr(Z=2)
We want
– for each data point xi a probability of choosing N1 or N2
= random variables Z1, … ,Zn
Solution (the actual E-Step)
– take Zi as the conditional Z | Xi
– Pr(Zi=1) = Pr(Z=1 | xi) = Pr(xi | Z=1) ∙ Pr(Z=1) / Pr(xi)
with Pr(xi) = ΣZ Pr(xi | Z=z) ∙ Pr(Z=z)
Bayes’ law
consider the mixture of twoGaussians as an example
![Page 9: Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture 9, Friday June 15 th, 2007 (EM.](https://reader036.fdocuments.net/reader036/viewer/2022072011/56649de65503460f94ade53d/html5/thumbnails/9.jpg)
E-Step — analogy to a simple example
Pr(Urn 1) = 1/3, Pr(Urn 2) = 2/3
Pr(Blue | Urn 1) = 1/2, Pr(Blue | Urn 2) = 1/4
Pr(Blue) = Pr(Blue | Urn 1) ∙ Pr(Urn 1) + Pr(Blue | Urn 2) ∙ Pr(Urn2)
= 1/2 ∙ 1/3 + 2/3 ∙ 1/4 = 1/3
Pr(Urn 1 | Blue) = Pr(B | Urn 1) ∙ Pr(Urn 1) / Pr(B)
= 1/2 ∙ 1/3 / 1/3 = 1/2
Draw ball from one of two urnsUrn 1
pick with prob 1/3Urn 2
pick with prob 2/3
![Page 10: Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture 9, Friday June 15 th, 2007 (EM.](https://reader036.fdocuments.net/reader036/viewer/2022072011/56649de65503460f94ade53d/html5/thumbnails/10.jpg)
M-Step — Formula
[Blackboard]
![Page 11: Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture 9, Friday June 15 th, 2007 (EM.](https://reader036.fdocuments.net/reader036/viewer/2022072011/56649de65503460f94ade53d/html5/thumbnails/11.jpg)
Convergence of EM Algorithm
Two (log) likelihoods
– true: log L(x1,…,xn)
– EM: E log L(x1,…,xn, Z1,…,Zn)
Lemma 1 (lower bound)
– E log L(x1,…,xn, Z1,…,Zn) ≤ log L(x1,…,xn)
Lemma 2 (touch)
– E log L(x1,…,xn, Z1,…,Zn)(θt) = log L(x1,…,xn)(θt)
Convergence
– if expected likelihood function is well-behaved, e.g., if first derivate at local maxima exist and second derivate is < 0
– then Lemmas 1 and 2 imply convergence
[blackboard]
[blackboard]
![Page 12: Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture 9, Friday June 15 th, 2007 (EM.](https://reader036.fdocuments.net/reader036/viewer/2022072011/56649de65503460f94ade53d/html5/thumbnails/12.jpg)
![Page 13: Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture 9, Friday June 15 th, 2007 (EM.](https://reader036.fdocuments.net/reader036/viewer/2022072011/56649de65503460f94ade53d/html5/thumbnails/13.jpg)
Attempt Two: Calculations
If only we knew …
– given data x1, … ,xn and hidden data z1, … ,zn
– find parameters μ1, σ1, μ2, σ2, p1, p2
– such that log L(x1, … ,xn, z1, … ,zn) is maximized
– let I1 = {i : zi = 1} and I2 = {i : zi = 2}
L(x1; : : : ;xn;z1; : : : ;zn) =Y
i2 I 1
1p
2¼¾1¢e
¡(xi ¡ ¹ 1)2
2¾21 ¢
Y
i2 I 2
1p
2¼¾2¢e
¡(xi ¡ ¹ 2)2
2¾22
The two products can be maximized separately
– here μ1 = Σi in I1 xi / |I1| and σ1
2 = Σ i in I1 (xi – μ1)2 / |I1|
– here μ2 = Σi in I2 xi / |I2| and σ2
2 = Σ i in I2 (xi – μ2)2 / |I2|