Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik...

Machine LearningSaarland University, SS 2007

Holger Bast

Max-Planck-Institut für InformatikSaarbrücken, Germany

Lecture 9, Friday June 15th, 2007(EM algorithm + convergence)

Overview of this Lecture

Quick recap of last lecture

– maximum likelihood principle / our 3 examples

The EM algorithm

– writing down the formula (very easy)

– understanding the formula (very hard)

– Example: mixture of two normal distributions

Convergence

– to local maximum (under mild assumptions)

Exercise Sheet

– explain / discuss / make a start

Maximum Likelihood: Example 1

Sequence of coin flips HHTTTTTTHTTTTTHTTHHT

– say 5 times H and 15 times T

– which Prob(H) and Prob(T) are most likely?

Formalization

– Data X = (x1, … , xn), xi in {H,T}

– Parameters Θ = (pH, pT), pH + pT = 1

– Likelihood L(X,Θ) =pHh

· pTt, h = #{i : xi = H}, t = #{i : xi = T}

– Log Likelihood Q(X,Θ) = log L(X,Θ) = h · log pH + t · log pT

– find Θ* = argmaxΘ L(X,Θ) = argmaxΘ Q(X,Θ)

Solution

– here pH = h / (h + t) and pT = t / (h + t)

looks like Prob(H) = ¼ Prob(T) = ¾

simple calculus[blackboard]


Sequence of reals drawn from N(μ, σ)

– which μ and σ are most likely?

Formalization

– Data X = (x1, … , xn), xi real number

– Parameters Θ = (μ, σ)– Likelihood L(X,Θ) = πi 1/(sqrt(2π)σ) · exp( - (xi - μ)2 / 2σ2 )

– Log Likelihood Q(X,Θ) = - n/2·log(2π) - n·log σ – Σi (xi - μ)2 /

2σ2


Solution

– here μ = 1/n * Σi xi and σ2 = 1/n * Σi (xi - μ)2

simple calculus[blackboard]

normal distributionwith mean μ and

standard deviation σ


Sequence of real numbers

– each drawn from either N1(μ1, σ1) or N2(μ2, σ2)

– from N1 with prob p1, and from N2 with prob p2

– which μ1, σ1, μ2, σ2, p1, p2 are most likely?

Formalization

– Data X = (x1, … , xn), xi real number

– Hidden data Z = (z1, … , zn), zi = j iff xi drawn from Nj

– Parameters Θ = (μ1, σ1, μ2, σ2, p1, p2), p1 + p2 = 1

– Likelihood L(X,Θ) = [blackboard]

– Log Likelihood Q(X,Θ) = [blackboard]


standard calculus fails (derivative of sum of logs of sum)

The EM algorithm — Formula

Given

– Data X = (x1, … ,xn)

– Hidden data Z = (z1, … ,zn)

– Parameters Θ + an initial guess θ1

Expectation-Step:

– Pr(Z|X;θt) = Pr(X|Z;θt) ∙ Pr(Z|θt) / ΣZ’ Pr(X|Z’;θt) ∙ Pr(Z’|θt)

Maximization-Step:

– θt+1 = argmaxΘ EZ[ log Pr(X,Z|Θ) | X;θt ]

What the hell does this mean?crucial to understand each of these probabilities / expected values

What is fixed? What is random and how? What do the conditionals mean?

Three attempts to maximize the likelihood

1. The direct way …

– given x1, … ,xn

– find parameters μ1, σ1, μ2, σ2, p1, p2

– such that log L(x1, … ,xn) is maximized

2. If only we knew …

– given data x1, … ,xn and hidden data z1, … ,zn


– such that log L(x1, … ,xn, z1, … ,zn) is maximized

The EM way …

– given x1, … ,xn and random variables Z1, … ,Zn


– such that E log L(x1, … ,xn, Z1, … ,Zn) is maximized

optimization too hard(sum of logs of sums)

would be feasible[show on blackboard]

M-Step of theEM algorithm

E-Step provides

the Z1, … ,Zn

consider the mixture of twoGaussians as an example

but we don’t know

the z1, … ,zn

E-Step — Formula

We have (at the beginning of each iteration)

– the data x1, … ,xn

– the fully specified distributions N1(μ1,σ1) and N2(μ2,σ2)

– the probability of choosing between N1and N2

= random variable Z with p1 = Pr(Z=1) and p2 = Pr(Z=2)

We want

– for each data point xi a probability of choosing N1 or N2

= random variables Z1, … ,Zn

Solution (the actual E-Step)

– take Zi as the conditional Z | Xi

– Pr(Zi=1) = Pr(Z=1 | xi) = Pr(xi | Z=1) ∙ Pr(Z=1) / Pr(xi)

with Pr(xi) = ΣZ Pr(xi | Z=z) ∙ Pr(Z=z)

Bayes’ law

consider the mixture of twoGaussians as an example

E-Step — analogy to a simple example

Pr(Urn 1) = 1/3, Pr(Urn 2) = 2/3

Pr(Blue | Urn 1) = 1/2, Pr(Blue | Urn 2) = 1/4

Pr(Blue) = Pr(Blue | Urn 1) ∙ Pr(Urn 1) + Pr(Blue | Urn 2) ∙ Pr(Urn2)

= 1/2 ∙ 1/3 + 2/3 ∙ 1/4 = 1/3

Pr(Urn 1 | Blue) = Pr(B | Urn 1) ∙ Pr(Urn 1) / Pr(B)

= 1/2 ∙ 1/3 / 1/3 = 1/2

Draw ball from one of two urnsUrn 1

pick with prob 1/3Urn 2

pick with prob 2/3

M-Step — Formula

[Blackboard]

Convergence of EM Algorithm

Two (log) likelihoods

– true: log L(x1,…,xn)

– EM: E log L(x1,…,xn, Z1,…,Zn)

Lemma 1 (lower bound)

– E log L(x1,…,xn, Z1,…,Zn) ≤ log L(x1,…,xn)

Lemma 2 (touch)

– E log L(x1,…,xn, Z1,…,Zn)(θt) = log L(x1,…,xn)(θt)

Convergence

– if expected likelihood function is well-behaved, e.g., if first derivate at local maxima exist and second derivate is < 0

– then Lemmas 1 and 2 imply convergence

[blackboard]

[blackboard]

Attempt Two: Calculations

If only we knew …

– given data x1, … ,xn and hidden data z1, … ,zn


– such that log L(x1, … ,xn, z1, … ,zn) is maximized

– let I1 = {i : zi = 1} and I2 = {i : zi = 2}

L(x1; : : : ;xn;z1; : : : ;zn) =Y

i2 I 1

1p

2¼¾1¢e

¡(xi ¡ ¹ 1)2

2¾21 ¢

Y

i2 I 2

1p

2¼¾2¢e

¡(xi ¡ ¹ 2)2

2¾22

The two products can be maximized separately

– here μ1 = Σi in I1 xi / |I1| and σ1

2 = Σ i in I1 (xi – μ1)2 / |I1|

– here μ2 = Σi in I2 xi / |I2| and σ2

2 = Σ i in I2 (xi – μ2)2 / |I2|

Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik...

Documents

Transcript of Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik...