lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date:...

35
1 CSE 527 Autumn 2009 4: MLE, EM

Transcript of lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date:...

Page 1: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

1

CSE 527Autumn 2009

4: MLE, EM

Page 2: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

2

FYI: Hemo-globinHistory

Browser

Page 3: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

Browser

Page 4: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...
Page 5: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

5

Outline

MLE: Maximum Likelihood Estimators

EM: the Expectation Maximization Algorithm

Next: Motif description & discovery

Page 6: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

6

Maximum Likelihood Estimators

Learning From Data: MLE

Page 7: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

7

Probability Basics, I

pdf, not probability

Ex. Ex.

Sample Space

{1, 2, . . . , 6} R

Distribution

p1, . . . , p6 ≥ 0;�

1≤i≤6

pi = 1 f(x) >= 0;�

Rf(x)dx = 1

e.g.

p1 = · · · = p6 = 1/6 f(x) =1√

2πσ2e−(x−µ)2/(2σ2)

Page 8: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

8

Probability Basics, II

Page 9: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

9

Parameter Estimation

Assuming sample x1, x2, ..., xn is from a parametric distribution f(x|θ), estimate θ.

E.g.: Given sample HHTTTTTHTHTTTHH of (possibly biased) coin flips, estimate

θ = probability of Heads

Page 10: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

LikelihoodP(x | θ): Probability of event x given model θViewed as a function of x (fixed θ), it’s a probability

E.g., Σx P(x | θ) = 1

Viewed as a function of θ (fixed x), it’s a likelihoodE.g., Σθ P(x | θ) can be anything; relative values of interest. E.g., if θ = prob of heads in a sequence of coin flips then P(HHTHH | .6) > P(HHTHH | .5), I.e., event HHTHH is more likely when θ = .6 than θ = .5

And what θ make HHTHH most likely?

10

Page 11: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

Likelihood FunctionProbability of HHTHH,

given P(H) = θ:

θ θ4(1-θ)

0.2 0.0013

0.5 0.0313

0.8 0.0819

0.95 0.04070.0 0.2 0.4 0.6 0.8 1.0

0.00

0.02

0.04

0.06

0.08

Theta

P( H

HTH

H |

Thet

a)

Page 12: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

12

One (of many) approaches to param. est.Likelihood of (indp) observations x1, x2, ..., xn

As a function of θ, what θ maximizes the likelihood of the data actually observedTypical approach: or

Maximum Likelihood Parameter Estimation

L(x1, x2, . . . , xn | θ) =n�

i=1

f(xi | θ)

!

!"L(#x | ") = 0

∂θlog L(�x | θ) = 0

Page 13: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

13

(Also verify it’s max, not min, & not better on boundary)

Example 1n coin flips, x1, x2, ..., xn; n0 tails, n1 heads, n0 + n1 = n;

θ = probability of heads

Observed fraction of successes in sample is MLE of success probability in population

dL/dθ = 0

Page 14: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

14

Parameter EstimationAssuming sample x1, x2, ..., xn is from a parametric distribution f(x|θ), estimate θ.

E.g.: Given n normal samples, estimate mean & variance

f(x) = 1√2πσ2 e−(x−µ)2/(2σ2)

θ = (µ,σ2)

-3 -2 -1 0 1 2 3

µ ± !

Page 15: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

15

Ex. 2: xi ∼ N(µ,σ2), σ2 = 1, µ unknown

And verify it’s max, not min & not better on boundary

Sample mean is MLE of population mean

Page 16: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

16

Ex 3: xi ∼ N(µ,σ2), µ,σ2 both unknown

-0.4

-0.2

0

0.2

0.4

0.2

0.4

0.6

0.8

0

1

2

3

-0.4

-0.2

0

0.2

0.4θ1

θ2

Sample mean is MLE of population mean, again

Page 17: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

17

Ex. 3, (cont.)

lnL(x1, x2, . . . , xn|θ1, θ2) =�

1≤i≤n

−12

ln 2πθ2 −(xi − θ1)2

2θ2

∂∂θ2

lnL(x1, x2, . . . , xn|θ1, θ2) =�

1≤i≤n

−12

2πθ2+

(xi − θ1)2

2θ22

= 0

θ̂2 =��

1≤i≤n(xi − θ̂1)2�

/n = s̄2

A consistent, but biased estimate of population variance. (An example of overfitting.) Unbiased estimate is:

Moral: MLE is a great idea, but not a magic bullet

I.e., limn→∞

= correct

Page 18: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

Is it? Yes. As an extreme, when n = 1, θ2 = 0.

Why? A bit harder to see, but think about n = 2. Then θ1 is exactly between the two sample points, the position that exactly minimizes the expression for θ2. Any other choices for θ1, θ2 make the likelihood of the observed data slightly lower. But it’s actually pretty unlikely that two sample points would be chosen exactly equidistant from, and on opposite sides of the mean, so the MLE θ2 systematically underestimates θ2.

Aside: Is it Biased? Why?

18

ˆ

ˆˆ

ˆ

Page 19: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

19

EM The Expectation-Maximization

Algorithm

Page 20: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

20

This?

Or this?

(A modeling decision, not a math problem..., but if later, what math?)

More Complex Example

Page 21: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

A Real Example:CpG content of human gene promoters

“A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters” Saxonov, Berg, and Brutlag, PNAS 2006;103:1412-1417

©2006 by National Academy of Sciences21

Page 22: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

22

No closed-formmax

Parameters θ

means µ1 µ2

variances σ21 σ2

2

mixing parameters τ1 τ2 = 1− τ1

P.D.F. f(x|µ1,σ21) f(x|µ2,σ2

2)

Likelihood

L(x1, x2, . . . , xn|µ1, µ2,σ21 ,σ2

2 , τ1, τ2)

=�n

i=1

�2j=1 τjf(xi|µj ,σ2

j )

Gaussian Mixture Models / Model-based Clustering

Page 23: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

23

-20

-10

0

10

20 -20

-10

0

10

20

0

0.05

0.1

0.15

-20

-10

0

10

20

Likelihood Surface

Page 24: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

24

-20

-10

0

10

20 -20

-10

0

10

20

0

0.05

0.1

0.15

-20

-10

0

10

20

!2 = 1.0"1 = .5"2 = .5

xi =!10.2, !10, !9.8!0.2, 0, 0.211.8, 12, 12.2

Page 25: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

25

-20

-10

0

10

20 -20

-10

0

10

20

0

0.05

0.1

0.15

-20

-10

0

10

20

!2 = 1.0"1 = .5"2 = .5

xi =!10.2, !10, !9.8!0.2, 0, 0.211.8, 12, 12.2

(-5,12)

(-10,6)

(6,-10)

(12,-5)

Page 26: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

26

Messy: no closed form solution known for finding θ maximizing L

But what if we knew the hidden data?

A What-If Puzzle

Page 27: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

27

EM as Egg vs ChickenIF zij known, could estimate parameters θ

E.g., only points in cluster 2 influence µ2, σ2 IF parameters θ known, could estimate zij

E.g., if |xi - µ1|/σ1 << |xi - µ2|/σ2, then zi1 >> zi2

But we know neither; (optimistically) iterate:E: calculate expected zij, given parametersM: calc “MLE” of parameters, given E(zij)

Overall, a clever “hill-climbing” strategy

Page 28: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

28

Simple Version: “Classification EM”

If zij < .5, pretend it’s 0; zij > .5, pretend it’s 1

I.e., classify points as component 0 or 1

Now recalc θ, assuming that partition

Then recalc zij , assuming that θThen re-recalc θ, assuming new zij, etc., etc.

“Full EM” is a bit more involved, but this is the crux.

Page 29: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

29

Full EM

Page 30: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

30

The E-step: Find E(Zij), i.e. P(Zij=1)

Assume θ known & fixedA (B): the event that xi was drawn from f1 (f2)D: the observed datum xi

Expected value of zi1 is P(A|D)

Repeat for

each xi}

E = 0 · P (0) + 1 · P (1)

Page 31: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

31

Complete Data Likelihood

(Better):

Page 32: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

32

M-step:Find θ maximizing E(log(Likelihood))

Page 33: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

33

2 Component Mixtureσ1 = σ2 = 1; τ = 0.5

Page 34: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

34

EM Summary

Fundamentally a max likelihood parameter estimation problemUseful if analysis is more tractable when 0/1 hidden data z knownIterate:

E-step: estimate E(z) for each z, given θM-step: estimate θ maximizing E(log likelihood) given E(z) [where “E(logL)” is wrt random z ~ E(z) = p(z=1)]

Page 35: lec04-mle-em - University of Washington · lec04-mle-em.key Author: Walter L. Ruzzo Created Date: 10/21/2009 3:08:33 PM ...

35

EM IssuesUnder mild assumptions (sect 11.6), EM is guaranteed to increase likelihood with every E-M iteration, hence will converge.

But may converge to local, not global, max. (Recall the 4-bump surface...)

Issue is intrinsic (probably), since EM is often applied to NP-hard problems (including clustering, above, and motif-discovery, soon)

Nevertheless, widely used, often effective