lec21 expectation maximization - Computer Science ...smaji/cmpsci689/slides/lec21...CMPSCI 689:...

19
Subhransu Maji CMPSCI 689: Machine Learning 14 April 2015 Expectation maximization

Transcript of lec21 expectation maximization - Computer Science ...smaji/cmpsci689/slides/lec21...CMPSCI 689:...

Page 1: lec21 expectation maximization - Computer Science ...smaji/cmpsci689/slides/lec21...CMPSCI 689: Machine Learning 14 April 2015 Expectation maximization CMPSCI 689 Subhransu Maji (UMASS)

Subhransu MajiCMPSCI 689: Machine Learning

14 April 2015

Expectation maximization

Page 2: lec21 expectation maximization - Computer Science ...smaji/cmpsci689/slides/lec21...CMPSCI 689: Machine Learning 14 April 2015 Expectation maximization CMPSCI 689 Subhransu Maji (UMASS)

Subhransu Maji (UMASS)CMPSCI 689 /19

Suppose you are building a naive Bayes spam classifier. After your are done your boss tells you that there is no money to label the data.!‣ You have a probabilistic model that assumes labelled data, but you

don't have any labels. Can you still do something? !

Amazingly you can!!‣ Treat the labels as hidden variables and try to learn them

simultaneously along with the parameters of the model !

Expectation Maximization (EM) !‣ A broad family of algorithms for solving hidden variable problems ‣ In today’s lecture we will derive EM algorithms for clustering and

naive Bayes classification and learn why EM works

Motivation

2

Page 3: lec21 expectation maximization - Computer Science ...smaji/cmpsci689/slides/lec21...CMPSCI 689: Machine Learning 14 April 2015 Expectation maximization CMPSCI 689 Subhransu Maji (UMASS)

Subhransu Maji (UMASS)CMPSCI 689 /19

Suppose data comes from a Gaussian Mixture Model (GMM) — you have K clusters and the data from the cluster k is drawn from a Gaussian with mean μk and variance σk2!

We will assume that the data comes with labels (we will soon remove this assumption)!Generative story of the data:!‣ For each example n = 1, 2, .., N

➡ Choose a label ➡ Choose example

Likelihood of the data:

Gaussian mixture model for clustering

3

xn ⇠ N(µk,�2k)

yn ⇠ Mult(✓1, ✓2, . . . , ✓K)

p(D) =NY

n=1

p(yn)p(xn|yn) =NY

n=1

✓ynN(xn;µyn ,�2yn)

p(D) =

NY

n=1

✓yn

�2⇡�2

yn

��D2exp

✓� ||xn � µyn ||2

2�2yn

Page 4: lec21 expectation maximization - Computer Science ...smaji/cmpsci689/slides/lec21...CMPSCI 689: Machine Learning 14 April 2015 Expectation maximization CMPSCI 689 Subhransu Maji (UMASS)

Subhransu Maji (UMASS)CMPSCI 689 /19

Likelihood of the data:!!!!If you knew the labels yn then the maximum-likelihood estimates of the parameters is easy:

GMM: known labels

4

✓k =1

N

X

n

[yn = k]

µk =

Pn[yn = k]xnPn[yn = k]

fraction of examples with label k

mean of all the examples with label k

variance of all the examples with label k

p(D) =

NY

n=1

✓yn

�2⇡�2

yn

��D2exp

✓� ||xn � µyn ||2

2�2yn

�2k =

Pn[yn = k]||xn � µk||2P

n[yn = k]

Page 5: lec21 expectation maximization - Computer Science ...smaji/cmpsci689/slides/lec21...CMPSCI 689: Machine Learning 14 April 2015 Expectation maximization CMPSCI 689 Subhransu Maji (UMASS)

Subhransu Maji (UMASS)CMPSCI 689 /19

Now suppose you didn’t have labels yn. Analogous to k-means, one solution is to iterate. Start by guessing the parameters and then repeat the two steps:!‣ Estimate labels given the parameters ‣ Estimate parameters given the labels !In k-means we assigned each point to a single cluster, also called as hard assignment (point 10 goes to cluster 2)!In expectation maximization (EM) we will will use soft assignment (point 10 goes half to cluster 2 and half to cluster 5)!!Lets define a random variable zn = [z1, z2, …, zK] to denote the assignment vector for the nth point!‣ Hard assignment: only one of zk is 1, the rest are 0 ‣ Soft assignment: zk is positive and sum to 1

GMM: unknown labels

5

Page 6: lec21 expectation maximization - Computer Science ...smaji/cmpsci689/slides/lec21...CMPSCI 689: Machine Learning 14 April 2015 Expectation maximization CMPSCI 689 Subhransu Maji (UMASS)

Subhransu Maji (UMASS)CMPSCI 689 /19

Formally zn,k is the probability that the nth point goes to cluster k!!!!!!Given a set of parameters (θk,μk,σk2), zn,k is easy to compute!Given zn,k , we can update the parameters (θk,μk,σk2) as:

GMM: parameter estimation

6

zn,k = p(yn = k|xn)

=P (yn = k,xn)

P (xn)

/ P (yn = k)P (xn|yn) = ✓kN(xn;µk,�2k)

✓k =1

N

X

n

zn,k

µk =

Pn zn,kxnPn zn,k

�2k =

Pn zn,k||xn � µk||2P

n zn,k

fraction of examples with label k

mean of all the fractional examples with label k

variance of all the fractional examples with label k

Page 7: lec21 expectation maximization - Computer Science ...smaji/cmpsci689/slides/lec21...CMPSCI 689: Machine Learning 14 April 2015 Expectation maximization CMPSCI 689 Subhransu Maji (UMASS)

Subhransu Maji (UMASS)CMPSCI 689 /19

We have replaced the indicator variable [yn = k] with p(yn=k) which is the expectation of [yn=k]. This is our guess of the labels.!Just like k-means the EM is susceptible to local minima.!Clustering example:

GMM: example

7

http://nbviewer.ipython.org/github/NICTA/MLSS/tree/master/clustering/

k-means GMM

Page 8: lec21 expectation maximization - Computer Science ...smaji/cmpsci689/slides/lec21...CMPSCI 689: Machine Learning 14 April 2015 Expectation maximization CMPSCI 689 Subhransu Maji (UMASS)

Subhransu Maji (UMASS)CMPSCI 689 /19

We have data with observations xn and hidden variables yn, and would like to estimate parameters θ!The likelihood of the data and hidden variables:!!!Only xn are known so we can compute the data likelihood by marginalizing out the yn:!!!!Parameter estimation by maximizing log-likelihood:

The EM framework

8

hard to maximize since the sum is inside the log

p(D) =Y

n

p(xn, yn|✓)

p(X|✓) =Y

n

X

yn

p(xn, yn|✓)

✓ML argmax

X

n

log

X

yn

p(xn, yn|✓)!

Page 9: lec21 expectation maximization - Computer Science ...smaji/cmpsci689/slides/lec21...CMPSCI 689: Machine Learning 14 April 2015 Expectation maximization CMPSCI 689 Subhransu Maji (UMASS)

Subhransu Maji (UMASS)CMPSCI 689 /19

Given a concave function f and a set of weights λi ≥ 0 and ∑ᵢ λᵢ = 1!Jensen’s inequality states that f(∑ᵢ λᵢ xᵢ) ≥ ∑ᵢ λᵢ f(xᵢ)!This is a direct consequence of concavity!‣ f(ax + by) ≥ a f(x) + b f(y) when a ≥ 0, b ≥ 0, a + b = 1

Jensen’s inequality

9

f(x)

f(y)

a f(x) + b f(y)

f(ax+by)

Page 10: lec21 expectation maximization - Computer Science ...smaji/cmpsci689/slides/lec21...CMPSCI 689: Machine Learning 14 April 2015 Expectation maximization CMPSCI 689 Subhransu Maji (UMASS)

Subhransu Maji (UMASS)CMPSCI 689 /19

Construct a lower bound the log-likelihood using Jensen’s inequality!!!!!!!!!!!Maximize the lower bound:

The EM framework

10

✓ argmax

X

n

X

yn

q(yn) log p(xn, yn|✓)

independent of θ

L(X|✓) =X

n

log

X

yn

p(xn, yn|✓)!

=

X

n

log

X

yn

q(yn)p(xn, yn|✓)

q(yn)

!

�X

n

X

yn

q(yn) log

✓p(xn, yn|✓)

q(yn)

=

X

n

X

yn

[q(yn) log p(xn, yn|✓)� q(yn) log q(yn)]

, ˆL(X|✓)

x

λ

Jensen’s inequality

f

Page 11: lec21 expectation maximization - Computer Science ...smaji/cmpsci689/slides/lec21...CMPSCI 689: Machine Learning 14 April 2015 Expectation maximization CMPSCI 689 Subhransu Maji (UMASS)

Subhransu Maji (UMASS)CMPSCI 689 /19

Maximizing the lower bound increases the value of the original function if the lower bound touches the function at the current value

Lower bound illustrated

11

L(X|✓)

✓t ✓t+1

L̂(X|✓t)

L̂(X|✓t+1)

Page 12: lec21 expectation maximization - Computer Science ...smaji/cmpsci689/slides/lec21...CMPSCI 689: Machine Learning 14 April 2015 Expectation maximization CMPSCI 689 Subhransu Maji (UMASS)

Subhransu Maji (UMASS)CMPSCI 689 /19

Any choice of the probability distribution q(yn) is valid as long as the lower bound touches the function at the current estimate of θ"!

We can the pick the optimal q(yn) by maximizing the lower bound!!!This gives us!‣ Proof: use Lagrangian multipliers with “sum to one” constraint !

This is the distributions of the hidden variables conditioned on the data and the current estimate of the parameters!‣ This is exactly what we computed in the GMM example

An optimal lower bound

12

L(X|✓t) = L̂(X|✓t)

q(yn) p(yn|xn, ✓t)

argmax

q

X

yn

[q(yn) log p(xn, yn|✓)� q(yn) log q(yn)]

Page 13: lec21 expectation maximization - Computer Science ...smaji/cmpsci689/slides/lec21...CMPSCI 689: Machine Learning 14 April 2015 Expectation maximization CMPSCI 689 Subhransu Maji (UMASS)

Subhransu Maji (UMASS)CMPSCI 689 /19

We have data with observations xn and hidden variables yn, and would like to estimate parameters θ of the distribution p(x | θ)!EM algorithm!‣ Initialize the parameters θ randomly ‣ Iterate between the following two steps:

➡ E step: Compute probability distribution over the hidden variables !

➡ M step: Maximize the lower bound !

!!EM algorithm is a great candidate when M-step can done easily but p(x | θ) cannot be easily optimized over θ!‣ For e.g. for GMMs it was easy to compute means and variances

given the memberships

The EM algorithm

13

q(yn) p(yn|xn, ✓)

✓ argmax

X

n

X

yn

q(yn) log p(xn, yn|✓)

Page 14: lec21 expectation maximization - Computer Science ...smaji/cmpsci689/slides/lec21...CMPSCI 689: Machine Learning 14 April 2015 Expectation maximization CMPSCI 689 Subhransu Maji (UMASS)

Subhransu Maji (UMASS)CMPSCI 689 /19

Consider the binary prediction problem!Let the data be distributed according to a probability distribution:!!!We can simplify this using the chain rule of probability:!!!!!Naive Bayes assumption:!!!!E.g., The words “free” and “money” are independent given spam

Naive Bayes: revisited

14

p✓(y,x) = p✓(y, x1, x2, . . . , xD)

p✓(y,x) = p✓(y)p✓(x1|y)p✓(x2|x1, y) . . . p✓(xD|x1, x2, . . . , xD�1, y)

= p✓(y)DY

d=1

p✓(xd|x1, x2, . . . , xd�1, y)

p✓(xd|xd0, y) = p✓(xd|y), 8d0 6= d

Page 15: lec21 expectation maximization - Computer Science ...smaji/cmpsci689/slides/lec21...CMPSCI 689: Machine Learning 14 April 2015 Expectation maximization CMPSCI 689 Subhransu Maji (UMASS)

Subhransu Maji (UMASS)CMPSCI 689 /19

Case: binary labels and binary features!!!!!Probability of the data:

Naive Bayes: a simple case

15

p✓(y) = Bernoulli(✓0)

p✓(xd|y = 1) = Bernoulli(✓+d )

p✓(xd|y = �1) = Bernoulli(✓�d )

1+2D parameters}

// label +1

// label -1

p

(y,x) = p

(y)DY

d=1

p

(xd

|y)

= ✓

[y=+1]0 (1� ✓0)

[y=�1]

...⇥DY

d=1

+[xd=1,y=+1]d

(1� ✓

+d

)[xd=0,y=+1]

...⇥DY

d=1

�[xd=1,y=�1]d

(1� ✓

�d

)[xd=0,y=�1]

Page 16: lec21 expectation maximization - Computer Science ...smaji/cmpsci689/slides/lec21...CMPSCI 689: Machine Learning 14 April 2015 Expectation maximization CMPSCI 689 Subhransu Maji (UMASS)

Subhransu Maji (UMASS)CMPSCI 689 /19

Given data we can estimate the parameters by maximizing data likelihood!The maximum likelihood estimates are:

Naive Bayes: parameter estimation

16

✓̂0 =

Pn[yn = +1]

N// fraction of the data with label as +1

// fraction of the instances with 1 among +1

✓̂

�d =

Pn[xd,n = 1, yn = �1]P

n[yn = �1]

✓̂

+d =

Pn[xd,n = 1, yn = +1]P

n[yn = +1]

// fraction of the instances with 1 among -1

Page 17: lec21 expectation maximization - Computer Science ...smaji/cmpsci689/slides/lec21...CMPSCI 689: Machine Learning 14 April 2015 Expectation maximization CMPSCI 689 Subhransu Maji (UMASS)

Subhransu Maji (UMASS)CMPSCI 689 /19

Now suppose you don’t have labels yn!Initialize the parameters θ randomly!E step: compute the distribution over the hidden variables q(yn)!!!M step: estimate θ given the guesses

Naive Bayes: EM

17

q(yn

= 1) = p(yn

= +1|xn

, ✓) / ✓+0

DY

d=1

✓+[xd,n=1]d

(1� ✓+d

)[xd,n=0]

✓0 =

Pn q(yn = 1)

N

+d =

Pn[xd,n = 1]q(yn = 1)P

n q(yn = 1)

�d =

Pn[xd,n = 1]q(yn = �1)P

n q(yn = �1)

// fraction of the data with label as +1

// fraction of the instances with 1 among +1

// fraction of the instances with 1 among -1

Page 18: lec21 expectation maximization - Computer Science ...smaji/cmpsci689/slides/lec21...CMPSCI 689: Machine Learning 14 April 2015 Expectation maximization CMPSCI 689 Subhransu Maji (UMASS)

Subhransu Maji (UMASS)CMPSCI 689 /19

Expectation maximization!‣ A general technique to estimate parameters of probabilistic models

when some observations are hidden ‣ EM iterates between estimating the hidden variables and

optimizing parameters given the hidden variables ‣ EM can be seen as a maximization of the lower bound of the data

log-likelihood — we used Jensen’s inequality to switch the log-sum to sum-log

EM can be used for learning:!‣ mixtures of distributions for clustering, e.g. GMM ‣ parameters for hidden Markov models (next lecture) ‣ topic models in NLP ‣ probabilistic PCA ‣ ….

Summary

18

Page 19: lec21 expectation maximization - Computer Science ...smaji/cmpsci689/slides/lec21...CMPSCI 689: Machine Learning 14 April 2015 Expectation maximization CMPSCI 689 Subhransu Maji (UMASS)

Subhransu Maji (UMASS)CMPSCI 689 /19

Some of the slides are based on CIML book by Hal Daume III!The figure for the EM lower bound is based on https://cxwangyi.wordpress.com/2008/11/!Clustering k-means vs GMM is from http://nbviewer.ipython.org/github/NICTA/MLSS/tree/master/clustering/

Slides credit

19