Expectation-Maximization
News o’ the dayFirst “3-d” picture of sun
Anybody got red/green sunglasses?
Administrivia
•No noose is good noose
Where we’re at
•Last time:
•E^3
•Finished up (our brief survey of) RL
•Today:
• Intro to unsupervised learning
•The expectation-maximization “algorithm”
What’s with this EM thing?
Nobody expects...
Unsupervised learning•EM is (one form of) unsupervised learning:
•Given: data
•Find: “structure” of that data
•Clusters -- what points “group together”? (we’ll do this one today)
•Taxonomies -- what’s descended from/related to what?
•Parses -- grammatical structure of a sentence
•Hidden variables -- “behind the scenes”
Example task
Example task
We can see the clusters easily...
... but the computer can’t. How can weget the computer to identify the clusters?
Need: algorithm that takes data and returnsa label (cluster ID) for each data point
Parsing exampleWhat’s the grammatical structure of this
sentence?
He never claimed to be a god.
Parsing example
He never claimed to be a god.
What’s the grammatical structure of this sentence?
NN VVVV NNDetDetAdvAdv
NPNP
VPVP
NPNP
VPVP
SS
Parsing example
He never claimed to be a god.
What’s the grammatical structure of this sentence?
NN VVVV NNDetDetAdvAdv
NPNP
VPVP
NPNP
VPVP
SS
Note: entirely hidden information!Need to infer (guess) it in an ~unsupervised way.
EM assumptions•All learning algorithms require data
assumptions
•EM: generative model
•Description of process that generates your data
•Assumes: hidden (latent) variables
•Probability model: assigns probability to data + hidden variables
•Often think: generate hidden var, then generate data based on that hidden var
Classic latent var model•Data generator looks like this:
•Behind a curtain:
• I flip a weighted coin
•Heads: I roll a 6-sided die
•Tails: I roll a 4-sided die
• I show you:
•Outcome of die
Your mission
•Data you get is sequence of die outcomes
•6, 3, 3, 1, 5, 4, 2, 1, 6, 3, 1, 5, 2, ...
•Your task: figure out what the coin flip was for each of these numbers
•Hidden variable: c≡outcome of coin flip
•What makes this hard?
A more “practical” example•Robot navigating in physical world
•Locations in world can be occupied or unoccupied
•Robot wants occupancy map (so it doesn’t bump into things)
•Sensors are imperfect (noise, object variation, etc.)
•Given: sensor data
•Infer: occupied/unoccupied for each location
Classic latent var model•This process describes (generates) prob
distribution over numbers
•Hidden state: outcome of coin flip
Classic latent var model•This process describes (generates) prob
distribution over numbers
•Hidden state: outcome of coin flip
Classic latent var model•This process describes (generates) prob
distribution over numbers
•Hidden state: outcome of coin flip
•Observed state: outcome of die given (conditioned on) coin flip result
Classic latent var model•This process describes (generates) prob
distribution over numbers
•Hidden state: outcome of coin flip
•Observed state: outcome of die given (conditioned on) coin flip result
Classic latent var model•This process describes (generates) prob
distribution over numbers
•Hidden state: outcome of coin flip
•Observed state: outcome of die given (conditioned on) coin flip result
Probability of observations•Final probability of outcome (x) is mixture
of probability for each possible coin result:
Probability of observations•Final probability of outcome (x) is mixture
of probability for each possible coin result:
Probability of observations•Final probability of outcome (x) is mixture
of probability for each possible coin result:
Your goal
•Given model, and data, x1, x2, ..., xn
•Find Pr[ci|xi]
•So we need the model
•Model given by parameters: Θ= 〈 p,θheads,θtails 〈
•Where θheads and θtails are die outcome probabilities; p is prob of heads
Where’s the problem?
•To get Pr[ci|xi], you need Pr[xi|ci]
•To get Pr[xi|ci], you need model parameters
•To get model parameters, you need Pr[ci|xi]
•Oh oh...
EM to the rescue!•Turns out that you can run this “chicken and
egg process” in a loop and eventually get the right* answer
•Make an initial guess about coin assignments
•Repeat:
•Use guesses to get parameters (M step)
•Use parameters to update coin guesses (E step)
•Until converged
EM to the rescue!function [Prc,Theta]=EM(X)// initializationPrc=pick_random_values()// the EM looprepeat {// M step: pick maximum likelihood// parameters:// argmax_theta(Pr[x,c|theta])Theta=get_params_from_c(Prc)
// E step: use complete model to get data// likelihood: Pr[c|x]=1/z*Pr[x|c,theta]Prc=get_labels_from_params(X,Theta)
} until(converged)
Wierd, but true•This is counterintuitive, but it works
•Essentially, you’re improving guesses on each step
•M step “maximizes” parameters, Θ, given data
•E step finds “expectation” of hidden data, given Θ
•Both are driving toward max likelihood joint soln
•Guaranteed to converge
•Not guaranteed to find global optimum...
Very easy example
•Two Gaussian (“bell curve”) clusters
•Well separated in space
•Two dimensions
In more detail•Gaussian mixture w/ k “components”
(clusters/blobs)
•Mixture probability:
In more detail•Gaussian mixture w/ k “components”
(clusters/blobs)
•Mixture probability:
One for each component
In more detail•Gaussian mixture w/ k “components”
(clusters/blobs)
•Mixture probability:
Weight (probability) of each component
In more detail•Gaussian mixture w/ k “components”
(clusters/blobs)
•Mixture probability:
Gaussian distribution for each componentw/ mean vector μi and covariance
matrix Σi
In more detail•Gaussian mixture w/ k “components”
(clusters/blobs)
•Mixture probability:
Normalizing term for Gaussian
In more detail•Gaussian mixture w/ k “components”
(clusters/blobs)
•Mixture probability:
Squared distance of data point x frommean μi (with respect to Σi)
In more detail•Gaussian mixture w/ k “components”
(clusters/blobs)
•Mixture probability:
Hidden variables
• Introduce the “hidden variable”, ci(x) (or just ci for short)
•Denotes “amount by which data point x belongs to cluster i”
•Sometimes called “cluster ownership”, “salience”, “relevance”, etc.
M step•Need: parameters (Θ) given hidden
variables (ci) and N data points, x1, x2,...,xN
•Q: what are the parameters of the model? (What do we need to learn?)
M step•Need: parameters (Θ) given hidden
variables (ci) and N data points, x1, x2,...,xN
•A: Θ= 〈 αi,μi,Σi 〈 i=1..k
M step•Need: parameters (Θ) given hidden
variables (ci) and N data points, x1, x2,...,xN
•A: Θ= 〈 αi,μi,Σi 〈 i=1..k
M step•Need: parameters (Θ) given hidden
variables (ci) and N data points, x1, x2,...,xN
•A: Θ= 〈 αi,μi,Σi 〈 i=1..k
M step•Need: parameters (Θ) given hidden
variables (ci) and N data points, x1, x2,...,xN
•A: Θ= 〈 αi,μi,Σi 〈 i=1..k
E step•Need: probability of hidden variable (ci) given
fixed parameters (Θ) and observed data (x1,...,xN)
Another example
•k=3 Gaussian clusters
•Different means, covariances
•Well separated
Restart
•Problem: EM has found a “minimum energy” solution
• It’s only “locally” optimal
•B/c of poor starting choice, it ended up in wrong local optimum -- not global optimum
•Default answer: pick a new random start and re-run
Final example•More Gaussians. How many clusters
here?
Note...
•Doesn’t always work out this well in practice
•Sometimes the machine is smarter than humans
•Usually, if it’s hard for us, it’s hard for the machine too...
•First ~7-10 times I ran this one, it lost one cluster altogether (α3→0.0001)
Unresolved issues•Notice: different cluster IDs (colors) end
up on different blobs of points in each run
•Answer is “unique only up to permutation”
• I can swap around cluster IDs without changing solution
•Can’t tell what “right” cluster assignment is
Unresolved issues•“Order” of model
• I.e., what k should you use?
•Hard to know, in general
•Can just try a bunch and find one that “works best”
•Problem: answer tends to get monotonically better w/ increasing k
•Best answer to date: Chinese restaurant process
Top Related