Clustering with Gaussian Mixtures - Carnegie Mellon School ..../awm/tutorials/gmm14.pdf ·...
Transcript of Clustering with Gaussian Mixtures - Carnegie Mellon School ..../awm/tutorials/gmm14.pdf ·...
1
Copyright © 2001, 2004, Andrew W. Moore
Clustering with Gaussian Mixtures
Andrew W. MooreProfessor
School of Computer ScienceCarnegie Mellon University
www.cs.cmu.edu/[email protected]
412-268-7599
Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 2
Unsupervised Learning• You walk into a bar.
A stranger approaches and tells you:“I’ve got data from k classes. Each class produces observations with a normal distribution and variance σ2I . Standard simple multivariate gaussian assumptions. I can tell you all the P(wi)’s .”
• So far, looks straightforward.“I need a maximum likelihood estimate of the µi’s .“
• No problem:“There’s just one thing. None of the data are labeled. I have datapoints, but I don’t know what class they’re from (any of them!)
• Uh oh!!
2
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 3
Gaussian Bayes Classifier Reminder
)()()|()|(
xxx
piyPiypiyP ==
==
( ) ( )
)(21exp
||||)2(1
)|(2/12/
x
µxΣµxΣx
p
piyP
iikiT
iki
m ⎥⎦⎤
⎢⎣⎡ −−−
==π
How do we deal with that?
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 4
Predicting wealth from age
3
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 5
Predicting wealth from age
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 6
Learning modelyear , mpg ---> maker
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
=
mmm
m
m
221
222
12
11212
σσσ
σσσσσσ
L
MOMM
L
L
Σ
4
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 7
General: O(m2)parameters
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
=
mmm
m
m
221
222
12
11212
σσσ
σσσσσσ
L
MOMM
L
L
Σ
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 8
Aligned: O(m)parameters
⎟⎟⎟⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜⎜⎜⎜
⎝
⎛
=
−
m
m
2
12
32
22
12
00000000
000000000000
σσ
σσ
σ
L
L
MMOMMM
L
L
L
Σ
5
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 9
Aligned: O(m)parameters
⎟⎟⎟⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜⎜⎜⎜
⎝
⎛
=
−
m
m
2
12
32
22
12
00000000
000000000000
σσ
σσ
σ
L
L
MMOMMM
L
L
L
Σ
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 10
Spherical: O(1)cov parameters
⎟⎟⎟⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜⎜⎜⎜
⎝
⎛
=
2
2
2
2
2
00000000
000000000000
σσ
σσ
σ
L
L
MMOMMM
L
L
L
Σ
6
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 11
Spherical: O(1)cov parameters
⎟⎟⎟⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜⎜⎜⎜
⎝
⎛
=
2
2
2
2
2
00000000
000000000000
σσ
σσ
σ
L
L
MMOMMM
L
L
L
Σ
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 12
Making a Classifier from a Density Estimator
Inpu
ts
ClassifierPredict
category
Inpu
ts DensityEstimator
Prob-ability
Inpu
ts
RegressorPredictreal no.
Categorical inputs only
Mixed Real / Cat okay
Real-valued inputs only
Dec TreeJoint BC
Naïve BC
Joint DE
Naïve DE
Gauss DE
Gauss BC
7
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 13
Next… back to Density Estimation
What if we want to do density estimation with multimodal or clumpy data?
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 14
The GMM assumption• There are k components. The
i’th component is called ωi
• Component ωi has an associated mean vector µi
µ1
µ2
µ3
8
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 15
The GMM assumption• There are k components. The
i’th component is called ωi
• Component ωi has an associated mean vector µi
• Each component generates data from a Gaussian with mean µi and covariance matrix σ2I
Assume that each datapoint is generated according to the following recipe:
µ1
µ2
µ3
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 16
The GMM assumption• There are k components. The
i’th component is called ωi
• Component ωi has an associated mean vector µi
• Each component generates data from a Gaussian with mean µi and covariance matrix σ2I
Assume that each datapoint is generated according to the following recipe:
1. Pick a component at random. Choose component i with probability P(ωi).
µ2
9
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 17
The GMM assumption• There are k components. The
i’th component is called ωi
• Component ωi has an associated mean vector µi
• Each component generates data from a Gaussian with mean µi and covariance matrix σ2I
Assume that each datapoint is generated according to the following recipe:
1. Pick a component at random. Choose component i with probability P(ωi).
2. Datapoint ~ N(µi, σ2I )
µ2
x
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 18
The General GMM assumption
µ1
µ2
µ3
• There are k components. The i’th component is called ωi
• Component ωi has an associated mean vector µi
• Each component generates data from a Gaussian with mean µi and covariance matrix Σi
Assume that each datapoint is generated according to the following recipe:
1. Pick a component at random. Choose component i with probability P(ωi).
2. Datapoint ~ N(µi, Σi )
10
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 19
Unsupervised Learning:not as hard as it looks
and sometimes in between
Sometimes impossible
Sometimes easyIN CASE YOU’RE WONDERING WHAT THESE DIAGRAMS ARE, THEY SHOW 2-d UNLABELED DATA (XVECTORS) DISTRIBUTED IN 2-d SPACE. THE TOP ONE HAS THREE VERY CLEAR GAUSSIAN CENTERS
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 20
Computing likelihoods in unsupervised case
We have x1 , x2 , … xN
We know P(w1) P(w2) .. P(wk)We know σ
P(x|wi, µi, … µk) = Prob that an observation from class wi would have value x given class means µ1… µx
Can we write an expression for that?
11
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 21
likelihoods in unsupervised case
We have x1 x2 … xnWe have P(w1) .. P(wk). We have σ.We can define, for any x , P(x|wi , µ1, µ2 .. µk)
Can we define P(x | µ1, µ2 .. µk) ?
Can we define P(x1, x1, .. xn | µ1, µ2 .. µk) ?
[YES, IF WE ASSUME THE X1’S WERE DRAWN INDEPENDENTLY]
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 22
Unsupervised Learning:Mediumly Good News
We now have a procedure s.t. if you give me a guess at µ1, µ2 .. µk,
I can tell you the prob of the unlabeled data given those µ‘s.
Suppose x‘s are 1-dimensional.
There are two classes; w1 and w2
P(w1) = 1/3 P(w2) = 2/3 σ = 1 .
There are 25 unlabeled datapoints
x1 = 0.608x2 = -1.590x3 = 0.235x4 = 3.949
:x25 = -0.712
(From Duda and Hart)
12
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 23
Graph of log P(x1, x2 .. x25 | µ1, µ2 )
against µ1 (→) and µ2 (↑)
Max likelihood = (µ1 =-2.13, µ2 =1.668)
Local minimum, but very close to global at (µ1 =2.085, µ2 =-1.257)*
* corresponds to switching w1 + w2.
Duda & Hart’s Example
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 24
Duda & Hart’s ExampleWe can graph the
prob. dist. function of data given our µ1 and µ2estimates.
We can also graph the true function from which the data was randomly generated.
• They are close. Good.
• The 2nd solution tries to put the “2/3” hump where the “1/3” hump should go, and vice versa.
• In this example unsupervised is almost as good as supervised. If the x1 .. x25 are given the class which was used to learn them, then the results are (µ1=-2.176, µ2=1.684). Unsupervised got (µ1=-2.13, µ2=1.668).
13
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 25
Finding the max likelihood µ1,µ2..µkWe can compute P( data | µ1,µ2..µk)How do we find the µi‘s which give max. likelihood?
• The normal max likelihood trick:Set log Prob (….) = 0
µi
and solve for µi‘s.# Here you get non-linear non-analytically-solvable equations
• Use gradient descentSlow but doable
• Use a much faster, cuter, and recently very popular method…
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 26
Expectation Maximalization
14
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 27
The E.M. Algorithm
• We’ll get back to unsupervised learning soon.• But now we’ll look at an even simpler case with
hidden information.• The EM algorithm
Can do trivial things, such as the contents of the next few slides.An excellent way of doing our unsupervised learning problem, as we’ll see.Many, many other uses, including inference of Hidden Markov Models (future lecture).
DETOUR
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 28
Silly ExampleLet events be “grades in a class”
w1 = Gets an A P(A) = ½w2 = Gets a B P(B) = µw3 = Gets a C P(C) = 2µw4 = Gets a D P(D) = ½-3µ
(Note 0 ≤ µ ≤1/6)Assume we want to estimate µ from data. In a given class
there werea A’sb B’sc C’sd D’s
What’s the maximum likelihood estimate of µ given a,b,c,d ?
15
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 29
Silly ExampleLet events be “grades in a class”
w1 = Gets an A P(A) = ½w2 = Gets a B P(B) = µw3 = Gets a C P(C) = 2µw4 = Gets a D P(D) = ½-3µ
(Note 0 ≤ µ ≤1/6)Assume we want to estimate µ from data. In a given class there were
a A’sb B’sc C’sd D’s
What’s the maximum likelihood estimate of µ given a,b,c,d ?
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 30
Trivial StatisticsP(A) = ½ P(B) = µ P(C) = 2µ P(D) = ½-3µ
P( a,b,c,d | µ) = K(½)a(µ)b(2µ)c(½-3µ)d
log P( a,b,c,d | µ) = log K + alog ½ + blog µ + clog 2µ + dlog (½-3µ)
( )
101µ likeMax
got class if So6
µ likemax Gives
0µ32/1
3µ2
2µµ
LogP
0µ
LogP SET µ, LIKE MAX FOR
=
+++
=
=−
−+=∂
∂
=∂
∂
dcbcb
dcb
109614
DCBA
Boring, but true!
16
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 31
Same Problem with Hidden InformationSomeone tells us thatNumber of High grades (A’s + B’s) = hNumber of C’s = cNumber of D’s = d
What is the max. like estimate of µ now?
REMEMBER
P(A) = ½
P(B) = µ
P(C) = 2µ
P(D) = ½-3µ
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 32
Same Problem with Hidden Information
hbhaµ2
1µ
µ21
21
+=
+=
( )dcbcb++
+=
6 µ
Someone tells us thatNumber of High grades (A’s + B’s) = hNumber of C’s = cNumber of D’s = d
What is the max. like estimate of µ now?
We can answer this question circularly:
EXPECTATION
MAXIMIZATION
If we know the value of µ we could compute the expected value of a and b
If we know the expected values of a and bwe could compute the maximum likelihood value of µ
REMEMBER
P(A) = ½
P(B) = µ
P(C) = 2µ
P(D) = ½-3µ
Since the ratio a:b should be the same as the ratio ½ : µ
17
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 33
E.M. for our Trivial ProblemWe begin with a guess for µWe iterate between EXPECTATION and MAXIMALIZATION to improve our estimates of µ and a and b.
Define µ(t) the estimate of µ on the t’th iterationb(t) the estimate of b on t’th iteration
REMEMBER
P(A) = ½
P(B) = µ
P(C) = 2µ
P(D) = ½-3µ
[ ]
( )( )( )
( )tbdctb
ctbt
tbt
htb
given µ ofest likemax 6
)1(µ
)(µ|)(µ2
1µ(t) )(
guess initial )0(µ
=++
+=+
Ε=+
=
=E-step
M-step
Continue iterating until converged.Good news: Converging to local optimum is assured.Bad news: I said “local” optimum.
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 34
E.M. Convergence• Convergence proof based on fact that Prob(data | µ) must increase or
remain same between each iteration [NOT OBVIOUS]
• But it can never exceed 1 [OBVIOUS]
So it must therefore converge [OBVIOUS]
3.1870.09486
3.1870.09485
3.1870.09484
3.1850.09473
3.1580.09372
2.8570.08331
000
b(t)µ(t)tIn our example, suppose we had
h = 20c = 10d = 10
µ(0) = 0
Convergence is generally linear: error decreases by a constant factor each time step.
18
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 35
Back to Unsupervised Learning of GMMs
Remember:We have unlabeled data x1 x2 … xR
We know there are k classesWe know P(w1) P(w2) P(w3) … P(wk)We don’t know µ1 µ2 .. µk
We can write P( data | µ1…. µk)
( )
( )
( ) ( )
( ) ( )∏∑
∏∑
∏
= =
= =
=
⎟⎠⎞
⎜⎝⎛ −−=
=
=
=
R
i
k
jjji
R
i
k
jjkji
R
iki
kR
wx
wwx
x
xx
1 1
22
1 11
11
11
Pµσ21expK
Pµ...µ,p
µ...µp
µ...µ...p
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 36
E.M. for GMMs( )
( )
( )∑
∑
=
==
=∂∂
R
ikij
i
R
ikij
j
ki
xwP
xxwP
11
11
1
µ...µ,
µ...µ, µ
j,each for ,likelihoodMax For " :into this turnsalgebracrazy n' wild'Some
0µ...µdataobPrlogµ
know welikelihoodMax For
This is n nonlinear equations in µj’s.”
…I feel an EM experience coming on!!
If, for each xi we knew that for each wj the prob that µj was in class wj isP(wj|xi,µ1…µk) Then… we would easily compute µj.
If we knew each µj then we could easily compute P(wj|xi,µ1…µk) for each wjand xi.
See
http://www.cs.cmu.edu/~awm/doc/gmm-algebra.pdf
19
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 37
E.M. for GMMsIterate. On the t’th iteration let our estimates be
λt = { µ1(t), µ2(t) … µc(t) }
E-stepCompute “expected” classes of all datapoints for each class
( ) ( ) ( )( )
( )( )∑
=
== c
jjjjk
iiik
tk
titiktki
tptwx
tptwxx
wwxxw
1
2
2
)(),(,p
)(),(,pp
P,p,P
I
I
σµ
σµ
λλλ
λ
M-step. Compute Max. like µ given our data’s class membership distributions
( )( )( )∑
∑=+
ktki
kk
tki
i xw
xxwt
λ
λ
,P
,P1µ
Just evaluate a Gaussian at xk
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 38
E.M. Convergence
• This algorithm is REALLY USED. And in high dimensional state spaces, too. E.G. Vector Quantization for Speech Data
• Your lecturer will (unless out of time) give you a nice intuitive explanation of why this rule works.
• As with all EM procedures, convergence to a local optimum guaranteed.
20
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 39
E.M. for General GMMsIterate. On the t’th iteration let our estimates be
λt = { µ1(t), µ2(t) … µc(t), Σ1(t), Σ2(t) … Σc(t), p1(t), p2(t) … pc(t) }
E-stepCompute “expected” classes of all datapoints for each class
( ) ( ) ( )( )
( )( )∑
=
Σ
Σ== c
jjjjjk
iiiik
tk
titiktki
tpttwx
tpttwxx
wwxxw
1
)()(),(,p
)()(),(,pp
P,p,P
µ
µλ
λλλ
M-step. Compute Max. like µ given our data’s class membership distributions
pi(t) is shorthand for estimate of P(ωi) on t’th iteration
( )( )( )∑
∑=+
ktki
kk
tki
i xw
xxwt
λ
λ
,P
,P1µ ( )
( ) ( )[ ] ( )[ ]
( )∑∑ +−+−
=+Σ
ktki
Tikik
ktki
i xw
txtxxwt
λ
µµλ
,P
11 ,P1
( )( )
R
xwtp k
tki
i
∑=+
λ,P1 R = #records
Just evaluate a Gaussian at xk
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 40
Gaussian Mixture
Example: Start
Advance apologies: in Black and White this example will be
incomprehensible
21
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 41
After first iteration
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 42
After 2nd iteration
22
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 43
After 3rd iteration
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 44
After 4th iteration
23
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 45
After 5th iteration
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 46
After 6th iteration
24
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 47
After 20th iteration
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 48
Some Bio Assay data
25
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 49
GMM clustering
of the assay data
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 50
Resulting Density
Estimator
26
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 51
Where are we now?In
puts
ClassifierPredict
category
Inpu
ts DensityEstimator
Prob-ability
Inpu
ts
RegressorPredictreal no.
Dec Tree, Sigmoid Perceptron, Sigmoid N.Net, Gauss/Joint BC, Gauss Naïve BC, N.Neigh, Bayes Net Based BC, Cascade Correlation
Joint DE, Naïve DE, Gauss/Joint DE, Gauss Naïve DE, Bayes Net Structure Learning, GMMs
Linear Regression, Polynomial Regression, Perceptron, Neural Net, N.Neigh, Kernel, LWR, RBFs, Robust Regression, Cascade Correlation, Regression Trees, GMDH, Multilinear Interp, MARS
Inpu
ts InferenceEngine Learn P(E1|E2)
Joint DE, Bayes Net Structure Learning
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 52
The old trick…
Inpu
ts
ClassifierPredict
category
Inpu
ts DensityEstimator
Prob-ability
Inpu
ts
RegressorPredictreal no.
Dec Tree, Sigmoid Perceptron, Sigmoid N.Net, Gauss/Joint BC, Gauss Naïve BC, N.Neigh, Bayes Net Based BC, Cascade Correlation, GMM-BC
Joint DE, Naïve DE, Gauss/Joint DE, Gauss Naïve DE, Bayes Net Structure Learning, GMMs
Linear Regression, Polynomial Regression, Perceptron, Neural Net, N.Neigh, Kernel, LWR, RBFs, Robust Regression, Cascade Correlation, Regression Trees, GMDH, Multilinear Interp, MARS
Inpu
ts InferenceEngine Learn P(E1|E2)
Joint DE, Bayes Net Structure Learning
27
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 53
Three classes of assay(each learned with it’s own mixture model)(Sorry, this will again be semi-useless in black and white)
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 54
Resulting Bayes Classifier
28
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 55
Resulting Bayes Classifier, using posterior probabilities to alert about ambiguity and anomalousness
Yellow means anomalous
Cyan means ambiguous
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 56
Unsupervised learning with symbolic attributes
It’s just a “learning Bayes net with known structure but hidden values” problem.Can use Gradient Descent.
EASY, fun exercise to do an EM formulation for this case too.
# KIDS
MARRIED
NATION
missing
29
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 57
Final Comments• Remember, E.M. can get stuck in local minima, and
empirically it DOES.• Our unsupervised learning example assumed P(wi)’s
known, and variances fixed and known. Easy to relax this.
• It’s possible to do Bayesian unsupervised learning instead of max. likelihood.
• There are other algorithms for unsupervised learning. We’ll visit K-means soon. Hierarchical clustering is also interesting.
• Neural-net algorithms called “competitive learning” turn out to have interesting parallels with the EM method we saw.
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 58
What you should know• How to “learn” maximum likelihood parameters
(locally max. like.) in the case of unlabeled data.
• Be happy with this kind of probabilistic analysis.
• Understand the two examples of E.M. given in these notes.
For more info, see Duda + Hart. It’s a great book. There’s much more in the book than in your handout.
30
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 59
Other unsupervised learning methods
• K-means (see next lecture)• Hierarchical clustering (e.g. Minimum spanning
trees) (see next lecture)• Principal Component Analysis
simple, useful tool
• Non-linear PCANeural Auto-AssociatorsLocally weighted PCAOthers…