Post on 23-Jul-2020
Clustering
Machine Learning Workshop
UC Berkeley
Junming Yin
August 24, 2007
2
Outline
• Introduction
• Unsupervised learning
• What is clustering? Application
• Dissimilarity (similarity) of samples
• Clustering algorithm
• K-means
• Gaussian mixture model (GMM), EM
• Hierarchical clustering
• Spectral clustering
3
Unsupervised Learning
• Recall in the setting of classification and
regression, the training data are represented as
, the goal is to predict given
a new point , they are called supervised
learning
• In unsupervised setting, we only have unlabelled
data
• Why bother unsupervised learning? Whether can
we learn anything of value from unlabelled
samples?
4
Unsupervised Learning
• Most of time, collecting raw data is virtually cheap
but labeling them can be very costly
• The samples might lie in high-dimensional space,we might find some low-dimensional features thatmight be sufficient to describe the samples
• In the early stages of an investigation it may bevaluable to perform exploratory data analysisand thus gain some insight into the nature orstructure of data
• Clustering is one of unsupervised learningmethod
5
What is Clustering?• Roughly speaking, clustering analysis aims to discover
distinct clusters or groups of samples such that samples
within the same group are more similar to each other than
they are to the members of other groups
• a dissimilarity (similarity) function between samples
• a criterion to evaluate a groupings of samples into clusters
• an algorithm that optimizes this criterion function
6
Application of Clustering
• Image segmentation: decompose the image into regions with
coherent color and texture inside them
• Search result clustering: group the search result set and
provide a better user interface
• Computational biology: group homologous protein sequences
into families; gene expression data analysis
• Signal processing: compress the signal by using codebook
derived from vector quantization (VQ)
7
Outline
• Introduction
• Unsupervised learning
• What is clustering? Application
• Dissimilarity (similarity) of samples
• Clustering algorithm
• K-means
• Gaussian mixture model (GMM), EM
• Hierarchical clustering
• Spectral clustering
8
Dissimilarity of samples
• The natural question now is: how should we
measure the dissimilarity between samples?
• the results depend on the choice of dissimilarity
• usually from subject matter consideration
• not necessarily a metric (i.e. triangle inequalitydoesn’t hold)
• possible to learn the dissimilarity from data for aparticular application (later)
9
Dissimilarity Based on features
• Most of time, data have measurements on features
• A common choice of dissimilarity function between samples is
the Euclidean distance
• Clusters defined by Euclidean distance is invariant totranslations and rotations in feature space, but not invariant toscaling of features
• You might consider standardizing the data: translate andscale the features so that all of features have zero mean andunit variance
• BE CAREFUL! It is not always desirable
10
Case Studies
Simulated data, 2-meanswithout standardization
Simulated data, 2-meanswith standardization
11
Learning Dissimilarity
• Suppose a user indicates certain objects are considered by him to be
“similar”:
• Consider learning a dissimilarity that respects this subjectivity
• If A is identity matrix, it corresponds to Euclidean distance
• Generally, A parameterizes a family of Mahalanobis distance
• Leaning such a dissimilarity is equivalent to finding a rescaling of
data by replacing with , and then applying the standard
Euclidean distance
12
Learning Dissimilarity
• A simple way to define a criterion for thedesired dissimilarity:
• A convex optimization problem, could besolved by gradient descent and iterativeprojection
• For details, see [Xing, Ng, Jordan, Russell ’03]
13
Learning Dissimilarity
14
Outline
• Introduction
• Unsupervised learning
• What is clustering? Application
• Dissimilarity (similarity) of samples
• Clustering algorithm
• K-means
• Gaussian mixture model (GMM), EM
• Hierarchical clustering
• Spectral clustering
15
K-means: Idea
• Represent the data set in terms of K clusters,
each of which is summarized by a prototype
• Each data is assigned to one of K clusters
• Represented by responsibilities such
that for all data indices i
• Example: 4 data points and 3 clusters
16
K-means: Idea
• Criterion function: the sum-of-squared
distances from each data point to its
assigned prototype
prototypesresponsibilities
data
17
Minimizing the Cost Function
• Chicken and egg problem, have to resort to iterativemethod
• E-step: fix values for and minimize w.r.t.
• assigns each data point to its nearest prototype
• M-step: fix values for and minimize w.r.t• this gives
• each prototype set to the mean of points in thatcluster
• Convergence guaranteed since there are a finitenumber of possible settings for the responsibilities
• It can only find the local minima, we should start thealgorithm with many different initial settings
18
19
20
21
22
23
24
25
26
27
How to Choose K?• In some cases it is known apriori from problem
domain
• Generally, it has to be estimated from data andusually selected by some heuristics in practice
• The cost function J generally decrease withincreasing K
• Idea: Assume that K* is the right number
• We assume that for K<K* each estimated clustercontains a subset of true underlying groups
• For K>K* some natural groups must be split
• Thus we assume that for K<K* the cost functionfalls substantially, afterwards not a lot more
28
Limitations of K-means
• Hard assignments of data points to clusters
• Small shift of a data point can flip it to a different cluster
• Solution: replace hard clustering of K-means with soft
probabilistic assignments (GMM)
• Hard to choose the value of K
• As K is increased, the cluster memberships can change in
an arbitrary way, the resulting clusters are not necessarily
nested
• Solution: hierarchical clustering
29
The Gaussian Distribution• Multivariate Gaussian
• Maximum likelihood estimation
mean covariance
30
Gaussian Mixture
• Linear combination of Gaussians
0 0.5 10
0.5
1
(a)
where
parameters to be estimated
31
Gaussian Mixture
0 0.5 10
0.5
1
(a)
• To generate a data point:
• first pick one of the components with probability
• then draw a sample from that component distribution
• Each data point is generated by one of K components, a latent
variable is associated with each
32
Synthetic Data Set Without Colours
0 0.5 10
0.5
1
(b)
33
Fitting the Gaussian Mixture• Given the complete data set
• maximize the complete log likelihood
• trivial closed-form solution: fit each component to the
corresponding set of data points
• Without knowing values of latent variables, we have to
maximize the incomplete log likelihood:
• Sum over components appears inside the logarithm, no
closed-form solution
34
EM Algorithm
• E-step: for given parameter values we can
compute the expected values of the latent
variables (responsibilities of data points)
• Note that instead of but we still
have
Bayes rule
35
EM Algorithm
• M-step: maximize the expected complete log
likelihood
• update parameters:
36
EM Algorithm
• Iterate E-step and M-step until the loglikelihood of data does not increase any more
• converge to local optima
• need to restart algorithm with different initialguess of parameters (as in K-means)
• Relation to K-means
• Consider GMM with common covariance
• As , two methods coincide
37
38
39
40
41
42
43
Hierarchical Clustering
• Organize the clusters in an hierarchical way
• Produces a rooted (binary) tree (dendrogram)
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
aa b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
divisive
44
Hierarchical Clustering• Two kinds of strategy
• Bottom-up (agglomerative): recursively merge two groups with
the smallest between-cluster dissimilarity (defined later on)
• Top-down (divisive): in each step, split a least coherent cluster
(e.g. largest diameter); splitting a cluster is also a clustering
problem (usually done in a greedy way); less popular than
bottom-up way
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
aa b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
divisive
45
Hierarchical Clustering
• User can choose a cut through the hierarchy to
represent the most natural division into clusters
• e.g, choose the cut where intergroup dissimilarity
exceeds some threshold
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
aa b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
divisive
3 2
46
Hierarchical Clustering
• Have to measure the dissimilarity for two disjoint
groups G and H, is computed from
pairwise dissimilarities
• Single Linkage: tends to yield extended clusters
• Complete Linkage: tends to yield round clusters
• Group Average: tradeoff between them
47
Example: Human Tumor Microarray Data
• 6830 64 matrix of real numbers
• Rows correspond to genes,
columns to tissue samples
• Cluster rows (genes) can deduce
functions of unknown genes from
known genes with similar
expression profiles
• Cluster columns (samples) can
identify disease profiles: tissues
with similar disease should yield
similar expression profiles
Gene expression matrix
48
Example: Human Tumor Microarray Data
• 6830 64 matrix of real numbers
• GA clustering of the microarray data• Applied separately to rows and columns
• Subtrees with tighter clusters placed on the left
• Produces a more informative picture of genes and samples than
the randomly ordered rows and columns
49
Example: Two circles
50
Spectral Clustering [Ng, Jordan and Weiss 01’]
• Given a set of points , we’d like tocluster them into k clusters
• Form an affinity matrix where
• Define
• Find k largest eigenvectors of L, concatenate themcolumnwise to obtain
• Form the matrix Y by normalizing each row of X tohave unit length
• Think of n rows of Y as a new representation oforiginal n data points; cluster them into k clustersusing K-means
51
Example: Two circles
52
Had Fun?
Hope you enjoyed…