Clustering Introduction Preprocessing: dimensional reduction with SVD Clustering methods: K-means,...
-
date post
15-Jan-2016 -
Category
Documents
-
view
236 -
download
0
Transcript of Clustering Introduction Preprocessing: dimensional reduction with SVD Clustering methods: K-means,...
Clustering
IntroductionPreprocessing: dimensional reduction with SVDClustering methods: K-means, FCMHierarchical methodsModel based methods (at the end)Competitive NN (SOM) (not shown here)SVC, QCApplicationsCOMPACT
(an ill-defined problem)
What Is Clustering?
Why? To help understand the natural grouping or structure in a data set
When? Used either as a stand-alone tool to get insight into data distribution or as a preprocessing step for other algorithms, e.g., to discover
classes
Not
Not
Clas
sifica
tion
Class
ifica
tion
!!!!
Clustering is partitioning of data into meaningful (?) groups called clusters.Cluster a collection of objects that are “similarsimilar” to one
another … what is similar? unsupervised learning: no predefined classes
Clustering Applications
Operations Research: Facility Location Problem: locate fire stations so as to
minimize the maximum/average distance a fire truck must travel
Signal Processing Vector Quantization: Transmit large files (e.g., video,
speech) by computing quantizers Astronomy:
SkyCat: Clustered 2x109 sky objects into stars, galaxies, quasars, etc based on radiation emitted in different spectrum bands.
Clustering Applications
Marketing: Segmentation of customers for target marketing Segmentation of customers based on online clickstream data.
Web To discover categories of content. Search results
Bioinformatics Gene expression
Finding groups of individuals (sick Vs. healthy) Finding groups of genes
Motifs search. …
In practice, clustering is one of the most widely used data mining techniques Association rule algorithms produce too many rules Other machine learning algorithms require labeled data.
Points/Metric Space Points could be in in Rd, {0,1}d,… Metric Space: dist(x,y) is a distance metric
if Reflexive: dist(x,y)=0 iff x=y Symmetric: dist(x,y)= dist(y,x) Triangle Inequality: dist(x,y) dist(x,z) +
dist(z,y)
x
y
Example of Distance Metrics The distance between x=<x1,…,xn> and
y=<y1,…,yn> is: L2 norm: Manhattan Distance (L1 norm):
Documents: Cosine measure Similarity
i.e., more similar -> close to 1 less similar -> close to 0
Not a metric space, but 1-cos is
2211 )()( nn yxyx
nn yxyx 11
Correlation
We might care more about the overall shape of expression profiles rather than the actual magnitudes
That is, we might want to consider genes similar when they are “up” and “down” together
When might we want this kind of measure? What experimental issues might make this appropriate?
Pearson Linear Correlation
We’re shifting the expression profiles down (subtracting the means) and scaling by the standard deviations (i.e., making the data have mean = 0 and std = 1)
n
ii
n
ii
n
ii
n
ii
i
n
ii
yn
y
xn
x
yyxx
yyxx
1
1
)()(
))((),(
1
2
1
2
1yx
Pearson Linear Correlation Pearson linear correlation (PLC) is a measure that is
invariant to scaling and shifting (vertically) of the expression values
Always between –1 and +1 (perfectly anti-correlated and perfectly correlated)
This is a similarity measure, but we can easily make it into a dissimilarity measure:
2
),(1 yxpd
PLC (cont.)
PLC only measures the degree of a linear relationship between two expression profiles!
If you want to measure other relationships, there are many other possible measures (see Jagota book and project #3 for more examples)
= 0.0249, so dp = 0.4876
The green curve is the square of the blue curve – this relationship is not captured with PLC
More correlation examples
What do you think the correlation is here? Is this what we want?
How about here? Is this what we want?
Missing Values A common problem w/ microarray data One approach with Euclidean distance or
PLC is just to ignore missing values (i.e., pretend the data has fewer dimensions)
There are more sophisticated approaches that use information such as continuity of a time series or related genes to estimate missing values – better to use these if possible
Preprocessing
For methods that are not applicable in very high dimensions you may want to apply
- Dimensional reduction, e.g. consider the first few SVD components (truncate S at r-dimensions) and use the remaing values of the U or V matrices
- Dimensional reduction + normalization: after applying dimensional reduction normalize all resulting vectors to unit length (i.e. consider angles as proximity measures)
- Feature selection, e.g. consider only features that have large variance. More on feature selection in the future.
Clustering Types
Exclusive vs. Overlapping Clustering Hierarchical vs. Global Clustering Formal vs. Heuristic Clustering
First two examples:
K-Means: exclusive, global, heuristic
FCM (fuzzy c-means): overlapping, global, heuristic
Two classes of data described by (o) and (*). The objective is to reproduce the two classes by K=2 clustering.
-8 -6 -4 -2 0 2-8
-7
-6
-5
-4
-3
-2
-1
0
1
2
log(intensity) 475 Hz
log
(inte
nsi
ty)
55
7 H
z
Tiles data: o = whole tiles, * = cracked tiles, x = centres
1. Place two cluster centres (x) at random.2. Assign each data point (* and o) to the nearest cluster centre (x)
-8 -6 -4 -2 0 2-8
-7
-6
-5
-4
-3
-2
-1
0
1
2
log(intensity) 475 Hz
log
(inte
nsi
ty)
55
7 H
z
Tiles data: o = whole tiles, * = cracked tiles, x = centres
-8 -6 -4 -2 0 2-8
-7
-6
-5
-4
-3
-2
-1
0
1
2
log(intensity) 475 Hz
log
(inte
nsi
ty)
55
7 H
z
Tiles data: o = whole tiles, * = cracked tiles, x = centres
1. Compute the new centre of each class2. Move the crosses (x)
Iteration 2
-8 -6 -4 -2 0 2-8
-7
-6
-5
-4
-3
-2
-1
0
1
2
log(intensity) 475 Hz
log
(inte
nsi
ty)
55
7 H
z
Tiles data: o = whole tiles, * = cracked tiles, x = centres
Iteration 3
-8 -6 -4 -2 0 2-8
-7
-6
-5
-4
-3
-2
-1
0
1
2
log(intensity) 475 Hz
log
(inte
nsi
ty)
55
7 H
z
Tiles data: o = whole tiles, * = cracked tiles, x = centres
Iteration 4 (then stop, because no visible change)Each data point belongs to the cluster defined by the nearest centre
-8 -6 -4 -2 0 2-8
-7
-6
-5
-4
-3
-2
-1
0
1
2
log(intensity) 475 Hz
log
(inte
nsi
ty)
55
7 H
z
Tiles data: o = whole tiles, * = cracked tiles, x = centres
The membership matrix M: 1. The last five data points (rows) belong to the first cluster (column)2. The first five data points (rows) belong to the second cluster (column)
M =
0.0000 1.0000
0.0000 1.0000
0.0000 1.0000
0.0000 1.0000
0.0000 1.0000
1.0000 0.0000
1.0000 0.0000
1.0000 0.0000
1.0000 0.0000
1.0000 0.0000
Membership matrix M
otherwise
ifm jkikik
0
122
cucu
data point k cluster centre i
distance
cluster centre j
Results of K-means depend on the starting point of the algorithm. Repeat it several times to get a better feeling whether the results are meaningful.
c-partition
Kc
iallforUCØ
jiallforØCC
UC
i
ji
c
ii
2
1
All clusters C together fills the
whole universe UClusters do not
overlap
A cluster C is never empty and it is smaller than the whole universe U
There must be at least 2 clusters in a c-partition
and at most as many as the number of data
points K
Objective function
c
i Ckik
c
ii
ik
JJ1
2
,1 u
cu
Minimise the total sum of all distances
Algorithm: fuzzy c-means (FCM)
Each data point belongs to two clusters to different degrees
-8 -6 -4 -2 0 2-8
-7
-6
-5
-4
-3
-2
-1
0
1
2
log(intensity) 475 Hz
log
(inte
nsi
ty)
55
7 H
z
Tiles data: o = whole tiles, * = cracked tiles, x = centres
1. Place two cluster centres
2. Assign a fuzzy membership to each data point depending on distance
-8 -6 -4 -2 0 2-8
-7
-6
-5
-4
-3
-2
-1
0
1
2
log(intensity) 475 Hz
log
(inte
nsi
ty)
55
7 H
z
Tiles data: o = whole tiles, * = cracked tiles, x = centres
1. Compute the new centre of each class2. Move the crosses (x)
-8 -6 -4 -2 0 2-8
-7
-6
-5
-4
-3
-2
-1
0
1
2
log(intensity) 475 Hz
log
(inte
nsi
ty)
55
7 H
z
Tiles data: o = whole tiles, * = cracked tiles, x = centres
Iteration 2
-8 -6 -4 -2 0 2-8
-7
-6
-5
-4
-3
-2
-1
0
1
2
log(intensity) 475 Hz
log
(inte
nsi
ty)
55
7 H
z
Tiles data: o = whole tiles, * = cracked tiles, x = centres
Iteration 5
-8 -6 -4 -2 0 2-8
-7
-6
-5
-4
-3
-2
-1
0
1
2
log(intensity) 475 Hz
log
(inte
nsi
ty)
55
7 H
z
Tiles data: o = whole tiles, * = cracked tiles, x = centres
Iteration 10
-8 -6 -4 -2 0 2-8
-7
-6
-5
-4
-3
-2
-1
0
1
2
log(intensity) 475 Hz
log
(inte
nsi
ty)
55
7 H
z
Tiles data: o = whole tiles, * = cracked tiles, x = centres
Iteration 13 (then stop, because no visible change)Each data point belongs to the two clusters to a degree
-8 -6 -4 -2 0 2-8
-7
-6
-5
-4
-3
-2
-1
0
1
2
log(intensity) 475 Hz
log
(inte
nsi
ty)
55
7 H
z
Tiles data: o = whole tiles, * = cracked tiles, x = centres
The membership matrix M: 1. The last five data points (rows) belong mostly to the first cluster (column)2. The first five data points (rows) belong mostly to the second cluster (column)
M =
0.0025 0.9975
0.0091 0.9909
0.0129 0.9871
0.0001 0.9999
0.0107 0.9893
0.9393 0.0607
0.9638 0.0362
0.9574 0.0426
0.9906 0.0094
0.9807 0.0193
Hard Classifier (HCM)
Ok
light
moderate
severeOk
A cell is either one or the other class defined by a colour.
Fuzzy Classifier (FCM)
Ok
light
moderate
severeOk
A cell can belong to several classes to adegree, i.e., one columnmay have several colours.
Dendrograms allow us to visualize visualization is not unique!
Tends to be sensitive to small changes in the data Provided with clusters of every size: where to “cut” is
user-determined Large storage demand +
Running Time: O(n2 * |levels|) = O(n3) Depends on: distance measure, linkage method
Hierarchical Clustering• Greedy• Agglomerative vs. Divisive
Hierarchical Agglomerative Clustering
We start with every data point in a separate cluster
We keep merging the most similar pairs of data points/clusters until we have one big cluster left
This is called a bottom-up or agglomerative method
Hierarchical Clustering (cont.) This produces a
binary tree or dendrogram
The final cluster is the root and each data item is a leaf
The height of the bars indicate how close the items are
Hierarchical Clustering Demo
Hierarchical Clustering Issues Distinct clusters are not produced –
sometimes this can be good, if the data has a hierarchical structure w/o clear boundaries
There are methods for producing distinct clusters, but these usually involve specifying somewhat arbitrary cutoff values
What if data doesn’t have a hierarchical structure? Is HC appropriate?
Support Vector Clustering
Given points x in data space, define images in Hilbert space.
Require all images to be enclosed by a minimal sphere in Hilbert space.
Reflection of this sphere in data space defines cluster boundaries.
Two parameters: width of Gaussian kernel and fraction of outliers
Ben-Hur, Horn, Siegelmann & Vapnik. JMLR 2 (2001) 125-127
Variation of q allows for clustering solutions on various scales
q=1,
20,
24,
48
Example that allows for SVclustering only in presence of outliers. Procedure: limit β <C=1/pN, where p=fraction of assumed outliers in the data.
q=3.5 p=0 q=1 p=0.3
Similarity to scale space approach for high values of q and p. Probability distribution obtained from R(x) .
q=4.8 p=0.7
From Scale-space to Quantum Clustering
Parzen window approach: estimate the probability density by kernel functions (Gaussians) located at data points.
N
i
N
i
xx
i
i
ecxfcxP1 1
2
)(2
2
)(
σ= 1/√(2q)
Quantum Clustering
View P= as the solution of the Schrödinger equation:
with the potential V(x) responsible for attraction to cluster centers and the Lagrangian causing the spread.
Find V(x):
i
xx
i
i
exxd
EExV2
2
22
2
22
2
1
22
ExVH 2
2
2
Horn and Gottlieb, Phys. Rev. Lett. 88 (2002) 018702
The Crabs Example The Crabs Example (from Ripley’s (from Ripley’s textbook)textbook)4 classes, 50 samples each, d=54 classes, 50 samples each, d=5
A topographic map of the probability distribution for the crab data set with =1/2 using principal components 2 and 3. There exists only one maximum.
The Crabs ExampleQC potential exhibits four minima identified with cluster centers
A topographic map of the potential for the crab data set with =1/2 using principal components 2 and 3 . The four minima are denoted by crossed circles. The contours are set at values V=cE for c=0.2,…,1.
The Crabs Example - ContdThe Crabs Example - Contd..
A three dimensional plot of the potential for the crab data set with =1/3 using principal components 2 and 3
The Crabs Example - ContdThe Crabs Example - Contd..
A three dimensional plot of the potential for the crab data set with =1/2 using principal components 2 and 3
Identifying Clusters
Local minima of the potential are identified with cluster centers.
Data points are assigned to clusters according to:-minimal distance from centers, or,-sliding points down the slopes of the potential
with gradient descent until they reach the centers.
The Iris ExampleThe Iris Example3 classes, each containing 50 samples, d=43 classes, each containing 50 samples, d=4
A topographic map of the potential for the iris data set with =0.25 using principal components 1 and 2. The three minima are denoted by crossed circles. The contours are set at values V=cE for c=0.2,…,1.
The Iris Example - Gradient Descent DynamicsThe Iris Example - Gradient Descent Dynamics
The Iris Example - Using Raw Data in 4DThe Iris Example - Using Raw Data in 4D..
There are only 5 misclassifications. =0.21.
Example – Yeast cell cycle
Yeast cell cycle data were studied by several Yeast cell cycle data were studied by several groups who have applied SVD. groups who have applied SVD. (Spellman et al. (Spellman et al.
Molecular Biology of the Cell, 9, Dec. 2000)Molecular Biology of the Cell, 9, Dec. 2000) We use it to test clustering of genes, whose We use it to test clustering of genes, whose classification into groups was investigated by classification into groups was investigated by Spellman et al.Spellman et al.
The gene/sample matrix that we start from has The gene/sample matrix that we start from has dimensions of 798x72, using the same selection dimensions of 798x72, using the same selection as made by as made by (Shamir, R. and Sharan, R. 2002 ). (Shamir, R. and Sharan, R. 2002 ).
We truncate it to r=4 and obtain, once again, We truncate it to r=4 and obtain, once again, our best results for our best results for σσ=0.5, where four clusters =0.5, where four clusters follow from the QC algorithm. follow from the QC algorithm.
Example – Yeast cell cycle
The five gene families as represented in two coordinates of our r=4 dimensional space.
Example – Yeast cell cycle
Cluster assignments of genes for QC with s=0.46 , as compared to the classification by Spellman into five classes, shown as alternating gray and white areas .
Yeast cell cycle in normalized 2 dimensions
Hierarchical Quantum Clustering (HQC)
Start with raw data matrix containing gene expression profiles of the samples.
Apply SVD and truncate to r-space by selecting the first r significant eigenvectors
Apply QC in r-dimensions starting at small scale , obtaining many clusters. Move data points to cluster centers and reiterate the process at higher σ. This produces hierarchical clustering that can be represented by a dendrogram.
Example – Clustering of human cancer cells
The NCI60 set is a gene expression profile of The NCI60 set is a gene expression profile of ~8000 genes in 60 human cancer cells. ~8000 genes in 60 human cancer cells.
NCI60 includes cell lines derived from cancers NCI60 includes cell lines derived from cancers of colorectal, renal, ovarian, breast, prostate, of colorectal, renal, ovarian, breast, prostate, lung and central nervous system, as well as lung and central nervous system, as well as leukemias and melanomas.leukemias and melanomas.
After application of selective filters the number After application of selective filters the number of gene spots is reduced to 1,376 gene subset. of gene spots is reduced to 1,376 gene subset. (Scherf et al. – Nature Genetics 24 , 2000)(Scherf et al. – Nature Genetics 24 , 2000)
We applied HQC with r=5 dimensionWe applied HQC with r=5 dimension.
Example – Clustering of human cancer cells
Dendrogram of 60 cancer cell samples. The clustering was done in 5 truncated dimensions. The first 2 letters in each sample represent the tissue/cancer type.
Example - Projection onto the unit sphere
Representation of data of four classes of cancer cells on two dimensions of the truncated space. The circles denote the locations of the data points before this normalization was applied
COMPACT – a comparative package for clustering assessment
Compact is a GUI Matlab tool that enables an easy and intuitive way to compare some clustering methods.
Compact is a five-step wizard that contains basic Matlab clustering methods as well as the quantum clustering algorithm. Compact provides a flexible and customizable interface for clustering data with high dimensionality.
Compact allows both textual and graphical display of the clustering results
How to Install?
COMPACT is a self-extracting package. In order to install and run the QUI tool, follow these three easy steps
Download the COMPACT.zip package to your local drive.
Add the COMPACT destination directory to your Matlab path.
Within Matlab, type ‘compact’ at the command prompt.
Steps – 1
Input parameters
Steps – 1
Selecting variables
Steps – 2
Determining the matrix shape and vectors to cluster
Steps – 3
Preprocessing Procedures Components’ variance
graphs Preprocessing parameters
Steps – 4
Points distribution preview
and clustering method selection
Steps – 5
Parameters for clustering algorithms Kmeans
Steps – 5
Parameters for clustering algorithms FCM
Steps – 5
Parameters for clustering algorithms NN
Steps – 5
Parameters for clustering algorithms QC
Steps – 6COMPACT results
Steps – 6Results
Clustering Methods: Model-Based Data are generated from a mixture of
underlying probability distributions
Some Examples Two univariate
normal components
Equal proportions Common
variance 2=1
=1 =2
=3 =4
Two univariate normal components
proportions 0.75 and 0.25
Common variance 2=1
=1 =2
=3 =4
and some more
Probability Models
Classification Likelihood
1 11
( ,..., ; ,..., | ) ( | )i i
n
C G n ii
L x f x
set of parameters of cluster K k
|i ik x K Mixture Likelihood
1 111
( ,..., ; ,..., | ) ( | )n G
M G G k k i kki
L x f x
is the probability that an observation belongs to cluster K ( )
k0;k
1
1G
kk
Probability Models (Cont.) Most used multivariate normal distribution
Θk has a means vector μk and a covariance matrix Σk
11( ) ( )
2/ 2
1( | , )
2 | |
Ti k k i kx x
k i k k pk
f x e
How is the covariance matrix Σk calculated?
Calculating the covariance matrix Σk
The idea: parameterize the covariance matrixT
k k k k kD A D Dk – Orthogonal matrix of eigenvectors
Determines the orientation of the PCs of Σk
Ak – Diagonal matrix whose elements are proportional to the eigenvalues of Σk
Determines the shape of the density contours
λk – Scalar Determines the volume of the corresponding
ellipsoid
Σk Definition Determines the Model
spherical, equal (SOS criterion)k I all ellipsoids are equal k DAD
How is Θk computed? EM algorithm
1 1
( , , | ) [log ( | )]n G
k k ik ik k k i ki k
l z x z f x
The complete-data log-likelihood(*)
1 if belongs to group
0 otherwisei
ik
x kz
Density of an observation given zi is
is the conditional expectation of zik given xi and Θ1,…, ΘG
1
( | ) ik
Gz
k i kk
f x
1ˆ [ | , ,..., ]ik ik i Gz E z x
1
ˆˆ
n
ik ii
kk
z x
n
ˆ kk
n
n
1
ˆn
k iki
n z
ˆk depends on the model
ˆikz• E: calculate,
,1
ˆˆ ˆ( | )ˆ
ˆˆ ˆ( | )
k k i k kik G
j j i j jj
f xz
f x
• M: given maximize (*)ˆikz
Limitations of the EM Algorithm Low rate of convergence
You should start with good starting points and hope for separable clusters…
Not practical for large number of clusters (== probabilities)
"Crashes" when covariance matrix becomes singular Problems when there are few observation in a
cluster EM must not get more clusters than exist in
nature…