What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3...

1

Clustering

Jian Pei: Big Data Analytics -- Clustering 2

What Is Clustering?

•  Group data into clusters – Similar to one another within the same cluster – Dissimilar to the objects in other clusters – Unsupervised learning: no predefined classes

Cluster 1 Cluster 2

Outliers


Similarity and Dissimilarity

•  Distances are normally used measures •  Minkowski distance: a generalization

•  If q = 2, d is Euclidean distance •  If q = 1, d is Manhattan distance •  If q = ∞, d is Chebyshev distance •  Weighed distance

)0(||...||||),(2211

>−++−+−= qjxixjxixjxixjid qq

pp

qq

)0()||...||2

||1

),(2211

>−++−+−= qjxixpwjxixwjxixwjid qq

pp

qq


Manhattan and Chebyshev Distance

Picture from Wekipedia

Manhattan Distance

http://brainking.com/images/rules/chess/02.gif

Chebyshev Distance

When n = 2, chess-distance


Properties of Minkowski Distance

•  Nonnegative: d(i,j) ≥ 0 •  The distance of an object to itself is 0

– d(i,i) = 0 •  Symmetric: d(i,j) = d(j,i) •  Triangular inequality

– d(i,j) ≤ d(i,k) + d(k,j) i j

k

Clustering Methods

•  K-means and partitioning methods •  Hierarchical clustering •  Density-based clustering •  Grid-based clustering •  Pattern-based clustering •  Other clustering methods


2


Partitioning Algorithms: Ideas

•  Partition n objects into k clusters –  Optimize the chosen partitioning criterion

•  Global optimal: examine all possible partitions –  (kn-(k-1)n-…-1) possible partitions, too expensive!

•  Heuristic methods: k-means and k-medoids –  K-means: a cluster is represented by the center –  K-medoids or PAM (partition around medoids): each

cluster is represented by one of the objects in the cluster


K-means

•  Arbitrarily choose k objects as the initial cluster centers

•  Until no change, do –  (Re)assign each object to the cluster to which

the object is the most similar, based on the mean value of the objects in the cluster

– Update the cluster means, i.e., calculate the mean value of the objects for each cluster


K-Means: Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrarily choose K object as initial cluster center

Assign each objects to most similar center

Update the cluster means

Update the cluster means

reassign reassign


Pros and Cons of K-means

•  Relatively efficient: O(tkn) – n: # objects, k: # clusters, t: # iterations; k, t <<

n. •  Often terminate at a local optimum •  Applicable only when mean is defined

– What about categorical data? •  Need to specify the number of clusters •  Unable to handle noisy data and outliers •  Unsuitable to discover non-convex clusters


Variations of the K-means •  Aspects of variations

–  Selection of the initial k means –  Dissimilarity calculations –  Strategies to calculate cluster means

•  Handling categorical data: k-modes –  Use mode instead of mean

•  Mode: the most frequent item(s) –  A mixture of categorical and numerical data: k-prototype

method •  EM (expectation maximization): assign a

probability of an object to a cluster


A Problem of K-means

•  Sensitive to outliers – Outlier: objects with extremely large values

•  May substantially distort the distribution of the data

•  K-medoids: the most centrally located object in a cluster

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

+ +

3


PAM: A K-medoids Method

•  PAM: partitioning around Medoids •  Arbitrarily choose k objects as the initial medoids •  Until no change, do

–  (Re)assign each object to the cluster to which the nearest medoid

–  Randomly select a non-medoid object o’, compute the total cost, S, of swapping medoid o with o’

–  If S < 0 then swap o with o’ to form the new set of k medoids


Swapping Cost

•  Measure whether o’ is better than o as a medoid

•  Use the squared-error criterion

– Compute Eo’-Eo

– Negative: swapping brings benefit

∑∑= ∈

=k

i Cpi

i

opdE1

2),(


PAM: Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Total Cost = 20

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrary choose k object as initial medoids

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Assign each remaining object to nearest medoids

Randomly select a nonmedoid object,Oramdom

Compute total cost of swapping

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

Total Cost = 26

Swapping O and Oramdom

If quality is improved.

Do loop

Until no change

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10 Jian Pei: Big Data Analytics -- Clustering 16

Pros and Cons of PAM

•  PAM is more robust than k-means in the presence of noise and outliers – Medoids are less influenced by outliers

•  PAM is efficiently for small data sets but does not scale well for large data sets – O(k(n-k)2 ) for each iteration

•  Sampling based method: CLARA


CLARA

•  CLARA: Clustering LARge Applications (Kaufmann and Rousseeuw in 1990) – Built in statistical analysis packages, such as S+

•  Draw multiple samples of the data set, apply PAM on each sample, give the best clustering

•  Perform better than PAM in larger data sets •  Efficiency depends on the sample size

– A good clustering on samples may not be a good clustering of the whole data set


CLARANS •  Clustering large applications based upon

randomized search •  The problem space graph of clustering

–  A vertex is k from n numbers, vertices in total –  PAM searches the whole graph –  CLARA searches some random sub-graphs

•  CLARANS climbs hills –  Randomly sample a set and select k medoids –  Consider neighbors of medoids as candidate for new

medoids –  Use the sample set to verify –  Repeat multiple times to avoid bad samples

⎟⎟⎠

⎞⎜⎜⎝

⎛

kn

4

Hierarchy

•  An arrangement or classification of things according to inclusiveness

•  A natural way of abstraction, summarization, compression, and simplification for understanding

•  Typical setting: organize a given set of objects to a hierarchy – No or very little supervision – Some heuristic quality guidances on the quality

of the hierarchy Jian Pei: Big Data Analytics -- Clustering 19

•  Group data objects into a tree of clusters •  Top-down versus bottom-up


Hierarchical Clustering

Step 0 Step 1 Step 2 Step 3 Step 4

b

d c

e

a a b

d e c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative (AGNES)

divisive (DIANA)


AGNES (Agglomerative Nesting)

•  Initially, each object is a cluster •  Step-by-step cluster merging, until all

objects form a cluster – Single-link approach – Each cluster is represented by all of the objects

in the cluster – The similarity between two clusters is measured

by the similarity of the closest pair of data points belonging to different clusters


Dendrogram

•  Show how to merge clusters hierarchically

•  Decompose data objects into a multi-level nested partitioning (a tree of clusters)

•  A clustering of the data objects: cutting the dendrogram at the desired level – Each connected component forms a cluster


DIANA (Divisive ANAlysis)

•  Initially, all objects are in one cluster •  Step-by-step splitting clusters until each

cluster contains only one object

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10


Distance Measures

•  Minimum distance •  Maximum distance •  Mean distance •  Average distance ∑∑

∈ ∈

∈∈

∈∈

=

=

=

=

i j

ji

ji

Cp Cqjijiavg

jijimean

CqCpji

CqCpji

qpdnn

CCd

mmdCCd

qpdCCd

qpdCCd

),(1),(

),(),(

),(max),(

),(min),(

,max

,min

m: mean for a cluster C: a cluster n: the number of objects in a cluster

5


Challenges

•  Hard to choose merge/split points – Never undo merging/splitting – Merging/splitting decisions are critical

•  High complexity O(n2) •  Integrating hierarchical clustering with other

techniques – BIRCH, CURE, CHAMELEON, ROCK


BIRCH

•  Balanced Iterative Reducing and Clustering using Hierarchies

•  CF (Clustering Feature) tree: a hierarchical data structure summarizing object info – Clustering objects à clustering leaf nodes of

the CF tree


Clustering Feature: CF = (N, LS, SS)

N: Number of data points

LS: ∑Ni=1=oi

SS: ∑Ni=1=oi

2

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

CF = (5, (16,30),(54,190))

(3,4)

(2,6)

(4,5)

(4,7)

(3,8)

Clustering Feature Vector


CF-tree in BIRCH

•  Clustering feature: –  Summarize the statistics for a cluster –  Many cluster quality measures (e.g., radium, distance)

can be derived –  Additivity: CF1+CF2=(N1+N2, L1+L2, SS1+SS2)

•  A CF tree: a height-balanced tree storing the clustering features for a hierarchical clustering –  A nonleaf node in a tree has descendants or “children” –  The nonleaf nodes store sums of the CFs of children


CF Tree CF1 child1

CF3 child3

CF2 child2

CF6 child6

CF1

child1

CF3

child3

CF2

child2

CF5

child5

CF1 CF2 CF6 prev next CF1 CF2 CF4 prev next

B = 7 L = 6

Root

Non-leaf node

Leaf node Leaf node


Parameters of a CF-tree

•  Branching factor: the maximum number of children

•  Threshold: max diameter of sub-clusters stored at the leaf nodes

6


BIRCH Clustering

•  Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data)

•  Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree


Pros & Cons of BIRCH

•  Linear scalability – Good clustering with a single scan – Quality can be further improved by a few

additional scans •  Can handle only numeric data •  Sensitive to the order of the data records


Drawbacks of Square Error Based Methods •  One representative per cluster

– Good only for convex shaped having similar size and density

•  K: the number of clusters parameter – Good only if k can be reasonably estimated


CURE: the Ideas

•  Each cluster has c representatives – Choose c well scattered points in the cluster – Shrink them towards the mean of the cluster by

a fraction of α – The representatives capture the physical shape

and geometry of the cluster •  Merge the closest two clusters

– Distance of two clusters: the distance between the two closest representatives


Cure: The Algorithm

•  Draw random sample S •  Partition sample to p partitions •  Partially cluster each partition •  Eliminate outliers

–  Random sampling + remove clusters growing too slowly

•  Cluster partial clusters until only k clusters left –  Shrink representatives of clusters towards the cluster

center


Data Partitioning and Clustering

x x

x

y

y y

y

x

y

x

7


Shrinking Representative Points

•  Shrink the multiple representative points towards the gravity center by a fraction of α

•  Representatives capture the shape

x

y

x

y

è


Clustering Categorical Data: ROCK •  Robust Clustering using links

– # of common neighbors between two points – Use links to measure similarity/proximity – Not distance based – 

•  Basic ideas: – Similarity function and neighbors:

•  Let T1 = {1,2,3}, T2={3,4,5}

O n nm m n nm a( log )2 2+ +

Sim T TT TT T

( , )1 21 2

1 2

=∩

∪

Sim T T( , ){ }

{ , , , , }.1 2

31 2 3 4 5

150 2= = =


Limitations

•  Merging decision based on static modeling – No special characteristics of clusters are

considered

C1 C2 C1’ C2’

CURE and BIRCH merge C1 and C2 C1’ and C2’ are more appropriate for merging


Chameleon

•  Hierarchical clustering using dynamic modeling •  Measures the similarity based on a dynamic model

–  Interconnectivity & closeness (proximity) between two clusters vs interconnectivity of the clusters and closeness of items within the clusters

•  A two-phase algorithm –  Use a graph partitioning algorithm: cluster objects into a

large number of relatively small sub-clusters –  Find the genuine clusters by repeatedly combining sub-

clusters


Overall Framework of CHAMELEON

Construct

Sparse Graph Partition the Graph

Merge Partition

Final Clusters

Data Set


Distance-based Methods: Drawbacks

•  Hard to find clusters with irregular shapes •  Hard to specify the number of clusters •  Heuristic: a cluster must be dense

8

How to Find Irregular Clusters?

•  Divide the whole space into many small areas – The density of an area can be estimated – Areas may or may not be exclusive – A dense area is likely in a cluster

•  Start from a dense area, traverse connected dense areas and discover clusters in irregular shape

Jian Pei: Big Data Analytics -- Clustering 43 Jian Pei: Big Data Analytics -- Clustering 44

Directly Density Reachable

•  Parameters – Eps: Maximum radius of the neighborhood – MinPts: Minimum number of points in an Eps-

neighborhood of that point – NEps(p): {q | dist(p,q) ≤Eps}

•  Core object p: |Neps(p)|≥MinPts – A core object is in a dense area

•  Point q directly density-reachable from p iff q ∈NEps(p) and p is a core object

p q

MinPts = 3 Eps = 1 cm


Density-Based Clustering

•  Density-reachable –  Directly density reachable p1àp2, p2àp3, …, pn-1à pn –  pn density-reachable from p1

•  Density-connected –  If points p, q are density-reachable from o then p and q

are density-connected

p q

o

p

q p1


DBSCAN

•  A cluster: a maximal set of density-connected points – Discover clusters of arbitrary shape in spatial

databases with noise

Core

Border

Outlier

Eps = 1cm

MinPts = 5


DBSCAN: the Algorithm

•  Arbitrary select a point p •  Retrieve all points density-reachable from p

wrt Eps and MinPts •  If p is a core point, a cluster is formed •  If p is a border point, no points are density-

reachable from p and DBSCAN visits the next point of the database

•  Continue the process until all of the points have been processed


Challenges for DBSCAN

•  Different clusters may have very different densities

•  Clusters may be in hierarchies

9


OPTICS: A Cluster-ordering Method

•  Idea: ordering points to identify the clustering structure

•  “Group” points by density connectivity – Hierarchies of clusters

•  Visualize clusters and the hierarchy


Ordering Points

•  Points strongly density-connected should be close to one another

•  Clusters density-connected should be close to one another and form a “cluster” of clusters


Reachability-distance

Cluster-order of the objects

ε

εundefined

ε‘

OPTICS: An Example


DENCLUE: Using Density Functions

•  DENsity-based CLUstEring •  Major features

– Solid mathematical foundation – Good for data sets with large amounts of noise – Allow a compact mathematical description of

arbitrarily shaped clusters in high-dimensional data sets

– Significantly faster than existing algorithms (faster than DBSCAN by a factor of up to 45)

– But need a large number of parameters


DENCLUE: Techniques

•  Use grid cells –  Only keep grid cells actually containing data points –  Manage cells in a tree-based access structure

•  Influence function: describe the impact of a data point on its neighborhood

•  Overall density of the data space is the sum of the influence function of all data points

•  Clustering by identifying density attractors –  Density attractor: local maximal of the overall density

function


Density Attractor

10


Center-defined and Arbitrary Clusters


A Shrinking-based Approach

•  Difficulties of Multi-dimensional Clustering – Noise (outliers) – Clusters of various densities – Not well-defined shapes

•  A novel preprocessing concept “Shrinking” •  A shrinking-based clustering approach


Intuition & Purpose

•  For data points in a data set, what if we could make them move towards the centroid of the natural subgroup they belong to?

•  Natural sparse subgroups become denser, thus easier to be detected – Noises are further isolated


Inspiration

•  Newton’s Universal Law of Gravitation –  Any two objects exert a gravitational force of attraction

on each other –  The direction of the force is along the line joining the

objects –  The magnitude of the force is directly proportional to

the product of the gravitational masses of the objects, and inversely proportional to the square of the distance between them

–  G: universal gravitational constant •  G = 6.67 x 10-11 N m2 /kg2 r

mmGFg 221=


The Concept of Shrinking

•  A data preprocessing technique – Aim to optimize the inner structure of real data

sets •  Each data point is “attracted” by other data

points and moves to the direction in which way the attraction is the strongest

•  Can be applied in different fields


Apply shrinking into clustering field •  Shrink the natural sparse clusters to make

them much denser to facilitate further cluster-detecting process.

Multi-attribute hyperspace

11


Data Shrinking

•  Each data point moves along the direction of the density gradient and the data set shrinks towards the inside of the clusters

•  Points are “attracted” by their neighbors and move to create denser clusters

•  It proceeds iteratively; repeated until the data are stabilized or the number of iterations exceeds a threshold


Approximation & Simplification

•  Problem: Computing mutual attraction of each data points pair is too time consuming O(n2) – Solution: No Newton's constant G, m1 and m2

are set to unit •  Only aggregate the gravitation surrounding

each data point •  Use grids to simplify the computation


Termination condition

•  Average movement of all points in the current iteration is less than a threshold

•  The number of iterations exceeds a threshold


Optics on Pendigits Data

Before data shrinking After data shrinking


Fuzzy Clustering

•  Each point xi takes a probability wij to belong to a cluster Cj

•  Requirements – For each point xi,

– For each cluster Cj

11

=∑=

k

jijw

mwm

iij <<∑

=1

0


Fuzzy C-Means (FCM)

Select an initial fuzzy pseudo-partition, i.e., assign values to all the wij

Repeat Compute the centroid of each cluster using the fuzzy

pseudo-partition Recompute the fuzzy pseudo-partition, i.e., the wij

Until the centroids do not change (or the change is below some threshold)

12


Critical Details

•  Optimization on sum of the squared error (SSE):

•  Computing centroids: •  Updating the fuzzy pseudo-partition

– When p=2

∑∑= =

=k

j

m

iji

pijk cxdistwCCSSE

1 1

21 ),(),,( …

∑∑==

=m

i

pij

m

ii

pijj wxwc

11

/

∑=

−−=k

q

pqi

pjiij cxdistcxdistw

1

11

211

2 )),(/1()),(/1(

∑=

=k

qqijiij cxdistcxdistw

1

22 ),(/1),(/1


Choice of P

•  When p à 1, FCM behaves like traditional k-means

•  When p is larger, the cluster centroids approach the global centroid of all data points

•  The partition becomes fuzzier as p increases


Effectiveness


Mixture Models

•  A cluster can be modeled as a probability distribution – Practically, assume a distribution can be

approximated well using multivariate normal distribution

•  Multiple clusters is a mixture of different probability distributions

•  A data set is a set of observations from a mixture of models


Object Probability

•  Suppose there are k clusters and a set X of m objects – Let the j-th cluster have parameter θj = (µj, σj) – The probability that a point is in the j-th cluster is

wj, w1 + …+ wk = 1 •  The probability of an object x is

∑=

=Θk

jjjj xpwxprob

1)|()|( θ

∏∑∏= ==

=Θ=Θm

i

k

jjijj

m

ii xpwxprobXprob

1 11

)|()|()|( θ


Example

2

2

2)(

21)|( σ

µ

σπ

−−

=Θx

i exprob

)2,4()2,4( 21 =−= θθ

8)4(

8)4( 22

221

221)|(

−−

+−

+=Θxx

eexprobππ

13


Maximal Likelihood Estimation

•  Maximum likelihood principle: if we know a set of objects are from one distribution, but do not know the parameter, we can choose the parameter maximizing the probability

•  Maximize

– Equivalently, maximize

∏=

−−

=Θm

j

x

i exprob1

2)(2

2

21)|( σ

µ

σπ

∑=

−−−

−=Θm

i

i mmxXprob1

2

2

log2log5.02

)()|(log σπσµ


EM Algorithm

•  Expectation Maximization algorithm Select an initial set of model parameters Repeat

Expectation Step: for each object, calculate the probability that it belongs to each distribution θi, i.e., prob(xi|θi)

Maximization Step: given the probabilities from the expectation step, find the new estimates of the parameters that maximize the expected likelihood

Until the parameters are stable


Advantages and Disadvantages

•  Mixture models are more general than k-means and fuzzy c-means

•  Clusters can be characterized by a small number of parameters

•  The results may satisfy the statistical assumptions of the generative models

•  Computationally expensive •  Need large data sets •  Hard to estimate the number of clusters


Grid-based Clustering Methods

•  Ideas – Using multi-resolution grid data structures – Using dense grid cells to form clusters

•  Several interesting methods – CLIQUE – STING – WaveCluster


CLIQUE

•  Clustering In QUEst •  Automatically identify subspaces of a high

dimensional data space •  Both density-based and grid-based


CLIQUE: the Ideas

•  Partition each dimension into the same number of equal length intervals – Partition an m-dimensional data space into non-

overlapping rectangular units •  A unit is dense if the number of data points

in the unit exceeds a threshold •  A cluster is a maximal set of connected

dense units within a subspace

14


CLIQUE: the Method

•  Partition the data space and find the number of points in each cell of the partition –  Apriori: a k-d cell cannot be dense if one of its (k-1)-d

projection is not dense •  Identify clusters:

–  Determine dense units in all subspaces of interests and connected dense units in all subspaces of interests

•  Generate minimal description for the clusters –  Determine the minimal cover for each cluster


Sala

ry

(10,

000)

age

Vac

atio

n

Salary

30 50

20 30 40 50 60 age

5 4

3 1

2 6

7 0

Vaca

tion

(wee

k)

20 30 40 50 60 age

5 4

3 1

2 6

7 0

CLIQUE: An Example


CLIQUE: Pros and Cons

•  Automatically find subspaces of the highest dimensionality with high density clusters

•  Insensitive to the order of input – Not presume any canonical data distribution

•  Scale linearly with the size of input •  Scale well with the number of dimensions •  The clustering result may be degraded at

the expense of simplicity of the method


Bad Cases for CLIQUE

Parts of a cluster may be missed

A cluster from CLIQUE may contain noise

Biclustering

•  Clustering both objects and attributes simultaneously

•  Four requirements – Only a small set of objects in a cluster (bicluster) – A bicluster only involves a small number of

attributes – An object may participate in multiple biclusters or

no biclusters – An attribute may be involved in multiple biclusters,

or no biclusters


Application Examples

•  Recommender systems – Objects: users – Attributes: items – Values: user ratings

•  Microarray data – Objects: genes – Attributes: samples – Values: expression levels


nmw

gene

sample/condition

w11w

21w31w

n1w

12w

32w22w

n2w

1mw

3mw2m

15

Biclusters with Constant Values


11.2. CLUSTERING HIGH-DIMENSIONAL DATA 535

· · · b6 · · · b12 · · · b36 · · · b99 · · ·a1 · · · 60 · · · 60 · · · 60 · · · 60 · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·a33 · · · 60 · · · 60 · · · 60 · · · 60 · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·a86 · · · 60 · · · 60 · · · 60 · · · 60 · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Figure 11.5: A gene-condition matrix, a submatrix, and a bi-cluster.

subset of products. For example, AllElectronics is highly interested in findinga group of customers who all like the same group of products. Such a clusteris a submatrix in the customer-product matrix, where all elements have a highvalue. Using such a cluster, AllElectronics can make recommendations in twodirections. First, the company can recommend products to new customerswho are similar to the customers in the cluster. Second, the company canrecommend to customers new products that are similar to those involved inthe cluster.

As with bi-clusters in a gene expression data matrix, the bi-clusters in acustomer-product matrix usually have the following characteristics:

• Only a small set of customers participate in a cluster;

• A cluster involves only a small subset of products;

• A customer can participate in multiple clusters, or may not participatein any cluster at all; and

• A product may be involved in multiple clusters, or may not be involvedin any cluster at all.

Bi-clustering can be applied to customer-product matrices to mine clusterssatisfying the above requirements.

Types of Bi-clusters

“How can we model bi-clusters and mine them?” Let’s start with some basicnotation. For the sake of simplicity, we’ll use “genes” and “conditions” torefer to the two dimensions in our discussion. Our discussion can easily beextended to other applications. For example, we can simply replace “genes” and“conditions” by “customers” and “products” to tackle the customer-product bi-clustering problem.

Let A = {a1, . . . , an} be a set of genes and B = {b1, . . . , bm} be a set ofconditions. Let E = [eij ] be a gene expression data matrix, that is, a gene-condition matrix, where 1 ≤ i ≤ n and 1 ≤ j ≤ m. A submatrix I × J is

536 CHAPTER 11. ADVANCED CLUSTER ANALYSIS

10 10 10 10 1020 20 20 20 2050 50 50 50 500 0 0 0 0

Figure 11.6: A bi-cluster with constant values on rows.

10 50 30 70 2020 60 40 80 3050 90 70 110 600 40 20 60 10

Figure 11.7: A bi-cluster with coherent values.

defined by a subset I ⊆ A of genes and a subset J ⊆ B of conditions. Forexample, in the matrix shown in Figure 11.5, {a1, a33, a86} × {b6, b12, b36, b99}is a submatrix.

A bi-cluster is a submatrix where genes and conditions follow consistentpatterns. We can define different types of bi-clusters based on such patterns:

• As the simplest case, a submatrix I × J (I ⊆ A, J ⊆ B) is a bi-clusterwith constant values if for any i ∈ I and j ∈ J , eij = c, where c is aconstant. For example, the submatrix {a1, a33, a86}× {b6, b12, b36, b99} inFigure 11.5 is a bi-cluster with constant values.

• A bi-cluster is interesting if each row has a constant value, though differ-ent rows may have different values. A bi-cluster with constant valueson rows is a submatrix I × J such that for any i ∈ I and j ∈ J , theneij = c+αi where αi is the adjustment for row i. For example, Figure 11.6shows a bi-cluster with constant values on rows.

Symmetrically, a bi-cluster with constant values on columns is asubmatrix I × J such that for any i ∈ I and j ∈ J , then eij = c + βj ,where βj is the adjustment for column j.

• More generally, a bi-cluster is interesting if the rows change in a syn-chronized way with respect to the columns and vice versa. Mathemat-ically, a bi-cluster with coherent values (also known as a pattern-based cluster) is a submatrix I × J such that for any i ∈ I and j ∈ J ,eij = c + αi + βj , where αi and βj are the adjustment for row i andcolumn j, respectively. For example, Figure 11.7 shows a bi-cluster withcoherent values.

It can be shown that I × J is a bi-cluster with coherent values if andonly if for any i1, i2 ∈ I and j1, j2 ∈ J , then ei1j1 − ei2j1 = ei1j2 − ei2j2 .Moreover, instead of using addition, we can define bi-cluster with coherent

On rows

Biclusters with Coherent Values

•  Also known as pattern-based clusters


Biclusters with Coherent Evolutions

•  Only up- or down-regulated changes over rows or columns


11.2. CLUSTERING HIGH-DIMENSIONAL DATA 537

10 50 30 70 2020 100 50 1000 3050 100 90 120 800 80 20 100 10

Figure 11.8: A bi-cluster with coherent evolutions on rows.

values using multiplication, that is eij = c · αi · βj . Clearly, bi-clusterswith constant values on rows or columns are special cases of bi-clusterswith coherent values.

• In some applications, we may only be interested in the up- or down-regulated changes across genes or conditions without constraining theexact values. A bi-cluster with coherent evolutions on rows is asubmatrix I × J such that for any i1, i2 ∈ I and j1, j2 ∈ J , (ei1j1 −ei1j2)(ei2j1 − ei2j2) ≥ 0. For example, Figure 11.8 shows a bi-cluster withcoherent evolutions on rows. Symmetrically, we can define bi-clusterswith coherent evolutions on columns.

Next, we study how to mine bi-clusters.

Bi-clustering Methods

The above specification of the types of bi-clusters only considers ideal cases. Inreal data sets, such perfect bi-clusters rarely exist. When they do exist, theyare usually very small. Instead, random noise can affect the readings of eij andthus prevent a bi-cluster in nature from appearing in a perfect shape.

There are two major types of methods for discovering bi-clusters in datathat may come with noise. Optimization-based methods conduct an it-erative search. At each iteration, the submatrix with the highest significancescore is identified as a bi-cluster. The process terminates when a user-specifiedcondition is met. Due to cost concerns in computation, greedy search is oftenemployed to find local optimal bi-clusters. Enumeration methods use a tol-erance threshold to specify the degree of noise allowed in the bi-clusters to bemined, and then tries to enumerate all submatrices of bi-clusters that satisfythe requirements. We use the δ-Cluster and MaPle algorithms as examples toillustrate these ideas.

Optimization Using the δ-Cluster Algorithm

For a submatrix, I × J , the mean of the i-th row is

eiJ =1

|J |!

j∈J

eij . (11.16)

Coherent evolutions on rows


Differences from Subspace Clustering

•  Subspace clustering uses global distance/similarity measure

•  Pattern-based clustering looks at patterns •  A subspace cluster according to a globally

defined similarity measure may not follow the same pattern


Objects Follow the Same Pattern?

pScore

D1 D2

Objectblue

Obejctgreen

The less the pScore, the more consistent the objects Jian Pei: Big Data Analytics -- Clustering 90

Pattern-based Clusters

•  pScore: the similarity between two objects rx, ry on two attributes au, av

•  δ-pCluster (R, D): for any objects rx, ry∈R and any attributes au, av∈D,

)..()..(....

vyvxuyuxvyuy

vxux arararararararar

pScore −−−=⎟⎟⎠

⎞⎜⎜⎝

⎛⎥⎦

⎤⎢⎣

⎡

)0(....

≥≤⎟⎟⎠

⎞⎜⎜⎝

⎛⎥⎦

⎤⎢⎣

⎡δδ

vyuy

vxux

arararar

pScore

16


Maximal pCluster

•  If (R, D) is a δ-pCluster , then every sub-cluster (R’, D’) is a δ-pCluster, where R’⊆R and D’⊆D – An anti-monotonic property – A large pCluster is accompanied with many

small pClusters! Inefficacious •  Idea: mining only the maximal pClusters!

– A δ-pCluster is maximal if there exists no proper super cluster as a δ-pCluster


Mining Maximal pClusters

•  Given – A cluster threshold δ – An attribute threshold mina – An object threshold mino

•  Task: mine the complete set of significant maximal δ-pClusters – A significant δ-pCluster has at least mino objects

on at least mina attributes


pCluters and Frequent Itemsets

•  A transaction database can be modeled as a binary matrix

•  Frequent itemset: a sub-matrix of all 1’s – 0-pCluster on binary data – Mino: support threshold – Mina: no less than mina attributes – Maximal pClusters – closed itemsets

•  Frequent itemset mining algorithms cannot be extended straightforwardly for mining pClusters on numeric data


Where Should We Start from?

•  How about the pClusters having only 2 objects or 2 attributes? – MDS (maximal dimension set) – A pCluster must have at least 2 objects and 2

attributes •  Finding MDSs

Attribute Objects

a b c d e f g h

x 13 11 9 7 9 13 2 15

y 7 4 10 1 12 3 4 7

x - y 6 7 -1 6 -3 10 -2 8


How to Assemble Larger pClusters? •  Systematically enumerate

every combination of attributes D –  For each attribute subset,

find the maximal subsets of objects R s.t. (R, D) is a pCluster

–  Check whether (R, D) is maximal

•  Prune search branches as early as possible

•  Why attribute-first-object-later? –  # of objects >> # attributes

•  Algorithm MaPle (Pei et al, 2003)


More Pruning Techniques

•  Only possible attributes should be considered to get larger pClusters

•  Pruning local maximal pClusters having insufficient possible attributes

•  Extracting common attributes from possible attribute set directly

•  Prune non-maximal pClusters

17


Gene-Sample-Time Series Data

gene1

gene2

sample1 time1 time2 sample2

Time Sample

Gene

Gene-Time Matrix Gene-Sample Matrix

Sample-Time Matrix

expression level of gene i on sample j at time k


Mining GST Microarray Data

•  Reduce the gene-sample-time series data to gene-sample data – Use the Pearson's correlation coeffcient as the

coherence measure


Basic Approaches

•  Sample-gene search – Enumerate the subsets of samples

systematically – For each subset of samples, find the genes that

are coherent on the samples •  Gene-sample search

– Enumerate the subsets of genes systematically – For each subset of genes, find the samples on

which the genes are coherent


Basic Tools

•  Set enumeration tree •  Sample-gene search and gene-sample

search are not symmetric! – Many genes, but a few samples – No requirement on samples coherent on genes


Phenotypes and Informative Genes

Informative Genes

Non- informative

Genes

gene1

gene6

gene7

gene2

gene4

gene5

gene3

1 2 3 4 5 6 7 samples


The Phenotype Mining Problem

•  Input: a microarray matrix and k •  Output: phenotypes and informative genes

– Partitioning the samples into k exclusive subsets – phenotypes

–  Informative genes discriminating the phenotypes

•  Machine learning methods – Heuristic search – Mutual reinforcing adjustment

18


Requirements

•  The expression levels of each informative gene should be similar over the samples within each phenotype

•  The expression levels of each informative gene should display a clear dissimilarity between each pair of phenotypes


Intra-phenotype Consistency

•  In a subset of genes (candidate informative genes), does every gene have good consistency on a set of samples?

•  Average of variance of the subset of genes – the smaller the intra-phenotype consistency, the better

( )∑∑∈ ∈

−−⋅

=' '

2',, )(

1''1)','(

Gg SsSiji

i j

wwSG

SGCon! !


Inter-phenotype Divergence

•  How a subset of genes (candidate informative genes) can discriminate two phenotypes of samples?

•  Sum of the average difference between the phenotypes – the larger the inter-phenotype divergence, the better

')),,'( '

,,

21

21

G

wwSSGDiv Gg

SiSii

∑∈

−

=!


Quality of Phenotypes and Informative Genes

•  The higher the value, the better the quality

∑ ≠≤≤

+=Ω

);,1(, , ),'(),'(),'(

1

jiKjiSS ji

ji

ji SSGDivSGConSGCon


Heuristic Search

•  Start from a random subset of genes and an arbitrary partition of the samples

•  Iteratively adjust the partition and the gene set toward a better solution –  For each possible adjustment, compute ΔΩ

•  For each gene, try possible insert/remove •  For each sample, try the best movement

–  ΔΩ > 0 à conduct the adjustment –  ΔΩ < 0 à conduct the adjustment with probability

•  T(i) is a decreasing simulated annealing function and i is the iteration number. T(i)=1/(i+1) in our implementation

)(iTe ⋅ΩΔΩ


Possible Adjustments

Insert a gene Remove a gene Move a sample

19


Disadvantages of Heuristic Search

•  Samples and genes are examined and adjusted with equal chances – # samples << # genes – Samples should play more important roles

•  Outliers in the samples should be handled specifically – Outliers highly interfere the quality and the

adjustment decisions


Mutual Reinforcing Adjustment

•  A two-phase approach –  Iteration phase – Refinement phase

•  Mutual reinforcement – Use gene partition to improve the sample

partition – Use the sample partition to improve the gene

partition


Dimensionality Reduction

•  Clustering a high dimensional data set is challenging –  Distance between two points could be dominated by

noise •  Dimensionality reduction: choosing the informative

dimensions for clustering analysis –  Feature selection: choosing a subset of existing

dimensions –  Feature construction: construct a new (small) set of

informative attributes


Variance and Covariance

•  Given a set of 1-d points, how different are those points? – Standard deviation: – Variance:

•  Given a set of 2-d points, are the two dimensions correlated? – Covariance:

1

)(1

2

−

−=∑=

n

XXs

n

ii

1

)(1

2

2

−

−=∑=

n

XXs

n

ii

1

))((),cov( 1

−

−−=∑=

n

YYXXYX

n

iii


Principal Components

Art work and example from http://csnet.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf


Step 1: Mean Subtraction

•  Subtract the mean from each dimension for each data point

•  Intuition: centralizing the data set

20


Step 2: Covariance Matrix

⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜

⎝

⎛

=

),cov(),cov(),cov(

),cov(),cov(),cov(),cov(),cov(),cov(

21

22212

12111

nnnn

n

n

DDDDDD

DDDDDDDDDDDD

C

!"#""

!!


Step 3: Eigenvectors and Eigenvalues

•  Compute the eigenvectors and the eigenvalues of the covariance matrix –  Intuition: find those direction invariant vectors as

candidates of new attributes – Eigenvalues indicate how much the direction

invariant vectors are scaled – the larger the better for manifest the data variance


Step 4: Forming New Features

•  Choose the principal components and forme new features –  Typically, choose the top-k components


New Features

NewData = RowFeatureVector x RowDataAdjust

The first principal component is used

Clustering in Derived Space


Y

XO

- 0.707x + 0.707y

Spectral Clustering


cluster the original data

ij[ ]

Data Affinity matrixk eigenvectors of A

A = f(W)Av = \lamda v

Clustering in thenew space

Computing the leading Projecting back to

W

21

Affinity Matrix

•  Using a distance measure where σ is a scaling parameter controling how fast the affinity Wij decreases as the distance increases

•  In the Ng-Jordan-Weiss algorithm, Wii is set to 0


Wij = e�dist(o

i

,o

j

)

�

w

Clustering

•  In the Ng-Jordan-Weiss algorithm, we define a diagonal matrix such that

•  Then, •  Use the k leading eigenvectors to form a

new space •  Map the original data to the new space and

conduct clustering Jian Pei: Big Data Analytics -- Clustering 122

Dii =nX

j=1

Wij

A = D� 12WD� 1

2

Is a Clustering Good?

•  Feasibility – Applying any clustering methods on a uniformly

distributed data set is meaningless •  Quality

– Are the clustering results meeting users’ interest? – Clustering patients into clusters corresponding

various disease or sub-phenotypes is meaningful – Clustering patients into clusters corresponding to

male or female is not meaningful


Major Tasks

•  Assessing clustering tendency – Are there non-random structures in the data?

•  Determining the number of clusters or other critical parameters

•  Measuring clustering quality


Uniformly Distributed Data

•  Clustering uniformly distributed data is meaningless

•  A uniformly distributed data set is generated by a uniform data distribution


504CHAPTER 10. CLUSTER ANALYSIS: BASIC CONCEPTS AND METHODS

Figure 10.21: A data set that is uniformly distributed in the data space.

• Measuring clustering quality. After applying a clustering method on adata set, we want to assess how good the resulting clusters are. A numberof measures can be used. Some methods measure how well the clustersfit the data set, while others measure how well the clusters match theground truth, if such truth is available. There are also measures thatscore clusterings and thus can compare two sets of clustering results onthe same data set.

In the rest of this section, we discuss each of the above three topics.

10.6.1 Assessing Clustering Tendency

Clustering tendency assessment determines whether a given data set has a non-random structure, which may lead to meaningful clusters. Consider a dataset that does not have any non-random structure, such as a set of uniformlydistributed points in a data space. Even though a clustering algorithm mayreturn clusters for the data, those clusters are random and are not meaningful.

Example 10.9 Clustering requires non-uniform distribution of data. Figure 10.21shows a data set that is uniformly distributed in 2-dimensional data space.Although a clustering algorithm may still artificially partition the points intogroups, the groups will unlikely mean anything significant to the applicationdue to the uniform distribution of the data.

“How can we assess the clustering tendency of a data set?” Intuitively, wecan try to measure the probability that the data set is generated by a uniformdata distribution. This can be achieved using statistical tests for spatial ran-domness. To illustrate this idea, let’s look at a simple yet effective statisticcalled Hopkins Statistic.

The Hopkins Statistic is a spatial statistic that tests the spatial random-ness of a variable as distributed in a space. Given a data set, D, which isregarded as a sample of a random variable, o, we want to determine how faraway o is from being uniformly distributed in the data space. We calculate theHopkins Statistic as follows:

Hopkins Statistic

•  Hypothesis: the data is generated by a uniform distribution in a space

•  Sample n points, p1, …, pn, uniformly from the space of D

•  For each point pi, find the nearest neighbor of pi in D, let xi be the distance between pi and its nearest neighbor in D


xi = minv2D

{dist(pi, v)}

22

Hopkins Statistic

•  Sample n points, q1, …, qn, uniformly from D •  For each qi, find the nearest neighbor of qi

in D – {qi}, let yi be the distance between qi and its nearest neighbor in D – {qi}

•  Calculate the Hopkins Statistic H


yi = minv2D,v 6=qi

{dist(qi, v)}

H =

nPi=1

yi

nPi=1

xi +nP

i=1yi

Explanation

•  If D is uniformly distributed, then and would be close to each other, and thus H would be round 0.5

•  If D is skewed, then would be substantially smaller, and thus H would be close to 0

•  If H > 0.5, then it is unlikely that D has statistically significant clusters


nX

i=1

yi

nX

i=1

xi

nX

i=1

yi

Finding the Number of Clusters

•  Depending on many factors – The shape and scale of the distribution in the

data set – The clustering resolution required by the user

•  Many methods exist – Set , each cluster has points on

average – Plot the sum of within-cluster variances with

respect to k, find the first (or the most significant turning point)


k =

rn

2

p2n

A Cross-Validation Method •  Divide the data set D into m parts •  Use m – 1 parts to find a clustering •  Use the remaining part as the test set to test

the quality of the clustering – For each point in the test set, find the closest

centroid or cluster center – Use the squared distances between all points in the

test set and the corresponding centroids to measure how well the clustering model fits the test set

•  Repeat m times for each value of k, use the average as the quality measure


Measuring Clustering Quality

•  Ground truth: the ideal clustering determined by human experts

•  Two situations – There is a known ground truth – the extrinsic

(supervised) methods, comparing the clustering against the ground truth

– The ground truth is unavailable – the intrinsic (unsupervised) methods, measuring how well the clusters are separated


Quality in Extrinsic Methods •  Cluster homogeneity: the more pure the

clusters in a clustering are, the better the clustering

•  Cluster completeness: objects in the same cluster in the ground truth should be clustered together

•  Rag bag: putting a heterogeneous object into a pure cluster is worse than putting it into a rag bag

•  Small cluster preservation: splitting a small cluster in the ground truth into pieces is worse than splitting a bigger one


23

Bcubed Precision and Recall

•  D = {o1, …, on} – L(oi) is the cluster of oi given by the ground truth

•  C is a clustering on D – C(oi) is the cluster-id of oi in C

•  For two objects oi and oj, the correctness is 1 if L(oi) = L(oj) ßà C(oi) = C(oj), 0 otherwise


Bcubed Precision and Recall

•  Precision

•  Recall


508CHAPTER 10. CLUSTER ANALYSIS: BASIC CONCEPTS AND METHODS

one, denoted by o, belong to the same category according to ground truth.Consider a clustering C2 identical to C1 except that o is assigned to acluster C′ = C in C2 such that C′ contains objects from various categoriesaccording to ground truth, and thus is noisy. In other words, C′ in C2 isa rag bag. Then, a clustering quality measure Q respecting the rag bagcriterion should give a higher score to C2, that is, Q(C2, Cg) > Q(C1, Cg).

• Small cluster preservation. If a small category is split into small piecesin a clustering, those small pieces may likely become noise and thus thesmall category cannot be discovered from the clustering. The small clus-ter preservation criterion states that splitting a small category into piecesis more harmful than splitting a large category into pieces. Consider anextreme case. Let D be a data set of n + 2 objects such that, accord-ing to the ground truth, n objects, denoted by o1, . . . , on, belong toone category and the other 2 objects, denoted by on+1, on+2, belong toanother category. Suppose clustering C1 has three clusters, C1={o1, . . . ,on}, C2={on+1}, and C3={on+2}. Let clustering C2 have three clusters,too, namely C1={o1, . . . , on−1}, C2={on}, and C3={on+1, on+2}. Inother words, C1 splits the small category and C2 splits the big category.A clustering quality measure Q preserving small clusters should give ahigher score to C2, that is, Q(C2, Cg) > Q(C1, Cg).

Many clustering quality measures satisfy some of the above four criteria.Here, we introduce the BCubed precision and recall metrics, which satisfy allof the above criteria.

BCubed evaluates the precision and recall for every object in a clusteringon a given data set according to the ground truth. The precision of an objectindicates how many other objects in the same cluster belong to the same cat-egory as the object. The recall of an object reflects how many objects of thesame category are assigned to the same cluster.

Formally, let D ={o1, . . . , on} be a set of objects, and C be a clusteringon D. Let L(oi) (1 ≤ i ≤ n) be the category of oi given by ground truth,and C(oi) be the cluster ID of oi in C. Then, for two objects, oi and oj ,(1 ≤ i, j,≤ n, i = j), the correctness of the relation between oi and oj inclustering C is given by

Correctness(oi, oj) =! 1 if L(oi) = L(oj)⇔ C(oi) = C(oj)

0 otherwise.(10.28)

BCubed precision is defined as

Precision BCubed =

n"

i=1

"

oj :i=j,C(oi)=C(oj)

Correctness(oi, oj)

∥{oj|i = j, C(oi) = C(oj)}∥n

. (10.29)

10.6. EVALUATION OF CLUSTERING 509

BCubed recall is defined as

Recall BCubed =

n!

i=1

!

oj :i=j,L(oi)=L(oj)

Correctness(oi, oj)

∥{oj|i = j, L(oi) = L(oj)}∥n

. (10.30)

Intrinsic Methods

When the ground truth of a data set is not available, we have to use an intrinsicmethod to assess the clustering quality. In general, intrinsic methods evaluatea clustering by examining how well the clusters are separated and how compactthe clusters are. Many intrinsic methods take the advantage of a similaritymetric between objects in the data set.

The silhouette coefficient is such a measure. For a data set D of nobjects, suppose D is partitioned into k clusters, C1, . . . , Ck. For each object o∈ D, we calculate a(o) as the average distance between o and all other objectsin the cluster to which o belongs. Similarly, b(o) is the minimum averagedistance from o to all clusters to which o does not belong. Formally, supposeo ∈ Ci (1 ≤ i ≤ k), then

a(o) =

"

o′∈Ci,o=o′ dist(o, o′)

|Ci|− 1(10.31)

and

b(o) = minCj :1≤j≤k,j =i

{

"

o′∈Cjdist(o, o′)

|Cj |}. (10.32)

The silhouette coefficient of o is then defined as

s(o) =b(o)− a(o)

max{a(o), b(o)} . (10.33)

The value of the silhouette coefficient is between −1 and 1. The value ofa(o) reflects the compactness of the cluster to which o belongs. The smallerthe value is, the more compact the cluster is. The value of b(o) capturesthe degree to which o is separated from other clusters. The larger b(o) is,the more separated o is from other clusters. Therefore, when the silhouettecoefficient value of o approaches 1, the cluster containing o is compact and ois far away from other clusters, which is the preferable case. However, whenthe silhouette coefficient value is negative (that is, b(o) < a(o)), this meansthat, in expectation, o is closer to the objects in another cluster than to theobjects in the same cluster as o. In many cases, this is a bad case, and shouldbe avoided.

To measure the fitness of a cluster within a clustering, we can compute theaverage silhouette coefficient value of all objects in the cluster. To measure thequality of a clustering, we can use the average silhouette coefficient value of allobjects in the data set. The silhouette coefficient and other intrinsic measures

Silhouette Coefficient

•  No ground truth is assumed •  Suppose a data set D of n objects is partitioned

into k clusters, C1, …, Ck •  For each object o,

– Calculate a(o), the average distance between o and every other object in the same cluster – compactness of a cluster, the smaller, the better

– Calculate b(o), the minimum average distance from o to every objects in a cluster that o does not belong to – degree of separation from other clusters, the larger, the better


Silhouette Coefficient

•  Then

•  Use the average silhouette coefficient of all objects as the overall measure


b(o) = minCj :o 62Cj

{

Po

02Cj

dist(o, o0)

|Cj

| }

a(o) =

Po,o

02Ci,o0 6=o

dist(o, o0)

|Ci

|� 1

s(o) =

b(o)� a(o)

max{a(o), b(o)}

Multi-Clustering

•  A data set may be clustered in different ways –  In different subspaces, that is, using different

attributes – Using different similarity measures – Using different clustering methods

•  Some different clusterings may capture different meanings of categorization – Orthogonal clusterings

•  Putting users in the loop Jian Pei: Big Data Analytics -- Clustering 137

What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3...

Documents

Transcript of What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3...