What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3...

23
1 Clustering Jian Pei: Big Data Analytics -- Clustering 2 What Is Clustering? Group data into clusters – Similar to one another within the same cluster – Dissimilar to the objects in other clusters – Unsupervised learning: no predefined classes Cluster 1 Cluster 2 Outliers Jian Pei: Big Data Analytics -- Clustering 3 Similarity and Dissimilarity Distances are normally used measures Minkowski distance: a generalization If q = 2, d is Euclidean distance If q = 1, d is Manhattan distance If q = , d is Chebyshev distance Weighed distance ) 0 ( | | ... | | | | ) , ( 2 2 1 1 > + + + = q j x i x j x i x j x i x j i d q q p p q q ) 0 ( ) | | ... | | 2 | | 1 ) , ( 2 2 1 1 > + + + = q j x i x p w j x i x w j x i x w j i d q q p p q q Jian Pei: Big Data Analytics -- Clustering 4 Manhattan and Chebyshev Distance Picture from Wekipedia Manhattan Distance http://brainking.com/images/rules/chess/02.gif Chebyshev Distance When n = 2, chess-distance Jian Pei: Big Data Analytics -- Clustering 5 Properties of Minkowski Distance Nonnegative: d(i,j) 0 The distance of an object to itself is 0 – d(i,i) = 0 Symmetric: d(i,j) = d(j,i) Triangular inequality – d(i,j) d(i,k) + d(k,j) i j k Clustering Methods K-means and partitioning methods Hierarchical clustering Density-based clustering Grid-based clustering Pattern-based clustering Other clustering methods Jian Pei: Big Data Analytics -- Clustering 6

Transcript of What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3...

Page 1: What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive Jian Pei: Big Data Analytics

1

Clustering

Jian Pei: Big Data Analytics -- Clustering 2

What Is Clustering?

•  Group data into clusters – Similar to one another within the same cluster – Dissimilar to the objects in other clusters – Unsupervised learning: no predefined classes

Cluster 1 Cluster 2

Outliers

Jian Pei: Big Data Analytics -- Clustering 3

Similarity and Dissimilarity

•  Distances are normally used measures •  Minkowski distance: a generalization

•  If q = 2, d is Euclidean distance •  If q = 1, d is Manhattan distance •  If q = ∞, d is Chebyshev distance •  Weighed distance

)0(||...||||),(2211

>−++−+−= qjxixjxixjxixjid qq

pp

qq

)0()||...||2

||1

),(2211

>−++−+−= qjxixpwjxixwjxixwjid qq

pp

qq

Jian Pei: Big Data Analytics -- Clustering 4

Manhattan and Chebyshev Distance

Picture from Wekipedia

Manhattan Distance

http://brainking.com/images/rules/chess/02.gif

Chebyshev Distance

When n = 2, chess-distance

Jian Pei: Big Data Analytics -- Clustering 5

Properties of Minkowski Distance

•  Nonnegative: d(i,j) ≥ 0 •  The distance of an object to itself is 0

– d(i,i) = 0 •  Symmetric: d(i,j) = d(j,i) •  Triangular inequality

– d(i,j) ≤ d(i,k) + d(k,j) i j

k

Clustering Methods

•  K-means and partitioning methods •  Hierarchical clustering •  Density-based clustering •  Grid-based clustering •  Pattern-based clustering •  Other clustering methods

Jian Pei: Big Data Analytics -- Clustering 6

Page 2: What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive Jian Pei: Big Data Analytics

2

Jian Pei: Big Data Analytics -- Clustering 7

Partitioning Algorithms: Ideas

•  Partition n objects into k clusters –  Optimize the chosen partitioning criterion

•  Global optimal: examine all possible partitions –  (kn-(k-1)n-…-1) possible partitions, too expensive!

•  Heuristic methods: k-means and k-medoids –  K-means: a cluster is represented by the center –  K-medoids or PAM (partition around medoids): each

cluster is represented by one of the objects in the cluster

Jian Pei: Big Data Analytics -- Clustering 8

K-means

•  Arbitrarily choose k objects as the initial cluster centers

•  Until no change, do –  (Re)assign each object to the cluster to which

the object is the most similar, based on the mean value of the objects in the cluster

– Update the cluster means, i.e., calculate the mean value of the objects for each cluster

Jian Pei: Big Data Analytics -- Clustering 9

K-Means: Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrarily choose K object as initial cluster center

Assign each objects to most similar center

Update the cluster means

Update the cluster means

reassign reassign

Jian Pei: Big Data Analytics -- Clustering 10

Pros and Cons of K-means

•  Relatively efficient: O(tkn) – n: # objects, k: # clusters, t: # iterations; k, t <<

n. •  Often terminate at a local optimum •  Applicable only when mean is defined

– What about categorical data? •  Need to specify the number of clusters •  Unable to handle noisy data and outliers •  Unsuitable to discover non-convex clusters

Jian Pei: Big Data Analytics -- Clustering 11

Variations of the K-means •  Aspects of variations

–  Selection of the initial k means –  Dissimilarity calculations –  Strategies to calculate cluster means

•  Handling categorical data: k-modes –  Use mode instead of mean

•  Mode: the most frequent item(s) –  A mixture of categorical and numerical data: k-prototype

method •  EM (expectation maximization): assign a

probability of an object to a cluster

Jian Pei: Big Data Analytics -- Clustering 12

A Problem of K-means

•  Sensitive to outliers – Outlier: objects with extremely large values

•  May substantially distort the distribution of the data

•  K-medoids: the most centrally located object in a cluster

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

+ +

Page 3: What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive Jian Pei: Big Data Analytics

3

Jian Pei: Big Data Analytics -- Clustering 13

PAM: A K-medoids Method

•  PAM: partitioning around Medoids •  Arbitrarily choose k objects as the initial medoids •  Until no change, do

–  (Re)assign each object to the cluster to which the nearest medoid

–  Randomly select a non-medoid object o’, compute the total cost, S, of swapping medoid o with o’

–  If S < 0 then swap o with o’ to form the new set of k medoids

Jian Pei: Big Data Analytics -- Clustering 14

Swapping Cost

•  Measure whether o’ is better than o as a medoid

•  Use the squared-error criterion

– Compute Eo’-Eo

– Negative: swapping brings benefit

∑∑= ∈

=k

i Cpi

i

opdE1

2),(

Jian Pei: Big Data Analytics -- Clustering 15

PAM: Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Total Cost = 20

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrary choose k object as initial medoids

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Assign each remaining object to nearest medoids

Randomly select a nonmedoid object,Oramdom

Compute total cost of swapping

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10

Total Cost = 26

Swapping O and Oramdom

If quality is improved.

Do loop

Until no change

0 1 2 3 4 5 6 7 8 9 10

0 1 2 3 4 5 6 7 8 9 10 Jian Pei: Big Data Analytics -- Clustering 16

Pros and Cons of PAM

•  PAM is more robust than k-means in the presence of noise and outliers – Medoids are less influenced by outliers

•  PAM is efficiently for small data sets but does not scale well for large data sets – O(k(n-k)2 ) for each iteration

•  Sampling based method: CLARA

Jian Pei: Big Data Analytics -- Clustering 17

CLARA

•  CLARA: Clustering LARge Applications (Kaufmann and Rousseeuw in 1990) – Built in statistical analysis packages, such as S+

•  Draw multiple samples of the data set, apply PAM on each sample, give the best clustering

•  Perform better than PAM in larger data sets •  Efficiency depends on the sample size

– A good clustering on samples may not be a good clustering of the whole data set

Jian Pei: Big Data Analytics -- Clustering 18

CLARANS •  Clustering large applications based upon

randomized search •  The problem space graph of clustering

–  A vertex is k from n numbers, vertices in total –  PAM searches the whole graph –  CLARA searches some random sub-graphs

•  CLARANS climbs hills –  Randomly sample a set and select k medoids –  Consider neighbors of medoids as candidate for new

medoids –  Use the sample set to verify –  Repeat multiple times to avoid bad samples

⎟⎟⎠

⎞⎜⎜⎝

kn

Page 4: What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive Jian Pei: Big Data Analytics

4

Hierarchy

•  An arrangement or classification of things according to inclusiveness

•  A natural way of abstraction, summarization, compression, and simplification for understanding

•  Typical setting: organize a given set of objects to a hierarchy – No or very little supervision – Some heuristic quality guidances on the quality

of the hierarchy Jian Pei: Big Data Analytics -- Clustering 19

•  Group data objects into a tree of clusters •  Top-down versus bottom-up

Jian Pei: Big Data Analytics -- Clustering 20

Hierarchical Clustering

Step 0 Step 1 Step 2 Step 3 Step 4

b

d c

e

a a b

d e c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative (AGNES)

divisive (DIANA)

Jian Pei: Big Data Analytics -- Clustering 21

AGNES (Agglomerative Nesting)

•  Initially, each object is a cluster •  Step-by-step cluster merging, until all

objects form a cluster – Single-link approach – Each cluster is represented by all of the objects

in the cluster – The similarity between two clusters is measured

by the similarity of the closest pair of data points belonging to different clusters

Jian Pei: Big Data Analytics -- Clustering 22

Dendrogram

•  Show how to merge clusters hierarchically

•  Decompose data objects into a multi-level nested partitioning (a tree of clusters)

•  A clustering of the data objects: cutting the dendrogram at the desired level – Each connected component forms a cluster

Jian Pei: Big Data Analytics -- Clustering 23

DIANA (Divisive ANAlysis)

•  Initially, all objects are in one cluster •  Step-by-step splitting clusters until each

cluster contains only one object

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Jian Pei: Big Data Analytics -- Clustering 24

Distance Measures

•  Minimum distance •  Maximum distance •  Mean distance •  Average distance ∑∑

∈ ∈

∈∈

∈∈

=

=

=

=

i j

ji

ji

Cp Cqjijiavg

jijimean

CqCpji

CqCpji

qpdnn

CCd

mmdCCd

qpdCCd

qpdCCd

),(1),(

),(),(

),(max),(

),(min),(

,max

,min

m: mean for a cluster C: a cluster n: the number of objects in a cluster

Page 5: What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive Jian Pei: Big Data Analytics

5

Jian Pei: Big Data Analytics -- Clustering 25

Challenges

•  Hard to choose merge/split points – Never undo merging/splitting – Merging/splitting decisions are critical

•  High complexity O(n2) •  Integrating hierarchical clustering with other

techniques – BIRCH, CURE, CHAMELEON, ROCK

Jian Pei: Big Data Analytics -- Clustering 26

BIRCH

•  Balanced Iterative Reducing and Clustering using Hierarchies

•  CF (Clustering Feature) tree: a hierarchical data structure summarizing object info – Clustering objects à clustering leaf nodes of

the CF tree

Jian Pei: Big Data Analytics -- Clustering 27

Clustering Feature: CF = (N, LS, SS)

N: Number of data points

LS: ∑Ni=1=oi

SS: ∑Ni=1=oi

2

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

CF = (5, (16,30),(54,190))

(3,4)

(2,6)

(4,5)

(4,7)

(3,8)

Clustering Feature Vector

Jian Pei: Big Data Analytics -- Clustering 28

CF-tree in BIRCH

•  Clustering feature: –  Summarize the statistics for a cluster –  Many cluster quality measures (e.g., radium, distance)

can be derived –  Additivity: CF1+CF2=(N1+N2, L1+L2, SS1+SS2)

•  A CF tree: a height-balanced tree storing the clustering features for a hierarchical clustering –  A nonleaf node in a tree has descendants or “children” –  The nonleaf nodes store sums of the CFs of children

Jian Pei: Big Data Analytics -- Clustering 29

CF Tree CF1 child1

CF3 child3

CF2 child2

CF6 child6

CF1

child1

CF3

child3

CF2

child2

CF5

child5

CF1 CF2 CF6 prev next CF1 CF2 CF4 prev next

B = 7 L = 6

Root

Non-leaf node

Leaf node Leaf node

Jian Pei: Big Data Analytics -- Clustering 30

Parameters of a CF-tree

•  Branching factor: the maximum number of children

•  Threshold: max diameter of sub-clusters stored at the leaf nodes

Page 6: What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive Jian Pei: Big Data Analytics

6

Jian Pei: Big Data Analytics -- Clustering 31

BIRCH Clustering

•  Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data)

•  Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree

Jian Pei: Big Data Analytics -- Clustering 32

Pros & Cons of BIRCH

•  Linear scalability – Good clustering with a single scan – Quality can be further improved by a few

additional scans •  Can handle only numeric data •  Sensitive to the order of the data records

Jian Pei: Big Data Analytics -- Clustering 33

Drawbacks of Square Error Based Methods •  One representative per cluster

– Good only for convex shaped having similar size and density

•  K: the number of clusters parameter – Good only if k can be reasonably estimated

Jian Pei: Big Data Analytics -- Clustering 34

CURE: the Ideas

•  Each cluster has c representatives – Choose c well scattered points in the cluster – Shrink them towards the mean of the cluster by

a fraction of α – The representatives capture the physical shape

and geometry of the cluster •  Merge the closest two clusters

– Distance of two clusters: the distance between the two closest representatives

Jian Pei: Big Data Analytics -- Clustering 35

Cure: The Algorithm

•  Draw random sample S •  Partition sample to p partitions •  Partially cluster each partition •  Eliminate outliers

–  Random sampling + remove clusters growing too slowly

•  Cluster partial clusters until only k clusters left –  Shrink representatives of clusters towards the cluster

center

Jian Pei: Big Data Analytics -- Clustering 36

Data Partitioning and Clustering

x x

x

y

y y

y

x

y

x

Page 7: What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive Jian Pei: Big Data Analytics

7

Jian Pei: Big Data Analytics -- Clustering 37

Shrinking Representative Points

•  Shrink the multiple representative points towards the gravity center by a fraction of α

•  Representatives capture the shape

x

y

x

y

è

Jian Pei: Big Data Analytics -- Clustering 38

Clustering Categorical Data: ROCK •  Robust Clustering using links

– # of common neighbors between two points – Use links to measure similarity/proximity – Not distance based – 

•  Basic ideas: – Similarity function and neighbors:

•  Let T1 = {1,2,3}, T2={3,4,5}

O n nm m n nm a( log )2 2+ +

Sim T TT TT T

( , )1 21 2

1 2

=∩

Sim T T( , ){ }

{ , , , , }.1 2

31 2 3 4 5

150 2= = =

Jian Pei: Big Data Analytics -- Clustering 39

Limitations

•  Merging decision based on static modeling – No special characteristics of clusters are

considered

C1 C2 C1’ C2’

CURE and BIRCH merge C1 and C2 C1’ and C2’ are more appropriate for merging

Jian Pei: Big Data Analytics -- Clustering 40

Chameleon

•  Hierarchical clustering using dynamic modeling •  Measures the similarity based on a dynamic model

–  Interconnectivity & closeness (proximity) between two clusters vs interconnectivity of the clusters and closeness of items within the clusters

•  A two-phase algorithm –  Use a graph partitioning algorithm: cluster objects into a

large number of relatively small sub-clusters –  Find the genuine clusters by repeatedly combining sub-

clusters

Jian Pei: Big Data Analytics -- Clustering 41

Overall Framework of CHAMELEON

Construct

Sparse Graph Partition the Graph

Merge Partition

Final Clusters

Data Set

Jian Pei: Big Data Analytics -- Clustering 42

Distance-based Methods: Drawbacks

•  Hard to find clusters with irregular shapes •  Hard to specify the number of clusters •  Heuristic: a cluster must be dense

Page 8: What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive Jian Pei: Big Data Analytics

8

How to Find Irregular Clusters?

•  Divide the whole space into many small areas – The density of an area can be estimated – Areas may or may not be exclusive – A dense area is likely in a cluster

•  Start from a dense area, traverse connected dense areas and discover clusters in irregular shape

Jian Pei: Big Data Analytics -- Clustering 43 Jian Pei: Big Data Analytics -- Clustering 44

Directly Density Reachable

•  Parameters – Eps: Maximum radius of the neighborhood – MinPts: Minimum number of points in an Eps-

neighborhood of that point – NEps(p): {q | dist(p,q) ≤Eps}

•  Core object p: |Neps(p)|≥MinPts – A core object is in a dense area

•  Point q directly density-reachable from p iff q ∈NEps(p) and p is a core object

p q

MinPts = 3 Eps = 1 cm

Jian Pei: Big Data Analytics -- Clustering 45

Density-Based Clustering

•  Density-reachable –  Directly density reachable p1àp2, p2àp3, …, pn-1à pn –  pn density-reachable from p1

•  Density-connected –  If points p, q are density-reachable from o then p and q

are density-connected

p q

o

p

q p1

Jian Pei: Big Data Analytics -- Clustering 46

DBSCAN

•  A cluster: a maximal set of density-connected points – Discover clusters of arbitrary shape in spatial

databases with noise

Core

Border

Outlier

Eps = 1cm

MinPts = 5

Jian Pei: Big Data Analytics -- Clustering 47

DBSCAN: the Algorithm

•  Arbitrary select a point p •  Retrieve all points density-reachable from p

wrt Eps and MinPts •  If p is a core point, a cluster is formed •  If p is a border point, no points are density-

reachable from p and DBSCAN visits the next point of the database

•  Continue the process until all of the points have been processed

Jian Pei: Big Data Analytics -- Clustering 48

Challenges for DBSCAN

•  Different clusters may have very different densities

•  Clusters may be in hierarchies

Page 9: What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive Jian Pei: Big Data Analytics

9

Jian Pei: Big Data Analytics -- Clustering 49

OPTICS: A Cluster-ordering Method

•  Idea: ordering points to identify the clustering structure

•  “Group” points by density connectivity – Hierarchies of clusters

•  Visualize clusters and the hierarchy

Jian Pei: Big Data Analytics -- Clustering 50

Ordering Points

•  Points strongly density-connected should be close to one another

•  Clusters density-connected should be close to one another and form a “cluster” of clusters

Jian Pei: Big Data Analytics -- Clustering 51

Reachability-distance

Cluster-order of the objects

ε

εundefined

ε‘

OPTICS: An Example

Jian Pei: Big Data Analytics -- Clustering 52

DENCLUE: Using Density Functions

•  DENsity-based CLUstEring •  Major features

– Solid mathematical foundation – Good for data sets with large amounts of noise – Allow a compact mathematical description of

arbitrarily shaped clusters in high-dimensional data sets

– Significantly faster than existing algorithms (faster than DBSCAN by a factor of up to 45)

– But need a large number of parameters

Jian Pei: Big Data Analytics -- Clustering 53

DENCLUE: Techniques

•  Use grid cells –  Only keep grid cells actually containing data points –  Manage cells in a tree-based access structure

•  Influence function: describe the impact of a data point on its neighborhood

•  Overall density of the data space is the sum of the influence function of all data points

•  Clustering by identifying density attractors –  Density attractor: local maximal of the overall density

function

Jian Pei: Big Data Analytics -- Clustering 54

Density Attractor

Page 10: What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive Jian Pei: Big Data Analytics

10

Jian Pei: Big Data Analytics -- Clustering 55

Center-defined and Arbitrary Clusters

Jian Pei: Big Data Analytics -- Clustering 56

A Shrinking-based Approach

•  Difficulties of Multi-dimensional Clustering – Noise (outliers) – Clusters of various densities – Not well-defined shapes

•  A novel preprocessing concept “Shrinking” •  A shrinking-based clustering approach

Jian Pei: Big Data Analytics -- Clustering 57

Intuition & Purpose

•  For data points in a data set, what if we could make them move towards the centroid of the natural subgroup they belong to?

•  Natural sparse subgroups become denser, thus easier to be detected – Noises are further isolated

Jian Pei: Big Data Analytics -- Clustering 58

Inspiration

•  Newton’s Universal Law of Gravitation –  Any two objects exert a gravitational force of attraction

on each other –  The direction of the force is along the line joining the

objects –  The magnitude of the force is directly proportional to

the product of the gravitational masses of the objects, and inversely proportional to the square of the distance between them

–  G: universal gravitational constant •  G = 6.67 x 10-11 N m2 /kg2 r

mmGFg 221=

Jian Pei: Big Data Analytics -- Clustering 59

The Concept of Shrinking

•  A data preprocessing technique – Aim to optimize the inner structure of real data

sets •  Each data point is “attracted” by other data

points and moves to the direction in which way the attraction is the strongest

•  Can be applied in different fields

Jian Pei: Big Data Analytics -- Clustering 60

Apply shrinking into clustering field •  Shrink the natural sparse clusters to make

them much denser to facilitate further cluster-detecting process.

Multi-attribute hyperspace

Page 11: What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive Jian Pei: Big Data Analytics

11

Jian Pei: Big Data Analytics -- Clustering 61

Data Shrinking

•  Each data point moves along the direction of the density gradient and the data set shrinks towards the inside of the clusters

•  Points are “attracted” by their neighbors and move to create denser clusters

•  It proceeds iteratively; repeated until the data are stabilized or the number of iterations exceeds a threshold

Jian Pei: Big Data Analytics -- Clustering 62

Approximation & Simplification

•  Problem: Computing mutual attraction of each data points pair is too time consuming O(n2) – Solution: No Newton's constant G, m1 and m2

are set to unit •  Only aggregate the gravitation surrounding

each data point •  Use grids to simplify the computation

Jian Pei: Big Data Analytics -- Clustering 63

Termination condition

•  Average movement of all points in the current iteration is less than a threshold

•  The number of iterations exceeds a threshold

Jian Pei: Big Data Analytics -- Clustering 64

Optics on Pendigits Data

Before data shrinking After data shrinking

Jian Pei: Big Data Analytics -- Clustering 65

Fuzzy Clustering

•  Each point xi takes a probability wij to belong to a cluster Cj

•  Requirements – For each point xi,

– For each cluster Cj

11

=∑=

k

jijw

mwm

iij <<∑

=1

0

Jian Pei: Big Data Analytics -- Clustering 66

Fuzzy C-Means (FCM)

Select an initial fuzzy pseudo-partition, i.e., assign values to all the wij

Repeat Compute the centroid of each cluster using the fuzzy

pseudo-partition Recompute the fuzzy pseudo-partition, i.e., the wij

Until the centroids do not change (or the change is below some threshold)

Page 12: What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive Jian Pei: Big Data Analytics

12

Jian Pei: Big Data Analytics -- Clustering 67

Critical Details

•  Optimization on sum of the squared error (SSE):

•  Computing centroids: •  Updating the fuzzy pseudo-partition

– When p=2

∑∑= =

=k

j

m

iji

pijk cxdistwCCSSE

1 1

21 ),(),,( …

∑∑==

=m

i

pij

m

ii

pijj wxwc

11

/

∑=

−−=k

q

pqi

pjiij cxdistcxdistw

1

11

211

2 )),(/1()),(/1(

∑=

=k

qqijiij cxdistcxdistw

1

22 ),(/1),(/1

Jian Pei: Big Data Analytics -- Clustering 68

Choice of P

•  When p à 1, FCM behaves like traditional k-means

•  When p is larger, the cluster centroids approach the global centroid of all data points

•  The partition becomes fuzzier as p increases

Jian Pei: Big Data Analytics -- Clustering 69

Effectiveness

Jian Pei: Big Data Analytics -- Clustering 70

Mixture Models

•  A cluster can be modeled as a probability distribution – Practically, assume a distribution can be

approximated well using multivariate normal distribution

•  Multiple clusters is a mixture of different probability distributions

•  A data set is a set of observations from a mixture of models

Jian Pei: Big Data Analytics -- Clustering 71

Object Probability

•  Suppose there are k clusters and a set X of m objects – Let the j-th cluster have parameter θj = (µj, σj) – The probability that a point is in the j-th cluster is

wj, w1 + …+ wk = 1 •  The probability of an object x is

∑=

=Θk

jjjj xpwxprob

1)|()|( θ

∏∑∏= ==

=Θ=Θm

i

k

jjijj

m

ii xpwxprobXprob

1 11

)|()|()|( θ

Jian Pei: Big Data Analytics -- Clustering 72

Example

2

2

2)(

21)|( σ

µ

σπ

−−

=Θx

i exprob

)2,4()2,4( 21 =−= θθ

8)4(

8)4( 22

221

221)|(

−−

+−

+=Θxx

eexprobππ

Page 13: What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive Jian Pei: Big Data Analytics

13

Jian Pei: Big Data Analytics -- Clustering 73

Maximal Likelihood Estimation

•  Maximum likelihood principle: if we know a set of objects are from one distribution, but do not know the parameter, we can choose the parameter maximizing the probability

•  Maximize

– Equivalently, maximize

∏=

−−

=Θm

j

x

i exprob1

2)(2

2

21)|( σ

µ

σπ

∑=

−−−

−=Θm

i

i mmxXprob1

2

2

log2log5.02

)()|(log σπσµ

Jian Pei: Big Data Analytics -- Clustering 74

EM Algorithm

•  Expectation Maximization algorithm Select an initial set of model parameters Repeat

Expectation Step: for each object, calculate the probability that it belongs to each distribution θi, i.e., prob(xi|θi)

Maximization Step: given the probabilities from the expectation step, find the new estimates of the parameters that maximize the expected likelihood

Until the parameters are stable

Jian Pei: Big Data Analytics -- Clustering 75

Advantages and Disadvantages

•  Mixture models are more general than k-means and fuzzy c-means

•  Clusters can be characterized by a small number of parameters

•  The results may satisfy the statistical assumptions of the generative models

•  Computationally expensive •  Need large data sets •  Hard to estimate the number of clusters

Jian Pei: Big Data Analytics -- Clustering 76

Grid-based Clustering Methods

•  Ideas – Using multi-resolution grid data structures – Using dense grid cells to form clusters

•  Several interesting methods – CLIQUE – STING – WaveCluster

Jian Pei: Big Data Analytics -- Clustering 77

CLIQUE

•  Clustering In QUEst •  Automatically identify subspaces of a high

dimensional data space •  Both density-based and grid-based

Jian Pei: Big Data Analytics -- Clustering 78

CLIQUE: the Ideas

•  Partition each dimension into the same number of equal length intervals – Partition an m-dimensional data space into non-

overlapping rectangular units •  A unit is dense if the number of data points

in the unit exceeds a threshold •  A cluster is a maximal set of connected

dense units within a subspace

Page 14: What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive Jian Pei: Big Data Analytics

14

Jian Pei: Big Data Analytics -- Clustering 79

CLIQUE: the Method

•  Partition the data space and find the number of points in each cell of the partition –  Apriori: a k-d cell cannot be dense if one of its (k-1)-d

projection is not dense •  Identify clusters:

–  Determine dense units in all subspaces of interests and connected dense units in all subspaces of interests

•  Generate minimal description for the clusters –  Determine the minimal cover for each cluster

Jian Pei: Big Data Analytics -- Clustering 80

Sala

ry

(10,

000)

age

Vac

atio

n

Salary

30 50

20 30 40 50 60 age

5 4

3 1

2 6

7 0

Vaca

tion

(wee

k)

20 30 40 50 60 age

5 4

3 1

2 6

7 0

CLIQUE: An Example

Jian Pei: Big Data Analytics -- Clustering 81

CLIQUE: Pros and Cons

•  Automatically find subspaces of the highest dimensionality with high density clusters

•  Insensitive to the order of input – Not presume any canonical data distribution

•  Scale linearly with the size of input •  Scale well with the number of dimensions •  The clustering result may be degraded at

the expense of simplicity of the method

Jian Pei: Big Data Analytics -- Clustering 82

Bad Cases for CLIQUE

Parts of a cluster may be missed

A cluster from CLIQUE may contain noise

Biclustering

•  Clustering both objects and attributes simultaneously

•  Four requirements – Only a small set of objects in a cluster (bicluster) – A bicluster only involves a small number of

attributes – An object may participate in multiple biclusters or

no biclusters – An attribute may be involved in multiple biclusters,

or no biclusters

Jian Pei: Big Data Analytics -- Clustering 83

Application Examples

•  Recommender systems – Objects: users – Attributes: items – Values: user ratings

•  Microarray data – Objects: genes – Attributes: samples – Values: expression levels

Jian Pei: Big Data Analytics -- Clustering 84

nmw

gene

sample/condition

w11w

21w31w

n1w

12w

32w22w

n2w

1mw

3mw2m

Page 15: What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive Jian Pei: Big Data Analytics

15

Biclusters with Constant Values

Jian Pei: Big Data Analytics -- Clustering 85

11.2. CLUSTERING HIGH-DIMENSIONAL DATA 535

· · · b6 · · · b12 · · · b36 · · · b99 · · ·a1 · · · 60 · · · 60 · · · 60 · · · 60 · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·a33 · · · 60 · · · 60 · · · 60 · · · 60 · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·a86 · · · 60 · · · 60 · · · 60 · · · 60 · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Figure 11.5: A gene-condition matrix, a submatrix, and a bi-cluster.

subset of products. For example, AllElectronics is highly interested in findinga group of customers who all like the same group of products. Such a clusteris a submatrix in the customer-product matrix, where all elements have a highvalue. Using such a cluster, AllElectronics can make recommendations in twodirections. First, the company can recommend products to new customerswho are similar to the customers in the cluster. Second, the company canrecommend to customers new products that are similar to those involved inthe cluster.

As with bi-clusters in a gene expression data matrix, the bi-clusters in acustomer-product matrix usually have the following characteristics:

• Only a small set of customers participate in a cluster;

• A cluster involves only a small subset of products;

• A customer can participate in multiple clusters, or may not participatein any cluster at all; and

• A product may be involved in multiple clusters, or may not be involvedin any cluster at all.

Bi-clustering can be applied to customer-product matrices to mine clusterssatisfying the above requirements.

Types of Bi-clusters

“How can we model bi-clusters and mine them?” Let’s start with some basicnotation. For the sake of simplicity, we’ll use “genes” and “conditions” torefer to the two dimensions in our discussion. Our discussion can easily beextended to other applications. For example, we can simply replace “genes” and“conditions” by “customers” and “products” to tackle the customer-product bi-clustering problem.

Let A = {a1, . . . , an} be a set of genes and B = {b1, . . . , bm} be a set ofconditions. Let E = [eij ] be a gene expression data matrix, that is, a gene-condition matrix, where 1 ≤ i ≤ n and 1 ≤ j ≤ m. A submatrix I × J is

536 CHAPTER 11. ADVANCED CLUSTER ANALYSIS

10 10 10 10 1020 20 20 20 2050 50 50 50 500 0 0 0 0

Figure 11.6: A bi-cluster with constant values on rows.

10 50 30 70 2020 60 40 80 3050 90 70 110 600 40 20 60 10

Figure 11.7: A bi-cluster with coherent values.

defined by a subset I ⊆ A of genes and a subset J ⊆ B of conditions. Forexample, in the matrix shown in Figure 11.5, {a1, a33, a86} × {b6, b12, b36, b99}is a submatrix.

A bi-cluster is a submatrix where genes and conditions follow consistentpatterns. We can define different types of bi-clusters based on such patterns:

• As the simplest case, a submatrix I × J (I ⊆ A, J ⊆ B) is a bi-clusterwith constant values if for any i ∈ I and j ∈ J , eij = c, where c is aconstant. For example, the submatrix {a1, a33, a86}× {b6, b12, b36, b99} inFigure 11.5 is a bi-cluster with constant values.

• A bi-cluster is interesting if each row has a constant value, though differ-ent rows may have different values. A bi-cluster with constant valueson rows is a submatrix I × J such that for any i ∈ I and j ∈ J , theneij = c+αi where αi is the adjustment for row i. For example, Figure 11.6shows a bi-cluster with constant values on rows.

Symmetrically, a bi-cluster with constant values on columns is asubmatrix I × J such that for any i ∈ I and j ∈ J , then eij = c + βj ,where βj is the adjustment for column j.

• More generally, a bi-cluster is interesting if the rows change in a syn-chronized way with respect to the columns and vice versa. Mathemat-ically, a bi-cluster with coherent values (also known as a pattern-based cluster) is a submatrix I × J such that for any i ∈ I and j ∈ J ,eij = c + αi + βj , where αi and βj are the adjustment for row i andcolumn j, respectively. For example, Figure 11.7 shows a bi-cluster withcoherent values.

It can be shown that I × J is a bi-cluster with coherent values if andonly if for any i1, i2 ∈ I and j1, j2 ∈ J , then ei1j1 − ei2j1 = ei1j2 − ei2j2 .Moreover, instead of using addition, we can define bi-cluster with coherent

On rows

Biclusters with Coherent Values

•  Also known as pattern-based clusters

Jian Pei: Big Data Analytics -- Clustering 86

Biclusters with Coherent Evolutions

•  Only up- or down-regulated changes over rows or columns

Jian Pei: Big Data Analytics -- Clustering 87

11.2. CLUSTERING HIGH-DIMENSIONAL DATA 537

10 50 30 70 2020 100 50 1000 3050 100 90 120 800 80 20 100 10

Figure 11.8: A bi-cluster with coherent evolutions on rows.

values using multiplication, that is eij = c · αi · βj . Clearly, bi-clusterswith constant values on rows or columns are special cases of bi-clusterswith coherent values.

• In some applications, we may only be interested in the up- or down-regulated changes across genes or conditions without constraining theexact values. A bi-cluster with coherent evolutions on rows is asubmatrix I × J such that for any i1, i2 ∈ I and j1, j2 ∈ J , (ei1j1 −ei1j2)(ei2j1 − ei2j2) ≥ 0. For example, Figure 11.8 shows a bi-cluster withcoherent evolutions on rows. Symmetrically, we can define bi-clusterswith coherent evolutions on columns.

Next, we study how to mine bi-clusters.

Bi-clustering Methods

The above specification of the types of bi-clusters only considers ideal cases. Inreal data sets, such perfect bi-clusters rarely exist. When they do exist, theyare usually very small. Instead, random noise can affect the readings of eij andthus prevent a bi-cluster in nature from appearing in a perfect shape.

There are two major types of methods for discovering bi-clusters in datathat may come with noise. Optimization-based methods conduct an it-erative search. At each iteration, the submatrix with the highest significancescore is identified as a bi-cluster. The process terminates when a user-specifiedcondition is met. Due to cost concerns in computation, greedy search is oftenemployed to find local optimal bi-clusters. Enumeration methods use a tol-erance threshold to specify the degree of noise allowed in the bi-clusters to bemined, and then tries to enumerate all submatrices of bi-clusters that satisfythe requirements. We use the δ-Cluster and MaPle algorithms as examples toillustrate these ideas.

Optimization Using the δ-Cluster Algorithm

For a submatrix, I × J , the mean of the i-th row is

eiJ =1

|J |!

j∈J

eij . (11.16)

Coherent evolutions on rows

Jian Pei: Big Data Analytics -- Clustering 88

Differences from Subspace Clustering

•  Subspace clustering uses global distance/similarity measure

•  Pattern-based clustering looks at patterns •  A subspace cluster according to a globally

defined similarity measure may not follow the same pattern

Jian Pei: Big Data Analytics -- Clustering 89

Objects Follow the Same Pattern?

pScore

D1 D2

Objectblue

Obejctgreen

The less the pScore, the more consistent the objects Jian Pei: Big Data Analytics -- Clustering 90

Pattern-based Clusters

•  pScore: the similarity between two objects rx, ry on two attributes au, av

•  δ-pCluster (R, D): for any objects rx, ry∈R and any attributes au, av∈D,

)..()..(....

vyvxuyuxvyuy

vxux arararararararar

pScore −−−=⎟⎟⎠

⎞⎜⎜⎝

⎛⎥⎦

⎤⎢⎣

)0(....

≥≤⎟⎟⎠

⎞⎜⎜⎝

⎛⎥⎦

⎤⎢⎣

⎡δδ

vyuy

vxux

arararar

pScore

Page 16: What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive Jian Pei: Big Data Analytics

16

Jian Pei: Big Data Analytics -- Clustering 91

Maximal pCluster

•  If (R, D) is a δ-pCluster , then every sub-cluster (R’, D’) is a δ-pCluster, where R’⊆R and D’⊆D – An anti-monotonic property – A large pCluster is accompanied with many

small pClusters! Inefficacious •  Idea: mining only the maximal pClusters!

– A δ-pCluster is maximal if there exists no proper super cluster as a δ-pCluster

Jian Pei: Big Data Analytics -- Clustering 92

Mining Maximal pClusters

•  Given – A cluster threshold δ – An attribute threshold mina – An object threshold mino

•  Task: mine the complete set of significant maximal δ-pClusters – A significant δ-pCluster has at least mino objects

on at least mina attributes

Jian Pei: Big Data Analytics -- Clustering 93

pCluters and Frequent Itemsets

•  A transaction database can be modeled as a binary matrix

•  Frequent itemset: a sub-matrix of all 1’s – 0-pCluster on binary data – Mino: support threshold – Mina: no less than mina attributes – Maximal pClusters – closed itemsets

•  Frequent itemset mining algorithms cannot be extended straightforwardly for mining pClusters on numeric data

Jian Pei: Big Data Analytics -- Clustering 94

Where Should We Start from?

•  How about the pClusters having only 2 objects or 2 attributes? – MDS (maximal dimension set) – A pCluster must have at least 2 objects and 2

attributes •  Finding MDSs

Attribute Objects

a b c d e f g h

x 13 11 9 7 9 13 2 15

y 7 4 10 1 12 3 4 7

x - y 6 7 -1 6 -3 10 -2 8

Jian Pei: Big Data Analytics -- Clustering 95

How to Assemble Larger pClusters? •  Systematically enumerate

every combination of attributes D –  For each attribute subset,

find the maximal subsets of objects R s.t. (R, D) is a pCluster

–  Check whether (R, D) is maximal

•  Prune search branches as early as possible

•  Why attribute-first-object-later? –  # of objects >> # attributes

•  Algorithm MaPle (Pei et al, 2003)

Jian Pei: Big Data Analytics -- Clustering 96

More Pruning Techniques

•  Only possible attributes should be considered to get larger pClusters

•  Pruning local maximal pClusters having insufficient possible attributes

•  Extracting common attributes from possible attribute set directly

•  Prune non-maximal pClusters

Page 17: What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive Jian Pei: Big Data Analytics

17

Jian Pei: Big Data Analytics -- Clustering 97

Gene-Sample-Time Series Data

gene1

gene2

sample1 time1 time2 sample2

Time Sample

Gene

Gene-Time Matrix Gene-Sample Matrix

Sample-Time Matrix

expression level of gene i on sample j at time k

Jian Pei: Big Data Analytics -- Clustering 98

Mining GST Microarray Data

•  Reduce the gene-sample-time series data to gene-sample data – Use the Pearson's correlation coeffcient as the

coherence measure

Jian Pei: Big Data Analytics -- Clustering 99

Basic Approaches

•  Sample-gene search – Enumerate the subsets of samples

systematically – For each subset of samples, find the genes that

are coherent on the samples •  Gene-sample search

– Enumerate the subsets of genes systematically – For each subset of genes, find the samples on

which the genes are coherent

Jian Pei: Big Data Analytics -- Clustering 100

Basic Tools

•  Set enumeration tree •  Sample-gene search and gene-sample

search are not symmetric! – Many genes, but a few samples – No requirement on samples coherent on genes

Jian Pei: Big Data Analytics -- Clustering 101

Phenotypes and Informative Genes

Informative Genes

Non- informative

Genes

gene1

gene6

gene7

gene2

gene4

gene5

gene3

1 2 3 4 5 6 7 samples

Jian Pei: Big Data Analytics -- Clustering 102

The Phenotype Mining Problem

•  Input: a microarray matrix and k •  Output: phenotypes and informative genes

– Partitioning the samples into k exclusive subsets – phenotypes

–  Informative genes discriminating the phenotypes

•  Machine learning methods – Heuristic search – Mutual reinforcing adjustment

Page 18: What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive Jian Pei: Big Data Analytics

18

Jian Pei: Big Data Analytics -- Clustering 103

Requirements

•  The expression levels of each informative gene should be similar over the samples within each phenotype

•  The expression levels of each informative gene should display a clear dissimilarity between each pair of phenotypes

Jian Pei: Big Data Analytics -- Clustering 104

Intra-phenotype Consistency

•  In a subset of genes (candidate informative genes), does every gene have good consistency on a set of samples?

•  Average of variance of the subset of genes – the smaller the intra-phenotype consistency, the better

( )∑∑∈ ∈

−−⋅

=' '

2',, )(

1''1)','(

Gg SsSiji

i j

wwSG

SGCon! !

Jian Pei: Big Data Analytics -- Clustering 105

Inter-phenotype Divergence

•  How a subset of genes (candidate informative genes) can discriminate two phenotypes of samples?

•  Sum of the average difference between the phenotypes – the larger the inter-phenotype divergence, the better

')),,'( '

,,

21

21

G

wwSSGDiv Gg

SiSii

∑∈

=!

Jian Pei: Big Data Analytics -- Clustering 106

Quality of Phenotypes and Informative Genes

•  The higher the value, the better the quality

∑ ≠≤≤

+=Ω

);,1(, , ),'(),'(),'(

1

jiKjiSS ji

ji

ji SSGDivSGConSGCon

Jian Pei: Big Data Analytics -- Clustering 107

Heuristic Search

•  Start from a random subset of genes and an arbitrary partition of the samples

•  Iteratively adjust the partition and the gene set toward a better solution –  For each possible adjustment, compute ΔΩ

•  For each gene, try possible insert/remove •  For each sample, try the best movement

–  ΔΩ > 0 à conduct the adjustment –  ΔΩ < 0 à conduct the adjustment with probability

•  T(i) is a decreasing simulated annealing function and i is the iteration number. T(i)=1/(i+1) in our implementation

)(iTe ⋅ΩΔΩ

Jian Pei: Big Data Analytics -- Clustering 108

Possible Adjustments

Insert a gene Remove a gene Move a sample

Page 19: What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive Jian Pei: Big Data Analytics

19

Jian Pei: Big Data Analytics -- Clustering 109

Disadvantages of Heuristic Search

•  Samples and genes are examined and adjusted with equal chances – # samples << # genes – Samples should play more important roles

•  Outliers in the samples should be handled specifically – Outliers highly interfere the quality and the

adjustment decisions

Jian Pei: Big Data Analytics -- Clustering 110

Mutual Reinforcing Adjustment

•  A two-phase approach –  Iteration phase – Refinement phase

•  Mutual reinforcement – Use gene partition to improve the sample

partition – Use the sample partition to improve the gene

partition

Jian Pei: Big Data Analytics -- Clustering 111

Dimensionality Reduction

•  Clustering a high dimensional data set is challenging –  Distance between two points could be dominated by

noise •  Dimensionality reduction: choosing the informative

dimensions for clustering analysis –  Feature selection: choosing a subset of existing

dimensions –  Feature construction: construct a new (small) set of

informative attributes

Jian Pei: Big Data Analytics -- Clustering 112

Variance and Covariance

•  Given a set of 1-d points, how different are those points? – Standard deviation: – Variance:

•  Given a set of 2-d points, are the two dimensions correlated? – Covariance:

1

)(1

2

−=∑=

n

XXs

n

ii

1

)(1

2

2

−=∑=

n

XXs

n

ii

1

))((),cov( 1

−−=∑=

n

YYXXYX

n

iii

Jian Pei: Big Data Analytics -- Clustering 113

Principal Components

Art work and example from http://csnet.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf

Jian Pei: Big Data Analytics -- Clustering 114

Step 1: Mean Subtraction

•  Subtract the mean from each dimension for each data point

•  Intuition: centralizing the data set

Page 20: What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive Jian Pei: Big Data Analytics

20

Jian Pei: Big Data Analytics -- Clustering 115

Step 2: Covariance Matrix

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=

),cov(),cov(),cov(

),cov(),cov(),cov(),cov(),cov(),cov(

21

22212

12111

nnnn

n

n

DDDDDD

DDDDDDDDDDDD

C

!"#""

!!

Jian Pei: Big Data Analytics -- Clustering 116

Step 3: Eigenvectors and Eigenvalues

•  Compute the eigenvectors and the eigenvalues of the covariance matrix –  Intuition: find those direction invariant vectors as

candidates of new attributes – Eigenvalues indicate how much the direction

invariant vectors are scaled – the larger the better for manifest the data variance

Jian Pei: Big Data Analytics -- Clustering 117

Step 4: Forming New Features

•  Choose the principal components and forme new features –  Typically, choose the top-k components

Jian Pei: Big Data Analytics -- Clustering 118

New Features

NewData = RowFeatureVector x RowDataAdjust

The first principal component is used

Clustering in Derived Space

Jian Pei: Big Data Analytics -- Clustering 119

Y

XO

- 0.707x + 0.707y

Spectral Clustering

Jian Pei: Big Data Analytics -- Clustering 120

cluster the original data

ij[ ]

Data Affinity matrixk eigenvectors of A

A = f(W)Av = \lamda v

Clustering in thenew space

Computing the leading Projecting back to

W

Page 21: What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive Jian Pei: Big Data Analytics

21

Affinity Matrix

•  Using a distance measure where σ is a scaling parameter controling how fast the affinity Wij decreases as the distance increases

•  In the Ng-Jordan-Weiss algorithm, Wii is set to 0

Jian Pei: Big Data Analytics -- Clustering 121

Wij = e�dist(o

i

,o

j

)

w

Clustering

•  In the Ng-Jordan-Weiss algorithm, we define a diagonal matrix such that

•  Then, •  Use the k leading eigenvectors to form a

new space •  Map the original data to the new space and

conduct clustering Jian Pei: Big Data Analytics -- Clustering 122

Dii =nX

j=1

Wij

A = D� 12WD� 1

2

Is a Clustering Good?

•  Feasibility – Applying any clustering methods on a uniformly

distributed data set is meaningless •  Quality

– Are the clustering results meeting users’ interest? – Clustering patients into clusters corresponding

various disease or sub-phenotypes is meaningful – Clustering patients into clusters corresponding to

male or female is not meaningful

Jian Pei: Big Data Analytics -- Clustering 123

Major Tasks

•  Assessing clustering tendency – Are there non-random structures in the data?

•  Determining the number of clusters or other critical parameters

•  Measuring clustering quality

Jian Pei: Big Data Analytics -- Clustering 124

Uniformly Distributed Data

•  Clustering uniformly distributed data is meaningless

•  A uniformly distributed data set is generated by a uniform data distribution

Jian Pei: Big Data Analytics -- Clustering 125

504CHAPTER 10. CLUSTER ANALYSIS: BASIC CONCEPTS AND METHODS

Figure 10.21: A data set that is uniformly distributed in the data space.

• Measuring clustering quality. After applying a clustering method on adata set, we want to assess how good the resulting clusters are. A numberof measures can be used. Some methods measure how well the clustersfit the data set, while others measure how well the clusters match theground truth, if such truth is available. There are also measures thatscore clusterings and thus can compare two sets of clustering results onthe same data set.

In the rest of this section, we discuss each of the above three topics.

10.6.1 Assessing Clustering Tendency

Clustering tendency assessment determines whether a given data set has a non-random structure, which may lead to meaningful clusters. Consider a dataset that does not have any non-random structure, such as a set of uniformlydistributed points in a data space. Even though a clustering algorithm mayreturn clusters for the data, those clusters are random and are not meaningful.

Example 10.9 Clustering requires non-uniform distribution of data. Figure 10.21shows a data set that is uniformly distributed in 2-dimensional data space.Although a clustering algorithm may still artificially partition the points intogroups, the groups will unlikely mean anything significant to the applicationdue to the uniform distribution of the data.

“How can we assess the clustering tendency of a data set?” Intuitively, wecan try to measure the probability that the data set is generated by a uniformdata distribution. This can be achieved using statistical tests for spatial ran-domness. To illustrate this idea, let’s look at a simple yet effective statisticcalled Hopkins Statistic.

The Hopkins Statistic is a spatial statistic that tests the spatial random-ness of a variable as distributed in a space. Given a data set, D, which isregarded as a sample of a random variable, o, we want to determine how faraway o is from being uniformly distributed in the data space. We calculate theHopkins Statistic as follows:

Hopkins Statistic

•  Hypothesis: the data is generated by a uniform distribution in a space

•  Sample n points, p1, …, pn, uniformly from the space of D

•  For each point pi, find the nearest neighbor of pi in D, let xi be the distance between pi and its nearest neighbor in D

Jian Pei: Big Data Analytics -- Clustering 126

xi = minv2D

{dist(pi, v)}

Page 22: What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive Jian Pei: Big Data Analytics

22

Hopkins Statistic

•  Sample n points, q1, …, qn, uniformly from D •  For each qi, find the nearest neighbor of qi

in D – {qi}, let yi be the distance between qi and its nearest neighbor in D – {qi}

•  Calculate the Hopkins Statistic H

Jian Pei: Big Data Analytics -- Clustering 127

yi = minv2D,v 6=qi

{dist(qi, v)}

H =

nPi=1

yi

nPi=1

xi +nP

i=1yi

Explanation

•  If D is uniformly distributed, then and would be close to each other, and thus H would be round 0.5

•  If D is skewed, then would be substantially smaller, and thus H would be close to 0

•  If H > 0.5, then it is unlikely that D has statistically significant clusters

Jian Pei: Big Data Analytics -- Clustering 128

nX

i=1

yi

nX

i=1

xi

nX

i=1

yi

Finding the Number of Clusters

•  Depending on many factors – The shape and scale of the distribution in the

data set – The clustering resolution required by the user

•  Many methods exist – Set , each cluster has points on

average – Plot the sum of within-cluster variances with

respect to k, find the first (or the most significant turning point)

Jian Pei: Big Data Analytics -- Clustering 129

k =

rn

2

p2n

A Cross-Validation Method •  Divide the data set D into m parts •  Use m – 1 parts to find a clustering •  Use the remaining part as the test set to test

the quality of the clustering – For each point in the test set, find the closest

centroid or cluster center – Use the squared distances between all points in the

test set and the corresponding centroids to measure how well the clustering model fits the test set

•  Repeat m times for each value of k, use the average as the quality measure

Jian Pei: Big Data Analytics -- Clustering 130

Measuring Clustering Quality

•  Ground truth: the ideal clustering determined by human experts

•  Two situations – There is a known ground truth – the extrinsic

(supervised) methods, comparing the clustering against the ground truth

– The ground truth is unavailable – the intrinsic (unsupervised) methods, measuring how well the clusters are separated

Jian Pei: Big Data Analytics -- Clustering 131

Quality in Extrinsic Methods •  Cluster homogeneity: the more pure the

clusters in a clustering are, the better the clustering

•  Cluster completeness: objects in the same cluster in the ground truth should be clustered together

•  Rag bag: putting a heterogeneous object into a pure cluster is worse than putting it into a rag bag

•  Small cluster preservation: splitting a small cluster in the ground truth into pieces is worse than splitting a bigger one

Jian Pei: Big Data Analytics -- Clustering 132

Page 23: What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive Jian Pei: Big Data Analytics

23

Bcubed Precision and Recall

•  D = {o1, …, on} – L(oi) is the cluster of oi given by the ground truth

•  C is a clustering on D – C(oi) is the cluster-id of oi in C

•  For two objects oi and oj, the correctness is 1 if L(oi) = L(oj) ßà C(oi) = C(oj), 0 otherwise

Jian Pei: Big Data Analytics -- Clustering 133

Bcubed Precision and Recall

•  Precision

•  Recall

Jian Pei: Big Data Analytics -- Clustering 134

508CHAPTER 10. CLUSTER ANALYSIS: BASIC CONCEPTS AND METHODS

one, denoted by o, belong to the same category according to ground truth.Consider a clustering C2 identical to C1 except that o is assigned to acluster C′ = C in C2 such that C′ contains objects from various categoriesaccording to ground truth, and thus is noisy. In other words, C′ in C2 isa rag bag. Then, a clustering quality measure Q respecting the rag bagcriterion should give a higher score to C2, that is, Q(C2, Cg) > Q(C1, Cg).

• Small cluster preservation. If a small category is split into small piecesin a clustering, those small pieces may likely become noise and thus thesmall category cannot be discovered from the clustering. The small clus-ter preservation criterion states that splitting a small category into piecesis more harmful than splitting a large category into pieces. Consider anextreme case. Let D be a data set of n + 2 objects such that, accord-ing to the ground truth, n objects, denoted by o1, . . . , on, belong toone category and the other 2 objects, denoted by on+1, on+2, belong toanother category. Suppose clustering C1 has three clusters, C1={o1, . . . ,on}, C2={on+1}, and C3={on+2}. Let clustering C2 have three clusters,too, namely C1={o1, . . . , on−1}, C2={on}, and C3={on+1, on+2}. Inother words, C1 splits the small category and C2 splits the big category.A clustering quality measure Q preserving small clusters should give ahigher score to C2, that is, Q(C2, Cg) > Q(C1, Cg).

Many clustering quality measures satisfy some of the above four criteria.Here, we introduce the BCubed precision and recall metrics, which satisfy allof the above criteria.

BCubed evaluates the precision and recall for every object in a clusteringon a given data set according to the ground truth. The precision of an objectindicates how many other objects in the same cluster belong to the same cat-egory as the object. The recall of an object reflects how many objects of thesame category are assigned to the same cluster.

Formally, let D ={o1, . . . , on} be a set of objects, and C be a clusteringon D. Let L(oi) (1 ≤ i ≤ n) be the category of oi given by ground truth,and C(oi) be the cluster ID of oi in C. Then, for two objects, oi and oj ,(1 ≤ i, j,≤ n, i = j), the correctness of the relation between oi and oj inclustering C is given by

Correctness(oi, oj) =! 1 if L(oi) = L(oj)⇔ C(oi) = C(oj)

0 otherwise.(10.28)

BCubed precision is defined as

Precision BCubed =

n"

i=1

"

oj :i=j,C(oi)=C(oj)

Correctness(oi, oj)

∥{oj|i = j, C(oi) = C(oj)}∥n

. (10.29)

10.6. EVALUATION OF CLUSTERING 509

BCubed recall is defined as

Recall BCubed =

n!

i=1

!

oj :i=j,L(oi)=L(oj)

Correctness(oi, oj)

∥{oj|i = j, L(oi) = L(oj)}∥n

. (10.30)

Intrinsic Methods

When the ground truth of a data set is not available, we have to use an intrinsicmethod to assess the clustering quality. In general, intrinsic methods evaluatea clustering by examining how well the clusters are separated and how compactthe clusters are. Many intrinsic methods take the advantage of a similaritymetric between objects in the data set.

The silhouette coefficient is such a measure. For a data set D of nobjects, suppose D is partitioned into k clusters, C1, . . . , Ck. For each object o∈ D, we calculate a(o) as the average distance between o and all other objectsin the cluster to which o belongs. Similarly, b(o) is the minimum averagedistance from o to all clusters to which o does not belong. Formally, supposeo ∈ Ci (1 ≤ i ≤ k), then

a(o) =

"

o′∈Ci,o=o′ dist(o, o′)

|Ci|− 1(10.31)

and

b(o) = minCj :1≤j≤k,j =i

{

"

o′∈Cjdist(o, o′)

|Cj |}. (10.32)

The silhouette coefficient of o is then defined as

s(o) =b(o)− a(o)

max{a(o), b(o)} . (10.33)

The value of the silhouette coefficient is between −1 and 1. The value ofa(o) reflects the compactness of the cluster to which o belongs. The smallerthe value is, the more compact the cluster is. The value of b(o) capturesthe degree to which o is separated from other clusters. The larger b(o) is,the more separated o is from other clusters. Therefore, when the silhouettecoefficient value of o approaches 1, the cluster containing o is compact and ois far away from other clusters, which is the preferable case. However, whenthe silhouette coefficient value is negative (that is, b(o) < a(o)), this meansthat, in expectation, o is closer to the objects in another cluster than to theobjects in the same cluster as o. In many cases, this is a bad case, and shouldbe avoided.

To measure the fitness of a cluster within a clustering, we can compute theaverage silhouette coefficient value of all objects in the cluster. To measure thequality of a clustering, we can use the average silhouette coefficient value of allobjects in the data set. The silhouette coefficient and other intrinsic measures

Silhouette Coefficient

•  No ground truth is assumed •  Suppose a data set D of n objects is partitioned

into k clusters, C1, …, Ck •  For each object o,

– Calculate a(o), the average distance between o and every other object in the same cluster – compactness of a cluster, the smaller, the better

– Calculate b(o), the minimum average distance from o to every objects in a cluster that o does not belong to – degree of separation from other clusters, the larger, the better

Jian Pei: Big Data Analytics -- Clustering 135

Silhouette Coefficient

•  Then

•  Use the average silhouette coefficient of all objects as the overall measure

Jian Pei: Big Data Analytics -- Clustering 136

b(o) = minCj :o 62Cj

{

Po

02Cj

dist(o, o0)

|Cj

| }

a(o) =

Po,o

02Ci,o0 6=o

dist(o, o0)

|Ci

|� 1

s(o) =

b(o)� a(o)

max{a(o), b(o)}

Multi-Clustering

•  A data set may be clustered in different ways –  In different subspaces, that is, using different

attributes – Using different similarity measures – Using different clustering methods

•  Some different clusterings may capture different meanings of categorization – Orthogonal clusterings

•  Putting users in the loop Jian Pei: Big Data Analytics -- Clustering 137