What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3...
Transcript of What Is Clustering? Clustering - ict.ac.cnnovel.ict.ac.cn/files/Day 5.pdf · Step 4 (DIANA)Step 3...
1
Clustering
Jian Pei: Big Data Analytics -- Clustering 2
What Is Clustering?
• Group data into clusters – Similar to one another within the same cluster – Dissimilar to the objects in other clusters – Unsupervised learning: no predefined classes
Cluster 1 Cluster 2
Outliers
Jian Pei: Big Data Analytics -- Clustering 3
Similarity and Dissimilarity
• Distances are normally used measures • Minkowski distance: a generalization
• If q = 2, d is Euclidean distance • If q = 1, d is Manhattan distance • If q = ∞, d is Chebyshev distance • Weighed distance
)0(||...||||),(2211
>−++−+−= qjxixjxixjxixjid qq
pp
)0()||...||2
||1
),(2211
>−++−+−= qjxixpwjxixwjxixwjid qq
pp
Jian Pei: Big Data Analytics -- Clustering 4
Manhattan and Chebyshev Distance
Picture from Wekipedia
Manhattan Distance
http://brainking.com/images/rules/chess/02.gif
Chebyshev Distance
When n = 2, chess-distance
Jian Pei: Big Data Analytics -- Clustering 5
Properties of Minkowski Distance
• Nonnegative: d(i,j) ≥ 0 • The distance of an object to itself is 0
– d(i,i) = 0 • Symmetric: d(i,j) = d(j,i) • Triangular inequality
– d(i,j) ≤ d(i,k) + d(k,j) i j
k
Clustering Methods
• K-means and partitioning methods • Hierarchical clustering • Density-based clustering • Grid-based clustering • Pattern-based clustering • Other clustering methods
Jian Pei: Big Data Analytics -- Clustering 6
2
Jian Pei: Big Data Analytics -- Clustering 7
Partitioning Algorithms: Ideas
• Partition n objects into k clusters – Optimize the chosen partitioning criterion
• Global optimal: examine all possible partitions – (kn-(k-1)n-…-1) possible partitions, too expensive!
• Heuristic methods: k-means and k-medoids – K-means: a cluster is represented by the center – K-medoids or PAM (partition around medoids): each
cluster is represented by one of the objects in the cluster
Jian Pei: Big Data Analytics -- Clustering 8
K-means
• Arbitrarily choose k objects as the initial cluster centers
• Until no change, do – (Re)assign each object to the cluster to which
the object is the most similar, based on the mean value of the objects in the cluster
– Update the cluster means, i.e., calculate the mean value of the objects for each cluster
Jian Pei: Big Data Analytics -- Clustering 9
K-Means: Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrarily choose K object as initial cluster center
Assign each objects to most similar center
Update the cluster means
Update the cluster means
reassign reassign
Jian Pei: Big Data Analytics -- Clustering 10
Pros and Cons of K-means
• Relatively efficient: O(tkn) – n: # objects, k: # clusters, t: # iterations; k, t <<
n. • Often terminate at a local optimum • Applicable only when mean is defined
– What about categorical data? • Need to specify the number of clusters • Unable to handle noisy data and outliers • Unsuitable to discover non-convex clusters
Jian Pei: Big Data Analytics -- Clustering 11
Variations of the K-means • Aspects of variations
– Selection of the initial k means – Dissimilarity calculations – Strategies to calculate cluster means
• Handling categorical data: k-modes – Use mode instead of mean
• Mode: the most frequent item(s) – A mixture of categorical and numerical data: k-prototype
method • EM (expectation maximization): assign a
probability of an object to a cluster
Jian Pei: Big Data Analytics -- Clustering 12
A Problem of K-means
• Sensitive to outliers – Outlier: objects with extremely large values
• May substantially distort the distribution of the data
• K-medoids: the most centrally located object in a cluster
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
+ +
3
Jian Pei: Big Data Analytics -- Clustering 13
PAM: A K-medoids Method
• PAM: partitioning around Medoids • Arbitrarily choose k objects as the initial medoids • Until no change, do
– (Re)assign each object to the cluster to which the nearest medoid
– Randomly select a non-medoid object o’, compute the total cost, S, of swapping medoid o with o’
– If S < 0 then swap o with o’ to form the new set of k medoids
Jian Pei: Big Data Analytics -- Clustering 14
Swapping Cost
• Measure whether o’ is better than o as a medoid
• Use the squared-error criterion
– Compute Eo’-Eo
– Negative: swapping brings benefit
∑∑= ∈
=k
i Cpi
i
opdE1
2),(
Jian Pei: Big Data Analytics -- Clustering 15
PAM: Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 20
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrary choose k object as initial medoids
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Assign each remaining object to nearest medoids
Randomly select a nonmedoid object,Oramdom
Compute total cost of swapping
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 26
Swapping O and Oramdom
If quality is improved.
Do loop
Until no change
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10 Jian Pei: Big Data Analytics -- Clustering 16
Pros and Cons of PAM
• PAM is more robust than k-means in the presence of noise and outliers – Medoids are less influenced by outliers
• PAM is efficiently for small data sets but does not scale well for large data sets – O(k(n-k)2 ) for each iteration
• Sampling based method: CLARA
Jian Pei: Big Data Analytics -- Clustering 17
CLARA
• CLARA: Clustering LARge Applications (Kaufmann and Rousseeuw in 1990) – Built in statistical analysis packages, such as S+
• Draw multiple samples of the data set, apply PAM on each sample, give the best clustering
• Perform better than PAM in larger data sets • Efficiency depends on the sample size
– A good clustering on samples may not be a good clustering of the whole data set
Jian Pei: Big Data Analytics -- Clustering 18
CLARANS • Clustering large applications based upon
randomized search • The problem space graph of clustering
– A vertex is k from n numbers, vertices in total – PAM searches the whole graph – CLARA searches some random sub-graphs
• CLARANS climbs hills – Randomly sample a set and select k medoids – Consider neighbors of medoids as candidate for new
medoids – Use the sample set to verify – Repeat multiple times to avoid bad samples
⎟⎟⎠
⎞⎜⎜⎝
⎛
kn
4
Hierarchy
• An arrangement or classification of things according to inclusiveness
• A natural way of abstraction, summarization, compression, and simplification for understanding
• Typical setting: organize a given set of objects to a hierarchy – No or very little supervision – Some heuristic quality guidances on the quality
of the hierarchy Jian Pei: Big Data Analytics -- Clustering 19
• Group data objects into a tree of clusters • Top-down versus bottom-up
Jian Pei: Big Data Analytics -- Clustering 20
Hierarchical Clustering
Step 0 Step 1 Step 2 Step 3 Step 4
b
d c
e
a a b
d e c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative (AGNES)
divisive (DIANA)
Jian Pei: Big Data Analytics -- Clustering 21
AGNES (Agglomerative Nesting)
• Initially, each object is a cluster • Step-by-step cluster merging, until all
objects form a cluster – Single-link approach – Each cluster is represented by all of the objects
in the cluster – The similarity between two clusters is measured
by the similarity of the closest pair of data points belonging to different clusters
Jian Pei: Big Data Analytics -- Clustering 22
Dendrogram
• Show how to merge clusters hierarchically
• Decompose data objects into a multi-level nested partitioning (a tree of clusters)
• A clustering of the data objects: cutting the dendrogram at the desired level – Each connected component forms a cluster
Jian Pei: Big Data Analytics -- Clustering 23
DIANA (Divisive ANAlysis)
• Initially, all objects are in one cluster • Step-by-step splitting clusters until each
cluster contains only one object
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Jian Pei: Big Data Analytics -- Clustering 24
Distance Measures
• Minimum distance • Maximum distance • Mean distance • Average distance ∑∑
∈ ∈
∈∈
∈∈
=
=
=
=
i j
ji
ji
Cp Cqjijiavg
jijimean
CqCpji
CqCpji
qpdnn
CCd
mmdCCd
qpdCCd
qpdCCd
),(1),(
),(),(
),(max),(
),(min),(
,max
,min
m: mean for a cluster C: a cluster n: the number of objects in a cluster
5
Jian Pei: Big Data Analytics -- Clustering 25
Challenges
• Hard to choose merge/split points – Never undo merging/splitting – Merging/splitting decisions are critical
• High complexity O(n2) • Integrating hierarchical clustering with other
techniques – BIRCH, CURE, CHAMELEON, ROCK
Jian Pei: Big Data Analytics -- Clustering 26
BIRCH
• Balanced Iterative Reducing and Clustering using Hierarchies
• CF (Clustering Feature) tree: a hierarchical data structure summarizing object info – Clustering objects à clustering leaf nodes of
the CF tree
Jian Pei: Big Data Analytics -- Clustering 27
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: ∑Ni=1=oi
SS: ∑Ni=1=oi
2
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
CF = (5, (16,30),(54,190))
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
Clustering Feature Vector
Jian Pei: Big Data Analytics -- Clustering 28
CF-tree in BIRCH
• Clustering feature: – Summarize the statistics for a cluster – Many cluster quality measures (e.g., radium, distance)
can be derived – Additivity: CF1+CF2=(N1+N2, L1+L2, SS1+SS2)
• A CF tree: a height-balanced tree storing the clustering features for a hierarchical clustering – A nonleaf node in a tree has descendants or “children” – The nonleaf nodes store sums of the CFs of children
Jian Pei: Big Data Analytics -- Clustering 29
CF Tree CF1 child1
CF3 child3
CF2 child2
CF6 child6
CF1
child1
CF3
child3
CF2
child2
CF5
child5
CF1 CF2 CF6 prev next CF1 CF2 CF4 prev next
B = 7 L = 6
Root
Non-leaf node
Leaf node Leaf node
Jian Pei: Big Data Analytics -- Clustering 30
Parameters of a CF-tree
• Branching factor: the maximum number of children
• Threshold: max diameter of sub-clusters stored at the leaf nodes
6
Jian Pei: Big Data Analytics -- Clustering 31
BIRCH Clustering
• Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data)
• Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree
Jian Pei: Big Data Analytics -- Clustering 32
Pros & Cons of BIRCH
• Linear scalability – Good clustering with a single scan – Quality can be further improved by a few
additional scans • Can handle only numeric data • Sensitive to the order of the data records
Jian Pei: Big Data Analytics -- Clustering 33
Drawbacks of Square Error Based Methods • One representative per cluster
– Good only for convex shaped having similar size and density
• K: the number of clusters parameter – Good only if k can be reasonably estimated
Jian Pei: Big Data Analytics -- Clustering 34
CURE: the Ideas
• Each cluster has c representatives – Choose c well scattered points in the cluster – Shrink them towards the mean of the cluster by
a fraction of α – The representatives capture the physical shape
and geometry of the cluster • Merge the closest two clusters
– Distance of two clusters: the distance between the two closest representatives
Jian Pei: Big Data Analytics -- Clustering 35
Cure: The Algorithm
• Draw random sample S • Partition sample to p partitions • Partially cluster each partition • Eliminate outliers
– Random sampling + remove clusters growing too slowly
• Cluster partial clusters until only k clusters left – Shrink representatives of clusters towards the cluster
center
Jian Pei: Big Data Analytics -- Clustering 36
Data Partitioning and Clustering
x x
x
y
y y
y
x
y
x
7
Jian Pei: Big Data Analytics -- Clustering 37
Shrinking Representative Points
• Shrink the multiple representative points towards the gravity center by a fraction of α
• Representatives capture the shape
x
y
x
y
è
Jian Pei: Big Data Analytics -- Clustering 38
Clustering Categorical Data: ROCK • Robust Clustering using links
– # of common neighbors between two points – Use links to measure similarity/proximity – Not distance based –
• Basic ideas: – Similarity function and neighbors:
• Let T1 = {1,2,3}, T2={3,4,5}
O n nm m n nm a( log )2 2+ +
Sim T TT TT T
( , )1 21 2
1 2
=∩
∪
Sim T T( , ){ }
{ , , , , }.1 2
31 2 3 4 5
150 2= = =
Jian Pei: Big Data Analytics -- Clustering 39
Limitations
• Merging decision based on static modeling – No special characteristics of clusters are
considered
C1 C2 C1’ C2’
CURE and BIRCH merge C1 and C2 C1’ and C2’ are more appropriate for merging
Jian Pei: Big Data Analytics -- Clustering 40
Chameleon
• Hierarchical clustering using dynamic modeling • Measures the similarity based on a dynamic model
– Interconnectivity & closeness (proximity) between two clusters vs interconnectivity of the clusters and closeness of items within the clusters
• A two-phase algorithm – Use a graph partitioning algorithm: cluster objects into a
large number of relatively small sub-clusters – Find the genuine clusters by repeatedly combining sub-
clusters
Jian Pei: Big Data Analytics -- Clustering 41
Overall Framework of CHAMELEON
Construct
Sparse Graph Partition the Graph
Merge Partition
Final Clusters
Data Set
Jian Pei: Big Data Analytics -- Clustering 42
Distance-based Methods: Drawbacks
• Hard to find clusters with irregular shapes • Hard to specify the number of clusters • Heuristic: a cluster must be dense
8
How to Find Irregular Clusters?
• Divide the whole space into many small areas – The density of an area can be estimated – Areas may or may not be exclusive – A dense area is likely in a cluster
• Start from a dense area, traverse connected dense areas and discover clusters in irregular shape
Jian Pei: Big Data Analytics -- Clustering 43 Jian Pei: Big Data Analytics -- Clustering 44
Directly Density Reachable
• Parameters – Eps: Maximum radius of the neighborhood – MinPts: Minimum number of points in an Eps-
neighborhood of that point – NEps(p): {q | dist(p,q) ≤Eps}
• Core object p: |Neps(p)|≥MinPts – A core object is in a dense area
• Point q directly density-reachable from p iff q ∈NEps(p) and p is a core object
p q
MinPts = 3 Eps = 1 cm
Jian Pei: Big Data Analytics -- Clustering 45
Density-Based Clustering
• Density-reachable – Directly density reachable p1àp2, p2àp3, …, pn-1à pn – pn density-reachable from p1
• Density-connected – If points p, q are density-reachable from o then p and q
are density-connected
p q
o
p
q p1
Jian Pei: Big Data Analytics -- Clustering 46
DBSCAN
• A cluster: a maximal set of density-connected points – Discover clusters of arbitrary shape in spatial
databases with noise
Core
Border
Outlier
Eps = 1cm
MinPts = 5
Jian Pei: Big Data Analytics -- Clustering 47
DBSCAN: the Algorithm
• Arbitrary select a point p • Retrieve all points density-reachable from p
wrt Eps and MinPts • If p is a core point, a cluster is formed • If p is a border point, no points are density-
reachable from p and DBSCAN visits the next point of the database
• Continue the process until all of the points have been processed
Jian Pei: Big Data Analytics -- Clustering 48
Challenges for DBSCAN
• Different clusters may have very different densities
• Clusters may be in hierarchies
9
Jian Pei: Big Data Analytics -- Clustering 49
OPTICS: A Cluster-ordering Method
• Idea: ordering points to identify the clustering structure
• “Group” points by density connectivity – Hierarchies of clusters
• Visualize clusters and the hierarchy
Jian Pei: Big Data Analytics -- Clustering 50
Ordering Points
• Points strongly density-connected should be close to one another
• Clusters density-connected should be close to one another and form a “cluster” of clusters
Jian Pei: Big Data Analytics -- Clustering 51
Reachability-distance
Cluster-order of the objects
ε
εundefined
ε‘
OPTICS: An Example
Jian Pei: Big Data Analytics -- Clustering 52
DENCLUE: Using Density Functions
• DENsity-based CLUstEring • Major features
– Solid mathematical foundation – Good for data sets with large amounts of noise – Allow a compact mathematical description of
arbitrarily shaped clusters in high-dimensional data sets
– Significantly faster than existing algorithms (faster than DBSCAN by a factor of up to 45)
– But need a large number of parameters
Jian Pei: Big Data Analytics -- Clustering 53
DENCLUE: Techniques
• Use grid cells – Only keep grid cells actually containing data points – Manage cells in a tree-based access structure
• Influence function: describe the impact of a data point on its neighborhood
• Overall density of the data space is the sum of the influence function of all data points
• Clustering by identifying density attractors – Density attractor: local maximal of the overall density
function
Jian Pei: Big Data Analytics -- Clustering 54
Density Attractor
10
Jian Pei: Big Data Analytics -- Clustering 55
Center-defined and Arbitrary Clusters
Jian Pei: Big Data Analytics -- Clustering 56
A Shrinking-based Approach
• Difficulties of Multi-dimensional Clustering – Noise (outliers) – Clusters of various densities – Not well-defined shapes
• A novel preprocessing concept “Shrinking” • A shrinking-based clustering approach
Jian Pei: Big Data Analytics -- Clustering 57
Intuition & Purpose
• For data points in a data set, what if we could make them move towards the centroid of the natural subgroup they belong to?
• Natural sparse subgroups become denser, thus easier to be detected – Noises are further isolated
Jian Pei: Big Data Analytics -- Clustering 58
Inspiration
• Newton’s Universal Law of Gravitation – Any two objects exert a gravitational force of attraction
on each other – The direction of the force is along the line joining the
objects – The magnitude of the force is directly proportional to
the product of the gravitational masses of the objects, and inversely proportional to the square of the distance between them
– G: universal gravitational constant • G = 6.67 x 10-11 N m2 /kg2 r
mmGFg 221=
Jian Pei: Big Data Analytics -- Clustering 59
The Concept of Shrinking
• A data preprocessing technique – Aim to optimize the inner structure of real data
sets • Each data point is “attracted” by other data
points and moves to the direction in which way the attraction is the strongest
• Can be applied in different fields
Jian Pei: Big Data Analytics -- Clustering 60
Apply shrinking into clustering field • Shrink the natural sparse clusters to make
them much denser to facilitate further cluster-detecting process.
Multi-attribute hyperspace
11
Jian Pei: Big Data Analytics -- Clustering 61
Data Shrinking
• Each data point moves along the direction of the density gradient and the data set shrinks towards the inside of the clusters
• Points are “attracted” by their neighbors and move to create denser clusters
• It proceeds iteratively; repeated until the data are stabilized or the number of iterations exceeds a threshold
Jian Pei: Big Data Analytics -- Clustering 62
Approximation & Simplification
• Problem: Computing mutual attraction of each data points pair is too time consuming O(n2) – Solution: No Newton's constant G, m1 and m2
are set to unit • Only aggregate the gravitation surrounding
each data point • Use grids to simplify the computation
Jian Pei: Big Data Analytics -- Clustering 63
Termination condition
• Average movement of all points in the current iteration is less than a threshold
• The number of iterations exceeds a threshold
Jian Pei: Big Data Analytics -- Clustering 64
Optics on Pendigits Data
Before data shrinking After data shrinking
Jian Pei: Big Data Analytics -- Clustering 65
Fuzzy Clustering
• Each point xi takes a probability wij to belong to a cluster Cj
• Requirements – For each point xi,
– For each cluster Cj
11
=∑=
k
jijw
mwm
iij <<∑
=1
0
Jian Pei: Big Data Analytics -- Clustering 66
Fuzzy C-Means (FCM)
Select an initial fuzzy pseudo-partition, i.e., assign values to all the wij
Repeat Compute the centroid of each cluster using the fuzzy
pseudo-partition Recompute the fuzzy pseudo-partition, i.e., the wij
Until the centroids do not change (or the change is below some threshold)
12
Jian Pei: Big Data Analytics -- Clustering 67
Critical Details
• Optimization on sum of the squared error (SSE):
• Computing centroids: • Updating the fuzzy pseudo-partition
– When p=2
∑∑= =
=k
j
m
iji
pijk cxdistwCCSSE
1 1
21 ),(),,( …
∑∑==
=m
i
pij
m
ii
pijj wxwc
11
/
∑=
−−=k
q
pqi
pjiij cxdistcxdistw
1
11
211
2 )),(/1()),(/1(
∑=
=k
qqijiij cxdistcxdistw
1
22 ),(/1),(/1
Jian Pei: Big Data Analytics -- Clustering 68
Choice of P
• When p à 1, FCM behaves like traditional k-means
• When p is larger, the cluster centroids approach the global centroid of all data points
• The partition becomes fuzzier as p increases
Jian Pei: Big Data Analytics -- Clustering 69
Effectiveness
Jian Pei: Big Data Analytics -- Clustering 70
Mixture Models
• A cluster can be modeled as a probability distribution – Practically, assume a distribution can be
approximated well using multivariate normal distribution
• Multiple clusters is a mixture of different probability distributions
• A data set is a set of observations from a mixture of models
Jian Pei: Big Data Analytics -- Clustering 71
Object Probability
• Suppose there are k clusters and a set X of m objects – Let the j-th cluster have parameter θj = (µj, σj) – The probability that a point is in the j-th cluster is
wj, w1 + …+ wk = 1 • The probability of an object x is
∑=
=Θk
jjjj xpwxprob
1)|()|( θ
∏∑∏= ==
=Θ=Θm
i
k
jjijj
m
ii xpwxprobXprob
1 11
)|()|()|( θ
Jian Pei: Big Data Analytics -- Clustering 72
Example
2
2
2)(
21)|( σ
µ
σπ
−−
=Θx
i exprob
)2,4()2,4( 21 =−= θθ
8)4(
8)4( 22
221
221)|(
−−
+−
+=Θxx
eexprobππ
13
Jian Pei: Big Data Analytics -- Clustering 73
Maximal Likelihood Estimation
• Maximum likelihood principle: if we know a set of objects are from one distribution, but do not know the parameter, we can choose the parameter maximizing the probability
• Maximize
– Equivalently, maximize
∏=
−−
=Θm
j
x
i exprob1
2)(2
2
21)|( σ
µ
σπ
∑=
−−−
−=Θm
i
i mmxXprob1
2
2
log2log5.02
)()|(log σπσµ
Jian Pei: Big Data Analytics -- Clustering 74
EM Algorithm
• Expectation Maximization algorithm Select an initial set of model parameters Repeat
Expectation Step: for each object, calculate the probability that it belongs to each distribution θi, i.e., prob(xi|θi)
Maximization Step: given the probabilities from the expectation step, find the new estimates of the parameters that maximize the expected likelihood
Until the parameters are stable
Jian Pei: Big Data Analytics -- Clustering 75
Advantages and Disadvantages
• Mixture models are more general than k-means and fuzzy c-means
• Clusters can be characterized by a small number of parameters
• The results may satisfy the statistical assumptions of the generative models
• Computationally expensive • Need large data sets • Hard to estimate the number of clusters
Jian Pei: Big Data Analytics -- Clustering 76
Grid-based Clustering Methods
• Ideas – Using multi-resolution grid data structures – Using dense grid cells to form clusters
• Several interesting methods – CLIQUE – STING – WaveCluster
Jian Pei: Big Data Analytics -- Clustering 77
CLIQUE
• Clustering In QUEst • Automatically identify subspaces of a high
dimensional data space • Both density-based and grid-based
Jian Pei: Big Data Analytics -- Clustering 78
CLIQUE: the Ideas
• Partition each dimension into the same number of equal length intervals – Partition an m-dimensional data space into non-
overlapping rectangular units • A unit is dense if the number of data points
in the unit exceeds a threshold • A cluster is a maximal set of connected
dense units within a subspace
14
Jian Pei: Big Data Analytics -- Clustering 79
CLIQUE: the Method
• Partition the data space and find the number of points in each cell of the partition – Apriori: a k-d cell cannot be dense if one of its (k-1)-d
projection is not dense • Identify clusters:
– Determine dense units in all subspaces of interests and connected dense units in all subspaces of interests
• Generate minimal description for the clusters – Determine the minimal cover for each cluster
Jian Pei: Big Data Analytics -- Clustering 80
Sala
ry
(10,
000)
age
Vac
atio
n
Salary
30 50
20 30 40 50 60 age
5 4
3 1
2 6
7 0
Vaca
tion
(wee
k)
20 30 40 50 60 age
5 4
3 1
2 6
7 0
CLIQUE: An Example
Jian Pei: Big Data Analytics -- Clustering 81
CLIQUE: Pros and Cons
• Automatically find subspaces of the highest dimensionality with high density clusters
• Insensitive to the order of input – Not presume any canonical data distribution
• Scale linearly with the size of input • Scale well with the number of dimensions • The clustering result may be degraded at
the expense of simplicity of the method
Jian Pei: Big Data Analytics -- Clustering 82
Bad Cases for CLIQUE
Parts of a cluster may be missed
A cluster from CLIQUE may contain noise
Biclustering
• Clustering both objects and attributes simultaneously
• Four requirements – Only a small set of objects in a cluster (bicluster) – A bicluster only involves a small number of
attributes – An object may participate in multiple biclusters or
no biclusters – An attribute may be involved in multiple biclusters,
or no biclusters
Jian Pei: Big Data Analytics -- Clustering 83
Application Examples
• Recommender systems – Objects: users – Attributes: items – Values: user ratings
• Microarray data – Objects: genes – Attributes: samples – Values: expression levels
Jian Pei: Big Data Analytics -- Clustering 84
nmw
gene
sample/condition
w11w
21w31w
n1w
12w
32w22w
n2w
1mw
3mw2m
15
Biclusters with Constant Values
Jian Pei: Big Data Analytics -- Clustering 85
11.2. CLUSTERING HIGH-DIMENSIONAL DATA 535
· · · b6 · · · b12 · · · b36 · · · b99 · · ·a1 · · · 60 · · · 60 · · · 60 · · · 60 · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·a33 · · · 60 · · · 60 · · · 60 · · · 60 · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·a86 · · · 60 · · · 60 · · · 60 · · · 60 · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
Figure 11.5: A gene-condition matrix, a submatrix, and a bi-cluster.
subset of products. For example, AllElectronics is highly interested in findinga group of customers who all like the same group of products. Such a clusteris a submatrix in the customer-product matrix, where all elements have a highvalue. Using such a cluster, AllElectronics can make recommendations in twodirections. First, the company can recommend products to new customerswho are similar to the customers in the cluster. Second, the company canrecommend to customers new products that are similar to those involved inthe cluster.
As with bi-clusters in a gene expression data matrix, the bi-clusters in acustomer-product matrix usually have the following characteristics:
• Only a small set of customers participate in a cluster;
• A cluster involves only a small subset of products;
• A customer can participate in multiple clusters, or may not participatein any cluster at all; and
• A product may be involved in multiple clusters, or may not be involvedin any cluster at all.
Bi-clustering can be applied to customer-product matrices to mine clusterssatisfying the above requirements.
Types of Bi-clusters
“How can we model bi-clusters and mine them?” Let’s start with some basicnotation. For the sake of simplicity, we’ll use “genes” and “conditions” torefer to the two dimensions in our discussion. Our discussion can easily beextended to other applications. For example, we can simply replace “genes” and“conditions” by “customers” and “products” to tackle the customer-product bi-clustering problem.
Let A = {a1, . . . , an} be a set of genes and B = {b1, . . . , bm} be a set ofconditions. Let E = [eij ] be a gene expression data matrix, that is, a gene-condition matrix, where 1 ≤ i ≤ n and 1 ≤ j ≤ m. A submatrix I × J is
536 CHAPTER 11. ADVANCED CLUSTER ANALYSIS
10 10 10 10 1020 20 20 20 2050 50 50 50 500 0 0 0 0
Figure 11.6: A bi-cluster with constant values on rows.
10 50 30 70 2020 60 40 80 3050 90 70 110 600 40 20 60 10
Figure 11.7: A bi-cluster with coherent values.
defined by a subset I ⊆ A of genes and a subset J ⊆ B of conditions. Forexample, in the matrix shown in Figure 11.5, {a1, a33, a86} × {b6, b12, b36, b99}is a submatrix.
A bi-cluster is a submatrix where genes and conditions follow consistentpatterns. We can define different types of bi-clusters based on such patterns:
• As the simplest case, a submatrix I × J (I ⊆ A, J ⊆ B) is a bi-clusterwith constant values if for any i ∈ I and j ∈ J , eij = c, where c is aconstant. For example, the submatrix {a1, a33, a86}× {b6, b12, b36, b99} inFigure 11.5 is a bi-cluster with constant values.
• A bi-cluster is interesting if each row has a constant value, though differ-ent rows may have different values. A bi-cluster with constant valueson rows is a submatrix I × J such that for any i ∈ I and j ∈ J , theneij = c+αi where αi is the adjustment for row i. For example, Figure 11.6shows a bi-cluster with constant values on rows.
Symmetrically, a bi-cluster with constant values on columns is asubmatrix I × J such that for any i ∈ I and j ∈ J , then eij = c + βj ,where βj is the adjustment for column j.
• More generally, a bi-cluster is interesting if the rows change in a syn-chronized way with respect to the columns and vice versa. Mathemat-ically, a bi-cluster with coherent values (also known as a pattern-based cluster) is a submatrix I × J such that for any i ∈ I and j ∈ J ,eij = c + αi + βj , where αi and βj are the adjustment for row i andcolumn j, respectively. For example, Figure 11.7 shows a bi-cluster withcoherent values.
It can be shown that I × J is a bi-cluster with coherent values if andonly if for any i1, i2 ∈ I and j1, j2 ∈ J , then ei1j1 − ei2j1 = ei1j2 − ei2j2 .Moreover, instead of using addition, we can define bi-cluster with coherent
On rows
Biclusters with Coherent Values
• Also known as pattern-based clusters
Jian Pei: Big Data Analytics -- Clustering 86
Biclusters with Coherent Evolutions
• Only up- or down-regulated changes over rows or columns
Jian Pei: Big Data Analytics -- Clustering 87
11.2. CLUSTERING HIGH-DIMENSIONAL DATA 537
10 50 30 70 2020 100 50 1000 3050 100 90 120 800 80 20 100 10
Figure 11.8: A bi-cluster with coherent evolutions on rows.
values using multiplication, that is eij = c · αi · βj . Clearly, bi-clusterswith constant values on rows or columns are special cases of bi-clusterswith coherent values.
• In some applications, we may only be interested in the up- or down-regulated changes across genes or conditions without constraining theexact values. A bi-cluster with coherent evolutions on rows is asubmatrix I × J such that for any i1, i2 ∈ I and j1, j2 ∈ J , (ei1j1 −ei1j2)(ei2j1 − ei2j2) ≥ 0. For example, Figure 11.8 shows a bi-cluster withcoherent evolutions on rows. Symmetrically, we can define bi-clusterswith coherent evolutions on columns.
Next, we study how to mine bi-clusters.
Bi-clustering Methods
The above specification of the types of bi-clusters only considers ideal cases. Inreal data sets, such perfect bi-clusters rarely exist. When they do exist, theyare usually very small. Instead, random noise can affect the readings of eij andthus prevent a bi-cluster in nature from appearing in a perfect shape.
There are two major types of methods for discovering bi-clusters in datathat may come with noise. Optimization-based methods conduct an it-erative search. At each iteration, the submatrix with the highest significancescore is identified as a bi-cluster. The process terminates when a user-specifiedcondition is met. Due to cost concerns in computation, greedy search is oftenemployed to find local optimal bi-clusters. Enumeration methods use a tol-erance threshold to specify the degree of noise allowed in the bi-clusters to bemined, and then tries to enumerate all submatrices of bi-clusters that satisfythe requirements. We use the δ-Cluster and MaPle algorithms as examples toillustrate these ideas.
Optimization Using the δ-Cluster Algorithm
For a submatrix, I × J , the mean of the i-th row is
eiJ =1
|J |!
j∈J
eij . (11.16)
Coherent evolutions on rows
Jian Pei: Big Data Analytics -- Clustering 88
Differences from Subspace Clustering
• Subspace clustering uses global distance/similarity measure
• Pattern-based clustering looks at patterns • A subspace cluster according to a globally
defined similarity measure may not follow the same pattern
Jian Pei: Big Data Analytics -- Clustering 89
Objects Follow the Same Pattern?
pScore
D1 D2
Objectblue
Obejctgreen
The less the pScore, the more consistent the objects Jian Pei: Big Data Analytics -- Clustering 90
Pattern-based Clusters
• pScore: the similarity between two objects rx, ry on two attributes au, av
• δ-pCluster (R, D): for any objects rx, ry∈R and any attributes au, av∈D,
)..()..(....
vyvxuyuxvyuy
vxux arararararararar
pScore −−−=⎟⎟⎠
⎞⎜⎜⎝
⎛⎥⎦
⎤⎢⎣
⎡
)0(....
≥≤⎟⎟⎠
⎞⎜⎜⎝
⎛⎥⎦
⎤⎢⎣
⎡δδ
vyuy
vxux
arararar
pScore
16
Jian Pei: Big Data Analytics -- Clustering 91
Maximal pCluster
• If (R, D) is a δ-pCluster , then every sub-cluster (R’, D’) is a δ-pCluster, where R’⊆R and D’⊆D – An anti-monotonic property – A large pCluster is accompanied with many
small pClusters! Inefficacious • Idea: mining only the maximal pClusters!
– A δ-pCluster is maximal if there exists no proper super cluster as a δ-pCluster
Jian Pei: Big Data Analytics -- Clustering 92
Mining Maximal pClusters
• Given – A cluster threshold δ – An attribute threshold mina – An object threshold mino
• Task: mine the complete set of significant maximal δ-pClusters – A significant δ-pCluster has at least mino objects
on at least mina attributes
Jian Pei: Big Data Analytics -- Clustering 93
pCluters and Frequent Itemsets
• A transaction database can be modeled as a binary matrix
• Frequent itemset: a sub-matrix of all 1’s – 0-pCluster on binary data – Mino: support threshold – Mina: no less than mina attributes – Maximal pClusters – closed itemsets
• Frequent itemset mining algorithms cannot be extended straightforwardly for mining pClusters on numeric data
Jian Pei: Big Data Analytics -- Clustering 94
Where Should We Start from?
• How about the pClusters having only 2 objects or 2 attributes? – MDS (maximal dimension set) – A pCluster must have at least 2 objects and 2
attributes • Finding MDSs
Attribute Objects
a b c d e f g h
x 13 11 9 7 9 13 2 15
y 7 4 10 1 12 3 4 7
x - y 6 7 -1 6 -3 10 -2 8
Jian Pei: Big Data Analytics -- Clustering 95
How to Assemble Larger pClusters? • Systematically enumerate
every combination of attributes D – For each attribute subset,
find the maximal subsets of objects R s.t. (R, D) is a pCluster
– Check whether (R, D) is maximal
• Prune search branches as early as possible
• Why attribute-first-object-later? – # of objects >> # attributes
• Algorithm MaPle (Pei et al, 2003)
Jian Pei: Big Data Analytics -- Clustering 96
More Pruning Techniques
• Only possible attributes should be considered to get larger pClusters
• Pruning local maximal pClusters having insufficient possible attributes
• Extracting common attributes from possible attribute set directly
• Prune non-maximal pClusters
17
Jian Pei: Big Data Analytics -- Clustering 97
Gene-Sample-Time Series Data
gene1
gene2
sample1 time1 time2 sample2
Time Sample
Gene
Gene-Time Matrix Gene-Sample Matrix
Sample-Time Matrix
expression level of gene i on sample j at time k
Jian Pei: Big Data Analytics -- Clustering 98
Mining GST Microarray Data
• Reduce the gene-sample-time series data to gene-sample data – Use the Pearson's correlation coeffcient as the
coherence measure
Jian Pei: Big Data Analytics -- Clustering 99
Basic Approaches
• Sample-gene search – Enumerate the subsets of samples
systematically – For each subset of samples, find the genes that
are coherent on the samples • Gene-sample search
– Enumerate the subsets of genes systematically – For each subset of genes, find the samples on
which the genes are coherent
Jian Pei: Big Data Analytics -- Clustering 100
Basic Tools
• Set enumeration tree • Sample-gene search and gene-sample
search are not symmetric! – Many genes, but a few samples – No requirement on samples coherent on genes
Jian Pei: Big Data Analytics -- Clustering 101
Phenotypes and Informative Genes
Informative Genes
Non- informative
Genes
gene1
gene6
gene7
gene2
gene4
gene5
gene3
1 2 3 4 5 6 7 samples
Jian Pei: Big Data Analytics -- Clustering 102
The Phenotype Mining Problem
• Input: a microarray matrix and k • Output: phenotypes and informative genes
– Partitioning the samples into k exclusive subsets – phenotypes
– Informative genes discriminating the phenotypes
• Machine learning methods – Heuristic search – Mutual reinforcing adjustment
18
Jian Pei: Big Data Analytics -- Clustering 103
Requirements
• The expression levels of each informative gene should be similar over the samples within each phenotype
• The expression levels of each informative gene should display a clear dissimilarity between each pair of phenotypes
Jian Pei: Big Data Analytics -- Clustering 104
Intra-phenotype Consistency
• In a subset of genes (candidate informative genes), does every gene have good consistency on a set of samples?
• Average of variance of the subset of genes – the smaller the intra-phenotype consistency, the better
( )∑∑∈ ∈
−−⋅
=' '
2',, )(
1''1)','(
Gg SsSiji
i j
wwSG
SGCon! !
Jian Pei: Big Data Analytics -- Clustering 105
Inter-phenotype Divergence
• How a subset of genes (candidate informative genes) can discriminate two phenotypes of samples?
• Sum of the average difference between the phenotypes – the larger the inter-phenotype divergence, the better
')),,'( '
,,
21
21
G
wwSSGDiv Gg
SiSii
∑∈
−
=!
Jian Pei: Big Data Analytics -- Clustering 106
Quality of Phenotypes and Informative Genes
• The higher the value, the better the quality
∑ ≠≤≤
+=Ω
);,1(, , ),'(),'(),'(
1
jiKjiSS ji
ji
ji SSGDivSGConSGCon
Jian Pei: Big Data Analytics -- Clustering 107
Heuristic Search
• Start from a random subset of genes and an arbitrary partition of the samples
• Iteratively adjust the partition and the gene set toward a better solution – For each possible adjustment, compute ΔΩ
• For each gene, try possible insert/remove • For each sample, try the best movement
– ΔΩ > 0 à conduct the adjustment – ΔΩ < 0 à conduct the adjustment with probability
• T(i) is a decreasing simulated annealing function and i is the iteration number. T(i)=1/(i+1) in our implementation
)(iTe ⋅ΩΔΩ
Jian Pei: Big Data Analytics -- Clustering 108
Possible Adjustments
Insert a gene Remove a gene Move a sample
19
Jian Pei: Big Data Analytics -- Clustering 109
Disadvantages of Heuristic Search
• Samples and genes are examined and adjusted with equal chances – # samples << # genes – Samples should play more important roles
• Outliers in the samples should be handled specifically – Outliers highly interfere the quality and the
adjustment decisions
Jian Pei: Big Data Analytics -- Clustering 110
Mutual Reinforcing Adjustment
• A two-phase approach – Iteration phase – Refinement phase
• Mutual reinforcement – Use gene partition to improve the sample
partition – Use the sample partition to improve the gene
partition
Jian Pei: Big Data Analytics -- Clustering 111
Dimensionality Reduction
• Clustering a high dimensional data set is challenging – Distance between two points could be dominated by
noise • Dimensionality reduction: choosing the informative
dimensions for clustering analysis – Feature selection: choosing a subset of existing
dimensions – Feature construction: construct a new (small) set of
informative attributes
Jian Pei: Big Data Analytics -- Clustering 112
Variance and Covariance
• Given a set of 1-d points, how different are those points? – Standard deviation: – Variance:
• Given a set of 2-d points, are the two dimensions correlated? – Covariance:
1
)(1
2
−
−=∑=
n
XXs
n
ii
1
)(1
2
2
−
−=∑=
n
XXs
n
ii
1
))((),cov( 1
−
−−=∑=
n
YYXXYX
n
iii
Jian Pei: Big Data Analytics -- Clustering 113
Principal Components
Art work and example from http://csnet.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
Jian Pei: Big Data Analytics -- Clustering 114
Step 1: Mean Subtraction
• Subtract the mean from each dimension for each data point
• Intuition: centralizing the data set
20
Jian Pei: Big Data Analytics -- Clustering 115
Step 2: Covariance Matrix
⎟⎟⎟⎟⎟
⎠
⎞
⎜⎜⎜⎜⎜
⎝
⎛
=
),cov(),cov(),cov(
),cov(),cov(),cov(),cov(),cov(),cov(
21
22212
12111
nnnn
n
n
DDDDDD
DDDDDDDDDDDD
C
!"#""
!!
Jian Pei: Big Data Analytics -- Clustering 116
Step 3: Eigenvectors and Eigenvalues
• Compute the eigenvectors and the eigenvalues of the covariance matrix – Intuition: find those direction invariant vectors as
candidates of new attributes – Eigenvalues indicate how much the direction
invariant vectors are scaled – the larger the better for manifest the data variance
Jian Pei: Big Data Analytics -- Clustering 117
Step 4: Forming New Features
• Choose the principal components and forme new features – Typically, choose the top-k components
Jian Pei: Big Data Analytics -- Clustering 118
New Features
NewData = RowFeatureVector x RowDataAdjust
The first principal component is used
Clustering in Derived Space
Jian Pei: Big Data Analytics -- Clustering 119
Y
XO
- 0.707x + 0.707y
Spectral Clustering
Jian Pei: Big Data Analytics -- Clustering 120
cluster the original data
ij[ ]
Data Affinity matrixk eigenvectors of A
A = f(W)Av = \lamda v
Clustering in thenew space
Computing the leading Projecting back to
W
21
Affinity Matrix
• Using a distance measure where σ is a scaling parameter controling how fast the affinity Wij decreases as the distance increases
• In the Ng-Jordan-Weiss algorithm, Wii is set to 0
Jian Pei: Big Data Analytics -- Clustering 121
Wij = e�dist(o
i
,o
j
)
�
w
Clustering
• In the Ng-Jordan-Weiss algorithm, we define a diagonal matrix such that
• Then, • Use the k leading eigenvectors to form a
new space • Map the original data to the new space and
conduct clustering Jian Pei: Big Data Analytics -- Clustering 122
Dii =nX
j=1
Wij
A = D� 12WD� 1
2
Is a Clustering Good?
• Feasibility – Applying any clustering methods on a uniformly
distributed data set is meaningless • Quality
– Are the clustering results meeting users’ interest? – Clustering patients into clusters corresponding
various disease or sub-phenotypes is meaningful – Clustering patients into clusters corresponding to
male or female is not meaningful
Jian Pei: Big Data Analytics -- Clustering 123
Major Tasks
• Assessing clustering tendency – Are there non-random structures in the data?
• Determining the number of clusters or other critical parameters
• Measuring clustering quality
Jian Pei: Big Data Analytics -- Clustering 124
Uniformly Distributed Data
• Clustering uniformly distributed data is meaningless
• A uniformly distributed data set is generated by a uniform data distribution
Jian Pei: Big Data Analytics -- Clustering 125
504CHAPTER 10. CLUSTER ANALYSIS: BASIC CONCEPTS AND METHODS
Figure 10.21: A data set that is uniformly distributed in the data space.
• Measuring clustering quality. After applying a clustering method on adata set, we want to assess how good the resulting clusters are. A numberof measures can be used. Some methods measure how well the clustersfit the data set, while others measure how well the clusters match theground truth, if such truth is available. There are also measures thatscore clusterings and thus can compare two sets of clustering results onthe same data set.
In the rest of this section, we discuss each of the above three topics.
10.6.1 Assessing Clustering Tendency
Clustering tendency assessment determines whether a given data set has a non-random structure, which may lead to meaningful clusters. Consider a dataset that does not have any non-random structure, such as a set of uniformlydistributed points in a data space. Even though a clustering algorithm mayreturn clusters for the data, those clusters are random and are not meaningful.
Example 10.9 Clustering requires non-uniform distribution of data. Figure 10.21shows a data set that is uniformly distributed in 2-dimensional data space.Although a clustering algorithm may still artificially partition the points intogroups, the groups will unlikely mean anything significant to the applicationdue to the uniform distribution of the data.
“How can we assess the clustering tendency of a data set?” Intuitively, wecan try to measure the probability that the data set is generated by a uniformdata distribution. This can be achieved using statistical tests for spatial ran-domness. To illustrate this idea, let’s look at a simple yet effective statisticcalled Hopkins Statistic.
The Hopkins Statistic is a spatial statistic that tests the spatial random-ness of a variable as distributed in a space. Given a data set, D, which isregarded as a sample of a random variable, o, we want to determine how faraway o is from being uniformly distributed in the data space. We calculate theHopkins Statistic as follows:
Hopkins Statistic
• Hypothesis: the data is generated by a uniform distribution in a space
• Sample n points, p1, …, pn, uniformly from the space of D
• For each point pi, find the nearest neighbor of pi in D, let xi be the distance between pi and its nearest neighbor in D
Jian Pei: Big Data Analytics -- Clustering 126
xi = minv2D
{dist(pi, v)}
22
Hopkins Statistic
• Sample n points, q1, …, qn, uniformly from D • For each qi, find the nearest neighbor of qi
in D – {qi}, let yi be the distance between qi and its nearest neighbor in D – {qi}
• Calculate the Hopkins Statistic H
Jian Pei: Big Data Analytics -- Clustering 127
yi = minv2D,v 6=qi
{dist(qi, v)}
H =
nPi=1
yi
nPi=1
xi +nP
i=1yi
Explanation
• If D is uniformly distributed, then and would be close to each other, and thus H would be round 0.5
• If D is skewed, then would be substantially smaller, and thus H would be close to 0
• If H > 0.5, then it is unlikely that D has statistically significant clusters
Jian Pei: Big Data Analytics -- Clustering 128
nX
i=1
yi
nX
i=1
xi
nX
i=1
yi
Finding the Number of Clusters
• Depending on many factors – The shape and scale of the distribution in the
data set – The clustering resolution required by the user
• Many methods exist – Set , each cluster has points on
average – Plot the sum of within-cluster variances with
respect to k, find the first (or the most significant turning point)
Jian Pei: Big Data Analytics -- Clustering 129
k =
rn
2
p2n
A Cross-Validation Method • Divide the data set D into m parts • Use m – 1 parts to find a clustering • Use the remaining part as the test set to test
the quality of the clustering – For each point in the test set, find the closest
centroid or cluster center – Use the squared distances between all points in the
test set and the corresponding centroids to measure how well the clustering model fits the test set
• Repeat m times for each value of k, use the average as the quality measure
Jian Pei: Big Data Analytics -- Clustering 130
Measuring Clustering Quality
• Ground truth: the ideal clustering determined by human experts
• Two situations – There is a known ground truth – the extrinsic
(supervised) methods, comparing the clustering against the ground truth
– The ground truth is unavailable – the intrinsic (unsupervised) methods, measuring how well the clusters are separated
Jian Pei: Big Data Analytics -- Clustering 131
Quality in Extrinsic Methods • Cluster homogeneity: the more pure the
clusters in a clustering are, the better the clustering
• Cluster completeness: objects in the same cluster in the ground truth should be clustered together
• Rag bag: putting a heterogeneous object into a pure cluster is worse than putting it into a rag bag
• Small cluster preservation: splitting a small cluster in the ground truth into pieces is worse than splitting a bigger one
Jian Pei: Big Data Analytics -- Clustering 132
23
Bcubed Precision and Recall
• D = {o1, …, on} – L(oi) is the cluster of oi given by the ground truth
• C is a clustering on D – C(oi) is the cluster-id of oi in C
• For two objects oi and oj, the correctness is 1 if L(oi) = L(oj) ßà C(oi) = C(oj), 0 otherwise
Jian Pei: Big Data Analytics -- Clustering 133
Bcubed Precision and Recall
• Precision
• Recall
Jian Pei: Big Data Analytics -- Clustering 134
508CHAPTER 10. CLUSTER ANALYSIS: BASIC CONCEPTS AND METHODS
one, denoted by o, belong to the same category according to ground truth.Consider a clustering C2 identical to C1 except that o is assigned to acluster C′ = C in C2 such that C′ contains objects from various categoriesaccording to ground truth, and thus is noisy. In other words, C′ in C2 isa rag bag. Then, a clustering quality measure Q respecting the rag bagcriterion should give a higher score to C2, that is, Q(C2, Cg) > Q(C1, Cg).
• Small cluster preservation. If a small category is split into small piecesin a clustering, those small pieces may likely become noise and thus thesmall category cannot be discovered from the clustering. The small clus-ter preservation criterion states that splitting a small category into piecesis more harmful than splitting a large category into pieces. Consider anextreme case. Let D be a data set of n + 2 objects such that, accord-ing to the ground truth, n objects, denoted by o1, . . . , on, belong toone category and the other 2 objects, denoted by on+1, on+2, belong toanother category. Suppose clustering C1 has three clusters, C1={o1, . . . ,on}, C2={on+1}, and C3={on+2}. Let clustering C2 have three clusters,too, namely C1={o1, . . . , on−1}, C2={on}, and C3={on+1, on+2}. Inother words, C1 splits the small category and C2 splits the big category.A clustering quality measure Q preserving small clusters should give ahigher score to C2, that is, Q(C2, Cg) > Q(C1, Cg).
Many clustering quality measures satisfy some of the above four criteria.Here, we introduce the BCubed precision and recall metrics, which satisfy allof the above criteria.
BCubed evaluates the precision and recall for every object in a clusteringon a given data set according to the ground truth. The precision of an objectindicates how many other objects in the same cluster belong to the same cat-egory as the object. The recall of an object reflects how many objects of thesame category are assigned to the same cluster.
Formally, let D ={o1, . . . , on} be a set of objects, and C be a clusteringon D. Let L(oi) (1 ≤ i ≤ n) be the category of oi given by ground truth,and C(oi) be the cluster ID of oi in C. Then, for two objects, oi and oj ,(1 ≤ i, j,≤ n, i = j), the correctness of the relation between oi and oj inclustering C is given by
Correctness(oi, oj) =! 1 if L(oi) = L(oj)⇔ C(oi) = C(oj)
0 otherwise.(10.28)
BCubed precision is defined as
Precision BCubed =
n"
i=1
"
oj :i=j,C(oi)=C(oj)
Correctness(oi, oj)
∥{oj|i = j, C(oi) = C(oj)}∥n
. (10.29)
10.6. EVALUATION OF CLUSTERING 509
BCubed recall is defined as
Recall BCubed =
n!
i=1
!
oj :i=j,L(oi)=L(oj)
Correctness(oi, oj)
∥{oj|i = j, L(oi) = L(oj)}∥n
. (10.30)
Intrinsic Methods
When the ground truth of a data set is not available, we have to use an intrinsicmethod to assess the clustering quality. In general, intrinsic methods evaluatea clustering by examining how well the clusters are separated and how compactthe clusters are. Many intrinsic methods take the advantage of a similaritymetric between objects in the data set.
The silhouette coefficient is such a measure. For a data set D of nobjects, suppose D is partitioned into k clusters, C1, . . . , Ck. For each object o∈ D, we calculate a(o) as the average distance between o and all other objectsin the cluster to which o belongs. Similarly, b(o) is the minimum averagedistance from o to all clusters to which o does not belong. Formally, supposeo ∈ Ci (1 ≤ i ≤ k), then
a(o) =
"
o′∈Ci,o=o′ dist(o, o′)
|Ci|− 1(10.31)
and
b(o) = minCj :1≤j≤k,j =i
{
"
o′∈Cjdist(o, o′)
|Cj |}. (10.32)
The silhouette coefficient of o is then defined as
s(o) =b(o)− a(o)
max{a(o), b(o)} . (10.33)
The value of the silhouette coefficient is between −1 and 1. The value ofa(o) reflects the compactness of the cluster to which o belongs. The smallerthe value is, the more compact the cluster is. The value of b(o) capturesthe degree to which o is separated from other clusters. The larger b(o) is,the more separated o is from other clusters. Therefore, when the silhouettecoefficient value of o approaches 1, the cluster containing o is compact and ois far away from other clusters, which is the preferable case. However, whenthe silhouette coefficient value is negative (that is, b(o) < a(o)), this meansthat, in expectation, o is closer to the objects in another cluster than to theobjects in the same cluster as o. In many cases, this is a bad case, and shouldbe avoided.
To measure the fitness of a cluster within a clustering, we can compute theaverage silhouette coefficient value of all objects in the cluster. To measure thequality of a clustering, we can use the average silhouette coefficient value of allobjects in the data set. The silhouette coefficient and other intrinsic measures
Silhouette Coefficient
• No ground truth is assumed • Suppose a data set D of n objects is partitioned
into k clusters, C1, …, Ck • For each object o,
– Calculate a(o), the average distance between o and every other object in the same cluster – compactness of a cluster, the smaller, the better
– Calculate b(o), the minimum average distance from o to every objects in a cluster that o does not belong to – degree of separation from other clusters, the larger, the better
Jian Pei: Big Data Analytics -- Clustering 135
Silhouette Coefficient
• Then
• Use the average silhouette coefficient of all objects as the overall measure
Jian Pei: Big Data Analytics -- Clustering 136
b(o) = minCj :o 62Cj
{
Po
02Cj
dist(o, o0)
|Cj
| }
a(o) =
Po,o
02Ci,o0 6=o
dist(o, o0)
|Ci
|� 1
s(o) =
b(o)� a(o)
max{a(o), b(o)}
Multi-Clustering
• A data set may be clustered in different ways – In different subspaces, that is, using different
attributes – Using different similarity measures – Using different clustering methods
• Some different clusterings may capture different meanings of categorization – Orthogonal clusterings
• Putting users in the loop Jian Pei: Big Data Analytics -- Clustering 137