Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak...
-
Upload
jemima-small -
Category
Documents
-
view
216 -
download
2
Transcript of Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak...
![Page 1: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/1.jpg)
Clustering
COSC 526 Class 12
Arvind RamanathanComputational Science & Engineering DivisionOak Ridge National Laboratory, Oak RidgePh: 865-576-7266E-mail: [email protected]
Slides inspired by: Andrew Moore (CMU), Andrew Ng (Stanford), Tan, Steinbach and Kumar
![Page 2: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/2.jpg)
2
Assignment 1: Your first hand at random walks
• Write up is here…
• Pair up and do the assignment– It helps to work in small teams
– Maximize your productivity
• Most of the assignment and its notes are in the handouts (class web-page)
![Page 3: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/3.jpg)
3
Clustering: Basics…
![Page 4: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/4.jpg)
4
Clustering
• Finding groups of items (or objects) in a group that are related to one other and different from other groups
Inter-cluster distances are maximized
Intra-cluster distances are minimized
![Page 5: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/5.jpg)
5
Applications• Grouping regions
together based on precipitation
• Grouping genes together based on expression patterns in cells
• Finding ensembles of folded/unfolded protein structures
![Page 6: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/6.jpg)
6
What is not clustering?
• Supervised classification– class label information
• Simple segmentation– Dividing students into different registration
groups (either alphabetically, by major, etc.)
• Results of a query– Grouping is a result of external specification
• Graph partitioning– Areas not identical…
Take Home Message:• Clustering of data is essentially
driven by the data at hand!!• Meaning or interpretation of the
clusters should be driven by the data!!
![Page 7: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/7.jpg)
7
Constitution of a cluster can be ambiguous
• How to decide between 8 clusters and 2 clusters?
![Page 8: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/8.jpg)
8
Types of Clustering
• Partitional Clustering– A division of data into
non-overlapping subsets (clusters) such that each data point is in exactly one subset
• Hierarchical Clustering– A set of nested clusters
organized as a hierarchical tree
p1 p2 p3 p4 p5 p6
p3
p4
p5
p6
p1p2
![Page 9: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/9.jpg)
9
Other types of distinctions…
• Exclusive vs. Non-exclusive:– Points may belong to multiple clusters
• Fuzzy vs. Non-fuzzy:– A point may belong to every cluster with weight
between 0 and 1
– Similar to probabilistic clustering
• Partial vs. Complete:– We may want to cluster only some of the data
• Heterogeneous vs. Homogeneous– Cluster of widely different sizes…
![Page 10: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/10.jpg)
10
Types of Clusters
• Well-separated clusters
• Center-based clusters
• Contiguous clusters
• Density based clusters
• Property/conceptual
• Described by an objective function
set of points such that any point in a cluster is closer to every other point in the cluster than to any point not in the cluster
![Page 11: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/11.jpg)
11
Types of Clusters
• Well-separated clusters
• Center-based clusters
• Contiguous clusters
• Density based clusters
• Property/conceptual
• Described by an objective function
Cluster is a set of objects such that an object in a cluster is closer to the center of the cluster (called centroid) than any other center of any other cluster…
![Page 12: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/12.jpg)
12
Types of Clusters
• Well-separated clusters
• Center-based clusters
• Contiguous clusters
• Density based clusters
• Property/conceptual
• Described by an objective function
Nearest neighbor or transitive…
![Page 13: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/13.jpg)
13
Types of Clusters
• Well-separated clusters
• Center-based clusters
• Contiguous clusters
• Density based clusters
• Property/conceptual
• Described by an objective function
A cluster is a dense region of points separated by low-density regions Used when clusters are irregular and when noise/outliers are present
![Page 14: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/14.jpg)
14
Types of Clusters
• Well-separated clusters
• Center-based clusters
• Contiguous clusters
• Density based clusters
• Property/conceptual
• Described by an objective function
Find clusters that share a common property or representationEg. taste, smell, …
![Page 15: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/15.jpg)
15
Types of Clusters
• Well-separated clusters
• Center-based clusters
• Contiguous clusters
• Density based clusters
• Property/conceptual
• Described by an objective function
• Find clusters based on minimizing or maximizing an objective function
• Enumerate all possible ways of dividing points into clusters:• Evaluate the goodness of
each potential set of clusters by an objective function
• NP Hard problem• Global vs. Local Objectives:
• Hierarchical clustering typically have local objectives
• Partitional algorithms typically have global objectives
![Page 16: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/16.jpg)
16
More on objective functions… (1)
• Objective functions tend to map the clustering problem to a different domain and solve a related problem:– E.g., defining a proximity matrix as a weighted
graph
– Clustering is equivalent to breaking the graph into connected components
– Minimize the edge weight between clusters and maximize edge weight within clusters
![Page 17: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/17.jpg)
17
More on objective functions… (2)
• Best clustering usually minimizes/maximizes an objective function
• Mixture models assume that the data is a mixture of a number of parametric statistical distributions (e.g., Gaussians)
![Page 18: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/18.jpg)
18
Characteristics of input data
• Type of proximity or density measure
– derived measure, central to clustering
• Sparseness:
– Dictates type of similarity
– Adds to efficiency
• Attribute Type:
– Dictates type of similarity
• Type of data:
– Dictates type of similarity
• Dimensionality
• Noise and outliers
• Type of distribution
![Page 19: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/19.jpg)
19
Clustering Algorithms:K-means ClusteringHierarchical ClusteringDensity-based Clustering
![Page 20: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/20.jpg)
20
K-means Clustering
• Partitional clustering:– Each cluster is associated with a centroid
– Each point is assigned to the cluster with the closest centroid
– We need to identify the total number of clusters, k, as one of the inputs
• Simple Algorithm K-means Algorithm
1 : Select K points as the initial centers2 : repeat
3 : Form K clusters by assigning all points to the closest centroid4 : Recompute the centroid of each cluster
5: until centroids don’t change
![Page 21: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/21.jpg)
21
K-means Clustering• Initial centroids are chosen randomly:
– clusters can vary depending on how you started
• Centroid is the mean of the points in the cluster
• “Closeness” is measured usually Euclidean distances
• K-means will typically converge quickly
– Points stop changing assignments
– Another stopping criterion: Only a few points change clusters
• Time complexity O(nKId) – n: number of points; K: number of clusters
– I: number of iterations; d: number of attributes
![Page 22: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/22.jpg)
22
K-means example
![Page 23: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/23.jpg)
23
How to initialize (seed) K-means?
• If there are K “real” clusters, then the chance of selecting one centroid from each cluster is small– Chance is relatively small when K is large
– If clusters have the same size (say m)
– If k = 10 P = 0.00036 (really small!!)
• The choice of centroids can have a deep impact on how the clusters are determined…
![Page 24: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/24.jpg)
24
Choosing K
![Page 25: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/25.jpg)
25
What are the solutions for this problem?
• Multiple runs!!– Usually helps
• Sample the points so that you can guesstimate the number of clusters– Depends on how we have sampled
– Or we have sampled outliers in the data
• Select more than the k number of centroids and then select k among these centroids– Choose widely separated k centroids
![Page 26: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/26.jpg)
26
How to evaluate k-means clusters
• Most common measure is the sum of squared errors (SSE):
• Given two clustering outputs from k-means, we can choose the one with the least error
• Only compare clustering with the same K
• Important side note: K-means is a heuristic for minimizing SSE
![Page 27: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/27.jpg)
27
Pre-processing and Post-processing
• Pre-processing:– normalize the data (e.g., scale the data to unit
standard deviation)
– eliminate outliers
• Post-processing:– Eliminate small clusters that may represent
outliers
– Split clusters that have a high SSE
– Merge clusters that have a low SSE
![Page 28: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/28.jpg)
28
Limitations of using K-means
• K-means can have problems when the data has:– different sizes
– different densities
– non-globular shapes
– outliers!
![Page 29: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/29.jpg)
29
How does this scale… (for MapReduce)In the map step:• Read the cluster centers
into memory from a sequencefile
• Iterate over each cluster center for each input key/value pair.
• Measure the distances and save the nearest center which has the lowest distance to the vector
• Write the clustercenter with its vector to the filesystem.
In the reduce step (we get associated vectors for each center):• Iterate over each value vector
and calculate the average vector. (Sum each vector and devide each part by the number of vectors we received).
• This is the new center, save it into a SequenceFile.
• Check the convergence between the clustercenter that is stored in the key object and the new center.
• If it they are not equal, increment an update counter
![Page 30: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/30.jpg)
30
Making k-means streaming
• Two broad approaches:– Solving the k-means as it comes:• Guha, Mishra, Motwani, O’Callaghan (2001)
• Charikar, O'Callaghan, and Panigrahy (2003)
• Braverman, Meyerson, Ostrovsky, Roytman, Shindler, and Tagiku (2011)
– Solving k-means using weighted coresets:• Select a small sample of points that are weighted
• Weights are such that the solution of the k-means on the subset is similar to the original dataset
![Page 31: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/31.jpg)
31
Fast Streaming K-means
Shindler, Wong, Myerson, NIPS (2011)
Shindler, NIPS presentation (2011)
![Page 32: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/32.jpg)
32
Fast Streaming K-means
• Intuition on why this works: The probability that point x belongs to some cluster is proportional to its distance from the “mean”
– referred to as “facility” here
• Costliest step: measuring δ:
– Use approximate nearest neighbor algorithms
• Space complexity: Ω(k log n)– You are only storing neighborhood info
– Use hashing and metric embedding (not discussed)
• Time complexity: o(nk)Shindler, Wong, Myerson, NIPS (2011)
![Page 33: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/33.jpg)
33
Hierarchical Clustering
![Page 34: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/34.jpg)
34
Hierarchical Clustering
• Produce a set of nested clusters organized as a hierarchical tree
• Can be conveniently visualized as a dendrogram:– a tree like representation which records the
sequences of merges and splits
![Page 35: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/35.jpg)
35
Types of Hierarchical Clustering
• Agglomerative Clustering:– Start with points as individual points (leaves)
– At each step, merge the closest pair of clusters until one cluster (or k clusters) remain
• Divisive Clustering: – Start with one, all inclusive cluster
– At each step, split a cluster until each cluster has a point (or there are k clusters)
• Traditional hierarchical clustering:– uses similarity or distance matrix
– merge or split one cluster at a time
![Page 36: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/36.jpg)
36
Agglomerative Clustering
• One of the more popular algorithms
• Basic algorithm is straightforwardAgglomerative Clustering Algorithm
1 : Compute the distance matrix2 : Let each data point be a cluster3 : repeat
3 : Merge the two closest clusters4 : Update the distance matrix
5: until only a single cluster remains
Key operation is the computation of the proximity of two clusters →
Different approaches to defining the distance between clusters
distinguish the different algorithms
![Page 37: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/37.jpg)
37
Starting Situation
• Start with clusters of individual data points and a distance matrix
p1 p2 p3 p4 p5 p…
p1
p2
p3
p5
p…
![Page 38: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/38.jpg)
38
Next step: Group points…
• After merging a few of these data points
C1
C2
C3
C4
C5
c1 c2 c3 c4 c5
c1
c2
c3
c4
c5
![Page 39: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/39.jpg)
39
Next step: Merge clusters…
• After merging a few of these data points
C1
C2
C3
C4
C5
c1 c2 c3 c4 c5
c1
c2
c3
c4
c5
![Page 40: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/40.jpg)
40
How to merge and update the distance matrix?
• Measure of similarity:– Min
– Max
– Group average
– Distance between centroids
– Other methods driven by an objective function
• How do these look on the clustering process?
![Page 41: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/41.jpg)
41
Defining inter-cluster similarity
• Min (single link)
• Max (complete link)
• Group Average (average link)
• Distance between centroids
![Page 42: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/42.jpg)
42
Single Link
• non-spherical/non-convex clusters
![Page 43: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/43.jpg)
43
Complete Link Clustering
• Better suited for datasets with noise
• Tends to form smaller clusters
• Biased toward more globular clusters
![Page 44: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/44.jpg)
44
Average link / Unweighted Pair Group Method using Arithmetic Averages (UPGMA)
• Compromise between single and complete linkage
• Works generally well in practice
![Page 45: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/45.jpg)
45
How do we say when two clusterings are similar?
• Ward’s method– Similarity of two clusters is based on the
increase in SSE when two clusters are merged
• Advantage:– Less susceptible to errors/outliers in the data
– Analog of the K-means comparison
– Can be used to initialize K-means
• Disadvantage:– Biased toward more globular clusters
![Page 46: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/46.jpg)
46
Space and Time Complexity
• Space Complexity: O(N2)– N is the number of data points
– N2 entries in the distance matrix
• Time Complexity: O(N3)– Many cases: N-steps for tree construction, and
at each step the distance matrix with O(N2) entries must be updated
– Complexity can be reduced to O(N2logN) in some cases
![Page 47: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/47.jpg)
47
Let’s talk about Scaling!
• Specific type of hierarchical clustering algorithm:– UPGMA (average linking)
– Most widely used in bioinformatics literature
• However impractical for scaling to entire genome!– Need the whole distance/ dissimilarity matrix in
memory (N2)!
– How can we exploit sparsity?
![Page 48: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/48.jpg)
48
Problem of interest…• Given a large
number of sequences and we have a way to determine how two or more sequences are similar
• We have a pairwise matrix dissimilarity matrix
• Build a hierarchical clustering routine for understanding how proteins (or other bio-molecules) have evolved
![Page 49: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/49.jpg)
49
The problem with UPGMA: Distance matrix computation is expensive
• We are computing the arithmetic mean between the sequences
• This is not defined when we have sparse inputs
Triangle inequality is not satisfied based on how we have defined the way clusters are built…
![Page 50: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/50.jpg)
50
Strategy to scale up this for Big Data
• Two aspects to handle:– Missing edges
– Sparsity in the distance matrix
detection threshold – for missing edge data…
We are completing “missing” values in D using ψ!
![Page 51: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/51.jpg)
51
Sparse UPGMA: Speeding
Space: O(E)note E << N2
Time: O(E log V)
Still Expensive for E can be arbitrarily large!
How do we deal with this?
![Page 52: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/52.jpg)
52
Streaming for Sparsity: Multi-round Memory Constrained (MC-UPGMA)
• Two components needed:– Memory
constrained clustering unit• Holds only a subset
of the E that needs to be processed in the current round
– Memory constrained merging unit:• Ensures we get only
valid edges
Space is only O(N) depending on how many sequences we have to load at any given time…
Time: O(E log V)
![Page 53: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/53.jpg)
53
Limitations of Hierarchical Clustering
• Greedy: once we make a decision for merging, it cannot be usually undone– Or can be expensive to undo
– Methods exist to alter this
• No global function is being minimized or maximized
• Different schemes of hierarchical clustering have limitations:– Sensitivity to noise and outliers
– Difficulty in handling different shapes
– Chaining, breaking of clusters…
![Page 54: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/54.jpg)
54
Density-based Spatial Clustering of Applications with Noise (DBSCAN)
![Page 55: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/55.jpg)
55
Preliminaries
• Density is defined to be the number of points within a radius (ε)– In this case, density = 9ε
• Core point has more than a specified number of points (minPts) at ε
– Points are interior to a cluster
• Border points have < minPts at ε but are within vicinity of the core point
• A noise point is neither a core point nor a border point
ε
minPts = 4
core point
border point
noise point
![Page 56: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/56.jpg)
56
DBSCAN Algorithm
![Page 57: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/57.jpg)
57
Illustration of DBSCAN: Assignment of Core, Border and Noise Points
![Page 58: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/58.jpg)
58
DBSCAN: Finding Clusters
![Page 59: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/59.jpg)
59
Advantages and Limitations
• Resistant to noise
• Can handle clusters of different sizes and shapes
• Eps and MinPts are dependent on each other– Can be difficult to
specify
• Different density clusters within the same class can be difficult to find
![Page 60: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/60.jpg)
60
Advantages and Limitations
• Varying density data
• High dimensional data
![Page 61: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/61.jpg)
61
How to determine Eps and MinPoints
• For points within a cluster, kth nearest neighbors are roughly at the same distance
• Noise points are farther away in general
• Plot by sorting the distance of every point to its kth nearest neighbor
![Page 62: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/62.jpg)
62
How do we validate clusters?
![Page 63: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/63.jpg)
63
Cluster validity
• For supervised learning:– we had a class label,
– which meant we could identify how good our training and testing errors were
– Metric: Accuracy, Precision, Recall
• For clustering: – How do we measure the “goodness” of the
resulting clusters?
![Page 64: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/64.jpg)
64
Clustering random data (overfitting)
If you ask a clustering algorithm to find clusters, it will find some
![Page 65: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/65.jpg)
65
Different aspects of validating clsuters• Determine the clustering tendency of a set of
data, i.e., whether non-random structure actually exists in the data (e.g., to avoid overfitting)
• External Validation: Compare the results of a cluster analysis to externally known class labels (ground truth).
• Internal Validation: Evaluating how well the results of a cluster analysis fit the data without reference to external information.
• Compare clusterings to determine which is better.
• Determining the ‘correct’ number of clusters.
![Page 66: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/66.jpg)
66
Measures of cluster validity• External Index: Used to measure the
extent to which cluster labels match externally supplied class labels. – Entropy, Purity, Rand Index
• Internal Index: Used to measure the goodness of a clustering structure without respect to external information.– Sum of Squared Error (SSE), Silhouette
coefficient
• Relative Index: Used to compare two different clusterings or clusters. – Often an external or internal index is used for
this function, e.g., SSE or entropy
![Page 67: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/67.jpg)
67
Measuring Cluster Validation with Correlation
• Proximity Matrix vs. Incidence matrix:
– A matrix Kij with 1 if the point belongs to the
same cluster; 0 otherwise
• Compute the correlation between the two matrices:– Only n(n-1)/2 values to be computed
– High values indicate similarity between points in the same cluster
• Not suited for density based clustering
![Page 68: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/68.jpg)
68
Another approach: use similarity matrix for cluster validation
![Page 69: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/69.jpg)
69
Internal Measures: SSE
• SSE is also a good measure to understand how good the clustering is– Lower SSE good clustering
• Can be used to estimate number of clusters
![Page 70: Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:](https://reader030.fdocuments.net/reader030/viewer/2022032703/56649f585503460f94c7ccda/html5/thumbnails/70.jpg)
70
More on Clustering a little later…
• We will discuss other forms of clustering in the following classes
• Next class:– please bring your brief write up on the two
papers