Other Clustering Algorithms
-
Upload
fabio-parra -
Category
Documents
-
view
228 -
download
0
Transcript of Other Clustering Algorithms
-
8/3/2019 Other Clustering Algorithms
1/42
MST: Divisive Hierarchical Clustering
q
Build MST (Minimum Spanning Tree) Start with a tree that consists of any point In successive steps, look for the closest pair of points (p, q) such
that one point (p) is in the current tree but the other (q) is not
Add q to the tree and put an edge between p and q
-
8/3/2019 Other Clustering Algorithms
2/42
MST: Divisive Hierarchical Clustering
q
Use MST for constructing hierarchy of clusters
-
8/3/2019 Other Clustering Algorithms
3/42
DBSCAN
q DBSCAN is a density-based algorithm. Density = number of points within a specified radius (Eps)
A point is a core point if it has more than a specified number
of points (MinPts) within Eps
These are points that are at the interior of a cluster
A border point has fewer than MinPts within Eps, but is inthe neighborhood of a core point
A noise point is any point that is not a core point or a borderpoint.
-
8/3/2019 Other Clustering Algorithms
4/42
DBSCAN: Core, Border, and Noise Points
-
8/3/2019 Other Clustering Algorithms
5/42
DBSCAN Algorithm
q
Eliminate noise pointsq Perform clustering on the remaining points
-
8/3/2019 Other Clustering Algorithms
6/42
DBSCAN: Core, Border and Noise Points
Original Points Point types: core,
border and noise
Eps = 10, MinPts = 4
-
8/3/2019 Other Clustering Algorithms
7/42
When DBSCAN Works Well
Original Points
Clusters
Resistant to Noise
Can handle clusters of different shapes and sizes
-
8/3/2019 Other Clustering Algorithms
8/42
When DBSCAN Does NOT Work Well
Original Points
(MinPts=4, Eps=9.75).
(MinPts=4, Eps=9.92)
Varying densities
High-dimensional data
-
8/3/2019 Other Clustering Algorithms
9/42
DBSCAN: Determining EPS and MinPts
q Idea is that for points in a cluster, their kth nearest
neighbors are at roughly the same distanceq Noise points have the kth nearest neighbor at farther
distance
q So, plot sorted distance of every point to its kth
nearest neighbor
-
8/3/2019 Other Clustering Algorithms
10/42
Cluster Validity
q For supervised classification we have a variety of
measures to evaluate how good our model is Accuracy, precision, recall
q For cluster analysis, the analogous question is how toevaluate the goodness of the resulting clusters?
q But clusters are in the eye of the beholder!
q Then why do we want to evaluate them?
To avoid finding patterns in noise To compare clustering algorithms
To compare two sets of clusters
To compare two clusters
-
8/3/2019 Other Clustering Algorithms
11/42
Graph Theoretic Interpretation
As we calculate the nearest pair to join. Willthis distance go up or down?
Let dn now be the distance between the
clusters merged at step n (counting frombottom)
Let G(dn) be a graph that has edges betweenall data points within distance dn of each
other. Then the clusters after step n are theconnected components of G(dn). Why?
Explains potentially long straggly clusters.
Assumptions?
-
8/3/2019 Other Clustering Algorithms
12/42
How to Define Inter-Cluster Similarity
q MIN
q MAX
q Group Average
q Distance Between Centroids
q Other methods driven by an objective function Wards Method uses squared error
Paraphrasing The President:
Your either part of my
cluster or your not
-
8/3/2019 Other Clustering Algorithms
13/42
Graph Theoretic Interpretation
Assume each pair of points has a uniquedistance. Let d
nbe the diameterof the
cluster formed at step n.
Let G(dn) be a graph that has edges betweenall data points within dn distance from one
another.
Then the clusters after step n are *cliques* ofG(dn). Why?
How can we make use of these graphtheoretic interpretations?
-
8/3/2019 Other Clustering Algorithms
14/42
How to Define Inter-Cluster Similarity
q MIN
q MAX
q Group Averageq Distance Between Centroids
q Other methods driven by an objective function Wards Method uses squared error
Paraphrasing The President:
Your either part of my
cluster or your not
-
8/3/2019 Other Clustering Algorithms
15/42
Graph-Based Clustering
q Graph-Based clustering uses the proximity graph
Start with the proximity matrix
Consider each point as a node in a graph
Each edge between two nodes has a weight which is
the proximity between the two points Initially the proximity graph is fully connected
MIN (single-link) and MAX (complete-link) can be
viewed as starting with this graph
q In the simplest case, clusters are connected
components in the graph.
-
8/3/2019 Other Clustering Algorithms
16/42
-
8/3/2019 Other Clustering Algorithms
17/42
Graph-Based Clustering: Sparsification
q Clustering may work better Sparsification techniques keep the connections to the most
similar (nearest) neighbors of a point while breaking theconnections to less similar points.
The nearest neighbors of a point tend to belong to the sameclass as the point itself.
This reduces the impact of noise and outliers and sharpens
the distinction between clusters.
q Sparsification facilitates the use of graphpartitioning algorithms (or algorithms basedon graph partitioning algorithms.
Chameleon and Hypergraph-based Clustering
-
8/3/2019 Other Clustering Algorithms
18/42
Sparsification in the Clustering Process
-
8/3/2019 Other Clustering Algorithms
19/42
Chameleon: Clustering Using Dynamic
Modeling
q Adapt to the characteristics of the data set to find the
natural clusters
q Use a dynamic model to measure the similarity between
clusters
Main property is the relative closeness and relative inter-
connectivity of the cluster
Two clusters are combined if the resulting cluster shares certain
properties with the constituent clusters
The merging scheme preserves self-similarity
q One of the areas of application is spatial data
-
8/3/2019 Other Clustering Algorithms
20/42
Characteristics of Spatial Data Sets
Clusters are defined as densely
populated regions of the space
Clusters have arbitrary shapes,
orientation, and non-uniform sizes
Difference in densities across clusters
and variation in density within clusters
Existence of special artifacts (streaks)
and noise
The clustering algorithm must addressthe above characteristics and also
require minimal supervision.
-
8/3/2019 Other Clustering Algorithms
21/42
Chameleon: Steps
q Preprocessing Step:Represent the Data by a Graph Given a set of points, construct the k-nearest-
neighbor (k-NN) graph to capture the relationshipbetween a point and its k nearest neighbors
Concept of neighborhood is captured dynamically(even if region is sparse)
q Phase 1: Use a multilevel graph partitioning
algorithm on the graph to find a large number ofclusters of well-connected vertices Each cluster should contain mostly points from one
true cluster, i.e., is a sub-cluster of a real cluster
-
8/3/2019 Other Clustering Algorithms
22/42
Chameleon: Steps
q Phase 2: Use Hierarchical Agglomerative
Clustering to merge sub-clusters
Two clusters are combined if the resulting cluster
shares certain properties with the constituent clusters
Two key properties used to model cluster similarity:Trying to capture how similar two sub-graphs are ...
Relative Interconnectivity: Absolute interconnectivity of two
clusters normalized by the internal connectivity of the clusters
Relative Closeness: Absolute closeness of two clusters
normalized by the internal closeness of the clusters
-
8/3/2019 Other Clustering Algorithms
23/42
Experimental Results: CHAMELEON
-
8/3/2019 Other Clustering Algorithms
24/42
Experimental Results: CHAMELEON
E i t l R lt CURE (10
-
8/3/2019 Other Clustering Algorithms
25/42
Experimental Results: CURE (10clusters)
E i t l R lt CURE (15
-
8/3/2019 Other Clustering Algorithms
26/42
Experimental Results: CURE (15clusters)
-
8/3/2019 Other Clustering Algorithms
27/42
Experimental Results: CHAMELEON
E i t l R lt CURE (9
-
8/3/2019 Other Clustering Algorithms
28/42
Experimental Results: CURE (9clusters)
E perimental Res lts CURE (15
-
8/3/2019 Other Clustering Algorithms
29/42
Experimental Results: CURE (15clusters)
-
8/3/2019 Other Clustering Algorithms
30/42
i j i j4
SNN graph: the weight of an edge is the number of sharedneighbors between vertices given that the vertices are connected
Shared Near Neighbor Approach
-
8/3/2019 Other Clustering Algorithms
31/42
Creating the SNN Graph
Sparse Graph
Link weights are similarities
between neighboring points
Shared Near Neighbor Graph
Link weights are number of
Shared Nearest Neighbors
ROCK (RObust Clustering using
-
8/3/2019 Other Clustering Algorithms
32/42
ROCK (RObust Clustering usinglinKs)q Clustering algorithm for data with categorical and
Boolean attributes A pair of points is defined to be neighbors if their similarity is greater
than some threshold
Use a hierarchical clustering scheme to cluster the data.
2. Obtain a sample of points from the data set3. Compute the link value for each set of points, i.e., transform the
original similarities (computed by Jaccard coefficient) into similaritiesthat reflect the number of shared neighbors between points
4. Perform an agglomerative hierarchical clustering on the data using
the number of shared neighbors as similarity measure andmaximizing the shared neighbors objective function
5. Assign the remaining points to the clusters that have been found
i i k Cl i
-
8/3/2019 Other Clustering Algorithms
33/42
Jarvis-Patrick Clustering
q First, the k-nearest neighbors of all points are found
In graph terms this can be regarded as breaking all but the k
strongest links from a point to other points in the proximity graph
q A pair of points is put in the same cluster if
any two points share more than T neighbors and the two points are in each others k nearest neighbor list
q For instance, we might choose a nearest neighbor list of
size 20 and put points in the same cluster if they sharemore than 10 near neighbors
q Jarvis-Patrick clustering is too brittle
When Jarvis Patrick Works Reasonably
-
8/3/2019 Other Clustering Algorithms
34/42
When Jarvis-Patrick Works Reasonably
Well
Original Points Jarvis Patrick Clustering
6 shared neighbors out of 20
When Jarvis Patrick Does NOT Work
-
8/3/2019 Other Clustering Algorithms
35/42
Smallest threshold, T,
that does not merge
clusters.
Threshold of T - 1
When Jarvis-Patrick Does NOT Work
Well
SNN Cl t i Al ith
-
8/3/2019 Other Clustering Algorithms
36/42
SNN Clustering Algorithm
q Compute the similarity matrix
This corresponds to a similarity graph with data points for nodes andedges whose weights are the similarities between data points
q Sparsify the similarity matrix by keeping only the kmost similarneighborsThis corresponds to only keeping the kstrongest links of the similarity
graphq Construct the shared nearest neighbor graph from the sparsified
similarity matrix.At this point, we could apply a similarity threshold and find theconnected components to obtain the clusters (Jarvis-Patrick
algorithm)q Find the SNN density of each Point.
Using a user specified parameters,Eps, find the number points thathave an SNN similarity of Eps or greater to each point. This is theSNN density of the point
-
8/3/2019 Other Clustering Algorithms
37/42
SNN Clustering Algorithm
q Find the core points
Using a user specified parameter, MinPts, find the corepoints, i.e., all points that have an SNN density greaterthan MinPts
q Form clusters from the core pointsIf two core points are within a radius, Eps, of each other
they are place in the same clusterq Discard all noise points
All non-core points that are not within a radius ofEps ofa core point are discarded
q Assign all non-noise, non-core points to clustersThis can be done by assigning such points to thenearest core point
(Note that steps 4-8 are DBSCAN)
SNN D it
-
8/3/2019 Other Clustering Algorithms
38/42
SNN Density
a) All Points b) High SNN Density
c) Medium SNN Density d) Low SNN Density
-
8/3/2019 Other Clustering Algorithms
39/42
SNN Clustering Can Handle Differing Densities
Original Points SNN Clustering
SNN Clustering Can Handle Other Difficult
-
8/3/2019 Other Clustering Algorithms
40/42
SNN Clustering Can Handle Other Difficult
Situations
Finding Clusters of Time Series In Spatio Temporal
-
8/3/2019 Other Clustering Algorithms
41/42
Finding Clusters of Time Series In Spatio-Temporal
Data
26 SLP Clusters via Shared Nearest Neighbor Clustering (100 NN, 1982-1994)
longitude
lat
itude
-1 80 -1 50 -12 0 -9 0 -6 0 -3 0 0 3 0 6 0 9 0 1 20 1 50 1 80
90
60
30
0
-30
-60
-90
1326
2425
22
14
16 20 17 18
19
15
23
1 9
6
4
7101211
3
52
8
21
SNN Clusters of SLP.
SNN Density of SLP Time Series Data
longitude
latitude
-180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180
90
60
30
0
-30
-60
-90
SNN Density of Points on the Globe.
Features and Limitations of SNN
-
8/3/2019 Other Clustering Algorithms
42/42
Features and Limitations of SNN
Clustering
q Does not cluster all the points
q Complexity of SNN Clustering is high O( n * time to find numbers of neighbor within Eps)
In worst case, this is O(n2) For lower dimensions, there are more efficient ways to find
the nearest neighbors
R* Tree
k-d Trees