Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density...
-
Upload
vernon-brooks -
Category
Documents
-
view
234 -
download
2
Transcript of Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density...
![Page 1: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/1.jpg)
Clustering
Prof. Navneet GoyalBITS, Pilani
![Page 2: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/2.jpg)
Density-based methods
Based on connectivity and density functions
Filter out noise, find clusters of arbitrary shape
Grid-based methods
Quantize the object space into a grid structure
Other Approaches to Clustering
![Page 3: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/3.jpg)
Density-Based Clustering Methods
Major features:Discover clusters of arbitrary shapeHandle noiseOne scanNeed density parameters as termination condition
Several interesting studies:
DBSCAN: Ester, et al. (KDD’96)
OPTICS: Ankerst, et al (SIGMOD’99).
DENCLUE: Hinneburg & D. Keim (KDD’98)
CLIQUE: Agrawal, et al. (SIGMOD’98)
![Page 4: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/4.jpg)
Density-Based Spatial Clustering of Applications with Noise
Clusters are dense regions of objects separated by regions of low density ( noise)
Outliers will not effect creation of cluster
Input– MinPts – minimum number of points in any
cluster– Eps – for each point in cluster there must be
another point in it less than this distance away
Density-Based Method: DBSCAN
![Page 5: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/5.jpg)
• Eps-neighborhood: Points within Eps distance of a point.
• Core point: Eps-neighborhood dense enough (MinPts)
• Directly density-reachable: A point p is directly density-reachable from a point q if the distance is small (Eps) and q is a core point.
• Density-reachable: A point is density-reachable form another point if there is a path from one to the other consisting of only core points.
DBSCAN Density Concepts
![Page 6: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/6.jpg)
Density-Based Method: DBSCAN Eps-neighborhood: Points within Eps distance of a
point.NEps(p): {q belongs to D | dist(p,q) <= Eps}
Core point: Eps-neighborhood dense enough (MinPts) Directly density-reachable: A point p is directly
density-reachable from a point q if the distance is small (Eps) and q is a core point.Directly density-reachable: A point p is directly density-reachable from a point q wrt. Eps, MinPts if
1) p belongs to NEps(q)
2) core point condition:
|NEps (q)| >= MinPts
pq
MinPts = 5
Eps = 1 cm
![Page 7: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/7.jpg)
Density-Based Method: DBSCAN
Density-reachable: A point is density-reachable form another point if there is a path from one to the other consisting of only core pointsA point p is density-reachable from a point q wrt. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi for all i (1,n-1)
p
qp1
![Page 8: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/8.jpg)
Density-connected
– A point p is density-connected to a point q wrt. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts.
p q
o
Density-Based Method: DBSCAN
![Page 9: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/9.jpg)
DBSCAN Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases with noise
Core
Border
Outlier
Eps = 1cm
MinPts = 5
![Page 10: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/10.jpg)
DBSCAN: Core, Border, and Noise Points
![Page 11: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/11.jpg)
1. Label all points as core, border, or noise points
2. Eliminate noise points
3. Put an edge between all core points that are within ε of each other\
4. Make each group of connected core points into a separate cluster
5. Assign each border point to one of the its associated core point
DBSCAN: The Algorithm
![Page 12: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/12.jpg)
DBSCAN: Core, Border and Noise Points
Original Points Point types: core, border and noise
Eps = 10, MinPts = 4
Source of figure: Introduction to Data Mining by Tan et. al.
![Page 13: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/13.jpg)
When DBSCAN Works Well
Original Points Clusters
• Resistant to Noise
• Can handle clusters of different shapes and sizes
Source of figure: Introduction to Data Mining by Tan et. al.
![Page 14: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/14.jpg)
When DBSCAN Does NOT Work Well
Original Points
(MinPts=4, Eps=9.75).
(MinPts=4, Eps=9.92)
• Varying densities
• High-dimensional data
Source of figure: Introduction to Data Mining by Tan et. al.
![Page 15: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/15.jpg)
• Idea is that for points in a cluster, their kth nearest neighbors are at roughly the same distance
• Noise points have the kth nearest neighbor at farther distance
• So, plot sorted distance of every point to its kth nearest neighbor
DBSCAN: Determining EPS and MinPts
Eps=10Minpts=4
Source of figure: Introduction to Data Mining by Tan et. al.
![Page 16: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/16.jpg)
• Ordering Points To Identify Clustering Structure
• DBSCAN is sensitive to the choice of input parameters
• Parameter setting is done empirically• High dimensional data – more pronounced• High dimensional data clustering
structures are not generally characterized by global density parameters like eps & minpts
• OPTICS as a solution!
OPTICS: Self Study
![Page 17: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/17.jpg)
• Computes an augmented cluster ordering
• Ordering represents the density based clustering structure of the data
• Contains information that is equivalent to density based clustering obtained from a wide range of parameter settings
• Cluster ordering can be used to extract basic clustering information
OPTICS
![Page 18: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/18.jpg)
• In DBSCAN, for constant minpts, clusters with high density (lower eps) are completely contained in density connected sets obtained with lower density
• Extend DBSCAN to process a set of distance parameter eps at the same time.
• For this the objects need to be processed in a specific order
• This order selects an object that is density reachable wrt lowest eps so that clusters of higher density will be finished first.
OPTICS
![Page 19: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/19.jpg)
• 2 values need to be stored for each object:– Core distance– Reachability distance
• Core distance – smallest eps that makes it a core object. If p is not core, it is iundefined.
• Reachability distance of q wrt p is the greater value of the core distance of p and the euclidean distance between p & q. If p is not a core object, distance reachability bet p & q is undefined
OPTICS
![Page 20: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/20.jpg)
• Index-based: • k = number of dimensions • N = 20• p = 75%• M = N(1-p) = 5
– Complexity: O(kN2)• Core Distance
• Reachability Distance
OPTICS: Some Extension from DBSCAN
D
p2
MinPts = 5
= 3 cm
Max (core-distance (o), d (o, p))
r(p1, o) = 2.8cm. r(p2,o) = 4cm
o
o
p1
![Page 21: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/21.jpg)
• Efficiency issues with DBSCAN• Finding clusters in subspaces• Modeling density accurately
We now look at:• Grid-based clustering
– Partitions data space into grid cells and forms clusters from cells that are dense enough
– Efficient approach for low-dimensional data• Subspace clustering
– Finds clusters in subsets of all dimensions– 2n-1 subspaces to be searched!!!
Density-based Clustering Contd…
![Page 22: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/22.jpg)
• GRIDCLUS• STING• CLIQUE• WaveCluster
Grid-based Clustering
![Page 23: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/23.jpg)
• Significant reduction in time complexity, especially for large data sets
• Number of cells << number of data points
• Instead of clustering data points, neighborhood surrounding the data points are clustered
Grid-based Clustering
![Page 24: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/24.jpg)
Steps involved:1.Creating the grid structure2.Calculating cell density for each cell3.Sorting of the cells according to their
densities4.Identifying cluster centers5.Traversal of neighborhood cells
Grid-based Clustering
![Page 25: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/25.jpg)
Algorithm:1.Define a set of grid cells2.Assign objects to appropriate grid cells
and compute the density of each cell3.Eliminate cells having density below a
specified threshold4.Form clusters from contiguous groups of
dense cells
Grid-based Clustering
![Page 26: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/26.jpg)
• Defining Grid Cells– Key step– Equal width intervals along all dimensions
• Each cell has same volume• Density of cell is defined as no. of points in
cell– Alternatively, equi-depth approach can be used
• Equal number of points in each interval• Called as equal frequency discretization
– MAFIA : subspace clustering algorithm initially uses equal width intervals and then combines intervals of similar density
• Definition of grid has strong impact on clustering results
Grid-based Clustering
![Page 27: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/27.jpg)
• Density of Grid Cells– No. of points in the cell divided by the volume
of the cell• No. of road signs per km• No. of tigers in a sq. km• No. of molecules of a gas in cu. cm
Grid-based Clustering
Source of figure: Introduction to Data Mining by Tan et. al.
![Page 28: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/28.jpg)
• Forming Clusters from dense grid cells– Relatively straight forward– In the example on previous slide: 2 clusters– Define adjacency
• 4 or 8 adjacent cells in 2-D?• Efficient technique to find adjacent cells
(only occupied cells are stored)– Partially empty cells on the fringe of clusters
which are not dense and will be discarded– 4 parts of the larger cluster will be lost if the
threshold is 9
Grid-based Clustering
Source of figure: Introduction to Data Mining by Tan et. al.
![Page 29: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/29.jpg)
• Strengths & Limitations– Single pass is enough to determine the cell
and count of every cell– Grid cells created only for non-empty cells– Complexity of O(m)– O(mlogm)– grids are rectangular– Curse of dimensionality– Grid cells containing just one element
Grid-based Clustering
Source of figure: Introduction to Data Mining by Tan et. al.
![Page 30: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/30.jpg)
• Clustering algorithms considered so far take into account all attributes
• Consider only a subspace of data
Subspace Clustering
Source of figure: Introduction to Data Mining by Tan et. al.
![Page 31: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/31.jpg)
Subspace Clustering
Source of figure: Introduction to Data Mining by Tan et. al.
![Page 32: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/32.jpg)
• Ensemble Clustering• Parallelizing Clustering Algorithms to
leverage a Cluster
Some Research Directions
![Page 33: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/33.jpg)
• Similar to Ensemble Classification• Consensus Clustering• Obtain different clustering solutions and
then reconcile them
Ensemble Clustering
![Page 34: Clustering Prof. Navneet Goyal BITS, Pilani Density-based methods Based on connectivity and density functions Filter out noise, find clusters of.](https://reader031.fdocuments.net/reader031/viewer/2022031921/56649cc55503460f9498e241/html5/thumbnails/34.jpg)
• Parallelize to leverage a cluster • Two levels of parallelism
– Node Level– Core Level
• Not Necessarily Orthogonal• Hybrid – Non Trivial• Programming Environment:
– MPI– Open MP
Parallelizing Clustering Algorithms