DMTM Lecture 14 Density based clustering
-
Upload
pier-luca-lanzi -
Category
Education
-
view
105 -
download
0
Transcript of DMTM Lecture 14 Density based clustering
Prof. Pier Luca Lanzi
Density Based ClusteringData Mining and Text Mining (UIC 583 @ Politecnico di Milano)
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
Prof. Pier Luca Lanzi
What is density-based clustering?
• Clustering based on density (local cluster criterion), such as density-connected points• Major features:§Discover clusters of arbitrary shape§Handle noise§One scan§Need density parameters as termination condition
• Several interesting studies:§DBSCAN: Ester, et al. (KDD’96)§OPTICS: Ankerst, et al (SIGMOD’99).§DENCLUE: Hinneburg & D. Keim (KDD’98)§CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
4
Prof. Pier Luca Lanzi
DBSCAN: Basic Concepts
• The neighborhood within a radius ε of a given object is called the ε-neighborhood of the object• If the ε-neighborhood of an object contains at least MinPts
objects, then the object is a core object • An object p is directly density-reachable from object q if p is
within the ε-neighborhood of q and q is a core object• An object p is density-reachable from object q if there is a chain
of object p1, …, pn where p1=p and pn=q such that pi+1 is directly density reachable from pi
• An object p is density-connected to q with respect to ε and MinPts if there is an object o such that both p and q are density reachable from o
5
Prof. Pier Luca Lanzi
DBSCAN: Basic Concepts
• Density = number of points within a specified radius (Eps)
• A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point
• A noise point is any point that is not a core point or a border point
• A density-based cluster is a set of density-connected objects that is maximal with respect to density-reachability
6
Prof. Pier Luca Lanzi
Density-Reachable &Density-Connected
• Directly density-reachable • Density-reachable
• Density-connected
p
qp1
p q
o
pq
MinPts = 5
Eps = 1 cm
7
Prof. Pier Luca Lanzi
DBSCAN: Core, Border, andNoise Points
8
Prof. Pier Luca Lanzi
DBSCAN Density Based Spatial Clustering
• Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points• Discovers clusters of arbitrary shape in spatial databases with
noise• The Algorithm§Arbitrary select a point p§Retrieve all points density-reachable
from p given Eps and MinPts.§ If p is a core point, a cluster is formed.§ If p is a border point, no points are density-reachable from p
and DBSCAN visits the next point of the database§Continue the process until all of the points have been
processed
9
Prof. Pier Luca Lanzi
Core, Border and Noise Points
Eps = 10, MinPts = 4
10
Original Points Point types: core, border and noise
Prof. Pier Luca Lanzi
When DBSCAN Works Well
• Resistant to Noise• Can handle clusters of different shapes and sizes
Original Points Clusters
11
Prof. Pier Luca Lanzi
When DBSCAN May Fail?
• Varying densities• High-dimensional data
Original Points
(MinPts=4, Eps=9.75).
(MinPts=4, Eps=9.92)
12
Prof. Pier Luca Lanzi
Run the python notebookon density-based clustering
Prof. Pier Luca Lanzi
Examples using R
14
Prof. Pier Luca Lanzi
Density-Based Clustering in R
library(fpc)
set.seed(665544)
n <- 600
x <- cbind(runif(10, 0, 10)+rnorm(n, sd=0.2), runif(10, 0, 10)+rnorm(n,sd=0.2))
par(bg="grey40")
ds <- dbscan(x, 0.2, showplot=1)
15
Prof. Pier Luca Lanzi
Density-Based Clustering in R
library(fpc)
set.seed(665544)
x <- seq(0,6.28,0.1)
y <- sin(x)
xd <- x+rnorm(630,sd=0.2)
yd <- y+rnorm(630,sd=0.2)
plot(xd,yd)
par(bg="grey40")
d <- cbind(xd,yd)
# this works nicely since the epsilon is
# the same size of the standard deviation (0.2)
# used to generate the data
ds <- dbscan(d, 0.2, showplot=1)
# this does not work so nicely
ds <- dbscan(d, 0.1, showplot=1)
16
Prof. Pier Luca Lanzi
Clustering Comparisons on Sin Data 17
hierarchical clustering kmeans clustering
Prof. Pier Luca Lanzi
Clustering Comparisons on Sin Data(k-means with 10 clusters)
18
Prof. Pier Luca Lanzi
http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/Density-Based_Clustering
Software Packages