DMTM Lecture 14 Density based clustering

19
Prof. Pier Luca Lanzi Density Based Clustering Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)

Transcript of DMTM Lecture 14 Density based clustering

Page 1: DMTM Lecture 14 Density based clustering

Prof. Pier Luca Lanzi

Density Based ClusteringData Mining and Text Mining (UIC 583 @ Politecnico di Milano)

Page 2: DMTM Lecture 14 Density based clustering

Prof. Pier Luca Lanzi

Page 3: DMTM Lecture 14 Density based clustering

Prof. Pier Luca Lanzi

Page 4: DMTM Lecture 14 Density based clustering

Prof. Pier Luca Lanzi

What is density-based clustering?

• Clustering based on density (local cluster criterion), such as density-connected points• Major features:§Discover clusters of arbitrary shape§Handle noise§One scan§Need density parameters as termination condition

• Several interesting studies:§DBSCAN: Ester, et al. (KDD’96)§OPTICS: Ankerst, et al (SIGMOD’99).§DENCLUE: Hinneburg & D. Keim (KDD’98)§CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)

4

Page 5: DMTM Lecture 14 Density based clustering

Prof. Pier Luca Lanzi

DBSCAN: Basic Concepts

• The neighborhood within a radius ε of a given object is called the ε-neighborhood of the object• If the ε-neighborhood of an object contains at least MinPts

objects, then the object is a core object • An object p is directly density-reachable from object q if p is

within the ε-neighborhood of q and q is a core object• An object p is density-reachable from object q if there is a chain

of object p1, …, pn where p1=p and pn=q such that pi+1 is directly density reachable from pi

• An object p is density-connected to q with respect to ε and MinPts if there is an object o such that both p and q are density reachable from o

5

Page 6: DMTM Lecture 14 Density based clustering

Prof. Pier Luca Lanzi

DBSCAN: Basic Concepts

• Density = number of points within a specified radius (Eps)

• A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point

• A noise point is any point that is not a core point or a border point

• A density-based cluster is a set of density-connected objects that is maximal with respect to density-reachability

6

Page 7: DMTM Lecture 14 Density based clustering

Prof. Pier Luca Lanzi

Density-Reachable &Density-Connected

• Directly density-reachable • Density-reachable

• Density-connected

p

qp1

p q

o

pq

MinPts = 5

Eps = 1 cm

7

Page 8: DMTM Lecture 14 Density based clustering

Prof. Pier Luca Lanzi

DBSCAN: Core, Border, andNoise Points

8

Page 9: DMTM Lecture 14 Density based clustering

Prof. Pier Luca Lanzi

DBSCAN Density Based Spatial Clustering

• Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points• Discovers clusters of arbitrary shape in spatial databases with

noise• The Algorithm§Arbitrary select a point p§Retrieve all points density-reachable

from p given Eps and MinPts.§ If p is a core point, a cluster is formed.§ If p is a border point, no points are density-reachable from p

and DBSCAN visits the next point of the database§Continue the process until all of the points have been

processed

9

Page 10: DMTM Lecture 14 Density based clustering

Prof. Pier Luca Lanzi

Core, Border and Noise Points

Eps = 10, MinPts = 4

10

Original Points Point types: core, border and noise

Page 11: DMTM Lecture 14 Density based clustering

Prof. Pier Luca Lanzi

When DBSCAN Works Well

• Resistant to Noise• Can handle clusters of different shapes and sizes

Original Points Clusters

11

Page 12: DMTM Lecture 14 Density based clustering

Prof. Pier Luca Lanzi

When DBSCAN May Fail?

• Varying densities• High-dimensional data

Original Points

(MinPts=4, Eps=9.75).

(MinPts=4, Eps=9.92)

12

Page 13: DMTM Lecture 14 Density based clustering

Prof. Pier Luca Lanzi

Run the python notebookon density-based clustering

Page 14: DMTM Lecture 14 Density based clustering

Prof. Pier Luca Lanzi

Examples using R

14

Page 15: DMTM Lecture 14 Density based clustering

Prof. Pier Luca Lanzi

Density-Based Clustering in R

library(fpc)

set.seed(665544)

n <- 600

x <- cbind(runif(10, 0, 10)+rnorm(n, sd=0.2), runif(10, 0, 10)+rnorm(n,sd=0.2))

par(bg="grey40")

ds <- dbscan(x, 0.2, showplot=1)

15

Page 16: DMTM Lecture 14 Density based clustering

Prof. Pier Luca Lanzi

Density-Based Clustering in R

library(fpc)

set.seed(665544)

x <- seq(0,6.28,0.1)

y <- sin(x)

xd <- x+rnorm(630,sd=0.2)

yd <- y+rnorm(630,sd=0.2)

plot(xd,yd)

par(bg="grey40")

d <- cbind(xd,yd)

# this works nicely since the epsilon is

# the same size of the standard deviation (0.2)

# used to generate the data

ds <- dbscan(d, 0.2, showplot=1)

# this does not work so nicely

ds <- dbscan(d, 0.1, showplot=1)

16

Page 17: DMTM Lecture 14 Density based clustering

Prof. Pier Luca Lanzi

Clustering Comparisons on Sin Data 17

hierarchical clustering kmeans clustering

Page 18: DMTM Lecture 14 Density based clustering

Prof. Pier Luca Lanzi

Clustering Comparisons on Sin Data(k-means with 10 clusters)

18

Page 19: DMTM Lecture 14 Density based clustering

Prof. Pier Luca Lanzi

http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/Density-Based_Clustering

Software Packages