chap15

18
CHAPTER 15. DENSITY-BASED CLUSTERING 372 Chapter 15 Density-based Clustering The clustering methods like K-means or Expectation-Maximization are suitable for finding ellipsoid-shaped clusters, or at best convex clusters. However, for non-convex clusters, such as those shown in Figure 15.1, these methods have trouble finding the true clusters, since two points from dierent clusters may be closer than two points in the same cluster. The density-based methods we consider in this chapter are able to mine such non-convex or shape-based clusters. Figure 15.1: Density-based Dataset DRAFT @ 2012-10-14 21:52. Please do not distribute. Feedback is Welcome. Note that this book shall be available for purchase from Cambridge University Press and other standard distribution channels, that no unauthorized distribution shall be allowed, and that the reader may make one copy only for personal on-screen use.

description

Good for beginners in Distributed systems

Transcript of chap15

Page 1: chap15

CHAPTER 15. DENSITY-BASED CLUSTERING 372

Chapter 15

Density-based Clustering

The clustering methods like K-means or Expectation-Maximization are suitable forfinding ellipsoid-shaped clusters, or at best convex clusters. However, for non-convexclusters, such as those shown in Figure 15.1, these methods have trouble finding thetrue clusters, since two points from di!erent clusters may be closer than two pointsin the same cluster. The density-based methods we consider in this chapter are ableto mine such non-convex or shape-based clusters.

Figure 15.1: Density-based Dataset

DRAFT @ 2012-10-14 21:52. Please do not distribute. Feedback is Welcome.Note that this book shall be available for purchase from Cambridge University Press and otherstandard distribution channels, that no unauthorized distribution shall be allowed, and that thereader may make one copy only for personal on-screen use.

Page 2: chap15

CHAPTER 15. DENSITY-BASED CLUSTERING 373

15.1 The DBSCAN Algorithm

Density-based clustering uses the local density of points to determine the clusters,rather than using only the distance between points. Define a ball of radius ! arounda point x, called the !-neighborhood of x, as follows

N!(x) = Bd(x, !) = {y | "(x,y) ! !}

Here "(x,y) represents the distance between points x and y, which is usually as-sumed to be the Euclidean distance, i.e., "(x,y) = "x#y"2. However, other distancemetrics can also be used.

For any point x $ D, we say that x is a core point if there are at least minptspoints in its neighborhood. In other words, x is a core point if |N!(x)| % minpts,where minpts is a user-defined local density or frequency threshold. A border pointis defined as a point that does not meet the minpts threshold, i.e., it has |N!(x)| <minpts, but it belongs to the neighborhood of some core point z, i.e., x $ N(z).Finally, if a point is neither a core nor a border point, then it is called a noise pointor an outlier.

!x

Figure 15.2: Neighborhood of a Point

x

yz

Figure 15.3: Core, Border and Noise Points

DRAFT @ 2012-10-14 21:52. Please do not distribute. Feedback is Welcome.Note that this book shall be available for purchase from Cambridge University Press and otherstandard distribution channels, that no unauthorized distribution shall be allowed, and that thereader may make one copy only for personal on-screen use.

Page 3: chap15

CHAPTER 15. DENSITY-BASED CLUSTERING 374

Example 15.1: Figure 15.2 shows the !-neighborhood of the point x, using theEuclidean distance metric. Figure 15.3 shows the three di!erent types of points,using minpts = 6. Here x is a core point since |N!(x)| = 6. y is a border point,since |N!(y)| = 3, but it is reachable from x. Finally, z is a noise point.

We say that a point x is directly density reachable from another point y, ifx $ N!(y) and y is a core point. We say that x is density reachable from y, if thereexists a chain of points, x = x0,x1, · · · ,xl = y, such that xi is directly densityreachable from xi!1. In other words, there is set of core points leading from y tox. Note that density reachability is an asymmetric or directed relationship. Finally,define any two points x and y to be density connected if there exists a core pointz, such that both x and y are density reachable from z. We can now define adensity-based cluster as a maximal set of density connected points.

Algorithm 15.1: Density-based Clustering Algorithm

DBSCAN (D, !, minpts) :Cores = &1

foreach x $ D do2

// Find the core points

if N!(x) % minpts then3

Cores = Cores ' {x}4

k = 05

foreach x $ Cores, such that x is unmarked do6

k = k + 17

DENSITYCONNECTED (x, k)99

C = {Ci}ki=1, where Ci = {x $ D : x has cluster id i}10

Noise = {x $ D : x is unmarked}11

Border = D# (C 'Noise)12

return C, Border,Noise13

DENSITYCONNECTED (x, k):Mark x with current cluster id k14

foreach y $ N!(x) do15

Mark y with current cluster id k16

if y $ Cores and y is unmarked then17

DENSITYCONNECTED (y, k)18

Algorithm 15.1 shows the pseudo-code for the DBSCAN algorithm. Initially, allthe points are unmarked. First, it computes the neighborhood N!(x) for each point

DRAFT @ 2012-10-14 21:52. Please do not distribute. Feedback is Welcome.Note that this book shall be available for purchase from Cambridge University Press and otherstandard distribution channels, that no unauthorized distribution shall be allowed, and that thereader may make one copy only for personal on-screen use.

Page 4: chap15

CHAPTER 15. DENSITY-BASED CLUSTERING 375

x in the dataset D, and checks if it is a core point (lines 2 - 4). Next, starting fromeach unmarked core, the method recursively finds all other points density connectedto it; all such points belong to the same cluster (line 9). Some border point maybe reachable from core points in more than one cluster. Such points may either bearbitrarily assigned to one of the clusters, or they may be assigned to all of them.Those points that do not belong to any cluster are treated as outliers or noise.

The density-based clustering algorithm can also be described as a search for theconnected components in a graph where the vertices correspond to the core points inthe dataset, and there exists an (undirected) edge between two vertices (core-points)if the distance between them is less than ! (i.e., each of them is in the neighborhoodof the other point). The connected components of this graph correspond to the corepoints of each cluster. Next, each core point incorporates into its cluster any borderpoints in its neighborhood.

One limitation of DBSCAN is that it is sensitive to the choice of !, in particular ifclusters have di!erent densities. If ! is too small, sparser clusters will be categorizedas noise. If ! is too large, denser clusters may be merged together. In other words, ifthere are clusters with di!erent local densities, then a single ! value may not su"ce.

Figure 15.4: Density-based Clusters

Example 15.2: Figure 15.4 shows the clusters discovered by DBSCAN on thedensity-based dataset in Figure 15.1. For the parameter values ! = 15 and

DRAFT @ 2012-10-14 21:52. Please do not distribute. Feedback is Welcome.Note that this book shall be available for purchase from Cambridge University Press and otherstandard distribution channels, that no unauthorized distribution shall be allowed, and that thereader may make one copy only for personal on-screen use.

Page 5: chap15

CHAPTER 15. DENSITY-BASED CLUSTERING 376

minpts = 10, found after parameter tuning, DBSCAN yields a near-perfect clus-tering, with k = 9 clusters. Cluster are shown using di!erent symbols and shading,and noise points are shown as plus symbols.

2

2.5

3.0

3.5

4.0

4 5 6 7X1

X2

+

+

+

+

+

+

+

+

+

+

++

++

+

+

+

+

+

+

+

+

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

(a) ! = 0.2, minpts = 5

2

2.5

3.0

3.5

4.0

4 5 6 7X1

X2

+

+

+

+

+

+

(b) ! = 0.325, minpts = 5

Figure 15.5: Denclue Clustering: Iris 2D Dataset

Example 15.3: Figure 15.5 shows the clusterings obtained via DBSCAN on theIris two-dimensional dataset, for two di!erent parameter settings. Figure 15.5ashows the clusters obtained with ! = 0.2 and minpts = 5. The three clustersare plotted using di!erent shaped points, namely circles, squares,and triangles.Shaded points are core points, whereas the border points for each cluster are showedunshaded (white). Noise points are shown as plus symbols. Figure 15.5b showsthe clusters obtained with a larger value of radius (! = 0.325), with minpts = 5.Two clusters are found, corresponding to the two dense regions of points.

Computational Complexity The main e!ort in density-based clustering is tocompute the !-neighborhood for each point. If the dimensionality is not too highthis can be done e"ciently using a spatial index structure in O(n log n) time. Whendimensionality is high, it takes O(n2) to compute the neighborhood for each point.Once N!(x) has been computed the algorithm needs only a single pass over all thepoints to find the maximal density connected clusters. Thus, the overall complexityis O(n2) in the worst-case.

DRAFT @ 2012-10-14 21:52. Please do not distribute. Feedback is Welcome.Note that this book shall be available for purchase from Cambridge University Press and otherstandard distribution channels, that no unauthorized distribution shall be allowed, and that thereader may make one copy only for personal on-screen use.

Page 6: chap15

CHAPTER 15. DENSITY-BASED CLUSTERING 377

15.2 Kernel Density Estimation

The density based clustering described above is a special case of kernel densityestimation. The main goal of density estimation is to find the dense regions ofpoints, which is essentially the same as clustering. Kernel density estimation is anon-parametric technique that does not assume any fixed probability model of theclusters, as in the case of K-means or model-based clustering via the EM algorithm.Instead, kernel density estimation tries to infer the underlying probability density ateach point in the dataset.

15.2.1 Univariate Density Estimation

Assume that X is a continuous random variable, and let x1, x2, · · · , xn be a randomsample drawn from the underlying probability density function f(x), which is as-sumed to be unknown. Let F (x) denote the cumulative distribution function. For acontinuous distribution it is given as

F (x) =

x!

!"

f(x) dx,

subject to the condition that""!" f(x) dx = 1.

Notice how the probability density f(x) is the derivative of the cumulative dis-tribution F (x), whereas F (x) is the integral of f(x). Let F̂ (x) be the estimate ofF (x) at x. The cumulative distribution can be directly estimated from the data bycounting how many points are less than or equal to x

F̂ (x) =|{xi|xi ! x}|

n

Let f̂(x) denote the estimate of f(x) at x. We can estimate f̂(x) by consideringa window of width h centered at x, and then computing the derivative at x

f̂(x) =F̂#x+ h

2

$# F̂

#x# h

2

$

h(15.1)

Here h plays the role of “influence”. That is, a large h estimates the probabilitydensity over a large window by considering many points, which has the e!ect ofsmoothing the estimate. On the other hand if h is small, then only the points inclose proximity to x are considered. In general we want a small value of h, but nottoo small, since in that case no points will fall in the window and we will not be ableto get an accurate estimate of the probability density.

The probability density f̂(x) can be estimated using an alternate formulation.For any closed interval [a, b], the cumulative probability distribution is given as

Pab =

b!

a

f(x) dx

DRAFT @ 2012-10-14 21:52. Please do not distribute. Feedback is Welcome.Note that this book shall be available for purchase from Cambridge University Press and otherstandard distribution channels, that no unauthorized distribution shall be allowed, and that thereader may make one copy only for personal on-screen use.

Page 7: chap15

CHAPTER 15. DENSITY-BASED CLUSTERING 378

Given the n points x1, x2, · · · , xn, Pab gives the probability that a point belongs tothe closed interval [a, b]. Assuming that the points are independently and identicallydistributed according to f(x), then the probability that exactly m points belong tothe closed interval is given by the binomial distribution

f(m | n, Pab) =

%n

m

&(Pab)

m(1# Pab)n!m

The expected number of points that fall in the interval [a, b], is then given as

E[m] = nPab

Since the binomial distribution has a very sharp peak at the expected value, lettingE[m] = k, we can estimate Pab as

F̂ab =k

n(15.2)

Now, assuming that the closed interval [a, b] centered at x is small, i.e., assumingthat width of the interval h = b# a is small, then another way to estimate Pab is asfollows

F̂ab =

b!

a

f(x) dx ( f̂(x)h (15.3)

Combining (15.2) and (15.3), we have

f̂(x)h =k

n=) f̂(x) =

k/n

h=

k

nh(15.4)

Combining the above with (15.1) we see that

k

n= F̂

%x+

h

2

&# F̂

%x# h

2

&

which equates the fraction of points in the closed interval with the di!erence betweenthe cumulative distribution at the two ends of that interval. In other words (15.1)and (15.4) are equivalent, though the latter is more intuitive, since it gives the densityestimate as the ratio of the fraction of the points in the region to the volume of theregion.

Discrete Kernel The density estimate f̂(x) from (15.4) can also be re-written interms of a kernel function as shown below

f̂(x) =1

nh

n'

i=1

K

%x# xi

h

&(15.5)

DRAFT @ 2012-10-14 21:52. Please do not distribute. Feedback is Welcome.Note that this book shall be available for purchase from Cambridge University Press and otherstandard distribution channels, that no unauthorized distribution shall be allowed, and that thereader may make one copy only for personal on-screen use.

Page 8: chap15

CHAPTER 15. DENSITY-BASED CLUSTERING 379

where the discrete kernel function computes the number of points k in the windowof width h, and is defined as

K(z) =

(1 If |z| ! 1

2

0 Otherwise(15.6)

We can see that if z = |x!xih | ! 1

2 , then the point xi is within a window of widthh centered at x, since

))))x# xi

h

)))) !1

2=) #1

2! x# xi

h! 1

2

=) #h

2! x# xi !

h

2

=) #x# h

2! xi ! #x+

h

2

=) x+h

2% xi % x# h

2

Example 15.4: Figure 15.6 shows the kernel density estimates using the Discretekernel for di!erent values of the influence parameter h, for the one-dimensional Irisdataset comprising the sepal length attribute. The x-axis also plots the n = 150data points. Since several points have the same value, they are shown stacked,where the stack height corresponds to value frequency.

When h is small, as shown in Figure 15.6a, the density function has many localmaxima or modes. However, as we increase h from 0.25 to 2, the number of modesdecreases, until h becomes large enough to yield a unimodal distribution (as shownin Figure 15.6d). We can observe that the Discrete kernel yields a non-smooth (orjagged) density function.

Gaussian Kernel The width h is a parameter which denotes the spread or smooth-ness of the density estimate. If the spread is too large we get a more averaged value.If it is too small we do not have enough points in the window. Furthermore, thekernel function in (15.6) has an abrupt influence. For points within the window(|z| ! 1

2 ) there is a net contribution of 1hn to the probability estimate f̂(x). On the

other hand, points outside the window (|z| > 12 ) contribute 0.

Instead of the discrete kernel above, we can define a more smooth transition ofinfluence, via a Gaussian kernel

K (z) =1*2#

exp

*#z2

2

+(15.7)

DRAFT @ 2012-10-14 21:52. Please do not distribute. Feedback is Welcome.Note that this book shall be available for purchase from Cambridge University Press and otherstandard distribution channels, that no unauthorized distribution shall be allowed, and that thereader may make one copy only for personal on-screen use.

Page 9: chap15

CHAPTER 15. DENSITY-BASED CLUSTERING 380

0

0.33

0.66

4 5 6 7 8x

f(x)

(a) h = 0.25

0

0.22

0.44

4 5 6 7 8x

f(x)

(b) h = 0.5

0

0.21

0.42

4 5 6 7 8x

f(x)

(c) h = 1.0

0

0.2

0.4

4 5 6 7 8x

f(x)

(d) h = 2.0

Figure 15.6: Kernel Density Estimation: Discrete Kernel (varying h)

Thus we have

K

%x# xi

h

&=

1*2#

exp

*#(x# xi)2

2h2

+

Here x (the center of the window) acts as the mean of the distribution, and h actsas the standard deviation of the distribution.

Example 15.5: Figure 15.7 shows the univariate density function for the one-dimensional Iris dataset, using the Gaussian kernel, for increasing values of thespread parameter h. The data points are shown stacked along the x-axis, with theheights corresponding to the value frequencies.

As h varies from 0.1 to 0.5, we can see clearly the smoothing e!ect of increasingh on the density function. For instance, for h = 0.1 there are many local maxima,whereas for h = 0.5, there is only one density peak. Compared to the Discretekernel case shown in Figure 15.6, we can clearly see that the Gaussian kernel yieldsmuch smoother estimates, without discontinuities.

DRAFT @ 2012-10-14 21:52. Please do not distribute. Feedback is Welcome.Note that this book shall be available for purchase from Cambridge University Press and otherstandard distribution channels, that no unauthorized distribution shall be allowed, and that thereader may make one copy only for personal on-screen use.

Page 10: chap15

CHAPTER 15. DENSITY-BASED CLUSTERING 381

0

0.27

0.54

4 5 6 7 8x

f(x)

(a) h = 0.1

0

0.23

0.46

4 5 6 7 8x

f(x)

(b) h = 0.15

0

0.2

0.4

4 5 6 7 8x

f(x)

(c) h = 0.25

0

0.19

0.38

4 5 6 7 8x

f(x)

(d) h = 0.5

Figure 15.7: Kernel Density Estimation: Gaussian Kernel (Varying h)

15.2.2 Multivariate Density Estimation

To estimate the probability density at a d-dimensional point x = (x1, x2, ...xd), wedefine the d-dimensional “window” as a hypercube in d-dimensions with edge-lengthh, i.e., a window centered at x with width h. The volume of such a d-dimensionalhypercube is given as

vol(Hd(h)) = hd

The density is then estimated as

f̂(x) =1

nhd

n'

i=1

K

%x# xi

h

&(15.8)

For any d-dimensional vector z = (z1, z2, · · · , zd), the discrete kernel function ind-dimensions is given as

K(z) =

(1 If |zj | ! 1

2 , for all dimensions j = 1, · · · , d0 Otherwise

(15.9)

DRAFT @ 2012-10-14 21:52. Please do not distribute. Feedback is Welcome.Note that this book shall be available for purchase from Cambridge University Press and otherstandard distribution channels, that no unauthorized distribution shall be allowed, and that thereader may make one copy only for personal on-screen use.

Page 11: chap15

CHAPTER 15. DENSITY-BASED CLUSTERING 382

For z = x!xih , we see that the kernel computes the number of points within the

hypercube of width h centered at x, since K(x!xih ) = 1 if and only if |xj!xij

h | ! 12

for all dimensions j.Further, the d-dimensional Gaussian kernel (with ! = Id) is given as

K (z) =1

(2#)d/2exp

*#zT z

2

+(15.10)

Putting z = x!xih , we have

K

%x# xi

h

&=

1

(2#)d/2exp

*#(x# xi)T (x# xi)

2h2

+

(a) h = 0.1 (b) h = 0.2

(c) h = 0.35 (d) h = 0.6

Figure 15.8: Density Estimation: 2D Iris Dataset (varying h)

Example 15.6: Figure 15.8 shows the probability density function for the 2DIris dataset comprising the sepal length and sepal width attributes, using theGaussian kernel. As expected, for small values of h, the density function hasseveral local maxima, whereas for larger values the number of maxima reduce, andultimately for a large enough value we obtain a unimodal distribution.

DRAFT @ 2012-10-14 21:52. Please do not distribute. Feedback is Welcome.Note that this book shall be available for purchase from Cambridge University Press and otherstandard distribution channels, that no unauthorized distribution shall be allowed, and that thereader may make one copy only for personal on-screen use.

Page 12: chap15

CHAPTER 15. DENSITY-BASED CLUSTERING 383

Figure 15.9: Density Estimation: Density-based Dataset

Example 15.7: Figure15.9 shows the kernel density estimate for the density-baseddataset in Figure 15.1, using a Gaussian kernel with h = 20. One can clearly discernthat the density peaks closely correspond to regions with higher density of points.

15.2.3 Nearest Neighbor Density Estimation

In the density estimation formulation above we implicitly fixed the volume of thehybercube by fixing the edge length h. The kernel function was used to find out thenumber of points that lie inside the fixed volume region.

An alternative approach to density estimation is to fix k, the number of pointsrequired to estimate the density, and allow the volume of the enclosing hypercube tovary to accommodate those k points in (15.4). This approach is called the k-nearestneighbor (KNN) approach to density estimation. Like kernel density estimation,KNN density estimation is also a non-parametric approach.

Given k, the number of neighbors to consider, we estimate the density at x, asfollows

f̂(x) =k

n vold(hx)=

k

n(hx)d

DRAFT @ 2012-10-14 21:52. Please do not distribute. Feedback is Welcome.Note that this book shall be available for purchase from Cambridge University Press and otherstandard distribution channels, that no unauthorized distribution shall be allowed, and that thereader may make one copy only for personal on-screen use.

Page 13: chap15

CHAPTER 15. DENSITY-BASED CLUSTERING 384

where hx is the distance from x to its k-th nearest neighbor. In other words, thewidth of the hypercube is now a variable, which depends on x and the chosen valuek. As before, we assume that the d-dimensional hypercube is centered at x.

15.3 Density-based Clustering: DENCLUE

Having laid the foundations of kernel density estimation, we can develop a generalformulation of density-based clustering. The basic approach is to find the peaks inthe density landscape, via gradient-based optimization, and find those regions withdensity above a given threshold.

Density Attractors and Gradient A point x# is called a density attractor if itis a local maximum of the probability density function f . A density attractor canbe found via a gradient ascent approach starting at some point x. The idea is tocompute the density gradient, the direction of the largest increase in the density,and to move in the direction of the gradient in small steps, until we reach a localmaximum.

The gradient at a point x can be computed as the multivariate derivative of theprobability density estimate in (15.8), i.e.,

+f̂(x) = $

$xf̂ =

1

nhd

n'

i=1

$

$xK

%x# xi

h

&(15.11)

For the Gaussian kernel (15.10), we have

$

$xK(z) =

%1

(2#)d/2exp

*#zT z

2

+&·#z · $z

$x

= K(z) ·#z · $z$x

Setting z = x!xih above, we get

$

$xK

%x# xi

h

&= K

%x# xi

h

&·%xi # x

h

&·%1

h

&

Substituting the above in (15.11), the gradient at a point x is given as

+f̂(x) = 1

nhd+2

n'

i=1

K

%x# xi

h

&· (xi # x) (15.12)

The above equation can be thought of as having two parts. A vector (xi # x) anda scalar influence value K(xi!x

h ). Thus, the gradient is the net vector obtainedas the weighted sum of all other points in terms of the kernel function K. For

DRAFT @ 2012-10-14 21:52. Please do not distribute. Feedback is Welcome.Note that this book shall be available for purchase from Cambridge University Press and otherstandard distribution channels, that no unauthorized distribution shall be allowed, and that thereader may make one copy only for personal on-screen use.

Page 14: chap15

CHAPTER 15. DENSITY-BASED CLUSTERING 385

each point xi, we first compute the direction away from x, i.e., the vector (xi # x).Next, we scale the magnitude of this vector using the Gaussian kernel function as theinfluence K

#x!xih

$. Finally, the +f̂(x) vector is the net influence at x, as illustrated

in Figure 15.10.

0

1

2

3

0 1 2 3 4 5

x x1

x2x3 +f̂(x)

Figure 15.10: The Gradient Vector

We say that a point x is density attracted to another point x# if a gradient ascentprocess started at x converges to x#. That is, there exists a sequence of pointsx = x0 , x1 , . . ., xm, such that "xm # x#" ! !, and each intermediate point isobtained after a small move in the direction of the gradient vector

xt+1 = xt + " ·+f̂(xt) (15.13)

where " > 0 is the step size.

Center-defined Cluster A cluster C - D, is called a center defined cluster C -D if all the points x $ C are density attracted to a unique density attractor x#,such that f̂(x#) % %, where % is a user-defined minimum density threshold. In otherwords,

f̂(x#) =1

nhd

n'

i=1

K

%x# # xi

h

&% % (15.14)

Density-based Cluster An arbitrary-shaped cluster C - D is called a density-based cluster if there exists a set of density attractors x#

1,x#2, . . . ,x

#m, such that

1. Each point x $ C is attracted to some attractor x#i .

2. Each density attractor has density above %. That is, f̂(x#i ) % %.

DRAFT @ 2012-10-14 21:52. Please do not distribute. Feedback is Welcome.Note that this book shall be available for purchase from Cambridge University Press and otherstandard distribution channels, that no unauthorized distribution shall be allowed, and that thereader may make one copy only for personal on-screen use.

Page 15: chap15

CHAPTER 15. DENSITY-BASED CLUSTERING 386

Algorithm 15.2: DENCLUE Algorithm

DENCLUE (D, h, %, !) :A. &1

// find density attractors

foreach x $ D do2

x# . FINDATTRACTOR(x, h, !)44

if f̂(x#) % % then5

A. A ' {x#}77

N(x#). N(x#) ' {x}99

M = {A - A : /x#i ,x

#j $ A,x#

i and x#j are density reachable}1111

C = &12

// density-based clusters

foreach A $M do1414

foreach x# $ A do C = C 'N(x#)15

C = C ' C1717

return C18

FINDATTRACTOR (D, h, !):t. 02020

x0 . x21

repeat22

if Gradient Ascent then23

xt+1 . xt + " ·+f̂(xt)2525

else26

xt+1 .!n

i=1 K"

xt!xih

#

·xt

!ni=1 K

"

xt!xih

#

2828

t. t+ 129

until "xt # xt!1" ! !30

return xt3232

3. Any two density attractors x#i and x#

j are density reachable, i.e., there exists a

path from x#i to x#

j , such that for all points y on the path, f̂(y) % %.

The pseudo-code for DENCLUE is shown in Algorithm 15.2. The first step isto compute the density attractor x# for each point x in the dataset by invoking theFindAttractor routine (line 4). If the density at x# is above the minimum densitythreshld %, the attractor is added to the set of attractors A (line 7). The methodalso maintains the set of all points N(x#) attracted to each attractor x# (line 9).In the second step, DENCLUE finds all the maximal subsets of attractors A - A,such that any pair of attractors in A is density-reachable from each other (line 11);

DRAFT @ 2012-10-14 21:52. Please do not distribute. Feedback is Welcome.Note that this book shall be available for purchase from Cambridge University Press and otherstandard distribution channels, that no unauthorized distribution shall be allowed, and that thereader may make one copy only for personal on-screen use.

Page 16: chap15

CHAPTER 15. DENSITY-BASED CLUSTERING 387

further, no more attractors can be added to A without destroying this property. Theset of all these maximal subsets M defines the density-based clusters. Each finalcluster C comprises all points x $ D that are density attracted to some attractor inA, obtained as the union of all the N(x#) sets for each x# $ A (lines 14 - 17).

The FindAttractor method (lines 20 - 32) implements a hill-climbing process.The standard approach is to use gradient-ascent (line 25). Starting from a givenx, we iteratively update it using the gradient-ascent update rule (15.13). However,this approach can be slow to converge. Instead, one can directly optimize the movedirection by setting the gradient (15.12) to zero

+f̂(x) = 0

=) 1

nhd+2

n'

i=1

K

%x# xi

h

&· (xi # x) = 0

=) x ·n'

i=1

K

%x# xi

h

&=

n'

i=1

K

%x# xi

h

&xi

=) x =

,ni=1K

#x!xih

$xi,n

i=1 K#x!xih

$

The point x is involved on both the left and right hand side above, however, it canbe used to obtain the following iterative update rule

xt+1 =

,ni=1K

#xt!xi

h

$xi,n

i=1K#xt!xi

h

$ (15.15)

where t denotes the current iteration. This direct update rule is essentially a weightedaverage of the influence (computed via the kernel function K) of each point xi $ D

on the current point xt. The direct update rule (line 28) results in much fasterconvergence of the hill-climbing process, and is the preferred approach.

For faster influence computation, it is possible to compute the kernel values foronly the nearest neighbors of each point xt. That is, we can index the points in thedataset D using a spatial index structure (e.g., a k-d Tree), so that we can quicklycompute all the nearest neighbors of xt within some radius r. For the Gaussiankernel, we can set r = h · z, where h is the influence parameter that plays the role ofstandard deviation, and z specifies the number of standard deviations (e.g., we canset z = 3 for a two-dimensional dataset, without loosing any accuracy in the densityevaluation). Let B(xt, r) denote the set of all points in D that lie within a ball ofradius ! centered at xt. The nearest-neighbors update rules can then be expressedas

xt+1 =

,xi$B(xt,r)K

#xt!xi

h

$xi

,xi$B(xt,r)K

#xt!xi

h

$ (15.16)

DRAFT @ 2012-10-14 21:52. Please do not distribute. Feedback is Welcome.Note that this book shall be available for purchase from Cambridge University Press and otherstandard distribution channels, that no unauthorized distribution shall be allowed, and that thereader may make one copy only for personal on-screen use.

Page 17: chap15

CHAPTER 15. DENSITY-BASED CLUSTERING 388

When the data dimensionality is not high, this can results in significant speedup.However, the e!ectiveness deteriorates rapidly with increasing number of dimensions.This is due to two e!ects. The first is that finding B(xt, r) reduces to a linear-scan ofthe data taking O(n) time for each query. Second, due to the curse of dimensionality(see Chapter 6), nearly all points appear to be equally close to xt, thereby nullifyingany benefits of computing the nearest neighbors.

Figure 15.11: DENCLUE: Iris 2D Dataset

Example 15.8: Figure 15.11 shows the DENCLUE clustering for the 2D Irisdataset, with h = 0.2 and % = 0.08, using a Gaussian kernel. The clusteringis obtained by thresholding the probability density function in Figure 15.8b at% = 0.08. The two peaks correspond to the two final clusters.

Example 15.9: Figure 15.12 shows the clusters obtained by DENCLUE on thedensity-based dataset from Figure 15.1. Using the parameters h = 10 and % =9.5010!5, with a Gaussian kernel, we obtain k = 8 clusters. The figure is obtainedby slicing the density function at the density value %, so that only the regions abovethat value are plotted. All the clusters are correctly identified, with the exceptionof the two semi-circular clusters on the lower right that appear merged into onecluster.

DENCLUE: Special Cases It can be shown that DBSCAN is a special case ofthe general kernel density estimate based clustering approach, DENCLUE. If we let

DRAFT @ 2012-10-14 21:52. Please do not distribute. Feedback is Welcome.Note that this book shall be available for purchase from Cambridge University Press and otherstandard distribution channels, that no unauthorized distribution shall be allowed, and that thereader may make one copy only for personal on-screen use.

Page 18: chap15

CHAPTER 15. DENSITY-BASED CLUSTERING 389

Figure 15.12: DENCLUE: Density-based Dataset

h = ! and % = minpts, then using a discrete kernel DENCLUE yields exactly thesame clusters as DBSCAN. Each density attractor corresponds to a core point, andthe set of connected core points define the attractors of a density-based cluster. Itcan also be shown that K-means is a special case of density-based clustering foran appropriate value of h and %, with the density attractors corresponding to thecluster centroids. Further, it is worth noting that the density-based approach canproduce hierarchical clusters, by varying the % threshold. For example, decreasing% can result in the merging of several clusters found at higher thresholds values. Inaddition, new clusters may emerge if the peak density satisfies the lower % value.

Computational Complexity The computational complexity of DENCLUE isdominated by the cost of the hill-climbing process. For each point x $ D, findingthe density attractor takes O(nt) time, where t is the maximum number of hill-climbing iterations. This is because each iteration takes O(n) time for computingthe sum of the influence function over all the points xi $ D. The total cost tocompute density attractors is therefore O(n2t). It is assumed that for reasonablevalues of h and %, there are only a few density attractors, i.e., |A| = m 1 n. Thecost of finding the maximal reachable subsets of attractors is O(m2), and the finalclusters can be obtained in O(n) time. When the dimensionality is small, the use of aspatial index can reduce the complexity of finding all the attractors to O((n log n)t).

15.4 Annotated References

DRAFT @ 2012-10-14 21:52. Please do not distribute. Feedback is Welcome.Note that this book shall be available for purchase from Cambridge University Press and otherstandard distribution channels, that no unauthorized distribution shall be allowed, and that thereader may make one copy only for personal on-screen use.