Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

109
Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan Cios / Pedrycz / Swiniarski / Kurgan

Transcript of Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

Page 1: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

Chapter 9UNSUPERVISED LEARNING:

Clustering Part 1

Cios / Pedrycz / Swiniarski / KurganCios / Pedrycz / Swiniarski / Kurgan

Page 2: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan2

Outline

• What is Clustering?- Categories of clustering methods

- Similarity measures • Partition-Based Clustering• Hierarchical Clustering• Model-Based (mixture of probabilities) clustering• Scalable Clustering• Grid-Based Clustering• Cluster Validity• Clustering of Large Datasets

Page 3: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan3

What is Clustering?

How do we understand data?

We look for structure in data by revealing groups/clusters.

Clusters are about abstraction of data.

The structure is formed based on similarities between patterns (data).

Page 4: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan4

How hard is clustering?

Consider N data points to be split into “c” groups (clusters). The number of possible splits (partitions) is described as

Even for a small problem of N =100 and c =5 we end up with 1067 partitions

Nc

1i

ic ii

c1)(

c!

1

Page 5: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan5

Clustering Challenges – from Bezdek

Page 6: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan6

Clustering Challenges – from Bezdek

Page 7: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan7

Categories of Clustering

We distinguish between three main categories (classes) of clustering methods

• Partition-based

• Hierarchical

• Model-based (mixture of probabilities)

Page 8: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

Major Clustering Approaches (I)

• Partitioning approach: – Construct various partitions and then evaluate them by some

criterion, e.g., minimizing the sum of square errors– Typical methods: k-means, k-medoids, CLARANS

• Hierarchical approach: – Create a hierarchical decomposition of the set of data (or objects)

using some criterion– Typical methods: Diana, Agnes, BIRCH, CAMELEON

• Density-based approach: – Based on connectivity and density functions– Typical methods: DBSACN, OPTICS, DenClue

• Grid-based approach: – based on a multiple-level granularity structure– Typical methods: STING, WaveCluster, CLIQUE

8

Page 9: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

Major Clustering Approaches (II)

• Model-based: – A model is hypothesized for each of the clusters and tries to find

the best fit of that model to each other– Typical methods: EM, SOM, COBWEB

• Frequent pattern-based:– Based on the analysis of frequent patterns– Typical methods: p-Cluster

• User-guided or constraint-based: – Clustering by considering user-specified or application-specific

constraints– Typical methods: COD (obstacles), constrained clustering

• Link-based clustering:– Objects are often linked together in various ways– Massive links can be used to cluster objects: SimRank, LinkClus

9

Page 10: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan10

Partition-Based Clustering

It is also referred to as objective function clustering, relies on the minimization of a certain objective function (performance index)

The result of minimization is a partition matrix and a collection of prototypes

The methods in this class are conceptually and algorithmically appealing

Page 11: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

Partitioning Algorithms: Basic Concept

• Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where ci is the centroid or medoid of cluster Ci)

• Given k, find a partition of k clusters that optimizes the chosen partitioning criterion– Global optimal: exhaustively enumerate all partitions– Heuristic methods: k-means and k-medoids algorithms– k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented

by the center of the cluster– k-medoids or PAM (Partition around medoids) (Kaufman &

Rousseeuw’87): Each cluster is represented by one of the objects in the cluster

21 )( iCp

ki cpE

i

11

Page 12: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan12

Model-Based Clustering

In MBC we assume a certain probabilistic model of data and estimate its parameters, such as mean, covariance matrix, etc.

Mixture density model is the common approach used – we assume that data are a result of a mixture of “c” sources of data and

each source is treated as a separate cluster.

Page 13: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan13

Similarity measures

SM are the most fundamental components of every clustering method; they are used to quantify similarity (or dissimilarity) between the data points.

The data with the highest similarity (like the lowest distance) are candidates to form a single cluster.

Page 14: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan14

Examples of Distance Functions:Continuous Data (1)

Euclidean distance (p=2 in Minkowski)

Hamming distance (p=1 in Minkowski)

Tchebyschev distance (p= ∞ in Minkowski)

n

1i

2ii )y(x),d( yx

n

1iii |yx|),d( yx

|yx|max),d( iin1,2,...,i yx

Page 15: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan15

Examples of Distance Functions:Continuous Data (2)

Minkowski distance

(drawing from Wikipedia)

0p ,)y(x),d( pn

1i

pii

yx

Page 16: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan16

Examples of Distance Functions:Continuous Data (2)

Canberra distance

Example:

v1(1,1,1), v2(1,1,0), v3(10,5,0), v4(1,2,3), v5(2,4,6)

d12=1 d13=2.485 d45=1

Angular separation

Example:

v1(7,6,3,-1), v2(0,3,4,5)

d12=0.363

n

1i

n

1i

2/12i

2i

n

1iii

]yx[

yx),d( yx

positive are y and x,yx

|yx|),d( ii

n

1i ii

ii

yx

Page 17: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan17

Examples of Distance Functions:Discrete Data (1)

Binary data x = [x1 x2 …xn]

a- number of occurrences where both xi and yi are 1

d- number of occurrences where both xi and yi are 0

b,c- number of occurrences where xi and yi are different (0-1)

1 0 1 a b 0 c d

Page 18: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan18

Examples of Distances:Discrete Data (2)

Binary data x = [x1 x2 …xn]

dcba

da

Matching index

Rusell & Rao

dcba

a

Page 19: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan19

Examples of Distances:Discrete Data (3)

Binary data x = [x1 x2 …xn]

Jacard index

Czekanowski

cba

a

cb2a

2a

Page 20: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan20

Hierarchical Clustering

HC provides graphical illustration of relationships between the data in the form of dendrogram, which is a binary tree.

There are two approaches to HC:

• Bottom – up / agglomerative

• Top-down / divisive

Page 21: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan21

Hierarchical Clustering

• Agglomerative / bottom-up method starts with each object in the data forming its own cluster, and then successively merges the clusters until one large cluster is formed, which encompasses the entire dataset

• Divisive / top-down method starts by considering the entire data as one cluster and then splits up the cluster(s) until each object forms its own cluster

Page 22: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan22

Hierarchical Clustering

a b c d e f g h

{a} {b,c,d,e} {f,g,h}

Top –down /divisive

Bottom-up /agglomerative

Page 23: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan23

ba dc e

{a}, {b}, {c, d}, {e}

{a,b}, {c, d}, {e}

{a,b,c,d}, {e}

numbers of clusters at different levels4

3

2

Hierarchical Agglomerative Clustering

Page 24: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan24

Given : a data set and the distance function 1. start with “N “ clusters by assigning each pattern to a separate cluster 2. proceed with this initial configuration of the clusters and merge the clusters that are the closest. In other words, if S and T are the two clusters being recognized as the closest, form a single cluster {S, T} and reduce the number of clusters by one 3 repeat step 2 until a minimal number of the clusters has been reached. Result : clusters of data (partition)

Hierarchical Agglomerative Clustering

Page 25: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan25

SyTx

yx)()(

1S-T :linkage average

yx

SyTx

maxST:linkage complete

yx

SyTx

minST:method linkage Single

TcardScard

Distance Between Clusters

Page 26: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan26

S imilarity between S and T is calculated based on the minimal distance between the elements belonging to the corresponding clusters

Single Linkage

T

S

|| T S||minxTyS

||x y||

Page 27: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan27

Complete Linkage

T

S

|| T S||maxxTyS

||x y||

We rely on the maximal distance between the patterns in the analyzed clusters.

Page 28: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan28

Average Linkage

T

S

|| T S||1

card(T)card(S)xTyS

||x y ||

We combine two clusters based upon their averaged distance between the patterns in the clusters.

Page 29: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan29

Hausdorff Distance Function

)},d(minmax),,d(minmax{max B) ,d(A ABBA yxyx xyyx

picture from Wikipedia

d(A,B) = max { min d(A,B)} = sup { inf d(A,B)}

Page 30: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan30

Lance-Williams updating formula

|dd|γβddαdαd CB,CA,BA,CB,BCA,ACB,A

Clustering

method A (B)

Single link 1/2 0 -1/2 Complete link 1/2 0 1/2 centroid

BA

A

nn

n

2BA

BA

)n(n

nn

0

median 1/2 -1/4 0

Page 31: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan31

Hierarchical Divisive Method

HD algorithm starts by considering all divisions of the data into two nonempty subsets

which amounts to possibilities.

However, it is possible to construct divisive methods that do not consider all divisions,

most of which may be incorrect anyway.

One such algorithm is by MacNaughton - Smith (1964)

12 1 n

Page 32: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan32

Hierarchical Divisive Method

At first A:=CA:=C and B:=B:=1. Move one object at a time from A to B.

For each object iA we compute average dissimilarity to all other objects of A:

Object m of A for which a(m) is the largest, is moved to B:

ijAj

jidA

ia ),(1||

1)(

}{:},{\: mBmAA

Page 33: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan33

2. Move other objects from A to B (called the “splinter group”)

If |A|=1, stop. Otherwise compute a(i) for all iA, and the average dissimilarity of i to all objects of B, denoted as d(i,B)

Bk

ijAj

kidB

jidA

Bidia ),(||

1),(

1||

1),()(

Hierarchical Divisive Method

Page 34: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan34

Select the object hA for which

If a(h)-d(h,B) > 0 move h from A to B, go to 2

If a(h)-d(h,B) 0 the process stops

The division of C into clusters A and B is complete.

)),()((max),()( BidiaBhdha Ai

Hierarchical Divisive Method

Page 35: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan35

Object Average Dissimilarity to the Other Objects

a (2.0 + 6.0 + 10.0 + 9.0)/4 = 6.75

b (2.0 + 5.0 + 9.0 +8.0)/4 = 6.00

c (6.0 + 5.0 + 4.0 + 5.0)/4 = 5.00

d (10.0 + 9.0 + 4.0 + 3.0)/4 = 6.50

e (9.0 + 8.0 + 5.0 + 3.0)/4 = 5.25

0.00.30.50.80.9

0.30.00.40.90.10

0.50.40.00.50.6

0.80.90.50.00.2

0.90.100.60.20.0

e

d

c

b

a

edcba

In our example, object a is chosen to initiate the splinter group.

At this stage we have groups A={b,c,d,e} and B={a}, but we don’t stop here.

Hierarchical Divisive Method

Page 36: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan36

Average DissimilarityAverage DissimilarityAverage DissimilarityAverage Dissimilarity to Objects of to Objects of

ObjectObject to remaining Objects to remaining Objects Splinter Group Splinter Group DifferenceDifference

b (5.0+9.0+8.0)/37.33 2.00 5.33 c (5.0+4.0+5.0)/34.67 6.00 -1.33 d (9.0+4.0+3.0)/35.33 10.00 -4.67 e (8.0+5.0+3.0)/35.33 9.00 -3.67

Therefore, object b changes sides, so new splinter group is B={a, b} and the remaining group becomes A={c, d, e}

Average DissimilarityAverage DissimilarityAverage DissimilarityAverage Dissimilarity to Objects of to Objects of

ObjectObject to remaining Objects to remaining Objects Splinter Group Splinter Group DifferenceDifference c (4.0+5.0)/2=4.50 (6.0+5.0)/2=5.50 -1.00 d (4.0+3.0)/2=3.50 (10.0+9.0)/2=9.50 -6.00 e (5.0+3.0)/2=4.00 (9.0+8.0)/2=8.50 -4.50

Hierarchical Divisive Method

Page 37: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan37

Objective Function Clustering

Develop and optimize a partition matrix so that a certain performance index is optimized.

The lower the value of the objective function, the better.

Page 38: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan38

Objective Function Clustering

Objective function Minimization Structure

Page 39: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan39

Objective Function Clustering

• Depends on minimization of a certain performance index Q

c

1i

N

1k2

ikd m

ikU

2i

vk

xc

1i

N

1k m

ikU Q

c – number of clustersU – partition matrix

Page 40: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan40

Clustering: representation issues

clusters?represent tohow

0 1 1 0 0 0 0 0

0 0 0 0 0 1 1 0

1 0 0 1 1 0 0 1

U

clusters

points data

matrix Partition

c

N

matrix partitiondata

Page 41: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan41

Partition Matrix

0 1 1 0 0 0 0 0

0 0 0 0 0 1 1 0

1 0 0 1 1 0 0 1

U{6,7}:cluster3

{2,3}:cluster2

{1,4,5,8} :1cluster

U {U | 0 u ik N, k 1

N

0 u ik 1 } i1

c

=

Page 42: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan42

Objective Function-Based Clustering

Clustering is guided by the minimization of some objective function/performance index Q.

Representation of structure is in the form of a:• Partition matrix U = [uik], i=1,2,…,c; k=1, 2,…,N

•Prototypes vi, i=1,2,…, c

N

1iN..., 2, 1,kfor 1

iku

N

1kc ..., 2, 1,ifor N

iku0

Page 43: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan43

Algorithm:

Given: the (guessed!) number of clusters (c), decide on the similarity function

(and on the value of the power factor (m) - for fuzzy clustering only)

Compute the prototypes (v) and update the partition matrix (U) based on the conditions of the minimized objective function

Result: partition matrix and prototypes

Objective Function Clustering

Page 44: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan44

K - means Algorithm

Inputc = number of clusters

d = distance function

m = power (fuzziness) factor not used in hard K- means

t = termination criteriaAmount of movement between clusters

v = cluster centersRandomly chosen each run

Output

U = partition matrix

V = Cluster centers

Page 45: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

April 18, 2023 Data Mining: Concepts and Techniques45

The K-Means Clustering Method

• Given k, the k-means algorithm is implemented in four

steps:

– Partition objects into k nonempty subsets

– Compute seed points as the centroids of the

clusters of the current partition (the centroid is the

center, i.e., mean point, of the cluster)

– Assign each object to the cluster with the nearest

seed point

– Go back to Step 2, stop when no more new

assignment

Page 46: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

April 18, 2023 Data Mining: Concepts and Techniques46

The K-Means Clustering Method

• Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrarily choose K object as initial cluster center

Assign each objects to most similar center

Update the cluster means

Update the cluster means

reassignreassign

Page 47: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

April 18, 2023 Data Mining: Concepts and Techniques47

Comments on the K-Means Method

• Strength: Relatively efficient: O(tkn), where n is # objects, k is #

clusters, and t is # iterations. Normally, k, t << n.

• Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))

• Comment: Often terminates at a local optimum. The global optimum

may be found using techniques such as: deterministic annealing and

genetic algorithms

• Weakness

– Applicable only when mean is defined, then what about categorical

data?

– Need to specify k, the number of clusters, in advance

– Unable to handle noisy data and outliers

– Not suitable to discover clusters with non-convex shapes

Page 48: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

What Is the Problem of the K-Means Method?

• The k-means algorithm is sensitive to outliers !

– Since an object with an extremely large value may substantially

distort the distribution of the data

• K-Medoids: Instead of taking the mean value of the object in a cluster

as a reference point, medoids can be used, which is the most

centrally located object in a cluster

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 49: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan49

Termination criteria

When the summed difference of the old and new partitions (partition matrices) is less than a threshold

• Hard Unew - Uold == 0;

• FuzzyUnew - Uold < user chosen number

(like 0.0001)

K - means Algorithm

Page 50: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan50

Minimizes the objective function by allocating data points to different clusters

Given: the number of clusters c

1. select initial c means

2. calculate distance between the pattern and the means of the clusters

3. allocate the pattern to the cluster whose mean is nearest to this pattern

4. recalculate the mean of the cluster from the patterns allocated to it

5. repeat 2-4 until a termination criterion is satisfied

Result: a collection of means (prototypes) of the clusters

K - means Algorithm

Page 51: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan51

K-Means Clustering Algorithm (1)

Objective function 2N

1kikik

c

1i||||uQ

vx

Minimize Q

subject to constraintsuik = 0 or 1

N

1iN..., 2, 1,kfor 1

iku

N

1kc ..., 2, 1,ifor N

iku0

Page 52: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan52

K-Means Clustering Algorithm (2)start with some initial configuration of the prototypes vi, i=1,2, …, c (e.g., choose

them randomly) - iterate - construct a partition matrix by assigning numeric values to U according to the

following rule

otherwise 0,

),d(min),d( if 1,u jkijik

ik

vxvx

- update the prototypes by computing the weighted average that involves the entries of the partition matrix

N

1kik

N

1kkik

i

u

u xv

until the performance index Q stabilizes and does not change or the changes are negligible

Page 53: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan53

Growing the Hierarchy of Clusters

c-clusters

c-clusters (a)

Develop “c” clusters; split the most heterogeneous clusterinto “c” clusters, etc.

Page 54: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan54

Growing the Hierarchy of Clusters

c-clusters

c-clusters

c-clusters

c-clusters balanced growth

imbalanced growth

Page 55: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan55

Kernel-Based Clustering

Idea:Original data in the n-dimensional space are transformedthrough some mapping into elements in m-dimensional space where m >>n.

Objective function in the new space:

2ik

N

1k

mik

c

1i||)φ()φ(||uQ vx

Page 56: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan56

Kernel-Based Clustering

Given the dimensionality of a new space, m >>n, we calculate in the new space a kernel function K(x,v) as a dot product

K(x,v) = T(x)(v)

We can use a Gaussian kernel

K(x,v) = exp(- ||x-v||2/2)

||(xk)-(vi)||2 = 2 – K(xk, vi)

Page 57: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

Kernel K-means

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan57

k-means cannot separate clusters that are non-linearly separable

To solve this problem kernel k-means algorithm is used:

before doing clustering, all points are mapped into a higher-dimensional space using some nonlinear function, and then the algorithm partitions the points in the new space.

Major difference with k-means is calculation of distance in the kernel k-means algorithm by the kernel method - not, say, by Euclidean.

Page 58: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

Kernel K-means

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan58

2jk

N

1k

c

1i

||m)φ(||w(x)Q

x

2ik

N

1k

mik

c

1i||)φ()φ(||uQ vx

To calculate the distances between the points in the new space and the mj we use a kernel function that is specified in the kernel matrix K.

Page 59: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

Kernel K-meansInput: K - kernel function, k - number of clusters

1. Initialize the k clusters: C1(0), ...,Ck(0)

2. Set t = 0

3. For each point x, find its new cluster by:

J*(x) = argmin j ||(x)−mj||2

4. Compute the updated clusters as

Cj (t+1) = {x : J*(x) = J}

5. If not converged, set t = t + 1 and go to step 3; otherwise, stop

Result: partition into clusters C1, ....,Ck

© 2007 Cios / Pedrycz / Swiniarski / Kurgan

59

Page 60: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan60

K-Medoids Clustering

To enhance robustness of clustering we use medoids instead of prototype mean values

In one-dimensional case it is the median

Page 61: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

The K-Medoid Clustering Method

• K-Medoids Clustering: Find representative objects (medoids) in clusters

– PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)

• Starts from an initial set of medoids and iteratively replaces one

of the medoids by one of the non-medoids if it improves the total

distance of the resulting clustering

• PAM works effectively for small data sets, but does not scale

well for large data sets (due to the computational complexity)

• Efficiency improvement on PAM

– CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples

– CLARANS (Ng & Han, 1994): Randomized re-sampling61

Page 62: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan62

Median as a Robust Estimator

Consider an ordered collection of real data

x1 <= x2 <= … <=xN

Median is the central point in this sequence (if N is even) or an average the two points in the middle (if N is odd)

Median is a robust estimator (ordered statistics)

median

mean

median

mean

outlier

Page 63: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan63

Median as a Robust Estimator

Median is a solution to the minimization problem

|medx||xx|minN

1kkii

N

1kkii

We enhance the robustness by considering the objective function

|vx|uN

1kijjkik

n

1j

c

1i

Advantage: one of the original points becomes cluster center

Page 64: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan64

Partitioning Around Medoids (PAM)

medoids – a family of the most centrally positioned data points. PAM clustering:

represent the structure in the data by a collection of medoids, each data point is grouped around the medoid to which its distance is the

shortest. PAM starts with an arbitrary collection of elements treated as medoids. At each step of the optimization, we make a swap between a certain data and one of the medoids assuming that the swap results in improvement of the quality of the clustering. Limitations -- size of the dataset. PAM works well for small datasets with a small number of clusters, (100 data points and 5 clusters).

Page 65: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

65

PAM (Partitioning Around Medoids) (1987)

• PAM (Kaufman and Rousseeuw, 1987), built in Splus

• Use real object to represent the cluster

– Select k representative objects arbitrarily

– For each pair of non-selected object h and selected

object i, calculate the total swapping cost TCih

– For each pair of i and h,

• If TCih < 0, i is replaced by h

• Then assign each non-selected object to the most

similar representative object

– repeat steps 2-3 until there is no change

Page 66: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

66

PAM Clustering: Finding the Best Cluster Center

Four Cases

Page 67: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

PAM Clustering: Finding the Best Cluster Center

• Case 1Suppose p currently belongs to cluster represented by Oj. Further

More, D(p,Oi) < D(p,Orandom) .Then If Oj is replaced by Orandom, p will belong to

the cluster represented by Oi

swap cost: C=d(p,Oi)-d(p,Oj)

Page 68: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

PAM Clustering: Finding the Best Cluster Center

• Case 2P currently belongs to the cluster represented by Oj. But this time we

assume D(p,Oi) > D(p,Orandom). So, Then If Oj is replaced by Orandom, p will belong to the cluster represented by Orandom

Cost Swap C=d(p,Orandom)-d(p,Oj)

Page 69: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

PAM Clustering: Finding the Best Cluster Center

• Case 3P currently belongs to a cluster represented by Oi instead of Oj,

Besides, D(p,Oi) < D(p,Orandom) . Then If Oj is replaced by Orandom, p will still belong to the cluster represented by Oi

Cost Swap C=0;

Page 70: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

PAM Clustering: Finding the Best Cluster Center

• Case 4P currently belongs to cluster represented by Oi, But D(p,Oi) >

D(p,Orandom)If replacing Oj with Orandom, p will belong to Orandom

Cost Swap C=d(p,Orandom)-d(p,Oi)

Page 71: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

PAM: A Typical K-Medoids Algorithm

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Total Cost = 20

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrary choose k object as initial medoids

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Assign each remaining object to nearest medoids Randomly select a

nonmedoid object,Oramdom

Compute total cost of swapping

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Total Cost = 26

Swapping O and Oramdom

If quality is improved.

Do loop

Until no change

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 72: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

72

What Is the Problem with PAM?

• Pam is more robust than k-means in the presence of

noise and outliers because a medoid is less influenced by

outliers or other extreme values than a mean

• Pam works efficiently for small data sets but does not

scale well for large data sets.

– O(k(n-k)2 ) for each iteration

where n is # of data,k is # of clusters

Sampling-based method

CLARA(Clustering LARge Applications)

Page 73: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

73

CLARA (Clustering Large Applications) (1990)

• CLARA (Kaufmann and Rousseeuw in 1990)

– Built in statistical analysis packages, such as SPlus– It draws multiple samples of the data set, applies

PAM on each sample, and gives the best clustering as the output

• Strength: deals with larger data sets than PAM• Weakness:

– Efficiency depends on the sample size– A good clustering based on samples will not

necessarily represent a good clustering of the whole data set if the sample is biased

Page 74: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

CLARA• Algorithm CLARA

Five Examples of size 40+2k

1. For i =1 to 5, repeat the following steps:

2. Draw a sample of 40 + 2k objects randomly from the entire data set, and call Algorithm PAM to find k medoids of the sample.

3. For each object Oj in the entire data set, determine which of the k medoids is the most similar to Oj.

4. Calculate the average dissimilarity of the clustering obtained in the previous step. If this value is less than the current minimum, use this value as the current minimum, and retain the k medoids found in Step 2 as the best set of medoids obtained so far.

5. Return to Step 1 to start the next iteration

Page 75: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

75

CLARANS (“Randomized” CLARA) (1994)

• CLARANS (A Clustering Algorithm based on Randomized Search) (Ng and Han’94)– Draws sample of neighbors dynamically– The clustering process can be presented as searching a

graph where every node is a potential solution, that is, a set of k medoids

– If the local optimum is found, it starts with new randomly selected node in search for a new local optimum

• Advantages: More efficient and scalable than both PAM and CLARA

• Further improvement: Focusing techniques and spatial access structures (Ester et al.’95)

Page 76: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

CLARANS

• Find k medoids can be viewed abstractly as searching through a certain graph.

• node is represented by a set of k objects {O1,……Ok} which is selected medoids.

• Two nodes are neighbors if and only if their sets differ by only one object. Each node corresponds to a clustering and each node can be assigned a cost which defines the dissimilarity between every object and the medoit

Page 77: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

CLARANS• Algorithm CLARANS 1. Input parameters numlocal and maxneighbor. Initialize i to 1, and

mincost to a large number. 2. Set current to an arbitrary node in Gn,k. 3. Set j to 1. 4. Consider a random neighbor S of current, and based on cost swap

functtion, calculate the cost differential of the two nodes. 5. If S has a lower cost, set current to S, and go to Step 3. 6. Otherwise, increment j by 1. If j < maxneighbor, go to Step 4. 7. Otherwise, when j > maxneighbor, compare the cost of current with

mincost. If the former is less than mincost, set mincost to the cost of current and set bestnode to current.

8. Increment i by 1. If i > numlocal, output bestnode and halt. Otherwise, go to Step 2.

Page 78: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan78

Clustering Large Applications (CLARA)

Modification of PAM to deal with large data sets:

Instead of processing all data, sample it ,and apply PAM to the sample

K-Medoids clustering:

• PAM

• CLARA

Advantages: • robustness, and • interpretability

Page 79: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan79

Fuzzy C-Means

How to deal (quantify) data that are in-between clusters?Consider partial membership to clusters – emergence of fuzzy sets

elements with partial membership

Page 80: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan80

Fuzzy C-Means

Partial membership in clusters – fuzzy partition matrix U

Objective function

U = [uik]; uik – degree of membership of k-th data to i-th cluster

m – fuzzification coefficient, m>1

||. || - distance function

2N

1kik

mik

c

1i||||uQ

vx

Page 81: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan81

Fuzzy C - Means: Optimization

QMin||||uQ U,prototypes2

ik

N

1k

mik

c

1iUvx

n..., 1,2,j ,...,2,1,0v

Q is

0

prototypes respect to with Q

ij

cithat

Q

Min

prototypes

!matrix! partitiona is U:constraint

N..., 1,2,k c,...,2,1i,0u

Q is that

0Q

matrix partition respect to withQ Min

ik

U

Page 82: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan82

2ik

N

1k

mik

c

1iprototypes ||||umin vx

vi

uikm

k1

N

(xk vi )T (xk vi ) 2 uikm

k1

N

(xk vi )

u ik

m

k1

N

(xk vi ) 0

v i

uikm

k1

N

xk

uikm

k1

N

Fuzzy C – Means: Calculations

Page 83: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan83

FCM: AlgorithmInitialize: select the number of clusters (c), stopping value (e), fuzzification coefficient (m). The distance function is Euclidean or weighted Euclidean. The initial partition matrix consists of random entries

Repeat update prototypes

N

1k

m

ik

N

1kk

m

ik

iu

u xv

update partition matrix

c

1l

1)2/(m

jk

ik

ik

||||||||

1u

vxvx

until a certain stopping criterion has been satisfied

Page 84: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan84

FCM – Algorithm

Design aspects

stopping criterion: termination of iterations

fuzzification coefficient (m) : m>1 Shape of the membership functions m =2.0 – typical value m close to 1 – set like shape of membership functions m higher than 2.0 - spike like membership functions

maxik | uik(iter+1) – uik(iter)|

Page 85: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan85

Model-Based Clustering

Mixture of data as an underlying model

Each component described by some conditional probabilitydensity function described by parameters

p(x|θ 1, θ 2,…, θ c) =

c

1iii )p|p( θx

Parameters estimation of the mixture of data

Page 86: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan86

Mixture of Data Model

Maximum likelihood estimation

Given data x1, x2, …, xN choose parameters such that the value of the expression

becomes maximized

)|p()|P(N

1kk θxθX

Page 87: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan87

Scalable Clustering

Clustering algorithms need to be scalable to deal with large data sets

Example algorithms:

•Density-Based Clustering (DBSCAN)

•OPTICS

•DENCLUE

•CLIQUE

Page 88: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

Density-Based Clustering Methods

• Clustering based on density (local cluster criterion), such as density-connected points

• Major features:– Discover clusters of arbitrary shape– Handle noise– One scan– Need density parameters as termination condition

• Several interesting studies:

– DBSCAN: Ester, et al. (KDD’96)– OPTICS: Ankerst, et al (SIGMOD’99).– DENCLUE: Hinneburg & D. Keim (KDD’98)– CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-

based) 88

Page 89: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

April 18, 2023 Data Mining: Concepts and Techniques89

Density-Based Clustering: Basic Concepts

• Two parameters:

– Eps: Maximum radius of the neighbourhood

– MinPts: Minimum number of points in an Eps-neighbourhood of that point

• NEps(p): {q belongs to D | dist(p,q) <= Eps}

• Directly density-reachable: A point p is directly density-reachable from a point q w.r.t. Eps, MinPts if

– p belongs to NEps(q)

– core point condition:

|NEps (q)| >= MinPts

p

q

MinPts = 5

Eps = 1 cm

Page 90: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan90

Density-Based Clustering DBSCAN

-neighborhood, denoted by N(xk), is given as

N(xk) = { x | d(x, xk) }

Based on the concept of -neighborhood, N_Pts, and density-based reachability

N_Pts – number of points falling within the neighborhood

xi is xk density-reachable with parameters and N_Pts ifthe following conditions are satisfied

(a) xi belongs to N(xk), and (b) card (N(xk)) N_Pts

Then xi becomes a CORE point

Page 91: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan91

DBSCAN: Density-based Reachability

xi is density-reachable from xk by a chain of data points xk+1, xk+2, …, xi-1

N(x1) N(xk)

Page 92: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan92

DBSCAN Algorithm

Set up the parameters of the neighborhood, and N_Pts (a) arbitrarily select a data point, say xk

(b) find (retrieve) all data that are density reachable from xk

(c) if xk is a core point, then the cluster has been formed (all points density reachable from xk)

(d) otherwise consider xk to be a border point and move on to the next data

point The sequence (a) – (d) is repeated until all data points have been processed.

Page 93: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

April 18, 2023 Data Mining: Concepts and Techniques93

Density-Reachable and Density-Connected

• Density-reachable:

– A point p is density-reachable from a point q w.r.t. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi

• Density-connected

– A point p is density-connected to a point q w.r.t. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t. Eps and MinPts

p

qp1

p q

o

Page 94: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

DBSCAN: Density-Based Spatial Clustering of Applications with Noise

• Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points

• Discovers clusters of arbitrary shape in spatial databases with noise

Core

Border

Outlier

Eps = 1cm

MinPts = 5

94

Page 95: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

DBSCAN: The Algorithm

• Arbitrary select a point p

• Retrieve all points density-reachable from p w.r.t. Eps

and MinPts

• If p is a core point, a cluster is formed

• If p is a border point, no points are density-reachable

from p and DBSCAN visits the next point of the database

• Continue the process until all of the points have been

processed

95

Page 96: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

OPTICS: A Cluster-Ordering Method (1999)

• OPTICS: Ordering Points To Identify the Clustering Structure– Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)– Produces a special order of the database wrt its

density-based clustering structure – This cluster-ordering contains info equiv to the density-

based clusterings corresponding to a broad range of parameter settings

– Good for both automatic and interactive cluster analysis, including finding intrinsic clustering structure

– Can be represented graphically or using visualization techniques

96

Page 97: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

April 18, 2023 Data Mining: Concepts and Techniques97

OPTICS: Some Extension from DBSCAN

• Index-based: • k = number of dimensions • N = 20• p = 75%• M = N(1-p) = 5

– Complexity: O(kN2)• Core Distance

• Reachability Distance

D

p2

MinPts = 5

= 3 cm

Max (core-distance (o), d (o, p))

r(p1, o) = 2.8cm. r(p2,o) = 4cm

o

o

p1

Page 98: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

DENCLUE: Using Statistical Density Functions

f x y eGaussian

d x y

( , )( , )

2

22

Page 99: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

DENCLUE: Using Statistical Density Functions

• Density AttractorThe local maxima of the overall density function

• Density AttractedA point x is density attracted to a density attractor x* if there exists a set

of points x0,x1…,xk such that x0=x, xk=x* and the gradient of xi-1 is in the direction of xi for 0<i<k

In General Points that are density attracted to x* may form a cluster

Page 100: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

• center-defined clusterA center-defined cluster for a density attractor, x* is a subset of points, C

belongs to D ,that are density-attracted by x*, and where the density function x* is no less than a threshold.

• arbitrary-shape clusterFor a set of density attractor is a set of Cs, each being density-attracted to

its respective density attractor, where

(1)Density function value at each density-attractor is no less than a threshold

(2) there exist a path P, from each density-attractor to another, where the density function value for each point along the path is no less than the threshold

Denclue

100

Page 101: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan101

Grid-Based Clustering

Describe structure in data in the language of genericgeometric constructs – hyperboxes and their combinations

Collection of clusters of

different geometry

Formation of clusters by merging adjacent hyperboxes

of the grid

Page 102: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan102

Grid-Based Clustering

Hyperboxes

{ B1, B2, …, Bp.} with two requirements: a) Bi is nonempty in the sense it includes some data points, b) the hyperboxes are disjoint that is Bi Bj = if i j, c) a union of all hyperboxes covers all data that is X

p

1iiB

where X = {x1, x2, …, xN}. It is also required that such hyperboxes “cover” some maximal number

(say bmax) of data points.

Page 103: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

© 2007 Cios / Pedrycz / Swiniarski /

Kurgan103

Grid-Based Clustering Steps Formation of the grid structure Insertion of data into the grid structure Computation of the density index of each hyperbox of the grid

structure Sorting the hyperboxes with respect to the values of their density

index Identification of cluster centres (viz. the hyperboxes of the highest

density) Traversal of neighboring hyperboxes and merging process Choice of the grid: too rough grid may not help capture the details of the structure in the

data. too detailed grid produces a significant computational overhead.

Page 104: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

April 18, 2023 Data Mining: Concepts and Techniques104

Clustering High-Dimensional Data

• Clustering high-dimensional data

– Many applications: text documents, DNA micro-array data

– Major challenges:

• Many irrelevant dimensions may mask clusters

• Distance measure becomes meaningless—due to equi-distance

• Clusters may exist only in some subspaces

• Methods

– Feature transformation: only effective if most dimensions are relevant

• PCA & SVD useful only when features are highly correlated/redundant

– Feature selection: wrapper or filter approaches

• useful to find a subspace where the data have nice clusters

– Subspace-clustering: find clusters in all the possible subspaces

• CLIQUE, ProClus, and frequent pattern-based clustering

Page 105: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

April 18, 2023 Data Mining: Concepts and Techniques105

CLIQUE (Clustering In QUEst)

• Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)

• Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space

• CLIQUE can be considered as both density-based and grid-based

– It partitions each dimension into the same number of equal length interval

– It partitions an m-dimensional data space into non-overlapping rectangular units

– A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter

– A cluster is a maximal set of connected dense units within a subspace

Page 106: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

April 18, 2023 Data Mining: Concepts and Techniques106

CLIQUE: The Major Steps

• Partition the data space and find the number of points that lie inside each cell of the partition.

• Identify the subspaces that contain clusters using the Apriori principle

• Identify clusters

– Determine dense units in all subspaces of interests– Determine connected dense units in all subspaces of

interests.

• Generate minimal description for the clusters– Determine maximal regions that cover a cluster of

connected dense units for each cluster– Determination of minimal cover for each cluster

Page 107: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

April 18, 2023 Data Mining: Concepts and Techniques107

Sala

ry

(10,

000)

20 30 40 50 60age

54

31

26

70

20 30 40 50 60age

54

31

26

70

Vac

atio

n(w

eek)

age

Vac

atio

n

Salary 30 50

= 3

Page 108: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

April 18, 2023 Data Mining: Concepts and Techniques108

Strength and Weakness of CLIQUE

• Strength – automatically finds subspaces of the highest

dimensionality such that high density clusters exist in those subspaces

– insensitive to the order of records in input and does not presume some canonical data distribution

– scales linearly with the size of input and has good scalability as the number of dimensions in the data increases

• Weakness– The accuracy of the clustering result may be degraded

at the expense of simplicity of the method

Page 109: Chapter 9 UNSUPERVISED LEARNING: Clustering Part 1 Cios / Pedrycz / Swiniarski / Kurgan.

Reference

• Data Mining Concepts and Techniques, edited by Jiawei Han

• Jiawei Han, CLARANS: A Method for Clustering Objects for Spatial Data Mining