Lecture 10 Clustering

38
1 Lecture 10 Clustering

description

Lecture 10 Clustering. Preview. Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods. Examples of Clustering Applications. - PowerPoint PPT Presentation

Transcript of Lecture 10 Clustering

Page 1: Lecture 10 Clustering

1

Lecture 10 Clustering

Page 2: Lecture 10 Clustering

2

Preview

Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods

Page 3: Lecture 10 Clustering

4

Examples of Clustering Applications

Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs

Land use: Identification of areas of similar land use in an earth observation database

Insurance: Identifying groups of motor insurance policy holders with a high average claim cost

Urban planning: Identifying groups of houses according to their house type, value, and geographical location

Seismology: Observed earth quake epicenters should be clustered along continent faults

Page 4: Lecture 10 Clustering

5

What Is a Good Clustering?

A good clustering method will produce clusters with High intra-class similarity Low inter-class similarity

Precise definition of clustering quality is difficult Application-dependent Ultimately subjective

Page 5: Lecture 10 Clustering

6

Requirements for Clustering in Data Mining

Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Minimal domain knowledge required to

determine input parameters Ability to deal with noise and outliers Insensitivity to order of input records Robustness wrt high dimensionality Incorporation of user-specified constraints Interpretability and usability

Page 6: Lecture 10 Clustering

7

Similarity and Dissimilarity Between Objects

Euclidean distance (p = 2):

Properties of a metric d(i,j): d(i,j) 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j) d(i,k) + d(k,j)

)||...|||(|),( 22

22

2

11 pp jx

ix

jx

ix

jx

ixjid

Page 7: Lecture 10 Clustering

8

Major Clustering Approaches

Partitioning: Construct various partitions and then

evaluate them by some criterion

Hierarchical: Create a hierarchical decomposition of

the set of objects using some criterion

Model-based: Hypothesize a model for each cluster

and find best fit of models to data

Density-based: Guided by connectivity and density

functions

Page 8: Lecture 10 Clustering

9

Partitioning Algorithms

Partitioning method: Construct a partition of a database D of n objects into a set of k clusters

Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids

algorithms k-means (MacQueen, 1967): Each cluster is

represented by the center of the cluster k-medoids or PAM (Partition around medoids)

(Kaufman & Rousseeuw, 1987): Each cluster is represented by one of the objects in the cluster

Page 9: Lecture 10 Clustering

10

K-Means Clustering

Given k, the k-means algorithm consists of four steps: Select initial centroids at random. Assign each object to the cluster with

the nearest centroid. Compute each centroid as the mean

of the objects assigned to it. Repeat previous 2 steps until no

change.

Page 10: Lecture 10 Clustering

k-Means [MacQueen ’67]

Page 11: Lecture 10 Clustering
Page 12: Lecture 10 Clustering
Page 13: Lecture 10 Clustering
Page 14: Lecture 10 Clustering
Page 15: Lecture 10 Clustering
Page 16: Lecture 10 Clustering
Page 17: Lecture 10 Clustering
Page 18: Lecture 10 Clustering

Another example (k=3)

19

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6

Page 19: Lecture 10 Clustering

20

K-Means Clustering (contd.)

Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 20: Lecture 10 Clustering

21

Comments on the K-Means Method

Strengths Relatively efficient: O(tkn), where n is # objects, k is

# clusters, and t is # iterations. Normally, k, t << n.

Often terminates at a local optimum. The global optimum may be found using techniques such as simulated annealing and genetic algorithms

Weaknesses Applicable only when mean is defined (what about

categorical data?) Need to specify k, the number of clusters, in advance Trouble with noisy data and outliers Not suitable to discover clusters with non-convex

shapes

Page 21: Lecture 10 Clustering

22

Hierarchical Clustering

Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative(AGNES)

divisive(DIANA)

Page 22: Lecture 10 Clustering

Simple example

1 3 2 5 4 60

0.05

0.1

0.15

0.2

1

2

3

4

5

6

1

23 4

5

Page 23: Lecture 10 Clustering

Cut-off point

Page 24: Lecture 10 Clustering

Single link example

Page 25: Lecture 10 Clustering
Page 26: Lecture 10 Clustering
Page 27: Lecture 10 Clustering
Page 28: Lecture 10 Clustering
Page 29: Lecture 10 Clustering
Page 30: Lecture 10 Clustering
Page 31: Lecture 10 Clustering
Page 32: Lecture 10 Clustering
Page 33: Lecture 10 Clustering

close?

Inter-Cluster similarity?

Page 34: Lecture 10 Clustering

MIN(Single-link)

Page 35: Lecture 10 Clustering

MAX(complete-link)

Page 36: Lecture 10 Clustering

Distance Between Centroids

Page 37: Lecture 10 Clustering

Group Average(average-link)

Page 38: Lecture 10 Clustering

What is: density-based problem?

Classes of arbitrary spatial shapes

Desired properties: good efficiency (large databases) minimal domain knowledge (determine

parameters)

Intuition: notion of density