Dbm630 lecture09

92
DBM630: Data Mining and Data Warehousing MS.IT. Rangsit University 1 Semester 2/2011 Lecture 9 Clustering by Kritsada Sriphaew (sriphaew.k AT gmail.com)

description

 

Transcript of Dbm630 lecture09

Page 1: Dbm630 lecture09

DBM630: Data Mining and

Data Warehousing

MS.IT. Rangsit University

1

Semester 2/2011

Lecture 9

Clustering

by Kritsada Sriphaew (sriphaew.k AT gmail.com)

Page 2: Dbm630 lecture09

Topics

2

What is Cluster Analysis?

Types of Attributes in Cluster Analysis

Major Clustering Approaches

Partitioning Algorithms

Hierarchical Algorithms

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 3: Dbm630 lecture09

Clustering Analysis 3

Classification vs. Clustering

Classification: Supervised learning

Learns a method for predicting the instance class from pre-labeled (classified) instances

Page 4: Dbm630 lecture09

4

Clustering

Unsupervised learning:

Finds “natural” grouping of instances given un-labeled data

Clustering Analysis

Page 5: Dbm630 lecture09

Clustering Analysis 5

Clustering Methods

Many different method and algorithms:

For numeric and/or symbolic data

Deterministic vs. probabilistic

Exclusive vs. overlapping

Hierarchical vs. flat

Top-down vs. bottom-up

Page 6: Dbm630 lecture09

Clustering Analysis 6

What is Cluster Analysis ? Cluster: a collection of data objects

High similarity of objects within a cluster

Low similarity of objects across clusters

Cluster analysis

Grouping a set of data objects into clusters

Clustering is an unsupervised classification: no predefined classes

Typical applications

As a stand-alone tool to get insight into data distribution

As a preprocessing step for other algorithms

Page 7: Dbm630 lecture09

Clustering Analysis 7

General Applications of Clustering Pattern Recognition

In biology, derive plant and animal taxonomies, categorize genes

Spatial Data Analysis

create thematic maps in GIS by clustering feature spaces

detect spatial clusters and explain them in spatial data mining

Image Processing

Economic Science (e.g. market research) discovering distinct groups in their customer bases and characterize customer

groups based on purchasing patterns

WWW

Document classification

Cluster Weblog data to discover groups of similar access patterns

Page 8: Dbm630 lecture09

Clustering Analysis 8

Examples of Clustering Applications Marketing: Help marketers discover distinct groups in their

customer bases, and then use this knowledge to develop targeted marketing programs

Land use: Identification of areas of similar land use in an earth observation database

Insurance: Identifying groups of motor insurance policy holders with a high average claim cost

City-planning: Identifying groups of houses according to their house type, value, and geographical location

Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

Page 9: Dbm630 lecture09

Clustering Analysis 9

Clustering Evaluation

Manual inspection

Benchmarking on existing labels

Cluster quality measures

distance measures

high similarity within a cluster, low across clusters

Page 10: Dbm630 lecture09

Clustering Analysis 10

Criteria for Clustering A good clustering method will produce high quality clusters

with

high intra-class similarity

low inter-class similarity

The quality of a clustering result depends on:

both the similarity measure used by the method and its implementation to the new cases

ability to discover some or all of the hidden patterns

Page 11: Dbm630 lecture09

Clustering Analysis 11

Requirements of Clustering in Data Mining Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to

determine input parameters Able to deal with noise and outliers Insensitive to order of input records High dimensionality Incorporation of user-specified constraints Interpretability and usability

Page 12: Dbm630 lecture09

Clustering Analysis 12

The distance function

Simplest case: one numeric attribute A

Distance(X,Y) = A(X) – A(Y)

Several numeric attributes:

Distance(X,Y) = Euclidean distance between X,Y

Nominal attributes: distance is set to 1 if values are different, 0 if they are equal

Are all attributes equally important?

Weighting the attributes might be necessary

Page 13: Dbm630 lecture09

Clustering Analysis 13

From Data Matrix to Similarity or Dissimilarity Matrices

Data matrix (or object-by-attribute structure)

m objects with n attributes, e.g., relational data

Similarity and dissimilarity matrices a collection of proximities for all pairs of m objects.

mnmjm

iniji

nj

xxx

xxx

xxx

1

1

1111

0

0

0

1

1

mjm

i

dd

d

1

1

1

1

1

mjm

i

ss

s

10

ij

jiij

d

dd

10

ij

jiij

s

ss

Page 14: Dbm630 lecture09

Clustering Analysis 14

Distance Functions (Overview)

To transform a data matrix to similarity or dissimilarity matrices, we need a definition of distance.

Some definitions of distance functions depend on the type of attributes interval-scaled attributes

Boolean attributes

nominal, ordinal and ratio attributes.

Weights should be associated with different attributes based on applications and data semantics.

It is hard to define “similar enough” or “good enough” the answer is typically highly subjective.

Page 15: Dbm630 lecture09

Clustering Analysis 15

Similarity and Dissimilarity Between Objects (I)

Distances are normally used to measure the similarity or dissimilarity between two data objects

Some popular ones include: Minkowski distance:

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer

If q = 1, d is Manhattan distance

q q

jpip

q

ji

q

jiij xxxxxxd )||...|||(| 2211

||...|||| 2211 jpipjijiij xxxxxxd

Page 16: Dbm630 lecture09

Clustering Analysis 16

Similarity and Dissimilarity Between Objects (II)

If q = 2, d is Euclidean distance:

Properties

d(i,j) >= 0

d(i,i) = 0

d(i,j) = d(j,i)

d(i,j) <= d(i,k) + d(k,j)

Also one can use weighted distance, parametric Pearson product

moment correlation, or other dissimilarity measures.

)||...|||(| 22

22

2

11 jpipjijiij xxxxxxd

)||...||||( 22

222

2

111 jpipijijixxwxxwxxwd

ij

Page 17: Dbm630 lecture09

Clustering Analysis: Types of Attributes 17

Types of Attributes in Clustering Interval-scaled attributes

Continuous measures of a roughly linear scale

Binary attributes

Two-state measures: 0 or 1

Nominal, ordinal, and ratio attributes

More than two states, nominal or ordinal or nonlinear scale

Mixed types

Mixture of interval-scaled, symmetric binary, asymmetric binary,

nominal, ordinal, or ratio-scaled attributes

Page 18: Dbm630 lecture09

18

Interval-valued Attributes Standardize data Calculate the mean absolute deviation:

where

Calculate the standardized measurement (mean-absolute-based z-

score)

Using mean absolute deviation is more robust than using standard

deviation since the z-scores of outliers do not become too small. Hence, the outliers remain detectable.

|)|...|||(|121 fnffffff mxmxmxns

.)...21

1nffff

xx(xn m

f

fif

if s

mx z

Clustering Analysis: Types of Attributes

Page 19: Dbm630 lecture09

19

Binary Attributes A contingency table for binary data

Simple matching coefficient (invariant, if the binary variable is symmetric):

Jaccard coefficient (noninvariant if the binary variable is asymmetric):

pdbcasum

dcdc

baba

sum

0

1

01

Object i

Object j

dcbacb d

ij

cbacb d

ij

A binary variable contains two

possible outcomes: 1

(positive/present) or 0

(negative/absent).

• If there is no preference for

which outcome should be

coded as 0 and which as 1, the

binary variable is

called symmetric.

• If the outcomes of a binary

variable are not equally

important, the binary variable is

called asymmetric, such as "is

color-blind" for a human being.

The most important outcome

is usually coded as 1 (present)

and the other is coded as 0

(absent).

Clustering Analysis: Types of Attributes

Page 20: Dbm630 lecture09

20

Dissimilarity on Binary Attributes Example

The gender is symmetric attribute and the remaining attributes are asymmetric

binary. Here, let the values Y and P be set to 1, and the value N be set to 0. Then calculate only asymmetric binary

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Jack M Y N P N N N

Mary F Y N P N P N

Jim M Y P N N N N

75.0211

21),(

67.0111

11),(

33.0102

10),(

maryjimd

jimjackd

maryjackd

Clustering Analysis: Types of Attributes

Page 21: Dbm630 lecture09

21

Nominal Attributes A generalization of the binary variable in that it can take

more than 2 states, e.g., red, yellow, blue, green

Method 1: Simple matching

m is the # of matches, p is the total # of nominal attributes

Method 2: Use a large number of binary variables creating a new binary variable for each of the M nominal

states

p

mpdij

Clustering Analysis: Types of Attributes

Page 22: Dbm630 lecture09

22

Ordinal Attributes

},...,1{fif

Mr

An ordinal variable can be discrete or continuous

order is important, e.g., rank

Can be treated like interval-scaled

replacing xif by their rank

map the range of each variable onto [0, 1] by replacing

i-th object in the f-th variable by

compute the dissimilarity using methods for interval-scaled

variables

1

1

f

if

if M

rz

Clustering Analysis: Types of Attributes

Page 23: Dbm630 lecture09

23

Ratio-Scaled Attributes Ratio-scaled variable: a positive measurement on a nonlinear

scale, approximately at exponential scale, such as AeBt or Ae-Bt

Methods:

(1) treat them like interval-scaled attributes

not a good choice!

(2) apply logarithmic transformation

yif = log(xif)

(3) treat them as continuous ordinal data and

treat their rank as interval-scaled.

Clustering Analysis: Types of Attributes

Page 24: Dbm630 lecture09

24

Mixed Types

A database may contain all the six types of attributes

symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio.

One may use a weighted formula to combine their effects.

f is binary or nominal:

dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise

f is interval-based: use the normalized distance

f is ordinal or ratio-scaled

compute ranks rif and

treat (normalized) zif as interval-scaled

)(1

)()(1

fij

pf

fij

fij

pf

ij

dd

Clustering Analysis: Types of Attributes

Page 25: Dbm630 lecture09

Clustering Analysis: Clustering Approaches 25

Major Clustering Approaches Partitioning algorithms: Construct various partitions and then

evaluate them by some criterion

Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion

Density-based: based on connectivity and density functions

Grid-based: based on a multiple-level granularity structure

Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other

Page 26: Dbm630 lecture09

Clustering Analysis: Partitioning Algorithms 26

Partitioning Approach Construct a partition of a database D of n objects into a set of k

clusters

Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion

Global optimal: exhaustively enumerate all partitions

Heuristic methods: k-means and k-medoids algorithms

k-means (MacQueen’67): Each cluster is represented by the center of the cluster

k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster

Page 27: Dbm630 lecture09

27

The K-Means Clustering Method (Overview)

Given k, the k-means algorithm is implemented in 4 steps:

Partition objects into k nonempty subsets

Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster.

Assign each object to the cluster with the nearest seed point.

Go back to Step 2, stop when no more new assignment.

Clustering Analysis: Partitioning Algorithms

Page 28: Dbm630 lecture09

28

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

The K-Means Clustering Method (An Graphical Example)

Clustering Analysis: Partitioning Algorithms

Page 29: Dbm630 lecture09

29

K-means example, step 1

k1

k2

k3

X

Y

Pick 3

initial

cluster

centers

(randomly)

Clustering Analysis: Partitioning Algorithms

Given k=3,

Page 30: Dbm630 lecture09

30

K-means example, step 2

k1

k2

k3

X

Y

Assign

each point

to the closest

cluster

center

Clustering Analysis: Partitioning Algorithms

Page 31: Dbm630 lecture09

31

K-means example, step 3

X

Y

Move

each cluster

center

to the mean

of each cluster

k1

k2

k2

k1

k3

k3

Clustering Analysis: Partitioning Algorithms

Page 32: Dbm630 lecture09

32

K-means example, step 4

X

Y

Reassign

points

closest to a

different new

cluster center

Q: Which

points are

reassigned?

k1

k2

k3

Clustering Analysis: Partitioning Algorithms

Page 33: Dbm630 lecture09

33

K-means example, step 4 …

X

Y

A: three

points with

animation

k1

k3 k2

Clustering Analysis: Partitioning Algorithms

Page 34: Dbm630 lecture09

34

K-means example, step 4b

X

Y

re-compute

cluster

means

k1

k3 k2

Clustering Analysis: Partitioning Algorithms

Page 35: Dbm630 lecture09

35

K-means example, step 5

X

Y

move cluster

centers to

cluster means

k2

k1

k3

Clustering Analysis: Partitioning Algorithms

Page 36: Dbm630 lecture09

36

Problems to be considered What can be the problems with K-means clustering? Result can vary significantly depending on initial choice of seeds

(number and position) Can get trapped in local minimum

Example:

Q: What can be done? A: To increase chance of finding global optimum: restart with

different random seeds. What can be done about outliers?

instances

initial cluster

centers

Clustering Analysis: Partitioning Algorithms

Page 37: Dbm630 lecture09

37

The K-Means Clustering Method (Strength and Weakness)

Strength

Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n

Good for finding clusters with spherical shapes

Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms

Weakness

Applicable only when mean is defined, then what about categorical data?

Need to specify k, no. of clusters, in advance

Unable to handle noisy data and outliers

Not suitable to discover clusters with non-convex shapes

Clustering Analysis: Partitioning Algorithms

Page 38: Dbm630 lecture09

38

The K-Means Clustering Method (Variations – I)

A few variants of the k-means which differ in

Selection of the initial k means

Dissimilarity calculations

Strategies to calculate cluster means K-medoids – instead of mean, use medians of each cluster

For large databases, use sampling

Handling categorical data: k-modes (Huang’98)

Replacing means of clusters with modes

Using new dissimilarity measures to deal with categorical objects

Using a frequency-based method to update modes of clusters

A mixture of categorical/numerical data: k-prototype method

Mean of 1, 3, 5, 7, 9 is 5 Mean of 1, 3, 5, 7, 1009 is 205 Median of 1, 3, 5, 7, 1009 is 5 Median advantage: not affected by extreme values

Clustering Analysis: Partitioning Algorithms

Page 39: Dbm630 lecture09

39

The K-Medoids Clustering Method (Overview)

Find representative objects, called medoids, in clusters

PAM (Partitioning Around Medoids, 1987)

starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering

PAM works effectively for small data sets, but does not scale well for large data sets

CLARA (Kaufmann & Rousseeuw, 1990)

CLARANS (Ng & Han, 1994): Randomized sampling

Clustering Analysis: Partitioning Algorithms

Page 40: Dbm630 lecture09

40

The K-Medoids Clustering Method (PAM - Partitioning Around Medoids)

PAM (Kaufman and Rousseeuw, 1987), built in Splus

Use real object to represent the cluster

Select k representative objects arbitrarily

For each pair of non-selected object h and selected object i, calculate the total swapping cost Sih

For each pair of i and h, If Sih < 0, i is replaced by h

Then assign each non-selected object to the most similar representative object

repeat steps 2-3 until there is no change

Clustering Analysis: Partitioning Algorithms

Page 41: Dbm630 lecture09

PAM example

Clustering Analysis 41

Cluster the following data set of ten objects into two clusters i.e k = 2.

X1 2 6

X2 3 4

X3 3 8

X4 4 7

X5 6 2

X6 6 4

X7 7 3

X8 7 4

X9 8 5

X10 7 6

Page 42: Dbm630 lecture09

PAM example, step 1

Clustering Analysis 42

Initialise k medoids. Let assume c1 = (3,4) and c2 = (7,4)

Calculating distance so as to associate each data object to its nearest medoid. Assume that cost is calculated using Minkowski distance metric with r = 1.

c1 Data objects

(Xi)

Cost

(distan

ce)

3 4 2 6 3

3 4 3 8 4

3 4 4 7 4

3 4 6 2 5

3 4 6 4 3

3 4 7 3 5

3 4 8 5 6

3 4 7 6 6

c2 Data objects

(Xi)

Cost

(distan

ce)

7 4 2 6 7

7 4 3 8 8

7 4 4 7 6

7 4 6 2 3

7 4 6 4 1

7 4 7 3 1

7 4 8 5 2

7 4 7 6 2

Page 43: Dbm630 lecture09

PAM example, step 1b

Clustering Analysis 43

Then the clusters become: Cluster1 = {(3,4)(2,6)(3,8)(4,7)} Cluster2 = {(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)}

Total cost = 3 + 4 + 4 + 3 + 1 + 1 + 2 + 2 = 20

c1 Data objects

(Xi)

Cost

(distan

ce)

3 4 2 6 3

3 4 3 8 4

3 4 4 7 4

3 4 6 2 5

3 4 6 4 3

3 4 7 3 5

3 4 8 5 6

3 4 7 6 6

c2 Data objects

(Xi)

Cost

(distan

ce)

7 4 2 6 7

7 4 3 8 8

7 4 4 7 6

7 4 6 2 3

7 4 6 4 1

7 4 7 3 1

7 4 8 5 2

7 4 7 6 2

Page 44: Dbm630 lecture09

PAM example, step 2

Clustering Analysis 44

Selection of nonmedoid O′ randomly. Let us assume O′ = (7,3). So now the medoids are c1(3,4) and O′(7,3)

Calculate the cost of new medoid by using the formula in the step1. Total cost = 3+4+4+2+2+1+3+3 = 22

c1 Data objects

(Xi)

Cost

(distan

ce)

3 4 2 6 3

3 4 3 8 4

3 4 4 7 4

3 4 6 2 5

3 4 6 4 3

3 4 7 3 5

3 4 8 5 6

3 4 7 6 6

O′ Data objects

(Xi)

Cost

(distan

ce)

7 3 2 6 8

7 3 3 8 9

7 3 4 7 7

7 3 6 2 2

7 3 6 4 2

7 3 7 4 1

7 3 8 5 3

7 3 7 6 3

Page 45: Dbm630 lecture09

PAM example, step 2b

Clustering Analysis 45

So cost of swapping medoid from c2 to O′ is

S = current total cost – past total cost = 22-20 = 2 >0

So moving to O′ would be bad idea, so the previous choice was good, and algorithm terminates here (i.e there is no change in the medoids).

It may happen some data points may shift from one cluster to another cluster depending upon their closeness to medoid

Page 46: Dbm630 lecture09

46

CLARA (Clustering Large Applications) (1990)

CLARA (Kaufmann and Rousseeuw in 1990)

Built in statistical analysis packages, such as S+

It draws multiple samples of the data set, applies PAM on each

sample, and gives the best clustering as the output

Strength: deals with larger data sets than PAM

Weakness:

Efficiency depends on the sample size

A good clustering based on samples will not necessarily represent a good

clustering of the whole data set if the sample is biased

Clustering Analysis: Partitioning Algorithms

Page 47: Dbm630 lecture09

47

CLARANS (“Randomized” CLARA) (1994)

CLARANS (A Clustering Algorithm based on Randomized Search) (Ng

and Han’94)

CLARANS draws sample of neighbors dynamically

The clustering process can be presented as searching a graph where

every node is a potential solution, that is, a set of k medoids

If the local optimum is found, CLARANS starts with new randomly

selected node in search for a new local optimum

It is more efficient and scalable than both PAM and CLARA

Focusing techniques and spatial access structures may further improve

its performance (Ester et al.’95)

Clustering Analysis: Partitioning Algorithms

Page 48: Dbm630 lecture09

48

The Partition-Based Clustering (Discussion)

Result can vary significantly based on initial choice of seeds

Algorithm can get trapped in a local minimum

Example: four instances at the vertices of a two-dimensional rectangle

Local minimum: two cluster centers at the midpoints of the rectangle’s long sides

Simple way to increase chance of finding a global optimum: restart with different random seeds

Clustering Analysis: Hierarchical Algorithms

Page 49: Dbm630 lecture09

49

Hierarchical Clustering Use distance matrix as clustering criteria.

This method does not require the number of clusters k as an

input, but needs a termination condition Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

Agglomerative

(AGNES)

Divisive

(DIANA) Clustering Analysis: Hierarchical Algorithms

Page 50: Dbm630 lecture09

50

AGNES (Agglomerative Nesting) Introduced in Kaufmann and Rousseeuw (1990)

Implemented in statistical analysis packages, e.g., Splus

Use the Single-Link method and the dissimilarity matrix.

Merge nodes that have the least dissimilarity

Go on in a non-descending fashion

Eventually all nodes belong to the same cluster

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Clustering Analysis: Hierarchical Algorithms

Page 51: Dbm630 lecture09

51

Dendrogram for Hierarchical Clustering Decompose data objects into a several levels of nested

partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.

Clustering Analysis: Hierarchical Algorithms

Page 52: Dbm630 lecture09

52

DIANA - Divisive Analysis Introduced in Kaufmann and Rousseeuw (1990)

Implemented in statistical analysis packages, e.g., Splus

Inverse order of AGNES

Eventually each node forms a cluster on its own

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Clustering Analysis: Hierarchical Algorithms

Page 53: Dbm630 lecture09

53

Hierarchical Clustering Major weakness of agglomerative clustering methods do not scale well: time complexity of at least O(n2),

where n is the number of total objects can never undo what was done previously

Clustering Analysis: Hierarchical Algorithms

Page 54: Dbm630 lecture09

54

Distances - Hierarchical Clustering (Overview)

Four measures for distance between clusters are: Single linkage (Minimum distance):

Complete linkage (Maximum distance):

Centroid comparison (Mean distance):

Element comparison (Average distance):

p'pminCCd Cjp'Cpjimin i ,),(

p'pmaxCCd Cjp'Cpjimax i ,),(

jijimean mmCCd ),(

Clustering Analysis: Hierarchical Algorithms

i jCp Cp'ji

jiavg p'pnn

CCd1

),(

Page 55: Dbm630 lecture09

55

Distances - Hierarchical Clustering (Graphical Representation)

Four measures for distance between clusters are (1) single linkage, (2) complete linkage, (3) centroid comparison and (4) element comparison

x x

x

Cluster 1 Cluster 2

Cluster 3 (2)complete

(3)centroid

(1)single

x = centroids

(4) Element comparison:

average distance among all

elements in two clusters

Clustering Analysis: Hierarchical Algorithms

Page 56: Dbm630 lecture09

Practice

Clustering Analysis 56

Use single and complete link agglomerative clustering to group the data described by the following distance matrix. Show the dendrograms.

A B C D

A 0 1 4 5

B 0 2 6

C 0 3

D 0

Page 57: Dbm630 lecture09

Clustering Analysis 57

Other Clustering Methods Incremental Clustering

Probability-based Clustering, Bayesian Clustering

EM Algorithm Cluster Schemes Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Clustering Methods

Page 58: Dbm630 lecture09

Clustering Analysis 58

Advanced Method: BIRCH (Overview)

Balanced Iterative Reducing and Clustering using Hierarchies [Tian Zhang, Raghu Ramakrishnan, Miron Livny, 1996]

Incremental, hierarchical, one scan Save clustering information in a tree Each entry in the tree contains information about one cluster New nodes inserted in closest entry in tree Only works with "metric" attributes

Must have Euclidean coordinates Designed for very large data sets

Time and memory constraints are explicit Treats dense regions of data points as sub-clusters

Not all data points are important for clustering Only one scan of data is necessary

Page 59: Dbm630 lecture09

Clustering Analysis 59

BIRCH (Merits) Incremental, distance-based approach

Decisions are made without scanning all data points, or all currently existing clusters

Does not need the whole data set in advance

Unique approach: Distance-based algorithms generally need all the data points to work

Make best use of available memory while minimizing I/O costs

Does not assume that the probability distributions on attributes is independent

Page 60: Dbm630 lecture09

Clustering Analysis 60

BIRCH – Clustering Feature and Clustering Feature Tree

BIRTH introduces two concepts, clustering feature and clustering feature tree (CF Tree), which are used to summarize cluster representations.

These structures help the clustering method achieve good speed and scalability in large databases and make it effective for incremental and dynamic clustering of incoming object

Given n d-dimensional data objects or points in a cluster, we can define the centroid x0, radius R and diameter D of the cluster as follows:

Page 61: Dbm630 lecture09

Clustering Analysis 61

BIRCH – Centroid, Radius and Diameter

• Given a cluster of instances , we define: • Centroid: the center of a cluster

• Radius: average distance from member points to centroid

• Diameter: average pair-wise distance within a cluster

Page 62: Dbm630 lecture09

Clustering Analysis 62

BIRCH – Centroid Euclidean and Manhattan distances

• The centroid Euclidean distance and centroid Manhattan distance are defined between any two clusters. • Centroid Euclidean distance

• Centroid Manhattan distance

Page 63: Dbm630 lecture09

Clustering Analysis 63

BIRCH (Average inter-cluster, Average intra-cluster, Variance increase)

• The average inter-cluster, the average intra-cluster, and the variance increase distances are defined as follows

• Average inter-cluster

• Average intra-cluster

• Variance increase distances

Page 64: Dbm630 lecture09

Clustering Analysis 64

Clustering Feature CF = (N,LS,SS)

N: Number of points in cluster

LS: Sum of points in the cluster

SS: Sum of squares of points in the cluster CF Tree

Balanced search tree

Node has CF triple for each child

Leaf node represents cluster and has CF value for each subcluster in it.

Subcluster has maximum diameter

Page 65: Dbm630 lecture09

Clustering Analysis 65

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

CF = (5, (16,30),(54,190))

(3,4) (4,7)

(2,6) (3,8)

(4,5)

Clustering Feature Vector

Clustering Feature: CF = (N, LS, SS)

N: Number of data points

LS: Ni=1=Xi

SS: Ni=1=Xi

2

),,( 21212121SSSSLSLSNNCFCF

(6,2) (8,4)

(7,2) (8,5)

(7,4)

CF = (5, (36,17),(262,65)) CF = (10, (52,47),(316,255))

Page 66: Dbm630 lecture09

Clustering Analysis 66

Merging two clusters Cluster {Xi}:

i = 1, 2, …, N1

Cluster {Xj}:

j = N1+1, N1+2, …, N1+N2

Cluster Xl = {Xi} + {Xj}:

l = 1, 2, …, N1, N1+1, N1+2, …, N1+N2

Page 67: Dbm630 lecture09

Clustering Analysis 67

CF Tree CF1

child1

CF3

child3

CF2

child2

CF6

child6

CF1

child1

CF3

child3

CF2

child2

CF5

child5

CF1 CF2 CF6 prev next CF1 CF2 CF4

prev next

B = 7

L = 6

Root

Non-leaf node

Leaf node Leaf node

Page 68: Dbm630 lecture09

Clustering Analysis 68

Properties of CF-Tree Each non-leaf node has at most B entries Each leaf node has at most L CF entries which each satisfy threshold

T Node size is determined by dimensionality of data space and input

parameter P (page size)

Branching Factor and Thread hold

Page 69: Dbm630 lecture09

Clustering Analysis 69

BIRCH Algorithm (CF-Tree Insertion)

Recurse down from root, find the appropriate leaf Follow the "closest"-CF path, w.r.t. D0 / … / D4 Modify the leaf If the closest-CF leaf cannot absorb, make a new CF entry. If there is no room for new leaf, split the parent node Traverse back & up Updating CFs on the path or splitting nodes

Page 70: Dbm630 lecture09

Clustering Analysis 70

Improve Clusters

Page 71: Dbm630 lecture09

Clustering Analysis 71

BIRCH Algorithm (Overall steps)

Page 72: Dbm630 lecture09

Clustering Analysis 72

Details of Each Step Phase 1: Load data into memory

Build an initial in-memory CF-tree with the data (one scan) Subsequent phases become fast, accurate, less order sensitive

Phase 2: Condense data Rebuild the CF-tree with a larger T Condensing is optional

Phase 3: Global clustering Use existing clustering algorithm on CF entries Helps fix problem where natural clusters span nodes

Phase 4: Cluster refining Do additional passes over the dataset & reassign data points to the

closest centroid from phase 3 Refining is optional

Page 73: Dbm630 lecture09

Clustering Analysis 73

Summary of BIRCH

BIRCH works with very large data sets

Explicitly bounded by computational resources.

The computation complexity is O(n), where n is the number of objects to be clustered.

Runs with specified amount of memory (P)

Superior to CLARANS and k-MEANS

Quality, speed, stability and scalability

Page 74: Dbm630 lecture09

Clustering Analysis 74

CURE (Clustering Using REpresentatives) CURE was proposed by Guha, Rastogi & Shim, 1998

It stops the creation of a cluster hierarchy if a level consists of k clusters

Each cluster has c representatives (instead of one)

Choose c well scattered points in the cluster

Shrink them towards the mean of the cluster by a fraction of

The representatives capture the physical shape and geometry of the cluster

It can treat arbitrary shaped clusters and avoid single-link effect.

Merge the closest two clusters

Distance of two clusters: the distance between the two closest representatives

Page 75: Dbm630 lecture09

Clustering Analysis 75

CURE Algorithm

Page 76: Dbm630 lecture09

Clustering Analysis 76

CURE Algorithm

Page 77: Dbm630 lecture09

Clustering Analysis 77

CURE Algorithm (Another form)

Page 78: Dbm630 lecture09

Clustering Analysis 78

CURE Algorithm (for Large Databases)

Page 79: Dbm630 lecture09

Clustering Analysis 79

Data Partitioning and Clustering s = 50

p = 2

s/p = 25

s/pq = 5

x x

x

y

y y

y

x

y

x

Page 80: Dbm630 lecture09

Clustering Analysis 80

Cure: Shrinking Representative Points Shrink the multiple representative points towards the gravity center by

a fraction of .

Multiple representatives capture the shape of the cluster

x

y

x

y

Page 81: Dbm630 lecture09

Clustering Analysis 81

Clustering Categorical Data: ROCK

ROCK: RObust Clustering using linKs, by S. Guha, R. Rastogi, K. Shim (ICDE’99).

Use links to measure similarity/proximity

Not distance-based with categorical attributes

Computational complexity:

Basic ideas (Jaccard coefficient):

Similarity function and neighbors:

Let T1 = {1,2,3}, T2={3,4,5}

O n nm m n nm a

( log )2 2

Sim T TT T

T T( , )

1 2

1 2

1 2

Sim T T( , ){ }

{ , , , , }.1 2

3

1 2 3 4 5

1

50 2

Page 82: Dbm630 lecture09

Clustering Analysis 82

ROCK: An Example

Links: The number of common neighbors for the two points. Using Jaccard

Use Distances to determine neighbors

(pt1,pt4) = 0, (pt1,pt2) = 0, (pt1,pt3) = 0

(pt2,pt3) = 0.6, (pt2,pt4) = 0.2

(pt3,pt4) = 0.2

Use 0.2 as threshold for neighbors

Pt2 and Pt3 have 3 common neighbors

Pt3 and Pt4 have 3 common neighbors

Pt2 and Pt4 have 3 common neighbors

Resulting clusters (1), (2,3,4) which makes more sense

Page 83: Dbm630 lecture09

Clustering Analysis 83

ROCK: Property & Algorithm

Links: The number of common neighbours for the two points.

Algorithm Draw random sample

Cluster with links (maybe agglomerative hierarchical)

Label data in disk

{1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5}

{1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5}

{1,2,3} {1,2,4} 3

Page 84: Dbm630 lecture09

Clustering Analysis 84

CHAMELEON

CHAMELEON: hierarchical clustering using dynamic modeling, by G. Karypis, E.H. Han and V. Kumar’99

Measures the similarity based on a dynamic model

Two clusters are merged only if the interconnectivity and closeness (proximity) between two clusters are high relative to the internal interconnectivity of the clusters and closeness of items within the clusters

A two phase algorithm

1. Use a graph partitioning algorithm: cluster objects into a large number of relatively small sub-clusters

2. Use an agglomerative hierarchical clustering algorithm: find the genuine clusters by repeatedly combining these sub-clusters

Page 85: Dbm630 lecture09

Clustering Analysis 85

Graph-based clustering

Sparsification techniques keep the connections to the most similar (nearest) neighbors of a point while breaking the connections to less similar points.

The nearest neighbors of a point tend to belong to the same class as the point itself.

This reduces the impact of noise and outliers and sharpens the distinction between clusters.

Page 86: Dbm630 lecture09

Clustering Analysis 86

Overall Framework of CHAMELEON

Construct

Sparse Graph Partition the Graph

Merge Partition

Final Clusters

Data Set

Page 87: Dbm630 lecture09

Clustering Analysis 87

Density-Based Clustering Methods

Clustering based on density (local cluster criterion), such as density-connected points

Major features: Discover clusters of arbitrary shape Handle noise One scan Need density parameters as termination condition

Several interesting studies: DBSCAN: Ester, et al. (KDD’96)

OPTICS: Ankerst, et al (SIGMOD’99).

DENCLUE: Hinneburg & D. Keim (KDD’98)

CLIQUE: Agrawal, et al. (SIGMOD’98)

Page 88: Dbm630 lecture09

Clustering Analysis 88

Density-Based Clustering (Background)

Two parameters:

Eps: Maximum radius of the neighbourhood

MinPts: Minimum number of points in an Eps-neighbourhood of that point

NEps(p): {q belongs to D | dist(p,q) <= Eps}

Directly density-reachable: A point p is directly density -reachable from a point q wrt. Eps, MinPts if

1) p belongs to NEps(q)

2) core point condition:

|NEps (q)| >= MinPts

p

q

MinPts = 5

Eps = 1 cm

Page 89: Dbm630 lecture09

Clustering Analysis 89

Density-Based Clustering (Background)

The neighborhood with in a radius Eps of a given object is call the neighborhood of the object.

If the neighborhood of an object contains at least a minimum number, MinPts, of objects, then the object is called a core object.

Given a set of objects, D, we say that an object p is directly density-reachable from object q if p is within the neighborhood of q, and q is a core object.

An object p is density-reachable from object q with respect to and MinPts in a set of objects D, if there is a chain of object p1, …, pn, where p1=q and pn=p such that p(i+1) is directly density-reachable from pi with respect to and MinPts, for 1≤i≤n, from pi D.

An object p is density-connected to object q with respect to

Page 90: Dbm630 lecture09

Clustering Analysis 90

Density-Based Clustering

Density-reachable:

A point p is density-reachable from a point q wrt. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi

Density-connected

A point p is density-connected to a point q wrt. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts.

p

q p1

p q

o

Page 91: Dbm630 lecture09

Clustering Analysis 91

DBSCAN: Density Based Spatial Clustering of Applications with

Noise

Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points

Discovers clusters of arbitrary shape in spatial databases with noise

Core

Border

Outlier

Eps = 1cm

MinPts = 5

Page 92: Dbm630 lecture09

Clustering Analysis 92

DBSCAN: The Algorithm

Arbitrary select a point p

Retrieve all points density-reachable from p wrt Eps and MinPts.

If p is a core point, a cluster is formed.

If p is a border point, no points are density-reachable from p and

DBSCAN visits the next point of the database.

Continue the process until all of the points have been processed.