What Is Good Clustering?

35
8 What Is Good Clustering? A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity The quality of a clustering result depends on the similarity measure used by the method. The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.

description

What Is Good Clustering?. A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity The quality of a clustering result depends on the similarity measure used by the method. - PowerPoint PPT Presentation

Transcript of What Is Good Clustering?

Page 1: What Is Good Clustering?

8

What Is Good Clustering?

A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity

The quality of a clustering result depends on the similarity measure used by the method.

The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.

Page 2: What Is Good Clustering?

9

Vocabulary of Clustering

Records, data points, samples, items, objects, patterns…

Attributes, features, variables…

Similarity, dissimilarity, distances.

Centre, Centroid, Prototype.

Hard Clustering (Crisp Clustering)

Page 3: What Is Good Clustering?

10

Requirements of Clustering

Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to

determine input parameters Able to deal with noise and outliers Insensitive to order of input records Insensitive to the initial conditions High dimensionality

Page 4: What Is Good Clustering?

11

Clustering Algorithms

Page 5: What Is Good Clustering?

12

Clustering Algorithms

Page 6: What Is Good Clustering?

13

Data Representation

Data matrix (two mode) N objects with p attributes

Dissimilarity matrix (one mode) d(i,j) : dissimilarity between i and j with p attributes

npx...

nfx...

n1x

...............ip

x...if

x...i1

x

...............1p

x...1f

x...11

x

0...)2,()1,(

:::

)2,3()

...ndnd

0dd(3,1

0d(2,1)

0

Page 7: What Is Good Clustering?

14

How to deal with missing values?

npx...

nfx...

n1x

...............ip

x...if

x...i1

x

...............1p

x...1f

x...11

x

Page 8: What Is Good Clustering?

15

Types of Clusters: Well-Separated

Well-separated clusters A cluster is a set of points such that any point

in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster

3 well-separated clusters

Page 9: What Is Good Clustering?

16

Types of Clusters: Center-Based

Center-based A cluster is a set of objects such that an

object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster

The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster

4 center-based clusters

Page 10: What Is Good Clustering?

17

Types of Clusters: Contiguity-Based

Contiguous Cluster (Nearest neighbor or Transitive) A cluster is a set of points such that a point in

a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.

8 contiguous clusters

Page 11: What Is Good Clustering?

18

Types of Clusters: Density-Based

Density-based A cluster is a dense region of points, which is

separated by low-density regions, from other regions of high density.

Used when the clusters are irregular or intertwined, and when noise and outliers are present.

6 density-based clusters

Page 12: What Is Good Clustering?

19

Types of Clusters: Conceptual Clusters

Shared Property or Conceptual Clusters Finds clusters that share some common

property or represent a particular concept.

2 Overlapping Circles

Page 13: What Is Good Clustering?

20

Types of Clusters: Objective Function

Clusters Defined by an Objective Function Finds clusters that minimize or maximize an

objective function. Enumerate all possible ways of dividing the

points into clusters and evaluate the `goodness' of each potential set of clusters by using the given objective function.

Page 14: What Is Good Clustering?

April 20, 2023 21

Type of data in clustering analysis

Page 15: What Is Good Clustering?

April 20, 2023 22

Symbol Table

Page 16: What Is Good Clustering?

April 20, 2023 23

Symbol Table

Page 17: What Is Good Clustering?

April 20, 2023 24

Frequency Table

Page 18: What Is Good Clustering?

April 20, 2023 25

Frequency Table

Page 19: What Is Good Clustering?

April 20, 2023 26

Frequency Table

Page 20: What Is Good Clustering?

April 20, 2023 27

Frequency Table

Page 21: What Is Good Clustering?

April 20, 2023 28

Type of data in clustering analysis

Binary variables

Nominal variables

Ordinal variables

Interval-scaled variables

Ratio variables

Variables of mixed types

Page 22: What Is Good Clustering?

April 20, 2023 29

Binary variables

The binary variable is symmetric (Simple match

coefficient)

The binary variable is asymmetric (Jaccard

coefficient)

pdbcasum

dcdc

baba

sum

0

1

01

Object i

Object j

dcbacb jid

),(

cbacb jid

),(

Page 23: What Is Good Clustering?

April 20, 2023 30

Binary variables

Page 24: What Is Good Clustering?

April 20, 2023 31

Dissimilarity between Binary Variables

Example

gender is a symmetric attribute the remaining attributes are asymmetric binary let the values Y and P be set to 1, and the value N be

set to 0

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Jack M Y N P N N NMary F Y N P N P NJim M Y P N N N N

75.0211

21),(

67.0111

11),(

33.0102

10),(

maryjimd

jimjackd

maryjackd

Page 25: What Is Good Clustering?

April 20, 2023 32

Nominal Variables

A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green

Method 1: Simple matching m: # of matches, p: total # of variables

Method 2: use a large number of binary variables creating a new binary variable for each of the M

nominal states

pmpjid ),(

Page 26: What Is Good Clustering?

April 20, 2023 33

Nominal Variables

Examples Eye Color Days of the week Religion Seasons Job title

Page 27: What Is Good Clustering?

April 20, 2023 34

Nominal Variables

Find the Proximity Matrix?

Page 28: What Is Good Clustering?

April 20, 2023 35

Ordinal Variables

Order is important, e.g., rank Can be treated like interval-scaled

replacing xif by their rank

map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by

compute the dissimilarity using methods for interval-scaled variables

11

f

ifif M

rz

},...,1{fif

Mr

Page 29: What Is Good Clustering?

April 20, 2023 36

Ordinal Variables

Find the Proximity Matrix?

Page 30: What Is Good Clustering?

April 20, 2023 37

Interval-valued variables

Examples Temperature Weight Time Age Length

Page 31: What Is Good Clustering?

April 20, 2023 38

Interval-valued variables

Standardize data

Calculate the mean absolute deviation:

where

Calculate the standardized measurement (z-

score)

Using mean absolute deviation is more robust than

using standard deviation

.)...21

1nffff

xx(xn m

|)|...|||(|121 fnffffff

mxmxmxns

f

fifif s

mx z

Page 32: What Is Good Clustering?

April 20, 2023 39

Ratio-Scaled Variables

Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale,

such as AeBt or Ae-Bt Methods:

treat them like interval-scaled variables — not a good choice! (why?)

apply logarithmic transformation

yif = log(xif)

treat them as continuous ordinal data treat their rank as interval-scaled.

Page 33: What Is Good Clustering?

April 20, 2023 40

Ratio-Scaled Variables

Find the Proximity Matrix?

Page 34: What Is Good Clustering?

Variables of Mixed Types

A database may contain all the six types of variables symmetric binary, asymmetric binary, nominal,

ordinal, interval and ratio. One may use a weighted formula to combine their

effects.

f is binary or nominal:dij

(f) = 0 if xif = xjf , or dij(f) = 1 o.w.

f is interval-based: use the normalized distance f is ordinal or ratio-scaled

compute ranks rif and and treat zif as interval-scaled

)(1

)()(1),(

fij

pf

fij

fij

pf

djid

1

1

f

if

Mrz

if

Page 35: What Is Good Clustering?

April 20, 2023 42

Variables of Mixed Types

Find the Proximity Matrix?