Post on 01-Jan-2016
description
8
What Is Good Clustering?
A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity
The quality of a clustering result depends on the similarity measure used by the method.
The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.
9
Vocabulary of Clustering
Records, data points, samples, items, objects, patterns…
Attributes, features, variables…
Similarity, dissimilarity, distances.
Centre, Centroid, Prototype.
Hard Clustering (Crisp Clustering)
10
Requirements of Clustering
Scalability Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to
determine input parameters Able to deal with noise and outliers Insensitive to order of input records Insensitive to the initial conditions High dimensionality
11
Clustering Algorithms
12
Clustering Algorithms
13
Data Representation
Data matrix (two mode) N objects with p attributes
Dissimilarity matrix (one mode) d(i,j) : dissimilarity between i and j with p attributes
npx...
nfx...
n1x
...............ip
x...if
x...i1
x
...............1p
x...1f
x...11
x
0...)2,()1,(
:::
)2,3()
...ndnd
0dd(3,1
0d(2,1)
0
14
How to deal with missing values?
npx...
nfx...
n1x
...............ip
x...if
x...i1
x
...............1p
x...1f
x...11
x
15
Types of Clusters: Well-Separated
Well-separated clusters A cluster is a set of points such that any point
in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster
3 well-separated clusters
16
Types of Clusters: Center-Based
Center-based A cluster is a set of objects such that an
object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster
The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster
4 center-based clusters
17
Types of Clusters: Contiguity-Based
Contiguous Cluster (Nearest neighbor or Transitive) A cluster is a set of points such that a point in
a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.
8 contiguous clusters
18
Types of Clusters: Density-Based
Density-based A cluster is a dense region of points, which is
separated by low-density regions, from other regions of high density.
Used when the clusters are irregular or intertwined, and when noise and outliers are present.
6 density-based clusters
19
Types of Clusters: Conceptual Clusters
Shared Property or Conceptual Clusters Finds clusters that share some common
property or represent a particular concept.
2 Overlapping Circles
20
Types of Clusters: Objective Function
Clusters Defined by an Objective Function Finds clusters that minimize or maximize an
objective function. Enumerate all possible ways of dividing the
points into clusters and evaluate the `goodness' of each potential set of clusters by using the given objective function.
April 20, 2023 21
Type of data in clustering analysis
April 20, 2023 22
Symbol Table
April 20, 2023 23
Symbol Table
April 20, 2023 24
Frequency Table
April 20, 2023 25
Frequency Table
April 20, 2023 26
Frequency Table
April 20, 2023 27
Frequency Table
April 20, 2023 28
Type of data in clustering analysis
Binary variables
Nominal variables
Ordinal variables
Interval-scaled variables
Ratio variables
Variables of mixed types
April 20, 2023 29
Binary variables
The binary variable is symmetric (Simple match
coefficient)
The binary variable is asymmetric (Jaccard
coefficient)
pdbcasum
dcdc
baba
sum
0
1
01
Object i
Object j
dcbacb jid
),(
cbacb jid
),(
April 20, 2023 30
Binary variables
April 20, 2023 31
Dissimilarity between Binary Variables
Example
gender is a symmetric attribute the remaining attributes are asymmetric binary let the values Y and P be set to 1, and the value N be
set to 0
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N NMary F Y N P N P NJim M Y P N N N N
75.0211
21),(
67.0111
11),(
33.0102
10),(
maryjimd
jimjackd
maryjackd
April 20, 2023 32
Nominal Variables
A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green
Method 1: Simple matching m: # of matches, p: total # of variables
Method 2: use a large number of binary variables creating a new binary variable for each of the M
nominal states
pmpjid ),(
April 20, 2023 33
Nominal Variables
Examples Eye Color Days of the week Religion Seasons Job title
April 20, 2023 34
Nominal Variables
Find the Proximity Matrix?
April 20, 2023 35
Ordinal Variables
Order is important, e.g., rank Can be treated like interval-scaled
replacing xif by their rank
map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by
compute the dissimilarity using methods for interval-scaled variables
11
f
ifif M
rz
},...,1{fif
Mr
April 20, 2023 36
Ordinal Variables
Find the Proximity Matrix?
April 20, 2023 37
Interval-valued variables
Examples Temperature Weight Time Age Length
April 20, 2023 38
Interval-valued variables
Standardize data
Calculate the mean absolute deviation:
where
Calculate the standardized measurement (z-
score)
Using mean absolute deviation is more robust than
using standard deviation
.)...21
1nffff
xx(xn m
|)|...|||(|121 fnffffff
mxmxmxns
f
fifif s
mx z
April 20, 2023 39
Ratio-Scaled Variables
Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt Methods:
treat them like interval-scaled variables — not a good choice! (why?)
apply logarithmic transformation
yif = log(xif)
treat them as continuous ordinal data treat their rank as interval-scaled.
April 20, 2023 40
Ratio-Scaled Variables
Find the Proximity Matrix?
Variables of Mixed Types
A database may contain all the six types of variables symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio. One may use a weighted formula to combine their
effects.
f is binary or nominal:dij
(f) = 0 if xif = xjf , or dij(f) = 1 o.w.
f is interval-based: use the normalized distance f is ordinal or ratio-scaled
compute ranks rif and and treat zif as interval-scaled
)(1
)()(1),(
fij
pf
fij
fij
pf
djid
1
1
f
if
Mrz
if
April 20, 2023 42
Variables of Mixed Types
Find the Proximity Matrix?