Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task:...

COMPSTAT 2010 1

Clustering with Mixed Type Variablesand Determination of Cluster Numbers

Hana Řezanková, Dušan HúsekTomáš Löster

University of Economics, PragueICS, Academy of Sciences of the Czech Republic

COMPSTAT 2010 2

Outline

Motivation Methods for clustering with mixed type variables Implementation in software packages Proposal of new criteria for cluster evaluation Application Conclusion

COMPSTAT 2010 3

Motivation

Task: We are looking for groups of similar Task: We are looking for groups of similar objects (e.g. respondents)objects (e.g. respondents),, i.e. we will i.e. we will concentrate on concentrate on thethe problem of object clusteringproblem of object clustering

The objects are characterized by both quantitative and qualitative (nominal) variables(e.g. respondent opinions, numbers of actions)

The number of clusters is unknown in advance The number of clusters is unknown in advance ––i.e. we should cope with appropriate number of i.e. we should cope with appropriate number of clusters determination (assignment)clusters determination (assignment)

COMPSTAT 2010 4

Methods for clustering with mixed type variables

Using a specialized dissimilarity measureUsing a specialized dissimilarity measure(Gower(Gower’’s coefficient, cluster variability based)s coefficient, cluster variability based)and application of agglomerative hierarchicaland application of agglomerative hierarchicalcluster analysiscluster analysis (AHCA)(AHCA)

Clustering objects separately with quantitative and qualitative variables and combining the results by cluster-based similarity partitioning algorithm (CSPA)

Latent class models

COMPSTAT 2010 5

Implementation in software packages

Specialized dissimilarity measuresSpecialized dissimilarity measures-- are not implementedare not implemented for for AHCAAHCA

Clustering objects with qualitative variables- is implemented only rarely (disagreement coef.)

Cluster-based similarity partitioning algorithm- is not implementednot implemented but it could be realized

LC Cluster models (Latent GOLD) LogLog--likelihood distance measurelikelihood distance measure between clusters

- implemented in two-step cluster analysis (SPSS)

COMPSTAT 2010 6


LogLog--likelihood distance measurelikelihood distance measure between clustersbetween clusters- implemented in two-step cluster analysis (SPSS)

)(, hhhhhhD

)1( )2(

1 1

22 )ln(21m

l

m

lglgllgg Hssn

g

gluK

u g

glugl n

nnn

Hl

ln1

…… entropyentropy

COMPSTAT 2010 7


LogLog--likelihood distance measurelikelihood distance measure between objectsbetween objects- implemented in two-step cluster analysis (SPSS)

)(, hhhhhhD

)1( )2(

1 1

22 )ln(21m

l

m

lglgllgg Hssn

jijiD xxxx ,),(

COMPSTAT 2010 8

Evaluation criteria implemented in software packages

BIC (BIC (Bayesian Information Criterion)Bayesian Information Criterion)AICAIC (Akaike Information Criterion) - implemented in two-step cluster analysis (SPSS)

k

gkg nwI

1BIC )ln(2

)2(

1

)1( )1(2m

llk Kmkw

k

gkg wI

1AIC 22

… minimum

only for initial estimationof number of clusters

COMPSTAT 2010 9

Proposed evaluation criteria

Within-cluster variability for k clusters:

Variability of the whole data set:

k

g

m

l

m

lglgllg

k

gg Hssnk

1 1 1

22

1

)1( )2(

)ln(21)(

)1( )2(

1 1

2 )2ln(21)1(

m

l

m

lll Hsn

COMPSTAT 2010 10

Proposed evaluation criteria

k

g

m

l

m

lglgllg

k

gg Hssnk

1 1 1

22

1

)1( )2(

)ln(21)(

Within-cluster variability for k clusters:

)()1()( kkkdiff difference

it should be maximalfor the suitable number of clusters

COMPSTAT 2010 11

Evaluation criteria modified for qualitative variables

1. Uncertainty index (R-square (RSQ) index)

2. Semipartial uncertainty index (optimal number of clusters - minimum)

)()1()( UUSPU kIkIkI

)1()()1()(

T

WT

T

BU

kV

VVVVkI

COMPSTAT 2010 12


3. Pseudo (Calinski and Habarasz) F index– PSF (SAS), CHF ( SYSTAT)

4. Pseudo T-squared statistic – PST2 (SAS)PTS (SYSTAT)

)()1())()1(()(1)(

W

B

CHFU kkkkn

knVkV

kI

2

)()( ,

PTSU

hh

hh

hhhh

nn

kI

COMPSTAT 2010 13


SYSTAT

COMPSTAT 2010 14


5. Modified Davies and Bouldin (DB) index

kD

ss

kI

k

h hh

hDhD

hhh

1

,,

,

DB

max)(

kkI

k

h hhhh

hh

hhh

1 ,

,

DBU

)(max

)(

COMPSTAT 2010 15


6. Dunn’s index

gkg

hhkhkh diam

DkI1

11D maxminmin)(

),(min, jiCChh DD

hjhi

xxxx

),(max, jiCg Ddiam

gji

xxxx

COMPSTAT 2010 16

Modified evaluation criteria

k

ggGkG

1

)(

CClusterluster variability variability based on the variance and GiniGini’’ss coefficient of mutabilitycoefficient of mutability

)1( )2(

1 1

22 )ln(21m

l

m

lglgllgg GssnG

lK

u g

glugl n

nG

1

2

1 GiniGini’’ss coefficient ofcoefficient of mutabilitymutability

k

gkg nwGI

1BGC )ln(2

COMPSTAT 2010 17


1. Tau index (RSQ index)

2. Semipartial tau index (optimal number of clusters - minimum)

)()1()(SP kIkIkI

)1()()1()(

T

WT

T

B

GkGG

VVV

VVkI

COMPSTAT 2010 18

Application to a real data file

Data from a questionnaire surveyData from a questionnaire survey((for the participants of the chemistry seminarfor the participants of the chemistry seminar))

7 qualitative and 1 quantitative (count) variables Two-step cluster analysis for clustering of

respondents (experiments for the numbers of clusters from 2 to 4)

LC Cluster model (experiments for the numbers of clusters from 2 to 6) – the quantitative variable was recoded to 5 categories

COMPSTAT 2010 19

Application to a real data file

Number of clusters Measure 1 2 3 4 Within-cluster variability

273.92 241.17 206.39 186.51

Variability difference

- 32.75 34.78 19.88

IU 0 0.12 0.25 0.32 ISPU 0.12 0.13 0.07 - ICHFU 0 6.52 7.69 7.19 IBIC 590.85 568.41 541.88 545.15

Criteria based on the entropy (TSCA in SPSS)

COMPSTAT 2010 20

Application to a real data fileCriteria based on the Gini’s coefficient (TSCA in SPSS)

Number of clusters Measure 1 2 3 4 Within-cluster variability

185.41 162.57 137.83 127.86

Variability difference

- 22.84 24.74 9.97

I 0 0.12 0.26 0.31 ISP 0.12 0.13 0.05 - ICHF 0 6.74 8.11 6.90 IBGC 413.85 411.20 404.75 427.84

COMPSTAT 2010 21

Application to a real data fileComparison of BIC

Number of clusters Method 1 2 3 4 Two-step CA 590.85 568.41 541.88 545.15 LC Cluster Model 1397.01 1059.24 1019.18 1036.90

COMPSTAT 2010 22

Conclusion

If the distance between objects, distance between clusters, within-cluster variability and the total variability are defined for the case when objects are characterized by mixed-type variables, then the evaluation criteria for quantitative variables can be modified.

One possibility is an application of log-likelihood distance measure based on the entropy

Another possibility is to use the analogous measure with using of Gini’s coefficient

COMPSTAT 2010 23

Thank you for your attention

Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task:...

Documents

Transcript of Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task:...