Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task:...
Transcript of Clustering with Mixed Type Variables and Determination of ... · COMPSTAT 2010 3 Motivation Task:...
COMPSTAT 2010 1
Clustering with Mixed Type Variablesand Determination of Cluster Numbers
Hana Řezanková, Dušan HúsekTomáš Löster
University of Economics, PragueICS, Academy of Sciences of the Czech Republic
COMPSTAT 2010 2
Outline
Motivation Methods for clustering with mixed type variables Implementation in software packages Proposal of new criteria for cluster evaluation Application Conclusion
COMPSTAT 2010 3
Motivation
Task: We are looking for groups of similar Task: We are looking for groups of similar objects (e.g. respondents)objects (e.g. respondents),, i.e. we will i.e. we will concentrate on concentrate on thethe problem of object clusteringproblem of object clustering
The objects are characterized by both quantitative and qualitative (nominal) variables(e.g. respondent opinions, numbers of actions)
The number of clusters is unknown in advance The number of clusters is unknown in advance ––i.e. we should cope with appropriate number of i.e. we should cope with appropriate number of clusters determination (assignment)clusters determination (assignment)
COMPSTAT 2010 4
Methods for clustering with mixed type variables
Using a specialized dissimilarity measureUsing a specialized dissimilarity measure(Gower(Gower’’s coefficient, cluster variability based)s coefficient, cluster variability based)and application of agglomerative hierarchicaland application of agglomerative hierarchicalcluster analysiscluster analysis (AHCA)(AHCA)
Clustering objects separately with quantitative and qualitative variables and combining the results by cluster-based similarity partitioning algorithm (CSPA)
Latent class models
COMPSTAT 2010 5
Implementation in software packages
Specialized dissimilarity measuresSpecialized dissimilarity measures-- are not implementedare not implemented for for AHCAAHCA
Clustering objects with qualitative variables- is implemented only rarely (disagreement coef.)
Cluster-based similarity partitioning algorithm- is not implementednot implemented but it could be realized
LC Cluster models (Latent GOLD) LogLog--likelihood distance measurelikelihood distance measure between clusters
- implemented in two-step cluster analysis (SPSS)
COMPSTAT 2010 6
Implementation in software packages
LogLog--likelihood distance measurelikelihood distance measure between clustersbetween clusters- implemented in two-step cluster analysis (SPSS)
)(, hhhhhhD
)1( )2(
1 1
22 )ln(21m
l
m
lglgllgg Hssn
g
gluK
u g
glugl n
nnn
Hl
ln1
…… entropyentropy
COMPSTAT 2010 7
Implementation in software packages
LogLog--likelihood distance measurelikelihood distance measure between objectsbetween objects- implemented in two-step cluster analysis (SPSS)
)(, hhhhhhD
)1( )2(
1 1
22 )ln(21m
l
m
lglgllgg Hssn
jijiD xxxx ,),(
COMPSTAT 2010 8
Evaluation criteria implemented in software packages
BIC (BIC (Bayesian Information Criterion)Bayesian Information Criterion)AICAIC (Akaike Information Criterion) - implemented in two-step cluster analysis (SPSS)
k
gkg nwI
1BIC )ln(2
)2(
1
)1( )1(2m
llk Kmkw
k
gkg wI
1AIC 22
… minimum
only for initial estimationof number of clusters
COMPSTAT 2010 9
Proposed evaluation criteria
Within-cluster variability for k clusters:
Variability of the whole data set:
k
g
m
l
m
lglgllg
k
gg Hssnk
1 1 1
22
1
)1( )2(
)ln(21)(
)1( )2(
1 1
2 )2ln(21)1(
m
l
m
lll Hsn
COMPSTAT 2010 10
Proposed evaluation criteria
k
g
m
l
m
lglgllg
k
gg Hssnk
1 1 1
22
1
)1( )2(
)ln(21)(
Within-cluster variability for k clusters:
)()1()( kkkdiff difference
it should be maximalfor the suitable number of clusters
COMPSTAT 2010 11
Evaluation criteria modified for qualitative variables
1. Uncertainty index (R-square (RSQ) index)
2. Semipartial uncertainty index (optimal number of clusters - minimum)
)()1()( UUSPU kIkIkI
)1()()1()(
T
WT
T
BU
kV
VVVVkI
COMPSTAT 2010 12
Evaluation criteria modified for qualitative variables
3. Pseudo (Calinski and Habarasz) F index– PSF (SAS), CHF ( SYSTAT)
4. Pseudo T-squared statistic – PST2 (SAS)PTS (SYSTAT)
)()1())()1(()(1)(
W
B
CHFU kkkkn
knVkV
kI
2
)()( ,
PTSU
hh
hh
hhhh
nn
kI
COMPSTAT 2010 13
Evaluation criteria modified for qualitative variables
SYSTAT
COMPSTAT 2010 14
Evaluation criteria modified for qualitative variables
5. Modified Davies and Bouldin (DB) index
kD
ss
kI
k
h hh
hDhD
hhh
1
,,
,
DB
max)(
kkI
k
h hhhh
hh
hhh
1 ,
,
DBU
)(max
)(
COMPSTAT 2010 15
Evaluation criteria modified for qualitative variables
6. Dunn’s index
gkg
hhkhkh diam
DkI1
11D maxminmin)(
),(min, jiCChh DD
hjhi
xxxx
),(max, jiCg Ddiam
gji
xxxx
COMPSTAT 2010 16
Modified evaluation criteria
k
ggGkG
1
)(
CClusterluster variability variability based on the variance and GiniGini’’ss coefficient of mutabilitycoefficient of mutability
)1( )2(
1 1
22 )ln(21m
l
m
lglgllgg GssnG
lK
u g
glugl n
nG
1
2
1 GiniGini’’ss coefficient ofcoefficient of mutabilitymutability
k
gkg nwGI
1BGC )ln(2
COMPSTAT 2010 17
Evaluation criteria modified for qualitative variables
1. Tau index (RSQ index)
2. Semipartial tau index (optimal number of clusters - minimum)
)()1()(SP kIkIkI
)1()()1()(
T
WT
T
B
GkGG
VVV
VVkI
COMPSTAT 2010 18
Application to a real data file
Data from a questionnaire surveyData from a questionnaire survey((for the participants of the chemistry seminarfor the participants of the chemistry seminar))
7 qualitative and 1 quantitative (count) variables Two-step cluster analysis for clustering of
respondents (experiments for the numbers of clusters from 2 to 4)
LC Cluster model (experiments for the numbers of clusters from 2 to 6) – the quantitative variable was recoded to 5 categories
COMPSTAT 2010 19
Application to a real data file
Number of clusters Measure 1 2 3 4 Within-cluster variability
273.92 241.17 206.39 186.51
Variability difference
- 32.75 34.78 19.88
IU 0 0.12 0.25 0.32 ISPU 0.12 0.13 0.07 - ICHFU 0 6.52 7.69 7.19 IBIC 590.85 568.41 541.88 545.15
Criteria based on the entropy (TSCA in SPSS)
COMPSTAT 2010 20
Application to a real data fileCriteria based on the Gini’s coefficient (TSCA in SPSS)
Number of clusters Measure 1 2 3 4 Within-cluster variability
185.41 162.57 137.83 127.86
Variability difference
- 22.84 24.74 9.97
I 0 0.12 0.26 0.31 ISP 0.12 0.13 0.05 - ICHF 0 6.74 8.11 6.90 IBGC 413.85 411.20 404.75 427.84
COMPSTAT 2010 21
Application to a real data fileComparison of BIC
Number of clusters Method 1 2 3 4 Two-step CA 590.85 568.41 541.88 545.15 LC Cluster Model 1397.01 1059.24 1019.18 1036.90
COMPSTAT 2010 22
Conclusion
If the distance between objects, distance between clusters, within-cluster variability and the total variability are defined for the case when objects are characterized by mixed-type variables, then the evaluation criteria for quantitative variables can be modified.
One possibility is an application of log-likelihood distance measure based on the entropy
Another possibility is to use the analogous measure with using of Gini’s coefficient
COMPSTAT 2010 23
Thank you for your attention