Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003.
CLUSTERING MIXED-TYPE DATA
Transcript of CLUSTERING MIXED-TYPE DATA
![Page 1: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/1.jpg)
CLUSTERING MIXED-TYPE DATA
M A R I A N T H I M A R K ATO U
D E P T. O F B I O S TAT I S T I C S
U N I V E R S I T Y AT B U F FA LO
S TAT I S T I C A L & C O M P U TAT I O N A L C H A L L E N G E S I N P R E C I S I O N M E D I C I N E
N O V E M B E R 6 - 9 , 2 0 1 8
I N S T I T U T E F O R M AT H E M AT I C S & I T S A P P L I C AT I O N S
![Page 2: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/2.jpg)
Collaborators
Alex Foss*, Senior Scientist, Sandia National LabBonnie Ray, VP, TalkspaceAliza Heching, IBM Watson Research Center
![Page 3: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/3.jpg)
OUTLINEIntroduction & Problem StatementBackground/Literature ReviewKAMILA: A new algorithm for clustering mixed-type dataSoftware PlatformExamplesReferences
![Page 4: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/4.jpg)
INTRODUCTIONPrecision Medicine: an approach to disease treatment andprevention that is designed to optimize efficiency or therapeuticbenefit for groups of patients.These subgroups are identified via data-driven techniques.The data are obtained from large, heterogeneouscombinations of sources:
Demographic Information: age, gender, race, ethnicity, economicstatus;Diagnostic testing: inflammation score, tumor size, disease stage;High-throughput “omics” data sets: interval scale gene expression
data, categorical SNP data.
![Page 5: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/5.jpg)
INTRODUCTION/LITERATURE REVIEWMany “big data” sets contain variables of both, interval and categorical(nominal/ordinal) scale.Many commonly used approaches for clustering mixed-scale (or type) datainvolve strategies to adapt existing techniques for single-type data. That is:
Interval scale data are discretized and then use techniquesfor clustering categorical scale data;Categorical scale variables are dummy-coded and then use
interval-scale data clustering techniques.There are significant problems associated with the aforementioned actions.
![Page 6: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/6.jpg)
INTRODUCTIONMixed-type data: refers to data that are a combination ofrealizations from both continuous (e.g. height, weight, systolicblood pressure) and categorical (e.g. gender, race, ethnicity, HCVgenotype) random variables.Cluster analysis: aims to identify groups of similar units in a dataset.Definition: A cluster is a set of data points that is compact andisolated, that is, a set of points that are similar to each otherand well-separated from points not belonging to the cluster(Cormack, 1971; Gordon, 1981; Jain, 2010; McNicholas, 2016).
![Page 7: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/7.jpg)
INTRODUCTIONAdditional characteristics of clusters that may be desirabledepending on the context include stability of identifiedclusters, independence of variables within a cluster, and thedegree to which a cluster can be well-represented by itscentroid.Recent literature that provides interesting discussions of the aforementioned issues are: 1) Hennig, C. (2015). What are the true clusters? Pattern Recognition Letters, 64, 53-62; 2)McNicholas, P.D. (2016). Model-Based Clustering. Journal of Classification, 33, 331-373.
![Page 8: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/8.jpg)
PROBLEM STATEMENT
We focus on clustering mixed-scale data, that is, data sets that contain realizations of both interval and categorical (nominal and/or ordinal) scale variables.
QUESTION: Why does this problem matter?Answer: Informative subgroup identification, i.e. clustering.
![Page 9: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/9.jpg)
LITERATURE REVIEWFundamental challenges are:
1) to equitably balance the contribution from the continuous and categorical variables;
2) current clustering algorithms are unable to properly handle data sets in which only a subset of variables are related to the underlying cluster structure of interest. In what follows, we will examine closely each of these challenges.
![Page 10: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/10.jpg)
FRAMEWORKThere are many perspectives and definitions of clusters. Here weinvoke the mixture model perspective. This perspective is particularlyeffective because: (a) it produces mathematically rigorous generativemodels; (b) it can naturally accommodate multiple scale datawithout transformations and approximations; (c) it can handledependencies within and between data types; and (d) it is flexibleenough to capture a very wide range of scenarios of practicalsignificance.
However, clustering is not always about recovering mixturecomponents. The definition of a cluster is highly dependent upon thecontext of the problem and the available data.
![Page 11: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/11.jpg)
DATA TRANSFORMATIONSDISCRETIZATION
Interval-scale variables are discretized and a clustering method suitable for exclusively categorical variables is applied.EXAMPLE: Consider a Monte-Carlo study in which we cluster manifestations of the random vector (V, W)defined by a mixture of two well-separated populations as follows:
Population 1: V ∼ N(0, 5.2), W∼Multin(n=1, 𝑝𝑝𝑇𝑇=(.45, .45, .05,.05))
Population 2: V∼N(5.2, 5.2)W∼Multin(n=1, 𝑝𝑝𝑇𝑇=(.05,.05,.45,.45))
![Page 12: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/12.jpg)
DISCRETIZATION
Sample size is N=500;
V is discretized with 1) median split into 2 bins; 2) tertile split into 3 bins and so on up to 9 bins.We use k-modes algorithm and LCA algorithm.
Figure 1 (next slide) shows ARI for each of the discretization conditions
![Page 13: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/13.jpg)
FIGURE 1: Performance of k-modes and LCA for various quantile splits of the data
![Page 14: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/14.jpg)
RESULTSK-modes algorithm performs best for the median split of the interval scale variable and degrades as the number of bins increases;LCA does not degrade as the number of cut points increases;The normal-multinomial mixture model, which uses the untransformed interval scale variable outperforms both competing algorithms for all choices of the cut-point.
![Page 15: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/15.jpg)
NUMERICAL CODING METHODInvolves converting categorical variables into numeric variables and clustering the new set with methods suitable for interval scale data only (e.g. k-means).
In practice, clustering with a numerical coding technique always involves using 0-1 dummy coding with standardized continuous variables. But this is ineffective because we have the following proposition.
Consider (V, W) defined as follows:V = 𝑌𝑌1 with probability π∊[0, 1]V= 𝑌𝑌2 with probability 1-π
![Page 16: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/16.jpg)
Numerical Coding Methods𝑌𝑌1, 𝑌𝑌2 are two continuous random variables with means 0, μand variances σ12 and 𝜎𝜎22 . Let η = 𝐸𝐸 𝑉𝑉 , ν2 = Var V =𝜋𝜋𝜎𝜎12 + (1 − 𝜋𝜋)𝜎𝜎22 + 𝜋𝜋(1 − 𝜋𝜋)𝜇𝜇2 . Let also W= 𝐵𝐵1 withprobability π and equal to 𝐵𝐵2 with probability 1-π;𝐵𝐵1 ∼ Bern(𝑝𝑝1) and 𝐵𝐵2∼Bern(𝑝𝑝2).The squared Euclidean distance between population 1 and 2is then:
((𝑌𝑌1 − 𝜂𝜂)/𝜈𝜈 − (𝑌𝑌2 − 𝜂𝜂)/𝜈𝜈) 2 + (𝐵𝐵1 − 𝐵𝐵2)2.
![Page 17: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/17.jpg)
PropositionLet (V, W) be a mixed-type bivariate vector. Then, the expectation of the continuous contribution to the distance is:
𝐸𝐸[(𝑌𝑌1−𝑌𝑌2ν
)2] = φ,
φ>1, 𝜎𝜎1 ≠ 𝜎𝜎2 and φ>2, 𝜎𝜎1 = 𝜎𝜎2. Furthermore, since |𝐵𝐵1 - 𝐵𝐵2| is 0 or 1,
0 < E[(𝐵𝐵1 − 𝐵𝐵2)2] < 1, for all 𝑝𝑝1, 𝑝𝑝2belong in (0, 1).
![Page 18: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/18.jpg)
Numerical CodingThe continuous contribution has expectation > 1 for 𝜎𝜎1 ≠ 𝜎𝜎2;The continuous contribution has expectation > 2 for equal variances;
The categorical contribution has expectation < 1.This implies that there is unbalanced treatment of continuous and categorical variables
How do we deal with this unbalanced treatment?
Perhaps categorical variable should be up-weighted? I.e. instead of taking 0-1 to take 0-2?This is an ineffective strategy in the general case.
![Page 19: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/19.jpg)
HYBRID DISTANCE
![Page 20: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/20.jpg)
HYBRID DISTANCE METHOD: MODHA-SPANGLER WEIGHTING
![Page 21: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/21.jpg)
MODHA-SPANGLER WEIGHTING
![Page 22: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/22.jpg)
KAMILA ALGORITHM►We introduce a novel clustering algorithm for clustering mixed data►“KAy” means for MIxed LArge datasets■Speed of k-means■Desirable continuous/categorical weighting strategy of finite mixturemodels■Categorical variables are modeled as multinomial random variables■For continuous variables, distances to nearest centroids are calculated,and an appropriate univariate kernel density estimator is constructed■Clusters in continuous dimensions are modeled as arbitraryspherical/elliptical mixtures using a kernel density-based method.
![Page 23: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/23.jpg)
Full KAMILA Algorithm
![Page 24: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/24.jpg)
RADIAL KDE
![Page 25: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/25.jpg)
RADIAL KDE
![Page 26: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/26.jpg)
DATA MODEL
![Page 27: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/27.jpg)
ALGORITHM
![Page 28: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/28.jpg)
ALGORITHM
![Page 29: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/29.jpg)
IDENTIFIABILITY OF THE MIXTURE MODEL
![Page 30: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/30.jpg)
TIME COMPLEXITY
![Page 31: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/31.jpg)
TIME COMPLEXITY
![Page 32: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/32.jpg)
SOFTWARE DEVELOPMENT
![Page 33: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/33.jpg)
SOFTWARE DEVELOPMENTThe software can be downloaded from
https://CRAN.R-project.org/package=kamila and it incorporates KAMILA and Modha-Spangler techniques.
Weighting techniques: Henning-Liao and othersA Hadoop implementation of KAMILA has been developed that is designed for clustering very large data sets stored in distributed file systems.
Map-Reduce programming model.
![Page 34: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/34.jpg)
SIMULATIONS
![Page 35: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/35.jpg)
UNDERSTANDING PROSTATE CANCER MORTALITY
![Page 36: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/36.jpg)
UNDERSTANDING PROSTATE CANCER MORTALITY
![Page 37: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/37.jpg)
PROSTATE CANCER MORTALITY
![Page 38: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/38.jpg)
PROSTATE CANCER MORTALITY
![Page 39: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/39.jpg)
CONCLUSIONS
![Page 40: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/40.jpg)
REFERENCES1. Foss, A., Markatou, M., Ray, B. (2018). Distance metrics and clustering methods for mixed-type data. International Statistical Review, https://doi.org/10.1111/insr.12274
2. Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining & Knowledge Discovery, 2(3), 283-304
3. Byar, D. & Green, S. (1980). The choice of treatment for cancer patients based on covariate information: applications to prostate cancer. Bulletin du Cancer, 67, 477-490.
4. Foss, A., Markatou, M., Ray, B., Heching, A. (2016). A semi-parametric method for clustering mixed data. Machine Learning, 105(3), 419-458.
![Page 41: CLUSTERING MIXED-TYPE DATA](https://reader031.fdocuments.net/reader031/viewer/2022012017/615c01cd32fbd05e9e5bb501/html5/thumbnails/41.jpg)
REFERENCES5. Holzmann, H., Munk, A., Gneiting, T. (2006). Identifiability of finite mixtures of elliptical distributions. Scandinavian Journal of Statistics,33(4), 753-763.6. Tibshirani, R. & Walther, G. (2005). Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14(3), 511-528.7. Foss, A., Markatou, M. (2018). Kamila: Clustering mixed-type data in R and Hadoop. Journal of Statistical Software, 83(13), doi:10.18637/jss.v083.i13.8. Modha, D. S. & Spangler, W.S. (2003). Feature weighting in k-means clustering. Machine Learning, 52(3), 217-237.