CLUSTERING MIXED-TYPE DATA

Post on 05-Oct-2021

16 views 0 download

Transcript of CLUSTERING MIXED-TYPE DATA

CLUSTERING MIXED-TYPE DATA

M A R I A N T H I M A R K ATO U

D E P T. O F B I O S TAT I S T I C S

U N I V E R S I T Y AT B U F FA LO

S TAT I S T I C A L & C O M P U TAT I O N A L C H A L L E N G E S I N P R E C I S I O N M E D I C I N E

N O V E M B E R 6 - 9 , 2 0 1 8

I N S T I T U T E F O R M AT H E M AT I C S & I T S A P P L I C AT I O N S

Collaborators

Alex Foss*, Senior Scientist, Sandia National LabBonnie Ray, VP, TalkspaceAliza Heching, IBM Watson Research Center

OUTLINEIntroduction & Problem StatementBackground/Literature ReviewKAMILA: A new algorithm for clustering mixed-type dataSoftware PlatformExamplesReferences

INTRODUCTIONPrecision Medicine: an approach to disease treatment andprevention that is designed to optimize efficiency or therapeuticbenefit for groups of patients.These subgroups are identified via data-driven techniques.The data are obtained from large, heterogeneouscombinations of sources:

Demographic Information: age, gender, race, ethnicity, economicstatus;Diagnostic testing: inflammation score, tumor size, disease stage;High-throughput “omics” data sets: interval scale gene expression

data, categorical SNP data.

INTRODUCTION/LITERATURE REVIEWMany “big data” sets contain variables of both, interval and categorical(nominal/ordinal) scale.Many commonly used approaches for clustering mixed-scale (or type) datainvolve strategies to adapt existing techniques for single-type data. That is:

Interval scale data are discretized and then use techniquesfor clustering categorical scale data;Categorical scale variables are dummy-coded and then use

interval-scale data clustering techniques.There are significant problems associated with the aforementioned actions.

INTRODUCTIONMixed-type data: refers to data that are a combination ofrealizations from both continuous (e.g. height, weight, systolicblood pressure) and categorical (e.g. gender, race, ethnicity, HCVgenotype) random variables.Cluster analysis: aims to identify groups of similar units in a dataset.Definition: A cluster is a set of data points that is compact andisolated, that is, a set of points that are similar to each otherand well-separated from points not belonging to the cluster(Cormack, 1971; Gordon, 1981; Jain, 2010; McNicholas, 2016).

INTRODUCTIONAdditional characteristics of clusters that may be desirabledepending on the context include stability of identifiedclusters, independence of variables within a cluster, and thedegree to which a cluster can be well-represented by itscentroid.Recent literature that provides interesting discussions of the aforementioned issues are: 1) Hennig, C. (2015). What are the true clusters? Pattern Recognition Letters, 64, 53-62; 2)McNicholas, P.D. (2016). Model-Based Clustering. Journal of Classification, 33, 331-373.

PROBLEM STATEMENT

We focus on clustering mixed-scale data, that is, data sets that contain realizations of both interval and categorical (nominal and/or ordinal) scale variables.

QUESTION: Why does this problem matter?Answer: Informative subgroup identification, i.e. clustering.

LITERATURE REVIEWFundamental challenges are:

1) to equitably balance the contribution from the continuous and categorical variables;

2) current clustering algorithms are unable to properly handle data sets in which only a subset of variables are related to the underlying cluster structure of interest. In what follows, we will examine closely each of these challenges.

FRAMEWORKThere are many perspectives and definitions of clusters. Here weinvoke the mixture model perspective. This perspective is particularlyeffective because: (a) it produces mathematically rigorous generativemodels; (b) it can naturally accommodate multiple scale datawithout transformations and approximations; (c) it can handledependencies within and between data types; and (d) it is flexibleenough to capture a very wide range of scenarios of practicalsignificance.

However, clustering is not always about recovering mixturecomponents. The definition of a cluster is highly dependent upon thecontext of the problem and the available data.

DATA TRANSFORMATIONSDISCRETIZATION

Interval-scale variables are discretized and a clustering method suitable for exclusively categorical variables is applied.EXAMPLE: Consider a Monte-Carlo study in which we cluster manifestations of the random vector (V, W)defined by a mixture of two well-separated populations as follows:

Population 1: V ∼ N(0, 5.2), W∼Multin(n=1, 𝑝𝑝𝑇𝑇=(.45, .45, .05,.05))

Population 2: V∼N(5.2, 5.2)W∼Multin(n=1, 𝑝𝑝𝑇𝑇=(.05,.05,.45,.45))

DISCRETIZATION

Sample size is N=500;

V is discretized with 1) median split into 2 bins; 2) tertile split into 3 bins and so on up to 9 bins.We use k-modes algorithm and LCA algorithm.

Figure 1 (next slide) shows ARI for each of the discretization conditions

FIGURE 1: Performance of k-modes and LCA for various quantile splits of the data

RESULTSK-modes algorithm performs best for the median split of the interval scale variable and degrades as the number of bins increases;LCA does not degrade as the number of cut points increases;The normal-multinomial mixture model, which uses the untransformed interval scale variable outperforms both competing algorithms for all choices of the cut-point.

NUMERICAL CODING METHODInvolves converting categorical variables into numeric variables and clustering the new set with methods suitable for interval scale data only (e.g. k-means).

In practice, clustering with a numerical coding technique always involves using 0-1 dummy coding with standardized continuous variables. But this is ineffective because we have the following proposition.

Consider (V, W) defined as follows:V = 𝑌𝑌1 with probability π∊[0, 1]V= 𝑌𝑌2 with probability 1-π

Numerical Coding Methods𝑌𝑌1, 𝑌𝑌2 are two continuous random variables with means 0, μand variances σ12 and 𝜎𝜎22 . Let η = 𝐸𝐸 𝑉𝑉 , ν2 = Var V =𝜋𝜋𝜎𝜎12 + (1 − 𝜋𝜋)𝜎𝜎22 + 𝜋𝜋(1 − 𝜋𝜋)𝜇𝜇2 . Let also W= 𝐵𝐵1 withprobability π and equal to 𝐵𝐵2 with probability 1-π;𝐵𝐵1 ∼ Bern(𝑝𝑝1) and 𝐵𝐵2∼Bern(𝑝𝑝2).The squared Euclidean distance between population 1 and 2is then:

((𝑌𝑌1 − 𝜂𝜂)/𝜈𝜈 − (𝑌𝑌2 − 𝜂𝜂)/𝜈𝜈) 2 + (𝐵𝐵1 − 𝐵𝐵2)2.

PropositionLet (V, W) be a mixed-type bivariate vector. Then, the expectation of the continuous contribution to the distance is:

𝐸𝐸[(𝑌𝑌1−𝑌𝑌2ν

)2] = φ,

φ>1, 𝜎𝜎1 ≠ 𝜎𝜎2 and φ>2, 𝜎𝜎1 = 𝜎𝜎2. Furthermore, since |𝐵𝐵1 - 𝐵𝐵2| is 0 or 1,

0 < E[(𝐵𝐵1 − 𝐵𝐵2)2] < 1, for all 𝑝𝑝1, 𝑝𝑝2belong in (0, 1).

Numerical CodingThe continuous contribution has expectation > 1 for 𝜎𝜎1 ≠ 𝜎𝜎2;The continuous contribution has expectation > 2 for equal variances;

The categorical contribution has expectation < 1.This implies that there is unbalanced treatment of continuous and categorical variables

How do we deal with this unbalanced treatment?

Perhaps categorical variable should be up-weighted? I.e. instead of taking 0-1 to take 0-2?This is an ineffective strategy in the general case.

HYBRID DISTANCE

HYBRID DISTANCE METHOD: MODHA-SPANGLER WEIGHTING

MODHA-SPANGLER WEIGHTING

KAMILA ALGORITHM►We introduce a novel clustering algorithm for clustering mixed data►“KAy” means for MIxed LArge datasets■Speed of k-means■Desirable continuous/categorical weighting strategy of finite mixturemodels■Categorical variables are modeled as multinomial random variables■For continuous variables, distances to nearest centroids are calculated,and an appropriate univariate kernel density estimator is constructed■Clusters in continuous dimensions are modeled as arbitraryspherical/elliptical mixtures using a kernel density-based method.

Full KAMILA Algorithm

RADIAL KDE

RADIAL KDE

DATA MODEL

ALGORITHM

ALGORITHM

IDENTIFIABILITY OF THE MIXTURE MODEL

TIME COMPLEXITY

TIME COMPLEXITY

SOFTWARE DEVELOPMENT

SOFTWARE DEVELOPMENTThe software can be downloaded from

https://CRAN.R-project.org/package=kamila and it incorporates KAMILA and Modha-Spangler techniques.

Weighting techniques: Henning-Liao and othersA Hadoop implementation of KAMILA has been developed that is designed for clustering very large data sets stored in distributed file systems.

Map-Reduce programming model.

SIMULATIONS

UNDERSTANDING PROSTATE CANCER MORTALITY

UNDERSTANDING PROSTATE CANCER MORTALITY

PROSTATE CANCER MORTALITY

PROSTATE CANCER MORTALITY

CONCLUSIONS

REFERENCES1. Foss, A., Markatou, M., Ray, B. (2018). Distance metrics and clustering methods for mixed-type data. International Statistical Review, https://doi.org/10.1111/insr.12274

2. Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining & Knowledge Discovery, 2(3), 283-304

3. Byar, D. & Green, S. (1980). The choice of treatment for cancer patients based on covariate information: applications to prostate cancer. Bulletin du Cancer, 67, 477-490.

4. Foss, A., Markatou, M., Ray, B., Heching, A. (2016). A semi-parametric method for clustering mixed data. Machine Learning, 105(3), 419-458.

REFERENCES5. Holzmann, H., Munk, A., Gneiting, T. (2006). Identifiability of finite mixtures of elliptical distributions. Scandinavian Journal of Statistics,33(4), 753-763.6. Tibshirani, R. & Walther, G. (2005). Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14(3), 511-528.7. Foss, A., Markatou, M. (2018). Kamila: Clustering mixed-type data in R and Hadoop. Journal of Statistical Software, 83(13), doi:10.18637/jss.v083.i13.8. Modha, D. S. & Spangler, W.S. (2003). Feature weighting in k-means clustering. Machine Learning, 52(3), 217-237.