Bayesian Variable Selection in Clustering High-Dimensional ...marina/papers/jasa05.pdf · Bayesian...

16
Bayesian Variable Selection in Clustering High-Dimensional Data Mahlet G. T ADESSE, Naijun SHA, and Marina VANNUCCI Over the last decade, technological advances have generated an explosion of data with substantially smaller sample size relative to the number of covariates ( p n). A common goal in the analysis of such data involves uncovering the group structure of the observations and identifying the discriminating variables. In this article we propose a methodology for addressing these problems simultaneously. Given a set of variables, we formulate the clustering problem in terms of a multivariate normal mixture model with an unknown number of components and use the reversible-jump Markov chain Monte Carlo technique to define a sampler that moves between different dimensional spaces. We handle the problem of selecting a few predictors among the prohibitively vast number of variable subsets by introducing a binary exclusion/inclusion latent vector, which gets updated via stochastic search techniques. We specify conjugate priors and exploit the conjugacy by integrating out some of the parameters. We describe strategies for posterior inference and explore the performance of the methodology with simulated and real datasets. KEY WORDS: Bayesian variable selection; Bayesian clustering; Label switching; Reversible-jump Markov chain Monte Carlo. 1. INTRODUCTION Over the last decade, technological advances have generated an explosion of data for which the number of covariates is con- siderably larger than the sample size. These data pose a chal- lenge to standard statistical methods and have revived a strong interest in clustering algorithms. The goals are to uncover the group structure of the observations and to determine the dis- criminating variables. The analysis of DNA microarray datasets is a typical high-dimensional example, where variable selection and outcome prediction have become a major focus of research. The technology provides an automated method for simultane- ously quantifying thousands of genes, but its high cost con- strains researchers to a few experimental units. The discovery of different types of tissues or subtypes of a disease and the identification of genes that best distinguish them is believed to provide a better understanding of the underlying biological processes. This in turn could lead to better treatment choice for patients in different risk groups, and the selected genes can serve as biomarkers to improve diagnosis and therapeutic intervention. When dealing with high-dimensional datasets, the clus- ter structure of the observations is often confined to a small subset of variables. As pointed out by several authors (e.g., Fowlkes, Gnanadesikan, and Kettering 1988; Milligan 1989; Gnanadesikan, Kettering, and Tao 1995; Brusco and Cradit 2001), the inclusion of unnecessary covariates could com- plicate or even mask the recovery of the clusters. Common approaches to mitigating the effect of noisy variables or iden- tifying those that define true cluster structure involve differen- tially weighting the covariates or selecting the discriminating ones. Gnanadesikan et al. (1995) showed that variable weight- ing schemes were often outperformed by variable selection Mahlet G. Tadesse is Assistant Professor, Department of Biostatistics, University of Pennsylvania, Philadelphia, PA 19104 (E-mail: mtadesse@cceb. upenn.edu). Naijun Sha is Assistant Professor, Department of Mathemati- cal Sciences, University of Texas at El Paso, El Paso, TX 79968 (E-mail: [email protected]). Marina Vannucci is Associate Professor, Department of Sta- tistics, Texas A&M University, College Station, TX 77843 (E-mail: [email protected]). Tadesse’s research is supported by grant CA90301 from the National Cancer Institute, Sha’s work is supported in part by BBRC/RCMI National Institute of Health grant 2G12RR08124 and a Uni- versity Research Institute (UTEP) grant, and Vannucci’s research is supported by National Science Foundation CAREER award number DMS-00-93208. The authors thank the associate editor and two referees for their thorough reading of the manuscript and their input, which have improved the article considerably. procedures. In addition to improving the prediction of clus- ter membership, the variable selection task reduces the mea- surement and storage requirements for future samples, thereby providing more cost-effective predictors. In this article we focus on the analysis of high-dimensional datasets characterized by a sample size, n, that is substan- tially smaller than the number of covariates, p. We propose a Bayesian approach for simultaneously selecting discriminat- ing variables and uncovering the cluster structure of the obser- vations. The article is organized as follows. In Section 2 we give a brief review of existing procedures and discuss their shortcomings. In Section 3 we describe our model specification, which makes use of model-based clustering and introduces la- tent indicators to identify discriminating variables. In Section 4 we present the prior assumptions and the resulting full condi- tionals. We also provide guidelines for choosing the hyperpa- rameters. In Section 5 we discuss in detail the Markov chain Monte Carlo (MCMC) updates for each of the model parame- ters, and in Section 6 we address the issue of label switching and describe the inference mechanism. In Section 7 we assess the performance of our methodology using various datasets. We conclude the article with a brief discussion in Section 8. 2. REVIEW OF EXISTING METHODS The most widely used approaches separate the variable selection/weighting and clustering tasks. One common vari- able selection approach involves fitting univariate models on each covariate and selecting a small subset that passes some threshold for significance (McLachlan, Bean, and Peel 2002). Another standard approach uses dimension-reduction tech- niques, such as principal component analysis, and focuses on the leading components (Ghosh and Chinnaiyan 2002). The re- duced data are then clustered using the procedure of choice. These selection steps are suboptimal. The first approach does not assess the joint effect of multiple variables and could throw away potentially valuable covariates, which are not predic- tive individually but may provide significant improvement in conjunction with others. The second approach results in lin- ear combinations and does not allow evaluation of the original © 2005 American Statistical Association Journal of the American Statistical Association June 2005, Vol. 100, No. 470, Theory and Methods DOI 10.1198/016214504000001565 602

Transcript of Bayesian Variable Selection in Clustering High-Dimensional ...marina/papers/jasa05.pdf · Bayesian...

Page 1: Bayesian Variable Selection in Clustering High-Dimensional ...marina/papers/jasa05.pdf · Bayesian Variable Selection in Clustering High-Dimensional Data Mahlet G. T ADESSE, Naijun

Bayesian Variable Selection in ClusteringHigh-Dimensional Data

Mahlet G TADESSE Naijun SHA and Marina VANNUCCI

Over the last decade technological advances have generated an explosion of data with substantially smaller sample size relative to thenumber of covariates (p n) A common goal in the analysis of such data involves uncovering the group structure of the observations andidentifying the discriminating variables In this article we propose a methodology for addressing these problems simultaneously Given a setof variables we formulate the clustering problem in terms of a multivariate normal mixture model with an unknown number of componentsand use the reversible-jump Markov chain Monte Carlo technique to define a sampler that moves between different dimensional spacesWe handle the problem of selecting a few predictors among the prohibitively vast number of variable subsets by introducing a binaryexclusioninclusion latent vector which gets updated via stochastic search techniques We specify conjugate priors and exploit the conjugacyby integrating out some of the parameters We describe strategies for posterior inference and explore the performance of the methodologywith simulated and real datasets

KEY WORDS Bayesian variable selection Bayesian clustering Label switching Reversible-jump Markov chain Monte Carlo

1 INTRODUCTION

Over the last decade technological advances have generatedan explosion of data for which the number of covariates is con-siderably larger than the sample size These data pose a chal-lenge to standard statistical methods and have revived a stronginterest in clustering algorithms The goals are to uncover thegroup structure of the observations and to determine the dis-criminating variables The analysis of DNA microarray datasetsis a typical high-dimensional example where variable selectionand outcome prediction have become a major focus of researchThe technology provides an automated method for simultane-ously quantifying thousands of genes but its high cost con-strains researchers to a few experimental units The discoveryof different types of tissues or subtypes of a disease and theidentification of genes that best distinguish them is believedto provide a better understanding of the underlying biologicalprocesses This in turn could lead to better treatment choicefor patients in different risk groups and the selected genescan serve as biomarkers to improve diagnosis and therapeuticintervention

When dealing with high-dimensional datasets the clus-ter structure of the observations is often confined to a smallsubset of variables As pointed out by several authors (egFowlkes Gnanadesikan and Kettering 1988 Milligan 1989Gnanadesikan Kettering and Tao 1995 Brusco and Cradit2001) the inclusion of unnecessary covariates could com-plicate or even mask the recovery of the clusters Commonapproaches to mitigating the effect of noisy variables or iden-tifying those that define true cluster structure involve differen-tially weighting the covariates or selecting the discriminatingones Gnanadesikan et al (1995) showed that variable weight-ing schemes were often outperformed by variable selection

Mahlet G Tadesse is Assistant Professor Department of BiostatisticsUniversity of Pennsylvania Philadelphia PA 19104 (E-mail mtadesseccebupennedu) Naijun Sha is Assistant Professor Department of Mathemati-cal Sciences University of Texas at El Paso El Paso TX 79968 (E-mailnshautepedu) Marina Vannucci is Associate Professor Department of Sta-tistics Texas AampM University College Station TX 77843 (E-mailmvannuccistattamuedu) Tadessersquos research is supported by grant CA90301from the National Cancer Institute Sharsquos work is supported in part byBBRCRCMI National Institute of Health grant 2G12RR08124 and a Uni-versity Research Institute (UTEP) grant and Vannuccirsquos research is supportedby National Science Foundation CAREER award number DMS-00-93208 Theauthors thank the associate editor and two referees for their thorough reading ofthe manuscript and their input which have improved the article considerably

procedures In addition to improving the prediction of clus-ter membership the variable selection task reduces the mea-surement and storage requirements for future samples therebyproviding more cost-effective predictors

In this article we focus on the analysis of high-dimensionaldatasets characterized by a sample size n that is substan-tially smaller than the number of covariates p We proposea Bayesian approach for simultaneously selecting discriminat-ing variables and uncovering the cluster structure of the obser-vations The article is organized as follows In Section 2 wegive a brief review of existing procedures and discuss theirshortcomings In Section 3 we describe our model specificationwhich makes use of model-based clustering and introduces la-tent indicators to identify discriminating variables In Section 4we present the prior assumptions and the resulting full condi-tionals We also provide guidelines for choosing the hyperpa-rameters In Section 5 we discuss in detail the Markov chainMonte Carlo (MCMC) updates for each of the model parame-ters and in Section 6 we address the issue of label switchingand describe the inference mechanism In Section 7 we assessthe performance of our methodology using various datasetsWe conclude the article with a brief discussion in Section 8

2 REVIEW OF EXISTING METHODS

The most widely used approaches separate the variableselectionweighting and clustering tasks One common vari-able selection approach involves fitting univariate models oneach covariate and selecting a small subset that passes somethreshold for significance (McLachlan Bean and Peel 2002)Another standard approach uses dimension-reduction tech-niques such as principal component analysis and focuses onthe leading components (Ghosh and Chinnaiyan 2002) The re-duced data are then clustered using the procedure of choiceThese selection steps are suboptimal The first approach doesnot assess the joint effect of multiple variables and could throwaway potentially valuable covariates which are not predic-tive individually but may provide significant improvement inconjunction with others The second approach results in lin-ear combinations and does not allow evaluation of the original

copy 2005 American Statistical AssociationJournal of the American Statistical Association

June 2005 Vol 100 No 470 Theory and MethodsDOI 101198016214504000001565

602

Tadesse Sha and Vannucci Bayesian Variable Selection 603

variables In addition it has been shown that the leading com-ponents do not necessarily contain the most information aboutthe cluster structure (Chang 1983)

The literature includes several procedures that combine thevariable selection and clustering tasks Fowlkes et al (1988)used a forward selection approach in the context of completelinkage hierarchical clustering The variables are added usinginformation of the between-cluster and total sum of squaresand their significance is judged based on graphical informa-tion The authors characterized this as an ldquoinformalrdquo assess-ment Brusco and Cradit (2001) proposed a variable selectionheuristic for k-means clustering using a similar forward selec-tion procedure Their approach uses the adjusted Rand indexto measure cluster recovery Recently Friedman and Meulman(2003) proposed a hierarchical clustering procedure that uncov-ers cluster structure on separate subsets of variables The al-gorithm does not explicitly select variables but rather assignsthem different weights which can be used to extract relevantcovariates This approach is somewhat different from ours inthat it detects subgroups of observations that preferentially clus-ter on different subsets of variables rather than clustering on thesame subset of variables These methods all work in conjunc-tion with k-means or hierarchical clustering algorithms whichhave several limitations One major shortcoming is the lack ofa statistical criterion for assessing the number of clusters Theseprocedures also do not provide a measure of the uncertainty as-sociated with the sample allocations In addition with respectto variable selection the existing approaches use greedy deter-ministic search procedures which can be stuck at local min-ima They also have the limitation of presuming the existence ofa single ldquobestrdquo subset of clustering variables In practice how-ever there may be several equally good subsets that define thetrue cluster structure

The Bayesian paradigm offers a coherent framework foraddressing the problems of variable selection and clusteringsimultaneously Recent developments in MCMC techniques(Gilks Richardson and Spiegelhalter 1996) have led to sub-stantial research in both areas Bayesian variable selectionmethods so far have been developed mainly in the context ofregression analysis (See George and McCulloch 1997 for a re-view on prior specifications and MCMC implementations andBrown Vannucci and Fearn 1998b for extensions to the caseof multivariate responses) The Bayesian clustering approachuses a mixture of probability distributions each distributionrepresenting a different cluster Diebolt and Robert (1994) pre-sented a comprehensive treatment of MCMC strategies whenthe number of mixture components is known For the generalcase of an unknown number of components Richardson andGreen (1997) successfully applied the reversible-jump MCMC(RJMCMC) technique but considered only univariate dataStephens (2000a) proposed an approach based on continuous-time Markov birthndashdeath processes and applied it to a bivariatesetting

We propose a method that simultaneously selects the dis-criminating variables and clusters the samples into G groupswhere G is unknown The computational burden of searchingthrough all possible 2p variable subsets is handled through theintroduction of a latent p-vector with binary entries We usea stochastic search method to explore the space of possible val-ues for this latent vector The clustering problem is formulated

in terms of a multivariate mixture model with an unknown num-ber of components We work with a marginalized likelihoodwhere some of the model parameters are integrated out andadapt the RJMCMC technique of Richardson and Green (1997)to the multivariate setting Our inferential procedure providesselection of discriminating variables estimates for the numberof clusters and the sample allocations and class prediction forfuture observations

3 MODEL SPECIFICATION

31 Clustering via Mixture Models

The goal of cluster analysis is to partition a collection of ob-servations into homogeneous groups Model-based clusteringhas received great attention recently and has provided promis-ing results in various applications In this approach the data areviewed as coming from a mixture of distributions each distri-bution representing a different cluster Let X = (x1 xn) beindependent p-dimensional observations from G populationsThe problem of clustering the n samples can be formulated interms of a mixture of G underlying probability distributions

f (xi|w θ) =Gsum

k=1

wk f (xi|θk) (1)

where f (xi|θk) is the density of an observation xi from thekth component and w = (w1 wG)T are the componentweights (wk ge 0

sumGk=1 wk = 1)

To identify the cluster from which each observation is drawnlatent variables y = ( y1 yn)

T are introduced where yi = kif the ith observation comes from component k The yirsquos areassumed to be independently and identically distributed withprobability mass function p( yi = k) = wk We consider the casewhere f (xi|θk) is a multivariate normal density with parametersθk = (microkk) Thus for sample i we have

xi|yi = kw θ simN (microkk) (2)

32 Identifying Discriminating Variables

In high-dimensional data it is often the case that a large num-ber of variables provide very little if any information about thegroup structure of the observations Their inclusion could bedetrimental because it might obscure the recovery of the truecluster structure To identify the relevant variables we intro-duce a latent p-vector γ with binary entries γj takes value 1if the jth variable defines a mixture distribution for the dataand 0 otherwise We use (γ ) and (γ c) to index the discriminat-ing variables and those that favor a single multivariate normaldensity Notice that this approach is different from what is com-monly done in Bayesian variable selection for regression mod-els where the latent indicator is used to induce mixture priorson the regression coefficients The likelihood function is thengiven by

L(Gγ wmicroη|Xy)

= (2π)minus(pminuspγ )n2∣∣(γ c)

∣∣minusn2

times exp

minus1

2

nsum

i=1

(xi(γ c) minus η(γ c)

)Tminus1

(γ c)

(xi(γ c) minus η(γ c)

)

604 Journal of the American Statistical Association June 2005

timesGprod

k=1

(2π)minuspγ nk2∣∣k(γ )

∣∣minusnk2wnk

k

times exp

minus1

2

sum

xiisinCk

(xi(γ ) minus microk(γ )

)Tminus1

k(γ )

(xi(γ ) minus microk(γ )

)

(3)

where (η(γ c)(γ c)) denote the mean and covariance parame-ters for the nondiscriminating variables (microk(γ )k(γ )) are themean and covariance parameters of cluster k Ck = xi|yi = kwith cardinality nk and pγ = sump

j=1 γj

4 PRIOR SETTING AND FULL CONDITIONALS

41 Prior Formulation and Choice of Hyperparameters

For the variable selection indicator γ we assume that itselements γj are independent Bernoulli random variables

p(γ ) =pprod

j=1

ϕγj(1 minus ϕ)1minusγj (4)

The number of covariates included in the model pγ thusfollows a binomial distribution and ϕ can be elicited as theproportion of variables expected a priori to discriminate the dif-ferent groups ϕ = ppriorp This prior assumption can be re-laxed by formulating a beta(ab) hyperprior on ϕ which leadsto a beta-binomial prior for pγ with expectation p a

a+b A vagueprior can be elicited by setting a + b = 2 which leads to settinga = 2ppriorp and b = 2 minus a

Natural priors for the number of components G are a trun-cated Poisson

P(G = g) = eminusλλgg1 minus (eminusλ(1 + λ) + suminfin

j=Gmax+1 (eminusλλ j)j)

g = 2 Gmax (5)

or a discrete uniform on [2 Gmax] P(G = g) = 1Gmaxminus1

where Gmax is chosen arbitrarily large For the vector of com-ponent weights w we specify a symmetric Dirichlet priorw|G sim Dirichlet(α α)

Updating the mean and covariance parameters is somewhatintricate When deleting or creating new components the corre-sponding appropriate changes must be made for the componentparameters However it is not clear whether adequate proposalscan be constructed for the reversible jump in the multivariatesetting Even if this difficulty can be overcome as γ gets up-dated the dimensions of microk(γ ) k(γ ) η(γ c) and (γ c) changerequiring another sampler that moves between different dimen-sional spaces The algorithm is much more efficient if we canintegrate out these parameters This also helps substantially ac-celerate model fitting because the set of parameters that needto be updated would then consist only of (γ wyG) The inte-gration can be facilitated by taking conjugate priors

microk(γ )

∣∣k(γ )G sim N(micro0(γ )h1k(γ )

)

η(γ c)

∣∣(γ c) sim N(micro0(γ c)h0(γ c)

)

(6)k(γ )|G sim IW

(δQ1(γ )

)

(γ c) sim IW(δQ0(γ c)

)

where IW(δQ1(γ )) indicates the inverse-Wishart distributionwith dimension pγ shape parameter δ = n minus pγ + 1 n degreesof freedom and mean Q1(γ )(δ minus 2) (Brown 1993) Weak priorinformation is specified by taking δ small which we set to 3 thesmallest integer such that the expectations of the covariance ma-trices are defined We take Q1 = 1κ1Iptimesp and Q0 = 1κ0IptimespSome care in the choice of κ1 and κ0 is needed These hyperpa-rameters need to be in the range of variability of the data At thesame time we do not want the prior information to overwhelmthe data and drive the inference We experimented with differ-ent values and found that values commensurate with the orderof magnitude of the eigenvalues of cov(X) where X is the fulldata matrix lead to reasonable results We suggest setting κ1between 1 and 10 of the upper decile of the n minus 1 non-zeroeigenvalues and κ0 proportional to their lower decile

For the mean parameters we take the priors to be fairly flatover the region where the data are defined Each element of micro0is set to the corresponding covariate interval midpoint andh0 and h1 are chosen arbitrarily large This prevents the speci-fication of priors that do not overlap with the likelihood and al-lows for mixtures with widely different component means Wefound that values of h0 and h1 between 10 and 1000 performedreasonably well

42 Full Conditionals

As we stated earlier the parameters micro η and are inte-grated out and we need to sample only from the joint posteriorof (Gyγ w) We have

f (Xy|Gwγ )

=int

f (Xy|γ Gwmicroη)f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d (7)

For completeness we provide the full conditionals under theassumption of equal and unequal covariances across clustersHowever in practice unless there is prior knowledge that fa-vors the former we recommend working with heterogeneouscovariances After some algebra (see the Appendix) we get thefollowing

1 Under the assumption of homogeneous covariance

f (Xy|Gwγ )

= πminusnp2K(γ ) middot ∣∣Q1(γ )

∣∣(δ+pγ minus1)2

times∣∣∣∣∣Q1(γ ) +

Gsum

k=1

Sk(γ )

∣∣∣∣∣

minus(n+δ+pγ minus1)2

times H(γ c) middot ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times ∣∣Q0(γ c) + S0(γ c)

∣∣minus(n+δ+pminuspγ minus1)2 (8)

K(γ ) =[

Gprod

k=1

wnkk (h1nk + 1)minuspγ 2

]

timespγprod

j=1

( 12 (n + δ + pγ minus j))

( 12 (δ + pγ minus j))

Tadesse Sha and Vannucci Bayesian Variable Selection 605

H(γ c) = (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

( 12 (n + δ + p minus pγ minus j))

( 12 (δ + p minus pγ minus j))

Sk(γ ) =sum

xi(γ )isinCk

(xi(γ ) minus xk(γ )

)(xi(γ ) minus xk(γ )

)T

+ nk

h1nk + 1

(micro0(γ ) minus xk(γ )

)(micro0(γ ) minus xk(γ )

)T

S0(γ c) =nsum

i=1

(xi(γ c) minus x(γ c)

)(xi(γ c) minus x(γ c)

)T

+ n

h0n + 1

(micro0(γ c) minus x(γ c)

)(micro0(γ c) minus x(γ c)

)T

xk(γ ) is the sample mean of cluster k and x(γ c) corre-sponds to the sample means of the nondiscriminating vari-ables As h1 rarr infin Sk(γ ) reduces to the within-group sumof squares for the kth cluster and

sumGk=1 Sk(γ ) becomes the

total within-group sum of squares2 Assuming heterogeneous covariances across clusters

f (Xy|Gwγ )

= πminusnp2Gprod

k=1

Kk(γ ) middot ∣∣Q1(γ )

∣∣(δ+pγ minus1)2

times ∣∣Q1(γ ) + Sk(γ )

∣∣minus(nk+δ+pγ minus1)2

times H(γ c) middot ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times ∣∣Q0(γ c) + S0(γ c)

∣∣minus(n+δ+pminuspγ minus1)2 (9)

where Kk(γ ) = wnkk (h1nk + 1)minuspγ 2 prodpγ

j=1( 1

2 (nk+δ+pγ minusj))

( 12 (δ+pγ minusj))

The full conditionals for the model parameters are then givenby

f (y|Gwγ X) prop f (Xy|Gwγ ) (10)

f (γ |GwyX) prop f (Xy|Gwγ )p(γ |G) (11)

and

w|Gγ yX sim Dirichlet(α + n1 α + nG) (12)

5 MODEL FITTING

Posterior samples for the parameters of interest are obtainedusing Metropolis moves and RJMCMC embedded within aGibbs sampler Our MCMC procedure comprises of the follow-ing five steps

1 Update γ from its full conditional in (11)2 Update w from its full conditional in (12)3 Update y from its full conditional in (10)4 Split one mixture component into two or merge two into

one5 Birth or death of an empty component

The birthdeath moves specifically deal with creatingdele-ting empty components and do not involve reallocation ofthe observations We could have extended these to include

nonempty components and removed the splitmerge movesHowever as described by Richardson and Green (1997) in-cluding both steps provides more efficient moves and improvesconvergence

We would like to point out that as we iterate through thesesteps the cluster structure evolves with the choice of variablesThis makes the problem of variable selection in the context ofclustering much more complicated than classification or regres-sion analysis In classification the group structure of the ob-servations is prespecified in the training data and guides theselection Here however the analysis is fully data-based Inaddition in the context of clustering including unnecessaryvariables can have severe implications because it may obscurethe true grouping of the observations

51 Variable Selection Update

In step 1 we update γ using a Metropolis search as suggestedfor model selection by Madigan and York (1995) and applied inregression by Brown et al (1998a) among others and recentlyby Sha et al (2004) in multinomial probit models for classifi-cation In this approach a new candidate γ new is generated byrandomly choosing one of the following two transition moves

1 Adddelete Randomly choose one of the p indices in γ oldand change its value from 0 to 1 or from 1 to 0 to be-come γ new

2 Swap Choose independently and at random a 0 and a 1in γ old and switch their values to get γ new

The new candidate γ new is accepted with probabilitymin1

f (γ new|XywG)

f (γ old|XywG)

52 Update of Weights and Cluster Allocation

In step 2 we use a Gibbs move to sample w from its fullconditional in (12) This can be done by drawing independentgamma random variables with common scale and shape para-meters α + n1 α + nG then scaling them to sum to 1

In step 3 we update the allocation vector y one element ata time via a sub-Gibbs sampling strategy The full conditionalprobability that the ith observation is in the kth cluster is givenby

f(yi = k

∣∣Xy(minusi)γ wG) prop f

(X yi = ky(minusi)

∣∣Gwγ) (13)

where y(minusi) is the cluster assignment for all observations exceptthe ith one

53 Reversible-Jump MCMC

The last two steps splitmerge and birthdeath moves causethe creation or deletion of new components and consequentlyrequire a sampler that jumps between different dimensionalspaces A popular method for defining such a sampler is RJM-CMC (Green 1995 Richardson and Green 1997) whichextends the MetropolisndashHastings algorithm to general statespaces Suppose that a move type m from a countable familyof move types is proposed from ψ to a higher-dimensionalspace ψ prime This will very often be implemented by drawinga vector of continuous random variables u independent of ψ and setting ψ prime to be a deterministic and invertible function ofψ and u The reverse of this move (from ψ prime to ψ) can be ac-complished by using the inverse transformation so that in this

606 Journal of the American Statistical Association June 2005

direction the proposal is deterministic The move is then ac-cepted with probability

min

1

f (ψ prime|X)rm(ψ prime)f (ψ |X)rm(ψ)q(u)

∣∣∣∣partψ prime

part(ψu)

∣∣∣∣

(14)

where rm(ψ) is the probability of choosing move type m whenin state ψ and q(u) is the density function of u The final termin the foregoing ratio is a Jacobian arising from the change ofvariable from (ψu) to ψ prime

531 Split and Merge Moves In step 4 we make a randomchoice between attempting to split a nonempty cluster or com-bine two clusters with probabilities bk = 5 and dk = 1 minus bk

(k = 2 Gmax minus 1 and bGmax = 0)The split move begins by randomly selecting a nonempty

cluster l and dividing it into two components l1 and l2 withweights wl1 = wlu and wl2 = wl(1 minus u) where u is a beta(22)

random variable The parameters ψ = (Gwoldγ yold) are up-dated to ψ prime = (G + 1wnewγ ynew) The acceptance probabil-ity for this move is given by min(1A) where

A = f (G + 1wnewγ ynew|X)

f (Gwoldγ yold|X)times q(ψ |ψ prime)

q(ψ prime|ψ) times f (u)

times∣∣∣∣part(wl1wl2)

part(wlu)

∣∣∣∣

= f (Xynew|wnewγ G + 1) times f (wnew|G + 1) times f (G + 1)

f (Xyold|woldγ G) times f (wold|G) times f (G)

times dG+1 times Pchos

bG times Gminus11 times Palloc times f (u)

times wl (15)

f (wnew|G + 1)

f (wold|G)= wαminus1

l1 wαminus1l2

wαminus1l B(αGα)

and

f (G + 1)

f (G)=

1 if G sim uniform [2 Gmax]

λG+1 if G sim truncated Poisson (λ)

Here G1 is the number of nonempty components before thesplit and Palloc is the probability that the particular allocation ismade and is set to unl1(1 minus u)nl2 where nl1 and nl2 are the num-ber of observations assigned to l1 and l2 Note that Richardsonand Green (1997) calculated Palloc using the full conditional ofthe allocation variables P( yi = k|rest) In our case because thecomponent parameters micro and are integrated out the alloca-tion variables are no longer independent and calculating thesefull conditionals is computationally prohibitive The random al-location approach that we took is substantially faster even withthe longer chains that need to be run to compensate for thelow acceptance rate of proposed moves Pchos is the probabilityof selecting two components for the reverse move Obviouslywe want to combine clusters that are closest to one anotherHowever choosing an optimal similarity metric in the multi-variate setting is not straightforward We define closeness interms of the Euclidean distance between cluster means and con-sider three possible ways of splitting a cluster

a Split results into two mutually adjacent components

rArr Pchos = Glowast1

Glowast times 2Glowast

1

b Split results into components that are not mutually ad-jacent that is l1 has l2 as its closest component but there is

another cluster that is closer to l2 than l1 rArr Pchos = Glowast1

Glowast times 1Glowast

1

c One of the resulting components is empty rArr Pchos =Glowast

0Glowast times 1

Glowast0

times 1Glowast

1

Here Glowast is the total number of clusters after the splitGlowast

0 and Glowast1 are the number of empty and nonempty components

after the split Thus splits that result in mutually close com-ponents are assigned a larger probability and those that yieldempty clusters have lower probability

The merge move begins by randomly choosing two com-ponents l1 and l2 to be combined into a single cluster lThe updated parameters are ψ = (Gwoldγ yold) rarr ψ prime =(G minus 1wnewγ ynew) where wl = wl1 + wl2 The acceptanceprobability for this move is given by min(1A) where

A = f (Xynew|γ wnewG minus 1) times f (wnew|G minus 1) times f (G minus 1)

f (Xyold|γ woldG) times f (wold|G) times f (G)

times bGminus1 times Palloc times (G1 minus 1)minus1 times f (u)

dG times Pchos

times (wl1 + wl2)minus1 (16)

where f (wnew|Gminus1)f (wold|G)

= wαminus1l B(α(Gminus1)α)

wαminus1l1 wαminus1

l2and f (Gminus1)

f (G)equals 1 or G

λ

for the Uniform and Poisson priors on G G1 is the numberof nonempty clusters before the combining move f (u) is thebeta(22) density for u = wl1

wl and Palloc and Pchos are defined

similarly to the split move but now Glowast0 and Glowast

1 in Pchos corre-spond to the number of empty and nonempty clusters after themerge

532 Birth and Death Moves We first make a randomchoice between a birth move and a death move using thesame probabilities bk and dk = 1 minus bk as earlier For a birthmove the updated parameters are ψ = (Gwoldγ y) rarr ψ prime =(G + 1wnewγ y) The weight for the proposed new compo-nent is drawn using wlowast sim beta(1G) and the existing weightsare rescaled to wprime

k = wk(1minuswlowast) k = 1 G (where k indexesthe component labels) so that all of the weights sum to 1 LetG0 be the number of empty components before the birth Theacceptance probability for a birth is min(1A) where

A = f (Xy|wnewγ G + 1) times f (wnew|G + 1) times f (G + 1)

f (Xy|woldγ G) times f (wold|G) times f (G)

times dG+1 times (G0 + 1)minus1

bG times f (wlowast)times (1 minus wlowast)Gminus1 (17)

f (wnew|G+1)f (wold|G)

= wlowastαminus1(1minuswlowast)G(αminus1)

B(αGα) and (1 minus wlowast)Gminus1 is the

Jacobian of the transformation (woldwlowast) rarr wnewFor the death move an empty component is chosen at

random and deleted Let wl be the weight of the deleted com-ponent The remaining weights are rescaled to sum to 1 (wprime

k =wk(1 minus wl)) and ψ = (Gwoldγ y) rarr ψ prime = (G minus 1wnew

γ y) The acceptance probability for this move is min(1A)where

A = f (Xy|wnewγ G minus 1) times f (wnew|G minus 1) times f (G minus 1)

f (Xy|woldγ G) times f (wold|G) times f (G)

times bGminus1 times f (wl)

dG times Gminus10

times (1 minus wl)minus(Gminus2) (18)

Tadesse Sha and Vannucci Bayesian Variable Selection 607

f (wnew|Gminus1)f (wold|G)

= B(α(Gminus1)α)

wαminus1l (1minuswl)

(Gminus1)(αminus1) G0 is the number of empty

components before the death move and f (wl) is the beta(1G)

density

6 POSTERIOR INFERENCE

We draw inference on the sample allocations conditionalon G Thus we first need to compute an estimate for this para-meter which we take to be the value most frequently visited bythe MCMC sampler In addition we need to address the label-switching problem

61 Label Switching

In finite mixture models an identifiability problem arisesfrom the invariance of the likelihood under permutation of thecomponent labels In the Bayesian paradigm this leads to sym-metric and multimodal posterior distributions with up to Gcopies of each ldquogenuinerdquo mode complicating inference on theparameters In particular it is not possible to form ergodic av-erages over the MCMC samples Traditional approaches to thisproblem impose identifiability constraints on the parametersfor instance w1 lt middot middot middot lt wG These constraints do not alwayssolve the problem however Recently Stephens (2000b) pro-posed a relabeling algorithm that takes a decision-theoretic ap-proach The procedure defines an appropriate loss function andpostprocesses the MCMC output to minimize the posterior ex-pected loss

Let P(ξ) = [pik(ξ)] be the matrix of allocation probabilities

pik(ξ) = Pr( yi = k|Xγ w θ)

= wk f (xi(γ )|θk(γ ))sumj wjf (xi(γ )|θ j(γ ))

(19)

In the context of clustering the loss function can be defined onthe cluster labels y

L0(y ξ) = minusnsum

i=1

logpiyi(ξ ) (20)

Let ξ (1) ξ (M) be the sampled parameters and let ν1 νM

be the permutations applied to them The parameters ξ (t) corre-spond to the component parameters w(t) micro(t) and (t) sampledat iteration t However in our methodology the componentmean and covariance parameters are integrated out and wedo not draw their posterior samples Therefore to calculatethe matrix of allocation probabilities in (19) we estimatethese parameters and set ξ (t) = w(t)

1 w(t)G x(t)

1(γ ) x(t)

G(γ )

S(t)1(γ )

S(t)G(γ )

where x(t)k(γ )

and S(t)k(γ )

are the estimates ofcluster krsquos mean and covariance based on the model visited atiteration t

The relabeling algorithm proceeds by selecting initial valuesfor the νtrsquos which we take to be the identity permutation theniterating the following steps until a fixed point is reached

a Choose y to minimizesumM

t=1 L0yνt(ξ(t))

b For t = 1 M choose νt to minimize L0yνt(ξ(t))

62 Posterior Densities and Posterior Estimates

Once the label switching is taken care of the MCMC samplescan be used to draw posterior inference Of particular interestare the allocation vector y and the variable selection vector γ

We compute the marginal posterior probability that sample iis allocated to cluster k yi = k as

p( yi = k|XG) =int

p(yi = ky(minusi)γ w

∣∣XG)

dy(minusi) dγ dw

propint

p(X yi = ky(minusi)γ w

∣∣G)

dy(minusi) dγ dw

=int

p(X yi = ky(minusi)

∣∣Gγ w)

times p(γ |G)p(w|G)dy(minusi) dγ dw

asympMsum

t=1

p(X y(t)

i = ky(t)(minusi)

∣∣Gγ (t)w(t))

times p(γ (t)

∣∣G)p(w(t)

∣∣G) (21)

where y(t)(minusi) is the vector y(t) at the tth MCMC iteration without

the ith element The posterior allocation of sample i can then beestimated by the mode of its marginal posterior density

yi = arg max1lekleG

p( yi = k|XG) (22)

Similarly the marginal posterior for γj = 1 can be com-puted as

p(γj = 1|XG) =int

p(γj = 1γ (minusj)wy|XG

)dγ (minusj) dy dw

propint

p(Xy γj = 1γ (minusj)w

∣∣G)

dγ (minusj) dy dw

=int

p(Xy

∣∣G γj = 1γ (minusj)w)

times p(γ |G)p(w|G)dγ (minusj) dy dw

asympMsum

t=1

p(Xy(t)

∣∣G γ(t)j = 1γ

(t)(minusj)w(t))

times p(γj = 1γ

(t)(minusj)

∣∣G)p(w(t)

∣∣G) (23)

where γ(t)(minusj) is the vector γ (t) at the tth iteration without the

jth element The best discriminating variables can then beidentified as those with largest marginal posterior p(γj = 1|XG) gt a where a is chosen arbitrarily

γj = Ip(γj=1|XG)gta (24)

An alternative variable selection can be performed by con-sidering the vector γ with largest posterior probability amongall visited vectors

γ lowast = arg max1letleM

p(γ (t)

∣∣XG w y)

(25)

where y is the vector of sample allocations estimated via (22)and w is given by w = 1

M

sumMt=1 w(t) The estimate given in (25)

considers the joint density of γ rather than the marginal distri-butions of the individual elements as (24) In the same spirit an

608 Journal of the American Statistical Association June 2005

estimate of the joint allocation of the samples can be obtainedas the configuration that yields the largest posterior probability

ylowast = arg max1letleM

p(y(t)

∣∣XG w γ)

(26)

where γ is the estimate in (24)

63 Class Prediction

The MCMC output can also be used to predict the class mem-bership of future observations Xf The predictive density isgiven by

p( yf = k|Xf XG)

=int

p( yf = kywγ |Xf XG)dy dγ dw

propint

p(Xf X yf = ky|Gwγ )p(γ |G)p(w|G)dy dγ dw

asympMsum

t=1

p(Xf X yf = ky(t)

∣∣γ (t)w(t)G)

times p(γ (t)

∣∣G)p(w(t)

∣∣G) (27)

and the observations are allocated according to yf

yf = arg max1lekleG

p( yf = k|Xf XG) (28)

7 PERFORMANCE OF OUR METHODOLOGY

In this section we investigate the performance of our method-ology using three datasets Because the Bayesian cluster-ing problem with an unknown number of components viaRJMCMC has been considered only in the univariate setting(Richardson and Green 1997) we first examine the efficiency ofour algorithm in the multivariate setting without variable selec-tion For this we use the benchmark iris data (Anderson 1935)We then explore the performance of our methodology for simul-taneous clustering and variable selection using a series of sim-ulated high-dimensional datasets We also analyze these data

using the COSA algorithm of Friedman and Meulman (2003)Finally we illustrate an application with DNA microarray datafrom an endometrial cancer study

71 Iris Data

This dataset consists of four covariatesmdashpetal width petalheight sepal width and sepal heightmdashmeasured on 50 flow-ers from each of three species Iris setosa Iris versicolor andIris virginica This dataset has been analyzed by several authors(see eg Murtagh and Hernaacutendez-Pajares 1995) and can be ac-cessed in SndashPLUS as a three-dimensional array

We rescaled each variable by its range and specified the priordistributions as described in Section 41 We took δ = 3 α = 1and h1 = 100 We set Gmax = 10 and κ1 = 03 a value com-mensurate with the variability in the data We considered differ-ent prior distributions for G truncated Poisson with λ = 5 andλ = 10 and a discrete uniform on [1 Gmax] We conductedthe analysis assuming both homogeneous and heterogeneouscovariances across clusters In all cases we used a burn-inof 10000 and a main run of 40000 iterations for inference

Let us first focus on the results using a truncated Pois-son(λ = 5) prior for G Figure 1 displays the trace plots forthe number of visited components under the assumption of het-erogeneous and homogeneous covariances Table 1 column 1gives the corresponding posterior probabilities p(G = k|X)Table 2 shows the posterior estimates for sample allocations ycomputed conditional on the most probable G accordingto (22) Under the assumption of heterogeneous covariancesacross clusters there is a strong support for G = 3 Two ofthe clusters successfully isolate all of the setosa and virginicaplants and the third cluster contains all of the versicolor ex-cept for five identified as virginica These results are satisfac-tory because the versicolor and virginica species are knownto have some overlap whereas the setosa is rather well sepa-rated Under the assumption of constant covariance there is astrong support for four clusters Again one cluster contains ex-clusively all of the setosa and all of the versicolor are grouped

(a) (b)

Figure 1 Iris Data Trace Plots of the Number of Clusters G (a) Heterogeneous covariance (b) homogeneous covariance

Tadesse Sha and Vannucci Bayesian Variable Selection 609

Table 1 Iris Data Posterior Distribution of G

Poisson(λ = 5) Poisson(λ = 10) Uniform[1 10]

k Different k Equal Different Σk Equal Σ Different Σk Equal Σ

3 6466 0 5019 0 8548 04 3124 7413 4291 4384 1417 65045 0403 3124 0545 2335 0035 34466 0007 0418 0138 2786 0 00507 0 0 0007 0484 0 08 0 0 0 0011 0 0

together except for one Two virginica flowers are assigned tothe versicolor group and the remaining are divided into twonew components Figure 2 shows the marginal posterior prob-abilities p( yi = k|XG = 4) computed using equation (21)We note that some of the flowers among the 12 virginica al-located to cluster IV had a close call for instance observa-tion 104 had marginal posteriors p( y104 = 3|XG) = 4169 andp( y104 = 4|XG) = 5823

We examined the sensitivity of the results to the choiceof the prior distribution on G and found them to be quiterobust The posterior probabilities p(G = k|X) assumingG sim Poisson(λ = 10) and G sim uniform[1 Gmax] are givenin Table 1 Under both priors we obtained sample allocationsthat are identical to those in Table 2 We also found littlesensitivity to the choice of the hyperparameter h1 with val-ues between 10 and 1000 giving satisfactory results Smallervalues of h1 tended to favor slightly more components but withsome clusters containing very few observations For examplefor h1 = 10 under the assumption of equal covariance acrossclusters Table 3 shows that the MCMC sampler gives strongersupport for G = 6 However one of the clusters contains onlyone observation and another contains only four observationsFigure 3 displays the marginal posterior probabilities for theallocation of these five samples p( yi = k|XG) there is littlesupport for assigning them to separate clusters

We use this example to also illustrate the label-switchingphenomenon described in Section 61 Figure 4 gives the traceplots of the component weights under the assumption of homo-geneous covariance before and after processing of the MCMCsamples We can easily identify two label-switching instancesin the raw output of Figure 4(a) One of these occurred be-tween the first and third components near iteration 1000 theother occurred around iteration 2400 between components 23 and 4 Figure 4(b) shows the trace plots after postprocessingof the MCMC output We note that the label switching is elim-inated and the multimodality in the estimates of the marginalposterior distributions of wk is removed It is now straightfor-ward to obtain sensible estimates using the posterior means

Table 2 Iris Data Sample Allocation y

Heterogeneouscovariance

Homogeneouscovariance

I II III I II III IV

Iris setosa 50 0 0 50 0 0 0Iris versicolor 0 45 5 0 49 1 0Iris virginica 0 0 50 0 2 36 12

Figure 2 Iris Data Poisson(λ = 5) With Equal Σ -Marginal PosteriorProbabilities of Sample Allocations p(yi = k|X G = 4) i = 1 150k = 1 4

72 Simulated Data

We generated a dataset of 15 observations arising from fourmultivariate normal densities with 20 variables such that

xij sim I1leile4N (micro1 σ21 ) + I5leile7N (micro2 σ

22 )

+ I8leile13N (micro3 σ23 ) + I14leile15N (micro4 σ

24 )

i = 1 15 j = 1 20

where Imiddot is the indicator function equal to 1 if the conditionis met Thus the first four samples arise from the same dis-tribution the next three come from the second group the fol-lowing six are from the third group and the last two are fromthe fourth group The component means were set to micro1 = 5micro2 = 2 micro3 = minus3 and micro4 = minus6 and the component varianceswere set to σ 2

1 = 15 σ 22 = 1 σ 2

3 = 5 and σ 24 = 2 We drew an

additional set of (pminus20) noisy variables that do not distinguishbetween the clusters from a standard normal density We con-sidered several values of p p = 50100500 1000 These datathus contain a small subset of discriminating variables togetherwith a large set of nonclustering variables The purpose is tostudy the ability of our method to uncover the cluster structureand identify the relevant covariates in the presence of increasingnumber of noisy variables

We permuted the columns of the data matrix X to dis-perse the predictors For each dataset we took δ = 3 α = 1

Table 3 Iris Data Sensitivity to Hyperparameter h Results With h = 10

Posterior distribution of G

k 4 5 6 7 8 9 10

p(G = k |X) 0747 1874 3326 2685 1120 0233 0015

Sample allocations yI II III IV V VI

Iris setosa 50 0 0 0 0 0Iris versicolor 0 45 1 0 4 0Iris virginica 0 2 35 12 0 1

610 Journal of the American Statistical Association June 2005

Figure 3 Iris Data Sensitivity to h1 Results for h1 = 10ndashMarginalPosterior Probabilities for Samples Allocated to Clusters V and VIp(yi = k|X G = 6) i = 1 5 k = 1 6

h1 = h0 = 100 and Gmax = n and assumed unequal covari-ances across clusters Results from the previous example haveshown little sensitivity to the choice of prior on G Here weconsidered a truncated Poisson prior with λ = 5 As describedin Section 41 some care is needed when choosing κ1 and κ0their values need to be commensurate with the variability of thedata The amount of total variability in the different simulateddatasets is of course widely different and in each case thesehyperparameters were chosen proportionally to the upper andlower decile of the non-zero eigenvalues For the prior of γ we chose a Bernoulli distribution and set the expected num-ber of included variables to 10 We used a starting model withone randomly selected variable and ran the MCMC chains for100000 iterations with 40000 sweeps as burn-in

Here we do inference on both the sample allocations andthe variable selection We obtained satisfactory results for all

datasets In all cases we successfully recovered the clusterstructure used to simulate the data and identified the 20 dis-criminating variables We present the summary plots associatedwith the largest dataset where p = 1000 Figure 5(a) shows thetrace plot for the number of visited components G and Table 4reports the marginal posterior probabilities p(G|X) There is astrong support for G = 4 and 5 with a slightly larger probabilityfor the former Figure 6 shows the marginal posterior probabili-ties of the sample allocations p( yi = k|XG = 4) We note thatthese match the group structure used to simulate the data Asfor selection of the predictors Figure 5(b) displays the num-ber of variables selected at each MCMC iteration In the first30000 iterations before burn-in the chain visited models witharound 30ndash35 variables After about 40000 iterations the chainstabilized to models with 15ndash20 covariates Figure 7 displaysthe marginal posterior probabilities p(γj = 1|XG = 4) Therewere 17 variables with marginal probabilities greater than 9 allof which were in the set of 20 covariates simulated to effectivelydiscriminate the four clusters If we lower the threshold for in-clusion to 4 then all 20 variables are selected Conditional onthe allocations obtained via (22) the γ vector with largest pos-terior probability (25) among all visited models contained 19 ofthe 20 discriminating variables

We also analyzed these datasets using the COSA algorithm ofFriedman and Meulman (2003) As we mentioned in Section 2this procedure performs variable selection in conjunction withhierarchical clustering We present the results for the analy-sis of the simulated data with p = 1000 Figure 8 shows thedendrogram of the clustering results based on the non-targetedCOSA distance We considered single average and completelinkage but none of these methods was able to recover the truecluster structure The average linkage dendrogram [Fig 8(b)]delineates one of the clusters that was partially recovered (thetrue cluster contains observations 8ndash13) Figure 8(d) displaysthe 20 highest relevance values of the variables that lead to thisclustering The variables that define the true cluster structure areindexed as 1ndash20 and we note that 16 of them appear in this plot

(a) (b)

Figure 4 Iris Data Trace Plots of Component Weights Before and After Removal of the Label Switching (a) Raw output (b) processed output

Tadesse Sha and Vannucci Bayesian Variable Selection 611

(a) (b)

Figure 5 Trace Plots for Simulated Data With p = 1000 (a) Number of clusters G (b) number of included variables pγ

Table 4 Simulated Data With p = 1000Posterior Distribution of G

k p(G = k|X )

2 01113 14474 34375 33756 13267 02168 00409 0015

10 000811 000512 000713 001214 0002

Figure 6 Simulated Data With p = 1000 Marginal Posterior Prob-abilities of Sample Allocations p(yi = k|X G = 4) i = 1 15k = 1 4

73 Application to Microarray Data

We now illustrate the methodology on microarray datafrom an endometrial cancer study Endometrioid endometrialadenocarcinoma is a common gynecologic malignancy aris-ing within the uterine lining The disease usually occurs inpostmenopausal women and is associated with the hormonalrisk factor of protracted estrogen exposure unopposed byprogestins Despite its prevalence the molecular mechanismsof its genesis are not completely known DNA microarrayswith their ability to examine thousands of genes simultaneouslycould be an efficient tool to isolate relevant genes and clustertissues into different subtypes This is especially important incancer treatment where different clinicopathologic groups areknown to vary in their response to therapy and genes identifiedto discriminate among the different subtypes may represent tar-gets for therapeutic intervention and biomarkers for improveddiagnosis

Figure 7 Simulated Data With p = 1000 Marginal Posterior Proba-bilities for Inclusion of Variables p(γj = 1|X G = 4)

612 Journal of the American Statistical Association June 2005

(a) (b)

(c) (d)

Figure 8 Analysis of Simulated Data With p = 1000 Using COSA (a) Single linkage (b) average linkage (c) complete linkage (d) relevancefor delineated cluster

Four normal endometrium samples and 10 endometrioidendometrial adenocarcinomas were collected from hysterec-tomy specimens RNA samples were isolated from the 14 tis-sues and reverse-transcribed The resulting cDNA targets wereprepared following the manufacturerrsquos instructions and hy-bridized to Affymetrix Hu6800 GeneChip arrays which con-tain 7070 probe sets The scanned images were processed withthe Affymetrix software version 31 The average differencederived by the software was used to indicate a transcript abun-dance and a global normalization procedure was applied Thetechnology does not reliably quantify low expression levels andit is subject to spot saturation at the high end For this particu-lar array the limits of reliable detection were previously set to20 and 16000 (Tadesse Ibrahim and Mutter 2003) We usedthese same thresholds and removed probe sets with at least oneunreliable reading from the analysis This left us with p = 762variables We then log-transformed the expression readings tosatisfy the assumption of normality as suggested by the BoxndashCox transformation We also rescaled each variable by its range

As described in Section 41 we specified the priors with thehyperparameters δ = 3 α = 1 and h1 = h0 = 100 We setκ1 = 001 and κ0 = 01 and assumed unequal covariancesacross clusters We used a truncated Poisson(λ = 5) prior for G

with Gmax = n We took a Bernoulli prior for γ with an expec-tation of 10 variables to be included in the model

To avoid possible dependence of the results on the initialmodel we ran four MCMC chains with widely different startingpoints (1) γj set to 0 for all jrsquos except one randomly selected(2) γj set to 1 for 10 randomly selected jrsquos (3) γj set to 1 for25 randomly selected jrsquos and (4) γj set to 1 for 50 randomlyselected jrsquos For each of the MCMC chains we ran 100000 it-erations with 40000 sweeps taken as a burn-in

We looked at both the allocation of tissues and the selec-tion of genes whose expression best discriminate among thedifferent groups The MCMC sampler mixed steadily overthe iterations As shown in the histogram of Figure 9(a) thesampler visited mostly between two and five components witha stronger support for G = 3 clusters Figure 9(b) shows thetrace plot for the number of included variables for one ofthe chains which visited mostly models with 25ndash35 variablesThe other chains behaved similarly Figure 10 gives the mar-ginal posterior probabilities p(γj = 1|XG = 3) for each ofthe chains Despite the very different starting points the fourchains visited similar regions and exhibit broadly similar mar-ginal plots To assess the concordance of the results across thefour chains we looked at the correlation coefficients betweenthe frequencies for inclusion of genes These are reported in

Tadesse Sha and Vannucci Bayesian Variable Selection 613

(a) (b)

Figure 9 MCMC Output for Endometrial Data (a) Number of visited clusters G (b) number of included variables pγ

Figure 11 along with the corresponding pairwise scatterplotsWe see that there is very good agreement across the differentMCMC runs The marginal probabilities on the other hand arenot as concordant across the chains because their derivationmakes use of model posterior probabilities [see (23)] which areaffected by the inclusion of correlated variables This is a prob-lem inherent to DNA microarray data where multiple geneswith similar expression profiles are often observed

We pooled and relabeled the output of the four chains nor-malized the relative posterior probabilities and recomputedp(γj = 1|XG = 3) These marginal probabilities are displayedin Figure 12 There are 31 genes with posterior probabilitygreater than 5 We also estimated the marginal posterior proba-bilities for the sample allocations p( yi = k|XG) based on the31 selected genes These are shown in Figure 13 We success-fully identified the four normal tissues The results also seem tosuggest that there are possibly two subtypes within the malig-nant tissues

Figure 10 Endometrial Data Marginal posterior probabilities for in-clusion of variables p(γj = 1|X G = 3) for each of the four chains

We also present the results from analyzing this dataset withthe COSA algorithm Figure 14 shows the resulting singlelinkage and average linkage dendrograms along with the top30 variables that distinguish the normal and tumor tissuesThere is some overlap between the genes selected by COSA andthose identified by our procedure Among the sets of 30 ldquobestrdquodiscriminating genes selected by the two methods 10 are com-mon to both

8 DISCUSSION

We have proposed a method for simultaneously clusteringhigh-dimensional data with an unknown number of componentsand selecting the variables that best discriminate the differentgroups We successfully applied the methodology to variousdatasets Our method is fully Bayesian and we provided stan-dard default recommendations for the choice of priors

Here we mention some possible extensions and directionsfor future research We have drawn posterior inference con-ditional on a fixed number of components A more attractive

Figure 11 Endometrial Data Concordance of results among the fourchains Pairwise scatterplots of frequencies for inclusion of genes

614 Journal of the American Statistical Association June 2005

Figure 12 Endometrial Data Union of four chains Marginal posteriorprobabilities p(γj = 1|X G = 3)

approach would incorporate the uncertainty on this parame-ter and combine the results for all different values of G vis-ited by the MCMC sampler To our knowledge this is an open

Figure 13 Endometrial Data Union of four chains Marginal poste-rior probabilities of sample allocations p(yi = k|X G = 3) i = 1 14k = 1 3

problem that requires further research One promising approachinvolves considering the posterior probability of pairs of ob-

(a) (b)

(c) (d)

Figure 14 Analysis of Endometrial Data Using COSA (a) Single linkage (b) average linkage (c) variables leading to cluster 1 (d) variablesleading to cluster 2

Tadesse Sha and Vannucci Bayesian Variable Selection 615

servations being allocated to the same cluster regardless of thenumber of visited components As long as there is no interestin estimating the component parameters the problem of labelswitching would not be a concern But this leads to

(n2

)proba-

bilities and it is not clear how to best summarize them For ex-ample Medvedovic and Sivaganesan (2002) used hierarchicalclustering based on these pairwise probabilities as a similaritymeasure

An interesting future avenue would be to investigate the per-formance of our method when covariance constraints are im-posed as was proposed by Banfield and Raftery (1993) Theirparameterization allows one to incorporate different assump-tions on the volume orientation and shape of the clusters

Another possible extension is to use empirical Bayes esti-mates to elicit the hyperparameters For example conditionalon a fixed number of clusters a complete log-likelihood canbe computed using all covariates and a Monte Carlo EM ap-proach developed for estimation Alternatively if prior infor-mation were available or subjective priors were preferable thenthe prior setting could be modified accordingly If interactionsamong predictors were of interest then additional interactionterms could be included in the model and the prior on γ couldbe modified to accommodate the additional terms

APPENDIX FULL CONDITIONALS

Here we give details on the derivation of the marginalized full con-ditionals under the assumption of equal and unequal covariances acrossclusters

Homogeneous Covariance 1 = middot middot middot = G =

f (Xy|Gwγ )

=int

f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d

=int Gprod

k=1

wnk

k

∣∣(γ )

∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h

minuspγ 21

times exp

minus1

2

Gsum

k=1

sum

xi(γ )isinCk

(xi(γ ) minus microk(γ )

)Tminus1

(γ )

(xi(γ ) minus microk(γ )

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus micro0(γ )

)T(h1(γ )

)minus1(microk(γ ) minus micro0(γ )

)

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

(γ )Q1(γ )

)

times ∣∣(γ c)

∣∣minus(n+1)2(2π)minus(n+1)(pminuspγ )2h

minus(pminuspγ )20

times exp

minus1

2

nsum

i=1

(xi(γ c) minus η(γ c)

)Tminus1

(γ c)

(xi(γ c) minus η(γ c)

)

times exp

minus1

2

(η(γ c) minus micro0(γ c)

)T(h0(γ c)

)minus1(η(γ c) minus micro0(γ c)

)

times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣(γ c)

∣∣minus(δ+2(pminuspγ ))2

times exp

minus1

2tr(minus1

(γ c)Q0(γ c)

)

dmicrod dη d

=int Gprod

k=1

(2π)minus(nk+1)pγ 2h

minuspγ 21 wnk

k

∣∣(γ )

∣∣minus(nk+1)2

times exp

minus1

2tr(minus1

(γ )Sk(γ )

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)T

times (nk + hminus11 )minus1

(γ )

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

(γ )Q1(γ )

)

times (2π)minus(n+1)(pminuspγ )2hminus(pminuspγ )20

∣∣(γ c)

∣∣minus(n+1)2

times exp

minus1

2tr(minus1

(γ c)S0(γ c)

)

times exp

minus1

2

(η(γ c) minus

sumni=1 xi(γ c) + hminus1

0 micro0(γ c)

n + hminus10

)T

times (n + hminus10 )minus1

(γ c)

(η(γ c) minus

sumni=1 xi(γ c) + hminus1

0 micro0(γ c)

n + hminus10

)

times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣(γ c)

∣∣minus(δ+2(pminuspγ ))2

times exp

minus1

2tr(minus1

0(γ c)Q0(γ c)

)dmicrod dη d

=[ Gprod

k=1

(h1nk + 1)minuspγ 2wnkk

]πminusnpγ 2

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)22minuspγ (δ+n+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

∣∣(γ )

∣∣minus(δ+n+2pγ )2

times exp

minus1

2tr

(minus1

(γ )

[Q1(γ ) +

Gsum

k=1

Sk(γ )

])d

616 Journal of the American Statistical Association June 2005

times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

timesint

∣∣(γ c)

∣∣minus(δ+n+2(pminuspγ ))2

times exp

minus1

2tr(minus1

(γ c)

[Q0(γ ) + S0(γ c)

])d13

= πminusnp2

[ Gprod

k=1

(h1nk + 1)minuspγ 2wnkk

](h0n + 1)minus(pminuspγ )2

timespγprod

j=1

(

(δ + n + pγ minus j

2

)

(δ + pγ minus j

2

))

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times∣∣∣∣∣Q1(γ ) +

Gsum

k=1

Sk(γ )

∣∣∣∣∣

minus(δ+n+pγ minus1)2

times ∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

where Sk(γ ) and S0(γ ) are given by

Sk(γ ) =sum

xi(γ )isinCk

xi(γ )xTi(γ ) + hminus1

1 micro0(γ )microT0(γ )

minus (nk + hminus11 )minus1

( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)

times( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)T

=sum

xi(γ )isinCk

(xi(γ ) minus xk(γ )

)(xi(γ ) minus xk(γ )

)T

+ nk

h1nk + 1

(micro0(γ ) minus xk(γ )

)(micro0(γ ) minus xk(γ )

)T

and

S0(γ ) =nsum

i=1

xi(γ c)xTi(γ c) + hminus1

0 micro0(γ c)microT0(γ c)

minus (n + hminus10 )minus1

( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)

times( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)T

=nsum

i=1

(xi(γ c) minus x(γ c)

)(xi(γ c) minus x(γ c)

)T

+ n

h0n + 1

(micro0(γ c) minus x(γ c)

)(micro0(γ c) minus x(γ c)

)T

with xk(γ ) = 1nk

sumxi(γ )isinCk

xi(γ ) and x(γ c) = 1n

sumni=1 xi(γ c)

Heterogeneous Covariances

f (Xy|Gwγ )

=int

f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d

=int Gprod

k=1

wnk

k

∣∣k(γ )

∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h

minuspγ 21

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times exp

minus1

2

Gsum

k=1

sum

xi(γ )isinCk

(xi(γ ) minus microk

)Tminus1

k(γ )

(xi(γ ) minus microk

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus micro0(γ )

)T(h1k(γ )

)minus1

times (microk(γ ) minus micro0(γ )

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )Q1(γ )

)dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

=int

int

micro

Gprod

k=1

(2π)minus(nk+1)pγ 2h

minuspγ 21 wnk

k 2minuspγ (δ+pγ minus1)2

times πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣k(γ )

∣∣minus(nk+1)2

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)T

times (nk + hminus11 )minus1

k(γ )

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q1(γ ) + Sk(γ )

])dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

Tadesse Sha and Vannucci Bayesian Variable Selection 617

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

∣∣Q(γ )

∣∣(δ+pγ minus1)2

times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

Gprod

k=1

∣∣k(γ )

∣∣minus(δ+nk+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q(γ ) + Sk(γ )

])d

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

timespγprod

j=1

(

(δ + nk + pγ minus 1

2

)(

(δ + pγ minus 1

2

)))

timesGprod

k=1

∣∣Q(γ )

∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )

∣∣minus(δ+nk+pγ minus1)2

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)(

(δ + p minus pγ minus j

2

)))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

[Received May 2003 Revised August 2004]

REFERENCES

Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5

Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821

Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press

Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182

(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641

Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270

Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275

Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375

Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228

Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center

George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373

Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286

Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall

Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136

Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732

Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232

McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422

Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206

Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71

Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190

Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792

Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819

Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74

(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809

Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554

Page 2: Bayesian Variable Selection in Clustering High-Dimensional ...marina/papers/jasa05.pdf · Bayesian Variable Selection in Clustering High-Dimensional Data Mahlet G. T ADESSE, Naijun

Tadesse Sha and Vannucci Bayesian Variable Selection 603

variables In addition it has been shown that the leading com-ponents do not necessarily contain the most information aboutthe cluster structure (Chang 1983)

The literature includes several procedures that combine thevariable selection and clustering tasks Fowlkes et al (1988)used a forward selection approach in the context of completelinkage hierarchical clustering The variables are added usinginformation of the between-cluster and total sum of squaresand their significance is judged based on graphical informa-tion The authors characterized this as an ldquoinformalrdquo assess-ment Brusco and Cradit (2001) proposed a variable selectionheuristic for k-means clustering using a similar forward selec-tion procedure Their approach uses the adjusted Rand indexto measure cluster recovery Recently Friedman and Meulman(2003) proposed a hierarchical clustering procedure that uncov-ers cluster structure on separate subsets of variables The al-gorithm does not explicitly select variables but rather assignsthem different weights which can be used to extract relevantcovariates This approach is somewhat different from ours inthat it detects subgroups of observations that preferentially clus-ter on different subsets of variables rather than clustering on thesame subset of variables These methods all work in conjunc-tion with k-means or hierarchical clustering algorithms whichhave several limitations One major shortcoming is the lack ofa statistical criterion for assessing the number of clusters Theseprocedures also do not provide a measure of the uncertainty as-sociated with the sample allocations In addition with respectto variable selection the existing approaches use greedy deter-ministic search procedures which can be stuck at local min-ima They also have the limitation of presuming the existence ofa single ldquobestrdquo subset of clustering variables In practice how-ever there may be several equally good subsets that define thetrue cluster structure

The Bayesian paradigm offers a coherent framework foraddressing the problems of variable selection and clusteringsimultaneously Recent developments in MCMC techniques(Gilks Richardson and Spiegelhalter 1996) have led to sub-stantial research in both areas Bayesian variable selectionmethods so far have been developed mainly in the context ofregression analysis (See George and McCulloch 1997 for a re-view on prior specifications and MCMC implementations andBrown Vannucci and Fearn 1998b for extensions to the caseof multivariate responses) The Bayesian clustering approachuses a mixture of probability distributions each distributionrepresenting a different cluster Diebolt and Robert (1994) pre-sented a comprehensive treatment of MCMC strategies whenthe number of mixture components is known For the generalcase of an unknown number of components Richardson andGreen (1997) successfully applied the reversible-jump MCMC(RJMCMC) technique but considered only univariate dataStephens (2000a) proposed an approach based on continuous-time Markov birthndashdeath processes and applied it to a bivariatesetting

We propose a method that simultaneously selects the dis-criminating variables and clusters the samples into G groupswhere G is unknown The computational burden of searchingthrough all possible 2p variable subsets is handled through theintroduction of a latent p-vector with binary entries We usea stochastic search method to explore the space of possible val-ues for this latent vector The clustering problem is formulated

in terms of a multivariate mixture model with an unknown num-ber of components We work with a marginalized likelihoodwhere some of the model parameters are integrated out andadapt the RJMCMC technique of Richardson and Green (1997)to the multivariate setting Our inferential procedure providesselection of discriminating variables estimates for the numberof clusters and the sample allocations and class prediction forfuture observations

3 MODEL SPECIFICATION

31 Clustering via Mixture Models

The goal of cluster analysis is to partition a collection of ob-servations into homogeneous groups Model-based clusteringhas received great attention recently and has provided promis-ing results in various applications In this approach the data areviewed as coming from a mixture of distributions each distri-bution representing a different cluster Let X = (x1 xn) beindependent p-dimensional observations from G populationsThe problem of clustering the n samples can be formulated interms of a mixture of G underlying probability distributions

f (xi|w θ) =Gsum

k=1

wk f (xi|θk) (1)

where f (xi|θk) is the density of an observation xi from thekth component and w = (w1 wG)T are the componentweights (wk ge 0

sumGk=1 wk = 1)

To identify the cluster from which each observation is drawnlatent variables y = ( y1 yn)

T are introduced where yi = kif the ith observation comes from component k The yirsquos areassumed to be independently and identically distributed withprobability mass function p( yi = k) = wk We consider the casewhere f (xi|θk) is a multivariate normal density with parametersθk = (microkk) Thus for sample i we have

xi|yi = kw θ simN (microkk) (2)

32 Identifying Discriminating Variables

In high-dimensional data it is often the case that a large num-ber of variables provide very little if any information about thegroup structure of the observations Their inclusion could bedetrimental because it might obscure the recovery of the truecluster structure To identify the relevant variables we intro-duce a latent p-vector γ with binary entries γj takes value 1if the jth variable defines a mixture distribution for the dataand 0 otherwise We use (γ ) and (γ c) to index the discriminat-ing variables and those that favor a single multivariate normaldensity Notice that this approach is different from what is com-monly done in Bayesian variable selection for regression mod-els where the latent indicator is used to induce mixture priorson the regression coefficients The likelihood function is thengiven by

L(Gγ wmicroη|Xy)

= (2π)minus(pminuspγ )n2∣∣(γ c)

∣∣minusn2

times exp

minus1

2

nsum

i=1

(xi(γ c) minus η(γ c)

)Tminus1

(γ c)

(xi(γ c) minus η(γ c)

)

604 Journal of the American Statistical Association June 2005

timesGprod

k=1

(2π)minuspγ nk2∣∣k(γ )

∣∣minusnk2wnk

k

times exp

minus1

2

sum

xiisinCk

(xi(γ ) minus microk(γ )

)Tminus1

k(γ )

(xi(γ ) minus microk(γ )

)

(3)

where (η(γ c)(γ c)) denote the mean and covariance parame-ters for the nondiscriminating variables (microk(γ )k(γ )) are themean and covariance parameters of cluster k Ck = xi|yi = kwith cardinality nk and pγ = sump

j=1 γj

4 PRIOR SETTING AND FULL CONDITIONALS

41 Prior Formulation and Choice of Hyperparameters

For the variable selection indicator γ we assume that itselements γj are independent Bernoulli random variables

p(γ ) =pprod

j=1

ϕγj(1 minus ϕ)1minusγj (4)

The number of covariates included in the model pγ thusfollows a binomial distribution and ϕ can be elicited as theproportion of variables expected a priori to discriminate the dif-ferent groups ϕ = ppriorp This prior assumption can be re-laxed by formulating a beta(ab) hyperprior on ϕ which leadsto a beta-binomial prior for pγ with expectation p a

a+b A vagueprior can be elicited by setting a + b = 2 which leads to settinga = 2ppriorp and b = 2 minus a

Natural priors for the number of components G are a trun-cated Poisson

P(G = g) = eminusλλgg1 minus (eminusλ(1 + λ) + suminfin

j=Gmax+1 (eminusλλ j)j)

g = 2 Gmax (5)

or a discrete uniform on [2 Gmax] P(G = g) = 1Gmaxminus1

where Gmax is chosen arbitrarily large For the vector of com-ponent weights w we specify a symmetric Dirichlet priorw|G sim Dirichlet(α α)

Updating the mean and covariance parameters is somewhatintricate When deleting or creating new components the corre-sponding appropriate changes must be made for the componentparameters However it is not clear whether adequate proposalscan be constructed for the reversible jump in the multivariatesetting Even if this difficulty can be overcome as γ gets up-dated the dimensions of microk(γ ) k(γ ) η(γ c) and (γ c) changerequiring another sampler that moves between different dimen-sional spaces The algorithm is much more efficient if we canintegrate out these parameters This also helps substantially ac-celerate model fitting because the set of parameters that needto be updated would then consist only of (γ wyG) The inte-gration can be facilitated by taking conjugate priors

microk(γ )

∣∣k(γ )G sim N(micro0(γ )h1k(γ )

)

η(γ c)

∣∣(γ c) sim N(micro0(γ c)h0(γ c)

)

(6)k(γ )|G sim IW

(δQ1(γ )

)

(γ c) sim IW(δQ0(γ c)

)

where IW(δQ1(γ )) indicates the inverse-Wishart distributionwith dimension pγ shape parameter δ = n minus pγ + 1 n degreesof freedom and mean Q1(γ )(δ minus 2) (Brown 1993) Weak priorinformation is specified by taking δ small which we set to 3 thesmallest integer such that the expectations of the covariance ma-trices are defined We take Q1 = 1κ1Iptimesp and Q0 = 1κ0IptimespSome care in the choice of κ1 and κ0 is needed These hyperpa-rameters need to be in the range of variability of the data At thesame time we do not want the prior information to overwhelmthe data and drive the inference We experimented with differ-ent values and found that values commensurate with the orderof magnitude of the eigenvalues of cov(X) where X is the fulldata matrix lead to reasonable results We suggest setting κ1between 1 and 10 of the upper decile of the n minus 1 non-zeroeigenvalues and κ0 proportional to their lower decile

For the mean parameters we take the priors to be fairly flatover the region where the data are defined Each element of micro0is set to the corresponding covariate interval midpoint andh0 and h1 are chosen arbitrarily large This prevents the speci-fication of priors that do not overlap with the likelihood and al-lows for mixtures with widely different component means Wefound that values of h0 and h1 between 10 and 1000 performedreasonably well

42 Full Conditionals

As we stated earlier the parameters micro η and are inte-grated out and we need to sample only from the joint posteriorof (Gyγ w) We have

f (Xy|Gwγ )

=int

f (Xy|γ Gwmicroη)f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d (7)

For completeness we provide the full conditionals under theassumption of equal and unequal covariances across clustersHowever in practice unless there is prior knowledge that fa-vors the former we recommend working with heterogeneouscovariances After some algebra (see the Appendix) we get thefollowing

1 Under the assumption of homogeneous covariance

f (Xy|Gwγ )

= πminusnp2K(γ ) middot ∣∣Q1(γ )

∣∣(δ+pγ minus1)2

times∣∣∣∣∣Q1(γ ) +

Gsum

k=1

Sk(γ )

∣∣∣∣∣

minus(n+δ+pγ minus1)2

times H(γ c) middot ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times ∣∣Q0(γ c) + S0(γ c)

∣∣minus(n+δ+pminuspγ minus1)2 (8)

K(γ ) =[

Gprod

k=1

wnkk (h1nk + 1)minuspγ 2

]

timespγprod

j=1

( 12 (n + δ + pγ minus j))

( 12 (δ + pγ minus j))

Tadesse Sha and Vannucci Bayesian Variable Selection 605

H(γ c) = (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

( 12 (n + δ + p minus pγ minus j))

( 12 (δ + p minus pγ minus j))

Sk(γ ) =sum

xi(γ )isinCk

(xi(γ ) minus xk(γ )

)(xi(γ ) minus xk(γ )

)T

+ nk

h1nk + 1

(micro0(γ ) minus xk(γ )

)(micro0(γ ) minus xk(γ )

)T

S0(γ c) =nsum

i=1

(xi(γ c) minus x(γ c)

)(xi(γ c) minus x(γ c)

)T

+ n

h0n + 1

(micro0(γ c) minus x(γ c)

)(micro0(γ c) minus x(γ c)

)T

xk(γ ) is the sample mean of cluster k and x(γ c) corre-sponds to the sample means of the nondiscriminating vari-ables As h1 rarr infin Sk(γ ) reduces to the within-group sumof squares for the kth cluster and

sumGk=1 Sk(γ ) becomes the

total within-group sum of squares2 Assuming heterogeneous covariances across clusters

f (Xy|Gwγ )

= πminusnp2Gprod

k=1

Kk(γ ) middot ∣∣Q1(γ )

∣∣(δ+pγ minus1)2

times ∣∣Q1(γ ) + Sk(γ )

∣∣minus(nk+δ+pγ minus1)2

times H(γ c) middot ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times ∣∣Q0(γ c) + S0(γ c)

∣∣minus(n+δ+pminuspγ minus1)2 (9)

where Kk(γ ) = wnkk (h1nk + 1)minuspγ 2 prodpγ

j=1( 1

2 (nk+δ+pγ minusj))

( 12 (δ+pγ minusj))

The full conditionals for the model parameters are then givenby

f (y|Gwγ X) prop f (Xy|Gwγ ) (10)

f (γ |GwyX) prop f (Xy|Gwγ )p(γ |G) (11)

and

w|Gγ yX sim Dirichlet(α + n1 α + nG) (12)

5 MODEL FITTING

Posterior samples for the parameters of interest are obtainedusing Metropolis moves and RJMCMC embedded within aGibbs sampler Our MCMC procedure comprises of the follow-ing five steps

1 Update γ from its full conditional in (11)2 Update w from its full conditional in (12)3 Update y from its full conditional in (10)4 Split one mixture component into two or merge two into

one5 Birth or death of an empty component

The birthdeath moves specifically deal with creatingdele-ting empty components and do not involve reallocation ofthe observations We could have extended these to include

nonempty components and removed the splitmerge movesHowever as described by Richardson and Green (1997) in-cluding both steps provides more efficient moves and improvesconvergence

We would like to point out that as we iterate through thesesteps the cluster structure evolves with the choice of variablesThis makes the problem of variable selection in the context ofclustering much more complicated than classification or regres-sion analysis In classification the group structure of the ob-servations is prespecified in the training data and guides theselection Here however the analysis is fully data-based Inaddition in the context of clustering including unnecessaryvariables can have severe implications because it may obscurethe true grouping of the observations

51 Variable Selection Update

In step 1 we update γ using a Metropolis search as suggestedfor model selection by Madigan and York (1995) and applied inregression by Brown et al (1998a) among others and recentlyby Sha et al (2004) in multinomial probit models for classifi-cation In this approach a new candidate γ new is generated byrandomly choosing one of the following two transition moves

1 Adddelete Randomly choose one of the p indices in γ oldand change its value from 0 to 1 or from 1 to 0 to be-come γ new

2 Swap Choose independently and at random a 0 and a 1in γ old and switch their values to get γ new

The new candidate γ new is accepted with probabilitymin1

f (γ new|XywG)

f (γ old|XywG)

52 Update of Weights and Cluster Allocation

In step 2 we use a Gibbs move to sample w from its fullconditional in (12) This can be done by drawing independentgamma random variables with common scale and shape para-meters α + n1 α + nG then scaling them to sum to 1

In step 3 we update the allocation vector y one element ata time via a sub-Gibbs sampling strategy The full conditionalprobability that the ith observation is in the kth cluster is givenby

f(yi = k

∣∣Xy(minusi)γ wG) prop f

(X yi = ky(minusi)

∣∣Gwγ) (13)

where y(minusi) is the cluster assignment for all observations exceptthe ith one

53 Reversible-Jump MCMC

The last two steps splitmerge and birthdeath moves causethe creation or deletion of new components and consequentlyrequire a sampler that jumps between different dimensionalspaces A popular method for defining such a sampler is RJM-CMC (Green 1995 Richardson and Green 1997) whichextends the MetropolisndashHastings algorithm to general statespaces Suppose that a move type m from a countable familyof move types is proposed from ψ to a higher-dimensionalspace ψ prime This will very often be implemented by drawinga vector of continuous random variables u independent of ψ and setting ψ prime to be a deterministic and invertible function ofψ and u The reverse of this move (from ψ prime to ψ) can be ac-complished by using the inverse transformation so that in this

606 Journal of the American Statistical Association June 2005

direction the proposal is deterministic The move is then ac-cepted with probability

min

1

f (ψ prime|X)rm(ψ prime)f (ψ |X)rm(ψ)q(u)

∣∣∣∣partψ prime

part(ψu)

∣∣∣∣

(14)

where rm(ψ) is the probability of choosing move type m whenin state ψ and q(u) is the density function of u The final termin the foregoing ratio is a Jacobian arising from the change ofvariable from (ψu) to ψ prime

531 Split and Merge Moves In step 4 we make a randomchoice between attempting to split a nonempty cluster or com-bine two clusters with probabilities bk = 5 and dk = 1 minus bk

(k = 2 Gmax minus 1 and bGmax = 0)The split move begins by randomly selecting a nonempty

cluster l and dividing it into two components l1 and l2 withweights wl1 = wlu and wl2 = wl(1 minus u) where u is a beta(22)

random variable The parameters ψ = (Gwoldγ yold) are up-dated to ψ prime = (G + 1wnewγ ynew) The acceptance probabil-ity for this move is given by min(1A) where

A = f (G + 1wnewγ ynew|X)

f (Gwoldγ yold|X)times q(ψ |ψ prime)

q(ψ prime|ψ) times f (u)

times∣∣∣∣part(wl1wl2)

part(wlu)

∣∣∣∣

= f (Xynew|wnewγ G + 1) times f (wnew|G + 1) times f (G + 1)

f (Xyold|woldγ G) times f (wold|G) times f (G)

times dG+1 times Pchos

bG times Gminus11 times Palloc times f (u)

times wl (15)

f (wnew|G + 1)

f (wold|G)= wαminus1

l1 wαminus1l2

wαminus1l B(αGα)

and

f (G + 1)

f (G)=

1 if G sim uniform [2 Gmax]

λG+1 if G sim truncated Poisson (λ)

Here G1 is the number of nonempty components before thesplit and Palloc is the probability that the particular allocation ismade and is set to unl1(1 minus u)nl2 where nl1 and nl2 are the num-ber of observations assigned to l1 and l2 Note that Richardsonand Green (1997) calculated Palloc using the full conditional ofthe allocation variables P( yi = k|rest) In our case because thecomponent parameters micro and are integrated out the alloca-tion variables are no longer independent and calculating thesefull conditionals is computationally prohibitive The random al-location approach that we took is substantially faster even withthe longer chains that need to be run to compensate for thelow acceptance rate of proposed moves Pchos is the probabilityof selecting two components for the reverse move Obviouslywe want to combine clusters that are closest to one anotherHowever choosing an optimal similarity metric in the multi-variate setting is not straightforward We define closeness interms of the Euclidean distance between cluster means and con-sider three possible ways of splitting a cluster

a Split results into two mutually adjacent components

rArr Pchos = Glowast1

Glowast times 2Glowast

1

b Split results into components that are not mutually ad-jacent that is l1 has l2 as its closest component but there is

another cluster that is closer to l2 than l1 rArr Pchos = Glowast1

Glowast times 1Glowast

1

c One of the resulting components is empty rArr Pchos =Glowast

0Glowast times 1

Glowast0

times 1Glowast

1

Here Glowast is the total number of clusters after the splitGlowast

0 and Glowast1 are the number of empty and nonempty components

after the split Thus splits that result in mutually close com-ponents are assigned a larger probability and those that yieldempty clusters have lower probability

The merge move begins by randomly choosing two com-ponents l1 and l2 to be combined into a single cluster lThe updated parameters are ψ = (Gwoldγ yold) rarr ψ prime =(G minus 1wnewγ ynew) where wl = wl1 + wl2 The acceptanceprobability for this move is given by min(1A) where

A = f (Xynew|γ wnewG minus 1) times f (wnew|G minus 1) times f (G minus 1)

f (Xyold|γ woldG) times f (wold|G) times f (G)

times bGminus1 times Palloc times (G1 minus 1)minus1 times f (u)

dG times Pchos

times (wl1 + wl2)minus1 (16)

where f (wnew|Gminus1)f (wold|G)

= wαminus1l B(α(Gminus1)α)

wαminus1l1 wαminus1

l2and f (Gminus1)

f (G)equals 1 or G

λ

for the Uniform and Poisson priors on G G1 is the numberof nonempty clusters before the combining move f (u) is thebeta(22) density for u = wl1

wl and Palloc and Pchos are defined

similarly to the split move but now Glowast0 and Glowast

1 in Pchos corre-spond to the number of empty and nonempty clusters after themerge

532 Birth and Death Moves We first make a randomchoice between a birth move and a death move using thesame probabilities bk and dk = 1 minus bk as earlier For a birthmove the updated parameters are ψ = (Gwoldγ y) rarr ψ prime =(G + 1wnewγ y) The weight for the proposed new compo-nent is drawn using wlowast sim beta(1G) and the existing weightsare rescaled to wprime

k = wk(1minuswlowast) k = 1 G (where k indexesthe component labels) so that all of the weights sum to 1 LetG0 be the number of empty components before the birth Theacceptance probability for a birth is min(1A) where

A = f (Xy|wnewγ G + 1) times f (wnew|G + 1) times f (G + 1)

f (Xy|woldγ G) times f (wold|G) times f (G)

times dG+1 times (G0 + 1)minus1

bG times f (wlowast)times (1 minus wlowast)Gminus1 (17)

f (wnew|G+1)f (wold|G)

= wlowastαminus1(1minuswlowast)G(αminus1)

B(αGα) and (1 minus wlowast)Gminus1 is the

Jacobian of the transformation (woldwlowast) rarr wnewFor the death move an empty component is chosen at

random and deleted Let wl be the weight of the deleted com-ponent The remaining weights are rescaled to sum to 1 (wprime

k =wk(1 minus wl)) and ψ = (Gwoldγ y) rarr ψ prime = (G minus 1wnew

γ y) The acceptance probability for this move is min(1A)where

A = f (Xy|wnewγ G minus 1) times f (wnew|G minus 1) times f (G minus 1)

f (Xy|woldγ G) times f (wold|G) times f (G)

times bGminus1 times f (wl)

dG times Gminus10

times (1 minus wl)minus(Gminus2) (18)

Tadesse Sha and Vannucci Bayesian Variable Selection 607

f (wnew|Gminus1)f (wold|G)

= B(α(Gminus1)α)

wαminus1l (1minuswl)

(Gminus1)(αminus1) G0 is the number of empty

components before the death move and f (wl) is the beta(1G)

density

6 POSTERIOR INFERENCE

We draw inference on the sample allocations conditionalon G Thus we first need to compute an estimate for this para-meter which we take to be the value most frequently visited bythe MCMC sampler In addition we need to address the label-switching problem

61 Label Switching

In finite mixture models an identifiability problem arisesfrom the invariance of the likelihood under permutation of thecomponent labels In the Bayesian paradigm this leads to sym-metric and multimodal posterior distributions with up to Gcopies of each ldquogenuinerdquo mode complicating inference on theparameters In particular it is not possible to form ergodic av-erages over the MCMC samples Traditional approaches to thisproblem impose identifiability constraints on the parametersfor instance w1 lt middot middot middot lt wG These constraints do not alwayssolve the problem however Recently Stephens (2000b) pro-posed a relabeling algorithm that takes a decision-theoretic ap-proach The procedure defines an appropriate loss function andpostprocesses the MCMC output to minimize the posterior ex-pected loss

Let P(ξ) = [pik(ξ)] be the matrix of allocation probabilities

pik(ξ) = Pr( yi = k|Xγ w θ)

= wk f (xi(γ )|θk(γ ))sumj wjf (xi(γ )|θ j(γ ))

(19)

In the context of clustering the loss function can be defined onthe cluster labels y

L0(y ξ) = minusnsum

i=1

logpiyi(ξ ) (20)

Let ξ (1) ξ (M) be the sampled parameters and let ν1 νM

be the permutations applied to them The parameters ξ (t) corre-spond to the component parameters w(t) micro(t) and (t) sampledat iteration t However in our methodology the componentmean and covariance parameters are integrated out and wedo not draw their posterior samples Therefore to calculatethe matrix of allocation probabilities in (19) we estimatethese parameters and set ξ (t) = w(t)

1 w(t)G x(t)

1(γ ) x(t)

G(γ )

S(t)1(γ )

S(t)G(γ )

where x(t)k(γ )

and S(t)k(γ )

are the estimates ofcluster krsquos mean and covariance based on the model visited atiteration t

The relabeling algorithm proceeds by selecting initial valuesfor the νtrsquos which we take to be the identity permutation theniterating the following steps until a fixed point is reached

a Choose y to minimizesumM

t=1 L0yνt(ξ(t))

b For t = 1 M choose νt to minimize L0yνt(ξ(t))

62 Posterior Densities and Posterior Estimates

Once the label switching is taken care of the MCMC samplescan be used to draw posterior inference Of particular interestare the allocation vector y and the variable selection vector γ

We compute the marginal posterior probability that sample iis allocated to cluster k yi = k as

p( yi = k|XG) =int

p(yi = ky(minusi)γ w

∣∣XG)

dy(minusi) dγ dw

propint

p(X yi = ky(minusi)γ w

∣∣G)

dy(minusi) dγ dw

=int

p(X yi = ky(minusi)

∣∣Gγ w)

times p(γ |G)p(w|G)dy(minusi) dγ dw

asympMsum

t=1

p(X y(t)

i = ky(t)(minusi)

∣∣Gγ (t)w(t))

times p(γ (t)

∣∣G)p(w(t)

∣∣G) (21)

where y(t)(minusi) is the vector y(t) at the tth MCMC iteration without

the ith element The posterior allocation of sample i can then beestimated by the mode of its marginal posterior density

yi = arg max1lekleG

p( yi = k|XG) (22)

Similarly the marginal posterior for γj = 1 can be com-puted as

p(γj = 1|XG) =int

p(γj = 1γ (minusj)wy|XG

)dγ (minusj) dy dw

propint

p(Xy γj = 1γ (minusj)w

∣∣G)

dγ (minusj) dy dw

=int

p(Xy

∣∣G γj = 1γ (minusj)w)

times p(γ |G)p(w|G)dγ (minusj) dy dw

asympMsum

t=1

p(Xy(t)

∣∣G γ(t)j = 1γ

(t)(minusj)w(t))

times p(γj = 1γ

(t)(minusj)

∣∣G)p(w(t)

∣∣G) (23)

where γ(t)(minusj) is the vector γ (t) at the tth iteration without the

jth element The best discriminating variables can then beidentified as those with largest marginal posterior p(γj = 1|XG) gt a where a is chosen arbitrarily

γj = Ip(γj=1|XG)gta (24)

An alternative variable selection can be performed by con-sidering the vector γ with largest posterior probability amongall visited vectors

γ lowast = arg max1letleM

p(γ (t)

∣∣XG w y)

(25)

where y is the vector of sample allocations estimated via (22)and w is given by w = 1

M

sumMt=1 w(t) The estimate given in (25)

considers the joint density of γ rather than the marginal distri-butions of the individual elements as (24) In the same spirit an

608 Journal of the American Statistical Association June 2005

estimate of the joint allocation of the samples can be obtainedas the configuration that yields the largest posterior probability

ylowast = arg max1letleM

p(y(t)

∣∣XG w γ)

(26)

where γ is the estimate in (24)

63 Class Prediction

The MCMC output can also be used to predict the class mem-bership of future observations Xf The predictive density isgiven by

p( yf = k|Xf XG)

=int

p( yf = kywγ |Xf XG)dy dγ dw

propint

p(Xf X yf = ky|Gwγ )p(γ |G)p(w|G)dy dγ dw

asympMsum

t=1

p(Xf X yf = ky(t)

∣∣γ (t)w(t)G)

times p(γ (t)

∣∣G)p(w(t)

∣∣G) (27)

and the observations are allocated according to yf

yf = arg max1lekleG

p( yf = k|Xf XG) (28)

7 PERFORMANCE OF OUR METHODOLOGY

In this section we investigate the performance of our method-ology using three datasets Because the Bayesian cluster-ing problem with an unknown number of components viaRJMCMC has been considered only in the univariate setting(Richardson and Green 1997) we first examine the efficiency ofour algorithm in the multivariate setting without variable selec-tion For this we use the benchmark iris data (Anderson 1935)We then explore the performance of our methodology for simul-taneous clustering and variable selection using a series of sim-ulated high-dimensional datasets We also analyze these data

using the COSA algorithm of Friedman and Meulman (2003)Finally we illustrate an application with DNA microarray datafrom an endometrial cancer study

71 Iris Data

This dataset consists of four covariatesmdashpetal width petalheight sepal width and sepal heightmdashmeasured on 50 flow-ers from each of three species Iris setosa Iris versicolor andIris virginica This dataset has been analyzed by several authors(see eg Murtagh and Hernaacutendez-Pajares 1995) and can be ac-cessed in SndashPLUS as a three-dimensional array

We rescaled each variable by its range and specified the priordistributions as described in Section 41 We took δ = 3 α = 1and h1 = 100 We set Gmax = 10 and κ1 = 03 a value com-mensurate with the variability in the data We considered differ-ent prior distributions for G truncated Poisson with λ = 5 andλ = 10 and a discrete uniform on [1 Gmax] We conductedthe analysis assuming both homogeneous and heterogeneouscovariances across clusters In all cases we used a burn-inof 10000 and a main run of 40000 iterations for inference

Let us first focus on the results using a truncated Pois-son(λ = 5) prior for G Figure 1 displays the trace plots forthe number of visited components under the assumption of het-erogeneous and homogeneous covariances Table 1 column 1gives the corresponding posterior probabilities p(G = k|X)Table 2 shows the posterior estimates for sample allocations ycomputed conditional on the most probable G accordingto (22) Under the assumption of heterogeneous covariancesacross clusters there is a strong support for G = 3 Two ofthe clusters successfully isolate all of the setosa and virginicaplants and the third cluster contains all of the versicolor ex-cept for five identified as virginica These results are satisfac-tory because the versicolor and virginica species are knownto have some overlap whereas the setosa is rather well sepa-rated Under the assumption of constant covariance there is astrong support for four clusters Again one cluster contains ex-clusively all of the setosa and all of the versicolor are grouped

(a) (b)

Figure 1 Iris Data Trace Plots of the Number of Clusters G (a) Heterogeneous covariance (b) homogeneous covariance

Tadesse Sha and Vannucci Bayesian Variable Selection 609

Table 1 Iris Data Posterior Distribution of G

Poisson(λ = 5) Poisson(λ = 10) Uniform[1 10]

k Different k Equal Different Σk Equal Σ Different Σk Equal Σ

3 6466 0 5019 0 8548 04 3124 7413 4291 4384 1417 65045 0403 3124 0545 2335 0035 34466 0007 0418 0138 2786 0 00507 0 0 0007 0484 0 08 0 0 0 0011 0 0

together except for one Two virginica flowers are assigned tothe versicolor group and the remaining are divided into twonew components Figure 2 shows the marginal posterior prob-abilities p( yi = k|XG = 4) computed using equation (21)We note that some of the flowers among the 12 virginica al-located to cluster IV had a close call for instance observa-tion 104 had marginal posteriors p( y104 = 3|XG) = 4169 andp( y104 = 4|XG) = 5823

We examined the sensitivity of the results to the choiceof the prior distribution on G and found them to be quiterobust The posterior probabilities p(G = k|X) assumingG sim Poisson(λ = 10) and G sim uniform[1 Gmax] are givenin Table 1 Under both priors we obtained sample allocationsthat are identical to those in Table 2 We also found littlesensitivity to the choice of the hyperparameter h1 with val-ues between 10 and 1000 giving satisfactory results Smallervalues of h1 tended to favor slightly more components but withsome clusters containing very few observations For examplefor h1 = 10 under the assumption of equal covariance acrossclusters Table 3 shows that the MCMC sampler gives strongersupport for G = 6 However one of the clusters contains onlyone observation and another contains only four observationsFigure 3 displays the marginal posterior probabilities for theallocation of these five samples p( yi = k|XG) there is littlesupport for assigning them to separate clusters

We use this example to also illustrate the label-switchingphenomenon described in Section 61 Figure 4 gives the traceplots of the component weights under the assumption of homo-geneous covariance before and after processing of the MCMCsamples We can easily identify two label-switching instancesin the raw output of Figure 4(a) One of these occurred be-tween the first and third components near iteration 1000 theother occurred around iteration 2400 between components 23 and 4 Figure 4(b) shows the trace plots after postprocessingof the MCMC output We note that the label switching is elim-inated and the multimodality in the estimates of the marginalposterior distributions of wk is removed It is now straightfor-ward to obtain sensible estimates using the posterior means

Table 2 Iris Data Sample Allocation y

Heterogeneouscovariance

Homogeneouscovariance

I II III I II III IV

Iris setosa 50 0 0 50 0 0 0Iris versicolor 0 45 5 0 49 1 0Iris virginica 0 0 50 0 2 36 12

Figure 2 Iris Data Poisson(λ = 5) With Equal Σ -Marginal PosteriorProbabilities of Sample Allocations p(yi = k|X G = 4) i = 1 150k = 1 4

72 Simulated Data

We generated a dataset of 15 observations arising from fourmultivariate normal densities with 20 variables such that

xij sim I1leile4N (micro1 σ21 ) + I5leile7N (micro2 σ

22 )

+ I8leile13N (micro3 σ23 ) + I14leile15N (micro4 σ

24 )

i = 1 15 j = 1 20

where Imiddot is the indicator function equal to 1 if the conditionis met Thus the first four samples arise from the same dis-tribution the next three come from the second group the fol-lowing six are from the third group and the last two are fromthe fourth group The component means were set to micro1 = 5micro2 = 2 micro3 = minus3 and micro4 = minus6 and the component varianceswere set to σ 2

1 = 15 σ 22 = 1 σ 2

3 = 5 and σ 24 = 2 We drew an

additional set of (pminus20) noisy variables that do not distinguishbetween the clusters from a standard normal density We con-sidered several values of p p = 50100500 1000 These datathus contain a small subset of discriminating variables togetherwith a large set of nonclustering variables The purpose is tostudy the ability of our method to uncover the cluster structureand identify the relevant covariates in the presence of increasingnumber of noisy variables

We permuted the columns of the data matrix X to dis-perse the predictors For each dataset we took δ = 3 α = 1

Table 3 Iris Data Sensitivity to Hyperparameter h Results With h = 10

Posterior distribution of G

k 4 5 6 7 8 9 10

p(G = k |X) 0747 1874 3326 2685 1120 0233 0015

Sample allocations yI II III IV V VI

Iris setosa 50 0 0 0 0 0Iris versicolor 0 45 1 0 4 0Iris virginica 0 2 35 12 0 1

610 Journal of the American Statistical Association June 2005

Figure 3 Iris Data Sensitivity to h1 Results for h1 = 10ndashMarginalPosterior Probabilities for Samples Allocated to Clusters V and VIp(yi = k|X G = 6) i = 1 5 k = 1 6

h1 = h0 = 100 and Gmax = n and assumed unequal covari-ances across clusters Results from the previous example haveshown little sensitivity to the choice of prior on G Here weconsidered a truncated Poisson prior with λ = 5 As describedin Section 41 some care is needed when choosing κ1 and κ0their values need to be commensurate with the variability of thedata The amount of total variability in the different simulateddatasets is of course widely different and in each case thesehyperparameters were chosen proportionally to the upper andlower decile of the non-zero eigenvalues For the prior of γ we chose a Bernoulli distribution and set the expected num-ber of included variables to 10 We used a starting model withone randomly selected variable and ran the MCMC chains for100000 iterations with 40000 sweeps as burn-in

Here we do inference on both the sample allocations andthe variable selection We obtained satisfactory results for all

datasets In all cases we successfully recovered the clusterstructure used to simulate the data and identified the 20 dis-criminating variables We present the summary plots associatedwith the largest dataset where p = 1000 Figure 5(a) shows thetrace plot for the number of visited components G and Table 4reports the marginal posterior probabilities p(G|X) There is astrong support for G = 4 and 5 with a slightly larger probabilityfor the former Figure 6 shows the marginal posterior probabili-ties of the sample allocations p( yi = k|XG = 4) We note thatthese match the group structure used to simulate the data Asfor selection of the predictors Figure 5(b) displays the num-ber of variables selected at each MCMC iteration In the first30000 iterations before burn-in the chain visited models witharound 30ndash35 variables After about 40000 iterations the chainstabilized to models with 15ndash20 covariates Figure 7 displaysthe marginal posterior probabilities p(γj = 1|XG = 4) Therewere 17 variables with marginal probabilities greater than 9 allof which were in the set of 20 covariates simulated to effectivelydiscriminate the four clusters If we lower the threshold for in-clusion to 4 then all 20 variables are selected Conditional onthe allocations obtained via (22) the γ vector with largest pos-terior probability (25) among all visited models contained 19 ofthe 20 discriminating variables

We also analyzed these datasets using the COSA algorithm ofFriedman and Meulman (2003) As we mentioned in Section 2this procedure performs variable selection in conjunction withhierarchical clustering We present the results for the analy-sis of the simulated data with p = 1000 Figure 8 shows thedendrogram of the clustering results based on the non-targetedCOSA distance We considered single average and completelinkage but none of these methods was able to recover the truecluster structure The average linkage dendrogram [Fig 8(b)]delineates one of the clusters that was partially recovered (thetrue cluster contains observations 8ndash13) Figure 8(d) displaysthe 20 highest relevance values of the variables that lead to thisclustering The variables that define the true cluster structure areindexed as 1ndash20 and we note that 16 of them appear in this plot

(a) (b)

Figure 4 Iris Data Trace Plots of Component Weights Before and After Removal of the Label Switching (a) Raw output (b) processed output

Tadesse Sha and Vannucci Bayesian Variable Selection 611

(a) (b)

Figure 5 Trace Plots for Simulated Data With p = 1000 (a) Number of clusters G (b) number of included variables pγ

Table 4 Simulated Data With p = 1000Posterior Distribution of G

k p(G = k|X )

2 01113 14474 34375 33756 13267 02168 00409 0015

10 000811 000512 000713 001214 0002

Figure 6 Simulated Data With p = 1000 Marginal Posterior Prob-abilities of Sample Allocations p(yi = k|X G = 4) i = 1 15k = 1 4

73 Application to Microarray Data

We now illustrate the methodology on microarray datafrom an endometrial cancer study Endometrioid endometrialadenocarcinoma is a common gynecologic malignancy aris-ing within the uterine lining The disease usually occurs inpostmenopausal women and is associated with the hormonalrisk factor of protracted estrogen exposure unopposed byprogestins Despite its prevalence the molecular mechanismsof its genesis are not completely known DNA microarrayswith their ability to examine thousands of genes simultaneouslycould be an efficient tool to isolate relevant genes and clustertissues into different subtypes This is especially important incancer treatment where different clinicopathologic groups areknown to vary in their response to therapy and genes identifiedto discriminate among the different subtypes may represent tar-gets for therapeutic intervention and biomarkers for improveddiagnosis

Figure 7 Simulated Data With p = 1000 Marginal Posterior Proba-bilities for Inclusion of Variables p(γj = 1|X G = 4)

612 Journal of the American Statistical Association June 2005

(a) (b)

(c) (d)

Figure 8 Analysis of Simulated Data With p = 1000 Using COSA (a) Single linkage (b) average linkage (c) complete linkage (d) relevancefor delineated cluster

Four normal endometrium samples and 10 endometrioidendometrial adenocarcinomas were collected from hysterec-tomy specimens RNA samples were isolated from the 14 tis-sues and reverse-transcribed The resulting cDNA targets wereprepared following the manufacturerrsquos instructions and hy-bridized to Affymetrix Hu6800 GeneChip arrays which con-tain 7070 probe sets The scanned images were processed withthe Affymetrix software version 31 The average differencederived by the software was used to indicate a transcript abun-dance and a global normalization procedure was applied Thetechnology does not reliably quantify low expression levels andit is subject to spot saturation at the high end For this particu-lar array the limits of reliable detection were previously set to20 and 16000 (Tadesse Ibrahim and Mutter 2003) We usedthese same thresholds and removed probe sets with at least oneunreliable reading from the analysis This left us with p = 762variables We then log-transformed the expression readings tosatisfy the assumption of normality as suggested by the BoxndashCox transformation We also rescaled each variable by its range

As described in Section 41 we specified the priors with thehyperparameters δ = 3 α = 1 and h1 = h0 = 100 We setκ1 = 001 and κ0 = 01 and assumed unequal covariancesacross clusters We used a truncated Poisson(λ = 5) prior for G

with Gmax = n We took a Bernoulli prior for γ with an expec-tation of 10 variables to be included in the model

To avoid possible dependence of the results on the initialmodel we ran four MCMC chains with widely different startingpoints (1) γj set to 0 for all jrsquos except one randomly selected(2) γj set to 1 for 10 randomly selected jrsquos (3) γj set to 1 for25 randomly selected jrsquos and (4) γj set to 1 for 50 randomlyselected jrsquos For each of the MCMC chains we ran 100000 it-erations with 40000 sweeps taken as a burn-in

We looked at both the allocation of tissues and the selec-tion of genes whose expression best discriminate among thedifferent groups The MCMC sampler mixed steadily overthe iterations As shown in the histogram of Figure 9(a) thesampler visited mostly between two and five components witha stronger support for G = 3 clusters Figure 9(b) shows thetrace plot for the number of included variables for one ofthe chains which visited mostly models with 25ndash35 variablesThe other chains behaved similarly Figure 10 gives the mar-ginal posterior probabilities p(γj = 1|XG = 3) for each ofthe chains Despite the very different starting points the fourchains visited similar regions and exhibit broadly similar mar-ginal plots To assess the concordance of the results across thefour chains we looked at the correlation coefficients betweenthe frequencies for inclusion of genes These are reported in

Tadesse Sha and Vannucci Bayesian Variable Selection 613

(a) (b)

Figure 9 MCMC Output for Endometrial Data (a) Number of visited clusters G (b) number of included variables pγ

Figure 11 along with the corresponding pairwise scatterplotsWe see that there is very good agreement across the differentMCMC runs The marginal probabilities on the other hand arenot as concordant across the chains because their derivationmakes use of model posterior probabilities [see (23)] which areaffected by the inclusion of correlated variables This is a prob-lem inherent to DNA microarray data where multiple geneswith similar expression profiles are often observed

We pooled and relabeled the output of the four chains nor-malized the relative posterior probabilities and recomputedp(γj = 1|XG = 3) These marginal probabilities are displayedin Figure 12 There are 31 genes with posterior probabilitygreater than 5 We also estimated the marginal posterior proba-bilities for the sample allocations p( yi = k|XG) based on the31 selected genes These are shown in Figure 13 We success-fully identified the four normal tissues The results also seem tosuggest that there are possibly two subtypes within the malig-nant tissues

Figure 10 Endometrial Data Marginal posterior probabilities for in-clusion of variables p(γj = 1|X G = 3) for each of the four chains

We also present the results from analyzing this dataset withthe COSA algorithm Figure 14 shows the resulting singlelinkage and average linkage dendrograms along with the top30 variables that distinguish the normal and tumor tissuesThere is some overlap between the genes selected by COSA andthose identified by our procedure Among the sets of 30 ldquobestrdquodiscriminating genes selected by the two methods 10 are com-mon to both

8 DISCUSSION

We have proposed a method for simultaneously clusteringhigh-dimensional data with an unknown number of componentsand selecting the variables that best discriminate the differentgroups We successfully applied the methodology to variousdatasets Our method is fully Bayesian and we provided stan-dard default recommendations for the choice of priors

Here we mention some possible extensions and directionsfor future research We have drawn posterior inference con-ditional on a fixed number of components A more attractive

Figure 11 Endometrial Data Concordance of results among the fourchains Pairwise scatterplots of frequencies for inclusion of genes

614 Journal of the American Statistical Association June 2005

Figure 12 Endometrial Data Union of four chains Marginal posteriorprobabilities p(γj = 1|X G = 3)

approach would incorporate the uncertainty on this parame-ter and combine the results for all different values of G vis-ited by the MCMC sampler To our knowledge this is an open

Figure 13 Endometrial Data Union of four chains Marginal poste-rior probabilities of sample allocations p(yi = k|X G = 3) i = 1 14k = 1 3

problem that requires further research One promising approachinvolves considering the posterior probability of pairs of ob-

(a) (b)

(c) (d)

Figure 14 Analysis of Endometrial Data Using COSA (a) Single linkage (b) average linkage (c) variables leading to cluster 1 (d) variablesleading to cluster 2

Tadesse Sha and Vannucci Bayesian Variable Selection 615

servations being allocated to the same cluster regardless of thenumber of visited components As long as there is no interestin estimating the component parameters the problem of labelswitching would not be a concern But this leads to

(n2

)proba-

bilities and it is not clear how to best summarize them For ex-ample Medvedovic and Sivaganesan (2002) used hierarchicalclustering based on these pairwise probabilities as a similaritymeasure

An interesting future avenue would be to investigate the per-formance of our method when covariance constraints are im-posed as was proposed by Banfield and Raftery (1993) Theirparameterization allows one to incorporate different assump-tions on the volume orientation and shape of the clusters

Another possible extension is to use empirical Bayes esti-mates to elicit the hyperparameters For example conditionalon a fixed number of clusters a complete log-likelihood canbe computed using all covariates and a Monte Carlo EM ap-proach developed for estimation Alternatively if prior infor-mation were available or subjective priors were preferable thenthe prior setting could be modified accordingly If interactionsamong predictors were of interest then additional interactionterms could be included in the model and the prior on γ couldbe modified to accommodate the additional terms

APPENDIX FULL CONDITIONALS

Here we give details on the derivation of the marginalized full con-ditionals under the assumption of equal and unequal covariances acrossclusters

Homogeneous Covariance 1 = middot middot middot = G =

f (Xy|Gwγ )

=int

f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d

=int Gprod

k=1

wnk

k

∣∣(γ )

∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h

minuspγ 21

times exp

minus1

2

Gsum

k=1

sum

xi(γ )isinCk

(xi(γ ) minus microk(γ )

)Tminus1

(γ )

(xi(γ ) minus microk(γ )

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus micro0(γ )

)T(h1(γ )

)minus1(microk(γ ) minus micro0(γ )

)

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

(γ )Q1(γ )

)

times ∣∣(γ c)

∣∣minus(n+1)2(2π)minus(n+1)(pminuspγ )2h

minus(pminuspγ )20

times exp

minus1

2

nsum

i=1

(xi(γ c) minus η(γ c)

)Tminus1

(γ c)

(xi(γ c) minus η(γ c)

)

times exp

minus1

2

(η(γ c) minus micro0(γ c)

)T(h0(γ c)

)minus1(η(γ c) minus micro0(γ c)

)

times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣(γ c)

∣∣minus(δ+2(pminuspγ ))2

times exp

minus1

2tr(minus1

(γ c)Q0(γ c)

)

dmicrod dη d

=int Gprod

k=1

(2π)minus(nk+1)pγ 2h

minuspγ 21 wnk

k

∣∣(γ )

∣∣minus(nk+1)2

times exp

minus1

2tr(minus1

(γ )Sk(γ )

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)T

times (nk + hminus11 )minus1

(γ )

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

(γ )Q1(γ )

)

times (2π)minus(n+1)(pminuspγ )2hminus(pminuspγ )20

∣∣(γ c)

∣∣minus(n+1)2

times exp

minus1

2tr(minus1

(γ c)S0(γ c)

)

times exp

minus1

2

(η(γ c) minus

sumni=1 xi(γ c) + hminus1

0 micro0(γ c)

n + hminus10

)T

times (n + hminus10 )minus1

(γ c)

(η(γ c) minus

sumni=1 xi(γ c) + hminus1

0 micro0(γ c)

n + hminus10

)

times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣(γ c)

∣∣minus(δ+2(pminuspγ ))2

times exp

minus1

2tr(minus1

0(γ c)Q0(γ c)

)dmicrod dη d

=[ Gprod

k=1

(h1nk + 1)minuspγ 2wnkk

]πminusnpγ 2

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)22minuspγ (δ+n+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

∣∣(γ )

∣∣minus(δ+n+2pγ )2

times exp

minus1

2tr

(minus1

(γ )

[Q1(γ ) +

Gsum

k=1

Sk(γ )

])d

616 Journal of the American Statistical Association June 2005

times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

timesint

∣∣(γ c)

∣∣minus(δ+n+2(pminuspγ ))2

times exp

minus1

2tr(minus1

(γ c)

[Q0(γ ) + S0(γ c)

])d13

= πminusnp2

[ Gprod

k=1

(h1nk + 1)minuspγ 2wnkk

](h0n + 1)minus(pminuspγ )2

timespγprod

j=1

(

(δ + n + pγ minus j

2

)

(δ + pγ minus j

2

))

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times∣∣∣∣∣Q1(γ ) +

Gsum

k=1

Sk(γ )

∣∣∣∣∣

minus(δ+n+pγ minus1)2

times ∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

where Sk(γ ) and S0(γ ) are given by

Sk(γ ) =sum

xi(γ )isinCk

xi(γ )xTi(γ ) + hminus1

1 micro0(γ )microT0(γ )

minus (nk + hminus11 )minus1

( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)

times( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)T

=sum

xi(γ )isinCk

(xi(γ ) minus xk(γ )

)(xi(γ ) minus xk(γ )

)T

+ nk

h1nk + 1

(micro0(γ ) minus xk(γ )

)(micro0(γ ) minus xk(γ )

)T

and

S0(γ ) =nsum

i=1

xi(γ c)xTi(γ c) + hminus1

0 micro0(γ c)microT0(γ c)

minus (n + hminus10 )minus1

( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)

times( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)T

=nsum

i=1

(xi(γ c) minus x(γ c)

)(xi(γ c) minus x(γ c)

)T

+ n

h0n + 1

(micro0(γ c) minus x(γ c)

)(micro0(γ c) minus x(γ c)

)T

with xk(γ ) = 1nk

sumxi(γ )isinCk

xi(γ ) and x(γ c) = 1n

sumni=1 xi(γ c)

Heterogeneous Covariances

f (Xy|Gwγ )

=int

f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d

=int Gprod

k=1

wnk

k

∣∣k(γ )

∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h

minuspγ 21

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times exp

minus1

2

Gsum

k=1

sum

xi(γ )isinCk

(xi(γ ) minus microk

)Tminus1

k(γ )

(xi(γ ) minus microk

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus micro0(γ )

)T(h1k(γ )

)minus1

times (microk(γ ) minus micro0(γ )

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )Q1(γ )

)dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

=int

int

micro

Gprod

k=1

(2π)minus(nk+1)pγ 2h

minuspγ 21 wnk

k 2minuspγ (δ+pγ minus1)2

times πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣k(γ )

∣∣minus(nk+1)2

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)T

times (nk + hminus11 )minus1

k(γ )

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q1(γ ) + Sk(γ )

])dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

Tadesse Sha and Vannucci Bayesian Variable Selection 617

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

∣∣Q(γ )

∣∣(δ+pγ minus1)2

times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

Gprod

k=1

∣∣k(γ )

∣∣minus(δ+nk+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q(γ ) + Sk(γ )

])d

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

timespγprod

j=1

(

(δ + nk + pγ minus 1

2

)(

(δ + pγ minus 1

2

)))

timesGprod

k=1

∣∣Q(γ )

∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )

∣∣minus(δ+nk+pγ minus1)2

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)(

(δ + p minus pγ minus j

2

)))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

[Received May 2003 Revised August 2004]

REFERENCES

Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5

Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821

Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press

Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182

(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641

Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270

Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275

Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375

Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228

Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center

George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373

Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286

Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall

Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136

Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732

Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232

McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422

Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206

Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71

Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190

Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792

Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819

Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74

(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809

Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554

Page 3: Bayesian Variable Selection in Clustering High-Dimensional ...marina/papers/jasa05.pdf · Bayesian Variable Selection in Clustering High-Dimensional Data Mahlet G. T ADESSE, Naijun

604 Journal of the American Statistical Association June 2005

timesGprod

k=1

(2π)minuspγ nk2∣∣k(γ )

∣∣minusnk2wnk

k

times exp

minus1

2

sum

xiisinCk

(xi(γ ) minus microk(γ )

)Tminus1

k(γ )

(xi(γ ) minus microk(γ )

)

(3)

where (η(γ c)(γ c)) denote the mean and covariance parame-ters for the nondiscriminating variables (microk(γ )k(γ )) are themean and covariance parameters of cluster k Ck = xi|yi = kwith cardinality nk and pγ = sump

j=1 γj

4 PRIOR SETTING AND FULL CONDITIONALS

41 Prior Formulation and Choice of Hyperparameters

For the variable selection indicator γ we assume that itselements γj are independent Bernoulli random variables

p(γ ) =pprod

j=1

ϕγj(1 minus ϕ)1minusγj (4)

The number of covariates included in the model pγ thusfollows a binomial distribution and ϕ can be elicited as theproportion of variables expected a priori to discriminate the dif-ferent groups ϕ = ppriorp This prior assumption can be re-laxed by formulating a beta(ab) hyperprior on ϕ which leadsto a beta-binomial prior for pγ with expectation p a

a+b A vagueprior can be elicited by setting a + b = 2 which leads to settinga = 2ppriorp and b = 2 minus a

Natural priors for the number of components G are a trun-cated Poisson

P(G = g) = eminusλλgg1 minus (eminusλ(1 + λ) + suminfin

j=Gmax+1 (eminusλλ j)j)

g = 2 Gmax (5)

or a discrete uniform on [2 Gmax] P(G = g) = 1Gmaxminus1

where Gmax is chosen arbitrarily large For the vector of com-ponent weights w we specify a symmetric Dirichlet priorw|G sim Dirichlet(α α)

Updating the mean and covariance parameters is somewhatintricate When deleting or creating new components the corre-sponding appropriate changes must be made for the componentparameters However it is not clear whether adequate proposalscan be constructed for the reversible jump in the multivariatesetting Even if this difficulty can be overcome as γ gets up-dated the dimensions of microk(γ ) k(γ ) η(γ c) and (γ c) changerequiring another sampler that moves between different dimen-sional spaces The algorithm is much more efficient if we canintegrate out these parameters This also helps substantially ac-celerate model fitting because the set of parameters that needto be updated would then consist only of (γ wyG) The inte-gration can be facilitated by taking conjugate priors

microk(γ )

∣∣k(γ )G sim N(micro0(γ )h1k(γ )

)

η(γ c)

∣∣(γ c) sim N(micro0(γ c)h0(γ c)

)

(6)k(γ )|G sim IW

(δQ1(γ )

)

(γ c) sim IW(δQ0(γ c)

)

where IW(δQ1(γ )) indicates the inverse-Wishart distributionwith dimension pγ shape parameter δ = n minus pγ + 1 n degreesof freedom and mean Q1(γ )(δ minus 2) (Brown 1993) Weak priorinformation is specified by taking δ small which we set to 3 thesmallest integer such that the expectations of the covariance ma-trices are defined We take Q1 = 1κ1Iptimesp and Q0 = 1κ0IptimespSome care in the choice of κ1 and κ0 is needed These hyperpa-rameters need to be in the range of variability of the data At thesame time we do not want the prior information to overwhelmthe data and drive the inference We experimented with differ-ent values and found that values commensurate with the orderof magnitude of the eigenvalues of cov(X) where X is the fulldata matrix lead to reasonable results We suggest setting κ1between 1 and 10 of the upper decile of the n minus 1 non-zeroeigenvalues and κ0 proportional to their lower decile

For the mean parameters we take the priors to be fairly flatover the region where the data are defined Each element of micro0is set to the corresponding covariate interval midpoint andh0 and h1 are chosen arbitrarily large This prevents the speci-fication of priors that do not overlap with the likelihood and al-lows for mixtures with widely different component means Wefound that values of h0 and h1 between 10 and 1000 performedreasonably well

42 Full Conditionals

As we stated earlier the parameters micro η and are inte-grated out and we need to sample only from the joint posteriorof (Gyγ w) We have

f (Xy|Gwγ )

=int

f (Xy|γ Gwmicroη)f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d (7)

For completeness we provide the full conditionals under theassumption of equal and unequal covariances across clustersHowever in practice unless there is prior knowledge that fa-vors the former we recommend working with heterogeneouscovariances After some algebra (see the Appendix) we get thefollowing

1 Under the assumption of homogeneous covariance

f (Xy|Gwγ )

= πminusnp2K(γ ) middot ∣∣Q1(γ )

∣∣(δ+pγ minus1)2

times∣∣∣∣∣Q1(γ ) +

Gsum

k=1

Sk(γ )

∣∣∣∣∣

minus(n+δ+pγ minus1)2

times H(γ c) middot ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times ∣∣Q0(γ c) + S0(γ c)

∣∣minus(n+δ+pminuspγ minus1)2 (8)

K(γ ) =[

Gprod

k=1

wnkk (h1nk + 1)minuspγ 2

]

timespγprod

j=1

( 12 (n + δ + pγ minus j))

( 12 (δ + pγ minus j))

Tadesse Sha and Vannucci Bayesian Variable Selection 605

H(γ c) = (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

( 12 (n + δ + p minus pγ minus j))

( 12 (δ + p minus pγ minus j))

Sk(γ ) =sum

xi(γ )isinCk

(xi(γ ) minus xk(γ )

)(xi(γ ) minus xk(γ )

)T

+ nk

h1nk + 1

(micro0(γ ) minus xk(γ )

)(micro0(γ ) minus xk(γ )

)T

S0(γ c) =nsum

i=1

(xi(γ c) minus x(γ c)

)(xi(γ c) minus x(γ c)

)T

+ n

h0n + 1

(micro0(γ c) minus x(γ c)

)(micro0(γ c) minus x(γ c)

)T

xk(γ ) is the sample mean of cluster k and x(γ c) corre-sponds to the sample means of the nondiscriminating vari-ables As h1 rarr infin Sk(γ ) reduces to the within-group sumof squares for the kth cluster and

sumGk=1 Sk(γ ) becomes the

total within-group sum of squares2 Assuming heterogeneous covariances across clusters

f (Xy|Gwγ )

= πminusnp2Gprod

k=1

Kk(γ ) middot ∣∣Q1(γ )

∣∣(δ+pγ minus1)2

times ∣∣Q1(γ ) + Sk(γ )

∣∣minus(nk+δ+pγ minus1)2

times H(γ c) middot ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times ∣∣Q0(γ c) + S0(γ c)

∣∣minus(n+δ+pminuspγ minus1)2 (9)

where Kk(γ ) = wnkk (h1nk + 1)minuspγ 2 prodpγ

j=1( 1

2 (nk+δ+pγ minusj))

( 12 (δ+pγ minusj))

The full conditionals for the model parameters are then givenby

f (y|Gwγ X) prop f (Xy|Gwγ ) (10)

f (γ |GwyX) prop f (Xy|Gwγ )p(γ |G) (11)

and

w|Gγ yX sim Dirichlet(α + n1 α + nG) (12)

5 MODEL FITTING

Posterior samples for the parameters of interest are obtainedusing Metropolis moves and RJMCMC embedded within aGibbs sampler Our MCMC procedure comprises of the follow-ing five steps

1 Update γ from its full conditional in (11)2 Update w from its full conditional in (12)3 Update y from its full conditional in (10)4 Split one mixture component into two or merge two into

one5 Birth or death of an empty component

The birthdeath moves specifically deal with creatingdele-ting empty components and do not involve reallocation ofthe observations We could have extended these to include

nonempty components and removed the splitmerge movesHowever as described by Richardson and Green (1997) in-cluding both steps provides more efficient moves and improvesconvergence

We would like to point out that as we iterate through thesesteps the cluster structure evolves with the choice of variablesThis makes the problem of variable selection in the context ofclustering much more complicated than classification or regres-sion analysis In classification the group structure of the ob-servations is prespecified in the training data and guides theselection Here however the analysis is fully data-based Inaddition in the context of clustering including unnecessaryvariables can have severe implications because it may obscurethe true grouping of the observations

51 Variable Selection Update

In step 1 we update γ using a Metropolis search as suggestedfor model selection by Madigan and York (1995) and applied inregression by Brown et al (1998a) among others and recentlyby Sha et al (2004) in multinomial probit models for classifi-cation In this approach a new candidate γ new is generated byrandomly choosing one of the following two transition moves

1 Adddelete Randomly choose one of the p indices in γ oldand change its value from 0 to 1 or from 1 to 0 to be-come γ new

2 Swap Choose independently and at random a 0 and a 1in γ old and switch their values to get γ new

The new candidate γ new is accepted with probabilitymin1

f (γ new|XywG)

f (γ old|XywG)

52 Update of Weights and Cluster Allocation

In step 2 we use a Gibbs move to sample w from its fullconditional in (12) This can be done by drawing independentgamma random variables with common scale and shape para-meters α + n1 α + nG then scaling them to sum to 1

In step 3 we update the allocation vector y one element ata time via a sub-Gibbs sampling strategy The full conditionalprobability that the ith observation is in the kth cluster is givenby

f(yi = k

∣∣Xy(minusi)γ wG) prop f

(X yi = ky(minusi)

∣∣Gwγ) (13)

where y(minusi) is the cluster assignment for all observations exceptthe ith one

53 Reversible-Jump MCMC

The last two steps splitmerge and birthdeath moves causethe creation or deletion of new components and consequentlyrequire a sampler that jumps between different dimensionalspaces A popular method for defining such a sampler is RJM-CMC (Green 1995 Richardson and Green 1997) whichextends the MetropolisndashHastings algorithm to general statespaces Suppose that a move type m from a countable familyof move types is proposed from ψ to a higher-dimensionalspace ψ prime This will very often be implemented by drawinga vector of continuous random variables u independent of ψ and setting ψ prime to be a deterministic and invertible function ofψ and u The reverse of this move (from ψ prime to ψ) can be ac-complished by using the inverse transformation so that in this

606 Journal of the American Statistical Association June 2005

direction the proposal is deterministic The move is then ac-cepted with probability

min

1

f (ψ prime|X)rm(ψ prime)f (ψ |X)rm(ψ)q(u)

∣∣∣∣partψ prime

part(ψu)

∣∣∣∣

(14)

where rm(ψ) is the probability of choosing move type m whenin state ψ and q(u) is the density function of u The final termin the foregoing ratio is a Jacobian arising from the change ofvariable from (ψu) to ψ prime

531 Split and Merge Moves In step 4 we make a randomchoice between attempting to split a nonempty cluster or com-bine two clusters with probabilities bk = 5 and dk = 1 minus bk

(k = 2 Gmax minus 1 and bGmax = 0)The split move begins by randomly selecting a nonempty

cluster l and dividing it into two components l1 and l2 withweights wl1 = wlu and wl2 = wl(1 minus u) where u is a beta(22)

random variable The parameters ψ = (Gwoldγ yold) are up-dated to ψ prime = (G + 1wnewγ ynew) The acceptance probabil-ity for this move is given by min(1A) where

A = f (G + 1wnewγ ynew|X)

f (Gwoldγ yold|X)times q(ψ |ψ prime)

q(ψ prime|ψ) times f (u)

times∣∣∣∣part(wl1wl2)

part(wlu)

∣∣∣∣

= f (Xynew|wnewγ G + 1) times f (wnew|G + 1) times f (G + 1)

f (Xyold|woldγ G) times f (wold|G) times f (G)

times dG+1 times Pchos

bG times Gminus11 times Palloc times f (u)

times wl (15)

f (wnew|G + 1)

f (wold|G)= wαminus1

l1 wαminus1l2

wαminus1l B(αGα)

and

f (G + 1)

f (G)=

1 if G sim uniform [2 Gmax]

λG+1 if G sim truncated Poisson (λ)

Here G1 is the number of nonempty components before thesplit and Palloc is the probability that the particular allocation ismade and is set to unl1(1 minus u)nl2 where nl1 and nl2 are the num-ber of observations assigned to l1 and l2 Note that Richardsonand Green (1997) calculated Palloc using the full conditional ofthe allocation variables P( yi = k|rest) In our case because thecomponent parameters micro and are integrated out the alloca-tion variables are no longer independent and calculating thesefull conditionals is computationally prohibitive The random al-location approach that we took is substantially faster even withthe longer chains that need to be run to compensate for thelow acceptance rate of proposed moves Pchos is the probabilityof selecting two components for the reverse move Obviouslywe want to combine clusters that are closest to one anotherHowever choosing an optimal similarity metric in the multi-variate setting is not straightforward We define closeness interms of the Euclidean distance between cluster means and con-sider three possible ways of splitting a cluster

a Split results into two mutually adjacent components

rArr Pchos = Glowast1

Glowast times 2Glowast

1

b Split results into components that are not mutually ad-jacent that is l1 has l2 as its closest component but there is

another cluster that is closer to l2 than l1 rArr Pchos = Glowast1

Glowast times 1Glowast

1

c One of the resulting components is empty rArr Pchos =Glowast

0Glowast times 1

Glowast0

times 1Glowast

1

Here Glowast is the total number of clusters after the splitGlowast

0 and Glowast1 are the number of empty and nonempty components

after the split Thus splits that result in mutually close com-ponents are assigned a larger probability and those that yieldempty clusters have lower probability

The merge move begins by randomly choosing two com-ponents l1 and l2 to be combined into a single cluster lThe updated parameters are ψ = (Gwoldγ yold) rarr ψ prime =(G minus 1wnewγ ynew) where wl = wl1 + wl2 The acceptanceprobability for this move is given by min(1A) where

A = f (Xynew|γ wnewG minus 1) times f (wnew|G minus 1) times f (G minus 1)

f (Xyold|γ woldG) times f (wold|G) times f (G)

times bGminus1 times Palloc times (G1 minus 1)minus1 times f (u)

dG times Pchos

times (wl1 + wl2)minus1 (16)

where f (wnew|Gminus1)f (wold|G)

= wαminus1l B(α(Gminus1)α)

wαminus1l1 wαminus1

l2and f (Gminus1)

f (G)equals 1 or G

λ

for the Uniform and Poisson priors on G G1 is the numberof nonempty clusters before the combining move f (u) is thebeta(22) density for u = wl1

wl and Palloc and Pchos are defined

similarly to the split move but now Glowast0 and Glowast

1 in Pchos corre-spond to the number of empty and nonempty clusters after themerge

532 Birth and Death Moves We first make a randomchoice between a birth move and a death move using thesame probabilities bk and dk = 1 minus bk as earlier For a birthmove the updated parameters are ψ = (Gwoldγ y) rarr ψ prime =(G + 1wnewγ y) The weight for the proposed new compo-nent is drawn using wlowast sim beta(1G) and the existing weightsare rescaled to wprime

k = wk(1minuswlowast) k = 1 G (where k indexesthe component labels) so that all of the weights sum to 1 LetG0 be the number of empty components before the birth Theacceptance probability for a birth is min(1A) where

A = f (Xy|wnewγ G + 1) times f (wnew|G + 1) times f (G + 1)

f (Xy|woldγ G) times f (wold|G) times f (G)

times dG+1 times (G0 + 1)minus1

bG times f (wlowast)times (1 minus wlowast)Gminus1 (17)

f (wnew|G+1)f (wold|G)

= wlowastαminus1(1minuswlowast)G(αminus1)

B(αGα) and (1 minus wlowast)Gminus1 is the

Jacobian of the transformation (woldwlowast) rarr wnewFor the death move an empty component is chosen at

random and deleted Let wl be the weight of the deleted com-ponent The remaining weights are rescaled to sum to 1 (wprime

k =wk(1 minus wl)) and ψ = (Gwoldγ y) rarr ψ prime = (G minus 1wnew

γ y) The acceptance probability for this move is min(1A)where

A = f (Xy|wnewγ G minus 1) times f (wnew|G minus 1) times f (G minus 1)

f (Xy|woldγ G) times f (wold|G) times f (G)

times bGminus1 times f (wl)

dG times Gminus10

times (1 minus wl)minus(Gminus2) (18)

Tadesse Sha and Vannucci Bayesian Variable Selection 607

f (wnew|Gminus1)f (wold|G)

= B(α(Gminus1)α)

wαminus1l (1minuswl)

(Gminus1)(αminus1) G0 is the number of empty

components before the death move and f (wl) is the beta(1G)

density

6 POSTERIOR INFERENCE

We draw inference on the sample allocations conditionalon G Thus we first need to compute an estimate for this para-meter which we take to be the value most frequently visited bythe MCMC sampler In addition we need to address the label-switching problem

61 Label Switching

In finite mixture models an identifiability problem arisesfrom the invariance of the likelihood under permutation of thecomponent labels In the Bayesian paradigm this leads to sym-metric and multimodal posterior distributions with up to Gcopies of each ldquogenuinerdquo mode complicating inference on theparameters In particular it is not possible to form ergodic av-erages over the MCMC samples Traditional approaches to thisproblem impose identifiability constraints on the parametersfor instance w1 lt middot middot middot lt wG These constraints do not alwayssolve the problem however Recently Stephens (2000b) pro-posed a relabeling algorithm that takes a decision-theoretic ap-proach The procedure defines an appropriate loss function andpostprocesses the MCMC output to minimize the posterior ex-pected loss

Let P(ξ) = [pik(ξ)] be the matrix of allocation probabilities

pik(ξ) = Pr( yi = k|Xγ w θ)

= wk f (xi(γ )|θk(γ ))sumj wjf (xi(γ )|θ j(γ ))

(19)

In the context of clustering the loss function can be defined onthe cluster labels y

L0(y ξ) = minusnsum

i=1

logpiyi(ξ ) (20)

Let ξ (1) ξ (M) be the sampled parameters and let ν1 νM

be the permutations applied to them The parameters ξ (t) corre-spond to the component parameters w(t) micro(t) and (t) sampledat iteration t However in our methodology the componentmean and covariance parameters are integrated out and wedo not draw their posterior samples Therefore to calculatethe matrix of allocation probabilities in (19) we estimatethese parameters and set ξ (t) = w(t)

1 w(t)G x(t)

1(γ ) x(t)

G(γ )

S(t)1(γ )

S(t)G(γ )

where x(t)k(γ )

and S(t)k(γ )

are the estimates ofcluster krsquos mean and covariance based on the model visited atiteration t

The relabeling algorithm proceeds by selecting initial valuesfor the νtrsquos which we take to be the identity permutation theniterating the following steps until a fixed point is reached

a Choose y to minimizesumM

t=1 L0yνt(ξ(t))

b For t = 1 M choose νt to minimize L0yνt(ξ(t))

62 Posterior Densities and Posterior Estimates

Once the label switching is taken care of the MCMC samplescan be used to draw posterior inference Of particular interestare the allocation vector y and the variable selection vector γ

We compute the marginal posterior probability that sample iis allocated to cluster k yi = k as

p( yi = k|XG) =int

p(yi = ky(minusi)γ w

∣∣XG)

dy(minusi) dγ dw

propint

p(X yi = ky(minusi)γ w

∣∣G)

dy(minusi) dγ dw

=int

p(X yi = ky(minusi)

∣∣Gγ w)

times p(γ |G)p(w|G)dy(minusi) dγ dw

asympMsum

t=1

p(X y(t)

i = ky(t)(minusi)

∣∣Gγ (t)w(t))

times p(γ (t)

∣∣G)p(w(t)

∣∣G) (21)

where y(t)(minusi) is the vector y(t) at the tth MCMC iteration without

the ith element The posterior allocation of sample i can then beestimated by the mode of its marginal posterior density

yi = arg max1lekleG

p( yi = k|XG) (22)

Similarly the marginal posterior for γj = 1 can be com-puted as

p(γj = 1|XG) =int

p(γj = 1γ (minusj)wy|XG

)dγ (minusj) dy dw

propint

p(Xy γj = 1γ (minusj)w

∣∣G)

dγ (minusj) dy dw

=int

p(Xy

∣∣G γj = 1γ (minusj)w)

times p(γ |G)p(w|G)dγ (minusj) dy dw

asympMsum

t=1

p(Xy(t)

∣∣G γ(t)j = 1γ

(t)(minusj)w(t))

times p(γj = 1γ

(t)(minusj)

∣∣G)p(w(t)

∣∣G) (23)

where γ(t)(minusj) is the vector γ (t) at the tth iteration without the

jth element The best discriminating variables can then beidentified as those with largest marginal posterior p(γj = 1|XG) gt a where a is chosen arbitrarily

γj = Ip(γj=1|XG)gta (24)

An alternative variable selection can be performed by con-sidering the vector γ with largest posterior probability amongall visited vectors

γ lowast = arg max1letleM

p(γ (t)

∣∣XG w y)

(25)

where y is the vector of sample allocations estimated via (22)and w is given by w = 1

M

sumMt=1 w(t) The estimate given in (25)

considers the joint density of γ rather than the marginal distri-butions of the individual elements as (24) In the same spirit an

608 Journal of the American Statistical Association June 2005

estimate of the joint allocation of the samples can be obtainedas the configuration that yields the largest posterior probability

ylowast = arg max1letleM

p(y(t)

∣∣XG w γ)

(26)

where γ is the estimate in (24)

63 Class Prediction

The MCMC output can also be used to predict the class mem-bership of future observations Xf The predictive density isgiven by

p( yf = k|Xf XG)

=int

p( yf = kywγ |Xf XG)dy dγ dw

propint

p(Xf X yf = ky|Gwγ )p(γ |G)p(w|G)dy dγ dw

asympMsum

t=1

p(Xf X yf = ky(t)

∣∣γ (t)w(t)G)

times p(γ (t)

∣∣G)p(w(t)

∣∣G) (27)

and the observations are allocated according to yf

yf = arg max1lekleG

p( yf = k|Xf XG) (28)

7 PERFORMANCE OF OUR METHODOLOGY

In this section we investigate the performance of our method-ology using three datasets Because the Bayesian cluster-ing problem with an unknown number of components viaRJMCMC has been considered only in the univariate setting(Richardson and Green 1997) we first examine the efficiency ofour algorithm in the multivariate setting without variable selec-tion For this we use the benchmark iris data (Anderson 1935)We then explore the performance of our methodology for simul-taneous clustering and variable selection using a series of sim-ulated high-dimensional datasets We also analyze these data

using the COSA algorithm of Friedman and Meulman (2003)Finally we illustrate an application with DNA microarray datafrom an endometrial cancer study

71 Iris Data

This dataset consists of four covariatesmdashpetal width petalheight sepal width and sepal heightmdashmeasured on 50 flow-ers from each of three species Iris setosa Iris versicolor andIris virginica This dataset has been analyzed by several authors(see eg Murtagh and Hernaacutendez-Pajares 1995) and can be ac-cessed in SndashPLUS as a three-dimensional array

We rescaled each variable by its range and specified the priordistributions as described in Section 41 We took δ = 3 α = 1and h1 = 100 We set Gmax = 10 and κ1 = 03 a value com-mensurate with the variability in the data We considered differ-ent prior distributions for G truncated Poisson with λ = 5 andλ = 10 and a discrete uniform on [1 Gmax] We conductedthe analysis assuming both homogeneous and heterogeneouscovariances across clusters In all cases we used a burn-inof 10000 and a main run of 40000 iterations for inference

Let us first focus on the results using a truncated Pois-son(λ = 5) prior for G Figure 1 displays the trace plots forthe number of visited components under the assumption of het-erogeneous and homogeneous covariances Table 1 column 1gives the corresponding posterior probabilities p(G = k|X)Table 2 shows the posterior estimates for sample allocations ycomputed conditional on the most probable G accordingto (22) Under the assumption of heterogeneous covariancesacross clusters there is a strong support for G = 3 Two ofthe clusters successfully isolate all of the setosa and virginicaplants and the third cluster contains all of the versicolor ex-cept for five identified as virginica These results are satisfac-tory because the versicolor and virginica species are knownto have some overlap whereas the setosa is rather well sepa-rated Under the assumption of constant covariance there is astrong support for four clusters Again one cluster contains ex-clusively all of the setosa and all of the versicolor are grouped

(a) (b)

Figure 1 Iris Data Trace Plots of the Number of Clusters G (a) Heterogeneous covariance (b) homogeneous covariance

Tadesse Sha and Vannucci Bayesian Variable Selection 609

Table 1 Iris Data Posterior Distribution of G

Poisson(λ = 5) Poisson(λ = 10) Uniform[1 10]

k Different k Equal Different Σk Equal Σ Different Σk Equal Σ

3 6466 0 5019 0 8548 04 3124 7413 4291 4384 1417 65045 0403 3124 0545 2335 0035 34466 0007 0418 0138 2786 0 00507 0 0 0007 0484 0 08 0 0 0 0011 0 0

together except for one Two virginica flowers are assigned tothe versicolor group and the remaining are divided into twonew components Figure 2 shows the marginal posterior prob-abilities p( yi = k|XG = 4) computed using equation (21)We note that some of the flowers among the 12 virginica al-located to cluster IV had a close call for instance observa-tion 104 had marginal posteriors p( y104 = 3|XG) = 4169 andp( y104 = 4|XG) = 5823

We examined the sensitivity of the results to the choiceof the prior distribution on G and found them to be quiterobust The posterior probabilities p(G = k|X) assumingG sim Poisson(λ = 10) and G sim uniform[1 Gmax] are givenin Table 1 Under both priors we obtained sample allocationsthat are identical to those in Table 2 We also found littlesensitivity to the choice of the hyperparameter h1 with val-ues between 10 and 1000 giving satisfactory results Smallervalues of h1 tended to favor slightly more components but withsome clusters containing very few observations For examplefor h1 = 10 under the assumption of equal covariance acrossclusters Table 3 shows that the MCMC sampler gives strongersupport for G = 6 However one of the clusters contains onlyone observation and another contains only four observationsFigure 3 displays the marginal posterior probabilities for theallocation of these five samples p( yi = k|XG) there is littlesupport for assigning them to separate clusters

We use this example to also illustrate the label-switchingphenomenon described in Section 61 Figure 4 gives the traceplots of the component weights under the assumption of homo-geneous covariance before and after processing of the MCMCsamples We can easily identify two label-switching instancesin the raw output of Figure 4(a) One of these occurred be-tween the first and third components near iteration 1000 theother occurred around iteration 2400 between components 23 and 4 Figure 4(b) shows the trace plots after postprocessingof the MCMC output We note that the label switching is elim-inated and the multimodality in the estimates of the marginalposterior distributions of wk is removed It is now straightfor-ward to obtain sensible estimates using the posterior means

Table 2 Iris Data Sample Allocation y

Heterogeneouscovariance

Homogeneouscovariance

I II III I II III IV

Iris setosa 50 0 0 50 0 0 0Iris versicolor 0 45 5 0 49 1 0Iris virginica 0 0 50 0 2 36 12

Figure 2 Iris Data Poisson(λ = 5) With Equal Σ -Marginal PosteriorProbabilities of Sample Allocations p(yi = k|X G = 4) i = 1 150k = 1 4

72 Simulated Data

We generated a dataset of 15 observations arising from fourmultivariate normal densities with 20 variables such that

xij sim I1leile4N (micro1 σ21 ) + I5leile7N (micro2 σ

22 )

+ I8leile13N (micro3 σ23 ) + I14leile15N (micro4 σ

24 )

i = 1 15 j = 1 20

where Imiddot is the indicator function equal to 1 if the conditionis met Thus the first four samples arise from the same dis-tribution the next three come from the second group the fol-lowing six are from the third group and the last two are fromthe fourth group The component means were set to micro1 = 5micro2 = 2 micro3 = minus3 and micro4 = minus6 and the component varianceswere set to σ 2

1 = 15 σ 22 = 1 σ 2

3 = 5 and σ 24 = 2 We drew an

additional set of (pminus20) noisy variables that do not distinguishbetween the clusters from a standard normal density We con-sidered several values of p p = 50100500 1000 These datathus contain a small subset of discriminating variables togetherwith a large set of nonclustering variables The purpose is tostudy the ability of our method to uncover the cluster structureand identify the relevant covariates in the presence of increasingnumber of noisy variables

We permuted the columns of the data matrix X to dis-perse the predictors For each dataset we took δ = 3 α = 1

Table 3 Iris Data Sensitivity to Hyperparameter h Results With h = 10

Posterior distribution of G

k 4 5 6 7 8 9 10

p(G = k |X) 0747 1874 3326 2685 1120 0233 0015

Sample allocations yI II III IV V VI

Iris setosa 50 0 0 0 0 0Iris versicolor 0 45 1 0 4 0Iris virginica 0 2 35 12 0 1

610 Journal of the American Statistical Association June 2005

Figure 3 Iris Data Sensitivity to h1 Results for h1 = 10ndashMarginalPosterior Probabilities for Samples Allocated to Clusters V and VIp(yi = k|X G = 6) i = 1 5 k = 1 6

h1 = h0 = 100 and Gmax = n and assumed unequal covari-ances across clusters Results from the previous example haveshown little sensitivity to the choice of prior on G Here weconsidered a truncated Poisson prior with λ = 5 As describedin Section 41 some care is needed when choosing κ1 and κ0their values need to be commensurate with the variability of thedata The amount of total variability in the different simulateddatasets is of course widely different and in each case thesehyperparameters were chosen proportionally to the upper andlower decile of the non-zero eigenvalues For the prior of γ we chose a Bernoulli distribution and set the expected num-ber of included variables to 10 We used a starting model withone randomly selected variable and ran the MCMC chains for100000 iterations with 40000 sweeps as burn-in

Here we do inference on both the sample allocations andthe variable selection We obtained satisfactory results for all

datasets In all cases we successfully recovered the clusterstructure used to simulate the data and identified the 20 dis-criminating variables We present the summary plots associatedwith the largest dataset where p = 1000 Figure 5(a) shows thetrace plot for the number of visited components G and Table 4reports the marginal posterior probabilities p(G|X) There is astrong support for G = 4 and 5 with a slightly larger probabilityfor the former Figure 6 shows the marginal posterior probabili-ties of the sample allocations p( yi = k|XG = 4) We note thatthese match the group structure used to simulate the data Asfor selection of the predictors Figure 5(b) displays the num-ber of variables selected at each MCMC iteration In the first30000 iterations before burn-in the chain visited models witharound 30ndash35 variables After about 40000 iterations the chainstabilized to models with 15ndash20 covariates Figure 7 displaysthe marginal posterior probabilities p(γj = 1|XG = 4) Therewere 17 variables with marginal probabilities greater than 9 allof which were in the set of 20 covariates simulated to effectivelydiscriminate the four clusters If we lower the threshold for in-clusion to 4 then all 20 variables are selected Conditional onthe allocations obtained via (22) the γ vector with largest pos-terior probability (25) among all visited models contained 19 ofthe 20 discriminating variables

We also analyzed these datasets using the COSA algorithm ofFriedman and Meulman (2003) As we mentioned in Section 2this procedure performs variable selection in conjunction withhierarchical clustering We present the results for the analy-sis of the simulated data with p = 1000 Figure 8 shows thedendrogram of the clustering results based on the non-targetedCOSA distance We considered single average and completelinkage but none of these methods was able to recover the truecluster structure The average linkage dendrogram [Fig 8(b)]delineates one of the clusters that was partially recovered (thetrue cluster contains observations 8ndash13) Figure 8(d) displaysthe 20 highest relevance values of the variables that lead to thisclustering The variables that define the true cluster structure areindexed as 1ndash20 and we note that 16 of them appear in this plot

(a) (b)

Figure 4 Iris Data Trace Plots of Component Weights Before and After Removal of the Label Switching (a) Raw output (b) processed output

Tadesse Sha and Vannucci Bayesian Variable Selection 611

(a) (b)

Figure 5 Trace Plots for Simulated Data With p = 1000 (a) Number of clusters G (b) number of included variables pγ

Table 4 Simulated Data With p = 1000Posterior Distribution of G

k p(G = k|X )

2 01113 14474 34375 33756 13267 02168 00409 0015

10 000811 000512 000713 001214 0002

Figure 6 Simulated Data With p = 1000 Marginal Posterior Prob-abilities of Sample Allocations p(yi = k|X G = 4) i = 1 15k = 1 4

73 Application to Microarray Data

We now illustrate the methodology on microarray datafrom an endometrial cancer study Endometrioid endometrialadenocarcinoma is a common gynecologic malignancy aris-ing within the uterine lining The disease usually occurs inpostmenopausal women and is associated with the hormonalrisk factor of protracted estrogen exposure unopposed byprogestins Despite its prevalence the molecular mechanismsof its genesis are not completely known DNA microarrayswith their ability to examine thousands of genes simultaneouslycould be an efficient tool to isolate relevant genes and clustertissues into different subtypes This is especially important incancer treatment where different clinicopathologic groups areknown to vary in their response to therapy and genes identifiedto discriminate among the different subtypes may represent tar-gets for therapeutic intervention and biomarkers for improveddiagnosis

Figure 7 Simulated Data With p = 1000 Marginal Posterior Proba-bilities for Inclusion of Variables p(γj = 1|X G = 4)

612 Journal of the American Statistical Association June 2005

(a) (b)

(c) (d)

Figure 8 Analysis of Simulated Data With p = 1000 Using COSA (a) Single linkage (b) average linkage (c) complete linkage (d) relevancefor delineated cluster

Four normal endometrium samples and 10 endometrioidendometrial adenocarcinomas were collected from hysterec-tomy specimens RNA samples were isolated from the 14 tis-sues and reverse-transcribed The resulting cDNA targets wereprepared following the manufacturerrsquos instructions and hy-bridized to Affymetrix Hu6800 GeneChip arrays which con-tain 7070 probe sets The scanned images were processed withthe Affymetrix software version 31 The average differencederived by the software was used to indicate a transcript abun-dance and a global normalization procedure was applied Thetechnology does not reliably quantify low expression levels andit is subject to spot saturation at the high end For this particu-lar array the limits of reliable detection were previously set to20 and 16000 (Tadesse Ibrahim and Mutter 2003) We usedthese same thresholds and removed probe sets with at least oneunreliable reading from the analysis This left us with p = 762variables We then log-transformed the expression readings tosatisfy the assumption of normality as suggested by the BoxndashCox transformation We also rescaled each variable by its range

As described in Section 41 we specified the priors with thehyperparameters δ = 3 α = 1 and h1 = h0 = 100 We setκ1 = 001 and κ0 = 01 and assumed unequal covariancesacross clusters We used a truncated Poisson(λ = 5) prior for G

with Gmax = n We took a Bernoulli prior for γ with an expec-tation of 10 variables to be included in the model

To avoid possible dependence of the results on the initialmodel we ran four MCMC chains with widely different startingpoints (1) γj set to 0 for all jrsquos except one randomly selected(2) γj set to 1 for 10 randomly selected jrsquos (3) γj set to 1 for25 randomly selected jrsquos and (4) γj set to 1 for 50 randomlyselected jrsquos For each of the MCMC chains we ran 100000 it-erations with 40000 sweeps taken as a burn-in

We looked at both the allocation of tissues and the selec-tion of genes whose expression best discriminate among thedifferent groups The MCMC sampler mixed steadily overthe iterations As shown in the histogram of Figure 9(a) thesampler visited mostly between two and five components witha stronger support for G = 3 clusters Figure 9(b) shows thetrace plot for the number of included variables for one ofthe chains which visited mostly models with 25ndash35 variablesThe other chains behaved similarly Figure 10 gives the mar-ginal posterior probabilities p(γj = 1|XG = 3) for each ofthe chains Despite the very different starting points the fourchains visited similar regions and exhibit broadly similar mar-ginal plots To assess the concordance of the results across thefour chains we looked at the correlation coefficients betweenthe frequencies for inclusion of genes These are reported in

Tadesse Sha and Vannucci Bayesian Variable Selection 613

(a) (b)

Figure 9 MCMC Output for Endometrial Data (a) Number of visited clusters G (b) number of included variables pγ

Figure 11 along with the corresponding pairwise scatterplotsWe see that there is very good agreement across the differentMCMC runs The marginal probabilities on the other hand arenot as concordant across the chains because their derivationmakes use of model posterior probabilities [see (23)] which areaffected by the inclusion of correlated variables This is a prob-lem inherent to DNA microarray data where multiple geneswith similar expression profiles are often observed

We pooled and relabeled the output of the four chains nor-malized the relative posterior probabilities and recomputedp(γj = 1|XG = 3) These marginal probabilities are displayedin Figure 12 There are 31 genes with posterior probabilitygreater than 5 We also estimated the marginal posterior proba-bilities for the sample allocations p( yi = k|XG) based on the31 selected genes These are shown in Figure 13 We success-fully identified the four normal tissues The results also seem tosuggest that there are possibly two subtypes within the malig-nant tissues

Figure 10 Endometrial Data Marginal posterior probabilities for in-clusion of variables p(γj = 1|X G = 3) for each of the four chains

We also present the results from analyzing this dataset withthe COSA algorithm Figure 14 shows the resulting singlelinkage and average linkage dendrograms along with the top30 variables that distinguish the normal and tumor tissuesThere is some overlap between the genes selected by COSA andthose identified by our procedure Among the sets of 30 ldquobestrdquodiscriminating genes selected by the two methods 10 are com-mon to both

8 DISCUSSION

We have proposed a method for simultaneously clusteringhigh-dimensional data with an unknown number of componentsand selecting the variables that best discriminate the differentgroups We successfully applied the methodology to variousdatasets Our method is fully Bayesian and we provided stan-dard default recommendations for the choice of priors

Here we mention some possible extensions and directionsfor future research We have drawn posterior inference con-ditional on a fixed number of components A more attractive

Figure 11 Endometrial Data Concordance of results among the fourchains Pairwise scatterplots of frequencies for inclusion of genes

614 Journal of the American Statistical Association June 2005

Figure 12 Endometrial Data Union of four chains Marginal posteriorprobabilities p(γj = 1|X G = 3)

approach would incorporate the uncertainty on this parame-ter and combine the results for all different values of G vis-ited by the MCMC sampler To our knowledge this is an open

Figure 13 Endometrial Data Union of four chains Marginal poste-rior probabilities of sample allocations p(yi = k|X G = 3) i = 1 14k = 1 3

problem that requires further research One promising approachinvolves considering the posterior probability of pairs of ob-

(a) (b)

(c) (d)

Figure 14 Analysis of Endometrial Data Using COSA (a) Single linkage (b) average linkage (c) variables leading to cluster 1 (d) variablesleading to cluster 2

Tadesse Sha and Vannucci Bayesian Variable Selection 615

servations being allocated to the same cluster regardless of thenumber of visited components As long as there is no interestin estimating the component parameters the problem of labelswitching would not be a concern But this leads to

(n2

)proba-

bilities and it is not clear how to best summarize them For ex-ample Medvedovic and Sivaganesan (2002) used hierarchicalclustering based on these pairwise probabilities as a similaritymeasure

An interesting future avenue would be to investigate the per-formance of our method when covariance constraints are im-posed as was proposed by Banfield and Raftery (1993) Theirparameterization allows one to incorporate different assump-tions on the volume orientation and shape of the clusters

Another possible extension is to use empirical Bayes esti-mates to elicit the hyperparameters For example conditionalon a fixed number of clusters a complete log-likelihood canbe computed using all covariates and a Monte Carlo EM ap-proach developed for estimation Alternatively if prior infor-mation were available or subjective priors were preferable thenthe prior setting could be modified accordingly If interactionsamong predictors were of interest then additional interactionterms could be included in the model and the prior on γ couldbe modified to accommodate the additional terms

APPENDIX FULL CONDITIONALS

Here we give details on the derivation of the marginalized full con-ditionals under the assumption of equal and unequal covariances acrossclusters

Homogeneous Covariance 1 = middot middot middot = G =

f (Xy|Gwγ )

=int

f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d

=int Gprod

k=1

wnk

k

∣∣(γ )

∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h

minuspγ 21

times exp

minus1

2

Gsum

k=1

sum

xi(γ )isinCk

(xi(γ ) minus microk(γ )

)Tminus1

(γ )

(xi(γ ) minus microk(γ )

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus micro0(γ )

)T(h1(γ )

)minus1(microk(γ ) minus micro0(γ )

)

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

(γ )Q1(γ )

)

times ∣∣(γ c)

∣∣minus(n+1)2(2π)minus(n+1)(pminuspγ )2h

minus(pminuspγ )20

times exp

minus1

2

nsum

i=1

(xi(γ c) minus η(γ c)

)Tminus1

(γ c)

(xi(γ c) minus η(γ c)

)

times exp

minus1

2

(η(γ c) minus micro0(γ c)

)T(h0(γ c)

)minus1(η(γ c) minus micro0(γ c)

)

times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣(γ c)

∣∣minus(δ+2(pminuspγ ))2

times exp

minus1

2tr(minus1

(γ c)Q0(γ c)

)

dmicrod dη d

=int Gprod

k=1

(2π)minus(nk+1)pγ 2h

minuspγ 21 wnk

k

∣∣(γ )

∣∣minus(nk+1)2

times exp

minus1

2tr(minus1

(γ )Sk(γ )

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)T

times (nk + hminus11 )minus1

(γ )

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

(γ )Q1(γ )

)

times (2π)minus(n+1)(pminuspγ )2hminus(pminuspγ )20

∣∣(γ c)

∣∣minus(n+1)2

times exp

minus1

2tr(minus1

(γ c)S0(γ c)

)

times exp

minus1

2

(η(γ c) minus

sumni=1 xi(γ c) + hminus1

0 micro0(γ c)

n + hminus10

)T

times (n + hminus10 )minus1

(γ c)

(η(γ c) minus

sumni=1 xi(γ c) + hminus1

0 micro0(γ c)

n + hminus10

)

times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣(γ c)

∣∣minus(δ+2(pminuspγ ))2

times exp

minus1

2tr(minus1

0(γ c)Q0(γ c)

)dmicrod dη d

=[ Gprod

k=1

(h1nk + 1)minuspγ 2wnkk

]πminusnpγ 2

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)22minuspγ (δ+n+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

∣∣(γ )

∣∣minus(δ+n+2pγ )2

times exp

minus1

2tr

(minus1

(γ )

[Q1(γ ) +

Gsum

k=1

Sk(γ )

])d

616 Journal of the American Statistical Association June 2005

times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

timesint

∣∣(γ c)

∣∣minus(δ+n+2(pminuspγ ))2

times exp

minus1

2tr(minus1

(γ c)

[Q0(γ ) + S0(γ c)

])d13

= πminusnp2

[ Gprod

k=1

(h1nk + 1)minuspγ 2wnkk

](h0n + 1)minus(pminuspγ )2

timespγprod

j=1

(

(δ + n + pγ minus j

2

)

(δ + pγ minus j

2

))

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times∣∣∣∣∣Q1(γ ) +

Gsum

k=1

Sk(γ )

∣∣∣∣∣

minus(δ+n+pγ minus1)2

times ∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

where Sk(γ ) and S0(γ ) are given by

Sk(γ ) =sum

xi(γ )isinCk

xi(γ )xTi(γ ) + hminus1

1 micro0(γ )microT0(γ )

minus (nk + hminus11 )minus1

( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)

times( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)T

=sum

xi(γ )isinCk

(xi(γ ) minus xk(γ )

)(xi(γ ) minus xk(γ )

)T

+ nk

h1nk + 1

(micro0(γ ) minus xk(γ )

)(micro0(γ ) minus xk(γ )

)T

and

S0(γ ) =nsum

i=1

xi(γ c)xTi(γ c) + hminus1

0 micro0(γ c)microT0(γ c)

minus (n + hminus10 )minus1

( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)

times( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)T

=nsum

i=1

(xi(γ c) minus x(γ c)

)(xi(γ c) minus x(γ c)

)T

+ n

h0n + 1

(micro0(γ c) minus x(γ c)

)(micro0(γ c) minus x(γ c)

)T

with xk(γ ) = 1nk

sumxi(γ )isinCk

xi(γ ) and x(γ c) = 1n

sumni=1 xi(γ c)

Heterogeneous Covariances

f (Xy|Gwγ )

=int

f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d

=int Gprod

k=1

wnk

k

∣∣k(γ )

∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h

minuspγ 21

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times exp

minus1

2

Gsum

k=1

sum

xi(γ )isinCk

(xi(γ ) minus microk

)Tminus1

k(γ )

(xi(γ ) minus microk

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus micro0(γ )

)T(h1k(γ )

)minus1

times (microk(γ ) minus micro0(γ )

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )Q1(γ )

)dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

=int

int

micro

Gprod

k=1

(2π)minus(nk+1)pγ 2h

minuspγ 21 wnk

k 2minuspγ (δ+pγ minus1)2

times πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣k(γ )

∣∣minus(nk+1)2

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)T

times (nk + hminus11 )minus1

k(γ )

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q1(γ ) + Sk(γ )

])dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

Tadesse Sha and Vannucci Bayesian Variable Selection 617

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

∣∣Q(γ )

∣∣(δ+pγ minus1)2

times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

Gprod

k=1

∣∣k(γ )

∣∣minus(δ+nk+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q(γ ) + Sk(γ )

])d

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

timespγprod

j=1

(

(δ + nk + pγ minus 1

2

)(

(δ + pγ minus 1

2

)))

timesGprod

k=1

∣∣Q(γ )

∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )

∣∣minus(δ+nk+pγ minus1)2

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)(

(δ + p minus pγ minus j

2

)))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

[Received May 2003 Revised August 2004]

REFERENCES

Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5

Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821

Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press

Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182

(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641

Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270

Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275

Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375

Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228

Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center

George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373

Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286

Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall

Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136

Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732

Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232

McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422

Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206

Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71

Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190

Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792

Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819

Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74

(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809

Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554

Page 4: Bayesian Variable Selection in Clustering High-Dimensional ...marina/papers/jasa05.pdf · Bayesian Variable Selection in Clustering High-Dimensional Data Mahlet G. T ADESSE, Naijun

Tadesse Sha and Vannucci Bayesian Variable Selection 605

H(γ c) = (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

( 12 (n + δ + p minus pγ minus j))

( 12 (δ + p minus pγ minus j))

Sk(γ ) =sum

xi(γ )isinCk

(xi(γ ) minus xk(γ )

)(xi(γ ) minus xk(γ )

)T

+ nk

h1nk + 1

(micro0(γ ) minus xk(γ )

)(micro0(γ ) minus xk(γ )

)T

S0(γ c) =nsum

i=1

(xi(γ c) minus x(γ c)

)(xi(γ c) minus x(γ c)

)T

+ n

h0n + 1

(micro0(γ c) minus x(γ c)

)(micro0(γ c) minus x(γ c)

)T

xk(γ ) is the sample mean of cluster k and x(γ c) corre-sponds to the sample means of the nondiscriminating vari-ables As h1 rarr infin Sk(γ ) reduces to the within-group sumof squares for the kth cluster and

sumGk=1 Sk(γ ) becomes the

total within-group sum of squares2 Assuming heterogeneous covariances across clusters

f (Xy|Gwγ )

= πminusnp2Gprod

k=1

Kk(γ ) middot ∣∣Q1(γ )

∣∣(δ+pγ minus1)2

times ∣∣Q1(γ ) + Sk(γ )

∣∣minus(nk+δ+pγ minus1)2

times H(γ c) middot ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times ∣∣Q0(γ c) + S0(γ c)

∣∣minus(n+δ+pminuspγ minus1)2 (9)

where Kk(γ ) = wnkk (h1nk + 1)minuspγ 2 prodpγ

j=1( 1

2 (nk+δ+pγ minusj))

( 12 (δ+pγ minusj))

The full conditionals for the model parameters are then givenby

f (y|Gwγ X) prop f (Xy|Gwγ ) (10)

f (γ |GwyX) prop f (Xy|Gwγ )p(γ |G) (11)

and

w|Gγ yX sim Dirichlet(α + n1 α + nG) (12)

5 MODEL FITTING

Posterior samples for the parameters of interest are obtainedusing Metropolis moves and RJMCMC embedded within aGibbs sampler Our MCMC procedure comprises of the follow-ing five steps

1 Update γ from its full conditional in (11)2 Update w from its full conditional in (12)3 Update y from its full conditional in (10)4 Split one mixture component into two or merge two into

one5 Birth or death of an empty component

The birthdeath moves specifically deal with creatingdele-ting empty components and do not involve reallocation ofthe observations We could have extended these to include

nonempty components and removed the splitmerge movesHowever as described by Richardson and Green (1997) in-cluding both steps provides more efficient moves and improvesconvergence

We would like to point out that as we iterate through thesesteps the cluster structure evolves with the choice of variablesThis makes the problem of variable selection in the context ofclustering much more complicated than classification or regres-sion analysis In classification the group structure of the ob-servations is prespecified in the training data and guides theselection Here however the analysis is fully data-based Inaddition in the context of clustering including unnecessaryvariables can have severe implications because it may obscurethe true grouping of the observations

51 Variable Selection Update

In step 1 we update γ using a Metropolis search as suggestedfor model selection by Madigan and York (1995) and applied inregression by Brown et al (1998a) among others and recentlyby Sha et al (2004) in multinomial probit models for classifi-cation In this approach a new candidate γ new is generated byrandomly choosing one of the following two transition moves

1 Adddelete Randomly choose one of the p indices in γ oldand change its value from 0 to 1 or from 1 to 0 to be-come γ new

2 Swap Choose independently and at random a 0 and a 1in γ old and switch their values to get γ new

The new candidate γ new is accepted with probabilitymin1

f (γ new|XywG)

f (γ old|XywG)

52 Update of Weights and Cluster Allocation

In step 2 we use a Gibbs move to sample w from its fullconditional in (12) This can be done by drawing independentgamma random variables with common scale and shape para-meters α + n1 α + nG then scaling them to sum to 1

In step 3 we update the allocation vector y one element ata time via a sub-Gibbs sampling strategy The full conditionalprobability that the ith observation is in the kth cluster is givenby

f(yi = k

∣∣Xy(minusi)γ wG) prop f

(X yi = ky(minusi)

∣∣Gwγ) (13)

where y(minusi) is the cluster assignment for all observations exceptthe ith one

53 Reversible-Jump MCMC

The last two steps splitmerge and birthdeath moves causethe creation or deletion of new components and consequentlyrequire a sampler that jumps between different dimensionalspaces A popular method for defining such a sampler is RJM-CMC (Green 1995 Richardson and Green 1997) whichextends the MetropolisndashHastings algorithm to general statespaces Suppose that a move type m from a countable familyof move types is proposed from ψ to a higher-dimensionalspace ψ prime This will very often be implemented by drawinga vector of continuous random variables u independent of ψ and setting ψ prime to be a deterministic and invertible function ofψ and u The reverse of this move (from ψ prime to ψ) can be ac-complished by using the inverse transformation so that in this

606 Journal of the American Statistical Association June 2005

direction the proposal is deterministic The move is then ac-cepted with probability

min

1

f (ψ prime|X)rm(ψ prime)f (ψ |X)rm(ψ)q(u)

∣∣∣∣partψ prime

part(ψu)

∣∣∣∣

(14)

where rm(ψ) is the probability of choosing move type m whenin state ψ and q(u) is the density function of u The final termin the foregoing ratio is a Jacobian arising from the change ofvariable from (ψu) to ψ prime

531 Split and Merge Moves In step 4 we make a randomchoice between attempting to split a nonempty cluster or com-bine two clusters with probabilities bk = 5 and dk = 1 minus bk

(k = 2 Gmax minus 1 and bGmax = 0)The split move begins by randomly selecting a nonempty

cluster l and dividing it into two components l1 and l2 withweights wl1 = wlu and wl2 = wl(1 minus u) where u is a beta(22)

random variable The parameters ψ = (Gwoldγ yold) are up-dated to ψ prime = (G + 1wnewγ ynew) The acceptance probabil-ity for this move is given by min(1A) where

A = f (G + 1wnewγ ynew|X)

f (Gwoldγ yold|X)times q(ψ |ψ prime)

q(ψ prime|ψ) times f (u)

times∣∣∣∣part(wl1wl2)

part(wlu)

∣∣∣∣

= f (Xynew|wnewγ G + 1) times f (wnew|G + 1) times f (G + 1)

f (Xyold|woldγ G) times f (wold|G) times f (G)

times dG+1 times Pchos

bG times Gminus11 times Palloc times f (u)

times wl (15)

f (wnew|G + 1)

f (wold|G)= wαminus1

l1 wαminus1l2

wαminus1l B(αGα)

and

f (G + 1)

f (G)=

1 if G sim uniform [2 Gmax]

λG+1 if G sim truncated Poisson (λ)

Here G1 is the number of nonempty components before thesplit and Palloc is the probability that the particular allocation ismade and is set to unl1(1 minus u)nl2 where nl1 and nl2 are the num-ber of observations assigned to l1 and l2 Note that Richardsonand Green (1997) calculated Palloc using the full conditional ofthe allocation variables P( yi = k|rest) In our case because thecomponent parameters micro and are integrated out the alloca-tion variables are no longer independent and calculating thesefull conditionals is computationally prohibitive The random al-location approach that we took is substantially faster even withthe longer chains that need to be run to compensate for thelow acceptance rate of proposed moves Pchos is the probabilityof selecting two components for the reverse move Obviouslywe want to combine clusters that are closest to one anotherHowever choosing an optimal similarity metric in the multi-variate setting is not straightforward We define closeness interms of the Euclidean distance between cluster means and con-sider three possible ways of splitting a cluster

a Split results into two mutually adjacent components

rArr Pchos = Glowast1

Glowast times 2Glowast

1

b Split results into components that are not mutually ad-jacent that is l1 has l2 as its closest component but there is

another cluster that is closer to l2 than l1 rArr Pchos = Glowast1

Glowast times 1Glowast

1

c One of the resulting components is empty rArr Pchos =Glowast

0Glowast times 1

Glowast0

times 1Glowast

1

Here Glowast is the total number of clusters after the splitGlowast

0 and Glowast1 are the number of empty and nonempty components

after the split Thus splits that result in mutually close com-ponents are assigned a larger probability and those that yieldempty clusters have lower probability

The merge move begins by randomly choosing two com-ponents l1 and l2 to be combined into a single cluster lThe updated parameters are ψ = (Gwoldγ yold) rarr ψ prime =(G minus 1wnewγ ynew) where wl = wl1 + wl2 The acceptanceprobability for this move is given by min(1A) where

A = f (Xynew|γ wnewG minus 1) times f (wnew|G minus 1) times f (G minus 1)

f (Xyold|γ woldG) times f (wold|G) times f (G)

times bGminus1 times Palloc times (G1 minus 1)minus1 times f (u)

dG times Pchos

times (wl1 + wl2)minus1 (16)

where f (wnew|Gminus1)f (wold|G)

= wαminus1l B(α(Gminus1)α)

wαminus1l1 wαminus1

l2and f (Gminus1)

f (G)equals 1 or G

λ

for the Uniform and Poisson priors on G G1 is the numberof nonempty clusters before the combining move f (u) is thebeta(22) density for u = wl1

wl and Palloc and Pchos are defined

similarly to the split move but now Glowast0 and Glowast

1 in Pchos corre-spond to the number of empty and nonempty clusters after themerge

532 Birth and Death Moves We first make a randomchoice between a birth move and a death move using thesame probabilities bk and dk = 1 minus bk as earlier For a birthmove the updated parameters are ψ = (Gwoldγ y) rarr ψ prime =(G + 1wnewγ y) The weight for the proposed new compo-nent is drawn using wlowast sim beta(1G) and the existing weightsare rescaled to wprime

k = wk(1minuswlowast) k = 1 G (where k indexesthe component labels) so that all of the weights sum to 1 LetG0 be the number of empty components before the birth Theacceptance probability for a birth is min(1A) where

A = f (Xy|wnewγ G + 1) times f (wnew|G + 1) times f (G + 1)

f (Xy|woldγ G) times f (wold|G) times f (G)

times dG+1 times (G0 + 1)minus1

bG times f (wlowast)times (1 minus wlowast)Gminus1 (17)

f (wnew|G+1)f (wold|G)

= wlowastαminus1(1minuswlowast)G(αminus1)

B(αGα) and (1 minus wlowast)Gminus1 is the

Jacobian of the transformation (woldwlowast) rarr wnewFor the death move an empty component is chosen at

random and deleted Let wl be the weight of the deleted com-ponent The remaining weights are rescaled to sum to 1 (wprime

k =wk(1 minus wl)) and ψ = (Gwoldγ y) rarr ψ prime = (G minus 1wnew

γ y) The acceptance probability for this move is min(1A)where

A = f (Xy|wnewγ G minus 1) times f (wnew|G minus 1) times f (G minus 1)

f (Xy|woldγ G) times f (wold|G) times f (G)

times bGminus1 times f (wl)

dG times Gminus10

times (1 minus wl)minus(Gminus2) (18)

Tadesse Sha and Vannucci Bayesian Variable Selection 607

f (wnew|Gminus1)f (wold|G)

= B(α(Gminus1)α)

wαminus1l (1minuswl)

(Gminus1)(αminus1) G0 is the number of empty

components before the death move and f (wl) is the beta(1G)

density

6 POSTERIOR INFERENCE

We draw inference on the sample allocations conditionalon G Thus we first need to compute an estimate for this para-meter which we take to be the value most frequently visited bythe MCMC sampler In addition we need to address the label-switching problem

61 Label Switching

In finite mixture models an identifiability problem arisesfrom the invariance of the likelihood under permutation of thecomponent labels In the Bayesian paradigm this leads to sym-metric and multimodal posterior distributions with up to Gcopies of each ldquogenuinerdquo mode complicating inference on theparameters In particular it is not possible to form ergodic av-erages over the MCMC samples Traditional approaches to thisproblem impose identifiability constraints on the parametersfor instance w1 lt middot middot middot lt wG These constraints do not alwayssolve the problem however Recently Stephens (2000b) pro-posed a relabeling algorithm that takes a decision-theoretic ap-proach The procedure defines an appropriate loss function andpostprocesses the MCMC output to minimize the posterior ex-pected loss

Let P(ξ) = [pik(ξ)] be the matrix of allocation probabilities

pik(ξ) = Pr( yi = k|Xγ w θ)

= wk f (xi(γ )|θk(γ ))sumj wjf (xi(γ )|θ j(γ ))

(19)

In the context of clustering the loss function can be defined onthe cluster labels y

L0(y ξ) = minusnsum

i=1

logpiyi(ξ ) (20)

Let ξ (1) ξ (M) be the sampled parameters and let ν1 νM

be the permutations applied to them The parameters ξ (t) corre-spond to the component parameters w(t) micro(t) and (t) sampledat iteration t However in our methodology the componentmean and covariance parameters are integrated out and wedo not draw their posterior samples Therefore to calculatethe matrix of allocation probabilities in (19) we estimatethese parameters and set ξ (t) = w(t)

1 w(t)G x(t)

1(γ ) x(t)

G(γ )

S(t)1(γ )

S(t)G(γ )

where x(t)k(γ )

and S(t)k(γ )

are the estimates ofcluster krsquos mean and covariance based on the model visited atiteration t

The relabeling algorithm proceeds by selecting initial valuesfor the νtrsquos which we take to be the identity permutation theniterating the following steps until a fixed point is reached

a Choose y to minimizesumM

t=1 L0yνt(ξ(t))

b For t = 1 M choose νt to minimize L0yνt(ξ(t))

62 Posterior Densities and Posterior Estimates

Once the label switching is taken care of the MCMC samplescan be used to draw posterior inference Of particular interestare the allocation vector y and the variable selection vector γ

We compute the marginal posterior probability that sample iis allocated to cluster k yi = k as

p( yi = k|XG) =int

p(yi = ky(minusi)γ w

∣∣XG)

dy(minusi) dγ dw

propint

p(X yi = ky(minusi)γ w

∣∣G)

dy(minusi) dγ dw

=int

p(X yi = ky(minusi)

∣∣Gγ w)

times p(γ |G)p(w|G)dy(minusi) dγ dw

asympMsum

t=1

p(X y(t)

i = ky(t)(minusi)

∣∣Gγ (t)w(t))

times p(γ (t)

∣∣G)p(w(t)

∣∣G) (21)

where y(t)(minusi) is the vector y(t) at the tth MCMC iteration without

the ith element The posterior allocation of sample i can then beestimated by the mode of its marginal posterior density

yi = arg max1lekleG

p( yi = k|XG) (22)

Similarly the marginal posterior for γj = 1 can be com-puted as

p(γj = 1|XG) =int

p(γj = 1γ (minusj)wy|XG

)dγ (minusj) dy dw

propint

p(Xy γj = 1γ (minusj)w

∣∣G)

dγ (minusj) dy dw

=int

p(Xy

∣∣G γj = 1γ (minusj)w)

times p(γ |G)p(w|G)dγ (minusj) dy dw

asympMsum

t=1

p(Xy(t)

∣∣G γ(t)j = 1γ

(t)(minusj)w(t))

times p(γj = 1γ

(t)(minusj)

∣∣G)p(w(t)

∣∣G) (23)

where γ(t)(minusj) is the vector γ (t) at the tth iteration without the

jth element The best discriminating variables can then beidentified as those with largest marginal posterior p(γj = 1|XG) gt a where a is chosen arbitrarily

γj = Ip(γj=1|XG)gta (24)

An alternative variable selection can be performed by con-sidering the vector γ with largest posterior probability amongall visited vectors

γ lowast = arg max1letleM

p(γ (t)

∣∣XG w y)

(25)

where y is the vector of sample allocations estimated via (22)and w is given by w = 1

M

sumMt=1 w(t) The estimate given in (25)

considers the joint density of γ rather than the marginal distri-butions of the individual elements as (24) In the same spirit an

608 Journal of the American Statistical Association June 2005

estimate of the joint allocation of the samples can be obtainedas the configuration that yields the largest posterior probability

ylowast = arg max1letleM

p(y(t)

∣∣XG w γ)

(26)

where γ is the estimate in (24)

63 Class Prediction

The MCMC output can also be used to predict the class mem-bership of future observations Xf The predictive density isgiven by

p( yf = k|Xf XG)

=int

p( yf = kywγ |Xf XG)dy dγ dw

propint

p(Xf X yf = ky|Gwγ )p(γ |G)p(w|G)dy dγ dw

asympMsum

t=1

p(Xf X yf = ky(t)

∣∣γ (t)w(t)G)

times p(γ (t)

∣∣G)p(w(t)

∣∣G) (27)

and the observations are allocated according to yf

yf = arg max1lekleG

p( yf = k|Xf XG) (28)

7 PERFORMANCE OF OUR METHODOLOGY

In this section we investigate the performance of our method-ology using three datasets Because the Bayesian cluster-ing problem with an unknown number of components viaRJMCMC has been considered only in the univariate setting(Richardson and Green 1997) we first examine the efficiency ofour algorithm in the multivariate setting without variable selec-tion For this we use the benchmark iris data (Anderson 1935)We then explore the performance of our methodology for simul-taneous clustering and variable selection using a series of sim-ulated high-dimensional datasets We also analyze these data

using the COSA algorithm of Friedman and Meulman (2003)Finally we illustrate an application with DNA microarray datafrom an endometrial cancer study

71 Iris Data

This dataset consists of four covariatesmdashpetal width petalheight sepal width and sepal heightmdashmeasured on 50 flow-ers from each of three species Iris setosa Iris versicolor andIris virginica This dataset has been analyzed by several authors(see eg Murtagh and Hernaacutendez-Pajares 1995) and can be ac-cessed in SndashPLUS as a three-dimensional array

We rescaled each variable by its range and specified the priordistributions as described in Section 41 We took δ = 3 α = 1and h1 = 100 We set Gmax = 10 and κ1 = 03 a value com-mensurate with the variability in the data We considered differ-ent prior distributions for G truncated Poisson with λ = 5 andλ = 10 and a discrete uniform on [1 Gmax] We conductedthe analysis assuming both homogeneous and heterogeneouscovariances across clusters In all cases we used a burn-inof 10000 and a main run of 40000 iterations for inference

Let us first focus on the results using a truncated Pois-son(λ = 5) prior for G Figure 1 displays the trace plots forthe number of visited components under the assumption of het-erogeneous and homogeneous covariances Table 1 column 1gives the corresponding posterior probabilities p(G = k|X)Table 2 shows the posterior estimates for sample allocations ycomputed conditional on the most probable G accordingto (22) Under the assumption of heterogeneous covariancesacross clusters there is a strong support for G = 3 Two ofthe clusters successfully isolate all of the setosa and virginicaplants and the third cluster contains all of the versicolor ex-cept for five identified as virginica These results are satisfac-tory because the versicolor and virginica species are knownto have some overlap whereas the setosa is rather well sepa-rated Under the assumption of constant covariance there is astrong support for four clusters Again one cluster contains ex-clusively all of the setosa and all of the versicolor are grouped

(a) (b)

Figure 1 Iris Data Trace Plots of the Number of Clusters G (a) Heterogeneous covariance (b) homogeneous covariance

Tadesse Sha and Vannucci Bayesian Variable Selection 609

Table 1 Iris Data Posterior Distribution of G

Poisson(λ = 5) Poisson(λ = 10) Uniform[1 10]

k Different k Equal Different Σk Equal Σ Different Σk Equal Σ

3 6466 0 5019 0 8548 04 3124 7413 4291 4384 1417 65045 0403 3124 0545 2335 0035 34466 0007 0418 0138 2786 0 00507 0 0 0007 0484 0 08 0 0 0 0011 0 0

together except for one Two virginica flowers are assigned tothe versicolor group and the remaining are divided into twonew components Figure 2 shows the marginal posterior prob-abilities p( yi = k|XG = 4) computed using equation (21)We note that some of the flowers among the 12 virginica al-located to cluster IV had a close call for instance observa-tion 104 had marginal posteriors p( y104 = 3|XG) = 4169 andp( y104 = 4|XG) = 5823

We examined the sensitivity of the results to the choiceof the prior distribution on G and found them to be quiterobust The posterior probabilities p(G = k|X) assumingG sim Poisson(λ = 10) and G sim uniform[1 Gmax] are givenin Table 1 Under both priors we obtained sample allocationsthat are identical to those in Table 2 We also found littlesensitivity to the choice of the hyperparameter h1 with val-ues between 10 and 1000 giving satisfactory results Smallervalues of h1 tended to favor slightly more components but withsome clusters containing very few observations For examplefor h1 = 10 under the assumption of equal covariance acrossclusters Table 3 shows that the MCMC sampler gives strongersupport for G = 6 However one of the clusters contains onlyone observation and another contains only four observationsFigure 3 displays the marginal posterior probabilities for theallocation of these five samples p( yi = k|XG) there is littlesupport for assigning them to separate clusters

We use this example to also illustrate the label-switchingphenomenon described in Section 61 Figure 4 gives the traceplots of the component weights under the assumption of homo-geneous covariance before and after processing of the MCMCsamples We can easily identify two label-switching instancesin the raw output of Figure 4(a) One of these occurred be-tween the first and third components near iteration 1000 theother occurred around iteration 2400 between components 23 and 4 Figure 4(b) shows the trace plots after postprocessingof the MCMC output We note that the label switching is elim-inated and the multimodality in the estimates of the marginalposterior distributions of wk is removed It is now straightfor-ward to obtain sensible estimates using the posterior means

Table 2 Iris Data Sample Allocation y

Heterogeneouscovariance

Homogeneouscovariance

I II III I II III IV

Iris setosa 50 0 0 50 0 0 0Iris versicolor 0 45 5 0 49 1 0Iris virginica 0 0 50 0 2 36 12

Figure 2 Iris Data Poisson(λ = 5) With Equal Σ -Marginal PosteriorProbabilities of Sample Allocations p(yi = k|X G = 4) i = 1 150k = 1 4

72 Simulated Data

We generated a dataset of 15 observations arising from fourmultivariate normal densities with 20 variables such that

xij sim I1leile4N (micro1 σ21 ) + I5leile7N (micro2 σ

22 )

+ I8leile13N (micro3 σ23 ) + I14leile15N (micro4 σ

24 )

i = 1 15 j = 1 20

where Imiddot is the indicator function equal to 1 if the conditionis met Thus the first four samples arise from the same dis-tribution the next three come from the second group the fol-lowing six are from the third group and the last two are fromthe fourth group The component means were set to micro1 = 5micro2 = 2 micro3 = minus3 and micro4 = minus6 and the component varianceswere set to σ 2

1 = 15 σ 22 = 1 σ 2

3 = 5 and σ 24 = 2 We drew an

additional set of (pminus20) noisy variables that do not distinguishbetween the clusters from a standard normal density We con-sidered several values of p p = 50100500 1000 These datathus contain a small subset of discriminating variables togetherwith a large set of nonclustering variables The purpose is tostudy the ability of our method to uncover the cluster structureand identify the relevant covariates in the presence of increasingnumber of noisy variables

We permuted the columns of the data matrix X to dis-perse the predictors For each dataset we took δ = 3 α = 1

Table 3 Iris Data Sensitivity to Hyperparameter h Results With h = 10

Posterior distribution of G

k 4 5 6 7 8 9 10

p(G = k |X) 0747 1874 3326 2685 1120 0233 0015

Sample allocations yI II III IV V VI

Iris setosa 50 0 0 0 0 0Iris versicolor 0 45 1 0 4 0Iris virginica 0 2 35 12 0 1

610 Journal of the American Statistical Association June 2005

Figure 3 Iris Data Sensitivity to h1 Results for h1 = 10ndashMarginalPosterior Probabilities for Samples Allocated to Clusters V and VIp(yi = k|X G = 6) i = 1 5 k = 1 6

h1 = h0 = 100 and Gmax = n and assumed unequal covari-ances across clusters Results from the previous example haveshown little sensitivity to the choice of prior on G Here weconsidered a truncated Poisson prior with λ = 5 As describedin Section 41 some care is needed when choosing κ1 and κ0their values need to be commensurate with the variability of thedata The amount of total variability in the different simulateddatasets is of course widely different and in each case thesehyperparameters were chosen proportionally to the upper andlower decile of the non-zero eigenvalues For the prior of γ we chose a Bernoulli distribution and set the expected num-ber of included variables to 10 We used a starting model withone randomly selected variable and ran the MCMC chains for100000 iterations with 40000 sweeps as burn-in

Here we do inference on both the sample allocations andthe variable selection We obtained satisfactory results for all

datasets In all cases we successfully recovered the clusterstructure used to simulate the data and identified the 20 dis-criminating variables We present the summary plots associatedwith the largest dataset where p = 1000 Figure 5(a) shows thetrace plot for the number of visited components G and Table 4reports the marginal posterior probabilities p(G|X) There is astrong support for G = 4 and 5 with a slightly larger probabilityfor the former Figure 6 shows the marginal posterior probabili-ties of the sample allocations p( yi = k|XG = 4) We note thatthese match the group structure used to simulate the data Asfor selection of the predictors Figure 5(b) displays the num-ber of variables selected at each MCMC iteration In the first30000 iterations before burn-in the chain visited models witharound 30ndash35 variables After about 40000 iterations the chainstabilized to models with 15ndash20 covariates Figure 7 displaysthe marginal posterior probabilities p(γj = 1|XG = 4) Therewere 17 variables with marginal probabilities greater than 9 allof which were in the set of 20 covariates simulated to effectivelydiscriminate the four clusters If we lower the threshold for in-clusion to 4 then all 20 variables are selected Conditional onthe allocations obtained via (22) the γ vector with largest pos-terior probability (25) among all visited models contained 19 ofthe 20 discriminating variables

We also analyzed these datasets using the COSA algorithm ofFriedman and Meulman (2003) As we mentioned in Section 2this procedure performs variable selection in conjunction withhierarchical clustering We present the results for the analy-sis of the simulated data with p = 1000 Figure 8 shows thedendrogram of the clustering results based on the non-targetedCOSA distance We considered single average and completelinkage but none of these methods was able to recover the truecluster structure The average linkage dendrogram [Fig 8(b)]delineates one of the clusters that was partially recovered (thetrue cluster contains observations 8ndash13) Figure 8(d) displaysthe 20 highest relevance values of the variables that lead to thisclustering The variables that define the true cluster structure areindexed as 1ndash20 and we note that 16 of them appear in this plot

(a) (b)

Figure 4 Iris Data Trace Plots of Component Weights Before and After Removal of the Label Switching (a) Raw output (b) processed output

Tadesse Sha and Vannucci Bayesian Variable Selection 611

(a) (b)

Figure 5 Trace Plots for Simulated Data With p = 1000 (a) Number of clusters G (b) number of included variables pγ

Table 4 Simulated Data With p = 1000Posterior Distribution of G

k p(G = k|X )

2 01113 14474 34375 33756 13267 02168 00409 0015

10 000811 000512 000713 001214 0002

Figure 6 Simulated Data With p = 1000 Marginal Posterior Prob-abilities of Sample Allocations p(yi = k|X G = 4) i = 1 15k = 1 4

73 Application to Microarray Data

We now illustrate the methodology on microarray datafrom an endometrial cancer study Endometrioid endometrialadenocarcinoma is a common gynecologic malignancy aris-ing within the uterine lining The disease usually occurs inpostmenopausal women and is associated with the hormonalrisk factor of protracted estrogen exposure unopposed byprogestins Despite its prevalence the molecular mechanismsof its genesis are not completely known DNA microarrayswith their ability to examine thousands of genes simultaneouslycould be an efficient tool to isolate relevant genes and clustertissues into different subtypes This is especially important incancer treatment where different clinicopathologic groups areknown to vary in their response to therapy and genes identifiedto discriminate among the different subtypes may represent tar-gets for therapeutic intervention and biomarkers for improveddiagnosis

Figure 7 Simulated Data With p = 1000 Marginal Posterior Proba-bilities for Inclusion of Variables p(γj = 1|X G = 4)

612 Journal of the American Statistical Association June 2005

(a) (b)

(c) (d)

Figure 8 Analysis of Simulated Data With p = 1000 Using COSA (a) Single linkage (b) average linkage (c) complete linkage (d) relevancefor delineated cluster

Four normal endometrium samples and 10 endometrioidendometrial adenocarcinomas were collected from hysterec-tomy specimens RNA samples were isolated from the 14 tis-sues and reverse-transcribed The resulting cDNA targets wereprepared following the manufacturerrsquos instructions and hy-bridized to Affymetrix Hu6800 GeneChip arrays which con-tain 7070 probe sets The scanned images were processed withthe Affymetrix software version 31 The average differencederived by the software was used to indicate a transcript abun-dance and a global normalization procedure was applied Thetechnology does not reliably quantify low expression levels andit is subject to spot saturation at the high end For this particu-lar array the limits of reliable detection were previously set to20 and 16000 (Tadesse Ibrahim and Mutter 2003) We usedthese same thresholds and removed probe sets with at least oneunreliable reading from the analysis This left us with p = 762variables We then log-transformed the expression readings tosatisfy the assumption of normality as suggested by the BoxndashCox transformation We also rescaled each variable by its range

As described in Section 41 we specified the priors with thehyperparameters δ = 3 α = 1 and h1 = h0 = 100 We setκ1 = 001 and κ0 = 01 and assumed unequal covariancesacross clusters We used a truncated Poisson(λ = 5) prior for G

with Gmax = n We took a Bernoulli prior for γ with an expec-tation of 10 variables to be included in the model

To avoid possible dependence of the results on the initialmodel we ran four MCMC chains with widely different startingpoints (1) γj set to 0 for all jrsquos except one randomly selected(2) γj set to 1 for 10 randomly selected jrsquos (3) γj set to 1 for25 randomly selected jrsquos and (4) γj set to 1 for 50 randomlyselected jrsquos For each of the MCMC chains we ran 100000 it-erations with 40000 sweeps taken as a burn-in

We looked at both the allocation of tissues and the selec-tion of genes whose expression best discriminate among thedifferent groups The MCMC sampler mixed steadily overthe iterations As shown in the histogram of Figure 9(a) thesampler visited mostly between two and five components witha stronger support for G = 3 clusters Figure 9(b) shows thetrace plot for the number of included variables for one ofthe chains which visited mostly models with 25ndash35 variablesThe other chains behaved similarly Figure 10 gives the mar-ginal posterior probabilities p(γj = 1|XG = 3) for each ofthe chains Despite the very different starting points the fourchains visited similar regions and exhibit broadly similar mar-ginal plots To assess the concordance of the results across thefour chains we looked at the correlation coefficients betweenthe frequencies for inclusion of genes These are reported in

Tadesse Sha and Vannucci Bayesian Variable Selection 613

(a) (b)

Figure 9 MCMC Output for Endometrial Data (a) Number of visited clusters G (b) number of included variables pγ

Figure 11 along with the corresponding pairwise scatterplotsWe see that there is very good agreement across the differentMCMC runs The marginal probabilities on the other hand arenot as concordant across the chains because their derivationmakes use of model posterior probabilities [see (23)] which areaffected by the inclusion of correlated variables This is a prob-lem inherent to DNA microarray data where multiple geneswith similar expression profiles are often observed

We pooled and relabeled the output of the four chains nor-malized the relative posterior probabilities and recomputedp(γj = 1|XG = 3) These marginal probabilities are displayedin Figure 12 There are 31 genes with posterior probabilitygreater than 5 We also estimated the marginal posterior proba-bilities for the sample allocations p( yi = k|XG) based on the31 selected genes These are shown in Figure 13 We success-fully identified the four normal tissues The results also seem tosuggest that there are possibly two subtypes within the malig-nant tissues

Figure 10 Endometrial Data Marginal posterior probabilities for in-clusion of variables p(γj = 1|X G = 3) for each of the four chains

We also present the results from analyzing this dataset withthe COSA algorithm Figure 14 shows the resulting singlelinkage and average linkage dendrograms along with the top30 variables that distinguish the normal and tumor tissuesThere is some overlap between the genes selected by COSA andthose identified by our procedure Among the sets of 30 ldquobestrdquodiscriminating genes selected by the two methods 10 are com-mon to both

8 DISCUSSION

We have proposed a method for simultaneously clusteringhigh-dimensional data with an unknown number of componentsand selecting the variables that best discriminate the differentgroups We successfully applied the methodology to variousdatasets Our method is fully Bayesian and we provided stan-dard default recommendations for the choice of priors

Here we mention some possible extensions and directionsfor future research We have drawn posterior inference con-ditional on a fixed number of components A more attractive

Figure 11 Endometrial Data Concordance of results among the fourchains Pairwise scatterplots of frequencies for inclusion of genes

614 Journal of the American Statistical Association June 2005

Figure 12 Endometrial Data Union of four chains Marginal posteriorprobabilities p(γj = 1|X G = 3)

approach would incorporate the uncertainty on this parame-ter and combine the results for all different values of G vis-ited by the MCMC sampler To our knowledge this is an open

Figure 13 Endometrial Data Union of four chains Marginal poste-rior probabilities of sample allocations p(yi = k|X G = 3) i = 1 14k = 1 3

problem that requires further research One promising approachinvolves considering the posterior probability of pairs of ob-

(a) (b)

(c) (d)

Figure 14 Analysis of Endometrial Data Using COSA (a) Single linkage (b) average linkage (c) variables leading to cluster 1 (d) variablesleading to cluster 2

Tadesse Sha and Vannucci Bayesian Variable Selection 615

servations being allocated to the same cluster regardless of thenumber of visited components As long as there is no interestin estimating the component parameters the problem of labelswitching would not be a concern But this leads to

(n2

)proba-

bilities and it is not clear how to best summarize them For ex-ample Medvedovic and Sivaganesan (2002) used hierarchicalclustering based on these pairwise probabilities as a similaritymeasure

An interesting future avenue would be to investigate the per-formance of our method when covariance constraints are im-posed as was proposed by Banfield and Raftery (1993) Theirparameterization allows one to incorporate different assump-tions on the volume orientation and shape of the clusters

Another possible extension is to use empirical Bayes esti-mates to elicit the hyperparameters For example conditionalon a fixed number of clusters a complete log-likelihood canbe computed using all covariates and a Monte Carlo EM ap-proach developed for estimation Alternatively if prior infor-mation were available or subjective priors were preferable thenthe prior setting could be modified accordingly If interactionsamong predictors were of interest then additional interactionterms could be included in the model and the prior on γ couldbe modified to accommodate the additional terms

APPENDIX FULL CONDITIONALS

Here we give details on the derivation of the marginalized full con-ditionals under the assumption of equal and unequal covariances acrossclusters

Homogeneous Covariance 1 = middot middot middot = G =

f (Xy|Gwγ )

=int

f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d

=int Gprod

k=1

wnk

k

∣∣(γ )

∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h

minuspγ 21

times exp

minus1

2

Gsum

k=1

sum

xi(γ )isinCk

(xi(γ ) minus microk(γ )

)Tminus1

(γ )

(xi(γ ) minus microk(γ )

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus micro0(γ )

)T(h1(γ )

)minus1(microk(γ ) minus micro0(γ )

)

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

(γ )Q1(γ )

)

times ∣∣(γ c)

∣∣minus(n+1)2(2π)minus(n+1)(pminuspγ )2h

minus(pminuspγ )20

times exp

minus1

2

nsum

i=1

(xi(γ c) minus η(γ c)

)Tminus1

(γ c)

(xi(γ c) minus η(γ c)

)

times exp

minus1

2

(η(γ c) minus micro0(γ c)

)T(h0(γ c)

)minus1(η(γ c) minus micro0(γ c)

)

times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣(γ c)

∣∣minus(δ+2(pminuspγ ))2

times exp

minus1

2tr(minus1

(γ c)Q0(γ c)

)

dmicrod dη d

=int Gprod

k=1

(2π)minus(nk+1)pγ 2h

minuspγ 21 wnk

k

∣∣(γ )

∣∣minus(nk+1)2

times exp

minus1

2tr(minus1

(γ )Sk(γ )

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)T

times (nk + hminus11 )minus1

(γ )

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

(γ )Q1(γ )

)

times (2π)minus(n+1)(pminuspγ )2hminus(pminuspγ )20

∣∣(γ c)

∣∣minus(n+1)2

times exp

minus1

2tr(minus1

(γ c)S0(γ c)

)

times exp

minus1

2

(η(γ c) minus

sumni=1 xi(γ c) + hminus1

0 micro0(γ c)

n + hminus10

)T

times (n + hminus10 )minus1

(γ c)

(η(γ c) minus

sumni=1 xi(γ c) + hminus1

0 micro0(γ c)

n + hminus10

)

times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣(γ c)

∣∣minus(δ+2(pminuspγ ))2

times exp

minus1

2tr(minus1

0(γ c)Q0(γ c)

)dmicrod dη d

=[ Gprod

k=1

(h1nk + 1)minuspγ 2wnkk

]πminusnpγ 2

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)22minuspγ (δ+n+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

∣∣(γ )

∣∣minus(δ+n+2pγ )2

times exp

minus1

2tr

(minus1

(γ )

[Q1(γ ) +

Gsum

k=1

Sk(γ )

])d

616 Journal of the American Statistical Association June 2005

times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

timesint

∣∣(γ c)

∣∣minus(δ+n+2(pminuspγ ))2

times exp

minus1

2tr(minus1

(γ c)

[Q0(γ ) + S0(γ c)

])d13

= πminusnp2

[ Gprod

k=1

(h1nk + 1)minuspγ 2wnkk

](h0n + 1)minus(pminuspγ )2

timespγprod

j=1

(

(δ + n + pγ minus j

2

)

(δ + pγ minus j

2

))

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times∣∣∣∣∣Q1(γ ) +

Gsum

k=1

Sk(γ )

∣∣∣∣∣

minus(δ+n+pγ minus1)2

times ∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

where Sk(γ ) and S0(γ ) are given by

Sk(γ ) =sum

xi(γ )isinCk

xi(γ )xTi(γ ) + hminus1

1 micro0(γ )microT0(γ )

minus (nk + hminus11 )minus1

( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)

times( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)T

=sum

xi(γ )isinCk

(xi(γ ) minus xk(γ )

)(xi(γ ) minus xk(γ )

)T

+ nk

h1nk + 1

(micro0(γ ) minus xk(γ )

)(micro0(γ ) minus xk(γ )

)T

and

S0(γ ) =nsum

i=1

xi(γ c)xTi(γ c) + hminus1

0 micro0(γ c)microT0(γ c)

minus (n + hminus10 )minus1

( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)

times( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)T

=nsum

i=1

(xi(γ c) minus x(γ c)

)(xi(γ c) minus x(γ c)

)T

+ n

h0n + 1

(micro0(γ c) minus x(γ c)

)(micro0(γ c) minus x(γ c)

)T

with xk(γ ) = 1nk

sumxi(γ )isinCk

xi(γ ) and x(γ c) = 1n

sumni=1 xi(γ c)

Heterogeneous Covariances

f (Xy|Gwγ )

=int

f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d

=int Gprod

k=1

wnk

k

∣∣k(γ )

∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h

minuspγ 21

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times exp

minus1

2

Gsum

k=1

sum

xi(γ )isinCk

(xi(γ ) minus microk

)Tminus1

k(γ )

(xi(γ ) minus microk

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus micro0(γ )

)T(h1k(γ )

)minus1

times (microk(γ ) minus micro0(γ )

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )Q1(γ )

)dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

=int

int

micro

Gprod

k=1

(2π)minus(nk+1)pγ 2h

minuspγ 21 wnk

k 2minuspγ (δ+pγ minus1)2

times πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣k(γ )

∣∣minus(nk+1)2

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)T

times (nk + hminus11 )minus1

k(γ )

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q1(γ ) + Sk(γ )

])dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

Tadesse Sha and Vannucci Bayesian Variable Selection 617

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

∣∣Q(γ )

∣∣(δ+pγ minus1)2

times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

Gprod

k=1

∣∣k(γ )

∣∣minus(δ+nk+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q(γ ) + Sk(γ )

])d

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

timespγprod

j=1

(

(δ + nk + pγ minus 1

2

)(

(δ + pγ minus 1

2

)))

timesGprod

k=1

∣∣Q(γ )

∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )

∣∣minus(δ+nk+pγ minus1)2

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)(

(δ + p minus pγ minus j

2

)))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

[Received May 2003 Revised August 2004]

REFERENCES

Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5

Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821

Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press

Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182

(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641

Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270

Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275

Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375

Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228

Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center

George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373

Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286

Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall

Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136

Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732

Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232

McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422

Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206

Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71

Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190

Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792

Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819

Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74

(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809

Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554

Page 5: Bayesian Variable Selection in Clustering High-Dimensional ...marina/papers/jasa05.pdf · Bayesian Variable Selection in Clustering High-Dimensional Data Mahlet G. T ADESSE, Naijun

606 Journal of the American Statistical Association June 2005

direction the proposal is deterministic The move is then ac-cepted with probability

min

1

f (ψ prime|X)rm(ψ prime)f (ψ |X)rm(ψ)q(u)

∣∣∣∣partψ prime

part(ψu)

∣∣∣∣

(14)

where rm(ψ) is the probability of choosing move type m whenin state ψ and q(u) is the density function of u The final termin the foregoing ratio is a Jacobian arising from the change ofvariable from (ψu) to ψ prime

531 Split and Merge Moves In step 4 we make a randomchoice between attempting to split a nonempty cluster or com-bine two clusters with probabilities bk = 5 and dk = 1 minus bk

(k = 2 Gmax minus 1 and bGmax = 0)The split move begins by randomly selecting a nonempty

cluster l and dividing it into two components l1 and l2 withweights wl1 = wlu and wl2 = wl(1 minus u) where u is a beta(22)

random variable The parameters ψ = (Gwoldγ yold) are up-dated to ψ prime = (G + 1wnewγ ynew) The acceptance probabil-ity for this move is given by min(1A) where

A = f (G + 1wnewγ ynew|X)

f (Gwoldγ yold|X)times q(ψ |ψ prime)

q(ψ prime|ψ) times f (u)

times∣∣∣∣part(wl1wl2)

part(wlu)

∣∣∣∣

= f (Xynew|wnewγ G + 1) times f (wnew|G + 1) times f (G + 1)

f (Xyold|woldγ G) times f (wold|G) times f (G)

times dG+1 times Pchos

bG times Gminus11 times Palloc times f (u)

times wl (15)

f (wnew|G + 1)

f (wold|G)= wαminus1

l1 wαminus1l2

wαminus1l B(αGα)

and

f (G + 1)

f (G)=

1 if G sim uniform [2 Gmax]

λG+1 if G sim truncated Poisson (λ)

Here G1 is the number of nonempty components before thesplit and Palloc is the probability that the particular allocation ismade and is set to unl1(1 minus u)nl2 where nl1 and nl2 are the num-ber of observations assigned to l1 and l2 Note that Richardsonand Green (1997) calculated Palloc using the full conditional ofthe allocation variables P( yi = k|rest) In our case because thecomponent parameters micro and are integrated out the alloca-tion variables are no longer independent and calculating thesefull conditionals is computationally prohibitive The random al-location approach that we took is substantially faster even withthe longer chains that need to be run to compensate for thelow acceptance rate of proposed moves Pchos is the probabilityof selecting two components for the reverse move Obviouslywe want to combine clusters that are closest to one anotherHowever choosing an optimal similarity metric in the multi-variate setting is not straightforward We define closeness interms of the Euclidean distance between cluster means and con-sider three possible ways of splitting a cluster

a Split results into two mutually adjacent components

rArr Pchos = Glowast1

Glowast times 2Glowast

1

b Split results into components that are not mutually ad-jacent that is l1 has l2 as its closest component but there is

another cluster that is closer to l2 than l1 rArr Pchos = Glowast1

Glowast times 1Glowast

1

c One of the resulting components is empty rArr Pchos =Glowast

0Glowast times 1

Glowast0

times 1Glowast

1

Here Glowast is the total number of clusters after the splitGlowast

0 and Glowast1 are the number of empty and nonempty components

after the split Thus splits that result in mutually close com-ponents are assigned a larger probability and those that yieldempty clusters have lower probability

The merge move begins by randomly choosing two com-ponents l1 and l2 to be combined into a single cluster lThe updated parameters are ψ = (Gwoldγ yold) rarr ψ prime =(G minus 1wnewγ ynew) where wl = wl1 + wl2 The acceptanceprobability for this move is given by min(1A) where

A = f (Xynew|γ wnewG minus 1) times f (wnew|G minus 1) times f (G minus 1)

f (Xyold|γ woldG) times f (wold|G) times f (G)

times bGminus1 times Palloc times (G1 minus 1)minus1 times f (u)

dG times Pchos

times (wl1 + wl2)minus1 (16)

where f (wnew|Gminus1)f (wold|G)

= wαminus1l B(α(Gminus1)α)

wαminus1l1 wαminus1

l2and f (Gminus1)

f (G)equals 1 or G

λ

for the Uniform and Poisson priors on G G1 is the numberof nonempty clusters before the combining move f (u) is thebeta(22) density for u = wl1

wl and Palloc and Pchos are defined

similarly to the split move but now Glowast0 and Glowast

1 in Pchos corre-spond to the number of empty and nonempty clusters after themerge

532 Birth and Death Moves We first make a randomchoice between a birth move and a death move using thesame probabilities bk and dk = 1 minus bk as earlier For a birthmove the updated parameters are ψ = (Gwoldγ y) rarr ψ prime =(G + 1wnewγ y) The weight for the proposed new compo-nent is drawn using wlowast sim beta(1G) and the existing weightsare rescaled to wprime

k = wk(1minuswlowast) k = 1 G (where k indexesthe component labels) so that all of the weights sum to 1 LetG0 be the number of empty components before the birth Theacceptance probability for a birth is min(1A) where

A = f (Xy|wnewγ G + 1) times f (wnew|G + 1) times f (G + 1)

f (Xy|woldγ G) times f (wold|G) times f (G)

times dG+1 times (G0 + 1)minus1

bG times f (wlowast)times (1 minus wlowast)Gminus1 (17)

f (wnew|G+1)f (wold|G)

= wlowastαminus1(1minuswlowast)G(αminus1)

B(αGα) and (1 minus wlowast)Gminus1 is the

Jacobian of the transformation (woldwlowast) rarr wnewFor the death move an empty component is chosen at

random and deleted Let wl be the weight of the deleted com-ponent The remaining weights are rescaled to sum to 1 (wprime

k =wk(1 minus wl)) and ψ = (Gwoldγ y) rarr ψ prime = (G minus 1wnew

γ y) The acceptance probability for this move is min(1A)where

A = f (Xy|wnewγ G minus 1) times f (wnew|G minus 1) times f (G minus 1)

f (Xy|woldγ G) times f (wold|G) times f (G)

times bGminus1 times f (wl)

dG times Gminus10

times (1 minus wl)minus(Gminus2) (18)

Tadesse Sha and Vannucci Bayesian Variable Selection 607

f (wnew|Gminus1)f (wold|G)

= B(α(Gminus1)α)

wαminus1l (1minuswl)

(Gminus1)(αminus1) G0 is the number of empty

components before the death move and f (wl) is the beta(1G)

density

6 POSTERIOR INFERENCE

We draw inference on the sample allocations conditionalon G Thus we first need to compute an estimate for this para-meter which we take to be the value most frequently visited bythe MCMC sampler In addition we need to address the label-switching problem

61 Label Switching

In finite mixture models an identifiability problem arisesfrom the invariance of the likelihood under permutation of thecomponent labels In the Bayesian paradigm this leads to sym-metric and multimodal posterior distributions with up to Gcopies of each ldquogenuinerdquo mode complicating inference on theparameters In particular it is not possible to form ergodic av-erages over the MCMC samples Traditional approaches to thisproblem impose identifiability constraints on the parametersfor instance w1 lt middot middot middot lt wG These constraints do not alwayssolve the problem however Recently Stephens (2000b) pro-posed a relabeling algorithm that takes a decision-theoretic ap-proach The procedure defines an appropriate loss function andpostprocesses the MCMC output to minimize the posterior ex-pected loss

Let P(ξ) = [pik(ξ)] be the matrix of allocation probabilities

pik(ξ) = Pr( yi = k|Xγ w θ)

= wk f (xi(γ )|θk(γ ))sumj wjf (xi(γ )|θ j(γ ))

(19)

In the context of clustering the loss function can be defined onthe cluster labels y

L0(y ξ) = minusnsum

i=1

logpiyi(ξ ) (20)

Let ξ (1) ξ (M) be the sampled parameters and let ν1 νM

be the permutations applied to them The parameters ξ (t) corre-spond to the component parameters w(t) micro(t) and (t) sampledat iteration t However in our methodology the componentmean and covariance parameters are integrated out and wedo not draw their posterior samples Therefore to calculatethe matrix of allocation probabilities in (19) we estimatethese parameters and set ξ (t) = w(t)

1 w(t)G x(t)

1(γ ) x(t)

G(γ )

S(t)1(γ )

S(t)G(γ )

where x(t)k(γ )

and S(t)k(γ )

are the estimates ofcluster krsquos mean and covariance based on the model visited atiteration t

The relabeling algorithm proceeds by selecting initial valuesfor the νtrsquos which we take to be the identity permutation theniterating the following steps until a fixed point is reached

a Choose y to minimizesumM

t=1 L0yνt(ξ(t))

b For t = 1 M choose νt to minimize L0yνt(ξ(t))

62 Posterior Densities and Posterior Estimates

Once the label switching is taken care of the MCMC samplescan be used to draw posterior inference Of particular interestare the allocation vector y and the variable selection vector γ

We compute the marginal posterior probability that sample iis allocated to cluster k yi = k as

p( yi = k|XG) =int

p(yi = ky(minusi)γ w

∣∣XG)

dy(minusi) dγ dw

propint

p(X yi = ky(minusi)γ w

∣∣G)

dy(minusi) dγ dw

=int

p(X yi = ky(minusi)

∣∣Gγ w)

times p(γ |G)p(w|G)dy(minusi) dγ dw

asympMsum

t=1

p(X y(t)

i = ky(t)(minusi)

∣∣Gγ (t)w(t))

times p(γ (t)

∣∣G)p(w(t)

∣∣G) (21)

where y(t)(minusi) is the vector y(t) at the tth MCMC iteration without

the ith element The posterior allocation of sample i can then beestimated by the mode of its marginal posterior density

yi = arg max1lekleG

p( yi = k|XG) (22)

Similarly the marginal posterior for γj = 1 can be com-puted as

p(γj = 1|XG) =int

p(γj = 1γ (minusj)wy|XG

)dγ (minusj) dy dw

propint

p(Xy γj = 1γ (minusj)w

∣∣G)

dγ (minusj) dy dw

=int

p(Xy

∣∣G γj = 1γ (minusj)w)

times p(γ |G)p(w|G)dγ (minusj) dy dw

asympMsum

t=1

p(Xy(t)

∣∣G γ(t)j = 1γ

(t)(minusj)w(t))

times p(γj = 1γ

(t)(minusj)

∣∣G)p(w(t)

∣∣G) (23)

where γ(t)(minusj) is the vector γ (t) at the tth iteration without the

jth element The best discriminating variables can then beidentified as those with largest marginal posterior p(γj = 1|XG) gt a where a is chosen arbitrarily

γj = Ip(γj=1|XG)gta (24)

An alternative variable selection can be performed by con-sidering the vector γ with largest posterior probability amongall visited vectors

γ lowast = arg max1letleM

p(γ (t)

∣∣XG w y)

(25)

where y is the vector of sample allocations estimated via (22)and w is given by w = 1

M

sumMt=1 w(t) The estimate given in (25)

considers the joint density of γ rather than the marginal distri-butions of the individual elements as (24) In the same spirit an

608 Journal of the American Statistical Association June 2005

estimate of the joint allocation of the samples can be obtainedas the configuration that yields the largest posterior probability

ylowast = arg max1letleM

p(y(t)

∣∣XG w γ)

(26)

where γ is the estimate in (24)

63 Class Prediction

The MCMC output can also be used to predict the class mem-bership of future observations Xf The predictive density isgiven by

p( yf = k|Xf XG)

=int

p( yf = kywγ |Xf XG)dy dγ dw

propint

p(Xf X yf = ky|Gwγ )p(γ |G)p(w|G)dy dγ dw

asympMsum

t=1

p(Xf X yf = ky(t)

∣∣γ (t)w(t)G)

times p(γ (t)

∣∣G)p(w(t)

∣∣G) (27)

and the observations are allocated according to yf

yf = arg max1lekleG

p( yf = k|Xf XG) (28)

7 PERFORMANCE OF OUR METHODOLOGY

In this section we investigate the performance of our method-ology using three datasets Because the Bayesian cluster-ing problem with an unknown number of components viaRJMCMC has been considered only in the univariate setting(Richardson and Green 1997) we first examine the efficiency ofour algorithm in the multivariate setting without variable selec-tion For this we use the benchmark iris data (Anderson 1935)We then explore the performance of our methodology for simul-taneous clustering and variable selection using a series of sim-ulated high-dimensional datasets We also analyze these data

using the COSA algorithm of Friedman and Meulman (2003)Finally we illustrate an application with DNA microarray datafrom an endometrial cancer study

71 Iris Data

This dataset consists of four covariatesmdashpetal width petalheight sepal width and sepal heightmdashmeasured on 50 flow-ers from each of three species Iris setosa Iris versicolor andIris virginica This dataset has been analyzed by several authors(see eg Murtagh and Hernaacutendez-Pajares 1995) and can be ac-cessed in SndashPLUS as a three-dimensional array

We rescaled each variable by its range and specified the priordistributions as described in Section 41 We took δ = 3 α = 1and h1 = 100 We set Gmax = 10 and κ1 = 03 a value com-mensurate with the variability in the data We considered differ-ent prior distributions for G truncated Poisson with λ = 5 andλ = 10 and a discrete uniform on [1 Gmax] We conductedthe analysis assuming both homogeneous and heterogeneouscovariances across clusters In all cases we used a burn-inof 10000 and a main run of 40000 iterations for inference

Let us first focus on the results using a truncated Pois-son(λ = 5) prior for G Figure 1 displays the trace plots forthe number of visited components under the assumption of het-erogeneous and homogeneous covariances Table 1 column 1gives the corresponding posterior probabilities p(G = k|X)Table 2 shows the posterior estimates for sample allocations ycomputed conditional on the most probable G accordingto (22) Under the assumption of heterogeneous covariancesacross clusters there is a strong support for G = 3 Two ofthe clusters successfully isolate all of the setosa and virginicaplants and the third cluster contains all of the versicolor ex-cept for five identified as virginica These results are satisfac-tory because the versicolor and virginica species are knownto have some overlap whereas the setosa is rather well sepa-rated Under the assumption of constant covariance there is astrong support for four clusters Again one cluster contains ex-clusively all of the setosa and all of the versicolor are grouped

(a) (b)

Figure 1 Iris Data Trace Plots of the Number of Clusters G (a) Heterogeneous covariance (b) homogeneous covariance

Tadesse Sha and Vannucci Bayesian Variable Selection 609

Table 1 Iris Data Posterior Distribution of G

Poisson(λ = 5) Poisson(λ = 10) Uniform[1 10]

k Different k Equal Different Σk Equal Σ Different Σk Equal Σ

3 6466 0 5019 0 8548 04 3124 7413 4291 4384 1417 65045 0403 3124 0545 2335 0035 34466 0007 0418 0138 2786 0 00507 0 0 0007 0484 0 08 0 0 0 0011 0 0

together except for one Two virginica flowers are assigned tothe versicolor group and the remaining are divided into twonew components Figure 2 shows the marginal posterior prob-abilities p( yi = k|XG = 4) computed using equation (21)We note that some of the flowers among the 12 virginica al-located to cluster IV had a close call for instance observa-tion 104 had marginal posteriors p( y104 = 3|XG) = 4169 andp( y104 = 4|XG) = 5823

We examined the sensitivity of the results to the choiceof the prior distribution on G and found them to be quiterobust The posterior probabilities p(G = k|X) assumingG sim Poisson(λ = 10) and G sim uniform[1 Gmax] are givenin Table 1 Under both priors we obtained sample allocationsthat are identical to those in Table 2 We also found littlesensitivity to the choice of the hyperparameter h1 with val-ues between 10 and 1000 giving satisfactory results Smallervalues of h1 tended to favor slightly more components but withsome clusters containing very few observations For examplefor h1 = 10 under the assumption of equal covariance acrossclusters Table 3 shows that the MCMC sampler gives strongersupport for G = 6 However one of the clusters contains onlyone observation and another contains only four observationsFigure 3 displays the marginal posterior probabilities for theallocation of these five samples p( yi = k|XG) there is littlesupport for assigning them to separate clusters

We use this example to also illustrate the label-switchingphenomenon described in Section 61 Figure 4 gives the traceplots of the component weights under the assumption of homo-geneous covariance before and after processing of the MCMCsamples We can easily identify two label-switching instancesin the raw output of Figure 4(a) One of these occurred be-tween the first and third components near iteration 1000 theother occurred around iteration 2400 between components 23 and 4 Figure 4(b) shows the trace plots after postprocessingof the MCMC output We note that the label switching is elim-inated and the multimodality in the estimates of the marginalposterior distributions of wk is removed It is now straightfor-ward to obtain sensible estimates using the posterior means

Table 2 Iris Data Sample Allocation y

Heterogeneouscovariance

Homogeneouscovariance

I II III I II III IV

Iris setosa 50 0 0 50 0 0 0Iris versicolor 0 45 5 0 49 1 0Iris virginica 0 0 50 0 2 36 12

Figure 2 Iris Data Poisson(λ = 5) With Equal Σ -Marginal PosteriorProbabilities of Sample Allocations p(yi = k|X G = 4) i = 1 150k = 1 4

72 Simulated Data

We generated a dataset of 15 observations arising from fourmultivariate normal densities with 20 variables such that

xij sim I1leile4N (micro1 σ21 ) + I5leile7N (micro2 σ

22 )

+ I8leile13N (micro3 σ23 ) + I14leile15N (micro4 σ

24 )

i = 1 15 j = 1 20

where Imiddot is the indicator function equal to 1 if the conditionis met Thus the first four samples arise from the same dis-tribution the next three come from the second group the fol-lowing six are from the third group and the last two are fromthe fourth group The component means were set to micro1 = 5micro2 = 2 micro3 = minus3 and micro4 = minus6 and the component varianceswere set to σ 2

1 = 15 σ 22 = 1 σ 2

3 = 5 and σ 24 = 2 We drew an

additional set of (pminus20) noisy variables that do not distinguishbetween the clusters from a standard normal density We con-sidered several values of p p = 50100500 1000 These datathus contain a small subset of discriminating variables togetherwith a large set of nonclustering variables The purpose is tostudy the ability of our method to uncover the cluster structureand identify the relevant covariates in the presence of increasingnumber of noisy variables

We permuted the columns of the data matrix X to dis-perse the predictors For each dataset we took δ = 3 α = 1

Table 3 Iris Data Sensitivity to Hyperparameter h Results With h = 10

Posterior distribution of G

k 4 5 6 7 8 9 10

p(G = k |X) 0747 1874 3326 2685 1120 0233 0015

Sample allocations yI II III IV V VI

Iris setosa 50 0 0 0 0 0Iris versicolor 0 45 1 0 4 0Iris virginica 0 2 35 12 0 1

610 Journal of the American Statistical Association June 2005

Figure 3 Iris Data Sensitivity to h1 Results for h1 = 10ndashMarginalPosterior Probabilities for Samples Allocated to Clusters V and VIp(yi = k|X G = 6) i = 1 5 k = 1 6

h1 = h0 = 100 and Gmax = n and assumed unequal covari-ances across clusters Results from the previous example haveshown little sensitivity to the choice of prior on G Here weconsidered a truncated Poisson prior with λ = 5 As describedin Section 41 some care is needed when choosing κ1 and κ0their values need to be commensurate with the variability of thedata The amount of total variability in the different simulateddatasets is of course widely different and in each case thesehyperparameters were chosen proportionally to the upper andlower decile of the non-zero eigenvalues For the prior of γ we chose a Bernoulli distribution and set the expected num-ber of included variables to 10 We used a starting model withone randomly selected variable and ran the MCMC chains for100000 iterations with 40000 sweeps as burn-in

Here we do inference on both the sample allocations andthe variable selection We obtained satisfactory results for all

datasets In all cases we successfully recovered the clusterstructure used to simulate the data and identified the 20 dis-criminating variables We present the summary plots associatedwith the largest dataset where p = 1000 Figure 5(a) shows thetrace plot for the number of visited components G and Table 4reports the marginal posterior probabilities p(G|X) There is astrong support for G = 4 and 5 with a slightly larger probabilityfor the former Figure 6 shows the marginal posterior probabili-ties of the sample allocations p( yi = k|XG = 4) We note thatthese match the group structure used to simulate the data Asfor selection of the predictors Figure 5(b) displays the num-ber of variables selected at each MCMC iteration In the first30000 iterations before burn-in the chain visited models witharound 30ndash35 variables After about 40000 iterations the chainstabilized to models with 15ndash20 covariates Figure 7 displaysthe marginal posterior probabilities p(γj = 1|XG = 4) Therewere 17 variables with marginal probabilities greater than 9 allof which were in the set of 20 covariates simulated to effectivelydiscriminate the four clusters If we lower the threshold for in-clusion to 4 then all 20 variables are selected Conditional onthe allocations obtained via (22) the γ vector with largest pos-terior probability (25) among all visited models contained 19 ofthe 20 discriminating variables

We also analyzed these datasets using the COSA algorithm ofFriedman and Meulman (2003) As we mentioned in Section 2this procedure performs variable selection in conjunction withhierarchical clustering We present the results for the analy-sis of the simulated data with p = 1000 Figure 8 shows thedendrogram of the clustering results based on the non-targetedCOSA distance We considered single average and completelinkage but none of these methods was able to recover the truecluster structure The average linkage dendrogram [Fig 8(b)]delineates one of the clusters that was partially recovered (thetrue cluster contains observations 8ndash13) Figure 8(d) displaysthe 20 highest relevance values of the variables that lead to thisclustering The variables that define the true cluster structure areindexed as 1ndash20 and we note that 16 of them appear in this plot

(a) (b)

Figure 4 Iris Data Trace Plots of Component Weights Before and After Removal of the Label Switching (a) Raw output (b) processed output

Tadesse Sha and Vannucci Bayesian Variable Selection 611

(a) (b)

Figure 5 Trace Plots for Simulated Data With p = 1000 (a) Number of clusters G (b) number of included variables pγ

Table 4 Simulated Data With p = 1000Posterior Distribution of G

k p(G = k|X )

2 01113 14474 34375 33756 13267 02168 00409 0015

10 000811 000512 000713 001214 0002

Figure 6 Simulated Data With p = 1000 Marginal Posterior Prob-abilities of Sample Allocations p(yi = k|X G = 4) i = 1 15k = 1 4

73 Application to Microarray Data

We now illustrate the methodology on microarray datafrom an endometrial cancer study Endometrioid endometrialadenocarcinoma is a common gynecologic malignancy aris-ing within the uterine lining The disease usually occurs inpostmenopausal women and is associated with the hormonalrisk factor of protracted estrogen exposure unopposed byprogestins Despite its prevalence the molecular mechanismsof its genesis are not completely known DNA microarrayswith their ability to examine thousands of genes simultaneouslycould be an efficient tool to isolate relevant genes and clustertissues into different subtypes This is especially important incancer treatment where different clinicopathologic groups areknown to vary in their response to therapy and genes identifiedto discriminate among the different subtypes may represent tar-gets for therapeutic intervention and biomarkers for improveddiagnosis

Figure 7 Simulated Data With p = 1000 Marginal Posterior Proba-bilities for Inclusion of Variables p(γj = 1|X G = 4)

612 Journal of the American Statistical Association June 2005

(a) (b)

(c) (d)

Figure 8 Analysis of Simulated Data With p = 1000 Using COSA (a) Single linkage (b) average linkage (c) complete linkage (d) relevancefor delineated cluster

Four normal endometrium samples and 10 endometrioidendometrial adenocarcinomas were collected from hysterec-tomy specimens RNA samples were isolated from the 14 tis-sues and reverse-transcribed The resulting cDNA targets wereprepared following the manufacturerrsquos instructions and hy-bridized to Affymetrix Hu6800 GeneChip arrays which con-tain 7070 probe sets The scanned images were processed withthe Affymetrix software version 31 The average differencederived by the software was used to indicate a transcript abun-dance and a global normalization procedure was applied Thetechnology does not reliably quantify low expression levels andit is subject to spot saturation at the high end For this particu-lar array the limits of reliable detection were previously set to20 and 16000 (Tadesse Ibrahim and Mutter 2003) We usedthese same thresholds and removed probe sets with at least oneunreliable reading from the analysis This left us with p = 762variables We then log-transformed the expression readings tosatisfy the assumption of normality as suggested by the BoxndashCox transformation We also rescaled each variable by its range

As described in Section 41 we specified the priors with thehyperparameters δ = 3 α = 1 and h1 = h0 = 100 We setκ1 = 001 and κ0 = 01 and assumed unequal covariancesacross clusters We used a truncated Poisson(λ = 5) prior for G

with Gmax = n We took a Bernoulli prior for γ with an expec-tation of 10 variables to be included in the model

To avoid possible dependence of the results on the initialmodel we ran four MCMC chains with widely different startingpoints (1) γj set to 0 for all jrsquos except one randomly selected(2) γj set to 1 for 10 randomly selected jrsquos (3) γj set to 1 for25 randomly selected jrsquos and (4) γj set to 1 for 50 randomlyselected jrsquos For each of the MCMC chains we ran 100000 it-erations with 40000 sweeps taken as a burn-in

We looked at both the allocation of tissues and the selec-tion of genes whose expression best discriminate among thedifferent groups The MCMC sampler mixed steadily overthe iterations As shown in the histogram of Figure 9(a) thesampler visited mostly between two and five components witha stronger support for G = 3 clusters Figure 9(b) shows thetrace plot for the number of included variables for one ofthe chains which visited mostly models with 25ndash35 variablesThe other chains behaved similarly Figure 10 gives the mar-ginal posterior probabilities p(γj = 1|XG = 3) for each ofthe chains Despite the very different starting points the fourchains visited similar regions and exhibit broadly similar mar-ginal plots To assess the concordance of the results across thefour chains we looked at the correlation coefficients betweenthe frequencies for inclusion of genes These are reported in

Tadesse Sha and Vannucci Bayesian Variable Selection 613

(a) (b)

Figure 9 MCMC Output for Endometrial Data (a) Number of visited clusters G (b) number of included variables pγ

Figure 11 along with the corresponding pairwise scatterplotsWe see that there is very good agreement across the differentMCMC runs The marginal probabilities on the other hand arenot as concordant across the chains because their derivationmakes use of model posterior probabilities [see (23)] which areaffected by the inclusion of correlated variables This is a prob-lem inherent to DNA microarray data where multiple geneswith similar expression profiles are often observed

We pooled and relabeled the output of the four chains nor-malized the relative posterior probabilities and recomputedp(γj = 1|XG = 3) These marginal probabilities are displayedin Figure 12 There are 31 genes with posterior probabilitygreater than 5 We also estimated the marginal posterior proba-bilities for the sample allocations p( yi = k|XG) based on the31 selected genes These are shown in Figure 13 We success-fully identified the four normal tissues The results also seem tosuggest that there are possibly two subtypes within the malig-nant tissues

Figure 10 Endometrial Data Marginal posterior probabilities for in-clusion of variables p(γj = 1|X G = 3) for each of the four chains

We also present the results from analyzing this dataset withthe COSA algorithm Figure 14 shows the resulting singlelinkage and average linkage dendrograms along with the top30 variables that distinguish the normal and tumor tissuesThere is some overlap between the genes selected by COSA andthose identified by our procedure Among the sets of 30 ldquobestrdquodiscriminating genes selected by the two methods 10 are com-mon to both

8 DISCUSSION

We have proposed a method for simultaneously clusteringhigh-dimensional data with an unknown number of componentsand selecting the variables that best discriminate the differentgroups We successfully applied the methodology to variousdatasets Our method is fully Bayesian and we provided stan-dard default recommendations for the choice of priors

Here we mention some possible extensions and directionsfor future research We have drawn posterior inference con-ditional on a fixed number of components A more attractive

Figure 11 Endometrial Data Concordance of results among the fourchains Pairwise scatterplots of frequencies for inclusion of genes

614 Journal of the American Statistical Association June 2005

Figure 12 Endometrial Data Union of four chains Marginal posteriorprobabilities p(γj = 1|X G = 3)

approach would incorporate the uncertainty on this parame-ter and combine the results for all different values of G vis-ited by the MCMC sampler To our knowledge this is an open

Figure 13 Endometrial Data Union of four chains Marginal poste-rior probabilities of sample allocations p(yi = k|X G = 3) i = 1 14k = 1 3

problem that requires further research One promising approachinvolves considering the posterior probability of pairs of ob-

(a) (b)

(c) (d)

Figure 14 Analysis of Endometrial Data Using COSA (a) Single linkage (b) average linkage (c) variables leading to cluster 1 (d) variablesleading to cluster 2

Tadesse Sha and Vannucci Bayesian Variable Selection 615

servations being allocated to the same cluster regardless of thenumber of visited components As long as there is no interestin estimating the component parameters the problem of labelswitching would not be a concern But this leads to

(n2

)proba-

bilities and it is not clear how to best summarize them For ex-ample Medvedovic and Sivaganesan (2002) used hierarchicalclustering based on these pairwise probabilities as a similaritymeasure

An interesting future avenue would be to investigate the per-formance of our method when covariance constraints are im-posed as was proposed by Banfield and Raftery (1993) Theirparameterization allows one to incorporate different assump-tions on the volume orientation and shape of the clusters

Another possible extension is to use empirical Bayes esti-mates to elicit the hyperparameters For example conditionalon a fixed number of clusters a complete log-likelihood canbe computed using all covariates and a Monte Carlo EM ap-proach developed for estimation Alternatively if prior infor-mation were available or subjective priors were preferable thenthe prior setting could be modified accordingly If interactionsamong predictors were of interest then additional interactionterms could be included in the model and the prior on γ couldbe modified to accommodate the additional terms

APPENDIX FULL CONDITIONALS

Here we give details on the derivation of the marginalized full con-ditionals under the assumption of equal and unequal covariances acrossclusters

Homogeneous Covariance 1 = middot middot middot = G =

f (Xy|Gwγ )

=int

f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d

=int Gprod

k=1

wnk

k

∣∣(γ )

∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h

minuspγ 21

times exp

minus1

2

Gsum

k=1

sum

xi(γ )isinCk

(xi(γ ) minus microk(γ )

)Tminus1

(γ )

(xi(γ ) minus microk(γ )

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus micro0(γ )

)T(h1(γ )

)minus1(microk(γ ) minus micro0(γ )

)

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

(γ )Q1(γ )

)

times ∣∣(γ c)

∣∣minus(n+1)2(2π)minus(n+1)(pminuspγ )2h

minus(pminuspγ )20

times exp

minus1

2

nsum

i=1

(xi(γ c) minus η(γ c)

)Tminus1

(γ c)

(xi(γ c) minus η(γ c)

)

times exp

minus1

2

(η(γ c) minus micro0(γ c)

)T(h0(γ c)

)minus1(η(γ c) minus micro0(γ c)

)

times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣(γ c)

∣∣minus(δ+2(pminuspγ ))2

times exp

minus1

2tr(minus1

(γ c)Q0(γ c)

)

dmicrod dη d

=int Gprod

k=1

(2π)minus(nk+1)pγ 2h

minuspγ 21 wnk

k

∣∣(γ )

∣∣minus(nk+1)2

times exp

minus1

2tr(minus1

(γ )Sk(γ )

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)T

times (nk + hminus11 )minus1

(γ )

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

(γ )Q1(γ )

)

times (2π)minus(n+1)(pminuspγ )2hminus(pminuspγ )20

∣∣(γ c)

∣∣minus(n+1)2

times exp

minus1

2tr(minus1

(γ c)S0(γ c)

)

times exp

minus1

2

(η(γ c) minus

sumni=1 xi(γ c) + hminus1

0 micro0(γ c)

n + hminus10

)T

times (n + hminus10 )minus1

(γ c)

(η(γ c) minus

sumni=1 xi(γ c) + hminus1

0 micro0(γ c)

n + hminus10

)

times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣(γ c)

∣∣minus(δ+2(pminuspγ ))2

times exp

minus1

2tr(minus1

0(γ c)Q0(γ c)

)dmicrod dη d

=[ Gprod

k=1

(h1nk + 1)minuspγ 2wnkk

]πminusnpγ 2

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)22minuspγ (δ+n+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

∣∣(γ )

∣∣minus(δ+n+2pγ )2

times exp

minus1

2tr

(minus1

(γ )

[Q1(γ ) +

Gsum

k=1

Sk(γ )

])d

616 Journal of the American Statistical Association June 2005

times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

timesint

∣∣(γ c)

∣∣minus(δ+n+2(pminuspγ ))2

times exp

minus1

2tr(minus1

(γ c)

[Q0(γ ) + S0(γ c)

])d13

= πminusnp2

[ Gprod

k=1

(h1nk + 1)minuspγ 2wnkk

](h0n + 1)minus(pminuspγ )2

timespγprod

j=1

(

(δ + n + pγ minus j

2

)

(δ + pγ minus j

2

))

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times∣∣∣∣∣Q1(γ ) +

Gsum

k=1

Sk(γ )

∣∣∣∣∣

minus(δ+n+pγ minus1)2

times ∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

where Sk(γ ) and S0(γ ) are given by

Sk(γ ) =sum

xi(γ )isinCk

xi(γ )xTi(γ ) + hminus1

1 micro0(γ )microT0(γ )

minus (nk + hminus11 )minus1

( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)

times( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)T

=sum

xi(γ )isinCk

(xi(γ ) minus xk(γ )

)(xi(γ ) minus xk(γ )

)T

+ nk

h1nk + 1

(micro0(γ ) minus xk(γ )

)(micro0(γ ) minus xk(γ )

)T

and

S0(γ ) =nsum

i=1

xi(γ c)xTi(γ c) + hminus1

0 micro0(γ c)microT0(γ c)

minus (n + hminus10 )minus1

( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)

times( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)T

=nsum

i=1

(xi(γ c) minus x(γ c)

)(xi(γ c) minus x(γ c)

)T

+ n

h0n + 1

(micro0(γ c) minus x(γ c)

)(micro0(γ c) minus x(γ c)

)T

with xk(γ ) = 1nk

sumxi(γ )isinCk

xi(γ ) and x(γ c) = 1n

sumni=1 xi(γ c)

Heterogeneous Covariances

f (Xy|Gwγ )

=int

f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d

=int Gprod

k=1

wnk

k

∣∣k(γ )

∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h

minuspγ 21

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times exp

minus1

2

Gsum

k=1

sum

xi(γ )isinCk

(xi(γ ) minus microk

)Tminus1

k(γ )

(xi(γ ) minus microk

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus micro0(γ )

)T(h1k(γ )

)minus1

times (microk(γ ) minus micro0(γ )

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )Q1(γ )

)dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

=int

int

micro

Gprod

k=1

(2π)minus(nk+1)pγ 2h

minuspγ 21 wnk

k 2minuspγ (δ+pγ minus1)2

times πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣k(γ )

∣∣minus(nk+1)2

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)T

times (nk + hminus11 )minus1

k(γ )

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q1(γ ) + Sk(γ )

])dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

Tadesse Sha and Vannucci Bayesian Variable Selection 617

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

∣∣Q(γ )

∣∣(δ+pγ minus1)2

times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

Gprod

k=1

∣∣k(γ )

∣∣minus(δ+nk+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q(γ ) + Sk(γ )

])d

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

timespγprod

j=1

(

(δ + nk + pγ minus 1

2

)(

(δ + pγ minus 1

2

)))

timesGprod

k=1

∣∣Q(γ )

∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )

∣∣minus(δ+nk+pγ minus1)2

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)(

(δ + p minus pγ minus j

2

)))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

[Received May 2003 Revised August 2004]

REFERENCES

Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5

Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821

Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press

Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182

(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641

Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270

Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275

Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375

Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228

Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center

George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373

Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286

Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall

Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136

Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732

Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232

McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422

Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206

Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71

Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190

Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792

Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819

Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74

(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809

Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554

Page 6: Bayesian Variable Selection in Clustering High-Dimensional ...marina/papers/jasa05.pdf · Bayesian Variable Selection in Clustering High-Dimensional Data Mahlet G. T ADESSE, Naijun

Tadesse Sha and Vannucci Bayesian Variable Selection 607

f (wnew|Gminus1)f (wold|G)

= B(α(Gminus1)α)

wαminus1l (1minuswl)

(Gminus1)(αminus1) G0 is the number of empty

components before the death move and f (wl) is the beta(1G)

density

6 POSTERIOR INFERENCE

We draw inference on the sample allocations conditionalon G Thus we first need to compute an estimate for this para-meter which we take to be the value most frequently visited bythe MCMC sampler In addition we need to address the label-switching problem

61 Label Switching

In finite mixture models an identifiability problem arisesfrom the invariance of the likelihood under permutation of thecomponent labels In the Bayesian paradigm this leads to sym-metric and multimodal posterior distributions with up to Gcopies of each ldquogenuinerdquo mode complicating inference on theparameters In particular it is not possible to form ergodic av-erages over the MCMC samples Traditional approaches to thisproblem impose identifiability constraints on the parametersfor instance w1 lt middot middot middot lt wG These constraints do not alwayssolve the problem however Recently Stephens (2000b) pro-posed a relabeling algorithm that takes a decision-theoretic ap-proach The procedure defines an appropriate loss function andpostprocesses the MCMC output to minimize the posterior ex-pected loss

Let P(ξ) = [pik(ξ)] be the matrix of allocation probabilities

pik(ξ) = Pr( yi = k|Xγ w θ)

= wk f (xi(γ )|θk(γ ))sumj wjf (xi(γ )|θ j(γ ))

(19)

In the context of clustering the loss function can be defined onthe cluster labels y

L0(y ξ) = minusnsum

i=1

logpiyi(ξ ) (20)

Let ξ (1) ξ (M) be the sampled parameters and let ν1 νM

be the permutations applied to them The parameters ξ (t) corre-spond to the component parameters w(t) micro(t) and (t) sampledat iteration t However in our methodology the componentmean and covariance parameters are integrated out and wedo not draw their posterior samples Therefore to calculatethe matrix of allocation probabilities in (19) we estimatethese parameters and set ξ (t) = w(t)

1 w(t)G x(t)

1(γ ) x(t)

G(γ )

S(t)1(γ )

S(t)G(γ )

where x(t)k(γ )

and S(t)k(γ )

are the estimates ofcluster krsquos mean and covariance based on the model visited atiteration t

The relabeling algorithm proceeds by selecting initial valuesfor the νtrsquos which we take to be the identity permutation theniterating the following steps until a fixed point is reached

a Choose y to minimizesumM

t=1 L0yνt(ξ(t))

b For t = 1 M choose νt to minimize L0yνt(ξ(t))

62 Posterior Densities and Posterior Estimates

Once the label switching is taken care of the MCMC samplescan be used to draw posterior inference Of particular interestare the allocation vector y and the variable selection vector γ

We compute the marginal posterior probability that sample iis allocated to cluster k yi = k as

p( yi = k|XG) =int

p(yi = ky(minusi)γ w

∣∣XG)

dy(minusi) dγ dw

propint

p(X yi = ky(minusi)γ w

∣∣G)

dy(minusi) dγ dw

=int

p(X yi = ky(minusi)

∣∣Gγ w)

times p(γ |G)p(w|G)dy(minusi) dγ dw

asympMsum

t=1

p(X y(t)

i = ky(t)(minusi)

∣∣Gγ (t)w(t))

times p(γ (t)

∣∣G)p(w(t)

∣∣G) (21)

where y(t)(minusi) is the vector y(t) at the tth MCMC iteration without

the ith element The posterior allocation of sample i can then beestimated by the mode of its marginal posterior density

yi = arg max1lekleG

p( yi = k|XG) (22)

Similarly the marginal posterior for γj = 1 can be com-puted as

p(γj = 1|XG) =int

p(γj = 1γ (minusj)wy|XG

)dγ (minusj) dy dw

propint

p(Xy γj = 1γ (minusj)w

∣∣G)

dγ (minusj) dy dw

=int

p(Xy

∣∣G γj = 1γ (minusj)w)

times p(γ |G)p(w|G)dγ (minusj) dy dw

asympMsum

t=1

p(Xy(t)

∣∣G γ(t)j = 1γ

(t)(minusj)w(t))

times p(γj = 1γ

(t)(minusj)

∣∣G)p(w(t)

∣∣G) (23)

where γ(t)(minusj) is the vector γ (t) at the tth iteration without the

jth element The best discriminating variables can then beidentified as those with largest marginal posterior p(γj = 1|XG) gt a where a is chosen arbitrarily

γj = Ip(γj=1|XG)gta (24)

An alternative variable selection can be performed by con-sidering the vector γ with largest posterior probability amongall visited vectors

γ lowast = arg max1letleM

p(γ (t)

∣∣XG w y)

(25)

where y is the vector of sample allocations estimated via (22)and w is given by w = 1

M

sumMt=1 w(t) The estimate given in (25)

considers the joint density of γ rather than the marginal distri-butions of the individual elements as (24) In the same spirit an

608 Journal of the American Statistical Association June 2005

estimate of the joint allocation of the samples can be obtainedas the configuration that yields the largest posterior probability

ylowast = arg max1letleM

p(y(t)

∣∣XG w γ)

(26)

where γ is the estimate in (24)

63 Class Prediction

The MCMC output can also be used to predict the class mem-bership of future observations Xf The predictive density isgiven by

p( yf = k|Xf XG)

=int

p( yf = kywγ |Xf XG)dy dγ dw

propint

p(Xf X yf = ky|Gwγ )p(γ |G)p(w|G)dy dγ dw

asympMsum

t=1

p(Xf X yf = ky(t)

∣∣γ (t)w(t)G)

times p(γ (t)

∣∣G)p(w(t)

∣∣G) (27)

and the observations are allocated according to yf

yf = arg max1lekleG

p( yf = k|Xf XG) (28)

7 PERFORMANCE OF OUR METHODOLOGY

In this section we investigate the performance of our method-ology using three datasets Because the Bayesian cluster-ing problem with an unknown number of components viaRJMCMC has been considered only in the univariate setting(Richardson and Green 1997) we first examine the efficiency ofour algorithm in the multivariate setting without variable selec-tion For this we use the benchmark iris data (Anderson 1935)We then explore the performance of our methodology for simul-taneous clustering and variable selection using a series of sim-ulated high-dimensional datasets We also analyze these data

using the COSA algorithm of Friedman and Meulman (2003)Finally we illustrate an application with DNA microarray datafrom an endometrial cancer study

71 Iris Data

This dataset consists of four covariatesmdashpetal width petalheight sepal width and sepal heightmdashmeasured on 50 flow-ers from each of three species Iris setosa Iris versicolor andIris virginica This dataset has been analyzed by several authors(see eg Murtagh and Hernaacutendez-Pajares 1995) and can be ac-cessed in SndashPLUS as a three-dimensional array

We rescaled each variable by its range and specified the priordistributions as described in Section 41 We took δ = 3 α = 1and h1 = 100 We set Gmax = 10 and κ1 = 03 a value com-mensurate with the variability in the data We considered differ-ent prior distributions for G truncated Poisson with λ = 5 andλ = 10 and a discrete uniform on [1 Gmax] We conductedthe analysis assuming both homogeneous and heterogeneouscovariances across clusters In all cases we used a burn-inof 10000 and a main run of 40000 iterations for inference

Let us first focus on the results using a truncated Pois-son(λ = 5) prior for G Figure 1 displays the trace plots forthe number of visited components under the assumption of het-erogeneous and homogeneous covariances Table 1 column 1gives the corresponding posterior probabilities p(G = k|X)Table 2 shows the posterior estimates for sample allocations ycomputed conditional on the most probable G accordingto (22) Under the assumption of heterogeneous covariancesacross clusters there is a strong support for G = 3 Two ofthe clusters successfully isolate all of the setosa and virginicaplants and the third cluster contains all of the versicolor ex-cept for five identified as virginica These results are satisfac-tory because the versicolor and virginica species are knownto have some overlap whereas the setosa is rather well sepa-rated Under the assumption of constant covariance there is astrong support for four clusters Again one cluster contains ex-clusively all of the setosa and all of the versicolor are grouped

(a) (b)

Figure 1 Iris Data Trace Plots of the Number of Clusters G (a) Heterogeneous covariance (b) homogeneous covariance

Tadesse Sha and Vannucci Bayesian Variable Selection 609

Table 1 Iris Data Posterior Distribution of G

Poisson(λ = 5) Poisson(λ = 10) Uniform[1 10]

k Different k Equal Different Σk Equal Σ Different Σk Equal Σ

3 6466 0 5019 0 8548 04 3124 7413 4291 4384 1417 65045 0403 3124 0545 2335 0035 34466 0007 0418 0138 2786 0 00507 0 0 0007 0484 0 08 0 0 0 0011 0 0

together except for one Two virginica flowers are assigned tothe versicolor group and the remaining are divided into twonew components Figure 2 shows the marginal posterior prob-abilities p( yi = k|XG = 4) computed using equation (21)We note that some of the flowers among the 12 virginica al-located to cluster IV had a close call for instance observa-tion 104 had marginal posteriors p( y104 = 3|XG) = 4169 andp( y104 = 4|XG) = 5823

We examined the sensitivity of the results to the choiceof the prior distribution on G and found them to be quiterobust The posterior probabilities p(G = k|X) assumingG sim Poisson(λ = 10) and G sim uniform[1 Gmax] are givenin Table 1 Under both priors we obtained sample allocationsthat are identical to those in Table 2 We also found littlesensitivity to the choice of the hyperparameter h1 with val-ues between 10 and 1000 giving satisfactory results Smallervalues of h1 tended to favor slightly more components but withsome clusters containing very few observations For examplefor h1 = 10 under the assumption of equal covariance acrossclusters Table 3 shows that the MCMC sampler gives strongersupport for G = 6 However one of the clusters contains onlyone observation and another contains only four observationsFigure 3 displays the marginal posterior probabilities for theallocation of these five samples p( yi = k|XG) there is littlesupport for assigning them to separate clusters

We use this example to also illustrate the label-switchingphenomenon described in Section 61 Figure 4 gives the traceplots of the component weights under the assumption of homo-geneous covariance before and after processing of the MCMCsamples We can easily identify two label-switching instancesin the raw output of Figure 4(a) One of these occurred be-tween the first and third components near iteration 1000 theother occurred around iteration 2400 between components 23 and 4 Figure 4(b) shows the trace plots after postprocessingof the MCMC output We note that the label switching is elim-inated and the multimodality in the estimates of the marginalposterior distributions of wk is removed It is now straightfor-ward to obtain sensible estimates using the posterior means

Table 2 Iris Data Sample Allocation y

Heterogeneouscovariance

Homogeneouscovariance

I II III I II III IV

Iris setosa 50 0 0 50 0 0 0Iris versicolor 0 45 5 0 49 1 0Iris virginica 0 0 50 0 2 36 12

Figure 2 Iris Data Poisson(λ = 5) With Equal Σ -Marginal PosteriorProbabilities of Sample Allocations p(yi = k|X G = 4) i = 1 150k = 1 4

72 Simulated Data

We generated a dataset of 15 observations arising from fourmultivariate normal densities with 20 variables such that

xij sim I1leile4N (micro1 σ21 ) + I5leile7N (micro2 σ

22 )

+ I8leile13N (micro3 σ23 ) + I14leile15N (micro4 σ

24 )

i = 1 15 j = 1 20

where Imiddot is the indicator function equal to 1 if the conditionis met Thus the first four samples arise from the same dis-tribution the next three come from the second group the fol-lowing six are from the third group and the last two are fromthe fourth group The component means were set to micro1 = 5micro2 = 2 micro3 = minus3 and micro4 = minus6 and the component varianceswere set to σ 2

1 = 15 σ 22 = 1 σ 2

3 = 5 and σ 24 = 2 We drew an

additional set of (pminus20) noisy variables that do not distinguishbetween the clusters from a standard normal density We con-sidered several values of p p = 50100500 1000 These datathus contain a small subset of discriminating variables togetherwith a large set of nonclustering variables The purpose is tostudy the ability of our method to uncover the cluster structureand identify the relevant covariates in the presence of increasingnumber of noisy variables

We permuted the columns of the data matrix X to dis-perse the predictors For each dataset we took δ = 3 α = 1

Table 3 Iris Data Sensitivity to Hyperparameter h Results With h = 10

Posterior distribution of G

k 4 5 6 7 8 9 10

p(G = k |X) 0747 1874 3326 2685 1120 0233 0015

Sample allocations yI II III IV V VI

Iris setosa 50 0 0 0 0 0Iris versicolor 0 45 1 0 4 0Iris virginica 0 2 35 12 0 1

610 Journal of the American Statistical Association June 2005

Figure 3 Iris Data Sensitivity to h1 Results for h1 = 10ndashMarginalPosterior Probabilities for Samples Allocated to Clusters V and VIp(yi = k|X G = 6) i = 1 5 k = 1 6

h1 = h0 = 100 and Gmax = n and assumed unequal covari-ances across clusters Results from the previous example haveshown little sensitivity to the choice of prior on G Here weconsidered a truncated Poisson prior with λ = 5 As describedin Section 41 some care is needed when choosing κ1 and κ0their values need to be commensurate with the variability of thedata The amount of total variability in the different simulateddatasets is of course widely different and in each case thesehyperparameters were chosen proportionally to the upper andlower decile of the non-zero eigenvalues For the prior of γ we chose a Bernoulli distribution and set the expected num-ber of included variables to 10 We used a starting model withone randomly selected variable and ran the MCMC chains for100000 iterations with 40000 sweeps as burn-in

Here we do inference on both the sample allocations andthe variable selection We obtained satisfactory results for all

datasets In all cases we successfully recovered the clusterstructure used to simulate the data and identified the 20 dis-criminating variables We present the summary plots associatedwith the largest dataset where p = 1000 Figure 5(a) shows thetrace plot for the number of visited components G and Table 4reports the marginal posterior probabilities p(G|X) There is astrong support for G = 4 and 5 with a slightly larger probabilityfor the former Figure 6 shows the marginal posterior probabili-ties of the sample allocations p( yi = k|XG = 4) We note thatthese match the group structure used to simulate the data Asfor selection of the predictors Figure 5(b) displays the num-ber of variables selected at each MCMC iteration In the first30000 iterations before burn-in the chain visited models witharound 30ndash35 variables After about 40000 iterations the chainstabilized to models with 15ndash20 covariates Figure 7 displaysthe marginal posterior probabilities p(γj = 1|XG = 4) Therewere 17 variables with marginal probabilities greater than 9 allof which were in the set of 20 covariates simulated to effectivelydiscriminate the four clusters If we lower the threshold for in-clusion to 4 then all 20 variables are selected Conditional onthe allocations obtained via (22) the γ vector with largest pos-terior probability (25) among all visited models contained 19 ofthe 20 discriminating variables

We also analyzed these datasets using the COSA algorithm ofFriedman and Meulman (2003) As we mentioned in Section 2this procedure performs variable selection in conjunction withhierarchical clustering We present the results for the analy-sis of the simulated data with p = 1000 Figure 8 shows thedendrogram of the clustering results based on the non-targetedCOSA distance We considered single average and completelinkage but none of these methods was able to recover the truecluster structure The average linkage dendrogram [Fig 8(b)]delineates one of the clusters that was partially recovered (thetrue cluster contains observations 8ndash13) Figure 8(d) displaysthe 20 highest relevance values of the variables that lead to thisclustering The variables that define the true cluster structure areindexed as 1ndash20 and we note that 16 of them appear in this plot

(a) (b)

Figure 4 Iris Data Trace Plots of Component Weights Before and After Removal of the Label Switching (a) Raw output (b) processed output

Tadesse Sha and Vannucci Bayesian Variable Selection 611

(a) (b)

Figure 5 Trace Plots for Simulated Data With p = 1000 (a) Number of clusters G (b) number of included variables pγ

Table 4 Simulated Data With p = 1000Posterior Distribution of G

k p(G = k|X )

2 01113 14474 34375 33756 13267 02168 00409 0015

10 000811 000512 000713 001214 0002

Figure 6 Simulated Data With p = 1000 Marginal Posterior Prob-abilities of Sample Allocations p(yi = k|X G = 4) i = 1 15k = 1 4

73 Application to Microarray Data

We now illustrate the methodology on microarray datafrom an endometrial cancer study Endometrioid endometrialadenocarcinoma is a common gynecologic malignancy aris-ing within the uterine lining The disease usually occurs inpostmenopausal women and is associated with the hormonalrisk factor of protracted estrogen exposure unopposed byprogestins Despite its prevalence the molecular mechanismsof its genesis are not completely known DNA microarrayswith their ability to examine thousands of genes simultaneouslycould be an efficient tool to isolate relevant genes and clustertissues into different subtypes This is especially important incancer treatment where different clinicopathologic groups areknown to vary in their response to therapy and genes identifiedto discriminate among the different subtypes may represent tar-gets for therapeutic intervention and biomarkers for improveddiagnosis

Figure 7 Simulated Data With p = 1000 Marginal Posterior Proba-bilities for Inclusion of Variables p(γj = 1|X G = 4)

612 Journal of the American Statistical Association June 2005

(a) (b)

(c) (d)

Figure 8 Analysis of Simulated Data With p = 1000 Using COSA (a) Single linkage (b) average linkage (c) complete linkage (d) relevancefor delineated cluster

Four normal endometrium samples and 10 endometrioidendometrial adenocarcinomas were collected from hysterec-tomy specimens RNA samples were isolated from the 14 tis-sues and reverse-transcribed The resulting cDNA targets wereprepared following the manufacturerrsquos instructions and hy-bridized to Affymetrix Hu6800 GeneChip arrays which con-tain 7070 probe sets The scanned images were processed withthe Affymetrix software version 31 The average differencederived by the software was used to indicate a transcript abun-dance and a global normalization procedure was applied Thetechnology does not reliably quantify low expression levels andit is subject to spot saturation at the high end For this particu-lar array the limits of reliable detection were previously set to20 and 16000 (Tadesse Ibrahim and Mutter 2003) We usedthese same thresholds and removed probe sets with at least oneunreliable reading from the analysis This left us with p = 762variables We then log-transformed the expression readings tosatisfy the assumption of normality as suggested by the BoxndashCox transformation We also rescaled each variable by its range

As described in Section 41 we specified the priors with thehyperparameters δ = 3 α = 1 and h1 = h0 = 100 We setκ1 = 001 and κ0 = 01 and assumed unequal covariancesacross clusters We used a truncated Poisson(λ = 5) prior for G

with Gmax = n We took a Bernoulli prior for γ with an expec-tation of 10 variables to be included in the model

To avoid possible dependence of the results on the initialmodel we ran four MCMC chains with widely different startingpoints (1) γj set to 0 for all jrsquos except one randomly selected(2) γj set to 1 for 10 randomly selected jrsquos (3) γj set to 1 for25 randomly selected jrsquos and (4) γj set to 1 for 50 randomlyselected jrsquos For each of the MCMC chains we ran 100000 it-erations with 40000 sweeps taken as a burn-in

We looked at both the allocation of tissues and the selec-tion of genes whose expression best discriminate among thedifferent groups The MCMC sampler mixed steadily overthe iterations As shown in the histogram of Figure 9(a) thesampler visited mostly between two and five components witha stronger support for G = 3 clusters Figure 9(b) shows thetrace plot for the number of included variables for one ofthe chains which visited mostly models with 25ndash35 variablesThe other chains behaved similarly Figure 10 gives the mar-ginal posterior probabilities p(γj = 1|XG = 3) for each ofthe chains Despite the very different starting points the fourchains visited similar regions and exhibit broadly similar mar-ginal plots To assess the concordance of the results across thefour chains we looked at the correlation coefficients betweenthe frequencies for inclusion of genes These are reported in

Tadesse Sha and Vannucci Bayesian Variable Selection 613

(a) (b)

Figure 9 MCMC Output for Endometrial Data (a) Number of visited clusters G (b) number of included variables pγ

Figure 11 along with the corresponding pairwise scatterplotsWe see that there is very good agreement across the differentMCMC runs The marginal probabilities on the other hand arenot as concordant across the chains because their derivationmakes use of model posterior probabilities [see (23)] which areaffected by the inclusion of correlated variables This is a prob-lem inherent to DNA microarray data where multiple geneswith similar expression profiles are often observed

We pooled and relabeled the output of the four chains nor-malized the relative posterior probabilities and recomputedp(γj = 1|XG = 3) These marginal probabilities are displayedin Figure 12 There are 31 genes with posterior probabilitygreater than 5 We also estimated the marginal posterior proba-bilities for the sample allocations p( yi = k|XG) based on the31 selected genes These are shown in Figure 13 We success-fully identified the four normal tissues The results also seem tosuggest that there are possibly two subtypes within the malig-nant tissues

Figure 10 Endometrial Data Marginal posterior probabilities for in-clusion of variables p(γj = 1|X G = 3) for each of the four chains

We also present the results from analyzing this dataset withthe COSA algorithm Figure 14 shows the resulting singlelinkage and average linkage dendrograms along with the top30 variables that distinguish the normal and tumor tissuesThere is some overlap between the genes selected by COSA andthose identified by our procedure Among the sets of 30 ldquobestrdquodiscriminating genes selected by the two methods 10 are com-mon to both

8 DISCUSSION

We have proposed a method for simultaneously clusteringhigh-dimensional data with an unknown number of componentsand selecting the variables that best discriminate the differentgroups We successfully applied the methodology to variousdatasets Our method is fully Bayesian and we provided stan-dard default recommendations for the choice of priors

Here we mention some possible extensions and directionsfor future research We have drawn posterior inference con-ditional on a fixed number of components A more attractive

Figure 11 Endometrial Data Concordance of results among the fourchains Pairwise scatterplots of frequencies for inclusion of genes

614 Journal of the American Statistical Association June 2005

Figure 12 Endometrial Data Union of four chains Marginal posteriorprobabilities p(γj = 1|X G = 3)

approach would incorporate the uncertainty on this parame-ter and combine the results for all different values of G vis-ited by the MCMC sampler To our knowledge this is an open

Figure 13 Endometrial Data Union of four chains Marginal poste-rior probabilities of sample allocations p(yi = k|X G = 3) i = 1 14k = 1 3

problem that requires further research One promising approachinvolves considering the posterior probability of pairs of ob-

(a) (b)

(c) (d)

Figure 14 Analysis of Endometrial Data Using COSA (a) Single linkage (b) average linkage (c) variables leading to cluster 1 (d) variablesleading to cluster 2

Tadesse Sha and Vannucci Bayesian Variable Selection 615

servations being allocated to the same cluster regardless of thenumber of visited components As long as there is no interestin estimating the component parameters the problem of labelswitching would not be a concern But this leads to

(n2

)proba-

bilities and it is not clear how to best summarize them For ex-ample Medvedovic and Sivaganesan (2002) used hierarchicalclustering based on these pairwise probabilities as a similaritymeasure

An interesting future avenue would be to investigate the per-formance of our method when covariance constraints are im-posed as was proposed by Banfield and Raftery (1993) Theirparameterization allows one to incorporate different assump-tions on the volume orientation and shape of the clusters

Another possible extension is to use empirical Bayes esti-mates to elicit the hyperparameters For example conditionalon a fixed number of clusters a complete log-likelihood canbe computed using all covariates and a Monte Carlo EM ap-proach developed for estimation Alternatively if prior infor-mation were available or subjective priors were preferable thenthe prior setting could be modified accordingly If interactionsamong predictors were of interest then additional interactionterms could be included in the model and the prior on γ couldbe modified to accommodate the additional terms

APPENDIX FULL CONDITIONALS

Here we give details on the derivation of the marginalized full con-ditionals under the assumption of equal and unequal covariances acrossclusters

Homogeneous Covariance 1 = middot middot middot = G =

f (Xy|Gwγ )

=int

f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d

=int Gprod

k=1

wnk

k

∣∣(γ )

∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h

minuspγ 21

times exp

minus1

2

Gsum

k=1

sum

xi(γ )isinCk

(xi(γ ) minus microk(γ )

)Tminus1

(γ )

(xi(γ ) minus microk(γ )

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus micro0(γ )

)T(h1(γ )

)minus1(microk(γ ) minus micro0(γ )

)

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

(γ )Q1(γ )

)

times ∣∣(γ c)

∣∣minus(n+1)2(2π)minus(n+1)(pminuspγ )2h

minus(pminuspγ )20

times exp

minus1

2

nsum

i=1

(xi(γ c) minus η(γ c)

)Tminus1

(γ c)

(xi(γ c) minus η(γ c)

)

times exp

minus1

2

(η(γ c) minus micro0(γ c)

)T(h0(γ c)

)minus1(η(γ c) minus micro0(γ c)

)

times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣(γ c)

∣∣minus(δ+2(pminuspγ ))2

times exp

minus1

2tr(minus1

(γ c)Q0(γ c)

)

dmicrod dη d

=int Gprod

k=1

(2π)minus(nk+1)pγ 2h

minuspγ 21 wnk

k

∣∣(γ )

∣∣minus(nk+1)2

times exp

minus1

2tr(minus1

(γ )Sk(γ )

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)T

times (nk + hminus11 )minus1

(γ )

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

(γ )Q1(γ )

)

times (2π)minus(n+1)(pminuspγ )2hminus(pminuspγ )20

∣∣(γ c)

∣∣minus(n+1)2

times exp

minus1

2tr(minus1

(γ c)S0(γ c)

)

times exp

minus1

2

(η(γ c) minus

sumni=1 xi(γ c) + hminus1

0 micro0(γ c)

n + hminus10

)T

times (n + hminus10 )minus1

(γ c)

(η(γ c) minus

sumni=1 xi(γ c) + hminus1

0 micro0(γ c)

n + hminus10

)

times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣(γ c)

∣∣minus(δ+2(pminuspγ ))2

times exp

minus1

2tr(minus1

0(γ c)Q0(γ c)

)dmicrod dη d

=[ Gprod

k=1

(h1nk + 1)minuspγ 2wnkk

]πminusnpγ 2

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)22minuspγ (δ+n+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

∣∣(γ )

∣∣minus(δ+n+2pγ )2

times exp

minus1

2tr

(minus1

(γ )

[Q1(γ ) +

Gsum

k=1

Sk(γ )

])d

616 Journal of the American Statistical Association June 2005

times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

timesint

∣∣(γ c)

∣∣minus(δ+n+2(pminuspγ ))2

times exp

minus1

2tr(minus1

(γ c)

[Q0(γ ) + S0(γ c)

])d13

= πminusnp2

[ Gprod

k=1

(h1nk + 1)minuspγ 2wnkk

](h0n + 1)minus(pminuspγ )2

timespγprod

j=1

(

(δ + n + pγ minus j

2

)

(δ + pγ minus j

2

))

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times∣∣∣∣∣Q1(γ ) +

Gsum

k=1

Sk(γ )

∣∣∣∣∣

minus(δ+n+pγ minus1)2

times ∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

where Sk(γ ) and S0(γ ) are given by

Sk(γ ) =sum

xi(γ )isinCk

xi(γ )xTi(γ ) + hminus1

1 micro0(γ )microT0(γ )

minus (nk + hminus11 )minus1

( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)

times( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)T

=sum

xi(γ )isinCk

(xi(γ ) minus xk(γ )

)(xi(γ ) minus xk(γ )

)T

+ nk

h1nk + 1

(micro0(γ ) minus xk(γ )

)(micro0(γ ) minus xk(γ )

)T

and

S0(γ ) =nsum

i=1

xi(γ c)xTi(γ c) + hminus1

0 micro0(γ c)microT0(γ c)

minus (n + hminus10 )minus1

( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)

times( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)T

=nsum

i=1

(xi(γ c) minus x(γ c)

)(xi(γ c) minus x(γ c)

)T

+ n

h0n + 1

(micro0(γ c) minus x(γ c)

)(micro0(γ c) minus x(γ c)

)T

with xk(γ ) = 1nk

sumxi(γ )isinCk

xi(γ ) and x(γ c) = 1n

sumni=1 xi(γ c)

Heterogeneous Covariances

f (Xy|Gwγ )

=int

f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d

=int Gprod

k=1

wnk

k

∣∣k(γ )

∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h

minuspγ 21

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times exp

minus1

2

Gsum

k=1

sum

xi(γ )isinCk

(xi(γ ) minus microk

)Tminus1

k(γ )

(xi(γ ) minus microk

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus micro0(γ )

)T(h1k(γ )

)minus1

times (microk(γ ) minus micro0(γ )

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )Q1(γ )

)dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

=int

int

micro

Gprod

k=1

(2π)minus(nk+1)pγ 2h

minuspγ 21 wnk

k 2minuspγ (δ+pγ minus1)2

times πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣k(γ )

∣∣minus(nk+1)2

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)T

times (nk + hminus11 )minus1

k(γ )

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q1(γ ) + Sk(γ )

])dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

Tadesse Sha and Vannucci Bayesian Variable Selection 617

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

∣∣Q(γ )

∣∣(δ+pγ minus1)2

times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

Gprod

k=1

∣∣k(γ )

∣∣minus(δ+nk+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q(γ ) + Sk(γ )

])d

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

timespγprod

j=1

(

(δ + nk + pγ minus 1

2

)(

(δ + pγ minus 1

2

)))

timesGprod

k=1

∣∣Q(γ )

∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )

∣∣minus(δ+nk+pγ minus1)2

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)(

(δ + p minus pγ minus j

2

)))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

[Received May 2003 Revised August 2004]

REFERENCES

Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5

Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821

Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press

Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182

(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641

Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270

Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275

Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375

Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228

Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center

George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373

Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286

Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall

Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136

Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732

Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232

McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422

Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206

Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71

Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190

Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792

Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819

Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74

(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809

Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554

Page 7: Bayesian Variable Selection in Clustering High-Dimensional ...marina/papers/jasa05.pdf · Bayesian Variable Selection in Clustering High-Dimensional Data Mahlet G. T ADESSE, Naijun

608 Journal of the American Statistical Association June 2005

estimate of the joint allocation of the samples can be obtainedas the configuration that yields the largest posterior probability

ylowast = arg max1letleM

p(y(t)

∣∣XG w γ)

(26)

where γ is the estimate in (24)

63 Class Prediction

The MCMC output can also be used to predict the class mem-bership of future observations Xf The predictive density isgiven by

p( yf = k|Xf XG)

=int

p( yf = kywγ |Xf XG)dy dγ dw

propint

p(Xf X yf = ky|Gwγ )p(γ |G)p(w|G)dy dγ dw

asympMsum

t=1

p(Xf X yf = ky(t)

∣∣γ (t)w(t)G)

times p(γ (t)

∣∣G)p(w(t)

∣∣G) (27)

and the observations are allocated according to yf

yf = arg max1lekleG

p( yf = k|Xf XG) (28)

7 PERFORMANCE OF OUR METHODOLOGY

In this section we investigate the performance of our method-ology using three datasets Because the Bayesian cluster-ing problem with an unknown number of components viaRJMCMC has been considered only in the univariate setting(Richardson and Green 1997) we first examine the efficiency ofour algorithm in the multivariate setting without variable selec-tion For this we use the benchmark iris data (Anderson 1935)We then explore the performance of our methodology for simul-taneous clustering and variable selection using a series of sim-ulated high-dimensional datasets We also analyze these data

using the COSA algorithm of Friedman and Meulman (2003)Finally we illustrate an application with DNA microarray datafrom an endometrial cancer study

71 Iris Data

This dataset consists of four covariatesmdashpetal width petalheight sepal width and sepal heightmdashmeasured on 50 flow-ers from each of three species Iris setosa Iris versicolor andIris virginica This dataset has been analyzed by several authors(see eg Murtagh and Hernaacutendez-Pajares 1995) and can be ac-cessed in SndashPLUS as a three-dimensional array

We rescaled each variable by its range and specified the priordistributions as described in Section 41 We took δ = 3 α = 1and h1 = 100 We set Gmax = 10 and κ1 = 03 a value com-mensurate with the variability in the data We considered differ-ent prior distributions for G truncated Poisson with λ = 5 andλ = 10 and a discrete uniform on [1 Gmax] We conductedthe analysis assuming both homogeneous and heterogeneouscovariances across clusters In all cases we used a burn-inof 10000 and a main run of 40000 iterations for inference

Let us first focus on the results using a truncated Pois-son(λ = 5) prior for G Figure 1 displays the trace plots forthe number of visited components under the assumption of het-erogeneous and homogeneous covariances Table 1 column 1gives the corresponding posterior probabilities p(G = k|X)Table 2 shows the posterior estimates for sample allocations ycomputed conditional on the most probable G accordingto (22) Under the assumption of heterogeneous covariancesacross clusters there is a strong support for G = 3 Two ofthe clusters successfully isolate all of the setosa and virginicaplants and the third cluster contains all of the versicolor ex-cept for five identified as virginica These results are satisfac-tory because the versicolor and virginica species are knownto have some overlap whereas the setosa is rather well sepa-rated Under the assumption of constant covariance there is astrong support for four clusters Again one cluster contains ex-clusively all of the setosa and all of the versicolor are grouped

(a) (b)

Figure 1 Iris Data Trace Plots of the Number of Clusters G (a) Heterogeneous covariance (b) homogeneous covariance

Tadesse Sha and Vannucci Bayesian Variable Selection 609

Table 1 Iris Data Posterior Distribution of G

Poisson(λ = 5) Poisson(λ = 10) Uniform[1 10]

k Different k Equal Different Σk Equal Σ Different Σk Equal Σ

3 6466 0 5019 0 8548 04 3124 7413 4291 4384 1417 65045 0403 3124 0545 2335 0035 34466 0007 0418 0138 2786 0 00507 0 0 0007 0484 0 08 0 0 0 0011 0 0

together except for one Two virginica flowers are assigned tothe versicolor group and the remaining are divided into twonew components Figure 2 shows the marginal posterior prob-abilities p( yi = k|XG = 4) computed using equation (21)We note that some of the flowers among the 12 virginica al-located to cluster IV had a close call for instance observa-tion 104 had marginal posteriors p( y104 = 3|XG) = 4169 andp( y104 = 4|XG) = 5823

We examined the sensitivity of the results to the choiceof the prior distribution on G and found them to be quiterobust The posterior probabilities p(G = k|X) assumingG sim Poisson(λ = 10) and G sim uniform[1 Gmax] are givenin Table 1 Under both priors we obtained sample allocationsthat are identical to those in Table 2 We also found littlesensitivity to the choice of the hyperparameter h1 with val-ues between 10 and 1000 giving satisfactory results Smallervalues of h1 tended to favor slightly more components but withsome clusters containing very few observations For examplefor h1 = 10 under the assumption of equal covariance acrossclusters Table 3 shows that the MCMC sampler gives strongersupport for G = 6 However one of the clusters contains onlyone observation and another contains only four observationsFigure 3 displays the marginal posterior probabilities for theallocation of these five samples p( yi = k|XG) there is littlesupport for assigning them to separate clusters

We use this example to also illustrate the label-switchingphenomenon described in Section 61 Figure 4 gives the traceplots of the component weights under the assumption of homo-geneous covariance before and after processing of the MCMCsamples We can easily identify two label-switching instancesin the raw output of Figure 4(a) One of these occurred be-tween the first and third components near iteration 1000 theother occurred around iteration 2400 between components 23 and 4 Figure 4(b) shows the trace plots after postprocessingof the MCMC output We note that the label switching is elim-inated and the multimodality in the estimates of the marginalposterior distributions of wk is removed It is now straightfor-ward to obtain sensible estimates using the posterior means

Table 2 Iris Data Sample Allocation y

Heterogeneouscovariance

Homogeneouscovariance

I II III I II III IV

Iris setosa 50 0 0 50 0 0 0Iris versicolor 0 45 5 0 49 1 0Iris virginica 0 0 50 0 2 36 12

Figure 2 Iris Data Poisson(λ = 5) With Equal Σ -Marginal PosteriorProbabilities of Sample Allocations p(yi = k|X G = 4) i = 1 150k = 1 4

72 Simulated Data

We generated a dataset of 15 observations arising from fourmultivariate normal densities with 20 variables such that

xij sim I1leile4N (micro1 σ21 ) + I5leile7N (micro2 σ

22 )

+ I8leile13N (micro3 σ23 ) + I14leile15N (micro4 σ

24 )

i = 1 15 j = 1 20

where Imiddot is the indicator function equal to 1 if the conditionis met Thus the first four samples arise from the same dis-tribution the next three come from the second group the fol-lowing six are from the third group and the last two are fromthe fourth group The component means were set to micro1 = 5micro2 = 2 micro3 = minus3 and micro4 = minus6 and the component varianceswere set to σ 2

1 = 15 σ 22 = 1 σ 2

3 = 5 and σ 24 = 2 We drew an

additional set of (pminus20) noisy variables that do not distinguishbetween the clusters from a standard normal density We con-sidered several values of p p = 50100500 1000 These datathus contain a small subset of discriminating variables togetherwith a large set of nonclustering variables The purpose is tostudy the ability of our method to uncover the cluster structureand identify the relevant covariates in the presence of increasingnumber of noisy variables

We permuted the columns of the data matrix X to dis-perse the predictors For each dataset we took δ = 3 α = 1

Table 3 Iris Data Sensitivity to Hyperparameter h Results With h = 10

Posterior distribution of G

k 4 5 6 7 8 9 10

p(G = k |X) 0747 1874 3326 2685 1120 0233 0015

Sample allocations yI II III IV V VI

Iris setosa 50 0 0 0 0 0Iris versicolor 0 45 1 0 4 0Iris virginica 0 2 35 12 0 1

610 Journal of the American Statistical Association June 2005

Figure 3 Iris Data Sensitivity to h1 Results for h1 = 10ndashMarginalPosterior Probabilities for Samples Allocated to Clusters V and VIp(yi = k|X G = 6) i = 1 5 k = 1 6

h1 = h0 = 100 and Gmax = n and assumed unequal covari-ances across clusters Results from the previous example haveshown little sensitivity to the choice of prior on G Here weconsidered a truncated Poisson prior with λ = 5 As describedin Section 41 some care is needed when choosing κ1 and κ0their values need to be commensurate with the variability of thedata The amount of total variability in the different simulateddatasets is of course widely different and in each case thesehyperparameters were chosen proportionally to the upper andlower decile of the non-zero eigenvalues For the prior of γ we chose a Bernoulli distribution and set the expected num-ber of included variables to 10 We used a starting model withone randomly selected variable and ran the MCMC chains for100000 iterations with 40000 sweeps as burn-in

Here we do inference on both the sample allocations andthe variable selection We obtained satisfactory results for all

datasets In all cases we successfully recovered the clusterstructure used to simulate the data and identified the 20 dis-criminating variables We present the summary plots associatedwith the largest dataset where p = 1000 Figure 5(a) shows thetrace plot for the number of visited components G and Table 4reports the marginal posterior probabilities p(G|X) There is astrong support for G = 4 and 5 with a slightly larger probabilityfor the former Figure 6 shows the marginal posterior probabili-ties of the sample allocations p( yi = k|XG = 4) We note thatthese match the group structure used to simulate the data Asfor selection of the predictors Figure 5(b) displays the num-ber of variables selected at each MCMC iteration In the first30000 iterations before burn-in the chain visited models witharound 30ndash35 variables After about 40000 iterations the chainstabilized to models with 15ndash20 covariates Figure 7 displaysthe marginal posterior probabilities p(γj = 1|XG = 4) Therewere 17 variables with marginal probabilities greater than 9 allof which were in the set of 20 covariates simulated to effectivelydiscriminate the four clusters If we lower the threshold for in-clusion to 4 then all 20 variables are selected Conditional onthe allocations obtained via (22) the γ vector with largest pos-terior probability (25) among all visited models contained 19 ofthe 20 discriminating variables

We also analyzed these datasets using the COSA algorithm ofFriedman and Meulman (2003) As we mentioned in Section 2this procedure performs variable selection in conjunction withhierarchical clustering We present the results for the analy-sis of the simulated data with p = 1000 Figure 8 shows thedendrogram of the clustering results based on the non-targetedCOSA distance We considered single average and completelinkage but none of these methods was able to recover the truecluster structure The average linkage dendrogram [Fig 8(b)]delineates one of the clusters that was partially recovered (thetrue cluster contains observations 8ndash13) Figure 8(d) displaysthe 20 highest relevance values of the variables that lead to thisclustering The variables that define the true cluster structure areindexed as 1ndash20 and we note that 16 of them appear in this plot

(a) (b)

Figure 4 Iris Data Trace Plots of Component Weights Before and After Removal of the Label Switching (a) Raw output (b) processed output

Tadesse Sha and Vannucci Bayesian Variable Selection 611

(a) (b)

Figure 5 Trace Plots for Simulated Data With p = 1000 (a) Number of clusters G (b) number of included variables pγ

Table 4 Simulated Data With p = 1000Posterior Distribution of G

k p(G = k|X )

2 01113 14474 34375 33756 13267 02168 00409 0015

10 000811 000512 000713 001214 0002

Figure 6 Simulated Data With p = 1000 Marginal Posterior Prob-abilities of Sample Allocations p(yi = k|X G = 4) i = 1 15k = 1 4

73 Application to Microarray Data

We now illustrate the methodology on microarray datafrom an endometrial cancer study Endometrioid endometrialadenocarcinoma is a common gynecologic malignancy aris-ing within the uterine lining The disease usually occurs inpostmenopausal women and is associated with the hormonalrisk factor of protracted estrogen exposure unopposed byprogestins Despite its prevalence the molecular mechanismsof its genesis are not completely known DNA microarrayswith their ability to examine thousands of genes simultaneouslycould be an efficient tool to isolate relevant genes and clustertissues into different subtypes This is especially important incancer treatment where different clinicopathologic groups areknown to vary in their response to therapy and genes identifiedto discriminate among the different subtypes may represent tar-gets for therapeutic intervention and biomarkers for improveddiagnosis

Figure 7 Simulated Data With p = 1000 Marginal Posterior Proba-bilities for Inclusion of Variables p(γj = 1|X G = 4)

612 Journal of the American Statistical Association June 2005

(a) (b)

(c) (d)

Figure 8 Analysis of Simulated Data With p = 1000 Using COSA (a) Single linkage (b) average linkage (c) complete linkage (d) relevancefor delineated cluster

Four normal endometrium samples and 10 endometrioidendometrial adenocarcinomas were collected from hysterec-tomy specimens RNA samples were isolated from the 14 tis-sues and reverse-transcribed The resulting cDNA targets wereprepared following the manufacturerrsquos instructions and hy-bridized to Affymetrix Hu6800 GeneChip arrays which con-tain 7070 probe sets The scanned images were processed withthe Affymetrix software version 31 The average differencederived by the software was used to indicate a transcript abun-dance and a global normalization procedure was applied Thetechnology does not reliably quantify low expression levels andit is subject to spot saturation at the high end For this particu-lar array the limits of reliable detection were previously set to20 and 16000 (Tadesse Ibrahim and Mutter 2003) We usedthese same thresholds and removed probe sets with at least oneunreliable reading from the analysis This left us with p = 762variables We then log-transformed the expression readings tosatisfy the assumption of normality as suggested by the BoxndashCox transformation We also rescaled each variable by its range

As described in Section 41 we specified the priors with thehyperparameters δ = 3 α = 1 and h1 = h0 = 100 We setκ1 = 001 and κ0 = 01 and assumed unequal covariancesacross clusters We used a truncated Poisson(λ = 5) prior for G

with Gmax = n We took a Bernoulli prior for γ with an expec-tation of 10 variables to be included in the model

To avoid possible dependence of the results on the initialmodel we ran four MCMC chains with widely different startingpoints (1) γj set to 0 for all jrsquos except one randomly selected(2) γj set to 1 for 10 randomly selected jrsquos (3) γj set to 1 for25 randomly selected jrsquos and (4) γj set to 1 for 50 randomlyselected jrsquos For each of the MCMC chains we ran 100000 it-erations with 40000 sweeps taken as a burn-in

We looked at both the allocation of tissues and the selec-tion of genes whose expression best discriminate among thedifferent groups The MCMC sampler mixed steadily overthe iterations As shown in the histogram of Figure 9(a) thesampler visited mostly between two and five components witha stronger support for G = 3 clusters Figure 9(b) shows thetrace plot for the number of included variables for one ofthe chains which visited mostly models with 25ndash35 variablesThe other chains behaved similarly Figure 10 gives the mar-ginal posterior probabilities p(γj = 1|XG = 3) for each ofthe chains Despite the very different starting points the fourchains visited similar regions and exhibit broadly similar mar-ginal plots To assess the concordance of the results across thefour chains we looked at the correlation coefficients betweenthe frequencies for inclusion of genes These are reported in

Tadesse Sha and Vannucci Bayesian Variable Selection 613

(a) (b)

Figure 9 MCMC Output for Endometrial Data (a) Number of visited clusters G (b) number of included variables pγ

Figure 11 along with the corresponding pairwise scatterplotsWe see that there is very good agreement across the differentMCMC runs The marginal probabilities on the other hand arenot as concordant across the chains because their derivationmakes use of model posterior probabilities [see (23)] which areaffected by the inclusion of correlated variables This is a prob-lem inherent to DNA microarray data where multiple geneswith similar expression profiles are often observed

We pooled and relabeled the output of the four chains nor-malized the relative posterior probabilities and recomputedp(γj = 1|XG = 3) These marginal probabilities are displayedin Figure 12 There are 31 genes with posterior probabilitygreater than 5 We also estimated the marginal posterior proba-bilities for the sample allocations p( yi = k|XG) based on the31 selected genes These are shown in Figure 13 We success-fully identified the four normal tissues The results also seem tosuggest that there are possibly two subtypes within the malig-nant tissues

Figure 10 Endometrial Data Marginal posterior probabilities for in-clusion of variables p(γj = 1|X G = 3) for each of the four chains

We also present the results from analyzing this dataset withthe COSA algorithm Figure 14 shows the resulting singlelinkage and average linkage dendrograms along with the top30 variables that distinguish the normal and tumor tissuesThere is some overlap between the genes selected by COSA andthose identified by our procedure Among the sets of 30 ldquobestrdquodiscriminating genes selected by the two methods 10 are com-mon to both

8 DISCUSSION

We have proposed a method for simultaneously clusteringhigh-dimensional data with an unknown number of componentsand selecting the variables that best discriminate the differentgroups We successfully applied the methodology to variousdatasets Our method is fully Bayesian and we provided stan-dard default recommendations for the choice of priors

Here we mention some possible extensions and directionsfor future research We have drawn posterior inference con-ditional on a fixed number of components A more attractive

Figure 11 Endometrial Data Concordance of results among the fourchains Pairwise scatterplots of frequencies for inclusion of genes

614 Journal of the American Statistical Association June 2005

Figure 12 Endometrial Data Union of four chains Marginal posteriorprobabilities p(γj = 1|X G = 3)

approach would incorporate the uncertainty on this parame-ter and combine the results for all different values of G vis-ited by the MCMC sampler To our knowledge this is an open

Figure 13 Endometrial Data Union of four chains Marginal poste-rior probabilities of sample allocations p(yi = k|X G = 3) i = 1 14k = 1 3

problem that requires further research One promising approachinvolves considering the posterior probability of pairs of ob-

(a) (b)

(c) (d)

Figure 14 Analysis of Endometrial Data Using COSA (a) Single linkage (b) average linkage (c) variables leading to cluster 1 (d) variablesleading to cluster 2

Tadesse Sha and Vannucci Bayesian Variable Selection 615

servations being allocated to the same cluster regardless of thenumber of visited components As long as there is no interestin estimating the component parameters the problem of labelswitching would not be a concern But this leads to

(n2

)proba-

bilities and it is not clear how to best summarize them For ex-ample Medvedovic and Sivaganesan (2002) used hierarchicalclustering based on these pairwise probabilities as a similaritymeasure

An interesting future avenue would be to investigate the per-formance of our method when covariance constraints are im-posed as was proposed by Banfield and Raftery (1993) Theirparameterization allows one to incorporate different assump-tions on the volume orientation and shape of the clusters

Another possible extension is to use empirical Bayes esti-mates to elicit the hyperparameters For example conditionalon a fixed number of clusters a complete log-likelihood canbe computed using all covariates and a Monte Carlo EM ap-proach developed for estimation Alternatively if prior infor-mation were available or subjective priors were preferable thenthe prior setting could be modified accordingly If interactionsamong predictors were of interest then additional interactionterms could be included in the model and the prior on γ couldbe modified to accommodate the additional terms

APPENDIX FULL CONDITIONALS

Here we give details on the derivation of the marginalized full con-ditionals under the assumption of equal and unequal covariances acrossclusters

Homogeneous Covariance 1 = middot middot middot = G =

f (Xy|Gwγ )

=int

f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d

=int Gprod

k=1

wnk

k

∣∣(γ )

∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h

minuspγ 21

times exp

minus1

2

Gsum

k=1

sum

xi(γ )isinCk

(xi(γ ) minus microk(γ )

)Tminus1

(γ )

(xi(γ ) minus microk(γ )

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus micro0(γ )

)T(h1(γ )

)minus1(microk(γ ) minus micro0(γ )

)

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

(γ )Q1(γ )

)

times ∣∣(γ c)

∣∣minus(n+1)2(2π)minus(n+1)(pminuspγ )2h

minus(pminuspγ )20

times exp

minus1

2

nsum

i=1

(xi(γ c) minus η(γ c)

)Tminus1

(γ c)

(xi(γ c) minus η(γ c)

)

times exp

minus1

2

(η(γ c) minus micro0(γ c)

)T(h0(γ c)

)minus1(η(γ c) minus micro0(γ c)

)

times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣(γ c)

∣∣minus(δ+2(pminuspγ ))2

times exp

minus1

2tr(minus1

(γ c)Q0(γ c)

)

dmicrod dη d

=int Gprod

k=1

(2π)minus(nk+1)pγ 2h

minuspγ 21 wnk

k

∣∣(γ )

∣∣minus(nk+1)2

times exp

minus1

2tr(minus1

(γ )Sk(γ )

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)T

times (nk + hminus11 )minus1

(γ )

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

(γ )Q1(γ )

)

times (2π)minus(n+1)(pminuspγ )2hminus(pminuspγ )20

∣∣(γ c)

∣∣minus(n+1)2

times exp

minus1

2tr(minus1

(γ c)S0(γ c)

)

times exp

minus1

2

(η(γ c) minus

sumni=1 xi(γ c) + hminus1

0 micro0(γ c)

n + hminus10

)T

times (n + hminus10 )minus1

(γ c)

(η(γ c) minus

sumni=1 xi(γ c) + hminus1

0 micro0(γ c)

n + hminus10

)

times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣(γ c)

∣∣minus(δ+2(pminuspγ ))2

times exp

minus1

2tr(minus1

0(γ c)Q0(γ c)

)dmicrod dη d

=[ Gprod

k=1

(h1nk + 1)minuspγ 2wnkk

]πminusnpγ 2

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)22minuspγ (δ+n+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

∣∣(γ )

∣∣minus(δ+n+2pγ )2

times exp

minus1

2tr

(minus1

(γ )

[Q1(γ ) +

Gsum

k=1

Sk(γ )

])d

616 Journal of the American Statistical Association June 2005

times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

timesint

∣∣(γ c)

∣∣minus(δ+n+2(pminuspγ ))2

times exp

minus1

2tr(minus1

(γ c)

[Q0(γ ) + S0(γ c)

])d13

= πminusnp2

[ Gprod

k=1

(h1nk + 1)minuspγ 2wnkk

](h0n + 1)minus(pminuspγ )2

timespγprod

j=1

(

(δ + n + pγ minus j

2

)

(δ + pγ minus j

2

))

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times∣∣∣∣∣Q1(γ ) +

Gsum

k=1

Sk(γ )

∣∣∣∣∣

minus(δ+n+pγ minus1)2

times ∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

where Sk(γ ) and S0(γ ) are given by

Sk(γ ) =sum

xi(γ )isinCk

xi(γ )xTi(γ ) + hminus1

1 micro0(γ )microT0(γ )

minus (nk + hminus11 )minus1

( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)

times( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)T

=sum

xi(γ )isinCk

(xi(γ ) minus xk(γ )

)(xi(γ ) minus xk(γ )

)T

+ nk

h1nk + 1

(micro0(γ ) minus xk(γ )

)(micro0(γ ) minus xk(γ )

)T

and

S0(γ ) =nsum

i=1

xi(γ c)xTi(γ c) + hminus1

0 micro0(γ c)microT0(γ c)

minus (n + hminus10 )minus1

( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)

times( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)T

=nsum

i=1

(xi(γ c) minus x(γ c)

)(xi(γ c) minus x(γ c)

)T

+ n

h0n + 1

(micro0(γ c) minus x(γ c)

)(micro0(γ c) minus x(γ c)

)T

with xk(γ ) = 1nk

sumxi(γ )isinCk

xi(γ ) and x(γ c) = 1n

sumni=1 xi(γ c)

Heterogeneous Covariances

f (Xy|Gwγ )

=int

f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d

=int Gprod

k=1

wnk

k

∣∣k(γ )

∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h

minuspγ 21

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times exp

minus1

2

Gsum

k=1

sum

xi(γ )isinCk

(xi(γ ) minus microk

)Tminus1

k(γ )

(xi(γ ) minus microk

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus micro0(γ )

)T(h1k(γ )

)minus1

times (microk(γ ) minus micro0(γ )

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )Q1(γ )

)dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

=int

int

micro

Gprod

k=1

(2π)minus(nk+1)pγ 2h

minuspγ 21 wnk

k 2minuspγ (δ+pγ minus1)2

times πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣k(γ )

∣∣minus(nk+1)2

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)T

times (nk + hminus11 )minus1

k(γ )

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q1(γ ) + Sk(γ )

])dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

Tadesse Sha and Vannucci Bayesian Variable Selection 617

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

∣∣Q(γ )

∣∣(δ+pγ minus1)2

times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

Gprod

k=1

∣∣k(γ )

∣∣minus(δ+nk+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q(γ ) + Sk(γ )

])d

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

timespγprod

j=1

(

(δ + nk + pγ minus 1

2

)(

(δ + pγ minus 1

2

)))

timesGprod

k=1

∣∣Q(γ )

∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )

∣∣minus(δ+nk+pγ minus1)2

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)(

(δ + p minus pγ minus j

2

)))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

[Received May 2003 Revised August 2004]

REFERENCES

Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5

Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821

Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press

Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182

(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641

Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270

Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275

Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375

Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228

Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center

George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373

Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286

Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall

Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136

Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732

Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232

McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422

Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206

Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71

Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190

Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792

Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819

Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74

(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809

Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554

Page 8: Bayesian Variable Selection in Clustering High-Dimensional ...marina/papers/jasa05.pdf · Bayesian Variable Selection in Clustering High-Dimensional Data Mahlet G. T ADESSE, Naijun

Tadesse Sha and Vannucci Bayesian Variable Selection 609

Table 1 Iris Data Posterior Distribution of G

Poisson(λ = 5) Poisson(λ = 10) Uniform[1 10]

k Different k Equal Different Σk Equal Σ Different Σk Equal Σ

3 6466 0 5019 0 8548 04 3124 7413 4291 4384 1417 65045 0403 3124 0545 2335 0035 34466 0007 0418 0138 2786 0 00507 0 0 0007 0484 0 08 0 0 0 0011 0 0

together except for one Two virginica flowers are assigned tothe versicolor group and the remaining are divided into twonew components Figure 2 shows the marginal posterior prob-abilities p( yi = k|XG = 4) computed using equation (21)We note that some of the flowers among the 12 virginica al-located to cluster IV had a close call for instance observa-tion 104 had marginal posteriors p( y104 = 3|XG) = 4169 andp( y104 = 4|XG) = 5823

We examined the sensitivity of the results to the choiceof the prior distribution on G and found them to be quiterobust The posterior probabilities p(G = k|X) assumingG sim Poisson(λ = 10) and G sim uniform[1 Gmax] are givenin Table 1 Under both priors we obtained sample allocationsthat are identical to those in Table 2 We also found littlesensitivity to the choice of the hyperparameter h1 with val-ues between 10 and 1000 giving satisfactory results Smallervalues of h1 tended to favor slightly more components but withsome clusters containing very few observations For examplefor h1 = 10 under the assumption of equal covariance acrossclusters Table 3 shows that the MCMC sampler gives strongersupport for G = 6 However one of the clusters contains onlyone observation and another contains only four observationsFigure 3 displays the marginal posterior probabilities for theallocation of these five samples p( yi = k|XG) there is littlesupport for assigning them to separate clusters

We use this example to also illustrate the label-switchingphenomenon described in Section 61 Figure 4 gives the traceplots of the component weights under the assumption of homo-geneous covariance before and after processing of the MCMCsamples We can easily identify two label-switching instancesin the raw output of Figure 4(a) One of these occurred be-tween the first and third components near iteration 1000 theother occurred around iteration 2400 between components 23 and 4 Figure 4(b) shows the trace plots after postprocessingof the MCMC output We note that the label switching is elim-inated and the multimodality in the estimates of the marginalposterior distributions of wk is removed It is now straightfor-ward to obtain sensible estimates using the posterior means

Table 2 Iris Data Sample Allocation y

Heterogeneouscovariance

Homogeneouscovariance

I II III I II III IV

Iris setosa 50 0 0 50 0 0 0Iris versicolor 0 45 5 0 49 1 0Iris virginica 0 0 50 0 2 36 12

Figure 2 Iris Data Poisson(λ = 5) With Equal Σ -Marginal PosteriorProbabilities of Sample Allocations p(yi = k|X G = 4) i = 1 150k = 1 4

72 Simulated Data

We generated a dataset of 15 observations arising from fourmultivariate normal densities with 20 variables such that

xij sim I1leile4N (micro1 σ21 ) + I5leile7N (micro2 σ

22 )

+ I8leile13N (micro3 σ23 ) + I14leile15N (micro4 σ

24 )

i = 1 15 j = 1 20

where Imiddot is the indicator function equal to 1 if the conditionis met Thus the first four samples arise from the same dis-tribution the next three come from the second group the fol-lowing six are from the third group and the last two are fromthe fourth group The component means were set to micro1 = 5micro2 = 2 micro3 = minus3 and micro4 = minus6 and the component varianceswere set to σ 2

1 = 15 σ 22 = 1 σ 2

3 = 5 and σ 24 = 2 We drew an

additional set of (pminus20) noisy variables that do not distinguishbetween the clusters from a standard normal density We con-sidered several values of p p = 50100500 1000 These datathus contain a small subset of discriminating variables togetherwith a large set of nonclustering variables The purpose is tostudy the ability of our method to uncover the cluster structureand identify the relevant covariates in the presence of increasingnumber of noisy variables

We permuted the columns of the data matrix X to dis-perse the predictors For each dataset we took δ = 3 α = 1

Table 3 Iris Data Sensitivity to Hyperparameter h Results With h = 10

Posterior distribution of G

k 4 5 6 7 8 9 10

p(G = k |X) 0747 1874 3326 2685 1120 0233 0015

Sample allocations yI II III IV V VI

Iris setosa 50 0 0 0 0 0Iris versicolor 0 45 1 0 4 0Iris virginica 0 2 35 12 0 1

610 Journal of the American Statistical Association June 2005

Figure 3 Iris Data Sensitivity to h1 Results for h1 = 10ndashMarginalPosterior Probabilities for Samples Allocated to Clusters V and VIp(yi = k|X G = 6) i = 1 5 k = 1 6

h1 = h0 = 100 and Gmax = n and assumed unequal covari-ances across clusters Results from the previous example haveshown little sensitivity to the choice of prior on G Here weconsidered a truncated Poisson prior with λ = 5 As describedin Section 41 some care is needed when choosing κ1 and κ0their values need to be commensurate with the variability of thedata The amount of total variability in the different simulateddatasets is of course widely different and in each case thesehyperparameters were chosen proportionally to the upper andlower decile of the non-zero eigenvalues For the prior of γ we chose a Bernoulli distribution and set the expected num-ber of included variables to 10 We used a starting model withone randomly selected variable and ran the MCMC chains for100000 iterations with 40000 sweeps as burn-in

Here we do inference on both the sample allocations andthe variable selection We obtained satisfactory results for all

datasets In all cases we successfully recovered the clusterstructure used to simulate the data and identified the 20 dis-criminating variables We present the summary plots associatedwith the largest dataset where p = 1000 Figure 5(a) shows thetrace plot for the number of visited components G and Table 4reports the marginal posterior probabilities p(G|X) There is astrong support for G = 4 and 5 with a slightly larger probabilityfor the former Figure 6 shows the marginal posterior probabili-ties of the sample allocations p( yi = k|XG = 4) We note thatthese match the group structure used to simulate the data Asfor selection of the predictors Figure 5(b) displays the num-ber of variables selected at each MCMC iteration In the first30000 iterations before burn-in the chain visited models witharound 30ndash35 variables After about 40000 iterations the chainstabilized to models with 15ndash20 covariates Figure 7 displaysthe marginal posterior probabilities p(γj = 1|XG = 4) Therewere 17 variables with marginal probabilities greater than 9 allof which were in the set of 20 covariates simulated to effectivelydiscriminate the four clusters If we lower the threshold for in-clusion to 4 then all 20 variables are selected Conditional onthe allocations obtained via (22) the γ vector with largest pos-terior probability (25) among all visited models contained 19 ofthe 20 discriminating variables

We also analyzed these datasets using the COSA algorithm ofFriedman and Meulman (2003) As we mentioned in Section 2this procedure performs variable selection in conjunction withhierarchical clustering We present the results for the analy-sis of the simulated data with p = 1000 Figure 8 shows thedendrogram of the clustering results based on the non-targetedCOSA distance We considered single average and completelinkage but none of these methods was able to recover the truecluster structure The average linkage dendrogram [Fig 8(b)]delineates one of the clusters that was partially recovered (thetrue cluster contains observations 8ndash13) Figure 8(d) displaysthe 20 highest relevance values of the variables that lead to thisclustering The variables that define the true cluster structure areindexed as 1ndash20 and we note that 16 of them appear in this plot

(a) (b)

Figure 4 Iris Data Trace Plots of Component Weights Before and After Removal of the Label Switching (a) Raw output (b) processed output

Tadesse Sha and Vannucci Bayesian Variable Selection 611

(a) (b)

Figure 5 Trace Plots for Simulated Data With p = 1000 (a) Number of clusters G (b) number of included variables pγ

Table 4 Simulated Data With p = 1000Posterior Distribution of G

k p(G = k|X )

2 01113 14474 34375 33756 13267 02168 00409 0015

10 000811 000512 000713 001214 0002

Figure 6 Simulated Data With p = 1000 Marginal Posterior Prob-abilities of Sample Allocations p(yi = k|X G = 4) i = 1 15k = 1 4

73 Application to Microarray Data

We now illustrate the methodology on microarray datafrom an endometrial cancer study Endometrioid endometrialadenocarcinoma is a common gynecologic malignancy aris-ing within the uterine lining The disease usually occurs inpostmenopausal women and is associated with the hormonalrisk factor of protracted estrogen exposure unopposed byprogestins Despite its prevalence the molecular mechanismsof its genesis are not completely known DNA microarrayswith their ability to examine thousands of genes simultaneouslycould be an efficient tool to isolate relevant genes and clustertissues into different subtypes This is especially important incancer treatment where different clinicopathologic groups areknown to vary in their response to therapy and genes identifiedto discriminate among the different subtypes may represent tar-gets for therapeutic intervention and biomarkers for improveddiagnosis

Figure 7 Simulated Data With p = 1000 Marginal Posterior Proba-bilities for Inclusion of Variables p(γj = 1|X G = 4)

612 Journal of the American Statistical Association June 2005

(a) (b)

(c) (d)

Figure 8 Analysis of Simulated Data With p = 1000 Using COSA (a) Single linkage (b) average linkage (c) complete linkage (d) relevancefor delineated cluster

Four normal endometrium samples and 10 endometrioidendometrial adenocarcinomas were collected from hysterec-tomy specimens RNA samples were isolated from the 14 tis-sues and reverse-transcribed The resulting cDNA targets wereprepared following the manufacturerrsquos instructions and hy-bridized to Affymetrix Hu6800 GeneChip arrays which con-tain 7070 probe sets The scanned images were processed withthe Affymetrix software version 31 The average differencederived by the software was used to indicate a transcript abun-dance and a global normalization procedure was applied Thetechnology does not reliably quantify low expression levels andit is subject to spot saturation at the high end For this particu-lar array the limits of reliable detection were previously set to20 and 16000 (Tadesse Ibrahim and Mutter 2003) We usedthese same thresholds and removed probe sets with at least oneunreliable reading from the analysis This left us with p = 762variables We then log-transformed the expression readings tosatisfy the assumption of normality as suggested by the BoxndashCox transformation We also rescaled each variable by its range

As described in Section 41 we specified the priors with thehyperparameters δ = 3 α = 1 and h1 = h0 = 100 We setκ1 = 001 and κ0 = 01 and assumed unequal covariancesacross clusters We used a truncated Poisson(λ = 5) prior for G

with Gmax = n We took a Bernoulli prior for γ with an expec-tation of 10 variables to be included in the model

To avoid possible dependence of the results on the initialmodel we ran four MCMC chains with widely different startingpoints (1) γj set to 0 for all jrsquos except one randomly selected(2) γj set to 1 for 10 randomly selected jrsquos (3) γj set to 1 for25 randomly selected jrsquos and (4) γj set to 1 for 50 randomlyselected jrsquos For each of the MCMC chains we ran 100000 it-erations with 40000 sweeps taken as a burn-in

We looked at both the allocation of tissues and the selec-tion of genes whose expression best discriminate among thedifferent groups The MCMC sampler mixed steadily overthe iterations As shown in the histogram of Figure 9(a) thesampler visited mostly between two and five components witha stronger support for G = 3 clusters Figure 9(b) shows thetrace plot for the number of included variables for one ofthe chains which visited mostly models with 25ndash35 variablesThe other chains behaved similarly Figure 10 gives the mar-ginal posterior probabilities p(γj = 1|XG = 3) for each ofthe chains Despite the very different starting points the fourchains visited similar regions and exhibit broadly similar mar-ginal plots To assess the concordance of the results across thefour chains we looked at the correlation coefficients betweenthe frequencies for inclusion of genes These are reported in

Tadesse Sha and Vannucci Bayesian Variable Selection 613

(a) (b)

Figure 9 MCMC Output for Endometrial Data (a) Number of visited clusters G (b) number of included variables pγ

Figure 11 along with the corresponding pairwise scatterplotsWe see that there is very good agreement across the differentMCMC runs The marginal probabilities on the other hand arenot as concordant across the chains because their derivationmakes use of model posterior probabilities [see (23)] which areaffected by the inclusion of correlated variables This is a prob-lem inherent to DNA microarray data where multiple geneswith similar expression profiles are often observed

We pooled and relabeled the output of the four chains nor-malized the relative posterior probabilities and recomputedp(γj = 1|XG = 3) These marginal probabilities are displayedin Figure 12 There are 31 genes with posterior probabilitygreater than 5 We also estimated the marginal posterior proba-bilities for the sample allocations p( yi = k|XG) based on the31 selected genes These are shown in Figure 13 We success-fully identified the four normal tissues The results also seem tosuggest that there are possibly two subtypes within the malig-nant tissues

Figure 10 Endometrial Data Marginal posterior probabilities for in-clusion of variables p(γj = 1|X G = 3) for each of the four chains

We also present the results from analyzing this dataset withthe COSA algorithm Figure 14 shows the resulting singlelinkage and average linkage dendrograms along with the top30 variables that distinguish the normal and tumor tissuesThere is some overlap between the genes selected by COSA andthose identified by our procedure Among the sets of 30 ldquobestrdquodiscriminating genes selected by the two methods 10 are com-mon to both

8 DISCUSSION

We have proposed a method for simultaneously clusteringhigh-dimensional data with an unknown number of componentsand selecting the variables that best discriminate the differentgroups We successfully applied the methodology to variousdatasets Our method is fully Bayesian and we provided stan-dard default recommendations for the choice of priors

Here we mention some possible extensions and directionsfor future research We have drawn posterior inference con-ditional on a fixed number of components A more attractive

Figure 11 Endometrial Data Concordance of results among the fourchains Pairwise scatterplots of frequencies for inclusion of genes

614 Journal of the American Statistical Association June 2005

Figure 12 Endometrial Data Union of four chains Marginal posteriorprobabilities p(γj = 1|X G = 3)

approach would incorporate the uncertainty on this parame-ter and combine the results for all different values of G vis-ited by the MCMC sampler To our knowledge this is an open

Figure 13 Endometrial Data Union of four chains Marginal poste-rior probabilities of sample allocations p(yi = k|X G = 3) i = 1 14k = 1 3

problem that requires further research One promising approachinvolves considering the posterior probability of pairs of ob-

(a) (b)

(c) (d)

Figure 14 Analysis of Endometrial Data Using COSA (a) Single linkage (b) average linkage (c) variables leading to cluster 1 (d) variablesleading to cluster 2

Tadesse Sha and Vannucci Bayesian Variable Selection 615

servations being allocated to the same cluster regardless of thenumber of visited components As long as there is no interestin estimating the component parameters the problem of labelswitching would not be a concern But this leads to

(n2

)proba-

bilities and it is not clear how to best summarize them For ex-ample Medvedovic and Sivaganesan (2002) used hierarchicalclustering based on these pairwise probabilities as a similaritymeasure

An interesting future avenue would be to investigate the per-formance of our method when covariance constraints are im-posed as was proposed by Banfield and Raftery (1993) Theirparameterization allows one to incorporate different assump-tions on the volume orientation and shape of the clusters

Another possible extension is to use empirical Bayes esti-mates to elicit the hyperparameters For example conditionalon a fixed number of clusters a complete log-likelihood canbe computed using all covariates and a Monte Carlo EM ap-proach developed for estimation Alternatively if prior infor-mation were available or subjective priors were preferable thenthe prior setting could be modified accordingly If interactionsamong predictors were of interest then additional interactionterms could be included in the model and the prior on γ couldbe modified to accommodate the additional terms

APPENDIX FULL CONDITIONALS

Here we give details on the derivation of the marginalized full con-ditionals under the assumption of equal and unequal covariances acrossclusters

Homogeneous Covariance 1 = middot middot middot = G =

f (Xy|Gwγ )

=int

f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d

=int Gprod

k=1

wnk

k

∣∣(γ )

∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h

minuspγ 21

times exp

minus1

2

Gsum

k=1

sum

xi(γ )isinCk

(xi(γ ) minus microk(γ )

)Tminus1

(γ )

(xi(γ ) minus microk(γ )

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus micro0(γ )

)T(h1(γ )

)minus1(microk(γ ) minus micro0(γ )

)

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

(γ )Q1(γ )

)

times ∣∣(γ c)

∣∣minus(n+1)2(2π)minus(n+1)(pminuspγ )2h

minus(pminuspγ )20

times exp

minus1

2

nsum

i=1

(xi(γ c) minus η(γ c)

)Tminus1

(γ c)

(xi(γ c) minus η(γ c)

)

times exp

minus1

2

(η(γ c) minus micro0(γ c)

)T(h0(γ c)

)minus1(η(γ c) minus micro0(γ c)

)

times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣(γ c)

∣∣minus(δ+2(pminuspγ ))2

times exp

minus1

2tr(minus1

(γ c)Q0(γ c)

)

dmicrod dη d

=int Gprod

k=1

(2π)minus(nk+1)pγ 2h

minuspγ 21 wnk

k

∣∣(γ )

∣∣minus(nk+1)2

times exp

minus1

2tr(minus1

(γ )Sk(γ )

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)T

times (nk + hminus11 )minus1

(γ )

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

(γ )Q1(γ )

)

times (2π)minus(n+1)(pminuspγ )2hminus(pminuspγ )20

∣∣(γ c)

∣∣minus(n+1)2

times exp

minus1

2tr(minus1

(γ c)S0(γ c)

)

times exp

minus1

2

(η(γ c) minus

sumni=1 xi(γ c) + hminus1

0 micro0(γ c)

n + hminus10

)T

times (n + hminus10 )minus1

(γ c)

(η(γ c) minus

sumni=1 xi(γ c) + hminus1

0 micro0(γ c)

n + hminus10

)

times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣(γ c)

∣∣minus(δ+2(pminuspγ ))2

times exp

minus1

2tr(minus1

0(γ c)Q0(γ c)

)dmicrod dη d

=[ Gprod

k=1

(h1nk + 1)minuspγ 2wnkk

]πminusnpγ 2

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)22minuspγ (δ+n+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

∣∣(γ )

∣∣minus(δ+n+2pγ )2

times exp

minus1

2tr

(minus1

(γ )

[Q1(γ ) +

Gsum

k=1

Sk(γ )

])d

616 Journal of the American Statistical Association June 2005

times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

timesint

∣∣(γ c)

∣∣minus(δ+n+2(pminuspγ ))2

times exp

minus1

2tr(minus1

(γ c)

[Q0(γ ) + S0(γ c)

])d13

= πminusnp2

[ Gprod

k=1

(h1nk + 1)minuspγ 2wnkk

](h0n + 1)minus(pminuspγ )2

timespγprod

j=1

(

(δ + n + pγ minus j

2

)

(δ + pγ minus j

2

))

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times∣∣∣∣∣Q1(γ ) +

Gsum

k=1

Sk(γ )

∣∣∣∣∣

minus(δ+n+pγ minus1)2

times ∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

where Sk(γ ) and S0(γ ) are given by

Sk(γ ) =sum

xi(γ )isinCk

xi(γ )xTi(γ ) + hminus1

1 micro0(γ )microT0(γ )

minus (nk + hminus11 )minus1

( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)

times( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)T

=sum

xi(γ )isinCk

(xi(γ ) minus xk(γ )

)(xi(γ ) minus xk(γ )

)T

+ nk

h1nk + 1

(micro0(γ ) minus xk(γ )

)(micro0(γ ) minus xk(γ )

)T

and

S0(γ ) =nsum

i=1

xi(γ c)xTi(γ c) + hminus1

0 micro0(γ c)microT0(γ c)

minus (n + hminus10 )minus1

( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)

times( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)T

=nsum

i=1

(xi(γ c) minus x(γ c)

)(xi(γ c) minus x(γ c)

)T

+ n

h0n + 1

(micro0(γ c) minus x(γ c)

)(micro0(γ c) minus x(γ c)

)T

with xk(γ ) = 1nk

sumxi(γ )isinCk

xi(γ ) and x(γ c) = 1n

sumni=1 xi(γ c)

Heterogeneous Covariances

f (Xy|Gwγ )

=int

f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d

=int Gprod

k=1

wnk

k

∣∣k(γ )

∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h

minuspγ 21

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times exp

minus1

2

Gsum

k=1

sum

xi(γ )isinCk

(xi(γ ) minus microk

)Tminus1

k(γ )

(xi(γ ) minus microk

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus micro0(γ )

)T(h1k(γ )

)minus1

times (microk(γ ) minus micro0(γ )

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )Q1(γ )

)dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

=int

int

micro

Gprod

k=1

(2π)minus(nk+1)pγ 2h

minuspγ 21 wnk

k 2minuspγ (δ+pγ minus1)2

times πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣k(γ )

∣∣minus(nk+1)2

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)T

times (nk + hminus11 )minus1

k(γ )

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q1(γ ) + Sk(γ )

])dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

Tadesse Sha and Vannucci Bayesian Variable Selection 617

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

∣∣Q(γ )

∣∣(δ+pγ minus1)2

times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

Gprod

k=1

∣∣k(γ )

∣∣minus(δ+nk+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q(γ ) + Sk(γ )

])d

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

timespγprod

j=1

(

(δ + nk + pγ minus 1

2

)(

(δ + pγ minus 1

2

)))

timesGprod

k=1

∣∣Q(γ )

∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )

∣∣minus(δ+nk+pγ minus1)2

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)(

(δ + p minus pγ minus j

2

)))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

[Received May 2003 Revised August 2004]

REFERENCES

Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5

Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821

Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press

Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182

(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641

Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270

Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275

Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375

Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228

Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center

George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373

Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286

Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall

Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136

Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732

Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232

McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422

Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206

Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71

Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190

Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792

Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819

Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74

(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809

Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554

Page 9: Bayesian Variable Selection in Clustering High-Dimensional ...marina/papers/jasa05.pdf · Bayesian Variable Selection in Clustering High-Dimensional Data Mahlet G. T ADESSE, Naijun

610 Journal of the American Statistical Association June 2005

Figure 3 Iris Data Sensitivity to h1 Results for h1 = 10ndashMarginalPosterior Probabilities for Samples Allocated to Clusters V and VIp(yi = k|X G = 6) i = 1 5 k = 1 6

h1 = h0 = 100 and Gmax = n and assumed unequal covari-ances across clusters Results from the previous example haveshown little sensitivity to the choice of prior on G Here weconsidered a truncated Poisson prior with λ = 5 As describedin Section 41 some care is needed when choosing κ1 and κ0their values need to be commensurate with the variability of thedata The amount of total variability in the different simulateddatasets is of course widely different and in each case thesehyperparameters were chosen proportionally to the upper andlower decile of the non-zero eigenvalues For the prior of γ we chose a Bernoulli distribution and set the expected num-ber of included variables to 10 We used a starting model withone randomly selected variable and ran the MCMC chains for100000 iterations with 40000 sweeps as burn-in

Here we do inference on both the sample allocations andthe variable selection We obtained satisfactory results for all

datasets In all cases we successfully recovered the clusterstructure used to simulate the data and identified the 20 dis-criminating variables We present the summary plots associatedwith the largest dataset where p = 1000 Figure 5(a) shows thetrace plot for the number of visited components G and Table 4reports the marginal posterior probabilities p(G|X) There is astrong support for G = 4 and 5 with a slightly larger probabilityfor the former Figure 6 shows the marginal posterior probabili-ties of the sample allocations p( yi = k|XG = 4) We note thatthese match the group structure used to simulate the data Asfor selection of the predictors Figure 5(b) displays the num-ber of variables selected at each MCMC iteration In the first30000 iterations before burn-in the chain visited models witharound 30ndash35 variables After about 40000 iterations the chainstabilized to models with 15ndash20 covariates Figure 7 displaysthe marginal posterior probabilities p(γj = 1|XG = 4) Therewere 17 variables with marginal probabilities greater than 9 allof which were in the set of 20 covariates simulated to effectivelydiscriminate the four clusters If we lower the threshold for in-clusion to 4 then all 20 variables are selected Conditional onthe allocations obtained via (22) the γ vector with largest pos-terior probability (25) among all visited models contained 19 ofthe 20 discriminating variables

We also analyzed these datasets using the COSA algorithm ofFriedman and Meulman (2003) As we mentioned in Section 2this procedure performs variable selection in conjunction withhierarchical clustering We present the results for the analy-sis of the simulated data with p = 1000 Figure 8 shows thedendrogram of the clustering results based on the non-targetedCOSA distance We considered single average and completelinkage but none of these methods was able to recover the truecluster structure The average linkage dendrogram [Fig 8(b)]delineates one of the clusters that was partially recovered (thetrue cluster contains observations 8ndash13) Figure 8(d) displaysthe 20 highest relevance values of the variables that lead to thisclustering The variables that define the true cluster structure areindexed as 1ndash20 and we note that 16 of them appear in this plot

(a) (b)

Figure 4 Iris Data Trace Plots of Component Weights Before and After Removal of the Label Switching (a) Raw output (b) processed output

Tadesse Sha and Vannucci Bayesian Variable Selection 611

(a) (b)

Figure 5 Trace Plots for Simulated Data With p = 1000 (a) Number of clusters G (b) number of included variables pγ

Table 4 Simulated Data With p = 1000Posterior Distribution of G

k p(G = k|X )

2 01113 14474 34375 33756 13267 02168 00409 0015

10 000811 000512 000713 001214 0002

Figure 6 Simulated Data With p = 1000 Marginal Posterior Prob-abilities of Sample Allocations p(yi = k|X G = 4) i = 1 15k = 1 4

73 Application to Microarray Data

We now illustrate the methodology on microarray datafrom an endometrial cancer study Endometrioid endometrialadenocarcinoma is a common gynecologic malignancy aris-ing within the uterine lining The disease usually occurs inpostmenopausal women and is associated with the hormonalrisk factor of protracted estrogen exposure unopposed byprogestins Despite its prevalence the molecular mechanismsof its genesis are not completely known DNA microarrayswith their ability to examine thousands of genes simultaneouslycould be an efficient tool to isolate relevant genes and clustertissues into different subtypes This is especially important incancer treatment where different clinicopathologic groups areknown to vary in their response to therapy and genes identifiedto discriminate among the different subtypes may represent tar-gets for therapeutic intervention and biomarkers for improveddiagnosis

Figure 7 Simulated Data With p = 1000 Marginal Posterior Proba-bilities for Inclusion of Variables p(γj = 1|X G = 4)

612 Journal of the American Statistical Association June 2005

(a) (b)

(c) (d)

Figure 8 Analysis of Simulated Data With p = 1000 Using COSA (a) Single linkage (b) average linkage (c) complete linkage (d) relevancefor delineated cluster

Four normal endometrium samples and 10 endometrioidendometrial adenocarcinomas were collected from hysterec-tomy specimens RNA samples were isolated from the 14 tis-sues and reverse-transcribed The resulting cDNA targets wereprepared following the manufacturerrsquos instructions and hy-bridized to Affymetrix Hu6800 GeneChip arrays which con-tain 7070 probe sets The scanned images were processed withthe Affymetrix software version 31 The average differencederived by the software was used to indicate a transcript abun-dance and a global normalization procedure was applied Thetechnology does not reliably quantify low expression levels andit is subject to spot saturation at the high end For this particu-lar array the limits of reliable detection were previously set to20 and 16000 (Tadesse Ibrahim and Mutter 2003) We usedthese same thresholds and removed probe sets with at least oneunreliable reading from the analysis This left us with p = 762variables We then log-transformed the expression readings tosatisfy the assumption of normality as suggested by the BoxndashCox transformation We also rescaled each variable by its range

As described in Section 41 we specified the priors with thehyperparameters δ = 3 α = 1 and h1 = h0 = 100 We setκ1 = 001 and κ0 = 01 and assumed unequal covariancesacross clusters We used a truncated Poisson(λ = 5) prior for G

with Gmax = n We took a Bernoulli prior for γ with an expec-tation of 10 variables to be included in the model

To avoid possible dependence of the results on the initialmodel we ran four MCMC chains with widely different startingpoints (1) γj set to 0 for all jrsquos except one randomly selected(2) γj set to 1 for 10 randomly selected jrsquos (3) γj set to 1 for25 randomly selected jrsquos and (4) γj set to 1 for 50 randomlyselected jrsquos For each of the MCMC chains we ran 100000 it-erations with 40000 sweeps taken as a burn-in

We looked at both the allocation of tissues and the selec-tion of genes whose expression best discriminate among thedifferent groups The MCMC sampler mixed steadily overthe iterations As shown in the histogram of Figure 9(a) thesampler visited mostly between two and five components witha stronger support for G = 3 clusters Figure 9(b) shows thetrace plot for the number of included variables for one ofthe chains which visited mostly models with 25ndash35 variablesThe other chains behaved similarly Figure 10 gives the mar-ginal posterior probabilities p(γj = 1|XG = 3) for each ofthe chains Despite the very different starting points the fourchains visited similar regions and exhibit broadly similar mar-ginal plots To assess the concordance of the results across thefour chains we looked at the correlation coefficients betweenthe frequencies for inclusion of genes These are reported in

Tadesse Sha and Vannucci Bayesian Variable Selection 613

(a) (b)

Figure 9 MCMC Output for Endometrial Data (a) Number of visited clusters G (b) number of included variables pγ

Figure 11 along with the corresponding pairwise scatterplotsWe see that there is very good agreement across the differentMCMC runs The marginal probabilities on the other hand arenot as concordant across the chains because their derivationmakes use of model posterior probabilities [see (23)] which areaffected by the inclusion of correlated variables This is a prob-lem inherent to DNA microarray data where multiple geneswith similar expression profiles are often observed

We pooled and relabeled the output of the four chains nor-malized the relative posterior probabilities and recomputedp(γj = 1|XG = 3) These marginal probabilities are displayedin Figure 12 There are 31 genes with posterior probabilitygreater than 5 We also estimated the marginal posterior proba-bilities for the sample allocations p( yi = k|XG) based on the31 selected genes These are shown in Figure 13 We success-fully identified the four normal tissues The results also seem tosuggest that there are possibly two subtypes within the malig-nant tissues

Figure 10 Endometrial Data Marginal posterior probabilities for in-clusion of variables p(γj = 1|X G = 3) for each of the four chains

We also present the results from analyzing this dataset withthe COSA algorithm Figure 14 shows the resulting singlelinkage and average linkage dendrograms along with the top30 variables that distinguish the normal and tumor tissuesThere is some overlap between the genes selected by COSA andthose identified by our procedure Among the sets of 30 ldquobestrdquodiscriminating genes selected by the two methods 10 are com-mon to both

8 DISCUSSION

We have proposed a method for simultaneously clusteringhigh-dimensional data with an unknown number of componentsand selecting the variables that best discriminate the differentgroups We successfully applied the methodology to variousdatasets Our method is fully Bayesian and we provided stan-dard default recommendations for the choice of priors

Here we mention some possible extensions and directionsfor future research We have drawn posterior inference con-ditional on a fixed number of components A more attractive

Figure 11 Endometrial Data Concordance of results among the fourchains Pairwise scatterplots of frequencies for inclusion of genes

614 Journal of the American Statistical Association June 2005

Figure 12 Endometrial Data Union of four chains Marginal posteriorprobabilities p(γj = 1|X G = 3)

approach would incorporate the uncertainty on this parame-ter and combine the results for all different values of G vis-ited by the MCMC sampler To our knowledge this is an open

Figure 13 Endometrial Data Union of four chains Marginal poste-rior probabilities of sample allocations p(yi = k|X G = 3) i = 1 14k = 1 3

problem that requires further research One promising approachinvolves considering the posterior probability of pairs of ob-

(a) (b)

(c) (d)

Figure 14 Analysis of Endometrial Data Using COSA (a) Single linkage (b) average linkage (c) variables leading to cluster 1 (d) variablesleading to cluster 2

Tadesse Sha and Vannucci Bayesian Variable Selection 615

servations being allocated to the same cluster regardless of thenumber of visited components As long as there is no interestin estimating the component parameters the problem of labelswitching would not be a concern But this leads to

(n2

)proba-

bilities and it is not clear how to best summarize them For ex-ample Medvedovic and Sivaganesan (2002) used hierarchicalclustering based on these pairwise probabilities as a similaritymeasure

An interesting future avenue would be to investigate the per-formance of our method when covariance constraints are im-posed as was proposed by Banfield and Raftery (1993) Theirparameterization allows one to incorporate different assump-tions on the volume orientation and shape of the clusters

Another possible extension is to use empirical Bayes esti-mates to elicit the hyperparameters For example conditionalon a fixed number of clusters a complete log-likelihood canbe computed using all covariates and a Monte Carlo EM ap-proach developed for estimation Alternatively if prior infor-mation were available or subjective priors were preferable thenthe prior setting could be modified accordingly If interactionsamong predictors were of interest then additional interactionterms could be included in the model and the prior on γ couldbe modified to accommodate the additional terms

APPENDIX FULL CONDITIONALS

Here we give details on the derivation of the marginalized full con-ditionals under the assumption of equal and unequal covariances acrossclusters

Homogeneous Covariance 1 = middot middot middot = G =

f (Xy|Gwγ )

=int

f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d

=int Gprod

k=1

wnk

k

∣∣(γ )

∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h

minuspγ 21

times exp

minus1

2

Gsum

k=1

sum

xi(γ )isinCk

(xi(γ ) minus microk(γ )

)Tminus1

(γ )

(xi(γ ) minus microk(γ )

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus micro0(γ )

)T(h1(γ )

)minus1(microk(γ ) minus micro0(γ )

)

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

(γ )Q1(γ )

)

times ∣∣(γ c)

∣∣minus(n+1)2(2π)minus(n+1)(pminuspγ )2h

minus(pminuspγ )20

times exp

minus1

2

nsum

i=1

(xi(γ c) minus η(γ c)

)Tminus1

(γ c)

(xi(γ c) minus η(γ c)

)

times exp

minus1

2

(η(γ c) minus micro0(γ c)

)T(h0(γ c)

)minus1(η(γ c) minus micro0(γ c)

)

times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣(γ c)

∣∣minus(δ+2(pminuspγ ))2

times exp

minus1

2tr(minus1

(γ c)Q0(γ c)

)

dmicrod dη d

=int Gprod

k=1

(2π)minus(nk+1)pγ 2h

minuspγ 21 wnk

k

∣∣(γ )

∣∣minus(nk+1)2

times exp

minus1

2tr(minus1

(γ )Sk(γ )

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)T

times (nk + hminus11 )minus1

(γ )

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

(γ )Q1(γ )

)

times (2π)minus(n+1)(pminuspγ )2hminus(pminuspγ )20

∣∣(γ c)

∣∣minus(n+1)2

times exp

minus1

2tr(minus1

(γ c)S0(γ c)

)

times exp

minus1

2

(η(γ c) minus

sumni=1 xi(γ c) + hminus1

0 micro0(γ c)

n + hminus10

)T

times (n + hminus10 )minus1

(γ c)

(η(γ c) minus

sumni=1 xi(γ c) + hminus1

0 micro0(γ c)

n + hminus10

)

times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣(γ c)

∣∣minus(δ+2(pminuspγ ))2

times exp

minus1

2tr(minus1

0(γ c)Q0(γ c)

)dmicrod dη d

=[ Gprod

k=1

(h1nk + 1)minuspγ 2wnkk

]πminusnpγ 2

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)22minuspγ (δ+n+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

∣∣(γ )

∣∣minus(δ+n+2pγ )2

times exp

minus1

2tr

(minus1

(γ )

[Q1(γ ) +

Gsum

k=1

Sk(γ )

])d

616 Journal of the American Statistical Association June 2005

times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

timesint

∣∣(γ c)

∣∣minus(δ+n+2(pminuspγ ))2

times exp

minus1

2tr(minus1

(γ c)

[Q0(γ ) + S0(γ c)

])d13

= πminusnp2

[ Gprod

k=1

(h1nk + 1)minuspγ 2wnkk

](h0n + 1)minus(pminuspγ )2

timespγprod

j=1

(

(δ + n + pγ minus j

2

)

(δ + pγ minus j

2

))

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times∣∣∣∣∣Q1(γ ) +

Gsum

k=1

Sk(γ )

∣∣∣∣∣

minus(δ+n+pγ minus1)2

times ∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

where Sk(γ ) and S0(γ ) are given by

Sk(γ ) =sum

xi(γ )isinCk

xi(γ )xTi(γ ) + hminus1

1 micro0(γ )microT0(γ )

minus (nk + hminus11 )minus1

( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)

times( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)T

=sum

xi(γ )isinCk

(xi(γ ) minus xk(γ )

)(xi(γ ) minus xk(γ )

)T

+ nk

h1nk + 1

(micro0(γ ) minus xk(γ )

)(micro0(γ ) minus xk(γ )

)T

and

S0(γ ) =nsum

i=1

xi(γ c)xTi(γ c) + hminus1

0 micro0(γ c)microT0(γ c)

minus (n + hminus10 )minus1

( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)

times( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)T

=nsum

i=1

(xi(γ c) minus x(γ c)

)(xi(γ c) minus x(γ c)

)T

+ n

h0n + 1

(micro0(γ c) minus x(γ c)

)(micro0(γ c) minus x(γ c)

)T

with xk(γ ) = 1nk

sumxi(γ )isinCk

xi(γ ) and x(γ c) = 1n

sumni=1 xi(γ c)

Heterogeneous Covariances

f (Xy|Gwγ )

=int

f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d

=int Gprod

k=1

wnk

k

∣∣k(γ )

∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h

minuspγ 21

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times exp

minus1

2

Gsum

k=1

sum

xi(γ )isinCk

(xi(γ ) minus microk

)Tminus1

k(γ )

(xi(γ ) minus microk

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus micro0(γ )

)T(h1k(γ )

)minus1

times (microk(γ ) minus micro0(γ )

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )Q1(γ )

)dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

=int

int

micro

Gprod

k=1

(2π)minus(nk+1)pγ 2h

minuspγ 21 wnk

k 2minuspγ (δ+pγ minus1)2

times πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣k(γ )

∣∣minus(nk+1)2

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)T

times (nk + hminus11 )minus1

k(γ )

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q1(γ ) + Sk(γ )

])dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

Tadesse Sha and Vannucci Bayesian Variable Selection 617

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

∣∣Q(γ )

∣∣(δ+pγ minus1)2

times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

Gprod

k=1

∣∣k(γ )

∣∣minus(δ+nk+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q(γ ) + Sk(γ )

])d

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

timespγprod

j=1

(

(δ + nk + pγ minus 1

2

)(

(δ + pγ minus 1

2

)))

timesGprod

k=1

∣∣Q(γ )

∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )

∣∣minus(δ+nk+pγ minus1)2

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)(

(δ + p minus pγ minus j

2

)))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

[Received May 2003 Revised August 2004]

REFERENCES

Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5

Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821

Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press

Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182

(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641

Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270

Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275

Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375

Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228

Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center

George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373

Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286

Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall

Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136

Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732

Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232

McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422

Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206

Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71

Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190

Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792

Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819

Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74

(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809

Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554

Page 10: Bayesian Variable Selection in Clustering High-Dimensional ...marina/papers/jasa05.pdf · Bayesian Variable Selection in Clustering High-Dimensional Data Mahlet G. T ADESSE, Naijun

Tadesse Sha and Vannucci Bayesian Variable Selection 611

(a) (b)

Figure 5 Trace Plots for Simulated Data With p = 1000 (a) Number of clusters G (b) number of included variables pγ

Table 4 Simulated Data With p = 1000Posterior Distribution of G

k p(G = k|X )

2 01113 14474 34375 33756 13267 02168 00409 0015

10 000811 000512 000713 001214 0002

Figure 6 Simulated Data With p = 1000 Marginal Posterior Prob-abilities of Sample Allocations p(yi = k|X G = 4) i = 1 15k = 1 4

73 Application to Microarray Data

We now illustrate the methodology on microarray datafrom an endometrial cancer study Endometrioid endometrialadenocarcinoma is a common gynecologic malignancy aris-ing within the uterine lining The disease usually occurs inpostmenopausal women and is associated with the hormonalrisk factor of protracted estrogen exposure unopposed byprogestins Despite its prevalence the molecular mechanismsof its genesis are not completely known DNA microarrayswith their ability to examine thousands of genes simultaneouslycould be an efficient tool to isolate relevant genes and clustertissues into different subtypes This is especially important incancer treatment where different clinicopathologic groups areknown to vary in their response to therapy and genes identifiedto discriminate among the different subtypes may represent tar-gets for therapeutic intervention and biomarkers for improveddiagnosis

Figure 7 Simulated Data With p = 1000 Marginal Posterior Proba-bilities for Inclusion of Variables p(γj = 1|X G = 4)

612 Journal of the American Statistical Association June 2005

(a) (b)

(c) (d)

Figure 8 Analysis of Simulated Data With p = 1000 Using COSA (a) Single linkage (b) average linkage (c) complete linkage (d) relevancefor delineated cluster

Four normal endometrium samples and 10 endometrioidendometrial adenocarcinomas were collected from hysterec-tomy specimens RNA samples were isolated from the 14 tis-sues and reverse-transcribed The resulting cDNA targets wereprepared following the manufacturerrsquos instructions and hy-bridized to Affymetrix Hu6800 GeneChip arrays which con-tain 7070 probe sets The scanned images were processed withthe Affymetrix software version 31 The average differencederived by the software was used to indicate a transcript abun-dance and a global normalization procedure was applied Thetechnology does not reliably quantify low expression levels andit is subject to spot saturation at the high end For this particu-lar array the limits of reliable detection were previously set to20 and 16000 (Tadesse Ibrahim and Mutter 2003) We usedthese same thresholds and removed probe sets with at least oneunreliable reading from the analysis This left us with p = 762variables We then log-transformed the expression readings tosatisfy the assumption of normality as suggested by the BoxndashCox transformation We also rescaled each variable by its range

As described in Section 41 we specified the priors with thehyperparameters δ = 3 α = 1 and h1 = h0 = 100 We setκ1 = 001 and κ0 = 01 and assumed unequal covariancesacross clusters We used a truncated Poisson(λ = 5) prior for G

with Gmax = n We took a Bernoulli prior for γ with an expec-tation of 10 variables to be included in the model

To avoid possible dependence of the results on the initialmodel we ran four MCMC chains with widely different startingpoints (1) γj set to 0 for all jrsquos except one randomly selected(2) γj set to 1 for 10 randomly selected jrsquos (3) γj set to 1 for25 randomly selected jrsquos and (4) γj set to 1 for 50 randomlyselected jrsquos For each of the MCMC chains we ran 100000 it-erations with 40000 sweeps taken as a burn-in

We looked at both the allocation of tissues and the selec-tion of genes whose expression best discriminate among thedifferent groups The MCMC sampler mixed steadily overthe iterations As shown in the histogram of Figure 9(a) thesampler visited mostly between two and five components witha stronger support for G = 3 clusters Figure 9(b) shows thetrace plot for the number of included variables for one ofthe chains which visited mostly models with 25ndash35 variablesThe other chains behaved similarly Figure 10 gives the mar-ginal posterior probabilities p(γj = 1|XG = 3) for each ofthe chains Despite the very different starting points the fourchains visited similar regions and exhibit broadly similar mar-ginal plots To assess the concordance of the results across thefour chains we looked at the correlation coefficients betweenthe frequencies for inclusion of genes These are reported in

Tadesse Sha and Vannucci Bayesian Variable Selection 613

(a) (b)

Figure 9 MCMC Output for Endometrial Data (a) Number of visited clusters G (b) number of included variables pγ

Figure 11 along with the corresponding pairwise scatterplotsWe see that there is very good agreement across the differentMCMC runs The marginal probabilities on the other hand arenot as concordant across the chains because their derivationmakes use of model posterior probabilities [see (23)] which areaffected by the inclusion of correlated variables This is a prob-lem inherent to DNA microarray data where multiple geneswith similar expression profiles are often observed

We pooled and relabeled the output of the four chains nor-malized the relative posterior probabilities and recomputedp(γj = 1|XG = 3) These marginal probabilities are displayedin Figure 12 There are 31 genes with posterior probabilitygreater than 5 We also estimated the marginal posterior proba-bilities for the sample allocations p( yi = k|XG) based on the31 selected genes These are shown in Figure 13 We success-fully identified the four normal tissues The results also seem tosuggest that there are possibly two subtypes within the malig-nant tissues

Figure 10 Endometrial Data Marginal posterior probabilities for in-clusion of variables p(γj = 1|X G = 3) for each of the four chains

We also present the results from analyzing this dataset withthe COSA algorithm Figure 14 shows the resulting singlelinkage and average linkage dendrograms along with the top30 variables that distinguish the normal and tumor tissuesThere is some overlap between the genes selected by COSA andthose identified by our procedure Among the sets of 30 ldquobestrdquodiscriminating genes selected by the two methods 10 are com-mon to both

8 DISCUSSION

We have proposed a method for simultaneously clusteringhigh-dimensional data with an unknown number of componentsand selecting the variables that best discriminate the differentgroups We successfully applied the methodology to variousdatasets Our method is fully Bayesian and we provided stan-dard default recommendations for the choice of priors

Here we mention some possible extensions and directionsfor future research We have drawn posterior inference con-ditional on a fixed number of components A more attractive

Figure 11 Endometrial Data Concordance of results among the fourchains Pairwise scatterplots of frequencies for inclusion of genes

614 Journal of the American Statistical Association June 2005

Figure 12 Endometrial Data Union of four chains Marginal posteriorprobabilities p(γj = 1|X G = 3)

approach would incorporate the uncertainty on this parame-ter and combine the results for all different values of G vis-ited by the MCMC sampler To our knowledge this is an open

Figure 13 Endometrial Data Union of four chains Marginal poste-rior probabilities of sample allocations p(yi = k|X G = 3) i = 1 14k = 1 3

problem that requires further research One promising approachinvolves considering the posterior probability of pairs of ob-

(a) (b)

(c) (d)

Figure 14 Analysis of Endometrial Data Using COSA (a) Single linkage (b) average linkage (c) variables leading to cluster 1 (d) variablesleading to cluster 2

Tadesse Sha and Vannucci Bayesian Variable Selection 615

servations being allocated to the same cluster regardless of thenumber of visited components As long as there is no interestin estimating the component parameters the problem of labelswitching would not be a concern But this leads to

(n2

)proba-

bilities and it is not clear how to best summarize them For ex-ample Medvedovic and Sivaganesan (2002) used hierarchicalclustering based on these pairwise probabilities as a similaritymeasure

An interesting future avenue would be to investigate the per-formance of our method when covariance constraints are im-posed as was proposed by Banfield and Raftery (1993) Theirparameterization allows one to incorporate different assump-tions on the volume orientation and shape of the clusters

Another possible extension is to use empirical Bayes esti-mates to elicit the hyperparameters For example conditionalon a fixed number of clusters a complete log-likelihood canbe computed using all covariates and a Monte Carlo EM ap-proach developed for estimation Alternatively if prior infor-mation were available or subjective priors were preferable thenthe prior setting could be modified accordingly If interactionsamong predictors were of interest then additional interactionterms could be included in the model and the prior on γ couldbe modified to accommodate the additional terms

APPENDIX FULL CONDITIONALS

Here we give details on the derivation of the marginalized full con-ditionals under the assumption of equal and unequal covariances acrossclusters

Homogeneous Covariance 1 = middot middot middot = G =

f (Xy|Gwγ )

=int

f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d

=int Gprod

k=1

wnk

k

∣∣(γ )

∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h

minuspγ 21

times exp

minus1

2

Gsum

k=1

sum

xi(γ )isinCk

(xi(γ ) minus microk(γ )

)Tminus1

(γ )

(xi(γ ) minus microk(γ )

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus micro0(γ )

)T(h1(γ )

)minus1(microk(γ ) minus micro0(γ )

)

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

(γ )Q1(γ )

)

times ∣∣(γ c)

∣∣minus(n+1)2(2π)minus(n+1)(pminuspγ )2h

minus(pminuspγ )20

times exp

minus1

2

nsum

i=1

(xi(γ c) minus η(γ c)

)Tminus1

(γ c)

(xi(γ c) minus η(γ c)

)

times exp

minus1

2

(η(γ c) minus micro0(γ c)

)T(h0(γ c)

)minus1(η(γ c) minus micro0(γ c)

)

times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣(γ c)

∣∣minus(δ+2(pminuspγ ))2

times exp

minus1

2tr(minus1

(γ c)Q0(γ c)

)

dmicrod dη d

=int Gprod

k=1

(2π)minus(nk+1)pγ 2h

minuspγ 21 wnk

k

∣∣(γ )

∣∣minus(nk+1)2

times exp

minus1

2tr(minus1

(γ )Sk(γ )

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)T

times (nk + hminus11 )minus1

(γ )

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

(γ )Q1(γ )

)

times (2π)minus(n+1)(pminuspγ )2hminus(pminuspγ )20

∣∣(γ c)

∣∣minus(n+1)2

times exp

minus1

2tr(minus1

(γ c)S0(γ c)

)

times exp

minus1

2

(η(γ c) minus

sumni=1 xi(γ c) + hminus1

0 micro0(γ c)

n + hminus10

)T

times (n + hminus10 )minus1

(γ c)

(η(γ c) minus

sumni=1 xi(γ c) + hminus1

0 micro0(γ c)

n + hminus10

)

times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣(γ c)

∣∣minus(δ+2(pminuspγ ))2

times exp

minus1

2tr(minus1

0(γ c)Q0(γ c)

)dmicrod dη d

=[ Gprod

k=1

(h1nk + 1)minuspγ 2wnkk

]πminusnpγ 2

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)22minuspγ (δ+n+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

∣∣(γ )

∣∣minus(δ+n+2pγ )2

times exp

minus1

2tr

(minus1

(γ )

[Q1(γ ) +

Gsum

k=1

Sk(γ )

])d

616 Journal of the American Statistical Association June 2005

times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

timesint

∣∣(γ c)

∣∣minus(δ+n+2(pminuspγ ))2

times exp

minus1

2tr(minus1

(γ c)

[Q0(γ ) + S0(γ c)

])d13

= πminusnp2

[ Gprod

k=1

(h1nk + 1)minuspγ 2wnkk

](h0n + 1)minus(pminuspγ )2

timespγprod

j=1

(

(δ + n + pγ minus j

2

)

(δ + pγ minus j

2

))

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times∣∣∣∣∣Q1(γ ) +

Gsum

k=1

Sk(γ )

∣∣∣∣∣

minus(δ+n+pγ minus1)2

times ∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

where Sk(γ ) and S0(γ ) are given by

Sk(γ ) =sum

xi(γ )isinCk

xi(γ )xTi(γ ) + hminus1

1 micro0(γ )microT0(γ )

minus (nk + hminus11 )minus1

( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)

times( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)T

=sum

xi(γ )isinCk

(xi(γ ) minus xk(γ )

)(xi(γ ) minus xk(γ )

)T

+ nk

h1nk + 1

(micro0(γ ) minus xk(γ )

)(micro0(γ ) minus xk(γ )

)T

and

S0(γ ) =nsum

i=1

xi(γ c)xTi(γ c) + hminus1

0 micro0(γ c)microT0(γ c)

minus (n + hminus10 )minus1

( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)

times( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)T

=nsum

i=1

(xi(γ c) minus x(γ c)

)(xi(γ c) minus x(γ c)

)T

+ n

h0n + 1

(micro0(γ c) minus x(γ c)

)(micro0(γ c) minus x(γ c)

)T

with xk(γ ) = 1nk

sumxi(γ )isinCk

xi(γ ) and x(γ c) = 1n

sumni=1 xi(γ c)

Heterogeneous Covariances

f (Xy|Gwγ )

=int

f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d

=int Gprod

k=1

wnk

k

∣∣k(γ )

∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h

minuspγ 21

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times exp

minus1

2

Gsum

k=1

sum

xi(γ )isinCk

(xi(γ ) minus microk

)Tminus1

k(γ )

(xi(γ ) minus microk

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus micro0(γ )

)T(h1k(γ )

)minus1

times (microk(γ ) minus micro0(γ )

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )Q1(γ )

)dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

=int

int

micro

Gprod

k=1

(2π)minus(nk+1)pγ 2h

minuspγ 21 wnk

k 2minuspγ (δ+pγ minus1)2

times πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣k(γ )

∣∣minus(nk+1)2

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)T

times (nk + hminus11 )minus1

k(γ )

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q1(γ ) + Sk(γ )

])dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

Tadesse Sha and Vannucci Bayesian Variable Selection 617

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

∣∣Q(γ )

∣∣(δ+pγ minus1)2

times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

Gprod

k=1

∣∣k(γ )

∣∣minus(δ+nk+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q(γ ) + Sk(γ )

])d

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

timespγprod

j=1

(

(δ + nk + pγ minus 1

2

)(

(δ + pγ minus 1

2

)))

timesGprod

k=1

∣∣Q(γ )

∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )

∣∣minus(δ+nk+pγ minus1)2

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)(

(δ + p minus pγ minus j

2

)))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

[Received May 2003 Revised August 2004]

REFERENCES

Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5

Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821

Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press

Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182

(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641

Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270

Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275

Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375

Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228

Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center

George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373

Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286

Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall

Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136

Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732

Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232

McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422

Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206

Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71

Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190

Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792

Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819

Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74

(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809

Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554

Page 11: Bayesian Variable Selection in Clustering High-Dimensional ...marina/papers/jasa05.pdf · Bayesian Variable Selection in Clustering High-Dimensional Data Mahlet G. T ADESSE, Naijun

612 Journal of the American Statistical Association June 2005

(a) (b)

(c) (d)

Figure 8 Analysis of Simulated Data With p = 1000 Using COSA (a) Single linkage (b) average linkage (c) complete linkage (d) relevancefor delineated cluster

Four normal endometrium samples and 10 endometrioidendometrial adenocarcinomas were collected from hysterec-tomy specimens RNA samples were isolated from the 14 tis-sues and reverse-transcribed The resulting cDNA targets wereprepared following the manufacturerrsquos instructions and hy-bridized to Affymetrix Hu6800 GeneChip arrays which con-tain 7070 probe sets The scanned images were processed withthe Affymetrix software version 31 The average differencederived by the software was used to indicate a transcript abun-dance and a global normalization procedure was applied Thetechnology does not reliably quantify low expression levels andit is subject to spot saturation at the high end For this particu-lar array the limits of reliable detection were previously set to20 and 16000 (Tadesse Ibrahim and Mutter 2003) We usedthese same thresholds and removed probe sets with at least oneunreliable reading from the analysis This left us with p = 762variables We then log-transformed the expression readings tosatisfy the assumption of normality as suggested by the BoxndashCox transformation We also rescaled each variable by its range

As described in Section 41 we specified the priors with thehyperparameters δ = 3 α = 1 and h1 = h0 = 100 We setκ1 = 001 and κ0 = 01 and assumed unequal covariancesacross clusters We used a truncated Poisson(λ = 5) prior for G

with Gmax = n We took a Bernoulli prior for γ with an expec-tation of 10 variables to be included in the model

To avoid possible dependence of the results on the initialmodel we ran four MCMC chains with widely different startingpoints (1) γj set to 0 for all jrsquos except one randomly selected(2) γj set to 1 for 10 randomly selected jrsquos (3) γj set to 1 for25 randomly selected jrsquos and (4) γj set to 1 for 50 randomlyselected jrsquos For each of the MCMC chains we ran 100000 it-erations with 40000 sweeps taken as a burn-in

We looked at both the allocation of tissues and the selec-tion of genes whose expression best discriminate among thedifferent groups The MCMC sampler mixed steadily overthe iterations As shown in the histogram of Figure 9(a) thesampler visited mostly between two and five components witha stronger support for G = 3 clusters Figure 9(b) shows thetrace plot for the number of included variables for one ofthe chains which visited mostly models with 25ndash35 variablesThe other chains behaved similarly Figure 10 gives the mar-ginal posterior probabilities p(γj = 1|XG = 3) for each ofthe chains Despite the very different starting points the fourchains visited similar regions and exhibit broadly similar mar-ginal plots To assess the concordance of the results across thefour chains we looked at the correlation coefficients betweenthe frequencies for inclusion of genes These are reported in

Tadesse Sha and Vannucci Bayesian Variable Selection 613

(a) (b)

Figure 9 MCMC Output for Endometrial Data (a) Number of visited clusters G (b) number of included variables pγ

Figure 11 along with the corresponding pairwise scatterplotsWe see that there is very good agreement across the differentMCMC runs The marginal probabilities on the other hand arenot as concordant across the chains because their derivationmakes use of model posterior probabilities [see (23)] which areaffected by the inclusion of correlated variables This is a prob-lem inherent to DNA microarray data where multiple geneswith similar expression profiles are often observed

We pooled and relabeled the output of the four chains nor-malized the relative posterior probabilities and recomputedp(γj = 1|XG = 3) These marginal probabilities are displayedin Figure 12 There are 31 genes with posterior probabilitygreater than 5 We also estimated the marginal posterior proba-bilities for the sample allocations p( yi = k|XG) based on the31 selected genes These are shown in Figure 13 We success-fully identified the four normal tissues The results also seem tosuggest that there are possibly two subtypes within the malig-nant tissues

Figure 10 Endometrial Data Marginal posterior probabilities for in-clusion of variables p(γj = 1|X G = 3) for each of the four chains

We also present the results from analyzing this dataset withthe COSA algorithm Figure 14 shows the resulting singlelinkage and average linkage dendrograms along with the top30 variables that distinguish the normal and tumor tissuesThere is some overlap between the genes selected by COSA andthose identified by our procedure Among the sets of 30 ldquobestrdquodiscriminating genes selected by the two methods 10 are com-mon to both

8 DISCUSSION

We have proposed a method for simultaneously clusteringhigh-dimensional data with an unknown number of componentsand selecting the variables that best discriminate the differentgroups We successfully applied the methodology to variousdatasets Our method is fully Bayesian and we provided stan-dard default recommendations for the choice of priors

Here we mention some possible extensions and directionsfor future research We have drawn posterior inference con-ditional on a fixed number of components A more attractive

Figure 11 Endometrial Data Concordance of results among the fourchains Pairwise scatterplots of frequencies for inclusion of genes

614 Journal of the American Statistical Association June 2005

Figure 12 Endometrial Data Union of four chains Marginal posteriorprobabilities p(γj = 1|X G = 3)

approach would incorporate the uncertainty on this parame-ter and combine the results for all different values of G vis-ited by the MCMC sampler To our knowledge this is an open

Figure 13 Endometrial Data Union of four chains Marginal poste-rior probabilities of sample allocations p(yi = k|X G = 3) i = 1 14k = 1 3

problem that requires further research One promising approachinvolves considering the posterior probability of pairs of ob-

(a) (b)

(c) (d)

Figure 14 Analysis of Endometrial Data Using COSA (a) Single linkage (b) average linkage (c) variables leading to cluster 1 (d) variablesleading to cluster 2

Tadesse Sha and Vannucci Bayesian Variable Selection 615

servations being allocated to the same cluster regardless of thenumber of visited components As long as there is no interestin estimating the component parameters the problem of labelswitching would not be a concern But this leads to

(n2

)proba-

bilities and it is not clear how to best summarize them For ex-ample Medvedovic and Sivaganesan (2002) used hierarchicalclustering based on these pairwise probabilities as a similaritymeasure

An interesting future avenue would be to investigate the per-formance of our method when covariance constraints are im-posed as was proposed by Banfield and Raftery (1993) Theirparameterization allows one to incorporate different assump-tions on the volume orientation and shape of the clusters

Another possible extension is to use empirical Bayes esti-mates to elicit the hyperparameters For example conditionalon a fixed number of clusters a complete log-likelihood canbe computed using all covariates and a Monte Carlo EM ap-proach developed for estimation Alternatively if prior infor-mation were available or subjective priors were preferable thenthe prior setting could be modified accordingly If interactionsamong predictors were of interest then additional interactionterms could be included in the model and the prior on γ couldbe modified to accommodate the additional terms

APPENDIX FULL CONDITIONALS

Here we give details on the derivation of the marginalized full con-ditionals under the assumption of equal and unequal covariances acrossclusters

Homogeneous Covariance 1 = middot middot middot = G =

f (Xy|Gwγ )

=int

f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d

=int Gprod

k=1

wnk

k

∣∣(γ )

∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h

minuspγ 21

times exp

minus1

2

Gsum

k=1

sum

xi(γ )isinCk

(xi(γ ) minus microk(γ )

)Tminus1

(γ )

(xi(γ ) minus microk(γ )

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus micro0(γ )

)T(h1(γ )

)minus1(microk(γ ) minus micro0(γ )

)

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

(γ )Q1(γ )

)

times ∣∣(γ c)

∣∣minus(n+1)2(2π)minus(n+1)(pminuspγ )2h

minus(pminuspγ )20

times exp

minus1

2

nsum

i=1

(xi(γ c) minus η(γ c)

)Tminus1

(γ c)

(xi(γ c) minus η(γ c)

)

times exp

minus1

2

(η(γ c) minus micro0(γ c)

)T(h0(γ c)

)minus1(η(γ c) minus micro0(γ c)

)

times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣(γ c)

∣∣minus(δ+2(pminuspγ ))2

times exp

minus1

2tr(minus1

(γ c)Q0(γ c)

)

dmicrod dη d

=int Gprod

k=1

(2π)minus(nk+1)pγ 2h

minuspγ 21 wnk

k

∣∣(γ )

∣∣minus(nk+1)2

times exp

minus1

2tr(minus1

(γ )Sk(γ )

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)T

times (nk + hminus11 )minus1

(γ )

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

(γ )Q1(γ )

)

times (2π)minus(n+1)(pminuspγ )2hminus(pminuspγ )20

∣∣(γ c)

∣∣minus(n+1)2

times exp

minus1

2tr(minus1

(γ c)S0(γ c)

)

times exp

minus1

2

(η(γ c) minus

sumni=1 xi(γ c) + hminus1

0 micro0(γ c)

n + hminus10

)T

times (n + hminus10 )minus1

(γ c)

(η(γ c) minus

sumni=1 xi(γ c) + hminus1

0 micro0(γ c)

n + hminus10

)

times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣(γ c)

∣∣minus(δ+2(pminuspγ ))2

times exp

minus1

2tr(minus1

0(γ c)Q0(γ c)

)dmicrod dη d

=[ Gprod

k=1

(h1nk + 1)minuspγ 2wnkk

]πminusnpγ 2

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)22minuspγ (δ+n+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

∣∣(γ )

∣∣minus(δ+n+2pγ )2

times exp

minus1

2tr

(minus1

(γ )

[Q1(γ ) +

Gsum

k=1

Sk(γ )

])d

616 Journal of the American Statistical Association June 2005

times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

timesint

∣∣(γ c)

∣∣minus(δ+n+2(pminuspγ ))2

times exp

minus1

2tr(minus1

(γ c)

[Q0(γ ) + S0(γ c)

])d13

= πminusnp2

[ Gprod

k=1

(h1nk + 1)minuspγ 2wnkk

](h0n + 1)minus(pminuspγ )2

timespγprod

j=1

(

(δ + n + pγ minus j

2

)

(δ + pγ minus j

2

))

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times∣∣∣∣∣Q1(γ ) +

Gsum

k=1

Sk(γ )

∣∣∣∣∣

minus(δ+n+pγ minus1)2

times ∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

where Sk(γ ) and S0(γ ) are given by

Sk(γ ) =sum

xi(γ )isinCk

xi(γ )xTi(γ ) + hminus1

1 micro0(γ )microT0(γ )

minus (nk + hminus11 )minus1

( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)

times( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)T

=sum

xi(γ )isinCk

(xi(γ ) minus xk(γ )

)(xi(γ ) minus xk(γ )

)T

+ nk

h1nk + 1

(micro0(γ ) minus xk(γ )

)(micro0(γ ) minus xk(γ )

)T

and

S0(γ ) =nsum

i=1

xi(γ c)xTi(γ c) + hminus1

0 micro0(γ c)microT0(γ c)

minus (n + hminus10 )minus1

( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)

times( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)T

=nsum

i=1

(xi(γ c) minus x(γ c)

)(xi(γ c) minus x(γ c)

)T

+ n

h0n + 1

(micro0(γ c) minus x(γ c)

)(micro0(γ c) minus x(γ c)

)T

with xk(γ ) = 1nk

sumxi(γ )isinCk

xi(γ ) and x(γ c) = 1n

sumni=1 xi(γ c)

Heterogeneous Covariances

f (Xy|Gwγ )

=int

f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d

=int Gprod

k=1

wnk

k

∣∣k(γ )

∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h

minuspγ 21

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times exp

minus1

2

Gsum

k=1

sum

xi(γ )isinCk

(xi(γ ) minus microk

)Tminus1

k(γ )

(xi(γ ) minus microk

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus micro0(γ )

)T(h1k(γ )

)minus1

times (microk(γ ) minus micro0(γ )

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )Q1(γ )

)dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

=int

int

micro

Gprod

k=1

(2π)minus(nk+1)pγ 2h

minuspγ 21 wnk

k 2minuspγ (δ+pγ minus1)2

times πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣k(γ )

∣∣minus(nk+1)2

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)T

times (nk + hminus11 )minus1

k(γ )

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q1(γ ) + Sk(γ )

])dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

Tadesse Sha and Vannucci Bayesian Variable Selection 617

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

∣∣Q(γ )

∣∣(δ+pγ minus1)2

times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

Gprod

k=1

∣∣k(γ )

∣∣minus(δ+nk+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q(γ ) + Sk(γ )

])d

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

timespγprod

j=1

(

(δ + nk + pγ minus 1

2

)(

(δ + pγ minus 1

2

)))

timesGprod

k=1

∣∣Q(γ )

∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )

∣∣minus(δ+nk+pγ minus1)2

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)(

(δ + p minus pγ minus j

2

)))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

[Received May 2003 Revised August 2004]

REFERENCES

Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5

Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821

Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press

Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182

(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641

Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270

Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275

Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375

Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228

Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center

George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373

Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286

Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall

Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136

Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732

Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232

McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422

Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206

Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71

Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190

Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792

Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819

Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74

(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809

Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554

Page 12: Bayesian Variable Selection in Clustering High-Dimensional ...marina/papers/jasa05.pdf · Bayesian Variable Selection in Clustering High-Dimensional Data Mahlet G. T ADESSE, Naijun

Tadesse Sha and Vannucci Bayesian Variable Selection 613

(a) (b)

Figure 9 MCMC Output for Endometrial Data (a) Number of visited clusters G (b) number of included variables pγ

Figure 11 along with the corresponding pairwise scatterplotsWe see that there is very good agreement across the differentMCMC runs The marginal probabilities on the other hand arenot as concordant across the chains because their derivationmakes use of model posterior probabilities [see (23)] which areaffected by the inclusion of correlated variables This is a prob-lem inherent to DNA microarray data where multiple geneswith similar expression profiles are often observed

We pooled and relabeled the output of the four chains nor-malized the relative posterior probabilities and recomputedp(γj = 1|XG = 3) These marginal probabilities are displayedin Figure 12 There are 31 genes with posterior probabilitygreater than 5 We also estimated the marginal posterior proba-bilities for the sample allocations p( yi = k|XG) based on the31 selected genes These are shown in Figure 13 We success-fully identified the four normal tissues The results also seem tosuggest that there are possibly two subtypes within the malig-nant tissues

Figure 10 Endometrial Data Marginal posterior probabilities for in-clusion of variables p(γj = 1|X G = 3) for each of the four chains

We also present the results from analyzing this dataset withthe COSA algorithm Figure 14 shows the resulting singlelinkage and average linkage dendrograms along with the top30 variables that distinguish the normal and tumor tissuesThere is some overlap between the genes selected by COSA andthose identified by our procedure Among the sets of 30 ldquobestrdquodiscriminating genes selected by the two methods 10 are com-mon to both

8 DISCUSSION

We have proposed a method for simultaneously clusteringhigh-dimensional data with an unknown number of componentsand selecting the variables that best discriminate the differentgroups We successfully applied the methodology to variousdatasets Our method is fully Bayesian and we provided stan-dard default recommendations for the choice of priors

Here we mention some possible extensions and directionsfor future research We have drawn posterior inference con-ditional on a fixed number of components A more attractive

Figure 11 Endometrial Data Concordance of results among the fourchains Pairwise scatterplots of frequencies for inclusion of genes

614 Journal of the American Statistical Association June 2005

Figure 12 Endometrial Data Union of four chains Marginal posteriorprobabilities p(γj = 1|X G = 3)

approach would incorporate the uncertainty on this parame-ter and combine the results for all different values of G vis-ited by the MCMC sampler To our knowledge this is an open

Figure 13 Endometrial Data Union of four chains Marginal poste-rior probabilities of sample allocations p(yi = k|X G = 3) i = 1 14k = 1 3

problem that requires further research One promising approachinvolves considering the posterior probability of pairs of ob-

(a) (b)

(c) (d)

Figure 14 Analysis of Endometrial Data Using COSA (a) Single linkage (b) average linkage (c) variables leading to cluster 1 (d) variablesleading to cluster 2

Tadesse Sha and Vannucci Bayesian Variable Selection 615

servations being allocated to the same cluster regardless of thenumber of visited components As long as there is no interestin estimating the component parameters the problem of labelswitching would not be a concern But this leads to

(n2

)proba-

bilities and it is not clear how to best summarize them For ex-ample Medvedovic and Sivaganesan (2002) used hierarchicalclustering based on these pairwise probabilities as a similaritymeasure

An interesting future avenue would be to investigate the per-formance of our method when covariance constraints are im-posed as was proposed by Banfield and Raftery (1993) Theirparameterization allows one to incorporate different assump-tions on the volume orientation and shape of the clusters

Another possible extension is to use empirical Bayes esti-mates to elicit the hyperparameters For example conditionalon a fixed number of clusters a complete log-likelihood canbe computed using all covariates and a Monte Carlo EM ap-proach developed for estimation Alternatively if prior infor-mation were available or subjective priors were preferable thenthe prior setting could be modified accordingly If interactionsamong predictors were of interest then additional interactionterms could be included in the model and the prior on γ couldbe modified to accommodate the additional terms

APPENDIX FULL CONDITIONALS

Here we give details on the derivation of the marginalized full con-ditionals under the assumption of equal and unequal covariances acrossclusters

Homogeneous Covariance 1 = middot middot middot = G =

f (Xy|Gwγ )

=int

f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d

=int Gprod

k=1

wnk

k

∣∣(γ )

∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h

minuspγ 21

times exp

minus1

2

Gsum

k=1

sum

xi(γ )isinCk

(xi(γ ) minus microk(γ )

)Tminus1

(γ )

(xi(γ ) minus microk(γ )

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus micro0(γ )

)T(h1(γ )

)minus1(microk(γ ) minus micro0(γ )

)

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

(γ )Q1(γ )

)

times ∣∣(γ c)

∣∣minus(n+1)2(2π)minus(n+1)(pminuspγ )2h

minus(pminuspγ )20

times exp

minus1

2

nsum

i=1

(xi(γ c) minus η(γ c)

)Tminus1

(γ c)

(xi(γ c) minus η(γ c)

)

times exp

minus1

2

(η(γ c) minus micro0(γ c)

)T(h0(γ c)

)minus1(η(γ c) minus micro0(γ c)

)

times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣(γ c)

∣∣minus(δ+2(pminuspγ ))2

times exp

minus1

2tr(minus1

(γ c)Q0(γ c)

)

dmicrod dη d

=int Gprod

k=1

(2π)minus(nk+1)pγ 2h

minuspγ 21 wnk

k

∣∣(γ )

∣∣minus(nk+1)2

times exp

minus1

2tr(minus1

(γ )Sk(γ )

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)T

times (nk + hminus11 )minus1

(γ )

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

(γ )Q1(γ )

)

times (2π)minus(n+1)(pminuspγ )2hminus(pminuspγ )20

∣∣(γ c)

∣∣minus(n+1)2

times exp

minus1

2tr(minus1

(γ c)S0(γ c)

)

times exp

minus1

2

(η(γ c) minus

sumni=1 xi(γ c) + hminus1

0 micro0(γ c)

n + hminus10

)T

times (n + hminus10 )minus1

(γ c)

(η(γ c) minus

sumni=1 xi(γ c) + hminus1

0 micro0(γ c)

n + hminus10

)

times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣(γ c)

∣∣minus(δ+2(pminuspγ ))2

times exp

minus1

2tr(minus1

0(γ c)Q0(γ c)

)dmicrod dη d

=[ Gprod

k=1

(h1nk + 1)minuspγ 2wnkk

]πminusnpγ 2

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)22minuspγ (δ+n+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

∣∣(γ )

∣∣minus(δ+n+2pγ )2

times exp

minus1

2tr

(minus1

(γ )

[Q1(γ ) +

Gsum

k=1

Sk(γ )

])d

616 Journal of the American Statistical Association June 2005

times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

timesint

∣∣(γ c)

∣∣minus(δ+n+2(pminuspγ ))2

times exp

minus1

2tr(minus1

(γ c)

[Q0(γ ) + S0(γ c)

])d13

= πminusnp2

[ Gprod

k=1

(h1nk + 1)minuspγ 2wnkk

](h0n + 1)minus(pminuspγ )2

timespγprod

j=1

(

(δ + n + pγ minus j

2

)

(δ + pγ minus j

2

))

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times∣∣∣∣∣Q1(γ ) +

Gsum

k=1

Sk(γ )

∣∣∣∣∣

minus(δ+n+pγ minus1)2

times ∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

where Sk(γ ) and S0(γ ) are given by

Sk(γ ) =sum

xi(γ )isinCk

xi(γ )xTi(γ ) + hminus1

1 micro0(γ )microT0(γ )

minus (nk + hminus11 )minus1

( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)

times( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)T

=sum

xi(γ )isinCk

(xi(γ ) minus xk(γ )

)(xi(γ ) minus xk(γ )

)T

+ nk

h1nk + 1

(micro0(γ ) minus xk(γ )

)(micro0(γ ) minus xk(γ )

)T

and

S0(γ ) =nsum

i=1

xi(γ c)xTi(γ c) + hminus1

0 micro0(γ c)microT0(γ c)

minus (n + hminus10 )minus1

( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)

times( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)T

=nsum

i=1

(xi(γ c) minus x(γ c)

)(xi(γ c) minus x(γ c)

)T

+ n

h0n + 1

(micro0(γ c) minus x(γ c)

)(micro0(γ c) minus x(γ c)

)T

with xk(γ ) = 1nk

sumxi(γ )isinCk

xi(γ ) and x(γ c) = 1n

sumni=1 xi(γ c)

Heterogeneous Covariances

f (Xy|Gwγ )

=int

f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d

=int Gprod

k=1

wnk

k

∣∣k(γ )

∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h

minuspγ 21

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times exp

minus1

2

Gsum

k=1

sum

xi(γ )isinCk

(xi(γ ) minus microk

)Tminus1

k(γ )

(xi(γ ) minus microk

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus micro0(γ )

)T(h1k(γ )

)minus1

times (microk(γ ) minus micro0(γ )

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )Q1(γ )

)dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

=int

int

micro

Gprod

k=1

(2π)minus(nk+1)pγ 2h

minuspγ 21 wnk

k 2minuspγ (δ+pγ minus1)2

times πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣k(γ )

∣∣minus(nk+1)2

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)T

times (nk + hminus11 )minus1

k(γ )

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q1(γ ) + Sk(γ )

])dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

Tadesse Sha and Vannucci Bayesian Variable Selection 617

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

∣∣Q(γ )

∣∣(δ+pγ minus1)2

times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

Gprod

k=1

∣∣k(γ )

∣∣minus(δ+nk+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q(γ ) + Sk(γ )

])d

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

timespγprod

j=1

(

(δ + nk + pγ minus 1

2

)(

(δ + pγ minus 1

2

)))

timesGprod

k=1

∣∣Q(γ )

∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )

∣∣minus(δ+nk+pγ minus1)2

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)(

(δ + p minus pγ minus j

2

)))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

[Received May 2003 Revised August 2004]

REFERENCES

Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5

Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821

Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press

Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182

(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641

Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270

Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275

Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375

Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228

Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center

George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373

Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286

Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall

Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136

Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732

Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232

McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422

Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206

Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71

Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190

Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792

Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819

Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74

(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809

Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554

Page 13: Bayesian Variable Selection in Clustering High-Dimensional ...marina/papers/jasa05.pdf · Bayesian Variable Selection in Clustering High-Dimensional Data Mahlet G. T ADESSE, Naijun

614 Journal of the American Statistical Association June 2005

Figure 12 Endometrial Data Union of four chains Marginal posteriorprobabilities p(γj = 1|X G = 3)

approach would incorporate the uncertainty on this parame-ter and combine the results for all different values of G vis-ited by the MCMC sampler To our knowledge this is an open

Figure 13 Endometrial Data Union of four chains Marginal poste-rior probabilities of sample allocations p(yi = k|X G = 3) i = 1 14k = 1 3

problem that requires further research One promising approachinvolves considering the posterior probability of pairs of ob-

(a) (b)

(c) (d)

Figure 14 Analysis of Endometrial Data Using COSA (a) Single linkage (b) average linkage (c) variables leading to cluster 1 (d) variablesleading to cluster 2

Tadesse Sha and Vannucci Bayesian Variable Selection 615

servations being allocated to the same cluster regardless of thenumber of visited components As long as there is no interestin estimating the component parameters the problem of labelswitching would not be a concern But this leads to

(n2

)proba-

bilities and it is not clear how to best summarize them For ex-ample Medvedovic and Sivaganesan (2002) used hierarchicalclustering based on these pairwise probabilities as a similaritymeasure

An interesting future avenue would be to investigate the per-formance of our method when covariance constraints are im-posed as was proposed by Banfield and Raftery (1993) Theirparameterization allows one to incorporate different assump-tions on the volume orientation and shape of the clusters

Another possible extension is to use empirical Bayes esti-mates to elicit the hyperparameters For example conditionalon a fixed number of clusters a complete log-likelihood canbe computed using all covariates and a Monte Carlo EM ap-proach developed for estimation Alternatively if prior infor-mation were available or subjective priors were preferable thenthe prior setting could be modified accordingly If interactionsamong predictors were of interest then additional interactionterms could be included in the model and the prior on γ couldbe modified to accommodate the additional terms

APPENDIX FULL CONDITIONALS

Here we give details on the derivation of the marginalized full con-ditionals under the assumption of equal and unequal covariances acrossclusters

Homogeneous Covariance 1 = middot middot middot = G =

f (Xy|Gwγ )

=int

f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d

=int Gprod

k=1

wnk

k

∣∣(γ )

∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h

minuspγ 21

times exp

minus1

2

Gsum

k=1

sum

xi(γ )isinCk

(xi(γ ) minus microk(γ )

)Tminus1

(γ )

(xi(γ ) minus microk(γ )

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus micro0(γ )

)T(h1(γ )

)minus1(microk(γ ) minus micro0(γ )

)

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

(γ )Q1(γ )

)

times ∣∣(γ c)

∣∣minus(n+1)2(2π)minus(n+1)(pminuspγ )2h

minus(pminuspγ )20

times exp

minus1

2

nsum

i=1

(xi(γ c) minus η(γ c)

)Tminus1

(γ c)

(xi(γ c) minus η(γ c)

)

times exp

minus1

2

(η(γ c) minus micro0(γ c)

)T(h0(γ c)

)minus1(η(γ c) minus micro0(γ c)

)

times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣(γ c)

∣∣minus(δ+2(pminuspγ ))2

times exp

minus1

2tr(minus1

(γ c)Q0(γ c)

)

dmicrod dη d

=int Gprod

k=1

(2π)minus(nk+1)pγ 2h

minuspγ 21 wnk

k

∣∣(γ )

∣∣minus(nk+1)2

times exp

minus1

2tr(minus1

(γ )Sk(γ )

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)T

times (nk + hminus11 )minus1

(γ )

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

(γ )Q1(γ )

)

times (2π)minus(n+1)(pminuspγ )2hminus(pminuspγ )20

∣∣(γ c)

∣∣minus(n+1)2

times exp

minus1

2tr(minus1

(γ c)S0(γ c)

)

times exp

minus1

2

(η(γ c) minus

sumni=1 xi(γ c) + hminus1

0 micro0(γ c)

n + hminus10

)T

times (n + hminus10 )minus1

(γ c)

(η(γ c) minus

sumni=1 xi(γ c) + hminus1

0 micro0(γ c)

n + hminus10

)

times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣(γ c)

∣∣minus(δ+2(pminuspγ ))2

times exp

minus1

2tr(minus1

0(γ c)Q0(γ c)

)dmicrod dη d

=[ Gprod

k=1

(h1nk + 1)minuspγ 2wnkk

]πminusnpγ 2

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)22minuspγ (δ+n+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

∣∣(γ )

∣∣minus(δ+n+2pγ )2

times exp

minus1

2tr

(minus1

(γ )

[Q1(γ ) +

Gsum

k=1

Sk(γ )

])d

616 Journal of the American Statistical Association June 2005

times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

timesint

∣∣(γ c)

∣∣minus(δ+n+2(pminuspγ ))2

times exp

minus1

2tr(minus1

(γ c)

[Q0(γ ) + S0(γ c)

])d13

= πminusnp2

[ Gprod

k=1

(h1nk + 1)minuspγ 2wnkk

](h0n + 1)minus(pminuspγ )2

timespγprod

j=1

(

(δ + n + pγ minus j

2

)

(δ + pγ minus j

2

))

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times∣∣∣∣∣Q1(γ ) +

Gsum

k=1

Sk(γ )

∣∣∣∣∣

minus(δ+n+pγ minus1)2

times ∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

where Sk(γ ) and S0(γ ) are given by

Sk(γ ) =sum

xi(γ )isinCk

xi(γ )xTi(γ ) + hminus1

1 micro0(γ )microT0(γ )

minus (nk + hminus11 )minus1

( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)

times( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)T

=sum

xi(γ )isinCk

(xi(γ ) minus xk(γ )

)(xi(γ ) minus xk(γ )

)T

+ nk

h1nk + 1

(micro0(γ ) minus xk(γ )

)(micro0(γ ) minus xk(γ )

)T

and

S0(γ ) =nsum

i=1

xi(γ c)xTi(γ c) + hminus1

0 micro0(γ c)microT0(γ c)

minus (n + hminus10 )minus1

( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)

times( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)T

=nsum

i=1

(xi(γ c) minus x(γ c)

)(xi(γ c) minus x(γ c)

)T

+ n

h0n + 1

(micro0(γ c) minus x(γ c)

)(micro0(γ c) minus x(γ c)

)T

with xk(γ ) = 1nk

sumxi(γ )isinCk

xi(γ ) and x(γ c) = 1n

sumni=1 xi(γ c)

Heterogeneous Covariances

f (Xy|Gwγ )

=int

f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d

=int Gprod

k=1

wnk

k

∣∣k(γ )

∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h

minuspγ 21

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times exp

minus1

2

Gsum

k=1

sum

xi(γ )isinCk

(xi(γ ) minus microk

)Tminus1

k(γ )

(xi(γ ) minus microk

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus micro0(γ )

)T(h1k(γ )

)minus1

times (microk(γ ) minus micro0(γ )

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )Q1(γ )

)dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

=int

int

micro

Gprod

k=1

(2π)minus(nk+1)pγ 2h

minuspγ 21 wnk

k 2minuspγ (δ+pγ minus1)2

times πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣k(γ )

∣∣minus(nk+1)2

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)T

times (nk + hminus11 )minus1

k(γ )

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q1(γ ) + Sk(γ )

])dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

Tadesse Sha and Vannucci Bayesian Variable Selection 617

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

∣∣Q(γ )

∣∣(δ+pγ minus1)2

times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

Gprod

k=1

∣∣k(γ )

∣∣minus(δ+nk+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q(γ ) + Sk(γ )

])d

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

timespγprod

j=1

(

(δ + nk + pγ minus 1

2

)(

(δ + pγ minus 1

2

)))

timesGprod

k=1

∣∣Q(γ )

∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )

∣∣minus(δ+nk+pγ minus1)2

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)(

(δ + p minus pγ minus j

2

)))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

[Received May 2003 Revised August 2004]

REFERENCES

Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5

Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821

Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press

Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182

(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641

Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270

Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275

Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375

Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228

Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center

George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373

Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286

Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall

Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136

Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732

Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232

McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422

Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206

Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71

Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190

Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792

Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819

Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74

(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809

Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554

Page 14: Bayesian Variable Selection in Clustering High-Dimensional ...marina/papers/jasa05.pdf · Bayesian Variable Selection in Clustering High-Dimensional Data Mahlet G. T ADESSE, Naijun

Tadesse Sha and Vannucci Bayesian Variable Selection 615

servations being allocated to the same cluster regardless of thenumber of visited components As long as there is no interestin estimating the component parameters the problem of labelswitching would not be a concern But this leads to

(n2

)proba-

bilities and it is not clear how to best summarize them For ex-ample Medvedovic and Sivaganesan (2002) used hierarchicalclustering based on these pairwise probabilities as a similaritymeasure

An interesting future avenue would be to investigate the per-formance of our method when covariance constraints are im-posed as was proposed by Banfield and Raftery (1993) Theirparameterization allows one to incorporate different assump-tions on the volume orientation and shape of the clusters

Another possible extension is to use empirical Bayes esti-mates to elicit the hyperparameters For example conditionalon a fixed number of clusters a complete log-likelihood canbe computed using all covariates and a Monte Carlo EM ap-proach developed for estimation Alternatively if prior infor-mation were available or subjective priors were preferable thenthe prior setting could be modified accordingly If interactionsamong predictors were of interest then additional interactionterms could be included in the model and the prior on γ couldbe modified to accommodate the additional terms

APPENDIX FULL CONDITIONALS

Here we give details on the derivation of the marginalized full con-ditionals under the assumption of equal and unequal covariances acrossclusters

Homogeneous Covariance 1 = middot middot middot = G =

f (Xy|Gwγ )

=int

f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d

=int Gprod

k=1

wnk

k

∣∣(γ )

∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h

minuspγ 21

times exp

minus1

2

Gsum

k=1

sum

xi(γ )isinCk

(xi(γ ) minus microk(γ )

)Tminus1

(γ )

(xi(γ ) minus microk(γ )

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus micro0(γ )

)T(h1(γ )

)minus1(microk(γ ) minus micro0(γ )

)

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

(γ )Q1(γ )

)

times ∣∣(γ c)

∣∣minus(n+1)2(2π)minus(n+1)(pminuspγ )2h

minus(pminuspγ )20

times exp

minus1

2

nsum

i=1

(xi(γ c) minus η(γ c)

)Tminus1

(γ c)

(xi(γ c) minus η(γ c)

)

times exp

minus1

2

(η(γ c) minus micro0(γ c)

)T(h0(γ c)

)minus1(η(γ c) minus micro0(γ c)

)

times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣(γ c)

∣∣minus(δ+2(pminuspγ ))2

times exp

minus1

2tr(minus1

(γ c)Q0(γ c)

)

dmicrod dη d

=int Gprod

k=1

(2π)minus(nk+1)pγ 2h

minuspγ 21 wnk

k

∣∣(γ )

∣∣minus(nk+1)2

times exp

minus1

2tr(minus1

(γ )Sk(γ )

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)T

times (nk + hminus11 )minus1

(γ )

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

(γ )Q1(γ )

)

times (2π)minus(n+1)(pminuspγ )2hminus(pminuspγ )20

∣∣(γ c)

∣∣minus(n+1)2

times exp

minus1

2tr(minus1

(γ c)S0(γ c)

)

times exp

minus1

2

(η(γ c) minus

sumni=1 xi(γ c) + hminus1

0 micro0(γ c)

n + hminus10

)T

times (n + hminus10 )minus1

(γ c)

(η(γ c) minus

sumni=1 xi(γ c) + hminus1

0 micro0(γ c)

n + hminus10

)

times 2minus(pminuspγ )(δ+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣(γ c)

∣∣minus(δ+2(pminuspγ ))2

times exp

minus1

2tr(minus1

0(γ c)Q0(γ c)

)dmicrod dη d

=[ Gprod

k=1

(h1nk + 1)minuspγ 2wnkk

]πminusnpγ 2

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)22minuspγ (δ+n+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

∣∣(γ )

∣∣minus(δ+n+2pγ )2

times exp

minus1

2tr

(minus1

(γ )

[Q1(γ ) +

Gsum

k=1

Sk(γ )

])d

616 Journal of the American Statistical Association June 2005

times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

timesint

∣∣(γ c)

∣∣minus(δ+n+2(pminuspγ ))2

times exp

minus1

2tr(minus1

(γ c)

[Q0(γ ) + S0(γ c)

])d13

= πminusnp2

[ Gprod

k=1

(h1nk + 1)minuspγ 2wnkk

](h0n + 1)minus(pminuspγ )2

timespγprod

j=1

(

(δ + n + pγ minus j

2

)

(δ + pγ minus j

2

))

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times∣∣∣∣∣Q1(γ ) +

Gsum

k=1

Sk(γ )

∣∣∣∣∣

minus(δ+n+pγ minus1)2

times ∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

where Sk(γ ) and S0(γ ) are given by

Sk(γ ) =sum

xi(γ )isinCk

xi(γ )xTi(γ ) + hminus1

1 micro0(γ )microT0(γ )

minus (nk + hminus11 )minus1

( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)

times( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)T

=sum

xi(γ )isinCk

(xi(γ ) minus xk(γ )

)(xi(γ ) minus xk(γ )

)T

+ nk

h1nk + 1

(micro0(γ ) minus xk(γ )

)(micro0(γ ) minus xk(γ )

)T

and

S0(γ ) =nsum

i=1

xi(γ c)xTi(γ c) + hminus1

0 micro0(γ c)microT0(γ c)

minus (n + hminus10 )minus1

( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)

times( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)T

=nsum

i=1

(xi(γ c) minus x(γ c)

)(xi(γ c) minus x(γ c)

)T

+ n

h0n + 1

(micro0(γ c) minus x(γ c)

)(micro0(γ c) minus x(γ c)

)T

with xk(γ ) = 1nk

sumxi(γ )isinCk

xi(γ ) and x(γ c) = 1n

sumni=1 xi(γ c)

Heterogeneous Covariances

f (Xy|Gwγ )

=int

f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d

=int Gprod

k=1

wnk

k

∣∣k(γ )

∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h

minuspγ 21

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times exp

minus1

2

Gsum

k=1

sum

xi(γ )isinCk

(xi(γ ) minus microk

)Tminus1

k(γ )

(xi(γ ) minus microk

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus micro0(γ )

)T(h1k(γ )

)minus1

times (microk(γ ) minus micro0(γ )

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )Q1(γ )

)dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

=int

int

micro

Gprod

k=1

(2π)minus(nk+1)pγ 2h

minuspγ 21 wnk

k 2minuspγ (δ+pγ minus1)2

times πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣k(γ )

∣∣minus(nk+1)2

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)T

times (nk + hminus11 )minus1

k(γ )

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q1(γ ) + Sk(γ )

])dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

Tadesse Sha and Vannucci Bayesian Variable Selection 617

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

∣∣Q(γ )

∣∣(δ+pγ minus1)2

times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

Gprod

k=1

∣∣k(γ )

∣∣minus(δ+nk+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q(γ ) + Sk(γ )

])d

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

timespγprod

j=1

(

(δ + nk + pγ minus 1

2

)(

(δ + pγ minus 1

2

)))

timesGprod

k=1

∣∣Q(γ )

∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )

∣∣minus(δ+nk+pγ minus1)2

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)(

(δ + p minus pγ minus j

2

)))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

[Received May 2003 Revised August 2004]

REFERENCES

Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5

Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821

Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press

Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182

(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641

Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270

Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275

Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375

Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228

Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center

George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373

Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286

Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall

Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136

Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732

Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232

McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422

Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206

Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71

Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190

Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792

Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819

Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74

(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809

Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554

Page 15: Bayesian Variable Selection in Clustering High-Dimensional ...marina/papers/jasa05.pdf · Bayesian Variable Selection in Clustering High-Dimensional Data Mahlet G. T ADESSE, Naijun

616 Journal of the American Statistical Association June 2005

times (h0n + 1)minus(pminuspγ )2πminusn(pminuspγ )2

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times 2minus(pminuspγ )(δ+n+pminuspγ minus1)2πminus(pminuspγ )(pminuspγ minus1)4

times( pminuspγprod

j=1

(δ + p minus pγ minus j

2

))minus1

timesint

∣∣(γ c)

∣∣minus(δ+n+2(pminuspγ ))2

times exp

minus1

2tr(minus1

(γ c)

[Q0(γ ) + S0(γ c)

])d13

= πminusnp2

[ Gprod

k=1

(h1nk + 1)minuspγ 2wnkk

](h0n + 1)minus(pminuspγ )2

timespγprod

j=1

(

(δ + n + pγ minus j

2

)

(δ + pγ minus j

2

))

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2

times∣∣∣∣∣Q1(γ ) +

Gsum

k=1

Sk(γ )

∣∣∣∣∣

minus(δ+n+pγ minus1)2

times ∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

where Sk(γ ) and S0(γ ) are given by

Sk(γ ) =sum

xi(γ )isinCk

xi(γ )xTi(γ ) + hminus1

1 micro0(γ )microT0(γ )

minus (nk + hminus11 )minus1

( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)

times( sum

xi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

)T

=sum

xi(γ )isinCk

(xi(γ ) minus xk(γ )

)(xi(γ ) minus xk(γ )

)T

+ nk

h1nk + 1

(micro0(γ ) minus xk(γ )

)(micro0(γ ) minus xk(γ )

)T

and

S0(γ ) =nsum

i=1

xi(γ c)xTi(γ c) + hminus1

0 micro0(γ c)microT0(γ c)

minus (n + hminus10 )minus1

( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)

times( nsum

i=1

xi(γ c) + hminus10 micro0(γ c)

)T

=nsum

i=1

(xi(γ c) minus x(γ c)

)(xi(γ c) minus x(γ c)

)T

+ n

h0n + 1

(micro0(γ c) minus x(γ c)

)(micro0(γ c) minus x(γ c)

)T

with xk(γ ) = 1nk

sumxi(γ )isinCk

xi(γ ) and x(γ c) = 1n

sumni=1 xi(γ c)

Heterogeneous Covariances

f (Xy|Gwγ )

=int

f (Xy|Gwmicroηγ )f (micro|Gγ )f (|Gγ )

times f (η|γ )f (|γ )dmicrod dη d

=int Gprod

k=1

wnk

k

∣∣k(γ )

∣∣minus(nk+1)2(2π)minus(nk+1)pγ 2h

minuspγ 21

times 2minuspγ (δ+pγ minus1)2πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times exp

minus1

2

Gsum

k=1

sum

xi(γ )isinCk

(xi(γ ) minus microk

)Tminus1

k(γ )

(xi(γ ) minus microk

)

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus micro0(γ )

)T(h1k(γ )

)minus1

times (microk(γ ) minus micro0(γ )

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )Q1(γ )

)dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

=int

int

micro

Gprod

k=1

(2π)minus(nk+1)pγ 2h

minuspγ 21 wnk

k 2minuspγ (δ+pγ minus1)2

times πminuspγ (pγ minus1)4

( pγprod

j=1

(δ + pγ minus j

2

))minus1

times ∣∣k(γ )

∣∣minus(nk+1)2

times exp

minus1

2

Gsum

k=1

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)T

times (nk + hminus11 )minus1

k(γ )

(microk(γ ) minus

sumxi(γ )isinCk

xi(γ ) + hminus11 micro0(γ )

nk + hminus11

)

timesGprod

k=1

∣∣Q1(γ )

∣∣(δ+pγ minus1)2∣∣k(γ )

∣∣minus(δ+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q1(γ ) + Sk(γ )

])dmicrod

times πminusn(pminuspγ )2(h0n + 1)minus(pminuspγ )2

Tadesse Sha and Vannucci Bayesian Variable Selection 617

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

∣∣Q(γ )

∣∣(δ+pγ minus1)2

times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

Gprod

k=1

∣∣k(γ )

∣∣minus(δ+nk+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q(γ ) + Sk(γ )

])d

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

timespγprod

j=1

(

(δ + nk + pγ minus 1

2

)(

(δ + pγ minus 1

2

)))

timesGprod

k=1

∣∣Q(γ )

∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )

∣∣minus(δ+nk+pγ minus1)2

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)(

(δ + p minus pγ minus j

2

)))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

[Received May 2003 Revised August 2004]

REFERENCES

Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5

Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821

Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press

Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182

(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641

Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270

Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275

Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375

Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228

Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center

George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373

Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286

Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall

Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136

Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732

Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232

McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422

Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206

Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71

Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190

Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792

Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819

Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74

(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809

Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554

Page 16: Bayesian Variable Selection in Clustering High-Dimensional ...marina/papers/jasa05.pdf · Bayesian Variable Selection in Clustering High-Dimensional Data Mahlet G. T ADESSE, Naijun

Tadesse Sha and Vannucci Bayesian Variable Selection 617

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

∣∣Q(γ )

∣∣(δ+pγ minus1)2

times 2minuspγ (δ+nk+pγ minus1)2πminuspγ (pγ minus1)4

times( pγprod

j=1

(δ + pγ minus j

2

))minus1

timesint

Gprod

k=1

∣∣k(γ )

∣∣minus(δ+nk+2pγ )2

times exp

minus1

2tr(minus1

k(γ )

[Q(γ ) + Sk(γ )

])d

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)

(δ + p minus pγ minus j

2

))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

= πminusnp2Gprod

k=1

(hnk + 1)minuspγ 2wnk

k

timespγprod

j=1

(

(δ + nk + pγ minus 1

2

)(

(δ + pγ minus 1

2

)))

timesGprod

k=1

∣∣Q(γ )

∣∣(δ+pγ minus1)2∣∣Q(γ ) + Sk(γ )

∣∣minus(δ+nk+pγ minus1)2

times (h0n + 1)minus(pminuspγ )2

timespminuspγprod

j=1

(

(δ + n + p minus pγ minus j

2

)(

(δ + p minus pγ minus j

2

)))

times ∣∣Q0(γ c)

∣∣(δ+pminuspγ minus1)2∣∣Q0(γ c) + S0(γ c)

∣∣minus(δ+n+pminuspγ minus1)2

[Received May 2003 Revised August 2004]

REFERENCES

Anderson E (1935) ldquoThe Irises of the Gaspe Peninsulardquo Bulletin of the Amer-ican Iris Society 59 2ndash5

Banfield J D and Raftery A E (1993) ldquoModel-Based Gaussian and Non-Gaussian Clusteringrdquo Biometrics 49 803ndash821

Brown P J (1993) Measurement Regression and Calibration Oxford UKClarendon Press

Brown P J Vannucci M and Fearn T (1998a) ldquoBayesian Wavelength Se-lection in Multicomponent Analysisrdquo Chemometrics 12 173ndash182

(1998b) ldquoMultivariate Bayesian Variable Selection and PredictionrdquoJournal of the Royal Statistical Society Ser B 60 627ndash641

Brusco M J and Cradit J D (2001) ldquoA Variable Selection Heuristic fork-Means Clusteringrdquo Psychometrika 66 249ndash270

Chang W-C (1983) ldquoOn Using Principal Components Before Separatinga Mixture of Two Multivariate Normal Distributionsrdquo Applied Statistics 32267ndash275

Diebolt J and Robert C P (1994) ldquoEstimation of Finite Mixture Distribu-tions Through Bayesian Samplingrdquo Journal of the Royal Statistical SocietySer B 56 363ndash375

Fowlkes E B Gnanadesikan R and Kettering J R (1988) ldquoVariable Selec-tion in Clusteringrdquo Journal of Classification 5 205ndash228

Friedman J H and Meulman J J (2003) ldquoClustering Objects on Subsetsof Attributesrdquo technical report Stanford University Dept of Statistics andStanford Linear Accelerator Center

George E I and McCulloch R E (1997) ldquoApproaches for Bayesian VariableSelectionrdquo Statistica Sinica 7 339ndash373

Ghosh D and Chinnaiyan A M (2002) ldquoMixture Modelling of Gene Ex-pression Data From Microarray Experimentsrdquo Bioinformatics 18 275ndash286

Gilks W R Richardson S and Spiegelhalter D J (eds) (1996) MarkovChain Monte Carlo in Practice London Chapman amp Hall

Gnanadesikan R Kettering J R and Tao S L (1995) ldquoWeighting andSelection of Variables for Cluster Analysisrdquo Journal of Classification 12113ndash136

Green P J (1995) ldquoReversible-Jump Markov Chain Monte Carlo Computationand Bayesian Model Determinationrdquo Biometrika 82 711ndash732

Madigan D and York J (1995) ldquoBayesian Graphical Models for DiscreteDatardquo International Statistical Review 63 215ndash232

McLachlan G J Bean R W and Peel D (2002) ldquoA Mixture Model-BasedApproach to the Clustering of Microarray Expression Datardquo Bioinformatics18 413ndash422

Medvedovic M and Sivaganesan S (2002) ldquoBayesian Infinite MixtureModel-Based Clustering of Gene Expression Profilesrdquo Bioinformatics 181194ndash1206

Milligan G W (1989) ldquoA Validation Study of a Variable Weighting Algorithmfor Cluster Analysisrdquo Journal of Classification 6 53ndash71

Murtagh F and Hernaacutendez-Pajares M (1995) ldquoThe Kohonen Self-Organizing Map Method An Assessmentrdquo Journal of Classification 12165ndash190

Richardson S and Green P J (1997) ldquoOn Bayesian Analysis of MixturesWith an Unknown Number of Componentsrdquo (with discussion) Journal of theRoyal Statistical Society Ser B 59 731ndash792

Sha N Vannucci M Tadesse M G Brown P J Dragoni I Davies NRoberts T Contestabile A Salmon M Buckley C and Falciani F(2004) ldquoBayesian Variable Selection in Multinomial Probit Models to Iden-tify Molecular Signatures of Disease Stagerdquo Biometrics 60 812ndash819

Stephens M (2000a) ldquoBayesian Analysis of Mixture Models With an Un-known Number of ComponentsmdashAn Alternative to Reversible-Jump Meth-odsrdquo The Annals of Statistics 28 40ndash74

(2000b) ldquoDealing With Label Switching in Mixture Modelsrdquo Journalof the Royal Statistical Society Ser B 62 795ndash809

Tadesse M G Ibrahim J G and Mutter G L (2003) ldquoIdentification ofDifferentially Expressed Genes in High-Density Oligonucleotide Arrays Ac-counting for the Quantification Limits of the Technologyrdquo Biometrics 59542ndash554