RJMCMC in clustering

.

......Clustering by mixture model

Pham The Thong

知能システム研究室

April 22, 2011

Pham The Thong (知能システム研究室) Clustering by mixture model April 22, 2011 1 / 44

Outline...1 RJMCMC in clustering

Clustering overviewReversible Jump MCMC

...2 Richardson&Green(1997): On Bayesian Analysis of Mixtures with anUnknown Number of Components

OverviewSplit/Merge and Birth/Death MechanismAlgorithmResult

...3 Tadesse et.al.(2005): Bayesian Variable Selection in ClusteringHigh-Dimensional Data

OverviewVariable SelectionRJMCMC MechanismResultWeakness of the model


RJMCMC in clustering Clustering overview









Clustering overview

Divide the observations into groups.

Predict group of a new observation.

Model-based clustering: select a probabilistic modelthat underlying the observations and makestatistical inferences based on that model. Onepopular model is the mixture model.



Clustering via mixture model

X = (x1, · · · , xn) be independent p-dimensionalobservations from G populations.

f (xi |w,θ) =G∑

k=1

wk f (xi |θk)

f (xi |θk) is the density of an observation xi from the kthcomponent.w = (w1, · · · ,wG )

T are component weights.θ = (θ1, · · · , θG )T are component parameters.Clustering is done via allocation vectory = (y1, · · · , yn)T : yi = k if the ith observation xi comesfrom component k .



Some approaches

Model Selection: Compare some model selectioncriteria of fixed-G models for various values of G tochoose the best G . Inference on fixed-G model isoften done via EM algorithm or Gibbs sampler.

Nonparametric method: Use Dirichlet Process.

Trans-dimensional Markov Chain Monte Carlo(MCMC): Allow G to be changed during theinference process by combining Gibbs sampler withMCMC moves that can change dimension of themodel. Reversible jump MCMC (RJMCMC) is onepossible scheme.


RJMCMC in clustering Reversible Jump MCMC









Overview

First developed in Green(1995)

Has applications ranged well beyond mixture modelanalysis.

Mixture model analysis power first demonstrated inRichardson&Green(1997). They considered only the1-dimensional case.

Applied to multidimensional setting in Tadesse et.al.(2005).



Some advantages of clustering byRJMCMC

Avoid the task of model selection.

Provide a coherent Bayesian framework. The clusternumber G is not treated as a special parameter.

Can provide useful summary of data which isdifficult to obtain by other methods.



General ideas of RJMCMC I

Simulating a Markov Chain that converges to thefull posterior distribution p(G , y,w,θ|X).Hybrid sampler consist of Gibbs Sampler(the base)and jump moves (the extension).

Gibbs sampler will sample (y,w,θ). Jump moveswill sample the cluster number G .

The jump moves come in pair: Split/Merge andBirth/Death



General ideas of RJMCMC II

Split move: split one component into twocomponents.Merge move: combine two components into onecomponent.Birth move: create an empty component.Death move: delete an empty component.

At each iteration, propose to perform Split(Birth)move with some fixed probability bk and withprobability 1− bk propose to perform Merge(Death)move.

In one proposal, calculate all the changes to themodel as if the move was made.



General ideas of RJMCMC III

Calculate the acceptance probability A, which is theproduct of three terms:

the ratio of the posterior of the new model to that of theold modelthe ratio of the probability of the way to go from thenew model back to the old model to that of the way togo from old model to new modelthe Jacobian arises from the change of dimension

To ensure convergence to the desired distribution,only actually carry out the move with probabilitymin(1,A).


Richardson&Green(1997) Overview








Richardson&Green(1997) Overview

Overview

1-dimensional data.Goal:

Clustering data.Estimating component parameters.Estimating the distribution of data.Predicting group of new data.

Demonstrated in three real dataset: Enzym, Acid,and Galaxy.


Richardson&Green(1997) Split/Merge and Birth/Death Mechanism









Split/Merge Mechanism

In Split move, select one component (wj∗, µj∗, σj∗)to split to 2 components (wj1, µj1, σj1) and(wj2, µj2, σj2).

In Merge move, select two components (wj1, µj1, σj1)and (wj2, µj2, σj2) to merge into one new component(wj∗, µj∗, σj∗).

Equalizing the zeroth, first, second moment of thenew component to those of a combination of thetwo old components.



Birth/Death Mechanism

Birth moveGenerate wj∗ , µj∗ , σj∗ from some distributions.Rescale the weights.

Death moveDelete a randomly chosen empty component.Rescale the weights.


Richardson&Green(1997) Algorithm








Richardson&Green(1997) Algorithm

One iteration containsGibbs Sampler:

Updating the weights wUpdating the parameters µ,σUpdating the allocation y

Split/Merge move

Birth/Death move


Richardson&Green(1997) Result









Post simulation

By processing the raw data come from the simulation,one can

clustering data by selecting the allocation vector ythat has the highest frequency.

estimating component parameters by their posteriormean.

estimating the distribution of data.

predicting group of new data.



The three dataset

Enzym data: enzymatic activity of one enzyme inthe blood of 245 unrelated people. The interest isidentifying subgroups of slow or fast activity as amarker of genetic polymorphism in the generalpopulation(i.e. to some extent, people of the samesubgroup may have similar genetic structurealthough they are unrelated).

Acid data: acidity level of 155 lakes in Wisconsin.

Galaxy data: velocities of 82 galaxies diverging fromour galaxy.


Tadesse et.al.(2005) Overview








Tadesse et.al.(2005) Overview

Overview

High dimensional dataGoal:

Variable selecting.Clustering data.Predicting group of new data.

Applied to microarray data.


Tadesse et.al.(2005) Variable Selection









Concept

Perhaps not all variables are useful for clustering.

By throwing away non-discriminating variables(irrelevant variables) and clustering only ondiscriminating variables (relevant variables) we mayimprove clustering accuracy.

We can think of variable selection as one way togeneralize the basic approach “clustering by the fullset of variables” to “clustering by a subset ofvariables”.



The model of Tadesse et.al. I

Introduce γ = (γ1, · · · , γp): γj = 1 if the jth variable isa discriminating variable and 0 if it is not.Use (γ) and (γc) to index discriminating variables andnon-discriminating variables.Three assumptions:

The set of discriminating variables and the set ofnon-discriminating variables are independent.

If we look only at (γc), the data X(γc) have anormal distribution(hence unsuitable for clustering).

If we look only at (γ), the data X(γ) have a mixturedistribution of G normal components (hencesuitable for clustering).



The model of Tadesse et.al. II

(η(γc),Ω(γc)): mean and covariance for thenon-discriminating variables.(µk(γ),Σk(γ)): mean and covariance for the kthcomponents Ck .The three assumptions can be written as

p(X|G ,γ,w, y,µ,Σ,η,Ω) =n∏

i=1

N(xi(γc),η(γc),Ω(γc)

)G∏

k=1

∏xi∈Ck

N(xi(γ),µk(γ),Σk(γ)

)Pham The Thong (知能システム研究室) Clustering by mixture model April 22, 2011 30 / 44


Searching for γ

The problem of variable selection is re-casted as aproblem of searching for the most probable binaryvector γ.

Use a Metropolis search(of which SimulatedAnnealing is one type)

At each step randomly choosing one of the followingtwo transitional moves: flip one bit or swap two bitof γ and accept the move with probability

min(1, p(γ

new |X,y,w,G )p(γold |X,y,w,G )

).


Tadesse et.al.(2005) RJMCMC Mechanism









Difficulties in high dimension

Unlike 1-dimensional case, there is no obvious wayto split a covariance matrix into two covariancematrix. Even if this could be done[4], the Jacobianmay not have closed-form.

The number of model parameters increases rapidlywith order p2. The chain may converge very slowly.



Approach of Tadesse et.al.

Integrating out the mean vector and the covariancematrix to obtain a marginalized posterior in whichonly G ,w,γ,and y are involved.

Despite being quite tedious, the math follows astandard framework: define conjugate priors formean and covariance matrix and then take theintegration.

Only need to split or merge the weights ofcomponents in Split/Merge move. Birth/Deathmove are the same as in 1-dimensional case.



Algorithm

One iteration contains

Metropolis search for γGibbs sampler:

Updating the weights wUpdating the allocation y

Split/Merge move

Birth/Death move


Tadesse et.al.(2005) Result









Post simulation

Since the mean and covariance are integrated out,there is no estimation for component parameters.Variable selection:

Method 1: select the vector γ that have the highestfrequency.Method 2: select all variables j that have p(γj |X,G )greater than some threshold: p(γj |X,G ) ≥ a.

Clustering and group prediction can be done in thesame way as in the univariate case.



Microarray data

14 samples (samples are come from tissues).

Variables are genes. There are 762 variables.

By clustering the samples into subgroups, one mayfind out which genes are relevant to each subgroup.


Tadesse et.al.(2005) Weakness of the model









Weakness of the model [5]

The independence assumption would often lead tothe wrongly case in which one irrelevant variable beidentified as a discriminating one because it isrelated to some discriminating variables.

It is not known whether one can relax thisassumption while still being able to performRJMCMC-based full Bayesian analysis.



References

[1]P.J.Green(1995), Reversible jump Markov chain Monte Carlocomputation and Bayesian model determination, Biometrica82,4,711-732.[2]S.Richardson and P.J.Green(1997), On Bayesian Analysis ofMixtures with an Unknown Number of Components, J.R.Statist.Soc.B 59, 4,731-792.[3]M.G.Tadesse, N.Sha, and M. Vannucci(2005), Bayesian VariableSelection in Clustering High-Dimensional Data,Journal of theAmerican Statistical Association 100,470,602-617.[4]Petros Dellaportas and Ioulia Papageorgiou(2006), Multivariatemixtures of normals with unknown number of components,Statisticsand Computing 16,1,57 - 68.[5]Maugis et.al.(2009), Variable Selection for Clustering withGaussian Mixture Models, Biometrics 65, 701-709.



Thank you for your attention


RJMCMC in clustering

Technology

Transcript of RJMCMC in clustering