Download - Symmetrical based projects

8/8/2019 Symmetrical based projects

1/105

EXPERT SYSTEMS AND SOLUTIONS

Email: [email protected]@yahoo.com

Cell: 9952749533www.researchprojects.infoPAIYANOOR, OMR, CHENNAI

Call For Research Projects Finalyear students of B.E in EEE, ECE, EI,M.E (Power Systems), M.E (Applied

Electronics), M.E (Power Electronics)Ph.D Electrical and Electronics.

Students can assemble their hardware in our

Research labs. Experts will be guiding the projects .


2/105

MICROARRAY DATA

),,( M 1 yyREPRESENTED by a N M matrix

contains the gene expressions for the N genes j yof the jth tissue sample (j = 1, ,M).

N = No. of genes (10 3 - 10 4)M = No. of tissue samples (10 - 10 2)

STANDARD STATISTICAL METHODOLOGYAPPROPRIATE FOR M >> N

HERE N >> M


3/105


4/105

Sample 1 Sample 2 Sample M

Gene 1Gene 2

Gene N

Expression Profile

E x

pr e

s s i on

S i g

n a t ur e

Microarray Data represented as N x M Matrix

N rows (genes) ~

10 4

M columns (samples) ~10 2


5/105

Two Clustering Problems:

Clustering of genes on basis of tissues:genes not independent

Clustering of tissues on basis of genes:

latter is a nonstandard problem incluster analysis (n


6/105


7/105

The notion of a cluster is not easy to define.

There is a very large literature devoted toclustering when there is a metric known inadvance; e.g. k -means. Usually, there is no a

priori metric (or equivalently a user-defineddistance matrix) for a cluster analysis.

That is, the difficulty is that the shape of the

clusters is not known until the clusters have been identified, and the clusters cannot beeffectively identified unless the shapes areknown.


8/105

In this case, one attractive feature of adopting mixture models with elliptically

symmetric components such as the normalor t densities, is that the implied clusteringis invariant under affine transformations of

the data (that is, under operations relatingto changes in location, scale, and rotationof the data).

Thus the clustering process does notdepend on irrelevant factors such as theunits of measurement or the orientation of the clusters in space.


9/105


10/105


11/105

Hierarchical clustering methods for the analysis of gene expression data

caught on like the hula hoop.

I, for one, will be glad to see them

fade.

Gary Churchill (The Jackson Laboratory)Contribution to the discussion of the paper bySebastiani, Gussoni, Kohane, and Ramoni.Statistical Science (2003) 18, 64-69.


12/105

Hierarchical (agglomerative) clustering algorithmsare largely heuristically motivated and there exist a

number of unresolved issues associated with their use, including how to determine the number of clusters.

(Yeung et al., 2001, Model-Based Clustering and DataTransformations for Gene Expression Data , Bioinformatics 17)

in the absence of a well-grounded statisticalmodel, it seems difficult to define what ismeant by a good clustering algorithm or theright number of clusters.


13/105

McLachlan and Khan (2004). On aresampling approach for tests on the

number of clusters with mixture model- based clustering of the tissue samples.

Special issue of the Journal of Multivariate Analysis 90 ( 2004) edited by Mark van der Laan and Sandrine Dudoit (UC Berkeley).


14/105

Attention is now turning towards a model-basedapproach to the analysis of microarray data

For example: Broet, Richarson, and Radvanyi (2002). Bayesian hierarchical modelfor identifying changes in gene expression from microarrayexperiments. Journal of Computational Biology 9

Ghosh and Chinnaiyan (2002). Mixture modelling of gene expressiondata from microarray experiments. Bioinformatics 18

Liu, Zhang, Palumbo, and Lawrence (2003). Bayesian clustering withvariable and transformation selection. In Bayesian Statistics 7

Pan, Lin, and Le , 2002, Model-based cluster analysis of microarraygene expression data. Genome Biology 3

Yeung et al., 2001, Model based clustering and data transformationsfor gene expression data, Bioinformatics 17


15/105

The notion of a cluster is not easy to define.

There is a very large literature devoted toclustering when there is a metric known inadvance; e.g. k -means. Usually, there is no a

priori metric (or equivalently a user-defineddistance matrix) for a cluster analysis.

That is, the difficulty is that the shape of the

clusters is not known until the clusters have been identified, and the clusters cannot beeffectively identified unless the shapes areknown.


16/105

In this case, one attractive feature of adopting mixture models with elliptically

symmetric components such as the normalor t densities, is that the implied clusteringis invariant under affine transformations of

the data (that is, under operations relatingto changes in location, scale, and rotationof the data).

Thus the clustering process does notdepend on irrelevant factors such as theunits of measurement or the orientation of the clusters in space.


17/105

=

BP

Weight

Height

y

+

BP

W-H

WH


18/105

p://www.maths.uq.edu.au/~gj

McLachlan and Peel (2000), Finite Mixture Models. Wiley .


19/105


20/105

Mixture Software: EMMIX

McLachlan, Peel, Adams, and Basfordhttp://www.maths.uq.edu.au/~gjm/emmix/emmix.html

EMMIX for UNIX
http://d/coursematerials/Powersystem%20analysis/www/emmix.bat


21/105

Basic Definition

We let Y 1 ,. Y n denote a random sample of size n where Y j is a p-dimensional randomvector with probability density function f (y j)

where the f i(y j) are densities and the i arenonnegative quantities that sum to one.

)()()( 1 j g g j1 j y f y f y f ++=


22/105

To provide an appealing semiparametricframework in which to model unknowndistributional shapes, as an alternative to, say,the kernel density method.

To use the mixture model to provide a model- based clustering. (In both situations, there isthe question of how many components toinclude in the mixture.)

Mixture distributions are applied to data withtwo main purposes in mind:


23/105

Shapes of Some Univariate

Normal MixturesConsider

where

denotes the univariate normal density with mean andvariance 2.

),;(),;()( 2222

11 j j j y y y f +=

})(exp{)2(),;( 222112 2

1

= j j y y


24/105

Figure 1 : Plot of a mixture density of two univariate normal components inequal proportions with common variance

2=1

=1 =2

=3 =4


25/105

Figure 2: Plot of a mixture density of twounivariate normal components in proportions

0.75 and 0.25 with common variance

=1 =2

=3 =4


26/105


27/105


28/105

Computationally convenient for multivariate data

Provide an arbitrarily accurate estimate of theunderlying density with g sufficiently large

Provide a probabilistic clustering of the data into g clusters - outright clustering by assigning a data

point to the component to which it has the greatest

posterior probability of belonging

Normal Mixtures


29/105

Synthetic Data Set 1


30/105

Synthetic Data Set 2


31/105

True Values Initial Values Estimates by EM

1 0.333 0.333 0.294

20.333 0.333 0.337

3 0.333 0.333 0.370

1 (0 2)T (-1 0) T (-0.154 1.961) T

2(0 0) T (0 0) T (0.360 0.115) T

3 (0 2)T (1 0) T (-0.004 2.027) T

1

1

1

2.00

02

2.00

02

2.00

02

10

01

10

01

10

01

218.0016.0

016.0961.1

218.0553.0

553.0346.2

206.0042.0

042.0339.2


32/105

Figure 7


33/105


34/105

Figure 8


35/105


36/105

MIXTURE OF g NORMAL COMPONENTS

);();()( 1 g g g 11 f ,, yyy ++=

)()( yy T

EUCLIDEAN DISTANCE

+=

)()()(log2 ,;1

yyyT

where

constantconstant+=

)()()(log2 ,;1

yyyT

MAHALANOBIS DISTANCE

where


37/105

SPHERICAL CLUSTERS

k-means

I 21 g ===

MIXTURE OF g NORMAL COMPONENTS

),;(),;()( 111 g g g f yyy ++=

I 21 g ===k-means


38/105

Equal spherical covariance matrices


39/105

With a mixture model-based approach toclustering, an observation is assignedoutright to the ith cluster if its density inthe ith component of the mixturedistribution (weighted by the prior

probability of that component) is greater than in the other (g-1) components.

),;(),;(),;()( 111

g g g

iii f

y

yyy

++++=


40/105

Figure 7: Contours of the fitted componentdensities on the 2 nd & 3 rd variates for the blue crab

data set.


41/105

Estimation of Mixture Distributions

It was the publication of the seminal paper of Dempster, Laird, and Rubin (1977) on theEM algorithm that greatly stimulated interest

in the use of finite mixture distributions tomodel heterogeneous data.

McLachlan and Krishnan (1997, Wiley)


42/105

If need be, the normal mixture model can

be made less sensitive to outlyingobservations by using t component densities.

With this t mixture model-based approach,the normal distribution for each componentin the mixture is embedded in a wider classof elliptically symmetric distributions withan additional parameter called the degrees of freedom.


43/105

The advantage of the t mixture model is that,although the number of outliers needed for

breakdown is almost the same as with thenormal mixture model, the outliers have to

be much larger.


44/105

In exploring high-dimensional datasets for group structure, it is typical

to rely on principal componentanalysis.

T G i T Di i All l i f i ld


45/105

Two Groups in Two Dimensions. All cluster information wouldbe lost by collapsing to the first principal component. The

principal ellipses of the two groups are shown as solid curves.


46/105

Mixtures of Factor Analyzers

A normal mixture model without restrictionson the component-covariance matrices may

be viewed as too general for many situations

in practice, in particular, with highdimensional data.

One approach for reducing the number of parameters is to work in a lower dimensionalspace by using principal components;another is to use mixtures of factor anal zers


47/105

Mixtures of Factor Analyzers

Principal components or asingle-factor analysis model

provides only a global linear model.

A global nonlinear approach by postulating a mixture of linear submodels


48/105

),,...,1( where

,),;()(1

g i

f

iT iii

g

iii ji j

=+=

== D B B

y y

B i is a p x q matrix and D i is a

diagonal matrix.


49/105

Single-Factor Analysis Model

loadings.factor of matrixx

aisandfactorscalledvariables

leunobservabor latentof vector

)(ldimensiona-aiswhere

,),...,1(

p p

B

pqqU

n je B U Y

i

j

j j j

g 0.

11 :H g g =versus00 : g g H =


66/105

We let denote the MLE of calculatedunder H i, (i=0,1). Then the evidence againstH0 will be strong if is sufficiently small,or equivalently, if -2log is sufficientlylarge, where

i

)}(log)({log2log2 01 L L =


67/105

Bootstrapping the LRTS

McLachlan (1987) proposed aresampling approach to the assessment of

the P -value of the LRTS in testing

for a specified value of g 0.

1100 :H v:H g g g g ==


68/105

Bayesian Information Criterion

nd L log)

(log2 +

The Bayesian information criterion (BIC)of Schwarz (1978) is given by

as the penalized log likelihood to bemaximized in model selection, including

the present situation for the number of components g in a mixture model.


69/105

Gap statistic (Tibshirani et al., 2001)

Clest (Dudoit and Fridlyand, 2002)


70/105

PROVIDES A MODEL-BASED

APPROACH TO CLUSTERINGMcLachlan, Bean, and Peel, 2002 , A

Mixture Model-Based Approach to the Clustering

of Microarray Expression Data, Bioinformatics 18 , 413-422

http://www.bioinformatics.oupjournals.org/cgi/screen

pdf/18/3/413.pdf


71/105


72/105

Example: Microarray Data

Colon Data of Alon et al. (1999)M = 62 (40 tumours ; 22 normals )

tissue samples of N = 2,000 genes in a

2,000 62 matrix .


73/105


74/105


75/105

Mixture of 2 normal components


76/105

Mixture of 2 t components

The t distribution does not have substantially better breakdown


77/105

behavior than the normal (Tyler, 1994).

The advantage of the t mixture model is that, although the number of

outliers needed for breakdown is almost the same as with the normal

mixture model, the outliers have to be much larger.

This point is made more precise in Hennig (2002) who has provided an

excellent account of breakdown points for ML estimation of location

-scale mixtures with a fixed number of components g.

Of course as explained in Hennig (2002), mixture models can be made

more robust by allowing the number of components g to grow with the

number of outliers.


78/105

For Normal mixtures breakdown begins with an additional pointat about 15.2. For a mixture of t 3-distributions, the outlier must

lie at about 800, t 1-mixtures need the outlier at about ,and a Normal mixture with additional noise component breaksdown with an additional point at

7

108.3 .105.3 7


79/105


80/105


81/105

Clustering of COLON Data

Genes using EMMIX-GENE

1 2 3 4 5

Grouping for Colon Data


82/105

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20
http://localhost:8080/review/001/alon_norm.dat.cut_group16.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group20.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group19.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group18.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group17.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group16.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group15.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group14.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group13.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group12.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group11.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group10.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group9.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group8.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group7.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group6.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group5.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group4.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group3.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group2.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group1.png


83/105


84/105


85/105

Clustering of COLON Data

Tissues using EMMIX-GENE

1 2 3 4 5

Grouping for Colon Data
http://d/coursematerials/Powersystem%20analysis/colon.htmhttp://d/coursematerials/Powersystem%20analysis/colon.htmhttp://d/coursematerials/Powersystem%20analysis/colontissopt.htmhttp://d/coursematerials/Powersystem%20analysis/colon.htmhttp://localhost:8080/review/001/alon_norm.dat.cut_group5.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group4.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group3.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group2.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group1.png


86/105

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20
http://localhost:8080/review/001/alon_norm.dat.cut_group16.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group20.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group19.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group18.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group17.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group16.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group15.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group14.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group13.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group12.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group11.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group10.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group9.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group8.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group7.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group6.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group5.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group4.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group3.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group2.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group1.png


87/105

Heat Map Displaying the Reduced Set of 4,869 Genes

on the 98 Breast Cancer Tumours


88/105

Insert heat map of 1867 genes

Heat Map of Top 1867 Genes


89/105


90/105


91/105


92/105

15141311 12

16 17 18 19 20

10986 7

5431 2


93/105

35343331 32

36 37 38 39 40

30292826 27

25242321 22


94/105

where i = group number mi = number in group iU i = -2 log i

1 146 112.98

2 93 74.953 61 46.08

4 55 35.20

5 43 30.40

6 92 29.297 71 28.77

8 20 28.76

9 23 28.44

10 23 27.73

21 44 13.77

22 30 13.2823 25 13.10

24 67 13.01

25 12 12.04

26 58 12.0327 27 11.74

28 64 11.61

29 38 11.38

30 21 10.72

11 66 25.72

12 38 25.4513 28 25.00

14 53 21.33

15 47 18.14

16 23 18.0017 27 17.62

18 45 17.51

19 80 17.28

20 55 13.79

31 53 9.84

32 36 8.9533 36 8.89

34 38 8.86

35 44 8.02

36 56 7.4337 46 7.21

38 19 6.14

39 29 4.64

40 35 2.44

i mi U i i m i U i i m i U i i m i U i


95/105

Heat Map of Genes in Group G1


96/105



97/105


Clustering of gene expression profiles


98/105

Longitudinal (with or without replication, for example time-course)

Cross-sectional data

g g p p

A Mixture Model with Random-Effects Components for Clustering Correlated Gene-Expression Profiles. S.K. Ng, G. J. McLachlan, K. Wang, L. Ben-Tovim Jones, S-W. Ng.

EMMIX-WIRE

EM-based MIXture analysis With Random Effects

Clustering of Correlated Gene Profiles


99/105

Clustering of Correlated Gene Profiles

hjhhjh j VcUb X y +++=


100/105

Longitudinal (with or without replication,for example time course)

Cross-section data

Clustering of gene expression profiles


101/105

},|1{);,( c y Z pr c y jhj j ==

= ==

= g i iiij ji

hhhj jh

c z y f

c z y f

1);,1|(

);,1|(

N( h, h), with hhh Vc X += T

bhhh UU A B +=

Yeast Cell Cycle


102/105

))7(2((cos +l

X is an 18 x 2 matrix with the ( l +1)th row ( l = 0,,17)

)))7(2(sin +l

Yeast data is from Spellman (1998); 18 rows represent the18 -factor (pheromone) synchronization wherethe yeast cells were sampled at 7 minute intervals for 119minutes. is the period of the cell cycle and is the

phase offset, estimated using least squares to be =53and =0.

Clustering Results for Spellman Yeast Cell Cycle Data


103/105


104/105

Plots of First versus Second Principal Components

(a) Our clustering (b) Muro clustering

A Mixture Model with Random Effects Components for


105/105

A Mixture Model with Random-Effects Components for Clustering Correlated Gene-Expression Profiles .

S.K. Ng, G. J. McLachlan, K. Wang, L. Ben-Tovim Jones,S-W. Ng.