8/8/2019 Symmetrical based projects
1/105
EXPERT SYSTEMS AND SOLUTIONS
Email: [email protected]@yahoo.com
Cell: 9952749533www.researchprojects.infoPAIYANOOR, OMR, CHENNAI
Call For Research Projects Finalyear students of B.E in EEE, ECE, EI,M.E (Power Systems), M.E (Applied
Electronics), M.E (Power Electronics)Ph.D Electrical and Electronics.
Students can assemble their hardware in our
Research labs. Experts will be guiding the projects .
8/8/2019 Symmetrical based projects
2/105
MICROARRAY DATA
),,( M 1 yyREPRESENTED by a N M matrix
contains the gene expressions for the N genes j yof the jth tissue sample (j = 1, ,M).
N = No. of genes (10 3 - 10 4)M = No. of tissue samples (10 - 10 2)
STANDARD STATISTICAL METHODOLOGYAPPROPRIATE FOR M >> N
HERE N >> M
8/8/2019 Symmetrical based projects
3/105
8/8/2019 Symmetrical based projects
4/105
Sample 1 Sample 2 Sample M
Gene 1Gene 2
Gene N
Expression Profile
E x
pr e
s s i on
S i g
n a t ur e
Microarray Data represented as N x M Matrix
N rows (genes) ~
10 4
M columns (samples) ~10 2
8/8/2019 Symmetrical based projects
5/105
Two Clustering Problems:
Clustering of genes on basis of tissues:genes not independent
Clustering of tissues on basis of genes:
latter is a nonstandard problem incluster analysis (n
8/8/2019 Symmetrical based projects
6/105
8/8/2019 Symmetrical based projects
7/105
The notion of a cluster is not easy to define.
There is a very large literature devoted toclustering when there is a metric known inadvance; e.g. k -means. Usually, there is no a
priori metric (or equivalently a user-defineddistance matrix) for a cluster analysis.
That is, the difficulty is that the shape of the
clusters is not known until the clusters have been identified, and the clusters cannot beeffectively identified unless the shapes areknown.
8/8/2019 Symmetrical based projects
8/105
In this case, one attractive feature of adopting mixture models with elliptically
symmetric components such as the normalor t densities, is that the implied clusteringis invariant under affine transformations of
the data (that is, under operations relatingto changes in location, scale, and rotationof the data).
Thus the clustering process does notdepend on irrelevant factors such as theunits of measurement or the orientation of the clusters in space.
8/8/2019 Symmetrical based projects
9/105
8/8/2019 Symmetrical based projects
10/105
8/8/2019 Symmetrical based projects
11/105
Hierarchical clustering methods for the analysis of gene expression data
caught on like the hula hoop.
I, for one, will be glad to see them
fade.
Gary Churchill (The Jackson Laboratory)Contribution to the discussion of the paper bySebastiani, Gussoni, Kohane, and Ramoni.Statistical Science (2003) 18, 64-69.
8/8/2019 Symmetrical based projects
12/105
Hierarchical (agglomerative) clustering algorithmsare largely heuristically motivated and there exist a
number of unresolved issues associated with their use, including how to determine the number of clusters.
(Yeung et al., 2001, Model-Based Clustering and DataTransformations for Gene Expression Data , Bioinformatics 17)
in the absence of a well-grounded statisticalmodel, it seems difficult to define what ismeant by a good clustering algorithm or theright number of clusters.
8/8/2019 Symmetrical based projects
13/105
McLachlan and Khan (2004). On aresampling approach for tests on the
number of clusters with mixture model- based clustering of the tissue samples.
Special issue of the Journal of Multivariate Analysis 90 ( 2004) edited by Mark van der Laan and Sandrine Dudoit (UC Berkeley).
8/8/2019 Symmetrical based projects
14/105
Attention is now turning towards a model-basedapproach to the analysis of microarray data
For example: Broet, Richarson, and Radvanyi (2002). Bayesian hierarchical modelfor identifying changes in gene expression from microarrayexperiments. Journal of Computational Biology 9
Ghosh and Chinnaiyan (2002). Mixture modelling of gene expressiondata from microarray experiments. Bioinformatics 18
Liu, Zhang, Palumbo, and Lawrence (2003). Bayesian clustering withvariable and transformation selection. In Bayesian Statistics 7
Pan, Lin, and Le , 2002, Model-based cluster analysis of microarraygene expression data. Genome Biology 3
Yeung et al., 2001, Model based clustering and data transformationsfor gene expression data, Bioinformatics 17
8/8/2019 Symmetrical based projects
15/105
The notion of a cluster is not easy to define.
There is a very large literature devoted toclustering when there is a metric known inadvance; e.g. k -means. Usually, there is no a
priori metric (or equivalently a user-defineddistance matrix) for a cluster analysis.
That is, the difficulty is that the shape of the
clusters is not known until the clusters have been identified, and the clusters cannot beeffectively identified unless the shapes areknown.
8/8/2019 Symmetrical based projects
16/105
In this case, one attractive feature of adopting mixture models with elliptically
symmetric components such as the normalor t densities, is that the implied clusteringis invariant under affine transformations of
the data (that is, under operations relatingto changes in location, scale, and rotationof the data).
Thus the clustering process does notdepend on irrelevant factors such as theunits of measurement or the orientation of the clusters in space.
8/8/2019 Symmetrical based projects
17/105
=
BP
Weight
Height
y
+
BP
W-H
WH
8/8/2019 Symmetrical based projects
18/105
p://www.maths.uq.edu.au/~gj
McLachlan and Peel (2000), Finite Mixture Models. Wiley .
8/8/2019 Symmetrical based projects
19/105
8/8/2019 Symmetrical based projects
20/105
Mixture Software: EMMIX
McLachlan, Peel, Adams, and Basfordhttp://www.maths.uq.edu.au/~gjm/emmix/emmix.html
EMMIX for UNIX
http://d/coursematerials/Powersystem%20analysis/www/emmix.bat8/8/2019 Symmetrical based projects
21/105
Basic Definition
We let Y 1 ,. Y n denote a random sample of size n where Y j is a p-dimensional randomvector with probability density function f (y j)
where the f i(y j) are densities and the i arenonnegative quantities that sum to one.
)()()( 1 j g g j1 j y f y f y f ++=
8/8/2019 Symmetrical based projects
22/105
To provide an appealing semiparametricframework in which to model unknowndistributional shapes, as an alternative to, say,the kernel density method.
To use the mixture model to provide a model- based clustering. (In both situations, there isthe question of how many components toinclude in the mixture.)
Mixture distributions are applied to data withtwo main purposes in mind:
8/8/2019 Symmetrical based projects
23/105
Shapes of Some Univariate
Normal MixturesConsider
where
denotes the univariate normal density with mean andvariance 2.
),;(),;()( 2222
11 j j j y y y f +=
})(exp{)2(),;( 222112 2
1
= j j y y
8/8/2019 Symmetrical based projects
24/105
Figure 1 : Plot of a mixture density of two univariate normal components inequal proportions with common variance
2=1
=1 =2
=3 =4
8/8/2019 Symmetrical based projects
25/105
Figure 2: Plot of a mixture density of twounivariate normal components in proportions
0.75 and 0.25 with common variance
=1 =2
=3 =4
8/8/2019 Symmetrical based projects
26/105
8/8/2019 Symmetrical based projects
27/105
8/8/2019 Symmetrical based projects
28/105
Computationally convenient for multivariate data
Provide an arbitrarily accurate estimate of theunderlying density with g sufficiently large
Provide a probabilistic clustering of the data into g clusters - outright clustering by assigning a data
point to the component to which it has the greatest
posterior probability of belonging
Normal Mixtures
8/8/2019 Symmetrical based projects
29/105
Synthetic Data Set 1
8/8/2019 Symmetrical based projects
30/105
Synthetic Data Set 2
8/8/2019 Symmetrical based projects
31/105
True Values Initial Values Estimates by EM
1 0.333 0.333 0.294
20.333 0.333 0.337
3 0.333 0.333 0.370
1 (0 2)T (-1 0) T (-0.154 1.961) T
2(0 0) T (0 0) T (0.360 0.115) T
3 (0 2)T (1 0) T (-0.004 2.027) T
1
1
1
2.00
02
2.00
02
2.00
02
10
01
10
01
10
01
218.0016.0
016.0961.1
218.0553.0
553.0346.2
206.0042.0
042.0339.2
8/8/2019 Symmetrical based projects
32/105
Figure 7
8/8/2019 Symmetrical based projects
33/105
8/8/2019 Symmetrical based projects
34/105
Figure 8
8/8/2019 Symmetrical based projects
35/105
8/8/2019 Symmetrical based projects
36/105
MIXTURE OF g NORMAL COMPONENTS
);();()( 1 g g g 11 f ,, yyy ++=
)()( yy T
EUCLIDEAN DISTANCE
+=
)()()(log2 ,;1
yyyT
where
constantconstant+=
)()()(log2 ,;1
yyyT
MAHALANOBIS DISTANCE
where
8/8/2019 Symmetrical based projects
37/105
SPHERICAL CLUSTERS
k-means
I 21 g ===
MIXTURE OF g NORMAL COMPONENTS
),;(),;()( 111 g g g f yyy ++=
I 21 g ===k-means
8/8/2019 Symmetrical based projects
38/105
Equal spherical covariance matrices
8/8/2019 Symmetrical based projects
39/105
With a mixture model-based approach toclustering, an observation is assignedoutright to the ith cluster if its density inthe ith component of the mixturedistribution (weighted by the prior
probability of that component) is greater than in the other (g-1) components.
),;(),;(),;()( 111
g g g
iii f
y
yyy
++++=
8/8/2019 Symmetrical based projects
40/105
Figure 7: Contours of the fitted componentdensities on the 2 nd & 3 rd variates for the blue crab
data set.
8/8/2019 Symmetrical based projects
41/105
Estimation of Mixture Distributions
It was the publication of the seminal paper of Dempster, Laird, and Rubin (1977) on theEM algorithm that greatly stimulated interest
in the use of finite mixture distributions tomodel heterogeneous data.
McLachlan and Krishnan (1997, Wiley)
8/8/2019 Symmetrical based projects
42/105
If need be, the normal mixture model can
be made less sensitive to outlyingobservations by using t component densities.
With this t mixture model-based approach,the normal distribution for each componentin the mixture is embedded in a wider classof elliptically symmetric distributions withan additional parameter called the degrees of freedom.
8/8/2019 Symmetrical based projects
43/105
The advantage of the t mixture model is that,although the number of outliers needed for
breakdown is almost the same as with thenormal mixture model, the outliers have to
be much larger.
8/8/2019 Symmetrical based projects
44/105
In exploring high-dimensional datasets for group structure, it is typical
to rely on principal componentanalysis.
T G i T Di i All l i f i ld
8/8/2019 Symmetrical based projects
45/105
Two Groups in Two Dimensions. All cluster information wouldbe lost by collapsing to the first principal component. The
principal ellipses of the two groups are shown as solid curves.
8/8/2019 Symmetrical based projects
46/105
Mixtures of Factor Analyzers
A normal mixture model without restrictionson the component-covariance matrices may
be viewed as too general for many situations
in practice, in particular, with highdimensional data.
One approach for reducing the number of parameters is to work in a lower dimensionalspace by using principal components;another is to use mixtures of factor anal zers
8/8/2019 Symmetrical based projects
47/105
Mixtures of Factor Analyzers
Principal components or asingle-factor analysis model
provides only a global linear model.
A global nonlinear approach by postulating a mixture of linear submodels
8/8/2019 Symmetrical based projects
48/105
),,...,1( where
,),;()(1
g i
f
iT iii
g
iii ji j
=+=
== D B B
y y
B i is a p x q matrix and D i is a
diagonal matrix.
8/8/2019 Symmetrical based projects
49/105
Single-Factor Analysis Model
loadings.factor of matrixx
aisandfactorscalledvariables
leunobservabor latentof vector
)(ldimensiona-aiswhere
,),...,1(
p p
B
pqqU
n je B U Y
i
j
j j j
g 0.
11 :H g g =versus00 : g g H =
8/8/2019 Symmetrical based projects
66/105
We let denote the MLE of calculatedunder H i, (i=0,1). Then the evidence againstH0 will be strong if is sufficiently small,or equivalently, if -2log is sufficientlylarge, where
i
)}(log)({log2log2 01 L L =
8/8/2019 Symmetrical based projects
67/105
Bootstrapping the LRTS
McLachlan (1987) proposed aresampling approach to the assessment of
the P -value of the LRTS in testing
for a specified value of g 0.
1100 :H v:H g g g g ==
8/8/2019 Symmetrical based projects
68/105
Bayesian Information Criterion
nd L log)
(log2 +
The Bayesian information criterion (BIC)of Schwarz (1978) is given by
as the penalized log likelihood to bemaximized in model selection, including
the present situation for the number of components g in a mixture model.
8/8/2019 Symmetrical based projects
69/105
Gap statistic (Tibshirani et al., 2001)
Clest (Dudoit and Fridlyand, 2002)
8/8/2019 Symmetrical based projects
70/105
PROVIDES A MODEL-BASED
APPROACH TO CLUSTERINGMcLachlan, Bean, and Peel, 2002 , A
Mixture Model-Based Approach to the Clustering
of Microarray Expression Data, Bioinformatics 18 , 413-422
http://www.bioinformatics.oupjournals.org/cgi/screen
pdf/18/3/413.pdf
8/8/2019 Symmetrical based projects
71/105
8/8/2019 Symmetrical based projects
72/105
Example: Microarray Data
Colon Data of Alon et al. (1999)M = 62 (40 tumours ; 22 normals )
tissue samples of N = 2,000 genes in a
2,000 62 matrix .
8/8/2019 Symmetrical based projects
73/105
8/8/2019 Symmetrical based projects
74/105
8/8/2019 Symmetrical based projects
75/105
Mixture of 2 normal components
8/8/2019 Symmetrical based projects
76/105
Mixture of 2 t components
The t distribution does not have substantially better breakdown
8/8/2019 Symmetrical based projects
77/105
behavior than the normal (Tyler, 1994).
The advantage of the t mixture model is that, although the number of
outliers needed for breakdown is almost the same as with the normal
mixture model, the outliers have to be much larger.
This point is made more precise in Hennig (2002) who has provided an
excellent account of breakdown points for ML estimation of location
-scale mixtures with a fixed number of components g.
Of course as explained in Hennig (2002), mixture models can be made
more robust by allowing the number of components g to grow with the
number of outliers.
8/8/2019 Symmetrical based projects
78/105
For Normal mixtures breakdown begins with an additional pointat about 15.2. For a mixture of t 3-distributions, the outlier must
lie at about 800, t 1-mixtures need the outlier at about ,and a Normal mixture with additional noise component breaksdown with an additional point at
7
108.3 .105.3 7
8/8/2019 Symmetrical based projects
79/105
8/8/2019 Symmetrical based projects
80/105
8/8/2019 Symmetrical based projects
81/105
Clustering of COLON Data
Genes using EMMIX-GENE
1 2 3 4 5
Grouping for Colon Data
8/8/2019 Symmetrical based projects
82/105
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
http://localhost:8080/review/001/alon_norm.dat.cut_group16.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group20.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group19.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group18.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group17.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group16.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group15.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group14.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group13.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group12.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group11.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group10.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group9.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group8.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group7.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group6.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group5.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group4.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group3.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group2.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group1.png8/8/2019 Symmetrical based projects
83/105
8/8/2019 Symmetrical based projects
84/105
8/8/2019 Symmetrical based projects
85/105
Clustering of COLON Data
Tissues using EMMIX-GENE
1 2 3 4 5
Grouping for Colon Data
http://d/coursematerials/Powersystem%20analysis/colon.htmhttp://d/coursematerials/Powersystem%20analysis/colon.htmhttp://d/coursematerials/Powersystem%20analysis/colontissopt.htmhttp://d/coursematerials/Powersystem%20analysis/colon.htmhttp://localhost:8080/review/001/alon_norm.dat.cut_group5.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group4.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group3.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group2.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group1.png8/8/2019 Symmetrical based projects
86/105
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
http://localhost:8080/review/001/alon_norm.dat.cut_group16.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group20.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group19.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group18.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group17.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group16.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group15.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group14.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group13.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group12.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group11.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group10.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group9.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group8.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group7.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group6.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group5.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group4.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group3.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group2.pnghttp://localhost:8080/review/001/alon_norm.dat.cut_group1.png8/8/2019 Symmetrical based projects
87/105
Heat Map Displaying the Reduced Set of 4,869 Genes
on the 98 Breast Cancer Tumours
8/8/2019 Symmetrical based projects
88/105
Insert heat map of 1867 genes
Heat Map of Top 1867 Genes
8/8/2019 Symmetrical based projects
89/105
8/8/2019 Symmetrical based projects
90/105
8/8/2019 Symmetrical based projects
91/105
8/8/2019 Symmetrical based projects
92/105
15141311 12
16 17 18 19 20
10986 7
5431 2
8/8/2019 Symmetrical based projects
93/105
35343331 32
36 37 38 39 40
30292826 27
25242321 22
8/8/2019 Symmetrical based projects
94/105
where i = group number mi = number in group iU i = -2 log i
1 146 112.98
2 93 74.953 61 46.08
4 55 35.20
5 43 30.40
6 92 29.297 71 28.77
8 20 28.76
9 23 28.44
10 23 27.73
21 44 13.77
22 30 13.2823 25 13.10
24 67 13.01
25 12 12.04
26 58 12.0327 27 11.74
28 64 11.61
29 38 11.38
30 21 10.72
11 66 25.72
12 38 25.4513 28 25.00
14 53 21.33
15 47 18.14
16 23 18.0017 27 17.62
18 45 17.51
19 80 17.28
20 55 13.79
31 53 9.84
32 36 8.9533 36 8.89
34 38 8.86
35 44 8.02
36 56 7.4337 46 7.21
38 19 6.14
39 29 4.64
40 35 2.44
i mi U i i m i U i i m i U i i m i U i
8/8/2019 Symmetrical based projects
95/105
Heat Map of Genes in Group G1
8/8/2019 Symmetrical based projects
96/105
Heat Map of Genes in Group G2
8/8/2019 Symmetrical based projects
97/105
Heat Map of Genes in Group G3
Clustering of gene expression profiles
8/8/2019 Symmetrical based projects
98/105
Longitudinal (with or without replication, for example time-course)
Cross-sectional data
g g p p
A Mixture Model with Random-Effects Components for Clustering Correlated Gene-Expression Profiles. S.K. Ng, G. J. McLachlan, K. Wang, L. Ben-Tovim Jones, S-W. Ng.
EMMIX-WIRE
EM-based MIXture analysis With Random Effects
Clustering of Correlated Gene Profiles
8/8/2019 Symmetrical based projects
99/105
Clustering of Correlated Gene Profiles
hjhhjh j VcUb X y +++=
8/8/2019 Symmetrical based projects
100/105
Longitudinal (with or without replication,for example time course)
Cross-section data
Clustering of gene expression profiles
8/8/2019 Symmetrical based projects
101/105
},|1{);,( c y Z pr c y jhj j ==
= ==
= g i iiij ji
hhhj jh
c z y f
c z y f
1);,1|(
);,1|(
N( h, h), with hhh Vc X += T
bhhh UU A B +=
Yeast Cell Cycle
8/8/2019 Symmetrical based projects
102/105
))7(2((cos +l
X is an 18 x 2 matrix with the ( l +1)th row ( l = 0,,17)
)))7(2(sin +l
Yeast data is from Spellman (1998); 18 rows represent the18 -factor (pheromone) synchronization wherethe yeast cells were sampled at 7 minute intervals for 119minutes. is the period of the cell cycle and is the
phase offset, estimated using least squares to be =53and =0.
Clustering Results for Spellman Yeast Cell Cycle Data
8/8/2019 Symmetrical based projects
103/105
8/8/2019 Symmetrical based projects
104/105
Plots of First versus Second Principal Components
(a) Our clustering (b) Muro clustering
A Mixture Model with Random Effects Components for
8/8/2019 Symmetrical based projects
105/105
A Mixture Model with Random-Effects Components for Clustering Correlated Gene-Expression Profiles .
S.K. Ng, G. J. McLachlan, K. Wang, L. Ben-Tovim Jones,S-W. Ng.
Top Related