MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION...

21
MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DATA Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman, and Roland Somogyi Information Processing in Cells and Tissues, pp. 203-212, 1998 Presented by Bin He

Transcript of MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION...

Page 1: MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DATA Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman,

MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DATA

Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman, and Roland Somogyi

Information Processing in Cells and Tissues, pp. 203-212, 1998

Presented by Bin He

Page 2: MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DATA Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman,

Motivations it is necessary to determine large-

scale temporal gene expression patterns

to decipher the logic of gene regulation, we should aim to be able to monitor the expression level of all genes simultaneously

Page 3: MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DATA Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman,

Gene time series assay the expression levels of

large numbers of genes in a tissue at different time points

Gene time seriesthe relative amounts of mRNA produced at these time points provide a gene expression time series for each gene

Page 4: MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DATA Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman,

Gene Expression Matrix Wen, X., Fuhrman, S., Michaels, G.S., Carr, D.B., Smith, S., Barker, J.L.,

and Somogyi, R., 1997, Large-scale temporal gene expression mapping of CNS development, Proc. Natl. Acad. Sci., in press

Page 5: MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DATA Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman,

Previous Approach Euclidean distance and information

theoretic measures to cluster the genes into related expression time series

A significant problem with this approach is the variety of measures that can be used

Each measure produces a unique clustering of gene expression patterns

Page 6: MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DATA Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman,

Contributions determining significant

relationships between individual genes, based on: linear correlation rank correlation information theory

Page 7: MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DATA Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman,

Linear correlation ------positive correlation positive linear correlation

Page 8: MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DATA Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman,

Linear correlation ------negative correlation negative linear correlation

Page 9: MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DATA Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman,

Linear correlation ------restriction for 112 different genes, 112x111/2

= 6216 pairs of expression time series need to be examined

to restrict the number of relationships, we might want to test which correlations are significantly larger than a certain value

Page 10: MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DATA Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman,

Linear correlation ------restriction For instance, to find those

relationships in which at least 50% of the variance is explained by the correlation, i.e. rho2>0.5, we need |r|>0.96 to reject at the 1% significance level the null hypothesis that |rho|<0.7071

Page 11: MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DATA Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman,

Linear correlation ------visualization residual variance based distance

measurment d=1-r2

d=0 if perfectly correlated, d=1 if uncorrelated

multidimensional scaling map time series into a two-

dimensional plane

Page 12: MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DATA Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman,

Linear correlation ------visualization Multidimensional scaling of 34 time

series with high correlation

Page 13: MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DATA Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman,

Nonlinear correlation ------Model Spearman rank correlation, rs

measurement for monotonic relationships can be used for non-Gaussian distributions

491 pairs of expression time series, involving 98 genes, which have a significant rs, ranging from -0.979 to 0.996

Page 14: MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DATA Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman,

Nonlinear correlation------Example

High rank correlation but low linear correlation between mGluR1 and GRa2

Page 15: MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DATA Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman,

Information Theory ------mutual information if H(A) and H(B) are the entropies

of sources A and B respectively, and H(A,B) the joint entropy of the sources, then M(A,B) = H(A) + H(B) - H(A,B)

discrete form is much easier to use We need discretize the time series

by partitioning the expression levels into bins

Page 16: MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DATA Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman,

Information Theory ------Bin size The fewer bins we use to discretize

the data, the more information about the original time series we ignore.

On the other hand, too fine a binning will leave us with too few points per bin to get a reasonable estimate of the frequency of each bin

Page 17: MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DATA Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman,

Information Theory ------Mapping Some time series map to the same

discretized series In total, from 112 unique

continuous-valued time series we get 91 discretized time series

Page 18: MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DATA Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman,

Information Theory ------Mapping

E11

E13

E15

E18

E21

P0 P7 P14

A genes

0 0 2 2 2 2 2 2 2 MAP2, pre-GAD67, GAT1

0 0 0 0 0 0 0 1 2 NFM, mGluR1, NMDA2A

0 0 0 1 1 1 1 2 2 S100 beta, GRg1

0 0 0 2 2 2 2 1 1 GAD67, mGluR5, NMDA1

Page 19: MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DATA Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman,

Information Theory ------Mapping eliminate one-to-one mapping by

permuting the bin numbers H(A)=H(B)=M(A,B) row 3 and row 4

replace such time series by one single series, leaving us with a set of 77 unique, non-equivalent time series.

Page 20: MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DATA Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman,

Information Theory ------Measurement symmetric measures

M(A,B)/max(H(A),H(B)) M(A,B)/H(A,B)

asymmetric measures Relative mutual information

R(A,B) = M(A,B)/H(B) R(A,B) = 1.0, means that all the information

about time series B is contained in time series A

Page 21: MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM LARGE SCALE GENE EXPRESSION DATA Patrik D'haeseleer, Xiling Wen, Stefanie Fuhrman,

Conclusion Linear correlation can be used very effectively

to detect linear relationships detect relationships not captured by Euclidean

distance, such as high negative correlations Rank correlation can be used to detect non-

linear relationships much more robust with respect to the distribution of

expression levels Information theory can be used to detect

genes whose (binned) expression patterns share information It will detect any mapping from time series A to B