Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de...

47
Metabolic Network Inference Metabolic Network Inference from Multiple Types of from Multiple Types of Genomic Data Genomic Data Yoshihiro Yamanishi Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris

Transcript of Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de...

Page 1: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Metabolic Network Inference Metabolic Network Inference from Multiple Types of from Multiple Types of

Genomic DataGenomic Data

Yoshihiro YamanishiYoshihiro Yamanishi

Centre de Bio-informatique,

Ecole des Mines de Paris

Page 2: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Outline Outline

Motivation: metabolic networkMotivation: metabolic network Method: network inferenceMethod: network inference

- Supervised network inference- Supervised network inference

- Multiple data integration- Multiple data integration ApplicationApplication

- Global network prediction- Global network prediction Concluding remarksConcluding remarks

Page 3: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Metabolic networkMetabolic network

The metabolic network consists of The metabolic network consists of enzyme proteins and chemical enzyme proteins and chemical compoundscompounds

6018 genes in yeast genome6018 genes in yeast genome 1120 genes with EC numbers1120 genes with EC numbers 668 genes with pathway information668 genes with pathway information

(in the KEGG as of Sep. 2004)(in the KEGG as of Sep. 2004)

Problem: unknown part of pathways Problem: unknown part of pathways and many missing enzyme genesand many missing enzyme genes

Page 4: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Network inference Network inference methodsmethods

For gene regulatory networkFor gene regulatory network Bayesian network (Friedman et al., 2000, Bayesian network (Friedman et al., 2000,

Imoto et al, 2002)Imoto et al, 2002) Boolean network (Akutsu et al., 2000)Boolean network (Akutsu et al., 2000) Graphical modeling (Toh et al., 2001)Graphical modeling (Toh et al., 2001)

For protein interaction network For protein interaction network Joint graph method (Marcotte et al., 1999)Joint graph method (Marcotte et al., 1999) Mirror tree method (Pazos et al., 2001)Mirror tree method (Pazos et al., 2001)

Page 5: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Objectives Objectives

Develop a method to infer metabolic Develop a method to infer metabolic gene networks in a supervised gene networks in a supervised contextcontext

Integrate heterogeneous genomic Integrate heterogeneous genomic data in the framework of network data in the framework of network inferenceinference

Reconstruct unknown pathways and Reconstruct unknown pathways and identify genes for missing enzymesidentify genes for missing enzymes

Page 6: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Kernel in this studyKernel in this study

k x , x 'Kernel : representation of the similarity between two genes and (e.g., correlation coefficient)

Kernel matrix: similarity matrix of a set of genes

K ij : k x i , x j i , j 1,2 , .. . ,N

x x '

N genes x1 , x2 ,. .. , xN

Page 7: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

An example of the kernelAn example of the kernel

0 . 1 0 . 4 0 . 2 0 . 3

righ

0 . 2 0 . 3 0 . 3 0 . 2

righ

x1

K x1 , x2 =< x1 , x 2

0 .1 0 . 2 0 . 4 0 . 3 0 . 2 0 . 3 0 . 3 0 . 2 0 . 26

Suppose we have a set of genes x1, x2,…, xN

and represent them by gene expression profiles

Page 8: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

An example of kernel An example of kernel matrixmatrix

kernel matrix:K x1 , x1 K x1 , x2

K x 2 , x1 K x 2 , x 2

righ

0 . 3 0 . 260 . 26 0 . 26

righ

K

This can be regarded as a kind of similarity matrix

Page 9: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Direct network inferenceDirect network inferenceAssumption: connected proteins in the network share high similarity in the data

Similarity matrix based on a genomic dataset

1 2 3 4 5 6 7 8 9123456789

Configuration of genes

12

3

5

4

7

68

9

Page 10: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Direct network inferenceDirect network inferenceAssumption: connected proteins in the network share high similarity in the data

1 2 3 4 5 6 7 8 9123456789

12

3

5

4

7

68

9

Similarity matrixPredicted network

Page 11: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Direct network inferenceDirect network inferenceAssumption: connected proteins in the network share high similarity in the data

1 2 3 4 5 6 7 8 9123456789

12

3

5

4

7

68

9

Similarity matrixPredicted network

Page 12: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Direct network inferenceDirect network inferenceAssumption: connected proteins in the network share high similarity in the data

1 2 3 4 5 6 7 8 9123456789

12

3

5

4

7

68

9

Similarity matrixPredicted network

Page 13: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Direct network inferenceDirect network inferenceAssumption: connected proteins in the network share high similarity in the data

1 2 3 4 5 6 7 8 9123456789

12

3

5

4

7

68

9

Similarity matrixPredicted network

Page 14: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Direct network inferenceDirect network inferenceAssumption: connected proteins in the network share high similarity in the data

1 2 3 4 5 6 7 8 9123456789

12

3

5

4

7

68

9

Similarity matrixPredicted network

Page 15: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Direct network inferenceDirect network inferenceAssumption: connected proteins in the network share high similarity in the data

1 2 3 4 5 6 7 8 9123456789

12

3

5

4

7

68

9

Similarity matrixPredicted network

Page 16: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Direct network inferenceDirect network inferenceAssumption: connected proteins in the network share high similarity in the data

1 2 3 4 5 6 7 8 9123456789

12

3

5

4

7

68

9

Similarity matrixPredicted network

Page 17: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Direct network inferenceDirect network inferenceAssumption: connected proteins in the network share high similarity in the data

1 2 3 4 5 6 7 8 9123456789

12

3

5

4

7

68

9

Similarity matrixPredicted network

Page 18: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Direct network inferenceDirect network inferenceAssumption: connected proteins in the network share high similarity in the data

1 2 3 4 5 6 7 8 9123456789

12

3

5

4

7

68

9

Similarity matrixPredicted network

Page 19: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Evaluation of the direct Evaluation of the direct approach:approach:

using gene expression using gene expression datadata

Gold standard data: metabolic network of 668 genes of the yeast in the

KEGG/PathwayROC

curve

False positives

True positives

x1x 2x3

157 expriments (SMD)

Page 20: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Outline Outline

Motivation: metabolic networkMotivation: metabolic network Method: network inferenceMethod: network inference

- - Supervised network inferenceSupervised network inference

- Multiple data integration- Multiple data integration ApplicationApplication

- Global network prediction- Global network prediction

- Missing enzyme gene estimation- Missing enzyme gene estimation Concluding remarksConcluding remarks

Page 21: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

An illustration of An illustration of formalismformalism

Unknown pathway

Protein networkSimilarity matrix in expression

Page 22: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

An illustration of An illustration of formalismformalism

Unknown pathway

Protein networkSimilarity matrix in expression

training training

Page 23: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Supervised network Supervised network inferenceinference

:training set

Original space

x1

x 2

x3

Key idea: use of partially known network information

Page 24: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Supervised network Supervised network inferenceinference

:training set

Original space

: edge predicted by direct approach

x1

x 2

x3

Page 25: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Supervised network Supervised network inferenceinference

:training set

Original space

:true edge

x1

x 2

x3

Page 26: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Supervised network Supervised network inference 1/2inference 1/2

Step 1: map proteins to a space, where interacting proteins are close to each other

Feature space

f x1

f x2

f x3

f

:training set

Original space

:true edge

x1

x 2

x3

Page 27: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Supervised network Supervised network inference 2/2inference 2/2

Feature space

f x1

f x2

f x3

f

:training set

:test set

Original space

:true edge

x1

x 2

x3

Page 28: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Supervised network Supervised network inference 2/2inference 2/2

Feature space

f x1

f x2

Step 2: predict interacting protein pairs involving the test set

f x3

f

:training set

:test set

Original space

x1

x 2

x3:true edge

Page 29: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

AlgorithmAlgorithm

Suppose we have a partially known graphG V , E with V x 1 , x2 , , x n

f argminxi , x j E

f x i f x j2

Kernel CCA (Yamanishi et al., 2004)Distance metric learning (Vert et al., 2004)

Page 30: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Result of the supervised Result of the supervised learning:learning:

ROC curve by cross-ROC curve by cross-validationvalidation

Direct approach Supervised approach

Page 31: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Outline Outline

Motivation: metabolic networkMotivation: metabolic network Method: network inferenceMethod: network inference

- Supervised network inference- Supervised network inference

- - Multiple data integrationMultiple data integration ApplicationApplication

- Global network prediction- Global network prediction

- Missing enzyme gene estimation- Missing enzyme gene estimation Concluding remarksConcluding remarks

Page 32: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Various genomic dataVarious genomic data

Bit Bit stringsstrings

Bit Bit stringsstrings

NumericNumericalal

vectorsvectors

StructureStructure

EvolutionarEvolutionary similarityy similarity

Co-Co-localization localization similaritysimilarity

Co-Co-expresion expresion similaritysimilarity

Gene-gene Gene-gene relationshirelationshi

pp

DataData

PhylogenPhylogenetic etic profileprofile

LocalizatiLocalization dataon data

GeneGene

expressioexpressionn

Page 33: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Data Data of the yeast of the yeast S. cerevisiaeS. cerevisiae

Expression: 6059 genes with 157 Expression: 6059 genes with 157 experiments (SMD database)experiments (SMD database)

Localization: 6059 proteins with 23 Localization: 6059 proteins with 23 intracellular locations (Huh et al, intracellular locations (Huh et al, 2003)2003)  

Phylogenetic profile: 6059 proteins Phylogenetic profile: 6059 proteins with 145 organisms with 145 organisms (KEGG/Ortholog Cluster) (KEGG/Ortholog Cluster)

Page 34: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Gene expression profilesGene expression profiles

exp1 exp2 exp3 exp4 exp5 … exp Pexp1 exp2 exp3 exp4 exp5 … exp P

gene 1 (0.1, 0.4, 0.6, 0.2 , -0.3, … , 1.5)gene 1 (0.1, 0.4, 0.6, 0.2 , -0.3, … , 1.5)gene 2 (0.2, 0.9, 1.8, 0.7 , -0.3, … , 0.4)gene 2 (0.2, 0.9, 1.8, 0.7 , -0.3, … , 0.4)gene 3 (0.6, 0.7, -1.0, 0.8 , 1.2, … , 0.6)gene 3 (0.6, 0.7, -1.0, 0.8 , 1.2, … , 0.6) … …gene N (1.2, 0.3, 1.9, -0.1 , -0.7, … , 0.1)gene N (1.2, 0.3, 1.9, -0.1 , -0.7, … , 0.1)

Numerical vectors of the gene expression ratio

gene

Experiments (or time series)

gene gene similariy : Kexp x , x' x x '

where x :vector of gene expression profile

Page 35: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Phylogenetic profilesPhylogenetic profiles

org1 org2 org3 org4 org5 … org Porg1 org2 org3 org4 org5 … org P

gene 1 (1, 1, 0, 0 , 0, … , 1)gene 1 (1, 1, 0, 0 , 0, … , 1)gene 2 (1, 0, 1, 0 , 1, … , 0)gene 2 (1, 0, 1, 0 , 1, … , 0)gene 3 (0, 1, 0, 0 , 1, … , 0)gene 3 (0, 1, 0, 0 , 1, … , 0) … …gene N (1, 0, 1, 0 , 0, … , 1)gene N (1, 0, 1, 0 , 0, … , 1)

Bit strings in which the presence and absence of the genes are corded as 1 or 0 across organisms

gene

organism

gene gene similarity: K phy x , x' x x '

where x :bit string of phylogenetic profile

Page 36: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

An illustration of our An illustration of our network inference network inference

procedureprocedure

Geneexpression

Proteinlocalization

Phylogeneticprofile

Gene networksimilarity matrix of genes

INPUT OUTPUT

infer

Page 37: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Data representation and Data representation and integrationintegration

expression datalocalization dataphylogenetic profileintegrationweighted integration

Genomic data

K exp

K loc

K phy

K int Kexp K loc K phy

Kwint w1K exp w2 K loc w3 K phy

Similarity matrix

Page 38: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Evaluating the weight for Evaluating the weight for each data sourceeach data source

1.Individual application to each data

2.Evaluation of its biological relevance by the ROC score

ROC curve

ROC score: area under the ROC curve

Page 39: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Evaluating the weight by Evaluating the weight by the ROC scoresthe ROC scores

For each data, compute the ROC score - 0.5, which are used as the weightExpression Localization Phylogenetic profile

Evolutionary information seems to be useful

Page 40: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

w1 0 . 31 expressionw2 0 . 16 localizationw3 0 . 53 phylogenetic profile

The resulting normalized weights:

Kwint w1K exp w2 K loc w3 K phy

The effect of data The effect of data integrationintegration

ROC curve

Page 41: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Outline Outline

Motivation: metabolic networkMotivation: metabolic network Method: network inferenceMethod: network inference

- Supervised network inference- Supervised network inference

- Multiple data integration- Multiple data integration ApplicationApplication

- - Global network predictionGlobal network prediction

- Missing enzyme gene estimation- Missing enzyme gene estimation Concluding remarksConcluding remarks

Page 42: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Comprehensive Comprehensive prediction ofprediction of

a global gene network a global gene network

We predicted a network of 6059 genesWe predicted a network of 6059 genes

Possible biological applicationsPossible biological applications

1.1. Estimate unknown pathwaysEstimate unknown pathways

2.2. Predict biochemical function for Predict biochemical function for hypothetical proteinshypothetical proteins

3.3. Identify missing enzyme genesIdentify missing enzyme genes

Page 43: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Prediction for a role in Prediction for a role in pathwayspathways

YJR137C (the detail function was YJR137C (the detail function was unknown as of Sep. 2003) is connected unknown as of Sep. 2003) is connected with with EC:1.8.4.8EC:1.8.4.8 and and EC:2.5.1.47EC:2.5.1.47 in the in the predicted networkpredicted network

Page 44: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Recently, there has been a report that Recently, there has been a report that YJR137C is annotated as YJR137C is annotated as EC:1.8.1.2EC:1.8.1.2

Prediction for a role in Prediction for a role in pathwayspathways

Page 45: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Outline Outline

Motivation: metabolic networkMotivation: metabolic network Method: network inferenceMethod: network inference

- Supervised network inference- Supervised network inference

- Multiple data integration- Multiple data integration ApplicationApplication

- Global network prediction- Global network prediction Concluding remarksConcluding remarks

Page 46: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

SummarySummary

We developed supervised approaches to We developed supervised approaches to infer the metabolic network from multiple infer the metabolic network from multiple genomic datagenomic data

The accuracy improved from the The accuracy improved from the supervised learning and the weighted data supervised learning and the weighted data integrationintegration

We showed some possibilities to obtain We showed some possibilities to obtain new biological findingsnew biological findings

Page 47: Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

CollaboratorCollaborator

For the methodsFor the methods

Jean-Philippe Vert (Ecole des Mines)Jean-Philippe Vert (Ecole des Mines)

Minoru Kanehisa (Kyoto University)Minoru Kanehisa (Kyoto University)

For the biochemical experimentsFor the biochemical experiments

Hisaaki Mihara, Motoharu Ohsaki, Hisaaki Mihara, Motoharu Ohsaki, Hisashi Muramatsu, Nobuyoshi Esaki Hisashi Muramatsu, Nobuyoshi Esaki (Kyoto University)(Kyoto University)