[email protected] Anis Karimpour-Fard ‡, Ryan T. Gill †,

8
[email protected] http://www.colorado.edu/che/research/faculty http://compbio.uchsc.edu/Hunter Anis Karimpour-Fard , Ryan T. Gill , and Lawrence Hunter University of Colorado School of Medicine Department of Chemical and Biological Engineering, University of Colorado, Boulder Investigation of factors affecting prediction of protein-protein interaction networks by phylogenetic profiling Dec 1, 2007

Transcript of [email protected] Anis Karimpour-Fard ‡, Ryan T. Gill †,

Page 1: Anis.karimpour-fard@uchsc.edu   Anis Karimpour-Fard ‡, Ryan T. Gill †,

[email protected]://www.colorado.edu/che/research/faculty/gill/http://compbio.uchsc.edu/Hunter

Anis Karimpour-Fard‡ , Ryan T. Gill†

, and Lawrence Hunter‡

‡ University of Colorado School of Medicine

† Department of Chemical and Biological Engineering, University of Colorado, Boulder

Investigation of factors affecting prediction of protein-protein interaction networks by

phylogenetic profiling

Dec 1, 2007

Page 2: Anis.karimpour-fard@uchsc.edu   Anis Karimpour-Fard ‡, Ryan T. Gill †,

The meaning of protein function

Eisenberg, D. et. al. Nature 2000

S PA

Biochemical view

The function of protein A is its action on Substrate to form a Product

The function of A is the context of its interactions with other proteins in the cell

Post genomic view

A

B

YZ

MDN

X C

The problem ……

More than 500 Microbial genomes are fully sequence and there is high percent of genes with unknown function.

For example: E. coli K12 15%

P. aeruginosa 45%http://www.genomesonline.org/

Page 3: Anis.karimpour-fard@uchsc.edu   Anis Karimpour-Fard ‡, Ryan T. Gill †,

• Homology based methods (gives partial understanding about protein role)– Simple sequence similarity searches (BLAST)– Profile searches (PSI-BLAST)– Databases of conserved domains (Pfam, SMART)

• Prediction from genomic context• Phylogenetic profile• Gene cluster• Gene neighbor• Rosetta Stone

• Prediction from high-throughput experimental data– Microarray gene expression data– Protein-protein interaction screens– ...

Prediction protein function

Page 4: Anis.karimpour-fard@uchsc.edu   Anis Karimpour-Fard ‡, Ryan T. Gill †,

Phylogenetic Profile

Pellegrini et al. PNAS 96, 4285 (1999)Marcotte et al. PNAS 97, 12115 (2000)

1- Select sets of genomes as a reference set

2- Create phylogenetic profile matrix for target organism:

•Do one-against-all BLAST search to identify all homologous target genes in diverse reference organisms.

Does the selection of the reference genomes influence the prediction?

if so? How?

How E-value threshold effects the protein-protein interactions prediction?

Reference selection?

Blast E-value threshold (present or absent)

Measure profile similarities

Reference selection

Page 5: Anis.karimpour-fard@uchsc.edu   Anis Karimpour-Fard ‡, Ryan T. Gill †,

Protein X: 110001111001001110001111Protein Y: 11100011110000011000111119 matching bits out of 24

3- Measure profile similarities

4- Generate protein-protein interactions

Generate Protein-protein interactions network

5- Create clusters from set of protein-protein interactions

Protein X Protein Y

2 nodes are connected if the 2 proteins have similar profile)

6- Visualize network

Page 6: Anis.karimpour-fard@uchsc.edu   Anis Karimpour-Fard ‡, Ryan T. Gill †,

Protein X Protein Y

Measure profile similarities

Protein X: 110001111001001110001111Protein Y: 111000111100000110001111

•Mutual information

MI(X, Y) = H(X) + H(Y) - H(X, Y)

H(Y) = -∑p(i) ln p(i)

p(i), (i= 0, 1) as the fraction of genomes in which protein Y in the state i

2 nodes are connected if the 2 proteins have similar profile)

•Pearson correlation coefficient

1

0i

1

0j),(ln),(Y)H(X, jipjip

•Inverse homology

•Calculate the homology between two genomes:

• The ratio of number of homologs of each reference organism j to the number of proteins in the target genome i ( Hi,j) .

•Pij =1/( Hi,j) otherwise Pij =0.Karimpour-Fard et al. BMC Genomics.

2007;8(1):393

Page 7: Anis.karimpour-fard@uchsc.edu   Anis Karimpour-Fard ‡, Ryan T. Gill †,

c)

Comparison of different combinations of reference genomes and E-value thresholds using COG

• PPV =TP/(TP+FP)

– TP = # predicted pair in the same functional category

– FP= # predicted pair that were classified but were not same functional category

Random sets

AllLow GC

Aerobic

Karimpour-Fard et al. BMC Genomics.

2007;8(1):393

Page 8: Anis.karimpour-fard@uchsc.edu   Anis Karimpour-Fard ‡, Ryan T. Gill †,

Co-evolution can be used to assign function to unstudied genes

Hypothetical proteins YcgB, YeaH, YeaG are co-conserved across different species. Comparison of sub-graphs across species (CS-CCC) suggested that a previously unstudied S. typhimurium gene, ycgB, is functionally related to yeaH. Experimental data support the hypothesis that both genes are important for antimicrobial peptide resistance.

Edge color code:

• E. coli K12 (green)

•E. coli O157 (blue)

•Shigella flexneri (black)

•S. typhimurium LT2 (purple)

•P. aeruginosa (mustard)

Karimpour-Fard et al. Genome Biology 2007 8:R185