T. M. Murali Department of Computer Science Virginia Tech Slides prepared by Arjun Krishnan
description
Transcript of T. M. Murali Department of Computer Science Virginia Tech Slides prepared by Arjun Krishnan
DNA mRNA Protein
Small molecul
es
Environment
RegulatoryRNA
How a cell is wiredHow a cell is wired
The dynamics of such interactions emerge as cellular processes and functions
How do the genes and their products interact to collectively perform a
function?
A
BGene G
35
RPM
Inhibitor
U2AF
Gene G
Molecular interaction networksMolecular interaction networks
Molecular interaction networksMolecular interaction networks
A network containing genes connected to each other whenever they physically or functionally interact
Proteins that interact/co-complex (ribosomal, polymerase, etc.)
Transcription factors and their target
Enzymes catalyzing different steps in the same metabolic pathway
Genes with correlation in expression
Genes with similar phylogenetic profiles
Functional
^
Arabidopsis is the primary Arabidopsis is the primary model organism for plantsmodel organism for plants
Complex organization from molecular to whole organism level.
A key challenge …
Understanding the cellular machinery that sustains this complexity.
In the current post-genomic times, a main aspect of this challenge is ‘gene function prediction’:
Identification of functions of all the (~30, 000) genes in the genome.
Total of ~30,000 genes in the genome
Extent of gene annotations in Extent of gene annotations in ArabidopsisArabidopsis
~15% with some
experimental annotation
~8% with ‘expert’
annotation
~13% with annotations
based on manually curated
computational analysis
~14% with electronic
annotations
Leaving ~50% of the genome
without any annotation
Ashburner et al, (2000) Nat. Gen.Swarbreck et al (2008) Nuc. Acids. Res.
Exploit high-throughput dataExploit high-throughput data
Integrating functional genomic data could lead to
Network models of gene interactions that resemble the underlying cellular map.
Typically these networks contain gene functional interactions
Connecting pairs of genes that participate in the same biological processes.
In such a network, the very place of a gene establishes the functional context that gene.
‘Guilt-by-association’ – genes of unknown functions can also be imputed with the function of their annotated neighbors.
Functional interaction networksFunctional interaction networks Functional interaction network models have been
developed for Arabidopsis.
Lee et al. (2010) Rational association of genes with traits using a genome-scale gene network for Arabidopsis thaliana.
Very comprehensive in terms of using and integrating datasets in other organisms for application in plants.
Integrated 24 datasets: 5 datasets from Arabidopsis and the rest from other models.
AraNet: 19,647 genes, 1,062,222 interactions.
Goal of this study …Goal of this study …
We examine the state of network-based gene function prediction in Arabidopsis.
Evaluate the performance of multiple prediction algorithms on AraNet.
Assesses the influence of the number of genes annotated to a function and the source of annotation evidence.
Compute the correlation of prediction performance with network properties.
Evaluate prediction performance for plant-specific functions.
Network-based gene function Network-based gene function prediction algorithmsprediction algorithms
Propagation of functional annotations
across the network Guilt-by-association
using direct interactions
Use positive
and negative examplesUse only positive
examples
SinkSourceHopfield
FunctionalFlow – multiple phases
Local
FunctionalFlow – 1 phaseLocal+
Each gene in the network
Network-based gene function Network-based gene function predictionprediction
Function A Function B
Network-based gene function Network-based gene function predictionprediction
Sink Source
In this study …In this study …
Recall: fraction of known examples predicted correctly
TP(TP + FN)
Precision: fraction of predictions that are correct
TP(TP + FP)
Performance of different Performance of different algorithmsalgorithms
Computational gene function prediction precedes and guides experimental validation
What we get is a ranked list of novel predictions
An experimenter would choose a manageable number of top-scoring predictions to pursue
Precision at the top of the prediction list
We choose precision at 20% recall (P20R) as the measure of performance
Performance of different Performance of different algorithmsalgorithms
SS seems to be better than the other algorithms
What about the influence of the number of genes in a function?
3rd quartile
1st quartile
Median
Using only annotations based
on experimental/expert
evidence
Performance of different Performance of different algorithmsalgorithms
Third group
First group
Second group
Number of genes annotated with a function
Nu
mb
er
of
fun
ctio
ns
Each group containing ~125
functions
Performance of different Performance of different algorithmsalgorithms
For ‘small’ functions, the algorithm does not
matter!And, using just
experimental annotations is better when you know little about a function.
For ‘medium’ functions, SS is a little better and
use of ‘electronic’ evidences is mixed.
For ‘large’ functions-SS is clearly the best
- Using all annotation is better
Performance of different Performance of different algorithmsalgorithms
All ECs Sans IEA/ISS
Wilcoxon test: SS vs. other algorithms
Overall, SinkSource appears to be best algorithm.
Correlation of performance with Correlation of performance with network properties network properties
Performance on a particular function might depend on how its genes are organized / connected among themselves in the network.
Number of nodes
Number of components
Fraction of nodes in the largest connected component
Total edge weight
Weighted density
Average weighted degree
Average segregation
Correlation of performance with Correlation of performance with network properties network properties
Correlation of performance with Correlation of performance with network properties network properties
Correlation of performance with Correlation of performance with network properties network properties
Number of nodes = 9
Number of components = 3
Fraction of nodes in the largest connected component = 4/9
Total edge weight = 8
Weighted density = 8/36
Average weighted degree = 16/9
Correlation of performance with Correlation of performance with network properties network properties
Functional modularity:
Average Segregation
Correlation of performance with Correlation of performance with network properties network properties
Avg. seg = 8/22 Avg. seg = 12/15
Functional modularity:
Average Segregation
We have …
Vector of SS P20R values for each function
Vector of values of a particular topological property for each function
Spearman rank correlation
Correlation of performance with Correlation of performance with network properties network properties
Weighted density
Correlation of performance with Correlation of performance with network properties network properties
Spearman rank
correlation
Performance on plant-specific Performance on plant-specific functionsfunctions
For ‘conserved’ functions-Performance is better than
that for all functions-Using all annotations is
better
For ‘plant-specific’ functions-Performance is much worse
compared to ‘conserved’ functions
-Using only experimental annotations is better
The underlying network is built based on data from multiple non-plant species
3rd quartile
1st quartile
Median
Using only annotations based
on experimental/expert
evidence
Most predictable ‘conserved’ Most predictable ‘conserved’ functionsfunctions
protein folding
nucleotide transport
innate immunity
cytoskeleton organization, and
cell cycle
Least predictable ‘conserved’ Least predictable ‘conserved’ functionsfunctions
regulation of …
Specialized functions
Most predictable ‘plant-Most predictable ‘plant-specific’ functionsspecific’ functions
cell wall modification
auxin/cytokinin signaling, and
photosynthesis
Contribution from Arabidopsis datasets
Least predictable ‘plant-Least predictable ‘plant-specific’ functionsspecific’ functions
development, morphogenesis
pattern formation
phase transitions of various tissues, organs / growth stages
ConclusionsConclusions Evaluated the performance of various prediction
algorithms on AraNet.
SinkSource is the overall best prediction algorithm.
Measured the influence of the number of genes annotated to a function and the source of annotation evidence.
All algorithms perform poorly when only a small number of genes are ‘known’ or when annotating very specific functions.
When only a small number of genes are ‘known’, use only experimentally verified annotations to make new predictions.
When a considerable number of genes are ‘known’, use all annotations to make new predictions.
ConclusionsConclusions Measured the correlation of performance
with network properties
Several topological properties correlate well with performance.
‘Average segregation’ has the strongest correlation.
ConclusionsConclusions Assessed performance on
conserved/plant-specific functions
Performance on basic ‘conserved’ functions is better than that for all the functions.
Specialized ‘conserved’ functions are hard to predict.
Performance on ‘plant-specific’ functions is very poor.
Also a consequence of the fact that ‘plant-specific’ functions generally have small number of annotations.
ConclusionsConclusions
Avenues for improvement in functional interaction networks
Build functional interaction networks that are based on a larger collection of plant datasets.
If possible, rely as little as possible on data from other species.
Avenues for future experimental work
‘Plant-specific’ functions and
Specialized ‘conserved’ functions.
AcknowledgementsAcknowledgements Arjun Krishnan
Brett Tyler
Andy Pereira