(1) survey 2011
-
Upload
azizyousef -
Category
Documents
-
view
213 -
download
0
Transcript of (1) survey 2011
-
8/2/2019 (1) survey 2011
1/14
Systematic computational prediction of protein interaction networks
This article has been downloaded from IOPscience. Please scroll down to see the full text article.
2011 Phys. Biol. 8 035008
(http://iopscience.iop.org/1478-3975/8/3/035008)
Download details:
IP Address: 194.225.166.111
The article was downloaded on 26/12/2011 at 11:52
Please note that terms and conditions apply.
View the table of contents for this issue, or go to thejournal homepage for more
ome Search Collections Journals About Contact us My IOPscience
http://iopscience.iop.org/page/termshttp://iopscience.iop.org/1478-3975/8/3http://iopscience.iop.org/1478-3975http://iopscience.iop.org/http://iopscience.iop.org/searchhttp://iopscience.iop.org/collectionshttp://iopscience.iop.org/journalshttp://iopscience.iop.org/page/aboutioppublishinghttp://iopscience.iop.org/contacthttp://iopscience.iop.org/myiopsciencehttp://iopscience.iop.org/myiopsciencehttp://iopscience.iop.org/contacthttp://iopscience.iop.org/page/aboutioppublishinghttp://iopscience.iop.org/journalshttp://iopscience.iop.org/collectionshttp://iopscience.iop.org/searchhttp://iopscience.iop.org/http://iopscience.iop.org/1478-3975http://iopscience.iop.org/1478-3975/8/3http://iopscience.iop.org/page/terms -
8/2/2019 (1) survey 2011
2/14
IOP PUBLISHING PHYSICAL BIOLOGY
Phys. Biol. 8 (2011) 035008 (13pp) doi:10.1088/1478-3975/8/3/035008
Systematic computational prediction ofprotein interaction networks
J G Lees1, J K Heriche2, I Morilla3, J A Ranea1,3 and C A Orengo1
1 Research Department of Structural & Molecular Biology, University College London, London, UK2 Cell Biology/Biophysics Unit, European Molecular Biology Laboratory (EMBL), Meyerhofstrasse 1,D-69117 Heidelberg, Germany3 Department of Molecular Biology and Biochemistry-CIBER de Enfermedades Raras,
University of Malaga, Malaga, Spain
E-mail: [email protected]
Received 2 November 2010
Accepted for publication 9 February 2011Published 13 May 2011
Online at stacks.iop.org/PhysBio/8/035008
Abstract
Determining the network of physical protein associations is an important first step in
developing mechanistic evidence for elucidating biological pathways. Despite rapid advances
in the field of high throughput experiments to determine protein interactions, the majority of
associations remain unknown. Here we describe computational methods for significantly
expanding protein association networks. We describe methods for integrating multiple
independent sources of evidence to obtain higher quality predictions and we compare the
major publicly available resources available for experimentalists to use.
1. Introduction
New technologies in biology have given us the genomes
for thousands of species, including humans. Understanding
how all of these molecular parts assemble into functional
pathways is a major challenge. It has been noted that an
organisms complexity arises in part from the intricate and
dynamic networks of protein associations. While several
resources [15] provide experimental information on protein
associations, the experimental data, although growing rapidly,
are still limited (e.g. perhaps
-
8/2/2019 (1) survey 2011
3/14
Phys. Biol. 8 (2011) 035008 J G Lees et al
first method made use of sequence information. As detailed
below, there are different ways of using sequences to make
inference about protein associations. The advantage of these
methods is that sequence data have become abundant and
are easily available through public databases in standardized
formats. A second class of methods has followed the
development of high-throughput experiments and functionalannotation databases.
2.1. Genomic context methods
2.1.1. Co-occurrence profiles. Genome context association
prediction algorithms are a family of sequence-based
methods used to predict associations between proteins.
These techniques are based on principles derived from
known evolutionary processes. For example, co-occurrence
(phylogenetic) profiles are genomic context methods based
on the principle that if genes are functionally related, they
will tend to be co-inherited as a unit since the loss of any
one gene would compromise the functioning of the others.Phylogenetic profiling algorithms look for similar patterns of
the presence/absence of genes across species (figure 1(a)). It
is unclear exactly what the predictions correspond to, although
it is generally considered to be a biological process type
association. More functional significance can be assigned
if the patterns are seen over distant evolutionary periods
or if they occur independently in multiple lineages. The
original phylogenetic profiling idea [9] has been developed
in many different ways including more complex logical
rules to associate genes [10], the use of domain instead of
whole protein profiles [11] and through integrating species
phylogenetic information [12, 13]. Gene duplications canlead to spurious predictions with high similarity between
the duplicated genes profiles. Some resources implement
a scoring scheme whereby homologous proteins are down-
weighed in accordance with their level of homology [14]. It is
also important to filter low information content profiles [11].
Phylogenetic profiles have been used successfully in
archaea and bacteria to discover for example novel essential
members of synthetic pathways [15, 16], environmental
adaption factors [17] and thiamine biosynthesis [18].
Interestingly some studies have made use of anti-correlation
in the pattern as a signal to make predictions of functional
association [18]. A domain-based phylogenetic profile
method, phylotuner [11], has recently been developed to
improve performance in eukaryotes where the multigene
families and protein domain rearrangements create challenges
for these types of approaches.
2.1.2. Gene fusion. Evolutionary pressure can produce
fusion of separate but functionally related genes (A and B
figure 1(b)) into a single gene. In their simplest form the
gene fusion prediction methods identify pairs of proteins in
a genome, which are homologous to proteins fused together
in a different genome, and use this as supporting evidence
for the functional association of these individual genes [19].The predicted association type is again unclear but most often
corresponds to a shared biological process or consecutive
steps in a pathway [19]. Gene fusion events detected in
mammals showed a propensity to interact [20]. As with
phylogenetic profiling, analogous domain-based equivalents
exist [21]. These methods identify domains on two distinct
protein sequences in the same genome that are found fused
into a single sequence in a different genome (figure 1(b)).However, because of large and/or promiscuous domain
families (e.g. kinase domains) domain fusion predictors
require a scoring mechanism to prevent a great number of
non-specific predictions. An example of fusion data used in
conjunction with profiles is given by a method developed for
identifying biothiol synthetic enzymes [22].
2.1.3. Genomic neighborhood. Due to functional constraints
genes can be maintained close together on a chromosome
over long evolutionary time periods (figure 1(d)). Genomic
neighborhood prediction methods identify genes that cluster
within a certain base distance across multiple genomes. Aswith other genomic context methods artifacts can arise through
shared ancestry due to inadequate time for reshuffling of
genes. The genomic neighborhood method is not to be
confused with methods such as the operon method [23]
based on the intergenic distance in a single genome. A
recent comprehensive study in prokaryotes demonstrated the
genomic neighborhood method to be the best among genomic
context methods [24].
Some genomic neighborhood methods assert the rule that
gene order needs to be maintained. However, since some gene
rearrangement can be tolerated not all methods enforce this
constraint [25]. Some resources allow neighboring genes that
are diametrically opposed in a head-to-head orientation to beconsidered [14]. The genomic neighborhood approach has
been used to predict the archael exosome [26] subsequently
experimentally validated [27].
2.2. Sequence-based prediction
2.2.1. Sequence co-evolution. Interactions between proteins
are mediated through specific residue interfaces [28, 29]. It
has been observed that physically interacting proteins have
greater similarity of their phylogenetic trees than expected
by chance. One process put forward used to explain this
is compensatory mutations, whereby a deleterious mutationof an interaction-mediating residue in one protein can be
ameliorated by a compensatory mutation of the binding partner
[30]. More recent analyses suggest that additional sources
contribute to the co-evolutionary signal (see [31, 32]) making
the type of functional linkages derived from these methods
more fuzzy. The prediction methods developed around the
principle of sequence co-evolution make use of multiple
sequence alignments for each putative interacting protein from
which distance matrices are calculated. High correlation of
these distance matrices is taken as evidence of a potential
physical interaction [33] (figure 1(c)). Other methods exploit
gene evolutionary information in multiple sequence analysis
by comparing pairs of gene families phylogenetic treesdistance matrices. These methods have been improved by
2
-
8/2/2019 (1) survey 2011
4/14
Phys. Biol. 8 (2011) 035008 J G Lees et al
(a)
(b)
(c)
(d)
(e)
Figure 1. Illustration of the principles behind several of the protein association prediction methods described in the text.
integrating species tree information [34] or by implementingalgorithms that efficiently deal with the multi-gene families of
paralogues found in eukaryotes [35].
2.2.2. Commonly occurring domain pairs. A largeportion of interactions are mediated through domaindomain
associations and these interactions are conserved acrossspecies [36] with certain domain pairs re-occurring inmultiple protein interactions [37, 38]. One use for this
domain association network is that the knowledge of theunderlying domains mediating protein interactions can be used
to help predict novel protein interactions. These methodsare potentially powerful since domains can be reliably and
quickly assigned quickly to any genome using the powerfulHMMER3 [39, 40]. Several approaches for predicting
domain interactions have been developed including over-representation methods [41] and random forest methods [42].
Multiple methods are available from the DIMA resource [43].UniDomInt merges the predictions of nine different domaininteraction prediction methods to provide a meta-database of
more reliable associations [44]. High scoring domain pairsbetween proteins have been frequently used to make prediction
as part of an integration strategy (for example [45]).
2.2.3. Simple sequence features. A variety of predictionmethods have been developed to predict proteinprotein
3
-
8/2/2019 (1) survey 2011
5/14
Phys. Biol. 8 (2011) 035008 J G Lees et al
interactions from intrinsic features of sequences such as 3-
mers of neighboring residues [46]. The high accuracyachieved
by these methods has been recently called into question and
may be an artifact from the sets used to train and validate the
methods [47].
2.3. Homology-based methods
2.3.1. Inheriting protein interactions from sequence.
There are many known protein interactions (interologs)
that are conserved across species [48], although the
general level of protein interaction conservation remains
unclear [49]. Algorithms have been developed that inherit
such protein interactions from an experimentally confirmed
interaction in a genome to the query genome, via homology
[48, 50, 51]. At its simplest these methods make orthology
assignments (by reciprocal best hit or using an orthology
database)and transfer interactions wherepossible (figure1(e)).
More complex methods make use of heterogeneous features,
including domain combination, subcellular localization andtissue specificity to try and give increased confidence to
the interolog assignment [52]. Interolog-based methods are
especially powerful for organisms with few experimentally
determined interactions allowing for a substantial number of
high confidence protein interaction predictions. Interolog-
based methods are one of the most successfully applied protein
association prediction methods in terms of uptake and use by
experimentalists with over 80 publications associated with the
I2D [53] method alone.
2.3.2. Inheriting protein interactions from structure.
Structural complexes can also be used to transfer interactionsfrom a known interacting pair to proteins with similar structure.
These methods can provide insight into the physical details
of the interaction and are likely to become more important
in the near future as more structures become available [54].
Although the number of complexes of a known 3D structure is
relatively small, it is possible to expand this set by considering
homologous proteins. An early example of this is the
InterPreTS method [55] that given a known 3D complex
structure and homologous sequences for each interacting
protein, ranks interactions between homologues of the same
species. Another method Struct2Net [56] threads sequences
to structures and computes scores from the interfacial energyfor the sequence pair. The iWRAP method allows for
inheriting interactions in cases with low sequence identity by
focusing on the interface residues in the threading process
[57]. Recently methods have been developed that add an extra
layer by considering the evolutionary conservation of binding
site residues via structural alignments as providing more
useful information and evidence for the predicted interaction
[58, 59]. The IBIS server predicts interaction partners
and binding sites for a given protein using experimentally
observed or homology inferred complexes. IBIS checks for
several features to ensure biological relevance of the inferred
complex. For example binding site residues are assessed
for evolutionary conservation, using a set of non-redundanthomologous proteins. As another check IBIS makes use of
PISA [60] validation, which considers the physicochemical
properties of the protein interaction interface. Structural
data have been used as a means for inheriting interactions
on a genome-wide scale using structural alignments score to
generate kernels for use with a support vector machine (SVM)
[61].
2.4. Exploiting experimental data
2.4.1. Microarray profiles. Microarrays were one of
the first genome-wide experimental methods developed
[8]. Compendia of microarray experiments across various
experimental conditions have been assembled. It has been
demonstrated in yeast that genes with high co-expression,
defined by Pearson correlation across the different conditions,
were more likely to be physically interacting than randomly
chosen pairs [62] although this signal strength varies for
different organisms [63]. Large information rich data sets
with 6000 microarray experiments have been assembled
[64]. The development of statistical processing tools to findsignals from such large data sets, using a subset of conditions,
has broadened the applicability of the method to predicting
co-complex membership in homo sapiens [65]. There are
other approaches for using microarray data in the context of
an integration strategy for protein association prediction. One
method is to look for genes whose co-expression is conserved
across multiple genomes [66]. Another is to identify genes
expressed in similar subcellular/tissue types[45] as supporting
evidence of their interaction. Clearly such pieces of evidence
are very weak on their own, but can provide useful supporting
information when used in combination with other data.
2.4.2. Other experimental screens. Other types of
experimental screens can also be assembled and processed
into similarity profiles. For example, phenotypic vectors
from high-throughput loss of function experiments can be
clustered to give sets of functionally related proteins often
used as a basis to test for physical interactions. Similarities
between trajectories of subcellular localization have also been
used to generate hypotheses about physical interactions of
uncharacterized proteins [67]. Other examples such as in
vivo genomic binding maps provide information on positional
targeting of chromatin components that canbe used to generate
predictions on the network of interactions in chromatin
assembly [68]. As more high-throughput experimental datasets become available their general usefulness for prediction
will increase.
2.5. Literature-derived associations
2.5.1. Text mining. Only a portion of the experimental
interactions are captured by the interaction database resources
[15, 69, 70]. Information on other experimentally detected
interactions is available from Pubmed and other online
resources [7173]. Text mining is a very powerful method
for expanding interactomes either automatically [14] or for
speeding up the curation process for certain interaction
databases [2, 3]. Protein associations can be obtained bysearching for statistically significant co-occurrences between
4
-
8/2/2019 (1) survey 2011
6/14
Phys. Biol. 8 (2011) 035008 J G Lees et al
gene names [74]. In its simplest form the principle behind
such methods is that the higher the frequency two genes
occur in the same sentence/paragraph/abstract or article the
more likely their functional association. Another common
way to generate networks is by natural language processing
of abstracts, considering gene names as nodes and the verbs
as edges. Restricting verbs of association to those such asbinds, interacts, etc provides physical interaction networks.
A major uncertainty associated with text mining results is in
assigning the gene names in the text to a corresponding entity
in the sequence databases. In more recent developments,
protein interactions have been extracted from the literature
using kernel methods [75].
2.5.2. Functional semantic similarity. The Gene Ontology
(GO) [76] is a controlled vocabulary used to describe various
attributes of genes including their functions. Terms that
describe the functions are stored as nodes in a directed
graph with specific terms sharing more general terms asparents. For example, apoptotic chromosome condensation
and mitotic chromosome condensation both share the parent
term chromosome condensation. Primary annotations are
derived from the literature through manual curation efforts.
There are several evidences associated with the annotations
ranging from the mostly reliable manual annotations to
automated electronic annotations. Various methods have
sought to derive networks of functional associations between
proteins using their associated GO terms (see [77] for recent
review). Problems in using the GO graph directly arise from
issues such as the variation of term specificity in the graph. A
common solution is to make use of information content-basedmeasures such as the Resnik score [78]. The final choice
of evidence to use similarity measures and gene transference
methods needs to be done on a case-by-case basis [77].
3. Integrating prediction methods
Each source data set, whether experimental or not, has bias
and errors. It is however unlikely given the potential number
of interactions (provided appropriate confidence cut-offs are
used) that two independent prediction methods will give rise to
the same false positive prediction. In general, we could expect
the prediction power (accuracy and coverage) to increase
proportionally to the number of independent approaches
supporting the association. The simplest approaches exploit
this principle by using a joint observation approach for
combining prediction methods, where a greater number of
independent methods predicting the association correspond to
a higher prediction accuracy [23]. Other tests have shown that
integrating multiple predictions using more advanced methods
can improve the prediction power [14, 79, 80] by combining
and reinforcing observations. A wide variety of integration
methods are available including: Fishers, Bayesian, logistic
regression and kernel methods. Some methods provide
confidence estimates for the outputs (e.g. Bayesian and logisticregression) that may be useful in certain scenarios.
3.1. Simple integration
Each of the protein association prediction methods described
above yields scores which correlate with the likelihood offunctional association. However, it can be difficult to directly
combine these scores since the scores for each method can
differ both in scale and in predicted biological association
type. To help overcome this problem, output scores fromindividual prediction methods need to be transformed intoconfidence measures using a set of known true positives. The
Prolinks predictionpipeline [81] simply chooses the maximum
score from all the individual methods as the choice of geneassociation score. Other methods make use of a formula for
combining the scores from each method, optionally after aweighting of each methods general performance [82].
3.2. Bayesian integration
Bayesian integration is the most widely used strategy for
integrating protein association predictions [14, 80]. It hasseveral features that make it suitable for data integration of
this type (table 1). Each individual data channel is implicitlyweighted according to its reliability, and hence it is easy
to interpret the probability relationships for each channel.
Crucially this method can accommodate missing data whichtypically lead to problems for supervised learning methods.
Nave Bayesian integration presumes that the different datachannels are statistically independent of one another and
failure to remove or merge redundant data sources can lead to
over-prediction. Bayesian integrators have been used in manysuccessful applications [83] allowing for multiple types of data
to be integrated (including numerical and categorical). It may
not always be advantageous to add in increasing numbers ofdata sources. One study has shown that choosing a small
number of the best features from those available can improveperformance, and adding in additional input data types does
not give further improvements [84].
3.3. Fishers method
Fisher is one of the general non-Bayesian methods, which
has been successfully used to integrate protein interactionprediction from diverse methods. Fishers algorithm is a
solution (as the Pareto optimal solution) to the problem of
combining independent tests [85]. The method is highly
flexible and it is able to deal with the low overlap betweensource data sets. Some recent studies have successfullyapplied Fishers method to protein association prediction
[86]. Fishers method does not need trained or supervised
predictions based on experimental gold standard data sets ofprotein interactions. Hence, if only genomic context methods
are used, Fishers predictions can be considered independentof the public repositories of protein interactions. A weighted
version of Fisher provides the ability to optimize contributions
from each data source.
3.4. Kernel methods
As these methods have gained in popularity in recent years, wegive here a brief summary of kernel properties that make them
5
-
8/2/2019 (1) survey 2011
7/14
Phys. Biol. 8 (2011) 035008 J G Lees et al
Table 1. Methods commonly used for integrating multiple protein association prediction methods. A 1 or 0 denotes the presence or absenceof a particular desirable property.
Integration method/example reference
Nave Bayes Fishers SVM Graph kernel + Random forestAdvantageous property [80, 90] [85, 91] [92, 93] SVM [46, 88] [94, 95]
Copes well with missing values 1 1 0 0 1Importance of input features can be readily obtained 1 0 0 0 1Copes well with high-dimensional data 0 0 1 1 1Complex relationships between input variables can be learned 0 0 1 1 1Probability estimate readily obtained from output 1 0 0 0 0No parameter optimization required 0 1 0 0 0No requirement for independence between input data 0 0 1 1 1No training data required 0 1 0 0 0
attractive for data integration. For more details, we refer the
interested reader to [87]. By definition, a kernel is a function
that gives the dot product between two vectors in some multi-
dimensional space (called feature space). A kernel matrix
(often abbreviated as kernel) contains the evaluation of thekernel function for all pairs of data points under consideration.
A kernel can be viewed as a matrix of similarities between
data points and different kernels capture different notions of
similarity as they correspond to embedding thedata in different
feature spaces.
The first property of interest is that any symmetric matrix
with non-negative eigenvalues is a valid kernel matrix. This
means we can test whether a similarity matrix is a valid kernel
without knowing the feature space in which the kernel function
operates. This makes kernel methods applicable, not only to
real-valued vectors, but to any data (e.g. sequences, graphs) for
which we candefinea similarity measure. Thesecond property
of interest for data integration is that various mathematical
combinations of kernels (e.g. linear combination) produce a
valid kernel. So far, most data integration approaches using
kernels for predicting protein interactions have been used
in a classification framework with support vector machines
[46, 88]. This leads to the requirement of a negative
data set which can be problematic to generate (section 5.1).
Alternatively, kernels can be used for link prediction in a semi-
supervised setting [89].
3.5. Random forest classifiers
Decision trees are supervised classification algorithms usingtree-like graphs for making predictions in a supervised
framework. In its simplest form a decision tree makes multiple
binary tests in a tree structure such that a given input vector
of attributes is propagated through the tree using the internal
nodes to test an attributes value and terminal nodes to give a
classification. Random forests are ensemble classifiers made
up of many individual decision trees [96]. Random forests
provide an efficient means of increasing performance, and are
less prone to over-fitting than individual decision trees. Each
individual decision tree is generated by selecting a random
subset of the training data with replacement. The final output
of the RF classifier is from the majority vote of its individual
decision trees. The random forest classifier has been shownto be consistently amongst the best methods on a wide range
of protein association tasks including protein interaction and
co-complex membership [92, 94, 97]. Unlike most supervised
learning methods with random forests it is possible to obtain
a measure of importance for each input channel to the overall
performance [97]. They can also deal with large and sparseinput vectors as seen for their use in predicting interactions
from protein domain content [42]. Random forests have also
been used to integrate structural information with more typical
protein association data types [56].
3.6. Logistic regression
Logistic regression has been used to integrate data to provide
output predictions [92, 9799] but has been shown to
be outperformed by random forests [92, 97]. Another
approach found good performance for logistic regression after
subdividing the input data into different natural groupings[99].
3.7. Random walks on a graph
Some authors have used an approach using matrices derived
from random walks on graphs to prioritize genes. A random
walk on a graph describes the sequence of steps taken by
a walker who moves from one node to a randomly selected
adjacent node with a probability proportional to the weight
associated with the edge connecting the two nodes. A random
walk is a type of Markov chain from which different measures
of similarity between nodes of the graph can be computed. If
the Markov chain is regular it gives rise to a valid kernel [100].
Random walks have been used in two ways for data integration.
In the first, each data set is considered separately as a graph
from which a random walk-based similarity is derived and
used to rank the genes. A rank aggregation method is then
used for the data integration step. Although this approach has
been essentially used to predict disease genes [101, 102] it
could be applicable to proteinprotein interaction prediction
or at least to the prediction of functional relationships. The
second integration approach consists in merging the different
source data sets into one graph from which a random walk-
based measure of similarity is derived and used for ranking
genes. Again, this has been used to identify disease genes[103] but could be used for other types of predictions.
6
-
8/2/2019 (1) survey 2011
8/14
Phys. Biol. 8 (2011) 035008 J G Lees et al
Figure 2. Examples of networks generated before (left) spectral clustering, green balls represent baits in the Mitocheck experiments [67]and after (right network) spectral clustering (colors represent different complexes/clusters).
4. Exploiting the network structure
The prediction methods described above produce pairwise
protein interaction data sets that can be used to construct
proteinprotein interaction graphs (also called protein
networks in systems biology) which are natural data structures
to model relationships between proteins. Several methods
use the experimentally determined protein interaction network
graph structure itself as the primary data source to infer
complex membership. The underlying assumption for these
methods is that proteins in a complex are more denselyconnected to each other than to the rest of the graph. Over
the years various clustering algorithms have been applied
to this problem (for a review see [104]). Most of these
are heuristics and come with sometimes hard to tune free
parameters. The Markov clustering algorithm (MCL) appears
to be one of the best methods currently available for clustering
protein interaction graphs [105, 106]. The MCL simulates a
random walk on the graph and iteratively prunes the weaker
edges. An exact analysis of a random walk on a graph
leads to spectral clustering algorithms which have recently
been applied to the protein complex prediction problem
(figure 2) [67, 107, 108] although an early application ofspectral clustering for complex detection was described in
[109]. Prior to these methods, spectral decomposition of
matrices derived from the interaction graph had also been
used to find complexes [110, 111]. Although the nature of the
structure and properties of an entire proteome graph remains
controversial [112], topological properties have been used to
guide protein interaction predictions [113].
5. Benchmarking
5.1. Gold standards
In order to validate the prediction methods described above, agold standard reference set is required. Known 3D structures
formally provide direct evidence of physical interaction
although care needs to be taken to determine the biological
unit and ignore irrelevant crystal contacts. The resource
for this was initially the protein quaternary structure (PQS)
resource [114] although this has been replaced by PISA [60].
For evaluating physical interactions in yeast, the most widely
used gold standard data set was initially the curated MIPS
protein interaction data set from yeast [115]. However, this
data set was later shown to be highly unrepresentative, with
over half the interactions coming from ribosomal proteins
[116] producing a highly unrepresentative data set. Recentlymore up-to-date gold standard databases have been generated
[5, 117]. Also despite misgivings about the quality of Yeast-
2-Hybrid (Y2H) data sets, work has shown that commonly
used Y2H data sets are of similar quality to other experimentalinteraction data and even curated data sets [7]. Such Y2H
data sets can be processed further to give higher quality data
sets potentially suitable for benchmarking [118]. Certain
integration methods such as SVMs additionally require a
negative gold standard data set (i.e. a set of proteins known
not to interact). A common approach for generating negative
data sets is to select random pairs of proteins from the genome.
However, this is not an optimal solution and can lead to variousproblems such as the prediction method learning the pattern of
missing values causing over-prediction of associations [95].
Also unless care is taken the negative data set network can
have a different structure to the positive data set leading
to overestimates of performance for certain algorithms [47].
Recently carefully curated true negative data sets have been
assembled from the literature [119]. They may help with the
over-prediction problem in the future although they are likely
to contain biases (e.g. toward well-studied proteins). Other
tools are available, providing negatives based on functional
dissimilarity, subcellular location, non-interacting domain
pairs [120, 121] and shortest path lengths [122].
With regard to validation of functional associations manydifferent resourceshave been used including KEGG [123],GO
7
-
8/2/2019 (1) survey 2011
9/14
Phys. Biol. 8 (2011) 035008 J G Lees et al
[76] and Panther [124]. KEGG annotation can be considered
as high quality resource with 1500 genomes annotated.
KEGG marks up some organisms as manually curated such
as human, and others as automatically annotated from the
curated genomes. STRING [14] benchmarks its predictors
using KEGG to provide an interpretable output score. One
advantage of using GO is that it is possible to make use ofthe ontology to define semantic similarities between proteins;
thus, all pairs of proteins within a certain similarity threshold
can be considered within the benchmark.
5.2. Data set bias
Many of the prediction methods are faced with the problem of
bias in the available data. For example, supervised methods
are hampered by the lack of true negative data sets. More
subtly, biological research is mostly focused on disease-
related and well-characterized genes. As a consequence, a
small number of genes and their products contribute a lot of
(possibly irrelevant [125, 126]) information while for most ofthe genome little is available. Genome-wide experiments (for
example [127]) should help alleviate this problem. Several
large-scale Y2H data sets are available (for example [128]),
although these are not devoid of experimental biases of their
own. For example, classic Y2H requires translocation of
proteins involved in the interaction with the nucleus and does
not perform well in all cases including membrane-associated
proteins and transient interactions [129].
5.3. The importance of independent benchmarks
A major problem in benchmarking protein association
prediction methods is the presence of circularity between thedata used as source input to the methods and the testing set.
This circularity can be quite subtle and papers do not always
take sufficient care to eliminate this issue. For example,
once knowledge enters one realm (e.g. protein interaction
databases) it can be quickly integrated into a secondary data
set (e.g. Reactome). Even the genomic context methods and
microarray data sets are now partly incorporated into the GO.
This problem goes further than affecting the benchmarking
since the lack of an independent test set precludes the
ability to accurately optimize prediction methods, leading
to over-fitting. Although it is possible to improve the
benchmarking independence through careful filtering of datasets, the only safe option is to do experimental validation of the
predictions. However, this is expensive and often only allows
a small number of targets to be validated with low statistical
significance (for example [80]). An alternative is to implement
a rollback benchmarkwheresource-training data sets are rolled
back to a given date and the test data are from after this date.
In practice this approach suffers from social bias in that
biologists are not testing the predictions but interactions with
well-characterized, disease-related genes. Also circularity
is still not completely removed by a rollback benchmark
since todays text mining associations and interologs are a
source of tomorrows curated database entries and protein
interaction experiments, respectively. In the future a CASPstyle benchmark would be a good first step in providing
real performance measures for the many prediction methods
available.
5.4. Real world performance measure
The expected number of interactions found in an organism
[130] is much smaller than the total number of possible
interactions, where true positives (TPs) are found very
infrequently relative to false positives (FPs). As an example
let us say for an organism TPs constitute only 0.1% of all
possible protein pairs, then a predictor with a reported 1% false
discovery rate, on a balanced test set of TPs and TNs, would
still produce ten false predictions for every TP in its real world
application. The imbalance of TPs to true negatives (TNs)
should be considered an important factor when considering the
usefulness of a prediction andthe size andtype of thevalidation
screen required to get a useful number of TP experimental
validations.
6. Existing resources
6.1. Online resources
A quick survey of resources hosting interaction data
and predicted interaction data is quite daunting (e.g.
http://ppi.fli-leibniz.de/jcb_ppi_databases.html). The most
widely used of these is STRING which combines information
from multiple sources and includes predictions from genomic
context (gene neighborhood, domain fusion, phylogenetic
profiles), high-throughput experiments (co-expression) and
previous knowledge (text mining, known protein interactions).
The majority of the associations in STRING come from its text
mining and inherited interactions [14]. STRING v8.3 provides
information for2.5 million sequences in 630 organisms with
regular updates. Another regularly updated resource with
easy to use interface is the GeneMania resource which has
both known and predicted protein associations. An alternative
integration strategy is used by the online resource FuncNet
(http://funcnet.eu/) which uses theweighted Fishers approach
and integrates, online, eight independent prediction methods
with different geographical locations throughout Europe.
Many prediction methods exist that have shown to be
powerful enough for experimentalists to use as part of their
standard experimental screens (e.g. table 2). Despite this
even for well-studied organisms such as human there are largeportions of the interactome missing. As an example of the
utility of the integration methods above, we have constructed
a network using only those genes with no known physical
interactions (after merging eight public databases). Even
with these very poorly characterized Ensembl genes we were
able to construct substantial networks (figure 3). Extreme
examples such as this suggest that much could be gained
from experimentalists sampling more of the genome using
established prediction methods as a guide.
6.2. Context specific resources
Experiments are most usually designed to focus on a specificpathway or biological process. Resources such as STRING
8
http://ppi.fli-leibniz.de/jcb_ppi_databases.htmlhttp://funcnet.euprotect%20%24elax%20hbox%20%7Bma%20char%20%2775%7D%24/http://funcnet.euprotect%20%24elax%20hbox%20%7Bma%20char%20%2775%7D%24/http://funcnet.euprotect%20%24elax%20hbox%20%7Bma%20char%20%2775%7D%24/http://ppi.fli-leibniz.de/jcb_ppi_databases.html -
8/2/2019 (1) survey 2011
10/14
Phys. Biol. 8 (2011) 035008 J G Lees et al
Figure 3. Example networks predicted from FuncNet CODA, FuncNet Hippo and STRING (score filtered at 500) with the database channelremoved. The network has been filtered to remove any genes with a known database physical interactions from one of Intact, MINT, MIPS,STRING, BIOGRID, DIP, HPRD and Reactome. Example subnetworks predominantly made up of phylogenetic profile, CODA or textmining associations (from left to right) are shown.
Table 2. Example online protein association prediction resources.
Online resource URL/reference Comments
IBIS http://www.ncbi.nlm.nih.gov/Structure/ibis/[59] Predicts interactions and binding residuesFuncNet http://funcnet.eu/[86] Integrates eight data sources using FishersPPI E.Coli http://sunserver.cdfd.org.in:8080/protease/PPI/[93] Example of an SVM-based integration resourceBCI http://amdec-bioinfo.cu genome.org/html/BCellInteractome.html [131] Cell-type specific predictionsI2D http://ophid.utoronto.ca/ophidv2.201/index.jsp [53] Interologs for expanding protein interaction
networksGeneMania http://genemania.org/search.jsf[132] High coverage of available known associationsSTRING http://string-db.org/[14] Largest number of genomes covered
provide the union of many available protein interactions.
However, for reasons such as differential expression, anygiven cell will only express a subset of all protein interactionsfound in an organism. In view of this, certain resourceshave been developed that apply contextual information to giveinteractomes specific for a cell type. One example of such aresource is the B-cell interactome [133], which predicts B-cellspecific protein associations. Tailoring of the data in the B-cell interactome, to help ensure B-cell specific interactions, isachieved by filtering to only include those proteins expressedin B-cells, and to include B-cell relevant microarray data setsas inputs to the Bayesian integrator. These B-cell specificnetworks have been used to extend our knowledge of B-cellbiology [131]. Other resources available (POINTILLIST [85])can be readily tailored with data sources specific to the systemof interest [91].
6.3. PPI prediction pipelines
Many large-scale experimental projects have been carried out
[128, 134138]. Such projects are costly and time consuming
and a strategy for effective protein pair prioritization is
desirable. A recent study [45] trialing various approaches
for this task showed that a protein interaction prediction
method using a nave Bayes integration of several of the
methods described in this section (expression data, GO,
interologs, domain interactions) gave the largest improvement
in efficiency. Even though this method had a high false
discovery rate (92%) there were still large reductions in cost
(>50 fold at 50% coverage) in comparison to not using the
predicted protein interactions.
9
http://www.ncbi.nlm.nih.gov/Structure/ibis/http://funcnet.eu/http://sunserver.cdfd.org.in:8080/protease/PPI/http://amdec-bioinfo.cu%20genome.org/html/BCellInteractome.htmlhttp://ophid.utoronto.ca/ophidv2.201/index.jsphttp://genemania.org/search.jsfhttp://string-db.org/http://string-db.org/http://genemania.org/search.jsfhttp://ophid.utoronto.ca/ophidv2.201/index.jsphttp://amdec-bioinfo.cu%20genome.org/html/BCellInteractome.htmlhttp://sunserver.cdfd.org.in:8080/protease/PPI/http://funcnet.eu/http://www.ncbi.nlm.nih.gov/Structure/ibis/ -
8/2/2019 (1) survey 2011
11/14
Phys. Biol. 8 (2011) 035008 J G Lees et al
7. Conclusion
The genomic context methods provide a fascinating field of
study, at the juncture of evolutionary theory and modern
computational biology. Despite the relatively short time these
methods have been available they have proven to be very
useful in guiding experiments. There is great potential forgreater uptake of these methods by experimentalists. Over
the coming years we can expect to see improvements in
the prediction methods particularly genomic context methods
which will benefit from targeted genome sequencing efforts
such as the GEBA project [139]. Such projects are expected
to provide improved sampling, fill in major phylogenetic
gaps and provide wider evolutionary distances. There is a
growing list of examples in the literature where they have been
used successfully when combined by statistical integration
methods. An example is the application of the FuncNet
protocol to human mitotic spindle proteins in the ENFIN [140]
network for systems biology, which combined prediction data
using Fisher integration, showed an increase in prediction
accuracy from 35% to 76%. Given the many prediction
methods available it is likely that greater coordination between
computational groups will lead to reduced redundancy,
improved resources and ultimately greater usage of protein
interaction predictions by experimentalists.
Acknowledgments
This work was funded in part by the European Commission
via the Sixth Framework Program Network of Excellence
ENFIN (contract number LSHG-CT-2005-518254). JGL andJKH acknowledge funding from ENFIN. JAR acknowledges
funding from SAF2009-09839 andthe Ramon y Cajal program
(RYC-2007-01649; Ministerio de Ciencia e Innovacion,
Spain). CIBERER is an initiative of the ISCIII.
References
[1] Kerrien S et al 2007 IntActopen source resource formolecular interaction data Nucleic Acids Res.35 D5615
[2] Chatr-aryamontri A, Ceol A, Palazzi L M, Nardelli G,Schneider M V, Castagnoli L and Cesareni G 2007 MINT:
the Molecular INTeraction database Nucleic Acids Res.35 D5724[3] Xenarios I, Rice D W, Salwinski L, Baron M K,
Marcotte E M and Eisenberg D 2000 DIP: the database ofinteracting proteins Nucleic Acids Res. 28 28991
[4] Keshava Prasad T S et al 2009 Human protein referencedatabase2009 update Nucleic Acids Res. 37 D76772
[5] Ruepp A et al 2008 CORUM: the comprehensive resource ofmammalian protein complexes Nucleic Acids Res.36 D64650
[6] Stumpf M P, Thorne T, de Silva E, Stewart R, An H J,Lappe M and Wiuf C 2008 Estimating the size of thehuman interactome Proc. Natl Acad. Sci. USA 105 695964
[7] Venkatesan K et al 2009 An empirical framework for binaryinteractome mapping Nat. Methods 6 8390
[8] Suthram S, Sittler T and Ideker T 2005 The Plasmodiumprotein network diverges from those of other eukaryotesNature 438 10812
[9] Pellegrini M, Marcotte E M, Thompson M J, Eisenberg Dand Yeates T O 1999 Assigning protein functions bycomparative genome analysis: protein phylogeneticprofiles Proc. Natl Acad. Sci. USA 96 42858
[10] Bowers P M, Cokus S J, Eisenberg D and Yeates T O 2004Use of logic relationships to decipher protein networkorganization Science 306 22469
[11] Ranea J A, Yeats C, Grant A and Orengo C A 2007Predicting protein function with hierarchical phylogeneticprofiles: the Gene3D phylo-tuner method applied toeukaryotic genomes PLoS Comput. Biol. 3 e237
[12] Barker D and Pagel M 2005 Predicting functional gene linksfrom phylogenetic-statistical analyses of whole genomesPLoS Comput. Biol. 1 e3
[13] Zhou Y, Wang R, Li L, Xia X and Sun Z 2006 Inferringfunctional linkages between proteins from evolutionaryscenarios J. Mol. Biol. 359 11509
[14] Jensen L J et al 2009 STRING 8a global view on proteinsand their functional interactions in 630 organisms NucleicAcids Res. 37 D4126
[15] Luttgen H et al 2000 Biosynthesis of terpenoids: YchBprotein of Escherichia coli phosphorylates the 2-hydroxy
group of 4-diphosphocytidyl-2 C-methyl-D-erythritolProc. Natl. Acad. Sci. USA 97 10627
[16] Carlson B A, Xu X M, Kryukov G V, Rao M, Berry M J,Gladyshev V N and Hatfield D L 2004 Identification andcharacterization of phosphoseryl-tRNA[Ser]Sec kinaseProc. Natl Acad. Sci. USA 101 1284853
[17] Forterre P 2002 A hot story from comparative genomics:reverse gyrase is the only hyperthermophile-specificprotein Trends Genet. 18 2367
[18] Morett E, Korbel J O, Rajan E, Saab-Rincon G, Olvera L,Olvera M, Schmidt S, Snel B and Bork P 2003 Systematicdiscovery of analogous enzymes in thiamin biosynthesisNat. Biotechnol. 21 7905
[19] Marcotte E M, Pellegrini M, Ng H L, Rice D W, Yeates T Oand Eisenberg D 1999 Detecting protein function and
proteinprotein interactions from genome sequencesScience 285 7513
[20] Zhang Z et al 2006 Genome-wide analysis of mammalianDNA segment fusion/fission J. Theor. Biol. 240 2008
[21] Reid A J, Ranea J A, Clegg A B and Orengo C A 2010CODA: accurate detection of functional associationsbetween proteins in eukaryotic genomes using domainfusion PLoS ONE5 e10908
[22] Gaballa A, Newton G L, Antelmann H, Parsonage D,Upton H, Rawat M, Claiborne A, Fahey R C andHelmann J D 2010 Biosynthesis and functions ofbacillithiol, a major low-molecular-weight thiol in BacilliProc. Natl Acad. Sci. USA 107 64826
[23] Strong M, Mallick P, Pellegrini M, Thompson M Jand Eisenberg D 2003 Inference of protein function andprotein linkages in Mycobacterium tuberculosis based onprokaryotic genome organization: a combinedcomputational approach Genome Biol. 4 R59
[24] Ferrer L, Dale J M and Karp P D 2010 A systematic study ofgenome context methods: calibration, normalization andcombination BMC Bioinformatics 11 493
[25] Itoh T, Takemoto K, Mori H and Gojobori T 1999Evolutionary instability of operon structures disclosed bysequence comparisons of complete microbial genomesMol. Biol. Evol. 16 33246
[26] Koonin E V, Wolf Y I and Aravind L 2001 Prediction of thearchaeal exosome and its connections with the proteasomeand the translation and transcription machineries by acomparative-genomic approach Genome Res. 11 24052
[27] Evguenieva-Hackenberg E, Walter P, Hochleitner E,Lottspeich F and Klug G 2003 An exosome-like complexin Sulfolobus solfataricus EMBO Rep. 4 88993
10
http://dx.doi.org/10.1093/nar/gkl958http://dx.doi.org/10.1093/nar/gkl958http://dx.doi.org/10.1093/nar/gkl950http://dx.doi.org/10.1093/nar/gkl950http://dx.doi.org/10.1093/nar/28.1.289http://dx.doi.org/10.1093/nar/28.1.289http://dx.doi.org/10.1093/nar/gkn892http://dx.doi.org/10.1093/nar/gkn892http://dx.doi.org/10.1093/nar/gkm936http://dx.doi.org/10.1093/nar/gkm936http://dx.doi.org/10.1073/pnas.0708078105http://dx.doi.org/10.1073/pnas.0708078105http://dx.doi.org/10.1038/nmeth.1280http://dx.doi.org/10.1038/nmeth.1280http://dx.doi.org/10.1038/nature04135http://dx.doi.org/10.1038/nature04135http://dx.doi.org/10.1073/pnas.96.8.4285http://dx.doi.org/10.1073/pnas.96.8.4285http://dx.doi.org/10.1126/science.1103330http://dx.doi.org/10.1126/science.1103330http://dx.doi.org/10.1371/journal.pcbi.0030237http://dx.doi.org/10.1371/journal.pcbi.0030237http://dx.doi.org/10.1371/journal.pcbi.0010003http://dx.doi.org/10.1371/journal.pcbi.0010003http://dx.doi.org/10.1016/j.jmb.2006.04.011http://dx.doi.org/10.1016/j.jmb.2006.04.011http://dx.doi.org/10.1093/nar/gkn760http://dx.doi.org/10.1093/nar/gkn760http://dx.doi.org/10.1073/pnas.97.3.1062http://dx.doi.org/10.1073/pnas.97.3.1062http://dx.doi.org/10.1073/pnas.0402636101http://dx.doi.org/10.1073/pnas.0402636101http://dx.doi.org/10.1016/S0168-9525(02)02650-1http://dx.doi.org/10.1016/S0168-9525(02)02650-1http://dx.doi.org/10.1038/nbt834http://dx.doi.org/10.1038/nbt834http://dx.doi.org/10.1126/science.285.5428.751http://dx.doi.org/10.1126/science.285.5428.751http://dx.doi.org/10.1016/j.jtbi.2005.09.016http://dx.doi.org/10.1016/j.jtbi.2005.09.016http://dx.doi.org/10.1371/journal.pone.0010908http://dx.doi.org/10.1371/journal.pone.0010908http://dx.doi.org/10.1073/pnas.1000928107http://dx.doi.org/10.1073/pnas.1000928107http://dx.doi.org/10.1186/gb-2003-4-9-r59http://dx.doi.org/10.1186/gb-2003-4-9-r59http://dx.doi.org/10.1186/1471-2105-11-493http://dx.doi.org/10.1186/1471-2105-11-493http://dx.doi.org/10.1101/gr.162001http://dx.doi.org/10.1101/gr.162001http://dx.doi.org/10.1038/sj.embor.embor929http://dx.doi.org/10.1038/sj.embor.embor929http://dx.doi.org/10.1038/sj.embor.embor929http://dx.doi.org/10.1101/gr.162001http://dx.doi.org/10.1186/1471-2105-11-493http://dx.doi.org/10.1186/gb-2003-4-9-r59http://dx.doi.org/10.1073/pnas.1000928107http://dx.doi.org/10.1371/journal.pone.0010908http://dx.doi.org/10.1016/j.jtbi.2005.09.016http://dx.doi.org/10.1126/science.285.5428.751http://dx.doi.org/10.1038/nbt834http://dx.doi.org/10.1016/S0168-9525(02)02650-1http://dx.doi.org/10.1073/pnas.0402636101http://dx.doi.org/10.1073/pnas.97.3.1062http://dx.doi.org/10.1093/nar/gkn760http://dx.doi.org/10.1016/j.jmb.2006.04.011http://dx.doi.org/10.1371/journal.pcbi.0010003http://dx.doi.org/10.1371/journal.pcbi.0030237http://dx.doi.org/10.1126/science.1103330http://dx.doi.org/10.1073/pnas.96.8.4285http://dx.doi.org/10.1038/nature04135http://dx.doi.org/10.1038/nmeth.1280http://dx.doi.org/10.1073/pnas.0708078105http://dx.doi.org/10.1093/nar/gkm936http://dx.doi.org/10.1093/nar/gkn892http://dx.doi.org/10.1093/nar/28.1.289http://dx.doi.org/10.1093/nar/gkl950http://dx.doi.org/10.1093/nar/gkl958 -
8/2/2019 (1) survey 2011
12/14
Phys. Biol. 8 (2011) 035008 J G Lees et al
[28] Tuncbag N, Gursoy A, Guney E, Nussinov R and Keskin O2008 Architectures and functional coverage ofproteinprotein interfaces J. Mol. Biol. 381 785802
[29] Tuncbag N, Kar G, Keskin O, Gursoy A and Nussinov R2009 A survey of available tools and web servers foranalysis of proteinprotein interactions and interfacesBrief Bioinform 10 21732
[30] Pazos F, Helmer-Citterich M, Ausiello G and Valencia A1997 Correlated mutations contain information aboutproteinprotein interaction J. Mol. Biol. 271 51123
[31] Juan D, Pazos F and Valencia A 2008 Co-evolution andco-adaptation in protein networks FEBS Lett. 582 122530
[32] Kann M G, Shoemaker B A, Panchenko A R andPrzytycka T M 2009 Correlated evolution of interactingproteins: looking behind the mirrortree J. Mol. Biol.385 918
[33] Pazos F and Valencia A 2001 Similarity of phylogenetic treesas indicator of proteinprotein interaction Protein Eng.14 60914
[34] Pazos F, Ranea J A, Juan D and Sternberg M J 2005Assessing protein co-evolution in the context of the tree oflife assists in the prediction of the interactome J. Mol. Biol.
352 100215[35] Izarzugaza J M, Juan D, Pons C, Ranea J A, Valencia Aand Pazos F 2006 TSEMA: interactive prediction ofprotein pairings between interacting families NucleicAcids Res. 34 W3159
[36] Itzhaki Z, Akiva E, Altuvia Y and Margalit H 2006Evolutionary conservation of domaindomain interactionsGenome Biol. 7 R125
[37] Finn R D et al 2008 The Pfam protein families databaseNucleic Acids Res. 36 D2818
[38] Stein A, Panjkovich A and Aloy P 2009 3did Update:domaindomain and peptide-mediated interactions ofknown 3D structure Nucleic Acids Res. 37 D3004
[39] Eddy S R 2009 A new generation of homology search toolsbased on probabilistic inference Genome Inform 23 20511
[40] Lees J, Yeats C, Redfern O, Clegg A and Orengo C 2010Gene3D: merging structure and function for a thousandgenomes Nucleic Acids Res. 38 D296300
[41] Kim W K, Park J and Suh J K 2002 Large scale statisticalprediction of proteinprotein interaction by potentiallyinteracting domain (PID) pair Genome Inform 13 4250
[42] Chen X W and Liu M 2005 Prediction of proteinproteininteractions using random decision forest frameworkBioinformatics 21 4394400
[43] Luo Q, Pagel P, Vilne B and Frishman D 2011 DIMA 3.0:domain interaction map Nucleic Acids Res. 39 D7249
[44] Bjorkholm P and Sonnhammer E L 2009 Comparativeanalysis and unification of domaindomain interactionnetworks Bioinformatics 25 30205
[45] Schwartz A S, Yu J, Gardenour K R, Finley R L Jr andIdeker T 2009 Cost-effective strategies for completing theinteractome Nat. Methods 6 5561
[46] Ben-Hur A and Noble W S 2005 Kernel methods forpredicting proteinprotein interactions Bioinformatics21 (Suppl. 1) i3846
[47] Yu J, Guo M, Needham C J, Huang Y, Cai L andWesthead D R 2010 Simple sequence-based kernels do notpredict proteinprotein interactions Bioinformatics26 26104
[48] Matthews L R, Vaglio P, Reboul J, Ge H, Davis B P, GarrelsJ, Vincent S and Vidal M 2001 Identification of potentialinteraction networks using sequence-based searches forconserved proteinprotein interactions or interologsGenome Res. 11 21206
[49] Mika S and Rost B 2006 Proteinprotein interactions moreconserved within species than across species PLoSComput. Biol. 2 e79
[50] Persico M, Ceol A, Gavrila C, Hoffmann R, Florio Aand Cesareni G 2005 HomoMINT: an inferred humannetwork based on orthology mapping of proteininteractions discovered in model organisms BMCBioinformatics 6 (Suppl. 4) S21
[51] Kemmer D et al 2005 Ulyssesan application for theprojection of molecular interactions across speciesGenome Biol. 6 R106
[52] Huang T W, Lin C Y and Kao C Y 2007 Reconstruction ofhuman protein interolog network using evolutionaryconserved networkBMC Bioinformatics 8 152
[53] Brown K R and Jurisica I 2007 Unequal evolutionaryconservation of human protein interactions in interologousnetworks Genome Biol. 8 R95
[54] Ezkurdia I, Bartoli L, Fariselli P, Casadio R, Valencia Aand Tress M L 2009 Progress and challenges in predictingproteinprotein interaction sites Brief Bioinform10 23346
[55] Aloy P and Russell R B 2003 InterPreTS: protein interactionprediction through tertiary structure Bioinformatics19 1612
[56] Singh R, Park D, Xu J, Hosur R and Berger B 2010
Struct2Net: a web service to predict proteinproteininteractions using a structure-based approach NucleicAcids Res. 38 (Suppl.) W50815
[57] Hosur R, Xu J, Bienkowska J and Berger B 2011 iWRAP: aninterface threading approach with application to predictionof cancer-related proteinprotein interactions J. Mol. Biol.405 1295310
[58] Zhang Q C, Petrey D, Norel R and Honig B H 2010 Proteininterface conservation across structure space Proc. NatlAcad. Sci. USA 107 10896901
[59] Shoemaker B A, Zhang D, Thangudu R R, Tyagi M,Fong J H, Marchler-Bauer A, Bryant S H, Madej Tand Panchenko A R 2010 Inferred biomolecularinteraction servera web server to analyze and predictprotein interacting partners and binding sites Nucleic
Acids Res. 38 D51824[60] Krissinel E and Henrick K 2007 Inference of macromolecular
assemblies from crystalline state J. Mol. Biol.372 77497
[61] Hue M, Riffle M, Vert J P and Noble W S 2010 Large-scaleprediction of proteinprotein interactions from structuresBMC Bioinformatics 11 144
[62] Grigoriev A 2001 A relationship between gene expressionand protein interactions on the proteome scale: analysis ofthe bacteriophage T7 and the yeast Saccharomycescerevisiae Nucleic Acids Res. 29 35139
[63] Bhardwaj N and Lu H 2005 Correlation between geneexpression profiles and proteinprotein interactions withinand across genomes Bioinformatics 21 27308
[64] Lukk M, Kapushesky M, Nikkila J, Parkinson H,Goncalves A, Huber W, Ukkonen E and Brazma A 2010 Aglobal map of human gene expression Nat. Biotechnol.28 3224
[65] Adler P, Kolde R, Kull M, Tkachenko A, Peterson H,Reimand J and Vilo J 2009 Mining for coexpression acrosshundreds of datasets using novel rank aggregation andvisualization methods Genome Biol. 10 R139
[66] Stuart J M, Segal E, Koller D and Kim S K 2003 Agene-coexpression network for global discovery ofconserved genetic modules Science 302 24955
[67] Hutchins J R et al 2010 Systematic analysis of human proteincomplexes identifies chromosome segregation proteinsScience 328 5939
[68] van Steensel B, Braunschweig U, Filion G J, Chen M,
van Bemmel J G and Ideker T 2010 Bayesian networkanalysis of targeting interactions in chromatin GenomeRes. 20 190200
11
http://dx.doi.org/10.1016/j.jmb.2008.04.071http://dx.doi.org/10.1016/j.jmb.2008.04.071http://dx.doi.org/10.1093/bib/bbp001http://dx.doi.org/10.1093/bib/bbp001http://dx.doi.org/10.1006/jmbi.1997.1198http://dx.doi.org/10.1006/jmbi.1997.1198http://dx.doi.org/10.1016/j.febslet.2008.02.017http://dx.doi.org/10.1016/j.febslet.2008.02.017http://dx.doi.org/10.1016/j.jmb.2008.09.078http://dx.doi.org/10.1016/j.jmb.2008.09.078http://dx.doi.org/10.1093/protein/14.9.609http://dx.doi.org/10.1093/protein/14.9.609http://dx.doi.org/10.1016/j.jmb.2005.07.005http://dx.doi.org/10.1016/j.jmb.2005.07.005http://dx.doi.org/10.1093/nar/gkl112http://dx.doi.org/10.1093/nar/gkl112http://dx.doi.org/10.1186/gb-2006-7-12-r125http://dx.doi.org/10.1186/gb-2006-7-12-r125http://dx.doi.org/10.1093/nar/gkm960http://dx.doi.org/10.1093/nar/gkm960http://dx.doi.org/10.1093/nar/gkn690http://dx.doi.org/10.1093/nar/gkn690http://dx.doi.org/10.1142/9781848165632_0019http://dx.doi.org/10.1142/9781848165632_0019http://dx.doi.org/10.1093/nar/gkp987http://dx.doi.org/10.1093/nar/gkp987http://dx.doi.org/10.1093/bioinformatics/bti721http://dx.doi.org/10.1093/bioinformatics/bti721http://dx.doi.org/10.1093/nar/gkq1200http://dx.doi.org/10.1093/nar/gkq1200http://dx.doi.org/10.1093/bioinformatics/btp522http://dx.doi.org/10.1093/bioinformatics/btp522http://dx.doi.org/10.1038/nmeth.1283http://dx.doi.org/10.1038/nmeth.1283http://dx.doi.org/10.1093/bioinformatics/bti1016http://dx.doi.org/10.1093/bioinformatics/bti1016http://dx.doi.org/10.1093/bioinformatics/btq483http://dx.doi.org/10.1093/bioinformatics/btq483http://dx.doi.org/10.1101/gr.205301http://dx.doi.org/10.1101/gr.205301http://dx.doi.org/10.1371/journal.pcbi.0020079http://dx.doi.org/10.1371/journal.pcbi.0020079http://dx.doi.org/10.1186/1471-2105-6-S4-S21http://dx.doi.org/10.1186/1471-2105-6-S4-S21http://dx.doi.org/10.1186/gb-2005-6-12-r106http://dx.doi.org/10.1186/gb-2005-6-12-r106http://dx.doi.org/10.1186/1471-2105-8-152http://dx.doi.org/10.1186/1471-2105-8-152http://dx.doi.org/10.1186/gb-2007-8-5-r95http://dx.doi.org/10.1186/gb-2007-8-5-r95http://dx.doi.org/10.1093/bib/bbp021http://dx.doi.org/10.1093/bib/bbp021http://dx.doi.org/10.1093/bioinformatics/19.1.161http://dx.doi.org/10.1093/bioinformatics/19.1.161http://dx.doi.org/10.1093/nar/gkq481http://dx.doi.org/10.1093/nar/gkq481http://dx.doi.org/10.1016/j.jmb.2010.11.025http://dx.doi.org/10.1016/j.jmb.2010.11.025http://dx.doi.org/10.1073/pnas.1005894107http://dx.doi.org/10.1073/pnas.1005894107http://dx.doi.org/10.1093/nar/gkp842http://dx.doi.org/10.1093/nar/gkp842http://dx.doi.org/10.1016/j.jmb.2007.05.022http://dx.doi.org/10.1016/j.jmb.2007.05.022http://dx.doi.org/10.1186/1471-2105-11-144http://dx.doi.org/10.1186/1471-2105-11-144http://dx.doi.org/10.1093/nar/29.17.3513http://dx.doi.org/10.1093/nar/29.17.3513http://dx.doi.org/10.1093/bioinformatics/bti398http://dx.doi.org/10.1093/bioinformatics/bti398http://dx.doi.org/10.1038/nbt0410-322http://dx.doi.org/10.1038/nbt0410-322http://dx.doi.org/10.1186/gb-2009-10-12-r139http://dx.doi.org/10.1186/gb-2009-10-12-r139http://dx.doi.org/10.1126/science.1087447http://dx.doi.org/10.1126/science.1087447http://dx.doi.org/10.1126/science.1181348http://dx.doi.org/10.1126/science.1181348http://dx.doi.org/10.1101/gr.098822.109http://dx.doi.org/10.1101/gr.098822.109http://dx.doi.org/10.1101/gr.098822.109http://dx.doi.org/10.1126/science.1181348http://dx.doi.org/10.1126/science.1087447http://dx.doi.org/10.1186/gb-2009-10-12-r139http://dx.doi.org/10.1038/nbt0410-322http://dx.doi.org/10.1093/bioinformatics/bti398http://dx.doi.org/10.1093/nar/29.17.3513http://dx.doi.org/10.1186/1471-2105-11-144http://dx.doi.org/10.1016/j.jmb.2007.05.022http://dx.doi.org/10.1093/nar/gkp842http://dx.doi.org/10.1073/pnas.1005894107http://dx.doi.org/10.1016/j.jmb.2010.11.025http://dx.doi.org/10.1093/nar/gkq481http://dx.doi.org/10.1093/bioinformatics/19.1.161http://dx.doi.org/10.1093/bib/bbp021http://dx.doi.org/10.1186/gb-2007-8-5-r95http://dx.doi.org/10.1186/1471-2105-8-152http://dx.doi.org/10.1186/gb-2005-6-12-r106http://dx.doi.org/10.1186/1471-2105-6-S4-S21http://dx.doi.org/10.1371/journal.pcbi.0020079http://dx.doi.org/10.1101/gr.205301http://dx.doi.org/10.1093/bioinformatics/btq483http://dx.doi.org/10.1093/bioinformatics/bti1016http://dx.doi.org/10.1038/nmeth.1283http://dx.doi.org/10.1093/bioinformatics/btp522http://dx.doi.org/10.1093/nar/gkq1200http://dx.doi.org/10.1093/bioinformatics/bti721http://dx.doi.org/10.1093/nar/gkp987http://dx.doi.org/10.1142/9781848165632_0019http://dx.doi.org/10.1093/nar/gkn690http://dx.doi.org/10.1093/nar/gkm960http://dx.doi.org/10.1186/gb-2006-7-12-r125http://dx.doi.org/10.1093/nar/gkl112http://dx.doi.org/10.1016/j.jmb.2005.07.005http://dx.doi.org/10.1093/protein/14.9.609http://dx.doi.org/10.1016/j.jmb.2008.09.078http://dx.doi.org/10.1016/j.febslet.2008.02.017http://dx.doi.org/10.1006/jmbi.1997.1198http://dx.doi.org/10.1093/bib/bbp001http://dx.doi.org/10.1016/j.jmb.2008.04.071 -
8/2/2019 (1) survey 2011
13/14
Phys. Biol. 8 (2011) 035008 J G Lees et al
[69] Schaefer C F, Anthony K, Krupa S, Buchoff J, Day M,Hannay T and Buetow K H 2009 PID: the pathwayinteraction database Nucleic Acids Res. 37 D6749
[70] Matthews L et al 2009 Reactome knowledgebase of humanbiological pathways and processes Nucleic Acids Res.37 D61922
[71] Cherry J M et al 1998 SGD: saccharomyces genome databaseNucleic Acids Res. 26 739
[72] Amberger J, Bocchini C A, Scott A F and Hamosh A 2009McKusicks online Mendelian inheritance in man (OMIM)Nucleic Acids Res. 37 D7936
[73] Tweedie S et al 2009 FlyBase: enhancing Drosophila GeneOntology annotations Nucleic Acids Res. 37 D5559
[74] Blaschke C, Hoffmann R, Oliveros J C and Valencia A 2001Extracting information automatically from biologicalliterature Comp. Funct. Genomics 2 3103
[75] Tikk D, Thomas P, Palaga P, Hakenberg J and Leser U 2010A comprehensive benchmark of kernel methods to extractproteinprotein interactions from literature PLoS Comput.Biol. 6 e1000837
[76] Ashburner M et al 2000 Gene ontology: tool for theunification of biology. The Gene Ontology Consortium
Nat. Genet. 25 259[77] Pesquita C, Faria D, Falcao A O, Lord P and Couto F M 2009
Semantic similarity in biomedical ontologies PLoSComput. Biol. 5 e1000443
[78] Lord P W, Stevens R D, Brass A and Goble C A 2003Investigating semantic similarity measures across the GeneOntology: the relationship between sequence andannotation Bioinformatics 19 127583
[79] Scott M S and Barton G J 2007 Probabilistic prediction andranking of human proteinprotein interactions BMCBioinformatics 8 239
[80] Jansen R et al 2003 A Bayesian networks approach forpredicting proteinprotein interactions from genomic dataScience 302 44953
[81] Bowers P M, Pellegrini M, Thompson M J, Fierro J,
Yeates T O and Eisenberg D 2004 Prolinks: a database ofprotein functional linkages derived from coevolutionGenome Biol. 5 R35
[82] Sun J, Sun Y, Ding G, Liu Q, Wang C, He Y, Shi T, Li Yand Zhao Z 2007 InPrePPI: an integrated evaluationmethod based on genomic context for predictingproteinprotein interactions in prokaryotic genomes BMCBioinformatics 8 414
[83] Wilkinson D J 2007 Bayesian methods in bioinformatics andcomputational systems biology Brief Bioinform8 10916
[84] Lu L J, Xia Y, Paccanaro A, Yu H and Gerstein M 2005Assessing the limits of genomic data integration forpredicting protein networks Genome Res. 15 94553
[85] Hwang D et al 2005 A data integration methodology forsystems biology Proc. Natl Acad. Sci. USA102 17296301
[86] Ranea J A, Morilla I, Lees J G, Reid A J, Yeats C, Clegg A B,Sanchez-Jimenez F and Orengo C 2010 Finding the darkmatter in human and yeast protein network prediction andmodelling PLoS Comput. Biol. 6 e1000945
[87] Shawe-Taylor J and Cristianini N (eds) 2004 Kernel Methodsfor Pattern Analysis (Cambridge: Cambridge UniversityPress)
[88] Qiu J and Noble W S 2008 Predicting co-complexed proteinpairs from heterogeneous data PLoS Comput. Biol.4 e1000054
[89] Zhou D and Scholkopf B 2004 A regularization frameworkfor learning from graph data ICML Workshop on
Statistical Relational Learning[90] Xia K, Dong D and Han J D 2006 IntNetDB v1.0: anintegrated proteinprotein interaction network database
generated by a probabilistic model BMC Bioinformatics7 508
[91] Hwang D et al 2005 A data integration methodology forsystems biology: experimental verification Proc. NatlAcad. Sci. USA 102 173027
[92] Qi Y, Bar-Joseph Z and Klein-Seetharaman J 2006Evaluation of different biological data and computational
classification methods for use in protein interactionprediction Proteins 63 490500[93] Yellaboina S, Goyal K and Mande S C 2007 Inferring
genome-wide functional linkages in E. coli by combiningimproved genome context methods: comparison withhigh-throughput experimental data Genome Res.17 52735
[94] Qi Y, Klein-Seetharaman J and Bar-Joseph Z 2005 Randomforest similarity for proteinprotein interaction predictionfrom multiple sources Pac. Symp. Biocomput. 10 53142
[95] Mohamed T P, Carbonell J G and Ganapathiraju M K 2010Active learning for human proteinprotein interactionprediction BMC Bioinformatics 11 (Suppl. 1) S57
[96] Geurts P, Irrthum A and Wehenkel L 2009 Supervisedlearning with decision tree-based methods in
computational and systems biology Mol. Biosyst.5 1593605
[97] Lin N, Wu B, Jansen R, Gerstein M and Zhao H 2004Information assessment on predicting proteinproteininteractions BMC Bioinformatics 5 154
[98] Sprinzak E, Altuvia Y and Margalit H 2006 Characterizationand prediction of proteinprotein interactions within andbetween complexes Proc. Natl Acad. Sci. USA103 1471823
[99] Qi Y, Klein-Seetharaman J and Bar-Joseph Z 2007 A mixtureof feature experts approach for proteinprotein interactionprediction BMC Bioinformatics 8 (Suppl. 10) S6
[100] Fouss F, Francoisse K, Yen L, Pirotte A and Saerens M 2006An experimental investigation of graph kernels on acollaborative recommendation taskProc. 6th Int. Conf. on
Data Mining pp 8638[101] Kohler S, Bauer S, Horn D and Robinson P N 2008 Walking
the interactome for prioritization of candidate diseasegenes Am. J. Human Genet. 82 94958
[102] Li Y and Patra J C 2010 Integration of multiple data sourcesto prioritize candidate genes using discounted ratingsystem BMC Bioinformatics 11 (Suppl. 1) S20
[103] Li Y and Patra J C 2010 Genome-wide inferringgene-phenotype relationship by walking on theheterogeneous networkBioinformatics 26 121924
[104] Li X, Wu M, Kwoh C K and Ng S K 2010 Computationalapproaches for detecting protein complexes from proteininteraction networks: a survey BMC Genomics11 (Suppl. 1) S3
[105] Brohee S and van Helden J 2006 Evaluation of clusteringalgorithms for proteinprotein interaction networks BMCBioinformatics 7 488
[106] Vlasblom J and Wodak S J 2009 Markov clustering versusaffinity propagation for the partitioning of proteininteraction graphs BMC Bioinformatics 10 99
[107] Inoue K, Li W and Kurata H 2010 Diffusion model basedspectral clustering for proteinprotein interaction networksPLoS ONE5 e12623
[108] Qin G and Gao L 2010 Spectral clustering for detectingprotein complexes in proteinprotein interaction (PPI)networks Math. Comput. Modell. 52 206674
[109] Ding C, He X, Meraz R F and Holbrook S R 2004 A unifiedrepresentation of multiprotein complex data for modelinginteraction networks Proteins 57 99108
[110] Bu D et al 2003 Topological structure analysis of theproteinprotein interaction network in budding yeastNucleic Acids Res. 31 244350
12
http://dx.doi.org/10.1093/nar/gkn653http://dx.doi.org/10.1093/nar/gkn653http://dx.doi.org/10.1093/nar/gkn863http://dx.doi.org/10.1093/nar/gkn863http://dx.doi.org/10.1093/nar/26.1.73http://dx.doi.org/10.1093/nar/26.1.73http://dx.doi.org/10.1093/nar/gkn665http://dx.doi.org/10.1093/nar/gkn665http://dx.doi.org/10.1093/nar/gkn788http://dx.doi.org/10.1093/nar/gkn788http://dx.doi.org/10.1002/cfg.102http://dx.doi.org/10.1002/cfg.102http://dx.doi.org/10.1371/journal.pcbi.1000837http://dx.doi.org/10.1371/journal.pcbi.1000837http://dx.doi.org/10.1038/75556http://dx.doi.org/10.1038/75556http://dx.doi.org/10.1371/journal.pcbi.1000443http://dx.doi.org/10.1371/journal.pcbi.1000443http://dx.doi.org/10.1093/bioinformatics/btg153http://dx.doi.org/10.1093/bioinformatics/btg153http://dx.doi.org/10.1186/1471-2105-8-239http://dx.doi.org/10.1186/1471-2105-8-239http://dx.doi.org/10.1126/science.1087361http://dx.doi.org/10.1126/science.1087361http://dx.doi.org/10.1186/gb-2004-5-5-r35http://dx.doi.org/10.1186/gb-2004-5-5-r35http://dx.doi.org/10.1186/1471-2105-8-414http://dx.doi.org/10.1186/1471-2105-8-414http://dx.doi.org/10.1093/bib/bbm007http://dx.doi.org/10.1093/bib/bbm007http://dx.doi.org/10.1101/gr.3610305http://dx.doi.org/10.1101/gr.3610305http://dx.doi.org/10.1073/pnas.0508647102http://dx.doi.org/10.1073/pnas.0508647102http://dx.doi.org/10.1371/journal.pcbi.1000945http://dx.doi.org/10.1371/journal.pcbi.1000945http://dx.doi.org/10.1371/journal.pcbi.1000054http://dx.doi.org/10.1371/journal.pcbi.1000054http://dx.doi.org/10.1186/1471-2105-7-508http://dx.doi.org/10.1186/1471-2105-7-508http://dx.doi.org/10.1073/pnas.0508649102http://dx.doi.org/10.1073/pnas.0508649102http://dx.doi.org/10.1002/prot.20865http://dx.doi.org/10.1002/prot.20865http://dx.doi.org/10.1101/gr.5900607http://dx.doi.org/10.1101/gr.5900607http://dx.doi.org/10.1142/9789812702456_0050http://dx.doi.org/10.1142/9789812702456_0050http://dx.doi.org/10.1186/1471-2105-11-S1-S57http://dx.doi.org/10.1186/1471-2105-11-S1-S57http://dx.doi.org/10.1039/b907946ghttp://dx.doi.org/10.1039/b907946ghttp://dx.doi.org/10.1186/1471-2105-5-154http://dx.doi.org/10.1186/1471-2105-5-154http://dx.doi.org/10.1073/pnas.0603352103http://dx.doi.org/10.1073/pnas.0603352103http://dx.doi.org/10.1186/1471-2105-8-S10-S6http://dx.doi.org/10.1186/1471-2105-8-S10-S6http://dx.doi.org/10.1016/j.ajhg.2008.02.013http://dx.doi.org/10.1016/j.ajhg.2008.02.013http://dx.doi.org/10.1186/1471-2105-11-S1-S20http://dx.doi.org/10.1186/1471-2105-11-S1-S20http://dx.doi.org/10.1093/bioinformatics/btq108http://dx.doi.org/10.1093/bioinformatics/btq108http://dx.doi.org/10.1186/1471-2164-11-S1-S3http://dx.doi.org/10.1186/1471-2164-11-S1-S3http://dx.doi.org/10.1186/1471-2105-7-488http://dx.doi.org/10.1186/1471-2105-7-488http://dx.doi.org/10.1186/1471-2105-10-99http://dx.doi.org/10.1186/1471-2105-10-99http://dx.doi.org/10.1371/journal.pone.0012623http://dx.doi.org/10.1371/journal.pone.0012623http://dx.doi.org/10.1016/j.mcm.2010.06.015http://dx.doi.org/10.1016/j.mcm.2010.06.015http://dx.doi.org/10.1002/prot.20147http://dx.doi.org/10.1002/prot.20147http://dx.doi.org/10.1093/nar/gkg340http://dx.doi.org/10.1093/nar/gkg340http://dx.doi.org/10.1093/nar/gkg340http://dx.doi.org/10.1002/prot.20147http://dx.doi.org/10.1016/j.mcm.2010.06.015http://dx.doi.org/10.1371/journal.pone.0012623http://dx.doi.org/10.1186/1471-2105-10-99http://dx.doi.org/10.1186/1471-2105-7-488http://dx.doi.org/10.1186/1471-2164-11-S1-S3http://dx.doi.org/10.1093/bioinformatics/btq108http://dx.doi.org/10.1186/1471-2105-11-S1-S20http://dx.doi.org/10.1016/j.ajhg.2008.02.013http://dx.doi.org/10.1186/1471-2105-8-S10-S6http://dx.doi.org/10.1073/pnas.0603352103http://dx.doi.org/10.1186/1471-2105-5-154http://dx.doi.org/10.1039/b907946ghttp://dx.doi.org/10.1186/1471-2105-11-S1-S57http://dx.doi.org/10.1142/9789812702456_0050http://dx.doi.org/10.1101/gr.5900607http://dx.doi.org/10.1002/prot.20865http://dx.doi.org/10.1073/pnas.0508649102http://dx.doi.org/10.1186/1471-2105-7-508http://dx.doi.org/10.1371/journal.pcbi.1000054http://dx.doi.org/10.1371/journal.pcbi.1000945http://dx.doi.org/10.1073/pnas.0508647102http://dx.doi.org/10.1101/gr.3610305http://dx.doi.org/10.1093/bib/bbm007http://dx.doi.org/10.1186/1471-2105-8-414http://dx.doi.org/10.1186/gb-2004-5-5-r35http://dx.doi.org/10.1126/science.1087361http://dx.doi.org/10.1186/1471-2105-8-239http://dx.doi.org/10.1093/bioinformatics/btg153http://dx.doi.org/10.1371/journal.pcbi.1000443http://dx.doi.org/10.1038/75556http://dx.doi.org/10.1371/journal.pcbi.1000837http://dx.doi.org/10.1002/cfg.102http://dx.doi.org/10.1093/nar/gkn788http://dx.doi.org/10.1093/nar/gkn665http://dx.doi.org/10.1093/nar/26.1.73http://dx.doi.org/10.1093/nar/gkn863http://dx.doi.org/10.1093/nar/gkn653 -
8/2/2019 (1) survey 2011
14/14
Phys. Biol. 8 (2011) 035008 J G Lees et al
[111] Sen T Z, Kloczkowski A and Jernigan R L 2006 Functionalclustering of yeast proteins from the proteinproteininteraction networkBMC Bioinformatics 7 355
[112] Lima-Mendez G and van Helden J 2009 The powerful law ofthe power law and other myths in network biology Mol.Biosyst. 5 148293
[113] Gomez S M and Rzhetsky A 2002 Towards the prediction ofcomplete proteinprotein interaction networks Pac. Symp.Biocomput. 7 41324
[114] Henrick K and Thornton J M 1998 PQS: a protein quaternarystructure file server Trends Biochem. Sci. 23 35861
[115] Guldener U, Munsterkotter M, Oesterheld M, Pagel P,Ruepp A, Mewes H W and Stumpflen V 2006 MPact: theMIPS protein interaction resource on yeast Nucleic AcidsRes. 34 D43641
[116] Hart G T, Lee I and Marcotte E R 2007 A high-accuracyconsensus map of yeast protein complexes reveals modularnature of gene essentiality BMC Bioinformatics8 236
[117] Pu S, Wong J, Turner B, Cho E and Wodak S J 2009Up-to-date catalogues of yeast protein complexes NucleicAcids Res. 37 82531
[118] Yu Het al
2008 High-quality binary protein interaction mapof the yeast interactome networkScience 322 10410[119] Smialowski P et al 2010 The Negatome database: a reference
set of non-interacting protein pairs Nucleic Acids Res.38 D5404
[120] Browne F, Wang H, Zheng H and Azuaje F 2009 GRIP: aweb-based system for constructing gold standard datasetsfor proteinprotein interaction prediction Source CodeBiol. Med. 4 2
[121] Chen X W, Jeong J C and Dermyer P 2010 KUPS:constructing datasets of interacting and non-interactingprotein pairs with associated attributions Nucleic AcidsRes 39 D7504
[122] Sharan R, Suthram S, Kelley R M, Kuhn T, McCuine S,Uetz P, Sittler T, Karp R M and Ideker T 2005 Conserved
patterns of protein interaction in multiple species ProcNatl. Acad. Sci. USA 102 19749[123] Kanehisa M et al 2008 KEGG for linking genomes to life and
the environment Nucleic Acids Res. 36 D4804[124] Thomas P D, Campbell M J, Kejariwal A, Mi H, Karlak B,
Daverman R, Diemer K, Muruganujan A andNarechania A 2003 PANTHER: a library of proteinfamilies and subfamilies indexed by function Genome Res.13 212941
[125] Ioannidis J P 2007 Why most published research findings arefalse: authors reply to Goodman and Greenland PLoSMed. 4 e215
[126] Pfeiffer T, Rand D G and Dreber A 2009 Decision-making inresearch tasks with sequential testing PLoS ONE4 e4607
[127] Neumann B et al 2010 Phenotypic profiling of the humangenome by time-lapse microscopy reveals cell divisiongenes Nature 464 7217
[128] Uetz P et al 2000 A comprehensive analysis ofproteinprotein interactions in Saccharomyces cerevisiaeNature 403 6237
[129] Russell R B and Aloy P 2008 Targeting and tinkering withinteraction networks Nat. Chem. Biol. 4 66673
[130] Hart G T, Ramani A K and Marcotte E M 2006 Howcomplete are current yeast and human protein-interactionnetworks? Genome Biol. 7 120
[131] Lefebvre C et al 2010 A human B-cell interactome identifiesMYB and FOXM1 as master regulators of proliferation ingerminal centers Mol. Syst. Biol. 6 377
[132] Warde-Farley D et al 2010 The GeneMANIA predictionserver: biological network integration for geneprioritization and predicting gene function Nucleic Acids
Res. 38 W21420[133] Lefebvre C, Lim W K, Basso K, dalla-Favera R andCalifano A 2007 A context-specific network ofproteinDNA and proteinprotein interactions reveals newregulatory motifs in human B cells Lect. NotesBioinformatics (LNCS) 4532 4256
[134] Krogan N J et al 2006 Global landscape of protein complexesin the yeast Saccharomyces cerevisiae Nature440 63743
[135] Gavin A C et al 2006 Proteome survey reveals modularity ofthe yeast cell machinery Nature 440 6316
[136] Li S et al 2004 A map of the interactome network of themetazoan C. elegans Science 303 5403
[137] Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M andSakaki Y 2001 A comprehensive two-hybrid analysis to
explore the yeast protein interactome Proc. Natl Acad. Sci.USA 98 456974[138] Giot L et al 2003 A protein interaction map of Drosophila
melanogaster Science 302 172736[139] Wu D et al 2009 A phylogeny-driven genomic encyclopaedia
of bacteria and archaea Nature 462 105660[140] Kahlem P and Birney E 2007 ENFIN a network to enhance
integrative systems biology Ann. New York Acad. Sci.1115 2331
13
http://dx.doi.org/10.1186/1471-2105-7-355http://dx.doi.org/10.1186/1471-2105-7-355http://dx.doi.org/10.1039/b908681ahttp://dx.doi.org/10.1039/b908681ahttp://dx.doi.org/10.1016/S0968-0004(98)01253-5http://dx.doi.org/10.1016/S0968-0004(98)01253-5http://dx.doi.org/10.1093/nar/gkj003http://dx.doi.org/10.1093/nar/gkj003http://dx.doi.org/10.1186/1471-2105-8-236http://dx.doi.org/10.1186/1471-2105-8-236http://dx.doi.org/10.1093/nar/gkn1005http://dx.doi.org/10.1093/nar/gkn1005http://dx.doi.org/10.1126/science.1158684http://dx.doi.org/10.1126/science.1158684http://dx.doi.org/10.1093/nar/gkp1026http://dx.doi.org/10.1093/nar/gkp1026http://dx.doi.org/10.1186/1751-0473-4-2http://dx.doi.org/10.1186/1751-0473-4-2http://dx.doi.org/10.1093/nar/gkq943http://dx.doi.org/10.1093/nar/gkq943http://dx.doi.org/10.1073/pnas.0409522102http://dx.doi.org/10.1073/pnas.0409522102http://dx.doi.org/10.1093/nar/gkm882http://dx.doi.org/10.1093/nar/gkm882http://dx.doi.org/10.1101/gr.772403http://dx.doi.org/10.1101/gr.772403http://dx.doi.org/10.1371/journal.pmed.0040215http://dx.doi.org/10.1371/journal.pmed.0040215http://dx.doi.org/10.1371/journal.pone.0004607http://dx.doi.org/10.1371/journal.pone.0004607http://dx.doi.org/10.1038/nature08869http://dx.doi.org/10.1038/nature08869http://dx.doi.org/10.1038/35001009http://dx.doi.org/10.1038/35001009http://dx.doi.org/10.1038/nchembio.119http://dx.doi.org/10.1038/nchembio.119http://dx.doi.org/10.1186/gb-2006-7-11-120http://dx.doi.org/10.1186/gb-2006-7-11-120http://dx.doi.org/10.1038/msb.2010.31http://dx.doi.org/10.1038/msb.2010.31http://dx.doi.org/10.1093/nar/gkq537http://dx.doi.org/10.1093/nar/gkq537http://dx.doi.org/10.1038/nature04670http://dx.doi.org/10.1038/nature04670http://dx.doi.org/10.1038/nature04532http://dx.doi.org/10.1038/nature04532http://dx.doi.org/10.1126/science.1091403http://dx.doi.org/10.1126/science.1091403http://dx.doi.org/10.1073/pnas.061034498http://dx.doi.org/10.1073/pnas.061034498http://dx.doi.org/10.1126/science.1090289http://dx.doi.org/10.1126/science.1090289http://dx.doi.org/10.1038/nature08656http://dx.doi.org/10.1038/nature08656http://dx.doi.org/10.1196/annals.1407.016http://dx.doi.org/10.1196/annals.1407.016http://dx.doi.org/10.1196/annals.1407.016http://dx.doi.org/10.1038/nature08656http://dx.doi.org/10.1126/science.1090289http://dx.doi.org/10.1073/pnas.061034498http://dx.doi.org/10.1126/science.1091403http://dx.doi.org/10.1038/nature04532http://dx.doi.org/10.1038/nature04670http://dx.doi.org/10.1093/nar/gkq537http://dx.doi.org/10.1038/msb.2010.31http://dx.doi.org/10.1186/gb-2006-7-11-120http://dx.doi.org/10.1038/nchembio.119http://dx.doi.org/10.1038/35001009http://dx.doi.org/10.1038/nature08869http://dx.doi.org/10.1371/journal.pone.0004607http://dx.doi.org/10.1371/journal.pmed.0040215http://dx.doi.org/10.1101/gr.772403http://dx.doi.org/10.1093/nar/gkm882http://dx.doi.org/10.1073/pnas.0409522102http://dx.doi.org/10.1093/nar/gkq943http://dx.doi.org/10.1186/1751-0473-4-2http://dx.doi.org/10.1093/nar/gkp1026http://dx.doi.org/10.1126/science.1158684http://dx.doi.org/10.1093/nar/gkn1005http://dx.doi.org/10.1186/1471-2105-8-236http://dx.doi.org/10.1093/nar/gkj003http://dx.doi.org/10.1016/S0968-0004(98)01253-5http://dx.doi.org/10.1039/b908681ahttp://dx.doi.org/10.1186/1471-2105-7-355