(1) survey 2011

download (1) survey 2011

of 14

Transcript of (1) survey 2011

  • 8/2/2019 (1) survey 2011

    1/14

    Systematic computational prediction of protein interaction networks

    This article has been downloaded from IOPscience. Please scroll down to see the full text article.

    2011 Phys. Biol. 8 035008

    (http://iopscience.iop.org/1478-3975/8/3/035008)

    Download details:

    IP Address: 194.225.166.111

    The article was downloaded on 26/12/2011 at 11:52

    Please note that terms and conditions apply.

    View the table of contents for this issue, or go to thejournal homepage for more

    ome Search Collections Journals About Contact us My IOPscience

    http://iopscience.iop.org/page/termshttp://iopscience.iop.org/1478-3975/8/3http://iopscience.iop.org/1478-3975http://iopscience.iop.org/http://iopscience.iop.org/searchhttp://iopscience.iop.org/collectionshttp://iopscience.iop.org/journalshttp://iopscience.iop.org/page/aboutioppublishinghttp://iopscience.iop.org/contacthttp://iopscience.iop.org/myiopsciencehttp://iopscience.iop.org/myiopsciencehttp://iopscience.iop.org/contacthttp://iopscience.iop.org/page/aboutioppublishinghttp://iopscience.iop.org/journalshttp://iopscience.iop.org/collectionshttp://iopscience.iop.org/searchhttp://iopscience.iop.org/http://iopscience.iop.org/1478-3975http://iopscience.iop.org/1478-3975/8/3http://iopscience.iop.org/page/terms
  • 8/2/2019 (1) survey 2011

    2/14

    IOP PUBLISHING PHYSICAL BIOLOGY

    Phys. Biol. 8 (2011) 035008 (13pp) doi:10.1088/1478-3975/8/3/035008

    Systematic computational prediction ofprotein interaction networks

    J G Lees1, J K Heriche2, I Morilla3, J A Ranea1,3 and C A Orengo1

    1 Research Department of Structural & Molecular Biology, University College London, London, UK2 Cell Biology/Biophysics Unit, European Molecular Biology Laboratory (EMBL), Meyerhofstrasse 1,D-69117 Heidelberg, Germany3 Department of Molecular Biology and Biochemistry-CIBER de Enfermedades Raras,

    University of Malaga, Malaga, Spain

    E-mail: [email protected]

    Received 2 November 2010

    Accepted for publication 9 February 2011Published 13 May 2011

    Online at stacks.iop.org/PhysBio/8/035008

    Abstract

    Determining the network of physical protein associations is an important first step in

    developing mechanistic evidence for elucidating biological pathways. Despite rapid advances

    in the field of high throughput experiments to determine protein interactions, the majority of

    associations remain unknown. Here we describe computational methods for significantly

    expanding protein association networks. We describe methods for integrating multiple

    independent sources of evidence to obtain higher quality predictions and we compare the

    major publicly available resources available for experimentalists to use.

    1. Introduction

    New technologies in biology have given us the genomes

    for thousands of species, including humans. Understanding

    how all of these molecular parts assemble into functional

    pathways is a major challenge. It has been noted that an

    organisms complexity arises in part from the intricate and

    dynamic networks of protein associations. While several

    resources [15] provide experimental information on protein

    associations, the experimental data, although growing rapidly,

    are still limited (e.g. perhaps

  • 8/2/2019 (1) survey 2011

    3/14

    Phys. Biol. 8 (2011) 035008 J G Lees et al

    first method made use of sequence information. As detailed

    below, there are different ways of using sequences to make

    inference about protein associations. The advantage of these

    methods is that sequence data have become abundant and

    are easily available through public databases in standardized

    formats. A second class of methods has followed the

    development of high-throughput experiments and functionalannotation databases.

    2.1. Genomic context methods

    2.1.1. Co-occurrence profiles. Genome context association

    prediction algorithms are a family of sequence-based

    methods used to predict associations between proteins.

    These techniques are based on principles derived from

    known evolutionary processes. For example, co-occurrence

    (phylogenetic) profiles are genomic context methods based

    on the principle that if genes are functionally related, they

    will tend to be co-inherited as a unit since the loss of any

    one gene would compromise the functioning of the others.Phylogenetic profiling algorithms look for similar patterns of

    the presence/absence of genes across species (figure 1(a)). It

    is unclear exactly what the predictions correspond to, although

    it is generally considered to be a biological process type

    association. More functional significance can be assigned

    if the patterns are seen over distant evolutionary periods

    or if they occur independently in multiple lineages. The

    original phylogenetic profiling idea [9] has been developed

    in many different ways including more complex logical

    rules to associate genes [10], the use of domain instead of

    whole protein profiles [11] and through integrating species

    phylogenetic information [12, 13]. Gene duplications canlead to spurious predictions with high similarity between

    the duplicated genes profiles. Some resources implement

    a scoring scheme whereby homologous proteins are down-

    weighed in accordance with their level of homology [14]. It is

    also important to filter low information content profiles [11].

    Phylogenetic profiles have been used successfully in

    archaea and bacteria to discover for example novel essential

    members of synthetic pathways [15, 16], environmental

    adaption factors [17] and thiamine biosynthesis [18].

    Interestingly some studies have made use of anti-correlation

    in the pattern as a signal to make predictions of functional

    association [18]. A domain-based phylogenetic profile

    method, phylotuner [11], has recently been developed to

    improve performance in eukaryotes where the multigene

    families and protein domain rearrangements create challenges

    for these types of approaches.

    2.1.2. Gene fusion. Evolutionary pressure can produce

    fusion of separate but functionally related genes (A and B

    figure 1(b)) into a single gene. In their simplest form the

    gene fusion prediction methods identify pairs of proteins in

    a genome, which are homologous to proteins fused together

    in a different genome, and use this as supporting evidence

    for the functional association of these individual genes [19].The predicted association type is again unclear but most often

    corresponds to a shared biological process or consecutive

    steps in a pathway [19]. Gene fusion events detected in

    mammals showed a propensity to interact [20]. As with

    phylogenetic profiling, analogous domain-based equivalents

    exist [21]. These methods identify domains on two distinct

    protein sequences in the same genome that are found fused

    into a single sequence in a different genome (figure 1(b)).However, because of large and/or promiscuous domain

    families (e.g. kinase domains) domain fusion predictors

    require a scoring mechanism to prevent a great number of

    non-specific predictions. An example of fusion data used in

    conjunction with profiles is given by a method developed for

    identifying biothiol synthetic enzymes [22].

    2.1.3. Genomic neighborhood. Due to functional constraints

    genes can be maintained close together on a chromosome

    over long evolutionary time periods (figure 1(d)). Genomic

    neighborhood prediction methods identify genes that cluster

    within a certain base distance across multiple genomes. Aswith other genomic context methods artifacts can arise through

    shared ancestry due to inadequate time for reshuffling of

    genes. The genomic neighborhood method is not to be

    confused with methods such as the operon method [23]

    based on the intergenic distance in a single genome. A

    recent comprehensive study in prokaryotes demonstrated the

    genomic neighborhood method to be the best among genomic

    context methods [24].

    Some genomic neighborhood methods assert the rule that

    gene order needs to be maintained. However, since some gene

    rearrangement can be tolerated not all methods enforce this

    constraint [25]. Some resources allow neighboring genes that

    are diametrically opposed in a head-to-head orientation to beconsidered [14]. The genomic neighborhood approach has

    been used to predict the archael exosome [26] subsequently

    experimentally validated [27].

    2.2. Sequence-based prediction

    2.2.1. Sequence co-evolution. Interactions between proteins

    are mediated through specific residue interfaces [28, 29]. It

    has been observed that physically interacting proteins have

    greater similarity of their phylogenetic trees than expected

    by chance. One process put forward used to explain this

    is compensatory mutations, whereby a deleterious mutationof an interaction-mediating residue in one protein can be

    ameliorated by a compensatory mutation of the binding partner

    [30]. More recent analyses suggest that additional sources

    contribute to the co-evolutionary signal (see [31, 32]) making

    the type of functional linkages derived from these methods

    more fuzzy. The prediction methods developed around the

    principle of sequence co-evolution make use of multiple

    sequence alignments for each putative interacting protein from

    which distance matrices are calculated. High correlation of

    these distance matrices is taken as evidence of a potential

    physical interaction [33] (figure 1(c)). Other methods exploit

    gene evolutionary information in multiple sequence analysis

    by comparing pairs of gene families phylogenetic treesdistance matrices. These methods have been improved by

    2

  • 8/2/2019 (1) survey 2011

    4/14

    Phys. Biol. 8 (2011) 035008 J G Lees et al

    (a)

    (b)

    (c)

    (d)

    (e)

    Figure 1. Illustration of the principles behind several of the protein association prediction methods described in the text.

    integrating species tree information [34] or by implementingalgorithms that efficiently deal with the multi-gene families of

    paralogues found in eukaryotes [35].

    2.2.2. Commonly occurring domain pairs. A largeportion of interactions are mediated through domaindomain

    associations and these interactions are conserved acrossspecies [36] with certain domain pairs re-occurring inmultiple protein interactions [37, 38]. One use for this

    domain association network is that the knowledge of theunderlying domains mediating protein interactions can be used

    to help predict novel protein interactions. These methodsare potentially powerful since domains can be reliably and

    quickly assigned quickly to any genome using the powerfulHMMER3 [39, 40]. Several approaches for predicting

    domain interactions have been developed including over-representation methods [41] and random forest methods [42].

    Multiple methods are available from the DIMA resource [43].UniDomInt merges the predictions of nine different domaininteraction prediction methods to provide a meta-database of

    more reliable associations [44]. High scoring domain pairsbetween proteins have been frequently used to make prediction

    as part of an integration strategy (for example [45]).

    2.2.3. Simple sequence features. A variety of predictionmethods have been developed to predict proteinprotein

    3

  • 8/2/2019 (1) survey 2011

    5/14

    Phys. Biol. 8 (2011) 035008 J G Lees et al

    interactions from intrinsic features of sequences such as 3-

    mers of neighboring residues [46]. The high accuracyachieved

    by these methods has been recently called into question and

    may be an artifact from the sets used to train and validate the

    methods [47].

    2.3. Homology-based methods

    2.3.1. Inheriting protein interactions from sequence.

    There are many known protein interactions (interologs)

    that are conserved across species [48], although the

    general level of protein interaction conservation remains

    unclear [49]. Algorithms have been developed that inherit

    such protein interactions from an experimentally confirmed

    interaction in a genome to the query genome, via homology

    [48, 50, 51]. At its simplest these methods make orthology

    assignments (by reciprocal best hit or using an orthology

    database)and transfer interactions wherepossible (figure1(e)).

    More complex methods make use of heterogeneous features,

    including domain combination, subcellular localization andtissue specificity to try and give increased confidence to

    the interolog assignment [52]. Interolog-based methods are

    especially powerful for organisms with few experimentally

    determined interactions allowing for a substantial number of

    high confidence protein interaction predictions. Interolog-

    based methods are one of the most successfully applied protein

    association prediction methods in terms of uptake and use by

    experimentalists with over 80 publications associated with the

    I2D [53] method alone.

    2.3.2. Inheriting protein interactions from structure.

    Structural complexes can also be used to transfer interactionsfrom a known interacting pair to proteins with similar structure.

    These methods can provide insight into the physical details

    of the interaction and are likely to become more important

    in the near future as more structures become available [54].

    Although the number of complexes of a known 3D structure is

    relatively small, it is possible to expand this set by considering

    homologous proteins. An early example of this is the

    InterPreTS method [55] that given a known 3D complex

    structure and homologous sequences for each interacting

    protein, ranks interactions between homologues of the same

    species. Another method Struct2Net [56] threads sequences

    to structures and computes scores from the interfacial energyfor the sequence pair. The iWRAP method allows for

    inheriting interactions in cases with low sequence identity by

    focusing on the interface residues in the threading process

    [57]. Recently methods have been developed that add an extra

    layer by considering the evolutionary conservation of binding

    site residues via structural alignments as providing more

    useful information and evidence for the predicted interaction

    [58, 59]. The IBIS server predicts interaction partners

    and binding sites for a given protein using experimentally

    observed or homology inferred complexes. IBIS checks for

    several features to ensure biological relevance of the inferred

    complex. For example binding site residues are assessed

    for evolutionary conservation, using a set of non-redundanthomologous proteins. As another check IBIS makes use of

    PISA [60] validation, which considers the physicochemical

    properties of the protein interaction interface. Structural

    data have been used as a means for inheriting interactions

    on a genome-wide scale using structural alignments score to

    generate kernels for use with a support vector machine (SVM)

    [61].

    2.4. Exploiting experimental data

    2.4.1. Microarray profiles. Microarrays were one of

    the first genome-wide experimental methods developed

    [8]. Compendia of microarray experiments across various

    experimental conditions have been assembled. It has been

    demonstrated in yeast that genes with high co-expression,

    defined by Pearson correlation across the different conditions,

    were more likely to be physically interacting than randomly

    chosen pairs [62] although this signal strength varies for

    different organisms [63]. Large information rich data sets

    with 6000 microarray experiments have been assembled

    [64]. The development of statistical processing tools to findsignals from such large data sets, using a subset of conditions,

    has broadened the applicability of the method to predicting

    co-complex membership in homo sapiens [65]. There are

    other approaches for using microarray data in the context of

    an integration strategy for protein association prediction. One

    method is to look for genes whose co-expression is conserved

    across multiple genomes [66]. Another is to identify genes

    expressed in similar subcellular/tissue types[45] as supporting

    evidence of their interaction. Clearly such pieces of evidence

    are very weak on their own, but can provide useful supporting

    information when used in combination with other data.

    2.4.2. Other experimental screens. Other types of

    experimental screens can also be assembled and processed

    into similarity profiles. For example, phenotypic vectors

    from high-throughput loss of function experiments can be

    clustered to give sets of functionally related proteins often

    used as a basis to test for physical interactions. Similarities

    between trajectories of subcellular localization have also been

    used to generate hypotheses about physical interactions of

    uncharacterized proteins [67]. Other examples such as in

    vivo genomic binding maps provide information on positional

    targeting of chromatin components that canbe used to generate

    predictions on the network of interactions in chromatin

    assembly [68]. As more high-throughput experimental datasets become available their general usefulness for prediction

    will increase.

    2.5. Literature-derived associations

    2.5.1. Text mining. Only a portion of the experimental

    interactions are captured by the interaction database resources

    [15, 69, 70]. Information on other experimentally detected

    interactions is available from Pubmed and other online

    resources [7173]. Text mining is a very powerful method

    for expanding interactomes either automatically [14] or for

    speeding up the curation process for certain interaction

    databases [2, 3]. Protein associations can be obtained bysearching for statistically significant co-occurrences between

    4

  • 8/2/2019 (1) survey 2011

    6/14

    Phys. Biol. 8 (2011) 035008 J G Lees et al

    gene names [74]. In its simplest form the principle behind

    such methods is that the higher the frequency two genes

    occur in the same sentence/paragraph/abstract or article the

    more likely their functional association. Another common

    way to generate networks is by natural language processing

    of abstracts, considering gene names as nodes and the verbs

    as edges. Restricting verbs of association to those such asbinds, interacts, etc provides physical interaction networks.

    A major uncertainty associated with text mining results is in

    assigning the gene names in the text to a corresponding entity

    in the sequence databases. In more recent developments,

    protein interactions have been extracted from the literature

    using kernel methods [75].

    2.5.2. Functional semantic similarity. The Gene Ontology

    (GO) [76] is a controlled vocabulary used to describe various

    attributes of genes including their functions. Terms that

    describe the functions are stored as nodes in a directed

    graph with specific terms sharing more general terms asparents. For example, apoptotic chromosome condensation

    and mitotic chromosome condensation both share the parent

    term chromosome condensation. Primary annotations are

    derived from the literature through manual curation efforts.

    There are several evidences associated with the annotations

    ranging from the mostly reliable manual annotations to

    automated electronic annotations. Various methods have

    sought to derive networks of functional associations between

    proteins using their associated GO terms (see [77] for recent

    review). Problems in using the GO graph directly arise from

    issues such as the variation of term specificity in the graph. A

    common solution is to make use of information content-basedmeasures such as the Resnik score [78]. The final choice

    of evidence to use similarity measures and gene transference

    methods needs to be done on a case-by-case basis [77].

    3. Integrating prediction methods

    Each source data set, whether experimental or not, has bias

    and errors. It is however unlikely given the potential number

    of interactions (provided appropriate confidence cut-offs are

    used) that two independent prediction methods will give rise to

    the same false positive prediction. In general, we could expect

    the prediction power (accuracy and coverage) to increase

    proportionally to the number of independent approaches

    supporting the association. The simplest approaches exploit

    this principle by using a joint observation approach for

    combining prediction methods, where a greater number of

    independent methods predicting the association correspond to

    a higher prediction accuracy [23]. Other tests have shown that

    integrating multiple predictions using more advanced methods

    can improve the prediction power [14, 79, 80] by combining

    and reinforcing observations. A wide variety of integration

    methods are available including: Fishers, Bayesian, logistic

    regression and kernel methods. Some methods provide

    confidence estimates for the outputs (e.g. Bayesian and logisticregression) that may be useful in certain scenarios.

    3.1. Simple integration

    Each of the protein association prediction methods described

    above yields scores which correlate with the likelihood offunctional association. However, it can be difficult to directly

    combine these scores since the scores for each method can

    differ both in scale and in predicted biological association

    type. To help overcome this problem, output scores fromindividual prediction methods need to be transformed intoconfidence measures using a set of known true positives. The

    Prolinks predictionpipeline [81] simply chooses the maximum

    score from all the individual methods as the choice of geneassociation score. Other methods make use of a formula for

    combining the scores from each method, optionally after aweighting of each methods general performance [82].

    3.2. Bayesian integration

    Bayesian integration is the most widely used strategy for

    integrating protein association predictions [14, 80]. It hasseveral features that make it suitable for data integration of

    this type (table 1). Each individual data channel is implicitlyweighted according to its reliability, and hence it is easy

    to interpret the probability relationships for each channel.

    Crucially this method can accommodate missing data whichtypically lead to problems for supervised learning methods.

    Nave Bayesian integration presumes that the different datachannels are statistically independent of one another and

    failure to remove or merge redundant data sources can lead to

    over-prediction. Bayesian integrators have been used in manysuccessful applications [83] allowing for multiple types of data

    to be integrated (including numerical and categorical). It may

    not always be advantageous to add in increasing numbers ofdata sources. One study has shown that choosing a small

    number of the best features from those available can improveperformance, and adding in additional input data types does

    not give further improvements [84].

    3.3. Fishers method

    Fisher is one of the general non-Bayesian methods, which

    has been successfully used to integrate protein interactionprediction from diverse methods. Fishers algorithm is a

    solution (as the Pareto optimal solution) to the problem of

    combining independent tests [85]. The method is highly

    flexible and it is able to deal with the low overlap betweensource data sets. Some recent studies have successfullyapplied Fishers method to protein association prediction

    [86]. Fishers method does not need trained or supervised

    predictions based on experimental gold standard data sets ofprotein interactions. Hence, if only genomic context methods

    are used, Fishers predictions can be considered independentof the public repositories of protein interactions. A weighted

    version of Fisher provides the ability to optimize contributions

    from each data source.

    3.4. Kernel methods

    As these methods have gained in popularity in recent years, wegive here a brief summary of kernel properties that make them

    5

  • 8/2/2019 (1) survey 2011

    7/14

    Phys. Biol. 8 (2011) 035008 J G Lees et al

    Table 1. Methods commonly used for integrating multiple protein association prediction methods. A 1 or 0 denotes the presence or absenceof a particular desirable property.

    Integration method/example reference

    Nave Bayes Fishers SVM Graph kernel + Random forestAdvantageous property [80, 90] [85, 91] [92, 93] SVM [46, 88] [94, 95]

    Copes well with missing values 1 1 0 0 1Importance of input features can be readily obtained 1 0 0 0 1Copes well with high-dimensional data 0 0 1 1 1Complex relationships between input variables can be learned 0 0 1 1 1Probability estimate readily obtained from output 1 0 0 0 0No parameter optimization required 0 1 0 0 0No requirement for independence between input data 0 0 1 1 1No training data required 0 1 0 0 0

    attractive for data integration. For more details, we refer the

    interested reader to [87]. By definition, a kernel is a function

    that gives the dot product between two vectors in some multi-

    dimensional space (called feature space). A kernel matrix

    (often abbreviated as kernel) contains the evaluation of thekernel function for all pairs of data points under consideration.

    A kernel can be viewed as a matrix of similarities between

    data points and different kernels capture different notions of

    similarity as they correspond to embedding thedata in different

    feature spaces.

    The first property of interest is that any symmetric matrix

    with non-negative eigenvalues is a valid kernel matrix. This

    means we can test whether a similarity matrix is a valid kernel

    without knowing the feature space in which the kernel function

    operates. This makes kernel methods applicable, not only to

    real-valued vectors, but to any data (e.g. sequences, graphs) for

    which we candefinea similarity measure. Thesecond property

    of interest for data integration is that various mathematical

    combinations of kernels (e.g. linear combination) produce a

    valid kernel. So far, most data integration approaches using

    kernels for predicting protein interactions have been used

    in a classification framework with support vector machines

    [46, 88]. This leads to the requirement of a negative

    data set which can be problematic to generate (section 5.1).

    Alternatively, kernels can be used for link prediction in a semi-

    supervised setting [89].

    3.5. Random forest classifiers

    Decision trees are supervised classification algorithms usingtree-like graphs for making predictions in a supervised

    framework. In its simplest form a decision tree makes multiple

    binary tests in a tree structure such that a given input vector

    of attributes is propagated through the tree using the internal

    nodes to test an attributes value and terminal nodes to give a

    classification. Random forests are ensemble classifiers made

    up of many individual decision trees [96]. Random forests

    provide an efficient means of increasing performance, and are

    less prone to over-fitting than individual decision trees. Each

    individual decision tree is generated by selecting a random

    subset of the training data with replacement. The final output

    of the RF classifier is from the majority vote of its individual

    decision trees. The random forest classifier has been shownto be consistently amongst the best methods on a wide range

    of protein association tasks including protein interaction and

    co-complex membership [92, 94, 97]. Unlike most supervised

    learning methods with random forests it is possible to obtain

    a measure of importance for each input channel to the overall

    performance [97]. They can also deal with large and sparseinput vectors as seen for their use in predicting interactions

    from protein domain content [42]. Random forests have also

    been used to integrate structural information with more typical

    protein association data types [56].

    3.6. Logistic regression

    Logistic regression has been used to integrate data to provide

    output predictions [92, 9799] but has been shown to

    be outperformed by random forests [92, 97]. Another

    approach found good performance for logistic regression after

    subdividing the input data into different natural groupings[99].

    3.7. Random walks on a graph

    Some authors have used an approach using matrices derived

    from random walks on graphs to prioritize genes. A random

    walk on a graph describes the sequence of steps taken by

    a walker who moves from one node to a randomly selected

    adjacent node with a probability proportional to the weight

    associated with the edge connecting the two nodes. A random

    walk is a type of Markov chain from which different measures

    of similarity between nodes of the graph can be computed. If

    the Markov chain is regular it gives rise to a valid kernel [100].

    Random walks have been used in two ways for data integration.

    In the first, each data set is considered separately as a graph

    from which a random walk-based similarity is derived and

    used to rank the genes. A rank aggregation method is then

    used for the data integration step. Although this approach has

    been essentially used to predict disease genes [101, 102] it

    could be applicable to proteinprotein interaction prediction

    or at least to the prediction of functional relationships. The

    second integration approach consists in merging the different

    source data sets into one graph from which a random walk-

    based measure of similarity is derived and used for ranking

    genes. Again, this has been used to identify disease genes[103] but could be used for other types of predictions.

    6

  • 8/2/2019 (1) survey 2011

    8/14

    Phys. Biol. 8 (2011) 035008 J G Lees et al

    Figure 2. Examples of networks generated before (left) spectral clustering, green balls represent baits in the Mitocheck experiments [67]and after (right network) spectral clustering (colors represent different complexes/clusters).

    4. Exploiting the network structure

    The prediction methods described above produce pairwise

    protein interaction data sets that can be used to construct

    proteinprotein interaction graphs (also called protein

    networks in systems biology) which are natural data structures

    to model relationships between proteins. Several methods

    use the experimentally determined protein interaction network

    graph structure itself as the primary data source to infer

    complex membership. The underlying assumption for these

    methods is that proteins in a complex are more denselyconnected to each other than to the rest of the graph. Over

    the years various clustering algorithms have been applied

    to this problem (for a review see [104]). Most of these

    are heuristics and come with sometimes hard to tune free

    parameters. The Markov clustering algorithm (MCL) appears

    to be one of the best methods currently available for clustering

    protein interaction graphs [105, 106]. The MCL simulates a

    random walk on the graph and iteratively prunes the weaker

    edges. An exact analysis of a random walk on a graph

    leads to spectral clustering algorithms which have recently

    been applied to the protein complex prediction problem

    (figure 2) [67, 107, 108] although an early application ofspectral clustering for complex detection was described in

    [109]. Prior to these methods, spectral decomposition of

    matrices derived from the interaction graph had also been

    used to find complexes [110, 111]. Although the nature of the

    structure and properties of an entire proteome graph remains

    controversial [112], topological properties have been used to

    guide protein interaction predictions [113].

    5. Benchmarking

    5.1. Gold standards

    In order to validate the prediction methods described above, agold standard reference set is required. Known 3D structures

    formally provide direct evidence of physical interaction

    although care needs to be taken to determine the biological

    unit and ignore irrelevant crystal contacts. The resource

    for this was initially the protein quaternary structure (PQS)

    resource [114] although this has been replaced by PISA [60].

    For evaluating physical interactions in yeast, the most widely

    used gold standard data set was initially the curated MIPS

    protein interaction data set from yeast [115]. However, this

    data set was later shown to be highly unrepresentative, with

    over half the interactions coming from ribosomal proteins

    [116] producing a highly unrepresentative data set. Recentlymore up-to-date gold standard databases have been generated

    [5, 117]. Also despite misgivings about the quality of Yeast-

    2-Hybrid (Y2H) data sets, work has shown that commonly

    used Y2H data sets are of similar quality to other experimentalinteraction data and even curated data sets [7]. Such Y2H

    data sets can be processed further to give higher quality data

    sets potentially suitable for benchmarking [118]. Certain

    integration methods such as SVMs additionally require a

    negative gold standard data set (i.e. a set of proteins known

    not to interact). A common approach for generating negative

    data sets is to select random pairs of proteins from the genome.

    However, this is not an optimal solution and can lead to variousproblems such as the prediction method learning the pattern of

    missing values causing over-prediction of associations [95].

    Also unless care is taken the negative data set network can

    have a different structure to the positive data set leading

    to overestimates of performance for certain algorithms [47].

    Recently carefully curated true negative data sets have been

    assembled from the literature [119]. They may help with the

    over-prediction problem in the future although they are likely

    to contain biases (e.g. toward well-studied proteins). Other

    tools are available, providing negatives based on functional

    dissimilarity, subcellular location, non-interacting domain

    pairs [120, 121] and shortest path lengths [122].

    With regard to validation of functional associations manydifferent resourceshave been used including KEGG [123],GO

    7

  • 8/2/2019 (1) survey 2011

    9/14

    Phys. Biol. 8 (2011) 035008 J G Lees et al

    [76] and Panther [124]. KEGG annotation can be considered

    as high quality resource with 1500 genomes annotated.

    KEGG marks up some organisms as manually curated such

    as human, and others as automatically annotated from the

    curated genomes. STRING [14] benchmarks its predictors

    using KEGG to provide an interpretable output score. One

    advantage of using GO is that it is possible to make use ofthe ontology to define semantic similarities between proteins;

    thus, all pairs of proteins within a certain similarity threshold

    can be considered within the benchmark.

    5.2. Data set bias

    Many of the prediction methods are faced with the problem of

    bias in the available data. For example, supervised methods

    are hampered by the lack of true negative data sets. More

    subtly, biological research is mostly focused on disease-

    related and well-characterized genes. As a consequence, a

    small number of genes and their products contribute a lot of

    (possibly irrelevant [125, 126]) information while for most ofthe genome little is available. Genome-wide experiments (for

    example [127]) should help alleviate this problem. Several

    large-scale Y2H data sets are available (for example [128]),

    although these are not devoid of experimental biases of their

    own. For example, classic Y2H requires translocation of

    proteins involved in the interaction with the nucleus and does

    not perform well in all cases including membrane-associated

    proteins and transient interactions [129].

    5.3. The importance of independent benchmarks

    A major problem in benchmarking protein association

    prediction methods is the presence of circularity between thedata used as source input to the methods and the testing set.

    This circularity can be quite subtle and papers do not always

    take sufficient care to eliminate this issue. For example,

    once knowledge enters one realm (e.g. protein interaction

    databases) it can be quickly integrated into a secondary data

    set (e.g. Reactome). Even the genomic context methods and

    microarray data sets are now partly incorporated into the GO.

    This problem goes further than affecting the benchmarking

    since the lack of an independent test set precludes the

    ability to accurately optimize prediction methods, leading

    to over-fitting. Although it is possible to improve the

    benchmarking independence through careful filtering of datasets, the only safe option is to do experimental validation of the

    predictions. However, this is expensive and often only allows

    a small number of targets to be validated with low statistical

    significance (for example [80]). An alternative is to implement

    a rollback benchmarkwheresource-training data sets are rolled

    back to a given date and the test data are from after this date.

    In practice this approach suffers from social bias in that

    biologists are not testing the predictions but interactions with

    well-characterized, disease-related genes. Also circularity

    is still not completely removed by a rollback benchmark

    since todays text mining associations and interologs are a

    source of tomorrows curated database entries and protein

    interaction experiments, respectively. In the future a CASPstyle benchmark would be a good first step in providing

    real performance measures for the many prediction methods

    available.

    5.4. Real world performance measure

    The expected number of interactions found in an organism

    [130] is much smaller than the total number of possible

    interactions, where true positives (TPs) are found very

    infrequently relative to false positives (FPs). As an example

    let us say for an organism TPs constitute only 0.1% of all

    possible protein pairs, then a predictor with a reported 1% false

    discovery rate, on a balanced test set of TPs and TNs, would

    still produce ten false predictions for every TP in its real world

    application. The imbalance of TPs to true negatives (TNs)

    should be considered an important factor when considering the

    usefulness of a prediction andthe size andtype of thevalidation

    screen required to get a useful number of TP experimental

    validations.

    6. Existing resources

    6.1. Online resources

    A quick survey of resources hosting interaction data

    and predicted interaction data is quite daunting (e.g.

    http://ppi.fli-leibniz.de/jcb_ppi_databases.html). The most

    widely used of these is STRING which combines information

    from multiple sources and includes predictions from genomic

    context (gene neighborhood, domain fusion, phylogenetic

    profiles), high-throughput experiments (co-expression) and

    previous knowledge (text mining, known protein interactions).

    The majority of the associations in STRING come from its text

    mining and inherited interactions [14]. STRING v8.3 provides

    information for2.5 million sequences in 630 organisms with

    regular updates. Another regularly updated resource with

    easy to use interface is the GeneMania resource which has

    both known and predicted protein associations. An alternative

    integration strategy is used by the online resource FuncNet

    (http://funcnet.eu/) which uses theweighted Fishers approach

    and integrates, online, eight independent prediction methods

    with different geographical locations throughout Europe.

    Many prediction methods exist that have shown to be

    powerful enough for experimentalists to use as part of their

    standard experimental screens (e.g. table 2). Despite this

    even for well-studied organisms such as human there are largeportions of the interactome missing. As an example of the

    utility of the integration methods above, we have constructed

    a network using only those genes with no known physical

    interactions (after merging eight public databases). Even

    with these very poorly characterized Ensembl genes we were

    able to construct substantial networks (figure 3). Extreme

    examples such as this suggest that much could be gained

    from experimentalists sampling more of the genome using

    established prediction methods as a guide.

    6.2. Context specific resources

    Experiments are most usually designed to focus on a specificpathway or biological process. Resources such as STRING

    8

    http://ppi.fli-leibniz.de/jcb_ppi_databases.htmlhttp://funcnet.euprotect%20%24elax%20hbox%20%7Bma%20char%20%2775%7D%24/http://funcnet.euprotect%20%24elax%20hbox%20%7Bma%20char%20%2775%7D%24/http://funcnet.euprotect%20%24elax%20hbox%20%7Bma%20char%20%2775%7D%24/http://ppi.fli-leibniz.de/jcb_ppi_databases.html
  • 8/2/2019 (1) survey 2011

    10/14

    Phys. Biol. 8 (2011) 035008 J G Lees et al

    Figure 3. Example networks predicted from FuncNet CODA, FuncNet Hippo and STRING (score filtered at 500) with the database channelremoved. The network has been filtered to remove any genes with a known database physical interactions from one of Intact, MINT, MIPS,STRING, BIOGRID, DIP, HPRD and Reactome. Example subnetworks predominantly made up of phylogenetic profile, CODA or textmining associations (from left to right) are shown.

    Table 2. Example online protein association prediction resources.

    Online resource URL/reference Comments

    IBIS http://www.ncbi.nlm.nih.gov/Structure/ibis/[59] Predicts interactions and binding residuesFuncNet http://funcnet.eu/[86] Integrates eight data sources using FishersPPI E.Coli http://sunserver.cdfd.org.in:8080/protease/PPI/[93] Example of an SVM-based integration resourceBCI http://amdec-bioinfo.cu genome.org/html/BCellInteractome.html [131] Cell-type specific predictionsI2D http://ophid.utoronto.ca/ophidv2.201/index.jsp [53] Interologs for expanding protein interaction

    networksGeneMania http://genemania.org/search.jsf[132] High coverage of available known associationsSTRING http://string-db.org/[14] Largest number of genomes covered

    provide the union of many available protein interactions.

    However, for reasons such as differential expression, anygiven cell will only express a subset of all protein interactionsfound in an organism. In view of this, certain resourceshave been developed that apply contextual information to giveinteractomes specific for a cell type. One example of such aresource is the B-cell interactome [133], which predicts B-cellspecific protein associations. Tailoring of the data in the B-cell interactome, to help ensure B-cell specific interactions, isachieved by filtering to only include those proteins expressedin B-cells, and to include B-cell relevant microarray data setsas inputs to the Bayesian integrator. These B-cell specificnetworks have been used to extend our knowledge of B-cellbiology [131]. Other resources available (POINTILLIST [85])can be readily tailored with data sources specific to the systemof interest [91].

    6.3. PPI prediction pipelines

    Many large-scale experimental projects have been carried out

    [128, 134138]. Such projects are costly and time consuming

    and a strategy for effective protein pair prioritization is

    desirable. A recent study [45] trialing various approaches

    for this task showed that a protein interaction prediction

    method using a nave Bayes integration of several of the

    methods described in this section (expression data, GO,

    interologs, domain interactions) gave the largest improvement

    in efficiency. Even though this method had a high false

    discovery rate (92%) there were still large reductions in cost

    (>50 fold at 50% coverage) in comparison to not using the

    predicted protein interactions.

    9

    http://www.ncbi.nlm.nih.gov/Structure/ibis/http://funcnet.eu/http://sunserver.cdfd.org.in:8080/protease/PPI/http://amdec-bioinfo.cu%20genome.org/html/BCellInteractome.htmlhttp://ophid.utoronto.ca/ophidv2.201/index.jsphttp://genemania.org/search.jsfhttp://string-db.org/http://string-db.org/http://genemania.org/search.jsfhttp://ophid.utoronto.ca/ophidv2.201/index.jsphttp://amdec-bioinfo.cu%20genome.org/html/BCellInteractome.htmlhttp://sunserver.cdfd.org.in:8080/protease/PPI/http://funcnet.eu/http://www.ncbi.nlm.nih.gov/Structure/ibis/
  • 8/2/2019 (1) survey 2011

    11/14

    Phys. Biol. 8 (2011) 035008 J G Lees et al

    7. Conclusion

    The genomic context methods provide a fascinating field of

    study, at the juncture of evolutionary theory and modern

    computational biology. Despite the relatively short time these

    methods have been available they have proven to be very

    useful in guiding experiments. There is great potential forgreater uptake of these methods by experimentalists. Over

    the coming years we can expect to see improvements in

    the prediction methods particularly genomic context methods

    which will benefit from targeted genome sequencing efforts

    such as the GEBA project [139]. Such projects are expected

    to provide improved sampling, fill in major phylogenetic

    gaps and provide wider evolutionary distances. There is a

    growing list of examples in the literature where they have been

    used successfully when combined by statistical integration

    methods. An example is the application of the FuncNet

    protocol to human mitotic spindle proteins in the ENFIN [140]

    network for systems biology, which combined prediction data

    using Fisher integration, showed an increase in prediction

    accuracy from 35% to 76%. Given the many prediction

    methods available it is likely that greater coordination between

    computational groups will lead to reduced redundancy,

    improved resources and ultimately greater usage of protein

    interaction predictions by experimentalists.

    Acknowledgments

    This work was funded in part by the European Commission

    via the Sixth Framework Program Network of Excellence

    ENFIN (contract number LSHG-CT-2005-518254). JGL andJKH acknowledge funding from ENFIN. JAR acknowledges

    funding from SAF2009-09839 andthe Ramon y Cajal program

    (RYC-2007-01649; Ministerio de Ciencia e Innovacion,

    Spain). CIBERER is an initiative of the ISCIII.

    References

    [1] Kerrien S et al 2007 IntActopen source resource formolecular interaction data Nucleic Acids Res.35 D5615

    [2] Chatr-aryamontri A, Ceol A, Palazzi L M, Nardelli G,Schneider M V, Castagnoli L and Cesareni G 2007 MINT:

    the Molecular INTeraction database Nucleic Acids Res.35 D5724[3] Xenarios I, Rice D W, Salwinski L, Baron M K,

    Marcotte E M and Eisenberg D 2000 DIP: the database ofinteracting proteins Nucleic Acids Res. 28 28991

    [4] Keshava Prasad T S et al 2009 Human protein referencedatabase2009 update Nucleic Acids Res. 37 D76772

    [5] Ruepp A et al 2008 CORUM: the comprehensive resource ofmammalian protein complexes Nucleic Acids Res.36 D64650

    [6] Stumpf M P, Thorne T, de Silva E, Stewart R, An H J,Lappe M and Wiuf C 2008 Estimating the size of thehuman interactome Proc. Natl Acad. Sci. USA 105 695964

    [7] Venkatesan K et al 2009 An empirical framework for binaryinteractome mapping Nat. Methods 6 8390

    [8] Suthram S, Sittler T and Ideker T 2005 The Plasmodiumprotein network diverges from those of other eukaryotesNature 438 10812

    [9] Pellegrini M, Marcotte E M, Thompson M J, Eisenberg Dand Yeates T O 1999 Assigning protein functions bycomparative genome analysis: protein phylogeneticprofiles Proc. Natl Acad. Sci. USA 96 42858

    [10] Bowers P M, Cokus S J, Eisenberg D and Yeates T O 2004Use of logic relationships to decipher protein networkorganization Science 306 22469

    [11] Ranea J A, Yeats C, Grant A and Orengo C A 2007Predicting protein function with hierarchical phylogeneticprofiles: the Gene3D phylo-tuner method applied toeukaryotic genomes PLoS Comput. Biol. 3 e237

    [12] Barker D and Pagel M 2005 Predicting functional gene linksfrom phylogenetic-statistical analyses of whole genomesPLoS Comput. Biol. 1 e3

    [13] Zhou Y, Wang R, Li L, Xia X and Sun Z 2006 Inferringfunctional linkages between proteins from evolutionaryscenarios J. Mol. Biol. 359 11509

    [14] Jensen L J et al 2009 STRING 8a global view on proteinsand their functional interactions in 630 organisms NucleicAcids Res. 37 D4126

    [15] Luttgen H et al 2000 Biosynthesis of terpenoids: YchBprotein of Escherichia coli phosphorylates the 2-hydroxy

    group of 4-diphosphocytidyl-2 C-methyl-D-erythritolProc. Natl. Acad. Sci. USA 97 10627

    [16] Carlson B A, Xu X M, Kryukov G V, Rao M, Berry M J,Gladyshev V N and Hatfield D L 2004 Identification andcharacterization of phosphoseryl-tRNA[Ser]Sec kinaseProc. Natl Acad. Sci. USA 101 1284853

    [17] Forterre P 2002 A hot story from comparative genomics:reverse gyrase is the only hyperthermophile-specificprotein Trends Genet. 18 2367

    [18] Morett E, Korbel J O, Rajan E, Saab-Rincon G, Olvera L,Olvera M, Schmidt S, Snel B and Bork P 2003 Systematicdiscovery of analogous enzymes in thiamin biosynthesisNat. Biotechnol. 21 7905

    [19] Marcotte E M, Pellegrini M, Ng H L, Rice D W, Yeates T Oand Eisenberg D 1999 Detecting protein function and

    proteinprotein interactions from genome sequencesScience 285 7513

    [20] Zhang Z et al 2006 Genome-wide analysis of mammalianDNA segment fusion/fission J. Theor. Biol. 240 2008

    [21] Reid A J, Ranea J A, Clegg A B and Orengo C A 2010CODA: accurate detection of functional associationsbetween proteins in eukaryotic genomes using domainfusion PLoS ONE5 e10908

    [22] Gaballa A, Newton G L, Antelmann H, Parsonage D,Upton H, Rawat M, Claiborne A, Fahey R C andHelmann J D 2010 Biosynthesis and functions ofbacillithiol, a major low-molecular-weight thiol in BacilliProc. Natl Acad. Sci. USA 107 64826

    [23] Strong M, Mallick P, Pellegrini M, Thompson M Jand Eisenberg D 2003 Inference of protein function andprotein linkages in Mycobacterium tuberculosis based onprokaryotic genome organization: a combinedcomputational approach Genome Biol. 4 R59

    [24] Ferrer L, Dale J M and Karp P D 2010 A systematic study ofgenome context methods: calibration, normalization andcombination BMC Bioinformatics 11 493

    [25] Itoh T, Takemoto K, Mori H and Gojobori T 1999Evolutionary instability of operon structures disclosed bysequence comparisons of complete microbial genomesMol. Biol. Evol. 16 33246

    [26] Koonin E V, Wolf Y I and Aravind L 2001 Prediction of thearchaeal exosome and its connections with the proteasomeand the translation and transcription machineries by acomparative-genomic approach Genome Res. 11 24052

    [27] Evguenieva-Hackenberg E, Walter P, Hochleitner E,Lottspeich F and Klug G 2003 An exosome-like complexin Sulfolobus solfataricus EMBO Rep. 4 88993

    10

    http://dx.doi.org/10.1093/nar/gkl958http://dx.doi.org/10.1093/nar/gkl958http://dx.doi.org/10.1093/nar/gkl950http://dx.doi.org/10.1093/nar/gkl950http://dx.doi.org/10.1093/nar/28.1.289http://dx.doi.org/10.1093/nar/28.1.289http://dx.doi.org/10.1093/nar/gkn892http://dx.doi.org/10.1093/nar/gkn892http://dx.doi.org/10.1093/nar/gkm936http://dx.doi.org/10.1093/nar/gkm936http://dx.doi.org/10.1073/pnas.0708078105http://dx.doi.org/10.1073/pnas.0708078105http://dx.doi.org/10.1038/nmeth.1280http://dx.doi.org/10.1038/nmeth.1280http://dx.doi.org/10.1038/nature04135http://dx.doi.org/10.1038/nature04135http://dx.doi.org/10.1073/pnas.96.8.4285http://dx.doi.org/10.1073/pnas.96.8.4285http://dx.doi.org/10.1126/science.1103330http://dx.doi.org/10.1126/science.1103330http://dx.doi.org/10.1371/journal.pcbi.0030237http://dx.doi.org/10.1371/journal.pcbi.0030237http://dx.doi.org/10.1371/journal.pcbi.0010003http://dx.doi.org/10.1371/journal.pcbi.0010003http://dx.doi.org/10.1016/j.jmb.2006.04.011http://dx.doi.org/10.1016/j.jmb.2006.04.011http://dx.doi.org/10.1093/nar/gkn760http://dx.doi.org/10.1093/nar/gkn760http://dx.doi.org/10.1073/pnas.97.3.1062http://dx.doi.org/10.1073/pnas.97.3.1062http://dx.doi.org/10.1073/pnas.0402636101http://dx.doi.org/10.1073/pnas.0402636101http://dx.doi.org/10.1016/S0168-9525(02)02650-1http://dx.doi.org/10.1016/S0168-9525(02)02650-1http://dx.doi.org/10.1038/nbt834http://dx.doi.org/10.1038/nbt834http://dx.doi.org/10.1126/science.285.5428.751http://dx.doi.org/10.1126/science.285.5428.751http://dx.doi.org/10.1016/j.jtbi.2005.09.016http://dx.doi.org/10.1016/j.jtbi.2005.09.016http://dx.doi.org/10.1371/journal.pone.0010908http://dx.doi.org/10.1371/journal.pone.0010908http://dx.doi.org/10.1073/pnas.1000928107http://dx.doi.org/10.1073/pnas.1000928107http://dx.doi.org/10.1186/gb-2003-4-9-r59http://dx.doi.org/10.1186/gb-2003-4-9-r59http://dx.doi.org/10.1186/1471-2105-11-493http://dx.doi.org/10.1186/1471-2105-11-493http://dx.doi.org/10.1101/gr.162001http://dx.doi.org/10.1101/gr.162001http://dx.doi.org/10.1038/sj.embor.embor929http://dx.doi.org/10.1038/sj.embor.embor929http://dx.doi.org/10.1038/sj.embor.embor929http://dx.doi.org/10.1101/gr.162001http://dx.doi.org/10.1186/1471-2105-11-493http://dx.doi.org/10.1186/gb-2003-4-9-r59http://dx.doi.org/10.1073/pnas.1000928107http://dx.doi.org/10.1371/journal.pone.0010908http://dx.doi.org/10.1016/j.jtbi.2005.09.016http://dx.doi.org/10.1126/science.285.5428.751http://dx.doi.org/10.1038/nbt834http://dx.doi.org/10.1016/S0168-9525(02)02650-1http://dx.doi.org/10.1073/pnas.0402636101http://dx.doi.org/10.1073/pnas.97.3.1062http://dx.doi.org/10.1093/nar/gkn760http://dx.doi.org/10.1016/j.jmb.2006.04.011http://dx.doi.org/10.1371/journal.pcbi.0010003http://dx.doi.org/10.1371/journal.pcbi.0030237http://dx.doi.org/10.1126/science.1103330http://dx.doi.org/10.1073/pnas.96.8.4285http://dx.doi.org/10.1038/nature04135http://dx.doi.org/10.1038/nmeth.1280http://dx.doi.org/10.1073/pnas.0708078105http://dx.doi.org/10.1093/nar/gkm936http://dx.doi.org/10.1093/nar/gkn892http://dx.doi.org/10.1093/nar/28.1.289http://dx.doi.org/10.1093/nar/gkl950http://dx.doi.org/10.1093/nar/gkl958
  • 8/2/2019 (1) survey 2011

    12/14

    Phys. Biol. 8 (2011) 035008 J G Lees et al

    [28] Tuncbag N, Gursoy A, Guney E, Nussinov R and Keskin O2008 Architectures and functional coverage ofproteinprotein interfaces J. Mol. Biol. 381 785802

    [29] Tuncbag N, Kar G, Keskin O, Gursoy A and Nussinov R2009 A survey of available tools and web servers foranalysis of proteinprotein interactions and interfacesBrief Bioinform 10 21732

    [30] Pazos F, Helmer-Citterich M, Ausiello G and Valencia A1997 Correlated mutations contain information aboutproteinprotein interaction J. Mol. Biol. 271 51123

    [31] Juan D, Pazos F and Valencia A 2008 Co-evolution andco-adaptation in protein networks FEBS Lett. 582 122530

    [32] Kann M G, Shoemaker B A, Panchenko A R andPrzytycka T M 2009 Correlated evolution of interactingproteins: looking behind the mirrortree J. Mol. Biol.385 918

    [33] Pazos F and Valencia A 2001 Similarity of phylogenetic treesas indicator of proteinprotein interaction Protein Eng.14 60914

    [34] Pazos F, Ranea J A, Juan D and Sternberg M J 2005Assessing protein co-evolution in the context of the tree oflife assists in the prediction of the interactome J. Mol. Biol.

    352 100215[35] Izarzugaza J M, Juan D, Pons C, Ranea J A, Valencia Aand Pazos F 2006 TSEMA: interactive prediction ofprotein pairings between interacting families NucleicAcids Res. 34 W3159

    [36] Itzhaki Z, Akiva E, Altuvia Y and Margalit H 2006Evolutionary conservation of domaindomain interactionsGenome Biol. 7 R125

    [37] Finn R D et al 2008 The Pfam protein families databaseNucleic Acids Res. 36 D2818

    [38] Stein A, Panjkovich A and Aloy P 2009 3did Update:domaindomain and peptide-mediated interactions ofknown 3D structure Nucleic Acids Res. 37 D3004

    [39] Eddy S R 2009 A new generation of homology search toolsbased on probabilistic inference Genome Inform 23 20511

    [40] Lees J, Yeats C, Redfern O, Clegg A and Orengo C 2010Gene3D: merging structure and function for a thousandgenomes Nucleic Acids Res. 38 D296300

    [41] Kim W K, Park J and Suh J K 2002 Large scale statisticalprediction of proteinprotein interaction by potentiallyinteracting domain (PID) pair Genome Inform 13 4250

    [42] Chen X W and Liu M 2005 Prediction of proteinproteininteractions using random decision forest frameworkBioinformatics 21 4394400

    [43] Luo Q, Pagel P, Vilne B and Frishman D 2011 DIMA 3.0:domain interaction map Nucleic Acids Res. 39 D7249

    [44] Bjorkholm P and Sonnhammer E L 2009 Comparativeanalysis and unification of domaindomain interactionnetworks Bioinformatics 25 30205

    [45] Schwartz A S, Yu J, Gardenour K R, Finley R L Jr andIdeker T 2009 Cost-effective strategies for completing theinteractome Nat. Methods 6 5561

    [46] Ben-Hur A and Noble W S 2005 Kernel methods forpredicting proteinprotein interactions Bioinformatics21 (Suppl. 1) i3846

    [47] Yu J, Guo M, Needham C J, Huang Y, Cai L andWesthead D R 2010 Simple sequence-based kernels do notpredict proteinprotein interactions Bioinformatics26 26104

    [48] Matthews L R, Vaglio P, Reboul J, Ge H, Davis B P, GarrelsJ, Vincent S and Vidal M 2001 Identification of potentialinteraction networks using sequence-based searches forconserved proteinprotein interactions or interologsGenome Res. 11 21206

    [49] Mika S and Rost B 2006 Proteinprotein interactions moreconserved within species than across species PLoSComput. Biol. 2 e79

    [50] Persico M, Ceol A, Gavrila C, Hoffmann R, Florio Aand Cesareni G 2005 HomoMINT: an inferred humannetwork based on orthology mapping of proteininteractions discovered in model organisms BMCBioinformatics 6 (Suppl. 4) S21

    [51] Kemmer D et al 2005 Ulyssesan application for theprojection of molecular interactions across speciesGenome Biol. 6 R106

    [52] Huang T W, Lin C Y and Kao C Y 2007 Reconstruction ofhuman protein interolog network using evolutionaryconserved networkBMC Bioinformatics 8 152

    [53] Brown K R and Jurisica I 2007 Unequal evolutionaryconservation of human protein interactions in interologousnetworks Genome Biol. 8 R95

    [54] Ezkurdia I, Bartoli L, Fariselli P, Casadio R, Valencia Aand Tress M L 2009 Progress and challenges in predictingproteinprotein interaction sites Brief Bioinform10 23346

    [55] Aloy P and Russell R B 2003 InterPreTS: protein interactionprediction through tertiary structure Bioinformatics19 1612

    [56] Singh R, Park D, Xu J, Hosur R and Berger B 2010

    Struct2Net: a web service to predict proteinproteininteractions using a structure-based approach NucleicAcids Res. 38 (Suppl.) W50815

    [57] Hosur R, Xu J, Bienkowska J and Berger B 2011 iWRAP: aninterface threading approach with application to predictionof cancer-related proteinprotein interactions J. Mol. Biol.405 1295310

    [58] Zhang Q C, Petrey D, Norel R and Honig B H 2010 Proteininterface conservation across structure space Proc. NatlAcad. Sci. USA 107 10896901

    [59] Shoemaker B A, Zhang D, Thangudu R R, Tyagi M,Fong J H, Marchler-Bauer A, Bryant S H, Madej Tand Panchenko A R 2010 Inferred biomolecularinteraction servera web server to analyze and predictprotein interacting partners and binding sites Nucleic

    Acids Res. 38 D51824[60] Krissinel E and Henrick K 2007 Inference of macromolecular

    assemblies from crystalline state J. Mol. Biol.372 77497

    [61] Hue M, Riffle M, Vert J P and Noble W S 2010 Large-scaleprediction of proteinprotein interactions from structuresBMC Bioinformatics 11 144

    [62] Grigoriev A 2001 A relationship between gene expressionand protein interactions on the proteome scale: analysis ofthe bacteriophage T7 and the yeast Saccharomycescerevisiae Nucleic Acids Res. 29 35139

    [63] Bhardwaj N and Lu H 2005 Correlation between geneexpression profiles and proteinprotein interactions withinand across genomes Bioinformatics 21 27308

    [64] Lukk M, Kapushesky M, Nikkila J, Parkinson H,Goncalves A, Huber W, Ukkonen E and Brazma A 2010 Aglobal map of human gene expression Nat. Biotechnol.28 3224

    [65] Adler P, Kolde R, Kull M, Tkachenko A, Peterson H,Reimand J and Vilo J 2009 Mining for coexpression acrosshundreds of datasets using novel rank aggregation andvisualization methods Genome Biol. 10 R139

    [66] Stuart J M, Segal E, Koller D and Kim S K 2003 Agene-coexpression network for global discovery ofconserved genetic modules Science 302 24955

    [67] Hutchins J R et al 2010 Systematic analysis of human proteincomplexes identifies chromosome segregation proteinsScience 328 5939

    [68] van Steensel B, Braunschweig U, Filion G J, Chen M,

    van Bemmel J G and Ideker T 2010 Bayesian networkanalysis of targeting interactions in chromatin GenomeRes. 20 190200

    11

    http://dx.doi.org/10.1016/j.jmb.2008.04.071http://dx.doi.org/10.1016/j.jmb.2008.04.071http://dx.doi.org/10.1093/bib/bbp001http://dx.doi.org/10.1093/bib/bbp001http://dx.doi.org/10.1006/jmbi.1997.1198http://dx.doi.org/10.1006/jmbi.1997.1198http://dx.doi.org/10.1016/j.febslet.2008.02.017http://dx.doi.org/10.1016/j.febslet.2008.02.017http://dx.doi.org/10.1016/j.jmb.2008.09.078http://dx.doi.org/10.1016/j.jmb.2008.09.078http://dx.doi.org/10.1093/protein/14.9.609http://dx.doi.org/10.1093/protein/14.9.609http://dx.doi.org/10.1016/j.jmb.2005.07.005http://dx.doi.org/10.1016/j.jmb.2005.07.005http://dx.doi.org/10.1093/nar/gkl112http://dx.doi.org/10.1093/nar/gkl112http://dx.doi.org/10.1186/gb-2006-7-12-r125http://dx.doi.org/10.1186/gb-2006-7-12-r125http://dx.doi.org/10.1093/nar/gkm960http://dx.doi.org/10.1093/nar/gkm960http://dx.doi.org/10.1093/nar/gkn690http://dx.doi.org/10.1093/nar/gkn690http://dx.doi.org/10.1142/9781848165632_0019http://dx.doi.org/10.1142/9781848165632_0019http://dx.doi.org/10.1093/nar/gkp987http://dx.doi.org/10.1093/nar/gkp987http://dx.doi.org/10.1093/bioinformatics/bti721http://dx.doi.org/10.1093/bioinformatics/bti721http://dx.doi.org/10.1093/nar/gkq1200http://dx.doi.org/10.1093/nar/gkq1200http://dx.doi.org/10.1093/bioinformatics/btp522http://dx.doi.org/10.1093/bioinformatics/btp522http://dx.doi.org/10.1038/nmeth.1283http://dx.doi.org/10.1038/nmeth.1283http://dx.doi.org/10.1093/bioinformatics/bti1016http://dx.doi.org/10.1093/bioinformatics/bti1016http://dx.doi.org/10.1093/bioinformatics/btq483http://dx.doi.org/10.1093/bioinformatics/btq483http://dx.doi.org/10.1101/gr.205301http://dx.doi.org/10.1101/gr.205301http://dx.doi.org/10.1371/journal.pcbi.0020079http://dx.doi.org/10.1371/journal.pcbi.0020079http://dx.doi.org/10.1186/1471-2105-6-S4-S21http://dx.doi.org/10.1186/1471-2105-6-S4-S21http://dx.doi.org/10.1186/gb-2005-6-12-r106http://dx.doi.org/10.1186/gb-2005-6-12-r106http://dx.doi.org/10.1186/1471-2105-8-152http://dx.doi.org/10.1186/1471-2105-8-152http://dx.doi.org/10.1186/gb-2007-8-5-r95http://dx.doi.org/10.1186/gb-2007-8-5-r95http://dx.doi.org/10.1093/bib/bbp021http://dx.doi.org/10.1093/bib/bbp021http://dx.doi.org/10.1093/bioinformatics/19.1.161http://dx.doi.org/10.1093/bioinformatics/19.1.161http://dx.doi.org/10.1093/nar/gkq481http://dx.doi.org/10.1093/nar/gkq481http://dx.doi.org/10.1016/j.jmb.2010.11.025http://dx.doi.org/10.1016/j.jmb.2010.11.025http://dx.doi.org/10.1073/pnas.1005894107http://dx.doi.org/10.1073/pnas.1005894107http://dx.doi.org/10.1093/nar/gkp842http://dx.doi.org/10.1093/nar/gkp842http://dx.doi.org/10.1016/j.jmb.2007.05.022http://dx.doi.org/10.1016/j.jmb.2007.05.022http://dx.doi.org/10.1186/1471-2105-11-144http://dx.doi.org/10.1186/1471-2105-11-144http://dx.doi.org/10.1093/nar/29.17.3513http://dx.doi.org/10.1093/nar/29.17.3513http://dx.doi.org/10.1093/bioinformatics/bti398http://dx.doi.org/10.1093/bioinformatics/bti398http://dx.doi.org/10.1038/nbt0410-322http://dx.doi.org/10.1038/nbt0410-322http://dx.doi.org/10.1186/gb-2009-10-12-r139http://dx.doi.org/10.1186/gb-2009-10-12-r139http://dx.doi.org/10.1126/science.1087447http://dx.doi.org/10.1126/science.1087447http://dx.doi.org/10.1126/science.1181348http://dx.doi.org/10.1126/science.1181348http://dx.doi.org/10.1101/gr.098822.109http://dx.doi.org/10.1101/gr.098822.109http://dx.doi.org/10.1101/gr.098822.109http://dx.doi.org/10.1126/science.1181348http://dx.doi.org/10.1126/science.1087447http://dx.doi.org/10.1186/gb-2009-10-12-r139http://dx.doi.org/10.1038/nbt0410-322http://dx.doi.org/10.1093/bioinformatics/bti398http://dx.doi.org/10.1093/nar/29.17.3513http://dx.doi.org/10.1186/1471-2105-11-144http://dx.doi.org/10.1016/j.jmb.2007.05.022http://dx.doi.org/10.1093/nar/gkp842http://dx.doi.org/10.1073/pnas.1005894107http://dx.doi.org/10.1016/j.jmb.2010.11.025http://dx.doi.org/10.1093/nar/gkq481http://dx.doi.org/10.1093/bioinformatics/19.1.161http://dx.doi.org/10.1093/bib/bbp021http://dx.doi.org/10.1186/gb-2007-8-5-r95http://dx.doi.org/10.1186/1471-2105-8-152http://dx.doi.org/10.1186/gb-2005-6-12-r106http://dx.doi.org/10.1186/1471-2105-6-S4-S21http://dx.doi.org/10.1371/journal.pcbi.0020079http://dx.doi.org/10.1101/gr.205301http://dx.doi.org/10.1093/bioinformatics/btq483http://dx.doi.org/10.1093/bioinformatics/bti1016http://dx.doi.org/10.1038/nmeth.1283http://dx.doi.org/10.1093/bioinformatics/btp522http://dx.doi.org/10.1093/nar/gkq1200http://dx.doi.org/10.1093/bioinformatics/bti721http://dx.doi.org/10.1093/nar/gkp987http://dx.doi.org/10.1142/9781848165632_0019http://dx.doi.org/10.1093/nar/gkn690http://dx.doi.org/10.1093/nar/gkm960http://dx.doi.org/10.1186/gb-2006-7-12-r125http://dx.doi.org/10.1093/nar/gkl112http://dx.doi.org/10.1016/j.jmb.2005.07.005http://dx.doi.org/10.1093/protein/14.9.609http://dx.doi.org/10.1016/j.jmb.2008.09.078http://dx.doi.org/10.1016/j.febslet.2008.02.017http://dx.doi.org/10.1006/jmbi.1997.1198http://dx.doi.org/10.1093/bib/bbp001http://dx.doi.org/10.1016/j.jmb.2008.04.071
  • 8/2/2019 (1) survey 2011

    13/14

    Phys. Biol. 8 (2011) 035008 J G Lees et al

    [69] Schaefer C F, Anthony K, Krupa S, Buchoff J, Day M,Hannay T and Buetow K H 2009 PID: the pathwayinteraction database Nucleic Acids Res. 37 D6749

    [70] Matthews L et al 2009 Reactome knowledgebase of humanbiological pathways and processes Nucleic Acids Res.37 D61922

    [71] Cherry J M et al 1998 SGD: saccharomyces genome databaseNucleic Acids Res. 26 739

    [72] Amberger J, Bocchini C A, Scott A F and Hamosh A 2009McKusicks online Mendelian inheritance in man (OMIM)Nucleic Acids Res. 37 D7936

    [73] Tweedie S et al 2009 FlyBase: enhancing Drosophila GeneOntology annotations Nucleic Acids Res. 37 D5559

    [74] Blaschke C, Hoffmann R, Oliveros J C and Valencia A 2001Extracting information automatically from biologicalliterature Comp. Funct. Genomics 2 3103

    [75] Tikk D, Thomas P, Palaga P, Hakenberg J and Leser U 2010A comprehensive benchmark of kernel methods to extractproteinprotein interactions from literature PLoS Comput.Biol. 6 e1000837

    [76] Ashburner M et al 2000 Gene ontology: tool for theunification of biology. The Gene Ontology Consortium

    Nat. Genet. 25 259[77] Pesquita C, Faria D, Falcao A O, Lord P and Couto F M 2009

    Semantic similarity in biomedical ontologies PLoSComput. Biol. 5 e1000443

    [78] Lord P W, Stevens R D, Brass A and Goble C A 2003Investigating semantic similarity measures across the GeneOntology: the relationship between sequence andannotation Bioinformatics 19 127583

    [79] Scott M S and Barton G J 2007 Probabilistic prediction andranking of human proteinprotein interactions BMCBioinformatics 8 239

    [80] Jansen R et al 2003 A Bayesian networks approach forpredicting proteinprotein interactions from genomic dataScience 302 44953

    [81] Bowers P M, Pellegrini M, Thompson M J, Fierro J,

    Yeates T O and Eisenberg D 2004 Prolinks: a database ofprotein functional linkages derived from coevolutionGenome Biol. 5 R35

    [82] Sun J, Sun Y, Ding G, Liu Q, Wang C, He Y, Shi T, Li Yand Zhao Z 2007 InPrePPI: an integrated evaluationmethod based on genomic context for predictingproteinprotein interactions in prokaryotic genomes BMCBioinformatics 8 414

    [83] Wilkinson D J 2007 Bayesian methods in bioinformatics andcomputational systems biology Brief Bioinform8 10916

    [84] Lu L J, Xia Y, Paccanaro A, Yu H and Gerstein M 2005Assessing the limits of genomic data integration forpredicting protein networks Genome Res. 15 94553

    [85] Hwang D et al 2005 A data integration methodology forsystems biology Proc. Natl Acad. Sci. USA102 17296301

    [86] Ranea J A, Morilla I, Lees J G, Reid A J, Yeats C, Clegg A B,Sanchez-Jimenez F and Orengo C 2010 Finding the darkmatter in human and yeast protein network prediction andmodelling PLoS Comput. Biol. 6 e1000945

    [87] Shawe-Taylor J and Cristianini N (eds) 2004 Kernel Methodsfor Pattern Analysis (Cambridge: Cambridge UniversityPress)

    [88] Qiu J and Noble W S 2008 Predicting co-complexed proteinpairs from heterogeneous data PLoS Comput. Biol.4 e1000054

    [89] Zhou D and Scholkopf B 2004 A regularization frameworkfor learning from graph data ICML Workshop on

    Statistical Relational Learning[90] Xia K, Dong D and Han J D 2006 IntNetDB v1.0: anintegrated proteinprotein interaction network database

    generated by a probabilistic model BMC Bioinformatics7 508

    [91] Hwang D et al 2005 A data integration methodology forsystems biology: experimental verification Proc. NatlAcad. Sci. USA 102 173027

    [92] Qi Y, Bar-Joseph Z and Klein-Seetharaman J 2006Evaluation of different biological data and computational

    classification methods for use in protein interactionprediction Proteins 63 490500[93] Yellaboina S, Goyal K and Mande S C 2007 Inferring

    genome-wide functional linkages in E. coli by combiningimproved genome context methods: comparison withhigh-throughput experimental data Genome Res.17 52735

    [94] Qi Y, Klein-Seetharaman J and Bar-Joseph Z 2005 Randomforest similarity for proteinprotein interaction predictionfrom multiple sources Pac. Symp. Biocomput. 10 53142

    [95] Mohamed T P, Carbonell J G and Ganapathiraju M K 2010Active learning for human proteinprotein interactionprediction BMC Bioinformatics 11 (Suppl. 1) S57

    [96] Geurts P, Irrthum A and Wehenkel L 2009 Supervisedlearning with decision tree-based methods in

    computational and systems biology Mol. Biosyst.5 1593605

    [97] Lin N, Wu B, Jansen R, Gerstein M and Zhao H 2004Information assessment on predicting proteinproteininteractions BMC Bioinformatics 5 154

    [98] Sprinzak E, Altuvia Y and Margalit H 2006 Characterizationand prediction of proteinprotein interactions within andbetween complexes Proc. Natl Acad. Sci. USA103 1471823

    [99] Qi Y, Klein-Seetharaman J and Bar-Joseph Z 2007 A mixtureof feature experts approach for proteinprotein interactionprediction BMC Bioinformatics 8 (Suppl. 10) S6

    [100] Fouss F, Francoisse K, Yen L, Pirotte A and Saerens M 2006An experimental investigation of graph kernels on acollaborative recommendation taskProc. 6th Int. Conf. on

    Data Mining pp 8638[101] Kohler S, Bauer S, Horn D and Robinson P N 2008 Walking

    the interactome for prioritization of candidate diseasegenes Am. J. Human Genet. 82 94958

    [102] Li Y and Patra J C 2010 Integration of multiple data sourcesto prioritize candidate genes using discounted ratingsystem BMC Bioinformatics 11 (Suppl. 1) S20

    [103] Li Y and Patra J C 2010 Genome-wide inferringgene-phenotype relationship by walking on theheterogeneous networkBioinformatics 26 121924

    [104] Li X, Wu M, Kwoh C K and Ng S K 2010 Computationalapproaches for detecting protein complexes from proteininteraction networks: a survey BMC Genomics11 (Suppl. 1) S3

    [105] Brohee S and van Helden J 2006 Evaluation of clusteringalgorithms for proteinprotein interaction networks BMCBioinformatics 7 488

    [106] Vlasblom J and Wodak S J 2009 Markov clustering versusaffinity propagation for the partitioning of proteininteraction graphs BMC Bioinformatics 10 99

    [107] Inoue K, Li W and Kurata H 2010 Diffusion model basedspectral clustering for proteinprotein interaction networksPLoS ONE5 e12623

    [108] Qin G and Gao L 2010 Spectral clustering for detectingprotein complexes in proteinprotein interaction (PPI)networks Math. Comput. Modell. 52 206674

    [109] Ding C, He X, Meraz R F and Holbrook S R 2004 A unifiedrepresentation of multiprotein complex data for modelinginteraction networks Proteins 57 99108

    [110] Bu D et al 2003 Topological structure analysis of theproteinprotein interaction network in budding yeastNucleic Acids Res. 31 244350

    12

    http://dx.doi.org/10.1093/nar/gkn653http://dx.doi.org/10.1093/nar/gkn653http://dx.doi.org/10.1093/nar/gkn863http://dx.doi.org/10.1093/nar/gkn863http://dx.doi.org/10.1093/nar/26.1.73http://dx.doi.org/10.1093/nar/26.1.73http://dx.doi.org/10.1093/nar/gkn665http://dx.doi.org/10.1093/nar/gkn665http://dx.doi.org/10.1093/nar/gkn788http://dx.doi.org/10.1093/nar/gkn788http://dx.doi.org/10.1002/cfg.102http://dx.doi.org/10.1002/cfg.102http://dx.doi.org/10.1371/journal.pcbi.1000837http://dx.doi.org/10.1371/journal.pcbi.1000837http://dx.doi.org/10.1038/75556http://dx.doi.org/10.1038/75556http://dx.doi.org/10.1371/journal.pcbi.1000443http://dx.doi.org/10.1371/journal.pcbi.1000443http://dx.doi.org/10.1093/bioinformatics/btg153http://dx.doi.org/10.1093/bioinformatics/btg153http://dx.doi.org/10.1186/1471-2105-8-239http://dx.doi.org/10.1186/1471-2105-8-239http://dx.doi.org/10.1126/science.1087361http://dx.doi.org/10.1126/science.1087361http://dx.doi.org/10.1186/gb-2004-5-5-r35http://dx.doi.org/10.1186/gb-2004-5-5-r35http://dx.doi.org/10.1186/1471-2105-8-414http://dx.doi.org/10.1186/1471-2105-8-414http://dx.doi.org/10.1093/bib/bbm007http://dx.doi.org/10.1093/bib/bbm007http://dx.doi.org/10.1101/gr.3610305http://dx.doi.org/10.1101/gr.3610305http://dx.doi.org/10.1073/pnas.0508647102http://dx.doi.org/10.1073/pnas.0508647102http://dx.doi.org/10.1371/journal.pcbi.1000945http://dx.doi.org/10.1371/journal.pcbi.1000945http://dx.doi.org/10.1371/journal.pcbi.1000054http://dx.doi.org/10.1371/journal.pcbi.1000054http://dx.doi.org/10.1186/1471-2105-7-508http://dx.doi.org/10.1186/1471-2105-7-508http://dx.doi.org/10.1073/pnas.0508649102http://dx.doi.org/10.1073/pnas.0508649102http://dx.doi.org/10.1002/prot.20865http://dx.doi.org/10.1002/prot.20865http://dx.doi.org/10.1101/gr.5900607http://dx.doi.org/10.1101/gr.5900607http://dx.doi.org/10.1142/9789812702456_0050http://dx.doi.org/10.1142/9789812702456_0050http://dx.doi.org/10.1186/1471-2105-11-S1-S57http://dx.doi.org/10.1186/1471-2105-11-S1-S57http://dx.doi.org/10.1039/b907946ghttp://dx.doi.org/10.1039/b907946ghttp://dx.doi.org/10.1186/1471-2105-5-154http://dx.doi.org/10.1186/1471-2105-5-154http://dx.doi.org/10.1073/pnas.0603352103http://dx.doi.org/10.1073/pnas.0603352103http://dx.doi.org/10.1186/1471-2105-8-S10-S6http://dx.doi.org/10.1186/1471-2105-8-S10-S6http://dx.doi.org/10.1016/j.ajhg.2008.02.013http://dx.doi.org/10.1016/j.ajhg.2008.02.013http://dx.doi.org/10.1186/1471-2105-11-S1-S20http://dx.doi.org/10.1186/1471-2105-11-S1-S20http://dx.doi.org/10.1093/bioinformatics/btq108http://dx.doi.org/10.1093/bioinformatics/btq108http://dx.doi.org/10.1186/1471-2164-11-S1-S3http://dx.doi.org/10.1186/1471-2164-11-S1-S3http://dx.doi.org/10.1186/1471-2105-7-488http://dx.doi.org/10.1186/1471-2105-7-488http://dx.doi.org/10.1186/1471-2105-10-99http://dx.doi.org/10.1186/1471-2105-10-99http://dx.doi.org/10.1371/journal.pone.0012623http://dx.doi.org/10.1371/journal.pone.0012623http://dx.doi.org/10.1016/j.mcm.2010.06.015http://dx.doi.org/10.1016/j.mcm.2010.06.015http://dx.doi.org/10.1002/prot.20147http://dx.doi.org/10.1002/prot.20147http://dx.doi.org/10.1093/nar/gkg340http://dx.doi.org/10.1093/nar/gkg340http://dx.doi.org/10.1093/nar/gkg340http://dx.doi.org/10.1002/prot.20147http://dx.doi.org/10.1016/j.mcm.2010.06.015http://dx.doi.org/10.1371/journal.pone.0012623http://dx.doi.org/10.1186/1471-2105-10-99http://dx.doi.org/10.1186/1471-2105-7-488http://dx.doi.org/10.1186/1471-2164-11-S1-S3http://dx.doi.org/10.1093/bioinformatics/btq108http://dx.doi.org/10.1186/1471-2105-11-S1-S20http://dx.doi.org/10.1016/j.ajhg.2008.02.013http://dx.doi.org/10.1186/1471-2105-8-S10-S6http://dx.doi.org/10.1073/pnas.0603352103http://dx.doi.org/10.1186/1471-2105-5-154http://dx.doi.org/10.1039/b907946ghttp://dx.doi.org/10.1186/1471-2105-11-S1-S57http://dx.doi.org/10.1142/9789812702456_0050http://dx.doi.org/10.1101/gr.5900607http://dx.doi.org/10.1002/prot.20865http://dx.doi.org/10.1073/pnas.0508649102http://dx.doi.org/10.1186/1471-2105-7-508http://dx.doi.org/10.1371/journal.pcbi.1000054http://dx.doi.org/10.1371/journal.pcbi.1000945http://dx.doi.org/10.1073/pnas.0508647102http://dx.doi.org/10.1101/gr.3610305http://dx.doi.org/10.1093/bib/bbm007http://dx.doi.org/10.1186/1471-2105-8-414http://dx.doi.org/10.1186/gb-2004-5-5-r35http://dx.doi.org/10.1126/science.1087361http://dx.doi.org/10.1186/1471-2105-8-239http://dx.doi.org/10.1093/bioinformatics/btg153http://dx.doi.org/10.1371/journal.pcbi.1000443http://dx.doi.org/10.1038/75556http://dx.doi.org/10.1371/journal.pcbi.1000837http://dx.doi.org/10.1002/cfg.102http://dx.doi.org/10.1093/nar/gkn788http://dx.doi.org/10.1093/nar/gkn665http://dx.doi.org/10.1093/nar/26.1.73http://dx.doi.org/10.1093/nar/gkn863http://dx.doi.org/10.1093/nar/gkn653
  • 8/2/2019 (1) survey 2011

    14/14

    Phys. Biol. 8 (2011) 035008 J G Lees et al

    [111] Sen T Z, Kloczkowski A and Jernigan R L 2006 Functionalclustering of yeast proteins from the proteinproteininteraction networkBMC Bioinformatics 7 355

    [112] Lima-Mendez G and van Helden J 2009 The powerful law ofthe power law and other myths in network biology Mol.Biosyst. 5 148293

    [113] Gomez S M and Rzhetsky A 2002 Towards the prediction ofcomplete proteinprotein interaction networks Pac. Symp.Biocomput. 7 41324

    [114] Henrick K and Thornton J M 1998 PQS: a protein quaternarystructure file server Trends Biochem. Sci. 23 35861

    [115] Guldener U, Munsterkotter M, Oesterheld M, Pagel P,Ruepp A, Mewes H W and Stumpflen V 2006 MPact: theMIPS protein interaction resource on yeast Nucleic AcidsRes. 34 D43641

    [116] Hart G T, Lee I and Marcotte E R 2007 A high-accuracyconsensus map of yeast protein complexes reveals modularnature of gene essentiality BMC Bioinformatics8 236

    [117] Pu S, Wong J, Turner B, Cho E and Wodak S J 2009Up-to-date catalogues of yeast protein complexes NucleicAcids Res. 37 82531

    [118] Yu Het al

    2008 High-quality binary protein interaction mapof the yeast interactome networkScience 322 10410[119] Smialowski P et al 2010 The Negatome database: a reference

    set of non-interacting protein pairs Nucleic Acids Res.38 D5404

    [120] Browne F, Wang H, Zheng H and Azuaje F 2009 GRIP: aweb-based system for constructing gold standard datasetsfor proteinprotein interaction prediction Source CodeBiol. Med. 4 2

    [121] Chen X W, Jeong J C and Dermyer P 2010 KUPS:constructing datasets of interacting and non-interactingprotein pairs with associated attributions Nucleic AcidsRes 39 D7504

    [122] Sharan R, Suthram S, Kelley R M, Kuhn T, McCuine S,Uetz P, Sittler T, Karp R M and Ideker T 2005 Conserved

    patterns of protein interaction in multiple species ProcNatl. Acad. Sci. USA 102 19749[123] Kanehisa M et al 2008 KEGG for linking genomes to life and

    the environment Nucleic Acids Res. 36 D4804[124] Thomas P D, Campbell M J, Kejariwal A, Mi H, Karlak B,

    Daverman R, Diemer K, Muruganujan A andNarechania A 2003 PANTHER: a library of proteinfamilies and subfamilies indexed by function Genome Res.13 212941

    [125] Ioannidis J P 2007 Why most published research findings arefalse: authors reply to Goodman and Greenland PLoSMed. 4 e215

    [126] Pfeiffer T, Rand D G and Dreber A 2009 Decision-making inresearch tasks with sequential testing PLoS ONE4 e4607

    [127] Neumann B et al 2010 Phenotypic profiling of the humangenome by time-lapse microscopy reveals cell divisiongenes Nature 464 7217

    [128] Uetz P et al 2000 A comprehensive analysis ofproteinprotein interactions in Saccharomyces cerevisiaeNature 403 6237

    [129] Russell R B and Aloy P 2008 Targeting and tinkering withinteraction networks Nat. Chem. Biol. 4 66673

    [130] Hart G T, Ramani A K and Marcotte E M 2006 Howcomplete are current yeast and human protein-interactionnetworks? Genome Biol. 7 120

    [131] Lefebvre C et al 2010 A human B-cell interactome identifiesMYB and FOXM1 as master regulators of proliferation ingerminal centers Mol. Syst. Biol. 6 377

    [132] Warde-Farley D et al 2010 The GeneMANIA predictionserver: biological network integration for geneprioritization and predicting gene function Nucleic Acids

    Res. 38 W21420[133] Lefebvre C, Lim W K, Basso K, dalla-Favera R andCalifano A 2007 A context-specific network ofproteinDNA and proteinprotein interactions reveals newregulatory motifs in human B cells Lect. NotesBioinformatics (LNCS) 4532 4256

    [134] Krogan N J et al 2006 Global landscape of protein complexesin the yeast Saccharomyces cerevisiae Nature440 63743

    [135] Gavin A C et al 2006 Proteome survey reveals modularity ofthe yeast cell machinery Nature 440 6316

    [136] Li S et al 2004 A map of the interactome network of themetazoan C. elegans Science 303 5403

    [137] Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M andSakaki Y 2001 A comprehensive two-hybrid analysis to

    explore the yeast protein interactome Proc. Natl Acad. Sci.USA 98 456974[138] Giot L et al 2003 A protein interaction map of Drosophila

    melanogaster Science 302 172736[139] Wu D et al 2009 A phylogeny-driven genomic encyclopaedia

    of bacteria and archaea Nature 462 105660[140] Kahlem P and Birney E 2007 ENFIN a network to enhance

    integrative systems biology Ann. New York Acad. Sci.1115 2331

    13

    http://dx.doi.org/10.1186/1471-2105-7-355http://dx.doi.org/10.1186/1471-2105-7-355http://dx.doi.org/10.1039/b908681ahttp://dx.doi.org/10.1039/b908681ahttp://dx.doi.org/10.1016/S0968-0004(98)01253-5http://dx.doi.org/10.1016/S0968-0004(98)01253-5http://dx.doi.org/10.1093/nar/gkj003http://dx.doi.org/10.1093/nar/gkj003http://dx.doi.org/10.1186/1471-2105-8-236http://dx.doi.org/10.1186/1471-2105-8-236http://dx.doi.org/10.1093/nar/gkn1005http://dx.doi.org/10.1093/nar/gkn1005http://dx.doi.org/10.1126/science.1158684http://dx.doi.org/10.1126/science.1158684http://dx.doi.org/10.1093/nar/gkp1026http://dx.doi.org/10.1093/nar/gkp1026http://dx.doi.org/10.1186/1751-0473-4-2http://dx.doi.org/10.1186/1751-0473-4-2http://dx.doi.org/10.1093/nar/gkq943http://dx.doi.org/10.1093/nar/gkq943http://dx.doi.org/10.1073/pnas.0409522102http://dx.doi.org/10.1073/pnas.0409522102http://dx.doi.org/10.1093/nar/gkm882http://dx.doi.org/10.1093/nar/gkm882http://dx.doi.org/10.1101/gr.772403http://dx.doi.org/10.1101/gr.772403http://dx.doi.org/10.1371/journal.pmed.0040215http://dx.doi.org/10.1371/journal.pmed.0040215http://dx.doi.org/10.1371/journal.pone.0004607http://dx.doi.org/10.1371/journal.pone.0004607http://dx.doi.org/10.1038/nature08869http://dx.doi.org/10.1038/nature08869http://dx.doi.org/10.1038/35001009http://dx.doi.org/10.1038/35001009http://dx.doi.org/10.1038/nchembio.119http://dx.doi.org/10.1038/nchembio.119http://dx.doi.org/10.1186/gb-2006-7-11-120http://dx.doi.org/10.1186/gb-2006-7-11-120http://dx.doi.org/10.1038/msb.2010.31http://dx.doi.org/10.1038/msb.2010.31http://dx.doi.org/10.1093/nar/gkq537http://dx.doi.org/10.1093/nar/gkq537http://dx.doi.org/10.1038/nature04670http://dx.doi.org/10.1038/nature04670http://dx.doi.org/10.1038/nature04532http://dx.doi.org/10.1038/nature04532http://dx.doi.org/10.1126/science.1091403http://dx.doi.org/10.1126/science.1091403http://dx.doi.org/10.1073/pnas.061034498http://dx.doi.org/10.1073/pnas.061034498http://dx.doi.org/10.1126/science.1090289http://dx.doi.org/10.1126/science.1090289http://dx.doi.org/10.1038/nature08656http://dx.doi.org/10.1038/nature08656http://dx.doi.org/10.1196/annals.1407.016http://dx.doi.org/10.1196/annals.1407.016http://dx.doi.org/10.1196/annals.1407.016http://dx.doi.org/10.1038/nature08656http://dx.doi.org/10.1126/science.1090289http://dx.doi.org/10.1073/pnas.061034498http://dx.doi.org/10.1126/science.1091403http://dx.doi.org/10.1038/nature04532http://dx.doi.org/10.1038/nature04670http://dx.doi.org/10.1093/nar/gkq537http://dx.doi.org/10.1038/msb.2010.31http://dx.doi.org/10.1186/gb-2006-7-11-120http://dx.doi.org/10.1038/nchembio.119http://dx.doi.org/10.1038/35001009http://dx.doi.org/10.1038/nature08869http://dx.doi.org/10.1371/journal.pone.0004607http://dx.doi.org/10.1371/journal.pmed.0040215http://dx.doi.org/10.1101/gr.772403http://dx.doi.org/10.1093/nar/gkm882http://dx.doi.org/10.1073/pnas.0409522102http://dx.doi.org/10.1093/nar/gkq943http://dx.doi.org/10.1186/1751-0473-4-2http://dx.doi.org/10.1093/nar/gkp1026http://dx.doi.org/10.1126/science.1158684http://dx.doi.org/10.1093/nar/gkn1005http://dx.doi.org/10.1186/1471-2105-8-236http://dx.doi.org/10.1093/nar/gkj003http://dx.doi.org/10.1016/S0968-0004(98)01253-5http://dx.doi.org/10.1039/b908681ahttp://dx.doi.org/10.1186/1471-2105-7-355