Topological Network Alignment Uncovers Biological Function and Phylogeny Oleksii Kuchaiev¹, Tijana...

1
Topological Network Alignment Uncovers Biological Function and Phylogeny Oleksii Kuchaiev¹, Tijana Milenković¹, Vesna Memišević, Wayne Hayes, Nataša Pržulj² Department of Computer Science, University of California, Irvine ¹ These authors contributed equally to this work. ² Corresponding author (e-mail: [email protected] ) Introduction Results Fig. 1: The thirty 2-5-node graphlets G 0 , G 1 ,..., G 29 and their 73 symmetry groups, i.e., automorphism orbits; in graphlet G i , i Є {0, 1,..., 29}, nodes belonging to the same orbit are of the same shade [1,2,3]. Network alignment is the problem of finding structural (i.e., topological) similarities between networks. It is computationally infeasible due to the NP-completeness of the underlying subgraph isomorphism problem, and thus, heuristics must be sought. There are two core challenges: scoring “similarities” between nodes, and rapidly identifying high-scoring alignments from the exponentially large set of possible alignments. When aligning two graphs G(V,E) and H(U,F), where |V| ≤ |U|, GRAAL first computes costs of topologically aligning each node in G with each node in H. It relies on graphlets, small induced subgraphs of a network (Fig. 1) [1]. The cost of aligning two nodes is based on their topological signature similarity [2], where the signature of a node, also called graphlet degree vector [2,3], describes the topology of its neighborhood: it is a 73- dimensional vector whose 1 st coordinate is the node’s degree, corresponding to the number of times the node touches G 0 at orbit 0 in Fig. 1, the 2 nd coordinate is the number of times the node touches G 1 at orbit 1 in Fig. 1, etc. for all 73 orbits in Fig. 1. Fig. 2 illustrates a graphlet degree vector of a node. The distance between the i th orbits of nodes u and v is: where w i is a weight of orbit i that accounts for dependencies between orbits (see [2] for details). The total distance D(u,v) between nodes u and v is: Finally, the signature similarity, S(u,v), between nodes u and v is: S(u,v) = 1 D(u,v) . GRAAL is a seed and extend approach that aligns the densest parts of the networks first. It chooses as the initial seed a pair of nodes (v,u), v from V and u from U, with the smallest cost; ties are broken randomly. Once the seed is found, GRAAL builds “spheres” of all possible radii around nodes v and u; a sphere of radius r around node v is the set of nodes S G (v,r) = {x Є V: d(v,x) = r} that are at distance r from v along a shortest path. Spheres of the same radii in G and H are then greedily aligned together by searching for the pairs (v′,u′) that are not already aligned and that can be aligned with the minimal cost. GRAAL can align a path of length up to 2 in G to a single edge in H, which is analogous to allowing “insertions” or “deletions” in sequence alignment. GRAAL stops when each node from G is aligned to exactly one node in H. For more details, see [4]. We use edge correctness (EC) scores to measure the quality of alignments produced by GRAAL. Given an alignment , where |V| ≤ |U|, its EC is: Fig. 4: Comparison of the phylogenetic trees for protists obtained by genetic sequence alignments (left) and by GRAAL’s metabolic network alignments (right). The following abbreviations are used for species: CHO - Cryptosporidium hominis, DDI - Dictyostelium discoideum, CPV - Cryptosporidium parvum, PFA - Plasmodium falciparum, EHI - Entamoeba histolytica, TAN - Theileria annulata, TPV - Theileria parva. The species are grouped into the following classes: “Alveolates,” “Entamoeba,” and “Cellular Slime mold.” Fig. 3: (A) The largest common connected subgraph (CCS) in GRAAL’s yeast-human alignment, consisting of 1,001 interactions amongst 290 proteins. (B) The second largest CCS that has the same biological function (splicing) in both yeast and human PPI networks. Each node contains a label denoting a pair of aligned yeast and human proteins. Biological Significance. (i) Across our entire alignment, we find that 42%, 12.5%, 4.6%, 1.7%, and 0.55% of aligned protein pairs share at least one, two, three, four, and five Gene Ontology (GO) terms, respectively. Compared to random alignments, the p-values for these percentages are all in the 10 −2 to 10 −3 range. Our results are superior to those produced by other methods. (ii) GRAAL identifies large conserved functional modules across species (Fig. 3B shows an example). (iii) From the list of yeast mitochondria-related genes that have human orthologs involved in mitochondrial disease, GRAAL successfully aligns 30% of such yeast proteins to human mitochondrial disease genes, with p-value of about 10 −4 ; (iv) GRAAL aligns human cancer proteins with yeast proteins whose orthologs in human are involved in cancer. See [4] for details. Protein Function Prediction. With the above validations in hand, we believe that GRAAL’s yeast-human alignment can be used to predict biological characteristics of un-annotated proteins based on their alignments with annotated ones. Of the 2,390 protein pairs in the alignment, we make protein function predictions for 36 human and 228 yeast proteins. We validate in the literature 39% of our human predictions and 38% of our yeast predictions, while the remainder were neither corroborated nor contradicted. Phylogeny. In the KEGG pathway database, We create the phylogenetic tree using the nearest distance algorithm, with pairwise edge correctness as the distance measure. Our phylogentic tree for protist is both statistically significant (p-value < 2.2x10 -4 ) and similar to the published one obtained from genetic sequence alignments (Fig. 4). Hence, topological alignments produced by GRAAL can be used as an independent source of phylogenetic information. Acknowledgments and References This project was supported by the NSF CAREER IIS- 0644424 grant. [1] N. Pržulj, D. G. Corneil, and I. Jurisica, Bioinformatics, 20(18), 2004. [2] T. Milenković and N. Pržulj, Cancer Informatics, 6, 2008. [3] N. Pržulj, Bioinformatics, 23, 2007. [4] O. Kuchaiev, T. Milenković, V. Memišević, W. Hayes, N. Pržulj, arXiv:0810.3280v1 [q-bio.MN], 2009. [5] D. Higham, M. Rašajski, and N. Pržulj, Bioinformatics, 24(8), 2008. Sequence comparison and alignment has had an enormous impact on our understanding of evolution, biology, and disease. Comparison and alignment of biological networks will likely have a similar impact. Existing network alignments in comparative proteomics use information external to the networks, such as protein sequence, because no good algorithm for purely topological alignment has yet been devised. We present GRAAL (GRAph ALinger), an algorithm based solely on network topology, that can be used to align any two networks, not necessarily biological ones. We apply it to biological networks to produce by far the most complete topological alignments to date. Both species phylogeny and detailed biological function of individual proteins can be extracted from our alignments. Our alignment of protein- protein interaction (PPI) networks of two very different species–yeast and human– indicate that even distant species share a surprising amount of network topology with each other, suggesting broad similarities in internal cellular wiring across all life on Earth. Fig. 2: Generalization of the degree of node v (left) into its graphlet degree vector (GDV(v)) that counts the number of different graphlets that the node touches, such as triangles (middle) or squares (right). Values of the 73 coordinates (corresponding to orbits) of GDV(v) are presented in the table. The signature similarity is computed as follows [2]. For a node u from G, u i denotes the i th coordinate of its signature vector, i.e., u i is the number of times node u touches an orbit i in G. ) 2 } , log(max{ | ) 1 log( ) 1 log( | ) , ( i i i i i i v u v u w v u D 72 0 72 0 ) , ( ) , ( i i i i w v u D v u D Topology. Using GRAAL, we align the human PPI network by Radivojac et al. (2008) to the yeast PPI network by Collins et al. (2007). GRAAL aligns 1,623 (i.e., 10.06%) of the edges in yeast to edges in human; thus, the edge correctness (EC) is 10.06%. We define a common connected subgraph (CCS) as a connected subgraph (not necessarily induced) that appears in both networks. Our largest yeast-human CCS (Fig. 3A) has 1,001 interactions amongst 290 proteins, which is an order of magnitude better than the best result produced by any of the other existing methods. GRAAL’s yeast-human alignment is statistically significant: (i) given a random alignment of yeast and human, the probability of obtaining EC of 10.06% or better (p-value) is less than 7×10 −8 ; (ii) EC for aligning geometric random graphs (GEO) of the same size as yeast and human is significantly lower than EC of GRAAL’s alignment, with p-value less than 0.022. Given that GEO is the best null model for PPI networks [1,3,5], this implies that yeast and human, two very different species, enjoy more network similarity than chance would allow. U V g : % 100 | | | } )) ( ), ( ( : ) , {( | E F v g u g E v u EC GRAAL (GRAph ALigner) algorithm (A) (B )

Transcript of Topological Network Alignment Uncovers Biological Function and Phylogeny Oleksii Kuchaiev¹, Tijana...

Page 1: Topological Network Alignment Uncovers Biological Function and Phylogeny Oleksii Kuchaiev¹, Tijana Milenković¹, Vesna Memišević, Wayne Hayes, Nataša Pržulj².

Topological Network Alignment Uncovers Biological Function and Phylogeny

Oleksii Kuchaiev¹, Tijana Milenković¹, Vesna Memišević, Wayne Hayes, Nataša Pržulj²Department of Computer Science, University of California, Irvine¹ These authors contributed equally to this work. ² Corresponding author (e-mail: [email protected] )

Introduction

Results

Fig. 1: The thirty 2-5-node graphlets G0, G1,..., G29 and their 73 symmetry groups, i.e., automorphism orbits; in graphlet Gi, i Є {0, 1,..., 29}, nodes belonging to the same orbit are of the same shade [1,2,3].

Network alignment is the problem of finding structural (i.e., topological) similarities between networks. It is computationally infeasible due to the NP-completeness of the underlying subgraph isomorphism problem, and thus, heuristics must be sought. There are two core challenges: scoring “similarities” between nodes, and rapidly identifying high-scoring alignments from the exponentially large set of possible alignments.

When aligning two graphs G(V,E) and H(U,F), where |V| ≤ |U|, GRAAL first computes costs of topologically aligning each node in G with each node in H. It relies on graphlets, small induced subgraphs of a network (Fig. 1) [1]. The cost of aligning two nodes is based on their topological signature similarity [2], where the signature of a node, also called graphlet degree vector [2,3], describes the topology of its neighborhood: it is a 73-dimensional vector whose 1st coordinate is the node’s degree, corresponding to the number of times the node touches G0 at orbit 0 in Fig. 1, the 2nd coordinate is the number of times the node touches G1 at orbit 1 in Fig. 1, etc. for all 73 orbits in Fig. 1. Fig. 2 illustrates a graphlet degree vector of a node.

The distance between the ith orbits of nodes u and v is:

where wi is a weight of orbit i that accounts for dependencies between orbits (see [2] for details). The total distance D(u,v) between nodes u and v is:

Finally, the signature similarity, S(u,v), between nodes u and v is: S(u,v) = 1 − D(u,v).

GRAAL is a seed and extend approach that aligns the densest parts of the networks first. It chooses as the initial seed a pair of nodes (v,u), v from V and u from U, with the smallest cost; ties are broken randomly. Once the seed is found, GRAAL builds “spheres” of all possible radii around nodes v and u; a sphere of radius r around node v is the set of nodes SG(v,r) = {x Є V: d(v,x) = r} that are at distance r from v along a shortest path. Spheres of the same radii in G and H are then greedily aligned together by searching for the pairs (v′,u′) that are not already aligned and that can be aligned with the minimal cost. GRAAL can align a path of length up to 2 in G to a single edge in H, which is analogous to allowing “insertions” or “deletions” in sequence alignment. GRAAL stops when each node from G is aligned to exactly one node in H. For more details, see [4].

We use edge correctness (EC) scores to measure the quality of alignments produced by GRAAL. Given an alignment , where |V| ≤ |U|, its EC is:

Fig. 4: Comparison of the phylogenetic trees for protists obtained by genetic sequence alignments (left) and by GRAAL’s metabolic network alignments (right). The following abbreviations are used for species: CHO - Cryptosporidium hominis, DDI - Dictyostelium discoideum, CPV - Cryptosporidium parvum, PFA - Plasmodium falciparum, EHI - Entamoeba histolytica, TAN - Theileria annulata, TPV - Theileria parva. The species are grouped into the following classes: “Alveolates,” “Entamoeba,” and “Cellular Slime mold.”

Fig. 3: (A) The largest common connected subgraph (CCS) in GRAAL’s yeast-human alignment, consisting of 1,001 interactions amongst 290 proteins. (B) The second largest CCS that has the same biological function (splicing) in both yeast and human PPI networks. Each node contains a label denoting a pair of aligned yeast and human proteins.

Biological Significance. (i) Across our entire alignment, we find that 42%, 12.5%, 4.6%, 1.7%, and 0.55% of aligned protein pairs share at least one, two, three, four, and five Gene Ontology (GO) terms, respectively. Compared to random alignments, the p-values for these percentages are all in the 10−2 to 10−3 range. Our results are superior to those produced by other methods. (ii) GRAAL identifies large conserved functional modules across species (Fig. 3B shows an example). (iii) From the list of yeast mitochondria-related genes that have human orthologs involved in mitochondrial disease, GRAAL successfully aligns 30% of such yeast proteins to human mitochondrial disease genes, with p-value of about 10−4; (iv) GRAAL aligns human cancer proteins with yeast proteins whose orthologs in human are involved in cancer. See [4] for details.

Protein Function Prediction. With the above validations in hand, we believe that GRAAL’s yeast-human alignment can be used to predict biological characteristics of un-annotated proteins based on their alignments with annotated ones. Of the 2,390 protein pairs in the alignment, we make protein function predictions for 36 human and 228 yeast proteins. We validate in the literature 39% of our human predictions and 38% of our yeast predictions, while the remainder were neither corroborated nor contradicted.

Phylogeny. In the KEGG pathway database, there are 17 Eucaryotic organisms with fully sequenced genomes, of which seven are protists, six are fungi, two are plants, and two are animals. Here we focus on protists. For each of the seven organisms, we extract the union of all metabolic pathways from KEGG, and then we use GRAAL to find all-to-all pairwise network alignments between the organisms.

We create the phylogenetic tree using the nearest distance algorithm, with pairwise edge correctness as the distance measure. Our phylogentic tree for protist is both statistically significant (p-value < 2.2x10-4) and similar to the published one obtained from genetic sequence alignments (Fig. 4). Hence, topological alignments produced by GRAAL can be used as an independent source of phylogenetic information.

Acknowledgments and ReferencesThis project was supported by the NSF CAREER IIS-0644424 grant.

[1] N. Pržulj, D. G. Corneil, and I. Jurisica, Bioinformatics, 20(18), 2004. [2] T. Milenković and N. Pržulj, Cancer Informatics, 6, 2008.[3] N. Pržulj, Bioinformatics, 23, 2007. [4] O. Kuchaiev, T. Milenković, V. Memišević, W. Hayes, N. Pržulj, arXiv:0810.3280v1 [q-bio.MN], 2009.[5] D. Higham, M. Rašajski, and N. Pržulj, Bioinformatics, 24(8), 2008.

Sequence comparison and alignment has had an enormous impact on our understanding of evolution, biology, and disease. Comparison and alignment of biological networks will likely have a similar impact. Existing network alignments in comparative proteomics use information external to the networks, such as protein sequence, because no good algorithm for purely topological alignment has yet been devised. We present GRAAL (GRAph ALinger), an algorithm based solely on network topology, that can be used to align any two networks, not necessarily biological ones. We apply it to biological networks to produce by far the most complete topological alignments to date. Both species phylogeny and detailed biological function of individual proteins can be extracted from our alignments. Our alignment of protein-protein interaction (PPI) networks of two very different species–yeast and human–indicate that even distant species share a surprising amount of network topology with each other, suggesting broad similarities in internal cellular wiring across all life on Earth.

Fig. 2: Generalization of the degree of node v (left) into its graphlet degree vector (GDV(v)) that counts the number of different graphlets that the node touches, such as triangles (middle) or squares (right). Values of the 73 coordinates (corresponding to orbits) of GDV(v) are presented in the table.

The signature similarity is computed as follows [2]. For a node u from G, ui denotes the ith coordinate of its signature vector, i.e., ui is the number of times node u touches an orbit i in G.

)2},log(max{

|)1log()1log(|),(

ii

iiii vu

vuwvuD

72

0

72

0),(

),(i i

i i

w

vuDvuD

Topology. Using GRAAL, we align the human PPI network by Radivojac et al. (2008) to the yeast PPI network by Collins et al. (2007). GRAAL aligns 1,623 (i.e., 10.06%) of the edges in yeast to edges in human; thus, the edge correctness (EC) is 10.06%. We define a common connected subgraph (CCS) as a connected subgraph (not necessarily induced) that appears in both networks. Our largest yeast-human CCS (Fig. 3A) has 1,001 interactions amongst 290 proteins, which is an order of magnitude better than the best result produced by any of the other existing methods.

GRAAL’s yeast-human alignment is statistically significant: (i) given a random alignment of yeast and human, the probability of obtaining EC of 10.06% or better (p-value) is less than 7×10−8; (ii) EC for aligning geometric random graphs (GEO) of the same size as yeast and human is significantly lower than EC of GRAAL’s alignment, with p-value less than 0.022. Given that GEO is the best null model for PPI networks [1,3,5], this implies that yeast and human, two very different species, enjoy more network similarity than chance would allow.

UVg :

%100||

|}))(),((:),{(|

E

FvgugEvuEC

GRAAL (GRAph ALigner) algorithm

(A) (B)