Protein Interaction Networks
-
Upload
charissa-pollard -
Category
Documents
-
view
44 -
download
3
description
Transcript of Protein Interaction Networks
Protein Interaction Networks
Thanks to Mehmet Koyuturk
Protein-Protein Interactions Physical association between proteins
Signal transduction, phosphorylation Docking, complex formation Permanent vs. transient interactions
Co-location of proteins Proteins that work in the same cellular component Soluble location: lysosome, mitochondrial stroma Membrane location: receptors in plasma membrane,
transporters in mitochondrial membrane Functional association of proteins
Proteins involved in the same biomolecular activity Enzymes in the same pathway, co-regulated proteins
7. Protein Interaction Networks
2
Permanent vs Transient Interactions Permanent interactions
Some proteins form a stable protein complex that carries out a structural or functional biomolecular role
These proteins are protein subunits of the complex and they work together
ATPase subunits, subunits of nuclear pore
Transient interactions Proteins that come together in certain
cellular states to undertake a biomolecular function
DNA replicative complex, signal transduction
7. Protein Interaction Networks
3
Signal Transduction Phosphorylation
Protein-kinase interaction Enzyme activation
7. Protein Interaction Networks
4
Signaling cascade
Why Study Protein Interactions? Identification of functional modules and
interconnections between these modules Functional annotation based on binding partners
and interaction patterns Identification of evolutionarily conserved pathways Identification of drug target proteins to minimize
side effects
7. Protein Interaction Networks
5
Identification of Protein Interactions Traditionally, protein interactions are identified by
wetlab experiments based on hypotheses on candidate proteins
Small scale assays Coimmunoprecipitation: Immunoprecipitate one
protein, see if other is also precipitated Reliable, but can only verify interactions between
suspected partners High throughput screening
Throw in thousands of ORFs and see which ones bind to each other
Yeast two hybrid, tandem affinity purification Large scale, but a lot of noise
7. Protein Interaction Networks
6
Yeast Two Hybrid Split yeast GAL4 gene, which encodes a
transcription factor, required for activation of GAL genes in two parts Activating domain, binding domain The split protein does not work unless the two
parts are in physical contact
7. Protein Interaction Networks
7
Protein Interaction Networks Organize all identified interactions in a network, where proteins are represented by nodes and interactions are represented by edges
TAP identifies a group of proteins that are caught by target protein Spoke model (star network) vs. matrix model (clique)
7. Protein Interaction Networks
8
Protein
Interaction
Functional Modularity in PPI Networks A protein complex
Dense subgraph
A signal transduction pathway Simple path, parallel paths
A protein with common, key, fundamental role (e.g., a kinase)
Hub node
7. Protein Interaction Networks
9
Computational Prediction of PPIs Functional association is a higher level
conceptualization of interaction Proteins that act as enzymes catalyzing reactions in the
same metabolic pathway Functionally associated proteins are likely to show up
in similar contexts Co-regulation, co-expression, co-evolution, co-citation…
Functional association between proteins can be computationally identified by looking at different sources of data such as sequences, gene expression, literature Can also be extended to capture physical associations, for
example, by taking into account evolution at structural level
7. Protein Interaction Networks
10
Conservation of Gene Neighborhood In bacteria, the genome
of an organism is organized in such a way that that functionally related proteins are coded by neighboring regions Operons
When more than one bacterial species are considered, it is observed that this neighborhood relationship becomes even more relevant
7. Protein Interaction Networks
11
Distribution of neighboring genes in H. Influenzae and E. coli into functionalclasses
Comparison of Nine Bacterial Genomes trpB-trpA is the only
gene pair whose proximity is conserved across nine prokaryotic genomes These genes encode
the two subunits of tryptophan synthase that interact and catalyze a single reaction
7. Protein Interaction Networks
12
Close Orthologs Run of genes
A set of genes on one strand, such that gaps between adjacent genes is less than a threshold, (in practice, 300 bp)
Any pair of genes on the same run are said to be close
Bidirectional best hits Genes X1 and X2 from
genomes G1 and G2 are BBH, if their sequence similarity is significant and there are no Y1 (Y2) in G1(G2) that is more similar to X2 (X1) than X1 (X2)
7. Protein Interaction Networks
13
Pair of close bidirectional best hits: Xa, Ya close in G1, Xb, Yb close in G2, Xa&Xb BBH, Ya& Yb BBH
Predicting Interactions For each pair of close orthologs (occuring at
least one pair of genomes), calculate a score Score should increase with the phylogenetic
distance between the two genomes, since closely related organisms are more likely to have similar genes nearby due to chance alone
Existence of a triplet (P1, P2, P3) should be stronger than the existence of two pairs (P1, P2 and P1, P3)
Triplet distance can be estimated as the minimum distance between any pair of organisms (in addition to pair score)
7. Protein Interaction Networks
14
Reconstructing Pathways
7. Protein Interaction Networks
15
Purine Metabolism
Can identify the association between unknown proteins and known pathways!
Projection of Gene Neighborhood The composition of operons is evolutionarily variable
A particular set of functionally related genes do not always comprise an operon
The application of gene neighborhood based interaction prediction is limited for a single organism
With multiple organisms, it is possible to statistically strengthen conclusions and project findings on other organisms If an operon with functionally related genes exists in
several genomes, a functional association can be predicted for other organisms, even if the corresponding genes are scattered
Variability turns out to be an advantage for prediction
7. Protein Interaction Networks
16
Gene Neighborhood - Limitations It is only directly applicable to bacteria (and
archaea), because relevance of gene order does not necessarily extend to eukaryotes
For closely related species, conserved gene order might just be due to lack of time for genome rearrangements We are interested in selective constraints that
preserve gene order Compared species should be distant enough But not too distant, because we need sufficient
number of orthologs to be able to derive statistically meaningful results
7. Protein Interaction Networks
17
Gene Fusion Domain fusion events
Two protein domains that act as independent proteins (components) in one organism may form (part of) a single polypoptide chain (composites) in another organism
Most proteins that are involved in domain fusion events are known to be subunits of multiprotein complexes (76% in E. coli metabolic network)
7. Protein Interaction Networks
18
Gene Fusion Based PPI Prediction A pair of proteins in
query genome are candidate interacting pairs if They show (local)
sequence similarity to the same protein (rosetta stone) in reference genome
They do now show sequence similarity with each other
Complete genomes!
7. Protein Interaction Networks
19
Predicted Interactions
7. Protein Interaction Networks
20
Know
n p
hysica
l in
tera
ction
sPro
tein
s in th
esa
me p
ath
way
Gene Fusion Based Prediction - Results
Interactions predicted based on gene fusion events Distance on circle shows distance on genome
7. Protein Interaction Networks
21
Co-evolution of Interacting Proteins Selective pressure is likely to act on common
function Proteins that are interacting are expected to either
be conserved together along with their interactions, or not conserved at all
Hypothesis 1: Orthologs of interacting proteins also interact in other species (supported by evidence, but there are subtleties, which we will discuss this later)
Hypothesis II: If two proteins are interacting, then they will show similar conservation patterns
Phylogenetic profiles
7. Protein Interaction Networks
22
Phylogenetic Profiles
7. Protein Interaction Networks
23
Correlation of Phylogenetic Profiles Assume we have N genomes, protein X has
homologs in x of them, Y has y, and they co-occur in z genomes
Hamming distance: Pearson correlation:
Mutual information:
Statistical significance:
7. Protein Interaction Networks
24
Phylogenetic Profiles - Limitations Many processes may be
common across lineages Too many false positives
Database of genomes may be biased
All organisms are treated equally Improvement: Use trees instead
of profiles Proteins are assumed to be
conserved as a whole It is domains that interact Improvement: Use domain profiles
7. Protein Interaction Networks
25
Yeast n
ucle
oli a
nd
riboso
mal
pro
tein
s
Organisms
Phylogenetic Tree Based Prediction
Phylogenetic trees of Ntr-family two-component sensor histidine kinases and their corresponding regulators
7. Protein Interaction Networks
26
Mirror Tree Method Need to have sufficient number
of genomes that contain homologs of both proteins
7. Protein Interaction Networks
27
Matrix Method Start with families of
proteins that are suspected to interact
Identify specific pairs of proteins that interact by aligning the phylogenetic trees that underly the two families
Assumption: Identical number of proteins in each family
7. Protein Interaction Networks
28
Correlated Mutations Co-evolution of interacting proteins can be followed
more closely by quantifying the degree of co-variation between pairs of residues from these proteins Correlated mutations may correspond to compensatory
mutations that stabilize the mutations in one protein with changes in the other
7. Protein Interaction Networks
29
Distribution of distances between aminoacid positions on a folded protein
In silico Two-Hybrid The correlation of mutations between two
positions (may be on different proteins) can be estimated from pairwise assessment of aligned multiple sequences
Position pairs with high correlation are potential contact points
Interaction index For a protein pair, compute the aggregate
correlation (of mutations) across all positions
7. Protein Interaction Networks
30
In silico Two-Hybrid
7. Protein Interaction Networks
31
Performance of I2H
I2H predicts physical, rather than functional association It requires complete genomes & sufficient number of
homologs
7. Protein Interaction Networks
32
Co-citation Based PPI Prediction Functionally associated proteins are likely to
be cited in the same research article We can assess the statistical significance of
co-citation based on hypergeometric model
Algorithmic problem: How to recognize & match protein names? Train algorithm using annotated abstracts via
conditional random fields (CRF)
7. Protein Interaction Networks
33
Performance of Co-citation
The method is robust to choice of parameters for name recognition
Statistical significance is quite relevant until it saturates
34
Integrating PPI Networks Interaction data
coming from multiple sources Different sources
refer to different levels of interaction
Can integration handle noise, making interaction data more reliable?
Superpose interactions based on their reliability
7. Protein Interaction Networks
35
Bayesian Integration For each prediction method, compute log-
likelihood score Let P(L|E) be the number of interactions predicted
by method E, such that functional association between corresponding proteins is known
Let ~P(L|E) be the number of false positives Let P(L) and ~P(L) be the corresponding priors
Assign weights to methods based on their log-likelihood scores
7. Protein Interaction Networks
36
Comparison of Prediction Methods
Integrated network captures functional association better Note that the integrated network is “trained”
using available data on functional association
7. Protein Interaction Networks
37
Classification Based Integration Points: Proteins, Space: Expression,
Conservation, Labels: Function
Points: Protein Pairs, Space: Co-expression, Co-evolution, etc., Labels: Existence of Interaction
7. Protein Interaction Networks
38
Performance of Domain Co-evolution
7. Protein Interaction Networks
39
Co-Evolutionary Matrix
7. Protein Interaction Networks
40
Domain Identification
7. Protein Interaction Networks
41
Difference between Predicted PPIs
7. Protein Interaction Networks
42
Pattern Discovery inSignaling Networks
Reconstruction of Cellular Signaling Network reconstruction includes
chemically accurate representation of all biochemical events occurring within a defined signaling network
and incorporates interconnectivity functional relationships
that are inferred from experimental data. Cellular signaling networks operate several orders of
magnitude in spatio-temporal scales Quick responses (<10-1 secs.), e.g., protein modifications Slow responses (minutes to hours), e.g., transcriptional
regulation
44
Cellular Signaling Who are the actors?
Receptors reside inside or on the surface of the cell and bind to specific chemicals with high specificity and affinity.
Protein kinases catalyze reactions involving the transfer of phosphate, from high-energy donor molecules, such as ATP, which results in activation of proteins
Protein phosphatases dephosphorylate active proteins
Transcription factors
45
Combinatorics of Cellular Signaling What is the scope of these actors?
In how many different ways a signal can be transmitted?
In how many different states can a cell be? Number of receptors, kinases, phosphatases,
transcription factos, and the number of possible interactions between these
Alternative splicing In eukaryotes, introns are spliced out before translation Different combinations of introns can be spliced out,
resulting in different products of the same gene One more level of combinatorial complexity If a gene has k exons, then splicing of alternative exons
can generate upto 2k isoforms
46
Scope of Human Signaling Network
47
Combinatorial Effects Genes that code for signaling proteins compose 75%
of all alternatively spliced genes This implies that cells use alternative splicing extensively
to achieve the extraordinary specificity that is required in signaling systems
After post-transcriptional modification, number of mRNA transcripts 3858 for receptors, 1295 for kinases, 375 for phosphatases
After post-translational modification (phosphorylation, acetylation, methylation), number of distinct protein states 30864 for receptors, 10360 for kinases, 3000 for
phosphatases 20-fold increase in number of protein states over genes
48
Links and Connectivity Interactions allow for an even greater degree
of combinatorial control Homo- and heterodimerization of 224 proteins can
provide sufficient specificity to control the expression of 25000 genes in human genome (n(n-1)/2)
If receptors assume only ligand bound and unbound states, then k receptors can recognize 2k different ligand combinations If 1% of estimated 1543 receptors in human
genome can be independently expressed, then the cell could potentially respond to 32768 different ligand combinations
49
Signal Reception Based on the average surface area of a cell
and average area of a receptor, it is estimated that there can be as many as a few million receptors on the surface of the typical somatic cell at a given time ~ 30000 distinct receptor types ~130 receptors of each receptor type A few receptors (~10-40) in high numbers (~105
per cell) for highly differentiated and specialized cells
Many receptors (~2000-3000) in small numbers (~102 per cell) for stem cells or undifferentiated cells
50
Reconstructing Signaling Networks
51
Focusing on Parts of the Network
Nodes Who does a single protein interact with? In what contexts?
Modules Group of related interactions, e.g., a protein complex
Pathways Chain of interactions that connect a signaling input to
output
52
Protein Complexes in PPI Networks Spoke vs matrix
model Recall that in PCP
methods like TAP identify a group ofproteins that bind
to each other using a single protein as bait
How to encode this into a network of pairwise interactions?
53
ActualComplex
SpokeModel
MatrixModel
Protein Complexes in Matrix Model
54
Modules and Quotients Define a module as a group of proteins such
that the interactions of the proteins with those outside the module are identical
Quotient: Replace proteins in a module with a single node The edges of the representative node will
represent the interactions of all proteins in the module
55
Types of Modules Parallel module
No interaction between proteins in the module These are likely to correspond to proteins that are
functionally related, but do not interact with each other Series module
Proteins in the module form a clique among themselves All proteins in the module perform some function
together (single complex or multiple related complexes)
Prime module All other topologies This is probably what you will
observe most of the time
56
Hierarchical Decomposition
Recursively identify and contract modules This results in a
tree representation of the network
Each node is a quotient graph
Leaves are proteins
Root is entire network
57
Decomposition of Yeast PPI Network
58
Identification of Modules Graph clustering
Find groups of nodes with high interconnectivity (and relatively low connectivity with outside)
Issues Definition of clustering metrics Density
Has to be normalized by subgraph size Distance-based metrics
A module has low diameter Normalizing intra-cluster
connectivity with outer connectivity
59
Algorithms The problems are generalizations of maximum clique
Maximum clique itself is NP-hard (enumeration of cliques in early PPI networks was possible, though, and these were used as seed subgraphs for dense clusters)
Heuristic approaches Graph clustering is very well studied Recall that, while clustering vectors in metric spaces (e.g.,
gene expression data), it is common to generate similarity graphs
Bottom-up heuristics Start with a single node, grow subgraph until “density” is lost
Top-down heuristics Recursively partition the entire network until subgraph is dense
enough
60
MCODE Algorithm Three stages
Vertex weighting Complex prediction Post-processing for finding overlapping clusters
Vertex weighting How “clustered” is a network’s neighborhood? Use core clustering coefficient instead of
clustering coefficientN : subgraph induced by neighbors of vK : k-core subgraph of N that maximizes k d : density of Kweight(v) = k x d
61
MCODE Algorithm (cont’d) Complex prediction
Seed a complex with the node with highest weight At each node addition, check the neighbors of that
node, if their weight is above a given threshold relative to that of the seed vertex, add that node into the complex as well
Repeat until no node can be added Once a complex is identified, remove those nodes and
find other complexes Post-processing
Filter-out complexes that do not contain at least a 2-core Add nodes to allow overlaps to a given threshold
Complex score: density x size
62
Scoring Subgraphs Observe the trade-off between size and density
A single interaction has density one What is a good cut-off for density?
Statistical significance What is the expected size of the largest dense
subgraph? Implicitly trades off density and size
If we can analytically characterize the distribution of the largest dense subgraph, then we can use statistical significance as a score function (stopping criterion) This also implicitly handles correction for multiple
hypothesis testing
63
G(n,p) Model Let random variable R be the size of largest
subgraph with density The typical value of R is given by
where denotes divergence
The p-value of a larger dense subgraph is given by
64
r0 = Hp()
log(n) – log(log(n)) + log(Hp())
Hp() = log(/p) + (1-) log((1-)/(1-p))
P(R r0) O(log(n)/n1/H ())
Piecewise G(n,p) Model
65
Two protein groups; hubs (Vh) and regulars (Vl) There is an edge between u and v with
probability ph if u, v Vh
pb if u Vh, v Vl, or vice versa p if u, v Vl
ph > pb > pl, |Vh| < |Vl|
If |Vh| << |Vl|, it contributes an additive factorr1 =
log(n) + 2|Vh| log(B) - log(log(n)) + log(Hp())
Hp()
where B = pb(1-p)/p+1-pb
SIDES Algorithm Recursive minimum-cut partitioning
Partition nodes into two parts such that the number of edges in between is minimized, then recurse
66
p << 1
p << 1p << 1
MCODE vs SIDES
67
-log
(p-v
alu
e)
Sp
ecifi
cit
y (%
)Sensitivity (%)Cluster Size
CorrelationSIDES: 0.76MCODE: 0.43
MCODE vs SIDES
68
Module Size Module Size
Sp
ecifi
cit
y (%
)
Sen
siti
vity
(%
)
CorrelationSIDES: 0.22MCODE: -0.02
CorrelationSIDES: 0.27MCODE: 0.36
Fiedler Vector For network G, Laplacian L is defined as follows:
Here, w(ui,uj) denotes the weight of edge uiuj.
It can be shown that Matrix L is positive semi-definite, with exactly one
zero eigenvalue for each connected component The eigenvector x that corresponds to the smallest
non-zero eigenvalue minimizes
This vector is known as the Fiedler vector of network G.
69
),(),(
),(),(
ji
jji
uuwjiL
uuwiiL
ji
T jxixjiwLxx,
2))()()(,(
Spectral Graph Clustering Fiedler vector provides the optimal mapping of the
nodes of the network on one-dimensional Euclidian space, in the mean squares sense This also generalizes to optimal k dimensional mapping
Once a one-dimensional mapping is obtained, clustering algorithms can be used on this one dimensional space Find cut points in one dimensional space
Top-down: Partition one dimensional space by finding two cut-points and recurse on each part
Bottom-up: Merge two closest nodes, recurse
70
-1 10
Identification of Signaling Pathways We would like to identify simple paths (chains of
interactions) in the PPI networks, which might correspond to, for example, signaling cascades highlighting the group of proteins and interactions that are resposible for the transduction of a specific signal
What can we do based solely on interaction data? In the PPI network, there may be be plenty of paths
connecting each pair of nodes Which ones are interesting? How long can a pathway be? How about identifying “most reliable” paths?
71
Formulating Pathway Identification Assume that the edges are scored, such that
p(u,v) denotes the likelihood that proteins u and v interact
Then the multiplication of edge scores along the path quantifies the likelihood that the path exists
Let w(u,v) = -log p(u,v) denote the weight of edge Then, if we define the weight of a path as the
summation of the weights of the edges on the path, paths with less weight will be more reliable paths
For a given set I of proteins, find all minimum-weight paths of length k from I to each protein in the network I might be the set of receptor proteins
72
Enumerating Pathways Dynamic programming
For v S V, let W(v, S) be the minimum weight of a simple path that starts from a protein in I, visits all proteins in S, and ends in v
This function can be tabulated using the following recursion
where if vI, and otherwise For given v, the minimum path from I to v is
given by the minimum W(v, S) over all S that contain v
The running time of this algorithm is O(knk) Not feasible for k larger than a few
73
Color Coding Color each protein randomly using a set of k colors Search for paths that contain one protein from each
colour => No vertex will be repeated on the path The dynamic programming algorithm can be modified
to solve this problem
The running time of this algorithm is O(2kkm) However, this algorithm misses an optimal path if two
proteins on the path happen to be colored identically For each path, the algorithm succeeds with ~
probability Repeat times to make sure that the probability that
the algorithm will fail for at least one protein is less than
74
Hunting Biologically Meaningful Paths Constraining the set of proteins
If a protein is required to be in the path, assign a unique color to the target protein
If a family is required, assign color to the family Constraining order of occurrence
Signal transduction often progress in inward order, from membrane proteins to nuclear proteins and transcription factors
Segmented pathways: Assign labels to proteins, where labels represent cellular component, require paths to be monotonic with respect to labels
Labels can also be generalized to intervals (consistent segments)
75