Protein Interaction Networks

Protein Interaction Networks

Thanks to Mehmet Koyuturk

Protein-Protein Interactions Physical association between proteins

Signal transduction, phosphorylation Docking, complex formation Permanent vs. transient interactions

Co-location of proteins Proteins that work in the same cellular component Soluble location: lysosome, mitochondrial stroma Membrane location: receptors in plasma membrane,

transporters in mitochondrial membrane Functional association of proteins

Proteins involved in the same biomolecular activity Enzymes in the same pathway, co-regulated proteins

7. Protein Interaction Networks

2

Permanent vs Transient Interactions Permanent interactions

Some proteins form a stable protein complex that carries out a structural or functional biomolecular role

These proteins are protein subunits of the complex and they work together

ATPase subunits, subunits of nuclear pore

Transient interactions Proteins that come together in certain

cellular states to undertake a biomolecular function

DNA replicative complex, signal transduction


3

Signal Transduction Phosphorylation

Protein-kinase interaction Enzyme activation


4

Signaling cascade

Why Study Protein Interactions? Identification of functional modules and

interconnections between these modules Functional annotation based on binding partners

and interaction patterns Identification of evolutionarily conserved pathways Identification of drug target proteins to minimize

side effects


5

Identification of Protein Interactions Traditionally, protein interactions are identified by

wetlab experiments based on hypotheses on candidate proteins

Small scale assays Coimmunoprecipitation: Immunoprecipitate one

protein, see if other is also precipitated Reliable, but can only verify interactions between

suspected partners High throughput screening

Throw in thousands of ORFs and see which ones bind to each other

Yeast two hybrid, tandem affinity purification Large scale, but a lot of noise


6

Yeast Two Hybrid Split yeast GAL4 gene, which encodes a

transcription factor, required for activation of GAL genes in two parts Activating domain, binding domain The split protein does not work unless the two

parts are in physical contact


7

Protein Interaction Networks Organize all identified interactions in a network, where proteins are represented by nodes and interactions are represented by edges

TAP identifies a group of proteins that are caught by target protein Spoke model (star network) vs. matrix model (clique)


8

Protein

Interaction

Functional Modularity in PPI Networks A protein complex

Dense subgraph

A signal transduction pathway Simple path, parallel paths

A protein with common, key, fundamental role (e.g., a kinase)

Hub node


9

Computational Prediction of PPIs Functional association is a higher level

conceptualization of interaction Proteins that act as enzymes catalyzing reactions in the

same metabolic pathway Functionally associated proteins are likely to show up

in similar contexts Co-regulation, co-expression, co-evolution, co-citation…

Functional association between proteins can be computationally identified by looking at different sources of data such as sequences, gene expression, literature Can also be extended to capture physical associations, for

example, by taking into account evolution at structural level


10

Conservation of Gene Neighborhood In bacteria, the genome

of an organism is organized in such a way that that functionally related proteins are coded by neighboring regions Operons

When more than one bacterial species are considered, it is observed that this neighborhood relationship becomes even more relevant


11

Distribution of neighboring genes in H. Influenzae and E. coli into functionalclasses

Comparison of Nine Bacterial Genomes trpB-trpA is the only

gene pair whose proximity is conserved across nine prokaryotic genomes These genes encode

the two subunits of tryptophan synthase that interact and catalyze a single reaction


12

Close Orthologs Run of genes

A set of genes on one strand, such that gaps between adjacent genes is less than a threshold, (in practice, 300 bp)

Any pair of genes on the same run are said to be close

Bidirectional best hits Genes X1 and X2 from

genomes G1 and G2 are BBH, if their sequence similarity is significant and there are no Y1 (Y2) in G1(G2) that is more similar to X2 (X1) than X1 (X2)


13

Pair of close bidirectional best hits: Xa, Ya close in G1, Xb, Yb close in G2, Xa&Xb BBH, Ya& Yb BBH

Predicting Interactions For each pair of close orthologs (occuring at

least one pair of genomes), calculate a score Score should increase with the phylogenetic

distance between the two genomes, since closely related organisms are more likely to have similar genes nearby due to chance alone

Existence of a triplet (P1, P2, P3) should be stronger than the existence of two pairs (P1, P2 and P1, P3)

Triplet distance can be estimated as the minimum distance between any pair of organisms (in addition to pair score)


14

Reconstructing Pathways


15

Purine Metabolism

Can identify the association between unknown proteins and known pathways!

Projection of Gene Neighborhood The composition of operons is evolutionarily variable

A particular set of functionally related genes do not always comprise an operon

The application of gene neighborhood based interaction prediction is limited for a single organism

With multiple organisms, it is possible to statistically strengthen conclusions and project findings on other organisms If an operon with functionally related genes exists in

several genomes, a functional association can be predicted for other organisms, even if the corresponding genes are scattered

Variability turns out to be an advantage for prediction


16

Gene Neighborhood - Limitations It is only directly applicable to bacteria (and

archaea), because relevance of gene order does not necessarily extend to eukaryotes

For closely related species, conserved gene order might just be due to lack of time for genome rearrangements We are interested in selective constraints that

preserve gene order Compared species should be distant enough But not too distant, because we need sufficient

number of orthologs to be able to derive statistically meaningful results


17

Gene Fusion Domain fusion events

Two protein domains that act as independent proteins (components) in one organism may form (part of) a single polypoptide chain (composites) in another organism

Most proteins that are involved in domain fusion events are known to be subunits of multiprotein complexes (76% in E. coli metabolic network)


18

Gene Fusion Based PPI Prediction A pair of proteins in

query genome are candidate interacting pairs if They show (local)

sequence similarity to the same protein (rosetta stone) in reference genome

They do now show sequence similarity with each other

Complete genomes!


19

Predicted Interactions


20

Know

n p

hysica

l in

tera

ction

sPro

tein

s in th

esa

me p

ath

way

Gene Fusion Based Prediction - Results

Interactions predicted based on gene fusion events Distance on circle shows distance on genome


21

Co-evolution of Interacting Proteins Selective pressure is likely to act on common

function Proteins that are interacting are expected to either

be conserved together along with their interactions, or not conserved at all

Hypothesis 1: Orthologs of interacting proteins also interact in other species (supported by evidence, but there are subtleties, which we will discuss this later)

Hypothesis II: If two proteins are interacting, then they will show similar conservation patterns

Phylogenetic profiles


22

Phylogenetic Profiles


23

Correlation of Phylogenetic Profiles Assume we have N genomes, protein X has

homologs in x of them, Y has y, and they co-occur in z genomes

Hamming distance: Pearson correlation:

Mutual information:

Statistical significance:


24

Phylogenetic Profiles - Limitations Many processes may be

common across lineages Too many false positives

Database of genomes may be biased

All organisms are treated equally Improvement: Use trees instead

of profiles Proteins are assumed to be

conserved as a whole It is domains that interact Improvement: Use domain profiles


25

Yeast n

ucle

oli a

nd

riboso

mal

pro

tein

s

Organisms

Phylogenetic Tree Based Prediction

Phylogenetic trees of Ntr-family two-component sensor histidine kinases and their corresponding regulators


26

Mirror Tree Method Need to have sufficient number

of genomes that contain homologs of both proteins


27

Matrix Method Start with families of

proteins that are suspected to interact

Identify specific pairs of proteins that interact by aligning the phylogenetic trees that underly the two families

Assumption: Identical number of proteins in each family


28

Correlated Mutations Co-evolution of interacting proteins can be followed

more closely by quantifying the degree of co-variation between pairs of residues from these proteins Correlated mutations may correspond to compensatory

mutations that stabilize the mutations in one protein with changes in the other


29

Distribution of distances between aminoacid positions on a folded protein

In silico Two-Hybrid The correlation of mutations between two

positions (may be on different proteins) can be estimated from pairwise assessment of aligned multiple sequences

Position pairs with high correlation are potential contact points

Interaction index For a protein pair, compute the aggregate

correlation (of mutations) across all positions


30

In silico Two-Hybrid


31

Performance of I2H

I2H predicts physical, rather than functional association It requires complete genomes & sufficient number of

homologs


32

Co-citation Based PPI Prediction Functionally associated proteins are likely to

be cited in the same research article We can assess the statistical significance of

co-citation based on hypergeometric model

Algorithmic problem: How to recognize & match protein names? Train algorithm using annotated abstracts via

conditional random fields (CRF)


33

Performance of Co-citation

The method is robust to choice of parameters for name recognition

Statistical significance is quite relevant until it saturates

34

Integrating PPI Networks Interaction data

coming from multiple sources Different sources

refer to different levels of interaction

Can integration handle noise, making interaction data more reliable?

Superpose interactions based on their reliability


35

Bayesian Integration For each prediction method, compute log-

likelihood score Let P(L|E) be the number of interactions predicted

by method E, such that functional association between corresponding proteins is known

Let ~P(L|E) be the number of false positives Let P(L) and ~P(L) be the corresponding priors

Assign weights to methods based on their log-likelihood scores


36

Comparison of Prediction Methods

Integrated network captures functional association better Note that the integrated network is “trained”

using available data on functional association


37

Classification Based Integration Points: Proteins, Space: Expression,

Conservation, Labels: Function

Points: Protein Pairs, Space: Co-expression, Co-evolution, etc., Labels: Existence of Interaction


38

Performance of Domain Co-evolution


39

Co-Evolutionary Matrix


40

Domain Identification


41

Difference between Predicted PPIs


42

Pattern Discovery inSignaling Networks

Reconstruction of Cellular Signaling Network reconstruction includes

chemically accurate representation of all biochemical events occurring within a defined signaling network

and incorporates interconnectivity functional relationships

that are inferred from experimental data. Cellular signaling networks operate several orders of

magnitude in spatio-temporal scales Quick responses (<10-1 secs.), e.g., protein modifications Slow responses (minutes to hours), e.g., transcriptional

regulation

44

Cellular Signaling Who are the actors?

Receptors reside inside or on the surface of the cell and bind to specific chemicals with high specificity and affinity.

Protein kinases catalyze reactions involving the transfer of phosphate, from high-energy donor molecules, such as ATP, which results in activation of proteins

Protein phosphatases dephosphorylate active proteins

Transcription factors

45

Combinatorics of Cellular Signaling What is the scope of these actors?

In how many different ways a signal can be transmitted?

In how many different states can a cell be? Number of receptors, kinases, phosphatases,

transcription factos, and the number of possible interactions between these

Alternative splicing In eukaryotes, introns are spliced out before translation Different combinations of introns can be spliced out,

resulting in different products of the same gene One more level of combinatorial complexity If a gene has k exons, then splicing of alternative exons

can generate upto 2k isoforms

46

Scope of Human Signaling Network

47

Combinatorial Effects Genes that code for signaling proteins compose 75%

of all alternatively spliced genes This implies that cells use alternative splicing extensively

to achieve the extraordinary specificity that is required in signaling systems

After post-transcriptional modification, number of mRNA transcripts 3858 for receptors, 1295 for kinases, 375 for phosphatases

After post-translational modification (phosphorylation, acetylation, methylation), number of distinct protein states 30864 for receptors, 10360 for kinases, 3000 for

phosphatases 20-fold increase in number of protein states over genes

48

Links and Connectivity Interactions allow for an even greater degree

of combinatorial control Homo- and heterodimerization of 224 proteins can

provide sufficient specificity to control the expression of 25000 genes in human genome (n(n-1)/2)

If receptors assume only ligand bound and unbound states, then k receptors can recognize 2k different ligand combinations If 1% of estimated 1543 receptors in human

genome can be independently expressed, then the cell could potentially respond to 32768 different ligand combinations

49

Signal Reception Based on the average surface area of a cell

and average area of a receptor, it is estimated that there can be as many as a few million receptors on the surface of the typical somatic cell at a given time ~ 30000 distinct receptor types ~130 receptors of each receptor type A few receptors (~10-40) in high numbers (~105

per cell) for highly differentiated and specialized cells

Many receptors (~2000-3000) in small numbers (~102 per cell) for stem cells or undifferentiated cells

50

Reconstructing Signaling Networks

51

Focusing on Parts of the Network

Nodes Who does a single protein interact with? In what contexts?

Modules Group of related interactions, e.g., a protein complex

Pathways Chain of interactions that connect a signaling input to

output

52

Protein Complexes in PPI Networks Spoke vs matrix

model Recall that in PCP

methods like TAP identify a group ofproteins that bind

to each other using a single protein as bait

How to encode this into a network of pairwise interactions?

53

ActualComplex

SpokeModel

MatrixModel

Protein Complexes in Matrix Model

54

Modules and Quotients Define a module as a group of proteins such

that the interactions of the proteins with those outside the module are identical

Quotient: Replace proteins in a module with a single node The edges of the representative node will

represent the interactions of all proteins in the module

55

Types of Modules Parallel module

No interaction between proteins in the module These are likely to correspond to proteins that are

functionally related, but do not interact with each other Series module

Proteins in the module form a clique among themselves All proteins in the module perform some function

together (single complex or multiple related complexes)

Prime module All other topologies This is probably what you will

observe most of the time

56

Hierarchical Decomposition

Recursively identify and contract modules This results in a

tree representation of the network

Each node is a quotient graph

Leaves are proteins

Root is entire network

57

Decomposition of Yeast PPI Network

58

Identification of Modules Graph clustering

Find groups of nodes with high interconnectivity (and relatively low connectivity with outside)

Issues Definition of clustering metrics Density

Has to be normalized by subgraph size Distance-based metrics

A module has low diameter Normalizing intra-cluster

connectivity with outer connectivity

59

Algorithms The problems are generalizations of maximum clique

Maximum clique itself is NP-hard (enumeration of cliques in early PPI networks was possible, though, and these were used as seed subgraphs for dense clusters)

Heuristic approaches Graph clustering is very well studied Recall that, while clustering vectors in metric spaces (e.g.,

gene expression data), it is common to generate similarity graphs

Bottom-up heuristics Start with a single node, grow subgraph until “density” is lost

Top-down heuristics Recursively partition the entire network until subgraph is dense

enough

60

MCODE Algorithm Three stages

Vertex weighting Complex prediction Post-processing for finding overlapping clusters

Vertex weighting How “clustered” is a network’s neighborhood? Use core clustering coefficient instead of

clustering coefficientN : subgraph induced by neighbors of vK : k-core subgraph of N that maximizes k d : density of Kweight(v) = k x d

61

MCODE Algorithm (cont’d) Complex prediction

Seed a complex with the node with highest weight At each node addition, check the neighbors of that

node, if their weight is above a given threshold relative to that of the seed vertex, add that node into the complex as well

Repeat until no node can be added Once a complex is identified, remove those nodes and

find other complexes Post-processing

Filter-out complexes that do not contain at least a 2-core Add nodes to allow overlaps to a given threshold

Complex score: density x size

62

Scoring Subgraphs Observe the trade-off between size and density

A single interaction has density one What is a good cut-off for density?

Statistical significance What is the expected size of the largest dense

subgraph? Implicitly trades off density and size

If we can analytically characterize the distribution of the largest dense subgraph, then we can use statistical significance as a score function (stopping criterion) This also implicitly handles correction for multiple

hypothesis testing

63

G(n,p) Model Let random variable R be the size of largest

subgraph with density The typical value of R is given by

where denotes divergence

The p-value of a larger dense subgraph is given by

64

r0 = Hp()

log(n) – log(log(n)) + log(Hp())

Hp() = log(/p) + (1-) log((1-)/(1-p))

P(R r0) O(log(n)/n1/H ())

Piecewise G(n,p) Model

65

Two protein groups; hubs (Vh) and regulars (Vl) There is an edge between u and v with

probability ph if u, v Vh

pb if u Vh, v Vl, or vice versa p if u, v Vl

ph > pb > pl, |Vh| < |Vl|

If |Vh| << |Vl|, it contributes an additive factorr1 =

log(n) + 2|Vh| log(B) - log(log(n)) + log(Hp())

Hp()

where B = pb(1-p)/p+1-pb

SIDES Algorithm Recursive minimum-cut partitioning

Partition nodes into two parts such that the number of edges in between is minimized, then recurse

66

p << 1

p << 1p << 1

MCODE vs SIDES

67

-log

(p-v

alu

e)

Sp

ecifi

cit

y (%

)Sensitivity (%)Cluster Size

CorrelationSIDES: 0.76MCODE: 0.43

MCODE vs SIDES

68

Module Size Module Size

Sp

ecifi

cit

y (%

)

Sen

siti

vity

(%

)

CorrelationSIDES: 0.22MCODE: -0.02

CorrelationSIDES: 0.27MCODE: 0.36

Fiedler Vector For network G, Laplacian L is defined as follows:

Here, w(ui,uj) denotes the weight of edge uiuj.

It can be shown that Matrix L is positive semi-definite, with exactly one

zero eigenvalue for each connected component The eigenvector x that corresponds to the smallest

non-zero eigenvalue minimizes

This vector is known as the Fiedler vector of network G.

69

),(),(

),(),(

ji

jji

uuwjiL

uuwiiL

ji

T jxixjiwLxx,

2))()()(,(

Spectral Graph Clustering Fiedler vector provides the optimal mapping of the

nodes of the network on one-dimensional Euclidian space, in the mean squares sense This also generalizes to optimal k dimensional mapping

Once a one-dimensional mapping is obtained, clustering algorithms can be used on this one dimensional space Find cut points in one dimensional space

Top-down: Partition one dimensional space by finding two cut-points and recurse on each part

Bottom-up: Merge two closest nodes, recurse

70

-1 10

Identification of Signaling Pathways We would like to identify simple paths (chains of

interactions) in the PPI networks, which might correspond to, for example, signaling cascades highlighting the group of proteins and interactions that are resposible for the transduction of a specific signal

What can we do based solely on interaction data? In the PPI network, there may be be plenty of paths

connecting each pair of nodes Which ones are interesting? How long can a pathway be? How about identifying “most reliable” paths?

71

Formulating Pathway Identification Assume that the edges are scored, such that

p(u,v) denotes the likelihood that proteins u and v interact

Then the multiplication of edge scores along the path quantifies the likelihood that the path exists

Let w(u,v) = -log p(u,v) denote the weight of edge Then, if we define the weight of a path as the

summation of the weights of the edges on the path, paths with less weight will be more reliable paths

For a given set I of proteins, find all minimum-weight paths of length k from I to each protein in the network I might be the set of receptor proteins

72

Enumerating Pathways Dynamic programming

For v S V, let W(v, S) be the minimum weight of a simple path that starts from a protein in I, visits all proteins in S, and ends in v

This function can be tabulated using the following recursion

where if vI, and otherwise For given v, the minimum path from I to v is

given by the minimum W(v, S) over all S that contain v

The running time of this algorithm is O(knk) Not feasible for k larger than a few

73

Color Coding Color each protein randomly using a set of k colors Search for paths that contain one protein from each

colour => No vertex will be repeated on the path The dynamic programming algorithm can be modified

to solve this problem

The running time of this algorithm is O(2kkm) However, this algorithm misses an optimal path if two

proteins on the path happen to be colored identically For each path, the algorithm succeeds with ~

probability Repeat times to make sure that the probability that

the algorithm will fail for at least one protein is less than

74

Hunting Biologically Meaningful Paths Constraining the set of proteins

If a protein is required to be in the path, assign a unique color to the target protein

If a family is required, assign color to the family Constraining order of occurrence

Signal transduction often progress in inward order, from membrane proteins to nuclear proteins and transcription factors

Segmented pathways: Assign labels to proteins, where labels represent cellular component, require paths to be monotonic with respect to labels

Labels can also be generalized to intervals (consistent segments)

75

Protein Interaction Networks

Documents

Transcript of Protein Interaction Networks