PrePPI: structure-based protein-protein interaction prediction
BL5203: Molecular Recognition & Interaction Lecture 6: Modeling Protein Structure and...
-
Upload
dennis-sullivan -
Category
Documents
-
view
228 -
download
2
Transcript of BL5203: Molecular Recognition & Interaction Lecture 6: Modeling Protein Structure and...
BL5203: Molecular Recognition & Interaction BL5203: Molecular Recognition & Interaction
Lecture 6: Modeling Protein Structure and Lecture 6: Modeling Protein Structure and Protein-Protein Interaction Protein-Protein Interaction
Y.Z. ChenY.Z. ChenDepartment of PharmacyDepartment of Pharmacy
National University of SingaporeNational University of Singapore Tel: 65-6616-6877; Email: Tel: 65-6616-6877; Email: [email protected] ; Web: ; Web: http://bidd.nus.edu.sg
ContentContent
• Protein fold and structureProtein fold and structure
• Homology modelingHomology modeling
• Protein-protein dockingProtein-protein docking
Sizes of protein databasesSizes of protein databases
1
100
10,000
1,000,000
100,000,000
10,000,000,000
Protein
residues
Protein
sequences
Protein
structures
Protein
complexes
500M 1.6M 26K 1K
Swiss-Prot databaseSwiss-Prot database
Protein world
Protein fold
Protein structure classificationProtein structure classification
Protein superfamily
Protein familyNew Fold
PDB New Fold GrowthPDB New Fold Growth
• The number of unique folds in nature is fairly small (possibly a few thousands)
• 90% of new structures submitted to PDB in the past three years have similar structural folds in PDB
New folds
Old folds
New
PD
B s
truct
ure
s
Protein classificationProtein classification
• Number of protein sequences grow exponentially• Number of solved structures grow exponentially• Number of new folds identified very small (and
close to constant)• Protein classification can
– Generate overview of structure types– Detect similarities (evolutionary relationships) between
protein sequences
Problems in Protein Problems in Protein BioinformaticsBioinformatics
• 20,000 entries of proteins in the PDB
• 1000 - 2000 distinct protein folds in nature
• Thought to be only several thousand unique folds in all
• Prediction of structure from sequence– Fold recognition– Fragment construction
• Proteome annotation
• Protein-protein docking
Protein folding codeProtein folding code
Proteinfoldingcode
Proteinstructure
Protein sequence
Prediction of correct foldPrediction of correct foldQuery sequence Fold
recognition
Eisenberg et al.Jones, Taylor, Thornton
Matchedfold
Match sequence against library of known folds
Computational RequirementsComputational Requirements
• 1 sequence search takes 12 mins (3Ghz)
• Benchmarking on 100 proteins with 100 runs for a simplex search of parameter space = 80 days
• 30 approaches explored = 7 years (on 1 cpu)
Types of Structure PredictionTypes of Structure Prediction
• De novo protein– methods seek to build three-dimensional
protein models "from scratch" – Example: Rosetta
• Comparative protein – modeling uses previously solved structures as
starting points, or templates.– Example: protein threading
Factors that Make Protein Structure Factors that Make Protein Structure Prediction a Difficult Task Prediction a Difficult Task
• The number of possible structures that proteins may possess is extremely large, as highlighted by the Levinthal paradox
• The physical basis of protein structural stability is not fully understood.
• The primary sequence may not fully specify the tertiary structure. – chaperones
• Direct simulation of protein folding is not generally tractable for both practical and theoretical reasons.
Homology ModelingHomology Modeling
• Homolog a protein related to it by divergent evolution from a common ancestor
• 40 % amino-acid identity with its homolog – NO large insertions or deletions – Produces a predicted structure
equivalent to that of a medium resolution experimentally solved structure
• 25 % of known protein sequences fall in a safe area implying they can be modeled reliably
Homology Modeling DefinedHomology Modeling Defined
• Homology modeling – Based on the reasonable assumption that two
homologous proteins will share very similar structures.
– Given the amino acid sequence of an unknown structure and the solved structure of a homologous protein, each amino acid in the solved structure is mutated computationally, into the corresponding amino acid from the unknown structure.
Homology Modeling LimitationsHomology Modeling Limitations
• Cannot study conformational changes• Cannot find new catalytic/binding sites• Brainstorm lack of activity vs activity
– Chymotrypsionogen, trypsinogen and plasminogen– 40% homologous– 2 active, 1 no activity, cannot explain why
• Large Bias towards structure of template• Models cannot be docked together
Why Homology Modeling?Why Homology Modeling?
• Value in structure based drug design• Find common catalytic sites/molecular
recognition sites• Use as a guide to planning and interpreting
experiments• 70-80 % chance a protein has a similar fold to
the target protein due to X-ray crystallography or NMR spectroscopy
• Sometimes it’s the only option or best guess
Protein ThreadingProtein Threading
• A target sequence is threaded through the backbone structure of a collection of template proteins (fold library)
• Quantitative measure of how well the sequence fits the fold
• Based on assumptions – 3-D structures of proteins have characteristics that
are semi-quantitatively predictable– reflect the physical-chemical properties of amino
acids– Limited types of interactions allowed within folding
Fold Recognition MethodsFold Recognition Methods
• Bowie, Lüthy and Eisenberg (1991)• 2 approaches to recognition methods• Derive a 1-D profile for each structure in the fold
library and align the target sequence to these profiles – Identify amino acids based on core or external
positions– Part of secondary structure
• Consider the full 3-D structure of the protein template – Modeled as a set of inter-atomic distances– NP-Hard (if include interactions of multiple residues)
Protein ThreadingProtein Threading
• The word threading implies that one drags the sequence (ACDEFG...) step by step through each location on each template
Protein ThreadingProtein Threading
Generalized Threading ScoreGeneralized Threading Score
• Want to correctly recognize arrangements of residues• Building a score function
– potentials of mean force – from an optimization calculation.
• G(rAB) = kTln (ρAB/ ρAB°)– G, free energy– k and T Boltzmanns constant and temperature respectively– ρ is the observed frequency of AB pairs at distance r. – ρ° the frequency of AB pairs at distance r you would expect to
see by chance.
• Z-score = (ENat - <Ealt>)/σ Ealt
– Natural energies and mean energies of all the wrong structures/ standard deviation
Scoring Different FoldsScoring Different Folds
• Goodness of fit score– Based on empirical energy
function– Modify to take into account
pairwise interactions and solvation terms
– High score means good fit– Low score means nothing
learned
Some Threading ProgramsSome Threading Programs
• 3D-pssm (ICNET). Based on sequence profiles, solvatation potentials and secondary structure.
• TOPITS (PredictProtein server) (EMBL). Based on coincidence of secondary structure and accesibility.
• UCLA-DOE Structure Prediction Server (UCLA). Executes various threading programs and report a consensus.
• 123D+ Combines substitution matrix, secondary structure prediction, and contact capacity potentials.
• SAM/HMM (UCSC). Basen on Markov models of alignments of crystalized proteins.
• FAS (Burnham Institute). Based on profile-profile matching algorithms of the query sequence with sequences from clustered PDB database.
• PSIPRED-GenThreader (Brunel) • THREADER2 (Warwick). Based on solvatation potentials and contacts
obtained from crystalized proteins. • ProFIT CAME (Salzburg)
Process of 3D Structure Process of 3D Structure Prediction by ThreadingPrediction by Threading
• Has this protein sequence similarity to other with a known structure?
• Structure related information in the databases• Results from threading programs• Predicted folding comparison• Threading on the structure and mapping of the
known data • A comparison between the threading predicted
structure and the actual one
Protein Threading Based on Multiple Protein Protein Threading Based on Multiple Protein Structure AlignmentStructure Alignment
Tatsuya Akutsu and Kim Lan SimTatsuya Akutsu and Kim Lan SimHuman Genome Center, Institute of Medical Science, Human Genome Center, Institute of Medical Science,
University of TokyoUniversity of Tokyo
• NP-Hard if include interactions between 2 or more AA
• Determine multiple structural alignments based on pair wise structure alignments – Center Star Method
Center Star MethodCenter Star Method• Let I0 be the maximum number of gap symbols placed before the first
residue of S0 in any of the alignments A(S0; S1); : : : ;A(S0; SN). Let IS0j be
the maximum number of gaps placed after the last character of S0 in any
of the alignments, and let Ii be the maximum number of gaps placed
between character S0;i and S0;i+1, where Sj:i denotes the i-th letter of
string Si
• Create a string S0 by inserting I0 gaps before S0, IjSo gaps after S0, and Ij
gaps between S0;I and S0;i+1.
• For each Sj (j > 0), create a pairwise alignment A(S0; Sj) between S0 and
Sj by inserting gaps into Sj so that deletion of the columns consisting of
gaps from A(S0; Sj) results in the same alignment as A(S0; Sj).
• Simply arrange A(S0; Sj )'s into a single matrix A (note that all A(S0; Sj )'s
have the same length).
Simple Threading AlgorithmSimple Threading Algorithm• Apply simple score function based on structure alignment algorithm
– Let X = x1……xN (input amino acid sequence)– Ci ( i-th column in A)
• Test and analyze results and/or apply constraints
Protein Threading with ConstraintsProtein Threading with Constraints
• Assume part of the input sequence xi…xi+k must correspond to part of the structure alignment c j…cj+k
• Apply constraints
Prediction PowerPrediction Power
• Entered in CASP3 competition• 17 predictions made• 3 targets evaluated as similar to correct folds• Only team to create a nearly correct model for
structure T0043• Best in competition
– 8 evaluated as similar to correct
Next time….Next time….
• In depth detail of– Multiple structural alignment program
• Multiprospector
– Global Optimum Protein Threading with Gapped Alignment
• Quality measures for protein threading models
• Improvements on threading-based models
Gapped AlignmentGapped Alignment
Trial structures for a local sequence taken from database of segments of known 3D structure
.
Fragment based methodFragment based method1 -Predict structure 1 -Predict structure of segmentof segment
Fragment based methodFragment based method2 - Construct trial model from segments2 - Construct trial model from segments
1 Low resolution energy function used in initial search through conformational space
2 - Side chains represented by single “centroid” pseudoatom
3 - Major contributions from Hydrophobic burial Beta strand pairing Steric overlap Specific residue pair interactions
4 - Models then refined using explicit rotamer based side chain representation and potential from design method
Fragment based methodFragment based method3 - Identify good trial structures3 - Identify good trial structures
Fragment-based protein foldingFragment-based protein folding
observed
Cro repressor(1orc)
Computational RequirementsComputational Requirements
• Methodology performs numerous simulations and looks for clusters
• One simulation takes 3 mins (3Ghz)
• Require 1,000 simulations per protein = 2 days
• Benchmark on 50 proteins = 100 days
Annotation procedure
MySQL database
New research
3D-GENOMICS - proteome 3D-GENOMICS - proteome annotationannotation
WWW
Databasesequences
Databasestructures
Proteomesequences
Functionaldata
Types of annotationTypes of annotation
Enzyme ABCEC 1.2.3.4- functionsuggested
E. coli Protein325-homologybut nofunction
membraneprotein
No similarsequence- orphan
structure
3D-Genomics database3D-Genomics database-structural and functional annotation-structural and functional annotation
size
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
M. genitalium
H. pylori J99
A. aeolicus
M. jannaschii
P. horikoshii
H. influenzae
V. cholerae
M. tuberculosis H37Rv
B. subtilis
E. coli K12
S. cerevisiae
D. melanogaster
C. elegans
H. sapiens
fraction of proteome (% of residues)
structure
function
any homology
non globular
orphan
Computational requirementsComputational requirements
• Today 800,000 protein sequences.
• Each sequence 15 mins to annotate on 2.5GHz cpu.
• Time today = 8,000 cpu days = 2.5 months with 100 processor farm.
• Need to update every 6 months.
• No of sequences will double in 2-3 years and so will keep pace with increase in compute power.
Modelling protein-protein Modelling protein-protein dockingdocking
Modelling protein-protein Modelling protein-protein dockingdocking
Coordinatesof mol 1
Coordinatesof mol 2
Rigid body search
List of possible complexes
Evaluate association energy
Flexibility to refine
List of complexes
Experimentalinformation
Protein-protein dockingProtein-protein docking
Step 1 - Generating ComplexesStep 1 - Generating Complexes
+1
A(i,j,k) B(l,m,n)
C = A(i,j,k) x B(l,m,n)
+1-15
overlap+1 x -15
match+1 x +1
Shape complementarityShape complementarity
+1
-1+1
-1
Charge in 1 = Q(i,j,k) Potential outside 2 V(l,m,n)
E = Q(i,j,k) x V(l,m,n)
Electrostatic complementarityElectrostatic complementarity
Step 2 - Modelling residue-Step 2 - Modelling residue-residue interactionsresidue interactions
V
I
E
Step 2 - Modelling residue-Step 2 - Modelling residue-residue interactionsresidue interactions
V
I
E
Empirical residue pair potentialsEmpirical residue pair potentials
a b
Analyse residues packing across 90 hetero-protein interfaces
A pair of residues pack if one atom-atom contact
Score(a,b) = log10 (Observed no a/b pairs) (Expected no a/b pairs)
< distance cut off (4.5A)
Step 3 - Including informationStep 3 - Including informationabout functional residuesabout functional residues
E
From literature
Step 3 - Including informationStep 3 - Including informationabout functional residuesabout functional residues
E
From literature
Step 4 - Refinement by Step 4 - Refinement by multicopymulticopy
Search for optimalcombination ofside-chain rotamersby energy calculation
+ Limitedrigid-body shifts
CAPRI - blind test of dockingCAPRI - blind test of docking
unboundamylase
bound Ab - X-raybound Ab - predicted
Prediction / Actual:Difference =0.6A
Computational RequirementsComputational Requirements
• 1 run of procedure takes 2 day on one 3Ghz processor
• Development tested on 30 protein complexes takes 60 days for one parameter set
• Applications– extension to predict which protein interacts with
another requires 1000s of docking simulations
Application areaApplication area
• Protein structure prediction– fold recognition– simulation
• Proteome annotation
• Protein-protein docking
Computing costComputing cost
• Modelling algorithm on one protein 10 mins - 2 days on one 3GHz cpu
• But algorithm development requires consideration of several structures (50 -100) with different parameter sets.
• Hence years of cpu required
Structure prediction & sequence spaceStructure prediction & sequence space
ASDJFHLKASDLFHASDFLHUHOUIQWEQWEONBLQWEROKJASDFPOIQWERUHOQWEORSADFLKJIJ
ASDJFHLKASDLFHTJYHASDFLHUHOUIQWEDFGHQWEONBLQWEROKJDGHJASDFPOIQWERUHODHGRQWEORSADFLKJIJGHFGQWOIEGTXKNBVALHERTASDLFHIUWERHSDDFGHKBJDDURMWOFBMFERTJFGJDKEGORTMVIRGHRT
ASDJFHLKASDLFHTJYHASDFLHUHOUIQWEDFGHQWEONBLQWEROKJDGHJASDFPOIQWERUHODHGRQWEORSADFLKJIJGHFG
ASDJFHLKASDASDFLHUHOUIQWEONBLQWERASDFPOIQWERQWEORSADFLK
Multiple sequence alignments aid Multiple sequence alignments aid comparative protein modelingcomparative protein modeling
• 1 in 3 sequences are recognizably related to at least one protein structure.
• A significant fraction of the remaining 2/3 have solved structural homologues, but they are not recognized through sequence similarity searching techniques.
• Marti-Renom et al. (2000)
• Multiple sequence alignments greatly improve the efficacy and accuracy of almost all phase of comparative modeling.
• Venclovas (2001)
Computational protein designComputational protein design
Native structure
Iterative refinementNew sequence
Large scale sequence Large scale sequence generationgeneration
200,000Total sequences generated
4,000Processors available
80 daysTotal time of data collection
26,400Total backbone variants
264Total structures
“Reverse BLAST” study:
““Reverse BLAST”: Reverse BLAST”: finding templates for finding templates for
comparative modelingcomparative modeling
Larson SM, Garg A, Desjarlais JR, Pande VS. (2003) Proteins: Structure, Function, and Genetics
Experiment: Sequence qualityExperiment: Sequence quality
ASDFASDFASDFASFDSAFASDFASDFAFASDFASDFASDFAFHFDIDIFERIDKDADHFYWTEFHHASDASDFYEFHGASDFVADHFYWTEFHHASDASDFYEFHGASDFVDGSAHDYERCNDFKAKSLKALSDFPLAK
Design BLAST E<0.01
Results: Sequence qualityResults: Sequence quality
1E-17
1E-16
1E-15
1E-14
1E-13
1E-12
1E-11
1E-10
1E-09
1E-08
1E-07
1E-06
1E-05
0.0001
0.001
0.01
0.1
1
10
0 25 50 75 100 125 150 175 200 225
Designed sequence profile (ranked by E-value)
E-v
alu
e o
f b
est
PD
B h
it
0
5
10
15
20
25
30
Ave
rag
e id
enti
ty t
o n
ativ
e se
qu
ence
(%
)
Method: “Reverse BLAST”Method: “Reverse BLAST”
THEHYPOTHETICALPROTEINSEQUENCEASDFASDFASDFAASDFASDFASDFASDFASDFASDFASDFASDFHWERHWIENCVASDFNWEFUWEF
BLAST E<0.01
THEHYPOTHETICALPROTEINSEQUENCEASDFASDFASDFAASDFASDFASDFASDFASDFASDFASDFASDFHWERHWIENCVASDFNWEFUWEF
THEHYPOTHETICALPROTEINSEQUENCEASDFASDFASDFAASDFASDFASDFASDFASDFASDFASDFASDFHWERHWIENCVASDFNWEFUWEF
THEHYPOTHETICALPROTEINSEQUENCEASDFASDFASDFAASDFASDFASDFASDFASDFASDFASDFASDFHWERHWIENCVASDFNWEFUWEF
THEHYPOTHETICALPROTEINSEQUENCEASDFASDFASDFAASDFASDFASDFASDFASDFASDFASDFASDFHWERHWIENCVASDFNWEFUWEF
THEHYPOTHETICALPROTEINSEQUENCEASDFASDFASDFAASDFASDFASDFASDFASDFASDFASDFASDFHWERHWIENCVASDFNWEFUWEF
THEHYPOTHETICALPROTEINSEQUENCEASDFASDFASDFAASDFASDFASDFASDFASDFASDFASDFASDFHWERHWIENCVASDFNWEFUWEF
THEHYPOTHETICALPROTEINSEQUENCEASDFASDFASDFAASDFASDFASDFASDFASDFASDFASDFASDFHWERHWIENCVASDFNWEFUWEF
Designed Sequences Hypothetical Proteins Structural Templates
Do the designed sequences help?Do the designed sequences help?
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
2 3 4 5 6 7 8 9 10
E-value threshold (-log(E))
hit
s w
ith
seq
uen
ce a
lig
nm
ent
: h
its
wit
ho
ut
0
20
40
60
80
100
120
140
160
Tota
l u
niq
ue
hit
s
Correctly identified structural templates
fold-increase in # of templates
fold-increase in # of genes
total hits
Remote homology detectionRemote homology detection
Optimizing structural diversityOptimizing structural diversity
0
10
20
30
40
50
60
70
80
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
RMSD of structural ensemble (Angstroms)
(%)
0
1
2
3
4
5
6
Seq
uen
ce e
ntr
op
y
sequence entropy
prediction accuracy
prediction coverage
mean pairwise %ID
mean native %ID