ProteinShop: A Tool for Protein Structure Prediction and Modeling
Protein Structure Modeling (2). Prediction
-
date post
19-Dec-2015 -
Category
Documents
-
view
222 -
download
1
Transcript of Protein Structure Modeling (2). Prediction
Protein Structure Modeling (2)
Prediction
http://www.bmm.icnet.uk/people/rob/CCP11BBS/
Template-Based Prediction
Structure is better conserved than sequence
Structure can adopt a wide range of mutations.
Physical forces favorcertain structures.
Number of fold is limited. Currently ~700 Total: 1,000 ~10,000 TIM barrel
Evolutionary Comparison
• Sequence-sequence comparison: homology modeling (similar sequence – similar structure)
• Sequence-structure comparison: threading / fold recognition (sequences fold into a limited number of folds)
• ~90% of new globular proteins share similar folds with known structures, implying the general applicability of comparative modeling methods for structure prediction
• general applicability of template-based modeling methods for structure prediction (currently 60-70% of new proteins, and this number is growing as more structures being solved)
• NIH Structural Genomics Initiative plans to experimentally solve ~10,000 “unique” structures and predict the rest using computational methods
Scope of the Problem
Why do we need structural models?
1. only 20% of all proteins have a homologue in PDB
2. for ~ 70% of the proteins a suitable structure from which to build a 3D model is available.
3. predict functions of proteins that share low degrees of sequence similarity
4. identify proteins that may have new folds
How many structures are there ?
Proteins, Peptides,
and Viruses
Protein/Nucleic Acid Complexes
Nucleic Acids
Carbohydrates total
Exp.X-ray
Diffraction and other
13259 636 602 14 14511
NMR 2174 82 424 4 2684
Theor.Theoretical Modeling
321 24 28 0 373
15754 742 1054 18 17568
Molecule Type
total
Source: http://www.rcsb.org/pdb/holdings.html
Protein Data Bank (PDB) Status: March 12, 2002
How many folds are there ?
ClassNumber of
foldsNumber of
superfamiliesNumber of
families
All alpha proteins 138 224 337All beta proteins 93 171 276Alpha and beta proteins (a/b) 97 167 374Alpha and beta proteins (a+b) 184 263 391Multi-domain proteins 28 28 35Membrane and cell surface proteins 11 17 28Small proteins 54 77 116
Total 605 947 1557
Source: http://scop.berkeley.edu/count.html
Structural Classification of Proteins (SCOP): Status (1 Mar 2002) based on 13220 PDB entries
Identification of new folds
Source: http://www.rcsb.org/pdb/holdings.html
Old fold vs. new fold
• A chain fold is considered old if it is similar to one of selected chains according to the following criteria:
• RMSD < 3.0Å
• number of aligned positions >= 70% of the length of this chain.
How many more folds are there ?
Estimation:
• number of possible folds ~ 4,000• database of 930 folds covers 90% of protein
families
Source:Govindarajan S., Recabarren R., & Goldstein R.A. 1999
Proteins: Structure, Function, and Genetics 35:408-414
Homology Modeling
• also called “comparative protein modeling”, “modeling by homology”, “knowledge-based modeling”
• the most successful tool for prediction of protein structure from sequence
Homology Modeling
• Sequence is aligned with sequence of known structure, usually share sequence identity of 30% or more.
• The sequence is then superimposed onto the template, replacing equivalent side chain atoms where necessary.
• Refinement of structure to make it closer to actual than the template.
Homology Modeling
• Given a sequence what is the best way of mounting it onto a known structure
GHIKLSYTVNEQNLKPERFFYTSAVAIL
What is the basis for homology modeling?
• The relative RMSD of the -carbon coordinates is ~ 1 Å, if the protein core share 50% identity.
• Protein sequences with > 70% similarity allow construction of models with < 3 Å RMSD
• Reduction to:
- Loop structure modeling (connections , , , )
- Side-chain modeling (energy refinement)
Input requirements for Homology Modeling
1. TARGET SEQUENCE (primary protein sequence with unknown structure)
2. TEMPLATE (protein whose 3D structure has already been determined)
3. SEQUENCE ALIGNMENT (using Clustal W) between template and target sequence
Find the appropriate template
Please enter your sequence in FASTA format.
SWISS-MODEL Blast
Find the Appropriate Modelling Template(s)
MTKNVLMLHGLAQSGDYFASKTKGFRAEMEKLGYKLYYPTAPNEFPPADVPDFLGAPGDGENTGVLAWLENDPSTGGYFIPQTTIDYLHNYVLENGPFAGIVGFSQGAGVTDFNGLLGLTTEEQPPLEFFMAVSGFRFQPQQYQEQYDLHPISVPSLHVQGELDTKVQGLYNSCTEDSRTLLMHSGGHFVPNSRGFVRKVAQWLQQLT*
Submit Request Clear Sequence
Source: http://www.expasy.org/swissmod/SM_Blast.html
Choose a template
Template search results
4CD2A topLIGAND INDUCED CONFORMATIONAL CHANGES IN THE CRYSTAL STRUCTURES OF PNEUMOCYSTIS CARINII DIHYDROFOLATE REDUCTASCOMPLEXES WITH FOLATE AND NADP+ MOL_ID: 1; MOLECULE: DIHYDROFOLATE REDUCTASE; CHAIN: A; SYNONYM: PCDHFR; EC: 1.5.1.3; ENGINEERED: YES MOL_ID: 1; ORGANISM_SCIENTIFIC: PNEUMOCYSTIS CARINII; ORGANISM_COMMON: BACTERIA; EXPRESSION_SYSTEM: ESCHERICHIA COLI; EXPRESSION_SYSTEM_COMMON: BACTERIA; EXPRESSION_SYSTEM_PLASMID: PT7-7; EXPRESSION_SYSTEM_GENE: C-DNA P.CARINII DHFR V.CODY,N.GALITSKY,D.RAK,J.R.LUFT,W.PANGBORN,S.F.QUEENER
Length = 202 Score = 157 bits (393), Expect = 9e-39 Identities = 82/220 (37 Positives = 138/220 (62 Gaps = 22/220 (10 Query: 232 RDLTMIVAVSSPNLGIGKKNSMPWHIKQEMAYFANVTSSTESSGQLEEGKSKIMNVVIMG 291 LT IVA GIG NS PW K E YF VTS E MNVV MG
Sbjct: 1 KSLTLIVALTT-SYGIGRSNSLPWKLKKEISYFKRVTSFVPTFDSFES-----MNVVLMG 54
Mounting the sequence onto the structure
template
Target
Mounted sequence
Yellow = adrenergic receptor sequenceBlue = adrenergic receptor (PDB 1F88 )
Modeled structure
Gaps
Corrected Model
Refinement
• Bond angle energy
• Dihedral angle energy
• van der Waals energy
• Electrostatic interactions
• Hydrogenbonds
• Geometrical constraints
• Packing density
Evaluating your model
• inaccurate if atomic coordinates are not within 0.5 A RMSD of template control
Threading-Based Protein Structure Prediction
Threading, Fold recognition, Protein fold assignments
Given:
• a database of protein structures / folds summarizing designs found in nature
• individual protein sequence
Goal:
• Find the structural backbone that best fits the protein sequence. Opposite of protein folding problem.
Concept of Threading
structure prediction through recognizing native-like fold
o Thread (align or place) a query protein sequence onto a template structure in “optimal” way
o Good alignment gives approximate backbone structure
Query sequence MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
Template set
Prediction accuracy: fold recognition / alignment
Why is it called threading ?
• threading a specific sequence through all known folds
• for each fold estimate the probability that the sequence can have that fold
Fold Recognitionand Threading
• Limited number of folds: 800-1000
• Known number of folds ~ 700
• Sequence-fold agreement ?
Application of Threading
• Predict structure
• Identify distant homologues of protein families
• Predict function of protein with low degree of sequence similarity to other proteins
Structure Families
SCOP: http://scop.mrc-lmb.cam.ac.uk/scop/
(domains, good annotation)
CATH: http://www.biochem.ucl.ac.uk/bsm/cath/
CE: http://cl.sdsc.edu/ce.html
Dali Domain Dictionary: http://columba.ebi.ac.uk:8765/holm/ddd2.cgi
FSSP: http://www2.ebi.ac.uk/dali/fssp/
(chains, updated weekly)
HOMSTRAD:
http://www-cryst.bioc.cam.ac.uk/~homstrad/
HSSP: http://swift.embl-heidelberg.de/hssp/
Hierarchy of Templates
Homologous family: evolutionarly related with a significant sequence identity -- 1827 in SCOP
Superfamily: different families whose structural and functional features suggest common evolutionary origin --1073 in SCOP (good tradeoff for accuracy/computing)
Fold: different superfamilies having same major secondary structures in same arrangement and with same topological connections (energetics favoring certain packing arrangements); -- 686 out of 39,893 in SCOP
Class: secondary structure composition.
Template and Fold
Secondary structures and their arrangement
Non-redundant representatives through structure-structure comparison
Core of a Template
Core secondary structures: -helices and -strands
Representation of folds: Definition of Template
• Residue type / profile• Secondary structure type• Solvent accessibility• Coordinates for C / C
(Pairwise preferences between two residues)
Threading- is alignment squared.
• Environmental preferences of aa’s: 3DPSSM– As environment classes (-helix, -sheet), solvent
accessibility– Pair potentials: physical interactions– Substitution matrices
• Possible alignments to template is evaluated.
• Evaluation of each position is dependent on rest of alignment.
Scoring Function
…YKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEW…
How well a residue fits a structural environment: E_s (singleton term)
How preferable to put two particular residues nearby: E_p (pairwise term)
Alignment gap penalty: E_g
Total energy: E_m + E_p + E_s + E_g
Describe how sequence fits template
How well a sequence residue aligns to a residue on structure: E_m (mutation term)
What We Learned…
• Why threading?• Evolutionary foundation of threading• Template library and its generation• Concept of scoring function
CASP (Critical Assesment of Structure Predictions)
• the annual competition in protein structure prediction.
http://predictioncenter.llnl.gov/casp5/Casp5.html
CASP (Critical Assesment of Structure Predictions)
• Targets for comparative modelling (15)fold recognition (22)
ab initio modelling (15)
http://predictioncenter.llnl.gov/casp5/Casp5.html
CASP Experiment
• Experimentalists are solicited to provide information about structures expected to be soon solved
• Predictors retrieve the sequence from prediction center (predictioncenter.llnl.gov)
• Deposit predictions throughout the season
• Meeting held to assess results
Prediction Categories
• Comparative Modeling – modeling by homology
• Fold Recognition– Advanced Sequence Comparison Methods– Threading
• New Fold Methods/ “ab initio”• Categories are separated by distance from
any known structure
Expected Performance
Predicted model
X-raystructuretarget
t0100
PROSPECT (threading) prediction in CASP4:12 out 19 folds recognized
Conclusions
• When a suitable template structure exists in PDB, using homology modeling on target sequence is best for predicting the structure
• Fold Recognition servers can help find a template when conventional sequence analysis methods fail
• Combining elements from several sources may allow you to construct reasonably accurate models
Prediction
http://www.bmm.icnet.uk/people/rob/CCP11BBS/