Protein Structure Modeling (2). Prediction

Protein Structure Modeling (2)

Prediction

http://www.bmm.icnet.uk/people/rob/CCP11BBS/

Template-Based Prediction

Structure is better conserved than sequence

Structure can adopt a wide range of mutations.

Physical forces favorcertain structures.

Number of fold is limited. Currently ~700 Total: 1,000 ~10,000 TIM barrel

Evolutionary Comparison

• Sequence-sequence comparison: homology modeling (similar sequence – similar structure)

• Sequence-structure comparison: threading / fold recognition (sequences fold into a limited number of folds)

• ~90% of new globular proteins share similar folds with known structures, implying the general applicability of comparative modeling methods for structure prediction

• general applicability of template-based modeling methods for structure prediction (currently 60-70% of new proteins, and this number is growing as more structures being solved)

• NIH Structural Genomics Initiative plans to experimentally solve ~10,000 “unique” structures and predict the rest using computational methods

Scope of the Problem

Why do we need structural models?

1. only 20% of all proteins have a homologue in PDB

2. for ~ 70% of the proteins a suitable structure from which to build a 3D model is available.

3. predict functions of proteins that share low degrees of sequence similarity

4. identify proteins that may have new folds

How many structures are there ?

Proteins, Peptides,

and Viruses

Protein/Nucleic Acid Complexes

Nucleic Acids

Carbohydrates total

Exp.X-ray

Diffraction and other

13259 636 602 14 14511

NMR 2174 82 424 4 2684

Theor.Theoretical Modeling

321 24 28 0 373

15754 742 1054 18 17568

Molecule Type

total

Source: http://www.rcsb.org/pdb/holdings.html

Protein Data Bank (PDB) Status: March 12, 2002

How many folds are there ?

ClassNumber of

foldsNumber of

superfamiliesNumber of

families

All alpha proteins 138 224 337All beta proteins 93 171 276Alpha and beta proteins (a/b) 97 167 374Alpha and beta proteins (a+b) 184 263 391Multi-domain proteins 28 28 35Membrane and cell surface proteins 11 17 28Small proteins 54 77 116

Total 605 947 1557

Source: http://scop.berkeley.edu/count.html

Structural Classification of Proteins (SCOP): Status (1 Mar 2002) based on 13220 PDB entries

Identification of new folds

Source: http://www.rcsb.org/pdb/holdings.html

Old fold vs. new fold

• A chain fold is considered old if it is similar to one of selected chains according to the following criteria:

• RMSD < 3.0Å

• number of aligned positions >= 70% of the length of this chain.

How many more folds are there ?

Estimation:

• number of possible folds ~ 4,000• database of 930 folds covers 90% of protein

families

Source:Govindarajan S., Recabarren R., & Goldstein R.A. 1999

Proteins: Structure, Function, and Genetics 35:408-414

Homology Modeling

• also called “comparative protein modeling”, “modeling by homology”, “knowledge-based modeling”

• the most successful tool for prediction of protein structure from sequence

Homology Modeling

• Sequence is aligned with sequence of known structure, usually share sequence identity of 30% or more.

• The sequence is then superimposed onto the template, replacing equivalent side chain atoms where necessary.

• Refinement of structure to make it closer to actual than the template.

Homology Modeling

• Given a sequence what is the best way of mounting it onto a known structure

GHIKLSYTVNEQNLKPERFFYTSAVAIL

What is the basis for homology modeling?

• The relative RMSD of the -carbon coordinates is ~ 1 Å, if the protein core share 50% identity.

• Protein sequences with > 70% similarity allow construction of models with < 3 Å RMSD

• Reduction to:

- Loop structure modeling (connections , , , )

- Side-chain modeling (energy refinement)

Input requirements for Homology Modeling

1. TARGET SEQUENCE (primary protein sequence with unknown structure)

2. TEMPLATE (protein whose 3D structure has already been determined)

3. SEQUENCE ALIGNMENT (using Clustal W) between template and target sequence

Find the appropriate template

Please enter your sequence in FASTA format.

SWISS-MODEL Blast

Find the Appropriate Modelling Template(s)

MTKNVLMLHGLAQSGDYFASKTKGFRAEMEKLGYKLYYPTAPNEFPPADVPDFLGAPGDGENTGVLAWLENDPSTGGYFIPQTTIDYLHNYVLENGPFAGIVGFSQGAGVTDFNGLLGLTTEEQPPLEFFMAVSGFRFQPQQYQEQYDLHPISVPSLHVQGELDTKVQGLYNSCTEDSRTLLMHSGGHFVPNSRGFVRKVAQWLQQLT*

Submit Request Clear Sequence

Source: http://www.expasy.org/swissmod/SM_Blast.html

Choose a template

Template search results

4CD2A topLIGAND INDUCED CONFORMATIONAL CHANGES IN THE CRYSTAL STRUCTURES OF PNEUMOCYSTIS CARINII DIHYDROFOLATE REDUCTASCOMPLEXES WITH FOLATE AND NADP+ MOL_ID: 1; MOLECULE: DIHYDROFOLATE REDUCTASE; CHAIN: A; SYNONYM: PCDHFR; EC: 1.5.1.3; ENGINEERED: YES MOL_ID: 1; ORGANISM_SCIENTIFIC: PNEUMOCYSTIS CARINII; ORGANISM_COMMON: BACTERIA; EXPRESSION_SYSTEM: ESCHERICHIA COLI; EXPRESSION_SYSTEM_COMMON: BACTERIA; EXPRESSION_SYSTEM_PLASMID: PT7-7; EXPRESSION_SYSTEM_GENE: C-DNA P.CARINII DHFR V.CODY,N.GALITSKY,D.RAK,J.R.LUFT,W.PANGBORN,S.F.QUEENER

Length = 202 Score = 157 bits (393), Expect = 9e-39 Identities = 82/220 (37 Positives = 138/220 (62 Gaps = 22/220 (10 Query: 232 RDLTMIVAVSSPNLGIGKKNSMPWHIKQEMAYFANVTSSTESSGQLEEGKSKIMNVVIMG 291 LT IVA GIG NS PW K E YF VTS E MNVV MG

Sbjct: 1 KSLTLIVALTT-SYGIGRSNSLPWKLKKEISYFKRVTSFVPTFDSFES-----MNVVLMG 54

http://www.expasy.org/swissmod/cgi-bin/#pagetop

Mounting the sequence onto the structure

template

Target

Mounted sequence

Yellow = adrenergic receptor sequenceBlue = adrenergic receptor (PDB 1F88 )

Modeled structure

Gaps

Corrected Model

Refinement

• Bond angle energy

• Dihedral angle energy

• van der Waals energy

• Electrostatic interactions

• Hydrogenbonds

• Geometrical constraints

• Packing density

Evaluating your model

• inaccurate if atomic coordinates are not within 0.5 A RMSD of template control

Threading-Based Protein Structure Prediction

Threading, Fold recognition, Protein fold assignments

Given:

• a database of protein structures / folds summarizing designs found in nature

• individual protein sequence

Goal:

• Find the structural backbone that best fits the protein sequence. Opposite of protein folding problem.

Concept of Threading

structure prediction through recognizing native-like fold

o Thread (align or place) a query protein sequence onto a template structure in “optimal” way

o Good alignment gives approximate backbone structure

Query sequence MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE

Template set

Prediction accuracy: fold recognition / alignment

Why is it called threading ?

• threading a specific sequence through all known folds

• for each fold estimate the probability that the sequence can have that fold

Fold Recognitionand Threading

• Limited number of folds: 800-1000

• Known number of folds ~ 700

• Sequence-fold agreement ?

Application of Threading

• Predict structure

• Identify distant homologues of protein families

• Predict function of protein with low degree of sequence similarity to other proteins

Structure Families

SCOP: http://scop.mrc-lmb.cam.ac.uk/scop/

(domains, good annotation)

CATH: http://www.biochem.ucl.ac.uk/bsm/cath/

CE: http://cl.sdsc.edu/ce.html

Dali Domain Dictionary: http://columba.ebi.ac.uk:8765/holm/ddd2.cgi

FSSP: http://www2.ebi.ac.uk/dali/fssp/

(chains, updated weekly)

HOMSTRAD:

http://www-cryst.bioc.cam.ac.uk/~homstrad/

HSSP: http://swift.embl-heidelberg.de/hssp/

http://scop.mrc-lmb.cam.ac.uk/scop/

http://www.biochem.ucl.ac.uk/bsm/cath/



http://cl.sdsc.edu/ce.html

http://cl.sdsc.edu/ce.html

http://columba.ebi.ac.uk:8765/holm/ddd2.cgi

http://www2.ebi.ac.uk/dali/fssp/




http://swift.embl-heidelberg.de/hssp/

http://swift.embl-heidelberg.de/hssp/

Hierarchy of Templates

Homologous family: evolutionarly related with a significant sequence identity -- 1827 in SCOP

Superfamily: different families whose structural and functional features suggest common evolutionary origin --1073 in SCOP (good tradeoff for accuracy/computing)

Fold: different superfamilies having same major secondary structures in same arrangement and with same topological connections (energetics favoring certain packing arrangements); -- 686 out of 39,893 in SCOP

Class: secondary structure composition.

Template and Fold

Secondary structures and their arrangement

Non-redundant representatives through structure-structure comparison

Core of a Template

Core secondary structures: -helices and -strands

Representation of folds: Definition of Template

• Residue type / profile• Secondary structure type• Solvent accessibility• Coordinates for C / C

(Pairwise preferences between two residues)

Threading- is alignment squared.

• Environmental preferences of aa’s: 3DPSSM– As environment classes (-helix, -sheet), solvent

accessibility– Pair potentials: physical interactions– Substitution matrices

• Possible alignments to template is evaluated.

• Evaluation of each position is dependent on rest of alignment.

Scoring Function

…YKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEW…

How well a residue fits a structural environment: E_s (singleton term)

How preferable to put two particular residues nearby: E_p (pairwise term)

Alignment gap penalty: E_g

Total energy: E_m + E_p + E_s + E_g

Describe how sequence fits template

How well a sequence residue aligns to a residue on structure: E_m (mutation term)

What We Learned…

• Why threading?• Evolutionary foundation of threading• Template library and its generation• Concept of scoring function

CASP (Critical Assesment of Structure Predictions)

• the annual competition in protein structure prediction.

http://predictioncenter.llnl.gov/casp5/Casp5.html

CASP (Critical Assesment of Structure Predictions)

• Targets for comparative modelling (15)fold recognition (22)

ab initio modelling (15)

http://predictioncenter.llnl.gov/casp5/Casp5.html

CASP Experiment

• Experimentalists are solicited to provide information about structures expected to be soon solved

• Predictors retrieve the sequence from prediction center (predictioncenter.llnl.gov)

• Deposit predictions throughout the season

• Meeting held to assess results

Prediction Categories

• Comparative Modeling – modeling by homology

• Fold Recognition– Advanced Sequence Comparison Methods– Threading

• New Fold Methods/ “ab initio”• Categories are separated by distance from

any known structure

Expected Performance

Predicted model

X-raystructuretarget

t0100

PROSPECT (threading) prediction in CASP4:12 out 19 folds recognized

Conclusions

• When a suitable template structure exists in PDB, using homology modeling on target sequence is best for predicting the structure

• Fold Recognition servers can help find a template when conventional sequence analysis methods fail

• Combining elements from several sources may allow you to construct reasonably accurate models

Prediction

http://www.bmm.icnet.uk/people/rob/CCP11BBS/

Protein Structure Modeling (2). Prediction

Documents

Transcript of Protein Structure Modeling (2). Prediction