Lecture 7. Computing Protein Structures
description
Transcript of Lecture 7. Computing Protein Structures
![Page 1: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/1.jpg)
Lecture 7. Computing Protein Structures
• Current attempts: • Threading: RAPTOR• Consensus: ACE• Fragment assembly
Can we compute the protein structures eventually? Your projects.
CS882, Fall 2006
![Page 2: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/2.jpg)
Homologous proteins have similar structure and functions Being homologous means that they have
evolved from a common ancestral gene. Hence at least in the past they had the same structure and function.
Caution: old genes can be recruited for new functions. Example: a structural protein in eye lens is homologous to an ancient glycolytic enzyme.
Homology search is done by BLAST, or PatternHunter for more sensitivity. BLAST will work with over 30% sequence identity.
![Page 3: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/3.jpg)
Conserving core regions
Homologous proteins usually have conserved core regions.
When we model one protein after a similar protein with known structure, the main problem becomes modeling loop regions.
Modeling loops can also depend on database to some degree.
Side chains: on a few side-chain conformations frequently occur – they are called rotamers, there is a such a database.
![Page 4: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/4.jpg)
Primary, secondary, and tertiary
There are many secondary structure prediction programs. However, without considering tertiary structure, we will never be correct solely predicting secondary structures.
Most tertiary structure prediction programs today depend on good secondary predictions. This is also not good: you cannot get right tertiary structure with wrong starting information.
They must be done together.
![Page 5: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/5.jpg)
There are not too many candidates!
There are only about 1000 topologically different domain structures. There is no reason whatsoever that we cannot compute their structures accurately.
Ab initio method – we have heard about it. Another promising method is threading (separate
lecture). After threading, an important step is “refinement”,
perhaps by fragment assembly. This will be a separate topic (Xin Gao).
Folding membrane proteins is a quite different topic (Richard Jang).
Now we go to threading.
![Page 6: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/6.jpg)
Protein Threading
Make a structure prediction through finding an optimal placement (threading) of a protein sequence onto each known structure (structural template) “placement” quality is measured by some statistics-based
energy function best overall “placement” among all templates may give a
structure prediction
target sequence MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTEtemplate library
![Page 7: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/7.jpg)
Threading Example
![Page 8: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/8.jpg)
Introduction to Linear Program
Optimize (Maximize or Minimize) a linear objective function e.g. 2x+3y+4z
The variables satisfy some linear constraints. e.g.
1. x+y-z >=1
2. 2x+y+3z=3 integer program (IP) =linear program (LP) + integral variables LP can be solved within polynomial time --- Interior point method.
Simplex method also runs fast. We used IBM package. Polynomial time for IP not likely, NP-hard
IP can be relaxed to LP, solve the non-integral version Branch-and-bound or branch-and-cut (may cost exponential
time)
![Page 9: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/9.jpg)
Why Integer Programming?
Treat pairwise potentials rigorously critical for fold-level targets
Existing Exact algorithms for pairwise potentials High memory requirement, or Expensive computational time
Exploit correlations between various kinds of item scores in the energy function
99% real data generate integral solutions directly, no branch-and-bound needed.
![Page 10: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/10.jpg)
Different approaches
Approximation Algorithm Interaction-Frozen Algorithm (A. Godzik et al.) Monte Carlo Sampling (T. Madej et al.) Double dynamic programming (D. Jones et al.) Recursive dynamic programming (R. Thiele et
al.) Exact Algorithm
Branch-and-bound (R.H. Lathrop et al.) Exploit the relationship among various
scoring parameters, fast self-threading Divide-and-conquer (Y. Xu et al.)
Exploit the topological structure of template contact graphs
![Page 11: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/11.jpg)
Formulating Protein Threading by LP
• Protein Threading Needs: 1. Construction of Template Library2. Design of Energy Function3. Sequence-Structure Alignment4. Template Selection and Model Construction
![Page 12: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/12.jpg)
Threading Energy Function
how well a residue fits a structural environment: Es
(Fitness score)
how preferable to put two particular residues nearby: Ep
(Pairwise potential)
alignment gap penalty: Eg
(gap score)
E= Ep + Es + Em + Eg + Ess
Minimize E to find a sequence-structure alignment
sequence similarity between query and template proteins: Em
(Mutation score)Consistency with the secondary structures: Ess
![Page 13: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/13.jpg)
Contact Graph
1. Each residue as a vertex2. One edge between two
residues if their spatial distance is within a given cutoff.
3. Cores are the most conserved segments in the template: alpha-helix, beta-sheet
template
![Page 14: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/14.jpg)
Simplified Contact Graph
![Page 15: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/15.jpg)
Contact Graph and Alignment Diagram
![Page 16: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/16.jpg)
Contact Graph and Alignment Diagram
![Page 17: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/17.jpg)
Variables
x(i,l) denotes core i is aligned to sequence position l y(i,l,j,k) denotes that core i is aligned to position l and core j is
aligned to position k at the same time.
![Page 18: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/18.jpg)
Formulation 1
}1,0{,
1
1
..
),)(,(,
][,
,,),)(,(
,1,
),)(,(),)(,(,,
kjlili
iDlli
kjlikjli
kili
kjlikjlilili
yx
x
xxy
xx
ts
ybxaE
MinimizeEg , Ep
Es , Ess , Em
Encodes interaction structures: the first makes sure no crosses; the second is quadratic, but can be converted to linear: a=bc is eqivalent to: a≤b, a≤c, a≥b+c-1
Encodes scoring system
![Page 19: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/19.jpg)
Formulation used in RAPTOR
}1,0{,
1
][,
][,
..
),)(,(,
][,
],,[),)(,(,
],,[),)(,(,
),)(,(),)(,(,,
kjlili
iDlli
ikjRlkjlikj
ljiRkkjlili
kjlikjlilili
yx
x
jDkyx
iDlyx
ts
ybxaE
MinimizeEg, Ep
Es, Ess, En
Encodes interaction structures
Encodes scoring system
![Page 20: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/20.jpg)
Solving the Problem Practically
1. More than 99% threading instances can be solved directly by linear programming, the rest can be solved by branch-and-bound with only several branch nodes
2. Less memory consumption
3. Less computational time
4. Easy to extend to incorporate other constraints
![Page 21: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/21.jpg)
CPU Time for CAFASP3 targets
![Page 22: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/22.jpg)
Fold Recognition
Support Vector Machines (SVM) Approach Features are extracted from the alignments A threading pair is treated as a positive pattern
only if they are in at least fold-level similarity 60,000 threading pairs are employed to train
SVM model. 5% more targets are recognized by SVM
approach than the traditional z-Score
![Page 23: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/23.jpg)
Part II. Experiments
Test Evaluator Data Set Blindness public
Lindhal et al.
benchmark
us large no no
LiveBench third-party small no yes
CASP/CAFASP
third-party small yes yes
![Page 24: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/24.jpg)
Target Category
CASP5 CM CM/FR FR(H) FR(A) NF/FR NF
CAFASP3
HM easy
(family level)
HM hard (superfamily
level)
FR (fold level)
# targets 20 12 30
Prediction Difficulty
CM: Comparative Modelling, HM: Homology ModellingFR: Fold Recogniton, NF: New Fold
HardEasy
![Page 25: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/25.jpg)
Lindahl Benchmark Test
family superfamily fold Top1 Top5 Top1 Top5 Top1 Top5 RAPTOR 84.8 87.1 47.0 60.0 31.3 54.2 FUGUE 82.2 85.8 41.9 53.2 12.5 26.8 PSI-BLAST 71.2 72.3 27.4 27.9 4.0 4.7 HMMER-PSIBLAST 67.7 73.5 20.7 31.3 4.4 14.6 SAMT98-PSIBLAST 70.1 75.4 28.3 38.9 3.4 18.7 BLASTLINK 74.6 78.9 29.3 40.6 6.9 16.5 SSEARCH 68.6 75.7 20.7 32.5 5.6 15.6 THREADER 49.2 58.9 10.8 24.7 14.6 37.7
976*975 threading pairs are tested, the results of other servers are taken from Shi et al.’s paper.
![Page 26: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/26.jpg)
LiveBench Test
Month Rank
August 3
September 4
October 7
November 14
December 9
Total 6
Easy 6
Hard 5
LiveBench 6Month Rank
Feb 10
March 1
April 3
May 2
June 6
Total 4
Easy 7
Hard 3
LiveBench 7
(http://bioinfo.pl/LiveBench)
![Page 27: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/27.jpg)
CASP5/CAFASP3
62 targets Time allowed for each target:
Individual Servers: 48 hours Meta Servers: 48 hours
Predictors: computer program, no manual intervention (CAFASP3)
Evaluated by computer program RAPTOR was voted by CASP5 attendees as the most novel
approach, at http://forcasp.org
CAFASP3: The Third Critical Assessment of Fully Automated Structure Prediction
![Page 28: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/28.jpg)
CAFASP3 Evaluation Criteria
Model Only the first submission considered for each target, each server can submit 10 models for each target,
MaxSub (evaluation program) Superimpose the predicted structure with the
experimental structure Calculate the length of maximum superimposable
subsegment within 5Å RMSD one prediction is regarded as correct only if the length
is above a given value.
![Page 29: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/29.jpg)
CAFASP3 Evaluation Criteria
Sensitivity (N-1 Rule) One miss allowed for each server, i.e., the first
models of N-1 out of N targets ranked Specificity
Rank the first models of all targets according to their zScores
S(M): # Correct before the first M false positives
Average of S(1),S(2),…,S(5)
![Page 30: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/30.jpg)
Specificity Example
Predicted Model
zScore Correct ?(by MaxSub)
T1 9.1 Yes
T2 8.4 Yes
T3 7.8 No
T4 7.6 Yes
T5 7.5 No
T6 7.4 Yes
… … …
T30 … …
S(1)=2
S(2)=3
First false positive
Second false positive
![Page 31: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/31.jpg)
Sensitivity on FR targets (1)
Servers Sum MaxSub Score # correct
3ds5 robetta 5.17-5.25 15-17
pmod 3ds3 pmode3 4.21-4.36 13-14
RAPTOR 3.98 13
shgu 3.93 13
3dsn orfeus 3.64-3.90 12-13
pcons3 3.75 12
fugu3 orf_c 3.38-3.67 11-12
… … …
pdbblast 0.00 0
… … …
blast 0.00 0
(http://ww.cs.bgu.ac.il/~dfischer/CAFASP3, released on Dec., 2002.)
30 FR targets
54 servers
![Page 32: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/32.jpg)
Sensitivity on FR targets (2)
CM/FR FR(H) FR(A) NF/FR NF
# Correct 6 4 2 1 0
# Targets 7 7 6 5 5
1. RAPTOR is weak at recognizing FR(A) targets (need improvement )2. RAPTOR cannot deal with NF targets at all (normal)
![Page 33: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/33.jpg)
Sensitivity on Hard HM targets
Rank
Servers Score # Correct
1 3ds5 5.13 12
2 3ds3 shgu 4.93-5.02 12
4 pmod pmod3 4.60-4.68 12
6 orfeus orfb 3dpsm raptor fugu3 pco3 robetta
4.33-4.43 12
8 samt02 4.18 12
… … … …
11 pdbblast 4.28 12
… … … …
blast 0.32 2
![Page 34: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/34.jpg)
Specificity of Servers
Rank Servers Specificity
1 3ds5 24.8
2 pmodel 3dsn 3ds3 pmodel3
22.0-22.6
6 pcons3 shgu 21.4-21.6
8 inbgu fugu3 19.0-19.8
10 ffas03 orfeus fugsa 18.2-18.4
13 raptor 3dpsm orf_c 17.4-17.8
… … …
pdbblast 13.0
blast 4.0
Out of 33 Targets
![Page 35: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/35.jpg)
CAFASP3 Example
Target ID: T0136_1 Target Size:144 Superimposable size
within 5Å: 118 RMSD:1.9Å
Red: Experimental Structure Blue/green: RAPTOR model
![Page 36: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/36.jpg)
CASP6, T0199-2, ACE buffalo rank: 9th
From RAPTOR rank 1 model. TM=0.4183 MaxSub=0.2857. Good parts: 116-134, 286-332
Left: predicted structure. Right: experimental structure
![Page 37: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/37.jpg)
CASP6, T0203 ACE buffalo rank: 1st From RAPTOR 2nd model. TM=0.6041, MaxSub=0.3485. Good parts: 19-57, 89-94, 139-178, 224-239, 312-372
Predicted Experimental
RAPTOR firstModel ranks 5th
![Page 38: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/38.jpg)
CASP6, T0262-2, ACE buffalo rank: 4th From Fugue3 6th model. TM=0.4306, MaxSub=0.3459. Good parts: 162-203
Predicted Experimental
Fugue’s topmodelranks low
![Page 39: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/39.jpg)
CASP6, T0242, NF, ACE buffalo rank: 1From RAPTOR rank 5 model.TM score=0.2784, MaxSub score=0.1645
However,RAPTOR topmodelranks 44th !Trivial error?
Predicted Experimental
![Page 40: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/40.jpg)
CASP6, T0238, NF ACE buffalo rank 1st From RAPTOR 8th model TM=0.2748, MaxSub=0.1633Good part: 188-237. High TM score, low MaxSub
Raptortop model ranks 4th
Predicted Experimental
![Page 41: Lecture 7. Computing Protein Structures](https://reader035.fdocuments.net/reader035/viewer/2022062322/56814644550346895db34ef1/html5/thumbnails/41.jpg)
About RAPTOR
Jinbo Xu’s Ph.D. thesis work. The RAPTOR system has benefited
significantly from PROSPECT (Ying Xu, Dong Xu, et al).
Currently distributed by BSI. References: J. Xu, M. Li, D. Kim, Y. Xu, Journal of
Bioinformatics and Computational Biology, 1:1(2003), 95-118. J. Xu, M. Li, PROTEINS: Structure, Function, and Genetics,
CASP5 special issue.