Modeling three-dimensional protein structures for amino acid sequences of the CASP3 experiment using...

5
Modeling Three-Dimensional Protein Structures for Amino Acid Sequences of the CASP3 Experiment Using Sequence-Derived Predictions Daniel Fischer* Department of Math and Computer Science, Ben Gurion University, Beer-Sheva, Israel ABSTRACT Homology or comparative model- ing is aimed at modeling the three-dimensional structure of a target sequence of unknown structure using the framework of an already known fold. Traditionally, homology modeling has been applied to targets with clear sequence similarity to proteins of known structure. Because methods to identify increasingly distant relationships have been devel- oped, homology models can now be built for a wider range of targets. The first challenge in homology modeling is to obtain an initial, accurate, sequence- structure alignment with the most compatible fold. In CASP3, the abilities of fold-recognition methods to fulfill this challenge were evaluated with a num- ber of target sequences of unknown structure. Se- quence-structure alignments for 33 of the CASP3 targets using the fold-recognition method SDP were submitted (Fischer and Eisenberg, Protein Sci 1996; 5:947–955). After the three-dimensional structures of the sequences were subsequently released, the qual- ity of the predictions were evaluated. Here I de- scribe three of the predictions for targets with little sequence similarity to proteins of known structure that were judged by the assessors to be of higher quality. For two of these predictions, the sequence- structure alignment corresponded perfectly to the structural alignment (zero average shift), and for the third, the average shift was 0.1. This alignment accuracy entails an ideal starting point for homol- ogy modeling. Proteins Suppl 1999;3:61–65. r 1999 Wiley-Liss, Inc. Key words: homology modeling; protein fold assign- ment; sequence-structure alignment; threading; protein structure predic- tion; critical assessment of protein struc- ture prediction INTRODUCTION In comparative modeling, one seeks to build a computa- tional molecular model of a protein of unknown structure using a related, experimentally determined structure. Thus, the first step is to align the target protein with the best available known structure. In the two previous CASP experiments it was observed that if the initial alignment is in error, then the comparative model is guaranteed to be wrong. The initial alignment is generally computed by using standard sequence-comparison methods. Because the quality of the alignment drops abruptly for lower sequence similarities, homology modeling has tradition- ally been applied to targets with sequence similarities above the so-called twilight zone. In CASP3, fold-recogni- tion methods were tested for their abilities to produce accurate sequence-structure alignments of target se- quences with their proper folds. Predictors were asked to submit models for all targets, including those with little or no sequence similarity to proteins of known structure. Because CASP3 predictions were filed before the experi- mental structures were released, it was possible to test the methods objectively. Using the previously described SDP fold-recognition method, 1 I submitted sequence-structure alignments for 33 of the CASP3 targets. In this article I briefly describe the SDP method and focus on three of the predictions judged by the assessors to be of the highest quality. A discussion on what went wrong and what was learned is included in the last section of this article. MATERIALS AND METHODS For CASP3, I based my predictions on an extension of the fold-recognition method SDP 1 (implemented in the computer server frsvr at http://www.mbi.ucla.edu/people/ frsvr/frsvr.html). This method computes sequence-struc- ture compatibility based on sequence-derived predictions and uses the so-called ‘‘global-local’’ dynamic programming algorithm for alignment. 1,2 The sequence-structure compat- ibility is composed of two terms. The first reflects the similarity of the target sequence to the assigned fold, using either a standard 20 3 20 sequence comparison matrix, a sequence profile built from multiply aligned sequences or other sequence-structure compatibility functions. 3 The second term measures the extent of agreement of the secondary structure predicted from the sequence 4,5 and the observed secondary structure of the fold. The contribution of this term is weighted by the per position reliability given in the secondary structure prediction. Two variants of SDP were used. The first considered the target sequence alone, and compatibility was measured by using the sequence comparison matrix of Gonnet et al. 6 The second variant used a profile built from a multiple alignment of sequences *Correspondence to: Daniel Fischer, Department of Math and Computer Science, Ben Gurion University, Beer-Sheva 84015, Israel. E-mail: dfi[email protected] Received 2 February 1999; Accepted 3 May 1999 PROTEINS: Structure, Function, and Genetics Suppl 3:61–65 (1999) r 1999 WILEY-LISS, INC.

Transcript of Modeling three-dimensional protein structures for amino acid sequences of the CASP3 experiment using...

Page 1: Modeling three-dimensional protein structures for amino acid sequences of the CASP3 experiment using sequence-derived predictions

Modeling Three-Dimensional Protein Structures for AminoAcid Sequences of the CASP3 ExperimentUsing Sequence-Derived PredictionsDaniel Fischer*Department of Math and Computer Science, Ben Gurion University, Beer-Sheva, Israel

ABSTRACT Homology or comparative model-ing is aimed at modeling the three-dimensionalstructure of a target sequence of unknown structureusing the framework of an already known fold.Traditionally, homology modeling has been appliedto targets with clear sequence similarity to proteinsof known structure. Because methods to identifyincreasingly distant relationships have been devel-oped, homology models can now be built for a widerrange of targets. The first challenge in homologymodeling is to obtain an initial, accurate, sequence-structure alignment with the most compatible fold.In CASP3, the abilities of fold-recognition methodsto fulfill this challenge were evaluated with a num-ber of target sequences of unknown structure. Se-quence-structure alignments for 33 of the CASP3targets using the fold-recognition method SDP weresubmitted (Fischer and Eisenberg, Protein Sci 1996;5:947–955). After the three-dimensional structures ofthe sequences were subsequently released, the qual-ity of the predictions were evaluated. Here I de-scribe three of the predictions for targets with littlesequence similarity to proteins of known structurethat were judged by the assessors to be of higherquality. For two of these predictions, the sequence-structure alignment corresponded perfectly to thestructural alignment (zero average shift), and forthe third, the average shift was 0.1. This alignmentaccuracy entails an ideal starting point for homol-ogy modeling. Proteins Suppl 1999;3:61–65.r 1999 Wiley-Liss, Inc.

Key words: homology modeling; protein fold assign-ment; sequence-structure alignment;threading; protein structure predic-tion; critical assessment of protein struc-ture prediction

INTRODUCTION

In comparative modeling, one seeks to build a computa-tional molecular model of a protein of unknown structureusing a related, experimentally determined structure.Thus, the first step is to align the target protein with thebest available known structure. In the two previous CASPexperiments it was observed that if the initial alignment isin error, then the comparative model is guaranteed to bewrong. The initial alignment is generally computed by

using standard sequence-comparison methods. Becausethe quality of the alignment drops abruptly for lowersequence similarities, homology modeling has tradition-ally been applied to targets with sequence similaritiesabove the so-called twilight zone. In CASP3, fold-recogni-tion methods were tested for their abilities to produceaccurate sequence-structure alignments of target se-quences with their proper folds. Predictors were asked tosubmit models for all targets, including those with little orno sequence similarity to proteins of known structure.Because CASP3 predictions were filed before the experi-mental structures were released, it was possible to test themethods objectively. Using the previously described SDPfold-recognition method,1 I submitted sequence-structurealignments for 33 of the CASP3 targets. In this article Ibriefly describe the SDP method and focus on three of thepredictions judged by the assessors to be of the highestquality. A discussion on what went wrong and what waslearned is included in the last section of this article.

MATERIALS AND METHODS

For CASP3, I based my predictions on an extension ofthe fold-recognition method SDP1 (implemented in thecomputer server frsvr at http://www.mbi.ucla.edu/people/frsvr/frsvr.html). This method computes sequence-struc-ture compatibility based on sequence-derived predictionsand uses the so-called ‘‘global-local’’ dynamic programmingalgorithm for alignment.1,2 The sequence-structure compat-ibility is composed of two terms. The first reflects thesimilarity of the target sequence to the assigned fold, usingeither a standard 20 3 20 sequence comparison matrix, asequence profile built from multiply aligned sequences orother sequence-structure compatibility functions.3 Thesecond term measures the extent of agreement of thesecondary structure predicted from the sequence4,5 and theobserved secondary structure of the fold. The contributionof this term is weighted by the per position reliability givenin the secondary structure prediction. Two variants of SDPwere used. The first considered the target sequence alone,and compatibility was measured by using the sequencecomparison matrix of Gonnet et al.6 The second variantused a profile built from a multiple alignment of sequences

*Correspondence to: Daniel Fischer, Department of Math andComputer Science, Ben Gurion University, Beer-Sheva 84015, Israel.E-mail: [email protected]

Received 2 February 1999; Accepted 3 May 1999

PROTEINS: Structure, Function, and Genetics Suppl 3:61–65 (1999)

r 1999 WILEY-LISS, INC.

Page 2: Modeling three-dimensional protein structures for amino acid sequences of the CASP3 experiment using sequence-derived predictions

homologous to the target (SDPMA, for SDP with MultipleAlignment), using PROFILEMAKE.7 Homologous se-quences were compiled automatically by the frsvr server,by applying a simple BLAST8 search on the SWISSPROT9

database.Each alignment produced a ‘‘raw’’ score representing the

sequence-structure compatibility. With use of the distribu-tion of the raw scores of the folds in the library, z-scoreswere computed. When the highest scoring fold had az-score above a threshold value, an automatic fold assign-ment was made on the basis of the z-score alone. Other-wise, to select one of the top ranking folds, human analysisof the automated results and further tests were applied.The latter included a number of consistency checks andtrials with sequences homologous to the target.10

The fold library was built by mid-1998 and containedabout 2,000 different folds, representing a minimallyredundant set of structures and domains taken from theProtein Data Bank (PDB11). No two entries in the libraryshared .50% sequence identity.

RESULTS

The 14 targets with predicted alignments that werejudged by CASP3 evaluators as being of high quality arelisted in Table I (for a full target description and otherdetails, see the assessors’ articles in this issue and also

http://PredictionCenter.llnl.gov). For 12 of these, SDP foundscores above the empirically set threshold of 7.0.12 Thepredictions for these 12 targets involved minimal humanintervention and largely corresponded to automatic predic-tions; no attempt was made to find a structure outside thefold library that might be more closely related to the targetsequence.

The sequence similarities of these 12 targets with theirassigned folds vary from 56% down to 24%. The third andsixth columns of Table I give an indication of the difficultyof the targets based on the SDP scores and on the numberof gaps and indels in the alignments. The last two columnsof Table I show two of the measures used to assess thequality of the predictions (see footnotes for Table I). Thefirst 7 targets (T0058–T0076) do not present a challenge tofold-recognition methods because the correct fold caneasily be identified in a single BLAST iteration, and theresulting alignment largely corresponds to the correctsequence-structure alignment. For most of the remainingtargets, at least two PSI-BLAST13 iterations are requiredto identify the correct fold, and the resulting alignmentscontain a significant number of gaps and indels. Thus,obtaining accurate alignments for these targets is consider-ably more difficult.

Below I discuss three of the predicted sequence-structure alignments, judged by the CASP3 assessors to be

TABLE I. Predicted Sequence-StructureAlignments of CASP3 Targets Judged by theAssessors to beof Higher Quality

Targeta Foldb SDP scorecSeq. Id.

(%)d Length/alig.e Gaps/indelsf sf0g sf-avh

t0058 1akz 74.0 56 229/218 1/1 217 0.0t0060 1mif 35.3 34 117/114 2/2 110 0.0t0082 1bol 32.7 32 190/172 8/32 115 0.6t0047 1mup 27.2 64 162/157 0/0 155 0.0t0049 3pte 25.6 29 392/334 11/36 192 0.5t0048 1dcha 11.8 32 118/92 3/6 87 0.2t0076 2bbma 11.7 37 140/137 3/7 45 0.0t0070 2por 14.5 18 332/277 6/66 130 1.8t0068 1rmg 12.9 22 376/323 12/67 150 2.8t0055 1esl 8.8 27 125/111 5/12 79 0.3t0057 1gdlo 8.0 24 340/253 12/39 151 0.6t0064 1r69 7.8 30 63/62 1/2 61 0.0t0074 3ctn 5.5 20 105/72 2/4 55 0.1t0063 1ah9 4.2 11 138/41 0/0 38 0.0aTarget 5 the CASP3-assigned target number; for the complete protein description, see the accompanying articles in thisissue.bFold 5 the predicted fold (PDB11 entry).cSDP score 5 the larger of the scores obtained by SDP and SDPMA (see Materials and Methods).dSeq. Id. 5 the sequence identity percentage in the predicted sequence-structure alignment.elength/alig. 5 the length of the target/the length of the predicted alignment (notice that ‘‘alig’’ is different from the ‘‘nres’’measure used by CASP3 assessors).fGaps/indels 5 the number of gaps and indels (total number of insertions and deletions) in the alignment.gsf0 5 ‘‘shift zero’’: the number of correctly aligned residues in the most favorable superposition (see the evaluators’articles in this issue for a more detailed definition of sf0 and of sf-av).hsf-av. 5 average shift in the most favorable superposition (when the filed predictions contained two separatemodels—one for each structural domain, the value of ‘‘sf-av’’ shown here is the weighted average of sf-av for each domain).The remaining targets (not listed in the table) with filed predictions correspond to targets that were not evaluated becausethe structures were not released on time, to targets with novel folds, or to other targets with no sequence similarity to anyknown fold. For some of the latter, SDP identified the correct fold, but the predicted sequence-structure alignments wereof lower quality than those shown in the table (see Discussion). For the others, SDP failed to identify the correct fold.

62 D. FISCHER

Page 3: Modeling three-dimensional protein structures for amino acid sequences of the CASP3 experiment using sequence-derived predictions

of high quality, that appear in the bottom of the table.Subsequently, a discussion of the less successful predic-tions is included.

Figure 1 shows the predicted sequence-structure align-ment of target T0064—SinR protein from Bacillus subti-lis—with the structure of the 434 repressor, PDB (ProteinData Bank11) code, 1r69. The alignment corresponds to thestructural superposition in all but four residues that werenot superimposed. The SDP methods are resistant toimperfect secondary structure prediction by weighting itscontribution using the predicted per position reliabilitiesgiven by PHD.4 Thus, although one predicted secondarystructure segment was incorrect (see Fig. 1), SDP suc-ceeded to produce a correct sequence-structure alignment.

Target T0074 represents the sequence of the humanEH2 domain of EPS15. SDP found that the most compat-ible available structure is that of troponin c, PDB code,3ctn, an ‘‘EF-hand’’ fold. The SDP score for this target wasonly 5.5—below the confidence threshold of SDP of 7.0.However, by observing that the top ranks produced by SDPincluded several EF-hand folds, the predictor obtainedhigher confidence in this prediction. In addition, the topranks obtained on different runs using sequences homolo-gous to the target also included several EF-hand folds.CASP3 assessors found that of the 72 aligned residues, 55corresponded to the structural alignment. The averageshift error of this alignment was 0.1.

The last sequence-structure alignment described here isthat of target T0063—translation initiation factor 5A,from Plasmodium aerophilum (see Fig. 2). This is a muchmore challenging example of fold recognition, becauseidentifying the correct fold (‘‘OB fold’’) for this target isconsiderably more difficult than for the other targetsshown in Table I. Indeed, PSI-BLAST finds no match withany of the OB folds in the PDB. Consequently, obtaining acorrect alignment for this target is a significant challenge.SDP found that the most compatible fold for this target isthe structure of the translational initiation factor if1 from

Escherichia coli, PDB code 1ah9, an OB fold. Figure 2shows that 38 residues in the sequence-structure align-ment produced by SDP correspond to the structural super-position. However, a segment of eight residues was shiftedby one position. The sequence identity of this alignment isonly 11%. Notice that the predicted secondary structure inthis case is in good agreement with the observed secondarystructure. Such an accurate secondary structure predic-tion contributed significantly to the correctness of thissequence-structure alignment.

What went wrong? Not all the predicted alignmentswere of the same quality as the three above. The align-ments for targets T0068 and T0070 were of significantlylower accuracy; only about half of the residues in thepredicted alignments had a zero shift; the average shifterror for these targets was 2.8 and 1.8, respectively. Fourof the 14 targets in Table I correspond to mostly b-struc-tures (T0068, T0070, T0047, and T0053); interestingly,T0068 and T0070 are two large b-structures (the lengths ofT0068, T0070, T0047, and T0055 are 376, 332, 162, and125, respectively). The lower performance of SDP on thesetwo targets is probably affected by the presence of a largenumber of indels in the sequence-structure alignments. Inaddition, although SDP is relatively resistant to imperfectsecondary structure predictions (see above), alignmentaccuracy of SDP is likely to drop with poor secondarystructure predictions. Indeed, T0070’s predicted secondarystructure was only 54% accurate and contained a numberof wrongly predicted segments.

A further problem was insufficient time to submit predic-tions for more targets and to update the fold library beforethe predictions were made. ‘‘Postdiction’’ runs indicatethat for some of the unsubmitted targets, good predictionsmight have been filed and for some of the incorrectpredictions filed, SDP now finds the correct fold using anup-to-date fold library (see http://www.cs.bgu.ac.il/dfischer/cafasp1/cafasp1.html).

Fig. 1. Sequence-structure alignment for target T0064—SinR proteinfrom Bacillus subtilis. The SDP predicted sequence-structure alignmentof T0064 with the structure of the 434 repressor, PDB code 1r69. TheSDP method identified this fold with a score of 7.8, above its conservativeconfidence threshold of 7.0. The top line shows the observed secondarystructure of T0064 (unknown when the prediction was made). ‘h’represents helix, ‘b’ represents strand, and blank represents a loopposition. The second line shows the predicted secondary structure asgiven by PHD4; the third to fifth lines correspond to the filed alignment;the fourth line highlights identities in the alignment. The last rows show

the correspondence of the predicted alignment with the structuralalignment. A vertical line indicates a zero shift in the alignment, and dotsindicate positions that were not superimposed. Notice that the lastsecondary structure segment was wrongly predicted as a b-strand, butnevertheless SDP aligned this segment to the last a-helix of 1r69,producing a correct alignment. A superposition of the structures based onthis alignment has an root mean square deviation (RMSD) of 1.5. Thepredicted alignment has 61 correct positions (sf0), for a mean shift error(sf-av) of 0.0.

63MODELING 3D STRUCTURES USING SDP

Page 4: Modeling three-dimensional protein structures for amino acid sequences of the CASP3 experiment using sequence-derived predictions

DISCUSSION

Additional sequence-structure alignments were filed fortargets T0067, T0046, and T0079 (not shown in Table I).These targets lie well below the twilight zone; PSI-BLASTdoes not identify their corresponding folds. Thus, recogniz-ing the correct folds for these targets entails a significantchallenge to fold-recognition methods; obtaining accuratesequence-structure alignments is even more difficult. Nev-ertheless, SDP predicted the correct folds of these threetargets, but the alignments were of lower quality thanthose listed in Table I; building homology models based onsuch alignments would contain significant errors. Thus, itis clear that there is still room for improving the quality ofthe alignments produced, and in general, for improvingthe sensitivity of the method.

Prediction of secondary structure is not yet perfect, butit can be accurate enough to be useful in fold recognition.SDP uses the predicted secondary structure informationas part of its sequence-structure compatibility function.1

When the sequence similarity between the target and thefold is high, the sequence information alone can in manycases produce accurate enough alignments. However, forthe more distant targets of CASP3 (where even recogniz-ing the correct fold is not straightforward), it is clear thatthe predicted secondary structure provides a significantcontribution. It may be necessary to conduct a largernumber of tests at different levels of sequence similarity tounambiguously determine the relative contribution of thepredicted secondary structure toward the quality of thesequence-structure alignments.

Among other things, I learned in CASP3 that theaverage accuracy of secondary structure prediction hasalready reached 75% (see other articles in this issue).Thus, incorporating the predicted secondary structurefrom the newer methods is likely to contribute to a betterperformance. The genome-sequencing projects are rapidlypopulating the sequence space; consequently, given a newsequence, it is becoming increasingly likely to find many(distant) homologues in the databases.12,14 The accuracy ofsecondary structure prediction is thus likely to increasemerely due to the availability of more sequences. In

addition, the sensitivity of SDP is also likely to improve ifthe currently used profiles7 (derived from homologoussequences compiled from a single BLAST iteration in theSWISSPROT database), are substituted with more infor-mation-rich profiles, such as those produced by applying anumber of PSI-BLAST13 iterations on large databases.

CONCLUSION

The fraction of sequences from complete genomes forwhich homology models can currently be built is roughly15–20%.15 It has been estimated that at the current rate ofstructure determination, this fraction is growing at anannual rate of 20%.12,14 Thus, within a few years, three-dimensional homology models will be available for aconsiderable percentage of the genome sequences. Improv-ing the accuracy of homology-modeling methods and of themethods used to obtain the initial, sequence-structurealignments will have a significant impact on structuralgenomics.

ACKNOWLEDGMENTS

I thank Dr. David Eisenberg for his support during thedevelopment of SDP; the CASP3 organizers and assessorsfor their hard work; Dr. Burkhard Rost for the availabilityof the PHD program; and the experimentalists who permit-ted their sequences to be used as benchmarks in CASP3.

REFERENCES1. Fischer D, Eisenberg D. Protein fold recognition using sequence-

derived predictions. Protein Sci 1996;5:947–955.2. Fischer D, Elofsson A, Rice DW, Eisenberg D. Assessing the

performance of inverted protein folding methods by means of anextensive benchmark. Proceedings 1st Pacific Symposium onBiocomputing 1996;January:300–318. http://www.mbi.ucla.edu/people/fischer/BENCH/benchmark1.html.

3. Elofsson A, Fischer D, Rice DW, Le Grand S, Eisenberg D. A studyof combined structure-sequence profiles. Folding Design 1996;1:451–461.

4. Rost B, Sander C. Prediction of protein secondary structure atbetter than 70% accuracy. J Mol Biol 1993;232:584–599.

5. Rost B. TOPITS: threading one-dimensional predictions intothree-dimensional structures. Proceedings Conference IntelligentSystems in Molecular Biology, ISMB-95 1995;314–321.

6. Gonnet GH, Cohen MA, Benner SA. Exhaustive matching of theentire protein sequence database. Science 1992;256:1433–1445.

Fig. 2. Sequence-structure alignment for target T0063—translationinitiation factor 5A from Plasmodium aerophilum. The SDP predictedsequence-structure alignment of T0063 with the structure of the transla-tional initiation factor if1 from E. coli, PDB code 1ah9. See symbols in Fig.1. In this case the predicted secondary structure corresponds well with

the observed secondary structure of 1ah9, possibly contributing signifi-cantly to the success of this prediction. The predicted alignment has 38correct positions (sf0) but has a shift (with respect to a structuralsuperposition) of one position in the second strand. A superposition of thestructures based on this alignment has an RMSD of 3.1.

64 D. FISCHER

Page 5: Modeling three-dimensional protein structures for amino acid sequences of the CASP3 experiment using sequence-derived predictions

7. Genetic Computer Group. 1991.8. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local

alignment tool. J Mol Biol 1990;215:403–410.9. Bairoch A, Boeckmann B. The swiss-prot protein sequence data

bank. Nucleic Acids Res 1992;20:2019–2022.10. Rice DW, Fischer D, Weiss R, Eisenberg D. Fold assignments for

amino acid sequences of the CASP2 experiment. Proteins Suppl1997;1:113–122.

11. Bernstein FC, Koetzle TF, Williams GJB, et al. The Protein DataBank: a Computer-based Archival File for Macromolecular Struc-tures. J Mol Biol 1977;112:535–542.

12. Fischer D, Eisenberg D. Assigning folds to the proteins encoded bythe genome of Mycoplasma genitalium. Proc Natl Acad Sci USA1997;94:11929–11934.

13. Altschul SF, Madden TL, Schaffer AA, et al. Gapped BLAST andPSI-BLAST: a new generation of protein database search pro-grams. Nucleic Acid Res 1997;25:3389–3402.

14. Fischer D, Eisenberg D. Predicting structures for genome se-quences. Curr Opin Struct Biol 1999;9:208–211.

15. Sanchez R, Sali A. Large-scale protein structure modeling of theSaccharomyces cerevisiae genome. Proc Natl Acad Sci USA 1998;95:13597–13602.

65MODELING 3D STRUCTURES USING SDP