Proteh structure prediction, Implications for the...

6
BiewhimW {]997 ~70, 63 i -6~6 O Soci6t6 fram,'aise de biochimie el bioh~gie mol0cukme / |.ib;cxicr. Paris Proteh structure prediction, Implications for the biologist G Del6age. C Blancher, C Geourjon hl.~till~l~' Of Biology amt ('hcmistrv qf Proleilts, 7. pa.~sage d, Veruorx. 69367 Lyon ~'edex 07. Framu (Received 16 July 1997: accepted 18 September i 997 Summary ~ Recent impro~ emenls in the prediction of protein secondary structure are described, particularly Illose methods using the hfformation contained int~ mulliple alignment.,,, in this respecl, the prediction accuracy has been checked and methods that take into account multiple aligm'nenls are 70g correct for a three-stale description of secondary structure. This quality ix obtained by at "lea',e-one out" procedure on a reference database of proteins slmring less than 25',"i. identity. Biological applications such as "protein domain design" and structural phylogeny are given. The biologisl's pohll t~t" view is also considered and joint prediclions arc encouraged in order to derive an amino acid based accuracy. All the tools described in this paper arc available for Ifiologisls on the Web (http/w~,w.ibcp.fr/predict.htmll. protein structure prediction / sequence analysis / protein domain design / bh~computing / S{)PMA Introduction Secondary structure prediction from protein sequences is perhaps the most studied problem in coral:rotational mole- cular biology. In addition to its intrinsic interest to biology, it has also become a starling point for lertiary structare pre- diction. Historically, the numerous methods yet available (more than 20 have been developed) were classified in Ibm" different categories: i) the statistical methods; it) the simi- larity based methods; iii)learning methods; and iv)empiri- cal methods. More recently, the incorl~oration of multil:~le alignment ol' related prolein sequences into predictive methods makes this classification obsolete and today we should consider illelhods thai are based eilher Oll single se- quence or on multiple alignnlenl. In Ihis i~aper, we presenl these predictive methods with their exlensious and lilt' way the biologist (obviously riot I'amiliar with these computing metl~:~ds) should integrate the information of secondary structure into his experiments. in formalion theory to protein sequence and secondary st ruc- ture. In this latter method the predictive parameters b.axe been revisited and the pair information has been added 18 I. However, statistical methods fail to improve despite the in- creasing size of the re l~rence database 19l. About 1() years ago similarity based prediction methods appeared I I O. il I. Briefly, each polypepfide of the query sequence is compared with all peptides with the same length in a reference data- base. if both peptides are similar (by using a secondary structure substitution matrix), the known observed confor- mation is given to the query peptide with the similarity value. The process in allowed |o continue until all l;ep|ides have been compared anti tile s, cores are sttmmed tip. The final predicted stale ix the one with the highe:4 conlbrlna+ ti()nal set,re. This strategy has been used in ,,everal methods with a 17 amino acid Ionp peplide I I 2l ai|d COlllhinetl with an aulonlaled optimization l~l°occdure l l3l. More reccnlly several groul~s have used neural networks 114=161 with slighl, significant improvenlents over previously described methods. Single sequence based methods The first developed methods were statistical methods even if the size and the sampling representativity criteria were obviously not met. The pioneering work of Chou and Fas- man I I-3 ] consisted in the derivation of the frequency with which individual residues or short sequences are found in given contbrmational states (helix, sheet, turn and a-peri- odic). Further improvements in this method have been achieved by distinguishing internal and external [.]-sheets 141, the integration of structural class prediction 151. a profile consideration in the special case of 1~-~-~ proteins 16l. Another approach 171 consisted in the application of Incorporation of multiple alignments A promising starting point for predicting a structure for a given amino acid sequence is Io determine whether that se~ quence is sufficiently similar to any other sequence for which biophysical data (ideally a 3D structure)are avail° able. In this context, several authors have investigated the possibility of improving the prediction accuracy by taking into account the information contained into a multiple alignment of related protein sequences 117-191. This ap- proach has been applied to both statistical and similarity based methods l:or the prediction ot' el-helices II 9l, to sell'-

Transcript of Proteh structure prediction, Implications for the...

Page 1: Proteh structure prediction, Implications for the biologistperso.ibcp.fr/gilbert.deleage/fr/pdf/86.pdf · O Soci6t6 fram,'aise de biochimie el bioh~gie mol0cukme / |.ib;cx icr. Paris

BiewhimW { ] 997 ~ 70, 63 i -6~6 O Soci6t6 fram,'aise de biochimie el bioh~gie mol0cukme / |.ib;cx icr. Paris

Proteh structure prediction, Implications for the biologist

G Del6age. C Blancher, C Geourjon

hl.~till~l~' Of Biology amt ('hcmistrv qf Proleilts, 7. pa.~sage d, Veruorx. 69367 Lyon ~'edex 07. Framu

(Received 16 July 1997: accepted 18 September i 997

Summary ~ Recent impro~ emenls in the prediction of protein secondary structure are described, particularly Illose methods using the hfformation contained int~ mulliple alignment.,,, in this respecl, the prediction accuracy has been checked and methods that take into account multiple aligm'nenls are 70g correct for a three-stale description of secondary structure. This quality ix obtained by at "lea', e-one out" procedure on a reference database of proteins slmring less than 25',"i. identity. Biological applications such as "protein domain design" and structural phylogeny are given. The biologisl's pohll t~t" view is also considered and joint prediclions arc encouraged in order to derive an amino acid based accuracy. All the tools described in this paper arc available for Ifiologisls on the Web (http/w~,w.ibcp.fr/predict.htmll.

protein structure prediction / sequence analysis / protein domain design / bh~computing / S{)PMA

Introduction

Secondary structure prediction from protein sequences is perhaps the most studied problem in coral:rotational mole- cular biology. In addition to its intrinsic interest to biology, it has also become a starling point for lertiary structare pre- diction. Historically, the numerous methods yet available (more than 20 have been developed) were classified in Ibm" different categories: i) the statistical methods; it) the simi- larity based methods; iii)learning methods; and iv)empiri- cal methods. More recently, the incorl~oration of multil:~le alignment ol' related prolein sequences into predictive methods makes this classification obsolete and today we should consider illelhods thai are based eilher Oll single se- quence or on multiple alignnlenl. In Ihis i~aper, we presenl these predictive methods with their exlensious and lilt' way the biologist (obviously riot I'amiliar with these computing metl~:~ds) should integrate the information of secondary structure into his experiments.

in formalion theory to protein sequence and secondary st ruc- ture. In this latter method the predictive parameters b.axe been revisited and the pair information has been added 18 I. However, statistical methods fail to improve despite the in- creasing size of the re l~rence database 19l. About 1() years ago similarity based prediction methods appeared I I O. i l I. Briefly, each polypepfide of the query sequence is compared with all peptides with the same length in a reference data- base. if both peptides are similar (by using a secondary structure substitution matrix), the known observed confor- mation is given to the query peptide with the similarity value. The process in allowed |o continue until all l;ep|ides have been compared anti tile s, cores are sttmmed tip. The final predicted stale ix the one with the highe:4 conlbrlna+ ti()nal set,re. This strategy has been used in ,,everal methods with a 17 amino acid Ionp peplide I I 2l ai|d COlllhinetl with an aulonlaled optimization l~l°occdure l l3l. More reccnlly several groul~s have used neural networks 114=161 with slighl, significant improvenlents over previously described methods.

Single sequence based methods

The first developed methods were statistical methods even if the size and the sampling representativity criteria were obviously not met. The pioneering work of Chou and Fas- man I I-3 ] consisted in the derivation of the frequency with which individual residues or short sequences are found in given contbrmational states (helix, sheet, turn and a-peri- odic). Further improvements in this method have been achieved by distinguishing internal and external [.]-sheets 141, the integration of structural class prediction 151. a profile consideration in the special case of 1~-~-~ proteins 16l. Another approach 171 consisted in the application of

Incorporation of multiple alignments

A promising starting point for predicting a structure for a given amino acid sequence is Io determine whether that se ~ quence is sufficiently similar to any other sequence for which biophysical data (ideally a 3D structure)are avail° able. In this context, several authors have investigated the possibility of improving the prediction accuracy by taking into account the information contained into a multiple alignment of related protein sequences 117-191. This ap- proach has been applied to both statistical and similarity based methods l:or the prediction ot' el-helices II 9l, to sell'-

Page 2: Proteh structure prediction, Implications for the biologistperso.ibcp.fr/gilbert.deleage/fr/pdf/86.pdf · O Soci6t6 fram,'aise de biochimie el bioh~gie mol0cukme / |.ib;cx icr. Paris

~ 2

optimized prediction methods 1181 and to neural network methods I201. One interesting feature in the Rost and San- ~ r method is a quality index given at each amino acid po- sition [20] and that considering only the residues with highest scores may yield as much as more than 86% on about 30% of residues [211, It seems now that no significant improvements in secondary structure prediction will be ob- tained without taking into account the effect of tertiary structure onto local folding.

Comparison of predictive success

The objective comparison of predictive accuracy has been addressed for a long time, it requires at least three problems to be solved: i) the reference database; ii) calculating the prediction accuracy; and iii) the reproducibility of methods developed onto different sets, The first problem to address is the definition of a reference database [221. In this context, the cut-off value of identity between protein sequence is critical and the cut-off yet admitted is 25% [ 16] (a value below which homology modeling is rather h~ardous). An- other problem is to derive the secondary structure from atomic coordinates, It has been shown that using different criteria (geometric or energetic) may result in an overall

agreement as low as 70% 123]. The firsl accuracy index is the global percentage of correctly predicted residues Q% and has been used in an objective comparison of differentt methods 191. This parameter can also be used on a state-by- state basis, but the percentage of correct prediction does not give any information about the correlation between predic- tion and observation, That is why a correlation parameter has been introduced and applied to secondary structure pre- diction [241. This C parameter varies from - I (total anti- correlation) to I (total correlation) and is nil if there is no correlation between prediction and observation. More re- cently, the segment overlap [251 is a measure of how the predicted segments correspond to observed segments, This is the single method that is not based on a residue-by- residue comparison. The last but not least bias that must be avoided is the independency between learning and ~esting sets for methods based on neural networks. Indeed, even if a jackknife procedure could be applied (removing the pro- tein from learning set prior its own prediction), this proce- dure seems to be too time consuming to be extensively used. A comparison of different methods is given in table 1. Methods based on a single sequence have a global success rate of about 60% when checked on the 126 protein chains sharing less than 25% identity I!6]. On the other hand, methods that exploit the information contained into the

Table I. Comparison of secondary structure methods on the Rost and Sander database,

States Co(!f GOR ! 71 Levin I ! I I DPM 151

o~-helix Ca 0,42 0.53 0.45 act 10,92 9,06 !0.95

SOV 0,609 0,601 0,603 Lp 8,2 7,9 6.3 Lo 9, I 9. I 9, I

Q3 (q) 53,5 o2 53.9

SOPMA 118j Pill) ! 161 SOPMA + PHD

0.61 11.61) I).65 7,9 8.5 - 0.737 - 0.706 7,3 9.3 7,5 9,1 9.1 9

{~9.9 72 8(),9

C' ~ cd

8,99 8,34 8.49 SOV 0,578 0,533 0,536 Lp 3,9 3,6 4, I Lo 5,1 5,1 5,1

Q3 (%) 38,8 40,5 42,2

Cc 0,49 0,41 0.53 ¢1c 8,49 8,75 9,5

SOV 0,528 0,592 0,585 Lp 8,2 7,8 7,6 Lo 5.8 5.8 5,8

Q3 (%) 69, I 74,2 72, I

Total Q3% 57.6 63, I 59.9 C 0,562 t),563 0,566

O.M) 0,52 0,61 8 8.1 - 0.621 - 0,74 3.9 5,0 4.0 5.1 5.1 5.3

53.3 66 81.3

0.52 - I).55 8.1 - -

0 . 7 0 2 - 0,642 6.9 - 4.8 5,8 - 6,11

74,6 74 82.8

68.6 70.8 83.7 i76) t},672 - 0,678

The pr~liclive success is calculated on tlae Rost and Sander database 1161 by using the classical index reported 1251, Brielly, Q is the ~ e n t a g e of corr~tly predicted residues, C is the correlation coefficient between prediction and observation iC = -! for negative correla- tion, C = 0 for absence of cor~lation and C = +1 for lx~sitive correlationL and ~ is the standard deviation between predicted secondmy content, Lp is the mean predic,,ed lenglh of predicted segments, Lo is the averaged length tot observed secondary structure and SOV is the segment overlap index,

Page 3: Proteh structure prediction, Implications for the biologistperso.ibcp.fr/gilbert.deleage/fr/pdf/86.pdf · O Soci6t6 fram,'aise de biochimie el bioh~gie mol0cukme / |.ib;cx icr. Paris

683

200 210 220 230 240 250 - - I I I

hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh ' hhhhhhhhhhh '

hhhhhhhhhhhhhhhhhhh~hhhhhb_hhhhh hhhhhhhhhhh ........ ~ ~:: AETDQLEEEKSAL ILAAHRPACKMPEELRFSEE/AAATALDLGAP- - S

arnier h h h h h l 11. FOS_MSVFB AETDQ

~SGL

~ETDKLEDEKSGLQi:EIEPI~ "4 V " ~J nier hhhhhhhhhhhhhhhhhhh ~ 15. FOSB MOUSEAETDQLEEEKAELE: 'EIAE ni or hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh mary cons. Q D A T E £! I A ! A ond. cons. hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh

Primary struct. F¢) SX M SV FR. Prediction method Ga,'nier (parameters: - 100 Alpha Helix=39.3% =38.9% =7.8% =13.9% AA: number Score: Pr~

'L

EEL2FSEEMS2A2SDLT(I<-;LPEAS ? hhhhh?h???? ?? ~

Message box

Fig 1. Tile ,nultiple protein sequence analysis (MPSA) program interface. The alignment of sequences that belong to the Fos family of transcription factors was pc,Tormed from within the MPSA program (version 0. I for Silicon Graphics) by using the CiustalW (1.6) package and secondary st,'uctt, rc predictions come either from local impleme,'ttatiorls of well known methods 17, 8, I !] or from the Network Protein Sequence Al,:dysis (hup://www.ibcp.fr/prcdict.hlml)I ! 8 I.

tuulliplc alignment of related i ,otcins SCtlUCnCC~ such as SOPMA 1181 and PHI) 116] have global success rates of abot!t 70% in the Rost and Sander database, Thus multiple alignment and consensus prediction using profiles signifi~ cantly improve the prediction accuracy of protein secondary structure. What are the future direclions lbr secondary struc° tuft? From the biologist's point of view, it seems more im- portant to know the correctness of prediction on some given stretches rather than to get a global prediction accuracy tbr the whole sequence. That is why some authors have inves- tigated the possibility to give an accuracy index for each residue of a sequence. Very often, this index is based on the difference between the conformational state that has the highest value and the second value 18, 201. Other authors have chosen to investigate the possibility of joint p,'ediction !18, 271 or a combination of several methods based on dif- ferent principles l l3l. This stresses the necessity for the biologis t to compare the results obtained by several methods (especially based on different principles), to make consensus prediction and thus to have the methods available either integrated into a single software or through lnternet.

ln tegra t io , hlto packages anvil Web servers avai|ability

In order Io give |he biologisl acces~ Io diffcrenl secondary slrtlClllre methods coupled with protein sequence analysis methods, a package called ANTHEPROT thai integrates several prediction methods has been being developed for several years 128,291. This program runs on PC computers (DOS, Windows o,' IBM rs6000) and is available for aca ,~ demic users (ftp.ibcp.fr). The program is being adapted tot other Unix machines (SGi, SUN, IBM) and Macintosh. This new program is called MPSA (multiple protein sequence analysis) program. The main window of MPSA showing the combination of multiple alignment with different secondary structure prediction is given in figure I. In order to investi- gate the conformational scores for each state (helix, sheet, turn and coil), a graphical view can be displayed. The graphics are fully interactive with cursors, scrolling, zoom, resizing and selection functions. The main window can be sent to a file in either Postscript Adobe or rich text tbrmat (RTF)-Microsoft formatted files. Great efforts have been made to allow large alignments (200 sequences of 5000

Page 4: Proteh structure prediction, Implications for the biologistperso.ibcp.fr/gilbert.deleage/fr/pdf/86.pdf · O Soci6t6 fram,'aise de biochimie el bioh~gie mol0cukme / |.ib;cx icr. Paris

~ 4

+),s+,+ o++(pt+++~,<++

Fig 2, Not¢in sequence analysis, strategy.

IETU 9 K PHVNVGT I GHVDHGKTTLTAAITTVLAKTYG

IGKY 1 SRP IVl SGPSGTGKSTLLKKLFAEYPDSFGFSVS STT

3 ADK 6 KKS KI I FVVGGPGSGKGTQCEKXVQKYGYTHLS -

5 P21 1 MTEYKLVVVGAGGVGKS~TXQLIQNHFVDEYDPT I

NBD 1 418 QSGQTVALVGNSGCGKSTTV~LMQRLYDPTEG~SVDG

NBD2 10 61 KKGQTLALVGSSGC(~KSTVVQLLERFYDPLAGKVLLDG

Fig 3. ('ompadxtm of prhuary and second+i~+, slnicturc ctm.~crva~ tion ill |lttclcotkle bi|tding region ~motff A) of P+glycoprolcin. The nuclct~Iktc binding region A motif(lAGbx t4)-(3-KqST[)has been searched in N RI+°3D database and ftmnd into four proteins: elon,g- ation factor (| ETU). guanylate kinase (IGKY). adenylate khmse (3ADK } and |)21 ras ~5P21 ) and in hun;arl P-glycoprotcin. The 3D structure wa~ deduced |'t'om their atomic ct)ordinates by using the Kabsch and Sander method 1+22 i. The grayed box entire|rag the last two sequences highlights the two predicted nucleotide binding do- mains (motif A) from the htllna|l Pgp. Open rectangles represent ~-shccts. filled black rectangles rcprcscn| a-helices; dotted rec- tangles rcp|'cscttt l]-Iurns.

elements) to b¢ handled with g(~xt inleractivity. The cumbina- lion of multiple aligrlments with secondary structure predic+ ,ions ix i,tegrated in a more general approach (fig 2). Moreover. a World Wide Web server (llttp:/twwvc.ibcp.fr/ predict.html} called NPSA (network protein seqttenc¢ all~i + lysis) has bee, develol~cd which allotvs tile hioh~gis| |o per° fi~rm secondary structu.~ analysis by several metht~ds. A ¢o,sensu~ prod,clio, i~ generated and sent It) the hio!ogi..| b) + e°mail (due to the CPU time required for predict,tin}. Ik~ay, mo~ than 27 l~} i~lUeSts ....... ha+ vet been pr,~essed by our server and the day rate is about 50 requests per day. In the fittu~, the p~,ssibility of extracting sequences by using the ACNUC 1301 web server (http:flacnuc.univ-lyonl.IW start.html) will ~ added, thus allowing to make predictions onto a Swiss+pro, subset,

Applk'alions to biological projtmls

In this chapter we pl~sent three difl~rcnt examples of sec- ondary structu~ p~diction: some have been performed in the absence of 3D data. others have been combined with 3D structural data,

The secondary structure is more often conserved than the sequence since the secondary structuw code is degenerated starting i~m 20 amino acids to finish with three ur tk}ur different conformational states. That is why some ,steres-

ring infommtion c:m be masked ill the sequence and become visible after secondary structure prediction. A typical example of lifts feature ix Birch with Ii!e Ilu¢l¢otide binding domains of multith'ug resistance proteins (P-glycoprotein). In tile nuclcotide binding domains u[ difl'~reilt proteins, tile mnino acid zunscrvation (fig 3) is very low since only tin'co am/no acids t~ti! of" ?18 arc identical i . to the six seql|en¢cs. These muino acid+,, ct,'respmld to Ihe requisite positions G-x ~4 )+G+K tor mieleotide binding. When the observed sezond- a,'y 51rll¢lure,s (first t'our sequences) are compared wilh pre- dicted st,'UclUres (last two sequences), tile structural conservatio, is obvious. This type of investigation permits to validate the multiple alignment and the location of sec- ondary structure elements and to postulate a typical Ross- ,nan fold for both nucleot ide binding domains of P-glycoprotein.

In the AIDS field, secondary structure prediction coupled with multiple aligmnents has been used to predict the 'most likely' sequence of itlV gpl20 131] and to propose an e×- planation at the secondary structure level for the difference in the dimerization rate of HIV-I and HIV-2 reverse tran- scriptases (HIV-RT) enzymes 132 I. When predicted second- ary structures for HIV-I RT and HIV-2 RT are compared. one can clearly identify tile main difference, located in the 312-34{} region (fig 4). Thus we formulated the hypothesis that regio, 312-34{} could explain the difference in the dim- erization rate of HIV-RT subunits 132]. Then peptides in

Page 5: Proteh structure prediction, Implications for the biologistperso.ibcp.fr/gilbert.deleage/fr/pdf/86.pdf · O Soci6t6 fram,'aise de biochimie el bioh~gie mol0cukme / |.ib;cx icr. Paris

6~5

0 100 200 300 4 0 0 5 0 0

H|V1-RT-p66 : ,~

Ii I

0 100 200 300 ] 400 i 500 ~x H e l i x amino acid number

Difference in secondary structure [---2 13 sheet

Fill 4. Predicted secondary structure of HIV-I and HIV-2 RT. The secondary .~tructure of H|V-I and HIV-2 RT p66 subunits was prediclcd by using tile SOPM nlethod I131. Closed and open boxes correspond to the ~x-hcliccs and [:Lstrantls. respectively.

these regions were synthesized. The synflletic peptide from HIV-2 clearly folds into :m ~-helix whereas the HIV-I pep- |ide is rather unfolded an observed by circular dichroism (Divita, personal commtmicalion). More striking, these syn- thetic peptides have been found to inhibit tile sul~unil dim- erizalion.

Anolher example in given by the design of Ihe I'unclional protein donlain. This i 'Ccelll concepl cOMes with tile possio bil i ly Io delcrnline the lel'liary slruclure of molecules of limited size by NMR. I)ue to the modular organization of proteins, one should be able 1o predict tIle limits of 13iolt~gi~ tally active domains. In our laboratory, this approach has been successl'ully applied to the l'ructose repressor DNA binding domain and to heal shock factor. These domains (nrainly designed l'rom sequence analysis) have kept DNA binding capabilities and, its shown in I'igure 5, their struc- ture has been solved by NMR 1331.

The last application is the use of predicted secondary structure as a starting point lbr tertiary structure prediction or folds recognition. We are addressing this problem by a combination of simulated annealing, database sampling, Monte-Carlo and distance geometry methods, in some par- ticular cases (protein mostly helical), structures that dilTer by less than 3.6 A from actual structure have been obtained and these results have to be generalized to a representative subset o1' the PDB. One of tile most pronlising challenges for tile next decades is probably the understanding of the folding process and the prediction ot' protein tertiary strut'- lures from sequences.

o,

/

/ r

J Fig 5. An example of protein doll'lahi design. The fl'UClOst~ i'¢l'~rc.',.',,or DNA-binding domain has been designed from both sequence ana- lysis and proteolytic experiments 1331.

Page 6: Proteh structure prediction, Implications for the biologistperso.ibcp.fr/gilbert.deleage/fr/pdf/86.pdf · O Soci6t6 fram,'aise de biochimie el bioh~gie mol0cukme / |.ib;cx icr. Paris

Ack~wledgments

The ~,ork presented in this review has been carried out wilh the support of IMABIO program (CNRS) and MENESR granls (ACC- SVI3t .

References

I Chou PY. Fasnmn GD 11974) Prediction of protein ct,nformation. Biochemist~' i 3.2 ! !-245

2 Chfm P'~; Fa:~man GD {It~781 Prediction of secondary structure of W¢tlein.,i from their ~mino acid ~iequence. A~h' Enz3m,o147. 145-147

3 Foeman GD 11989~ Development of the prediction of protein ~truc- tore. In: Predwthm of pn~win .stlTwttu~, and the print'o~ies of protein ~otll~,~uition tFasman G, edt Plenum. New York and London. 193 p

4 Garrali RC, Taylor WR. Thornton JM 11985t The influence of tertiary ,,tructure on ~econdary structure prediction, Accessibility verstL~ preo di¢iahility tbr beta ~tructure. FEBS l.ett 188, 59--62

5 Deleage G. Roux B 119871 An algorithm for protein secondary strut. tun prediction b a ~ d on class prediction. Pt~t ling I. 289-294

6 Taykw WR. Thornton JM t 1983t Prediction of super-secondary ,,true- lUre in proteins. Nutur~' 301. 540-542

7 Gamier J. Osgu tho~ D J. Robson B ! 197,~) Aualy.~i~ of the accuracy and implications of ~imple methods tbr predicting the secondary ~trueture of globular protein~. J Mol Biol t 20, 9 % 1211

g Giant JE Gamier J, Robins g (1987) Ftmher developments of protein ~¢ondary ~|rllClUrg prediction using information theory. New par° am¢ler~ and consideration of t~sidue pair~ J Mol Bhg 198, 425 44.a

~ Kab~ch W. Sander C 119831 Ho~. g t ~ are prediction of protein ,,ec, ond~y ,~tructure ? FEfl.~ In'It 155, 179~ 182

I0 Sweet R 1 I986) Evolutionnar~. similarity among peptitle ,,egment in a h~!~i~ tilt prediction of pmte{n foldhtg. Bi,#l,ulvm,,r.~ 25. 1565-1577

11 l,e%in JM, Rnb.~nll B. Gamier l ! 19861 An algori|hn! l i l r neCOllt!ary .~tru¢lurt~ predi¢!lon t~ased on ,eqlte!l¢¢' :.itaihlrity, FEliX Left 2115, .~1)3-- 3118

12 !.evill JM, (iarnlt2r J t lqt'lt'l) illlprll'.'CUt¢iits il! a scctintlary ~tfucture pr~dif!it!n !ltelht~l hll~e_d on a ~¢lti~ll tile Inca! ,~¢qi!cilco Iliti!iilll~gi¢~ t!lid i!~ !l~t' a~ a t! it i ldi l l~ tunl, ttiochiol tlii,,phv,~ A,'/,i ~155, 2~3 295

I£~ Geuuiioa C, !~qCt!ge G {!t i i !4) SOPM~ l~ nell-llplillli,~ei! pledi¢liotl Illt~lh~,l ii,~' ltttllt, ill ~'¢il l ldiu~ ~tfil~'tttlt, !~f~tlii:liou, t~i~i ! : l i t /7, 154 I f~-I

14 Qhul N Sgiltov¢~ki 1'] l !~}14~lt Prediclillg ti!e ~¢¢ondltry ,~!fll¢lti!oe i l l ~h t htlh!r lt.ftllt!il!~ tl~i!l 7 nTural itc,ilViitl'ks lllod¢l%, ,I ~!ti l I1#ol l till. @J7 7{PJ

!5 ~ht!tl 8 X, l%!i)~h~Iv I t { Wi! l t l DL 11! !2} Hyhr id system itti" I!fol01tt . ~.5, 104% I 1~3

t6 Ro~I B, Sa!~!er C l l ~ ;~ ) P~di~:tio!! of protein secondary strt!clure ~11 ~tter than 71F;~,, accuracy. J Mot Bio1232, 584~59ti

17 Levis JM, Pascarella S, A~os It Gari~iei' l i 1993i Quantification of se¢oltdary strucluit, p~diclion inlprovenlent using multiple aligm ment. Pt~l I~'~Ig 6, 849=854

18 Geouljon C, Driftage (i 11905) SOPMA: significanl improvements in protein secondary structure prediction by consensus prediction fi'om multiple aliginnents, Compia Appl Biosci I !, 68 !-684

19 Donnelly D, Overinglon JP, Bklndell TL ( 1994} The predic|ion and orientatmn of ~ helices from sequence alignme~:ts: then comined use of environment-dependent substituion tables. Fourier translbrm nlethcxis and helix capping rules. Pant Eng 7,645-653

211 Rest B, Sander C 119941 Combining evolutionary information and neural networks to predict protein secondary structure. Prim, ins I% 55-72

21 Yi TM, Lander ES { 1993) Protein structure prediction using nearest- neighbors methods J Mol Biol 232, | ! 17-1129

22 Kabsch W, Sander C ( 1983} Dictionary of protein secondary struc- tures: pattern recognition of htdmgen-honded and geometrical fea- tures. Biopolymers 22. 2577-2637

23 Colloc'h N, Eichcbest C, Thoreau E, Henrissat B, Mouton JP t 1993l Comparison of three algoritim~s for the assignment of secondary structure in proteins: the advantage of a consensus assignment. Pt~t i:ilg 6, 377 ~82

24 Matthews BW t 1~75~ (/~mpa~i,,,~m of the predi¢led alld obsclwed se¢~ ondary structure of T4 phage lyso~ym¢, tliochooa t~,y,hyx Acla .11~5. 442-451

25 Ro.~t B, Sander C { 1994! Redefining the goals of protein secondary structure prediction. J Mol Bioi 235, 13-26

26 Bmu V. Gibrat JE Le,dn JM. Robson B, Garnier J 11988~ Secondary structure prediclion: comhination of three different methods. Peel Eng 2. 185~ 191

27 Dddage G, Roux G 119891 Use of class prediction to improve protein secondary structure prediction: Joiut prediction with methods based olt sequence honlology In: Pro,diction o,f p~lein strlwtm'e ami the !,rim'iph.x ~!flmm.iu con]i~rmali, m (Fasman G, edl Plenum, New York and London, 587 p

28 GetlurjoB C, Deldage G { 1993} Interactive and graphic coupling I~etween multiple alignments, secondary structure prediction and motiflpallern scanning in to protein sequences. ( ) |BIOS 9, 87-9 I

2( j Geourjon ('. Ileldage G t1'1951 AN°|'II|:2P~<OT 2.11: A Ihree dimem sional moduie [nlly coupled wtth protein SetlUettce anal)hi,, method. g A'h*i Gvaphi,',~ 13, 209-212

i~O Pelxi/~re G. Thiot|louse ] { i9961 Onoline tools for netlneuce retrieval a|ld nlulli'val'iate staiisiics iil Itiidecular hiology. ¢'ABIO5 12. {~3h9

~1 Myer,~ 1]o I:antk~r A {1997~ HIV !digitlllel|ls. databases scat°dies anti nq~u~:mle tuediclmn hi. Hw Ihiman Rctioi'ii'li.~t'.~ ~llld AII)S {',mtl,en- dnum II~nn Ahmlt ls Nathl| lal I,aimralory Compendium

12 iIi~dla {k Rillhlgel' K. {]eOu|j~Iu C, I)dOage (|, (hmtiy ItS 119951 !.}i|lleti~almn killelies el I I IV! and i!!V-2 le~erse Irallscriplase: A two step process J Me! Bin! 245. 508-521

33 Pelliu !:~ Geo06o|! C, Mo!ilserret R, llitckll!anll A, Lesage A, Yal!g YS, Bonod-Bidaud N, Cortay JC. N~gre I), Cozzone AJ. I)cl~age G 11997 ) Three-dimensional struclure of Ihe DNA binding donmin of the fructose represser froln Escherh'hia coli by Ill aud I, N NMR. 1 Mol Biol 270, 496-5 I0