GOR Method for Predicting Protein Secondary Structure from Amino Acid Sequence

14
540 SECONDARY STRUCTURE CONSIDERATIONS [321 [32] GOR Method for Predicting Protein Secondary Structure from Amino Acid Sequence By JEAN GARNIER, JEAN-FRANqOIS GIBRAT, a n d BARRY ROBSON Introduction How to extract properties from the amino acid sequence of a protein for understanding its function is one of the most timely and competitive areas of the biological sciences. It is related to a fundamental aspect of biology. A linea r and ordered sequence of amino acids, coded and conserved by a linear and ordered sequence of nucleotide bases, codes for all that is characteristic of living organisms: specific and organized interactions in space and time between proteins, lipids, nucleic acids, and the cell metabo- lites. These characteristics depend on how a protein can fold in a unique active three-dimensional structure. This process is spontaneous under given environmental conditions, even though the living cell can add efficiency and control by use of som e protein com plexes, called chaperones, to catalyze this process. Much effort has been devoted to the calculation of the spatial structure of a polypeptide chain from its amino acid sequence alone, with only limited but nevertheless encouraging success I when a polypeptide is longer than 10-20 amino acids. However, attempts to reduce the problem to simpler features of the protein fold such as a helix,/3 strands, and aperiodic or coil structure have yielded interesting results (see Refs. 2 and 3). These results have been an aid for designing new proteins, predicting the effect of point mutations, identifying the protein class, for instance, all-a or all-/3 proteins, predicting epitopes, etc. It is hoped that this information will be increasingly useful to molecular biologists and protein modelers. Usually the computing time is short, and many programs of secondary structure predictions are available on-line to the biologist. The GOR method is one of the most popular of the secondary structure prediction schemes. This method is theoretically well founded in a series of earlier papers, and it has been the real first prediction of secondary structure implemented as a computer program. The three letters stand for 1 Protein Structure Prediction Issue. Proteins 23, 3 (1995). 2 j. Garnier and J. M. Levin, CABIOS 7, 133 (1991). 3 B. Rost and C. Sander, Trends Biochem. Sci. 18, 120 (1993). Copyright © 1996 by Academic Press, Inc. METHODS IN ENZYM OLOG Y, VOL. 266 All rights of reproduct ionin any form reserved.

Transcript of GOR Method for Predicting Protein Secondary Structure from Amino Acid Sequence

8/6/2019 GOR Method for Predicting Protein Secondary Structure from Amino Acid Sequence

http://slidepdf.com/reader/full/gor-method-for-predicting-protein-secondary-structure-from-amino-acid-sequence 1/14

540 SECONDARY STRUCTURE CONSIDERATIONS [321

[ 3 2 ] G O R M e t h o d f o r P r e d i c t i n g P r o t e i n S e c o n d a r y

S t r u c t u r e f r o m A m i n o A c i d S e q u e n c eBy JEAN GARNIER, JEAN-FRANqOIS GIBRAT, a n d BARRY ROBSON

I n t r o d u c t i o n

H o w t o e x t r a c t p r o p e r t i e s f r o m t h e a m i n o a c i d s e q u e n c e o f a p r o t e i n

f o r u n d e r s t a n d i n g i ts f u n c ti o n i s o n e o f t h e m o s t t i m e l y a n d c o m p e t i ti v e

a r e a s o f t h e b i o l o g i c a l s c ie n c e s . I t is r e l a t e d t o a f u n d a m e n t a l a s p e c t o f

b i o l o g y . A l i n e ar a n d o r d e r e d s e q u e n c e o f a m i n o a c id s , c o d e d a n d c o n s e r v e d

b y a l in e a r a n d o r d e r e d s e q u e n c e o f n u c l e o t i d e b a s e s , c o d e s f o r a ll t h a t i s

chara c te r i s t i c o f l iv ing o rgan i sm s : spec i f ic and o rgan ized in te rac t ion s in

s p a c e a n d t i m e b e t w e e n p r o t e i n s , l ip i d s, n u c l e i c a ci d s, a n d t h e c e l l m e t a b o -

l i t e s . T h e s e c h a r a c t e r i s t i c s d e p e n d o n h o w a p r o t e i n c a n f o l d i n a u n i q u e

a c t i v e t h r e e - d i m e n s i o n a l s t r u c t u r e . T h i s p r o c e s s is s p o n t a n e o u s u n d e r g i v e n

e n v i r o n m e n t a l c o n d i t i o n s , e v e n t h o u g h t h e l i v i n g c e l l c a n a d d e f f i c i e n c y

a n d c o n t r o l b y u s e o f s o m e p r o t e i n c o m p l e x e s , c a l le d c h a p e r o n e s , t o c a t a ly z e

th i s p roces s .

M u c h e f f o r t h a s b e e n d e v o t e d t o t h e c a l c u l a t io n o f t h e s p a t ia l s t ru c t u r e

o f a p o l y p e p t i d e c h a i n f r o m i ts a m i n o a c id s e q u e n c e a l o n e , w it h o n l y li m i t ed

b u t n e v e r t h e l e s s e n c o u r a g i n g s u c c e s s I w h e n a p o l y p e p t i d e i s l o n g e r t h a n

1 0 - 2 0 a m i n o a c i d s . H o w e v e r , a t t e m p t s t o r e d u c e t h e p r o b l e m t o s i m p l e r

f e a t u r e s o f th e p r o t e i n f o l d s u c h a s a h e l ix , /3 s t r a n d s , a n d a p e r i o d i c o r c o il

s t r u c t u r e h a v e y i e l d e d i n t e r e s ti n g r e s u l ts ( s e e R e f s . 2 a n d 3 ). T h e s e r e s u l ts

h a v e b e e n a n a i d f o r d e s i gn i n g n e w p r o t e in s , p r e d i c t in g t h e e f f e c t o f p o i n t

m uta t ion s , ide n t i fy ing the p ro te in c las s , fo r in s tance , a l l - a o r a ll -/3 p ro te in s ,

p r e d i c t i n g e p i t o p e s , e t c . I t is h o p e d t h a t t h is i n f o r m a t i o n w i ll b e i n c r e a s i n g ly

u s e f u l t o m o l e c u l a r b i o lo g i s ts a n d p r o t e i n m o d e l e r s . U s u a l l y t h e c o m p u t i n g

t im e is s h o rt , a n d m a n y p r o g r a m s o f s e c o n d a r y s t r u c t u r e p r e d i c t io n s a r e

ava i lab le on - l ine to the b io log i s t .

T h e G O R m e t h o d is o n e o f t h e m o s t p o p u l a r o f th e s e c o n d a r y s tr u c tu r e

p r e d i c t i o n s c h e m e s . T h i s m e t h o d i s t h e o r e t i c a l l y w e l l f o u n d e d i n a s e r i e s

o f e a r l i e r p a p e r s , a n d i t h a s b e e n t h e r e a l f ir st p r e d i c t i o n o f s e c o n d a r y

s t r u c t u r e i m p l e m e n t e d a s a c o m p u t e r p r o g r a m . T h e t h r e e l e t t e r s s t a n d f o r

1 Protein Structure Prediction Issue. Proteins 23, 3 (1995).

2 j. Gar nie r and J. M. Levin, CABIOS 7, 133 (1991).

3 B. Ros t and C. Sander, Trends Biochem. Sci. 18, 120 (1993).

Copyright © 1996 by Academic Press, Inc.METHODS IN ENZYMOLOGY, VOL. 266 All rights of reproduction in any form reserved.

8/6/2019 GOR Method for Predicting Protein Secondary Structure from Amino Acid Sequence

http://slidepdf.com/reader/full/gor-method-for-predicting-protein-secondary-structure-from-amino-acid-sequence 2/14

[ 3 2 ] T H E G O R M E TH O D 5 4 1

t h e f ir st l e t te r o f t h e n a m e s o f t h e a u t h o r s o f t h e o r i g i n a l p u b l i c a t i o n . 4 T h i s

m e t h o d r e m a i n s r e m a r k a b l y p o p u l a r , 5 b u t u s e r s h a v e o v e r l o o k e d t h e f a ct

t h a t i m p r o v e d v e r s i o n s o f t h e m e t h o d h a v e s i nc e b e e n p u b l is h e d , a n d a

f u l l d e s c r i p t i o n o f t h e m c a n b e f o u n d i n t h e b o o k e d i t e d b y F a s m a n . 6 T h ea d d i t i o n o f h o m o l o g o u s s e q u e n c e i n f o r m a t i o n t h r o u g h m u l t i p le a l ig n m e n t s

h a s g i v e n a s i g n if ic a n t b o o s t t o t h e a c c u r a c y o f s e c o n d a r y s t r u c t u r e p r e d i c -

t io n s . 7-9 I n t h is c h a p t e r , a f t e r p r e s e n t i n g t h e m a j o r p r i n c i p l e s u s e d b y t h e

G O R m e t h o d , w e g i ve s o m e r es u lt s o b t a i n e d w i th a n u p d a t e d v e r s i o n o f

t h i s m e t h o d .

P r i n c i p l e s o f M e t h o d

I n a se r i e s o f a r t ic l e s , R o b s o n et aL 10,11 u s e d t h e f o r m a l i s m o f i n f o r m a t i o nt h e o r y a n d B a y s i a n s t a ti st ic s t o e s t ab l is h t h e c o d e r e l a t i n g t h e a m i n o a c i d

s e q u e n c e a n d t h e s e c o n d a r y s t r u c t u r e s o f a p r o t e i n . T h i s le d l a te r t o t h e

d e v e l o p m e n t o f t h e G O R m e t h o d . 4

I n f o r m a t i o n t h e o r y w a s d e v e l o p e d i n t h e 1 9 5 0 - 1 9 6 0 s x2,13 a n d t h e G O R

m e t h o d m a d e u s e o f a n i n f o r m a t i o n f u n c t io n d e s c r i b e d b y F a n o , TM I ( S ; R ) ,

w h i c h i s d e f i n e d a s

I (S ; R ) = I o g [ P ( S ] R ) / P ( S ) ] ( 1 )

O r i g i n a l ly t hi s f o r m u l a t i o n w a s c o n c e r n e d m a i n l y w i t h e l e c t ro n i c t ra n s m i s -s i on o f in f o r m a t i o n . I n t h e p r e s e n t a p p l i c a t i o n , S is o n e o f t h e t h r e e c o n f o r -

m a t i o n s , R is o n e o f t h e 2 0 a m i n o a c i d r e s i d u e s , P ( S ] R ) is t h e c o n d i t i o n a l

p r o b a b i l i t y f o r o b s e r v i n g a c o n f o r m a t i o n S w h e n a r e s i d u e R i s p r e s e n t ,

a n d P ( S ) is t h e p r o b a b i l i t y o f o b s e r v i n g S . A c c o r d i n g t o t h e d e f in i t i o n o f

c o n d i t i o n a l p r o b a b i l i t i e s , P ( S I R ) = P (S , R ) / P ( R ) w h e r e P ( S , R ) is t h e j o i n t

p r o b a b i l i ty o f o b s e r v i n g t h e e v e n t s S a n d R a n d P ( R ) is t h e p r o b a b i l i t y o f

o b s e r v i n g a r e s i d u e R . I t is e a s y to h a v e a n e s t i m a t i o n o f I ( S ; R ) f r o m

4 j. G arnier, D . Osguthorpe, and B. Robson, .L Mo l. Biol. 120, 97 (1978).5 L. B. M . Ellis and R. P. M ilius, CABIOS 10, 341 (1994).

6 j . G am ier and B. Robson, in "Prediction of Protein Structure and the Principles of Protein

Conformation" (G. D. Fasman, ed.), Chap. 10, p. 417. Plenum Press, New York, 1989.

7 j. M. L evin, S. Pascarella, P. Argos, and J. G arnier, Protein E ng. 6, 849 (1993).

8 B. R ost and C. Sander, J. MoL Biol. 232 , 584 (1993).

9 V. di Francesco, P. J. M unson, and J. Ga rnier, 28th Annual HawaiiInternation al Conferenceon System Sciences (L. Hunte r, ed.), p. 285. IE E E Com puter Society Press, Los Alamos, 1995.

10 B. R obs on a nd R. H. Pain, J. Mo l. Biol. 58, 237 (1971).11 B. Ro bso n, Biochem. J. 141, 853 (1974).

12 C. E. Shannon and W. Weaver, " Th e M athematical T heo ry of Comm unication." Univ. ofIllinois Press, Urbana, Illinois, 1949.

13 L. Brillouin, "Science and Information Th eo ry." Aca demic Press, New York, 1956.14 R. Fano, "Transmission of Infor m ation ." Wiley, New Y ork, 1961.

8/6/2019 GOR Method for Predicting Protein Secondary Structure from Amino Acid Sequence

http://slidepdf.com/reader/full/gor-method-for-predicting-protein-secondary-structure-from-amino-acid-sequence 3/14

54 2 SECONDARY STRUCTURE CONSIDERATIONS [321

a database of known sequences and corresponding observed secondary

structures since P ( S , R ) = f s , R /N , P ( R ) = f R / N and P ( S ) = f s / N with N

being the total number of amino acids in the database, fS, R the number of

residues R observed in the conformation S in the same database, fR thetotal number of residues R, and f s the total number of residues observed

in the conformation S in the same database. Then

I (S ; R ) = l o g [ ( f S , R / f R ) / ( f s / N ) ] (2)

This quantity can be obtained easily from the database provided it is

large enough.

A more general treatment requires corrections for levels of data (see

Robson aa and below). Robson ~1 introduced the information difference,

I ( A S ; R ) = I ( S ; R ) - I ( n - S ; R ) = I o g ( fs , R /f ,. S ,R ) + l o g ( f , - s / f s ) (3)

where n - S stands for the conformations other than S (non-S); for instance,

if S is ot helix (H), n - S will be /3 strand (E) and coil (C) for a three-

state prediction. It gives the extra information for S on the two others. It

represents a kind of normalization where the total number of amino acids,

N, and residues, R, in the database have disappeared from the equation.

In effect the positive hypothesis (S; R) and the complementary negative

hypothesis ( n - S ; R ) are treated in concert. This quantity also corresponds toone-residue information or single-residue information or self-information.

Calculated for the three conformations, the highest value of Eq. (3) for

one of the conformations S will be the predicted conformation and will be

the propensity for that residue to be in that conformation, usually expressed

in centinat units when natural logarithms are used. This underlines one

of the differences with the Chou-Fasman propensities which correspond

approximately to the mantissa of the log of Eq. (2).

Equations (1) to (3) can be extended to a local sequence along the

polypeptide chain of n consecutive residues R:

I ( A S ] ; R 1 . . . . , R n ) = log[P(Sj, R1, .. ., R , , ) / P ( n - S ] , R1 .. .. , R,)]

+ l o g [ P ( n - S ) / P ( S ) ](4)

where P ( S j , R 1 . . . . , R , ) is the joint probability of the conformation S at

position j in the sequence and the local sequence R1, .. ., R,. One may

remark that

P(Sj, R1 . . . . . R . ) + P ( n - S j , R , . . . . . R.) = 1 (5 )

and that

P ( S j , R 1 . . . . . R , , ) / P ( n - S j , R 1 . . . . . R , ,) = P ( S ) / P ( n - S ) e ' ( A s ? g l ' ' g . ) (6)

8/6/2019 GOR Method for Predicting Protein Secondary Structure from Amino Acid Sequence

http://slidepdf.com/reader/full/gor-method-for-predicting-protein-secondary-structure-from-amino-acid-sequence 4/14

[321 THE G O R METHOD 543

I n p r e d i c ti n g a r e s id u e t o b e i n o n e c o n f o r m a t i o n , o n e c a n p r e d i c t e i t h e r

t h e o n e h a v i n g t h e h i g h e s t v a lu e o f t h e i n f o r m a t i o n w i t h E q . ( 4) o r th e

h i g h e s t p r o b a b i l i t y v a l u e t a k e n f r o m E q s . ( 5) a n d ( 6) . P r o b a b i l i t y v a l u e s

h a v e b e e n u s e d f o r t h e p r e d i c t i o n o f R a m a c h a n d r a n z o n e s. 15 T h e y a r em o r e p r e c i s e t h a n c o n f i d e n c e s c al es d e v e l o p e d f o r o t h e r m e t h o d s 8,16 a n d

u n d e r l i n e t h e f a c t t h a t t h e d e c i s i o n t o p r e d i c t t h e c o n f o r m a t i o n o f th e

h i g h e s t p r o b a b i l i t y le a v e s t h e p o s s ib i li ty t h a t t h e o t h e r c o n f o r m a t i o n s h a v e

a d e f i n i t e p r o b a b i l i t y t o o c c u r w h i c h c a n b e c l o s e t o t h e h i g h e s t , a n d t h u s

s h o u l d n o t n e c e s s a ri l y b e r u l e d o u t .

O n e f a ce s a f u n d a m e n t a l p r o b l e m w h e n c a lc u la t i n g i n f o r m a t i o n v al ue s.

O n e n e e d s t o e s t im a t e t e r m s s u c h a s P ( S j , R 1 . . . . . R n ) i n v o l v i n g N r e s i d u e s .

I t is i m p o s s i b l e t o e v a l u a t e s u c h t e r m s d i r e c tl y f r o m t h e d a t a b a s e , s o o n e

m u s t r e s o r t t o v a ri o u s a p p r o x i m a t i o n s . T h e d i f f e r e n t v e r s i o n s o f t h e G O Rm e t h o d c o r r e s p o n d t o v a ri o u s t y p e s o f a p p r o x i m a t i o n s w e h a v e t r ie d i n a n

e f f o r t t o i m p r o v e t h e a c c u r a c y o f t h e m e t h o d .

A p p r o x i m a t i o n s I n vo l ve d i n G O R M e t h o d

T h e f irs t G O R v e r si o n , 4 n a m e d G O R I , a d d e d t o t h e s i ng l e- re s id u e

i n f o r m a t i o n t h e s o - c a ll e d d i r e c ti o n a l i n f o r m a t i o n o f e ig h t r es i d u e s o n e a c h

s id e o f t h e r e s i d u e t o b e p r e d i c t e d i n th e s e q u e n c e . T h i s l i m i t o f e ig h t w a s

n o t a r b i t r a r y b u t w a s b a s e d o n s t u d i es o f i n f o r m a t i o n c o n t e n t a t i n c re a s in gs e p a r a t i o n s . T o o b t a i n t h e i n f o r m a t i o n m e a s u r e , o n e s t a r t s b y c a l c u l a t i n g

f r o m t h e d a t a b a s e t h e f r e q u e n c y o f e a c h o f th e 2 0 a m i n o a ci ds r e si d u e s a t

d i f f e r e n t p o s i ti o n s , u p t o e i g h t r e s i d u e s o n t h e N - t e r m i n a l a n d C - t e r m i n a l

s id e , w h e n t h e c e n t r a l r e s i d u e i s o b s e r v e d i n a g i v e n c o n f o r m a t i o n b u t

i n d e p e n d e n t l y o f th e n a t u r e o f t h a t r e s i d u e . I n f a ct , i n th i s a p p r o x i m a t i o n

o n e a s s u m e s t h a t t h e r e i s n o c o r r e l a t i o n b e t w e e n r e s id u e s o c c u r r i n g a t

d i f f e r e n t p o s it i o n s i n t h e w i n d o w o f 1 7 r e s i d u e s s o d e f in e d . T h e n

I ( AS j ; R , . . . . . R n) ~ I ( AS j ; R j ) + Xm.m ~0 I ( AS j ; R j+ m ) ( 7 )

w h e r e j s t a n d s f o r a n y p o s i t i o n j i n t h e a m i n o a c id s e q u e n c e o f w h i c h t h e

c o n f o r m a t i o n o u g h t t o b e p r e d i c te d a n d m i s b e t w e e n - 8 ( N - t e rm i n a l s id e )

a n d + 8 ( C - t e r m i n a l s id e ). T h e s a m e v e rs i o n , G O R I I, w a s u p d a t e d w i t h a

n e w d a t a b a s e i n 1 9 89 . 6 B o t h v e r s i o n s p r e d i c t e d f o u r c o n f o r m a t i o n s , H ,

E , C , a n d T , w i th T f o r t u rn s . S u b s e q u e n t v e r s i o n s p r e d i c t e d o n l y t h r e e

c o n f o r m a t i o n s H , E , a n d C , a l t h o u g h t h e m e t h o d h a s n o i n t ri n si c l i m i t a ti o n

in t h e n u m b e r a n d n a t u r e o f c o n f o r m a t i o n s . T h e d i f f e re n t t u r n t y p e s a n d

t h e r e l a t i v e d i ff ic u l ty o f d i s ti n g u is h i n g b e t w e e n t h e m u s i ng t h e D S S P p r o -

g r a m 17 l e d u s t o l im i t t h e p r e d i c t i o n f o r t h e t i m e b e i n g t o t h r e e c o n f o r m a -

15 j. F. G ibrat, B. R obso n, and J. Ga rnier, Biochemistry30, 1578 (1991).16 V. B iou , J. F. Gib rat, J. M . Lev in, B. Rob son, and J. Ga rnier, Protein Eng. 2, 185 (1988).17 W. Ka bsch and C. Sand er, Biopolymers 22, 2577 (1983).

8/6/2019 GOR Method for Predicting Protein Secondary Structure from Amino Acid Sequence

http://slidepdf.com/reader/full/gor-method-for-predicting-protein-secondary-structure-from-amino-acid-sequence 5/14

5 4 4 S E C ONDAR Y S T R UC T UR E C ONS IDE R AT IONS I32]

t io n s s o t h a t h e l i c e s a n d s t r an d s r e p r e s e n t t h e m a j o r a r c h i t e c t u r a l s tr u c t u r e s

o f th e c o n s e r v e d c o r e o f h o m o l o g o u s p r ot ei n s•

T h e n e x t l e v el o f a p p r o x i m a t i o n , i n t r o d u c e d in t h e G O R I I I v e rs i o n , 18

c o n s i d e r e d t h e c o r r e l a t i o n b e t w e e n t h e t y p e o f r e si d u es i n t h e w i n d o w a n dt h e t y p e o f t h e r e s i d u e t o b e p r e d i c te d • T h i s v e r s i o n u s e s t h e s o - c a l le d

p a i r i n f o r m a t i o n :

I ( A S j ; R 1 . . . . , R n ) ~ I ( A S j ; R j ) + ~-~m,rn#O ( A S j ; n j + m l R j ) (8 )

T h e s e c o n d t e r m o n t h e f i g h t - h a n d s i d e o f E q . ( 8) is a c o n d i t i o n a l in f o r m a -

t i o n . 15 I t i n v o l v e s t h e c a l c u l a t i o n f r o m t h e d a t a b a s e o f p a i r f r e q u e n c i e s o f

r e s i d u e s R i a n d R j + m w i t h R j h a v i n g t h e o b s e r v e d c o n f o r m a t i o n s S j a n d n - S j

a t p o s i t i o n j , w i t h a f r e q u e n c y o f fs j R +m R a n d fn-s R +m R , r e s p e c t i v e l y , b u t• . . ' ] , ] . J' J ' .1 .

t h e c o n f o r m a t i o n o f r e si d u e R j+,~ is n o t t a k e n i n t o c o n s i d e r a t i o n . W e h a v e

I ( A S j ; R j + m I R j ) = Iog( fs ,,g ,+m ,R,/ f, .s ,,R,+= ,R ) + Io g ( f , - s j , R / fS , ,R ) (9 )

W h e n t h i s a p p r o x i m a t i o n w a s u s e d , t h e d a t a b a s e ( a t t h a t ti m e c o n t a i n i n g

r o u g h l y 1 2 , 0 0 0 r e s i d u e s ) w a s b a r e l y l a r g e e n o u g h t o a l l o w a n e a s y c a l c u l a -

t i o n o f t e r m s f o r E q . ( 9) . E a c h o f t h e s e t e r m s i n v o l v e s t w o a m i n o a c id s

a n d a s e c o n d a r y s t r u c t u r e c o n f o r m a t i o n , s o t h e r e a r e 1 2 0 0 e n t r i e s i n t h e

t ab l e. T h e a v e r a g e n u m b e r o f o b s e rv a t i o n s p e r e n t r y w a s t h e r e f o r e 1 0.

H o w e v e r , t h e a m i n o a c i ds a r e n o t a ll e q u i p r o b a b l e ; s o m e l i k e T r p o r M e t

a r e r a r e r , a n d t h e n u m b e r o f o b s e r v a t i o n s f o r e n t r i e s in v o l v i n g s u c h a m i n oa c id s w e r e l e s s t h a n 10. A s a c o n s e q u e n c e , t h e p r o b a b i l i ti e s e s t i m a t e d u s in g

t h e r a t i o o f s u c h s p a rs e f r e q u e n c i e s w e r e u n r e l i a b l e a n d w e r e r e s p o n s i b l e

f o r a d e c r e a s e o f t h e p r e d i c t i o n a c c u ra c y . T o c i r c u m v e n t t hi s p r o b l e m , w e

i n t r o d u c e d s o -c a ll ed d u m m y f r eq u e n c i e s . R e a d e r s i n t e r e s te d i n t h e p r e c is e

d e f i n i ti o n o f t h e s e d u m m y f r e q u e n c i e s a r e r e f e r r e d t o R e f s . 1 5 a n d 17.

D u m m y f r e q u e n c i e s a m o u n t t o th e f o l lo w i n g c o n s id e r a ti o n s . L e t u s

a s s u m e t h a t w e o b s e r v e in th e d a t a b a s e t w o o c c u r r e n c e s o f a M e t a t p o s i ti o n

j - 1 w h e n t h e r e s i d u e a t j i s a T r p i n h e l ic a l c o n f o r m a t i o n . W e c a n c a lc u l a te

e as il y t h e e x p e c t e d n u m b e r w e w o u l d o b s e r v e i f t h e t w o e v e n ts w e r eu n c o r r e l a t e d , n a m e l y , th i s is t h e f r e q u e n c y o f M e t a t p o s i t io n j - 1

m u l t i p l i e d b y t h e f r e q u e n c y o f T r p a t p o s i ti o n j h a v i n g a h e li c a l c o n f o r m a -

t i o n d i v i d e d b y t h e t o t a l n u m b e r o f r e si d u e s. T h i s n u m b e r i s r e l a ti v e l y

r e l i a b l e s i n ce i t is c a l c u l a t e d u s i n g f r e q u e n c i e s t h a t i n v o l v e o n l y o n e r e s i d u e

( w h i c h t h u s a r e g r e a t e r t h a n f r e q u e n c i e s f o r p a i r s b y a f a c t o r o f 2 0, o n

a v e r ag e ) . N o w w e c a n a sk t h e q u e s t io n , H o w m u c h d o w e t r u st t h e f r e q u e n -

c ie s f o r p a ir s w e o b s e r v e d i n t h e d a t a b a s e ? I f w e t r u s t t h e m 1 0 0% , w e j u s t

u s e t h e s e f r e q u e n c i e s i n t h e c a l c u l a ti o n s o f i n f o r m a t i o n v a l u e s. I f w e m i s-

t r u s t t h e m 1 0 0 % , w e c a n a l w a y s u s e t h e n u m b e r s c a l c u l a t e d a s s u m i n g t h a t

18 . F. G ibrat, J. Garn ier, and B. Rob son, J. M o l . B i o l. 198 , 425 (1987).

8/6/2019 GOR Method for Predicting Protein Secondary Structure from Amino Acid Sequence

http://slidepdf.com/reader/full/gor-method-for-predicting-protein-secondary-structure-from-amino-acid-sequence 6/14

[32] THE GOR METHOD 545

the events are independent, but then we are back to the approximation of

Eq. (7). In fact, empirically, to improve the accuracy prediction we need

to consider an intermediary stage when we bias the observed frequencies

toward the calculated (uncorrelated) ones by a given amount.The database available (see Table I) now contains about 63,000 residues,

so the average number of observations for pairs per entry is 50. This is

large enough for us to compute terms involving pairs of residues without

the need for introducing dummy frequencies. Note, however, that this new

database does not allow the calculation of terms involving triplets of amino

acids. We thus decided to include more pairs in our description of the

window of 17 residues. Instead of considering only the 16 pairs Rj+m, Rj ,with m varying from - 8 to +8 and m # 0, that is, the pair formed by each

residue in the window and the central one, we consider all the possiblepairs in the window [there are (17 × 16)/2 such pairs]. We thus have

used for the results presented below the following approximation (GOR

IV version):

P(Sj , LocS eq) 2 +8 P(S j , Rj+m,Rj+~)log p - ~ ~ ) = ]-~ m~8, log P(n -Sj , Rj+m,Rj+.)

( l O )

15 ~-~ P (Sj, Rj+m)log

17 m=_8' P ( n - S j , Rj+m)

where LocSeq stands for the local sequence R1 . . . . . Rn of 17 residues

around the residue to be predicted. The values of P(Sj , LocSeq) from Eq.

(10) for the three conformations are then used directly for the predictions

instead of using probabilities calculated from Eqs. (5) and (6) with the

information value calculated from Eq. (9).

Database and Results

We used a database of 267 protein structures having a resolution betterthan 2.5 .A with an R factor less than 25% and whose length is greater than

50 residues (see Table I). There is no pair of proteins with an identity

above 30%. The prediction is carried out using a jackknife: the protein to

be predicted is removed from the database, the parameters are estimated

using the 266 remaining proteins, and the prediction is done using these

parameters. As mentioned above, this database is large enough that we do

not need to use dummy frequencies anymore. Moreover, we do not use

decision constants to adjust the predicted number of secondary structures

to the observed number in the database. In fact, there is no optimizationof any sort; we just estimate the probabilities according to Eq. (10) from

the frequencies observed in the database.

8/6/2019 GOR Method for Predicting Protein Secondary Structure from Amino Acid Sequence

http://slidepdf.com/reader/full/gor-method-for-predicting-protein-secondary-structure-from-amino-acid-sequence 7/14

T A B L E IDATABA SE PROTEINS a

l a a j .x l a a k . x l a a p . a l a b a . x l a b k . x l a b m . a l a d d . x

l a d s . x l a l k . a l a o z . a l a p a . x l a p m . e l a r b . x l a t r . x

l a v h . a l a y h . x l b a b . a lb b h . a l b b p . a l b e t. x l b g e . a

l b l l. e l b m d . a l b o v . a l b p b . x l b r s . d l b t c . x l c 2 r .al c a j . x l c a u . a l c a u . b l c d e . x l c d t . a l c e w . i lc g t . x

l c h m . a l c m b . a l c o b . a lc o l . a l c p c . a l c p c . b l c p t .x

l c r l . x l c s e . i l c t f . x l c t m . x l c u s . x l d d t . x l d h r . x

l d o g . x ld s b . a l e a f . x l e c o . x l e d e . x l e n d . x l e p a . a

l f b a . a l f d d . x lf h a . x l f i a .a l f k b . x lf n a . x l f n r . x

l f x i . a l g a l . x l g d l . o l g d h . a l g k y . x l g l t .x l g m f . a

l g o f .x l g o x .x l g p l . a l g p b . x l g p r . x l g s r . a l h b q . x

l h d x . a l h i v . a l h l b . x l h l e . a l h m y . x l h o e . x l h p l . a

l h r h . a l h s l . a l h u w . x l i f c .x l i p d . x l i s u . a l i t h . a

1129.x l l e4 .x l l en . a l l ga . a l l i s . x l l l a . x l im b .3

l l ts . a l l ts . d l m d c . x l m g n . x l m i n . a l m i n . b l m j c . xl m p p . x l m u p . x l n a r .x l n b a . a l n d k . x l n o a . x l n s b . a

l n x b . x l o f v . x l o l b . a l o m f . x lo m p . x l o n c . x l o s a . x

l p d a . x l p f k . a l p g b . x l p g d . x l p h h . x l p h p . x l p ii . x

l p l f. a l p o c . x l p o h . x l p o x . a l p p a . x l p p f . e l p p f . i

l p p n . x l p r c . c l p r c . h lp r c. 1 l p r c . m l p t s . a l p y a . a

l p y a . b l p y d . a l r c b . x l r e c . x l r i b . a l r n d . x l r o p . a

l r v e . a l s 0 1 .x l s a c . a l s b p . x l s e s . a l s g t . x l s h a . a

l s h f . a l s i m . x l s l t . b l s n c . x l s p a . x l s t f .i l t b e . a

l t c a . x l t i e .x l t m l . x l t n d . a l t p l . a l t r b . x l t r k . a

l t r o . a l t t b . a l u t g .x l v a a . a l v a a . b l v m o . a l w h t . a

l w h t . b l w s y . a l w s y . b l y h b . x l z a a . c 2 5 6 b . a 2 a a i .b

2aza . a 2bo p . a 2ccy . a 2cdv .x 2chs . a 2cm d .x 2cp4 .x

2cpl .x 2 cro .x 2ctc .x 2cts .x 2cyp .x 2dn j .a 2er7 .e

2hbg .x 2h hm .a 2h ip . a 2hp d . a 2 ih l . x 21h2 .x 21iv .x

2 m h r . x 2 m n r . x 2 m s b . a 2 m t a . c 2 m t a . h 2 ra ta .1 2 p f l . x

2p i a .x 2po l . a 2 po r .x 2 reb .x 2 rn2 .x 2 rs l . a 2 sa r . a

2 sas .x 2 scp . a 2 sga .x 2 sn3 .x 2 spc . a 2 tg i. x 2 tm d .a

2 t p r . a 2 t s c .a 3 a a h . a 3 a a h . b 3 a d k .x 3 b5 c .x 3 c d 4 .x

3chy .x 3c l a .x 3cox .x 3d f r . x 3eca . a 3g ap . a 3gbp .x

3 ink . c 3 rub.1 3 rub . s 3 sdh . a 3 tg l .x 451c .x 4b lm .a

4enl .x 4fgf .x 4gcr .x 4 t s l . a 4x is .x 5fb p .a 5p21.x5 t im.a 6 fab .h 6 fab.1 6 t aa .x 8abp .x 8acn .x 8a t c . a

8a t c .b 8ca t . a 8 i l b .x 8 rxn . a 8 t l n . e 91dt .a 9 rn t . x

9w ga .a

a T h e d a t a b a s e w a s p r e p a r e d b y J . M . L e v i n a n d c h e c k e d f o r h o m o l o g o u s s e q u e n c e s w i t h

t h e h e l p o f V . D i F r a n c e s c o . T h i s d a t a b a s e h a s b e e n m o d i fi e d t o r e s t o r e t h e t o t a l l e n g t h

o f t h e s e q u e n c e s a s d e fi n e d in t h e S E Q R E S f ie ld o f t h e P r o t e i n D a t a B a n k ( P D B ) f il e

( t h e D S S P p r o g r a m o m i t s r e s id u e s w h o s e c o o r d i n a t e s a r e m i s si n g i n t h e P D B file , a n d

t h u s i f t h i s o c c u r s i n t h e m i d d l e o f t h e p o l y p e p t i d e c h a i n i t i s s p l i t i n t o t w o o r m o r e

c h a i ns ) . R e s id u e s h a v i n g n o c o o r d i n a t e s w e r e a s s i g n e d t h e c o n f o r m a t i o n X a n d w e r e

n o t t a k e n i n t o a c c o u n t f o r t h e p r e d i c t i o n a c c u r a c y a l t h o u g h t h e p r e d i c t i o n w a s d o n ew i t h t h e w h o l e s e q u e n c e l e n g th . T h e P D B c o d e i s f o ll o w e d b y t h e c h a i n n a m e a , b , c,

d , h ( h e a v y ) , 1 ( l ig h t ) , x ( o n e c h a i n o n l y ) , e ( e n z y m e ) , o r i ( i n h i b i t o r ) .

8/6/2019 GOR Method for Predicting Protein Secondary Structure from Amino Acid Sequence

http://slidepdf.com/reader/full/gor-method-for-predicting-protein-secondary-structure-from-amino-acid-sequence 8/14

132] THE G O R MEXHOO 547

TABLE II

GLOBAL RESULTS FOR DATABASE PREDICTION

O b s e r v e d

H E C T o t a l

Pred ic t ed

H

E

C

T o t a l

Qprd a

Gobs b

Q3 c = 64.4%

14,460 3094 4790 22,344

1124 4965 2089 8178

6002 5546 21,496 33,044

21,586 13,605 28,375 63,566

64.7 60.7 65.1

67.0 36.5 75.8

a N u m b e r o f c o r r e ct ly p r e d i c te d r e s i d u e s / n u m b e r o f p r e d i c te d r e s -

idues .

b N u m b e r o f c o r re c tl y p r e d i c t e d r e s i d u e s / n u m b e r o f o b s e r v e d r es -

i d u e s

c T o t a l num ber o f co rrec t ly pred ic t ed re s idue s / t o t a l num ber o f re s-

idues .

H o w e v e r , t h is so m e t i m e s l e a d s to p r e d i c t io n s t h a t a r e n o t p h y s i c a ll y

m e a n i n g f u l , f o r e x a m p l e , h e l ic e s h a v i n g o n l y tw o r e s id u e s , o r m i x tu r e s o fs t ra n d ( E ) a n d h e l i x ( H ) r e s id u e s . S e v e r a l a t te m p t s h a v e b e e n m a d e t o

s o l v e t h a t p r o b l e m ( s ee R o s t a n d S a n d e r 8 a n d Z i m m e r m a n n 1 9 ). H e r e w e

a d d e d a si m p l e f il te r a ft e r th e p r e d i c t i o n w h i c h r e q u i r e s h e l i c e s t o b e a t

l e a s t f o u r r e s i d u e s a n d s t r a n d s t o b e a t l ea s t t w o r e s i d u e s . F o r i n s t a n c e , l e t

u s a s s u m e t h a t w e p r e d i c t t w o i s o la t e d H r e s i d u e s. W e t h e n l o o k f o r a ll

t h e p o s s i b i l it i e s o f e x t e n s i o n o f t h e t w o H r e s i d u e s ( i n t h i s c a s e , t h e r e a r e

t h r e e p o ss ib i li ti e s: a d d i n g t w o H ' s b e f o r e , a d d i n g o n e H b e f o r e a n d o n e H

a f te r , a n d a d d i n g t w o H ' s a f t er t h e p r e d i c t e d H ' s ) . W e t h e n c a l c u l a t e t h e

p r o d u c t o f t h e p r o b a b i l i t ie s o f t h e d i f fe r e n t s e c o n d a r y s t r u c t u r e s f o r t h et h r e e s e g m e n t s s o d e f in e d . T h e s e g m e n t t h a t i s t h e m o s t p r o b a b l e i s s e l e c te d

l e a d i n g e i t h e r t o a n e x t e n s i o n o f t h e h e l i x t o fo u r r e s i d u e s o r t o th e s u p p r e s -

s i o n o f th e t w o i s o l a t e d r e s i d u e s . A l t h o u g h t h i s f il te r a f fe c t s t h e p r e d i c t i o n

o f p a r ti c u la r p r o t e i n s , o n a v e r a g e f o r t h e w h o l e d a t a b a s e i t h a s n o e f f e c t

o n t h e p r e d i c t io n a c cu r a cy ; it n e i t h e r i m p r o v e s n o r d e c r e a s e s t h e p e r c e n t a g e

o f c o r r e c t l y p r e d i c t e d r e s i d u e s .

T h e g l o b a l r e su l ts f o r t h e d a t a b a s e a r e s h o w n i n T a b l e I I. T h e p e r c e n t a g e

O f c o r r e c t l y p r e d i c t e d r e s i d u e s i s 6 4 .4 % , a n d a n i n d i v i d u a l p r e d i c t i o n o u t p u t

o f t h e p r o g r a m i s g i v e n i n T a b l e I I I w i t h a n e x tr a c o l u m n t o c o m p a r e w i t h

a9 K. Z i m m e r m a n n , Pr o t e i n Eng. 7, 1197 (1994).

8/6/2019 GOR Method for Predicting Protein Secondary Structure from Amino Acid Sequence

http://slidepdf.com/reader/full/gor-method-for-predicting-protein-secondary-structure-from-amino-acid-sequence 9/14

548 SECONDARY STRUCTURE CONSIDERATIONS [321

TABLE I l l

PREDICTION OF EGLINa

Seq Obs Prd pH pE pC

T X C 0.00 0.00 1.00

E X C 0.00 0.02 0.98

F X C 0.00 0.08 0.92

G X C 0.01 0.13 0.87

S X C 0.02 0.14 0.84

E X C 0.04 0.24 0.72

L X C 0.09 0.33 0.59

K C C 0.15 0.24 0.61

S C C 0.19 0.16 0.65

F C C 0.12 0.12 0.77

P C C 0.29 0.12 0.59E C C 0.35 0.26 0.39

V C C 0.37 0.30 0.34

V C C 0.35 0.24 0.41

G C C 0.25 0.20 0.55

K C C 0.26 0.27 0.47

T C C 0.24 0.34 0.42

V H C 0.37 0.19 0.44

D H H 0.54 0.08 0.38

Q H H 0.58 0.11 0.30

A H H 0.61 0.11 0.28

R H H 0.56 0.19 0.25

E H H 0.50 0.27 0.24

Y H H 0.46 0.35 0.19

F H H 0.34 0.44 0.22

T H H 0.29 0.38 0.32

L H C 0.20 0.35 0.44

H H C 0.09 0.22 0.69

Y C C 0.03 0.11 0.86

P C C 0.05 0.06 0.89

Q C C 0.09 0.15 0.76

Y C C 0.08 0.29 0.63N E C 0.07 0.31 0.61

V E E 0.06 0.65 0.30

Y E E 0.04 0.75 0.21

F E E 0.02 0.76 0.22

L E C 0.01 0.30 0.69

P E C 0.02 0.09 0.88

E C C 0.02 0.03 0.95

G C C 0.01 0.01 0.98

S C C 0.01 0.02 0.97

P C C 0.09 0.12 0.79

V E C 0.18 0.33 0.49T E H 0.23 0.51 0.26

8/6/2019 GOR Method for Predicting Protein Secondary Structure from Amino Acid Sequence

http://slidepdf.com/reader/full/gor-method-for-predicting-protein-secondary-structure-from-amino-acid-sequence 10/14

[32 ] THE G O R METHOD 549

T A B L E I I I (continued)

S e q O b s P r d p H p E p C

L C H 0 .36 0 .46 0 .18D C H 0 . 3 8 0 . 3 3 0 . 2 9

L C H 0 .51 0 .17 0 .32

R C C 0 .39 0 .16 0 .46

Y C C 0 .33 0 .22 0 .46

N C C 0 .24 0 .18 0 .57

R E C 0 .20 0 .31 0 .49

V E E 0 .17 0 .57 0 .26

R E E 0 .12 0 .71 0 .17

V E E 0 .07 0 .80 0 .13

F E E 0.05 0.71 0.25

Y E E 0 .03 0 .49 0 .48N E C 0 .01 0 .12 0 .87

P C C 0 .01 0 .03 0 .96

G C C 0 .01 0 .03 0 .96

T C C 0 .02 0 .10 0 .88

N C C 0 .03 0 .29 0 .68

V E E 0 .04 0 .54 0 .41

V E E 0 .05 0 .68 0 .27

N C E 0 .02 0 .71 0 .27

H C E 0 .01 0 .58 0 .41

V C C 0 .01 0 .22 0 .78

P C C 0.01 0.09 0.91

H E C 0 .00 0 .02 0 .98

V E C 0 .00 0 .00 1 .00

G C C 0 .00 0 .00 1 .00

a A m i n o a c i d s e q u e n c e ( S e q ) o f e g l in , a s u b t il i s in i n h i b i t o r ( l c s e ) , w i t h

o b s e r v e d c o n f o r m a t i o n s ( O b s ) , p r e d i c t e d c o n f o r m a t i o n s w i t h f i l t e r ( P r d ) ,

a n d t h e p r o b a b i l i ty v a lu e s p H , p E , a n d p C f o r t h e p r e d i c t e d ~ h e l ix ( H ) ,

/3 s t r a n d s ( E ) , a n d c o i l ( C ) , re s p e c t iv e l y . T h e c o n f o r m a t i o n X c o r r e s p o n d s

t o r e s i d u e s f o r w h i c h t h e c r y s t a l l o g r a p h e r g a v e n o c o o r d i n a t e s . F o r s o m e

r e s i d u e s , f o r i n s t a n c e , V - 1 3 , a l t h o u g h t h e p r o b a b i l i t y f o r H i s h i g h e r , t h e

f i lt e r a s s i g n e d a c o i l ( s e e t e x t ) . T h e a c c u r a c y o f p r e d i c t i o n f o r t h e t h r e e

c o n f o r m a t i o n s ( Q 3 ) i s 7 3 % .

the obse r ve d c onfor m at ions . The r e su l t whe n c ons ide r ing the pr e d ic t ion

of individual prote ins i s 64 .7% with a s tandard deviat ion of 9 .3%. Figure

la show s the num be r o f pr ote ins as a func t ion o f the pe r c e ntage o f c orr e c t ly

pred ic ted res idu es . Figure l b i s just a check that th is d is tr ibution does not

depart s igni f icant ly from a Gauss ian d is tr ibution .

A s sugge s te d by L e v in , 2° F ig . 2 show s a s c a tte r p lo t o f th e p e r c e ntage

of c or r e c t ly pr e d ic te d r e s idue s as a func t ion o f the s i z e o f the pr ote in

20 j . M . L e v i n , t o b e p u b l i s h e d .

8/6/2019 GOR Method for Predicting Protein Secondary Structure from Amino Acid Sequence

http://slidepdf.com/reader/full/gor-method-for-predicting-protein-secondary-structure-from-amino-acid-sequence 11/14

"5

=EZ

o

40 50 60 70 80 90 100

Percentage of correctly predicted residues

b

j J

o~ j J "

• . ' J

_ - y

r,

j °~--

j . : : . .

-3 -2 -1 0 1 2 3

QuantUes of Standard Normar

FIG. 1. (a) Histogram of secondary structure prediction accuracies (Q3) for all the proteins

of the database. (b) Normal quant i le-quant i le plot of the results showing the agreement of

the distribution with a normal distribution. The individual values of Q3 for each protein of

the database are sorted according to the quantiles of a normal distribution with a mean of

64.4% and a standard deviation of 9.3%. The dashed line is the least-squares f it of the points.

8/6/2019 GOR Method for Predicting Protein Secondary Structure from Amino Acid Sequence

http://slidepdf.com/reader/full/gor-method-for-predicting-protein-secondary-structure-from-amino-acid-sequence 12/14

[321 Tr~z GOR METHOD 551

. { :

oE L

00(3 0

t o

o

° °

• " . . . " i f . ." I , • ~ °

~ '• ~ I° •# oo"t l • • # l • • .

" ~ j I * . • , ,

o • j *

• . " . - . . . • ,

• . q , • • °° . . . ' ~ . . ~ $ ,

° " * * . d ' B • . ~ •

" ," " • • s , ' , . . , , . , t ,o • ° ° ~ w . . • •

• . • . . # • • ° •• . , . * . • °

40 50 60 70 80 90

Percen tage of cor rec t l y pred ic ted res idues

F IG . 2 . D i s t r i b u t i o n o f t h e s e q u e n c e l e n g t h ( n u m b e r o f a m i n o a c i d ) o f t h e d a t a b a s e p r o t e i n s

a s a f u n c t i o n o f t h e n u m b e r o f co r r e c t ly p r e d i c t e d r e s i d u e s .

(number of residues). There seems to be no apparent effect of the length

of the protein on the accuracy of prediction, except that the longer the pro-

tein, the closer the accuracy comes to the average value. For proteins of

less than 200 residues the accuracy can lie anywhere between 40 and 90%.

A consequence of this observation is that no protein is easier to predict

than any other, but rather there are segments of the sequence easier to pre-

dict; thus the shorter is the protein, the more likely its accuracy of predictionwill be different from the mean, and inversely for the longest ones.

Table IV shows results for the prediction of secondary structure seg-

ments, namely, helices and strands. The percentage of correctly predicted

segments is given according to the minimum percentage of overlap that is

allowed between the predicted and observed segments. For instance, in

Table IV, for the row 75% a segment is considered as being correctly

predicted if the predicted segment overlaps with at least 75% of the observed

segment (counted as the number of residues).

Conclusion

Through the successive incorporation of observed frequencies of single,

then pairs of residues on a local sequence of 17 residues, the accuracy of

8/6/2019 GOR Method for Predicting Protein Secondary Structure from Amino Acid Sequence

http://slidepdf.com/reader/full/gor-method-for-predicting-protein-secondary-structure-from-amino-acid-sequence 13/14

552 SECONDARY STRUCTURE CONSIDERATIONS [3 9, ]

TABLE IV

RESULTS FOR SEOMENTS: HELICES AND STRANDS

Average

Number of segments length

H E Total H E

Observed 1989 2587 4576 10.9 5.9

Predic ted 2148 2043 4191 10.6 4.1

Overlap H segmen ts E segments

75% 51.1 23.7

50% 70.0 42.0

25% 75.7 50.2

t h e G O R m e t h o d h a s b e e n i m p r o v e d f ro m a b o u t 55 % ( G O R I u s in g a

j a c k k n i f e 21) u p t o 6 4 . 4% . T h e i n c r e a s e o f t h e d a t a b a s e s iz e f r o m 6 7 p r o t e i n s

t o t h e p r e s e n t d a t a b a s e o f 2 6 7 p r o t e i n s a n d t h e u s e o f a m o r e d e t a i l e d

d e s c r ip t i o n o f t h e l o ca l s e q u e n c e r e s u l t e d in a n i m p r o v e m e n t o f a b o u t 1 %

( G O R I II , Q 3 = 6 3 .3 % ; G O R I V , Q3 = 6 4 .4 % ; t h e c o r r e s p o n d i n g s t a n d a r d

d e v i a t i o n s o f Q 3 a r e 0 .8 % a n d 0 .6 % , re s p e c t i v e l y ). H o w e v e r , t h e r e s u l t o f

6 3.3 % f o r G O R I II w a s r e a c h e d u s i ng d u m m y f r e q u e n c i e s a n d a d d i n g

d e c i s io n c o n s t a n t s t h a t w e n o w b e l i e v e r e s u l t e d i n a s li g ht b ia s t o w a r d t h ed a t a b a s e w e w e r e t h e n u s i ng . T h i s c a u s e s t h e o v e r a l l a c c u r a c y o f t h e m e t h o d

t o b e o v e r e s t i m a t e d b y a p e r c e n t o r s o, as b e c a m e a p p a r e n t w h e n p a r a m e -

t e r s d e r i v e d f r o m t h e o r i g in a l d a t a b a s e w e r e u s e d t o p r e d i c t n e w s e ts o f p r o -

t e in s . T h i s is t h e r e a s o n w h y , h e r e , w e a v o i d e d t h e u s e o f d e c i si o n c o n s t a n t s

( e.g ., t o a d j u s t t h e n u m b e r o f p r e d i c t e d s e c o n d a r y s t r u c t u r e s t o w h a t i s o b -

s e r v e d in th e d a t a b a s e ) i n o r d e r t o o b t a i n a m o r e r o b u s t e s t i m a t i o n o f t h e

a c c u r a c y o f t h e m e t h o d . T h i s s m a l l in c r e a s e i n th e p r e d i c t i o n a c c u r a c y is

c o n s i s t e n t w i t h a p r e v i o u s a s s u m p t i o n w e m a d e . I5 W e e s t i m a t e d t h a t w e

w e r e a b l e t o e x t r a c t m o r e o r l es s al l t h e i n f o r m a t i o n a v a i la b l e in t h e l o c a ls e q u e n c e . O t h e r p u b l i s h e d m e t h o d s i n cl u d in g n e u r a l n e t m e t h o d s , u s in g o n l y

t h e p r o t e i n s e q u e n c e , a r e o f s i m il a r o r l o w e r a c c u r a c y ( s e e R e f s. 2 a n d 3 ).

W e a t t r i b u t e d t h e l i m i t at i o n i n t h e a c c u r a c y o f t h e p r e d i c t i o n 15 t o t h e

l a ck o f l o n g - d i st a n c e e f fe c ts . T h i s a p p e a r s t o b e c o n f i r m e d h e r e b y t h e p o o r

q u a l i t y o f t h e / 3 - s t r a n d p r e d i c t i o n . B e c a u s e / 3 s h e e t s r e q u i r e t h e p a i r in g o f

r e s i d u e s t h a t m a y b e d i s t a n t a l o n g t h e s e q u e n c e , t h i s s e c o n d a r y s t r u c t u r e

is p r e s u m a b l y m o r e d e p e n d e n t o n l o n g - r a n g e i n t e r a c t i o n s t h a n a r e o~ h e l ic e s

o r c o i l s . C l e a r l y t h e m e t h o d d o e s n o t f a r e w e l l w i t h t h e p r e d i c t i o n o f / 3

s t r a n d s . A l t h o u g h Q p rd f o r / 3 s t r a n d s i s o n l y s l i g h tl y l o w e r t h a n Qpra o r t h e

21 W. Kabsch and C. Sander, FEBS Lett. 155, 179 (1983).

8/6/2019 GOR Method for Predicting Protein Secondary Structure from Amino Acid Sequence

http://slidepdf.com/reader/full/gor-method-for-predicting-protein-secondary-structure-from-amino-acid-sequence 14/14

[32] THE G O R METI-IO0 553

t w o o t h e r s e c o n d a r y s t r u c tu r e s , t h e r e i s a c h r o n i c u n d e r p r e d i c t i o n o f t h is

s t r u c t u r e , Q o b s f o r / 3 s t r a n d s i s th u s s ig n i fi c an t ly l o w e r c o m p a r e d t o Q o bs

f o r h e li c e s a n d c o il s ( e v e n t a k i n g i n t o a c c o u n t t h e o v e r e s t i m a t i o n o f th e

c o r r e s p o n d i n g Q o b s v a l u e s , w h i c h i s a c o n s e q u e n c e o f th e o v e r p r e d i c t i o nof he lic e s and co i ls ) . Th is i s a l so no t ic eab le in the f ac t tha t , w he r ea s the

a v e r a g e l e n g t h o f p r e d i c t e d a n d o b s e r v e d h e l i c es c o r r e s p o n d s c l o se l y , t h e

a v e r a g e l e n g th o f p r e d i c t e d s t r a n d s i s s h o r t e r b y a b o u t o n e - t h i r d c o m p a r e d

t o t h e a v e r a g e l e n g t h o f t h e o b s e r v e d o n e s .

I t i s thus ve ry impor tan t to cons ide r the poss ib i l i t i e s o f inc lud ing long-

r a n g e i n t e r ac t io n s i n o u r m e t h o d ( o r o t h e r m e t h o d s u s in g o n l y s h o r t- r an g e

i n f o r m a t i o n , t h a t i s, l o c a l s e q u e n c e o n l y , f o r t h a t m a t t e r ) . O n e w a y t o

i n t r o d u c e l o n g - d i s t an c e e f f e c ts is t o u s e s p e c if ic n o n l o c a l p a i rs t o i m p r o v e

/ 3 -s tr an d p r e d i c t io n . 22 I n o t h e r w o r d s , t h e p r e d i c t i o n o f / 3 s t r an d s c o u l d b ed o n e u s i n g t h e l o c a l w i n d o w a s u s u a l t o f i r s t s e l e c t p u t a t i v e / 3 - s t r a n d s e g -

m e n t s a n d t h e n o n e c o u l d s l i d e a w i n d o w a l o n g t h e s e q u e n c e t o l o o k

w h e t h e r c o m p l e m e n t a r y s e g m e n t s t o th e s e p u t a t iv e / 3 s t r an d s e g m e n t s ca n

b e f o u n d . A n o t h e r p o s s i b i li t y is t o u s e m u l t ip l e a l i g n m e n t s . T h i s i s b a s e d

o n t h e a s s u m p t i o n t h a t c o r r e s p o n d i n g r e s i d u e s i n t h e a l i g n m e n t , p r o v i d e d

t h e a l i g n m e n t i s c o r r e c t , w i l l h a v e t h e s a m e s e c o n d a r y s t r u c t u r e , b e i n g a t

t h e s a m e l o c a t i o n i n t h e f o ld . T h e u s e o f m u l t i p le a l i g n m e n t s h a s r e c e n t l y

b e e n t h e s o u r c e o f a n i m p r o v e m e n t o f t h e a c c u r a c y r an g in g f r o m 5 , f o r

t h e G O R 7 a n d t h e q u a d r a t ic l o gi st ic9 m e t h o d s u p t o 1 0 p e r c e n t a g e p o i n t sf o r a n e u r a l n e t w o r k m e t h o d . 8

T h e G O R m e t h o d h a s th e a d v a n ta g e o v e r n e u ra l n e t w o r k - b a s ed m e t h -

o d s o r n e a r e s t - n e i g h b o r m e t h o d s i n t h a t i t c l e a r l y i d e n t i f i e s w h a t i s t a k e n

i n to a c c o u n t f o r t h e p r e d i c t io n a n d w h a t i s n e g l e ct e d . M o r e o v e r , t h e m e t h o d

p r o v i d e s e s t i m a t e s o f p r o b a b i l i t i e s f o r t h e t h r e e s e c o n d a r y s t r u c tu r e s a t

e a c h r e s i d u e p o s i t i o n , w h i c h c a n b e u s e f u l f o r fu r t h e r a p p l i c a t i o n o f t h e

m e t h o d . 15 I t r e l ie s o n l y o n o b s e r v e d f r e q u e n c i e s i n t h e d a t a b a s e ; t h u s , t h e

c a l c u l a t io n o f t h e p a r a m e t e r s i s s t r a i g h t f o r w a r d a n d e a s y t o u p d a t e .

A v a i l a b i l i t y

T h e c o r r e s p o n d i n g p r o g r a m h a s b e e n w r i t t e n i n C l a n g u a g e a n d c u r -

r e n t l y r u n s o n a p l a t f o r m w i t h a U N I X o p e r a t i n g s y s t e m ( b u t i t w i l l r u n

e q u a l l y w e ll o n o t h e r o p e r a t i n g s y s t e m s ) . I t c an b e o b t a i n e d b y a n o n y m o u s

f t p a t N C B I ( N a t i o n a l C e n t e r f o r B i o t e c h n o l o g y In f o r m a t i o n ) u s in g th e

f o l l o w i n g p r o c e d u r e : f t p n c b i . n l m . n i h . g o v , m o v e t o t h e d i r e c t o r y g i b r a t /

G O R . I t i s a l s o a v a i l a b l e a t I N R A - J o u y - e n - J o s a s : f t p l o c u s . j o u y . i n r a . f r ,

m o v e t o d i r e c t o r y /p u b / p r o t e i n / G O R .

22 T. J . P . H ub bar d , 2 7 t h A n n u a l H a w a i i I n t e r n a t i o n a l C o n f e r e n c e o n S y s t e m S c i e n c e s , (L .

H u n t e r , e d . ) I E E E C o m p u t e r S o c i e t y P r es s , Lo s A l a m o s , 33 6 (1 99 4) .