RTH · ondary structure of the proteins could not be assigned by the program signment DSSP (Kabsch...

11
ISMB-99, 95-105 Using Sequence Motifs for Enhanced Neural Network Prediction of Protein Distance Constraints Jan Gorodkin 12 , Ole Lund 1 , Claus A. Andersen 1 , and Søren Brunak 1 1 Center for Biological Sequence Analysis, Department of Biotechnology, The Technical University of Denmark, Building 208, DK-2800 Lyngby, Denmark [email protected], [email protected], [email protected], [email protected] Phone: +45 45 25 24 77, Fax: +45 45 93 15 85 2 Department of Genetics and Ecology, The Institute of Biological Sciences University of Aarhus, Building 540, Ny Munkegade, DK-8000 Aarhus C, Denmark Abstract Correlations between sequence separation (in residues) and distance (in Angstrom) of any pair of amino acids in polypep- tide chains are investigated. For each sequence separation we define a distance threshold. For pairs of amino acids where the distance between C α atoms is smaller than the threshold, a characteristic sequence (logo) motif, is found. The mo- tifs change as the sequence separation increases: for small separations they consist of one peak located in between the two residues, then additional peaks at these residues appear, and finally the center peak smears out for very large separa- tions. We also find correlations between the residues in the center of the motif. This and other statistical analyses are used to design neural networks with enhanced performance compared to earlier work. Importantly, the statistical anal- ysis explains why neural networks perform better than sim- ple statistical data-driven approaches such as pair probability density functions. The statistical results also explain charac- teristics of the network performance for increasing sequence separation. The improvement of the new network design is significant in the sequence separation range 10–30 residues. Finally, we find that the performance curve for increasing se- quence separation is directly correlated to the corresponding information content. A WWW server, distanceP, is available at http://www.cbs.dtu.dk/services/distanceP/. Keywords: Distance prediction; sequence motifs; distance constraints; neural network; protein structure. Introduction Much work have over the years been put into approaches which either analyze or predict features of the three- dimensional structure using distributions of distances, cor- related mutations, and more lately neural networks, or com- binations of these e.g. (Tanaka & Scheraga 1976; Miyazawa & Jernigan 1985; Bohr et al. 1990; Sippl 1990; Maiorov & Crippen 1992; Göbel et al. 1994; Mirny & Shaknovich 1996; Thomas, Casari, & Sander 1996; Lund et al. 1997; Olmea & Valencia 1997; Skolnick, Kolinski, & Ortiz 1997; Present address: Structural Bioinformatics Advanced Tech- nologies A/S, Agern Alle 3, DK-2970 Hørsholm, Denmark Copyright c 1999, American Association for Artificial Intelli- gence (www.aaai.org). All rights reserved. Fariselli & Casadio 1999). The ability to adopt structure from sequences depends on constructing an appropriate cost function for the native structure. In search of such a function we here concentrate on finding a method to predict distance constraints that correlate well with the observed distances in proteins. As the neural network approach is the only ap- proach so far which includes sequence context for the con- sidered pair of amino acids, these are expected not only to perform better, but also to capture more features relating dis- tance constraints and sequence composition. The analysis include investigation of the distances be- tween amino acid as well as sequence motifs and corre- lations for separated residues. We construct a prediction scheme which significantly improve on an earlier approach (Lund et al. 1997). For each particular sequence separation, the correspond- ing distance threshold is computed as the average of all physical distances in a large data set between any two amino acids separated by that amount of residues (Lund et al. 1997). Here, we include an analysis of the distance dis- tributions relative to these thresholds and use them to ex- plain qualitative behavior of the neural network prediction scheme, thus extending earlier studies (Reese et al. 1996). For the prediction scheme used here it is essential to relate the distributions to their means. Analysis of the network weight composition reveal intriguing properties of the dis- tance constraints: the sequence motifs can be decomposed into sub-motifs associated with each of the hidden units in the neural network. Further, as the sequence separation increases there is a clear correspondence in the change of the mean value, dis- tance distributions, and the sequence motifs describing the distance constraints of the separated amino acids, respec- tively. The predicted distance constraints may be used as inputs to threading, ab initio, or loop modeling algorithms. Materials and Method Data extraction The data set was extracted from the Brookhaven Protein Data Bank (Bernstein et al. 1977), release 82 containing 5762 proteins. In brief entries were excluded if: (1) the sec-

Transcript of RTH · ondary structure of the proteins could not be assigned by the program signment DSSP (Kabsch...

Page 1: RTH · ondary structure of the proteins could not be assigned by the program signment DSSP (Kabsch & Sander 1983), as the DSSP as-isused to quantify thesecondary structure identity

ISMB-99, 95-105

UsingSequenceMotifs for EnhancedNeural NetworkPrediction of Protein DistanceConstraints

Jan Gorodkin1 � 2, Ole Lund1 �, ClausA. Andersen1, and SørenBrunak1

1Centerfor BiologicalSequenceAnalysis,Departmentof Biotechnology,TheTechnicalUniversityof Denmark,Building 208,DK-2800Lyngby, Denmark

[email protected],[email protected],[email protected],[email protected]:+454525 2477,Fax: +45459315 85

2Departmentof GeneticsandEcology, TheInstituteof BiologicalSciencesUniversityof Aarhus,Building 540,Ny Munkegade,DK-8000AarhusC, Denmark

Abstract

Correlationsbetweensequenceseparation(in residues)anddistance(in Angstrom)of any pairof aminoacidsin polypep-tidechainsareinvestigated.For eachsequenceseparationwedefinea distancethreshold.For pairsof aminoacidswherethedistancebetweenCα atomsis smallerthanthethreshold,a characteristicsequence(logo) motif, is found. The mo-tifs changeas the sequenceseparationincreases:for smallseparationsthey consistof onepeaklocatedin betweenthetwo residues,thenadditionalpeaksat theseresiduesappear,andfinally thecenterpeaksmearsout for very largesepara-tions. We alsofind correlationsbetweenthe residuesin thecenterof the motif. This and other statisticalanalysesareusedto designneuralnetworks with enhancedperformancecomparedto earlier work. Importantly, the statisticalanal-ysis explainswhy neuralnetworks performbetterthansim-plestatisticaldata-drivenapproachessuchaspairprobabilitydensityfunctions.Thestatisticalresultsalsoexplain charac-teristicsof thenetwork performancefor increasingsequenceseparation.The improvementof the new network designissignificantin thesequenceseparationrange10–30residues.Finally, wefind thattheperformancecurve for increasingse-quenceseparationis directly correlatedto thecorrespondinginformationcontent.A WWW server, distanceP, is availableathttp://www.cbs.dtu.dk/services/distanceP/.

Keywords: Distanceprediction;sequencemotifs; distanceconstraints;neuralnetwork; proteinstructure.

Intr oductionMuch work have over the yearsbeenput into approacheswhich either analyze or predict features of the three-dimensionalstructureusingdistributionsof distances,cor-relatedmutations,andmorelatelyneuralnetworks,or com-binationsof thesee.g. (Tanaka& Scheraga1976;Miyazawa& Jernigan1985;Bohr et al. 1990;Sippl 1990;Maiorov& Crippen1992;Göbelet al. 1994;Mirny & Shaknovich1996;Thomas,Casari,& Sander1996;Lund et al. 1997;Olmea& Valencia1997;Skolnick, Kolinski, & Ortiz 1997;

�Presentaddress:StructuralBioinformaticsAdvancedTech-

nologiesA/S, AgernAlle 3, DK-2970Hørsholm,DenmarkCopyright c

�1999, AmericanAssociationfor Artificial Intelli-

gence(www.aaai.org). All rightsreserved.

Fariselli & Casadio1999). The ability to adoptstructurefrom sequencesdependsonconstructinganappropriatecostfunctionfor thenativestructure.In searchof suchafunctionwehereconcentrateon findinga methodto predictdistanceconstraintsthat correlatewell with the observed distancesin proteins.As theneuralnetwork approachis theonly ap-proachsofar which includessequencecontext for thecon-sideredpair of aminoacids,theseareexpectednot only toperformbetter, but alsoto capturemorefeaturesrelatingdis-tanceconstraintsandsequencecomposition.

The analysisinclude investigationof the distancesbe-tween amino acid as well as sequencemotifs and corre-lations for separatedresidues. We constructa predictionschemewhich significantlyimprove on anearlierapproach(Lundetal. 1997).

For eachparticularsequenceseparation,thecorrespond-ing distancethresholdis computedas the averageof allphysicaldistancesin a largedatasetbetweenany two aminoacids separatedby that amountof residues(Lund et al.1997). Here, we include an analysisof the distancedis-tributions relative to thesethresholdsand usethem to ex-plain qualitative behavior of the neuralnetwork predictionscheme,thusextendingearlierstudies(Reeseet al. 1996).For thepredictionschemeusedhereit is essentialto relatethe distributions to their means. Analysis of the networkweight compositionreveal intriguing propertiesof the dis-tanceconstraints:the sequencemotifs canbe decomposedinto sub-motifsassociatedwith eachof the hiddenunits intheneuralnetwork.

Further, as the sequenceseparationincreasesthere is aclearcorrespondencein thechangeof themeanvalue,dis-tancedistributions,andthe sequencemotifs describingthedistanceconstraintsof the separatedamino acids, respec-tively. The predicteddistanceconstraintsmay be usedasinputsto threading,ab initio, or loopmodelingalgorithms.

Materials and MethodData extractionThe data set was extractedfrom the Brookhaven ProteinData Bank (Bernsteinet al. 1977), release82 containing5762proteins.In brief entrieswereexcludedif: (1) thesec-

Page 2: RTH · ondary structure of the proteins could not be assigned by the program signment DSSP (Kabsch & Sander 1983), as the DSSP as-isused to quantify thesecondary structure identity

ondarystructureof theproteinscouldnotbeassignedby theprogram� DSSP(Kabsch& Sander1983),asthe DSSPas-signmentis usedto quantifythesecondarystructureidentityin the pairwisealignments,(2) the proteinshadany physi-cal chainbreaks(definedasneighboringaminoacidsin thesequencehaving Cα-distancesexceeding4 � 0Å) or entrieswherethe DSSPprogramdetectedchainbreaksor incom-pletebackbones,(3) they hadaresolutionvaluegreaterthan2 � 5Å, sinceproteinswith worseresolution,arelessreliableastemplatesfor homologymodelingof theCα trace(unpub-lishedresults).

Individualchainsof entrieswerediscardedif (1) they hada lengthof lessthan30 aminoacids,(2) they hadlessthan50%secondarystructureassignmentasdefinedby thepro-gramDSSP, (3) they hadmorethan85% non-aminoacids(nucleotides)in the sequence,and (4) they hadmore than10%of non-standardaminoacids(B, X, Z) in thesequence.

A representative setwith low pairwisesequencesimilar-ity wasselectedby runningalgorithm#1 of Hobohmet al.(1992) implementedin the programRedHom(Lund et al.1997).

In brief thesequencesweresortedaccordingto resolution(all NMR structureswereassignedresolution100). These-quenceswith thesameresolutionweresortedsothathigherpriority wasgivento longerproteins.

Thesequenceswerealignedutilizing thelocal alignmentprogram,ssearch (Myers& Miller 1988;Pearson1990)us-ing thepam120aminoacidsubstitutionmatrix (Dayhoff &Orcutt1978),with gappenalties� 12, � 4. As a cutoff forsequencesimilarity we appliedthe thresholdT � 290� L,T is the percentageof identity in the alignmentandL thelength of the alignment. Startingfrom the top of the listeachsequencewasusedasa probeto excludeall sequencesimilarproteinsfurtherdown thelist.

By visual inspectionseven proteinswereremoved fromthelist, sincetheir structureis either’sustained’by DNA orpredominantlyburied in the membrane.The resulting744proteinchainsarecomposedof the residueswheretheCα-atompositionis specifiedin thePDB entry.

Tencross-validationsetswereselectedsuchthat they allcontainapproximatelythesamenumberof residues,andallhave thesamelengthdistributionof thechains.All thedataare madepublicly available through the world wide webpagehttp://www.cbs.dtu.dk/services/distanceP/.

Inf ormation Content / Relative entropy measureHere we usethe relative entropy to measurethe informa-tion content(Kullback & Leibler 1951)of alignedregionsbetweensequenceseparatedresidues.Theinformationcon-tent is obtainedby summingfor the respective position inthealignment,I � ∑L

i � 1 Ii , whereIi is the informationcon-tentof positioni in thealignment.Theinformationcontentat eachpositionwill sometimesbedisplayedasa sequencelogo (Schneider& Stephens1990).Theposition-dependentinformationcontentis givenby

Ii � ∑k

qik log2qik

pk � (1)

wherek refersto the symbolsof the alphabetconsidered(hereaminoacids). The observed fraction of symbolk atpositioni is qik, andpk is thebackgroundprobabilityof find-ing symbolk by chancein thesequences.pk will sometimesbe replacedby a position-dependentbackgroundprobabil-ity, thatis theprobabilityof observingletterk at someposi-tion in thealignmentin anotherdatasetonewishesto com-pareto. Symbolsin logosturned180degreesindicatethatqik pk.

Neural networksAs in the previouswork, we apply two-layerfeed-forwardneuralnetworks, trainedby standardback-propagation,seee.g. (Brunak,Engelbrecht,& Knudsen1991;Bishop1996;Baldi & Brunak1998),to predictwhethertwo residuesarebelow or above a given distancethresholdin space. Thesequenceinput is sparselyencoded.In (Lund et al. 1997)the inputswereprocessedastwo windows centeredaroundeachof theseparatedaminoacids.However, hereweextendthatschemeby allowing thewindowsto grow towardseachother, andevenmergeto asinglelargewindow coveringthecompletesequencebetweentheseparatedaminoacids.Eventhoughsucha schemeincreasesthecomputationalrequire-ments,it allow usto searchfor optimalcoveringbetweentheseparatedaminoacids.

As therecanbea largedifferencein thenumberof posi-tive (contact)andnegative (no contact)sequencewindows,for a given separation,we apply the balancedlearningap-proach(Rost& Sander1993). Trainingis doneby a 10 setcross-validationapproach(Bishop1996), and the result isreportedastheaverageperformanceoverthepartitions.Theperformanceon eachpartition is evaluatedby theMathewscorrelationcoefficient (Mathews1975)

C � PtNt � Pf Nf� �Nt � Nf �

�Nt � Pf �

�Pt � Nf �

�Pt � Pf � � (2)

wherePt is thenumberof truepositives(contact,predictedcontact),Nt the numberof true negatives (no contact,nocontactpredicted),Pf thenumberof falsepositives(nocon-tact,contactpredicted),andNf is thenumberof falsenega-tives(contact,nocontactpredicted).

The analysisof the patternsstoredin the weightsof thenetworks is donethroughthe saliencyof the weights,thatis the costof removing a singleweight while keepingtheremainingones. Due to the sparseencodingeachweightconnectedto a hiddenunit correspondsexactly to a particu-lar aminoacidat a givenpositionin thesequencewindowsusedasinputs.We canthenobtaina rankingof symbolsoneachpositionin the input field. To computethe salienciesweusetheapproximationfor two-layerone-outputnetworks(Gorodkinet al. 1997),who showedthat thesalienciesforthe weightsbetweeninput andhiddenlayercanbe writtenas

skji � sj i � w2

j iW2j K � (3)

wherewj i is the weight betweeninput i andhiddenunit j,andWj theweightbetweenhiddenunit j andtheoutput.Kis a constant.Thekth symbolis implicitly givendueto the

Page 3: RTH · ondary structure of the proteins could not be assigned by the program signment DSSP (Kabsch & Sander 1983), as the DSSP as-isused to quantify thesecondary structure identity

0

500

1000

1500

2000

-6 -4 -2 0 2

Nu

mb

er

Physical distance (Angstrom)

Sequence separation 2

0

50

100

150

200

250

-20 -15 -10 -5 0 5 10 15 20 25

Nu

mb

er

Physical distance (Angstrom)

Sequence separation 11

0

20

40

60

80

100

120

-20 -10 0 10 20 30 40

Nu

mb

er

Physical distance (Angstrom)

Sequence separation 16

0

200

400

600

800

1000

-6 -4 -2 0 2 4

Nu

mb

er

Physical distance (Angstrom)

Sequence separation 3

0

50

100

150

200

-20 -10 0 10 20

Nu

mb

er

Physical distance (Angstrom)

Sequence separation 12

0

10

20

30

40

50

60

70

80

90

-20 -10 0 10 20 30 40 50

Nu

mb

er

Physical distance (Angstrom)

Sequence separation 20

0

100

200

300

400

500

600

700

800

-8 -6 -4 -2 0 2 4 6

Nu

mb

er

Physical distance (Angstrom)

Sequence separation 4

0

50

100

150

200

-20 -10 0 10 20 30

Nu

mb

er

Physical distance (Angstrom)

Sequence separation 13

0

10

20

30

40

50

60

70

-40 -20 0 20 40 60 80

Nu

mb

er

Physical distance (Angstrom)

Sequence separation 60

Figure 1: Lengthdistributionsof residuesegmentsfor correspondingsequenceseparations,relative the respective meanvalues.Sequenceseparations2, 3, 4, 11,12,13,16,20,and60 areshown. Theseshow therepresentative shapesof thedistancedistributions.Theverticallinethroughzeroindicatesthedisplacementwith respectto themeandistance.

sparseencoding. If the signsof wj i andWj are opposite,we display the correspondingsymbol upsidedown in theweight-saliency logo.

ResultsWe conductstatisticalanalysisof thedataandthedistanceconstraintsbetweenaminoacids,andsubsequentusethere-sultsto designandexplain thebehavior of a neuralnetworkpredictionschemewith enhancedperformance.

Statistical analysisFor eachsequenceseparation(in residues),we derive themeanof all thedistancesbetweenpairsof Cα atoms.Weusethesemeansasdistanceconstraintthresholds, andin a pre-diction schemewe wish to predictwhetherthedistancebe-tweenany pair of aminoacidsis aboveor below thethresh-old correspondingto a particularseparationin residues.To

analyzewhich pairsareabove andbelow the threshold,itis relevant to compare:(1) thedistribution of distancesbe-tweenaminoacidpairsbelow andabove thethreshold,and(2) thesequencecompositionof segmentswherethepairsofaminoacidsarebelow andabovethethreshold.

Firstweinvestigatethelengthdistributionof thedistancesasfunction of the sequenceseparation.A completeinves-tigation of physicaldistancesfor increasingsequencesep-aration is given by (Reeseet al. 1996). In particular itwas found that the α-helicescauseda distinct peakup tosequenceseparation20, whereasβ-strandsare seenup toseparations5 only. However, when we perform a simpletranslationof thesedistributionsrelative to their respectivemeans,thesameplotsprovideessentiallynew qualitativein-formation,which is anticipatedto bestronglycorrelatedtotheperformanceof a predictor(above/below thethreshold).In particularwefocusonthedistributionsshown in Figure1,

Page 4: RTH · ondary structure of the proteins could not be assigned by the program signment DSSP (Kabsch & Sander 1983), as the DSSP as-isused to quantify thesecondary structure identity

but we alsousethe remainingdistributions to make someimportant� observations.

Wheninspectingthedistancedistributionsrelativeto theirmean,two main observationsaremade.First, the distancedistribution for sequenceseparation3, is the one distancedistributionwherethedatais mostbimodal.Thussequenceseparation3 providesthemostdistinctpartitionof thedatapoints. Hence,in agreementwith the resultsin (Lund etal. 1997)we anticipatethat the bestpredictionof distanceconstraintscanbeobtainedfor sequenceseparation3. Fur-thermore,weobserve thattheα-helix peakshiftsrelative tothe meanwhen going from sequenceseparation11 to 13.Thelengthof thehelicesbecomeslongerthanthemeandis-tances.Thisshift interestinglyinvolvethepeakto beplacedat themeanvalueitself for sequenceseparation12. Duetothis phenomenon,we anticipate,that,for anoptimizedpre-dictor, it canbeslightlyhardertopredictdistanceconstraintsfor separation12 thanfor separations11and13.

Thepeakpresentat themeanvaluefor sequencesepara-tion 12 doesindeedreflectthe lengthof helicesasdemon-stratedclearlyin Figure2. Ratherthanusingthesimplerulethat eachresiduein a helix increasesthe physicaldistancewith 1.5 Angstrom(Branden& Tooze1999),we computedthe actualphysicallengthsfor eachsize helix to obtain amoreaccuratepicture.Thephysicallengthof theα-heliceswascalculatedby findingthehelicalaxisandmeasuringthetranslationper Cα-atom along this axis. The helical axiswasdeterminedby themasscenterof four consecutive Cα-atoms. Helicesof length four are thereforenot included,sinceonly onecenterof masswaspresent.We seethathe-lices at 12 residuescoincidewith sequenceseparation12.Again we usethis as an indication that at sequencesepa-ration12 it maybe slightly harderto predictdistancecon-straintsthanat separation13.

0� 5� 10 15 20 25 30Separation/Length (Residues)

0

5

10

15

20

25

30

35

Ph

ysic

al d

ista

nce

(A

ng

stro

m)

Separation and helix length

Average sequence separation�Average helix length�

Figure 2: Meandistancesfor increasingsequenceseparationandcomputedaveragephysicalhelix lengthsfor increasingnumberofresidues.

As the sequenceseparationincreasesthe distancedistri-bution approachesa universalshape,presumablyindepen-dentof structuralelements,whicharegovernedby morelo-cal distanceconstraints.The traits from the local elementsdo asmentionedby (Reeseet al. 1996)(who merge largeseparations)vanishfor separations20–25. (Here we con-sideredseparationsup to 100residues.)In Figure1 we seethat the transitionfrom bimodal distributions to unimodaldistributionscenteredaroundtheirmeans,indicatesthatpre-diction of distanceconstraintsmustbecomeharderwith in-creasingsequenceseparation.In particularwhentheuniver-salshapehasbeenreached(sequenceseparationlargerthan30),we shouldnot expecta changein thepredictionabilityas the sequenceseparationincreases.The universaldistri-butionhasits modevalueapproximatelyat � 3 � 5 Angstrom,indicatingthatthemostprobablephysicaldistancefor largesequenceseparationscorrespondsto the distancebetweentwo Cα atoms.

Noticethat theuniversalityonly appearswhenthedistri-bution is displacedby its meandistance.This is interestingsincethe meanof the physicaldistancesgrows as the se-quenceseparationincreases.Froma predictabilityperspec-tive, it thereforemakesgoodsenseto usethemeanvalueasa thresholdto decidethedistanceconstraintfor anarbitrarysequenceseparation.

A usefulpredictionschememustrely on the informationavailable in the sequences.To investigateif thereexists adetectablesignalbetweenthe sequenceseparatedresidues,for eachsequenceseparation,we constructedsequencelo-gos(Schneider& Stephens1990)asfollows: Thesequencesegmentsfor which thephysicaldistancebetweenthesepa-ratedaminoacidswasabovethethresholdwereusedto gen-erateaposition-dependentbackgrounddistributionof aminoacids. The segmentswith correspondingphysicaldistancebelow the thresholdwere all alignedand displayedin se-quencelogos using the computedbackgrounddistributionof amino acids. The pure information contentcurves areshown for increasingsequencein Figure3. Thecorrespond-ing sequencelogos aredisplayedin Figure 4. We useda“margin” of 4 residuesto theleft andright of thelogos,e.g.,for sequenceseparation2, thephysicaldistanceis measuredbetweenposition5 and7 in thelogo.

Thechangein thesequencepatternsis consistentwith thechangein thedistributionof physicaldistances.Up to sepa-rations6–7,thedistributionof distances(Figure1) containstwo peaks,with theβ-strandpeakvanishingcompletelyforseparation7–8 residues.For the sameseparations,the se-quencemotif changesfrom containingoneto threepeaks.For largersequenceseparations,themotif consistsof threecharacteristicpeaks,thetwo locatedatpositionsexactlycor-respondingto theseparatedaminoacids.Thethird peakap-pearin the center. This peaktells us that for physicaldis-tanceswithin the threshold,it indeedmatterswhethertheresiduesin the centercanbendor turn the chain. We seethatexactlysuchaminoacidsarelocatedin thecenterpeak:glycine,aparticacid,glutamicacid,andlysine,all thesearemediumsize residues(except glycine), beinghydrophilic,therebyhaving affinity for thesolvent,but not beingon theoutermostsurfaceof theprotein(seee.g., (Creighton1993)

Page 5: RTH · ondary structure of the proteins could not be assigned by the program signment DSSP (Kabsch & Sander 1983), as the DSSP as-isused to quantify thesecondary structure identity

1� 2�

3�

4�

5�

6�

7�

8�

9�

10� 11Position

0.00

0.02

0.04

0.06

0.08

0.10

0.12

Bit

sSeparation: 2!

2�

4�

6�

8�

10� 12�

14�

16�

18�

20� 22�

Position 0.00

0.02

0.04

0.06

Bit

s

Separation: 13"

2�

4�

6�

8�

10� 12�

14�

16�

Position 0.00

0.02

0.04

0.06

Bit

s

Separation: 7!

2�

4�

6�

8�

10� 12�

14�

16�

18�

20� 22�

24�

26�

28�

Position 0.00

0.01

0.02

0.03

Bit

s

Separation: 20"

2�

4�

6�

8�

10� 12�

14�

16�

18�

Position

0.00

0.02

0.04

0.06

Bit

s

Separation: 9!

5�

10� 15�

20� 25�

30� 35�

Position

0.00

0.01

0.02

Bit

s

Separation: 30"

2�

4�

6�

8�

10� 12�

14�

16�

18�

20�Position

0.00

0.02

0.04

0.06

Bit

s

Separation: 12"

5�

10� 15�

20� 25�

30� 35�

40� 45�

50� 55�

60� 65�

Position

0.00

0.01

0.02

Bit

s

Separation: 60"

Figure3: Sequenceinformationprofilesfor increasingsequenceseparation.

for aminoacid properties).As the sequenceseparationin-creasesfrom about20 to 30 the centerpeaksmearsout, inagreementwith thetransitionof thedistancedistributionthatshifts to theuniversaldistribution in this rangeof sequenceseparation.The sequencemotif likewisebecomes“univer-sal”. Only the peakslocatedat the separatedresiduesareleft.

Thecompositionof singlepeakmotifsresembleto alarge

degreethecompositionof thecenterfor motifshaving threepeaks. However, the slightly increasedamountof leucinemoves from the centerof the one peakmotifs to the sep-aratedaminoacid positionsin the threepeakmotifs. Thereversehappensfor glycine. Interestingly, asthe sequenceseparationincreasesvaline(andisoleucine)becomepresentat theoutermostpeaks.In agreementwith thehelix lengthshift from below to above the thresholdat separations12

Page 6: RTH · ondary structure of the proteins could not be assigned by the program signment DSSP (Kabsch & Sander 1983), as the DSSP as-isused to quantify thesecondary structure identity

Separation: 2 Separation: 13

Separation: 7 Separation: 20

Separation: 9 Separation: 30

Separation: 12 Separation: 60

0.00.020.040.060.08

0.10.12

bit

s |1

C

W

H

MYQ# PF

NR

KI

TSDE

V

GL$A2

W

C% H

M

YQF

P& NRI

K' TSD

E

VGL

A

3

C(W

HM)YFQ* PNR+ I T, KSD- VE

G.LA

4

C/WHM

YFQ

PNR

ITSKVGD

ELA

5

WCM0 HYFQ1R2P3 IN

T4 VKG5S6EDL

A7

6

CWHM

YFPQ

ITVR

N

S8K9D:E;

GLA<

7

PWCH=M> YFQ

N

IT?RSVD@KAEB

GLCAD

8

WCHME YFPQF NRG ISTDVE

K

GLHA

9

W

CHIMYPJFQNTSRK DL IEVK

GAML

10 CWHM

Y

FQN PNRO TSDI

E

VK

GLA

11 CWM

HPYQQ PFR NR

TSI

DSEKT VGLUAV |

0.0

0.02

0.04

0.06

bit

s |

1

WCWMXHY Q

YFR

NZ PKI

ETSD[VAG

L\

2

WC]MH_ QYF

PNR

KI

ESDTV

G

AL

3

WCMaHQYFb P

NR

KSDETI

Vc GAL

4

WdCMHe QY

PNF

RKDf SETI

GVAgLh

5

WiCMjHk QY

PNFl RKDESTIm GVnAo

Lp

6

WqCMrHs QY

PFt NRKEDI

TSVu GAvL

7

WCMwHx Q

YFy PRNz I KETS{D| VGL}A~

8

WC� M

H� Q

Y

FPRN� IKTE

S

VD�G�L�A�

9

WC� MHYFQ� RIP�N� VTK

S�ED� LA�G

10 WC� M� H

YFQ� IRP�N� V� TK�E�S�

D� LAG�

11 WC� M� H

YFQ

IRP� V�NT�K

ES�D�

L� A G

12 WC

MHF¡ YQ

IRP¢ VN

T£S¤K¥E¦D§

L¨ AG©

13 WC

MHª F« YQ

IRN

P

V¬ TSKE­D®

L¯ AG°

14 WC

MHYFQ

IRNPT

V±SEKD

LAG

15 W²

C

MHYF³ QR

NP

IST

KED

VLAµG

16 CWMH

Q¶ F·YR

N

PI¹ STK

DEV

Lº AG

17 W

C

MHQY

PF

N» RSKD¼ E½TI¾VGA¿L

18 W

C

HM

QY

PNF

RDKÀ SÁ EÂ TIÃ GVÄAL

19 W

C

HÅMQY

PFÆ NRÇ KDÈ STEÉ I GV

ALÊ

20 WC

MË HÌ QY

PFÍ NR

KTDÎ SIÏ EGV

ALÐ

21 WCMÑ HQY

PF

NRÒ KTÓ DÔ SI

EV

GLÕA

22 WCMÖ H

QYFPNR×IØ TDÙ SÚ KÛ EÜ VGLÝA

|

0.0

0.02

0.04

0.06

bit

s |

1

WCMH

YQÞFß PNRKIà STDE

V

GLáAâ

2

WCMH

YQãFä PNR

KIå TSDEæVç GèLAé

3

WC

HMêYëQìFí PNRKIî STDE

Vï GLðA

4

WC

HMYñQò PF

NRKTó Sô I DE

Võ GöLA÷

5

WCø HMùYQú PFû NRü K

TIý SDþEÿ GVL�A

6

WCHM� YFQ� PNR� ITVK

S�D�

G�E�LA

7

WCHM

YFQ� PIRN

TVS

KDE�

G�LA

8

WCM� H� YFPQ� IR�

VN

TSK�D�EL�G�A

9

WCHM

YFP�QIR�N

VT� SK�D

E�

LG�A�

10 CWM

H� YFPQ

INR

T� VSD�K

E LG!A"

11 WCHM

YPF#QNI$RTSD% VK

E

GL&A

12 W

C

H'MPYFQ

NR

DS( T) I

EK*V+

GA,L-

13 W

C

HMYQ

PF

NR

DTSI.EK

G/VA0L

14 W

CMHYQ

P1FN2RST3 D4 I

KE

GVAL5

15 W

CMHYQ

P6FNR

STD7 I

K

EV

GAL

16 WC

MHYQF

PNR

T8 SI9 D:KEV

GLA|

0.0

0.02

0.04

bit

s |1

WC; M

H

QYFR< NPI

EK=TS> DV

LAG

2

WC? M

H@ QY

FRNPKEA I

DBTSVC ALG

3

WC

M

H

QYF

PNRKEI

TD SDVE AG

L

4

WFCGMHQYHFPNRI KEJ DK I SLTVGM

ALN

5

WOCPMHQY

PQ NF

RKEDSR TI

VS GAL

6

WCTMHQYF

PNURV KW EI

DTX SV

GAYL

7

WCMH

QY

PF

NRKEI

DTSVZ GA

L

8

W[C\M]HQY_FPNR` Ka EI

Db TSVc GA

L

9

WCM

HQY

FPRd NKI

ETeSDV

GAL

10 WCMHQYF

P

Nf Rg IhTEi KSjDk VLA

G

11 WC

MH

YQFPRN

IK

ETlSDVLA

G

12 WCm M

Hn YFQRo PNp IKqTrEsSDVLAGt

13 WCu MH

YFQRPvNIT

KESw VD

LAG

14 WCx My Hz Y{ F

Q| RP} IN~T�KE�

VS�DL� A�G

15 WC

MHYFQRIP

NTK

VESD

LAG

16 WC

MHYFQRIP�NK�T

S� VE�D�

LAG

17 WC� M

HYFQRIP

NKT�S�E

VD

LAG

18 WC� M

HY� F� QRP� IN

K�TSE�D�

VL� AG

19 WC

MHYFQRP

N

IKT�SEDV

LAG

20 WC� MH� Q

YFPR�NIK� ST�EDV� LA�G

21 WC� MHQY

FP� R� NKI

ES TDV

A¡ LG

22 WC¢ MH£ Q¤YFP

N¥ RK¦ IT§ SEDV

G

AL

23 WC

MHQY

F

PNRKDTSI

EV

GAL

24 W©CªM« H

QY¬ P­FNRK® DSET

IV

GAL°

25 W±C² HM

QY

PF

NRKDES³ T´ I

GVAµL

26 WC¶ HM

QYF

PRNKDETSI

V

GA·L

27 WCMH

QY

FPRN

KI

ST

EDV

GLA¹

28 WCºM»H¼ QY

F½ PRN¾ I¿ KTÀ ESÁ DV

GÂLAÃ

29 WCÄMHÅ Q

YFPRN

I

TEKÆ DS

VLG

A

|

0.0

0.02

0.04

0.06

bit

s |

1

WCMÇHÈ Y

QÉFPN

RÊ KIË STÌ EDV

GLÍA

2

WCMÎHÏ YQÐFPNRÑ KÒ I STDÓEVGÔLA

3

WCMHYQ

PF

NRKÕ I

STDÖEVGL×A

4

WC

HMQY

PF

NRKØ SÙ TÚ DIÛEGV

AL

5

WÜCHMYQ

PFÝ NRKTSI

DE

GVAÞL

6

WC

HMß YFQà PNR

ITKáSDVEâ GãLA

7

WCM

HYFQ

PäRINå TVK

SDE

GLæA

8

WCM

HYFQ

PIR

N

Tç VèSKDELGA

9

Wé CM

Hê YFëQPIR

N

VTìSK

DEí

LGA

10 Wî C

HM

YFQ

PIRïNð

VTSK

DE

LGAñ

11 WCMò Hó Y

FôQõ PIN

Rö TVSD

KE÷LøG

ùA

12 WCMú H

YFû PüQNIR

TSDVK

EGLýA

13 WCþMHÿYP� F�QN�RI

ST� D�KE

V� GL�A

14 W

C

HM

PYQF

NR

DS� T� I

K

EGV

AL

15 W�CMHY

Q� PF NR

DSTI

EK

GVA�L

16 W

C�M�HYQ�

PF

N�RTS� D� I

KE�V

G�LA

17 WC

M

HQY

PF

NR

T� SDI

KEV

G�AL

18 WCMH�QYFP� NR

T� SI

D� K�EVG�AL|

0.0

0.04

bit

s |1

C

WMH

QY

FPRNKEI�TS DV

AL!G

2 W

C

M"HQYF

RNP# K$ EI

T

S% DV

AL&G

3 W

C

MH

Q'YF( PNR) K* EI+ DST,VAL-G

4

WCMH

QYF

PNRKEDI

STVG

AL

5

WC.M/H0 QY1F2 P

NRKEDS3 I

TV

GAL

6

WCMH

Q4YF5 P6 N7RKE8 I

D9TSVG: A;L

7

WC

M<H

Q=YFPNR

K> EI

DTS?VGLA

8

CW

M

H

QYF@ P

NRKA EI

DTSV

AGLB

9

CW

MH

QFY

PNRC KI

ETSD DEVLF AG

10 CWMHQYFG PH N

R

KIDSTEVLI AG

11 CWMH

QYFJ PK NLRI KE

TSDV

AGLM

12 WMCH

QYFN PO NRPI

EDK

STV

AGL

13 CWMH

QYFQ P

R

NRI

ED

KSTV

A

GL

14 CWMHQYFRNPEIKSTDV

A

GL

15 CWMHQYRFPNS IT ED

KSTUVA

GL

16 CWMHV YFW QRNP

X I

EST

K

DV

ALYG

17 WC

MHYFQ

PR

N

IKTSEZDV

LAG

18 WC

MH[ QYFP\ RN

I]KET_SDa V

Lb AGc

19 WC

Md HQe YFf Rg PNh IKiDSjTEVk Ll AGm

20 WC

MHn YQ

FPo RpNIqKTESDVLr AG

21 WC

Ms HYQFt Pu RN

IvKSwTEDVLx AG

22 WC

MHQYFPRNy IKzETSDVLAG

23 C

WMH{ QYFP| RN} I~ KT�S�EDVLAG

24 WC

M� H� Q� YF� P� R�NIK�TS� ED

V

L� A�G

25 CWMH

QYFRN

PIKT

SEDV

ALG

26 CWM� HQFYP� R

N

IK�DST�E�V� L� AG�

27 CWMH

Q� FYP� R

NI

K�TE�SDVL� AG

28 CWMH

Q� FYR

N

PI

K�TSEDV

ALG

29 CWMH

Q� YFR

N

PI

K�TSEDVL� AG

30 CWMHQ

FYPRN

KITS� EDV

LAG

31 CWMH� QY

FP

N

RK� I ET  S¡ DV

L¢ AG£32

CWM

H¤ Q¥YF

P¦ R§ N¨ KI©TSª E« D¬VL­ AG®33

WCMH

Q

YF

P° RNK± I DST

EV

AG

L²34

WCMH

QYF³ P

RNK´ EIµTSDV¶ GAL

35 W

CMH

QYF

PNRKEDST

IV

GAL

36 WCM

H· QYF

P¹ Nº RK» ESI

DT¼VGAL½

37 WCMH

Q¾ YF¿ RNPÀ KÁ I EÂTSDÃVGLÄ AÅ

38 CWMH

QYFRNPKEI

TS

DVG

AL

39 CWMH

QYFÆ RNPÇ K

ITS

EDVÈ AG

L

|

0.0

0.02

0.04

0.06

bit

s |

1

WCMHY

F

ÉQÊ P

Ë N

R

K

IÌ DSTEVAGÍL

2

WCMHQYFÎ P

NÏRKÐ I DÑ SÒ TEV

GAÓLÔ

3

WCMHYQ

PF

NÕRKDSÖ I

TEV

G×AL

4

WØCMHQY

PÙFNRKDSTÚ EIÛ GV

AÜL

5

WCHM

QY

PNFÝ

RKDÞ SETIß

GVàAL

6

WCáMâHQYFã Pä NåRKI

DTæSç EVè GéALê

7

WCMHYQ

FPN

R

IKë TSìEDVGLíA

8

WCMîHYQ

FPRN

ITKVS

DE

GLA

9

WCï MHYFQð Pñ RINò Vó TKôSõEö

D

LA÷Gø

10 WC

Mù Hú YFQ

IûRPNü

Vý TKþSÿED

LA�G�

11 WC

MHYFQ

IR�P

VN�

TS�KE�D�

L�AG

12 WC

MHYFQ

IRP�N

VTSK

E�D�

LAG

13 WC

MHYFQ

IPRN� T

VS�KED

LA�G

14 WC� M

HYFQ�P�RN� I� STD� VE�K� L�AG

15 CWMHYFQ

PN�RI� S

T� DKEVGLA

16 W

C

MH

QY

PF

NR

S� DTKI

EV

GAL

17 W�CHM� Q

P YNF!R

DS" KETI

GV#AL$

18 W%CMH

Q&YPF' NR

T( S) DKEI* G+VAL

19 WC

MHQ,YPF

NR

TS- KDI

EG.VLA

20 WCM

HQY

PF

N/RTKS0 I

EDV

GAL

21 WCMHQ1

YF2 P

3 N

R

T

I4 KE5 S6 D7V

G8L9A

|0.0

0.02

0.04

bit

s |

1

CWMH

Q: YF

RNPKEI;T<SD=VL> A?G

2

CWMH

Q@YAFPRB NC KEI

TDSDEVF LG AG

3

CWMH

QYF

PRNKEI

DTSV

LAG

4

CWMHH QY

PFI RNKEI

DTJSVKGL

A

5

CWLMHM QNYO PFP RQ N

KR ES DI

TTSVU

GVLAW

6

CWMH

QYF

PRX NKEI

DTYSZVGLA

7

CW

M[H

Q\YF

PR]NK^ I EDTS_VGLA

8

CWMH

QaYbFc PRN

KEI

DTdSe VGLfA

9

CWMH

Q

YF

Pg RN

IEKT

S

DVG

LhAi

10 CW

MHQY

FP

NR

I KET

SjDk VG

LAl

11 CWMHQY

Fm PR

N

I Kn EoTSpDVq LrGAs

12 CW

MHQYFt PRuN

I KEvTSD

VLAG

13 CWMHQYRFw PN

I Kx D

STy EVz LAG

14 CMWHQYF{ RNPI| E

KT}SDVAG

L

15 WCHMFYNPQREIT~ KVDSG

LA

16 WCHMQYPNFREIKDST� VAG

L

17 WCHMQYFPIR�NE� D

KST� VA� G� L�

18 HWM

� QYFNPIKRDSET� VG

L�A

19 YMQHIKFRNPEST� VDAGL�

20 MHFYN� PQREIKVDS�T�GL�

A

21 WCHMYPFQRIKNET� VD

SG

LA

22 CWMHQYRFPN

I EDKST� VAG

L

23 CMWH

Q�YF� R

�NP

�K

I D

ST� E�VG

LA�

24 CWMH

Q�Y

R

�FP

�N

I

T� E�KS

DVG

L� A

25 CWMHQYPFRN

I E

KT�S

DVA

GL

26 CWMHQYFR�

NPI E

KT SD¡V

LA¢G

27 WCHMQYNPF

I£ REDKST¤VA

G

L

28 YMQHNPFI RKEDSTVAGL

29 I YFRNPQKEDST

¥VAGL

30 CMWHQ

YF¦ R

§NPI

E

DKS

TV

LA©G

31 CWMHQYFRª

NPEIT«KVDS

A

G

L

32 CMWHQYFRNPEIT¬ K­

SDV

L® A

G

33 CWMH

Q¯ YF° R

±NPK²IDST

E

V

AL

G

34 CWMH

Q

Y

RFP³NI

EKTSDV

AG

L

35 CWMHQYFRNPµ I EKT¶SDVLAG

36 CWMH

Q·FY

R¹ NºP» I E

ST

KDV

A

G

L

37 CWMHQYF

R

¼ NPI E½ST¾KDV

LA¿G

38 CWMHQYFÀ RÁ

NP

ITEKSDV

ALÂG

39 CWMÃ

H

Q

Y

FÄ RÅNPI EST

KÆDV

LAG

40 CWMHQYFRNPÇ EIT

È KSDV

ALG

41 CHWMQRYFÉNPEIT

ÊKSDV

LAG

42 CW

MHQYFË R

N

PÌK

IDS

TÍ EÎVLAÏG

43 CWMHQYFR

ÐNPEI

KSTDV

A

G

L

44 CWMHQYFRÑ

NPEIKSTDV

A

G

L

45 CWMHQYPFRNTÒ I KÓ E

SDVA

G

L

46 CWMHQFY

RNPEÔ IT

Õ KV

DS

LAG

47 CW

MHQY

FR

N

PKI ET

V

DS

LAG

48 CW

MHQYF

Ö P× RØN

KIDS

TÙ EÚ VAÛGL

49 CW

MÜ HQYF

Ý RÞN

Pß KIE

TàSD

Vá LAG

50 CWMHQYFRâ

N

PITEKSD

VãG

LäA

51 CW

Må HQFæY

PRN

KIE

TçSèDVLGA

52 CWMé HQF

êY

PRëNK

I

DSTE

V

G

LAì

53 CW

MHQY

FR

N

PITEKSD

VGLAí

54 CW

MHQY

Fî RïN

Pð ITñE

KòSD

Vó GLAô

55 CWMHõ QYFö P÷ Rø

NI

TùE

KúSD

VG

LûA

56 CWMHQY

FR

N

PIK

T

SED

Vü LAG

57 CWMH

QYFPRýNK

I ETþSD

VÿGLA

58 CWM� H� QYF� PRNI

T� E� K�SD�V� LGA�

59 CWMH

QYFPRNI

T EK

SD V

LGA�

60 CWMH

QYFRN

PKITS

E

DV� LAG

61 CWMH

Q YFP� R� NI

EK� ST

� DV�G�LA�

62 C

W�MH

QYF

PR�NKEI

S�TD�VGLA

63 CWMH

Q�YF� PR� NKEI

T� S� D�VGLA

64 CW

MH

Q�Y PF! R" N# K$ ES% I D&TVGLA

65 C

W'M(HQY

PRF

NKEDST

) IV

GAL

66 CWM

H* QYF

PRN

KET+ SI

DV, GLA-

67 CWMH. Q/

YF

PR0N1 KI

T2 S3 E

DV4 GL5A

68 CWMH

Q6YRF

PNKI

E7TSD8VGLA

69 CWMH

Q9YF

PR

N

K: I DS;TEVGAL

|Figure 4: Sequencelogos for increasingsequenceseparation. A margin of 4 residueson both ends of the logo is used. Sym-bols displayedupsidedown indicate that their amountis lower than the backgroundamount. This figure is available in full color athttp://www.cbs.dtu.dk/services/distanceP/.

Page 7: RTH · ondary structure of the proteins could not be assigned by the program signment DSSP (Kabsch & Sander 1983), as the DSSP as-isused to quantify thesecondary structure identity

and13, thecenterpeakshifts from slight over to slight un-derrepresentation< of alanine:helicesarenolongerdominantfor distancesbelow thethreshold.

The smearingout of the centerpeakfor large sequenceseparationsdoesnot indicatelack of bending/turnproper-ties, but reflectsthat suchaminoacidscan be placedin alargeregionin betweenthetwo separatedaminoacids.Whatwe seefor small separations,is that the signal in additionmust be locatedin a very specificposition relative to theseparatedaminoacids.

Neural networks: predictionsand behaviorAsfoundabove,optimizedpredictionof distanceconstraintsfor varyingsequenceseparation,shouldbeperformedby anon-linearpredictor. Herewe useneuralnetworks. Clearly,far mostof the informationis found in thesequencelogos,so we expectonly a few hiddenneurons(units) to be nec-essaryin thedesignof theneuralnetwork architecture.Asearlier, weonly usetwo-layernetworkswith oneoutputunit(above/below threshold).We fix thenumberof hiddenunitsto 5, but thesizeof theinputfield mayvary.

We start out by quantitatively investigating the rela-tion betweenthe sequencemotifs in the logos above, andthe amountof sequencecontext neededin the predictionscheme.First, we choosethe amountof sequencecontext,by startingwith local windows aroundtheseparatedaminoacidsand the extendingthe sequenceregion, r, to be in-cluded.We only includeadditionalresiduesin thedirectiontowardsthecenter. Thenumberof residuesusedin thedirec-tion awayfrom thecenteris fixedto 4,exceptin thecasesofr = 0 andr = 2, wherethatnumberis zeroandtwo respec-tively. At somepoint the two sequenceregionsmay reacheachother. Whetherthey overlap,or just connectdoesnotaffect the performance.For eachr = 0 > 2 > 4 > 6 > 8 > 10> 12> 14,wetrainednetworksfor sequenceseparations2 to 99,result-ing in thetrainingof 8000networks,sinceweused10cross-validationsets.Theperformancesin “fraction correct”and“correlationcoefficient” areshown in Figure5.

Figure5(b) shows the performancein termsof correla-tion coefficient. We seethat eachtime the sequencesepa-rationbecomesso largethat thetwo context regionsdo notconnecttheperformancedropswhencomparedto contextsthatmergeinto a singlewindow. Thisbehavior is generic,ithappensconsistentlyfor increasingsizeof sequencecontextregion, thoughthe phenomenonis asymptoticallydecreas-ing. An effect canbefoundup to a region sizeof about30,seeFigure6. Several featuresappearon the performancecurve,andthey canall beexplainedby thestatisticalobser-vationsmadeearlier. First, thecurvewith nocontext (r = 0)correspondsto thecurve oneobtainsusingprobabilitypairdensityfunctions(Lund et al. 1997). Thebadperformanceof suchapredictoris to beexpecteddueto thelackof motifthatwasnot includedin themodel.Thegreencurve(r = 4)is the curve that correspondsto the neuralnetwork predic-tor in (Lund et al. 1997). As observedthereina significantimprovementis obtainedwhenincluding just a little bit ofsequencecontext. As the size of the context region is in-creased,sois theperformanceup to aboutsequencesepara-tion 30. Thenthe performanceremainsthe sameindepen-

dentof the sequenceseparation.We evennotethe drop inperformancefor sequenceseparation12, aspredictedfromthe statisticalanalysisabove. It is alsocharacteristicthat,asthedistributionof physicaldistancesapproachestheuni-versalshape,andasthe centerpeakof the sequencelogosvanishes,theperformancedropsandbecomesconstantwithapproximately0.15 in correlationcoefficient. The changein predictionperformancetakes placeat sequencesepara-tions for which changesin the distribution of physicaldis-tanceandsequencemotifs change.Due to the not vanish-ing signal in the logos for large separations,it is possibleto predictdistanceconstraintssignificantlybetterthanran-dom,andbetterthanwith a no-context approach.Thecon-clusionis notsurprising:thebestperformingnetwork is thatwhich usesasmuchcontext aspossible.However, for largesequenceseparationsmore than 30–35residues,the sameperformanceis obtainedby just using a small amountofsequencecontext aroundthe separatedamino acids,as in(Lundetal. 1997).We foundthattheperformancebetweentheworstandbestcross-validationsetis about0.1for sepa-rationsup to 12 residues,thenabout0.07for separationsupto 30 residues,andthen0.1–0.2for separationslarger than30. Hence,at sequenceseparation31 theperformancestartto fluctuateindicatingthatthesequencemotif becomeslessdefinedthroughoutthe sequencesin the respective cross-validationsets. Hencewe can usethe networks as an in-dicatorfor whenasequencemotif is well defined.

As an independenttest, we predictedon nine CASP3(http://predictioncenter.llnl.gov/casp3/)targets(andsubmit-ted the predictions). None of thesetargets(T0053T0067T0071T0077T0081T0082T0083T0084T0085)have se-quencesimilarity toany sequencewith knownstructure.Theprevious method(Lund et al. 1997) had on the average64.5%correctpredictionsandan averagecorrelationcoef-ficient of 0.224.Themethodpresentedherehason average70.3%correctpredictionsandanaveragecorrelationcoeffi-cientof 0.249.Theperformanceof theoriginalmethodcor-respondsto theperformancepreviously published(Lund etal. 1997).However, theaveragemeasureisalessconvenientway to reporttheperformance,asthegoodperformanceonsmallseparationsarebiaseddueto themany largeseparationnetworksthathaveessentiallythesameperformance.There-forewehavealsoconstructedperformancecurvessimilar tothe onein Figure5 (not shown). The main featuresof thepredictioncurve waspresent. In Figure8 we show a pre-diction exampleof thedistanceconstraintsfor T0067. Theserver (http://www.cbs.dtu.dk/services/distanceP/)providespredictionsup a sequenceseparationof 100. Notice thatpredictionsupto asequenceseparationof 30clearlycapturethemainpartof thedistanceconstraints.

We alsoinvestigatedthe qualitative relationbetweenthenetwork performanceand information content in the se-quencelogos. Interestingly, andin agreementwith the ob-servationsmadeearlier, we found that the two curveshavethesamequalitativebehavior asthesequenceseparationin-crease(Figure7). Both curvespeakat separation3, bothcurvesdrop at separation12, and both curvesreachestheplateaufor sequenceseparation30. Wenotethattherelativeentropy curve increasesslightly astheseparationincreases.

Page 8: RTH · ondary structure of the proteins could not be assigned by the program signment DSSP (Kabsch & Sander 1983), as the DSSP as-isused to quantify thesecondary structure identity

0?

20 40 60 80 100?

Sequence Separation (residues)@0.45

0.50

0.55

0.60

0.65

0.70

0.75

Fra

ctio

n c

orr

ect

A

(a)

r=0r=2r=4r=6r=8r=10r=12r=14

Mean performances

0B

20B

40B

60B

80B

100B

Sequence separation (residues)C0.00

0.10

0.20

0.30

0.40

0.50

Co

rrel

atio

n c

oef

fici

ent

(b)

r=0r=2r=4r=6r=8r=10r=12r=14

Mean performances

Figure 5: The performanceasfunction of increasingsequencecontext whenpredictingdistanceconstraints.The sequencecontext is thenumberof additionalresiduesr D 0 E 2 E 4 E 6 E 8 E 10E 12E 14 from theaminoacidandtowardtheotheraminoacid.Theperformanceis showedinpercentage(a),andin correlationcoefficient (b). Thisfigureis availablein full color athttp://www.cbs.dtu.dk/services/distanceP/.

Page 9: RTH · ondary structure of the proteins could not be assigned by the program signment DSSP (Kabsch & Sander 1983), as the DSSP as-isused to quantify thesecondary structure identity

This is dueto a well known artifactfor decreasingsamplingsize, andF entropy measures.The correlationbetweenthecurvesalsoindicatesthat themostinformationusedby thepredictorstemfrom themotifs in thesequencelogos.

0G 20 40 60 80 100GSequence Separation (residues)H

0.00

0.10

0.20

0.30

0.40

0.50

Co

rrel

atio

n c

oef

fici

ent

Plain

Profile

Single window curvesI

Figure 6: The performancecurve using a single windows only.We alsoshow the performancewhenincludinghomologuein therespective tests.Thiscompleteaprofile.

0J 20 40 60 80 100JSequence separation (residues)K

Arb

itra

ry s

cale

Relative entropy and NN performance

NN-performanceLRelative entropy

Figure 7: Information contentand network performancefor in-creasingsequenceseparation.

Finally, weconcludeby ananalysisof theneuralnetworkweight compositionwith the aim of revealing significantmotifsusedin thelearningprocess.In general,wefoundthesamemotifsappearingfor theeachof the10cross-validationsetsfor a given sequenceseparation.As describedin theMethodssection,wecomputesaliency logosof thenetworkparameters,that is we display the amino acids for whichthecorrespondingweights,if removed,causethelargestin-creasein network error. We madesaliency logosfor eachofthe5 hiddenneurons.For theshortsequenceseparationsthenetwork stronglypunishhaving aprolinein thecenterof the

CASP3 target T0067

] - ; 0.2]

]0.2 ; 0.3]

]0.3 ; 0.4]

]0.4 ; 0.5]

]0.5 ; 0.6]

]0.6 ; 0.7]

]0.7 ; 0.8]

]0.8 ; 0.9]

]0.9 ; 1.0]

20

40

60

80

100

120

140

160

180

20 40 60 80 100

120

140

160

180

Figure 8: Predictionof distanceconstraintsfor the CASP3tar-get T0067. The uppertriangleis the distanceconstraintsof pub-lished coordinates,where dark points indicate distancesabovethe thresholds(meanvalues)andlight pointsdistancesbelow thethreshold. The lower triangle shows the actual neural networkpredictions. The scaleindicatesthe predictedprobability for se-quenceseparationshaving distancesbelow the thresholds.Thuslight points representprediction for distancesbelow the thresh-olds and dark points prediction for distancesabove the thresh-old. The numbersalong the edgesof the plot show the posi-tion in the sequence. This figure is available in full color athttp://www.cbs.dtu.dk/services/distanceP/.

input field, that is a prolinein betweentheseparatedaminoacids.Thismakessenseasprolineis not locatedwithin sec-ondarystructureelements,and the helix lengthsfor smallseparationsaresmallerthanthe meandistances.(The net-work haslearnt that proline is not presentin helices.) Incontrast,thepresenceof asparticacidcanhavea positiveornegative effect dependingon thepositionin the input field.For larger sequenceseparations,someof the neuronsdis-playathreepeakmotif resemblingwhatwasfoundin these-quencelogos.Thecenterof suchlogoscanbevery glycinerich, but alsoshow astrongpunishmentfor valineor trypto-phan.Sometimestwo neuronscansharethemotif of whatispresentfor oneneuronin anothercross-validationset.Otherdetailscanbefoundaswell. Herewe justoutlinedthemainfeatures: the networks not only look for favorableaminoacids,they alsodetectsthosethatarenot favorableatall. InFigure9 we show an exampleof the motifs for 2 of the 5hiddenneuronsfor a network with sequenceseparation13.

Page 10: RTH · ondary structure of the proteins could not be assigned by the program signment DSSP (Kabsch & Sander 1983), as the DSSP as-isused to quantify thesecondary structure identity

0

1

2

Sal

ien

cy |1

QRLDES

H

YTVAI

NFMM G

CPNWK

X

2

QVRSWE

YDC

NAIT

KO FPMGPHX

3

RTQAV

IS

PF

XDKYNL

WMEQ

HGC

4

RTR Y

SL

KAEIS DT NFUQMVXWV P

HCGW

5

TWX

RYS

KQEI

DCF

V

NMLAX P

HG

6

WVFXYD

SYMT

NKRILQAZ CE

H[ P

G

7

TVDSX

YIF

WMR

N\LEK]AQ

CH_

GP

8 XY

VF` W

NCI

HEA

TLa SDbMRcQdKe

GP

9 IX

YFVW

GLMf N

Dg HhECSiRKAj

TQk

P

10 NFYWIX

VEl GSDHK

MLR

CQm TA

P

11 VYWXFn I

HGR

KoEp SDTq NM

QrLsAt

C

P

12 HVXRWF

GKI

QY

TuESNMvAwLx DC

P

13 EWFVXHyYG

NR

DzMIK{A|L

SQ

T} C

P

14 CHKXEQTYR

VAF

GW

NM

DSI~L

P

15 MFW

HQ

YV

ETCK

XND� GA

SRL� PI

16 MHNFQGEV

SXK

WA

C�RYD�LPTI�

17 WT�HMVYE

KF

QNRAL

XSGPDI� C

18 THDMEGCVF�LKRP

WNA

QXYSI�

|0

1

Sal

ien

cy |

1

QS

AYM

DIL

RVH�T�W� X

GN� E�C

KP

2

T�I

RH

SVQ�M� EFAW� P� K�YC

GDN� X

3

CT

�MAL

EQY

PW� I

XV

RGF�

D� NK�

4

YR

EAI

DHWV

F

XST�QC�M

PKL

NG

5

HX

QM

TY

KEW� ASV

� RL

NDPI

CF�

G

6

FWMAS

XQHR�Y�K VDI

C¡LE¢ TN

PG

7

KQ£ NYDXV¤ I

FMA

ST¥RLWEP¦ HGC

§

8

YWTRSAN

L¨ V

X

IK

MQ©EDªCHG« PF¬

9

R­EP®TKSDXAGLN

YQ

HWC

FIV

10 M° QPRAX

HECGT

IKN

Y±SD²

LFV

W

11 PT³XMA

HRQµG¶SC

E· YN

W¹ VDKº

FLI

12 ST» AHXP

C¼Q½E¾ RLN¿ MKÀ YFWVG

Á

ID

13 SPRY

TQÂ XVCÃKHEÄNÅ IAMÆ FDÇ

WG

L

14 VEKDA

CXH

QNMÈSRLFT

P

IG

WY

15 TDMSÉKPÊHXI

VQË C

WG

FRAEÌ Y

NL

16 YKQDFH

WEPA

X

SI

LC

TMRGNV

17 YL

T

X

NKÍERCÎFMAÏ I

W

QDÐ SPGÑV

18 MAH

RÒ XY

TKQÓLWCEÔ DSFÕV

Ö I×

NPG

19 MØYTQK

AV

XC

HW

DÙ NSGFÚ EÛ I

RPÜ

20 SÝMAT

QH

DY

GKC

XF

EÞWV

LPNRI

21 SHL

QVA

Tß FCIàMá E

YRDWâ KPGXN

22 GSY

DRAH

Mã KF

Pä QåVLEIæ NW

çCè X|0

1

2

3

4

Sal

ien

cy |

1

RLSGNHéTê APKDëCìYQWM

VF

Eí I

X

2

WPîHïDYRAKG

TLF

SNIð XEMñ QV

3

NT

óA

DôHõ R

QöWFK÷ EGSY

IL

PCø XMV

4

NATùHMúRWFKGûYQü EL

PDý SXV

IþC

5

NHT

XRMÿ QAFW� E

SKY

�LGV� DIC� P

6

AMHPRXF�LSGN�YKE� QV

C�T� I

DW

7

T P MQYS�VFLHIXD� WG ARC� KE�N�

8

PTXK�CRALN

VMFH

IYE

Q�DW�SG

9

CXYTQ

R

H

FK

APNE

W� M� LIS

D�VG

10 THX

ACRE

YMP

KQS

FLD�VWN

IG

11 AXT� C�

RH�K�E�QY�PLS

M� FD N!

VW"

IG

12 TX

AH#Q$ERP% CK& MS

FN'WLVD(

YI)G

13 CXTAS

RE*H+ WKQ

FDP, M- LN.

VY

I/G

14 HXS

ARFET

W0QCN1KDPVYLM2

I3G

15 SCQXYD4RE5 T

AWH

N6 VLFK

P7MIG

16 KRXH8 YTNLE9Q: CV

SIFW; DAP

MG

17 LEH<XD= P> QAT?YM@ I

RF

NW

SKCAG

18 HXAQTL

KB EC SMY

RF

GD DPECWNFVI

G

19 MEHXAFP

H STLW

QKDIYNJ RC

GKVIL

20 ASRYHXML

TNKDQE

WF

GVC

PI

21 LKRMEHMVYANN P

STOWDQF

GXIPCQ

22 VPKSTYAHRESWRGL

NTMDIUFQVCX

|Figure9: Threesaliency-weightlogosfrom thetrainedneuralnet-work at sequenceseparations9 and13. For eachpositionof theinput field thesaliency of therespective aminoacidsis displayed.Lettersturnedupsidedown contribute negatively to the weightedsumof thenetwork input. Thetop logo (sequenceseparation9) il-lustratesstrongpunishmentfor prolineandglycineupstreams.Onthemiddlelogo(sequenceseparation13),N is turnedupsidedownon the peaksnearthe edges.On the bottomlogo the I’s closetothe edgescontribute positively, andthe N in the centerpeakalsocontributepositively. The I’s in thecenterpeakcontributesnega-tively (they areupsidedown). This figureis availablein full colorathttp://www.cbs.dtu.dk/services/distanceP/.

Final remarksWestudiedpredictionof distanceconstraintsin proteins.Wefoundthatsequencemotifswererelatedto thedistributionofdistances.Usingthemotifs we showedhow to constructanoptimalneuralnetwork predictorwhich improve an earlierapproach.Thebehavior of thepredictorcouldbecompletelyexplainedfrom astatisticalanalysisof thedata,in particulara drop in performancefor sequenceseparation12 residueswaspredictedfrom thestatisticalanalysis.Wealsocorrectly

predictedseparation3 ashaving optimalperformance.Wefoundthattheinformationcontentof thelogoshasthesamequalitativebehavior asthenetwork performancefor increas-ing sequenceseparation.Finally, theweightcompositionofthe network wasanalyzedandwe found that the sequencemotif appearingwasin agreementwith thesequencelogos.

Theperspectivesaremany: thenetworkperformancemaybe improved further by using threewindows for sequenceseparationsbetween20and30residues.Theperformanceofthenetwork canbeusedasinputsto anew collectionof net-workswhichcancleanup certaintypesof falsepredictions.Combiningthemethodwith asecondarystructurepredictionshouldgive a significantperformanceimprovementon theshortsequenceseparations.The relationbetweeninforma-tion contentandnetwork performancemight be explainedquantitatively throughextensivealgebraicconsiderations.

AcknowledgmentsThis work wassupportedby TheDanishNationalResearchFoundation.OL wassupportedby The Danish1991Phar-macy Foundation.

ReferencesBaldi, P., and Brunak, S. 1998. Bioinformatics— Themachinelearningapproach. CambridgeMass.:MIT Press.Bernstein, F. C.; Koetzle, T. G.; Williams, G. J. B.;Meyer, E. F.; Brice, M. D.; Rogers, J. R.; Kennard,O.; Shimanouchi,T.; and Tasumi, M. 1977. Theprotein data bank: A computerbasedarchival file formacromolecularstructures. J. Mol. Biol. 122:535–542.(http://www.pdb.bnl.gov/).Bishop,C. M. 1996. Neural Networksfor PatternRecog-nition. Oxford: OxfordUniversityPress.Bohr, H.; Bohr, J.; Brunak,S.; Cotterill, R. M. C.; Fred-holm, H.; Lautrup,B.; andPetersen,S. B. 1990. A novelapproachto predictionof the 3-dimensionalstructuresofproteinbackbonesby neuralnetworks.FEBSLett.261:43–46.Branden,C., andTooze,J. 1999. Introductionto ProteinStructure. New York: GarlandPublishingInc.,2ndedition.Brunak,S.; Engelbrecht,J.; andKnudsen,S. 1991. Pre-dictionof humanmRNA donorandacceptorsitesfrom theDNA sequence.J. Mol. Biol. 220:49–65.Creighton,T. E. 1993.Proteins. New York: W. H. FreemanandCompany, 2ndedition.Dayhoff, M. O. ansSchwartz, R. M., and Orcutt, B. C.1978.A modelof evolutionarychangein proteins.AtlasofProteinSequenceandStructure5, Suppl.3:345–352.Fariselli,P., andCasadio,R. 1999.A neuralnetwork basedpredictorof residuecontactsin proteins.Prot. Eng. 12:15–21.Göbel, U.; Sander, C.; Schneider, R.; and Valencia,A.1994. Correlatedmutationsand residuecontactsin pro-teins.Proteins18:309–317.Gorodkin,J.; Hansen,L. K.; Lautrup,B.; andSolla,S. A.1997. Universaldistribution of salienciesfor pruning in

Page 11: RTH · ondary structure of the proteins could not be assigned by the program signment DSSP (Kabsch & Sander 1983), as the DSSP as-isused to quantify thesecondary structure identity

layeredneuralnetworks. Int. J. Neural Systems8:489–498.(http://wwwW .wspc.com.sg/journals/ijns/85_6/gorod.pdf).Hobohm,U.; Scharf,M.; Schneider, R.; and Sander, C.1992. Selectionof a representative setof structuresfromtheBrookhavenProteinDataBank. Prot. Sci.1:409–417.Kabsch,W., andSander, C. 1983. Dictionaryof proteinsecondarystructure: Pattern recognitionand hydrogen-bondedandgeometricalfeatures. Biopolymers 22:2577–2637.Kullback,S.,andLeibler, R. A. 1951.On informationandsufficiency. Ann.Math.Stat.22:79–86.Lund, O.; Frimand, K.; Gorodkin, J.; Bohr, H.; Bohr,J.; Hansen,J.; and Brunak, S. 1997. Protein dis-tanceconstraintspredictedby neuralnetworks andprob-ability density functions. Prot. Eng. 10:1241–1248.(http://www.cbs.dtu.dk/services/CPHmodels).Maiorov, V. N., andCrippen,G.M. 1992.Contactpotentialthat recognizesthecorrectfolding of globular proteins.J.Mol. Biol. 227:876–888.Mathews, B. W. 1975. Comparisonof the predictedandobserved secondarystructureof T4 phagelysozyme.Biochem.Biophys.Acta405:442–451.Mirny, L. A., and Shaknovich, E. I. 1996. How to de-rive a proteinfolding potential?A new approachto anoldproblem.J. Mol. Biol. 264:1164–1179.Miyazawa, S., and Jernigan,R. L. 1985. Estimationofeffective interresiduecontactenergies from protein crys-tal structures: cuasi-chemicalapproximation. Macro-molecules18:534–552.Myers,E. W., andMiller, W. 1988.Optimalalignmentsinlinearspace.CABIOS4:11–7.Olmea,O., and Valencia,A. 1997. Improving contactpredictionsby thecombinationof correlatedmutationsandothersourcesof sequenceinformation. Folding & Design2:S25–S32.Pearson,W. R. 1990. Rapidandsensitive sequencecom-parisonwith FASTPandFASTA. Meth.Enzymol.183:63–98.Reese,M. G.; Lund,O.; Bohr, J.; Bohr, H.; Hansen,J. E.;andBrunak,S. 1996.Distancedistributionsin proteins:Asix parameterrepresentation.Prot. Eng. 9:733–740.Rost,B., andSander, C. 1993. Predictionof proteinsec-ondarystructureatbetterthan70%accuracy. J. Mol. Biol.232:584–599.Schneider, T. D., and Stephens,R. M. 1990. Sequencelogos: a new way to displayconcensussequences.Nucl.AcidsRes.18:6097–6100.Sippl, M. J. 1990. Calculationof conformationalensem-bles from potentialsof meanforce. An approachto theknowledge-basedpredictionof local structuresin globularproteins.J. Mol. Biol. 213:859–83.Skolnick, J.; Kolinski, A.; andOrtiz, A. R. 1997. MON-SSTER:A methodfor folding globular proteinswith asmallnumber-commentof distancerestraints.J. Mol. Biol.265:217–241.

Tanaka,S.,andScheraga,H. A. 1976. Medium-andlongrangeinteractionparametersbetweenaminoacidsfor pre-dicting three-dimensionalstructuresof proteins. Macro-molecules9:945–950.Thomas,D. J.;Casari,C.;andSander, C. 1996.Thepredic-tionof proteincontactsfrommultiplesequencealignments.Prot. Eng. 9:941–948.