RTH · ondary structure of the proteins could not be assigned by the program signment DSSP (Kabsch...
Transcript of RTH · ondary structure of the proteins could not be assigned by the program signment DSSP (Kabsch...
ISMB-99, 95-105
UsingSequenceMotifs for EnhancedNeural NetworkPrediction of Protein DistanceConstraints
Jan Gorodkin1 � 2, Ole Lund1 �, ClausA. Andersen1, and SørenBrunak1
1Centerfor BiologicalSequenceAnalysis,Departmentof Biotechnology,TheTechnicalUniversityof Denmark,Building 208,DK-2800Lyngby, Denmark
[email protected],[email protected],[email protected],[email protected]:+454525 2477,Fax: +45459315 85
2Departmentof GeneticsandEcology, TheInstituteof BiologicalSciencesUniversityof Aarhus,Building 540,Ny Munkegade,DK-8000AarhusC, Denmark
Abstract
Correlationsbetweensequenceseparation(in residues)anddistance(in Angstrom)of any pairof aminoacidsin polypep-tidechainsareinvestigated.For eachsequenceseparationwedefinea distancethreshold.For pairsof aminoacidswherethedistancebetweenCα atomsis smallerthanthethreshold,a characteristicsequence(logo) motif, is found. The mo-tifs changeas the sequenceseparationincreases:for smallseparationsthey consistof onepeaklocatedin betweenthetwo residues,thenadditionalpeaksat theseresiduesappear,andfinally thecenterpeaksmearsout for very largesepara-tions. We alsofind correlationsbetweenthe residuesin thecenterof the motif. This and other statisticalanalysesareusedto designneuralnetworks with enhancedperformancecomparedto earlier work. Importantly, the statisticalanal-ysis explainswhy neuralnetworks performbetterthansim-plestatisticaldata-drivenapproachessuchaspairprobabilitydensityfunctions.Thestatisticalresultsalsoexplain charac-teristicsof thenetwork performancefor increasingsequenceseparation.The improvementof the new network designissignificantin thesequenceseparationrange10–30residues.Finally, wefind thattheperformancecurve for increasingse-quenceseparationis directly correlatedto thecorrespondinginformationcontent.A WWW server, distanceP, is availableathttp://www.cbs.dtu.dk/services/distanceP/.
Keywords: Distanceprediction;sequencemotifs; distanceconstraints;neuralnetwork; proteinstructure.
Intr oductionMuch work have over the yearsbeenput into approacheswhich either analyze or predict features of the three-dimensionalstructureusingdistributionsof distances,cor-relatedmutations,andmorelatelyneuralnetworks,or com-binationsof thesee.g. (Tanaka& Scheraga1976;Miyazawa& Jernigan1985;Bohr et al. 1990;Sippl 1990;Maiorov& Crippen1992;Göbelet al. 1994;Mirny & Shaknovich1996;Thomas,Casari,& Sander1996;Lund et al. 1997;Olmea& Valencia1997;Skolnick, Kolinski, & Ortiz 1997;
�Presentaddress:StructuralBioinformaticsAdvancedTech-
nologiesA/S, AgernAlle 3, DK-2970Hørsholm,DenmarkCopyright c
�1999, AmericanAssociationfor Artificial Intelli-
gence(www.aaai.org). All rightsreserved.
Fariselli & Casadio1999). The ability to adoptstructurefrom sequencesdependsonconstructinganappropriatecostfunctionfor thenativestructure.In searchof suchafunctionwehereconcentrateon findinga methodto predictdistanceconstraintsthat correlatewell with the observed distancesin proteins.As theneuralnetwork approachis theonly ap-proachsofar which includessequencecontext for thecon-sideredpair of aminoacids,theseareexpectednot only toperformbetter, but alsoto capturemorefeaturesrelatingdis-tanceconstraintsandsequencecomposition.
The analysisinclude investigationof the distancesbe-tween amino acid as well as sequencemotifs and corre-lations for separatedresidues. We constructa predictionschemewhich significantlyimprove on anearlierapproach(Lundetal. 1997).
For eachparticularsequenceseparation,thecorrespond-ing distancethresholdis computedas the averageof allphysicaldistancesin a largedatasetbetweenany two aminoacids separatedby that amountof residues(Lund et al.1997). Here, we include an analysisof the distancedis-tributions relative to thesethresholdsand usethem to ex-plain qualitative behavior of the neuralnetwork predictionscheme,thusextendingearlierstudies(Reeseet al. 1996).For thepredictionschemeusedhereit is essentialto relatethe distributions to their means. Analysis of the networkweight compositionreveal intriguing propertiesof the dis-tanceconstraints:the sequencemotifs canbe decomposedinto sub-motifsassociatedwith eachof the hiddenunits intheneuralnetwork.
Further, as the sequenceseparationincreasesthere is aclearcorrespondencein thechangeof themeanvalue,dis-tancedistributions,andthe sequencemotifs describingthedistanceconstraintsof the separatedamino acids, respec-tively. The predicteddistanceconstraintsmay be usedasinputsto threading,ab initio, or loopmodelingalgorithms.
Materials and MethodData extractionThe data set was extractedfrom the Brookhaven ProteinData Bank (Bernsteinet al. 1977), release82 containing5762proteins.In brief entrieswereexcludedif: (1) thesec-
ondarystructureof theproteinscouldnotbeassignedby theprogram� DSSP(Kabsch& Sander1983),asthe DSSPas-signmentis usedto quantifythesecondarystructureidentityin the pairwisealignments,(2) the proteinshadany physi-cal chainbreaks(definedasneighboringaminoacidsin thesequencehaving Cα-distancesexceeding4 � 0Å) or entrieswherethe DSSPprogramdetectedchainbreaksor incom-pletebackbones,(3) they hadaresolutionvaluegreaterthan2 � 5Å, sinceproteinswith worseresolution,arelessreliableastemplatesfor homologymodelingof theCα trace(unpub-lishedresults).
Individualchainsof entrieswerediscardedif (1) they hada lengthof lessthan30 aminoacids,(2) they hadlessthan50%secondarystructureassignmentasdefinedby thepro-gramDSSP, (3) they hadmorethan85% non-aminoacids(nucleotides)in the sequence,and (4) they hadmore than10%of non-standardaminoacids(B, X, Z) in thesequence.
A representative setwith low pairwisesequencesimilar-ity wasselectedby runningalgorithm#1 of Hobohmet al.(1992) implementedin the programRedHom(Lund et al.1997).
In brief thesequencesweresortedaccordingto resolution(all NMR structureswereassignedresolution100). These-quenceswith thesameresolutionweresortedsothathigherpriority wasgivento longerproteins.
Thesequenceswerealignedutilizing thelocal alignmentprogram,ssearch (Myers& Miller 1988;Pearson1990)us-ing thepam120aminoacidsubstitutionmatrix (Dayhoff &Orcutt1978),with gappenalties� 12, � 4. As a cutoff forsequencesimilarity we appliedthe thresholdT � 290� L,T is the percentageof identity in the alignmentandL thelength of the alignment. Startingfrom the top of the listeachsequencewasusedasa probeto excludeall sequencesimilarproteinsfurtherdown thelist.
By visual inspectionseven proteinswereremoved fromthelist, sincetheir structureis either’sustained’by DNA orpredominantlyburied in the membrane.The resulting744proteinchainsarecomposedof the residueswheretheCα-atompositionis specifiedin thePDB entry.
Tencross-validationsetswereselectedsuchthat they allcontainapproximatelythesamenumberof residues,andallhave thesamelengthdistributionof thechains.All thedataare madepublicly available through the world wide webpagehttp://www.cbs.dtu.dk/services/distanceP/.
Inf ormation Content / Relative entropy measureHere we usethe relative entropy to measurethe informa-tion content(Kullback & Leibler 1951)of alignedregionsbetweensequenceseparatedresidues.Theinformationcon-tent is obtainedby summingfor the respective position inthealignment,I � ∑L
i � 1 Ii , whereIi is the informationcon-tentof positioni in thealignment.Theinformationcontentat eachpositionwill sometimesbedisplayedasa sequencelogo (Schneider& Stephens1990).Theposition-dependentinformationcontentis givenby
Ii � ∑k
qik log2qik
pk � (1)
wherek refersto the symbolsof the alphabetconsidered(hereaminoacids). The observed fraction of symbolk atpositioni is qik, andpk is thebackgroundprobabilityof find-ing symbolk by chancein thesequences.pk will sometimesbe replacedby a position-dependentbackgroundprobabil-ity, thatis theprobabilityof observingletterk at someposi-tion in thealignmentin anotherdatasetonewishesto com-pareto. Symbolsin logosturned180degreesindicatethatqik pk.
Neural networksAs in the previouswork, we apply two-layerfeed-forwardneuralnetworks, trainedby standardback-propagation,seee.g. (Brunak,Engelbrecht,& Knudsen1991;Bishop1996;Baldi & Brunak1998),to predictwhethertwo residuesarebelow or above a given distancethresholdin space. Thesequenceinput is sparselyencoded.In (Lund et al. 1997)the inputswereprocessedastwo windows centeredaroundeachof theseparatedaminoacids.However, hereweextendthatschemeby allowing thewindowsto grow towardseachother, andevenmergeto asinglelargewindow coveringthecompletesequencebetweentheseparatedaminoacids.Eventhoughsucha schemeincreasesthecomputationalrequire-ments,it allow usto searchfor optimalcoveringbetweentheseparatedaminoacids.
As therecanbea largedifferencein thenumberof posi-tive (contact)andnegative (no contact)sequencewindows,for a given separation,we apply the balancedlearningap-proach(Rost& Sander1993). Trainingis doneby a 10 setcross-validationapproach(Bishop1996), and the result isreportedastheaverageperformanceoverthepartitions.Theperformanceon eachpartition is evaluatedby theMathewscorrelationcoefficient (Mathews1975)
C � PtNt � Pf Nf� �Nt � Nf �
�Nt � Pf �
�Pt � Nf �
�Pt � Pf � � (2)
wherePt is thenumberof truepositives(contact,predictedcontact),Nt the numberof true negatives (no contact,nocontactpredicted),Pf thenumberof falsepositives(nocon-tact,contactpredicted),andNf is thenumberof falsenega-tives(contact,nocontactpredicted).
The analysisof the patternsstoredin the weightsof thenetworks is donethroughthe saliencyof the weights,thatis the costof removing a singleweight while keepingtheremainingones. Due to the sparseencodingeachweightconnectedto a hiddenunit correspondsexactly to a particu-lar aminoacidat a givenpositionin thesequencewindowsusedasinputs.We canthenobtaina rankingof symbolsoneachpositionin the input field. To computethe salienciesweusetheapproximationfor two-layerone-outputnetworks(Gorodkinet al. 1997),who showedthat thesalienciesforthe weightsbetweeninput andhiddenlayercanbe writtenas
skji � sj i � w2
j iW2j K � (3)
wherewj i is the weight betweeninput i andhiddenunit j,andWj theweightbetweenhiddenunit j andtheoutput.Kis a constant.Thekth symbolis implicitly givendueto the
0
500
1000
1500
2000
-6 -4 -2 0 2
Nu
mb
er
Physical distance (Angstrom)
Sequence separation 2
0
50
100
150
200
250
-20 -15 -10 -5 0 5 10 15 20 25
Nu
mb
er
Physical distance (Angstrom)
Sequence separation 11
0
20
40
60
80
100
120
-20 -10 0 10 20 30 40
Nu
mb
er
Physical distance (Angstrom)
Sequence separation 16
0
200
400
600
800
1000
-6 -4 -2 0 2 4
Nu
mb
er
Physical distance (Angstrom)
Sequence separation 3
0
50
100
150
200
-20 -10 0 10 20
Nu
mb
er
Physical distance (Angstrom)
Sequence separation 12
0
10
20
30
40
50
60
70
80
90
-20 -10 0 10 20 30 40 50
Nu
mb
er
Physical distance (Angstrom)
Sequence separation 20
0
100
200
300
400
500
600
700
800
-8 -6 -4 -2 0 2 4 6
Nu
mb
er
Physical distance (Angstrom)
Sequence separation 4
0
50
100
150
200
-20 -10 0 10 20 30
Nu
mb
er
Physical distance (Angstrom)
Sequence separation 13
0
10
20
30
40
50
60
70
-40 -20 0 20 40 60 80
Nu
mb
er
Physical distance (Angstrom)
Sequence separation 60
Figure 1: Lengthdistributionsof residuesegmentsfor correspondingsequenceseparations,relative the respective meanvalues.Sequenceseparations2, 3, 4, 11,12,13,16,20,and60 areshown. Theseshow therepresentative shapesof thedistancedistributions.Theverticallinethroughzeroindicatesthedisplacementwith respectto themeandistance.
sparseencoding. If the signsof wj i andWj are opposite,we display the correspondingsymbol upsidedown in theweight-saliency logo.
ResultsWe conductstatisticalanalysisof thedataandthedistanceconstraintsbetweenaminoacids,andsubsequentusethere-sultsto designandexplain thebehavior of a neuralnetworkpredictionschemewith enhancedperformance.
Statistical analysisFor eachsequenceseparation(in residues),we derive themeanof all thedistancesbetweenpairsof Cα atoms.Weusethesemeansasdistanceconstraintthresholds, andin a pre-diction schemewe wish to predictwhetherthedistancebe-tweenany pair of aminoacidsis aboveor below thethresh-old correspondingto a particularseparationin residues.To
analyzewhich pairsareabove andbelow the threshold,itis relevant to compare:(1) thedistribution of distancesbe-tweenaminoacidpairsbelow andabove thethreshold,and(2) thesequencecompositionof segmentswherethepairsofaminoacidsarebelow andabovethethreshold.
Firstweinvestigatethelengthdistributionof thedistancesasfunction of the sequenceseparation.A completeinves-tigation of physicaldistancesfor increasingsequencesep-aration is given by (Reeseet al. 1996). In particular itwas found that the α-helicescauseda distinct peakup tosequenceseparation20, whereasβ-strandsare seenup toseparations5 only. However, when we perform a simpletranslationof thesedistributionsrelative to their respectivemeans,thesameplotsprovideessentiallynew qualitativein-formation,which is anticipatedto bestronglycorrelatedtotheperformanceof a predictor(above/below thethreshold).In particularwefocusonthedistributionsshown in Figure1,
but we alsousethe remainingdistributions to make someimportant� observations.
Wheninspectingthedistancedistributionsrelativeto theirmean,two main observationsaremade.First, the distancedistribution for sequenceseparation3, is the one distancedistributionwherethedatais mostbimodal.Thussequenceseparation3 providesthemostdistinctpartitionof thedatapoints. Hence,in agreementwith the resultsin (Lund etal. 1997)we anticipatethat the bestpredictionof distanceconstraintscanbeobtainedfor sequenceseparation3. Fur-thermore,weobserve thattheα-helix peakshiftsrelative tothe meanwhen going from sequenceseparation11 to 13.Thelengthof thehelicesbecomeslongerthanthemeandis-tances.Thisshift interestinglyinvolvethepeakto beplacedat themeanvalueitself for sequenceseparation12. Duetothis phenomenon,we anticipate,that,for anoptimizedpre-dictor, it canbeslightlyhardertopredictdistanceconstraintsfor separation12 thanfor separations11and13.
Thepeakpresentat themeanvaluefor sequencesepara-tion 12 doesindeedreflectthe lengthof helicesasdemon-stratedclearlyin Figure2. Ratherthanusingthesimplerulethat eachresiduein a helix increasesthe physicaldistancewith 1.5 Angstrom(Branden& Tooze1999),we computedthe actualphysicallengthsfor eachsize helix to obtain amoreaccuratepicture.Thephysicallengthof theα-heliceswascalculatedby findingthehelicalaxisandmeasuringthetranslationper Cα-atom along this axis. The helical axiswasdeterminedby themasscenterof four consecutive Cα-atoms. Helicesof length four are thereforenot included,sinceonly onecenterof masswaspresent.We seethathe-lices at 12 residuescoincidewith sequenceseparation12.Again we usethis as an indication that at sequencesepa-ration12 it maybe slightly harderto predictdistancecon-straintsthanat separation13.
0� 5� 10 15 20 25 30Separation/Length (Residues)
0
5
10
15
20
25
30
35
Ph
ysic
al d
ista
nce
(A
ng
stro
m)
Separation and helix length
Average sequence separation�Average helix length�
Figure 2: Meandistancesfor increasingsequenceseparationandcomputedaveragephysicalhelix lengthsfor increasingnumberofresidues.
As the sequenceseparationincreasesthe distancedistri-bution approachesa universalshape,presumablyindepen-dentof structuralelements,whicharegovernedby morelo-cal distanceconstraints.The traits from the local elementsdo asmentionedby (Reeseet al. 1996)(who merge largeseparations)vanishfor separations20–25. (Here we con-sideredseparationsup to 100residues.)In Figure1 we seethat the transitionfrom bimodal distributions to unimodaldistributionscenteredaroundtheirmeans,indicatesthatpre-diction of distanceconstraintsmustbecomeharderwith in-creasingsequenceseparation.In particularwhentheuniver-salshapehasbeenreached(sequenceseparationlargerthan30),we shouldnot expecta changein thepredictionabilityas the sequenceseparationincreases.The universaldistri-butionhasits modevalueapproximatelyat � 3 � 5 Angstrom,indicatingthatthemostprobablephysicaldistancefor largesequenceseparationscorrespondsto the distancebetweentwo Cα atoms.
Noticethat theuniversalityonly appearswhenthedistri-bution is displacedby its meandistance.This is interestingsincethe meanof the physicaldistancesgrows as the se-quenceseparationincreases.Froma predictabilityperspec-tive, it thereforemakesgoodsenseto usethemeanvalueasa thresholdto decidethedistanceconstraintfor anarbitrarysequenceseparation.
A usefulpredictionschememustrely on the informationavailable in the sequences.To investigateif thereexists adetectablesignalbetweenthe sequenceseparatedresidues,for eachsequenceseparation,we constructedsequencelo-gos(Schneider& Stephens1990)asfollows: Thesequencesegmentsfor which thephysicaldistancebetweenthesepa-ratedaminoacidswasabovethethresholdwereusedto gen-erateaposition-dependentbackgrounddistributionof aminoacids. The segmentswith correspondingphysicaldistancebelow the thresholdwere all alignedand displayedin se-quencelogos using the computedbackgrounddistributionof amino acids. The pure information contentcurves areshown for increasingsequencein Figure3. Thecorrespond-ing sequencelogos aredisplayedin Figure 4. We useda“margin” of 4 residuesto theleft andright of thelogos,e.g.,for sequenceseparation2, thephysicaldistanceis measuredbetweenposition5 and7 in thelogo.
Thechangein thesequencepatternsis consistentwith thechangein thedistributionof physicaldistances.Up to sepa-rations6–7,thedistributionof distances(Figure1) containstwo peaks,with theβ-strandpeakvanishingcompletelyforseparation7–8 residues.For the sameseparations,the se-quencemotif changesfrom containingoneto threepeaks.For largersequenceseparations,themotif consistsof threecharacteristicpeaks,thetwo locatedatpositionsexactlycor-respondingto theseparatedaminoacids.Thethird peakap-pearin the center. This peaktells us that for physicaldis-tanceswithin the threshold,it indeedmatterswhethertheresiduesin the centercanbendor turn the chain. We seethatexactlysuchaminoacidsarelocatedin thecenterpeak:glycine,aparticacid,glutamicacid,andlysine,all thesearemediumsize residues(except glycine), beinghydrophilic,therebyhaving affinity for thesolvent,but not beingon theoutermostsurfaceof theprotein(seee.g., (Creighton1993)
1� 2�
3�
4�
5�
6�
7�
8�
9�
10� 11Position
0.00
0.02
0.04
0.06
0.08
0.10
0.12
Bit
sSeparation: 2!
2�
4�
6�
8�
10� 12�
14�
16�
18�
20� 22�
Position 0.00
0.02
0.04
0.06
Bit
s
Separation: 13"
2�
4�
6�
8�
10� 12�
14�
16�
Position 0.00
0.02
0.04
0.06
Bit
s
Separation: 7!
2�
4�
6�
8�
10� 12�
14�
16�
18�
20� 22�
24�
26�
28�
Position 0.00
0.01
0.02
0.03
Bit
s
Separation: 20"
2�
4�
6�
8�
10� 12�
14�
16�
18�
Position
0.00
0.02
0.04
0.06
Bit
s
Separation: 9!
5�
10� 15�
20� 25�
30� 35�
Position
0.00
0.01
0.02
Bit
s
Separation: 30"
2�
4�
6�
8�
10� 12�
14�
16�
18�
20�Position
0.00
0.02
0.04
0.06
Bit
s
Separation: 12"
5�
10� 15�
20� 25�
30� 35�
40� 45�
50� 55�
60� 65�
Position
0.00
0.01
0.02
Bit
s
Separation: 60"
Figure3: Sequenceinformationprofilesfor increasingsequenceseparation.
for aminoacid properties).As the sequenceseparationin-creasesfrom about20 to 30 the centerpeaksmearsout, inagreementwith thetransitionof thedistancedistributionthatshifts to theuniversaldistribution in this rangeof sequenceseparation.The sequencemotif likewisebecomes“univer-sal”. Only the peakslocatedat the separatedresiduesareleft.
Thecompositionof singlepeakmotifsresembleto alarge
degreethecompositionof thecenterfor motifshaving threepeaks. However, the slightly increasedamountof leucinemoves from the centerof the one peakmotifs to the sep-aratedaminoacid positionsin the threepeakmotifs. Thereversehappensfor glycine. Interestingly, asthe sequenceseparationincreasesvaline(andisoleucine)becomepresentat theoutermostpeaks.In agreementwith thehelix lengthshift from below to above the thresholdat separations12
Separation: 2 Separation: 13
Separation: 7 Separation: 20
Separation: 9 Separation: 30
Separation: 12 Separation: 60
0.00.020.040.060.08
0.10.12
bit
s |1
C
W
H
MYQ# PF
NR
KI
TSDE
V
GL$A2
W
C% H
M
YQF
P& NRI
K' TSD
E
VGL
A
3
C(W
HM)YFQ* PNR+ I T, KSD- VE
G.LA
4
C/WHM
YFQ
PNR
ITSKVGD
ELA
5
WCM0 HYFQ1R2P3 IN
T4 VKG5S6EDL
A7
6
CWHM
YFPQ
ITVR
N
S8K9D:E;
GLA<
7
PWCH=M> YFQ
N
IT?RSVD@KAEB
GLCAD
8
WCHME YFPQF NRG ISTDVE
K
GLHA
9
W
CHIMYPJFQNTSRK DL IEVK
GAML
10 CWHM
Y
FQN PNRO TSDI
E
VK
GLA
11 CWM
HPYQQ PFR NR
TSI
DSEKT VGLUAV |
0.0
0.02
0.04
0.06
bit
s |
1
WCWMXHY Q
YFR
NZ PKI
ETSD[VAG
L\
2
WC]MH_ QYF
PNR
KI
ESDTV
G
AL
3
WCMaHQYFb P
NR
KSDETI
Vc GAL
4
WdCMHe QY
PNF
RKDf SETI
GVAgLh
5
WiCMjHk QY
PNFl RKDESTIm GVnAo
Lp
6
WqCMrHs QY
PFt NRKEDI
TSVu GAvL
7
WCMwHx Q
YFy PRNz I KETS{D| VGL}A~
8
WC� M
H� Q
Y
FPRN� IKTE
S
VD�G�L�A�
9
WC� MHYFQ� RIP�N� VTK
S�ED� LA�G
10 WC� M� H
YFQ� IRP�N� V� TK�E�S�
D� LAG�
11 WC� M� H
YFQ
IRP� V�NT�K
ES�D�
L� A G
12 WC
MHF¡ YQ
IRP¢ VN
T£S¤K¥E¦D§
L¨ AG©
13 WC
MHª F« YQ
IRN
P
V¬ TSKED®
L¯ AG°
14 WC
MHYFQ
IRNPT
V±SEKD
LAG
15 W²
C
MHYF³ QR
NP
IST
KED
VLAµG
16 CWMH
Q¶ F·YR
N
PI¹ STK
DEV
Lº AG
17 W
C
MHQY
PF
N» RSKD¼ E½TI¾VGA¿L
18 W
C
HM
QY
PNF
RDKÀ SÁ EÂ TIÃ GVÄAL
19 W
C
HÅMQY
PFÆ NRÇ KDÈ STEÉ I GV
ALÊ
20 WC
MË HÌ QY
PFÍ NR
KTDÎ SIÏ EGV
ALÐ
21 WCMÑ HQY
PF
NRÒ KTÓ DÔ SI
EV
GLÕA
22 WCMÖ H
QYFPNR×IØ TDÙ SÚ KÛ EÜ VGLÝA
|
0.0
0.02
0.04
0.06
bit
s |
1
WCMH
YQÞFß PNRKIà STDE
V
GLáAâ
2
WCMH
YQãFä PNR
KIå TSDEæVç GèLAé
3
WC
HMêYëQìFí PNRKIî STDE
Vï GLðA
4
WC
HMYñQò PF
NRKTó Sô I DE
Võ GöLA÷
5
WCø HMùYQú PFû NRü K
TIý SDþEÿ GVL�A
6
WCHM� YFQ� PNR� ITVK
S�D�
G�E�LA
7
WCHM
YFQ� PIRN
TVS
KDE�
G�LA
8
WCM� H� YFPQ� IR�
VN
TSK�D�EL�G�A
9
WCHM
YFP�QIR�N
VT� SK�D
E�
LG�A�
10 CWM
H� YFPQ
INR
T� VSD�K
E LG!A"
11 WCHM
YPF#QNI$RTSD% VK
E
GL&A
12 W
C
H'MPYFQ
NR
DS( T) I
EK*V+
GA,L-
13 W
C
HMYQ
PF
NR
DTSI.EK
G/VA0L
14 W
CMHYQ
P1FN2RST3 D4 I
KE
GVAL5
15 W
CMHYQ
P6FNR
STD7 I
K
EV
GAL
16 WC
MHYQF
PNR
T8 SI9 D:KEV
GLA|
0.0
0.02
0.04
bit
s |1
WC; M
H
QYFR< NPI
EK=TS> DV
LAG
2
WC? M
H@ QY
FRNPKEA I
DBTSVC ALG
3
WC
M
H
QYF
PNRKEI
TD SDVE AG
L
4
WFCGMHQYHFPNRI KEJ DK I SLTVGM
ALN
5
WOCPMHQY
PQ NF
RKEDSR TI
VS GAL
6
WCTMHQYF
PNURV KW EI
DTX SV
GAYL
7
WCMH
QY
PF
NRKEI
DTSVZ GA
L
8
W[C\M]HQY_FPNR` Ka EI
Db TSVc GA
L
9
WCM
HQY
FPRd NKI
ETeSDV
GAL
10 WCMHQYF
P
Nf Rg IhTEi KSjDk VLA
G
11 WC
MH
YQFPRN
IK
ETlSDVLA
G
12 WCm M
Hn YFQRo PNp IKqTrEsSDVLAGt
13 WCu MH
YFQRPvNIT
KESw VD
LAG
14 WCx My Hz Y{ F
Q| RP} IN~T�KE�
VS�DL� A�G
15 WC
MHYFQRIP
NTK
VESD
LAG
16 WC
MHYFQRIP�NK�T
S� VE�D�
LAG
�
17 WC� M
HYFQRIP
NKT�S�E
VD
LAG
18 WC� M
HY� F� QRP� IN
K�TSE�D�
VL� AG
19 WC
MHYFQRP
N
IKT�SEDV
LAG
20 WC� MH� Q
YFPR�NIK� ST�EDV� LA�G
21 WC� MHQY
FP� R� NKI
ES TDV
A¡ LG
22 WC¢ MH£ Q¤YFP
N¥ RK¦ IT§ SEDV
G
AL
23 WC
MHQY
F
PNRKDTSI
EV
GAL
24 W©CªM« H
QY¬ PFNRK® DSET
IV
GAL°
25 W±C² HM
QY
PF
NRKDES³ T´ I
GVAµL
26 WC¶ HM
QYF
PRNKDETSI
V
GA·L
27 WCMH
QY
FPRN
KI
ST
EDV
GLA¹
28 WCºM»H¼ QY
F½ PRN¾ I¿ KTÀ ESÁ DV
GÂLAÃ
29 WCÄMHÅ Q
YFPRN
I
TEKÆ DS
VLG
A
|
0.0
0.02
0.04
0.06
bit
s |
1
WCMÇHÈ Y
QÉFPN
RÊ KIË STÌ EDV
GLÍA
2
WCMÎHÏ YQÐFPNRÑ KÒ I STDÓEVGÔLA
3
WCMHYQ
PF
NRKÕ I
STDÖEVGL×A
4
WC
HMQY
PF
NRKØ SÙ TÚ DIÛEGV
AL
5
WÜCHMYQ
PFÝ NRKTSI
DE
GVAÞL
6
WC
HMß YFQà PNR
ITKáSDVEâ GãLA
7
WCM
HYFQ
PäRINå TVK
SDE
GLæA
8
WCM
HYFQ
PIR
N
Tç VèSKDELGA
9
Wé CM
Hê YFëQPIR
N
VTìSK
DEí
LGA
10 Wî C
HM
YFQ
PIRïNð
VTSK
DE
LGAñ
11 WCMò Hó Y
FôQõ PIN
Rö TVSD
KE÷LøG
ùA
12 WCMú H
YFû PüQNIR
TSDVK
EGLýA
13 WCþMHÿYP� F�QN�RI
ST� D�KE
V� GL�A
14 W
C
HM
PYQF
NR
DS� T� I
K
EGV
AL
15 W�CMHY
Q� PF NR
DSTI
EK
GVA�L
16 W
C�M�HYQ�
PF
N�RTS� D� I
KE�V
G�LA
17 WC
M
HQY
PF
NR
T� SDI
KEV
G�AL
18 WCMH�QYFP� NR
T� SI
D� K�EVG�AL|
0.0
0.04
bit
s |1
C
WMH
QY
FPRNKEI�TS DV
AL!G
2 W
C
M"HQYF
RNP# K$ EI
T
S% DV
AL&G
3 W
C
MH
Q'YF( PNR) K* EI+ DST,VAL-G
4
WCMH
QYF
PNRKEDI
STVG
AL
5
WC.M/H0 QY1F2 P
NRKEDS3 I
TV
GAL
6
WCMH
Q4YF5 P6 N7RKE8 I
D9TSVG: A;L
7
WC
M<H
Q=YFPNR
K> EI
DTS?VGLA
8
CW
M
H
QYF@ P
NRKA EI
DTSV
AGLB
9
CW
MH
QFY
PNRC KI
ETSD DEVLF AG
10 CWMHQYFG PH N
R
KIDSTEVLI AG
11 CWMH
QYFJ PK NLRI KE
TSDV
AGLM
12 WMCH
QYFN PO NRPI
EDK
STV
AGL
13 CWMH
QYFQ P
R
NRI
ED
KSTV
A
GL
14 CWMHQYFRNPEIKSTDV
A
GL
15 CWMHQYRFPNS IT ED
KSTUVA
GL
16 CWMHV YFW QRNP
X I
EST
K
DV
ALYG
17 WC
MHYFQ
PR
N
IKTSEZDV
LAG
18 WC
MH[ QYFP\ RN
I]KET_SDa V
Lb AGc
19 WC
Md HQe YFf Rg PNh IKiDSjTEVk Ll AGm
20 WC
MHn YQ
FPo RpNIqKTESDVLr AG
21 WC
Ms HYQFt Pu RN
IvKSwTEDVLx AG
22 WC
MHQYFPRNy IKzETSDVLAG
23 C
WMH{ QYFP| RN} I~ KT�S�EDVLAG
24 WC
M� H� Q� YF� P� R�NIK�TS� ED
V
L� A�G
25 CWMH
QYFRN
PIKT
SEDV
ALG
26 CWM� HQFYP� R
N
IK�DST�E�V� L� AG�
27 CWMH
Q� FYP� R
NI
K�TE�SDVL� AG
28 CWMH
Q� FYR
N
PI
K�TSEDV
ALG
29 CWMH
Q� YFR
N
PI
K�TSEDVL� AG
30 CWMHQ
FYPRN
KITS� EDV
LAG
31 CWMH� QY
FP
N
RK� I ET S¡ DV
L¢ AG£32
CWM
H¤ Q¥YF
P¦ R§ N¨ KI©TSª E« D¬VL AG®33
WCMH
Q
YF
P° RNK± I DST
EV
AG
L²34
WCMH
QYF³ P
RNK´ EIµTSDV¶ GAL
35 W
CMH
QYF
PNRKEDST
IV
GAL
36 WCM
H· QYF
P¹ Nº RK» ESI
DT¼VGAL½
37 WCMH
Q¾ YF¿ RNPÀ KÁ I EÂTSDÃVGLÄ AÅ
38 CWMH
QYFRNPKEI
TS
DVG
AL
39 CWMH
QYFÆ RNPÇ K
ITS
EDVÈ AG
L
|
0.0
0.02
0.04
0.06
bit
s |
1
WCMHY
F
ÉQÊ P
Ë N
R
K
IÌ DSTEVAGÍL
2
WCMHQYFÎ P
NÏRKÐ I DÑ SÒ TEV
GAÓLÔ
3
WCMHYQ
PF
NÕRKDSÖ I
TEV
G×AL
4
WØCMHQY
PÙFNRKDSTÚ EIÛ GV
AÜL
5
WCHM
QY
PNFÝ
RKDÞ SETIß
GVàAL
6
WCáMâHQYFã Pä NåRKI
DTæSç EVè GéALê
7
WCMHYQ
FPN
R
IKë TSìEDVGLíA
8
WCMîHYQ
FPRN
ITKVS
DE
GLA
9
WCï MHYFQð Pñ RINò Vó TKôSõEö
D
LA÷Gø
10 WC
Mù Hú YFQ
IûRPNü
Vý TKþSÿED
LA�G�
11 WC
MHYFQ
IR�P
VN�
TS�KE�D�
L�AG
12 WC
MHYFQ
IRP�N
VTSK
E�D�
LAG
13 WC
MHYFQ
IPRN� T
VS�KED
LA�G
14 WC� M
HYFQ�P�RN� I� STD� VE�K� L�AG
15 CWMHYFQ
PN�RI� S
T� DKEVGLA
16 W
C
MH
QY
PF
NR
S� DTKI
EV
GAL
17 W�CHM� Q
P YNF!R
DS" KETI
GV#AL$
18 W%CMH
Q&YPF' NR
T( S) DKEI* G+VAL
19 WC
MHQ,YPF
NR
TS- KDI
EG.VLA
20 WCM
HQY
PF
N/RTKS0 I
EDV
GAL
21 WCMHQ1
YF2 P
3 N
R
T
I4 KE5 S6 D7V
G8L9A
|0.0
0.02
0.04
bit
s |
1
CWMH
Q: YF
RNPKEI;T<SD=VL> A?G
2
CWMH
Q@YAFPRB NC KEI
TDSDEVF LG AG
3
CWMH
QYF
PRNKEI
DTSV
LAG
4
CWMHH QY
PFI RNKEI
DTJSVKGL
A
5
CWLMHM QNYO PFP RQ N
KR ES DI
TTSVU
GVLAW
6
CWMH
QYF
PRX NKEI
DTYSZVGLA
7
CW
M[H
Q\YF
PR]NK^ I EDTS_VGLA
8
CWMH
QaYbFc PRN
KEI
DTdSe VGLfA
9
CWMH
Q
YF
Pg RN
IEKT
S
DVG
LhAi
10 CW
MHQY
FP
NR
I KET
SjDk VG
LAl
11 CWMHQY
Fm PR
N
I Kn EoTSpDVq LrGAs
12 CW
MHQYFt PRuN
I KEvTSD
VLAG
13 CWMHQYRFw PN
I Kx D
STy EVz LAG
14 CMWHQYF{ RNPI| E
KT}SDVAG
L
15 WCHMFYNPQREIT~ KVDSG
LA
16 WCHMQYPNFREIKDST� VAG
L
17 WCHMQYFPIR�NE� D
KST� VA� G� L�
18 HWM
� QYFNPIKRDSET� VG
L�A
19 YMQHIKFRNPEST� VDAGL�
20 MHFYN� PQREIKVDS�T�GL�
A
21 WCHMYPFQRIKNET� VD
SG
LA
22 CWMHQYRFPN
I EDKST� VAG
L
23 CMWH
Q�YF� R
�NP
�K
I D
ST� E�VG
LA�
24 CWMH
Q�Y
R
�FP
�N
I
T� E�KS
DVG
L� A
25 CWMHQYPFRN
I E
KT�S
DVA
GL
26 CWMHQYFR�
NPI E
KT SD¡V
LA¢G
27 WCHMQYNPF
I£ REDKST¤VA
G
L
28 YMQHNPFI RKEDSTVAGL
29 I YFRNPQKEDST
¥VAGL
30 CMWHQ
YF¦ R
§NPI
E
DKS
TV
LA©G
31 CWMHQYFRª
NPEIT«KVDS
A
G
L
32 CMWHQYFRNPEIT¬ K
SDV
L® A
G
33 CWMH
Q¯ YF° R
±NPK²IDST
E
V
AL
G
34 CWMH
Q
Y
RFP³NI
EKTSDV
AG
L
35 CWMHQYFRNPµ I EKT¶SDVLAG
36 CWMH
Q·FY
R¹ NºP» I E
ST
KDV
A
G
L
37 CWMHQYF
R
¼ NPI E½ST¾KDV
LA¿G
38 CWMHQYFÀ RÁ
NP
ITEKSDV
ALÂG
39 CWMÃ
H
Q
Y
FÄ RÅNPI EST
KÆDV
LAG
40 CWMHQYFRNPÇ EIT
È KSDV
ALG
41 CHWMQRYFÉNPEIT
ÊKSDV
LAG
42 CW
MHQYFË R
N
PÌK
IDS
TÍ EÎVLAÏG
43 CWMHQYFR
ÐNPEI
KSTDV
A
G
L
44 CWMHQYFRÑ
NPEIKSTDV
A
G
L
45 CWMHQYPFRNTÒ I KÓ E
SDVA
G
L
46 CWMHQFY
RNPEÔ IT
Õ KV
DS
LAG
47 CW
MHQY
FR
N
PKI ET
V
DS
LAG
48 CW
MHQYF
Ö P× RØN
KIDS
TÙ EÚ VAÛGL
49 CW
MÜ HQYF
Ý RÞN
Pß KIE
TàSD
Vá LAG
50 CWMHQYFRâ
N
PITEKSD
VãG
LäA
51 CW
Må HQFæY
PRN
KIE
TçSèDVLGA
52 CWMé HQF
êY
PRëNK
I
DSTE
V
G
LAì
53 CW
MHQY
FR
N
PITEKSD
VGLAí
54 CW
MHQY
Fî RïN
Pð ITñE
KòSD
Vó GLAô
55 CWMHõ QYFö P÷ Rø
NI
TùE
KúSD
VG
LûA
56 CWMHQY
FR
N
PIK
T
SED
Vü LAG
57 CWMH
QYFPRýNK
I ETþSD
VÿGLA
58 CWM� H� QYF� PRNI
T� E� K�SD�V� LGA�
59 CWMH
QYFPRNI
T EK
SD V
LGA�
60 CWMH
QYFRN
PKITS
E
DV� LAG
61 CWMH
Q YFP� R� NI
EK� ST
� DV�G�LA�
62 C
W�MH
QYF
PR�NKEI
S�TD�VGLA
63 CWMH
Q�YF� PR� NKEI
T� S� D�VGLA
64 CW
MH
Q�Y PF! R" N# K$ ES% I D&TVGLA
65 C
W'M(HQY
PRF
NKEDST
) IV
GAL
66 CWM
H* QYF
PRN
KET+ SI
DV, GLA-
67 CWMH. Q/
YF
PR0N1 KI
T2 S3 E
DV4 GL5A
68 CWMH
Q6YRF
PNKI
E7TSD8VGLA
69 CWMH
Q9YF
PR
N
K: I DS;TEVGAL
|Figure 4: Sequencelogos for increasingsequenceseparation. A margin of 4 residueson both ends of the logo is used. Sym-bols displayedupsidedown indicate that their amountis lower than the backgroundamount. This figure is available in full color athttp://www.cbs.dtu.dk/services/distanceP/.
and13, thecenterpeakshifts from slight over to slight un-derrepresentation< of alanine:helicesarenolongerdominantfor distancesbelow thethreshold.
The smearingout of the centerpeakfor large sequenceseparationsdoesnot indicatelack of bending/turnproper-ties, but reflectsthat suchaminoacidscan be placedin alargeregionin betweenthetwo separatedaminoacids.Whatwe seefor small separations,is that the signal in additionmust be locatedin a very specificposition relative to theseparatedaminoacids.
Neural networks: predictionsand behaviorAsfoundabove,optimizedpredictionof distanceconstraintsfor varyingsequenceseparation,shouldbeperformedby anon-linearpredictor. Herewe useneuralnetworks. Clearly,far mostof the informationis found in thesequencelogos,so we expectonly a few hiddenneurons(units) to be nec-essaryin thedesignof theneuralnetwork architecture.Asearlier, weonly usetwo-layernetworkswith oneoutputunit(above/below threshold).We fix thenumberof hiddenunitsto 5, but thesizeof theinputfield mayvary.
We start out by quantitatively investigating the rela-tion betweenthe sequencemotifs in the logos above, andthe amountof sequencecontext neededin the predictionscheme.First, we choosethe amountof sequencecontext,by startingwith local windows aroundtheseparatedaminoacidsand the extendingthe sequenceregion, r, to be in-cluded.We only includeadditionalresiduesin thedirectiontowardsthecenter. Thenumberof residuesusedin thedirec-tion awayfrom thecenteris fixedto 4,exceptin thecasesofr = 0 andr = 2, wherethatnumberis zeroandtwo respec-tively. At somepoint the two sequenceregionsmay reacheachother. Whetherthey overlap,or just connectdoesnotaffect the performance.For eachr = 0 > 2 > 4 > 6 > 8 > 10> 12> 14,wetrainednetworksfor sequenceseparations2 to 99,result-ing in thetrainingof 8000networks,sinceweused10cross-validationsets.Theperformancesin “fraction correct”and“correlationcoefficient” areshown in Figure5.
Figure5(b) shows the performancein termsof correla-tion coefficient. We seethat eachtime the sequencesepa-rationbecomesso largethat thetwo context regionsdo notconnecttheperformancedropswhencomparedto contextsthatmergeinto a singlewindow. Thisbehavior is generic,ithappensconsistentlyfor increasingsizeof sequencecontextregion, thoughthe phenomenonis asymptoticallydecreas-ing. An effect canbefoundup to a region sizeof about30,seeFigure6. Several featuresappearon the performancecurve,andthey canall beexplainedby thestatisticalobser-vationsmadeearlier. First, thecurvewith nocontext (r = 0)correspondsto thecurve oneobtainsusingprobabilitypairdensityfunctions(Lund et al. 1997). Thebadperformanceof suchapredictoris to beexpecteddueto thelackof motifthatwasnot includedin themodel.Thegreencurve(r = 4)is the curve that correspondsto the neuralnetwork predic-tor in (Lund et al. 1997). As observedthereina significantimprovementis obtainedwhenincluding just a little bit ofsequencecontext. As the size of the context region is in-creased,sois theperformanceup to aboutsequencesepara-tion 30. Thenthe performanceremainsthe sameindepen-
dentof the sequenceseparation.We evennotethe drop inperformancefor sequenceseparation12, aspredictedfromthe statisticalanalysisabove. It is alsocharacteristicthat,asthedistributionof physicaldistancesapproachestheuni-versalshape,andasthe centerpeakof the sequencelogosvanishes,theperformancedropsandbecomesconstantwithapproximately0.15 in correlationcoefficient. The changein predictionperformancetakes placeat sequencesepara-tions for which changesin the distribution of physicaldis-tanceandsequencemotifs change.Due to the not vanish-ing signal in the logos for large separations,it is possibleto predictdistanceconstraintssignificantlybetterthanran-dom,andbetterthanwith a no-context approach.Thecon-clusionis notsurprising:thebestperformingnetwork is thatwhich usesasmuchcontext aspossible.However, for largesequenceseparationsmore than 30–35residues,the sameperformanceis obtainedby just using a small amountofsequencecontext aroundthe separatedamino acids,as in(Lundetal. 1997).We foundthattheperformancebetweentheworstandbestcross-validationsetis about0.1for sepa-rationsup to 12 residues,thenabout0.07for separationsupto 30 residues,andthen0.1–0.2for separationslarger than30. Hence,at sequenceseparation31 theperformancestartto fluctuateindicatingthatthesequencemotif becomeslessdefinedthroughoutthe sequencesin the respective cross-validationsets. Hencewe can usethe networks as an in-dicatorfor whenasequencemotif is well defined.
As an independenttest, we predictedon nine CASP3(http://predictioncenter.llnl.gov/casp3/)targets(andsubmit-ted the predictions). None of thesetargets(T0053T0067T0071T0077T0081T0082T0083T0084T0085)have se-quencesimilarity toany sequencewith knownstructure.Theprevious method(Lund et al. 1997) had on the average64.5%correctpredictionsandan averagecorrelationcoef-ficient of 0.224.Themethodpresentedherehason average70.3%correctpredictionsandanaveragecorrelationcoeffi-cientof 0.249.Theperformanceof theoriginalmethodcor-respondsto theperformancepreviously published(Lund etal. 1997).However, theaveragemeasureisalessconvenientway to reporttheperformance,asthegoodperformanceonsmallseparationsarebiaseddueto themany largeseparationnetworksthathaveessentiallythesameperformance.There-forewehavealsoconstructedperformancecurvessimilar tothe onein Figure5 (not shown). The main featuresof thepredictioncurve waspresent. In Figure8 we show a pre-diction exampleof thedistanceconstraintsfor T0067. Theserver (http://www.cbs.dtu.dk/services/distanceP/)providespredictionsup a sequenceseparationof 100. Notice thatpredictionsupto asequenceseparationof 30clearlycapturethemainpartof thedistanceconstraints.
We alsoinvestigatedthe qualitative relationbetweenthenetwork performanceand information content in the se-quencelogos. Interestingly, andin agreementwith the ob-servationsmadeearlier, we found that the two curveshavethesamequalitativebehavior asthesequenceseparationin-crease(Figure7). Both curvespeakat separation3, bothcurvesdrop at separation12, and both curvesreachestheplateaufor sequenceseparation30. Wenotethattherelativeentropy curve increasesslightly astheseparationincreases.
0?
20 40 60 80 100?
Sequence Separation (residues)@0.45
0.50
0.55
0.60
0.65
0.70
0.75
Fra
ctio
n c
orr
ect
A
(a)
r=0r=2r=4r=6r=8r=10r=12r=14
Mean performances
0B
20B
40B
60B
80B
100B
Sequence separation (residues)C0.00
0.10
0.20
0.30
0.40
0.50
Co
rrel
atio
n c
oef
fici
ent
(b)
r=0r=2r=4r=6r=8r=10r=12r=14
Mean performances
Figure 5: The performanceasfunction of increasingsequencecontext whenpredictingdistanceconstraints.The sequencecontext is thenumberof additionalresiduesr D 0 E 2 E 4 E 6 E 8 E 10E 12E 14 from theaminoacidandtowardtheotheraminoacid.Theperformanceis showedinpercentage(a),andin correlationcoefficient (b). Thisfigureis availablein full color athttp://www.cbs.dtu.dk/services/distanceP/.
This is dueto a well known artifactfor decreasingsamplingsize, andF entropy measures.The correlationbetweenthecurvesalsoindicatesthat themostinformationusedby thepredictorstemfrom themotifs in thesequencelogos.
0G 20 40 60 80 100GSequence Separation (residues)H
0.00
0.10
0.20
0.30
0.40
0.50
Co
rrel
atio
n c
oef
fici
ent
Plain
Profile
Single window curvesI
Figure 6: The performancecurve using a single windows only.We alsoshow the performancewhenincludinghomologuein therespective tests.Thiscompleteaprofile.
0J 20 40 60 80 100JSequence separation (residues)K
Arb
itra
ry s
cale
Relative entropy and NN performance
NN-performanceLRelative entropy
Figure 7: Information contentand network performancefor in-creasingsequenceseparation.
Finally, weconcludeby ananalysisof theneuralnetworkweight compositionwith the aim of revealing significantmotifsusedin thelearningprocess.In general,wefoundthesamemotifsappearingfor theeachof the10cross-validationsetsfor a given sequenceseparation.As describedin theMethodssection,wecomputesaliency logosof thenetworkparameters,that is we display the amino acids for whichthecorrespondingweights,if removed,causethelargestin-creasein network error. We madesaliency logosfor eachofthe5 hiddenneurons.For theshortsequenceseparationsthenetwork stronglypunishhaving aprolinein thecenterof the
CASP3 target T0067
] - ; 0.2]
]0.2 ; 0.3]
]0.3 ; 0.4]
]0.4 ; 0.5]
]0.5 ; 0.6]
]0.6 ; 0.7]
]0.7 ; 0.8]
]0.8 ; 0.9]
]0.9 ; 1.0]
20
40
60
80
100
120
140
160
180
20 40 60 80 100
120
140
160
180
Figure 8: Predictionof distanceconstraintsfor the CASP3tar-get T0067. The uppertriangleis the distanceconstraintsof pub-lished coordinates,where dark points indicate distancesabovethe thresholds(meanvalues)andlight pointsdistancesbelow thethreshold. The lower triangle shows the actual neural networkpredictions. The scaleindicatesthe predictedprobability for se-quenceseparationshaving distancesbelow the thresholds.Thuslight points representprediction for distancesbelow the thresh-olds and dark points prediction for distancesabove the thresh-old. The numbersalong the edgesof the plot show the posi-tion in the sequence. This figure is available in full color athttp://www.cbs.dtu.dk/services/distanceP/.
input field, that is a prolinein betweentheseparatedaminoacids.Thismakessenseasprolineis not locatedwithin sec-ondarystructureelements,and the helix lengthsfor smallseparationsaresmallerthanthe meandistances.(The net-work haslearnt that proline is not presentin helices.) Incontrast,thepresenceof asparticacidcanhavea positiveornegative effect dependingon thepositionin the input field.For larger sequenceseparations,someof the neuronsdis-playathreepeakmotif resemblingwhatwasfoundin these-quencelogos.Thecenterof suchlogoscanbevery glycinerich, but alsoshow astrongpunishmentfor valineor trypto-phan.Sometimestwo neuronscansharethemotif of whatispresentfor oneneuronin anothercross-validationset.Otherdetailscanbefoundaswell. Herewe justoutlinedthemainfeatures: the networks not only look for favorableaminoacids,they alsodetectsthosethatarenot favorableatall. InFigure9 we show an exampleof the motifs for 2 of the 5hiddenneuronsfor a network with sequenceseparation13.
0
1
2
Sal
ien
cy |1
QRLDES
H
YTVAI
NFMM G
CPNWK
X
2
QVRSWE
YDC
NAIT
KO FPMGPHX
3
RTQAV
IS
PF
XDKYNL
WMEQ
HGC
4
RTR Y
SL
KAEIS DT NFUQMVXWV P
HCGW
5
TWX
RYS
KQEI
DCF
V
NMLAX P
HG
6
WVFXYD
SYMT
NKRILQAZ CE
H[ P
G
7
TVDSX
YIF
WMR
N\LEK]AQ
CH_
GP
8 XY
VF` W
NCI
HEA
TLa SDbMRcQdKe
GP
9 IX
YFVW
GLMf N
Dg HhECSiRKAj
TQk
P
10 NFYWIX
VEl GSDHK
MLR
CQm TA
P
11 VYWXFn I
HGR
KoEp SDTq NM
QrLsAt
C
P
12 HVXRWF
GKI
QY
TuESNMvAwLx DC
P
13 EWFVXHyYG
NR
DzMIK{A|L
SQ
T} C
P
14 CHKXEQTYR
VAF
GW
NM
DSI~L
P
15 MFW
HQ
YV
ETCK
XND� GA
SRL� PI
16 MHNFQGEV
SXK
WA
C�RYD�LPTI�
17 WT�HMVYE
KF
QNRAL
XSGPDI� C
18 THDMEGCVF�LKRP
WNA
QXYSI�
|0
1
Sal
ien
cy |
1
QS
AYM
DIL
RVH�T�W� X
GN� E�C
KP
2
T�I
RH
SVQ�M� EFAW� P� K�YC
GDN� X
3
CT
�MAL
EQY
PW� I
XV
RGF�
D� NK�
4
YR
EAI
DHWV
F
XST�QC�M
PKL
NG
5
HX
QM
TY
KEW� ASV
� RL
NDPI
CF�
G
6
FWMAS
XQHR�Y�K VDI
C¡LE¢ TN
PG
7
KQ£ NYDXV¤ I
FMA
ST¥RLWEP¦ HGC
§
8
YWTRSAN
L¨ V
X
IK
MQ©EDªCHG« PF¬
9
REP®TKSDXAGLN
YQ
HWC
FIV
10 M° QPRAX
HECGT
IKN
Y±SD²
LFV
W
11 PT³XMA
HRQµG¶SC
E· YN
W¹ VDKº
FLI
12 ST» AHXP
C¼Q½E¾ RLN¿ MKÀ YFWVG
Á
ID
13 SPRY
TQÂ XVCÃKHEÄNÅ IAMÆ FDÇ
WG
L
14 VEKDA
CXH
QNMÈSRLFT
P
IG
WY
15 TDMSÉKPÊHXI
VQË C
WG
FRAEÌ Y
NL
16 YKQDFH
WEPA
X
SI
LC
TMRGNV
17 YL
T
X
NKÍERCÎFMAÏ I
W
QDÐ SPGÑV
18 MAH
RÒ XY
TKQÓLWCEÔ DSFÕV
Ö I×
NPG
19 MØYTQK
AV
XC
HW
DÙ NSGFÚ EÛ I
RPÜ
20 SÝMAT
QH
DY
GKC
XF
EÞWV
LPNRI
21 SHL
QVA
Tß FCIàMá E
YRDWâ KPGXN
22 GSY
DRAH
Mã KF
Pä QåVLEIæ NW
çCè X|0
1
2
3
4
Sal
ien
cy |
1
RLSGNHéTê APKDëCìYQWM
VF
Eí I
X
2
WPîHïDYRAKG
TLF
SNIð XEMñ QV
Cò
3
NT
óA
DôHõ R
QöWFK÷ EGSY
IL
PCø XMV
4
NATùHMúRWFKGûYQü EL
PDý SXV
IþC
5
NHT
XRMÿ QAFW� E
SKY
�LGV� DIC� P
6
AMHPRXF�LSGN�YKE� QV
C�T� I
DW
7
T P MQYS�VFLHIXD� WG ARC� KE�N�
8
PTXK�CRALN
VMFH
IYE
Q�DW�SG
9
CXYTQ
R
H
FK
APNE
W� M� LIS
D�VG
�
10 THX
ACRE
YMP
KQS
FLD�VWN
IG
11 AXT� C�
RH�K�E�QY�PLS
M� FD N!
VW"
IG
12 TX
AH#Q$ERP% CK& MS
FN'WLVD(
YI)G
13 CXTAS
RE*H+ WKQ
FDP, M- LN.
VY
I/G
14 HXS
ARFET
W0QCN1KDPVYLM2
I3G
15 SCQXYD4RE5 T
AWH
N6 VLFK
P7MIG
16 KRXH8 YTNLE9Q: CV
SIFW; DAP
MG
17 LEH<XD= P> QAT?YM@ I
RF
NW
SKCAG
18 HXAQTL
KB EC SMY
RF
GD DPECWNFVI
G
19 MEHXAFP
H STLW
QKDIYNJ RC
GKVIL
20 ASRYHXML
TNKDQE
WF
GVC
PI
21 LKRMEHMVYANN P
STOWDQF
GXIPCQ
22 VPKSTYAHRESWRGL
NTMDIUFQVCX
|Figure9: Threesaliency-weightlogosfrom thetrainedneuralnet-work at sequenceseparations9 and13. For eachpositionof theinput field thesaliency of therespective aminoacidsis displayed.Lettersturnedupsidedown contribute negatively to the weightedsumof thenetwork input. Thetop logo (sequenceseparation9) il-lustratesstrongpunishmentfor prolineandglycineupstreams.Onthemiddlelogo(sequenceseparation13),N is turnedupsidedownon the peaksnearthe edges.On the bottomlogo the I’s closetothe edgescontribute positively, andthe N in the centerpeakalsocontributepositively. The I’s in thecenterpeakcontributesnega-tively (they areupsidedown). This figureis availablein full colorathttp://www.cbs.dtu.dk/services/distanceP/.
Final remarksWestudiedpredictionof distanceconstraintsin proteins.Wefoundthatsequencemotifswererelatedto thedistributionofdistances.Usingthemotifs we showedhow to constructanoptimalneuralnetwork predictorwhich improve an earlierapproach.Thebehavior of thepredictorcouldbecompletelyexplainedfrom astatisticalanalysisof thedata,in particulara drop in performancefor sequenceseparation12 residueswaspredictedfrom thestatisticalanalysis.Wealsocorrectly
predictedseparation3 ashaving optimalperformance.Wefoundthattheinformationcontentof thelogoshasthesamequalitativebehavior asthenetwork performancefor increas-ing sequenceseparation.Finally, theweightcompositionofthe network wasanalyzedandwe found that the sequencemotif appearingwasin agreementwith thesequencelogos.
Theperspectivesaremany: thenetworkperformancemaybe improved further by using threewindows for sequenceseparationsbetween20and30residues.Theperformanceofthenetwork canbeusedasinputsto anew collectionof net-workswhichcancleanup certaintypesof falsepredictions.Combiningthemethodwith asecondarystructurepredictionshouldgive a significantperformanceimprovementon theshortsequenceseparations.The relationbetweeninforma-tion contentandnetwork performancemight be explainedquantitatively throughextensivealgebraicconsiderations.
AcknowledgmentsThis work wassupportedby TheDanishNationalResearchFoundation.OL wassupportedby The Danish1991Phar-macy Foundation.
ReferencesBaldi, P., and Brunak, S. 1998. Bioinformatics— Themachinelearningapproach. CambridgeMass.:MIT Press.Bernstein, F. C.; Koetzle, T. G.; Williams, G. J. B.;Meyer, E. F.; Brice, M. D.; Rogers, J. R.; Kennard,O.; Shimanouchi,T.; and Tasumi, M. 1977. Theprotein data bank: A computerbasedarchival file formacromolecularstructures. J. Mol. Biol. 122:535–542.(http://www.pdb.bnl.gov/).Bishop,C. M. 1996. Neural Networksfor PatternRecog-nition. Oxford: OxfordUniversityPress.Bohr, H.; Bohr, J.; Brunak,S.; Cotterill, R. M. C.; Fred-holm, H.; Lautrup,B.; andPetersen,S. B. 1990. A novelapproachto predictionof the 3-dimensionalstructuresofproteinbackbonesby neuralnetworks.FEBSLett.261:43–46.Branden,C., andTooze,J. 1999. Introductionto ProteinStructure. New York: GarlandPublishingInc.,2ndedition.Brunak,S.; Engelbrecht,J.; andKnudsen,S. 1991. Pre-dictionof humanmRNA donorandacceptorsitesfrom theDNA sequence.J. Mol. Biol. 220:49–65.Creighton,T. E. 1993.Proteins. New York: W. H. FreemanandCompany, 2ndedition.Dayhoff, M. O. ansSchwartz, R. M., and Orcutt, B. C.1978.A modelof evolutionarychangein proteins.AtlasofProteinSequenceandStructure5, Suppl.3:345–352.Fariselli,P., andCasadio,R. 1999.A neuralnetwork basedpredictorof residuecontactsin proteins.Prot. Eng. 12:15–21.Göbel, U.; Sander, C.; Schneider, R.; and Valencia,A.1994. Correlatedmutationsand residuecontactsin pro-teins.Proteins18:309–317.Gorodkin,J.; Hansen,L. K.; Lautrup,B.; andSolla,S. A.1997. Universaldistribution of salienciesfor pruning in
layeredneuralnetworks. Int. J. Neural Systems8:489–498.(http://wwwW .wspc.com.sg/journals/ijns/85_6/gorod.pdf).Hobohm,U.; Scharf,M.; Schneider, R.; and Sander, C.1992. Selectionof a representative setof structuresfromtheBrookhavenProteinDataBank. Prot. Sci.1:409–417.Kabsch,W., andSander, C. 1983. Dictionaryof proteinsecondarystructure: Pattern recognitionand hydrogen-bondedandgeometricalfeatures. Biopolymers 22:2577–2637.Kullback,S.,andLeibler, R. A. 1951.On informationandsufficiency. Ann.Math.Stat.22:79–86.Lund, O.; Frimand, K.; Gorodkin, J.; Bohr, H.; Bohr,J.; Hansen,J.; and Brunak, S. 1997. Protein dis-tanceconstraintspredictedby neuralnetworks andprob-ability density functions. Prot. Eng. 10:1241–1248.(http://www.cbs.dtu.dk/services/CPHmodels).Maiorov, V. N., andCrippen,G.M. 1992.Contactpotentialthat recognizesthecorrectfolding of globular proteins.J.Mol. Biol. 227:876–888.Mathews, B. W. 1975. Comparisonof the predictedandobserved secondarystructureof T4 phagelysozyme.Biochem.Biophys.Acta405:442–451.Mirny, L. A., and Shaknovich, E. I. 1996. How to de-rive a proteinfolding potential?A new approachto anoldproblem.J. Mol. Biol. 264:1164–1179.Miyazawa, S., and Jernigan,R. L. 1985. Estimationofeffective interresiduecontactenergies from protein crys-tal structures: cuasi-chemicalapproximation. Macro-molecules18:534–552.Myers,E. W., andMiller, W. 1988.Optimalalignmentsinlinearspace.CABIOS4:11–7.Olmea,O., and Valencia,A. 1997. Improving contactpredictionsby thecombinationof correlatedmutationsandothersourcesof sequenceinformation. Folding & Design2:S25–S32.Pearson,W. R. 1990. Rapidandsensitive sequencecom-parisonwith FASTPandFASTA. Meth.Enzymol.183:63–98.Reese,M. G.; Lund,O.; Bohr, J.; Bohr, H.; Hansen,J. E.;andBrunak,S. 1996.Distancedistributionsin proteins:Asix parameterrepresentation.Prot. Eng. 9:733–740.Rost,B., andSander, C. 1993. Predictionof proteinsec-ondarystructureatbetterthan70%accuracy. J. Mol. Biol.232:584–599.Schneider, T. D., and Stephens,R. M. 1990. Sequencelogos: a new way to displayconcensussequences.Nucl.AcidsRes.18:6097–6100.Sippl, M. J. 1990. Calculationof conformationalensem-bles from potentialsof meanforce. An approachto theknowledge-basedpredictionof local structuresin globularproteins.J. Mol. Biol. 213:859–83.Skolnick, J.; Kolinski, A.; andOrtiz, A. R. 1997. MON-SSTER:A methodfor folding globular proteinswith asmallnumber-commentof distancerestraints.J. Mol. Biol.265:217–241.
Tanaka,S.,andScheraga,H. A. 1976. Medium-andlongrangeinteractionparametersbetweenaminoacidsfor pre-dicting three-dimensionalstructuresof proteins. Macro-molecules9:945–950.Thomas,D. J.;Casari,C.;andSander, C. 1996.Thepredic-tionof proteincontactsfrommultiplesequencealignments.Prot. Eng. 9:941–948.