The Statistics of Sequence Similarity Scores
-
Upload
michele-silva -
Category
Documents
-
view
13 -
download
1
description
Transcript of The Statistics of Sequence Similarity Scores
-
10/8/2014 The Statistics of Sequence Similarity Scores
http://www.ncbi.nlm.nih.gov/blast/tutorial/Altschul-1.html 1/10
TheStatisticsofSequenceSimilarityScores
Introduction
Toassesswhetheragivenalignmentconstitutesevidenceforhomology,ithelpstoknowhowstronganalignmentcanbeexpectedfromchancealone.Inthiscontext,"chance"canmeanthecomparisonof(i)realbutnonhomologoussequences(ii)realsequencesthatareshuffledtopreservecompositionalproperties[13]or(iii)sequencesthataregeneratedrandomlybaseduponaDNAorproteinsequencemodel.Analyticstatisticalresultsinvariablyusethelastofthesedefinitionsofchance,whileempiricalresultsbasedonsimulationandcurvefittingmayuseanyofthedefinitions.
Thestatisticsofglobalsequencecomparison
Unfortunately,undereventhesimplestrandommodelsandscoringsystems,verylittleisknownabouttherandomdistributionofoptimalglobalalignmentscores[4].MonteCarloexperimentscanprovideroughdistributionalresultsforsomespecificscoringsystemsandsequencecompositions[5],butthesecannotbegeneralizedeasily.Therefore,oneofthefewmethodsavailableforassessingthestatisticalsignificanceofaparticularglobalalignmentistogeneratemanyrandomsequencepairsoftheappropriatelengthandcomposition,andcalculatetheoptimalalignmentscoreforeach[1,3].Whileitisthenpossibletoexpressthescoreofinterestintermsofstandarddeviationsfromthemean,itisamistaketoassumethattherelevantdistributionisnormalandconvertthisZvalueintoaPvaluethetailbehaviorofglobalalignmentscoresisunknown.Themostonecansayreliablyisthatif100randomalignmentshavescoreinferiortothealignmentofinterest,thePvalueinquestionislikelylessthan0.01.Onefurtherpitfalltoavoidisexaggeratingthesignificanceofaresultfoundamongmultipletests.Whenmanyalignmentshavebeengenerated,e.g.inadatabasesearch,thesignificanceofthebestmustbediscountedaccordingly.AnalignmentwithPvalue0.0001inthecontextofasingletrialmaybeassignedaPvalueofonly0.1ifitwasselectedasthebestamong1000independenttrials.
Thestatisticsoflocalsequencecomparison
Fortunatelystatisticsforthescoresoflocalalignments,unlikethoseofglobalalignments,arewellunderstood.Thisisparticularlytrueforlocalalignmentslackinggaps,whichwewillconsiderfirst.SuchalignmentswerepreciselythosesoughtbytheoriginalBLASTdatabasesearchprograms[6].Alocalalignmentwithoutgapsconsistssimplyofapairofequallengthsegments,onefromeachofthetwosequencesbeingcompared.AmodificationoftheSmithWaterman[7]orSellers[8]algorithmswillfindallsegmentpairswhosescorescannotbeimprovedbyextensionortrimming.ThesearecalledhighscoringsegmentpairsorHSPs.Toanalyzehowhighascoreislikelytoarisebychance,amodelofrandom
-
10/8/2014 The Statistics of Sequence Similarity Scores
http://www.ncbi.nlm.nih.gov/blast/tutorial/Altschul-1.html 2/10
sequencesisneeded.Forproteins,thesimplestmodelchoosestheaminoacidresiduesinasequenceindependently,withspecificbackgroundprobabilitiesforthevariousresidues.Additionally,theexpectedscoreforaligningarandompairofaminoacidisrequiredtobenegative.Werethisnotthecase,longalignmentswouldtendtohavehighscoreindependentlyofwhetherthesegmentsalignedwererelated,andthestatisticaltheorywouldbreakdown.Justasthesumofalargenumberofindependentidenticallydistributed(i.i.d)randomvariablestendstoanormaldistribution,themaximumofalargenumberofi.i.d.randomvariablestendstoanextremevaluedistribution[9].(Wewillelidethemanytechnicalpointsrequiredtomakethisstatementrigorous.)Instudyingoptimallocalsequencealignments,weareessentiallydealingwiththelattercase[10,11].Inthelimitofsufficientlylargesequencelengthsmandn,thestatisticsofHSPscoresarecharacterizedbytwoparameters,Kandlambda.Mostsimply,theexpectednumberofHSPswithscoreatleastSisgivenbytheformula
WecallthistheEvalueforthescoreS.Thisformulamakeseminentlyintuitivesense.DoublingthelengthofeithersequenceshoulddoublethenumberofHSPsattainingagivenscore.Also,foranHSPtoattainthescore2xitmustattainthescorextwiceinarow,sooneexpectsEtodecreaseexponentiallywithscore.TheparametersKandlambdacanbethoughtofsimplyasnaturalscalesforthesearchspacesizeandthescoringsystemrespectively.
Bitscores
Rawscoreshavelittlemeaningwithoutdetailedknowledgeofthescoringsystemused,ormoresimplyitsstatisticalparametersKandlambda.Unlessthescoringsystemisunderstood,citingarawscorealoneislikecitingadistancewithoutspecifyingfeet,meters,orlightyears.Bynormalizingarawscoreusingtheformula
oneattainsa"bitscore"S',whichhasastandardsetofunits.TheEvaluecorrespondingtoagivenbitscoreissimply
-
10/8/2014 The Statistics of Sequence Similarity Scores
http://www.ncbi.nlm.nih.gov/blast/tutorial/Altschul-1.html 3/10
Bitscoressubsumethestatisticalessenceofthescoringsystememployed,sothattocalculatesignificanceoneneedstoknowinadditiononlythesizeofthesearchspace.
Pvalues
ThenumberofrandomHSPswithscore>=SisdescribedbyaPoissondistribution[10,11].ThismeansthattheprobabilityoffindingexactlyaHSPswithscore>=Sisgivenby
whereEistheEvalueofSgivenbyequation(1)above.SpecificallythechanceoffindingzeroHSPswithscore>=SiseE,sotheprobabilityoffindingatleastonesuchHSPis
ThisisthePvalueassociatedwiththescoreS.Forexample,ifoneexpectstofindthreeHSPswithscore>=S,theprobabilityoffindingatleastoneis0.95.TheBLASTprogramsreportEvalueratherthanPvaluesbecauseitiseasiertounderstandthedifferencebetween,forexample,Evalueof5and10thanPvaluesof0.993and0.99995.However,whenE
-
10/8/2014 The Statistics of Sequence Similarity Scores
http://www.ncbi.nlm.nih.gov/blast/tutorial/Altschul-1.html 4/10
multipledistinctdomains.Ifweassumetheapriorichanceofrelatednessisproportionaltosequencelength,thenthepairwiseEvalueinvolvingadatabasesequenceoflengthnshouldbemultipliedbyN/n,whereNisthetotallengthofthedatabaseinresidues.Examiningequation(1),thiscanbeaccomplishedsimplybytreatingthedatabaseasasinglelongsequenceoflengthN.TheBLASTprograms[6,14,15]takethisapproachtocalculatingdatabaseEvalue.NoticethatforDNAsequencecomparisons,thelengthofdatabaserecordsislargelyarbitrary,andthereforethisistheonlyreallytenablemethodforestimatingstatisticalsignificance.
Thestatisticsofgappedalignments
Thestatisticsdevelopedabovehaveasolidtheoreticalfoundationonlyforlocalalignmentsthatarenotpermittedtohavegaps.However,manycomputationalexperiments[1421]andsomeanalyticresults[22]stronglysuggestthatthesametheoryappliesaswelltogappedalignments.Forungappedalignments,thestatisticalparameterscanbecalculated,usinganalyticformulas,fromthesubstitutionscoresandthebackgroundresiduefrequenciesofthesequencesbeingcompared.Forgappedalignments,theseparametersmustbeestimatedfromalargescalecomparisonof"random"sequences.Somedatabasesearchprograms,suchasFASTA[12]orvariousimplementationoftheSmithWatermanalgorithm[7],produceoptimallocalalignmentscoresforthecomparisonofthequerysequencetoeverysequenceinthedatabase.Mostofthesescoresinvolveunrelatedsequences,andthereforecanbeusedtoestimatelambdaandK[17,21].Thisapproachavoidstheartificialityofarandomsequencemodelbyemployingrealsequences,withtheirattendantinternalstructureandcorrelations,butitmustfacetheproblemofexcludingfromtheestimationscoresfrompairsofrelatedsequences.TheBLASTprogramsachievemuchoftheirspeedbyavoidingthecalculationofoptimalalignmentscoresforallbutahandfulofunrelatedsequences.ThemustthereforerelyuponapreestimationoftheparameterslambdaandK,foraselectedsetofsubstitutionmatricesandgapcosts.Thisestimationcouldbedoneusingrealsequences,buthasinsteadrelieduponarandomsequencemodel[14],whichappearstoyieldfairlyaccurateresults[21].
Edgeeffects
Thestatisticsdescribedabovetendtobesomewhatconservativeforshortsequences.Thetheorysupportingthesestatisticsisanasymptoticone,whichassumesanoptimallocalalignmentcanbeginwithanyalignedpairofresidues.However,ahighscoringalignmentmusthavesomelength,andthereforecannotbeginneartotheendofeitheroftwosequencesbeingcompared.This"edgeeffect"maybecorrectedforbycalculatingan"effectivelength"forsequences[14]theBLASTprogramsimplementsuchacorrection.Forsequenceslongerthanabout200residuestheedgeeffectcorrectionisusuallynegligible.
Thechoiceofsubstitutionscores
-
10/8/2014 The Statistics of Sequence Similarity Scores
http://www.ncbi.nlm.nih.gov/blast/tutorial/Altschul-1.html 5/10
Theresultsalocalalignmentprogramproducesdependstronglyuponthescoresituses.Nosinglescoringschemeisbestforallpurposes,andanunderstandingofthebasictheoryoflocalalignmentscorescanimprovethesensitivityofone'ssequenceanalyses.Asbefore,thetheoryisfullydevelopedonlyforscoresusedtofindungappedlocalalignments,sowestartwiththatcase.Alargenumberofdifferentaminoacidsubstitutionscores,baseduponavarietyofrationales,havebeendescribed[2336].Howeverthescoresofanysubstitutionmatrixwithnegativeexpectedscorecanbewrittenuniquelyintheform
wheretheqij,calledtargetfrequencies,arepositivenumbersthatsumto1,thepiarebackgroundfrequenciesforthevariousresidues,andlambdaisapositiveconstant[10,31].Thelambdahereisidenticaltothelambdaofequation(1).Multiplyingallthescoresinasubstitutionmatrixbyapositiveconstantdoesnotchangetheiressence:analignmentthatwasoptimalusingtheoriginalscoresremainsoptimal.Suchmultiplicationalterstheparameterlambdabutnotthetargetfrequenciesqij.Thus,uptoaconstantscalingfactor,everysubstitutionmatrixisuniquelydeterminedbyitstargetfrequencies.Thesefrequencieshaveaspecialsignificance[10,31]:
Agivenclassofalignmentsisbestdistinguishedfromchancebythesubstitutionmatrixwhosetargetfrequenciescharacterizetheclass.
Toelaborate,onemaycharacterizeasetofalignmentsrepresentinghomologousproteinregionsbythefrequencywithwhicheachpossiblepairofresiduesisaligned.Ifvalineinthefirstsequenceandleucineinthesecondappearin1%ofallalignmentpositions,thetargetfrequencyfor(valine,leucine)is0.01.Themostdirectwaytoconstructappropriatesubstitutionmatricesforlocalsequencecomparisonistoestimatetargetandbackgroundfrequencies,andcalculatethecorrespondinglogoddsscoresofformula(6).Thesefrequenciesingeneralcannotbederivedfromfirstprinciples,andtheirestimationrequiresempiricalinput.
ThePAMandBLOSUMaminoacidsubstitutionmatrices
Whileallsubstitutionmatricesareimplicitlyoflogoddsform,thefirstexplicitconstructionusingformula(6)wasbyDayhoffandcoworkers[24,25].Fromastudyofobservedresiduereplacementsincloselyrelatedproteins,they
-
10/8/2014 The Statistics of Sequence Similarity Scores
http://www.ncbi.nlm.nih.gov/blast/tutorial/Altschul-1.html 6/10
constructedthePAM(for"pointacceptedmutation")modelofmolecularevolution.One"PAM"correspondstoanaveragechangein1%ofallaminoacidpositions.After100PAMsofevolution,noteveryresiduewillhavechanged:somewillhavemutatedseveraltimes,perhapsreturningtotheiroriginalstate,andothersnotatall.Thusitispossibletorecognizeashomologousproteinsseparatedbymuchmorethan100PAMs.NotethatthereisnogeneralcorrespondencebetweenPAMdistanceandevolutionarytime,asdifferentproteinfamiliesevolveatdifferentrates.UsingthePAMmodel,thetargetfrequenciesandthecorrespondingsubstitutionmatrixmaybecalculatedforanygivenevolutionarydistance.Whentwosequencesarecompared,itisnotgenerallyknownaprioriwhatevolutionarydistancewillbestcharacterizeanysimilaritytheymayshare.Closelyrelatedsequences,however,arerelativelyeasytofindevenwillnonoptimalmatrices,sothetendencyhasbeentousematricestailoredforfairlydistantsimilarities.Formanyyears,themostwidelyusedmatrixwasPAM250,becauseitwastheonlyoneoriginallypublishedbyDayhoff.Dayhoff'sformalismforcalculatingtargetfrequencieshasbeencriticized[27],andtherehavebeenseveraleffortstoupdatehernumbersusingthevastquantitiesofderivedproteinsequencedatageneratedsinceherwork[33,35].ThesenewerPAMmatricesdonotdiffergreatlyfromtheoriginalones[37].Analternativeapproachtoestimatingtargetfrequencies,andthecorrespondinglogoddsmatrices,hasbeenadvancedbyHenikoff&Henikoff[34].Theyexaminemultiplealignmentsofdistantlyrelatedproteinregionsdirectly,ratherthanextrapolatefromcloselyrelatedsequences.Anadvantageofthisapproachisthatitcleavesclosertoobservationadisadvantageisthatityieldsnoevolutionarymodel.Anumberoftests[13,37]suggestthatthe"BLOSUM"matricesproducedbythismethodgenerallyaresuperiortothePAMmatricesfordetectingbiologicalrelationships.
DNAsubstitutionmatrices
Whilewehavediscussedsubstitutionmatricesonlyinthecontextofproteinsequencecomparison,allthemainissuescarryovertoDNAsequencecomparison.Onewarningisthatwhenthesequencesofinterestcodeforprotein,itisalmostalwaysbettertocomparetheproteintranslationsthantocomparetheDNAsequencesdirectly.Thereasonisthatafteronlyasmallamountofevolutionarychange,theDNAsequences,whencomparedusingsimplenucleotidesubstitutionscores,containlessinformationwithwhichtodeducehomologythandotheencodedproteinsequences[32].Sometimes,however,onemaywishtocomparenoncodingDNAsequences,atwhichpointthesamelogoddsapproachasbeforeapplies.Anevolutionarymodelinwhichallnucleotidesareequallycommonandallsubstitutionmutationsareequallylikelyyieldsdifferentscoresonlyformatchesandmismatches[32].Amorecomplexmodel,inwhichtransitionsaremorelikelythantransversions,yieldsdifferent"mismatch"scoresfortransitionsandtransversions[32].Thebestscorestousewilldependuponwhetheroneisseekingrelativelydivergedor
-
10/8/2014 The Statistics of Sequence Similarity Scores
http://www.ncbi.nlm.nih.gov/blast/tutorial/Altschul-1.html 7/10
closelyrelatedsequences[32].
Gapscores
Ourtheoreticaldevelopmentconcerningtheoptimalityofmatricesconstructedusingequation(6)unfortunatelyisinvalidassoonasgapsandassociatedgapscoresareintroduced,andnomoregeneraltheoryisavailabletotakeitsplace.However,ifthegapscoresemployedaresufficientlylarge,onecanexpectthattheoptimalsubstitutionscoresforagivenapplicationwillnotchangesubstantially.Inpractice,thesamesubstitutionscoreshavebeenappliedfruitfullytolocalalignmentsbothwithandwithoutgaps.Appropriategapscoreshavebeenselectedovertheyearsbytrialanderror[13],andmostalignmentprogramswillhaveadefaultsetofgapscorestogowithadefaultsetofsubstitutionscores.Iftheuserwishestoemployadifferentsetofsubstitutionscores,thereisnoguaranteethatthesamegapscoreswillremainappropriate.Nocleartheoreticalguidancecanbegiven,but"affinegapscores"[3841],withalargepenaltyforopeningagapandamuchsmalleroneforextendingit,havegenerallyprovedamongthemosteffective.
Lowcomplexitysequenceregions
Thereisonefrequentcasewheretherandommodelsandthereforethestatisticsdiscussedherebreakdown.Asmanyasonefourthofallresiduesinproteinsequencesoccurwithinregionswithhighlybiasedaminoacidcomposition.Alignmentsoftworegionswithsimilarlybiasedcompositionmayachieveveryhighscoresthatowevirtuallynothingtoresidueorderbutaredueinsteadtosegmentcomposition.Alignmentsofsuch"lowcomplexity"regionshavelittlemeaninginanycase:sincetheseregionsmostlikelyarisebygeneslippage,theonetooneresiduecorrespondenceimposedbyalignmentisnotvalid.Whileitisworthnotingthattwoproteinscontainsimilarlowcomplexityregions,theyarebestexcludedwhenconstructingalignments[4244].TheBLASTprogramsemploytheSEGalgorithm[43]tofilterlowcomplexityregionsfromproteinsbeforeexecutingadatabasesearch.
References
[1]Fitch,W.M.(1983)"Randomsequences."J.Mol.Biol.163:171176.(PubMed)
[2]Lipman,D.J.,Wilbur,W.J.,SmithT.F.&Waterman,M.S.(1984)"Onthestatisticalsignificanceofnucleicacidsimilarities."Nucl.AcidsRes.12:215226.(PubMed)
[3]Altschul,S.F.&Erickson,B.W.(1985)"Significanceofnucleotidesequencealignments:amethodforrandomsequencepermutationthatpreservesdinucleotideandcodonusage."Mol.Biol.Evol.2:526538.(PubMed)
[4]Deken,J.(1983)"Probabilisticbehavioroflongestcommonsubsequence
-
10/8/2014 The Statistics of Sequence Similarity Scores
http://www.ncbi.nlm.nih.gov/blast/tutorial/Altschul-1.html 8/10
length."In"TimeWarps,StringEditsandMacromolecules:TheTheoryandPracticeofSequenceComparison."D.Sankoff&J.B.Kruskal(eds.),pp.5591,AddisonWesley,Reading,MA.
[5]Reich,J.G.,Drabsch,H.&Daumler,A.(1984)"OnthestatisticalassessmentofsimilaritiesinDNAsequences."Nucl.AcidsRes.12:55295543.(PubMed)
[6]Altschul,S.F.,Gish,W.,Miller,W.,Myers,E.W.&Lipman,D.J.(1990)"Basiclocalalignmentsearchtool."J.Mol.Biol.215:403410.(PubMed)
[7]Smith,T.F.&Waterman,M.S.(1981)"Identificationofcommonmolecularsubsequences."J.Mol.Biol.147:195197.(PubMed)
[8]Sellers,P.H.(1984)"Patternrecognitioningeneticsequencesbymismatchdensity."Bull.Math.Biol.46:501514.
[9]Gumbel,E.J.(1958)"Statisticsofextremes."ColumbiaUniversityPress,NewYork,NY.
[10]Karlin,S.&Altschul,S.F.(1990)"Methodsforassessingthestatisticalsignificanceofmolecularsequencefeaturesbyusinggeneralscoringschemes."Proc.Natl.Acad.Sci.USA87:22642268.(PubMed)
[11]Dembo,A.,Karlin,S.&Zeitouni,O.(1994)"Limitdistributionofmaximalnonalignedtwosequencesegmentalscore."Ann.Prob.22:20222039.
[12]Pearson,W.R.&Lipman,D.J.(1988)Improvedtoolsforbiologicalsequencecomparison."Proc.Natl.Acad.Sci.USA85:24442448.(PubMed)
[13]Pearson,W.R.(1995)"Comparisonofmethodsforsearchingproteinsequencedatabases."Prot.Sci.4:11451160.(PubMed)
[14]Altschul,S.F.&Gish,W.(1996)"Localalignmentstatistics."Meth.Enzymol.266:460480.(PubMed)
[15]Altschul,S.F.,Madden,T.L.,Schffer,A.A.,Zhang,J.,Zhang,Z.,Miller,W.&Lipman,D.J.(1997)"GappedBLASTandPSIBLAST:anewgenerationofproteindatabasesearchprograms."NucleicAcidsRes.25:33893402.(PubMed)
[16]Smith,T.F.,Waterman,M.S.&Burks,C.(1985)"Thestatisticaldistributionofnucleicacidsimilarities."NucleicAcidsRes.13:645656.(PubMed)
[17]Collins,J.F.,Coulson,A.F.W.&Lyall,A.(1988)"Thesignificanceofproteinsequencesimilarities."Comput.Appl.Biosci.4:6771.(PubMed)
[18]Mott,R.(1992)"Maximumlikelihoodestimationofthestatisticaldistributionof
-
10/8/2014 The Statistics of Sequence Similarity Scores
http://www.ncbi.nlm.nih.gov/blast/tutorial/Altschul-1.html 9/10
SmithWatermanlocalsequencesimilarityscores."Bull.Math.Biol.54:5975.
[19]Waterman,M.S.&Vingron,M.(1994)"Rapidandaccurateestimatesofstatisticalsignificanceforsequencedatabasesearches."Proc.Natl.Acad.Sci.USA91:46254628.(PubMed)
[20]Waterman,M.S.&Vingron,M.(1994)"SequencecomparisonsignificanceandPoissonapproximation."Stat.Sci.9:367381.
[21]Pearson,W.R.(1998)"Empiricalstatisticalestimatesforsequencesimilaritysearches."J.Mol.Biol.276:7184.(PubMed)
[22]Arratia,R.&Waterman,M.S.(1994)"Aphasetransitionforthescoreinmatchingrandomsequencesallowingdeletions."Ann.Appl.Prob.4:200225.
[23]McLachlan,A.D.(1971)"Testsforcomparingrelatedaminoacidsequences.Cytochromecandcytochromec551."J.Mol.Biol.61:409424.(PubMed)
[24]Dayhoff,M.O.,Schwartz,R.M.&Orcutt,B.C.(1978)"Amodelofevolutionarychangeinproteins."In"AtlasofProteinSequenceandStructure,"Vol.5,Suppl.3(ed.M.O.Dayhoff),pp.345352.Natl.Biomed.Res.Found.,Washington,DC.
[25]Schwartz,R.M.&Dayhoff,M.O.(1978)"Matricesfordetectingdistantrelationships."In"AtlasofProteinSequenceandStructure,"Vol.5,Suppl.3(ed.M.O.Dayhoff),p.353358.Natl.Biomed.Res.Found.,Washington,DC.
[26]Feng,D.F.,Johnson,M.S.&Doolittle,R.F.(1984)"Aligningaminoacidsequences:comparisonofcommonlyusedmethods."J.Mol.Evol.21:112125.(PubMed)
[27]Wilbur,W.J.(1985)"OnthePAMmatrixmodelofproteinevolution."Mol.Biol.Evol.2:434447.(PubMed)
[28]Taylor,W.R.(1986)"Theclassificationofaminoacidconservation."J.Theor.Biol.119:205218.(PubMed)
[29]Rao,J.K.M.(1987)"Newscoringmatrixforaminoacidresidueexchangesbasedonresiduecharacteristicphysicalparameters."Int.J.PeptideProteinRes.29:276281.
[30]Risler,J.L.,Delorme,M.O.,Delacroix,H.&Henaut,A.(1988)"Aminoacidsubstitutionsinstructurallyrelatedproteins.Apatternrecognitionapproach.Determinationofanewandefficientscoringmatrix."J.Mol.Biol.204:10191029.(PubMed)
[31]Altschul,S.F.(1991)"Aminoacidsubstitutionmatricesfromaninformation
-
10/8/2014 The Statistics of Sequence Similarity Scores
http://www.ncbi.nlm.nih.gov/blast/tutorial/Altschul-1.html 10/10
theoreticperspective."J.Mol.Biol.219:555565.(PubMed)
[32]States,D.J.,Gish,W.&Altschul,S.F.(1991)"Improvedsensitivityofnucleicaciddatabasesearchesusingapplicationspecificscoringmatrices."Methods3:6670.
[33]Gonnet,G.H.,Cohen,M.A.&Benner,S.A.(1992)"Exhaustivematchingoftheentireproteinsequencedatabase."Science256:14431445.(PubMed)
[34]Henikoff,S.&Henikoff,J.G.(1992)"Aminoacidsubstitutionmatricesfromproteinblocks."Proc.Natl.Acad.Sci.USA89:1091510919.(PubMed)
[35]Jones,D.T.,Taylor,W.R.&Thornton,J.M.(1992)"Therapidgenerationofmutationdatamatricesfromproteinsequences."Comput.Appl.Biosci.8:275282.(PubMed)
[36]Overington,J.,Donnelly,D.,JohnsonM.S.,Sali,A.&Blundell,T.L.(1992)"Environmentspecificaminoacidsubstitutiontables:Tertiarytemplatesandpredictionofproteinfolds."Prot.Sci.1:216226.(PubMed)
[37]Henikoff,S.&Henikoff,J.G.(1993)"Performanceevaluationofaminoacidsubstitutionmatrices."Proteins17:4961.(PubMed)
[38]Gotoh,O.(1982)"Animprovedalgorithmformatchingbiologicalsequences."J.Mol.Biol.162:705708.(PubMed)
[39]Fitch,W.M.&Smith,T.F.(1983)"Optimalsequencealignments."Proc.Natl.Acad.Sci.USA80:13821386.
[40]Altschul,S.F.&Erickson,B.W.(1986)"Optimalsequencealignmentusingaffinegapcosts."Bull.Math.Biol.48:603616.(PubMed)
[41]Myers,E.W.&Miller,W.(1988)"Optimalalignmentsinlinearspace."Comput.Appl.Biosci.4:1117.(PubMed)
[42]Claverie,J.M.&States,D.J.(1993)"Informationenhancementmethodsforlargescalesequenceanalysis."Comput.Chem.17:191201.
[43]Wootton,J.C.&Federhen,S.(1993)"Statisticsoflocalcomplexityinaminoacidsequencesandsequencedatabases."Comput.Chem.17:149163.
[44]Altschul,S.F.,Boguski,M.S.,Gish,W.&Wootton,J.C.(1994)"Issuesinsearchingmolecularsequencedatabases."NatureGenet.6:119129.(PubMed)