The Statistics of Sequence Similarity Scores

10
10/8/2014 The Statistics of Sequence Similarity Scores http://www.ncbi.nlm.nih.gov/blast/tutorial/Altschul-1.html 1/10 The Statistics of Sequence Similarity Scores Introduction To assess whether a given alignment constitutes evidence for homology, it helps to know how strong an alignment can be expected from chance alone. In this context, "chance" can mean the comparison of (i) real but nonhomologous sequences; (ii) real sequences that are shuffled to preserve compositional properties [13] ; or (iii) sequences that are generated randomly based upon a DNA or protein sequence model. Analytic statistical results invariably use the last of these definitions of chance, while empirical results based on simulation and curvefitting may use any of the definitions. The statistics of global sequence comparison Unfortunately, under even the simplest random models and scoring systems, very little is known about the random distribution of optimal global alignment scores [4] . Monte Carlo experiments can provide rough distributional results for some specific scoring systems and sequence compositions [5] , but these can not be generalized easily. Therefore, one of the few methods available for assessing the statistical significance of a particular global alignment is to generate many random sequence pairs of the appropriate length and composition, and calculate the optimal alignment score for each [1,3] . While it is then possible to express the score of interest in terms of standard deviations from the mean, it is a mistake to assume that the relevant distribution is normal and convert this Zvalue into a P value; the tail behavior of global alignment scores is unknown. The most one can say reliably is that if 100 random alignments have score inferior to the alignment of interest, the Pvalue in question is likely less than 0.01. One further pitfall to avoid is exaggerating the significance of a result found among multiple tests. When many alignments have been generated, e.g. in a database search, the significance of the best must be discounted accordingly. An alignment with P value 0.0001 in the context of a single trial may be assigned a Pvalue of only 0.1 if it was selected as the best among 1000 independent trials. The statistics of local sequence comparison Fortunately statistics for the scores of local alignments, unlike those of global alignments, are well understood. This is particularly true for local alignments lacking gaps, which we will consider first. Such alignments were precisely those sought by the original BLAST database search programs [6] . A local alignment without gaps consists simply of a pair of equal length segments, one from each of the two sequences being compared. A modification of the SmithWaterman [7] or Sellers [8] algorithms will find all segment pairs whose scores can not be improved by extension or trimming. These are called highscoring segment pairs or HSPs. To analyze how high a score is likely to arise by chance, a model of random

description

To assess whether a given alignment constitutes evidence for homology, ithelps to know how strong an alignment can be expected from chance alone. Inthis context, "chance" can mean the comparison of (i) real but non­homologoussequences; (ii) real sequences that are shuffled to preserve compositionalproperties [1­3]; or (iii) sequences that are generated randomly based upon aDNA or protein sequence model. Analytic statistical results invariably use the lastof these definitions of chance, while empirical results based on simulation andcurve ­fitting may use any of the definitions.

Transcript of The Statistics of Sequence Similarity Scores

  • 10/8/2014 The Statistics of Sequence Similarity Scores

    http://www.ncbi.nlm.nih.gov/blast/tutorial/Altschul-1.html 1/10

    TheStatisticsofSequenceSimilarityScores

    Introduction

    Toassesswhetheragivenalignmentconstitutesevidenceforhomology,ithelpstoknowhowstronganalignmentcanbeexpectedfromchancealone.Inthiscontext,"chance"canmeanthecomparisonof(i)realbutnonhomologoussequences(ii)realsequencesthatareshuffledtopreservecompositionalproperties[13]or(iii)sequencesthataregeneratedrandomlybaseduponaDNAorproteinsequencemodel.Analyticstatisticalresultsinvariablyusethelastofthesedefinitionsofchance,whileempiricalresultsbasedonsimulationandcurvefittingmayuseanyofthedefinitions.

    Thestatisticsofglobalsequencecomparison

    Unfortunately,undereventhesimplestrandommodelsandscoringsystems,verylittleisknownabouttherandomdistributionofoptimalglobalalignmentscores[4].MonteCarloexperimentscanprovideroughdistributionalresultsforsomespecificscoringsystemsandsequencecompositions[5],butthesecannotbegeneralizedeasily.Therefore,oneofthefewmethodsavailableforassessingthestatisticalsignificanceofaparticularglobalalignmentistogeneratemanyrandomsequencepairsoftheappropriatelengthandcomposition,andcalculatetheoptimalalignmentscoreforeach[1,3].Whileitisthenpossibletoexpressthescoreofinterestintermsofstandarddeviationsfromthemean,itisamistaketoassumethattherelevantdistributionisnormalandconvertthisZvalueintoaPvaluethetailbehaviorofglobalalignmentscoresisunknown.Themostonecansayreliablyisthatif100randomalignmentshavescoreinferiortothealignmentofinterest,thePvalueinquestionislikelylessthan0.01.Onefurtherpitfalltoavoidisexaggeratingthesignificanceofaresultfoundamongmultipletests.Whenmanyalignmentshavebeengenerated,e.g.inadatabasesearch,thesignificanceofthebestmustbediscountedaccordingly.AnalignmentwithPvalue0.0001inthecontextofasingletrialmaybeassignedaPvalueofonly0.1ifitwasselectedasthebestamong1000independenttrials.

    Thestatisticsoflocalsequencecomparison

    Fortunatelystatisticsforthescoresoflocalalignments,unlikethoseofglobalalignments,arewellunderstood.Thisisparticularlytrueforlocalalignmentslackinggaps,whichwewillconsiderfirst.SuchalignmentswerepreciselythosesoughtbytheoriginalBLASTdatabasesearchprograms[6].Alocalalignmentwithoutgapsconsistssimplyofapairofequallengthsegments,onefromeachofthetwosequencesbeingcompared.AmodificationoftheSmithWaterman[7]orSellers[8]algorithmswillfindallsegmentpairswhosescorescannotbeimprovedbyextensionortrimming.ThesearecalledhighscoringsegmentpairsorHSPs.Toanalyzehowhighascoreislikelytoarisebychance,amodelofrandom

  • 10/8/2014 The Statistics of Sequence Similarity Scores

    http://www.ncbi.nlm.nih.gov/blast/tutorial/Altschul-1.html 2/10

    sequencesisneeded.Forproteins,thesimplestmodelchoosestheaminoacidresiduesinasequenceindependently,withspecificbackgroundprobabilitiesforthevariousresidues.Additionally,theexpectedscoreforaligningarandompairofaminoacidisrequiredtobenegative.Werethisnotthecase,longalignmentswouldtendtohavehighscoreindependentlyofwhetherthesegmentsalignedwererelated,andthestatisticaltheorywouldbreakdown.Justasthesumofalargenumberofindependentidenticallydistributed(i.i.d)randomvariablestendstoanormaldistribution,themaximumofalargenumberofi.i.d.randomvariablestendstoanextremevaluedistribution[9].(Wewillelidethemanytechnicalpointsrequiredtomakethisstatementrigorous.)Instudyingoptimallocalsequencealignments,weareessentiallydealingwiththelattercase[10,11].Inthelimitofsufficientlylargesequencelengthsmandn,thestatisticsofHSPscoresarecharacterizedbytwoparameters,Kandlambda.Mostsimply,theexpectednumberofHSPswithscoreatleastSisgivenbytheformula

    WecallthistheEvalueforthescoreS.Thisformulamakeseminentlyintuitivesense.DoublingthelengthofeithersequenceshoulddoublethenumberofHSPsattainingagivenscore.Also,foranHSPtoattainthescore2xitmustattainthescorextwiceinarow,sooneexpectsEtodecreaseexponentiallywithscore.TheparametersKandlambdacanbethoughtofsimplyasnaturalscalesforthesearchspacesizeandthescoringsystemrespectively.

    Bitscores

    Rawscoreshavelittlemeaningwithoutdetailedknowledgeofthescoringsystemused,ormoresimplyitsstatisticalparametersKandlambda.Unlessthescoringsystemisunderstood,citingarawscorealoneislikecitingadistancewithoutspecifyingfeet,meters,orlightyears.Bynormalizingarawscoreusingtheformula

    oneattainsa"bitscore"S',whichhasastandardsetofunits.TheEvaluecorrespondingtoagivenbitscoreissimply

  • 10/8/2014 The Statistics of Sequence Similarity Scores

    http://www.ncbi.nlm.nih.gov/blast/tutorial/Altschul-1.html 3/10

    Bitscoressubsumethestatisticalessenceofthescoringsystememployed,sothattocalculatesignificanceoneneedstoknowinadditiononlythesizeofthesearchspace.

    Pvalues

    ThenumberofrandomHSPswithscore>=SisdescribedbyaPoissondistribution[10,11].ThismeansthattheprobabilityoffindingexactlyaHSPswithscore>=Sisgivenby

    whereEistheEvalueofSgivenbyequation(1)above.SpecificallythechanceoffindingzeroHSPswithscore>=SiseE,sotheprobabilityoffindingatleastonesuchHSPis

    ThisisthePvalueassociatedwiththescoreS.Forexample,ifoneexpectstofindthreeHSPswithscore>=S,theprobabilityoffindingatleastoneis0.95.TheBLASTprogramsreportEvalueratherthanPvaluesbecauseitiseasiertounderstandthedifferencebetween,forexample,Evalueof5and10thanPvaluesof0.993and0.99995.However,whenE

  • 10/8/2014 The Statistics of Sequence Similarity Scores

    http://www.ncbi.nlm.nih.gov/blast/tutorial/Altschul-1.html 4/10

    multipledistinctdomains.Ifweassumetheapriorichanceofrelatednessisproportionaltosequencelength,thenthepairwiseEvalueinvolvingadatabasesequenceoflengthnshouldbemultipliedbyN/n,whereNisthetotallengthofthedatabaseinresidues.Examiningequation(1),thiscanbeaccomplishedsimplybytreatingthedatabaseasasinglelongsequenceoflengthN.TheBLASTprograms[6,14,15]takethisapproachtocalculatingdatabaseEvalue.NoticethatforDNAsequencecomparisons,thelengthofdatabaserecordsislargelyarbitrary,andthereforethisistheonlyreallytenablemethodforestimatingstatisticalsignificance.

    Thestatisticsofgappedalignments

    Thestatisticsdevelopedabovehaveasolidtheoreticalfoundationonlyforlocalalignmentsthatarenotpermittedtohavegaps.However,manycomputationalexperiments[1421]andsomeanalyticresults[22]stronglysuggestthatthesametheoryappliesaswelltogappedalignments.Forungappedalignments,thestatisticalparameterscanbecalculated,usinganalyticformulas,fromthesubstitutionscoresandthebackgroundresiduefrequenciesofthesequencesbeingcompared.Forgappedalignments,theseparametersmustbeestimatedfromalargescalecomparisonof"random"sequences.Somedatabasesearchprograms,suchasFASTA[12]orvariousimplementationoftheSmithWatermanalgorithm[7],produceoptimallocalalignmentscoresforthecomparisonofthequerysequencetoeverysequenceinthedatabase.Mostofthesescoresinvolveunrelatedsequences,andthereforecanbeusedtoestimatelambdaandK[17,21].Thisapproachavoidstheartificialityofarandomsequencemodelbyemployingrealsequences,withtheirattendantinternalstructureandcorrelations,butitmustfacetheproblemofexcludingfromtheestimationscoresfrompairsofrelatedsequences.TheBLASTprogramsachievemuchoftheirspeedbyavoidingthecalculationofoptimalalignmentscoresforallbutahandfulofunrelatedsequences.ThemustthereforerelyuponapreestimationoftheparameterslambdaandK,foraselectedsetofsubstitutionmatricesandgapcosts.Thisestimationcouldbedoneusingrealsequences,buthasinsteadrelieduponarandomsequencemodel[14],whichappearstoyieldfairlyaccurateresults[21].

    Edgeeffects

    Thestatisticsdescribedabovetendtobesomewhatconservativeforshortsequences.Thetheorysupportingthesestatisticsisanasymptoticone,whichassumesanoptimallocalalignmentcanbeginwithanyalignedpairofresidues.However,ahighscoringalignmentmusthavesomelength,andthereforecannotbeginneartotheendofeitheroftwosequencesbeingcompared.This"edgeeffect"maybecorrectedforbycalculatingan"effectivelength"forsequences[14]theBLASTprogramsimplementsuchacorrection.Forsequenceslongerthanabout200residuestheedgeeffectcorrectionisusuallynegligible.

    Thechoiceofsubstitutionscores

  • 10/8/2014 The Statistics of Sequence Similarity Scores

    http://www.ncbi.nlm.nih.gov/blast/tutorial/Altschul-1.html 5/10

    Theresultsalocalalignmentprogramproducesdependstronglyuponthescoresituses.Nosinglescoringschemeisbestforallpurposes,andanunderstandingofthebasictheoryoflocalalignmentscorescanimprovethesensitivityofone'ssequenceanalyses.Asbefore,thetheoryisfullydevelopedonlyforscoresusedtofindungappedlocalalignments,sowestartwiththatcase.Alargenumberofdifferentaminoacidsubstitutionscores,baseduponavarietyofrationales,havebeendescribed[2336].Howeverthescoresofanysubstitutionmatrixwithnegativeexpectedscorecanbewrittenuniquelyintheform

    wheretheqij,calledtargetfrequencies,arepositivenumbersthatsumto1,thepiarebackgroundfrequenciesforthevariousresidues,andlambdaisapositiveconstant[10,31].Thelambdahereisidenticaltothelambdaofequation(1).Multiplyingallthescoresinasubstitutionmatrixbyapositiveconstantdoesnotchangetheiressence:analignmentthatwasoptimalusingtheoriginalscoresremainsoptimal.Suchmultiplicationalterstheparameterlambdabutnotthetargetfrequenciesqij.Thus,uptoaconstantscalingfactor,everysubstitutionmatrixisuniquelydeterminedbyitstargetfrequencies.Thesefrequencieshaveaspecialsignificance[10,31]:

    Agivenclassofalignmentsisbestdistinguishedfromchancebythesubstitutionmatrixwhosetargetfrequenciescharacterizetheclass.

    Toelaborate,onemaycharacterizeasetofalignmentsrepresentinghomologousproteinregionsbythefrequencywithwhicheachpossiblepairofresiduesisaligned.Ifvalineinthefirstsequenceandleucineinthesecondappearin1%ofallalignmentpositions,thetargetfrequencyfor(valine,leucine)is0.01.Themostdirectwaytoconstructappropriatesubstitutionmatricesforlocalsequencecomparisonistoestimatetargetandbackgroundfrequencies,andcalculatethecorrespondinglogoddsscoresofformula(6).Thesefrequenciesingeneralcannotbederivedfromfirstprinciples,andtheirestimationrequiresempiricalinput.

    ThePAMandBLOSUMaminoacidsubstitutionmatrices

    Whileallsubstitutionmatricesareimplicitlyoflogoddsform,thefirstexplicitconstructionusingformula(6)wasbyDayhoffandcoworkers[24,25].Fromastudyofobservedresiduereplacementsincloselyrelatedproteins,they

  • 10/8/2014 The Statistics of Sequence Similarity Scores

    http://www.ncbi.nlm.nih.gov/blast/tutorial/Altschul-1.html 6/10

    constructedthePAM(for"pointacceptedmutation")modelofmolecularevolution.One"PAM"correspondstoanaveragechangein1%ofallaminoacidpositions.After100PAMsofevolution,noteveryresiduewillhavechanged:somewillhavemutatedseveraltimes,perhapsreturningtotheiroriginalstate,andothersnotatall.Thusitispossibletorecognizeashomologousproteinsseparatedbymuchmorethan100PAMs.NotethatthereisnogeneralcorrespondencebetweenPAMdistanceandevolutionarytime,asdifferentproteinfamiliesevolveatdifferentrates.UsingthePAMmodel,thetargetfrequenciesandthecorrespondingsubstitutionmatrixmaybecalculatedforanygivenevolutionarydistance.Whentwosequencesarecompared,itisnotgenerallyknownaprioriwhatevolutionarydistancewillbestcharacterizeanysimilaritytheymayshare.Closelyrelatedsequences,however,arerelativelyeasytofindevenwillnonoptimalmatrices,sothetendencyhasbeentousematricestailoredforfairlydistantsimilarities.Formanyyears,themostwidelyusedmatrixwasPAM250,becauseitwastheonlyoneoriginallypublishedbyDayhoff.Dayhoff'sformalismforcalculatingtargetfrequencieshasbeencriticized[27],andtherehavebeenseveraleffortstoupdatehernumbersusingthevastquantitiesofderivedproteinsequencedatageneratedsinceherwork[33,35].ThesenewerPAMmatricesdonotdiffergreatlyfromtheoriginalones[37].Analternativeapproachtoestimatingtargetfrequencies,andthecorrespondinglogoddsmatrices,hasbeenadvancedbyHenikoff&Henikoff[34].Theyexaminemultiplealignmentsofdistantlyrelatedproteinregionsdirectly,ratherthanextrapolatefromcloselyrelatedsequences.Anadvantageofthisapproachisthatitcleavesclosertoobservationadisadvantageisthatityieldsnoevolutionarymodel.Anumberoftests[13,37]suggestthatthe"BLOSUM"matricesproducedbythismethodgenerallyaresuperiortothePAMmatricesfordetectingbiologicalrelationships.

    DNAsubstitutionmatrices

    Whilewehavediscussedsubstitutionmatricesonlyinthecontextofproteinsequencecomparison,allthemainissuescarryovertoDNAsequencecomparison.Onewarningisthatwhenthesequencesofinterestcodeforprotein,itisalmostalwaysbettertocomparetheproteintranslationsthantocomparetheDNAsequencesdirectly.Thereasonisthatafteronlyasmallamountofevolutionarychange,theDNAsequences,whencomparedusingsimplenucleotidesubstitutionscores,containlessinformationwithwhichtodeducehomologythandotheencodedproteinsequences[32].Sometimes,however,onemaywishtocomparenoncodingDNAsequences,atwhichpointthesamelogoddsapproachasbeforeapplies.Anevolutionarymodelinwhichallnucleotidesareequallycommonandallsubstitutionmutationsareequallylikelyyieldsdifferentscoresonlyformatchesandmismatches[32].Amorecomplexmodel,inwhichtransitionsaremorelikelythantransversions,yieldsdifferent"mismatch"scoresfortransitionsandtransversions[32].Thebestscorestousewilldependuponwhetheroneisseekingrelativelydivergedor

  • 10/8/2014 The Statistics of Sequence Similarity Scores

    http://www.ncbi.nlm.nih.gov/blast/tutorial/Altschul-1.html 7/10

    closelyrelatedsequences[32].

    Gapscores

    Ourtheoreticaldevelopmentconcerningtheoptimalityofmatricesconstructedusingequation(6)unfortunatelyisinvalidassoonasgapsandassociatedgapscoresareintroduced,andnomoregeneraltheoryisavailabletotakeitsplace.However,ifthegapscoresemployedaresufficientlylarge,onecanexpectthattheoptimalsubstitutionscoresforagivenapplicationwillnotchangesubstantially.Inpractice,thesamesubstitutionscoreshavebeenappliedfruitfullytolocalalignmentsbothwithandwithoutgaps.Appropriategapscoreshavebeenselectedovertheyearsbytrialanderror[13],andmostalignmentprogramswillhaveadefaultsetofgapscorestogowithadefaultsetofsubstitutionscores.Iftheuserwishestoemployadifferentsetofsubstitutionscores,thereisnoguaranteethatthesamegapscoreswillremainappropriate.Nocleartheoreticalguidancecanbegiven,but"affinegapscores"[3841],withalargepenaltyforopeningagapandamuchsmalleroneforextendingit,havegenerallyprovedamongthemosteffective.

    Lowcomplexitysequenceregions

    Thereisonefrequentcasewheretherandommodelsandthereforethestatisticsdiscussedherebreakdown.Asmanyasonefourthofallresiduesinproteinsequencesoccurwithinregionswithhighlybiasedaminoacidcomposition.Alignmentsoftworegionswithsimilarlybiasedcompositionmayachieveveryhighscoresthatowevirtuallynothingtoresidueorderbutaredueinsteadtosegmentcomposition.Alignmentsofsuch"lowcomplexity"regionshavelittlemeaninginanycase:sincetheseregionsmostlikelyarisebygeneslippage,theonetooneresiduecorrespondenceimposedbyalignmentisnotvalid.Whileitisworthnotingthattwoproteinscontainsimilarlowcomplexityregions,theyarebestexcludedwhenconstructingalignments[4244].TheBLASTprogramsemploytheSEGalgorithm[43]tofilterlowcomplexityregionsfromproteinsbeforeexecutingadatabasesearch.

    References

    [1]Fitch,W.M.(1983)"Randomsequences."J.Mol.Biol.163:171176.(PubMed)

    [2]Lipman,D.J.,Wilbur,W.J.,SmithT.F.&Waterman,M.S.(1984)"Onthestatisticalsignificanceofnucleicacidsimilarities."Nucl.AcidsRes.12:215226.(PubMed)

    [3]Altschul,S.F.&Erickson,B.W.(1985)"Significanceofnucleotidesequencealignments:amethodforrandomsequencepermutationthatpreservesdinucleotideandcodonusage."Mol.Biol.Evol.2:526538.(PubMed)

    [4]Deken,J.(1983)"Probabilisticbehavioroflongestcommonsubsequence

  • 10/8/2014 The Statistics of Sequence Similarity Scores

    http://www.ncbi.nlm.nih.gov/blast/tutorial/Altschul-1.html 8/10

    length."In"TimeWarps,StringEditsandMacromolecules:TheTheoryandPracticeofSequenceComparison."D.Sankoff&J.B.Kruskal(eds.),pp.5591,AddisonWesley,Reading,MA.

    [5]Reich,J.G.,Drabsch,H.&Daumler,A.(1984)"OnthestatisticalassessmentofsimilaritiesinDNAsequences."Nucl.AcidsRes.12:55295543.(PubMed)

    [6]Altschul,S.F.,Gish,W.,Miller,W.,Myers,E.W.&Lipman,D.J.(1990)"Basiclocalalignmentsearchtool."J.Mol.Biol.215:403410.(PubMed)

    [7]Smith,T.F.&Waterman,M.S.(1981)"Identificationofcommonmolecularsubsequences."J.Mol.Biol.147:195197.(PubMed)

    [8]Sellers,P.H.(1984)"Patternrecognitioningeneticsequencesbymismatchdensity."Bull.Math.Biol.46:501514.

    [9]Gumbel,E.J.(1958)"Statisticsofextremes."ColumbiaUniversityPress,NewYork,NY.

    [10]Karlin,S.&Altschul,S.F.(1990)"Methodsforassessingthestatisticalsignificanceofmolecularsequencefeaturesbyusinggeneralscoringschemes."Proc.Natl.Acad.Sci.USA87:22642268.(PubMed)

    [11]Dembo,A.,Karlin,S.&Zeitouni,O.(1994)"Limitdistributionofmaximalnonalignedtwosequencesegmentalscore."Ann.Prob.22:20222039.

    [12]Pearson,W.R.&Lipman,D.J.(1988)Improvedtoolsforbiologicalsequencecomparison."Proc.Natl.Acad.Sci.USA85:24442448.(PubMed)

    [13]Pearson,W.R.(1995)"Comparisonofmethodsforsearchingproteinsequencedatabases."Prot.Sci.4:11451160.(PubMed)

    [14]Altschul,S.F.&Gish,W.(1996)"Localalignmentstatistics."Meth.Enzymol.266:460480.(PubMed)

    [15]Altschul,S.F.,Madden,T.L.,Schffer,A.A.,Zhang,J.,Zhang,Z.,Miller,W.&Lipman,D.J.(1997)"GappedBLASTandPSIBLAST:anewgenerationofproteindatabasesearchprograms."NucleicAcidsRes.25:33893402.(PubMed)

    [16]Smith,T.F.,Waterman,M.S.&Burks,C.(1985)"Thestatisticaldistributionofnucleicacidsimilarities."NucleicAcidsRes.13:645656.(PubMed)

    [17]Collins,J.F.,Coulson,A.F.W.&Lyall,A.(1988)"Thesignificanceofproteinsequencesimilarities."Comput.Appl.Biosci.4:6771.(PubMed)

    [18]Mott,R.(1992)"Maximumlikelihoodestimationofthestatisticaldistributionof

  • 10/8/2014 The Statistics of Sequence Similarity Scores

    http://www.ncbi.nlm.nih.gov/blast/tutorial/Altschul-1.html 9/10

    SmithWatermanlocalsequencesimilarityscores."Bull.Math.Biol.54:5975.

    [19]Waterman,M.S.&Vingron,M.(1994)"Rapidandaccurateestimatesofstatisticalsignificanceforsequencedatabasesearches."Proc.Natl.Acad.Sci.USA91:46254628.(PubMed)

    [20]Waterman,M.S.&Vingron,M.(1994)"SequencecomparisonsignificanceandPoissonapproximation."Stat.Sci.9:367381.

    [21]Pearson,W.R.(1998)"Empiricalstatisticalestimatesforsequencesimilaritysearches."J.Mol.Biol.276:7184.(PubMed)

    [22]Arratia,R.&Waterman,M.S.(1994)"Aphasetransitionforthescoreinmatchingrandomsequencesallowingdeletions."Ann.Appl.Prob.4:200225.

    [23]McLachlan,A.D.(1971)"Testsforcomparingrelatedaminoacidsequences.Cytochromecandcytochromec551."J.Mol.Biol.61:409424.(PubMed)

    [24]Dayhoff,M.O.,Schwartz,R.M.&Orcutt,B.C.(1978)"Amodelofevolutionarychangeinproteins."In"AtlasofProteinSequenceandStructure,"Vol.5,Suppl.3(ed.M.O.Dayhoff),pp.345352.Natl.Biomed.Res.Found.,Washington,DC.

    [25]Schwartz,R.M.&Dayhoff,M.O.(1978)"Matricesfordetectingdistantrelationships."In"AtlasofProteinSequenceandStructure,"Vol.5,Suppl.3(ed.M.O.Dayhoff),p.353358.Natl.Biomed.Res.Found.,Washington,DC.

    [26]Feng,D.F.,Johnson,M.S.&Doolittle,R.F.(1984)"Aligningaminoacidsequences:comparisonofcommonlyusedmethods."J.Mol.Evol.21:112125.(PubMed)

    [27]Wilbur,W.J.(1985)"OnthePAMmatrixmodelofproteinevolution."Mol.Biol.Evol.2:434447.(PubMed)

    [28]Taylor,W.R.(1986)"Theclassificationofaminoacidconservation."J.Theor.Biol.119:205218.(PubMed)

    [29]Rao,J.K.M.(1987)"Newscoringmatrixforaminoacidresidueexchangesbasedonresiduecharacteristicphysicalparameters."Int.J.PeptideProteinRes.29:276281.

    [30]Risler,J.L.,Delorme,M.O.,Delacroix,H.&Henaut,A.(1988)"Aminoacidsubstitutionsinstructurallyrelatedproteins.Apatternrecognitionapproach.Determinationofanewandefficientscoringmatrix."J.Mol.Biol.204:10191029.(PubMed)

    [31]Altschul,S.F.(1991)"Aminoacidsubstitutionmatricesfromaninformation

  • 10/8/2014 The Statistics of Sequence Similarity Scores

    http://www.ncbi.nlm.nih.gov/blast/tutorial/Altschul-1.html 10/10

    theoreticperspective."J.Mol.Biol.219:555565.(PubMed)

    [32]States,D.J.,Gish,W.&Altschul,S.F.(1991)"Improvedsensitivityofnucleicaciddatabasesearchesusingapplicationspecificscoringmatrices."Methods3:6670.

    [33]Gonnet,G.H.,Cohen,M.A.&Benner,S.A.(1992)"Exhaustivematchingoftheentireproteinsequencedatabase."Science256:14431445.(PubMed)

    [34]Henikoff,S.&Henikoff,J.G.(1992)"Aminoacidsubstitutionmatricesfromproteinblocks."Proc.Natl.Acad.Sci.USA89:1091510919.(PubMed)

    [35]Jones,D.T.,Taylor,W.R.&Thornton,J.M.(1992)"Therapidgenerationofmutationdatamatricesfromproteinsequences."Comput.Appl.Biosci.8:275282.(PubMed)

    [36]Overington,J.,Donnelly,D.,JohnsonM.S.,Sali,A.&Blundell,T.L.(1992)"Environmentspecificaminoacidsubstitutiontables:Tertiarytemplatesandpredictionofproteinfolds."Prot.Sci.1:216226.(PubMed)

    [37]Henikoff,S.&Henikoff,J.G.(1993)"Performanceevaluationofaminoacidsubstitutionmatrices."Proteins17:4961.(PubMed)

    [38]Gotoh,O.(1982)"Animprovedalgorithmformatchingbiologicalsequences."J.Mol.Biol.162:705708.(PubMed)

    [39]Fitch,W.M.&Smith,T.F.(1983)"Optimalsequencealignments."Proc.Natl.Acad.Sci.USA80:13821386.

    [40]Altschul,S.F.&Erickson,B.W.(1986)"Optimalsequencealignmentusingaffinegapcosts."Bull.Math.Biol.48:603616.(PubMed)

    [41]Myers,E.W.&Miller,W.(1988)"Optimalalignmentsinlinearspace."Comput.Appl.Biosci.4:1117.(PubMed)

    [42]Claverie,J.M.&States,D.J.(1993)"Informationenhancementmethodsforlargescalesequenceanalysis."Comput.Chem.17:191201.

    [43]Wootton,J.C.&Federhen,S.(1993)"Statisticsoflocalcomplexityinaminoacidsequencesandsequencedatabases."Comput.Chem.17:149163.

    [44]Altschul,S.F.,Boguski,M.S.,Gish,W.&Wootton,J.C.(1994)"Issuesinsearchingmolecularsequencedatabases."NatureGenet.6:119129.(PubMed)