The Statistics of Sequence Similarity Scores

10/8/2014 The Statistics of Sequence Similarity Scores

http://www.ncbi.nlm.nih.gov/blast/tutorial/Altschul-1.html 1/10

TheStatisticsofSequenceSimilarityScores

Introduction

Toassesswhetheragivenalignmentconstitutesevidenceforhomology,ithelpstoknowhowstronganalignmentcanbeexpectedfromchancealone.Inthiscontext,"chance"canmeanthecomparisonof(i)realbutnonhomologoussequences(ii)realsequencesthatareshuffledtopreservecompositionalproperties[13]or(iii)sequencesthataregeneratedrandomlybaseduponaDNAorproteinsequencemodel.Analyticstatisticalresultsinvariablyusethelastofthesedefinitionsofchance,whileempiricalresultsbasedonsimulationandcurvefittingmayuseanyofthedefinitions.

Thestatisticsofglobalsequencecomparison

Unfortunately,undereventhesimplestrandommodelsandscoringsystems,verylittleisknownabouttherandomdistributionofoptimalglobalalignmentscores[4].MonteCarloexperimentscanprovideroughdistributionalresultsforsomespecificscoringsystemsandsequencecompositions[5],butthesecannotbegeneralizedeasily.Therefore,oneofthefewmethodsavailableforassessingthestatisticalsignificanceofaparticularglobalalignmentistogeneratemanyrandomsequencepairsoftheappropriatelengthandcomposition,andcalculatetheoptimalalignmentscoreforeach[1,3].Whileitisthenpossibletoexpressthescoreofinterestintermsofstandarddeviationsfromthemean,itisamistaketoassumethattherelevantdistributionisnormalandconvertthisZvalueintoaPvaluethetailbehaviorofglobalalignmentscoresisunknown.Themostonecansayreliablyisthatif100randomalignmentshavescoreinferiortothealignmentofinterest,thePvalueinquestionislikelylessthan0.01.Onefurtherpitfalltoavoidisexaggeratingthesignificanceofaresultfoundamongmultipletests.Whenmanyalignmentshavebeengenerated,e.g.inadatabasesearch,thesignificanceofthebestmustbediscountedaccordingly.AnalignmentwithPvalue0.0001inthecontextofasingletrialmaybeassignedaPvalueofonly0.1ifitwasselectedasthebestamong1000independenttrials.

Thestatisticsoflocalsequencecomparison

Fortunatelystatisticsforthescoresoflocalalignments,unlikethoseofglobalalignments,arewellunderstood.Thisisparticularlytrueforlocalalignmentslackinggaps,whichwewillconsiderfirst.SuchalignmentswerepreciselythosesoughtbytheoriginalBLASTdatabasesearchprograms[6].Alocalalignmentwithoutgapsconsistssimplyofapairofequallengthsegments,onefromeachofthetwosequencesbeingcompared.AmodificationoftheSmithWaterman[7]orSellers[8]algorithmswillfindallsegmentpairswhosescorescannotbeimprovedbyextensionortrimming.ThesearecalledhighscoringsegmentpairsorHSPs.Toanalyzehowhighascoreislikelytoarisebychance,amodelofrandom



sequencesisneeded.Forproteins,thesimplestmodelchoosestheaminoacidresiduesinasequenceindependently,withspecificbackgroundprobabilitiesforthevariousresidues.Additionally,theexpectedscoreforaligningarandompairofaminoacidisrequiredtobenegative.Werethisnotthecase,longalignmentswouldtendtohavehighscoreindependentlyofwhetherthesegmentsalignedwererelated,andthestatisticaltheorywouldbreakdown.Justasthesumofalargenumberofindependentidenticallydistributed(i.i.d)randomvariablestendstoanormaldistribution,themaximumofalargenumberofi.i.d.randomvariablestendstoanextremevaluedistribution[9].(Wewillelidethemanytechnicalpointsrequiredtomakethisstatementrigorous.)Instudyingoptimallocalsequencealignments,weareessentiallydealingwiththelattercase[10,11].Inthelimitofsufficientlylargesequencelengthsmandn,thestatisticsofHSPscoresarecharacterizedbytwoparameters,Kandlambda.Mostsimply,theexpectednumberofHSPswithscoreatleastSisgivenbytheformula

WecallthistheEvalueforthescoreS.Thisformulamakeseminentlyintuitivesense.DoublingthelengthofeithersequenceshoulddoublethenumberofHSPsattainingagivenscore.Also,foranHSPtoattainthescore2xitmustattainthescorextwiceinarow,sooneexpectsEtodecreaseexponentiallywithscore.TheparametersKandlambdacanbethoughtofsimplyasnaturalscalesforthesearchspacesizeandthescoringsystemrespectively.

Bitscores

Rawscoreshavelittlemeaningwithoutdetailedknowledgeofthescoringsystemused,ormoresimplyitsstatisticalparametersKandlambda.Unlessthescoringsystemisunderstood,citingarawscorealoneislikecitingadistancewithoutspecifyingfeet,meters,orlightyears.Bynormalizingarawscoreusingtheformula

oneattainsa"bitscore"S',whichhasastandardsetofunits.TheEvaluecorrespondingtoagivenbitscoreissimply



Bitscoressubsumethestatisticalessenceofthescoringsystememployed,sothattocalculatesignificanceoneneedstoknowinadditiononlythesizeofthesearchspace.

Pvalues

ThenumberofrandomHSPswithscore>=SisdescribedbyaPoissondistribution[10,11].ThismeansthattheprobabilityoffindingexactlyaHSPswithscore>=Sisgivenby

whereEistheEvalueofSgivenbyequation(1)above.SpecificallythechanceoffindingzeroHSPswithscore>=SiseE,sotheprobabilityoffindingatleastonesuchHSPis

ThisisthePvalueassociatedwiththescoreS.Forexample,ifoneexpectstofindthreeHSPswithscore>=S,theprobabilityoffindingatleastoneis0.95.TheBLASTprogramsreportEvalueratherthanPvaluesbecauseitiseasiertounderstandthedifferencebetween,forexample,Evalueof5and10thanPvaluesof0.993and0.99995.However,whenE



multipledistinctdomains.Ifweassumetheapriorichanceofrelatednessisproportionaltosequencelength,thenthepairwiseEvalueinvolvingadatabasesequenceoflengthnshouldbemultipliedbyN/n,whereNisthetotallengthofthedatabaseinresidues.Examiningequation(1),thiscanbeaccomplishedsimplybytreatingthedatabaseasasinglelongsequenceoflengthN.TheBLASTprograms[6,14,15]takethisapproachtocalculatingdatabaseEvalue.NoticethatforDNAsequencecomparisons,thelengthofdatabaserecordsislargelyarbitrary,andthereforethisistheonlyreallytenablemethodforestimatingstatisticalsignificance.

Thestatisticsofgappedalignments

Thestatisticsdevelopedabovehaveasolidtheoreticalfoundationonlyforlocalalignmentsthatarenotpermittedtohavegaps.However,manycomputationalexperiments[1421]andsomeanalyticresults[22]stronglysuggestthatthesametheoryappliesaswelltogappedalignments.Forungappedalignments,thestatisticalparameterscanbecalculated,usinganalyticformulas,fromthesubstitutionscoresandthebackgroundresiduefrequenciesofthesequencesbeingcompared.Forgappedalignments,theseparametersmustbeestimatedfromalargescalecomparisonof"random"sequences.Somedatabasesearchprograms,suchasFASTA[12]orvariousimplementationoftheSmithWatermanalgorithm[7],produceoptimallocalalignmentscoresforthecomparisonofthequerysequencetoeverysequenceinthedatabase.Mostofthesescoresinvolveunrelatedsequences,andthereforecanbeusedtoestimatelambdaandK[17,21].Thisapproachavoidstheartificialityofarandomsequencemodelbyemployingrealsequences,withtheirattendantinternalstructureandcorrelations,butitmustfacetheproblemofexcludingfromtheestimationscoresfrompairsofrelatedsequences.TheBLASTprogramsachievemuchoftheirspeedbyavoidingthecalculationofoptimalalignmentscoresforallbutahandfulofunrelatedsequences.ThemustthereforerelyuponapreestimationoftheparameterslambdaandK,foraselectedsetofsubstitutionmatricesandgapcosts.Thisestimationcouldbedoneusingrealsequences,buthasinsteadrelieduponarandomsequencemodel[14],whichappearstoyieldfairlyaccurateresults[21].

Edgeeffects

Thestatisticsdescribedabovetendtobesomewhatconservativeforshortsequences.Thetheorysupportingthesestatisticsisanasymptoticone,whichassumesanoptimallocalalignmentcanbeginwithanyalignedpairofresidues.However,ahighscoringalignmentmusthavesomelength,andthereforecannotbeginneartotheendofeitheroftwosequencesbeingcompared.This"edgeeffect"maybecorrectedforbycalculatingan"effectivelength"forsequences[14]theBLASTprogramsimplementsuchacorrection.Forsequenceslongerthanabout200residuestheedgeeffectcorrectionisusuallynegligible.

Thechoiceofsubstitutionscores



Theresultsalocalalignmentprogramproducesdependstronglyuponthescoresituses.Nosinglescoringschemeisbestforallpurposes,andanunderstandingofthebasictheoryoflocalalignmentscorescanimprovethesensitivityofone'ssequenceanalyses.Asbefore,thetheoryisfullydevelopedonlyforscoresusedtofindungappedlocalalignments,sowestartwiththatcase.Alargenumberofdifferentaminoacidsubstitutionscores,baseduponavarietyofrationales,havebeendescribed[2336].Howeverthescoresofanysubstitutionmatrixwithnegativeexpectedscorecanbewrittenuniquelyintheform

wheretheqij,calledtargetfrequencies,arepositivenumbersthatsumto1,thepiarebackgroundfrequenciesforthevariousresidues,andlambdaisapositiveconstant[10,31].Thelambdahereisidenticaltothelambdaofequation(1).Multiplyingallthescoresinasubstitutionmatrixbyapositiveconstantdoesnotchangetheiressence:analignmentthatwasoptimalusingtheoriginalscoresremainsoptimal.Suchmultiplicationalterstheparameterlambdabutnotthetargetfrequenciesqij.Thus,uptoaconstantscalingfactor,everysubstitutionmatrixisuniquelydeterminedbyitstargetfrequencies.Thesefrequencieshaveaspecialsignificance[10,31]:

Agivenclassofalignmentsisbestdistinguishedfromchancebythesubstitutionmatrixwhosetargetfrequenciescharacterizetheclass.

Toelaborate,onemaycharacterizeasetofalignmentsrepresentinghomologousproteinregionsbythefrequencywithwhicheachpossiblepairofresiduesisaligned.Ifvalineinthefirstsequenceandleucineinthesecondappearin1%ofallalignmentpositions,thetargetfrequencyfor(valine,leucine)is0.01.Themostdirectwaytoconstructappropriatesubstitutionmatricesforlocalsequencecomparisonistoestimatetargetandbackgroundfrequencies,andcalculatethecorrespondinglogoddsscoresofformula(6).Thesefrequenciesingeneralcannotbederivedfromfirstprinciples,andtheirestimationrequiresempiricalinput.

ThePAMandBLOSUMaminoacidsubstitutionmatrices

Whileallsubstitutionmatricesareimplicitlyoflogoddsform,thefirstexplicitconstructionusingformula(6)wasbyDayhoffandcoworkers[24,25].Fromastudyofobservedresiduereplacementsincloselyrelatedproteins,they



constructedthePAM(for"pointacceptedmutation")modelofmolecularevolution.One"PAM"correspondstoanaveragechangein1%ofallaminoacidpositions.After100PAMsofevolution,noteveryresiduewillhavechanged:somewillhavemutatedseveraltimes,perhapsreturningtotheiroriginalstate,andothersnotatall.Thusitispossibletorecognizeashomologousproteinsseparatedbymuchmorethan100PAMs.NotethatthereisnogeneralcorrespondencebetweenPAMdistanceandevolutionarytime,asdifferentproteinfamiliesevolveatdifferentrates.UsingthePAMmodel,thetargetfrequenciesandthecorrespondingsubstitutionmatrixmaybecalculatedforanygivenevolutionarydistance.Whentwosequencesarecompared,itisnotgenerallyknownaprioriwhatevolutionarydistancewillbestcharacterizeanysimilaritytheymayshare.Closelyrelatedsequences,however,arerelativelyeasytofindevenwillnonoptimalmatrices,sothetendencyhasbeentousematricestailoredforfairlydistantsimilarities.Formanyyears,themostwidelyusedmatrixwasPAM250,becauseitwastheonlyoneoriginallypublishedbyDayhoff.Dayhoff'sformalismforcalculatingtargetfrequencieshasbeencriticized[27],andtherehavebeenseveraleffortstoupdatehernumbersusingthevastquantitiesofderivedproteinsequencedatageneratedsinceherwork[33,35].ThesenewerPAMmatricesdonotdiffergreatlyfromtheoriginalones[37].Analternativeapproachtoestimatingtargetfrequencies,andthecorrespondinglogoddsmatrices,hasbeenadvancedbyHenikoff&Henikoff[34].Theyexaminemultiplealignmentsofdistantlyrelatedproteinregionsdirectly,ratherthanextrapolatefromcloselyrelatedsequences.Anadvantageofthisapproachisthatitcleavesclosertoobservationadisadvantageisthatityieldsnoevolutionarymodel.Anumberoftests[13,37]suggestthatthe"BLOSUM"matricesproducedbythismethodgenerallyaresuperiortothePAMmatricesfordetectingbiologicalrelationships.

DNAsubstitutionmatrices

Whilewehavediscussedsubstitutionmatricesonlyinthecontextofproteinsequencecomparison,allthemainissuescarryovertoDNAsequencecomparison.Onewarningisthatwhenthesequencesofinterestcodeforprotein,itisalmostalwaysbettertocomparetheproteintranslationsthantocomparetheDNAsequencesdirectly.Thereasonisthatafteronlyasmallamountofevolutionarychange,theDNAsequences,whencomparedusingsimplenucleotidesubstitutionscores,containlessinformationwithwhichtodeducehomologythandotheencodedproteinsequences[32].Sometimes,however,onemaywishtocomparenoncodingDNAsequences,atwhichpointthesamelogoddsapproachasbeforeapplies.Anevolutionarymodelinwhichallnucleotidesareequallycommonandallsubstitutionmutationsareequallylikelyyieldsdifferentscoresonlyformatchesandmismatches[32].Amorecomplexmodel,inwhichtransitionsaremorelikelythantransversions,yieldsdifferent"mismatch"scoresfortransitionsandtransversions[32].Thebestscorestousewilldependuponwhetheroneisseekingrelativelydivergedor



closelyrelatedsequences[32].

Gapscores

Ourtheoreticaldevelopmentconcerningtheoptimalityofmatricesconstructedusingequation(6)unfortunatelyisinvalidassoonasgapsandassociatedgapscoresareintroduced,andnomoregeneraltheoryisavailabletotakeitsplace.However,ifthegapscoresemployedaresufficientlylarge,onecanexpectthattheoptimalsubstitutionscoresforagivenapplicationwillnotchangesubstantially.Inpractice,thesamesubstitutionscoreshavebeenappliedfruitfullytolocalalignmentsbothwithandwithoutgaps.Appropriategapscoreshavebeenselectedovertheyearsbytrialanderror[13],andmostalignmentprogramswillhaveadefaultsetofgapscorestogowithadefaultsetofsubstitutionscores.Iftheuserwishestoemployadifferentsetofsubstitutionscores,thereisnoguaranteethatthesamegapscoreswillremainappropriate.Nocleartheoreticalguidancecanbegiven,but"affinegapscores"[3841],withalargepenaltyforopeningagapandamuchsmalleroneforextendingit,havegenerallyprovedamongthemosteffective.

Lowcomplexitysequenceregions

Thereisonefrequentcasewheretherandommodelsandthereforethestatisticsdiscussedherebreakdown.Asmanyasonefourthofallresiduesinproteinsequencesoccurwithinregionswithhighlybiasedaminoacidcomposition.Alignmentsoftworegionswithsimilarlybiasedcompositionmayachieveveryhighscoresthatowevirtuallynothingtoresidueorderbutaredueinsteadtosegmentcomposition.Alignmentsofsuch"lowcomplexity"regionshavelittlemeaninginanycase:sincetheseregionsmostlikelyarisebygeneslippage,theonetooneresiduecorrespondenceimposedbyalignmentisnotvalid.Whileitisworthnotingthattwoproteinscontainsimilarlowcomplexityregions,theyarebestexcludedwhenconstructingalignments[4244].TheBLASTprogramsemploytheSEGalgorithm[43]tofilterlowcomplexityregionsfromproteinsbeforeexecutingadatabasesearch.

References

[1]Fitch,W.M.(1983)"Randomsequences."J.Mol.Biol.163:171176.(PubMed)

[2]Lipman,D.J.,Wilbur,W.J.,SmithT.F.&Waterman,M.S.(1984)"Onthestatisticalsignificanceofnucleicacidsimilarities."Nucl.AcidsRes.12:215226.(PubMed)

[3]Altschul,S.F.&Erickson,B.W.(1985)"Significanceofnucleotidesequencealignments:amethodforrandomsequencepermutationthatpreservesdinucleotideandcodonusage."Mol.Biol.Evol.2:526538.(PubMed)

[4]Deken,J.(1983)"Probabilisticbehavioroflongestcommonsubsequence



length."In"TimeWarps,StringEditsandMacromolecules:TheTheoryandPracticeofSequenceComparison."D.Sankoff&J.B.Kruskal(eds.),pp.5591,AddisonWesley,Reading,MA.

[5]Reich,J.G.,Drabsch,H.&Daumler,A.(1984)"OnthestatisticalassessmentofsimilaritiesinDNAsequences."Nucl.AcidsRes.12:55295543.(PubMed)

[6]Altschul,S.F.,Gish,W.,Miller,W.,Myers,E.W.&Lipman,D.J.(1990)"Basiclocalalignmentsearchtool."J.Mol.Biol.215:403410.(PubMed)

[7]Smith,T.F.&Waterman,M.S.(1981)"Identificationofcommonmolecularsubsequences."J.Mol.Biol.147:195197.(PubMed)

[8]Sellers,P.H.(1984)"Patternrecognitioningeneticsequencesbymismatchdensity."Bull.Math.Biol.46:501514.

[9]Gumbel,E.J.(1958)"Statisticsofextremes."ColumbiaUniversityPress,NewYork,NY.

[10]Karlin,S.&Altschul,S.F.(1990)"Methodsforassessingthestatisticalsignificanceofmolecularsequencefeaturesbyusinggeneralscoringschemes."Proc.Natl.Acad.Sci.USA87:22642268.(PubMed)

[11]Dembo,A.,Karlin,S.&Zeitouni,O.(1994)"Limitdistributionofmaximalnonalignedtwosequencesegmentalscore."Ann.Prob.22:20222039.

[12]Pearson,W.R.&Lipman,D.J.(1988)Improvedtoolsforbiologicalsequencecomparison."Proc.Natl.Acad.Sci.USA85:24442448.(PubMed)

[13]Pearson,W.R.(1995)"Comparisonofmethodsforsearchingproteinsequencedatabases."Prot.Sci.4:11451160.(PubMed)

[14]Altschul,S.F.&Gish,W.(1996)"Localalignmentstatistics."Meth.Enzymol.266:460480.(PubMed)

[15]Altschul,S.F.,Madden,T.L.,Schffer,A.A.,Zhang,J.,Zhang,Z.,Miller,W.&Lipman,D.J.(1997)"GappedBLASTandPSIBLAST:anewgenerationofproteindatabasesearchprograms."NucleicAcidsRes.25:33893402.(PubMed)

[16]Smith,T.F.,Waterman,M.S.&Burks,C.(1985)"Thestatisticaldistributionofnucleicacidsimilarities."NucleicAcidsRes.13:645656.(PubMed)

[17]Collins,J.F.,Coulson,A.F.W.&Lyall,A.(1988)"Thesignificanceofproteinsequencesimilarities."Comput.Appl.Biosci.4:6771.(PubMed)

[18]Mott,R.(1992)"Maximumlikelihoodestimationofthestatisticaldistributionof



SmithWatermanlocalsequencesimilarityscores."Bull.Math.Biol.54:5975.

[19]Waterman,M.S.&Vingron,M.(1994)"Rapidandaccurateestimatesofstatisticalsignificanceforsequencedatabasesearches."Proc.Natl.Acad.Sci.USA91:46254628.(PubMed)

[20]Waterman,M.S.&Vingron,M.(1994)"SequencecomparisonsignificanceandPoissonapproximation."Stat.Sci.9:367381.

[21]Pearson,W.R.(1998)"Empiricalstatisticalestimatesforsequencesimilaritysearches."J.Mol.Biol.276:7184.(PubMed)

[22]Arratia,R.&Waterman,M.S.(1994)"Aphasetransitionforthescoreinmatchingrandomsequencesallowingdeletions."Ann.Appl.Prob.4:200225.

[23]McLachlan,A.D.(1971)"Testsforcomparingrelatedaminoacidsequences.Cytochromecandcytochromec551."J.Mol.Biol.61:409424.(PubMed)

[24]Dayhoff,M.O.,Schwartz,R.M.&Orcutt,B.C.(1978)"Amodelofevolutionarychangeinproteins."In"AtlasofProteinSequenceandStructure,"Vol.5,Suppl.3(ed.M.O.Dayhoff),pp.345352.Natl.Biomed.Res.Found.,Washington,DC.

[25]Schwartz,R.M.&Dayhoff,M.O.(1978)"Matricesfordetectingdistantrelationships."In"AtlasofProteinSequenceandStructure,"Vol.5,Suppl.3(ed.M.O.Dayhoff),p.353358.Natl.Biomed.Res.Found.,Washington,DC.

[26]Feng,D.F.,Johnson,M.S.&Doolittle,R.F.(1984)"Aligningaminoacidsequences:comparisonofcommonlyusedmethods."J.Mol.Evol.21:112125.(PubMed)

[27]Wilbur,W.J.(1985)"OnthePAMmatrixmodelofproteinevolution."Mol.Biol.Evol.2:434447.(PubMed)

[28]Taylor,W.R.(1986)"Theclassificationofaminoacidconservation."J.Theor.Biol.119:205218.(PubMed)

[29]Rao,J.K.M.(1987)"Newscoringmatrixforaminoacidresidueexchangesbasedonresiduecharacteristicphysicalparameters."Int.J.PeptideProteinRes.29:276281.

[30]Risler,J.L.,Delorme,M.O.,Delacroix,H.&Henaut,A.(1988)"Aminoacidsubstitutionsinstructurallyrelatedproteins.Apatternrecognitionapproach.Determinationofanewandefficientscoringmatrix."J.Mol.Biol.204:10191029.(PubMed)

[31]Altschul,S.F.(1991)"Aminoacidsubstitutionmatricesfromaninformation



theoreticperspective."J.Mol.Biol.219:555565.(PubMed)

[32]States,D.J.,Gish,W.&Altschul,S.F.(1991)"Improvedsensitivityofnucleicaciddatabasesearchesusingapplicationspecificscoringmatrices."Methods3:6670.

[33]Gonnet,G.H.,Cohen,M.A.&Benner,S.A.(1992)"Exhaustivematchingoftheentireproteinsequencedatabase."Science256:14431445.(PubMed)

[34]Henikoff,S.&Henikoff,J.G.(1992)"Aminoacidsubstitutionmatricesfromproteinblocks."Proc.Natl.Acad.Sci.USA89:1091510919.(PubMed)

[35]Jones,D.T.,Taylor,W.R.&Thornton,J.M.(1992)"Therapidgenerationofmutationdatamatricesfromproteinsequences."Comput.Appl.Biosci.8:275282.(PubMed)

[36]Overington,J.,Donnelly,D.,JohnsonM.S.,Sali,A.&Blundell,T.L.(1992)"Environmentspecificaminoacidsubstitutiontables:Tertiarytemplatesandpredictionofproteinfolds."Prot.Sci.1:216226.(PubMed)

[37]Henikoff,S.&Henikoff,J.G.(1993)"Performanceevaluationofaminoacidsubstitutionmatrices."Proteins17:4961.(PubMed)

[38]Gotoh,O.(1982)"Animprovedalgorithmformatchingbiologicalsequences."J.Mol.Biol.162:705708.(PubMed)

[39]Fitch,W.M.&Smith,T.F.(1983)"Optimalsequencealignments."Proc.Natl.Acad.Sci.USA80:13821386.

[40]Altschul,S.F.&Erickson,B.W.(1986)"Optimalsequencealignmentusingaffinegapcosts."Bull.Math.Biol.48:603616.(PubMed)

[41]Myers,E.W.&Miller,W.(1988)"Optimalalignmentsinlinearspace."Comput.Appl.Biosci.4:1117.(PubMed)

[42]Claverie,J.M.&States,D.J.(1993)"Informationenhancementmethodsforlargescalesequenceanalysis."Comput.Chem.17:191201.

[43]Wootton,J.C.&Federhen,S.(1993)"Statisticsoflocalcomplexityinaminoacidsequencesandsequencedatabases."Comput.Chem.17:149163.

[44]Altschul,S.F.,Boguski,M.S.,Gish,W.&Wootton,J.C.(1994)"Issuesinsearchingmolecularsequencedatabases."NatureGenet.6:119129.(PubMed)

The Statistics of Sequence Similarity Scores

Documents

Transcript of The Statistics of Sequence Similarity Scores