Reliable Phylogenetic Tree - University of Virginia

GeneralizedNeighbor-Joining:

MoreReliablePhylogeneticTreeReconstruction�

William R. Pearson�, GabrielRobins,andTongtongZhang

Departmentof ComputerScience,Universityof Virginia

�Departmentof Biochemistry, Universityof Virginia

Charlottesville,VA 22908

Key Words: phylogeneticreconstruction,neighbor-joining, least-squares,

minimum-evolution,solutionspacesampling

�The corresponding author is Professor William Pearson, [email protected],

804-924-2818,FAX: 804-924-5069

1

Abstract

Wehavedevelopedaphylogenetictreereconstructionmethodthatdetectsand

reportsmultiple, topologicallydistant,low costsolutions.Ourmethodis a

generalizationof theNeighbor-Joining(NJ)methodof Nei andSaitou,and

affordsamorethoroughsamplingof thesolutionspaceby keepingtrackof

multiplepartialsolutionsduringits execution.Thescopeof thesolutionspace

samplingis controlledby apairof user-specifiedparameters- thetotalnumberof

alternatesolutionsandthenumberof alternatesolutionsthatarerandomly

selected- effectinga smoothtradeoff betweenrun timeandsolutionqualityand

diversity. Thismethodcandiscover topologicallydistinctlow costsolutions.In

testsonbiologicalandsyntheticdatasetsusingeithertheleast-squaresdistanceor

minimum-evolutioncriterion,themethodconsistentlyperformedaswell as,or

betterthan,eithertheNeighbor-Joiningheuristicor thePHYLIP implementation

of theFitch-Margoliashdistancemeasure.In addition,themethodidentified

alternative treetopologieswith costswithin 1 or 2%of thebest,but with

topologicaldistances9 or morepartitionsfrom thebestsolution(16 taxa);with

32 taxa,topologieswereobtained17 (least-squares)- 22 (minimum-evolution)

partitionsfrom thebesttopologywhen200partialsolutionswereretained.Thus,

themethodcanfind lowercosttreetopologiesandnear-besttreetopologiesthat

aresignificantlydifferentfrom thebesttopology.

2

1 Introduction

Reconstructionof ancestralrelationshipsfrom contemporarydatais widely usedto provide

bothevolutionaryandfunctionalinsightsinto biologicalsystems.Theexplosive increasein

availableDNA sequencedatahasincreasedinterestin phylogeneticanalysisof multi-gene

anddomain-swappedproteinfamilies.Threegeneralclassesof phylogeneticreconstruction

methodsarecommonlyusedfor analysisof sequencedatasets:parsimony methods(Swofford

etal., 1996),distancebasedmethods(FitchandMargoliash,1967),andmaximumlikelihood

methods(Felsenstein,1982;Felsenstein,1988).Parsimony- anddistance-basedmethodsare

mostoftenused,largelybecausethey arefastercomputationallyandallow a largernumberof

potentialphylogenetictreesto beevaluated.

Reconstructionof anevolutionaryhistoryfor asetof contemporarytaxabasedon their

pairwisedistanceis computationallyintractable(i.e.,NP-complete)for variousoptimality

criteria(FouldsandGraham,1982;Day, 1987),includingtheleast-squarescriterion1 andthe

minimum-evolutioncriterion2. Variousheuristicshavebeenproposedto searchfor solutions

of desiredquality (Felsenstein,1988;BandeltandDress,1992;Swofford etal., 1996),and

1Accordingto theleast-squarescriterion,thebestphylogeny for theinput distancematrix

is theonethatminimizes �� where�� is thedistancebetweentaxa � and �in theinput distancematrix,and � �� is thesumof all thebranchlengthsalongtheuniquepath

connectingtaxa � and � in thepostulatedphylogeny.

2By the minimum-evolution criterion, thebestphylogeny for the input distancematrix is

onethatminimizesthesumof all edgelengths,whereedgelengthsareassignedto minimize

theleast-squaresdeviation.

3

themajorityof thesemethodsaregreedy, whichalwaysemploy movesthatAre “locally best”

andmaynot necessarilyleadto globaloptima(Swofford etal., 1996).Amongthegreedy

approaches,theNeighbor-Joiningmethod(SaitouandNei, 1987;StudierandKeppler, 1988)

is widely usedby molecularbiologistsdueto its efficiency andsimplicity.

Greedymethodsareefficientbecausethey exploreonly asmallportionof thesolutionspace3.

However, greedymethodscanfail to find thebestoverallsolutionif they become“trapped”in

localoptima.In addition,becauseonly asmallfractionof thesolutionspaceis examined,a

greedyheuristictypically will not report(or detect)alternativesolutionswith distinct

topologiesthatmayfit thedatanearlyaswell, or evenequallywell. Neglectingsuch

alternativesolutionscanproducemisleadinginferencesregardingtheevolutionaryhistory.

For instance,Wilsonetal. concludedthatall humansoriginatedfrom Africa, becausetheir

tree-building methodfailedto discoveralternative,near-optimal,treesthatwereconsistent

with adifferentgeographicalhistory(Wilsonetal., 1989;Maddison,1991).

To improvethereliability of phylogenetictreereconstructions,weproposeaschemewhich

samplesthesolutionspacemoreextensively by repeatedlyusingtheNeighbor-Joining

algorithm(SaitouandNei, 1987).Insteadof trackingonly asingle,locally-besttreeas

Neighbor-Joiningdoes,ourschememaintainsmultiplepartialsolutionsasit progresses.The

methodexploresall possibletreesderivablefrom thesetof currentpartialsolutionsin a

singleneighbor-joining step,andthenselectsasubsetof thesepartialsolutionsto passon to

thenext iteration.Thisapproachis competitivewith Neighbor-Joiningin recoveringdistinct

3A solution spaceis the set of all possiblephylogeniesspanningthe given taxa. Taxa

correspondto leavesin a treethatspansthem.

4

low-costtopologies,while still beingcomputationallyefficient.

2 Methods

2.1 The Neighbor-Joining Method

TheNeighbor-Joiningmethod(NJ)wasinitially proposedby SaitouandNei (1987),andlater

modifiedby StudierandKepler(1988).Neighbor-Joiningseeksto build a treewhich

minimizesthesumof all edgelengths,i.e., it adoptstheminimum-evolution(ME) criterion.

A numberof studieshavecorroboratedNJ’sperformancein reconstructingcorrect

evolutionarytrees(SaitouandImanishi,1989;KuhnerandFelsenstein,1994;Huelsenbeck,

1995).For smallnumbersof taxa,NJsolutionsarelikely to beidenticalto theoptimalME

tree(SaitouandImanishi,1989).

Neighbor-Joiningbeginswith astartree,theniteratively findstheclosestneighboringpair

(i.e. thepair thatinducesa treeof minimumsumof edgelengths)amongall possiblepairsof

nodes(bothinternalandexternal).Theclosestpair is thenclusteredinto anew internalnode,

andthedistancesof thisnodeto therestof thenodesin thetreearecomputedandusedin

lateriterations.Thealgorithmterminateswhen �"! internalnodeshavebeeninsertedinto

thetree(i.e.,whenthestartreeis fully resolvedinto abinarytree).TheNeighbor-Joining

heuristicis illustratedin Figure1B.

Fig. 1 goesnearhere.

AlthoughtheNeighbor-Joiningmethodrunsquickly, it returnsonly thesinglebestsolution

5

foundby its greedysearchstrategy. Thissolutioncanbefurtherimprovedwith

post-processingby rearrangingbranchesandswappingsubtrees(Rzhetsky andNei, 1992;

Swofford, 1996),but suchimprovedsolutionstendto remaintopologicallysimilar to the

originalstarting-pointsolutions.To increaseourconfidencein thesolutionreliability, it is

naturalto askif thereareothersolutions,with differenttopologies,thatareequally

well-supportedby thedistancematrixdata.

Solutionspacescanexhibit many alternatelocaloptima(Penny etal., 1995).For instance,

amongall #%$ possibletreesof $ taxa, ! of them( & � in Figures1Cand & � in Figure1B) fit the

inputmatrix (Figure1A) best.However, thesetwo treeshaveverydifferenttopologies;they

sharenocommoninternaledges.Indeed,accordingto thepartitiondistancemetric(see

Section2.4), & � and & � arethemostdissimilarpossible.

2.2 The Generalized Neighbor-Joining Method

OurGeneralizedNeighbor-Joining(GNJ)methodsamplesthesolutionspaceextendedlyby

keepingtrackof multiplepartialsolutionsasit progresses(thenumberof partialsolutions'is aninputparameter).Unlike theNeighbor-Joiningmethod,which followsonly asinglepath

towardsasolution,GNJperformsamorethoroughsearchof thesolutionspaceby tracking

andexploringmany potentially-goodpaths.Thatis, GNJretainspromisingpartialsolutions,

whichmaynotbelocally-optimal,but whichhave thepotentialfor substantiallygreatercost

savingsin subsequentsteps.An executionexampleof GNJon thematrixof Figure1A is

shown in Figure2.


6

TheGeneralizedNeighbor-Joining(GNJ)algorithmcanselectin severalwaysthe ' partial

solutionsthatarepassedon to thenext iteration.A simplestrategy wouldsave the ' best

(i.e., leastcost)partialsolutions;alternatively, partialsolutionscanbechosenat random.

Selectingthebestsolutionstendsto improvesolutionquality, while selectingalternates

randomlytendsto increasesolutiondiversity. We implementahybridschemethatbalances

thesetwo extremes:thetop (*)+' least-costsolutionsareselected,alongwith anadditional

,.- ' � ( “topologicallydiverse”solutions4. If therearemorethan ( least-costpairsata

giveniteration,GNJwill select( of themarbitrarily, whichmakestheGNJmethod

non-deterministic.

To achieve topologicaldiversity, ateachiteration,afterselectingthebest( partialsolutions,

theremainingpartialsolutionsarepartitionedinto / groupsaccordingto their topological

distancesfrom abestpartialsolution(partialsolutionswithin thesamegroupareequidistant

from thebestpartialsolution).We thenobtainanadditional,

“topologicallydiverse”partial

solutionsfor thenext iterationby selectingthetop 021 354 solutionsfrom eachgroup5. Thus,at

thelaststeptowardssolvinga #76 -taxaproblem,alternatesolutionscanbeasmany as #98partitionsaway from thebestcurrentsolution.In thiscase,if

,.- $;: , at least <>=�@? - 8 best

solutionsat topologicaldistances# and ! aresaved,andat leastthe A bestsolutionsat

topologicaldistances8 through #78 aresaved.For 8 ! taxaand,.- #9:B: , at least 8 solutions

will besavedateachtopologicaldistance.Becausethemaximumtopologicaldistance

increaseslinearlywith thenumberof taxa,theabovestrategy ensuresthatthenumberof

4Theparameternames( and,

aremnemonicfor qualityanddiversity, respectively.

5If / doesnotdivide,

, weselectoneadditionalsolutionfrom eachof the, � �C0 1 3D4FE9/ �

groupscorrespondingto thetopologicaldistancesfarthestfrom thebestpartialsolution.

7

topologicallydistinctgoodsolutionscanremainrelatively constantby increasing,

linearly,

ratherthanexponentially, with thenumberof taxa.

A similar ideais employedin thestepwiseminimum-evolutiontreebuilding method(Kumar,

1996).At eachiteration,for agivenpartialtree,thismethodfirst identifiestheleadingnode

(i.e., thenodemostlikely to bejoinedto anothernode),andformsthesetof next-step

Neighbor-Joiningtreesby clusteringeachnodewith theleadingnode.Thisstrategy restricts

thesolutionspacesomewhat,but it requiresexponentialtime to run,whichmakesit practical

only for smalldatasets.Moreover, it doesnotexplicitly consideralternatesolutionsat

differenttopologicaldistances(seebelow), soit is lesslikely to identify topologicallydistinct

alternatives.

Differentcombinationsof ( and,

( ' - (HG , ) enableasmoothtradeoff betweenquality

vs. diversity. As ( increaseswith respectto,

(for afixed ' ), lower-costsolutionsare

favoredoveroneswith diversetopologies,while for smallervaluesof ( , thesolutionspace

explorationbecomesmorebroad,andtopologicallydifferentlocaloptimaaremorelikely to

emerge.We notethatif ' - ( - # (andthus,I- ' � ( - : ), thesinglesolutionreturned

by ourGeneralizedNeighbor-Joiningapproachis identicalto thesolutionproducedby the

originalNeighbor-Joiningmethod(SaitouandNei, 1987;StudierandKeppler, 1988).Here

only thebest-costpartialsolutionis passedto eachsubsequentiteration,whichis exactlywhat

is doneby Neighbor-Joining.Thus,GNJdirectlygeneralizestheNeighbor-Joiningmethod.

Additionalstrategiesfor expandingthesearchof phylogenetictree-spacemightbe

considered.TheGNJapproachcanbeabstractlydividedinto two phases:(1) a tree

generationcomponentwhichproducesmultiplepartialsolutions,and(2) apartialsolution

8

evaluationfunctionwhich favorscertainpreferredpartialsolutionsoverothers.Theoverall

run timeperiterationof thecombinedmethodis asymptoticallynogreaterthantheslowestof

thesetwo components.

Thealgorithmdescribedin Section2.2utilizestheNeighbor-Joiningmethodasthepartial

treegenerationmechanismin phase�J# � , while usingtheminimum-evolutioncriterion

(implicit in theNeighbor-Joiningmethod)in filtering candidatepartialsolutionsin phase� !� .However, anycombinationof existingalgorithmsor heuristicsfor treegenerationandtree

evaluationcanbeincorporatedinto thisgeneraltemplate.

For example,wecanevaluatepartialtreesateachstepusingtheleast-squaresdeviation

optimalitycriterion.An alternativeschemefor treegenerationmightallow arbitrarypartitions

at intermediatesteps(i.e., “join” anynumberof taxaratherthanexactly two). In thiscase,a

numberof existingefficientpartitioningheuristics(Alpert andKahng,1995)canbereadily

appliedto generatemorepromisinganddiversepartialsolutions.Likewise,themethodfor

selectingtopologicallydiversepartialsolutionsmightselectmoresolutionsfrom moredistant

topologies,ratherthanuniformly samplingthetopologicaldistancesasis donein this

implementation.

TheGNJprogramis written in the‘C’ programminglanguageandis availablefrom

ftp://ftp.virginia.edu/pub/fasta/GNJ. To maketheGNJresultsmoreusable

in practice,weoutputthetreesobtainedby GNJin acomputer-readableformatthatcanbe

readilyprocessedby otherprograms(e.g.,theconsense programin thePHYLIP package).

Moreover, wesummarizetheleafpartitionsfoundamongtheGNJsolutionsbelow a

thresholdcost,andrankthemby decreasingfrequencies.

9

2.3 Datasets

WetestedtheGNJheuristicin theUNIX environment.Two typesof distancematriceswere

usedto evaluatethealgorithm:

(1) Distancematriceswereconstructedfor nucleotidesequencesgeneratedby randomly

mutatingan“ancestral”sequencealongamodelevolutionarytreeusingthetreeDNA

program(Felsenstein,personalcommunication)with theKimura two-parametermodelfor

mutationrates(Kimura,1980).Threetypesof topologieswereusedfor themodeltrees:

topologiesof minimumdiameter(whichwereferto asType‘A’ ), topologiesof maximum

diameter(type‘B’ ), andamixtureof both(type‘C’ ). Here,thediameterof a topologyis

definedasthemaximumnumberof edgesconnectingany two leafnodeswithin thetopology.

Therefore,topologiesof type‘A’ aremost“branchy”(i.e., they resembleacompletebinary

tree),while topologiesof type‘B’ aremore“stringy”. type‘A’ treeswerethemost

challenging,andareusedfor mostof thefigures.

Divergenceratesrangingfrom :LKM:B:B$ (internalbranches)to :LKN$;: (leafor externalbranches)

wereusedto producethesyntheticdata.Two differenttype‘A’ andtype‘B’ datasetswere

examined.type‘A1’ and‘B1’ datasetsuseddivergenceratesOP:QKR: ! (32 taxa)– OP:QKR:�$(8-taxa)for internaledgesand OP:QKSA for externaledges(thus,theratioof externalto internal

branchratesvariedfrom #9: for T taxato 8�$ for 32 taxa).Type‘A2’ and‘B2’ treesusedrates

of OP:QKM:;:�$ for thecentral(internal)edgesand OP:QKM$;: for theexternal(leaf)edges

(external/internalratiosof 100).

(2) Severalbiologicaldatasetswereexamined,includingimmunologicaldatafrom U frog

species(SaitouandNei, 1987),datafrom #78 viral env V3 fragmentsandgag P17(Leitner

10

etal., 1996),and AV alignedTCP-1chaperonin6;: family members(JonSundBlandfort,

personalcommunication).For DNA sequences,thedistancematriceswerecomputedwith the

dnadist progamin thePHYLIP package(Felsenstein,1993),usingtheKimura2-parameter

model(Kimura,1980).For proteinsequences,thedistancematriceswerecomputedwith the

protdist programin thePHYLIP package(Felsenstein,1993),usingtheDayhoff PAM

matrixmodel(Dayhoff, 1978).Weobtain 8;: biologicaldatasetsof T , #76 , or 8 ! taxaby

randomlysamplingtheoriginaldatasets.Resultson thedifferentbiologicaldatasetswere

similar; only resultson thechaperonindistances(referredto asdataset‘R1’) arereported.

2.4 Algorithms Compared

Weevaluatedthedatasetsusingthreealgorithms:(1) NJ: TheNeighbor-Joiningmethod

(SaitouandNei, 1987;StudierandKeppler, 1988),asimplementedin thePHYLIP package

(Felsenstein,1993);(2) FM: TheFitch-Margoliashmethodfor fitting topologiesto distance

matriceswith respectto theleast-squarescriterion(FitchandMargoliash,1967),as

implementedin thePHYLIP package(Felsenstein,1993);and(3) GNJ: theGeneralized

Neighbor-Joiningmethod,describedin thispaper.

In addition,weexaminedeverypossibletreetopologyfor syntheticandbiologicaldataover8

taxa.Thisexhaustivemethodis guaranteedto returnaglobaloptimum(i.e. thelowest-cost

topology).Becauseof thesheersizeof thesolutionspace,theoptimalmethodis feasibleonly

for datasetscontainingfewer thantentaxa.

Thesolutionsfrom thedifferentalgorithmswereevaluatedusingeithertheleast-squaresor

theminimum-evolutioncriterion.Least-squarestreecostis computedby assigning

11

non-negativeedgelengthsin a way thatminimizestheleast-squaresdeviation.

Minimum-evolutiontreecostis computedasthesumof suchedge-lengthsin a tree.

To improvefurtherthesolutionquality, wealsoappliedapost-processingoptimizationstep

which rearrangessubtreesasfollows. Givena topology, wecomputethecostof all thetrees

resultingfrom swapping/exchangingsubtreesaroundeachof theinternaledgesof the

topology. Then,thelowest-costtreeis chosenasthenew currenttree,andits neighborhoodis

investigatedin turn. We iteratethisprocessuntil no furtherimprovementcanbeobtained.

Topologicaldistancesin thispaperarebasedon thepartitionmetric(RobinsonandFoulds,

1981;Penny andHendy, 1985;SteelandHendy, 1993),whichmeasuresthenumberof edges

commonto agivenpairof binarytrees.Eachinternaledgenaturallypartitionsthesetof leaf

nodesinto two subsets.Two treesspanningthesamesetof leaveshaveacommonedgeif

removing thisedgeinducesthesametwo subsetsof leafnodes.Thus,thepartitiondistance

betweenany two treesis definedasthenumberof edgesin onetreefor which thereis no

correspondingequivalentedgein theothertree.Sinceeachbinarytreeof leaveshas � 8internaledges,distancesunderthepartitionmetriccanberepresentedasintegersbetween:and � 8 .

3 Results

LikeNeighbor-Joining,GeneralizedNeighbor-Joining(GNJ)seeksto identify phylogenetic

treetopologiesandbranchlengthsthatbestfit distancedata.GNJimproveson

Neighbor-Joiningby identifyingnear-optimaltopologiesthataresignificantlydifferentfrom

12

thebestsolutionfoundin thesearch(therearetypically many near-optimalsolutionsthat

differ only slightly from thebestsolution;weseektopologically-distantalternatives).In the

resultsbelow, wefirst show thatthedatasetsthatweexaminecontaintopologicallydistinct,

low-costsolutions.We thendemonstratethattheGNJalgorithmcanfind theselow-cost

alternativesolutions,by examiningtwo measuresof success:(1) thenumberof alternative

treesfoundby GNJwith anear-optimalcost;and(2) themaximaltopological(partition)

distancebetweenthenear-optimalsolutionsandtheoptimalsolutionfound.6 In bothtests,we

seekthelargestnumberof solutionswith costnearestto optimal,but with topological

distancethatis far away.

3.1 Comparison of GNJ with exhaustive 8-taxa searches

To judgehow effectively theGNJapproachfindsalternative topologically-distinctsolutions,

wefirst characterizedtheactualnumberanddiversityof near-optimalsolutionsby

enumeratingall 10,395differenttreesfor datasetswith 8 taxaandcalculatedthecostfor each

treetopology(Fig. 3). Tree-costswereoptimizedusingeithertheminimum-evolution

criterionor theleast-squarescriterion.Becausethedifferentcostcriteriamayhavedifferent

distributionsof costs,weplot thenumberof treesobtainedasa functionof thefractionalcost

range: �XW Y � W Z �N�[��\ �XW Z^]JY � W Z �N�� , whereW Y is theleast-squaresor minimum-evolutioncostof

aspecifictreetopology, W2Z �N� is theminimum(andfor exhaustivesearches,optimal)cost

underthatcriterion,and W Z^]�Y is thecostof theworsttopology. For the8 taxadata,W2Z �M� and

6In the caseof morethan T taxa,wherean exhaustive searchfor the optimal solutionis

computationallyinfeasible,wecompareto thebestsolutionfoundinsteadof to optimal.

13

W2Z_]�Y areknown becausethecostof everypossibletopologyhasbeencalculated.7

For thesyntheticdata,it is possibleto askhow oftenthelow costtreesfoundby theGNJ

algorithmwereconsistentwith theoriginal treethatwasusedto producethedistancedata.

However, thelowestcostleast-squaresor minimum-evolutiontreewasoftendifferentfrom

theoriginal tree.Treesfrom type‘A1’ and‘B1’ dataareusedfor mostof thefiguresbecause

thedifferencebetweentheoriginal treecostandthebesttreecostwastypically between:and :LK# with themedianbetween:QKR:Q# and :LKM:B8 of thecostrange.Treesfrom type‘A1’ and

‘B1’ syntheticdatabehavedverysimilarly to treesfrom thebiologicaldatasets.For the‘A2’

and‘B2’ datasets,themedianoriginal treecostwas :QKSA – :LKNV of thecostrange.Thus,because

of thehighexternal/internalrateratio, thebesttreefrequentlyhadacostsubstantiallylower

thantheoriginal treeandthesedatasetshavea largenumberof distinctlocalminima,which

arenotseenwith thebiologicaldatasetsor with thetype‘A1’ and‘B1’ trees.


Fig. 3 showshow thenumberof treesandthetopologicaldistancebetweenthealternative

solutionsincreasesover thefractionalcostrange.Theresultsfrom threedifferentdatasetsare

shown usingeithertheminimum-evolutionor theleast-squarescostcriterion.In theseplots,

morechallengingdatasetshavea largernumberof near-optimaltreesandgreatertopological

7For larger datasets,W2Z �N� is approximatedfrom the minimum cost obtainedfor all the

tree-searcheson the dataset,and W Z^]�Y is approximatedfrom the maximumcostobtainedby

sampling #9:B: treesrandomly. Thus for the 16 and 32 taxa datasets,W2Z �N� may not be the

optimalminimumcostand W2Z_]�Y maynotbethehighest(worst)cost,but theseapproximations

shoulddiffer only slightly from thetruevalues.

14

distanceat lower fractionalcost.In general,therearemorenear-optimaltreeswith the

least-squarescriterionthanwith theminimum-evolutioncriterionandthosetreestendto be

moretopologicallydistinct(Fig. 3). For example,with thebiologicaldata(Fig. 3C), there

were #�A`KM6 treesonaveragewith least-squarescostwithin :QKR:Q# of optimal,but only 2.6trees

)+:QKR: ! whentheminimum-evolutioncostis calculated.Furthermore,whenthecostis less

than :QKR:Q# , themaximumtopologicaldistancefor near-optimaltreesis greaterfor the

least-squarestreesthanfor theminimum-evolutiontrees.

The“branchy” type‘A1’ syntheticdatasettendsto producea largernumberof near-optimal,

topologicallydistanttreesthanthetype‘B1’ (Fig. 3) datasets.Whenthetype‘A2’, ‘B2’ and

‘C2’ datasetswereexamined(datanot shown), type‘A2’ datasetswerethemostchallenging,

and,for treeswith cost )a:QKM:L# , thenumberof treesandtopologicaldistancebetweenthetrees

wasabouttwiceashigh for type‘A2’ comparedto type‘A1’. Thebiologicaldatasetappears

morechallengingthanthetype‘A1’, ‘B1’ and‘B2’ syntheticdatasets,but lesschallenging

thanthetype‘A2’ dataset(Fig. 3 anddatanotshown). We focusourattentionon thenumber

anddiversityof treeswith cost-range:QKR:Q# –:QKR:�$ bothbecausethesecost-rangesareintuitively

close—between1%and5%of thebestcostfound—andbecause,for thetype‘A1’ and‘B1’

syntheticdata,:QKR:Q# –:QKR:�$ spanstherangeof costdifferencesbetweentheoriginal treesused

to generatethedistancedataandthebesttreesfoundfor thedata.

Ideally, theGNJalgorithmwouldfind eachof thenear-optimalsolutionsthatcanbefound

whenevery tree-topologyis examined.Thus,weusethenumberof solutions,theiraverage

cost,andtheirdiversityto gaugetheeffectivenessof GNJ(Fig. 4 – 6), andcompareGNJwith

anexhaustivesearch(Fig. 3). Weseekcombinationsof ‘Q’ and‘D’ thatapproachthe

distributionof solutionsseenin theexhaustivesearch.Fig. 4A shows thattheGNJalgorithm

15

effectively identifiesvirtually all sub-optimalsolutionswith costs)+:LKM:�$ on thesynthetic

dataset,aslongas (cbH: . (Results,notshown, usingtheminimum-evolutioncostcriterion

areindistinguishable)Only when ( - : , ,.- $;: is thenumberof near-optimalsolutions

differentfrom thenumberfoundin theexhaustivesearch;thatis, someof thelowest-cost

alternativesolutionsaremissed.Thecurvesin Fig. 4B, D reporttheaveragemaximum

topologicaldistance;i.e. themaximumtopologicaldistanceamongall thetreeswith costless

thantheordinateis determinedfor eachof the30datasets,andthe 8B: maximumdistancesare

averaged.Again,when (*bH: thealternatesolutionsfoundby theGNJalgorithmareas

diverseasthosefoundby theexhaustivesearch,for costswithin #7: % of optimal. (Wealso

examinedthemaximumtopologicaldistancesfor thedatain Fig. 4, andfoundthatthey were

verysimilar to theexhaustivesearchif (*bd: , datanotshown.)


Thebiological‘R1’ datasetis morechallengingin someways—thereis a largernumberof

alternatesolutionswith low cost(Fig. 4C)andthelow-costsolutionsappearmore

topologicallydiverse(Fig. 4D). For thebiologicaldataset,GNJbeginsto misssolutionswith

costsbH:QKM:;:�$ thatarefoundby theexhaustivesearch.At a fractionalcostof :LKM:Q# , #B# of #%$solutionsarefoundby GNJwith (*e ! $ , and #9T outof 8Q# arefoundat fractionalcost :QKR: ! .As with thesyntheticdataset(Fig. 4B), when ( - : , someof thebestnear-optimalsolutions

aremissed.Theresultsin Fig. 4 suggestthatfor small(8 taxa)problems,theGNJalgorithm

identifiesalternate,near-optimal,topologically-distantsolutionsveryeffectively.

16

3.2 GNJ performance with 16 and 32 taxa

For largerdatasets,it is not computationallyfeasibleto examinethesolutionspace

exhaustively, sowecannotdirectlycomparetheGNJresultsto theoptimalsolution.

(Likewise,wecannotguaranteethatthelowest-costsolutionis optimal,but it is likely to be

nearoptimal.)Nonetheless,wecanstill evaluatehow theGNJalgorithmbenefitsfrom saving

multiple ' - (HG , solutionsby examininga rangeof (gf , pairs(Figs.5–6).Whentype

‘A1’ syntheticdatasetswith 16 taxaaresearched,thelargestnumberof low-costsolutionsare

againfoundwhen (*bd: , andthemosttopologicallydiversesolutionsarefoundwhen

, bH: . (Fig. 5 showstheresultsusingtheminimum-evolutioncostcriterion;resultsusingthe

least-squarescriterion,notshown, aresimilar.) For thesedatasetswith ' - #7:B: , thetrade-off

betweenquality ( anddiversity,

is clear-cut. Below :LKM:Q# thereis little differencein

diversityas ( and,

change;above :QKR: ! , , eh$;: givesthebestresults.On thebiological

dataset(Fig. 5C),searcheswith (*e�#7:B: find almosttwiceasmany ( 6 ! –V;T ) solutionswith

fractionalcost )+:QKR:Q# assearcheswith,.- UB: or #7:B: ( 8B8 –8�V solutions).Thedifferencein

performancewith respectto ( and,

increasesathigherfractionalcosts.However, while

reducing,

increasesthenumberof low-costsolutionsfound,it alsodecreasesthediversity

of thesolutionset.For thesedata,( -h,.- $;: seemsto bethebestcompromise.


Whensearchesareperformedon32-taxadata(Fig. 6), theimportanceof,

in improving the

diversityof thesolutionsis moreapparent.As before,solutionswith ( -h,I- ! $;: appearto

provideagoodbalancebetweenfinding thelargestnumberof low-costsolutionsandfinding

themostdiversesolutions.Wenotethatas ( increasesfrom ! $�: to $;:B: , thereis little change

17

in thenumberof treeswith fractionalcost )+:QKR: ! on thesynthetictype‘A1’ dataset(Fig. 6A),

andthatthemaximumtopologicaldistanceamongthosesolutionsincreasesvery little as,

increasesfrom ! $;: to $;:B: . Thus,for thissyntheticdata,although' - $�:B: retainsonly a tiny

fractionof up to #7:Ci = possible32-taxatreetopologies,thedatain Fig. 6A andFig. 10suggest

thatmostof thelowest-costsolutions,andmany of thetopologicallydiversesolutions,are

found.


3.3 Comparison with other methods

Thusfar, our resultssuggestthatGNJcanidentify alternative,near-optimalsolutionswhen 'rangesfrom $�: ( T -taxa)to ! :B: ( 8 ! -taxa).In thissection,wecompareGNJwith different

' - (dG , to two popularphylogenetictreereconstructionmethodsfor distancedata,the

Neighbor-Joiningmethod(SaitouandNei, 1987)andtheFitch-Margoliashalgorithm(Fitch

andMargoliash,1967)asimplementedin thePHYLIP package(Felsenstein,1993).As

before,weconsiderbothsyntheticandbiologicaldatasetswith differentnumbersof taxa,and

wecomparetwo costcriteria: theminimum-evolutioncriterionusedfor Neighbor-Joining

searches,andtheleast-squarescriterionusedby Fitch-Margoliash.In thesetests,weagain

considertwo measuresof success:quality (cost)anddiversity. Weevaluatethequalityof the

solutionsin two ways:(a) thefractionof thetime(for the30 testdatasets)thatanear-optimal

solutionis found;and(b) theaveragecostof thebestsolutionsfound.To evaluatediversity,

for eachdistancematrix,wefirst computethemaximumtopologicaldistancebetweenpairs

of near-optimalGNJsolutions.Diversityis thenmeasuredby computing(a) themaximum,as

18

well as(b) theaverageof thesedistances,over 8B: datasets.


For T -taxatype‘A1’ data,GNJfindssolutionsof veryhighquality thatareasdiverseasthe

exhaustivesearchwhen 'jbH$ and (*b ! (Fig. 7). When (*b ! , asolutionwithin acost

rangeof :LKM:Q# of optimalis found100%of thetime. For ( - : , GNJfindsa solutionwithin

:QKR:Q# of optimallessthan20%of thetimewhen,.- $ , andlessthan40%of thetimewhen

, bd$ . On thesamedatasets,Neighbor-Joiningfindsa kH:QKM:L# minimum-evolutionsolutution

andFitch-Margoliashfindsa kH:QKR:Q# least-squaressolutionmorethan95%of thetime. The

averagecostdatain Fig. 7A shows thatthebestsolutionsfoundby Neighbor-Joiningand

Fitch-Margoliasharetypically within :QKR:Q# of thecostrange,but thosefoundby GNJ

( 'je ! : and (*e ! ) areoptimal.Thus,GNJconsistentlyfindssolutionswith costlower than

eitherNeighbor-Joiningor Fitch-Margoliash.Moreover, comparisonof boththelargest

maximumtopologicaldistanceandtheaveragemaximumtopologicaldistance(Fig. 7B)

shows thatwhentheoptimalsolutionwasfoundby GNJ,thediversityof solutionsfound

(with costskl:QKM:L# of optimal)is aslargefor theGNJsolutionsetasfor thosefoundby the

exhaustivesearch.GNJperformedaswell astheexhaustiveseachon themuchmore

challengingtype‘A2’ dataaswell (notshown).


Resultsfor #76 -taxaareshown in Figs.8 and9. On thesynthetictype‘A1’ dataset,GNJfound

asolutionwithin :QKR:Q# of thebestcost100%of thetime when (*bd: . For thisdata,

Neighbor-Joiningfounda kH:QKR:Q# costsolutiononly 80%of thetimeusingthe

minimum-evolutioncriterion,while Fitch-Margoliashalwaysfounda kd:QKR:Q# costsolution.

19

Onceagain,GNJfoundsolutionswith loweraveragecost.For type‘A2’ data(notshown),

Neighbor-JoiningandFitch-Margoliashfound kd:QKR:Q# solutionsonly 8B: – $B$ % of thetime for

theleast-squarescriterion,and ! $ – 8�V % of thetime for theminimum-evolutioncriterion

while GNJfoundthebestminimum-evolutionsolution100%of thetimewhen (*ea$ . GNJ

foundthebestleast-squaressolutionmorethan80%of thetimeon type‘A2’ datawith

(*ea$ . As ' increasedfrom ! : to ! :;: , thecostof thebestsolutionsconsistentlyimproved

with GNJ.While wecannotcomparetheGNJdiversityto thediversitythatwouldbefound

by anexhaustivesearch,increasing' from ! : to ! :B: improvestheaveragemaximum

diversity, andasbefore,( -h,seemsto provide low-costsolutionswith highdiversity.


When #76 -taxabiologicaldataareexamined,theNeighbor-JoiningandFitch-Margoliash

algorithmsperformquitewell (Fig. 9). However, evenwith thisdata,theaveragecostof the

bestsolutionsfoundimprovesfrom about #7:mi to #7:m[n whenGNJis usedand (*e ! $ .

Fig. 10goesnearhere.

Neighbor-Joining,Fitch-Margoliash,andGNJall performwell on32-taxatype‘A1’ (Fig. 10)

andbiologicaldata(not shown) usingacostthresholdof :QKM: ! or :QKR:�$ (notshown). However,

it is surprisinghow diversetheGNJsolutionsarewhensolutionswith costswithin 2%of the

bestcostareincluded;GNJfoundalternatelow costsolutionsthatsharefewer thanhalf of

theinternaledges(two treesshareaninternaledgeif theedgeinducesthesameleaf

bi-partitionsin bothtrees).

Comparisonof thecostandthediversityof GNJsolutionswith ' - ! :B: and ' - $;:B:suggeststhat, ' - $;:B: , which increasestherun time ! KN$ -fold, is probablyunnecessary, since

20

neitherthequalityof thesolutionsnor thediversityincreasessignificantlywith thehigher ' .

Again,using ( -a,providesa goodbalanceof qualityanddiversity.

3.4 Post-processing

Rzhetsky andNei haveobservedthatfor smalldatasets,NJsolutionsarelikely to be

topologicallycloseto theoptimalsolution(Rzhetsky andNei, 1992).Weexaminedhow

post-processing(describedin Section2.4)affectsthenumberanddiversityof thelow-cost

solutions,andhow post-processingmight improveNeighbor-Joining,Fitch-Margoliash,and

GNJ-basedinitial solutions.Thepost-processingalgorithmexaminesall thetreesthatcanbe

formedby swapping(exchanging)subtreesaroundeachof theinternaledgesin thetree,thus

consideringall thealternative treesthatarewithin onepartitiondistancefrom theinitial tree.

If a topologyis foundwith a lowercost(least-squaresor minimum-evolution),theprocessis

repeated,until no topologicalneighboris foundwith a lowercost.If theGNJalgorithmfinds

alternatesolutionsthatareondifferentsidesof asingleshallow costbasin,post-processing

shouldreducethenumberanddiversityof low-costalternatetrees.Thisseemsto bethecase

for thebiological‘R1’ data(Fig. 11A) andthesynthetictype‘A1’ data(similar to the

biological‘R1’ data,notshown). Alternatively, if GNJactuallyfindsdistinctlocalminima

(with respectto cost),thenumberof treesmaydecreasedramaticallybut thetopological

distancebetweenalternatesolutionsshouldremainsubstantial.Multiple distinctlocalminima

arefoundwith thesynthetictype‘A2’ data.


Theresultsof post-processingon the #96 -taxadatasetssuggestthatGNJis capableof

21

identifyingalternate,topologically-distinctlocalminimawhenthey exist (Fig. 11). As

expected,thenumberof distinctsolutionsdropsdramatically(becauseof convergence)when

theGNJsolutionsarepost-processed.For thebiological‘R1’ data(Fig. 11A,C),thedropis

morethan30-fold,asit is with thesynthetictype‘A1’ data(notshown). However, for the

synthetictype‘A2’ data,which is derivedfrom treesin which thecostfor theoriginal treeis

frequentlymid-waybetweenthebestandworstcosts,thedropis only 2–3-foldandthe

averagemaximumtopologicaldistancesdropsonly about! : %. Thusfor thisverydifficult

dataset,many of thealternatesolutionsfoundby GNJcannotbereachedby local

branch-swappingfrom thebestsolution,anddistinctlocalminimahavebeenfound.Fig. 12

comparestheperformanceof Neighbor-Joining,Fitch-Margoliash,andGNJ,eachfollowed

by post-processing,on #96 -taxatype‘A2’ data.Post-processingimprovestheperformanceof

Neighbor-JoiningandFitch-Margoliashin findinga solutionwith cost kH:QKR:Q# from about

8B: –$;: % successto $�: –V;: % success;GNJis #7:B: % successfulwith everycombinationof ' ,

( , and,

. Again,GNJfindslower-costsolutionsthatNeighbor-JoiningandFitch-Margoliash

fail to find, evenafterexhaustivepost-processing.For thesedata,Neighbor-Joiningand

Fitch-Margoliashappearto sometimesfind localminima(with respectto cost),while GNJ

findsmoreglobalminima.


Theaveragemaximumtopologicaldiversityon thedifficult type‘A2’ datadecreasesonly

slightly with post-processingandthemaximumtopologicaldiversityis ashighafter

post-processingasbefore.This result—topologicallydiversesolutionsdespiteadramatic

decreasein thenumberof low-costsolutions—impliesthatGNJhasfoundalternatelocal

minimathatcannotbereachedby localbranchswappingfrom thelowest-costsolution.Since

22

ourpost-processingstrategy beginsfrom the ' solutionsfoundby conventionalGNJ,GNJ

withoutpost-processingcandetecttopologicallydistinctalternative localminima.For the

synthetictype‘A2’ datasets,low-costsolutionsthataretopologicallydistinctlocalminima

appearoften.For example,for ( - $;: and,I- $;: , theleast-squaressolutionsdiffer by 11

(outof amaximum #78 possible)branchswaps(partitions)arefoundin at leastonedataset,

anddiffer by ! KRU swapsonaverage(minimum-evolutionsolutionsdiffer by asmuchas9

partitions,and4.3onaverage).Thissuggeststhattopologicallydistinctsolutionshavebeen

foundby GNJon ! $ –$;: % of the 8B: syntheticdatasets.

Theresultswith thebiologicalandsynthetictype‘A1’ datasetscontraststarklyto the

diversityfoundwith thesynthetictype‘A2’ data.With theleast-squarescriterionand

post-processing,theaveragemaximumtopologicaldiversityis about:QKN$ andthemaximum

diversityis A , implying thatdistinctsolutionsarefoundin only about10%of thedatasets.

With theminimum-evolutioncriterion,themaximumdiversityis thesamebut theaverage

maximumis #BKo# ; again,topologicallydiversesolutionsmaybefoundfor 25%of these

#76 -taxadata.For thesynthetictype‘A1’ data,theaveragediversitydropsfrom aboutA to #;K#(least-squares)or from V to ! (minimum-evolution).

Althoughpost-processingcanimprovethequalityof GNJsolutionswithoutsignificantly

reducingtheirdiversity, thetimerequiredto post-process' alternativesolutionscanbe

prohibitivewhenthenumberof taxa(andthusthenumberof branchswapsthatmustbe

tested)is large( bP#76 ). However, comparisonof Fig. 12A andthenon-post-processeddata

(not shown) suggeststhatpost-processingdoesnot improvethesolutionqualitysignificantly

when ( and, ea$;: , andthustheextracomputationis unnecessary.

23

3.5 Run Time

GNJusescomputationtimeroughlyproportionalto thenumberof partialsolutions

maintainedduringexecution( ' ), andcubicin thenumberof taxaanalyzed.Averagerun

timesof Neighbor-Joining,Fitch-MargoliashandGNJfor variousinputsizesareshown in

Table1. GNJis considerablyslower thanNeighbor-Joining(which is oneof thefastesttree

constructionalgorithmsavailable,becauseit doesnotevaluateany alternative trees),and 8( ' - ! :;: ) to T -fold ( ' - $�:B: ) slower thanFitch-Margoliashfor the32-taxadatasets.

During its execution,GNJkeepstrackof ' partialsolutions.At eachiteration,asthenext

pairof taxais removedfrom the“star” tree,GNJexploresall thecandidatesolutionsderivable

from thecurrent' partialsolutionsvia asingleNeighbor-Joiningstep.Sinceeachpartialtree

inducespq�X � � candidatetreesby groupingoneof the pr�s � � possiblenodepairsin thetree,

thecostof all theresultingcandidatetreesrequirespq�X � � evaluationtime. Therefore,each

GNJiterationrequirespr�X'tE7 � � time to examinethecostof all pr�@'uE7 � � candidatetrees.

At eachiteration,GNJmustalsoselect' candidatetreesto passon to thenext iteration.In

thisversion,theselectionprocessrequiresall pq�@'uE7 � � candidatetreesto besortedby cost.

Currently, thetimerequiredby eachiterationof GNJis dominatedby thesortingtimewhich

is pr�X'uE7 � E7vwBxy�X'zE7 � �{� - pq�@'uE7 � E�XvwBx5'|G}v~wBxD �{� . Weanticipatethattheamountof

datato besortedcanbereduced,andthatin futureversions,theGNJcostcalculationwill

dominatetherun time. SinceGNJhasa totalof � 8 iterations,theoverall run time for GNJ

is pr�X'uE7 ? E�sv~wBxD'.G"vwBxD �J� .

24

4 Discussion

TheGeneralizedNeighbor-Joiningalgorithmis explicitly designedto explorebroadly

phylogenetictreesolutionspacesandseeklow-costsolutionsthataretopologicallydistant.

To achieve thisgoal,GNJmaintainsmultiplepartialsolutionsateachiteration,and

incorporatesbothquality (treecost)anddiversity(topologicaldistance)in selectingthesetof

partialsolutionsthatwill bepassedon to thenext iteration.Thesolutionspacesamplingis

controlledby theparameters' , ( , and,

, whichspecifythenumberof partialsolutionsto

retainandthebalancebetweenquality ( ( ) anddiversity(,

) in selectingalternatesolutions.

For datasetsof smallsize(i.e. )�#7: -taxa),GNJcanperformbetterthanNeighbor-Joiningand

Fitch-Margoliashalgorithmsby maintaining' - ! : –$;: partialsolutions.For example,for

biologicaldatasetsover U -taxa,all T treesof costcloseto theNJcost(undertheleast-squares

criterion)areobtainedby GNJwhile maintainingonly ' - $;: partialsolutions.For

syntheticdatasetsof T -taxa,GNJfindstheoptimalsolutionwhenever 'je ! : and (*e ! .

For datasetswith #96 or 8 ! -taxa,boththetopologicaldiversityandthequalityof GNJ

solutionsimprovesas ' increases.For thedatasetsthatweexamined,low-costsolutions

wereefficiently foundwith '�e ! : for T -taxaand '�eh$;: for #96 -taxa.Increasing' did

little to improveeitherthequalityor thediversityof thesolutions.8 ! -taxaproblemsarefar

morechallenging(andmorecommonin themolecularbiology literature).While '�e ! :B:( 8 ! -taxa)waseffective in finding low-costsolutions,increasing' improvedthequality, and

to a lesserextentthediversity, of the 8 ! -taxasolutions.

Webelieve thatthepost-processingresultsshow thatGNJis capableof identifying low-cost,

25

topologically-distinctsolutionsthatcannotbefoundsimplyby successively examiningevery

topologynearto individual low-costtrees.“Falling into localminima” is aninherentflaw of

any phylogeneticsearchmethodthatexaminesonly asmallportionof thesolutionspace.The

post-processingresultsfor thebiological‘R1’ datasuggestthatthisdataprobablyhassingle,

verybroadlocalminimumwith many differentlow-costtopologiesbut very few, if any,

alternatesolutionsthatcannotbefoundby post-processing(localbranchswapping).In

contrast,thesynthetictype‘A2’ datadoesappearto haveseveraldistinctlocalminima,which

werefoundby GNJ.While it is reasuringto learnthatGNJis capableof findingalternative

localminimawhenthey exist, moreextensivesimulationswill berequiredto characterizethe

conditionsunderwhich largenumbersof distinctlocalminimaoccur. While post-processing

maynotbenecessaryto find highqualitysolutions,thedecreasein diversitywith

post-processingshouldimproveourconfidencethatadatasetdoesnothavemany

topologicallydistinctlow-costsolutions.

Our resultssuggestthatGNJperformsbestwhen ( -�,I-.�� , andthat ' - ! :B: providesan

excellentbalancebetweencomputationtimeandsolutionquality/diversityfor up to 8 ! -taxa.

For morethan $;: -taxa, ' - $;:B: or #7:;:B: mayprovidebettersolutions;however, thiswill

dependgreatlyon thestructureof thephylogenetictreesolutionspace.For largenumbersof

taxa,onecanjudgewhethera larger ' is likely to providenovel solutionsby performing

searcheswith #7:B: and ! :B: . If ' - ! :;: doesnotfind any low-costtreesthatweremissedwith

' - #7:B: , it is unlikely that ' - $;:B: (or more)will uncoveradditionalnovel treeseither.

GNJis considerablyslower thantraditionalNeighbor-Joining,andfor largeproblems

( eh8 ! -taxaand '�e ! :B: ), it is muchslower thanFitch-Margoliashaswell. However, aGNJ

run is moreaccuratelycomparedto multipleFitch-Margoliashsearcheswherethetaxaare

26

successively addedin differentorder, aprocessthatcaneasilyincreasetheamountof time

requiredby ! : to $;: -fold. BecauseGNJexplicitly seeksout topologicallydiversesolutions,

webelieve thatit is morelikely to identify distinctalternativesthanadditional

Fitch-Margoliashtrials.

Thispaperconsidersthegeneralizationof theneighbor-joining partitioningstrategy to which

thedistancecostmeasuresseemideallysuited.However, themethodof retainingmany

partialsolutionsduringa partitioningstrategy canbeappliedto maximumparsimony

methods,andperhapsto maximumlikelihood-basedapproachesaswell. Wearecurrently

developingabroadergeneralizationof theapproachthatcanbeappliedto character-based,

ratherthandistance-based,costcriteria.

Acknowledgments

Wewish to thankDr. DouglasTaylor for hiscommentsonourdraft. This researchwas

supportedby agrantfrom theNationalLibrary of Medicine(LM04961).GabrielRobins

receivedadditionalsupportfrom anNSFYoungInvestigatorAwardandaPackard

FoundationFellowship.

27

References

Alpert, C. andKahng,A. B. 1995.Recentdirectionsin netlistpartitioning:asurvey.

Integration: TheVLSIJournal, 19, 1–81.

Bandelt,H. J.andDress,A. 1992.Split decomposition:anew andusefulapproachto

phylogeneticanalysisof distancedata.Mol. PhylogenetEvol.1, 242–252.

Day, W. 1987.Computationalcomplexity of inferringphylogeniesfrom dissimilarity

matrices.Bull. Math.Biol. 49(4), 461–467.

Dayhoff, M. O. 1978.Supplement3. Atlasof ProteinSequenceandStructure.Natl. Biomed.

Res.Found.,Washington,D.C.

Felsenstein,J.1982.Numericalmethodsfor inferringevolutionarytrees.Quar. Rev. Biol. 57

(1), 379–404.

Felsenstein,J.1988.Phylogeniesfrom molecularsequences:inferenceandreliability. Annu.

Rev. Genet.22, 521–565.

Felsenstein,J.1993.PHYLIP:PhylogenyInferencePackage, version3.5c. Universityof

Washington.

Fitch,W. M. andMargoliash,E. 1967.Constructionof phylogenetictrees.Science, 155,

279–284.

Foulds,L. R. andGraham,R. L. 1982.Thesteinerproblemin phylogeny is NP-complete.

AdvancesAppl.Math.3, 43–49.

Huelsenbeck,J.P. 1995.Theperformanceof phylogeneticmethodsin thefour-taxoncase.

Syst.Biol. 44, 17–48.

28

Kumar, S.1996.A stepwisealgorithmfor findingminimumevolutionarytrees.Mol. Biol.

Evol.13, 584-583.

Kimura,M. 1980.A simplemethodfor estimatingevolutionaryratesof basesubstitutions

throughcomparativestudiesof nucleotidesequences.J. Mol. Evol.16, 111–120.

Kuhner, M. K. andFelsenstein,J.1994.A simulationcomparisonof phylogeny algorithms

underequalandunequalevolutionaryrates.Mol. Biol. Evol.11, 459–468.

Leitner, T., Escanilla,D., Franzen,C.,Uhlen,M. andAlbert, J.1996.Accuratereconstruction

of aknown HIV-1 transmissionhistoryby phylogenetictreeanalysis.Proc.Natl. Acad.

Sci.USA, 93, 10864–10869.

Maddison,D. R. 1991.African origin of humanmitochondriaDNA reexamined.Syst.Zool.

40, 355–363.

Penny, D. andHendy, M. D. 1985.Theuseof treecomparisonmetrics.Syst.Zool.34, 75–82.

Penny, D., Stee,M. A., Waddell,P. J.andHendy, M. D. 1995.ImprovedAnalysesof Human

mtDNA SequencesSupportaRecentAfrican Origin for Homosapiens.Mol. Biol. Evol.

12, 863–882.

Robinson,D. F. andFoulds,L. R. 1981.Comparisonof phylogenetictrees.Math.Biosci.53,

131–147.

Rzhetsky, A. andNei, M. 1992.A simplemethodfor estimatingandtesting

minimum-evolutiontrees.Mol. Biol. Evol.9, 945–967.

Saitou,N. andImanishi,T. 1989.Relativeefficienciesof theFitch-Margoliash,

maximum-parsimony, maximum-likelihood,minimum-evolution,andneighbor-joining

methodsof phylogenetictreeconstructionin obtainingthecorrecttree.Mol. Biol. Evol.

6, 514–525.

29

Saitou,N. andNei, M. 1987.Theneighbor-joining method:anew methodfor reconstructing

phylogenetictrees.Mol. Biol. Evol.4 (4), 406–425.

Steel,M. A. andHendy, M. D. 1993.Distributionsof treecomparisonmetrics– somenew

results.Syst.Biol. 42, 126–141.

Studier, J.andKeppler, K. 1988.A noteon theneighbor-joining algorithmof SaitouandNei.

Mol. Biol. Evol.5, 729–731.

Swofford, D. 1996.PAUP: PhylogeneticAnalysisUsingParsimony(andOtherMethods),

version4.0(testversion). SinauerAssociates,Inc.,Sunderland,MA.

Swofford, D. L., Olsen,G. J.,Waddell,P. J.,andHillis, D. M. 1996.MolecularSystematics,

D.M. Hillis, C. Moritz andB.K. Mable, eds.PhylogeneticInference,pp.407–514.

SinauerAssociates,Inc.,Sunderland,MA.

Wilson,A. C. E.,Zimmer, E. A., Prager, E. M. andKocher, T. D. 1989.TheHierarchyof

Life, RestrictionMappingin theMolecularSystematicsof Mammals:A Retrospective

Salute,pp.407–419.Elsevier Press,Amsterdam.

30

Table1:

Execution times

Method 8 taxa 16 taxa 32 taxa

NJ kH:QKR:Q# 0.01 0.09

FM 0.08 2.4 31.5

GNJ ' - ! : 0.08 0.8 9.8

' - $;: 0.2 2.1 25.1

' - #7:B: 0.5 4.4 52.1

' - ! :B: 1.1 8.8 103.7

' - $;:B: 3.1 24.2 262.7

Run time (in CPU secondson a #76BV -Mhz UltraSparc)for Neighbor-Joining (NJ), Fitch-

Margoliash(FM),andGNJ,averagedover #7:B: type ‘A’ input datasetsof varioussizes,using

theleast-squarescriterionand ( -h,.

31

Figure Legends

Fig. 1 TheNeighbor-Joiningheuristic(A) An inputmatrixof size $ . (B) A Neighbor-Joining

solutionstrategy. First, thestartreetopologycenteredatnode 6 is formed;next, theclosest

neighborpair �#;f ![� is “joined” into adistinctinternalnode V ; finally, thisnew internalnode

V , togetherwith oneof theleafnodes8 , arejoinedinto anew internalnodeT to form & � . (C)

An equally-goodsolution & � but with averydifferenttopology, whichwasnot foundby NJ.

Fig. 2 TheGeneralizedNeighbor-JoiningMethodTheGNJheuristicfor thedataof Figure1A

is shown. Throughoutthesearch,8 partialsolutionsarekept( ' - 8 ). At eachiteration,all

possibleneighboringtaxapairsareexamined;8 areselectedto passto thenext iteration.The

dashedlinesrepresenttheneighboringpairsthatareeliminatedduringthecurrentiteration.

While NJ follows thepathon theleft, andthusonly findsthesingletree & � , theGNJscheme

recoversbothequally-goodsolutions& � and & � (seeFigure1).

Fig. 3 Distributionof treecostsanddiversity Thenumberof trees(left ordinate,� , � ) and

maximumtopological(partition)distanceaveragedover30datasets(right ordinate,� , � ) are

plottedasa functionof thefractionalcostrange.Distributionsfor theleast-squares(LS, � ,

� ) andtheminimum-evolution(ME, � , � ) criteriaareshown. Distributionsweredetermined

for 8 taxafrom 30synthetictype‘A1’ datasets,30synthetictype‘B1’ datasets,and30

datasetsfrom biologicaldataset‘R1’. Thefiguresshow theresultsdeterminedafteran

exhaustivesearchof all 10,395treetopologiesfor 8 taxa.

Fig. 4 GNJsolutions—8taxaThedistributionof solutionsfoundby GNJon30 type‘A1’

synthetic8 taxadatasets(A, B) or 30 ‘R1’ biological8 taxadatasets(C, D) areshown.

Searchesweredonewith ' - (dG ,I- $;: . PanelsA, C show theaveragenumberof

differenttreeswith costswithin thefractionalleast-squarescostshown. PanelsB, D show the

averageof themaximumtopologicaldistanceof thesolutionswithin thefractionalcostrange.

32

Fig. 5 GNJsolutions—16taxaThedistributionof solutionsfoundby GNJon30 type‘A1’

synthetic#76 -taxadatasets(A, B) or 30 ‘R1’ biological #76 -taxadatasets(C, D) areshown.

Searchesweredonewith ' - (dG ,I- #7:B: . PanelsA, C show theaveragenumberof

differenttreeswith costswithin thefractionalminimum-evolutioncostshown. PanelsB, D

show theaverageof themaximumtopologicaldistanceof thesolutionswithin thefractional

costrange.

Fig. 6 GNJsolutions—32taxaThedistributionsof thenumberof trees(A) andthemaximum

topologicaldistanceaveragedover30datasets,(B) areshown for theleast-squarescost

criterionon32 taxasynthetictype‘A1’ data.Searchesweredonewith ' - (dG ,|- $;:B: .Panels(C) and(D) show treenumbersandtopologicaldistancefor thebiological‘R1’ data.

Fig. 7 GNJperformance—8taxaThequalityanddiversityof GNJsolutionswith different

valuesof ( and,

arecomparedto theoptimalsolutionset(opt), andNeighbor-JoiningNJ

andFitch-MargoliashFM solutionsfor synthetictype‘A1’ data.(A) Thefractionof thetime

asolutionwasfoundwith acost kd:QKR:Q# of optimal(squares,left axis)usingeitherthe

least-squares(filled symbols)or theminimum-evolution(opensymbols)criterion.Theright

axis(circles)reportsthecostof thebestsolutionfound,averagedover the30datasets.(B)

Thediversityof the ' - (dG , solutionswith cost kl:QKR:Q# is shown asthelargest

topologicaldiversityfoundamongall 8B: datasets(squares)andthemaximumtopological

diversityaveragedover the 8B: datasets.Closedsymbolsreportdiversityfor least-squares

solutions;opensymbolsreportminimum-evolutiondiversity.

Fig. 8 GNJperformance—16taxaResultsfor 16 taxatype‘A1’ syntheticdataareplottedas

in Fig. 7, exceptthatboththefractionof solutionsfoundandthetopologicaldistanceplot

usesacostthresholdof :QKR:Q# .

33

Fig. 9 GNJperformance—biological dataResultsfor 16 taxafrom biologicaldataset‘R1’ are

plottedasin Fig. 8.

Fig. 10GNJperformance—32taxaResultsfor 32 taxatype‘A1’ syntheticdataareplottedas

in Fig. 8, exceptthatboththefractionof solutionsfoundandthetopologicaldistanceplot

usesacostthresholdof :QKR: ! .

Fig. 11Numberanddiversityof post-processedsolutionsResultsfor #96 -taxabiologicaldata

(B, D) and #76 -taxatype‘A2’ data(C, D) areplottedasin Fig. 5. Resultsshown arefor

' - #7:B: , ( -h,.- $;: , with eithernon-post-processed(squares)or post-processed(circles)

solutions.Filled symbolsreportthedistributionof least-squaressolutionswith fractional

cost;opensymbolsplot minimum-evolutioncosts.

Fig. 12Post-processedNeighbor-Joining, Fitch-Margoliash,andGNJperformance

Post-processedresultsfor #96 -taxatype‘A2’ dataareplottedasin Fig. 8.

34

Figure1: TheNeighbor-JoiningHeuristic

S 1 S 2 S 3 S 4 S 5

S 1S 2S 3S 4S 5

1

5

2

4

3

T2

1

2

4

5

6

3

1

2

3

4

57 6

0 3 3 4 3

(a)

1

2 57 6

3 4

8

T1

(c)

(b)

3 0 3 3 43 3 0 3 34 3 3 0 33 4 3 3 0

35

Figure2: GeneralizedNeighbor-Joining

1 2

3

45

1

532

4 4

3

1

2 5

1

2 5

3

1

1

25

4

3

25

31

5

43

2 4

1

�~�� ~�� ~��

36

Figure3:

� � � � ��

� � � � � ��

��

� � � ��

��

� � � � ��

�

��

1x100�

1x101

1x102�

1x103�

1x104�

1x105�

0.001 0.01 0.1 1cost

� � ��

� ��

� � � ��

��

� ��

� ��

� �

� � ��

��

0

1

2

3

4

5

0.001 0.01 0.1 1

topo

logi

cal d

ista

nce

cost

A. type 'A' data B. type 'B' data C. biological data

� � � ��

��

� � � � ��

� ��

��

� � � ��

�

��

0.001 0.01 0.1 1cost�

number (LS)�number (ME)

�distance (LS)�distance (ME)

num

ber

of tr

ees

37

Figure4:

��

��

��

��

��

��

��

��

0.1

1

10

100

0.001 0.01 0.1 1

num

ber

of s

olut

ions

least-squares cost

��

��

��

��

��

��

��

��

��

��0

1

2

3

4

5

0.001 0.01 0.1 1

topo

logi

cal d

ista

nce

least-squares cost

��

��

��

��

��

��

0.1

1

10

100

0.001 0.01 0.1 1

num

ber

of s

olut

ions

least-squares cost

�Q=50/0�45/10�25/25�5/45�0/D=50�opt

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��0

1

2

3

4

5

0.001 0.01 0.1 1

topo

logi

cal d

ista

nce

least-squares cost

A. B.

C. D.

38

Figure5:

��

��

��

��

��

0.1

1

10

100

500

0.001 0.01 0.1 1

num

ber

of tr

ees

minimum evolution cost

�Q=100/0�90/10

�50/50�10/90�0/D=100

��

��

��

��

��0

5

10

15

0.001 0.01 0.1 1

topo

logi

cal d

ista

nce


��

��

��

��

��

��

��

��0.1

1

10

100

500

0.001 0.01 0.1 1

num

ber

of tr

ees


��

��

��

��

��

��

��

��

0

5

10

15

0.001 0.01 0.1 1

topo

logi

cal d

ista

nce


A. B.

C. D.

39

Figure6:

��

��

��

��

��

1

10

100

1000

0.001 0.01 0.1 1

num

ber

of tr

ees

least-squares cost

��

��

��

��

��

��

��

��

0

5

10

15

20

25

0.001 0.01 0.1 1

topo

logi

cal d

ista

nce

least-squares cost

�Q=500/0�450/50�250/250�50/450�0/D=500

A. B.

��

1

10

100

1000

0.001 0.01 0.1 1

num

ber

of tr

ees


��

��

��

��

��0

5

10

15

20

25

30

0.001 0.01 0.1 1

aver

age

topo

logi

cal d

ista

nce


C. D.

40

Figure7:

� � � � � � � �

�

� � � �

�

� � � �

�

� � � �

�

� � ��

�

� � � �

�

� � � �

�

� � � �

�

¡

¡¡

¡ ¡ ¡ ¡ ¡

¡

¡ ¡ ¡ ¡

¡

¡ ¡ ¡ ¡

¡

¡ ¡ ¡ ¡

¡

opt

NJFMQ=5

/0 4/1

3/2

2/3

1/4

0/D=5

Q=20/

018

/210

/102/

18

0/D=2

0

Q=50/

045

/525

/255/

45

0/D=5

0

Q=100

/0

90/1

050

/5010

/90

0/D=1

000.0

0.2

0.4

0.6

0.8

1.0

0.00001

0.0001

0.001

0.01

0.1

1

frac

tion

< 0

.01

aver

age

cost

�LS fract�ME fract

LS cost¡ME cost

�

� �

� � � � � � � � � � � � � � � � � � � � �

�

� �

� � � � �

�

� � � �

�

� � � � � � � � � �

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡op

tNJ

FMQ=5

/0 4/1

3/2

2/3

1/4

0/D=5

Q=20/

018

/210

/102/

18

0/D=2

0

Q=50/

045

/525

/255/

45

0/D=5

0

Q=100

/0

90/1

050

/5010

/90

0/D=1

000.0

0.5

1.0

1.5

2.0

0.0

0.5

1.0

1.5

2.0

topo

logi

cal d

ista

nce

�LS max�ME max

LS ave¡ME ave

A. quality

B. diversity

41

Figure8:

� � � � � � � � � � � � � � � � � � � � � ��

��

��

��

�

¡ ¡

¡ ¡ ¡¡¡

¡ ¡ ¡ ¡

¡

¡ ¡ ¡ ¡

¡

¡ ¡ ¡ ¡

¡

NJFM

Q=20/

018

/210

/102/

18

0/D=2

0

Q=50/

045

/525

/255/

45

0/D=5

0

Q=100

/0

90/1

050

/5010

/90

0/D=1

00

Q=200

/0

180/

20

100/

100

20/1

80

0/D=2

000.0

0.2

0.4

0.6

0.8

1.0

0.00001

0.0001

0.001

0.01

0.1

1

frac

tion

< 0

.01

aver

age

cost


LS cost¡ME cost

� �

� � � ��

� � � � � �

� �

� � � � � � � � � � � � � � � � � � � �

¡ ¡¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

NJFM

Q=20/

018

/210

/102/

18

0/D=2

0

Q=50/

045

/525

/255/

45

0/D=5

0

Q=100

/0

90/1

050

/5010

/90

0/D=1

00

Q=200

/0

180/

20

100/

100

20/1

80

0/D=2

000.0

2.0

4.0

6.0

8.0

10.0

0.0

2.0

4.0

6.0

8.0

10.0

topo

logi

cal d

ista

nce

�LS max ME max

�LS ave¡ME ave

A. quality

B. diversity

42

Figure9:

� � � � � � � � � � � � � � � � � � � � � ��

¡ ¡¡ ¡ ¡

¡¡

¡ ¡

¡

¡¡

¡

¡ ¡ ¡

¡

¡ ¡ ¡ ¡

¡

NJFM

Q=20/

018

/210

/102/

18

0/D=2

0

Q=50/

045

/525

/255/

45

0/D=5

0

Q=100

/0

90/1

050

/5010

/90

0/D=1

00

Q=200

/0

180/

20

100/

100

20/1

80

0/D=2

000.00

0.20

0.40

0.60

0.80

1.00

0.0000001

0.000001

0.00001

0.0001

0.001

0.01

frac

tion

<0.

01

aver

age

cost


LS cost¡ME cost

� �

� ��

� ��

� � � � ��

� �

� ��

¡ ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

NJFM

Q=20/

018

/210

/102/

18

0/D=2

0

Q=50/

045

/525

/255/

45

0/D=5

0

Q=100

/0

90/1

050

/5010

/90

0/D=1

00

Q=200

/0

180/

20

100/

100

20/1

80

0/D=2

000.0

2.0

4.0

6.0

8.0

10.0

12.0

0.0

2.0

4.0

6.0

8.0

10.0

12.0

topo

logi

cal d

ista

nce

�LS max�ME max

LS ave¡ME ave

A. quality

B. diversity

43

Figure10:

� � � � � � � � � � � � � � � � � � � � � ��

¡ ¡¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

¡ ¡ ¡ ¡ ¡

NJFM

Q=50/

045

/525

/255/

45

0/D=5

0

Q=100

/0

90/1

050

/5010

/90

0/D=1

00

Q=200

/0

180/

20

100/

100

20/1

80

0/D=2

00

Q=500

/0

450/

50

250/

250

50/4

50

0/D=5

000.00

0.25

0.50

0.75

1.00

0.0001

0.001

0.01

0.1

1

frac

tion

< 0

.02

aver

age

cost


LS cost¡ME cost

� �

� �

� ��

� ��

��

� � � � �

� �

� ��

� � � � � � � ��

¡ ¡¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

NJFM

Q=50/

045

/525

/255/

45

0/D=5

0

Q=100

/0

90/1

050

/5010

/90

0/D=1

00

Q=200

/0

180/

20

100/

100

20/1

80

0/D=2

00

Q=500

/0

450/

50

250/

250

50/4

50

0/D=5

000

5

10

15

20

25

30

0

5

10

15

20

25

30

topo

logi

cal d

ista

nce

�LS max�ME max

LS ave¡ME ave

A. quality

B. diversity

44

Figure11:

��

��

��

��

��

��

0.1

1

10

100

0.001 0.01 0.1 1

num

ber

of tr

ees

cost

�LS NP�LS PP�ME NP�ME PP

��

��

��

��

��

��

��

0

5

10

15

0.001 0.01 0.1 1

aver

age

topo

logi

cal d

ista

nce

cost

��

��

��

� ��

0.1

1

10

100

0.001 0.01 0.1 1

num

ber

of tr

ees

cost

��

�� 0

5

10

15

0.001 0.01 0.1 1av

erag

e to

polo

gica

l dis

tanc

ecost

A. B.

C. D.

45

Figure12:

� �

� � � � ��

�

�

� � � � � � � � � � � � � � � � � � � �

¡¡

¡ ¡¡¡¡

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡NJ

FM

Q=20/

018

/210

/102/

18

0/D=2

0

Q=50/

045

/525

/255/

45

0/D=5

0

Q=100

/0

90/1

050

/5010

/90

0/D=1

00

Q=200

/0

180/

20

100/

100

20/1

80

0/D=2

000.0

0.2

0.4

0.6

0.8

1.0

0.00001

0.0001

0.001

0.01

0.1

1

frac

tion

<0.

01

aver

age

cost

� �

� ��

� �

� �

� � �

�

� � ��

� � � �

� �

� � � � � � � � � � � � � � � � � � � �

¡ ¡

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

NJFM

Q=20/

018

/210

/102/

18

0/D=2

0

Q=50/

045

/525

/255/

45

0/D=5

0

Q=100

/0

90/1

050

/5010

/90

0/D=1

00

Q=200

/0

180/

20

100/

100

20/1

80

0/D=2

000.0

2.0

4.0

6.0

8.0

10.0

12.0

0.0

2.0

4.0

6.0

8.0

10.0

12.0

topo

logi

cal d

ista

nce

A. quality

B. diversity

46

Reliable Phylogenetic Tree - University of Virginia

Documents

Transcript of Reliable Phylogenetic Tree - University of Virginia