Reliable Phylogenetic Tree - University of Virginia
Transcript of Reliable Phylogenetic Tree - University of Virginia
GeneralizedNeighbor-Joining:
MoreReliablePhylogeneticTreeReconstruction�
William R. Pearson�, GabrielRobins,andTongtongZhang
Departmentof ComputerScience,Universityof Virginia
�Departmentof Biochemistry, Universityof Virginia
Charlottesville,VA 22908
Key Words: phylogeneticreconstruction,neighbor-joining, least-squares,
minimum-evolution,solutionspacesampling
�The corresponding author is Professor William Pearson, [email protected],
804-924-2818,FAX: 804-924-5069
1
Abstract
Wehavedevelopedaphylogenetictreereconstructionmethodthatdetectsand
reportsmultiple, topologicallydistant,low costsolutions.Ourmethodis a
generalizationof theNeighbor-Joining(NJ)methodof Nei andSaitou,and
affordsamorethoroughsamplingof thesolutionspaceby keepingtrackof
multiplepartialsolutionsduringits execution.Thescopeof thesolutionspace
samplingis controlledby apairof user-specifiedparameters- thetotalnumberof
alternatesolutionsandthenumberof alternatesolutionsthatarerandomly
selected- effectinga smoothtradeoff betweenrun timeandsolutionqualityand
diversity. Thismethodcandiscover topologicallydistinctlow costsolutions.In
testsonbiologicalandsyntheticdatasetsusingeithertheleast-squaresdistanceor
minimum-evolutioncriterion,themethodconsistentlyperformedaswell as,or
betterthan,eithertheNeighbor-Joiningheuristicor thePHYLIP implementation
of theFitch-Margoliashdistancemeasure.In addition,themethodidentified
alternative treetopologieswith costswithin 1 or 2%of thebest,but with
topologicaldistances9 or morepartitionsfrom thebestsolution(16 taxa);with
32 taxa,topologieswereobtained17 (least-squares)- 22 (minimum-evolution)
partitionsfrom thebesttopologywhen200partialsolutionswereretained.Thus,
themethodcanfind lowercosttreetopologiesandnear-besttreetopologiesthat
aresignificantlydifferentfrom thebesttopology.
2
1 Introduction
Reconstructionof ancestralrelationshipsfrom contemporarydatais widely usedto provide
bothevolutionaryandfunctionalinsightsinto biologicalsystems.Theexplosive increasein
availableDNA sequencedatahasincreasedinterestin phylogeneticanalysisof multi-gene
anddomain-swappedproteinfamilies.Threegeneralclassesof phylogeneticreconstruction
methodsarecommonlyusedfor analysisof sequencedatasets:parsimony methods(Swofford
etal., 1996),distancebasedmethods(FitchandMargoliash,1967),andmaximumlikelihood
methods(Felsenstein,1982;Felsenstein,1988).Parsimony- anddistance-basedmethodsare
mostoftenused,largelybecausethey arefastercomputationallyandallow a largernumberof
potentialphylogenetictreesto beevaluated.
Reconstructionof anevolutionaryhistoryfor asetof contemporarytaxabasedon their
pairwisedistanceis computationallyintractable(i.e.,NP-complete)for variousoptimality
criteria(FouldsandGraham,1982;Day, 1987),includingtheleast-squarescriterion1 andthe
minimum-evolutioncriterion2. Variousheuristicshavebeenproposedto searchfor solutions
of desiredquality (Felsenstein,1988;BandeltandDress,1992;Swofford etal., 1996),and
1Accordingto theleast-squarescriterion,thebestphylogeny for theinput distancematrix
is theonethatminimizes ��������� ������ ��������������� where����� is thedistancebetweentaxa � and �in theinput distancematrix,and � ��� is thesumof all thebranchlengthsalongtheuniquepath
connectingtaxa � and � in thepostulatedphylogeny.
2By the minimum-evolution criterion, thebestphylogeny for the input distancematrix is
onethatminimizesthesumof all edgelengths,whereedgelengthsareassignedto minimize
theleast-squaresdeviation.
3
themajorityof thesemethodsaregreedy, whichalwaysemploy movesthatAre “locally best”
andmaynot necessarilyleadto globaloptima(Swofford etal., 1996).Amongthegreedy
approaches,theNeighbor-Joiningmethod(SaitouandNei, 1987;StudierandKeppler, 1988)
is widely usedby molecularbiologistsdueto its efficiency andsimplicity.
Greedymethodsareefficientbecausethey exploreonly asmallportionof thesolutionspace3.
However, greedymethodscanfail to find thebestoverallsolutionif they become“trapped”in
localoptima.In addition,becauseonly asmallfractionof thesolutionspaceis examined,a
greedyheuristictypically will not report(or detect)alternativesolutionswith distinct
topologiesthatmayfit thedatanearlyaswell, or evenequallywell. Neglectingsuch
alternativesolutionscanproducemisleadinginferencesregardingtheevolutionaryhistory.
For instance,Wilsonetal. concludedthatall humansoriginatedfrom Africa, becausetheir
tree-building methodfailedto discoveralternative,near-optimal,treesthatwereconsistent
with adifferentgeographicalhistory(Wilsonetal., 1989;Maddison,1991).
To improvethereliability of phylogenetictreereconstructions,weproposeaschemewhich
samplesthesolutionspacemoreextensively by repeatedlyusingtheNeighbor-Joining
algorithm(SaitouandNei, 1987).Insteadof trackingonly asingle,locally-besttreeas
Neighbor-Joiningdoes,ourschememaintainsmultiplepartialsolutionsasit progresses.The
methodexploresall possibletreesderivablefrom thesetof currentpartialsolutionsin a
singleneighbor-joining step,andthenselectsasubsetof thesepartialsolutionsto passon to
thenext iteration.Thisapproachis competitivewith Neighbor-Joiningin recoveringdistinct
3A solution spaceis the set of all possiblephylogeniesspanningthe given taxa. Taxa
correspondto leavesin a treethatspansthem.
4
low-costtopologies,while still beingcomputationallyefficient.
2 Methods
2.1 The Neighbor-Joining Method
TheNeighbor-Joiningmethod(NJ)wasinitially proposedby SaitouandNei (1987),andlater
modifiedby StudierandKepler(1988).Neighbor-Joiningseeksto build a treewhich
minimizesthesumof all edgelengths,i.e., it adoptstheminimum-evolution(ME) criterion.
A numberof studieshavecorroboratedNJ’sperformancein reconstructingcorrect
evolutionarytrees(SaitouandImanishi,1989;KuhnerandFelsenstein,1994;Huelsenbeck,
1995).For smallnumbersof taxa,NJsolutionsarelikely to beidenticalto theoptimalME
tree(SaitouandImanishi,1989).
Neighbor-Joiningbeginswith astartree,theniteratively findstheclosestneighboringpair
(i.e. thepair thatinducesa treeof minimumsumof edgelengths)amongall possiblepairsof
nodes(bothinternalandexternal).Theclosestpair is thenclusteredinto anew internalnode,
andthedistancesof thisnodeto therestof thenodesin thetreearecomputedandusedin
lateriterations.Thealgorithmterminateswhen �"! internalnodeshavebeeninsertedinto
thetree(i.e.,whenthestartreeis fully resolvedinto abinarytree).TheNeighbor-Joining
heuristicis illustratedin Figure1B.
Fig. 1 goesnearhere.
AlthoughtheNeighbor-Joiningmethodrunsquickly, it returnsonly thesinglebestsolution
5
foundby its greedysearchstrategy. Thissolutioncanbefurtherimprovedwith
post-processingby rearrangingbranchesandswappingsubtrees(Rzhetsky andNei, 1992;
Swofford, 1996),but suchimprovedsolutionstendto remaintopologicallysimilar to the
originalstarting-pointsolutions.To increaseourconfidencein thesolutionreliability, it is
naturalto askif thereareothersolutions,with differenttopologies,thatareequally
well-supportedby thedistancematrixdata.
Solutionspacescanexhibit many alternatelocaloptima(Penny etal., 1995).For instance,
amongall #%$ possibletreesof $ taxa, ! of them( & � in Figures1Cand & � in Figure1B) fit the
inputmatrix (Figure1A) best.However, thesetwo treeshaveverydifferenttopologies;they
sharenocommoninternaledges.Indeed,accordingto thepartitiondistancemetric(see
Section2.4), & � and & � arethemostdissimilarpossible.
2.2 The Generalized Neighbor-Joining Method
OurGeneralizedNeighbor-Joining(GNJ)methodsamplesthesolutionspaceextendedlyby
keepingtrackof multiplepartialsolutionsasit progresses(thenumberof partialsolutions'is aninputparameter).Unlike theNeighbor-Joiningmethod,which followsonly asinglepath
towardsasolution,GNJperformsamorethoroughsearchof thesolutionspaceby tracking
andexploringmany potentially-goodpaths.Thatis, GNJretainspromisingpartialsolutions,
whichmaynotbelocally-optimal,but whichhave thepotentialfor substantiallygreatercost
savingsin subsequentsteps.An executionexampleof GNJon thematrixof Figure1A is
shown in Figure2.
Fig. 2 goesnearhere.
6
TheGeneralizedNeighbor-Joining(GNJ)algorithmcanselectin severalwaysthe ' partial
solutionsthatarepassedon to thenext iteration.A simplestrategy wouldsave the ' best
(i.e., leastcost)partialsolutions;alternatively, partialsolutionscanbechosenat random.
Selectingthebestsolutionstendsto improvesolutionquality, while selectingalternates
randomlytendsto increasesolutiondiversity. We implementahybridschemethatbalances
thesetwo extremes:thetop (*)+' least-costsolutionsareselected,alongwith anadditional
,.- ' � ( “topologicallydiverse”solutions4. If therearemorethan ( least-costpairsata
giveniteration,GNJwill select( of themarbitrarily, whichmakestheGNJmethod
non-deterministic.
To achieve topologicaldiversity, ateachiteration,afterselectingthebest( partialsolutions,
theremainingpartialsolutionsarepartitionedinto / groupsaccordingto their topological
distancesfrom abestpartialsolution(partialsolutionswithin thesamegroupareequidistant
from thebestpartialsolution).We thenobtainanadditional,
“topologicallydiverse”partial
solutionsfor thenext iterationby selectingthetop 021 354 solutionsfrom eachgroup5. Thus,at
thelaststeptowardssolvinga #76 -taxaproblem,alternatesolutionscanbeasmany as #98partitionsaway from thebestcurrentsolution.In thiscase,if
,.- $;: , at least <>=�@? - 8 best
solutionsat topologicaldistances# and ! aresaved,andat leastthe A bestsolutionsat
topologicaldistances8 through #78 aresaved.For 8 ! taxaand,.- #9:B: , at least 8 solutions
will besavedateachtopologicaldistance.Becausethemaximumtopologicaldistance
increaseslinearlywith thenumberof taxa,theabovestrategy ensuresthatthenumberof
4Theparameternames( and,
aremnemonicfor qualityanddiversity, respectively.
5If / doesnotdivide,
, weselectoneadditionalsolutionfrom eachof the, � �C0 1 3D4FE9/ �
groupscorrespondingto thetopologicaldistancesfarthestfrom thebestpartialsolution.
7
topologicallydistinctgoodsolutionscanremainrelatively constantby increasing,
linearly,
ratherthanexponentially, with thenumberof taxa.
A similar ideais employedin thestepwiseminimum-evolutiontreebuilding method(Kumar,
1996).At eachiteration,for agivenpartialtree,thismethodfirst identifiestheleadingnode
(i.e., thenodemostlikely to bejoinedto anothernode),andformsthesetof next-step
Neighbor-Joiningtreesby clusteringeachnodewith theleadingnode.Thisstrategy restricts
thesolutionspacesomewhat,but it requiresexponentialtime to run,whichmakesit practical
only for smalldatasets.Moreover, it doesnotexplicitly consideralternatesolutionsat
differenttopologicaldistances(seebelow), soit is lesslikely to identify topologicallydistinct
alternatives.
Differentcombinationsof ( and,
( ' - (HG , ) enableasmoothtradeoff betweenquality
vs. diversity. As ( increaseswith respectto,
(for afixed ' ), lower-costsolutionsare
favoredoveroneswith diversetopologies,while for smallervaluesof ( , thesolutionspace
explorationbecomesmorebroad,andtopologicallydifferentlocaloptimaaremorelikely to
emerge.We notethatif ' - ( - # (andthus,I- ' � ( - : ), thesinglesolutionreturned
by ourGeneralizedNeighbor-Joiningapproachis identicalto thesolutionproducedby the
originalNeighbor-Joiningmethod(SaitouandNei, 1987;StudierandKeppler, 1988).Here
only thebest-costpartialsolutionis passedto eachsubsequentiteration,whichis exactlywhat
is doneby Neighbor-Joining.Thus,GNJdirectlygeneralizestheNeighbor-Joiningmethod.
Additionalstrategiesfor expandingthesearchof phylogenetictree-spacemightbe
considered.TheGNJapproachcanbeabstractlydividedinto two phases:(1) a tree
generationcomponentwhichproducesmultiplepartialsolutions,and(2) apartialsolution
8
evaluationfunctionwhich favorscertainpreferredpartialsolutionsoverothers.Theoverall
run timeperiterationof thecombinedmethodis asymptoticallynogreaterthantheslowestof
thesetwo components.
Thealgorithmdescribedin Section2.2utilizestheNeighbor-Joiningmethodasthepartial
treegenerationmechanismin phase�J# � , while usingtheminimum-evolutioncriterion
(implicit in theNeighbor-Joiningmethod)in filtering candidatepartialsolutionsin phase� !� .However, anycombinationof existingalgorithmsor heuristicsfor treegenerationandtree
evaluationcanbeincorporatedinto thisgeneraltemplate.
For example,wecanevaluatepartialtreesateachstepusingtheleast-squaresdeviation
optimalitycriterion.An alternativeschemefor treegenerationmightallow arbitrarypartitions
at intermediatesteps(i.e., “join” anynumberof taxaratherthanexactly two). In thiscase,a
numberof existingefficientpartitioningheuristics(Alpert andKahng,1995)canbereadily
appliedto generatemorepromisinganddiversepartialsolutions.Likewise,themethodfor
selectingtopologicallydiversepartialsolutionsmightselectmoresolutionsfrom moredistant
topologies,ratherthanuniformly samplingthetopologicaldistancesasis donein this
implementation.
TheGNJprogramis written in the‘C’ programminglanguageandis availablefrom
ftp://ftp.virginia.edu/pub/fasta/GNJ. To maketheGNJresultsmoreusable
in practice,weoutputthetreesobtainedby GNJin acomputer-readableformatthatcanbe
readilyprocessedby otherprograms(e.g.,theconsense programin thePHYLIP package).
Moreover, wesummarizetheleafpartitionsfoundamongtheGNJsolutionsbelow a
thresholdcost,andrankthemby decreasingfrequencies.
9
2.3 Datasets
WetestedtheGNJheuristicin theUNIX environment.Two typesof distancematriceswere
usedto evaluatethealgorithm:
(1) Distancematriceswereconstructedfor nucleotidesequencesgeneratedby randomly
mutatingan“ancestral”sequencealongamodelevolutionarytreeusingthetreeDNA
program(Felsenstein,personalcommunication)with theKimura two-parametermodelfor
mutationrates(Kimura,1980).Threetypesof topologieswereusedfor themodeltrees:
topologiesof minimumdiameter(whichwereferto asType‘A’ ), topologiesof maximum
diameter(type‘B’ ), andamixtureof both(type‘C’ ). Here,thediameterof a topologyis
definedasthemaximumnumberof edgesconnectingany two leafnodeswithin thetopology.
Therefore,topologiesof type‘A’ aremost“branchy”(i.e., they resembleacompletebinary
tree),while topologiesof type‘B’ aremore“stringy”. type‘A’ treeswerethemost
challenging,andareusedfor mostof thefigures.
Divergenceratesrangingfrom :LKM:B:B$ (internalbranches)to :LKN$;: (leafor externalbranches)
wereusedto producethesyntheticdata.Two differenttype‘A’ andtype‘B’ datasetswere
examined.type‘A1’ and‘B1’ datasetsuseddivergenceratesOP:QKR: ! (32 taxa)– OP:QKR:�$(8-taxa)for internaledgesand OP:QKSA for externaledges(thus,theratioof externalto internal
branchratesvariedfrom #9: for T taxato 8�$ for 32 taxa).Type‘A2’ and‘B2’ treesusedrates
of OP:QKM:;:�$ for thecentral(internal)edgesand OP:QKM$;: for theexternal(leaf)edges
(external/internalratiosof 100).
(2) Severalbiologicaldatasetswereexamined,includingimmunologicaldatafrom U frog
species(SaitouandNei, 1987),datafrom #78 viral env V3 fragmentsandgag P17(Leitner
10
etal., 1996),and AV alignedTCP-1chaperonin6;: family members(JonSundBlandfort,
personalcommunication).For DNA sequences,thedistancematriceswerecomputedwith the
dnadist progamin thePHYLIP package(Felsenstein,1993),usingtheKimura2-parameter
model(Kimura,1980).For proteinsequences,thedistancematriceswerecomputedwith the
protdist programin thePHYLIP package(Felsenstein,1993),usingtheDayhoff PAM
matrixmodel(Dayhoff, 1978).Weobtain 8;: biologicaldatasetsof T , #76 , or 8 ! taxaby
randomlysamplingtheoriginaldatasets.Resultson thedifferentbiologicaldatasetswere
similar; only resultson thechaperonindistances(referredto asdataset‘R1’) arereported.
2.4 Algorithms Compared
Weevaluatedthedatasetsusingthreealgorithms:(1) NJ: TheNeighbor-Joiningmethod
(SaitouandNei, 1987;StudierandKeppler, 1988),asimplementedin thePHYLIP package
(Felsenstein,1993);(2) FM: TheFitch-Margoliashmethodfor fitting topologiesto distance
matriceswith respectto theleast-squarescriterion(FitchandMargoliash,1967),as
implementedin thePHYLIP package(Felsenstein,1993);and(3) GNJ: theGeneralized
Neighbor-Joiningmethod,describedin thispaper.
In addition,weexaminedeverypossibletreetopologyfor syntheticandbiologicaldataover8
taxa.Thisexhaustivemethodis guaranteedto returnaglobaloptimum(i.e. thelowest-cost
topology).Becauseof thesheersizeof thesolutionspace,theoptimalmethodis feasibleonly
for datasetscontainingfewer thantentaxa.
Thesolutionsfrom thedifferentalgorithmswereevaluatedusingeithertheleast-squaresor
theminimum-evolutioncriterion.Least-squarestreecostis computedby assigning
11
non-negativeedgelengthsin a way thatminimizestheleast-squaresdeviation.
Minimum-evolutiontreecostis computedasthesumof suchedge-lengthsin a tree.
To improvefurtherthesolutionquality, wealsoappliedapost-processingoptimizationstep
which rearrangessubtreesasfollows. Givena topology, wecomputethecostof all thetrees
resultingfrom swapping/exchangingsubtreesaroundeachof theinternaledgesof the
topology. Then,thelowest-costtreeis chosenasthenew currenttree,andits neighborhoodis
investigatedin turn. We iteratethisprocessuntil no furtherimprovementcanbeobtained.
Topologicaldistancesin thispaperarebasedon thepartitionmetric(RobinsonandFoulds,
1981;Penny andHendy, 1985;SteelandHendy, 1993),whichmeasuresthenumberof edges
commonto agivenpairof binarytrees.Eachinternaledgenaturallypartitionsthesetof leaf
nodesinto two subsets.Two treesspanningthesamesetof leaveshaveacommonedgeif
removing thisedgeinducesthesametwo subsetsof leafnodes.Thus,thepartitiondistance
betweenany two treesis definedasthenumberof edgesin onetreefor which thereis no
correspondingequivalentedgein theothertree.Sinceeachbinarytreeof leaveshas � 8internaledges,distancesunderthepartitionmetriccanberepresentedasintegersbetween:and � 8 .
3 Results
LikeNeighbor-Joining,GeneralizedNeighbor-Joining(GNJ)seeksto identify phylogenetic
treetopologiesandbranchlengthsthatbestfit distancedata.GNJimproveson
Neighbor-Joiningby identifyingnear-optimaltopologiesthataresignificantlydifferentfrom
12
thebestsolutionfoundin thesearch(therearetypically many near-optimalsolutionsthat
differ only slightly from thebestsolution;weseektopologically-distantalternatives).In the
resultsbelow, wefirst show thatthedatasetsthatweexaminecontaintopologicallydistinct,
low-costsolutions.We thendemonstratethattheGNJalgorithmcanfind theselow-cost
alternativesolutions,by examiningtwo measuresof success:(1) thenumberof alternative
treesfoundby GNJwith anear-optimalcost;and(2) themaximaltopological(partition)
distancebetweenthenear-optimalsolutionsandtheoptimalsolutionfound.6 In bothtests,we
seekthelargestnumberof solutionswith costnearestto optimal,but with topological
distancethatis far away.
3.1 Comparison of GNJ with exhaustive 8-taxa searches
To judgehow effectively theGNJapproachfindsalternative topologically-distinctsolutions,
wefirst characterizedtheactualnumberanddiversityof near-optimalsolutionsby
enumeratingall 10,395differenttreesfor datasetswith 8 taxaandcalculatedthecostfor each
treetopology(Fig. 3). Tree-costswereoptimizedusingeithertheminimum-evolution
criterionor theleast-squarescriterion.Becausethedifferentcostcriteriamayhavedifferent
distributionsof costs,weplot thenumberof treesobtainedasa functionof thefractionalcost
range: �XW Y � W Z �N�[��\ �XW Z^]JY � W Z �N�� , whereW Y is theleast-squaresor minimum-evolutioncostof
aspecifictreetopology, W2Z �N� is theminimum(andfor exhaustivesearches,optimal)cost
underthatcriterion,and W Z^]�Y is thecostof theworsttopology. For the8 taxadata,W2Z �M� and
6In the caseof morethan T taxa,wherean exhaustive searchfor the optimal solutionis
computationallyinfeasible,wecompareto thebestsolutionfoundinsteadof to optimal.
13
W2Z_]�Y areknown becausethecostof everypossibletopologyhasbeencalculated.7
For thesyntheticdata,it is possibleto askhow oftenthelow costtreesfoundby theGNJ
algorithmwereconsistentwith theoriginal treethatwasusedto producethedistancedata.
However, thelowestcostleast-squaresor minimum-evolutiontreewasoftendifferentfrom
theoriginal tree.Treesfrom type‘A1’ and‘B1’ dataareusedfor mostof thefiguresbecause
thedifferencebetweentheoriginal treecostandthebesttreecostwastypically between:and :LK# with themedianbetween:QKR:Q# and :LKM:B8 of thecostrange.Treesfrom type‘A1’ and
‘B1’ syntheticdatabehavedverysimilarly to treesfrom thebiologicaldatasets.For the‘A2’
and‘B2’ datasets,themedianoriginal treecostwas :QKSA – :LKNV of thecostrange.Thus,because
of thehighexternal/internalrateratio, thebesttreefrequentlyhadacostsubstantiallylower
thantheoriginal treeandthesedatasetshavea largenumberof distinctlocalminima,which
arenotseenwith thebiologicaldatasetsor with thetype‘A1’ and‘B1’ trees.
Fig. 3 goesnearhere.
Fig. 3 showshow thenumberof treesandthetopologicaldistancebetweenthealternative
solutionsincreasesover thefractionalcostrange.Theresultsfrom threedifferentdatasetsare
shown usingeithertheminimum-evolutionor theleast-squarescostcriterion.In theseplots,
morechallengingdatasetshavea largernumberof near-optimaltreesandgreatertopological
7For larger datasets,W2Z �N� is approximatedfrom the minimum cost obtainedfor all the
tree-searcheson the dataset,and W Z^]�Y is approximatedfrom the maximumcostobtainedby
sampling #9:B: treesrandomly. Thus for the 16 and 32 taxa datasets,W2Z �N� may not be the
optimalminimumcostand W2Z_]�Y maynotbethehighest(worst)cost,but theseapproximations
shoulddiffer only slightly from thetruevalues.
14
distanceat lower fractionalcost.In general,therearemorenear-optimaltreeswith the
least-squarescriterionthanwith theminimum-evolutioncriterionandthosetreestendto be
moretopologicallydistinct(Fig. 3). For example,with thebiologicaldata(Fig. 3C), there
were #�A`KM6 treesonaveragewith least-squarescostwithin :QKR:Q# of optimal,but only 2.6trees
)+:QKR: ! whentheminimum-evolutioncostis calculated.Furthermore,whenthecostis less
than :QKR:Q# , themaximumtopologicaldistancefor near-optimaltreesis greaterfor the
least-squarestreesthanfor theminimum-evolutiontrees.
The“branchy” type‘A1’ syntheticdatasettendsto producea largernumberof near-optimal,
topologicallydistanttreesthanthetype‘B1’ (Fig. 3) datasets.Whenthetype‘A2’, ‘B2’ and
‘C2’ datasetswereexamined(datanot shown), type‘A2’ datasetswerethemostchallenging,
and,for treeswith cost )a:QKM:L# , thenumberof treesandtopologicaldistancebetweenthetrees
wasabouttwiceashigh for type‘A2’ comparedto type‘A1’. Thebiologicaldatasetappears
morechallengingthanthetype‘A1’, ‘B1’ and‘B2’ syntheticdatasets,but lesschallenging
thanthetype‘A2’ dataset(Fig. 3 anddatanotshown). We focusourattentionon thenumber
anddiversityof treeswith cost-range:QKR:Q# –:QKR:�$ bothbecausethesecost-rangesareintuitively
close—between1%and5%of thebestcostfound—andbecause,for thetype‘A1’ and‘B1’
syntheticdata,:QKR:Q# –:QKR:�$ spanstherangeof costdifferencesbetweentheoriginal treesused
to generatethedistancedataandthebesttreesfoundfor thedata.
Ideally, theGNJalgorithmwouldfind eachof thenear-optimalsolutionsthatcanbefound
whenevery tree-topologyis examined.Thus,weusethenumberof solutions,theiraverage
cost,andtheirdiversityto gaugetheeffectivenessof GNJ(Fig. 4 – 6), andcompareGNJwith
anexhaustivesearch(Fig. 3). Weseekcombinationsof ‘Q’ and‘D’ thatapproachthe
distributionof solutionsseenin theexhaustivesearch.Fig. 4A shows thattheGNJalgorithm
15
effectively identifiesvirtually all sub-optimalsolutionswith costs)+:LKM:�$ on thesynthetic
dataset,aslongas (cbH: . (Results,notshown, usingtheminimum-evolutioncostcriterion
areindistinguishable)Only when ( - : , ,.- $;: is thenumberof near-optimalsolutions
differentfrom thenumberfoundin theexhaustivesearch;thatis, someof thelowest-cost
alternativesolutionsaremissed.Thecurvesin Fig. 4B, D reporttheaveragemaximum
topologicaldistance;i.e. themaximumtopologicaldistanceamongall thetreeswith costless
thantheordinateis determinedfor eachof the30datasets,andthe 8B: maximumdistancesare
averaged.Again,when (*bH: thealternatesolutionsfoundby theGNJalgorithmareas
diverseasthosefoundby theexhaustivesearch,for costswithin #7: % of optimal. (Wealso
examinedthemaximumtopologicaldistancesfor thedatain Fig. 4, andfoundthatthey were
verysimilar to theexhaustivesearchif (*bd: , datanotshown.)
Fig. 4 goesnearhere.
Thebiological‘R1’ datasetis morechallengingin someways—thereis a largernumberof
alternatesolutionswith low cost(Fig. 4C)andthelow-costsolutionsappearmore
topologicallydiverse(Fig. 4D). For thebiologicaldataset,GNJbeginsto misssolutionswith
costsbH:QKM:;:�$ thatarefoundby theexhaustivesearch.At a fractionalcostof :LKM:Q# , #B# of #%$solutionsarefoundby GNJwith (*e ! $ , and #9T outof 8Q# arefoundat fractionalcost :QKR: ! .As with thesyntheticdataset(Fig. 4B), when ( - : , someof thebestnear-optimalsolutions
aremissed.Theresultsin Fig. 4 suggestthatfor small(8 taxa)problems,theGNJalgorithm
identifiesalternate,near-optimal,topologically-distantsolutionsveryeffectively.
16
3.2 GNJ performance with 16 and 32 taxa
For largerdatasets,it is not computationallyfeasibleto examinethesolutionspace
exhaustively, sowecannotdirectlycomparetheGNJresultsto theoptimalsolution.
(Likewise,wecannotguaranteethatthelowest-costsolutionis optimal,but it is likely to be
nearoptimal.)Nonetheless,wecanstill evaluatehow theGNJalgorithmbenefitsfrom saving
multiple ' - (HG , solutionsby examininga rangeof (gf , pairs(Figs.5–6).Whentype
‘A1’ syntheticdatasetswith 16 taxaaresearched,thelargestnumberof low-costsolutionsare
againfoundwhen (*bd: , andthemosttopologicallydiversesolutionsarefoundwhen
, bH: . (Fig. 5 showstheresultsusingtheminimum-evolutioncostcriterion;resultsusingthe
least-squarescriterion,notshown, aresimilar.) For thesedatasetswith ' - #7:B: , thetrade-off
betweenquality ( anddiversity,
is clear-cut. Below :LKM:Q# thereis little differencein
diversityas ( and,
change;above :QKR: ! , , eh$;: givesthebestresults.On thebiological
dataset(Fig. 5C),searcheswith (*e�#7:B: find almosttwiceasmany ( 6 ! –V;T ) solutionswith
fractionalcost )+:QKR:Q# assearcheswith,.- UB: or #7:B: ( 8B8 –8�V solutions).Thedifferencein
performancewith respectto ( and,
increasesathigherfractionalcosts.However, while
reducing,
increasesthenumberof low-costsolutionsfound,it alsodecreasesthediversity
of thesolutionset.For thesedata,( -h,.- $;: seemsto bethebestcompromise.
Fig. 5 goesnearhere.
Whensearchesareperformedon32-taxadata(Fig. 6), theimportanceof,
in improving the
diversityof thesolutionsis moreapparent.As before,solutionswith ( -h,I- ! $;: appearto
provideagoodbalancebetweenfinding thelargestnumberof low-costsolutionsandfinding
themostdiversesolutions.Wenotethatas ( increasesfrom ! $�: to $;:B: , thereis little change
17
in thenumberof treeswith fractionalcost )+:QKR: ! on thesynthetictype‘A1’ dataset(Fig. 6A),
andthatthemaximumtopologicaldistanceamongthosesolutionsincreasesvery little as,
increasesfrom ! $;: to $;:B: . Thus,for thissyntheticdata,although' - $�:B: retainsonly a tiny
fractionof up to #7:Ci = possible32-taxatreetopologies,thedatain Fig. 6A andFig. 10suggest
thatmostof thelowest-costsolutions,andmany of thetopologicallydiversesolutions,are
found.
Fig. 6 goesnearhere.
3.3 Comparison with other methods
Thusfar, our resultssuggestthatGNJcanidentify alternative,near-optimalsolutionswhen 'rangesfrom $�: ( T -taxa)to ! :B: ( 8 ! -taxa).In thissection,wecompareGNJwith different
' - (dG , to two popularphylogenetictreereconstructionmethodsfor distancedata,the
Neighbor-Joiningmethod(SaitouandNei, 1987)andtheFitch-Margoliashalgorithm(Fitch
andMargoliash,1967)asimplementedin thePHYLIP package(Felsenstein,1993).As
before,weconsiderbothsyntheticandbiologicaldatasetswith differentnumbersof taxa,and
wecomparetwo costcriteria: theminimum-evolutioncriterionusedfor Neighbor-Joining
searches,andtheleast-squarescriterionusedby Fitch-Margoliash.In thesetests,weagain
considertwo measuresof success:quality (cost)anddiversity. Weevaluatethequalityof the
solutionsin two ways:(a) thefractionof thetime(for the30 testdatasets)thatanear-optimal
solutionis found;and(b) theaveragecostof thebestsolutionsfound.To evaluatediversity,
for eachdistancematrix,wefirst computethemaximumtopologicaldistancebetweenpairs
of near-optimalGNJsolutions.Diversityis thenmeasuredby computing(a) themaximum,as
18
well as(b) theaverageof thesedistances,over 8B: datasets.
Fig. 7 goesnearhere.
For T -taxatype‘A1’ data,GNJfindssolutionsof veryhighquality thatareasdiverseasthe
exhaustivesearchwhen 'jbH$ and (*b ! (Fig. 7). When (*b ! , asolutionwithin acost
rangeof :LKM:Q# of optimalis found100%of thetime. For ( - : , GNJfindsa solutionwithin
:QKR:Q# of optimallessthan20%of thetimewhen,.- $ , andlessthan40%of thetimewhen
, bd$ . On thesamedatasets,Neighbor-Joiningfindsa kH:QKM:L# minimum-evolutionsolutution
andFitch-Margoliashfindsa kH:QKR:Q# least-squaressolutionmorethan95%of thetime. The
averagecostdatain Fig. 7A shows thatthebestsolutionsfoundby Neighbor-Joiningand
Fitch-Margoliasharetypically within :QKR:Q# of thecostrange,but thosefoundby GNJ
( 'je ! : and (*e ! ) areoptimal.Thus,GNJconsistentlyfindssolutionswith costlower than
eitherNeighbor-Joiningor Fitch-Margoliash.Moreover, comparisonof boththelargest
maximumtopologicaldistanceandtheaveragemaximumtopologicaldistance(Fig. 7B)
shows thatwhentheoptimalsolutionwasfoundby GNJ,thediversityof solutionsfound
(with costskl:QKM:L# of optimal)is aslargefor theGNJsolutionsetasfor thosefoundby the
exhaustivesearch.GNJperformedaswell astheexhaustiveseachon themuchmore
challengingtype‘A2’ dataaswell (notshown).
Fig. 8 goesnearhere.
Resultsfor #76 -taxaareshown in Figs.8 and9. On thesynthetictype‘A1’ dataset,GNJfound
asolutionwithin :QKR:Q# of thebestcost100%of thetime when (*bd: . For thisdata,
Neighbor-Joiningfounda kH:QKR:Q# costsolutiononly 80%of thetimeusingthe
minimum-evolutioncriterion,while Fitch-Margoliashalwaysfounda kd:QKR:Q# costsolution.
19
Onceagain,GNJfoundsolutionswith loweraveragecost.For type‘A2’ data(notshown),
Neighbor-JoiningandFitch-Margoliashfound kd:QKR:Q# solutionsonly 8B: – $B$ % of thetime for
theleast-squarescriterion,and ! $ – 8�V % of thetime for theminimum-evolutioncriterion
while GNJfoundthebestminimum-evolutionsolution100%of thetimewhen (*ea$ . GNJ
foundthebestleast-squaressolutionmorethan80%of thetimeon type‘A2’ datawith
(*ea$ . As ' increasedfrom ! : to ! :;: , thecostof thebestsolutionsconsistentlyimproved
with GNJ.While wecannotcomparetheGNJdiversityto thediversitythatwouldbefound
by anexhaustivesearch,increasing' from ! : to ! :B: improvestheaveragemaximum
diversity, andasbefore,( -h,seemsto provide low-costsolutionswith highdiversity.
Fig. 9 goesnearhere.
When #76 -taxabiologicaldataareexamined,theNeighbor-JoiningandFitch-Margoliash
algorithmsperformquitewell (Fig. 9). However, evenwith thisdata,theaveragecostof the
bestsolutionsfoundimprovesfrom about #7:mi to #7:m[n whenGNJis usedand (*e ! $ .
Fig. 10goesnearhere.
Neighbor-Joining,Fitch-Margoliash,andGNJall performwell on32-taxatype‘A1’ (Fig. 10)
andbiologicaldata(not shown) usingacostthresholdof :QKM: ! or :QKR:�$ (notshown). However,
it is surprisinghow diversetheGNJsolutionsarewhensolutionswith costswithin 2%of the
bestcostareincluded;GNJfoundalternatelow costsolutionsthatsharefewer thanhalf of
theinternaledges(two treesshareaninternaledgeif theedgeinducesthesameleaf
bi-partitionsin bothtrees).
Comparisonof thecostandthediversityof GNJsolutionswith ' - ! :B: and ' - $;:B:suggeststhat, ' - $;:B: , which increasestherun time ! KN$ -fold, is probablyunnecessary, since
20
neitherthequalityof thesolutionsnor thediversityincreasessignificantlywith thehigher ' .
Again,using ( -a,providesa goodbalanceof qualityanddiversity.
3.4 Post-processing
Rzhetsky andNei haveobservedthatfor smalldatasets,NJsolutionsarelikely to be
topologicallycloseto theoptimalsolution(Rzhetsky andNei, 1992).Weexaminedhow
post-processing(describedin Section2.4)affectsthenumberanddiversityof thelow-cost
solutions,andhow post-processingmight improveNeighbor-Joining,Fitch-Margoliash,and
GNJ-basedinitial solutions.Thepost-processingalgorithmexaminesall thetreesthatcanbe
formedby swapping(exchanging)subtreesaroundeachof theinternaledgesin thetree,thus
consideringall thealternative treesthatarewithin onepartitiondistancefrom theinitial tree.
If a topologyis foundwith a lowercost(least-squaresor minimum-evolution),theprocessis
repeated,until no topologicalneighboris foundwith a lowercost.If theGNJalgorithmfinds
alternatesolutionsthatareondifferentsidesof asingleshallow costbasin,post-processing
shouldreducethenumberanddiversityof low-costalternatetrees.Thisseemsto bethecase
for thebiological‘R1’ data(Fig. 11A) andthesynthetictype‘A1’ data(similar to the
biological‘R1’ data,notshown). Alternatively, if GNJactuallyfindsdistinctlocalminima
(with respectto cost),thenumberof treesmaydecreasedramaticallybut thetopological
distancebetweenalternatesolutionsshouldremainsubstantial.Multiple distinctlocalminima
arefoundwith thesynthetictype‘A2’ data.
Fig. 11goesnearhere.
Theresultsof post-processingon the #96 -taxadatasetssuggestthatGNJis capableof
21
identifyingalternate,topologically-distinctlocalminimawhenthey exist (Fig. 11). As
expected,thenumberof distinctsolutionsdropsdramatically(becauseof convergence)when
theGNJsolutionsarepost-processed.For thebiological‘R1’ data(Fig. 11A,C),thedropis
morethan30-fold,asit is with thesynthetictype‘A1’ data(notshown). However, for the
synthetictype‘A2’ data,which is derivedfrom treesin which thecostfor theoriginal treeis
frequentlymid-waybetweenthebestandworstcosts,thedropis only 2–3-foldandthe
averagemaximumtopologicaldistancesdropsonly about! : %. Thusfor thisverydifficult
dataset,many of thealternatesolutionsfoundby GNJcannotbereachedby local
branch-swappingfrom thebestsolution,anddistinctlocalminimahavebeenfound.Fig. 12
comparestheperformanceof Neighbor-Joining,Fitch-Margoliash,andGNJ,eachfollowed
by post-processing,on #96 -taxatype‘A2’ data.Post-processingimprovestheperformanceof
Neighbor-JoiningandFitch-Margoliashin findinga solutionwith cost kH:QKR:Q# from about
8B: –$;: % successto $�: –V;: % success;GNJis #7:B: % successfulwith everycombinationof ' ,
( , and,
. Again,GNJfindslower-costsolutionsthatNeighbor-JoiningandFitch-Margoliash
fail to find, evenafterexhaustivepost-processing.For thesedata,Neighbor-Joiningand
Fitch-Margoliashappearto sometimesfind localminima(with respectto cost),while GNJ
findsmoreglobalminima.
Fig. 12goesnearhere.
Theaveragemaximumtopologicaldiversityon thedifficult type‘A2’ datadecreasesonly
slightly with post-processingandthemaximumtopologicaldiversityis ashighafter
post-processingasbefore.This result—topologicallydiversesolutionsdespiteadramatic
decreasein thenumberof low-costsolutions—impliesthatGNJhasfoundalternatelocal
minimathatcannotbereachedby localbranchswappingfrom thelowest-costsolution.Since
22
ourpost-processingstrategy beginsfrom the ' solutionsfoundby conventionalGNJ,GNJ
withoutpost-processingcandetecttopologicallydistinctalternative localminima.For the
synthetictype‘A2’ datasets,low-costsolutionsthataretopologicallydistinctlocalminima
appearoften.For example,for ( - $;: and,I- $;: , theleast-squaressolutionsdiffer by 11
(outof amaximum #78 possible)branchswaps(partitions)arefoundin at leastonedataset,
anddiffer by ! KRU swapsonaverage(minimum-evolutionsolutionsdiffer by asmuchas9
partitions,and4.3onaverage).Thissuggeststhattopologicallydistinctsolutionshavebeen
foundby GNJon ! $ –$;: % of the 8B: syntheticdatasets.
Theresultswith thebiologicalandsynthetictype‘A1’ datasetscontraststarklyto the
diversityfoundwith thesynthetictype‘A2’ data.With theleast-squarescriterionand
post-processing,theaveragemaximumtopologicaldiversityis about:QKN$ andthemaximum
diversityis A , implying thatdistinctsolutionsarefoundin only about10%of thedatasets.
With theminimum-evolutioncriterion,themaximumdiversityis thesamebut theaverage
maximumis #BKo# ; again,topologicallydiversesolutionsmaybefoundfor 25%of these
#76 -taxadata.For thesynthetictype‘A1’ data,theaveragediversitydropsfrom aboutA to #;K#(least-squares)or from V to ! (minimum-evolution).
Althoughpost-processingcanimprovethequalityof GNJsolutionswithoutsignificantly
reducingtheirdiversity, thetimerequiredto post-process' alternativesolutionscanbe
prohibitivewhenthenumberof taxa(andthusthenumberof branchswapsthatmustbe
tested)is large( bP#76 ). However, comparisonof Fig. 12A andthenon-post-processeddata
(not shown) suggeststhatpost-processingdoesnot improvethesolutionqualitysignificantly
when ( and, ea$;: , andthustheextracomputationis unnecessary.
23
3.5 Run Time
GNJusescomputationtimeroughlyproportionalto thenumberof partialsolutions
maintainedduringexecution( ' ), andcubicin thenumberof taxaanalyzed.Averagerun
timesof Neighbor-Joining,Fitch-MargoliashandGNJfor variousinputsizesareshown in
Table1. GNJis considerablyslower thanNeighbor-Joining(which is oneof thefastesttree
constructionalgorithmsavailable,becauseit doesnotevaluateany alternative trees),and 8( ' - ! :;: ) to T -fold ( ' - $�:B: ) slower thanFitch-Margoliashfor the32-taxadatasets.
During its execution,GNJkeepstrackof ' partialsolutions.At eachiteration,asthenext
pairof taxais removedfrom the“star” tree,GNJexploresall thecandidatesolutionsderivable
from thecurrent' partialsolutionsvia asingleNeighbor-Joiningstep.Sinceeachpartialtree
inducespq�X � � candidatetreesby groupingoneof the pr�s � � possiblenodepairsin thetree,
thecostof all theresultingcandidatetreesrequirespq�X � � evaluationtime. Therefore,each
GNJiterationrequirespr�X'tE7 � � time to examinethecostof all pr�@'uE7 � � candidatetrees.
At eachiteration,GNJmustalsoselect' candidatetreesto passon to thenext iteration.In
thisversion,theselectionprocessrequiresall pq�@'uE7 � � candidatetreesto besortedby cost.
Currently, thetimerequiredby eachiterationof GNJis dominatedby thesortingtimewhich
is pr�X'uE7 � E7vwBxy�X'zE7 � �{� - pq�@'uE7 � E�XvwBx5'|G}v~wBxD �{� . Weanticipatethattheamountof
datato besortedcanbereduced,andthatin futureversions,theGNJcostcalculationwill
dominatetherun time. SinceGNJhasa totalof � 8 iterations,theoverall run time for GNJ
is pr�X'uE7 ? E�sv~wBxD'.G"vwBxD �J� .
24
4 Discussion
TheGeneralizedNeighbor-Joiningalgorithmis explicitly designedto explorebroadly
phylogenetictreesolutionspacesandseeklow-costsolutionsthataretopologicallydistant.
To achieve thisgoal,GNJmaintainsmultiplepartialsolutionsateachiteration,and
incorporatesbothquality (treecost)anddiversity(topologicaldistance)in selectingthesetof
partialsolutionsthatwill bepassedon to thenext iteration.Thesolutionspacesamplingis
controlledby theparameters' , ( , and,
, whichspecifythenumberof partialsolutionsto
retainandthebalancebetweenquality ( ( ) anddiversity(,
) in selectingalternatesolutions.
For datasetsof smallsize(i.e. )�#7: -taxa),GNJcanperformbetterthanNeighbor-Joiningand
Fitch-Margoliashalgorithmsby maintaining' - ! : –$;: partialsolutions.For example,for
biologicaldatasetsover U -taxa,all T treesof costcloseto theNJcost(undertheleast-squares
criterion)areobtainedby GNJwhile maintainingonly ' - $;: partialsolutions.For
syntheticdatasetsof T -taxa,GNJfindstheoptimalsolutionwhenever 'je ! : and (*e ! .
For datasetswith #96 or 8 ! -taxa,boththetopologicaldiversityandthequalityof GNJ
solutionsimprovesas ' increases.For thedatasetsthatweexamined,low-costsolutions
wereefficiently foundwith '�e ! : for T -taxaand '�eh$;: for #96 -taxa.Increasing' did
little to improveeitherthequalityor thediversityof thesolutions.8 ! -taxaproblemsarefar
morechallenging(andmorecommonin themolecularbiology literature).While '�e ! :B:( 8 ! -taxa)waseffective in finding low-costsolutions,increasing' improvedthequality, and
to a lesserextentthediversity, of the 8 ! -taxasolutions.
Webelieve thatthepost-processingresultsshow thatGNJis capableof identifying low-cost,
25
topologically-distinctsolutionsthatcannotbefoundsimplyby successively examiningevery
topologynearto individual low-costtrees.“Falling into localminima” is aninherentflaw of
any phylogeneticsearchmethodthatexaminesonly asmallportionof thesolutionspace.The
post-processingresultsfor thebiological‘R1’ datasuggestthatthisdataprobablyhassingle,
verybroadlocalminimumwith many differentlow-costtopologiesbut very few, if any,
alternatesolutionsthatcannotbefoundby post-processing(localbranchswapping).In
contrast,thesynthetictype‘A2’ datadoesappearto haveseveraldistinctlocalminima,which
werefoundby GNJ.While it is reasuringto learnthatGNJis capableof findingalternative
localminimawhenthey exist, moreextensivesimulationswill berequiredto characterizethe
conditionsunderwhich largenumbersof distinctlocalminimaoccur. While post-processing
maynotbenecessaryto find highqualitysolutions,thedecreasein diversitywith
post-processingshouldimproveourconfidencethatadatasetdoesnothavemany
topologicallydistinctlow-costsolutions.
Our resultssuggestthatGNJperformsbestwhen ( -�,I-.�� , andthat ' - ! :B: providesan
excellentbalancebetweencomputationtimeandsolutionquality/diversityfor up to 8 ! -taxa.
For morethan $;: -taxa, ' - $;:B: or #7:;:B: mayprovidebettersolutions;however, thiswill
dependgreatlyon thestructureof thephylogenetictreesolutionspace.For largenumbersof
taxa,onecanjudgewhethera larger ' is likely to providenovel solutionsby performing
searcheswith #7:B: and ! :B: . If ' - ! :;: doesnotfind any low-costtreesthatweremissedwith
' - #7:B: , it is unlikely that ' - $;:B: (or more)will uncoveradditionalnovel treeseither.
GNJis considerablyslower thantraditionalNeighbor-Joining,andfor largeproblems
( eh8 ! -taxaand '�e ! :B: ), it is muchslower thanFitch-Margoliashaswell. However, aGNJ
run is moreaccuratelycomparedto multipleFitch-Margoliashsearcheswherethetaxaare
26
successively addedin differentorder, aprocessthatcaneasilyincreasetheamountof time
requiredby ! : to $;: -fold. BecauseGNJexplicitly seeksout topologicallydiversesolutions,
webelieve thatit is morelikely to identify distinctalternativesthanadditional
Fitch-Margoliashtrials.
Thispaperconsidersthegeneralizationof theneighbor-joining partitioningstrategy to which
thedistancecostmeasuresseemideallysuited.However, themethodof retainingmany
partialsolutionsduringa partitioningstrategy canbeappliedto maximumparsimony
methods,andperhapsto maximumlikelihood-basedapproachesaswell. Wearecurrently
developingabroadergeneralizationof theapproachthatcanbeappliedto character-based,
ratherthandistance-based,costcriteria.
Acknowledgments
Wewish to thankDr. DouglasTaylor for hiscommentsonourdraft. This researchwas
supportedby agrantfrom theNationalLibrary of Medicine(LM04961).GabrielRobins
receivedadditionalsupportfrom anNSFYoungInvestigatorAwardandaPackard
FoundationFellowship.
27
References
Alpert, C. andKahng,A. B. 1995.Recentdirectionsin netlistpartitioning:asurvey.
Integration: TheVLSIJournal, 19, 1–81.
Bandelt,H. J.andDress,A. 1992.Split decomposition:anew andusefulapproachto
phylogeneticanalysisof distancedata.Mol. PhylogenetEvol.1, 242–252.
Day, W. 1987.Computationalcomplexity of inferringphylogeniesfrom dissimilarity
matrices.Bull. Math.Biol. 49(4), 461–467.
Dayhoff, M. O. 1978.Supplement3. Atlasof ProteinSequenceandStructure.Natl. Biomed.
Res.Found.,Washington,D.C.
Felsenstein,J.1982.Numericalmethodsfor inferringevolutionarytrees.Quar. Rev. Biol. 57
(1), 379–404.
Felsenstein,J.1988.Phylogeniesfrom molecularsequences:inferenceandreliability. Annu.
Rev. Genet.22, 521–565.
Felsenstein,J.1993.PHYLIP:PhylogenyInferencePackage, version3.5c. Universityof
Washington.
Fitch,W. M. andMargoliash,E. 1967.Constructionof phylogenetictrees.Science, 155,
279–284.
Foulds,L. R. andGraham,R. L. 1982.Thesteinerproblemin phylogeny is NP-complete.
AdvancesAppl.Math.3, 43–49.
Huelsenbeck,J.P. 1995.Theperformanceof phylogeneticmethodsin thefour-taxoncase.
Syst.Biol. 44, 17–48.
28
Kumar, S.1996.A stepwisealgorithmfor findingminimumevolutionarytrees.Mol. Biol.
Evol.13, 584-583.
Kimura,M. 1980.A simplemethodfor estimatingevolutionaryratesof basesubstitutions
throughcomparativestudiesof nucleotidesequences.J. Mol. Evol.16, 111–120.
Kuhner, M. K. andFelsenstein,J.1994.A simulationcomparisonof phylogeny algorithms
underequalandunequalevolutionaryrates.Mol. Biol. Evol.11, 459–468.
Leitner, T., Escanilla,D., Franzen,C.,Uhlen,M. andAlbert, J.1996.Accuratereconstruction
of aknown HIV-1 transmissionhistoryby phylogenetictreeanalysis.Proc.Natl. Acad.
Sci.USA, 93, 10864–10869.
Maddison,D. R. 1991.African origin of humanmitochondriaDNA reexamined.Syst.Zool.
40, 355–363.
Penny, D. andHendy, M. D. 1985.Theuseof treecomparisonmetrics.Syst.Zool.34, 75–82.
Penny, D., Stee,M. A., Waddell,P. J.andHendy, M. D. 1995.ImprovedAnalysesof Human
mtDNA SequencesSupportaRecentAfrican Origin for Homosapiens.Mol. Biol. Evol.
12, 863–882.
Robinson,D. F. andFoulds,L. R. 1981.Comparisonof phylogenetictrees.Math.Biosci.53,
131–147.
Rzhetsky, A. andNei, M. 1992.A simplemethodfor estimatingandtesting
minimum-evolutiontrees.Mol. Biol. Evol.9, 945–967.
Saitou,N. andImanishi,T. 1989.Relativeefficienciesof theFitch-Margoliash,
maximum-parsimony, maximum-likelihood,minimum-evolution,andneighbor-joining
methodsof phylogenetictreeconstructionin obtainingthecorrecttree.Mol. Biol. Evol.
6, 514–525.
29
Saitou,N. andNei, M. 1987.Theneighbor-joining method:anew methodfor reconstructing
phylogenetictrees.Mol. Biol. Evol.4 (4), 406–425.
Steel,M. A. andHendy, M. D. 1993.Distributionsof treecomparisonmetrics– somenew
results.Syst.Biol. 42, 126–141.
Studier, J.andKeppler, K. 1988.A noteon theneighbor-joining algorithmof SaitouandNei.
Mol. Biol. Evol.5, 729–731.
Swofford, D. 1996.PAUP: PhylogeneticAnalysisUsingParsimony(andOtherMethods),
version4.0(testversion). SinauerAssociates,Inc.,Sunderland,MA.
Swofford, D. L., Olsen,G. J.,Waddell,P. J.,andHillis, D. M. 1996.MolecularSystematics,
D.M. Hillis, C. Moritz andB.K. Mable, eds.PhylogeneticInference,pp.407–514.
SinauerAssociates,Inc.,Sunderland,MA.
Wilson,A. C. E.,Zimmer, E. A., Prager, E. M. andKocher, T. D. 1989.TheHierarchyof
Life, RestrictionMappingin theMolecularSystematicsof Mammals:A Retrospective
Salute,pp.407–419.Elsevier Press,Amsterdam.
30
Table1:
Execution times
Method 8 taxa 16 taxa 32 taxa
NJ kH:QKR:Q# 0.01 0.09
FM 0.08 2.4 31.5
GNJ ' - ! : 0.08 0.8 9.8
' - $;: 0.2 2.1 25.1
' - #7:B: 0.5 4.4 52.1
' - ! :B: 1.1 8.8 103.7
' - $;:B: 3.1 24.2 262.7
Run time (in CPU secondson a #76BV -Mhz UltraSparc)for Neighbor-Joining (NJ), Fitch-
Margoliash(FM),andGNJ,averagedover #7:B: type ‘A’ input datasetsof varioussizes,using
theleast-squarescriterionand ( -h,.
31
Figure Legends
Fig. 1 TheNeighbor-Joiningheuristic(A) An inputmatrixof size $ . (B) A Neighbor-Joining
solutionstrategy. First, thestartreetopologycenteredatnode 6 is formed;next, theclosest
neighborpair �#;f ![� is “joined” into adistinctinternalnode V ; finally, thisnew internalnode
V , togetherwith oneof theleafnodes8 , arejoinedinto anew internalnodeT to form & � . (C)
An equally-goodsolution & � but with averydifferenttopology, whichwasnot foundby NJ.
Fig. 2 TheGeneralizedNeighbor-JoiningMethodTheGNJheuristicfor thedataof Figure1A
is shown. Throughoutthesearch,8 partialsolutionsarekept( ' - 8 ). At eachiteration,all
possibleneighboringtaxapairsareexamined;8 areselectedto passto thenext iteration.The
dashedlinesrepresenttheneighboringpairsthatareeliminatedduringthecurrentiteration.
While NJ follows thepathon theleft, andthusonly findsthesingletree & � , theGNJscheme
recoversbothequally-goodsolutions& � and & � (seeFigure1).
Fig. 3 Distributionof treecostsanddiversity Thenumberof trees(left ordinate,� , � ) and
maximumtopological(partition)distanceaveragedover30datasets(right ordinate,� , � ) are
plottedasa functionof thefractionalcostrange.Distributionsfor theleast-squares(LS, � ,
� ) andtheminimum-evolution(ME, � , � ) criteriaareshown. Distributionsweredetermined
for 8 taxafrom 30synthetictype‘A1’ datasets,30synthetictype‘B1’ datasets,and30
datasetsfrom biologicaldataset‘R1’. Thefiguresshow theresultsdeterminedafteran
exhaustivesearchof all 10,395treetopologiesfor 8 taxa.
Fig. 4 GNJsolutions—8taxaThedistributionof solutionsfoundby GNJon30 type‘A1’
synthetic8 taxadatasets(A, B) or 30 ‘R1’ biological8 taxadatasets(C, D) areshown.
Searchesweredonewith ' - (dG ,I- $;: . PanelsA, C show theaveragenumberof
differenttreeswith costswithin thefractionalleast-squarescostshown. PanelsB, D show the
averageof themaximumtopologicaldistanceof thesolutionswithin thefractionalcostrange.
32
Fig. 5 GNJsolutions—16taxaThedistributionof solutionsfoundby GNJon30 type‘A1’
synthetic#76 -taxadatasets(A, B) or 30 ‘R1’ biological #76 -taxadatasets(C, D) areshown.
Searchesweredonewith ' - (dG ,I- #7:B: . PanelsA, C show theaveragenumberof
differenttreeswith costswithin thefractionalminimum-evolutioncostshown. PanelsB, D
show theaverageof themaximumtopologicaldistanceof thesolutionswithin thefractional
costrange.
Fig. 6 GNJsolutions—32taxaThedistributionsof thenumberof trees(A) andthemaximum
topologicaldistanceaveragedover30datasets,(B) areshown for theleast-squarescost
criterionon32 taxasynthetictype‘A1’ data.Searchesweredonewith ' - (dG ,|- $;:B: .Panels(C) and(D) show treenumbersandtopologicaldistancefor thebiological‘R1’ data.
Fig. 7 GNJperformance—8taxaThequalityanddiversityof GNJsolutionswith different
valuesof ( and,
arecomparedto theoptimalsolutionset(opt), andNeighbor-JoiningNJ
andFitch-MargoliashFM solutionsfor synthetictype‘A1’ data.(A) Thefractionof thetime
asolutionwasfoundwith acost kd:QKR:Q# of optimal(squares,left axis)usingeitherthe
least-squares(filled symbols)or theminimum-evolution(opensymbols)criterion.Theright
axis(circles)reportsthecostof thebestsolutionfound,averagedover the30datasets.(B)
Thediversityof the ' - (dG , solutionswith cost kl:QKR:Q# is shown asthelargest
topologicaldiversityfoundamongall 8B: datasets(squares)andthemaximumtopological
diversityaveragedover the 8B: datasets.Closedsymbolsreportdiversityfor least-squares
solutions;opensymbolsreportminimum-evolutiondiversity.
Fig. 8 GNJperformance—16taxaResultsfor 16 taxatype‘A1’ syntheticdataareplottedas
in Fig. 7, exceptthatboththefractionof solutionsfoundandthetopologicaldistanceplot
usesacostthresholdof :QKR:Q# .
33
Fig. 9 GNJperformance—biological dataResultsfor 16 taxafrom biologicaldataset‘R1’ are
plottedasin Fig. 8.
Fig. 10GNJperformance—32taxaResultsfor 32 taxatype‘A1’ syntheticdataareplottedas
in Fig. 8, exceptthatboththefractionof solutionsfoundandthetopologicaldistanceplot
usesacostthresholdof :QKR: ! .
Fig. 11Numberanddiversityof post-processedsolutionsResultsfor #96 -taxabiologicaldata
(B, D) and #76 -taxatype‘A2’ data(C, D) areplottedasin Fig. 5. Resultsshown arefor
' - #7:B: , ( -h,.- $;: , with eithernon-post-processed(squares)or post-processed(circles)
solutions.Filled symbolsreportthedistributionof least-squaressolutionswith fractional
cost;opensymbolsplot minimum-evolutioncosts.
Fig. 12Post-processedNeighbor-Joining, Fitch-Margoliash,andGNJperformance
Post-processedresultsfor #96 -taxatype‘A2’ dataareplottedasin Fig. 8.
34
Figure1: TheNeighbor-JoiningHeuristic
S 1 S 2 S 3 S 4 S 5
S 1S 2S 3S 4S 5
1
5
2
4
3
T2
1
2
4
5
6
3
1
2
3
4
57 6
0 3 3 4 3
(a)
1
2 57 6
3 4
8
T1
(c)
(b)
3 0 3 3 43 3 0 3 34 3 3 0 33 4 3 3 0
35
Figure2: GeneralizedNeighbor-Joining
1 2
3
45
1
532
4 4
3
1
2 5
1
2 5
3
1
1
25
4
3
25
31
5
43
2 4
1
�~������� �~������� �~�������
36
Figure3:
� � � � ������
� � � � � ���
��
� � � ���
��� �
� � � � ��
�
�� �
1x100�
1x101
1x102�
1x103�
1x104�
1x105�
0.001 0.01 0.1 1cost
� � �� �
� ����
� � � ��� �
���
� �� �
� �� �
� �
� � ���� �
�� �
0
1
2
3
4
5
0.001 0.01 0.1 1
topo
logi
cal d
ista
nce
cost
A. type 'A' data B. type 'B' data C. biological data
� � � �� �
����
� � � � ������
� �� �
���� � �
� � � ��
�
�� � �
0.001 0.01 0.1 1cost�
number (LS)�number (ME)
�distance (LS)�distance (ME)
num
ber
of tr
ees
37
Figure4:
����������
����������
����������
����������
�����
�����
��
������
0.1
1
10
100
0.001 0.01 0.1 1
num
ber
of s
olut
ions
least-squares cost
����������
����������
��������
��
��������
��
��������
��
��������
��0
1
2
3
4
5
0.001 0.01 0.1 1
topo
logi
cal d
ista
nce
least-squares cost
����������
����������
����������
����������
����������
������
0.1
1
10
100
0.001 0.01 0.1 1
num
ber
of s
olut
ions
least-squares cost
�Q=50/0�45/10�25/25�5/45�0/D=50�opt
���
��
�����
���
��
�����
���
��
�����
���
��
�����
���
��
�����
���
��
�����0
1
2
3
4
5
0.001 0.01 0.1 1
topo
logi
cal d
ista
nce
least-squares cost
A. B.
C. D.
38
Figure5:
����������
����������
����������
����������
����������
0.1
1
10
100
500
0.001 0.01 0.1 1
num
ber
of tr
ees
minimum evolution cost
�Q=100/0�90/10
�50/50�10/90�0/D=100
����������
����������
����������
����������
����������0
5
10
15
0.001 0.01 0.1 1
topo
logi
cal d
ista
nce
minimum evolution cost
������
����
������
����
������
����
����������
����������0.1
1
10
100
500
0.001 0.01 0.1 1
num
ber
of tr
ees
minimum evolution cost
����������
����������
����
������
����
������
����
������
0
5
10
15
0.001 0.01 0.1 1
topo
logi
cal d
ista
nce
minimum evolution cost
A. B.
C. D.
39
Figure6:
����������
����������
����������
����������
����������
1
10
100
1000
0.001 0.01 0.1 1
num
ber
of tr
ees
least-squares cost
����������
����������
�����
�����
�����
�����
�����
�����
0
5
10
15
20
25
0.001 0.01 0.1 1
topo
logi
cal d
ista
nce
least-squares cost
�Q=500/0�450/50�250/250�50/450�0/D=500
A. B.
��������������������������������������������������
1
10
100
1000
0.001 0.01 0.1 1
num
ber
of tr
ees
minimum evolution cost
����������
����������
����������
����������
����������0
5
10
15
20
25
30
0.001 0.01 0.1 1
aver
age
topo
logi
cal d
ista
nce
minimum evolution cost
C. D.
40
Figure7:
� � � � � � � �
�
� � � �
�
� � � �
�
� � � �
�
� � �� � � � �
�
� � � �
�
� � � �
�
� � � �
�
¡
¡¡
¡ ¡ ¡ ¡ ¡
¡
¡ ¡ ¡ ¡
¡
¡ ¡ ¡ ¡
¡
¡ ¡ ¡ ¡
¡
opt
NJFMQ=5
/0 4/1
3/2
2/3
1/4
0/D=5
Q=20/
018
/210
/102/
18
0/D=2
0
Q=50/
045
/525
/255/
45
0/D=5
0
Q=100
/0
90/1
050
/5010
/90
0/D=1
000.0
0.2
0.4
0.6
0.8
1.0
0.00001
0.0001
0.001
0.01
0.1
1
frac
tion
< 0
.01
aver
age
cost
�LS fract�ME fract
LS cost¡ME cost
�
� �
� � � � � � � � � � � � � � � � � � � � �
�
� �
� � � � �
�
� � � �
�
� � � � � � � � � �
¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡op
tNJ
FMQ=5
/0 4/1
3/2
2/3
1/4
0/D=5
Q=20/
018
/210
/102/
18
0/D=2
0
Q=50/
045
/525
/255/
45
0/D=5
0
Q=100
/0
90/1
050
/5010
/90
0/D=1
000.0
0.5
1.0
1.5
2.0
0.0
0.5
1.0
1.5
2.0
topo
logi
cal d
ista
nce
�LS max�ME max
LS ave¡ME ave
A. quality
B. diversity
41
Figure8:
� � � � � � � � � � � � � � � � � � � � � �� � � � � �
�� � � �
�� � � �
�� � � �
�
¡ ¡
¡ ¡ ¡¡¡
¡ ¡ ¡ ¡
¡
¡ ¡ ¡ ¡
¡
¡ ¡ ¡ ¡
¡
NJFM
Q=20/
018
/210
/102/
18
0/D=2
0
Q=50/
045
/525
/255/
45
0/D=5
0
Q=100
/0
90/1
050
/5010
/90
0/D=1
00
Q=200
/0
180/
20
100/
100
20/1
80
0/D=2
000.0
0.2
0.4
0.6
0.8
1.0
0.00001
0.0001
0.001
0.01
0.1
1
frac
tion
< 0
.01
aver
age
cost
�LS fract�ME fract
LS cost¡ME cost
� �
� � � �� � � � � � � � � �
� � � � � �
� �
� � � � � � � � � � � � � � � � � � � �
¡ ¡¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡
NJFM
Q=20/
018
/210
/102/
18
0/D=2
0
Q=50/
045
/525
/255/
45
0/D=5
0
Q=100
/0
90/1
050
/5010
/90
0/D=1
00
Q=200
/0
180/
20
100/
100
20/1
80
0/D=2
000.0
2.0
4.0
6.0
8.0
10.0
0.0
2.0
4.0
6.0
8.0
10.0
topo
logi
cal d
ista
nce
�LS max ME max
�LS ave¡ME ave
A. quality
B. diversity
42
Figure9:
� � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � �
¡ ¡¡ ¡ ¡
¡¡
¡ ¡
¡
¡¡
¡
¡ ¡ ¡
¡
¡ ¡ ¡ ¡
¡
NJFM
Q=20/
018
/210
/102/
18
0/D=2
0
Q=50/
045
/525
/255/
45
0/D=5
0
Q=100
/0
90/1
050
/5010
/90
0/D=1
00
Q=200
/0
180/
20
100/
100
20/1
80
0/D=2
000.00
0.20
0.40
0.60
0.80
1.00
0.0000001
0.000001
0.00001
0.0001
0.001
0.01
frac
tion
<0.
01
aver
age
cost
�LS fract�ME fract
LS cost¡ME cost
� �
� �� � �
� �� � � �
� � � � �� � � �
� �
� �� � � � � � � � � � � � � � � � � �
¡ ¡
¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡
NJFM
Q=20/
018
/210
/102/
18
0/D=2
0
Q=50/
045
/525
/255/
45
0/D=5
0
Q=100
/0
90/1
050
/5010
/90
0/D=1
00
Q=200
/0
180/
20
100/
100
20/1
80
0/D=2
000.0
2.0
4.0
6.0
8.0
10.0
12.0
0.0
2.0
4.0
6.0
8.0
10.0
12.0
topo
logi
cal d
ista
nce
�LS max�ME max
LS ave¡ME ave
A. quality
B. diversity
43
Figure10:
� � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � �
¡ ¡¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡
¡ ¡ ¡ ¡ ¡
NJFM
Q=50/
045
/525
/255/
45
0/D=5
0
Q=100
/0
90/1
050
/5010
/90
0/D=1
00
Q=200
/0
180/
20
100/
100
20/1
80
0/D=2
00
Q=500
/0
450/
50
250/
250
50/4
50
0/D=5
000.00
0.25
0.50
0.75
1.00
0.0001
0.001
0.01
0.1
1
frac
tion
< 0
.02
aver
age
cost
�LS fract�ME fract
LS cost¡ME cost
� �
� �
� ��
� �� � �
�� � � �
� � � � �
� �
� �� � �
� � � � � � � �� � � � � � �
¡ ¡¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡
NJFM
Q=50/
045
/525
/255/
45
0/D=5
0
Q=100
/0
90/1
050
/5010
/90
0/D=1
00
Q=200
/0
180/
20
100/
100
20/1
80
0/D=2
00
Q=500
/0
450/
50
250/
250
50/4
50
0/D=5
000
5
10
15
20
25
30
0
5
10
15
20
25
30
topo
logi
cal d
ista
nce
�LS max�ME max
LS ave¡ME ave
A. quality
B. diversity
44
Figure11:
�������
���
����������
�������
���
����������
0.1
1
10
100
0.001 0.01 0.1 1
num
ber
of tr
ees
cost
�LS NP�LS PP�ME NP�ME PP
�����
�����
����������
������
����
������
����
0
5
10
15
0.001 0.01 0.1 1
aver
age
topo
logi
cal d
ista
nce
cost
����������
����������
���������
� ����������
0.1
1
10
100
0.001 0.01 0.1 1
num
ber
of tr
ees
cost
��������������������
���������� ����������0
5
10
15
0.001 0.01 0.1 1av
erag
e to
polo
gica
l dis
tanc
ecost
A. B.
C. D.
45
Figure12:
� �
� � � � �� � � � � � � � � � � � � � �
�
�
� � � � � � � � � � � � � � � � � � � �
¡¡
¡ ¡¡¡¡
¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡NJ
FM
Q=20/
018
/210
/102/
18
0/D=2
0
Q=50/
045
/525
/255/
45
0/D=5
0
Q=100
/0
90/1
050
/5010
/90
0/D=1
00
Q=200
/0
180/
20
100/
100
20/1
80
0/D=2
000.0
0.2
0.4
0.6
0.8
1.0
0.00001
0.0001
0.001
0.01
0.1
1
frac
tion
<0.
01
aver
age
cost
� �
� ��
� �
� �
� � �
�
� � �� �
� � � �
� �
� � � � � � � � � � � � � � � � � � � �
¡ ¡
¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡
NJFM
Q=20/
018
/210
/102/
18
0/D=2
0
Q=50/
045
/525
/255/
45
0/D=5
0
Q=100
/0
90/1
050
/5010
/90
0/D=1
00
Q=200
/0
180/
20
100/
100
20/1
80
0/D=2
000.0
2.0
4.0
6.0
8.0
10.0
12.0
0.0
2.0
4.0
6.0
8.0
10.0
12.0
topo
logi
cal d
ista
nce
A. quality
B. diversity
46