Phylogenomic Supertrees. ORP Bininda-Emond

Phylogenomic supertrees:the end of the road

or the light at the end of the tunnel?

Phylogenomic supertrees:the end of the road

or the light at the end of the tunnel?

Olaf R. P. Bininda-EmondsOlaf R. P. Bininda-EmondsFriedrich-Schiller-Universität JenaFriedrich-Schiller-Universität Jena

• what are supertrees?what are supertrees?• ““traditional” supertreestraditional” supertrees• the threat from the threat from

phylogenomicsphylogenomics

• supertrees in the futuresupertrees in the future• a paradigm shifta paradigm shift• deconstructing divide-deconstructing divide-

and-conquerand-conquer

• challenges for the futurechallenges for the future

OutlineOutline

What is a supertree?What is a supertree?• results from the combination of results from the combination of

many smaller, many smaller, overlappingoverlapping trees trees to form a single larger oneto form a single larger one• allows inferences of relationships allows inferences of relationships

that cannot be made from any that cannot be made from any single source treesingle source tree

• as old as systematics itself?as old as systematics itself?• ““vertical” (taxonomic) substitutionvertical” (taxonomic) substitution

• still in usestill in usee.g., Tree of Life, larger supertreese.g., Tree of Life, larger supertrees

Formal supertree constructionFormal supertree construction

AA BB DDCC EE II JJFF GG HH KK LL

EE JJFF GG HH KK LL

AA BB CC KK LL

DDCC EE IIHH KK

AgreementAgreement

OptimizationOptimization

consensus-like techniquesconsensus-like techniques

coding coding techniquetechnique

optimization optimization criterioncriterion

“Traditional” supertrees“Traditional” supertrees

A supertree of extant mammalsA supertree of extant mammals

4510 of4510 ofthe 4554the 4554

species listed in species listed in Wilson and Wilson and

Reeder (1993)Reeder (1993)You are hereYou are here

• from Bininda-Emonds from Bininda-Emonds et alet al. (2007). (2007)

MonotremataMarsupialiaMarsupialiaAfrotheriaAfrotheriaXenarthraXenarthra

LaurasiatheriaLaurasiatheriaEuarchontogliresEuarchontoglires

A supertree of extant birdsA supertree of extant birds

• 5985 extant 5985 extant species species (Davis and (Davis and Page, semi-Page, semi-publ. data)publ. data)

• phylogeny from phylogeny from Johnson (2001)Johnson (2001)

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.





Criticisms of supertreesCriticisms of supertrees• one step removed from the real dataone step removed from the real data

• loss of informationloss of information reduces accuracy reduces accuracy

• prevents prevents “signal enhancement”“signal enhancement”

• potential for potential for data duplicationdata duplication

• can produce can produce unsupportedunsupported clades clades• invalidinvalid as phylogenetic hypotheses as phylogenetic hypotheses

• summary statement (i.e., consensus)summary statement (i.e., consensus)

• cannot cannot interpretinterpret supertree biologically supertree biologically

• not necessarynot necessary due to the molecular due to the molecular revolution (revolution (stop-gap methodstop-gap method))

• ““Not many people build them [supertrees], Not many people build them [supertrees], and my sense is that their and my sense is that their lifetime is lifetime is limitedlimited: as gene sequence data becomes : as gene sequence data becomes increasingly easy to acquire, supertrees increasingly easy to acquire, supertrees will lose their value.”will lose their value.”

• Anonymous review of proposed Anonymous review of proposed supertree book (2001)supertree book (2001)

MRP supertree of extant Carnivora

MRP supertree of extant Carnivora

all 271 extant speciesall 271 extant species 274 source trees from 177 274 source trees from 177

literature sourcesliterature sources 13 nested supertrees13 nested supertrees

• from Bininda-Emonds from Bininda-Emonds etet alal. (1999). (1999)

11

1010

100100

10001000

10 00010 000

19901990 19951995 20002000 20052005

Num

ber

Num

ber

YearYear

Carnivora sequences in GenBankCarnivora sequences in GenBank

• as of January 1, 1996as of January 1, 1996

100 000100 000

1 000 0001 000 000

10 000 00010 000 000

677 sequences677 sequences

48 species48 species

12 new species / yr12 new species / yr

11

1010

100100

10001000

10 00010 000

19901990 19951995 20002000 20052005

Num

ber

Num

ber

YearYear

1 984 6231 984 623sequencessequences

197 species197 species

13.1 new13.1 newspecies / yrspecies / yr

Carnivora sequences in GenBankCarnivora sequences in GenBank

• as of March 12, 2004as of March 12, 2004

100 000100 000

1 000 0001 000 000

10 000 00010 000 000

• from Bininda-Emonds (2005)from Bininda-Emonds (2005)

Distribution of GenBank dataDistribution of GenBank data

1 976 3581 976 35843654365

39003900

are for domestic dog (99.6%)are for domestic dog (99.6%)are for domestic cat (0.2%)are for domestic cat (0.2%)

for remaining 195 speciesfor remaining 195 species(or 20.0 sequences / species)(or 20.0 sequences / species)

1 984 6231 984 623 sequencessequences

191 of the 219 191 of the 219 Martes americanaMartes americana sequences are cyt sequences are cyt bb

225 of the 302 225 of the 302 Phoca vitulinaPhoca vitulina sequences are tRNA-Pro sequences are tRNA-Pro

• but:but:

Gen

esG

enes

SpeciesSpecies

The molecular revolutionThe molecular revolution

• molecular databases are molecular databases are currently currently highly highly incompleteincomplete and data are and data are not randomly distributednot randomly distributed• 33+ genome projects for 33+ genome projects for

mammalsmammals• ESTs: lots of bps, but ESTs: lots of bps, but

comparatively few speciescomparatively few species• ““data availability matrix”data availability matrix” for green plants for green plants

(from Sanderson and Driskell, 2003)(from Sanderson and Driskell, 2003)

A paradigm shiftA paradigm shift

• traditional, literature-basedtraditional, literature-basedsupertree construction probablysupertree construction probablyultimately endangeredultimately endangered• but more so for some groupsbut more so for some groups

than for othersthan for others• any future role in phylogeneticsany future role in phylogenetics

likely as likely as an analytical toolan analytical tool• traditional traditional mixed data analyses mixed data analyses• divide-and-conquer divide-and-conquer homogeneous data analyses homogeneous data analyses

Partitioned analysesPartitioned analyses• utility of pure sequence-based analyses for utility of pure sequence-based analyses for

large, taxonomically broad studies large, taxonomically broad studies questioned increasinglyquestioned increasingly • alignment problemsalignment problems loss of data loss of data

• saturation / signal dropoutsaturation / signal dropout

• increasing trend for increasing trend for mixed analysesmixed analyses using using data that require different models and data that require different models and assumptions:assumptions:• e.g., morphology, DNA sequence data, AA e.g., morphology, DNA sequence data, AA

alignments, RCGs, gene order, gene content, …alignments, RCGs, gene order, gene content, …

• mixed-data analyses might benefit from a mixed-data analyses might benefit from a “traditional” supertree approach“traditional” supertree approach• i.e., supertree represents end result of analysisi.e., supertree represents end result of analysis

supertreesupertreeconstructionconstruction

conventionalconventionalanalysisanalysis

Analyzing DNA supermatricesAnalyzing DNA supermatrices• partitioned approach incorporating partitioned approach incorporating

supertrees needed around turn of supertrees needed around turn of centurycentury

• less need todayless need today through advances through advances in hardware (clusters and parallel in hardware (clusters and parallel computing) and software (faster computing) and software (faster algorithms and “tricks”)algorithms and “tricks”)• ever larger phylogenetic problems ever larger phylogenetic problems

now increasingly feasiblenow increasingly feasible (esp. in a (esp. in a likelihood framework), with likelihood framework), with bootstraps and mixed model analysesbootstraps and mixed model analyses




Archimedean phylogeneticsArchimedean phylogenetics““Give me a cluster large enough Give me a cluster large enough and a data set on which to work on, and a data set on which to work on, and I shall derive the phylogeny.”and I shall derive the phylogeny.”


• adapted from Roshan adapted from Roshan et alet al. (2004). (2004)

global global optimizationoptimization

(conventional (conventional analysis)analysis)

subtree subtree optimizationoptimization

(conventional (conventional analysis)analysis)

StageStage SpeedSpeed AccuracyAccuracy

DivideDivide

Subtree optimizationSubtree optimization

Supertree Supertree constructionconstruction

BUILDBUILD

MR / OMR / O

Global optimizationGlobal optimization

n/an/a

subsample datasubsample data(4, 8, 16, …, (4, 8, 16, …,

1024, 2048 taxa)1024, 2048 taxa)

simulate datasimulate data(K2P, ti:tv = 2.0, (K2P, ti:tv = 2.0, = 0.5, = 0.5, = 0.1, = 0.1,

2000 bp)2000 bp)

phylogenetic phylogenetic analysisanalysis

(NJ, weighted MP, (NJ, weighted MP, ML, or ML-DCM3)ML, or ML-DCM3)

compare to pruned compare to pruned model treemodel tree

• ““clade sampling”clade sampling”

Sampling schemesSampling schemes

• ““random sampling”random sampling”


DivideDivide



BUILDBUILD

MR / OMR / O


n/an/a

Divide stepDivide step

• investigated chiefly by Daniel Huson, Tandy Warnow, investigated chiefly by Daniel Huson, Tandy Warnow, Usman Roshan and colleaguesUsman Roshan and colleagues• developed disk-covering methods (DCMs)developed disk-covering methods (DCMs)• fastest current implementation is Recursive-Iterative-DCM3 fastest current implementation is Recursive-Iterative-DCM3

(Rec-I-DCM3)(Rec-I-DCM3)

• sampling strategysampling strategy for divide step crucial for divide step crucial• Roshan Roshan et alet al. (2004) noted that performance gain . (2004) noted that performance gain dependent dependent

on qualityon quality of initial decomposition of initial decomposition• due to effects on due to effects on analysis timesanalysis times of subtree optimization step of subtree optimization step

• from Bininda-Emonds from Bininda-Emonds and Stamatakis (2006)and Stamatakis (2006)

0.7500.750

0.8000.800

0.8500.850

0.9000.900

0.9500.950

1.0001.000

Ave

rage

sim

ilari

ty to

mod

el tr

ee (

1 –

dA

vera

ge s

imila

rity

to m

odel

tree

(1

– d SS))

11 1010 100100 10001000 1000010000

Size of subsampled treeSize of subsampled tree

ML-DCM3 (random)ML-DCM3 (random)

ML-DCM3 (clade)ML-DCM3 (clade)

MP (random)MP (random)

MP (clade)MP (clade)

NJ (random)NJ (random)

ML (random)ML (random)

NJ (clade)NJ (clade)

ML (clade)ML (clade)

Scaling of accuracyScaling of accuracy


0.950.95

1.001.00

1.051.05

1.101.10

1.151.15

11 1010 100100 10001000 1000010000


Rat

io o

f av

erag

e si

mila

rity

Rat

io o

f av

erag

e si

mila

rity

(cla

de /

rand

om s

ampl

ing)

(cla

de /

rand

om s

ampl

ing)

MPMP

NJNJ

MLML

ML-DCM3ML-DCM3

Accuracy and sampling strategyAccuracy and sampling strategy


0.010.01

0.10.1

11

1010

100100

10001000

1000010000

100000100000

Ave

rage

ana

lysi

s tim

e (s

econ

ds)

Ave

rage

ana

lysi

s tim

e (s

econ

ds)

11 1010 100100 10001000 1000010000


ML-DCM3 (random)ML-DCM3 (random)

ML-DCM3 (clade)ML-DCM3 (clade)







Scaling of analysis timeScaling of analysis time

0.00.0

0.50.5

1.01.0

1.51.5

11 1010 100100 10001000 1000010000


Rat

io o

f av

erag

e an

alys

is ti

me

Rat

io o

f av

erag

e an

alys

is ti

me

(cla

de /

rand

om s

ampl

ing)

(cla

de /

rand

om s

ampl

ing)


MPMP

NJNJ

MLML

ML-DCM3ML-DCM3

Analysis time and sampling strategyAnalysis time and sampling strategy


DivideDivide



BUILDBUILD

MR / OMR / O


n/an/a

Supertree stepSupertree step• two main alternative strategies: two main alternative strategies: BUILD-basedBUILD-based

vs. vs. matrix representationmatrix representation / optimization based / optimization based

• problem:problem:• BUILD is BUILD is fastfast, but shows , but shows poor accuracypoor accuracy• MR / O shows MR / O shows good accuracygood accuracy, but is , but is deadlydeadly slowslow

• can we devise a supertree method that can we devise a supertree method that combines speed and accuracycombines speed and accuracy??• BUILD shows more promise BUILD shows more promise MR / O will MR / O will

always be slowalways be slow• NB: NB: accuracy ≠ resolutionaccuracy ≠ resolution!!

Problems with BUILDProblems with BUILD• lot of BUILD-derived algorithms:lot of BUILD-derived algorithms:

• BUILD, MinCutSupertree, BUILD, MinCutSupertree, BUILD-with-DistancesBUILD-with-Distances, , AncestralBUILD, MultiLevelSupertree, AncestralBUILD, MultiLevelSupertree, PhySICPhySIC, …, …

• MinCut the most widely known and basis for MinCut the most widely known and basis for many other methodsmany other methods• tends to tends to approximate Adams consensusapproximate Adams consensus (at least (at least

empirically)empirically)• tends to favour larger source trees (= tends to favour larger source trees (= size biassize bias))• tends to spit out single conflicting taxa at each step tends to spit out single conflicting taxa at each step

yielding very yielding very unbalanced, comb-like treesunbalanced, comb-like trees

Does divide-and-conquer work?Does divide-and-conquer work?

• it should / could:it should / could:• tremendous speed gaintremendous speed gain to analyzing many, to analyzing many,

smaller problems:smaller problems:

• accuracy ~flataccuracy ~flat with respect to problem size with respect to problem size11

nn

time time x << time x << time nnxx

Num

ber

of a

naly

ses

Num

ber

of a

naly

ses

11

1010

100100

10001000

1000010000

100000100000

10000001000000

1000000010000000

11 1010 100100 10001000 1000010000









= 4096 taxa

• e.g., can run ~250 000 MP analyses of 16 e.g., can run ~250 000 MP analyses of 16 clade-sampled taxa clade-sampled taxa (≈ 4 000 000 taxa in (≈ 4 000 000 taxa in total)total) in the time taken to analyze 4096 taxa in the time taken to analyze 4096 taxa simultaneouslysimultaneously

Does divide-and-conquer work?Does divide-and-conquer work?

• it should / could:it should / could:• tremendous speed gain to analyzing many, tremendous speed gain to analyzing many,

smaller problems:smaller problems:

• accuracy ~flat with respect to problem sizeaccuracy ~flat with respect to problem size

• but these but these potential savings aren’t realizedpotential savings aren’t realized in in full empirically …full empirically …

11

nn

time time x << time x << time nnxx

MethodMethodAccuracy Accuracy (1 – d(1 – dSS))

Time taken Time taken (seconds)(seconds)

NJNJ 0.8570.857 193193

MPMP 0.9170.917 69 39269 392

ML-DCM3ML-DCM3 0.9210.921 195 371195 371

ML (“standard hill climbing”)ML (“standard hill climbing”) 0.9230.923 303 450303 450

Analyses of full 4096-taxon data setAnalyses of full 4096-taxon data set

1.55x1.55x

• from Bininda-Emonds and Stamatakis (2006)from Bininda-Emonds and Stamatakis (2006)

MethodMethodAccuracy Accuracy (1 – d(1 – dSS))

Time taken Time taken (seconds)(seconds)

NJNJ 0.8570.857 193193

MPMP 0.9170.917 69 39269 392

ML (“fast hill climbing”)ML (“fast hill climbing”) 0.9120.912 38 73738 737

ML-DCM3ML-DCM3 0.9210.921 195 371195 371

ML (“standard hill climbing”)ML (“standard hill climbing”) 0.9230.923 303 450303 450

Analyses of full data setAnalyses of full data set

5.04x5.04x

• from Bininda-Emonds and Stamatakis (2006)from Bininda-Emonds and Stamatakis (2006)

What’s the problem?What’s the problem?• bottleneck remains terminal bottleneck remains terminal global global

optimization stepoptimization step• any excessive branch swapping will slow it any excessive branch swapping will slow it

downdown• but branching swapping crucial for accuracybut branching swapping crucial for accuracy

• therefore, key is to provide as therefore, key is to provide as accurate of accurate of a starting treea starting tree as possible as possible• DCM3 method seems to be providing only a DCM3 method seems to be providing only a

slightly better tree than NJ (PHYML) or slightly better tree than NJ (PHYML) or greedy MP (RAxML)greedy MP (RAxML)

Possible solutions: inputPossible solutions: input• improve accuracy of supertree by improve accuracy of supertree by

any of all of:any of all of:• increasing coverageincreasing coverage by analyzing more by analyzing more

subtrees with more overlapsubtrees with more overlap• including several larger including several larger backbone treesbackbone trees• deriving deriving support values for subtreessupport values for subtrees

(e.g., fast bootstrapping) to enable (e.g., fast bootstrapping) to enable weighted supertree analysisweighted supertree analysis

• time is available for these stepstime is available for these steps• also lend themsevles to parallelizationalso lend themsevles to parallelization

Possible solutions: analysisPossible solutions: analysis• optimize global optimization stepoptimize global optimization step using using

constraintsconstraints• minimize amount of intensive branch-minimize amount of intensive branch-

swapping and tree surfingswapping and tree surfing

• idea in DCM-based methods (“refinement” of idea in DCM-based methods (“refinement” of SCM supertree)SCM supertree)

• supertree serves as starting tree and supertree serves as starting tree and constraint treeconstraint tree• crucial that supertree is accurate (NB crucial that supertree is accurate (NB

accuracy ≠ resolution!)accuracy ≠ resolution!)

• can also can also judge node support empiricallyjudge node support empirically and and constraint only well supported nodesconstraint only well supported nodes

What’s the answer?What’s the answer?• increasing technological increasing technological

sophistication will sophistication will keep increasing keep increasing rangerange of conventional analyses of conventional analyses• analyses of ≤10000 taxa now feasible analyses of ≤10000 taxa now feasible

and usually without parallelizationand usually without parallelization

• but, does a divide-and-conquer + but, does a divide-and-conquer + supertree framework have a role supertree framework have a role beyond this?beyond this?• theoretically yes, but only by solving theoretically yes, but only by solving

a number of challengesa number of challenges

??

• divide (+ subtree optimization) stepsdivide (+ subtree optimization) steps• find find subtree size(s)subtree size(s) or combinations thereof that maximize or combinations thereof that maximize

speed and especially accuracyspeed and especially accuracy

• find find optimal sampling schemeoptimal sampling scheme: clade and backbone vs clade-: clade and backbone vs clade-like samplinglike sampling

• do general rules-of-thumb exist or do parameters need to be do general rules-of-thumb exist or do parameters need to be empirically determined on a case-by-case basis?empirically determined on a case-by-case basis?

• alternativesalternatives to disk-covering methods? to disk-covering methods?

• supertree stepsupertree step• can we find a method that is can we find a method that is fastfast like BUILD and like BUILD and accurateaccurate like like

MR / O methods? MR / O methods? PhySIC???PhySIC???

• global-optimization stepglobal-optimization step• have to weigh have to weigh costscosts (no error correction) vs (no error correction) vs benefitsbenefits (speed!) (speed!)

of searching under constraintsof searching under constraints

Challenges for the futureChallenges for the future

AA BB DD FFCC GGEETaxaTaxa

GenesGenes 6633 4411 7755 8822

1 2 3 4 5 6 7 81 2 3 4 5 6 7 8AA ++ – – – – – – – – – – – – – –BB + ++ + – – – – – – – – – – – –CC – – + + + ++ + + + – – – – – –DD – – + + + ++ + + + ++ – – – –EE – – + + + ++ + + + – – – – – –FF – – – – – – – – – – + ++ + – –GG – – – – – – – – – – – – + ++ +

Tax

aT

axa

GenesGenes

maximal biclique = maximal biclique = KK4,34,3

BicliquesBicliques

Extending bicliquesExtending bicliques

• quasi-bicliquesquasi-bicliques• allow a certain proportion allow a certain proportion

of of missing edgesmissing edges

• as input for a supertree as input for a supertree analysisanalysis• essentially build essentially build

bicliques of bicliquesbicliques of bicliques bicliques that overlap for bicliques that overlap for at least two taxa, but no at least two taxa, but no sequencessequences

AA BB DD FFCC GGEE

TaxaTaxa

GenesGenes

6633 4411 7755 8822

Challenges for the futureChallenges for the future• divide (+ subtree optimization) stepsdivide (+ subtree optimization) steps

• find find subtree size(s)subtree size(s) or combinations thereof that maximize or combinations thereof that maximize speed and especially accuracyspeed and especially accuracy

• find find optimal sampling schemeoptimal sampling scheme: clade and backbone vs clade-: clade and backbone vs clade-like samplinglike sampling

• do general rules-of-thumb exist or do parameters need to be do general rules-of-thumb exist or do parameters need to be empirically determined on a case-by-case basis?empirically determined on a case-by-case basis?

• alternativesalternatives to disk-covering methods? to disk-covering methods?

• supertree stepsupertree step• can we find a method that is can we find a method that is fastfast like BUILD and like BUILD and accurateaccurate like like

MR / O methods? MR / O methods? PhySIC???PhySIC???

• global-optimization stepglobal-optimization step• have to weigh have to weigh costscosts (no error correction) vs (no error correction) vs benefitsbenefits (speed!) (speed!)

of searching under constraintsof searching under constraints

Phylogenomic Supertrees. ORP Bininda-Emond

Technology

Transcript of Phylogenomic Supertrees. ORP Bininda-Emond