Phylogenomic Supertrees. ORP Bininda-Emond
-
Upload
roderic-page -
Category
Technology
-
view
1.215 -
download
2
Transcript of Phylogenomic Supertrees. ORP Bininda-Emond
Phylogenomic supertrees:the end of the road
or the light at the end of the tunnel?
Phylogenomic supertrees:the end of the road
or the light at the end of the tunnel?
Olaf R. P. Bininda-EmondsOlaf R. P. Bininda-EmondsFriedrich-Schiller-Universität JenaFriedrich-Schiller-Universität Jena
• what are supertrees?what are supertrees?• ““traditional” supertreestraditional” supertrees• the threat from the threat from
phylogenomicsphylogenomics
• supertrees in the futuresupertrees in the future• a paradigm shifta paradigm shift• deconstructing divide-deconstructing divide-
and-conquerand-conquer
• challenges for the futurechallenges for the future
OutlineOutline
What is a supertree?What is a supertree?• results from the combination of results from the combination of
many smaller, many smaller, overlappingoverlapping trees trees to form a single larger oneto form a single larger one• allows inferences of relationships allows inferences of relationships
that cannot be made from any that cannot be made from any single source treesingle source tree
• as old as systematics itself?as old as systematics itself?• ““vertical” (taxonomic) substitutionvertical” (taxonomic) substitution
• still in usestill in usee.g., Tree of Life, larger supertreese.g., Tree of Life, larger supertrees
Formal supertree constructionFormal supertree construction
AA BB DDCC EE II JJFF GG HH KK LL
EE JJFF GG HH KK LL
AA BB CC KK LL
DDCC EE IIHH KK
AgreementAgreement
OptimizationOptimization
consensus-like techniquesconsensus-like techniques
coding coding techniquetechnique
optimization optimization criterioncriterion
“Traditional” supertrees“Traditional” supertrees
A supertree of extant mammalsA supertree of extant mammals
4510 of4510 ofthe 4554the 4554
species listed in species listed in Wilson and Wilson and
Reeder (1993)Reeder (1993)You are hereYou are here
• from Bininda-Emonds from Bininda-Emonds et alet al. (2007). (2007)
MonotremataMarsupialiaMarsupialiaAfrotheriaAfrotheriaXenarthraXenarthra
LaurasiatheriaLaurasiatheriaEuarchontogliresEuarchontoglires
A supertree of extant birdsA supertree of extant birds
• 5985 extant 5985 extant species species (Davis and (Davis and Page, semi-Page, semi-publ. data)publ. data)
• phylogeny from phylogeny from Johnson (2001)Johnson (2001)
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
Criticisms of supertreesCriticisms of supertrees• one step removed from the real dataone step removed from the real data
• loss of informationloss of information reduces accuracy reduces accuracy
• prevents prevents “signal enhancement”“signal enhancement”
• potential for potential for data duplicationdata duplication
• can produce can produce unsupportedunsupported clades clades• invalidinvalid as phylogenetic hypotheses as phylogenetic hypotheses
• summary statement (i.e., consensus)summary statement (i.e., consensus)
• cannot cannot interpretinterpret supertree biologically supertree biologically
• not necessarynot necessary due to the molecular due to the molecular revolution (revolution (stop-gap methodstop-gap method))
• ““Not many people build them [supertrees], Not many people build them [supertrees], and my sense is that their and my sense is that their lifetime is lifetime is limitedlimited: as gene sequence data becomes : as gene sequence data becomes increasingly easy to acquire, supertrees increasingly easy to acquire, supertrees will lose their value.”will lose their value.”
• Anonymous review of proposed Anonymous review of proposed supertree book (2001)supertree book (2001)
MRP supertree of extant Carnivora
MRP supertree of extant Carnivora
all 271 extant speciesall 271 extant species 274 source trees from 177 274 source trees from 177
literature sourcesliterature sources 13 nested supertrees13 nested supertrees
• from Bininda-Emonds from Bininda-Emonds etet alal. (1999). (1999)
11
1010
100100
10001000
10 00010 000
19901990 19951995 20002000 20052005
Num
ber
Num
ber
YearYear
Carnivora sequences in GenBankCarnivora sequences in GenBank
• as of January 1, 1996as of January 1, 1996
100 000100 000
1 000 0001 000 000
10 000 00010 000 000
677 sequences677 sequences
48 species48 species
12 new species / yr12 new species / yr
11
1010
100100
10001000
10 00010 000
19901990 19951995 20002000 20052005
Num
ber
Num
ber
YearYear
1 984 6231 984 623sequencessequences
197 species197 species
13.1 new13.1 newspecies / yrspecies / yr
Carnivora sequences in GenBankCarnivora sequences in GenBank
• as of March 12, 2004as of March 12, 2004
100 000100 000
1 000 0001 000 000
10 000 00010 000 000
• from Bininda-Emonds (2005)from Bininda-Emonds (2005)
Distribution of GenBank dataDistribution of GenBank data
1 976 3581 976 35843654365
39003900
are for domestic dog (99.6%)are for domestic dog (99.6%)are for domestic cat (0.2%)are for domestic cat (0.2%)
for remaining 195 speciesfor remaining 195 species(or 20.0 sequences / species)(or 20.0 sequences / species)
1 984 6231 984 623 sequencessequences
191 of the 219 191 of the 219 Martes americanaMartes americana sequences are cyt sequences are cyt bb
225 of the 302 225 of the 302 Phoca vitulinaPhoca vitulina sequences are tRNA-Pro sequences are tRNA-Pro
• but:but:
Gen
esG
enes
SpeciesSpecies
The molecular revolutionThe molecular revolution
• molecular databases are molecular databases are currently currently highly highly incompleteincomplete and data are and data are not randomly distributednot randomly distributed• 33+ genome projects for 33+ genome projects for
mammalsmammals• ESTs: lots of bps, but ESTs: lots of bps, but
comparatively few speciescomparatively few species• ““data availability matrix”data availability matrix” for green plants for green plants
(from Sanderson and Driskell, 2003)(from Sanderson and Driskell, 2003)
A paradigm shiftA paradigm shift
• traditional, literature-basedtraditional, literature-basedsupertree construction probablysupertree construction probablyultimately endangeredultimately endangered• but more so for some groupsbut more so for some groups
than for othersthan for others• any future role in phylogeneticsany future role in phylogenetics
likely as likely as an analytical toolan analytical tool• traditional traditional mixed data analyses mixed data analyses• divide-and-conquer divide-and-conquer homogeneous data analyses homogeneous data analyses
Partitioned analysesPartitioned analyses• utility of pure sequence-based analyses for utility of pure sequence-based analyses for
large, taxonomically broad studies large, taxonomically broad studies questioned increasinglyquestioned increasingly • alignment problemsalignment problems loss of data loss of data
• saturation / signal dropoutsaturation / signal dropout
• increasing trend for increasing trend for mixed analysesmixed analyses using using data that require different models and data that require different models and assumptions:assumptions:• e.g., morphology, DNA sequence data, AA e.g., morphology, DNA sequence data, AA
alignments, RCGs, gene order, gene content, …alignments, RCGs, gene order, gene content, …
• mixed-data analyses might benefit from a mixed-data analyses might benefit from a “traditional” supertree approach“traditional” supertree approach• i.e., supertree represents end result of analysisi.e., supertree represents end result of analysis
supertreesupertreeconstructionconstruction
conventionalconventionalanalysisanalysis
Analyzing DNA supermatricesAnalyzing DNA supermatrices• partitioned approach incorporating partitioned approach incorporating
supertrees needed around turn of supertrees needed around turn of centurycentury
• less need todayless need today through advances through advances in hardware (clusters and parallel in hardware (clusters and parallel computing) and software (faster computing) and software (faster algorithms and “tricks”)algorithms and “tricks”)• ever larger phylogenetic problems ever larger phylogenetic problems
now increasingly feasiblenow increasingly feasible (esp. in a (esp. in a likelihood framework), with likelihood framework), with bootstraps and mixed model analysesbootstraps and mixed model analyses
supertreesupertreeconstructionconstruction
conventionalconventionalanalysisanalysis
conventionalconventionalanalysisanalysis
Archimedean phylogeneticsArchimedean phylogenetics““Give me a cluster large enough Give me a cluster large enough and a data set on which to work on, and a data set on which to work on, and I shall derive the phylogeny.”and I shall derive the phylogeny.”
supertreesupertreeconstructionconstruction
• adapted from Roshan adapted from Roshan et alet al. (2004). (2004)
global global optimizationoptimization
(conventional (conventional analysis)analysis)
subtree subtree optimizationoptimization
(conventional (conventional analysis)analysis)
StageStage SpeedSpeed AccuracyAccuracy
DivideDivide
Subtree optimizationSubtree optimization
Supertree Supertree constructionconstruction
BUILDBUILD
MR / OMR / O
Global optimizationGlobal optimization
n/an/a
subsample datasubsample data(4, 8, 16, …, (4, 8, 16, …,
1024, 2048 taxa)1024, 2048 taxa)
simulate datasimulate data(K2P, ti:tv = 2.0, (K2P, ti:tv = 2.0, = 0.5, = 0.5, = 0.1, = 0.1,
2000 bp)2000 bp)
phylogenetic phylogenetic analysisanalysis
(NJ, weighted MP, (NJ, weighted MP, ML, or ML-DCM3)ML, or ML-DCM3)
compare to pruned compare to pruned model treemodel tree
• ““clade sampling”clade sampling”
Sampling schemesSampling schemes
• ““random sampling”random sampling”
StageStage SpeedSpeed AccuracyAccuracy
DivideDivide
Subtree optimizationSubtree optimization
Supertree Supertree constructionconstruction
BUILDBUILD
MR / OMR / O
Global optimizationGlobal optimization
n/an/a
Divide stepDivide step
• investigated chiefly by Daniel Huson, Tandy Warnow, investigated chiefly by Daniel Huson, Tandy Warnow, Usman Roshan and colleaguesUsman Roshan and colleagues• developed disk-covering methods (DCMs)developed disk-covering methods (DCMs)• fastest current implementation is Recursive-Iterative-DCM3 fastest current implementation is Recursive-Iterative-DCM3
(Rec-I-DCM3)(Rec-I-DCM3)
• sampling strategysampling strategy for divide step crucial for divide step crucial• Roshan Roshan et alet al. (2004) noted that performance gain . (2004) noted that performance gain dependent dependent
on qualityon quality of initial decomposition of initial decomposition• due to effects on due to effects on analysis timesanalysis times of subtree optimization step of subtree optimization step
• from Bininda-Emonds from Bininda-Emonds and Stamatakis (2006)and Stamatakis (2006)
0.7500.750
0.8000.800
0.8500.850
0.9000.900
0.9500.950
1.0001.000
Ave
rage
sim
ilari
ty to
mod
el tr
ee (
1 –
dA
vera
ge s
imila
rity
to m
odel
tree
(1
– d SS))
11 1010 100100 10001000 1000010000
Size of subsampled treeSize of subsampled tree
ML-DCM3 (random)ML-DCM3 (random)
ML-DCM3 (clade)ML-DCM3 (clade)
MP (random)MP (random)
MP (clade)MP (clade)
NJ (random)NJ (random)
ML (random)ML (random)
NJ (clade)NJ (clade)
ML (clade)ML (clade)
Scaling of accuracyScaling of accuracy
• from Bininda-Emonds from Bininda-Emonds and Stamatakis (2006)and Stamatakis (2006)
0.950.95
1.001.00
1.051.05
1.101.10
1.151.15
11 1010 100100 10001000 1000010000
Size of subsampled treeSize of subsampled tree
Rat
io o
f av
erag
e si
mila
rity
Rat
io o
f av
erag
e si
mila
rity
(cla
de /
rand
om s
ampl
ing)
(cla
de /
rand
om s
ampl
ing)
MPMP
NJNJ
MLML
ML-DCM3ML-DCM3
Accuracy and sampling strategyAccuracy and sampling strategy
• from Bininda-Emonds from Bininda-Emonds and Stamatakis (2006)and Stamatakis (2006)
0.010.01
0.10.1
11
1010
100100
10001000
1000010000
100000100000
Ave
rage
ana
lysi
s tim
e (s
econ
ds)
Ave
rage
ana
lysi
s tim
e (s
econ
ds)
11 1010 100100 10001000 1000010000
Size of subsampled treeSize of subsampled tree
ML-DCM3 (random)ML-DCM3 (random)
ML-DCM3 (clade)ML-DCM3 (clade)
MP (random)MP (random)
MP (clade)MP (clade)
NJ (random)NJ (random)
ML (random)ML (random)
NJ (clade)NJ (clade)
ML (clade)ML (clade)
Scaling of analysis timeScaling of analysis time
0.00.0
0.50.5
1.01.0
1.51.5
11 1010 100100 10001000 1000010000
Size of subsampled treeSize of subsampled tree
Rat
io o
f av
erag
e an
alys
is ti
me
Rat
io o
f av
erag
e an
alys
is ti
me
(cla
de /
rand
om s
ampl
ing)
(cla
de /
rand
om s
ampl
ing)
• from Bininda-Emonds from Bininda-Emonds and Stamatakis (2006)and Stamatakis (2006)
MPMP
NJNJ
MLML
ML-DCM3ML-DCM3
Analysis time and sampling strategyAnalysis time and sampling strategy
StageStage SpeedSpeed AccuracyAccuracy
DivideDivide
Subtree optimizationSubtree optimization
Supertree Supertree constructionconstruction
BUILDBUILD
MR / OMR / O
Global optimizationGlobal optimization
n/an/a
Supertree stepSupertree step• two main alternative strategies: two main alternative strategies: BUILD-basedBUILD-based
vs. vs. matrix representationmatrix representation / optimization based / optimization based
• problem:problem:• BUILD is BUILD is fastfast, but shows , but shows poor accuracypoor accuracy• MR / O shows MR / O shows good accuracygood accuracy, but is , but is deadlydeadly slowslow
• can we devise a supertree method that can we devise a supertree method that combines speed and accuracycombines speed and accuracy??• BUILD shows more promise BUILD shows more promise MR / O will MR / O will
always be slowalways be slow• NB: NB: accuracy ≠ resolutionaccuracy ≠ resolution!!
Problems with BUILDProblems with BUILD• lot of BUILD-derived algorithms:lot of BUILD-derived algorithms:
• BUILD, MinCutSupertree, BUILD, MinCutSupertree, BUILD-with-DistancesBUILD-with-Distances, , AncestralBUILD, MultiLevelSupertree, AncestralBUILD, MultiLevelSupertree, PhySICPhySIC, …, …
• MinCut the most widely known and basis for MinCut the most widely known and basis for many other methodsmany other methods• tends to tends to approximate Adams consensusapproximate Adams consensus (at least (at least
empirically)empirically)• tends to favour larger source trees (= tends to favour larger source trees (= size biassize bias))• tends to spit out single conflicting taxa at each step tends to spit out single conflicting taxa at each step
yielding very yielding very unbalanced, comb-like treesunbalanced, comb-like trees
Does divide-and-conquer work?Does divide-and-conquer work?
• it should / could:it should / could:• tremendous speed gaintremendous speed gain to analyzing many, to analyzing many,
smaller problems:smaller problems:
• accuracy ~flataccuracy ~flat with respect to problem size with respect to problem size11
nn
time time x << time x << time nnxx
Num
ber
of a
naly
ses
Num
ber
of a
naly
ses
11
1010
100100
10001000
1000010000
100000100000
10000001000000
1000000010000000
11 1010 100100 10001000 1000010000
Size of subsampled treeSize of subsampled tree
MP (random)MP (random)
MP (clade)MP (clade)
NJ (random)NJ (random)
ML (random)ML (random)
NJ (clade)NJ (clade)
ML (clade)ML (clade)
• from Bininda-Emonds from Bininda-Emonds and Stamatakis (2006)and Stamatakis (2006)
= 4096 taxa
• e.g., can run ~250 000 MP analyses of 16 e.g., can run ~250 000 MP analyses of 16 clade-sampled taxa clade-sampled taxa (≈ 4 000 000 taxa in (≈ 4 000 000 taxa in total)total) in the time taken to analyze 4096 taxa in the time taken to analyze 4096 taxa simultaneouslysimultaneously
Does divide-and-conquer work?Does divide-and-conquer work?
• it should / could:it should / could:• tremendous speed gain to analyzing many, tremendous speed gain to analyzing many,
smaller problems:smaller problems:
• accuracy ~flat with respect to problem sizeaccuracy ~flat with respect to problem size
• but these but these potential savings aren’t realizedpotential savings aren’t realized in in full empirically …full empirically …
11
nn
time time x << time x << time nnxx
MethodMethodAccuracy Accuracy (1 – d(1 – dSS))
Time taken Time taken (seconds)(seconds)
NJNJ 0.8570.857 193193
MPMP 0.9170.917 69 39269 392
ML-DCM3ML-DCM3 0.9210.921 195 371195 371
ML (“standard hill climbing”)ML (“standard hill climbing”) 0.9230.923 303 450303 450
Analyses of full 4096-taxon data setAnalyses of full 4096-taxon data set
1.55x1.55x
• from Bininda-Emonds and Stamatakis (2006)from Bininda-Emonds and Stamatakis (2006)
MethodMethodAccuracy Accuracy (1 – d(1 – dSS))
Time taken Time taken (seconds)(seconds)
NJNJ 0.8570.857 193193
MPMP 0.9170.917 69 39269 392
ML (“fast hill climbing”)ML (“fast hill climbing”) 0.9120.912 38 73738 737
ML-DCM3ML-DCM3 0.9210.921 195 371195 371
ML (“standard hill climbing”)ML (“standard hill climbing”) 0.9230.923 303 450303 450
Analyses of full data setAnalyses of full data set
5.04x5.04x
• from Bininda-Emonds and Stamatakis (2006)from Bininda-Emonds and Stamatakis (2006)
What’s the problem?What’s the problem?• bottleneck remains terminal bottleneck remains terminal global global
optimization stepoptimization step• any excessive branch swapping will slow it any excessive branch swapping will slow it
downdown• but branching swapping crucial for accuracybut branching swapping crucial for accuracy
• therefore, key is to provide as therefore, key is to provide as accurate of accurate of a starting treea starting tree as possible as possible• DCM3 method seems to be providing only a DCM3 method seems to be providing only a
slightly better tree than NJ (PHYML) or slightly better tree than NJ (PHYML) or greedy MP (RAxML)greedy MP (RAxML)
Possible solutions: inputPossible solutions: input• improve accuracy of supertree by improve accuracy of supertree by
any of all of:any of all of:• increasing coverageincreasing coverage by analyzing more by analyzing more
subtrees with more overlapsubtrees with more overlap• including several larger including several larger backbone treesbackbone trees• deriving deriving support values for subtreessupport values for subtrees
(e.g., fast bootstrapping) to enable (e.g., fast bootstrapping) to enable weighted supertree analysisweighted supertree analysis
• time is available for these stepstime is available for these steps• also lend themsevles to parallelizationalso lend themsevles to parallelization
Possible solutions: analysisPossible solutions: analysis• optimize global optimization stepoptimize global optimization step using using
constraintsconstraints• minimize amount of intensive branch-minimize amount of intensive branch-
swapping and tree surfingswapping and tree surfing
• idea in DCM-based methods (“refinement” of idea in DCM-based methods (“refinement” of SCM supertree)SCM supertree)
• supertree serves as starting tree and supertree serves as starting tree and constraint treeconstraint tree• crucial that supertree is accurate (NB crucial that supertree is accurate (NB
accuracy ≠ resolution!)accuracy ≠ resolution!)
• can also can also judge node support empiricallyjudge node support empirically and and constraint only well supported nodesconstraint only well supported nodes
What’s the answer?What’s the answer?• increasing technological increasing technological
sophistication will sophistication will keep increasing keep increasing rangerange of conventional analyses of conventional analyses• analyses of ≤10000 taxa now feasible analyses of ≤10000 taxa now feasible
and usually without parallelizationand usually without parallelization
• but, does a divide-and-conquer + but, does a divide-and-conquer + supertree framework have a role supertree framework have a role beyond this?beyond this?• theoretically yes, but only by solving theoretically yes, but only by solving
a number of challengesa number of challenges
??
• divide (+ subtree optimization) stepsdivide (+ subtree optimization) steps• find find subtree size(s)subtree size(s) or combinations thereof that maximize or combinations thereof that maximize
speed and especially accuracyspeed and especially accuracy
• find find optimal sampling schemeoptimal sampling scheme: clade and backbone vs clade-: clade and backbone vs clade-like samplinglike sampling
• do general rules-of-thumb exist or do parameters need to be do general rules-of-thumb exist or do parameters need to be empirically determined on a case-by-case basis?empirically determined on a case-by-case basis?
• alternativesalternatives to disk-covering methods? to disk-covering methods?
• supertree stepsupertree step• can we find a method that is can we find a method that is fastfast like BUILD and like BUILD and accurateaccurate like like
MR / O methods? MR / O methods? PhySIC???PhySIC???
• global-optimization stepglobal-optimization step• have to weigh have to weigh costscosts (no error correction) vs (no error correction) vs benefitsbenefits (speed!) (speed!)
of searching under constraintsof searching under constraints
Challenges for the futureChallenges for the future
AA BB DD FFCC GGEETaxaTaxa
GenesGenes 6633 4411 7755 8822
1 2 3 4 5 6 7 81 2 3 4 5 6 7 8AA ++ – – – – – – – – – – – – – –BB + ++ + – – – – – – – – – – – –CC – – + + + ++ + + + – – – – – –DD – – + + + ++ + + + ++ – – – –EE – – + + + ++ + + + – – – – – –FF – – – – – – – – – – + ++ + – –GG – – – – – – – – – – – – + ++ +
Tax
aT
axa
GenesGenes
maximal biclique = maximal biclique = KK4,34,3
BicliquesBicliques
Extending bicliquesExtending bicliques
• quasi-bicliquesquasi-bicliques• allow a certain proportion allow a certain proportion
of of missing edgesmissing edges
• as input for a supertree as input for a supertree analysisanalysis• essentially build essentially build
bicliques of bicliquesbicliques of bicliques bicliques that overlap for bicliques that overlap for at least two taxa, but no at least two taxa, but no sequencessequences
AA BB DD FFCC GGEE
TaxaTaxa
GenesGenes
6633 4411 7755 8822
Challenges for the futureChallenges for the future• divide (+ subtree optimization) stepsdivide (+ subtree optimization) steps
• find find subtree size(s)subtree size(s) or combinations thereof that maximize or combinations thereof that maximize speed and especially accuracyspeed and especially accuracy
• find find optimal sampling schemeoptimal sampling scheme: clade and backbone vs clade-: clade and backbone vs clade-like samplinglike sampling
• do general rules-of-thumb exist or do parameters need to be do general rules-of-thumb exist or do parameters need to be empirically determined on a case-by-case basis?empirically determined on a case-by-case basis?
• alternativesalternatives to disk-covering methods? to disk-covering methods?
• supertree stepsupertree step• can we find a method that is can we find a method that is fastfast like BUILD and like BUILD and accurateaccurate like like
MR / O methods? MR / O methods? PhySIC???PhySIC???
• global-optimization stepglobal-optimization step• have to weigh have to weigh costscosts (no error correction) vs (no error correction) vs benefitsbenefits (speed!) (speed!)
of searching under constraintsof searching under constraints