SupreFine, a new supertree method Shel Swenson September 17th 2009.
SuperTriplets: a triplet-based supertree approach to phylogenomics
description
Transcript of SuperTriplets: a triplet-based supertree approach to phylogenomics
SuperTriplets: a triplet-based supertree approach to phylogenomics
Vincent Ranwez, Alexis Criscuolo and Emmanuel J.P. Douzery
SuperTriplets: ISBM 20102
Introduction: inferring phylogeny (1 gene)
SuperTriplets: ISBM 20103
Introduction: inferring phylogeny (3 genes)
Gene 1 Gene 3Gene 2
??????????????????????????????????????????????????????????????????????????????????????????????????????
????????????????????????????????????????????????????????????????????
?????????????????????????????????????????????????????????????????????
??????????????????????????????????????????????????????
????????????????????????????????????????????????????????????????????????
????????????????????????????????????
SuperTree
SuperMatrix
SuperTriplets: ISBM 20104
Introduction: inferring phylogeny (more data)
Gene 1000Gene 2
?????????????????????????????????????????????????????????????????????
??????????????????????????????????????????????????????
????????????????????????????????????????????????????????????????????????
????????????????????????????????????
SuperTree
SuperMatrix
………………………..………………………..………………………..………………………..………………………..………………………..………………………..………………………..………………………..………………………..………………………..………………………..………………………..………………………..……………………….………………………..
SNP / Morpho/ biblio
SuperTriplets: ISBM 20105
Supertree overview: MRP
0100101001?11?0100
01??0?011?0???0010
??0011010??001????
0100010??00??001?0
111??0101000????01
MRP [Baum 1992, Ragan 1992] 1 binary sequence per taxon 1 site per clade (1=in the clade; 0 outside; ? missing)
MR P
ABCDEF
CDEABF
CDEFBA
MRP
[Goloboff and Pol, 2002] Relation contradicted by all source
trees
SuperTriplets: ISBM 20106
Supertree overview: intuitive approach
The Supertree problem (intuitive formulation) Input: a collection of overlapping trees (a forest) Output: the tree that best represents this collection A major question is: how to define "best represents" ?
Vizualizing supertree candidates within the tree space
Median supertree Intuitive solution Generalization of the consensus tree Good theoretical properties [Steel and Rodriguo, 2008]
SuperTriplets: ISBM 20107
Supertree oveview: median tree
d( , ) = + -
Tree decomposition as:• split set• quartet set• triplet set
Tree restrictionInitial trees
SuperTriplets: ISBM 20108
Supertree overview: MRP and median tree
ED
CBA
T1
Triplet MRABCDEFGH
110?????0
11?0????0
AB|C AB|D … GH|F … FH|G …
………………………Rooting
FGH
BAC
T2
?????1010
………………………
?????0110
GFH
BAC
T3
………………………
0100101001?11?0100
01??0?011?0???0010
??0011010??001????
0100010??00??001?0
111??0101000????01
MR PInput forest
SuperTriplets: ISBM 20109
Supertree overview: MRP and median tree
The parsimony value is related to the triplet distance: 1 parsimony step for triplets within the supertree 2 parsimony steps for others parsimony score = nbSites + (triplet distance)/2
The MRP approach is unadapted to triplet encoding for 100 taxa 97% of « ? » for 1000 taxa 99.7% of « ? » unnecessary huge matrices
SuperTriplets: ISBM 201010
Supertriplets: few notations
Given a forest F of input trees N+(xy|z): number of occurrences of xy|z in F N-(xy|z) = N+(xz|y) + N+(yz|x) (alternive resolutions in F) Input trees are then useless (little impact of forest size)
Searching for the (asymmetric) triplet median tree T:
median :
d3(T,F) d3(T,Ti)Ti F
3| ( )
| | ( )
( , ) (2 ( | ) ( | | ) )
( ( | ) ( | ))
xy z triplets T
x y z triplets T
d T F N xy z N x y z
N xy z N xy z
asymmetric
SuperTriplets: ISBM 201011
Supertriplets: general overview
N-(homo pan|mus)N+(homo pan|mus)
N-(pan bos|mus)N+(pan bos|mus)
N-(homo pan|bos)N+(homo pan|bos)
N-(mus pan| bos)N+(mus pan|bos)
……
triplet decompostion
first sketchNJ-like strategy
improvementNNI local search
branch supportand collapse
O(n3 |F| ) O(n3)+ consistency
O(n3) to test all branches once
O(n3)
SuperTriplets: ISBM 201012
Supertriplets: agglomerative process
DE|ADE|BDE|C
AB|CAB|DAB|E
Triplets(T3 )
EDC
BA
T0
C1={A} C2={B}
EDC
BA
T1
C1={D} C2={E}
EDC
BA
T2
AC|D BC|DAC|E BC|E
C1={A,B} C2={C}
ED
CBA
T3
SuperTriplets: ISBM 201013
Supertriplets: agglomerative process
Agglomeration of (CA,CB ) Transform T into T’ Resolve some new triplets (AB|X) with ACA, BCB, X{CACB}
d3( T’,F ) = d3( T,F ) - ( ∑ N+(AB|X) - ∑ N-(AB|X) )
We select the pair maximizing Score (CA, CB) = (∑ N+(AB|X) - ∑ N- (AB|X) ) / (∑ N+(AB|X) + ∑ N-(AB|X) )
The whole process is O(n3) : when CA and CB are agglomerated score(CD , CE ) is unchanged
score(C{AB} ,CD ) is easily derived from Score (CA, CD ) and Score (CB, CD )
SuperTriplets: ISBM 201014
Supertriplets: NNI optimisation
The variation d3(T’,F) - d3(T,F) depends on few triplets (here ) All these variations are initially evaluated in O(n3)
Once a NNI is done few NNI have to be re-evaluated (4 adjacent edges) NNI optimisation is therefore very fast
2 possible NNI per edge
T T’
SuperTriplets: ISBM 201015
Supertriplets: edge supports
Local support ∑ N+( ) / [ ∑ N+( ) + ∑ N-( ) ] If <0.5 collapsing the edge improve d3(T,F)
Global support Also take into account N+( ) and N- ( ) impact two edges
Final edge support: min (local, global)
T
SuperTriplets: ISBM 201016
Supertriplets: simulation protocol
Are they similar?Triplet/split measure
[Eulenstein et al. 2004] [Criscuolo et al. 2006]
SuperTriplets: ISBM 201017
Supertriplets: simulation results
Less resolvedVery few errors
Contain errors
lack of resolutionperfect
Splits
triplets
SuperTriplets: ISBM 201018
Supertriplets: phylogenomic case study
Supertree of 33 mammals Species: complete genomes
( EnsEMBL v54)
Sequences: orthologous CDS (orthoMaM v5)
Gene trees: 13 000 ML trees (inferred using PAUP)
Output supertree Computed in 30s Congruent with [Prasad et al. 2008]
SuperTriplets: ISBM 201019
Conclusion & prospects
(Asymmetric) median supertree Easy to understand Makes tree weighting natural
MRP, triplets and median supertree Understanding the criteria optimized by MRP Design a dedicated algorithm to optimize it http://www.supertriplets.univ-montp2.fr/
Supertrees & supermatrix are complementary 1 000 vertebrate genome project Divide and conquer approach
i) trees based on multiple CDSs (supermatrix)ii) assembling those trees (supertree)
SuperTriplets: ISBM 201020
Supertriplets: http://www.supertriplets.univ-montp2.fr/
N-(homo pan|mus)N+(homo pan|mus)
N-(pan bos|mus)N+(pan bos|mus)
N-(homo pan|bos)N+(homo pan|bos)
N-(mus pan| bos)N+(mus pan|bos)
……
triplet decompostion
first sketchNJ-like strategy
improvementNNI local search
branch supportand collapse
O(n3 |F| ) O(n3)+ consistency
O(n3) to test all branches once
O(n3)
Less resolvedVery few errors
SuperTriplets: ISBM 201021
Supertree overview: asymmetric median tree
EDCBA
EDCBA
EDCBA
EDCBA
d(F1, ) = d( + )
EDCBA
EDCBA
EDCBA
EDCBA
F1
d(F1, ) = 3 * d( + )
d(F2, ) = 3*d( + ) d(F2, ) = d( + )
F2
REF