Lecture 7 Difficult problems….and solutions Platypus (Ornithorhynchus anatinus)
-
Upload
riley-linch -
Category
Documents
-
view
220 -
download
1
Transcript of Lecture 7 Difficult problems….and solutions Platypus (Ornithorhynchus anatinus)
Lecture 7
Difficult problems….and solutions
Platypus (Ornithorhynchus anatinus)
Non-homogenous evolution
Taxon1 ACGTAAGTCATCGTAGC Taxon2 ATGGAAATTATCGCGGT Taxon3 ACATAAATCATCGTAGA Taxon4 ACGCAAGTCATCGAAGT
3
1 2 1
43 4
2
Assuming equal substitution rates across sites
Allowing some sites to be invariant – reveals more parallel evolution among the variant sites
Mutations at some sites are lethal, so they are invariant
Rates can also differ among the variable sites due to fitness effects, differential mutability and codon bias - again leading homogenous models to underestimate parallel change
Such rate variation can often be accommodated by assuming a gamma distribution of rates across sites in the likelihood (or distance) model
Non-homogenous data partitions
Partition 1 Partition 2 Reconstructed under a single likelihood modelKolaczkowski and Thornton
(Nature, 2004)
Rifleman GTAACACTAGCCBroadbill GTCACACTAGCCFlycatcher GTTACATTAGCCLyrebird GTTACTTTAGCAIndigobird GTAACCCTAGCCZebraFinch GTAACCTTAGCARook GTAACTCTAGCA Codon pos. 123123123123
Red for variable sites, most change at 3rd positions
Rifleman
reptiles
monotremes
marsupials
placentalsMarsupionta Theria
Competing hypotheses for the interrelations of the mammalian sub-classes
Janke et al. (PNAS, 1997)
ML analysis of complete mitochondrial genome protein-coding sequences
Marsupionta
Purine base frequency
00.2 0.3 0.4 0.5 0.6
0.6
0.1
0.8
0.4
0.2
1.0
ppn.
con
stan
t site
s
Model
TN93+I+ (concatenated) TN93+I+ (partitioned)
df
40
480
AIC
162260.5 158054.3
Grouping of protein - coding and RNA - coding genes based on observed constant site proportions and Purine base frequency. RNAloops ( ); RNAstems ( ); COI ( ); NADH6; ( ); ATPase8, NADH2, NADH4L ( ); ATPase6, NADH1, NADH3, NADH4, NADH5( ); COII, COIII, Cytb ( ).
Partitioned ML: Theria is favoured
KH-test p-value - Phillips et al. (MPE, 2003)
Mar
supi
als
Mon
otre
mes
Pla
cent
als
Rep
tile
s
Theria
Compositional heterogeneity
Stationarity: A standard assumption of most phylogeny reconstruction methods is that underlying substitution processes are the same across the tree
When violated, biases arise that provide signals in the data that can overwhelm the “true” phylogenetic signal
Shifting substitution processes (e.g. AG being favoured in some branches but G A in others) can result in signals for relationships arising due to similar DNA or protein sequence composition, rather than shared ancestry.
ElephantPlatypus
Opossum
Bandicoot
Aardvark
Rook
Hippopotamus
Rhea
ViduaWallaroo
Brushtail PossumFin Whale
Mole
Armadillo
Green Turtle
Painted Turtle
Ostrich
61
53
5268
Extreme example: NJ tree - mt 3rd codon positions, transitions only
Branch thickness proportional to T:C ratio
Composition 2 test (stochastic test)
Taxon A C G T-----------------------------------------------Rifleman 165 154 82 95Broadbill 203 142 48 103Flycatcher 195 115 60 126Lyrebird 138 142 127 89Indigobird 137 144 128 87Zebra Finch 141 143 124 88Rook 145 144 118 89Expected 160.57 140.57 98.14 96.71
Chi-square = (Exp-Obs)2
Exp* = 119.211273 df= (n-1)(t-1)= 18 P < 0.0001
Tells only of the presence of a bias and is unreliable when most of the variation occurs among a small number of character states
Relative compositional variability (magnitude metric)
Allows the magnitude of compositional heterogeneity to be compared between sequences or coding regimes (for the same taxa)
RCV = (| Ai - A*| + | Ti - T*| + | Ci- C* | + | Gi - G* |) / n.ti1
n
Where Ai is the observed frequency of adenine for taxon i, A* is the average frequency of adenine across all taxa, n is the number of taxa and t is the number of sites
Accounting for compositional heterogeneity
1. LogDet distances - recover additive distances between sequences when base composition varies
For each pair of DNA sequences x and y, a 4 4 matrix with each possible pair of sites
Olithodiscus(x) A C G T 224 5 24 8 3 149 1 16 24 5 230 4 5 19 8 175
0.249 0.006 0.027 0.009 0.003 0.166 0.001 0.018 0.027 0.006 0.256 0.004 0.006 0.021 0.009 0.194
Euglena(y)
A C G T
Fxy=
Dxy = -ln[det Fxy] = 6.216
Rates-across-sites LogDet has yet to be developed, so this method is often inconsistent due to poor branch-length estimation
Euglena
Liverwort
Chlamydomonas
Rice
Tobacco
Anacystis
ChlorellaOlithodiscus
Lockhart et al. (MBE, 1994)
a. Jukes-Cantor distances
b. LogDet distances
Chlorophyll a/b
Chlorophyll a/cPhycobilin
uncertain
Euglena
Liverwort
Chlamydomonas
Rice
Tobacco
Anacystis
ChlorellaOlithodiscus
2. Non-homogenous base composition Maximum likelihood
Galtier and Gouy (MBE, 1998)
ωλ1.Φ θ1
λ1.1Φ θ1λ2
θ2
λ3 θ3
λ4 θ4
λ6 θ6
λ5 θ5 λ7
θ7
Parameters symbol number root G+C% ω 1 branch-length λ 2n-3 root location Φ 1 Ts/Tv ratio κ 1 equilibrium G+C% θ 2n-2
Limitations 1. restricted to GC vs. AT bias 2. computer time intensive
3. Character state re-coding
• Often much of the compositional heterogeneity arises within specific classes of character state
e.g. Purine and Pyrimidine transitions
These can be re-coded: RY-coding involves A,G R and C,T Y
• Similarly, lumping amino acids into functionally similar groups e.g. Valine, leucine and Isoleucine as single category of mid-sized aliphatic amino acids.
Nardi et al. (Science, 2003) found Hexapoda to be paraphyletic
Delsuc et al. (Science, 2003)
1st and 3rd codon positions RY-coded
RCVnt = 0.1064 RCVry = 0.0413
Hex
apod
a
Mistaking precision for accuracy
106 nuclear genes: Different methods provide conflicting Yeast topologies, each with 100% bootstrap support
The results underline the importance of understanding how non-phylogenetic signals will bias inference under the model used
Phillips et al. (MBE, 2004)
Not enough phylogentic signal to resolve the tree
Branch-length too short Ans. Increase gene sequencing
Signal erosion with time Ans. Use high-value (often slower evolving) characters
Long unbroken branches make for “noisier” data Ans. Increase taxon sampling
Stemminess (Fiala and Sokal: Evol., 1985) on uncorrected distance trees indicates the relative extent of phylogenetic signal erosion among alternative sequemces (or coding regimes) for the same taxa
Σ external branch-lengthstotal tree-length
Stemminess =
Greater phylogenetic signal retention for slower evolving genes results in higher stemminess
Tigercat
Dunnart
Wombat
Brushtail
Wallaroo
Monodelphis
Opossum
Spiny Bandicoot
Northern Brown Bandicoot
Tigercat
Dunnart
Wombat
Brushtail
Wallaroo
MonodelphisOpossum
12 mitochondrial protein-coding genes
Stemminess =0.086
5 nuclear protein-coding genes
Stemminess =0.440
Spiny Bandicoot
Northern Brown Bandicoot
Saturation – the problem of multiple changes at the same sites
• Theory, simulations, and practical experience all indicate that the sequences must eventually lose information about events that were long ago.
• Part of the problem with using DNA sequence alignments to infer deep events is that the state space is small {A,C,G,T}
Other sorts of characters
• In an idealised situation where each site had an infinite state space there would be no parallel changes or reversals and our character matrices would be homoplasy free.
• Obviously it is interesting to try and find characters that are closer to this ideal than DNA sequences.
SINEs and LINEs
• SINEs (and LINEs) are Short (or Long) interspersed nuclear elements.
• Retrotransposed DNA elements that are copied into the genome.
• Low expectations for the same retrotransposon sequence to insert in exactly the same position independently (low homoplasy markers)
Taxon1 ATGCT-------//-------GTCTAGT Taxon2 AGGCTGTTATGT//TCTCTAGGTCAAGT Taxon3 ATGCTGCTATGT//TCTCTAGGTCTATT Taxon4 ATACT-------//-------GTATAGT
Insertion event 1 into chromosome A
The SINE/LINE is copied from loci 1 on chromosome A to loci 2 on chromosome B
Loci 2 sequence
Taxon3 (present at loci 1 and 2)
Taxon2 (present at loci 1 and 2)
Taxon4 (only present at loci 1)
Taxon1 (not present at loci 1 or loci 2)
Competing hypothesis for the position of the whales
SINEs and LINEs provide homoplasy free support for the position of the whales as sister group to the hippos.
Genome-order based phylogeny
Large state-space
• DNA sequences : 4 states per site• Signed circular genomes with n genes:
states, 1 site
• Circular genomes (1 site)
– with 37 genes: states
– with 120 genes: states
2n-1(n1)!
2.56×1052
3.70×10232
Reference sequence
Inversion (of orange and blue)
Transposition (of grey)
Indicates sequence read direction
Inverted transposition (of grey)
Genome rearrangements
Breakpoint Distance
• Breakpoint distance=5
1 2 3 4 5 6 7 8 9 10
1 –3 –2 4 5 9 6 7 8 10
Minimum Inversion Distance
1 2 3 4 5 6 7 8 9 10
1 2 3 –8 –7 –6 –5 –4 9 10
1 8 –3 –2 –7 –6 –5 –4 9 10
1 8 –3 7 2 –6 –5 –4 9 10
• Inversion distance=3
Distance-based methods
Tandy Warnow, UT-Austin
Maximum Parsimony on Rearranged Genomes (MPRG)
• The leaves are rearranged genomes.• Find the tree that minimizes the total number of
rearrangement events
A
B
C
D
3 6
2
3
4
A
B
C
D
EF
Total length= 18
Tandy Warnow, UT-Austin
Mitochondrial genome rearrangement maximum parsimony
Fritzsch et al. (J.Theor. Biol., 2006)
Data choice and analytical methods are in their infancy
Note non-monophyly of Nematoda and Mollusca; Well resolved sequence and morphology clades
?
An additional possibility is that there are multiple signals: 1. Biases in the data (e.g. compositional heterogeneity), 2. genes have different histories (e.g. lineage sorting or hybridization)
If a gene has a long coalescent time, then its relationships among taxa may differ from the species tree
Gene tree
Species tree
A B C D
Molecular dating
e.g. Zukerkandl and Pauling (J. Theor Biol., 1965)
The molecular clock
Time since divergence
Gen
etic
cha
nge
Time since divergenceG
enet
ic d
iver
genc
e
observed
corrected for saturation
Human – ChimpanzeeHuman – MouseHuman – Bird
Is the data clock-like?
Can the deviation from an ultrametric tree be explained by the stochastic nature of substitution (sampling error), or do substitution rates differ across the tree?
Relative rates tests
HO: Two sister taxa are evolving at the same rate (by comparison with an outgroup)
Hebsgaard et al. (TIM, 2005)
Molecular clock likelihood ratio testHO: That a clock model explains the data as well as a non-clock model
1. Optimize the likelihood of the (unrooted) tree under a non-clock model (lnLn)
2. Optimise the likelihood of the (rooted) tree under a clock model (lnLc)
3. Calculate the test statistic = 2(lnLc minus lnLn)
4. This is compared to a 2 distribution critical value (where the degrees of freedom are the difference in the number of free parameters being estimated between the two models = n2)
Linearized trees: Takezaki et al. (MBE, 1995)
Prune the taxa that are the most non-clock-like until the molecular clock likelihood ratio test is passed
Concerns: 1. removing any branches reduces the power of the test (so increases the probability of passing) and 2. remaining branches may hide complementary rate shifts that cancel out
Relaxing the molecular clock
1. Local clocks 2. Autocorrelated rate evolution
r1
r2
r3
r6r5
r4r3
r1 r2
r10r9
r8
r7
Relies on the identification of rate classes with respect to clades
Each rate ri is a function of the rate of its parent branch. Many different models of rate change have been applied including: quadratic, lognormal, exponential, gamma, Ornstein-Uhlenbeck
3. Uncorrelated rate evolution
r6r5
r4r3
r1 r2
r10r9
r8
r7
Method of Drummond et al. (PLoS Biol., 2006)
Rates ri do not depend on the rate of their parent branch, but are drawn from a lognormal or exponential distribution that maximises the posterior probability of the tree
Performance of correlated rates methods on trees simulated under uncorrelated rates among branches
Calibrating molecular clocks
Biogeographical divergences
e.g. New Zealand split from Gondwana about 80 million years ago and so did some of New Zealand’s endemic fauna
Fossils that post-date divergences
61 Ma calibration
Pen
guin
s
Alb
atro
ss
Duc
ks
90 Ma estimateSlack et al., (MBE, 2006)
timePoint calibration
Calibration bounds
upper lower
Flat Prior
Normal Prior
Using a lognormal (19Ma-25Ma upper 95%, mean=21Ma) calibration for cats/hyaenas
Barnett et al. (Curr. Biol., 2005)
25 20 15 10 5 0Millions of years ago