Lecture 7 Difficult problems….and solutions Platypus (Ornithorhynchus anatinus)

Post on 31-Mar-2015

220 views 1 download

Tags:

Transcript of Lecture 7 Difficult problems….and solutions Platypus (Ornithorhynchus anatinus)

Lecture 7

Difficult problems….and solutions

Platypus (Ornithorhynchus anatinus)

Non-homogenous evolution

Taxon1 ACGTAAGTCATCGTAGC Taxon2 ATGGAAATTATCGCGGT Taxon3 ACATAAATCATCGTAGA Taxon4 ACGCAAGTCATCGAAGT

3

1 2 1

43 4

2

Assuming equal substitution rates across sites

Allowing some sites to be invariant – reveals more parallel evolution among the variant sites

Mutations at some sites are lethal, so they are invariant

Rates can also differ among the variable sites due to fitness effects, differential mutability and codon bias - again leading homogenous models to underestimate parallel change

Such rate variation can often be accommodated by assuming a gamma distribution of rates across sites in the likelihood (or distance) model

Non-homogenous data partitions

Partition 1 Partition 2 Reconstructed under a single likelihood modelKolaczkowski and Thornton

(Nature, 2004)

Rifleman GTAACACTAGCCBroadbill GTCACACTAGCCFlycatcher GTTACATTAGCCLyrebird GTTACTTTAGCAIndigobird GTAACCCTAGCCZebraFinch GTAACCTTAGCARook GTAACTCTAGCA Codon pos. 123123123123

Red for variable sites, most change at 3rd positions

Rifleman

reptiles

monotremes

marsupials

placentalsMarsupionta Theria

Competing hypotheses for the interrelations of the mammalian sub-classes

Janke et al. (PNAS, 1997)

ML analysis of complete mitochondrial genome protein-coding sequences

Marsupionta

Purine base frequency

00.2 0.3 0.4 0.5 0.6

0.6

0.1

0.8

0.4

0.2

1.0

ppn.

con

stan

t site

s

Model

TN93+I+ (concatenated) TN93+I+ (partitioned)

df

40

480

AIC

162260.5 158054.3

Grouping of protein - coding and RNA - coding genes based on observed constant site proportions and Purine base frequency. RNAloops ( ); RNAstems ( ); COI ( ); NADH6; ( ); ATPase8, NADH2, NADH4L ( ); ATPase6, NADH1, NADH3, NADH4, NADH5( ); COII, COIII, Cytb ( ).

Partitioned ML: Theria is favoured

KH-test p-value - Phillips et al. (MPE, 2003)

Mar

supi

als

Mon

otre

mes

Pla

cent

als

Rep

tile

s

Theria

Compositional heterogeneity

Stationarity: A standard assumption of most phylogeny reconstruction methods is that underlying substitution processes are the same across the tree

When violated, biases arise that provide signals in the data that can overwhelm the “true” phylogenetic signal

Shifting substitution processes (e.g. AG being favoured in some branches but G A in others) can result in signals for relationships arising due to similar DNA or protein sequence composition, rather than shared ancestry.

ElephantPlatypus

Opossum

Bandicoot

Aardvark

Rook

Hippopotamus

Rhea

ViduaWallaroo

Brushtail PossumFin Whale

Mole

Armadillo

Green Turtle

Painted Turtle

Ostrich

61

53

5268

Extreme example: NJ tree - mt 3rd codon positions, transitions only

Branch thickness proportional to T:C ratio

Composition 2 test (stochastic test)

Taxon A C G T-----------------------------------------------Rifleman 165 154 82 95Broadbill 203 142 48 103Flycatcher 195 115 60 126Lyrebird 138 142 127 89Indigobird 137 144 128 87Zebra Finch 141 143 124 88Rook 145 144 118 89Expected 160.57 140.57 98.14 96.71

Chi-square = (Exp-Obs)2

Exp* = 119.211273 df= (n-1)(t-1)= 18 P < 0.0001

Tells only of the presence of a bias and is unreliable when most of the variation occurs among a small number of character states

Relative compositional variability (magnitude metric)

Allows the magnitude of compositional heterogeneity to be compared between sequences or coding regimes (for the same taxa)

RCV = (| Ai - A*| + | Ti - T*| + | Ci- C* | + | Gi - G* |) / n.ti1

n

Where Ai is the observed frequency of adenine for taxon i, A* is the average frequency of adenine across all taxa, n is the number of taxa and t is the number of sites

Accounting for compositional heterogeneity

1. LogDet distances - recover additive distances between sequences when base composition varies

For each pair of DNA sequences x and y, a 4 4 matrix with each possible pair of sites

Olithodiscus(x) A C G T 224 5 24 8 3 149 1 16 24 5 230 4 5 19 8 175

0.249 0.006 0.027 0.009 0.003 0.166 0.001 0.018 0.027 0.006 0.256 0.004 0.006 0.021 0.009 0.194

Euglena(y)

A C G T

Fxy=

Dxy = -ln[det Fxy] = 6.216

Rates-across-sites LogDet has yet to be developed, so this method is often inconsistent due to poor branch-length estimation

Euglena

Liverwort

Chlamydomonas

Rice

Tobacco

Anacystis

ChlorellaOlithodiscus

Lockhart et al. (MBE, 1994)

a. Jukes-Cantor distances

b. LogDet distances

Chlorophyll a/b

Chlorophyll a/cPhycobilin

uncertain

Euglena

Liverwort

Chlamydomonas

Rice

Tobacco

Anacystis

ChlorellaOlithodiscus

2. Non-homogenous base composition Maximum likelihood

Galtier and Gouy (MBE, 1998)

ωλ1.Φ θ1

λ1.1Φ θ1λ2

θ2

λ3 θ3

λ4 θ4

λ6 θ6

λ5 θ5 λ7

θ7

Parameters symbol number root G+C% ω 1 branch-length λ 2n-3 root location Φ 1 Ts/Tv ratio κ 1 equilibrium G+C% θ 2n-2

Limitations 1. restricted to GC vs. AT bias 2. computer time intensive

3. Character state re-coding

• Often much of the compositional heterogeneity arises within specific classes of character state

e.g. Purine and Pyrimidine transitions

These can be re-coded: RY-coding involves A,G R and C,T Y

• Similarly, lumping amino acids into functionally similar groups e.g. Valine, leucine and Isoleucine as single category of mid-sized aliphatic amino acids.

Nardi et al. (Science, 2003) found Hexapoda to be paraphyletic

Delsuc et al. (Science, 2003)

1st and 3rd codon positions RY-coded

RCVnt = 0.1064 RCVry = 0.0413

Hex

apod

a

Mistaking precision for accuracy

106 nuclear genes: Different methods provide conflicting Yeast topologies, each with 100% bootstrap support

The results underline the importance of understanding how non-phylogenetic signals will bias inference under the model used

Phillips et al. (MBE, 2004)

Not enough phylogentic signal to resolve the tree

Branch-length too short Ans. Increase gene sequencing

Signal erosion with time Ans. Use high-value (often slower evolving) characters

Long unbroken branches make for “noisier” data Ans. Increase taxon sampling

Stemminess (Fiala and Sokal: Evol., 1985) on uncorrected distance trees indicates the relative extent of phylogenetic signal erosion among alternative sequemces (or coding regimes) for the same taxa

Σ external branch-lengthstotal tree-length

Stemminess =

Greater phylogenetic signal retention for slower evolving genes results in higher stemminess

Tigercat

Dunnart

Wombat

Brushtail

Wallaroo

Monodelphis

Opossum

Spiny Bandicoot

Northern Brown Bandicoot

Tigercat

Dunnart

Wombat

Brushtail

Wallaroo

MonodelphisOpossum

12 mitochondrial protein-coding genes

Stemminess =0.086

5 nuclear protein-coding genes

Stemminess =0.440

Spiny Bandicoot

Northern Brown Bandicoot

Saturation – the problem of multiple changes at the same sites

• Theory, simulations, and practical experience all indicate that the sequences must eventually lose information about events that were long ago.

• Part of the problem with using DNA sequence alignments to infer deep events is that the state space is small {A,C,G,T}

Other sorts of characters

• In an idealised situation where each site had an infinite state space there would be no parallel changes or reversals and our character matrices would be homoplasy free.

• Obviously it is interesting to try and find characters that are closer to this ideal than DNA sequences.

SINEs and LINEs

• SINEs (and LINEs) are Short (or Long) interspersed nuclear elements.

• Retrotransposed DNA elements that are copied into the genome.

• Low expectations for the same retrotransposon sequence to insert in exactly the same position independently (low homoplasy markers)

Taxon1 ATGCT-------//-------GTCTAGT Taxon2 AGGCTGTTATGT//TCTCTAGGTCAAGT Taxon3 ATGCTGCTATGT//TCTCTAGGTCTATT Taxon4 ATACT-------//-------GTATAGT

Insertion event 1 into chromosome A

The SINE/LINE is copied from loci 1 on chromosome A to loci 2 on chromosome B

Loci 2 sequence

Taxon3 (present at loci 1 and 2)

Taxon2 (present at loci 1 and 2)

Taxon4 (only present at loci 1)

Taxon1 (not present at loci 1 or loci 2)

Competing hypothesis for the position of the whales

SINEs and LINEs provide homoplasy free support for the position of the whales as sister group to the hippos.

Genome-order based phylogeny

Large state-space

• DNA sequences : 4 states per site• Signed circular genomes with n genes:

states, 1 site

• Circular genomes (1 site)

– with 37 genes: states

– with 120 genes: states

2n-1(n1)!

2.56×1052

3.70×10232

Reference sequence

Inversion (of orange and blue)

Transposition (of grey)

Indicates sequence read direction

Inverted transposition (of grey)

Genome rearrangements

Breakpoint Distance

• Breakpoint distance=5

1 2 3 4 5 6 7 8 9 10

1 –3 –2 4 5 9 6 7 8 10

Minimum Inversion Distance

1 2 3 4 5 6 7 8 9 10

1 2 3 –8 –7 –6 –5 –4 9 10

1 8 –3 –2 –7 –6 –5 –4 9 10

1 8 –3 7 2 –6 –5 –4 9 10

• Inversion distance=3

Distance-based methods

Tandy Warnow, UT-Austin

Maximum Parsimony on Rearranged Genomes (MPRG)

• The leaves are rearranged genomes.• Find the tree that minimizes the total number of

rearrangement events

A

B

C

D

3 6

2

3

4

A

B

C

D

EF

Total length= 18

Tandy Warnow, UT-Austin

Mitochondrial genome rearrangement maximum parsimony

Fritzsch et al. (J.Theor. Biol., 2006)

Data choice and analytical methods are in their infancy

Note non-monophyly of Nematoda and Mollusca; Well resolved sequence and morphology clades

?

An additional possibility is that there are multiple signals: 1. Biases in the data (e.g. compositional heterogeneity), 2. genes have different histories (e.g. lineage sorting or hybridization)

If a gene has a long coalescent time, then its relationships among taxa may differ from the species tree

Gene tree

Species tree

A B C D

Molecular dating

e.g. Zukerkandl and Pauling (J. Theor Biol., 1965)

The molecular clock

Time since divergence

Gen

etic

cha

nge

Time since divergenceG

enet

ic d

iver

genc

e

observed

corrected for saturation

Human – ChimpanzeeHuman – MouseHuman – Bird

Is the data clock-like?

Can the deviation from an ultrametric tree be explained by the stochastic nature of substitution (sampling error), or do substitution rates differ across the tree?

Relative rates tests

HO: Two sister taxa are evolving at the same rate (by comparison with an outgroup)

Hebsgaard et al. (TIM, 2005)

Molecular clock likelihood ratio testHO: That a clock model explains the data as well as a non-clock model

1. Optimize the likelihood of the (unrooted) tree under a non-clock model (lnLn)

2. Optimise the likelihood of the (rooted) tree under a clock model (lnLc)

3. Calculate the test statistic = 2(lnLc minus lnLn)

4. This is compared to a 2 distribution critical value (where the degrees of freedom are the difference in the number of free parameters being estimated between the two models = n2)

Linearized trees: Takezaki et al. (MBE, 1995)

Prune the taxa that are the most non-clock-like until the molecular clock likelihood ratio test is passed

Concerns: 1. removing any branches reduces the power of the test (so increases the probability of passing) and 2. remaining branches may hide complementary rate shifts that cancel out

Relaxing the molecular clock

1. Local clocks 2. Autocorrelated rate evolution

r1

r2

r3

r6r5

r4r3

r1 r2

r10r9

r8

r7

Relies on the identification of rate classes with respect to clades

Each rate ri is a function of the rate of its parent branch. Many different models of rate change have been applied including: quadratic, lognormal, exponential, gamma, Ornstein-Uhlenbeck

3. Uncorrelated rate evolution

r6r5

r4r3

r1 r2

r10r9

r8

r7

Method of Drummond et al. (PLoS Biol., 2006)

Rates ri do not depend on the rate of their parent branch, but are drawn from a lognormal or exponential distribution that maximises the posterior probability of the tree

Performance of correlated rates methods on trees simulated under uncorrelated rates among branches

Calibrating molecular clocks

Biogeographical divergences

e.g. New Zealand split from Gondwana about 80 million years ago and so did some of New Zealand’s endemic fauna

Fossils that post-date divergences

61 Ma calibration

Pen

guin

s

Alb

atro

ss

Duc

ks

90 Ma estimateSlack et al., (MBE, 2006)

timePoint calibration

Calibration bounds

upper lower

Flat Prior

Normal Prior

Using a lognormal (19Ma-25Ma upper 95%, mean=21Ma) calibration for cats/hyaenas

Barnett et al. (Curr. Biol., 2005)

25 20 15 10 5 0Millions of years ago