Week 10: Coalescents, Consensus trees, etc.
Genome 570
March, 2016
Week 10: Coalescents, Consensus trees, etc. – p.1/87
Cann, Stoneking, and Wilson
Becky Cann Mark Stoneking the late Allan Wilson
Cann, R. L., M. Stoneking, and A. C. Wilson. 1987. Mitochondrial DNAand human evolution. Nature 325:a 31-36.
Week 10: Coalescents, Consensus trees, etc. – p.2/87
Mitochondrial Eve
Week 10: Coalescents, Consensus trees, etc. – p.3/87
Gene copies in a population of 10 individuals
Time
A random−mating population
Week 10: Coalescents, Consensus trees, etc. – p.4/87
Going back one generation
Time
A random−mating population
Week 10: Coalescents, Consensus trees, etc. – p.5/87
... and one more
Time
A random−mating population
Week 10: Coalescents, Consensus trees, etc. – p.6/87
... and one more
Time
A random−mating population
Week 10: Coalescents, Consensus trees, etc. – p.7/87
... and one more
Time
A random−mating population
Week 10: Coalescents, Consensus trees, etc. – p.8/87
... and one more
Time
A random−mating population
Week 10: Coalescents, Consensus trees, etc. – p.9/87
... and one more
Time
A random−mating population
Week 10: Coalescents, Consensus trees, etc. – p.10/87
... and one more
Time
A random−mating population
Week 10: Coalescents, Consensus trees, etc. – p.11/87
... and one more
Time
A random−mating population
Week 10: Coalescents, Consensus trees, etc. – p.12/87
... and one more
Time
A random−mating population
Week 10: Coalescents, Consensus trees, etc. – p.13/87
... and one more
Time
A random−mating population
Week 10: Coalescents, Consensus trees, etc. – p.14/87
... and one more
Time
A random−mating population
Week 10: Coalescents, Consensus trees, etc. – p.15/87
The genealogy of gene copies is a tree
Time
Genealogy of gene copies, after reordering the copies
Week 10: Coalescents, Consensus trees, etc. – p.16/87
Ancestry of a sample of 3 copies
Time
Genealogy of a small sample of genes from the population
Week 10: Coalescents, Consensus trees, etc. – p.17/87
Here is that tree of 3 copies in the pedigree
Time
Week 10: Coalescents, Consensus trees, etc. – p.18/87
Kingman’s coalescent
Random collision of lineages as go back in time (sans recombination)
Collision is faster the smaller the effective population size
u9
u7
u5
u3
u8
u6
u4
u2
Average time for n
Average time for
copies to coalesce to
4N
k(k−1) k−1 =
In a diploid population of
effective population size N,
copies to coalesce
= 4N (1 − 1
n
( generations
k
Average time for
two copies to coalesce
= 2N generations
What’s misleading about this diagram: the lineages that coalesce arerandom pairs, not necessarily ones that are next to each other in a linearorder.
Week 10: Coalescents, Consensus trees, etc. – p.19/87
The Wright-Fisher model
This is the canonical model of genetic drift in populations. It was invented
in 1930 and 1932 by Sewall Wright and R. A. Fisher.
In this model the next generation is produced by doing this:
Choose two individuals with replacement (including the possibility thatthey are the same individual) to be parents,
Each produces one gamete, these become a diploid individual,
Repeat these steps until N diploid individuals have been produced.
The effect of this is to have each locus in an individual in the nextgeneration consist of two genes sampled from the parents’ generation atrandom, with replacement.
Week 10: Coalescents, Consensus trees, etc. – p.20/87
The coalescent – a derivation
The probability that k lineages becomes k − 1 one generation earlier
turns out to be (as each lineage “chooses” its ancestor independently):
k(k − 1)/2 × Prob (First two have same parent, rest are different)
(since there are(k2
)= k(k − 1)/2 different pairs of copies)
We add up terms, all the same, for the k(k − 1)/2 pairs that couldcoalesce; the sum is:
k(k − 1)/2 × 1 × 12N
×(1 − 1
2N
)
×(1 − 2
2N
)× · · · ×
(1 − k−2
2N
)
so that the total probability that a pair coalesces is
= k(k − 1)/4N + O(1/N2)
Week 10: Coalescents, Consensus trees, etc. – p.21/87
Can probabilities of two or more lineages coalescing
Note that the total probability that some combination of lineagescoalesces is
1 − Prob (Probability all genes have separate ancestors)
= 1 −
[
1 ×
(
1 −1
2N
) (
1 −2
2N
)
. . .
(
1 −k − 1
2N
)]
= 1 −
[
1 −1 + 2 + 3 + · · · + (k − 1)
2N+ O(1/N2)
]
and since1 + 2 + 3 + . . . + (n − 1) = n(n − 1)/2
the quantity
= 1 −[
1 − k(k − 1)/4N + O(1/N2)]≃ k(k − 1)/4N + O(1/N2)
Week 10: Coalescents, Consensus trees, etc. – p.22/87
Can calculate how many coalescences are of pairs
This shows, since the terms of order 1/N are the same, that the eventsinvolving 3 or more lineages simultaneously coalescing are in the terms of
order 1/N2 and thus become unimportant if N is large.
Here are the probabilities of 0, 1, or more coalescences with 10 lineages
in populations of different sizes:
N 0 1 > 1
100 0.79560747 0.18744678 0.016945751000 0.97771632 0.02209806 0.0001856210000 0.99775217 0.00224595 0.00000187
Note that increasing the population size by a factor of 10 reduces the
coalescent rate for pairs by about 10-fold, but reduces the rate for triples(or more) by about 100-fold.
Week 10: Coalescents, Consensus trees, etc. – p.23/87
The coalescent
To simulate a random genealogy, do the following:
1. Start with k lineages
2. Draw an exponential time interval with mean 4N/(k(k − 1))generations.
3. Combine two randomly chosen lineages.
4. Decrease k by 1.
5. If k = 1, then stop
6. Otherwise go back to step 2.
Week 10: Coalescents, Consensus trees, etc. – p.24/87
An accurate analogy: Bugs In A Box
There is a box ...
Week 10: Coalescents, Consensus trees, etc. – p.25/87
An accurate analogy: Bugs In A Box
with bugs that are ...
Week 10: Coalescents, Consensus trees, etc. – p.26/87
An accurate analogy: Bugs In A Box
hyperactive, ...
Week 10: Coalescents, Consensus trees, etc. – p.27/87
An accurate analogy: Bugs In A Box
indiscriminate, ...
Week 10: Coalescents, Consensus trees, etc. – p.28/87
An accurate analogy: Bugs In A Box
voracious ...
Week 10: Coalescents, Consensus trees, etc. – p.29/87
An accurate analogy: Bugs In A Box
(eats other bug) ...
Gulp!
Week 10: Coalescents, Consensus trees, etc. – p.30/87
An accurate analogy: Bugs In A Box
and insatiable.
Week 10: Coalescents, Consensus trees, etc. – p.31/87
Random coalescent trees with 16 lineages
O C S M L P K E J I T R H Q F B N D G A M J B F G C E R A S Q K N L H T I P D O B G T M L Q D O F K P E A I J S C H R N F R N L M D H B T C Q S O G P I A K J E
I Q C A J L S G P F O D H B M E T R K N R C L D K H O Q F M B G S I T P A J E N N M P R H L E S O F B G J D C I T K Q A N H M C R P G L T E D S O I K J Q F A B
Week 10: Coalescents, Consensus trees, etc. – p.32/87
Coalescence is faster in small populations
Change of population size and coalescents
Ne
time
the changes in population size will produce waves of coalescence
time
Coalescence events
time
the tree
The parameters of the growth curve for Ne can be inferred by
likelihood methods as they affect the prior probabilities of those trees
that fit the data.
Week 10: Coalescents, Consensus trees, etc. – p.33/87
Migration can be taken into account
Time
population #1 population #2Week 10: Coalescents, Consensus trees, etc. – p.34/87
Recombination creates loops
Recomb.
Different markers have slightly different coalescent trees
Week 10: Coalescents, Consensus trees, etc. – p.35/87
If we have a sample of 50 copies
50−gene sample in a coalescent tree
Week 10: Coalescents, Consensus trees, etc. – p.36/87
The first 10 account for most of the branch length
10 genes sampled randomly out of a
50−gene sample in a coalescent tree
Week 10: Coalescents, Consensus trees, etc. – p.37/87
... and when we add the other 40 they add less length
10 genes sampled randomly out of a
50−gene sample in a coalescent tree
(purple lines are the 10−gene tree)Week 10: Coalescents, Consensus trees, etc. – p.38/87
We want to be able to analyze human evolution
Africa
Europe Asia
"Out of Africa" hypothesis
(vertical scale is not time or evolutionary change)
Week 10: Coalescents, Consensus trees, etc. – p.39/87
coalescent and “gene trees” versus species trees
The species tree
Week 10: Coalescents, Consensus trees, etc. – p.40/87
coalescent and “gene trees” versus species trees
Consistency of gene tree with species tree
Week 10: Coalescents, Consensus trees, etc. – p.41/87
coalescent and “gene trees” versus species trees
Consistency of gene tree with species tree
Week 10: Coalescents, Consensus trees, etc. – p.42/87
coalescent and “gene trees” versus species trees
Consistency of gene tree with species tree
Week 10: Coalescents, Consensus trees, etc. – p.43/87
coalescent and “gene trees” versus species trees
Consistency of gene tree with species tree
Week 10: Coalescents, Consensus trees, etc. – p.44/87
coalescent and “gene trees” versus species trees
Consistency of gene tree with species tree
coalescence time
Week 10: Coalescents, Consensus trees, etc. – p.45/87
If the branch is more than Ne generations long ...
t1
t2
N1
N2
N4
N3
N5
Gene tree and Species tree
Week 10: Coalescents, Consensus trees, etc. – p.46/87
If the branch is more than Ne generations long ...
t1
t2
N1
N2
N4
N3
N5
Gene tree and Species tree
Week 10: Coalescents, Consensus trees, etc. – p.47/87
If the branch is more than Ne generations long ...
t1
t2
N1
N2
N4
N3
N5
Gene tree and Species tree
Week 10: Coalescents, Consensus trees, etc. – p.48/87
How do we compute a likelihood for a population sample?
CAGTTTTAGCGTCC
CAGTTTTAGCGTCC
CAGTTTTAGCGTCC
CAGTTTTAGCGTCC
CAGTTTTAGCGTCC
CAGTTTTAGCGTCC
CAGTTTTAGCGTCC
CAGTTTTAGCGTCC
CAGTTTTAGCGTCC
CAGTTTTAGCGTCC
CAGTTTTAGCGTCC
CAGTTTCAGCGTCC
CAGTTTCAGCGTCC
CAGTTTCAGCGTCC
CAGTTTCAGCGTCC
CAGTTTCAGCGTCC
CAGTTTCAGCGTCC
CAGTTTCAGCGTCC
CAGTTTCAGCGTCC
CAGTTTTGGCGTCC
CAGTTTTGGCGTCCCAGTTTTGGCGTCC
CAGTTTTGGCGTCC
CAGTTTTGGCGTCC
CAGTTTCAGCGTAC
CAGTTTCAGCGTAC
CAGTTTCAGCGTAC
, CAGTTTCAGCGTCC CAGTTTCAGCGTCC ), ... L = Prob ( = ??
Week 10: Coalescents, Consensus trees, etc. – p.49/87
If we have a tree for the sample sequences, we can
CAGTTTTAGCGTCC
CAGTTTTAGCGTCC
CAGTTTTAGCGTCC
CAGTTTTAGCGTCC
CAGTTTTAGCGTCC
CAGTTTTAGCGTCC
CAGTTTTAGCGTCC
CAGTTTTAGCGTCC
CAGTTTTAGCGTCC
CAGTTTCAGCGTCC
CAGTTTCAGCGTCC
CAGTTTCAGCGTCC
CAGTTTCAGCGTCC
CAGTTTTGGCGTCCCAGTTTTGGCGTCC
CAGTTTTGGCGTCC
CAGTTTTGGCGTCC
CAGTTTCAGCGTAC
CAGTTTCAGCGTAC
CAGTTTCAGCGTAC
CAGTTTCAGCGTCC
, CAGTTTCAGCGTCC CAGTTTCAGCGTCCProb( | Genealogy)
so we can compute
but how to computer the overall likelihood from this?
, ...
CAGTTTCAGCGTCC
CAGTTTTAGCGTCCCAGTTTTAGCGTCC
CAGTTTCAGCGTCCCAGTTTTGGCGTCC
CAGTTTCAGCGTCC
Week 10: Coalescents, Consensus trees, etc. – p.50/87
The basic equation for coalescent likelihoods
In the case of a single population with parameters
Ne effective population sizeµ mutation rate per site
and assuming G′ stands for a coalescent genealogy and D for the
sequences,
L = Prob (D | Ne, µ)
=∑
G′
Prob (G′ | Ne) Prob (D | G′, µ)
︸ ︷︷ ︸ ︸ ︷︷ ︸
Kingman′s prior likelihood of tree
Week 10: Coalescents, Consensus trees, etc. – p.51/87
Rescaling the branch lengths
Rescaling branch lengths of G′ so that branches are given in expected
mutations per site, G = µG′ , we get (if we let Θ = 4Neµ )
L =∑
G
Prob (G | Θ) Prob (D | G)
as the fundamental equation. For more complex population scenarios onesimply replaces Θ with a vector of parameters.
Week 10: Coalescents, Consensus trees, etc. – p.52/87
The variability comes from two sources
Ne
Ne
can reduce variability by looking at
(i) more gene copies, or
(2) Randomness of coalescence of lineages
affected by the
can reduce variance of
branch by examining more sites
number of mutations per site per
mutation rate
(1) Randomness of mutation
affected by effective population size
coalescence times allow estimation of
µ
(ii) more loci
Week 10: Coalescents, Consensus trees, etc. – p.53/87
Computing the likelihood: averaging over coalescents
t
t
Lik
elih
oo
d o
f t
Lik
elih
oo
d o
f
The product of the prior on t,
times the likelihood of that t from the data,
when integrated over all possible t’s, gives the
likelihood for the underlying parameter
The likelihood calculation in a sample of two gene copies
t
1Θ
Θ
Prio
r P
rob
of
t
Θ1
Θ
Θ
Week 10: Coalescents, Consensus trees, etc. – p.54/87
Computing the likelihood: averaging over coalescents
t
t
Lik
elih
oo
d o
f t
Lik
elih
oo
d o
f
The product of the prior on t,
times the likelihood of that t from the data,
when integrated over all possible t’s, gives the
likelihood for the underlying parameter
The likelihood calculation in a sample of two gene copies
t
2ΘΘ
Prio
r P
rob
of
t
2Θ
Θ
Θ
Week 10: Coalescents, Consensus trees, etc. – p.55/87
Computing the likelihood: averaging over coalescents
t
t
Lik
elih
oo
d o
f t
Lik
elih
oo
d o
f
The product of the prior on t,
times the likelihood of that t from the data,
when integrated over all possible t’s, gives the
likelihood for the underlying parameter
The likelihood calculation in a sample of two gene copies
t
3Θ
Θ
Prio
r P
rob
of
t
3Θ
Θ
Θ
Week 10: Coalescents, Consensus trees, etc. – p.56/87
Computing the likelihood: averaging over coalescents
t
t
Lik
elih
oo
d o
f t
Lik
elih
oo
d o
f
The product of the prior on t,
times the likelihood of that t from the data,
when integrated over all possible t’s, gives the
likelihood for the underlying parameter
The likelihood calculation in a sample of two gene copies
t
1Θ
2Θ
3Θ
Θ
Prio
r P
rob
of
t
2Θ
3Θ
Θ1
Θ
Θ
Week 10: Coalescents, Consensus trees, etc. – p.57/87
Labelled histories
Labelled Histories (Edwards, 1970; Harding, 1971)
Trees that differ in the time−ordering of their nodes
A B C D
A B C D
These two are the same:
A B C D
A B C D
These two are different:
Week 10: Coalescents, Consensus trees, etc. – p.58/87
Sampling approaches to coalescent likelihood
Bob Griffiths Simon Tavaré Mary Kuhner and Jon Yamato
Week 10: Coalescents, Consensus trees, etc. – p.59/87
Monte Carlo integration
To get the area under a curve, we can either evaluate the function (f(x)) ata series of grid points and add up heights × widths:
or we can sample at random the same number of points, add up height ×width:
Week 10: Coalescents, Consensus trees, etc. – p.60/87
Importance sampling
Week 10: Coalescents, Consensus trees, etc. – p.61/87
Importance sampling
The function we integrate
We sample from this density
f(x)
g(x)
Week 10: Coalescents, Consensus trees, etc. – p.62/87
The math of importance sampling
∫f(x) dx =
∫ f(x)g(x) g(x) dx
= Eg
[f(x)g(x)
]
which is the expectation for points sampled from g(x) of the ratio f(x)g(x) .
This is approximated by sampling a lot (n) of points from g(x) and thecomputing the average:
L =1
n
n∑
i=1
f(xi)
g(xi)
Week 10: Coalescents, Consensus trees, etc. – p.63/87
Rearrangement to sample points in tree space
A conditional coalescent rearrangement strategy
Week 10: Coalescents, Consensus trees, etc. – p.64/87
Dissolving a branch and regrowing it backwards
First pick a random node (interior or tip) and remove its subtree
Week 10: Coalescents, Consensus trees, etc. – p.65/87
We allow it coalesce with the other branches
Then allow this node to re−coalesce with the tree
Week 10: Coalescents, Consensus trees, etc. – p.66/87
and this gives another coalescent
The resulting tree proposed by this process
Week 10: Coalescents, Consensus trees, etc. – p.67/87
The resulting likelihood ratio is
L(Θ)
L(Θ0)=
1
n
n∑
i=1
Prob (Gi|Θ)
Prob (Gi|Θ0)
(“Wait a second – where in this expression is the data?”) It’s in thesampling that gives you the Gi: the data biases those samples in thecorrect way.
Week 10: Coalescents, Consensus trees, etc. – p.68/87
An example of an MCMC likelihood curve
0
−10
−20
−30
−40
−50
−60
−70
−80
0.001 0.002 0.005 0.01 0.02 0.05 0.1
Θ
ln L
0.00650776
Results of analysing a data set with 50 sequences of 500 bases
which was simulated with a true value of Θ = 0.01
Week 10: Coalescents, Consensus trees, etc. – p.69/87
Major MCMC likelihood or Bayesian programs
LAMARC by Mary Kuhner and Jon Yamato and others. Likelihoodinference with multiple populations, recombination, migration,population growth. No historical branching events or serialsampling, yet.
BEAST by Andrew Rambaut, Alexei Drummond and others.Bayesian inference with multiple populations related by a tree.Support for serial sampling (no migration or recombination yet).
genetree by Bob Griffiths and Melanie Bahlo. Likelihood inference ofmigration rates and changes in population size. No recombination orhistorical branching events.
migrate by Peter Beerli. Likelihood inference with multiplepopulations and migration rates. No recombination or historicalbranching events yet.
IM and IMa by Rasmus Nielsen and Jody Hey. Two or morepopulations allowing both historical splitting and migration after that.No recombination yet.
Week 10: Coalescents, Consensus trees, etc. – p.70/87
Trees we will use for consensus trees
DA C B E D FG A CG F B E D A CG F B E
Week 10: Coalescents, Consensus trees, etc. – p.71/87
Trees we will use for consensus trees
A C B E D FG A CG F B E D A CG F B E D
(for unrooted trees we would use partitions induced by branches insteadof clades)
Week 10: Coalescents, Consensus trees, etc. – p.72/87
Trees we will use for consensus trees
A C B E D FG A CG F B E D A CG F B E D
Week 10: Coalescents, Consensus trees, etc. – p.73/87
Trees we will use for consensus trees
A C B E D FG A CG F B E D A CG F B E D
(Do we count this one if the trees are considered rooted? unrooted?)
Week 10: Coalescents, Consensus trees, etc. – p.74/87
Trees we will use for consensus trees
A C B E D FG A CG F B E D A CG F B E D
Here is a clade that is found on only two of the trees, so it is not includedin the Strict Consensus Tree.
Week 10: Coalescents, Consensus trees, etc. – p.75/87
Their strict consensus tree
A CG F B E D
Week 10: Coalescents, Consensus trees, etc. – p.76/87
A distressing case for the strict consensus tree
A B C D E F G B C D E F G A
Only one species moves ...
Week 10: Coalescents, Consensus trees, etc. – p.77/87
A distressing case for the strict consensus tree
A B C D E F G
... but the strict consensus tree becomes totally unresolved.
Week 10: Coalescents, Consensus trees, etc. – p.78/87
Majority-rule consensus tree
A CG F B E D
100
100
67
67100
Week 10: Coalescents, Consensus trees, etc. – p.79/87
The Adams consensus tree
For rooted trees, Adams (1972, 1986) suggested:
1. Take all rooted triples on each tree.
2. Retain those that are not contradicted, where lack of resolution doesnot count as contradiction.
3. Construct a tree of these.
Week 10: Coalescents, Consensus trees, etc. – p.80/87
Two of the possible triples to examine
DA C B E D FG A CG F B E D A CG F B E
The green triple shows the same rooted topology on all three trees. The
red triple is contradicted and does not get used in the Adams ConsensusTree.
Week 10: Coalescents, Consensus trees, etc. – p.81/87
The Adams consensus tree
A CG F B E D
Week 10: Coalescents, Consensus trees, etc. – p.82/87
Steel, Böcker, and Dress’s shocking disproof
Steel, M., S. Böcker, and A. W. M. Dress. 2000. Simple but fundamentallimits for supertree and consensus tree methods. Systematic Biology 49(2):363-368.
They put forward three minimal requirements for an unrooted Adams-likeconsensus tree based on observations of quartets, rather than triples.Note that a quartet, like a triple, has three possible topologies, but
unrooted ones: ((A,B),(C,D)) and ((A,C),(B,D)) and ((A,D),(B,C)).
The result shouldn’t be altered by relabelling all the species in aconsistent way.
The result should not depend on the order in which the trees areinput.
If a quartet appears in all trees, it should appear in the consensus.
Alas, they then show there is no consensus tree method for unrootedtrees that can satisfy all of these!
Week 10: Coalescents, Consensus trees, etc. – p.83/87
A consensus subtree
F C A B G DFCA BDE F AB D E
Week 10: Coalescents, Consensus trees, etc. – p.84/87
A consensus subtree
F C A B G DFCA BDE F AB D E
Week 10: Coalescents, Consensus trees, etc. – p.85/87
A consensus subtree
F C A B G DFCA BDE F AB D E
B DFA
Week 10: Coalescents, Consensus trees, etc. – p.86/87
A supertree
F C A B G D
F C A BA B G DC A B D
Construct a tree with all tips, for which each of the smaller trees is asubtree. What to do if there is conflict? There are various suggestions.
Week 10: Coalescents, Consensus trees, etc. – p.87/87
Top Related