Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf ·...
Transcript of Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf ·...
![Page 1: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané](https://reader031.fdocuments.net/reader031/viewer/2022022519/5b15836a7f8b9afb0a8cb30e/html5/thumbnails/1.jpg)
Statistics 877Statistical Methods for Molecular Biology
Phylogenetics
Cécile Ané
Departments of Botany and of StatisticsUniversity of Wisconsin - Madison
Spring 2017 - Feb. 9
![Page 2: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané](https://reader031.fdocuments.net/reader031/viewer/2022022519/5b15836a7f8b9afb0a8cb30e/html5/thumbnails/2.jpg)
Outline
1 Trait evolution on a phylogeny
2 Molecular evolution on a tree: Maximum LikelihoodEstimation
![Page 3: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané](https://reader031.fdocuments.net/reader031/viewer/2022022519/5b15836a7f8b9afb0a8cb30e/html5/thumbnails/3.jpg)
Outline
1 Trait evolution on a phylogenyExamplesNon-independence problemBrownian motion model
![Page 4: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané](https://reader031.fdocuments.net/reader031/viewer/2022022519/5b15836a7f8b9afb0a8cb30e/html5/thumbnails/4.jpg)
Trait evolution on a phylogeny
Sometimes we don’t care about the tree, but we need it.
In antelopes, are brain size and group size correlated?In plants, did the capacity to shed leaves evolve before orafter the move to freezing winters?In flowers, is the color red correlated with pollination byhummingbirds?
![Page 5: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané](https://reader031.fdocuments.net/reader031/viewer/2022022519/5b15836a7f8b9afb0a8cb30e/html5/thumbnails/5.jpg)
Is red correlated with hummingbirds?
Pollinators in Iochroma:hummingbirds, ants, butterflies,flies.
Pollinator importance = visitationrate× # ovules potentiallypollinated per visit
Predictors: flower color, reward,corolla length, etc.
Smith, Ané & Baum (2008) Evolution
![Page 6: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané](https://reader031.fdocuments.net/reader031/viewer/2022022519/5b15836a7f8b9afb0a8cb30e/html5/thumbnails/6.jpg)
Problem: units are not independent
Y4
Y3
Y1
Y2
X4
X3
X2
X1
X =Hummingbird importance
Y =
Bri
ghtn
ess
0.8
0.40.6
0.2
A sample of species do not form a “random” sample because ofa lack of independence.
Y = Xβ + ε
but ε’s are not iid:identically distributed (id) perhaps, but not independent (i).
![Page 7: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané](https://reader031.fdocuments.net/reader031/viewer/2022022519/5b15836a7f8b9afb0a8cb30e/html5/thumbnails/7.jpg)
Using the tree to parametrize the non-independence
for quantitative traits:Brownian motion model: neutral evolutionOrnstein-Uhlenbeck model: natural selection towards anselection optimum µ
0.0 0.4 0.8
78
911
Time
Tra
it va
lue
●
●
●
●
BM
0.0 0.4 0.8
68
1012
Time
Tra
it va
lue
●
●
●
●
OU ((αα == 2))
![Page 8: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané](https://reader031.fdocuments.net/reader031/viewer/2022022519/5b15836a7f8b9afb0a8cb30e/html5/thumbnails/8.jpg)
Brownian motion model for the residuals
Bt − Bs ∼ N (0, σ2(t − s))
ε ∼ N (0,Vθ)
Vθ,ij = σ2∗ time shared between tips i and j
Y1
Y4
Y2
Y30.60.4
0.8 0.2
Y1
Y4
Y2
Y3
Vθ = σ2
1 .8 0 0.8 1 0 00 0 1 .40 0 .4 1
Vθ = σ2
1 0 0 00 1 0 00 0 1 00 0 0 1
![Page 9: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané](https://reader031.fdocuments.net/reader031/viewer/2022022519/5b15836a7f8b9afb0a8cb30e/html5/thumbnails/9.jpg)
Likelihood for Brownian motion model
Y = Xβ + ε, ε ∼ N (0,Vθ)
Likelihood:
−2 log f (Y |θ) = n log(2π) + log |Vθ|+ (Y − Xβ)′ Vθ−1(Y − Xβ)
Looks simple, but tricky for a tree with 1000’s species.What is the size of Vθ?
![Page 10: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané](https://reader031.fdocuments.net/reader031/viewer/2022022519/5b15836a7f8b9afb0a8cb30e/html5/thumbnails/10.jpg)
Areas for current research
Linear time algorithm: for BM model, but also phylogeneticlogistic or Poisson regression(Ho & Ané, 2014, Goolsby, Bruggeman & Ané 2017)Asymptotic theory(Ané, 2008, Ho & Ané 2013, Ané Ho & Roch 2017)model selection: need new tools(Khabbazian, Kriebel, Rohe & Ané 2016)
![Page 11: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané](https://reader031.fdocuments.net/reader031/viewer/2022022519/5b15836a7f8b9afb0a8cb30e/html5/thumbnails/11.jpg)
Outline
2 Molecular evolution on a tree: Maximum LikelihoodEstimation
Maximum Likelihood Estimation for 2 speciesLikelihood Calculations on TreesSearch for Maximum Likelihood
![Page 12: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané](https://reader031.fdocuments.net/reader031/viewer/2022022519/5b15836a7f8b9afb0a8cb30e/html5/thumbnails/12.jpg)
Distance Between Pairs of Taxa
In a two-taxon tree, the distance between two taxa can beestimated under any model by maximum likelihood.If the distance is t and at site i one species has base A andthe other has base C, the contribution to the likelihood atthis site j is
Lj(t) = πAPAC(t) = πCPCA(t)
for a time-reversible model.The overall likelihood is
L(t) =∏
j
Lj(t)
and the log-likelihood is
`(t) =∑
j
log Lj(t) =∑
j
(logπx [j] + log Px [j]y [j](t)
)
![Page 13: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané](https://reader031.fdocuments.net/reader031/viewer/2022022519/5b15836a7f8b9afb0a8cb30e/html5/thumbnails/13.jpg)
Distance Between Pairs of Taxa
Models with free π: it is common to estimate π withobserved base frequencies.Other parameters usually estimated by maximumlikelihood.The simplest models have closed form solutions, othersrequire numerical optimization.
![Page 14: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané](https://reader031.fdocuments.net/reader031/viewer/2022022519/5b15836a7f8b9afb0a8cb30e/html5/thumbnails/14.jpg)
Notation for the Alignment
Alignment of m taxa and n sites: mn nucleotide bases.Observed base for the i th taxon and the j th site: xij .
![Page 15: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané](https://reader031.fdocuments.net/reader031/viewer/2022022519/5b15836a7f8b9afb0a8cb30e/html5/thumbnails/15.jpg)
Likelihood of a Tree
P(t): 4× 4 probability transition matrix over an edge of length t .
likelihood of the tree:
∏site j
∑πy1j
∏edge e
Pyp(e),j ,yc(e),j (te)
p(e): parent node of edge ec(e): child edge ete: length of enodes 1 (root) to m − 2: internal nodes,nodes m − 1 to 2m − 2: leaves, observed xij = ym−2+i,j
sum: over the 4m−2 possible allocations of valuesat internal nodes: y1,j , . . . , ym−2,j ∈ {A,C,G,T}
![Page 16: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané](https://reader031.fdocuments.net/reader031/viewer/2022022519/5b15836a7f8b9afb0a8cb30e/html5/thumbnails/16.jpg)
Likelihood of a Tree
a naive calculation would not be tractable for large treesfor m = 10, 8 internal nodes, 48 = 65,536 allocations, orterms in the sum
With a time-reversible model, the location of a root (wherethe CTMC begins at stationarity) does not affect thelikelihood calculation.
![Page 17: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané](https://reader031.fdocuments.net/reader031/viewer/2022022519/5b15836a7f8b9afb0a8cb30e/html5/thumbnails/17.jpg)
Felsenstein’s Pruning Algorithm
dynamic programming, or belief propagation algorithmpartial calculations, for each site and node:
IP{subtree rooted at that node} given each possiblestarting base.
leaves first, root last, then weighted average of full treestarting at each base.time complexity: ∝ number of sites, not exponentially.
![Page 18: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané](https://reader031.fdocuments.net/reader031/viewer/2022022519/5b15836a7f8b9afb0a8cb30e/html5/thumbnails/18.jpg)
Maximum Likelihood Estimation for one Tree
For a single tree topology, the ML estimation requiresoptimization of branch lengths and of any parameters inthe substitution model.Numerical optimization methods are required even forsimple models and small trees.
![Page 19: Statistics 877 Statistical Methods for Molecular Biology ...kendzior/STAT877/SLIDES/ane1.pdf · Statistics 877 Statistical Methods for Molecular Biology Phylogenetics Cécile Ané](https://reader031.fdocuments.net/reader031/viewer/2022022519/5b15836a7f8b9afb0a8cb30e/html5/thumbnails/19.jpg)
Tree Search
naive search for ML tree: optimize ML branch lengths etc.for each possible topology, then picking the best.
this exhaustive search is not feasible for 12+ taxa
heuristic search algorithms:
define a neighborhood structure for possible topologiesgo through neighborsjump to the first "better" neighbor with a higher likelihoodstop when no "better" neighbors.
e.g. RAxML and GARLI are two modern programs