Post on 11-Jan-2016
description
1
Towards optimal distance functionsfor stochastic substitution models
Ilan Gronau, Shlomo Moran, Irad YavnehTechnion, Israel
2
PreviewThe
Phylogenetic Reconstrutction
Problem
3
AATCCTG
ATAGCTGAATGGGC
GAACGTA
AAACCGA
ACGGTCA
ACGGATA
ACGGGTA
ACCCGTG
ACCGTTG
TCTGGTA
TCTGGGA
TCCGGAA AGCCGTG
GGGGATT
AAAGTCA
AAAGGCG AAACACAAAAGCTG
Evolution is modeled by a Tree
(All our sequences are DNA sequences, consisting of {A,G,C,T})
4
AATCCTG
ATAGCTGAATGGGC
GAACGTA
AAACCGAACCGTTGTCTGGGA
TCCGGAA AGCCGTG
GGGGATT
Phylogenetic Reconstruction
5
B : AATCCTG
C : ATAGCTG
A : AATGGGC
D : GAACGTAE : AAACCGA
J : ACCGTTG
G : TCTGGGAH : TCCGGAA
I : AGCCGTG
F : GGGGATT
Goal: reconstruct the ‘true’ tree as accurately as possible
reconstruct
AB
C
FG
IH J
D
E
A
B
C F
G
I
H
J
D
E
(root)
Phylogenetic Reconstruction
7
Road Map • Distance based reconstruction algorithms• The Kimura 2 Parameter (K2P) Model• Performance of distance methods in the K2P model• Substitution models and substitution rate functions• Properties of SR functions• Unified Substitutions Models• Optimizing Distances in the K2P model• Simulation results
8
A
C
B
D
F
G
E
edge-weighted ‘true’ tree reconstructed tree
reconstruction
B
C
A
D
F
G
E
,
ˆˆ ( , )u v S
D d u v
5
6
0.4
6
3 0.32 2
4
5
Challange: minimize the effect of noiseIntroduced by the sampling
Distance Based Phylogenetic Reconstruction:Exact vs. Noisy distances
Estimated distances
,
( , )u v S
D d u v
Exact (additive) distances
Between species
Distance estimationusing
finite Sampling
9
Road Map • Distance based reconstruction algorithms
• The Kimura 2 Parameter (K2P) Model• Performance of known distance methods in the K2P model• Substitution models and substitution rate functions• Properties of SR functions• Unified Substitutions Models• Optimizing Distances in the K2P model• Simulation results
10
The Kimura 2 Parameter )K2P( model [Kimura80]:each edge corresponds to a “Rate Matrix”
{ }A G
{ }C T
Transitions
Transversions
Transitions
Transitions/transversions ratio = / 2 1R
-αββT
α-ββC
ββ-αG
ββα-A
TCGA
-αββT
α-ββC
ββ-αG
ββα-A
TCGA
K2P generic rate matrixu
v
11
K2P standard distance: Δtotal = Total substitution rate
u v w
The total substitution rate of a K2P rate matrix R is
This is the expected number of mutations per site. It is an additive distance.
+
1( ) 2 sum of off-diagonal entries of 4total uv uvR R
α + 2β α’ + 2β’
(α+α’) + 2(β+ β’)
12
Estimation of Δtotal(Ruv) = dK2P(u,v) is a noisy stochastic process
u A A C A … G T C T T C G A G G C C C
v A G C A … G C C T A T G C G A C C T
2ˆ ˆˆ( , ) 2K Pd u v
K2P total rate“distance correction”
procedure
13
Road Map • Distance based reconstruction algorithms• The Kimura 2 Parameter (K2P) Model
• Performance of distance methods in the K2P model• Substitution models and substitution rate functions• Properties of SR functions• Unified Substitutions Models• Optimizing Distances in the K2P model• Simulation results
14
Check performance of K2P “standard” distances in resolving quartet-splits
A C
B D
A B
C D
A C
D B
• Distance methods reconstruct the true split by 4-point
condition:
There are 3 possible quartet topologies:
wsep
The 4-point condition for noisy distances is:
2 2 2 2 2 2( , ) ( , ) min ( , ) ( , ) , ( , ) ( , )K P K P K P K P K P K Pd d d d d d A B C D A C B D A D B C
2 2 2 2 2 2( , ) ( , ) ( , ) ( , ) ( , ) ( , )2K P K P K P K P K Pse K Ppd d dwd d d A B C D A C B D A D B C
15
We evaluate the accuracy of the K2P distance estimation
by Split Resolution Test:
root
D
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
t
10t
CA
B
10t 10t10t
t-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
t is “evolutionary time”
The diameter of the quartet is 22t
16
Phase A: simulate evolution
DC
AB
CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAACCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
17
Phase B: reconstruct the split by the 4p condition
DCCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
CCCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
BCCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
ACCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
÷÷÷÷÷÷÷÷
øçççççççç
è
2ˆˆ ( , ) ( , )K P i jD i j d s s
Apply the 4p condition.
Was the correct split found?
estimate distances between sequences,
Repeat this process 10,000 times,
count number of failures
18
root
D
t
10t
CA
B
10t 10t 10t
t
root
D
t
10t
CA
B
10t 10t 10t
t
the split resolution test was applied on the model quartet with various diameters
For each diameter, mark the fraction (percentage) of the
simulations in which the 4p condition failed (next slide)
root
D
t
10t
CA
B
10t 10t 10t
t
root
D
t
10t
CA
B
10t 10t 10t
t
root
D
t
10t
CA
B
10t 10t 10t
t
root
D
t
10t
CA
B
10t 10t 10t
t
root
D
t
10t
C
AB
10t 10t 10t
t
root
D
t
10t
C
AB
10t 10t 10t
t … …
19
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
quartet diameter )total rate between furthest leaves(
Fra
ctio
n of
failu
res
out o
f 100
00 e
xper
imen
tsperformance of K2P standard distance method in resolving quartets, R=10
Performance of K2P distances in resolving quartets, small diameters: 0.01-0.2
root
D
t
10t
CA
B
10t 10t 10t
t
root
D
t
10t
CA
B
10t 10t 10t
t
Templatequartet
20
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0.1
0.2
0.3
0.4
0.5
0.6
0.7
quartet diameter (=mutations rate between furthest leaves)
Fract
ion
of fa
ilure
s out
of 10
000 si
mul
atio
nsperformance of K2P standard distance method in resolving quartets,
For quartet ratio 0.1, R=10
Performance for larger diameters
“site saturation”
21
{ }A G
{ }C T
Transitions
Transversions
Transitions
When β < α, we can postpone the “site saturation” effect. For this, use another distance function for the same model, Δtv , which counts only transversions:
{0}
{1}
This is actually the CFN model
[Cavendar78, Farris73, Neymann71]
α
α
β
22
Apply the same split resolution test on the transversions only distance:
u A A C A … G T C T T C G A G G C C C
v A G C A … G C C T A T G C G A C C T
ˆ ˆ( , )trd u v
Transversions onlyDistance correction
procedure
23
transversions only performs better on large, worse on small rates
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
quartet diameter
Fract
ion
of Fa
ilure
s out o
f 10
000
exper
imen
ts
performance of distance methods in resolving quartets, R=10
Transversions only
total K2P rate
.
4 5
7 21
210 61
Conclusion: Distance based reconstruction methods should be
adaptive:
Find a distance function d which is good for the input ÷
÷÷÷÷÷÷÷
ø
ö
çççççççç
è
æ
= ˆˆ ( , ) ( , )D u v d u vD
We do a small step in this direction:
Input: An alignment of the sequences at u, v.
Output: a )near(-optimal distance function, which minimizes the
expected noise in the estimation procedure.
25
Example: An adaptive distance method (max-optimal)
based on this talk:
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
quartet diameter
Fract
ion o
f fa
ilure
s out of 10
000
ex
peri
ments
performance of distance methods in resolving quartets, R=10
max-optimal
stanard K2Ptrasversions only
26
Road Map • Distance based reconstruction algorithms• The Kimura 2 Parameter (K2P) Model• Performance of distance methods in the K2P model
• Substitution models and Substitution Rate functions• Properties of SR functions• Unified Substitutions Models• Optimizing Distances in the K2P model• Simulation results
27
Steps in finding optimal distance functions:1. Define substitution model.
2. Characterize the available distance functions.
3. Select a function which is optimal for the input
sequences.
least sensitive to stochastic noise
28
From Rate matrices to Substitution matrices
A A C A … G T C T T C G A G G C C Cu
v A G C A … G C C T A T G C G A C C T
-αββT
α-ββC
ββ-αG
ββα-A
TCGA
-αββT
α-ββC
ββ-αG
ββα-A
TCGA
Rate matrices imply stochastic substitution matrices:
C
T
G
A
CTGA
C
T
G
A
CTGA
p
p
p
p
p
p p
p
p
p
p p1 2 p p
1 2 p p
1 2 p p
1 2 p p
uvP uvR
Evolution of a finite sequence by unknown model parameters α, β
A stochastic substitution matrix Puv
29
A substitution model M : A set of stochastic substitution matrices, closed under matrix product:
P,Q∈ M ⇒ PQ ∈ M
uvP
vwP
u
v
w
uw uv vwP P P
Motivation tothe definition:
Also requiredP>0, 0<det(P)<1
for all P∈M
30
Uniform distribution
Model tree over M =<Tree Topology> +
<DNA distribution at the root> + <M-substitution matrices at the edges>
r
vPrv
P..
P..
P..
P..
P..
P.. P..
P..
P..
P.. P.. P..
P..
P..
P..
P..
P..
31
Distances for a given model are defined by
Substitution Rate functions:
uvP
vwP
u
v
w
Δ:M is an SR function for ℝ M iff for all P,Q in M:
1. Δ(PQ) = Δ(P)+ Δ(Q) (additivity)
2. Δ(P)>0 (positivity)
32
Road Map • Distance based reconstruction algorithms• The Kimura 2 Parameter (K2P) Model• Performance of distance methods in the K2P model• Substitution models and substitution rate functions
• Properties of SR functions• Unified Substitutions Models• Optimizing Distances in the K2P model• Simulation results
33
1st question:Given a model M, what are its SR functions? X
additive
SR functions are additive functions which are strictly
positive
34
Example 1: The logdet function [Lake94, Steel93] is an SR function for the most general model, Muniv :
Muniv= {P: P is a stochastic 4╳4 matrix, 0<det(P)<1}.
logdetThe function ( ) ln(det( ))
additive functionis an for .univ
P P
M
logdetThe function ( , ) ln(det(
SR fun
))
is an for .ction
uv
univ
d u v P
M
35
Example 2: The log eigenvalue function
4
Assume a model with the following property:
There is a vector which is an eigenvector
of .
The function
is an additive function for . [e.g. Gu&L
( ) ln(| ( ) |)
each
P
R
P
M
P
v
v
v
M
M
i98]
i.e., PPv v
36
Both “logdet” and the “log eigenvalue” functions are special cases of a general technique:
Generalized logdet which is given below:
4
Definition: Let be a 4 by 4 matrix.
A subspace of R is -invariant if
If is invariant, then defines a linear transformation on .
det( | ) is the determinant of this linear transformationH
P
H P PH H
H P P H
P
.
(Generalized LogDet)Lemma GLD :
If is -invariant for all , then
ln(| det( | ) |)
is an additive function for .
( ) HH
H P P
PP
M
M
37
Linearity of additive functions:
1. If Δ1 and Δ2 are additive functions for M, so is c1 Δ1 + c2 Δ2
The set of additive functions for M forms a vector space, to be denoted ADM.
Dimension(ADM) is the dimension of this vector space.Large dimension implies more “independent” distance functions
If dimension(ADM ) = 1, then M admits a single distance function (up to product by scalar). Selecting best SR function in such a model is trivial. Thus, the adaptive approach is useful only when dimension(ADM ) > 1.
38
Road Map • Distance based reconstruction algorithms• The Kimura 2 Parameter (K2P) Model• Performance of distance methods in the K2P model• Substitution models and substitution rate functions• Properties of SR functions
• Unified Substitutions Models: Models which the
adaptive approach is potentially useful.• Optimizing Distances in the K2P model• Simulation results
39
Unified Substitution Models:
U-1 PU = λ3(P)000
?λ2(P)00
??λ1(P)0
???1
λ3(P)000
?λ2(P)00
??λ1(P)0
???1
Def: A model M is unified if there is a matrix U s.t. for each P∈M it holds that:
1 2 3
3
1
Thm: if is unified,
then for each 3 constants , , , the function
( ) ln(| ( ) |)
is an additive function for
i ii
c c c
P c P
M
M.
Using Lemma GLD, we have:
40
Strongly Unified Substitution Models
U-1 PU =
Def: A model M is strongly unified if there is a matrix U s.t. for each P∈M it holds that:
3
1
Thm: if is strongly unified,
then the additive functions of
are of the form
( ) ln( ( ))i ii
all
P c P
M
M
000
000
00λ1 (P)0
0001
000
000
00λ1 (P)0
0001
λ2 (P)
λ3 (P)
41
A simple strongly unified model: The Jukes Cantor model [1969]
MJC=
For all P∈ MJC , U-1 PU =
:0< p <0.25
MJC is strongly unified by U=
1 1 12 22
1 1 12 22
1 1 12 22
1 1 12 22
0
0
0
0
1-3ppppC
p1-3pppT
pp1-3ppG
ppp1-3pA
CTGA
1-3ppppC
p1-3pppT
pp1-3ppG
ppp1-3pA
CTGA
1 4P p
000
000
00λp0
0001
λp
λp
Claim dimension(ADMJC)=1
Hence the adaptive approach is irrelevant to this model.
42
Another model M for which dimension(ADM)=1
Recall: Muniv consists of all DNA transition matrices.
Claim 2: dimension(ADMuniv) = 1
This means that all the additive functions of Muniv are
proportional to logdet.
Hence the adaptive approach is irrelevant also to this model.
Luckily, the additive functions of “intermediate” unified models have dimensions > 1, hence the adaptive approach is useful for them.Next we return to the Kimura 2 parameter model.
43
Back to K2P: For every K2P Substitution Matrix P:
1 0 0 0
0 λP 0 0
0 0 μP 0
0 0 0 μP
Where:λP = 1 - 4Pβ = e-4β
μP = 1 - 2Pβ - 2Pα= e-2α-2β
U-1 PU =
C
T
G
A
CTGA
C
T
G
A
CTGA
p
p
p
p
p
p p
p
p
p
p p1 2 p p
1 2 p p
1 2 p p
1 2 p p
P =
0 < λP <10 < μP < 1
Conclusion: dimension(ADMK2P )=2.
U of the JC model
44
The functions:Δλ(P)= -ln(λP) , Δμ (P)=-ln(μP)
Form a basis of ADK2P
1 2
Each positive function of the form:
( ) ln( ) ln( )
is an SR function for the K2P model
P PP c c
uvPu
v
The standard “total rate” distance is:
ΔK2P(P)=-(ln(λP)+2ln(μP))/4=-Δlogdet(P)/4.
The “transversion only” distance is:
Δtr(P)=-ln(λP )/4.
46
Road Map • Distance based reconstruction algorithms• The Kimura 2 Parameter (K2P) Model• Performance of distance methods in the K2P model• Substitution models and substitution rate functions• Properties of SR functions• Unified Substitutions Models
• Optimizing Distances in the K2P model• Simulation results
47
1 2
1 2
ˆˆ ˆCompute ( ) ln( ) ln( ),
an estimation of ( ) ln( ) ln( ).uv
uv
P c c
P c c
u A A C A … G T C T T C G A G G C C C
v A G C A … G C C T A T G C G A C C T
K2P distance estimation: where the noise comes from
ˆ ˆ ˆˆCompute ( ), ( ),
estimations of ( ), ( ).uv uv
uv uv
P P
P P
inherent noise
implied noise propagation
“user controlled” noise propagation
ˆCompute , an estimation of uv uvP P
48
uvP
u
v
1 2
1 2
Given , we look for , such that:
( , ) ( ) ln( ) ln( )
has a small expected relative error.uv uv
uv
uv P P
P c c
d u v P c c
Selection of c1, c2
True distance
Expected error
Estim
ated distance+ =
49
Expected Relative Error True distance
Expected error
==
50
Minimizing the expected relative error
Let ( , ) ( ) be the exact distance
ˆ ˆ ˆ( , ) ( ) is the estimated (stochastic) distance.
We would like to minimize the "Normalized Mean Square Error":
ˆ ( )
uv
uv
d d u v P
d d u v P
NMSE d
2
2
ˆ
ˆIn the plots we use NRMSE=
d dE
d
d dE
d
51
1 2
1
2
The NMSE of a distance function:
ˆˆ ˆ ( ) ln( )+c ln( )
Depends only on the ratio
uvP c
cc
c
This means that equivalent SR functions have
the same NMSE
A basic property of Normalized Mean Square Error:
52
A Proper Disclosure on our optimal functions:
Since ln( ) is non-linear, we only find which minimizes the NMSE
ˆ of a of (usinlinear ap g the "deproxim lta mea thod")on .ti
c
44
4
4 4
and the optimal for a K2P matrix is:
11
11 1
opt
c
ee
ece e
st1 term in the Taylor
expansion of
d d
d
Hence, our approximation is imprecise when some
of the (true) Eigenvalue are very smalls
53
Relation between c and SR functions:
44
4
4 4
11
11 1
opt
ee
ece e
Function name Function c c/(1+c)
Total rate (logdet) -ln)λP(-2ln)μP( 1/2 1/3
Transversions only -ln)λP( ∞ 1
13As grows from to 1, the optimal rate function
1
is gradually changed from to total rate transversions only
opt
opt
c
c
54
0 0.5 1 1.5 2 2.5 30
0.2
0.4
0.6
0.8
1
total substitution rate
C1 /
(C1 +
C2) α=20β
Optimal values of copt /(1+copt) for ti/tv ratio = 10
As the rate grows, the relative weight of the “transversion” coefficient increases
55
0 0.5 1 1.5 2 2.5 30
0.2
0.4
0.6
0.8
1
total substitution rate
C1 /
(C1 +
C2) α=2β
α=4βα=20β
Optimal values of c1/(c1 +c2) for various transitions/transversion rates
α=β
α>>β,rate>2
α=200β
56
0 0.5 1 1.5 2 2.50
0.1
0.2
0.3
0.4
0.5
0.6R = 2
total substitution rate
pred
icte
d N
RM
SE
Expected Relative error of various distance functions: theoretical prediction
Total rate
transversions
optimal
57
Road Map • Distance based reconstruction algorithms• The Kimura 2 Parameter (K2P) Model• Performance of distance methods in the K2P model• Substitution models and substitution rate functions• Properties of SR functions• Unified Substitutions Models• Optimizing Distances in the K2P model
• Simulation results
58
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
total substitution rate
NR
MS
E
R = 2
standard formula )C = 0.5(
'transversions only' )C = (actually used SR functions
predicted error for standard formula
predicted error for 'transversions only'predicted error for optimal SR function
Expected Relative error of various distance functions: simulations
Total rate
Transversions only
optimal
“small eigenvaluedistortion”
59
Back to the K2P quartet resolution
A heuristic distance method )max-optimal( based on this talk:
Select a distance function which is optimal w.r.t. the largest of the six observed distances of the quartet )ie, largest copt(.
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
quartet diameter
Fract
ion
of Fa
ilure
s out
of 10
000
exper
imen
ts
performance of distance methods in resolving quartets, R=10
Recall the performance of the two known distance function on the “template quartet”
60
When α≠β, the suggested heuristic performs better than both known methods.
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
quartet diameter
Fract
ion o
f fa
ilure
s out of 10
000
ex
peri
ments
performance of distance methods in resolving quartets, R=10
max-optimal
stanard K2Ptrasversions only
61
Summary• Adaptive approach to distance based reconstructions: adjust
distance function to input sequences.• Distance functions for stochastic evolutionary models are defined by
SR functions.• SR functions can be constructed by Generalized Logdet.• When the dimension of the space of SR functions is greater than 1,
the adaptive approach is applicable.• The adaptive approach is applicible to non-trivial unified models.• Most common models are unified.• An analysis of the simplest non-trivial unified model - K2P - shows
a significant improvements in the accuracy of the adaptive
approach.
62
Further Research Prove/Disprove: For any substitution model M, all the additive functions of
M are GLD functions. In the K2P model:
Define&find optimal SR functions for: two distances, quartets, general trees.
Find optimal SR functions for non-homogenous model trees Find optimal SR functions to variable rates cross sites.
Find optimal SR functions for more general evolutionary models (Tamura Nei) (analytic/heuristic methods)
Empirical/analytical study of “plugging” adaptive distances in common reconstruction algorithms (eg NJ).
Study improvement in performance on real biological data. Devise algorithms which use distance-vectors
63
64
65
Further research questions• We have infinitely many additive distance functions for
the K2P model.• Which one should we use for reconstructing the tree?• If we have the exact substitution matrices for all pairs of
taxa, then all functions are equally good.• But we have only finite sequences,
whose alignments provide only estimations of the true substitution matrices
66
Distances are defined by Substitution Rate functions
u
v
w
For each tree path u — v—w It holds that D(u,v)+D(v,w)=D(u,w).D(u,v)
D(v,w)
D(u,w)= D(u,v)+D(v,w)
67
Part 3.1:
from
Substitution modelsto
Additive distances
68
The aligned sequences provide for each pair of DNA letters,say A and G, how many times A was mutated to GThis defines a joint distribution matrix F
Aligned Sequences joint distribution matrices
A G T C
A 0.2 0.05 0.01 0.02
G 0.02 0.25 0.01 0.01
T 0.02 0.01 0.16 0.02
C 0.01 0.01 0.01 0.2
F =
A is aligned with GIn 5% of the pairs
69
Joint Distribution matrices are converted to distances by Substitution models.These models describe how DNA sequences are transformed during the evolution. The tool used for this is called “Markovian Processes”. In the following we will sketch it. Additional reading is recommended…
70
species C1 C2 C3 C4 … Cm
u A A C A … G T C T T C G A G G C C C
v A G C A … G C C T A T G C G A C C T
K2P Distinguish between two mutations types:
Transitions {AG, CT}
And
Transversions [{A,G}{C,T}]
Different biological models impose restrictions on the substitution matrices.
Our model is the Kimura 2 Parameter )K2P( model:
71
K2P rate matrices have the following shape
A G T C
A -
G -
T -
C -
All transitions have rate α
All transversions has rate β
72
Part 3.2:Distance functions for K2P
( Linear Algebra in the service of Biology)
73
μP000
0μP00
00λP0
0001
U-1 P U =
μQ000
0μQ00
00λQ0
0001
U-1 Q U =
U-1 PQ U =
Let P,Q be two matrices in K2P. Then:
μP μQ
000
0μP μQ00
00λP λQ0
0001
U-1 PQ U =
74
U-1 PQ U =
000
000
00λ1 (P)0
0001
λ2 (P)
λ3 (P)
75
000
000
00λp0
0001
U-1 P U =
λp
λp
76
ACGGTCA
ACGGATA
GGGGATT
The joint distribution of each pair of verticesprovides an approximation of the substitution matrices
w
v
u uvP
vwP
The common theme of all projects: Start with input sequences for two or more taxa.Find a distance function which minimizes the inaccuracy (noise) introduced by the sampling process.
uvP
vwP
79
A G C T
A - α β β
G α - β β
C β β - αT β β α -
80
A G C T
A - α` β` β`
G α` - β` β`
C β` β` - α`T β` β` α` -
81
25%
ACGGATA
K2P Model tree:======<Tree Topology> +
<Uniform Distribution of DNA at all vertices> + <K2P instantaneous rate matrices at edges>
r
vRuv
82
A G T C
A
G
T
C
p
p
p
p
p
p p
p
p
p
p p1 2 p p
1 2 p p
1 2 p p
1 2 p p
83
A G T C
A 1-3p p p pG p 1-3p p pT p p 1-3p pC p p p 1-3p
84
1 1 12 22
1 1 12 22
1 1 12 22
1 1 12 22
0
0
0
0
85
K2P Model tree:======<Tree Topology> +
<Uniform Distribution of DNA at all vertices> + <K2P instantaneous rate matrices at edges>
0.25 0.25 0.25 0.25
A G C T
86
K2P rate matrices have the following shape
A G T C
A -
G -
T -
C -
All transitions have rate α
All transversions has rate β
' ''
''''''
'
''
87
Given sequences at two adjacent verticeswe define the edge length in two steps :
vertices C1 C2 C3 C4 … Cm
u A A C A … G T C T T C G A G G C C C
v A G C A … G C C T A T G C G A C C T
u
v…TCTGGGA…
…GGGGATT…
First, align the sequences,
88
Natural evolutionary distance: Total substitution rate
u vw
-αββT
α-ββC
ββ-αG
ββα-A
TCGA
-αββT
α-ββC
ββ-αG
ββα-A
TCGA
Each edge is associated with a time t and a K2P rate matrix S.The total substitution rate along an edge of length t is t(α +2β).Total substitution rate between species = sum of the rates over the path connecting them.
Total substitution rates are exact distances, which we try to reconstruct from observing the joint distribution of sequences at u and v.
-α`β`β`T
α`-β`β`C
β`β`-α`G
β`β`α`-A
TCGA
-α`β`β`T
α`-β`β`C
β`β`-α`G
β`β`α`-A
TCGA
89
How do we estimate DK2P(u,v)?
vertices C1 C2 C3 C4 … Cm
u A A C A … G T C T T C G A G G C C C
v A G C A … G C C T A T G C G A C C T
Our input are aligned sequences at u and v.They can be used to estimate the probablity that a nucleotide X in u will be replaced by a nucleotide Y in v
90
vertices C1 C2 C3 C4 … Cm
u A A C A … G T C T T C G A G G C C C
v A G C A … G C C T A T G C G A C C T
Estimate Puv from the joint distributions:
First step in distance estimation:
(Maximum Likelihood)
C
T
G
A
CTGA
C
T
G
A
CTGA
p
p
p
p
p
p p
p
p
p
p p1 2 p p
1 2 p p
1 2 p p
1 2 p p
91
C
T
G
A
CTGA
C
T
G
A
CTGA
p
p
p
p
p
p p
p
p
p
p p1 2 p p
1 2 p p
1 2 p p
1 2 p p
92
Substitution matrix is estimated by the observed difference between the sequences.
ACCGTTGTCTGGGA5
ACGGGTA
ACCCGTGTCTGGTA1
2 3
2
ACCGTTGTCTGGGA
• Errors in distance estimations are amplified when:• The rate is small: signal is too weak (in extreme
cases, there are no substitution whatsoever)• The rate is large: recent substitutions overwrite older
ones.
93
25%
ACGGATA
K2P Model tree:======<Tree Topology> +
<Uniform Distribution of DNA at all vertices> + <K2P instantaneous rate matrices at edges>
r
vRuv
94
How reliable
Consider “balanced” quartets. Define the “quartet ratio” to be the ratio between the middle edge and two external edges.
95
The rate matrix S implies a stochastic substitution matrix Puv :
uvS
u
v
uvP
C
T
G
A
CTGA
C
T
G
A
CTGA
p
p
p
p
p
p p
p
p
p
p p1 2 p p
1 2 p p
1 2 p p
1 2 p p
exp( )uv uvP S
Puv defines the joint distribution of the sequences at u,v.
97
( , ) ( , ) ( , ) ( , ) ( , ) (2 , )seT T T Tp T Td d d d dw d A B C D A C B D A D B C
performance of the standard distance method in reconstructing the split from estimated distances
12 sepw
• Distance based 4-point method (FPM):
Reconstruction will fail if .
ˆ ˆ ˆ ˆ ˆ ˆ( , ) ( , ) min ( , ) ( , ), ( , ) ( , )d A B d C D d A C d B D d A D d B C
12 sepw 1
2 sepw 12 sepw 1
2 sepw 12 sepw
diam
A C
B D
A B
C D
A C
D B
wsep
diam
98
root
D
t
10t
CA
B
10t 10t 10t
t
99
Minimizing the expected relative error
2
2
Since ln( ) is non-linear, we only find which minimizes the NMSE
ˆ of a linear approximation of (using the "delta method").
ˆ ˆˆ ˆ(ln( ) ln( )) (ln( ) ln( ))
ln( ) ln( )
c
E cE c
c
2
2ln( ) ln( )c
44
4
4 4
and the optimal is:
11
11 1
opt
c
ee
ece e
.
- Compute distances between all taxon-pairs
- Find a tree (edge-weighted) best-describing the distances
Distance based methods: The general scheme
0
30
980
1514180
171620220
1615192190
D
4 5
7 21
210 61
This Talk
101
AATCCTG
ATAGCTGAATGGGC
GAACGTA
AAACCGAACCGTTGTCTGGGA
TCCGGAA AGCCGTG
GGGGATT
Phylogenetic Reconstruction
.
1 2
1 2
Find constants { ,c }
s.t. the SR function:
( ) ln( ) ln( )
is best for the input P P
c
P c c D
÷÷÷÷÷÷÷÷
ø
ö
çççççççç
è
æ
=
1615192190
( , ) ( , )i jD i j s s
Adaptive distance based algorithm
for the K2P model
.
- Compute distances between all taxon-pairs
- Find a tree (edge-weighted) best-describing the distances
Distance based methods: The general scheme
0
30
980
1514180
171620220
1615192190
D
4 5
7 21
210 61
This Talk
.
÷÷÷÷÷÷÷÷
ø
ö
çççççççç
è
æ
=
1615192190
D ( , ) ( , )i jD i j d s s4 5
7 21
210 61
Find a good distance function
- Compute distances between all taxon-pairs
- Find a tree (edge-weighted) best-describing the distances
Distance based methods: An adaptive scheme
Find a distance function d which is good for the input
This work
.
÷÷÷÷÷÷÷÷
øçççççççç
è
( , ) ( , )i jD i j d s s
Promotion: Make Distance based methods adaptive
106
1
1 2(
1 2
)
functions for K2P are of the form:
gives the weight the function
puts on the transversions.
Next we show how this weight is affected by
( ) ln(
the
total substitution r
) ln
)
aa e
( .
t
cc c
P P
SR
P c c
transition/transversion nd ratio
Summary of previous slides: