Chapter 11 Phylogenetic Tree Construction Methods and Programs
Phylogentic Tree Construction
-
Upload
francis-burnett -
Category
Documents
-
view
33 -
download
0
description
Transcript of Phylogentic Tree Construction
Phylogentic Tree Construction
(Lecture for CS397-CXZ Algorithms in Bioinformatics)
April. 2, 2004
ChengXiang Zhai
Department of Computer Science
University of Illinois, Urbana-Champaign
Introduction
• Phylogenetic tree: A tree with sequences as leaves that reflect evolutionary relationship
• Formal properties– Binary
– Rooted or unrooted
– Edge length reflects the amount of evolutionary divergence.
• Contruction methods (all related to clustering)– Similarity/distance based (bottom up construction)
– Maximum parsimony (search for the right tree)
– Probabilistic models (modeling a tree)
Similarity-based Methods
• Unweighted Pair Group Method using Arithmetic Averages (UPGMA)– Essentially average-link clustering
– Node height (Ck) = ½ dij, dij is the distance of the two children of Ck
• Desirable properties of tree– Molecular clocks (edge lengths): Equal edge length to the
leaves from the same node (tree shows the time)
– Additivity: Edge lengths are additive if the distance between any pair of leaves is the sum of the lengths of the edges on the path connecting them. (tree shows “changes”)
• UPGMA can guarantee “molecular” but not necessary “additivity”.
Neighbor-Joining
• Adjust the distances
– Dij = dij –(ri +rj), ri is the average distance of i to all other nodes
– Guarantees minimum Dij=> neighbors
• Alternative cluster distance function
– Suppose i and j are a pair of neighbors, replacing them with a new node k
– Define dkm = ½ (dim + djm –dij) for any other node m
– This guarantees additivity
• Finally, the edge length is dik = ½ (dij +rj -rj), djk =dij –dik, for joining k to i and j.
• Used in ClustalW
1
| | 2i ikk L
r dL
Neighbor-Joining: Example
23
5
3
1
6
Sequence A B C D
A 0 8 7 12
B 8 0 9 14
C 7 9 0 11
D 12 14 11 0
Original (true) tree
A
BC
D
r
13.5
15.5
13.5
18.5
Sequence A B C D
A 0 -21 -20 -20
B -21 0 -20 -20
C -20 -20 0 -21
D -20 -20 -21 0
Original distance matrix
Adjusted distance matrix 8-(13.5+15.5)
Neighbor-Joining: Example (cont.)
5
3Node F C D
F 0 4 9
C 4 0 11
D 9 11 0
A
B
C
D
r
13
15
20
Node F C D
F 0 -24 -24
C -24 0 -24
D -24 -24 0
Intermediate distance matrix
Adjusted distance matrix
4-(13+15)(8+(15.5-13.5))/2=5
(8-(15.5-13.5))/2=3
F
dFC=(dAC+dBC-dAB)/2=4
Node A B C D
A 0 8 7 12
B 8 0 9 14
C 7 9 0 11
D 12 14 11 0
Original distance matrix
4
9
11
5
3A
B
C
D
F
3
8
1
root 6
maximum parsimony principle:
the principle that the most accurate phylogenetic tree is one that is based on the fewest changes in the genetic code.
Maximum Parsimony
1 2 3 4 5 6 7 8 9 10
1 - A G G G T A A C T G
2 - A C G A T T A T T A
3 - A T A A T T G T C T
4 - A A T G T T G T C G
0
0
0
1
3
2
4
1
2
3
4
1
4
3
2
1 2 3 4 5 6 7 8 9 10
1 - A G G G T A A C T G
2 - A C G A T T A T T A
3 - A T A A T T G T C T
4 - A A T G T T G T C G
0 3
0 3
0 3
1
3
2
4
1
2
3
4
1
4
3
2
1
2
3
4A
G
C
T
4
1 - G
2 - C
3 - T
4 - A
C
A
G
T
C1
3
2
4C
C
G
A
T1
4
3
2C
3
3
3
1 2 3 4 5 6 7 8 9 10
1 - A G G G T A A C T G
2 - A C G A T T A T T A
3 - A T A A T T G T C T
4 - A A T G T T G T C G
0 3 2
0 3 2
0 3 2
1
3
2
4
1
2
3
4
1
4
3
2
Informative Site=discriminative site
1 2 3 4 5 6 7 8 9 10
1 - A G G G T A A C T G
2 - A C G A T T A T T A
3 - A T A A T T G T C T
4 - A A T G T T G T C G
0 3 2 2
0 3 2 1
0 3 2 2
1
3
2
4
1
2
3
4
1
4
3
2
4
1 - G
2 - A
3 - A
4 - G
1
2
3
4G
G
A
A
A
G
G
A
A1
3
2
4A
G
A
A
G1
4
3
2A
2
2
1
1 2 3 4 5 6 7 8 9 10
1 - A G G G T A A C T G
2 - A C G A T T A T T A
3 - A T A A T T G T C T
4 - A A T G T T G T C G
0 3 2 2
0 3 2 1
0 3 2 2
1
3
2
4
1
2
3
4
1
4
3
2
1 2 3 4 5 6 7 8 9 10
1 - A G G G T A A C T G
2 - A C G A T T A T T A
3 - A T A A T T G T C T
4 - A A T G T T G T C G
0 3 2 2 0 1 1 1 1 3 14
0 3 2 1 0 1 2 1 2 3 15
0 3 2 2 0 1 2 1 2 3 16
1
3
2
4
1
2
3
4
1
4
3
2
Probabilistic Approaches
• Basic idea:
– Tree= Generative probabilistic model, e.g., an n-leaf tree defines a model p(X1, …,Xn)
– Data: sequences {s1, …, sn}
– Choose the tree according to • Maximum Likelihood: p(Data|Tree)
• Maximum A Posterior (Bayesian): p(Tree|Data)
• Model evolution more directly
• Computationally expensive
Detailed View of Probabilistic Models
x5
t3
x4
x2
x1
x3
t2
t1
t41 5 1 4 2 4 3 5 4 5 5
1 2 3 4( ,..., | , ) ( | , ) ( | , ) ( | , ) ( | , ) ( )p x x T t p x x t p x x t p x x t p x x t p x
The tree on the left defines the following probabilistic model:
Basic evolution model: p(x|y,t)=prob of x arising from an ancestral sequence y
over an edge of length t
Decompose the sequence: “Independence Assumption”:
( | , ) ( | , )u uu
p x y t p x y tDecompose the time: “Markovian Assumption”
( | , ) ( | , ) ( | , )b
p a c s t p b c s p a b t “Primitive Evolution Model”: p(a|b,t)
- Nucleotides: Jukes-Cantor model - Amino acids: PAM
The Jukes-Cantor model
-3
-3
-3
-3
rt st st st
st rt st st
st st rt st
st st st rt
Solutions: rt = (1+3e4t)/4, st = (1 e4t)/4.
R= S(t)=
: ( )
( ) ( ) ( ) ( )( )
'( ) ( )
' 3 3 '
Short time S I R
S t S t S S t I R
S t S t R
r r s s s r
A
C
G
T
Computing the Likelihood
x5
t3
x4
x2
x1
x3
t2
t1
t41 5 1 4 2 4 3 5 4 5 5
1 2 3 4( ,..., | , ) ( | , ) ( | , ) ( | , ) ( | , ) ( )p x x T t p x x t p x x t p x x t p x x t p x
With Parents Known:
But We don’t know the parents…
Handling the Hidden Nodes
• We must sum up over all the hidden ancestral nodes
• Felsenstein’s algorithm for likelihood: Compute the sum in a bottom up fashion
– Start from leaves
– Compute the parent node based on children nodes
1 2 1 21 2 1 2( , | , , ) ( | , ) ( | , )u u a u u
a
p x x T t t q p x a t p x a t
Maximizing the Likelihood
• Easy for small number of sequences
• Generally complex for large number of sequences
• Many solutions:– EM
– Gradient descent
– Sampling
• Metropolis sampling– Accept a new tree if P(new-tree)>= P(old-tree)
– Accept a new tree with prob. P(new-tree)/P(old-tree) if p(new-tree)<p(old-tree)
More realistic evolutionary models
• Allowing different rates at different sites
– Using a prior (e.g., gamma) to regular the different rates
– Hidden Markov models
• Evolutionary models with gaps
– Tree HMMs