A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters
description
Transcript of A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters
![Page 1: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/1.jpg)
D. Gusfield, V. Bansal (Recomb 2005)
A Fundamental Decomposition Theory for Phylogenetic
Networks and Incompatible Characters
![Page 2: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/2.jpg)
Alternative Title
The Continuing Role of Incompatibility Graphs in the Study of Phylogenetic Networks
![Page 3: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/3.jpg)
Geneological or Phylogenetic Networks
• The major biological motivation comes from genetics and attempts to reconstruct the history of recombination in populations.
• The results also have phylogenetic applications, for example in hybrid speciation, lateral gene transfer.
![Page 4: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/4.jpg)
Reconstructing the Evolution of Binary Bio-Sequences (SNPs)
• Perfect Phylogeny (tree) model• Phylogenetic Networks (DAG) with
recombination (ARG)• Blobbed Trees• Incompatibility Graph and Connected its
Components• Prior uses of Connected Components• Decomposition Theorem and Proof Sketch• Optimality Conjecture and Progress
![Page 5: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/5.jpg)
The Perfect Phylogeny Model for binary sequences
000001
2
4
3
510100
1000001011
00010
01010
12345sitesAncestral sequence
Extant sequences at the leaves
Site mutations on edgesThe tree derives the set M:1010010000010110101000010
one mutationper site
![Page 6: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/6.jpg)
Classic NASC: Arrange the sequences in a matrix. Then (with no duplicate columns), the sequences can be generated on a unique perfect phylogeny if and only if no two columns (sites) contain all four pairs:
0,0 and 0,1 and 1,0 and 1,1
This is the 4-Gamete Test
When can a set of sequences be derived on a perfect phylogeny?
![Page 7: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/7.jpg)
A richer model
000001
2
4
3
510100
1000001011
00010
01010
12345101001000001011010100001010101 added
Pair 4, 5 fails the fourgamete-test. The sites 4, 5are ``incompatible”
Real sequence histories often involve recombination.
![Page 8: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/8.jpg)
10100 01011
5
10101
The first 4 sites come from P (Prefix) and the sitesfrom 5 onward come from S (Suffix).
P S
Sequence Recombination
A recombination of P and S at recombination point 5.
Single crossover recombination
Called ``crossing over” in genetics
![Page 9: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/9.jpg)
Network with Recombination
000001
2
4
3
510100
1000001011
00010
01010
12345101001000001011010100001010101 new
10101
The previous tree with onerecombination event now derivesall the sequences.
5
P
S
![Page 10: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/10.jpg)
Multiple Crossover Recombination
4-crossovers
2-crossovers = ``gene conversion”
![Page 11: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/11.jpg)
A Phylogenetic Network00000
52
3
3 4S
pP
S
1
4
a:00010
b:10010
c:00100
10010
01100
d:10100
e:01100
00101
01101
f:01101
g:00101
00100
00010
10100
![Page 12: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/12.jpg)
Minimizing Recombinations
• Any set M of sequences can be generated by a phylogenetic network with enough recombinations, and one mutation per site. This is not interesting or useful.
• However, the number of (observable) recombinations is small in realistic sets of sequences. ``Observable” depends on n and m relative to the number of recombinations.
• Problem: Given a set of sequences M, find a phylogenetic network generating M, minimizing the number of recombinations. NP-hard (Wang et al 2000, Semple et al 2004)
![Page 13: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/13.jpg)
Decomposition can help
First we introduce the viewpoint needed.
![Page 14: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/14.jpg)
Blobs in Networks• In a Phylogenetic Network, with a
recombination node x, if we trace two paths backwards from x, then the paths will eventually meet.
• The cycle specified by those two paths is called a ``recombination cycle”.
• In a phylogenetic Network a maximal set of (edge) intersecting cycles is called a blob.
![Page 15: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/15.jpg)
A Phylogenetic Network with one Blob
00000
52
3
3
4Sp
PS
1
4
a:00010
b:10010c:00100
10010
01100
d:10100
e:01100
00101
01101
f:01101
g:00101
00100
00010
![Page 16: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/16.jpg)
Blobbed-trees
• Contracting each blob to a single node results in a directed, rooted tree, otherwise one of the “blobs” was not maximal.
• So every phylogenetic network can be viewed as a directed tree of blobs - a blobbed-tree.
The blobs are the non-tree-like parts of the network.
![Page 17: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/17.jpg)
Ugly tanglednetwork insidethe blob.
Every network is a tree of blobs. How do the tree partsand the blobs relate?How can we exploitthis relationship?
![Page 18: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/18.jpg)
0 0 0 1 01 0 0 1 00 0 1 0 01 0 1 0 00 1 1 0 00 1 1 0 10 0 1 0 1
1 2 3 4 5abcdefg
1 3
4
2 5
Two nodes are connected iff the pairof sites are incompatible, i.e, fail the 4-gamete test.
Incompatibility Graph G(M)
M
G(M) has two connected components.
![Page 19: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/19.jpg)
The connected components of G(M) are very informative
• The number of non-trivial connected components is a lower-bound on the number of recombinations needed in any network (Bafna, Bansal; Gusfield, Hickerson).
• When each blob is a single-cycle (galled-tree case) all the incompatible sites in a blob must come from a single connected component C, and that blob must contain all the sites from C. Compatible sites need not be inside any blob. (Gusfield et al 2003-5)
![Page 20: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/20.jpg)
Galled-Tree Structure
So when each blob contains only a single cycle, there is a one-one correspondence between the blobs and the non-trivial connected components of the incompatibility graph. This is the central fact used in polynomial-time solutions to the (NP-hard) recombination minimization problem, when a galled-tree for M exists.
Motivating Question: To what extent does this clean one-one structure carry over to general phylogenetic networks? How do we exploit the general structure?
![Page 21: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/21.jpg)
The Decomposition Theorem (Recomb 2005)
For any set of sequences M, there is a blobbed-tree T(M) that derives M, where each blob contains all and only the sites in one non-trivial connected component of G(M). The compatible sites can always be put on edges outside of any blob.
A blobbed-tree with this structure is called fully-decomposed.
![Page 22: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/22.jpg)
General Structure
So, for any set of sequences M, there is a phylogenetic network where there is a one-one correspondence between the blobs and the non-trivial connected components of G(M).
Moreover, the tree part of T(M) is unique. And it is easy to find the tree part.
![Page 23: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/23.jpg)
4
1
3
2 5
a: 00010
b: 10010
d: 10100
c: 00100
e: 01100
f: 01101
g: 00101
A fully-decomposed network for the sequences generated by the prior network.
2
4
p s
ps
1 3
4
2 5
Incompatibility Graph
![Page 24: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/24.jpg)
Proof Ideas
Let C be a connected component of G(M). Define M[C] as the sequences in M restricted to the sites in C.
![Page 25: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/25.jpg)
0 0 0 1 01 0 0 1 00 0 1 0 01 0 1 0 00 1 1 0 00 1 1 0 10 0 1 0 1
1 2 3 4 5abcdefg
1 3
4
M
a 0 0 1b 1 0 1c 0 1 0d 1 1 0e 0 1 0f 0 1 0g 0 1 0
1 3 4
C1
2 5
C2
abcdefg
0 00 00 00 01 01 10 1
2 5
B1 B2
M[C1] M[C2]
![Page 26: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/26.jpg)
Faux Proof
Pick one site from each connected component C in G(M) to ``represent” C. No pair of those sites are incompatible, so by the NASC for a perfect phylogeny, there will be a perfect phylogeny T for the sites. Expand each node to a network generating the sequences in M[C].
Incorrect, because the structure of T can be wrong. We need to use information about all the sites in each C.
![Page 27: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/27.jpg)
a 0 0 1b 1 0 1c 0 1 0d 1 1 0e 0 1 0f 0 1 0g 01 0
1 3 4abcdefg
0 00 00 00 01 01 10 1
2 5
M[C1] M[C2]
abcdefg
1 0 0 0 1 0 0 00 1 0 0 1 0 0 00 0 1 0 1 0 0 00 0 0 1 1 0 0 00 0 1 0 0 1 0 00 0 1 0 0 0 1 00 0 1 0 0 0 0 1
W
1 2 3 4 5 6 7 8
1234333
5555678
Now for each connected component C in G(M), call each distinct sequence in M[C] a supercharacter, and let W be the indicatormatrix for the supercharacters. So W indicates which rows of Mcontain which particular supercharacters.
![Page 28: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/28.jpg)
Proof Ideas
Lemma: No pair of supercharacters are incompatible.
So by the NASC for a Perfect Phylogeny, there is a unique perfect phylogeny T for W.
![Page 29: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/29.jpg)
Proof Ideas
For each connected component C of G(M), all supercharacters that originate from C label edges in T that are incident with one single node v[C] in T. So, if we expand each node v[C] to be a network that generates the supercharacters from C (the sequences in M[C]), and connect each network correctly to the edges in T, the resulting network is a fully-decomposed blobbed-tree that generates M.
![Page 30: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/30.jpg)
Algorithmically, T is easy to find and is the tree resulting from contracting each blob in the fully-decomposed blobbed-tree T(M) for M. T can be constructed from M in O(nm^2) time.
![Page 31: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/31.jpg)
Broader Biological Applications
Our major interest is in recombination, but the proof of the decomposition theorem does not explicitly use recombination. So it holds for whatever biological phenomena caused the incompatibility of sites. For example, back or recurrent mutation, gene-conversion, lateral gene transfer etc.
![Page 32: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/32.jpg)
What is the most tree-like network?
• Simple definition: The ``treeness’’ of a network is the number of edges in the tree after contracting each blob to a single node.
• Simple fact: In any phylogenetic network N for M, all sites from a single non-trivial connected component must be together in a single blob of N.
• Hence, under this simple definition, a fully-decomposed blobbed tree is the most tree-like network for M.
![Page 33: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/33.jpg)
The supercharacters from M play the role in phylogenetic networks that normal binary characters play in perfect phylogeny trees.
So supercharacters are the fundamental characters of phylogenetic networks.
![Page 34: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/34.jpg)
The main open question
The Decomposition Theorem says there is always a fully-decomposed blobbed-tree for any M, but
Is there always a fully-decomposed blobbed-tree that minimizes the number of recombinations over all possible phylogenetic networks for M?
![Page 35: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/35.jpg)
We conjecture the answer is yes.
If true, then we can decompose the problem of minimizing the total number of recombinations into separate problems on each connected component, and also find lower bounds on the needed number of recombinations, in each component separately, adding those bounds to get a valid overall lower bound for M. This computation of lower bounds is known to be correct for certain lower bounds (Bafna, Bansal 2004).
![Page 36: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/36.jpg)
Progress on Proving the Conjecture
Definition: If N is a phylogenetic network for M, and a node v in N is labeled with a sequence in M, then v is said to be visible in N.
Theorem: If every node in N is visible, then there is a fully-decomposed network for M where the number of recombinations is at most the number in N.
Corollary: The conjecture is true for any M where the Haplotype or History lower bounds (S. Myers) on the number of recombinations needed to generate M, is tight.
![Page 37: A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters](https://reader033.fdocuments.net/reader033/viewer/2022051420/56815dfd550346895dcc3be5/html5/thumbnails/37.jpg)
Papers and Software
wwwcsif.cs.ucdavis.edu/~gusfield/