CS 394C: Computational Biology Algorithms
-
Upload
hoyt-ramirez -
Category
Documents
-
view
33 -
download
3
description
Transcript of CS 394C: Computational Biology Algorithms
![Page 1: CS 394C: Computational Biology Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813345550346895d9a3eee/html5/thumbnails/1.jpg)
CS 394C: Computational Biology Algorithms
Tandy WarnowDepartment of Computer Sciences
University of Texas at Austin
![Page 2: CS 394C: Computational Biology Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813345550346895d9a3eee/html5/thumbnails/2.jpg)
DNA Sequence Evolution
AAGACTT
TGGACTTAAGGCCT
-3 mil yrs
-2 mil yrs
-1 mil yrs
today
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT
AGGGCAT TAGCCCT AGCACTT
AAGACTT
TGGACTTAAGGCCT
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT
![Page 3: CS 394C: Computational Biology Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813345550346895d9a3eee/html5/thumbnails/3.jpg)
Molecular Systematics
TAGCCCA TAGACTT TGCACAA TGCGCTTAGGGCAT
U V W X Y
U
V W
X
Y
![Page 4: CS 394C: Computational Biology Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813345550346895d9a3eee/html5/thumbnails/4.jpg)
Phylogeny estimation methods
• Distance-based (Neighbor joining, NQM, and others): mostly statistically consistent and polynomial time
• Maximum parsimony and maximum compatibility: NP-hard and not statistically consistent
• Maximum likelihood: NP-hard and usually statistically consistent (if solved exactly)
• Bayesian Methods: statistically consistent if run long enough
![Page 5: CS 394C: Computational Biology Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813345550346895d9a3eee/html5/thumbnails/5.jpg)
Distance-based methods
• Theorem: Let (T,) be a Cavender-Farris model tree, with additive matrix [(i,j)]. Let >0 be given. The sequence length that suffices for accuracy with probability at least 1- of NJ (neighbor joining) and the Naïve Quartet Method is
O(log n e(O(max (i,j))).
![Page 6: CS 394C: Computational Biology Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813345550346895d9a3eee/html5/thumbnails/6.jpg)
Neighbor joining (although statistically consistent) has poor performance on large diameter trees
[Nakhleh et al. ISMB 2001]
Simulation study based upon fixed edge lengths, K2P model of evolution, sequence lengths fixed to 1000 nucleotides.
Error rates reflect proportion of incorrect edges in inferred trees.
NJ
0 400 800 16001200No. Taxa
0
0.2
0.4
0.6
0.8
Err
or R
ate
![Page 7: CS 394C: Computational Biology Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813345550346895d9a3eee/html5/thumbnails/7.jpg)
Maximum Parsimony
• Input: Set S of n aligned sequences of length k
• Output: A phylogenetic tree T– leaf-labeled by sequences in S– additional sequences of length k labeling the
internal nodes of T
such that is minimized. ∑∈ )(),(
),(TEji
jiH
![Page 8: CS 394C: Computational Biology Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813345550346895d9a3eee/html5/thumbnails/8.jpg)
Maximum parsimony (example)
• Input: Four sequences– ACT– ACA– GTT– GTA
• Question: which of the three trees has the best MP scores?
![Page 9: CS 394C: Computational Biology Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813345550346895d9a3eee/html5/thumbnails/9.jpg)
Maximum Parsimony
ACT
GTT ACA
GTA ACA ACT
GTAGTT
ACT
ACA
GTT
GTA
![Page 10: CS 394C: Computational Biology Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813345550346895d9a3eee/html5/thumbnails/10.jpg)
Maximum Parsimony
ACT
GTT
GTT GTA
ACA
GTA
12
2
MP score = 5
ACA ACT
GTAGTT
ACA ACT
3 1 3
MP score = 7
ACT
ACA
GTT
GTAACA GTA
1 2 1
MP score = 4
Optimal MP tree
![Page 11: CS 394C: Computational Biology Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813345550346895d9a3eee/html5/thumbnails/11.jpg)
Maximum Parsimony
ACT
ACA
GTT
GTAACA GTA
1 2 1
MP score = 4
Finding the optimal MP tree is NP-hard
Optimal labeling can be computed in polynomial time using Dynamic Programming
![Page 12: CS 394C: Computational Biology Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813345550346895d9a3eee/html5/thumbnails/12.jpg)
Solving NP-hard problems exactly is … unlikely
• Number of (unrooted) binary trees on n leaves is (2n-5)!!
• If each tree on 1000 taxa could be analyzed in 0.001 seconds, we would find the best tree in
2890 millennia
#leaves #trees
4 3
5 15
6 105
7 945
8 10395
9 135135
10 2027025
20 2.2 x 1020
100 4.5 x 10190
1000 2.7 x 102900
![Page 13: CS 394C: Computational Biology Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813345550346895d9a3eee/html5/thumbnails/13.jpg)
1. Hill-climbing heuristics (which can get stuck in local optima)2. Randomized algorithms for getting out of local optima3. Approximation algorithms for MP (based upon Steiner Tree approximation
algorithms) -- however, the approx. ratio that is needed is probably 1.01 or smaller!
Approaches for “solving” MP and ML(and other NP-hard problems in phylogeny)
Phylogenetic trees
Cost
Global optimum
Local optimum
![Page 14: CS 394C: Computational Biology Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813345550346895d9a3eee/html5/thumbnails/14.jpg)
Problems with techniques for MP and ML
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0 4 8 12 16 20 24
Hours
Average MP score above
optimal, shown as a percentage of
the optimal
Shown here is the performance of a TNT heuristic maximum parsimony analysis on a real dataset of almost 14,000 sequences. (“Optimal” here means best score to date, using any method for any amount of time.) Acceptable error is below 0.01%.
Performance of TNT with time
![Page 15: CS 394C: Computational Biology Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813345550346895d9a3eee/html5/thumbnails/15.jpg)
MP and Cavender-Farris
• Consider a tree (AB,CD) with two very long branches leading to A and C, and all other branches very short.
• MP will be statistically inconsistent (and “positively misleading”) on this tree.
![Page 16: CS 394C: Computational Biology Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813345550346895d9a3eee/html5/thumbnails/16.jpg)
Problems with existing phylogeny reconstruction methods
• Polynomial time methods (generally based upon distances) have poor accuracy with large diameter datasets.
• Heuristics for NP-hard optimization problems take too long (months to reach acceptable local optima).
![Page 17: CS 394C: Computational Biology Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813345550346895d9a3eee/html5/thumbnails/17.jpg)
Warnow et al.: Meta-algorithms for phylogenetics
• Basic technique: determine the conditions under which a phylogeny reconstruction method does well (or poorly), and design a divide-and-conquer strategy (specific to the method) to improve its performance
• Warnow et al. developed a class of divide-and-conquer methods, collectively called DCMs (Disk-Covering Methods). These are based upon chordal graph theory to give fast decompositions and provable performance guarantees.
![Page 18: CS 394C: Computational Biology Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813345550346895d9a3eee/html5/thumbnails/18.jpg)
Disk-Covering Method (DCM)
![Page 19: CS 394C: Computational Biology Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813345550346895d9a3eee/html5/thumbnails/19.jpg)
Improving phylogeny reconstruction methods using DCMs
• Improving the theoretical convergence rate and performance of polynomial time distance-based methods using DCM1
• Speeding up heuristics for NP-hard optimization problems (Maximum Parsimony and Maximum Likelihood) using Rec-I-DCM3
![Page 20: CS 394C: Computational Biology Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813345550346895d9a3eee/html5/thumbnails/20.jpg)
DCM1 Warnow, St. John, and Moret, SODA 2001
• A two-phase procedure which reduces the sequence length requirement of methods. The DCM phase produces a collection of trees, and the SQS phase picks the “best” tree.
• The “base method” is applied to subsets of the original dataset. When the base method is NJ, you get DCM1-NJ.
DCM SQSExponentiallyconvergingmethod
Absolute fast convergingmethod
![Page 21: CS 394C: Computational Biology Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813345550346895d9a3eee/html5/thumbnails/21.jpg)
DCM1-boosting distance-based methods[Nakhleh et al. ISMB 2001]
•Theorem: DCM1-NJ converges to the true tree from polynomial length sequences
NJ
DCM1-NJ
0 400 800 16001200No. Taxa
0
0.2
0.4
0.6
0.8
Err
or R
ate
![Page 22: CS 394C: Computational Biology Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813345550346895d9a3eee/html5/thumbnails/22.jpg)
Rec-I-DCM3 significantly improves performance (Roshan et al. CSB 2004)
Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset.Similar improvements obtained for RAxML (maximum likelihood).
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0 4 8 12 16 20 24
Hours
Average MP score above
optimal, shown as a percentage of
the optimal
Current best techniques
DCM boosted version of best techniques
![Page 23: CS 394C: Computational Biology Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813345550346895d9a3eee/html5/thumbnails/23.jpg)
Summary (so far)
• Optimization problems in biology are almost all NP-hard, and heuristics may run for months before finding local optima.
• The challenge here is to find better heuristics, since exact solutions are very unlikely to ever be achievable on large datasets.
![Page 24: CS 394C: Computational Biology Algorithms](https://reader035.fdocuments.net/reader035/viewer/2022062422/56813345550346895d9a3eee/html5/thumbnails/24.jpg)
Summary
• NP-hard optimization problems abound in phylogeny reconstruction, and in computational biology in general, and need very accurate solutions
• Many real problems have beautiful and natural combinatorial and graph-theoretic formulations