Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow...

59
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas at Austin

Transcript of Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow...

Phylogeny Estimation: Why It Is "Hard", and

How to Design Methods with Good Performance

Tandy WarnowDepartment of Computer Sciences

University of Texas at Austin

The real title:

Phylogeny Estimation:

Why it is “Hard”

but not

how to design methods with good performance -

talk to me separately about this, no time in this lecture!

This talk

• Intro to phylogenetic estimation (using some terms to be defined later: polynomial time and NP-hard)

• Computational problems and what it means to solve them exactly

• Computational problems, and what it means to “solve them” heuristically

Orangutan Gorilla Chimpanzee Human

From the Tree of the Life Website,University of Arizona

Phylogeny (evolutionary tree)

Evolutionary History

From AToL website

Helps us–predict gene function–develop drugs and vaccines–understand disease spread–understand human origins

Tree of Life

Phylogenetics: estimating evolutionary histories

DNA Sequence Evolution

AAGACTT

TGGACTTAAGGCCT

-3 mil yrs

-2 mil yrs

-1 mil yrs

today

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT

AGGGCAT TAGCCCT AGCACTT

AAGACTT

TGGACTTAAGGCCT

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Phylogeny Problem

TAGCCCA TAGACTT TGCACAA TGCGCTTAGGGCAT

U V W X Y

U

V W

X

Y

How can we infer evolution?

1. Heuristics for hard optimization problems (Maximum Parsimony and Maximum Likelihood)

Two types of phylogenetic reconstruction methods

Phylogenetic trees

Cost

Global optimum

Local optimum

2. Polynomial time distance-based methods: UPGMA, Neighbor Joining, FastME, Weighbor, etc.

Maximum Parsimony

• Input: Set S of n aligned sequences of length k• Output:

– A phylogenetic tree T leaf-labeled by sequences in S

– additional sequences of length k labeling the internal nodes of T

such that the total number of changes is minimized

Maximum parsimony (example)

• Input: Four sequences– ACT– ACA– GTT– GTA

• Question: which of the three trees has the best MP scores?

Maximum Parsimony

ACT

GTT ACA

GTA ACA ACT

GTAGTT

ACT

ACA

GTT

GTA

Maximum Parsimony

ACT

GTT

GTT GTA

ACA

GTA

12

2

MP score = 5

ACA ACT

GTAGTT

ACA ACT

3 1 3

MP score = 7

ACT

ACA

GTT

GTAACA GTA

1 2 1

MP score = 4

Optimal MP tree

Maximum Parsimony: computational complexity

ACT

ACA

GTT

GTAACA GTA

1 2 1

MP score = 4

But how do we find the best tree?

Optimal labeling can becomputed in linear time O(nk)

Exhaustive Search

For every tree in on the set of sequences, DO:

• Score each tree (compute optimal sequences for each internal node, and record the score)

• Keep track of the tree with the best score

How expensive is this?

Exhaustive Search

For every tree in on the set of sequences, DO:

• Score each tree (compute optimal sequences for each internal node, and record the score)

• Keep track of the tree with the best score

How expensive is this?

Don’t try “exhaustive search”

• Number of (unrooted) binary trees on n leaves is (2n-5)!! = (2n-5)x(2n-7)x(2n-9)x…x3

• If each tree on 1000 taxa could be analyzed in 0.001 seconds, we would find the best tree in

2890 millennia

#leaves #trees

4 3

5 15

6 105

7 945

8 10395

9 135135

10 2027025

20 2.2 x 1020

100 4.5 x 10190

1000 2.7 x 102900

Maximum Parsimony: computational complexity

ACT

ACA

GTT

GTAACA GTA

1 2 1

MP score = 4

Finding the optimal MP tree is NP-hard

Optimal labeling can becomputed in linear time O(nk)

NP-hard(ness)

• What does this mean?

• What are the consequences for a problem being NP-hard?

• What kind of methods are used to “solve” NP-hard problems?

• How should you interpret the output of a software program, when the problem is NP-hard?

“Real” problem: your brother’s birthday party

• Your brother is turning 10 and you need to arrange his birthday party

• He wants all his friends to come• But some of them hate each other

Your objective: have as few parties as you can, but invite everyone to at least one party (while not having people who hate each other at the same party)

Your brother’s party

• Friends: Sally, Alice, Henry, Tommy, Jimmy, and Ben

• Sally and Alice hate each other, also Henry and Sally, Henry and Tommy, Alice and Jimmy, Ben and Sally, and Ben and Henry.

Graph representation of your brother’s friends

Graph has vertices and edges

• Vertices = your brother’s friends

• Edges between vertices indicate they hate each other

Your brother’s party

• Friends: Sally, Alice, Henry, Tommy, Jimmy, and Ben

• Sally and Alice hate each other, also Henry and Sally, Henry and Tommy, Alice and Jimmy, Ben and Sally, and Ben and Henry.

Coloring vertices to assign friends to parties

• Given graph G with vertices and edges• Assign colors to the vertices so that no edge

connects vertices of the same color, using a minimum number of colors

• Vertices = your brother’s friends• Edges between vertices indicate they hate each

other• Colors = parties

Assigning friends to parties: graph coloring!

• Friends: Sally, Alice, Henry, Tommy, Jimmy, and Ben

• We can’t do this with two parties. Why?

• What about three?

Your brother’s parties

Solution: three parties!

• Sally, Tommy, and Jimmy

• Henry and Alice

• Ben

What is the minimum number of colors that a graph needs?

Remember: no edge between vertices of the same color!

A graph that needs 3 colors

2-colored graph

A computational problem

• 2-colorability:

• Given graph G, determine if we can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color.

Can we 2-color this graph?

Can we 2-color these graphs?

Solving 2-colorability• 2-colorability: Given graph G, determine if we can

assign colors red and blue to the vertices of G so that no edge connects vertices of the same color.

• Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored.

• Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.

Solving 2-colorability• 2-colorability: Given graph G, determine if we can

assign colors red and blue to the vertices of G so that no edge connects vertices of the same color.

• Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored.

• Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.

Solving 2-colorability• 2-colorability: Given graph G, determine if we can

assign colors red and blue to the vertices of G so that no edge connects vertices of the same color.

• Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored.

• Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.

What about this?

• 3-colorability: Given graph G, determine if we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color.

A 3-colored graph

Can you 3-color these graphs?

How about this graph?

Testing 3-colorability

• 3-colorability: Given graph G, determine if we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color.

The “greedy algorithm” will work correctly in some, but not all cases.

Exhaustive search for 3-colorability

• Look at all possible vertex colorings• See if any is “legal” (no edge between vertices of

the same color)

Problem: there are 3n vertex colorings of a graph on n vertices

Question to students: how many vertex colorings are there for a graph with 10 vertices? 20 vertices? 100 vertices?

What about this?

• 3-colorability: Given graph G, determine if we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color.

What about this?

• 3-colorability: Given graph G, determine if we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color.

• This problem is NP-hard.

• What does this mean?

• Some decision problems can be solved in polynomial time:– Can graph G be 2-colored?

– Does graph G have a 3-clique (three vertices that are all adjacent)?

• Some decision problems seem to not be solvable in polynomial time:– Can graph G be 3-colored?

– What is the size of the largest clique in the graph G?

P vs. NP, continued

• The “big” question in theoretical computer science is:– Is it possible to solve an NP-hard

problem in polynomial time?

• If the answer is “yes”, then all NP-hard problems can be solved in polynomial time, so P=NP. This is generally not believed.

Minimum coloring

• Since 3-colorability is NP-hard, finding the minimum number of colors for a graph is NP-hard.

• That means the problem will be very had on some graphs -- even if others can be easy.

• So if your brother has a lot of friends, arranging the minimum number of parties could take you a very

very very very very long time.• So forget solving this problem exactly!

Solving NP-hard optimization problems (like min coloring)

Options:– Solve the problem exactly (but use lots of time

on some inputs)– Use heuristics which may not solve the

problem exactly (and which might be computationally expensive, anyway)

Phylogeny estimation is NP-hard, so

• Most methods that are used for maximum parsimony (or maximum likelihood) are heuristics that are not guaranteed to solve the problems exactly.

• Even the best methods can take a very long time (months or more) on some inputs, without being guaranteed to solve their problems well.

• You do not know how poor the solution is.

Start with some tree and score it

Repeat Change the tree slightly, and see if the new tree has a better score.

until no neighbor of your best tree has a better score (i.e., stop at a local optimum)

Return the best tree you found

Hill-climbing for phylogeny estimation

Exploring “tree space”

Two problems: 1. Getting “stuck” in local optima2. Taking too long to get to good solutions

“Solving” NP-hard phylogenetic estimation problems

Phylogenetic trees

Cost

Global optimum

Local optimum

Problems with current techniques for MP

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0 4 8 12 16 20 24

Hours

Average MP score above

optimal, shown as a percentage of

the optimal

Shown here is the performance of a heuristic maximum parsimony analysis on a real dataset of almost 14,000 sequences. (“Optimal” here means best score to date, using any method for any amount of time.) Acceptable error is below 0.01%.

Performance of TNT with time

Observations

• The best MP heuristics cannot get acceptably good solutions within 24 hours on most of these large datasets.

• Datasets of these sizes may need months (or years) of further analysis to reach reasonable solutions.

• Apparent convergence can be misleading.

If a problem is NP-hard

• Some inputs you can solve correctly and quickly, using simple algorithms.

• Some inputs you can solve correctly but it will take a long time.

• Some algorithms will give incorrect answers on some inputs.

• You may not know if your answer is correct or not.

Lessons

• Optimization problems in biology are almost all NP-hard, and heuristics may run for months before finding local optima.

• Therefore we still need better heuristics.

• Biologists should be cautious in believing that the trees found are actually “optimal”.

Reconstructing the “Tree” of LifeHandling large datasets: Handling large datasets:

millions of speciesmillions of species

The “Tree of Life” is not The “Tree of Life” is not really a tree: really a tree:

reticulate evolutionreticulate evolution

Phylolab, U. TexasPlease visit us athttp://www.cs.utexas.edu/users/phylo/

Acknowledgements

• Funding: NSF and the David and Lucile Packard Foundation

• Collaborators and students: Bernard Moret, Luay Nakhleh, Usman Roshan, and Tiffani Williams

General comments for NP-hard optimization problems

• Getting exact solutions may not be possible for some problems on some inputs, without spending a great deal of time.

• You may not know when you have an optimal solution, if you use a heuristic.

• Sometimes exact solutions may not be necessary, and approximate solutions may suffice. But, how good an approximation do you need?