Dynamic Programming: Edit Distance Bioinformatics: Issues...

84
CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 10 - 1 - Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 10 Dynamic Programming: Edit Distance

Transcript of Dynamic Programming: Edit Distance Bioinformatics: Issues...

Page 1: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 1 -

Bioinformatics:Issues and Algorithms

CSE 308-408 • Fall 2007 • Lecture 10

Dynamic Programming:Edit Distance

Page 2: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 2 -

Outline

• Setting the Stage

• DNA Sequence Comparison: First Successes

• The Change Problem

• The Manhattan Tourist Problem

• The Longest Common Subsequence Problem

• Sequence Alignment

• The Edit Distance Problem

http://www.bioalgorithms.info

Page 3: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 3 -

Sequence comparison and alignment

Question: how can we measure sequence similarity?

Consider:

AGTAGCATC

AGTGCACC

AGTAGCATC

GACACGATTversus

That the two DNA fragments on the left somehow seem more similar than the two on the right could be significant.

Page 4: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 4 -

Motivation

Why is this important?

• Given a new DNA sequence, one of the first things abiologist will want to do is search databases of known sequences to see if anyone has recorded something similar.(As we've seen, genetic sequences are long and the databases are enormous, so efficiency will be an issue.)

• Sequence similarity can provide clues about function.

• Many other problems from computational biology incorporate some notion of sequence similarity as a basic premise.

• Similarity can provide clues about evolutionary relationships.

Page 5: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 5 -

Sequence concepts

A sequence is a linear string of symbols over a finite alphabet.

Some basic sequence concepts:

(Note: string and sequence are often used synonymously.)

number of symbols in s (written |s|).length

sequence of length 0 (written ε).empty string

sequence that can be obtained from s by removing some symbols (ACG is a subsequence of TATCTG).

subsequence

if t is a subsequence of s, then s is a supersequence of t.

supersequence

Page 6: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 6 -

Sequence concepts

More basic sequence concepts:

sequence of consecutive symbols appearing in s (ACG is not a substring of TATCTG, but TCT is).

substring

(Observation: every substring is a subsequence of the string in question, but not vice versa.)

if t is a substring of s, then s is a superstring of t.

superstring

set of consecutive indices of a string, e.g., [2..4] refers to 2nd through 4th symbols of string s, or s[2]s[3]s[4].

interval

Page 7: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 7 -

Sequence concepts

And lastly:

substring of s of the form s[1..j] where 0 ≤ j ≤ |s|(when j = 0, prefix is the empty string).

prefix

substring of s of the form s[i..|s|] where 1 ≤ i ≤ |s| + 1(when i = |s| + 1, suffix is the empty string).

suffix

AT is a prefix (and substring and subsequence) of ATCCAG.

AG is a suffix (and substring and subsequence) of ATCCAG.

Page 8: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 8 -

Sequences

The concept of a sequence is extremely broad. In this course, we are concerned with genetic sequences. However, there are other important kinds of sequence data:

• ASCII text,• speech,• handwriting (pen-strokes).

Likewise, the same algorithmic techniques turn up again and again, often under different names:

• approximate string matching,• edit (or evolutionary) distance,• dynamic time warping.

Page 9: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 9 -

Sequence comparison: a false start

A

C

C

G

G

T

T

G

G

C

C

A

A

C

A

G

G

T

A

G

G

C

C

An obvious idea that comes to mind is to line up each symbol and count the number that don't match:

As we know, this is Hamming distance and forms the basis for most error correcting codes, as well as the motif-finding problem we saw earlier.

= 2

But it doesn't work for the kinds of sequences we care about:

Just one missing symbol at the start of the second sequence leads to a large distance.

= 6?

Page 10: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 10 -

Sequence manipulation at the genetic level

http://www.accessexcellence.org/AB/GG/nhgri_PDFs/insertion.pdf

Genomes aren't static ...

... sequence comparison must account for this.http://www.accessexcellence.org/AB/GG/nhgri_PDFs/deletion.pdf

Page 11: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 11 -

The human factor

A hard-to-read gel (arrow marks location where bands of similar

intensity appear in two different lanes):http://hshgp.genome.washington.edu/teacher_resources/99-studentDNASequencingModule.pdf

"...the error rate is generally less than 1% over the first 650 bases and then rises significantly over the remaining sequence.”http://genome.med.harvard.edu/dnaseq.html

In addition, errors can arise during the sequencing process:

TAATGGCG?

TAATGCGG?

TAATTGCGG?

Page 12: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 12 -

Sequence alignment

As we have seen, the two sequences we wish to compare may have different lengths. As a result, we need to allow for deletions and insertions.

The notion of an alignment helps us visualize this:

Alignment permits us to incorporate “spaces” (represented by a dash) in one or both sequences to make them the same length.

A

C

C

G

G

T

T

G

G

C

C?

A

C

C

G

G

T

T

G

G

C

C

-

Page 13: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 13 -

Sequence alignment

No – this one has 5! How can we find the best alignment?

A

C

C

G

G

T

T

C

G

C

C

- T

-

G

G

C

C

-

T

-

G

C

C

T

T

G

G

C

-

This alignment has 6 mismatches. Is it the best possible?

A

C

C

G

G

T

T

C

G

C

C

- T G

G

C

C T G

C

-

T

T

G

G

C

-C

- -

Page 14: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 14 -

DNA sequence comparison

First successes:

• Finding sequence similarities with genes of known function is a common approach to infer function of a newly-sequenced gene.

• In 1984, Russell Doolittle and colleagues found similarities between cancer-causing gene and normal growth factor (PDGF) gene.

http://www.bioalgorithms.info

Page 15: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 15 -

Cystic Fibrosis

• Cystic fibrosis (CF) is chronic and often fatal genetic disease of body's mucus glands (abnormally high level of mucus). Primarily affects respiratory systems in children.

• Mucus is slimy material that coats many epithelial surfaces and is secreted into fluids such as saliva.

Inheritance of Cystic Fibrosis:• In early 1980's, biologists hypothesized that CF is an

autosomal recessive disorder caused by mutations in a gene that remained unknown until 1989.

• Heterozygous carriers are asymptomatic.• Must be homozygously recessive in this gene in order to be

diagnosed with CF.http://www.bioalgorithms.info

Page 16: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 16 -

Cystic Fibrosis: finding the gene

http://www.bioalgorithms.info

Page 17: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 17 -

Cystic Fibrosis

• ATP binding proteins are present on cell membrane and act as transport channel.

• In 1989, biologists found similarity between cystic fibrosis gene and ATP binding proteins.

• A plausible function for cystic fibrosis gene, given that CF involves sweet secretion with abnormally high sodium level.

Finding Similarities between Cystic Fibrosis Gene and ATP binding proteins:

http://www.bioalgorithms.info

Page 18: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 18 -

Cystic Fibrosis: mutation analysis

If a high percentage of cystic fibrosis (CF) patients have a certain mutation in the gene, while unaffected patients don’t, that could be an indicator of a mutation that is related to CF.

Indeed, a certain mutation was found in 70% of CF patients; this is convincing evidence of a predominant genetic diagnostics marker for CF.

http://www.bioalgorithms.info

Page 19: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 19 -

Cystic Fibrosis and the CFTR protein

• CFTR (Cystic Fibrosis Transmembrane conductance Regulator) protein acts in cell membrane of epithelial cells that secrete mucus.

• These cells line airways of nose, lungs, stomach wall, etc.

• CFTR protein (1480 amino acids) regulates a chloride ion channel: adjusts “wateriness” of fluids secreted by cell.

• CF sufferers are missing a single amino acid in their CFTR.• Mucus ends up being too thick, affecting many organs.http://www.bioalgorithms.info

Page 20: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 20 -

Bring in the Bioinformaticians

• Gene similarities between two genes with known and unknown function alert biologists to some possibilities.

• Computing a similarity score between two genes tells how likely it is that they have similar functions.

• Dynamic programming is a technique for revealing similarities between genes.

• The Change Problem is a good problem to introduce idea of dynamic programming.

http://www.bioalgorithms.info

Page 21: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 21 -

The Change Problem

The Change Problem.Convert some amount of money M into given denominations, using fewest possible number of coins.Input: An amount of money M, and an array of d denominations, c = (c1, c2, …, cd), with (c1 > c2 > … > cd).Output: A list of d integers, i1, i2, …, id, such that

c1i1 + c2i2 + … + cdid = Mand i1 + i2 + … + id is minimal.

You hand cashier $20 for a $19.23 bill. Your change could be:

etc.

3 quarters + 2 pennies 77 pennies

Page 22: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 22 -

The Change Problem: an example

Given denominations 1, 3, and 5, what is minimum number of coins needed to make change for a given value?

1 2 3 4 5 6 7 8 9 10

1 1 1

Value

Min # of coins

1 2 3 4 5 6 7 8 9 10

1 1 1

Value

Min # of coins

Only one coin needed to make change for values 1, 3, and 5.

http://www.bioalgorithms.info

Page 23: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 23 -

The Change Problem: an example

Given denominations 1, 3, and 5, what is minimum number of coins needed to make change for a given value?

However, two coins needed to make change for values 2, 4, 6, 8, and 10.

1 2 3 4 5 6 7 8 9 10

1 2 1 2 1 2 2 2

Value

Min # of coins

http://www.bioalgorithms.info

Page 24: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 24 -

The Change Problem: an example

Given denominations 1, 3, and 5, what is minimum number of coins needed to make change for a given value?

Lastly, three coins needed to make change for values 7 and 9.

1 2 3 4 5 6 7 8 9 10

1 2 1 2 1 2 3 2 3 2

Value

Min # of coins

http://www.bioalgorithms.info

Page 25: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 25 -

The Change Problem: recurrence

minNumCoins(M) =

minNumCoins(M-1) + 1

minNumCoins(M-3) + 1

minNumCoins(M-5) + 1

min ofminNumCoins(M) =

minNumCoins(M-1) + 1

minNumCoins(M-3) + 1

minNumCoins(M-5) + 1

min of

This example is expressed by the following recurrence relation:

minNumCoins(M) =

minNumCoins(M-c1) + 1

minNumCoins(M-c2) + 1

minNumCoins(M-cd) + 1

min ofminNumCoins(M) =

minNumCoins(M-c1) + 1

minNumCoins(M-c2) + 1

minNumCoins(M-cd) + 1

min of

Given denominations c1, c2, …, cd, recurrence relation is:

http://www.bioalgorithms.info

Page 26: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 26 -

The Change Problem: a recursive algorithm

RecursiveChange(M, c, d)if M = 0

return 0bestNumCoins ← infinityfor i ← 1 to d

if M ≥ ci

numCoins ← RecursiveChange(M – ci, c, d)if numCoins + 1 < bestNumCoins

bestNumCoins ← numCoins + 1return bestNumCoins

http://www.bioalgorithms.info

Page 27: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 27 -

The Change Problem: a recursive algorithm

The RecursiveChange algorithm works, but it is not efficient:

• E.g., if M = 77 andc = (1,3,7), then optimal combination of coins for 70 cents is computed on 9 different occasions!

74

77

76 70

75 73 69 73 71 67 69 67 63

74 72 68

72 70 66

68 66 62

72 70 66

70 68 64

66 64 60

68 66 62

66 64 60

62 60 56

. . . . . .70 70 70 7070

• It recalculates optimal coin combination for a given amount of money repeatedly.

http://www.bioalgorithms.info

Page 28: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 28 -

The Change Problem: we can do better

Rather than re-compute a given value more than once:

• Save results of each computation for 0 to M.

• This way, we can do a reference call to find an already computed value, instead of re-computing each time.

• Running time is M x d, where M is the value of money and d is the number of denominations.

http://www.bioalgorithms.info

Page 29: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 29 -

The Change Problem: Dynamic Programming

DPChange(M, c, d)bestNumCoins0 ← 0

for m ← 1 to MbestNumCoinsm ← infinity

for i ← 1 to dif m ≥ ci

if bestNumCoinsm – ci + 1 < bestNumCoinsm

bestNumCoinsm ← bestNumCoinsm – ci + 1

return bestNumCoinsM

http://www.bioalgorithms.info

Page 30: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 30 -

DPChange example

00

0 10 1

0 1 20 1 2

0 1 2 30 1 2 1

0 1 2 3 40 1 2 1 2

0 1 2 3 4 50 1 2 1 2 3

0 1 2 3 4 5 60 1 2 1 2 3 2

0 1 2 3 4 5 6 70 1 2 1 2 3 2 1

0 1 2 3 4 5 6 7 80 1 2 1 2 3 2 1 2

0 1 2 3 4 5 6 7 8 90 1 2 1 2 3 2 1 2 3

c = (1,3,7)M = 9

http

://w

ww

.bio

algo

rithm

s.in

fo

Page 31: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 31 -

The Manhattan Tourist Problem (MTP)

Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid.

Sink*

*

*

**

**

* *

*

*

Source

*

http://www.bioalgorithms.info

Page 32: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 32 -

Sink*

*

*

**

**

* *

*

*

Source

*

The Manhattan Tourist Problem (MTP)

Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid.

http://www.bioalgorithms.info

Page 33: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 33 -

The Manhattan Tourist Problem (MTP)

The Manhattan Tourist Problem.Find the longest path in a weighted grid.

Input: A weighted grid G with two distinct vertices, one labeled “source” and the other labeled “sink.”Output: A longest path in G from “source” to “sink.”

http://www.bioalgorithms.info

Page 34: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 34 -

MTP: an example

3 2 4

0 7 3

3 3 0

1 3 2

4

4

5

6

4

6

5

5

8

2

2

5

0 1 2 3

0

1

2

3

j coordinatei c

oord

inat

e

13

source

sink

4

3 2 4 0

1 0 2 4 3

3

1

1

2

2

2

4 19

95

15

23

0

20

3

4 http

://w

ww

.bio

algo

rithm

s.in

fo

Page 35: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 35 -

MTP: greedy is not optimal

1 2 5

2 1 5

2 3 4

0 0 0

5

3

0

3

5

0

10

3

5

5

1

2promising start, but leads to bad choices!

source

sink18

22

http://www.bioalgorithms.info

Page 36: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 36 -

MTP: a simple recursive program

MT(n, m)if n = 0 and m = 0

return 0if n > 0

x ← MT(n-1, m) + length of edge from (n-1, m) to (n, m)else

x ← 0if m > 0

y ← MT(n, m-1) + length of edge from (n, m-1) to (n, m)else

y ← 0return max(x, y) What's wrong with

this approach?

http://www.bioalgorithms.info

Page 37: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 37 -

MTP: Dynamic Programming

• Calculate optimal path score for each vertex in graph.

• Each vertex’s score is maximum of prior vertices' scores plus weight of respective edges in between.

1

5

0 1

0

1

i

source

1

5S1,0 = 5

S0,1 = 1

j

http://www.bioalgorithms.info

Page 38: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 38 -

1 2

5

3

0 1 2

0

1

2

source

1 3

5

8

4

S2,0 = 8

i

S1,1 = 4

S0,2 = 33

-5

j

MTP: Dynamic Programming (continued)

http://www.bioalgorithms.info

Page 39: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 39 -

MTP: Dynamic Programming (continued)

1 2

5

3

0 1 2 3

0

1

2

3

i

source

1 3

5

8

8

4

0

5

8

103

5

-59

131-5

S3,0 = 8

S2,1 = 9

S1,2 = 13

S3,0 = 8

j

http

://w

ww

.bio

algo

rithm

s.in

fo

Page 40: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 40 -

MTP: Dynamic Programming (continued)

1 2 5

-5 1 -5

-5 3

0

5

3

0

3

5

0

10

-3

-5

0 1 2 3

0

1

2

3

i

source

1 3 8

5

8

8

4

9

13 8

9

12

S3,1 = 9

S2,2 = 12

S1,3 = 8

j

Greedyalgorithmfails!

Page 41: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 41 -

1 2 5

-5 1 -5

-5 3 3

0 0

5

3

0

3

5

0

10

-3

-5

-5

2

0 1 2 3

0

1

2

3

i

source

1 3 8

5

8

8

4

9

13 8

12

9

15

9

j

0

1

16 S3,3 = 16

MTP: Dynamic Programming (continued)

Done!

Page 42: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 42 -

MTP: recurrence

Computing score for a point (i, j) by recurrence relation:

si, j = max si-1, j + weight of the edge between (i-1, j) and (i, j)

si, j-1 + weight of the edge between (i, j-1) and (i, j)si, j = max si-1, j + weight of the edge between (i-1, j) and (i, j)

si, j-1 + weight of the edge between (i, j-1) and (i, j)

The running time is n x m for a n-by-m grid, where n is the number of rows and m is the number of columns.

http://www.bioalgorithms.info

Page 43: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 43 -

But Manhattan is not a perfect grid ...

B

A3

A1

A2

B

A3

A1

A2

What about diagonals?

Score at point B is given by:

sB = max of

sA1 + weight of the edge (A1, B)

sA2 + weight of the edge (A2, B)

sA3 + weight of the edge (A3, B)

sB = max of

sA1 + weight of the edge (A1, B)

sA2 + weight of the edge (A2, B)

sA3 + weight of the edge (A3, B)

http://www.bioalgorithms.info

Page 44: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 44 -

But Manhattan is not a perfect grid ...

Computing score sx for point x is given by recurrence relation:

Predecessors(x) = set of vertices that have edges leading to x

Running time for a graph G(V, E) is O(E) since each edge is evaluated once

Note: V is set of all vertices and E is set of all edges.

sx = maxof

sy + weight of edge (y, x) where

y ∈ Predecessors(x)sx = max

ofsy + weight of edge (y, x) where

y ∈ Predecessors(x)

http://www.bioalgorithms.info

Page 45: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 45 -

Alignment: two-row representation

Alignment : 2 * k matrix ( k > m, n )

T - C G A T-T G C - A -A

letters of v

letters of wTT

A T C T G A TT G C A T A

v :w :

m = 7 n = 6

4 matches 2 insertions 3 deletions

Given 2 DNA sequences v and w:

http://www.bioalgorithms.info

A-

Page 46: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 46 -

Aligning DNA sequences

v = ATCTGATGw = TGCATAC

n = 8m = 7

CATACGTGTAGTCTAV

W

match

insertiondeletion

mismatch

indels

4132

matches mismatches deletions insertions

http://www.bioalgorithms.info

Page 47: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 47 -

The Longest Common Subsequence Problem (LCS)

The Longest Common Subsequence Problem.Find the longest common subsequence of two sequences.

Input: Two sequences:

v = v1 v2…vm and w = w1 w2…wn

Output: A sequence of positions in v : 1 ≤ i1 < i2 < … < it ≤ m

and a sequence of positions in w : 1 ≤ j1 < j2 < … < jt ≤ n

such that symbol ik of v matches jk of w and t is maximal.

Page 48: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 48 -

LCS example

A T - C T G A T C- T G C T - A - C

elements of v

elements of w-A

1

2

0

1

2

2

3

3

4

3

5

4

5

5

6

6

6

7

7

8

j coords:

i coords:

Matches shown in redpositions in v:positions in w:

2 < 3 < 4 < 6 < 8

1 < 3 < 5 < 6 < 7

Every common subsequence is a path in 2-D grid.

0

0

(0,0)→ (1,0)→ (2,1)→ (2,2)→ (3,3)→ (3,4)→ (4,5)→ (5,5)→ (6,6)→ (7,6)→ (8,7)

http://www.bioalgorithms.info

Page 49: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 49 -

LCS Dynamic Programming

Find LCS of two strings.

Input: a weighted graph G with two distinct vertices, one labeled “source” and one labeled “sink.”Output: a longest path in G from “source” to “sink.”

Solve using an MTP graph with diagonals replaced by +1 edges.

http://www.bioalgorithms.info

Page 50: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 50 -

LCS Problem as Manhattan Tourist Problem

T

G

C

A

T

A

C

1

2

3

4

5

6

7

0i

A T C T G A T C0 1 2 3 4 5 6 7 8

j

http://www.bioalgorithms.info

Page 51: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 51 -

MTP Graph for LCS Problem

T

G

C

A

T

A

C

1

2

3

4

5

6

7

0i

A T C T G A T C0 1 2 3 4 5 6 7 8

j

http://www.bioalgorithms.info

Page 52: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 52 -

MTP Graph for LCS Problem

T

G

C

A

T

A

C

1

2

3

4

5

6

7

0i

A T C T G A T C0 1 2 3 4 5 6 7 8

j

Diagonal edge adds extraelement to subsequence.

LCS Problem: find path with maximum diagonal edges.

Every path is common subsequence.

http://www.bioalgorithms.info

Page 53: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 53 -

Computing LCS

si, j = max

si-1, j

si, j-1

si-1, j-1 + 1 if vi = wj

Let vi = prefix of v of length i : v1 … vi

and wj = prefix of w of length j : w1 … wj

The length of LCS(vi, wj) is computed by:

i,j

i-1,j

i,j -1

i-1,j -1

1 00

http://www.bioalgorithms.info

Page 54: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 54 -

LCS Alignments

4

3

2

1

0

43210W A T C G

ATGT

V

0 1 2 2 3 4V = A T - G T | | |W = A T C G – 0 1 2 3 4 4

Every path in grid corresponds to an alignment:

http://www.bioalgorithms.info

Page 55: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 55 -

Edit distance

In 1966, Levenshtein introduced edit distance between two strings as minimum number of elementary operations (insertions, deletions, and substitutions) needed to transform one string into the other.

d(v, w) = MIN number of elementary operations neededto transform v → w.

Keep in mind that LCS computation uses a MAX.

Page 56: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 56 -

The Edit Distance Problem

The Edit Distance Problem.Given two sequences, find the optimal series of deletions, insertions, and substitutions to transform one into the other.Input: Two sequences:

v = v1 v2…vm and w = w1 w2…wn

Output: An optimal series of basic editing operations:

e1, e2, ..., et

such then, when applied to one of the sequences, say v, it is transformed into the other sequence, w.

Here “optimal” can mean any of a number of things, including “fewest” or “lowest- / highest-cost.”

Page 57: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 57 -

Edit distance vs. Hamming distance

v = ATATATATw = TATATATA

Hamming distance always compares i-th letter of v with i-th letter of w

Computing Hammingdistance is trivial task.

Hamming distance: d(v, w) = 8

w = TATATATAJust one shift

Make it all line up

v = -ATATATAT

Edit distance may compare i-th letter of v with j-th letter of w

Computing edit distance is non-trivial task.

Edit distance: d(v, w) = 2

http://www.bioalgorithms.info

Page 58: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 58 -

Edit distance vs. Hamming distance

v = ATATATATw = TATATATA

Hamming distance always compares i-th letter of v with i-th letter of w

Hamming distance: d(v, w) = 8

w = TATATATA-Just one shift

Make it all line up

v = -ATATATAT

Edit distance may compare i-th letter of v with j-th letter of w

One insertion and one deletion.

Edit distance: d(v, w) = 2

How to find which j goes with which i ???

http://www.bioalgorithms.info

Page 59: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 59 -

Edit distance example

TGCATAT ATCCGAT in 5 steps:

TGCATAT (delete last T)TGCATA (delete last A)TGCAT (insert A at front)ATGCAT (substitute C for 3rd G)ATCCAT (insert G before last A) ATCCGAT (Done)

http://www.bioalgorithms.info

Page 60: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 60 -

TGCATAT ATCCGAT in 5 steps:

TGCATAT (delete last T)TGCATA (delete last A)TGCAT (insert A at front)ATGCAT (substitute C for 3rd G)ATCCAT (insert G before last A) ATCCGAT (Done)

What is the edit distance? 5?

Edit distance example

http://www.bioalgorithms.info

Page 61: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 61 -

Edit distance example

TGCATAT ATCCGAT in 4 steps:

TGCATAT (insert A at front)ATGCATAT (delete 8th T)ATGCATA (substitute G for 5th A)ATGCGTA (substitute C for 3rd G)ATCCGAT (Done)

http://www.bioalgorithms.info

Page 62: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 62 -

TGCATAT ATCCGAT in 4 steps:

TGCATAT (insert A at front)ATGCATAT (delete 8th T)ATGCATA (substitute G for 5th A)ATGCGTA (substitute C for 3rd G)ATCCGAT (Done)

Can it be done in 3 steps???

Edit distance example

http://www.bioalgorithms.info

Page 63: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 63 -

Alignment grid

Every alignment path is from source to sink.

Page 64: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 64 -

Alignment as a path in edit graph

0 1 2 2 3 4 5 6 7 70 1 2 2 3 4 5 6 7 7 A T _ G T T A T _A T _ G T T A T _ A T C G T _ A _ CA T C G T _ A _ C0 1 2 3 4 5 5 6 6 70 1 2 3 4 5 5 6 6 7

(0,0), (1,1), (2,2), (0,0), (1,1), (2,2), (2,3), (3,4), (4,5), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (5,5), (6,6), (7,6), (7,7)(7,7)

Corresponding path:

http://www.bioalgorithms.info

Page 65: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 65 -

Alignment as a path in edit graph

and represent indels in v and w with score 0.

represent matches with score 1.

The score of the alignment path is 5.

http://www.bioalgorithms.info

Page 66: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 66 -

Alignment as a path in edit graph

Every path inedit graphorresponds toan alignment:

http://www.bioalgorithms.info

Page 67: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 67 -

Alignment as a path in edit graph

http://www.bioalgorithms.info

Old AlignmentOld Alignment 0122345677v = AT_GTTAT_w = ATCGT_A_C 0123455667

New AlignmentNew Alignment 0122345677v = AT_GTTAT_w = ATCG_TA_C 0123445667

Page 68: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 68 -

Alignment as a path in edit graph

http://www.bioalgorithms.info

0122345677v = AT_GTTAT_w = ATCGT_A_C 0123455667

(0,0), (1,1), (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7)

Page 69: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 69 -

Sequence comparison: the basic algorithm

Fortunately we don't have to. The optimal alignment can be found using dynamic programming, which we've just seen.

We can't afford to enumerate all possible alignments looking for the best one – that would be an exponential search.

Dynamic programming is based on premise of computing solutions to smaller subproblems first and then using these to solve successively larger problems until we have our answer.

(Dynamic programming was invented by Richard Bellman in the 1950's. Its application to sequence comparison came later, in the 1970's.)http://fens.sabanciuniv.edu/msie/operations_research_50_years/anniversary/or50/1526-5463-2002-50-01-0048.pdf

Page 70: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 70 -

Sequence alignment ...

First some ground-rules.

-

-

Not legal:

A

-

Legal:

deletion

-

T

Legal:

insertion

C

C

Legal:

match

G

C

Legal:

mismatch

http://www.bioalgorithms.info

Page 71: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 71 -

Sequence comparison: the basic algorithm

So, assuming we've already computed solutions for all shorter prefixes, we can compute the alignment for v[1..i] and w[1..j].

Given two sequences v and w, consider what's required to compute optimal alignment for prefixes v[1..i] and w[1..j]. Based on our rules for alignments, there are three possible cases:

v[i]

w[j]

optimal alignmentfor v[1..i-1]

and w[1..j-1]

III

or

substitution or match

-

w[j]

optimal alignmentfor v[1..i]

and w[1..j-1]

II

insertion

v[i]

-

optimal alignmentfor v[1..i-1]and w[1..j]

I

deletion

Page 72: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 72 -

Sequence comparison: the basic algorithm

Conceptually, this might look something like this:

optimal alignment at v[1..i] and w[1..j]

optimal alignment at v[1..i] and w[1..j-1]

+cost of inserting w[j]

optimal alignment at v[1..i-1] and w[1..j]

+cost of deleting v[i]

optimal alignment atv[1..i-1] and w[1..j-1]

+cost of matching v[i] and w[j]

= max

Here we assume that deletions, insertions, and mismatcheshave negative costs, while matches have positive costs.

Page 73: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 73 -

Sequence comparison: the basic algorithm

This computation can be viewed as building a 2-D matrix:

s t r i n g wε

s t r

i n

g

vε 0

where indel is cost of a deletion or insertion, and sub is cost of a match/mismatch involving symbols v[i] and w[j].

d[i,j] = max

d[i-1,j] + indel

d[i,j-1] + indel

d[i-1,j-1] + sub(v[i],w[j])

cost of inserting w

cost

of d

elet

ing

v

Page 74: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 74 -

Sequence comparison: the basic algorithm

Stated more generally, say that our two sequences are:v[1]v[2]v[3]...v[m] w[1]w[2]w[3]...w[n]

Where cdel, cins, and csub are the costs of a deletion, an insertion, and a substitution, respectively.

d[i,j] = max

d[i-1,j] + cdel(v[i])

d[i,j-1] + cins(w[j])

d[i-1,j-1] + csub(v[i], w[j])

1 ≤ i ≤ m, 1 ≤ j ≤ n

And:

recu

rrenc

e

d[i,0] = d[i-1,0] + cdel(v[i])d[0,j] = d[0,j-1] + cins(w[j])

Then:1 ≤ i ≤ m1 ≤ j ≤ n

initi

al

cond

ition

sd[0,0] = 0

Page 75: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 75 -

Sequence comparison: the basic algorithm

Example: say that cdel = -1, cins = -1,csub = -1 if mismatch and +1 if match

ACGT

-1-2-3-4

0-10-1-2

-10-1-1-2

-2-1-1-20

-3C A T

Page 76: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 76 -

Algorithm for computing edit distance

EditDistance1(v, w)for i ← 1 to n

di,0 ← di-1,0 + cdel(vi)for j ← 1 to m

d0,j ← d0,j-1 + cins(wj)for i ← 1 to n

for j ← 1 to mdi-1,j + cdel(vi)

di,j ← max di,j-1 + cins(wj)di-1,j-1 + csub(vi, wj)

return (dn,m)http://www.bioalgorithms.info

Page 77: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 77 -

Sequence comparison: some observations

Computation can progress in a number of ways:

or or

For sequences of length m and n,

v[1]v[2]v[3]...v[m] w[1]w[2]w[3]...w[n]

What is the computation time?

How much memory (storage) is required?

O(mn)

O(mn)

Page 78: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 78 -

Sequence comparison: getting an alignment

ACGT

C A T

We started with the notion of alignment. How do we get this?

-3 -1 -1 -2-2

-2-1-1

-2

-4

0

-1

-1

-2

-1

0

-30-1 0

By keeping track of optimal decisions made during algorithm,

... and then tracing back optimal path.

G

A

C

C

T

T

A

-(May not be unique).

Page 79: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 79 -

Maintaining the traceback

EditDistance1(v, w)...for i ← 1 to n

for j ← 1 to mdi-1,j + cdel(vi)

di,j ← max di,j-1 + cins(wj)di-1,j-1 + csub(vi, wj)

if di,j = di-1,j + cdel(vi)bi,j ← if di,j = di,j-1 + cins(wj)

if di,j = di-1,j-1 + csub(vi, wj)return (dn,m, b)http://www.bioalgorithms.info

Page 80: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 80 -

More examples

Comparing ACGT vs. CAT:

Page 81: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 81 -

More examples

Comparing ACGTAT vs. AGTTTG:

Page 82: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 82 -

More examples

Comparing ACGTGCGCTGCTG vs. CGTCCTGCCTGC:

A

C

C

G

G

T

T

C

G

C

C

- T G

G

C

C T G

C

-

T

T

G

G

C

-C

- -

Recall the trace we saw earlier:

Page 83: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 83 -

Optimality of the algorithm

How do we know this agorithm finds an optimal alignment?

We won't dig into finer details now, but basically it hinges on:

• Bellman's principle of optimality for dynamic programming.• The cost functions behaving properly (i.e., they must satisfy

the triangle inequality).

This could be an interesting topic for a final paper ...

Page 84: Dynamic Programming: Edit Distance Bioinformatics: Issues ...lopresti/Courses/2007-08/CSE308-408...Edit Distance CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 84 -

Wrap-up

Remember:• Come to class having done the readings.• Check Blackboard regularly for updates.

Readings for next time:• Continue reading IBA Chapter 6 (sequence comparison).