Handbook of Neuroprosthetic Methods - W. Finn, P. LoPresti (CRC, 2003) WW
Dynamic Programming: Edit Distance Bioinformatics: Issues...
Transcript of Dynamic Programming: Edit Distance Bioinformatics: Issues...
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 1 -
Bioinformatics:Issues and Algorithms
CSE 308-408 • Fall 2007 • Lecture 10
Dynamic Programming:Edit Distance
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 2 -
Outline
• Setting the Stage
• DNA Sequence Comparison: First Successes
• The Change Problem
• The Manhattan Tourist Problem
• The Longest Common Subsequence Problem
• Sequence Alignment
• The Edit Distance Problem
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 3 -
Sequence comparison and alignment
Question: how can we measure sequence similarity?
Consider:
AGTAGCATC
AGTGCACC
AGTAGCATC
GACACGATTversus
That the two DNA fragments on the left somehow seem more similar than the two on the right could be significant.
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 4 -
Motivation
Why is this important?
• Given a new DNA sequence, one of the first things abiologist will want to do is search databases of known sequences to see if anyone has recorded something similar.(As we've seen, genetic sequences are long and the databases are enormous, so efficiency will be an issue.)
• Sequence similarity can provide clues about function.
• Many other problems from computational biology incorporate some notion of sequence similarity as a basic premise.
• Similarity can provide clues about evolutionary relationships.
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 5 -
Sequence concepts
A sequence is a linear string of symbols over a finite alphabet.
Some basic sequence concepts:
(Note: string and sequence are often used synonymously.)
number of symbols in s (written |s|).length
sequence of length 0 (written ε).empty string
sequence that can be obtained from s by removing some symbols (ACG is a subsequence of TATCTG).
subsequence
if t is a subsequence of s, then s is a supersequence of t.
supersequence
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 6 -
Sequence concepts
More basic sequence concepts:
sequence of consecutive symbols appearing in s (ACG is not a substring of TATCTG, but TCT is).
substring
(Observation: every substring is a subsequence of the string in question, but not vice versa.)
if t is a substring of s, then s is a superstring of t.
superstring
set of consecutive indices of a string, e.g., [2..4] refers to 2nd through 4th symbols of string s, or s[2]s[3]s[4].
interval
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 7 -
Sequence concepts
And lastly:
substring of s of the form s[1..j] where 0 ≤ j ≤ |s|(when j = 0, prefix is the empty string).
prefix
substring of s of the form s[i..|s|] where 1 ≤ i ≤ |s| + 1(when i = |s| + 1, suffix is the empty string).
suffix
AT is a prefix (and substring and subsequence) of ATCCAG.
AG is a suffix (and substring and subsequence) of ATCCAG.
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 8 -
Sequences
The concept of a sequence is extremely broad. In this course, we are concerned with genetic sequences. However, there are other important kinds of sequence data:
• ASCII text,• speech,• handwriting (pen-strokes).
Likewise, the same algorithmic techniques turn up again and again, often under different names:
• approximate string matching,• edit (or evolutionary) distance,• dynamic time warping.
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 9 -
Sequence comparison: a false start
A
C
C
G
G
T
T
G
G
C
C
A
A
C
A
G
G
T
A
G
G
C
C
An obvious idea that comes to mind is to line up each symbol and count the number that don't match:
As we know, this is Hamming distance and forms the basis for most error correcting codes, as well as the motif-finding problem we saw earlier.
= 2
But it doesn't work for the kinds of sequences we care about:
Just one missing symbol at the start of the second sequence leads to a large distance.
= 6?
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 10 -
Sequence manipulation at the genetic level
http://www.accessexcellence.org/AB/GG/nhgri_PDFs/insertion.pdf
Genomes aren't static ...
... sequence comparison must account for this.http://www.accessexcellence.org/AB/GG/nhgri_PDFs/deletion.pdf
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 11 -
The human factor
A hard-to-read gel (arrow marks location where bands of similar
intensity appear in two different lanes):http://hshgp.genome.washington.edu/teacher_resources/99-studentDNASequencingModule.pdf
"...the error rate is generally less than 1% over the first 650 bases and then rises significantly over the remaining sequence.”http://genome.med.harvard.edu/dnaseq.html
In addition, errors can arise during the sequencing process:
TAATGGCG?
TAATGCGG?
TAATTGCGG?
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 12 -
Sequence alignment
As we have seen, the two sequences we wish to compare may have different lengths. As a result, we need to allow for deletions and insertions.
The notion of an alignment helps us visualize this:
Alignment permits us to incorporate “spaces” (represented by a dash) in one or both sequences to make them the same length.
A
C
C
G
G
T
T
G
G
C
C?
A
C
C
G
G
T
T
G
G
C
C
-
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 13 -
Sequence alignment
No – this one has 5! How can we find the best alignment?
A
C
C
G
G
T
T
C
G
C
C
- T
-
G
G
C
C
-
T
-
G
C
C
T
T
G
G
C
-
This alignment has 6 mismatches. Is it the best possible?
A
C
C
G
G
T
T
C
G
C
C
- T G
G
C
C T G
C
-
T
T
G
G
C
-C
- -
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 14 -
DNA sequence comparison
First successes:
• Finding sequence similarities with genes of known function is a common approach to infer function of a newly-sequenced gene.
• In 1984, Russell Doolittle and colleagues found similarities between cancer-causing gene and normal growth factor (PDGF) gene.
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 15 -
Cystic Fibrosis
• Cystic fibrosis (CF) is chronic and often fatal genetic disease of body's mucus glands (abnormally high level of mucus). Primarily affects respiratory systems in children.
• Mucus is slimy material that coats many epithelial surfaces and is secreted into fluids such as saliva.
Inheritance of Cystic Fibrosis:• In early 1980's, biologists hypothesized that CF is an
autosomal recessive disorder caused by mutations in a gene that remained unknown until 1989.
• Heterozygous carriers are asymptomatic.• Must be homozygously recessive in this gene in order to be
diagnosed with CF.http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 16 -
Cystic Fibrosis: finding the gene
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 17 -
Cystic Fibrosis
• ATP binding proteins are present on cell membrane and act as transport channel.
• In 1989, biologists found similarity between cystic fibrosis gene and ATP binding proteins.
• A plausible function for cystic fibrosis gene, given that CF involves sweet secretion with abnormally high sodium level.
Finding Similarities between Cystic Fibrosis Gene and ATP binding proteins:
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 18 -
Cystic Fibrosis: mutation analysis
If a high percentage of cystic fibrosis (CF) patients have a certain mutation in the gene, while unaffected patients don’t, that could be an indicator of a mutation that is related to CF.
Indeed, a certain mutation was found in 70% of CF patients; this is convincing evidence of a predominant genetic diagnostics marker for CF.
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 19 -
Cystic Fibrosis and the CFTR protein
• CFTR (Cystic Fibrosis Transmembrane conductance Regulator) protein acts in cell membrane of epithelial cells that secrete mucus.
• These cells line airways of nose, lungs, stomach wall, etc.
• CFTR protein (1480 amino acids) regulates a chloride ion channel: adjusts “wateriness” of fluids secreted by cell.
• CF sufferers are missing a single amino acid in their CFTR.• Mucus ends up being too thick, affecting many organs.http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 20 -
Bring in the Bioinformaticians
• Gene similarities between two genes with known and unknown function alert biologists to some possibilities.
• Computing a similarity score between two genes tells how likely it is that they have similar functions.
• Dynamic programming is a technique for revealing similarities between genes.
• The Change Problem is a good problem to introduce idea of dynamic programming.
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 21 -
The Change Problem
The Change Problem.Convert some amount of money M into given denominations, using fewest possible number of coins.Input: An amount of money M, and an array of d denominations, c = (c1, c2, …, cd), with (c1 > c2 > … > cd).Output: A list of d integers, i1, i2, …, id, such that
c1i1 + c2i2 + … + cdid = Mand i1 + i2 + … + id is minimal.
You hand cashier $20 for a $19.23 bill. Your change could be:
etc.
3 quarters + 2 pennies 77 pennies
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 22 -
The Change Problem: an example
Given denominations 1, 3, and 5, what is minimum number of coins needed to make change for a given value?
1 2 3 4 5 6 7 8 9 10
1 1 1
Value
Min # of coins
1 2 3 4 5 6 7 8 9 10
1 1 1
Value
Min # of coins
Only one coin needed to make change for values 1, 3, and 5.
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 23 -
The Change Problem: an example
Given denominations 1, 3, and 5, what is minimum number of coins needed to make change for a given value?
However, two coins needed to make change for values 2, 4, 6, 8, and 10.
1 2 3 4 5 6 7 8 9 10
1 2 1 2 1 2 2 2
Value
Min # of coins
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 24 -
The Change Problem: an example
Given denominations 1, 3, and 5, what is minimum number of coins needed to make change for a given value?
Lastly, three coins needed to make change for values 7 and 9.
1 2 3 4 5 6 7 8 9 10
1 2 1 2 1 2 3 2 3 2
Value
Min # of coins
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 25 -
The Change Problem: recurrence
minNumCoins(M) =
minNumCoins(M-1) + 1
minNumCoins(M-3) + 1
minNumCoins(M-5) + 1
min ofminNumCoins(M) =
minNumCoins(M-1) + 1
minNumCoins(M-3) + 1
minNumCoins(M-5) + 1
min of
This example is expressed by the following recurrence relation:
minNumCoins(M) =
minNumCoins(M-c1) + 1
minNumCoins(M-c2) + 1
…
minNumCoins(M-cd) + 1
min ofminNumCoins(M) =
minNumCoins(M-c1) + 1
minNumCoins(M-c2) + 1
…
minNumCoins(M-cd) + 1
min of
Given denominations c1, c2, …, cd, recurrence relation is:
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 26 -
The Change Problem: a recursive algorithm
RecursiveChange(M, c, d)if M = 0
return 0bestNumCoins ← infinityfor i ← 1 to d
if M ≥ ci
numCoins ← RecursiveChange(M – ci, c, d)if numCoins + 1 < bestNumCoins
bestNumCoins ← numCoins + 1return bestNumCoins
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 27 -
The Change Problem: a recursive algorithm
The RecursiveChange algorithm works, but it is not efficient:
• E.g., if M = 77 andc = (1,3,7), then optimal combination of coins for 70 cents is computed on 9 different occasions!
74
77
76 70
75 73 69 73 71 67 69 67 63
74 72 68
72 70 66
68 66 62
72 70 66
70 68 64
66 64 60
68 66 62
66 64 60
62 60 56
. . . . . .70 70 70 7070
• It recalculates optimal coin combination for a given amount of money repeatedly.
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 28 -
The Change Problem: we can do better
Rather than re-compute a given value more than once:
• Save results of each computation for 0 to M.
• This way, we can do a reference call to find an already computed value, instead of re-computing each time.
• Running time is M x d, where M is the value of money and d is the number of denominations.
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 29 -
The Change Problem: Dynamic Programming
DPChange(M, c, d)bestNumCoins0 ← 0
for m ← 1 to MbestNumCoinsm ← infinity
for i ← 1 to dif m ≥ ci
if bestNumCoinsm – ci + 1 < bestNumCoinsm
bestNumCoinsm ← bestNumCoinsm – ci + 1
return bestNumCoinsM
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 30 -
DPChange example
00
0 10 1
0 1 20 1 2
0 1 2 30 1 2 1
0 1 2 3 40 1 2 1 2
0 1 2 3 4 50 1 2 1 2 3
0 1 2 3 4 5 60 1 2 1 2 3 2
0 1 2 3 4 5 6 70 1 2 1 2 3 2 1
0 1 2 3 4 5 6 7 80 1 2 1 2 3 2 1 2
0 1 2 3 4 5 6 7 8 90 1 2 1 2 3 2 1 2 3
c = (1,3,7)M = 9
http
://w
ww
.bio
algo
rithm
s.in
fo
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 31 -
The Manhattan Tourist Problem (MTP)
Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid.
Sink*
*
*
**
**
* *
*
*
Source
*
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 32 -
Sink*
*
*
**
**
* *
*
*
Source
*
The Manhattan Tourist Problem (MTP)
Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid.
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 33 -
The Manhattan Tourist Problem (MTP)
The Manhattan Tourist Problem.Find the longest path in a weighted grid.
Input: A weighted grid G with two distinct vertices, one labeled “source” and the other labeled “sink.”Output: A longest path in G from “source” to “sink.”
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 34 -
MTP: an example
3 2 4
0 7 3
3 3 0
1 3 2
4
4
5
6
4
6
5
5
8
2
2
5
0 1 2 3
0
1
2
3
j coordinatei c
oord
inat
e
13
source
sink
4
3 2 4 0
1 0 2 4 3
3
1
1
2
2
2
4 19
95
15
23
0
20
3
4 http
://w
ww
.bio
algo
rithm
s.in
fo
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 35 -
MTP: greedy is not optimal
1 2 5
2 1 5
2 3 4
0 0 0
5
3
0
3
5
0
10
3
5
5
1
2promising start, but leads to bad choices!
source
sink18
22
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 36 -
MTP: a simple recursive program
MT(n, m)if n = 0 and m = 0
return 0if n > 0
x ← MT(n-1, m) + length of edge from (n-1, m) to (n, m)else
x ← 0if m > 0
y ← MT(n, m-1) + length of edge from (n, m-1) to (n, m)else
y ← 0return max(x, y) What's wrong with
this approach?
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 37 -
MTP: Dynamic Programming
• Calculate optimal path score for each vertex in graph.
• Each vertex’s score is maximum of prior vertices' scores plus weight of respective edges in between.
1
5
0 1
0
1
i
source
1
5S1,0 = 5
S0,1 = 1
j
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 38 -
1 2
5
3
0 1 2
0
1
2
source
1 3
5
8
4
S2,0 = 8
i
S1,1 = 4
S0,2 = 33
-5
j
MTP: Dynamic Programming (continued)
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 39 -
MTP: Dynamic Programming (continued)
1 2
5
3
0 1 2 3
0
1
2
3
i
source
1 3
5
8
8
4
0
5
8
103
5
-59
131-5
S3,0 = 8
S2,1 = 9
S1,2 = 13
S3,0 = 8
j
http
://w
ww
.bio
algo
rithm
s.in
fo
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 40 -
MTP: Dynamic Programming (continued)
1 2 5
-5 1 -5
-5 3
0
5
3
0
3
5
0
10
-3
-5
0 1 2 3
0
1
2
3
i
source
1 3 8
5
8
8
4
9
13 8
9
12
S3,1 = 9
S2,2 = 12
S1,3 = 8
j
Greedyalgorithmfails!
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 41 -
1 2 5
-5 1 -5
-5 3 3
0 0
5
3
0
3
5
0
10
-3
-5
-5
2
0 1 2 3
0
1
2
3
i
source
1 3 8
5
8
8
4
9
13 8
12
9
15
9
j
0
1
16 S3,3 = 16
MTP: Dynamic Programming (continued)
Done!
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 42 -
MTP: recurrence
Computing score for a point (i, j) by recurrence relation:
si, j = max si-1, j + weight of the edge between (i-1, j) and (i, j)
si, j-1 + weight of the edge between (i, j-1) and (i, j)si, j = max si-1, j + weight of the edge between (i-1, j) and (i, j)
si, j-1 + weight of the edge between (i, j-1) and (i, j)
The running time is n x m for a n-by-m grid, where n is the number of rows and m is the number of columns.
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 43 -
But Manhattan is not a perfect grid ...
B
A3
A1
A2
B
A3
A1
A2
What about diagonals?
Score at point B is given by:
sB = max of
sA1 + weight of the edge (A1, B)
sA2 + weight of the edge (A2, B)
sA3 + weight of the edge (A3, B)
sB = max of
sA1 + weight of the edge (A1, B)
sA2 + weight of the edge (A2, B)
sA3 + weight of the edge (A3, B)
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 44 -
But Manhattan is not a perfect grid ...
Computing score sx for point x is given by recurrence relation:
Predecessors(x) = set of vertices that have edges leading to x
Running time for a graph G(V, E) is O(E) since each edge is evaluated once
Note: V is set of all vertices and E is set of all edges.
sx = maxof
sy + weight of edge (y, x) where
y ∈ Predecessors(x)sx = max
ofsy + weight of edge (y, x) where
y ∈ Predecessors(x)
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 45 -
Alignment: two-row representation
Alignment : 2 * k matrix ( k > m, n )
T - C G A T-T G C - A -A
letters of v
letters of wTT
A T C T G A TT G C A T A
v :w :
m = 7 n = 6
4 matches 2 insertions 3 deletions
Given 2 DNA sequences v and w:
http://www.bioalgorithms.info
A-
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 46 -
Aligning DNA sequences
v = ATCTGATGw = TGCATAC
n = 8m = 7
CATACGTGTAGTCTAV
W
match
insertiondeletion
mismatch
indels
4132
matches mismatches deletions insertions
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 47 -
The Longest Common Subsequence Problem (LCS)
The Longest Common Subsequence Problem.Find the longest common subsequence of two sequences.
Input: Two sequences:
v = v1 v2…vm and w = w1 w2…wn
Output: A sequence of positions in v : 1 ≤ i1 < i2 < … < it ≤ m
and a sequence of positions in w : 1 ≤ j1 < j2 < … < jt ≤ n
such that symbol ik of v matches jk of w and t is maximal.
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 48 -
LCS example
A T - C T G A T C- T G C T - A - C
elements of v
elements of w-A
1
2
0
1
2
2
3
3
4
3
5
4
5
5
6
6
6
7
7
8
j coords:
i coords:
Matches shown in redpositions in v:positions in w:
2 < 3 < 4 < 6 < 8
1 < 3 < 5 < 6 < 7
Every common subsequence is a path in 2-D grid.
0
0
(0,0)→ (1,0)→ (2,1)→ (2,2)→ (3,3)→ (3,4)→ (4,5)→ (5,5)→ (6,6)→ (7,6)→ (8,7)
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 49 -
LCS Dynamic Programming
Find LCS of two strings.
Input: a weighted graph G with two distinct vertices, one labeled “source” and one labeled “sink.”Output: a longest path in G from “source” to “sink.”
Solve using an MTP graph with diagonals replaced by +1 edges.
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 50 -
LCS Problem as Manhattan Tourist Problem
T
G
C
A
T
A
C
1
2
3
4
5
6
7
0i
A T C T G A T C0 1 2 3 4 5 6 7 8
j
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 51 -
MTP Graph for LCS Problem
T
G
C
A
T
A
C
1
2
3
4
5
6
7
0i
A T C T G A T C0 1 2 3 4 5 6 7 8
j
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 52 -
MTP Graph for LCS Problem
T
G
C
A
T
A
C
1
2
3
4
5
6
7
0i
A T C T G A T C0 1 2 3 4 5 6 7 8
j
Diagonal edge adds extraelement to subsequence.
LCS Problem: find path with maximum diagonal edges.
Every path is common subsequence.
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 53 -
Computing LCS
si, j = max
si-1, j
si, j-1
si-1, j-1 + 1 if vi = wj
Let vi = prefix of v of length i : v1 … vi
and wj = prefix of w of length j : w1 … wj
The length of LCS(vi, wj) is computed by:
i,j
i-1,j
i,j -1
i-1,j -1
1 00
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 54 -
LCS Alignments
4
3
2
1
0
43210W A T C G
ATGT
V
0 1 2 2 3 4V = A T - G T | | |W = A T C G – 0 1 2 3 4 4
Every path in grid corresponds to an alignment:
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 55 -
Edit distance
In 1966, Levenshtein introduced edit distance between two strings as minimum number of elementary operations (insertions, deletions, and substitutions) needed to transform one string into the other.
d(v, w) = MIN number of elementary operations neededto transform v → w.
Keep in mind that LCS computation uses a MAX.
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 56 -
The Edit Distance Problem
The Edit Distance Problem.Given two sequences, find the optimal series of deletions, insertions, and substitutions to transform one into the other.Input: Two sequences:
v = v1 v2…vm and w = w1 w2…wn
Output: An optimal series of basic editing operations:
e1, e2, ..., et
such then, when applied to one of the sequences, say v, it is transformed into the other sequence, w.
Here “optimal” can mean any of a number of things, including “fewest” or “lowest- / highest-cost.”
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 57 -
Edit distance vs. Hamming distance
v = ATATATATw = TATATATA
Hamming distance always compares i-th letter of v with i-th letter of w
Computing Hammingdistance is trivial task.
Hamming distance: d(v, w) = 8
w = TATATATAJust one shift
Make it all line up
v = -ATATATAT
Edit distance may compare i-th letter of v with j-th letter of w
Computing edit distance is non-trivial task.
Edit distance: d(v, w) = 2
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 58 -
Edit distance vs. Hamming distance
v = ATATATATw = TATATATA
Hamming distance always compares i-th letter of v with i-th letter of w
Hamming distance: d(v, w) = 8
w = TATATATA-Just one shift
Make it all line up
v = -ATATATAT
Edit distance may compare i-th letter of v with j-th letter of w
One insertion and one deletion.
Edit distance: d(v, w) = 2
How to find which j goes with which i ???
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 59 -
Edit distance example
TGCATAT ATCCGAT in 5 steps:
TGCATAT (delete last T)TGCATA (delete last A)TGCAT (insert A at front)ATGCAT (substitute C for 3rd G)ATCCAT (insert G before last A) ATCCGAT (Done)
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 60 -
TGCATAT ATCCGAT in 5 steps:
TGCATAT (delete last T)TGCATA (delete last A)TGCAT (insert A at front)ATGCAT (substitute C for 3rd G)ATCCAT (insert G before last A) ATCCGAT (Done)
What is the edit distance? 5?
Edit distance example
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 61 -
Edit distance example
TGCATAT ATCCGAT in 4 steps:
TGCATAT (insert A at front)ATGCATAT (delete 8th T)ATGCATA (substitute G for 5th A)ATGCGTA (substitute C for 3rd G)ATCCGAT (Done)
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 62 -
TGCATAT ATCCGAT in 4 steps:
TGCATAT (insert A at front)ATGCATAT (delete 8th T)ATGCATA (substitute G for 5th A)ATGCGTA (substitute C for 3rd G)ATCCGAT (Done)
Can it be done in 3 steps???
Edit distance example
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 63 -
Alignment grid
Every alignment path is from source to sink.
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 64 -
Alignment as a path in edit graph
0 1 2 2 3 4 5 6 7 70 1 2 2 3 4 5 6 7 7 A T _ G T T A T _A T _ G T T A T _ A T C G T _ A _ CA T C G T _ A _ C0 1 2 3 4 5 5 6 6 70 1 2 3 4 5 5 6 6 7
(0,0), (1,1), (2,2), (0,0), (1,1), (2,2), (2,3), (3,4), (4,5), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (5,5), (6,6), (7,6), (7,7)(7,7)
Corresponding path:
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 65 -
Alignment as a path in edit graph
and represent indels in v and w with score 0.
represent matches with score 1.
The score of the alignment path is 5.
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 66 -
Alignment as a path in edit graph
Every path inedit graphorresponds toan alignment:
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 67 -
Alignment as a path in edit graph
http://www.bioalgorithms.info
Old AlignmentOld Alignment 0122345677v = AT_GTTAT_w = ATCGT_A_C 0123455667
New AlignmentNew Alignment 0122345677v = AT_GTTAT_w = ATCG_TA_C 0123445667
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 68 -
Alignment as a path in edit graph
http://www.bioalgorithms.info
0122345677v = AT_GTTAT_w = ATCGT_A_C 0123455667
(0,0), (1,1), (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7)
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 69 -
Sequence comparison: the basic algorithm
Fortunately we don't have to. The optimal alignment can be found using dynamic programming, which we've just seen.
We can't afford to enumerate all possible alignments looking for the best one – that would be an exponential search.
Dynamic programming is based on premise of computing solutions to smaller subproblems first and then using these to solve successively larger problems until we have our answer.
(Dynamic programming was invented by Richard Bellman in the 1950's. Its application to sequence comparison came later, in the 1970's.)http://fens.sabanciuniv.edu/msie/operations_research_50_years/anniversary/or50/1526-5463-2002-50-01-0048.pdf
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 70 -
Sequence alignment ...
First some ground-rules.
-
-
Not legal:
A
-
Legal:
deletion
-
T
Legal:
insertion
C
C
Legal:
match
G
C
Legal:
mismatch
http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 71 -
Sequence comparison: the basic algorithm
So, assuming we've already computed solutions for all shorter prefixes, we can compute the alignment for v[1..i] and w[1..j].
Given two sequences v and w, consider what's required to compute optimal alignment for prefixes v[1..i] and w[1..j]. Based on our rules for alignments, there are three possible cases:
v[i]
w[j]
optimal alignmentfor v[1..i-1]
and w[1..j-1]
III
or
substitution or match
-
w[j]
optimal alignmentfor v[1..i]
and w[1..j-1]
II
insertion
v[i]
-
optimal alignmentfor v[1..i-1]and w[1..j]
I
deletion
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 72 -
Sequence comparison: the basic algorithm
Conceptually, this might look something like this:
optimal alignment at v[1..i] and w[1..j]
optimal alignment at v[1..i] and w[1..j-1]
+cost of inserting w[j]
optimal alignment at v[1..i-1] and w[1..j]
+cost of deleting v[i]
optimal alignment atv[1..i-1] and w[1..j-1]
+cost of matching v[i] and w[j]
= max
Here we assume that deletions, insertions, and mismatcheshave negative costs, while matches have positive costs.
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 73 -
Sequence comparison: the basic algorithm
This computation can be viewed as building a 2-D matrix:
s t r i n g wε
s t r
i n
g
vε 0
where indel is cost of a deletion or insertion, and sub is cost of a match/mismatch involving symbols v[i] and w[j].
d[i,j] = max
d[i-1,j] + indel
d[i,j-1] + indel
d[i-1,j-1] + sub(v[i],w[j])
cost of inserting w
cost
of d
elet
ing
v
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 74 -
Sequence comparison: the basic algorithm
Stated more generally, say that our two sequences are:v[1]v[2]v[3]...v[m] w[1]w[2]w[3]...w[n]
Where cdel, cins, and csub are the costs of a deletion, an insertion, and a substitution, respectively.
d[i,j] = max
d[i-1,j] + cdel(v[i])
d[i,j-1] + cins(w[j])
d[i-1,j-1] + csub(v[i], w[j])
1 ≤ i ≤ m, 1 ≤ j ≤ n
And:
recu
rrenc
e
d[i,0] = d[i-1,0] + cdel(v[i])d[0,j] = d[0,j-1] + cins(w[j])
Then:1 ≤ i ≤ m1 ≤ j ≤ n
initi
al
cond
ition
sd[0,0] = 0
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 75 -
Sequence comparison: the basic algorithm
Example: say that cdel = -1, cins = -1,csub = -1 if mismatch and +1 if match
ACGT
-1-2-3-4
0-10-1-2
-10-1-1-2
-2-1-1-20
-3C A T
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 76 -
Algorithm for computing edit distance
EditDistance1(v, w)for i ← 1 to n
di,0 ← di-1,0 + cdel(vi)for j ← 1 to m
d0,j ← d0,j-1 + cins(wj)for i ← 1 to n
for j ← 1 to mdi-1,j + cdel(vi)
di,j ← max di,j-1 + cins(wj)di-1,j-1 + csub(vi, wj)
return (dn,m)http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 77 -
Sequence comparison: some observations
Computation can progress in a number of ways:
or or
For sequences of length m and n,
v[1]v[2]v[3]...v[m] w[1]w[2]w[3]...w[n]
What is the computation time?
How much memory (storage) is required?
O(mn)
O(mn)
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 78 -
Sequence comparison: getting an alignment
ACGT
C A T
We started with the notion of alignment. How do we get this?
-3 -1 -1 -2-2
-2-1-1
-2
-4
0
-1
-1
-2
-1
0
-30-1 0
By keeping track of optimal decisions made during algorithm,
... and then tracing back optimal path.
G
A
C
C
T
T
A
-(May not be unique).
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 79 -
Maintaining the traceback
EditDistance1(v, w)...for i ← 1 to n
for j ← 1 to mdi-1,j + cdel(vi)
di,j ← max di,j-1 + cins(wj)di-1,j-1 + csub(vi, wj)
if di,j = di-1,j + cdel(vi)bi,j ← if di,j = di,j-1 + cins(wj)
if di,j = di-1,j-1 + csub(vi, wj)return (dn,m, b)http://www.bioalgorithms.info
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 80 -
More examples
Comparing ACGT vs. CAT:
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 81 -
More examples
Comparing ACGTAT vs. AGTTTG:
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 82 -
More examples
Comparing ACGTGCGCTGCTG vs. CGTCCTGCCTGC:
A
C
C
G
G
T
T
C
G
C
C
- T G
G
C
C T G
C
-
T
T
G
G
C
-C
- -
Recall the trace we saw earlier:
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 83 -
Optimality of the algorithm
How do we know this agorithm finds an optimal alignment?
We won't dig into finer details now, but basically it hinges on:
• Bellman's principle of optimality for dynamic programming.• The cost functions behaving properly (i.e., they must satisfy
the triangle inequality).
This could be an interesting topic for a final paper ...
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 10 - 84 -
Wrap-up
Remember:• Come to class having done the readings.• Check Blackboard regularly for updates.
Readings for next time:• Continue reading IBA Chapter 6 (sequence comparison).