Analyzing algorithms & Asymptotic Notation BIO/CS 471 – Algorithms for Bioinformatics.
CS 5263 Bioinformatics
description
Transcript of CS 5263 Bioinformatics
CS 5263 Bioinformatics
Lecture 5: Affine Gap Penalties
Last lecture
• Local Sequence Alignment
• Bounded Dynamic Programming
• Linear Space Sequence Alignment
The Smith-Waterman algorithm
Initialization: F(0, j) = F(i, 0) = 0
0
F(i – 1, j) – d
F(i, j – 1) – d
F(i – 1, j – 1) + (xi, yj)
Iteration: F(i, j) = max
The Smith-Waterman algorithm
Termination:
1. If we want the best local alignment…FOPT = maxi,j F(i, j)
2. If we want all local alignments scoring > t For all i, j find F(i, j) > t, and trace
back
Bounded Dynamic Programming
• O(kM) time
• O(kN) memory
x1 ………………………… xM
y N …
……
……
……
……
… y
1
k
Linear-space alignment
N-k*
M/2
M/2
k*
• O(M+N) memory
• 2MN time
Homework Problem 5 hintsDot matrix for visualizing seq similarities• Seq1: x[1..m]• Seq2: y[1..n]
Sequence 2
Se
qu
en
ce 1
50 100 150 200 250 300
50
100
150
200
250
300
Sequence 2
Se
qu
en
ce 1
50 100 150 200 250 300
50
100
150
200
250
300
Sequence 2
Se
qu
en
ce 1
50 100 150 200 250 300
50
100
150
200
250
300
A(i, j) = 1 if k=1:10((xi+k, yj+k)) > 7
A(i, j) = 1 if k=1:20((xi+k, yj+k)) > 15
A dot matrix does not do any alignment (global or local).It helps to detect strongly conserved regions.
A(i, j) = 1 if (xi, yj) = 1
Sequence 2
Se
qu
en
ce 1
50 100 150 200 250 300
50
100
150
200
250
300
Seq1Seq2
Today
• How to model gaps more accurately?
• Statistics of alignments– Where does (xi, yj) come from?
– Are two aligned sequences actually related? – not today
What’s a better alignment?
GACGCCGAACG||||| |||GACGC---ACG
GACGCCGAACG|||| | | ||GACG-C-A-CG
Score = 8 x m – 3 x d Score = 8 x m – 3 x d
However, gaps usually occur in bunches.
- During evolution, chunks of DNA may be lost entirely- Aligning genomic sequences vs. cDNAs (reverse
complimentary to mRNAs)
Model gaps more accurately
• Current model:– Gap of length n incurs penalty nd
• General: – Convex function– E.g. (n) = c * sqrt (n)
n
n
General gap dynamic programming
Initialization: same
Iteration:
F(i-1, j-1) + s(xi, yj)
F(i, j) = max maxk=0…i-1F(k,j) – (i-k)
maxk=0…j-1F(i,k) – (j-k)
Termination: same
Running Time: O((M+N)MN) (cubic)Space: O(NM) (linear-space algorithm not applicable)
Compromise: affine gaps
(n) = d + (n – 1)e | |gap gapopen extension
de
(n)
Match: 2
Gap open: -5
Gap extension: -1
GACGCCGAACG||||| |||GACGC---ACG
GACGCCGAACG|||| | | ||GACG-C-A-CG
8x2-5-2 = 9 8x2-3x5 = 1
We want to find the optimal alignment with affine gap penalty in
• O(MN) time
• O(MN) or better O(M+N) memory
Allowing affine gap penalties
• Still three cases– xi aligned with yj
– Xi aligns to a gap• Are we continuing a gap in x? (if no, start is more expensive)
– Yj aligns to a gap• Are we continuing a gap in y? (if no, start is more expensive)
• We can use a finite state machine to represent the three cases as three states– The machine has two heads, reading the chars on the two
strings separately– At every step, each head reads 0 or 1 char from each sequence– Depending on what it reads, goes to a different state, and
produces different scores
Finite State Machine
F: have just read 1 char from each seq (xi aligned to yj )
Ix: have read 0 char from x. (yj aligned to a gap)
Iy: have read 0 char from y (xi aligned to a gap)
F
Ix
Iy
? / ?
? / ?
? / ?
? / ?
? / ?
? / ?
? / ?Input Output
State
F
Ix
Iy
(xi,yj) /
(xi,yj) /
(xi,yj) /
(xi,-) / d
(xi,-) / e
(-, yj) / d
(-, yj) / eInput Output
Start state
Current state Input Output Next state
F (xi,yj) F
F (-,yj) d Ix
F (xi,-) d Iy
Ix (-,yj) e Ix
… … … …
AAC
ACT
F-F-F-F
AAC
|||
ACT
F-Iy-F-F-Ix
AAC-
||
-ACT
F-F-Iy-F-Ix
AAC-
| |
A-CT
F
Ix
Iy
(xi,yj) /
(xi,yj) /
(xi,yj) /
(xi,-) / d
(xi,-) / e
(-, yj) / d
(-, yj) / e
startstate
Given a pair of sequences, an alignment (not necessarily optimal) corresponds to a state path in the FSM.
Optimal alignment: find a state path to read the two sequences such that the total output score is the highest
Dynamic programming
• We encode this information in three different matrices
• For each element (i,j) we use three variables– F(i,j): best alignment (score) of x1..xi & y1..yj if xi aligns
to yj
– Ix(i,j): best alignment of x1..xi & y1..yj if yj aligns to gap– Iy(i,j): best alignment of x1..xi & y1..yj if xi aligns to gap
xi
yj
xi
yj
xi
yj
F(i, j) Ix(i, j) Iy(i, j)
F
Ix
Iy
(xi,yj) /
(xi,yj) /
(xi,yj) /
(xi,-) /d
(xi,-)/e
(-, yj) /d
(-, yj)/e
F(i-1, j-1) + (xi, yj)
F(i, j) = max Ix(i-1, j-1) + (xi, yj)
Iy(i-1, j-1) + (xi, yj)
xi
yj
F
Ix
Iy
(xi,yj) /
(xi,yj) /
(xi,yj) /
(xi,-) /d
(xi,-)/e
(-, yj) /d
(-, yj)/e
F(i, j-1) + d
Ix(i, j) = max
Ix(i, j-1) + e
xi
yj
Ix(i, j)
F
Ix
Iy
(xi,yj) /
(xi,yj) /
(xi,yj) /
(xi,-) /d
(xi,-)/e
(-, yj) /d
(-, yj)/e
F(i-1, j) + d
Iy(i, j) = max
Iy(i-1, j) + e
xi
yj
Iy(i, j)
F(i – 1, j – 1)F(i, j) = (xi, yj) + max Ix(i – 1, j – 1)
Iy(i – 1, j – 1)
F(i, j – 1) + d Ix(i, j) = max
Ix(i, j – 1) + e
F(i – 1, j) + d Iy(i, j) = max
Iy(i – 1, j) + e
Continuing alignment
Closing gaps in x
Closing gaps in y
Opening a gap in x
Gap extension in x
Opening a gap in y
Gap extension in y
Data dependency
F
Ix Iy
i
j
i-1
j-1
i-1
j-1
Data dependency
IyIx
F
i
j
i
j
i
j
Data dependency
• If we stack all three matrices– No cyclic dependency– Therefore, we can fill in all three matrices in order
Algorithm
• for i = 1:m– for j = 1:n
• Fill in F(i, j), Ix(i, j), Iy(i, j)
– end
end• F(M, N) = max (F(M, N), Ix(M, N), Iy(M, N))
• Time: O(MN)• Space: O(MN) or O(N) when combine with the
linear-space algorithm
Exercise
• x = GCAC
• y = GCC
• m = 2
• s = -2
• d = -5
• e = -1
0 - - -
-
-
-
-
- - - -
-5
-6
-7
-8
- -5 -6 -7
-
-
-
-
F: aligned on both Iy: Insertion on y
F(i, j)
F(i-1, j-1)
Ix(i-1, j-1)
Iy(i-1, j-1)
Ix(i,j)Ix(i,j-1)
F(i,j-1) Iy(i,j)
Iy(i-1,j)
F(i-1,j)
G C C
G
C
A
C
G
C
A
C
G
C
A
C
G C C
G C C
x =
y =
x =
y =
x =
y =
Ix: Insertion on x
(xi, yj)
d
e
de
m = 2s = -2d = -5e = -1
0 - - -
- 2
-
-
-
- - - -
-5
-6
-7
-8
- -5 -6 -7
-
-
-
-
F Iy
Ix
G C C
G
C
A
C
G
C
A
C
G
C
A
C
G C C
G C C
x =
y =
x =
y =
x =
y =
F(i, j)
F(i-1, j-1)
Ix(i-1, j-1)
Iy(i-1, j-1)
(xi, yj) = 2
m = 2s = -2d = -5e = -1
0 - - -
- 2 -7
-
-
-
- - - -
-5
-6
-7
-8
- -5 -6 -7
-
-
-
-
F Iy
Ix
G C C
G
C
A
C
G
C
A
C
G
C
A
C
G C C
G C C
x =
y =
x =
y =
x =
y =
F(i, j)
F(i-1, j-1)
Ix(i-1, j-1)
Iy(i-1, j-1)
(xi, yj) = -2
m = 2s = -2d = -5e = -1
0 - - -
- 2 -7 -8
-
-
-
- - - -
-5
-6
-7
-8
- -5 -6 -7
-
-
-
-
F Iy
Ix
G C C
G
C
A
C
G
C
A
C
G
C
A
C
G C C
G C C
x =
y =
x =
y =
x =
y =
F(i, j)
F(i-1, j-1)
Ix(i-1, j-1)
Iy(i-1, j-1)
(xi, yj) = -2
m = 2s = -2d = -5e = -1
0 - - -
- 2 -7 -8
-
-
-
- - -
-5
-6
-7
-8
-5 -6 -7
- - -3
-
-
-
F Iy
Ix
G C C
G
C
A
C
G
C
A
C
G
C
A
C
G C C
G C C
x =
y =
x =
y =
x =
y =
Ix(i,j)Ix(i,j-1)
F(i,j-1)d = -5
e = -1
m = 2s = -2d = -5e = -1
0 - - -
- 2 -7 -8
-
-
-
- - -
-5
-6
-7
-8
-5 -6 -7
- - -3 -4
-
-
-
F Iy
Ix
G C C
G
C
A
C
G
C
A
C
G
C
A
C
G C C
G C C
x =
y =
x =
y =
x =
y =
Ix(i,j)Ix(i,j-1)
F(i,j-1)d = -5
e = -1
m = 2s = -2d = -5e = -1
0 - - -
- 2 -7 -8
-
-
-
- - -
-5 - - -
-6
-7
-8
-5 -6 -7
- - -3 -4
-
-
-
F Iy
Ix
G C C
G
C
A
C
G
C
A
C
G
C
A
C
G C C
G C C
x =
y =
x =
y =
x =
y =
Iy(i,j)
Iy(i-1,j)F(i-1,j)
d=-5e=-1
m = 2s = -2d = -5e = -1
0 - - -
- 2 -7 -8
- -7
-
-
- - -
-5 - - -
-6
-7
-8
-5 -6 -7
- - -3 -4
-
-
-
F Iy
Ix
G C C
G
C
A
C
G
C
A
C
G
C
A
C
G C C
G C C
x =
y =
x =
y =
x =
y =
F(i, j)
F(i-1, j-1)
Ix(i-1, j-1)
Iy(i-1, j-1)
(xi, yj) = -2
m = 2s = -2d = -5e = -1
0 - - -
- 2 -7 -8
- -7 4
-
-
- - -
-5 - - -
-6
-7
-8
-5 -6 -7
- - -3 -4
-
-
-
F Iy
Ix
G C C
G
C
A
C
G
C
A
C
G
C
A
C
G C C
G C C
x =
y =
x =
y =
x =
y =
F(i, j)
F(i-1, j-1)
Ix(i-1, j-1)
Iy(i-1, j-1)
(xi, yj) = 2
m = 2s = -2d = -5e = -1
0 - - -
- 2 -7 -8
- -7 4 -1
-
-
- - -
-5 - - -
-6
-7
-8
-5 -6 -7
- - -3 -4
-
-
-
F Iy
Ix
G C C
G
C
A
C
G
C
A
C
G
C
A
C
G C C
G C C
x =
y =
x =
y =
x =
y =
F(i, j)
F(i-1, j-1)
Ix(i-1, j-1)
Iy(i-1, j-1)
(xi, yj) = 2
m = 2s = -2d = -5e = -1
0 - - -
- 2 -7 -8
- -7 4 -1
-
-
- - -
-5 - - -
-6
-7
-8
-5 -6 -7
- - -3 -4
- - -12 -1
-
-
F Iy
Ix
G C C
G
C
A
C
G
C
A
C
G
C
A
C
G C C
G C C
x =
y =
x =
y =
x =
y =
Ix(i,j)Ix(i,j-1)
F(i,j-1)d = -5
e = -1
m = 2s = -2d = -5e = -1
0 - - -
- 2 -7 -8
- -7 4 -1
-
-
- - -
-5 - - -
-6 -3
-7
-8
-5 -6 -7
- - -3 -4
- - -12 -1
-
-
F Iy
Ix
G C C
G
C
A
C
G
C
A
C
G
C
A
C
G C C
G C C
x =
y =
x =
y =
x =
y =
Iy(i,j)
Iy(i-1,j)F(i-1,j)
d=-5e=-1
m = 2s = -2d = -5e = -1
0 - - -
- 2 -7 -8
- -7 4 -1
-
-
- - -
-5 - - -
-6 -3 -12 -13
-7
-8
-5 -6 -7
- - -3 -4
- - -12 -1
-
-
F Iy
Ix
G C C
G
C
A
C
G
C
A
C
G
C
A
C
G C C
G C C
x =
y =
x =
y =
x =
y =
F(i, j)
F(i-1, j-1)
Ix(i-1, j-1)
Iy(i-1, j-1)
Ix(i,j)Ix(i,j-1)
F(i,j-1) Iy(i,j)
Iy(i-1,j)
F(i-1,j)(xi, yj)
d
e
de
m = 2s = -2d = -5e = -1
0 - - -
- 2 -7 -8
- -7 4 -5
- -8 -5 2
-
- - -
-5 - - -
-6 -3 -12 -13
-7
-8
-5 -6 -7
- - -3 -4
- - -12 -1
- - -13 -10
-
F Iy
Ix
G C C
G
C
A
C
G
C
A
C
G
C
A
C
G C C
G C C
x =
y =
x =
y =
x =
y =
F(i, j)
F(i-1, j-1)
Ix(i-1, j-1)
Iy(i-1, j-1)
Ix(i,j)Ix(i,j-1)
F(i,j-1) Iy(i,j)
Iy(i-1,j)
F(i-1,j)(xi, yj)
d
e
de
m = 2s = -2d = -5e = -1
0 - - -
- 2 -7 -8
- -7 4 -1
- -8 -5 2
-
- - -
-5 - - -
-6 -3 -12 -13
-7 -8 -1
-8
-5 -6 -7
- - -3 -4
- - -12 -1
- - -13 -10
-
F Iy
Ix
G C C
G
C
A
C
G
C
A
C
G
C
A
C
G C C
G C C
x =
y =
x =
y =
x =
y =
Iy(i,j)
Iy(i-1,j)F(i-1,j)
d=-5e=-1
m = 2s = -2d = -5e = -1
0 - - -
- 2 -7 -8
- -7 4 -1
- -8 -5 2
-
- - -
-5 - - -
-6 -3 -12 -13
-7 -8 -1 -6
-8
-5 -6 -7
- - -3 -4
- - -12 -1
- - -13 -10
-
F Iy
Ix
G C C
G
C
A
C
G
C
A
C
G
C
A
C
G C C
G C C
x =
y =
x =
y =
x =
y =
Iy(i,j)
Iy(i-1,j)F(i-1,j)
d=-5e=-1
m = 2s = -2d = -5e = -1
0 - - -
- 2 -7 -8
- -7 4 -1
- -8 -5 2
- -9 -6 1
- - -
-5 - - -
-6 -3 -12 -13
-7 -8 -1 -6
-8 -13 -2 -3
-5 -6 -7
- - -3 -4
- - -12 -1
- - -13 -10
- - -14 -11
F Iy
Ix
G C C
G
C
A
C
G
C
A
C
G
C
A
C
G C C
G C C
x =
y =
x =
y =
x =
y =
F(i, j)
F(i-1, j-1)
Ix(i-1, j-1)
Iy(i-1, j-1)
Ix(i,j)Ix(i,j-1)
F(i,j-1) Iy(i,j)
Iy(i-1,j)
F(i-1,j)(xi, yj)
d
e
de
m = 2s = -2d = -5e = -1
0 - - -
- 2 -7 -8
- -7 4 -1
- -8 -5 2
- -9 -6 1
- - -
-5 - - -
-6 -3 -12 -13
-7 -8 -1 -6
-8 -13 -2 -3
-5 -6 -7
- - -3 -4
- - -12 -1
- - -13 -10
- - -14 -11
F Iy
Ix
G C C
G
C
A
C
G
C
A
C
G
C
A
C
G C C
G C C
x =
y =
x =
y =
x =
y =
F(i, j)
F(i-1, j-1)
Ix(i-1, j-1)
Iy(i-1, j-1)
Ix(i,j)Ix(i,j-1)
F(i,j-1) Iy(i,j)
Iy(i-1,j)
F(i-1,j)(xi, yj)
d
e
de
m = 2s = -2d = -5e = -1
0 - - -
- 2 -7 -8
- -7 4 -1
- -8 -5 2
- -9 -6 1
- - -
-5 - - -
-6 -3 -12 -13
-7 -8 -1 -6
-8 -13 -2 -3
-5 -6 -7
- - -3 -4
- - -12 -1
- - -13 -10
- - -14 -11
F Iy
Ix
G C C
G
C
A
C
G
C
A
C
G
C
A
C
G C C
G C C
GCAC
|| |
GC-C
x =
y =
x =
y =
x =
y =
x
y
G
C
A
C
G C C
x =
y =
m = 2s = -2d = -5e = -1