A Sub-quadratic Sequence Alignment Algorithm
description
Transcript of A Sub-quadratic Sequence Alignment Algorithm
A Sub-quadratic Sequence Alignment Algorithm
Global alignment
ag
a
g
c
a
t
c
agcagcaa 31
1
2
3
5
4 65 7 80
7
6
8
2
4
Alignment graph for S = aacgacga, T = ctacgaga
Complexity: O(n2)
V(i,j) = max {V(i-1,j-1) + (S[i], T[j]),V(i-1,j) + (S[i], -),V(i,j-1) + (-, T[j])
}
FOUR RUSSIAN ALGORITHM
UNRESTRICTED SCORING FUNCTION
Main idea: Compress the sequences
• S = aacgacga • T = ctacgaga
0
21 3
4 5
c t a
g g
0
1 3
2
4
a g
c
g
LZ-78: Divide the sequence into distinct words
1 2 3 4
a ac g acg a1 2 3 4 5
c t a cg ag a
Trie Trie
The number of distinct words: )( lognnO
a acg g ac act
3/4 3/2 acg
5/4 5/2aga
2 3 4
1
2
3
4
5
0 1
g
a
gca
agca
aca
ga
ca
Main idea
03
52
1
ag c
t
Trie for T
4g
g
01
23
4
ac
gTrie for S
• Compute the alignment score in each block• Propagate the scores between the adjacent blocks
Main idea
• Compress the sequence into words• Pre-compute the score for each block• Do alignment between blocks
• Note:– Replace normal characters by words– Operate on blocks
COMPRESS THE SEQUENCELZ-78
LZ-78
• S = aacgacga • T = ctacgaga
0
21 3
4 5
c t a
g g
0
1 3
2
4
a g
c
g
LZ-78: Divide the sequence into distinct words
1 2 3 4
a ac g acg a1 2 3 4 5
c t a cg ag a
Trie Trie
The number of distinct words: )( lognnO
LZ-78
• Theorem (Lempel and Ziv):– Constant alphabet sequence S– The maximal number of distinct phrases in S is
O(n/log n).
• Tighter upper bound: O(hn/log n) – h is the entropy factor – a real number, 0 < h 1– Entropy is small sequence is repetitive
COMPUTE THE ALIGNMENT SCORE IN EACH BLOCK
a acg g ac act
3/4 3/2 acg
5/4 5/2aga
2 3 4
1
2
3
4
5
0 1
g
a
gca
agca
aca
ga
ca
Compute the alignment score in each block•
• Given– Input border: I– Block
• Compute– Output border: O
O
g
a
gca
G0
20
1
2 3 4
13
4
55
I
Matrices
• I[i] : is the input border value• DIST[i,j] : weight of the optimal path– From entry i of the input border– To entry j of its output border
• OUT[i,j] : merges the information from input row I and DIST– OUT[i,j]=I[i] + DIST[i,j]
• O[j] = max{OUT[i,j] for i=1..n}
O
g
a
gca
G0
20
1
2 3 4
13
4
55
I
DIST and OUT matrix example
O
g
a
gca
G0
20
1
2 3 4
13
4
55
I
DIST matrix OUT matrixI (input borders)
Block – sub-sequences “acg”, “ag”
0 1 2 3 4 5
I0 0 -1 -2 -3 △ △
I1 -1 -1 -2 -1 -3 △
I2 -2 0 0 1 -1 -3
I3 △ -2 -2 0 -2 -2
I4 △ △ -2 0 -1 -1
I5 △ △ △ -2 -1 0
0 1 2 3 4 5
1 0 -1 -2 - -
1 1 0 1 -1 -
1 3 3 4 2 0
-12 0 0 2 0 0
-13 -13 -1 1 0 0
-14 -14 -14 1 2 3
I0=1
I1=2
I2=3
I3=2
I4=1
I5=3
O0 O1 O2 O3 O4 O5
1 3 3 4 2 3
max col
• For each block, given two sub-sequence S1, S2
• Compute (from scratch) DIST in (n*m) time• Given I and DIST, compute OUT in (n*m) time• Given OUT[i,j], Compute O in (m*n) time
Revise• Compress the sequence• Pre-compute DIST[i,j] for
each block• Compute border values of
each blocks
• Remaining questions– How to compute DIST[i,j]
efficiently?– How to compute O[j] from
I[i] and DIST[i,j] efficiently?
a acg g ac acta
4/4cg
5/4 5/3aga
2 3 4
1
2
3
4
5
0 1
COMPUTE O[J] EFFICIENTLY
Compute O[j] efficiently
• For each block of two sub-sequences S1, S2• Given– I[i]– DIST[i,j]
• Compute– O[j]
DIST and OUT matrix example
O
g
a
gca
G0
20
1
2 3 4
13
4
55
I
DIST matrix OUT matrixI (input borders)
Block – sub-sequences “acg”, “ag”
0 1 2 3 4 5
I0 0 -1 -2 -3 △ △
I1 -1 -1 -2 -1 -3 △
I2 -2 0 0 1 -1 -3
I3 △ -2 -2 0 -2 -2
I4 △ △ -2 0 -1 -1
I5 △ △ △ -2 -1 0
0 1 2 3 4 5
1 0 -1 -2 - -
1 1 0 1 -1 -
1 3 3 4 2 0
-12 0 0 2 0 0
-13 -13 -1 1 0 0
-14 -14 -14 1 2 3
I0=1
I1=2
I2=3
I3=2
I4=1
I5=3
O0 O1 O2 O3 O4 O5
1 3 3 4 2 3
max col
Compute O without explicit OUT
O
g
a
gca
G0
20
1
2 3 4
13
4
55
I
DIST matrix I (input borders)
Block – sub-sequences “acg”, “ag”
0 1 2 3 4 5
I0 0 -1 -2 -3 △ △
I1 -1 -1 -2 -1 -3 △
I2 -2 0 0 1 -1 -3
I3 △ -2 -2 0 -2 -2
I4 △ △ -2 0 -1 -1
I5 △ △ △ -2 -1 0
I0=1
I1=2
I2=3
I3=2
I4=1
I5=3
O0 O1 O2 O3 O4 O5
1 3 3 4 2 3
SMAWK
• Given DIST[i,j], I[i] we can compute O[j] in O(n+m)– Without creating OUT[i,j]
• How? Why?
Why?
• Aggarwal, Park and Schmidt observed that DIST and OUT matrices are Monge arrays.
• Definition: a matrix M[0…m,0…n] is totally monotone if either condition 1 or 2 below holds for all a,b=0…m; c,d=0…n: 1. Convex condition:
M[a,c]M[b,c]M[a,d]M[b,d] for all a<b and c<d.2. Concave condition:
M[a,c]M[b,c]M[a,d]M[b,d] for all a<b and c<d.
How?
• Aggarwal et. al. gave a recursive algorithm, called SMAWK, which can find
all row and column maxima of a totally monotone matrixby querying only O(n) elements of the matrix.
• Why DIST[i,j] is totally monotone?
O
g
a
gca
G0
20
1
2 3 4
13
4
55
I
The concave condition
If b-c is better than a-c, then b-d is better than a-d.
a b
dc
Other problem
• Rectangle problem of DIST
• Set upper right corner of OUT to -• Set lower left corner of OUT to -(n+i-1)*k• Preserve the totally monotone property of
OUT
0 1 2 3 4 5
I0 0 -1 -2 -3 △ △I1 -1 -1 -2 -1 -3 △I2 -2 0 0 1 -1 -3
I3 △ -2 -2 0 -2 -2
I4 △ △ -2 0 -1 -1
I5 △ △ △ -2 -1 0
COMPUTE DIST[I,J] EFFICIENTLY
a acg g ac act
3/4 3/2 acg
5/4 5/2aga
2 3 4
1
2
3
4
5
0 1
g
a
gca
agca
aca
ga
ca
Compute DIST[i,j] for block(5/4)
03
52
1
ag c
t
Trie for T
4g
g
01
23
4
ac
gTrie for S
gca
g
a
gca
g
a
I0
I4 I5I2I3
I1
O3 DIST matrix
0-1-2ΔΔΔI5 = 3
-1-10-2ΔΔI4 = 1
-2-20-2-2ΔI3 = 2
-3-1100-2I2 = 3
Δ-2-1-2-1-1I1 = 2
ΔΔ-3-2-10I0 = 1
0-1-2ΔΔΔI5 = 3
-1-10-2ΔΔI4 = 1
-2-20-2-2ΔI3 = 2
-3-1100-2I2 = 3
Δ-2-1-2-1-1I1 = 2
ΔΔ-3-2-10I0 = 1
gca
g
a
gca
g
a
I0
I4 I5I2I3
I1
O3 DIST matrix
0-1-2ΔΔΔI5 = 3
-1-10-2ΔΔI4 = 1
-2-20-2-2ΔI3 = 2
-3-1100-2I2 = 3
Δ-2-1-2-1-1I1 = 2
ΔΔ-3-2-10I0 = 1
0-1-2ΔΔΔI5 = 3
-1-10-2ΔΔI4 = 1
-2-20-2-2ΔI3 = 2
-3-1100-2I2 = 3
Δ-2-1-2-1-1I1 = 2
ΔΔ-3-2-10I0 = 1
gca
g
a
gca
g
a
I0
I4 I5I2I3
I1
O3 DIST matrix
0-1-2ΔΔΔI5 = 3
-1-10-2ΔΔI4 = 1
-2-20-2-2ΔI3 = 2
-3-1100-2I2 = 3
Δ-2-1-2-1-1I1 = 2
ΔΔ-3-2-10I0 = 1
0-1-2ΔΔΔI5 = 3
-1-10-2ΔΔI4 = 1
-2-20-2-2ΔI3 = 2
-3-1100-2I2 = 3
Δ-2-1-2-1-1I1 = 2
ΔΔ-3-2-10I0 = 1
gca
g
a
gca
g
a
I0
I4 I5I2I3
I1
O3 DIST matrix
0-1-2ΔΔΔI5 = 3
-1-10-2ΔΔI4 = 1
-2-20-2-2ΔI3 = 2
-3-1100-2I2 = 3
Δ-2-1-2-1-1I1 = 2
ΔΔ-3-2-10I0 = 1
0-1-2ΔΔΔI5 = 3
-1-10-2ΔΔI4 = 1
-2-20-2-2ΔI3 = 2
-3-1100-2I2 = 3
Δ-2-1-2-1-1I1 = 2
ΔΔ-3-2-10I0 = 1
gca
g
a
gca
g
a
I0
I4 I5I2I3
I1
O3 DIST matrix
0-1-2ΔΔΔI5 = 3
-1-10-2ΔΔI4 = 1
-2-20-2-2ΔI3 = 2
-3-1100-2I2 = 3
Δ-2-1-2-1-1I1 = 2
ΔΔ-3-2-10I0 = 1
0-1-2ΔΔΔI5 = 3
-1-10-2ΔΔI4 = 1
-2-20-2-2ΔI3 = 2
-3-1100-2I2 = 3
Δ-2-1-2-1-1I1 = 2
ΔΔ-3-2-10I0 = 1
• Only column m in DIST[i,j] is new
• DIST block can be updated in O(m+n)
MANTAINING DIRECT ACCESS TO DIST TABLE
-3
-1
1
0
0
-2
a a c g a c g actacgaga
Trie for T0
1 3
2
4
g
ga
c
Trie for S0
31
2
54
g
cta
g
2 3 4
12
3
4
5
01
-3
-1
1
0
0
-2
-2
-2
0
-2
-2
-1
-1
0
-2
0
-1
-2-2
-1
-2
-1
-1
-3
-2
-1
0
a a c g a c g actacgaga
Trie for T0
1 3
2
4
g
ga
c
Trie for S0
31
2
54
g
cta
g
2 3 4
12
3
4
5
01
DIST
-3
-1
1
0
0
-2
-2
-2
0
-2
-2
-1
-1
0
-2
0
-1
-2-2
-1
-2
-1
-1
-3
-2
-1
0
a a c g a c g actacgaga
Trie for T0
1 3
2
4
g
ga
c
Trie for S0
31
2
54
g
cta
g
2 3 4
12
3
4
5
01
Complexity
• Assume |S| = |T| = n• Number of words in S, T = O(hn/log n)• Number of blocks in alignment graph O(h2n2/(log n)2)• For each block
– Update new DIST block O(t = size of the border)– Create direct access table O(t)
• Propagating I/O across blocks – SMAWK O(t)
• Sum of the sizes of all borders is O(hn2/log n)• Total complexity: O(hn2/log n)
Other extensions
• Trace• Reducing the space complexity for discrete
scoring• Local alignment
References
• Crochemore, M.; Landau, G. M. & Ziv-Ukelson, M. A sub-quadratic sequence alignment algorithm for unrestricted cost matricesACM-SIAM, 2002, 679-688
• Some pictures from 葉恆青