A Sub-quadratic Sequence Alignment Algorithm

Global alignment

ag

a

g

c

a

t

c

agcagcaa 31

1

2

3

5

4 65 7 80

7

6

8

2

4

Alignment graph for S = aacgacga, T = ctacgaga

Complexity: O(n2)

V(i,j) = max {V(i-1,j-1) + (S[i], T[j]),V(i-1,j) + (S[i], -),V(i,j-1) + (-, T[j])

}

FOUR RUSSIAN ALGORITHM

UNRESTRICTED SCORING FUNCTION

Main idea: Compress the sequences

• S = aacgacga • T = ctacgaga

0

21 3

4 5

c t a

g g

0

1 3

2

4

a g

c

g

LZ-78: Divide the sequence into distinct words

1 2 3 4

a ac g acg a1 2 3 4 5

c t a cg ag a

Trie Trie

The number of distinct words: )( lognnO

a acg g ac act

3/4 3/2 acg

5/4 5/2aga

2 3 4

1

2

3

4

5

0 1

g

a

gca

agca

aca

ga

ca

Main idea

03

52

1

ag c

t

Trie for T

4g

g

01

23

4

ac

gTrie for S

• Compute the alignment score in each block• Propagate the scores between the adjacent blocks

Main idea

• Compress the sequence into words• Pre-compute the score for each block• Do alignment between blocks

• Note:– Replace normal characters by words– Operate on blocks

COMPRESS THE SEQUENCELZ-78

LZ-78

• S = aacgacga • T = ctacgaga

0

21 3

4 5

c t a

g g

0

1 3

2

4

a g

c

g

LZ-78: Divide the sequence into distinct words

1 2 3 4

a ac g acg a1 2 3 4 5

c t a cg ag a

Trie Trie

The number of distinct words: )( lognnO

LZ-78

• Theorem (Lempel and Ziv):– Constant alphabet sequence S– The maximal number of distinct phrases in S is

O(n/log n).

• Tighter upper bound: O(hn/log n) – h is the entropy factor – a real number, 0 < h 1– Entropy is small sequence is repetitive

COMPUTE THE ALIGNMENT SCORE IN EACH BLOCK

a acg g ac act

3/4 3/2 acg

5/4 5/2aga

2 3 4

1

2

3

4

5

0 1

g

a

gca

agca

aca

ga

ca

Compute the alignment score in each block•

• Given– Input border: I– Block

• Compute– Output border: O

O

g

a

gca

G0

20

1

2 3 4

13

4

55

I

Matrices

• I[i] : is the input border value• DIST[i,j] : weight of the optimal path– From entry i of the input border– To entry j of its output border

• OUT[i,j] : merges the information from input row I and DIST– OUT[i,j]=I[i] + DIST[i,j]

• O[j] = max{OUT[i,j] for i=1..n}

O

g

a

gca

G0

20

1

2 3 4

13

4

55

I

DIST and OUT matrix example

O

g

a

gca

G0

20

1

2 3 4

13

4

55

I

DIST matrix OUT matrixI (input borders)

Block – sub-sequences “acg”, “ag”

0 1 2 3 4 5

I0 0 -1 -2 -3 △ △

I1 -1 -1 -2 -1 -3 △

I2 -2 0 0 1 -1 -3

I3 △ -2 -2 0 -2 -2

I4 △ △ -2 0 -1 -1

I5 △ △ △ -2 -1 0

0 1 2 3 4 5

1 0 -1 -2 - -

1 1 0 1 -1 -

1 3 3 4 2 0

-12 0 0 2 0 0

-13 -13 -1 1 0 0

-14 -14 -14 1 2 3

I0=1

I1=2

I2=3

I3=2

I4=1

I5=3

O0 O1 O2 O3 O4 O5

1 3 3 4 2 3

max col

• For each block, given two sub-sequence S1, S2

• Compute (from scratch) DIST in (n*m) time• Given I and DIST, compute OUT in (n*m) time• Given OUT[i,j], Compute O in (m*n) time

Revise• Compress the sequence• Pre-compute DIST[i,j] for

each block• Compute border values of

each blocks

• Remaining questions– How to compute DIST[i,j]

efficiently?– How to compute O[j] from

I[i] and DIST[i,j] efficiently?

a acg g ac acta

4/4cg

5/4 5/3aga

2 3 4

1

2

3

4

5

0 1

COMPUTE O[J] EFFICIENTLY

Compute O[j] efficiently

• For each block of two sub-sequences S1, S2• Given– I[i]– DIST[i,j]

• Compute– O[j]

DIST and OUT matrix example

O

g

a

gca

G0

20

1

2 3 4

13

4

55

I

DIST matrix OUT matrixI (input borders)


0 1 2 3 4 5

I0 0 -1 -2 -3 △ △

I1 -1 -1 -2 -1 -3 △

I2 -2 0 0 1 -1 -3

I3 △ -2 -2 0 -2 -2

I4 △ △ -2 0 -1 -1

I5 △ △ △ -2 -1 0

0 1 2 3 4 5

1 0 -1 -2 - -

1 1 0 1 -1 -

1 3 3 4 2 0

-12 0 0 2 0 0

-13 -13 -1 1 0 0

-14 -14 -14 1 2 3

I0=1

I1=2

I2=3

I3=2

I4=1

I5=3

O0 O1 O2 O3 O4 O5

1 3 3 4 2 3

max col

Compute O without explicit OUT

O

g

a

gca

G0

20

1

2 3 4

13

4

55

I

DIST matrix I (input borders)


0 1 2 3 4 5

I0 0 -1 -2 -3 △ △

I1 -1 -1 -2 -1 -3 △

I2 -2 0 0 1 -1 -3

I3 △ -2 -2 0 -2 -2

I4 △ △ -2 0 -1 -1

I5 △ △ △ -2 -1 0

I0=1

I1=2

I2=3

I3=2

I4=1

I5=3

O0 O1 O2 O3 O4 O5

1 3 3 4 2 3

SMAWK

• Given DIST[i,j], I[i] we can compute O[j] in O(n+m)– Without creating OUT[i,j]

• How? Why?

Why?

• Aggarwal, Park and Schmidt observed that DIST and OUT matrices are Monge arrays.

• Definition: a matrix M[0…m,0…n] is totally monotone if either condition 1 or 2 below holds for all a,b=0…m; c,d=0…n: 1. Convex condition:

M[a,c]M[b,c]M[a,d]M[b,d] for all a<b and c<d.2. Concave condition:

M[a,c]M[b,c]M[a,d]M[b,d] for all a<b and c<d.

How?

• Aggarwal et. al. gave a recursive algorithm, called SMAWK, which can find

all row and column maxima of a totally monotone matrixby querying only O(n) elements of the matrix.

• Why DIST[i,j] is totally monotone?

O

g

a

gca

G0

20

1

2 3 4

13

4

55

I

The concave condition

If b-c is better than a-c, then b-d is better than a-d.

a b

dc

Other problem

• Rectangle problem of DIST

• Set upper right corner of OUT to -• Set lower left corner of OUT to -(n+i-1)*k• Preserve the totally monotone property of

OUT

0 1 2 3 4 5

I0 0 -1 -2 -3 △ △I1 -1 -1 -2 -1 -3 △I2 -2 0 0 1 -1 -3

I3 △ -2 -2 0 -2 -2

I4 △ △ -2 0 -1 -1

I5 △ △ △ -2 -1 0

COMPUTE DIST[I,J] EFFICIENTLY

a acg g ac act

3/4 3/2 acg

5/4 5/2aga

2 3 4

1

2

3

4

5

0 1

g

a

gca

agca

aca

ga

ca

Compute DIST[i,j] for block(5/4)

03

52

1

ag c

t

Trie for T

4g

g

01

23

4

ac

gTrie for S

gca

g

a

gca

g

a

I0

I4 I5I2I3

I1

O3 DIST matrix

0-1-2ΔΔΔI5 = 3

-1-10-2ΔΔI4 = 1

-2-20-2-2ΔI3 = 2

-3-1100-2I2 = 3

Δ-2-1-2-1-1I1 = 2

ΔΔ-3-2-10I0 = 1

0-1-2ΔΔΔI5 = 3

-1-10-2ΔΔI4 = 1

-2-20-2-2ΔI3 = 2

-3-1100-2I2 = 3

Δ-2-1-2-1-1I1 = 2

ΔΔ-3-2-10I0 = 1

• Only column m in DIST[i,j] is new

• DIST block can be updated in O(m+n)

MANTAINING DIRECT ACCESS TO DIST TABLE

-3

-1

1

0

0

-2

a a c g a c g actacgaga

Trie for T0

1 3

2

4

g

ga

c

Trie for S0

31

2

54

g

cta

g

2 3 4

12

3

4

5

01

-3

-1

1

0

0

-2

-2

-2

0

-2

-2

-1

-1

0

-2

0

-1

-2-2

-1

-2

-1

-1

-3

-2

-1

0


Trie for T0

1 3

2

4

g

ga

c

Trie for S0

31

2

54

g

cta

g

2 3 4

12

3

4

5

01

DIST

-3

-1

1

0

0

-2

-2

-2

0

-2

-2

-1

-1

0

-2

0

-1

-2-2

-1

-2

-1

-1

-3

-2

-1

0


Trie for T0

1 3

2

4

g

ga

c

Trie for S0

31

2

54

g

cta

g

2 3 4

12

3

4

5

01

Complexity

• Assume |S| = |T| = n• Number of words in S, T = O(hn/log n)• Number of blocks in alignment graph O(h2n2/(log n)2)• For each block

– Update new DIST block O(t = size of the border)– Create direct access table O(t)

• Propagating I/O across blocks – SMAWK O(t)

• Sum of the sizes of all borders is O(hn2/log n)• Total complexity: O(hn2/log n)

Other extensions

• Trace• Reducing the space complexity for discrete

scoring• Local alignment

References

• Crochemore, M.; Landau, G. M. & Ziv-Ukelson, M. A sub-quadratic sequence alignment algorithm for unrestricted cost matricesACM-SIAM, 2002, 679-688

• Some pictures from 葉恆青

A Sub-quadratic Sequence Alignment Algorithm

Documents

Transcript of A Sub-quadratic Sequence Alignment Algorithm