20121020 semi local-string_comparison_tiskin

270
Semi-local string comparison: Algorithmic techniques and applications Alexander Tiskin Department of Computer Science University of Warwick http://go.warwick.ac.uk/alextiskin Alexander Tiskin (Warwick) Semi-local string comparison 1 / 132

description

 

Transcript of 20121020 semi local-string_comparison_tiskin

Page 1: 20121020 semi local-string_comparison_tiskin

Semi-local string comparison:Algorithmic techniques and applications

Alexander Tiskin

Department of Computer ScienceUniversity of Warwick

http://go.warwick.ac.uk/alextiskin

Alexander Tiskin (Warwick) Semi-local string comparison 1 / 132

Page 2: 20121020 semi local-string_comparison_tiskin

1 Introduction

2 Matrix distance multiplication

3 Semi-local string comparison

4 The seaweed method

5 Periodic string comparison

6 The transposition networkmethod

7 Sparse string comparison

8 Compressed string comparison

9 Beyond semi-locality

10 Conclusions and future work

Alexander Tiskin (Warwick) Semi-local string comparison 2 / 132

Page 3: 20121020 semi local-string_comparison_tiskin

1 Introduction

2 Matrix distance multiplication

3 Semi-local string comparison

4 The seaweed method

5 Periodic string comparison

6 The transposition networkmethod

7 Sparse string comparison

8 Compressed string comparison

9 Beyond semi-locality

10 Conclusions and future work

Alexander Tiskin (Warwick) Semi-local string comparison 3 / 132

Page 4: 20121020 semi local-string_comparison_tiskin

Introduction

String matching: finding an exact pattern in a string

String comparison: finding similar patterns in two strings

Applications: computational biology, image recognition, . . .

Standard types of string comparison:

global: whole string vs whole string

local: substrings vs substrings

Main focus of this work:

semi-local: whole string vs substrings; prefixes vs suffixes

Closely related to approximate string matching (no relation toapproximation algorithms!)

Main tool: implicit unit-Monge matrices (a.k.a. seaweed matrices)

Alexander Tiskin (Warwick) Semi-local string comparison 4 / 132

Page 5: 20121020 semi local-string_comparison_tiskin

Introduction

String matching: finding an exact pattern in a string

String comparison: finding similar patterns in two strings

Applications: computational biology, image recognition, . . .

Standard types of string comparison:

global: whole string vs whole string

local: substrings vs substrings

Main focus of this work:

semi-local: whole string vs substrings; prefixes vs suffixes

Closely related to approximate string matching (no relation toapproximation algorithms!)

Main tool: implicit unit-Monge matrices (a.k.a. seaweed matrices)

Alexander Tiskin (Warwick) Semi-local string comparison 4 / 132

Page 6: 20121020 semi local-string_comparison_tiskin

IntroductionTerminology and notation

x− = x − 12 x+ = x + 1

2

Integers: {. . .− 2,−1, 0, 1, 2, . . .}Half-integers:

{. . .− 3

2 ,−12 ,

12 ,

32 ,

52 , . . .

}={. . . (−2)+, (−1)+, 0+, 1+, 2+

}(i , j)� (i ′, j ′) iff i < i ′ and j < j ′ (i , j) ≷ (i ′, j ′) iff i > i ′ and j < j ′

A permutation matrix is a 0/1 matrix with exactly one nonzero per rowand per column0 1 0

1 0 00 0 1

Alexander Tiskin (Warwick) Semi-local string comparison 5 / 132

Page 7: 20121020 semi local-string_comparison_tiskin

IntroductionTerminology and notation

Given matrix D, its distribution matrix is made up of ≷-dominance sums:

Given matrix E , its density matrix is made up of quadrangle differences:E�(ı, ) = E (ı−, +)− E (ı−, −)− E (ı+, +) + E (ı+, −)

where DΣ, E over integers; D, E� over half-integers0 1 01 0 00 0 1

Σ

=

0 1 2 30 1 1 20 0 0 10 0 0 0

0 1 2 30 1 1 20 0 0 10 0 0 0

=

0 1 01 0 00 0 1

(DΣ)� = D for all D

Matrix E is simple, if (E�)Σ = E ; equivalently, if it has all zeros in the leftcolumn and bottom row

Alexander Tiskin (Warwick) Semi-local string comparison 6 / 132

Page 8: 20121020 semi local-string_comparison_tiskin

IntroductionTerminology and notation

Given matrix D, its distribution matrix is made up of ≷-dominance sums:

Given matrix E , its density matrix is made up of quadrangle differences:E�(ı, ) = E (ı−, +)− E (ı−, −)− E (ı+, +) + E (ı+, −)

where DΣ, E over integers; D, E� over half-integers

0 1 01 0 00 0 1

Σ

=

0 1 2 30 1 1 20 0 0 10 0 0 0

0 1 2 30 1 1 20 0 0 10 0 0 0

=

0 1 01 0 00 0 1

(DΣ)� = D for all D

Matrix E is simple, if (E�)Σ = E ; equivalently, if it has all zeros in the leftcolumn and bottom row

Alexander Tiskin (Warwick) Semi-local string comparison 6 / 132

Page 9: 20121020 semi local-string_comparison_tiskin

IntroductionTerminology and notation

Given matrix D, its distribution matrix is made up of ≷-dominance sums:

Given matrix E , its density matrix is made up of quadrangle differences:E�(ı, ) = E (ı−, +)− E (ı−, −)− E (ı+, +) + E (ı+, −)

where DΣ, E over integers; D, E� over half-integers0 1 01 0 00 0 1

Σ

=

0 1 2 30 1 1 20 0 0 10 0 0 0

0 1 2 30 1 1 20 0 0 10 0 0 0

=

0 1 01 0 00 0 1

(DΣ)� = D for all D

Matrix E is simple, if (E�)Σ = E ; equivalently, if it has all zeros in the leftcolumn and bottom row

Alexander Tiskin (Warwick) Semi-local string comparison 6 / 132

Page 10: 20121020 semi local-string_comparison_tiskin

IntroductionTerminology and notation

Given matrix D, its distribution matrix is made up of ≷-dominance sums:

Given matrix E , its density matrix is made up of quadrangle differences:E�(ı, ) = E (ı−, +)− E (ı−, −)− E (ı+, +) + E (ı+, −)

where DΣ, E over integers; D, E� over half-integers0 1 01 0 00 0 1

Σ

=

0 1 2 30 1 1 20 0 0 10 0 0 0

0 1 2 30 1 1 20 0 0 10 0 0 0

=

0 1 01 0 00 0 1

(DΣ)� = D for all D

Matrix E is simple, if (E�)Σ = E ; equivalently, if it has all zeros in the leftcolumn and bottom row

Alexander Tiskin (Warwick) Semi-local string comparison 6 / 132

Page 11: 20121020 semi local-string_comparison_tiskin

IntroductionTerminology and notation

Matrix E is Monge, if E� is nonnegative

Intuition: boundary-to-boundary distances in a (weighted) planar graph

Matrix E is unit-Monge, if E� is a permutation matrix

Intuition: boundary-to-boundary distances in a grid-like graph

Simple unit-Monge matrix: PΣ, where P is a permutation matrix

Seaweed matrix: P used as an implicit representation of PΣ0 1 01 0 00 0 1

Σ

=

0 1 2 30 1 1 20 0 0 10 0 0 0

Alexander Tiskin (Warwick) Semi-local string comparison 7 / 132

Page 12: 20121020 semi local-string_comparison_tiskin

IntroductionTerminology and notation

Matrix E is Monge, if E� is nonnegative

Intuition: boundary-to-boundary distances in a (weighted) planar graph

Matrix E is unit-Monge, if E� is a permutation matrix

Intuition: boundary-to-boundary distances in a grid-like graph

Simple unit-Monge matrix: PΣ, where P is a permutation matrix

Seaweed matrix: P used as an implicit representation of PΣ0 1 01 0 00 0 1

Σ

=

0 1 2 30 1 1 20 0 0 10 0 0 0

Alexander Tiskin (Warwick) Semi-local string comparison 7 / 132

Page 13: 20121020 semi local-string_comparison_tiskin

IntroductionImplicit unit-Monge matrices

Efficient PΣ queries: range tree on nonzeros of P [Bentley: 1980]

binary search tree by i-coordinate

under every node, binary search tree by j-coordinate

••••−→ •

•••−→ •

•••

••••−→ •

•••−→ •

•••

••••−→ •

•••−→ •

•••

Alexander Tiskin (Warwick) Semi-local string comparison 8 / 132

Page 14: 20121020 semi local-string_comparison_tiskin

IntroductionImplicit unit-Monge matrices

Efficient PΣ queries: (contd.)

Every node of the range tree represents a canonical range (rectangularregion), and stores its nonzero count

Overall, ≤ n log n canonical ranges are non-empty

A PΣ query is equivalent to ≷-dominance counting: how many nonzerosare ≷-dominated by query point?

Answer: sum up nonzero counts in ≤ log2 n disjoint canonical ranges

Total size O(n log n), query time O(log2 n)

There are asymptotically more efficient (but less practical) data structures

Total size O(n), query time O( log n

log log n

)[JaJa+: 2004]

[Chan, Patrascu: 2010]

Alexander Tiskin (Warwick) Semi-local string comparison 9 / 132

Page 15: 20121020 semi local-string_comparison_tiskin

IntroductionImplicit unit-Monge matrices

Efficient PΣ queries: (contd.)

Every node of the range tree represents a canonical range (rectangularregion), and stores its nonzero count

Overall, ≤ n log n canonical ranges are non-empty

A PΣ query is equivalent to ≷-dominance counting: how many nonzerosare ≷-dominated by query point?

Answer: sum up nonzero counts in ≤ log2 n disjoint canonical ranges

Total size O(n log n), query time O(log2 n)

There are asymptotically more efficient (but less practical) data structures

Total size O(n), query time O( log n

log log n

)[JaJa+: 2004]

[Chan, Patrascu: 2010]

Alexander Tiskin (Warwick) Semi-local string comparison 9 / 132

Page 16: 20121020 semi local-string_comparison_tiskin

1 Introduction

2 Matrix distance multiplication

3 Semi-local string comparison

4 The seaweed method

5 Periodic string comparison

6 The transposition networkmethod

7 Sparse string comparison

8 Compressed string comparison

9 Beyond semi-locality

10 Conclusions and future work

Alexander Tiskin (Warwick) Semi-local string comparison 10 / 132

Page 17: 20121020 semi local-string_comparison_tiskin

Matrix distance multiplicationSeaweed braids

Distance algebra (a.k.a (min,+) or tropical algebra):

addition ⊕ given by min

multiplication � given by +

Matrix �-multiplication

A� B = C C (i , k) =⊕

j

(A(i , j)� B(j , k)

)= minj

(A(i , j) + B(j , k)

)Matrix classes closed under �-multiplication (for given n):

general numerical (integer, real) matrices

Monge matrices

simple unit-Monge matrices (!)

Alexander Tiskin (Warwick) Semi-local string comparison 11 / 132

Page 18: 20121020 semi local-string_comparison_tiskin

Matrix distance multiplicationSeaweed braids

Recall that simple unit-Monge matrices are represented implicitly bypermutation (seaweed) matrices

Define PA � PB = PC as PΣA � PΣ

B = PΣC

The seaweed monoid Tn:

simple unit-Monge matrices under �equivalently, permutation (seaweed) matrices under �

Also known as the 0-Hecke monoid of the symmetric group H0(Sn)

Alexander Tiskin (Warwick) Semi-local string comparison 12 / 132

Page 19: 20121020 semi local-string_comparison_tiskin

Matrix distance multiplicationSeaweed braids

PA � PB = PC can be seen as combing of seaweed braids

•••

••

•PA

•••

••

•PB

••

••

••PC

PA

PB

PC

Alexander Tiskin (Warwick) Semi-local string comparison 13 / 132

Page 20: 20121020 semi local-string_comparison_tiskin

Matrix distance multiplicationSeaweed braids

PA � PB = PC can be seen as combing of seaweed braids

•••

••

•PA

•••

••

•PB

••

••

••PC

PA

PB

PC

Alexander Tiskin (Warwick) Semi-local string comparison 13 / 132

Page 21: 20121020 semi local-string_comparison_tiskin

Matrix distance multiplicationSeaweed braids

PA � PB = PC can be seen as combing of seaweed braids

•••

••

•PA

•••

••

•PB

••

••

••PC

PA

PB

PC

Alexander Tiskin (Warwick) Semi-local string comparison 13 / 132

Page 22: 20121020 semi local-string_comparison_tiskin

Matrix distance multiplicationSeaweed braids

PA � PB = PC can be seen as combing of seaweed braids

•••

••

•PA

•••

••

•PB

••

••

••PC

PA

PB

PC

Alexander Tiskin (Warwick) Semi-local string comparison 13 / 132

Page 23: 20121020 semi local-string_comparison_tiskin

Matrix distance multiplicationSeaweed braids

The seaweed monoid Tn:

n! elements (permutations of size n)

n − 1 generators g1, g2, . . . , gn−1 (elementary crossings)

Idempotence:g2i = gi for all i =

Far commutativity:gigj = gjgi j − i > 1 · · · = · · ·

Braid relations:gigjgi = gjgigj j − i = 1 =

Alexander Tiskin (Warwick) Semi-local string comparison 14 / 132

Page 24: 20121020 semi local-string_comparison_tiskin

Matrix distance multiplicationSeaweed braids

Identity: 1� x = x

1 =

• · · ·· • · ·· · • ·· · · •

=

Zero: 0� x = 0

0 =

· · · •· · • ·· • · ·• · · ·

=

Alexander Tiskin (Warwick) Semi-local string comparison 15 / 132

Page 25: 20121020 semi local-string_comparison_tiskin

Matrix distance multiplicationSeaweed braids

Related structures:

positive braids: far comm; braid relations

braids: gig−1i = 1; far comm; braid relations

Coxeter’s presentation of Sn: g2i = 1; far comm; braid relations

locally free idempotent monoid: idem; far comm [Vershik+: 2000]

Generalisations:

general 0-Hecke monoids [Fomin, Greene: 1998; Buch+: 2008]

Coxeter monoids [Tsaranov: 1990; Richardson, Springer: 1990]

J -trivial monoids [Denton+: 2011]

Alexander Tiskin (Warwick) Semi-local string comparison 16 / 132

Page 26: 20121020 semi local-string_comparison_tiskin

Matrix distance multiplicationSeaweed braids

Computation in the seaweed monoid: a confluent rewriting system can beobtained by software (Semigroupe, GAP)

T3: 1, a = g1, b = g2; ab, ba, aba = 0

aa→ a bb → b bab → 0 aba→ 0

T4: 1, a = g1, b = g2, c = g3; ab, ac, ba, bc, cb, aba, abc, acb, bac,bcb, cba, abac , abcb, acba, bacb, bcba, abacb, abcba, bacba, abacba = 0

aa→ abb → b

ca→ accc → c

bab → abacbc → bcb

cbac → bcbaabacba→ 0

Easy to use, but not an efficient algorithm

Alexander Tiskin (Warwick) Semi-local string comparison 17 / 132

Page 27: 20121020 semi local-string_comparison_tiskin

Matrix distance multiplicationSeaweed braids

Computation in the seaweed monoid: a confluent rewriting system can beobtained by software (Semigroupe, GAP)

T3: 1, a = g1, b = g2; ab, ba, aba = 0

aa→ a bb → b bab → 0 aba→ 0

T4: 1, a = g1, b = g2, c = g3; ab, ac, ba, bc, cb, aba, abc, acb, bac,bcb, cba, abac , abcb, acba, bacb, bcba, abacb, abcba, bacba, abacba = 0

aa→ abb → b

ca→ accc → c

bab → abacbc → bcb

cbac → bcbaabacba→ 0

Easy to use, but not an efficient algorithm

Alexander Tiskin (Warwick) Semi-local string comparison 17 / 132

Page 28: 20121020 semi local-string_comparison_tiskin

Matrix distance multiplicationSeaweed braids

Computation in the seaweed monoid: a confluent rewriting system can beobtained by software (Semigroupe, GAP)

T3: 1, a = g1, b = g2; ab, ba, aba = 0

aa→ a bb → b bab → 0 aba→ 0

T4: 1, a = g1, b = g2, c = g3; ab, ac, ba, bc, cb, aba, abc, acb, bac,bcb, cba, abac , abcb, acba, bacb, bcba, abacb, abcba, bacba, abacba = 0

aa→ abb → b

ca→ accc → c

bab → abacbc → bcb

cbac → bcbaabacba→ 0

Easy to use, but not an efficient algorithm

Alexander Tiskin (Warwick) Semi-local string comparison 17 / 132

Page 29: 20121020 semi local-string_comparison_tiskin

Matrix distance multiplicationSeaweed braids

Computation in the seaweed monoid: a confluent rewriting system can beobtained by software (Semigroupe, GAP)

T3: 1, a = g1, b = g2; ab, ba, aba = 0

aa→ a bb → b bab → 0 aba→ 0

T4: 1, a = g1, b = g2, c = g3; ab, ac, ba, bc, cb, aba, abc, acb, bac,bcb, cba, abac , abcb, acba, bacb, bcba, abacb, abcba, bacba, abacba = 0

aa→ abb → b

ca→ accc → c

bab → abacbc → bcb

cbac → bcbaabacba→ 0

Easy to use, but not an efficient algorithm

Alexander Tiskin (Warwick) Semi-local string comparison 17 / 132

Page 30: 20121020 semi local-string_comparison_tiskin

Matrix distance multiplicationSeaweed matrix multiplication

The implicit unit-Monge matrix �-multiplication problem

Given permutation matrices PA, PB , compute PC , such thatPΣA � PΣ

B = PΣC (equivalently, PA � PB = PC )

Matrix �-multiplication: running time

type time

general O(n3) standard

O(n3(log log n)3

log2 n

)[Chan: 2007]

Monge O(n2) via [Aggarwal+: 1987]

implicit unit-Monge O(n1.5) [T: 2006]O(n log n) [T: 2010]

Alexander Tiskin (Warwick) Semi-local string comparison 18 / 132

Page 31: 20121020 semi local-string_comparison_tiskin

Matrix distance multiplicationSeaweed matrix multiplication

The implicit unit-Monge matrix �-multiplication problem

Given permutation matrices PA, PB , compute PC , such thatPΣA � PΣ

B = PΣC (equivalently, PA � PB = PC )

Matrix �-multiplication: running time

type time

general O(n3) standard

O(n3(log log n)3

log2 n

)[Chan: 2007]

Monge O(n2) via [Aggarwal+: 1987]

implicit unit-Monge O(n1.5) [T: 2006]O(n log n) [T: 2010]

Alexander Tiskin (Warwick) Semi-local string comparison 18 / 132

Page 32: 20121020 semi local-string_comparison_tiskin

Matrix distance multiplicationSeaweed matrix multiplication

PA

•••• •

•••• •

•• ••••••••

PB• • ••• ••• • •• •• • •••• • •

PC

?

Alexander Tiskin (Warwick) Semi-local string comparison 19 / 132

Page 33: 20121020 semi local-string_comparison_tiskin

Matrix distance multiplicationSeaweed matrix multiplication

PA,lo , PA,hi

•••• •

•••• •

•• ••••••••

PB,lo , PB,hi

• • ••• ••• • •• •• • •••• • •

•••• •

•••• •

• • ••••

• •••PC ,lo + PC ,hi

•••• •

•••• •

• • ••••

• •••

Alexander Tiskin (Warwick) Semi-local string comparison 20 / 132

Page 34: 20121020 semi local-string_comparison_tiskin

Matrix distance multiplicationSeaweed matrix multiplication

PA,lo , PA,hi

•••• •

•••• •

•• ••••••••

PB,lo , PB,hi

• • ••• ••• • •• •• • •••• • •

•••• •

•••• •

• • ••••

• •••PC ,lo + PC ,hi

•••• •

•••• •

• • ••••

• •••

Alexander Tiskin (Warwick) Semi-local string comparison 20 / 132

Page 35: 20121020 semi local-string_comparison_tiskin

Matrix distance multiplicationSeaweed matrix multiplication

PA,lo , PA,hi

•••• •

•••• •

•• ••••••••

PB,lo , PB,hi

• • ••• ••• • •• •• • •••• • •

•••• •

•••• •

• • ••••

• •••

PC ,lo + PC ,hi

•••• •

•••• •

• • ••••

• •••

Alexander Tiskin (Warwick) Semi-local string comparison 20 / 132

Page 36: 20121020 semi local-string_comparison_tiskin

Matrix distance multiplicationSeaweed matrix multiplication

PA,lo , PA,hi

•••• •

•••• •

•• ••••••••

PB,lo , PB,hi

• • ••• ••• • •• •• • •••• • •

•••• •

•••• •

• • ••••

• •••

PC ,lo + PC ,hi

•••• •

•••• •

• • ••••

• •••

Alexander Tiskin (Warwick) Semi-local string comparison 20 / 132

Page 37: 20121020 semi local-string_comparison_tiskin

Matrix distance multiplicationSeaweed matrix multiplication

PA,lo , PA,hi

•••• •

•••• •

•• ••••••••

PB,lo , PB,hi

• • ••• ••• • •• •• • •••• • •

PC ,lo + PC ,hi

•••• •

•••• •

• • ••••

• •••

PC

••••

••••

•••

•••

• •••

Alexander Tiskin (Warwick) Semi-local string comparison 21 / 132

Page 38: 20121020 semi local-string_comparison_tiskin

Matrix distance multiplicationSeaweed matrix multiplication

PA,lo , PA,hi

•••• •

•••• •

•• ••••••••

PB,lo , PB,hi

• • ••• ••• • •• •• • •••• • •

PC ,lo + PC ,hi

•••• •

•••• •

• • ••••

• •••

PC

••••

••••

•••

•••

• •••

Alexander Tiskin (Warwick) Semi-local string comparison 21 / 132

Page 39: 20121020 semi local-string_comparison_tiskin

Matrix distance multiplicationSeaweed matrix multiplication

PA

•••• •

•••• •

•• ••••••••

PB• • ••• ••• • •• •• • •••• • •

PC

••••

••••

•••

•••

• •••

Alexander Tiskin (Warwick) Semi-local string comparison 22 / 132

Page 40: 20121020 semi local-string_comparison_tiskin

Matrix distance multiplicationSeaweed matrix multiplication

Implicit unit-Monge matrix �-multiplication: the algorithm

PΣC (i , k) = minj

(PΣA (i , j) + PΣ

B (j , k))

Divide-and-conquer on the range of j

Divide PA horizontally, PB vertically: two subproblems of effective size n/2

PΣA,lo � PΣ

B,lo = PΣC ,lo PΣ

A,hi � PΣB,hi = PΣ

C ,hi

Conquer:

�-low nonzeros of PC ,lo and �-high nonzeros of PC ,hi appear in PC

The remaining nonzeros of PC ,lo and PC ,hi are “wrong”, and need to becorrected to obtain the remaining nonzeros of PC

Correction can be done in time O(n) using the unit-Monge property

Overall time O(n log n)

Alexander Tiskin (Warwick) Semi-local string comparison 23 / 132

Page 41: 20121020 semi local-string_comparison_tiskin

Matrix distance multiplicationBruhat order

Comparing permutations by the “degree of sortedness”

Bruhat order

Permutation A is lower (“more sorted”) than permutation B in the Bruhatorder (A � B), if B can be transformed to A by successive pairwise sortingbetween arbitrary pairs of elements.

Permutation matrices: PA � PB , if PB can be transformed to PA bysuccessive submatrix substitution: ( 0 1

1 0 ) ( 1 00 1 )

Alexander Tiskin (Warwick) Semi-local string comparison 24 / 132

Page 42: 20121020 semi local-string_comparison_tiskin

Matrix distance multiplicationBruhat order

Bruhat comparability: running time

O(n2) folkloreO(n log n) [T: NEW]

PA � PB iff PΣA ≤ PΣ

B elementwise, time O(n2) folklore

PA � PB iff PRA � PB = IdR , time O(n log n) [T: NEW]

where PR denotes clockwise rotation of matrix P

Alexander Tiskin (Warwick) Semi-local string comparison 25 / 132

Page 43: 20121020 semi local-string_comparison_tiskin

1 Introduction

2 Matrix distance multiplication

3 Semi-local string comparison

4 The seaweed method

5 Periodic string comparison

6 The transposition networkmethod

7 Sparse string comparison

8 Compressed string comparison

9 Beyond semi-locality

10 Conclusions and future work

Alexander Tiskin (Warwick) Semi-local string comparison 26 / 132

Page 44: 20121020 semi local-string_comparison_tiskin

Semi-local string comparisonSemi-local LCS and edit distance

Consider strings (= sequences) over an alphabet of size σ

Distinguish contiguous substrings and not necessarily contiguoussubsequences

Special cases of substring: prefix, suffix

Notation: strings a, b of length m, n respectively

Assume where necessary: m ≤ n; m, n reasonably close

The longest common subsequence (LCS) score:

length of longest string that is a subsequence of both a and b

equivalently, alignment score, where score(match) = 1 andscore(mismatch) = 0

In biological terms, “loss-free alignment” (unlike “lossy” BLAST)

Alexander Tiskin (Warwick) Semi-local string comparison 27 / 132

Page 45: 20121020 semi local-string_comparison_tiskin

Semi-local string comparisonSemi-local LCS and edit distance

Consider strings (= sequences) over an alphabet of size σ

Distinguish contiguous substrings and not necessarily contiguoussubsequences

Special cases of substring: prefix, suffix

Notation: strings a, b of length m, n respectively

Assume where necessary: m ≤ n; m, n reasonably close

The longest common subsequence (LCS) score:

length of longest string that is a subsequence of both a and b

equivalently, alignment score, where score(match) = 1 andscore(mismatch) = 0

In biological terms, “loss-free alignment” (unlike “lossy” BLAST)

Alexander Tiskin (Warwick) Semi-local string comparison 27 / 132

Page 46: 20121020 semi local-string_comparison_tiskin

Semi-local string comparisonSemi-local LCS and edit distance

The LCS problem

Give the LCS score for a vs b

LCS: running time

O(mn) [Wagner, Fischer: 1974]O(

mnlog2 n

)σ = O(1) [Masek, Paterson: 1980]

[Crochemore+: 2003]

O(mn(log log n)2

log2 n

)[Paterson, Dancık: 1994]

[Bille, Farach-Colton: 2008]

Running time varies depending on the RAM model version

We assume word-RAM with word size log n (where it matters)

Alexander Tiskin (Warwick) Semi-local string comparison 28 / 132

Page 47: 20121020 semi local-string_comparison_tiskin

Semi-local string comparisonSemi-local LCS and edit distance

The LCS problem

Give the LCS score for a vs b

LCS: running time

O(mn) [Wagner, Fischer: 1974]O(

mnlog2 n

)σ = O(1) [Masek, Paterson: 1980]

[Crochemore+: 2003]

O(mn(log log n)2

log2 n

)[Paterson, Dancık: 1994]

[Bille, Farach-Colton: 2008]

Running time varies depending on the RAM model version

We assume word-RAM with word size log n (where it matters)

Alexander Tiskin (Warwick) Semi-local string comparison 28 / 132

Page 48: 20121020 semi local-string_comparison_tiskin

Semi-local string comparisonSemi-local LCS and edit distance

LCS on the alignment graph (directed, acyclic)

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A blue = 0red = 1

score(“BAABCBCA”, “BAABCABCABACA”) = len(“BAABCBCA”) = 8

LCS = highest-score path from top-left to bottom-right

Alexander Tiskin (Warwick) Semi-local string comparison 29 / 132

Page 49: 20121020 semi local-string_comparison_tiskin

Semi-local string comparisonSemi-local LCS and edit distance

LCS: dynamic programming [WF: 1974]

Sweep cells in any �-compatible order

Cell update: time O(1)

Overall time O(mn)

Alexander Tiskin (Warwick) Semi-local string comparison 30 / 132

Page 50: 20121020 semi local-string_comparison_tiskin

Semi-local string comparisonSemi-local LCS and edit distance

‘Begin at the beginning,’ the King saidgravely, ‘and go on till you come to the end:then stop.’

L. Carroll, Alice in Wonderland(The standard approach in dynamicprogramming)

Alexander Tiskin (Warwick) Semi-local string comparison 31 / 132

Page 51: 20121020 semi local-string_comparison_tiskin

Semi-local string comparisonSemi-local LCS and edit distance

Sometimes dynamic programming can be run from both ends for extraflexibility

Is there a better, fully flexible alternative (e.g. for comparing compressedstrings, comparing strings dynamically or in parallel, etc.)?

Alexander Tiskin (Warwick) Semi-local string comparison 32 / 132

Page 52: 20121020 semi local-string_comparison_tiskin

Semi-local string comparisonSemi-local LCS and edit distance

Sometimes dynamic programming can be run from both ends for extraflexibility

Is there a better, fully flexible alternative (e.g. for comparing compressedstrings, comparing strings dynamically or in parallel, etc.)?

Alexander Tiskin (Warwick) Semi-local string comparison 32 / 132

Page 53: 20121020 semi local-string_comparison_tiskin

Semi-local string comparisonSemi-local LCS and edit distance

LCS: micro-block dynamic programming [MP: 1980; BF: 2008]

Sweep cells in micro-blocks, in any �-compatible order

Micro-block size:

t = O(log n) when σ = O(1)

t = O( log n

log log n

)otherwise

Micro-block interface:

O(t) characters, each O(log σ) bits, can be reduced to O(log t) bits

O(t) small integers, each O(1) bits

Micro-block update: time O(1), by precomputing all possible interfaces

Overall time O(

mnlog2 n

)when σ = O(1), O

(mn(log log n)2

log2 n

)otherwise

Alexander Tiskin (Warwick) Semi-local string comparison 33 / 132

Page 54: 20121020 semi local-string_comparison_tiskin

Semi-local string comparisonSemi-local LCS and edit distance

The semi-local LCS problem

Give the (implicit) matrix of O((m + n)2

)LCS scores:

string-substring LCS: string a vs every substring of b

prefix-suffix LCS: every prefix of a vs every suffix of b

suffix-prefix LCS: every suffix of a vs every prefix of b

substring-string LCS: every substring of a vs string b

Cf.: dynamic programming gives prefix-prefix LCS

Alexander Tiskin (Warwick) Semi-local string comparison 34 / 132

Page 55: 20121020 semi local-string_comparison_tiskin

Semi-local string comparisonSemi-local LCS and edit distance

The semi-local LCS problem

Give the (implicit) matrix of O((m + n)2

)LCS scores:

string-substring LCS: string a vs every substring of b

prefix-suffix LCS: every prefix of a vs every suffix of b

suffix-prefix LCS: every suffix of a vs every prefix of b

substring-string LCS: every substring of a vs string b

Cf.: dynamic programming gives prefix-prefix LCS

Alexander Tiskin (Warwick) Semi-local string comparison 34 / 132

Page 56: 20121020 semi local-string_comparison_tiskin

Semi-local string comparisonSemi-local LCS and edit distance

Semi-local LCS on the alignment graph

B

A

A

B

C

B

C

A

B A A B C A B C A B A C AC A B C A B A blue = 0red = 1

score(“BAABCBCA”, “CABCABA”) = len(“ABCBA”) = 5

String-substring LCS: all highest-score top-to-bottom paths

Semi-local LCS: all highest-score boundary-to-boundary paths

Alexander Tiskin (Warwick) Semi-local string comparison 35 / 132

Page 57: 20121020 semi local-string_comparison_tiskin

Semi-local string comparisonScore matrices and seaweed matrices

The score matrix H

0 1 2 3 4 5 6 6 7 8 8 8 8 8

-1 0 1 2 3 4 5 5 6 7 7 7 7 7

-2 -1 0 1 2 3 4 4 5 6 6 6 6 7

-3 -2 -1 0 1 2 3 3 4 5 5 6 6 7

-4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6

-5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6

-6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5

-7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4

-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3

-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2

-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1

-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0

5

a = “BAABCBCA”

b = “BAABCABCABACA”

H(i , j) = score(a, b〈i : j〉)H(4, 11) = 5

H(i , j) = j − i if i > j

Alexander Tiskin (Warwick) Semi-local string comparison 36 / 132

Page 58: 20121020 semi local-string_comparison_tiskin

Semi-local string comparisonScore matrices and seaweed matrices

Semi-local LCS: output representation and running time

size query time

O(n2) O(1) trivial

O(m1/2n) O(log n) string-substring [Alves+: 2003]O(n) O(n) string-substring [Alves+: 2005]O(n log n) O(log2 n) [T: 2006]

. . . or any 2D orthogonal range counting data structure

running time

O(mn2) naiveO(mn) string-substring [Schmidt: 1998; Alves+: 2005]O(mn) [T: 2006]O(

mnlog0.5 n

)[T: 2006]

O(mn(log log n)2

log2 n

)[T: 2007]

Alexander Tiskin (Warwick) Semi-local string comparison 37 / 132

Page 59: 20121020 semi local-string_comparison_tiskin

Semi-local string comparisonScore matrices and seaweed matrices

The score matrix H and the seaweed matrix P

H(i , j): the number of matched characters for a vs substring b〈i : j〉j − i − H(i , j): the number of unmatched characters

Properties of matrix j − i − H(i , j):

simple unit-Monge

therefore, = PΣ, where P = −H� is a permutation matrix

P is the seaweed matrix, giving an implicit representation of H

Range tree for P: memory O(n log n), query time O(log2 n)

Alexander Tiskin (Warwick) Semi-local string comparison 38 / 132

Page 60: 20121020 semi local-string_comparison_tiskin

Semi-local string comparisonScore matrices and seaweed matrices

The score matrix H and the seaweed matrix P

0 1 2 3 4 5 6 6 7 8 8 8 8 8

-1 0 1 2 3 4 5 5 6 7 7 7 7 7

-2 -1 0 1 2 3 4 4 5 6 6 6 6 7

-3 -2 -1 0 1 2 3 3 4 5 5 6 6 7

-4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6

-5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6

-6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5

-7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4

-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3

-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2

-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1

-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0

5

a = “BAABCBCA”

b = “BAABCABCABACA”

H(i , j) = score(a, b〈i : j〉)H(4, 11) = 5

H(i , j) = j − i if i > j

blue: difference in H is 0

red : difference in H is 1

green: P(i , j) = 1

H(i , j) = j − i − PΣ(i , j)

Alexander Tiskin (Warwick) Semi-local string comparison 39 / 132

Page 61: 20121020 semi local-string_comparison_tiskin

Semi-local string comparisonScore matrices and seaweed matrices

The score matrix H and the seaweed matrix P

0 1 2 3 4 5 6 6 7 8 8 8 8 8

-1 0 1 2 3 4 5 5 6 7 7 7 7 7

-2 -1 0 1 2 3 4 4 5 6 6 6 6 7

-3 -2 -1 0 1 2 3 3 4 5 5 6 6 7

-4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6

-5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6

-6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5

-7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4

-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3

-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2

-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1

-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0

5

a = “BAABCBCA”

b = “BAABCABCABACA”

H(i , j) = score(a, b〈i : j〉)H(4, 11) = 5

H(i , j) = j − i if i > j

blue: difference in H is 0

red : difference in H is 1

green: P(i , j) = 1

H(i , j) = j − i − PΣ(i , j)

Alexander Tiskin (Warwick) Semi-local string comparison 39 / 132

Page 62: 20121020 semi local-string_comparison_tiskin

Semi-local string comparisonScore matrices and seaweed matrices

The score matrix H and the seaweed matrix P

0 1 2 3 4 5 6 6 7 8 8 8 8 8

-1 0 1 2 3 4 5 5 6 7 7 7 7 7

-2 -1 0 1 2 3 4 4 5 6 6 6 6 7

-3 -2 -1 0 1 2 3 3 4 5 5 6 6 7

-4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6

-5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6

-6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5

-7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4

-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3

-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2

-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1

-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0

5

a = “BAABCBCA”

b = “BAABCABCABACA”

H(i , j) = score(a, b〈i : j〉)H(4, 11) = 5

H(i , j) = j − i if i > j

blue: difference in H is 0

red : difference in H is 1

green: P(i , j) = 1

H(i , j) = j − i − PΣ(i , j)

Alexander Tiskin (Warwick) Semi-local string comparison 39 / 132

Page 63: 20121020 semi local-string_comparison_tiskin

Semi-local string comparisonScore matrices and seaweed matrices

The score matrix H and the seaweed matrix P

0 1 2 3 4 5 6 6 7 8 8 8 8 8

-1 0 1 2 3 4 5 5 6 7 7 7 7 7

-2 -1 0 1 2 3 4 4 5 6 6 6 6 7

-3 -2 -1 0 1 2 3 3 4 5 5 6 6 7

-4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6

-5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6

-6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5

-7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4

-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3

-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2

-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1

-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0

5

a = “BAABCBCA”

b = “BAABCABCABACA”

H(4, 11) =11− 4− PΣ(i , j) =11− 4− 2 = 5

Alexander Tiskin (Warwick) Semi-local string comparison 40 / 132

Page 64: 20121020 semi local-string_comparison_tiskin

Semi-local string comparisonScore matrices and seaweed matrices

The seaweed braid in the alignment graph

B

A

A

B

C

B

C

A

B A A B C A B C A B A C AC A B C A B A a = “BAABCBCA”

b = “BAABCABCABACA”

H(4, 11) =11− 4− PΣ(i , j) =11− 4− 2 = 5

P(i , j) = 1 corresponds to seaweed top i bottom j

Also define top right, left right, left bottom seaweeds

Gives bijection between top-left and bottom-right graph boundaries

Alexander Tiskin (Warwick) Semi-local string comparison 41 / 132

Page 65: 20121020 semi local-string_comparison_tiskin

Semi-local string comparisonScore matrices and seaweed matrices

The seaweed braid in the alignment graph

B

A

A

B

C

B

C

A

B A A B C A B C A B A C AC A B C A B A a = “BAABCBCA”

b = “BAABCABCABACA”

H(4, 11) =11− 4− PΣ(i , j) =11− 4− 2 = 5

P(i , j) = 1 corresponds to seaweed top i bottom j

Also define top right, left right, left bottom seaweeds

Gives bijection between top-left and bottom-right graph boundaries

Alexander Tiskin (Warwick) Semi-local string comparison 41 / 132

Page 66: 20121020 semi local-string_comparison_tiskin

Semi-local string comparisonScore matrices and seaweed matrices

Seaweed braid: a highly symmetric object (element of the 0-Hecke monoidof the symmetric group)

Can be built recursively by assembling subbraids from separate parts

Highly flexible: local alignment, compression, parallel computation. . .

Alexander Tiskin (Warwick) Semi-local string comparison 42 / 132

Page 67: 20121020 semi local-string_comparison_tiskin

Semi-local string comparisonWeighted alignment

The LCS problem is a special case of the weighted alignment scoreproblem with weighted matches (wM), mismatches (wX ) and gaps (wG)

LCS score: wM = 1, wX = wG = 0

Levenshtein score: wM = 2, wX = 1, wG = 0

Alignment score is rational, if wM , wX , wG are rational numbers

Equivalent to LCS score on blown-up strings

Edit distance: minimum cost to transform a into b by weighted characteredits (insertion, deletion, substitution)

Corresponds to weighted alignment score with wM = 0, insertion/deletionweight −wG , substitution weight −wX

Alexander Tiskin (Warwick) Semi-local string comparison 43 / 132

Page 68: 20121020 semi local-string_comparison_tiskin

Semi-local string comparisonWeighted alignment

The LCS problem is a special case of the weighted alignment scoreproblem with weighted matches (wM), mismatches (wX ) and gaps (wG)

LCS score: wM = 1, wX = wG = 0

Levenshtein score: wM = 2, wX = 1, wG = 0

Alignment score is rational, if wM , wX , wG are rational numbers

Equivalent to LCS score on blown-up strings

Edit distance: minimum cost to transform a into b by weighted characteredits (insertion, deletion, substitution)

Corresponds to weighted alignment score with wM = 0, insertion/deletionweight −wG , substitution weight −wX

Alexander Tiskin (Warwick) Semi-local string comparison 43 / 132

Page 69: 20121020 semi local-string_comparison_tiskin

Semi-local string comparisonWeighted alignment

The LCS problem is a special case of the weighted alignment scoreproblem with weighted matches (wM), mismatches (wX ) and gaps (wG)

LCS score: wM = 1, wX = wG = 0

Levenshtein score: wM = 2, wX = 1, wG = 0

Alignment score is rational, if wM , wX , wG are rational numbers

Equivalent to LCS score on blown-up strings

Edit distance: minimum cost to transform a into b by weighted characteredits (insertion, deletion, substitution)

Corresponds to weighted alignment score with wM = 0, insertion/deletionweight −wG , substitution weight −wX

Alexander Tiskin (Warwick) Semi-local string comparison 43 / 132

Page 70: 20121020 semi local-string_comparison_tiskin

Semi-local string comparisonWeighted alignment

Weighted alignment graph

B

A

A

B

C

B

C

A

B A A B C A B C A B A C AC A B C A B A blue = 0red (solid) = 2

red (dotted) = 1

Levenshtein score(“BAABCBCA”, “CABCABA”) = 11

Alexander Tiskin (Warwick) Semi-local string comparison 44 / 132

Page 71: 20121020 semi local-string_comparison_tiskin

Semi-local string comparisonWeighted alignment

Alignment graph for blown-up strings

$B

$A

$A

$B

$C

$B

$C

$A

$B $A $A $B $C $A $B $C $A $B $A $C $A$C$A$B$C$A$B$A blue = 0red = 0.5 or 1

Levenshtein score(“BAABCBCA”, “CABCABA”) = 2 · 5.5

Alexander Tiskin (Warwick) Semi-local string comparison 45 / 132

Page 72: 20121020 semi local-string_comparison_tiskin

Semi-local string comparisonWeighted alignment

Rational-weighted semi-local alignment reduced to semi-local LCS

$B

$A

$A

$B

$C

$B

$C

$A

$B $A $A $B $C $A $B $C $A $B $A $C $A$C$A$B$C$A$B$A

Let wM = 1, wX = µν , wG = 0

Increase × ν2 in complexity (can be reduced to ν)

Alexander Tiskin (Warwick) Semi-local string comparison 46 / 132

Page 73: 20121020 semi local-string_comparison_tiskin

Alexander Tiskin (Warwick) Semi-local string comparison 47 / 132

Page 74: 20121020 semi local-string_comparison_tiskin

1 Introduction

2 Matrix distance multiplication

3 Semi-local string comparison

4 The seaweed method

5 Periodic string comparison

6 The transposition networkmethod

7 Sparse string comparison

8 Compressed string comparison

9 Beyond semi-locality

10 Conclusions and future work

Alexander Tiskin (Warwick) Semi-local string comparison 48 / 132

Page 75: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 49 / 132

Page 76: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 77: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 78: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 79: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 80: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 81: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 82: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 83: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 84: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 85: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 86: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 87: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 88: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 89: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 90: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 91: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 92: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 93: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 94: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 95: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 96: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 97: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 98: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 99: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 100: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 101: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 102: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 103: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 104: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 105: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 106: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 107: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 108: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 109: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 110: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 111: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 112: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132

Page 113: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 51 / 132

Page 114: 20121020 semi local-string_comparison_tiskin

The seaweed methodSeaweed combing

Semi-local LCS: seaweed combing [T: 2006]

Initialise seaweed braid: crossings in all mismatch cells

Sweep cells in any �-compatible order

Match cell: two seaweeds uncrossed; skip

Mismatch cell: two seaweeds cross

if the same seaweeds crossed before, uncross them

otherwise skip, keep seaweeds crossed

Cell update: time O(1)

Overall time O(mn)

Alexander Tiskin (Warwick) Semi-local string comparison 52 / 132

Page 115: 20121020 semi local-string_comparison_tiskin

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 53 / 132

Page 116: 20121020 semi local-string_comparison_tiskin

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132

Page 117: 20121020 semi local-string_comparison_tiskin

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132

Page 118: 20121020 semi local-string_comparison_tiskin

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132

Page 119: 20121020 semi local-string_comparison_tiskin

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132

Page 120: 20121020 semi local-string_comparison_tiskin

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132

Page 121: 20121020 semi local-string_comparison_tiskin

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132

Page 122: 20121020 semi local-string_comparison_tiskin

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132

Page 123: 20121020 semi local-string_comparison_tiskin

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132

Page 124: 20121020 semi local-string_comparison_tiskin

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132

Page 125: 20121020 semi local-string_comparison_tiskin

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132

Page 126: 20121020 semi local-string_comparison_tiskin

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132

Page 127: 20121020 semi local-string_comparison_tiskin

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132

Page 128: 20121020 semi local-string_comparison_tiskin

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132

Page 129: 20121020 semi local-string_comparison_tiskin

The seaweed methodMicro-block seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 55 / 132

Page 130: 20121020 semi local-string_comparison_tiskin

The seaweed methodMicro-block seaweed combing

Semi-local LCS: micro-block seaweed combing [T: 2007]

Initialise seaweed braid: crossings in all mismatch cells

Sweep cells in micro-blocks, in any �-compatible order

Micro-block size: t = O( log n

log log n

)Micro-block interface:

O(t) characters, each O(log σ) bits, can be reduced to O(log t) bits

O(t) integers, each O(log n) bits, can be reduced to O(log t) bits

Micro-block update: time O(1), by precomputing all possible interfaces

Overall time O(mn(log log n)2

log2 n

)

Alexander Tiskin (Warwick) Semi-local string comparison 56 / 132

Page 131: 20121020 semi local-string_comparison_tiskin

The seaweed methodCyclic LCS

The cyclic LCS problem

Give the maximum LCS score for a vs all cyclic rotations of b

Cyclic LCS: running time

O(mn2

log n

)naive

O(mn logm) [Maes: 1990]O(mn) [Bunke, Buhler: 1993; Landau+: 1998; Schmidt: 1998]

O(mn(log log n)2

log2 n

)[T: 2007]

Alexander Tiskin (Warwick) Semi-local string comparison 57 / 132

Page 132: 20121020 semi local-string_comparison_tiskin

The seaweed methodCyclic LCS

The cyclic LCS problem

Give the maximum LCS score for a vs all cyclic rotations of b

Cyclic LCS: running time

O(mn2

log n

)naive

O(mn logm) [Maes: 1990]O(mn) [Bunke, Buhler: 1993; Landau+: 1998; Schmidt: 1998]

O(mn(log log n)2

log2 n

)[T: 2007]

Alexander Tiskin (Warwick) Semi-local string comparison 57 / 132

Page 133: 20121020 semi local-string_comparison_tiskin

The seaweed methodCyclic LCS

Cyclic LCS: the algorithm

Micro-block seaweed combing on a vs bb, time O(mn(log log n)2

log2 n

)Make n string-substring LCS queries, time negligible

Alexander Tiskin (Warwick) Semi-local string comparison 58 / 132

Page 134: 20121020 semi local-string_comparison_tiskin

The seaweed methodLongest repeating subsequence

The longest repeating subsequence problem

Find the longest subsequence of a that is a square (a repetition of twoidentical strings)

Longest repeating subsequence: running time

O(m3) naiveO(m2) [Kosowski: 2004]

O(m2(log log m)2

log2 m

)[T: 2007]

Alexander Tiskin (Warwick) Semi-local string comparison 59 / 132

Page 135: 20121020 semi local-string_comparison_tiskin

The seaweed methodLongest repeating subsequence

The longest repeating subsequence problem

Find the longest subsequence of a that is a square (a repetition of twoidentical strings)

Longest repeating subsequence: running time

O(m3) naiveO(m2) [Kosowski: 2004]

O(m2(log log m)2

log2 m

)[T: 2007]

Alexander Tiskin (Warwick) Semi-local string comparison 59 / 132

Page 136: 20121020 semi local-string_comparison_tiskin

The seaweed methodLongest repeating subsequence

Longest repeating subsequence: the algorithm

Micro-block seaweed combing on a vs a, time O(m2(log log m)2

log2 m

)Make m − 1 suffix-prefix LCS queries, time negligible

Alexander Tiskin (Warwick) Semi-local string comparison 60 / 132

Page 137: 20121020 semi local-string_comparison_tiskin

The seaweed methodApproximate matching

The approximate pattern matching problem

Give the substring closest to a by alignment score, starting at eachposition in b

Assume rational alignment score

Approximate pattern matching: running time

O(mn) [Sellers: 1980]O(

mnlog n

)σ = O(1) via [Masek, Paterson: 1980]

O(mn(log log n)2

log2 n

)via [Bille, Farach-Colton: 2008]

Alexander Tiskin (Warwick) Semi-local string comparison 61 / 132

Page 138: 20121020 semi local-string_comparison_tiskin

The seaweed methodApproximate matching

Approximate pattern matching: the algorithm

Micro-block seaweed combing on a vs b (with blow-up), time

O(mn(log log n)2

log2 n

)The implicit semi-local edit score matrix:

an anti-Monge matrix

approximate pattern matching ∼ row minima

Row minima in O(n) element queries [Aggarwal+: 1987]

Each query in time O(log2 n) using the range tree representation,combined query time negligible

Overall running time O(mn(log log n)2

log2 n

), same as [Bille, Farach-Colton: 2008]

Alexander Tiskin (Warwick) Semi-local string comparison 62 / 132

Page 139: 20121020 semi local-string_comparison_tiskin

Alexander Tiskin (Warwick) Semi-local string comparison 63 / 132

Page 140: 20121020 semi local-string_comparison_tiskin

1 Introduction

2 Matrix distance multiplication

3 Semi-local string comparison

4 The seaweed method

5 Periodic string comparison

6 The transposition networkmethod

7 Sparse string comparison

8 Compressed string comparison

9 Beyond semi-locality

10 Conclusions and future work

Alexander Tiskin (Warwick) Semi-local string comparison 64 / 132

Page 141: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

The periodic string-substring LCS problem

Give (implicit) LCS scores for a vs each substring of b = . . . uuu . . . = u±∞

Let u be of length p

May assume that every character of a occurs in u

Only substrings of b of length at most mp (otherwise LCS score is m)

Alexander Tiskin (Warwick) Semi-local string comparison 65 / 132

Page 142: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 66 / 132

Page 143: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 144: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 145: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 146: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 147: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 148: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 149: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 150: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 151: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 152: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 153: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 154: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 155: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 156: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 157: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 158: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 159: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 160: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 161: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 162: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 163: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 164: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 165: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 166: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 167: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 168: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 169: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 170: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 171: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 172: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 173: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 174: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 175: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 176: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 177: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 178: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 179: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 180: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 181: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 182: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132

Page 183: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

B

A

A

B

C

B

C

A

B A A B C A B C A B A C A

Alexander Tiskin (Warwick) Semi-local string comparison 68 / 132

Page 184: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

Periodic string-substring LCS: Wraparound seaweed combing

Initialise seaweed braid: crossings in all mismatch cells

Sweep cells row-by-row: each row starts at match cell, wraps at boundary

Match cell: two seaweeds uncrossed; skip

Mismatch cell: two seaweeds cross

if the same seaweeds crossed before (with wrapping), uncross them

otherwise skip, keep seaweeds crossed

Cell update: time O(1)

Overall time O(mn)

String-substring LCS score: count seaweeds with multiplicities

Alexander Tiskin (Warwick) Semi-local string comparison 69 / 132

Page 185: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

The tandem LCS problem

Give LCS score for a vs b = uk

We have n = kp; may assume k ≤ m

Tandem LCS: running time

O(mkp) naiveO(m(k + p)

)[Landau, Ziv-Ukelson: 2001]

O(mp) [T: 2009]

Direct application of wraparound seaweed combing

Alexander Tiskin (Warwick) Semi-local string comparison 70 / 132

Page 186: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

The tandem alignment problem

Give the substring closest to a by alignment score among certainsubstrings of b = u±∞:

global: substrings uk of length kp across all k

cyclic: substrings of length kp across all k

local: substrings of any length

Tandem alignment: running time

O(m2p) all naive

O(mp) global [Myers, Miller: 1989]

O(mp log p) cyclic [Benson: 2005]O(mp) cyclic [T: 2009]

O(mp) local [Myers, Miller: 1989]

Alexander Tiskin (Warwick) Semi-local string comparison 71 / 132

Page 187: 20121020 semi local-string_comparison_tiskin

Periodic string comparisonWraparound seaweed combing

Cyclic tandem alignment: the algorithm

Periodic seaweed combing for a vs b (with blow-up), time O(mp)

For each k ∈ [1 : m]:

solve tandem LCS (under given alignment score) for a vs uk

obtain scores for a vs p successive substrings of b of length kp by LCSbatch query: time O(1) per substring

Running time O(mp)

Alexander Tiskin (Warwick) Semi-local string comparison 72 / 132

Page 188: 20121020 semi local-string_comparison_tiskin

Alexander Tiskin (Warwick) Semi-local string comparison 73 / 132

Page 189: 20121020 semi local-string_comparison_tiskin

1 Introduction

2 Matrix distance multiplication

3 Semi-local string comparison

4 The seaweed method

5 Periodic string comparison

6 The transposition networkmethod

7 Sparse string comparison

8 Compressed string comparison

9 Beyond semi-locality

10 Conclusions and future work

Alexander Tiskin (Warwick) Semi-local string comparison 74 / 132

Page 190: 20121020 semi local-string_comparison_tiskin

The transposition network methodTransposition networks

Comparison network: a circuit of comparators

A comparator sorts two inputs and outputs them in prescribed order

Comparison networks traditionally used for non-branching merging/sorting

Classical comparison networks

# comparators

merging O(n log n) [Batcher: 1968]

sorting O(n log2 n) [Batcher: 1968]O(n log n) [Ajtai+: 1983]

Comparison networks are visualised by wire diagrams

Transposition network: all comparisons are between adjacent wires

Alexander Tiskin (Warwick) Semi-local string comparison 75 / 132

Page 191: 20121020 semi local-string_comparison_tiskin

The transposition network methodTransposition networks

Comparison network: a circuit of comparators

A comparator sorts two inputs and outputs them in prescribed order

Comparison networks traditionally used for non-branching merging/sorting

Classical comparison networks

# comparators

merging O(n log n) [Batcher: 1968]

sorting O(n log2 n) [Batcher: 1968]O(n log n) [Ajtai+: 1983]

Comparison networks are visualised by wire diagrams

Transposition network: all comparisons are between adjacent wires

Alexander Tiskin (Warwick) Semi-local string comparison 75 / 132

Page 192: 20121020 semi local-string_comparison_tiskin

The transposition network methodTransposition networks

Seaweed combing as a transposition network

A

C

B

C

A B C A

+7

+1

+5

+5

+3

+7

+1

−5

−1

−3

−3

+3

−5

−1

−7

−7

Character mismatches correspond to comparators

Inputs anti-sorted (sorted in reverse); each value traces a seaweed

Alexander Tiskin (Warwick) Semi-local string comparison 76 / 132

Page 193: 20121020 semi local-string_comparison_tiskin

The transposition network methodTransposition networks

Global LCS: transposition network with binary input

A

C

B

C

A B C A

1

1

1

1

1

1

1

0

0

0

0

1

0

0

0

0

Inputs still anti-sorted, but may not be distinct

Comparison between equal values is indeterminate

Alexander Tiskin (Warwick) Semi-local string comparison 77 / 132

Page 194: 20121020 semi local-string_comparison_tiskin

The transposition network methodParameterised string comparison

Parameterised string comparison

String comparison sensitive e.g. to

low similarity: small λ = LCS(a, b)

high similarity: small κ = distLCS(a, b) = m + n − 2λ

Can also use weighted alignment score or edit distance

Assume m = n, therefore κ = 2(n − λ)

Alexander Tiskin (Warwick) Semi-local string comparison 78 / 132

Page 195: 20121020 semi local-string_comparison_tiskin

The transposition network methodParameterised string comparison

Low-similarity comparison: small λ

sparse set of matches, may need to look at them all

preprocess matches for fast searching, time O(n log σ)

High-similarity comparison: small κ

set of matches may be dense, but only need to look at small subset

no need to preprocess, linear search is OK

Flexible comparison: sensitive to both high and low similarity, e.g. by bothcomparison types running alongside each other

Alexander Tiskin (Warwick) Semi-local string comparison 79 / 132

Page 196: 20121020 semi local-string_comparison_tiskin

The transposition network methodParameterised string comparison

Parameterised string comparison: running time

Low-similarity, after preprocessing in O(n log σ)

O(nλ) [Hirschberg: 1977][Apostolico, Guerra: 1985]

[Apostolico+: 1992]

High-similarity, no preprocessing

O(n · κ) [Ukkonen: 1985][Myers: 1986]

Flexible

O(λ · κ · log n) no preproc [Myers: 1986; Wu+: 1990]O(λ · κ) after preproc [Rick: 1995]

Alexander Tiskin (Warwick) Semi-local string comparison 80 / 132

Page 197: 20121020 semi local-string_comparison_tiskin

The transposition network methodParameterised string comparison

Parameterised string comparison: the waterfall algorithm

Low-similarity: O(n · λ)0 0 0 0 0 0 0 0

0 0 1 1 0 0 1 0

1

1

1

1

1

1

1

1

0

1

0

1

1

0

1

1

High-similarity: O(n · κ)0 0 0 0 0 0 0 0

1 1 1 0 1 1 0 0

1

1

1

1

1

1

1

1

0

0

0

0

1

1

1

0

Trace 0s through network in contiguous blocks and gaps

Alexander Tiskin (Warwick) Semi-local string comparison 81 / 132

Page 198: 20121020 semi local-string_comparison_tiskin

The transposition network methodDynamic string comparison

The dynamic LCS problem

Maintain current LCS score under updates to one or both input strings

Both input strings are streams, updated on-line:

appending characters at left or right

deleting characters at left or right

Assume for simplicity m ≈ n, i.e. m = Θ(n)

Goal: linear time per update

O(n) per update of a (n = |b|)O(m) per update of b (m = |a|)

Alexander Tiskin (Warwick) Semi-local string comparison 82 / 132

Page 199: 20121020 semi local-string_comparison_tiskin

The transposition network methodDynamic string comparison

Dynamic LCS in linear time: update models

left right

– app+del standard DP [Wagner, Fischer: 1974]app app a fixed [Landau+: 1998], [Kim, Park: 2004]app app [Ishida+: 2005]app+del app+del [T: NEW]

Main idea:

for append only, maintain seaweed matrix Pa,b

for append+delete, maintain partial seaweed layout by tracing atransposition network

Alexander Tiskin (Warwick) Semi-local string comparison 83 / 132

Page 200: 20121020 semi local-string_comparison_tiskin

The transposition network methodBit-parallel string comparison

Bit-parallel string comparison

String comparison using standard instructions on words of size w

Bit-parallel string comparison: running time

O(mn/w) [Allison, Dix: 1986; Myers: 1999; Crochemore+: 2001]

Alexander Tiskin (Warwick) Semi-local string comparison 84 / 132

Page 201: 20121020 semi local-string_comparison_tiskin

The transposition network methodBit-parallel string comparison

Bit-parallel string comparison: binary transposition network

In every cell: input bits s, c ; output bits s ′, c ′; match/mismatch flag µ

µ

s

c

s′

c ′

s 0 1 0 1 0 1 0 1c 0 0 1 1 0 0 1 1µ 0 0 0 0 1 1 1 1

s′ 0 1 1 1 0 0 1 1c ′ 0 0 0 1 0 1 0 1

+

∧µ

s

c

s′

c ′

s 0 1 0 1 0 1 0 1c 0 0 1 1 0 0 1 1µ 0 0 0 0 1 1 1 1

s′ 0 1 1 0 0 0 1 1c ′ 0 0 0 1 0 1 0 1

2c + s ← (s + (s ∧ µ) + c) ∨ (s ∧ ¬µ)

S ← (S + (S ∧M)) ∨ (S ∧ ¬M), where S , M are words of bits s, µ

Alexander Tiskin (Warwick) Semi-local string comparison 85 / 132

Page 202: 20121020 semi local-string_comparison_tiskin

The transposition network methodBit-parallel string comparison

Bit-parallel string comparison: binary transposition network

In every cell: input bits s, c ; output bits s ′, c ′; match/mismatch flag µ

µ

s

c

s′

c ′

s 0 1 0 1 0 1 0 1c 0 0 1 1 0 0 1 1µ 0 0 0 0 1 1 1 1

s′ 0 1 1 1 0 0 1 1c ′ 0 0 0 1 0 1 0 1

+

∧µ

s

c

s′

c ′

s 0 1 0 1 0 1 0 1c 0 0 1 1 0 0 1 1µ 0 0 0 0 1 1 1 1

s′ 0 1 1 0 0 0 1 1c ′ 0 0 0 1 0 1 0 1

2c + s ← (s + (s ∧ µ) + c) ∨ (s ∧ ¬µ)

S ← (S + (S ∧M)) ∨ (S ∧ ¬M), where S , M are words of bits s, µ

Alexander Tiskin (Warwick) Semi-local string comparison 85 / 132

Page 203: 20121020 semi local-string_comparison_tiskin

Alexander Tiskin (Warwick) Semi-local string comparison 86 / 132

Page 204: 20121020 semi local-string_comparison_tiskin

1 Introduction

2 Matrix distance multiplication

3 Semi-local string comparison

4 The seaweed method

5 Periodic string comparison

6 The transposition networkmethod

7 Sparse string comparison

8 Compressed string comparison

9 Beyond semi-locality

10 Conclusions and future work

Alexander Tiskin (Warwick) Semi-local string comparison 87 / 132

Page 205: 20121020 semi local-string_comparison_tiskin

Sparse string comparisonSemi-local LCS between permutations

The LCS problem on permutation strings

Give LCS score for a vs bIn each of a, b all characters distinct: total m = n matchesEquivalent to

longest increasing subsequence (LIS) in a string

maximum clique in a permutation graph

maximum planar matching in an embedded bipartite graph

LCS on permutation strings: running time

O(n log n) implicit in [Erdos, Szekeres: 1935][Robinson: 1938; Knuth: 1970; Dijkstra: 1980]

O(n log log n) unit-RAM [Chang, Wang: 1992][Bespamyatnikh, Segal: 2000]

Alexander Tiskin (Warwick) Semi-local string comparison 88 / 132

Page 206: 20121020 semi local-string_comparison_tiskin

Sparse string comparisonSemi-local LCS between permutations

The semi-local LCS problem on permutation strings

Give semi-local LCS scores of a vs bIn each of a, b all characters distinct: total m = n matchesEquivalent to

longest increasing subsequence (LIS) in every substring of a string

Semi-local LCS on permutation strings: running time

O(n2 log n) naiveO(n2) restricted [Albert+: 2003; Chen+: 2005]O(n1.5 log n) randomised, restricted [Albert+: 2007]O(n1.5) [T: 2006]O(n log2 n) [T: NEW]

Alexander Tiskin (Warwick) Semi-local string comparison 89 / 132

Page 207: 20121020 semi local-string_comparison_tiskin

Sparse string comparisonSemi-local LCS between permutations

C

F

A

E

D

H

G

B

D E H C B A F G

Alexander Tiskin (Warwick) Semi-local string comparison 90 / 132

Page 208: 20121020 semi local-string_comparison_tiskin

Sparse string comparisonSemi-local LCS between permutations

C

F

A

E

D

H

G

B

D E H C B A F G

Alexander Tiskin (Warwick) Semi-local string comparison 90 / 132

Page 209: 20121020 semi local-string_comparison_tiskin

Sparse string comparisonSemi-local LCS between permutations

C

F

A

E

D

H

G

B

D E H C B A F G

Alexander Tiskin (Warwick) Semi-local string comparison 91 / 132

Page 210: 20121020 semi local-string_comparison_tiskin

Sparse string comparisonSemi-local LCS between permutations

C

F

A

E

D

H

G

B

D E H C B A F G

Alexander Tiskin (Warwick) Semi-local string comparison 91 / 132

Page 211: 20121020 semi local-string_comparison_tiskin

Sparse string comparisonSemi-local LCS between permutations

C

F

A

E

D

H

G

B

D E H C B A F G

Alexander Tiskin (Warwick) Semi-local string comparison 92 / 132

Page 212: 20121020 semi local-string_comparison_tiskin

Sparse string comparisonSemi-local LCS between permutations

C

F

A

E

D

H

G

B

D E H C B A F G

Alexander Tiskin (Warwick) Semi-local string comparison 92 / 132

Page 213: 20121020 semi local-string_comparison_tiskin

Sparse string comparisonSemi-local LCS between permutations

Semi-local LCS on permutation strings: the algorithm

Divide-and-conquer on the alignment graph

Divide graph (say) horizontally; two subproblems of effective size n/2

Conquer: seaweed matrix �-multiplication, time O(n log n)

Overall time O(n log2 n)

Alexander Tiskin (Warwick) Semi-local string comparison 93 / 132

Page 214: 20121020 semi local-string_comparison_tiskin

Sparse string comparisonLongest piecewise monotone subsequences

A k-increasing sequence: a concatenation of k increasing sequences

A k-modal sequence: a concatenation of k alternating increasing anddecreasing sequences

The longest k-increasing (k-modal) subsequence problem

Give the longest k-increasing (k-modal) subsequence of string b

Longest k-increasing (k-modal) subsequence: running time

O(nk log n) k-modal [Demange+: 2007]O(nk log n) via [Hunt, Szymanski: 1977]O(n log2 n) [T: NEW]

Main idea: LCS for idk (respectively, (id id)k/2) vs b

Alexander Tiskin (Warwick) Semi-local string comparison 94 / 132

Page 215: 20121020 semi local-string_comparison_tiskin

Sparse string comparisonLongest piecewise monotone subsequences

Longest k-increasing subsequence: algorithm A

Sparse LCS for idk vs b: time O(nk log n)

Longest k-increasing subsequence: algorithm B

Compute seaweed matrix for id vs b: time O(n log2 n)

Extract three-way seaweed submatrix

Compute three-way seaweed submatrix for idk vs b by log k instances ofseaweed matrix �-square/multiply: time log k · O(n log n) = O(n log2 n)

Query the LCS score for idk vs b: time negligible

Overall time O(n log2 n)

Algorithm B faster than Algorithm A for k ≥ log n

Alexander Tiskin (Warwick) Semi-local string comparison 95 / 132

Page 216: 20121020 semi local-string_comparison_tiskin

Sparse string comparisonLongest piecewise monotone subsequences

Longest k-increasing subsequence: algorithm A

Sparse LCS for idk vs b: time O(nk log n)

Longest k-increasing subsequence: algorithm B

Compute seaweed matrix for id vs b: time O(n log2 n)

Extract three-way seaweed submatrix

Compute three-way seaweed submatrix for idk vs b by log k instances ofseaweed matrix �-square/multiply: time log k · O(n log n) = O(n log2 n)

Query the LCS score for idk vs b: time negligible

Overall time O(n log2 n)

Algorithm B faster than Algorithm A for k ≥ log n

Alexander Tiskin (Warwick) Semi-local string comparison 95 / 132

Page 217: 20121020 semi local-string_comparison_tiskin

Sparse string comparisonMaximum clique in a circle graph

The maximum clique problem in a circle graph

Given a circle with n chords, find the maximum-size subset of pairwiseintersecting chords

S

Standard reduction to an interval model: cut the circle and lay it out onthe line; chords become intervals (here drawn as square diagonals)

Chords intersect iff intervals overlap, i.e. intersect without containment

Alexander Tiskin (Warwick) Semi-local string comparison 96 / 132

Page 218: 20121020 semi local-string_comparison_tiskin

Sparse string comparisonMaximum clique in a circle graph

The maximum clique problem in a circle graph

Given a circle with n chords, find the maximum-size subset of pairwiseintersecting chords

S

Standard reduction to an interval model: cut the circle and lay it out onthe line; chords become intervals (here drawn as square diagonals)

Chords intersect iff intervals overlap, i.e. intersect without containment

Alexander Tiskin (Warwick) Semi-local string comparison 96 / 132

Page 219: 20121020 semi local-string_comparison_tiskin

Sparse string comparisonMaximum clique in a circle graph

Maximum clique in a circle graph: running time

exp(n) naiveO(n3) [Gavril: 1973]O(n2) [Rotem, Urrutia: 1981; Hsu: 1985]

[Masuda+: 1990; Apostolico+: 1992]O(n1.5) [T: 2006]O(n log2 n) [T: NEW]

Alexander Tiskin (Warwick) Semi-local string comparison 97 / 132

Page 220: 20121020 semi local-string_comparison_tiskin

Sparse string comparisonMaximum clique in a circle graph

Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132

Page 221: 20121020 semi local-string_comparison_tiskin

Sparse string comparisonMaximum clique in a circle graph

Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132

Page 222: 20121020 semi local-string_comparison_tiskin

Sparse string comparisonMaximum clique in a circle graph

Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132

Page 223: 20121020 semi local-string_comparison_tiskin

Sparse string comparisonMaximum clique in a circle graph

Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132

Page 224: 20121020 semi local-string_comparison_tiskin

Sparse string comparisonMaximum clique in a circle graph

Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132

Page 225: 20121020 semi local-string_comparison_tiskin

Sparse string comparisonMaximum clique in a circle graph

Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132

Page 226: 20121020 semi local-string_comparison_tiskin

Sparse string comparisonMaximum clique in a circle graph

Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132

Page 227: 20121020 semi local-string_comparison_tiskin

Sparse string comparisonMaximum clique in a circle graph

Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132

Page 228: 20121020 semi local-string_comparison_tiskin

Sparse string comparisonMaximum clique in a circle graph

Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132

Page 229: 20121020 semi local-string_comparison_tiskin

Sparse string comparisonMaximum clique in a circle graph

Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132

Page 230: 20121020 semi local-string_comparison_tiskin

Sparse string comparisonMaximum clique in a circle graph

Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132

Page 231: 20121020 semi local-string_comparison_tiskin

Sparse string comparisonMaximum clique in a circle graph

Maximum clique in a circle graph: the algorithm

Helly property: if any set of intervals intersect pairwise, then they allintersect at a common point

Compute seaweed matrix, build range tree: time O(n log2 n)

Run through all 2n + 1 possible common intersection points

For each point, find a maximum subset of covering overlapping segmentsby a prefix-suffix LCS query: time (2n + 1) · O(log2 n) = O(n log2 n)

Overall time O(n log2 n) + O(n log2 n) = O(n log2 n)

Alexander Tiskin (Warwick) Semi-local string comparison 99 / 132

Page 232: 20121020 semi local-string_comparison_tiskin

Sparse string comparisonMaximum clique in a circle graph

Parameterised maximum clique in a circle graph

The maximum clique problem in a circle graph, sensitive e.g. to

the number e of edges

the size l of maximum clique

the thickness d of interval model (the maximum number of intervalscovering a point)

We have l ≤ d ≤ n, e ≤ n2

Parameterised maximum clique in a circle graph: running time

O(n log n + e) [Apostolico+: 1992]

O(n log n + nl log(n/l)) [Apostolico+: 1992]

O(n log n + n log2 d) NEW

Alexander Tiskin (Warwick) Semi-local string comparison 100 / 132

Page 233: 20121020 semi local-string_comparison_tiskin

Sparse string comparisonMaximum clique in a circle graph

Parameterised maximum clique in a circle graph: the algorithm

For each diagonal block of size d , compute seaweed matrix, build rangetree: time n/d · O(d log2 d) = O(n log2 d)

Extend each diagonal block to a quadrant: time O(n log2 d)

Run through all 2n + 1 possible common intersection points

For each point, find a maximum subset of covering overlapping segmentsby a prefix-suffix LCS query: time O(n log2 d)

Overall time O(n log2 d)

Alexander Tiskin (Warwick) Semi-local string comparison 101 / 132

Page 234: 20121020 semi local-string_comparison_tiskin

Alexander Tiskin (Warwick) Semi-local string comparison 102 / 132

Page 235: 20121020 semi local-string_comparison_tiskin

1 Introduction

2 Matrix distance multiplication

3 Semi-local string comparison

4 The seaweed method

5 Periodic string comparison

6 The transposition networkmethod

7 Sparse string comparison

8 Compressed string comparison

9 Beyond semi-locality

10 Conclusions and future work

Alexander Tiskin (Warwick) Semi-local string comparison 103 / 132

Page 236: 20121020 semi local-string_comparison_tiskin

Compressed string comparisonGrammar compression

Notation: pattern p of length m; text t of length n

A GC-string (grammar-compressed string) t is a straight-line program(context-free grammar) generating t = tn by n assignments of the form

tk = α, where α is an alphabet character

tk = ti tj , where i , j < k

In general, n = O(2n)

Example: Fibonacci string “ABAABABAABAAB”

t1 = ‘B’ t2 = ‘A’

t3 = t2t1 t4 = t3t2 t5 = t4t3 t6 = t5t4 t7 = t6t5

Alexander Tiskin (Warwick) Semi-local string comparison 104 / 132

Page 237: 20121020 semi local-string_comparison_tiskin

Compressed string comparisonGrammar compression

Grammar-compression covers various compression types, e.g. LZ78, LZW(not LZ77 directly)

Simplifying assumption: arithmetic up to n runs in O(1)

This assumption can be removed by careful index remapping

Alexander Tiskin (Warwick) Semi-local string comparison 105 / 132

Page 238: 20121020 semi local-string_comparison_tiskin

Compressed string comparisonExtended substring-string LCS on GC-strings

LCS: running time (r = m + n, r = m + n)

p t

plain plain O(mn) [Wagner, Fischer: 1974]O(

mnlog2 m

)[Masek, Paterson: 1980]

[Crochemore+: 2003]

plain GC O(m3n + . . .) gen. CFG [Myers: 1995]O(m1.5n) ext subs-s [T: 2008]O(m logm · n) ext subs-s [T: 2010]

GC GC NP-hard [Lifshits: 2005]O(r1.2r1.4) R weights [Hermelin+: 2009]O(r log r · r) [T: 2010]O(r log(r/r) · r

)[Hermelin+: 2010]

O(r log1/2(r/r) · r

)[Gawrychowski: NEW]

Alexander Tiskin (Warwick) Semi-local string comparison 106 / 132

Page 239: 20121020 semi local-string_comparison_tiskin

Compressed string comparisonExtended substring-string LCS on GC-strings

Extended substring-string LCS (plain pattern, GC text): the algorithm

For every k, compute by recursion the appropriate part of seaweed matrixPp,tk , using matrix �-multiplication: time O(m logm · n)

Overall time O(m logm · n)

Alexander Tiskin (Warwick) Semi-local string comparison 107 / 132

Page 240: 20121020 semi local-string_comparison_tiskin

Compressed string comparisonSubsequence recognition on GC-strings

The global subsequence recognition problem

Does pattern p appear in text t as a subsequence?

Global subsequence recognition: running time

p t

plain plain O(n) greedy

plain GC O(mn) greedy

GC GC NP-hard [Lifshits: 2005]

Alexander Tiskin (Warwick) Semi-local string comparison 108 / 132

Page 241: 20121020 semi local-string_comparison_tiskin

Compressed string comparisonSubsequence recognition on GC-strings

The local subsequence recognition problem

Find all minimally matching substrings of t with respect to p

Substring of t is matching, if p is a subsequence of t

Matching substring of t is minimally matching, if none of its propersubstrings are matching

Alexander Tiskin (Warwick) Semi-local string comparison 109 / 132

Page 242: 20121020 semi local-string_comparison_tiskin

Compressed string comparisonSubsequence recognition on GC-strings

Local subsequence recognition: running time ( + output)

p t

plain plain O(mn) [Mannila+: 1995]O(

mnlog m

)[Das+: 1997]

O(cm + n) [Boasson+: 2001]O(m + nσ) [Tronicek: 2001]

plain GC O(m2 logmn) [Cegielski+: 2006]O(m1.5n) [T: 2008]O(m logm · n) [T: NEW]

GC GC NP-hard [Lifshits: 2005]

Alexander Tiskin (Warwick) Semi-local string comparison 110 / 132

Page 243: 20121020 semi local-string_comparison_tiskin

Compressed string comparisonSubsequence recognition on GC-strings

ı0+

Oı1+

Oı2+

O

M0+

M1+

M2+

0H

n′

HnH

b〈i : j〉 matching iff box [i : j ] not pierced left-to-right

≷-maximal seaweeds: �-chain(ı0+ , 0+

)�(ı1+ , 1+

)� · · · �

(ıs− , s−

)b〈i : j〉 minimally matching iff (i , j) is in the interleaved �-chain(ı−1+ ,

+1−

)�(ı−2+ ,

+2−

)� · · · �

(ı−(s−1)+ ,

+(s−1)−

)Alexander Tiskin (Warwick) Semi-local string comparison 111 / 132

Page 244: 20121020 semi local-string_comparison_tiskin

Compressed string comparisonSubsequence recognition on GC-strings

×

×

×

×

0+ 1+ 2+n′ n m+n

−m

0

n′

ı0+

ı1+

ı2+

Alexander Tiskin (Warwick) Semi-local string comparison 112 / 132

Page 245: 20121020 semi local-string_comparison_tiskin

Compressed string comparisonSubsequence recognition on GC-strings

Local subsequence recognition (plain pattern, GC text): the algorithm

For every k, compute by recursion the appropriate part of seaweed matrixPp,tk , using matrix �-multiplication: time O(m logm · n)

Given an assignment t = t ′t ′′, count by recursion

minimally matching substrings in t ′

minimally matching substrings in t ′′

Then, find �-chain of ≷-maximal seaweeds in time n · O(m) = O(mn)

The interleaved �-chain defines minimally matching substrings in toverlapping both t ′ and t ′′

Overall time O(m logm · n) + O(mn) = O(m logm · n)

Alexander Tiskin (Warwick) Semi-local string comparison 113 / 132

Page 246: 20121020 semi local-string_comparison_tiskin

Compressed string comparisonSubsequence recognition on GC-strings

Local subsequence recognition (plain pattern, GC text): the algorithm

For every k, compute by recursion the appropriate part of seaweed matrixPp,tk , using matrix �-multiplication: time O(m logm · n)

Given an assignment t = t ′t ′′, count by recursion

minimally matching substrings in t ′

minimally matching substrings in t ′′

Then, find �-chain of ≷-maximal seaweeds in time n · O(m) = O(mn)

The interleaved �-chain defines minimally matching substrings in toverlapping both t ′ and t ′′

Overall time O(m logm · n) + O(mn) = O(m logm · n)

Alexander Tiskin (Warwick) Semi-local string comparison 113 / 132

Page 247: 20121020 semi local-string_comparison_tiskin

Compressed string comparisonSubsequence recognition on GC-strings

Local subsequence recognition (plain pattern, GC text): the algorithm

For every k, compute by recursion the appropriate part of seaweed matrixPp,tk , using matrix �-multiplication: time O(m logm · n)

Given an assignment t = t ′t ′′, count by recursion

minimally matching substrings in t ′

minimally matching substrings in t ′′

Then, find �-chain of ≷-maximal seaweeds in time n · O(m) = O(mn)

The interleaved �-chain defines minimally matching substrings in toverlapping both t ′ and t ′′

Overall time O(m logm · n) + O(mn) = O(m logm · n)

Alexander Tiskin (Warwick) Semi-local string comparison 113 / 132

Page 248: 20121020 semi local-string_comparison_tiskin

Compressed string comparisonThreshold approximate matching

The threshold approximate matching problem

Find all matching substrings of t with respect to p, according to athreshold k

Substring of t is matching, if the edit distance for p vs t is at most k

Alexander Tiskin (Warwick) Semi-local string comparison 114 / 132

Page 249: 20121020 semi local-string_comparison_tiskin

Compressed string comparisonThreshold approximate matching

Threshold approximate matching: running time ( + output)

p t

plain plain O(mn) [Sellers: 1980]O(mk) [Landau, Vishkin: 1989]

O(m + n + nk4

m ) [Cole, Hariharan: 2002]

plain GC O(mnk2) [Karkkainen+: 2003]O(mnk + n log n) [LV: 1989] via [Bille+: 2010]O(mn + nk4 + n log n) [CH: 2002] via [Bille+: 2010]O(m logm · n) [T: NEW]

GC GC NP-hard [Lifshits: 2005]

(Also many specialised variants for LZ compression)

Alexander Tiskin (Warwick) Semi-local string comparison 115 / 132

Page 250: 20121020 semi local-string_comparison_tiskin

Compressed string comparisonThreshold approximate matching

Threshold approx matching (plain pattern, GC text): the algorithm

Algorithm structure similar to local subsequence recognition by matrix�-multiplication and seaweed �-chains

Extra ingredients:

the blow-up technique: reduction of edit distances to LCS scores

implicit matrix searching, replaces �-chain interleaving

Monge row minima: “SMAWK” O(m) [Aggarwal+: 1987]

Implicit unit-Monge row minima O(m log logm) [T: NEW]O(m) [Gawrychowski: NEW]

replaces �-chain interleaving

Overall time O(m logm · n) + O(mn) = O(m logm · n)

Alexander Tiskin (Warwick) Semi-local string comparison 116 / 132

Page 251: 20121020 semi local-string_comparison_tiskin

Compressed string comparisonThreshold approximate matching

ı0+

Oı1+

Oı2+

Oı3+

Oı4+

O

M0+

M1+

M2+

M3+

M4+

0H

n′

HnH

Blow up: weighted alignment on strings p, t of size m, n equivalent toLCS on strings p, t of size m = νm, n = νn

Alexander Tiskin (Warwick) Semi-local string comparison 117 / 132

Page 252: 20121020 semi local-string_comparison_tiskin

Compressed string comparisonThreshold approximate matching

0+ 1+ 2+ 3+ 4+n′ n m+n

ı0+

ı1+

ı2+

ı3+

ı4+

−m

0

n′

×

×

×

×

×

×× × × ×× ×

× × ×× ×

× × ×× ×

× × ×× ×

× × ×× ×

× × ×× ×

Alexander Tiskin (Warwick) Semi-local string comparison 118 / 132

Page 253: 20121020 semi local-string_comparison_tiskin

1 Introduction

2 Matrix distance multiplication

3 Semi-local string comparison

4 The seaweed method

5 Periodic string comparison

6 The transposition networkmethod

7 Sparse string comparison

8 Compressed string comparison

9 Beyond semi-locality

10 Conclusions and future work

Alexander Tiskin (Warwick) Semi-local string comparison 119 / 132

Page 254: 20121020 semi local-string_comparison_tiskin

Beyond semi-localityWindow-local LCS and alignment plots

Given window length w

The window-local LCS problem

Give semi-local LCS score for every w -substring of a vs whole b

Window-local LCS: running time

O(mn3w) naiveO(mnw) multiple seaweedO(mn) [Krusche, T: 2010]

Alexander Tiskin (Warwick) Semi-local string comparison 120 / 132

Page 255: 20121020 semi local-string_comparison_tiskin

Beyond semi-localityWindow-local LCS and alignment plots

Given window length w

The window-local LCS problem

Give semi-local LCS score for every w -substring of a vs whole b

Window-local LCS: running time

O(mn3w) naiveO(mnw) multiple seaweedO(mn) [Krusche, T: 2010]

Alexander Tiskin (Warwick) Semi-local string comparison 120 / 132

Page 256: 20121020 semi local-string_comparison_tiskin

Beyond semi-localityWindow-local LCS and alignment plots

Quasi-local LCS: the algorithm

Compute seaweed matrices for canonical substrings of a against b

Build seaweed matrices for windows of a against b recursively

Both stages use matrix �-multiplication, overall time O(mn)

Alexander Tiskin (Warwick) Semi-local string comparison 121 / 132

Page 257: 20121020 semi local-string_comparison_tiskin

Beyond semi-localityWindow-local LCS and alignment plots

The window-window LCS problem

Give semi-local LCS score for every w -substring of a vs every w -substring b

Provides an alignment plot: loss-free local comparison of genomesequences, important for studying conservation in the genome

Window-window LCS: running time

O(mnw2) naiveO(mnw) multiple seaweedO(mn) [Krusche, T: 2010]

Alexander Tiskin (Warwick) Semi-local string comparison 122 / 132

Page 258: 20121020 semi local-string_comparison_tiskin

Beyond semi-localityWindow-local LCS and alignment plots

The window-window LCS problem

Give semi-local LCS score for every w -substring of a vs every w -substring b

Provides an alignment plot: loss-free local comparison of genomesequences, important for studying conservation in the genome

Window-window LCS: running time

O(mnw2) naiveO(mnw) multiple seaweedO(mn) [Krusche, T: 2010]

Alexander Tiskin (Warwick) Semi-local string comparison 122 / 132

Page 259: 20121020 semi local-string_comparison_tiskin

Beyond semi-localityWindow-local LCS and alignment plots

An implementation of alignment plots [Krusche: 2010]

Based on the O(mnw) multiple seaweed combing, using some ideas fromthe optimal O(mn) algorithm

Resulting theoretical complexity O(mnw0.5)

Alignment weights (match 1, mismatch 0, gap −0.5): ×4 slowdown

C++, Intel assembly (x86, x86 64, MMX/SSE2 data parallelism)

SMP parallelism (currently two processors)

Alexander Tiskin (Warwick) Semi-local string comparison 123 / 132

Page 260: 20121020 semi local-string_comparison_tiskin

Beyond semi-localityWindow-local LCS and alignment plots

Krusche’s implementation used at Warwick Systems Biology Centre tostudy weak conservation in plant genomes (BLAST too inaccurate)

Speedup on one processor:

×10 over heavily optimised, bit-parallel naive algorithm

×7 over biologists’ heuristics

Speedup on two processors: extra ×2, near-perfect parallelism

Alexander Tiskin (Warwick) Semi-local string comparison 124 / 132

Page 261: 20121020 semi local-string_comparison_tiskin

Beyond semi-localityWindow-local LCS and alignment plots

Publications:

[Picot+: 2010] Evolutionary Analysis of Regulatory Sequences (EARS) inPlants. Plant Journal.

[Baxter+: accepted] Conserved Noncoding Sequences Highlight SharedComponents of Regulatory Networks in Dicotyledonous Plants. Plant Cell.

Web service: http://wsbc.warwick.ac.uk/ears

Future work: genomic repeats; genome-scale comparison

Alexander Tiskin (Warwick) Semi-local string comparison 125 / 132

Page 262: 20121020 semi local-string_comparison_tiskin

Beyond semi-localityQuasi-local LCS and sparse spliced alignment

Given m (possibly overlapping) prescribed substrings in a

The quasi-local LCS problem

Give semi-local LCS score for every prescribed substring of a vs whole b

Quasi-local LCS: running time

O(m2n) multiple seaweedO(mn log2 m) [T: unpublished]

Alexander Tiskin (Warwick) Semi-local string comparison 126 / 132

Page 263: 20121020 semi local-string_comparison_tiskin

Beyond semi-localityQuasi-local LCS and sparse spliced alignment

Given m (possibly overlapping) prescribed substrings in a

The quasi-local LCS problem

Give semi-local LCS score for every prescribed substring of a vs whole b

Quasi-local LCS: running time

O(m2n) multiple seaweedO(mn log2 m) [T: unpublished]

Alexander Tiskin (Warwick) Semi-local string comparison 126 / 132

Page 264: 20121020 semi local-string_comparison_tiskin

Beyond semi-localityQuasi-local LCS and sparse spliced alignment

Window-local LCS: the algorithm

Compute seaweed matrices for canonical substrings of a against b

Build seaweed matrices for prescribed substrings of a against b recursively

Both stages use matrix �-multiplication, time O(mn log2 m)

Alexander Tiskin (Warwick) Semi-local string comparison 127 / 132

Page 265: 20121020 semi local-string_comparison_tiskin

Beyond semi-localityQuasi-local LCS and sparse spliced alignment

The sparse spliced alignment problem

Give the chain of non-overlapping prescribed substrings in a, closest to bby alignment score

Describes gene assembly: unknown gene from candidate exons, given aknown similar gene

Assume m = n; let s = total size of prescribed substrings

Sparse spliced alignment: running time

O(ns) = O(n3) [Gelfand+: 1996]O(n2.5) [Kent+: 2006]O(n2 log2 n) [T: unpublished]O(n2 log n) [Sakai: 2009]

Alexander Tiskin (Warwick) Semi-local string comparison 128 / 132

Page 266: 20121020 semi local-string_comparison_tiskin

Beyond semi-localityQuasi-local LCS and sparse spliced alignment

The sparse spliced alignment problem

Give the chain of non-overlapping prescribed substrings in a, closest to bby alignment score

Describes gene assembly: unknown gene from candidate exons, given aknown similar gene

Assume m = n; let s = total size of prescribed substrings

Sparse spliced alignment: running time

O(ns) = O(n3) [Gelfand+: 1996]O(n2.5) [Kent+: 2006]O(n2 log2 n) [T: unpublished]O(n2 log n) [Sakai: 2009]

Alexander Tiskin (Warwick) Semi-local string comparison 128 / 132

Page 267: 20121020 semi local-string_comparison_tiskin

1 Introduction

2 Matrix distance multiplication

3 Semi-local string comparison

4 The seaweed method

5 Periodic string comparison

6 The transposition networkmethod

7 Sparse string comparison

8 Compressed string comparison

9 Beyond semi-locality

10 Conclusions and future work

Alexander Tiskin (Warwick) Semi-local string comparison 129 / 132

Page 268: 20121020 semi local-string_comparison_tiskin

Conclusions and future work

Implicit unit-Monge matrices:

the seaweed monoid

distance multiplication in time O(n log n)

next: lower bound?

Semi-local LCS problem:

representation by implicit unit-Monge matrices

generalisation to rational alignment scores

next: real alignment scores?

Seaweed combing and micro-block speedup:

a simple algorithm for semi-local LCS

semi-local LCS in time o(mn)

improvements on related problems

Alexander Tiskin (Warwick) Semi-local string comparison 130 / 132

Page 269: 20121020 semi local-string_comparison_tiskin

Conclusions and future work

Transposition networks:

simple interpretation of high-similarity and dissimilarity LCS

dynamic LCS

simple interpretation of bit-parallel LCS

Sparse string comparison:

fast LCS on permutation strings

fast max-clique in a circle graph

next: lower bound? or even faster max-clique

Sparse string comparison:

semi-local LCS on permutations in time O(n log2 n)

maximum clique in a circle graph in time O(n log2 n)

improvements on related problems

Alexander Tiskin (Warwick) Semi-local string comparison 131 / 132

Page 270: 20121020 semi local-string_comparison_tiskin

Conclusions and future work

Compressed string comparison:

three-way semi-local LCS on GC text against plain pattern

subsequence recognition in GC-strings

next: approximate matching in GC-strings

Beyond semi-locality:

quasi-local string comparison

sparse spliced alignment

next: full locality

Alexander Tiskin (Warwick) Semi-local string comparison 132 / 132