20121020 semi local-string_comparison_tiskin
-
Upload
computer-science-club -
Category
Documents
-
view
550 -
download
0
description
Transcript of 20121020 semi local-string_comparison_tiskin
Semi-local string comparison:Algorithmic techniques and applications
Alexander Tiskin
Department of Computer ScienceUniversity of Warwick
http://go.warwick.ac.uk/alextiskin
Alexander Tiskin (Warwick) Semi-local string comparison 1 / 132
1 Introduction
2 Matrix distance multiplication
3 Semi-local string comparison
4 The seaweed method
5 Periodic string comparison
6 The transposition networkmethod
7 Sparse string comparison
8 Compressed string comparison
9 Beyond semi-locality
10 Conclusions and future work
Alexander Tiskin (Warwick) Semi-local string comparison 2 / 132
1 Introduction
2 Matrix distance multiplication
3 Semi-local string comparison
4 The seaweed method
5 Periodic string comparison
6 The transposition networkmethod
7 Sparse string comparison
8 Compressed string comparison
9 Beyond semi-locality
10 Conclusions and future work
Alexander Tiskin (Warwick) Semi-local string comparison 3 / 132
Introduction
String matching: finding an exact pattern in a string
String comparison: finding similar patterns in two strings
Applications: computational biology, image recognition, . . .
Standard types of string comparison:
global: whole string vs whole string
local: substrings vs substrings
Main focus of this work:
semi-local: whole string vs substrings; prefixes vs suffixes
Closely related to approximate string matching (no relation toapproximation algorithms!)
Main tool: implicit unit-Monge matrices (a.k.a. seaweed matrices)
Alexander Tiskin (Warwick) Semi-local string comparison 4 / 132
Introduction
String matching: finding an exact pattern in a string
String comparison: finding similar patterns in two strings
Applications: computational biology, image recognition, . . .
Standard types of string comparison:
global: whole string vs whole string
local: substrings vs substrings
Main focus of this work:
semi-local: whole string vs substrings; prefixes vs suffixes
Closely related to approximate string matching (no relation toapproximation algorithms!)
Main tool: implicit unit-Monge matrices (a.k.a. seaweed matrices)
Alexander Tiskin (Warwick) Semi-local string comparison 4 / 132
IntroductionTerminology and notation
x− = x − 12 x+ = x + 1
2
Integers: {. . .− 2,−1, 0, 1, 2, . . .}Half-integers:
{. . .− 3
2 ,−12 ,
12 ,
32 ,
52 , . . .
}={. . . (−2)+, (−1)+, 0+, 1+, 2+
}(i , j)� (i ′, j ′) iff i < i ′ and j < j ′ (i , j) ≷ (i ′, j ′) iff i > i ′ and j < j ′
A permutation matrix is a 0/1 matrix with exactly one nonzero per rowand per column0 1 0
1 0 00 0 1
Alexander Tiskin (Warwick) Semi-local string comparison 5 / 132
IntroductionTerminology and notation
Given matrix D, its distribution matrix is made up of ≷-dominance sums:
Given matrix E , its density matrix is made up of quadrangle differences:E�(ı, ) = E (ı−, +)− E (ı−, −)− E (ı+, +) + E (ı+, −)
where DΣ, E over integers; D, E� over half-integers0 1 01 0 00 0 1
Σ
=
0 1 2 30 1 1 20 0 0 10 0 0 0
0 1 2 30 1 1 20 0 0 10 0 0 0
�
=
0 1 01 0 00 0 1
(DΣ)� = D for all D
Matrix E is simple, if (E�)Σ = E ; equivalently, if it has all zeros in the leftcolumn and bottom row
Alexander Tiskin (Warwick) Semi-local string comparison 6 / 132
IntroductionTerminology and notation
Given matrix D, its distribution matrix is made up of ≷-dominance sums:
Given matrix E , its density matrix is made up of quadrangle differences:E�(ı, ) = E (ı−, +)− E (ı−, −)− E (ı+, +) + E (ı+, −)
where DΣ, E over integers; D, E� over half-integers
0 1 01 0 00 0 1
Σ
=
0 1 2 30 1 1 20 0 0 10 0 0 0
0 1 2 30 1 1 20 0 0 10 0 0 0
�
=
0 1 01 0 00 0 1
(DΣ)� = D for all D
Matrix E is simple, if (E�)Σ = E ; equivalently, if it has all zeros in the leftcolumn and bottom row
Alexander Tiskin (Warwick) Semi-local string comparison 6 / 132
IntroductionTerminology and notation
Given matrix D, its distribution matrix is made up of ≷-dominance sums:
Given matrix E , its density matrix is made up of quadrangle differences:E�(ı, ) = E (ı−, +)− E (ı−, −)− E (ı+, +) + E (ı+, −)
where DΣ, E over integers; D, E� over half-integers0 1 01 0 00 0 1
Σ
=
0 1 2 30 1 1 20 0 0 10 0 0 0
0 1 2 30 1 1 20 0 0 10 0 0 0
�
=
0 1 01 0 00 0 1
(DΣ)� = D for all D
Matrix E is simple, if (E�)Σ = E ; equivalently, if it has all zeros in the leftcolumn and bottom row
Alexander Tiskin (Warwick) Semi-local string comparison 6 / 132
IntroductionTerminology and notation
Given matrix D, its distribution matrix is made up of ≷-dominance sums:
Given matrix E , its density matrix is made up of quadrangle differences:E�(ı, ) = E (ı−, +)− E (ı−, −)− E (ı+, +) + E (ı+, −)
where DΣ, E over integers; D, E� over half-integers0 1 01 0 00 0 1
Σ
=
0 1 2 30 1 1 20 0 0 10 0 0 0
0 1 2 30 1 1 20 0 0 10 0 0 0
�
=
0 1 01 0 00 0 1
(DΣ)� = D for all D
Matrix E is simple, if (E�)Σ = E ; equivalently, if it has all zeros in the leftcolumn and bottom row
Alexander Tiskin (Warwick) Semi-local string comparison 6 / 132
IntroductionTerminology and notation
Matrix E is Monge, if E� is nonnegative
Intuition: boundary-to-boundary distances in a (weighted) planar graph
Matrix E is unit-Monge, if E� is a permutation matrix
Intuition: boundary-to-boundary distances in a grid-like graph
Simple unit-Monge matrix: PΣ, where P is a permutation matrix
Seaweed matrix: P used as an implicit representation of PΣ0 1 01 0 00 0 1
Σ
=
0 1 2 30 1 1 20 0 0 10 0 0 0
Alexander Tiskin (Warwick) Semi-local string comparison 7 / 132
IntroductionTerminology and notation
Matrix E is Monge, if E� is nonnegative
Intuition: boundary-to-boundary distances in a (weighted) planar graph
Matrix E is unit-Monge, if E� is a permutation matrix
Intuition: boundary-to-boundary distances in a grid-like graph
Simple unit-Monge matrix: PΣ, where P is a permutation matrix
Seaweed matrix: P used as an implicit representation of PΣ0 1 01 0 00 0 1
Σ
=
0 1 2 30 1 1 20 0 0 10 0 0 0
Alexander Tiskin (Warwick) Semi-local string comparison 7 / 132
IntroductionImplicit unit-Monge matrices
Efficient PΣ queries: range tree on nonzeros of P [Bentley: 1980]
binary search tree by i-coordinate
under every node, binary search tree by j-coordinate
••••−→ •
•••−→ •
•••
↓
••••−→ •
•••−→ •
•••
↓
••••−→ •
•••−→ •
•••
Alexander Tiskin (Warwick) Semi-local string comparison 8 / 132
IntroductionImplicit unit-Monge matrices
Efficient PΣ queries: (contd.)
Every node of the range tree represents a canonical range (rectangularregion), and stores its nonzero count
Overall, ≤ n log n canonical ranges are non-empty
A PΣ query is equivalent to ≷-dominance counting: how many nonzerosare ≷-dominated by query point?
Answer: sum up nonzero counts in ≤ log2 n disjoint canonical ranges
Total size O(n log n), query time O(log2 n)
There are asymptotically more efficient (but less practical) data structures
Total size O(n), query time O( log n
log log n
)[JaJa+: 2004]
[Chan, Patrascu: 2010]
Alexander Tiskin (Warwick) Semi-local string comparison 9 / 132
IntroductionImplicit unit-Monge matrices
Efficient PΣ queries: (contd.)
Every node of the range tree represents a canonical range (rectangularregion), and stores its nonzero count
Overall, ≤ n log n canonical ranges are non-empty
A PΣ query is equivalent to ≷-dominance counting: how many nonzerosare ≷-dominated by query point?
Answer: sum up nonzero counts in ≤ log2 n disjoint canonical ranges
Total size O(n log n), query time O(log2 n)
There are asymptotically more efficient (but less practical) data structures
Total size O(n), query time O( log n
log log n
)[JaJa+: 2004]
[Chan, Patrascu: 2010]
Alexander Tiskin (Warwick) Semi-local string comparison 9 / 132
1 Introduction
2 Matrix distance multiplication
3 Semi-local string comparison
4 The seaweed method
5 Periodic string comparison
6 The transposition networkmethod
7 Sparse string comparison
8 Compressed string comparison
9 Beyond semi-locality
10 Conclusions and future work
Alexander Tiskin (Warwick) Semi-local string comparison 10 / 132
Matrix distance multiplicationSeaweed braids
Distance algebra (a.k.a (min,+) or tropical algebra):
addition ⊕ given by min
multiplication � given by +
Matrix �-multiplication
A� B = C C (i , k) =⊕
j
(A(i , j)� B(j , k)
)= minj
(A(i , j) + B(j , k)
)Matrix classes closed under �-multiplication (for given n):
general numerical (integer, real) matrices
Monge matrices
simple unit-Monge matrices (!)
Alexander Tiskin (Warwick) Semi-local string comparison 11 / 132
Matrix distance multiplicationSeaweed braids
Recall that simple unit-Monge matrices are represented implicitly bypermutation (seaweed) matrices
Define PA � PB = PC as PΣA � PΣ
B = PΣC
The seaweed monoid Tn:
simple unit-Monge matrices under �equivalently, permutation (seaweed) matrices under �
Also known as the 0-Hecke monoid of the symmetric group H0(Sn)
Alexander Tiskin (Warwick) Semi-local string comparison 12 / 132
Matrix distance multiplicationSeaweed braids
PA � PB = PC can be seen as combing of seaweed braids
•••
••
•PA
•••
••
•PB
••
••
••PC
PA
PB
PC
Alexander Tiskin (Warwick) Semi-local string comparison 13 / 132
Matrix distance multiplicationSeaweed braids
PA � PB = PC can be seen as combing of seaweed braids
•••
••
•PA
•••
••
•PB
••
••
••PC
PA
PB
PC
Alexander Tiskin (Warwick) Semi-local string comparison 13 / 132
Matrix distance multiplicationSeaweed braids
PA � PB = PC can be seen as combing of seaweed braids
•••
••
•PA
•••
••
•PB
••
••
••PC
PA
PB
PC
Alexander Tiskin (Warwick) Semi-local string comparison 13 / 132
Matrix distance multiplicationSeaweed braids
PA � PB = PC can be seen as combing of seaweed braids
•••
••
•PA
•••
••
•PB
••
••
••PC
PA
PB
PC
Alexander Tiskin (Warwick) Semi-local string comparison 13 / 132
Matrix distance multiplicationSeaweed braids
The seaweed monoid Tn:
n! elements (permutations of size n)
n − 1 generators g1, g2, . . . , gn−1 (elementary crossings)
Idempotence:g2i = gi for all i =
Far commutativity:gigj = gjgi j − i > 1 · · · = · · ·
Braid relations:gigjgi = gjgigj j − i = 1 =
Alexander Tiskin (Warwick) Semi-local string comparison 14 / 132
Matrix distance multiplicationSeaweed braids
Identity: 1� x = x
1 =
• · · ·· • · ·· · • ·· · · •
=
Zero: 0� x = 0
0 =
· · · •· · • ·· • · ·• · · ·
=
Alexander Tiskin (Warwick) Semi-local string comparison 15 / 132
Matrix distance multiplicationSeaweed braids
Related structures:
positive braids: far comm; braid relations
braids: gig−1i = 1; far comm; braid relations
Coxeter’s presentation of Sn: g2i = 1; far comm; braid relations
locally free idempotent monoid: idem; far comm [Vershik+: 2000]
Generalisations:
general 0-Hecke monoids [Fomin, Greene: 1998; Buch+: 2008]
Coxeter monoids [Tsaranov: 1990; Richardson, Springer: 1990]
J -trivial monoids [Denton+: 2011]
Alexander Tiskin (Warwick) Semi-local string comparison 16 / 132
Matrix distance multiplicationSeaweed braids
Computation in the seaweed monoid: a confluent rewriting system can beobtained by software (Semigroupe, GAP)
T3: 1, a = g1, b = g2; ab, ba, aba = 0
aa→ a bb → b bab → 0 aba→ 0
T4: 1, a = g1, b = g2, c = g3; ab, ac, ba, bc, cb, aba, abc, acb, bac,bcb, cba, abac , abcb, acba, bacb, bcba, abacb, abcba, bacba, abacba = 0
aa→ abb → b
ca→ accc → c
bab → abacbc → bcb
cbac → bcbaabacba→ 0
Easy to use, but not an efficient algorithm
Alexander Tiskin (Warwick) Semi-local string comparison 17 / 132
Matrix distance multiplicationSeaweed braids
Computation in the seaweed monoid: a confluent rewriting system can beobtained by software (Semigroupe, GAP)
T3: 1, a = g1, b = g2; ab, ba, aba = 0
aa→ a bb → b bab → 0 aba→ 0
T4: 1, a = g1, b = g2, c = g3; ab, ac, ba, bc, cb, aba, abc, acb, bac,bcb, cba, abac , abcb, acba, bacb, bcba, abacb, abcba, bacba, abacba = 0
aa→ abb → b
ca→ accc → c
bab → abacbc → bcb
cbac → bcbaabacba→ 0
Easy to use, but not an efficient algorithm
Alexander Tiskin (Warwick) Semi-local string comparison 17 / 132
Matrix distance multiplicationSeaweed braids
Computation in the seaweed monoid: a confluent rewriting system can beobtained by software (Semigroupe, GAP)
T3: 1, a = g1, b = g2; ab, ba, aba = 0
aa→ a bb → b bab → 0 aba→ 0
T4: 1, a = g1, b = g2, c = g3; ab, ac, ba, bc, cb, aba, abc, acb, bac,bcb, cba, abac , abcb, acba, bacb, bcba, abacb, abcba, bacba, abacba = 0
aa→ abb → b
ca→ accc → c
bab → abacbc → bcb
cbac → bcbaabacba→ 0
Easy to use, but not an efficient algorithm
Alexander Tiskin (Warwick) Semi-local string comparison 17 / 132
Matrix distance multiplicationSeaweed braids
Computation in the seaweed monoid: a confluent rewriting system can beobtained by software (Semigroupe, GAP)
T3: 1, a = g1, b = g2; ab, ba, aba = 0
aa→ a bb → b bab → 0 aba→ 0
T4: 1, a = g1, b = g2, c = g3; ab, ac, ba, bc, cb, aba, abc, acb, bac,bcb, cba, abac , abcb, acba, bacb, bcba, abacb, abcba, bacba, abacba = 0
aa→ abb → b
ca→ accc → c
bab → abacbc → bcb
cbac → bcbaabacba→ 0
Easy to use, but not an efficient algorithm
Alexander Tiskin (Warwick) Semi-local string comparison 17 / 132
Matrix distance multiplicationSeaweed matrix multiplication
The implicit unit-Monge matrix �-multiplication problem
Given permutation matrices PA, PB , compute PC , such thatPΣA � PΣ
B = PΣC (equivalently, PA � PB = PC )
Matrix �-multiplication: running time
type time
general O(n3) standard
O(n3(log log n)3
log2 n
)[Chan: 2007]
Monge O(n2) via [Aggarwal+: 1987]
implicit unit-Monge O(n1.5) [T: 2006]O(n log n) [T: 2010]
Alexander Tiskin (Warwick) Semi-local string comparison 18 / 132
Matrix distance multiplicationSeaweed matrix multiplication
The implicit unit-Monge matrix �-multiplication problem
Given permutation matrices PA, PB , compute PC , such thatPΣA � PΣ
B = PΣC (equivalently, PA � PB = PC )
Matrix �-multiplication: running time
type time
general O(n3) standard
O(n3(log log n)3
log2 n
)[Chan: 2007]
Monge O(n2) via [Aggarwal+: 1987]
implicit unit-Monge O(n1.5) [T: 2006]O(n log n) [T: 2010]
Alexander Tiskin (Warwick) Semi-local string comparison 18 / 132
Matrix distance multiplicationSeaweed matrix multiplication
PA
•••• •
•••• •
•• ••••••••
PB• • ••• ••• • •• •• • •••• • •
PC
?
Alexander Tiskin (Warwick) Semi-local string comparison 19 / 132
Matrix distance multiplicationSeaweed matrix multiplication
PA,lo , PA,hi
•••• •
•••• •
•• ••••••••
PB,lo , PB,hi
• • ••• ••• • •• •• • •••• • •
•••• •
•••• •
• • ••••
• •••PC ,lo + PC ,hi
•••• •
•••• •
• • ••••
• •••
Alexander Tiskin (Warwick) Semi-local string comparison 20 / 132
Matrix distance multiplicationSeaweed matrix multiplication
PA,lo , PA,hi
•••• •
•••• •
•• ••••••••
PB,lo , PB,hi
• • ••• ••• • •• •• • •••• • •
•••• •
•••• •
• • ••••
• •••PC ,lo + PC ,hi
•••• •
•••• •
• • ••••
• •••
Alexander Tiskin (Warwick) Semi-local string comparison 20 / 132
Matrix distance multiplicationSeaweed matrix multiplication
PA,lo , PA,hi
•••• •
•••• •
•• ••••••••
PB,lo , PB,hi
• • ••• ••• • •• •• • •••• • •
•••• •
•••• •
• • ••••
• •••
PC ,lo + PC ,hi
•••• •
•••• •
• • ••••
• •••
Alexander Tiskin (Warwick) Semi-local string comparison 20 / 132
Matrix distance multiplicationSeaweed matrix multiplication
PA,lo , PA,hi
•••• •
•••• •
•• ••••••••
PB,lo , PB,hi
• • ••• ••• • •• •• • •••• • •
•••• •
•••• •
• • ••••
• •••
PC ,lo + PC ,hi
•••• •
•••• •
• • ••••
• •••
Alexander Tiskin (Warwick) Semi-local string comparison 20 / 132
Matrix distance multiplicationSeaweed matrix multiplication
PA,lo , PA,hi
•••• •
•••• •
•• ••••••••
PB,lo , PB,hi
• • ••• ••• • •• •• • •••• • •
PC ,lo + PC ,hi
•••• •
•••• •
• • ••••
• •••
PC
••••
••••
•••
•
•
•••
• •••
Alexander Tiskin (Warwick) Semi-local string comparison 21 / 132
Matrix distance multiplicationSeaweed matrix multiplication
PA,lo , PA,hi
•••• •
•••• •
•• ••••••••
PB,lo , PB,hi
• • ••• ••• • •• •• • •••• • •
PC ,lo + PC ,hi
•••• •
•••• •
• • ••••
• •••
PC
••••
••••
•••
•
•
•••
• •••
Alexander Tiskin (Warwick) Semi-local string comparison 21 / 132
Matrix distance multiplicationSeaweed matrix multiplication
PA
•••• •
•••• •
•• ••••••••
PB• • ••• ••• • •• •• • •••• • •
PC
••••
••••
•••
•
•
•••
• •••
Alexander Tiskin (Warwick) Semi-local string comparison 22 / 132
Matrix distance multiplicationSeaweed matrix multiplication
Implicit unit-Monge matrix �-multiplication: the algorithm
PΣC (i , k) = minj
(PΣA (i , j) + PΣ
B (j , k))
Divide-and-conquer on the range of j
Divide PA horizontally, PB vertically: two subproblems of effective size n/2
PΣA,lo � PΣ
B,lo = PΣC ,lo PΣ
A,hi � PΣB,hi = PΣ
C ,hi
Conquer:
�-low nonzeros of PC ,lo and �-high nonzeros of PC ,hi appear in PC
The remaining nonzeros of PC ,lo and PC ,hi are “wrong”, and need to becorrected to obtain the remaining nonzeros of PC
Correction can be done in time O(n) using the unit-Monge property
Overall time O(n log n)
Alexander Tiskin (Warwick) Semi-local string comparison 23 / 132
Matrix distance multiplicationBruhat order
Comparing permutations by the “degree of sortedness”
Bruhat order
Permutation A is lower (“more sorted”) than permutation B in the Bruhatorder (A � B), if B can be transformed to A by successive pairwise sortingbetween arbitrary pairs of elements.
Permutation matrices: PA � PB , if PB can be transformed to PA bysuccessive submatrix substitution: ( 0 1
1 0 ) ( 1 00 1 )
Alexander Tiskin (Warwick) Semi-local string comparison 24 / 132
Matrix distance multiplicationBruhat order
Bruhat comparability: running time
O(n2) folkloreO(n log n) [T: NEW]
PA � PB iff PΣA ≤ PΣ
B elementwise, time O(n2) folklore
PA � PB iff PRA � PB = IdR , time O(n log n) [T: NEW]
where PR denotes clockwise rotation of matrix P
Alexander Tiskin (Warwick) Semi-local string comparison 25 / 132
1 Introduction
2 Matrix distance multiplication
3 Semi-local string comparison
4 The seaweed method
5 Periodic string comparison
6 The transposition networkmethod
7 Sparse string comparison
8 Compressed string comparison
9 Beyond semi-locality
10 Conclusions and future work
Alexander Tiskin (Warwick) Semi-local string comparison 26 / 132
Semi-local string comparisonSemi-local LCS and edit distance
Consider strings (= sequences) over an alphabet of size σ
Distinguish contiguous substrings and not necessarily contiguoussubsequences
Special cases of substring: prefix, suffix
Notation: strings a, b of length m, n respectively
Assume where necessary: m ≤ n; m, n reasonably close
The longest common subsequence (LCS) score:
length of longest string that is a subsequence of both a and b
equivalently, alignment score, where score(match) = 1 andscore(mismatch) = 0
In biological terms, “loss-free alignment” (unlike “lossy” BLAST)
Alexander Tiskin (Warwick) Semi-local string comparison 27 / 132
Semi-local string comparisonSemi-local LCS and edit distance
Consider strings (= sequences) over an alphabet of size σ
Distinguish contiguous substrings and not necessarily contiguoussubsequences
Special cases of substring: prefix, suffix
Notation: strings a, b of length m, n respectively
Assume where necessary: m ≤ n; m, n reasonably close
The longest common subsequence (LCS) score:
length of longest string that is a subsequence of both a and b
equivalently, alignment score, where score(match) = 1 andscore(mismatch) = 0
In biological terms, “loss-free alignment” (unlike “lossy” BLAST)
Alexander Tiskin (Warwick) Semi-local string comparison 27 / 132
Semi-local string comparisonSemi-local LCS and edit distance
The LCS problem
Give the LCS score for a vs b
LCS: running time
O(mn) [Wagner, Fischer: 1974]O(
mnlog2 n
)σ = O(1) [Masek, Paterson: 1980]
[Crochemore+: 2003]
O(mn(log log n)2
log2 n
)[Paterson, Dancık: 1994]
[Bille, Farach-Colton: 2008]
Running time varies depending on the RAM model version
We assume word-RAM with word size log n (where it matters)
Alexander Tiskin (Warwick) Semi-local string comparison 28 / 132
Semi-local string comparisonSemi-local LCS and edit distance
The LCS problem
Give the LCS score for a vs b
LCS: running time
O(mn) [Wagner, Fischer: 1974]O(
mnlog2 n
)σ = O(1) [Masek, Paterson: 1980]
[Crochemore+: 2003]
O(mn(log log n)2
log2 n
)[Paterson, Dancık: 1994]
[Bille, Farach-Colton: 2008]
Running time varies depending on the RAM model version
We assume word-RAM with word size log n (where it matters)
Alexander Tiskin (Warwick) Semi-local string comparison 28 / 132
Semi-local string comparisonSemi-local LCS and edit distance
LCS on the alignment graph (directed, acyclic)
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A blue = 0red = 1
score(“BAABCBCA”, “BAABCABCABACA”) = len(“BAABCBCA”) = 8
LCS = highest-score path from top-left to bottom-right
Alexander Tiskin (Warwick) Semi-local string comparison 29 / 132
Semi-local string comparisonSemi-local LCS and edit distance
LCS: dynamic programming [WF: 1974]
Sweep cells in any �-compatible order
Cell update: time O(1)
Overall time O(mn)
Alexander Tiskin (Warwick) Semi-local string comparison 30 / 132
Semi-local string comparisonSemi-local LCS and edit distance
‘Begin at the beginning,’ the King saidgravely, ‘and go on till you come to the end:then stop.’
L. Carroll, Alice in Wonderland(The standard approach in dynamicprogramming)
Alexander Tiskin (Warwick) Semi-local string comparison 31 / 132
Semi-local string comparisonSemi-local LCS and edit distance
Sometimes dynamic programming can be run from both ends for extraflexibility
Is there a better, fully flexible alternative (e.g. for comparing compressedstrings, comparing strings dynamically or in parallel, etc.)?
Alexander Tiskin (Warwick) Semi-local string comparison 32 / 132
Semi-local string comparisonSemi-local LCS and edit distance
Sometimes dynamic programming can be run from both ends for extraflexibility
Is there a better, fully flexible alternative (e.g. for comparing compressedstrings, comparing strings dynamically or in parallel, etc.)?
Alexander Tiskin (Warwick) Semi-local string comparison 32 / 132
Semi-local string comparisonSemi-local LCS and edit distance
LCS: micro-block dynamic programming [MP: 1980; BF: 2008]
Sweep cells in micro-blocks, in any �-compatible order
Micro-block size:
t = O(log n) when σ = O(1)
t = O( log n
log log n
)otherwise
Micro-block interface:
O(t) characters, each O(log σ) bits, can be reduced to O(log t) bits
O(t) small integers, each O(1) bits
Micro-block update: time O(1), by precomputing all possible interfaces
Overall time O(
mnlog2 n
)when σ = O(1), O
(mn(log log n)2
log2 n
)otherwise
Alexander Tiskin (Warwick) Semi-local string comparison 33 / 132
Semi-local string comparisonSemi-local LCS and edit distance
The semi-local LCS problem
Give the (implicit) matrix of O((m + n)2
)LCS scores:
string-substring LCS: string a vs every substring of b
prefix-suffix LCS: every prefix of a vs every suffix of b
suffix-prefix LCS: every suffix of a vs every prefix of b
substring-string LCS: every substring of a vs string b
Cf.: dynamic programming gives prefix-prefix LCS
Alexander Tiskin (Warwick) Semi-local string comparison 34 / 132
Semi-local string comparisonSemi-local LCS and edit distance
The semi-local LCS problem
Give the (implicit) matrix of O((m + n)2
)LCS scores:
string-substring LCS: string a vs every substring of b
prefix-suffix LCS: every prefix of a vs every suffix of b
suffix-prefix LCS: every suffix of a vs every prefix of b
substring-string LCS: every substring of a vs string b
Cf.: dynamic programming gives prefix-prefix LCS
Alexander Tiskin (Warwick) Semi-local string comparison 34 / 132
Semi-local string comparisonSemi-local LCS and edit distance
Semi-local LCS on the alignment graph
B
A
A
B
C
B
C
A
B A A B C A B C A B A C AC A B C A B A blue = 0red = 1
score(“BAABCBCA”, “CABCABA”) = len(“ABCBA”) = 5
String-substring LCS: all highest-score top-to-bottom paths
Semi-local LCS: all highest-score boundary-to-boundary paths
Alexander Tiskin (Warwick) Semi-local string comparison 35 / 132
Semi-local string comparisonScore matrices and seaweed matrices
The score matrix H
0 1 2 3 4 5 6 6 7 8 8 8 8 8
-1 0 1 2 3 4 5 5 6 7 7 7 7 7
-2 -1 0 1 2 3 4 4 5 6 6 6 6 7
-3 -2 -1 0 1 2 3 3 4 5 5 6 6 7
-4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6
-5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6
-6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5
-7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4
-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4
-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3
-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2
-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1
-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0
5
a = “BAABCBCA”
b = “BAABCABCABACA”
H(i , j) = score(a, b〈i : j〉)H(4, 11) = 5
H(i , j) = j − i if i > j
Alexander Tiskin (Warwick) Semi-local string comparison 36 / 132
Semi-local string comparisonScore matrices and seaweed matrices
Semi-local LCS: output representation and running time
size query time
O(n2) O(1) trivial
O(m1/2n) O(log n) string-substring [Alves+: 2003]O(n) O(n) string-substring [Alves+: 2005]O(n log n) O(log2 n) [T: 2006]
. . . or any 2D orthogonal range counting data structure
running time
O(mn2) naiveO(mn) string-substring [Schmidt: 1998; Alves+: 2005]O(mn) [T: 2006]O(
mnlog0.5 n
)[T: 2006]
O(mn(log log n)2
log2 n
)[T: 2007]
Alexander Tiskin (Warwick) Semi-local string comparison 37 / 132
Semi-local string comparisonScore matrices and seaweed matrices
The score matrix H and the seaweed matrix P
H(i , j): the number of matched characters for a vs substring b〈i : j〉j − i − H(i , j): the number of unmatched characters
Properties of matrix j − i − H(i , j):
simple unit-Monge
therefore, = PΣ, where P = −H� is a permutation matrix
P is the seaweed matrix, giving an implicit representation of H
Range tree for P: memory O(n log n), query time O(log2 n)
Alexander Tiskin (Warwick) Semi-local string comparison 38 / 132
Semi-local string comparisonScore matrices and seaweed matrices
The score matrix H and the seaweed matrix P
•
•
•
•
•
0 1 2 3 4 5 6 6 7 8 8 8 8 8
-1 0 1 2 3 4 5 5 6 7 7 7 7 7
-2 -1 0 1 2 3 4 4 5 6 6 6 6 7
-3 -2 -1 0 1 2 3 3 4 5 5 6 6 7
-4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6
-5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6
-6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5
-7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4
-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4
-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3
-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2
-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1
-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0
5
a = “BAABCBCA”
b = “BAABCABCABACA”
H(i , j) = score(a, b〈i : j〉)H(4, 11) = 5
H(i , j) = j − i if i > j
blue: difference in H is 0
red : difference in H is 1
green: P(i , j) = 1
H(i , j) = j − i − PΣ(i , j)
Alexander Tiskin (Warwick) Semi-local string comparison 39 / 132
Semi-local string comparisonScore matrices and seaweed matrices
The score matrix H and the seaweed matrix P
•
•
•
•
•
0 1 2 3 4 5 6 6 7 8 8 8 8 8
-1 0 1 2 3 4 5 5 6 7 7 7 7 7
-2 -1 0 1 2 3 4 4 5 6 6 6 6 7
-3 -2 -1 0 1 2 3 3 4 5 5 6 6 7
-4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6
-5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6
-6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5
-7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4
-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4
-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3
-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2
-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1
-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0
5
a = “BAABCBCA”
b = “BAABCABCABACA”
H(i , j) = score(a, b〈i : j〉)H(4, 11) = 5
H(i , j) = j − i if i > j
blue: difference in H is 0
red : difference in H is 1
green: P(i , j) = 1
H(i , j) = j − i − PΣ(i , j)
Alexander Tiskin (Warwick) Semi-local string comparison 39 / 132
Semi-local string comparisonScore matrices and seaweed matrices
The score matrix H and the seaweed matrix P
•
•
•
•
•
0 1 2 3 4 5 6 6 7 8 8 8 8 8
-1 0 1 2 3 4 5 5 6 7 7 7 7 7
-2 -1 0 1 2 3 4 4 5 6 6 6 6 7
-3 -2 -1 0 1 2 3 3 4 5 5 6 6 7
-4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6
-5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6
-6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5
-7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4
-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4
-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3
-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2
-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1
-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0
5
a = “BAABCBCA”
b = “BAABCABCABACA”
H(i , j) = score(a, b〈i : j〉)H(4, 11) = 5
H(i , j) = j − i if i > j
blue: difference in H is 0
red : difference in H is 1
green: P(i , j) = 1
H(i , j) = j − i − PΣ(i , j)
Alexander Tiskin (Warwick) Semi-local string comparison 39 / 132
Semi-local string comparisonScore matrices and seaweed matrices
The score matrix H and the seaweed matrix P
•
•
•
•
•
0 1 2 3 4 5 6 6 7 8 8 8 8 8
-1 0 1 2 3 4 5 5 6 7 7 7 7 7
-2 -1 0 1 2 3 4 4 5 6 6 6 6 7
-3 -2 -1 0 1 2 3 3 4 5 5 6 6 7
-4 -3 -2 -1 0 1 2 2 3 4 4 5 5 6
-5 -4 -3 -2 -1 0 1 2 3 4 4 5 5 6
-6 -5 -4 -3 -2 -1 0 1 2 3 3 4 4 5
-7 -6 -5 -4 -3 -2 -1 0 1 2 2 3 3 4
-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 3 4
-9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3
-11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2
-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1
-13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0
5
a = “BAABCBCA”
b = “BAABCABCABACA”
H(4, 11) =11− 4− PΣ(i , j) =11− 4− 2 = 5
Alexander Tiskin (Warwick) Semi-local string comparison 40 / 132
Semi-local string comparisonScore matrices and seaweed matrices
The seaweed braid in the alignment graph
B
A
A
B
C
B
C
A
B A A B C A B C A B A C AC A B C A B A a = “BAABCBCA”
b = “BAABCABCABACA”
H(4, 11) =11− 4− PΣ(i , j) =11− 4− 2 = 5
P(i , j) = 1 corresponds to seaweed top i bottom j
Also define top right, left right, left bottom seaweeds
Gives bijection between top-left and bottom-right graph boundaries
Alexander Tiskin (Warwick) Semi-local string comparison 41 / 132
Semi-local string comparisonScore matrices and seaweed matrices
The seaweed braid in the alignment graph
B
A
A
B
C
B
C
A
B A A B C A B C A B A C AC A B C A B A a = “BAABCBCA”
b = “BAABCABCABACA”
H(4, 11) =11− 4− PΣ(i , j) =11− 4− 2 = 5
P(i , j) = 1 corresponds to seaweed top i bottom j
Also define top right, left right, left bottom seaweeds
Gives bijection between top-left and bottom-right graph boundaries
Alexander Tiskin (Warwick) Semi-local string comparison 41 / 132
Semi-local string comparisonScore matrices and seaweed matrices
Seaweed braid: a highly symmetric object (element of the 0-Hecke monoidof the symmetric group)
Can be built recursively by assembling subbraids from separate parts
Highly flexible: local alignment, compression, parallel computation. . .
Alexander Tiskin (Warwick) Semi-local string comparison 42 / 132
Semi-local string comparisonWeighted alignment
The LCS problem is a special case of the weighted alignment scoreproblem with weighted matches (wM), mismatches (wX ) and gaps (wG)
LCS score: wM = 1, wX = wG = 0
Levenshtein score: wM = 2, wX = 1, wG = 0
Alignment score is rational, if wM , wX , wG are rational numbers
Equivalent to LCS score on blown-up strings
Edit distance: minimum cost to transform a into b by weighted characteredits (insertion, deletion, substitution)
Corresponds to weighted alignment score with wM = 0, insertion/deletionweight −wG , substitution weight −wX
Alexander Tiskin (Warwick) Semi-local string comparison 43 / 132
Semi-local string comparisonWeighted alignment
The LCS problem is a special case of the weighted alignment scoreproblem with weighted matches (wM), mismatches (wX ) and gaps (wG)
LCS score: wM = 1, wX = wG = 0
Levenshtein score: wM = 2, wX = 1, wG = 0
Alignment score is rational, if wM , wX , wG are rational numbers
Equivalent to LCS score on blown-up strings
Edit distance: minimum cost to transform a into b by weighted characteredits (insertion, deletion, substitution)
Corresponds to weighted alignment score with wM = 0, insertion/deletionweight −wG , substitution weight −wX
Alexander Tiskin (Warwick) Semi-local string comparison 43 / 132
Semi-local string comparisonWeighted alignment
The LCS problem is a special case of the weighted alignment scoreproblem with weighted matches (wM), mismatches (wX ) and gaps (wG)
LCS score: wM = 1, wX = wG = 0
Levenshtein score: wM = 2, wX = 1, wG = 0
Alignment score is rational, if wM , wX , wG are rational numbers
Equivalent to LCS score on blown-up strings
Edit distance: minimum cost to transform a into b by weighted characteredits (insertion, deletion, substitution)
Corresponds to weighted alignment score with wM = 0, insertion/deletionweight −wG , substitution weight −wX
Alexander Tiskin (Warwick) Semi-local string comparison 43 / 132
Semi-local string comparisonWeighted alignment
Weighted alignment graph
B
A
A
B
C
B
C
A
B A A B C A B C A B A C AC A B C A B A blue = 0red (solid) = 2
red (dotted) = 1
Levenshtein score(“BAABCBCA”, “CABCABA”) = 11
Alexander Tiskin (Warwick) Semi-local string comparison 44 / 132
Semi-local string comparisonWeighted alignment
Alignment graph for blown-up strings
$B
$A
$A
$B
$C
$B
$C
$A
$B $A $A $B $C $A $B $C $A $B $A $C $A$C$A$B$C$A$B$A blue = 0red = 0.5 or 1
Levenshtein score(“BAABCBCA”, “CABCABA”) = 2 · 5.5
Alexander Tiskin (Warwick) Semi-local string comparison 45 / 132
Semi-local string comparisonWeighted alignment
Rational-weighted semi-local alignment reduced to semi-local LCS
$B
$A
$A
$B
$C
$B
$C
$A
$B $A $A $B $C $A $B $C $A $B $A $C $A$C$A$B$C$A$B$A
Let wM = 1, wX = µν , wG = 0
Increase × ν2 in complexity (can be reduced to ν)
Alexander Tiskin (Warwick) Semi-local string comparison 46 / 132
Alexander Tiskin (Warwick) Semi-local string comparison 47 / 132
1 Introduction
2 Matrix distance multiplication
3 Semi-local string comparison
4 The seaweed method
5 Periodic string comparison
6 The transposition networkmethod
7 Sparse string comparison
8 Compressed string comparison
9 Beyond semi-locality
10 Conclusions and future work
Alexander Tiskin (Warwick) Semi-local string comparison 48 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 49 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 50 / 132
The seaweed methodSeaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 51 / 132
The seaweed methodSeaweed combing
Semi-local LCS: seaweed combing [T: 2006]
Initialise seaweed braid: crossings in all mismatch cells
Sweep cells in any �-compatible order
Match cell: two seaweeds uncrossed; skip
Mismatch cell: two seaweeds cross
if the same seaweeds crossed before, uncross them
otherwise skip, keep seaweeds crossed
Cell update: time O(1)
Overall time O(mn)
Alexander Tiskin (Warwick) Semi-local string comparison 52 / 132
The seaweed methodMicro-block seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 53 / 132
The seaweed methodMicro-block seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
The seaweed methodMicro-block seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
The seaweed methodMicro-block seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
The seaweed methodMicro-block seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
The seaweed methodMicro-block seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
The seaweed methodMicro-block seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
The seaweed methodMicro-block seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
The seaweed methodMicro-block seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
The seaweed methodMicro-block seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
The seaweed methodMicro-block seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
The seaweed methodMicro-block seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
The seaweed methodMicro-block seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
The seaweed methodMicro-block seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 54 / 132
The seaweed methodMicro-block seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 55 / 132
The seaweed methodMicro-block seaweed combing
Semi-local LCS: micro-block seaweed combing [T: 2007]
Initialise seaweed braid: crossings in all mismatch cells
Sweep cells in micro-blocks, in any �-compatible order
Micro-block size: t = O( log n
log log n
)Micro-block interface:
O(t) characters, each O(log σ) bits, can be reduced to O(log t) bits
O(t) integers, each O(log n) bits, can be reduced to O(log t) bits
Micro-block update: time O(1), by precomputing all possible interfaces
Overall time O(mn(log log n)2
log2 n
)
Alexander Tiskin (Warwick) Semi-local string comparison 56 / 132
The seaweed methodCyclic LCS
The cyclic LCS problem
Give the maximum LCS score for a vs all cyclic rotations of b
Cyclic LCS: running time
O(mn2
log n
)naive
O(mn logm) [Maes: 1990]O(mn) [Bunke, Buhler: 1993; Landau+: 1998; Schmidt: 1998]
O(mn(log log n)2
log2 n
)[T: 2007]
Alexander Tiskin (Warwick) Semi-local string comparison 57 / 132
The seaweed methodCyclic LCS
The cyclic LCS problem
Give the maximum LCS score for a vs all cyclic rotations of b
Cyclic LCS: running time
O(mn2
log n
)naive
O(mn logm) [Maes: 1990]O(mn) [Bunke, Buhler: 1993; Landau+: 1998; Schmidt: 1998]
O(mn(log log n)2
log2 n
)[T: 2007]
Alexander Tiskin (Warwick) Semi-local string comparison 57 / 132
The seaweed methodCyclic LCS
Cyclic LCS: the algorithm
Micro-block seaweed combing on a vs bb, time O(mn(log log n)2
log2 n
)Make n string-substring LCS queries, time negligible
Alexander Tiskin (Warwick) Semi-local string comparison 58 / 132
The seaweed methodLongest repeating subsequence
The longest repeating subsequence problem
Find the longest subsequence of a that is a square (a repetition of twoidentical strings)
Longest repeating subsequence: running time
O(m3) naiveO(m2) [Kosowski: 2004]
O(m2(log log m)2
log2 m
)[T: 2007]
Alexander Tiskin (Warwick) Semi-local string comparison 59 / 132
The seaweed methodLongest repeating subsequence
The longest repeating subsequence problem
Find the longest subsequence of a that is a square (a repetition of twoidentical strings)
Longest repeating subsequence: running time
O(m3) naiveO(m2) [Kosowski: 2004]
O(m2(log log m)2
log2 m
)[T: 2007]
Alexander Tiskin (Warwick) Semi-local string comparison 59 / 132
The seaweed methodLongest repeating subsequence
Longest repeating subsequence: the algorithm
Micro-block seaweed combing on a vs a, time O(m2(log log m)2
log2 m
)Make m − 1 suffix-prefix LCS queries, time negligible
Alexander Tiskin (Warwick) Semi-local string comparison 60 / 132
The seaweed methodApproximate matching
The approximate pattern matching problem
Give the substring closest to a by alignment score, starting at eachposition in b
Assume rational alignment score
Approximate pattern matching: running time
O(mn) [Sellers: 1980]O(
mnlog n
)σ = O(1) via [Masek, Paterson: 1980]
O(mn(log log n)2
log2 n
)via [Bille, Farach-Colton: 2008]
Alexander Tiskin (Warwick) Semi-local string comparison 61 / 132
The seaweed methodApproximate matching
Approximate pattern matching: the algorithm
Micro-block seaweed combing on a vs b (with blow-up), time
O(mn(log log n)2
log2 n
)The implicit semi-local edit score matrix:
an anti-Monge matrix
approximate pattern matching ∼ row minima
Row minima in O(n) element queries [Aggarwal+: 1987]
Each query in time O(log2 n) using the range tree representation,combined query time negligible
Overall running time O(mn(log log n)2
log2 n
), same as [Bille, Farach-Colton: 2008]
Alexander Tiskin (Warwick) Semi-local string comparison 62 / 132
Alexander Tiskin (Warwick) Semi-local string comparison 63 / 132
1 Introduction
2 Matrix distance multiplication
3 Semi-local string comparison
4 The seaweed method
5 Periodic string comparison
6 The transposition networkmethod
7 Sparse string comparison
8 Compressed string comparison
9 Beyond semi-locality
10 Conclusions and future work
Alexander Tiskin (Warwick) Semi-local string comparison 64 / 132
Periodic string comparisonWraparound seaweed combing
The periodic string-substring LCS problem
Give (implicit) LCS scores for a vs each substring of b = . . . uuu . . . = u±∞
Let u be of length p
May assume that every character of a occurs in u
Only substrings of b of length at most mp (otherwise LCS score is m)
Alexander Tiskin (Warwick) Semi-local string comparison 65 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 66 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 67 / 132
Periodic string comparisonWraparound seaweed combing
B
A
A
B
C
B
C
A
B A A B C A B C A B A C A
Alexander Tiskin (Warwick) Semi-local string comparison 68 / 132
Periodic string comparisonWraparound seaweed combing
Periodic string-substring LCS: Wraparound seaweed combing
Initialise seaweed braid: crossings in all mismatch cells
Sweep cells row-by-row: each row starts at match cell, wraps at boundary
Match cell: two seaweeds uncrossed; skip
Mismatch cell: two seaweeds cross
if the same seaweeds crossed before (with wrapping), uncross them
otherwise skip, keep seaweeds crossed
Cell update: time O(1)
Overall time O(mn)
String-substring LCS score: count seaweeds with multiplicities
Alexander Tiskin (Warwick) Semi-local string comparison 69 / 132
Periodic string comparisonWraparound seaweed combing
The tandem LCS problem
Give LCS score for a vs b = uk
We have n = kp; may assume k ≤ m
Tandem LCS: running time
O(mkp) naiveO(m(k + p)
)[Landau, Ziv-Ukelson: 2001]
O(mp) [T: 2009]
Direct application of wraparound seaweed combing
Alexander Tiskin (Warwick) Semi-local string comparison 70 / 132
Periodic string comparisonWraparound seaweed combing
The tandem alignment problem
Give the substring closest to a by alignment score among certainsubstrings of b = u±∞:
global: substrings uk of length kp across all k
cyclic: substrings of length kp across all k
local: substrings of any length
Tandem alignment: running time
O(m2p) all naive
O(mp) global [Myers, Miller: 1989]
O(mp log p) cyclic [Benson: 2005]O(mp) cyclic [T: 2009]
O(mp) local [Myers, Miller: 1989]
Alexander Tiskin (Warwick) Semi-local string comparison 71 / 132
Periodic string comparisonWraparound seaweed combing
Cyclic tandem alignment: the algorithm
Periodic seaweed combing for a vs b (with blow-up), time O(mp)
For each k ∈ [1 : m]:
solve tandem LCS (under given alignment score) for a vs uk
obtain scores for a vs p successive substrings of b of length kp by LCSbatch query: time O(1) per substring
Running time O(mp)
Alexander Tiskin (Warwick) Semi-local string comparison 72 / 132
Alexander Tiskin (Warwick) Semi-local string comparison 73 / 132
1 Introduction
2 Matrix distance multiplication
3 Semi-local string comparison
4 The seaweed method
5 Periodic string comparison
6 The transposition networkmethod
7 Sparse string comparison
8 Compressed string comparison
9 Beyond semi-locality
10 Conclusions and future work
Alexander Tiskin (Warwick) Semi-local string comparison 74 / 132
The transposition network methodTransposition networks
Comparison network: a circuit of comparators
A comparator sorts two inputs and outputs them in prescribed order
Comparison networks traditionally used for non-branching merging/sorting
Classical comparison networks
# comparators
merging O(n log n) [Batcher: 1968]
sorting O(n log2 n) [Batcher: 1968]O(n log n) [Ajtai+: 1983]
Comparison networks are visualised by wire diagrams
Transposition network: all comparisons are between adjacent wires
Alexander Tiskin (Warwick) Semi-local string comparison 75 / 132
The transposition network methodTransposition networks
Comparison network: a circuit of comparators
A comparator sorts two inputs and outputs them in prescribed order
Comparison networks traditionally used for non-branching merging/sorting
Classical comparison networks
# comparators
merging O(n log n) [Batcher: 1968]
sorting O(n log2 n) [Batcher: 1968]O(n log n) [Ajtai+: 1983]
Comparison networks are visualised by wire diagrams
Transposition network: all comparisons are between adjacent wires
Alexander Tiskin (Warwick) Semi-local string comparison 75 / 132
The transposition network methodTransposition networks
Seaweed combing as a transposition network
A
C
B
C
A B C A
+7
+1
+5
+5
+3
+7
+1
−5
−1
−3
−3
+3
−5
−1
−7
−7
Character mismatches correspond to comparators
Inputs anti-sorted (sorted in reverse); each value traces a seaweed
Alexander Tiskin (Warwick) Semi-local string comparison 76 / 132
The transposition network methodTransposition networks
Global LCS: transposition network with binary input
A
C
B
C
A B C A
1
1
1
1
1
1
1
0
0
0
0
1
0
0
0
0
Inputs still anti-sorted, but may not be distinct
Comparison between equal values is indeterminate
Alexander Tiskin (Warwick) Semi-local string comparison 77 / 132
The transposition network methodParameterised string comparison
Parameterised string comparison
String comparison sensitive e.g. to
low similarity: small λ = LCS(a, b)
high similarity: small κ = distLCS(a, b) = m + n − 2λ
Can also use weighted alignment score or edit distance
Assume m = n, therefore κ = 2(n − λ)
Alexander Tiskin (Warwick) Semi-local string comparison 78 / 132
The transposition network methodParameterised string comparison
Low-similarity comparison: small λ
sparse set of matches, may need to look at them all
preprocess matches for fast searching, time O(n log σ)
High-similarity comparison: small κ
set of matches may be dense, but only need to look at small subset
no need to preprocess, linear search is OK
Flexible comparison: sensitive to both high and low similarity, e.g. by bothcomparison types running alongside each other
Alexander Tiskin (Warwick) Semi-local string comparison 79 / 132
The transposition network methodParameterised string comparison
Parameterised string comparison: running time
Low-similarity, after preprocessing in O(n log σ)
O(nλ) [Hirschberg: 1977][Apostolico, Guerra: 1985]
[Apostolico+: 1992]
High-similarity, no preprocessing
O(n · κ) [Ukkonen: 1985][Myers: 1986]
Flexible
O(λ · κ · log n) no preproc [Myers: 1986; Wu+: 1990]O(λ · κ) after preproc [Rick: 1995]
Alexander Tiskin (Warwick) Semi-local string comparison 80 / 132
The transposition network methodParameterised string comparison
Parameterised string comparison: the waterfall algorithm
Low-similarity: O(n · λ)0 0 0 0 0 0 0 0
0 0 1 1 0 0 1 0
1
1
1
1
1
1
1
1
0
1
0
1
1
0
1
1
High-similarity: O(n · κ)0 0 0 0 0 0 0 0
1 1 1 0 1 1 0 0
1
1
1
1
1
1
1
1
0
0
0
0
1
1
1
0
Trace 0s through network in contiguous blocks and gaps
Alexander Tiskin (Warwick) Semi-local string comparison 81 / 132
The transposition network methodDynamic string comparison
The dynamic LCS problem
Maintain current LCS score under updates to one or both input strings
Both input strings are streams, updated on-line:
appending characters at left or right
deleting characters at left or right
Assume for simplicity m ≈ n, i.e. m = Θ(n)
Goal: linear time per update
O(n) per update of a (n = |b|)O(m) per update of b (m = |a|)
Alexander Tiskin (Warwick) Semi-local string comparison 82 / 132
The transposition network methodDynamic string comparison
Dynamic LCS in linear time: update models
left right
– app+del standard DP [Wagner, Fischer: 1974]app app a fixed [Landau+: 1998], [Kim, Park: 2004]app app [Ishida+: 2005]app+del app+del [T: NEW]
Main idea:
for append only, maintain seaweed matrix Pa,b
for append+delete, maintain partial seaweed layout by tracing atransposition network
Alexander Tiskin (Warwick) Semi-local string comparison 83 / 132
The transposition network methodBit-parallel string comparison
Bit-parallel string comparison
String comparison using standard instructions on words of size w
Bit-parallel string comparison: running time
O(mn/w) [Allison, Dix: 1986; Myers: 1999; Crochemore+: 2001]
Alexander Tiskin (Warwick) Semi-local string comparison 84 / 132
The transposition network methodBit-parallel string comparison
Bit-parallel string comparison: binary transposition network
In every cell: input bits s, c ; output bits s ′, c ′; match/mismatch flag µ
µ
s
c
s′
c ′
s 0 1 0 1 0 1 0 1c 0 0 1 1 0 0 1 1µ 0 0 0 0 1 1 1 1
s′ 0 1 1 1 0 0 1 1c ′ 0 0 0 1 0 1 0 1
+
∧µ
s
c
s′
c ′
s 0 1 0 1 0 1 0 1c 0 0 1 1 0 0 1 1µ 0 0 0 0 1 1 1 1
s′ 0 1 1 0 0 0 1 1c ′ 0 0 0 1 0 1 0 1
2c + s ← (s + (s ∧ µ) + c) ∨ (s ∧ ¬µ)
S ← (S + (S ∧M)) ∨ (S ∧ ¬M), where S , M are words of bits s, µ
Alexander Tiskin (Warwick) Semi-local string comparison 85 / 132
The transposition network methodBit-parallel string comparison
Bit-parallel string comparison: binary transposition network
In every cell: input bits s, c ; output bits s ′, c ′; match/mismatch flag µ
µ
s
c
s′
c ′
s 0 1 0 1 0 1 0 1c 0 0 1 1 0 0 1 1µ 0 0 0 0 1 1 1 1
s′ 0 1 1 1 0 0 1 1c ′ 0 0 0 1 0 1 0 1
+
∧µ
s
c
s′
c ′
s 0 1 0 1 0 1 0 1c 0 0 1 1 0 0 1 1µ 0 0 0 0 1 1 1 1
s′ 0 1 1 0 0 0 1 1c ′ 0 0 0 1 0 1 0 1
2c + s ← (s + (s ∧ µ) + c) ∨ (s ∧ ¬µ)
S ← (S + (S ∧M)) ∨ (S ∧ ¬M), where S , M are words of bits s, µ
Alexander Tiskin (Warwick) Semi-local string comparison 85 / 132
Alexander Tiskin (Warwick) Semi-local string comparison 86 / 132
1 Introduction
2 Matrix distance multiplication
3 Semi-local string comparison
4 The seaweed method
5 Periodic string comparison
6 The transposition networkmethod
7 Sparse string comparison
8 Compressed string comparison
9 Beyond semi-locality
10 Conclusions and future work
Alexander Tiskin (Warwick) Semi-local string comparison 87 / 132
Sparse string comparisonSemi-local LCS between permutations
The LCS problem on permutation strings
Give LCS score for a vs bIn each of a, b all characters distinct: total m = n matchesEquivalent to
longest increasing subsequence (LIS) in a string
maximum clique in a permutation graph
maximum planar matching in an embedded bipartite graph
LCS on permutation strings: running time
O(n log n) implicit in [Erdos, Szekeres: 1935][Robinson: 1938; Knuth: 1970; Dijkstra: 1980]
O(n log log n) unit-RAM [Chang, Wang: 1992][Bespamyatnikh, Segal: 2000]
Alexander Tiskin (Warwick) Semi-local string comparison 88 / 132
Sparse string comparisonSemi-local LCS between permutations
The semi-local LCS problem on permutation strings
Give semi-local LCS scores of a vs bIn each of a, b all characters distinct: total m = n matchesEquivalent to
longest increasing subsequence (LIS) in every substring of a string
Semi-local LCS on permutation strings: running time
O(n2 log n) naiveO(n2) restricted [Albert+: 2003; Chen+: 2005]O(n1.5 log n) randomised, restricted [Albert+: 2007]O(n1.5) [T: 2006]O(n log2 n) [T: NEW]
Alexander Tiskin (Warwick) Semi-local string comparison 89 / 132
Sparse string comparisonSemi-local LCS between permutations
C
F
A
E
D
H
G
B
D E H C B A F G
Alexander Tiskin (Warwick) Semi-local string comparison 90 / 132
Sparse string comparisonSemi-local LCS between permutations
C
F
A
E
D
H
G
B
D E H C B A F G
Alexander Tiskin (Warwick) Semi-local string comparison 90 / 132
Sparse string comparisonSemi-local LCS between permutations
C
F
A
E
D
H
G
B
D E H C B A F G
Alexander Tiskin (Warwick) Semi-local string comparison 91 / 132
Sparse string comparisonSemi-local LCS between permutations
C
F
A
E
D
H
G
B
D E H C B A F G
Alexander Tiskin (Warwick) Semi-local string comparison 91 / 132
Sparse string comparisonSemi-local LCS between permutations
C
F
A
E
D
H
G
B
D E H C B A F G
Alexander Tiskin (Warwick) Semi-local string comparison 92 / 132
Sparse string comparisonSemi-local LCS between permutations
C
F
A
E
D
H
G
B
D E H C B A F G
Alexander Tiskin (Warwick) Semi-local string comparison 92 / 132
Sparse string comparisonSemi-local LCS between permutations
Semi-local LCS on permutation strings: the algorithm
Divide-and-conquer on the alignment graph
Divide graph (say) horizontally; two subproblems of effective size n/2
Conquer: seaweed matrix �-multiplication, time O(n log n)
Overall time O(n log2 n)
Alexander Tiskin (Warwick) Semi-local string comparison 93 / 132
Sparse string comparisonLongest piecewise monotone subsequences
A k-increasing sequence: a concatenation of k increasing sequences
A k-modal sequence: a concatenation of k alternating increasing anddecreasing sequences
The longest k-increasing (k-modal) subsequence problem
Give the longest k-increasing (k-modal) subsequence of string b
Longest k-increasing (k-modal) subsequence: running time
O(nk log n) k-modal [Demange+: 2007]O(nk log n) via [Hunt, Szymanski: 1977]O(n log2 n) [T: NEW]
Main idea: LCS for idk (respectively, (id id)k/2) vs b
Alexander Tiskin (Warwick) Semi-local string comparison 94 / 132
Sparse string comparisonLongest piecewise monotone subsequences
Longest k-increasing subsequence: algorithm A
Sparse LCS for idk vs b: time O(nk log n)
Longest k-increasing subsequence: algorithm B
Compute seaweed matrix for id vs b: time O(n log2 n)
Extract three-way seaweed submatrix
Compute three-way seaweed submatrix for idk vs b by log k instances ofseaweed matrix �-square/multiply: time log k · O(n log n) = O(n log2 n)
Query the LCS score for idk vs b: time negligible
Overall time O(n log2 n)
Algorithm B faster than Algorithm A for k ≥ log n
Alexander Tiskin (Warwick) Semi-local string comparison 95 / 132
Sparse string comparisonLongest piecewise monotone subsequences
Longest k-increasing subsequence: algorithm A
Sparse LCS for idk vs b: time O(nk log n)
Longest k-increasing subsequence: algorithm B
Compute seaweed matrix for id vs b: time O(n log2 n)
Extract three-way seaweed submatrix
Compute three-way seaweed submatrix for idk vs b by log k instances ofseaweed matrix �-square/multiply: time log k · O(n log n) = O(n log2 n)
Query the LCS score for idk vs b: time negligible
Overall time O(n log2 n)
Algorithm B faster than Algorithm A for k ≥ log n
Alexander Tiskin (Warwick) Semi-local string comparison 95 / 132
Sparse string comparisonMaximum clique in a circle graph
The maximum clique problem in a circle graph
Given a circle with n chords, find the maximum-size subset of pairwiseintersecting chords
S
Standard reduction to an interval model: cut the circle and lay it out onthe line; chords become intervals (here drawn as square diagonals)
Chords intersect iff intervals overlap, i.e. intersect without containment
Alexander Tiskin (Warwick) Semi-local string comparison 96 / 132
Sparse string comparisonMaximum clique in a circle graph
The maximum clique problem in a circle graph
Given a circle with n chords, find the maximum-size subset of pairwiseintersecting chords
S
Standard reduction to an interval model: cut the circle and lay it out onthe line; chords become intervals (here drawn as square diagonals)
Chords intersect iff intervals overlap, i.e. intersect without containment
Alexander Tiskin (Warwick) Semi-local string comparison 96 / 132
Sparse string comparisonMaximum clique in a circle graph
Maximum clique in a circle graph: running time
exp(n) naiveO(n3) [Gavril: 1973]O(n2) [Rotem, Urrutia: 1981; Hsu: 1985]
[Masuda+: 1990; Apostolico+: 1992]O(n1.5) [T: 2006]O(n log2 n) [T: NEW]
Alexander Tiskin (Warwick) Semi-local string comparison 97 / 132
Sparse string comparisonMaximum clique in a circle graph
Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
Sparse string comparisonMaximum clique in a circle graph
Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
Sparse string comparisonMaximum clique in a circle graph
Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
Sparse string comparisonMaximum clique in a circle graph
Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
Sparse string comparisonMaximum clique in a circle graph
Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
Sparse string comparisonMaximum clique in a circle graph
Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
Sparse string comparisonMaximum clique in a circle graph
Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
Sparse string comparisonMaximum clique in a circle graph
Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
Sparse string comparisonMaximum clique in a circle graph
Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
Sparse string comparisonMaximum clique in a circle graph
Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
Sparse string comparisonMaximum clique in a circle graph
Alexander Tiskin (Warwick) Semi-local string comparison 98 / 132
Sparse string comparisonMaximum clique in a circle graph
Maximum clique in a circle graph: the algorithm
Helly property: if any set of intervals intersect pairwise, then they allintersect at a common point
Compute seaweed matrix, build range tree: time O(n log2 n)
Run through all 2n + 1 possible common intersection points
For each point, find a maximum subset of covering overlapping segmentsby a prefix-suffix LCS query: time (2n + 1) · O(log2 n) = O(n log2 n)
Overall time O(n log2 n) + O(n log2 n) = O(n log2 n)
Alexander Tiskin (Warwick) Semi-local string comparison 99 / 132
Sparse string comparisonMaximum clique in a circle graph
Parameterised maximum clique in a circle graph
The maximum clique problem in a circle graph, sensitive e.g. to
the number e of edges
the size l of maximum clique
the thickness d of interval model (the maximum number of intervalscovering a point)
We have l ≤ d ≤ n, e ≤ n2
Parameterised maximum clique in a circle graph: running time
O(n log n + e) [Apostolico+: 1992]
O(n log n + nl log(n/l)) [Apostolico+: 1992]
O(n log n + n log2 d) NEW
Alexander Tiskin (Warwick) Semi-local string comparison 100 / 132
Sparse string comparisonMaximum clique in a circle graph
Parameterised maximum clique in a circle graph: the algorithm
For each diagonal block of size d , compute seaweed matrix, build rangetree: time n/d · O(d log2 d) = O(n log2 d)
Extend each diagonal block to a quadrant: time O(n log2 d)
Run through all 2n + 1 possible common intersection points
For each point, find a maximum subset of covering overlapping segmentsby a prefix-suffix LCS query: time O(n log2 d)
Overall time O(n log2 d)
Alexander Tiskin (Warwick) Semi-local string comparison 101 / 132
Alexander Tiskin (Warwick) Semi-local string comparison 102 / 132
1 Introduction
2 Matrix distance multiplication
3 Semi-local string comparison
4 The seaweed method
5 Periodic string comparison
6 The transposition networkmethod
7 Sparse string comparison
8 Compressed string comparison
9 Beyond semi-locality
10 Conclusions and future work
Alexander Tiskin (Warwick) Semi-local string comparison 103 / 132
Compressed string comparisonGrammar compression
Notation: pattern p of length m; text t of length n
A GC-string (grammar-compressed string) t is a straight-line program(context-free grammar) generating t = tn by n assignments of the form
tk = α, where α is an alphabet character
tk = ti tj , where i , j < k
In general, n = O(2n)
Example: Fibonacci string “ABAABABAABAAB”
t1 = ‘B’ t2 = ‘A’
t3 = t2t1 t4 = t3t2 t5 = t4t3 t6 = t5t4 t7 = t6t5
Alexander Tiskin (Warwick) Semi-local string comparison 104 / 132
Compressed string comparisonGrammar compression
Grammar-compression covers various compression types, e.g. LZ78, LZW(not LZ77 directly)
Simplifying assumption: arithmetic up to n runs in O(1)
This assumption can be removed by careful index remapping
Alexander Tiskin (Warwick) Semi-local string comparison 105 / 132
Compressed string comparisonExtended substring-string LCS on GC-strings
LCS: running time (r = m + n, r = m + n)
p t
plain plain O(mn) [Wagner, Fischer: 1974]O(
mnlog2 m
)[Masek, Paterson: 1980]
[Crochemore+: 2003]
plain GC O(m3n + . . .) gen. CFG [Myers: 1995]O(m1.5n) ext subs-s [T: 2008]O(m logm · n) ext subs-s [T: 2010]
GC GC NP-hard [Lifshits: 2005]O(r1.2r1.4) R weights [Hermelin+: 2009]O(r log r · r) [T: 2010]O(r log(r/r) · r
)[Hermelin+: 2010]
O(r log1/2(r/r) · r
)[Gawrychowski: NEW]
Alexander Tiskin (Warwick) Semi-local string comparison 106 / 132
Compressed string comparisonExtended substring-string LCS on GC-strings
Extended substring-string LCS (plain pattern, GC text): the algorithm
For every k, compute by recursion the appropriate part of seaweed matrixPp,tk , using matrix �-multiplication: time O(m logm · n)
Overall time O(m logm · n)
Alexander Tiskin (Warwick) Semi-local string comparison 107 / 132
Compressed string comparisonSubsequence recognition on GC-strings
The global subsequence recognition problem
Does pattern p appear in text t as a subsequence?
Global subsequence recognition: running time
p t
plain plain O(n) greedy
plain GC O(mn) greedy
GC GC NP-hard [Lifshits: 2005]
Alexander Tiskin (Warwick) Semi-local string comparison 108 / 132
Compressed string comparisonSubsequence recognition on GC-strings
The local subsequence recognition problem
Find all minimally matching substrings of t with respect to p
Substring of t is matching, if p is a subsequence of t
Matching substring of t is minimally matching, if none of its propersubstrings are matching
Alexander Tiskin (Warwick) Semi-local string comparison 109 / 132
Compressed string comparisonSubsequence recognition on GC-strings
Local subsequence recognition: running time ( + output)
p t
plain plain O(mn) [Mannila+: 1995]O(
mnlog m
)[Das+: 1997]
O(cm + n) [Boasson+: 2001]O(m + nσ) [Tronicek: 2001]
plain GC O(m2 logmn) [Cegielski+: 2006]O(m1.5n) [T: 2008]O(m logm · n) [T: NEW]
GC GC NP-hard [Lifshits: 2005]
Alexander Tiskin (Warwick) Semi-local string comparison 110 / 132
Compressed string comparisonSubsequence recognition on GC-strings
ı0+
Oı1+
Oı2+
O
M0+
M1+
M2+
0H
n′
HnH
b〈i : j〉 matching iff box [i : j ] not pierced left-to-right
≷-maximal seaweeds: �-chain(ı0+ , 0+
)�(ı1+ , 1+
)� · · · �
(ıs− , s−
)b〈i : j〉 minimally matching iff (i , j) is in the interleaved �-chain(ı−1+ ,
+1−
)�(ı−2+ ,
+2−
)� · · · �
(ı−(s−1)+ ,
+(s−1)−
)Alexander Tiskin (Warwick) Semi-local string comparison 111 / 132
Compressed string comparisonSubsequence recognition on GC-strings
•
•
•
•
•
×
×
×
×
0+ 1+ 2+n′ n m+n
−m
0
n′
ı0+
ı1+
ı2+
Alexander Tiskin (Warwick) Semi-local string comparison 112 / 132
Compressed string comparisonSubsequence recognition on GC-strings
Local subsequence recognition (plain pattern, GC text): the algorithm
For every k, compute by recursion the appropriate part of seaweed matrixPp,tk , using matrix �-multiplication: time O(m logm · n)
Given an assignment t = t ′t ′′, count by recursion
minimally matching substrings in t ′
minimally matching substrings in t ′′
Then, find �-chain of ≷-maximal seaweeds in time n · O(m) = O(mn)
The interleaved �-chain defines minimally matching substrings in toverlapping both t ′ and t ′′
Overall time O(m logm · n) + O(mn) = O(m logm · n)
Alexander Tiskin (Warwick) Semi-local string comparison 113 / 132
Compressed string comparisonSubsequence recognition on GC-strings
Local subsequence recognition (plain pattern, GC text): the algorithm
For every k, compute by recursion the appropriate part of seaweed matrixPp,tk , using matrix �-multiplication: time O(m logm · n)
Given an assignment t = t ′t ′′, count by recursion
minimally matching substrings in t ′
minimally matching substrings in t ′′
Then, find �-chain of ≷-maximal seaweeds in time n · O(m) = O(mn)
The interleaved �-chain defines minimally matching substrings in toverlapping both t ′ and t ′′
Overall time O(m logm · n) + O(mn) = O(m logm · n)
Alexander Tiskin (Warwick) Semi-local string comparison 113 / 132
Compressed string comparisonSubsequence recognition on GC-strings
Local subsequence recognition (plain pattern, GC text): the algorithm
For every k, compute by recursion the appropriate part of seaweed matrixPp,tk , using matrix �-multiplication: time O(m logm · n)
Given an assignment t = t ′t ′′, count by recursion
minimally matching substrings in t ′
minimally matching substrings in t ′′
Then, find �-chain of ≷-maximal seaweeds in time n · O(m) = O(mn)
The interleaved �-chain defines minimally matching substrings in toverlapping both t ′ and t ′′
Overall time O(m logm · n) + O(mn) = O(m logm · n)
Alexander Tiskin (Warwick) Semi-local string comparison 113 / 132
Compressed string comparisonThreshold approximate matching
The threshold approximate matching problem
Find all matching substrings of t with respect to p, according to athreshold k
Substring of t is matching, if the edit distance for p vs t is at most k
Alexander Tiskin (Warwick) Semi-local string comparison 114 / 132
Compressed string comparisonThreshold approximate matching
Threshold approximate matching: running time ( + output)
p t
plain plain O(mn) [Sellers: 1980]O(mk) [Landau, Vishkin: 1989]
O(m + n + nk4
m ) [Cole, Hariharan: 2002]
plain GC O(mnk2) [Karkkainen+: 2003]O(mnk + n log n) [LV: 1989] via [Bille+: 2010]O(mn + nk4 + n log n) [CH: 2002] via [Bille+: 2010]O(m logm · n) [T: NEW]
GC GC NP-hard [Lifshits: 2005]
(Also many specialised variants for LZ compression)
Alexander Tiskin (Warwick) Semi-local string comparison 115 / 132
Compressed string comparisonThreshold approximate matching
Threshold approx matching (plain pattern, GC text): the algorithm
Algorithm structure similar to local subsequence recognition by matrix�-multiplication and seaweed �-chains
Extra ingredients:
the blow-up technique: reduction of edit distances to LCS scores
implicit matrix searching, replaces �-chain interleaving
Monge row minima: “SMAWK” O(m) [Aggarwal+: 1987]
Implicit unit-Monge row minima O(m log logm) [T: NEW]O(m) [Gawrychowski: NEW]
replaces �-chain interleaving
Overall time O(m logm · n) + O(mn) = O(m logm · n)
Alexander Tiskin (Warwick) Semi-local string comparison 116 / 132
Compressed string comparisonThreshold approximate matching
ı0+
Oı1+
Oı2+
Oı3+
Oı4+
O
M0+
M1+
M2+
M3+
M4+
0H
n′
HnH
Blow up: weighted alignment on strings p, t of size m, n equivalent toLCS on strings p, t of size m = νm, n = νn
Alexander Tiskin (Warwick) Semi-local string comparison 117 / 132
Compressed string comparisonThreshold approximate matching
0+ 1+ 2+ 3+ 4+n′ n m+n
ı0+
ı1+
ı2+
ı3+
ı4+
−m
0
n′
•
•
•
•
•
×
×
×
×
×
×× × × ×× ×
× × ×× ×
× × ×× ×
× × ×× ×
× × ×× ×
× × ×× ×
Alexander Tiskin (Warwick) Semi-local string comparison 118 / 132
1 Introduction
2 Matrix distance multiplication
3 Semi-local string comparison
4 The seaweed method
5 Periodic string comparison
6 The transposition networkmethod
7 Sparse string comparison
8 Compressed string comparison
9 Beyond semi-locality
10 Conclusions and future work
Alexander Tiskin (Warwick) Semi-local string comparison 119 / 132
Beyond semi-localityWindow-local LCS and alignment plots
Given window length w
The window-local LCS problem
Give semi-local LCS score for every w -substring of a vs whole b
Window-local LCS: running time
O(mn3w) naiveO(mnw) multiple seaweedO(mn) [Krusche, T: 2010]
Alexander Tiskin (Warwick) Semi-local string comparison 120 / 132
Beyond semi-localityWindow-local LCS and alignment plots
Given window length w
The window-local LCS problem
Give semi-local LCS score for every w -substring of a vs whole b
Window-local LCS: running time
O(mn3w) naiveO(mnw) multiple seaweedO(mn) [Krusche, T: 2010]
Alexander Tiskin (Warwick) Semi-local string comparison 120 / 132
Beyond semi-localityWindow-local LCS and alignment plots
Quasi-local LCS: the algorithm
Compute seaweed matrices for canonical substrings of a against b
Build seaweed matrices for windows of a against b recursively
Both stages use matrix �-multiplication, overall time O(mn)
Alexander Tiskin (Warwick) Semi-local string comparison 121 / 132
Beyond semi-localityWindow-local LCS and alignment plots
The window-window LCS problem
Give semi-local LCS score for every w -substring of a vs every w -substring b
Provides an alignment plot: loss-free local comparison of genomesequences, important for studying conservation in the genome
Window-window LCS: running time
O(mnw2) naiveO(mnw) multiple seaweedO(mn) [Krusche, T: 2010]
Alexander Tiskin (Warwick) Semi-local string comparison 122 / 132
Beyond semi-localityWindow-local LCS and alignment plots
The window-window LCS problem
Give semi-local LCS score for every w -substring of a vs every w -substring b
Provides an alignment plot: loss-free local comparison of genomesequences, important for studying conservation in the genome
Window-window LCS: running time
O(mnw2) naiveO(mnw) multiple seaweedO(mn) [Krusche, T: 2010]
Alexander Tiskin (Warwick) Semi-local string comparison 122 / 132
Beyond semi-localityWindow-local LCS and alignment plots
An implementation of alignment plots [Krusche: 2010]
Based on the O(mnw) multiple seaweed combing, using some ideas fromthe optimal O(mn) algorithm
Resulting theoretical complexity O(mnw0.5)
Alignment weights (match 1, mismatch 0, gap −0.5): ×4 slowdown
C++, Intel assembly (x86, x86 64, MMX/SSE2 data parallelism)
SMP parallelism (currently two processors)
Alexander Tiskin (Warwick) Semi-local string comparison 123 / 132
Beyond semi-localityWindow-local LCS and alignment plots
Krusche’s implementation used at Warwick Systems Biology Centre tostudy weak conservation in plant genomes (BLAST too inaccurate)
Speedup on one processor:
×10 over heavily optimised, bit-parallel naive algorithm
×7 over biologists’ heuristics
Speedup on two processors: extra ×2, near-perfect parallelism
Alexander Tiskin (Warwick) Semi-local string comparison 124 / 132
Beyond semi-localityWindow-local LCS and alignment plots
Publications:
[Picot+: 2010] Evolutionary Analysis of Regulatory Sequences (EARS) inPlants. Plant Journal.
[Baxter+: accepted] Conserved Noncoding Sequences Highlight SharedComponents of Regulatory Networks in Dicotyledonous Plants. Plant Cell.
Web service: http://wsbc.warwick.ac.uk/ears
Future work: genomic repeats; genome-scale comparison
Alexander Tiskin (Warwick) Semi-local string comparison 125 / 132
Beyond semi-localityQuasi-local LCS and sparse spliced alignment
Given m (possibly overlapping) prescribed substrings in a
The quasi-local LCS problem
Give semi-local LCS score for every prescribed substring of a vs whole b
Quasi-local LCS: running time
O(m2n) multiple seaweedO(mn log2 m) [T: unpublished]
Alexander Tiskin (Warwick) Semi-local string comparison 126 / 132
Beyond semi-localityQuasi-local LCS and sparse spliced alignment
Given m (possibly overlapping) prescribed substrings in a
The quasi-local LCS problem
Give semi-local LCS score for every prescribed substring of a vs whole b
Quasi-local LCS: running time
O(m2n) multiple seaweedO(mn log2 m) [T: unpublished]
Alexander Tiskin (Warwick) Semi-local string comparison 126 / 132
Beyond semi-localityQuasi-local LCS and sparse spliced alignment
Window-local LCS: the algorithm
Compute seaweed matrices for canonical substrings of a against b
Build seaweed matrices for prescribed substrings of a against b recursively
Both stages use matrix �-multiplication, time O(mn log2 m)
Alexander Tiskin (Warwick) Semi-local string comparison 127 / 132
Beyond semi-localityQuasi-local LCS and sparse spliced alignment
The sparse spliced alignment problem
Give the chain of non-overlapping prescribed substrings in a, closest to bby alignment score
Describes gene assembly: unknown gene from candidate exons, given aknown similar gene
Assume m = n; let s = total size of prescribed substrings
Sparse spliced alignment: running time
O(ns) = O(n3) [Gelfand+: 1996]O(n2.5) [Kent+: 2006]O(n2 log2 n) [T: unpublished]O(n2 log n) [Sakai: 2009]
Alexander Tiskin (Warwick) Semi-local string comparison 128 / 132
Beyond semi-localityQuasi-local LCS and sparse spliced alignment
The sparse spliced alignment problem
Give the chain of non-overlapping prescribed substrings in a, closest to bby alignment score
Describes gene assembly: unknown gene from candidate exons, given aknown similar gene
Assume m = n; let s = total size of prescribed substrings
Sparse spliced alignment: running time
O(ns) = O(n3) [Gelfand+: 1996]O(n2.5) [Kent+: 2006]O(n2 log2 n) [T: unpublished]O(n2 log n) [Sakai: 2009]
Alexander Tiskin (Warwick) Semi-local string comparison 128 / 132
1 Introduction
2 Matrix distance multiplication
3 Semi-local string comparison
4 The seaweed method
5 Periodic string comparison
6 The transposition networkmethod
7 Sparse string comparison
8 Compressed string comparison
9 Beyond semi-locality
10 Conclusions and future work
Alexander Tiskin (Warwick) Semi-local string comparison 129 / 132
Conclusions and future work
Implicit unit-Monge matrices:
the seaweed monoid
distance multiplication in time O(n log n)
next: lower bound?
Semi-local LCS problem:
representation by implicit unit-Monge matrices
generalisation to rational alignment scores
next: real alignment scores?
Seaweed combing and micro-block speedup:
a simple algorithm for semi-local LCS
semi-local LCS in time o(mn)
improvements on related problems
Alexander Tiskin (Warwick) Semi-local string comparison 130 / 132
Conclusions and future work
Transposition networks:
simple interpretation of high-similarity and dissimilarity LCS
dynamic LCS
simple interpretation of bit-parallel LCS
Sparse string comparison:
fast LCS on permutation strings
fast max-clique in a circle graph
next: lower bound? or even faster max-clique
Sparse string comparison:
semi-local LCS on permutations in time O(n log2 n)
maximum clique in a circle graph in time O(n log2 n)
improvements on related problems
Alexander Tiskin (Warwick) Semi-local string comparison 131 / 132
Conclusions and future work
Compressed string comparison:
three-way semi-local LCS on GC text against plain pattern
subsequence recognition in GC-strings
next: approximate matching in GC-strings
Beyond semi-locality:
quasi-local string comparison
sparse spliced alignment
next: full locality
Alexander Tiskin (Warwick) Semi-local string comparison 132 / 132