Longest Common Substring

33
Computing Longest Common Substrings Using Suffix Arrays Maxim A. Babenko, Tatiana A. Starikovskaya Moscow State University Computer Science in Russia, 2008 Maxim A. Babenko, Tatiana A. Starikovskaya (MSU) Computing LCS Using Suffix Arrays CSR 2008 1 / 22

Transcript of Longest Common Substring

Computing Longest Common SubstringsUsing Suffix Arrays

Maxim A. Babenko, Tatiana A. Starikovskaya

Moscow State University

Computer Science in Russia, 2008

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 1 / 22

Outline

1 Problem Definition

2 Suffix Arrays3 The Algorithm4 Conclusions

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 2 / 22

Outline

1 Problem Definition2 Suffix Arrays

3 The Algorithm4 Conclusions

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 2 / 22

Outline

1 Problem Definition2 Suffix Arrays3 The Algorithm

4 Conclusions

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 2 / 22

Outline

1 Problem Definition2 Suffix Arrays3 The Algorithm4 Conclusions

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 2 / 22

Part 1Problem Definition

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 3 / 22

LCS

Problem (LCS, Longest Common Substring): Given a collectionof N strings A = {α1, . . . , αN} and an integer K (2 ≤ K ≤ N) findthe longest string β that is a substring of at least K strings in A.

Tools: Suffix ArraysTime and Space: Linear and alphabet-independentModel of Computation: RAM

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 4 / 22

LCS

Problem (LCS, Longest Common Substring): Given a collectionof N strings A = {α1, . . . , αN} and an integer K (2 ≤ K ≤ N) findthe longest string β that is a substring of at least K strings in A.Tools: Suffix ArraysTime and Space: Linear and alphabet-independentModel of Computation: RAM

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 4 / 22

Part 2Suffix Arrays

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 5 / 22

Useful Definitions

Definition (Suffix): Let ω = ω1ω2 . . . ωn be an arbitrary string oflength n. For each i (1 ≤ i ≤ n)

ω[i ..] = ωiωi+1 . . . ωn

is a suffix of ω.Definition (Lexicographic order): Suppose we have some orderon letters of the alphabet Σ. This order can be extended in astandard way to strings over Σ: α < β iff either α is proper prefix ofβ or α[1] = β[1], . . . , α[i] = β[i], α[i + 1] < β[i + 1].

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 6 / 22

Suffix Arrays

Definition (Suffix Array): Let ω be an arbitrary string of length n.Consider its non-empty suffixes

ω[1..], ω[2..], . . . , ω[n..].

and order them lexicographically. Let SA(i) denote the startingposition of the suffix appearing on the i-th place (1 ≤ i ≤ n):

ω[SA(1)..] < ω[SA(2)..] < . . . < ω[SA(n)..].

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 7 / 22

An Example of Suffix Array

suffixes SA sorted suffixes1 mississippi 11 i2 ississippi 8 ippi3 ssissippi 5 issippi4 sissippi 2 ississippi5 issippi 1 mississippi6 ssippi 10 pi7 sippi 9 ppi8 ippi 7 sippi9 ppi 4 sissippi

10 pi 6 ssippi11 i 3 ssissippi

Figure: String mississippi, its suffixes, and the corresponding suffix array.

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 8 / 22

Why Suffix Arrays?

A simple data structure containing all the necessaryinformation.

Many nice and simple efficient construction algoritms (e.g.Kärkäinen, Sanders [2003]) with alphabet-independenttime and space complexity.

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 9 / 22

Why Suffix Arrays?

A simple data structure containing all the necessaryinformation.Many nice and simple efficient construction algoritms (e.g.Kärkäinen, Sanders [2003]) with alphabet-independenttime and space complexity.

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 9 / 22

Part 3The Algorithm

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 10 / 22

Our Main Result

TheoremLet the total length of strings α1, . . . , αN be equal to L.Then the answer to the LCS problem can becomputed in O(L) time and in O(L) space.

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 11 / 22

LCS Example

Consider the following example with N = 3, K = 2:

α1 = abbα2 = cbα3 = abc

Clearly, the answer is ab.

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 12 / 22

Observation

The longest common substring for K strings of our set is thelongest common prefix of some suffixes of these strings.

We calculate the longest common prefix of every K suffixes ofdifferent strings and take the longest one; the latter is the answerto the LCS problem.

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 13 / 22

Observation

The longest common substring for K strings of our set is thelongest common prefix of some suffixes of these strings.We calculate the longest common prefix of every K suffixes ofdifferent strings and take the longest one; the latter is the answerto the LCS problem.

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 13 / 22

Preprocessing: Step 1

Combine the strings in A as follows:

α = α1$1α2$2 . . . αN$N .

Here $i are special symbols (sentinels) that are different andlexicographically less than other symbols of the initial alphabet Σ

Example: α = abb$1cb$2abc$3

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 14 / 22

Preprocessing: Step 1

Combine the strings in A as follows:

α = α1$1α2$2 . . . αN$N .

Here $i are special symbols (sentinels) that are different andlexicographically less than other symbols of the initial alphabet Σ

Example: α = abb$1cb$2abc$3

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 14 / 22

Preprocessing: Step 2

Definition (Longest Common Prefixes (LCP) array): The arraycontaining lengths of the longest common prefixes for every pair ofconsecutive suffixes (w.r.t. lexicographical order).LCP array can be easily constructed in linear time and space.

We construct the suffix array and the LCP array for α.

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 15 / 22

Preprocessing: Step 2

Definition (Longest Common Prefixes (LCP) array): The arraycontaining lengths of the longest common prefixes for every pair ofconsecutive suffixes (w.r.t. lexicographical order).LCP array can be easily constructed in linear time and space.We construct the suffix array and the LCP array for α.

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 15 / 22

Step 2. Example of SA and LCP

String: abb$1cb$2abc$3SA: 4 7 11 1 8 3 6 2 9 10 5

LCP: 0 0 0 2 0 1 1 1 0 1

suffixes SA sorted suffixes LCP1 abb$1cb$2abc$3 4 $1cb$2abc$3 02 bb$1cb$2abc$3 7 $2abc$3 03 b$1cb$2abc$3 11 $3 04 $1cb$2abc$3 1 abb$1cb$2abc$3 25 cb$2abc$3 8 abc$3 06 b$2abc$3 3 b$1cb$2abc$3 17 $2abc$3 6 b$2abc$3 18 abc$3 2 bb$1cb$2abc$3 19 bc$3 9 bc$3 0

10 c$3 10 c$3 111 $3 5 cb$2abc$3

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 16 / 22

Further Ideas

The longest prefix of suffixes of K different strings in A is thelongest common prefix of suffixes of K different colors in α.

Consider K suffixes at positions i1, . . . , iK and assume thatSA[i1] < SA[i2] < . . . < SA[iK ]. The length of the longest commonprefix of these K suffixes is equal to the minimum ofLCP[i1], . . . ,LCP[iK − 1].Example:

SA: 4 7 11 1 8 3 6 2 9 10 5

LCP: 0 0 0 2 0 1 1 1 0 1

Suffixes: abb$1cb$2abc$3

abc$3

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 17 / 22

Further Ideas

The longest prefix of suffixes of K different strings in A is thelongest common prefix of suffixes of K different colors in α.Consider K suffixes at positions i1, . . . , iK and assume thatSA[i1] < SA[i2] < . . . < SA[iK ]. The length of the longest commonprefix of these K suffixes is equal to the minimum ofLCP[i1], . . . ,LCP[iK − 1].

Example:SA: 4 7 11 1 8 3 6 2 9 10 5

LCP: 0 0 0 2 0 1 1 1 0 1

Suffixes: abb$1cb$2abc$3

abc$3

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 17 / 22

Further Ideas

The longest prefix of suffixes of K different strings in A is thelongest common prefix of suffixes of K different colors in α.Consider K suffixes at positions i1, . . . , iK and assume thatSA[i1] < SA[i2] < . . . < SA[iK ]. The length of the longest commonprefix of these K suffixes is equal to the minimum ofLCP[i1], . . . ,LCP[iK − 1].Example:

SA: 4 7 11 1 8 3 6 2 9 10 5

LCP: 0 0 0 2 0 1 1 1 0 1

Suffixes: abb$1cb$2abc$3

abc$3

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 17 / 22

Extensions

TheoremProblem: Given a collection of N strings A = {α1, . . . , αN}, foreach K (2 ≤ K ≤ N) find the longest string β that is a substring ofat least K strings in A.

Let the total length of strings α1, . . . , αN be equal to L. Then theanswer to the above problem can be computed in O(L · log∗ L)time and in O(L) space.

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 18 / 22

Extensions

TheoremProblem: Given a collection of N strings A = {α1, . . . , αN}, foreach K (2 ≤ K ≤ N) find the longest string β that is a substring ofat least K strings in A.Let the total length of strings α1, . . . , αN be equal to L. Then theanswer to the above problem can be computed in O(L · log∗ L)time and in O(L) space.

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 18 / 22

Part 4Conclusions

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 19 / 22

Open Problem

How to compute an inexact longest commonsubstring?

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 20 / 22

Acknowledgements

The authors are thankful to the students of Department ofMathematical Logic and Theory of Algorithms and to Maxim Ushakovand Victor Khimenko (Google Moscow) for many helpful discussions.

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 21 / 22

The End :-)

Thank you for your attention.Questions are welcome!

Maxim A. Babenko, Tatiana A. Starikovskaya (MSU)Computing LCS Using Suffix Arrays CSR 2008 22 / 22