Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

87
M a r t i n F a r a c h - C o l t o n R u t g e r s U n i v e r s i t y U S A

Transcript of Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Page 1: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Suffix Sorting

&

Related

Algoritmics

Martin Farach-ColtonRutgers

UniversityUSA

Page 2: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

What is Suffix Sorting?

✢ Given a string, give the sorted order of its suffixes.

✢ Why would we want that?

✤ More on this later.

Page 3: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

8 2 5 11

1 9 10

6 3 7 4 12

8 5 11

1 9 10

6 3 7 4 12

2

Mississippi$

sissippi$

sippi$

i$pi$ppi$

ssissippi$ississippi$

issippi$ssippi$

ippi$

$

Mississippi$

sissippi$sippi$

i$

pi$ppi$

ssissippi$

ississippi$issippi$

ssippi$

ippi$

$

{

{

Page 4: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

How fast can we sort?

✢ Sorting suffixes of a string can be no faster than sorting the characters of a string.

✢ Can we match the sorting lower bound?

Page 5: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Suffix Sorting✢ We are sorting strings, so Radix Sort

is natural.

✤ This yields an algorithm with time O(n2) to O(n2logn), depending on assumptions on sortability of characters.

✥ O(n2logn) in comparison model.

✥ O(n2) for small integers. In between in general word model.

✢ We can do better by combining Merge Sort with Radix Sort.

Page 6: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Building Blocks

✢ Range Reduction

✢ Radix Step

✢ Chunking

Page 7: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Range Reduction

✢ Observation: If we apply a monotone function to the characters, the sorted order doesn’t change.

Page 8: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$Mississippi$

sissippi$

sippi$

i$pi$ppi$

ssissippi$ississippi$

issippi$ssippi$

ippi$

$

214414413315144144133154414413315

413315441331514413315

13315

414413315

3315315155

214414413315

14414413315

4414413315

413315

4413315

1441331513315

414413315

3315315

15

5

8 2 5 11

1 9 10

6 3 7 4 12

8 5 11

1 9 10

6 3 7 4 12

2

Mississippi$

sissippi$sippi$

i$

pi$ppi$

ssissippi$

ississippi$issippi$

ssippi$

ippi$

$

i 1

M 2

p 3

s 4

$ 5

Page 9: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Range Reduction

✢ Our only range reduction operation will be:

✤ Replace every character by rank in sorted order of characters.

✤ After RR, length n string will be in [n]n

i 1

M 2

p 3

s 4

$ 5

Page 10: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

RR helps running time

✢ Radix Sort on raw input might take O(n2logn) time.

✢ RR takes at most O(nlogn) time.

✢ Radix Sort on small-integer inputs takes O(n2) time.

✢ Total time is O(n2).

Page 11: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Radix Step✢ Recall that Radix Sort proceeds in

steps:

✤ Lexicographically sort the last i characters of each string.

✤ Stably sort by preceding character. Now strings are lexicographically sorted by last i+1 characters.

sorte

d sorte

d

Page 12: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Using Radix Step

✢ If we have recursively sorted some subset of suffixes of a string:

✤ 1 step of radix will sort the preceding suffixes.

✢ Ex: If you have sorted suffixes at odd positions, then you get suffixes at even positions.

Page 13: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: 214414413315

214414413315

4414413315

413315

14413315

3315

15

214414413315

4414413315413315

14413315

3315

15

214414413315144144133154414413315

413315441331514413315

13315

414413315

3315315155

25 1111

11 99 37 73

Odd Suffixes

Page 14: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: 214414413315

214414413315

4414413315413315

14413315

3315

15

Even Suffixes

5144144133154413315

414413315

13315

315

5

14414413315

4413315414413315

13315

315

88 10

610

6 4 12

4 12

52

Page 15: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Radix Step

✢ We normally think of Radix Sort as sorting one set of strings.

✢ But with suffixes (which are interrelated), we can use a Radix Step to get sorted orders of one set from another.

Page 16: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Where are we now?✢ Step 1: Recursively sort odd

suffixes.

✤ How? And how is it recursive? A recursive step must sort every suffix! We’ll get to that.

✢ Step 2: Get even suffixes in linear time.

✤ By Radix Step.

✢ Step 3: Merge!

Page 17: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Merging is tricky...

✢ F ‘97 gave first linear time solution.

✢ This yields an optimal suffix sorting routine.

✢ It’s a fun algorithm, but highly unintuitive.

✤ Or so I’ve been told!

Page 18: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Chunking✢ Let’s solve the recursion problem

first.

✢ Given two integers i and j, let 〈 i,j 〉 be their bit concatenation.

✤ If i,j∈[n], then 〈 i,j 〉∈ [n2].

✢ Given a string S = (s1,s2,...,sn)

✢ Let S’ = ( 〈 s1,s2 〉 , 〈 s3,s4 〉 ,...,

〈 sn-1,sn〉 )

Page 19: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Chunking + Recursion: I

✢ Observation: The order of the odd suffixes of S = (s1,s2,...,sn) is the same as the order of all suffixes of S’ = ( 〈 s1,s2 〉 , 〈 s3,s4 〉 ,..., 〈 sn-

1,sn〉 )

✤ Since bit concatenation preserves lexicographic ordering.

Page 20: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: 214414413315

8 2 5 11

1 93 6 5 4 21

214414413315

4414413315413315

14413315

3315

15〈 21 〉〈 44 〉〈 14 〉〈 41 〉〈 33 〉〈 15 〉

〈 44 〉〈 14 〉〈 41 〉〈 33 〉〈 15 〉〈 14 〉〈 41 〉〈 33 〉〈 15 〉

〈 41 〉〈 33 〉〈 15 〉〈 33 〉〈 15 〉

〈 15 〉

25 1111

11 99 37 73

17 36 12 33 27 13

36 12 33 27 13

12 33 27 13

33 27 1327 13

1317 36 12 33 27 1336 12 33 27 1312 33 27 1333 27 1327 1313

base 8

Page 21: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Chunking + Recursion: II

✢ Chunking+Range Reduction = Recursion

✤ Input is in [n]n.

✤ Chunked Input is in [n2]n/2.

✤ Range Reduced Chunking is in [n/2]n/2.

✤ So now problem instance is half the size and we can recurse.

Page 22: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: 214414413315

8 2 5 11

1 93 6 5 4 21

〈 21 〉〈 44 〉〈 14 〉〈 41 〉〈 33 〉〈 15 〉

36154261542

542

242

1542 361542

61542542

2

42

1542

Page 23: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Recall Basic Operations

✢ Range Reduction

✢ Radix Step

✢ Chunking

✢ How we are ready for the whole algorithm.

Page 24: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Suffix Sorting

✢ Step 1: Chunk + Range Reduction.

✤ Recurse on new string.

✤ Get sorted order of odd suffixes.

✢ Step 2: Radix Step.

✤ Get sorted order of even suffixes.

✢ Step 3: Merge!

✤ We still don’t know how to do this.

Page 25: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

The Trouble with Merging

✢ Let’s start merging. We need to see which is smaller: the smallest odd suffix s2i-1 or the smallest even suffix s2j.

✤ We can compare the first character.

✤ We can then compare s2i with s2j

✥ But this is another odd/even comparison.

Page 26: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

The difference between 3 and 2

✢ It’s possible to merge the lists.

✢ But Kärkkäinen & Sanders showed the cute way to merge.

✢ The modified the recursion to make the merge easy.

Page 27: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Mod 3 Recursion✢ Given a string S = (s1,s2,...,sn)

✢ Let S1 = ( 〈 s1,s2,s3 〉 , 〈 s4,s5,s6 〉 ,...,

〈 sn-2,sn-1,sn〉 )

✢ Let S2 = ( 〈 s2,s3,s4 〉 , 〈 s5,s6,s7 〉 ,...,

〈 sn-4,sn-3,sn-2 〉 )

✢ Let O12 be order of suffix congruent to 1 and 2 mod 3.

✤ You get this recursively from sorting the suffixes of S1S2

Page 28: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Radix Step x 2

✢ We have O12 from the recursion.

✢ One Radix Step gives us O01

✢ Another Radix Step gives us O02

✢ Each pair of suffix is now compared in one list.

✢ Each suffix appears in two lists.

Page 29: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Merging... at last!

✢ To merge O12, O01, and O12 note that:

✤ The smallest suffix is the first on two lists.

✤ Pop the smallest.

✤ Now the next smallest is the first on two lists.

✤ Etc.

Page 30: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Total time✢ T(n) to sort suffix of strings in [n]n

✢ T(n) = recursion + 2*radix + merging

✢ T(n) = T(2n/3) + O(n) + O(n)

✢ T(n) = O(n)

✢ So the initial Range Reduction step into the integer alphabet is the bottleneck.

✤ So this algorithm is optimal for any alphabet.

Page 31: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Why did we want to sort suffixes anyway?

✢ It’s easy to go from Suffix Sorting to Suffix Arrays...

Page 32: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Suffix Arrays

✢ A suffix array is:

✤ The sorted order of suffixes

✤ Their pairwise adjacent lcp’s.

✢ They are handy as space efficient indexes.

✢ How do we compute the lcp’s from the suffix order?

Page 33: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

? ?

?

? ? ? ?? ? ??

Mississippi$☟

8 2 5 11

1 9 10

6 3 7 4 12

8 5 11

1 9 10

6 3 7 4 12

2

Page 34: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

? ?

0

? ? ? ?? ? ??

Mississippi$

8 2 5 11

1 9 10

6 3 7 4 12

8 5 11

1 9 10

6 3 7 4 12

2

Page 35: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

? ?

0

? ? ? ?? ? ?

?

Mississippi$

8 2 5 11

1 9 10

6 3 7 4 12

8 5 11

1 9 10

6 3 7 4 12

2

Page 36: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

? ?

0

? ? ? ?? ? ?

4

Mississippi$

8 2 5 11

1 9 10

6 3 7 4 12

8 5 11

1 9 10

6 3 7 4 12

2

☺☺☺☺

☺☺☺☺☹

Page 37: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

? ?

0

? ? ? ?

?

? ?

4

Mississippi$

8 2 5 11

1 9 10

6 3 7 4 12

8 5 11

1 9 10

6 3 7 4 12

2

☺☺

☺☺☹

Page 38: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

? ?

0

? ? ? ?

3

? ?

4

Mississippi$

8 2 5 11

1 9 10

6 3 7 4 12

8 5 11

1 9 10

6 3 7 4 12

2

☺☺

☺☺☹

Page 39: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
Page 40: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

What’s a suffix tree?

✢ Compacted trie of all suffixes of a string.

✢ What’s it good for?

✤ Too many things to enumerate...

✤ See the stringology literature of the last 30 years!

Page 41: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

Page 42: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

Page 43: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

Page 44: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

Page 45: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

Page 46: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

Page 47: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

Page 48: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

Page 49: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

Page 50: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

Page 51: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

Page 52: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

Page 53: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Brute-force Algorithm

✢ We insert one suffix at a time.

✢ Each suffix insertion takes O(n) time.

✢ Total time is O(n2)

Page 54: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

How low can we go?✢ There is a leaf for each suffix

✤ That’s why we put a $ at end.

✢ Each internal node is branching

✤ Because we have a compacted trie.

✢ Each edge has a constant-size label.

✤ Just a pointer to string + length.

✢ So suffix tree has size O(n).

Page 55: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Weiner’s Algorithm

✢ Weiner showed a suffix-tree construction in O(n) time for binary alphabet.

✢ Still adds suffixes one at a time.

✤ But gets a speed-up on insertions by exploiting Suffix Links.

Page 56: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Defs: LCP

✢ Let li be the leaf for suffix S[1,i].

✢ Let LCP(i,j) be the length of the longest common prefix of S[1,i] and S[1,j].

✤ Ex: LCP(2,5) = 4 for Mississippi.

Page 57: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Suffix Links

✢ Claim: If some node in a suffix tree has string aα, for a a character and α a string, then some node in the suffix tree has string α.

Page 58: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$issi

issi

Page 59: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

issi

issi

Page 60: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

Page 61: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

Page 62: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Suffix Links: Proof

✢ How do we know a suffix link always exists?

✤ Maybe the link points into the middle of a edge...

Page 63: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Suffix Links: Proof Cartoon

a αi i+1

j j+1

Page 64: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Adding Suffix Links

✢ Adding each suffix link is just a least common ancestors computation

✤ This takes O(n) preprocessing + O(1) per lca.

✤ Adding all suffix links is O(n) time.

Page 65: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Suffix Links Uses

✢ The first use of suffix links was to speed up suffix tree construction.

✤ If you keep suffix links on the partially constructed tree, you don’t have to start every insertion from the root.

✤ Yields a linear time construction!

Page 66: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

So we’re done!

✢ The data structure has linear size.

✤ So linear lower bound.

✢ The algorithm runs in linear time.

✤ So linear upper bound.

✢ Declare victory and go home!

Page 67: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Alphabets everywhere.

✢ This algorithm runs in linear time for binary alphabet.

✢ For an alphabet of size s, it runs in O(n log s).

Page 68: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Lower Bounds

✢ Element uniqueness reduces to suffix tree construction, so in the algebraic decision tree model, the lower bound is Omega(n log n).

✢ If we require edges to be sorted, then sorting is lower bound.

Page 69: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Open Problem

✢ Is there an interesting lower bound for suffix trees where each node may have an arbitrary order of children?

✢ Such a tree is no good as an index, but just fine for LCA applications of suffix trees.

Page 70: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Back to Suffix Links

✢ Fun fact: The suffix links form a sl-tree. The length of the string at a node is the depth in the sl-tree.

Page 71: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

2

111

3

4

Page 72: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Stripping a Suffix Tree: I

✢ If we are given a suffix tree with no edge labels, we can reconstruct the labels in linear time:

✤ Compute Suffix Links (by LCAs).

✤ Compute Depth in SL-tree.

✤ Each node selects a leaf and computes offset within its descendant suffix.

Page 73: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

Page 74: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

2

111

3

4

Page 75: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

2

111

3

4

Page 76: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Suffix Arrays

✢ A suffix array is:

✤ The sorted order of suffixes

✤ Their pairwise adjacent lcp’s.

Page 77: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

8 2 11

1 9 10

6 3 7 4 12

51 1 0 0 1 0 13 2 04

Page 78: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Suffix Arrays

✢ They are much more space efficient than suffix trees.

✢ You can still use them as indexes.

✢ You can build them as fast as suffix trees...

✤ By dfs of suffix tree. How about w/o suffix trees?

Page 79: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Building suffix trees from suffix arrays

✢ You can build the suffix tree left to right by keeping a stack of the right-most path.

✢ This gives you the shape of the tree.

✢ Then add link labels.

✢ Total construction is linear for any alphabet.

Page 80: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

8 2 5 11

1 9 10

6 3 7 4 12

1 1 0 0 1 0 13 2 048 2 1

11 9 1

06 3 7 4 1

25

1 1 0 0 1 0 13 2 04

Page 81: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

8 2 5 11

1 9 10

6 3 7 4 12

1 1 0 0 1 0 13 2 04

1

☟8 2 1

11 9 1

06 3 7 4 1

25

1 1 0 0 1 0 13 2 04

Page 82: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

8 2 5 11

1 9 10

6 3 7 4 12

1 1 0 0 1 0 13 2 04

1

4

☟8 2 1

11 9 1

06 3 7 4 1

25

1 1 0 0 1 0 13 2 04

Page 83: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

8 2 5 11

1 9 10

6 3 7 4 12

1 1 0 0 1 0 13 2 04

1

4

☟8 2 1

11 9 1

06 3 7 4 1

25

1 1 0 0 1 0 13 2 04

Page 84: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Example: Mississippi$

8 2 5 11

1 9 10

6 3 7 4 12

1 1 0 0 1 0 13 2 04

1

4etc.

☟8 2 1

11 9 1

06 3 7 4 1

25

1 1 0 0 1 0 13 2 04

Page 85: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Even Less Information!

✢ You don’t even need LCP information to build the suffix tree.

✢ You can compute LCP array from suffix order array in linear time.

✢ Big Conclusion: Given sorted order of suffixes, you can build the suffix array in linear time for any alphabet.

Page 86: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

How do we sort suffixes?

✢ Mergesort!

✤ Careful choice of recursion.

✤ Careful merging.

Page 87: Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

How do we sort suffixes?

✢ Key Operations:

✤ Range Reduction.

✤ Radix Sorting.

✤ Clumping.