Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear...

67
Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and P. Sanders(2002) Seminar in advanced topics in data structures Presented by Kfir Amitai

Transcript of Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear...

Page 1: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Suffix Arrays

A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen

and P. Sanders(2002) Seminar in advanced topics in data structures

Presented by Kfir Amitai

Page 2: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Contents

Introduction Searching a suffix array Building in O(n logn) - 1993

Sorting LCP information building Some observations about linear time

Building in O(n) - 2002 Results

Page 3: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Introduction

Until now we observed suffix trees. The main problem with suffix trees is the

coefficient of the linear space complexity. Suffix arrays present a much simpler data

structures Suffix arrays allow us to search all

appearances of a string of size P in a string of size N in O(P+logN) with a kind of binary search.

Page 4: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.
Page 5: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Introduction – What is a suffix array? A suffix array is a

sorted array of the suffix of a string S represented by an array of pointers to the suffixes of S.

S = “nahariya”

S0 nahariyaS1 ahariyaS2 hariyaS3 ariyaS4 riyaS5 iyaS6 yaS7 a

Page 6: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

The sorted suffixes will be represented by an array of integers - POS All suffixesS0 nahariyaS1 ahariyaS2 hariyaS3 ariyaS4 riyaS5 iyaS6 yaS7 a

Sorted suffixesS7 a

S1 ahariya

S3 ariya

S2 hariya

S5 iya

S0 nahariya

S4 riya

S6 ya

POS

7

1

3

2

5

0

4

6

Page 7: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Some definitions and observations Pos[k] = i Si is the kth smallest suffix in the

set {S0, S1, S2…… SN-1} For every string u, and a prefix of size p, we

denote “<P” as lexicographic order on the first p

characters: v <P u v0v1…vP-1 < u0u1…uP-1

Note that for every choice of p<N:

APOS[0] ≤P APOS[1] ≤P APOS [2] ≤P … ≤P APOS [N-1]

|W| = P .Note that W is a substring of A there is an i such that W =P APOS[i]

Page 8: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Contents

Introduction Searching a suffix array Building in O(n logn) - 1993

Sorting LCP information building Some observations about linear time

Building in O(n) - 2002 Results

Page 9: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

The Binary search

We define a search interval:

LW = min {k | W ≤P APOS[k] or k = N}

RW = max {k | W ≥P APOS[k] or k = -1}

W matches aiai+1...ai+P-1 i=POS[k] for some

k [LW, RW]

If LW > RW => W is not a substring of A.

Else: there are (RW-LW+1) matches - APOS[LW],…, APOS[RW]

W>PAPOS[k] W<PAPOS[k]LW RW

POS array

Page 10: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Search example

Pos0 assassin

1 assin 2 in 3 n 4 sassin

5 sin

6 ssassin

7 ssin

W LW RW # s 4 7 4 as 0 1 2

assa 0 0 1 ast 2 1 0

A = assassin

Search 1

Search 3Serach 4

Search 2

P=|W|

LW = min {k | W ≤P APOS[k] or k = N}

RW = max {k | W ≥P APOS[k] or k = -1}

If found (RW-LW+1) matches

Page 11: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Naïve search – O(P logN) We iterate over POS array in an ordinary binary search.

There will be logN iterations of complexity P Initialize:

L=0 R=N-1

Step: Set M=(L+R)/2 Set sets new L,R bounds according to a comparison of W with

APOS[M]. Stop if reached LW = min {k | W ≤P APOS[k] or k = N} and RW =

max {k | W ≥P APOS[k] or k = -1}

L M R

abcde... abcdf... abd...Pos

W=“abcx”

Page 12: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Stop to think…

What can we do better?

Page 13: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Let’s do it better…

What we didn’t use is the fact that we searching suffixes of the same string…

Let’s assume we have information on the lcp’s of pairs of the suffixes.

For each iteration We define: l = lcp(APOS[L], W) r = lcp(W, APOS [R]) Llcp[M] = lcp(APOS [L] , APOS [M]) Rlcp[M] = lcp(APOS [M] , APOS [R])

An important point – we don’t need more than 2*N lcp pairs becuase for each search midpoint M there are well defined L and R!

Page 14: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Search in O(P + logN) using lcp’s Let’s look for W = “nahx”. If l≥r we will compare l and Llcp[M] and if l<r, we will

compare r and Rlcp[M]. I will show the case of l≥r, the other case is

symmetric. Case 1 : l < Llcp[M]

nahde...

nah... nah... nahdf… nazf…

l=3 r=2

L M R

Llcp[M]=4l = lcp(APOS[L], W) r = lcp(W, APOS [R])Llcp[M] = lcp(APOS [L] , APOS [M])

Page 15: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Search in O(P + logN) using lcp’s

Case 1 : l < Llcp[M] (W = “nahx”) We know that W>APOS[L] W>APOS[M] because their lcp is bigger We need to move L to be M l is unchanged (again, their lcp is bigger) We did it with no string comparison, only

integers

nahde...

nah... nah... nahdf… naz…

l=3 r=2

L M R

Llcp[M]=4l = lcp(APOS[L], W) r = lcp(W, APOS [R])Llcp[M] = lcp(APOS [L] , APOS [M])

Page 16: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Case 2 : l > Llcp[M] (W = “nahx”) W and APOS[L] have more in common (bigger lcp) Therefore, because we know that APOS[L] < APOS[M] W < APOS[M] We need to move R to be M Now we assign r Llcp[M]

Again – no string comparison operations

Search in O(P + logN) using lcp’s

nahde...

nai... nak... naxf… naz…

l=3 r=2

L M R

Llcp[M]=2l = lcp(APOS[L], W) r = lcp(W, APOS [R])Llcp[M] = lcp(APOS [L] , APOS [M])

Page 17: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Case 3 : l = Llcp[M] (W = “nahx”) Now we got to the only case we have to compare strings.

We are not sure if we have to go left or right using our lcp information.

What we do know is that the first l characters of W and APOS[M] are similar.

We compare the l+1st character, the l+2nd, and so on, until we find j such that Wl+j ≠l+j APOS[M]

The l+jth character determines if we go left or right. In either way, we know the new value of l/r.

Search in O(P + logN) using lcp’s

nahde...

nai... nak... nahp… naz…

l=3 r=2

L M R

Llcp[M]=3l = lcp(APOS[L], W) r = lcp(W, APOS [R])Llcp[M] = lcp(APOS [L] , APOS [M])

Page 18: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Search in O(P + logN) using lcp’sTime complexity If we analyze the number of single character

comparisons we do in this step, in an amortized manner, we can say that it equals: ( max(l,r) of last step) – ( max(l,r) initially ) + 1. All together – not bigger that P, together with the

steps, we get O(P + logN)

Page 19: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Search in O(P + logN) using lcp’sSpace complexity The implementation uses three N-sized

arrays of integers – POS, Llcp and Rlcp (that we didn’t show how to use in the example). It is used in the cases were r>l in the same way.

Now we move on to see how to prepare those 3 arrays, whilst sorting.

Page 20: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.
Page 21: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Contents

Introduction Searching a suffix array Building in O(n logn) - 1993

Sorting LCP information building Some observations about linear time

Building in O(n) - 2002 Results

Page 22: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Sorting the suffixes

We will see a variation of radix sort. We will sort in O(logN) stages, and call the stages

1,2,4,8,… We name the stage 2i, H-stage. In stage H the suffixes are sorted in buckets called

H Buckets, according to the first H characters. (next stage is 2H)

If Ai, Aj H-bucket, we Sort them by the Next H symbols in the 2H stage.

Page 23: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

The general idea

If Ai, Aj H-bucket, we Sort them by the Next H symbols in the 2H stage, but Their next H symbols = first H symbols of Ai+H and Aj+H which are already sorted in phase H.

abef… abcd…

ab… bb... bb… cd… cd… ef…

H=2Ai Aj Aj+H Ai+H

first bucket fourth bucketthird bucketsecond bucket

Page 24: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

The sorting algorithm

We go over the semi-sorted suffix array: The first stage involves only bucket sort of the first

character. Assume the suffixes are now ordered in ≤H order. For each Ai: Move Ai-H to next available place in its

H-bucket. The suffixes are now sorted according to ≤2H order. Go over the array again, and decide which suffix

opens a new 2H-bucket, use lcp knowledge (will be described later).

In this way, POS will get more and more sorted until every suffix is put in a bucket of it’s own.

Page 25: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

An example of A = “assassin”

A = assassin0 1 2 3 4 5 6 7

assin assassin

in n sin ssin sassin

ssassin

H=1A3

A2

Ai sets Ai-1

Page 26: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

An example

A = assassin0 1 2 3 4 5 6 7

assin assassin

in n sassin

sin ssin ssassin

H=1A0

Ai sets Ai-1 - not possible because i=0

Page 27: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

An example

A = assassin0 1 2 3 4 5 6 7

assin assassin

in n sassin

sin ssin ssassin

H=1A6

A5

Ai sets Ai-1

Page 28: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

An example

A = assassin0 1 2 3 4 5 6 7

assin assassin

in n sassin

sin ssin ssassin

H=1A6 A7

Ai sets Ai-1 – already the first in its bucket

Page 29: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

An example

A = assassin0 1 2 3 4 5 6 7

assin assassin

in n sassin

sin ssin ssassin

H=1A2 A1

Ai sets Ai-1

Page 30: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

An example

A = assassin0 1 2 3 4 5 6 7

assin assassin

in n sassin

sin ssassin

ssin

H=1A5 A4

Ai sets Ai-1

Page 31: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

An example

A = assassin0 1 2 3 4 5 6 7

assin assassin

in n sassin

sin ssassin

ssin

H=1A0 A1

Ai sets Ai-1

Page 32: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

An example

A = assassin0 1 2 3 4 5 6 7

assassin

assin

in n sassin

sin ssassin

ssin

H=1A3 A4

Ai sets Ai-1

Page 33: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

An example

A = assassin0 1 2 3 4 5 6 7

assassin

assin

in n sassin

sin ssassin

ssin

H=1

lcp(sassin,sin)= 1+ lcp(assin,in)= 1+0=1 so “sin” opens a new 2-bucket

Go over array to get new 2-buckets

Page 34: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

An example

A = assassin0 1 2 3 4 5 6 7

assassin

assin

in n sassin

sin ssassin

ssin

H=2A0

Ai sets Ai-2 - not possible because i=0

Page 35: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

An example

A = assassin0 1 2 3 4 5 6 7

assassin

assin

in n sassin

sin ssassin

ssin

H=2A3

Ai sets Ai-2

A1

Page 36: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

An example

A = assassin0 1 2 3 4 5 6 7

assassin

assin

in n sassin

sin ssassin

ssin

H=2A6

Ai sets Ai-2

A4

Page 37: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

An example

A = assassin0 1 2 3 4 5 6 7

assassin

assin

in n sassin

sin ssassin

ssin

H=2A7

Ai sets Ai-2

A5

Page 38: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

An example

A = assassin0 1 2 3 4 5 6 7

assassin

assin

in n sassin

sin ssassin

ssin

H=2A0

Ai sets Ai-2 - but Ai-2 is already the first in its bucket

A2

Page 39: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

An example

A = assassin0 1 2 3 4 5 6 7

assassin

assin

in n sassin

sin ssassin

ssin

H=2A3

Ai sets Ai-2

A5

Page 40: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

An example

A = assassin0 1 2 3 4 5 6 7

assassin

assin

in n sassin

sin ssassin

ssin

H=2Ai sets Ai-2 - not possible because i=0

A1

Page 41: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

An example

A = assassin0 1 2 3 4 5 6 7

assassin

assin

in n sassin

sin ssassin

ssin

H=2A2

Ai sets Ai-2

A4

Page 42: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

An example

A = assassin0 1 2 3 4 5 6 7

assassin

assin

in n sassin

sin ssassin

ssin

H=2

lcp(assassin,assin)= 2+ lcp(sassin,sin)= 2+1=3 so “assin” opens a new 4-bucket.

Lcp(ssassin,ssin)= 2+ lcp(assin,in) = 2+0=2 so “ssin” opens a new 4-bucket.

Go over array to get new 4-buckets

back

Page 43: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

An example

A = assassin0 1 2 3 4 5 6 7

assassin

assin

in n sassin

sin ssassin

ssin

H=4

We are done

back

Page 44: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Complexity analysis

First stage (bucket sort) was O(N). We had log(N) stages, each in O(N):

One traverse for the sorting One traverse to determine new buckets.

Total time complexity is O(N logN) Space complexity is:

We hold 3 integer arrays: POS PRM which is the inverse of POS: PRM[POS[i]] = I Another array to tell us who is the last moved suffix in every bucket

We hold 2 Boolean arrays to tell us where are the beginnings of each bucket of this stage and the last stage

All together – O(N). Still we have to show up we knew the lcp information.

Page 45: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Take a break

Page 46: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Contents

Introduction Searching a suffix array Building in O(n logn) - 1993

Sorting LCP information building Some observations about linear time

Building in O(n) - 2002 Results

Page 47: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

lcp information – general idea We used the lcp information to determine

where to split buckets for next iteration. That’s why we are only interested in two

suffixes Ap , Aq such that they are in the same H-bucket, but will not be in the same 2H-bucket.

We also would like to do it concurrently while constructing the array.

Page 48: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Let’s see what we know of such Ap and Aq: H ≤ lcp(Ap, Aq) < 2H

lcp(Ap, Aq) = H + lcp(Ap+H, Aq+H)

lcp(Ap+H, Aq+H) < H

that is why Ap+H and Aq+H Are in different H-buckets.

What we do is that along the algorithm, we will keep track of the lcp value between neighbors of adjacent buckets.

What about suffixes that are not on adjacent buckets?

lcp information – general idea

Slide 41

Page 49: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

lcp information – general idea Let’s notice something – if APOS[i] < APOS[j] then:

lcp( APOS[i] , APOS[j] ) = {lcp(APOS[k],APOS[k+1])} That means that their lcp is the minimum of all the adjacent

couples between them.

]1,[ jikMin

assassin

assin

in n sassin

sin ssassin

ssin

Aq+hAp+h

1 1 2H=2

Ap Aq

lcp(Ap+H, Aq+H) = min{1,1,2} = 1

Page 50: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

So, let’s conclude: We don’t need to hold the lcp every pair, we can obtain it

by knowing the minimum of all adjacent pairs between it. We will hold an array Hgt[N-1] for that purpose. We will use Interval Trees. Interval trees are balanced trees that can hold this

information for us. Their space complexity is O(N). We will keep in the leaves the lcp of adjacent pairs, and

internal nodes will hold the minimum of their children. We will be able to obtain the information of any couple in

log(N).

lcp information – general idea

Page 51: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

lcp information – data structures During the construction stage, we build an

array Called Hgt[N-1]: Hgt(i)=lcp(APOS[i-1], APOS[i]), initialized so that

Hgt[i]=N+1 for every i.

In stage H=1: Hgt[i]=0 for APOS[i] that are first in their buckets.

When moving from stage H to stage 2H we update every Hgt[i] that APOS[i] is the first in a newly created 2H bucket

Page 52: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

lcp information – Hgt[] example

H=1assin assassin

3 0 6 7 5 4 2 1 in n sin ssin sassin ssassin

0 0 0 9 999

1 1

lcp(ssin,sin)=1+lcp(sin,in)=1+min{lcp(in,n),lcp(sin, n)}=1+0=1

lcp(sin,sassin)=1+lcp(in,assin)=1+min{lcp(assin,assassin),lcp(assassin, in)}=1+0=1

Hgt[] =

assin assassin in n sin ssinsassin ssassin3 0 6 7 2 5 4 1

0 0 0 99

H=2

Hgt[] =

Page 53: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

lcp information – Hgt[] example

lcp(assassin,assin)=2+lcp(sin, sassin)=2+1=3

lcp(ssin, ssassin)=2+lcp(in, assin)=2+0=2

23

0 3 6 7 2 5 1 4 assinassassin in n sin ssinsassin ssassin

H=4

0 0 0 1 1Hgt[] =

Page 54: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

lcp information – the interval tree Hgt[] cells will become the leaves of T.

(2,3) (3,4) (4,5) (5,6) (6,7)(1,2)(0,1)

0

9 0 0 0

0 0

9

0

9 9

9

9

1 1

1

1

3 2

Page 55: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Complexity analysis

This is an amortized analysis Each time a leaf opens a new bucket we

change Hgt[i] for that leaf. That change requires O(logN) changes in the

interval tree. There are O(N) leaves opening a new bucket. Total time complexity:

O(N logN) for all operations altogether.

Page 56: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Contents

Introduction Searching a suffix array Building in O(n logn) - 1993

Sorting LCP information building Some observations about linear time

Building in O(n) - 2002 Results

Page 57: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Equally likely strings

When we sorted POS, we said we are stopping when all suffixes are in there own buckets. Actually, we are doing r stages, when r is not N, but the longest repeated substring.

If we assume all the strings are equally likely, the longest repeated substring is expected to be 2log|Σ|

N. That means we can limit the number of stages to

log| 2log|Σ|N | which is expected to be O(log(logN)). Total complexity of the sort will therefore be

O(Nlog(logN)).

Slide 44

Page 58: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Contents

Introduction Searching a suffix array Building in O(n logn) - 1993

Sorting LCP information building Some observations about linear time

Building in O(n) - 2002 Results

Page 59: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Skew algorithm

Step 1: SA≠ 0 = sort the suffixes starting at position i ≠ 0 mod 3.

(recursively)

Step 2:SA= 0 = sort the suffixes starting at position i = 0 mod 3.

Step 3:SA = merge SA= 0 and SA≠ 0

0 1 2 3 4 5 6 7 8 9 10

S= m i s s i s s i p p i

Page 60: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Skew Algorithmstep 1: sort the 3 first symbols of suffixes starting at position i ≠ 0 mod 3

0 1 2 3 4 5 6 7 8 9 10

s = m i s s i s s i p p i

3 3 2 1 5 5 4

11 12

m i s s i s s i p p i

0 1 2 3 4 5 6 7 8 9 10

Radix Sort

Recursive call

3 2 1 0 6 5 4SA12 =

s12 =

That means sorting “3321”, “321”, “21”, “1”, “554”, “54”, “4”

Page 61: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Skew Algorithmstep 2: sort the suffixes starting at position i = 0 mod 3 This step will be done in linear complexity of the size

of the string we are dealing with (could be called recursively)

The suffixes Si with i mod 3 = 0 are sorted by sorting the pairs (s[i], Si+1 ).

We want to compare (s[j], Sj+1 ) and (s[i], Si+1 ). i mod 3 = j mod 3 =0: The relative order of Sj+1 and Si+1 can be determined from

their position in SA12 in a constant time if we prepare the inverse array to SAinv

12. (same as we did in the O(N logN) algorithm).

Page 62: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

To compare a suffix Sj with j mod 3 = 0 with a suffix Si with i mod 3 ≠ 0, we distinguish two cases: If i mod 3 = 1, we write Si as (s[i],Si+1 ) and Sj as (s[j], Sj+1 ).

Since i + 1 mod 3 = 2 and j + 1 mod 3 = 1

If i mod 3 = 2, we compare the triples (s[i], s[i+1], Si+2 ) and (s[j], s[j+1], Sj+2 ), by their lexicographic names in SAinv

12.

Skew Algorithmstep 3:Merging the results of SA0 and SA12 to get SA

m i s s i s s i p p i0 1 2 3 4 5 6 7 8 9 10 S3 = (‘s’,’issippi’) and S6 = (‘s’,’ippi’)

SAinv12(‘issippi’) > SAinv

12(‘ippi’)

S3 > S6

Page 63: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

0 1 2 3 4 5 6 7 8 9 10

s = m i s s i s s i p p i

11 12

m i s s i s s i p p i

0 1 2 3 4 5 6 7 8 9 10

3 2 1 0 6 5 4SA12 =

10 7 4 1 8 5 2SA12 actually fit

0 9 6 3And we get from step 2

Merging them will give us 10 7 4 1 0 9 8 6 3 5 2

Skew Algorithmstep 3: Merging the results of S0 and S12 to get SA

Page 64: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Time complexity analysis

Step1: O(n) + T(2n/3) Step2: O(n) Step3: O(n) Solving this column:

T(n) = O(n) + T(2n/3) = O(n)

Page 65: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Contents

Introduction Searching a suffix array Building in O(n logn) - 1993

Sorting LCP usage Some observations about linear time

Building in O(n) - 2002 Results

Page 66: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

Some results from the first paperEmpirical results for texts of length 100,000.

Page 67: Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.

THE ENDFIBA Europe League – group B

Team GP W/L F/A Pts

1. Strauss Iscar Nahariya 4 4/0 360/297 8

2. JDA Dijon 4 3/1 347/311 7

3. Besiktas Istanbul 4 3/1 339/320 7

4. Ionikos Amaliada 4 3/1 344/336 7

5. Azovmash Mariupol 4 1/3 329/344 5

6. RBC Verviers-Pepinster 4 1/3 284/307 5

7. BS|Energy Braunschweig 4 1/3 261/317 5

8. Ural Great Perm 4 0/4 269/301 4