Counting Suffix Arrays and Strings

24
18 May 2006 Klaus-Bernd Schürmann Jens Stoye Technische Fakultät Universität Bielefeld Germany Counting Suffix Arrays and Strings Counting Suffix Arrays and Strings

description

Counting Suffix Arrays and Strings. Text to be indexed:. T. C. T. T. C. T. C. T. T. C. T. C. $. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. Suffix Array Data Structure. Suffix Array – lexicographically sorted list of all suffixes:. 13-$ 12-C$ 10-CTC$ - PowerPoint PPT Presentation

Transcript of Counting Suffix Arrays and Strings

Page 1: Counting Suffix Arrays and Strings

18 May 2006

Klaus-Bernd SchürmannJens Stoye

Technische FakultätUniversität BielefeldGermany

Counting Suffix Arrays and StringsCounting Suffix Arrays and Strings

Page 2: Counting Suffix Arrays and Strings

Dagstuhl, May 2006 - Jens Stoye Slide 2

Suffix Array Data StructureSuffix Array Data Structure

Suffix Array – lexicographically sorted list of all suffixes:

13 - $12 - C$10 - CTC$5 - CTCTTCTC$7 - CTTCTC$2 - CTTCTCTTCTC$11 - TC$9 - TCTC $4 - TCTCTTCTC$6 - TCTTCTC$1 - TCTTCTCTTCTC$8 - TTCTC$3 - TTCTCTTCTC$

Text to be indexed: T C T T1 2 3 4

C T C T5 6 7 8

T9C10

T C11 12

$13

Page 3: Counting Suffix Arrays and Strings

Dagstuhl, May 2006 - Jens Stoye Slide 3

OverviewOverview

1. Classify strings sharing same suffix array

2. Counting strings sharing same suffix array

3. Counting suffix arrays Lower bound suffix array compression

4. Summation identities

Page 4: Counting Suffix Arrays and Strings

Dagstuhl, May 2006 - Jens Stoye Slide 4

1. Classify Strings for Suffix Array1. Classify Strings for Suffix Array

t - string of length n,P - permutation of {1,..., n},R - inverse of P.

Theorem:

P is the suffix array of t if and only if for all i {1,...,n}

a) t[P[i]] t[P[i+1]] andb) t[P[i]] = t[P[i+1]] R[P[i]+1] R[P[i+1]+1]same asb) R[P[i]+1] > R[P[i+1]+1] t[P[i]] < t[P[i+1]]

Page 5: Counting Suffix Arrays and Strings

Dagstuhl, May 2006 - Jens Stoye Slide 5

Text to be indexed: t = A A1 2

B C B3 4 5

12345

i

12534

P[i]A ABCBA BCBBB CBC B

t[P[i]]

a) t[P[i]] t[P[i+1]] andb) R[P[i]+1] > R[P[i+1]+1] t[P[i]] < t[P[i+1]]

1. Classify Strings for Suffix Array1. Classify Strings for Suffix Array

R+-descent

Page 6: Counting Suffix Arrays and Strings

Dagstuhl, May 2006 - Jens Stoye Slide 6

Text to be indexed: t = A A1 2

B C B3 4 5

12345

i

12534

P[i]A ABCBA BCBBB CBC B

t[P[i]]

t2 = A A1 2

C D C3 4 5

t3 = A B1 2

D E D3 4 5

A ACDCA CDCC C DCD C

t2[P[i]]A BDEDB DEDDD EDE D

t3[P[i]]

Equivalences between strings

(order-equivalent) (order-distinct)

1. Classify Strings for Suffix Array1. Classify Strings for Suffix Array

Page 7: Counting Suffix Arrays and Strings

Dagstuhl, May 2006 - Jens Stoye Slide 7

2. Counting Strings for Suffix Array2. Counting Strings for Suffix Array

Text to be indexed: t = A A1 2

B C B3 4 5

12345

i

12534

P[i]

t2 = A A1 2

C D C3 4 5

t3 = A B1 2

D E D3 4 5

AABBC

t[P[i]]AACCD

t2[P[i]]+ 0 =+ 0 =+ 1 =+ 1 =+ 1 =

AABBC

t[P[i]]ABDDE

t3[P[i]]+ 0 =+ 1 =+ 2 =+ 2 =+ 2 =

Non-decreasing sequences

Base string

Page 8: Counting Suffix Arrays and Strings

Dagstuhl, May 2006 - Jens Stoye Slide 8

2. Counting Strings for 2. Counting Strings for SSuffix uffix AArrayrray

Suffix array P of length n with d R+-descents.

Number of strings over alphabet of size a for P= Number of non-decreasing sequences over

a-d elements

1

1

da

dan

Page 9: Counting Suffix Arrays and Strings

Dagstuhl, May 2006 - Jens Stoye Slide 9

2. Counting Strings for 2. Counting Strings for SSuffix uffix AArrayrray

Suffix array P of length n with d R+-descents.

Number of strings composed of exactly k distinct characters for P is

1

1

dk

dn

Page 10: Counting Suffix Arrays and Strings

Dagstuhl, May 2006 - Jens Stoye Slide 10

2. Counting Strings for 2. Counting Strings for SSuffix uffix AArrayrray

Number of strings over alphabet size 20 for suffix arrays of length n with 10 R+-descents:

nStrings composed of up to 20 characters

Strings composed of all 20 characters

5 2,002 0

10 92,378 0

15 1,307,504 0

20 10,015,005 1

25 52,451,256 2,002

30 211,915,132 92,378

35 708,930,508 1,307,504

Page 11: Counting Suffix Arrays and Strings

Dagstuhl, May 2006 - Jens Stoye Slide 11

a

dk dk

dn

1 1

1

2. Counting Strings for 2. Counting Strings for SSuffix uffix AArrayrray

Suffix array P of length n with d R+-descents

Number of order-distinct strings over alphabet of size a is

Number of order-distinct strings where all k distinct characters must appear is

1

1

dk

dn

Page 12: Counting Suffix Arrays and Strings

Dagstuhl, May 2006 - Jens Stoye Slide 12

Definition:

Let P permutation of {1,..., n}.

Position i{1,...,n-1} is a permutation descentif P[i] > P[i+1].

Definition:

The Eulerian number gives the number of

permutations of {1,...,n} with exactly dpermutation descents.

d

n

3. Counting Suffix Arrays3. Counting Suffix Arrays

Page 13: Counting Suffix Arrays and Strings

Dagstuhl, May 2006 - Jens Stoye Slide 13

3. Counting Suffix Arrays3. Counting Suffix Arrays

Well-known fact:

Recursive enumeration of Eulerian numbers

a) ,

b) for n d, and

c)

10

n

0d

n

1

1)(

1)1(

d

ndn

d

nd

d

n

Page 14: Counting Suffix Arrays and Strings

Dagstuhl, May 2006 - Jens Stoye Slide 14

3. Counting Suffix Arrays3. Counting Suffix Arrays

Definition:Let A(n,d) be the number of permutations of

length n with d R+-descents.

Observation:a) A(n,0) = 1b) A(n,d) = 0 for n dc) see next

Page 15: Counting Suffix Arrays and Strings

Dagstuhl, May 2006 - Jens Stoye Slide 15

3. Counting Suffix Arrays3. Counting Suffix Arrays

Text to be indexed: t = A A1 2

B C B3 4 5

12345

i

12534

Pt[i]A ABCBA BCBBB CBC B

t[P[i]]

At = A A1 2

A B C3 4 5

B6

12364

PAt[i]

5

A AABCBA ABCBA BCBBB CB

At[P[i]]

C B

12345

i

6

(d+1) possible positions without additional R+-descent

Page 16: Counting Suffix Arrays and Strings

Dagstuhl, May 2006 - Jens Stoye Slide 16

3. Counting Suffix Arrays3. Counting Suffix Arrays

Text to be indexed: t = A A1 2

B C B3 4 5

12345

i

12534

Pt[i]A ABCBA BCBBB CBC B

t[P[i]]

Bt = B A1 2

A B C3 4 5

B6

23614

PBt[i]

5

A ABCBA BCBBB AABCBB CB

Bt[P[i]]

C B

12345

i

6

(d+1) possible positions without additional R+-descent

Page 17: Counting Suffix Arrays and Strings

Dagstuhl, May 2006 - Jens Stoye Slide 17

3. Counting Suffix Arrays3. Counting Suffix Arrays

Together:a) A(n,0) = 1,b) A(n,d) = 0 for n d, andc) A(n,d) = (d+1) A(n-1,d) + (n-d) A(n-1,d-1)

Theorem:The number A(n,d) of permutations of length n

with d R+-descents is the Eulerian number .dn

Page 18: Counting Suffix Arrays and Strings

Dagstuhl, May 2006 - Jens Stoye Slide 18

3. Counting Suffix Arrays3. Counting Suffix Arrays

The number of distinct suffix arrays of length n for strings over alphabet of size a:

Lower bound for compressibility of suffix arrays in the Kolmogorov sense:

1

0

a

d d

n

1

0

loga

d d

n

Page 19: Counting Suffix Arrays and Strings

Dagstuhl, May 2006 - Jens Stoye Slide 19

3. Counting Suffix Arrays3. Counting Suffix Arrays

Number of distinct suffix arrays of length n for strings over alphabet of size 20:

n String count (20n) Suffix array count

4 160,000 24

6 6.4 107 720

8 2.6 1010 40,320

10 1.0 1013 3.6 106

12 4.1 1015 4.8 108

14 1.6 1018 8.7 1010

16 6.6 1020 2.1 1013

18 2.6 1023 6.4 1015

Page 20: Counting Suffix Arrays and Strings

Dagstuhl, May 2006 - Jens Stoye Slide 20

3. Counting Suffix Arrays3. Counting Suffix Arrays

Number of distinct suffix arrays of length n for strings over alphabet of size 4:

n String count (4n) Suffix array count

4 256 24

6 4,096 662

8 65,536 20,160

10 1,048,576 504,046

12 16,777,216 10,670,040

14 268,435,456 202,964,470

16 4,294,967,296 3,614,083,520

18 68,719,476,736 61,786,015,150

Page 21: Counting Suffix Arrays and Strings

Dagstuhl, May 2006 - Jens Stoye Slide 21

4. Summation Identities4. Summation Identities

Worpitzki‘s identity by summing up the number of strings of length n for each suffix array:

Summation rule for Eulerian numbers to generate the Stirling numbers of second kind:

i

a

d

n

n

ia

i

n

da

dan

d

na

1

0 1

1

i

k

d kn

i

i

n

dk

dn

d

n

k

nk

1

0 1

1!

Page 22: Counting Suffix Arrays and Strings

Dagstuhl, May 2006 - Jens Stoye Slide 22

SummarySummary

Constructive proofs to count strings sharing the same suffix array

Constructive proof to count distinct suffix arrays yielding lower bound for suffix array compression

Constructive proofs for Worpitzki‘s identity and the summation rule of Eulerian numbers to count Stirling numbers of second kind

Page 23: Counting Suffix Arrays and Strings

Dagstuhl, May 2006 - Jens Stoye Slide 23

OutlookOutlook

Efficient enumeration algorithm for suffix arrays

Compressed suffix arrays for fast querying in bioinformatics applications

Average case analysis under non-uniform model

Page 24: Counting Suffix Arrays and Strings

18 May 2006

Klaus-Bernd SchürmannJens Stoye

Technische FakultätUniversität BielefeldGermany

Thank you for your attention!