Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

41
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein

Transcript of Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Page 1: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Suffix Trees, Suffix Arrays

and Suffix TraysRichard Cole

Tsvi Kopelowitz

Moshe Lewenstein

Page 2: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Indexing problem

Input: Text T=t1,…,tn (preprocess to DS)

Queries: Pattern P=p1,…,pm (use DS)

T=

5 14 30

Page 3: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Suffix Property

P appears at location i of T iff

P is a prefix of the suffix Ti

T=

T14 =

5 14 30

Page 4: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Suffix Tree

A suffix tree for string S is a compressed trie of all suffixes of S.

{ $ b$ ab$ bab$ abab$ }

ab

ab

$

ab

$

b

$

$

$

Example: s=abab$

Page 5: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Suffix Tree

The size of the suffix tree of S is O(|S|).

{ $ b$ ab$ bab$ abab$ }

Example: s=abab$

01

ab

ab

$

ab

$

b

2

$ 3

$

4

$

Page 6: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Suffix Tree

The size of the suffix tree of S is O(|S|).

{ $ b$ ab$ bab$ abab$ } 0

1

[2,3]

2

3

4

Example: s=abab$

[2,4] [4,4]

[4,4]

[4,4]

[1,1]

[2,4]

Page 7: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Indexing and Suffix Trees

Navigate from root. (Use suffix property).

P = ssi

Time: O(|P| + occ)

Page 8: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Indexing and Suffix Trees

Navigate from root. (Use suffix property).

P = ssi

Time: O(|P| log|Σ| + occ)

Page 9: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Suffix Trees

Weiner 1973 (linear time construction!)

McCreight 1975 (space efficient)

Ukkonen 1995 (online)

Farach 1997 (poly range alphabets)

Page 10: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Suffix Array POS

11

8

5

2

1

10

9

7

4

6

3

All suffixesS1 mississippi

S2 ississippi

S3 ssissippi

S4 sissippi

S5 issippi

S6 ssippi

S7 sippi

S8 ippi

S9 ppi

S10 pi

S11 i

Sorted suffixesS11 i

S8 ippi

S5 issippi

S2 ississippi

S1 mississippi

S10 pi

S9 ppi

S7 sippi

S4 sissippi

S6 ssippi

S3 ssissippi

Page 11: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Suffix Array

11 8 5 2 1 10 9 7 4 6 3

m i s s i s s i p p i S =

SA(S) =

P = pi

Page 12: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Suffix Array

11 8 5 2 1 10 9 7 4 6 3

m i s s i s s i p p i S =

SA(S) =

P = pi

Page 13: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Suffix Array

11 8 5 2 1 10 9 7 4 6 3

m i s s i s s i p p i S =

SA(S) =

P = pi

Page 14: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Suffix Array

11 8 5 2 1 10 9 7 4 6 3

m i s s i s s i p p i S =

SA(S) =

P = pi

Page 15: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Suffix Array

11 8 5 2 1 10 9 7 4 6 3

m i s s i s s i p p i S =

SA(S) =

P = pi

Time: O(|P|*log |S|)

Page 16: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Suffix Array

Introduced:

Manber and Myers (1993).

Gonnet, Baeza-Yates, Snider (1992) (PAT arrays).

Manber and Myers (1993):

Time - O(|P| + log |S|)

Page 17: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Suffix Array Construction

Manber and Myers (1993) - O(n log n).

Karkkainen-Sanders (2003) - O(n) (poly range)

2 Other papers as well.

Page 18: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

End of Story?

No. Lots of questions.

1.Construction Time of Suffix Trees.

2.Query Time.

3.Compressed Indexing Structures.

4. Indexing with Errors.

5.Real-Time S.T. construction.

Page 19: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Query Time for Large Alphabets

Suffix Trees: O(|P|*log|Σ|) (deterministic)

Suffix Arrays: O(|P| + log |T|)

Suffix Trays: O(|P|+log|Σ|) for alphabets {1,…,|Σ|}

Page 20: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Query Time for Large Alphabets

Actually it is easy to answer queries in O(|P|) time.

Create at every node of suffix tree - |∑| length array.

Then navigation at every node is O(1).

However, time and space of suffix tree construction = O(n|∑| )

Page 21: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Query Time for Large Alphabets

Suffix Trees: O(|P|*log|Σ|) (deterministic)

Suffix Arrays: O(|P| + log |S|)

Suffix Trays: O(|P|+log|Σ|) for alphabets {1,…,|Σ|}

Page 22: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Suffix Tree – Suffix Array connection

The ordering of the suffixes (leaves) in suffix tree is exactly the suffix array

Page 23: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Suffix Array POS

8

5

2

11

1

9

10

6

3

7

4

12

All suffixesS1 mississippi$

S2 ississippi$

S3 ssissippi$

S4 sissippi$

S5 issippi$

S6 ssippi$

S7 sippi$

S8 ippi$

S9 ppi$

S10 pi$

S11 i$

S12 $

sorted suffixesS8 ippi$

S5 issippi$

S2 ississippi$

S11 i$

S1 mississippi$

S9 ppi$

S10 pi$

S6 ssippi$

S3 ssissippi$

S7 sippi$

S4 sissippi$

S12 $

Page 24: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Example: Mississippi$

8 5 2 11 1 9 10 6 3 7 4 12 SA(mississippi) =

Page 25: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Suffix Tree – Suffix Array connectionWe utilize this connection as follows:

Every node in the suffix tree corresponds to an interval in suffix array.

Page 26: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Example: Mississippi$

8 5 2 11 1 9 10 6 3 7 4 12 SA(mississippi) =

Page 27: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Suffix Tree – Suffix Array connectionMoreover,

Time to search in suffix array on interval I is:

O(|P| + log |I|).

Page 28: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Suffix Tree – Suffix Array connectionDFN: a |Σ|-leaf is a node that

(1) has at least |Σ| leaves in its subtree

(2) all its children do not.

Number of leaves in subtree of |Σ|-leaf is O(|Σ|2).

Why?

At most |Σ| children – each with less than |Σ| leaves in subtree.

Page 29: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Suffix Tree – Suffix Array connection

Number of leaves in subtree of |Σ|-leaf is O(|Σ|2).

Time to search in suffix array for |Σ|-leaf is:

O(|P| + log |Σ|).

Page 30: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Example: Mississippi$

8 5 2 11 1 9 10 6 3 7 4 12 SA(mississippi) =

Page 31: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Suffix Tray

Idea Outline:

Navigate in suffix tree till a |Σ|-leaf is hit and then move to suffix array (time in SA - O(|P| + log |Σ|))

Problem:

Navigation in suffix tree O(|P| log |Σ|) time.

We promised O(|P| + log |Σ|) .

Page 32: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Suffix Tray

Recall idea:

Create at every node of suffix tree - |∑| length array.

Then navigation at every node is O(1).

Too expensive overall: O(n|∑| )

But OK for O(n/|Σ|) nodes.

Page 33: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Suffix TrayIdea: Truncate suffix trees at |Σ|-leaves for Σ-tree

Would be nice: size of Σ-tree = O(n/|Σ|)

However, this is not the case.a

$

$

$

$$a

a

aa

$

< |Σ| leaves

|Σ|-leaf

- the rest

Page 34: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

< |Σ| leaves

|Σ|-leaf

- the rest

$a

$

$

$

$

$ab

ab

ab

ab

$

ab

ab

$ab

$

$

ba

S=ababababa$

Page 35: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Suffix Tray

Alternative Idea: Extend def. of Σ-tree by removing all nodes with fewer than |Σ| leaves in its subtree.

Nodes in Σ-tree:

1.Σ-leaf

2.Branching-Σ-node: node with at least 2 children

3.Others – nodes with only one child.

Page 36: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Suffix Tray - Example

$a

$

$

$

$$ab

ab

abab

$

ab

ab

$ab

$

$

ba

< |Σ| leaves

|Σ|-leaf

- others

- branching |Σ|- node

Page 37: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Suffix TrayObservation:

# of Σ-leafs = O(n/|Σ|)

Hence, # of branching-Σ-nodes = O(n/|Σ|)

So, we can save Σ-tables for navigation at each.

Page 38: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Suffix Tray – What is Left?

$a

$

$

$

$$ab

ab

abab

$

ab

ab

$ab

$

$

ba

< |Σ| leaves

|Σ|-leaf

- others

- branching |Σ|- node

Page 39: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Suffix Tray

Nodes in Σ-tree with only one child.

ab b c d

e

8 5 2 11 1 9 10 6 3 7 4 12

Interval less than |Σ|2

Page 40: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Suffix Tray

Size of suffix Tray: O(n)

Navigation: 1.Σ-leaf - jump to suffix array2.Branching-Σ-node: look at Σ-array3.Others – look at one character to Σ-tree child.

Time: O(|P| + log|Σ|)

Page 41: Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

End of Story?

No. Lots of questions.

1.Construction Time of Suffix Trees.

2.Query Time.

3.Compressed Indexing Structures.

4. Indexing with Errors.

5.Real-Time S.T. construction.