Introduction n – length of text, m – length of search pattern string Generally suffix tree...

28

Transcript of Introduction n – length of text, m – length of search pattern string Generally suffix tree...

Introduction

n – length of text, m – length of search pattern string

Generally suffix tree construction takes O(n) time, O(n) space and searching takes O(m) time.

Although space requirement is O(n), the constant is usually big.

Introduction (Cont.)

Motivation is to develop a space efficient data structure with a minimal constant over n.

We present suffix tree that uses n+O(n/lgn) words, or equivalently nlgn+O(n) bits and supports string searching in O(m) time.

Previous Representations

Had either a higher lower order in space and some expectation assumption or required more time for searching

Below are some approaches:Keep alphabet size vector for each node. So constant to space is at least |Σ|. Keep a pair <start, end> for each compressed node (or equivalently a pair <start, length>).Save only the length, called the “skip value “.

Using Skip Values for Search

Skip value – the length of the compressed string at node.

At compressed node skip as many characters as specified by skip value before comparing with the pattern.

Search until the pattern is exhausted or the current character of the pattern has no match at the current node. In first case any leaf of the subtree rooted at the node gives a possible starting point of the pattern in the text.

Start at position given by any of leaves and confirm if the pattern exists in the text.

Suffix Trees (Patricias)

0

217

6 5

2 2

4 32 1

Skip value

a ba

ba

ba

#

#

#

# #

ba#

ba#

Input text: bababa#

Example

0

217

6 5

2 2

4 32 1

Skip value

a ba

ba

ba

#

#

#

# #

ba#

ba#

Input text: bababa#

Pattern: aba

Previous Representations (Cont.)

In compressed suffix tree there are n+1 leaf nodes and at most n internal nodes. Total: 2n+1 nodes.

Storage requirement:For the treeFor skip valuesFor position indices at the leaves

Representation using skip values require: 2n+1+n+n+1, about 4n words. Each word takes lgn bits, so total required space is about 4nlgn+O(n) bits.

Suffix array uses 2n words and has O(m+lgn) search time. More compact representation uses 1.25n words but the search time is given as expected bound.

Binary tree rooted ordered tree

Isomorphism between binary trees and rooted ordered trees.

In the ordered tree there is a root which does not corresponds to any node in the binary tree.

Left child of binary tree node corresponds to the leftmost child of the corresponding node in the ordered tree.

Right child of binary tree node corresponds to the next sibling to the right in the ordered tree.

Binary tree representation using the parenthesis sequence

The given binary tree

on 10 nodes

1

2 6

3

4 5

7

8

109

Binary tree representation using the parenthesis sequence

Equivalent rooted ordered tree

The parenthesis representation 0 1 2 2 3 4 4 3 5 5 1 6 6 7 8 9 9 8 1010 7 0

( ( ( ) ( ( ) )( ) ) ( )( ( ( ) ) ( ) ) )

0

761

52

4

8

9

103

Parentheses tree representation

A general rooted ordered tree on n nodes can be represented by 2n parentheses.

Use 2n+o(n) bit encoding of n node binary tree that supports, in constant time:

1. move to left/right child

2. move to parent

3. get the size of subtree

Succint Suffix Tree Representation

Convert each symbol of the alphabet to binary 0,1 . Our suffix tree becomes binary tree. Support additional operations in constant time:

leafrank(x): return the number of leaves to the left of node x (in the preorder numbering)

leafselect(j): return the jth leaf in the left to right ordering of the leaves.

leafsize(x): return the number of leaves in the subtree rooted at node xleftmost(x): return the leftmost leaf in the subtree

rooted at node xrightmost(x): return the rightmost leaf in the subtree

rooted at node x

Example

1

2 5

3 4 6

Leafrank(1) = 2

1

2 5

3 4 6

Leafselect(3) = 6

1

2 5

3 4 6

Leafsize(1) = 3

Succint Suffix tree Representation (Cont.)

Important navigation operations:rank(i): the number of 1’s up to and including the position i select(i): the position of the ith 1

rankp(i): the number of occurrences of pattern p up to and including the

position i

selectp(i): position of the ith occurrence of p in given binary string

THEOREM 1

Given a binary string of length n and a binary pattern p of length at most єlgn, where є is any constant less than ½, both rankp(i) and selectp(i) can be supported in constant time using o(n) bits, in addition to the space required for the given binary string.

Intuition

Divide the string into blocks of size lg2n and keep the rank info for the first element of every block.

Each block further divide into small blocks. In the smallest blocks keep precomputed

table of answers in o(n) bits.

THEOREM 2

A static binary tree on n nodes can be represented using 2n+o(n) bits such that, given a node x, in addition to finding its parent, left child, right child, and the size of the subtree rooted at node x, we can support leafrank(x), leafselect(j), leafsize(x), leftmost(x), and rightmost(x) operations in constant time.

Proof

Convert binary tree into rooted ordered tree.

Leaves in binary tree correspond to the rightmost leaves in general tree.

Rightmost leaves in general tree correspond to “())” pattern in the string.

Proof (Cont.)

1

2 6

3

4 5

7

8

109

0

761

52

4

8

9

103

0 1 2 2 3 4 4 3 5 5 1 6 6 7 8 9 9 8 1010 7 0 ( ( ( ) ( ( ) ) ( ) ) ( ) ( ( ( ) ) ( ) ) )

Proof (Cont.)

Since rankp(x) searches the pattern from the left of the string, then the number of p occurrences is the number of leaves to the left of node x.

leafrank(x) rankp(x), p=“())”

Proof (Cont.)

leafselect(j) selectp(j)

When p = “())” then operation selectp(j) chooses j’th leaf from the left.

leftmost(x) selectp(rankp(x)+1) rightmost(x)

selectp(rankp(close(parent(x))-1)) leafsize(x) rankp(f(x))- rankp(x)

note that f(x) is the closing parenthesis of parent of node x.

Representing Suffix Tree

Binary encoding of suffix tree will make 2n+1 nodes of binary tree.

Use succint representation of binary tree: 2n+o(n) bits of space.

Our suffix tree now has 4n+o(n) bits. The third component takes nlgn bits. The second component – skip values are not kept. Total space needed: 4n+nlg(n)+o(n) bits

nlgn+O(n) bits n+O(n/lgn) words.

Skip values storage trick

Skip values need not to be stored. They can be found online when needed.

To find the skip value, go to leftmost and rightmost leaves and compare the text until disagreement, suppose k characters are the same and they occupy l bits.

Find how many first bits are the same in those two different characters. Suppose j bits.

Skip value is l+j bits.

Searching

Perform the search as before. If the search stops at a leaf, first find leafrank

of that leaf and then find the suffix index from the array of pointers.

If the end of pattern is encountered in internal node, then any leaf in the subtree represent a possible matching suffix. The leaf can be found by the leftmost(x) or rightmost(x) at constant time.

Searching (Cont.)

Working with |Σ| alphabet, time to find skip values is O(lg|Σ|+skip value).

The sum of skip values is at most m. So total time spent to find skip values is O(mlg|Σ|).

Searching (Cont.)

Once we confirm that the pattern exists (O(m)), the number of pattern occurrences is the leafsize of the node where the search ended.

Theorem 3:

A suffix tree for a text of length n can be represented using nlgn+O(n) bits such that, given a pattern of size m, the number of occurrences of the pattern in the string can be found in O(mlg|Σ|) time. Finding the positions of all the occurrences of the pattern requires O(m+s) time, where s is the number of occurrences of the pattern in the text.