Suffix Trees
description
Transcript of Suffix Trees
Suffix Trees
Suffix trees
• Linearized suffix trees
• Virtual suffix treesSuffix arrays
• Enhanced suffix arrays
• Suffix cactus, suffix vectors, …
Suffix Trees
• String … any sequence of characters.
• Substring of string S … string composed of characters i through j, i <= j of S. S = cater => ate is a substring. car is not a substring. Empty string is a substring of S.
Subsequence
• Subsequence of string S … string composed of characters i1 < i2 < … < ik of S. S = cater => ate is a subsequence. car is a subsequence. The empty string is a subsequence.
String/Pattern Matching
• You are given a source string S.
• Answer queries of the form: is the string pi a substring of S?
• Knuth-Morris-Pratt (KMP) string matching. O(|S| + | pi |) time per query.
O(n|S| + i | pi |) time for n queries.
• Suffix tree solution. O(|S| + i | pi |) time for n queries.
String/Pattern Matching
• KMP preprocesses the query string pi, whereas the suffix tree method preprocesses the source string S.
• An application of string matching. Genome project. Databank of strings (gene sequences). Character set is ATGC. Determine if a “new” sequence is a substring of
a databank sequence.
Definition Of Suffix Tree
• Compressed trie with edge information.
• Keys are the nonempty suffixes of a given string S.
• Nonempty suffixes of S = sleeper are: sleeper leeper eeper eper per, er, and r.
String Matching & Suffixes• pi is a substring of S iff pi is a prefix of some
suffix of S.• Nonempty suffixes of S = sleeper are:
sleeper leeper eeper eper per, er, and r.
• Which of these are substrings of S? leep, eepe, pe, leap, peel
Last Character Of S Repeats• When the last character of S appears more
than once in S, S has at least one suffix that is a proper prefix of another suffix.
• S = creeper creeper, reeper, eeper, eper, per, er, r
• When the last character of S appears more than once in S, use an end of string character # to overcome this problem.
• S = creeper# creeper#, reeper#, eeper#, eper#, per#, er#, r#, #
Suffix Tree For S = abbbabbbb#
abbb b #
abbbb# b##abbbb#
b
#abbbb#
#abbbb#
b
b#
1
2
3
4
5
Suffix Tree For S = abbbabbbb#
abbb b #
abbbb# b##abbbb#
b
#abbbb#
#abbbb#
b
b#
abbbabbbb#12345678910
1 5 4
3
2 6 7
8
9
10
1
2
3
4
5
Suffix Tree For S = abbbabbbb#
abbb b #
abbbb# b##abbbb#
b
#abbbb#
#abbbb#
b
b#
abbbabbbb#12345678910
1 5 4
3
2 6 7
8
9
10
1
14
8
2
1
5 2
3
4
Suffix Tree Construction
• See Web write up for algorithm.• Time complexity
|S| = n, alphabet size = r. O(nr) using array nodes. This is O(n) for r a constant (or r <= c). O(n) expected time using a hash table. O(n) time algorithm for large r in reference
cited in Web write up.
Suffix Array
• Array that contains the start position of suffixes in lexicographic order.
• abbbabbbb# Assume # < a < b # < abbbabbbb# < abbbb# < b# < babbbb# <
bb# < bbabbbb# < bbb# < bbbabbbb# < bbbb# SA = [10, 1, 5, 9, 4, 8, 3, 7, 2, 6] LCP = length of longest common prefix
between adjacent entries of SA. LCP = [0, 4, 0, 1, 1, 2, 2, 3, 3, -]
Suffix Array
• Less space than suffix tree• Linear time construction• Can be used to solve several of the problems
solved by a suffix tree with same asymptotic complexity. Substring matching binary search for p using SA. O(|p| log |S|).
O(|pi|) Time Substring Matching
babb abbba baba
abbb b #
abbbb# b##abbbb#
b
#abbbb#
#abbbb#
b
b#
abbbabbbb#12345678910
1 5 4
3
2 6 7
8
9
10
Find All Occurrences Of pi
• Search suffix tree for pi.
• Suppose the search for pi is successful.
• When search terminates at an element node, pi appears exactly once in the source string S.
Search Terminates At Element Node
abbb b #
abbbb# b##abbbb#
b
#abbbb#
#abbbb#
b
b#
abbbabbbb#12345678910
1 5 4
3
2 6 7
8
9
10
abbbb#
Search Terminates At Branch Node
• When the search for pi terminates at a branch node, each element node in the subtree rooted at this branch node gives a different occurrence of pi.
Search Terminates At Branch Node
abbb b #
abbbb# b##abbbb#
b
#abbbb#
#abbbb#
b
b#
abbbabbbb#12345678910
1 5 4
3
2 6 7
8
9
10
ab
Find All Occurrences Of pi
• To find all occurrences of pi in time linear in the length of pi and linear in the number of occurrences of pi, augment suffix tree: Link all element nodes into a chain in inorder. Each branch node keeps a pointer to the left most
and right most element node in its subtree.
Augmented Suffix Tree
abbb b #
abbbb# b##abbbb#
b
#abbbb#
#abbbb#
b
b#
abbbabbbb#12345678910
1 5 4
3
2 6 7
8
9
10
b
Longest Repeating Substring
• Find longest substring of S that occurs more than m > 1 times in S.
• Label branch nodes with number of element nodes in subtree.
• Find branch node with label >= m and max char# field.
Longest Repeating Substring
abbb b #
abbbb# b##abbbb#
b
#abbbb#
#abbbb#
b
b#
abbbabbbb#12345678910
1 5 4
3
2 6 7
8
9
10
m = 2
2
3
5
7
m = 5
10
Longest Common Substring
• Given two strings S and T.
• Find the longest common substring.
• S = carport, T = airports Longest common substring = rport Longest common subsequence = arport
• Longest common subsequence may be found in O(|S|*|T|) time using dynamic programming.
• Longest common substring may be found in O(|S|+|T|) time using a suffix tree.
Longest Common Substring
• Let $ be a new symbol.• Construct the suffix tree for the string U = S$T#.
U = carport$airports# No repeating substring includes $. Find longest repeating substring that is both to left and
right of $.
• Find branch node that has max char# and has at least one element node in its subtree that represents a suffix that begins in S as well as at least one that begins in T.