1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm...
-
Upload
annice-jordan -
Category
Documents
-
view
226 -
download
2
Transcript of 1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm...
1
The MaxSuffix-Matching Algorithm
On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American Symposium,
Cancun, Mexico, April 3-6, 2002. Proceedings. Rytter, W.
Advisor: Prof. R. C. T. Lee
Reporter: L. Y. Huang
2
Maximal Suffix
• A maximal suffix of a string is a suffix which is lexicographically maximal of all suffixes of a string.
• The maximal suffix of string w is denoted by MaxSuf(w)
• Ex: Consider string w = abaaba The set of its suffixes : {a, ba, aba, aaba, baaba, abaaba}The set of its sorted suffixes:{a, aaba, aba, abaaba, ba, baaba}
• Thus we can find that MaxSuf(w) = baaba.
3
Self-Maximal String
• A string w is said to be self-maximal if MaxSuf(w) = w.
• Ex: Consider strings w = abaaba , x = baaba.– The MaxSuf(w) = baaba.– The MaxSuf(x) = baaba.
• Hence, we say that x is a self-maximal string but w is not.
4
Important Properties of Self-Maximal Strings
• By definition, we have the following observation about self-maximal strings:
• For a self-maximal string P, suppose a prefix P1,P2,…,Pi of P is equal to a substring, Pk,Pk+1,…, Pk+i-1, of P, then Pi+1>=Pk+i.
x y
x > y
u uP …
5
• Example: TCATBTCATA is a self-maximal string.
• But, TBATATBATB is not a self-maximal string because B after the substring TBAT is lexically larger than A after prefix TBAT.
6
The Period of a String
• A period of a string w is an integer p, , such that :
• Ex: Consider string w = bbabbabbabba– bbabbabbabba → period = 3 and period =6.– abcdefg →period=word length=7– abcdeab →period=5
• We define period(w) as the smallest period of w.• If w = bbabbabbabba, period(w) is 3.
wp 0
},1{ allfor ][ pwipiwiw
7
• Given a string P, we are actually interested in the period of every prefix.
i 1 2 3 4 5 6 7 8 9
P a b c a a b c a b
period 1 2 3 3 4 4 4 4 7
prefix 0 0 0 1 1 2 3 4 2
i-prefix(i) 1 2 3 3 4 4 4 4 7
Note that the period of i-prefix(i) in the MP-algorithm which is the number of steps which we can move the pattern. (The index starts from 1 in this case.)
8
Why are we interested in the period function?
• If the period function is actually the same as the prefix function of the MP_algorithm, why are we interested in it?
• To calculate the prefix function, we must use pointers which point back to some characters way back.
• In the following, we shall introduce a naïve period function which never looks back.
9
Naive-Period Function• Function Naive-Period can be used to compute
the period of a string if this string is self-maximal.
• For a general string, the Naive-Period function will not work. This is why our algorithm only works for the self-maximal strings.
10
Function Naive-Period (j);{ computes the period of self-maximal pat}
period (1):= 1;for i := 2 to j do if then period (i):= i ; else period(i) := period(i - 1)return period;
)]1([][ iperiodipatipat
Algorithm of Naive-Period Function
11
An Example of Naive-Period Function
w b b a b b a b b a bi 1 2
i-period(i-1)
0 1
period 1 1
Function Naive-Period (j);{ computes the period of self-maximal pat}
period (1):= 1;for i := 2 to j do if then period (i):= i ; else period(i) := period(i - 1)return period;
)]1([][ iperiodipatipat
12
An Example of Naive-Period Function
• Consider a string w = bbabbabbab– w is a self-maximal string and period(w)=3.
w b b a b b a b b a bi 1 2 3
i-period(i-1)
0 1 2
period 1 1 3
13
An Example of Naive-Period Function
• Consider a string w = bbabbabbab– w is a self-maximal string and period(w)=3.
w b b a b b a b b a bi 1 2 3 4
i-period(i-1)
0 1 2 1
period 1 1 3 3
14
An Example of Naive-Period Function
• Consider a string w = bbabbabbab– w is a self-maximal string and period(w)=3.
w b b a b b a b b a bi 1 2 3 4 5
i-period(i-1)
0 1 2 1 2
period 1 1 3 3 3
15
An Example of Naive-Period Function
• Consider a string w = bbabbabbab– w is a self-maximal string and period(w)=3.
w b b a b b a b b a bi 1 2 3 4 5 6
i-period(i-1)
0 1 2 1 2 3
period 1 1 3 3 3 3
16
An Example of Naive-Period Function
• Consider a string w = bbabbabbab– w is a self-maximal string and period(w)=3.
w b b a b b a b b a bi 1 2 3 4 5 6 7
i-period(i-1)
0 1 2 1 2 3 4
period 1 1 3 3 3 3 3
17
An Example of Naive-Period Function
• Consider a string w = bbabbabbab– w is a self-maximal string and period(w)=3.
w b b a b b a b b a bi 1 2 3 4 5 6 7 8
i-period(i-1)
0 1 2 1 2 3 4 5
period 1 1 3 3 3 3 3 3
18
An Example of Naive-Period Function
• Consider a string w = bbabbabbab– w is a self-maximal string and period(w)=3.
w b b a b b a b b a bi 1 2 3 4 5 6 7 8 9
i-period(i-1)
0 1 2 1 2 3 4 5 6
period 1 1 3 3 3 3 3 3 3
19
An Example of Naive-Period Function
• Consider a string w = bbabbabbab– w is a self-maximal string and period(w)=3.
w b b a b b a b b a bi 1 2 3 4 5 6 7 8 9 10
i-period(i-1)
0 1 2 1 2 3 4 5 6 7
Period(i) 1 1 3 3 3 3 3 3 3 3
20
• Given any pattern P, let k be the length of the longest proper suffix of P[1, i-1] equal to a prefix P[1, k] of a P[1, i-1].
• Let k’ be the length of the longest proper suffix of P[1, i] equal to a prefix P[1, k’] of a P[1, i].
• For any i, we consider the following possibilities:
Why can Naïve period work in the self-maximal string?
i
i-1
k’k’
kk
P
P
21
1. k ≠ 0 and P[k + 1] = P[i] : Period(i) = Period(i - 1)2. k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0 : Period(i) = i – k’3. k ≠ 0, P[k + 1] ≠ P[i] and k’ = 0 : Period(i) = i4. k = 0 and k’ ≠ 0 : Period(i) = i – k’5. k = 0 and k’ = 0 : Period(i) = i
22
1. k ≠ 0 and P[k + 1] = P[i] : Period(i) = Period(i - 1)
i 1 2 3 4 5 6 7 8
P a b c a a b c a
period 1 2 3 3 4 4 4 4
For i = 8, the substring “abc” of length 3 (k = 3) is the longest suffix of P(1, 7) which equals to a prefix of P(1, 7) and P(8) = P(4)
period(8) = period(7)=4.
23
2. k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0 : Period(i) = i – k’
i 1 2 3 4 5 6 7 8 9
P a b c a a b c a b
period 1 2 3 3 4 4 4 4 7
For i = 9, the substring “abca” of length 4 (k = 4) is the longest suffix of P(1, 8) which equals to a prefix of P(1, 8) and P(9) ≠ P(5)
There is a suffix of P(1, 9) which equals to a prefix of P(1, 9), P(1, 2) = ab of length 2 (k’ = 2)
period(9) = i - | P(1, 2) | = 9 - 2 =7.
24
3. k ≠ 0, P[k + 1] ≠ P[i] and k’ = 0 : Period(i) = i
i 1 2 3 4 5 6 7 8 9
P a b c c a b c c b
period 1 2 3 4 4 4 4 4 9
For i = 9, the substring “abcc” of length 4 (k = 4) is the longest suffix of P(1, 8) which equals to a prefix of P(1, 8) and P(9) ≠ P(5)
There is no suffix of P(1, 9) which equals to a prefix of P(1, 9) , (k’ = 0).
period(9) = i = 9.
25
4. k = 0 and k’ ≠ 0 : Period(i) = i – k’
i 1 2 3 4 5 6 7 8 9
P a b c c b b c c a
period 1 2 3 4 5 6 7 8 8
For i = 9, the is no suffix of P(1, 8) which equals to a prefix of P(1, 8), (k = 0)
The substring “a” of length 1 (k’ = 1) is a suffix of P(1, 9) which equals to a prefix of P(1, 9), P(1, 1) = a.
period(9) = i - |P(1, 1)| = 9-1 = 8.
26
5. k = 0 and k’ = 0 : Period(i) = i
i 1 2 3 4 5 6 7 8 9
P a b c c b b c c b
period 1 2 3 4 5 6 7 8 9
For i = 9, there is no suffix of P(1, 8) which equals to a prefix of P(1, 8), (k = 0).
There is no suffix of P(1, 9) which equals to a prefix of P(1, 9), (k’ = 0).
period(9) = i = 9.
27
Assume that the conditions 2 & 4 holds. There must be a suffix which is equal to a prefix. Let u be the such a longest suffix.
But, the conditions 2 (k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0) and 4 (k = 0 and k’ ≠ 0) do not exist in self-maximal suffix. Why?
28
2. k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0
x yij
period periodu u
Suppose that P is self-maximal.Since P[i]=y≠P[j]=x holds, x >y.
Since k’ ≠ 0, there is a v+y which is the longest suffix of P(1,i) equal to a prefix of P(1,i) as shown above.
P
v y x v yi
period periodu u
P
29
v y v x v y v yij
period periodu u
P
v y x v yi
period periodu u
P
Since k ≠ 0, we must have the following.
Since P is a self-maximal string, from the prefix u, we may conclude that y>x.Contradiction! k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0 cannot hold for self-maximal strings.
30
Using similar reasoning, we can prove thatfor self-maximal strings, k = 0 and k’ ≠ 0 doesnot hold.
Thus we may have the following:
For self-maximal strings, Period(i)=Period(i - 1) or Period(i)=i.
That is, the naïve period function works for Self-maximal strings.
31
• What is the advantage of the naïve-period function?
• It is linear and we never need to look back to
some characters way back, as we need in calculating the prefix function in MP-algorithm.
32
• For a string which is not self-maximal, we use the following algorithm, called the Max-Suffix Matching Algorithm.
33
MaxSuffix-Matching Algorithm
• First, we decompose the pattern string P to be u · v, where v= MaxSuf(P) and u is the other part of P.
• Note that v is unique in the string P, and this is a very important property.
• Property 1: No suffix of u is equal to a prefix of v., because v is uniqueness.
• Example:P = dababdadad MaxSuf(P) = dadadP = u·v = dabab ·dadad
34
MaxSuffix-Matching Algorithm
• If v is found in T, we next find the part u of P which occurs in the left of v by a naive testing way.
• Assume i is the location of an occurrence of v in T and the string before i is denoted as prev because of Property 1.
Text v v
iprev
35
Maxsuffix-Matching Algorithm
Algorithm Maxsuffix-Matchingi:= 0; j:=0; period:=1;prev:=0;while i ≤ n - |v| do begin while j < |v| and v[i+1]= T[i+j+1] do begin j=j+1; if j > period and v[j] ≠ v[j -period]
then period:=j end; {MATCH OF v} if j = |u| then begin
if i − prev > |u| and u = T[i − |u| + 1… i] then report match at i − |u|; prev := i; end
i := i + period; if j ≥ 2 ・ period then j := j − period else begin j:= 0; period := 1 end; end;
Naive-Period
Function
Test u by using any algorithm
36
Example
• Text = adadaddadabababadada• P = u·v = abababa · dada• case1
– If i < |u|, that there is no occurrence of u·v at beginning.
a d a d a d d a d a b a b a b d a d aText
d a d a
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
i
37
Example• Text = adadaddadabababadada• P = u·v = abababa · dada• Case2
– If i – prev <|u|, then there is no occurrence of u·v at position i - |u|. This is because the maximal suffix v of P only start at one position on P.
d a d a
a d a d a d d a d a b a b a b d a d aText
d a d a
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
d a d a
i = 7, |u| = 7, prev =2
38
Example• Text = adadaddadabababadada• P = u·v = abababa · dada• So, we only need to check whether u exists in the l
eft of third v in this example.
d a d a
a d a d a d d a d a b a b a b d a d aText
d a d a
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
d a d a
First occurrence Second occurrence Third occurrence
39
Time Complexity and Space Complexity
• Hence, the MaxSuffix-Matching Algorithm can find all occurrences of a pattern in O(1) space (i, j, period) and linear time complexity.
40
Reference• Maxime Crochemore, String-matching on ordered alphabets, T
heoretical Computer Science, v.92 n.1, p.33-47, Jan. 6, 1992• Maxime Crochemore, Dominique Perrin, Two-way string-mat
ching, Journal of the ACM (JACM), v.38 n.3, p.650-674, July 1991
• Maxime Crochemore, Wojcjech Rvtter, Text algorithms, Oxford University Press, Inc.,New York, NY, 1994
• M. Crochemore, W. Rytter, Cubes, squares and time space efficient string matching, Algorithmica 13 (5) (1995) 405-425.
• J.-P. Duval, Factorizing words over an ordered alphabet, J. Algorithms 4 (1983) 363-381.
41
Reference• Z Galil, J. Seiferas, Time-space-optimal string matching, J. Co
mput. System Sci. 26 (1983) 280-294. • L. Gasieniec, W. Plandowski, W. Rytter, Constant-space string
matching with smaller number of comparisons: sequential sampling, in: Z. Galil, E. Ukkonen (Eds.), Combinatorial Pattern Matching, 6th Annual Symposium, CPM gs, Lecture Notes in Computer Science, Vol. 937, Springer, Berlin, 1995, pp. 78-89.
• Leszek Gasieniec , Woiciech Plandowski , Woiciech Rytter, The zooming method: a recursive approach to time-space efficient string-matching, Theoretical Computer Science, v. 147 n. 1-2, p. 19-30, Aug. 7, 1995
• D.E. Knuth, J.H. Morris, V.R. Pratt, Fast pattern matching in strings, SIAM J. Comput. 6 (1977) 322-350.
• M. Lothaire, Combinatorics on Words, Addison-Wesley, Reading, MA, USA, 1983.
42
~Thank You~