Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU

80
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Exact String Matching Algorithms

description

Exact String Matching Algorithms. Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU. Classical Comparison Based Methods. Boyer-Moore Algorithm Knuth-Morris-Pratt Algorithm (KMP Algorithm). Boyer-Moore Algorithm. Basic ideas: Previously discussed ideas for naïve matching - PowerPoint PPT Presentation

Transcript of Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU

Page 1: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Presented ByDr. Shazzad Hosain

Asst. Prof. EECS, NSU

Exact String Matching Algorithms

Page 2: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Classical Comparison Based Methods

• Boyer-Moore Algorithm• Knuth-Morris-Pratt Algorithm (KMP Algorithm)

Page 3: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Boyer-Moore Algorithm

• Basic ideas:– Previously discussed ideas for naïve matching

1. successively align P and T to check for a match.2. Shift P to the right on match failure.

– new concepts wrt the naïve algorithm1. Scan from right-to-left, i.e., 2. Special Bad character rule3. Suffix shift rule

Page 4: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Right-to-left Scan

• How can we check for a match of pattern P at location i in target T?

• Naïve algorithm scanned left-to-right, i.e., T[i+k]&P[1+k], k = 0 to length(P)-1

^1 a == a ^ 2 d != b

Example: P = adab, T = abaracadabara a b a r a c a d a b a r a a d a b

Page 5: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Right-to-left Scan

• Alternative, scan right-to-left, i.e., T[i+k]&P[1+k], k = length(P)-1 down-to 0

^ 1 b != r

Example: P = adab, T = abaracadabara a b a r a c a d a b a r a a d a b

Page 6: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Right-to-left Scan

• Why is scanning right-to-left a good idea?• Answer: by itself, it isn’t any better than left-

to-right.– A naïve approach with right-to-left scanning is

also Q(nm).– Larger shifts, supported by a clever bad

character rule and a suffix shift rule make it better.

Page 7: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Bad Character Rule

• Idea: the mismatched character indicates a safe minimum shift.

^ 1 a == a

Example: P = adacara, T=abaracadabara a b a r a c a d a b a r a a d a c a r a

^ 2 r != c

Here the bad character is c. Perhaps we should shift to align this character with its rightmost occurrence in P?

Page 8: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Bad Character Rule

Shift two positions to align the rightmost occurrence of the mismatched character c in P.

a b a r a c a d a b a r a a d a c a r a a d a c a r a

Now, start matching again from right to left.

Page 9: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Bad Character Rule

^ 1 a == a

Second Example: P = adacara, T=abaxaradabara a b a x a r a d a b a r a a d a c a r a

^ 2 r == r

Here the bad character is x. The minimum that we should shift should align this character with its occurrence in P.

But x doesn’t occur in P!!!!

^ 3 a == a ^ 4 c != x

Page 10: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Bad Character Rule

Second Example: P = adacara, T=abaxaradabara a b a x a r a d a b a r a a d a c a r a

Since x doesn’t occur in P, we can shift past it.

a d a c a r a

Now, start matching again from right to left.

Page 11: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

11

Concept: Bad Character Rule

• The idea of bad character rule is to shift P by more than one characters when possible.

• But if rightmost position is greater than the mismatched position.

• Unfortunately, it is often the case

12345678901234567T: spbctbsatpqsctbpqP: tpabsatP: tpabsat

Page 12: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Bad Character Rule

• We will define a bad character rule that uses the concept of the rightmost occurrence of each letter.

• Let R(x) be the rightmost position of the letter x in P for each letter x in our alphabet.

• If x doesn’t occur in P, define R(x) to be 0.

a b c d z

7 0 4 2 * * 0

1234567P = adacara

R

Page 13: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

13

Concept: Bad Character Rule 12345678901234567T: spbctbsabpqsctbpqP: tpabsab

R(t)=1, R(s)=5.i: the position of mismatch in P. i=3k: the counterpart in T. k=5. T[k]=t• The bad character rule says P should be shifted right by max{1, i-R(T[k])}. i.e.,

if the right-most occurrence of character T[k] in P is in position j (j<i), then P[j] should be below T[k] after the shifting.

• Otherwise, we will shift P one position, i.e., when R(T[k]) >= i, 1 >= i - R(T[k])• Obviously this rule is not very useful when R(T[k]) >= i, which is usually the

case for DNA sequences

P: tpabxab

Page 14: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Extended Bad Character Rule

Extended Bad Character Rule: If P[i] mismatches T[k], shift P along T so

that the closest occurrence of the letter T[k] in P to the left of i in P is aligned with T[k].

^ 1 a == a

Example: P = aracara, T=abararadabara a b a r a r a d a b a r a a r a c a r a

^ 2 r == r ^ 3 a == a ^ 4 c != r ^ This is the rightmost occurrence of r in P.

Notice that i - R(T(k)) < 0 , i.e., 4 – 6 < 0

^ This is the rightmost occurrence of r to the left of i in P.

Notice that 4 – 2 > 0, i.e., this gives us a positive shift.

Page 15: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Extended Bad Character Rule

The amount of shift is i – j, where:– i is the index of the mismatch in P.– j is the rightmost occurrence of T[k] to the left of i in P.

^ 1 a == a

Example: P = aracara, T=abataradabara a b a t a r a d a b a r a a r a c a r a

^ 2 r == r ^ 3 a == a ^ 4 c != t

There is no occurrence of t in P, thus j = 0. Notice that i – j = 4,

i.e., this gives us a positive shift past the point of mismatch.

Page 16: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Extended Bad Character Rule

• How do we implement this rule?• We preprocess P (from right to left), recording the

position of each occurrence of the letters.• For each character x in S, the alphabet, create a list

of its occurrences in P. If x doesn’t occur in P, then it has an empty list.

Page 17: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Extended Bad Character Rule

Example: S = {a, b, c, d, r, t}, P = abataradabara• a_list = <13, 11,9,7,5,3,1> since ‘a’ occurs at these positions

in P, i.e., abataradabara• b_list = <10,2> (abataradabara)• c_list = Ø• d_list = <8> (abataradabara)• r_list = <12,6> (abataradabara)• t_list = <4> (abataradabara)

Page 18: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• Recall that we investigated finding prefixes before.• Since we are matching P to T from right-to-left, we will

instead need to use suffixes.

Page 19: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

19

Suffix Shift Rule

t is a suffix of P that match with a substring t of Tx≠yt’ is the right-most copy of t in P such that t’ is not a suffix of P and z≠y

Page 20: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• Consider the partial right-to-left matching of P to T below.

• This partial match involves ,a a suffix of P.

.....................................adbadbaddog

............................................axbadbaddog.....

P

T

Page 21: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• This partial match ends where the first mismatch occurs, where x is aligned with d.

.....................................adbadbaddog

............................................axbadbaddog.....

P

T

Page 22: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

We want to find a right-most copy a´ of this substring a in P such that:

1. a´ is not a suffix of P and 2. The character to the left of a´ is not the same as the

character to the left of a

.........gbadbaddoghorseadbadbaddog

............................................axbadbaddog.....

P

T

Page 23: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

1. If a´ exists, shift P to the right such that a´ is now aligned with the substring in T that was previously aligned with a.

.........gbadbaddogcatdbadbaddog

.......................................xbadbaddog.....

P

T

.........gbadbaddogcatdbadbaddog

P after shifting ’

Page 24: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

2. If a´ doesn’t exist, shift P right by the least amount such that a prefix of P is aligned with a suffix of a in T.

dogcatratdbadbaddog

.......................................xbadbaddog.....

P

T

P after shifting dogcatratdbadbaddog

Page 25: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

3. If a´ doesn’t exist, and there is no prefix of P that matches a suffix of a in T, shift P left by n positions.

batcatratdbadbaddog

.......................................xbadbaddog.....

P

T

P after shifting batcatratdbadbaddog

Page 26: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Preprocessing for the good suffix rule

• Let L(i) denote the largest position less than n s.t. P[i..n] matches a suffix of P[1..L(i)].

• If there is no such position, then L(i) = 0• Example 1: If i = 17 then L(i) = 9

batcatdogdbadbaddog P

17 L(17)

batcatdogdbadbaddog P

16

• Example 2: If i = 16 then L(i) = 0

Page 27: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• Let L´(i) denote the largest position less than n s.t. P[i..n] matches a suffix of P[1..L´(i)] and s.t. the character preceding the suffix is not equal to P(i-1).

• If there is no such position, then L´(i) = 0• Example 1: If i = 20 then L(i) = 12 and L´(i) = 6

slydogsaddogdbadbaddog P

20 L(20) L’(20)

Page 28: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• Example 2: If i = 19 then L(i) = 12 and L´(i) = 0

slydogsaddogdbadbaddog P

19 L(19)

Page 29: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• Notice that L(i) indicates the right-most copy of P[i..n] that is not a suffix of P.

• In contrast, L´(i) indicates the right-most copy of P[i..n] that is not a suffix of P and whose preceding character doesn’t match P(i-1).

• The relation between L´(i) and L(i) is analogous to the relation between a´ and a.

slydogsaddogdbadbaddog P

20 L(20) L’(20)

Page 30: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• Q: What is the point?• A: If P(i - 1) causes the mismatch and L´(i) > 0, then

we can shift P right by n - L´(i) positions. Example:

.........gbadbaddogcatdbadbaddog

.......................................xbadbaddog.....

P

T

.........gbadbaddogcatdbadbaddog

P after shifting ’

Page 31: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• If L(i) and L´(i) are different, then obviously shifting by n - L´(i) positions is a greater shift than n - L(i).

• Example:

slybaddogbadbaddogcatdbadbaddog

.......................................xbaxbaddog.....

P

T

slybaddogbadbaddogcatdbadbaddog

P after shifting ’

L(i) L’(i)

Page 32: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• Let Nj(P) denote the length of the longest suffix of P[1..j] that is also a suffix of P.

• Example 1: N6(P) = 3 and N12(P) = 5.

slydogsaddogdbadbaddog P

12 6

hogslydogsaddogdbadbaddog P

15 9 3 19

• Example 2: N3(P) = 2, N9(P) = 3, N15(P) = 5, N19(P) = 0.

Page 33: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• Q: How are the concepts of Ni and Zi related?• Recall that Zi = Length of a maximal substring starting at

position i, which is a prefix of P.

• In contrast, Ni = Length of a maximal substring ending at position i, which is a suffix of P.

• In the case of Boyer-Moore, we are naturally interested in suffixes since we are scanning right-to-left

i

a

i

a

Page 34: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• Let Pr denote the mirror image of P, then the relationship can be expressed as Nj(P)=Zn-j+1(Pr).

• In words, the length of the substring matching a suffix at position j in P is equal to the length of the corresponding substring matching a prefix in the reverse of P.

• Q: Why must this true?• A: Because they are the same substring, except

that one is the reverse of the other.

Page 35: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• Since Nj(P) = Zn-j+1(Pr), we can use the Z algorithm to compute N in O(n).

• Q: How do we do this?• A: We create Pr, the reverse of P, and process it

with the Z algorithm.

Page 36: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

36

Concept: Suffix Shift Rule

N is the reverse of Z!

P: the pattern

Pr the string obtained by reversing P

Then Nj (P)=Zn-j+1 (Pr)

1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0P: q c a b d a b d a b Pr: b a d b a d b a c qNj: 0 0 0 2 0 0 5 0 0 0 Zi 0 0 0 5 0 0 2 0 0 0

t t’ xyi

tt’j

xy

Page 37: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

37

Concept: Suffix Shift Rule

For pattern P,

Nj (for j=1,…,n) can be calculated in O(n) using the Z algorithm.

Why do we need to define Nj ?

To use the strong good suffix rule, we need to find out L’(i) for every i=1,…,n.

We can get L’(i) from Nj !

x t

y tt’

y tt’

z

z

T

P

niL’(i)

Page 38: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• We can then find L´(i) and L(i) values from N values in linear time with the following:

For i = 1 to n {L´(i) = 0;}For j = 1 to n – 1 {

i = n - Nj(P) + 1;L´(i) = j;

}

// L values (if desired) can be obtainedL(2) = L´(2) ;For i = 3 to n { L(i) = max(L(i - 1), L´(i));}

Page 39: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• Example: P = asdbasasas, n = 10• Values of Ni(P): 0, 2, 0, 0, 0, 2, 0, 4,

0• Computed values i: 11, 9, 11, 11, 11, 9, 11, 7, 11• Values of L´: 0, 0, 0, 0, 0, 0, 8, 0, 6

For i = 1 to n {L´(i) = 0;}For j = 1 to n – 1 {

i = n - Nj(P) + 1;L´(i) = j;

}L(2) = L´(2) ;For i = 3 to n { L(i) = max(L(i - 1), L´(i));}

Page 40: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• Let l´(i) denote the length of the largest suffix of P[i..n] that is also a prefix of P. Let l´(i) = 0 if no such suffix exists.

Example: P = asasbsasas^ l’(1) = 4 ^ l’(2) = 4 ^ l’(3) = 4 ^ l’(4) = 4 ^ l’(5) = 4 ^ l’(6) = 4 ^ l’(7) = 4 ^ l’(8) = 2 ^ l’(9) = 2 ^ l’(10) = 0

tt’l´(i) = t

i

Page 41: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Concept: Suffix Shift Rule

• Thm: l´(i) = largest j <= n – i + 1 s.t. Nj(P) = j.• Q: How can we compute l´(i) values in linear

time?• A: This is problem #9 in Chapter 2. This would

make an interesting homework problem.

tt’j

xy

tt’i

l´(i) = t

Page 42: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Boyer-Moore Algorithm

Preprocessing:Compute L´(i) and l´(i) for each position i in P, Compute R(x), the right-most occurrence of x in P, for each character x in S.Search:k = n;While k <= m {

i = n; h = k;While i > 0 and P(i) = T(j) {

i = i – 1; h = h – 1;}if i = 0 {

report occurrence of P in T at position k.k = k + n - l´(2);}

else Shift P (increase k) by the max amount indicated by theextended bad character rule and the good suffix rule.

}

Page 43: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Boyer-Moore Algorithm

Example: P = golgolPreprocessing:Compute L´(i) and l´(i) for each position i in P

Notice that first we need Nj(P) values in order to compute L´(i) and l´(i) for each position i in P.

For i = 1 to n {L´(i) = 0;}For j = 1 to n – 1 {

i = n - Nj(P) + 1;L´(i) = j;

}

Page 44: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Boyer-Moore Algorithm

Example: P = golgolRecall that Nj(P) is the length of the longest suffix of P[1..j]that is also a suffix of P.

N1(P) = 0, there is no suffix of P that ends with g

N2(P) = 0, there is no suffix of P that ends with o

N3(P) = 3, there is a suffix of P that ends with l

N4(P) = 0, there is no suffix of P that ends with g

N5(P) = 0, there is no suffix of P that ends with o

N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3

Page 45: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Boyer-Moore Algorithm

Preprocessing: P = golgol, n = 6

N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3Compute L´(i) and l´(i) for each position i in P

For i = 1 to n {L´(i) = 0;}For j = 1 to n – 1 {

i = n - Nj(P) + 1;L´(i) = j;

}

j = 1 i = 7 Therefore L´(7) = 1j = 2 i = 7 Therefore L´(7) = 2j = 3 i = 4 Therefore L´(4) = 3j = 4 i = 7 Therefore L´(7) = 4j = 5 i = 7 Therefore L´(7) = 5

L´(1) = L´(2) = L´(3) = L´(5) = 0 and L´(4) = 3

Page 46: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Boyer-Moore Algorithm

Preprocessing: P = golgol, n = 6N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3

Compute l´(i) for each position i in P.Recall that l´(i) is the length of the longest suffix of P[i..n] thatis also a prefix of P.

l´(1) = 6 since gol is the longest suffix of P[1..n] that is a prefix of P.l´(2) = 3 since gol is the longest suffix of P[2..n] that is a prefix of P.l´(3) = 3 since gol is the longest suffix of P[3..n] that is a prefix of P.l´(4) = 3 since gol is the longest suffix of P[4..n] that is a prefix of P.l´(5) = 0 since there is no suffix of P[5..n] that is a prefix of P.

l´(6) = 0 since there is no suffix of P[6..n] that is a prefix of P.

l´(1) = 6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0

Page 47: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Boyer-Moore Algorithm

Preprocessing: P = golgol, n = 6N1(P) = N2(P) = N4(P) = N5(P) = 0 and N3(P) = 3L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3l´(1) = 6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0

Compute the list R(x), the right-most occurrences of x in P,for each character x in S = {g, o, l}

R(g) = <4, 1>

R(o) = <5, 2>

R(l) = <6, 3>

Page 48: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Boyer-Moore Algorithm

Preprocessing: P = golgol, n = 6, T = lolgolgol, m = 9L´(1) = L´(2) = L´(4) = L´(5) = 0 and L´(4) = 3l´(1) =6, l´(2) = l´(3) = l´(4) = 3 and l´(5) = l´(6) = 0R(g) = <4, 1>, R(o) = <5, 2>, R(l) = <6, 3>

Search:k = n;While k <= m {

i = n; h = k;While i > 0 and P(i) = T(j) {

i = i – 1; h = h – 1;}if i = 0 {

report occurrence of P in T at position k.k = k + n - l´(2);}

else Shift P (increase k) by the max amount indicated by theextended bad character rule and the good suffix rule.

}

Page 49: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Search

^i = 6, h = 6

^i = 5, h = 5

^i = 4, h = 4

lolgolgolgolgol

Bad Character Rule: there is no occurrence of l, the mismatched characterin T, to the left of P(1). This suggests shifting only 1 place

Good Suffix Rule: Since L´(2) = 0, l´(2) = 3 thereforeshift P by n - l´(2) places, i.e., 6-3=3 places. Thus k = k + 3 = 9

But i = 1!

^i = 3, h = 3

^i = 2, h = 2

^i = 1, h = 1, P(1) != T(1)

k = 6;While k <= 9 {

i = 6; h = k;While i > 0 and P(i) = T(j) {

i = i – 1; h = h – 1;}if i = 0 {

report occurrence of P in T at position k.k = k + 6 - l´(2);}

else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule.

}

Page 50: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Search

lolgolgolgolgol

^i = 6, h = 9

^i = 5, h = 8

^i = 4, h = 7

^i = 3, h = 6^i = 2, h = 5

^i = 1, h = 4

^i = 0, h = 3

i = 0, report occurrence of P in T at position 4,k = k + 6 - l´(2) = 9 + 6 - 3 = 12

lolgolgol golgolk = 12, we are done!

k = 6;While k <= 9 {

i = 6; h = k;While i > 0 and P(i) = T(j) {

i = i – 1; h = h – 1;}if i = 0 {

report occurrence of P in T at position k.k = k + 6 - l´(2);}

else Shift P (increase k) by the max amount indicated by the extended bad character rule and the good suffix rule.

}

Page 51: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Homework 1: Due Next Week• Implement the Boyeer More Algorithm

Page 52: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Break

Page 53: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

KMP Algorithm

• Preliminaries:– KMP can be easily explained in terms of finite

state machines.– KMP has a easily proved linear bound– KMP is usually not the method of choice

Page 54: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

KMP Algorithm

• Recall that the naïve approach to string matching is Q(mn).

• How can we reduce this complexity?– Avoid redundant comparisons– Use larger shifts

• Boyer-Moore good suffix rule• Boyer-Moore extended bad character rule

Page 55: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

KMP Algorithm

• KMP finds larger shifts by recognizing patterns in P.– Let spi(P) denote the length of the longest proper

suffix of P[1..i] that matches a prefix of P.

– By definition sp1 = 0 for any string.– Q: Why does this make sense?– A: The proper suffix must be the empty string

α αi

Page 56: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

KMP Algorithm

• Example: P = abcaeabcabd– P[1..2] = ab hence sp2 = ?

– sp2 = 0

– P[1..3] = abc hence sp3 = ?

– sp3 = 0

– P[1..4] = abca hence sp4 = ?

– sp4 = 1

– P[1..5] = abcae hence sp5 = ?

– sp5 = 0

– P[1..6] = abcaea hence sp6 = ?

– sp6 = 1

Page 57: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

KMP Algorithm

• Example Continued– P[1..7] = abcaeab hence sp7 = ?

– sp7 = 2

– P[1..8] = abcaeabc hence sp8 = ?

– sp8 = 3

– P[1..9] = abcaeabca hence sp9 = ?

– sp9 = 4

– P[1..10] = abcaeabcab hence sp10 = ?

– sp10 = 2

– P[1..11] = abcaeabcabd hence sp11 = ?

– sp11 = 0

Page 58: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

KMP Algorithm• Like the a/a concept for Boyer-Moore, there is an

analogous spi/sp´i concept.• Let sp´i(P) denote the length of the longest proper

suffix of P[1..i] that matches a prefix of P, with the added condition that characters P(i + 1) and P(sp´i + 1) are unequal.

• Example: P = abcdabce sp´7 = 3

Obviously sp´i(P) <= spi(P), since the later is lessrestrictive.

α αi

x y

Page 59: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

KMP Algorithm• KMP Shift Rule:

1. Mismatch case:• Let position i+1 in P and position k in T be the first mismatch

in a left-to-right scan.• Shift P to the right, aligning P[1..sp´i] with T[k- sp´i..k-1]

2. Match case:• If no mismatch is found, an occurrence of P has been found.• Shift P by n – sp´n spaces to continue searching for other

occurrences.

i+1

αα

n+1

α

αα

Page 60: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

KMP Algorithm

• Observations:– The prefix P[1..sp´i] of the shifted P is shifted to match

the corresponding substring in T.– Subsequent character matching proceeds from

position sp´i + 1– Unlike Boyer-Moore, the matched substring is not

compared again.– The shift rule based on sp´i guarantees that the exact

same mismatch won’t occur at sp´i + 1 but doesn’t guarantee that P(sp´i+1) = T(k)

Page 61: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

KMP Algorithm

• Example: P = abcxabcde– If a mismatch occurs at position 8, P will be shifted 4

positions to the right.– Q: Where did the 4 position shift come from?– A: The number of position is given by i - sp´i , in this

example i = 7, sp´7 = 3, 7 – 3 = 4 – Notice that we know the amount of shift without

knowing anything about T other than there was a mismatch at position 8..

Page 62: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

KMP Algorithm

• Example Continued: P = abcxabcde– After the shift, P[1..3] lines up with T[k-4..k-1]– Since it known that P[1..3] must match T[k-4..k-1], no

comparison is needed.– The scan continues from P(4) & T(k)

• Advantages of KMP Shift Rule1. P is often shifted by more than 1 character, (i - sp´i )

2. The left-most sp´i characters in the shifted P are known to match the corresponding characters in T.

Page 63: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

KMP Algorithm

Full Example: T = xyabcxabcxadcdqfeg P = abcxabcdeAssume that we have already shifted past the first two

positions in T.

xyabcxabcxadcdqfegabcxabcde

^ 1 ^ 2 ^ 3 ^ 4 ^ 5 ^ 6 ^ 7

abcxabcde

^ 8 d!=x, shift 4 places ^ 1 start again from position 4

Page 64: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Preprocessing for KMP

Approach: show how to derive sp´ values from Z values.

Definition: Position j > 1 maps to i if i = j + Zj(P) – 1– Recall that Zj(P) denotes the length of the Z-box starting at position j.

– This says that j maps to i if i is the right end of a Z-box starting at j.

αα

ααi

j

Page 65: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Preprocessing for KMP

Theorem. For any i > 1, sp´i(P) = Zj = i – j + 1

Where j > 1 is the smallest position that maps to i.If j then sp´i(P) = 0

Similarly for sp:For any i > 1, spi(P) = i – j + 1

Where j, i j > 1, is the smallest position that maps to i or beyond.If j then spi(P) = 0

Definition: Position j > 1 maps to i if i = j + Zj(P) – 1

αα

ααi

j

x y

Page 66: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Preprocessing for KMPGiven the theorem from the preceding slide, the sp´i and spi

values can be computed in linear time using Zi values:

For i = 1 to n { sp´i = 0;}For j = n downto 2 {

i = j + Zj(P) – 1; sp´i = Zj;

}

spn(P) = sp´n(P); For i = n - 1 downto 2 {

spi (P) = max[spi+1 (P) - 1, sp´i(P)];}

αα

ααi

j

x y

Page 67: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Preprocessing for KMP

Defn. Failure function F´(i) = sp´i-1 + 1 , 1 i n + 1, sp´0 = 0(similarly F(i) = spi-1 + 1 , 1 i n + 1, sp0 = 0)

xyabcxabcxadcdqfegabcxabcde

^ 1 ^ 2 ^ 3 ^ 4 ^ 5 ^ 6 ^ 7

abcxabcde

^ 8 d!=x, shift 4 places

Shifting is only conceptual and P is never explicitly shifted

xyabcxabcxadcdqfegabcxabcde

^ i

c |

^ i

c |

^ i

c |

^ i

c |

^i

Two special cases:1. Mismatch at position 1, then F’(1) = 12. Match found, then P shifts by n - sp’n places

o Which is F’(n+1) = sp’n + 1

Page 68: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Preprocessing for KMP

Defn. Failure function F´(i) = sp´i-1 + 1 , 1 i n + 1, sp´0 = 0(similarly F(i) = spi-1 + 1 , 1 i n + 1, sp0 = 0)• Idea:

– We maintain a pointer i in P and c in T.– After a mismatch at P(i+1) with T(c), shift P to align

P(sp´i + 1) with T(c), i.e., i = sp´i + 1.– Special case 1: i = 1 set i = F´(1) = 1 & c = c + 1– Special case 2: we find P in T, shift n - sp´n spaces,

i.e., i = F´(n + 1) = sp´n + 1.

Page 69: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Full KMP AlgorithmPreprocess P to find F´(k) = sp´k-1 +1 for k from 1 to n + 1c = 1; p = 1;While c + (n – p) m {

While P(p) = T( c )and p n {p = p + 1;c = c + 1;}

If (p = n + 1) thenreport an occurrence of P at position c – n of T.

if (p = 1) then c = c + 1;p = F´(p) ;}

T = xyabcxabcxadcdqfegP = abcxabcde

^ p

c |

|T| = m|P| = n

Page 70: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Full KMP Algorithm

xyabcxabcxabcdefegabcxabcde

^ 1 a!=x

p != n+1

p = 1! c = 2

p = F’(1) = 1

c = 1; p = 1;While c + (n – p) m {

While P(p) = T( c )and p n {p = p + 1;c = c + 1;}

If (p = n + 1) thenreport an occurrence of P at position c – n of T.

if (p = 1) then c = c + 1;p = F´(p) ;

}

Page 71: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Full KMP Algorithm

xyabcxabcxabcdefeg

abcxabcde

^ 1 a!=y

p != n+1

p = 1! c = 3

p = F’(1) = 1

c = 1; p = 1;While c + (n – p) m {

While P(p) = T( c )and p n {p = p + 1;c = c + 1;}

If (p = n + 1) thenreport an occurrence of P at position c – n of T.

if (p = 1) then c = c + 1;p = F´(p) ;

}

abcxabcde

Page 72: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Full KMP Algorithm

xyabcxabcxabcdefeg

p != n+1

p = 8! don’t change c

p = F´(8) = 4

abcxabcde abcxabcde

^ 1 ^ 2 ^ 3 ^ 4 ^ 5 ^ 6 ^ 7 ^ 8 d!=x

c = 1; p = 1;While c + (n – p) m {

While P(p) = T( c )and p n {p = p + 1;c = c + 1;}

If (p = n + 1) thenreport an occurrence of P at position c – n of T.

if (p = 1) then c = c + 1;p = F´(p) ;

}

Page 73: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

p = 4, c = 10

^ 4

Full KMP Algorithm

xyabcxabcxabcdefeg

p = n+1 !

abcxabcde

^ 5 ^ 6 ^ 7 ^ 8

abcxabcde abcxabcde abcxabcde

c = 1; p = 1;While c + (n – p) m {

While P(p) = T( c )and p n {p = p + 1;c = c + 1;}

If (p = n + 1) thenreport an occurrence of P at position c – n of T.

if (p = 1) then c = c + 1;p = F´(p) ;

}

^ 9

Page 74: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Real-Time KMP

• Q: What is meant by real-time algorithms?• A: Typically these are algorithms that are meant

to interact synchronously in the real world.– This implies a known fixed turn-around time for

processing a task– Many embedded scheduling systems are examples

involving real-time algorithms.– For KMP this means that we require a constant time

for processing all strings of length n.

Page 75: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Real-Time KMP

• Q: Why is KMP not real-time?• A: For any mismatched character in T, we may try

matching it several times.– Recall that sp´i only guarantees that P(i + 1) and P(sp´i + 1) differ– There is NO guarantee that P(i + 1) and T(k) match

• We need to ensure that a mismatch at T(k) does NOT entail additional matches at T(k).

• This means that we have to compute sp´i values with respect to all characters in S since any could appear in T.

Page 76: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Real-Time KMP

• Define: sp´(i,x)(P) to be the length of the longest proper suffix of P[1..i] that matches a prefix of P, with the added condition that character P(sp´i + 1) is x.

• This is will tell us exactly what shift to use for each possible mismatch.

• A mismatched character T(k) will never be involved in subsequent comparisons.

Page 77: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Real-Time KMP

• Q: How do we know that the mismatched character T(k) will never be involved in subsequent comparisons?

• A: Because the shift will shift P so that either the matching character aligns with T(k) or P will be shifted past T(k).

• This results in a real-time version of KMP.• Let’s consider how we can find the sp´(i,x)(P)

values in linear time.

Page 78: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Real-Time KMP

Thm. For P[i + 1] x, sp´(i,x)(P) = i - j + 1– Here j is the smallest position such that j maps to i and

P(Zj + 1) = x.– If there is no such j then where sp´(i,x)(P) = 0

For i = 1 to n { sp´(i,x) = 0 for every character x;}For j = n downto 2 {

i = j + Zi(P) – 1;x = P(Zj + 1); sp´(i,x) = Zi;

}

Page 79: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Real-Time KMP

• Notice how this works:– Starting from the right

• Find i the right end of the Z box associated with j• Find x the character immediately following the prefix

corresponding to this Z box.• Set sp´(i,x) = Zi, the length of this Z box.

For i = 1 to n { sp´(i,x) = 0 for every character x;}For j = n downto 2 {

i = j + Zi(P) – 1;x = P(Zj + 1); sp´(i,x) = Zi;}

Page 80: Presented By Dr.  Shazzad Hosain Asst. Prof. EECS, NSU

Reference

• Chapter 1, 2: Exact Matching: Fundamental Preprocessing and First Algorithms