Preprocessing Application: String Matching By Rong Ge COSC3100.

21
Preprocessing Application: String Matching By Rong Ge COSC3100

Transcript of Preprocessing Application: String Matching By Rong Ge COSC3100.

Page 1: Preprocessing Application: String Matching By Rong Ge COSC3100.

Preprocessing

Application: String Matching

By Rong Ge

COSC3100

Page 2: Preprocessing Application: String Matching By Rong Ge COSC3100.

A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 2

Space-for-time tradeoffs

Two varieties of space-for-time algorithms: input enhancement — preprocess the input (or its part) to

store some info to be used later in solving the problem • counting sorts (Last Lecture)• string searching algorithms (today)

prestructuring — preprocess the input to make accessing its elements easier• hashing

Page 3: Preprocessing Application: String Matching By Rong Ge COSC3100.

Review: String searching by brute force

String Matching problem:Given a pattern P: a string of m characters to search for

and a text T: a (long) string of n>=m characters to search in

Return whether P occurs in T and the position of P in T

Brute force algorithm RevisitStep 1 Align pattern P at beginning of text T

Step 2 Scan from left to right, compare each character ofP to the corresponding character in T until either all characters of P are found to match (successful search) or a mismatch is detected

Step 3 While a mismatch is detected and T is not yet exhausted, realign P one position to the right and repeat Step 2

#comparison: O(mn)

More efficient algorithms with O(n) time? 3

Page 4: Preprocessing Application: String Matching By Rong Ge COSC3100.

A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 4

String searching by preprocessing

Several string searching algorithms are based on the inputenhancement idea of preprocessing the pattern P

Horspool’s algorithm 1. preprocess P to build one shift table tb2. align P against the beginning of T3. repeat until a matching substring is found or T ends:

1. compare the corresponding characters from right to left for P2. if mismatch occurs, shift P to right by tb(c) where c is the rightmost

character of T in the current alignment

Boyer -Moore algorithm improves Horspool’s algorithm by using two tables: one good table and one bad table

Page 5: Preprocessing Application: String Matching By Rong Ge COSC3100.

A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 5

Horspool’s Algorithm

Two key points:

• preprocesses P to generate a shift table that determines how much to shift P when a mismatch occurs

• always makes a shift based on the T’s character c aligned with the last character in P according to the shift table’s entry for c

Page 6: Preprocessing Application: String Matching By Rong Ge COSC3100.

Your answers? 6

How far to shift for each case?

Focus on the rightmost character in T in the alignment: Case I: character C != B (mismatch), and C does not occur in P

.....C...................... Text (C not in pattern)

BAOBAB Pattern Case II:

Character O != B (mismatch), but occur in P once

.....O...................... (O occurs once in pattern) BAOBAB

Character A != B (mismatch), and occurs in P more than once

.....A...................... (A occurs twice in pattern)

BAOBAB Case III: character R = R (match), but doesn’t appear in P in the first m-1

characters ...MER...................... LEADER Case IV: character B = B (match), and occurs in P in the first m-1 characters .....B...................... BAOBAB

Page 7: Preprocessing Application: String Matching By Rong Ge COSC3100.

A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 7

Build a shift table for a given pattern P

Shift table:• An 1-D array indicating the shift size for each character c if it is the

rightmost character of T in the alignment• array size: number of characters in the alphabet

– E.g.: 26 if all characters in T are capital case English letters The array entry for a character c is computed as follows

• Cases I & III: P’s length m• Case II & IV: distance from c’s rightmost occurrence in P among P’s

first m-1 characters to P’s right end Horspool's alg. populates the shift table with two steps:

1. Initialize each array entry as P’s length

2. Scan P from left to right and update the array entries

Page 8: Preprocessing Application: String Matching By Rong Ge COSC3100.

A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 8

Build shift table

Example: Pattern: BAOBAB• Alphabet: capital case English letters

Preprocessing• Initialize all array entries as P’s length, which is 6 for BAOBAB

- Taking care of cases I and III

- Shift table is indexed by text and pattern alphabet

• Scan P from left to right for the first m-1 characters and update the table

0. BAOBAB: distance = 5 (B’s entry) 1. BAOBAB: distance = 4 (A’s entry)

2. BAOBAB: distance = 3 (O’s entry) 3. BAOBAB: distance = 2 (B’s entry)

4. BAOBAB: distance = 1 (A’s entry) skip the last character of P,

- Taking care of cases II & IV

Resulting shift tableA B C D E F G H I J K L M N O P Q R S T U V W X Y Z

1 2 6 6 6 6 6 6 6 6 6 6 6 6 3 6 6 6 6 6 6 6 6 6 6 6

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6

Page 9: Preprocessing Application: String Matching By Rong Ge COSC3100.

A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 9

Example of Horspool’s alg. application

Alg: keep aligning P against T and comparing from right to left, shifting if mismatched.

Example:

BARD LOVED BANANAS (Text)

BAOBAB (shift t(L)=6)

BAOBAB (shift t(B)=2)

BAOBAB (shift t(N)=6)

BAOBAB (T out of bound, unsuccessful search)

# total comparisons: 4.

1st alignment: 1 2nd alignment: 2

3rd alignment: 1 4th alignment: 0

How many comparisons with brute force alg?

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

1 2 6 6 6 6 6 6 6 6 6 6 6 6 3 6 6 6 6 6 6 6 6 6 6 6

_

6

Page 10: Preprocessing Application: String Matching By Rong Ge COSC3100.

A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 10

Boyer-Moore algorithm

Based on same two ideas:• comparing P to T from right to left for each alignment

• precomputing shift sizes in two tables

– bad-symbol table t1 indicates how much to shift based on text’s character causing a mismatch

t1 is built the same as Horspool’s alg.

– good-suffix table d2 indicates how much to shift based on matched part (suffix) of the pattern

Page 11: Preprocessing Application: String Matching By Rong Ge COSC3100.

A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 11

Scenarios in string matching

SI: the rightmost character of P doesn’t match, BM algorithm acts as Horspool’s. Shift P to right by t1(c)

text

pattern

SII: k characters are matched, and then a mismatch occurs.

0 < k < m

text

pattern

How much do we shift Pattern to right?

k>0 matches

c K characters

x K characters

Rightmost character of P doesn’t match

c

x

Page 12: Preprocessing Application: String Matching By Rong Ge COSC3100.

Prefix and suffix of strings

Prefix of a string P:A prefix of a string P is a substring of P that occurs at the beginning of P.

E.g. P: BANANA

Suffix of a string P:A suffix of a string P is a substring that occurs at the end of P.

E.g. P: BANANA

Suffixes: A, NA, ANA, NANA, ANANA, BANANA

A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 12

Prefix B BA BAN BANA BANAN BANANA

Length 1 2 3 4 5 6

suffix A NA ANA NANA ANANA BANANA

Length 1 2 3 4 5 6

Page 13: Preprocessing Application: String Matching By Rong Ge COSC3100.

A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 13

Good-suffix table d2 for pattern

applied after a suffix of P matches T for an alignment, and then a mismatch occurs• k is the length of the matched suffix of P, 0 < k < m

Good-suffix table d2: • an 1-D array with size of m, but the entry indexed by 0 is not used.• Each entry indicates the shift size for the good, matched suffix

Page 14: Preprocessing Application: String Matching By Rong Ge COSC3100.

A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 14

Build good-suffix table d2 for the pattern

For each entry indexed by k, set d2(k) with three cases in order, k is in [1, m-1] and corresponds to the suffix of P with size k1. the distance between matched suffix of size k and its rightmost

occurrence in the pattern that is not preceded by the same character as the suffixE.g.: P: CABABA d2(k=1) = 4 //Suffix with size 1: A; the rightmost A that is not preceded by B is the second character in P.

2. the distance between the longest part of the k-character suffix and the corresponding prefix of P, if there is no occurrence in case 1

E.g.: P: ANAN d2(k=3) = 2 // Suffix with size 3: NAN; the longest part of NAN that matches a prefix of P is AN

3. m, the length of P if there is no occurrence in case 2E.g.: P: ANAN d2(k=1) = 4 //Suffix with size 1: N

Page 15: Preprocessing Application: String Matching By Rong Ge COSC3100.

Exercise

Build the good suffix table for pattern WOWWOW

A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 15

Page 16: Preprocessing Application: String Matching By Rong Ge COSC3100.

Exercise

Build the good suffix table for pattern WOWWOW

A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 16

k pattern d2

1 WOWWOW 2

2 WOWWOW 5

3 WOWWOW 3

4 WOWWOW 3

5 WOWWOW 3

Case 1

Case 2: longest part of OW with a matched prefix is W

Case 1

Case 2: longest part of WWOW with a matched prefix is WOW

Case 2: longest part of WWOW with a matched prefix is WOW

Page 17: Preprocessing Application: String Matching By Rong Ge COSC3100.

A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 17

BM alg. for scenario II in string matching

After successfully matching 0 < k < m characters, the algorithm shifts the pattern to right by

d = max {d1, d2}

where d1 = max{t1(c) - k, 1} is bad-symbol shift

d2(k) is good-suffix shift

Page 18: Preprocessing Application: String Matching By Rong Ge COSC3100.

A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 18

Boyer-Moore Algorithm (cont.)

Step 1 Fill in the bad-symbol shift tableStep 2 Fill in the good-suffix shift tableStep 3 Align the pattern against the beginning of the textStep 4 Repeat until a matching substring is found or text ends: Compare the corresponding characters right to left.

If no characters match, retrieve entry t1(c) from the bad-symbol table for the text’s character c causing the mismatch and shift the pattern to the right by t1(c).If 0 < k < m characters are matched, retrieve entry t1(c) from the bad-symbol table for the text’s character c causing the mismatch and entry d2(k) from the good-suffix table and shift the pattern to the right by

d = max {d1, d2}where d1 = max{t1(c) - k, 1}.

Page 19: Preprocessing Application: String Matching By Rong Ge COSC3100.

A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 19

Example of Boyer-Moore alg. application

Table t1:

B E S S _ N N E W _ A B O U T _ B A O B A B S B A O B A B 0 match shift d1 = t1(N) = 6

B A O B A BTable d2 2 matches: k=2 d1 = t1(_)-2 = 4

shift d2(2) = 5 B A O B A B

1 match: k=1 shift d1 = t1(_)-1 = 5

d2(1) = 2 B A O B A B

(success) #comp: 12 with BM, 13 with Horspool #alignment: 4 with BM, 5 with Horspool

k pattern d2

1 BAOBAB 2

2 BAOBAB 5

3 BAOBAB 5

4 BAOBAB 5

5 BAOBAB 5

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z1 2 6 6 6 6 6 6 6 6 6 6 6 6 3 6 6 6 6 6 6 6 6 6 6 6

_

6

Page 20: Preprocessing Application: String Matching By Rong Ge COSC3100.

Summary

Preprocessing - preprocess the input (or its part) to store some info to be used later in solving the problem

With preprocessing, Horspool’s and BM string algorithms are fast• Comparison starts from right to left on P

A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 20

Page 21: Preprocessing Application: String Matching By Rong Ge COSC3100.

Finishing Counting Sort

Alg. CountingSort(A[0,…,n-1], lb, ub)Input: an array A of integers between lb and ub (lb <=ub)Output: Array B[0,…,n-1] of A’s elements sorted in nondescending order for j 0 to ub-lb do //init freq to 0 C[j] 0 for i 0 to n-1 do //count freq C[A[i]-lb] C[A[i]-lb] + 1 for j 1 to ub-lb do //calc. last pos. of lb+j-1 C[j] C[j-1] + C[j] for i n-1 to 0 do //copy from A to B j A[i] – lb B[C[j]–1] A[i] C[j] C[j] – 1 E.G. A = {3, 2, 4, 1, 4, 1}

A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 21