Preprocessing Application: String Matching By Rong Ge COSC3100.
-
Upload
janel-fields -
Category
Documents
-
view
228 -
download
0
Transcript of Preprocessing Application: String Matching By Rong Ge COSC3100.
Preprocessing
Application: String Matching
By Rong Ge
COSC3100
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 2
Space-for-time tradeoffs
Two varieties of space-for-time algorithms: input enhancement — preprocess the input (or its part) to
store some info to be used later in solving the problem • counting sorts (Last Lecture)• string searching algorithms (today)
prestructuring — preprocess the input to make accessing its elements easier• hashing
Review: String searching by brute force
String Matching problem:Given a pattern P: a string of m characters to search for
and a text T: a (long) string of n>=m characters to search in
Return whether P occurs in T and the position of P in T
Brute force algorithm RevisitStep 1 Align pattern P at beginning of text T
Step 2 Scan from left to right, compare each character ofP to the corresponding character in T until either all characters of P are found to match (successful search) or a mismatch is detected
Step 3 While a mismatch is detected and T is not yet exhausted, realign P one position to the right and repeat Step 2
#comparison: O(mn)
More efficient algorithms with O(n) time? 3
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 4
String searching by preprocessing
Several string searching algorithms are based on the inputenhancement idea of preprocessing the pattern P
Horspool’s algorithm 1. preprocess P to build one shift table tb2. align P against the beginning of T3. repeat until a matching substring is found or T ends:
1. compare the corresponding characters from right to left for P2. if mismatch occurs, shift P to right by tb(c) where c is the rightmost
character of T in the current alignment
Boyer -Moore algorithm improves Horspool’s algorithm by using two tables: one good table and one bad table
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 5
Horspool’s Algorithm
Two key points:
• preprocesses P to generate a shift table that determines how much to shift P when a mismatch occurs
• always makes a shift based on the T’s character c aligned with the last character in P according to the shift table’s entry for c
Your answers? 6
How far to shift for each case?
Focus on the rightmost character in T in the alignment: Case I: character C != B (mismatch), and C does not occur in P
.....C...................... Text (C not in pattern)
BAOBAB Pattern Case II:
Character O != B (mismatch), but occur in P once
.....O...................... (O occurs once in pattern) BAOBAB
Character A != B (mismatch), and occurs in P more than once
.....A...................... (A occurs twice in pattern)
BAOBAB Case III: character R = R (match), but doesn’t appear in P in the first m-1
characters ...MER...................... LEADER Case IV: character B = B (match), and occurs in P in the first m-1 characters .....B...................... BAOBAB
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 7
Build a shift table for a given pattern P
Shift table:• An 1-D array indicating the shift size for each character c if it is the
rightmost character of T in the alignment• array size: number of characters in the alphabet
– E.g.: 26 if all characters in T are capital case English letters The array entry for a character c is computed as follows
• Cases I & III: P’s length m• Case II & IV: distance from c’s rightmost occurrence in P among P’s
first m-1 characters to P’s right end Horspool's alg. populates the shift table with two steps:
1. Initialize each array entry as P’s length
2. Scan P from left to right and update the array entries
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 8
Build shift table
Example: Pattern: BAOBAB• Alphabet: capital case English letters
Preprocessing• Initialize all array entries as P’s length, which is 6 for BAOBAB
- Taking care of cases I and III
- Shift table is indexed by text and pattern alphabet
• Scan P from left to right for the first m-1 characters and update the table
0. BAOBAB: distance = 5 (B’s entry) 1. BAOBAB: distance = 4 (A’s entry)
2. BAOBAB: distance = 3 (O’s entry) 3. BAOBAB: distance = 2 (B’s entry)
4. BAOBAB: distance = 1 (A’s entry) skip the last character of P,
- Taking care of cases II & IV
Resulting shift tableA B C D E F G H I J K L M N O P Q R S T U V W X Y Z
1 2 6 6 6 6 6 6 6 6 6 6 6 6 3 6 6 6 6 6 6 6 6 6 6 6
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 9
Example of Horspool’s alg. application
Alg: keep aligning P against T and comparing from right to left, shifting if mismatched.
Example:
BARD LOVED BANANAS (Text)
BAOBAB (shift t(L)=6)
BAOBAB (shift t(B)=2)
BAOBAB (shift t(N)=6)
BAOBAB (T out of bound, unsuccessful search)
# total comparisons: 4.
1st alignment: 1 2nd alignment: 2
3rd alignment: 1 4th alignment: 0
How many comparisons with brute force alg?
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
1 2 6 6 6 6 6 6 6 6 6 6 6 6 3 6 6 6 6 6 6 6 6 6 6 6
_
6
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 10
Boyer-Moore algorithm
Based on same two ideas:• comparing P to T from right to left for each alignment
• precomputing shift sizes in two tables
– bad-symbol table t1 indicates how much to shift based on text’s character causing a mismatch
t1 is built the same as Horspool’s alg.
– good-suffix table d2 indicates how much to shift based on matched part (suffix) of the pattern
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 11
Scenarios in string matching
SI: the rightmost character of P doesn’t match, BM algorithm acts as Horspool’s. Shift P to right by t1(c)
text
pattern
SII: k characters are matched, and then a mismatch occurs.
0 < k < m
text
pattern
How much do we shift Pattern to right?
k>0 matches
c K characters
x K characters
Rightmost character of P doesn’t match
c
x
Prefix and suffix of strings
Prefix of a string P:A prefix of a string P is a substring of P that occurs at the beginning of P.
E.g. P: BANANA
Suffix of a string P:A suffix of a string P is a substring that occurs at the end of P.
E.g. P: BANANA
Suffixes: A, NA, ANA, NANA, ANANA, BANANA
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 12
Prefix B BA BAN BANA BANAN BANANA
Length 1 2 3 4 5 6
suffix A NA ANA NANA ANANA BANANA
Length 1 2 3 4 5 6
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 13
Good-suffix table d2 for pattern
applied after a suffix of P matches T for an alignment, and then a mismatch occurs• k is the length of the matched suffix of P, 0 < k < m
Good-suffix table d2: • an 1-D array with size of m, but the entry indexed by 0 is not used.• Each entry indicates the shift size for the good, matched suffix
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 14
Build good-suffix table d2 for the pattern
For each entry indexed by k, set d2(k) with three cases in order, k is in [1, m-1] and corresponds to the suffix of P with size k1. the distance between matched suffix of size k and its rightmost
occurrence in the pattern that is not preceded by the same character as the suffixE.g.: P: CABABA d2(k=1) = 4 //Suffix with size 1: A; the rightmost A that is not preceded by B is the second character in P.
2. the distance between the longest part of the k-character suffix and the corresponding prefix of P, if there is no occurrence in case 1
E.g.: P: ANAN d2(k=3) = 2 // Suffix with size 3: NAN; the longest part of NAN that matches a prefix of P is AN
3. m, the length of P if there is no occurrence in case 2E.g.: P: ANAN d2(k=1) = 4 //Suffix with size 1: N
Exercise
Build the good suffix table for pattern WOWWOW
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 15
Exercise
Build the good suffix table for pattern WOWWOW
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 16
k pattern d2
1 WOWWOW 2
2 WOWWOW 5
3 WOWWOW 3
4 WOWWOW 3
5 WOWWOW 3
Case 1
Case 2: longest part of OW with a matched prefix is W
Case 1
Case 2: longest part of WWOW with a matched prefix is WOW
Case 2: longest part of WWOW with a matched prefix is WOW
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 17
BM alg. for scenario II in string matching
After successfully matching 0 < k < m characters, the algorithm shifts the pattern to right by
d = max {d1, d2}
where d1 = max{t1(c) - k, 1} is bad-symbol shift
d2(k) is good-suffix shift
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 18
Boyer-Moore Algorithm (cont.)
Step 1 Fill in the bad-symbol shift tableStep 2 Fill in the good-suffix shift tableStep 3 Align the pattern against the beginning of the textStep 4 Repeat until a matching substring is found or text ends: Compare the corresponding characters right to left.
If no characters match, retrieve entry t1(c) from the bad-symbol table for the text’s character c causing the mismatch and shift the pattern to the right by t1(c).If 0 < k < m characters are matched, retrieve entry t1(c) from the bad-symbol table for the text’s character c causing the mismatch and entry d2(k) from the good-suffix table and shift the pattern to the right by
d = max {d1, d2}where d1 = max{t1(c) - k, 1}.
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 19
Example of Boyer-Moore alg. application
Table t1:
B E S S _ N N E W _ A B O U T _ B A O B A B S B A O B A B 0 match shift d1 = t1(N) = 6
B A O B A BTable d2 2 matches: k=2 d1 = t1(_)-2 = 4
shift d2(2) = 5 B A O B A B
1 match: k=1 shift d1 = t1(_)-1 = 5
d2(1) = 2 B A O B A B
(success) #comp: 12 with BM, 13 with Horspool #alignment: 4 with BM, 5 with Horspool
k pattern d2
1 BAOBAB 2
2 BAOBAB 5
3 BAOBAB 5
4 BAOBAB 5
5 BAOBAB 5
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z1 2 6 6 6 6 6 6 6 6 6 6 6 6 3 6 6 6 6 6 6 6 6 6 6 6
_
6
Summary
Preprocessing - preprocess the input (or its part) to store some info to be used later in solving the problem
With preprocessing, Horspool’s and BM string algorithms are fast• Comparison starts from right to left on P
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 20
Finishing Counting Sort
Alg. CountingSort(A[0,…,n-1], lb, ub)Input: an array A of integers between lb and ub (lb <=ub)Output: Array B[0,…,n-1] of A’s elements sorted in nondescending order for j 0 to ub-lb do //init freq to 0 C[j] 0 for i 0 to n-1 do //count freq C[A[i]-lb] C[A[i]-lb] + 1 for j 1 to ub-lb do //calc. last pos. of lb+j-1 C[j] C[j-1] + C[j] for i n-1 to 0 do //copy from A to B j A[i] – lb B[C[j]–1] A[i] C[j] C[j] – 1 E.G. A = {3, 2, 4, 1, 4, 1}
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 7 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved. 21