Post on 03-Jan-2016
description
UMass Lowell Computer Science 91.503
Analysis of Algorithms Prof. Karen Daniels
Fall, 2008
UMass Lowell Computer Science 91.503
Analysis of Algorithms Prof. Karen Daniels
Fall, 2008
Tuesday, 12/2/08Tuesday, 12/2/08
String Matching AlgorithmsString Matching AlgorithmsChapter 32 Chapter 32
Chapter DependenciesChapter Dependencies
Ch 32String Matching
Automata You’re responsible for material in Sections 32.1-32.4 of this chapter.
String Matching AlgorithmsString Matching Algorithms
Motivation & BasicsMotivation & Basics
String Matching ProblemString Matching Problem
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
MotivationsMotivations: text-editing, pattern matching in DNA sequences: text-editing, pattern matching in DNA sequences
TextText: array : array T T [1...[1...nn]] PatternPattern: array : array P P [1...[1...mm]]
Array ElementArray Element: Character from finite alphabet : Character from finite alphabet
Pattern Pattern PP occurs with shift occurs with shift ss in in TT if if PP [1... [1...mm] = ] = TT [ [ss +1...+1...s s + + mm] ] mns 0
mn
32.1
String Matching AlgorithmsString Matching Algorithms
Naive AlgorithmNaive Algorithm Worst-case running time in O((Worst-case running time in O((nn--mm+1) +1) mm))
Rabin-KarpRabin-Karp Worst-case running time in O((Worst-case running time in O((nn--mm+1) +1) mm)) Better than this on average and in practiceBetter than this on average and in practice
Finite Automaton-BasedFinite Automaton-Based Worst-case running time in O(Worst-case running time in O(nn + + mm||))
Knuth-Morris-PrattKnuth-Morris-Pratt Worst-case running time in O(Worst-case running time in O(nn + + mm))
Notation & TerminologyNotation & Terminology
* = set of all finite-length strings formed * = set of all finite-length strings formed using characters from alphabet using characters from alphabet
Empty string: Empty string: |x| = length of string x|x| = length of string x w is a prefix of x: w is a prefix of x: ww xx w is a suffix of x: w is a suffix of x: ww xx prefix, suffix are prefix, suffix are transitivetransitive
ab abccaab abcca
cca abccacca abcca
Overlapping Suffix LemmaOverlapping Suffix Lemma
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
32.1
32.3 32.1
String Matching AlgorithmsString Matching Algorithms
Naive AlgorithmNaive Algorithm
Naive String MatchingNaive String Matching
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
worst-case running time is in worst-case running time is in ((((nn--mm+1)+1)mm))
32.4
String Matching AlgorithmsString Matching Algorithms
Rabin-KarpRabin-Karp
Rabin-Karp AlgorithmRabin-Karp Algorithm
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
Assume each character is digit in radix-d notation Assume each character is digit in radix-d notation (e.g. d=10)(e.g. d=10)
p = decimal value of patternp = decimal value of pattern ttss = decimal value of substring T[s+1..s+m] = decimal value of substring T[s+1..s+m] for s = 0,1...,n-mfor s = 0,1...,n-m
Strategy: Strategy: compute p in O(m) time (which is in O(n))compute p in O(m) time (which is in O(n))
compute all tcompute all tii values in total of O(n) time values in total of O(n) time
find all valid shifts s in O(n) time by comparing p with each tfind all valid shifts s in O(n) time by comparing p with each tss
Compute p in O(m) time using Horner’s rule:Compute p in O(m) time using Horner’s rule: p = P[m] + d(P[m-1] + d(P[m-2] + ... + d(P[2] + dP[1])))p = P[m] + d(P[m-1] + d(P[m-2] + ... + d(P[2] + dP[1])))
Compute tCompute t00 similarly from T[1..m] in O(m) time similarly from T[1..m] in O(m) time
Compute remaining tCompute remaining tii’s in O(n-m) time’s in O(n-m) time tts+1s+1 = d(t = d(tss - d - d m-1m-1T[s+1]) + T[s+m+1]T[s+1]) + T[s+m+1]
Rabin-Karp AlgorithmRabin-Karp Algorithm
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
p, tp, tss may be large, so use mod may be large, so use mod
32.5
Rabin-Karp Algorithm (continued)Rabin-Karp Algorithm (continued)
p = 31415p = 31415
spuriousspurious
hithit
ts+1 = d(ts - d m-1T[s+1]) + T[s+m+1]
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
d m-1 mod q
Rabin-Karp Algorithm (continued)Rabin-Karp Algorithm (continued)
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
Rabin-Karp Algorithm (continued)Rabin-Karp Algorithm (continued)
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.worst-case running time is in worst-case running time is in ((n-m+1)m)((n-m+1)m)
(m) in (m) in (n)(n)
(m)(m)
(m)(m)((n-m+1)m)((n-m+1)m)
high-order digit position for m-digit window
Matching loop invariant: when line 10 executedts=T[s+1..s+m] mod q
rule out spurious hit
Try all possible shifts
d is radix. q is modulus
Preprocessing
Rabin-Karp Algorithm (continued)Rabin-Karp Algorithm (continued)source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
average-case running time is in average-case running time is in (n+m)(n+m)
Assume reducing mod q is like random mapping from * to Zq
Estimate (chance that ts= p mod q) = 1/q # spurious hits is in O(n/q)
(m) in (m) in (n)(n)
(m)(m)
(m)(m)((n-m+1)m)((n-m+1)m)
high-order digit position for m-digit window
Matching loop invariant: when line 10 executedts=T[s+1..s+m] mod q
rule out spurious hit
Try all possible shifts
d is radix q is modulus
Preprocessing
Expected matching time = O(n) + O(m(v + n/q)) (v = # valid shifts)
If v is in O(1) and q >= m
set of all finite-length set of all finite-length strings formed from strings formed from
preprocessing + tpreprocessing + tss updates updates explicit matching comparisonsexplicit matching comparisons
String Matching AlgorithmsString Matching Algorithms
Finite AutomataFinite Automata
Finite AutomataFinite Automata
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
StrategyStrategy: Build automaton for pattern, then examine each text character once.: Build automaton for pattern, then examine each text character once.
worst-case running time is in worst-case running time is in (n) + (n) + automaton creation timeautomaton creation time
32.6
Finite AutomataFinite Automata
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
String-Matching AutomatonString-Matching Automaton
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
Pattern = P = Pattern = P = ababacaababaca
Automaton accepts Automaton accepts strings strings ending in Pending in P
32.7
String-Matching AutomatonString-Matching Automaton
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
Suffix Function for P:
(x) = length of longest prefix of P that is a suffix of x
}:max{)( xPkx k
Automaton’s operational invariant
at each stepat each step: keeps track of longest pattern prefix that is a suffix of what has been read so far: keeps track of longest pattern prefix that is a suffix of what has been read so far
32.3
32.4
String-Matching AutomatonString-Matching Automaton
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
Simulate behavior of string-matching automaton that finds occurrences of pattern P of length m in T[1..n]
worst-case running time of worst-case running time of matchingmatching is in is in (n) (n)
assuming automaton has assuming automaton has already been createdalready been created......
String-Matching Automaton (continued)String-Matching Automaton (continued)
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
Correctness of matching procedure...Correctness of matching procedure...
32.4
32.3
32.3 )()( )( aPxa x to be proved next…
String-Matching Automaton (continued)String-Matching Automaton (continued)
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
Correctness of matching procedure...Correctness of matching procedure...
32.2
32.8 32.2
32.8
String-Matching Automaton (continued)String-Matching Automaton (continued)
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
Correctness of matching procedure...Correctness of matching procedure...
32.3
32.9 32.3
32.9
32.2
32.1
String-Matching Automaton (continued)String-Matching Automaton (continued)
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
Correctness of matching procedure...Correctness of matching procedure...
32.4
32.3
32.3 )()( )( aPxa x
String-Matching Automaton (continued)String-Matching Automaton (continued)
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
worst-case running time of worst-case running time of automaton creationautomaton creation is in is in (m(m3 3 |||) |)
worst-case running time of entire string-matching strategy worst-case running time of entire string-matching strategy
is in is in (m(m |||) + |) + (n) (n)
can be improved to: can be improved to: (m(m |||) |)
pattern matching timepattern matching timeautomaton creation timeautomaton creation time
String Matching AlgorithmsString Matching Algorithms
Knuth-Morris-PrattKnuth-Morris-Pratt
Knuth-Morris-Pratt OverviewKnuth-Morris-Pratt Overview
Achieve Achieve (n+m)(n+m) time by shortening time by shortening automaton preprocessing time below automaton preprocessing time below (m(m |||)|)
ApproachApproach:: don’t precompute automaton’s transition functiondon’t precompute automaton’s transition function calculate enough transition data “on-the-fly”calculate enough transition data “on-the-fly” obtain data via “alphabet-independent” pattern obtain data via “alphabet-independent” pattern
preprocessingpreprocessing pattern preprocessing pattern preprocessing compares pattern against compares pattern against
shifts of itselfshifts of itself
Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
determine how pattern matches against itself determine how pattern matches against itself
32.10
Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
Prefix function Prefix function shows how pattern matches against itself shows how pattern matches against itself
Equivalently, what is largest k < q such that PEquivalently, what is largest k < q such that Pkk P Pqq? ?
} and :max{)( qk PPqkkq
(q) is length of longest prefix of P that is a proper suffix of P(q) is length of longest prefix of P that is a proper suffix of Pqq
Example:Example:
32.5
Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
(m) in (m) in (n)(n)
using amortized analysis
# characters matched
scan text left-to-right
next character does not match
next character matches
Is all of P matched?
Look for next match
(m+n) (m+n)
using amortized analysis
(n) (n)
Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm
Amortized Analysis
k
Potential Method
k = current state of algorithm
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
(m) (m) in in (n)(n)
initial potential value
potential decreases
Potential is never negative since (k) >= 0 for all k
potential increases by <=1 in each execution of for loop body
amortized amortized cost of loop cost of loop body is in body is in (1)(1)
(m) loop (m) loop iterationsiterations
Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
Correctness...Correctness...
Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
Correctness...Correctness...
32.5
32.1
32.6
32.6
Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
Correctness...Correctness...
32.11 32.5
Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm
source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.
Correctness...Correctness...
32.6
32.5
32.5
32.7
32.6