UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

Post on 03-Jan-2016

27 views 0 download

description

UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008. Tuesday, 12/2/08 String Matching Algorithms Chapter 32. Ch 32 String Matching. Automata. Chapter Dependencies. You’re responsible for material in Sections 32.1-32.4 of this chapter. - PowerPoint PPT Presentation

Transcript of UMass Lowell Computer Science 91.503 Analysis of Algorithms Prof. Karen Daniels Fall, 2008

UMass Lowell Computer Science 91.503

Analysis of Algorithms Prof. Karen Daniels

Fall, 2008

UMass Lowell Computer Science 91.503

Analysis of Algorithms Prof. Karen Daniels

Fall, 2008

Tuesday, 12/2/08Tuesday, 12/2/08

String Matching AlgorithmsString Matching AlgorithmsChapter 32 Chapter 32

Chapter DependenciesChapter Dependencies

Ch 32String Matching

Automata You’re responsible for material in Sections 32.1-32.4 of this chapter.

String Matching AlgorithmsString Matching Algorithms

Motivation & BasicsMotivation & Basics

String Matching ProblemString Matching Problem

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

MotivationsMotivations: text-editing, pattern matching in DNA sequences: text-editing, pattern matching in DNA sequences

TextText: array : array T T [1...[1...nn]] PatternPattern: array : array P P [1...[1...mm]]

Array ElementArray Element: Character from finite alphabet : Character from finite alphabet

Pattern Pattern PP occurs with shift occurs with shift ss in in TT if if PP [1... [1...mm] = ] = TT [ [ss +1...+1...s s + + mm] ] mns 0

mn

32.1

String Matching AlgorithmsString Matching Algorithms

Naive AlgorithmNaive Algorithm Worst-case running time in O((Worst-case running time in O((nn--mm+1) +1) mm))

Rabin-KarpRabin-Karp Worst-case running time in O((Worst-case running time in O((nn--mm+1) +1) mm)) Better than this on average and in practiceBetter than this on average and in practice

Finite Automaton-BasedFinite Automaton-Based Worst-case running time in O(Worst-case running time in O(nn + + mm||))

Knuth-Morris-PrattKnuth-Morris-Pratt Worst-case running time in O(Worst-case running time in O(nn + + mm))

Notation & TerminologyNotation & Terminology

* = set of all finite-length strings formed * = set of all finite-length strings formed using characters from alphabet using characters from alphabet

Empty string: Empty string: |x| = length of string x|x| = length of string x w is a prefix of x: w is a prefix of x: ww xx w is a suffix of x: w is a suffix of x: ww xx prefix, suffix are prefix, suffix are transitivetransitive

ab abccaab abcca

cca abccacca abcca

Overlapping Suffix LemmaOverlapping Suffix Lemma

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

32.1

32.3 32.1

String Matching AlgorithmsString Matching Algorithms

Naive AlgorithmNaive Algorithm

Naive String MatchingNaive String Matching

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

worst-case running time is in worst-case running time is in ((((nn--mm+1)+1)mm))

32.4

String Matching AlgorithmsString Matching Algorithms

Rabin-KarpRabin-Karp

Rabin-Karp AlgorithmRabin-Karp Algorithm

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

Assume each character is digit in radix-d notation Assume each character is digit in radix-d notation (e.g. d=10)(e.g. d=10)

p = decimal value of patternp = decimal value of pattern ttss = decimal value of substring T[s+1..s+m] = decimal value of substring T[s+1..s+m] for s = 0,1...,n-mfor s = 0,1...,n-m

Strategy: Strategy: compute p in O(m) time (which is in O(n))compute p in O(m) time (which is in O(n))

compute all tcompute all tii values in total of O(n) time values in total of O(n) time

find all valid shifts s in O(n) time by comparing p with each tfind all valid shifts s in O(n) time by comparing p with each tss

Compute p in O(m) time using Horner’s rule:Compute p in O(m) time using Horner’s rule: p = P[m] + d(P[m-1] + d(P[m-2] + ... + d(P[2] + dP[1])))p = P[m] + d(P[m-1] + d(P[m-2] + ... + d(P[2] + dP[1])))

Compute tCompute t00 similarly from T[1..m] in O(m) time similarly from T[1..m] in O(m) time

Compute remaining tCompute remaining tii’s in O(n-m) time’s in O(n-m) time tts+1s+1 = d(t = d(tss - d - d m-1m-1T[s+1]) + T[s+m+1]T[s+1]) + T[s+m+1]

Rabin-Karp AlgorithmRabin-Karp Algorithm

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

p, tp, tss may be large, so use mod may be large, so use mod

32.5

Rabin-Karp Algorithm (continued)Rabin-Karp Algorithm (continued)

p = 31415p = 31415

spuriousspurious

hithit

ts+1 = d(ts - d m-1T[s+1]) + T[s+m+1]

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

d m-1 mod q

Rabin-Karp Algorithm (continued)Rabin-Karp Algorithm (continued)

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

Rabin-Karp Algorithm (continued)Rabin-Karp Algorithm (continued)

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.worst-case running time is in worst-case running time is in ((n-m+1)m)((n-m+1)m)

(m) in (m) in (n)(n)

(m)(m)

(m)(m)((n-m+1)m)((n-m+1)m)

high-order digit position for m-digit window

Matching loop invariant: when line 10 executedts=T[s+1..s+m] mod q

rule out spurious hit

Try all possible shifts

d is radix. q is modulus

Preprocessing

Rabin-Karp Algorithm (continued)Rabin-Karp Algorithm (continued)source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

average-case running time is in average-case running time is in (n+m)(n+m)

Assume reducing mod q is like random mapping from * to Zq

Estimate (chance that ts= p mod q) = 1/q # spurious hits is in O(n/q)

(m) in (m) in (n)(n)

(m)(m)

(m)(m)((n-m+1)m)((n-m+1)m)

high-order digit position for m-digit window

Matching loop invariant: when line 10 executedts=T[s+1..s+m] mod q

rule out spurious hit

Try all possible shifts

d is radix q is modulus

Preprocessing

Expected matching time = O(n) + O(m(v + n/q)) (v = # valid shifts)

If v is in O(1) and q >= m

set of all finite-length set of all finite-length strings formed from strings formed from

preprocessing + tpreprocessing + tss updates updates explicit matching comparisonsexplicit matching comparisons

String Matching AlgorithmsString Matching Algorithms

Finite AutomataFinite Automata

Finite AutomataFinite Automata

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

StrategyStrategy: Build automaton for pattern, then examine each text character once.: Build automaton for pattern, then examine each text character once.

worst-case running time is in worst-case running time is in (n) + (n) + automaton creation timeautomaton creation time

32.6

Finite AutomataFinite Automata

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

String-Matching AutomatonString-Matching Automaton

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

Pattern = P = Pattern = P = ababacaababaca

Automaton accepts Automaton accepts strings strings ending in Pending in P

32.7

String-Matching AutomatonString-Matching Automaton

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

Suffix Function for P:

(x) = length of longest prefix of P that is a suffix of x

}:max{)( xPkx k

Automaton’s operational invariant

at each stepat each step: keeps track of longest pattern prefix that is a suffix of what has been read so far: keeps track of longest pattern prefix that is a suffix of what has been read so far

32.3

32.4

String-Matching AutomatonString-Matching Automaton

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

Simulate behavior of string-matching automaton that finds occurrences of pattern P of length m in T[1..n]

worst-case running time of worst-case running time of matchingmatching is in is in (n) (n)

assuming automaton has assuming automaton has already been createdalready been created......

String-Matching Automaton (continued)String-Matching Automaton (continued)

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

Correctness of matching procedure...Correctness of matching procedure...

32.4

32.3

32.3 )()( )( aPxa x to be proved next…

String-Matching Automaton (continued)String-Matching Automaton (continued)

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

Correctness of matching procedure...Correctness of matching procedure...

32.2

32.8 32.2

32.8

String-Matching Automaton (continued)String-Matching Automaton (continued)

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

Correctness of matching procedure...Correctness of matching procedure...

32.3

32.9 32.3

32.9

32.2

32.1

String-Matching Automaton (continued)String-Matching Automaton (continued)

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

Correctness of matching procedure...Correctness of matching procedure...

32.4

32.3

32.3 )()( )( aPxa x

String-Matching Automaton (continued)String-Matching Automaton (continued)

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

worst-case running time of worst-case running time of automaton creationautomaton creation is in is in (m(m3 3 |||) |)

worst-case running time of entire string-matching strategy worst-case running time of entire string-matching strategy

is in is in (m(m |||) + |) + (n) (n)

can be improved to: can be improved to: (m(m |||) |)

pattern matching timepattern matching timeautomaton creation timeautomaton creation time

String Matching AlgorithmsString Matching Algorithms

Knuth-Morris-PrattKnuth-Morris-Pratt

Knuth-Morris-Pratt OverviewKnuth-Morris-Pratt Overview

Achieve Achieve (n+m)(n+m) time by shortening time by shortening automaton preprocessing time below automaton preprocessing time below (m(m |||)|)

ApproachApproach:: don’t precompute automaton’s transition functiondon’t precompute automaton’s transition function calculate enough transition data “on-the-fly”calculate enough transition data “on-the-fly” obtain data via “alphabet-independent” pattern obtain data via “alphabet-independent” pattern

preprocessingpreprocessing pattern preprocessing pattern preprocessing compares pattern against compares pattern against

shifts of itselfshifts of itself

Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

determine how pattern matches against itself determine how pattern matches against itself

32.10

Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

Prefix function Prefix function shows how pattern matches against itself shows how pattern matches against itself

Equivalently, what is largest k < q such that PEquivalently, what is largest k < q such that Pkk P Pqq? ?

} and :max{)( qk PPqkkq

(q) is length of longest prefix of P that is a proper suffix of P(q) is length of longest prefix of P that is a proper suffix of Pqq

Example:Example:

32.5

Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

(m) in (m) in (n)(n)

using amortized analysis

# characters matched

scan text left-to-right

next character does not match

next character matches

Is all of P matched?

Look for next match

(m+n) (m+n)

using amortized analysis

(n) (n)

Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm

Amortized Analysis

k

Potential Method

k = current state of algorithm

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

(m) (m) in in (n)(n)

initial potential value

potential decreases

Potential is never negative since (k) >= 0 for all k

potential increases by <=1 in each execution of for loop body

amortized amortized cost of loop cost of loop body is in body is in (1)(1)

(m) loop (m) loop iterationsiterations

Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

Correctness...Correctness...

Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

Correctness...Correctness...

32.5

32.1

32.6

32.6

Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

Correctness...Correctness...

32.11 32.5

Knuth-Morris-Pratt AlgorithmKnuth-Morris-Pratt Algorithm

source: 91.503 textbook Cormen et al.source: 91.503 textbook Cormen et al.

Correctness...Correctness...

32.6

32.5

32.5

32.7

32.6