Efficient String Matching : An Aid to Bibliographic Search
description
Transcript of Efficient String Matching : An Aid to Bibliographic Search
![Page 1: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/1.jpg)
1
Efficient String Matching : An Aid to Bibliographic Search
Alfred V. Aho and Margaret J. Corasick
Bell Laboratories
![Page 2: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/2.jpg)
2
Virus Definition
Each virus has its peculiar signature Example in ClamAV
_0017_0001_000=21b8004233c999cd218bd6b90300b440cd218b4c198b541bb80157cd21b43ecd2132ed
_0017_0001_000 virus index Hex(21)=Dec(33)=‘!’
Match the signature for detecting virus
![Page 3: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/3.jpg)
3
Regular Expression
Use RE to describe the signature ? can be any one char
W32.Hybris.C (Clam)=4000?????????????83??????75f2e9????ffff00000000
* can be any chars (including no char) Oror-fam
(Clam)=495243*56697275*53455859330f5455*4b617a61*536e617073686f
{n1-n2}, there are n1~n2 chars between two parts Worm.Bagle.AG-empty
(Clam)=6e74656e742d547970653a206170706c69636174696f6e2f6f637465742d73747265616d3b{40-130}2d2d2d2d2d2d2d2d
![Page 4: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/4.jpg)
4
Introduction
Locate all occurrences of any of a finite number of keywords in a string of text.
Consists of two parts : constructing a finite state pattern matching
machine from the keywords using the pattern matching machine to
process the text string in a single pass.
![Page 5: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/5.jpg)
5
Pattern Matching Machine(1)
Our problem is to locate and identify all substrings of x which are keywords in K. K : K={y1,y2,…,yk} be a finite set of strings which
we shall call keywords x : x is an arbitrary string which we shall call the
text string.The behavior of the pattern matching
machine is dictated by three functions: a goto function g, a failure function f, and an output function output.
![Page 6: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/6.jpg)
6
Pattern Matching Machine(2)
g (s,a) = s’ or fail : maps a pair consisting of a state and an input symbol into a state or the message fail.
f (s) = s’ : maps a state into a state, and is consulted whenever the goto function reports fail.
output (s) = keywords : associating a set of keyword (possibly empty) with every state.
![Page 7: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/7.jpg)
7
Pattern Matching Machine Example with keywords {he,she,his,hers}
![Page 8: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/8.jpg)
8
![Page 9: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/9.jpg)
9
Start state is state 0.Let s be the current state and a the
current symbol of the input string x.Operating cycle
If g(s,a)=s’, makes a goto transition, and enters state s’ and the next symbol of x becomes the current input symbol.
If g(s,a)=fail, make a failure transition f. If f(s)=s’, the machine repeats the cycle with s’ as the current state and a as the current input symbol.
![Page 10: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/10.jpg)
10
Example
Text: u s h e r s
State: 0 0 3 4 5 8 9
2
In state 4, since g(4,e)=5, and the machine enters state 5, and finds keywords “she” and “he” at the end of position four in text string, emits output(5)
![Page 11: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/11.jpg)
11
Example Cont’d
In state 5 on input symbol r, the machine makes two state transitions in its operating cycle.
Since g(5,r)=fail, M enters state 2=f(5) . Then since g(2,r)=8, M enters state 8 and advances to the next input symbol.
No output is generated in this operating cycle.
![Page 12: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/12.jpg)
12
Algorithm 1. Pattern matching machine.Input. A text string x = a1 a2 … a n where each a i is an input symbol and a pattern matching machine M with goto function g, failure function f, and output function output, as described above.Output. Locations at which keywords occur in x.Method. begin state ← 0 for i ← 1 until n do begin while g (state, a i ) = fail do state ← f(state)
state ← g (state, a i ) if output (state)≠ empty then begin print i print output (state) end end end
![Page 13: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/13.jpg)
13
Construction the functions
Two part to the construction First : Determine the states and the goto
function. Second : Compute the failure function. Output function start at first, complete at
second.
![Page 14: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/14.jpg)
14
Construction of Goto function
Construct a goto graph like next page.New vertices and edges to the graph,
starting at the start state.Add new edges only when necessary.Add a loop from state 0 to state 0 on all
input symbols other than the first one in each keyword.
![Page 15: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/15.jpg)
15
Construction of Goto function with keywords {he,she,his,hers}
![Page 16: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/16.jpg)
16
Algorithm 2. Construction of the goto function.Input. Set of keywords K = {yl, y2, . . . . . yk}.Output. Goto function g and a partially computed output function output.Method. We assume output(s) is empty when state s is first created, and g(s, a) = fail if a is undefined or if g(s, a) has not yet been defined. The procedure enter(y) inserts into the goto graph a path that spells out y.
begin newstate ← 0 for i ← 1 until k do enter(y i ) for all a such that g(0, a) = fail do g(0, a) ← 0 end
Algorithm 2
![Page 17: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/17.jpg)
17
procedure enter(a 1 a 2 … a m ): begin state ← 0; j ← 1 while g (state, aj )≠ fail do begin state ← g (state, aj) j ← j + l end for p ← j until m do begin newstate ← newstate + 1 g (state, ap ) ← newstate state ← newstate end output(state) ← { a 1 a 2 … a m} end
![Page 18: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/18.jpg)
18
Construction of Failure function
Depth of s : the length of the shortest path from the start state to state s.
The states of depth d can be determined from the states of depth d-1.
Make f(s)=0 for all states s of depth 1.
![Page 19: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/19.jpg)
19
Construction of Failure function Cont’d
Compute failure function for the state of depth d, each state r of depth d-1 : 1. If g(r,a)=fail for all a, do nothing. 2. Otherwise, for each a such that g(r,a)=s, do the
following : a. Set state=f(r) . b. Execute state ←f(state) zero or more times, until a
value for state is obtained such that g(state,a)≠fail . c. Set f(s)=g(state,a) .
![Page 20: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/20.jpg)
20
Algorithm 3. Construction of the failure function.Input. Goto function g and output function output from Algorithm 2.Output. Failure function fand output function output.Method.
begin queue ← empty for each a such that g(0, a) = s≠0 do begin queue ← queue {s}∪ f(s) ← 0 end
Algorithm 3
![Page 21: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/21.jpg)
21
while queue ≠ empty do begin let r be the next state in queue queue ← queue - {r} for each asuch that g(r, a) = s≠fail do begin queue ← queue {s}∪ state ← f(r) while g (state, a) = fail do state ← f(state) f(s) ← g(state, a) output(s) ←output(s) output(f(s))∪ end end end
![Page 22: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/22.jpg)
22
About construction
When we determine f(s)=s’, we merge the outputs of state s with the output of state s’.
In fact, if the keyword “his” were not present, then could go directly from state 4 to state 0, skipping an unnecessary intermediate transition to state 1.
To avoid above, we can use the deterministic finite automaton, which discuss later.
![Page 23: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/23.jpg)
23
Properties of Algorithms 1,2,3
Lemma 1: Suppose that in the goto graph state s is represented by the string u and state t is represented by the string v. Then f(s)=t iff v is the longest proper suffix of u that is also a prefix of some keyword.
Proof : Suppose u=a1a2…aj, and a1a2…aj-1 represents state r, let
r1,r2,…,rn be the sequence of states : 1. r1=f(r) ; 2. ri+1=f(ri) ; 3.g(ri,aj)=fail for 1 i≦ < n ; 4.g(rn,aj)=t
Suppose vi represents state ri, v1 is the longest proper suffix of a1a2…aj-1 that is a prefix of some keyword; v2 is the longest proper suffix of v1 that is a prefix of some keyword, and so on.
Thus vn is the longest suffix of a1a2…aj-1 such that vnaj is a prefix of some keyword.
![Page 24: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/24.jpg)
24
Properties of Algorithms 1,2,3 Lemma 2 : The set output(s) contains y if and
only if y is a keyword that is a suffix of the string representing state s.
Proof : Consider a string y in output(s). If y is added to output(s) by algorithm 2, then y=u and
y is a keyword. If y is added to output(s) by algorithm 3, then y is in
output(f(s)). If y is a proper suffix of u, then from the inductive hypothesis and Lemma 1 we know output(f(s)) contains y.
![Page 25: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/25.jpg)
25
Properties of Algorithms 1,2,3 Lemma 3 : After the jth operating cycle,
Algorithm 1 will be in state s iff s is represented by the longest suffix of a1a2…aj that is a prefix of some keyword. Proof : Similar to Lemma 1.
THEOREM 1THEOREM 1 : : Algorithms 2 and 3 produce valid goto,failure, and output functions. Proof : By Lemmas 2 and 3.
![Page 26: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/26.jpg)
26
Time Complexity of Algorithms 1, 2, and 3
THEOREM 2 THEOREM 2 :: Using the goto, failure and output functions created by Algorithms 2 and 3, Algorithm 1 makes fewer than 2n state transitions in processing a text string of length n. From state s of depth d Algorithm 1 make d
failure transitions at most in one operating cycle. Number of failure transitions must be at least
one less than number of goto transitions. processing an input of length n Algorithm 1
makes exactly n goto transitions. Therefore the total number of state transitions is less than 2n.
![Page 27: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/27.jpg)
27
Time Complexity of Algorithms 1, 2, and 3
THEOREM 3 THEOREM 3 :: Algorithms 2 requires time linearly proportional to the sum of the lengths of the keywords.
Proof : Straightforward
THEOREM 4 THEOREM 4 :: Algorithms 3 can be implemented to run in time proportional to the sum of the lengths of the keywords.
Proof : Total number of executions of state← f(state) is
bounded by the sum of the lengths of the keywords. Using linked lists to represent the output set of a
state, we can execute the statement output(s) ← output(s) output(f(s))∪ in constant time.
![Page 28: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/28.jpg)
28
procedure enter(a 1 a 2 … a m ): begin state ← 0; j ← 1 while g (state, aj )≠ fail do begin state ← g (state, aj) j ← j + l end for p ← j until m do begin newstate ← newstate + 1 g (state, ap ) ← newstate state ← newstate end output(state) ← { a 1 a 2 … a m} end
![Page 29: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/29.jpg)
29
while queue ≠ empty do begin let r be the next state in queue queue ← queue - {r} for each asuch that g(r, a) = s≠fail do begin queue ← queue {s}∪ state ← f(r) while g (state, a) = fail do state ← f(state) f(s) ← g(state, a) output(s) ←output(s) output(f(s))∪ end end end
![Page 30: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/30.jpg)
30
Eliminating Failure Transitions
Using in algorithm 1δ(s, a), a next move function δ such that
for each state s and input symbol a.By using the next move function δ, we can
dispense with all failure transitions, and make exactly one state transition per input character.
![Page 31: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/31.jpg)
31
Algorithm 4. Construction of a deterministic finite automaton.Input. Goto function g from Algorithm 2 and failure function f from Algorithm 3.Output. Next move function 8.Method. begin queue ← empty for each symbol a do begin δ(0, a) ← g(0, a) if g (0, a) ≠ 0 then queue ← queue {g (0, a) } ∪ end while queue ≠ empty do begin let r be the next state in queue queue ← queue - {r} for each symbol a do if g(r, a) = s ≠ fail do begin queue ← queue ∪ {s} δ(r, a) ← s end elseδ(r, a) ←δ(f(r), a) end end
![Page 32: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/32.jpg)
32
Fig. 3. Next move function.input symbol next statestate 0: h 1
s 3. 0
state 1 : e 2i 6h 1s 3. 0
state 9:state7: state3 : h 4s 3. 0
state 5:state2 : r 8h 1s 3. 0
state 6 : s 7h 1. 0
state 4 : e 5i 6h 1s 3. 0
state 8 : s 9h 1. 0
![Page 33: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/33.jpg)
33
Conclusion
Attractive in large numbers of keywords, since all keywords can be simultaneously matched in one pass.
Using Next move function can potentially reduce state transitions by 50%,
but more memory. Spend most time in state 0 from which there are
no failure transitions.
![Page 34: Efficient String Matching : An Aid to Bibliographic Search](https://reader036.fdocuments.net/reader036/viewer/2022062409/568146f5550346895db42949/html5/thumbnails/34.jpg)
34
0 1 2 8 9
6 7
3 4 5s
h e r s
i s{h,s}’
h e