Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A....

Post on 14-Jan-2016

219 views 2 download

Transcript of Multiple Pattern Matching Algorithms on Collage System T. Kida, T. Matsumoto, M. Takeda, A....

Multiple Pattern Matching Algorithms on Collage

System

Multiple Pattern Matching Algorithms on Collage

System

T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa

T. Kida, T. Matsumoto, M. Takeda, A. Shinohara, and S. Arikawa

Department of Informatics, Kyushu University

ContentsContents

• Compressed pattern matching

• Collage system

• Extension of Output function for multiple patterns

• Multi-pattern version of BM type algorithm

• Parallel complexity

• Conclusion

Compressed Pattern MatchingCompressed Pattern Matching

CompressedText

OriginalText

CompressedText

Pattern Matching Machine

CompressedPattern Matching Machine

decompress

Works on This StudyWorks on This Study

Compression method Compressed pattern matching algorithms

Run-length Eilam-Tzoreff & Vishkin (1988)Run-length (two dim) Amir et al. (1992, 1997); Amir & Benson (1992)LZ77 family Farach & Thorup (1995); G sieniec, ą et al. (1996);

Klein & Shapira (2000)LZ78 family Amir et al. (1996); Kida et al. (1998, 1999);

Navarro & Tarhio (2000); Kärkkäinen et al. (2000);LZ family Navarro et al. (1999)Straight-line programs Karpinski et al. (1997); Miyazaki et al. (1997);

Hirao et al. (2000)Huffman Fukamachi et al. (1998); Klein & Shapira (2001);

Miyazaki et al. (1998)Finite state encoding Takeda (1997)Word based encoding Moura et al. (1998)Pattern substitution Manber (1994); Shibata et al. (1998)Antidictionary based Shibata et al. (1999)

Works on This StudyWorks on This Study

Previous

Algorithm for word-based methodWord-based

Algorithm for LZ78LZ78

Algorithm for LZ77LZ77

Algorithm for texts

represented by collage

systemWord-based

LZ78

LZ77Collage Syste

mA Unifying framework for compressed pattern matching.T. Kida et al. (1999), SPIRE1999

Collage SystemCollage System

Compressed pattern matchingCompressed pattern matchingCollage systemCollage systemExtension of Output function for multiple patternsExtension of Output function for multiple patternsMulti-pattern version of BM type algorithmMulti-pattern version of BM type algorithmParallel complexity of compressed pattern matchingParallel complexity of compressed pattern matchingConclusionConclusion

Collage SystemCollage System

X1 = a ;X2 = b ;

D :

S : X3 X6 X4 X5 X2 X3 X1 X5 X4 X2

X6 = [ 3 ]X5 ;      (Truncation)

X5 = ( X3 )3 ;     (Repetition)

X4 = X2 ・ X1 ;     (Concatenation)X3 = X1 ・ X2 ;     (Concatenation)

babababab

baab

abbabbaabababbabaabababbab

Notations and DefinitionsNotations and Definitions

• Collage system is a pair 〈 D, S 〉• D : a set of assignments of tokens

– X1 = expr1 ; X2 = expr2 ; ・・・ ; Xn = exprn ;where each exprk is any of the form

• a for a Σ {ε∈ ∪ },• Xi ・ X j for i, j < k,• ( Xi ) j for i < k and an integer j,• [ j ]Xi for i < k and an integer j,• Xi

[ j ] for i < k and an integer j,– ||D|| = n : the number of tokens defined in D– X.u : the string represented by a token X

• S : a sequence of tokens defined in D– Xi1 Xi2 ・・・ Xil ( Xi is a token defined in D)– |S| = l : the number of tokens in S

concatenation

j times repetition

prefix truncation

suffix truncation

primitive assignment

Height of Height of DD

X1 = a ;X2 = b ;

D :

X7 = X6 ・ X4 ;X6 = [ 3 ]X5 ;X5 = ( X3 )3 ;

X4 = X2 ・ X1 ;X3 = X1 ・ X2 ;

height(X7) = 4

height(D) = max{height(X) | XF(D)}

X7

X6 X4

X5

X3

X1 X2

X2 X1

F (D) is the set of tokens defined in D.

Example of Collage System (LZSS [gzip])Example of Collage System (LZSS [gzip])

Xq+1 = (( [i1]Xl(1) Xl(1)+1 ・・・ Xr(1))m1)[ j1] b1;

・・

・Xq+2 = (( [i2]Xl(2) Xl(2)+1 ・・・ Xr(2))m2)[ j2] b2;

Xq+n = (( [in]Xl(n) Xl(n)+1 ・・・ Xr(n))mn)[ jn] bn;

X1 = a1 ; X2 = a2 ; Xq = aq ;・・・

S : Xq+1 Xq+2 ・・・ Xq+n

D :

={a1, ..., aq}bj and 0 ik, jk, mk

Pattern Matching on Collage SystemPattern Matching on Collage System

state: 0 1 2 3 4 3 4 5 11 2 4 1

S : Xi1 Xi2 Xi3 Xi4

7 : goto function

: failure function

a0 1 2 4 5b ba b3

KMP automaton for = a b a b b

original text: abababba

Jump( 4 , Xi4) = 1 Output( 4 , Xi4

) = {3}

3 3 4

Pattern Matching on Collage SystemPattern Matching on Collage System

no truncation

truncation

O( (||D||+|S|) ・ height(D) + ||2 + r ) time

O( ||D|| + ||2 ) space

LZ77

SequiturLZ78

LZSS

BPE

O( ||D|| + |S| + ||2 + r ) time

r is the number of pattern occurrences

LZW

Extension of Output function Extension of Output function for multiple patternsfor multiple patterns

Compressed pattern matchingCompressed pattern matchingCollage systemCollage systemExtension of Output function for multiple patternsExtension of Output function for multiple patternsMulti-pattern version of BM type algorithmMulti-pattern version of BM type algorithmParallel complexity of compressed pattern matchingParallel complexity of compressed pattern matchingConclusionConclusion

Basic IdeaBasic Idea

• Simulate the move of Aho-Corasick pattern matching machine

AC machine for ={aba,ababb,abca,bb}

ac

0 1 2 3 4 5

6 7

98

b ba b

c ab

b {bb}

{abca}

{aba} {ababb,bb}

: goto function: failure function

{ } : output

Jump( q, X) = AC( q, X.u)

Output( q, X)={|v|, o(AC(q, v)), v is a prefix of X.u}

(AC is a transition function of AC machine)

( o is an output function of AC machine)

Enumeration of Enumeration of OutputOutput((qq, , XX))

• Enumerate Output( q, X) Enumerate Occ( , X.u)

Y.u Z.u

Period ?

• Enumerate for each case of Xe.g. Enumerate Occ*( , Y.uZ.u) for X=YZ

Singlepattern

case

Multiple pattern

case

Enumeration of Enumeration of OccOcc*(*(, , xxyy))

O(m2) time and space preprocessing

={abcabc, cabb, abca}

a b c ca b c a ba b c a b c aa b c a b ca b c a b

c ca b c a bc a b c ac a b cc a b

ca b c a ba b c aa b ca b

1 2 3

1

1

3

abca

ca

a

bc bca bcabc

Suffixes of

Prefixes of

1 nil

nil

13

nil

nilnil

nil

(px, py)

px

py

m is the total length of the patterns in

Enumeration of Enumeration of OccOcc((, (, (Y.uY.u))k k ))

• Reduce to the single pattern case– If Y.uY.u is a substring of a pattern in ,

• Add a list of the patterns that occur in X.u with covering Y.u2.

• The number of substring that is a square is O(m). O(m2) space

Generalized Suffix trie

GST

{1, 3, 6}

(Y.u)2 is a substring of 1, and |Y.u| is a period of 1.

(same for 3, 6)

(Y.u)2 is a substring of 1, and |Y.u| is a period of 1.

(same for 3, 6)

X=Y kY.u

1Y.u

Y.u

{1, 3, 6}

m is the total length of the patterns in

Our ResultsOur Results

TheoremThe multiple pattern matching problem for a text represented by a collage system 〈 D, S 〉 can be solved inO( ( ||D|| + |S| ) ・ height(D) + m2 + r ) time,using O( ||D|| + m2 ) space.

If D contains no truncation operation, it can be solved in O( ||D|| + |S| + m2 + r ) time.

TheoremThe multiple pattern matching problem for a text represented by a collage system 〈 D, S 〉 can be solved inO( ( ||D|| + |S| ) ・ height(D) + m2 + r ) time,using O( ||D|| + m2 ) space.

If D contains no truncation operation, it can be solved in O( ||D|| + |S| + m2 + r ) time.

m is the total length of the patterns in r is the number of pattern occurrences

Multi-pattern version ofMulti-pattern version ofBM type algorithmBM type algorithm

Compressed pattern matchingCompressed pattern matchingCollage systemCollage systemExtension of Output function for multiple patternsExtension of Output function for multiple patternsMulti-pattern version of BM type algorithmMulti-pattern version of BM type algorithmParallel complexity of compressed pattern matchingParallel complexity of compressed pattern matchingConclusionConclusion

Boyer-Moore type algorithmBoyer-Moore type algorithm• A Boyer-Moore type algorithm for compressed pattern

matching, CPM2000– Y. Shibata, T. Matsumoto, M. Takeda, A. Shinohara, and S. Ar

ikawa– O( (||D||+|S|) ・ height(D) + || ・ |S| + ||2 + r ) time– O( ||D|| + ||2 ) space– If no truncation, O( ||D||+|S|+ || ・ |S| + ||2 + r ) time

r is the number of pattern occurrencesm is the total length of the patterns in

TheoremThe BM type algorithm for multiple pattern serching on collage system runs in– O( (||D||+|S|) ・ height(D) + || ・ |S| + m2 + r ) time– O( ||D|| + m2 ) space– If no truncation, O( ||D||+|S|+ m|S| + m2 + r ) time

TheoremThe BM type algorithm for multiple pattern serching on collage system runs in– O( (||D||+|S|) ・ height(D) + || ・ |S| + m2 + r ) time– O( ||D|| + m2 ) space– If no truncation, O( ||D||+|S|+ m|S| + m2 + r ) time

Boyer-Moore Type AlgorithmBoyer-Moore Type Algorithm

S ・・・・

Xi1 Xi2 Xi3 Xi4 Xi5 Xi6 Xi7

・・・・CTTAATTAAGCCTGCTAAGCATOriginal text

Pattern occurrences

Shift by

1. Enumerate Occ( , S[i].u)2. Enumerate Occ*( , qS[i].u). 3. Calculate the maximal safe shift Δ

• Calculate Shift (lpps( S[i+1].u ), S[i])• Calculate the smallest k s.t.

4. i:= i +

1. Enumerate Occ( , S[i].u)2. Enumerate Occ*( , qS[i].u). 3. Calculate the maximal safe shift Δ

• Calculate Shift (lpps( S[i+1].u ), S[i])• Calculate the smallest k s.t.

4. i:= i +

Shift(lpps(S[i+1].u), S[i]) (|S[i+j].u|) |lpps(S[i].u)|.j =0

k

Same way of AC typeSame way of AC type

O(m)

Calculate Calculate ShiftShift((lppslpps((SS[[ii+1].+1].uu), ), SS[[ii])])

rightmost_occ(w)

= min l > 0[m l |w| : m l ] = w, or[1: m l ] is a suffix of w

texttext

l la suffix of w

w

w w

w w

w

rightmost_occ(w) = min{rightmost_occ(w)}

Calculate Calculate ShiftShift((lppslpps((SS[[ii+1].+1].uu), ), SS[[ii])])

Shift(lpps(S[i+1].u), X) = rightmost_occ(X.u ・ lpps(S[i+1].u))

O( ||D|| ・ height(D)+ m2) time and O(||D||+ m2) space

S[i]

Shift

=3

Shift(lpps(S[i+1].u), S[i]) (|S[i+j].u|) |lpps(S[i].u)|j =0

k

Experimental ResultExperimental Result

AlphaStation XP1000(Alpha21264: 667MHz)Tru64 UNIX V4.0F

Medline (English text)60.3Mbyte

AlphaStation XP1000(Alpha21264: 667MHz)Tru64 UNIX V4.0F

Medline (English text)60.3Mbyte

5 10 15 20 25 30

Pattern length

0.0

0.3

0.4

0.5

0.8

0.1

0.2

0.6

0.7

CPU

tim

e (

seco

nd)

Search for uncompressed textswith KMP method.

Search for uncompressed textswith Agrep.

Search for texts compressed by BPEwith AC type algorithm.

* Agrep is a search tool developed by Wu and Manber.* BPE: Byte Pair Encoding

Search for texts compressed by BPE with BM type algorithm.

* A single pattern was inputted.

Parallel complexity of Parallel complexity of compressed pattern matchingcompressed pattern matching

Compressed pattern matchingCompressed pattern matchingCollage systemCollage systemExtension of Output function for multiple patternsExtension of Output function for multiple patternsMulti-pattern version of BM type algorithmMulti-pattern version of BM type algorithmParallel complexity of compressed pattern matchingParallel complexity of compressed pattern matchingConclusionConclusion

Problem to considerProblem to consider

• Instance:A regular collage system 〈 D, S 〉 and a set ={1, 2, ,s} of patterns.

• Question:Is there any pattern j that occurs in the text T represented by 〈 D, S 〉 ?

Contains no truncation and repetition

LogCFLLogCFL

Can be efficiently parallelized !

LogCFL NC2

*LogCFL is the class of problems logspace-reducible to a context-free language

The space of pushdown store is not bounded

Nondeterministic Turing machine

Idea of the ProofIdea of the Proof

• Using the lemma of I. Sudborough– LogCFL = AuxPDA( log n, nO(1) )

Using log n space worktape in nO(1) time

*AuxPDA is an auxiliary pushdown automaton.

Show such an AuxPDA MM accepts an input string if and only if there is somepattern that occurs in the text represented by 〈 D, S 〉 .

AuxPDA MAuxPDA M

M

¢ 1#2##s&Xi1Xi2

Xin

$

$ 100000....

Pushdown store

Occ(j, Xik.u) =

Xik

Xik.u[ l ]= j[ t ] ?

t

ConclusionConclusion

• Collage system is a formal system– Texts compressed by various compression method can be expres

sed by collage system.• Two types of algorithm for multiple pattern matching on c

ollage system– AC type

• O( ( ||D|| + |S| ) ・ height(D) + m2 + r ) time• O( ||D|| + |S| + m2 + r ) space

– BM type• O() time and O() space

• Compressed pattern matching can be efficiently parallelized in principle.– For regular collage systems– Not yet for general collage systems