Download - Hannu Peltola Jorma Tarhio Aalto University Finland Variations of Forward-SBNDM.

Hannu Peltola Jorma Tarhio

Aalto University

Finland

Variations of Forward-SBNDM

Aug. 29, 2011

Aims

Tuning algorithms for exact string matching.

Studying the effect of simultaneous 2-byte read.

Aug. 29, 2011

SBNDMSimple Backward Nondeterministic DAWG Matching

SBNDM [18] is a simplification of BNDM [17]. Both are bit-parallel algorithms.

Text T = t1...tn, pattern P = p1...pm.

At each alignment window of P in T, scan T from right to left until the suffix of the window is not a factor of P or an occurrence of P is found.

Aug. 29, 2011

Shift of SBNDM

No factor: m

P found: 1

Else: next alignment starts at the last factor

Aug. 29, 2011

SBNDM, example

P = banana, T = antanabadbanana...

alignment: antanabadbanana a na ana

Aug. 29, 2011

SBNDM, example



not a factor: tananext alignment: antanabadbanana

Aug. 29, 2011

SBNDM, example



not a factor: tananext alignment: antanabadbanana not a factor: dnext alignment: antanabadbanana

Aug. 29, 2011

SBNDMq

SBNDMq [6] is a tuned version of SBNDM.

Processing of an alignment starts with checking a q-gram.

Let q = 4. Consider an alignment at antana. Instead of testing four suffixes a, na, ana, tana,only tana is tested.

Testing is done in a fast loop.

Aug. 29, 2011

Forward-SBNDM

Forward-SBNDM (FSB for short) by Faro & Lecroq [7] is a lookahead version of SBNDM2.

Both FSB and SBNDM2 read a 2-gram x1x2 before a factor test.

x1x2 is matched with the end of P in SBNDM2.

Only x1 is matched with the end of P in FSB, and x2 is a lookahead character following the current alignment.

FSB is faster than SBNDM2 for large alphabets.

Aug. 29, 2011

Generalization of FSB: FSB(q,f)

FSB(q,f) (= Forward-SBNDM(q,f)) is SBNDMq with f lookahead characters, f = 0, 1, ..., q-1.

FSB(2,1) = FSB and FSB(q,0) = SBNDMq.

Motivation: SBNDMq works well on modern processors also for q>2.

Aug. 29, 2011

FSB(q,f)

Let UV be a q-gram, where |V| = f.

After reading UV there are 3 alternatives:i. If U is a suffix of P, reading continues leftwards.

ii. Else if UV is a factor of P, reading continues leftwards.

iii. Else the state vector is zero and P is shifted m-q+f+1 positions

(f positions more than in SBNDMq).

Aug. 29, 2011

Occurrence vectors in FSB(q,2)

Example: P = banana

bananaSBNDMq: B[n] = 00001010

FSB(q,2): B[n] = 00101011 B[a] = 01010111 B[x] = 00000011

extra bits

Aug. 29, 2011

State vectors in FSB(q,2) for q=4

4-gram nanx: x 00000011 n 00101011 a 01010111 n 00101011

00001000

4-gram State vector Conclusionnanx 00001000 na is a suffix of Pxana 00000000 not a factoranan 01000000 factor of P

nanx is not a factor

Aug. 29, 2011

Benefits / drawbacks of lookahead characters and extra bits

Benefits

• Longer shifts more speed

• Combined suffix/factor test

Drawback

• More q-grams accepted less speed

Aug. 29, 2011

Greedy skip loop for SBNDM2 (GSB2 = Greedy-SBNDM2)

Factor tests of two 2-grams are done in one round.

Let B2[x,y] denote the combined occurrence vector of characters x and y. B2[x,y] = B[x] & (B[y]<<1)

next:D B2[ti,ti+1]if D = 0 then if B2[ti+m-1,ti+m] = 0 then i i+2*m-2

goto next

Aug. 29, 2011

2-byte read

Read two characters (= 2 bytes = 16 bits) in one instruction (in a skip loop).

Suits well q-gram algorithms with even q.

For experiments we made two versions of the algorithms:• Standard (1-byte read)

• b-version using 2-byte read

Aug. 29, 2011

2-byte read (cont.)

Advantage: a part of computation can moved to preprocessing phase

• Example: B2[x,y] = B[x] & (B[y]<<1)

Speed-up factor even more than 2

Drawback: extra 0.1 ms for preprocessing.

Aug. 29, 2011

4-byte read?

Many border crosses happen => slow down

232 tables too big for practice

Aug. 29, 2011

Experimental results/KJV Bible

In the recent comparison S. Faro, T. Lecroq: The Exact String Matching Problem: a Comprehensive Experimental Evaluation

(2010), the algorithms EBOM and Hash3 were the fastest

in the bible text for m = 4,...,20.4 8 16

Hash3 14.6 5.42 2.79

EBOM 6.53 3.87 2.91

Aug. 29, 2011

KJV: EBOM & Hash3 (on ThinkPad X61s)

0

0,5

1

1,5

2

2,5

3

3,5

4

4 8 12 16 20

m

GB

/s

EBOM

Hash3

Aug. 29, 2011

KJV: EBOMb & Hash3b (with 2-byte read) added

0

0,5

1

1,5

2

2,5

3

3,5

4

4 8 12 16 20

m

GB

/s

EBOM

EBOMb

Hash3

Hash3b

Aug. 29, 2011

KJV: SBNDM2b = FSB(2,0)b added

0

0,5

1

1,5

2

2,5

3

3,5

4

4 8 12 16 20

m

GB

/s

EBOM

EBOMb

Hash3

Hash3b

FSB(2,0)b

Aug. 29, 2011

KJV: GSB2b added

0

0,5

1

1,5

2

2,5

3

3,5

4

4 8 12 16 20

m

GB

/s

EBOM

EBOMb

Hash3

Hash3b

FSB(2,0)b

GSB2b

Aug. 29, 2011

KJV: FSB(4,i)b added, i = 0,1,2

0

0,5

1

1,5

2

2,5

3

3,5

4

4 8 12 16 20

m

GB

/s

EBOM

EBOMb

Hash3

Hash3b

FSB(2,0)b

GSB2b

FSB(4,0)b

FSB(4,1)b

FSB(4,2)b

Aug. 29, 2011

KJV: Speed-up factors of 2-byte read

GSB2 1.32FSB(2,0) 1.34FSB(2,1) 1.24FSB(4,0) 1.72FSB(4,1) 2.15FSB(4,2) 2.03Hash3 1.05EBOM 1.17

Aug. 29, 2011

Other experiments

DNA and binary data was also tested.• Gain of lookahead characters or the greedy loop was smaller

than with the bible data.

Gain of 2-byte read was smaller with 64-bit code than with 32-bit code.

Aug. 29, 2011

Conclusions

Two new algorithms were presented: • FSB(q,f)

• GSB2

The new algorithms are faster than earlier algorithms on English data:• GSB2 for m = 4, …, 8

• FSB(q,f) for m = 8, …, 20

2-byte read makes most string algorithms faster.

Aug. 29, 2011

Web site for practical speed comparison

cse.aalto.fi/stringmatching