Efficient algorithms for (δ,γ,α)-matching Szymon Grabowski Computer Engineering Dept., Tech....
-
Upload
elvin-houston -
Category
Documents
-
view
221 -
download
1
Transcript of Efficient algorithms for (δ,γ,α)-matching Szymon Grabowski Computer Engineering Dept., Tech....
Efficient algorithmsfor (δ,γ,α)-matching
Szymon GrabowskiComputer Engineering Dept.,Tech. Univ. of Łódź, Poland
PSC, Prague, August 2006
Kimmo FredrikssonDept. of Computer Science Univ. of Joensuu, Finland
2
For example, it’s not relevant for music information retrieval (MIR)
and molecular biology.
Several approximate matching models have thus been developed...
String matching in its classic form: given text T = t0t1 ... tn–1, and pattern P = p0p1 ... pm–1
over a finite alphabet Σ of size σ, report all occurences of P in T.
Such simple problem variant (exact matching)is not very useful for many applications.
Problem setting
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
3
Models & applications – music information retrieval
We allow classes of characters: the classes are continuous intervals (of equal width, 2δ+1, for all pattern positions).
This corresponds to handling little distortions of the melody (singer / whistler unskilled or under influence...).
Limitation on the sum of individual errors γ (< mδ).
Gaps also allowed – this is to skip ornamentation (esp. in classical music). We assume all gaps are in [0, α] range.
Transposition invariance – the key of the melody can be arbitrary, i.e. everything can be shifted up or down
by a fixed value.
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matchingFuture work, hopefully...
4
Problem we consider here
(δ,γ,α)-matching
Two symbols a, b Σ delta-match ( we write a =δ b ) iff |a – b| δ.
We say that a pattern P (δ,γ,α)-matches the text substring ti0 ti1 ... ti(m–1),
if pj =δ tij for j {0 ... m–1},where 0 < ij+1 – ij α+1,
and
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
1
0
m
jijj tp
5
Previous work on similar models
(δ, α)-matching:Crochemore et al., 2002: O(mn) time (worst, avg, and best case).
Cantone et al., 2005a: also O(mn) in every case to find not only the end positions of the occurences but also all the matching sequences.
Cantone et al., 2005b: achieving O(n) on avg (for constant α) and retaining O(mn) in the worst case.
Navarro & Raffinot, 2003; Cantone et al., 2005b: nondeterministic finite automaton with O(n mα / w) worst case time.
Along these lines: Fredriksson & Grabowski, 2006: more compact automaton with O(n m log(α) / w) worst case time.
Fredriksson & Grabowski, 2006: bit-par alg with O(nδ + n / w m) worst case time.
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
6
Surprisingly little work specifically on the(δ,γ,α)-matching problem...
Crochemore et al., 2002: dynamic programming alg,runs in O(mn) worst-case time. Uses a min-queue.
Of course, also a brute-force DP alg is possible:O(mn α) time, but may be faster in practice than
the more sophisticated alg above (as α usually small).
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
7
Our contributions
We improve the basic dynamic programming based algorithm to run in O(nα δ/σ) average time.
We propose a simple sparse DP alg with O(n) avg timeand O(min(mn, |M|α)) worst-casetime,
where M = { (i,j) | pi =δ tj }.
We develop a bit-parallel algorithm that runs in O(nδ + mn log γ / w) worst case time.
Its avg time complexity is close to O(n log γ α (δ/σ) / w + n), assuming small α.
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
8
Basic dynamic programing
Let us have matrix D, with each cell (i, j) corresponding to the search state of pattern prefix p0 ... pi in text T.
More precisely, a γ-bounded value of Di,j will denote that p0 ... pi matches T at the end position j.
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Brute-force computation in O(mn α) time and O(n) space (enough to store only the curr and prev row).
We can also proceed column-wise: same time but O(αm) space instead.
9K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Cut-off trick for improving the avg time(Ukkonen, 1985; Cantone et al., 2005)
Usually, calculating all the matrix cells is an overkill.
Observation: if Di...m–1,j–α...j > γ then Di+1...m–1,j+1 > γ.
Read: it’s not so easy to get out of a ‘dead zone’.
m
10
DP-CO, cont’d
The avg time is O(n (αδ/σ)2). (Pessimistic analysis, we weren’t able to take the gamma restriction
into account.)
The worst case remains O(mn α),but as in (Crochemore et al., 2002) it can
be improved to O(mn). The difference is we handle m queues as we proceed column-wise.
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
11
Simple algorithm(ingenious name, eh?)
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
In a few words: naïve brute force DP algorithm but applied only locally.
We work on lists Li, corresponding to individual rows.
We start with L0 = { j | tj =δ p0 } (obtained in O(n) time).
For i=1...m–1:Li = { j | tj =δ pi AND Di–1,j’ + |pi–tj| γ AND
0 < j–j’ α +1 }
We put each j only once into Li (if there are many j’ that can cause it, we choose the one that minimizes the new Di,j).
Obtaining list Li takes O(α|Li–1|) time.
12
Simple algorithm, cont’d
Complexity
All lists have length |M| in total in the worst case.Which implies O(|M|α) worst case time.
But: (i) on average this is much better,(ii) we can improve somewhat the worst case.
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
13
Simple algorithm, cont’d
Average case analysis
The length of list L0 is O(n δ/σ) on avg.Hence L1 is computed in O(n α δ/σ) avg time.But its avg length is only O(n δ/σ α δ/σ).
...........................In general, computing Li takes O(n (α δ/σ)i) avg time.
The total time will be summation over m such components.
Note that α, δ, σ are fixed for a given problem instance.In other words, α δ/σ can be considered a constant.
If the constant α (2δ+1)/σ is less than 1, we have a geometric series with O(n) sum.
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
14K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Simple algorithm, cont’d
Improving the worst case
Idea: avoid brute-force handling of overlapping windows of α+1 size.
We make use of a min-queue (Gajewska & Tarjan, 1986), similarly to the concept from (Crochemore et al., 2002).
The queue always keeps up to α+1 integers, namely the error sums corresponding to the sliding window area in the previous row. For each
processed cell 0 or 1 values are inserted to the front of the queue (O(1) time) and from 0 to α+1 values deleted from the tail. But we can’t remove more than we’ve inserted. Hence O(1) amortized cost per cell.
This improves the worst-case time complexity to O(min(mn, |M|α)).
15
Bit-parallelism technique(in stringology)
Baeza–Yates (1989) noticed that CPU registers are usually longer than 1 bit...
And he made use of this fact.
In O(1) time we can peform operations like logical and (&), or (|), shifts (<<, >>)etc. on a whole machine word (usu. 32 or 64 bits).
Nowadays, bit-parallelism is a very popular techniquein string matching algorithms, in theory and in practice.
Also useful for many approximate matching variants.
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
16
Bit-parallel dynamic programming
Modified DP alg: let the cells of D be chunks of O(log γ) bits. We’ll be able to compute O(w / log γ) cells in parallel.
More precisely, each cell will use l + 1 bits, where l = log2(2γ +1).
Error sum zero will be encoded as 2l–1 – (γ +1),γ +1 (the lowest ‘illegal’ value) will be thus 2l–1
(old trick, e.g., Fredriksson & Navarro, 2004; Crochemore et al., 2005).
This representation can solve 3 issues:(i) checking in parallel if some counters exceed γ,
(ii) parallel handling of counter overflows, (iii) computing pairwise minima over two sets of counters
in parallel.K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
17
Tiling the DP matrix with C = w / (l+1) × 1 vectors (C = 8). The dark gray cell of the current tile depends on the
light gray cells of the two tiles in the previous row (α = 4).
We are in row i. Thx to preprocessing, we know the delta-errors between all chars in the current tile (C cells) and P[i].
Problem: How to calculate the new values of Di,*?
BP-DP, cont’d
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
18
Solution #1. Naïve shifts (chunk by chunk) and minimizations with O(α) factor.
Solution #2. Similar but with a halving technique: first shift by α / 2 counter positions, then by α / 4 etc.
performing the minimization at each step.It yields O(log α) time factor.
Solution #3. Use a precomputed function.Which we choose, as it gives O(1) time for a
O(w)-bit chunk (in practice some w’, e.g. w’=w / 4).
BP-DP, cont’d
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
19
Pre-emptying the computation in the BP-DP search
The cut-off trick can again be used. With some modification since now we calculate C cells in
parallel. (Read: the picture at slide 9 will be less jagged and the trick is somewhat less efficient here.)
Avg search time is (upper bound estimation, maybe not tight):
O(n / C α δ/σ + n).
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
20
How to find minima in parallel forthe O(w / log γ)-sized chunks
Precomputing as usual (ugly...) or an old trick (Paul & Simon, 1980)
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
21
Preprocessing in BP-DP
Preprocessing is simple.We build a helper bit-matrix V such that Vi,j = |pi – tj| if pi =δ tj , and γ+1 otherwise.
Note that the numbers of rows in V can be reduced to the # of unique symbols in P (why storing completely repeating
rows?), which is σP. We call this terse representation V’.
First we fill V’ with γ+1 values in O(n / C σP) time. Then we scan T and set 0..δ in at most 2δ +1 rows of V’
(those that δ-match the current char from T). Worst case time of the latter phase: O(nδ). Less on avg.
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
22
Lazy preprocessing
Note that in the previous scheme (with cut-off) the avg time may be even O(n) but the preprocessing
typically superlinear (even if not much).
To avoid costly preprocessing in the case when search will be fast (i.e. the cut-off thing will work efficiently),
we can interweave the preprocessing and search phases.
This leads to O(n / C α δ/σ + n) avg preprocessing time (pessimistic analysis), i.e. matches the avg search time.
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
23
Multiple patterns
The bit-par alg has relatively high preprocessing cost:O(nδ + P n / w / log γ ) in the worst case.
If we are however about to search for r patterns, the search time is multiplied by r,
but the good news is that the preprocessing is increasedmuch more mildly:
to O(nδ + P n / w / log γ +rm),where P is now the # of distinct symbols in the
whole pattern set.
Practical (well-known) trick for r patterns if r small compared to / δ: superimpose pattern (then verify).
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
24
Test methodology
All algorithms implemented in C, compiled with icc 9.0.
Test machine: P4 2.4 GHz, 512 MB, running GNU/Linux 2.4.20.
Avg times reported over 100 trials (randomly extracted patt.).
Text files:1. Concatenation of 7543 music pieces (MIDI, stripped off of anything
except pitch values), totalling 1.8 MB. Alphabet: [0..127] range, but far from random: only 55 values actually occur, and only 6 most freq
symbols cover ~50% of the whole text.
2. Uniformly random data in 0..127 range.
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
25
Compared algorithms
BP Cut-off: bit-parallel dynamic programming with cut-off (without the lazy preprocessing).
BP Filter: the (δ,α)-matching version of BP Cut-off (Fredriksson & Grabowski, 2006)
used as a filter, and DP-CO used for verifications.
DP Cut-off: dynamic programming with cut-off.
Simple: simple sparse DP (in the O(|M|α) worst case time version).
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
26
Experimental results, MIDI δ = 1, γ = 4, α = 1
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
27
Experimental results, MIDI δ = 4, γ = 16, α = 2
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
28K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
Experimental results, randomδ = 4, γ = 16, α = 2
29
Conclusions
Bit-parallelism works well also for the (δ,γ,α) search problem...
...But it works even better if regions of text where matches cannot be extended are quickly discarded.
Still, BP-DP for (δ,γ,α) disappoints compared to BP-DP for (δ,α) used as a filter.
(Problem: the γ counters need many bits...)
Consistently best alg in the tests was a simple heuristic (called Simple alg). Fortunately, it doesn’t have competitive
worst-case time.
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching
30
Future plans
Research on extended models: most importantly with transposition invariance.
Some purely theoretical variants (e.g., better complexity for large alpha).
Injecting compression to represent bit vectors more succinctly and thus speed up the search?
Can we replace the log γ factor in the bit-par algwith log δ?
(Hint: in each step we increase the counters by at most δ only.)
K.Fredriksson & Sz. Grabowski, Efficient agorithms for (δ,γ,α)-matching