Filter Algorithms for Approximate String Matching Stefan Burkhardt.
-
Upload
hugh-adams -
Category
Documents
-
view
243 -
download
0
Transcript of Filter Algorithms for Approximate String Matching Stefan Burkhardt.
![Page 1: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/1.jpg)
Filter Algorithms forFilter Algorithms forApproximate String MatchingApproximate String Matching
Stefan Burkhardt
![Page 2: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/2.jpg)
OutlineOutline
Motivation Filter Algorithms Gapped q-grams Experimental Analysis
![Page 3: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/3.jpg)
Motivation
Computational Biology: EST Clustering Assembly Genome comparison (e.g. Human/Mouse)
Information Retrieval Phonebooks Dictionaries Search Engines
Many more….
Why ?
Approximate String Matching
Edit and Hamming Distance
Problems and Motivation
![Page 4: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/4.jpg)
The global approximate
string matching problem
Given a pattern P, a target S, an
error level k and a string distance d(x,y):
Find all substrings y from S with:
Why ?
Approximate String Matching
Edit and Hamming Distance
Problems and Motivation
kyP ),d(
P
S
GAT
ACTGATAACGTTAGCCATGG
![Page 5: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/5.jpg)
The global approximate
string matching problem
d(x,y) = Hamming Distance:
The k-mismatches problem
d(x,y) = Edit Distance:
The k-differences problem
Why ?
Approximate String Matching
Edit and Hamming Distance
Problems and Motivation
P
S
GAT
ACTGATAACGTTAGCCATGG
![Page 6: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/6.jpg)
How?
BLAST
The q-gram Lemma and QUASAR
Filter Algorithms
P
S
Potential Matches
FilterAlgorithm
Filtration Phase,apply Filter Criterion
ExactAlgorithm
Verification Phase,examine Potential Matches
False Matches True Matches
![Page 7: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/7.jpg)
How?
BLAST
The q-gram Lemma and QUASAR
Filter Algorithms
BLAST (Altschul, Karlin, et al.) :
S
P
Problem for high similarity: sequential scan quite time consuming
single q-grams unspecific
Sequential scan of S locates all matching q-grams with P
Iterative extension with cutoff to find good matches
![Page 8: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/8.jpg)
How?
BLAST
The q-gram Lemma and QUASAR
Filter Algorithms
P
S
Preprocess
Index
ExactAlgorithm
Verification Phase,examine Potential Matches
False Matches True Matches
Potential Matches
Indexed FilterAlgorithm
![Page 9: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/9.jpg)
How?
BLAST
The q-gram Lemma and QUASAR
Filter Algorithms
P
S
Preprocess
Potential Matches
IndexIndexed Filter
Algorithm
Con: preprocessing time
extra space required
only good for some filter criteria
Pro: potentially faster evaluation of filter criterium
![Page 10: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/10.jpg)
How?
BLAST
The q-gram Lemma and QUASAR
Filter Algorithms
P
S
Preprocess
Potential Matches
IndexIndexed Filter
Algorithm
QUASAR (Burkhardt, Rivals et al. 99):
Filter Criterion: q-gram Lemma (Jokinen, Ukkonen 91)
Index Structure: Lookup table (Jokinen, Ukkonen 91)
with suffix array (Manber, Myers 90)
Match Detection: overlapping rectangles in DP-Matrix
![Page 11: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/11.jpg)
|P| =8, q = 3total # of q-grams : |P| - q + 1 = 6
T C GC G A
G A TA T T
T T AT A C
T C G A T T A C
Each error can ´destroy´q matching q-grams=> for k errors lose
kq q-grams
T C GC G A
G A TA T T
T T AT A C
T C G A A T A C
How?
BLAST
The q-gram Lemma and QUASAR
Filter Algorithms
The q-gram Lemma
(Jokinen, Ukkonen, 1991)
For a pattern P, a substring y of S and a value k, matches between P and y with at most k errors share at least
t = |P| - q + 1 - (kq)
substrings of length q (q-grams).
![Page 12: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/12.jpg)
How?
BLAST
The q-gram Lemma and QUASAR
Filter Algorithms
Match Detection (Jokinen, Ukkonen 91) :
overlapping rectangles of width 2|P| in DP-Matrix
rectangle with at least t hits => potential match
S
P
3 hits3 hits
2 hits2 hits
1 hitt = 3
![Page 13: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/13.jpg)
How?
BLAST
The q-gram Lemma and QUASAR
Filter Algorithms
Match Detection (Jokinen, Ukkonen 91) :
overlapping rectangles of width 2|P| in DP-Matrix
rectangle with at least t hits => potential match
S
P
QUASAR (Burkhardt, Rivals et al. 1999) :
wider rectangles efficient in practice (2048 for QUASAR)
S
![Page 14: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/14.jpg)
How?
BLAST
The q-gram Lemma and QUASAR
Filter Algorithms
QUASAR (Burkhardt, Rivals et al. 1999) :
BLAST for the verification of the potential matches
wider Rectangles as Match Regions
Index is a combination of Lookup Table and Suffix Array
used for EST-Clustering at the DKFZ in Heidelberg
searches for EST-Clustering about 30 times faster than BLAST
![Page 15: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/15.jpg)
Gapped Gapped qq-grams-grams
A new (old?) idea Hamming Distance Finding good shapes
![Page 16: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/16.jpg)
use gapped q-grams call arrangement of gaps the shape
General idea:
Gapped q-grams A new (old ?) idea
Hamming Distance
Finding good shapes
TCGATTACTC.A CG.T GA.T AT.A TT.C
gapped3-shape:
# # . #
Match Don’t care
![Page 17: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/17.jpg)
Califano, Rigoutsos (1993) Pevzner, Waterman (1995) Lehtinen, Sutinen, Tarhio (1996)
Previous work...Previous work...
limited attention paid to choice of shapes
no exact threshold for the general case given
Gapped q-grams A new (old ?) idea
Hamming Distance
Finding good shapes
Recently...Recently...Buhler (2001) : Multiple ShapesMa, Tromp, Li (2002) : Pattern Hunter
threshold t = 1
![Page 18: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/18.jpg)
The Threshold tDefinition: t is the number of remaining q-grams in a worst-case placement of k errors
OOXOOXOOXOOOOX OXO XOO OOX OXO XOO OOX OXO XOO
classic3-shape###k = 3
gapped3-shape##.#k = 3t = 1t = 0
no filter!
OOOXXOOXOOOOO.X OO.X OX.O XX.O XO.X OO.O OX.O XO.O
Gapped q-grams A new (old ?) idea
Hamming Distance
Finding good shapes
![Page 19: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/19.jpg)
OOOXXOOXOOOOO.X OO.X OX.O XX.O XO.X OO.O OX.O XO.O
Definition: t is the number of remaining q-grams in a worst-case placement of k errors
gapped shapes can have higher(!) thresholds t than ungapped shapes
The Threshold t
gapped3-shape##.#k = 3t = 1
classic3-shape###k = 3t = 0
no filter!
no simple formula for t we used a DP-based approach to compute t
Gapped q-grams A new (old ?) idea
Hamming Distance
Finding good shapes
![Page 20: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/20.jpg)
Finding good shapes Finding good shapes
high low# of q-gram hits
high lowfiltration time
high
low
verific. time
high
low
# ofpotentialmatches
goodfilters
badfilters
Gapped q-grams A new (old ?) idea
Hamming Distance
Finding good shapes
tradeoffline
low highq
![Page 21: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/21.jpg)
Finding good shapes Finding good shapes
high
low
# ofpotentialmatches
Gapped q-grams A new (old ?) idea
Hamming Distance
Finding good shapes
# of q-gram hits |S|1
||q
?tradeoff
line
goodfilters
badfilters
low highq
![Page 22: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/22.jpg)
Finding good shapesFinding good shapes
Reason:
##.# ### ##.# ### ----- ----
5 4
A random match requires 5 matching characters instead of only 4 for the ungapped q-gram.This makes random matchesless likely.
Gapped q-grams A new (old ?) idea
Hamming Distance
Finding good shapes
For |P |=13, k=3 and q=3 the shapes ##.# and ### both have a threshold of t=2. However, the gapped shape returns fewer potential matches.
![Page 23: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/23.jpg)
We define the minimum coverage cm as the minimum number of matching characters for any distinct arrangement of t matching shapes in P and S
Finding good shapesFinding good shapes
CGACGATTGAT ##.# ##.# -----ACTCGATTAGA
For t =2 andthe shape ##.#the minimum coverage is 5
Gapped q-grams A new (old ?) idea
Hamming Distance
Finding good shapes
![Page 24: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/24.jpg)
Finding good shapes Finding good shapes
# ofpotentialmatches
Gapped q-grams A new (old ?) idea
Hamming Distance
Finding good shapes
# of q-gram hits |S|1
||q
low highq
tradeoffline
goodfilters
badfilters
|S|1
||c
m
low
high
cm
![Page 25: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/25.jpg)
8 10 12 14 16 18 20 22
0
600
400
200
t = 1t = 2t = 3t = 4t = 5
minimum coverage
number ofshapes
with givenminimum coveragefor k = 5
q = 8
median
contiguous best
• compute t and minimum coverage for all shapes with |P|=50 and k=3,4,5,6
Gapped q-grams A new (old ?) idea
Hamming Distance
Finding good shapes
Finding good shapesFinding good shapes
![Page 26: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/26.jpg)
Experimental AnalysisExperimental Analysis
Speed and Filtration Efficiency The Heuristic Zone
![Page 27: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/27.jpg)
6 7 8 9 10 11 12 q
min
imum
cov
erag
e
8
12
16
20
24
gapped, Hammingcontiguous
m
atch
es
hits 222 220 218 216 214 212
216
212
28
24
1
2-4
2-8
Experimental Analysis A few different Filters
Speed and Filtration Efficiency
The Heuristic Zonek = 5
|P| = 50|S| = 50Mbps
![Page 28: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/28.jpg)
From Hits to Matches Describing Filter Properties
Filters usually have 3 ‚recognition zones` depending on k :1. Guarantee zone (finds all approximate matches)2. Heuristic zone (finds some of the approximate matches)3. Negative zone (guaranteed not to find matches)
Errors |P|0
0%
100%
Rec
ogni
tion
rate
![Page 29: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/29.jpg)
From Hits to Matches Describing Filter Properties
Filters usually have 3 ‚recognition zones` depending on k :1. Guarantee zone (finds all approximate matches)2. Heuristic zone (finds some of the approximate matches)3. Negative zone (guaranteed not to find matches)
Errors |P|0 k
0%
100%
Rec
ogni
tion
rate
![Page 30: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/30.jpg)
From Hits to Matches Describing Filter Properties
Filters usually have 3 ‚recognition zones` depending on k :1. Guarantee zone (finds all approximate matches)2. Heuristic zone (finds some of the approximate matches)3. Negative zone (guaranteed not to find matches)
Errors |P|0 k
0%
100%
Rec
ogni
tion
rate
|P|-mc
![Page 31: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/31.jpg)
From Hits to Matches Describing Filter Properties
Filters usually have 3 ‚recognition zones` depending on k :1. Guarantee zone (finds all approximate matches)2. Heuristic zone (finds some of the approximate matches)3. Negative zone (guaranteed not to find matches)
Errors |P|0 k
0%
100%
Rec
ogni
tion
rate
|P|-mc
![Page 32: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/32.jpg)
Errors |P|0 k |P|-mc
0%
100%
Rec
ogni
tion
rate
A few different Filters
Speed and Filtration Efficiency
The Heuristic Zone
Experimental Analysis
Heuristic Zone
Problem:Behaviour in the Heuristic Zone hard to predict
![Page 33: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/33.jpg)
A few different Filters
Speed and Filtration Efficiency
The Heuristic Zone
Experimental Analysis
A simple idea:Sampling!
For a value i:1. Generate s sample strings with i random errors each2. Run a filter algorithm on these samples 3. Record how many strings were recognized (in percent)
This allows an experimental evaluation of the Heuristic Zone
![Page 34: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/34.jpg)
|P| = 501000 samples for each error level
A few different Filters
Speed and Filtration Efficiency
The Heuristic Zone
Experimental Analysis
0%
100%
Rec
ogni
tion
rat
e
Errors0 5 10 15 20 25 30
contiguous k=3, q=11k=4, q=9
![Page 35: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/35.jpg)
A few different Filters
Speed and Filtration Efficiency
The Heuristic Zone
Experimental Analysis
0%
100%
Rec
ogni
tion
rat
e
Errors0 5 10 15 20 25 30
k=4, q=9k=3, q=11
gapped, edit
contiguous
k=5, q=10k=4, q=11k=3, q=11
|P| = 501000 samples for each error level
![Page 36: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/36.jpg)
A few different Filters
Speed and Filtration Efficiency
The Heuristic Zone
Experimental Analysis
0%
100%
Rec
ogni
tion
rat
e
Errors0 5 10 15 20 25 30
k=4, q=9k=3, q=11
BLAST
gapped, edit
contiguous
k=5, q=10k=4, q=11k=3, q=11
k=3,q=11k=4,q=10
|P| = 501000 samples for each error level
![Page 37: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/37.jpg)
A few different Filters
Speed and Filtration Efficiency
The Heuristic Zone
Experimental Analysis
0%
100%
Rec
ogni
tion
rat
e
Errors0 5 10 15 20 25 30
k=4, q=9k=3, q=11
BLAST
gapped, edit
contiguous
k=5, q=10k=4, q=11k=3, q=11
k=3,q=11k=4,q=10
|P| = 501000 samples for each error level
![Page 38: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/38.jpg)
A few different Filters
Speed and Filtration Efficiency
The Heuristic Zone
Experimental Analysis
50%
100%
Rec
ogni
tion
rat
e
Errors0 5 10 15
k=4, q=9k=3, q=11
BLAST
gapped, edit
contiguous
k=3, q=11k=4, q=11k=5, q=10k=3,q=11k=4,q=10
|P| = 501000 samples for each error level
![Page 39: Filter Algorithms for Approximate String Matching Stefan Burkhardt.](https://reader035.fdocuments.net/reader035/viewer/2022062322/56649ea35503460f94ba6f43/html5/thumbnails/39.jpg)
Conclusion - Future WorkConclusion - Future WorkOur Work: Significant sensitivity improvement over existing filters Required modifications easy to implement Methods for describing filter properties
Future Work: Combination of `orthogonal` shapes into one filter Use of word neighborhoods Database of filter properties for good shapes