Review of Claremont Report on Database Research Jiaheng Lu Renmin University of China.
1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University...
-
Upload
jesus-duffy -
Category
Documents
-
view
219 -
download
2
Transcript of 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University...
![Page 1: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/1.jpg)
11
Efficient Merging and Filtering Algorithms
for Approximate String Searches
Jiaheng Lu,
University of California, Irvine
Joint work with Chen Li, Yiming Lu
![Page 2: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/2.jpg)
2Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 2
Example: a movie database
Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Iron man 2008 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson The man 2006 Crime
Find movies starred Schwarrzenger.
![Page 3: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/3.jpg)
3Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 3
Data may not clean
Star
Keanu Reeves
Samuel Jackson
Schwarzenegger
Relation R Relation S
Data integration and cleaning:
Star
Keanu Reeves
Samuel L. Jackson
Schwarzenegger
![Page 4: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/4.jpg)
4Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 4
Problem definition: approximate string searches
…
Schwarzenger
Samuel Jackson
Keanu ReevesStar
Query q:
Collection of strings s
Search
Output: strings s that satisfy Sim(q,s)≤δOutput: strings s that satisfy Sim(q,s)≤δSim functions: edit distance, Jaccard Coefficient and Cosine similaritySim functions: edit distance, Jaccard Coefficient and Cosine similarity
SchwarrzengerSchwarrzenger
![Page 5: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/5.jpg)
5Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 5
Outline Problem motivation Preliminaries
Grams Inverted lists
Merge algorithms Filtering techniques Conclusion
![Page 6: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/6.jpg)
6Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 6
String Grams q-grams
(un),(ni),(iv),(ve),(er),(rs),(sa),(al)
For example: 2-gram
u n i v e r s a l
![Page 7: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/7.jpg)
7Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 7
Inverted lists Convert strings to gram inverted lists
id strings01234
richstickstichstuckstatic
4
2 30
1 4
2-grams
atchckicristtatituuc
201 30 1 2 4
41 2 433
![Page 8: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/8.jpg)
8Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 8
Main ExampleQuery
Merge
Data Grams
stick (st,ti,ic,ck)
count >=2
id strings
0 rich
1 stick
2 stich
3 stuck
4 static
ck
ic
st
ta
ti…
1,3
1,2,3,4
4
1,2,4
ed(s,q)≤1
0,0,1,2,41,2,4
Candidates
![Page 9: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/9.jpg)
9Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 9
Problem definition:
Find elements whose occurrences ≥ T
Ascending
order
Ascending
order
MergeMerge
![Page 10: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/10.jpg)
10Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 10
Example T = 4
Result: 13
1
3
5
10
13
10
13
15
5
7
13
13 15
![Page 11: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/11.jpg)
11Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu
Contributions
Three new merge algorithms
New finding: wisely using filters
![Page 12: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/12.jpg)
12Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 12
Outline Problem motivation Preliminaries Merge algorithms
Two previous algorithms Our proposed three algorithms
Filtering techniques Conclusion
![Page 13: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/13.jpg)
13Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 13
Five Merge Algorithms
HeapMerger[Sarawagi,SIGMOD
2004]
MergeOpt[Sarawagi,SIGMOD
2004]
Previous
New
ScanCount MergeSkip DivideSkip
![Page 14: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/14.jpg)
14Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 14
Heap-based Algorithm
Min-heap
Count # of the occurrences of each element by a heap
Push to heap ……
![Page 15: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/15.jpg)
15Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 15
MergeOpt Algorithm
Long Lists: T-1 Short Lists
Binary
search
![Page 16: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/16.jpg)
16Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 16
Example of MergeOpt [Sarawagi et al 2004]
1
3
5
10
13
10
13
15
5
7
13
13 15
Count threshold T≥ 4
Long Lists: 3Short Lists: 2
![Page 17: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/17.jpg)
17Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 17
Can we run faster?
![Page 18: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/18.jpg)
18Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 18
Five Merge Algorithms
HeapMerger MergeOpt
Previous
New
ScanCount MergeSkip DivideSkip
![Page 19: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/19.jpg)
19Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 19
ScanCount Example
1 2 3
…
1
3
5
10
13
10
13
15
5
7
13
13 15
Count threshold T≥ 4
# of occurrences# of occurrences
00
00
00
44
11
Increment by 1
Increment by 111
String idsString ids
1313
1414
1515
00
22
00
00
Result!Result!
![Page 20: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/20.jpg)
20Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 20
Five Merge Algorithms
HeapMerger MergeOpt
Previous
New
ScanCount MergeSkip DivideSkip
![Page 21: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/21.jpg)
21Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 21
MergeSkip algorithm
Min-heap ……Pop T-1
T-1
Jump Greater or
equals
Greater or
equals
![Page 22: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/22.jpg)
22Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 22
Example of MergeSkip
1
3
5
10
10
15
5
7
13 15
Count threshold T≥ 4
minHeap10
13 15
15
JumpJump
15151515
13131313
17171717
![Page 23: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/23.jpg)
23Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 23
Skip is safe
Min-heap ……
# of occurrences of skipped elements ≤T-1
Skip
![Page 24: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/24.jpg)
24Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 24
Five Merge Algorithms
HeapMerger MergeOpt
Previous
New
ScanCount MergeSkip DivideSkip
![Page 25: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/25.jpg)
25Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu
DivideSkip Algorithm
Long Lists Short Lists
Binary
searchMergeSkip
![Page 26: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/26.jpg)
26Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 26
How many lists are treated as long lists?
??
Short ListsMerge
Long ListsLookup
![Page 27: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/27.jpg)
27Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 27
Decide L valueA good balance in the tradeoff:
# of long lists = T / ( μ logM +1)
![Page 28: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/28.jpg)
28Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 28
Experimental data sets
DBLP data IMDB data Google Web corpus
![Page 29: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/29.jpg)
29Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 29
Performance (DBLP)
DivideSkip is the best one
![Page 30: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/30.jpg)
30Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 30
# of access elements (DBLP)
DivideSkip is the best one
![Page 31: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/31.jpg)
31Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 31
Outline Problem motivation Preliminaries Merge algorithms Filtering techniques
Length, positional filters Filter tree
Conclusion and future work
![Page 32: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/32.jpg)
32Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 32
Length Filtering
Ed(s,t) ≤ 2
s: s:
t: t:
Length: 19Length: 19
Length: 10Length: 10
By length only!
By length only!
![Page 33: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/33.jpg)
33Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu
33
Positional Filtering
a b
a b
Ed(s,t) ≤ 2
s
t
(ab,1)
(ab,12)
![Page 34: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/34.jpg)
34Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 34
Filter tree
…
Length level
Gram level
Position level
Inverted list512172844
root
2 n1 3
… zy zzabaa
1 2 m
…
![Page 35: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/35.jpg)
35Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu
Surprising experimental results (DBLP)
No filter (ms)
Length (ms)
Length+Pos (ms)
DivideSkip 2.23 0.76 1.96
Why adding position filter increases the running time?Why adding position filter
increases the running time?
![Page 36: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/36.jpg)
36Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu
Filters fragment inverts lists
Applying filters
Applying filters
MergeMergeMergeMerge MergeMerge MergeMerge
Cost:
(1)Tree traversal
(2)More merging
Cost:
(1)Tree traversal
(2)More merging
Saving:
reduce total lists size
Saving:
reduce total lists size
![Page 37: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/37.jpg)
37Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu
Conclusion
Three new merge algorithms We run faster
Interesting finding:
Do not abuse filters!Do not abuse filters!
![Page 38: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/38.jpg)
38Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 38
Related work
Approximate string matching
[Navarro 2001]
Varied length Grams
[Li et al 2007]
Fuzzy lookup in
![Page 39: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/39.jpg)
39Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 39
References1. [Arasu 2006] A. Arasu and V. Ganti and R.
Kaushik “Efficient Exact Set-similarity Joins” in VLDB 2006
2. [Chaudhuri 2003] S. Chaudhuri ,K Ganjam, V. Ganti and R. Motwani “Robust and Efficient Fuzzy Match for online Data Cleaning” in SIGMOD 2003
3. [Gravano 2001] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan and D. Srivastava “Approximate string joins in a database almost for free” in VLDB 2001
![Page 40: 1 Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming.](https://reader038.fdocuments.net/reader038/viewer/2022103111/5514809c550346b0158b57df/html5/thumbnails/40.jpg)
40Chen Li, Jiaheng Lu, Yiming Lu Chen Li, Jiaheng Lu, Yiming Lu 40
References4. [Li 2007] C. Li, B Wang and X. Yang
“VGRAM:Improving performance of approximate queries on string collections using variable-length grams ” in VLDB 2007
5. [Navarro 2001] G. Navarro, “A guided tour to approximate string matching” in Computing survey 2001
6. [Sarawagi 2004] S. Sarawagi and A. Kirpal, “Efficient set joins on similarity predicates” in ACM SIGMOD 2004