The Flamingo Software Package on Approximate String Queries
description
Transcript of The Flamingo Software Package on Approximate String Queries
![Page 1: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/1.jpg)
The Flamingo Software Package on Approximate String Queries
Chen LiUC Irvine and Bimaple
http://flamingo.ics.uci.edu/
![Page 2: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/2.jpg)
Personal Journey: 2001 …
![Page 3: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/3.jpg)
Chen Li, UC Irvine 3
Data Integration Problems?
Talking to medical doctors…
![Page 4: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/4.jpg)
4
Example
Name SSN AddrJack Lemmon
430-871-8294 Maple St
Harrison Ford
292-918-2913 Culver Blvd
Tom Hanks 234-762-1234 Main St… … …
Table RName SSN Addr
Ton Hanks 234-162-1234 Main StreetKevin Spacey
928-184-2813 Frost Blvd
Jack Lemon 430-817-8294 Maple Street
… … …
Table S
Find records from different datasets that could be the same entity
![Page 5: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/5.jpg)
5
Another Example P. Bernstein, D. Chiu: Using Semi-Joins
to Solve Relational Queries. JACM 28(1): 25-40(1981)
Philip A. Bernstein, Dah-Ming W. Chiu, Using Semi-Joins to Solve Relational Queries, Journal of the ACM (JACM), v.28 n.1, p.25-40, Jan. 1981
![Page 6: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/6.jpg)
6
Challenges How to define good similarity functions?
— Many functions proposed (edit distance, cosine similarity, …)
— Domain knowledge is critical Names: “Wall Street Journal” and “LA Times” Address: “Main Street” versus “Main St”
How to do matching efficiently
![Page 7: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/7.jpg)
7
Nested-loop? Not desirable for large data sets 5 hours for 30K strings! (in 2002)
![Page 8: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/8.jpg)
8
Our first attempt (DASFAA 2003)
- Map strings into a high-dimensional Euclidean space
- Do a similarity join in the Euclidean space
Metric Space Euclidean Space
![Page 9: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/9.jpg)
9
Use data set 1 (54K names) as an example k=2, d=20
— Use k’=5.2 to differentiate similar and dissimilar pairs.
Can it preserve distances?
![Page 10: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/10.jpg)
10
2nd Problem: Selectivity Estimation
A bag of strings
Input: fuzzy string predicate P(q, δ)
star SIMILARTO ’Schwarrzenger’
Output: # of strings s that satisfy dist(s,q) <= δ
![Page 11: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/11.jpg)
11
SEPIA: Intuition (VLDB 2005)
11
Cluster
Pivot: p
String s
Query String: q
v1
v2ed(p,s)1 2 3
10%
44%28%
Probability 100%
4
![Page 12: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/12.jpg)
12
1M strings in 1ms 10M strings in 10ms
Story of “1-1-10-10”
![Page 13: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/13.jpg)
1313
String Grams q-grams
(un),(ni),(iv),(ve),(er),(rs),(sa),(al)
For example: 2-gram
u n i v e r s a l
![Page 14: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/14.jpg)
1414
Inverted lists Convert strings to gram inverted lists
id strings01234
richstickstichstuckstatic
4
2 301 4
2-grams
atchckicristtatituuc
201 30 1 2 4
41 2 433
![Page 15: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/15.jpg)
1515
Main ExampleQuery
Merge
Data Grams
stick (st,ti,ic,ck)
count >=2
id strings0 rich1 stick2 stich3 stuck4 static
ck
ic
st
ta
ti…
1,3
1,2,3,4
4
1,2,4
ed(s,q)≤1
0,1,2,4
Candidates
![Page 16: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/16.jpg)
1616
Problem definition:
Find elements whose occurrences ≥ T
Ascendingorder
Merge
![Page 17: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/17.jpg)
1717
Example T = 4
Result: 13
1351013
101315
5713
13 15
![Page 18: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/18.jpg)
1818
Five Merge Algorithms (icde2008)
HeapMerger[Sarawagi,SIGMOD
2004]
MergeOpt[Sarawagi,SIGMOD
2004]
PreviousNew
ScanCount MergeSkip DivideSkip
![Page 19: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/19.jpg)
19
1M strings in 1ms 10M strings in 10ms
Next: VGRAM
Story of “1-1-10-10”
![Page 20: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/20.jpg)
20
Observation 1: dilemma of choosing “q” Increasing “q” causing:
Longer grams Shorter lists Smaller # of common grams of similar strings
id strings01234
richstickstichstuckstatic
4
2 301 4
2-grams
atchckicristtatituuc
201 30 1 2 4
41 2 433
![Page 21: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/21.jpg)
21
Observation 2: skew distributions of gram frequencies DBLP: 276,699 article titles Popular 5-grams: ation (>114K times), tions, ystem, catio
![Page 22: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/22.jpg)
22
VGRAM: Main idea Grams with variable lengths (between qmin
and qmax) zebra
ze(123) corrasion
co(5213), cor(859), corr(171) Advantages
Reduce index size Reducing running time Adoptable by many algorithms
![Page 23: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/23.jpg)
23
Challenges Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string similarity and their
gram-set similarity? Adopting VGRAM in existing algorithms?
![Page 24: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/24.jpg)
24
1M strings in 1ms 10M strings in 10ms
—Challenge: large index size
Story of “1-1-10-10”
![Page 25: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/25.jpg)
25
Contributions (icde2009)
Proposed two lossy compression techniques— Answer queries exactly— Index fits into a space budget — Queries faster on the compressed indexes — Flexibility to choose space / time tradeoff— Existing list-merging algorithms: re-use + compression
specific optimizations
![Page 26: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/26.jpg)
26
Intuition of compression techniques
Find elements whose occurrences ≥ T
Ascendingorder
Merge
![Page 27: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/27.jpg)
27
Content of Flamingo Package
— List mergers— SEPIA— Stringmap— Location-based fuzzy search— PartEnum (fuzzy join)— Fuzzy join using MapReduce— …
![Page 28: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/28.jpg)
28
Development of Flamingo
— C++— Contributors: 9 people (different times)— Four releases— Well received by various communities
![Page 29: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/29.jpg)
Chen Li, UC Irvine 29
Making an impact?
![Page 30: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/30.jpg)
Chen Li, UC Irvine 30
UCI People Search
![Page 31: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/31.jpg)
Chen Li, UC Irvine 31
PSearch
![Page 32: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/32.jpg)
32
Other systems built
— iPubmed: http://ipubmed.ics.uci.edu— Location-based instant search— …— Started a company: Bimaple
![Page 33: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/33.jpg)
33
Lessons learned
Hands-on experiences …
![Page 34: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/34.jpg)
34
Lessons learnedResearch management
— Software development: code sharing— Tools: svn, wiki, etc.— Team environment— Research continuity
![Page 35: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/35.jpg)
35
Lessons learned—Impact —Outreach activities
![Page 36: The Flamingo Software Package on Approximate String Queries](https://reader035.fdocuments.net/reader035/viewer/2022070500/5681684d550346895dde4bef/html5/thumbnails/36.jpg)
36
Thank you!
http://flamingo.ics.uci.edu/