Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish...
-
Upload
allison-stevenson -
Category
Documents
-
view
215 -
download
0
description
Transcript of Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish...
Approximate String Joins in a Database (Almost) for Free
L. Gravano P.G. Ipeirotis H.V. JagadishColumbia Univ. Columbia Univ. Univ. of Michigan
N. Koudas S. Muthukrishnan D. SrivastavaAT&T Labs AT&T Labs AT&T Labs
Scenario of Application
The work was done before 2001. The related research has been put in practice
05/03/23 Columbia UniversityComputer Science Dept.
2
05/03/23 Columbia UniversityComputer Science Dept.
3
The Need for String Joins
Service A
Jenny Stamatopoulou
John Paul McDougal
Aldridge Rodriguez
Panos Ipeirotis
John Smith
…
…
Service B
Panos Ipirotis
Jonh Smith
…
Jenny Stamatopulou
John P. McDougal
…
Al Dridge Rodriguez
Substantial amounts of data in existing DBMSs are strings Often, there is a need to correlate data stored in different tables
Example: Find common customers across various customer databases
05/03/23 Columbia UniversityComputer Science Dept.
4
Problems with Exact String Joins
Service A
Jenny Stamatopoulou
John Paul McDougal
Aldridge Rodriguez
Panos Ipeirotis
John Smith
…
…
Service B
Panos Ipirotis
Jonh Smith
…
Jenny Stamatopulou
John P. McDougal
…
Al Dridge Rodriguez
Typing mistakes (e.g., John vs. Jonh) No standard way of recording string data Standard equijoins do not “forgive such mistakes”
⋈= ∅
05/03/23 Columbia UniversityComputer Science Dept.
5
Approximate String Joins
Service A
Jenny Stamatopoulou
John Paul McDougal
Aldridge Rodriguez
Panos Ipeirotis
John Smith
…
…
Service B
Panos Ipirotis
Jonh Smith
…
Jenny Stamatopulou
John P. McDougal
…
Al Dridge Rodriguez
We want to join tuples with “similar” string fields Similarity measure: Edit Distance Each Insertion, Deletion, Replacement increases distance by one (minimal!)
K=1
K=2
K=1
K=3
K=1
05/03/23 Columbia UniversityComputer Science Dept.
6
Our Focus: Approximate String Joins over Relational DBMSs
Join two tables on string attributes and keep all pairs of strings with Edit Distance ≤ K
Solve the problem in a database-friendly way(if possible with an existing "vanilla" RDBMS)
05/03/23 Columbia UniversityComputer Science Dept.
7
Related Work
Data Cleaning Hernandez and Stolfo, DMKD Journal, 2 (1), 1998 Monge and Elkan, SIGMOD DMKD Workshop, 1997 ...
Approximate String Matching Baeza-Yates and Gonnet, SPIRE 1999 Sutinen and Tarhio, ESA’95, CPM’96 Smith and Waterman, Journal of Molecular Biology 147, 1981 Ukkonen, TCS 92(1), 1992 Ullman, Computer Journal, 1977 …
05/03/23 Columbia UniversityComputer Science Dept.
8
Current Approaches for Processing Approximate String Joins
No native support for approximate joins in RDBMSs
Two existing (straightforward) solutions: Join data outside of DBMS Join data via user-defined functions (UDFs) inside the DBMS
05/03/23 Columbia UniversityComputer Science Dept.
9
Approximate String Joins outside of a DBMS
1. Export data2. Join outside of DBMS3. Import the result
Main advantage: We can exploit any state-of-the-art string-matching algorithm, without restrictions from DBMS functionality
Disadvantages: Substantial amounts of data to be exported/imported Cannot be easily integrated with further processing steps in the
DBMS
05/03/23 Columbia UniversityComputer Science Dept.
10
Approximate String Joins with UDFs
1. Write a UDF to check if two strings match within distance K2. Write an SQL statement that applies the UDF to the string pairs
SELECT R.stringAttr, S.stringAttrFROM R, SWHERE edit_distance(R.stringAttr, S.stringAttr, K)
Main advantage: Ease of implementation
Main disadvantage: UDF applied to entire cross-product of relations
Edit distance Algorithms
Applied in UDF method: WHERE edit_distance(R.stringAttr, S.stringAttr, K)
Input: two strings s and t with reasonable lengths m and n.
Output: their edit_distance. Dynamic Programming Algorithm: Levenshtein distance
05/03/23 Columbia UniversityComputer Science Dept.
11
Levenshtein distance Algorithm
int LevenshteinDistance(char s[1..m], char t[1..n]) declare int d[0..m, 0..n] //(m+1)*(n+1) Matrix //initialization of the Matrix
for i from 0 to m d[i, 0] := i for j from 0 to n d[0, j] := j
05/03/23 Columbia UniversityComputer Science Dept.
12
Matrix for Levenshtein distance (intialization) K I T T E N
0 1 2 3 4 5 6
S 1
I 2
T 3
T 4
I 5
N 6
G 7
05/03/23 Columbia UniversityComputer Science Dept.
13
Levenshtein distance Algorithm (con’d)
//after the initialization for j from 1 to n { if s[i] = t[j] then cost := 0 else cost := 1 d[i, j] := minimum( d[i-1, j] + 1, // deletion d[i, j-1] + 1, // insertion d[i-1, j-1] + cost // substitution ) } 05/03/23 Columbia University
Computer Science Dept.14
Matrix for Levenshtein distance
05/03/23 Columbia UniversityComputer Science Dept.
15
K I T T E N0 1 2 3 4 5 6
S 1 1 2 3 4 5 6
I 2 2 1 2 3 4 5
T 3 3 2 1 2 3 4
T 4 4 3 2 1 2 3
I 5 5 4 3 2 2 3
N 6 6 5 4 3 3 2
G 7 7 6 5 4 4 3
The “right-most and lowest” element d[m, n] is the edit-distance.
Levenshtein distance Algorithm
Correctness: Invariant: Given the distance of two substrings s[1…i]
and t[1…j], we can determine s[1…i+1] and t[1…j+1] by adding d[i, j].
The initialization step meets the definition of edit distance well, and then we induce it according to the Invariant and prove it true.
O(mn) time, which could be improved (not the heart of the matter..)
05/03/23 Columbia UniversityComputer Science Dept.
16
05/03/23 Columbia UniversityComputer Science Dept.
17
Our Approach: Approximate String Joins over an Unmodified RDBMS
1. Preprocess data and generate auxiliary tables2. Perform join exploiting standard RDBMS capabilities
Advantages No modification of underlying RDBMS needed. Can leverage the RDBMS query optimizer. Much more efficient than the approach based on naive UDFs
05/03/23 Columbia UniversityComputer Science Dept.
18
Our Approach: Intuition and Roadmap
Intuition: Similar strings have many common substrings Use exact joins to perform approximate joins (current
DBMSs are good for exact joins) A good candidate set can be verified for false positives[Ukkonen 1992, Sutinen and Tarhio 1996, Ullman 1977]
Roadmap: Break strings into substrings of length q (q-grams) Perform an exact join on the q-grams Find candidate string pairs based on the results Check only candidate pairs with a UDF to obtain final answer
05/03/23 Columbia UniversityComputer Science Dept.
19
What is a “Q-gram”?
Q-gram: A sequence of q characters of the original string
Example for q=3vacations
{##v, #va, vac, aca, cat, ati, tio, ion, ons, ns$, s$$}
String with length L → L + q - 1 q-grams
Similar strings have a many common q-grams
05/03/23 Columbia UniversityComputer Science Dept.
20
Q-grams and Edit Distance Operations
With no edits: L + q - 1 common q-grams
Replacement: (L + q – 1) - q common q-gramsVacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$}Vacalions: {##v, #va, vac, aca, cal, ali, lio, ion, ons, ns#, s$$}
Insertion: (Lmax + q – 1) - q common q-gramsVacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$}Vacatlions: {##v, #va, vac, aca, cat, atl, tli, lio, ion, ons, ns#, s$$}
Deletion: (Lmax + q – 1) - q common q-gramsVacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$}Vacaions: {##v, #va, vac, aca, cai, aio, ion, ons, ns#, s$$}
05/03/23 Columbia UniversityComputer Science Dept.
21
“Q-gram Distance” and Edit Distance
In the “q-gram space” the distance for a pair of strings (S1,S2) is defined as: |Maximum number of q-grams| - |Common q-grams|
Each Edit Distance operation affects a q-gram in the following ways: It destroys it, or It leaves it intact, or It shifts it by one position
Each Edit Distance operation destroys at most q q-grams
For Edit Distance = K, the “q-gram distance” is at most Kq
05/03/23 Columbia UniversityComputer Science Dept.
22
Number of Common Q-grams and Edit Distance
For Edit Distance = K, there could be at most K replacements, insertions, deletions
Two strings S1 and S2 with Edit Distance ≤ K have at least [max(S1.len, S2.len) + q - 1] – Kq q-grams in common
Useful filter: eliminate all string pairs without "enough" common q-grams (no false dismissals)
05/03/23 Columbia UniversityComputer Science Dept.
23
Using a DBMS for Q-gram Joins
If we have the q-grams in the DBMS, we can perform this counting efficiently.
Create auxiliary tables with tuples of the form:<sid, strlen, qgram>
and join these tables
A GROUP BY – HAVING COUNT clause can perform the counting / filtering
05/03/23 Columbia UniversityComputer Science Dept.
24
Eliminating Candidate Pairs: COUNT FILTERING
SQL for this filter: (parts omitted for clarity)
SELECT R.sid, S.sidFROM R, SWHERE R.qgram=S.qgramGROUP BY R.sid, S.sidHAVING COUNT(*) >= (max(R.strlen, S.strlen) + q - 1) – K*q
The result is the pair of strings with sufficiently enough common q-grams to ensure that we will not have false negatives.
05/03/23 Columbia UniversityComputer Science Dept.
25
Eliminating Candidate Pairs Further: LENGTH FILTERING
Strings with length difference larger than K cannot be within Edit Distance K
SELECT R.sid, S.sidFROM R, SWHERE R.qgram=S.qgram AND abs(R.strlen - S.strlen)<=KGROUP BY R.sid, S.sidHAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q
We refer to this filter as LENGTH FILTERING
05/03/23 Columbia UniversityComputer Science Dept.
26
Exploiting Q-gram Positions for Filtering
Consider strings aabbzzaacczz and aacczzaabbzz Strings are at edit distance 4 Strings have identical q-grams for q=3
Problem: Matching q-grams that are at different positions in both strings Either q-grams do not "originate" from same q-gram, or Too many edit operations "caused" spurious q-grams
at various parts of strings to match
05/03/23 Columbia UniversityComputer Science Dept.
27
POSITION FILTERING - Filtering using positions
Keep the position of the q-grams <sid, strlen, pos, qgram> Do not match q-grams that are more than K positions away
SELECT R.sid, S.sidFROM R, SWHERE R.qgram=S.qgram
AND abs(R.strlen - S.strlen)<=KAND abs(R.pos - S.pos)<=K
GROUP BY R.sid, S.sidHAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q
We refer to this filter as POSITION FILTERING
05/03/23 Columbia UniversityComputer Science Dept.
28
The Actual, Complete SQL Statement
SELECT R1.string, S1.string, R1.sid, S1.sidFROM R1, S1, R, S, WHERE R1.sid=R.sid
AND S1.sid=S.sidAND R.qgram=S.qgram AND abs(strlen(R1.string)–strlen(S1.string))<=KAND abs(R.pos - S.pos)<=K
GROUP BY R1.sid, S1.sid, R1.string, S1.stringHAVING COUNT(*) >=
(max(strlen(R1.string),strlen(S1.string))+ q-1)–K*q
05/03/23 Columbia UniversityComputer Science Dept.
29
Experimental Results: Data
Three sets of customer data from AT&T Worldnet Service 3 customer data sets from AT&T Worldnet:
(a) set1 with about 40K strings (b) set2 and (c) set3 with about 30K strings each
0
2000
4000
6000
8000
10000
12000
1 6 11 16 21 26 31
String Length
Num
ber o
f Str
ings
(a)
0
200
400
600
800
1000
1200
1 9 17 25 33 41 49 57 65
String Length
Num
ber o
f Stri
ngs
(b)
0
100
200
300
400
500
600
1 7 13 19 25 31 37 43 49 55 61 67
String Length
Num
ber o
f Stri
ngs
(c)
05/03/23 Columbia UniversityComputer Science Dept.
30
DBMS Setup
Used Oracle 8i (supports UDFs), on Sun 20 Enterprise Server
Materialized the q-gram tables with entries <sid, qgram, pos> (less then 2 minutes per table)
Tested configurations with and without indexes on the auxiliary q-gram tables (less than 5 minutes to generate each index)
The generation time for the auxiliary q-gram tables and indexes is small: Even on-the-fly materialization is feasible
05/03/23 Columbia UniversityComputer Science Dept.
31
Query Plans Generated by RDBMS
Naive approach with UDFs: nested-loops joins (prohibitively slow even for small data sets)
Q-gram approach: usually sort-merge joins
In our prototype implementation, sort-merge joins is the fastest as well
05/03/23 Columbia UniversityComputer Science Dept.
32
Naïve UDFs vs. Filtering
10
100
1000
10000Q1 (UDF only) Q4 (Filtering)
Q1 (UDF only) 1954 2028 2044Q4 (Filtering) 48 68 91
k=1 k=2 k=3
For a subset of set1, our technique was 20 to 30 times faster than the naïve use of UDFs
05/03/23 Columbia UniversityComputer Science Dept.
33
Effect of Filters (Candidate Set)CP=Cross Product, L=Length Filtering, LP=Length and Position Filtering, LC=Length and Count Filtering, LPC=Length, Position, and Count Filtering, Real=Number of Real matches
1.E+05
1.E+06
1.E+07
1.E+08
1.E+09
1.E+10
k=1 k=2 k=3
Cand
idat
e Se
t Siz
e
CP L LP LC LPC Real
(a)
1.E+05
1.E+06
1.E+07
1.E+08
1.E+09
1.E+10
k=1 k=2 k=3
Cand
idat
e Se
t Siz
e
CP L LP LC LPC Real
(b)
1.E+05
1.E+06
1.E+07
1.E+08
1.E+09
1.E+10
k=1 k=2 k=3
Cand
idat
e Se
t Siz
e
CP L LP LC LPC Real
(c)
LENGTH FILTERING: 40-70% reduction for set1 (small length deviation)
90-98% reductions for set2, set3 (big length deviation)
+COUNT FILTERING: > 99% reduction
POSITION FILTERING: ~ 50% reduction (additionally)
05/03/23 Columbia UniversityComputer Science Dept.
34
Effect of Filters (Candidate Set) – Best Q
1.E+05
1.E+06
1.E+07
set1 set2 set3
Cand
idat
e Se
t Siz
e
q=1 q=2 q=3 q=4 q=5
(a)
1.E+06
1.E+07
1.E+08
set1 set2 set3
Cand
idat
e Se
t Siz
e
q=1 q=2 q=3 q=4 q=5
(b)
For the given data sets q=2 worked best q=2 is close to the theoretical approximations as well q=2 is small enough to avoid, as much as possible, the space
overhead for the auxiliary tables
05/03/23 Columbia UniversityComputer Science Dept.
35
Effect of Filters (Q-gram Join Size)
COUNT FILTERING is applied last (in HAVING clause)
For efficiency, it is important to have a small join size for the q-gram tables
LENGTH FILTERING cuts the size by a factor of 2 to 10
+POSITION FILTERING cuts the size by a factor of ~100
05/03/23 Columbia UniversityComputer Science Dept.
36
Extensions: Substring Joins
Substring approximate joins: Length filtering is not applicable Substring matches have fewer common q-grams (no “overflow”
q-grams) Position filtering is not directly applicable (it needs tuning
depending on the substring location)
We also propose a fast in-memory filter (see paper) The result is not trivial! Exploiting position of q-grams extensively to find possible
alignments
05/03/23 Columbia UniversityComputer Science Dept.
37
Extensions: Block Moves
Match “AT&T Corp” with “Corp AT&T” as a unit cost operation.
We can allow block moves: Length filtering is still applicable The threshold for count filtering is different Position filtering is not applicable (the q-gram may move far
away)
05/03/23 Columbia UniversityComputer Science Dept.
38
Conclusions
We introduced a technique for mapping approximate string joins into a “vanilla” SQL expression
Our technique does not require modifying the underlying RDBMS
Our technique exploits the RDBMS's query optimizer
Our technique significantly outperforms existing approaches
Many opportunities for improvements and future work!