Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish...

Approximate String Joins in a Database (Almost) for Free

L. Gravano P.G. Ipeirotis H.V. JagadishColumbia Univ. Columbia Univ. Univ. of Michigan

N. Koudas S. Muthukrishnan D. SrivastavaAT&T Labs AT&T Labs AT&T Labs

Scenario of Application

The work was done before 2001. The related research has been put in practice

05/03/23 Columbia UniversityComputer Science Dept.

2


3

The Need for String Joins

Service A

Jenny Stamatopoulou

John Paul McDougal

Aldridge Rodriguez

Panos Ipeirotis

John Smith

…

…

Service B

Panos Ipirotis

Jonh Smith

…

Jenny Stamatopulou

John P. McDougal

…

Al Dridge Rodriguez

Substantial amounts of data in existing DBMSs are strings Often, there is a need to correlate data stored in different tables

Example: Find common customers across various customer databases


4

Problems with Exact String Joins

Service A

Jenny Stamatopoulou

John Paul McDougal

Aldridge Rodriguez

Panos Ipeirotis

John Smith

…

…

Service B

Panos Ipirotis

Jonh Smith

…

Jenny Stamatopulou

John P. McDougal

…

Al Dridge Rodriguez

Typing mistakes (e.g., John vs. Jonh) No standard way of recording string data Standard equijoins do not “forgive such mistakes”

⋈= ∅


5

Approximate String Joins

Service A

Jenny Stamatopoulou

John Paul McDougal

Aldridge Rodriguez

Panos Ipeirotis

John Smith

…

…

Service B

Panos Ipirotis

Jonh Smith

…

Jenny Stamatopulou

John P. McDougal

…

Al Dridge Rodriguez

We want to join tuples with “similar” string fields Similarity measure: Edit Distance Each Insertion, Deletion, Replacement increases distance by one (minimal!)

K=1

K=2

K=1

K=3

K=1


6

Our Focus: Approximate String Joins over Relational DBMSs

Join two tables on string attributes and keep all pairs of strings with Edit Distance ≤ K

Solve the problem in a database-friendly way(if possible with an existing "vanilla" RDBMS)


7

Related Work

Data Cleaning Hernandez and Stolfo, DMKD Journal, 2 (1), 1998 Monge and Elkan, SIGMOD DMKD Workshop, 1997 ...

Approximate String Matching Baeza-Yates and Gonnet, SPIRE 1999 Sutinen and Tarhio, ESA’95, CPM’96 Smith and Waterman, Journal of Molecular Biology 147, 1981 Ukkonen, TCS 92(1), 1992 Ullman, Computer Journal, 1977 …


8

Current Approaches for Processing Approximate String Joins

No native support for approximate joins in RDBMSs

Two existing (straightforward) solutions: Join data outside of DBMS Join data via user-defined functions (UDFs) inside the DBMS


9

Approximate String Joins outside of a DBMS

1. Export data2. Join outside of DBMS3. Import the result

Main advantage: We can exploit any state-of-the-art string-matching algorithm, without restrictions from DBMS functionality

Disadvantages: Substantial amounts of data to be exported/imported Cannot be easily integrated with further processing steps in the

DBMS


10

Approximate String Joins with UDFs

1. Write a UDF to check if two strings match within distance K2. Write an SQL statement that applies the UDF to the string pairs

SELECT R.stringAttr, S.stringAttrFROM R, SWHERE edit_distance(R.stringAttr, S.stringAttr, K)

Main advantage: Ease of implementation

Main disadvantage: UDF applied to entire cross-product of relations

Edit distance Algorithms

Applied in UDF method: WHERE edit_distance(R.stringAttr, S.stringAttr, K)

Input: two strings s and t with reasonable lengths m and n.

Output: their edit_distance. Dynamic Programming Algorithm: Levenshtein distance


11

Levenshtein distance Algorithm

int LevenshteinDistance(char s[1..m], char t[1..n]) declare int d[0..m, 0..n] //(m+1)*(n+1) Matrix //initialization of the Matrix

for i from 0 to m d[i, 0] := i for j from 0 to n d[0, j] := j


12

Matrix for Levenshtein distance (intialization) K I T T E N

0 1 2 3 4 5 6

S 1

I 2

T 3

T 4

I 5

N 6

G 7


13

Levenshtein distance Algorithm (con’d)

//after the initialization for j from 1 to n { if s[i] = t[j] then cost := 0 else cost := 1 d[i, j] := minimum( d[i-1, j] + 1, // deletion d[i, j-1] + 1, // insertion d[i-1, j-1] + cost // substitution ) } 05/03/23 Columbia University

Computer Science Dept.14

Matrix for Levenshtein distance


15

K I T T E N0 1 2 3 4 5 6

S 1 1 2 3 4 5 6

I 2 2 1 2 3 4 5

T 3 3 2 1 2 3 4

T 4 4 3 2 1 2 3

I 5 5 4 3 2 2 3

N 6 6 5 4 3 3 2

G 7 7 6 5 4 4 3

The “right-most and lowest” element d[m, n] is the edit-distance.

Levenshtein distance Algorithm

Correctness: Invariant: Given the distance of two substrings s[1…i]

and t[1…j], we can determine s[1…i+1] and t[1…j+1] by adding d[i, j].

The initialization step meets the definition of edit distance well, and then we induce it according to the Invariant and prove it true.

O(mn) time, which could be improved (not the heart of the matter..)


16


17

Our Approach: Approximate String Joins over an Unmodified RDBMS

1. Preprocess data and generate auxiliary tables2. Perform join exploiting standard RDBMS capabilities

Advantages No modification of underlying RDBMS needed. Can leverage the RDBMS query optimizer. Much more efficient than the approach based on naive UDFs


18

Our Approach: Intuition and Roadmap

Intuition: Similar strings have many common substrings Use exact joins to perform approximate joins (current

DBMSs are good for exact joins) A good candidate set can be verified for false positives[Ukkonen 1992, Sutinen and Tarhio 1996, Ullman 1977]

Roadmap: Break strings into substrings of length q (q-grams) Perform an exact join on the q-grams Find candidate string pairs based on the results Check only candidate pairs with a UDF to obtain final answer


19

What is a “Q-gram”?

Q-gram: A sequence of q characters of the original string

Example for q=3vacations

{##v, #va, vac, aca, cat, ati, tio, ion, ons, ns$, s$$}

String with length L → L + q - 1 q-grams

Similar strings have a many common q-grams


20

Q-grams and Edit Distance Operations

With no edits: L + q - 1 common q-grams

Replacement: (L + q – 1) - q common q-gramsVacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$}Vacalions: {##v, #va, vac, aca, cal, ali, lio, ion, ons, ns#, s$$}

Insertion: (Lmax + q – 1) - q common q-gramsVacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$}Vacatlions: {##v, #va, vac, aca, cat, atl, tli, lio, ion, ons, ns#, s$$}

Deletion: (Lmax + q – 1) - q common q-gramsVacations: {##v, #va, vac, aca, cat, ati, tio, ion, ons, ns#, s$$}Vacaions: {##v, #va, vac, aca, cai, aio, ion, ons, ns#, s$$}


21

“Q-gram Distance” and Edit Distance

In the “q-gram space” the distance for a pair of strings (S1,S2) is defined as: |Maximum number of q-grams| - |Common q-grams|

Each Edit Distance operation affects a q-gram in the following ways: It destroys it, or It leaves it intact, or It shifts it by one position

Each Edit Distance operation destroys at most q q-grams

For Edit Distance = K, the “q-gram distance” is at most Kq


22

Number of Common Q-grams and Edit Distance

For Edit Distance = K, there could be at most K replacements, insertions, deletions

Two strings S1 and S2 with Edit Distance ≤ K have at least [max(S1.len, S2.len) + q - 1] – Kq q-grams in common

Useful filter: eliminate all string pairs without "enough" common q-grams (no false dismissals)


23

Using a DBMS for Q-gram Joins

If we have the q-grams in the DBMS, we can perform this counting efficiently.

Create auxiliary tables with tuples of the form:<sid, strlen, qgram>

and join these tables

A GROUP BY – HAVING COUNT clause can perform the counting / filtering


24

Eliminating Candidate Pairs: COUNT FILTERING

SQL for this filter: (parts omitted for clarity)

SELECT R.sid, S.sidFROM R, SWHERE R.qgram=S.qgramGROUP BY R.sid, S.sidHAVING COUNT(*) >= (max(R.strlen, S.strlen) + q - 1) – K*q

The result is the pair of strings with sufficiently enough common q-grams to ensure that we will not have false negatives.


25

Eliminating Candidate Pairs Further: LENGTH FILTERING

Strings with length difference larger than K cannot be within Edit Distance K

SELECT R.sid, S.sidFROM R, SWHERE R.qgram=S.qgram AND abs(R.strlen - S.strlen)<=KGROUP BY R.sid, S.sidHAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q

We refer to this filter as LENGTH FILTERING


26

Exploiting Q-gram Positions for Filtering

Consider strings aabbzzaacczz and aacczzaabbzz Strings are at edit distance 4 Strings have identical q-grams for q=3

Problem: Matching q-grams that are at different positions in both strings Either q-grams do not "originate" from same q-gram, or Too many edit operations "caused" spurious q-grams

at various parts of strings to match


27

POSITION FILTERING - Filtering using positions

Keep the position of the q-grams <sid, strlen, pos, qgram> Do not match q-grams that are more than K positions away

SELECT R.sid, S.sidFROM R, SWHERE R.qgram=S.qgram

AND abs(R.strlen - S.strlen)<=KAND abs(R.pos - S.pos)<=K

GROUP BY R.sid, S.sidHAVING COUNT(*) >= (max(R.strlen, S.strlen) + q – 1) – K*q

We refer to this filter as POSITION FILTERING


28

The Actual, Complete SQL Statement

SELECT R1.string, S1.string, R1.sid, S1.sidFROM R1, S1, R, S, WHERE R1.sid=R.sid

AND S1.sid=S.sidAND R.qgram=S.qgram AND abs(strlen(R1.string)–strlen(S1.string))<=KAND abs(R.pos - S.pos)<=K

GROUP BY R1.sid, S1.sid, R1.string, S1.stringHAVING COUNT(*) >=

(max(strlen(R1.string),strlen(S1.string))+ q-1)–K*q


29

Experimental Results: Data

Three sets of customer data from AT&T Worldnet Service 3 customer data sets from AT&T Worldnet:

(a) set1 with about 40K strings (b) set2 and (c) set3 with about 30K strings each

0

2000

4000

6000

8000

10000

12000

1 6 11 16 21 26 31

String Length

Num

ber o

f Str

ings

(a)

0

200

400

600

800

1000

1200

1 9 17 25 33 41 49 57 65

String Length

Num

ber o

f Stri

ngs

(b)

0

100

200

300

400

500

600

1 7 13 19 25 31 37 43 49 55 61 67

String Length

Num

ber o

f Stri

ngs

(c)


30

DBMS Setup

Used Oracle 8i (supports UDFs), on Sun 20 Enterprise Server

Materialized the q-gram tables with entries <sid, qgram, pos> (less then 2 minutes per table)

Tested configurations with and without indexes on the auxiliary q-gram tables (less than 5 minutes to generate each index)

The generation time for the auxiliary q-gram tables and indexes is small: Even on-the-fly materialization is feasible


31

Query Plans Generated by RDBMS

Naive approach with UDFs: nested-loops joins (prohibitively slow even for small data sets)

Q-gram approach: usually sort-merge joins

In our prototype implementation, sort-merge joins is the fastest as well


32

Naïve UDFs vs. Filtering

10

100

1000

10000Q1 (UDF only) Q4 (Filtering)

Q1 (UDF only) 1954 2028 2044Q4 (Filtering) 48 68 91

k=1 k=2 k=3

For a subset of set1, our technique was 20 to 30 times faster than the naïve use of UDFs


33

Effect of Filters (Candidate Set)CP=Cross Product, L=Length Filtering, LP=Length and Position Filtering, LC=Length and Count Filtering, LPC=Length, Position, and Count Filtering, Real=Number of Real matches

1.E+05

1.E+06

1.E+07

1.E+08

1.E+09

1.E+10

k=1 k=2 k=3

Cand

idat

e Se

t Siz

e

CP L LP LC LPC Real

(a)

1.E+05

1.E+06

1.E+07

1.E+08

1.E+09

1.E+10

k=1 k=2 k=3

Cand

idat

e Se

t Siz

e

CP L LP LC LPC Real

(b)

1.E+05

1.E+06

1.E+07

1.E+08

1.E+09

1.E+10

k=1 k=2 k=3

Cand

idat

e Se

t Siz

e

CP L LP LC LPC Real

(c)

LENGTH FILTERING: 40-70% reduction for set1 (small length deviation)

90-98% reductions for set2, set3 (big length deviation)

+COUNT FILTERING: > 99% reduction

POSITION FILTERING: ~ 50% reduction (additionally)


34

Effect of Filters (Candidate Set) – Best Q

1.E+05

1.E+06

1.E+07

set1 set2 set3

Cand

idat

e Se

t Siz

e

q=1 q=2 q=3 q=4 q=5

(a)

1.E+06

1.E+07

1.E+08

set1 set2 set3

Cand

idat

e Se

t Siz

e

q=1 q=2 q=3 q=4 q=5

(b)

For the given data sets q=2 worked best q=2 is close to the theoretical approximations as well q=2 is small enough to avoid, as much as possible, the space

overhead for the auxiliary tables


35

Effect of Filters (Q-gram Join Size)

COUNT FILTERING is applied last (in HAVING clause)

For efficiency, it is important to have a small join size for the q-gram tables

LENGTH FILTERING cuts the size by a factor of 2 to 10

+POSITION FILTERING cuts the size by a factor of ~100


36

Extensions: Substring Joins

Substring approximate joins: Length filtering is not applicable Substring matches have fewer common q-grams (no “overflow”

q-grams) Position filtering is not directly applicable (it needs tuning

depending on the substring location)

We also propose a fast in-memory filter (see paper) The result is not trivial! Exploiting position of q-grams extensively to find possible

alignments


37

Extensions: Block Moves

Match “AT&T Corp” with “Corp AT&T” as a unit cost operation.

We can allow block moves: Length filtering is still applicable The threshold for count filtering is different Position filtering is not applicable (the q-gram may move far

away)


38

Conclusions

We introduced a technique for mapping approximate string joins into a “vanilla” SQL expression

Our technique does not require modifying the underlying RDBMS

Our technique exploits the RDBMS's query optimizer

Our technique significantly outperforms existing approaches

Many opportunities for improvements and future work!

Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish...

Documents

Transcript of Approximate String Joins in a Database (Almost) for Free L. GravanoP.G. IpeirotisH.V. Jagadish...