Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1.
-
Upload
harriet-elliott -
Category
Documents
-
view
216 -
download
1
Transcript of Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1.
1
Record Linkage: A 10-Year Retrospective
Chen Li and Sharad Mehrotra
UC Irvine
2
Efficient Record Linkage in Large Data Sets
Liang Jin, Chen Li, Sharad MehrotraUniversity of California, Irvine
DASFAA, Kyoto, Japan, March 2003
How was the paper written?
Two faculty working on different areas, plus
1st year PhD student
5
Chen’s Story: 2001 …
6
Data Integration Problems?
Talking to medical doctors…
Example
Name SSN Addr
Jack Lemmon
430-871-8294 Maple St
Harrison Ford
292-918-2913 Culver Blvd
Tom Hanks 234-762-1234 Main St
… … …
Table R
Name SSN Addr
Ton Hanks 234-162-1234 Main Street
Kevin Spacey
928-184-2813 Frost Blvd
Jack Lemon 430-817-8294 Maple Street
… … …
Table S
Q: Find records from different datasets that could be the same entity
7Chen Li
Sharad’s research
8Chen Li
Liang’s story1st-year PhD student at UC Irvine
9Chen Li
Challenges How to define good similarity functions?
How to do matching efficiently?
10Chen Li
11
Nested-loop? Not desirable for large data sets 5 hours for 30K strings!
12
Our 2-step approach Step 1: map strings (in a metric
space) to objects in a Euclidean space
Step 2: do a similarity join in the Euclidean space
13
Advantages Applicable to many metric similarity
functions— E.g.: Edit distance
Open to existing algorithms— Mapping techniques— Join techniques
14
Step 1Map strings into a high-dimensional Euclidean
space
Metric Space Euclidean Space
15
Use data set 1 (54K names) as an example k=2, d=20
— Use k’=5.2 to differentiate similar and dissimilar pairs.
Can it preserve distances?
16
Multi-attribute linkage Example: title + name + year Different attributes have different
similarity functions and thresholds Consider merge rules in disjunctive
format:
17
Secret of the paper …
18
19
Work since then … Chen: efficiency
Sharad: quality
20
Chen’s Work on Efficiency Gram-based algorithms
— Indexing— Selection algorithms— Join algorithms— Variable-length grams— Selectivity estimation
Trie-based algorithms— Instant search
22
Follow-up work in the community
Significant amount of work on approximate string queries— Selection— Join
23
Make an impact?
Chen Li 24
UCI People Search
Chen Li 25
Psearch (2008) : 2 stories
26
Fuzzy search
Research commercialization
28Chen Li
Lesson learned: Hands-on experiences important!
29Chen Li