Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1.

Post on 30-Dec-2015

216 views 1 download

Tags:

Transcript of Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1.

1

Record Linkage: A 10-Year Retrospective

Chen Li and Sharad Mehrotra

UC Irvine

2

Efficient Record Linkage in Large Data Sets

Liang Jin, Chen Li, Sharad MehrotraUniversity of California, Irvine

DASFAA, Kyoto, Japan, March 2003

How was the paper written?

Two faculty working on different areas, plus

1st year PhD student

5

Chen’s Story: 2001 …

6

Data Integration Problems?

Talking to medical doctors…

Example

Name SSN Addr

Jack Lemmon

430-871-8294 Maple St

Harrison Ford

292-918-2913 Culver Blvd

Tom Hanks 234-762-1234 Main St

… … …

Table R

Name SSN Addr

Ton Hanks 234-162-1234 Main Street

Kevin Spacey

928-184-2813 Frost Blvd

Jack Lemon 430-817-8294 Maple Street

… … …

Table S

Q: Find records from different datasets that could be the same entity

7Chen Li

Sharad’s research

8Chen Li

Liang’s story1st-year PhD student at UC Irvine

9Chen Li

Challenges How to define good similarity functions?

How to do matching efficiently?

10Chen Li

11

Nested-loop? Not desirable for large data sets 5 hours for 30K strings!

12

Our 2-step approach Step 1: map strings (in a metric

space) to objects in a Euclidean space

Step 2: do a similarity join in the Euclidean space

13

Advantages Applicable to many metric similarity

functions— E.g.: Edit distance

Open to existing algorithms— Mapping techniques— Join techniques

14

Step 1Map strings into a high-dimensional Euclidean

space

Metric Space Euclidean Space

15

Use data set 1 (54K names) as an example k=2, d=20

— Use k’=5.2 to differentiate similar and dissimilar pairs.

Can it preserve distances?

16

Multi-attribute linkage Example: title + name + year Different attributes have different

similarity functions and thresholds Consider merge rules in disjunctive

format:

17

Secret of the paper …

18

19

Work since then … Chen: efficiency

Sharad: quality

20

Chen’s Work on Efficiency Gram-based algorithms

— Indexing— Selection algorithms— Join algorithms— Variable-length grams— Selectivity estimation

Trie-based algorithms— Instant search

The Flamingo Package

http://flamingo.ics.uci.edu/

22

Follow-up work in the community

Significant amount of work on approximate string queries— Selection— Join

23

Make an impact?

Chen Li 24

UCI People Search

Chen Li 25

Psearch (2008) : 2 stories

26

Fuzzy search

www.omniplaces.com

Location-based search

27

Research commercialization

28Chen Li

Lesson learned: Hands-on experiences important!

29Chen Li