Linking data without common identifiers

61
1 Linking data without common identifiers Javazone 2012, 2012-09-12 Lars Marius Garshol, <[email protected]> http://twitter.com/larsga

description

A presentation of an open source tool for deduplicating and finding links between data sets that do not share common identifiers.

Transcript of Linking data without common identifiers

Page 1: Linking data without common identifiers

1

Linking data without common identifiers

Javazone 2012, 2012-09-12Lars Marius Garshol, <[email protected]>http://twitter.com/larsga

Page 2: Linking data without common identifiers

2

Identity resolution

Entity resolutionRecord linkage

Master data management

Deduplication

Merge/purge

Data matching

Page 3: Linking data without common identifiers

3

The problem

• How to tell if two different records represent the same real-world entity?

DBPEDIA

Id http://dbpedia.org/resource/Samoa

Name Samoa

Founding date 1962-01-01

Capital Apia

Currency Tala

Area 2831

Leader name Tuilaepa Aiono Sailele Malielegaoi

MONDIAL

Id 17019

Name Western Samoa

Independence 01 01 1962

Capital Apia, Samoa

Population 214384

Area 2860

GDP 415

Page 4: Linking data without common identifiers

4

Linking across datasets

• Independent applications, for example– even when unique identifiers exist, they tend

to be unreliable, or missing

A B?

Page 5: Linking data without common identifiers

5

Within datasets

• Duplicate records in the same application– in different tables (poor modelling)– in the same table (poor UI, poor logic, sloppy

entry)

A B?

Page 6: Linking data without common identifiers

6

In data interchange

• Receiving data from third parties– do we have this record already?

A B?

XML

Page 7: Linking data without common identifiers

7

A difficult problem

• It requires O(n2) comparisons for n records– a million comparisons for 1000 records, 100

million for 10,000, ...

• Exact string comparison is not enough– must handle misspellings, name variants, etc

• Interpreting the data can be difficult even for a human being– is the address different because there are two

different people, or because the person moved?

• ...

Page 8: Linking data without common identifiers

8

Record linkage

• Statisticians very often must connect data sets from different sources

• They call it “record linkage”– term coined in 19461)

– mathematical foundations laid in 19592) – formalized in 1969 as “Fellegi-Sunter” model3)

• A whole subject area has been developed with well-known techniques, methods, and tools– these can of course be applied outside of

statistics1) http://ajph.aphapublications.org/cgi/reprint/36/12/14122) http://www.sciencemag.org/content/130/3381/954.citation3) http://www.jstor.org/pss/2286061

Page 9: Linking data without common identifiers

9

The basic idea

Field Record 1 Record 2 Probability

Name acme inc acme inc 0.9

Assoc no 177477707 0.5

Zip code 9161 9161 0.6

Country norway norway 0.51

Address 1 mb 113 mailbox 113 0.49

Address 2 0.5

0.931

Page 10: Linking data without common identifiers

10

Standard algorithm

• n2 comparisons for n records is unacceptable– must reduce number of direct comparisons

• Solution– produce a key from field values,– sort records by the key,– for each record, compare with n nearest neighbours– sometimes several different keys are produced, to

increase chances of finding matches

• Downsides– requires coming up with a key– difficult to apply incrementally– sorting is expensive

Page 11: Linking data without common identifiers

11

String comparison

• Measure “distance” between strings– must handle spelling errors and name variants– must estimate probability that values belong to same

entity despite differences

• Examples– John Smith ≈ Jonh Smith– J. Random Hacker ≈ James Random Hacker

• Many well-known algorithms have been developed– no one best choice, unfortunately– some are eager to merge, others less so

• Many are computationally expensive– O(n2) where n is length of string is common– this gets very costly when number of record pairs to

compare is already high

Page 12: Linking data without common identifiers

12

Existing tools

• Commercial tools– big, sophisticated, and expensive– seem to follow the same principles– presumably also effective

• Open source tools– generally made by and for statisticians– nice user interfaces and rich configurability– advanced maths– architecture often not as flexible as it could be

Page 13: Linking data without common identifiers

13

DUplicate KillEr

Duke

Page 14: Linking data without common identifiers

14

Context

• Doing a project for Hafslund where we integrate data from many sources

• Entities are duplicated both inside systems and across systems

ERP

Suppliers

Customers

Companies

Customers

CRM

Customers

Billing

Page 15: Linking data without common identifiers

15

Requirements

• Must be flexible and configurable– no way to know in advance exactly what data we

will need to deduplicate

• Must scale to large data sets– CRM alone has 1.4 million customer records– that’s 2 trillion comparisons with naïve approach

• Must have an API– project uses SDshare protocol everywhere– must therefore be able to build connectors

• Must be able to work incrementally– process data as it changes and update conclusions

on the fly

Page 16: Linking data without common identifiers

16

Reviewed existing tools...

• ...but didn’t find anything suitable for our needs

• Therefore...

Page 17: Linking data without common identifiers

17

Duke

• Java deduplication engine– released as open source– http://code.google.com/p/duke/

• Does not use key approach– instead indexes data with Lucene– does Lucene searches to find potential matches

• Still a work in progress, but– high performance,– several data sources and comparators,– being used for real in real projects,– flexible architecture

Page 18: Linking data without common identifiers

18

How it works

Lucene index

Extract Map Clean

Index Search Compare Link

DataSource

Extract Map Clean

DataSource

Extract Map Clean

DataSource

Page 19: Linking data without common identifiers

19

Cleaning of data

• Used to normalize data from sources– necessary to get rid of differences that don’t

matter

• Makes comparing much more effective– pluggable in Duke– utilities and components already provided

Côte D’Ivoire cote d’ivoire

Main St 113 113 main street

Dec 25 1973 1973-12-25

Page 20: Linking data without common identifiers

20

Components

Data sources• CSV• JDBC• Sparql• NTriples• <plug in your

own>

Comparators• ExactComparator• NumericComparato

r• SoundexComparato

r• TokenizedComparat

or• Levenshtein• JaroWinkler• Dice coefficient• <plug in your own>

Page 21: Linking data without common identifiers

21

Features

• XML syntax for configuration• Fairly complete command-line tool– with debugging support

• API for embedding• Pluggable data sources, comparators,

and cleaners• Framework for composing cleaners• Incremental processing• High performance

Page 22: Linking data without common identifiers

22

Two modes

Deduplication Record linkage

??

Page 23: Linking data without common identifiers

23

Why probabilities?

• Some tools use rules instead– if field1 and field2 match, then ...– if field1 and (field3 or field4) match, then ...

• This is easy to understand, but– there are many, many combinations– doesn’t take inexact matching into account– adding another field becomes a real pain

Page 24: Linking data without common identifiers

24

Probabilities weigh all the evidence

Field Record 1 Record 2 Probability Accum.

Name acme inc acme inc 0.9

Assoc no 177477707 0.5

Zip code 9161 9161 0.6

Country norway norway 0.51

Address 1 mb 113 mailbox 113

0.3

Address 2 0.5

0.9

0.9

0.931

0.934

0.857

0.857

Probabilities combined with Bayes Theorem

Probabilities reduced if values don’t match exactly

Page 25: Linking data without common identifiers

25

A real-world example

Linking Mondial and DBpedia

http://code.google.com/p/duke/wiki/LinkingCountries

Page 26: Linking data without common identifiers

26

Finding properties to match

• Need properties providing identity evidence

• Matching on the properties in bold below

• Extracted data to CSV for ease of useDBPEDIA

Id http://dbpedia.org/resource/Samoa

Name Samoa

Founding date 1962-01-01

Capital Apia

Currency Tala

Area 2831

Leader name Tuilaepa Aiono Sailele Malielegaoi

MONDIAL

Id 17019

Name Western Samoa

Independence 01 01 1962

Capital Apia, Samoa

Population 214384

Area 2860

GDP 415

Page 27: Linking data without common identifiers

27

Configuration – data sources

<group> <csv> <param name="input-file" value="dbpedia.csv"/> <param name="header-line" value="false"/> <column name="1" property="ID"/> <column name="2" cleaner="no.priv...CountryNameCleaner" property="NAME"/> <column name="3" property="AREA"/> <column name="4" cleaner="no.priv...CapitalCleaner" property="CAPITAL"/> </csv> </group>

<group> <csv> <param name="input-file" value="mondial.csv"/> <column name="id" property="ID"/> <column name="country" cleaner="no.priv...examples.CountryNameCleaner" property="NAME"/> <column name="capital" cleaner="no.priv...LowerCaseNormalizeCleaner" property="CAPITAL"/> <column name="area" property="AREA"/> </csv> </group>

Using groups tells Duke that we are linking across two data sets,not deduplicating by comparing all records against all others

Page 28: Linking data without common identifiers

28

Configuration – matching <schema> <threshold>0.65</threshold>

<property type="id"> <name>ID</name> </property> <property> <name>NAME</name> <comparator>no.priv.garshol.duke.Levenshtein</comparator> <low>0.3</low> <high>0.88</high> </property> <property> <name>AREA</name> <comparator>AreaComparator</comparator> <low>0.2</low> <high>0.6</high> </property> <property> <name>CAPITAL</name> <comparator>no.priv.garshol.duke.Levenshtein</comparator> <low>0.4</low> <high>0.88</high> </property> </schema>

<object class="no.priv.garshol.duke.NumericComparator" name="AreaComparator"> <param name="min-ratio" value="0.7"/> </object>

Duke analyzes this setup and decidesonly NAME and CAPITAL need to be searched on in Lucene.

Page 29: Linking data without common identifiers

29

Result

• Correct links found: 206 / 217 (94.9%)• Wrong links found: 0 / 12 (0.0%)• Unknown links found: 0• Percent of links correct 100.0%, wrong

0.0%, unknown 0.0%• Records with no link: 25• Precision 100.0%, recall 94.9%, f-

number 0.974

Page 30: Linking data without common identifiers

30

Examples

Field DBpedia Mondial

Name albania albania

Area 28748 28750

Capital tirana tirane

Probability 0.980

Field DBpedia Mondial

Name côte d'ivoire cote divoire

Area 322460 322460

Capital yamoussoukro yamoussoukro

Probability 0.975

Field DBpedia Mondial

Name samoa western samoa

Area 2831 2860

Capital apia apia

Probability 0.824

Field DBpedia Mondial

Name kazakhstan kazakstan

Area 2724900 2717300

Capital astana almaty

Probability 0.838

Field DBpedia Mondial

Name grande comore comoros

Area 1148 2170

Capital moroni moroní

Probability 0.440

Field DBpedia Mondial

Name serbia serbia and mont

Area 102350 88361

Capital sarajevo sarajevo

Probability 0.440

Page 31: Linking data without common identifiers

31

Western Samoa or American Samoa?

Field DBpedia Mondial Probability

Name samoa western samoa 0.3

Area 2831 2860 0.6

Capital apia apia 0.88

Probability 0.824

Field DBpedia Mondial Probability

Name samoa american samoa 0.3

Area 2831 199 0.4

Capital apia pago pago 0.4

Probability 0.067

Page 32: Linking data without common identifiers

32

An example of failure

• Duke doesn’t find this match– no tokens matching exactly– Lucene search finds nothing

• Detailed comparison gives correct result– the problem is Lucene search

• Lucene does have Levenshtein search, but– in Lucene 3.x it’s very slow– therefore not enabled now– thinking of adding option to enable where

needed– Lucene 4.x will fix the performance problem

Field DBpedia Mondial

Name kazakhstan kazakstan

Area 2724900 2717300

Capital astana almaty

Probability 0.838

Page 33: Linking data without common identifiers

33

The effects of value matching

Case Precision Recall F

Optimal 100% 94.9% 97.4%

Cleaning, exact compare

99.3% 73.3% 84.4%

No cleaning, good compare

100% 93.4% 96.6%

No cleaning, exact compare

99.3% 70.9% 82.8%

Cleaning, JaroWinkler

97.6% 96.7% 97.2%

Cleaning, Soundex

95.9% 97.2% 96.6%

Page 34: Linking data without common identifiers

34

Duke in real life

Usage at Hafslund

Page 35: Linking data without common identifiers

35

The big picture

Virtuoso(RDF)

ERPSearch engine

Public 360

SDshare

SDshare SDshare

SDshare

CRM

Billing

DUPLICATES!

DukeSDshare

SDsharecontains owl:sameAs andhaf:possiblySameAs

LinkDBCan override here

Page 36: Linking data without common identifiers

36

Experiences so far

• Incremental processing works fine– links added and retracted as data changes

• Performance not an issue at all– despite there being ~1.4 million records– requires a little bit of tuning, however

• Matching works well, but not perfect– data are very noisy and messy– biggest issue is clusters caused by generic

values– also, matching is hard

Page 37: Linking data without common identifiers

37

Overview of algorithms

Approximate string comparison

Page 38: Linking data without common identifiers

38

Levenshtein

• Also known as edit distance• Measures the number of edit operations

necessary to turn s1 into s2• Edit operations are– insert a character– remove a character– substitute a character

• Complexity– O(n2) – naïve algorithm– O(n + d2) – where d = distance (with

optimizations)

Page 39: Linking data without common identifiers

39

Levenshtein example

• Levenshtein -> Löwenstein– Levenstein (remove ‘h’)– Lövenstein (substitute ‘ö’)– Löwenstein (substitute ‘w’)

• Edit distance = 3

Page 40: Linking data without common identifiers

40

Weighted Levenshtein

• Not all edit operations are equal– substituting “i” for “e” is a smaller edit than

substituting “o” for “k”

• Weighted Levenshtein evaluates each edit operation as a number 0.0-1.0

• Weights depend on context– language, for example

• Slower, because some Levenshtein optimizations do not apply with weights

Page 41: Linking data without common identifiers

41

Jaro-Winkler

• Developed at the US Bureau of the Census

• For name comparisons– not well suited to long strings– best if given name/surname are separated

• Exists in a few variants– originally proposed by Winkler– then modified by Jaro– a few different versions of modifications etc

Page 42: Linking data without common identifiers

42

Jaro-Winkler definition

• Formula:– m = number of matching characters– t = number of transposed characters

• A character from string s1 matches s2 if the same character is found in s2 less then half the length of the string away

• Levenshtein ~ Löwenstein = 0.8• Axel ~ Aksel = 0.783

Page 43: Linking data without common identifiers

43

Jaro-Winkler variant

Page 44: Linking data without common identifiers

44

Soundex

• A coarse schema for matching names by sound– produces a key from the name– names match if key is the same

• In common use in many places– Nav’s person register uses it for search– built-in in many databases– ...

Page 45: Linking data without common identifiers

45

Soundex definition

Page 46: Linking data without common identifiers

46

Examples

• soundex(“Axel”) = ‘A240’• soundex(“Aksel”) = ‘A240’• soundex(“Levenshtein”) = ‘L523’• soundex(“Löwenstein”) = ‘L152’

Page 47: Linking data without common identifiers

47

Metaphone

• Developed by Lawrence Philips• Similar to Soundex, but much more

complex– both more accurate and more sensitive

• Developed further into Double Metaphone

• Metaphone 3.0 also exists, but only available commercially

Page 48: Linking data without common identifiers

48

Metaphone examples

• metaphone(“Axel”) = ‘AKSL’• metaphone(“Aksel”) = ‘AKSL’• metaphone(“Levenshtein”) = ‘LFNX’• metaphone(“Löwenstein”) = ‘LWNS’

Page 49: Linking data without common identifiers

49

Norphone

• I’m working on a Norwegian alternative to Metaphone– designed to handle phonetic rules in

Norweigan names– for example, “ch” = “k”

• Work in progress– http://code.google.com/p/duke/source/

browse/src/main/java/no/priv/garshol/duke/comparators/NorphoneComparator.java

Page 50: Linking data without common identifiers

50

Dice coefficient

• A similarity measure for sets– set can be tokens in a string– or characters in a string

• Formula:

Page 51: Linking data without common identifiers

51

TFIDF

• Compares strings as sets of tokens– a la Dice coefficient

• However, takes frequency of tokens in corpus into account– this matches how we evaluate matches

mentally

• Has done well in evaluations– however, can be difficult to evaluate– results will change as corpus changes

Page 52: Linking data without common identifiers

52

More comparators

• Smith-Waterman– originated in DNA sequencing

• Q-grams distance– breaks string into sets of pieces of q characters– then does set similarity comparison

• Monge-Elkan– similar to Smith-Waterman, but with affine gap

distances– has done very well in evaluations– costly to evaluate

• Many, many more– ...

Page 53: Linking data without common identifiers

53

Performance

Page 54: Linking data without common identifiers

54

Why performance matters

• Performance really is a major issue– O(n2) behaviour– slow comparators

• An example– linking 840’000 authors from Deichmann

library with 200’000 person records from DBpedia

– with naïve in-memory backend this takes ~4 secs per record in second group

– that’s 9.5 days to do the whole job...

Page 55: Linking data without common identifiers

55

Bigger data

So... 9.5 days for ~400,000 records

Time to process increases with square of record count

These guys can expect to spend 585 years...

Page 56: Linking data without common identifiers

56

Obvious solution #1

• Where is the program spending its time?– mostly in Levenshtein string comparisons– these actually take a while

• Solution– switch from n x n array to single n2 array (30%

off)– break off comparison if we’re going to go over

difference limit, anyway (another 30% off)

• Overall speed improvement ~50%– so, from 9.5 days to 4.25 days– or, from 585 years to 293 years

Page 57: Linking data without common identifiers

57

Obvious solution #2

• Parallel computing– indexing takes just a few minutes (for small case)– record comparisons are independent of each other– so, spin up different computers and let them work in

parallell

• Does it work?– well, let’s say we use 20 nodes– enough to cost a bit, and give us some admin work– not too many to cause communication issues– expect speed-up by a factor of 10

• That gives us– from 4.5 days to 10 hours– from 293 years to 29 years

Didn’t actually do this...

Page 58: Linking data without common identifiers

58

Obvious solution #3

• Improve the algorithm!– that is, use the Lucene backend– now we only compare records matched by

searches– dramatically cuts number of record pairs

• But does it work?– Deichmann case now runs in 2 hours (single

machine)– performance still not linear– Mexican runtime therefore probably still too

long (hence their email)

Page 59: Linking data without common identifiers

59

Final solution

• Searches can still return huge numbers of potential candidates– most relevant ones at the top– so, we can configure cutoffs (by relevance or

number)

• Setting– min-relevance: 0.99– max-search-hits: 10– runs Deichmann in 3 minutes (98.2%

accuracy)– performance now close to linear– should work for the Mexicans, too

Page 60: Linking data without common identifiers

60

Next step

• We can do some more tweaking with Lucene– can use boosting to improve relevance– then can cut more records without losing

recall– there are limits, however

• So next step may well be parallel computation– should be fun

Page 61: Linking data without common identifiers

61

Slides will appear on http://slideshare.net/larsga

Comments/questions?