Method and system for processing and linking data records

74
US007912842B1 (12) Umted States Patent (10) Patent N0.: US 7,912,842 B1 Bayliss (45) Date of Patent: Mar. 22, 2011 (54) METHOD AND SYSTEM FOR PROCESSING 5,878,408 A 3/ 1999 Van Huben et al. AND LINKING DATA RECORDS 5,884,299 A 3/1999 Ramesh et al. 5,897,638 A 4/1999 Lasser et al. . . 5,983,228 A 11/1999 Koba ashiet a1. (75) Inventor: Davld Bayllss, Delray Beach, FL (US) 6,006,249 A 12/1999 Leon; 6,026,394 A 2/2000 Tsuchida et a1. (73) Assignee: LeXisNeXis Risk Data Management 6,026,398 A * 2/2000 Brown et al. ................... .. 707/5 Inc., Boca Raton, FL (US) 6,081,801 A 6/2000 Cochrane et al. 6,266,804 B1 7/2001 Isman ( * ) Notice: Subject to any disclaimer, the term of this 22118001; k $11518 llssixgelidej7o3rdadlusted under 35 636583412 B1* 12/2003 Jenkins et al. .................. .. 707/5 ' ' ' ( ) y ays' (Continued) 21 A l. N .: 10/357 418 ( ) pp 0 OTHER PUBLICATIONS (22) Filed: Feb‘ 41 2003 Henniger, “An Evolutionary Approach to Constructing Effective 51 I t Cl Software Reuse Repositories”, ACM Transactions of Software Engi ( ) Gn0'6F '7/00 (2006 01) neering and Methodology, vol. 6, No. 2, Apr. 1997, pp. 111-140.* G06F 17/30 (2006.01) (Continued) (52) US. Cl. ..................................................... .. 707/749 (58) Field of Classi?cation Search ................ .. 707/100, Primary Examiner * Kavita Padmanabhan 707/102, 103 R, 103 Z; 706/45, 46, 48, 52 (74) Attorney, Agent, or Firm * Hunton & Williams, LLP See application ?le for complete search history. (57) ABSTRACT (56) References Cited U.S. PATENT DOCUMENTS 4,543,630 A 9/1985 Neches 4,860,201 A 8/1989 Stolfo et al. 4,870,568 A 9/1989 Kahle etal. 4,925,311 A 5/1990 Neches et a1. 5,006,978 A 4/1991 Neches 5,276,899 A 1/1994 Neches 5,303,383 A 4/1994 Neches et a1. 5,423,037 A 6/1995 Hvasshovd 5,471,622 A 11/1995 Eadline 5,495,606 A 2/1996 Borden et a1. 5,551,027 A 8/1996 Choyetal. 5,555,404 A 9/1996 Torbjyamsen et al. 5,655,080 A 8/1997 Dias et a1. 5,715,469 A * 2/1998 Arning ........................ .. 715/533 5,732,400 A 3/1998 Mandler et a1. 5,745,746 A 4/1998 Jhingran et a1. Various exemplary systems and methods for linking entity references and identifying associations are presented. In par ticular, a method is provided for linking a plurality of entity references to at least one entity. The method comprises the steps of evaluating a probability of a match betWeen a ?rst entity reference and a second entity reference based at least in part on a statistical signi?cance of one or more ?eld values being common to both the ?rst entity reference and the second entity reference, Wherein ?eld value statistical signi?cance is inversely related to a number of ?eld value occurrences occurring in some or all of the plurality of entity references and linking the ?rst entity reference With the second entity reference When the probability is greater than or equal to a match threshold. 36 Claims, 31 Drawing Sheets Probability-Based Matching EB. Content Weighting Field Weighting Compare Entity "‘ Lldfit'on of a m V in i Emmm Context 5.0.6 E] Familial Nicknames! Relationships Synonyms

Transcript of Method and system for processing and linking data records

Page 1: Method and system for processing and linking data records

US007912842B1

(12) Umted States Patent (10) Patent N0.: US 7,912,842 B1 Bayliss (45) Date of Patent: Mar. 22, 2011

(54) METHOD AND SYSTEM FOR PROCESSING 5,878,408 A 3/ 1999 Van Huben et al. AND LINKING DATA RECORDS 5,884,299 A 3/1999 Ramesh et al.

5,897,638 A 4/1999 Lasser et al. . . 5,983,228 A 11/1999 Koba ashiet a1.

(75) Inventor: Davld Bayllss, Delray Beach, FL (US) 6,006,249 A 12/1999 Leon; 6,026,394 A 2/2000 Tsuchida et a1.

(73) Assignee: LeXisNeXis Risk Data Management 6,026,398 A * 2/2000 Brown et al. ................... .. 707/5 Inc., Boca Raton, FL (US) 6,081,801 A 6/2000 Cochrane et al.

6,266,804 B1 7/2001 Isman

( * ) Notice: Subject to any disclaimer, the term of this 22118001; k

$11518 llssixgelidej7o3rdadlusted under 35 636583412 B1* 12/2003 Jenkins et al. .................. .. 707/5 ' ' ' ( ) y ays' (Continued)

21 A l. N .: 10/357 418 ( ) pp 0 ’ OTHER PUBLICATIONS

(22) Filed: Feb‘ 41 2003 Henniger, “An Evolutionary Approach to Constructing Effective 51 I t Cl Software Reuse Repositories”, ACM Transactions of Software Engi

( ) Gn0'6F '7/00 (2006 01) neering and Methodology, vol. 6, No. 2, Apr. 1997, pp. 111-140.*

G06F 17/30 (2006.01) (Continued) (52) US. Cl. ..................................................... .. 707/749

(58) Field of Classi?cation Search ................ .. 707/100, Primary Examiner * Kavita Padmanabhan

707/102, 103 R, 103 Z; 706/45, 46, 48, 52 (74) Attorney, Agent, or Firm * Hunton & Williams, LLP See application ?le for complete search history.

(57) ABSTRACT (56) References Cited

U.S. PATENT DOCUMENTS

4,543,630 A 9/1985 Neches 4,860,201 A 8/1989 Stolfo et al. 4,870,568 A 9/1989 Kahle etal. 4,925,311 A 5/1990 Neches et a1. 5,006,978 A 4/1991 Neches 5,276,899 A 1/1994 Neches 5,303,383 A 4/1994 Neches et a1. 5,423,037 A 6/1995 Hvasshovd 5,471,622 A 11/1995 Eadline 5,495,606 A 2/1996 Borden et a1. 5,551,027 A 8/1996 Choyetal. 5,555,404 A 9/1996 Torbjyamsen et al. 5,655,080 A 8/1997 Dias et a1. 5,715,469 A * 2/1998 Arning ........................ .. 715/533

5,732,400 A 3/1998 Mandler et a1. 5,745,746 A 4/1998 Jhingran et a1.

Various exemplary systems and methods for linking entity references and identifying associations are presented. In par ticular, a method is provided for linking a plurality of entity references to at least one entity. The method comprises the steps of evaluating a probability of a match betWeen a ?rst entity reference and a second entity reference based at least in part on a statistical signi?cance of one or more ?eld values being common to both the ?rst entity reference and the second entity reference, Wherein ?eld value statistical signi?cance is inversely related to a number of ?eld value occurrences occurring in some or all of the plurality of entity references and linking the ?rst entity reference With the second entity reference When the probability is greater than or equal to a match threshold.

36 Claims, 31 Drawing Sheets

Probability-Based Matching EB.

Content Weighting Field Weighting

Compare Entity "‘ Lldfit'on of a m V in i

Emmm

Context 5.0.6

E] Familial Nicknames! Relationships Synonyms

Page 2: Method and system for processing and linking data records

US 7,912,842 B1 Page 2

US. PATENT DOCUMENTS

2002/0073099 A1 * 6/2002 Gilbert et al. ............ .. 707/l04.l

2004/0064447 A1 * 4/2004 Simske et al. .................. .. 707/5

OTHER PUBLICATIONS

Eike Schallehn et al., “Advanced Grouping and Aggregation for Data Integration,” Department of Computer Science, Paper ID: 222, pp. 1-16. Vincent Coppola, “Killer APP,” Men’s Journal, vol. 12, No. 3, Apr. 2003, pp. 86-90. Eike Schallehn et al., “Extensible and Similarity-based Grouping for Data Integration,” Department of Computer Science, pp. l-l7, 2002. Rohit Ananthakrishna et al., “Eliminating Fuzzy Duplicates in Data Warehouses,” 12 pages, 2002. Peter Christen et al., “Parallel Computing Techniques for High-Per formance Probabilistic Record Linkage,” Data Mining Group, Aus tralian National University, Epidemiology and Surveillance Branch, Project web page: http://datamining.anu.edu.au/linkagehtml, 2002, pp. l-l l .

Peter Christen et al., “Parallel Techniques for High-Performance Record Linkage (Data Matching),” Data Mining Group, Australian National University, Epidemiology and Surveillance Branch, Project web page: http://datamining.anu.edu.au/linkage.html, 2002, pp. l-27. Peter Christen et al., “High-Performance Computing Techniques for Record Linkage,” Data Mining Group, Australian National Univer sity, Epidemiology and Surveillance Branch, Project web page: http://datamining.anu.edu.au/linkagehtml, 2002, pp. l-l4. William E. Winkler, “Matching and Record Linkage,” U. S. Bureau of the Census, pp. l-38. Peter Christen et al., “High-Performance Computing Techniques for Record Linkage,” ANU Data Mining Group, Australian National University, Epidemiology and Surveillance Branch, Project web page: http://datamining.anu.edu.au/linkage.html, pp. l-ll. William E. Winkler, “The State of Record Linkage and Current Research Problems,” US. Bureau of the Census, 15 pages. William E. Winkler, “Advanced Methods For Record Linkage,” Bureau ofthe Census, pp. l-2l. William E. Winkler, Frequency-Based Matching in Fellegi-Sunter Model of Record Linkage, Bureau of the Census Statistical Research Division, Oct. 4, 2000, 14 pages. William E. Winkler, “State of Statistical Data Editing and Current Research Problems,” Bureau Of The Census Statistical Research Division, 10 pages.

The First Open ETL/EAI Software for the Real-Time Enterprise, Sunopsis, A New Generation ETL Tool, “SunopsisTM v3 expedites integration between heterogeneous systems for Data Warehouse, Data Mining, Business Intelligence, and OLAP projects,” <www. suopsis.com>, 6 pages. Alan Dumas, “The ETL Market and SunopsisTM v3 Business Intel ligence, Data Warehouse & Datamart Projects,” 2002, Sunopsis, pp. l-7. Teradata Warehouse Solutions, “Teradata Database Technical Over view,” 2002, pp. l-7. WhiteCross White Paper, May 25, 2000, “wx/des-Technical Infor mation,” pp. l-36. Teradata Alliance Solutions, “Teradata and Ab Initio,” pp. l-2, 2001. Peter Christen et al., The Australian National University, “Febrli Freely extensible biomedical record linkage,” Oct. 2002, pp. l-67. William E. Winkler, “Using the EM Algorithim for Weight Compu tation in the Fellegi-Sunter Model of Record Linkage,” Bureau Of The Census Statistical Research Division, Oct. 4, 2000, 12 pages. William E. Winkler et al., “An Application of the Fellegi-Sunter Model ofRecord Linkage to The 1990 US. Decennial Census,” US. Bureau of the Census, pp. l-22. William E. Winkler, “Improved Decision Rules in the Fellegi-Sunter Model of Record Linkage,” Bureau ofthe Census, pp. l-l3. FritZ Scheuren et al., “Recursive Merging and Analysis of Adminis trative Lists and Data,” US. Bureau of the Census, 9 pages. William E. Winkler, “Record Linkage Software and Methods for Merging Administrative Lists,” US. Bureau of the Census, Jul. 7, 2001, ll pages. Enterprises, Publishing and Broadcasting Limited, Acxiom-Abilitec, pp. 44-45. TransUnion, Credit Reporting System, Oct. 9, 2002, 4 pages, <http:// www.transunion.com/content/pagej sp?id:/transunion/general/ data/business/BusCre...>. TransUnion, ID Veri?cation & Fraud Detection, Account Acquisi tion, Account Management, Collection & Location Services, Employment Screening, Risk Management, Automotive, Banking Savings & Loan, Credit Card Providers, Credit Unions, Energy & Utilities, Healthcare, Insurance, Investment, Real Estate, Telecom munications, Oct. 9, 2002,46 pages, <http://www.transunion.com>. White Paper an Introduction to OLAP Multidimensional Terminol ogy and Technology, 20 pages.

* cited by examiner

Page 3: Method and system for processing and linking data records

US. Patent Mar. 22, 2011 Sheet 1 0f 31 US 7,912,842 B1

‘ Fig. 1A

Page 4: Method and system for processing and linking data records

US. Patent Mar. 22, 2011 Sheet 2 0f 31 US 7,912,842 B1

140

‘ Fig. 1B

150

144

Page 5: Method and system for processing and linking data records

US. Patent Mar. 22, 2011 Sheet 3 0131 US 7,912,842 B1

Prepare Raw Data (Preparation Phase)

El

N O

V

Translate Data to Entity References (Link Phase) M

Repeat for Iteration N Incoming Data ——> 208

Determine Inter-Relationships Between Entities

(Association Phase) &

Perform One or More Queries Using Master

File

Page 6: Method and system for processing and linking data records

US. Patent Mar. 22, 2011 Sheet 4 0f 31 US 7,912,842 B1

Format Raw Data into 7 Entity References

i0_2_

l Join Entity References

(Master File) &

l Remove Duplicate Entity

References Repeat for Iteration N 39g

Fill In Null Field Values 3%

l Remove Junk Field

Values/Entries m

Preparation Phase .ZQZ

Incoming Data ———-——>

Page 7: Method and system for processing and linking data records

US. Patent Mar. 22, 2011 Sheet 5 0131 US 7,912,842 B1

Select Relevant Fields ——-~—-—>

%

Fig. 4 l Measure Field Variance

and Reset DlDs lf Necessary M

Link Phase i M Fill In Null Field Values

Q5

l . Repeat for Iteration N Generate Ghost Entity

Incoming Data ———> gig References

%

‘ i Link Entity References

919

l Transition Links m

l Append/Modify DlDs in

Master File m

Page 8: Method and system for processing and linking data records

US. Patent Mar. 22, 2011 Sheet 6 0131 US 7,912,842 B1

Fig. 5

Probability-Based Matching Q21

Content Weighting Field Weighting

Entity Reference A V

Indication of a Compare Entity References Link Between

Q _

Entity Reference B Entity References

A

Context _50_

Ethnicity

Familial Nicknames! Location Relationships Synonyms

Page 9: Method and system for processing and linking data records

US. Patent Mar. 22, 2011 Sheet 7 0131 US 7,912,842 B1

For each particular ?eld entry fn, determine total number (Count) of

6 same ?eld entries in master file @

Count = i [if (f, = f") then 1,6136 0]

Count Table

For each particular field entry fn, determine context weight we‘i

m

l WCJ : .

Count + Cautrousne ss

Context Weight Table

Calculate probability (P) of match between Entry References using

context weight(s)

l Assign DlDs to Entity References

based on probability (P) E

Page 10: Method and system for processing and linking data records

US. Patent Mar. 22, 2011 Sheet 8 0131 US 7,912,842 B1

Fig. 7 Select subset N of

Entity Reference fields m

Next Entity Ref. A and ‘ For each ?eld (X) of the Entity Ref. B subset:

.721 ' B6. A

Compare E No A.f with B.f

Match X 708 X

Match

Add (A5) to Match Table D9

Match Table 2

Common DID transition using Match Table

12

i Adjust DID of Affected Entity References in Master File

m

Page 11: Method and system for processing and linking data records

US. Patent Mar. 22, 2011 Sheet 9 0131 US 7,912,842 B1

Fig. 8 808

804 802

806

V

Page 12: Method and system for processing and linking data records

US. Patent Mar. 22, 2011 Sheet 10 0f 31 US 7,912,842 B1

V

Page 13: Method and system for processing and linking data records

US. Patent Mar. 22, 2011 Sheet 11 0f 31 US 7,912,842 B1

Fig. 10

Match Table w

Inner Join of Match Table with itself by left DID

1002

Expanded Match Table w

1022

lnner Join of Expanded Match Table with itself from

right DID to left DID' 199A

Transitive Closure Table 1024

Transition BIOS to lowest possible DID value

1006

Page 14: Method and system for processing and linking data records

US. Patent Mar. 22, 2011 Sheet 12 0f 31

Fig. 11

US 7,912,842 B1

Select subset N of data ?elds

1102

i For each field (X) of the subset, generate Field Unique Value Table

?0_4

Y

Cross-Produce Field Unique Value Tables to generate Ghost Table

mi

Ghost Table 1128

Update Master File to Include Ghost Entity

References 11_08

Page 15: Method and system for processing and linking data records

US. Patent Mar. 22, 2011 Sheet 13 0f 31 US 7,912,842 B1

Page 16: Method and system for processing and linking data records

US. Patent Mar. 22, 2011 Sheet 14 or 31 US 7,912,842 B1

Measure variance along

each & ‘axis’ g I 1 3

Variance >

Threshold? 1304

Yes 1300 V

Reset DID of Each Entity Reference to its RID

BE Y

End 1312

A Mark Entity References as having been ‘Broken’ M

Mark Entity References as suspect

1314

Page 17: Method and system for processing and linking data records

US. Patent Mar. 22, 2011 Sheet 15 0131 US 7,912,842 B1

Fig. 14 Determine Degree of

Commonality ——> (Association) Between

Entities 1402

Association Phase E

Mark Highly Associated Entities as Related

1404

Repeat for Iteration N Incoming Data —————> 1410

‘ Generate Ghost Entity References from

Relations 1406

Transitive Closure For Additional Associations

Between Entities JAQB.

Page 18: Method and system for processing and linking data records

US. Patent Mar. 22, 2011

Select subset N of Entity Reference ?elds M

V

Sheet 16 0f 31

Fig. 15

US 7,912,842 B1

Next Entity Ref. A and For each ?eld (X) of the Entity Ref. B ‘ ‘ subset:

1504 ' 1506

A

A Compare

1m M23211 Afx with B.fX 1i

Match

Increase score of (CD) pair in Score Table

1i

v

E m c t A ' t d '8 n y no _$SOCla e scorew'g) Score Table with Entltv D. 4—N° >=Threshold 1522

Entity D not Associated 1512 with Entity C

Yes

Mark Entity C as Associate of Entity D,

Entity D as Associate of Entity C M

Page 19: Method and system for processing and linking data records

US. Patent Mar. 22, 2011 Sheet 17 0131 US 7,912,842 B1

Fig. 16

Relatives File 1620

Filter 1602

l ""— Duplicate Records

1604

l lnner Join by left DID

1606

l Set weight, separation,

and dedup values 1608

Page 20: Method and system for processing and linking data records

US. Patent Mar. 22, 2011 Sheet 18 0f 31

Match Table 1730

Filter 1702

I Duplicate Records

1704

Fig. 17

Duplicate Match Table

1722

Inner Join duplicate match table with master

?le 1_7_0_6_

Outlier Reference Table 1724

Score DlDs using grading criteria

1708

I Sum DID scores

1710

<— Grading Criteria

US 7,912,842 B1

DID Score Table 1726

Filter DID Score Table 1712

Obtain entity references of selected DlDs from

Outlier Reference Table 1714

Page 21: Method and system for processing and linking data records
Page 22: Method and system for processing and linking data records
Page 23: Method and system for processing and linking data records
Page 24: Method and system for processing and linking data records
Page 25: Method and system for processing and linking data records
Page 26: Method and system for processing and linking data records
Page 27: Method and system for processing and linking data records
Page 28: Method and system for processing and linking data records
Page 29: Method and system for processing and linking data records
Page 30: Method and system for processing and linking data records
Page 31: Method and system for processing and linking data records
Page 32: Method and system for processing and linking data records
Page 33: Method and system for processing and linking data records
Page 34: Method and system for processing and linking data records
Page 35: Method and system for processing and linking data records
Page 36: Method and system for processing and linking data records
Page 37: Method and system for processing and linking data records
Page 38: Method and system for processing and linking data records
Page 39: Method and system for processing and linking data records
Page 40: Method and system for processing and linking data records
Page 41: Method and system for processing and linking data records
Page 42: Method and system for processing and linking data records
Page 43: Method and system for processing and linking data records
Page 44: Method and system for processing and linking data records
Page 45: Method and system for processing and linking data records
Page 46: Method and system for processing and linking data records
Page 47: Method and system for processing and linking data records
Page 48: Method and system for processing and linking data records
Page 49: Method and system for processing and linking data records
Page 50: Method and system for processing and linking data records
Page 51: Method and system for processing and linking data records
Page 52: Method and system for processing and linking data records
Page 53: Method and system for processing and linking data records
Page 54: Method and system for processing and linking data records
Page 55: Method and system for processing and linking data records
Page 56: Method and system for processing and linking data records
Page 57: Method and system for processing and linking data records
Page 58: Method and system for processing and linking data records
Page 59: Method and system for processing and linking data records
Page 60: Method and system for processing and linking data records
Page 61: Method and system for processing and linking data records
Page 62: Method and system for processing and linking data records
Page 63: Method and system for processing and linking data records
Page 64: Method and system for processing and linking data records
Page 65: Method and system for processing and linking data records
Page 66: Method and system for processing and linking data records
Page 67: Method and system for processing and linking data records
Page 68: Method and system for processing and linking data records
Page 69: Method and system for processing and linking data records
Page 70: Method and system for processing and linking data records
Page 71: Method and system for processing and linking data records
Page 72: Method and system for processing and linking data records
Page 73: Method and system for processing and linking data records
Page 74: Method and system for processing and linking data records