Overview of Adaptive Blocking for DDL Research Lab

7
Adaptive Blocking - key points Reduce computation time Apply across domains Maximize recall Limit false positives Disjunctive / DNF blocking Approx. Red-Blue Set Cover Increased reduction ratio Increased recall Bilenko, Kamath, Mooney. “Adaptive Blocking: Learning to Scale Up Record Linkage.” Proceedings of the 6th IEEE International Conference on Data Mining. Hong Kong, December 2006

Transcript of Overview of Adaptive Blocking for DDL Research Lab

Page 1: Overview of Adaptive Blocking for DDL Research Lab

Adaptive Blocking - key points• Reduce computation time

• Apply across domains

• Maximize recall

• Limit false positives

• Disjunctive / DNF blocking

• Approx. Red-Blue Set Cover

• Increased reduction ratio

• Increased recall

Bilenko, Kamath, Mooney. “Adaptive Blocking: Learning to Scale Up Record Linkage.” Proceedings of the 6th IEEE International Conference on Data Mining. Hong Kong, December 2006

Page 2: Overview of Adaptive Blocking for DDL Research Lab

Blocking Predicates• Index function: generates keys based field values (e.g. first three letters

of name)

• Equality function: returns True if any set of index keys matches for a given set of record pairs

• Covered pairs: matched (equal) records for a given predicate

• Blocking function: blocking predicate set w/aggregate index & equality

Page 3: Overview of Adaptive Blocking for DDL Research Lab

find optimal blocking function that

minimizes false positives after

finding most true positives (within some error)

Page 4: Overview of Adaptive Blocking for DDL Research Lab

Disjunctive and DNF blocking

• Disjunctive: select pairs covered by at least one blocking predicate

• Disjunctive Normal Form: select pairs covered by at least one conjunction of blocking predicates

Page 5: Overview of Adaptive Blocking for DDL Research Lab

Disjunctive Red-blue set cover

DNF Red-blue set cover

Page 6: Overview of Adaptive Blocking for DDL Research Lab

Disjunctive Blocking• Remove predicates covering too many false pairs

• Remove false pairs covered by too many predicates

• Predicate cost == # of false pairs

• Weighted set cover: greedy predicate selection based on improvement; check uncovered threshhold; repeat

Page 7: Overview of Adaptive Blocking for DDL Research Lab

DNF Blocking• Remove predicates covering too many pairs

• Construct predicate conjunctions, length <= k-1

• Add conjunctions maximizing marginal true/false ratio to set

• Apply Disjunctive Blocking with resulting predicates