Overview of Adaptive Blocking for DDL Research Lab
-
Upload
dan-chudnov -
Category
Technology
-
view
206 -
download
1
Transcript of Overview of Adaptive Blocking for DDL Research Lab
Adaptive Blocking - key points• Reduce computation time
• Apply across domains
• Maximize recall
• Limit false positives
• Disjunctive / DNF blocking
• Approx. Red-Blue Set Cover
• Increased reduction ratio
• Increased recall
Bilenko, Kamath, Mooney. “Adaptive Blocking: Learning to Scale Up Record Linkage.” Proceedings of the 6th IEEE International Conference on Data Mining. Hong Kong, December 2006
Blocking Predicates• Index function: generates keys based field values (e.g. first three letters
of name)
• Equality function: returns True if any set of index keys matches for a given set of record pairs
• Covered pairs: matched (equal) records for a given predicate
• Blocking function: blocking predicate set w/aggregate index & equality
find optimal blocking function that
minimizes false positives after
finding most true positives (within some error)
Disjunctive and DNF blocking
• Disjunctive: select pairs covered by at least one blocking predicate
• Disjunctive Normal Form: select pairs covered by at least one conjunction of blocking predicates
Disjunctive Red-blue set cover
DNF Red-blue set cover
Disjunctive Blocking• Remove predicates covering too many false pairs
• Remove false pairs covered by too many predicates
• Predicate cost == # of false pairs
• Weighted set cover: greedy predicate selection based on improvement; check uncovered threshhold; repeat
DNF Blocking• Remove predicates covering too many pairs
• Construct predicate conjunctions, length <= k-1
• Add conjunctions maximizing marginal true/false ratio to set
• Apply Disjunctive Blocking with resulting predicates