1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem...
-
date post
20-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of 1 Haiguang Li 01. Dec. 2011 Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem...
1Haiguang Li 01. Dec. 2011
Real-World Data Is Dirty
Data Cleansing and the Merge/Purge ProblemHernandez & Stolfo: Columbia University - 1998
Class Presentation by Haiguang Li, 01. Dec 2011
Haiguang Li 01. Dec. 20112
TOPICS
IntroductionA Basic Data Cleansing SolutionTest & Real World ResultsIncremental Merge Purge w/ New DataConclusionRecap
3Haiguang Li 01. Dec. 2011
Introduction
Haiguang Li 01. Dec. 20114
The problem:
Some corporations acquire large amounts of information every monthThe data is stored in many large databases (DB)These databases may be heterogeneous Variations in schema
The data may be represented differently across the various datasetsData in these DB may simply be inaccurate
Haiguang Li 01. Dec. 20115
Requirement of the analysis
The data mining needs to be done Quickly Efficiently Accurately
Haiguang Li 01. Dec. 20116
Examples of real-world applications
Credit card companies Assess risk of potential new
customers Find false identities
Match disparate records concerning a customer Mass Marketing companies Government agencies
7Haiguang Li 01. Dec. 2011
A Basic Data Cleansing Solution
Haiguang Li 01. Dec. 20118
Duplicate Elimination
Sorted-Neighborhood Method (SNM)This is done in three phases Create a Key for each record Sort records on this key Merge/Purge records
Haiguang Li 01. Dec. 20119
SNM: Create key
Compute a key for each record by extracting relevant fields or portions of fieldsExample:
First Last Address ID Key
Sal Stolfo 123 First Street
45678987
STLSAL123FRST456
Haiguang Li 01. Dec. 201110
SNM: Sort Data
Sort the records in the data list using the key in step 1This can be very time consuming O(NlogN) for a good algorithm, O(N2) for a bad algorithm
Haiguang Li 01. Dec. 201111
SNM: Merge records
Move a fixed size window through the sequential list of records.This limits the comparisons to the records in the window
Haiguang Li 01. Dec. 201112
SNM: Considerations
What is the optimal window size while Maximizing accuracy Minimizing computational cost
Execution time for large DB will be bound by Disk I/O Number of passes over the data set
Haiguang Li 01. Dec. 201113
Selection of Keys
The effectiveness of the SNM highly depends on the key selected to sort the recordsA key is defined to be a sequence of a subset of attributesKeys must provide sufficient discriminating power
Haiguang Li 01. Dec. 201114
Example of Records and Keys
First Last Address ID Key
Sal Stolfo 123 First Street 45678987 STLSAL123FRST456
Sal Stolfo 123 First Street 45678987 STLSAL123FRST456
Sal Stolpho
123 First Street 45678987 STLSAL123FRST456
Sal Stiles 123 Forest Street
45654321 STLSAL123FRST456
Haiguang Li 01. Dec. 201115
Equational Theory
The comparison during the merge phase is an inferential processCompares much more information than simply the keyThe more information there is, the better inferences can be made
Haiguang Li 01. Dec. 201116
Equational Theory - Example
Two names are spelled nearly identically and have the same address It may be inferred that they are the same
person
Two social security numbers are the same but the names and addresses are totally different Could be the same person who moved Could be two different people and there is an
error in the social security number
Haiguang Li 01. Dec. 201117
A simplified rule in English
Given two records, r1 and r2IF the last name of r1 equals the last name
of r2,AND the first names differ slightly,AND the address of r1 equals the address of r2
THENr1 is equivalent to r2
Haiguang Li 01. Dec. 201118
The distance function
A “distance function” is used to compare pieces of data (usually text)Apply “distance function” to data that “differ slightly” Select a threshold to capture obvious typographical errors. Impacts number of successful matches
and number of false positives
Haiguang Li 01. Dec. 201119
Examples of matched records
SSN Name (First, Initial, Last)
Address
334600443
334600443
Lisa BoardmanLisa Brown
144 Wars St.144 Ward St.
525520001
525520001
Ramon BonillaRaymond Bonilla
38 Ward St.38 Ward St.
00
Diana D. AmbrosionDiana A. Dambrosion
40 Brik Church Av.40 Brick Church Av.
789912345
879912345
Kathi KasonKathy Kason
48 North St.48 North St.
879912345
879912345
Kathy KasonKathy Smith
48 North St.48 North St.
Haiguang Li 01. Dec. 201120
Building an equational theory
The process of creating a good equational theory is similar to the process of creating a good knowledge-base for an expert systemIn complex problems, an expert’s assistance is needed to write the equational theory
Haiguang Li 01. Dec. 201121
Transitive Closure
In general, no single pass (i.e. no single key) will be sufficient to catch all matching recordsAn attribute that appears first in the key has higher discriminating power than those appearing after them If an employee has two records in a DB with
SSN 193456782 and 913456782, it’s unlikely they will fall under the same window
Haiguang Li 01. Dec. 201122
Transitive Closure
To increase the number of similar records merged Widen the scanning window size, w Execute several independent runs of
the SNM Use a different key each time Use a relatively small window Call this the Multi-Pass approach
Haiguang Li 01. Dec. 201123
Transitive Closure
Each independent run of the Multi-Pass approach will produce a set of pairs of recordsAlthough one field in a record may be in error, another field may notTransitive closure can be applied to those pairs to be merged
Haiguang Li 01. Dec. 201124
Multi-pass Matches
Pass 1 (Lastname discriminates)KSNKAT48NRTH789 (Kathi Kason 789912345 )KSNKAT48NRTH879 (Kathy Kason 879912345 )
Pass 2 (Firstname discriminates)KATKSN48NRTH789 (Kathi Kason 789912345 )KATKSN48NRTH879 (Kathy Kason 879912345 )
Pass 3 (Address discriminates)48NRTH879KSNKAT (Kathy Kason 879912345 )48NRTH879SMTKAT (Kathy Smith 879912345 )
Haiguang Li 01. Dec. 201125
Transitive Equality Example
IF A implies BAND B implies C
THEN A implies CFrom example:789912345 Kathi Kason 48 North St. (A)879912345 Kathy Kason 48 North St. (B)879912345 Kathy Smith 48 North St. (C)
26Haiguang Li 01. Dec. 2011
Test Results
Haiguang Li 01. Dec. 201127
Test Environment
Test data was created by a database generator Names are randomly chosen from a list of
63000 real names
The database generator provides a large number of parameters: size of the DB, percentage of duplicates, amount of error…
Haiguang Li 01. Dec. 201128
Correct Duplicate Detection
Haiguang Li 01. Dec. 201129
Time for each run
Haiguang Li 01. Dec. 201130
Accuracy for each run
Haiguang Li 01. Dec. 201131
Real-World Test
Data was obtained from the Office of Children Administrative Research (OCAR) of the Department of Social and Health Services (State of Washington)OCAR’s goals How long do children stay in foster care? How many different homes do children
typically stay in?
Haiguang Li 01. Dec. 201132
OCAR’s Database
Most of OCAR’s data is stored in one relationThe DB contains 6,000,000 total recordsThe DB grows by about 50,000 records per month
Haiguang Li 01. Dec. 201133
Typical Problems in the DB
Names are frequently misspelledSSN or birthdays are either missing or clearly wrongCase number often changes when the child’s family moves to another part of the stateSome records use service provider names instead of the child’sNo reliable unique identifier
Haiguang Li 01. Dec. 201134
OCAR Equational Theory
Keys for the independent runs Last Name, First Name, SSN, Case
Number First Name, Last Name, SSN, Case
Number Case Number, First Name, Last Name,
SSN
Haiguang Li 01. Dec. 201135
OCAR Results
36Haiguang Li 01. Dec. 2011
Incremental Merge/Purge w/ New Data
Haiguang Li 01. Dec. 201137
Incremental Merge/Purge
Lists are concatenated for first time processingConcatenating new data before reapplying the merge/purge process may be very expensive in both time and spaceAn incremental merge/purge approach is needed: Prime Representatives method
Haiguang Li 01. Dec. 201138
Prime-Representative: Definition
A set of records extracted from each cluster of records used to represent the information in the clusterThe “Cluster Centroid” or base element of equivalence class
Haiguang Li 01. Dec. 201139
Prime-Representative creation
Initially, no PR existsAfter the execution of the first merge/purge create clusters of similiar recordsCorrect selection of PR from cluster impacts accuracy of resultsNo PR can be the best selection for some clusters
Haiguang Li 01. Dec. 201140
3 Strategies for Choosing PR
Random Sample Select a sample of records at random
from each cluster
N-Latest Most recent elements entered in DB
Syntactic Choose the largest or more complete
record
Haiguang Li 01. Dec. 201141
Important Assumption
No data previously used to select each cluster’s PR will be deleted Deleted records could require
restructuring of clusters (expensive)
No changes in the rule-set will occur after the first increment of data is processed Substantial rule change could
invalidate clusters.
Haiguang Li 01. Dec. 201142
Results
Cumulative running time for the Incremental Merge/Purge algorithm is higher than the classic algorithmPR selection methodology could improve cumulative running time Total running time of the Incremental Merge/Purge algorithm is always smaller
43Haiguang Li 01. Dec. 2011
Conclusion
Haiguang Li 01. Dec. 201144
Cleansing of Data
Sorted-Neighborhood Method is expensive due to the sorting phase the need for large windows for high accuracy
Multiple passes with small windows followed by transitive closure improves accuracy and performance for level of accuracy increasing number of successful matches decreasing number of false positives
Haiguang Li 01. Dec. 201145
2 major reasons merging large databases becomes a difficult problem: The databases are heterogeneous The identifiers or strings differ in how
they are represented within each DB
Questions 1?
Haiguang Li 01. Dec. 201146
The 3 steps in SNM are: Creation of key(s) Sorting records on this key Merge/Purge records
Questions 2?
Haiguang Li 01. Dec. 201147
3 strategies for selecting a PR: Random Sample N-Latest Syntactic
Questions 3?
Haiguang Li 01. Dec. 201148
The End
Thanks very much!