Achieving Data Quality with AJAX
description
Transcript of Achieving Data Quality with AJAX
H.GalhardasGTI 2007/08
Achieving Data Quality with AJAX
(first version of AJAX designed and developed at INRIA Rocquencourt, France)
H.GalhardasGTI 2007/08
Existing technology• Ad-hoc programs written in a programming language like
C or Java or using an RDBMS proprietary language– Programs difficult to optimize and maintain
• RDBMS mechanisms for guaranteeing integrity constraints– Do not address important data instance problems
• Data transformation scripts using an ETL
(Extraction-Transformation-Loading) or data quality tool
H.GalhardasGTI 2007/08
Problems of data quality solutions (1)
The semantics of some data transformations is defined in terms of their implementation algorithms
App. Domain 1
App. Domain 2
App. Domain 3
Data cleaning transformations
...
H.GalhardasGTI 2007/08
There is a lack of interactive facilities to tune a data cleaning application program
Problems of data quality solutions (2)
Dirty Data
Cleaning process
Clean data Rejected data
H.GalhardasGTI 2007/08
Motivating example (1)
DirtyData(paper:String)
Data Cleaning & Transformation
Events(eventKey, name)
Publications(pubKey, title, eventKey, url, volume, number, pages, city, month, year)
Authors(authorKey, name)
PubsAuthors(pubKey, authorKey)
H.GalhardasGTI 2007/08
Motivating example (2)
[1] Dallan Quass, Ashish Gupta, Inderpal Singh Mumick, and Jennifer Widom. Making Views Self-Maintainable for Data Warehousing. In Proceedings of the Conference on Parallel and Distributed Information Systems. Miami Beach, Florida, USA, 1996[2] D. Quass, A. Gupta, I. Mumick, J. Widom, Making views self-maintianable for data warehousing, PDIS’95
DirtyData
Data Cleaning & Transformation
PDIS | Conference on Parallel and Distributed Information Systems
Events
QGMW96| Making Views Self-Maintainablefor Data Warehousing |PDIS| null | null | null | null | Miami Beach | Florida, USA | 1996
PublicationsAuthors
DQua | Dallan Quass
AGup | Ashish Gupta
JWid | Jennifer Widom…..
QGMW96 | DQua
QGMW96 | AGup….
PubsAuthors
H.GalhardasGTI 2007/08
Modeling a data quality process
A data quality process is modeled by a directed acyclic graph of data transformations
DirtyData
DirtyAuthors
Authors
Duplicate Elimination
Extraction
Standardization
Formatting
DirtyTitles... DirtyEvents
CitiesTags
H.GalhardasGTI 2007/08
AJAX features• An extensible data quality framework
– Logical operators as extensions of relational algebra
– Physical execution algorithms
• A declarative language for logical operators – SQL extension
• A debugger facility for tuning a data cleaning program application– Based on a mechanism of exceptions
H.GalhardasGTI 2007/08
AJAX features• An extensible data quality framework
– Logical operators as extensions of relational algebra
– Physical execution algorithms
• A declarative language for logical operators – SQL extension
• A debugger facility for tuning a data cleaning program application– Based on a mechanism of exceptions
H.GalhardasGTI 2007/08
Logical level: parametric operators
• View: arbitrary SQL query• Map: iterator-based one-to-many mapping with
arbitrary user-defined functions• Match: iterator-based approximate join • Cluster: uses an arbitrary clustering function• Merge: extends SQL group-by with user-defined
aggregate functions• Apply: executes an arbitrary user-defined
algorithm
Map Match
Merge
ClusterView
Apply
H.GalhardasGTI 2007/08
Logical level
DirtyData
DirtyAuthors
Authors
Duplicate Elimination
Extraction
Standardization
Formatting
DirtyTitles...
CitiesTags
H.GalhardasGTI 2007/08
Logical level
DirtyData
DirtyAuthors
Map
Cluster
Match
Merge
Authors
Map
Map
Duplicate Elimination
Extraction
Standardization
Formatting
DirtyTitles...
CitiesTags
DirtyData
DirtyAuthors
TC
NL
Authors
SQL Scan
Java Scan
Physical level
DirtyTitles...
Java Scan
Java Scan
CitiesTags
H.GalhardasGTI 2007/08
Match• Input: 2 relations• Finds data records that correspond to the same
real object• Calls distance functions for comparing field values
and computing the distance between input tuples• Output: 1 relation containing matching tuples and
possibly 1 or 2 relations containing non-matching tuples
H.GalhardasGTI 2007/08
Example
Cluster
Match
Merge
Duplicate Elimination
Authors
DirtyAuthors
MatchAuthors
H.GalhardasGTI 2007/08
ExampleCREATE MATCH MatchDirtyAuthors
FROM DirtyAuthors da1, DirtyAuthors da2
LET distance = editDistance(da1.name, da2.name)
WHERE distance < maxDist
INTO MatchAuthorsCluster
Match
Merge
Duplicate Elimination
Authors
DirtyAuthors
MatchAuthors
H.GalhardasGTI 2007/08
ExampleCREATE MATCH MatchDirtyAuthors
FROM DirtyAuthors da1, DirtyAuthors da2
LET distance = editDistance(da1.name, da2.name)
WHERE distance < maxDist
INTO MatchAuthors
Input:
DirtyAuthors(authorKey, name)861|johann christoph freytag
822|jc freytag
819|j freytag
814|j-c freytag
Output:
MatchAuthors(authorKey1, authorKey2, name1, name2)861|822|johann christoph freytag| jc freytag
822|814|jc freytag|j-c freytag ...
Cluster
Match
Merge
Duplicate Elimination
Authors
DirtyAuthors
MatchAuthors
H.GalhardasGTI 2007/08
Implementation of the match operator
s1 S1, s2 S2
(s1, s2) is a match if
editDistance (s1, s2) < maxDist
H.GalhardasGTI 2007/08
Nested loopS1 S2
...
• Very expensive evaluation when handling large amounts of data
Need alternative execution algorithms for the same logical specification
editDistance
H.GalhardasGTI 2007/08
A database solution
CREATE TABLE MatchAuthors ASSELECT authorKey1, authorKey2, distance
FROM (SELECT a1.authorKey authorKey1, a2.authorKey authorKey2,
editDistance (a1.name, a2.name) distance
FROM DirtyAuthors a1, DirtyAuthors a2)
WHERE distance < maxDist;
No optimization supported for a Cartesian product with external function calls
H.GalhardasGTI 2007/08
Window scanning
S
n
H.GalhardasGTI 2007/08
Window scanning
S
n
H.GalhardasGTI 2007/08
Window scanning
S
n
May loose some matches
H.GalhardasGTI 2007/08
String distance filtering
S1 S2
maxDist = 1
John Smith
John Smit
Jogn Smith
John Smithe
length
length- 1
length
length + 1
editDistance
H.GalhardasGTI 2007/08
Annotation-based optimization
• The user specifies types of optimization • The system suggests which algorithm to
use
Ex:
CREATE MATCHING MatchDirtyAuthors
FROM DirtyAuthors da1, DirtyAuthors da2
LET dist = editDistance(da1.name, da2.name)
WHERE dist < maxDist
% distance-filtering: map= length; dist = abs %
INTO MatchAuthors
H.GalhardasGTI 2007/08
AJAX features• An extensible data quality framework
– Logical operators as extensions of relational algebra
– Physical execution algorithms
• A declarative language for logical operators – SQL extension
• A debugger facility for tuning a data cleaning program application– Based on a mechanism of exceptions
H.GalhardasGTI 2007/08
DEFINE FUNCTIONS ASChoose.uniqueString(OBJECT[]) RETURN STRING THROWS CiteSeerExceptionGenerate.generateId(INTEGER) RETURN STRINGNormal.removeCitationTags(STRING) RETURN STRING (600)
DEFINE ALGORITHMS ASTransitiveClosureSourceClustering(STRING)
DEFINE INPUT DATA FLOWS ASTABLE DirtyData(paper STRING (400));TABLE City(city STRING (80),citysyn STRING (80))KEY city,citysyn;
DEFINE TRANSFORMATIONS AS
CREATE MAPPING mapKeDiDa FROM DirtyData Dd LET keyKdd = generateId(1) {SELECT keyKdd AS paperKey, Dd.paper AS paperKEY paperKey CONSTRAINT NOT NULL mapKeDiDa.paper}
Declarative specification
H.GalhardasGTI 2007/08
DEFINE FUNCTIONS ASChoose.uniqueString(OBJECT[]) RETURN STRING THROWS CiteSeerExceptionGenerate.generateId(INTEGER) RETURN STRINGNormal.removeCitationTags(STRING) RETURN STRING (600)
DEFINE ALGORITHMS ASTransitiveClosureSourceClustering(STRING)
DEFINE INPUT DATA FLOWS ASTABLE DirtyData(paper STRING (400));TABLE City(city STRING (80),citysyn STRING (80))KEY city,citysyn;
DEFINE TRANSFORMATIONS AS
CREATE MAPPING mapKeDiDa FROM DirtyData Dd LET keyKdd = generateId(1) {SELECT keyKdd AS paperKey, Dd.paper AS paperKEY paperKey CONSTRAINT NOT NULL mapKeDiDa.paper}
Graph of data transformations
Declarative specification
H.GalhardasGTI 2007/08
AJAX features• An extensible data quality framework
– Logical operators as extensions of relational algebra
– Physical execution algorithms
• A declarative language for logical operators – SQL extension
• A debugger facility for tuning a data cleaning program application– Based on a mechanism of exceptions
H.GalhardasGTI 2007/08
Management of exceptions
• Problem: to mark tuples not handled by the cleaning criteria of an operator
• Solution: to specify the generation of exception tuples within a logical operator– exceptions are thrown by external functions– output constraints are violated
H.GalhardasGTI 2007/08
Debugger facility
• Supports the (backward and forward) data derivation of tuples wrt an operator to debug exceptions
• Supports the interactive data modification and, in the future, the incremental execution of logical operators
H.GalhardasGTI 2007/08
Debugging exceptions
H.GalhardasGTI 2007/08
Architecture
H.GalhardasGTI 2007/08
References
• Helena Galhardas, Daniela Florescu, Dennis Shasha, Eric Simon, Cristian-Augustin Saita: “Declarative Data Cleaning: Language, Model, and Algorithms”. VLDB 2001: 371-380