Post on 22-Feb-2016
description
Floris Geerts (University of Antwerp)Giansalvatore Mecca, Donatello Santoro (Università della
Basilicata)Paolo Papotti (Qatar Computing Research Institute)
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
Overview 2
‣ Motivations and Goals
‣ Semantics
‣ Experimental Results
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
Overview 3
‣ Motivations and Goals
‣ Semantics
‣ Experimental Results
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
Data Cleaning as Data Repairing
•Data Cleaning is a very general term•standardization, entity resolution,
data fusion…•We consider one of its facets: data
repairing
4
Dirty Table R
Database Table R
ConstraintsC over R
Data Repairing
Cleaned Table R’
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
5A Motivating ExampleCustomers
SSN Name Phone PhConf City CC#
t1 111 M. White 408-3334 0.8 NY 112321t2 222 L. Lennon 122-1876 0.9 SF 781658t3 222 L. Lennon 000-0000 0.0 SF 784659
Master DataSSN Name Phone Street City
tm 222 F. Lennon 122-1876 Sky Dr. SF
fd1. Cust: SSN, Name → Phone
Treatments
SSN Salary Insurance Treat. Date
t4 111 10k Abx Dental 10/01/11t5 111 25k Abx Cholest. 08/12/12t6 222 30k Med Eye Surg. 06/10/12er6. IF Cust.SSN = MD.SSN,
Cust.Phone = MD.Phone → TAKE Name, Street, City from MD
fd2. Cust: SSN, Name → CC#fd3. Treat: SSN → Salarycfd4. Treat: Insur[‘Abx’] → Tr[‘Dental’]cfd5. IF Treat:Insur[‘Abx’] THEN Cust: City[‘SF’]
?
INTERACTION!
PREFERRED VALUES
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
Language of
Constraints
Choose Preferred
ValuesRepair
Selection
6Many TechniquesFD - ID
Cond FDCond ID
Matching DEdit Rules
CurrencyConfidenceMaster Data Cost-Minimality
Certain FixesSampling
6
How to put everything together?
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
Language of
Constraints
Choose Preferred
ValuesRepair
Selection
7Many TechniquesFD - ID
Cond FDCond ID
Matching DEdit Rules
FD - ID
CurrencyConfidenceMaster Data Cost-Minimality
Certain FixesSampling
Problem 1:missing semantics
andmissing repair
algorithm
7
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
Language of
Constraints
Choose Preferred
ValuesRepair
Selection
8Many TechniquesFD - ID
Cond FDCond ID
Matching DEdit Rules
FD - ID
CurrencyConfidenceMaster Data Cost-Minimality
Certain FixesSampling
Problem 2:missing formalism
to handle user-specified
preference rules
8
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
Language of
Constraints
Choose Preferred
ValuesRepair
Selection
9Many TechniquesFD - ID
Cond FDCond ID
Matching DEdit Rules
FD - ID
CurrencyConfidenceMaster Data Cost-Minimality
Certain FixesSampling
Problem 3:no DBMS-based
scalable implementations
9
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
10System Comparison
Depend. Language Repair Strategy Value Preference Solution Selection
FDs CFDs ERs Tgds RHS LHS Conf. Curr. MD Cost Cert. Card.
Sampl
[1] ✓ ✗ ✗ ✓ ✓ ✗ ✓ ✗ ✗ ✓ ✗ ✗ ✗[2] ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✗ ✗[3] ✓ ✓ ✗ ✗ ✓ ✓ ✗ ✗ ✗ ✓ ✗ ✗ ✗[4] ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗[5] ✓ ✗ ✗ ✗ ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✓
[1] Bohannon SIGMOD ’05 [2] Cong VLDB ‘07 [3] Kolahi ICDT ’09 [4] Fan VLDB ’10 [5] Beskales VLDB ’10
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
11Our Goal
Depend. Language Repair Strategy Value Preference Solution Selection
FDs CFDs ERs Tgds RHS LHS Conf. Curr. MD Cost Cert. Card.
Sampl
[1] ✓ ✗ ✗ ✓ ✓ ✗ ✓ ✗ ✗ ✓ ✗ ✗ ✗[2] ✓ ✓ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✗ ✗[3] ✓ ✓ ✗ ✗ ✓ ✓ ✗ ✗ ✗ ✓ ✗ ✗ ✗[4] ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗[5] ✓ ✗ ✗ ✗ ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✓ ✓
GOAL[1] Bohannon SIGMOD ’05 [2] Cong VLDB ‘07 [3] Kolahi ICDT ’09 [4] Fan VLDB ’10 [5] Beskales
VLDB ’10
✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
Overview 12
‣ Motivations and Goals
‣ Semantics
‣ Experimental Results
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
13Cleaning EGDs
Customers
SSN Name Phone PhConf City CC#
t1 111 M. White 408-3334 0.8 NY 112321t2 222 L. Lennon 122-1876 0.9 SF 781658t3 222 L. Lennon 000-0000 0.0 SF 784659
Master DataSSN Name Phone Street City
tm 222 F. Lennon 122-1876 Sky Dr. SF
e1. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n, ph’, c’, cc’) → ph = ph’e2. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n, ph’, c’, cc’) → cc = cc’e3. Treat(ssn, sal, i, t, d) , Treat(ssn, sal’, i’, t’, d’) → sal = sal’e4. Treat(ssn, s, ins, t, d) , ins = ‘Abx’ → tr = ‘Dental’
Treatments
SSN Salary Insurance Treat. Date
t4 111 10k Abx Dental 10/01/11t5 111 25k Abx Cholest. 08/12/12t6 222 30k Med Eye Surg. 06/10/12
e5. Cust(ssn, n, ph, c , cc ) , Treat(ssn, s, ins, t, d) , ins = ‘Abx’ → c = ‘SF’e6. Cust(ssn, n, ph, c , cc ) , Master(ssn, n’, ph, s, c’) , → n = n’, c = c’
Σ: Cleaning EGDs
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
Semantics 14
‣ Cell Groups
‣ Upgrades
‣ LLUNs
‣ Partial Order
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
The Partial Order• Standard preference rules
• NULLS CONST
• source values are preferred to target values
• Ordering attributeΠ(Cust.Phone) =
{dom(Cust.PhConf), ≥}Π(Treat.Salary) =
{dom(Treat.Date), ≥time}
•No orderΠ(Cust.CC#) = ∅
15
Customers
SSN Name Phone PhConf City CC#
t1 111 M. White 408-3334 0.8 NY 112321t2 222 L. Lennon 122-1876 0.9 SF 781658t3 222 L. Lennon 000-0000 0.0 SF 784659
Master DataSSN Name Phone Street City
tm 222 F. Lennon 122-1876 Sky Dr. SF
Treatments
SSN Salary Insurance Treat. Date
t4 111 10k Abx Dental 10/01/11t5 111 25k Abx Cholest. 08/12/12t6 222 30k Med Eye Surg. 06/10/12
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
SSN Name Phone PhConf
t1 222 L. Lennon 123 0.9
t2 222 F. Lennon 000 0.0
SSN Name Phone PhConf
t1 222 L. Lennon 123 0.9
t2 222 F. Lennon 123 0.0
Cell Groups
• We model partial repair by a Cell-Group
• set of target cells that must be changed together• we do not disrupt this equality in the following
• Carry also lineage information
16
g1 = <123 → {t1.phone, t2.phone}>occurrencesvalue
Master Data
Name Phone
tm L. Lennon 999
e2. Cust(s, n, ph) , Master(n, ph’) → ph = ph’
222 L. Lennon 999 0.9
222 F. Lennon 999 0.0
g2 = <999→ {t1.phone, t2.phone} by {tm.ph}>
we change the value of all occurrences
justifications
e1. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n’, ph’, c’, cc’) → ph = ph’
Partial orderover cell groups
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
Upgrades 17
SSN Name Phonet2 222 L. Lennon 123t3 222 F. Lennon 123
<123→ {t2.ph, t3.ph} >
SSN Name Phonet2 222 L. Lennon 999t3 222 F. Lennon 999
<999→ {t2.ph, t3.ph} by {tm.ph}>
SSN Name Phonet2 222 L. Lennon 123t3 222 F. Lennon 000
Starting database J
upgrade Repair 1upgrade Repair N
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
LLUNs•There are cases in which we don’t have any
clear strategy to remove a violation•LLUNs•a new class of symbols•placeholders used to mark conflicts•not only an unknown value but rather a
“hypervalue”• the opposite of a NULL, since it upgrades
constants
18
SSN Name Phone CC#t2 222 L. Lennon 122-1876 5555t3 222 L. Lennon 000-0000 6666e3. Cust(ssn, n, ph, cc ) , Cust(ssn, n, ph’, cc’) → cc = cc’
L0
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
Scenarios and Solutions19
•S: source schema, T: target schema, Σ: set of cleaning EGDs
•Π: the partial order specification•Π is the way to specify when a value is
preferrable to another•Solution: Given C, an instance I of S, and an
instance J of T compute an instance J’ such that:• it is a repair, i.e., “I and J’ satisfy Σ”•and “J’ is an upgrade of J according to Π”
Cleaning Scenario
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
The Chase Algorithm•To compute solutions:•Chase algorithm for chasing egds
20
SSN Name Phonet2 L1 L. Lennon 123t3 222 F. Lennon 000
SSN Name Phonet2 222 L. Lennon 123t3 L2 F. Lennon 000
<L1→ {t2.ssn} >
Repair 2 (backward)
<L2→ {t3.ssn} >
Repair 3 (backward)
e1. Cust(ssn, n, ph, c , cc ) , Cust(ssn, n’, ph’, c’, cc’) → ph = ph’
SSN Name Phonet2 222 L. Lennon 123t3 222 F. Lennon 123
<123→ {t2.ph, t3.ph} >
Repair 1 (forward)
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
A Few Results•Given a cleaning scenario C, and
instances <I, J>•C always has a solution for <I, J>•The chase always terminates (it never
fails)•The chase computes all minimal
solutions•The number of minimal solutions is
exponential in the size of J
21
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
Overview 22
‣ Motivations and Goals
‣ Semantics
‣ Experimental Results
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
Chase Tree 23
J
R1
e0, b1
R2 R3
e0, b2e0, f
R4
e1, b1
R5 R6
e1, b2e1, f
R10e1, b1 R11 R12
e1, b2e1, f
R13
e0, b1
R14 R15
e0, b2e0, f
the e0-e1 sequence the e1-e0 sequence
Our goal: to make this scalable ! Different orders of application
give different results
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
Scalability Techniques•Chase implementation based on
equivalence classes•Delta Databases•a representation system for chase
trees•Cost managers•pluggable strategies to prune the
chase tree
24
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
25Scalability
LLUNATIC-FR-S10LLUNATIC-FR-S1 LLUNATIC-FR-S1-FOLLUNATIC-FR-S50
CUSTOMERSHOSPITAL
CUSTOMERS
100K 250K 400K 550K 700K 850K 1M
LLUNATICis the first scalable
DBMS-based chase algorithm for
data repairing
sec.
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
26Quality of Repairs
5k 10k 25k
LLUNATIC-FR-S10LLUNATIC-FR-S1 LLUNATIC-FR-S1-FO
SAMPLING-500 VERTEX COVER MIN. COST
HOSPITAL
The LLUNATIC Data-Cleaning Framework – F. Geerts, G. Mecca, P. Papotti, D. Santoro August, 28 2013
That’s all Folks!