BigDansing presentation slides for SIGMOD 2015
-
Upload
zuhair-khayyat -
Category
Data & Analytics
-
view
257 -
download
0
Transcript of BigDansing presentation slides for SIGMOD 2015
![Page 1: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/1.jpg)
BigDansing: A BigData Cleansing System
Zuhair Khayyat Ihab F. Ilyas Alekh Jindal Samuel Madden
Mourad Ouzzani Paolo Papotti Jorge-Arnulfo Quiané-Ruiz Nan Tan Si Yin
![Page 2: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/2.jpg)
Problem of Dirty Data
● “duplicate and dirty data costs the healthcare industry over $300 billion every year”
– Joe Fusaro (RingLead)
● “inaccurate data has a direct impact ... the average company losing 12% of its revenue”
– Ben Davis (Econsultancy)
![Page 3: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/3.jpg)
The Process of Data Cleansing
Stained
![Page 4: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/4.jpg)
The Process of Data Cleansing
Stained
One approach: Violation Detection using declarative rules
Stained
![Page 5: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/5.jpg)
The Process of Data Cleansing
Suggested Repairs
One approach: Violation Detection using declarative rules
Stained
Stained
![Page 6: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/6.jpg)
The Process of Data Cleansing
Apply the Repairs
One approach: Violation Detection using declarative rules
Suggested Repairs
Stained
StainedStained
![Page 7: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/7.jpg)
The Process of Data Cleansing
One approach: Violation Detection using declarative rules
Suggested Repairs Clean Dataset
Stained
Apply the Repairs
StainedStained
![Page 8: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/8.jpg)
The Process of Data Cleansing
One approach: Violation Detection using declarative rules
Suggested Repairs
Side effect: new Violations
Stained
Clean Dataset
Apply the Repairs
StainedStained Stained
![Page 9: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/9.jpg)
The Process of Data Cleansing
One approach: Violation Detection using declarative rules
Suggested Repairs
Stained
Clean Dataset
Stained
Side effect: new ViolationsApply the Repairs
Stained Stained
![Page 10: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/10.jpg)
Related work
• Functional dependencies (FDs, CFDs)
• Inclusion dependencies (INDs, CINDs)
• Denial constraints (DCs)
• Matching dependencies (MDs)
• Entity resolution rules (ERs)
Limited quality rules support*Limited quality rules support*
* On approximating optimum repairs for functional dependency violations, ICDT 2009
* Holistic data cleaning: Putting violations into context, ICDE 2013
* The llunatic data-cleaning framework, VLDB 2013
![Page 11: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/11.jpg)
Related work
• Functional dependencies (FDs, CFDs)
• Inclusion dependencies (INDs, CINDs)
• Denial constraints (DCs)
• Matching dependencies (MDs)
• Entity resolution rules (ERs)
Limited quality rules support*Limited quality rules support*
NADEEF**NADEEF**
• Easy-to-use
• Extensible
• Effcient
** SIGMOD 2013
* On approximating optimum repairs for functional dependency violations, ICDT 2009
* Holistic data cleaning: Putting violations into context, ICDE 2013
* The llunatic data-cleaning framework, VLDB 2013
![Page 12: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/12.jpg)
Related work
• Functional dependencies (FDs, CFDs)
• Inclusion dependencies (INDs, CINDs)
• Denial constraints (DCs)
• Matching dependencies (MDs)
• Entity resolution rules (ERs)
Limited quality rules support*Limited quality rules support*
NADEEF**NADEEF**
• Easy-to-use
• Extensible
• Effcient
• Scalability
** SIGMOD 2013
* On approximating optimum repairs for functional dependency violations, ICDT 2009
* Holistic data cleaning: Putting violations into context, ICDE 2013
* The llunatic data-cleaning framework, VLDB 2013
![Page 13: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/13.jpg)
Data Cleansing is Big Data a problem
Dirty data Dirty data Dirty data
![Page 14: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/14.jpg)
Data Cleansing is Big Data a problem
Dirty data Dirty data Dirty data
![Page 15: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/15.jpg)
Scalable
BigData Cleansing Requirements
![Page 16: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/16.jpg)
Fast
Scalable
BigData Cleansing Requirements
![Page 17: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/17.jpg)
Fast
Scalable
Portable
BigData Cleansing Requirements
![Page 18: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/18.jpg)
AbstractionScalability
vs.
Challenges of BigData Cleansing
![Page 19: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/19.jpg)
Ease-of-use
Effciencyvs.
AbstractionScalability
vs.
Challenges of BigData Cleansing
![Page 20: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/20.jpg)
Ease-of-use
Effciencyvs.
AbstractionScalability
vs.
Quality Rules
Inequalities
Challenges of BigData Cleansing
![Page 21: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/21.jpg)
BigDansing
RepairAlgorithm
DirtyDataset
UDFs(operators)
Block
Scope
Iterate
Detect
GenFix
Declarative rule
+
(1)
(4)(5)
(6)
(7)
BigDansing
logical plans
physical plans
execution plans
Rule EngineLogical
Layer
Physical
Layer
Execution
Layer
(3)
(2)
violations&
possible repairs
value updates
Rule Parser
+
ApplyUpdates
Data Processing Framework
![Page 22: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/22.jpg)
RepairAlgorithm
DirtyDataset
UDFs(operators)
Block
Scope
Iterate
Detect
GenFix
Declarative rule
+
(1)
(4)(5)
(6)
(7)
BigDansing
logical plans
physical plans
execution plans
Rule EngineLogical
Layer
Physical
Layer
Execution
Layer
(3)
(2)
violations&
possible repairs
value updates
Rule Parser
+
ApplyUpdates
Data Processing Framework
BigDansing: Abstraction
![Page 23: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/23.jpg)
ScopeScope
BlockBlock
IterateIterate
DetectDetect
GenFixGenFix
Logical Operators
BigDansing: Abstraction
Declarative Rules: FD, CFD, DC, ....
![Page 24: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/24.jpg)
ScopeScope
BlockBlock
IterateIterate
DetectDetect
GenFixGenFix
Logical OperatorsFD: zipcode -> city
BigDansing: Abstraction
![Page 25: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/25.jpg)
ScopeScope
BlockBlock
IterateIterate
DetectDetect
GenFixGenFix
Logical OperatorsFD: zipcode -> city
BigDansing: Abstraction
![Page 26: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/26.jpg)
ScopeScope
BlockBlock
IterateIterate
DetectDetect
GenFixGenFix
Logical OperatorsFD: zipcode -> city
BigDansing: Abstraction
![Page 27: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/27.jpg)
ScopeScope
BlockBlock
IterateIterate
DetectDetect
GenFixGenFix
Logical OperatorsFD: zipcode -> city
BigDansing: Abstraction
![Page 28: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/28.jpg)
ScopeScope
BlockBlock
IterateIterate
DetectDetect
GenFixGenFix
Logical OperatorsFD: zipcode -> city
BigDansing: Abstraction
![Page 29: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/29.jpg)
Logical Operators
BigDansing: Abstraction
ScopeScope
BlockBlock
IterateIterate
DetectDetect
GenFixGenFix
Easy to use and enables scalability!
![Page 30: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/30.jpg)
BigDansing: Optimizations
RepairAlgorithm
DirtyDataset
UDFs(operators)
Iterate
Scope
Block
Detect
GenFix
declarative rule
+
(1)
(4)(5)
(6)
(7)
BigDansing
logical plans
physical plans
execution plans
Rule EngineLogical
Layer
Physical
Layer
Execution
Layer
(3)
(2)
violations&
possible repairs
value updates
Rule Parser
+
ApplyUpdates
Data Processing Framework
![Page 31: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/31.jpg)
BigDansing: Optimizations
RepairAlgorithm
DirtyDataset
UDFs(operators)
Iterate
Scope
Block
Detect
GenFix
declarative rule
+
(1)
(4)(5)
(6)
(7)
BigDansing
logical plans
physical plans
execution plans
Rule EngineLogical
Layer
Physical
Layer
Execution
Layer
(3)
(2)
violations&
possible repairs
value updates
Rule Parser
+
ApplyUpdates
Data Processing Framework
Shared Scans
Fast Inequality
Joins
Shared Execution
![Page 32: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/32.jpg)
Optimizations: Fast Inequality Joins
DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)
![Page 33: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/33.jpg)
PartitioningPartitioning
(divide) partition 1partition 1 partition 2partition 2 … partition npartition non rate on rate on rate
Optimizations: Fast Inequality Joins
DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)
![Page 34: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/34.jpg)
PartitioningPartitioning
SortingSorting
(divide)
(prepare)
partition 1partition 1 partition 2partition 2 … partition npartition non rate on rate on rate
partition 1partition 1 partition 2partition 2 … partition npartition nsort sort sort
rate & salary rate & salary rate & salary
Optimizations: Fast Inequality Joins
DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)
![Page 35: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/35.jpg)
PartitioningPartitioning
SortingSorting
PruningPruning
(divide)
(prepare)
(reduce)
partition 1partition 1 partition 2partition 2 … partition npartition non rate on rate on rate
partition 1partition 1
partition 2partition 2…
partition 3partition 3
partition 4partition 4partition npartition n
min-max values
min-max values
partition 1partition 1 partition 2partition 2 … partition npartition nsort sort sort
rate & salary rate & salary rate & salary
Optimizations: Fast Inequality Joins
DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)
![Page 36: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/36.jpg)
PartitioningPartitioning
SortingSorting
PruningPruning
JoiningJoiningpartition 2partition 2 partition 3partition 3 …
(divide)
(prepare)
(reduce)
(execute)
partition 1partition 1 partition 2partition 2 … partition npartition non rate on rate on rate
partition 1partition 1
partition 2partition 2…
partition 3partition 3
partition 4partition 4partition npartition n
min-max values
min-max values
partition 1partition 1 partition 2partition 2 … partition npartition nsort sort sort
rate & salary rate & salary rate & salary
Optimizations: Fast Inequality Joins
DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)
![Page 37: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/37.jpg)
BigDansing: Scalable Repair
RepairAlgorithm
DirtyDataset
UDFs(operators)
Block
Scope
Iterate
Detect
GenFix
Declarative rule
+
(1)
(4)(5)
(6)
(7)
BigDansing
logical plans
physical plans
execution plans
Rule EngineLogical
Layer
Physical
Layer
Execution
Layer
(3)
(2)
violations&
possible repairs
value updates
Rule Parser
+
ApplyUpdates
Data Processing Framework
![Page 38: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/38.jpg)
Distributed Equivalent Class
A B
a1 b1
a1 b2
a1 b1
a2 b4
a2 b4
a2 b3
t1
t2
t3
t4
t5
t6
FD: A —> B
Scalable Repair
![Page 39: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/39.jpg)
Distributed Equivalent Class
A B
a1 b1
a1 b2
a1 b1
a2 b4
a2 b4
a2 b3
t1
t2
t3
t4
t5
t6
FD: A —> B
t1t2t3
t4t5t6
EQ1
EQ2
Scalable Repair
![Page 40: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/40.jpg)
Distributed Equivalent Class
A B
a1 b1
a1 b2
a1 b1
a2 b4
a2 b4
a2 b3
t1
t2
t3
t4
t5
t6
FD: A —> B
t1t2t3
t4t5t6
EQ1
EQ2
b2 —> b1
b3 —> b4
Scalable Repair
![Page 41: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/41.jpg)
Distributed Equivalent Class
A B
a1 b1
a1 b2
a1 b1
a2 b4
a2 b4
a2 b3
t1
t2
t3
t4
t5
t6
FD: A —> B
t1t2t3
t4t5t6
EQ1
EQ2
b2 —> b1
b3 —> b4
Scalable Repair
EQ algorithm as a word count problem
![Page 42: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/42.jpg)
data errors
data cells
Data Repair as a Black box
![Page 43: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/43.jpg)
data errors
data cells
Data Repair as a Black box
![Page 44: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/44.jpg)
centralized data repair algorithm
data errors
data cells
Data Repair as a Black box
![Page 45: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/45.jpg)
centralized data repair algorithm
data errors
data cells
big connected components?
Data Repair as a Black box
![Page 46: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/46.jpg)
RepairAlgorithm
DirtyDataset
UDFs(operators)
Block
Scope
Iterate
Detect
GenFix
Declarative rule
+
(1)
(4)(5)
(6)
(7)
BigDansing
logical plans
physical plans
execution plans
Rule EngineLogical
Layer
Physical
Layer
Execution
Layer
(3)
(2)
violations&
possible repairs
value updates
Rule Parser
+
ApplyUpdates
Data Processing Framework
BigDansing: Portability
![Page 47: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/47.jpg)
RepairAlgorithm
DirtyDataset
UDFs(operators)
Block
Scope
Iterate
Detect
GenFix
Declarative rule
+
(1)
(4)(5)
(6)
(7)
BigDansing
logical plans
physical plans
execution plans
Rule EngineLogical
Layer
Physical
Layer
Execution
Layer
(3)
(2)
violations&
possible repairs
value updates
Rule Parser
+
ApplyUpdates
Data Processing Framework
BigDansing: Portability
![Page 48: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/48.jpg)
Centralized Experiment
DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)
![Page 49: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/49.jpg)
Several orders of magnitude faster!
Centralized Experiment
DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)
![Page 50: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/50.jpg)
Parallel Experiment
DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)
![Page 51: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/51.jpg)
Only BigDansing fnished!
Parallel Experiment
DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)
![Page 52: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/52.jpg)
Summary
RepairAlgorithm
DirtyDataset
UDFs(operators)
Block
Scope
Iterate
Detect
GenFix
Declarative rule
+
(1)
(4)(5)
(6)
(7)
BigDansing
logical plans
physical plans
execution plans
Rule Engine
Logical
Layer
Physical
Layer
Execution
Layer
(3)
(2)
violations&
possible repairs
value updates
Rule Parser
+
ApplyUpdates
Data Processing Framework
![Page 53: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/53.jpg)
RepairAlgorithm
DirtyDataset
UDFs(operators)
Block
Scope
Iterate
Detect
GenFix
Declarative rule
+
(1)
(4)(5)
(6)
(7)
BigDansing
logical plans
physical plans
execution plans
Rule Engine
Logical
Layer
Physical
Layer
Execution
Layer
(3)
(2)
violations&
possible repairs
value updates
Rule Parser
+
ApplyUpdates
Data Processing Framework
Ease-
of-Use
Summary
![Page 54: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/54.jpg)
RepairAlgorithm
DirtyDataset
UDFs(operators)
Block
Scope
Iterate
Detect
GenFix
Declarative rule
+
(1)
(4)(5)
(6)
(7)
BigDansing
logical plans
physical plans
execution plans
Rule Engine
Logical
Layer
Physical
Layer
Execution
Layer
(3)
(2)
violations&
possible repairs
value updates
Rule Parser
+
ApplyUpdates
Data Processing Framework
Ease-
of-Use
Summary
![Page 55: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/55.jpg)
RepairAlgorithm
DirtyDataset
UDFs(operators)
Block
Scope
Iterate
Detect
GenFix
Declarative rule
+
(1)
(4)(5)
(6)
(7)
BigDansing
logical plans
physical plans
execution plans
Rule Engine
Logical
Layer
Physical
Layer
Execution
Layer
(3)
(2)
violations&
possible repairs
value updates
Rule Parser
+
ApplyUpdates
Data Processing Framework
Ease-
of-Use
Scalab
ility
Summary
![Page 56: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/56.jpg)
RepairAlgorithm
DirtyDataset
UDFs(operators)
Block
Scope
Iterate
Detect
GenFix
Declarative rule
+
(1)
(4)(5)
(6)
(7)
BigDansing
logical plans
physical plans
execution plans
Rule Engine
Logical
Layer
Physical
Layer
Execution
Layer
(3)
(2)
violations&
possible repairs
value updates
Rule Parser
+
ApplyUpdates
Data Processing Framework
Ease-
of-Use
Scalab
ility
Portability
Summary
![Page 57: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/57.jpg)
57
Experiments – Parallel FD
● TPCH dataset:
● FD: custkey → custAddress
![Page 58: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/58.jpg)
58
Experiments – Scalability
● TPCH Dataset:
● FD: custkey → custAddress
● Dataset: 500M rows
![Page 59: BigDansing presentation slides for SIGMOD 2015](https://reader034.fdocuments.net/reader034/viewer/2022051017/55c82c0dbb61ebda138b4598/html5/thumbnails/59.jpg)
59
Repair Quality for FDs and DC
● Φ6: FD: Zipcode → State
● Φ8: FD: PhoneNumber → Zipcode
● Φ8: FD: ProviderID → City,PhoneNumber
● ØD: DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)
Φ6Φ6 & Φ7Φ6 - Φ8
ØD