SampleClean: Bringing Data Cleaning into the BDAS Stack
-
Upload
jeykottalam -
Category
Software
-
view
1.061 -
download
7
description
Transcript of SampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS Stack!
Sanjay Krishnan and Daniel Haas!In Collaboration With: Juan Sanchez, Wenbo Tao, Jiannan
Wang, Tim Kraska, Michael Franklin, Tova Milo, Ken Goldberg !
Who publishes more? !
!!
2
Microsoft Academic Search!
!! Paper Id! Affiliation!
16! Computer Science Division--University of California Berkeley CA!
101! University of California at Berkeley!
102! Department of Physics Stanford !University California!
116! Lawrence Berkeley National Labs!<ref>California</ref>!
3
Microsoft Academic Search!
!! Paper Id! Affiliation!
16! Computer Science Division--University of California Berkeley CA!
101! University of California at Berkeley!
102! Department of Physics Stanford !University California!
116! Lawrence Berkeley National Labs!<ref>California</ref>!
X
4
Microsoft Academic Search!
!!
University of California at Berkeley!Computer Science Division!University of California at Berkeley!Computer Science Division!University of California at Berkeley!
Department of Physics Stanford !University California!
5
• Data cleaning in BDAS.!– Problem 1. Scale!– Problem 2. Latency!!
• Sampling to cope with scale.!• Asynchrony to cope with latency.!
!!
Enter SampleClean!
6
Now it’s your turn!!
!!
Be the crowd and help us decide!
7
Dirty Data is Ubiquitous!
8!
Example: Missing, incomplete, inconsistent data!
Data Cleaning is Hard!
9
Time consuming!
Data Cleaning is Hard!
10
Costly!Time consuming!
Data Cleaning is Hard!
11
Costly!Domain-specific!
Time consuming!
Data Cleaning is Hard!
12
Costly!Domain-specific!
Time consuming!
Analy0cs
13
A New Data Cleaning Architecture!
Data Cleaning Data
Analy0cs
14
A New Data Cleaning Architecture!
Data Cleaning
Data
Can it Scale?!
Crowd Machine Learning
Regex
Time
People are slow and expensive!
15
Insight 1: Asynchrony Hides Latency!
16
Insight 2: Sampling Hides Scale!
Query !Error!
Time!
BlinkDB!
17
Query !Error!
Time!
Data Error
BlinkDB!
Insight 2: Sampling Hides Scale!
18
Query !Error!
Time!
Data Error
SampleClean!
Insight 2: Sampling Hides Scale!
BlinkDB!
19
SampleClean Data Flow!
Dirty Data Dirty Sample
Clean Sample
Query
Data Cleaning
20
Sampling Asynchrony
SampleClean Data Flow!
Clean Sample
Query
Data Cleaning
Asynchrony
21
The SampleClean Architecture!
Data Cleaning Library
Approximate Query Processing Asynchronous
Pipelines
Clean Sample
Issue Queries, ! Get Results!
Declare Cleaning !Operations!
Dirty Sample
22
The SampleClean Architecture!
Data Cleaning Library
Approximate Query Processing Asynchronous
Pipelines
Clean Sample
Issue Queries, ! Get Results!
Declare Cleaning !Operations!
Dirty Sample
23
Approximate Query Processing!• Estimate early results and bound with
error bars!
SampleClean: Fast and Accurate Query Processing on Dirty Data. SIGMOD 2014!!BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. EuroSys 2013!
Query !Error!
Time!
24
The SampleClean Architecture!
25
Approximate Query Processing Asynchronous
Pipelines
Clean Sample
Issue Queries, ! Get Results!
Declare Cleaning !Operations!
Dirty Sample
Data Cleaning Library
Crowds and Machines Work Together!
• Extensible library of data cleaning tools!• Tools are:!– Automated!– Human-powered!– Hybrid!
!
Crowd Machine Learning
Regex
Time
26
Active Learning and Crowds!• Choose informative training points!
Not !Informative!
Are these the same?!Stanford Department of IEOR!!UC Berkeley Stats!! ¢ Yes ! ¤ No!
Informative!
Are these the same?!Department of Mathematics Stanford University!!University of California Berkeley Department of Mathematics!! ¢ Yes ! ¤ No!
27
Active Learning and Crowds!• Choose informative training points!
Not !Informative!
Are these the same?!Stanford Department of IEOR!!UC Berkeley Stats!! ¢ Yes ! ¤ No!
Informative!
Are these the same?!Department of Mathematics Stanford University!!University of California Berkeley Department of Mathematics!! ¢ Yes ! ¤ No!
28
The SampleClean Architecture!
29
Data Cleaning Library
Clean Sample
Issue Queries, ! Get Results!
Declare Cleaning !Operations!
Dirty Sample
Approximate Query Processing Asynchronous
Pipelines
Putting it all together: Asynchronous Pipelines!
• Users group data cleaning operations into pipelines!
30
The SampleClean Architecture!
Data Cleaning Library
Approximate Query Processing Asynchronous
Pipelines
Clean Sample
Issue Queries, ! Get Results!
Declare Cleaning !Operations!
Dirty Sample
31
Great, Now What?!• Prototype implementation complete!!
• Significant research challenges remain:!• Crowd worker performance and quality!• Pipeline semantics and optimization!• Programming model and interface!!
• Open source release targeted for next year!
32
Summary!• Data Cleaning is slow, costly, and
domain-specific!
• SampleClean brings data cleaning into the BDAS stack !
• SampleClean uses asynchrony to hide latency, and sampling to hide scale!
• SampleClean combines Algorithms, Machines, and People, all in one system! 33
Asynchrony in Spark!• The Spark abstraction: blocking BSP!
• So how do we achieve asynchrony?!• Multithreaded master!• Intermediate results materialized in
Hive!• Standalone Finagle HTTP server for
crowd work!
!34