SampleClean: Bringing Data Cleaning into the BDAS Stack

34
SampleClean: Bringing Data Cleaning into the BDAS Stack Sanjay Krishnan and Daniel Haas In Collaboration With: Juan Sanchez, Wenbo Tao, Jiannan Wang, Tim Kraska, Michael Franklin, Tova Milo, Ken Goldberg

description

"SampleClean: Bringing Data Cleaning into the BDAS Stack" presentation at AMPCamp 5 by Sanjay Krishnan and Daniel Haas

Transcript of SampleClean: Bringing Data Cleaning into the BDAS Stack

Page 1: SampleClean: Bringing Data Cleaning into the BDAS Stack

SampleClean: Bringing Data Cleaning into the BDAS Stack!

Sanjay Krishnan and Daniel Haas!In Collaboration With: Juan Sanchez, Wenbo Tao, Jiannan

Wang, Tim Kraska, Michael Franklin, Tova Milo, Ken Goldberg !

Page 2: SampleClean: Bringing Data Cleaning into the BDAS Stack

Who publishes more? !

!!

2  

Page 3: SampleClean: Bringing Data Cleaning into the BDAS Stack

Microsoft Academic Search!

!! Paper Id! Affiliation!

16! Computer Science Division--University of California Berkeley CA!

101! University of California at Berkeley!

102! Department of Physics Stanford !University California!

116! Lawrence Berkeley National Labs!<ref>California</ref>!

3  

Page 4: SampleClean: Bringing Data Cleaning into the BDAS Stack

Microsoft Academic Search!

!! Paper Id! Affiliation!

16! Computer Science Division--University of California Berkeley CA!

101! University of California at Berkeley!

102! Department of Physics Stanford !University California!

116! Lawrence Berkeley National Labs!<ref>California</ref>!

X  

4  

Page 5: SampleClean: Bringing Data Cleaning into the BDAS Stack

Microsoft Academic Search!

!!

University of California at Berkeley!Computer Science Division!University of California at Berkeley!Computer Science Division!University of California at Berkeley!

Department of Physics Stanford !University California!

5  

Page 6: SampleClean: Bringing Data Cleaning into the BDAS Stack

•  Data cleaning in BDAS.!– Problem 1. Scale!– Problem 2. Latency!!

•  Sampling to cope with scale.!•  Asynchrony to cope with latency.!

!!

Enter SampleClean!

6  

Page 7: SampleClean: Bringing Data Cleaning into the BDAS Stack

Now it’s your turn!!

!!

Be the crowd and help us decide!

7  

Page 8: SampleClean: Bringing Data Cleaning into the BDAS Stack

Dirty Data is Ubiquitous!

8!

Example: Missing, incomplete, inconsistent data!

Page 9: SampleClean: Bringing Data Cleaning into the BDAS Stack

Data Cleaning is Hard!

9  

Time consuming!

Page 10: SampleClean: Bringing Data Cleaning into the BDAS Stack

Data Cleaning is Hard!

10  

Costly!Time consuming!

Page 11: SampleClean: Bringing Data Cleaning into the BDAS Stack

Data Cleaning is Hard!

11  

Costly!Domain-specific!

Time consuming!

Page 12: SampleClean: Bringing Data Cleaning into the BDAS Stack

Data Cleaning is Hard!

12  

Costly!Domain-specific!

Time consuming!

Page 13: SampleClean: Bringing Data Cleaning into the BDAS Stack

Analy0cs  

13  

A New Data Cleaning Architecture!

Data  Cleaning  Data  

Page 14: SampleClean: Bringing Data Cleaning into the BDAS Stack

Analy0cs  

14  

A New Data Cleaning Architecture!

Data  Cleaning  

Data  

Page 15: SampleClean: Bringing Data Cleaning into the BDAS Stack

Can it Scale?!

Crowd   Machine  Learning  

Regex  

Time  

People are slow and expensive!

15  

Page 16: SampleClean: Bringing Data Cleaning into the BDAS Stack

Insight 1: Asynchrony Hides Latency!

16  

Page 17: SampleClean: Bringing Data Cleaning into the BDAS Stack

Insight 2: Sampling Hides Scale!

Query !Error!

Time!

BlinkDB!

17  

Page 18: SampleClean: Bringing Data Cleaning into the BDAS Stack

Query !Error!

Time!

Data  Error  

BlinkDB!

Insight 2: Sampling Hides Scale!

18  

Page 19: SampleClean: Bringing Data Cleaning into the BDAS Stack

Query !Error!

Time!

Data  Error  

SampleClean!

Insight 2: Sampling Hides Scale!

BlinkDB!

19  

Page 20: SampleClean: Bringing Data Cleaning into the BDAS Stack

SampleClean Data Flow!

Dirty  Data   Dirty    Sample  

Clean    Sample  

Query  

Data    Cleaning  

20  

Sampling   Asynchrony  

Page 21: SampleClean: Bringing Data Cleaning into the BDAS Stack

SampleClean Data Flow!

Clean    Sample  

Query  

Data    Cleaning  

Asynchrony  

21  

Page 22: SampleClean: Bringing Data Cleaning into the BDAS Stack

The SampleClean Architecture!

Data  Cleaning  Library  

Approximate  Query  Processing  Asynchronous  

Pipelines  

Clean  Sample  

Issue Queries, ! Get Results!

Declare Cleaning !Operations!

Dirty  Sample  

22  

Page 23: SampleClean: Bringing Data Cleaning into the BDAS Stack

The SampleClean Architecture!

Data  Cleaning  Library  

Approximate  Query  Processing  Asynchronous  

Pipelines  

Clean  Sample  

Issue Queries, ! Get Results!

Declare Cleaning !Operations!

Dirty  Sample  

23  

Page 24: SampleClean: Bringing Data Cleaning into the BDAS Stack

Approximate Query Processing!•  Estimate early results and bound with

error bars!

SampleClean: Fast and Accurate Query Processing on Dirty Data. SIGMOD 2014!!BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. EuroSys 2013!

Query !Error!

Time!

24  

Page 25: SampleClean: Bringing Data Cleaning into the BDAS Stack

The SampleClean Architecture!

25  

Approximate  Query  Processing  Asynchronous  

Pipelines  

Clean  Sample  

Issue Queries, ! Get Results!

Declare Cleaning !Operations!

Dirty  Sample  

Data  Cleaning  Library  

Page 26: SampleClean: Bringing Data Cleaning into the BDAS Stack

Crowds and Machines Work Together!

•  Extensible library of data cleaning tools!•  Tools are:!– Automated!– Human-powered!– Hybrid!

!

Crowd   Machine  Learning  

Regex  

Time  

26  

Page 27: SampleClean: Bringing Data Cleaning into the BDAS Stack

Active Learning and Crowds!•  Choose informative training points!

Not !Informative!

Are these the same?!Stanford Department of IEOR!!UC Berkeley Stats!! ¢ Yes ! ¤ No!

Informative!

Are these the same?!Department of Mathematics Stanford University!!University of California Berkeley Department of Mathematics!! ¢ Yes ! ¤ No!

27  

Page 28: SampleClean: Bringing Data Cleaning into the BDAS Stack

Active Learning and Crowds!•  Choose informative training points!

Not !Informative!

Are these the same?!Stanford Department of IEOR!!UC Berkeley Stats!! ¢ Yes ! ¤ No!

Informative!

Are these the same?!Department of Mathematics Stanford University!!University of California Berkeley Department of Mathematics!! ¢ Yes ! ¤ No!

28  

Page 29: SampleClean: Bringing Data Cleaning into the BDAS Stack

The SampleClean Architecture!

29  

Data  Cleaning  Library  

Clean  Sample  

Issue Queries, ! Get Results!

Declare Cleaning !Operations!

Dirty  Sample  

Approximate  Query  Processing  Asynchronous  

Pipelines  

Page 30: SampleClean: Bringing Data Cleaning into the BDAS Stack

Putting it all together: Asynchronous Pipelines!

•  Users group data cleaning operations into pipelines!

30  

Page 31: SampleClean: Bringing Data Cleaning into the BDAS Stack

The SampleClean Architecture!

Data  Cleaning  Library  

Approximate  Query  Processing  Asynchronous  

Pipelines  

Clean  Sample  

Issue Queries, ! Get Results!

Declare Cleaning !Operations!

Dirty  Sample  

31  

Page 32: SampleClean: Bringing Data Cleaning into the BDAS Stack

Great, Now What?!•  Prototype implementation complete!!

•  Significant research challenges remain:!•  Crowd worker performance and quality!•  Pipeline semantics and optimization!•  Programming model and interface!!

•  Open source release targeted for next year!

32  

Page 33: SampleClean: Bringing Data Cleaning into the BDAS Stack

Summary!•  Data Cleaning is slow, costly, and

domain-specific!

•  SampleClean brings data cleaning into the BDAS stack !

•  SampleClean uses asynchrony to hide latency, and sampling to hide scale!

•  SampleClean combines Algorithms, Machines, and People, all in one system! 33  

Page 34: SampleClean: Bringing Data Cleaning into the BDAS Stack

Asynchrony in Spark!•  The Spark abstraction: blocking BSP!

•  So how do we achieve asynchrony?!•  Multithreaded master!•  Intermediate results materialized in

Hive!•  Standalone Finagle HTTP server for

crowd work!

!34