Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat...

41
Synthetic Data Generation for Realistic Analytics Examples and Testing Ronald J. Nowling Red Hat, Inc. [email protected] http://rnowling.github.io/

Transcript of Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat...

Page 1: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

Synthetic Data Generation for Realistic Analytics Examples and

Testing Ronald J. Nowling

Red Hat, Inc. [email protected]

http://rnowling.github.io/

Page 2: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

Who Am I?

•  Software Engineer at Red Hat •  Data Science Team, Emerging

Technologies – Evaluate open-source Big Data space – Ensure software works for Red Hat

customers – Promote data science internally through

consulting projects •  Apache BigTop PMC

2  

Page 3: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

Synthetic Data

•  No licensing, privacy, or intellectual property concerns

•  Scalable: Laptops to Clusters! •  More reliable than external data sets •  Enable more realistic example

applications •  Enable more comprehensive testing than

wordcount and TeraSort

3  

Page 4: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

Data Transformation and Summarization Pipeline

Transform Raw Text

Raw Daily Page Views

Parse

Clean & Validate

Raw Daily Page Views

Raw Daily Page Views

Transform Raw Text

Transform Raw Text Parse

Parse

Clean & Validate

Clean & Validate

Accounts

Summarize

Summarize

Summarize

Aggregate

DailyActivity

CumulativeActivity

4  

Page 5: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

Data Transformation and Summarization Pipeline

Transform Raw Text

Raw Daily Page Views

Parse

Clean & Validate

Raw Daily Page Views

Raw Daily Page Views

Transform Raw Text

Transform Raw Text Parse

Parse

Clean & Validate

Clean & Validate

Accounts

Summarize

Summarize

Summarize

Aggregate

DailyActivity

CumulativeActivity

5  

Page 6: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

Data Transformation and Summarization Pipeline

Transform Raw Text

Raw Daily Page Views

Parse

Clean & Validate

Raw Daily Page Views

Raw Daily Page Views

Transform Raw Text

Transform Raw Text Parse

Parse

Clean & Validate

Clean & Validate

Accounts

Summarize

Summarize

Summarize

Aggregate

DailyActivity

CumulativeActivity

6  

Page 7: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

Data Transformation and Summarization Pipeline

Transform Raw Text

Raw Daily Page Views

Parse

Clean & Validate

Raw Daily Page Views

Raw Daily Page Views

Transform Raw Text

Transform Raw Text Parse

Parse

Clean & Validate

Clean & Validate

Accounts

Summarize

Summarize

Summarize

Aggregate

DailyActivity

CumulativeActivity

7  

Page 8: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

Data Transformation and Summarization Pipeline

Transform Raw Text

Raw Daily Page Views

Parse

Clean & Validate

Raw Daily Page Views

Raw Daily Page Views

Transform Raw Text

Transform Raw Text Parse

Parse

Clean & Validate

Clean & Validate

Accounts

Summarize

Summarize

Summarize

Aggregate

DailyActivity

CumulativeActivity

8  

Page 9: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

Data Transformation and Summarization Pipeline

Transform Raw Text

Raw Daily Page Views

Parse

Clean & Validate

Raw Daily Page Views

Raw Daily Page Views

Transform Raw Text

Transform Raw Text Parse

Parse

Clean & Validate

Clean & Validate

Accounts

Summarize

Summarize

Summarize

Aggregate

DailyActivity

CumulativeActivity

9  

Page 10: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

Synthetic Data

•  Sensitive Data – Real data on cluster for scalability testing and

validation – Synthetic data for local development and testing

•  Needed smaller data sets for checking calculations – Total aggregation results requires re-running old

pipeline – Extra burden on operations team – Delay for development team

10  

Page 11: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

ValidationScript

DataGenerator

Expected Cumulative

Activity

Accounts

Raw Daily Page Views

Expected Daily Activity

Transformation and Summarization

Pipeline

Cumulative ActivityDaily Activity

Validation with Synthetic Data

11  

Page 12: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

ValidationScript

DataGenerator

Expected Cumulative

Activity

Accounts

Raw Daily Page Views

Expected Daily Activity

Transformation and Summarization

Pipeline

Cumulative ActivityDaily Activity

Validation with Synthetic Data

12  

Page 13: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

ValidationScript

DataGenerator

Expected Cumulative

Activity

Accounts

Raw Daily Page Views

Expected Daily Activity

Transformation and Summarization

Pipeline

Cumulative ActivityDaily Activity

Validation with Synthetic Data

13  

Page 14: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

ValidationScript

DataGenerator

Expected Cumulative

Activity

Accounts

Raw Daily Page Views

Expected Daily Activity

Transformation and Summarization

Pipeline

Cumulative ActivityDaily Activity

Validation with Synthetic Data

14  

Page 15: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

ValidationScript

DataGenerator

Expected Cumulative

Activity

Accounts

Raw Daily Page Views

Expected Daily Activity

Transformation and Summarization

Pipeline

Cumulative ActivityDaily Activity

Validation with Synthetic Data

15  

Page 16: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

ValidationScript

DataGenerator

Expected Cumulative

Activity

Accounts

Raw Daily Page Views

Expected Daily Activity

Transformation and Summarization

Pipeline

Cumulative ActivityDaily Activity

Validation with Synthetic Data

16  

Page 17: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

Issues Tackled

•  Error in account validation introduced while refactoring code

•  Usage of the correct join types •  Validation of date-time operations •  Correct Output Formats

17  

Page 18: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

Apache BigTop BigPetStore Blueprints

•  Problem domain: Transactions for a fictional chain of pet stores

•  BigPetStore Data Generator simulates customer purchasing behavior to generate realistic transaction data

•  Blueprints for big data ecosystem – Hadoop: MapReduce / Pig / Hive / Mahout – Spark – Flink (in progress)

18  

Page 19: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

BigPetStore

19  

Page 20: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

BigPetStore

20  

HCFS

Page 21: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

BigPetStore

21  

Core (RDDs) HCFS

Page 22: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

BigPetStore

22  

Spark SQL

Core (RDDs) HCFS

Page 23: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

BigPetStore

23  

Spark SQL MLLib

Core (RDDs) HCFS

Page 24: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

Team Cluster

•  ~10 nodes •  40 cores, 400GB RAM per node

24  

Page 25: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

Potential Issues

•  Infrastructure •  Storage •  Software Installation •  Software Upgrades •  Spark Configuration Tuning •  User Management

25  

Page 26: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

Real Stories

•  Creating a new user – User Gluster permissions incorrect

•  Cluster upgrade – Spark upgrade didn’t take because of issue with

Ansible role configuration – Wiped out our spark.conf – master / mesos

settings wrong

•  Gluster moint points disappeared on reboot – Not set in fstab

26  

Page 27: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

k8petstore

Public IP Proxy

Users

BPS DataGenerator

Redis Master

RedisSlave

Web Application

RedisSlave

RedisSlave

BPS DataGenerator

BPS DataGenerator

27  

Page 28: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

k8petstore

Public IP Proxy

Users

BPS DataGenerator

Redis Master

RedisSlave

Web Application

RedisSlave

RedisSlave

BPS DataGenerator

BPS DataGenerator

28  

Page 29: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

k8petstore

Public IP Proxy

Users

BPS DataGenerator

Redis Master

RedisSlave

Web Application

RedisSlave

RedisSlave

BPS DataGenerator

BPS DataGenerator

29  

Page 30: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

k8petstore

Public IP Proxy

Users

BPS DataGenerator

Redis Master

RedisSlave

Web Application

RedisSlave

RedisSlave

BPS DataGenerator

BPS DataGenerator

30  

Page 31: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

k8petstore

31  

Page 32: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

Use Cases

•  Configuration •  Scalability •  Fault Tolerance

32  

Page 33: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

k8petstore

•  OpenContrail networking solution demo1 •  Kubernetes JuJu Charm documentation

example2 •  Kubernetes v1.0 launch talk at OSCON3 [1] -

https://pedrormarques.wordpress.com/2015/04/24/kubernetes-and-opencontrail/

[2] - http://kubernetes.io/v1.0/docs/getting-started-guides/juju.html [3] - http://www.oscon.com/open-source-2015/public/schedule/detail/45281

33  

Page 34: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

APACHE BIGTOP DATA GENERATORS

34  

Page 35: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

BigPetStore

35  

Page 36: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

BigTop Weatherman

36  

Page 37: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

BigTop Bazaar

37  

Page 38: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

Vision

•  Encourage synthetic data generation for testing and realistic examples

•  Serve as a resource for the larger Apache and open source communities

•  Emphasis on –  Flexibility – Scalability – Realism

•  We look forward to collaborating and getting folks involved!

38  

Page 39: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

Conclusion

•  Synthetic data generators and blueprints are useful!

•  Case studies: – Data Processing Pipelines – Cluster Deployment – Kubernetes

•  BigPetStore and BigTop Data Generators efforts in Apache BigTop

•  Open invitation to get involved and collaborate

39  

Page 40: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

Resources

http://bigtop.apache.org/

http://github.com/apache/bigtop

http://rnowling.github.io/

40  

Page 41: Synthetic Data Generation for Realistic Analytics Examples ... · • Software Engineer at Red Hat • Data Science Team, Emerging Technologies – Evaluate open-source Big Data space

QUESTIONS

41