Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public...

45
Mining Public Datasets using Apache Zeppelin (incubating), Apache Spark and Juju by Alexander Bezzubov NFLabs for AppacheCon ’16 NA

Transcript of Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public...

Page 1: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

Mining Public Datasets using Apache Zeppelin (incubating), Apache Spark and Juju

by Alexander BezzubovNFLabs for AppacheCon ’16 NA

Page 2: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

Software Engineer at NFLabs, Seoul, South Korea

Co-organizer of SeoulTech Society

Committer and PPMC member of Apache Zeppelin (Incubating)

Graduated Maths at St.Petersburg State University, Russia

@seoul_engineer

github.com/bzz

Alexander Bezzubov

Page 3: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

PUBLIC DATASETS: Number, Size & Growth

Web Crawls Structured data (RDF, micro-formats, tables) Hackers News\Reddit\Twitter\StackOverflow\Wikipedia Reviews (movies, restaurants, beer, wine) Emails (Enroll, ASF public ML archives) Census Data (US, UK, UN, Japan, etc) Transportation (Taxi, Flights, Bicycles) Climate Genome

Page 4: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

PUBLIC DATASETS: Number, Size & Growth

Web Crawls Structured data (RDF, micro-formats, tables) Hackers News\Reddit\Twitter\StackOverflow\Wikipedia Reviews (movies, restaurants, beer, wine) Emails (Enroll, ASF public ML archives) Census Data (US, UK, UN, Japan, etc) Transportation (Taxi, Flights, Bicycles) Genome

order of Tbs

Page 5: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

PUBLIC DATASETS: Number, Size & Growth

Web Crawls Structured data (RDF, micro-formats, tables) Hackers News\Reddit\Twitter\StackOverflow\Wikipedia Reviews (movies, restaurants, beer, wine) Emails (Enroll, ASF public ML archives) Census Data (US, UK, UN, Japan, etc) Transportation (Taxi, Flights, Bicycles) Genome

order of Tbs

AWS Public Datasets https://aws.amazon.com/public-data-sets/ Yahoo Webscope https://webscope.sandbox.yahoo.com/ Stanford Network Analyser Project http://snap.stanford.edu/data/

Physics Research http://opendata.cern.ch

Page 6: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

AWS Public Datasets https://aws.amazon.com/public-data-sets/ Yahoo Webscope https://webscope.sandbox.yahoo.com/ Stanford Network Analyser Project http://snap.stanford.edu/data/

Physics Research http://opendata.cern.ch

PUBLIC DATASETS: Number, Size & Growth

Web Crawls Structured data (RDF, micro-formats, tables) Hackers News\Reddit\Twitter\StackOverflow\Wikipedia Reviews (movies, restaurants, beer, wine) Emails (Enroll, ASF public ML archives) Census Data (US, UK, UN, Japan, etc) Transportation (Taxi, Flights, Bicycles) Genome

order of Tbs

order of Pbs

Page 7: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

PUBLIC DATA = OPPORTUNITY

Page 8: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

I. Tools

II. Data

Page 9: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

Overview Big Data eco-systemTOOL TO PURSUIT THE OPPORTUNITY:

… …

Page 10: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

Overview Big Data eco-systemTOOL TO PURSUIT THE OPPORTUNITY:

Page 11: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

Apache Spark Scala, Python, R

Apache Zeppelin Modern Web GUI, plays nicely with Spark, Flink, Elasticsearch, etc.

Warcbase Spark library for saved crawl data (WARC)

Juju Scales, integration with Spark, Zeppelin, AWS, GCE

Todays choice Zeppelin, Spark, JujuTOOL TO PURSUIT THE OPPORTUNITY:

Page 12: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

APACHE ZEPPELIN: Overview

Page 13: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

Zeppelin: Brief history

Enters ASF Incubation12.201408.2013 NFLabs Internal project Hive/Shark

http://zeppelin.incubator.apache.org

12.2012 Commercial App using AMP Lab Shark 0.510.2013 Prototype Hive/Shark

01.2016 3 major releases05.2016 Graduation vote passed

Page 14: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project
Page 15: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

Interactive Visualization

Page 16: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

APACHE SPARK

From Berkeley AMP Labs, since 2010

Joined Apache since 2014

1000+ contributors

REPL + Java, Scala, Python, R APIs

http://spark.apache.org

Page 17: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

Service modelling at scale

Deployment\configuration automation + Integration with Spark, Zeppelin, Ganglia, etc + AWS, GCE, Azure, LXC, etc

JUJU

https://jujucharms.com/

Page 18: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

$ apt-get install juju-core juju-quickstart # or $ brew install juju juju-quickstart $ juju generate-config #LXC, AWS, GCE, Azure, VMWare, OpenStack

$ juju bootstrap $ juju quickstart apache-hadoop-spark-zeppelin $ juju expose spark zeppelin $ juju add-unit -n4 slave

JUJU

http://bigdata.juju.solutions/getstarted

Page 19: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

JUJU

http://bigdata.juju.solutions/getstarted

7 node cluster designed to scale out

Page 20: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

1 core

10s PC

1000 instances

APPROACH: local, small cluster, big cluster

Prototype

Estimate the cost

Scale out

Your laptop

AWS spot instances

Deployment automation

Page 21: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

I. Tools

II. Data

Page 22: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

DATA: GitHub

• 300Gb compressed• Collaboration google and github engineers• Events on PR, repo, issues, comments, etc in JSON

http://githubarchive.org

Page 23: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

http://www.commitlogsfromlastnight.com/

Page 24: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

http://sideeffect.kr/popularconvention/

Page 25: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

https://www.gitlive.net/

Page 26: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

http://zoom.it/kCsU

Page 27: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

DATA PRODUCT: Get notified when project goes Open Source

Page 28: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

DATA PRODUCT: Exploration

Page 29: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

DATA PRODUCT: Sketch

We are going to build a Notebook that sends you a digest email:

Page 30: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

DATA PRODUCT: pieces (flow-chart)

We are going to build a Notebook that: • Downloads the latest data from GitHub Archive• Read & explore the dataset• Imports, filters the PublicEvent• Join logs w/ more data from Github API calls• Shows HTML template, to visualise the list• Sends email notifications• Does all above automatically, once a day

Page 31: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

DATA PRODUCT: Full impl

Page 32: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

I. Tools

II. Data

Page 33: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

DATA: Common Crawl

https://commoncrawl.org

Nonprofit, by Factual

On AWS S3 in WARC, WAT, formatssince 2013, monthly: ~150Tb compressed, 2+bln ulrs

Page 34: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

URL Index by Ilya Kreymer of @webrecorder_io http://index.commoncrawl.org/

Page 35: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project
Page 36: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

https://about.commonsearch.org

Page 37: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

DATA: CommonCrawl - Data Product

Objective: estimate % of pages/domains that use Google Analytics/Facebook

Measuring the impact of Google Analytics

Existing research from 2013

Page 38: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

DATA: CommonCrawl - Data Product

Copy to HDFS vs read from S3 Verify using grep hadoop jar hadoop-examples.jar grep /grep-data/ \ /grep-output/ '[Bb]ig [Dd]ata is ([a-zA-Z]{5,})'

Verify using grep

Measuring the impact of Google Analytics

Page 39: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

DATA: CommonCrawl - Data Product

Feb 2016 Crawl: - 48Tb compressed - 100 segments (dir on S3) - 30,000 files, ~1Gb each

Page 40: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

DATA: CommonCrawl - Data Product

AWS optimisations: - pick spot instance prices - pick instance type (net throughput) - user Juju instead of EMR (2x $$ savings!)

Spark optimisations: - IO-bound, so increase spark.executor.cores spark.executor.memory

Page 41: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

DATA: CommonCrawl - Data Product

Page 42: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

Zeppelin Viewer

Community service for sharing example notebooks http://zeppelinhub.com/viewer

Page 43: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

TAKEAWAY

There are plenty of free tools out there

To crunch the data for fun and profit

They are easy (not simple) to learn and generic enough

Page 44: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

Questions?

@seoul_engineer

Alexander Bezzubov

github.com/bzz

Page 45: Mining Public Datasets - Schedschd.ws/hosted_files/apachebigdata2016/12/Mining Public Datasets.pdf · Mining Public Datasets using Apache Zeppelin ... Stanford Network Analyser Project

Thank youAlexander Bezzubov NFLabs, Seoul (we are hiring!)