Mining public datasets using opensource tools: Zeppelin, Spark and Juju

38
Mining Public Datasets Using Open Source Tools At scale & on a budget by Alexander

Transcript of Mining public datasets using opensource tools: Zeppelin, Spark and Juju

Page 1: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

Mining Public Datasets Using Open Source Tools

At scale & on a budget by Alexander

Page 2: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

Software Engineer at NFLabs, Seoul, South Korea

Co-organizer of SeoulTech Society

Committer and PPMC member of Apache Zeppelin (Incubating)

@seoul_engineer

github.com/bzz

Alexander

Page 3: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

IS DATA IMPORTANT?

Content Streaming Services

Taxi

Housing

Web Search

Kuaidi

OpenHouse

IoT

Page 4: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

CONTEXT

Size of even Public Data is huge and growing

There could be more research, applications and data products build using that data

Quality and number of free tools available to public to crunch that data is constantly improving

Cloud brings affordable computations @ scale

Page 5: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

PUBLIC DATA = OPPORTUNITY

Page 6: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

AGENDA

Datasets

Tools

Scale and Budget

Page 7: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

Datasets

Tools

Scale and Budget

Page 8: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

• Internet archives• Web applications logs (wikipedia activity, github activity)• Genome• AdClicks• Webserver access logs • Network traffic• Scientific datasets• Images, Songs (Million Song Dataset)• Reviews, • Social media• Flight timetables• Taxis• n-gram language model

DATASETS

Page 9: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

DATASETS

• 300Gb compressed• Collaboration google and github engineers• Events on PR, repo, issues, comments in JSON

http://githubarchive.org

Page 10: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

http://www.commitlogsfromlastnight.com/

Page 11: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

http://sideeffect.kr/popularconvention/

Page 12: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

https://www.gitlive.net/

Page 13: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

http://zoom.it/kCsU

Page 14: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

Common Crawl

https://commoncrawl.org

Nonprofit, by Factual

On AWS S3 in WARC, WAT, formatssince 2013, monthly: ~150Tb compressed, 2+bln ulrs

Page 15: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

URL Index by Ilya Kreymer of @webrecorder_io http://index.commoncrawl.org/

Page 16: Mining public datasets using opensource tools: Zeppelin, Spark and Juju
Page 17: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

https://about.commonsearch.org

Page 18: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

Datasets

Tools

Scale and Budget

Page 19: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

TOOLS OVERVIEW

Generic: Grep, Python, Ruby, JVM - all good, but hard to scale beyond single machine or data format

Hight-performance: MPI, Hadoop, HPC - awesome but complex, not easy, problem specific, not very accessible

New, scalable: Spark, Flink, Zeppelin - easy (not simple) and robust

Page 20: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

Apache Software Foundation

http://www.apache.org/foundation/

Page 21: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

Apache Software Foundation

1999 - 21 founders of 1 project

2016 - 9 Board of Directors 600 Foundation members2000+ committers of 171 projects (+55 incubating)

Keywords: meritocracy, community over code, consensus

Provide: infrastructure, legal support, way of building software

http://www.apache.org/foundation/

Page 22: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

Apache Spark Scala, Python, R

Apache Zeppelin Modern Web GUI, plays nicely with Spark, Flink, Elasticsearch, etc. Easy to set up.

Warcbase Spark library for saved crawl data (WARC)

Juju Scales, integration with Spark, Zeppelin, AWS, Ganglia

NEW, SCALABLE TOOLS

Page 23: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

APACHE SPARK

From Berkeley AMP Labs, since 2010

Founded Databricks since 2013, joined Apache since 2014

1000+ contributors

REPL + Java, Scala, Python, R APIs

http://spark.apache.org

Page 24: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

APACHE SPARK

Has much more: GraphX, MLlib, SQLhttps://spark.apache.org/examples.html

http://spark.apache.org

Parallel collections API (similar to FlumeJava, Crunch, Cascading)

Page 25: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

• Notebook style GUI on top of backend processing system

• Plays nicely with all the eco-system Spark, Flink, SQL, Elasticsearch, etc.

• Easy to set up

APACHE ZEPPELIN (INCUBATING)

http://zeppelin.incubator.apache.org

Page 26: Mining public datasets using opensource tools: Zeppelin, Spark and Juju
Page 27: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

Enters ASF Incubation12.201408.2013 NFLabs Internal project Hive/Shark

12.2012 Commercial App using AMP Lab Shark 0.510.2013 Prototype Hive/Shark

APACHE ZEPPELIN PROJECT TIMELINE

01.2016 3 major releases

http://zeppelin.incubator.apache.org

Page 28: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

Interactive Visualization

APACHE ZEPPELIN

Page 29: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

Pluggable Interpreters

APACHE ZEPPELIN

Page 30: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

Spark library for WARC (Web ARChive) data processing * text analysis * site link structure

WARCBASE

https://github.com/lintool/warcbase

http://lintool.github.io/warcbase-docs

Page 31: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

Service modeling at scale

Deployment\configuration automation + Integration with Spark, Zeppelin, Ganglia, etc + AWS, GCE, Azure, LXC, etc

JUJU

https://jujucharms.com/

Page 32: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

$ apt-get install juju-core juju-quickstart # or $ brew install juju juju-quickstart $ juju generate-config #LXC, AWS, GCE, Azure, VMWare, OpenStack

$ juju bootstrap $ juju quickstart apache-hadoop-spark-zeppelin $ juju expose spark zeppelin $ juju add-unit -n4 slave

JUJU

http://bigdata.juju.solutions/getstarted

Page 33: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

JUJU

http://bigdata.juju.solutions/getstarted

7 node cluster designed to scale out

Page 34: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

Datasets

Tools

Scale and Budget

Page 35: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

1 core

10s PC

1000 instances

APPROACH: SCALE AND BUDGET

Prototype

Estimate the cost

Scale out

Your laptop

AWS spot instances

Deployment automation

Page 36: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

TAKEAWAY

There are plenty of free tools out there

To crunch the data for fun and profit

They are easy (not simple) to learn and generic enough

Page 37: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

Thank you

Page 38: Mining public datasets using opensource tools: Zeppelin, Spark and Juju

Questions?

@seoul_engineer

Alexander

github.com/bzz