Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14...

33
Berkeley Data Analytics Stack (BDAS) UC BERKELEY Ion Stoica UC Berkeley / Databricks / Conviva

Transcript of Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14...

Page 1: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

Berkeley Data Analytics Stack (BDAS)

UC  BERKELEY  

Ion Stoica UC Berkeley / Databricks / Conviva

Page 2: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

Data is Everywhere Easier and cheaper than ever to collect Data grows faster than Moore’s law

0 2 4 6 8

10 12 14

2010 2011 2012 2013 2014 2015

Moore's Law

Overall Data

(IDC report*)

Page 3: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

The New Gold Rush Everyone wants to extract value from data » Big companies & startups alike

Huge potential » Already demonstrated by Google, Facebook, …

But, untapped by most companies » “We have lots of data but no one is looking at it!”

Page 4: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

Extracting Value from Data Hard Data is massive, unstructured, and dirty Question are complex Processing, analysis tools still in their “infancy”

Need tools that are » Faster » More sophisticated » Easier to use

Page 5: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

Turning Data into Value Insights, diagnosis, e.g., » Why is user engagement dropping? » Why is the system slow? » Detect spam, DDoS attacks

Decisions, e.g., » Decide what feature to add to a product » Personalized medical treatment » Decide when to change an aircraft engine part » Decide what ads to show

Data only as useful as the decisions it enables

Page 6: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

What do We Need? Interactive queries: enable faster decisions » E.g., identify why a site is slow and fix it

Queries on streaming data: enable decisions on real-time data » E.g., fraud detection, detect DDoS attacks

Sophisticated data processing: enable “better” decisions » E.g., anomaly detection, trend analysis

Page 7: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

Our Goal

Batch

Interactive Streaming

Single Stack! "

Support batch, streaming, and interactive computations… … in a unified framework

Easy to develop sophisticated algorithms (e.g., graph, ML algos)

Page 8: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

The Berkeley AMPLab January 2011 – 2017 » 8 faculty » > 40 students and postdocs » 3 software engineer team

Organized for collaboration

3 day retreats (twice a year)

lgorithms    

achines     eople    

220 campers (100+ companies)

AMPCamp3 (August, 2013)

Page 9: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

The Berkeley AMPLab Governmental and industrial funding:

Goal: Next generation of open source data analytics stack for industry & academia: Berkeley Data Analytics Stack (BDAS)

Page 10: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

BDAS Stack (Feb, 2013)

Data Processing Layer

Resource Management Layer

Storage Layer HDFS, S3, …

Mesos

Releases 3rd party

Enable multiple frameworks to share same cluster resources (e.g., Hadoop, Storm, Spark) (2009) Scale to thousands of servers (e.g., Twitter) Third party schedulers, e.g., Chronos, Aurora

Research Projects

Page 11: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

Research Projects

BDAS Stack (Feb, 2013)

Data Processing Layer

Resource Management Layer

Storage Layer HDFS, S3, …

Mesos Hadoop Yarn

Spark

Releases 3rd party

Distributed Execution Engine (2009) » Fault-tolerant, in-memory storage » Powerful APIs (Scala, Python, Java)

Fast: up to 100x faster than HadoopMR

Easy to use: 2-5x less code than HadoopMR General: support interactive & iterative apps

Page 12: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

Research Projects

BDAS Stack (Feb, 2013)

Resource Management Layer

Storage Layer HDFS, S3, …

Mesos Hadoop Yarn

Spark

Shark SQL

Releases 3rd party

Hive over Spark: full support for HQL and UDFs (2010)

Up to 100x when input is in memory

Up to 5-10x when input is on disk

Page 13: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

Shark SQL

BDAS Stack (Feb, 2013)

Resource Management Layer

Storage Layer HDFS, S3, …

Mesos Hadoop Yarn

Spark

Releases 3rd party

Spark Streaming

Large scale streaming engine (2011)

Implemented as a sequence of microbatch (< 1s) jobs » Fault tolerant » Handle stragglers » Ensure exactly one semantics

Research Projects

Page 14: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

BDAS Stack (Feb, 2013)

Storage Layer

Mesos

Spark

Shark SQL

HDFS, S3, …

Releases

Hadoop Yarn

3rd party

BlinkDB Spark Streaming

Approximate query processing

Trade between query latency/cost and accuracy using sampling

Provide error bounds & confidence intervals to arbitrary queries

Research Projects

Page 15: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

BDAS Stack (Feb, 2013)

Storage Layer

Mesos

Spark

Shark SQL

HDFS, S3, …

Releases

Hadoop Yarn

3rd party

BlinkDB Machine Learning

Spark Streaming

Research Projects

Make ML accessible to non-experts Declarative API: allow users to say what they want Automatically pick best algorithm for given data, time Allow developers to easily add new algorithms

Page 16: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

BDAS Stack (Feb, 2013)

Storage Layer

Mesos

Spark

Shark SQL

HDFS, S3, …

Releases

Hadoop Yarn

3rd party

BlinkDB

Tachyon

Machine Learning

Spark Streaming

Research Projects

In-memory, fault-tolerant storage system

Flexible API, including HDFS API

Allow multiple frameworks (including Hadoop) to share in-memory data

Page 17: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

BDAS Stack (Feb, 2014)

Storage Layer

Mesos

Spark

Shark SQL

HDFS, S3, …

Releases

Hadoop Yarn

3rd party

BlinkDB

Tachyon

Machine Learning

Spark Streaming

Tachyon

BlinkDB

MLlib

MLbase GraphX SparkR

Research Projects

Page 18: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

BDAS Stack (Feb, 2014)

Resource Management Layer

Storage Layer

Mesos

Spark

Spark Streaming Shark SQL

BlinkDB GraphX

MLlib

MLBase

HDFS, S3, …

Tachyon

GA

SparkR

Alpha Releases 3rd party

Hadoop Yarn

Page 19: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

BDAS Stack (Feb, 2014)

Resource Management Layer

Storage Layer

Mesos

Spark

Spark Streaming Shark SQL

BlinkDB GraphX

MLlib

MLBase

HDFS, S3, …

Tachyon

Apache License

SparkR

BSD License 3rd party

Hadoop Yarn

Page 20: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

Unification: One Size Fits Many! Using Spark & Tachyon BDAS unifies » Batch » Streaming » Interactive » Iterative (e.g., graph and ML algorithms)

Had

oopM

R

Impa

la/D

rill

Stor

m

HDFS

Gra

phLa

b

…  Spark

(+Tachyon) Sp

ark

Stre

aming

Shar

k+

Blink

DB

Gra

phX

MLb

ase

Gra

phR

HDFS

Page 21: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

Unification Examples

Real-time and historical data analysis Streaming and machine-learning Graph processing and ETLs

Page 22: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

Unify Real-time and Historical Analytics Today: separate stacks » Historical analysis (Hadoop, Hive) » Streaming (Storm) » Interactive queries (Impala)

Disadvantages: » Hard to maintain and operate » Hard to integrate: cannot support interactive ad-hoc

queries on streaming data

Had

oopM

R

Impa

la/D

rill

Stor

m

HDFS

Page 23: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single stack » Easier to build and maintain » Cheaper to operate » Interactive queries on streaming data: faster decisions » Simplify development

Spark (+Tachyon)

Spar

k St

ream

ing

Shar

k+

Blink

DB

Had

oopM

R

Impa

la/D

rill

Stor

m

HDFS

Page 24: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

Unify Real-time and Historical Analytics Batch and streaming codes virtually the same » Easy to develop and maintain consistency

// count words from a file (batch) val file = sc.textFile("hdfs://.../pagecounts-*.gz") val words = file.flatMap(line => line.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print()

// count words from a network stream, every 10s (streaming) val ssc = new StreamingContext(args(0), "NetCount", Seconds(10), ..) val lines = ssc.socketTextStream("localhost”, 3456) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start()

Page 25: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

Unify Streaming and ML Today: ML done mostly off-line Spark (+ Streaming, MLbase): Real-time diagnosis & decisions » Fraud detection » Early notification of service degradation and failures

Had

oopM

R

Mah

oot /

R

Spark (+Tachyon)

Spar

k St

ream

ing

MLb

ase

HDFS

Page 26: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

Unify Graph Processing and ETL Today: Graph-parallel systems (Pregel, GraphLab) » Fast and scalable, but… » … inefficient for graph creation, post-processing

Spark (+ GraphX): unifies graph processing & ETL » Faster to get social network insights

Spark

GraphX

Had

oopM

R

Gap

hLab

/ G

iraph

e

HDFS

Page 27: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

Unify Graph Processing and ETL

Graph"Lab

Hadoop Graph Algorithms

Graph Creation (Hadoop) Post Proc.

GraphX Graph Creation

(Spark) Post Proc. (Spark)

Page 28: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

Not Only General, but Fast!

Hive

Impa

la (d

isk)

Impa

la (m

em)

Shar

k (d

isk)

Shar

k (m

em)

0

5

10

15

20

25

30

35

40

45

Resp

onse

Tim

e (s)

Interactive (SQL)

Stor

m

Spar

k

0

5

10

15

20

25

30

35

Thro

ughp

ut (M

B/s/

node

)

Streaming

Hado

op

Spar

k

0

20

40

60

80

100

120

140

Tim

e pe

r Ite

ratio

n (s)

Batch (ML, Spark)

Page 29: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

Gaining Rapid Traction 1,500+ Spark meetup users 30+ companies and over 100+ users contributing code

0

100

200

300

400

500

AMP Camp 1 (Aug 2012)

AMP Camp 2 (Aug 2013)

Spark Summit (Dec 2013)

Atte

ndee

s

Page 30: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

Gaining Rapid Traction

Page 31: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

Cloudera Partnership Integrate Spark (including SparkStreaming, MLlib) with Cloudera Manager Spark will become part of CDH

Enterprise class support and professional services available for Spark

Page 32: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

Summary BDAS: address next Big Data challenges

Unify batch, interactive, and streaming computations

Easy to develop sophisticate applications » Support graph & ML algorithms, approximate queries

Witnessed significant adoption » 30+ companies, 100+ individuals contributing code

Exciting ongoing work » MLbase, GraphX, BlinkDB, …

Batch

Interactive Streaming

Spark

Page 33: Berkeley Data Analytics Stack (BDAS)ampcamp.berkeley.edu/wp-content/uploads/2014/02/Ion-StrataSC14 … · Unify Real-time and Historical Analytics Spark (+ Streaming, Shark): single

What’s Next? Rest of morning: » Tachyon » GraphX » BlinkDB » MLlib

Afternoon: hands on tutorial on Spark, Shark/BlinkDB, GraphX, MLlib

Resource Management Layer

Storage Layer

Mesos

Spark

Spark Streaming Shark SQL

BlinkDB GraphX

MLlib

MLBase

HDFS, S3, …

Tachyon

SparkR

Hadoop Yarn