Intro to AMPLab and Berkeley Data Analytics Stack

31
Intro to AMPLab and Berkeley Data Analytics Stack UC BERKELEY Ion Stoica UC Berkeley

description

Intro to AMPLab and Berkeley Data Analytics Stack. Ion Stoica UC Berkeley. UC BERKELEY. A Brief History…. AMPCamp 1 (August, 2012) 150 campers (3,000+ online) AMPCamp 2 (February, 2013) Full-Day Strata Tutorial Sold-out hands-on tutorial AMPCamp 3 (Today and Tomorrow) - PowerPoint PPT Presentation

Transcript of Intro to AMPLab and Berkeley Data Analytics Stack

Page 2: Intro to  AMPLab  and  Berkeley Data Analytics Stack

A Brief History…AMPCamp 1 (August, 2012)

»150 campers (3,000+ online)

AMPCamp 2 (February, 2013)»Full-Day Strata Tutorial»Sold-out hands-on tutorial

AMPCamp 3 (Today and Tomorrow)»250 capers (sold-out)

Page 3: Intro to  AMPLab  and  Berkeley Data Analytics Stack

What is Big Data used For?Reports, e.g.,

»Track business processes, transactions

Diagnosis, e.g.,»Why is user engagement dropping?»Why is the system slow?»Detect spam, worms, viruses, DDoS attacks

Decisions, e.g.,»Personalized medical treatment»Decide what feature to add to a product»Decide what ads to show

Data is only as useful as the decisions it enables

Page 4: Intro to  AMPLab  and  Berkeley Data Analytics Stack

Data Processing GoalsLow latency (interactive) queries on historical data: enable faster decisions

»E.g., identify why a site is slow and fix itLow latency queries on live data (streaming): enable decisions on real-time data

»E.g., detect & block worms in real-time (a worm may infect 1mil hosts in 1.3sec)

Sophisticated data processing: enable “better” decisions

»E.g., anomaly detection, trend analysis

Page 5: Intro to  AMPLab  and  Berkeley Data Analytics Stack

Our Goal

Batch

Interactive

Streaming

SingleStack!

Support batch, streaming, and interactive computations…… and make it easy to compose them

Easy to develop sophisticated algorithms (e.g., graph, ML algos)

Page 6: Intro to  AMPLab  and  Berkeley Data Analytics Stack

The Need for Unification (1/2)Today’s state-of-art analytics stack

Batch stack(e.g., Hadoop)

Logs

Dem

ux

Streaming stack(e.g., Storm)

Real-Time AnalyticsAd-Hoc querieson historical dataInteractive querieson historical data

Interactive queries (e.g., HBase, Impala,

SQL)Challenges:

»Need to maintain three separate stacks• Expensive and complex• Hard to compute consistent metrics across

stacks »Hard and slow to share data across stacks

Page 7: Intro to  AMPLab  and  Berkeley Data Analytics Stack

The Need for Unification (2/2)

Make real-time decisions»Detect DDoS, fraud, etc

E.g.,: what’s needed to detect a DDoS attack?

1. Detect attack pattern in real time streaming2. Is traffic surge expected? interactive queries3. Making queries fast pre-computation (batch)

And need to implement complex algos (e.g., ML)!

Page 8: Intro to  AMPLab  and  Berkeley Data Analytics Stack

The Berkeley AMPLabJanuary 2011 – 2017

»8 faculty»> 40 students»3 software engineer team

Organized for collaboration

3 day retreats(twice a year)

150 campers (3000 on-line)

AMPCamp 1(August, 2012)

Algorithms

Machines

People

AMP

Page 9: Intro to  AMPLab  and  Berkeley Data Analytics Stack

The Berkeley AMPLabGovernmental and industrial funding:

Goal: Next generation of open source data analytics stack for industry &

academia:Berkeley Data Analytics Stack

(BDAS)

Page 10: Intro to  AMPLab  and  Berkeley Data Analytics Stack

Data Processing Stack

Data Processing Layer

Resource Management Layer

Storage Layer

Page 11: Intro to  AMPLab  and  Berkeley Data Analytics Stack

Hadoop Stack

Data Processing Layer

Resource Management Layer

Storage Layer

…Hadoop MR

Hive Pig HBase Storm

Hadoop Yarn

HDFS, S3, …

Page 12: Intro to  AMPLab  and  Berkeley Data Analytics Stack

BDAS Stack

Data Processing Layer

Resource Management Layer

Storage Layer

MesosSpark

SparkStreamin

g Shark SQLBlinkDB

GraphXMLlib

MLBase

HDFS, S3, … Tachyon

Page 13: Intro to  AMPLab  and  Berkeley Data Analytics Stack

How do BDAS & Hadoop fit together?

Mesos Mesos

Spark

SparkStreamin

g Shark SQLBlinkDB

GraphXMLlib

MLBase

HDFS, S3, … Tachyon

Hadoop Yarn

Spark Strami

ng SharkSQL

Graph X ML

library

BlinkDB

MLbase

Spark Hadoop MR

Hive Pig HBase

Storm

Page 14: Intro to  AMPLab  and  Berkeley Data Analytics Stack

Apache MesosEnable multiple frameworks to share same cluster resources (e.g., Hadoop, Storm, Spark)Twitter’s large scale deployment

»6,000+ servers, »500+ engineers running jobs on Mesos

Third party Mesos schedulers »AirBnB’s Chronos»Twitter’s Aurora

Mesospehere: startup to commercialize Mesos

MesosSpark

SparkStream

. SharkBlinkDB Graph

X MLlibMLBase

HDFS, S3, … Tachyon

Page 15: Intro to  AMPLab  and  Berkeley Data Analytics Stack

Apache SparkDistributed Execution Engine

»Fault-tolerant, efficient in-memory storage (RDDs)

»Powerful programming model and APIs (Scala, Python, Java)

Fast: up to 100x faster than HadoopEasy to use: 5-10x less code than HadoopGeneral: support interactive & iterative appsTwo major releases since last AMPCamp

MesosSpark

SparkStream

. SharkBlinkDB Graph

X MLlibMLBase

HDFS, S3, … Tachyon

Page 16: Intro to  AMPLab  and  Berkeley Data Analytics Stack

Spark StreamingLarge scale streaming computationImplement streaming as a sequence of <1s jobs

»Fault tolerant»Handle stragglers»Ensure exactly one semantics

Integrated with Spark: unifies batch, interactive, and batch computationsAlpha release (Spring, 2013)

MesosSpark

SparkStrea

m. SharkBlinkDB Graph

X MLlibMLBase

HDFS, S3, … Tachyon

Page 17: Intro to  AMPLab  and  Berkeley Data Analytics Stack

SharkHive over Spark: full support for HQL and UDFsUp to 100x when input is in memoryUp to 5-10x when input is on diskRunning on hundreds of nodes at Yahoo!Two major releases along Spark

MesosSpark

SparkStream

. SharkBlinkDB Graph

X MLlibMLBase

HDFS, S3, … Tachyon

Page 18: Intro to  AMPLab  and  Berkeley Data Analytics Stack

Performance and Generality(Unified Computation Models)

0

10

20

30

40

50

60

70Hi

veIm

pala

Shar

k

Resp

onse

Tim

e (s

)

Interactive(SQL, Shark)

0

5

10

15

20

25

30

35

Stor

m

SparkStream-

ing

Thro

ughp

ut (

MB/

s/no

de)

Streaming(SparkStreaming)

0

20

40

60

80

100

120

140

Hado

op

Spar

k

Tim

e pe

r It

erat

ion

(s)

Batch(ML, Spark)

Page 19: Intro to  AMPLab  and  Berkeley Data Analytics Stack

Unified Programming Models

Unified system for SQL, graph processing, machine learning

All share the same set of workers and caches

Page 20: Intro to  AMPLab  and  Berkeley Data Analytics Stack

Gaining Rapid Traction

Sold out AMPCamp and Strata tutorials 1,000+ Spark meetup users20+ companies contributing code

MesosSpark

SparkStream

. SharkBlinkDB Graph

X MLlibMLBase

HDFS, S3, … Tachyon

Page 21: Intro to  AMPLab  and  Berkeley Data Analytics Stack

Gaining Rapid Traction

Page 22: Intro to  AMPLab  and  Berkeley Data Analytics Stack

BlinkDBTrade between query performance and accuracy using sampling Why?

»In-memory processing doesn’t guarantee interactive processing• E.g., ~10’s sec just to scan 512

GB RAM!• Gap between memory capacity

and transfer rate increasing

MesosSpark

SparkStream

. Shark

BlinkDB Graph

X MLlibMLBase

HDFS, S3, … Tachyon

512GB

16 cores

40-60GB/s

doubles every 18 monthsdoubles every 36 months

Page 23: Intro to  AMPLab  and  Berkeley Data Analytics Stack

Key InsightsInput often noisy: exact computations do not guarantee exact answersError often acceptable if small and boundedMain challenge: estimate errors for arbitrary computationsAlpha release (August, 2013)

»Allow users to build uniform and stratified samples

»Provide error bounds for simple aggregate queries

MesosSpark

SparkStream

. Shark

BlinkDB Graph

X MLlibMLBase

HDFS, S3, … Tachyon

Page 24: Intro to  AMPLab  and  Berkeley Data Analytics Stack

Latency: 772.34 sec

(17TB input)

Latency: 1.78 sec

(1.7GB input)

Example: Video Quality DiagnosisTop 10 worse

performers identical!

440x faster!

Page 25: Intro to  AMPLab  and  Berkeley Data Analytics Stack

GraphXCombine data-parallel and graph-parallel computationsProvide powerful abstractions:

»PowerGraph, Pregel implemented in less than 20 LOC!

Leverage Spark’s fault toleranceAlpha release: expected this fall

MesosSpark

SparkStream

. SharkBlinkDB Graph

X MLlibMLBase

HDFS, S3, … Tachyon

Page 26: Intro to  AMPLab  and  Berkeley Data Analytics Stack

MLlib and MLbaseMLlib: high quality library for ML algorithms

»Will be released with Spark 0.8 (September, 2013)

MLbase: make ML accessible to non-experts»Declarative interface: allow users to say what they

want • E.g., classify(data)

»Automatically pick best algorithm for given data, time»Allow developers to easily add and test new algorithms»Alpha release of MLI, first component of MLbase, in

September, 2013

MesosSpark

SparkStream

. SharkBlinkDB Graph

X MLlibMLBase

HDFS, S3, … Tachyon

Page 27: Intro to  AMPLab  and  Berkeley Data Analytics Stack

TachyonIn-memory, fault-tolerant storage systemFlexible API, including HDFS APIAllow multiple frameworks (including Hadoop) to share in-memory dataAlpha release (June, 2013)

MesosSpark

SparkStream

. SharkBlinkDB Graph

X MLlibMLBase

HDFS, S3, … Tachyon

Page 28: Intro to  AMPLab  and  Berkeley Data Analytics Stack

Compatibility to Existing Ecosystem

Resource Management Layer

Storage Layer

MesosSpark

Spark

Streaming

Shark SQL

BlinkDBGraphX

MLlibMLBase

HDFS, S3, … Tachyon

Accept inputs from Kafka, Flume, Twitter, TCP Sockets, …

Hive API

GraphLab API

HDFS API

Support Hadoop, Storm, MPI

Page 29: Intro to  AMPLab  and  Berkeley Data Analytics Stack

Summary BDAS: address next Big Data challengesUnify batch, interactive, and streaming computationsEasy to develop sophisticate applications

»Support graph & ML algorithms, approximate queries

Witnessed significant adoption»20+ companies, 70+ individuals contributing

code

Exciting ongoing work»MLbase, GraphX, BlinkDB, …

Batch

Interactive

Streaming

Spark

Page 30: Intro to  AMPLab  and  Berkeley Data Analytics Stack

AMPCamp Schedule (Today)

Rest of this session: AMPCamp Curriculum, Mesos10:45-12:45pm: Spark, Shark, and Spark Streaming12:45-2pm: Lunch2-4:30pm: Hand-on exercises (Spark, Shark, Spark Streaming)5-6:30pm: User presentations (Conviva, Ooyala, Yahoo!)6:30-8:30pm: Reception

Page 31: Intro to  AMPLab  and  Berkeley Data Analytics Stack

AMPCamp Schedule (Tomorrow)9-10:15pm: BlinkDB, MLbase10:45-12:45pm: Hand-on exercises (BlinkDB, MLbase, Mesos) 12:45-2:15pm: Lunch2:15-3:15pm: Introduction to Tachyon and GarphX3:15-3:30pm: Wrap Up and Concluding Remarks