Conquering Big Data with BDAS (Berkeley Data...

65
Conquering Big Data with BDAS (Berkeley Data Analytics) UC BERKELEY Ion Stoica UC Berkeley / Databricks / Conviva

Transcript of Conquering Big Data with BDAS (Berkeley Data...

Page 1: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Conquering Big Data with BDAS (Berkeley Data Analytics)

UC  BERKELEY  

Ion Stoica UC Berkeley / Databricks / Conviva

Page 2: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Extracting Value from Big Data Insights, diagnosis, e.g., » Why is user engagement dropping? » Why is the system slow? » Detect spam, DDoS attacks

Decisions, e.g., » Decide what features to add to a product » Personalized medical treatment » Decide when to change an aircraft engine part » Decide what ads to show

Data only as useful as the decisions it enables

Page 3: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

What do We Need? Interactive queries: enable faster decisions » E.g., identify why a site is slow and fix it

Queries on streaming data: enable decisions on real-time data » E.g., fraud detection, detect DDoS attacks

Sophisticated data processing: enable “better” decisions » E.g., anomaly detection, trend analysis

Page 4: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Our Goal

Batch

Interactive Streaming

Single Framework! "

Support batch, streaming, and interactive computations… … in a unified framework

Easy to develop sophisticated algorithms (e.g., graph, ML algos)

Page 5: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

The Berkeley AMPLab January 2011 – 2017 » 8 faculty » > 40 students » 3 software engineer team

Organized for collaboration

3 day retreats (twice a year)

lgorithms    

achines     eople    

220 campers (100+ companies)

AMPCamp3 (August, 2013)

Page 6: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

The Berkeley AMPLab Governmental and industrial funding:

Goal: Next generation of open source data analytics stack for industry & academia: Berkeley Data Analytics Stack (BDAS)

Page 7: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Data Processing Stack

Data Processing Layer

Resource Management Layer

Storage Layer

Page 8: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Hadoop Stack

Data Processing Layer

Resource Management Layer

Storage Layer

… Hadoop MR

Hive Pig Impala Storm

Hadoop Yarn

HDFS, S3, …

Page 9: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

BDAS Stack

Data Processing Layer

Resource Management Layer

Storage Layer

Mesos

Spark

Spark Streaming Shark SQL

BlinkDB GraphX

MLlib

MLBase

HDFS, S3, …

Tachyon

Page 10: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Mesos

HDFS, S3, …

Tachyon

Spark

Spark Streaming Shark SQL

BlinkDB GraphX

MLlib

MLBase

How do BDAS & Hadoop fit together?

Hadoop Yarn

HDFS, S3, …

Page 11: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Mesos

HDFS, S3, …

Tachyon

How do BDAS & Hadoop fit together?

Hadoop Yarn

HDFS, S3, …

Spark

Spark Streaming Shark SQL

BlinkDB GraphX

MLlib

MLBase Spark Straming Shark

SQL

Graph X ML

library

BlinkDB MLbase

Spark Hadoop MR

Hive Pig Impala Storm

Page 12: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Mesos

HDFS, S3, …

Tachyon

How do BDAS & Hadoop fit together?

Hadoop Yarn

HDFS, S3, …

Spark Straming Shark

SQL

Graph X ML

library

BlinkDB MLbase

Spark Hadoop MR

Hive Pig Impala Storm

Page 13: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Apache Mesos Enable multiple frameworks to share same cluster resources (e.g., Hadoop, Storm, Spark) Twitter’s large scale deployment » 10,000+ servers, » 500+ engineers running jobs on Mesos

Third party Mesos schedulers » AirBnB’s Chronos » Twitter’s Aurora

Mesospehere: startup to commercialize Mesos

Mesos Spark

Spark Stream. Shark

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

Page 14: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Apache Spark Distributed Execution Engine » Fault-tolerant, efficient in-memory storage » Low-latency large-scale task scheduler » Powerful prog. model and APIs: Python, Java, Scala

Fast: up to 100x faster than Hadoop MR » Can run sub-second jobs on hundreds of nodes

Easy to use: 2-5x less code than Hadoop MR General: support interactive & iterative apps

Mesos Spark

Spark Stream. Shark

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

Page 15: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Fault Tolerance Need to achieve » High throughput reads and writes » Efficient memory usage

Replication » Writes bottlenecked by network » Inefficient: store multiple replicas

Persist update logs » Big data processing can generate massive logs

Our solution: Resilient Distributed Datasets (RDDs) » Partitioned collection of immutable records » Use lineage to reconstruct lost data

Mesos Spark

Spark Stream. Shark

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

Page 16: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

RDD Example Two-partition RDD A={A1, A2} stored on disk

1)  filter and cache à RDD B 2)  joinà RDD C 3)  aggregate à RDD D

A1

A2

RDD

A

agg. D

filter

filter B2

B1 join

join C2

C1

Mesos Spark

Spark Stream. Shark

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

Page 17: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

RDD Example C1 lost due to node failure before reduce finishes

A1

A2

RDD

A

agg. D

filter

filter B2

B1 join

join C2

C1

Mesos Spark

Spark Stream. Shark

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

Page 18: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

RDD Example C1 lost due to node failure before reduce finishes Reconstruct C1, eventually, on different node

A1

A2

RDD

A

agg. D

filter

filter B2

B1

join C2

agg. D

join C1

Mesos Spark

Spark Stream. Shark

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

Page 19: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Spark Streaming Existing solutions: record-by-record processing Low latency

Hard to » Provide fault tolerance » Mitigate straggler

Mesos Spark

Spark Stream. Shark

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

node 1

node 3

input records

node 2

input records

filter

filter

agg.

Page 20: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Spark Streaming Implemented as sequence of micro-jobs (<1s) » Fault tolerant » Mitigate stragglers » Ensure exactly one semantics

Mesos Spark

Spark Stream. Shark

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

Spark

batches of X seconds

live data stream

processed results

Spark Streaming

Spark & SparkStreaming: batch, interactive, and streaming computations

Page 21: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Shark

Hive over Spark: full support for HQL and UDFs Up to 100x when input is in memory Up to 5-10x when input is on disk

Running on hundreds of nodes at Yahoo!

Mesos Spark

Spark Stream. Shark

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

Page 22: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Not Only General, but Fast

Hive

Impa

la (d

isk)

Impa

la (m

em)

Shar

k (d

isk)

Shar

k (m

em)

0

5

10

15

20

25

30

35

40

45

Resp

onse

Tim

e (s)

Interactive (SQL)

Stor

m

Spar

k

0

5

10

15

20

25

30

35

Thro

ughp

ut (M

B/s/

node

)

Streaming

Hado

op

Spar

k

0

20

40

60

80

100

120

140

Tim

e pe

r Ite

ratio

n (s)

Batch (ML, Spark)

Page 23: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Spark Distribution Includes » Spark (core) » Spark Streaming » GraphX (alpha release) » MLlib

In the future: » Shark » Tachyon

Mesos

BlinkDB MLlib

HDFS, S3, …

Tachyon

Spark

Spark Stream. Shark

GraphX MLBase

Page 24: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Explosive Growth 2,500+ Spark meetup users 180+ contributors from 30+ companies 1st Spark Summit » 450+ attendees » 140+ companies

2nd Spark Summit » June 30 – July 2

Giraph! Storm!Tez!

0!

20!

40!

60!

80!

100!

120!

140!

Contributors in past year

Mesos

BlinkDB MLlib

HDFS, S3, …

Tachyon

Spark

Spark Stream. Shark

GraphX MLBase

Page 25: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Explosive Growth Databricks: founded in 2013 to commercialize Spark Platform Included in all major Hadoop Distributions » Cloudera » MapR » Hortonworks (technical preview)

Enterprise support: Cloudera, MapR, Datastax Spark and Shark available on Amazon’s EMR

Mesos

BlinkDB MLlib

HDFS, S3, …

Tachyon

Spark

Spark Stream. Shark

GraphX MLBase

Page 26: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

BlinkDB Trade between query performance and accuracy using sampling Why? » In-memory processing doesn’t

guarantee interactive processing •  E.g., ~10’s sec just to scan 512 GB

RAM! •  Gap between memory capacity and

transfer rate increasing

Mesos Spark

Spark Stream. Shark

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

512GB

16 cores

40-60GB/s

doubles every 18 months

doubles every 36 months

Page 27: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Key Insight

•  Input often noisy: exact computations do not guarantee exact answers

•  Error often acceptable if small and bounded

Computations  don’t  always  need  exact  answers  

Best scale ± 200g error

Speedometers ± 2.5 % error (edmunds.com)

OmniPod Insulin Pump ± 0.96 % error (www.ncbi.nlm.nih.gov/pubmed/22226273)

Page 28: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Approach: Sampling Compute results on samples instead of full data » Typically, error depends on sample size (n) not on

original data size, i.e., error α

Can trade between answer’s latency and accuracy and cost

1/ n

Page 29: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

BlinkDB Interface

SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ AND ‘dt=2012-9-2’ WITHIN 1 SECONDS 234.23  ±  15.32  

Page 30: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

BlinkDB Interface

SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ AND ‘dt=2012-9-2’ WITHIN 2 SECONDS

239.46  ±  4.96  SELECT  avg(sessionTime)    

FROM  Table    WHERE  city=‘San  Francisco’  AND  ‘dt=2012-­‐9-­‐2’  ERROR  0.1  CONFIDENCE  95.0%  

234.23  ±  15.32  

Page 31: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Quick Results Dataset » 365 mil rows, 204GB on disk » 600+ GB in memory (deserialized format)

Query: query computing 95-th percentile 25 EC2 instances with » 4 cores » 15GB RAM

Page 32: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Query Response Time

0 100 200 300 400 500 600 700 800 900

1000

1 10-1 10-2 10-3 10-4 10-5

Fraction of full data

seconds

Query Response Time

103

1020

18 13 10 8

10x as response time is dominated by I/O

Page 33: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Query Response Time

0 100 200 300 400 500 600 700 800 900

1000

1 10-1 10-2 10-3 10-4 10-5

Fraction of full data

seconds

Query Response Time

103

1020

18 13 10 8

Error Bars (0.02%)

(0.07%) (1.1%) (3.4%) (11%)

Page 34: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

BlinkDB Overview

TABLE

Original Data

Offline-sampling: Optimal set of samples across different dimensions (columns or sets of columns) to support ad-hoc exploratory queries

Sam

plin

g M

odul

e

Page 35: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

BlinkDB Overview

TABLE

Sam

pling

Mod

ule

In-Memory Samples

On-Disk Samples

Original Data

Sample Placement: Samples striped over 100s or 1,000s of machines both on disks and in-memory.

Page 36: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

BlinkDB Overview

SELECT foo (*) FROM TABLE

WITHIN 2

Query Plan

HiveQL/SQL Query

Sample Selection

TABLE

Sam

pling

Mod

ule

In-Memory Samples

On-Disk Samples

Original Data

Page 37: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

BlinkDB Overview

SELECT foo (*) FROM TABLE

WITHIN 2

Query Plan

HiveQL/SQL Query

Sample Selection

TABLE

Sam

pling

Mod

ule

In-Memory Samples

On-Disk Samples

Original Data

Online sample selection to pick best sample(s) based on query latency and accuracy requirements

Page 38: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

BlinkDB Overview

TABLE

Sam

pling

Mod

ule

In-Memory Samples

On-Disk Samples

Original Data

Hadoop/Spark

SELECT

foo (*) FROM TABLE

WITHIN 2

New Query Plan

HiveQL/SQL Query

Sample Selection

Error Bars & Confidence Intervals

Result 182.23 ± 5.56

(95% confidence)

Parallel query execution on multiple samples striped across multiple machines.

Page 39: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

BlinkDB Challenges Which set of samples to build given a storage budget? Which sample to run the query on?

How to accurately estimate the error?

Page 40: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

BlinkDB Challenges Which set of samples to build given a storage budget? Which sample to run the query on?

How to accurately estimate the error?

Page 41: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

How to Accurately Estimate Error?

Close formulas for limited number of operators » E.g., count, mean, percentiles

What about user defined functions (UDFs)? » Use bootsrap technique

Page 42: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Bootstrap Quantify accuracy of a sample estimator, f()

f (X)

S

random  sample  

Distribution  X

|S| = N f (S)

can’t  compute  f (X) as  we  don’t  have  X

what  is  f(S)’s  error?

use  f(S1), …, f(SK) to  compute  • estimator:  mean(f(Si) • error,  e.g.:  stdev(f(Si))

S1

Sk

…  

f (S1)

f (Sk)

…  

|Si| = N

sampling  with    

replacement  

Page 43: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Bootstrap for BlinkDB Quantify accuracy of a query on a sample table

Q(T) Q(T) takes  too  long!

Q(S) what  is  Q(S)’s  error?

sample  

|S| = N S

T Original    Table  

use  Q(S1), …, Q(SK) to  compute  estimator  quality  assessment,  e.g.,  stdev(Q(Si))

Q (S1)

Q (Sk)

…  

|Si| = N

sampling  with    

replacement  

S1

Sk

…  

Page 44: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

How Do You Know Error Estimation is Correct? Assumption: f() is Hadamard differentiable » How do you know an UDF is Hadamard

differentiable? » Sufficient, not necessary condition

Only approximations of true error distribution (true for closed formula as well)

Previous work doesn’t address error estimation correctness

Page 45: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

How Bad it Is? Workloads » Conviva: 268 real-world 113 had

custom User-Defined Functions » Facebook

Closed Forms/Bootstrap fails for » 3 in 10 Conviva Queries » 4 in 10 Facebook Queries

Need runtime diagnosis!

Page 46: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Error Diagnosis Compare bootstrapping with ground truth for small samples

Check whether error improves as sample size increases

Page 47: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

T Original Table

Ground Truth (Approximation)

Page 48: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

T Original Table

S1

Sp

…  

Q (S1)

Ground Truth (Approximation)

ξ = stdev(Q(Sj ))

Q (Sp) Estimator quality assessment

Page 49: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

T Original Table

S1

Sp

…  

S11

S1k

…  

Bootstrap on Individual Samples

Q (S11)

Q (S1k)

…  

Sp1

Spk

…  

Q (Sp1)

Q (Spk)

…  

Ground Truth and Bootstrap

ξ *1 = stdev(Q(S1 j ))

…  

ξ *p = stdev(Q(Spj ))

Page 50: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Ground Truth vs. Bootstrap Re

lative

Erro

r

Sample Size

Bootstrap Ground Truth

n1 n2 n3

…ξ *i1 = stdev(Q(Si1 j )) ξ *ip = stdev(Q(Sipj ))ξi =mean(ξij )

Page 51: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Ground Truth vs. Bootstrap

Sample Size

Bootstrap Ground Truth

n1 n2 n3

Relat

ive E

rror

…ξ *i1 = stdev(Q(Si1 j )) ξ *ip = stdev(Q(Sipj ))ξi =mean(ξij )

Page 52: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Ground Truth vs. Bootstrap

Sample Size

Bootstrap Ground Truth

n1 n2 n3

Relat

ive E

rror

…ξi =mean(ξij ) ξ *i1 = stdev(Q(Si1 j )) ξ *ip = stdev(Q(Sipj ))

Page 53: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

How Well Does it Work in Practice?

Evaluated on Conviva Query Workload Diagnostic predicted that 207 (77%) queries can be approximated » False Negatives: 18 » False Positives: 3 (conditional UDFs)

Page 54: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Overhead Boostrap and Diagnostic overheads can be very large » Diagnostics requires to run 30,000 small queries!

Optimization » Pushdown filter » One pass execution

Can reduce overhead by orders of magnitude

Page 55: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Query + Bootstrap + Diagnosis"

0 100 200 300 400 500 600 700 800 900

1000

Query Response Time Bootstrap overhead Diagnosis overhead

1 10-1 10-2 10-3 10-4 10-5

seconds 1020

Fraction of full data

426 436 442 467

723 Bootsrap + Diagnosis 53x

more than query itself!

Page 56: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Optimization: Filter Pushdown Perform filtering before resampling » Can dramatically reduce I/O

…  

Sample N elements  

resample N  

filter

resample N  

n  filter n  

resample resample …  n   n  

n  n  

Sample

filter

N  

Assume n << N  

Page 57: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Optimization: One Pass Exec.

For each resample add new column » Specify how many times each row has been selected » Query generate results for each resamples in one pass

resample resample …  

Sample

filter

resample [k]

Sample

filter

[k]

Poissoned resampling: construct all k resamples

in one pass!

Compute all k results in one pass!

Page 58: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

0

50

100

150

200

250

300

Query + Bootstrap + Diagnosis"(with Filter Pushdown and Single Pass)

Fraction of full data 1 10-1 10-2 10-3 10-4 10-5

sec.

Query Response Time Bootstrap overhead

1020 Diagnosis overhead

120

26 19 15 13

Less than 70% overhead (80x reduction)

Page 59: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

BlinkDB is open-sourced and released at http://blinkdb.org

Open Source

Page 60: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Used regularly by 100+ engineers at Facebook Inc.

Open Source

60  

Page 61: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Lesson Learned Focus on novel usage scenarios

Be paranoid about simplicity » Very hard to build real complex systems in academia

Basic functionality first, performance opt. next

Page 62: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Example: Mesos Focus on novel usage scenarios » Let multiple frameworks share same cluster

Be paranoid about simplicity » Just enforce allocation policy across frameworks,

e.g., fair sharing » Let frameworks decide which slots they accept,

which tasks to run, and when to run them » First release: 10K code

Basic functionality first, performance opt. next » First, support arbitrary scheduling, but inneficient » Latter added filters to improve performance

Page 63: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Example: Spark Focus on novel usage scenarios » Interactive queries and iterative (ML) algorithms

Be paranoid about simplicity » Immutable data; avoid complex consistency

protocols » First release: 2K code

Basic functionality first, performance opt. next » First, no automatic check-pointing » Latter to add automatic checkpoint

Page 64: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Example: BlinkDB Focus on novel usage scenarios » Approximate computations; error diagnosis

Be paranoid about simplicity » Use bootstrap as generic technique; no support for

close formula

Basic functionality first, performance opt. next » First, straightforward error estimation, very expensive » Latter, optimizations to reduce overhead » First, manual sample generation » Later, automatic sample generation (to do)

Page 65: Conquering Big Data with BDAS (Berkeley Data Analytics)datasys.cs.iit.edu/events/CCGrid2014/CCGrid-May25-Stoica.pdf · Interactive queries: enable faster decisions » E.g., identify

Summary BDAS: address next Big Data challenges

Unify batch, interactive, and streaming

Enable users to trade between » Response time, accuracy, and cost

Explosive adoption » 30+ companies, 180+ individual contributors (Spark) » Spark platform included in all major Hadoop distros » Enterprise grade support and professional services for

both Mesos and Spark platform

Batch

Interactive Streaming

Spark