Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.

22
Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY

Transcript of Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.

Page 1: Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.

Turning Data into Value

Ion StoicaCEO, Databricks

(also, UC Berkeley and Conviva)

UC BERKELEY

Page 2: Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.

Data is Everywhere

Easier and cheaper than ever to collect

Data grows faster than Moore’s law

2010 2011 2012 2013 2014 20150

2

4

6

8

10

12

14

Moore's Law

(IDC report*)

Page 3: Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.

The New Gold Rush

Everyone wants to extract value from data

»Big companies & startups alike

Huge potential»Already demonstrated by Google, Facebook,

But, untapped by most companies»“We have lots of data but no one is looking at

it!”

Page 4: Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.

Extracting Value from Data HardData is massive, unstructured, and dirty

Question are complex

Processing, analysis tools still in their “infancy”

Need tools that are»Faster»More sophisticated»Easier to use

Page 5: Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.

Turning Data into ValueInsights, diagnosis, e.g.,

»Why is user engagement dropping?»Why is the system slow?»Detect spam, DDoS attacks

Decisions, e.g.,»Decide what feature to add to a product»Personalized medical treatment»Decide when to change an aircraft engine

part»Decide what ads to show Data only as useful as the decisions it

enables

Page 6: Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.

What do We Need?Interactive queries: enable faster decisions

»E.g., identify why a site is slow and fix it

Queries on streaming data: enable decisions on real-time data

»E.g., fraud detection, detect DDoS attacks

Sophisticated data processing: enable “better” decisions

»E.g., anomaly detection, trend analysis

Page 7: Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.

Our Goal

Batch

Interactive

Streaming

SingleFramework!

Support batch, streaming, and interactive computations…

… in a unified framework

Easy to develop sophisticated algorithms (e.g., graph, ML algos)

Page 8: Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.

The Need For Unification Today’s state-of-art analytics stack

Data (e.g., logs) Ad-Hoc queries

on historical data

Challenge 1: need to maintain three stacks

• Expensive and complex• Hard to compute consistent metrics across

stacks

Interactive querieson historical data

StreamingReal-Time Analytics

Batch

Interactive queries

Page 9: Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.

The Need For Unification Today’s state-of-art analytics stack

Data (e.g., logs)

Batch

Streaming

Ad-Hoc querieson historical data

Real-Time Analytics

Interactive querieson historical data

Challenge 2: hard/slow to share data, e.g.,

»Hard to perform interactive queries on streamed data

Interactive queries

Page 10: Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.

SparkUnifies batch, streaming, interactive comp.

Easy to build sophisticated applications»Support iterative, graph-parallel algorithms»Powerful APIs in Scala, Python, Java

Spark

SparkStreamin

g Shark SQL

BlinkDBGraphX MLlib

Streaming

Batch, Interactiv

e

Batch, Interactiv

e

Interactive

Data-parallel,Iterative

Sophisticated algos.

Page 11: Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.

An Analogy

First cellularphones

Unified device

(smartphone)

Specializeddevices

z

Better Phone

Better GPS

Better Games

Page 12: Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.

An Analogy

Batch processing

Unifiedsystem

Specializedsystems

Page 13: Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.

Turning Data into Value, ExamplesUnify real-time and historical data analysis

»Easier to build and maintain»Cheaper to operate»Easier to get insights, faster decisions

Unify streaming and machine-learning»Faster diagnosis, decisions (e.g., better ad

targeting)

Unify graph processing and ETLs»Faster to get social network insights (e.g.,

improve user experience)

Page 14: Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.

Unify Real-time & Historical Analysis

Single implementation (stack) providing»Streaming»Batch (pre-computing results)»Interactive computations/queries

Spark

SparkStreamin

g Shark SQL

BlinkDB

Page 15: Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.

Unify Real-time & Historical Analysis

Batch and streaming codes virtually the same

»Easy to develop and maintain consistency// count words from a file (batch)val file = sc.textFile("hdfs://.../pagecounts-*.gz")val words = file.flatMap(line => line.split(" "))val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)wordCounts.print()

// count words from a network stream, every 10s (streaming)val ssc = new StreamingContext(args(0), "NetCount", Seconds(10), ..)val lines = ssc.socketTextStream("localhost”, 3456)val words = lines.flatMap(_.split(" "))val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)wordCounts.print()ssc.start()

Page 16: Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.

Unify Streaming and ML

Sophisticated, real-time diagnosis & decisions, e.g.,

»Fraud detection»Detect denial of service attacks»Early notification of service degradation and

failures

Spark

SparkStreamin

gMLlib

Page 17: Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.

Unify Graph Processing and ETLGraph-parallel systems (e.g., Pregel, GraphLab)

»Fast and scalable, but…»… inefficient for graph creation, post-

processing

GraphX: unifies graph processing and ETL

Spark

GraphX

Page 18: Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.

Unify Graph Processing and ETL

Graph

Lab

Hadoop Graph Algorithms

Graph Creation (Hadoop)

PostProc.

GraphXGraph Creation

(Spark) Post Proc.(Spark)

Page 19: Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.

“Crossing the Chasm”

Page 20: Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.

Cloudera Partnership

Integrate Spark with Cloudera Manager

Spark will become part of CDH

Enterprise class support and professional services available for Spark

Page 21: Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.

We are Committed to…

… open source»We believe that any successful analytics

stack will be open source

… improve integration with Hadoop »Enable every Hadoop user take advantage of

Spark

… work with partners to make Apache Spark successful for enterprise customers

Page 22: Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.

SummaryEveryone collects but few extract value from data

Unification of comp. and prog. models key to

»Efficiently analyze data»Make sophisticated, real-time decisions

Spark is unique in unifying»batch, interactive, streaming computation

models»data-parallel and graph-parallel prog. models

Many use cases in the rest of the program!

Batch

Interactive

Streaming

Spark