UC Berkeley Above the Clouds A Berkeley View of Cloud Computing 1 UC Berkeley RAD Lab.
Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.
-
Upload
kimberly-taylor -
Category
Documents
-
view
226 -
download
1
Transcript of Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.
Turning Data into Value
Ion StoicaCEO, Databricks
(also, UC Berkeley and Conviva)
UC BERKELEY
Data is Everywhere
Easier and cheaper than ever to collect
Data grows faster than Moore’s law
2010 2011 2012 2013 2014 20150
2
4
6
8
10
12
14
Moore's Law
(IDC report*)
The New Gold Rush
Everyone wants to extract value from data
»Big companies & startups alike
Huge potential»Already demonstrated by Google, Facebook,
…
But, untapped by most companies»“We have lots of data but no one is looking at
it!”
Extracting Value from Data HardData is massive, unstructured, and dirty
Question are complex
Processing, analysis tools still in their “infancy”
Need tools that are»Faster»More sophisticated»Easier to use
Turning Data into ValueInsights, diagnosis, e.g.,
»Why is user engagement dropping?»Why is the system slow?»Detect spam, DDoS attacks
Decisions, e.g.,»Decide what feature to add to a product»Personalized medical treatment»Decide when to change an aircraft engine
part»Decide what ads to show Data only as useful as the decisions it
enables
What do We Need?Interactive queries: enable faster decisions
»E.g., identify why a site is slow and fix it
Queries on streaming data: enable decisions on real-time data
»E.g., fraud detection, detect DDoS attacks
Sophisticated data processing: enable “better” decisions
»E.g., anomaly detection, trend analysis
Our Goal
Batch
Interactive
Streaming
SingleFramework!
Support batch, streaming, and interactive computations…
… in a unified framework
Easy to develop sophisticated algorithms (e.g., graph, ML algos)
The Need For Unification Today’s state-of-art analytics stack
Data (e.g., logs) Ad-Hoc queries
on historical data
Challenge 1: need to maintain three stacks
• Expensive and complex• Hard to compute consistent metrics across
stacks
Interactive querieson historical data
StreamingReal-Time Analytics
Batch
Interactive queries
The Need For Unification Today’s state-of-art analytics stack
Data (e.g., logs)
Batch
Streaming
Ad-Hoc querieson historical data
Real-Time Analytics
Interactive querieson historical data
Challenge 2: hard/slow to share data, e.g.,
»Hard to perform interactive queries on streamed data
Interactive queries
SparkUnifies batch, streaming, interactive comp.
Easy to build sophisticated applications»Support iterative, graph-parallel algorithms»Powerful APIs in Scala, Python, Java
Spark
SparkStreamin
g Shark SQL
BlinkDBGraphX MLlib
Streaming
Batch, Interactiv
e
Batch, Interactiv
e
Interactive
Data-parallel,Iterative
Sophisticated algos.
An Analogy
First cellularphones
Unified device
(smartphone)
Specializeddevices
z
Better Phone
Better GPS
Better Games
An Analogy
Batch processing
Unifiedsystem
Specializedsystems
Turning Data into Value, ExamplesUnify real-time and historical data analysis
»Easier to build and maintain»Cheaper to operate»Easier to get insights, faster decisions
Unify streaming and machine-learning»Faster diagnosis, decisions (e.g., better ad
targeting)
Unify graph processing and ETLs»Faster to get social network insights (e.g.,
improve user experience)
Unify Real-time & Historical Analysis
Single implementation (stack) providing»Streaming»Batch (pre-computing results)»Interactive computations/queries
Spark
SparkStreamin
g Shark SQL
BlinkDB
Unify Real-time & Historical Analysis
Batch and streaming codes virtually the same
»Easy to develop and maintain consistency// count words from a file (batch)val file = sc.textFile("hdfs://.../pagecounts-*.gz")val words = file.flatMap(line => line.split(" "))val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)wordCounts.print()
// count words from a network stream, every 10s (streaming)val ssc = new StreamingContext(args(0), "NetCount", Seconds(10), ..)val lines = ssc.socketTextStream("localhost”, 3456)val words = lines.flatMap(_.split(" "))val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)wordCounts.print()ssc.start()
Unify Streaming and ML
Sophisticated, real-time diagnosis & decisions, e.g.,
»Fraud detection»Detect denial of service attacks»Early notification of service degradation and
failures
Spark
SparkStreamin
gMLlib
Unify Graph Processing and ETLGraph-parallel systems (e.g., Pregel, GraphLab)
»Fast and scalable, but…»… inefficient for graph creation, post-
processing
GraphX: unifies graph processing and ETL
Spark
GraphX
Unify Graph Processing and ETL
Graph
Lab
Hadoop Graph Algorithms
Graph Creation (Hadoop)
PostProc.
GraphXGraph Creation
(Spark) Post Proc.(Spark)
“Crossing the Chasm”
Cloudera Partnership
Integrate Spark with Cloudera Manager
Spark will become part of CDH
Enterprise class support and professional services available for Spark
We are Committed to…
… open source»We believe that any successful analytics
stack will be open source
… improve integration with Hadoop »Enable every Hadoop user take advantage of
Spark
… work with partners to make Apache Spark successful for enterprise customers
SummaryEveryone collects but few extract value from data
Unification of comp. and prog. models key to
»Efficiently analyze data»Make sophisticated, real-time decisions
Spark is unique in unifying»batch, interactive, streaming computation
models»data-parallel and graph-parallel prog. models
Many use cases in the rest of the program!
Batch
Interactive
Streaming
Spark