Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.
-
Upload
chad-farmer -
Category
Documents
-
view
217 -
download
0
Transcript of Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.
Berkeley Data Analytics Stack
Prof. Harold Liu
15 December 2014
Data Processing Goals
• Low latency (interactive) queries on historical data: enable faster decisions– E.g., identify why a site is slow and fix it
• Low latency queries on live data (streaming): enable decisions on real-time data– E.g., detect & block worms in real-time (a wo
rm may infect 1mil hosts in 1.3sec)• Sophisticated data processing: enable “bette
r” decisions– E.g., anomaly detection, trend analysis
Today’s Open Analytics Stack…
• ..mostly focused on large on-disk datasets: great for batch but slow
ApplicationApplication
StorageStorage
Data ProcessingData Processing
InfrastructureInfrastructure
Goals
Batch
Interactive
Streaming
One stack to rule them
all!
Easy to combine batch, streaming, and interactive computations Easy to develop sophisticated algorithms Compatible with existing open source ecosystem (Hadoop/HDFS)
Support Interactive and Streaming Comp.
• Aggressive use of memory• Why?
1. Memory transfer rates >> disk or SSDs
2. Many datasets already fit into memory
• Inputs of over 90% of jobs in Facebook, Yahoo!, and Bing clusters fit into memory
• e.g., 1TB = 1 billion records @ 1KB each
3. Memory density (still) grows with Moore’s law
• RAM/SSD hybrid memories at horizon High end datacenter node
16 cores
10-30TB
128-512GB
1-4TB
10Gbps
0.2-1GB/s
(x10 disks)
1-4GB/s
(x4 disks)
40-60GB/s
Support Interactive and Streaming Comp.
• Increase parallelism• Why?
– Reduce work per node improve latency
• Techniques:– Low latency parallel scheduler
that achieve high locality– Optimized parallel communication
patterns (e.g., shuffle, broadcast)– Efficient recovery from failures
and straggler mitigation
result
T
result
Tnew (< T)
• Trade between result accuracy and response times
• Why? – In-memory processing does not gu
arantee interactive query processing• e.g., ~10’s sec just to scan 512
GB RAM!• Gap between memory capacity
and transfer rate increasing• Challenges:
– accurately estimate error and running time for…
– … arbitrary computations
128-512GB
16 cores
40-60GB/s
doubles every 18 months
doubles every 36 months
Support Interactive and Streaming Comp.
Berkeley Data Analytics Stack (BDAS)
InfrastructureInfrastructure
StorageStorage
Data ProcessingData Processing
ApplicationApplication
Resource ManagementResource Management
Data ManagementData Management
Share infrastructure across frameworks(multi-programming for datacenters)
Efficient data sharing across frameworks
Data ProcessingData Processing
• in-memory processing • trade between time, quality, and
cost
ApplicationApplication
New apps: AMP-Genomics, Carat, …
Berkeley AMPLab
“Launched” January 2011: 6 Year Plan– 8 CS Faculty– ~40 students– 3 software engineers• Organized for collaboration:
Berkeley AMPLab• Funding:
– XData, CISE Expedition Grant
– Industrial, founding sponsors– 18 other sponsors, including
Goal: Next Generation of Analytics Data Stack for Industry & Research:• Berkeley Data Analytics Stack (BDAS)• Release as Open Source
Berkeley Data Analytics Stack (BDAS)
• Existing stack components….
HDFS
MPI…
ResourceMgmnt.
DataMgmnt.
Data Processing
Hadoop
HIVE Pig
HBase Storm
Data Management
Data Processing
Resource Management
Mesos• Management platform that allows multiple framework to share
cluster
• Compatible with existing open analytics stack
• Deployed in production at Twitter on 3,500+ servers
HDFS
MPI…
ResourceMgmnt.
DataMgmnt.
Data Processing
Hadoop
HIVE Pig
HBase Storm
Mesos
Spark• In-memory framework for interactive and iterative computatio
ns
– Resilient Distributed Dataset (RDD): fault-tolerance, in-memory storage abstraction
• Scala interface, Java and Python APIs
HDFS
Mesos
MPI
ResourceMgmnt.
DataMgmnt.
Data ProcessingStorm
…
Spark Hadoop
HIVE Pig
Spark Streaming [Alpha Release]
• Large scale streaming computation• Ensure exactly one semantics• Integrated with Spark unifies batch, interactive, and streaming
computations!
HDFS
Mesos
MPI
ResourceMgmnt.
DataMgmnt.
Data Processing
Hadoop
HIVE PigStorm
…
Spark
SparkStreaming
Shark Spark SQL• HIVE over Spark: SQL-like interface (supports Hive 0.9)
– up to 100x faster for in-memory data, and 5-10x for disk
• In tests on hundreds node cluster at
HDFS
Mesos
MPI
ResourceMgmnt.
DataMgmnt.
Data Processing
Storm
…
Spark
SparkStreaming Shark
Hadoop
HIVE Pig
Tachyon• High-throughput, fault-tolerant in-memory storage
• Interface compatible to HDFS
• Support for Spark and Hadoop
HDFS
Mesos
MPI
ResourceMgmnt.
DataMgmnt.
Data Processing
Hadoop
HIVE PigStorm
…
Spark
SparkStreaming Shark
Tachyon
BlinkDB• Large scale approximate query engine• Allow users to specify error or time bounds• Preliminary prototype starting being tested at Facebook
Mesos
MPI
ResourceMgmnt.
Data Processing
Storm
…
Spark
SparkStreaming Shark
BlinkDB
HDFSDataMgmnt.
Tachyon
Hadoop
PigHIVE
SparkGraph
• GraphLab API and Toolkits on top of Spark• Fault tolerance by leveraging Spark
Mesos
MPI
ResourceMgmnt.
Data Processing
Storm
…
Spark
SparkStreaming Shark
BlinkDB
HDFSDataMgmnt.
Tachyon
Hadoop
HIVEPig
SparkGraph
MLlib• Declarative approach to ML
• Develop scalable ML algorithms
• Make ML accessible to non-experts
Mesos
MPI
ResourceMgmnt.
Data Processing
Storm
…
Spark
SparkStreaming Shark
BlinkDB
HDFSDataMgmnt.
Tachyon
Hadoop
HIVEPig
SparkGraph
MLbase
Compatible with Open Source Ecosystem
• Support existing interfaces whenever possible
Mesos
MPI
ResourceMgmnt.
Data Processing
Storm
…
Spark
SparkStreaming Shark
BlinkDB
HDFSDataMgmnt.
Tachyon
Hadoop
HIVEPig
SparkGraph
MLbase
GraphLab API
Hive Interface and Shell
HDFS APICompatibility layer for Hadoop, Storm, MPI, etc
to run over Mesos
Compatible with Open Source Ecosystem
• Use existing interfaces whenever possible
Mesos
MPI
ResourceMgmnt.
Data Processing
Storm
…
Spark
SparkStreaming Shark
BlinkDB
HDFSDataMgmnt.
Tachyon
Hadoop
HIVEPig
SparkGraph
MLbase
Support HDFS API, S3 API, and Hive metadata
Support Hive API
Accept inputs from Kafka, Flume, Twitter, TCP Sockets, …
Summary• Support interactive and streaming computations
– In-memory, fault-tolerant storage abstraction, low-latency scheduling,...
• Easy to combine batch, streaming, and interactive computations– Spark execution engine supports
all comp. models• Easy to develop sophisticated algorithms
– Scala interface, APIs for Java, Python, Hive QL, …– New frameworks targeted to graph based and ML algorithms
• Compatible with existing open source ecosystem• Open source (Apache/BSD) and fully committed to release high
quality software– Three-person software engineering team lead by Matt Massi
e (creator of Ganglia, 5th Cloudera engineer)
Batch
Interactive
Streaming
Spark
SparkIn-Memory Cluster Computing forIterative and Interactive Applications
UC Berkeley
Background
• Commodity clusters have become an important computing platform for a variety of applications
– In industry: search, machine translation, ad targeting, …
– In research: bioinformatics, NLP, climate simulation, …
• High-level cluster programming models like MapReduce power many of these apps
• Theme of this work: provide similarly powerful abstractions for a broader class of applications
MotivationCurrent popular programming models for clusters transform data flowing from stable storage to stable storagee.g., MapReduce:
MapMap
MapMap
MapMap
Reduce
Reduce
Reduce
Reduce
Input Output
Motivation• Acyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:
– Iterative algorithms (many in machine learning)
– Interactive data mining tools (R, Excel, Python)
• Spark makes working sets a first-class concept to efficiently support these apps
Spark Goal
• Provide distributed memory abstractions for clusters to support apps with working sets
• Retain the attractive properties of MapReduce:
– Fault tolerance (for crashes & stragglers)
– Data locality
– Scalability
Solution: augment data flow model with “resilient distributed datasets” (RDDs)
Programming Model• Resilient distributed datasets (RDDs)
– Immutable collections partitioned across cluster that can be rebuilt if a partition is lost
– Created by transforming data in stable storage using data flow operators (map, filter, group-by, …)
– Can be cached across parallel operations• Parallel operations on RDDs
– Reduce, collect, count, save, …• Restricted shared variables
– Accumulators, broadcast variables
Example: Log Mining
•Load error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))cachedMsgs = messages.cache()
Block 1Block 1
Block 2Block 2
Block 3Block 3
Worker
Worker
Worker
Worker
Worker
Worker
Driver
Driver
cachedMsgs.filter(_.contains(“foo”)).countcachedMsgs.filter(_.contains(“bar”)).count. . .
tasks
results
Cache 1
Cache 1
Cache 2
Cache 2
Cache 3
Cache 3
Base RDDBase RDD
Transformed RDD
Transformed RDD
Cached RDD
Cached RDD Parallel
operationParallel
operation
Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-d
isk data)
RDDs in More Detail• An RDD is an immutable, partitioned, logical collection of records
– Need not be materialized, but rather contains information to rebuild a dataset from stable storage
• Partitioning can be based on a key in each record (using hash or range partitioning)
• Built using bulk transformations on other RDDs
• Can be cached for future reuse
RDD Operations
Transformations(define a new RDD)
mapfiltersampleuniongroupByKeyreduceByKeyjoincache…
Parallel operations (Actions)(return a result to driver)
reducecollectcountsavelookupKey…
RDD Fault Tolerance
• RDDs maintain lineage information that can be used to reconstruct lost partitions
• e.g.:cachedMsgs = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) .cache()
HdfsRDDpath: hdfs://…
HdfsRDDpath: hdfs://…
FilteredRDDfunc: contains
(...)
FilteredRDDfunc: contains
(...)MappedRDDfunc: split(…)
MappedRDDfunc: split(…) CachedRDDCachedRDD
Example 1: Logistic Regression
• Goal: find best line separating two sets of points
+
–
++
+
+
+
++ +
– ––
–
–
–
––
+
target
–
random initial line
Logistic Regression Code
• val data = spark.textFile(...).map(readPoint).cache()
• var w = Vector.random(D)
• for (i <- 1 to ITERATIONS) {• val gradient = data.map(p =>• (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x• ).reduce(_ + _)• w -= gradient• }
• println("Final w: " + w)
Logistic Regression Performance
127 s / iteration
first iteration 174 s
further iterations 6 s
Example 2: MapReduce
• MapReduce data flow can be expressed using RDD transformations
res = data.flatMap(rec => myMapFunc(rec)) .groupByKey() .map((key, vals) => myReduceFunc(key, vals))
Or with combiners:
res = data.flatMap(rec => myMapFunc(rec)) .reduceByKey(myCombiner) .map((key, val) => myReduceFunc(key, val))
Example 3
Other Spark Applications
• Twitter spam classification (Justin Ma)• EM alg. for traffic prediction (Mobile Millennium)• K-means clustering• Alternating Least Squares matrix factorization• In-memory OLAP aggregation on Hive data• SQL on Spark (future work)
Conclusion• By making distributed datasets a first-class primitive, Spark provides a simple, efficient programming model for stateful data analytics
• RDDs provide:– Lineage info for fault recovery and debugging– Adjustable in-memory caching– Locality-aware parallel operations
• We plan to make Spark the basis of a suite of batch and interactive data analysis tools