Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

48
Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014

Transcript of Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Page 1: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Berkeley Data Analytics Stack

Prof. Harold Liu

15 December 2014

Page 2: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Data Processing Goals

• Low latency (interactive) queries on historical data: enable faster decisions– E.g., identify why a site is slow and fix it

• Low latency queries on live data (streaming): enable decisions on real-time data– E.g., detect & block worms in real-time (a wo

rm may infect 1mil hosts in 1.3sec)• Sophisticated data processing: enable “bette

r” decisions– E.g., anomaly detection, trend analysis

Page 3: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Today’s Open Analytics Stack…

• ..mostly focused on large on-disk datasets: great for batch but slow

ApplicationApplication

StorageStorage

Data ProcessingData Processing

InfrastructureInfrastructure

Page 4: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Goals

Batch

Interactive

Streaming

One stack to rule them

all!

Easy to combine batch, streaming, and interactive computations Easy to develop sophisticated algorithms Compatible with existing open source ecosystem (Hadoop/HDFS)

Page 5: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Support Interactive and Streaming Comp.

• Aggressive use of memory• Why?

1. Memory transfer rates >> disk or SSDs

2. Many datasets already fit into memory

• Inputs of over 90% of jobs in Facebook, Yahoo!, and Bing clusters fit into memory

• e.g., 1TB = 1 billion records @ 1KB each

3. Memory density (still) grows with Moore’s law

• RAM/SSD hybrid memories at horizon High end datacenter node

16 cores

10-30TB

128-512GB

1-4TB

10Gbps

0.2-1GB/s

(x10 disks)

1-4GB/s

(x4 disks)

40-60GB/s

Page 6: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Support Interactive and Streaming Comp.

• Increase parallelism• Why?

– Reduce work per node improve latency

• Techniques:– Low latency parallel scheduler

that achieve high locality– Optimized parallel communication

patterns (e.g., shuffle, broadcast)– Efficient recovery from failures

and straggler mitigation

result

T

result

Tnew (< T)

Page 7: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

• Trade between result accuracy and response times

• Why? – In-memory processing does not gu

arantee interactive query processing• e.g., ~10’s sec just to scan 512

GB RAM!• Gap between memory capacity

and transfer rate increasing• Challenges:

– accurately estimate error and running time for…

– … arbitrary computations

128-512GB

16 cores

40-60GB/s

doubles every 18 months

doubles every 36 months

Support Interactive and Streaming Comp.

Page 8: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Berkeley Data Analytics Stack (BDAS)

InfrastructureInfrastructure

StorageStorage

Data ProcessingData Processing

ApplicationApplication

Resource ManagementResource Management

Data ManagementData Management

Share infrastructure across frameworks(multi-programming for datacenters)

Efficient data sharing across frameworks

Data ProcessingData Processing

• in-memory processing • trade between time, quality, and

cost

ApplicationApplication

New apps: AMP-Genomics, Carat, …

Page 9: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Berkeley AMPLab

“Launched” January 2011: 6 Year Plan– 8 CS Faculty– ~40 students– 3 software engineers• Organized for collaboration:

Page 10: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Berkeley AMPLab• Funding:

– XData, CISE Expedition Grant

– Industrial, founding sponsors– 18 other sponsors, including

Goal: Next Generation of Analytics Data Stack for Industry & Research:• Berkeley Data Analytics Stack (BDAS)• Release as Open Source

Page 11: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Berkeley Data Analytics Stack (BDAS)

• Existing stack components….

HDFS

MPI…

ResourceMgmnt.

DataMgmnt.

Data Processing

Hadoop

HIVE Pig

HBase Storm

Data Management

Data Processing

Resource Management

Page 12: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Mesos• Management platform that allows multiple framework to share

cluster

• Compatible with existing open analytics stack

• Deployed in production at Twitter on 3,500+ servers

HDFS

MPI…

ResourceMgmnt.

DataMgmnt.

Data Processing

Hadoop

HIVE Pig

HBase Storm

Mesos

Page 13: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Spark• In-memory framework for interactive and iterative computatio

ns

– Resilient Distributed Dataset (RDD): fault-tolerance, in-memory storage abstraction

• Scala interface, Java and Python APIs

HDFS

Mesos

MPI

ResourceMgmnt.

DataMgmnt.

Data ProcessingStorm

Spark Hadoop

HIVE Pig

Page 14: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Spark Streaming [Alpha Release]

• Large scale streaming computation• Ensure exactly one semantics• Integrated with Spark unifies batch, interactive, and streaming

computations!

HDFS

Mesos

MPI

ResourceMgmnt.

DataMgmnt.

Data Processing

Hadoop

HIVE PigStorm

Spark

SparkStreaming

Page 15: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Shark Spark SQL• HIVE over Spark: SQL-like interface (supports Hive 0.9)

– up to 100x faster for in-memory data, and 5-10x for disk

• In tests on hundreds node cluster at

HDFS

Mesos

MPI

ResourceMgmnt.

DataMgmnt.

Data Processing

Storm

Spark

SparkStreaming Shark

Hadoop

HIVE Pig

Page 16: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Tachyon• High-throughput, fault-tolerant in-memory storage

• Interface compatible to HDFS

• Support for Spark and Hadoop

HDFS

Mesos

MPI

ResourceMgmnt.

DataMgmnt.

Data Processing

Hadoop

HIVE PigStorm

Spark

SparkStreaming Shark

Tachyon

Page 17: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

BlinkDB• Large scale approximate query engine• Allow users to specify error or time bounds• Preliminary prototype starting being tested at Facebook

Mesos

MPI

ResourceMgmnt.

Data Processing

Storm

Spark

SparkStreaming Shark

BlinkDB

HDFSDataMgmnt.

Tachyon

Hadoop

PigHIVE

Page 18: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

SparkGraph

• GraphLab API and Toolkits on top of Spark• Fault tolerance by leveraging Spark

Mesos

MPI

ResourceMgmnt.

Data Processing

Storm

Spark

SparkStreaming Shark

BlinkDB

HDFSDataMgmnt.

Tachyon

Hadoop

HIVEPig

SparkGraph

Page 19: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

MLlib• Declarative approach to ML

• Develop scalable ML algorithms

• Make ML accessible to non-experts

Mesos

MPI

ResourceMgmnt.

Data Processing

Storm

Spark

SparkStreaming Shark

BlinkDB

HDFSDataMgmnt.

Tachyon

Hadoop

HIVEPig

SparkGraph

MLbase

Page 20: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Compatible with Open Source Ecosystem

• Support existing interfaces whenever possible

Mesos

MPI

ResourceMgmnt.

Data Processing

Storm

Spark

SparkStreaming Shark

BlinkDB

HDFSDataMgmnt.

Tachyon

Hadoop

HIVEPig

SparkGraph

MLbase

GraphLab API

Hive Interface and Shell

HDFS APICompatibility layer for Hadoop, Storm, MPI, etc

to run over Mesos

Page 21: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Compatible with Open Source Ecosystem

• Use existing interfaces whenever possible

Mesos

MPI

ResourceMgmnt.

Data Processing

Storm

Spark

SparkStreaming Shark

BlinkDB

HDFSDataMgmnt.

Tachyon

Hadoop

HIVEPig

SparkGraph

MLbase

Support HDFS API, S3 API, and Hive metadata

Support Hive API

Accept inputs from Kafka, Flume, Twitter, TCP Sockets, …

Page 22: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Summary• Support interactive and streaming computations

– In-memory, fault-tolerant storage abstraction, low-latency scheduling,...

• Easy to combine batch, streaming, and interactive computations– Spark execution engine supports

all comp. models• Easy to develop sophisticated algorithms

– Scala interface, APIs for Java, Python, Hive QL, …– New frameworks targeted to graph based and ML algorithms

• Compatible with existing open source ecosystem• Open source (Apache/BSD) and fully committed to release high

quality software– Three-person software engineering team lead by Matt Massi

e (creator of Ganglia, 5th Cloudera engineer)

Batch

Interactive

Streaming

Spark

Page 23: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

SparkIn-Memory Cluster Computing forIterative and Interactive Applications

UC Berkeley

Page 24: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Background

• Commodity clusters have become an important computing platform for a variety of applications

– In industry: search, machine translation, ad targeting, …

– In research: bioinformatics, NLP, climate simulation, …

• High-level cluster programming models like MapReduce power many of these apps

• Theme of this work: provide similarly powerful abstractions for a broader class of applications

Page 25: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

MotivationCurrent popular programming models for clusters transform data flowing from stable storage to stable storagee.g., MapReduce:

MapMap

MapMap

MapMap

Reduce

Reduce

Reduce

Reduce

Input Output

Page 26: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Motivation• Acyclic data flow is a powerful abstraction, but is not efficient for applications that repeatedly reuse a working set of data:

– Iterative algorithms (many in machine learning)

– Interactive data mining tools (R, Excel, Python)

• Spark makes working sets a first-class concept to efficiently support these apps

Page 27: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Spark Goal

• Provide distributed memory abstractions for clusters to support apps with working sets

• Retain the attractive properties of MapReduce:

– Fault tolerance (for crashes & stragglers)

– Data locality

– Scalability

Solution: augment data flow model with “resilient distributed datasets” (RDDs)

Page 28: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Programming Model• Resilient distributed datasets (RDDs)

– Immutable collections partitioned across cluster that can be rebuilt if a partition is lost

– Created by transforming data in stable storage using data flow operators (map, filter, group-by, …)

– Can be cached across parallel operations• Parallel operations on RDDs

– Reduce, collect, count, save, …• Restricted shared variables

– Accumulators, broadcast variables

Page 29: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Example: Log Mining

•Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))cachedMsgs = messages.cache()

Block 1Block 1

Block 2Block 2

Block 3Block 3

Worker

Worker

Worker

Worker

Worker

Worker

Driver

Driver

cachedMsgs.filter(_.contains(“foo”)).countcachedMsgs.filter(_.contains(“bar”)).count. . .

tasks

results

Cache 1

Cache 1

Cache 2

Cache 2

Cache 3

Cache 3

Base RDDBase RDD

Transformed RDD

Transformed RDD

Cached RDD

Cached RDD Parallel

operationParallel

operation

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-d

isk data)

Page 30: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

RDDs in More Detail• An RDD is an immutable, partitioned, logical collection of records

– Need not be materialized, but rather contains information to rebuild a dataset from stable storage

• Partitioning can be based on a key in each record (using hash or range partitioning)

• Built using bulk transformations on other RDDs

• Can be cached for future reuse

Page 31: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.
Page 32: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

RDD Operations

Transformations(define a new RDD)

mapfiltersampleuniongroupByKeyreduceByKeyjoincache…

Parallel operations (Actions)(return a result to driver)

reducecollectcountsavelookupKey…

Page 33: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.
Page 34: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.
Page 35: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.
Page 36: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.
Page 37: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.
Page 38: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

RDD Fault Tolerance

• RDDs maintain lineage information that can be used to reconstruct lost partitions

• e.g.:cachedMsgs = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) .cache()

HdfsRDDpath: hdfs://…

HdfsRDDpath: hdfs://…

FilteredRDDfunc: contains

(...)

FilteredRDDfunc: contains

(...)MappedRDDfunc: split(…)

MappedRDDfunc: split(…) CachedRDDCachedRDD

Page 39: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Example 1: Logistic Regression

• Goal: find best line separating two sets of points

+

++

+

+

+

++ +

– ––

––

+

target

random initial line

Page 40: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Logistic Regression Code

• val data = spark.textFile(...).map(readPoint).cache()

• var w = Vector.random(D)

• for (i <- 1 to ITERATIONS) {• val gradient = data.map(p =>• (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x• ).reduce(_ + _)• w -= gradient• }

• println("Final w: " + w)

Page 41: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Logistic Regression Performance

127 s / iteration

first iteration 174 s

further iterations 6 s

Page 42: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Example 2: MapReduce

• MapReduce data flow can be expressed using RDD transformations

res = data.flatMap(rec => myMapFunc(rec)) .groupByKey() .map((key, vals) => myReduceFunc(key, vals))

Or with combiners:

res = data.flatMap(rec => myMapFunc(rec)) .reduceByKey(myCombiner) .map((key, val) => myReduceFunc(key, val))

Page 43: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Example 3

Page 44: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.
Page 45: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.
Page 46: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Other Spark Applications

• Twitter spam classification (Justin Ma)• EM alg. for traffic prediction (Mobile Millennium)• K-means clustering• Alternating Least Squares matrix factorization• In-memory OLAP aggregation on Hive data• SQL on Spark (future work)

Page 47: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.
Page 48: Berkeley Data Analytics Stack Prof. Harold Liu 15 December 2014.

Conclusion• By making distributed datasets a first-class primitive, Spark provides a simple, efficient programming model for stateful data analytics

• RDDs provide:– Lineage info for fault recovery and debugging– Adjustable in-memory caching– Locality-aware parallel operations

• We plan to make Spark the basis of a suite of batch and interactive data analysis tools