Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

43
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma, Murphy McCauley, Scott Shenker, Ion Stoica, Reynold Xin UC Berkeley spark-project.org Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data UC BERKELEY

description

Spark is an open source cluster computing framework that can outperform Hadoop by 30x through a combination of in-memory computation and a richer execution engine. Shark is a port of Apache Hive onto Spark, which provides a similar speedup for SQL queries, allowing interactive exploration of data in existing Hive warehouses. This talk will cover how both Spark and Shark are being used at various companies to accelerate big data analytics, the architecture of the systems, and where they are heading. We will also discuss the next major feature we are developing, Spark Streaming, which adds support for low-latency stream processing to Spark, giving users a unified interface for batch and real-time analytics.

Transcript of Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Page 1: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Matei Zaharia, in collaboration withMosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma, Murphy McCauley, Scott Shenker, Ion Stoica, Reynold Xin

UC Berkeleyspark-project.org

Spark and SharkHigh-Speed In-Memory Analyticsover Hadoop and Hive Data

UC BERKELEY

Page 2: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

What is Spark?Not a modified version of Hadoop

Separate, fast, MapReduce-like engine»In-memory data storage for very fast

iterative queries»General execution graphs and powerful

optimizations»Up to 40x faster than Hadoop

Compatible with Hadoop’s storage APIs»Can read/write to any Hadoop-supported

system, including HDFS, HBase, SequenceFiles, etc

Page 3: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

What is Shark?Port of Apache Hive to run on Spark

Compatible with existing Hive data, metastores, and queries (HiveQL, UDFs, etc)

Similar speedups of up to 40x

Page 4: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Project HistorySpark project started in 2009, open sourced 2010

Shark started summer 2011, alpha April 2012

In use at Berkeley, Princeton, Klout, Foursquare, Conviva, Quantifind, Yahoo! Research & others

250+ member meetup, 500+ watchers on GitHub

Page 5: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

This TalkSpark programming model

User applications

Shark overview

Demo

Next major addition: Streaming Spark

Page 6: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Why a New Programming Model?MapReduce greatly simplified big data analysis

But as soon as it got popular, users wanted more:

»More complex, multi-stage applications (graph algorithms, machine learning)

»More interactive ad-hoc queries»More real-time online processing

All three of these apps require fast data sharing across parallel jobs

Page 7: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Data Sharing in MapReduce

iter. 1 iter. 2 . . .

Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFSread

Slow due to replication, serialization, and disk IO

Page 8: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

iter. 1 iter. 2 . . .

Input

Data Sharing in Spark

Distributedmemory

Input

query 1

query 2

query 3

. . .

one-timeprocessing

10-100× faster than network and disk

Page 9: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Spark Programming ModelKey idea: resilient distributed datasets (RDDs)

»Distributed collections of objects that can be cached in memory across cluster nodes

»Manipulated through various parallel operators

»Automatically rebuilt on failure

Interface»Clean language-integrated API in Scala»Can be used interactively from Scala

console

Page 10: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

. . .

tasks

results

Cache 1

Cache 2

Cache 3

Base RDD

Transformed RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20

sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec

(vs 170 sec for on-disk data)

Page 11: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Fault ToleranceRDDs track the series of transformations used to build them (their lineage) to recompute lost data

E.g:messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))

HadoopRDDpath = hdfs://…

FilteredRDDfunc =

_.contains(...)

MappedRDDfunc = _.split(…)

Page 12: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Example: Logistic Regression

val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient}

println("Final w: " + w)

Initial parameter vector

Repeated MapReduce steps

to do gradient descent

Load data in memory once

Page 13: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Logistic Regression Performance

1 5 10 20 300

50010001500200025003000350040004500

Hadoop

Number of Iterations

Ru

nn

ing

Tim

e (

s) 127 s / iteration

first iteration 174 s

further iterations 6 s

Page 14: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Supported Operatorsmap

filter

groupBy

sort

join

leftOuterJoin

rightOuterJoin

reduce

count

reduceByKey

groupByKey

first

union

cross

sample

cogroup

take

partitionBy

pipe

save

...

Page 15: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Other Engine FeaturesGeneral graphs of operators ( efficiency)

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

= cached data

= RDD

Page 16: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Other Engine FeaturesControllable data partitioning to minimize communication

0

20

40

60

80 72

23

PageRank Performance

HadoopBasic SparkSpark + Con-trolled Partition-ing

Itera

tion

tim

e

(s)

Page 17: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

User Applications

Page 18: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Spark Users

Page 19: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

ApplicationsIn-memory analytics & anomaly detection (Conviva)

Interactive queries on data streams (Quantifind)

Exploratory log analysis (Foursquare)

Traffic estimation w/ GPS data (Mobile Millennium)

Twitter spam classification (Monarch)

. . .

Page 20: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Conviva GeoReport

Group aggregations on many keys w/ same filter

40× gain over Hive from avoiding repeated reading, deserialization and filtering

Spark

Hive

0 2 4 6 8 10 12 14 16 18 20

0.5

20

Time (hours)

Page 21: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Quantifind Feed Analysis

Load data feeds, extract entities, and compute in-memory tables every few minutes

Let users drill down interactively from AJAX app

Data Feeds

Parsed Documen

ts

Extracted Entities

In-Memory

Time Series

Web AppSpark

queries

Page 22: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Mobile Millennium Project

Credit: Tim Hunter, with support of the Mobile Millennium team; P.I. Alex Bayen; traffic.berkeley.edu

Iterative EM algorithm scaling to 160 nodes

Estimate city traffic from crowdsourced GPS data

Page 23: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Shark: Hive on Spark

Page 24: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

MotivationHive is great, but Hadoop’s execution engine makes even the smallest queries take minutes

Scala is good for programmers, but many data users only know SQL

Can we extend Hive to run on Spark?

Page 25: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Hive Architecture

Meta store

HDFS

Client

Driver

SQL Parse

r

Query Optimize

r

Physical Plan

Execution

CLI JDBC

MapReduce

Page 26: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Shark Architecture

Meta store

HDFS

Client

Driver

SQL Parse

r

Physical Plan

Execution

CLI JDBC

Spark

Cache Mgr.

Query Optimize

r

[Engle et al, SIGMOD 2012]

Page 27: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Efficient In-Memory StorageSimply caching Hive records as Java objects is inefficient due to high per-object overhead

Instead, Shark employs column-oriented storage using arrays of primitive types

1

Column Storage

2 3

johnmike

sally

4.1 3.5 6.4

Row Storage

1 john 4.1

2mike

3.5

3sally

6.4

Page 28: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Efficient In-Memory StorageSimply caching Hive records as Java objects is inefficient due to high per-object overhead

Instead, Shark employs column-oriented storage using arrays of primitive types

1

Column Storage

2 3

johnmike

sally

4.1 3.5 6.4

Row Storage

1 john 4.1

2mike

3.5

3sally

6.4

Benefit: similarly compact size to serialized data,

but >5x faster to access

Page 29: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Using SharkCREATE TABLE mydata_cached AS SELECT …

Run standard HiveQL on it, including UDFs

»A few esoteric features are not yet supported

Can also call from Scala to mix with Spark

Early alpha release at shark.cs.berkeley.edu

Page 30: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Benchmark Query 1SELECT * FROM grep WHERE field LIKE ‘%XYZ%’;

Page 31: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Benchmark Query 2SELECT sourceIP, AVG(pageRank), SUM(adRevenue) AS earningsFROM rankings AS R, userVisits AS V ON R.pageURL = V.destURLWHERE V.visitDate BETWEEN ‘1999-01-01’ AND ‘2000-01-01’GROUP BY V.sourceIPORDER BY earnings DESCLIMIT 1;

Page 32: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Demo

Page 33: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

What’s Next

Page 34: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Streaming SparkMany key big data apps must run in real time

»Live event reporting, click analysis, spam filtering, …

Event-passing systems (e.g. Storm) are low-level

»Users must worry about FT, state, consistency

»Programming model is different from batch, so must write each app twice

Can we give streaming a Spark-like interface?

Page 35: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Our IdeaRun streaming computations as a series of very short (<1 second) batch jobs

»“Discretized stream processing”

Keep state in memory as RDDs (automatically recover from any failure)

Provide a functional API similar to Spark

Page 36: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Spark Streaming APIFunctional operators on discretized streams

New “stateful” operators for windowingpageViews = readStream("...", "1s")

ones = pageViews.map(ev => (ev.url, 1))

counts = ones.runningReduce(_ + _)

t = 1:

t = 2:

pageViews ones counts

map reduce

. . .

= RDD = partition

sliding = ones.reduceByWindow( “5s”, _ + _, _ - _)

Sliding window reduce with “add” and “subtract”

functions

D-streams

Transformation

Page 37: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Streaming + Batch + Ad-HocCombining D-streams with historical data:

pageViews.join(historicCounts).map(...)

Interactive ad-hoc queries on stream state:

counts.slice(“21:00”, “21:05”).topK(10)

Page 38: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

How Fast Can It Go?

Maximum throughput possible with 1s or 2s latency

0 50 1000

1

2

3

4

5TopKWords

0 20 40 60 80 1000

1

2

3

4

5Grep

1 sec 2 sec

Clu

ster

Th

rou

gh

pu

t (G

B/s

)Can process 4 GB/s (42M records/s) of data on 100 nodes at sub-second latency

Recovers from failures within 1 sec

Page 39: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Performance vs Storm

Storm limited to 10,000 records/s/node

Also tried S4: 7000 records/s/node

1001000100000

20

40

60Spark Storm

Record Size (bytes)

Gre

p T

hro

ug

h-

pu

t (

MB

/s/n

od

e)

1001000100000

10

20

30Spark Storm

Record Size (bytes)To

pK

Th

rou

gh

-p

ut

(M

B/s

/no

de)

Page 40: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Streaming RoadmapAlpha release expected in August

Spark engine changes already in “dev” branch

Page 41: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

ConclusionSpark & Shark speed up your interactive, complex, and (soon) streaming analytics on Hadoop data

Download and docs: www.spark-project.org

»Easy local mode and deploy scripts for EC2

User meetup: meetup.com/spark-users

Training camp at Berkeley in [email protected] / @matei_zaharia

Page 42: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Behavior with Not Enough RAM

Cache disabled

25% 50% 75% Fully cached

0

20

40

60

80

10068.8

58.1

40.7

29.7

11.5

% of working set in memory

Itera

tion

tim

e (

s)

Page 43: Spark and Shark: Lightning-Fast Analytics over Hadoop and Hive Data

Software Stack

Local mode

Spark

Bagel(Pregel on

Spark)

Shark(Hive on Spark) …

Streaming Spark

EC2Apache Mesos

YARN