Spark: High-Speed Analytics for Big Data

40
Patrick Wendell Databricks Spark.incubator.apache.org Spark: High-Speed Analytics for Big Data

description

Spark: High-Speed Analytics for Big Data. Patrick Wendell Databricks Spark.incubator.apache.org. What is Spark?. Fast and expressive distributed runtime compatible with Apache Hadoop Improves efficiency through: General execution graphs In-memory storage Improves usability through: - PowerPoint PPT Presentation

Transcript of Spark: High-Speed Analytics for Big Data

Page 1: Spark:  High-Speed Analytics for  Big Data

Patrick WendellDatabricks

Spark.incubator.apache.org

Spark: High-Speed Analytics for Big Data

Page 2: Spark:  High-Speed Analytics for  Big Data

What is Spark?Fast and expressive distributed runtime compatible with Apache HadoopImproves efficiency through:

»General execution graphs»In-memory storage

Improves usability through:»Rich APIs in Scala, Java, Python»Interactive shell

Up to 10× faster on disk,100× in memory

2-5× less code

Page 3: Spark:  High-Speed Analytics for  Big Data

Project HistorySpark started in 2009, open sourced 2010Today: 1000+ meetup members

code contributed from 24 companies

Page 4: Spark:  High-Speed Analytics for  Big Data

Today’s Talk

Spark

Spark Streamin

greal-time

SharkSQL

GraphXgraph

MLLibmachine learning …

Page 5: Spark:  High-Speed Analytics for  Big Data

Why a New Programming Model?MapReduce greatly simplified big data analysisBut as soon as it got popular, users wanted more:

»More complex, multi-pass analytics (e.g. ML, graph)

»More interactive ad-hoc queries»More real-time stream processing

All 3 need faster data sharing across parallel jobs

Page 6: Spark:  High-Speed Analytics for  Big Data

Data Sharing in MapReduce

iter. 1 iter. 2 . . .

Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFSread

Slow due to replication, serialization, and disk IO

Page 7: Spark:  High-Speed Analytics for  Big Data

iter. 1 iter. 2 . . .

Input

Data Sharing in Spark

Distributedmemory

Input

query 1

query 2

query 3. . .

one-timeprocessing

10-100× faster than network and disk

Page 8: Spark:  High-Speed Analytics for  Big Data

Spark Programming ModelKey idea: resilient distributed datasets (RDDs)

»Distributed collections of objects that can be cached in memory across cluster

»Manipulated through parallel operators»Automatically recomputed on failure

Programming interface»Functional APIs in Scala, Java, Python»Interactive use from Scala & Python shells

Page 9: Spark:  High-Speed Analytics for  Big Data

Example: Logistic RegressionGoal: find best line separating two sets of points

+

+ ++

+

+

++ +

– ––

–– –

+

target

random initial line

Page 10: Spark:  High-Speed Analytics for  Big Data

Example: Logistic Regression

val data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient}

println("Final w: " + w)

Page 11: Spark:  High-Speed Analytics for  Big Data

Logistic Regression Performance

1 5 10 20 300

5001000150020002500300035004000

HadoopSpark

Number of Iterations

Runn

ing

Tim

e (s

) 110 s / iteration

first iteration 80 s

further iterations 1 s

Page 12: Spark:  High-Speed Analytics for  Big Data

Some OperatorsmapfiltergroupBysortunionjoinleftOuterJoinrightOuterJoin

reducecountfoldreduceByKeygroupByKeycogroupcrosszip

sampletakefirstpartitionBymapWithpipesave...

Page 13: Spark:  High-Speed Analytics for  Big Data

Execution EngineGeneral task graphsAutomatically pipelines functionsData locality awarePartitioning awareto avoid shuffles

= cached partition= RDD

join

filter

groupBy

Stage 3

Stage 1

Stage 2

A: B:

C: D: E:

F:

map

Page 14: Spark:  High-Speed Analytics for  Big Data

In Python and Java…// Python:lines = sc.textFile(...)lines.filter(lambda x: “ERROR” in x).count()

// Java:JavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); }}).count();

Page 15: Spark:  High-Speed Analytics for  Big Data

There was Spark… and it was good

Spark

Page 16: Spark:  High-Speed Analytics for  Big Data

Generality of RDDs

SparkRDDs, Transformations, and Actions

Spark Streamin

greal-time

SharkSQL

GraphXgraph

MLLibmachine learning

DStream’s: Streams of

RDD’sRDD-Based

TablesRDD-Based Matrices

RDD-Based Graphs

Page 17: Spark:  High-Speed Analytics for  Big Data

Spark Streaming

Page 18: Spark:  High-Speed Analytics for  Big Data

Many important apps must process large data streams at second-scale latencies

»Site statistics, intrusion detection, online ML

To build and scale these apps users want:»Integration: with offline analytical stack»Fault-tolerance: both for crashes and

stragglers»Efficiency: low cost beyond base processing

Spark Streaming: Motivation

Page 19: Spark:  High-Speed Analytics for  Big Data

Traditional Streaming SystemsSeparate codebase/API from offline analytics stackContinuous operator model

»Each node has mutable state»For each record, update state & send new records

mutable state

node 1 node

3

input records push

node 2

input records

Page 20: Spark:  High-Speed Analytics for  Big Data

Challenges with ‘record-at-a-time’ for large datasets• Fault recovery is tricky and often not

implemented• Unclear how to deal with stragglers

or slow nodes• Difficult to reconcile results with

offline stack

Page 21: Spark:  High-Speed Analytics for  Big Data

ObservationFunctional runtime like Spark can provide fault tolerance efficiently

»Divide job into deterministic tasks»Rerun failed/slow tasks in parallel on other nodes

Idea: run streaming computations as a series of small, deterministic batch jobs

»Same recovery schemes at much smaller timescale»To make latency low, store state in RDDs»Get “exactly once” semantics and recoverable state

Page 22: Spark:  High-Speed Analytics for  Big Data

Discretized Stream Processingt = 1:

t = 2:

stream 1 stream 2

batch operationpullinput

… …

input

immutable dataset

(stored reliably)

immutable dataset

(output or state);

stored in memoryas RDD

Page 23: Spark:  High-Speed Analytics for  Big Data

Programming InterfaceSimple functional APIviews = readStream("http:...", "1s") ones = views.map(ev => (ev.url, 1))counts = ones.runningReduce(_ + _)

Interoperates with RDDs // Join stream with static RDD counts.join(historicCounts).map(...)

// Ad-hoc queries on stream state counts.slice(“21:00”,“21:05”).topK(10)

t = 1:

t = 2:

views ones counts

map reduce

. . .

= RDD = partition

Page 24: Spark:  High-Speed Analytics for  Big Data

Inherited “for free” from SparkRDD data model and APIData partitioning and shufflesTask schedulingMonitoring/instrumentationScheduling and resource allocation

Page 25: Spark:  High-Speed Analytics for  Big Data

Generality of RDDs

SparkRDDs, Transformations, and Actions

Spark Streamin

greal-time

SharkSQL

GraphXgraph

MLLibmachine learning

DStream’s: Streams of

RDD’sRDD-Based

TablesRDD-Based Matrices

RDD-Based Graphs

Page 26: Spark:  High-Speed Analytics for  Big Data

SharkHive-compatible (HiveQL, UDFs, metadata)

»Works in existing Hive warehouses without changing queries or data!

Augments Hive»In-memory tables and columnar memory

store

Fast execution engine»Uses Spark as the underlying execution

engine»Low-latency, interactive queries»Scale-out and tolerates worker failures

First release: November, 2012

Page 27: Spark:  High-Speed Analytics for  Big Data

Generality of RDDs

SparkRDDs, Transformations, and Actions

Spark Streamin

greal-time

SharkSQL

GraphXgraph

MLLibmachine learning

DStream’s: Streams of

RDD’sRDD-Based

TablesRDD-Based Matrices

RDD-Based Graphs

Page 28: Spark:  High-Speed Analytics for  Big Data

MLLibProvides high-quality, optimized ML implementations on top of Spark

Page 29: Spark:  High-Speed Analytics for  Big Data

Generality of RDDs

SparkRDDs, Transformations, and Actions

Spark Streamin

greal-time

SharkSQL

GraphXgraph

MLLibmachine learning

DStream’s: Streams of

RDD’sRDD-Based

TablesRDD-Based Matrices

RDD-Based Graphs

Page 30: Spark:  High-Speed Analytics for  Big Data

GraphX (alpha)

https://github.com/amplab/graphx

Cover “full lifecycle” of graph processing from ETL -> graph creation -> algorithms -> value extraction

Page 31: Spark:  High-Speed Analytics for  Big Data

Benefits of Unification: Code Size

Hadoo

p Map

Reduc

e

Impa

la (SQ

L)

Storm

(Stre

aming

)

Giraph

(Grap

h)Sp

ark0

40000

80000

120000

non-test, non-example source lines

Page 32: Spark:  High-Speed Analytics for  Big Data

Benefits of Unification: Code Size

Hadoo

p Map

Reduc

e

Impa

la (SQ

L)

Storm

(Stre

aming

)

Giraph

(Grap

h)Sp

ark0

40000

80000

120000

non-test, non-example source lines

Shark

Page 33: Spark:  High-Speed Analytics for  Big Data

Hadoo

p Map

Reduc

e

Impa

la (SQ

L)

Storm

(Stre

aming

)

Giraph

(Grap

h)Sp

ark0

40000

80000

120000

non-test, non-example source lines

SharkStreaming

Benefits of Unification: Code Size

Page 34: Spark:  High-Speed Analytics for  Big Data

Hadoo

p Map

Reduc

e

Impa

la (SQ

L)

Storm

(Stre

aming

)

Giraph

(Grap

h)Sp

ark0

40000

80000

120000

non-test, non-example source lines

Shark

GraphXStreaming

Benefits of Unification: Code Size

Page 35: Spark:  High-Speed Analytics for  Big Data

Performance

0

5

10

15

20

25

Impa

la

(disk

)Im

pala

(m

em)

Red

shift

Shar

k (d

isk)

Shar

k (m

em)

Resp

onse

Tim

e (s

)

SQL[1]0

5

10

15

20

25

30

35

Stor m

Spar k

Thro

ughp

ut (M

B/s/

node

)

Streaming[2]0

5

10

15

20

25

30

Hado

op

Gi-

raph Grap

hX

Resp

onse

Tim

e (m

in)

Graph[3][1] https://amplab.cs.berkeley.edu/benchmark/[2] Discretized Streams: Fault-Tolerant Streaming Computation at Scale. At SOSP 2013.[3] https://amplab.cs.berkeley.edu/publication/graphx-grades/

Page 36: Spark:  High-Speed Analytics for  Big Data

Benefits for UsersHigh performance data sharing

»Data sharing is the bottleneck in many environments

»RDD’s provide in-place sharing through memory

Applications can compose models»Run a SQL query and then PageRank the

results»ETL your data and then run graph/ML on it

Benefit from investment in shared functioanlity

»E.g. re-usable components (shell) and performance optimizations

Page 37: Spark:  High-Speed Analytics for  Big Data

Getting StartedVisit spark.incubator.apache.org for videos, tutorials, and hands-on exercisesEasy to run in local mode, private clusters, EC2Spark Summit on Dec 2-3 (spark-summit.org)

Online training camp:ampcamp.berkeley.edu

Page 38: Spark:  High-Speed Analytics for  Big Data

ConclusionBig data analytics is evolving to include:

»More complex analytics (e.g. machine learning)

»More interactive ad-hoc queries»More real-time stream processing

Spark is a platform that unifies these models, enabling sophisticated appsMore info: spark-project.org

Page 39: Spark:  High-Speed Analytics for  Big Data

Backup Slides

Page 40: Spark:  High-Speed Analytics for  Big Data

Behavior with Not Enough RAM

Cache disabled

25% 50% 75% Fully cached

020406080

10068

.8

58.1

40.7

29.7

11.5

% of working set in memory

Iter

atio

n ti

me

(s)