Overview of Spark project

45
Overview of Spark project Presented by Yin Zhu ([email protected] ) Materials from http ://spark-project.org/documentation / Hadoop in Practice by A. Holmes My demo code: https:// bitbucket.org/blackswift/spark-example 25 March 2013

description

Overview of Spark project. Presented by Yin Zhu ( [email protected] ) Materials from http ://spark-project.org/documentation / Hadoop in Practice by A. Holmes My demo code : https:// bitbucket.org/blackswift/spark-example. 25 March 2013. Outline. Review of MapReduce (15’) - PowerPoint PPT Presentation

Transcript of Overview of Spark project

Page 1: Overview of Spark project

Overview of Spark projectPresented by Yin Zhu ([email protected]) Materials from • http://spark-project.org/documentation/• Hadoop in Practice by A. Holmes• My demo code: https://

bitbucket.org/blackswift/spark-example25 March 2013

Page 2: Overview of Spark project

OutlineReview of MapReduce (15’)

Going Through Spark (NSDI’12) Slides (30’)

Demo (15’)

Page 3: Overview of Spark project

Review of MapReducePagerank implemented in Scala and HadoopWhy Pagerank? >More complicated than the “hello world” WordCount >Widely used in search engines, very large scale input (the whole web graph!) >Iterative algorithm (typical style for most numerical algorithms for data mining)

Page 4: Overview of Spark project
Page 5: Overview of Spark project

New score: 0.15 + 0.5*0.85 = 0.575

Page 6: Overview of Spark project
Page 7: Overview of Spark project
Page 8: Overview of Spark project
Page 9: Overview of Spark project
Page 10: Overview of Spark project

A functional implementationdef pagerank(links: Array[(UrlId, Array[UrlId])], numIters: Int):Map[UrlId, Double] = { val n = links.size var ranks = (Array.fromFunction(i => (i, 1.0)) (n)).toMap // init: each node has rank 1.0 for (iter <- 1 to numIters) { // Take some interactions val contrib = links .flatMap(node => {

val out_url = node._1 val in_urls = node._2 val score = ranks(out_url) / in_urls.size // the score each outer link

recieves in_urls.map(in_url => (in_url, score) )

} ) .groupBy(url_score => url_score._1) // group the (url, score) pairs by url .map(url_scoregroup => // sum the score for each unique url (url_scoregroup._1, url_scoregroup._2.foldLeft (0.0) ((sum,url_score) => sum+url_score._2))) ranks = ranks.map(url_score => (url_score._1, if (contrib.contains(url_score._1)) 0.85 * contrib(url_score._1) + 0.15 else url_score._2)) } ranks }

Map

Reduce

Page 12: Overview of Spark project

Hadoop/MapReduce implementationFault tolerance the result after each iteration is saved to Disk

Speed/Disk IO disk IO is proportional to the # of iterations ideally the link graph should be loaded for only once

Page 13: Overview of Spark project

Resilient Distributed DatasetsA Fault-Tolerant Abstraction forIn-Memory Cluster Computing

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley,Michael Franklin, Scott Shenker, Ion Stoica

UC Berkeley UC BERKELEY

Page 14: Overview of Spark project

MotivationMapReduce greatly simplified “big data” analysis on large, unreliable clustersBut as soon as it got popular, users wanted more:

»More complex, multi-stage applications(e.g. iterative machine learning & graph processing)

»More interactive ad-hoc queries

Response: specialized frameworks for some of these apps (e.g. Pregel for graph

processing)

Page 15: Overview of Spark project

MotivationComplex apps and interactive queries both need one thing that MapReduce lacks:

Efficient primitives for data sharingIn MapReduce, the only way to share data across jobs is stable storage

slow!

Page 16: Overview of Spark project

Examplesiter. 1 iter. 2 . .

.Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFSread

Slow due to replication and disk I/O,but necessary for fault tolerance

Page 17: Overview of Spark project

iter. 1 iter. 2 . . .

Input

Goal: In-Memory Data Sharing

Input

query 1

query 2

query 3. . .

one-timeprocessing

10-100× faster than network/disk, but how to get FT?

Page 18: Overview of Spark project

Challenge

How to design a distributed memory abstraction that is both fault-tolerant

and efficient?

Page 19: Overview of Spark project

Solution: Resilient Distributed Datasets (RDDs)Restricted form of distributed shared memory

»Immutable, partitioned collections of records

»Can only be built through coarse-grained deterministic transformations (map, filter, join, …)

Efficient fault recovery using lineage»Log one operation to apply to many

elements»Recompute lost partitions on failure»No cost if nothing fails

Page 20: Overview of Spark project

Three core concepts of RDDTransformations define a new RDD from an existing RDD

Cache and Partitioner Put the dataset into memory/other persist media, and specify the locations of the sub datasets

Actions carry out the actual computation

Page 21: Overview of Spark project

RDD and its lazy transformationsRDD[T]: A sequence of objects of type TTransformations are lazy:

https://github.com/mesos/spark/blob/master/core/src/main/scala/spark/PairRDDFunctions.scala

Page 22: Overview of Spark project

  /*** Return an RDD of grouped items.*/

  def groupBy[K: ClassManifest](f: T => K, p: Partitioner): RDD[(K, Seq[T])] = {    val cleanF = sc.clean(f)    this.map(t => (cleanF(t), t)).groupByKey(p)  }

  /*** Group the values for each key in the RDD into a single sequence. Allows controlling the* partitioning of the resulting key-value pair RDD by passing a Partitioner.*/

  def groupByKey(partitioner: Partitioner): RDD[(K, Seq[V])] = {    def createCombiner(v: V) = ArrayBuffer(v)    def mergeValue(buf: ArrayBuffer[V], v: V) = buf += v

    def mergeCombiners(b1: ArrayBuffer[V], b2: ArrayBuffer[V]) = b1 ++= b2

    val bufs = combineByKey[ArrayBuffer[V]](      createCombiner _, mergeValue _, mergeCombiners _, partitioner)    bufs.asInstanceOf[RDD[(K, Seq[V])]]  }

Page 23: Overview of Spark project

Actions: carry out the actual computation

https://github.com/mesos/spark/blob/master/core/src/main/scala/spark/SparkContext.scala

runJob

Page 24: Overview of Spark project

Examplelines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))

messages.persist() or .cache()

messages.filter(_.contains(“foo”)).countmessages.filter(_.contains(“bar”)).count

Page 25: Overview of Spark project

Task Scheduler for actionsDryad-like DAGsPipelines functionswithin a stageLocality & data reuse awarePartitioning-awareto avoid shuffles

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

= cached data partition

Page 26: Overview of Spark project

Input

query 1

query 2

query 3

. . .

RDD Recovery

one-timeprocessing

iter. 1 iter. 2 . . .

Input

Page 27: Overview of Spark project

RDDs track the graph of transformations that built them (their lineage) to rebuild lost dataE.g.:

messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))

HadoopRDDpath = hdfs://…

FilteredRDDfunc =

_.contains(...)

MappedRDDfunc = _.split(…)

Fault Recovery

HadoopRDD FilteredRDD MappedRDD

Page 28: Overview of Spark project

Fault Recovery Results

1 2 3 4 5 6 7 8 9 10020406080

100120140 119

57 56 58 5881

57 59 57 59

Iteration

Iter

atri

on t

ime

(s) Failure happens

Page 29: Overview of Spark project

Generality of RDDsDespite their restrictions, RDDs can express surprisingly many parallel algorithms

»These naturally apply the same operation to many items

Unify many current programming models»Data flow models: MapReduce, Dryad, SQL, …»Specialized models for iterative apps: BSP (Pregel),

iterative MapReduce (Haloop), bulk incremental, …

Support new apps that these models don’t

Page 30: Overview of Spark project

Memorybandwidth

Networkbandwidth

Tradeoff Space

Granularityof Updates

Write Throughput

Fine

CoarseLow High

K-V stores,databases,RAMCloud

Best for batchworkloads

Best fortransactional

workloads

HDFS RDDs

Page 31: Overview of Spark project

OutlineSpark programming interfaceImplementationDemoHow people are using Spark

Page 32: Overview of Spark project

Spark Programming InterfaceDryadLINQ-like API in the Scala languageUsable interactively from Scala interpreterProvides:

»Resilient distributed datasets (RDDs)»Operations on RDDs: transformations (build

new RDDs), actions (compute and output results)

»Control of each RDD’s partitioning (layout across nodes) and persistence (storage in RAM, on disk, etc)

Page 33: Overview of Spark project

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patternslines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))messages.persist()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Master

messages.filter(_.contains(“foo”)).countmessages.filter(_.contains(“bar”)).count

tasksresults

Msgs. 1

Msgs. 2

Msgs. 3

Base RDDTransformed

RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20

sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec

(vs 170 sec for on-disk data)

Page 34: Overview of Spark project

Example: PageRank1. Start each page with a rank of 12. On each iteration, update each page’s rank to

Σi∈neighbors ranki / |neighborsi|links = // RDD of (url, neighbors) pairsranks = // RDD of (url, rank) pairsfor (i <- 1 to ITERATIONS) { ranks = links.join(ranks).flatMap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) }.reduceByKey(_ + _)}

Page 35: Overview of Spark project

Optimizing Placement

links & ranks repeatedly joinedCan co-partition them (e.g. hash both on URL) to avoid shufflesCan also use app knowledge, e.g., hash on DNS name

links = links.partitionBy( new URLPartitioner())

reduceContribs0

join

joinContribs2

Ranks0(url, rank)

Links(url,

neighbors)

. . .

Ranks2

reduce

Ranks1

Page 36: Overview of Spark project

PageRank Performance

01020304050607080 72

23

HadoopBasic SparkSpark + Con-trolled Partition-ingTi

me

per

iter

-at

ion

(s)

Page 37: Overview of Spark project

ImplementationRuns on Mesos [NSDI 11]to share clusters w/ HadoopCan read from any Hadoop input source (HDFS, S3, …)

Spark Hadoop MPI

Mesos

Node Node Node Node

No changes to Scala language or compiler»Reflection + bytecode analysis to correctly ship

code

www.spark-project.org

Page 38: Overview of Spark project

Behavior with Insufficient RAM

0% 25% 50% 75% 100%0

20

40

60

80

10068

.8

58.1

40.7

29.7

11.5

Percent of working set in memory

Iter

atio

n ti

me

(s)

Page 39: Overview of Spark project

Scalability

25 50 1000

50

100

150

200

250

184

111

76

116

80

62

15 6 3

HadoopHadoopBinMemSpark

Number of machines

Iter

atio

n ti

me

(s)

25 50 1000

50100150200250300 27

4

157

106

197

121

87

143

61

33

Hadoop HadoopBinMemSpark

Number of machines

Iter

atio

n ti

me

(s)

Logistic Regression K-Means

Page 40: Overview of Spark project

Breaking Down the Speedup

In-mem HDFS

In-mem local file

Spark RDD0

5

10

15

2015

.4

13.1

2.9

8.4

6.9

2.9

Text InputBinary Input

Iter

atio

n ti

me

(s)

Page 41: Overview of Spark project

Spark Operations

Transformations

(define a new RDD)

mapfilter

samplegroupByKeyreduceByKey

sortByKey

flatMapunionjoin

cogroupcross

mapValues

Actions(return a result

to driver program)

collectreducecountsave

lookupKey

Page 42: Overview of Spark project

Demo

Page 43: Overview of Spark project

Open Source Community15 contributors, 5+ companies using Spark,3+ applications projects at BerkeleyUser applications:

»Data mining 40x faster than Hadoop (Conviva)

»Exploratory log analysis (Foursquare)»Traffic prediction via EM (Mobile Millennium)»Twitter spam classification (Monarch)»DNA sequence analysis (SNAP)». . .

Page 44: Overview of Spark project

Related WorkRAMCloud, Piccolo, GraphLab, parallel DBs

»Fine-grained writes requiring replication for resiliencePregel, iterative MapReduce

»Specialized models; can’t run arbitrary / ad-hoc queriesDryadLINQ, FlumeJava

»Language-integrated “distributed dataset” API, but cannot share datasets efficiently across queries

Nectar [OSDI 10]»Automatic expression caching, but over distributed FS

PacMan [NSDI 12]»Memory cache for HDFS, but writes still go to network/disk

Page 45: Overview of Spark project

ConclusionRDDs offer a simple and efficient programming model for a broad range of applicationsLeverage the coarse-grained nature of many parallel algorithms for low-overhead recoveryTry it out at www.spark-project.org