Spark Fast, Interactive, Language-Integrated Cluster Computing.

39
Spark Fast, Interactive, Language-Integrated Cluster Computing

Transcript of Spark Fast, Interactive, Language-Integrated Cluster Computing.

Page 1: Spark Fast, Interactive, Language-Integrated Cluster Computing.

SparkFast, Interactive, Language-Integrated Cluster Computing

Page 2: Spark Fast, Interactive, Language-Integrated Cluster Computing.

Project Goals

Extend the MapReduce model to better support two common classes of analytics apps: >> Iterative algorithms (machine learning, graph) >> Interactive data mining

Enhance programmability: >> Integrate into Scala programming language >> Allow interactive use from Scala interpreter

Page 3: Spark Fast, Interactive, Language-Integrated Cluster Computing.

BackgroundMost current cluster programming models are based on directed acyclic data flow from stable storage to stable storage

Benefits of data flow: runtime can decide where to run tasks and can automatically recover from failures

Page 4: Spark Fast, Interactive, Language-Integrated Cluster Computing.

Problem

Acyclic data flow is inefficient for applications that repeatedly reuse a working set of data: >> Iterative algorithms (machine learning, graphs) >> Interactive data mining tools (R, Excel, Python)

With current frameworks, apps reload data from stable storage on each query

Page 5: Spark Fast, Interactive, Language-Integrated Cluster Computing.

Solution: Resilient Distributed Datasets (RDDs) Allow apps to keep working sets in memory for efficient reuse

Retain the attractive properties of MapReduce>> Fault tolerance, data locality, scalability

Support a wide range of applications

Page 6: Spark Fast, Interactive, Language-Integrated Cluster Computing.

About Scala

High-level language for JVM >> Object-oriented + Functional programming (FP)

Statically typed >> Comparable in speed to Java >> no need to write types due to type inference

Interoperates with Java >> Can use any Java class, inherit from it, etc; >> Can also call Scala code from Java

Page 7: Spark Fast, Interactive, Language-Integrated Cluster Computing.

Quick Tour

Page 8: Spark Fast, Interactive, Language-Integrated Cluster Computing.

Quick Tour

Page 9: Spark Fast, Interactive, Language-Integrated Cluster Computing.

All of these leave the list unchanged (List is Immutable)

Page 10: Spark Fast, Interactive, Language-Integrated Cluster Computing.
Page 11: Spark Fast, Interactive, Language-Integrated Cluster Computing.
Page 12: Spark Fast, Interactive, Language-Integrated Cluster Computing.

Spark Overview

Concept: resilient distributed datasets (RDDs) >> Immutable collections of objects spread across a cluster >> Built through parallel transformations (map, filter, etc) >> Automatically rebuilt on failure >> Controllable persistence (e.g. caching in RAM) for reuse >> Shared variables that can be used in parallel operations

Goal: work with distributed collections as you would with local ones

Page 13: Spark Fast, Interactive, Language-Integrated Cluster Computing.

Spark frameworkSpark + HiveSpark + Pregel

Page 14: Spark Fast, Interactive, Language-Integrated Cluster Computing.

Run Spark

Spark runs as a library in your program(1 instance per app)

Runs tasks locally or on Mesos

>> new SparkContext ( masterUrl, jobname, [sparkhome], [jars] )

>> MASTER=local[n] ./spark-shell >> MASTER=HOST:PORT ./spark-shell

Page 15: Spark Fast, Interactive, Language-Integrated Cluster Computing.

RDD Abstraction

An RDD is a read-only , partitioned collection of recordsCan only be created by :(1) Data in stable storage(2) Other RDDs (transformation , lineage)

An RDD has enough information about how it was derived from other datasets(its lineage) to rebuild it

Users can control two aspects of RDDs:1) Persistence (in RAM, reuse)2) Partitioning (hash, range, [<k, v>])

Page 16: Spark Fast, Interactive, Language-Integrated Cluster Computing.

RDD Types: parallelized collections

By calling SparkContext’s parallelize method on an existing Scala collection (a Seq obj)

Once created, the distributed dataset can be operated on in parallel

Page 17: Spark Fast, Interactive, Language-Integrated Cluster Computing.

RDD Types: Hadoop Datasets

Spark supports text files, SequenceFiles, and any other Hadoop inputFormat

val distFiles = sc.textFile(URI)

Other Hadoop inputFormat val distFile = sc.hadoopRDD(URI)

Local path or hdfs://, s3n://, kfs://

Page 18: Spark Fast, Interactive, Language-Integrated Cluster Computing.

RDD Operations

Transformations >> create a new dataset from an existing one

Actions >> Return a value to the driver program

Transformations are lazy, they don’t compute right away. Just remember the transformations applied to datasets(lineage). Only compute when an action require.

Page 19: Spark Fast, Interactive, Language-Integrated Cluster Computing.

TransformationsTransformations Meaning

map(func) Return a new distributed dataset formed by passing each element of the source through a function func

flatMap(func) Return a new datasets formed by selecting those elements of the source on which func returns true

union(otherDateset) Return a new dataset that contains the union of the elements in the source dataset and the argument

… …

Page 20: Spark Fast, Interactive, Language-Integrated Cluster Computing.

ActionsActions Meaning

reduce(func) Aggregate the elements of the dataset using a function func

collect() Return all the elements of the dataset as an array at the driver program

count() Return the number of elements in dataset

first() Return the first element of the dataset

saveAsTextFile(path)

…..

Write the elements of the dataset as text file (or set of text file) in a given dir in the local file system, HDFS or any other Hadoop-supported file system

……

Page 21: Spark Fast, Interactive, Language-Integrated Cluster Computing.

Transformations & Actions

Page 22: Spark Fast, Interactive, Language-Integrated Cluster Computing.

Representing RDDs

Challenge: choosing a representation for RDDs that can track lineage across transformations

Each RDD includes: 1) A set of partitions(atomic pieces of datasets) 2) A set of dependencies on parent RDDs 3) A function for computing the dataset based its parents 4) Metadata about its partitioning scheme 5) Data placement

Page 23: Spark Fast, Interactive, Language-Integrated Cluster Computing.

Interface used to represent RDDsOperation Meaning

partitons() Return s list of partition objects

preferredLocations(p) List nodes where partition p can be accessed faster due to data locality

dependencies() Return a list of dependencies

iterator(p, parenetIters) Compute the elements of partition p given iterators for its parent partitions

partitioner() Return metadata specifying whether the RDD is hash/range partitioned

Page 24: Spark Fast, Interactive, Language-Integrated Cluster Computing.

RDD Dependencies

Each box is an RDD, with partitions shown as shaded rectangles

Page 25: Spark Fast, Interactive, Language-Integrated Cluster Computing.

RDD Fault ToleranceAn RDD is a read-only , partitioned collection of recordsCan only be created by :(1) Data in stable storage(2) Other RDDs

An RDD has enough information about how it was derived from other datasets(its lineage) to recreate it.

Page 26: Spark Fast, Interactive, Language-Integrated Cluster Computing.
Page 27: Spark Fast, Interactive, Language-Integrated Cluster Computing.
Page 28: Spark Fast, Interactive, Language-Integrated Cluster Computing.

PageRank

Page 29: Spark Fast, Interactive, Language-Integrated Cluster Computing.
Page 30: Spark Fast, Interactive, Language-Integrated Cluster Computing.

Algorithm

1.0 1.0

1.0

1.0

1. Start each page at a rank of 12. On each iteration, have page p contribute

rankp / |neighborsp| to its neighbors3. Set each page’s rank to 0.15 + 0.85 × contribs

Page 31: Spark Fast, Interactive, Language-Integrated Cluster Computing.

Algorithm1. Start each page at a rank of 12. On each iteration, have page p contribute

rankp / |neighborsp| to its neighbors3. Set each page’s rank to 0.15 + 0.85 × contribs

1.0 1.0

1.0

1.0

1

0.5

0.5

0.5

1

0.5

Page 32: Spark Fast, Interactive, Language-Integrated Cluster Computing.

Algorithm1. Start each page at a rank of 12. On each iteration, have page p contribute

rankp / |neighborsp| to its neighbors3. Set each page’s rank to 0.15 + 0.85 × contribs

0.58 1.0

1.85

0.58

Page 33: Spark Fast, Interactive, Language-Integrated Cluster Computing.

Algorithm1. Start each page at a rank of 12. On each iteration, have page p contribute

rankp / |neighborsp| to its neighbors3. Set each page’s rank to 0.15 + 0.85 × contribs

0.58

0.29

0.29

0.5

1.850.58 1.0

1.85

0.58

0.5

Page 34: Spark Fast, Interactive, Language-Integrated Cluster Computing.

Algorithm1. Start each page at a rank of 12. On each iteration, have page p contribute

rankp / |neighborsp| to its neighbors3. Set each page’s rank to 0.15 + 0.85 × contribs

0.39 1.72

1.31

0.58

. . .

Page 35: Spark Fast, Interactive, Language-Integrated Cluster Computing.

Algorithm1. Start each page at a rank of 12. On each iteration, have page p contribute

rankp / |neighborsp| to its neighbors3. Set each page’s rank to 0.15 + 0.85 × contribs

0.46 1.37

1.44

0.73

Final state:

Page 36: Spark Fast, Interactive, Language-Integrated Cluster Computing.
Page 37: Spark Fast, Interactive, Language-Integrated Cluster Computing.

Python Implementationlinks = # RDD of (url, neighbors) pairsranks = # RDD of (url, rank) pairs

for i in range(NUM_ITERATIONS): def compute_contribs(pair): [url, [links, rank]] = pair # split key-value pair return [(dest, rank/len(links)) for dest in links]

contribs = links.join(ranks).flatMap(compute_contribs) ranks = contribs.reduceByKey(lambda x, y: x + y) \ .mapValues(lambda x: 0.15 + 0.85 * x)

ranks.saveAsTextFile(...)

Page 38: Spark Fast, Interactive, Language-Integrated Cluster Computing.

PageRank Performance

30 600

20406080

100120140160180 17

1

80

23

14

Hadoop

Spark

Number of machines

Itera

tion

time

(s)

Page 39: Spark Fast, Interactive, Language-Integrated Cluster Computing.

Other Iterative Algorithms

Logistic Re-gression

0 25 50 75 100 125

0.96

110

K-Means Cluster-ing

0 30 60 90 120 150 180

4.1

155 Hadoop

Spark

Time per Iteration (s)