Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis...

66
Spark and Resilient Distributed Datasets Amir H. Payberah [email protected] Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 1 / 49

Transcript of Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis...

Page 1: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Spark and Resilient Distributed Datasets

Amir H. [email protected]

Amirkabir University of Technology(Tehran Polytechnic)

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 1 / 49

Page 2: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Motivation

I MapReduce greatly simplified big data analysis on large, unreliableclusters.

I But as soon as it got popular, users wanted more:• Iterative jobs, e.g., machine learning algorithms• Interactive analytics

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 2 / 49

Page 3: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Motivation

I Both iterative and interactive queries need one thing that MapRe-duce lacks:

Efficient primitives for data sharing.

I In MapReduce, the only way to share data across jobs is stablestorage, which is slow.

I Replication also makes the system slow, but it is necessary for faulttolerance.

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 3 / 49

Page 4: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Motivation

I Both iterative and interactive queries need one thing that MapRe-duce lacks:

Efficient primitives for data sharing.

I In MapReduce, the only way to share data across jobs is stablestorage, which is slow.

I Replication also makes the system slow, but it is necessary for faulttolerance.

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 3 / 49

Page 5: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Motivation

I Both iterative and interactive queries need one thing that MapRe-duce lacks:

Efficient primitives for data sharing.

I In MapReduce, the only way to share data across jobs is stablestorage, which is slow.

I Replication also makes the system slow, but it is necessary for faulttolerance.

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 3 / 49

Page 6: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Proposed Solution

In-Memory Data Processing and Sharing.

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 4 / 49

Page 7: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Example

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 5 / 49

Page 8: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Example

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 5 / 49

Page 9: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Example

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 6 / 49

Page 10: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Example

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 6 / 49

Page 11: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Challenge

How to design a distributed memory abstractionthat is both fault tolerant and efficient?

Solution

Resilient Distributed Datasets (RDD)

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 7 / 49

Page 12: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Challenge

How to design a distributed memory abstractionthat is both fault tolerant and efficient?

Solution

Resilient Distributed Datasets (RDD)

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 7 / 49

Page 13: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Resilient Distributed Datasets (RDD) (1/2)

I A distributed memory abstraction.

I Immutable collections of objects spread across a cluster.

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 8 / 49

Page 14: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Resilient Distributed Datasets (RDD) (1/2)

I A distributed memory abstraction.

I Immutable collections of objects spread across a cluster.

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 8 / 49

Page 15: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Resilient Distributed Datasets (RDD) (2/2)

I An RDD is divided into a number of partitions, which are atomicpieces of information.

I Partitions of an RDD can be stored on different nodes of a cluster.

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 9 / 49

Page 16: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Programming Model

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 10 / 49

Page 17: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Spark Programming Model (1/2)

I Spark programming model is based on parallelizable operators.

I Parallelizable operators are higher-order functions that execute user-defined functions in parallel.

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 11 / 49

Page 18: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Spark Programming Model (2/2)

I A data flow is composed of any number of data sources, operators,and data sinks by connecting their inputs and outputs.

I Job description based on directed acyclic graphs (DAG).

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 12 / 49

Page 19: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Higher-Order Functions (1/3)

I Higher-order functions: RDDs operators.

I There are two types of RDD operators: transformations and actions.

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 13 / 49

Page 20: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Higher-Order Functions (2/3)

I Transformations: lazy operators that create new RDDs.

I Actions: lunch a computation and return a value to the program orwrite data to the external storage.

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 14 / 49

Page 21: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Higher-Order Functions (3/3)

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 15 / 49

Page 22: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

RDD Transformations - Map

I All pairs are independently processed.

// passing each element through a function.

val nums = sc.parallelize(Array(1, 2, 3))

val squares = nums.map(x => x * x) // {1, 4, 9}

// selecting those elements that func returns true.

val even = squares.filter(x => x % 2 == 0) // {4}

// mapping each element to zero or more others.

nums.flatMap(x => Range(0, x, 1)) // {0, 0, 1, 0, 1, 2}

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 16 / 49

Page 23: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

RDD Transformations - Map

I All pairs are independently processed.

// passing each element through a function.

val nums = sc.parallelize(Array(1, 2, 3))

val squares = nums.map(x => x * x) // {1, 4, 9}

// selecting those elements that func returns true.

val even = squares.filter(x => x % 2 == 0) // {4}

// mapping each element to zero or more others.

nums.flatMap(x => Range(0, x, 1)) // {0, 0, 1, 0, 1, 2}

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 16 / 49

Page 24: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

RDD Transformations - Reduce

I Pairs with identical key are grouped.

I Groups are independently processed.

val pets = sc.parallelize(Seq(("cat", 1), ("dog", 1), ("cat", 2)))

pets.reduceByKey((x, y) => x + y)

// {(cat, 3), (dog, 1)}

pets.groupByKey()

// {(cat, (1, 2)), (dog, (1))}

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 17 / 49

Page 25: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

RDD Transformations - Reduce

I Pairs with identical key are grouped.

I Groups are independently processed.

val pets = sc.parallelize(Seq(("cat", 1), ("dog", 1), ("cat", 2)))

pets.reduceByKey((x, y) => x + y)

// {(cat, 3), (dog, 1)}

pets.groupByKey()

// {(cat, (1, 2)), (dog, (1))}

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 17 / 49

Page 26: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

RDD Transformations - Join

I Performs an equi-join on the key.

I Join candidates are independently pro-cessed.

val visits = sc.parallelize(Seq(("index.html", "1.2.3.4"),

("about.html", "3.4.5.6"),

("index.html", "1.3.3.1")))

val pageNames = sc.parallelize(Seq(("index.html", "Home"),

("about.html", "About")))

visits.join(pageNames)

// ("index.html", ("1.2.3.4", "Home"))

// ("index.html", ("1.3.3.1", "Home"))

// ("about.html", ("3.4.5.6", "About"))

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 18 / 49

Page 27: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

RDD Transformations - Join

I Performs an equi-join on the key.

I Join candidates are independently pro-cessed.

val visits = sc.parallelize(Seq(("index.html", "1.2.3.4"),

("about.html", "3.4.5.6"),

("index.html", "1.3.3.1")))

val pageNames = sc.parallelize(Seq(("index.html", "Home"),

("about.html", "About")))

visits.join(pageNames)

// ("index.html", ("1.2.3.4", "Home"))

// ("index.html", ("1.3.3.1", "Home"))

// ("about.html", ("3.4.5.6", "About"))

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 18 / 49

Page 28: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

RDD Transformations - CoGroup

I Groups each input on key.

I Groups with identical keys are processedtogether.

val visits = sc.parallelize(Seq(("index.html", "1.2.3.4"),

("about.html", "3.4.5.6"),

("index.html", "1.3.3.1")))

val pageNames = sc.parallelize(Seq(("index.html", "Home"),

("about.html", "About")))

visits.cogroup(pageNames)

// ("index.html", (("1.2.3.4", "1.3.3.1"), ("Home")))

// ("about.html", (("3.4.5.6"), ("About")))

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 19 / 49

Page 29: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

RDD Transformations - CoGroup

I Groups each input on key.

I Groups with identical keys are processedtogether.

val visits = sc.parallelize(Seq(("index.html", "1.2.3.4"),

("about.html", "3.4.5.6"),

("index.html", "1.3.3.1")))

val pageNames = sc.parallelize(Seq(("index.html", "Home"),

("about.html", "About")))

visits.cogroup(pageNames)

// ("index.html", (("1.2.3.4", "1.3.3.1"), ("Home")))

// ("about.html", (("3.4.5.6"), ("About")))

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 19 / 49

Page 30: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

RDD Transformations - Union and Sample

I Union: merges two RDDs and returns a single RDD using bag se-mantics, i.e., duplicates are not removed.

I Sample: similar to mapping, except that the RDD stores a randomnumber generator seed for each partition to deterministically sampleparent records.

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 20 / 49

Page 31: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Basic RDD Actions (1/2)

I Return all the elements of the RDD as an array.

val nums = sc.parallelize(Array(1, 2, 3))

nums.collect() // Array(1, 2, 3)

I Return an array with the first n elements of the RDD.

nums.take(2) // Array(1, 2)

I Return the number of elements in the RDD.

nums.count() // 3

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 21 / 49

Page 32: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Basic RDD Actions (1/2)

I Return all the elements of the RDD as an array.

val nums = sc.parallelize(Array(1, 2, 3))

nums.collect() // Array(1, 2, 3)

I Return an array with the first n elements of the RDD.

nums.take(2) // Array(1, 2)

I Return the number of elements in the RDD.

nums.count() // 3

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 21 / 49

Page 33: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Basic RDD Actions (1/2)

I Return all the elements of the RDD as an array.

val nums = sc.parallelize(Array(1, 2, 3))

nums.collect() // Array(1, 2, 3)

I Return an array with the first n elements of the RDD.

nums.take(2) // Array(1, 2)

I Return the number of elements in the RDD.

nums.count() // 3

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 21 / 49

Page 34: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Basic RDD Actions (2/2)

I Aggregate the elements of the RDD using the given function.

nums.reduce((x, y) => x + y)

or

nums.reduce(_ + _) // 6

I Write the elements of the RDD as a text file.

nums.saveAsTextFile("hdfs://file.txt")

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 22 / 49

Page 35: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Basic RDD Actions (2/2)

I Aggregate the elements of the RDD using the given function.

nums.reduce((x, y) => x + y)

or

nums.reduce(_ + _) // 6

I Write the elements of the RDD as a text file.

nums.saveAsTextFile("hdfs://file.txt")

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 22 / 49

Page 36: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

SparkContext

I Main entry point to Spark functionality.

I Available in shell as variable sc.

I In standalone programs, you should make your own.

import org.apache.spark.SparkContext

import org.apache.spark.SparkContext._

val sc = new SparkContext(master, appName, [sparkHome], [jars])

locallocal[k]

spark://host:portmesos://host:port

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 23 / 49

Page 37: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

SparkContext

I Main entry point to Spark functionality.

I Available in shell as variable sc.

I In standalone programs, you should make your own.

import org.apache.spark.SparkContext

import org.apache.spark.SparkContext._

val sc = new SparkContext(master, appName, [sparkHome], [jars])

locallocal[k]

spark://host:portmesos://host:port

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 23 / 49

Page 38: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Creating RDDs

I Turn a collection into an RDD.

val a = sc.parallelize(Array(1, 2, 3))

I Load text file from local FS, HDFS, or S3.

val a = sc.textFile("file.txt")

val b = sc.textFile("directory/*.txt")

val c = sc.textFile("hdfs://namenode:9000/path/file")

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 24 / 49

Page 39: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Creating RDDs

I Turn a collection into an RDD.

val a = sc.parallelize(Array(1, 2, 3))

I Load text file from local FS, HDFS, or S3.

val a = sc.textFile("file.txt")

val b = sc.textFile("directory/*.txt")

val c = sc.textFile("hdfs://namenode:9000/path/file")

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 24 / 49

Page 40: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Example (1/2)

I Count the lines containing SICS.

val file = sc.textFile("hdfs://...")

val sics = file.filter(_.contains("SICS"))

val cachedSics = sics.cache()

val ones = cachedSics.map(_ => 1)

val count = ones.reduce(_+_)

Transformation

Transformation

Action

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 25 / 49

Page 41: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Example (1/2)

I Count the lines containing SICS.

val file = sc.textFile("hdfs://...")

val sics = file.filter(_.contains("SICS"))

val cachedSics = sics.cache()

val ones = cachedSics.map(_ => 1)

val count = ones.reduce(_+_)

Transformation

Transformation

Action

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 25 / 49

Page 42: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Example (2/2)

I Count the lines containing SICS.

val file = sc.textFile("hdfs://...")

val count = file.filter(_.contains("SICS")).count()

TransformationAction

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 26 / 49

Page 43: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Example (2/2)

I Count the lines containing SICS.

val file = sc.textFile("hdfs://...")

val count = file.filter(_.contains("SICS")).count()

TransformationAction

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 26 / 49

Page 44: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Example - Standalone Application (1/2)

import org.apache.spark.SparkContext

import org.apache.spark.SparkContext._

object WordCount {

def main(args: Array[String]) {

val sc = new SparkContext("local", "SICS", "127.0.0.1",

List("target/scala-2.10/sics-count_2.10-1.0.jar"))

val file = sc.textFile("...").cache()

val count = file.filter(_.contains("SICS")).count()

}

}

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 27 / 49

Page 45: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Example - Standalone Application (2/2)

I sics.sbt:

name := "SICS Count"

version := "1.0"

scalaVersion := "2.10.3"

libraryDependencies += "org.apache.spark" %% "spark-core" % "0.9.0-incubating"

resolvers += "Akka Repository" at "http://repo.akka.io/releases/"

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 28 / 49

Page 46: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Shared Variables (1/2)

I When Spark runs a function in parallel as a set of tasks on differentnodes, it ships a copy of each variable used in the function to eachtask.

I Sometimes, a variable needs to be shared across tasks, or betweentasks and the driver program.

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 29 / 49

Page 47: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Shared Variables (2/2)

I No updates to the variables are propagated back to the driver pro-gram.

I General read-write shared variables across tasks is inefficient.• For example, to give every node a copy of a large input dataset.

I Two types of shared variables: broadcast variables and accumula-tors.

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 30 / 49

Page 48: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Shared Variables: Broadcast Variables

I A read-only variable cached on each machine rather than shippinga copy of it with tasks.

I The broadcast values are not shipped to the nodes more than once.

scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))

broadcastVar: spark.Broadcast[Array[Int]] = spark.Broadcast(b5c40191-...)

scala> broadcastVar.value

res0: Array[Int] = Array(1, 2, 3)

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 31 / 49

Page 49: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Shared Variables: Accumulators

I They are only added.

I They can be used to implement counters or sums.

I Tasks running on the cluster can then add to it using the += oper-ator.

scala> val accum = sc.accumulator(0)

accum: spark.Accumulator[Int] = 0

scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)

...

scala> accum.value

res2: Int = 10

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 32 / 49

Page 50: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Execution Engine(SPARK)

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 33 / 49

Page 51: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Spark

I Spark provides a programming interface in Scala.

I Each RDD is represented as an object in Spark.

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 34 / 49

Page 52: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Spark Programming Interface

I A Spark application consists of a driver program that runs the user’smain function and executes various parallel operations on a cluster.

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 35 / 49

Page 53: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Lineage

I Lineage: transformations used to buildan RDD.

I RDDs are stored as a chain of objectscapturing the lineage of each RDD.

val file = sc.textFile("hdfs://...")

val sics = file.filter(_.contains("SICS"))

val cachedSics = sics.cache()

val ones = cachedSics.map(_ => 1)

val count = ones.reduce(_+_)

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 36 / 49

Page 54: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

RDD Dependencies (1/3)

I Two types of dependencies between RDDs: Narrow and Wide.

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 37 / 49

Page 55: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

RDD Dependencies: Narrow (2/3)

I Narrow: each partition of a parent RDD is used by at most onepartition of the child RDD.

I Narrow dependencies allow pipelined execution on one cluster node:a map followed by a filter.

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 38 / 49

Page 56: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

RDD Dependencies: Wide (3/3)

I Wide: each partition of a parent RDD is used by multiple partitionsof the child RDDs.

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 39 / 49

Page 57: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Job Scheduling (1/2)

I When a user runs an action on an RDD:the scheduler builds a DAG of stagesfrom the RDD lineage graph.

I A stage contains as many pipelinedtransformations with narrow dependen-cies.

I The boundary of a stage:• Shuffles for wide dependencies.• Already computed partitions.

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 40 / 49

Page 58: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Job Scheduling (2/2)

I The scheduler launches tasks to computemissing partitions from each stage untilit computes the target RDD.

I Tasks are assigned to machines based ondata locality.

• If a task needs a partition, which isavailable in the memory of a node, thetask is sent to that node.

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 41 / 49

Page 59: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

RDD Fault Tolerance (1/3)

I RDDs maintain lineage information that can be used to reconstructlost partitions.

I Logging lineage rather than the actual data.

I No replication.

I Recompute only the lost partitions of an RDD.

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 42 / 49

Page 60: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

RDD Fault Tolerance (2/3)

I The intermediate records of wide dependencies are materialized onthe nodes holding the parent partitions: to simplify fault recovery.

I If a task fails, it will be re-ran on another node, as long as its stagesparents are available.

I If some stages become unavailable, the tasks are submitted to com-pute the missing partitions in parallel.

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 43 / 49

Page 61: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

RDD Fault Tolerance (3/3)

I Recovery may be time-consuming for RDDs with long lineage chainsand wide dependencies.

I It can be helpful to checkpoint some RDDs to stable storage.

I Decision about which data to checkpoint is left to users.

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 44 / 49

Page 62: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Memory Management (1/2)

I If there is not enough space in memory for a new computed RDDpartition: a partition from the least recently used RDD is evicted.

I Spark provides three options for storage of persistent RDDs:1 In memory storage as deserialized Java objects.2 In memory storage as serialized Java objects.3 On disk storage.

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 45 / 49

Page 63: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Memory Management (2/2)

I When an RDD is persisted, each node stores any partitions of theRDD that it computes in memory.

I This allows future actions to be much faster.

I Persisting an RDD using persist() or cache() methods.

I Different storage levels:

MEMORY ONLY

MEMORY AND DISK

MEMORY ONLY SER

MEMORY AND DISK SER

MEMORY ONLY 2, MEMORY AND DISK 2, etc.

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 46 / 49

Page 64: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

RDD Applications

I Applications suitable for RDDs• Batch applications that apply the same operation to all elements of

a dataset.

I Applications not suitable for RDDs• Applications that make asynchronous fine-grained updates to shared

state, e.g., storage system for a web application.

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 47 / 49

Page 65: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Summary

I RDD: a distributed memory abstraction that is both fault tolerantand efficient

I Two types of operations: Transformations and Actions.

I RDD fault tolerance: Lineage

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 48 / 49

Page 66: Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Questions?

Amir H. Payberah (Tehran Polytechnic) Spark 1393/8/17 49 / 49