Lec21 spark analytics -...

50
Analytics in Spark Yanlei Diao Slides Courtesy of Ion Stoica, Matei Zaharia, Brooke Wenig, Tim Hunter

Transcript of Lec21 spark analytics -...

Page 1: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

Analytics in Spark

Yanlei Diao

Slides Courtesy of Ion Stoica, Matei Zaharia, Brooke Wenig, Tim Hunter

Page 2: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

2009: State-of-the-art in Big Data Apache Hadoop

• Open Source: HDFS, Hbase, MapReduce, Hive• Large scale, flexible data processing engine• Batch computation (e.g., 10s minutes to hours)

New requirements emerging• Iterative computations, e.g., machine learning• Interactive computations, e.g., ad-hoc analytics

3

Page 3: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

Prior MapReduce Model

Map

Map

Map

Reduce

Reduce

Input Output

MapReduce model for clusters transforms data flowing from stable storage to stable storage, e.g., Hadoop:

Page 4: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

How to Support ML Apps Better?

Iterative and interactive applications

Spark core(RDD API)

Page 5: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

How to Support Iterative AppsIterative and interactive applications

Spark core(RDD API)

share data between stages via memory

Page 6: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

How to Support Interactive Apps

Spark core(RDD API)

Cache data in memory

Iterative and interactive applications

Page 7: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

Motivation for SparkAcyclic data flow (MapReduce) is a powerful abstraction, but

not efficient for apps that repeatedly reuse a working set of data:• iterative algorithms (many in machine learning)• interactive data mining tools (R, Excel, Python)

Spark makes working sets a first-class concept to efficiently support these apps

Page 8: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

Spark - A Short History

Started at UC Berkeley in 2009

Open Source: 2010

Apache Project: 2013

Today: (arguably) most active big data project

Page 9: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

2. Technical Summary of Spark

Page 10: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

What is Apache Spark?1) Parallel execution engine for big data

• Implements BSP (Bulk Synchronous Processing) model

2) Data abstraction: Resilient Distributed Datasets (RDDs)• Sets of objects partitioned & distributed across a cluster

• Stored in RAM or on Disk

3) Automatic recovery based on lineage of bulk transformations

Page 11: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

1) Bulk Synchronous Processing (BSP) ModelIn this classic model for designing parallel algorithms,

computation proceeds in a series of supersteps: • Concurrent computation: parallel processes perform local computation• Communication: processes exchange data• Barrier synchronization: when a process reaches the barrier, it waits until

all other processes have reached the same barrier

Page 12: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

Spark, as a BSP System

stage (super-step)

tasks (processors)

stage (super-step)

tasks (processors)

RDDRDD Shuffle

Page 13: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

Spark, as a BSP System

stage (super-step)

tasks (processors)

stage (super-step)

tasks (processors)

RDDRDD Shuffle • All tasks in same stage run same operation,

• single-threaded, deterministic execution

Immutable dataset

Barrier implicit by data dependency such as group data by key

Page 14: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

2) Programming ModelResilient distributed datasets (RDDs)

• Immutable collections partitioned across cluster that can be rebuilt if a partition is lost

• Partitioning can be based on a key in each record (using hash or range partitioning)

• Created by transforming data in stable storage using data flow operators (map, filter, group-by, …)

• Can be cached across parallel operations

Restricted shared variables• Accumulators, broadcast variables

Page 15: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

Example: Log MiningLoad error messages from a log into memory, then interactively

search for various patternslines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()cachedMsgs.filter(_.contains(“foo”)).countcachedMsgs.filter(_.contains(“bar”)).count

. . .

Page 16: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

Example: Log Mining

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).countcachedMsgs.filter(_.contains(“bar”)).count

. . .

tasksresults

Cache 1

Cache 2

Cache 3

Base RDDTransformed RDD

Cached RDDParallel operation

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

Load error messages from a log into memory, then interactively search for various patterns

Page 17: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

RDD Operations

Transformations(define a new RDD)

mapfiltersampleuniongroupByKeyreduceByKeyjoincache…

Actions(return a result to driver)

reducecollectcountsavelookupKey…

http://spark.apache.org/docs/latest/programming-guide.html

Page 18: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

Transformations (define a new RDD)map(func): Return a new distributed dataset formed by passing each element of the source through a function func.

filter(func): Return a new dataset formed by selecting those elements of the source on which func returns true.

flatMap(func): Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).mapPartitions(func): Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T.sample: Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed.union(otherDataset): Return a new dataset that contains the union of the elements in the source dataset and the argument.

intersection(otherDataset): Return a new RDD that is the intersection of elements in the source dataset and the argument.distinct: Return a new dataset that contains the distinct elements of the source dataset.

groupByKey: When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. Note: To perform an aggregation (such as

a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance.

reduceByKey(func): When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V.sort([ascending]): When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order.join(otherDataset): When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.cogroup(otherDataset): When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable<V>, Iterable<W>)) tuples.

Page 19: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

Actions (return a result to driver)

count(): Return the number of elements in the dataset.

collect(): Return all the elements of the dataset as an array at the driver program. reduce(func): Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The

function should be commutative and associative so that it can be computed correctly in parallel.

take(n): Return an array with the first n elements of the dataset.takeSample(n): Return an array with a random sample of num elements of the dataset

saveAsTextFile(path): Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Call toString on each element to convert it to a line of text in the file.saveAsObjectFile(path): Write the elements of the dataset in a simple format using Java serialization, which can then be loaded using SparkContext.objectFile().lookupKey…

Page 20: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

RDD Operations

Transformations

map( f : T ) U) : RDD[T] ) RDD[U]filter( f : T ) Bool) : RDD[T] ) RDD[T]

flatMap( f : T ) Seq[U]) : RDD[T] ) RDD[U]sample(fraction : Float) : RDD[T] ) RDD[T] (Deterministic sampling)

groupByKey() : RDD[(K, V)] ) RDD[(K, Seq[V])]reduceByKey( f : (V,V)) V) : RDD[(K, V)] ) RDD[(K, V)]

union() : (RDD[T],RDD[T])) RDD[T]join() : (RDD[(K, V)],RDD[(K, W)])) RDD[(K, (V, W))]

cogroup() : (RDD[(K, V)],RDD[(K, W)])) RDD[(K, (Seq[V], Seq[W]))]crossProduct() : (RDD[T],RDD[U])) RDD[(T, U)]

mapValues( f : V ) W) : RDD[(K, V)] ) RDD[(K, W)] (Preserves partitioning)sort(c : Comparator[K]) : RDD[(K, V)] ) RDD[(K, V)]

partitionBy(p : Partitioner[K]) : RDD[(K, V)] ) RDD[(K, V)]

Actions

count() : RDD[T] ) Longcollect() : RDD[T] ) Seq[T]

reduce( f : (T,T)) T) : RDD[T] ) Tlookup(k : K) : RDD[(K, V)] ) Seq[V] (On hash/range partitioned RDDs)

save(path : String) : Outputs RDD to a storage system, e.g., HDFS

Table 2: Transformations and actions available on RDDs in Spark. Seq[T] denotes a sequence of elements of type T.

that searches for a hyperplane w that best separates twosets of points (e.g., spam and non-spam emails). The al-gorithm uses gradient descent: it starts w at a randomvalue, and on each iteration, it sums a function of w overthe data to move w in a direction that improves it.

val points = spark.textFile(...).map(parsePoint).persist()

var w = // random initial vectorfor (i <- 1 to ITERATIONS) {val gradient = points.map{ p =>p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y

}.reduce((a,b) => a+b)w -= gradient

}

We start by defining a persistent RDD called pointsas the result of a map transformation on a text file thatparses each line of text into a Point object. We then re-peatedly run map and reduce on points to compute thegradient at each step by summing a function of the cur-rent w. Keeping points in memory across iterations canyield a 20⇥ speedup, as we show in Section 6.1.

3.2.2 PageRankA more complex pattern of data sharing occurs inPageRank [6]. The algorithm iteratively updates a rankfor each document by adding up contributions from doc-uments that link to it. On each iteration, each documentsends a contribution of r

n to its neighbors, where r is itsrank and n is its number of neighbors. It then updatesits rank to a/N + (1 � a)Âci, where the sum is overthe contributions it received and N is the total number ofdocuments. We can write PageRank in Spark as follows:

// Load graph as an RDD of (URL, outlinks) pairs

ranks0 input file map

contribs0

ranks1

contribs1

ranks2

contribs2

links join

reduce + map

. . .

Figure 3: Lineage graph for datasets in PageRank.

val links = spark.textFile(...).map(...).persist()var ranks = // RDD of (URL, rank) pairsfor (i <- 1 to ITERATIONS) {// Build an RDD of (targetURL, float) pairs// with the contributions sent by each pageval contribs = links.join(ranks).flatMap {(url, (links, rank)) =>links.map(dest => (dest, rank/links.size))

}// Sum contributions by URL and get new ranksranks = contribs.reduceByKey((x,y) => x+y)

.mapValues(sum => a/N + (1-a)*sum)}

This program leads to the RDD lineage graph in Fig-ure 3. On each iteration, we create a new ranks datasetbased on the contribs and ranks from the previous iter-ation and the static links dataset.6 One interesting fea-ture of this graph is that it grows longer with the number

6Note that although RDDs are immutable, the variables ranks andcontribs in the program point to different RDDs on each iteration.

Page 21: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

Multi-language Programming Interface

• Standalone programs can be written in any, but console is only Python & Scala

• Python developers: can stay with Python for both• Java developers: consider using Scala for console (to learn the API)

Performance: Java / Scala will be faster (statically typed), but Python can do well for numerical work with NumPy

Page 22: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

Learning Spark

Easiest way: Spark interpreter (spark-shell or pyspark)• Special Scala and Python consoles for cluster use

Runs in local mode on 1 thread by default, but can control with MASTER environment var:MASTER=local ./spark-shell # local, 1 threadMASTER=local[2] ./spark-shell # local, 2 threadsMASTER=spark://host:port ./spark-shell # Spark standalone cluster

Page 23: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

Main entry point to Spark functionalityCreated for you in Spark shells as variable scIn standalone programs, you’d make your own

First Stop: SparkContext

http://spark.apache.org/docs/latest/programming-guide.html

Page 24: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

Creating RDDs# Turn a local collection into an RDDsc.parallelize([1, 2, 3])

# Load text file from local FS, HDFS, or S3sc.textFile(“file.txt”)sc.textFile(“directory/*.txt”)sc.textFile(“hdfs://namenode:9000/path/file”)

# Use any existing Hadoop InputFormatsc.hadoopFile(keyClass, valClass, inputFmt, conf)

Spark can read/write to any storage system / format that has a plugin for Hadoop!• Examples: HDFS, S3, HBase, Cassandra, Avro, SequenceFile• Reuses Hadoop’s InputFormat and OutputFormat APIs

APIs like SparkContext.textFile support filesystems, while SparkContext.hadoopRDDallows passing any Hadoop JobConf to configure an input source

Page 25: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

Basic Transformations (Python)nums = sc.parallelize([1, 2, 3])

# Pass each element through a functionsquares = nums.map(lambda x: x*x) # => {1, 4, 9}

# Keep elements passing a predicateeven = squares.filter(lambda x: x % 2 == 0) # => {4}

# Map each element to zero or more othersnums.flatMap(lambda x: range(0, x)) # => {0, 0, 1, 0, 1, 2}

Range object (sequence of numbers 0, 1, …, x-1)

Page 26: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

nums = sc.parallelize([1, 2, 3])

# Retrieve RDD contents as a local collectionnums.collect() # => [1, 2, 3]

# Return first K elementsnums.take(2) # => [1, 2]

# Count number of elementsnums.count() # => 3

# Merge elements with an associative functionnums.reduce(lambda x, y: x + y) # => 6

# Write elements to a text filenums.saveAsTextFile(“hdfs://file.txt”)

Basic Actions (Python)

Page 27: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

Spark’s “distributed reduce” transformations act on RDDs of key-value pairs

Python: pair = (a, b)pair[0] # => apair[1] # => b

Scala: val pair = (a, b)pair._1 // => apair._2 // => b

Java: Tuple2 pair = new Tuple2(a, b); // class scala.Tuple2pair._1 // => apair._2 // => b

Working with Key-Value Pairs

Page 28: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

Some Key-Value Operations (Python)pets = sc.parallelize([(“cat”, 1), (“dog”, 1), (“cat”, 2)])

pets.reduceByKey(lambda x, y: x + y)# => {(cat, 3), (dog, 1)}

pets.groupByKey()# => {(cat, Seq(1, 2)), (dog, Seq(1)}

pets.sortByKey()# => {(cat, 1), (cat, 2), (dog, 1)}

reduceByKey also automatically implements combiners on the map side

Page 29: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

lines = sc.textFile(“hamlet.txt”)counts = lines.flatMap(lambda line: line.split(“ ”)) \

.map(lambda word: (word, 1)) \

.reduceByKey(lambda x, y: x + y)

“to be or”

“not to be”

“to”“be”“or”

“not”“to”“be”

(to, 1)(be, 1)(or, 1)

(not, 1)(to, 1)(be, 1)

(be, 2)(not, 1)

(or, 1)(to, 2)

Example: Word Count (Python)

Page 30: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

visits = sc.parallelize([(“index.html”, “1.2.3.4”),(“about.html”, “3.4.5.6”),(“index.html”, “1.3.3.1”)])

pageNames = sc.parallelize([(“index.html”, “Home”), (“about.html”, “About”)])

visits.join(pageNames) # (“index.html”, (“1.2.3.4”, “Home”))# (“index.html”, (“1.3.3.1”, “Home”))# (“about.html”, (“3.4.5.6”, “About”))

visits.cogroup(pageNames) # (“index.html”, (Seq(“1.2.3.4”, “1.3.3.1”), Seq(“Home”)))# (“about.html”, (Seq(“3.4.5.6”), Seq(“About”)))

Multiple Datasets

Page 31: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

Controlling the Level of Parallelism (Python)

All the pair RDD operations take an optional second parameter for number of taskswords.reduceByKey(lambda x, y: x + y, 5)words.groupByKey(5)visits.join(pageViews, 5)

Page 32: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

3) RDD Fault Tolerance

RDDs maintain lineage (like logical logging in Aries) that can be used to reconstruct lost partitions

Ex: cachedMsgs = textFile(...).filter(_.contains(“error”)).map(_.split(‘\t’)(2)).cache()

HdfsRDDpath: hdfs://…

FilteredRDDfunc: contains(...)

MappedRDDfunc: split(…) CachedRDD

Page 33: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

Benefits of RDD Model

Consistency is easy due to immutability (no updates)Fault tolerance is inexpensive (log lineage rather than

replicating/checkpointing data)Locality-aware scheduling of tasks on partitionsDespite being restricted, model seems applicable to a

broad variety of applications

Page 34: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

3. Apache Spark’s Path to Unification

Page 35: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

Apache Spark’s Path to Unification

Unified engine across data workloads and data sources

SQLStreaming ML Graph Batch …

Page 36: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

2009: State-of-the-art in Big Data Apache Hadoop

• Large scale, flexible data processing engine• Batch computation (e.g., 10s minutes to hours)• Open Source

New requirements emerging• Iterative computations, e.g., machine Learning• Interactive computations, e.g., ad-hoc analytics

44

Page 37: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

The Path to Unification

Iterative and interactive applications

Spark core(RDD API)

Page 38: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

The Path to Unification

Iterative and interactive applications

Spark core(RDD API)

share data between stages via memory

Page 39: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

The Path to Unification

Iterative and interactive applications

Spark core(RDD API)

Cache data in memory

Page 40: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

The Path to Unification

Iterative, interactive, and batch applications

Spark core(RDD API)

Shark(Hive SQL

over Spark)

Share same computation

engine

HQL

Page 41: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

The Path to Unification

Iterative, interactive, batch, and streaming applications

Spark core(RDD API)

Shark(Hive over

Spark)

HQL

SparkStreaming

Share same computation engine

and similar API

Page 42: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

The Path to Unification

Iterative, interactive, batch, and streaming applications

Spark core(RDD API)

Shark(Hive over

Spark)

HQL

SparkStreaming

Java and Python language bindings

Page 43: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

The Path to Unification

Iterative, interactive, batch, and streaming applications

Spark core(RDD API)

Shark(Hive over

Spark)

HQL

SparkStreaming MLlib GraphX

ML and Graph libraries, sharing same execution

engine

Page 44: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

The Path to Unification

Iterative, interactive, batch, and streaming applications

Spark core(RDD API)

SparkSQL(Catalyst)

SQL

SparkStreaming MLlib GraphX

New SQL engine and query optimizer

(Catalyst)

Page 45: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

The Path to Unification

Iterative, interactive, batch, and streaming applications

Spark core(RDD API)

SparkSQL(Catalyst)

SQL

SparkStreaming MLlib GraphX SparkR

Page 46: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

The Path to Unification

Iterative, interactive, batch, and streaming applications

Spark core(DataFrames API, Catalyst, RDD API)

SparkSQL

SQL

StructuredStreams MLlib Graph

Frames

Share not only execution engine but also query optimizer

Unify DataFrameand SQL across

all libraries

SparkR

Page 47: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

DataFrame API

DataFrame logically equivalent to a relational table

Operators mostly relational with additional ones for statistical analysis, e.g., quantile, std, skew

Popularized by R and Python/pandas, languages of choice for Data Scientists

Page 48: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

RDDpdata.map(lambda x: (x.dept, [x.age, 1])) \

.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \

.map(lambda x: [x[0], x[1][0] / x[1][1]]) \

.collect()

DataFrame

data.groupBy(“dept”).avg(“age”)

Page 49: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

DataFrames, a Unifying Abstraction

Make DataFrame declarative, unify DataFrame and SQL

DataFrame and SQL share same• query optimizer, and• execution engine

Tightly integrated with rest of Spark• ML library takes DataFrames as input & output• Easily convert RDDs ↔ DataFrames

PythonDF

Logical Plan

Java/ScalaDF

RDF

ExecutionEvery optimization automatically applies to

SQL, and Scala, Python and R DataFrames

Page 50: Lec21 spark analytics - avid.cs.umass.eduavid.cs.umass.edu/courses/645/s2020/lectures/Lec21-Spark.pdfSpark core (RDD API) share data between stages via memory. How to Support Interactive

Today’s Apache Spark

Iterative, interactive, batch, and streaming applications

Spark core(DataFrames API, Catalyst, RDD API)

SparkSQL

SQL

StructuredStreams MLlib Graph

Frames SparkR