Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf · SparkSQL &...

25
Spark CSCE 678

Transcript of Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf · SparkSQL &...

Page 1: Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf · SparkSQL & Dataframe Catalyst Optimizer Spark Streaming Mllib (Machine learning) GraphX (Graph

Spark

CSCE 678

Page 2: Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf · SparkSQL & Dataframe Catalyst Optimizer Spark Streaming Mllib (Machine learning) GraphX (Graph

Why Not MapReduce?

• “MapReduce largely simplified big-data analysis on

large, unrealistic clusters” – by Matei Zahari

• Users want more complex, multi-stage applications

(e.g., iterative ML, graph processing)

• More interactive, real-time jobs than batched jobs

2

Page 3: Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf · SparkSQL & Dataframe Catalyst Optimizer Spark Streaming Mllib (Machine learning) GraphX (Graph

Why Not MapReduce?

3

Input

Iter. 1 Iter. 2 …

HDFSRead

HDFSWrite

HDFSWrite

HDFSRead

Input

Query 1

Query 2

Query 3

Query 4

Result 1

Result 2

Result 3

Result 4

HDFS:simplified fault tolerance,but too slow due toreplication and disk I/O.

HDFSRead

Page 4: Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf · SparkSQL & Dataframe Catalyst Optimizer Spark Streaming Mllib (Machine learning) GraphX (Graph

Goal: In-Memory Data Sharing

4

Input

Iter. 1 Iter. 2 …

Input

Query 1

Query 2

Query 3

Query 4

Result 1

Result 2

Result 3

Result 4

10-100x faster than network/disk

Page 5: Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf · SparkSQL & Dataframe Catalyst Optimizer Spark Streaming Mllib (Machine learning) GraphX (Graph

Goal: In-Memory Data Sharing

5

Input

Iter. 1 Iter. 2 …

Input

Query 1

Query 2

Query 3

Query 4

Result 1

Result 2

Result 3

Result 4

How to achieve Fault tolerance: What if DRAM crashes?

Page 6: Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf · SparkSQL & Dataframe Catalyst Optimizer Spark Streaming Mllib (Machine learning) GraphX (Graph

Resilient Distributed Dataset (RDD)

• Immutable collection of objects

• Statically typed

• Distributed across clusters

6

val lines = spark.textFile(“hdfs://log.txt")val errors = lines.filter(_.startsWith("ERROR"))val msgs = errors.map(_.split(‘\t’)(2))

msgs.saveAsTextFile(“msgs.txt") Kick off computation(lazy evaluation)

lines

errors

msgs

filter(_.startswith(“ERROR”))

map(_.split(“\t”)(2))

Page 7: Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf · SparkSQL & Dataframe Catalyst Optimizer Spark Streaming Mllib (Machine learning) GraphX (Graph

Spark Programming Interface

• Scala language: lambda-like language on Java

• Run interactively on Scala interpreter

• Interface for specifying dataflow between RDDs:

• Transformations (filter, map, reduce, etc)

• Actions (count, persist, save, etc)

• User-custom controls for partitioning

7

Page 8: Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf · SparkSQL & Dataframe Catalyst Optimizer Spark Streaming Mllib (Machine learning) GraphX (Graph

Fault Tolerance

• RDDs track lineage to rebuild lost data

8

file.map(lambda rec: (rec.type, 1)).reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

Page 9: Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf · SparkSQL & Dataframe Catalyst Optimizer Spark Streaming Mllib (Machine learning) GraphX (Graph

Fault Tolerance

• RDDs track lineage to rebuild lost data

9

file.map(lambda rec: (rec.type, 1)).reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

Page 10: Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf · SparkSQL & Dataframe Catalyst Optimizer Spark Streaming Mllib (Machine learning) GraphX (Graph

Partitioning

• RDDs know their partitioning functions

10

file.map(lambda rec: (rec.type, 1)).reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

Hash-partitioned Same

Page 11: Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf · SparkSQL & Dataframe Catalyst Optimizer Spark Streaming Mllib (Machine learning) GraphX (Graph

Example: PageRank

1. Start each page with rank = 1

2. For iteration = 1, …, N, update rank(page) =

11

𝑖 ∈Neighbors

rank𝑖/ Neighbors𝑖

links = // RDD of (url, neighbors) pairsranks = // RDD of (url, rank) pairsfor (i <- 1 to ITERATIONS) {

ranks = links.join(ranks).flatMap {(url, (links, rank)) =>

links.map(dest => (dest, rank/links.size))}.reduceByKey(_ + _)

}

Page 12: Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf · SparkSQL & Dataframe Catalyst Optimizer Spark Streaming Mllib (Machine learning) GraphX (Graph

Example: PageRank

• Repeatedly join links & ranks

• Co-partition them (e.g., hash

on url) to avoid shuffling

• Specify partitioner:

12

links = links.partitionBy(new URLPartitioner())

Page 13: Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf · SparkSQL & Dataframe Catalyst Optimizer Spark Streaming Mllib (Machine learning) GraphX (Graph

Spark Operations

• Transformations (define a new RDD):

• Actions (return a result to the driver program)

13

mapfiltersample

groupByKeyreduceByKeysortByKey

aggregateByByte

flatMapunionjoin

intersectioncogroupcross

distinct

cartesianpipe

coalescemapPartition

mapPartitionWithIndexrepartition

repartitionAndSortWithinPartitions

reducecollectcountfirst

takeSampletakeOrderedcountByKeyforeach

saveAsTextFilesaveAsSequenceFilesaveAsObjectFile

Page 14: Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf · SparkSQL & Dataframe Catalyst Optimizer Spark Streaming Mllib (Machine learning) GraphX (Graph

How General is Spark?

• RDDs can express many parallel algorithms

• Dataflow models: Hadoop, SQL, Dryad, …

• Specialized models: graph processing, machine

learning, iterative MapReduce (HaLoop), …

14

Page 15: Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf · SparkSQL & Dataframe Catalyst Optimizer Spark Streaming Mllib (Machine learning) GraphX (Graph

Spark Stack

15

Apache Spark Core

RDD

Yarn

HDFS

Mesos

Cassandra HBase S3

SparkSQL&

Dataframe

Catalyst Optimizer

SparkStreaming

Mllib(Machine learning)

GraphX(Graph

processing)

SparkR(R on Spark)

Page 16: Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf · SparkSQL & Dataframe Catalyst Optimizer Spark Streaming Mllib (Machine learning) GraphX (Graph

Spark SQL

• Relational DBMS on Spark

• Using RDDs to define and partition database states

• Optimized with DBMS techniques

• Caching

• Query optimizers

16

Page 17: Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf · SparkSQL & Dataframe Catalyst Optimizer Spark Streaming Mllib (Machine learning) GraphX (Graph

DataFrame API

• Spark SQL uses DataFrame to express structured dataset with a fixed schema

17

Values

Values

Values

RDDsUserID

Structured RDDs

Name Age

UserID Name Age

UserID Name Age

UserID Name Age

UserID Name Age

UserID Name Age

UserID Name Age

UserID Name Age

UserID Name Age

Page 18: Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf · SparkSQL & Dataframe Catalyst Optimizer Spark Streaming Mllib (Machine learning) GraphX (Graph

DataFrame API

18

cities = spark.read.csv(‘cities’, header=True, inferschema=True)

cities.show()+---------+----------+-------+| city |population| area|+---------+----------+-------+|Vancouver| 2463431|2878.52|+---------+----------+-------+

Page 19: Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf · SparkSQL & Dataframe Catalyst Optimizer Spark Streaming Mllib (Machine learning) GraphX (Graph

DataFrame API

19

cities = spark.read.csv(‘cities’, header=True, inferschema=True)

cities.show()+---------+----------+-------+| city |population| area|+---------+----------+-------+|Vancouver| 2463431|2878.52|+---------+----------+-------+

cities.printSchema()root|-- city: string (nullable = true)|-- population: integer (nullable = true)|-- area: double (nullable = true)

Page 20: Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf · SparkSQL & Dataframe Catalyst Optimizer Spark Streaming Mllib (Machine learning) GraphX (Graph

DataFrame API

20

cities = spark.read.csv(‘cities’, header=True, inferschema=True)

small_cities = cities.where(cities[‘area’] < 5000)density = small_cities(cities[‘population’] / cities[‘area’])

density.show()+-------------------++(population / area)|+-------------------+| 855.797710|+-------------------+

RDDRDD

Lazy evaluation:operations are processed whenoutputting results

Page 21: Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf · SparkSQL & Dataframe Catalyst Optimizer Spark Streaming Mllib (Machine learning) GraphX (Graph

Limitations

• DataFrame API provides many built-in functions

• Math: abs(), cos(), avg(), ceil(), …

• Data processing: asc(), base64(), …

• Aggregation: array(), array_contains(), …

• String: concat(), substring(), …

• You can’t directly use Scala/Python functions

because they are not DataFrame API

• What to do? Write User Define Functions (UDFs)

21

Page 22: Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf · SparkSQL & Dataframe Catalyst Optimizer Spark Streaming Mllib (Machine learning) GraphX (Graph

User-Defined Function (UDF)

22

@functions.udf(returnType=types.DoubleType())def my_logic(a, b):

return a + 2 * b * math.log2(a)

res = df.select(my_logic(df[‘a’], df[‘b’]),alias(‘res’))

UDF

Including UDF in the expression

Page 23: Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf · SparkSQL & Dataframe Catalyst Optimizer Spark Streaming Mllib (Machine learning) GraphX (Graph

User-Defined Function (UDF)

23

@functions.udf(returnType=types.DoubleType())def my_logic(a, b):

return a + 2 * b * math.log2(a)

res = df.select(my_logic(df[‘a’], df[‘b’]),alias(‘res’))

UDF

Including UDF in the expression

UDFs are much slower than DataFrame API

Time

Spark DataFrame API 10 s

Python UDF 437 s

Page 24: Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf · SparkSQL & Dataframe Catalyst Optimizer Spark Streaming Mllib (Machine learning) GraphX (Graph

Query Optimization

• RDBMS usually has a query optimizer to find a

more ideal version of a query

• Example: (X + 1) + (Y + 2) ➔ X + Y + 3

• Spark SQL: Catalyst optimizer

• Transforming a query into equivalent logics

• Selecting the ideal physical plan based on cost model

24

Page 25: Spark - Texas A&M Universitycourses.cse.tamu.edu/chiache/csce678/s19/slides/spark.pdf · SparkSQL & Dataframe Catalyst Optimizer Spark Streaming Mllib (Machine learning) GraphX (Graph

References

• Spark class hosted by Stanford ICME:

http://stanford.edu/~rezab/sparkclass/

• RDD Talk in NSDI (by Zaharia):

https://www.usenix.org/sites/default/files/confere

nce/protected-files/nsdi_zaharia.pdf

25