Bigdata processing with Spark - part II

54
SIKS Big Data Course Part Two Prof.dr.ir. Arjen P. de Vries [email protected] Enschede, December 7, 2016

Transcript of Bigdata processing with Spark - part II

Page 1: Bigdata processing with Spark - part II

SIKS Big Data CoursePart TwoProf.dr.ir. Arjen P. de [email protected], December 7, 2016

Page 2: Bigdata processing with Spark - part II
Page 3: Bigdata processing with Spark - part II

Recap Spark Data Sharing Crucial for:

- Interactive Analysis- Iterative machine learning algorithms

Spark RDDs- Distributed collections, cached in memory across cluster nodes

Keep track of Lineage- To ensure fault-tolerance- To optimize processing based on knowledge of the data partitioning

Page 4: Bigdata processing with Spark - part II

RDDs in More DetailRDDs additionally provide:- Control over partitioning, which can be used to optimize data

placement across queries.- usually more efficient than the sort-based approach of Map

Reduce- Control over persistence (e.g. store on disk vs in RAM)- Fine-grained reads (treat RDD as a big table)

Slide by Matei Zaharia, creator Spark, http://spark-project.org

Page 5: Bigdata processing with Spark - part II

Scheduling Process

rdd1.join(rdd2) .groupBy(…) .filter(…)

RDD Objects

build operator DAG agnostic

to operators!

doesn’t know about

stages

DAGScheduler

split graph into stages of taskssubmit each stage as ready

DAG

TaskScheduler

TaskSet

launch tasks via cluster managerretry failed or straggling tasks

Clustermanager

Worker

execute tasks

store and serve blocks

Block manager

ThreadsTask

stagefailed

Page 6: Bigdata processing with Spark - part II

RDD API Example

// Read input fileval input = sc.textFile("input.txt")

val tokenized = input .map(line => line.split(" ")) .filter(words => words.size > 0) // remove empty lines

val counts = tokenized // frequency of log levels .map(words => (words(0), 1)). .reduceByKey{ (a, b) => a + b, 2 }

6

Page 7: Bigdata processing with Spark - part II

RDD API Example

// Read input fileval input = sc.textFile( )

val tokenized = input .map(line => line.split(" ")) .filter(words => words.size > 0) // remove empty lines

val counts = tokenized // frequency of log levels .map(words => (words(0), 1)). .reduceByKey{ (a, b) => a + b }

7

Page 8: Bigdata processing with Spark - part II

Transformations

sc.textFile().map().filter().map().reduceByKey()

8

Page 9: Bigdata processing with Spark - part II

DAG View of RDD’s

textFile() map() filter() map() reduceByKey()

9

Mapped RDD

Partition 1

Partition 2

Partition 3

Filtered RDD

Partition 1

Partition 2

Partition 3

Mapped RDD

Partition 1

Partition 2

Partition 3

Shuffle RDD

Partition 1

Partition 2

Hadoop RDD

Partition 1

Partition 2

Partition 3

input tokenized counts

Page 10: Bigdata processing with Spark - part II

Transformations build up a DAG, but don’t “do anything”

10

Page 11: Bigdata processing with Spark - part II

How runJob Works

Needs to compute my parents, parents, parents, etc all the way back to an RDD with no dependencies (e.g. HadoopRDD).

11

Mapped RDD

Partition 1

Partition 2

Partition 3

Filtered RDD

Partition 1

Partition 2

Partition 3

Mapped RDD

Partition 1

Partition 2

Partition 3

Hadoop RDD

Partition 1

Partition 2

Partition 3

input tokenized counts

runJob(counts)

Page 12: Bigdata processing with Spark - part II

Physical Optimizations

1. Certain types of transformations can be pipelined.

2. If dependent RDD’s have already been cached (or persisted in a shuffle) the graph can be truncated.

Pipelining and truncation produce a set of stages where each stage is composed of tasks

12

Page 13: Bigdata processing with Spark - part II

Scheduler OptimizationsPipelines narrow ops. within a stagePicks join algorithms based on partitioning (minimize shuffles)Reuses previously cached data

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

= previously computed partition

Task

Page 14: Bigdata processing with Spark - part II

Task DetailsStage boundaries are only at input RDDs or “shuffle” operationsSo, each task looks like this:

Taskf1 f2 …

map output fileor master

externalstorage

fetch mapoutputs

and/or

Page 15: Bigdata processing with Spark - part II

How runJob Works

Needs to compute my parents, parents, parents, etc all the way back to an RDD with no dependencies (e.g. HadoopRDD).

15

Mapped RDD

Partition 1

Partition 2

Partition 3

Filtered RDD

Partition 1

Partition 2

Partition 3

Mapped RDD

Partition 1

Partition 2

Partition 3

Shuffle RDD

Partition 1

Partition 2

Hadoop RDD

Partition 1

Partition 2

Partition 3

input tokenized counts

runJob(counts)

Page 16: Bigdata processing with Spark - part II

16

How runJob WorksNeeds to compute my parents, parents, parents, etc all the way back to an RDD with no dependencies (e.g. HadoopRDD).

input tokenized counts

Mapped RDD

Partition 1

Partition 2

Partition 3

Filtered RDD

Partition 1

Partition 2

Partition 3

Mapped RDD

Partition 1

Partition 2

Partition 3

Shuffle RDD

Partition 1

Partition 2

Hadoop RDD

Partition 1

Partition 2

Partition 3

runJob(counts)

Page 17: Bigdata processing with Spark - part II

Stage Graph

Task 1

Task 2

Task 3

Task 1

Task 2

Stage 1 Stage 2

Each task will:1.Read Hadoop input2.Perform maps and filters3.Write partial sums

Each task will:1.Read partial sums2.Invoke user function passed to runJob.

Shuffle write Shuffle readInput read

Page 18: Bigdata processing with Spark - part II

Physical Execution Model Distinguish between:

- Jobs: complete work to be done- Stages: bundles of work that can execute together- Tasks: unit of work, corresponds to one RDD partition

Defining stages and tasks should not require deep knowledge of what these actually do- Goal of Spark is to be extensible, letting users define new

RDD operators

Page 19: Bigdata processing with Spark - part II

RDD InterfaceSet of partitions (“splits”)List of dependencies on parent RDDsFunction to compute a partition given parentsOptional preferred locations

Optional partitioning info (Partitioner)Captures all current Spark operations!

Page 20: Bigdata processing with Spark - part II

Example: HadoopRDDpartitions = one per HDFS blockdependencies = nonecompute(partition) = read corresponding block

preferredLocations(part) = HDFS block locationpartitioner = none

Page 21: Bigdata processing with Spark - part II

Example: FilteredRDDpartitions = same as parent RDDdependencies = “one-to-one” on parentcompute(partition) = compute parent and filter it

preferredLocations(part) = none (ask parent)partitioner = none

Page 22: Bigdata processing with Spark - part II

Example: JoinedRDDpartitions = one per reduce taskdependencies = “shuffle” on each parentcompute(partition) = read and join shuffled data

preferredLocations(part) = nonepartitioner = HashPartitioner(numTasks)Spark will now

know this data is hashed!

Page 23: Bigdata processing with Spark - part II

Dependency Types

union join with inputs not

co-partitioned

map, filter

join with inputs co-partitioned

“Narrow” deps:

groupByKey

“Wide” (shuffle) deps:

Page 24: Bigdata processing with Spark - part II

Improving Efficiency Basic Principle: Avoid Shuffling!

Page 25: Bigdata processing with Spark - part II

Filter Input Early

Page 26: Bigdata processing with Spark - part II

Avoid groupByKey on Pair RDDs

All key-value pairs will be shuffled accross the network, to a reducer where the values are collected together

groupByKey

“Wide” (shuffle) deps:

Page 27: Bigdata processing with Spark - part II

aggregateByKey Three inputs

- Zero-element- Merging function within partition- Merging function across partitions

val initialCount = 0;val addToCounts = (n: Int, v: String) => n + 1val sumPartitionCounts = (p1: Int, p2: Int) => p1 + p2

val countByKey = kv.aggregateByKey(initialCount)(addToCounts,sumPartitionCounts)

Combiners!

Page 28: Bigdata processing with Spark - part II

combineByKey

val result = input.combineByKey((v) => (v, 1),(acc: (Int, Int), v) => (acc.1 + v, acc.2 + 1),(acc1: (Int, Int), acc2: (Int, Int)) =>

(acc1._1 + acc2._1, acc1._2 + acc2._2) ).map{

case (key, value) => (key, value._1 / value._2.toFloat) }

result.collectAsMap().map(println(_))

Page 29: Bigdata processing with Spark - part II

Control the Degree of Parallellism Repartition

- Concentrate effort - increase use of nodes Coalesce

- Reduce number of tasks

Page 30: Bigdata processing with Spark - part II

Broadcast Values In case of a join with a small RHS or LHS, broadcast the

small set to every node in the cluster

Page 31: Bigdata processing with Spark - part II
Page 32: Bigdata processing with Spark - part II
Page 33: Bigdata processing with Spark - part II
Page 34: Bigdata processing with Spark - part II

Broadcast Variables Create with SparkContext.broadcast(initVal) Access with .value inside tasks

Immutable!- If you modify the broadcast value after creation, that change is

local to the node

Page 35: Bigdata processing with Spark - part II

Maintaining Partitioning mapValues instead of map flatMapValues instead of flatMap

- Good for tokenization!

Page 36: Bigdata processing with Spark - part II
Page 37: Bigdata processing with Spark - part II

The best trick of all, however…

Page 38: Bigdata processing with Spark - part II

Use Higher Level API’s!DataFrame APIs for core processing

Works across Scala, Java, Python and R

Spark ML for machine learning

Spark SQL for structured query processing

38

Page 39: Bigdata processing with Spark - part II

Higher-Level Libraries

Spark

Spark Streaming

real-time

Spark SQLstructured

data

MLlibmachinelearning

GraphXgraph

Page 40: Bigdata processing with Spark - part II

Combining Processing Types// Load data using SQLpoints = ctx.sql(“select latitude, longitude from tweets”)

// Train a machine learning modelmodel = KMeans.train(points, 10)

// Apply it to a streamsc.twitterStream(...) .map(t => (model.predict(t.location), 1)) .reduceByWindow(“5s”, (a, b) => a + b)

Page 41: Bigdata processing with Spark - part II

Performance of CompositionSeparate computing frameworks:

…HDFS read

HDFS write

HDFS read

HDFS write

HDFS read

HDFS write

HDFS write

HDFS read

Spark:

Page 42: Bigdata processing with Spark - part II

Encode Domain Knowledge In essence, nothing more than libraries with pre-cooked

code – that still operates over the abstraction of RDDs

Focus on optimizations that require domain knowledge

Page 43: Bigdata processing with Spark - part II

Spark MLLib

Page 44: Bigdata processing with Spark - part II

Data Sets

Page 45: Bigdata processing with Spark - part II

Challenge: Data RepresentationJava objects often many times larger than data

class User(name: String, friends: Array[Int])User(“Bobby”, Array(1, 2))

User 0x… 0x…

String

3

0

1 2

Bobby

5 0x…

int[]

char[] 5

Page 46: Bigdata processing with Spark - part II

DataFrames / Spark SQLEfficient library for working with structured data

»Two interfaces: SQL for data analysts and external apps, DataFrames for complex programs

»Optimized computation and storage underneath

Spark SQL added in 2014, DataFrames in 2015

Page 47: Bigdata processing with Spark - part II

Spark SQL Architecture

Logical Plan

Physical Plan

Catalog

OptimizerRDDs

DataSource

API

SQL DataFrames

CodeGenerator

Page 48: Bigdata processing with Spark - part II

DataFrame APIDataFrames hold rows with a known schema and offer relational operations through a DSLc = HiveContext()users = c.sql(“select * from users”)

ma_users = users[users.state == “MA”]

ma_users.count()

ma_users.groupBy(“name”).avg(“age”)

ma_users.map(lambda row: row.user.toUpper())

Expression AST

Page 49: Bigdata processing with Spark - part II

What DataFrames Enable1. Compact binary representation

• Columnar, compressed cache; rows for processing

2. Optimization across operators (join reordering, predicate pushdown, etc)

3. Runtime code generation

Page 50: Bigdata processing with Spark - part II

Performance

Page 51: Bigdata processing with Spark - part II

Performance

Page 52: Bigdata processing with Spark - part II

Data SourcesUniform way to access structured data

»Apps can migrate across Hive, Cassandra, JSON, …»Rich semantics allows query pushdown into data

sources

SparkSQL

users[users.age > 20]

select * from users

Page 53: Bigdata processing with Spark - part II

ExamplesJSON:

JDBC:

Together:

select user.id, text from tweets

{ “text”: “hi”, “user”: { “name”: “bob”, “id”: 15 }}

tweets.jsonselect age from users where lang = “en”

select t.text, u.agefrom tweets t, users uwhere t.user.id = u.idand u.lang = “en” Spark

SQL{JSON}

select id, age fromusers where lang=“en”

Page 54: Bigdata processing with Spark - part II

Thanks Matei Zaharia, MIT (https://cs.stanford.edu/~matei/) Paul Wendell, Databricks http://spark-project.org