Download - Scala meetup - Intro to spark

Transcript
Page 1: Scala meetup - Intro to spark

Introduction to Apache Spark

Brendan DillonJavier Arrieta

Page 2: Scala meetup - Intro to spark

Spark Core

Your Applications

The Stack

Spark SQL MLLib GraphX Spark Streaming

Mesos YARN Standalone

sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19").map(t => "Name: " + t(0)).collect().foreach(println)

Page 3: Scala meetup - Intro to spark

Distributed Execution

Driver

Spark Context

Worker Node

Executor

Task

Task

Worker Node

Executor

Task

Task

Page 4: Scala meetup - Intro to spark

Resilient Distributed Datasets (RDD)

• Immutable: never modified – just transformed to new RDDs

• Distributed: split into multiple partitions and spread across multiple servers in a cluster

• Resilient: can be re-computed if they get destroyed

• Created by:– Loading external data– Distributing a collection of objects in the driver program

Page 5: Scala meetup - Intro to spark

RDD Implementation

• Array of partitions• List of dependencies on parent RDDs• Function to compute a partition given its parents

– Returns Iterator over a partition• Preferred locations: list of strings for each

partition (Nil by default)• Partitioner (None by default)

Page 6: Scala meetup - Intro to spark

Persistence / Caching

• By default RDDs (and all of their dependencies) are recomputed every time an action is called on them!

• Need to explicitly tell Spark when to persist• Options:

– Default: stored in heap as unserialized objects (pickled objects for Python)

– Memory only: serialized or not– Memory and disk: spills to disk, option to serialize in memory– Disk only

• Tachyon: off-heap distributed caching– Aims to make Spark more resilient– Avoid GC overheads

Page 7: Scala meetup - Intro to spark

Dependency Types: Narrow

E.g. map, filter

E.g. unionE.g. join with

co-partitioned input

Each partition of parent is used by at most one partition of the child

Page 8: Scala meetup - Intro to spark

Dependency Types: Wide

E.g. groupByKey

E.g. join with inputsnon co-partitioned

Each partition of the parent is used by more than one partition of the child

Page 9: Scala meetup - Intro to spark

Transformations

• Return a new RDD• Lazy evaluation• Single RDD transformations: map, flatMap, filter,

distinct• Pair RDDs: keyBy, reduceByKey, groupByKey,

combineByKey, mapValues, flatMapValues, sortByKey

• Two RDD transformations: union, intersection, subtract, cartesian

• Two pair RDDs: join, rightOuterJoin, leftOuterJoin, cogroup

Page 10: Scala meetup - Intro to spark

Actions

• Force evaluation of the transformations and return a value to the driver program or write to external storage

• Actions on RDDs:– reduce, fold, aggregate– foreach(func), collect– count, countByValue– top(num)– take(num), takeOrdered(num)(ordering)

• Actions on pair RDDs:– countByKey– collectAsMap– lookup(key)

Page 11: Scala meetup - Intro to spark

Single RDD Transformations

Page 12: Scala meetup - Intro to spark

map and flatMap

• map takes a function that transforms each element of a collection: map(f: T => U)

• RDD[T] => RDD[U]• flatMap takes a function that transforms a single

element of a collection into a sequence of elements: flatMap(f: T => Seq[U])

• Flattens out the output into a single sequence• RDD[T] => RDD[U]

Page 13: Scala meetup - Intro to spark

filter, distinct

• filter takes a (predicate) function that returns true if an element should be in the output collection: map(f: T => Bool)

• distinct removes duplicates from the RDD• Both filter and distinct transform from RDD[T] =>

RDD[T]

Page 14: Scala meetup - Intro to spark

Actions

Page 15: Scala meetup - Intro to spark

reduce, fold & aggregate

• reduce takes a function that combines pairwise element of a collection: reduce(f: (T, T) => T)

• fold is like reduce except it takes a zero value i.e. fold(zero: T) (f: (T, T) => T)

• reduce and fold: RDD[T] => T• aggregate is the most general form• aggregate(zero: U)(seqOp: (U, T) => U, combOp:

(U, U) => U)• aggregate: RDD[T] => U

Page 16: Scala meetup - Intro to spark

Pair RDD Transformations

Page 17: Scala meetup - Intro to spark

keyBy, reduceByKey

• keyBy creates tuples of the elements in an RDD by applying a function: keyBy(f: T => K)

• RDD[ T ] => RDD[ (K, T) ]• reduceByKey takes a function that takes a two

values and returns a single value: reduceByKey(f: (V,V) => V)

• RDD[ (K, V) ] => RDD[ (K, V) ]

Page 18: Scala meetup - Intro to spark

groupByKey

• Takes a collection of key-value pairs and no parameters

• Returns a sequence of values associated with each key

• RDD[ ( K, V ) ] => RDD[ ( K, Iterable[V] ) ]• Results must fit in memory• Can be slow – use aggregateByKey or

reduceByKey where possible• Ordering of values not guaranteed and can vary

on every evaluation

Page 19: Scala meetup - Intro to spark

combineByKey

• def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner: Partitioner, mapSideCombine: Boolean = true, serializer: Serializer = null)• RDD [ (K, V) ] => RDD[ (K, C) ]• createCombiner called per partition when a new key is found• mergeValue combines a new value to an existing accumulator• mergeCombiners with results from different partitions• Sometimes map-size combine not useful e.g. groupByKey• groupByKey, aggregateByKey and reduceByKey all

implemented using combineByKey

Page 20: Scala meetup - Intro to spark

map vs mapValues

• map takes a function that transforms each element of a collection: map(f: T => U)

• RDD[T] => RDD[U]• When T is a tuple we may want to only act on the

values – not the keys• mapValues takes a function that maps the values

in the inputs to the values in the output: mapValues(f: V => W)

• Where RDD[ (K, V) ] => RDD[ (K, W) ]• NB: use mapValues when you can: avoids

reshuffle when data is partitioned by key

Page 21: Scala meetup - Intro to spark

Two RDD Transformations

Page 22: Scala meetup - Intro to spark

Pseudo-set: union, intersection, subtract, cartesian

• rdd.union(otherRdd): RRD containing elements from both

• rdd.intersection(otherRdd): RDD containing only elements found in both

• rdd.subtract(otherRdd): remove content of one from the other e.g. removing training data

• rdd.cartesian(otherRdd): Cartesian product of two RDDs e.g. similarity of pairs: RDD[T] RDD[U] => RDD[ (T, U) ]

Page 23: Scala meetup - Intro to spark

Two Pair RDD Transformations

Page 24: Scala meetup - Intro to spark

join, rightOuterJoin, leftOuterJoin, cogroup

• Join: RDD[ ( K, V) ] and RDD[ (K, W) ] => RDD[ ( K, (V,W) ) ]

• Cogroup: RDD[ ( K, V) ] and RDD[ (K, W) ] => RDD[ ( K, ( Seq[V], Seq[W] ) ) ]

• rightOuterJoin and leftRightJoin when keys must be present in left / right RDD

Page 25: Scala meetup - Intro to spark

Partition-specific Transformations and Actions

Page 26: Scala meetup - Intro to spark

mapPartitions, mapPartitionsWithIndex, and foreachPartition

• Same as map and foreach except they operate on a per partition basis

• Useful for when you have setup code (DB, RNG etc.) but don’t want to call it for each partition

• You can set preservesPartitioning when you are not altering the keys used for partitioning to avoid unnecessary shuffling– As with mapValues in the last slide

Page 27: Scala meetup - Intro to spark

Data Frames

Page 28: Scala meetup - Intro to spark

Data Frames & Catalyst Optimizer

Page 29: Scala meetup - Intro to spark

DataFrame creation and operations

val sc: SparkContext // An existing SparkContext.val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// Create the DataFrameval df = sqlContext.jsonFile("examples/src/main/resources/people.json”)

// Show the content of the DataFramedf.show()

// Print the schema in a tree formatdf.printSchema()

// Select only the "name" columndf.select("name”)

// Select everybody, but increment the age by 1df.select("name", df("age") + 1)

// Select people older than 21df.filter(df("name") > 21)

// Count people by agedf.groupBy("age").count()

Page 30: Scala meetup - Intro to spark

Spark Streaming

Page 31: Scala meetup - Intro to spark

Introduction

Page 32: Scala meetup - Intro to spark

AlternativesApache Storm

TridentProgramming

ModelMicro-Batch One at a time Micro-batch

Stream Primitive DStream Stream Tuple, Tuple Batch, Partition

Distributed Stream Dataflow

Stream Source ReceiverInputDStream

Container Spouts, Trident Spouts

Data Stream

ComputationMaps/windows/operations on Dstream

StreamTask, Window, join

Filters, functions, aggregations, joins

Maps/windows/operations on Data Stream

Resource mgmt YARN/Mesos YARN YARN/Mesos YARN

ResilienceRequire WAL to DFS (HDFS/S3)

Checkpointing (Kafka)

Nimbus reassigns and failed batch replayed

Lightweight Distributed Snapshots

Page 33: Scala meetup - Intro to spark

Scala collections programming model, map, flatMap, window, reduce (fold)

share code between batch and streaming, both share the same programming model (although different semantics)

microbatches allow aggregation on the batches, improved throughput with a latency cost

Why Spark Streaming

Page 34: Scala meetup - Intro to spark

Spark Streaming Execution

Driver

Spark Context

Worker NodeExecutor

Task

Task

Worker NodeExecutor

Task

Task

Worker NodeExecutor

Task

TaskStreaming Producer

Page 35: Scala meetup - Intro to spark

Example overview

Page 36: Scala meetup - Intro to spark

Codeval metaStream = stream.map { case (k, v) => (k, DocumentMetadata.fromMutable(recordDecoder.decode(v).asInstanceOf[GenericRecord])) }

private val pdfFiles = metaStream.filter(_._2.contentType == "application/pdf") .map { case (k, meta) => (meta, fetchFileFromMessage(k, meta)) }val pdfDocs = pdfFiles.map { case (meta, file) => (meta, TextExtractor.parseFile(file)) }val texts = pdfDocs.map { case (meta, doc) => (meta, TextExtractor.extractText(doc)) }.cache()val wordStream = texts.map { case (meta, text) => (meta, text.split("""[\ \n\r\t\u00a0]+""").toList.map(_.replaceAll("""[,;\.]$""", "").trim.toLowerCase()).filter(_.length > 1)) }texts.foreachRDD( rdd => rdd.foreach { case (meta,text) => indexText(meta.id, text) } )

val wordCountStream = wordStream.flatMap(_._2).map(word => (word, 1)).reduceByKey(_ + _)val totalWordCountStream = wordStream.map(_._2.size)val totalWords = totalWordCountStream.reduce(_+_)

val sortedWordCount = wordCountStream.transform(rdd => rdd.sortBy(_._2, ascending = false))

sortedWordCount.foreachRDD(rdd => println(rdd.toDebugString))sortedWordCount.print(30)totalWords.print()

Page 37: Scala meetup - Intro to spark

Q & A