Scalable Stream Processing Spark Streaming and Flink Stream

71
Scalable Stream Processing Spark Streaming and Flink Stream Amir H. Payberah [email protected] KTH Royal Institute of Technology Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 1 / 64

Transcript of Scalable Stream Processing Spark Streaming and Flink Stream

Page 1: Scalable Stream Processing Spark Streaming and Flink Stream

Scalable Stream ProcessingSpark Streaming and Flink Stream

Amir H. [email protected]

KTH Royal Institute of Technology

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 1 / 64

Page 2: Scalable Stream Processing Spark Streaming and Flink Stream

Spark Streaming

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 2 / 64

Page 3: Scalable Stream Processing Spark Streaming and Flink Stream

Existing Streaming Systems (1/2)

I Record-at-a-time processing model:

• Each node has mutable state.

• For each record, updates state and sendsnew records.

• State is lost if node dies.

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 3 / 64

Page 4: Scalable Stream Processing Spark Streaming and Flink Stream

Existing Streaming Systems (2/2)

I Fault tolerance via replication or upstream backup.

Fast recovery, but 2x hardware cost Only need one standby, but slow to recover

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 4 / 64

Page 5: Scalable Stream Processing Spark Streaming and Flink Stream

Existing Streaming Systems (2/2)

I Fault tolerance via replication or upstream backup.

Fast recovery, but 2x hardware cost Only need one standby, but slow to recover

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 4 / 64

Page 6: Scalable Stream Processing Spark Streaming and Flink Stream

Observation

I Batch processing models for clusters provide fault tolerance effi-ciently.

I Divide job into deterministic tasks.

I Rerun failed/slow tasks in parallel on other nodes.

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 5 / 64

Page 7: Scalable Stream Processing Spark Streaming and Flink Stream

Core Idea

I Run a streaming computation as a series of very small and deter-ministic batch jobs.

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 6 / 64

Page 8: Scalable Stream Processing Spark Streaming and Flink Stream

Challenges

I Latency (interval granularity)• Traditional batch systems replicate state on-disk storage: slow

I Recovering quickly from faults and stragglers

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 7 / 64

Page 9: Scalable Stream Processing Spark Streaming and Flink Stream

Proposed Solution

I Latency (interval granularity)• Resilient Distributed Dataset (RDD)• Keep data in memory• No replication

I Recovering quickly from faults and stragglers• Storing the lineage graph• Using the determinism of D-Streams• Parallel recovery of a lost node’s state

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 8 / 64

Page 10: Scalable Stream Processing Spark Streaming and Flink Stream

Programming Model

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 9 / 64

Page 11: Scalable Stream Processing Spark Streaming and Flink Stream

Spark Streaming

I Run a streaming computation as a series of very small, deterministicbatch jobs.

• Chop up the live stream into batches of X seconds.

• Spark treats each batch of data as RDDs and processes them usingRDD operations.

• Finally, the processed results of the RDD operations are returned inbatches.

• Discretized Stream Processing (DStream)

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 10 / 64

Page 12: Scalable Stream Processing Spark Streaming and Flink Stream

Spark Streaming

I Run a streaming computation as a series of very small, deterministicbatch jobs.

• Chop up the live stream into batches of X seconds.

• Spark treats each batch of data as RDDs and processes them usingRDD operations.

• Finally, the processed results of the RDD operations are returned inbatches.

• Discretized Stream Processing (DStream)

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 10 / 64

Page 13: Scalable Stream Processing Spark Streaming and Flink Stream

Spark Streaming

I Run a streaming computation as a series of very small, deterministicbatch jobs.

• Chop up the live stream into batches of X seconds.

• Spark treats each batch of data as RDDs and processes them usingRDD operations.

• Finally, the processed results of the RDD operations are returned inbatches.

• Discretized Stream Processing (DStream)

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 10 / 64

Page 14: Scalable Stream Processing Spark Streaming and Flink Stream

Spark Streaming

I Run a streaming computation as a series of very small, deterministicbatch jobs.

• Chop up the live stream into batches of X seconds.

• Spark treats each batch of data as RDDs and processes them usingRDD operations.

• Finally, the processed results of the RDD operations are returned inbatches.

• Discretized Stream Processing (DStream)

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 10 / 64

Page 15: Scalable Stream Processing Spark Streaming and Flink Stream

Spark Streaming

I Run a streaming computation as a series of very small, deterministicbatch jobs.

• Chop up the live stream into batches of X seconds.

• Spark treats each batch of data as RDDs and processes them usingRDD operations.

• Finally, the processed results of the RDD operations are returned inbatches.

• Discretized Stream Processing (DStream)

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 10 / 64

Page 16: Scalable Stream Processing Spark Streaming and Flink Stream

DStream

I DStream: sequence of RDDs representing a stream of data.

I Any operation applied on a DStream translates to operations on theunderlying RDDs.

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 11 / 64

Page 17: Scalable Stream Processing Spark Streaming and Flink Stream

DStream

I DStream: sequence of RDDs representing a stream of data.

I Any operation applied on a DStream translates to operations on theunderlying RDDs.

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 11 / 64

Page 18: Scalable Stream Processing Spark Streaming and Flink Stream

StreamingContext

I StreamingContext: the main entry point of all Spark Streamingfunctionality.

I To initialize a Spark Streaming program, a StreamingContext objecthas to be created.

val conf = new SparkConf().setAppName(appName).setMaster(master)

val ssc = new StreamingContext(conf, Seconds(1))

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 12 / 64

Page 19: Scalable Stream Processing Spark Streaming and Flink Stream

Source of Streaming

I Two categories of streaming sources.

I Basic sources directly available in the StreamingContext API, e.g.,file systems, socket connections, ....

I Advanced sources, e.g., Kafka, Flume, Kinesis, Twitter, ....

ssc.socketTextStream("localhost", 9999)

TwitterUtils.createStream(ssc, None)

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 13 / 64

Page 20: Scalable Stream Processing Spark Streaming and Flink Stream

DStream Transformations

I Transformations: modify data from on DStream to a new DStream.

I Standard RDD operations, e.g., map, join, ...

I DStream operations, e.g., window operations

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 14 / 64

Page 21: Scalable Stream Processing Spark Streaming and Flink Stream

DStream Transformation Example

val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")

val ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

wordCounts.print()

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 15 / 64

Page 22: Scalable Stream Processing Spark Streaming and Flink Stream

Window Operations

I Apply transformations over a sliding window of data: window lengthand slide interval.

val ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream(IP, Port)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val windowedWordCounts = pairs.reduceByKeyAndWindow(_ + _,

Seconds(30), Seconds(10))

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 16 / 64

Page 23: Scalable Stream Processing Spark Streaming and Flink Stream

MapWithState Operation

I Maintains state while continuously updating it with new information.

I It requires the checkpoint directory.

I A new operation after updateStateByKey.

val ssc = new StreamingContext(conf, Seconds(1))

ssc.checkpoint(".")

val lines = ssc.socketTextStream(IP, Port)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val stateWordCount = pairs.mapWithState(

StateSpec.function(mappingFunc))

val mappingFunc = (word: String, one: Option[Int], state: State[Int]) => {

val sum = one.getOrElse(0) + state.getOption.getOrElse(0)

state.update(sum)

(word, sum)

}

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 17 / 64

Page 24: Scalable Stream Processing Spark Streaming and Flink Stream

Transform Operation

I Allows arbitrary RDD-to-RDD functions to be applied on a DStream.

I Apply any RDD operation that is not exposed in the DStream API,e.g., joining every RDD in a DStream with another RDD.

// RDD containing spam information

val spamInfoRDD = ssc.sparkContext.newAPIHadoopRDD(...)

val cleanedDStream = wordCounts.transform(rdd => {

// join data stream with spam information to do data cleaning

rdd.join(spamInfoRDD).filter(...)

...

})

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 18 / 64

Page 25: Scalable Stream Processing Spark Streaming and Flink Stream

Output Operations

I Push out DStream’s data to external systems, e.g., a database or afile system.

I foreachRDD: the most generic output operator• Applies a function to each RDD generated from the stream.• The function is executed in the driver process.

dstream.foreachRDD { rdd =>

rdd.foreachPartition { partitionOfRecords =>

val connection = createNewConnection()

partitionOfRecords.foreach(record => connection.send(record))

connection.close()

}

}

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 19 / 64

Page 26: Scalable Stream Processing Spark Streaming and Flink Stream

Spark Streaming and DataFrame

val words: DStream[String] = ...

words.foreachRDD { rdd =>

// Get the singleton instance of SQLContext

val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)

import sqlContext.implicits._

// Convert RDD[String] to DataFrame

val wordsDataFrame = rdd.toDF("word")

// Register as table

wordsDataFrame.registerTempTable("words")

// Do word count on DataFrame using SQL and print it

val wordCountsDataFrame =

sqlContext.sql("select word, count(*) as total from words group by word")

wordCountsDataFrame.show()

}

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 20 / 64

Page 27: Scalable Stream Processing Spark Streaming and Flink Stream

Implementation

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 21 / 64

Page 28: Scalable Stream Processing Spark Streaming and Flink Stream

System Architecture

I Spark Streaming components:

• Master: tracks the DStream lineage graph and schedules tasks tocompute new RDD partitions.

• Workers: receive data, store the partitions of input and computedRDDs, and execute tasks.

• Client library: used to send data into the system.

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 22 / 64

Page 29: Scalable Stream Processing Spark Streaming and Flink Stream

Application Execution (1/2)

I The system loads streams:• By receiving records directly from clients,• or by loading data periodically from an external storage, e.g., HDFS

I All data is managed by a block store on each worker, with a trackeron the master to let nodes find the locations of blocks.

• Each block is given a unique ID, and any node that has that ID canserve it.

• The block store keeps new blocks in memory but drops them in anLRU fashion.

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 23 / 64

Page 30: Scalable Stream Processing Spark Streaming and Flink Stream

Application Execution (1/2)

I The system loads streams:• By receiving records directly from clients,• or by loading data periodically from an external storage, e.g., HDFS

I All data is managed by a block store on each worker, with a trackeron the master to let nodes find the locations of blocks.

• Each block is given a unique ID, and any node that has that ID canserve it.

• The block store keeps new blocks in memory but drops them in anLRU fashion.

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 23 / 64

Page 31: Scalable Stream Processing Spark Streaming and Flink Stream

Application Execution (2/2)

I To decide when to start processing a new interval:• The nodes have their clocks synchronized via NTP.• Each node sends the master a list of block IDs it received in each

interval when it ends.

I The master starts each task whenever its parents are finished.

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 24 / 64

Page 32: Scalable Stream Processing Spark Streaming and Flink Stream

Fault Tolerance

I Spark remembers the sequence of oper-ations that creates each RDD from theoriginal fault-tolerant input data (lineagegraph).

I Batches of input data are replicated inmemory of multiple worker nodes.

I Data lost due to worker failure, can berecomputed from input data.

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 25 / 64

Page 33: Scalable Stream Processing Spark Streaming and Flink Stream

Parallel Recovery

I When a node fails, the RDD partitions on the node and its runningtasks are recomputed in parallel on other nodes.

I The system periodically checkpoints some of the RDDs, by asyn-chronously replicating them to other worker nodes.

I When a node fails, the system detects all missing RDD partitionsand launches tasks to recompute them from the last checkpoint.

I Many tasks can be launched at the same time to compute differentRDD partitions.

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 26 / 64

Page 34: Scalable Stream Processing Spark Streaming and Flink Stream

Master Recovery

I To tolerate failures of Spark master:• Writing the state of the computation reliably when starting each

timestep.• Having workers connect to a new master and report their RDD

partitions to it when the old master fails.

I Operations are deterministic, therefore there is no problem if a givenRDD is computed twice.

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 27 / 64

Page 35: Scalable Stream Processing Spark Streaming and Flink Stream

Structured Streaming

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 28 / 64

Page 36: Scalable Stream Processing Spark Streaming and Flink Stream

Motivation

I Continuous applications: end-to-end applications that react to datain real-time.

• Updating data that will be served in real-time• Extract, transform and load (ETL)• Creating a real-time version of an existing batch job• Online machine learning

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 29 / 64

Page 37: Scalable Stream Processing Spark Streaming and Flink Stream

Structured Streaming

I Structured streaming is a new high-level API to support continuousapplications.

I A higher-level API than Spark streaming.

I Built on the Spark SQL engine.

I Perform database-like query optimizations.

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 30 / 64

Page 38: Scalable Stream Processing Spark Streaming and Flink Stream

Programming Model (1/2)

I Treating a live data stream as a table that is being continuouslyappended.

I Users can express their streaming computation as standard batch-like query as on a static table.

I Spark runs it as an incremental query on the unbounded input table.

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 31 / 64

Page 39: Scalable Stream Processing Spark Streaming and Flink Stream

Programming Model (2/2)

I A query on the input will generate the Result Table.

I Every trigger interval (e.g., every 1 second), new rows get appendedto the Input Table, which eventually updates the Result Table.

I Whenever the result table gets updated, we can write the changedresult rows to an external sink.

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 32 / 64

Page 40: Scalable Stream Processing Spark Streaming and Flink Stream

Example

val spark: SparkSession = ...

val lines = spark.readStream.format("socket").option("host", "localhost")

.option("port", 9999).load()

val words = lines.as[String].flatMap(_.split(" "))

val wordCounts = words.groupBy("value").count()

val query = wordCounts.writeStream.outputMode("complete")

.format("console").start()

query.awaitTermination()

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 33 / 64

Page 41: Scalable Stream Processing Spark Streaming and Flink Stream

Creating Streaming DataFrames and Datasets

I Creating through the DataStreamReader returned bySparkSession.readStream().

val spark: SparkSession = ...

// Read text from socket

val socketDF = spark.readStream.format("socket")

.option("host", "localhost").option("port", 9999).load()

// Read all the csv files written atomically in a directory

val userSchema = new StructType().add("name", "string").add("age", "integer")

val csvDF = spark.readStream.option("sep", ";")

.schema(userSchema).csv("/path/to/directory")

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 34 / 64

Page 42: Scalable Stream Processing Spark Streaming and Flink Stream

Basic Operations

I Most of the common operations on DataFrame/Dataset aresupported for streaming.

case class DeviceData(device: String, type: String, signal: Double,

time: DateTime)

// streaming DataFrame with schema

// { device: string, type: string, signal: double, time: string }

val df: DataFrame = ...

val ds: Dataset[DeviceData] = df.as[DeviceData]

// Selection and projection

df.select("device").where("signal > 10") // using untyped APIs

ds.filter(_.signal > 10).map(_.device) // using typed APIs

// Aggregation

df.groupBy("type") // using untyped API

ds.groupByKey(_.type) // using typed API

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 35 / 64

Page 43: Scalable Stream Processing Spark Streaming and Flink Stream

Window Operation (1/2)

I Aggregations over a sliding event-time window.

I Event-time is the time embedded in the data itself, not the timeSpark receives them.

// count words within 10 minute windows, updating every 5 minutes.

// streaming DataFrame of schema {timestamp: Timestamp, word: String}

val words = ...

val windowedCounts = words.groupBy(

window($"timestamp", "10 minutes", "5 minutes"),

$"word"

).count()

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 36 / 64

Page 44: Scalable Stream Processing Spark Streaming and Flink Stream

Window Operation (2/2)

I Late data

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 37 / 64

Page 45: Scalable Stream Processing Spark Streaming and Flink Stream

Flink Stream

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 38 / 64

Page 46: Scalable Stream Processing Spark Streaming and Flink Stream

Batch Processing vs. Stream Processing (1/2)

I Batch processing is just a special case of stream processing.

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 39 / 64

Page 47: Scalable Stream Processing Spark Streaming and Flink Stream

Batch Processing vs. Stream Processing (2/2)

I Batched/Stateless: scheduled in batches• Short-lived tasks (hadoop, spark)• Distributed streaming over batches (spark stream)

I DataFlow/Stateful: continuous/scheduled once (Storm, Samza,Naiad, Flink)

• Long-lived task execution• State is kept inside tasks

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 40 / 64

Page 48: Scalable Stream Processing Spark Streaming and Flink Stream

Native vs. Non-Native Streaming

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 41 / 64

Page 49: Scalable Stream Processing Spark Streaming and Flink Stream

Lambda Architecture

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 42 / 64

Page 50: Scalable Stream Processing Spark Streaming and Flink Stream

Flink

I Distributed data flow processing system

I Unified real-time stream and batch processing

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 43 / 64

Page 51: Scalable Stream Processing Spark Streaming and Flink Stream

Programming Model

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 44 / 64

Page 52: Scalable Stream Processing Spark Streaming and Flink Stream

Programming Model

I Data stream• An unbounded, partitioned immutable sequence of events.

I Stream operators• Stream transformations that generate new output data streams

from input ones.

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 45 / 64

Page 53: Scalable Stream Processing Spark Streaming and Flink Stream

Flink Stream API (1/2)

I Transformations:

• Basic transformations: Map, Reduce, Filter, Aggregations

• Binary stream transformations: CoMap, CoReduce

• Windowing semantics: policy based flexible windowing (Time, Count,Delta ...)

• Temporal binary stream operators: Joins, Crosses

• Native support for iterations

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 46 / 64

Page 54: Scalable Stream Processing Spark Streaming and Flink Stream

Flink Stream API (2/2)

I Data stream sources• File system• Message queue connectors• Arbitrary source functionality

I Data stream outputs

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 47 / 64

Page 55: Scalable Stream Processing Spark Streaming and Flink Stream

Word Count in Flink - Batch and Stream

I Batch (DataSet API)

case class Word (word: String, frequency: Int)

val lines: DataSet[String] = env.readTextFile(...)

lines.flatMap {line => line.split(" ").map(word => Word(word, 1))}

.groupBy("word").sum("frequency").print()

I Streaming (DataStream API)

case class Word (word: String, frequency: Int)

val lines: DataStream[String] = env.fromSocketStream(...)

lines.flatMap {line => line.split(" ").map(word => Word(word, 1))}

.keyBy("word").window(Time.of(5, SECONDS))

.every(Time.of(1, SECONDS)).sum("frequency").print()

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 48 / 64

Page 56: Scalable Stream Processing Spark Streaming and Flink Stream

Windowning Semantics

I Trigger and eviction policies• window(eviction, trigger)• window(eviction).every(trigger)

I Built-in policies:• Time: Time.of(length, TimeUnit/Custom timestamp)• Count: Count.of(windowSize)• Delta: Delta.of(treshold, Distance function, Start value)

I Window transformations:• Reduce, mapWindow

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 49 / 64

Page 57: Scalable Stream Processing Spark Streaming and Flink Stream

Example 1 - Reading From Multiple Inputs (1/2)

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 50 / 64

Page 58: Scalable Stream Processing Spark Streaming and Flink Stream

Example 1 - Reading From Multiple Inputs (2/2)

val env = StreamExecutionEnvironment.getExecutionEnvironment

//Read from a socket stream at map it to StockPrice objects

val socketStockStream = env.socketTextStream("localhost", 9999).map(x => {

val split = x.split(",")

StockPrice(split(0), split(1).toDouble)

})

//Generate other stock streams

val SPX_Stream = env.addSource(generateStock("SPX")(10) _)

val FTSE_Stream = env.addSource(generateStock("FTSE")(20) _)

val DJI_Stream = env.addSource(generateStock("DJI")(30) _)

val BUX_Stream = env.addSource(generateStock("BUX")(40) _)

//Merge all stock streams together

val stockStream = socketStockStream.merge(SPX_Stream, FTSE_Stream,

DJI_Stream, BUX_Stream)

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 51 / 64

Page 59: Scalable Stream Processing Spark Streaming and Flink Stream

Example 2 - Window Aggregations (1/2)

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 52 / 64

Page 60: Scalable Stream Processing Spark Streaming and Flink Stream

Example 2 - Window Aggregations (2/2)

//Define the desired time window

val windowedStream = stockStream

.window(Time.of(10, SECONDS)).every(Time.of(5, SECONDS))

//Compute some simple statistics on a rolling window

val lowest = windowedStream.minBy("price")

val maxByStock = windowedStream.groupBy("symbol").maxBy("price")

val rollingMean = windowedStream.groupBy("symbol").mapWindow(mean _)

//Compute the mean of a window

def mean(ts: Iterable[StockPrice], out: Collector[StockPrice]) = {

if (ts.nonEmpty) {

out.collect(StockPrice(ts.head.symbol,

ts.foldLeft(0: Double)(_ + _.price) / ts.size))

}

}

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 53 / 64

Page 61: Scalable Stream Processing Spark Streaming and Flink Stream

Example 3 - Data-Driven Windows (1/2)

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 54 / 64

Page 62: Scalable Stream Processing Spark Streaming and Flink Stream

Example 3 - Data-Driven Windows (2/2)

case class Count(symbol: String, count: Int)

val defaultPrice = StockPrice("", 1000)

//Use delta policy to create price change warnings

val priceWarnings = stockStream.groupBy("symbol")

.window(Delta.of(0.05, priceChange, defaultPrice))

.mapWindow(sendWarning _)

//Count the number of warnings every half a minute

val warningsPerStock = priceWarnings.map(Count(_, 1))

.groupBy("symbol")

.window(Time.of(30, SECONDS))

.sum("count")

def priceChange(p1: StockPrice, p2: StockPrice): Double = {

Math.abs(p1.price / p2.price - 1)

}

def sendWarning(ts: Iterable[StockPrice], out: Collector[String]) = {

if (ts.nonEmpty) out.collect(ts.head.symbol)

}

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 55 / 64

Page 63: Scalable Stream Processing Spark Streaming and Flink Stream

Implementation

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 56 / 64

Page 64: Scalable Stream Processing Spark Streaming and Flink Stream

Flink Architecture

I Master (JobManager): schedules tasks, coordinates checkpoints,coordinates recovery on failures, etc.

I Workers (TaskManagers): JVM processes that execute tasks of adataflow, and buffer and exchange the data streams.

• Workers use task slots to control the number of tasks it accepts.• Each task slot represents a fixed subset of resources of the worker.

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 57 / 64

Page 65: Scalable Stream Processing Spark Streaming and Flink Stream

Application Execution

I Jobs are expressed as data flows.

I Job graphs are transformed into the execution graph.

I Execution graphs consist information to schedule and execute a job.

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 58 / 64

Page 66: Scalable Stream Processing Spark Streaming and Flink Stream

Fault Tolerance (1/3)

I Fault tolerance in Spark• RDD re-computation

I Fault tolerance in Storm• Tracks records with unique IDs.• Operators send acks when a record has been processed.• Records are dropped from the backup when the have been fully

acknowledged.

I Fault tolerance in Flink• More coarse-grained approach than Storm.• Based on consistent global snapshots (inspired by Chandy-Lamport).• Low runtime overhead, stateful exactly-once semantics.

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 59 / 64

Page 67: Scalable Stream Processing Spark Streaming and Flink Stream

Fault Tolerance (2/3)

I Acks sequences of records instead of individual records.

I Periodically, the data sources inject checkpoint barriers into the datastream.

I The barriers flow through the data stream, and trigger operators toemit all records that depend only on records before the barrier.

I Once all sinks have received the barriers, Flink knows that all recordsbefore the barriers will never be needed again.

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 60 / 64

Page 68: Scalable Stream Processing Spark Streaming and Flink Stream

Fault Tolerance (3/3)

I Asynchronous barrier snapshotting for globally consistent check-points.

I Checkpointing and recovery.

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 61 / 64

Page 69: Scalable Stream Processing Spark Streaming and Flink Stream

Summary

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 62 / 64

Page 70: Scalable Stream Processing Spark Streaming and Flink Stream

Summary

I Spark• Mini-batch processing• DStream: sequence of RDDs• RDD and window operations• Structured streaming

I Flink• Unified batch and stream• Native streaming: data flow and pipelining• Different windowing semantics• Job graphs and execution graph• Asynchronous barriers

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 63 / 64

Page 71: Scalable Stream Processing Spark Streaming and Flink Stream

Questions?

Acknowledgements

Some slides and pictures were derived from Gyula Fora slides.

Amir H. Payberah (KTH) Spark Streaming and Flink Stream 2016/09/26 64 / 64