Spark streaming

Noam Shaish

Spark Streaming Scale Fault tolerance High throughput

Agenda

❖ Overview

❖ Architecture

❖ Fault-‐tolerance

❖ Why Spark streaming? We have Storm

❖ Demo

Overview❖ Spark Streaming is an extension of core Spark API. It enables scalable,

high-‐throughput, fault-‐tolerant stream processing of live data streams.

❖ ConnecGons for most of common data sources such as KaIa, Flume, TwiKer, ZeroMQ, Kinesis, TCP, etc.

❖ Spark streaming differ from most online processing soluGon by espousing mini batch approach, instead of data stream.

❖ Based on DiscreGzed Stream paper ❖ Discretized Streams:A Fault-Tolerant Model for Scalable Stream Processing

Matei Zaharia,Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion Stoica Berkeley EECS (2012-12-14)www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf

http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf

OverviewSpark streaming runs streaming computaGon as a series of very small, determinis1c batch jobs

Spark streaming

Spark

Live data stream

Batches of X milliseconds

Processed results

❖ Chops live stream into batches of x milliseconds

❖ Spark treats each batch of data as RDDs

❖ Processed results of the RDD operaGons are returned in batches

DStream, not just RDD

* Datastax cassandra connector

Transformations• map(), • flatMap() • filter() • count() • reparGGon() • union() • reduce() • countByValue() • reduceByKey() • join() • cogroup() • transform() • updateStateByKey()

Output Operations• print() • foreachRDD() • saveAsObjectToFiles() • saveAsTextFiles() • saveAsHadoopFiles() • *saveToCassandra()

Window Operations• window() • countByWindow() • reduceByWindow() • reduceByKeyAndWindow() • countByValueAndWindow()

Example 1 - DStream to RDD val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)

Twi8er Streaming API !!

tweets DStream

batch @ t batch @ t + 1 batch @ t + 3batch @ t + 2

stored in memory as an RDD (immutable, distributed)

Example 1 - DStream to RDD relationval tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)!

val hashTags = tweets.flatMap(status => getTags(status))

tweets DStream

batch @ t batch @ t + 1 batch @ t + 3batch @ t + 2

hashTags DStream [#hobbitch, #bilboleggins, …]

flatMap flatMap flatMap flatMap new RDDs for each batch

new DStream

Example 1 - DStream to RDD val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)!

val hashTags = tweets.flatMap(status => getTags(status))!

hashTags.saveToCassandra(“keyspace”, “tableName”)

tweets DStream

hashTags DStream [#hobbitch, #bilboleggins, …]

flatMap flatMap flatMap flatMap

every batch saved to Cassandra

save save save save

Example 2 - DStream to RDD relationval tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)!


val tagCounts = hashTags.countByValue()

tweets DStream

hashTags

flatMap flatMap flatMap flatMap

map map map map

reduceByKey reduceByKey reduceByKey reduceByKey

hashTags [(#hobbitch, 10), (#bilboleggins, 34), …]

Example 3 - Count the hash tags over last 10 minutes

val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)!


val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()

Sliding window operaGon Window length Sliding interval

Example 3 - Count the hash tags over last 10 minutes

val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()

t-1 t t+1 t+2 t+3

sliding window

hashTags

hashTags

Count over all data in window

Example 4 - Count hash tags over last 10 minutes smartly

val tagCounts = hashTags.countByValueAndWindow(Minutes(10), Seconds(1))

t-1 t t+1 t+2 t+3

sliding window

hashTags

hashTags

Add count of new batch in window

+-

Reduce count of batch out of window

generalizaGon of smart window reduce exists: reduceByKeyAndWindow(reduce, inverseReduce, window, interval)

Architecture

❖ Receivers divides data into mini batches

❖ Size of batches can be defined in milliseconds (best pracGce is greater than 500 milliseconds)

Spark Streaming

Receivers

Spark Engine

Batches of input RDDs

Batches of output RDDsIn

put streams

Fault-tolerance

❖ RDDs are not generated from fault-‐tolerance source

❖ Replicate data among worker nodes (default replicaGon factor of 2)

❖ In state-‐full jobs checkpoints should be used

❖ Journaling such as in DB can be acGvated

flatMap

Tweets RDD

hashTags RDD

input data replicated in memory

lost parGGons recomputed on other

workers

Fault-tolerance❖ Two kinds of data to recover in the event of failure:

• Data received and replicated -‐ This data survives failure of a single worker node, since a copy of it exists on one of the other nodes.

• Data received but buffered for replicaGon -‐As this is not replicated, the only way to recover that data is to get it from the source again.

Fault-tolerance❖ Two receiver semanGcs:

• Reliable receiver -‐ Acknowledges only ager received data is replicated. If fails, buffered data does not get acknowledged to the source. If the receiver is restarted, the source will resend the data, and therefore no data will be lost due to the failure.

• Unreliable Receiver -‐ Such receivers can lose data when they fail due to worker or driver failures.

Fault-tolerance

Deployment Scenario Receiver Failure Driver failure

without write ahead log

Buffered data lost with unreliable receivers Zero data lost with reliable receivers and files

Buffered data lost with unreliable receivers Past data lost with all receivers

Zero data lost with files

with write ahead log

Zero data lost with receivers and files Zero data lost with receivers and files

Why Spark streaming? We have Storm

One model to rule them all

❖ Same model for offline AND online processing

❖ Common code base for offline AND online processing

❖ Less bugs due to duplicaGon

❖ Less bugs of framework difference

❖ Increase developer producGvity

One stack to rule them all

❖ Explore data interacGvely using Spark shell to idenGfy problem

❖ Use same code in Spark standalone to idenGfy problem in producGon environment

❖ Use similar code in Spark Streaming to monitor problem online

$ ./spark-‐shell scala> val file = sc.hadoopFile(“smallLogs”) ...

scala> val filtered = file.filter(_.contains(“ERROR”)) ...

scala> vaobject ProcessProductionData { def main(args: Array[String]) { val sc = new SparkContext(...) val file = sc.hadoopFile(“productionLogs”) val filtered = file.filter(_.contains(“ERROR”)) val mapped = filtered.map(...) ... } } object ProcessLiveStream {

def main(args: Array[String]) { val sc = new StreamingContext(...) val stream = sc.kafkaStream(...) val filtered = stream.filter(_.contains(“ERROR”)) val mapped = filtered.map(...) ... } }

Performance❖ Higher throughput than Storm

• Spark Streaming: 670k records/second/node

• Storm: 115k records/seconds/node

Grep

Throughp

ut per

node

(MB/s)

0

17.5

35

52.5

70

Record size (bytes)

100 1000

SparkStorm

WordCount

0

7.5

15

22.5

30

Record size (bytes)

100 1000

Tested with 100 EC2 instances with 4 core each Comparison taken from Das Thatagata and Reynold Xin Hadoop summit 2013 presentaGon

Community

Monitoring

In addiGon StreamListener interface provides addiGonal informaGon in various levels (ApplicaGon, Job, Task, etc.)

Language

vs

Utilization

❖ Spark 1.2 introduces dynamic cluster resource allocaGon

❖ Jobs can request more resources and release resource

❖ Available only on YARN

DemohKps://github.com/NoamShaish/spark-‐streaming-‐workshop.git

https://github.com/NoamShaish/spark-streaming-workshop.git

Spark streaming

Software

Transcript of Spark streaming