Building Robust, Adaptive Streaming Apps with Spark Streaming
Spark streaming
-
Upload
noam-shaish -
Category
Software
-
view
340 -
download
2
Transcript of Spark streaming
Noam Shaish
Spark Streaming Scale Fault tolerance High throughput
Agenda
❖ Overview
❖ Architecture
❖ Fault-‐tolerance
❖ Why Spark streaming? We have Storm
❖ Demo
Overview❖ Spark Streaming is an extension of core Spark API. It enables scalable,
high-‐throughput, fault-‐tolerant stream processing of live data streams.
❖ ConnecGons for most of common data sources such as KaIa, Flume, TwiKer, ZeroMQ, Kinesis, TCP, etc.
❖ Spark streaming differ from most online processing soluGon by espousing mini batch approach, instead of data stream.
❖ Based on DiscreGzed Stream paper ❖ Discretized Streams:A Fault-Tolerant Model for Scalable Stream Processing
Matei Zaharia,Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion Stoica Berkeley EECS (2012-12-14)www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
OverviewSpark streaming runs streaming computaGon as a series of very small, determinis1c batch jobs
Spark streaming
Spark
Live data stream
Batches of X milliseconds
Processed results
❖ Chops live stream into batches of x milliseconds
❖ Spark treats each batch of data as RDDs
❖ Processed results of the RDD operaGons are returned in batches
DStream, not just RDD
* Datastax cassandra connector
Transformations• map(), • flatMap() • filter() • count() • reparGGon() • union() • reduce() • countByValue() • reduceByKey() • join() • cogroup() • transform() • updateStateByKey()
Output Operations• print() • foreachRDD() • saveAsObjectToFiles() • saveAsTextFiles() • saveAsHadoopFiles() • *saveToCassandra()
Window Operations• window() • countByWindow() • reduceByWindow() • reduceByKeyAndWindow() • countByValueAndWindow()
Example 1 - DStream to RDD val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)
Twi8er Streaming API !!
tweets DStream
batch @ t batch @ t + 1 batch @ t + 3batch @ t + 2
stored in memory as an RDD (immutable, distributed)
Example 1 - DStream to RDD relationval tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)!
val hashTags = tweets.flatMap(status => getTags(status))
tweets DStream
batch @ t batch @ t + 1 batch @ t + 3batch @ t + 2
hashTags DStream [#hobbitch, #bilboleggins, …]
flatMap flatMap flatMap flatMap new RDDs for each batch
new DStream
Example 1 - DStream to RDD val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)!
val hashTags = tweets.flatMap(status => getTags(status))!
hashTags.saveToCassandra(“keyspace”, “tableName”)
tweets DStream
hashTags DStream [#hobbitch, #bilboleggins, …]
flatMap flatMap flatMap flatMap
every batch saved to Cassandra
save save save save
Example 2 - DStream to RDD relationval tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)!
val hashTags = tweets.flatMap(status => getTags(status))!
val tagCounts = hashTags.countByValue()
tweets DStream
hashTags
flatMap flatMap flatMap flatMap
map map map map
reduceByKey reduceByKey reduceByKey reduceByKey
hashTags [(#hobbitch, 10), (#bilboleggins, 34), …]
Example 3 - Count the hash tags over last 10 minutes
val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)!
val hashTags = tweets.flatMap(status => getTags(status))!
val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()
Sliding window operaGon Window length Sliding interval
Example 3 - Count the hash tags over last 10 minutes
val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()
t-1 t t+1 t+2 t+3
sliding window
hashTags
hashTags
Count over all data in window
Example 4 - Count hash tags over last 10 minutes smartly
val tagCounts = hashTags.countByValueAndWindow(Minutes(10), Seconds(1))
t-1 t t+1 t+2 t+3
sliding window
hashTags
hashTags
Add count of new batch in window
+-
Reduce count of batch out of window
generalizaGon of smart window reduce exists: reduceByKeyAndWindow(reduce, inverseReduce, window, interval)
Architecture
❖ Receivers divides data into mini batches
❖ Size of batches can be defined in milliseconds (best pracGce is greater than 500 milliseconds)
Spark Streaming
Receivers
Spark Engine
Batches of input RDDs
Batches of output RDDsIn
put streams
Fault-tolerance
❖ RDDs are not generated from fault-‐tolerance source
❖ Replicate data among worker nodes (default replicaGon factor of 2)
❖ In state-‐full jobs checkpoints should be used
❖ Journaling such as in DB can be acGvated
flatMap
Tweets RDD
hashTags RDD
input data replicated in memory
lost parGGons recomputed on other
workers
Fault-tolerance❖ Two kinds of data to recover in the event of failure:
• Data received and replicated -‐ This data survives failure of a single worker node, since a copy of it exists on one of the other nodes.
• Data received but buffered for replicaGon -‐As this is not replicated, the only way to recover that data is to get it from the source again.
Fault-tolerance❖ Two receiver semanGcs:
• Reliable receiver -‐ Acknowledges only ager received data is replicated. If fails, buffered data does not get acknowledged to the source. If the receiver is restarted, the source will resend the data, and therefore no data will be lost due to the failure.
• Unreliable Receiver -‐ Such receivers can lose data when they fail due to worker or driver failures.
Fault-tolerance
Deployment Scenario Receiver Failure Driver failure
without write ahead log
Buffered data lost with unreliable receivers Zero data lost with reliable receivers and files
Buffered data lost with unreliable receivers Past data lost with all receivers
Zero data lost with files
with write ahead log
Zero data lost with receivers and files Zero data lost with receivers and files
Why Spark streaming? We have Storm
One model to rule them all
❖ Same model for offline AND online processing
❖ Common code base for offline AND online processing
❖ Less bugs due to duplicaGon
❖ Less bugs of framework difference
❖ Increase developer producGvity
One stack to rule them all
❖ Explore data interacGvely using Spark shell to idenGfy problem
❖ Use same code in Spark standalone to idenGfy problem in producGon environment
❖ Use similar code in Spark Streaming to monitor problem online
$ ./spark-‐shell scala> val file = sc.hadoopFile(“smallLogs”) ...
scala> val filtered = file.filter(_.contains(“ERROR”)) ...
scala> vaobject ProcessProductionData { def main(args: Array[String]) { val sc = new SparkContext(...) val file = sc.hadoopFile(“productionLogs”) val filtered = file.filter(_.contains(“ERROR”)) val mapped = filtered.map(...) ... } } object ProcessLiveStream {
def main(args: Array[String]) { val sc = new StreamingContext(...) val stream = sc.kafkaStream(...) val filtered = stream.filter(_.contains(“ERROR”)) val mapped = filtered.map(...) ... } }
Performance❖ Higher throughput than Storm
• Spark Streaming: 670k records/second/node
• Storm: 115k records/seconds/node
Grep
Throughp
ut per
node
(MB/s)
0
17.5
35
52.5
70
Record size (bytes)
100 1000
SparkStorm
WordCount
0
7.5
15
22.5
30
Record size (bytes)
100 1000
Tested with 100 EC2 instances with 4 core each Comparison taken from Das Thatagata and Reynold Xin Hadoop summit 2013 presentaGon
Community
Community
Community
Monitoring
In addiGon StreamListener interface provides addiGonal informaGon in various levels (ApplicaGon, Job, Task, etc.)
Language
vs
Utilization
❖ Spark 1.2 introduces dynamic cluster resource allocaGon
❖ Jobs can request more resources and release resource
❖ Available only on YARN
DemohKps://github.com/NoamShaish/spark-‐streaming-‐workshop.git