Spark streaming

download Spark streaming

of 28

  • date post

    06-Aug-2015
  • Category

    Software

  • view

    326
  • download

    2

Embed Size (px)

Transcript of Spark streaming

  1. 1. Noam Shaish Spark Streaming Scale Faulttolerance Highthroughput
  2. 2. Agenda Overview Architecture Fault-tolerance WhySparkstreaming?WehaveStorm Demo
  3. 3. Overview SparkStreamingisanextensionofcoreSparkAPI.Itenablesscalable, high-throughput,fault-tolerantstreamprocessingoflivedatastreams. ConnecGonsformostofcommondatasourcessuchasKaIa,Flume, TwiKer,ZeroMQ,Kinesis,TCP,etc. SparkstreamingdierfrommostonlineprocessingsoluGonby espousingminibatchapproach,insteadofdatastream. BasedonDiscreGzedStreampaper Discretized Streams:A Fault-Tolerant Model for Scalable Stream Processing Matei Zaharia,Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion Stoica Berkeley EECS (2012-12-14) www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
  4. 4. Overview SparkstreamingrunsstreamingcomputaGonasaseriesofverysmall, determinis1cbatchjobs Spark streaming Spark Livedatastream BatchesofXmilliseconds Processedresults Chopslivestreamintobatchesofx milliseconds Sparktreatseachbatchofdataas RDDs ProcessedresultsoftheRDD operaGonsarereturnedinbatches
  5. 5. DStream, not just RDD * Datastax cassandra connector Transformations map(), atMap() lter() count() reparGGon() union() reduce() countByValue() reduceByKey() join() cogroup() transform() updateStateByKey() Output Operations print() foreachRDD() saveAsObjectToFiles() saveAsTextFiles() saveAsHadoopFiles() *saveToCassandra() Window Operations window() countByWindow() reduceByWindow() reduceByKeyAndWindow() countByValueAndWindow()
  6. 6. Example 1 - DStream to RDD val tweets = ssc.twitterStream(, ) Twi8erStreamingAPI ! ! tweetsDStream batch@t batch@t+1 batch@t+3batch@t+2 storedinmemoryasanRDD (immutable,distributed)
  7. 7. Example 1 - DStream to RDD relation val tweets = ssc.twitterStream(, )! val hashTags = tweets.flatMap(status => getTags(status)) tweetsDStream batch@t batch@t+1 batch@t+3batch@t+2 hashTagsDStream [#hobbitch,#bilboleggins,] atMap atMap atMap atMap newRDDsfor eachbatch newDStream
  8. 8. Example 1 - DStream to RDD val tweets = ssc.twitterStream(, )! val hashTags = tweets.flatMap(status => getTags(status))! hashTags.saveToCassandra(keyspace, tableName) tweetsDStream hashTagsDStream [#hobbitch,#bilboleggins,] atMap atMap atMap atMap everybatch savedto Cassandra save save save save
  9. 9. Example 2 - DStream to RDD relation val tweets = ssc.twitterStream(, )! val hashTags = tweets.flatMap(status => getTags(status))! val tagCounts = hashTags.countByValue() tweetsDStream hashTags atMap atMap atMap atMap map map map map reduceByKey reduceByKey reduceByKey reduceByKey hashTags [(#hobbitch,10),(#bilboleggins,34),]
  10. 10. Example 3 - Count the hash tags over last 10 minutes val tweets = ssc.twitterStream(, )! val hashTags = tweets.flatMap(status => getTags(status))! val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue() Slidingwindow operaGon Windowlength Slidinginterval
  11. 11. Example 3 - Count the hash tags over last 10 minutes val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue() t-1 t t+1 t+2 t+3 slidingwindow hashTags hashTags Countoverall datainwindow
  12. 12. Example 4 - Count hash tags over last 10 minutes smartly val tagCounts = hashTags.countByValueAndWindow(Minutes(10), Seconds(1)) t-1 t t+1 t+2 t+3 slidingwindow hashTags hashTags Addcountofnew batchinwindow +- Reducecountof batchoutofwindow generalizaGonofsmartwindowreduceexists: reduceByKeyAndWindow(reduce,inverseReduce,window,interval)
  13. 13. Architecture Receiversdividesdataintominibatches Sizeofbatchescanbedenedinmilliseconds(bestpracGce isgreaterthan500milliseconds) SparkStreaming Receivers Spark Engine Batchesof inputRDDs Batchesof outputRDDs Inputstreams
  14. 14. Fault-tolerance RDDsarenotgeneratedfrom fault-tolerancesource Replicatedataamongworker nodes (defaultreplicaGonfactorof2) Instate-fulljobscheckpoints shouldbeused JournalingsuchasinDBcan beacGvated atMap TweetsRDD hashTagsRDD inputdata replicatedin memory lostparGGons recomputedonother workers
  15. 15. Fault-tolerance Twokindsofdatatorecoverintheeventoffailure: Datareceivedandreplicated- Thisdatasurvivesfailureofasingleworkernode,sinceacopyofit existsononeoftheothernodes. DatareceivedbutbueredforreplicaGon- Asthisisnotreplicated,theonlywaytorecoverthatdataistoget itfromthesourceagain.
  16. 16. Fault-tolerance TworeceiversemanGcs: Reliablereceiver- Acknowledgesonlyagerreceiveddataisreplicated.Iffails, buereddatadoesnotgetacknowledgedtothesource.Ifthe receiverisrestarted,thesourcewillresendthedata,and thereforenodatawillbelostduetothefailure. UnreliableReceiver- Suchreceiverscanlosedatawhentheyfailduetoworkerordriver failures.
  17. 17. Fault-tolerance Deployment Scenario ReceiverFailure Driverfailure withoutwrite aheadlog Buereddatalostwithunreliablereceivers Zerodatalostwithreliablereceiversandles Buereddatalostwithunreliablereceivers Pastdatalostwithallreceivers Zerodatalostwithles withwrite aheadlog Zerodatalostwithreceiversandles Zerodatalostwithreceiversandles
  18. 18. Why Spark streaming? We have Storm
  19. 19. One model to rule them all SamemodelforoineAND onlineprocessing Commoncodebaseforoine ANDonlineprocessing LessbugsduetoduplicaGon Lessbugsofframeworkdierence IncreasedeveloperproducGvity
  20. 20. One stack to rule them all Exploredata interacGvelyusingSpark shelltoidenGfyproblem UsesamecodeinSpark standalonetoidenGfy probleminproducGon environment Usesimilarcodein SparkStreamingto monitorproblemonline $./spark-shell scala>valfile=sc.hadoopFile(smallLogs) ...scala>valfiltered=file.filter(_.contains(ERROR)) ...scala>va objectProcessProductionData{ defmain(args:Array[String]){ valsc=newSparkContext(...) valfile=sc.hadoopFile(productionLogs) valfiltered=file.filter(_.contains(ERROR)) valmapped=filtered.map(...) ... } } objectProcessLiveStream{ defmain(args:Array[String]){ valsc=newStreamingContext(...) valstream=sc.kafkaStream(...) valfiltered=stream.filter(_.contains(ERROR)) valmapped=filtered.map(...) ... } }
  21. 21. Performance HigherthroughputthanStorm SparkStreaming:670krecords/second/node Storm:115krecords/seconds/node Grep Throughputper node(MB/s) 0 17.5 35 52.5 70 Recordsize(bytes) 100 1000 Spark Storm WordCount 0 7.5 15 22.5 30 Recordsize(bytes) 100 1000 Testedwith100EC2instanceswith4coreeach ComparisontakenfromDasThatagataandReynoldXinHadoopsummit2013presentaGon
  22. 22. Community
  23. 23. Community
  24. 24. Community
  25. 25. Monitoring InaddiGonStreamListenerinterfaceprovidesaddiGonalinformaGoninvariouslevels (ApplicaGon,Job,Task,etc.)
  26. 26. Language vs
  27. 27. Utilization Spark1.2introducesdynamicclusterresourceallocaGon Jobscanrequestmoreresourcesandreleaseresource AvailableonlyonYARN
  28. 28. Demo hKps://github.com/NoamShaish/spark-streaming-workshop.git