Spark streaming

28
Noam Shaish Spark Streaming Scale Fault tolerance High throughput

Transcript of Spark streaming

Page 1: Spark streaming

Noam Shaish

Spark Streaming Scale  Fault  tolerance  High  throughput

Page 2: Spark streaming

Agenda

❖ Overview  

❖ Architecture  

❖ Fault-­‐tolerance  

❖ Why  Spark  streaming?  We  have  Storm  

❖ Demo

Page 3: Spark streaming

Overview❖ Spark  Streaming  is  an  extension  of  core  Spark  API.  It  enables  scalable,  

high-­‐throughput,  fault-­‐tolerant  stream  processing  of  live  data  streams.  

❖ ConnecGons  for  most  of  common  data  sources  such  as  KaIa,  Flume,  TwiKer,  ZeroMQ,  Kinesis,  TCP,  etc.  

❖ Spark  streaming  differ  from  most  online  processing  soluGon  by  espousing  mini  batch  approach,  instead  of  data  stream.  

❖ Based  on  DiscreGzed  Stream  paper    ❖ Discretized Streams:A Fault-Tolerant Model for Scalable Stream Processing

Matei Zaharia,Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion Stoica Berkeley EECS (2012-12-14)www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf

Page 4: Spark streaming

OverviewSpark  streaming  runs  streaming  computaGon  as  a  series  of  very  small,  determinis1c  batch  jobs  

Spark  streaming

Spark

Live  data  stream

Batches  of  X  milliseconds

Processed  results

❖ Chops  live  stream  into  batches  of  x  milliseconds  

❖ Spark  treats  each  batch  of  data  as  RDDs  

❖ Processed  results  of  the  RDD  operaGons  are  returned  in  batches

Page 5: Spark streaming

DStream, not just RDD

* Datastax cassandra connector

Transformations• map(),    • flatMap()    • filter()    • count()  • reparGGon()  • union()  • reduce()    • countByValue()  • reduceByKey()  • join()    • cogroup()  • transform()  • updateStateByKey()

Output Operations• print()  • foreachRDD()  • saveAsObjectToFiles()  • saveAsTextFiles()  • saveAsHadoopFiles()  • *saveToCassandra()

Window Operations• window()  • countByWindow()  • reduceByWindow()  • reduceByKeyAndWindow()  • countByValueAndWindow()

Page 6: Spark streaming

Example 1 - DStream to RDD val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)

Twi8er  Streaming  API  !!

tweets  DStream  

batch  @  t batch  @  t  +  1 batch  @  t  +  3batch  @  t  +  2

stored  in  memory  as  an  RDD  (immutable,  distributed)

Page 7: Spark streaming

Example 1 - DStream to RDD relationval tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)!

val hashTags = tweets.flatMap(status => getTags(status))

tweets  DStream  

batch  @  t batch  @  t  +  1 batch  @  t  +  3batch  @  t  +  2

hashTags  DStream  [#hobbitch,    #bilboleggins,  …]

flatMap flatMap flatMap flatMap new  RDDs  for  each  batch

new  DStream

Page 8: Spark streaming

Example 1 - DStream to RDD val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)!

val hashTags = tweets.flatMap(status => getTags(status))!

hashTags.saveToCassandra(“keyspace”, “tableName”)

tweets  DStream  

hashTags  DStream  [#hobbitch,    #bilboleggins,  …]

flatMap flatMap flatMap flatMap

every  batch  saved  to  Cassandra

save save save save

Page 9: Spark streaming

Example 2 - DStream to RDD relationval tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)!

val hashTags = tweets.flatMap(status => getTags(status))!

val tagCounts = hashTags.countByValue()

tweets  DStream  

hashTags  

flatMap flatMap flatMap flatMap

map map map map

reduceByKey reduceByKey reduceByKey reduceByKey

hashTags  [(#hobbitch,  10),    (#bilboleggins,  34),  …]

Page 10: Spark streaming

Example 3 - Count the hash tags over last 10 minutes

val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)!

val hashTags = tweets.flatMap(status => getTags(status))!

val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()

Sliding  window  operaGon Window  length Sliding  interval

Page 11: Spark streaming

Example 3 - Count the hash tags over last 10 minutes

val tagCounts = hashTags.window(Minutes(10), Seconds(1)).countByValue()

t-1 t t+1 t+2 t+3

sliding  window

hashTags  

hashTags  

Count  over  all  data  in  window

Page 12: Spark streaming

Example 4 - Count hash tags over last 10 minutes smartly

val tagCounts = hashTags.countByValueAndWindow(Minutes(10), Seconds(1))

t-1 t t+1 t+2 t+3

sliding  window

hashTags  

hashTags  

Add  count  of  new  batch  in  window

+-

Reduce  count  of  batch  out  of  window

generalizaGon  of  smart  window  reduce  exists:    reduceByKeyAndWindow(reduce,  inverseReduce,  window,    interval)

Page 13: Spark streaming

Architecture

❖ Receivers  divides  data  into  mini  batches  

❖ Size  of  batches  can  be  defined  in  milliseconds  (best  pracGce  is  greater  than  500  milliseconds)

Spark  Streaming

Receivers

Spark  Engine

Batches  of    input  RDDs

Batches  of    output  RDDsIn

put  streams

Page 14: Spark streaming

Fault-tolerance

❖ RDDs  are  not  generated  from  fault-­‐tolerance  source      

❖ Replicate  data  among  worker  nodes  (default  replicaGon  factor  of  2)  

❖ In  state-­‐full  jobs  checkpoints  should  be  used    

❖ Journaling  such  as  in  DB  can  be  acGvated  

flatMap

Tweets  RDD

hashTags  RDD

input  data  replicated  in  memory

lost  parGGons  recomputed  on  other  

workers

Page 15: Spark streaming

Fault-tolerance❖ Two  kinds  of  data  to  recover  in  the  event  of  failure:  

• Data  received  and  replicated  -­‐  This  data  survives  failure  of  a  single  worker  node,  since  a  copy  of  it  exists  on  one  of  the  other  nodes.  

• Data  received  but  buffered  for  replicaGon  -­‐As  this  is  not  replicated,  the  only  way  to  recover  that  data  is  to  get  it  from  the  source  again.

Page 16: Spark streaming

Fault-tolerance❖ Two  receiver  semanGcs:  

• Reliable  receiver  -­‐  Acknowledges  only  ager  received  data  is  replicated.  If  fails,  buffered  data  does  not  get  acknowledged  to  the  source.  If  the  receiver  is  restarted,  the  source  will  resend  the  data,  and  therefore  no  data  will  be  lost  due  to  the  failure.    

• Unreliable  Receiver  -­‐  Such  receivers  can  lose  data  when  they  fail  due  to  worker  or  driver  failures.

Page 17: Spark streaming

Fault-tolerance

Deployment  Scenario Receiver  Failure Driver  failure

without  write  ahead  log

Buffered  data  lost  with  unreliable  receivers  Zero  data  lost  with  reliable  receivers  and  files

Buffered  data  lost  with  unreliable  receivers  Past  data  lost  with  all  receivers  

Zero  data  lost  with  files

with  write  ahead  log

Zero  data  lost  with  receivers  and  files Zero  data  lost  with  receivers  and  files

Page 18: Spark streaming

Why Spark streaming? We have Storm

Page 19: Spark streaming

One model to rule them all

❖ Same  model  for  offline  AND  online  processing  

❖ Common  code  base  for  offline  AND  online  processing  

❖ Less  bugs  due  to  duplicaGon  

❖ Less  bugs  of  framework  difference  

❖ Increase  developer  producGvity

Page 20: Spark streaming

One stack to rule them all

❖ Explore  data  interacGvely  using  Spark  shell  to  idenGfy  problem  

❖ Use  same  code  in  Spark  standalone  to  idenGfy  problem  in  producGon  environment  

❖ Use  similar  code  in  Spark  Streaming  to  monitor  problem  online

$  ./spark-­‐shell  scala>  val  file  =  sc.hadoopFile(“smallLogs”)  ...

scala>  val  filtered  =  file.filter(_.contains(“ERROR”))  ...

scala>  vaobject  ProcessProductionData  {     def  main(args:  Array[String])  {       val  sc  =  new  SparkContext(...)       val  file  =  sc.hadoopFile(“productionLogs”)       val  filtered  =  file.filter(_.contains(“ERROR”))       val  mapped  =  filtered.map(...)       ...     }  } object  ProcessLiveStream  {  

  def  main(args:  Array[String])  {       val  sc  =  new  StreamingContext(...)       val  stream  =  sc.kafkaStream(...)       val  filtered  =  stream.filter(_.contains(“ERROR”))       val  mapped  =  filtered.map(...)       ...     }  }

Page 21: Spark streaming

Performance❖ Higher  throughput  than  Storm  

• Spark  Streaming:  670k  records/second/node  

• Storm:  115k  records/seconds/node

Grep

Throughp

ut  per  

node

 (MB/s)

0

17.5

35

52.5

70

Record  size  (bytes)

100 1000

SparkStorm

WordCount

0

7.5

15

22.5

30

Record  size  (bytes)

100 1000

Tested  with  100  EC2  instances  with  4  core  each  Comparison  taken  from  Das  Thatagata  and  Reynold  Xin  Hadoop  summit  2013  presentaGon

Page 22: Spark streaming

Community

Page 23: Spark streaming

Community

Page 24: Spark streaming

Community

Page 25: Spark streaming

Monitoring

In  addiGon  StreamListener  interface  provides  addiGonal  informaGon  in  various  levels    (ApplicaGon,  Job,  Task,  etc.)    

Page 26: Spark streaming

Language

vs

Page 27: Spark streaming

Utilization

❖ Spark  1.2  introduces  dynamic  cluster  resource  allocaGon  

❖ Jobs  can  request  more  resources  and  release  resource  

❖ Available  only  on  YARN

Page 28: Spark streaming

DemohKps://github.com/NoamShaish/spark-­‐streaming-­‐workshop.git