[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story

download [Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story

of 74

  • date post

    21-Jan-2018
  • Category

    Technology

  • view

    116
  • download

    4

Embed Size (px)

Transcript of [Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story

  1. 1. Apache Spark Streaming + Kafka 0.10: An Integration Story Joan Viladrosa, Billy Mobile
  2. 2. About me Degree In Computer Science Advanced Programming Techniques & System Interfaces and Integration Co-Founder, Educabits Educational Big data solutions using AWS cloud Big Data Developer, Trovit Hadoop and MapReduce Framework SEM keywords optimization Big Data Architect & Tech Lead BillyMobile Full architecture with Hadoop: Kafka, Storm, Hive, HBase, Spark, Druid, Joan Viladrosa Riera @joanvr joanviladrosa joan.viladrosa@billymob.com
  3. 3. Apache Kafka
  4. 4. What is Apache Kafka? - Publish - Subscribe Message System
  5. 5. What is Apache Kafka? What makes it great? - Publish - Subscribe Message System - Fast - Scalable - Durable - Fault-tolerant
  6. 6. What is Apache Kafka Producer Producer Producer Producer Kafka Consumer Consumer Consumer Consumer As a central point
  7. 7. What is Apache Kafka A lot of different connectors Apache Storm Apache Spark My Java App Logger Kafka Apache Storm Apache Spark My Java App Monitoring Tool
  8. 8. Kafka Terminology Topic: A feed of messages Producer: Processes that publish messages to a topic Consumer: Processes that subscribe to topics and process the feed of published messages Broker: Each server of a kafka cluster that holds, receives and sends the actual data
  9. 9. Kafka Topic Partitions 0 1 2 3 4 5 6Partition 0 Partition 1 Partition 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 Topic: Old New writes
  10. 10. Kafka Topic Partitions 0 1 2 3 4 5 6Partition 0 7 8 9 Old New 1 0 1 1 1 2 1 3 1 4 1 5 Producer writes Consumer A (offset=6) Consumer B (offset=12) reads reads
  11. 11. Kafka Topic Partitions 0 1 2 3 4 5 6P0 P1 P2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 0 1 2 3 4 5 6P3 P4 P5 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 0 1 2 3 4 5 6P6 P7 P8 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 Broker 1 Broker 2 Broker 3 Consumers & Producers
  12. 12. Kafka Topic Partitions 0 1 2 3 4 5 6P0 P1 P2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 0 1 2 3 4 5 6P3 P4 P5 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 0 1 2 3 4 5 6P6 P7 P8 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 Broker 1 Broker 2 Broker 3 Consumers & Producers More Storage More Parallelism
  13. 13. Kafka Semantics In short: consumer delivery semantics are up to you, not Kafka - Kafka doesnt store the state of the consumers* - It just sends you what you ask for (topic, partition, offset, length) - You have to take care of your state
  14. 14. Apache Kafka Timeline may-2016nov-2015nov-2013nov-2012 New Producer New Consumer Security Kafka Streams Apache Incubator Project 0.7 0.8 0.9 0.10
  15. 15. Apache Spark Streaming
  16. 16. What is Apache Spark Streaming? - Process streams of data - Micro-batching approach
  17. 17. What is Apache Spark Streaming? What makes it great? - Process streams of data - Micro-batching approach - Same API as Spark - Same integrations as Spark - Same guarantees & semantics as Spark
  18. 18. What is Apache Spark Streaming Relying on the same Spark Engine: same syntax as batch jobs https://spark.apache.org/docs/latest/streaming-programming-guide.html
  19. 19. How does it work? - Discretized Streams https://spark.apache.org/docs/latest/streaming-programming-guide.html
  20. 20. How does it work? - Discretized Streams https://spark.apache.org/docs/latest/streaming-programming-guide.html
  21. 21. How does it work? https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-execution-model.html
  22. 22. How does it work? https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-execution-model.html
  23. 23. Spark Streaming Semantics Side effects As in Spark: - Not guarantee exactly-once semantics for output actions - Any side-effecting output operations may be repeated - Because of node failure, process failure, etc. So, be careful when outputting to external sources
  24. 24. Spark Streaming Kafka Integration
  25. 25. Spark Streaming Kafka Integration Timeline dec-2016jul-2016jan-2016sep-2015jun-2015mar-2015dec-2014sep-2014 Fault Tolerant WAL + Python API Direct Streams + Python API Improved Streaming UI Metadata in UI (offsets) + Graduated Direct Receivers Native Kafka 0.10 (experimental) 1.1 1.2 1.3 1.4 1.5 1.6 2.0 2.1
  26. 26. Kafka Receiver ( Spark 1.1) Executor Driver Launch jobs on data Continuously receive data using High Level API Update offsets in ZooKeeper Receiver
  27. 27. Executor HDFS WAL Kafka Receiver with WAL (Spark 1.2) Driver Launch jobs on data Continuously receive data using High Level API Update offsets in ZooKeeper Receiver
  28. 28. Kafka Receiver with WAL (Spark 1.2) Application Driver Executor Spark Context Jobs Computation checkpointed Receiver Input stream Block metadata Block metadata written to log Block data written both memory + log Streaming Context
  29. 29. Kafka Receiver with WAL (Spark 1.2) Restarted Driver Restarted Executor Restarted Spark Context Relaunch Jobs Restart computation from info in checkpoints Restarted Receiver Resend unacked data Recover Block metadata from log Recover Block data from log Restarted Streaming Context
  30. 30. Executor HDFS WAL Kafka Receiver with WAL (Spark 1.2) Driver Launch jobs on data Continuously receive data using High Level API Update offsets in ZooKeeper Receiver
  31. 31. Direct Kafka Integration w/o Receiver or WAL (Spark 1.3) Executor Driver
  32. 32. Direct Kafka Integration w/o Receiver or WAL (Spark 1.3) Executor Driver 1. Query latest offsets and decide offset ranges for batch
  33. 33. Direct Kafka Integration w/o Receiver or WAL (Spark 1.3) Executor 1. Query latest offsets and decide offset ranges for batch 2. Launch jobs using offset ranges Driver topic1, p1, (2000, 2100) topic1, p2, (2010, 2110) topic1, p3, (2002, 2102)
  34. 34. Direct Kafka Integration w/o Receiver or WAL (Spark 1.3) Executor 1. Query latest offsets and decide offset ranges for batch 2. Launch jobs using offset ranges Driver topic1, p1, (2000, 2100) topic1, p2, (2010, 2110) topic1, p3, (2002, 2102) 3. Reads data using offset ranges in jobs using Simple API
  35. 35. Direct Kafka Integration w/o Receiver or WAL (Spark 1.3) Executor 1. Query latest offsets and decide offset ranges for batch 2. Launch jobs using offset ranges Driver topic1, p1, (2000, 2100) topic1, p3, (2002, 2102) 3. Reads data using offset ranges in jobs using Simple API topic1, p2, (2010, 2110)
  36. 36. Direct Kafka Integration w/o Receiver or WAL (Spark 1.3) Executor 1. Query latest offsets and decide offset ranges for batch 2. Launch jobs using offset ranges Driver topic1, p1, (2000, 2100) topic1, p3, (2002, 2102) 3. Reads data using offset ranges in jobs using Simple API topic1, p2, (2010, 2110)
  37. 37. Direct Kafka Integration w/o Receiver or WAL (Spark 1.3) Executor Driver 2. Launch jobs using offset ranges 3. Reads data using offset ranges in jobs using Simple API 1. Query latest offsets and decide offset ranges for batch
  38. 38. Direct Kafka API benefits - No WALs or Receivers - Allows end-to-end exactly-once semantics pipelines * * updates to downstream systems should be idempotent or transactional - More fault-tolerant - More efficient - Easier to use.
  39. 39. Spark Streaming UI improvements (Spark 1.4)
  40. 40. Kafka Metadata (offsets) in UI (Spark 1.5)
  41. 41. What about Spark 2.0+ and new Kafka Integration? This is why we are here, right?
  42. 42. Spark 2.0+ new Kafka Integration spark-streaming-kafka-0-8 spark-streaming-kafka-0-10 Broker Version 0.8.2.1 or higher 0.10.0 or higher Api Stability Stable Experimental Language Support Scala, Java, Python Scala, Java Receiver DStream Yes No Direct DStream Yes Yes SSL / TLS Support No Yes Offset Commit Api No Yes Dynamic Topic Subscription No Yes
  43. 43. Whats really New with this New Kafka Integration? - New Consumer API * Instead of Simple API - Location Strategies - Consumer Strategies - SSL / TLS - No Python API :(
  44. 44. Location Strategies - New consumer API will pre-fetch messages into buffers - So, keep cached consumers into executors - Its better to schedule partitions on the host with already appropriate consumers
  45. 45. Location Strategies - PreferConsistent Distribute partitions evenly across available executors - PreferBrokers If your executors are on the same hosts as your Kafka brokers - PreferFixed Specify an explicit mapping of partitions to hosts
  46. 46. Consumer Strategies - New consumer API has a number of different ways to specify topics, some of which require considerable post-object-instantiation setup. - ConsumerStrategies provides an abstraction that allows Spark to obtain properly configured consumers even after restart from checkpoint.
  47. 47. Consumer Strategies - Subscribe subscribe to a fixed collection of topics - SubscribePattern use a regex to specify topics of interest - Assign specify a fixed collection of partitions Overloaded constructors to specify the starting offset for a particular partition. ConsumerStrategy is a public class that you can extend.
  48. 48. SSL/TTL encryption - New consumer API supports SSL - Only applies to communication between Spark and Kafka brokers - Still responsible for separately securing Spark inter-node communication
  49. 49. How to use New Kafka Integration on Spark 2.0+ Scala Example Code Basic Usage val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "broker01:9092,broker02:9092", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" -> "stream_group_id", "auto.offset.reset" -> "latest", "enable.auto.commit" -> (false: java.lang.Boolean) ) val topics = Ar