[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story

71
Joan Viladrosa, Billy Mobile Apache Spark Streaming + Kafka 0.10: An Integration Story #EUstr5

Transcript of [Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story

Page 1: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Joan Viladrosa, Billy Mobile

Apache Spark Streaming + Kafka 0.10: An Integration Story

#EUstr5

Page 2: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

About meJoan Viladrosa Riera

@joanvrjoanviladrosa

[email protected]

2#EUstr5

Degree In Computer Science Advanced Programming Techniques & System Interfaces and Integration

Co-Founder, Educabits Educational Big data solutions using AWS cloud

Big Data Developer, Trovit Hadoop and MapReduce Framework SEM keywords optimization

Big Data Architect & Tech Lead BillyMobile Full architecture with Hadoop: Kafka, Storm, Hive, HBase, Spark, Druid, …

Page 3: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Apache Kafka

#EUstr5

Page 4: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

What is Apache Kafka?

- Publish - Subscribe Message System

4#EUstr5

Page 5: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

What is Apache Kafka?

- Publish - Subscribe Message System

- Fast- Scalable- Durable- Fault-tolerant

What makes it great?

5#EUstr5

Page 6: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

What is Apache Kafka?As a central point

Producer Producer Producer Producer

Kafka

Consumer Consumer Consumer Consumer

6#EUstr5

Page 7: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

What is Apache Kafka?A lot of different connectors

Apache Storm

Apache Spark My Java App Logger

Kafka

Apache Storm

Apache Spark My Java App Monitoring

Tool

7#EUstr5

Page 8: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Kafka Terminology

Topic: A feed of messages

Producer: Processes that publish messages to a topic

Consumer: Processes that subscribe to topics and process the feed of published messages

Broker: Each server of a kafka cluster that holds, receives and sends the actual data

8#EUstr5

Page 9: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Kafka Topic Partitions

0 1 2 3 4 5 6Partition 0

Partition 1

Partition 2

0 1 2 3 4 5 6

0 1 2 3 4 5 6

7 8 9

7 8

Topic:

Old New

writes

9#EUstr5

Page 10: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Kafka Topic Partitions

0 1 2 3 4 5 6Partition 0 7 8 9

Old New

10

11

12

13

14

15

Producer

writes

Consumer A(offset=6)

Consumer B(offset=12)

reads reads

10#EUstr5

Page 11: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Kafka Topic Partitions

0 1 2 3 4 5 6P0

P1

P2

0 1 2 3 4 5 6

0 1 2 3 4 5 6

7 8 9

7 8

0 1 2 3 4 5 6P3

P4

P5

0 1 2 3 4 5 6

0 1 2 3 4 5 6

7 8 9

7 8

0 1 2 3 4 5 6P6

P7

P8

0 1 2 3 4 5 6

0 1 2 3 4 5 6

7 8 9

7 8

Broker 1 Broker 2 Broker 3

Consumers & Producers

11#EUstr5

Page 12: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Kafka Topic Partitions

0 1 2 3 4 5 6P0

P1

P2

0 1 2 3 4 5 6

0 1 2 3 4 5 6

7 8 9

7 8

0 1 2 3 4 5 6P3

P4

P5

0 1 2 3 4 5 6

0 1 2 3 4 5 6

7 8 9

7 8

0 1 2 3 4 5 6P6

P7

P8

0 1 2 3 4 5 6

0 1 2 3 4 5 6

7 8 9

7 8

Broker 1 Broker 2 Broker 3

Consumers & Producers

More Storage

MoreParallelism

12#EUstr5

Page 13: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Kafka Semantics

In short: consumer delivery semantics are up to you, not Kafka

- Kafka doesn’t store the state of the consumers*

- It just sends you what you ask for (topic, partition, offset, length)

- You have to take care of your state

13#EUstr5

Page 14: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Apache Kafka Timeline

may-2016nov-2015nov-2013nov-2012

New Producer

New Consumer

SecurityKafka Streams

Apache Incubator Project

0.7 0.8 0.9 0.10

14#EUstr5

Page 15: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Apache Spark Streaming

#EUstr5

Page 16: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

- Process streams of data- Micro-batching approach

What is Apache Spark Streaming?

16#EUstr5

Page 17: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

- Process streams of data- Micro-batching approach

- Same API as Spark- Same integrations as Spark- Same guarantees &

semantics as SparkWhat makes it great?

What is Apache Spark Streaming?

17#EUstr5

Page 18: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

What is Apache Spark Streaming?Relying on the same Spark Engine: “same syntax” as batch jobs

https://spark.apache.org/docs/latest/streaming-programming-guide.html 18

Page 19: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

How does it work?- Discretized Streams

https://spark.apache.org/docs/latest/streaming-programming-guide.html 19

Page 20: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

How does it work?- Discretized Streams

https://spark.apache.org/docs/latest/streaming-programming-guide.html 20

Page 21: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

How does it work?

21https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-execution-model.html

Page 22: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

How does it work?

22https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-execution-model.html

Page 23: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Spark Streaming Semantics

As in Spark:- Not guarantee exactly-once

semantics for output actions- Any side-effecting output

operations may be repeated- Because of node failure, process

failure, etc.

So, be careful when outputting to external sources

Side effects

23#EUstr5

Page 24: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Spark Streaming Kafka Integration

#EUstr5

Page 25: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Spark Streaming Kafka Integration Timelinedec-2016jul-2016jan-2016sep-2015jun-2015mar-2015dec-2014sep-2014

Fault TolerantWAL+ Python API

Direct Streams+Python API

Improved Streaming UI

Metadata in UI (offsets)+ Graduated Direct

Receivers Native Kafka 0.10(experimental)

1.1 1.2 1.3 1.4 1.5 1.6 2.0 2.1

25#EUstr5

Page 26: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Kafka Receiver (≤ Spark 1.1)

Executor

Driver

Launch jobs on data Continuously receive

data using High Level API

Update offsets in ZooKeeper

Receiver

26#EUstr5

Page 27: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Kafka Receiver with WAL (Spark 1.2)

HDFS

Executor

Driver

Launch jobs on data Continuously receive

data using High Level API

Update offsets in ZooKeeper

WAL

Receiver

27#EUstr5

Page 28: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Application Driver

Executor

Spark Context

Jobs

Computation checkpointed

Receiver

Input stream

Block metadata

Block metadatawritten to log Block data

written both memory + log

Streaming Context

Kafka Receiver with WAL (Spark 1.2)

28#EUstr5

Page 29: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Kafka Receiver with WAL (Spark 1.2)

Restarted Driver Restarted Executor

Restarted Spark

Context

Relaunch Jobs

Restart computation from info in checkpoints Restarted

Receiver

Resend unacked data

Recover Block metadatafrom log

Recover Block data from log

Restarted Streaming

Context

29#EUstr5

Page 30: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Kafka Receiver with WAL (Spark 1.2)

HDFS

Executor

Driver

Launch jobs on data Continuously receive

data using High Level API

Update offsets in ZooKeeper

WAL

Receiver

30#EUstr5

Page 31: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Direct Kafka Integration w/o Receivers or WALs(Spark 1.3)

Executor

Driver

31#EUstr5

Page 32: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Direct Kafka Integration w/o Receivers or WALs(Spark 1.3)

Executor

Driver 1. Query latest offsets and decide offset ranges

for batch

32#EUstr5

Page 33: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Direct Kafka Integration w/o Receivers or WALs(Spark 1.3)

Executor

1. Query latest offsets and decide offset ranges

for batch2. Launch jobs

using offset ranges

Driver

topic1, p1, (2000, 2100)

topic1, p2, (2010, 2110)

topic1, p3, (2002, 2102)

33#EUstr5

Page 34: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Direct Kafka Integration w/o Receivers or WALs(Spark 1.3)

Executor

1. Query latest offsets and decide offset ranges

for batch2. Launch jobs

using offset ranges

Driver

topic1, p1, (2000, 2100)

topic1, p2, (2010, 2110)

topic1, p3, (2002, 2102)

3. Reads data using offset ranges in jobs

using Simple API

34#EUstr5

Page 35: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Direct Kafka Integration w/o Receivers or WALs(Spark 1.3)

Executor

Driver

2. Launch jobs using offset

ranges3. Reads data using offset ranges in jobs

using Simple API

1. Query latest offsets and decide offset ranges

for batchtopic1, p1, (2000, 2100)

topic1, p2, (2010, 2110)

topic1, p3, (2002, 2102)

35#EUstr5

Page 36: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Direct Kafka Integration w/o Receivers or WALs(Spark 1.3)

Executor

Driver

2. Launch jobs using offset

ranges3. Reads data using offset ranges in jobs

using Simple API

1. Query latest offsets and decide offset ranges

for batchtopic1, p1, (2000, 2100)

topic1, p2, (2010, 2110)

topic1, p3, (2002, 2102)

36#EUstr5

Page 37: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Direct Kafka Integration w/o Receivers or WALs(Spark 1.3)

Executor

Driver

2. Launch jobs using offset

ranges3. Reads data using offset ranges in jobs

using Simple API

1. Query latest offsets and decide offset ranges

for batch

37#EUstr5

Page 38: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Direct Kafka API benefits

- No WALs or Receivers- Allows end-to-end

exactly-once semantics pipelines ** updates to downstream systems should be

idempotent or transactional

- More fault-tolerant- More efficient- Easier to use.

38#EUstr5

Page 39: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Spark Streaming UI improvements (Spark 1.4) 39

Page 40: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Kafka Metadata (offsets) in UI (Spark 1.5) 40

Page 41: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

What about Spark 2.0+ and new Kafka Integration?

This is why we are here, right?

41#EUstr5

Page 42: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Spark 2.0+ new Kafka Integrationspark-streaming-kafka-0-8 spark-streaming-kafka-0-10

Broker Version 0.8.2.1 or higher 0.10.0 or higher

Api Stability Stable Experimental

Language Support Scala, Java, Python Scala, Java

Receiver DStream Yes No

Direct DStream Yes Yes

SSL / TLS Support No Yes

Offset Commit Api No Yes

Dynamic Topic Subscription No Yes

42#EUstr5

Page 43: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

What’s really New with this New Kafka Integration?

- New Consumer API * Instead of Simple API

- Location Strategies- Consumer Strategies- SSL / TLS

- No Python API :(

43#EUstr5

Page 44: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Location Strategies- New consumer API will pre-fetch messages into buffers- So, keep cached consumers into executors- It’s better to schedule partitions on the host with

appropriate consumers

44#EUstr5

Page 45: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Location Strategies

- PreferConsistentDistribute partitions evenly across available executors

- PreferBrokersIf your executors are on the same hosts as your Kafka brokers

- PreferFixed Specify an explicit mapping of partitions to hosts

45#EUstr5

Page 46: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Consumer Strategies- New consumer API has a number of different

ways to specify topics, some of which require considerable post-object-instantiation setup.

- ConsumerStrategies provides an abstraction that allows Spark to obtain properly configured consumers even after restart from checkpoint.

46#EUstr5

Page 47: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Consumer Strategies- Subscribe subscribe to a fixed collection of topics- SubscribePattern use a regex to specify topics of

interest- Assign specify a fixed collection of partitions

● Overloaded constructors to specify the starting offset for a particular partition.

● ConsumerStrategy is a public class that you can extend.

47#EUstr5

Page 48: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

SSL/TTL encryption- New consumer API supports SSL- Only applies to communication between Spark

and Kafka brokers- Still responsible for separately securing Spark

inter-node communication

48#EUstr5

Page 49: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

How to use New Kafka Integration on Spark 2.0+Scala Example Code

Basic usage

val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "broker01:9092,broker02:9092", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" -> "stream_group_id", "auto.offset.reset" -> "latest", "enable.auto.commit" -> (false: java.lang.Boolean))

val topics = Array("topicA", "topicB")

val stream = KafkaUtils.createDirectStream[String, String]( streamingContext, PreferConsistent, Subscribe[String, String](topics, kafkaParams))

stream.map(record => (record.key, record.value))

49#EUstr5

Page 50: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

How to use New Kafka Integration on Spark 2.0+Java Example Code

Getting metadata

stream.foreachRDD { rdd =>

val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]

.offsetRanges

rdd.foreachPartition { iter =>

val osr: OffsetRange = offsetRanges(

TaskContext.get.partitionId)

// get any needed data from the offset range

val topic = osr.topic

val kafkaPartitionId = osr.partition

val begin = osr.fromOffset

val end = osr.untilOffset

}

}

50#EUstr5

Page 51: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

RDDTopic

Kafka or Spark RDD Partitions?Kafka Spark

51

1

2

3

4

1

2

3

4

Page 52: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

RDDTopic

Kafka or Spark RDD Partitions?Kafka Spark

52

1

2

3

4

1

2

3

4

Page 53: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

How to use New Kafka Integration on Spark 2.0+Java Example Code

Getting metadata

stream.foreachRDD { rdd =>

val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]

.offsetRanges

rdd.foreachPartition { iter =>

val osr: OffsetRange = offsetRanges(

TaskContext.get.partitionId)

// get any needed data from the offset range

val topic = osr.topic

val kafkaPartitionId = osr.partition

val begin = osr.fromOffset

val end = osr.untilOffset

}

}

53#EUstr5

Page 54: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

How to use New Kafka Integration on Spark 2.0+Java Example Code

Store offsets in Kafka itself: Commit API

stream.foreachRDD { rdd =>

val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]

.offsetRanges

// DO YOUR STUFF with DATA

stream.asInstanceOf[CanCommitOffsets]

.commitAsync(offsetRanges)

}

}

54#EUstr5

Page 55: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Kafka + Spark Semantics

- At most once- At least once- Exactly once

55#EUstr5

Page 56: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Kafka + Spark Semantics

- We don’t want duplicates

- Not worth the hassle of ensuring that messages don’t get lost

- Example: Sending statistics over UDP

1. Set spark.task.maxFailures to 1

2. Make sure spark.speculation is false (the default)

3. Set Kafka param auto.offset.reset to “largest”

4. Set Kafka param enable.auto.commit to true

At most once

56#EUstr5

Page 57: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Kafka + Spark Semantics

- This will mean you lose messages on restart

- At least they shouldn’t get replayed.

- Test this carefully if it’s actually important to you that a message never gets repeated, because it’s not a common use case.

At most once

57#EUstr5

Page 58: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Kafka + Spark Semantics

- We don’t want to loose any record

- We don’t care about duplicates

- Example: Sending internal alerts on relative rare occurrences on the stream

1. Set spark.task.maxFailures > 1000

2. Set Kafka param auto.offset.reset to “smallest”

3. Set Kafka param enable.auto.commit to false

At least once

58#EUstr5

Page 59: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Kafka + Spark Semantics

- Don’t be silly! Do NOT replay your whole log on every restart…

- Manually commit the offsets when you are 100% sure records are processed

- If this is “too hard” you’d better have a relative short retention log

- Or be REALLY ok with duplicates. For example, you are outputting to an external system that handles duplicates for you (HBase)

At least once

59#EUstr5

Page 60: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Kafka + Spark Semantics

- We don’t want to loose any record

- We don’t want duplicates either

- Example: Storing stream in data warehouse

1. We need some kind of idempotent writes, or whole-or-nothing writes (transactions)

2. Only store offsets EXACTLY after writing data

3. Same parameters as at least once

Exactly once

60#EUstr5

Page 61: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Kafka + Spark Semantics

- Probably the hardest to achieve right

- Still some small chance of failure if your app fails just between writing data and committing offsets… (but REALLY small)Exactly once

61#EUstr5

Page 62: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Apache Kafka Apacke Spark

at Billy Mobile

62

15B records monthly

35TB weekly retention log

6K events/second

x4 growth/year

Page 63: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Our use cases

- Input events from Kafka- Enrich events with some

external data sources- Finally store it to Hive

We do NOT want duplicatesWe do NOT want to lose events

ETL to Data Warehouse

63

Page 64: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Our use cases

- Hive is not transactional- Neither idempotent writes- Writing files to HDFS is “atomic”

(whole or nothing)

- A relation 1:1 from each partition-batch to file in HDFS

- Store to ZK the current state of the batch

- Store to ZK offsets of last finished batch

ETL to Data Warehouse

64

Page 65: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Our use cases

- Input events from Kafka- Periodically load

batch-computed model- Detect when an offer stops

converting (or too much)

- We do not care about losing some events (on restart)

- We always need to process the “real-time” stream

Anomalies detector

65

Page 66: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Our use cases

- It’s useless to detect anomalies on a lagged stream!

- Actually it could be very bad

- Always restart stream on latest offsets

- Restart with “fresh” state

Anomalies detector

66

Page 67: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Our use cases

- Input events from Kafka- Almost no processing- Store it to HBase

- (has idempotent writes)

- We do not care about duplicates- We can NOT lose a single event

Store to Entity Cache

67

Page 68: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Our use cases

- Since HBase has idempotent writes, we can write events multiple times without hassle

- But, we do NOT start with earliest offsets…

- That would be 7 days of redundant writes…!!!

- We store offsets of last finished batch- But obviously we might re-write some

events on restart or failure

Store to Entity Cache

68

Page 69: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Lessons Learned

- Do NOT use checkpointing- Not recoverable across code upgrades

- Do your own checkpointing

- Track offsets yourself- In general, more reliable:

HDFS, ZK, RMDBS...

- Memory usually is an issue- You don’t want to waste it

- Adjust batchDuration

- Adjust maxRatePerPartition

69

Page 70: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Further Improvements

- Dynamic Allocationspark.dynamicAllocation.enabled vsspark.streaming.dynamicAllocation.enabledhttps://issues.apache.org/jira/browse/SPARK-12133But no reference in docs...

- Graceful shutdown

- Structured Streaming

70

Page 71: [Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story

Thank you very much!Questions?

@joanvrjoanviladrosa

[email protected]