How we built an event-time merge of two kafka-streams with spark-streaming

download How we built an event-time merge of two kafka-streams with spark-streaming

of 37

  • date post

    14-Feb-2017
  • Category

    Software

  • view

    229
  • download

    0

Embed Size (px)

Transcript of How we built an event-time merge of two kafka-streams with spark-streaming

How We Built an Event-Time Merge of Two Kafka-Streams with Spark StreamingRalf Sigmund, Sebastian SchrderOtto GmbH & Co. KG

AgendaWho are we?Our SetupThe ProblemOur ApproachRequirementsAdvanced RequirementsLessons learned

Who are we?RalfTechnical Designer Team TrackingCycling, SurfingTwitter: @sistar_hhSebastianDeveloper Team TrackingTaekwondo, CyclingTwitter: @Sebasti0n

Who are we working for?

2nd largest ecommerce company in Germanymore than 1 million visitors every dayour blog: https://dev.otto.deour jobs: www.otto.de/jobs

Our Setupotto.deVarnish Reverse ProxyTracking-ServerVertical 1

Vertical 0Vertical nService 0Service 1Service n

The Problemotto.deVarnish Reverse ProxyVertical 1

Vertical 0Vertical nService 0Service 1Service nHas a limit on the size of messages Needs to send large tracking messages Tracking-Server

Our Approachotto.deVarnish Reverse ProxyTracking-ServerVertical 1

Vertical 0Vertical nS0S1Sn

Spark Service

Requirements

Messages with the same key are mergedt: 1id: 1t: 3id: 2t: 2id: 3t: 0id: 2t: 4id: 0t: 2id: 3t: 1id: 1t: 0id: 2t: 3id: 2t: 2id: 3t: 2id: 3t: 4id: 0

Messages have to be timed outt: 1id: 1t: 3id: 2t: 2id: 3t: 0id: 2t: 4id: 0t: 2id: 3t: 1id: 1t: 0id: 2t: 3id: 2t: 2id: 3t: 2id: 3t: 4id: 0

max(business event timestamp)time-out after 3 sec

messages from different sources will arrive out of sequence.for a given key there will not allways be a message with the same id in the second source. So we cannot wait for the arrival of the complementary messagea business-clock calculated from the event-timestamp allows timeout-logicmetrics are used to detect rate of merged / unmerged messages

Event Streams might get stuckt: 1id: 1t: 3id: 2t: 2id: 3t: 0id: 2t: 4id: 0t: 2id: 3t: 1id: 1t: 0id: 2t: 3id: 2t: 2id: 3t: 2id: 3t: 4id: 0

clock Aclock Bclock: min(clock A, clock B)

There might be no messagest: 1id: 1t: 4id: 2t: 2id: 3t: 5heart beatt: 1id: 1t: 4id: 2t: 2id: 3time-out after 3 sect: 5heart beatclock: min(clock A, clock B) : 4

there might be no messages at all. Still we need to advance the global business-clock in order to timeout messages correctly

Requirements SummaryMessages with the same key are mergedMessages are timed out by the combined event time of both source topics Timed out messages are sorted by event time

combined event time defines our business clock

Solution

Merge messages with the same IdUpdateStateByKey, which is defined on pairwise RDDsIt applies a function to all new messages and existing state for a given keyReturns a StateDStream with the resulting messages

UpdateStateByKeyDStream[InputMessage]DStream[MergedMessage]123123HistoryRDD[MergedMessage]123UpdateFunction: (Seq[InputMessage], Option[MergedMessage]) => Option[MergedMessage]

What you get as a developer is a call to your update function. Sequence of new Messages for a given KeyThe keys Function-Result for the previous Micro-BatchYou do not get access to information for other Keys.

Merge messages with the same Id

Merge messages with the same Id

Flag And Remove Timed Out MsgsDStream[InputMessage]DStream[MergedMessage]1MMTO223HistoryRDD[MergedMessage]MM

MMTOMMTOFilter (timed out)Filter !(timed out)231MMFlag newly timed out1MM3

needs globalBusiness ClockUpdateFunction

recap how UpdateStateByKey really works. The State RDD of the next micro-batch is exactly the same as the current result RDD. In order to remove data from history you have to remove it from the output at the same time. You do this by returning None as result of the update. In this pattern:we first flag a message as timed out during a micro-batchonly messages flagged as timed out are advanced downstreamin the following micro-batch messages that have been timed out before are discarded

Messages are timed out by event timeWe need access to all messages from the current micro batch and the history (state) => Custom StateDStream Compute the maximum event time of both topics and supply it to the updateStateByKey function

The important thing is, that we can only update the state within updateStateByKey and thus need the business clock in there

Messages are timed out by event time

Looks very similar to before but is quite differentAllows us to access all messages from current microbatch and state

Messages are timed out by event time

Messages are timed out by event time

Messages are timed out by event time

Advanced Requirements

Additional to the core business requirements we have to cover several more operational situations (especially due to downtimes and catching up with sources):amounts of messages might differ in magnitudes between sourcesat what position do we start consuming after deployments or system downtimes

topic Btopic AEffect of different request rates in source topics

10K

5K

0

10K

5K

0catching upwith 5K per microbatch

data read too early

Handle different request rates in topicsStop reading from one topic if the event time of other topic is too far behindStore latest event time of each topic and partition on driverUse custom KafkaDStream with clamp method which uses the event time

Handle different request rates in topics

Handle different request rates in topics

Handle different request rates in topics

Handle different request rates in topics

Guarantee at least once semanticso: 1t: ao: 7 t: ao: 2t: ao: 5t: bo: 9t: bt: 6t: bCurrent StateSink Topic

a-offset: 1b-offset: 5

a-offset: 1b-offset: 5

a-offset: 1b-offset: 5On Restart/RecoveryRetrieve last messageTimeoutSpark Merge Service

Attach minimum offset in state from both source topics to timed out messageRead message from destination topic on startup to determine where to read in source topic

Guarantee at least once semantics

Guarantee at least once semantics

Everything put together

Lessons learnedExcellent Performance and ScalabilityExtensible via Custom RDDs and Extension to DirectKafkaNo event time windows in Spark StreamingSee: https://github.com/apache/spark/pull/2633Checkpointing cannot be used when deploying new artifact versionsDriver/executor model is powerful, but also complex

not easy to know beforehand where code is being executed

THANK YOU.Twitter: @sistar_hh@Sebasti0n

Blog: dev.otto.deJobs: otto.de/jobs

What runtime problems did you run into?What other solutions did you consider and what were the reasons against those?How do you handle at least once semantics in downstream service?Are you running it in Production?