Stream data from Apache Kafka for processing with Apache Apex

of 23/23
Low-latency ingestion and analytics with Apache Kafka and Apache Apex Thomas Weise, Architect DataTorrent, PPMC member Apache Apex March 28 th 2016
  • date post

    08-Jan-2017
  • Category

    Technology

  • view

    301
  • download

    1

Embed Size (px)

Transcript of Stream data from Apache Kafka for processing with Apache Apex

PowerPoint Presentation

Low-latency ingestion and analytics with Apache Kafka and Apache ApexThomas Weise, Architect DataTorrent, PPMC member Apache Apex March 28th 2016

Apache Apex FeaturesIn-memory Stream ProcessingScale out, Distributed, Parallel, High ThroughputWindowing (temporal boundary)Reliability, Fault ToleranceOperabilityYARN nativeCompute LocalityDynamic updates

2

Partitioning & Scaling built-inOperators can be dynamically scaledThroughput, latency or any custom logicStreams can be split in flexible waysTuple hashcode, tuple field or custom logicParallel partitioning for parallel pipelinesMxN partitioning for generic pipelinesUnifier concept for merging results from partitionsHelps in handling skew imbalanceAdvanced Windowing supportApplication window configurable per operatorSliding window and tumbling window supportCheckpoint window control for fault recoveryWindowing does not introduce artificial latencyStateful fault tolerance out of the boxOperators recover automatically from a precise point before failureAt least onceAt most onceExactly once at window boundaries

2

Apex Platform Overview3

Apache Apex Malhar Library 4

Apache Kafka

5

A high-throughput distributed messaging system.Fast, Scalable, Durable, Distributed

Kafka is a natural fit to deliver events into Apex for low-latency processing.

Partitioning & Scaling built-inOperators can be dynamically scaledThroughput, latency or any custom logicStreams can be split in flexible waysTuple hashcode, tuple field or custom logicParallel partitioning for parallel pipelinesMxN partitioning for generic pipelinesUnifier concept for merging results from partitionsHelps in handling skew imbalanceAdvanced Windowing supportApplication window configurable per operatorSliding window and tumbling window supportCheckpoint window control for fault recoveryWindowing does not introduce artificial latencyStateful fault tolerance out of the boxOperators recover automatically from a precise point before failureAt least onceAt most onceExactly once at window boundaries

5

Kafka Integration - Consumer

6Low-latency, high throughput ingestScales with Kafka topicsAuto-partitioningFlexible and customizable partition mappingFault-tolerance (in 0.8 based on SimpleConsumer)Metadata monitoring/failover to new brokerOffset checkpointingIdempotencyExternal offset storageSupport for multiple clustersBuilt for better resource utilizationBandwidth controlBytes per second

Partitioning & Scaling built-inOperators can be dynamically scaledThroughput, latency or any custom logicStreams can be split in flexible waysTuple hashcode, tuple field or custom logicParallel partitioning for parallel pipelinesMxN partitioning for generic pipelinesUnifier concept for merging results from partitionsHelps in handling skew imbalanceAdvanced Windowing supportApplication window configurable per operatorSliding window and tumbling window supportCheckpoint window control for fault recoveryWindowing does not introduce artificial latencyStateful fault tolerance out of the boxOperators recover automatically from a precise point before failureAt least onceAt most onceExactly once at window boundaries

6

Kafka Integration - Producer

7Output operator is a Kafka producerExactly once strategyOn failure data already sent to message queue should not be re-sentSends a key along with data that is monotonically increasingOn recovery operator asks the message queue for the last sent messageGets the recovery key from the messageIgnores all replayed data with key that is less than or equal to the recovered keyIf the key is not monotonically increasing then data can be sorted on the key at the end of the window and sent to message queueImplemented in operator AbstractExactlyOnceKafkaOutputOperator in apache/incubator-apex-malhar github repository available here

Partitioning & Scaling built-inOperators can be dynamically scaledThroughput, latency or any custom logicStreams can be split in flexible waysTuple hashcode, tuple field or custom logicParallel partitioning for parallel pipelinesMxN partitioning for generic pipelinesUnifier concept for merging results from partitionsHelps in handling skew imbalanceAdvanced Windowing supportApplication window configurable per operatorSliding window and tumbling window supportCheckpoint window control for fault recoveryWindowing does not introduce artificial latencyStateful fault tolerance out of the boxOperators recover automatically from a precise point before failureAt least onceAt most onceExactly once at window boundaries

7

Apex Application Specification8

Logical and Physical Plan9

Partitioning10NxM PartitionsUnifier0123Logical DAG01211Unifier120Logical DiagramPhysical Diagram with operator 1 with 3 partitions0Unifier1a1b1c2a2bUnifier3Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneckUnifierUnifier01a1b1c2a2bUnifier3Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier

Advanced Partitioning1101a1b234UnifierPhysical DAG043a2a1a1b2b3bUnifierPhysical DAG with Parallel PartitionParallel PartitionContaineruopruopr1uopr2uopr3uopr4uopr1uopr2uopr3uopr4doprdoprdoprunifierunifierunifierunifierContainerContainerNICNICNICNICNICContainerNICLogical PlanExecution Plan, for N = 4; M = 1Execution Plan, for N = 4; M = 1, K = 2 with cascading unifiersCascading Unifiers01234Logical DAG

11

Dynamic Scaling12Partitioning change while application is runningChange number of partitions at runtime based on statsDetermine initial number of partitions dynamicallyKafka operators scale according to number of Kafka partitionsSupports re-distribution of state when number of partitions changeAPI for custom scaling or partitioning2b2c32a2d1b1a1a2a1b2b31a2b1b2c3b2a2d3a

Unifiers not shown

Fault Tolerance13Operator state is checkpointed to persistent storeAutomatically performed by engine, no additional coding neededAsynchronous and distributed In case of failure operators are restarted from checkpoint stateAutomatic detection and recovery of failed containersHeartbeat mechanismYARN process status notificationBuffering to enable replay of data from recovered pointFast, incremental recovery, spike handlingApplication master state checkpointedSnapshot of physical (and logical) planExecution layer change log

Streaming Windows14Application window Sliding window and tumbling windowCheckpoint windowNo artificial latency

Checkpointing Operator State15Save state of operator so that it can be recovered on failurePluggable storage handlerDefault implementationSerialization with KryoAll non-transient fields serializedSerialized state written to HDFSWrites asynchronous, non-blockingPossible to implement custom handlers for alternative approach to extract state or different storage backend (such as IMDG) For operators that rely on previous state for computationOperators can be marked @Stateless to skip checkpointingCheckpoint frequency tunable (by default 30s)Based on streaming windows for consistent state

Processing Guarantees16At-least-onceOn recovery data will be replayed from a previous checkpointNo messages lostDefault, suitable for most applicationsCan be used to ensure data is written once to storeTransactions with meta information, Rewinding output, Feedback from external entity, Idempotent operationsAt-most-onceOn recovery the latest data is made available to operatorUseful in use cases where some data loss is acceptable and latest data is sufficientExactly-onceAt-least-once + idempotency + transactional mechanisms (operator logic) to achieve end-to-end exactly once behavior

Idempotency with Kafka Consumer17

Use Case Ad TechCustomer:Leading digital automation software company for publishersHelps publishers monetize their digital assetsEnables publishers to make smarter inventory decisions and improve revenue

Features: Reporting of critical metrics from auctions and client logsRevenue, impression, and click informationAggregate counters and reporting on top N metricsLow latency querying using pub-sub model

18

Partitioning & Scaling built-inOperators can be dynamically scaledThroughput, latency or any custom logicStreams can be split in flexible waysTuple hashcode, tuple field or custom logicParallel partitioning for parallel pipelinesMxN partitioning for generic pipelinesUnifier concept for merging results from partitionsHelps in handling skew imbalanceAdvanced Windowing supportApplication window configurable per operatorSliding window and tumbling window supportCheckpoint window control for fault recoveryWindowing does not introduce artificial latencyStateful fault tolerance out of the boxOperators recover automatically from a precise point before failureAt least onceAt most onceExactly once at window boundaries

18

Use Case Ad Tech

19User Browser

AdServer

REST proxyREST proxy

Kafka ClusterClient logsKafka Input(Auction logs)Kafka Input(Client logs)CDN(Caching of logs)ETLETLFilterFilterDimensions AggregatorDimensions AggregatorDimensions StoreQueryQuery ResultKafka ClusterAuction LogsClient logsMiddlewareAuction LogsClient logsKafka MessagesKafka MessagesDecompress & FlattenDecompress & FlattenFiltered EventsFiltered EventsAggregatesQuery from MWQueryQuery Results

Kafka Cluster

Partitioning & Scaling built-inOperators can be dynamically scaledThroughput, latency or any custom logicStreams can be split in flexible waysTuple hashcode, tuple field or custom logicParallel partitioning for parallel pipelinesMxN partitioning for generic pipelinesUnifier concept for merging results from partitionsHelps in handling skew imbalanceAdvanced Windowing supportApplication window configurable per operatorSliding window and tumbling window supportCheckpoint window control for fault recoveryWindowing does not introduce artificial latencyStateful fault tolerance out of the boxOperators recover automatically from a precise point before failureAt least onceAt most onceExactly once at window boundaries

19

Use Case Ad Tech

20

Partitioning & Scaling built-inOperators can be dynamically scaledThroughput, latency or any custom logicStreams can be split in flexible waysTuple hashcode, tuple field or custom logicParallel partitioning for parallel pipelinesMxN partitioning for generic pipelinesUnifier concept for merging results from partitionsHelps in handling skew imbalanceAdvanced Windowing supportApplication window configurable per operatorSliding window and tumbling window supportCheckpoint window control for fault recoveryWindowing does not introduce artificial latencyStateful fault tolerance out of the boxOperators recover automatically from a precise point before failureAt least onceAt most onceExactly once at window boundaries

20

Use Case Ad Tech15+ billion impressions per dayAverage data inflow of 200K events/sec64 Kafka Input operators reading from 6 geographically distributed DCs32 instances of in-memory distributed store64 aggregators~150 container processes, 30+ nodes1.2 TB memory footprint @ peak load

21

Partitioning & Scaling built-inOperators can be dynamically scaledThroughput, latency or any custom logicStreams can be split in flexible waysTuple hashcode, tuple field or custom logicParallel partitioning for parallel pipelinesMxN partitioning for generic pipelinesUnifier concept for merging results from partitionsHelps in handling skew imbalanceAdvanced Windowing supportApplication window configurable per operatorSliding window and tumbling window supportCheckpoint window control for fault recoveryWindowing does not introduce artificial latencyStateful fault tolerance out of the boxOperators recover automatically from a precise point before failureAt least onceAt most onceExactly once at window boundaries

21

Resources22Exactly-once processing: https://www.datatorrent.com/blog/end-to-end-exactly-once-with-apache-apex/Examples with Kafka and Files: https://github.com/tweise/apex-samples/tree/master/exactly-onceLearn more: http://apex.incubator.apache.org/docs.html Subscribe - http://apex.incubator.apache.org/community.htmlDownload - http://apex.incubator.apache.org/downloads.htmlApex website - http://apex.incubator.apache.org/Follow @ApacheApex - https://twitter.com/apacheapexMeetups - http://www.meetup.com/topics/apache-apex

Q&A23