Apache Kafka

Apache KafkaA high-throughput distributed messaging system

Johan Lundahl

Agenda• Kafka overview

– Main concepts and comparisons to other messaging systems• Features, strengths and tradeoffs• Message format and broker concepts

– Partitioning, Keyed messages, Replication• Producer / Consumer APIs• Operation considerations• Kafka ecosystemIf time permits:• Kafka as a real-time processing backbone • Brief intro to Storm• Kafka-Storm wordcount demo

What is Apache Kafka?• Distributed, high-throughput, pub-sub messaging system

– Fast, Scalable, Durable

• Main use cases: – log aggregation, real-time processing, monitoring, queueing

• Originally developed by LinkedIn• Implemented in Scala/Java• Top level Apache project since 2012: http://kafka.apache.org/

Comparison to other messaging systems– Traditional: JMS, xxxMQ/AMQP– New gen: Kestrel, Scribe, Flume, Kafka

Message queuesLow throughput, low latency

ActiveMQ

RabbitMQ

Log aggregatorsHigh throughput, high latency

Kestrel

Scribe

Flume Hedwig

Batch jobs

Frontend ServiceFrontend

Monitoring Stream processing

Batch processing

Data warehouse

Producers

Broker

Consumers

Topic1Topic1

Topic2Topic3

Topic1

Topic3Topic2 Topic3

Topic3

Topic2

Topic1

Kafka concepts

Distributed model

Producer Producer Producer

Broker Broker Broker

Topic1 consumer group Topic2 consumer group

Partitioned Data Publication

Ordered subscription

Intra cluster replication

Producer persistence

KAFKA-156

Zookeeper

Performance factors• Broker doesn’t track consumer state• Everything is distributed• Zero-copy (sendfile) reads/writes• Usage of page cache backed by sequential

disk allocation• Like a distributed commit log

• Low overhead protocol• Message batching (Producer & Consumer)• Compression (End to end)• Configurable ack levels

From: http://queue.acm.org/detail.cfm?id=1563874

Kafka features and strengths

• Simple model, focused on high throughput and durability• O(1) time persistence on disk• Horizontally scalable by design (broker and consumers)• Push - pull => consumer burst tolerance• Replay messages• Multiple independent subscribes per topic• Configurable batching, compression, serialization• Online upgrades

Tradeoffs• Not optimized for millisecond latencies• Have not beaten CAP• Simple messaging system, no processing• Zookeeper becomes a bottleneck when using too many topics/partitions

(>>10000)• Not designed for very large payloads (full HD movie etc.)• Helps to know your data in advance

Message/Log Format

LengthVersion

ChecksumPayload

Message

Log based queue (Simplified model)

Message1

Topic2

Message2

Message3

Message4

Message5

Message6

Message7

Producer1 Consumer2

Producer2

Consumer1Message1

Message2

Message3

Message4

Message5

Message6

Message7

Message8

Message9

Message10

Topic1

Broker

Consumer3

ConsumerGroup1 • Batching• Compression• Serialization

Producer API used directly by application or through one of the contributed implementations, e.g. log4j/logback appender

Broker

Producer

Topic1

Topic2

Partitions

Partitioning

Consumer

ConsumerGroup2

Consumer

Group1

Group3

ConsumerNo partition for this guy

Keyed messages

Producer

Message1

Message5

Message9

Message13

Message17

Topic1

BrokerId=1

Message2

Message4

Message6

Message8

Message10

Message12

Message14

Message16

Message18

Topic1

BrokerId=2

Message3

Message7

Message11

Message15

Topic1

BrokerId=3

hash(key) % #partitions#partitions=3

Intra cluster replication

Message1

Message2

Message3

Message4

Message5

Message6

Message7

Message8

Message9

Message10

Topic1 leader

Broker1

Message1

Message2

Message3

Message4

Message5

Message6

Message7

Message8

Message9

Message10

Topic1 follower

Broker2

Message1

Message2

Message3

Message4

Message5

Message6

Message7

Message8

Message9

Topic1 follower

Broker3

Producer ackackackack

Replication factor = 3

Message10

InSyncReplicas

Commit mode Latency Durability

Fire & Forget “none” Weak

Leader ack 1 roundtrip Medium

Full replication 2 roundtrips Strong

Follower fails:• Follower dropped from ISR • When follower comes online again: fetch

data from leader, then ISR gets updated

Leader fails:• Detected via Zookeeper from ISR• New leader gets elected

3 commit modes:

Producer API

…or for log aggregation:

Configuration parameters:ProducerType (sync/async)CompressionCodec (none/snappy/gzip)BatchSizeEnqueueSize/TimeEncoder/SerializerPartitioner#RetriesMaxMessageSize…

Consumer API(s)• High-level (consumer group, auto-commit)• Low-level (simple consumer, manual commit)

Broker Protips

• Reasonable number of partitions – will affect performance• Reasonable number of topics – will affect performance• Performance decrease with larger Zookeeper ensembles• Disk flush rate settings• message.max.bytes – max accept size, should be smaller than the heap• socket.request.max.bytes – max fetch size, should be smaller than the heap• log.retention.bytes – don’t want to run out of disk space…• Keep Zookeeper logs under control for same reason as above• Kafka brokers have been tested on Linux and Solaris

Operating Kafka

• Zookeeper usage– Producer loadbalancing– Broker ISR– Consumer tracking

• Monitoring– JMX– Audit trail/console in the making

Distribution Tools:• Controlled shutdown tool• Preferred replica leader election tool• List topic tool• Create topic tool• Add partition tool• Reassign partitions tool• MirrorMaker

Multi-datacenter replication

Ecosystem

Producers:• Java (in standard dist)• Scala (in standard dist)• Log4j (in standard dist)• Logback: logback-kafka• Udp-kafka-bridge• Python: kafka-python• Python: pykafka• Python: samsa• Python: pykafkap• Python: brod• Go: Sarama• Go: kafka.go• C: librdkafka• C/C++: libkafka• Clojure: clj-kafka• Clojure: kafka-clj• Ruby: Poseidon• Ruby: kafka-rb• Ruby: em-kafka• PHP: kafka-php(1)• PHP: kafka-php(2)• PHP: log4php• Node.js: Prozess• Node.js: node-kafka• Node.js: franz-kafka• Erlang: erlkafka

Consumers:• Java (in standard dist)• Scala (in standard dist)• Python: kafka-python• Python: samsa• Python: brod• Go: Sarama• Go: nuance• Go: kafka.go• C/C++: libkafka• Clojure: clj-kafka• Clojure: kafka-clj• Ruby: Poseidon• Ruby: kafka-rb• Ruby: Kafkaesque• Jruby::Kafka• PHP: kafka-php(1)• PHP: kafka-php(2)• Node.js: Prozess• Node.js: node-kafka• Node.js: franz-kafka• Erlang: erlkafka• Erlang: kafka-erlang

Common integration points:Stream ProcessingStorm - A stream-processing framework.Samza - A YARN-based stream processing framework.Hadoop IntegrationCamus - LinkedIn's Kafka=>HDFS pipeline. This one is used for all data at LinkedIn, and works great.Kafka Hadoop Loader A different take on Hadoop loading functionality from what is included in the main distribution.AWS IntegrationAutomated AWS deploymentKafka->S3 MirroringLoggingklogd - A python syslog publisherklogd2 - A java syslog publisherTail2Kafka - A simple log tailing utilityFluentd plugin - Integration with FluentdFlume Kafka Plugin - Integration with FlumeRemote log viewerLogStash integration - Integration with LogStash and FluentdOfficial logstash integrationMetricsMozilla Metrics Service - A Kafka and Protocol Buffers based metrics and logging systemGanglia IntegrationPacking and DeploymentRPM packagingDebian packaginghttps://github.com/tomdz/kafka-deb-packagingPuppet integrationDropwizard packagingMisc.Kafka Mirror - An alternative to the built-in mirroring toolRuby Demo App Apache Camel IntegrationInfobright integration

What’s in the future?• Topic and transient consumer garbage collection (KAFKA-560/KAFKA-559)• Producer side persistence (KAFKA-156/KAFKA-789)• Exact mirroring (KAFKA-658)• Quotas (KAFKA-656)• YARN integration (KAFKA-949)• RESTful proxy (KAFKA-639)• New build system? (KAFKA-855)• More tooling (Console, Audit trail) (KAFKA-266/KAFKA-260)• Client API rewrite (Proposal)• Application level security (Proposal)

Stream processingKafka as a processing pipeline backbone

Producer

Kafka topic1

Kafka topic2

Process1

Process2

System1 System2

What is Storm?

• Distributed real-time computation system with design goals:– Guaranteed processing– No orphaned tasks– Horizontally scalable– Fault tolerant– Fast

• Use cases: Stream processing, DRPC, Continuous computation• 4 basic concepts: streams, spouts, bolts, topologies• In Apache incubator• Implemented in Clojure

Streams

(t4,s2,e2) (t3,s3) (t2,s1,e2) (t1,s1,e1)

(timestamp,sessionid,exception stacktrace)

Spoutsa source of streams

(t4,s2,e2) (t3,s3) (t2,s1,e2) (t1,s1,e1)

Connects to queues, logs, API calls, event data.

Some features like transactional topologies (which gives exactly-once messaging semantics) is only possible using the Kafka-TransactionalSpout-consumer

an [infinite] sequence (of tuples)

(t2,s1,h2)(t1,s1,h1)

(t3,s3)(t4,s2,e2)(t5,s4)

• Filters• Transformations• Apply functions• Aggregations• Access DB, APIs etc.• Emitting new streams• Trident = a high level abstraction on top of Storm

Topologies

(t2,s1,h2)(t1,s1,h1)

(t3,s3)

(t4,s2,e2)(t5,s4)

(t6,s6)

(t7,s7)

(t8,s8)

Storm cluster

Nimbus

Supervisor SupervisorSupervisorSupervisorSupervisor

Zookeeper

Topology

Deploy

(JobTracker)

Compare with Hadoop:

(TaskTrackers)

Mesos/YARN

Apache Kafka:Papers and presentationsMain project pageSmall Mediawiki case studyStorm:Introductory articleRealtime discussing blog postKafka+Storm for realtimeBigData Trifecta blog post: Kafka+Storm+CassandraIBM developer articleKafka+Storm@TwitterBigData Quadfecta blog post

Apache Kafka

Documents

Transcript of Apache Kafka

Apache Kafka

Kafka in Production - WordPress.com · Apache Kafka Apache Kafka is an open-source stream processing platform developed by the Apache Software Foundation written in Scala and Java.

Spring for Apache Kafka · 2020. 8. 12. · 4.1. Using Spring for Apache Kafka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Intro to Apache Kafka

Authorization in Apache Kafka - Seattle Kafka Meetup - Ashish Singh

Apache Kafka Lightning Talk

Building Distributed Semantic Job Queue with Kafka · Apache Kafka Overview What is Apache Kafka ? Run as a cluster on one or more servers that can span multiple DC Apache Kafka®

Diplomado Apache Kafka

Apache Kafka - Free Friday

Kafka Streams: The Stream Processing Engine of Apache Kafka

Apache Kafka Cluster - Russian

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE … SF... · STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & ... One of the initial authors of Apache Kafka, ... Introduction to

Chapter 1: An Introduction to SMACK · Figure 4-11: Apache Cassandra cache. Chapter 5: The Broker - Apache Kafka Figure 5-1. Apache Kafka typical scenario. Figure 5-2. Apache Kafka

Apache Kafka Security

Apache Kafka - · PDF fileOverview What is Apache Kafka? Data pipelines Architecture How does Apache Kafka work? Brokers Producers Consumers Topics

User's Guide Apache Kafka Software Release 2 diagram shows the parts (green) of Apache Kafka: Core Apache Kafka (light green), including the Kafka client API and the Kafka broker.

Kafka Tutorial - Introduction to Apache Kafka (Part 1)

Apache kafka intro_20150313_springloops

Apache Kafka Best Practices

Slides - Apache Kafka® Architecture & Fundamentals Explained€¦ · for Apache Kafka (aligns to Confluent Developer Skills for Building Apache Kafka course) Confluent Certified