Post on 25-Feb-2016
description
Apache KafkaA high-throughput distributed messaging system
Johan Lundahl
Agenda• Kafka overview
– Main concepts and comparisons to other messaging systems• Features, strengths and tradeoffs• Message format and broker concepts
– Partitioning, Keyed messages, Replication• Producer / Consumer APIs• Operation considerations• Kafka ecosystemIf time permits:• Kafka as a real-time processing backbone • Brief intro to Storm• Kafka-Storm wordcount demo
2
What is Apache Kafka?• Distributed, high-throughput, pub-sub messaging system
– Fast, Scalable, Durable
• Main use cases: – log aggregation, real-time processing, monitoring, queueing
• Originally developed by LinkedIn• Implemented in Scala/Java• Top level Apache project since 2012: http://kafka.apache.org/
4
Comparison to other messaging systems– Traditional: JMS, xxxMQ/AMQP– New gen: Kestrel, Scribe, Flume, Kafka
Kafka
Message queuesLow throughput, low latency
JMS
ActiveMQ
Qpid
RabbitMQ
Log aggregatorsHigh throughput, high latency
Kestrel
Scribe
Flume Hedwig
Batch jobs
5
Frontend ServiceFrontend
Monitoring Stream processing
Batch processing
Data warehouse
Kafka
Producers
Broker
Consumers
Topic1Topic1
Topic2Topic3
Topic1
Topic3Topic2 Topic3
Topic3
Topic2
Topic1
Push
Pull
Kafka concepts
Distributed model
6
Producer Producer Producer
Broker Broker Broker
Topic1 consumer group Topic2 consumer group
Partitioned Data Publication
Ordered subscription
Intra cluster replication
Producer persistence
KAFKA-156
Zookeeper
Agenda• Kafka overview
– Main concepts and comparisons to other messaging systems• Features, strengths and tradeoffs• Message format and broker concepts
– Partitioning, Keyed messages, Replication• Producer / Consumer APIs• Operation considerations• Kafka ecosystemIf time permits:• Kafka as a real-time processing backbone • Brief intro to Storm• Kafka-Storm wordcount demo
7
Performance factors• Broker doesn’t track consumer state• Everything is distributed• Zero-copy (sendfile) reads/writes• Usage of page cache backed by sequential
disk allocation• Like a distributed commit log
• Low overhead protocol• Message batching (Producer & Consumer)• Compression (End to end)• Configurable ack levels
8
From: http://queue.acm.org/detail.cfm?id=1563874
Kafka features and strengths
• Simple model, focused on high throughput and durability• O(1) time persistence on disk• Horizontally scalable by design (broker and consumers)• Push - pull => consumer burst tolerance• Replay messages• Multiple independent subscribes per topic• Configurable batching, compression, serialization• Online upgrades
9
Tradeoffs• Not optimized for millisecond latencies• Have not beaten CAP• Simple messaging system, no processing• Zookeeper becomes a bottleneck when using too many topics/partitions
(>>10000)• Not designed for very large payloads (full HD movie etc.)• Helps to know your data in advance
10
Agenda• Kafka overview
– Main concepts and comparisons to other messaging systems• Features, strengths and tradeoffs• Message format and broker concepts
– Partitioning, Keyed messages, Replication• Producer / Consumer APIs• Operation considerations• Kafka ecosystemIf time permits:• Kafka as a real-time processing backbone • Brief intro to Storm• Kafka-Storm wordcount demo
11
Message/Log Format
LengthVersion
ChecksumPayload
Message
Log based queue (Simplified model)
Message1
Topic2
Message2
Message3
Message4
Message5
Message6
Message7
Producer1 Consumer2
Producer2
Consumer1Message1
Message2
Message3
Message4
Message5
Message6
Message7
Message8
Message9
Message10
Topic1
Broker
Consumer3
Consumer3
Consumer3
ConsumerGroup1 • Batching• Compression• Serialization
Producer API used directly by application or through one of the contributed implementations, e.g. log4j/logback appender
Broker
Producer
Producer
Producer
Producer
Producer
Topic1
Topic2
Partitions
Partitioning
Consumer
Consumer
Consumer
ConsumerGroup2
Consumer
Consumer
Group1
Group3
ConsumerNo partition for this guy
Keyed messages
Producer
Message1
Message5
Message9
Message13
Message17
Topic1
BrokerId=1
Message2
Message4
Message6
Message8
Message10
Message12
Message14
Message16
Message18
Topic1
BrokerId=2
Message3
Message7
Message11
Message15
Topic1
BrokerId=3
hash(key) % #partitions#partitions=3
Intra cluster replication
Message1
Message2
Message3
Message4
Message5
Message6
Message7
Message8
Message9
Message10
Topic1 leader
Broker1
Message1
Message2
Message3
Message4
Message5
Message6
Message7
Message8
Message9
Message10
Topic1 follower
Broker2
Message1
Message2
Message3
Message4
Message5
Message6
Message7
Message8
Message9
Topic1 follower
Broker3
Producer ackackackack
Replication factor = 3
Message10
InSyncReplicas
Commit mode Latency Durability
Fire & Forget “none” Weak
Leader ack 1 roundtrip Medium
Full replication 2 roundtrips Strong
Follower fails:• Follower dropped from ISR • When follower comes online again: fetch
data from leader, then ISR gets updated
Leader fails:• Detected via Zookeeper from ISR• New leader gets elected
3 commit modes:
Agenda• Kafka overview
– Main concepts and comparisons to other messaging systems• Features, strengths and tradeoffs• Message format and broker concepts
– Partitioning, Keyed messages, Replication• Producer / Consumer APIs• Operation considerations• Kafka ecosystemIf time permits:• Kafka as a real-time processing backbone • Brief intro to Storm• Kafka-Storm wordcount demo
17
Producer API
…or for log aggregation:
Configuration parameters:ProducerType (sync/async)CompressionCodec (none/snappy/gzip)BatchSizeEnqueueSize/TimeEncoder/SerializerPartitioner#RetriesMaxMessageSize…
Consumer API(s)• High-level (consumer group, auto-commit)• Low-level (simple consumer, manual commit)
Agenda• Kafka overview
– Main concepts and comparisons to other messaging systems• Features, strengths and tradeoffs• Message format and broker concepts
– Partitioning, Keyed messages, Replication• Producer / Consumer APIs• Operation considerations• Kafka ecosystemIf time permits:• Kafka as a real-time processing backbone • Brief intro to Storm• Kafka-Storm wordcount demo
20
Broker Protips
• Reasonable number of partitions – will affect performance• Reasonable number of topics – will affect performance• Performance decrease with larger Zookeeper ensembles• Disk flush rate settings• message.max.bytes – max accept size, should be smaller than the heap• socket.request.max.bytes – max fetch size, should be smaller than the heap• log.retention.bytes – don’t want to run out of disk space…• Keep Zookeeper logs under control for same reason as above• Kafka brokers have been tested on Linux and Solaris
Operating Kafka
• Zookeeper usage– Producer loadbalancing– Broker ISR– Consumer tracking
• Monitoring– JMX– Audit trail/console in the making
Distribution Tools:• Controlled shutdown tool• Preferred replica leader election tool• List topic tool• Create topic tool• Add partition tool• Reassign partitions tool• MirrorMaker
Multi-datacenter replication
23
Agenda• Kafka overview
– Main concepts and comparisons to other messaging systems• Features, strengths and tradeoffs• Message format and broker concepts
– Partitioning, Keyed messages, Replication• Producer / Consumer APIs• Operation considerations• Kafka ecosystemIf time permits:• Kafka as a real-time processing backbone • Brief intro to Storm• Kafka-Storm wordcount demo
24
Ecosystem
Producers:• Java (in standard dist)• Scala (in standard dist)• Log4j (in standard dist)• Logback: logback-kafka• Udp-kafka-bridge• Python: kafka-python• Python: pykafka• Python: samsa• Python: pykafkap• Python: brod• Go: Sarama• Go: kafka.go• C: librdkafka• C/C++: libkafka• Clojure: clj-kafka• Clojure: kafka-clj• Ruby: Poseidon• Ruby: kafka-rb• Ruby: em-kafka• PHP: kafka-php(1)• PHP: kafka-php(2)• PHP: log4php• Node.js: Prozess• Node.js: node-kafka• Node.js: franz-kafka• Erlang: erlkafka
Consumers:• Java (in standard dist)• Scala (in standard dist)• Python: kafka-python• Python: samsa• Python: brod• Go: Sarama• Go: nuance• Go: kafka.go• C/C++: libkafka• Clojure: clj-kafka• Clojure: kafka-clj• Ruby: Poseidon• Ruby: kafka-rb• Ruby: Kafkaesque• Jruby::Kafka• PHP: kafka-php(1)• PHP: kafka-php(2)• Node.js: Prozess• Node.js: node-kafka• Node.js: franz-kafka• Erlang: erlkafka• Erlang: kafka-erlang
Common integration points:Stream ProcessingStorm - A stream-processing framework.Samza - A YARN-based stream processing framework.Hadoop IntegrationCamus - LinkedIn's Kafka=>HDFS pipeline. This one is used for all data at LinkedIn, and works great.Kafka Hadoop Loader A different take on Hadoop loading functionality from what is included in the main distribution.AWS IntegrationAutomated AWS deploymentKafka->S3 MirroringLoggingklogd - A python syslog publisherklogd2 - A java syslog publisherTail2Kafka - A simple log tailing utilityFluentd plugin - Integration with FluentdFlume Kafka Plugin - Integration with FlumeRemote log viewerLogStash integration - Integration with LogStash and FluentdOfficial logstash integrationMetricsMozilla Metrics Service - A Kafka and Protocol Buffers based metrics and logging systemGanglia IntegrationPacking and DeploymentRPM packagingDebian packaginghttps://github.com/tomdz/kafka-deb-packagingPuppet integrationDropwizard packagingMisc.Kafka Mirror - An alternative to the built-in mirroring toolRuby Demo App Apache Camel IntegrationInfobright integration
What’s in the future?• Topic and transient consumer garbage collection (KAFKA-560/KAFKA-559)• Producer side persistence (KAFKA-156/KAFKA-789)• Exact mirroring (KAFKA-658)• Quotas (KAFKA-656)• YARN integration (KAFKA-949)• RESTful proxy (KAFKA-639)• New build system? (KAFKA-855)• More tooling (Console, Audit trail) (KAFKA-266/KAFKA-260)• Client API rewrite (Proposal)• Application level security (Proposal)
Agenda• Kafka overview
– Main concepts and comparisons to other messaging systems• Features, strengths and tradeoffs• Message format and broker concepts
– Partitioning, Keyed messages, Replication• Producer / Consumer APIs• Operation considerations• Kafka ecosystemIf time permits:• Kafka as a real-time processing backbone • Brief intro to Storm• Kafka-Storm wordcount demo
27
Stream processingKafka as a processing pipeline backbone
Producer
Producer
Producer
Kafka topic1
Kafka topic2
Process1
Process1
Process1
Process2
Process2
Process2
System1 System2
29
What is Storm?
• Distributed real-time computation system with design goals:– Guaranteed processing– No orphaned tasks– Horizontally scalable– Fault tolerant– Fast
• Use cases: Stream processing, DRPC, Continuous computation• 4 basic concepts: streams, spouts, bolts, topologies• In Apache incubator• Implemented in Clojure
30
Streams
(t4,s2,e2) (t3,s3) (t2,s1,e2) (t1,s1,e1)
(timestamp,sessionid,exception stacktrace)
Spoutsa source of streams
(t4,s2,e2) (t3,s3) (t2,s1,e2) (t1,s1,e1)
Connects to queues, logs, API calls, event data.
Some features like transactional topologies (which gives exactly-once messaging semantics) is only possible using the Kafka-TransactionalSpout-consumer
an [infinite] sequence (of tuples)
31
Bolts
(t2,s1,h2)(t1,s1,h1)
(t3,s3)(t4,s2,e2)(t5,s4)
• Filters• Transformations• Apply functions• Aggregations• Access DB, APIs etc.• Emitting new streams• Trident = a high level abstraction on top of Storm
32
Topologies
(t2,s1,h2)(t1,s1,h1)
(t3,s3)
(t4,s2,e2)(t5,s4)
(t6,s6)
(t7,s7)
(t8,s8)
33
Storm cluster
Nimbus
Supervisor SupervisorSupervisorSupervisorSupervisor
Zookeeper
Topology
Deploy
(JobTracker)
Compare with Hadoop:
(TaskTrackers)
Mesos/YARN
Links
34
Apache Kafka:Papers and presentationsMain project pageSmall Mediawiki case studyStorm:Introductory articleRealtime discussing blog postKafka+Storm for realtimeBigData Trifecta blog post: Kafka+Storm+CassandraIBM developer articleKafka+Storm@TwitterBigData Quadfecta blog post