Apache Kafka

Apache KafkaA distributed publish-subscribe messaging system

Neha Narkhede, LinkedInnnarkhede@linkedin.com, 11/11/11

Introduction to pub-sub

Kafka at LinkedIn

Hadoop and Kafka

Design

Performance

Outline

What is pub sub ?

Producer Consumer

Producer

Consumer

Topic 1

Topic 2

Topic 3

subscribepublish(topic, msg)

Publish subscribe system

Kafka at LinkedIn

Hadoop and Kafka

Design

Performance

Outline

Motivation

Activity tracking

Operational metrics

Distributed

Persistent

High throughput

Kafka at LinkedIn

Frontend

Real time monitoring

BrokerBrokerBroker

Hadoop DWHSecuritysystems

News feed

Frontend ServiceService

Kafka at LinkedIn

Hadoop and Kafka

Design

Performance

Outline

Hadoop Data Load for Kafka

Live data center Offline data center

HadoopHadoopDev Hadoop

FrontendFrontendReal time consumers

KafkaKafkaKafkaKafkaKafkaKafka

HadoopHadoopPROD Hadoop

Multi DC data deploymentsLive data centers Offline data centers

KafkaReal time

consumersReal time

consumers

Real time consumers

HadoopHadoopHadoopHadoop

HadoopHadoopHadoopDWH

Volume

20B events/day

3 terabytes/day

150K events/sec

Message queues

• ActiveMQ

• TIBCO

Log aggregators

• Flume

• Scribe

•Low throughput

•Secondary indexes

•Tuned for low latency

•Focus on HDFS

•Push model

•No rewindable consumption

What Kafka offers

Very high performance

Elastically scalable

Low operational overhead

Durable, highly available (coming soon)

Architecture

Producer

Consumer

Producer

Broker Broker Broker Broker

Consumer

Kafka at LinkedIn

Hadoop and Kafka

Design

Performance

Outline

Efficiency #1: simple storage

Each topic has an ever-growing log A log == a list of files A message is addressed by a log offset

Efficiency #2: careful transfer

Batch send and receive

No message caching in JVM

Rely on file system buffering

Zero-copy transfer: file -> socket

Multi subscribers

1 file system operation per request

Consumption is cheap

SLA based message retention

Rewindable consumption

Guarantees

Data integrity checks

At least once delivery

In order delivery, per partition

Automatic load balancing

Consumer

Producer

Broker Broker

Consumer

Producer

Auditing

# events published = # events consumed

Kafka at LinkedIn

Hadoop and Kafka

Design

Performance

Outline

Performance 2 Linux boxes

• 16 2.0 GHz cores

• 6 7200 rpm SATA drive RAID 10

• 24GB memory

• 1Gb network link

200 byte messages

Producer batch size 200 messages

Basic performance metrics

• Producer batch size = 40K• Consumer batch size = 1MB• 100 topics, broker flush interval = 100K

– Producer throughput = 90 MB/sec– Consumer throughput = 60 MB/sec– Consumer latency = 220 ms

Latency vs throughput

0 10 20 30 40 50 60 70 80 90 1000

Producer throughput in MB/sec

(100 topics, 1 producer, 1 broker)

Scalability

1 broker 2 brokers 3 brokers 4 brokers0

(10 topics, broker flush interval 100K)

Throughput vs Unconsumed data

120000

160000

200000

(1 topic, broker flush interval 10K)Th

Unconsumed data in GB

State of the system

4 clusters per colo, 4 servers each 850 socket connections per server 20 TB 430 topics Batched frontend to offline datacenter latency

=> 6-10 secs Frontend to Hadoop latency => 5 min

State of the system

Successfully deployed in production at LinkedIn and other startups

Apache incubator inclusion 0.7 Release

• Compression

• Cluster mirroring

Replication

Some project ideas

Security

Long poll

More compression codecs

Locality of consumption

• Jay Kreps• Jun Rao• Neha Narkhede• Joel Koshy• Chris Burroughs

THANK YOUhttp://incubator.apache.org/kafka/index.html

kafka-users@incubator.apache.orghttp://www.linkedin.com/in/nehanarkhede

@nehanarkhede#kafka

Apache Kafka

Documents

Transcript of Apache Kafka

Apache Kafka Lesson Learned

Intro to Apache Kafka

· Apache Kafka Introduction to Apache Kafka Apache Kafka Architecture explanation Practical Examples on Apache Kafka SCALA, PYTHON, SPARK Course Content

Apache Kafka - · PDF fileOverview What is Apache Kafka? Data pipelines Architecture How does Apache Kafka work? Brokers Producers Consumers Topics

Apache Kafka - Martin Podval

Apache Kafka

Apache Kafka Security

Slides - Apache Kafka® Architecture & Fundamentals Explained€¦ · for Apache Kafka (aligns to Confluent Developer Skills for Building Apache Kafka course) Confluent Certified

Diplomado Apache Kafka

Evaluation of Apache Kafka in Real-Time Big Data Pipeline ... · Apache Kafka in the pipeline architecture. 3.1 Apache Kafka Architecture . Kafka [9] is an open source, distributed

Apache Kafka - RainFocus · Apache Kafka Scalable Message ... Introduction& Motivation Apache Kafka -Scalable Message Processing and more! Apache Kafka -Overview ... • Apache Spark

Apache kafka intro_20150313_springloops

Introduction to Apache Kafka

Making Apache Kafka Elastic with Apache Mesos

Chapter 1: An Introduction to SMACK · Figure 4-11: Apache Cassandra cache. Chapter 5: The Broker - Apache Kafka Figure 5-1. Apache Kafka typical scenario. Figure 5-2. Apache Kafka

apache kafka event stream processing solution is Apache Kafka? Apache Kafka is a Stream Processing Element (SPE) taking care of the needs of event processing. Apache Kafka was initially

Apache Kafka Overview - docs.cloudera.com

Spring for Apache Kafka · 2020. 8. 12. · 4.1. Using Spring for Apache Kafka . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Kafka Streams: The Stream Processing Engine of Apache Kafka

Authorization in Apache Kafka - Seattle Kafka Meetup - Ashish Singh