Apache Kafka

Post on 15-Jan-2016

222 views 0 download

description

Apache Kafka. A distributed publish-subscribe messaging system Neha Narkhede, LinkedIn nnarkhede@ linkedin .com, 11/11/11. Outline. Introduction to pub-sub Kafka at LinkedIn Hadoop and Kafka Design Performance. What is pub sub ?. Consumer. Producer. publish(topic , msg ). subscribe. - PowerPoint PPT Presentation

Transcript of Apache Kafka

Apache KafkaA distributed publish-subscribe messaging system

Neha Narkhede, LinkedInnnarkhede@linkedin.com, 11/11/11

Introduction to pub-sub

Kafka at LinkedIn

Hadoop and Kafka

Design

Performance

Outline

What is pub sub ?

Producer Consumer

Producer

Consumer

Topic 1

Topic 2

Topic 3

subscribepublish(topic, msg)

Publish subscribe system

msg

msg

Introduction to pub-sub

Kafka at LinkedIn

Hadoop and Kafka

Design

Performance

Outline

Motivation

Activity tracking

Operational metrics

Kafka

Distributed

Persistent

High throughput

Kafka at LinkedIn

Frontend

Real time monitoring

BrokerBrokerBroker

Hadoop DWHSecuritysystems

News feed

Kafka

Frontend ServiceService

Introduction to pub-sub

Kafka at LinkedIn

Hadoop and Kafka

Design

Performance

Outline

Hadoop Data Load for Kafka

Live data center Offline data center

HadoopHadoopDev Hadoop

FrontendFrontendReal time consumers

KafkaKafkaKafkaKafkaKafkaKafka

HadoopHadoopPROD Hadoop

Multi DC data deploymentsLive data centers Offline data centers

KafkaReal time

consumersReal time

consumersReal time

consumers

Real time consumers

Real time consumers

Real time consumers

Kafka

HadoopHadoopHadoopHadoop

HadoopHadoopHadoopDWH

Volume

20B events/day

3 terabytes/day

150K events/sec

Message queues

• ActiveMQ

• TIBCO

Log aggregators

• Flume

• Scribe

•Low throughput

•Secondary indexes

•Tuned for low latency

•Focus on HDFS

•Push model

•No rewindable consumption

KAFKA

What Kafka offers

Very high performance

Elastically scalable

Low operational overhead

Durable, highly available (coming soon)

Architecture

Producer

Consumer

Producer

Broker Broker Broker Broker

Consumer

ZK

Introduction to pub-sub

Kafka at LinkedIn

Hadoop and Kafka

Design

Performance

Outline

Efficiency #1: simple storage

Each topic has an ever-growing log A log == a list of files A message is addressed by a log offset

Efficiency #2: careful transfer

Batch send and receive

No message caching in JVM

Rely on file system buffering

Zero-copy transfer: file -> socket

Multi subscribers

1 file system operation per request

Consumption is cheap

SLA based message retention

Rewindable consumption

Guarantees

Data integrity checks

At least once delivery

In order delivery, per partition

Automatic load balancing

Consumer

Producer

Broker Broker

Consumer

Producer

Auditing

# events published = # events consumed

Introduction to pub-sub

Kafka at LinkedIn

Hadoop and Kafka

Design

Performance

Outline

Performance 2 Linux boxes

• 16 2.0 GHz cores

• 6 7200 rpm SATA drive RAID 10

• 24GB memory

• 1Gb network link

200 byte messages

Producer batch size 200 messages

Basic performance metrics

• Producer batch size = 40K• Consumer batch size = 1MB• 100 topics, broker flush interval = 100K

– Producer throughput = 90 MB/sec– Consumer throughput = 60 MB/sec– Consumer latency = 220 ms

Latency vs throughput

0 10 20 30 40 50 60 70 80 90 1000

50

100

150

200

250

Producer throughput in MB/sec

Con

sum

er la

tenc

y in

ms

(100 topics, 1 producer, 1 broker)

Scalability

1 broker 2 brokers 3 brokers 4 brokers0

50

100

150

200

250

300

350

400

101

190

293

381

Thr

ough

put

in M

B/s

(10 topics, broker flush interval 100K)

Throughput vs Unconsumed data

10.4

9042

73.4

3292

136.

3754

319

9.31

793

262.

2604

432

5.20

294

388.

1454

544

1.54

497

504.

4874

856

7.42

998

630.

3724

969

3.31

499

756.

257.

..81

9.2

882.

142.

..94

5.08

501

1008

.0...

0

40000

80000

120000

160000

200000

(1 topic, broker flush interval 10K)Th

roug

hput

in m

sg/s

Unconsumed data in GB

State of the system

4 clusters per colo, 4 servers each 850 socket connections per server 20 TB 430 topics Batched frontend to offline datacenter latency

=> 6-10 secs Frontend to Hadoop latency => 5 min

State of the system

Successfully deployed in production at LinkedIn and other startups

Apache incubator inclusion 0.7 Release

• Compression

• Cluster mirroring

Replication

Some project ideas

Security

Long poll

More compression codecs

Locality of consumption

Team

• Jay Kreps• Jun Rao• Neha Narkhede• Joel Koshy• Chris Burroughs

THANK YOUhttp://incubator.apache.org/kafka/index.html

kafka-users@incubator.apache.orghttp://www.linkedin.com/in/nehanarkhede

@nehanarkhede#kafka

API