Apache kafka

36
Apache Kafka A high throughput distributed messaging system

Transcript of Apache kafka

Page 1: Apache kafka

Apache KafkaA high throughput distributed messaging system

Page 2: Apache kafka

What is Kakfa?

Kafka is a distributed publish-subscribe messaging system rethought as a distributed commit log.

It’s designed to be Fast Scalable Durable

When used in the right way and for the right use case, Kafka has unique attributes that make it a highly attractive option for data integration.

Page 3: Apache kafka

Publish subscribe messaging system Kafka maintains feeds of messages in categories called topics Producers are processes that publish messages to one or more topics Consumers are processes that subscribe to topics and process the feed

of published messages

Subscriber

Subscriber

Subscriber

Publisher

Message

Message

MessageMessage Topic

Page 4: Apache kafka

Kafka cluster Since Kafka is distributed in nature, Kafka is run as a cluster. A cluster is typically comprised multiple servers; each of which is called a

broker. Communication between the clients and the servers takes place over TCP protocol

Kafka clusterConsumer

Consumer

Consumer

Producer

Producer

Producer

Broker 1Topic 1 Topic 2

Broker 2Topic 1 Topic 2

Zookeeper

Page 5: Apache kafka

Deep dive into high level abstractions

Page 6: Apache kafka

Topic To balance load, a topic is

divided into multiple partitions and replicated across brokers.

Partitions are ordered, immutable sequences of messages that’s continually appended i.e. a commit log.

The messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each message within the partition.

Page 7: Apache kafka

Partitions allow a topic’s log to scale beyond a size that will fit on a single server (i.e. a broker) and act as the unit of parallelism

The partitions of a topic are distributed over the brokers in the Kafka cluster where each broker handles data and requests for a share of the partitions.

For fault tolerance, each partition is replicated across a configurable number of brokers.

Distribution and partitions

Page 8: Apache kafka

Distribution and fault tolerance

Each partition has one server which acts as the "leader" and zero or more servers which act as "followers".

The leader handles all read and write requests for the partition while the followers passively replicate the leader.

If the leader fails, one of the followers will automatically become the new leader.

Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.

Page 9: Apache kafka

Retention

The Kafka cluster retains all published messages—whether or not they have been consumed—for a configurable period of time; after which it will be discarded to free up space.

Metadata retained on a per-consumer basis is the position of the consumer in the log, called the offset; which is controlled by consumer.

Normally a consumer will advance its offset linearly as it reads messages, but it can consume messages in any order it likes.

Kafka consumers can come and go without much impact on the cluster or on other consumers.

Page 10: Apache kafka

Producers Producers publish data to the topics by assigning messages to a

partition within the topic either in a round-robin fashion or according to some semantic partition function (say based on some key in the message).

Page 11: Apache kafka

Consumers Kafka offers a single consumer abstraction called consumer group that

generalises both queue and topic. Consumers label themselves with a consumer group name. Each message published to a topic is delivered to one consumer instance

within each subscribing consumer group. If all the consumer instances have the same consumer group, then

this works just like a traditional queue balancing load over the consumers.

If all the consumer instances have different consumer groups, then this works like publish-subscribe and all messages are broadcast to all consumers.

Page 12: Apache kafka

Consumer groups Topics have a small number of consumer groups, one for each logical

subscriber. Each group is composed of many consumer instances for scalability and fault

tolerance.

Page 13: Apache kafka

Ordering guarantees Kafka assigns partitions in a topic to consumers in a consumer group

so, each partition is consumed by exactly one consumer in the group. Limitation: there cannot be more consumer instances in a consumer group

than partitions. Provides a total order over messages within a partition, not between different

partitions in a topic.

Page 14: Apache kafka

Comaprison

Kafka JMS message broker; Rabbit MQA fire hose of events arriving at rate of approximately 100k+/sec

Messages arriving at a rate of 20k+/sec

‘At least once‘ processed as data is read with an offset within a partition.

Exactly once processed by consumers

Producer-centric. Doesn't have message acknowledgements as consumers track messages consumed.

Broker-centric. Uses the broker itself to maintain state of what's consumed (via message acknowledgements)

Supports both online and batch consumers that may be online or offline. It also supports producer message batching - it's designed for holding and distributing large volumes of messages at a very low latency.

Consumers are mostly online, and any messages "in wait" (persistent or not) are held opaquely.

Page 15: Apache kafka

Comaprison

Kafka JMS message broker; Rabbit MQProvides a rudimentary routing. It uses topic for exchanges.

Provides rich routing capabilities with Advanced Message Queuing Protocol’s (AMQP) exchange, binding and queuing model.

Makes distributed cluster explicit, by forcing the producer to know it is partitioning a topic's messages across several nodes.

Makes the distributed cluster transparent, as if it were a virtual broker

Preserves ordered delivery within a partition

Almost always unordered delivery. AMQP model says "one producer channel, one exchange, one queue, one consumer channel" is required for in-order delivery

Page 16: Apache kafka

Throttling is un-necessary

The whole job of Kafka is to provide a "shock absorber" between the flood of events and those who want to consume them in their own way.

Page 17: Apache kafka

Performance benchmark

500,000 messages published per second 22,000 messages consumed per second on a 2-node cluster with 6-disk RAID 10. See

research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf

Page 18: Apache kafka

Key benefits

Horizontally scalable It’s a distributed system can be elastically and transparently expanded

with no downtime High throughput

High throughput is provided for both publishing and subscribing, due to disk structures that provide constant performance even with many terabytes of stored messages

Reliable delivery Persists messages on disk, and provides intra-cluster replication Supports large number of subscribers and automatically balances consumers

in case of failure.

Page 19: Apache kafka

Use cases

Common use cases include1. Stream processing, Event sourcing or a replacement for a more traditional

message broker2. Website activity tracking - original use case for Kafka3. Metrics collection and monitoring - centralized feeds of operational data4. Log aggregation

Page 20: Apache kafka

Getting practical

Page 21: Apache kafka

Download and extract Kafka

Download the archive from kafka.apache.org/downloads.html and extract it

Page 22: Apache kafka

Kafka uses ZooKeeper for cluster coordination Kafka uses ZooKeeper; which enables

highly reliable distributed coordination so, one needs to first start a ZooKeeper server.

Kafka bundles a single-node ZooKeeper instance. Single node zookeeper cluster does NOT run a leader and a follower.

Typically exchanged metadata include Kafka broker addresses Consumed messages offset

Page 23: Apache kafka

Common Challenges faced by distributed system Outages Co-ordination of tasks Reduction of operational complexity Consistency and ordering guarantees

Zookeeper to rescue

Page 24: Apache kafka

Apache Zookeeper: Definition

Centralised service for Maintaining configuration information Naming Distributed synchronisation and providing group services.

Page 25: Apache kafka

Apache Zookeeper: Features

Distributed consistent data store which favours consistency over everything else.

High availability - Tolerates a minority of an ensemble members being unavailable and continues to function correctly. In an ensemble of n members where n is an odd number, the loss of (n-1)/2 members can

be tolerated. High performance - All the data is stored in memory and benchmarked at 50k

ops/sec but the numbers really depend on your servers and network Tuned for read heavy write light work load. Reads are served from the node to which a

client a connected. Provides strictly ordered access for data.

Atomic write guarantees in the order they're sent to zookeeper. Writes are acknowledged and changes are also seen in the order they occurred.

Page 26: Apache kafka

Apache Zookeeper: Operation basics A cluster is an ensemble with a leaders and several followers Read requests are serviced from each server using its local replica but write

request are forwarded to a leader. When the leader receives a write request, it calculates what the state of the system is when the write is to be applied and transforms this into a transaction that captures this new state.

When zookeeper starts, it goes through a loading algorithm where by one node of the cluster is elected to act as the leader. At any given point in time, only one node acts as a leader.

Page 27: Apache kafka

Apache Zookeeper: Operation basics Clients create a state full session (i.e. with heartbeats through an open socket)

when they connect to a node of an ensemble. Number of open sockets available on a zookeeper node will limit the number of clients that connect to it.

When a cluster member dies, clients notice a disconnect event and thus reconnect themselves to another member of the quorum.

Session (i.e. state of the client connected to node of an ensemble) stay alive when the client goes down, as the session events go through the leader and gets replicated in a cluster onto another node.

When the leader goes down, remaining members of the cluster will re-elect a new leader using a atomic broadcast consensus algorithm. Cluster remains unavailable only when it re-elects a new leader.

Page 28: Apache kafka

1. Start ZooKeeper

Page 29: Apache kafka

2. Set-up a cluster with 3 brokers

Page 30: Apache kafka

2. Adjust broker configuration files

Page 31: Apache kafka

3. Start kafka server 1

Page 32: Apache kafka

3. Start kafka server 2

Page 33: Apache kafka

3. Start kafka server 3

Page 34: Apache kafka

3. Servers 1, 2 and 3 are started and running

Page 35: Apache kafka

4. Create a kakfa topic, list topics and describe one

5. Start a producer and publish some messages

6. Start a consumer and process messages

Page 36: Apache kafka

Log directories for each of the broker instances