Introduction to Kafka

Post on 25-Jan-2017

602 views 10 download

Transcript of Introduction to Kafka

Introduction to KafkaBY DUCAS FRANCIS

The problem

Web Security System

Real-time Monitoring

Logging SystemOther

services

Mobile

API

Job

It’s simple enough at first…

Then it gets a little busy…

And ends up a mess.

The solution

Web Security System

Real-time Monitoring

Logging SystemOther

services

Mobile

API

Job

Pub/Sub

Decouple data pipelines using a pub/sub system

Producers Brokers Consumers

Apache KafkaA UNIFIED, HIGH-THROUGHPUT, LOW-LATENCY PLATFORM FOR HANDLING REAL-TIME DATA FEEDS

A brief history lesson

Originally developed at LinkedIn in 2011 Graduated Apache Incubator in 2012 Engineers from LinkedIn formed Confluent in 2014 Up to version 0.9.4 with 0.10 on horizon

Motivation

Unified platform for all real-time data feeds High throughput for high volume streams Support periodic data loads from offline systems Low latency for traditional messaging Support partitioned, distributed, real-time processing Guarantee fault-tolerance

Common use cases

Messaging Website activity tracking Metrics Log aggregation Stream processing Event sourcing Commit log

Benefits of Kafka

High throughput Low latency Load balancing Fault tolerant Guaranteed delivery Secure

Performance comparison

Batch performance comparison

Some terminology

Topic – feed of messages Producer – publishes messages to a topic Consumer – subscribes to topics and processes the feed of messages Broker – server instance that acts in a cluster

@apachekafka

powers @

microsot…

Libraries

Python – kafka-python / pykafka Go – sarama / go_kafka_client / … C/C++ - librdkafka / libkafka / … .NET – kafka-net (x2) / rdkafka-dotnet / CSharpClient-for-Kafka Node.js – kafka-node / sutoiku/node-kafka / ... HTTP – kafka-pixy / kafka-rest

etc.

Architecture

Producer Producer

Broker BrokerBroker

Consumer ConsumerZookeeper

Cluster

x3

Show me the Kafka!!! VAGRANT TO THE RESCUE

Anatomy of a topic

Topics are broken into partitions Messages are assigned sequential

ID called and offset Data is retained for a

configurable period of time Number of partitions can be

increased after creation, but not decreased

Partitions are assigned to brokers

Each partition is an ordered, immutable sequence of messages that is continually appended to…a commit log.

Broker

Kafka service running as part of a cluster Receives messages from producers and serves them to consumers Coordinated using Zookeeper Need odd number for quorum Store messages on the file system Replicate messages to/from other brokers Answer metadata requests about brokers and topics/partitions As of 0.9.0 – coordinate consumers

Replication

Partitions on a topic should be replicated Each partition has 1 leader and 0 or more followers An In-Sync Replica (ISR) is one that’s communicating with Zookeeper

and not too far behind the leader Replication factor can be increased after creation, not decreased

./kafka-topics--CREATE--REPLICATION-FACTOR--PARTITIONS

--DESCRIBE

Producers

Publishes messages to a topic Distributes messages across partitions

Round-robin Key hashing

Send synchronously or asynchronously to the broker that is the leader for the partition ACKS = 0 (none),1 (leader), -1 (all ISRs) Synchronous is obviously slower, but more durable

Testing... Testing… 1 2 3

LET’S SEE HOW FAST WE CAN PUSH

Consumers

Read messages from a topic Multiple consumers can read from the same topic Manage their offsets Messages stay on Kafka after they are consumed

Testing... Testing… 1 2 3

LET’S SEE HOW FAST WE CAN RECEIVE

It’s fast! But why…?

Efficient protocol based on message set Batching messages to reduce network latency and small I/O operations Append/chunk messages to increase consumer throughput

Optimised OS operations pagecache sendfile()

Broker services consumers from cache where possible End-to-end batch compression

Load balanced consumers

Distribute load across instances in a group by allocating partitions Handle failure by rebalancing partitions to other instances Commit their offsets to Kafka

ClusterBroker 1 Broker 2P0 P1 P2 P3

Consumer Group 1

C0 C1Consumer Group 2

C2 C3 C4 C6

Consumer groups and offsets

ClusterBroker 1 Broker 2P0 P1 P2 P3

Consumer Group 1

C0 C1

0 1 2 3 4 5 6 7 8 9 10P3

C1read

C1commit

C0read

C0commit

Guarantees

Messages sent by a producer to a particular topic’s partition will be appended in the order they are sent

A consumer instance sees messages in the order they are stored in the log

For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any messages committed to the log

Ordered delivery

Messages are guaranteed to be delivered in order by partition, NOT topic

M1 M3 M5

M2 M4 M6

P0

P1

M1 before M3 before M5 – YES M1 before M2 – NO M2 before M4 before M6 – YES M2 before M3 - NO

Enough ALT… now .NET USING RDKAFKA-DOTNET

FIN. THANK YOU

Log compaction

Keep the most recent payload for a key Use cases

Database change subscription Event sourcing Journaling for HA

Log compaction