Apache Kafka Topic Design€¦ · Keys are required for Kafka Streams and some Kafka Connect...

Topic DesignApache Kafka

August 2018

2

Today’s Agenda

● Key concepts: brokers, producers, consumers, topics, partitions, keys● Topic design● Partition design● Key design● Design Process

Introducing Instaclustr

4

Brokers, Producers and Consumers

Producers Brokers Consumers

● Applications using a Kafka consumer library

● Generate and send message to the brokers

● The “Kafka Cluster”● Receives and stores messages from

producers● Makes messages sequentially

available to consumers● Replicates messages for HA

● Applications using a Kafka producer library to read messages

● Can be grouped to spread work across multiple consumer instance

● Messages can be consumed multiple times

5

Topics and Partitions

1

2

3 4

1

2

3 4

3 4

2 1 2

3 4

2

3

1

21

43

4

1

Broker

Topic

Partition - Master

Partition - Replica

Legend

● Topic○ Logical grouping of data○ Settings such as replication, num partitions, log

retention, compaction, etc controllable at topic level

● Partition○ Subset of messages in a partition that:

■ Have a single master broker■ Guarantee ordered delivery within that

subset■ Within consumer groups, 1 consumer is

assigned to read from each partition○ Number of partitions is set on topic creation○ Messages are mapped to partition by key

6

Topic Design

● Minimum number of topics is implied by the minimum different retention, etc settings you require○ -> you probably don’t want to mix message types with different scalability or latency requirements

● Maximum number of topics is largely limited by imagination● In between is a set of design trade-offs:

● In general, pick the minimum number of topics that allows for required replication, retention, etc settings, separates message types with different scale or latency profiles and does not result in consumers reading excess numbers of extra messages

Less Topics More Topics

Consumers may have to filter messages Consumers can read only from topics they care about

Less processing overhead of managing masters and consumers

Slower restarts, other processing overheads

Less configuration to manage More flexibility in configuration

7

Partition Design

● Partitions are the fundamental enabler of scale in Kafka○ you can’t have more master brokers for a topic than partitions○ you can’t have more than one consumer (in a consumer group) reading from a partition

● Too many partitions per broker can lead to long failover/restart times and higher replication latency○ # partitions can be increased over time but may be a complex operation

● “Just right” number of partitions is therefore greater of:○ total target throughput divide by max throughput per broker or ○ total target throughput divide by max throughput per consumer○ !! assumes that data is equally distributed to partitions !!

8

Key Design

● Message may optionally have a key● Where a key is defined it is used to map the message to a particular partition -> messages with the same key will

have guaranteed ordered delivery (no key = round robin partition assignment)● Keys are are also vital when compaction is used: only the most recent value for a key is retained● Keys are required for Kafka Streams and some Kafka Connect functions

● Choice of key will be significantly driven by functional design● However, poor keys can lead to performance issues as partitions receive uneven load

● The number of potential key values (cardinality) will be determined be your problem domain but in general, more is good

● If using default partitioner, then want a minimum of 10x partitions in potential key values

● Ideally, message volume is roughly equal per key value○ If large deviation but high number of key values then may be OK○ If one key value >> average volume then likely an issue -> split to separate topic or use bucketing

9

Key distribution

10

Design Process

● What topics do I need?○ Are there distinct streams of message types that are require different processing?○ Are there any different requirements for message retention or relisiency?○ Would splitting by topic help to reduce consumer load?

For each topic:

● Do I care about ordering?○ What level (key) is ordering important?○ Are there sufficient keys to distribute across Kafka partitions?○ Is the message distribution per-key relatively consistent?

● How many partitions?○ What is max expect throughput by broker and consumer?○ Partitions = total target throughput / min (broker throughput, consumer throughput) * buffer factor○ Buffer factor dependent on how evenly your keys distribute data to partitions

The open source-as-a-service company, delivering reliability at scale.

Questions?

Apache Kafka Topic Design€¦ · Keys are required for Kafka Streams and some Kafka Connect...

Documents

Transcript of Apache Kafka Topic Design€¦ · Keys are required for Kafka Streams and some Kafka Connect...