Apache Kafka Topic Design · PDF file Keys are required for Kafka Streams and some Kafka...

Click here to load reader

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Apache Kafka Topic Design · PDF file Keys are required for Kafka Streams and some Kafka...

  • Topic Design Apache Kafka

    August 2018

  • 2

    Today’s Agenda

    ● Key concepts: brokers, producers, consumers, topics, partitions, keys ● Topic design ● Partition design ● Key design ● Design Process

  • Introducing Instaclustr

  • 4

    Brokers, Producers and Consumers

    Producers Brokers Consumers

    ● Applications using a Kafka consumer library

    ● Generate and send message to the brokers

    ● The “Kafka Cluster” ● Receives and stores messages from

    producers ● Makes messages sequentially

    available to consumers ● Replicates messages for HA

    ● Applications using a Kafka producer library to read messages

    ● Can be grouped to spread work across multiple consumer instance

    ● Messages can be consumed multiple times

  • 5

    Topics and Partitions



    3 4



    3 4

    3 4

    2 1 2

    3 4










    Partition - Master

    Partition - Replica


    ● Topic ○ Logical grouping of data ○ Settings such as replication, num partitions, log

    retention, compaction, etc controllable at topic level

    ● Partition ○ Subset of messages in a partition that:

    ■ Have a single master broker ■ Guarantee ordered delivery within that

    subset ■ Within consumer groups, 1 consumer is

    assigned to read from each partition ○ Number of partitions is set on topic creation ○ Messages are mapped to partition by key

  • 6

    Topic Design

    ● Minimum number of topics is implied by the minimum different retention, etc settings you require ○ -> you probably don’t want to mix message types with different scalability or latency requirements

    ● Maximum number of topics is largely limited by imagination ● In between is a set of design trade-offs:

    ● In general, pick the minimum number of topics that allows for required replication, retention, etc settings, separates message types with different scale or latency profiles and does not result in consumers reading excess numbers of extra messages

    Less Topics More Topics

    Consumers may have to filter messages Consumers can read only from topics they care about

    Less processing overhead of managing masters and consumers

    Slower restarts, other processing overheads

    Less configuration to manage More flexibility in configuration

  • 7

    Partition Design

    ● Partitions are the fundamental enabler of scale in Kafka ○ you can’t have more master brokers for a topic than partitions ○ you can’t have more than one consumer (in a consumer group) reading from a partition

    ● Too many partitions per broker can lead to long failover/restart times and higher replication latency ○ # partitions can be increased over time but may be a complex operation

    ● “Just right” number of partitions is therefore greater of: ○ total target throughput divide by max throughput per broker or ○ total target throughput divide by max throughput per consumer ○ !! assumes that data is equally distributed to partitions !!

  • 8

    Key Design

    ● Message may optionally have a key ● Where a key is defined it is used to map the message to a particular partition -> messages with the same key will

    have guaranteed ordered delivery (no key = round robin partition assignment) ● Keys are are also vital when compaction is used: only the most recent value for a key is retained ● Keys are required for Kafka Streams and some Kafka Connect functions

    ● Choice of key will be significantly driven by functional design ● However, poor keys can lead to performance issues as partitions receive uneven load

    ● The number of potential key values (cardinality) will be determined be your problem domain but in general, more is good

    ● If using default partitioner, then want a minimum of 10x partitions in potential key values

    ● Ideally, message volume is roughly equal per key value ○ If large deviation but high number of key values then may be OK ○ If one key value >> average volume then likely an issue -> split to separate topic or use bucketing

  • 9

    Key distribution

  • 10

    Design Process

    ● What topics do I need? ○ Are there distinct streams of message types that are require different processing? ○ Are there any different requirements for message retention or relisiency? ○ Would splitting by topic help to reduce consumer load?

    For each topic:

    ● Do I care about ordering? ○ What level (key) is ordering important? ○ Are there sufficient keys to distribute across Kafka partitions? ○ Is the message distribution per-key relatively consistent?

    ● How many partitions? ○ What is max expect throughput by broker and consumer? ○ Partitions = total target throughput / min (broker throughput, consumer throughput) * buffer factor ○ Buffer factor dependent on how evenly your keys distribute data to partitions

  • The open source-as-a-service company, delivering reliability at scale.