101 ways to configure kafka - badly (Kafka Summit)

101* ways to configure Kafka - badly

Audun Fauchald StrandLead Developer Infrastructure

@audunstrandbio: gof, mq, ejb,

mda, wli, bpel eda, soa, ws*,esb, ddd

Henning Spjelkavik Architect

@spjelkavikbio: Skiinfo (Vail Resorts),

FINN.noenjoys reading jstacks

agenda

introduction to kafka

kafka @ finn.no

101* mistakes

questions

“From a certain point onward

there is no longer any turning

back. That is the point that

must be reached.”

― Franz Kafka, The Trial

1. no consideration of data on the inside vs outside

2. schema not externally defined3. same config for every

client/topic4. 128 partitions as default config5. running on 8 overloaded nodes

FINN.no

2nd largest website in norway

classified ads ( Ebay, Zillow in one)

60 millions pageviews a day

80 microservices

130 developers

1000 deploys to production a week

6 minutes from commit to deploy (median)

#kafkasummit @spjelkavik @audunstrand

Schibsted Media Group6800 people in 30 countries

FINN.no is a part of

kafka @ finn.no

kafka @finn.no

architecture

use cases

in the beginning ...

Architecture governance board decided to use RabbitMQ as message queue.

Kafka was installed for a proof of concept, after developers spotted it januar 2013.

2013 - POC

“High” volume

Stream of classified ads

Ad matching

Ad indexed

Version 0.8.1

4 partitions

common client java library

thrift

2014 - Adoption and complaininglow volume/ high reliability

Ad Insert

Product Orchestration

Payment

Build Pipeline

click streams

Version 0.8.1

4 partitions

experimenting with configuration

common java library

toolingalerting

2015 - Migration and consolidation“reliable messaging”

asynchronous communication between services

store and forward

zipkin

slack notifications

Version 0.8.2

5-20 partitions

multiple configurations

broker05

broker01

broker03

broker04

broker02

toolingGrafana dashboard visualizing jmx stats

kafka-manager

kafka-cat

2016 - Confluent

zk04 zk

broker01

broker05

broker03

broker04

broker02

zk05 zk

zk02 zk zk03 zk

zk01 zk

platform

schema registry

data replication

kafka connect

kafka streams

101* mistakes

“God gives the nuts, but he does not crack them.” ― Franz Kafka

Pattern Language

why is it a mistake

what is the consequence

what is the correct solution

what has finn.no done

1. no consideration of data on the inside vs outside

2. schema not externally defined3. same config for every

client/topic4. 128 partitions as default config5. running on 8 overloaded nodes

mistake: no consideration of data on the inside vs outside

https://flic.kr/p/6MjhUR

why is it a mistakeeverything published on Kafka (0.8.2) is visible to any client that can access

what is the consequencedirect reads across services/domains is quite normal in legacy and/or enterprise systems

coupling makes it hard to make changes

unknown and unwanted coupling has a cost

Kafka had no security per topic - you must add that yourself

what is the correct solutionConsider what is data on the inside, versus data on the outside

Convention for what is private data and what is public data

If you want to change your internal representation often, map it before publishing it publicly (Anti corruption layer)

what has finn.no doneDecided on a naming convention (i.e Public.xyzzy) for public topics

Communicates the intention (contract)

mistake: schema not externally defined

why is it a mistakedata and code needs separate versioning strategies

version should be part of the data

defining schema in a java library makes it more difficult to access data from non-jvm languages

very little discoverability of data, people chose other means to get their data

difficult to create tools

what is the consequencedevelopment speed outside jvm has been slow

change of data needs coordinated deployment

no process for data versioning, like backwards compatibility checks

difficult to create tooling that needs to know data format, like data lake and database sinks

what is the correct solutionconfluent.io platform has a separate schema registry

apache avro

multiple compatibility settings and evolutions strategies

connect

Take complexity out of the applications

what has finn.no donestill using java library, with schemas in builders

confluent platform 2.0 is planned for the next step, not (just) kafka 0.9

mistake: running mixed load with a single, default configuration

https://flic.kr/p/qbarDR

why is it a mistakeHistorically - One Big Database with Expensive License

Database world - OLTP and OLAP

Changed with Open Source software and Cloud

Tried to simplify the developer's day with a single config

Kafka supports very high throughput and highly reliable

what is the consequenceTrade off between throughput and degree of reliability

With a single configuration - the last commit wins

Either high throughput, and risk of loss - or potentially too slow

what is the correct solutionUnderstand your use cases and their needs!

Use proper pr topic configuration

Consider splitting / isolation

Defaults that are quite reliable

Exposing configuration variables in the client

Ask the questions;

● at least once delivery● ordering - if you partition, what must have strict ordering● 99% delivery - is that good enough?● what level of throughput is needed

ConfigurationConfiguration for production

● Partitions● Replicas (default.replication.factor)● Minimum ISR (min.insync.replicas)● Wait for acknowledge when producing messages (request.required.acks, block.on.buffer.full)● Retries● Leader election

Configuration for consumer

● Number of threads● When to commit (autocommit.enable vs consumer.commitOffsets)

Gwen Shapira recommends...● akcs = all● block.on.buffer.full = true● retries = MAX_INT● max.inflight.requests.per.connect = 1● Producer.close()● replication-factor >= 3● min.insync.replicas = 2● unclean.leader.election = false● auto.offset.commit = false● commit after processing● monitor!

mistake: default configuration of 128 partitions for each topic

https://flic.kr/p/6KxPgZ

why is it a mistakepartitions are kafkas way of scaling consumers, 128 partitions can handle 128 consumer processes

in 0.8; clusters could not reduce the number of partitions without deleting data

highest number of consumers today is 20

what is the consequenceour 0.8 cluster was configured with 128 partitions as default, for all topics.

many partitions and many topics creates many datapoints that must be coordinated

zookeeper must coordinate all this

rebalance must balance all clients on all partitions

zookeeper and kafka went down (may 2015)

Users could note create ads for two days

what is the correct solutionsmall number of partitions as default

increase number of partitions for selected topics

understand your use case (throughput target)

reduce length of transactions on consumer side

Max partitions on a broker => 1500 advised in our case - we had 38k

http://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/

what has finn.no done5 partitions as default

2 heavy-traffic topics have more than 5 partitions

mistake: deploy a proof of concept hack - in production ; i.e why we had 8 zk nodes

https://flic.kr/p/6eoSgT

why is it a mistakeKafka was set up by Ops for a proof of concept - not for hardened production use

By coincidence we had 8 nodes for kafka, the same 8 nodes for zookeeper

Zookeeper is dependent on a majority quorum, low latency between nodes

The 8 nodes were NOT dedicated - in fact - they were overloaded already

what is the consequenceZookeeper recommends 3 nodes for normal usage, 5 for high, and any more is questionable

More nodes leads to longer time for finding consensus, more communication

If we get a split between data centers, there will be 4 in each

You should not run Zk between data centers, due to latency and outage possibilities

what is the correct solutionHave an odd number of Zookeeper nodes - preferrably 3, at most 5

Don’t cross data centers

Check the documentation before deploying serious production load

Don’t run a sensitive service (Zookeeper) on a server with 50 jvm-based services, 300% over committed on RAM

Watch GC times

broker05

broker01

broker03

broker04

broker02

Version 0.8.2

5-20 partitions

multiple configurations

“They say ignorance is bliss.... they're wrong ” ― Franz Kafka

References / Further readingDesigning data intensive systems, Martin Kleppmann

Data on the inside - data on the outside, Pat Helland

I Heart Logs, Jay Kreps

The Confluent Blog, http://confluent.io/

Kafka - The definitive guide

https://cwiki.apache.org/confluence/display/KAFKA/Kafka+papers+and+presentations

http://www.finn.no/apply-herehttp://www.schibsted.com/en/Career/

“It's only because of their stupidity that they're able to be so sure of themselves.” ― Franz Kafka, The Trial

Audun Fauchald Strand

@audunstrand

Henning Spjelkavik

@spjelkavik

http://www.finn.no/apply-herehttp://www.schibsted.com/en/Career/

Runner upUsing pre-1.0 software

Have control of topic creation

Kafka is storage - treat it like one also ops-wise

Client side rebalancing, misunderstood

Commiting on all consumer threads, believing that you only commited on one

101 ways to configure kafka - badly (Kafka Summit)

Technology

Transcript of 101 ways to configure kafka - badly (Kafka Summit)

Kafka Tutorial: Kafka Security

Tungsten Replicator for Kafka, Elasticsearch, Cassandra to Kafka+Elastic... · •Full transaction support for Kafka •Support for Amazon Elasticsearch •Kafka Extraction –Parsing

Paris Kafka Meetup - How to develop with Kafka

Kafka Tutorial - basics of the Kafka streaming platform

101 ways to configure kafka - badly

Kafka Connect & Kafka Streams/KSQL - powerful ecosystem around Kafka core

Apache Kafka - RainFocus · Apache Kafka Scalable Message ... Introduction& Motivation Apache Kafka -Scalable Message Processing and more! Apache Kafka -Overview ... • Apache Spark

Badly Brken Tooth

Kafka Messaging JavaCro2019 IBM pdfv · •Kafka basics •Kafka Use Cases •Java Connector Architecture •Distributed messaging challenges •Reliable Kafka Processing •IBM Event

Kafka Audit - Kafka Meetup - January 27th, 2015

Kafka Streams: Hands-on Session - ce.uniroma2.it · Kafka Streams Kafka Streams: • Kafka Streams is a client library for processing and analyzing data stored in Kafka • Supports

Employees Behaving Badly

Blanchot Maurice - De Kafka a Kafka

God Behaving Badly

Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka

Kafka Connect & Streams - the ecosystem around Kafka

Kafka & Hadoop - for NYC Kafka Meetup

Kafka Reliability Guarantees ATL Kafka User Group

Badly Drafted Wills

Installing and configuring Apache Kafka · 2020-03-07 · Apache Kafka Installing Kafka Installing Kafka Although you can install Kafka on a cluster not managed by Ambari, this chapter