Streaming in Practice - Putting Apache Kafka in Production

44
1 Streaming in Practice Putting Apache Kafka in Production Roger Hoover, Engineer, Confluent

Transcript of Streaming in Practice - Putting Apache Kafka in Production

Page 1: Streaming in Practice - Putting Apache Kafka in Production

1

Streaming in PracticePutting Apache Kafka in Production

Roger Hoover, Engineer, Confluent

Page 2: Streaming in Practice - Putting Apache Kafka in Production

2

Apache Kafka: Online Talk SeriesPart 1: September 27 Part 2: October 6 Part 3: October 27

Part 4: November 17 Part 6: December 15Part 5: December 1

Introduction To Streaming Data and Stream Processing with Apache Kafka

Deep Dive into Apache Kafka

Demystifying Stream Processing with Apache Kafka

Data Integration with Apache Kafka

A Practical Guide to Selecting a Stream

Processing Technology

Streaming in Practice: Putting

Apache Kafka in Production

https://www.confluent.io/apache-kafka-talk-series/

Page 3: Streaming in Practice - Putting Apache Kafka in Production

3

Agenda• Kafka Basics• Tuning Kafka For Your Application• Data Balancing• Spanning Multiple Datacenters

Page 4: Streaming in Practice - Putting Apache Kafka in Production

4

Agenda• Kafka Basics• Tuning Kafka For Your Application• Data Balancing• Spanning Multiple Datacenters

Page 5: Streaming in Practice - Putting Apache Kafka in Production

5

Page 6: Streaming in Practice - Putting Apache Kafka in Production

6

Architecture

Kafka cluster

broker 1…

producer

producer

producer

consumer

consumer

broker 2 broker n topic partition

server 1

server 2

server 3

ZooKeepercluster

Page 7: Streaming in Practice - Putting Apache Kafka in Production

7

Operations• Simple Deployment• Rolling Upgrades• Good metrics for component monitoring

Page 8: Streaming in Practice - Putting Apache Kafka in Production

8

Agenda• Kafka Basics• Tuning Kafka For Your Application• Data Balancing• Spanning Multiple Datacenters

Page 9: Streaming in Practice - Putting Apache Kafka in Production

9

Two Example Apps• User activity tracking

• Collect page view events while users are browsing our web and mobile storefronts

• Persist the data to HDFS for subsequent use in recommendation engine

• Inventory adjustments• Track sales, maintain inventory, and re-order

on-demand

Page 10: Streaming in Practice - Putting Apache Kafka in Production

10

Application Priorities• User activity tracking

• High throughput (100x the sales stream)• Availability is most important• Low retention required - 3 days

• Inventory adjustments• Relatively low throughput• Durability is most important• Long retention required – 6 months

Page 11: Streaming in Practice - Putting Apache Kafka in Production

11

Knobs- Partition count- Replication factor- Retention- Batching + compression- Producer send acknowledgements- Minimum ISRs- Unclean Leader Election

Page 12: Streaming in Practice - Putting Apache Kafka in Production

12

Partition Count- Partitions are the unit of consumer parallelism- Over-partition your topics (especially keyed topics)- Easy to add consumers but hard to add partitions for keyed topics- Kafka can support ~10s k partitions

Page 13: Streaming in Practice - Putting Apache Kafka in Production

13

Partition Count- High Throughput (User activity tracking)

- Large number of partitions (~100)- Fewer Resources (Inventory adjustments)

- Smaller number of partitions (< 50)

Page 14: Streaming in Practice - Putting Apache Kafka in Production

14

Replication Factor- More replicas require more storage, disk I/O, and network bandwidth- More replicas can tolerate more failures

topic1-part1

logs

broker 1

topic1-part2

logs

broker 2

topic2-part2

topic2-part1

logs

broker 3

topic1-part1

logs

broker 4

topic1-part2

topic2-part2 topic1-part1 topic1-part2

topic2-part1

topic2-part2

topic2-part1

Page 15: Streaming in Practice - Putting Apache Kafka in Production

15

Replication Factor- Lower cost (User activity tracking)

- replication.factor = 2- High Fault Tolerance (Inventory adjustments)

- replication.factor = 3- Defaults to 1

Page 16: Streaming in Practice - Putting Apache Kafka in Production

16

Retention- Retention time can be set per topic- Longer retention times require more storage (imagine that!)- Longer retention allows consumers to rewind further back in time

- Part of the consumer’s SLA!

Page 17: Streaming in Practice - Putting Apache Kafka in Production

17

Retention- Less Storage (User activity tracking)

- log.retention.hours=72 (3 days)- Longer Time Travel (Inventory adjustments)

- log.retention.hours=4380 (6 months)- Default is 7 days

Page 18: Streaming in Practice - Putting Apache Kafka in Production

18

Side-note: Time Travel- Kafka 0.10.1 supports rewinding by time

- E.g. “Rewind to 10 minutes ago”

Page 19: Streaming in Practice - Putting Apache Kafka in Production

19

Batching & Compression- Producer: batch.size, linger.ms, compression.type- Consumer: fetch.min.bytes, fetch.wait.max.ms

compressed batch 1send()

send()send()send()

producer

asyncflush

poll()compressed batch 2

compressed batch 3

compressed batch 1

compressed batch 2

compressed batch 3

consumerbroker

Page 20: Streaming in Practice - Putting Apache Kafka in Production

20

Batching & Compression- High throughput (User activity tracking)

- Producer: compression.type=lz4, batch.size (256KB), linger.ms (~10ms) or flush manually

- Consumer: fetch.min.bytes (256KB), fetch.wait.max.ms (~10ms)- Low latency (Inventory adjustments)

- Producer: linger.ms=0- Consumer: fetch.min.bytes=1

- Defaults- compression.type = none- linger.ms = 0 (i.e. send immediately)- fetch.min.bytes = 1 (i.e. receive immediately)

Page 21: Streaming in Practice - Putting Apache Kafka in Production

21

Producer Acknowledgements on Send

broker 1

producer

leader

broker 2

follower

broker 3

follower

4

2

2

3commit

ack

When producer receives ack Latency Durability on failures

acks=0 (no ack) no network delay some data loss

acks=1 (wait for leader) 1 network roundtrip a few data loss

acks=all (wait for committed) 2 network roundtrips no data loss

topic1-part1 topic1-part1 topic1-part1consumer

1

Page 22: Streaming in Practice - Putting Apache Kafka in Production

22

Producer Acknowledgements on Send- Throughput++ (User activity tracking)

- acks = 1- Durability++ (Inventory adjustments)

- acks = all- Default

- acks = 1

Page 23: Streaming in Practice - Putting Apache Kafka in Production

23

In-Sync Replicas (ISRs)

broker 1

producer

leader

broker 2

follower

broker 3

follower

2

2

topic1-part1 topic1-part1 topic1-part1

1

m1 m1 m1

m2 m2 m2

ISR

last committed

m2, m1

In-sync : replica reads from leader’s log end within replica.lag.time.max.ms

Page 24: Streaming in Practice - Putting Apache Kafka in Production

24

Minimum In-Sync Replicas

broker 1

producerleader

broker 2

follower

broker 3

topic1-part1 topic1-part1 topic1-part1

m1 m1 m1

m2 m2 m2

ISR

m3

m4last committed

m5 follower

- Topic config to tell Kafka how to handle writes during severe outages (rare)

- Leader will reject writes if the ISR count is too smalltopic1: min.insync.replicas=2

Page 25: Streaming in Practice - Putting Apache Kafka in Production

25

Minimum In-Sync Replicas- Availability++ (User activity tracking)

- min.insync.replicas = 1- Durability++ (Inventory adjustments)

- min.insync.replicas = 2- Defaults to 1

Page 26: Streaming in Practice - Putting Apache Kafka in Production

26

Unclean Leader Election- Topic config to tell Kafka how to handle topic leadership during severe

outages (rare)- Allows automatic recovery in exchange for losing data

m5

broker 1

producer

leader ???

broker 2

leader

broker 3

2

topic1-part1 topic1-part1 topic1-part1

1

m1 m1 m1

m2 m2 m2

ISR

m3 m3

m4 m4last committed

m3

follower

m4

m5

Page 27: Streaming in Practice - Putting Apache Kafka in Production

27

Unclean Leader Election- Availability++ (User activity tracking)

- unclean.leader.election.enable = true- Durability++ (Inventory adjustments)

- unclean.leader.election.enable = false- Defaults to true

Page 28: Streaming in Practice - Putting Apache Kafka in Production

28

Mission Critical Data- Producer acknowledgments

- acks=all- Replication factor

- replication.factor = 3- Minimum ISRs

- min.insync.replicas = 2- Unclean Leader Election

- unclean.leader.election.enable = false

Page 29: Streaming in Practice - Putting Apache Kafka in Production

29

Agenda• Kafka Basics• Tuning Kafka For Your Application• Data Balancing• Spanning Multiple Datacenters

Page 30: Streaming in Practice - Putting Apache Kafka in Production

30

Replica Placement• Partitions are replicated• Replicas are spread evenly across the cluster• Only when the topic is created or modified

topic1-part1

logs

broker 1

topic1-part2

logs

broker 2

topic2-part2

topic2-part1

logs

broker 3

topic1-part1

logs

broker 4

topic1-part2

topic2-part2 topic1-part1 topic1-part2

topic2-part1

topic2-part2

topic2-part1

Page 31: Streaming in Practice - Putting Apache Kafka in Production

31

Replica Placement• Over time broker load and storage become unbalanced• Initial replica placement does not account for topic throughput or

retention• Adding or removing brokers

topic1-part1

broker 1

topic1-part2

broker 2

topic2-part2

topic2-part1

broker 3

topic1-part1

broker 4

topic1-part2

topic2-part2

topic1-part1 topic1-part2topic2-part1

topic2-part2

topic2-part1

broker 5

Page 32: Streaming in Practice - Putting Apache Kafka in Production

32

Replica Reassignment• Create plan to rebalance replicas• Upload new assignment to the cluster• Kafka migrates replicas without disruption

topic1-part1

broker 1

topic1-part2

broker 2

topic2-part2

topic2-part1

broker 3

topic1-part1

broker 4

topic1-part2

topic1-part1

topic1-part2topic2-part1

topic2-part2

broker 5

topic2-part1

topic2-part2

topic1-part1

broker 1

topic1-part2

broker 2

topic2-part2

topic2-part1

broker 3

topic1-part1

broker 4

topic1-part2

topic2-part2

topic1-part1 topic1-part2topic2-part1

topic2-part2

topic2-part1

broker 5

Before

After

Page 33: Streaming in Practice - Putting Apache Kafka in Production

33

Data Balancing: Tricky Parts• Creating a good plan

• Balance broker disk space• Balance broker load• Minimize data movement• Preserve rack placement

• Movement of replicas can overload I/O and bandwidth resources• Use replication quota feature in 0.10.1

Page 34: Streaming in Practice - Putting Apache Kafka in Production

34

Data Balancing: Solutions• DIY

• kafka-reassign-partitions.sh script in Apache Kafka• Confluent Enterprise Auto Data Balancing

• Optimizes storage utilization• Rack awareness and minimal data movement• Leverages replication quotas during rebalance

Page 35: Streaming in Practice - Putting Apache Kafka in Production

35

Agenda• Kafka Basics• Tuning Kafka For Your Application• Data Balancing• Spanning Multiple Datacenters

Page 36: Streaming in Practice - Putting Apache Kafka in Production

36

Use cases • Disaster Recovery• Replicate data out to geo-localized data centers• Aggregate data from other data centers for analysis• Part of hybrid cloud or cloud migration strategy

Page 37: Streaming in Practice - Putting Apache Kafka in Production

37

Multi-DC: Two Approaches• Stretched cluster• Mirroring across clusters

Page 38: Streaming in Practice - Putting Apache Kafka in Production

38

Stretched Cluster• Low-latency links between 3 DCs. Typically AZs in a single AWS region.• Applications in all 3 DCs share the same cluster and handle failures automatically.• Relies on intra-cluster replication to copy data across DCs (replication.factor >= 3)

• Use rack awareness in Kafka 0.10; manual partition placement otherwise

Kafka

producers

consumers

AZ 1

AZ 3

AZ 2 produce

rsproduce

rs

consumers

consumers

AWS Region

Page 39: Streaming in Practice - Putting Apache Kafka in Production

39

Mirroring Across Clusters• Separate Kafka clusters in each DC. Mirroring process copies data between them.• Several variations of this pattern. Some require manual intervention on failover and

recovery.

Page 40: Streaming in Practice - Putting Apache Kafka in Production

40

How to Mirror Across Clusters• MirrorMaker tool in Apache Kafka

• Manual topic creation• Manual sync of topic configuration

• Confluent Enterprise Multi-DC• Dynamic topic creation at the destination• Automatic sync for topic configurations (including access controls)• Can be configured and managed from the Control Center UI• Leverages Connect API

Page 41: Streaming in Practice - Putting Apache Kafka in Production

41

More Information: Tuning Tradeoffs• Apache Kafka and Confluent Documentation• When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka

• Gwen Shapira and Jeff Holoman - https://www.confluent.io/kafka-summit-2016-ops-when-it-absolutely-positively-has-to-be-there/

• Chapter 6: Reliability Guarantees• Neha Narkhede, Gwen Shapira, Todd Palino – Kafka: The Definitive Guide

• Confluent Operations Training

Page 42: Streaming in Practice - Putting Apache Kafka in Production

42

More Information: Multi-DC• Building Large Scale Stream Infrastructures Across Multiple Data Centers with Apache

Kafka – Jun Rao• Video: https://www.youtube.com/watch?v=XcvHmqmh16g• Slides: http://www.slideshare.net/HadoopSummit/building-largescale-stream-

infrastructures-across-multiple-data-centers-with-apache-kafka• Confluent Enterprise Multi-DC - https://www.confluent.io/product/multi-datacenter/

Page 43: Streaming in Practice - Putting Apache Kafka in Production

43

More Information: Metadata Management• Yes, Virginia, You Really Do Need a Schema Registry

• Gwen Shapira - https://www.confluent.io/blog/schema-registry-kafka-stream-processing-yes-virginia-you-really-need-one/

Page 44: Streaming in Practice - Putting Apache Kafka in Production

44

Thank you!www.kafka-summit.org May 8, 2017

New York CityHilton Midtown

August 28, 2017San FranciscoHilton Union Square