Streaming in Practice - Putting Apache Kafka in Production

Post on 16-Apr-2017

1.130 views 0 download

Transcript of Streaming in Practice - Putting Apache Kafka in Production

1

Streaming in PracticePutting Apache Kafka in Production

Roger Hoover, Engineer, Confluent

2

Apache Kafka: Online Talk SeriesPart 1: September 27 Part 2: October 6 Part 3: October 27

Part 4: November 17 Part 6: December 15Part 5: December 1

Introduction To Streaming Data and Stream Processing with Apache Kafka

Deep Dive into Apache Kafka

Demystifying Stream Processing with Apache Kafka

Data Integration with Apache Kafka

A Practical Guide to Selecting a Stream

Processing Technology

Streaming in Practice: Putting

Apache Kafka in Production

https://www.confluent.io/apache-kafka-talk-series/

3

Agenda• Kafka Basics• Tuning Kafka For Your Application• Data Balancing• Spanning Multiple Datacenters

4

Agenda• Kafka Basics• Tuning Kafka For Your Application• Data Balancing• Spanning Multiple Datacenters

5

6

Architecture

Kafka cluster

broker 1…

producer

producer

producer

consumer

consumer

broker 2 broker n topic partition

server 1

server 2

server 3

ZooKeepercluster

7

Operations• Simple Deployment• Rolling Upgrades• Good metrics for component monitoring

8

Agenda• Kafka Basics• Tuning Kafka For Your Application• Data Balancing• Spanning Multiple Datacenters

9

Two Example Apps• User activity tracking

• Collect page view events while users are browsing our web and mobile storefronts

• Persist the data to HDFS for subsequent use in recommendation engine

• Inventory adjustments• Track sales, maintain inventory, and re-order

on-demand

10

Application Priorities• User activity tracking

• High throughput (100x the sales stream)• Availability is most important• Low retention required - 3 days

• Inventory adjustments• Relatively low throughput• Durability is most important• Long retention required – 6 months

11

Knobs- Partition count- Replication factor- Retention- Batching + compression- Producer send acknowledgements- Minimum ISRs- Unclean Leader Election

12

Partition Count- Partitions are the unit of consumer parallelism- Over-partition your topics (especially keyed topics)- Easy to add consumers but hard to add partitions for keyed topics- Kafka can support ~10s k partitions

13

Partition Count- High Throughput (User activity tracking)

- Large number of partitions (~100)- Fewer Resources (Inventory adjustments)

- Smaller number of partitions (< 50)

14

Replication Factor- More replicas require more storage, disk I/O, and network bandwidth- More replicas can tolerate more failures

topic1-part1

logs

broker 1

topic1-part2

logs

broker 2

topic2-part2

topic2-part1

logs

broker 3

topic1-part1

logs

broker 4

topic1-part2

topic2-part2 topic1-part1 topic1-part2

topic2-part1

topic2-part2

topic2-part1

15

Replication Factor- Lower cost (User activity tracking)

- replication.factor = 2- High Fault Tolerance (Inventory adjustments)

- replication.factor = 3- Defaults to 1

16

Retention- Retention time can be set per topic- Longer retention times require more storage (imagine that!)- Longer retention allows consumers to rewind further back in time

- Part of the consumer’s SLA!

17

Retention- Less Storage (User activity tracking)

- log.retention.hours=72 (3 days)- Longer Time Travel (Inventory adjustments)

- log.retention.hours=4380 (6 months)- Default is 7 days

18

Side-note: Time Travel- Kafka 0.10.1 supports rewinding by time

- E.g. “Rewind to 10 minutes ago”

19

Batching & Compression- Producer: batch.size, linger.ms, compression.type- Consumer: fetch.min.bytes, fetch.wait.max.ms

compressed batch 1send()

send()send()send()

producer

asyncflush

poll()compressed batch 2

compressed batch 3

compressed batch 1

compressed batch 2

compressed batch 3

consumerbroker

20

Batching & Compression- High throughput (User activity tracking)

- Producer: compression.type=lz4, batch.size (256KB), linger.ms (~10ms) or flush manually

- Consumer: fetch.min.bytes (256KB), fetch.wait.max.ms (~10ms)- Low latency (Inventory adjustments)

- Producer: linger.ms=0- Consumer: fetch.min.bytes=1

- Defaults- compression.type = none- linger.ms = 0 (i.e. send immediately)- fetch.min.bytes = 1 (i.e. receive immediately)

21

Producer Acknowledgements on Send

broker 1

producer

leader

broker 2

follower

broker 3

follower

4

2

2

3commit

ack

When producer receives ack Latency Durability on failures

acks=0 (no ack) no network delay some data loss

acks=1 (wait for leader) 1 network roundtrip a few data loss

acks=all (wait for committed) 2 network roundtrips no data loss

topic1-part1 topic1-part1 topic1-part1consumer

1

22

Producer Acknowledgements on Send- Throughput++ (User activity tracking)

- acks = 1- Durability++ (Inventory adjustments)

- acks = all- Default

- acks = 1

23

In-Sync Replicas (ISRs)

broker 1

producer

leader

broker 2

follower

broker 3

follower

2

2

topic1-part1 topic1-part1 topic1-part1

1

m1 m1 m1

m2 m2 m2

ISR

last committed

m2, m1

In-sync : replica reads from leader’s log end within replica.lag.time.max.ms

24

Minimum In-Sync Replicas

broker 1

producerleader

broker 2

follower

broker 3

topic1-part1 topic1-part1 topic1-part1

m1 m1 m1

m2 m2 m2

ISR

m3

m4last committed

m5 follower

- Topic config to tell Kafka how to handle writes during severe outages (rare)

- Leader will reject writes if the ISR count is too smalltopic1: min.insync.replicas=2

25

Minimum In-Sync Replicas- Availability++ (User activity tracking)

- min.insync.replicas = 1- Durability++ (Inventory adjustments)

- min.insync.replicas = 2- Defaults to 1

26

Unclean Leader Election- Topic config to tell Kafka how to handle topic leadership during severe

outages (rare)- Allows automatic recovery in exchange for losing data

m5

broker 1

producer

leader ???

broker 2

leader

broker 3

2

topic1-part1 topic1-part1 topic1-part1

1

m1 m1 m1

m2 m2 m2

ISR

m3 m3

m4 m4last committed

m3

follower

m4

m5

27

Unclean Leader Election- Availability++ (User activity tracking)

- unclean.leader.election.enable = true- Durability++ (Inventory adjustments)

- unclean.leader.election.enable = false- Defaults to true

28

Mission Critical Data- Producer acknowledgments

- acks=all- Replication factor

- replication.factor = 3- Minimum ISRs

- min.insync.replicas = 2- Unclean Leader Election

- unclean.leader.election.enable = false

29

Agenda• Kafka Basics• Tuning Kafka For Your Application• Data Balancing• Spanning Multiple Datacenters

30

Replica Placement• Partitions are replicated• Replicas are spread evenly across the cluster• Only when the topic is created or modified

topic1-part1

logs

broker 1

topic1-part2

logs

broker 2

topic2-part2

topic2-part1

logs

broker 3

topic1-part1

logs

broker 4

topic1-part2

topic2-part2 topic1-part1 topic1-part2

topic2-part1

topic2-part2

topic2-part1

31

Replica Placement• Over time broker load and storage become unbalanced• Initial replica placement does not account for topic throughput or

retention• Adding or removing brokers

topic1-part1

broker 1

topic1-part2

broker 2

topic2-part2

topic2-part1

broker 3

topic1-part1

broker 4

topic1-part2

topic2-part2

topic1-part1 topic1-part2topic2-part1

topic2-part2

topic2-part1

broker 5

32

Replica Reassignment• Create plan to rebalance replicas• Upload new assignment to the cluster• Kafka migrates replicas without disruption

topic1-part1

broker 1

topic1-part2

broker 2

topic2-part2

topic2-part1

broker 3

topic1-part1

broker 4

topic1-part2

topic1-part1

topic1-part2topic2-part1

topic2-part2

broker 5

topic2-part1

topic2-part2

topic1-part1

broker 1

topic1-part2

broker 2

topic2-part2

topic2-part1

broker 3

topic1-part1

broker 4

topic1-part2

topic2-part2

topic1-part1 topic1-part2topic2-part1

topic2-part2

topic2-part1

broker 5

Before

After

33

Data Balancing: Tricky Parts• Creating a good plan

• Balance broker disk space• Balance broker load• Minimize data movement• Preserve rack placement

• Movement of replicas can overload I/O and bandwidth resources• Use replication quota feature in 0.10.1

34

Data Balancing: Solutions• DIY

• kafka-reassign-partitions.sh script in Apache Kafka• Confluent Enterprise Auto Data Balancing

• Optimizes storage utilization• Rack awareness and minimal data movement• Leverages replication quotas during rebalance

35

Agenda• Kafka Basics• Tuning Kafka For Your Application• Data Balancing• Spanning Multiple Datacenters

36

Use cases • Disaster Recovery• Replicate data out to geo-localized data centers• Aggregate data from other data centers for analysis• Part of hybrid cloud or cloud migration strategy

37

Multi-DC: Two Approaches• Stretched cluster• Mirroring across clusters

38

Stretched Cluster• Low-latency links between 3 DCs. Typically AZs in a single AWS region.• Applications in all 3 DCs share the same cluster and handle failures automatically.• Relies on intra-cluster replication to copy data across DCs (replication.factor >= 3)

• Use rack awareness in Kafka 0.10; manual partition placement otherwise

Kafka

producers

consumers

AZ 1

AZ 3

AZ 2 produce

rsproduce

rs

consumers

consumers

AWS Region

39

Mirroring Across Clusters• Separate Kafka clusters in each DC. Mirroring process copies data between them.• Several variations of this pattern. Some require manual intervention on failover and

recovery.

40

How to Mirror Across Clusters• MirrorMaker tool in Apache Kafka

• Manual topic creation• Manual sync of topic configuration

• Confluent Enterprise Multi-DC• Dynamic topic creation at the destination• Automatic sync for topic configurations (including access controls)• Can be configured and managed from the Control Center UI• Leverages Connect API

41

More Information: Tuning Tradeoffs• Apache Kafka and Confluent Documentation• When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka

• Gwen Shapira and Jeff Holoman - https://www.confluent.io/kafka-summit-2016-ops-when-it-absolutely-positively-has-to-be-there/

• Chapter 6: Reliability Guarantees• Neha Narkhede, Gwen Shapira, Todd Palino – Kafka: The Definitive Guide

• Confluent Operations Training

42

More Information: Multi-DC• Building Large Scale Stream Infrastructures Across Multiple Data Centers with Apache

Kafka – Jun Rao• Video: https://www.youtube.com/watch?v=XcvHmqmh16g• Slides: http://www.slideshare.net/HadoopSummit/building-largescale-stream-

infrastructures-across-multiple-data-centers-with-apache-kafka• Confluent Enterprise Multi-DC - https://www.confluent.io/product/multi-datacenter/

43

More Information: Metadata Management• Yes, Virginia, You Really Do Need a Schema Registry

• Gwen Shapira - https://www.confluent.io/blog/schema-registry-kafka-stream-processing-yes-virginia-you-really-need-one/

44

Thank you!www.kafka-summit.org May 8, 2017

New York CityHilton Midtown

August 28, 2017San FranciscoHilton Union Square