Streaming in Practice - Putting Apache Kafka in Production
Transcript of Streaming in Practice - Putting Apache Kafka in Production
1
Streaming in PracticePutting Apache Kafka in Production
Roger Hoover, Engineer, Confluent
2
Apache Kafka: Online Talk SeriesPart 1: September 27 Part 2: October 6 Part 3: October 27
Part 4: November 17 Part 6: December 15Part 5: December 1
Introduction To Streaming Data and Stream Processing with Apache Kafka
Deep Dive into Apache Kafka
Demystifying Stream Processing with Apache Kafka
Data Integration with Apache Kafka
A Practical Guide to Selecting a Stream
Processing Technology
Streaming in Practice: Putting
Apache Kafka in Production
https://www.confluent.io/apache-kafka-talk-series/
3
Agenda• Kafka Basics• Tuning Kafka For Your Application• Data Balancing• Spanning Multiple Datacenters
4
Agenda• Kafka Basics• Tuning Kafka For Your Application• Data Balancing• Spanning Multiple Datacenters
5
6
Architecture
Kafka cluster
broker 1…
producer
producer
producer
consumer
consumer
broker 2 broker n topic partition
server 1
server 2
server 3
ZooKeepercluster
7
Operations• Simple Deployment• Rolling Upgrades• Good metrics for component monitoring
8
Agenda• Kafka Basics• Tuning Kafka For Your Application• Data Balancing• Spanning Multiple Datacenters
9
Two Example Apps• User activity tracking
• Collect page view events while users are browsing our web and mobile storefronts
• Persist the data to HDFS for subsequent use in recommendation engine
• Inventory adjustments• Track sales, maintain inventory, and re-order
on-demand
10
Application Priorities• User activity tracking
• High throughput (100x the sales stream)• Availability is most important• Low retention required - 3 days
• Inventory adjustments• Relatively low throughput• Durability is most important• Long retention required – 6 months
11
Knobs- Partition count- Replication factor- Retention- Batching + compression- Producer send acknowledgements- Minimum ISRs- Unclean Leader Election
12
Partition Count- Partitions are the unit of consumer parallelism- Over-partition your topics (especially keyed topics)- Easy to add consumers but hard to add partitions for keyed topics- Kafka can support ~10s k partitions
13
Partition Count- High Throughput (User activity tracking)
- Large number of partitions (~100)- Fewer Resources (Inventory adjustments)
- Smaller number of partitions (< 50)
14
Replication Factor- More replicas require more storage, disk I/O, and network bandwidth- More replicas can tolerate more failures
topic1-part1
logs
broker 1
topic1-part2
logs
broker 2
topic2-part2
topic2-part1
logs
broker 3
topic1-part1
logs
broker 4
topic1-part2
topic2-part2 topic1-part1 topic1-part2
topic2-part1
topic2-part2
topic2-part1
15
Replication Factor- Lower cost (User activity tracking)
- replication.factor = 2- High Fault Tolerance (Inventory adjustments)
- replication.factor = 3- Defaults to 1
16
Retention- Retention time can be set per topic- Longer retention times require more storage (imagine that!)- Longer retention allows consumers to rewind further back in time
- Part of the consumer’s SLA!
17
Retention- Less Storage (User activity tracking)
- log.retention.hours=72 (3 days)- Longer Time Travel (Inventory adjustments)
- log.retention.hours=4380 (6 months)- Default is 7 days
18
Side-note: Time Travel- Kafka 0.10.1 supports rewinding by time
- E.g. “Rewind to 10 minutes ago”
19
Batching & Compression- Producer: batch.size, linger.ms, compression.type- Consumer: fetch.min.bytes, fetch.wait.max.ms
compressed batch 1send()
send()send()send()
producer
asyncflush
poll()compressed batch 2
compressed batch 3
compressed batch 1
compressed batch 2
compressed batch 3
consumerbroker
20
Batching & Compression- High throughput (User activity tracking)
- Producer: compression.type=lz4, batch.size (256KB), linger.ms (~10ms) or flush manually
- Consumer: fetch.min.bytes (256KB), fetch.wait.max.ms (~10ms)- Low latency (Inventory adjustments)
- Producer: linger.ms=0- Consumer: fetch.min.bytes=1
- Defaults- compression.type = none- linger.ms = 0 (i.e. send immediately)- fetch.min.bytes = 1 (i.e. receive immediately)
21
Producer Acknowledgements on Send
broker 1
producer
leader
broker 2
follower
broker 3
follower
4
2
2
3commit
ack
When producer receives ack Latency Durability on failures
acks=0 (no ack) no network delay some data loss
acks=1 (wait for leader) 1 network roundtrip a few data loss
acks=all (wait for committed) 2 network roundtrips no data loss
topic1-part1 topic1-part1 topic1-part1consumer
1
22
Producer Acknowledgements on Send- Throughput++ (User activity tracking)
- acks = 1- Durability++ (Inventory adjustments)
- acks = all- Default
- acks = 1
23
In-Sync Replicas (ISRs)
broker 1
producer
leader
broker 2
follower
broker 3
follower
2
2
topic1-part1 topic1-part1 topic1-part1
1
m1 m1 m1
m2 m2 m2
ISR
last committed
m2, m1
In-sync : replica reads from leader’s log end within replica.lag.time.max.ms
24
Minimum In-Sync Replicas
broker 1
producerleader
broker 2
follower
broker 3
topic1-part1 topic1-part1 topic1-part1
m1 m1 m1
m2 m2 m2
ISR
m3
m4last committed
m5 follower
- Topic config to tell Kafka how to handle writes during severe outages (rare)
- Leader will reject writes if the ISR count is too smalltopic1: min.insync.replicas=2
25
Minimum In-Sync Replicas- Availability++ (User activity tracking)
- min.insync.replicas = 1- Durability++ (Inventory adjustments)
- min.insync.replicas = 2- Defaults to 1
26
Unclean Leader Election- Topic config to tell Kafka how to handle topic leadership during severe
outages (rare)- Allows automatic recovery in exchange for losing data
m5
broker 1
producer
leader ???
broker 2
leader
broker 3
2
topic1-part1 topic1-part1 topic1-part1
1
m1 m1 m1
m2 m2 m2
ISR
m3 m3
m4 m4last committed
m3
follower
m4
m5
27
Unclean Leader Election- Availability++ (User activity tracking)
- unclean.leader.election.enable = true- Durability++ (Inventory adjustments)
- unclean.leader.election.enable = false- Defaults to true
28
Mission Critical Data- Producer acknowledgments
- acks=all- Replication factor
- replication.factor = 3- Minimum ISRs
- min.insync.replicas = 2- Unclean Leader Election
- unclean.leader.election.enable = false
29
Agenda• Kafka Basics• Tuning Kafka For Your Application• Data Balancing• Spanning Multiple Datacenters
30
Replica Placement• Partitions are replicated• Replicas are spread evenly across the cluster• Only when the topic is created or modified
topic1-part1
logs
broker 1
topic1-part2
logs
broker 2
topic2-part2
topic2-part1
logs
broker 3
topic1-part1
logs
broker 4
topic1-part2
topic2-part2 topic1-part1 topic1-part2
topic2-part1
topic2-part2
topic2-part1
31
Replica Placement• Over time broker load and storage become unbalanced• Initial replica placement does not account for topic throughput or
retention• Adding or removing brokers
topic1-part1
broker 1
topic1-part2
broker 2
topic2-part2
topic2-part1
broker 3
topic1-part1
broker 4
topic1-part2
topic2-part2
topic1-part1 topic1-part2topic2-part1
topic2-part2
topic2-part1
broker 5
32
Replica Reassignment• Create plan to rebalance replicas• Upload new assignment to the cluster• Kafka migrates replicas without disruption
topic1-part1
broker 1
topic1-part2
broker 2
topic2-part2
topic2-part1
broker 3
topic1-part1
broker 4
topic1-part2
topic1-part1
topic1-part2topic2-part1
topic2-part2
broker 5
topic2-part1
topic2-part2
topic1-part1
broker 1
topic1-part2
broker 2
topic2-part2
topic2-part1
broker 3
topic1-part1
broker 4
topic1-part2
topic2-part2
topic1-part1 topic1-part2topic2-part1
topic2-part2
topic2-part1
broker 5
Before
After
33
Data Balancing: Tricky Parts• Creating a good plan
• Balance broker disk space• Balance broker load• Minimize data movement• Preserve rack placement
• Movement of replicas can overload I/O and bandwidth resources• Use replication quota feature in 0.10.1
34
Data Balancing: Solutions• DIY
• kafka-reassign-partitions.sh script in Apache Kafka• Confluent Enterprise Auto Data Balancing
• Optimizes storage utilization• Rack awareness and minimal data movement• Leverages replication quotas during rebalance
35
Agenda• Kafka Basics• Tuning Kafka For Your Application• Data Balancing• Spanning Multiple Datacenters
36
Use cases • Disaster Recovery• Replicate data out to geo-localized data centers• Aggregate data from other data centers for analysis• Part of hybrid cloud or cloud migration strategy
37
Multi-DC: Two Approaches• Stretched cluster• Mirroring across clusters
38
Stretched Cluster• Low-latency links between 3 DCs. Typically AZs in a single AWS region.• Applications in all 3 DCs share the same cluster and handle failures automatically.• Relies on intra-cluster replication to copy data across DCs (replication.factor >= 3)
• Use rack awareness in Kafka 0.10; manual partition placement otherwise
Kafka
producers
consumers
AZ 1
AZ 3
AZ 2 produce
rsproduce
rs
consumers
consumers
AWS Region
39
Mirroring Across Clusters• Separate Kafka clusters in each DC. Mirroring process copies data between them.• Several variations of this pattern. Some require manual intervention on failover and
recovery.
40
How to Mirror Across Clusters• MirrorMaker tool in Apache Kafka
• Manual topic creation• Manual sync of topic configuration
• Confluent Enterprise Multi-DC• Dynamic topic creation at the destination• Automatic sync for topic configurations (including access controls)• Can be configured and managed from the Control Center UI• Leverages Connect API
41
More Information: Tuning Tradeoffs• Apache Kafka and Confluent Documentation• When it Absolutely, Positively, Has to be There: Reliability Guarantees in Kafka
• Gwen Shapira and Jeff Holoman - https://www.confluent.io/kafka-summit-2016-ops-when-it-absolutely-positively-has-to-be-there/
• Chapter 6: Reliability Guarantees• Neha Narkhede, Gwen Shapira, Todd Palino – Kafka: The Definitive Guide
• Confluent Operations Training
42
More Information: Multi-DC• Building Large Scale Stream Infrastructures Across Multiple Data Centers with Apache
Kafka – Jun Rao• Video: https://www.youtube.com/watch?v=XcvHmqmh16g• Slides: http://www.slideshare.net/HadoopSummit/building-largescale-stream-
infrastructures-across-multiple-data-centers-with-apache-kafka• Confluent Enterprise Multi-DC - https://www.confluent.io/product/multi-datacenter/
43
More Information: Metadata Management• Yes, Virginia, You Really Do Need a Schema Registry
• Gwen Shapira - https://www.confluent.io/blog/schema-registry-kafka-stream-processing-yes-virginia-you-really-need-one/
44
Thank you!www.kafka-summit.org May 8, 2017
New York CityHilton Midtown
August 28, 2017San FranciscoHilton Union Square