Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an...
Transcript of Lessons Learned from Building an Apache Kafka Managed …...Fleet logs transferred via Kafka to an...
instaclustr.comTwitter @instaclustr [email protected] instaclustr.com
Lessons Learned from Building an Apache Kafka Managed Service
instaclustr.com
Introduction
● Over 20 million node-hours of experience managing Cassandra, Spark and Elassandra
● Our platform provides automated provisioning, monitoring and management
● Available on AWS, GCP, Azure and IBM Cloud
● Managed Apache Kafka released May 21st
instaclustr.com
Agenda
● Context - our offering and development process
● Hardware choice and benchmarking
● Topic and user management
● Broker security configuration
● Monitoring
● Backup and Restore
instaclustr.com
Instaclustr Managed Kafka - Key Features
● Preview Release available:○ Open source Apache Kafka and Zookeeper provisioned in AWS, GCP and Azure○ Broker monitoring○ Instaclustr monitoring and provisioning API support○ Private network clusters (AWS only)○ Run in your cloud provider account or ours○ Topic management via a custom CLI tool
instaclustr.com
Instaclustr Managed Kafka - Key Features
● For GA (end June):○ SOC2 compliant○ User & credential management○ Providing more cluster config options○ Topic level and synthetic transaction monitoring○ Infrastructure config tuning
instaclustr.com
Instaclustr Managed Kafka - Development Process
● First customer requests 2016
● Internal infrastructure deployment and usage of Kafka mid 2017
● Managed service platform developmentcommenced November 2017
● Early access program with 4 customerscommenced December 2017
● Public preview release 21 May 2018
● GA expected 25 June 2018
instaclustr.com
Hardware Choice and Benchmarking - GP2 vs ST1
● Disk Type○ AWS benchmark - r4.large w 500GB disks
■ 1 x 500GB ST1 volume■ 10 x 50GB GP2 volumes in RAID0 configuration
○ Avg 10% improved throughput with ST1 vs GP2 EBS○ ST1 is 45% of the cost of GP2○ Non-RAIDed mount simplifies re-sizing EBS volumes
Type Writes (m/s) Reads (m/s) Mixed (m/s)
ST1 223,851 149,506 W: 171,305 / R: 49,898
GP2 203,409 127,127 W: 162,966 / R: 44,869
instaclustr.com
ST1
GP2
instaclustr.com
Provider Comparison
instaclustr.com
Hardware Choice and Benchmarking - SSL vs non-SSL
● Encryption enabled on broker-to-broker and client-to-broker○ AWS benchmark - r4.large w 1500GB ST1 disk○ 512 byte messages○ ~30% decrease in throughput with Broker and Client SSL enabled
● Follow-up benchmarks on OpenJDK 8 vs. 9, based on KAFKA-2561○ 50% increased throughput in writes○ 80% increased throughput in reads
instaclustr.com
instaclustr.com
Hardware Choice and Benchmarking - Number of Topics
● Possible urban myth that increasing topics reduces performance
● However, more topics = more partitions
● Significantly slows recovery time from node failure
10Topic
s
100Topic
s
1000Topic
s
5000Topic
s
instaclustr.com
Hardware Choice and Benchmarking -Colocated Zookeeper
● Often recommended to host zookeeper separately to Kafka● However, recent changes have significantly reduced load on Zookeeper from Kafka
○ Consumer offsets are no longer stored in Zookeeper● Our benchmarking showed no measurable difference in performance, at least for smaller clusters
instaclustr.com
Hardware Choice and Benchmarking -Colocated Zookeeper
Consumer Rate - Separate Consumer Rate - Colocated
● 6 node cluster with broker restart○ Similar results with dedicated Zookeeper disk vs. shared
instaclustr.com
Topic and User Configuration Management
● Kafka utilities require direct access to Zookeeper● Zookeeper does not have a robust external security model● Felt that providing access to Zookeeper was a risk
● Solutions○ Developed command line tool to use Kafka API for topic configuration
https://github.com/instaclustr/ic-kafka-tools■ Future: Console UI support?■ Value topic configuration versioning and management
○ Adding user management to Instaclustr Console■ Additional authentication required
instaclustr.com
Broker Security Configuration
● Using SCRAM (Salted Challenge Response Authentication Mechanism) authentication○ Used for client->broker○ Broker->broker uses SASL plaintext
● Using SASL plaintext authentication○ Used for broker->broker○ Were planning on integrating SCRAM authentication, but dynamic configuration still requires
broker restart○ Instead planning on short-lived signed broker keys as dynamic configuration does not require
restart
instaclustr.com
Broker Security Configuration
● Access to managed clusters○ Public IPs and whitelisting in firewall (security group or equivalent)○ Private IPs with VPC Peering (or equivalent in other cloud providers)○ Private Network Clusters where nodes are not allocated public IPs and gateway box is used for
admin access○ Don’t expose Zookeeper through firewall due to weak security model
instaclustr.com
Monitoring
● Metrics exposed via JMX○ Custom collection agent -> RabbitMQ (planned to migrate to Kafka) -> Riemann ->
Cassandra+Spark -> Console, APIs, Grafana● Exposing broker-level and per-topic metrics ● Alerting
○ Basics: service state, disk usage free space, server still exists○ Kafka metrics: offline partitions, active controllers != 1, partition under replicated
■ Active controller very sensitive, are re-assessing alert thresholds○ Synthetic transactions: publish and consume message to controlled topic, measure success and
latency
instaclustr.com
Monitoring
● Central Logging○ Fleet logs transferred via Kafka to an Elassandra cluster○ 1,700 nodes submit via Journalbeat -> Kafka -> Logstash -> Elassandra○ Kafka experience in this project has been very positive
● Only issue○ Auto offset commit failed for group logstash: Commit offsets failed with retriable exception. You
should retry committing offsets.○ We weren’t monitoring consumer lag closely enough○ Increased consumer session and request timeouts
instaclustr.com
Backup and Restore
● Internet wisdom = Kafka Backups is not a thing○ Rely on replication within cluster or mirror maker
replication to another cluster● Cassandra experience says backups are valuable
○ Hardware failure is not an issue but corruption due to app bugs or user error can occur and be spread by replication
● Future○ Working on regular automated backup and restore of
topic and security configuration○ Consider using Kafka Connect to write important
messages to offline backup
instaclustr.com
Thanks for listening!
● Currently in Preview● Would love any feedback, suggestions or just telling us what we missed● 14-day free trial option (no CC needed) - console.instaclustr.com