Salesforce enabling real time scenarios at scale using kafka
-
Upload
thomas-alex -
Category
Software
-
view
135 -
download
2
Transcript of Salesforce enabling real time scenarios at scale using kafka
Enabling Real-time Scenarios At Scale Using KafkaNishant Gupta, Engineering@Salesforce
Who am I?Nishant Gupta - Sr. Director, Engineering
1.5 years at Salesforce (15+ years in software development)
Focus: Distributed systems, data pipelines, data analytics
linkedin.com/in/nigupta
@ngupta77
medium.com/@ngupta
What are we talking today?Overview of real time data analysis platform at Salesforce and zoom into how Kafka fits into this.
Let’s start from the beginning...We wanted to understand system health (at host level) across global data centers in real time. I.e. generate system health insights from host level metric events.
Thus, the genesis of project Ajna.
Ajna - (or the third eye chakra) it is the sixth primary chakra in the body, according to Hindu tradition. … it is believed to reveal insights…
More here - https://en.wikipedia.org/wiki/Ajna
What we achieved?Mechanism for system health monitoring for all clusters across all
Salesforce global data centers.
Health monitoring and alerting for new clusters is config driven with zero code change
From event to insight - E2E latency is under 10 sec (including network delays)
Kafka is the key technology that enabled it.
VisionBuild a multi-tenant platform to enable real time scenarios at scale
Architecture
Stream processing
Text Indexing
MR Jobs
Machine Learning
Raw Store
Etc.
Collection agentHost
Cluster level Kafka (data ingest)
Production Cluster
MirrorMaker
Local subscriber
DMZ
Aggregate Kafka
Let’s talk numbers●# of clusters per production cluster: 1
●# of aggregate clusters in DMZ: 1
●# of topics: Data specific. Ranges from 1 to 100s.
●# of partitions: Data specific. Ranges from 1 to 16
●# of replications: 3 (across all data, all clusters)
●Data retention: Data specific. Ranges up to 4 days
●Version of Kafka: modified 0.8.2 and modified 0.9. Moving towards vanilla 0.9
●SSL enabled: Yes.
●Topic level auth: Enabled with modified 0.9 → enable across the board
●AuthN (where enabled) : Kerberos based
MirrorMaker●Not all data is mirrored. Selective topics are white listed.
●Max message size - 2MB (working to reduce this)
●Batch.num.messages - 500
●Queueing.buffer.max.ms - 5000ms
●Fetch.max.message.size - 2MB
●Partition strategy - round robin
●Garbage collection - G1GC with max heap size of 28GB
●We have modified 0.8.2 MM to enable SSL
Scale of operation
30+ Kafka clusters
~300 MBPS Aggregate throughput of the system
100s of Billions Avg events/day
>30 sec P95 for latency
Scenarios we enable●SR visibility into system health
●Transaction and performance visibility publically - http://trust.salesforce.com/trust/performance
●App log analysis - application performance, business insights etc.
●Network monitoring
●Security and Compliance monitoring
●App’s event based communication and notification scenarios
●… more
Operational challenges●Large number of clusters to manage
●Non-homogeneous hardware for brokers. Capacity planning is hard
●0.8.x MM is buggy. Data loss is possible.
●No built-in support for QoS
●Operations on the box are manual - need to log into every box to get info
●Lack of traceability of data across system. We operate in Kafka → MM → Kafka configuration
●No self onboarding for customers is possible.
●Management of multiple Kafka clusters is manual / time consuming
Learnings
●Aggregate clusters rather than too many small clusters.
●Use homogenous hardware as much as possible.
●MM on 0.8.X does not split load evenly. Bugs in MM 0.8.x around data loss. Upgrade ASAP!
●Use SSDs where possible! It will help in increasing throughput
●Consider using a dedicated cluster for latency sensitive scenarios
●Consider default number of partitions to be some multiple of number of brokers. That way disk consumption is uniform.
●Monitor everything!
●Don’t use Ajna to monitor and alert on Ajna! Use out-of-band monitoring for the system.
Monitoring Ajna
Service level checks
● SLA alert: Latency increases over x min for a given topic
● SLA alert: Throughput decreases under x bytes/sec for a given topic
Host checks
● CollectD: host level metrics → time series DB → alerts. E.g. disk capacity
● Jmxtrans-agent: JMX beans → time series DB → alerts. E.g. Kafka’s BytesOneMinuteRatePerSec
● Nagios: process checks → alerts. E.g. number of processes for Kafka user
Alerts
● Argus → email notifications → pager duty integration.
● Nagios
Tools we use
●Ajna auditor - timestamps every event; calculates per topic latency, data loss, throughput etc.
●Argus - Hbase based time series monitoring and alerting platform (https://github.com/SalesforceEng/Argus)
●DCT (Dashboard creation tool) - for Argus and Graphite
●Funnel - HTTP endpoint for ingestion of metrics data
The road ahead
10X scale
Management API
Multi-tenancy
AuthZ/N
Throttling
Self-serve UI
Cluster Management
Ajna health monitoring using Prometheus or similar
?s