Salesforce enabling real time scenarios at scale using kafka

Enabling Real-time Scenarios At Scale Using KafkaNishant Gupta, Engineering@Salesforce

Who am I?Nishant Gupta - Sr. Director, Engineering

1.5 years at Salesforce (15+ years in software development)

Focus: Distributed systems, data pipelines, data analytics

linkedin.com/in/nigupta

@ngupta77

medium.com/@ngupta

What are we talking today?Overview of real time data analysis platform at Salesforce and zoom into how Kafka fits into this.

Let’s start from the beginning...We wanted to understand system health (at host level) across global data centers in real time. I.e. generate system health insights from host level metric events.

Thus, the genesis of project Ajna.

Ajna - (or the third eye chakra) it is the sixth primary chakra in the body, according to Hindu tradition. … it is believed to reveal insights…

More here - https://en.wikipedia.org/wiki/Ajna

https://en.wikipedia.org/wiki/Ajna

What we achieved?Mechanism for system health monitoring for all clusters across all

Salesforce global data centers.

Health monitoring and alerting for new clusters is config driven with zero code change

From event to insight - E2E latency is under 10 sec (including network delays)

Kafka is the key technology that enabled it.

VisionBuild a multi-tenant platform to enable real time scenarios at scale

Architecture

Stream processing

Text Indexing

MR Jobs

Machine Learning

Raw Store

Etc.

Collection agentHost

Cluster level Kafka (data ingest)

Production Cluster

MirrorMaker

Local subscriber

DMZ

Aggregate Kafka

Let’s talk numbers●# of clusters per production cluster: 1

●# of aggregate clusters in DMZ: 1

●# of topics: Data specific. Ranges from 1 to 100s.

●# of partitions: Data specific. Ranges from 1 to 16

●# of replications: 3 (across all data, all clusters)

●Data retention: Data specific. Ranges up to 4 days

●Version of Kafka: modified 0.8.2 and modified 0.9. Moving towards vanilla 0.9

●SSL enabled: Yes.

●Topic level auth: Enabled with modified 0.9 → enable across the board

●AuthN (where enabled) : Kerberos based

MirrorMaker●Not all data is mirrored. Selective topics are white listed.

●Max message size - 2MB (working to reduce this)

●Batch.num.messages - 500

●Queueing.buffer.max.ms - 5000ms

●Fetch.max.message.size - 2MB

●Partition strategy - round robin

●Garbage collection - G1GC with max heap size of 28GB

●We have modified 0.8.2 MM to enable SSL

Scale of operation

30+ Kafka clusters

~300 MBPS Aggregate throughput of the system

100s of Billions Avg events/day

>30 sec P95 for latency

Scenarios we enable●SR visibility into system health

●Transaction and performance visibility publically - http://trust.salesforce.com/trust/performance

●App log analysis - application performance, business insights etc.

●Network monitoring

●Security and Compliance monitoring

●App’s event based communication and notification scenarios

●… more

http://trust.salesforce.com/trust/performance

Operational challenges●Large number of clusters to manage

●Non-homogeneous hardware for brokers. Capacity planning is hard

●0.8.x MM is buggy. Data loss is possible.

●No built-in support for QoS

●Operations on the box are manual - need to log into every box to get info

●Lack of traceability of data across system. We operate in Kafka → MM → Kafka configuration

●No self onboarding for customers is possible.

●Management of multiple Kafka clusters is manual / time consuming

Learnings

●Aggregate clusters rather than too many small clusters.

●Use homogenous hardware as much as possible.

●MM on 0.8.X does not split load evenly. Bugs in MM 0.8.x around data loss. Upgrade ASAP!

●Use SSDs where possible! It will help in increasing throughput

●Consider using a dedicated cluster for latency sensitive scenarios

●Consider default number of partitions to be some multiple of number of brokers. That way disk consumption is uniform.

●Monitor everything!

●Don’t use Ajna to monitor and alert on Ajna! Use out-of-band monitoring for the system.

Monitoring Ajna

Service level checks

● SLA alert: Latency increases over x min for a given topic

● SLA alert: Throughput decreases under x bytes/sec for a given topic

Host checks

● CollectD: host level metrics → time series DB → alerts. E.g. disk capacity

● Jmxtrans-agent: JMX beans → time series DB → alerts. E.g. Kafka’s BytesOneMinuteRatePerSec

● Nagios: process checks → alerts. E.g. number of processes for Kafka user

Alerts

● Argus → email notifications → pager duty integration.

● Nagios

Tools we use

●Ajna auditor - timestamps every event; calculates per topic latency, data loss, throughput etc.

●Argus - Hbase based time series monitoring and alerting platform (https://github.com/SalesforceEng/Argus)

●DCT (Dashboard creation tool) - for Argus and Graphite

●Funnel - HTTP endpoint for ingestion of metrics data

https://github.com/SalesforceEng/Argus

The road ahead

10X scale

Management API

Multi-tenancy

AuthZ/N

Throttling

Self-serve UI

Cluster Management

Ajna health monitoring using Prometheus or similar

https://prometheus.io/

Salesforce enabling real time scenarios at scale using kafka

Software

Transcript of Salesforce enabling real time scenarios at scale using kafka