Salesforce enabling real time scenarios at scale using kafka

23
Enabling Real-time Scenarios At Scale Using Kafka Nishant Gupta, Engineering@Salesforce

Transcript of Salesforce enabling real time scenarios at scale using kafka

Page 1: Salesforce enabling real time scenarios at scale using kafka

Enabling Real-time Scenarios At Scale Using KafkaNishant Gupta, Engineering@Salesforce

Page 2: Salesforce enabling real time scenarios at scale using kafka

Who am I?Nishant Gupta - Sr. Director, Engineering

1.5 years at Salesforce (15+ years in software development)

Focus: Distributed systems, data pipelines, data analytics

linkedin.com/in/nigupta

@ngupta77

medium.com/@ngupta

Page 3: Salesforce enabling real time scenarios at scale using kafka

What are we talking today?Overview of real time data analysis platform at Salesforce and zoom into how Kafka fits into this.

Page 4: Salesforce enabling real time scenarios at scale using kafka

Let’s start from the beginning...We wanted to understand system health (at host level) across global data centers in real time. I.e. generate system health insights from host level metric events.

Thus, the genesis of project Ajna.

Ajna - (or the third eye chakra) it is the sixth primary chakra in the body, according to Hindu tradition. … it is believed to reveal insights…

More here - https://en.wikipedia.org/wiki/Ajna

Page 5: Salesforce enabling real time scenarios at scale using kafka

What we achieved?Mechanism for system health monitoring for all clusters across all

Salesforce global data centers.

Health monitoring and alerting for new clusters is config driven with zero code change

From event to insight - E2E latency is under 10 sec (including network delays)

Kafka is the key technology that enabled it.

Page 6: Salesforce enabling real time scenarios at scale using kafka

VisionBuild a multi-tenant platform to enable real time scenarios at scale

Page 7: Salesforce enabling real time scenarios at scale using kafka
Page 8: Salesforce enabling real time scenarios at scale using kafka

Architecture

Page 9: Salesforce enabling real time scenarios at scale using kafka

Stream processing

Text Indexing

MR Jobs

Machine Learning

Raw Store

Etc.

Collection agentHost

Cluster level Kafka (data ingest)

Production Cluster

MirrorMaker

Local subscriber

DMZ

Aggregate Kafka

Page 10: Salesforce enabling real time scenarios at scale using kafka

Let’s talk numbers●# of clusters per production cluster: 1

●# of aggregate clusters in DMZ: 1

●# of topics: Data specific. Ranges from 1 to 100s.

●# of partitions: Data specific. Ranges from 1 to 16

●# of replications: 3 (across all data, all clusters)

●Data retention: Data specific. Ranges up to 4 days

●Version of Kafka: modified 0.8.2 and modified 0.9. Moving towards vanilla 0.9

●SSL enabled: Yes.

●Topic level auth: Enabled with modified 0.9 → enable across the board

●AuthN (where enabled) : Kerberos based

Page 11: Salesforce enabling real time scenarios at scale using kafka

MirrorMaker●Not all data is mirrored. Selective topics are white listed.

●Max message size - 2MB (working to reduce this)

●Batch.num.messages - 500

●Queueing.buffer.max.ms - 5000ms

●Fetch.max.message.size - 2MB

●Partition strategy - round robin

●Garbage collection - G1GC with max heap size of 28GB

●We have modified 0.8.2 MM to enable SSL

Page 12: Salesforce enabling real time scenarios at scale using kafka

Scale of operation

30+ Kafka clusters

~300 MBPS Aggregate throughput of the system

100s of Billions Avg events/day

>30 sec P95 for latency

Page 13: Salesforce enabling real time scenarios at scale using kafka

Scenarios we enable●SR visibility into system health

●Transaction and performance visibility publically - http://trust.salesforce.com/trust/performance

●App log analysis - application performance, business insights etc.

●Network monitoring

●Security and Compliance monitoring

●App’s event based communication and notification scenarios

●… more

Page 14: Salesforce enabling real time scenarios at scale using kafka

Operational challenges●Large number of clusters to manage

●Non-homogeneous hardware for brokers. Capacity planning is hard

●0.8.x MM is buggy. Data loss is possible.

●No built-in support for QoS

●Operations on the box are manual - need to log into every box to get info

●Lack of traceability of data across system. We operate in Kafka → MM → Kafka configuration

●No self onboarding for customers is possible.

●Management of multiple Kafka clusters is manual / time consuming

Page 15: Salesforce enabling real time scenarios at scale using kafka

Learnings

Page 16: Salesforce enabling real time scenarios at scale using kafka

●Aggregate clusters rather than too many small clusters.

●Use homogenous hardware as much as possible.

●MM on 0.8.X does not split load evenly. Bugs in MM 0.8.x around data loss. Upgrade ASAP!

●Use SSDs where possible! It will help in increasing throughput

●Consider using a dedicated cluster for latency sensitive scenarios

●Consider default number of partitions to be some multiple of number of brokers. That way disk consumption is uniform.

●Monitor everything!

●Don’t use Ajna to monitor and alert on Ajna! Use out-of-band monitoring for the system.

Page 17: Salesforce enabling real time scenarios at scale using kafka

Monitoring Ajna

Page 18: Salesforce enabling real time scenarios at scale using kafka

Service level checks

● SLA alert: Latency increases over x min for a given topic

● SLA alert: Throughput decreases under x bytes/sec for a given topic

Host checks

● CollectD: host level metrics → time series DB → alerts. E.g. disk capacity

● Jmxtrans-agent: JMX beans → time series DB → alerts. E.g. Kafka’s BytesOneMinuteRatePerSec

● Nagios: process checks → alerts. E.g. number of processes for Kafka user

Alerts

● Argus → email notifications → pager duty integration.

● Nagios

Page 19: Salesforce enabling real time scenarios at scale using kafka

Tools we use

Page 20: Salesforce enabling real time scenarios at scale using kafka

●Ajna auditor - timestamps every event; calculates per topic latency, data loss, throughput etc.

●Argus - Hbase based time series monitoring and alerting platform (https://github.com/SalesforceEng/Argus)

●DCT (Dashboard creation tool) - for Argus and Graphite

●Funnel - HTTP endpoint for ingestion of metrics data

Page 21: Salesforce enabling real time scenarios at scale using kafka

The road ahead

Page 22: Salesforce enabling real time scenarios at scale using kafka

10X scale

Management API

Multi-tenancy

AuthZ/N

Throttling

Self-serve UI

Cluster Management

Ajna health monitoring using Prometheus or similar

Page 23: Salesforce enabling real time scenarios at scale using kafka

?s