Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming...

71
Real Time Data Analytics @ Uber Ankur Bansal November 14, 2016

Transcript of Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming...

Page 1: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Real Time Data Analytics @ Uber Ankur Bansal November 14, 2016

Page 2: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

About Me

● Sr. Software Engineer, Streaming Team @ Uber○ Streaming team supports platform for real time data

analytics: Kafka, Samza, Flink, Pinot.. and plenty more○ Focused on scaling Kafka at Uber’s pace

● Staff software Engineer @ Ebay○ Build & scale Ebay’s cloud using openstack

● Apache Kylin: Committer, Emeritus PMC

Page 3: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Agenda

● Real time Use Cases ● Kafka Infrastructure Deep Dive● Our own Development:

○ Rest Proxy & Clients○ Local Agent○ uReplicator (Mirrormaker)○ Chaperone (Auditing)

● Operations/Tooling

Page 4: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Important Use Cases

Page 5: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

StreamProcessing

Real-time Price Surging

SURGEMULTIPLIERS

Rider eyeballs

Open car information

KAFKA

Page 6: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Real-time Machine Learning - UberEats ETD

Page 7: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..
Page 8: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

● Fraud detection● Share my ETA

And many more ...

Page 9: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Apache Kafka is Uber’s Lifeline

Page 10: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

DATA PRODUCERS

DATA CONSUMERS

Real-time, Fast Analytics

BATCH PIPELINE ApplicationsData Science

AnalyticsReporting

RIDER APP

DRIVER APP

API / SERVICES

DISPATCH (gps logs)

Mapping & Logistic

Ad-hoc exploration

Alerts,Dashboards

Kafka ecosystem @ Uber

Debugging

REAL-TIME PIPELINE

Mobile App

Page 11: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

100s of billion

100s TB

Messages/day

bytes/day

Kafka cluster stats

Multiple data centers

Page 12: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Kafka Infrastructure Deep Dive

Page 13: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Requirements

● Scale to 100s Billions/day → 1 Trillion/day ● High Throughput ( Scale: 100s TB → PB)● Low Latency for most use cases(<5ms )● Reliability - 99.99% ( #Msgs Available /#Msgs Produced)● Multi-Language Support● Tens of thousands of simultaneous clients.● Reliable data replication across DC

Page 14: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Kafka Pipeline

Applications[ProxyClient]

Kafka RESTProxy

RegionalKafka

AggregateKafka

Applications[ProxyClient]

Kafka RESTProxy

RegionalKafka

Local Agent

SecondaryKafka

DataCenter-I

uReplicator

DataCenter-III

DataCenter-II

Page 15: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Kafka Pipeline: Data Flow

Application Process

ProxyClient

Kafka Proxy Server uReplicator

1

2

3 5 7

64 8

Regional Kafka Aggregate Kafka

Page 16: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Kafka Clusters

Applications[ProxyClient]

Kafka RESTProxy

RegionalKafka

AggregateKafka

Applications[ProxyClient]

Kafka RESTProxy

RegionalKafka

Local Agent

SecondaryKafka

DataCenter-I

uReplicator

DataCenter-III

DataCenter-II

Page 17: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Kafka Clusters

● Use case based clusters○ Data (async, reliable)○ Logging (High throughput)○ Time Sensitive (Low Latency e.g. Surge, Push

notifications)○ High Value Data (At-least once, Sync e.g. Payments)

● Secondary cluster as fallback ● Aggregate clusters for all data topics.

Page 18: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Kafka Clusters

● Scale to 100s Billions/day → 1 Trillion/day ● High Throughput ( Scale: 100s TB → PB)● Low Latency for most use cases(<5ms )● Reliability - 99.99% ( #Msgs Available /#Msgs Produced)● Multi-Language Support● Tens of thousands of simultaneous clients.● Reliable data replication across DC

Page 19: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Kafka Rest Proxy

Applications[ProxyClient]

Kafka RESTProxy

RegionalKafka

AggregateKafka

Applications[ProxyClient]

Kafka RESTProxy

RegionalKafka

Local Agent

SecondaryKafka

DataCenter-I

uReplicator

DataCenter-III

DataCenter-II

Page 20: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Why Kafka Rest Proxy ?

● Simplified Client API ● Multi-lang support (Java, NodeJs, Python, Golang)● Decouple client from Kafka broker

○ Thin clients = operational ease○ Less connections to Kafka brokers○ Future kafka upgrade

● Enhanced Reliability○ Primary & Secondary Kafka Clusters

Page 21: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Kafka Rest Proxy: Internals

Page 22: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Kafka Rest Proxy: Internals

Page 23: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Kafka Rest Proxy: Internals

● Based on Confluent’s open sourced Rest Proxy ● Performance enhancements

○ Simple http servlets on jetty instead of Jersey ○ Optimized for binary payloads. ○ Performance increase from 7K* to 45-50K QPS/box

● Caching of topic metadata. ● Reliability improvements*

○ Support for Fallback cluster ○ Support for multiple Producers (SLA based segregation)

● Plan to contribute back to community

*Based on benchmarking & analysis done in Jun ’2015

Page 24: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Rest Proxy: performance (1 box)

Message rate (K/second) at single node

End-

end

Late

ncy

(ms)

Page 25: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Kafka Clusters + Rest Proxy

● Scale to 100s Billions/day → 1 Trillion/day ● High Throughput ( Scale: 100s TB → PB)● Low Latency for most use cases(<5ms )● Reliability - 99.99% ( #Msgs Available /#Msgs Produced)● Multi-Language Support● Tens of thousands of simultaneous clients.● Reliable data replication across DC

Page 26: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Kafka Clients

Applications[ProxyClient]

Kafka RESTProxy

RegionalKafka

AggregateKafka

Applications[ProxyClient]

Kafka RESTProxy

RegionalKafka

Local Agent

SecondaryKafka

DataCenter-I

uReplicator

DataCenter-III

DataCenter-II

Page 27: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Client Libraries

● Support for multiple clusters. ● High Throughput

○ Non-blocking, async, batching ○ <1ms produce latency for clients○ Handles Throttling/BackOff signals from Rest Proxy

● Topic Discovery○ Discovers the kafka cluster a topic belongs ○ Able to multiplex to different kafka clusters

● Integration with Local Agent for critical data

Page 28: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Client Libraries

Add Figure

What if there is network glitch / outage?

Page 29: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Client Libraries

Add Figure

Page 30: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Kafka Clusters + Rest Proxy + Clients

● Scale to 100s Billions/day → 1 Trillion/day ● High Throughput ( Scale: 100s TB → PB)● Low Latency for most use cases(<5ms )● Reliability - 99.99% ( #Msgs Available /#Msgs Produced)● Multi-Language Support● Tens of thousands of simultaneous clients.● Reliable data replication across DC

Page 31: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Local Agent

Applications[ProxyClient]

Kafka RESTProxy

RegionalKafka

AggregateKafka

Applications[ProxyClient]

Kafka RESTProxy

RegionalKafka

Local Agent

SecondaryKafka

DataCenter-I

uReplicator

DataCenter-III

DataCenter-II

Page 32: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Local Agent

● Local spooling in case of downstream outage/backpressure● Backfills at the controlled rate to avoid hammering

infrastructure recovering from outage● Implementation:

○ Reuses code from rest-proxy and kafka’s log module. ○ Appends all topics to same file for high throughput.

Page 33: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Local Agent Architecture

Add Figure

Page 34: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Local Agent in Action

Add Figure

Page 35: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Kafka Clusters + Rest Proxy + Clients + Local Agent

● Scale to 100s Billions/day → 1 Trillion/day ● High Throughput ( Scale: 100s TB → PB)● Low Latency for most use cases(<5ms )● Reliability - 99.99% ( #Msgs Available /#Msgs Produced)● Multi-Language Support● Tens of thousands of simultaneous clients.● Reliable data replication across DC

Page 36: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

uReplicator

Applications[ProxyClient]

Kafka RESTProxy

RegionalKafka

AggregateKafka

Applications[ProxyClient]

Kafka RESTProxy

RegionalKafka

Local Agent

SecondaryKafka

DataCenter-I

uReplicator

DataCenter-III

DataCenter-II

Page 37: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Traffic from DC1

Traffic from DC3

Traffic from DC2App boxDispatchMobile

API

Kafka8 Aggregation Cluster

MirrorMaker

Multi-DC data flow

http calls

Page 38: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

CONFIDENTIAL

>> INSERT SCREENSHOT HERE <<

Mirrormaker : existing problems

● New Topic added● New partitions added● Mirrormaker bounced● New mirrormaker added

Page 39: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

uReplicator: In-house solution

ZookeeperHelix MMController

HelixAgent Thread 1

Thread NTopic-partition

HelixAgent Thread 1

Thread NTopic-partition

HelixAgent Thread 1

Thread NTopic-partition

MM worker1 MM worker2 MM worker3

Page 40: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

uReplicator

ZookeeperHelix MMController

HelixAgent Thread 1

Thread NTopic-partition

HelixAgent Thread 1

Thread NTopic-partition

HelixAgent Thread 1

Thread NTopic-partition

MM worker1 MM worker2 MM worker3

Page 41: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Kafka Clusters + Rest Proxy + Clients + Local Agent

● Scale to 100s Billions/day → 1 Trillion/day ● High Throughput ( Scale: 100s TB → PB)● Low Latency for most use cases(<5ms )● Reliability - 99.99% ( #Msgs Available /#Msgs Produced)● Multi-Language Support● Tens of thousands of simultaneous clients.● Reliable data replication across DC

Page 42: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

uReplicator

● Running in production for 1+ year ● Open sourced: https://github.com/uber/uReplicator● Blog: https://eng.uber.com/ureplicator/

Page 43: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Chaperone - E2E Auditing

Page 44: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Chaperone Architecture

Page 45: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

CONFIDENTIAL

>> INSERT SCREENSHOT HERE <<

Chaperone : Track counts

Page 46: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

CONFIDENTIAL

>> INSERT SCREENSHOT HERE <<

Chaperone : Track Latency

Page 47: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Chaperone

● Running in production for 1+ year ● Planning to open source in ~2 Weeks

Page 48: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

At-least Once Kafka

Page 49: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Why do we need it?

Application Process

ProxyClient

Kafka Proxy Server uReplicator

1

2

3 5 7

64 8

Regional Kafka Aggregate Kafka

● Most of infrastructure tuned for high throughput ○ Batching at each stage ○ Ack before produce (ack’ed != committed)

● Single node failure in any stage leads to data loss● Need a reliable pipeline for High Value Data e.g. Payments

Page 50: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

How did we achieve it?

● Brokers:○ min.insync.replicas=2, can only torrent one node failure○ unclean.leader.election= false, need to wait until the old

leader comes back● Rest Proxy:

○ Partition Failover● Improved Operations:

○ Replication throttling, to reduce impact of node bootstrap ○ Prevent catching up nodes to become ISR

Page 51: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Operations/Tooling

Page 52: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Partition Rebalancing

Add Figure

Page 53: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Partition Rebalancing

● Calculates partition imbalance and inter-broker dependency.

● Generates & Executes Rebalance Plan.

● Rebalance plans are incremental, can be stopped and resumed.

● Currently on-demand, Automated in the future.

Page 54: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

XFS vs EXT4

Add Figure

Page 55: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Summary: Scale

● Kafka Brokers:○ Multiple Clusters per DC○ Use case based tuning

● Rest Proxy to reduce connections and better batching● Rest Proxy & Clients

○ Batch everywhere, Async produce ○ Replace Jersey with Jetty

● XFS

Page 56: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Summary: Reliability

● Local Agent ● Secondary Clusters ● Multi Producer support in Rest Proxy● uReplicator ● Auditing via Chaperone

Page 57: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Future Work

● Open source contribution○ Chaperone○ Toolkit

● Data Lineage● Active Active Kafka● Chargeback● Exactly once mirroring via uReplicator

Page 59: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Extra Slides

Page 60: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Kafka Durability (acks=1)

Broker 1

100

101

102

103

Broker 2

100

101

Broker 3

100

101

Leader

Committed

Producer

Acked

Page 61: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Kafka Durability (acks=1)

Broker 1

100

101

102

103

Broker 2

100

101

Broker 3

100

101

Leader

Committed

Producer

Failed

Acked

Page 62: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Kafka Durability (acks=1)

Broker 1

100

101

102

103

Broker 2

100

101

Broker 3

100

101

Leader

Committed

Producer

Page 63: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Kafka Durability (acks=1)

Broker 1

100

101

102

103

Broker 2

100

101

104

105

106

Broker 3

100

101

104

105

Leader

Committed

Producer

Old HW

Page 64: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Kafka Durability (acks=1)

Broker 1

100

101

102

103

Broker 2

100

101

104

105

106

Broker 3

100

101

104

105

Leader

Committed

Producer

X

Old HWX

Page 65: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Kafka Durability (acks=1)

Broker 1

100

101

104

105

106

Broker 2

100

101

104

105

106

Broker 3

100

101

105

106

Leader

Committed

Producer

data loss!!

Page 66: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Distributed Messaging system

* Supported in Kafka 0.8+

● High throughput● Low latency● Scalable● Centralized● Real-time

Page 67: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

What is Kafka?

● Distributed● Partitioned● Replicated● Commit Log

Broker 1 Broker 2 Broker 3

ZooKeeper

Page 68: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

What is Kafka?

● Distributed● Partitioned● Replicated● Commit Log

Broker 1

Partition 0

Broker 2

Partition 1

Broker 3

Partition 2

ZooKeeper

Page 69: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

What is Kafka?

● Distributed● Partitioned● Replicated● Commit Log

Broker 1

Partition 0

Partition 2

Broker 2

Partition 1

Partition 0

Broker 3

Partition 2

Partition 1

ZooKeeper

Page 70: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

What is Kafka?

● Distributed● Partitioned● Replicated● Commit Log

Broker 1

Partition 0

0 1 2 3

Partition 2

0 1 2 3

Broker 2

Partition 1

0 1 2 3

Partition 0

0 1 2 3

Broker 3

Partition 2

0 1 2 3

Partition 1

0 1 2 3

ZooKeeper

Page 71: Uber Real Time Data Analytics Bansal.pdf · Sr. Software Engineer, Streaming Team @ Uber Streaming team supports platform for real time data analytics: Kafka, Samza, Flink, Pinot..

Kafka Concepts