Introduction to Apache Kafka

28
INTRO TO KAFKA Jim Plush, Director of Cloud Engineering, CrowdStrike.com Twitter: @jimplush

Transcript of Introduction to Apache Kafka

Page 1: Introduction to Apache Kafka

INTRO TO KAFKAJim Plush, Director of Cloud Engineering, CrowdStrike.comTwitter: @jimplush

Page 2: Introduction to Apache Kafka

ABOUT ME

Jim Plush, Director of Cloud Engineering @ CrowdStrike.com

Architect of distributed cloud services for catching bad guys

Previously Director of Engineering at gravity.com

personalization service, ingesting clickstream from Yahoo!, New York Times, WSJ, etc…

wrote most of the ETL workflow

Page 3: Introduction to Apache Kafka

ABOUT CROWDSTRIKE

“Big Data” Security Company

Near term focus on targeted, state sponsored attacks and attribution

Single customer can generate 2.2TB of machine data per day we process in our cloud

Horizontally scalable, distributed infrastructure

Uses goodies like Kafka, Cassandra, Elastic Search, Hadoop, Scala, Go

Page 4: Introduction to Apache Kafka

–Said everyone, always

“Some people, when confronted with a problem, think “I know, I'll use a message queue.” Now they have two problems.”

Page 5: Introduction to Apache Kafka

APACHE KAFKA

It’s not a so much a queue, but an activity stream system

Trades stability and speed for consumer complexity

It’s scalable by nature

Supports data replication

You can rewind time

It’s fast!

Persistent messaging with O(1) disk structures that provide constant time performance even with many TB of stored messages.

Page 6: Introduction to Apache Kafka

APACHE KAFKA - CONS

Consumer Complexity

Not “Rack Aware” replication

Lack of tooling/monitoring

Still pre 1.0 release

Operationally, it’s more manual than desired

Requires ZooKeeper

Page 7: Introduction to Apache Kafka

BASIC CONCEPTS

Topics - logical namespace for data (clickstream, app logs)

Partition - physical separation of data to allow for horizontal scalability

Consumer Groups/Offsets - Where your consumer group last check pointed in the stream

Replica - allows for partitions to be replicated across nodes for availability, only one is the active leader

Page 8: Introduction to Apache Kafka
Page 9: Introduction to Apache Kafka
Page 10: Introduction to Apache Kafka
Page 11: Introduction to Apache Kafka

USE CASES

First point for data ingestion, provide back pressure to downstream

Provide a data firehose for clients (with seeks)

Friendly to Blue/Green deployment architectures

Mirroring test data easily

Data Center log aggregation

Page 12: Introduction to Apache Kafka

Seamless Integration with Storm

Page 13: Introduction to Apache Kafka

Data Center Aggregation

Page 14: Introduction to Apache Kafka

Producer

API Server

Customer A Customer B

Data Stream

Serving a Firehose

Page 15: Introduction to Apache Kafka

Data Affinity w/ Key Partitioning

Producer

Consumer B

Data Stream P0

Data Stream P1

UserIds 0-100

Consumer A

UserIds 0-100 UserIds 101-200

Page 16: Introduction to Apache Kafka

Producer

Blue Consumer

InactiveTopic

ActiveTopic

Blue/Green Deployment

ZooKeeperController

Page 17: Introduction to Apache Kafka

Producer

Blue Consumer

InactiveTopic

ActiveTopic

Blue/Green Deployment

ZooKeeperController

Green Consumer

Page 18: Introduction to Apache Kafka

Producer

Blue Consumer

InactiveTopic

ActiveTopic

Blue/Green Deployment

ZooKeeper

Green Consumer

ControllerUser: 555

Page 19: Introduction to Apache Kafka

Producer

Blue Consumer

InactiveTopic

ActiveTopic

Blue/Green Deployment

Green Consumer

ControllerUser: 555ZooKeeper

Page 20: Introduction to Apache Kafka

SCALING OUT

1 partition = 1 consumer

1 partition needs to fit on a single machine

Partitions = the scalability of your system from the producer and consumer side

For high scale apps you will probably start out with 100 partitions

Page 21: Introduction to Apache Kafka

ProducerConsumer AP1

P0

P2

Page 22: Introduction to Apache Kafka

Producer

Consumer A

P1

P0

P2

Consumer B

Consumer C

Page 23: Introduction to Apache Kafka

MONITORING

http://quantifind.com/KafkaOffsetMonitor/

Page 24: Introduction to Apache Kafka
Page 25: Introduction to Apache Kafka
Page 26: Introduction to Apache Kafka

ZOOKEEPERhttp://techblog.netflix.com/2012/04/introducing-exhibitor-supervisor-

system.html

Page 27: Introduction to Apache Kafka

WE’RE [email protected]@jimplushcrowdstrike.com/about-us/careers

Page 28: Introduction to Apache Kafka

Producer A

Producer B

ZooKeeper

Partition 1

Partition 2

ClickStream

Partition OffsetsCommit Offset

Consumer A