Apache Kafkacs237/project2020/Kafka.pdf · 2020-05-11 · Publish/subscribe messaging pattern...

14
Apache Kafka Yinhao He Jiaqi Xiao Ananth Gottumukkala

Transcript of Apache Kafkacs237/project2020/Kafka.pdf · 2020-05-11 · Publish/subscribe messaging pattern...

Page 1: Apache Kafkacs237/project2020/Kafka.pdf · 2020-05-11 · Publish/subscribe messaging pattern Publisher: classify the message without knowing any subscribers exist Subscriber: subscribe

Apache KafkaYinhao HeJiaqi Xiao

Ananth Gottumukkala

Page 2: Apache Kafkacs237/project2020/Kafka.pdf · 2020-05-11 · Publish/subscribe messaging pattern Publisher: classify the message without knowing any subscribers exist Subscriber: subscribe

Publish/subscribe messaging pattern● Publisher: classify the message without

knowing any subscribers exist

● Subscriber: subscribe to the message

without knowing any publishers exist

● Broker: decouples publishers from

subscribers

(Similar to a bulletin board)

Page 3: Apache Kafkacs237/project2020/Kafka.pdf · 2020-05-11 · Publish/subscribe messaging pattern Publisher: classify the message without knowing any subscribers exist Subscriber: subscribe

What is Kafka?● Open source publish/subscribe messaging system

● Distributed event log (persistent on disk)

● Hybrid between a messaging system and a database

● High throughput platform

● Real-time data streams

● Used by Twitter, Netflix, and originally developed by LinkedIn

Page 4: Apache Kafkacs237/project2020/Kafka.pdf · 2020-05-11 · Publish/subscribe messaging pattern Publisher: classify the message without knowing any subscribers exist Subscriber: subscribe

Kafka structure

Page 5: Apache Kafkacs237/project2020/Kafka.pdf · 2020-05-11 · Publish/subscribe messaging pattern Publisher: classify the message without knowing any subscribers exist Subscriber: subscribe

Message● Single Unit of Data (Byte Array)

● Batch○ collection of messages produced for the same topic

and partition

○ trade-off between latency and throughput

○ can be compressed

● Additional Structure○ E.g. JSON, XML, AVRO or PROTOBUF

● Message ordering not guaranteed across multiple partitions

Page 6: Apache Kafkacs237/project2020/Kafka.pdf · 2020-05-11 · Publish/subscribe messaging pattern Publisher: classify the message without knowing any subscribers exist Subscriber: subscribe

Producer & ConsumerProducer

● create new messages & send to specific topic

Consumer

● read messages○ In order

● Offset○ Created when message is written to Kafka○ Consumer remember what offset each partition is at○ Zookeeper

Consumer Group● each partition only

consumed by one member of a consumer group

Page 7: Apache Kafkacs237/project2020/Kafka.pdf · 2020-05-11 · Publish/subscribe messaging pattern Publisher: classify the message without knowing any subscribers exist Subscriber: subscribe

Broker● Kafka cluster consists of

multiple servers called brokers

● Controller Broker responsible for administrative operations○ Assign partitions to brokers○ Monitor Broker Failure

● Provides redundancy of messages in the partition○ Avoid Broker Failure

Page 8: Apache Kafkacs237/project2020/Kafka.pdf · 2020-05-11 · Publish/subscribe messaging pattern Publisher: classify the message without knowing any subscribers exist Subscriber: subscribe

Retention● Provides a certain time period durable

storage for messages

● Time

● Size

● Individual topics can also configure their

own retention settings

Page 9: Apache Kafkacs237/project2020/Kafka.pdf · 2020-05-11 · Publish/subscribe messaging pattern Publisher: classify the message without knowing any subscribers exist Subscriber: subscribe

Reliability Guarantees● Guarantees the order of messages in one partition

● Committed messages won't be lost as long as at least one replica

remains alive and retention policy holds

● Consumers can only read committed messages

● At least once message delivery semantics

Page 10: Apache Kafkacs237/project2020/Kafka.pdf · 2020-05-11 · Publish/subscribe messaging pattern Publisher: classify the message without knowing any subscribers exist Subscriber: subscribe

Advantages of KafkaDeals with Integration Complexity

High Throughput and Fairly Low Latency

Handles Big Data

Many Configuration Options

Data Retention

Multiple Producers/Consumers

Page 11: Apache Kafkacs237/project2020/Kafka.pdf · 2020-05-11 · Publish/subscribe messaging pattern Publisher: classify the message without knowing any subscribers exist Subscriber: subscribe

Disadvantages of KafkaSteep Learning Curve

Not Low Enough Latency

Susceptible to Data Loss

● Split-Brain● Partition Lead Failover

Page 12: Apache Kafkacs237/project2020/Kafka.pdf · 2020-05-11 · Publish/subscribe messaging pattern Publisher: classify the message without knowing any subscribers exist Subscriber: subscribe

Kafka vs JMS/ActiveMQ

Kafka JMS/ActiveMQ

Real-Time Data Stream Traditional Messaging

Consumers Pull Messages from Brokers Messages Pushed to Consumers

Implements Backpressure Hard to Achieve Backpressure

Data Retention to Disk No Data Retention

Guarantees Message Ordering in Partition No Ordering Guarantees

Can rewind and re-consume data Consumer does not track offset

Page 13: Apache Kafkacs237/project2020/Kafka.pdf · 2020-05-11 · Publish/subscribe messaging pattern Publisher: classify the message without knowing any subscribers exist Subscriber: subscribe

Kafka vs Kinesis

Kafka Kinesis

Requires setting up your own cluster, nodes, replicas, partitions, etc.

AWS manages infrastructure, config, etc.

Flexible config but need to tune producers (amt. of data to send to broker), consumers (# replicas, # consumers per partition/topic)

Config not as flexible but AWS ensures availability/durability for 7 days. Configure # shards for throughput

Higher Maintenance/Risk Mgmt Cost Pay-as-you-go / Per # Shards

Page 14: Apache Kafkacs237/project2020/Kafka.pdf · 2020-05-11 · Publish/subscribe messaging pattern Publisher: classify the message without knowing any subscribers exist Subscriber: subscribe

Thank you