Publish-subscribe Message Framework with Apache Kafka and ... Introduction of Kafka Kafka elementary

download Publish-subscribe Message Framework with Apache Kafka and ... Introduction of Kafka Kafka elementary

of 157

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Publish-subscribe Message Framework with Apache Kafka and ... Introduction of Kafka Kafka elementary

  • Publish-subscribe Message Framework with Apache Kafka and Kinesis

    Presented by:-

    Prince Kumar

    Aashish Ranjan

    Raghavendra Tiwari

    Joel joy


    Introduction of Messaging Framework

    Pub-sub model

    Publish subscribe Messaging Framework architecture

    Introduction of zookeeper

    Zookeeper core concepts

    Zookeeper architecture

    Introduction of Kafka

    Kafka elementary concepts

    Kafka architecture

    How Kafka works in the messaging Framework

    Introduction of Firehouse and Kinesis Analytics

    Key concepts of kinesis

    Uses of Amazon kinesis analytics

    Windowing concepts

  • Publish-subscribe

    Message Framework

    with Apache Kafka

    and Kinesis

  • Lets discuss first about Publish-Subscribe messaging framework

  • The Publish/Subscribe pattern, also known as Pub/Sub, is an architectural design pattern

    that provides a framework for exchanging messages between publishers and subscribers. This pattern

    involves the publisher and the subscriber relying on a message broker that relays messages from the publisher to the subscribers. The host (publisher)

    publishes messages (events) to a channel that subscribers can then sign up to.

  • • Four core concepts make up the pub/sub model: • Topic – An intermediary channel that maintains a list

    of subscribers to relay messages to that are

    received from publishers

    • Message – Serialized messages sent to a topic by a

    publisher which has no knowledge of the subscribers

    • Publisher – The application that publishes a

    message to a topic

    • Subscriber – An application that registers itself with

    the desired topic in order to receive the appropriate messages

  • In topic-based system, messages are published to "topics" or named logical channels. Subscribers in a topic-based system

    will receive all messages published to the topics to which

    they subscribe. The publisher is responsible for defining the

    topics to which subscribers can subscribe.

    The process of selecting messages for reception and processing is

    called filtering. There are two common forms of filtering: topic-based and content-based.

    In a content-based system, messages are only delivered to a subscriber if the attributes or content of those

    messages matches constraints defined by the subscriber.

    The subscriber is responsible for classifying the messages. 1.

  • Publish-Subscribe Message Framework architecture

  • Click to add text Click to add text

  • Lets discuss about zookeeper.

    It is one of the essential part of kafka so we should understand it clearly.

  • Zookeeper properties

  • Producer

  • • Small or medium piece of data

    • For kafka, an array of bytes

  • Batches of


  • Ways of sending messages

  • Send the Message and Forget it

  • Send the message and wait for acknowledgment

  • Send the message and use call back function for acknowledgment

  • Since partition number is decided on the basis of mod of hash value of key and total number of partition . So, partition number

    will change for a key as we increase the number of partition

  • Automatic Partition

    We don’t which partition goes to which consumer

    Use Round Robin Strategy

  • Custom Partitioner

    • Allow to decide partition number for different kinds of messages

    • Allow to process messages coming from certain number of partition

  • retries

    Allow to access the number of retry and time between two retries

  • .per. connection

  • Broker

  • Broker Configuration

  • 4.Zookeeper.connect

  • 1.Maintain a list of offsets that are processed and ready to be committed. 2.Commit the offsets when partition are going away which is notify by one of the configuration rebalance listener called onPartitionsRevoked

  • Consumer

  • Group




  • Consumer Configuration

  • • Default iteration time is less than 3 seconds. • Set heartbeat.interval and if we can’t poll every

    3 seconds. • Otherwise Rebalancing occur


  • © Stephane Maarek

    ● *Amazon Kinesis is a fully managed, scalable, cloud-based service.

    ● Allows real-time processing of streaming large amount of data.

    ● takes in any amount of data from several sources, scaling up and down that can be run on *EC2 instances.

    Key capabilities:

    ● Kinesis Firehose – to easily load streaming data into AWS.

    ● Kinesis Analytics – to easily process and analyze streaming data with standard SQL.

    ● Kinesis Streams – to build custom applications that process and analyze data.

    Kinesis Overview

  • Amazon Kinesis

    Amazon Kinesis Streams

    Amazon Kinesis Firehose

    Amazon Kinesis Analytics

    Click Streams

    IoT devices

    Metrics & Logs

    Amazon S3 bucket

    Amazon Redshift

    Amazon Kinesis

  • • Collect and process large streams of data records in real time

    • *Common Use cases : - Accelerated log and feed data intake - Real-time data analytics - Complex streams processing - *Follows simple Pay as you go pricing

    • *Benefits of Kinesis Data Streams:

    - Data streams ensure durability with low latency. - Elasticity i.e easy scaling stream up or down - Multiple consumption of data simultaneously. - Fault-tolerant consumption.

    Kinesis Data Streams Overview

  • Terminlogy and concepts Overview :

    Kinesis Data Stream : -Set of Shards - Unlimited streams in an AWS account.

    Shard : -Uniquely identified set of data records in a stream. - Unlimited number of shards in a stream. - Max. Ingest capacity : 1 MB(including partition keys)/s or 1000 records/s - Max. Read-rate : 2 MB/s per shard - Max. Read-transactions: 5/s per shard - *Each shard has a hash key range.

    Account Charged per-shard basis.

    Data record : unit of data stored. Composed of : Sequence number,Partition Key, data blobs. Data blob can be any type of data; a segment from a log file,

    geographic/location data, website clickstream data, and so on - Max. Size : 1 MB

    Partition Key: - Unicode strings, maximum length: 256 characters - hashed to 128-bit integer values(using MD5 hash function) - ExplicitHashKey to explicitly determine the shard to be sent a record. - *Map associated data records to shards using hash-key ranges.

    Sequence number : - Unique number per partition-key assigned to each data record by KDS after writing to stream.

    - Increases over time.

  • Total capacity of Stream : sum of capacities of shards.

    No. of Shards to be specified before creating Data Stream. Dynamic scaling possible.

    * Factors for initial size : Avg. data record size,Data read/write rate, No. of consumers, incoming/outgoing write/read bandwith

    Producer : ● Puts data records into Amazon Kinesis Data Streams.

    ● To put data, specify name, partition key and data blob.

    ● No. Of partition keys >> No. Of shards

    Kinesis Data Stream Application/Consumers:

    ● Read and process data records from stream.

    ● Runs on fleet of *EC2 instances.

    ● Multiple applications for one stream ; consume data independently and concurrently.

    Server side encryption : Data streams automatically encrypt sensitive data as producer enters AWS KMS master key

  • Retention Period : Length of time that data records are accessible after they are added to the stream. Default : 24 hrs. After creation. Max : 168 hrs. Additional charges for setting retention period > 24 hrs.


    ● Tag : Key-value pairs to define a stream. ● Categorize streams by purpose,owner, environment etc. ● Custom set of categories for specific needs

    Eg- Project: Project Name Owner : Name Purpose : Load Testing Environment : Production

  • Resharding :

    ● Adjust the number of shards in a stream to adapt to changes in data flow rate. ● Performed by administrative application different from producer and consumer.

    ● Types of Resharding:  Splitting  Merging


    ● Data records with same partition key have same hash value. ● Selectively split the “hot shards” into two “child” shards ● While splitting, a value in the hash key range specified:

    - hash key values higher than that value(inclusive) distributed to one child shard - hash key values lower than that value distributed to another child shard.

    ● Splitting causes.. - increased stream capacity. - increased cost. - wastage of resources if child shards not used

  • Shard states: Open : Data records can be bot