Publish-subscribe Message Framework with Apache Kafka and ... Introduction of Kafka Kafka elementary
Embed Size (px)
Transcript of Publish-subscribe Message Framework with Apache Kafka and ... Introduction of Kafka Kafka elementary
Publish-subscribe Message Framework with Apache Kafka and Kinesis
Introduction of Messaging Framework
Publish subscribe Messaging Framework architecture
Introduction of zookeeper
Zookeeper core concepts
Introduction of Kafka
Kafka elementary concepts
How Kafka works in the messaging Framework
Introduction of Firehouse and Kinesis Analytics
Key concepts of kinesis
Uses of Amazon kinesis analytics
with Apache Kafka
Lets discuss first about Publish-Subscribe messaging framework
The Publish/Subscribe pattern, also known as Pub/Sub, is an architectural design pattern
that provides a framework for exchanging messages between publishers and subscribers. This pattern
involves the publisher and the subscriber relying on a message broker that relays messages from the publisher to the subscribers. The host (publisher)
publishes messages (events) to a channel that subscribers can then sign up to.
• Four core concepts make up the pub/sub model: • Topic – An intermediary channel that maintains a list
of subscribers to relay messages to that are
received from publishers
• Message – Serialized messages sent to a topic by a
publisher which has no knowledge of the subscribers
• Publisher – The application that publishes a
message to a topic
• Subscriber – An application that registers itself with
the desired topic in order to receive the appropriate messages
In topic-based system, messages are published to "topics" or named logical channels. Subscribers in a topic-based system
will receive all messages published to the topics to which
they subscribe. The publisher is responsible for defining the
topics to which subscribers can subscribe.
The process of selecting messages for reception and processing is
called filtering. There are two common forms of filtering: topic-based and content-based.
In a content-based system, messages are only delivered to a subscriber if the attributes or content of those
messages matches constraints defined by the subscriber.
The subscriber is responsible for classifying the messages. 1.
Publish-Subscribe Message Framework architecture
Click to add text Click to add text
Lets discuss about zookeeper.
It is one of the essential part of kafka so we should understand it clearly.
• Small or medium piece of data
• For kafka, an array of bytes
Ways of sending messages
Send the Message and Forget it
Send the message and wait for acknowledgment
Send the message and use call back function for acknowledgment
Since partition number is decided on the basis of mod of hash value of key and total number of partition . So, partition number
will change for a key as we increase the number of partition
We don’t which partition goes to which consumer
Use Round Robin Strategy
• Allow to decide partition number for different kinds of messages
• Allow to process messages coming from certain number of partition
Allow to access the number of retry and time between two retries
max.in.flight.request .per. connection
1.Maintain a list of offsets that are processed and ready to be committed. 2.Commit the offsets when partition are going away which is notify by one of the configuration rebalance listener called onPartitionsRevoked
• Default iteration time is less than 3 seconds. • Set heartbeat.interval and session.timeout.ms if we can’t poll every
3 seconds. • Otherwise Rebalancing occur
© Stephane Maarek
● *Amazon Kinesis is a fully managed, scalable, cloud-based service.
● Allows real-time processing of streaming large amount of data.
● takes in any amount of data from several sources, scaling up and down that can be run on *EC2 instances.
● Kinesis Firehose – to easily load streaming data into AWS.
● Kinesis Analytics – to easily process and analyze streaming data with standard SQL.
● Kinesis Streams – to build custom applications that process and analyze data.
Amazon Kinesis Streams
Amazon Kinesis Firehose
Amazon Kinesis Analytics
Metrics & Logs
Amazon S3 bucket
• Collect and process large streams of data records in real time
• *Common Use cases : - Accelerated log and feed data intake - Real-time data analytics - Complex streams processing - *Follows simple Pay as you go pricing
• *Benefits of Kinesis Data Streams:
- Data streams ensure durability with low latency. - Elasticity i.e easy scaling stream up or down - Multiple consumption of data simultaneously. - Fault-tolerant consumption.
Kinesis Data Streams Overview
Terminlogy and concepts Overview :
Kinesis Data Stream : -Set of Shards - Unlimited streams in an AWS account.
Shard : -Uniquely identified set of data records in a stream. - Unlimited number of shards in a stream. - Max. Ingest capacity : 1 MB(including partition keys)/s or 1000 records/s - Max. Read-rate : 2 MB/s per shard - Max. Read-transactions: 5/s per shard - *Each shard has a hash key range.
Account Charged per-shard basis.
Data record : unit of data stored. Composed of : Sequence number,Partition Key, data blobs. Data blob can be any type of data; a segment from a log file,
geographic/location data, website clickstream data, and so on - Max. Size : 1 MB
Partition Key: - Unicode strings, maximum length: 256 characters - hashed to 128-bit integer values(using MD5 hash function) - ExplicitHashKey to explicitly determine the shard to be sent a record. - *Map associated data records to shards using hash-key ranges.
Sequence number : - Unique number per partition-key assigned to each data record by KDS after writing to stream.
- Increases over time.
Total capacity of Stream : sum of capacities of shards.
No. of Shards to be specified before creating Data Stream. Dynamic scaling possible.
* Factors for initial size : Avg. data record size,Data read/write rate, No. of consumers, incoming/outgoing write/read bandwith
Producer : ● Puts data records into Amazon Kinesis Data Streams.
● To put data, specify name, partition key and data blob.
● No. Of partition keys >> No. Of shards
Kinesis Data Stream Application/Consumers:
● Read and process data records from stream.
● Runs on fleet of *EC2 instances.
● Multiple applications for one stream ; consume data independently and concurrently.
Server side encryption : Data streams automatically encrypt sensitive data as producer enters AWS KMS master key
Retention Period : Length of time that data records are accessible after they are added to the stream. Default : 24 hrs. After creation. Max : 168 hrs. Additional charges for setting retention period > 24 hrs.
● Tag : Key-value pairs to define a stream. ● Categorize streams by purpose,owner, environment etc. ● Custom set of categories for specific needs
Eg- Project: Project Name Owner : Name Purpose : Load Testing Environment : Production
● Adjust the number of shards in a stream to adapt to changes in data flow rate. ● Performed by administrative application different from producer and consumer.
● Types of Resharding: Splitting Merging
● Data records with same partition key have same hash value. ● Selectively split the “hot shards” into two “child” shards ● While splitting, a value in the hash key range specified:
- hash key values higher than that value(inclusive) distributed to one child shard - hash key values lower than that value distributed to another child shard.
● Splitting causes.. - increased stream capacity. - increased cost. - wastage of resources if child shards not used
Shard states: Open : Data records can be bot