Building & Operating High-Fidelity Data Streams
Transcript of Building & Operating High-Fidelity Data Streams
![Page 1: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/1.jpg)
Building & Operating High-Fidelity Data Streams
![Page 2: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/2.jpg)
Why Do Streams Matter?
![Page 3: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/3.jpg)
Why Do Streams Matter?In our world today, machine intelligence & personalization drive engaging experiences online
![Page 4: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/4.jpg)
Why Do Streams Matter?In our world today, machine intelligence & personalization drive engaging experiences online
![Page 5: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/5.jpg)
Why Do Streams Matter?In our world today, machine intelligence & personalization drive engaging experiences online
![Page 6: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/6.jpg)
Why Do Streams Matter?
Disparate data is constantly being connected to drive predictions that keep us engaged!
In our world today, machine intelligence & personalization drive engaging experiences online
![Page 7: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/7.jpg)
While it may seem that some magical SQL join is powering these connections….
Why Do Streams Matter?
![Page 8: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/8.jpg)
While it may seem that some magical SQL join is powering these connections….
Why Do Streams Matter?
The reality is that data growth has made it impractical to store all of this data in a single DB
![Page 9: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/9.jpg)
Why Do Streams Matter?
![Page 10: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/10.jpg)
Why Do Streams Matter?
How do companies manage the complexity below?
![Page 11: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/11.jpg)
Why Do Streams Matter?
A key piece to the puzzle is data movement, which usually comes in 2 forms:
How do companies manage the complexity below?
![Page 12: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/12.jpg)
Why Do Streams Matter?
A key piece to the puzzle is data movement, which usually comes in 2 forms:
Batch Processing
Stream Processing
How do companies manage the complexity below?
![Page 13: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/13.jpg)
Why Are Streams Hard?
![Page 14: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/14.jpg)
Why Are Streams Hard?
![Page 15: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/15.jpg)
Why Are Streams Hard?The answer lies in the image below : complexity, lots of moving parts
![Page 16: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/16.jpg)
Why Are Streams Hard?
In streaming architectures, any gaps in non-functional requirements can be unforgiving
![Page 17: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/17.jpg)
Why Are Streams Hard?
In streaming architectures, any gaps in non-functional requirements can be unforgiving
You end up spending a lot of your time fighting fires & keeping systems up
![Page 18: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/18.jpg)
Why Are Streams Hard?
In streaming architectures, any gaps in non-functional requirements can be unforgiving
You end up spending a lot of your time fighting fires & keeping systems up
If you don’t build your systems with the -ilities as first class citizens, you pay an operational tax
![Page 19: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/19.jpg)
Why Are Streams Hard?In streaming architectures, implementation gaps in non-functional requirements can be unforgiving
You end up spending a lot of your time fighting fires & keeping systems up
If you don’t build your systems with the -ilities as first class citizens, you pay an operational tax
… and this translates to unhappy customers and burnt-out team members!
![Page 20: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/20.jpg)
Why Are Streams Hard?
Data Infrastructure is an iceberg
Your customers may only see 10% of your effort — those that manifest in features
The remaining 90% of your work goes unnoticed because it relates to keeping the lights on
![Page 21: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/21.jpg)
Why Are Streams Hard?
Data Infrastructure is an iceberg
Your customers may only see 10% of your effort — those that manifest in features
The remaining 90% of your work goes unnoticed because it relates to keeping the lights on
In this talk, we will build high-fidelity streams-as-a-service from the ground up!
![Page 22: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/22.jpg)
Start Simple
![Page 23: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/23.jpg)
Start Simple
Goal : Build a system that can deliver messages from source S to destination D
S D
![Page 24: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/24.jpg)
Start Simple
Goal : Build a system that can deliver messages from source S to destination D
S D
But first, let’s decouple S and D by putting messaging infrastructure between them
ES D
Events topic
![Page 25: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/25.jpg)
Start Simple
Make a few more implementation decisions about this system
ES D
![Page 26: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/26.jpg)
Start Simple
Make a few more implementation decisions about this system
ES D
Run our system on a cloud platform (e.g. AWS)
![Page 27: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/27.jpg)
Start Simple
Make a few more implementation decisions about this system
Operate at low scale
ES D
Run our system on a cloud platform (e.g. AWS)
![Page 28: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/28.jpg)
Start Simple
Make a few more implementation decisions about this system
Operate at low scale
Kafka with a single partition
ES D
Run our system on a cloud platform (e.g. AWS)
![Page 29: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/29.jpg)
Start Simple
Make a few more implementation decisions about this system
Operate at low scale
Kafka with a single partition
Kafka across 3 brokers split across AZs with RF=3 (min in-sync replicas =2)
ES D
Run our system on a cloud platform (e.g. AWS)
![Page 30: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/30.jpg)
Start Simple
Make a few more implementation decisions about this system
Operate at low scale
Kafka with a single partition
Kafka across 3 brokers split across AZs with RF=3 (min in-sync replicas =2)
Run S & D on single, separate EC2 Instances
ES D
Run our system on a cloud platform (e.g. AWS)
![Page 31: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/31.jpg)
Start Simple
To make things a bit more interesting, let’s provide our stream as a service
We define our system boundary using a blue box as shown below!
ES D
![Page 32: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/32.jpg)
Reliability(Is This System Reliable?)
![Page 33: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/33.jpg)
Reliability
Goal : Build a system that can deliver messages reliably from S to D
ES D
![Page 34: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/34.jpg)
Reliability
Goal : Build a system that can deliver messages reliably from S to D
ES D
Concrete Goal : 0 message loss
![Page 35: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/35.jpg)
Reliability
Goal : Build a system that can deliver messages reliably from S to D
ES D
Concrete Goal : 0 message loss
Once S has ACKd a message to a remote sender, D must deliver that message to a remote receiver
![Page 36: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/36.jpg)
Reliability
How do we build reliability into our system?
ES D
![Page 37: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/37.jpg)
Reliability
Let’s first generalize our system!
`A B Cm1 m1
![Page 38: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/38.jpg)
Reliability
In order to make this system reliable
`A B Cm1 m1
![Page 39: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/39.jpg)
Reliability
`A B Cm1 m1
Treat the messaging system like a chain — it’s only as strong as its weakest link
In order to make this system reliable
![Page 40: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/40.jpg)
Reliability
`A B Cm1 m1
Treat the messaging system like a chain — it’s only as strong as its weakest link
Insight : If each process/link is transactional in nature, the chain will be transactional!
In order to make this system reliable
![Page 41: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/41.jpg)
Reliability
`A B Cm1 m1
Treat the messaging system like a chain — it’s only as strong as its weakest link
In order to make this system reliable
Insight : If each process/link is transactional in nature, the chain will be transactional!
Transactionality = At least once delivery
![Page 42: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/42.jpg)
Reliability
`A B Cm1 m1
Treat the messaging system like a chain — it’s only as strong as its weakest link
How do we make each link transactional?
In order to make this system reliable
Insight : If each process/link is transactional in nature, the chain will be transactional!
Transactionality = At least once delivery
![Page 43: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/43.jpg)
Reliability
Let’s first break this chain into its component processing links
Bm1 m1
`Am1 m1
` C m1m1
![Page 44: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/44.jpg)
Reliability
Let’s first break this chain into its component processing links
Bm1 m1
`Am1 m1
` C m1m1
A is an ingest node
![Page 45: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/45.jpg)
Reliability
Let’s first break this chain into its component processing links
Bm1 m1
`Am1 m1
` C m1m1
B is an internal node
![Page 46: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/46.jpg)
Reliability
Let’s first break this chain into its component processing links
Bm1 m1
`Am1 m1
` C m1m1
C is an expel node
![Page 47: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/47.jpg)
Reliability
But, how do we handle edge nodes A & C?
Bm1 m1
`Am1 m1
` C m1m1
What does A need to do?
• Receive a Request (e.g. REST)
• Do some processing
• Reliably send data to Kafka • kProducer.send(topic, message) • kProducer.flush() • Producer Config
• acks = all
• Send HTTP Response to caller
![Page 48: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/48.jpg)
Reliability
But, how do we handle edge nodes A & C?
Bm1 m1
`Am1 m1
` C m1m1
What does C need to do?
• Read data (a batch) from Kafka
• Do some processing
• Reliably send data out
• ACK / NACK Kafka • Consumer Config
• enable.auto.commit = false • ACK moves the read checkpoint
forward • NACK forces a reread of the same data
![Page 49: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/49.jpg)
Reliability
But, how do we handle edge nodes A & C?
Bm1 m1
`Am1 m1
` C m1m1
B is a combination of A and C
![Page 50: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/50.jpg)
Reliability
But, how do we handle edge nodes A & C?
Bm1 m1
`Am1 m1
` C m1m1
B is a combination of A and C
B needs to act like a reliable Kafka Producer
![Page 51: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/51.jpg)
Reliability
But, how do we handle edge nodes A & C?
Bm1 m1
`Am1 m1
` C m1m1
B is a combination of A and C
B needs to act like a reliable Kafka Producer
B needs to act like a reliable Kafka Consumer
![Page 52: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/52.jpg)
`A B Cm1 m1
Reliability
How reliable is our system now?
![Page 53: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/53.jpg)
Reliability
How reliable is our system now?
What happens if a process crashes?
`A B Cm1 m1
![Page 54: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/54.jpg)
Reliability
How reliable is our system now?
What happens if a process crashes?
If A crashes, we will have a complete outage at ingestion!
`A B Cm1 m1
![Page 55: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/55.jpg)
Reliability
How reliable is our system now?
If C crashes, we will stop delivering messages to external consumers!
What happens if a process crashes?
If A crashes, we will have a complete outage at ingestion!
`A B Cm1 m1
![Page 56: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/56.jpg)
Reliability
`A B Cm1 m1
Solution : Place each service in an autoscaling group of size T
`A B Cm1 m1
T-1 concurrent failures
![Page 57: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/57.jpg)
Reliability
`A B Cm1 m1
Solution : Place each service in an autoscaling group of size T
`A B Cm1 m1
T-1 concurrent failures
For now, we appear to have a pretty reliable data stream
![Page 58: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/58.jpg)
But how do we measure its reliability?
![Page 59: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/59.jpg)
Observability(A story about Lag & Loss Metrics)
(This brings us to …)
![Page 60: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/60.jpg)
Lag : What is it?
![Page 61: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/61.jpg)
Lag : What is it?Lag is simply a measure of message delay in a system
![Page 62: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/62.jpg)
Lag : What is it?Lag is simply a measure of message delay in a system
The longer a message takes to transit a system, the greater its lag
![Page 63: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/63.jpg)
Lag : What is it?Lag is simply a measure of message delay in a system
The longer a message takes to transit a system, the greater its lag
The greater the lag, the greater the impact to the business
![Page 64: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/64.jpg)
Lag : What is it?Lag is simply a measure of message delay in a system
The longer a message takes to transit a system, the greater its lag
The greater the lag, the greater the impact to the business
Hence, our goal is to minimize lag in order to deliver insights as quickly as possible
![Page 65: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/65.jpg)
Lag : How do we compute it?
![Page 66: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/66.jpg)
Lag : How do we compute it?eventTime : the creation time of an event message
Lag can be calculated for any message m1 at any node N in the system as
lag(m1, N) = current_time(m1, N) - eventTime(m1)
`A B Cm1 m1
T0eventTime:
![Page 67: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/67.jpg)
Lag : How do we compute it?
Lag-in @
A = T1 - T0 (e.g 1 ms)
B = T3 - T0 (e.g 5 ms)
C = T5 - T0 (e.g 10 ms)
`A B C
Arrival Lag (Lag-in): time message arrives - eventTime
T1 T3 T5T0eventTime:
m1 m1
![Page 68: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/68.jpg)
`A B C
Arrival Lag (Lag-in): time message arrives - eventTime
T1 T3 T5T0eventTime:
m1 m1
Lag : How do we compute it?
Lag-in @
A = T1 - T0 (e.g 1 ms)
B = T3 - T0 (e.g 5 ms)
C = T5 - T0 (e.g 10 ms)
Cumulative Lag
![Page 69: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/69.jpg)
Lag : How do we compute it?
Lag-in @
A = T1 - T0 (e.g 1 ms)
B = T3 - T0 (e.g 5 ms)
C = T5 - T0 (e.g 10 ms)
Lag-out @
A = T2 - T0 (e.g 3 ms)
B = T4 - T0 (e.g 8 ms)
C = T6 - T0 (e.g 12 ms)`A B C
Arrival Lag (Lag-in): time message arrives - eventTime
T1 T3 T5
Departure Lag (Lag-out): time message leaves - eventTime
T2 T4 T6
T0eventTime:
m1
![Page 70: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/70.jpg)
Lag : How do we compute it?
Lag-in @
A = T1 - T0 (e.g 1 ms)
B = T3 - T0 (e.g 5 ms)
C = T5 - T0 (e.g 10 ms)
Lag-out @
A = T2 - T0 (e.g 3 ms)
B = T4 - T0 (e.g 8 ms)
C = T6 - T0 (e.g 12 ms)`A B C
Arrival Lag (Lag-in): time message arrives - eventTime
T1 T3 T5
Departure Lag (Lag-out): time message leaves - eventTime
T2 T4
T0eventTime:
m1
E2E Lag
E2E Lag is the total time a message spent in the system
T6
![Page 71: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/71.jpg)
Lag : How do we compute it?While it is interesting to know the lag for a particular message m1, it is of little use since we typically deal with millions of messages
![Page 72: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/72.jpg)
Lag : How do we compute it?While it is interesting to know the lag for a particular message m1, it is of little use since we typically deal with millions of messages
Instead, we prefer statistics (e.g. P95) to capture population behavior
![Page 73: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/73.jpg)
Lag : How do we compute it?Some useful Lag statistics are:
E2E Lag (p95) : 95th percentile time of messages spent in the system
Lag_[in|out](N, p95): P95 Lag_in or Lag_out at any Node N
![Page 74: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/74.jpg)
Lag : How do we compute it?Some useful Lag statistics are:
E2E Lag (p95) : 95th percentile time of messages spent in the system
Lag_[in|out](N, p95): P95 Lag_in or Lag_out at any Node N
Process_Duration(N, p95) : Lag_out(N, p95) - Lag_in(N, p95)
![Page 75: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/75.jpg)
m1 m1
Process_Duration Graphs show you the contribution to overall Lag from each hop
Lag : How do we compute it?
![Page 76: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/76.jpg)
Loss : What is it?
![Page 77: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/77.jpg)
Loss : What is it?Loss is simply a measure of messages lost while transiting the system
![Page 78: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/78.jpg)
Loss : What is it?Loss is simply a measure of messages lost while transiting the system
Messages can be lost for various reasons, most of which we can mitigate!
![Page 79: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/79.jpg)
Loss : What is it?Loss is simply a measure of messages lost while transiting the system
Messages can be lost for various reasons, most of which we can mitigate!
The greater the loss, the lower the data quality
![Page 80: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/80.jpg)
Loss : What is it?Loss is simply a measure of messages lost while transiting the system
Messages can be lost for various reasons, most of which we can mitigate!
The greater the loss, the lower the data quality
Hence, our goal is to minimize loss in order to deliver high quality insights
![Page 81: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/81.jpg)
Loss : How do we compute it?
![Page 82: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/82.jpg)
Loss : How do we compute it?
Concepts : Loss
Loss can be computed as the set difference of messages between any 2 points in the system
m1m1m1m1m1m1m1m1m1m1m1m1
![Page 83: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/83.jpg)
Loss : How do we compute it?
Message Id E2E Loss E2E Loss %
m1 1 1 1 1
m2 1 1 1 1
m3 1 0 0 0
… … … … …
m10 1 1 0 0
Count 10 9 7 5
Per Node Loss(N) 0 1 2 2 5 50%
m1m1m1m1m1m1m1m1m1m1m1m1
![Page 84: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/84.jpg)
Loss : How do we compute it?
In a streaming data system, messages never stop flowing. So, how do we know when to count?
m1m1m1m1m1m1m1m1m1m1m1m1
m1m1m1m1m1m11m1m1m1m1m1m21
m1m1m1m1m1m31
![Page 85: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/85.jpg)
Loss : How do we compute it?
In a streaming data system, messages never stop flowing. So, how do we know when to count?
Solution
Allocate messages to 1-minute wide time buckets using message eventTime
m1m1m1m1m1m1m1m1m1m1m1m1
m1m1m1m1m1m11m1m1m1m1m1m21
m1m1m1m1m1m31
![Page 86: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/86.jpg)
Loss : How do we compute it?
Message Id E2E Loss E2E Loss %
m1 1 1 1 1
m2 1 1 1 1
m3 1 0 0 0
… … … … …
m10 1 1 0 0
Count 10 9 7 5
Per Node Loss(N) 0 1 2 2 5 50%
m1m1m1m1m1m1m1m1m1m1m1m1
@12:34p
![Page 87: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/87.jpg)
Loss : How do we compute it?
In a streaming data system, messages never stop flowing. So, how do we know when to count?
Solution
Allocate messages to 1-minute wide time buckets using message eventTime
Wait a few minutes for messages to transit, then compute loss
Raise alarms if loss occurs over a configured threshold (e.g. > 1%)
m1m1m1m1m1m1m1m1m1m1m1m1
m1m1m1m1m1m11m1m1m1m1m1m21
m1m1m1m1m1m31
![Page 88: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/88.jpg)
We now have a way to measure the reliability (via Loss metrics) and latency (via Lag metrics) of our system.
Loss : How do we compute it?
But wait…
![Page 89: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/89.jpg)
Performance(have we tuned our system for performance yet??)
![Page 90: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/90.jpg)
Performance
Goal : Build a system that can deliver messages reliably from S to D with low latency
SSSSSSSD
SSS…SSS…
To understand streaming system performance, let’s understand the components of E2E Lag
![Page 91: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/91.jpg)
Performance
Ingest Time
SSSSSSSD
SSS…S SS…
Ingest Time : Time from Last_Byte_In_of_Request to First_Byte_Out_of_Response
![Page 92: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/92.jpg)
Performance
Ingest Time
SSSSSSSD
SSS…S SS…
• This time includes overhead of reliably sending messages to Kafka
Ingest Time : Time from Last_Byte_In_of_Request to First_Byte_Out_of_Response
![Page 93: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/93.jpg)
Performance
Ingest Time Expel Time
SSSSSSSD
SSS…S SS…
Expel Time : Time to process and egest a message at D.
![Page 94: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/94.jpg)
Performance
E2E Lag
Ingest Time Expel TimeTransit Time
SSSSSSSD
SSS…S SS…
E2E Lag : Total time messages spend in the system from message ingest to expel!
![Page 95: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/95.jpg)
Performance
Ingest Time Expel TimeTransit Time
SSSSSSSD
SSS…S SS…
Transit Time : Rest of the time spent in the data pipe (i.e. internal nodes)
![Page 96: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/96.jpg)
Performance Penalties(Trading of Latency for Reliability)
![Page 97: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/97.jpg)
Performance : Penalties
In order to have stream reliability, we must sacrifice latency!
How can we handle our performance penalties?
![Page 98: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/98.jpg)
PerformanceChallenge 1 : Ingest Penalty
In the name of reliability, S needs to call kProducer.flush() on every inbound API request
S also needs to wait for 3 ACKS from Kafka before sending its API response
E2E Lag
Ingest Time Expel TimeTransit Time
SSSSSSSD
SSS…SSS…
![Page 99: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/99.jpg)
PerformanceChallenge 1 : Ingest Penalty
Approach : Amortization
Support Batch APIs (i.e. multiple messages per web request) to amortize the ingest penalty
E2E Lag
Ingest Time Expel TimeTransit Time
SSSSSSSD
SSS…SSS…
![Page 100: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/100.jpg)
Performance
E2E Lag
Ingest Time Expel TimeTransit Time
Challenge 2 : Expel Penalty
Observations
Kafka is very fast — many orders of magnitude faster than HTTP RTTs
The majority of the expel time is the HTTP RTT
SSSSSSSD
SSS…SSS…
![Page 101: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/101.jpg)
Performance
E2E Lag
Ingest Time Expel TimeTransit Time
Challenge 2 : Expel Penalty
Approach : Amortization
In each D node, add batch + parallelism
SSSSSSSD
SSS…SSS…
![Page 102: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/102.jpg)
PerformanceChallenge 3 : Retry Penalty (@ D)
Concepts
In order to run a zero-loss pipeline, we need to retry messages @ D that will succeed given enough attempts
![Page 103: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/103.jpg)
PerformanceChallenge 3 : Retry Penalty (@ D)
Concepts
In order to run a zero-loss pipeline, we need to retry messages @ D that will succeed given enough attempts
We call these Recoverable Failures
![Page 104: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/104.jpg)
PerformanceChallenge 3 : Retry Penalty (@ D)
Concepts
In order to run a zero-loss pipeline, we need to retry messages @ D that will succeed given enough attempts
We call these Recoverable Failures
In contrast, we should never retry a message that has 0 chance of success!
We call these Non-Recoverable Failures
![Page 105: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/105.jpg)
PerformanceChallenge 3 : Retry Penalty (@ D)
Concepts
In order to run a zero-loss pipeline, we need to retry messages @ D that will succeed given enough attempts
We call these Recoverable Failures
In contrast, we should never retry a message that has 0 chance of success!
We call these Non-Recoverable Failures
E.g. Any 4xx HTTP response code, except for 429 (Too Many Requests)
![Page 106: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/106.jpg)
PerformanceChallenge 3 : Retry Penalty
Approach
We pay a latency penalty on retry, so we need to smart about
What we retry — Don’t retry any non-recoverable failures
How we retry
![Page 107: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/107.jpg)
PerformanceChallenge 3 : Retry Penalty
Approach
We pay a latency penalty on retry, so we need to smart about
What we retry — Don’t retry any non-recoverable failures
How we retry — One Idea : Tiered Retries
![Page 108: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/108.jpg)
Performance - Tiered Retries
Local Retries
Try to send message a configurable number of times @ D
Global Retries
![Page 109: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/109.jpg)
Performance - Tiered Retries
Local Retries
Try to send message a configurable number of times @ D
If we exhaust local retries, D transfers the message to a Global Retrier
Global Retries
![Page 110: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/110.jpg)
Performance - Tiered Retries
Local Retries
Try to send message a configurable number of times @ D
If we exhaust local retries, D transfers the message to a Global Retrier
Global Retries
The Global Retrier than retries the message over a longer span of time
![Page 111: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/111.jpg)
`
ESSSS
SSSDRO
RISSSgr
Performance - 2 Tiered Retries
RI : Retry_In RO : Retry_Out
![Page 112: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/112.jpg)
At this point, we have a system that works well at low scale
Performance
![Page 113: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/113.jpg)
Scalability
![Page 114: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/114.jpg)
ScalabilityFirst, Let’s dispel a myth!
![Page 115: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/115.jpg)
ScalabilityFirst, Let’s dispel a myth!
Each system is traffic-rated
The traffic rating comes from running load tests
There is no such thing as a system that can handle infinite scale
![Page 116: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/116.jpg)
ScalabilityFirst, Let’s dispel a myth!
Each system is traffic-rated
The traffic rating comes from running load tests
There is no such thing as a system that can handle infinite scale
We only achieve higher scale by iteratively running load tests & removing bottlenecks
![Page 117: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/117.jpg)
Scalability - AutoscalingAutoscaling Goals (for data streams):
Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag)
Goal 2: Automatically scale in to minimize cost
![Page 118: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/118.jpg)
Scalability - AutoscalingAutoscaling Goals (for data streams):
Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag)
Goal 2: Automatically scale in to minimize cost
![Page 119: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/119.jpg)
Scalability - AutoscalingAutoscaling Goals (for data streams):
Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag)
Goal 2: Automatically scale in to minimize cost
Autoscaling Considerations
What can autoscale? What can’t autoscale?
![Page 120: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/120.jpg)
Scalability - AutoscalingAutoscaling Goals (for data streams):
Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag)
Goal 2: Automatically scale in to minimize cost
Autoscaling Considerations
What can autoscale? What can’t autoscale?
![Page 121: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/121.jpg)
Scalability - Autoscaling EC2
The most important part of autoscaling is picking the right metric to trigger autoscaling actions
![Page 122: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/122.jpg)
Scalability - Autoscaling EC2Pick a metric that
Preserves low latency
Goes up as traffic increases
Goes down as the microservice scales out
![Page 123: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/123.jpg)
Scalability - Autoscaling EC2Pick a metric that
Preserves low latency
Goes up as traffic increases
Goes down as the microservice scales out
E.g.
Average CPU
![Page 124: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/124.jpg)
Scalability - Autoscaling EC2Pick a metric that
Preserves low latency
Goes up as traffic increases
Goes down as the microservice scales out
E.g.
Average CPU
What to be wary of
Any locks/code synchronization & IO Waits
Otherwise … As traffic increases, CPU will plateau, auto-scale-out will stop, and latency (i.e. E2E Lag) will increase
![Page 125: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/125.jpg)
What Next?
We now have a system with the Non-functional Requirements (NFRs) that we desire!
![Page 126: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/126.jpg)
What Next?What if we want to handle
• Different types of messages
• More complex processing ( i.e. more processing stages)
• More complex stream topologies (e.g. 1-1, 1-many, many-many)
![Page 127: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/127.jpg)
What Next?
It will take a lot of work to rebuild our data pipe for each variation of customers’ needs!
What we need to do is build a more generic Streams-as-a-Service (STaaS) platform!
![Page 128: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/128.jpg)
Building StaaS
Firstly, let’s make our pipeline a bit more realistic by adding more processing stages
SSSSSSSD
SSS…SSS…
Ingest Normalize Enrich Route Transform Transmit
![Page 129: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/129.jpg)
Building StaaSAnd by handling more complex topologies (e.g. many-to-many)
Normalize Enrich Route Transform
Transmit (T1)
Transmit (T2)
Transmit (T3)
Transmit (T4)
Transmit (T5)
Ingest (n1)
Ingest (n2)
Ingest (n3)
![Page 130: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/130.jpg)
Building StaaS
This our data plane — it send messages from multiple sources to multiple destinations
Normalize Enrich Route Transform
Transmit (T1)
Transmit (T2)
Transmit (T3)
Transmit (T4)
Transmit (T5)
Ingest (n1)
Ingest (n2)
Ingest (n3)
![Page 131: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/131.jpg)
Building StaaSBut, we also want to allow users the ability to define their own data pipes in this data plane
![Page 132: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/132.jpg)
Building StaaSHence, we need a management plane to capture the intent of the users
Admin FE
Admin BE
![Page 133: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/133.jpg)
Data Plane
Building StaaSWe now have 2 planes: a Management Plane & a Data Plane
Admin FE
Admin BE
Management Plane
![Page 134: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/134.jpg)
Building StaaSHence, we need at least 2 planes : Management & Data
Data Plane
Admin FE
Admin BE
Management Plane
![Page 135: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/135.jpg)
Building StaaS
Data Plane
Admin FE
Admin BE
Management Plane
Control Plane
P
We also need a Provisioner(P)
![Page 136: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/136.jpg)
Building StaaS
Data Plane
Admin FE
Admin BE
Management Plane
Control Plane
P
We also need a Deployer(D)
D
![Page 137: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/137.jpg)
Building StaaS
Data Plane
Admin FE
Admin BE
Management Plane
Control Plane
O
A
Finally, we can add systems to promote health and stability: Observer(O) & Autoscaler (A)
P
D
![Page 138: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/138.jpg)
Building StaaS
Data Plane
Admin FE
Admin BE
Management Plane
Control Plane
O
A
Together these 4 services form the Control Plane
P
D
![Page 139: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/139.jpg)
Building StaaS
The Control Plane Topology is a diamond-cross
Control Plane
O
D
A
P
![Page 140: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/140.jpg)
Building StaaS
• The observer(O) is the source of truth for system health
• It is aware of D, P, and A activity & may quiet alarms during certain actions
Control Plane
O
D
A
P
![Page 141: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/141.jpg)
Building StaaS
• The observer(O) is the source of truth for system health
• It is aware of D, P, and A activity & may quiet alarms during certain actions
Control Plane
O
D
A
P
• It can collect and monitor more complex health metrics than lag and loss. For example, in ML pipelines, it can track scoring skew
![Page 142: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/142.jpg)
Building StaaS
• The observer(O) is the source of truth for system health
• It is aware of D, P, and A activity & may quiet alarms during certain actions
Control Plane
O
D
A
P
• The system can also detect common causes of non-recoverable failures & alert customers
• It can collect and monitor more complex health metrics than lag and loss. For example, in ML pipelines, it can track scoring skew
![Page 143: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/143.jpg)
Building StaaS• The deployer(D) deploys new code to the data plane
• It can however not deploy if the system is unstable or autoscaling
• It can also automatically roll back if the system becomes unstable due to deployment
Control Plane
O
D
A
P
![Page 144: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/144.jpg)
Building StaaS• The provisioner(P) deploys customer data pipes to the
system.
• It can pause if the system is unstable or autoscaling
Control Plane
O
D
A
P
![Page 145: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/145.jpg)
Building StaaS• The provisioner(P) deploys customer data pipes to the
system.
• It can pause if the system is unstable or autoscaling
Control Plane
O
D
A
P
• The provisioner(P) can also control things like phased traffic ramp ups for new deployed pipelines
![Page 146: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/146.jpg)
Conclusion
![Page 147: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/147.jpg)
Conclusion
• We have built a Streams-as-a-Service system with many NFRs as first class citizens
• While we’ve covered many key elements, a few areas will be covered in future talks (e.g. Isolation, Containerization, Caching)
• Should you have questions, join me for Q&A and follow for more on (@r39132)
![Page 148: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/148.jpg)
Thank You for your Time
![Page 149: Building & Operating High-Fidelity Data Streams](https://reader036.fdocuments.net/reader036/viewer/2022081419/62a0bdf96e20a03dc41e9099/html5/thumbnails/149.jpg)
And thanks to the many people who help build these systems with me..
• Vincent Chen • Anisha Nainani • Pramod Garre • Harsh Bhimani • Nirmalya Ghosh • Yash Shah • Aastha Sinha • Esther Kent • Dheeraj Rampali
• Deepak Chandramouli • Prasanna Krishna • Sandhu Santhakumar • Maneesh CM & team at Active
Lobby • Shiju & the team at Xminds • Bob Carlson • Tony Gentille • Josh Evans