ScalableData Ingestion for Stream Processingand...
Transcript of ScalableData Ingestion for Stream Processingand...
![Page 1: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed36ae4e450f87a2052e6a8/html5/thumbnails/1.jpg)
Scalable Data Ingestion for Stream Processing and Beyond
Gabriel AntoniuJoint work with Ovidiu Marcu, Alexandru Costan, Maria S. Pérez
9th JLESC workshop, UTK, Knoxville, April 15, 2019
![Page 2: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed36ae4e450f87a2052e6a8/html5/thumbnails/2.jpg)
Velocity
Data in motion
FluidDynamic
Volume
Data at rest
StationaryStatic
2
From Big Data to Fast Data
![Page 3: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed36ae4e450f87a2052e6a8/html5/thumbnails/3.jpg)
Correctness Exact results Approximate results
Latency High-latency Low-latency
Cost Stateless Stateful
Batch Streaming
3
![Page 4: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed36ae4e450f87a2052e6a8/html5/thumbnails/4.jpg)
State of the art until recently:Lambda Architectures
Historical events
Real-time events
Exact historical
model
Approximate real-time
model
Periodic queries
Continuous queries
Batch processing
Stream processing
Results &
Actions
What?
Why?
4
![Page 5: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed36ae4e450f87a2052e6a8/html5/thumbnails/5.jpg)
The streaming pipeline: latency happens
Unified batch and stream processing
Ingest delay (write latency)
Throughput(read latency)
Network delay or unavailable Backlog
Poor storage design
Starved resources
Hardware failure
DATATRANSFER
Edge Cloud
5
![Page 6: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed36ae4e450f87a2052e6a8/html5/thumbnails/6.jpg)
What is ingestion ?
•Collect data from various sources → producers
•Deliver them for processing / storage → consumers
•Optionally: buffer, log, pre-process
Ingestion determines the processing performance6
![Page 7: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed36ae4e450f87a2052e6a8/html5/thumbnails/7.jpg)
State of the art: Apache Kafka
Limitations• Scalability• Data duplication
400 nodes, peak 3.2M events/s50 nodes, average 200K events/s
7
![Page 8: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed36ae4e450f87a2052e6a8/html5/thumbnails/8.jpg)
The KerA approach to ingestion • Scalability → Dynamic partitioning• Enables seamless elasticity
• Data duplication → Unified ingestion and storage • Support for both
• Streams (unbounded data)
• Objects (bounded data)
8
![Page 9: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed36ae4e450f87a2052e6a8/html5/thumbnails/9.jpg)
Zoom: scalability
Each partition is statically associated with one consumer: limited scalability
Producers Brokers Consumers
Partitions
Partitions
Partitions
9
![Page 10: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed36ae4e450f87a2052e6a8/html5/thumbnails/10.jpg)
KerA: dynamic partitioning
10
• Streamlets: logical stream containers; #streamlets > #brokers• Groups: created and processed dynamically; maximum #active groups per broker
Streamlets Streamlets
Streamlets Streamlets
Streamlets Streamlets
GroupsGroups
GroupsGroups
Groups
![Page 11: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed36ae4e450f87a2052e6a8/html5/thumbnails/11.jpg)
Increased network and storage overheads 11
Zoom: data duplication
![Page 12: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed36ae4e450f87a2052e6a8/html5/thumbnails/12.jpg)
KerA: unified ingestion and storage
KerA
INGESTIONBrokers
Acquire Push/Pulldata access
Streams
Objects
Common data model for streams and objects
STORAGEBackups
Move less data, process them faster
12
![Page 13: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed36ae4e450f87a2052e6a8/html5/thumbnails/13.jpg)
Evaluating scalability
1.0*1053.0*1055.0*1057.0*1059.0*1051.1*1061.3*1061.5*1061.7*1061.9*1062.1*106
4 8 16 32
Aver
age
Aggr
egat
ed
Clie
nt T
hrou
ghpu
t (re
cord
s/s)
Clients Number
KeraProdKeraConsKafkaProdKafkaCons
1.0*1053.0*1055.0*1057.0*1059.0*1051.1*1061.3*1061.5*1061.7*1061.9*1062.1*106
4 8 12 16
Ave
rage
Agg
rega
ted
Clie
nt T
hrou
ghpu
t (re
cord
s/s)
Nodes Number
KeraProdKeraConsKafkaProdKafkaCons
64 clients, 32 partitions, 1MB request size, 100B records
# Brokers
2x better throughputwith 75% less resources
Vertical Horizontal
8x10x
# Clients
4 brokers, 32 partitions, 128KB request size, 100B records
21*105
19*105
17*105
15*105
13*105
11*105
9*105
7*105
5*105
3*105
1*105
Thro
ughp
ut
21*105
19*105
17*105
15*105
13*105
11*105
9*105
7*105
5*105
3*105
1*105
13
![Page 14: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed36ae4e450f87a2052e6a8/html5/thumbnails/14.jpg)
14
Our vision: hybrid analytics architecture
Present data
Stream processing
Past data
Historicalmodel
Real-timemodel
ComputationalModel
Batch processing
FuturemodelSimulation
Proactive control
Continuous update
In situ processing
In transit processing
HybridAnalytics
![Page 15: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed36ae4e450f87a2052e6a8/html5/thumbnails/15.jpg)
Hybrid analytics: processing architecture
DATAfrom the
Real World
DATAfrom the
Hypothetical World …
Simulation (e.g., digital twin)
Computation
In situ pre-processingof simulation data
Sensor
In situ streampre-processingof sensor data
…
Hybrid (stream + batch) in transit processing
(data in-motion + data at-rest)
Historicaldata
BetterDecisionLearning
15
Data processing
![Page 16: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed36ae4e450f87a2052e6a8/html5/thumbnails/16.jpg)
16
Hybrid analytics architecture
DATAfrom the
Real World
DATAfrom the
Hypothetical World …
Simulation (e.g., digital twin)
Computation
In situ pre-processingof simulation data
Sensor
In situ streampre-processingof sensor data
…
BetterDecisionLearning
Hybrid (stream + batch) in transit processing
(data in-motion + data at-rest)
Historicaldata
Postdoc (ANR OverFlow project)• Investigating Edge vs. Cloud
computing trade-offs for stream processing
• Methodology for benchmarking Edge processing frameworks
Ph.D. (to hire)• Uniform Cloud and Edge stream
processing for Fast Data analytics
Pedro Silva
![Page 17: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed36ae4e450f87a2052e6a8/html5/thumbnails/17.jpg)
Hybrid analytics architecture
DATAfrom the
Real World
DATAfrom the
Hypothetical World …
Simulation (e.g., digital twin)
Computation
In situ pre-processingof simulation data
Sensor
In situ streampre-processingof sensor data
…
Historicaldata
BetterDecisionLearning
Hybrid (stream + batch) in transit processing
(data in-motion + data at-rest) Research Engineer (Inria ADT project)• Enable support for in situ Big Data
analytics• Elastic allocation of dedicated resources
(cores/nodes)Ovidiu Marcu
![Page 18: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed36ae4e450f87a2052e6a8/html5/thumbnails/18.jpg)
Hybrid analytics architecture
DATAfrom the
Real World
DATAfrom the
Hypothetical World …
Simulation (e.g., digital twin)
Computation
In situ pre-processingof simulation data
Sensor
In situ streampre-processingof sensor data
…
BetterDecisionLearning
Hybrid (stream + batch) in transit processing
(data in-motion + data at-rest)
KerA+++seamless integration with in situ/in transit
+large state management
Historicaldata
Ovidiu Marcu
Startup (ZettaFlow)• Low and consistent latency
(lightweight offset indexing, independent memory management)
• Model applications not partitioning/stream storage
Ph.D. (Inria IPL project)• HPC – Big Data processing
convergence• Bridge in situ/in transit and
stream/batch processing
H2020 project in preparation
![Page 19: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World](https://reader034.fdocuments.net/reader034/viewer/2022042307/5ed36ae4e450f87a2052e6a8/html5/thumbnails/19.jpg)
Thank You!