Apache Gearpump next-gen streaming engine
-
Upload
tianlun-zhang -
Category
Software
-
view
1.848 -
download
0
Transcript of Apache Gearpump next-gen streaming engine
![Page 1: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/1.jpg)
Apache Gearpump next-gen streaming engineManu Zhang, [email protected] Zhong, [email protected]
![Page 2: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/2.jpg)
What is Gearpump ?
● An Akka[2] based real-time streaming engine ● An Apache Incubator[1] project since Mar.8th, 2016● A super simple pump that consists of only two gears
but very powerful at streaming water
![Page 3: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/3.jpg)
Gearpump Architecture
● Actor concurrency● Message passing
communication● error handling and
isolation with supervision hierarchy
● Master HA with Akka Cluster
![Page 4: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/4.jpg)
Why Gearpump ?
![Page 5: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/5.jpg)
Stream processing is hard but no system meets all requirements
● Infinite data● Out-of-order data● Low latency assurance (e.g real-time recommendation)● Correctness requirement (e.g. charge advertisers for ads) ● Cheap to update user logics
![Page 6: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/6.jpg)
How Gearpump solves the hard parts
● User interface● Flow control● Out-of-order processing● Exactly once ● Dynamic DAG
![Page 7: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/7.jpg)
How Gearpump solves the hard parts
● User Interface● Flow control● Out-of-order processing● Exactly Once ● Dynamic DAG
![Page 8: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/8.jpg)
User Interface - DAG API
Graph( KafkaSource ~ ShufflePartitioner ~> WindowCounter, WindowCounter ~ HashPartitioner ~> ModelCalculator, WindowCounter ~ HashPartitioner ~> AnomalyDetector, ModelCalculator ~> HashPartitioner ~> AnomalyDetector, AnomalyDetector ~> ShufflePartitioner ~> KafkaSink)
![Page 9: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/9.jpg)
partitioner
User Interface - Partitioner
Graph( KafkaSource ~ ShufflePartitioner ~> WindowCounter, WindowCounter ~ HashPartitioner ~> ModelCalculator, WindowCounter ~ HashPartitioner ~> AnomalyDetector, ModelCalculator ~ HashPartitioner ~> AnomalyDetector, AnomalyDetector ~ ShufflePartitioner ~> KafkaSink)
WindowCounter(parallelism = 3)
ModelCalculator(parallelism = 2)
Task
Task
Task
Task
Task
![Page 10: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/10.jpg)
User Interface - Compose DAG
![Page 11: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/11.jpg)
How Gearpump solves the hard parts
● User interface● Flow control● Out-of-order processing● Exactly once ● Dynamic DAG
![Page 12: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/12.jpg)
Flow Control
1. Words
3. W
indow
Coun
ts
2. Window Counts
4. Models
5. Anomaly words
KafkaSource WindowCounter
ModelCalculator
AnomalyDetector
KafkaSink
![Page 13: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/13.jpg)
Without Flow Control - messages queue up
KafkaSource WindowCounter ModelCalculator
fast fast very slow
![Page 14: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/14.jpg)
Without Flow Control - OOM
KafkaSource WindowCounter ModelCalculator
fast fast very slow
![Page 15: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/15.jpg)
With Flow Control - Backpressure
KafkaSource WindowCounter ModelCalculator
fast slow down very slowbackpressure
![Page 16: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/16.jpg)
With Flow Control - Backpressure
KafkaSource WindowCounter ModelCalculator
Slow down Slow down Very Slowbackpressurebackpressure
pull slower
![Page 17: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/17.jpg)
How Gearpump solves the hard parts
● User Interface● Flow control● Out-of-order processing● Exactly Once ● Dynamic DAG
![Page 18: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/18.jpg)
Out-of-order data
Event time321
Proc
essi
ng t
ime
● Event time - when data generated● Processing time - when data processed
1
2
3
4
56
![Page 19: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/19.jpg)
Window Count
1. Words
3. W
indow
Coun
ts
2. Window Counts
4. Models
5. Anomaly words
KafkaSource WindowCounter
ModelCalculator
AnomalyDetector
KafkaSink
![Page 20: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/20.jpg)
When can window counts be emitted ?
window messages count
[0, 2) 1
[2, 4)2
[4, 6)2
(“gearpump”, 1)
WindowCounter In-memory Table
● No window can be emitted since message as early as time 1 has not arrived
(“gearpump”, 1)
(“gearpump”, 3)
(“gearpump”, 5)
(“gearpump”, 4)
(“gearpump”, 2)
WindowCounter
![Page 21: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/21.jpg)
On Watermark[4]
Event time321
Proc
essi
ng t
ime
1
2
3
4
56 watermark
● No timestamp earlier than watermark will be seen
![Page 22: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/22.jpg)
Out-of-order processing with watermark
window messages count
[0, 2) 1
[2, 4)2
[4, 6)2
(“gearpump”, 1)
WindowCounter In-memory Table
(“gearpump”, 1)
(“gearpump”, 3)
(“gearpump”, 5)
(“gearpump”, 4)
(“gearpump”, 2)
WindowCounter
watermark = 0,No window can be emitted
![Page 23: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/23.jpg)
Out-of-order processing with watermark
window messages count
[0, 2) 1
[2, 4)2
[4, 6)2
(“gearpump”, [
0, 2), 1
)
WindowCounter In-memory Table
(“gearpump”, 1)
(“gearpump”, 3)
(“gearpump”, 5)
(“gearpump”, 4)
(“gearpump”, 2)
WindowCounter
watermark = 2,Window [0, 2) can be emitted
(“gearpump”, [0, 2), 1)
![Page 24: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/24.jpg)
How to get watermark ?
1. Words
3. W
indow
Coun
ts
2. Window Counts
4. Models
5. Anomaly words
KafkaSource WindowCounter
ModelCalculator
AnomalyDetector
KafkaSink
![Page 25: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/25.jpg)
From upstream
1. Words
3. W
indow
Coun
ts
2. Window Counts
4. Models
5. Anomaly words
KafkaSource WindowCounter
ModelCalculator
AnomalyDetector
KafkaSink
watermark
![Page 26: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/26.jpg)
More on Watermark
● Source watermark defined by user● Usually heuristic based ● Users decide whether to drop data arriving after
watermark
![Page 27: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/27.jpg)
How Gearpump solves the hard parts
● User Interface● Flow control● Out-of-order processing● Exactly once ● Dynamic DAG
![Page 28: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/28.jpg)
Exactly Once - no Lost or Duplicated States
● Lost states can cause anomaly words skipped
● Duplicated states can cause false alarm
States
4. Models
5. Anomaly words
KafkaSource WindowCounter
ModelCalculator
AnomalyDetector
KafkaSink
kafka_offset window_counts
models
![Page 29: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/29.jpg)
Exactly Once with asynchronous checkpointing
(2, kafka_offset)
KafkaSource WindowCounter ModelCalculator
Watermark = 2 Watermark = 0 Watermark = 0
(2, kafka_offset)
![Page 30: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/30.jpg)
(2, kafka_offset)(2, window_counts)
Exactly Once with asynchronous checkpointing
KafkaSource WindowCounter ModelCalculator
Watermark = 2 Watermark = 2 Watermark = 0
(2, window_counts)
![Page 31: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/31.jpg)
(2, kafka_offset)(2, window_counts)
(2, models)
Exactly Once with asynchronous checkpointing
KafkaSource WindowCounter ModelCalculator
Watermark = 2 Watermark = 2 Watermark = 2
(2, models)
![Page 32: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/32.jpg)
(2, kafka_offset)(2, window_counts)
(2, models)
Exactly Once with asynchronous checkpointing
KafkaSource WindowCounter ModelCalculator
Watermark = 2 Watermark = 2 Watermark = 2
(2, models)
![Page 33: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/33.jpg)
(2, kafka_offset)(2, window_counts)
(2, models)
Crash
KafkaSource WindowCounter ModelCalculator
Watermark = 2 Watermark = 2 Watermark = 2
(2, models)
![Page 34: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/34.jpg)
(2, kafka_offset)(2, window_counts)
(2, models)
Recover to latest checkpoint at 2
KafkaSource WindowCounter ModelCalculator
modelswindow_countskafka_offset
Replay from kafka
Get state at 2
![Page 35: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/35.jpg)
More on Exactly Once
● User extends PersistentState interface and system checkpoints state for users automatically
● Watermark is persisted on master
![Page 36: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/36.jpg)
How Gearpump solves the hard parts
● User Interface● Flow control● Out-of-order processing● Exactly Once ● Dynamic DAG
![Page 37: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/37.jpg)
Can we update model algorithm in a cheap way ?
1. Words
3. W
indow
Coun
ts
2. Window Counts
4. Models
5. Anomaly words
KafkaSource WindowCounter
ModelCalculator
AnomalyDetector
KafkaSinkThe model algorithm doesn’t perform well
![Page 38: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/38.jpg)
Dynamic DAG - modify application at runtime
![Page 39: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/39.jpg)
Performance
![Page 40: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/40.jpg)
Yahoo! Streaming Benchmark[5]
![Page 41: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/41.jpg)
Yahoo! Streaming Benchmark
● 4 node cluster
● Each node has
32-core, 64 GB
memory
● 10GB Ethernet
![Page 42: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/42.jpg)
Advanced features
![Page 43: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/43.jpg)
DAG Visualization
● Watermark● Node size reflects throughput● Edge width represents flow rate
![Page 44: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/44.jpg)
DAG Visualization
● Data skew analysis
![Page 45: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/45.jpg)
Apache Beam[6] Gearpump Runner
Beam Model: Fn Runners
Apache Flink
Apache Spark
Beam Model: Pipeline Construction
OtherLanguagesBeam Java
Beam Python
Execution Execution
Cloud Dataflow
Execution
Apache Gearpump
![Page 46: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/46.jpg)
Experimental features
● Web UI Authorization / OAuth2 Authentication● CGroup Resource Isolation● Binary Storm compatibility● Akka Streams integration
![Page 47: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/47.jpg)
Summary
● Gearpump is good at streaming infinite out-of-order data and guarantees correctness
● Gearpump helps users to easily program streaming applications, get runtime information and update dynamically
![Page 48: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/48.jpg)
References
1. gearpump.apache.org2. akka.io3. http://www.slideshare.net/SeanZhong/strata-singapore-gearpumpreal-ti
me-dagprocessing-with-akka-at-scale4. https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p734-akid
au.pdf5. https://yahooeng.tumblr.com/post/135321837876/benchmarking-streami
ng-computation-engines-at6. Apache Beam [project overview]
![Page 49: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/49.jpg)
Backup Slides
![Page 50: Apache Gearpump next-gen streaming engine](https://reader031.fdocuments.net/reader031/viewer/2022030207/58a8b3ea1a28abbd6b8b53c3/html5/thumbnails/50.jpg)
What is Akka
● Message driven● Shared nothing● Location transparent● Supervision based
failure management● High performance
Actor
ActorActor
Actor Actor
spawn