Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g....
Transcript of Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g....
![Page 1: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/1.jpg)
Scalable Stream ProcessingSurveying Storm, Samza,Spark & Flink
Wolfram [email protected]
September 12, techcamp 2018, Hamburg
@baqendcom
![Page 2: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/2.jpg)
Research:• Real-Time Databases• Stream Processing• NoSQL & Cloud Databases• …
Practice: Backend-as-a-Service
Web CachingReal-Time Database
…
+•
•
•
•
www.baqend.com
About meWolfram Wingerath
PhD Thesis & Research
DistributedSystems
Engineer
![Page 3: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/3.jpg)
Outline
• Big Picture:• A Typical Data Pipeline• Processing Frameworks
• Processing Models:• Batch Processing• Stream Processing
Future DirectionsReal-Time Databases
IntroductionBig Data in Motion
System SurveyBig Data + Low Latency
∑
Wrap-UpSummary & Discussion
![Page 4: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/4.jpg)
IN PRACTICE
Scalable Data Processing
![Page 5: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/5.jpg)
ApplicationProcessingPersistence/Streaming Serving
Today‘s topic!
A Data Processing Pipeline
6
![Page 6: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/6.jpg)
INTRODUCTION
Batch vs StreamProcessing
![Page 7: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/7.jpg)
low
late
ncy
high throughput
Big Data Processing FrameworksWhat are your options?
Amazon Elastic
MapReduce
Google Dataflow
What to use when?
9
![Page 8: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/8.jpg)
ApplicationBatch
(e.g. MapReduce)Persistence(e.g. HDFS)
Serving(e.g. HBase)
• Cost-effective & Efficient
• Easy to reason about: operating on complete data
But:
• High latency: jobs periodic (e.g. during night times)
Batch Processing„Volume“
10
![Page 9: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/9.jpg)
Stream Processing„Velocity“
• Low end-to-end latency
• Challenges:
• Long-running jobs - no downtime allowed
• Asynchronism - data may arrive delayed or out-of-order
• Incomplete input - algorithms operate on partial data
• More: fault-tolerance, state management, guarantees, …
Streaming(e.g. Kafka, Redis)
ApplicationServingReal-Time
(e.g. Storm)
11
![Page 10: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/10.jpg)
Typical Stream OperatorsExamples
Filter & Transform
https://www.infoq.com/presentations/stream-processors-databases 14
Group
Aggregates Windows
Filter Map GroupByKey
Tumbling
Sliding
SUM()
COUNT()
https://www.infoq.com/presentations/stream-processing-apache-flink
![Page 11: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/11.jpg)
Typical Use CaseExample from Yahoo!
https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at 15
Input
• Read Ad trackingdata from Kafka
Filter
• Discard uselessdata
Project
• Extract relevant fields
Group
• By Ad campaign
Window
• Ad views per 10-min-window
![Page 12: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/12.jpg)
Wrap-upData Processing
• Processing frameworks abstract from scaling issues
16
Batch processing• easy to reason about• extremely efficient• huge input-output
latency
Stream processing• quick results• purely incremental• potentially complex to
handle
![Page 13: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/13.jpg)
Outline
• System Survey:• Processing Model
Overview• Storm/Trident• Samza• Spark Streaming• Flink
Future DirectionsReal-Time Databases
IntroductionBig Data in Motion
System SurveyBig Data + Low Latency
∑
Wrap-UpSummary & Discussion
![Page 14: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/14.jpg)
SURVEY
Popular Stream Processing Systems
![Page 15: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/15.jpg)
Processing ModelsBatch vs. Micro-Batch vs. Stream
low latency high throughput
stream batchmicro-batch
20
![Page 16: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/16.jpg)
Overview
◦ First production-ready, well-adopted stream processor
◦ Compatible: native Java API, Thrift, distributed RPC
◦ Low-level: no primitives for joins or aggregations
◦ Native stream processor: latency < 50 ms feasible
◦ Big users: Twitter, Yahoo!, Spotify, Baidu, Alibaba, …
History
◦ 2010: developed at BackType (acquired by Twitter)
◦ 2011: open-sourced
◦ 2014: Apache top-level project
Storm„Hadoop of real-time“
21
![Page 17: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/17.jpg)
Dataflow
Cycles!
Directed Acyclic Graphs (DAG):• Spouts: pull data into topology• Bolts: do processing, emit data• Asynchronous• Lineage can be tracked for each tuple
→ At-least-once has 2x messagingoverhead
22
![Page 18: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/18.jpg)
State ManagementRecover State on Failure
• In-memory or Redis-backed reliable state
• Synchronous state communication on the critical path
→ infeasible for large state
24
![Page 19: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/19.jpg)
Back PressureThrottling Ingestion on Overload
Approach: monitoring bolts‘ inbound buffer1. Exceeding high watermark → throttle!2. Falling below low watermark → full power!
1. too manytuples
3. tuples getreplayed
2. tuples time out and fail
25
![Page 20: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/20.jpg)
Overview:
◦ Abstraction layer on top of Storm
◦ Released in 2012 (Storm 0.8.0)
◦ Micro-batching
◦ New features:
High-level API: aggregations & joins
Strong ordering
Stateful exactly-once processing
Performance penalty
TridentStateful Stream Joining on Storm
26
![Page 21: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/21.jpg)
TridentPartitioned Micro-Batching
27
3 Parti-tions
3 BatchesIllustration taken from: “Storm applied”, Sean T. Allen et al.
![Page 22: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/22.jpg)
Overview◦ Co-developed with Kafka
→ Kappa Architecture
◦ Simple: only single-step jobs
◦ Local state
◦ Native stream processor: low latency
◦ Users: LinkedIn, Uber, Netflix, TripAdvisor, Optimizely, …
History◦ Developed at LinkedIn
◦ 2013: open-source (Apache Incubator)
◦ 2015: Apache top-level project
SamzaReal-Time on Top of Kafka
Illustration taken from: Jay Kreps, Questioning the Lambda Architecture (2014)https://www.oreilly.com/ideas/questioning-the-lambda-architecture (2017-03-02) 28
![Page 23: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/23.jpg)
DataflowSimple By Design
• Job: processing step (≈ Storm bolt)→ Robust→ But: often several jobs
• Task: job instance (parallelism)
• Message: single data item
• Output persisted in Kafka→ Easy data sharing→ Buffering (no back pressure!)→ But: Increased latency
• Ordering within partitions
• Task = Kafka partitions: not-elastic on purpose
Martin Kleppmann, Turning the database inside-out with Apache Samza (2015)https://www.confluent.io/blog/turning-the-database-inside-out-with-apache-samza/ (2017-02-23) 29
![Page 24: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/24.jpg)
SamzaLocal State
Illustrations taken from: Jay Kreps, Why local state is a fundamental primitive in stream processing (2014)https://www.oreilly.com/ideas/why-local-state-is-a-fundamental-primitive-in-stream-processing (2017-02-26)
Advantages of local state:
• Buffering→ No back pressure→ At-least-once delivery→ Simple recovery
• Fast lookups
30
Remote State Local State
![Page 25: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/25.jpg)
DataflowExample: Enriching a Clickstream
Example: the enrichedclickstream is available toevery team within the organization
Illustration taken from: Jay Kreps, Why local state is a fundamental primitive in stream processing (2014)https://www.oreilly.com/ideas/why-local-state-is-a-fundamental-primitive-in-stream-processing (2017-02-26) 31
![Page 26: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/26.jpg)
State ManagementStraightforward Recovery
Illustration taken from: Navina Ramesh, Apache Samza, LinkedIn’s Framework for Stream Processing (2015)https://thenewstack.io/apache-samza-linkedins-framework-for-stream-processing (2017-02-26) 32
![Page 27: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/27.jpg)
Overview◦ High-level API: immutable collections (RDDs)
◦ Community: 1000+ contributors in 2015
◦ Big users: Amazon, eBay, Yahoo!, IBM, Baidu, …
History◦ 2009: developed at UC Berkeley
◦ 2010: open-sourced
◦ 2014: Apache top-level project
Spark„MapReduce successor“
33
Core SQL MLlib GraphXSpark
Streaming
![Page 28: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/28.jpg)
Overview◦ High-level API: DStreams ( ̴Java 8 Streams)
◦ Micro-Batching: seconds of latency
◦ Rich features: stateful, exactly-once, elastic
History◦ 2011: start of development
◦ 2013: Spark Streaming becomes part of Spark Core
Spark Streaming
34
![Page 29: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/29.jpg)
Resilient Distributed Data set (RDD)
◦ Immutable collection & deterministic operations
◦ Lineage tracking: → state can be reproduced→ periodic checkpoints reduce recovery time
DStream: Discretized RDD
◦ RDDs are processed in order: no ordering within RDD
◦ RDD scheduling ̴50 ms → latency >100ms
Spark StreamingCore Abstraction: DStream
Illustration taken from: http://spark.apache.org/docs/latest/streaming-programming-guide.html#overview (2017-02-26) 35
![Page 30: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/30.jpg)
ExampleCounting Page Views
Zaharia, Matei, et al. "Discretized streams: Fault-tolerant streaming computation at scale." Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 2013.
36
pageViews = readStream("http://...", "1s")ones = pageViews.map(event => (event.url, 1))counts = ones.runningReduce((a, b) => a + b)
![Page 31: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/31.jpg)
Overview◦ Native stream processor: Latency <100ms feasible
◦ Abstract API for stream and batch processing, stateful, exactly-once delivery
◦ Many libraries: Table and SQL, CEP, Machine Learning , Gelly…
◦ Users: Alibaba, Ericsson, Otto Group, ResearchGate, Zalando…
History◦ 2010: start as Stratosphere at TU Berlin, HU Berlin, and HPI
Potsdam
◦ 2014: Apache Incubator, project renamed to Flink
◦ 2015: Apache top-level project
Flink
37
![Page 32: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/32.jpg)
ArchitectureStreaming + Batch
38
https://www.infoq.com/presentations/stream-processing-apache-flink
![Page 33: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/33.jpg)
Managed StateStreaming + Batch
39
https://www.infoq.com/presentations/stream-processing-apache-flink
• Automatic Backups of local state
• Stored in RocksDB, Savepoints written to HDFS
![Page 34: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/34.jpg)
Highlight: Fault ToleranceDistributed Snapshots
• Ordering within stream partitions• Periodic checkpoints• Recovery:
1. reset state to checkpoint2. replay data from there
40
Illustration taken from: https://ci.apache.org/projects/flink/flink-docs-release-1.2/internals/stream_checkpointing.html (2017-02-26)
Exactly-once
![Page 35: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/35.jpg)
Outline
• Discussion:• Comparison Matrix• Other Systems
• Takeaway
Future DirectionsReal-Time Databases
IntroductionBig Data in Motion
System SurveyBig Data + Low Latency
∑
Wrap-UpSummary & Discussion
![Page 36: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/36.jpg)
WRAP UP
Side-by-sidecomparison
![Page 37: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/37.jpg)
Storm Trident SamzaSpark
StreamingFlink
(streaming)
StrictestGuarantee
at-least-once
exactly-once
at-least-once
exactly-once exactly-once
AchievableLatency
≪100 ms <100 ms <100 ms <1 second <100 ms
State Management
(small state)
(small state)
Processing Model
one-at-a-time
micro-batchone-at-a-
timemicro-batch
one-at-a-time
Backpressure no
(buffering)
Ordering betweenbatches
withinpartitions
betweenbatches
withinpartitions
Elasticity
Comparison
43
![Page 38: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/38.jpg)
PerformanceYahoo! Benchmark
44
“Storm […] and Flink […] show sub-second latencies at relatively high throughputs with Storm having the lowest99th percentile latency. Spark streaming […] supports high throughputs, but at a relatively higher latency.”
From https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
Based on real use case:◦ Filter and count ad impressions
◦ 10 minute windows
![Page 39: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/39.jpg)
And even more: Kinesis, Gearpump, MillWheel, Muppet, S4, Photon, …
Other Systems
45
Heron Apex Dataflow
BeamKafka
StreamsIBM InfoSphere
Streams
![Page 40: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/40.jpg)
Outline
• Real-Time Databases:• Why Push-Based
Database Queries?• Where Do Real-Time
Databases Fit in?• Comparison Matrix:
• Meteor• RethinkDB• Parse• Firebase• Baqend
Future DirectionsReal-Time Databases
IntroductionBig Data in Motion
System SurveyBig Data + Low Latency
∑
Wrap-UpSummary & Discussion
![Page 41: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/41.jpg)
REAL-TIME DBS
Combining databaseswith streaming
![Page 42: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/42.jpg)
Traditional DatabasesNo Request? No Data!
circular shapes ?
What‘s the current state?
Query maintenance: periodic polling→ Inefficient→ Slow
48
![Page 43: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/43.jpg)
Quick ComparisonDBMS vs. RT DB vs. DSMS vs. Stream Processing
Database Management
static collections
Stream Processing
ephemeralstreams
push-basedpull-based
Data Stream Management
persistent/ephemeral streams
Real-TimeDatabases
evolving collections
![Page 44: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/44.jpg)
MeteorPoll-and-Diff Oplog Tailing
RethinkDB Parse Firebase Baqend
Scales withwrite TP
Scales with no. of queries
?(100k connections)
Composite queries (AND/OR)
(AND In Firestore)
Sorted queries (single attribute)
Limit
Offset (value-based)
Real-Time DatabasesIn a Nutshell
![Page 45: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/45.jpg)
TAKEAWAY
Trade-Offs in Stream Processing
![Page 46: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/46.jpg)
Stream Processors:
Real-Time Databases integerateStorage & Streaming
Learn more: slides.baqend.com
Summary
latency throughput
![Page 47: Scalable Stream Processing · Google Dataflow What to use when? 9. Application Batch (e.g. MapReduce) Persistence (e.g. HDFS) Serving (e.g. HBase) •Cost-effective & Efficient •Easy](https://reader036.fdocuments.net/reader036/viewer/2022071103/5fdc86ee77cab37b281fab39/html5/thumbnails/47.jpg)
@baqendcom
[email protected] Systems Engineer
• Web & Data Management Workshops
• Performance Auditing• Implementation Services
Our Product
Speed Kit:• Accelerates Any Website• Pluggable• Easy Setup
test.speed-kit.com
Our Services
Who We Are
Wolfram Wingerath