Introduction to Stateful Stream Processing with Apache Flink.
-
Upload
konstantinos-kloudas -
Category
Software
-
view
312 -
download
5
Transcript of Introduction to Stateful Stream Processing with Apache Flink.
1
Kostas Kloudas@kkloudas
HUG @ WarsawJULY 3, 2017
Stateful Stream Processingwith Apache Flink®
2
Original creators of ApacheFlink®
Providers of thedA Platform
3 questions and some history▪ What is stateful stream processing?
▪ Why care about it?
▪ How does Flink do it?
▪ The evolution of Flink.
3
Stateful Stream Processing
4
Stateful Stream Processing
5
Continuous Processingfor Continuously Arriving Data
6
Batchjobs
t
2017-06-1401:00am
2017-06-1400:00am
2017-06-1311:00pm
2017-06-1310:00pm
... ● Continuously ingestingdata
● Time-bounded batchfiles
● Periodic batch jobs
The ol’ traditional batch way
7
intermediatestate
t
2017-06-1401:00am
2017-06-1400:00am
2017-06-1311:00pm
2017-06-1310:00pm
...
● Compute a counter:#(A) per hour / 2 min
● What if:
● interval crosses batch boundaries?→ carry intermediate results to next batch
● events out of order?→ ???
The ol’ traditional batch way
▪ So, for a simple counting program:• Custom logic for handling state• Custom logic for handling time• Custom logic for fault tolerance
8
The ol’ traditional batch way
▪ So, for a simple counting program:• Custom logic for handling state• Custom logic for handling time• Custom logic for fault tolerance
9
The ol’ traditional batch way
Difficult and has nothing to do with your program.
Why should we care?▪...this is just for continuous data, right?
10
Why should we care?▪...this is just for continuous data, right?
11
Most datasets are continuously arriving streams.
Stream Processing
12
Computation over an endless stream of data
YourCode
...
Distributed Stream Processing
13
YourCode
... ... ...
YourCode
YourCode
● Partitions input by some key
● Distributes computationacross multiple instances
● Each instance is responsiblefor some keys
qwe
Stateful Stream Processing
14
... ...
YourCode
YourCode
update localvariables/structures
var x = …
if (condition(x)) {…
}
Stateful Stream Processing
15
... ...
YourCode
YourCode
qweupdate local
variables/structures
var x = …
if (condition(x)) {…
}● Embedded local state
● State co-partitioned withthe input stream by key
A practical stream processor
16
state●Fault-tolerance
●Scalability
●Efficiency
●Event-time (out-of-order events)
●Allows you to work in event-time (e.g. timers)
time
17
Stateful Stream Processorthat handles
consistently, robustly, and efficiently
LargeDistributed State
Time / Order /Completeness
● Stateful stream processing asa new paradigm tocontinuously processcontinuously arriving data
● Produce accurate results
● Real-time is only a naturalconsequence of the model
A practical stream processor
This is where Flink shines...▪ Supports out-of-order streams▪ Manages state transparently
• exactly-once processing
▪ Offers high throughput and low latency▪ Scales to large deployments
• https://data-artisans.com/blog/blink-flink-alibaba-search
• https://data-artisans.com/blog/rbea-scalable-real-time-analytics-at-king
18
Apache Flink®
19
About time ...
20
... ...
YourCode
YourCode
When are my results complete?
21
... ...
YourCode
YourCode
When are my results complete?
Processing Time drawbacks:• Incorrect results• Irreproducible results
About time ...
About time ...
22
Event Time: Watermarks
23
● Special markers, called Watermarks
● Flow with elements
● A watermark oftimestamp t meansthat no records withtimestamp < t shouldbe expected
Event Time: Watermarks
24
25
Documentation:https://ci.apache.org/projects/flink/flink-docs-release-
1.3/dev/event_time.html
Event Time
Fault tolerance▪ How to ensure exactly-once semantics?
26
Fault tolerance simple case
27
event log
single process
main memoryperiodically take a Snapshot of the memory
28
event log
single process
main memoryRecoveryrestore snapshot and replay
events since snapshot
persists events(temporarily)
Fault tolerance simple case
Fault tolerance distributed
▪ How to create consistent snapshots ofdistributed state?
▪ How to do it efficiently?
29
Distributed Snapshots
30
Coordination via markers, injected into the streams
31
State index(Hash Table
or RocksDB)
Events flow without replication or synchronous writes
statefuloperation
source
Distributed Snapshots
32
Trigger checkpointInject checkpoint barrier
statefuloperation
source
Distributed Snapshots
33stateful
operationsource
Take state snapshotTrigger state
copy-on-write
Distributed Snapshots
34stateful
operationsource
DFS
Durably persistsnapshots
asynchronouslyProcessing pipeline continues
Distributed Snapshots
35
... YourCode
YourCode
YourCode
State
State
State
YourCode
State
● Consistent snapshotting:
Fault tolerance
36
... YourCode
YourCode
YourCode
State
State
State
YourCode
State
checkpointedstate
checkpointedstate
checkpointedstate
File System Checkpoint
● Consistent snapshotting:
Fault tolerance
37
... YourCode
YourCode
YourCode
State
State
State
YourCode
State
checkpointedstate
checkpointedstate
checkpointedstate
File System Restore
● Recover all embedded state● Reset position in input stream
Fault tolerance
38
Documentation:https://ci.apache.org/projects/flink/flink-docs-release-
1.3/internals/stream_checkpointing.html
Fault tolerance
State Management: misc.
39
▪ Savepoints
▪ Rescaling
▪ Queryable State
Apache Flink Ecosystem
40
Integration
POSIX Java/ScalaCollections
POSIX
Apache Flink Stack
41
DataStream APIStream Processing
DataSet APIBatch Processing
RuntimeDistributed Streaming Data Flow
Libraries
Streaming and batch as first class citizens.
Levels of abstraction
42
Process Function (events, state, time)
DataStream API (streams, windows)
Table API (dynamic tables)
Stream SQL
low-level (statefulstream processing)
stream processing &analytics
declarative DSL
high-level langauge
API and Execution
43
SourceDataStream<String> lines = env.addSource(new FlinkKafkaConsumer010(…));
DataStream<Event> events = lines.map(line -> parse(line));
DataStream<Statistic> stats = stream.keyBy("id").timeWindow(Time.seconds(5)).sum(new MyAggregationFunction());
stats.addSink(new BucketingSink(path));
map()[1]
keyBy()/window()/
apply()[1]
Transformation
Transformation
Sink
StreamingDataflowkeyBy()/
window()/apply()
[2]
map()[1]
map()[2]
Source[1]
Source[2]
Sink[1]
Evolution of Flink
44
Programming APIs
45
Large State Handling
46
Conclusion
47
TL;DR▪ Stateful stream processing as a paradigm for
continuous data processing
▪ Flink is a sophisticated and tested stateful streamprocessor
▪ Efficiency, management, and operational issues forstate are taken very seriously
48
49
Thank you!@kkloudas@ApacheFlink@dataArtisans
50
Stream Processingand Apache Flink®'s approach to it
@StephanEwen Apache Flink PMC
CTO @ data ArtisansFLINK FORWARD IS COMING BACK TO BERLINSEPTEMBER 11-13, 2017
BERLIN.FLINK-FORWARD.ORG -
We are hiring!data-artisans.com/careers