Introduction to Stateful Stream Processing with Apache Flink.

Post on 22-Jan-2018

312 views 5 download

Transcript of Introduction to Stateful Stream Processing with Apache Flink.

1

Kostas Kloudas@kkloudas

HUG @ WarsawJULY 3, 2017

Stateful Stream Processingwith Apache Flink®

2

Original creators of ApacheFlink®

Providers of thedA Platform

3 questions and some history▪ What is stateful stream processing?

▪ Why care about it?

▪ How does Flink do it?

▪ The evolution of Flink.

3

Stateful Stream Processing

4

Stateful Stream Processing

5

Continuous Processingfor Continuously Arriving Data

6

Batchjobs

t

2017-06-1401:00am

2017-06-1400:00am

2017-06-1311:00pm

2017-06-1310:00pm

... ● Continuously ingestingdata

● Time-bounded batchfiles

● Periodic batch jobs

The ol’ traditional batch way

7

intermediatestate

t

2017-06-1401:00am

2017-06-1400:00am

2017-06-1311:00pm

2017-06-1310:00pm

...

● Compute a counter:#(A) per hour / 2 min

● What if:

● interval crosses batch boundaries?→ carry intermediate results to next batch

● events out of order?→ ???

The ol’ traditional batch way

▪ So, for a simple counting program:• Custom logic for handling state• Custom logic for handling time• Custom logic for fault tolerance

8

The ol’ traditional batch way

▪ So, for a simple counting program:• Custom logic for handling state• Custom logic for handling time• Custom logic for fault tolerance

9

The ol’ traditional batch way

Difficult and has nothing to do with your program.

Why should we care?▪...this is just for continuous data, right?

10

Why should we care?▪...this is just for continuous data, right?

11

Most datasets are continuously arriving streams.

Stream Processing

12

Computation over an endless stream of data

YourCode

...

Distributed Stream Processing

13

YourCode

... ... ...

YourCode

YourCode

● Partitions input by some key

● Distributes computationacross multiple instances

● Each instance is responsiblefor some keys

qwe

Stateful Stream Processing

14

... ...

YourCode

YourCode

update localvariables/structures

var x = …

if (condition(x)) {…

}

Stateful Stream Processing

15

... ...

YourCode

YourCode

qweupdate local

variables/structures

var x = …

if (condition(x)) {…

}● Embedded local state

● State co-partitioned withthe input stream by key

A practical stream processor

16

state●Fault-tolerance

●Scalability

●Efficiency

●Event-time (out-of-order events)

●Allows you to work in event-time (e.g. timers)

time

17

Stateful Stream Processorthat handles

consistently, robustly, and efficiently

LargeDistributed State

Time / Order /Completeness

● Stateful stream processing asa new paradigm tocontinuously processcontinuously arriving data

● Produce accurate results

● Real-time is only a naturalconsequence of the model

A practical stream processor

This is where Flink shines...▪ Supports out-of-order streams▪ Manages state transparently

• exactly-once processing

▪ Offers high throughput and low latency▪ Scales to large deployments

• https://data-artisans.com/blog/blink-flink-alibaba-search

• https://data-artisans.com/blog/rbea-scalable-real-time-analytics-at-king

18

Apache Flink®

19

About time ...

20

... ...

YourCode

YourCode

When are my results complete?

21

... ...

YourCode

YourCode

When are my results complete?

Processing Time drawbacks:• Incorrect results• Irreproducible results

About time ...

About time ...

22

Event Time: Watermarks

23

● Special markers, called Watermarks

● Flow with elements

● A watermark oftimestamp t meansthat no records withtimestamp < t shouldbe expected

Event Time: Watermarks

24

25

Documentation:https://ci.apache.org/projects/flink/flink-docs-release-

1.3/dev/event_time.html

Event Time

Fault tolerance▪ How to ensure exactly-once semantics?

26

Fault tolerance simple case

27

event log

single process

main memoryperiodically take a Snapshot of the memory

28

event log

single process

main memoryRecoveryrestore snapshot and replay

events since snapshot

persists events(temporarily)

Fault tolerance simple case

Fault tolerance distributed

▪ How to create consistent snapshots ofdistributed state?

▪ How to do it efficiently?

29

Distributed Snapshots

30

Coordination via markers, injected into the streams

31

State index(Hash Table

or RocksDB)

Events flow without replication or synchronous writes

statefuloperation

source

Distributed Snapshots

32

Trigger checkpointInject checkpoint barrier

statefuloperation

source

Distributed Snapshots

33stateful

operationsource

Take state snapshotTrigger state

copy-on-write

Distributed Snapshots

34stateful

operationsource

DFS

Durably persistsnapshots

asynchronouslyProcessing pipeline continues

Distributed Snapshots

35

... YourCode

YourCode

YourCode

State

State

State

YourCode

State

● Consistent snapshotting:

Fault tolerance

36

... YourCode

YourCode

YourCode

State

State

State

YourCode

State

checkpointedstate

checkpointedstate

checkpointedstate

File System Checkpoint

● Consistent snapshotting:

Fault tolerance

37

... YourCode

YourCode

YourCode

State

State

State

YourCode

State

checkpointedstate

checkpointedstate

checkpointedstate

File System Restore

● Recover all embedded state● Reset position in input stream

Fault tolerance

38

Documentation:https://ci.apache.org/projects/flink/flink-docs-release-

1.3/internals/stream_checkpointing.html

Fault tolerance

State Management: misc.

39

▪ Savepoints

▪ Rescaling

▪ Queryable State

Apache Flink Ecosystem

40

Integration

POSIX Java/ScalaCollections

POSIX

Apache Flink Stack

41

DataStream APIStream Processing

DataSet APIBatch Processing

RuntimeDistributed Streaming Data Flow

Libraries

Streaming and batch as first class citizens.

Levels of abstraction

42

Process Function (events, state, time)

DataStream API (streams, windows)

Table API (dynamic tables)

Stream SQL

low-level (statefulstream processing)

stream processing &analytics

declarative DSL

high-level langauge

API and Execution

43

SourceDataStream<String> lines = env.addSource(new FlinkKafkaConsumer010(…));

DataStream<Event> events = lines.map(line -> parse(line));

DataStream<Statistic> stats = stream.keyBy("id").timeWindow(Time.seconds(5)).sum(new MyAggregationFunction());

stats.addSink(new BucketingSink(path));

map()[1]

keyBy()/window()/

apply()[1]

Transformation

Transformation

Sink

StreamingDataflowkeyBy()/

window()/apply()

[2]

map()[1]

map()[2]

Source[1]

Source[2]

Sink[1]

Evolution of Flink

44

Programming APIs

45

Large State Handling

46

Conclusion

47

TL;DR▪ Stateful stream processing as a paradigm for

continuous data processing

▪ Flink is a sophisticated and tested stateful streamprocessor

▪ Efficiency, management, and operational issues forstate are taken very seriously

48

49

Thank you!@kkloudas@ApacheFlink@dataArtisans

50

Stream Processingand Apache Flink®'s approach to it

@StephanEwen Apache Flink PMC

CTO @ data ArtisansFLINK FORWARD IS COMING BACK TO BERLINSEPTEMBER 11-13, 2017

BERLIN.FLINK-FORWARD.ORG -

We are hiring!data-artisans.com/careers