Introduction to Stateful Stream Processing with Apache Flink.

51
1 Kostas Kloudas @kkloudas HUG @ Warsaw JULY 3, 2017 Stateful Stream Processing with Apache Flink®

Transcript of Introduction to Stateful Stream Processing with Apache Flink.

Page 1: Introduction to Stateful Stream Processing with Apache Flink.

1

Kostas Kloudas@kkloudas

HUG @ WarsawJULY 3, 2017

Stateful Stream Processingwith Apache Flink®

Page 2: Introduction to Stateful Stream Processing with Apache Flink.

2

Original creators of ApacheFlink®

Providers of thedA Platform

Page 3: Introduction to Stateful Stream Processing with Apache Flink.

3 questions and some history▪ What is stateful stream processing?

▪ Why care about it?

▪ How does Flink do it?

▪ The evolution of Flink.

3

Page 4: Introduction to Stateful Stream Processing with Apache Flink.

Stateful Stream Processing

4

Page 5: Introduction to Stateful Stream Processing with Apache Flink.

Stateful Stream Processing

5

Continuous Processingfor Continuously Arriving Data

Page 6: Introduction to Stateful Stream Processing with Apache Flink.

6

Batchjobs

t

2017-06-1401:00am

2017-06-1400:00am

2017-06-1311:00pm

2017-06-1310:00pm

... ● Continuously ingestingdata

● Time-bounded batchfiles

● Periodic batch jobs

The ol’ traditional batch way

Page 7: Introduction to Stateful Stream Processing with Apache Flink.

7

intermediatestate

t

2017-06-1401:00am

2017-06-1400:00am

2017-06-1311:00pm

2017-06-1310:00pm

...

● Compute a counter:#(A) per hour / 2 min

● What if:

● interval crosses batch boundaries?→ carry intermediate results to next batch

● events out of order?→ ???

The ol’ traditional batch way

Page 8: Introduction to Stateful Stream Processing with Apache Flink.

▪ So, for a simple counting program:• Custom logic for handling state• Custom logic for handling time• Custom logic for fault tolerance

8

The ol’ traditional batch way

Page 9: Introduction to Stateful Stream Processing with Apache Flink.

▪ So, for a simple counting program:• Custom logic for handling state• Custom logic for handling time• Custom logic for fault tolerance

9

The ol’ traditional batch way

Difficult and has nothing to do with your program.

Page 10: Introduction to Stateful Stream Processing with Apache Flink.

Why should we care?▪...this is just for continuous data, right?

10

Page 11: Introduction to Stateful Stream Processing with Apache Flink.

Why should we care?▪...this is just for continuous data, right?

11

Most datasets are continuously arriving streams.

Page 12: Introduction to Stateful Stream Processing with Apache Flink.

Stream Processing

12

Computation over an endless stream of data

YourCode

...

Page 13: Introduction to Stateful Stream Processing with Apache Flink.

Distributed Stream Processing

13

YourCode

... ... ...

YourCode

YourCode

● Partitions input by some key

● Distributes computationacross multiple instances

● Each instance is responsiblefor some keys

Page 14: Introduction to Stateful Stream Processing with Apache Flink.

qwe

Stateful Stream Processing

14

... ...

YourCode

YourCode

update localvariables/structures

var x = …

if (condition(x)) {…

}

Page 15: Introduction to Stateful Stream Processing with Apache Flink.

Stateful Stream Processing

15

... ...

YourCode

YourCode

qweupdate local

variables/structures

var x = …

if (condition(x)) {…

}● Embedded local state

● State co-partitioned withthe input stream by key

Page 16: Introduction to Stateful Stream Processing with Apache Flink.

A practical stream processor

16

state●Fault-tolerance

●Scalability

●Efficiency

●Event-time (out-of-order events)

●Allows you to work in event-time (e.g. timers)

time

Page 17: Introduction to Stateful Stream Processing with Apache Flink.

17

Stateful Stream Processorthat handles

consistently, robustly, and efficiently

LargeDistributed State

Time / Order /Completeness

● Stateful stream processing asa new paradigm tocontinuously processcontinuously arriving data

● Produce accurate results

● Real-time is only a naturalconsequence of the model

A practical stream processor

Page 18: Introduction to Stateful Stream Processing with Apache Flink.

This is where Flink shines...▪ Supports out-of-order streams▪ Manages state transparently

• exactly-once processing

▪ Offers high throughput and low latency▪ Scales to large deployments

• https://data-artisans.com/blog/blink-flink-alibaba-search

• https://data-artisans.com/blog/rbea-scalable-real-time-analytics-at-king

18

Page 19: Introduction to Stateful Stream Processing with Apache Flink.

Apache Flink®

19

Page 20: Introduction to Stateful Stream Processing with Apache Flink.

About time ...

20

... ...

YourCode

YourCode

When are my results complete?

Page 21: Introduction to Stateful Stream Processing with Apache Flink.

21

... ...

YourCode

YourCode

When are my results complete?

Processing Time drawbacks:• Incorrect results• Irreproducible results

About time ...

Page 22: Introduction to Stateful Stream Processing with Apache Flink.

About time ...

22

Page 23: Introduction to Stateful Stream Processing with Apache Flink.

Event Time: Watermarks

23

● Special markers, called Watermarks

● Flow with elements

● A watermark oftimestamp t meansthat no records withtimestamp < t shouldbe expected

Page 24: Introduction to Stateful Stream Processing with Apache Flink.

Event Time: Watermarks

24

Page 25: Introduction to Stateful Stream Processing with Apache Flink.

25

Documentation:https://ci.apache.org/projects/flink/flink-docs-release-

1.3/dev/event_time.html

Event Time

Page 26: Introduction to Stateful Stream Processing with Apache Flink.

Fault tolerance▪ How to ensure exactly-once semantics?

26

Page 27: Introduction to Stateful Stream Processing with Apache Flink.

Fault tolerance simple case

27

event log

single process

main memoryperiodically take a Snapshot of the memory

Page 28: Introduction to Stateful Stream Processing with Apache Flink.

28

event log

single process

main memoryRecoveryrestore snapshot and replay

events since snapshot

persists events(temporarily)

Fault tolerance simple case

Page 29: Introduction to Stateful Stream Processing with Apache Flink.

Fault tolerance distributed

▪ How to create consistent snapshots ofdistributed state?

▪ How to do it efficiently?

29

Page 30: Introduction to Stateful Stream Processing with Apache Flink.

Distributed Snapshots

30

Coordination via markers, injected into the streams

Page 31: Introduction to Stateful Stream Processing with Apache Flink.

31

State index(Hash Table

or RocksDB)

Events flow without replication or synchronous writes

statefuloperation

source

Distributed Snapshots

Page 32: Introduction to Stateful Stream Processing with Apache Flink.

32

Trigger checkpointInject checkpoint barrier

statefuloperation

source

Distributed Snapshots

Page 33: Introduction to Stateful Stream Processing with Apache Flink.

33stateful

operationsource

Take state snapshotTrigger state

copy-on-write

Distributed Snapshots

Page 34: Introduction to Stateful Stream Processing with Apache Flink.

34stateful

operationsource

DFS

Durably persistsnapshots

asynchronouslyProcessing pipeline continues

Distributed Snapshots

Page 35: Introduction to Stateful Stream Processing with Apache Flink.

35

... YourCode

YourCode

YourCode

State

State

State

YourCode

State

● Consistent snapshotting:

Fault tolerance

Page 36: Introduction to Stateful Stream Processing with Apache Flink.

36

... YourCode

YourCode

YourCode

State

State

State

YourCode

State

checkpointedstate

checkpointedstate

checkpointedstate

File System Checkpoint

● Consistent snapshotting:

Fault tolerance

Page 37: Introduction to Stateful Stream Processing with Apache Flink.

37

... YourCode

YourCode

YourCode

State

State

State

YourCode

State

checkpointedstate

checkpointedstate

checkpointedstate

File System Restore

● Recover all embedded state● Reset position in input stream

Fault tolerance

Page 38: Introduction to Stateful Stream Processing with Apache Flink.

38

Documentation:https://ci.apache.org/projects/flink/flink-docs-release-

1.3/internals/stream_checkpointing.html

Fault tolerance

Page 39: Introduction to Stateful Stream Processing with Apache Flink.

State Management: misc.

39

▪ Savepoints

▪ Rescaling

▪ Queryable State

Page 40: Introduction to Stateful Stream Processing with Apache Flink.

Apache Flink Ecosystem

40

Integration

POSIX Java/ScalaCollections

POSIX

Page 41: Introduction to Stateful Stream Processing with Apache Flink.

Apache Flink Stack

41

DataStream APIStream Processing

DataSet APIBatch Processing

RuntimeDistributed Streaming Data Flow

Libraries

Streaming and batch as first class citizens.

Page 42: Introduction to Stateful Stream Processing with Apache Flink.

Levels of abstraction

42

Process Function (events, state, time)

DataStream API (streams, windows)

Table API (dynamic tables)

Stream SQL

low-level (statefulstream processing)

stream processing &analytics

declarative DSL

high-level langauge

Page 43: Introduction to Stateful Stream Processing with Apache Flink.

API and Execution

43

SourceDataStream<String> lines = env.addSource(new FlinkKafkaConsumer010(…));

DataStream<Event> events = lines.map(line -> parse(line));

DataStream<Statistic> stats = stream.keyBy("id").timeWindow(Time.seconds(5)).sum(new MyAggregationFunction());

stats.addSink(new BucketingSink(path));

map()[1]

keyBy()/window()/

apply()[1]

Transformation

Transformation

Sink

StreamingDataflowkeyBy()/

window()/apply()

[2]

map()[1]

map()[2]

Source[1]

Source[2]

Sink[1]

Page 44: Introduction to Stateful Stream Processing with Apache Flink.

Evolution of Flink

44

Page 45: Introduction to Stateful Stream Processing with Apache Flink.

Programming APIs

45

Page 46: Introduction to Stateful Stream Processing with Apache Flink.

Large State Handling

46

Page 47: Introduction to Stateful Stream Processing with Apache Flink.

Conclusion

47

Page 48: Introduction to Stateful Stream Processing with Apache Flink.

TL;DR▪ Stateful stream processing as a paradigm for

continuous data processing

▪ Flink is a sophisticated and tested stateful streamprocessor

▪ Efficiency, management, and operational issues forstate are taken very seriously

48

Page 49: Introduction to Stateful Stream Processing with Apache Flink.

49

Thank you!@kkloudas@ApacheFlink@dataArtisans

Page 50: Introduction to Stateful Stream Processing with Apache Flink.

50

Stream Processingand Apache Flink®'s approach to it

@StephanEwen Apache Flink PMC

CTO @ data ArtisansFLINK FORWARD IS COMING BACK TO BERLINSEPTEMBER 11-13, 2017

BERLIN.FLINK-FORWARD.ORG -

Page 51: Introduction to Stateful Stream Processing with Apache Flink.

We are hiring!data-artisans.com/careers