Graduating Flink Streaming - Chicago meetup

34
Flink 0.10 Graduating streaming Márton Balassi [email protected] / @MartonBalassi Hungarian Academy of Sciences

Transcript of Graduating Flink Streaming - Chicago meetup

Flink 0.10Graduating streaming

Márton [email protected] / @MartonBalassi

Hungarian Academy of Sciences

Streaming in Flink 0.10

• Operational readiness

High Availability

Monitoring

Integration with other systems

• First-class support for event-time

• Hardened statefulness support

• Redefined API

Streaming in Flink 0.10

• Some breaking changes

GroupBy -> KeyBy

Windowing API completely changed

DataStream and alike naming

Internal rewrite

The goal is to harden for 1.0

API in one take

Windowing

• Why put your data into windows?

• That is why:

Streaming data never stops

Window (5 min)

Count #Hashtags

Just saw #Trump on #CNN,

super cool. :D

Trump: 2394

Cheese: 12984

Money: 42

7

What I didn’t mention

• tweets have a timestamp,

their event time

• tweets from across the globe

arrive with delay

=> tweets with different

timestamps arrive out-of-order

Window (5 min)

Count #Hashtags

12:34 (13.10.2015):

Just saw #Trump on #CNN,

super cool. :D

Trump: 2394

Cheese: 12984

Money: 42

These arrive with 3

minutes slack

Form windows based on

processing time of the

machine.

Processing Time != Event Time

8

9

Why do people use this?

• easy to implement

• low latency

• this is what systems give you

(Spark Streaming, Apex,

Samza, Storm)*

*not Google Cloud Dataflow

10

Lets look at a more

complex example.

11

Window (5 min)

Correlate Tweets

and News

something...

These still have 3 min slack.

These have 8 min slack.

12:33 (13.10.2015):

Donald Trump speaks at

Cheese conference.

Processing Time != Event Time

Processing Time != Event Time

=> Mismatch in the

timespace continuum

13

Use cases

• out-of-order elements

• sources with delay

• recovery/fault-tolerance

• “catching up” with a stream

Who does it?

• Google Cloud Dataflow

• Apache Flink

14

How can we do this?

15

We need a

Global Clock

that runs on

event time

instead of

processing time.

16

This is a source

This is our window operator

1

0

0

0 0

1

2

1

2

1

1

This is the current event-time time

2

2

2

2

2

This is a watermark.

17

Now, show me the API!

18

StreamExecutionEnvironment env =

StreamExecutionEnvironment.getExecutionEnvironment();

env.setStreamTimeCharacteristic(ProcessingTime);

DataStream<Tweet> text = env.addSource(new TwitterSrc());

DataStream<Tuple2<String, Integer>> counts = text

.flatMap(new ExtractHashtags())

.keyBy(“name”)

.timeWindow(Time.of(5, MINUTES)

.apply(new HashtagCounter());

Processing Time

19

Event TimeStreamExecutionEnvironment env =

StreamExecutionEnvironment.getExecutionEnvironment();

env.setStreamTimeCharacteristic(EventTime);

DataStream<Tweet> text = env.addSource(new TwitterSrc());

text = text.assignTimestamps(new MyTimestampExtractor());

DataStream<Tuple2<String, Integer>> counts = text

.flatMap(new ExtractHashtags())

.keyBy(“name”)

.timeWindow(Time.of(5, MINUTES)

.apply(new HashtagCounter());

Fault tolerance in streaming

Fault-tolerance in streaming systems is inherently harder than in batch

• Can’t just restart computation

• State is a problem

• Fast recovery is crucial

• Streaming topologies run 24/7 for a long period

Fault-tolerance is a complex issue

• No single point of failure is allowed

• Guaranteeing input processing

• Consistent operator state

• Fast recovery

• At-least-once vs Exactly-once semantics

High Availability

Consistency - Flink distributed snapshots

Based on consistent global snapshots

Algorithm designed for stateful dataflows (minimal runtime overhead)

Exactly-once semantics

Stateful streaming applications

ETL style operations

Filter incoming data, Log analysis

High throughput, connectors, at-least-once processing

Window aggregations

Trending tweets, User sessions, Stream joins

Window abstractions

Inpu

tInpu

tInpu

tInput

Process/Enrich

Stateful streaming applications

Machine learning

Fitting trends to the evolving stream, Stream clustering

Model state, cyclic flows

Pattern recognition

Fraud detection, Triggering signals based on activity

Exactly-once processing

Statefulness in 0.9.1

Stateful dataflow operators (conceptually similar to Samza)

Two state access patterns

Local (Task) state

Partitioned (Key) state

Proper API integration

Java: OperatorState interface

Scala: mapWithState, flatMapWithState…

Exactly-once semantics by checkpointing

Stateful API

words.keyBy(x => x).mapWithState {

(word, count: Option[Int]) =>

{

val newCount = count.getOrElse(0) + 1

val output = (word, newCount)

(output, Some(newCount))

}

}

Local state example (Java)

public class MySource extends RichParallelSourceFunction {

// Omitted details

private OperatorState<Long> offset;

@Override

public void run(SourceContext ctx) {

Object checkpointLock = ctx.getCheckpointLock();

isRunning = true;

while (isRunning) {

synchronized (checkpointLock) {

offset.update(offset.value() + 1);

//ctx.collect(next);

}

}

}

}

Statefulness in 0.10

Internal operators are checkpointed

Aggregations

Window operators

KeyValue state

Easing common acces patterns

Flexible state backend interface

Removes non-partitioned operator state

Improved monitoring

Batch and streaming

Batch and streaming

Integration (not complete)

Summary - Streaming in Flink 0.10

• Operational readiness

High Availability

Monitoring

Integration with other systems

• First-class support for event-time

• Hardened statefulness support

• Redefined API

Thanks for the slides

• Material borrowed from:

flink.apache.org

Stephan Ewen

Aljoscha Krettek

Gyula Fóra