Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

44
Ufuk Celebi Hadoop Summit Dublin April 13, 2016 Unified Stream & Batch Processing with Apache Flink

Transcript of Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Page 1: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Ufuk Celebi

Hadoop Summit Dublin April 13, 2016

Unified Stream & Batch Processing

with Apache Flink

Page 2: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

What is Apache Flink?

2

Apache Flink is an open source stream processing framework.

• Event Time Handling • State & Fault Tolerance • Low Latency • High Throughput

Developed at the Apache Software Foundation.

Page 3: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Recent History

3

April ‘14 December ‘14

v0.5 v0.6 v0.7

March ‘16

ProjectIncubation

Top LevelProject

v0.8 v0.10

Release1.0

Page 4: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Flink Stack

4

DataStream API Stream Processing

DataSet API Batch Processing

Runtime Distributed Streaming Data Flow

Libraries

Streaming and batch as first class citizens.

Page 5: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Counting

5

Seemingly simple application: Count visitors, ad impressions, etc.

But generalizes well to other problems.

Page 6: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Batch Processing

6

All Input

Batch Job

All OutputHadoop, Spark, Flink

Page 7: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Batch Processing

7

DataSet<ColorEvent>counts=env.readFile("MM-dd.csv").groupBy("color").count();

Page 8: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Continuous Counting

8Time

1h

Job 1

Continuous ingestion

Periodic files

Periodic batch jobs

1h

Job 2

1h

Job 3

Page 9: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Many Moving Parts

9

Batch Job1h Serving

Layer

Periodic jobscheduler(e.g. Oozie)

Data loadinginto HDFS(e.g. Flume)

Batch processor

(e.g. Hadoop,Spark, Flink)

Page 10: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

High Latency

10

Latency from event to serving layerusually in the range of hours.

Batch Job1h Serving

Layer

Schedule every X hours

Page 11: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Implicit Treatment of Time

11

Time is treated outside of your application.

Batch Job1h Serving

LayerBatch Job1h

Batch Job1h

Page 12: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Implicit Treatment of Time

12

DataSet<ColorEvent>counts=env.readFile("MM-dd.csv").groupBy("color").count();

Time is implicit in input file

Page 13: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Batch Job ServingLayer

Continuouslyproduced

Files are finite streams

Periodicallyexecuted

Streaming over Batch

13

Page 14: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Streaming

14

Until now, stream processors were less maturethan their batch counterparts. This led to:

• in-house solutions, • abuse of batch processors, • Lambda architectures

This is no longer needed with new generation stream processors like Flink.

Page 15: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Streaming All the Way

15

Streaming Job

ServingLayer

Message Queue (e.g. Apache Kafka)

Durability and Replay

Stream Processor (e.g. Apache Flink)

Consistent Processing

Page 16: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Building Blocks of Flink

16

Explicit Handling of Time

State & Fault Tolerance

Performance

Page 17: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Windowing

17Time

Aggregates on streamsare scoped by windows

Time-driven Data-drivene.g. last X minutes e.g. last X records

Page 18: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Tumbling Windows (No Overlap)

18

Time

e.g.“Count over the last 5 minutes”,

“Average over the last 100 records”

Page 19: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Sliding Windows (with Overlap)

19

Time

e.g. “Count over the last 5 minutes, updated each minute.”,

“Average over the last 100 elements, updated every 10 elements”

Page 20: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Explicit Handling of Time

20

DataStream<ColorEvent>counts=env.addSource(newKafkaConsumer(…)).keyBy("color").timeWindow(Time.minutes(60)).apply(newCountPerWindow());

Time is explicit in your program

Page 21: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Session Windows

21

Time

Sessions close after period of inactivity.

Inactivity

Inactivity

e.g. “Count activity from login until time-out or logout.”

Page 22: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Session Windows

22

DataStream<ColorEvent>counts=env.addSource(newKafkaConsumer(…)).keyBy("color").window(EventTimeSessionWindows.withGap(Time.minutes(10)).apply(newCountPerWindow());

Page 23: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Notions of Time

23

12:23 am

Event Time

1:37 pmProcessing Time

Time measured by system clock

Time when event happened.

Page 24: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

1977 1980 1983 1999 2002 2005 2015

Processing Time

EpisodeIV

EpisodeV

EpisodeVI

EpisodeI

EpisodeII

EpisodeIII

EpisodeVII

Event Time

Out of Order Events

24

Page 25: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Out of Order Events

25

1st burst of events2nd burst of events

Event TimeWindows

Processing TimeWindows

Page 26: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Notions of Time

26

env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

DataStream<ColorEvent>counts=env....timeWindow(Time.minutes(60)).apply(newCountPerWindow());

Page 27: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Explicit Handling of Time

27

1. Expressive windowing 2. Accurate results for out of order data 3. Deterministic results

Page 28: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Stateful Streaming

28

Stateless Stream Processing

Stateful Stream Processing

Op Op

State

Page 29: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Processing Semantics

29

At-least onceMay over-count

after failure

Exactly OnceCorrect counts after failures

End-to-end exactly onceCorrect counts in external system (e.g. DB, file system) after failure

Page 30: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Processing Semantics

30

• Flink guarantees exactly once (can be configuredfor at-least once if desired)

• End-to-end exactly once with specific sourcesand sinks (e.g. Kafka -> Flink -> HDFS)

• Internally, Flink periodically takes consistentsnapshots of the state without ever stoppingcomputation

Page 31: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Yahoo! Benchmark

31

• Storm 0.10, Spark Streaming 1.5, and Flink 0.10benchmark by Storm team at Yahoo!

• Focus on measuring end-to-end latency at low throughputs (~ 200k events/sec)

• First benchmark modelled after a real application

https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at

Page 32: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Yahoo! Benchmark

32

• Count ad impressions grouped by campaign • Compute aggregates over last 10 seconds • Make aggregates available for queries in Redis

Page 33: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

99th Percentile Latency (sec)

9

8

2

1

Storm 0.10Flink 0.10

60 80 100 120 140 160 180

Throughput(1000 events/sec)

Spark Streaming 1.5Spark latency increases

with throughput

Storm and Flink atlow latencies

Latency (Lower is Better)

33

Page 34: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Extending the Benchmark

34

• Great starting point, but benchmark stops at low write throughput and programs are notfault-tolerant

• Extend benchmark to high volumes and Flink’s built-in fault-tolerant state

http://data-artisans.com/extending-the-yahoo-streaming-benchmark/

Page 35: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Extending the Benchmark

35

Use Flink’s internal state

Page 36: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Throughput (Higher is Better)

365.000.000 10.000.000 15.000.000

Maximum Throughput (events/sec)

0

Flinkw/o Kafka

Flinkw/ Kafka

Stormw/ Kafka

Limited by bandwidth betweenKafka and Flink cluster

Page 37: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Summary

37

• Stream processing is gaining momentum, the right paradigm for continuous data applications

• Choice of framework is crucial – even seemingly simple applications become complex at scale and in production

• Flink offers unique combination of efficiency, consistency and event time handling

Page 38: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Libraries

38

DataStream API Stream Processing

DataSet API Batch Processing

Runtime Distributed Streaming Data Flow

LibrariesComplex Event Processing (CEP), ML, Graphs

Page 39: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

39

Pattern<MonitoringEvent, ?> warningPattern = Pattern.<MonitoringEvent>begin("First Event") .subtype(TemperatureEvent.class) .where(evt -> evt.getTemperature() >= THRESHOLD) .next("Second Event") .subtype(TemperatureEvent.class) .where(evt -> evt.getTemperature() >= THRESHOLD) .within(Time.seconds(10));

Complex Event Processing (CEP)

Page 40: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Upcoming Features

40

• SQL: ongoing work in collaboration with Apache Calcite

• Dynamic Scaling: adapt resources to stream volume, scale up for historical stream processing

• Queryable State: query the state inside the stream processor

Page 41: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

SQL

41

SELECTSTREAM*FROMOrdersWHEREunits>3;

rowtime|productId|orderId|units----------+-----------+---------+-------10:17:00|30|5|410:18:07|30|8|2011:02:00|10|9|611:09:30|40|11|1211:24:11|10|12|4…|…|…|…

Page 42: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

key­value states have to be redistributed when rescaling a Flink job. Distributing the key­value 

states coherently with the job’s new partitioning will lead to a consistent state. 

 

 

Requirements 

In order to efficiently distribute key­value states across the cluster, they should be grouped into 

key groups. Each key group represents a subset of the key space and is checkpointed as an 

independent unit. The key groups can then be re­assigned to different tasks if the DOP 

changes. In order to decide which key belongs to which key group, the user has to specify the 

maximum number of key groups. This number limits the maximum number of parallelism for an 

operator, because each operator has to receive at least one key group. The current 

implementation of Flink can be thought of as having only a single implicit key group per 

operator. 

 

In order to mitigate the problem of having a fixed number of key groups, it is conceivable to 

provide a tool with which one can change the number of key groups for a checkpointed 

program. This of course would most likely be a very costly operation. 

Dynamic Scaling

42

Page 43: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Queryable State

43

Query Flink directly

Page 44: Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)

Join the Community

44

Read http://flink.apache.org/blog http://data-artisans.com/blog

Follow@ApacheFlink @dataArtisans

Subscribe (news | user | dev)@flink.apache.org