Flink meetup

Post on 20-Mar-2017

76 views 0 download

Transcript of Flink meetup

Get your hands on implementing a Flink app: A tutorial

Christos Hadjinikolis & Satyasheel | DataReply.uk

C. Hadjinikolis & Satyasheel | DataReply 2

Tutorial Overview:

What is Apache Flink? Why Flink? Processing both bounded and un-bounded data! Anatomy of a Flink App Windowing in Flink Event time & Process time in Flink

2/22/17

C. Hadjinikolis & Satyasheel | DataReply 3

What is Apache Flink?

“A distributed data processing platform…”

2/22/17

42/22/17C. Hadjinikolis & Satyasheel | DataReply

Flink is a distributed stream- & batch- data processing platform Stream processing

…the real-time processing of data continuously, concurrently, and in a record-by-record fashion, where data is not static.

Batch processing…the execution of a series of programs each on a set or "batch" of static inputs, rather than a single input (which would instead be a custom job).

52/22/17C. Hadjinikolis & Satyasheel | DataReply

…distributed processing dataset types

UnboundedInfinite datasets that are appended to continuously: End users interacting with mobile or web applications Physical sensors providing measurements Financial markets Machine log data Surveillance camera frames

62/22/17C. Hadjinikolis & Satyasheel | DataReply

…distributed processing dataset types

BoundedFinite, unchanging datasets:

Pictures Documents Database tables

C. Hadjinikolis & Satyasheel | DataReply 7

Why Flink?

“The world is turning more and more towards stream processing…”

2/22/17

82/22/17C. Hadjinikolis & Satyasheel | DataReply

Opt for Flink because it:

Provides results that are accurate Is stateful and fault-tolerant and can

seamlessly recover from failures Performs at large scale

92/22/17C. Hadjinikolis & Satyasheel | DataReply

…exactly-once semantics

Statefull … apps can maintain summaries of processed data.

Checkpointing… a mechanism that ensures that in the event of failure no duplicate re-computation of an event will take place.

102/22/17C. Hadjinikolis & Satyasheel | DataReply

…event time semantics

…event-time-based windowingEvent time makes it easy to compute accurate results over streams where events arrive out of order and where events may arrive delayed.

112/22/17C. Hadjinikolis & Satyasheel | DataReply

… flexible windowingWindows can be customized with flexible triggering conditions to support sophisticated streaming patterns based on:

Time; Count, and; Sessions.

122/22/17C. Hadjinikolis & Satyasheel | DataReply

… lightweight fault tolerance

Recovers from failures with zero data loss while the tradeoff between reliability and latency is negligible.

132/22/17C. Hadjinikolis & Satyasheel | DataReply

… lightweight fault tolerance

Savepoints Provide a state versioning mechanism. Applications can update and reprocess

historic data with no lost state.

142/22/17C. Hadjinikolis & Satyasheel | DataReply

… Scalable

Designed to run on large scale clusters

with many thousands on nodes.

152/22/17C. Hadjinikolis & Satyasheel | DataReply

So, in summary…Flink is an open-source stream processing framework, which: Eliminates the “performance vs. reliability”

problem and; Performs consistently in both categories.

C. Hadjinikolis & Satyasheel | DataReply 16

Processing both bounded & un-bouded data!

“Unbounding the boundaries…”

2/22/17

172/22/17C. Hadjinikolis & Satyasheel | DataReply

…the streaming model & bounded datasets DataStream API un-bounded

data DataSet API bounded data

A bounded dataset is handled inside of Flink as a “finite stream”, with only a few minor differences in how Flink manages un-bounded datasets.

C. Hadjinikolis & Satyasheel | DataReply 18

Anatomy of a Flink App

“Let’s get this started…”

2/22/17

192/22/17C. Hadjinikolis & Satyasheel | DataReply

…Flink programs transform collections of dataEach program consists of the same basic parts: Obtain an execution environment, Load/create the initial data, Specify transformations on this data, Specify where to put the results of your

computations Trigger the program execution

202/22/17C. Hadjinikolis & Satyasheel | DataReply

Create execution environment

Load streaming data

Trigger transformations

Specify dumping location

Execute

212/22/17C. Hadjinikolis & Satyasheel | DataReply

…Lazy evaluation

When the program’s main method is executed: Each operation is created and added to

the program’s plan. execution is explicitly triggered by

an execute() call.This helps with constructing an optimised data-flow as a holistically planned unit.

C. Hadjinikolis & Satyasheel | DataReply 22

Lets take 15 mins…

2/22/17

C. Hadjinikolis & Satyasheel | DataReply 23

Windowing in Flink

“…a simple word count app.”

2/22/17

242/22/17C. Hadjinikolis & Satyasheel | DataReply

…so what is a window? A window is a way to get a {snapshot} of the streaming data. A {snapshot} can be based on time or other variables. One can define the window based on no of records or other

stream specific variables.

252/22/17C. Hadjinikolis & Satyasheel | DataReply

…enough with theory! Give us some code!

A streaming word count example with no windowing

262/22/17C. Hadjinikolis & Satyasheel | DataReply

…updating states

Flink automatically updates its states

without the user explicitly doing so. To better appreciate this, it is worth

contrasting Flink with Spark. Spark relies on micro-batches:

This means one has to define the batch size either in terms of time or size

Flink, does not require defining a batch size. It can process each and every new event

individually (it is true stream processing!)

vs

C. Hadjinikolis & Satyasheel | DataReply 27

Lets see an example

2/22/17

C. Hadjinikolis & Satyasheel | DataReply 28

Windowing in Flink

“Don't waste a minute not being happy. If one window closes, run to the next window - or break down a door. …”

2/22/17

292/22/17C. Hadjinikolis & Satyasheel | DataReply

…so why use windowing at all?

Aggregation on DataStream is different from

aggregation Dataset. One cannot count all records on infinite stream. DataStream aggregation makes sense on window

stream.

302/22/17C. Hadjinikolis & Satyasheel | DataReply

…what types of windowing can you use? Tumbling Windows :

Aligned, fixed length, non-overlapping window. Sliding Windows :

Aligned, fixed length, overlapping window. Session Windows :

Non aligned, variable length window. Count Windows :

Fixed number of records/events, non-overlapping window.

312/22/17C. Hadjinikolis & Satyasheel | DataReply

…anatomy of the window API

3 window functions:

Window Assigner:  Responsible for assigning given element to window. Depending upon the definition of window, one element can belong to one or more

windows at a time. Trigger: 

Defines the condition for triggering window evaluation. This function controls when a given window created by window assigner is

evaluated. Evictor: 

An optional function which defines the preprocessing before firing window operations.

322/22/17C. Hadjinikolis & Satyasheel | DataReply

…understanding count window

Window Assigner (for count-based window user-

defined) No start or end to the window, therefore the window is non-time

based. For these windows we use the GlobalWindows window assigner. For a given key, all key-values are filled into the same window.

keyValue.window(GlobalWindows.create())

The window API allows us to add the window assigner to the window. Every window assigner has a default trigger.

for global windows that trigger is NeverTrigger which never triggers.

so, this window assigner has to be used with a custom trigger.

332/22/17C. Hadjinikolis & Satyasheel | DataReply

…understanding count window

Count trigger

Once we have the window assigner, we have to define when the window needs to be trigger-ed, for example:

trigger(CountTrigger.of(2)) This results in the window being evaluated every two records.

Evictor In addition to these, an evictor can be used for further preprocessing tasks

before firing a window operation, e.g. to remove the every 3rd element of all window.

Some default evictors: CountEvictor , DeltaEvictor , TimeEvictor

C. Hadjinikolis & Satyasheel | DataReply 37

Lets take 15 mins…

2/22/17

C. Hadjinikolis & Satyasheel | DataReply 38

Timing in Flink

“The two most powerful warriors are patience and time.

2/22/17

392/22/17C. Hadjinikolis & Satyasheel | DataReply

…the time concept in streaming A streaming application is an always running application. ..we need to take snapshots of the stream at various points. ..these points can be defined using a time component. ..we can group, correlate, different events happening in the

stream. Some of the constructs like window, heavily use the time

component. Most of the streaming frameworks support a single meaning of

time, which is mostly tied to the processing time.

402/22/17C. Hadjinikolis & Satyasheel | DataReply

…time in Flink When we say, last “t” seconds, what do we mean exactly?

Well in Flink it’s one of three things: Processing Time“…the records arrived in last "t" seconds for the processing.” Event Time“… all the records generated in those last "t" seconds at the source.” Ingestion Time

The time when events ingested into the system. This time is in between of the event time and processing time.

412/22/17C. Hadjinikolis & Satyasheel | DataReply

…time in Flink

C. Hadjinikolis & Satyasheel | DataReply 432/22/17

Thanks for your attention!