Post on 10-Apr-2017
Introduction to stream processing with Apache Flink
Seif HaridiKTH/SICS
Stream processing
2Data Science Summit 2015
Why streaming
3
Data Warehouse
Batch
Data availability Streaming
2008 20152000
- Which data?- When?- Who?
Data Science Summit 2015 S. Haridi
3 Parts of a Streaming Infrastructure
4
Gathering Broker Analysis
SensorsTransactionlogs …
Server Logs
Data Science Summit 2015 S. Haridi
Example: Bouygues Telecom
5Data Science Summit 2015 S. Haridi
• Network and subscriber data gathered
• Added to Broker in raw format• Transformed and analyzed by
streaming engine• Stored back for further procesing
http://data-artisans.com/flink-at-bouygues.html
What is Apache Flink?
6Data Science Summit 2015
1 year of Flink - codeApril 2014 April 2015
Data Science Summit 2015 S. Haridi 7
What is Apache Flink
8
Distributed Data Flow Processing System
▪Focused on large-scale data analytics
▪Unified real-time stream and batch processing
▪Expressive and rich APIs in Java / Scala (+ Python)
▪Robust and fast execution backend
Reduce
Join
Filter
Reduce
MapIterate
Source
Sink
Source
Data Science Summit 2015 S. Haridi
Flink Stack
9
Gelly
Tabl
e
ML
SAM
OA
DataSet (Java/Scala) DataStream (Java/Scala)
Hado
op M
/R
Local Cluster Yarn
Tez
Embe
dded
Data
flow
Data
flow
Tabl
e
Streaming dataflow runtime
Stor
m
Zepp
elin
Data Science Summit 2015 S. Haridi
Stream Processing with Flink
10Data Science Summit 2015
What is Flink Streaming
11
Native, low-latency stream processor Expressive functional API Flexible operator state, iterations,
windows Exactly-once processing semantics
Data Science Summit 2015 S. Haridi
Native vs non-native streaming
12
Streamdiscretizer
Job Job Job Jobwhile (true) { // get next few records // issue batch computation}
Non-native streaming
while (true) { // process next record}
Long-standing operators
Native streaming
Data Science Summit 2015 S. Haridi
Stream processing in Flink Continuous Streaming model Low processing latency O(1) state updates per operator Exactly once semantics for state
operators
Data Science Summit 2015 S. Haridi 13
DataStream API
14Data Science Summit 2015
15
Overview of the API
Data Science Summit 2015 S. Haridi
Windowing Semantics
16
• Trigger and Eviction policies• window(<eviction>).every(<trigger>)
• Built-in policies:– Time: Time.of(length, TimeUnit/Custom timestamp)
– window(Time.of(20, SECONDS))
– Count: Count.of(windowSize)
– window(Count.of(20)).every(Count.of(10))
– Delta: Delta.of(Threshold, Distance function, Start value)
– window(Delta.of(0.1, priceDistanceFun, initPrice)
Data Science Summit 2015 S. Haridi
17
Word count in Batch and Streaming
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .keyBy("word”).window(Time.of(5,SECONDS))
.every(Time.of(1,SECONDS)).sum("frequency") .print()
val lines: DataSet[String] = env.readTextFile(...)lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print()
DataSet API (batch):
DataStream API (streaming):
Data Science Summit 2015 S. Haridi
Flexible windows
18More at: http://flink.apache.org/news/2015/02/09/streaming-example.html
Keyed StreamWindowed StreamData Stream Keyed StreamWindowed Stream Stream of stocks Trigger warning if price fluctuates by 5% Count the number of warnings per stock
in 30 second (tumbling) window Do it continuously
Data Science Summit 2015 S. Haridi
StockStream
Delta 5% of price Warning Count 30 sec
window Sum
keyBysymbol
keyBysymbol
Flexible windows
19More at: http://flink.apache.org/news/2015/02/09/streaming-example.html
case class Count(symbol: String, count: Int)val defaultPrice = StockPrice(“”, 1000)val priceWarnings = stockStream.keyBy(“symbol”) .window(Delta.of(0.05, priceChange, defaultPrice)
.mapWindow(sendWarning _)
Use delta policy to createchange warnings
Count number of warning per stock every half a minute
val warningPerStock = priceWarnings.flatten()
.map(Count(_, 1))
.keyBy(“symbol”)
.window(Time.of(30, SECONDS))
.sum(“count”) Data Science Summit 2015 S. Haridi
StockStream
Delta 5% of price Warning Count 30 sec
window Sum
keyBysymbol
keyBysymbol
Iterative stream processing
20
Motivation Many applications require cyclic
streams Machine learning applications
(parallel model training, evaluation)
Iterations in Flink Streaming Native support for cyclic dataflows Integrated with functional API High performance and expressivity
Input
Train
Evaluate
Data Science Summit 2015 S. Haridi
Fault tolerance
21Data Science Summit 2015
Exactly-once processing in for operator state
22
Based on consistent global snapshots Low runtime overhead, stateful
exactly-once semantics
Data Science Summit 2015 S. Haridi
Checkpointing / Recovery
23
Detailed algorithm: Lightweight Asynchronous Snapshots for Distributed DataflowsData Science Summit 2015 S. Haridi
Fault tolerance Check-pointing and recovery of operator
state is very fast• Data processing does not block
Executions based on CPU/operator time are not idempotent
Other execution modes are based on timestamps of input streams (Event/Ingress time) • Allows idempotent executions • End-to-End exactly-once semantics• In Flink version 0.10
24Data Science Summit 2015 S. Haridi
Streaming in Apache Flink True streaming over stateful
distributed dataflow engine Expressive Streaming API in
Java/Scala• Flexible window semantics• Iterative computation
Low streaming latency, exactly-once semantics depending on execution mode, and low overhead for recovery
25Data Science Summit 2015 S. Haridi
Special Thanks to
Gyula Fora, SICSParis Carbone, KTHKostas Tzoumas, Data ArtisansStephan Ewen, Data ArtisansVolker Markl, TU-Berlin
26Data Science Summit 2015