Streaming in the Wild with Apache Flink

39
Kostas Tzoumas @kostas_tzoumas Hadoop Summit San Jose June 6, 2016 Streaming in the Wild with Apache Flink TM

Transcript of Streaming in the Wild with Apache Flink

Page 1: Streaming in the Wild with Apache Flink

Kostas Tzoumas@kostas_tzoumas

Hadoop Summit San JoseJune 6, 2016

Streaming in the Wild with Apache FlinkTM

Page 2: Streaming in the Wild with Apache Flink

2

Streaming technology is enabling the obvious: continuous processing on data

that is continuously produced

Hint: you are already doing streaming

Page 3: Streaming in the Wild with Apache Flink

Why embrace streaming? Monitor your business and react in real

time

Implement robust continuous applications

Adopt a decentralized architecture

Consolidate analytics infrastructure 3

Page 4: Streaming in the Wild with Apache Flink

React in real time

4

Page 5: Streaming in the Wild with Apache Flink

5

Streaming versus real-time Streaming != Real-time

E.g., streaming that is not real time: continuous applications with large windows

E.g., real-time that is not streaming: very fast data warehousing queries

However: streaming applications can be fast

Streaming

Real time

Page 6: Streaming in the Wild with Apache Flink

How real-time is Flink?

6

Yahoo! benchmark* data Artisans benchmarks**

* https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at** http://data-artisans.com/extending-the-yahoo-streaming-benchmark/ and http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/

Page 7: Streaming in the Wild with Apache Flink

When and why does this matter? Immediate reaction to life• E.g., generate alerts on

anomaly/pattern/special event

Avoid unnecessary tradeoffs• Even if application is not latency-critical• With Flink you do not pay a price for latency!

7

Page 8: Streaming in the Wild with Apache Flink

Bouygues Telecom – LUX

8

One of the largest telcos in France. System (among others) used for real time diagnostics and alarming.

Read more: http://data-artisans.com/flink-at-bouygues-html/

Page 9: Streaming in the Wild with Apache Flink

Robust continuous applications

9

Page 10: Streaming in the Wild with Apache Flink

10

Continuous application A production data application that needs to be live

24/7 feeding other systems (perhaps customer-facing)

Need to be efficient, consistent, correct, and manageable

Stream processing is a great way to implement continuous applications robustly

Page 11: Streaming in the Wild with Apache Flink

Continuous apps with “batch”

11

file 1

file 2

Job 1

Job 2

time

file 3 Job 3

Scheduler

Serv

e &

stor

e

Page 12: Streaming in the Wild with Apache Flink

Continuous apps with “lambda”

12

file 1

file 2

Job 1

Job 2

Scheduler

Streaming job

Serv

e &

stor

e

Page 13: Streaming in the Wild with Apache Flink

Problems with batch and λ Way too many moving parts (and code dup)

Implicit treatment of time

Out of order event handling

Implicit batch boundaries

13

Page 14: Streaming in the Wild with Apache Flink

Continuous apps with streaming

14

Streaming job

Serv

e &

stor

e

Page 15: Streaming in the Wild with Apache Flink

Extending the Yahoo! benchmark Work of Jamie Grier, inspired by a real continuous

application at Twitter

15http://data-artisans.com/extending-the-yahoo-streaming-benchmark/

Page 16: Streaming in the Wild with Apache Flink

What is the use case? Counting!• Tweet impressions or ad views

Most analytics is continuous counting and aggregations grouped by dimensions• E.g., anomaly detection

16

Page 17: Streaming in the Wild with Apache Flink

Requirements Performance: millions of events/sec, millions of keys

Correctness: counts correlated with timestamps

Consistency: counts should be correct under failures

Manageability: ability to pause & restart, reprocess, change code, etc

17

Page 18: Streaming in the Wild with Apache Flink

Before Flink Performance: 1000s of cores needed to sustain

workload

Correctness: time handled in application code (or not)

Consistency: approximate results during the day, exact results once a day (lambda)

Manageability: acceptable

18

Page 19: Streaming in the Wild with Apache Flink

After Flink Performance: 10s of cores needed to sustain

workload

Correctness: time handled by framework

Consistency: correct results on demand

Manageability: acceptable19

Page 20: Streaming in the Wild with Apache Flink

Results (yet to be beaten!)

Same program as Yahoo! benchmark

30x over Storm, plus consistent results20

Page 21: Streaming in the Wild with Apache Flink

Manageability Flink savepoints (Flink 1.0): consistent

snapshots of stateful applications• Planned downtime for code upgrades,

maintenance, migration, debugging, etc

Monitoring (Flink 1.1)

Dynamic scaling (Flink 1.2+)21

Page 22: Streaming in the Wild with Apache Flink

22

Decentralized architecture

Page 23: Streaming in the Wild with Apache Flink

23

Streaming and microservices

App App

App

local state

local state

Archive

A decentralized architecture favors a streaming-based data infrastructure with local application state

Page 24: Streaming in the Wild with Apache Flink

Zalando

24

Slides at http://www.slideshare.net/ZalandoTech/flink-in-zalandos-world-of-microservices-62376341

Page 25: Streaming in the Wild with Apache Flink

Zalando

25

Transitioning from monolithicarchitecture to microservices

Page 26: Streaming in the Wild with Apache Flink

New BI stack

26

Page 27: Streaming in the Wild with Apache Flink

Flink @ Zalando (present & future) Business process monitoring

• Check if Zalando platform works• Order & delivery velocities• SLAs of related events

Continuous ETL• Transformation, combination, pre-aggregation• Data cleansing and validation

Complex Event Processing

Sales monitoring

27

Page 28: Streaming in the Wild with Apache Flink

Consolidate analytics

28

Page 29: Streaming in the Wild with Apache Flink

Stream Processing as a Service How do we make stream processing more

accessible to the data analyst?

More familiar interfaces• Flink 1.1 includes the first version of SQL for

static data sets and data streams

Easier deployment29

Page 30: Streaming in the Wild with Apache Flink

King.com

30

Page 31: Streaming in the Wild with Apache Flink

King.com - RBEA RBEA – a platform

designed to make stream processing available inside King.com

Data scientists submit scripts in Groovy

Flink backend executes these scripts

31

https://techblog.king.com/rbea-scalable-real-time-analytics-king/

Page 32: Streaming in the Wild with Apache Flink

Netflix Netflix plans to offer

Stream Processing as a Service internally in the company

Currently testing Flink and Apache Beam

32

http://www.slideshare.net/mdaxini/netflix-keystone-streaming-data-pipeline-scale-in-the-clouddbtb2016-62076009

Page 33: Streaming in the Wild with Apache Flink

Closing

33

Page 34: Streaming in the Wild with Apache Flink

Disclaimer A lot of this presentation is based on the work of very

talented engineers building data products with Flink

Bouygues Telecom: Amine Abdessemed, ...

Zalando: Mihail Vieru, Javier Lopez

King.com: Gyula Fora, Mattias Andersson, ...

Netflix: Monal Daxini, ...34

Page 35: Streaming in the Wild with Apache Flink

35

More Flink tales at Hadoop SummitXiaowei JiangBlink−Improved Runtime for Flink and its Application in Alibaba SearchWednesday, June 29, 2016, 2:10PM - 2:50PM210C

Stephan EwenTurning the Stream Processor into a Database: Building Online Applications on StreamsThursday, June 30, 2016, 12:20PM - 1:00PM212

Page 36: Streaming in the Wild with Apache Flink

Flink Forward 2016, BerlinSubmission deadline: June 30, 2016 (watch website)Early bird deadline: July 15, 2016www.flink-forward.org

Page 37: Streaming in the Wild with Apache Flink

We are hiring!data-artisans.com/careers

Page 38: Streaming in the Wild with Apache Flink

Appendix

Page 39: Streaming in the Wild with Apache Flink

Batch < Streaming In principle, batch is a special case

of streaming (global window)

In practice, batch processors can be more efficient than stream processors in batch

Flink is a very efficient batch processor (DataSet code path)

39