Debunking Common Myths in Stream Processing

41
1 Kostas Tzoumas @kostas_tzoumas Big Data Ldn November 4, 2016 Stream Processing with Apache Flink®

Transcript of Debunking Common Myths in Stream Processing

Page 1: Debunking Common Myths in Stream Processing

1

Kostas Tzoumas@kostas_tzoumas

Big Data LdnNovember 4, 2016

Stream Processing with Apache Flink®

Page 2: Debunking Common Myths in Stream Processing

2

Kostas Tzoumas@kostas_tzoumas

Big Data LdnNovember 4, 2016

Debunking Some Common Myths in Stream Processing

Page 3: Debunking Common Myths in Stream Processing

3

Original creators of Apache Flink®

Providers of the dA Platform, a supported

Flink distribution

Page 4: Debunking Common Myths in Stream Processing

Outline What is data streaming Myth 1: The throughput/latency tradeoff

Myth 2: Exactly once not possible

Myth 3: Streaming is for (near) real-time Myth 4: Streaming is hard

4

Page 5: Debunking Common Myths in Stream Processing

The streaming architecture

5

Page 6: Debunking Common Myths in Stream Processing

6

Reconsideration of data architecture

Better app isolation More real-time reaction to events Robust continuous applications Process both real-time and historical data

Page 7: Debunking Common Myths in Stream Processing

7

app state

app state

app state

event log

Queryservice

Page 8: Debunking Common Myths in Stream Processing

What is (distributed) streaming Computations on never-

ending “streams” of data records (“events”)

Stream processor distributes the computation in a cluster

8

Your code

Your code

Your code

Your code

Page 9: Debunking Common Myths in Stream Processing

What is stateful streaming Computation and state

• E.g., counters, windows of past events, state machines, trained ML models

Result depends on history of stream

Stateful stream processor gives the tools to manage state• Recover, roll back, version, upgrade,

etc

9

Your code

state

Page 10: Debunking Common Myths in Stream Processing

What is event-time streaming Data records associated with

timestamps (time series data)

Processing depends on timestamps

Event-time stream processor gives you the tools to reason about time• E.g., handle streams that are out of

order• Core feature is watermarks – a clock

to measure event time10

Your code

state

t3 t1 t2t4 t1-t2 t3-t4

Page 11: Debunking Common Myths in Stream Processing

What is streaming Continuous processing on data that

is continuously generated

I.e., pretty much all “big” data

It’s all about state and time11

Page 12: Debunking Common Myths in Stream Processing

Debunking some common stream processing myths

12

Page 13: Debunking Common Myths in Stream Processing

Myth 1: Throughput/latency tradeoff Myth 1: you need to choose between high

throughput or low latency

Physical limits• In reality, network determines both the

achievable throughput and latency• A well-engineered system achieves these limits

13

Page 14: Debunking Common Myths in Stream Processing

Flink performance 10s of millions events per seconds in 10s of

nodes scaled to 1000s of nodes with latency in single-digit milliseconds

14

Page 15: Debunking Common Myths in Stream Processing

15

Myth 2: Exactly once not possible Exactly once: under failures, system computes result as if

there was no failure

In contrast to:• At most once: no guarantees• At least once: duplicates possible

Exactly once state versus exactly once delivery

Myth 2: Exactly once state not possible/too costly

Page 16: Debunking Common Myths in Stream Processing

Transactions “Exactly once” is transactions: either all actions

succeed or none succeed

Transactions are possible

Transactions are useful

Let’s not start eventual consistency all over again…

16

Page 17: Debunking Common Myths in Stream Processing

Flink checkpoints Periodic asynchronous consistent snapshots of

application state

Provide exactly-once state guarantees under failures

17

Page 18: Debunking Common Myths in Stream Processing

End-to-end exactly once Checkpoints double as transaction coordination mechanism

Source and sink operators can take part in checkpoints

Exactly once internally, "effectively once" end to end: e.g., Flink + Cassandra with idempotent updates

18

transactional sinks

Page 19: Debunking Common Myths in Stream Processing

State management Checkpoints triple as state

versioning mechanism (savepoints)

Go back and forth in time while maintaining state consistency

Ease code upgrades (Flink or app), maintenance, migration, and debugging, what-if simulations, A/B tests

19

Page 20: Debunking Common Myths in Stream Processing

Myth 3: Streaming and real time Myth 3: streaming and real-time are

synonymous

Streaming is a new model• Essentially, state and time• Low latency/real time is the icing on the

cake20

Page 21: Debunking Common Myths in Stream Processing

Low latency and high latency streams

21

2016-3-112:00 am

2016-3-11:00 am

2016-3-12:00 am

2016-3-1111:00pm

2016-3-1212:00am

2016-3-121:00am

2016-3-1110:00pm

2016-3-122:00am

2016-3-123:00am…

partition

partition

Stream (low latency)

Batch(bounded stream)Stream (high latency)

Page 22: Debunking Common Myths in Stream Processing

Robust continuous applications

22

Page 23: Debunking Common Myths in Stream Processing

Accurate computation Batch processing is not an accurate

computation model for continuous data• Misses the right concepts and primitives• Time handling, state across batch boundaries

Stateful stream processing a better model• Real-time/low-latency is the icing on the cake

23

Page 24: Debunking Common Myths in Stream Processing

Myth 4: How hard is streaming? Myth 4: streaming is too hard to learn

You are already doing streaming, just in an ad hoc way

Most data is unbounded and the code changes slower than the data• This is a streaming problem

24

Page 25: Debunking Common Myths in Stream Processing

It's about your data and code What's the form of your data?• Unbounded (e.g., clicks, sensors, logs), or• Bounded (e.g., ???*)

What changes more often?• My code changes faster than my data• My data changes faster than my code

25

* Please help me find a great example of naturally bounded data

Page 26: Debunking Common Myths in Stream Processing

It's about your data and code If your data changes faster than your

code you have a streaming problem• You may be solving it with hourly batch

jobs depending on someone else to create the hourly batches

• You are probably living with inaccurate results without knowing it

26

Page 27: Debunking Common Myths in Stream Processing

It's about your data and code If your code changes faster than your

data you have an exploration problem• Using notebooks or other tools for quick

data exploration is a good idea• Once your code stabilizes you will have

a streaming problem, so you might as well think of it as such from the beginning 27

Page 28: Debunking Common Myths in Stream Processing

Flink in the real world

28

Page 29: Debunking Common Myths in Stream Processing

29

Flink community > 240 contributors, 95 contributors in Flink 1.1

42 meetups around the world with > 15,000 members

2x-3x growth in 2015, similar in 2016

Page 30: Debunking Common Myths in Stream Processing

Powered by Flink

30

Zalando, one of the largest ecommerce companies in Europe, uses Flink for real-time business

process monitoring.

King, the creators of Candy Crush Saga, uses Flink to provide data

science teams with real-time analytics.

Bouygues Telecom uses Flink for real-time event processing over billions of

Kafka messages per day.

Alibaba, the world's largest retailer, built a Flink-based system (Blink) to

optimize search rankings in real time.

See more at flink.apache.org/poweredby.html

Page 31: Debunking Common Myths in Stream Processing

30 Flink applications in production for more than one year. 10 billion events (2TB) processed daily

Complex jobs of > 30 operators running 24/7, processing 30 billion events daily, maintaining state of 100s of GB with exactly-once guarantees

Largest job has > 20 operators, runs on > 5000 vCores in 1000-node cluster, processes millions of events per second

31

Page 32: Debunking Common Myths in Stream Processing

32

Page 33: Debunking Common Myths in Stream Processing

Flink Forward 2016

Page 34: Debunking Common Myths in Stream Processing

Current work in Flink

34

Page 35: Debunking Common Myths in Stream Processing

Ongoing Flink development

35

ConnectorsSession

Windows(Stream) SQL

Libraryenhancements

MetricSystem

Operations

Ecosystem ApplicationFeatures

Metrics &Visualization

Dynamic Scaling

Savepointcompatibility Checkpoints

to savepoints

More connectors Stream SQLWindows

Large stateMaintenance

Fine grainedrecovery

Side in-/outputsWindow DSL

BroaderAudience

Security

Mesos &others

Dynamic ResourceManagement

Authentication

Queryable State

Page 36: Debunking Common Myths in Stream Processing

A longer-term vision for Flink

36

Page 37: Debunking Common Myths in Stream Processing

37

Streaming use casesApplication

(Near) real-time apps

Continuous apps

Analytics on historical data

Request/response apps

TechnologyLow-latency streaming

High-latency streaming

Batch as special case of streaming

Large queryable state

Page 38: Debunking Common Myths in Stream Processing

Request/response applications Queryable state: query Flink state directly instead

of pushing results in a database

Large state support and query API coming in Flink

38

queries

Page 39: Debunking Common Myths in Stream Processing

In summary The need for streaming comes from a rethinking

of data infra architecture• Stream processing then just becomes natural

Debunking 4 common myths• Myth 1: The throughput/latency tradeoff• Myth 2: Exactly once not possible• Myth 3: Streaming is for (near) real-time• Myth 4: Streaming is hard

39

Page 40: Debunking Common Myths in Stream Processing

40

Thank you!@kostas_tzoumas @ApacheFlink @dataArtisans

Page 41: Debunking Common Myths in Stream Processing

41

We are hiring!

data-artisans.com/careers