Apache Flink – Distributed Stream Processing

26
Apache Flink Distributed Data Processing By Kevin Jacobs - 2016/08/12

Transcript of Apache Flink – Distributed Stream Processing

Page 1: Apache Flink – Distributed Stream Processing

Apache FlinkDistributed Data ProcessingBy Kevin Jacobs - 2016/08/12

Page 2: Apache Flink – Distributed Stream Processing

Who am I?

● Kevin Jacobs● Bachelor Computer Science● Currently: Master Data Science

2

Page 3: Apache Flink – Distributed Stream Processing

Distributed Data Processing

● Significant growth of data● Single computing node not sufficient● Scalable solutions● Processing of data streams

3

Page 4: Apache Flink – Distributed Stream Processing

Streams & Batches

Streams

● Infinite amount of data

Batches

● Finite amount of data

4

Page 5: Apache Flink – Distributed Stream Processing

Data Processing Frameworks

● Apache Spark: ”Lightning-fast cluster computing”● Apache Flink: ”Scalable Batch and Stream Data Processing”

May 2014v1.0.0

December 2011v1.0.0

March 2016v1.0.0

5

Page 6: Apache Flink – Distributed Stream Processing

Flink versus Spark

● Based on data streams● Definition of windows (time,

counting, ...)

● Based on data batches● Batches based on time● Large community

○ More available libraries

6

Page 7: Apache Flink – Distributed Stream Processing

Functionality - Comparison

7

● Apache Spark○ Limited support for windows

● Functionality of Apache Sparkis a subset of functionality ofApache Flink

Page 8: Apache Flink – Distributed Stream Processing

Functionality - What is missing?

8

● Metadata driven triggers○ Useful for filtering duplicates

Page 9: Apache Flink – Distributed Stream Processing

My Project

● Is Apache Flink better than Apache Spark?○ Stream Enrichment○ Duplicate Filtering

9

Page 10: Apache Flink – Distributed Stream Processing

Stream Enrichment

Stream A + Stream B = Stream C Stateless Stateful

10

...(401, “example-a.org”, “CH”)(405, “example-b.org”, “NL”)(406, “example-c.org”, “DE”)(408, “example-d.org”, “CH”)...

...(“CH”, 1)(“NL”, 2)(“DE”, 3)...

...(401, “example-a.org”, “CH”, 1)(405, “example-b.org”, “NL”, 2)(406, “example-c.org”, “DE”, 3)(408, “example-d.org”, “CH”, 1)...

Page 11: Apache Flink – Distributed Stream Processing

Filtering duplicates

Stream A → Stream B Stateful

11

...(401, “example-a.org”, “CH”)(405, “example-b.org”, “NL”)(418, “example-a.org”, “CH”)(406, “example-c.org”, “DE”)...

...(401, “example-a.org”, “CH”)(405, “example-b.org”, “NL”)(406, “example-c.org”, “DE”)...

Page 12: Apache Flink – Distributed Stream Processing

Stateful Stream Processing

Stream A [Stateless] + Stream B [Stateful] = Stream C

Stream A [Stateful] → Stream B

● Distributed state:○ Distributed memory

○ Iterate over elements in state

● Implement the current Spark code in Flink by using distributed memory

12

Page 13: Apache Flink – Distributed Stream Processing

Issues - Distributed Memory

What if…

● The memory is full?○ MemoryStateBackend○ FsStateBackend○ ...

13

Page 14: Apache Flink – Distributed Stream Processing

Approach

Is Flink better than Spark?

● Compare Batch API of Flink with Batch API of Spark○ Benchmark a sorting algorithm on a fixed-size file

● Compare Stream API of Flink with Stream API of Spark○ Benchmark the latency in streaming jobs

14

Page 15: Apache Flink – Distributed Stream Processing

Resources

● Cluster consisting of 11 nodes○ Memory: 77GB○ 44 CPUs (4 per node)○ 2.5TB HDFS storage

15

Page 16: Apache Flink – Distributed Stream Processing

Issues - Cluster

● Unpredictable resource usage of the cluster and many other processes running

○ Fixing a small amount of memory for the jobs to prevent memory errors■ 10 executors, 1GB each

○ Run the experiments a few times to filter out the noise

16

Page 17: Apache Flink – Distributed Stream Processing

TeraSort

● Task: Sort a fixed-size file (100GB)

● Input: HDFS

● Output: HDFS

17

Page 18: Apache Flink – Distributed Stream Processing

TeraSort - Results (1/3)

18

Page 19: Apache Flink – Distributed Stream Processing

TeraSort - Results (2/3)

19

Page 20: Apache Flink – Distributed Stream Processing

TeraSort - Results (3/3)

● Execution time:Apache Flink ≈ ½ Apache Spark

● Apache Spark has higher standarddeviation for larger jobs

20

Page 21: Apache Flink – Distributed Stream Processing

Measuring Latency - Setup (1/2)

● Generate random strings of bits of a fixed length and append the creation time to the bitstring:

○ Example with size = 2■ 10 1470935218352753297■ 00 1470935218353602960■ 11 1470935218354432744■ 00 1470935218355357746■ 01 1470935218356186778■ 10 1470935218357042980

■ 00 1470935218357868268

21

Page 22: Apache Flink – Distributed Stream Processing

Measuring Latency - Setup (2/2)

● Perform stateful operations which keeps unique bit strings into memory○ 2b unique bit strings for bit strings of length b

● Output time - Creation time ≈ Latency

22

Page 23: Apache Flink – Distributed Stream Processing

Measuring Latency - Result

23

● There clearly is a winner here!

Page 24: Apache Flink – Distributed Stream Processing

Conclusion (1/2)

● Functionality of Apache Spark is a subset of the functionality of Apache Flink

● Apache Flink not yet capable of mixing data streams and data batches○ But planned for a future version!

● Tuning is done automatically in Apache Flink○ In Apache Spark you need to optimize the parameters yourself

24

Page 25: Apache Flink – Distributed Stream Processing

Conclusion (2/2)

● Performance boost○ Apache Flink is faster than Apache Spark in terms of latency and batch processing (at

least in the presented use cases)

● Apache Beam combining the good things of both Apache Spark and Apache Flink (in development)

● Use Apache Flink :-)!

25

Page 26: Apache Flink – Distributed Stream Processing

Apache Flink

Questions? ???

26

[email protected]

www.data-blogger.com