Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Fault Tolerance and Job Recovery in Apache Flink™

Till Rohrmanntrohrmann@apache.org @stsffap

Better be safe than sorry Failures will happen EMC estimated $1.7 billion costs

due to data loss and system downtime

Recovery will save you time and costs

Switch between algorithms Live upgrade of your system

Fault Tolerance

Fault tolerance guarantees At most once• No guarantees at all

At least once• For many applications

sufficient Exactly once

Flink provides all guarantees

Checkpoints Consistent snapshots of distributed

data stream and operator state

Barriers Markers for checkpoints Injected in the data flow

Alignment for multi-input operators

Operator State Stateless operators System state

User defined state

ds.filter(_ != 0)

ds.keyBy(0).window(TumblingTimeWindows.of(5, TimeUnit.SECONDS))

public class CounterSum implements RichReduceFunction<Long> { private OperatorState<Long> counter;

@Override public Long reduce(Long v1, Long v2) throws Exception { counter.update(counter.value() + 1); return v1 + v2; }

@Override public void open(Configuration config) { counter = getRuntimeContext().getOperatorState(“counter”, 0L, false); }}

Advantages Separation of app logic from recovery• Checkpointing interval is just a config

parameter

High throughput• Controllable checkpointing overhead

Low impact on latency

Cluster High Availability

Without high availability

JobManager

TaskManager

With high availability

JobManager

TaskManager

Stand-byJobManager

Apache Zookeeper™

KEEP GOING

Persisting jobs

JobManager

Client

TaskManagers

Apache Zookeeper™

1. Submit job

Persisting jobs

JobManager

Client

TaskManagers

Apache Zookeeper™

1. Submit job2. Persist execution graph

Persisting jobs

JobManager

Client

TaskManagers

Apache Zookeeper™

1. Submit job2. Persist execution graph3. Write handle to ZooKeeper

Persisting jobs

JobManager

Client

TaskManagers

Apache Zookeeper™

1. Submit job2. Persist execution graph3. Write handle to ZooKeeper4. Deploy tasks

Handling checkpoints

JobManager

Client

TaskManagers

Apache Zookeeper™

1. Take snapshots

JobManager

Client

TaskManagers

Apache Zookeeper™

1. Take snapshots2. Persist snapshots3. Send handles to JM

JobManager

Client

TaskManagers

Apache Zookeeper™

1. Take snapshots2. Persist snapshots3. Send handles to JM4. Create global checkpoint

JobManager

Client

TaskManagers

Apache Zookeeper™

1. Take snapshots2. Persist snapshots3. Send handles to JM4. Create global checkpoint5. Persist global checkpoint

JobManager

Client

TaskManagers

Apache Zookeeper™

1. Take snapshots2. Persist snapshots3. Send handles to JM4. Create global checkpoint5. Persist global checkpoint6. Write handle to ZooKeeper

Conclusion

TL;DL Job recovery mechanism with low

latency and high throughput Exactly one processing semantics No single point of failure

Flink will always keep processing your data

flink.apache.org@ApacheFlink

Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Technology

Transcript of Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Kaufmann Rohrmann Szablewski-Cavus Orientier

Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015

Flink internals web

Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015

Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

Tech Talk @ Google on Flink Fault Tolerance and HA

Apache Flink internals

Flink Forward SF 2017: Jamie Grier - Apache Flink - The latest and greatest

Apache Flink - Overview

Flink London meetup 3 March 2016 - Flink basics

Flink Forward SF 2017: Ted Dunning - Non-Flink Machine Learning on Flink

Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL and CEP

Apache Flink: The Latest and Greatest...Apache Flink: The Latest and Greatest 2 Original creators of Apache Flink® Providers of the dA Platform, a supported Flink distribution The

Apache Flink Big Data Stream Processing · PDF fileApache Flink Big Data Stream Processing Tilmann Rabl ... Apache Flink! The case for Flink as a stream processor • Ideal basis for

Apache Flink - SICS

Flink vs. Spark

Towards Declarative Stream Processing using Apache Flink · Stream Processing with Apache Flink • Flexible Windows/Stream Discretization • Exactly-once Processing & Fault Tolerance.

Till Rohrmann - Dynamic Scaling - How Apache Flink adapts to changing workloads

Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with Flink

Flink. Pure Streaming