Apache Apex Fault Tolerance and Processing Semantics

21
Apache Apex (incubating) Fault Tolerance and Processing Semantics Thomas Weise, Architect & Co-founder, PPMC member Pramod Immaneni, Architect, PPMC member March 24 th 2016

Transcript of Apache Apex Fault Tolerance and Processing Semantics

Page 1: Apache Apex Fault Tolerance and Processing Semantics

Apache Apex (incubating)Fault Tolerance and Processing

SemanticsThomas Weise, Architect & Co-founder, PPMC member  Pramod Immaneni, Architect, PPMC member

March 24th 2016

Page 2: Apache Apex Fault Tolerance and Processing Semantics

Apache Apex Features• In-memory Stream Processing• Partitioning and Scaling out• Windowing (temporal boundary)• Reliability

ᵒ Statefulᵒ Automatic Recoveryᵒ Processing Guarantees

• Operability• Compute Locality• Dynamic updates

2

Page 3: Apache Apex Fault Tolerance and Processing Semantics

3

Apex Platform Overview

Page 4: Apache Apex Fault Tolerance and Processing Semantics

4

Native Hadoop Integration

• YARN is the resource manager

• HDFS used for storing any persistent state

Page 5: Apache Apex Fault Tolerance and Processing Semantics

5

Streaming Windows

Application window Sliding window and tumbling window

Checkpoint window No artificial latency

Page 6: Apache Apex Fault Tolerance and Processing Semantics

6

Fault Tolerance• Operator state is checkpointed to persistent store

ᵒ Automatically performed by engine, no additional coding neededᵒ Asynchronous and distributed ᵒ In case of failure operators are restarted from checkpoint state

• Automatic detection and recovery of failed containersᵒ Heartbeat mechanismᵒ YARN process status notification

• Buffering to enable replay of data from recovered pointᵒ Fast, incremental recovery, spike handling

• Application master state checkpointedᵒ Snapshot of physical (and logical) planᵒ Execution layer change log

Page 7: Apache Apex Fault Tolerance and Processing Semantics

7

Checkpointing Operator State• Save state of operator so that it can be recovered on failure• Pluggable storage handler• Default implementation

ᵒ Serialization with Kryoᵒ All non-transient fields serializedᵒ Serialized state written to HDFSᵒ Writes asynchronous, non-blocking

• Possible to implement custom handlers for alternative approach to extract state or different storage backend (such as IMDG)

• For operators that rely on previous state for computationᵒ Operators can be marked @Stateless to skip checkpointing

• Checkpoint frequency tunable (by default 30s)ᵒ Based on streaming windows for consistent state

Page 8: Apache Apex Fault Tolerance and Processing Semantics

• In-memory PubSub• Stores results emitted by operator until committed• Handles backpressure / spillover to local disk• Ordering, idempotency

Operator 1

Container 1

BufferServer

Node 1

Operator 2

Container 2

Node 2

Buffer Server

8

Page 9: Apache Apex Fault Tolerance and Processing Semantics

9

Application Master State• Snapshot state on plan change

ᵒ Serialize Physical Plan (includes logical plan)ᵒ Infrequent, expensive operation

• WAL (Write-ahead-Log) for state changesᵒ Execution layer changesᵒ Container, operator state, property changes

• Containers locate master through DFSᵒ AM can fail and restart, other containers need to find itᵒ Work preserving restart

• Recoveryᵒ YARN restarts application masterᵒ Apex restores state from snapshot and replays log

Page 10: Apache Apex Fault Tolerance and Processing Semantics

• Container process fails• NM detects• In case of AM (Apex Application Master), YARN launches replacement

container (for attempt count < max)• Node Manager Process fails

• RM detects NM failure and notifies AM• Machine fails

• RM detects NM/AM failure and recovers or notifies AM• RM fails - RM HA option

• Entire YARN cluster down – stateful restart of Apex application

Failure Scenarios

10

Page 11: Apache Apex Fault Tolerance and Processing Semantics

NM NM

Resource Manager

Apex AM

3

2

1

Apex AM

1 2

3

NM

Failure Scenarios

NM

11

Page 12: Apache Apex Fault Tolerance and Processing Semantics

Failure Scenarios… EW2, 1, 3, BW2, EW1, 4, 2, 1, BW1

sum0

… EW2, 1, 3, BW2, EW1, 4, 2, 1, BW1sum

7

… EW2, 1, 3, BW2, EW1, 4, 2, 1, BW1sum10

… EW2, 1, 3, BW2, EW1, 4, 2, 1, BW1sum

7

12

Page 13: Apache Apex Fault Tolerance and Processing Semantics

13

Processing GuaranteesAt-least-once• On recovery data will be replayed from a previous checkpoint

ᵒ No messages lostᵒ Default, suitable for most applications

• Can be used to ensure data is written once to storeᵒ Transactions with meta information, Rewinding output, Feedback

from external entity, Idempotent operationsAt-most-once• On recovery the latest data is made available to operator

ᵒ Useful in use cases where some data loss is acceptable and latest data is sufficient

Exactly-onceᵒ At-least-once + idempotency + transactional mechanisms (operator

logic) to achieve end-to-end exactly once behavior

Page 14: Apache Apex Fault Tolerance and Processing Semantics

End-to-End Exactly Once

14

• Becomes important when writing to external systems• Data should not be duplicated or lost in the external system even in

case of application failures• Common external systems

ᵒ Databasesᵒ Filesᵒ Message queues

• Platform support for at least once is a must so that no data is lost• Data duplication must still be avoided when data is replayed from

checkpointᵒ Operators implement the logic dependent on the external system

• Aid of platform features such as stateful checkpointing and windowing• Three different mechanisms with implementations explained in next

slides

Page 15: Apache Apex Fault Tolerance and Processing Semantics

Files

15

• Streaming data is being written to file on a continuous basis• Failure at a random point results in file with an unknown amount of

data• Operator works with platform to ensure exactly once

ᵒ Platform responsibility• Restores state and restarts operator from an earlier checkpoint• Platform replays data from the exact point after checkpoint

ᵒ Operator responsibility• Replayed data doesn’t get duplicated in the file• Accomplishes by keeping track of file offset as state

ᵒ Details in next slide• Implemented in operator AbstractFileOutputOperator in

apache/incubator-apex-malhar github repository available here• Example application AtomicFileOutputApp available here

Page 16: Apache Apex Fault Tolerance and Processing Semantics

Exactly Once Strategy

16

File Data

Offset

• Operator saves file offset during checkpoint

• File contents are flushed before checkpoint to ensure there is no pending data in buffer

• On recovery platform restores the file offset value from checkpoint

• Operator truncates the file to the offset

• Starts writing data again• Ensures no data is duplicated or lost

Chk

Page 17: Apache Apex Fault Tolerance and Processing Semantics

Transactional databases

17

• Use of streaming windows• For exactly once in failure scenarios

ᵒ Operator uses transactionsᵒ Stores window id in a separate table in the databaseᵒ Details in next slide

• Implemented in operator AbstractJdbcTransactionableOutputOperator in apache/incubator-apex-malhar github repository available here

• Example application streaming data in from kafka and writing to a JDBC database is available here

Page 18: Apache Apex Fault Tolerance and Processing Semantics

Exactly Once Strategy

18

d11 d12 d13

d21 d22 d23

lwn1 lwn2 lwn3

op-id wn

chk wn wn+1

Lwn+11 Lwn+12 Lwn+13

op-id wn+1

Data TableMeta Table

• Data in a window is written out in a single transaction

• Window id is also written to a meta table as part of the same transaction

• Operator reads the window id from meta table on recovery

• Ignores data for windows less than the recovered window id and writes new data

• Partial window data before failure will not appear in data table as transaction was not committed

• Assumes idempotency for replay

Page 19: Apache Apex Fault Tolerance and Processing Semantics

Stateful Message Queue

19

• Data is being sent to a stateful message queue like Apache Kafka• On failure data already sent to message queue should not be re-sent• Exactly once strategy

ᵒ Sends a key along with data that is monotonically increasingᵒ On recovery operator asks the message queue for the last sent message

• Gets the recovery key from the messageᵒ Ignores all replayed data with key that is less than or equal to the recovered

keyᵒ If the key is not monotonically increasing then data can be sorted on the key

at the end of the window and sent to message queue• Implemented in operator AbstractExactlyOnceKafkaOutputOperator

in apache/incubator-apex-malhar github repository available here

Page 20: Apache Apex Fault Tolerance and Processing Semantics

20

Resources• Subscribe - http://apex.incubator.apache.org/community.html• Download - http://apex.incubator.apache.org/downloads.html• Apex website - http://apex.incubator.apache.org/• Twitter - @ApacheApex; Follow - https://

twitter.com/apacheapex• Facebook - https://www.facebook.com/ApacheApex/• Meetup - http://www.meetup.com/topics/apache-apex

Page 21: Apache Apex Fault Tolerance and Processing Semantics

Q&A

21