1 Seneca: Remote Mirroring Done Write Minwen Ji, Alistair Veitch and John Wilkes HP Labs December 2,...

13
1 Seneca: Remote Mirroring Done Write Minwen Ji , Alistair Veitch and John Wilkes HP Labs June 20, 2022

Transcript of 1 Seneca: Remote Mirroring Done Write Minwen Ji, Alistair Veitch and John Wilkes HP Labs December 2,...

Page 1: 1 Seneca: Remote Mirroring Done Write Minwen Ji, Alistair Veitch and John Wilkes HP Labs December 2, 2015.

1

Seneca: Remote Mirroring Done Write

Minwen Ji ,

Alistair Veitch and John Wilkes

HP Labs

April 21, 2023

Page 2: 1 Seneca: Remote Mirroring Done Write Minwen Ji, Alistair Veitch and John Wilkes HP Labs December 2, 2015.

2

Motivations: Reliability and Availability

• 2 out of 5 enterprises that experience a disaster go out of business within 5 years [Gartner Report]

• Outages cost >$250K/hour (25%) or >$5M/hour (4%) [Eagle Rock Online Survey]

Page 3: 1 Seneca: Remote Mirroring Done Write Minwen Ji, Alistair Veitch and John Wilkes HP Labs December 2, 2015.

3

Our Contributions

• A taxonomy of the design choices for remote mirroring

• An asynchronous protocol that is designed to handle many kinds and sequences of failures

• Checking the correctness of the protocol using I/O automata-based simulation

Page 4: 1 Seneca: Remote Mirroring Done Write Minwen Ji, Alistair Veitch and John Wilkes HP Labs December 2, 2015.

4

Remote Mirroring Overview

Competing goals:High performance, low cost, and low data loss

App App App

MirroringModule

MirroringModule

Local Remote

Wide Area Network

App App App

Primary

SecondarySecondary Primary

PrimaryLog

SecondaryLog

Page 5: 1 Seneca: Remote Mirroring Done Write Minwen Ji, Alistair Veitch and John Wilkes HP Labs December 2, 2015.

5

Design Choices• Synchronous vs. asynchronous

– Propagate update to mirror before vs. after write request returned to application

• Divergence: zero bounded, op/byte/time bounded, resource bounded, unbounded– Amount of data allowed to be out-of-sync between

mirrors

• As-is logging vs. write coalescing– Store all versions vs. a subset of versions of overwritten

data in log

Page 6: 1 Seneca: Remote Mirroring Done Write Minwen Ji, Alistair Veitch and John Wilkes HP Labs December 2, 2015.

6

Seneca’s Choices

• Synchronous vs. asynchronous– Low data loss vs. smooth traffic and high performance

• Divergence: zero bounded, op/byte/time bounded, resource bounded, unbounded– Low data loss vs. smooth traffic and high availability

• As-is logging vs. write coalescing– Little secondary log space vs. low primary log space

and low traffic

Page 7: 1 Seneca: Remote Mirroring Done Write Minwen Ji, Alistair Veitch and John Wilkes HP Labs December 2, 2015.

7

A Taxonomy

0

1

2

3

4

Divergence Bound

Pro

paga

tion

VeritasIBM-PRRCIBM-XRCEMCNetAppsHP-XPHP-SV3000Seneca

4 Async- Bitmap 3 Async- Coalesce2 Async1 Sync

Avail+Cost –Loss +

Perf +Cost –Loss +

Page 8: 1 Seneca: Remote Mirroring Done Write Minwen Ji, Alistair Veitch and John Wilkes HP Labs December 2, 2015.

8

Evaluation of Seneca’s Choices

Metrics:

Impact of asynchronous propagation and write coalescing on WAN traffic and log space

Traces Capacity Length Mean Write Rate

Cello2002 1.44TB 24 hours 0.78 MB/s

SAP 4TB 15 mins 1.95 MB/s

RDW 500GB 1.4 hours 0.34 MB/s

OpenMail 640GB 1 hour 1.70 MB/s

Page 9: 1 Seneca: Remote Mirroring Done Write Minwen Ji, Alistair Veitch and John Wilkes HP Labs December 2, 2015.

9

Simulation Results

• Mean traffic: 5-40% reduction with write coalescing allowed within 30 sec intervals

• 95th percentile usage: reduced from 93% of 4 T3 lines to 85% of 3 T3 lines

• Log space: 100 GB log will cover a network outage for 14-81 hours

Mean Traffic vs. Coalescing Interval Traffic CDF w/ Coalescing On/Off

Log Size vs. Coalescing Interval

Page 10: 1 Seneca: Remote Mirroring Done Write Minwen Ji, Alistair Veitch and John Wilkes HP Labs December 2, 2015.

10

How To Get Things Right

• Hard problems:– Rolling disasters

• Primary fails => secondary inconsistent => system inaccessible

– Failover dilemmas• Primary fails before propagation

• Secondary takes over and continues to update

• Old primary returns

• Our approach:– Finite state machines

Page 11: 1 Seneca: Remote Mirroring Done Write Minwen Ji, Alistair Veitch and John Wilkes HP Labs December 2, 2015.

11

Local Seneca State Remote Seneca State

Primary State

Secondary State

Page 12: 1 Seneca: Remote Mirroring Done Write Minwen Ji, Alistair Veitch and John Wilkes HP Labs December 2, 2015.

12

Checking Correctness• Simulation

– Started with Input/Output Automata (a model checking language) – Constrained random walks in the state space– Implemented in C

• Correctness criteria– Coverage, safety and liveness

• Latest results– Detected and fixed many non-trivial implementation bugs in a

relatively short time– Average failure injections before a bug is detected: 16435– Mean Time Between Failures for the protocol proper: 4100 years– The latest bug took 1.77M writes, 75.9K failures, 22.4K recoveries

and 6.6M internal events to detect

Page 13: 1 Seneca: Remote Mirroring Done Write Minwen Ji, Alistair Veitch and John Wilkes HP Labs December 2, 2015.

13

Summary

• A taxonomy of design space for remote mirroring

• Evaluation of Seneca’s design choices

• A finite state machine description of the Seneca remote mirroring protocol

• Checking the correctness of Seneca with simulations