Strong and Efficient Consistency with Consistency-Aware ...Strong and Efficient Consistency with...

177
Strong and Efficient Consistency with Consistency-Aware Durability Aishwarya Ganesan , Ram Alagappan, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau

Transcript of Strong and Efficient Consistency with Consistency-Aware ...Strong and Efficient Consistency with...

  • Strong and Efficient Consistency withConsistency-Aware Durability

    Aishwarya Ganesan, Ram Alagappan,

    Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau

  • Distributed Storage Systems

    2

  • Consistency Models in Distributed Systems

    3

  • Consistency Models in Distributed Systems

    3

    What does a read see given a previous set of reads and writes?

  • Consistency Models in Distributed Systems

    3

    What does a read see given a previous set of reads and writes?

    linearizability strong

  • Consistency Models in Distributed Systems

    3

    What does a read see given a previous set of reads and writes?

    linearizability eventualstrong weak

  • Consistency Models in Distributed Systems

    3

    What does a read see given a previous set of reads and writes?

    linearizability eventual

    sequential

    causal+ PRAM

    monotonic

    readscausalstrong weak

  • Consistency Models in Distributed Systems

    3

    What does a read see given a previous set of reads and writes?

    linearizability eventual

    sequential

    causal+ PRAM

    monotonic

    readscausalstrong weak

    Well studied and understood!

  • Unlike consistency models, scant attention to durability model!

    Durability Models

    4

  • Unlike consistency models, scant attention to durability model!

    How writes are replicated and persisted

    Durability Models

    4

  • Unlike consistency models, scant attention to durability model!

    How writes are replicated and persisted

    Durability model influences consistency

    Durability Models

    4

  • Unlike consistency models, scant attention to durability model!

    How writes are replicated and persisted

    Durability model influences consistency

    Also determines performance

    Durability Models

    4

  • Unlike consistency models, scant attention to durability model!

    How writes are replicated and persisted

    Durability model influences consistency

    Also determines performance

    Durability Models

    4

    Despite this importance, often overlooked!

  • Two Widely Used Durability Models

    5

  • Two Widely Used Durability Models

    5

    Immediate durability

  • Two Widely Used Durability Models

    5

    client

    write

    node-1 node-2 node-3

    Immediate durability

  • Two Widely Used Durability Models

    5

    replicateclientwrite

    node-1 node-2 node-3

    Immediate durability

  • Two Widely Used Durability Models

    5

    replicate

    persistclient

    write

    node-1 node-2 node-3

    Immediate durabilityfs

    ync

    fsyn

    c

    fsyn

    c

  • Two Widely Used Durability Models

    5

    replicate

    persistclient

    write

    node-1 node-2 node-3

    Immediate durability

    ack

    fsyn

    c

    fsyn

    c

    fsyn

    c

  • Two Widely Used Durability Models

    5

    replicate

    persistclient

    write

    node-1 node-2 node-3

    Immediate durability

    ack

    enables strong consistency

    but too slow!

    fsyn

    c

    fsyn

    c

    fsyn

    c

  • Two Widely Used Durability Models

    5

    replicate

    persistclient

    write

    node-1 node-2 node-3

    Immediate durability

    ack

    enables strong consistency

    but too slow!

    Eventual durabilityfs

    ync

    fsyn

    c

    fsyn

    c

  • Two Widely Used Durability Models

    5

    replicate

    persistclient

    write

    node-1 node-2 node-3

    Immediate durability

    ack

    enables strong consistency

    but too slow!

    client

    write

    node-1 node-2 node-3

    Eventual durabilityfs

    ync

    fsyn

    c

    fsyn

    c

    ack

  • Two Widely Used Durability Models

    5

    replicate

    persistclient

    write

    node-1 node-2 node-3

    Immediate durability

    ack

    enables strong consistency

    but too slow!

    client

    write

    node-1 node-2 node-3

    Eventual durability

    lazily replicate

    and persist

    fsyn

    c

    fsyn

    c

    fsyn

    c

    ack

  • Two Widely Used Durability Models

    5

    replicate

    persistclient

    write

    node-1 node-2 node-3

    Immediate durability

    ack

    enables strong consistency

    but too slow!

    client

    write

    node-1 node-2 node-3

    Eventual durability

    fast

    but enables only weak consistency due

    to data loss upon failures!

    lazily replicate

    and persist

    fsyn

    c

    fsyn

    c

    fsyn

    c

    ack

  • Is it possible for a durability layer to enable both strong consistency and

    high performance?

    6

  • CAD: Consistency-aware Durability

    7

  • Design the durability layer by taking the consistency model into account

    CAD: Consistency-aware Durability

    7

  • Design the durability layer by taking the consistency model into account

    Intuition: what a read sees is important for most consistency models

    CAD: Consistency-aware Durability

    7

  • Design the durability layer by taking the consistency model into account

    Intuition: what a read sees is important for most consistency models

    Key idea: CAD shifts the point of durability to reads from writesdata is replicated and persisted before a read is served

    CAD: Consistency-aware Durability

    7

  • Design the durability layer by taking the consistency model into account

    Intuition: what a read sees is important for most consistency models

    Key idea: CAD shifts the point of durability to reads from writesdata is replicated and persisted before a read is served

    delayed writes → high performance

    CAD: Consistency-aware Durability

    7

  • Design the durability layer by taking the consistency model into account

    Intuition: what a read sees is important for most consistency models

    Key idea: CAD shifts the point of durability to reads from writesdata is replicated and persisted before a read is served

    delayed writes → high performance

    data durable before it is read → strong consistency even under failures

    CAD: Consistency-aware Durability

    7

  • Design the durability layer by taking the consistency model into account

    Intuition: what a read sees is important for most consistency models

    Key idea: CAD shifts the point of durability to reads from writesdata is replicated and persisted before a read is served

    delayed writes → high performance

    data durable before it is read → strong consistency even under failures

    lose some writes if failures arise before read; but, useful for many systems that use eventual durability

    CAD: Consistency-aware Durability

    7

  • Design the durability layer by taking the consistency model into account

    Intuition: what a read sees is important for most consistency models

    Key idea: CAD shifts the point of durability to reads from writesdata is replicated and persisted before a read is served

    delayed writes → high performance

    data durable before it is read → strong consistency even under failures

    lose some writes if failures arise before read; but, useful for many systems that use eventual durability

    We show efficacy of CAD by providing cross-client monotonic readsa new and strong consistency property

    CAD: Consistency-aware Durability

    7

  • Results

    8

  • ORCA: CAD and cross-client monotonic reads for leader-based systems implemented in ZooKeeper

    Results

    8

  • ORCA: CAD and cross-client monotonic reads for leader-based systems implemented in ZooKeeper

    Compared to strongly consistent ZooKeeperORCA is 1.6 – 3.3x faster by using CAD

    higher read throughput by allowing reads at many nodes

    reduces latency in geo-distributed settings by 14x

    Results

    8

  • ORCA: CAD and cross-client monotonic reads for leader-based systems implemented in ZooKeeper

    Compared to strongly consistent ZooKeeperORCA is 1.6 – 3.3x faster by using CAD

    higher read throughput by allowing reads at many nodes

    reduces latency in geo-distributed settings by 14x

    Compared to weakly consistent ZooKeeperORCA provides similar throughput and latency

    but with stronger guarantees

    Results

    8

  • ORCA: CAD and cross-client monotonic reads for leader-based systems implemented in ZooKeeper

    Compared to strongly consistent ZooKeeperORCA is 1.6 – 3.3x faster by using CAD

    higher read throughput by allowing reads at many nodes

    reduces latency in geo-distributed settings by 14x

    Compared to weakly consistent ZooKeeperORCA provides similar throughput and latency

    but with stronger guarantees

    Experimentally show ORCA’s guarantees under failures, useful for apps

    Results

    8

  • Outline

    Introduction

    Motivation

    CAD and cross-client monotonic reads

    ORCA design

    Results

    Summary and conclusion

    9

  • Consistency Models and Guarantees

    10

    Example: I’m bored at FAST and want to go home!

  • Consistency Models and Guarantees

    10

    Example: I’m bored at FAST and want to go home!

  • Consistency Models and Guarantees

    10

    Example: I’m bored at FAST and want to go home!

  • Consistency Models and Guarantees

    10

    Example: I’m bored at FAST and want to go home!

  • Consistency Models and Guarantees

    10

    Example: I’m bored at FAST and want to go home!

    Linearizability

  • Consistency Models and Guarantees

    10

    Example: I’m bored at FAST and want to go home!

    Linearizability

  • Consistency Models and Guarantees

    10

    Example: I’m bored at FAST and want to go home!

    Linearizability

  • Consistency Models and Guarantees

    10

    Example: I’m bored at FAST and want to go home!

    Linearizability

    latest data: no staleness

  • Consistency Models and Guarantees

    10

    Example: I’m bored at FAST and want to go home!

    Linearizability

    latest data: no staleness

    in-order reads across clients

  • Consistency Models and Guarantees

    10

    Example: I’m bored at FAST and want to go home!

    Linearizability

    latest data: no staleness

    in-order reads across clients

    Weaker models

  • Consistency Models and Guarantees

    10

    Example: I’m bored at FAST and want to go home!

    Linearizability

    latest data: no staleness

    in-order reads across clients

    Weaker models

  • Consistency Models and Guarantees

    10

    Example: I’m bored at FAST and want to go home!

    Linearizability

    latest data: no staleness

    in-order reads across clients

    Weaker models

  • Consistency Models and Guarantees

    10

    Example: I’m bored at FAST and want to go home!

    Linearizability

    latest data: no staleness

    in-order reads across clients

    Weaker models

    stale reads

    out-of-order reads across clients

  • Consistency Models and Guarantees

    10

    Example: I’m bored at FAST and want to go home!

    Linearizability

    latest data: no staleness

    in-order reads across clients

    Weaker models

    stale reads

    out-of-order reads across clients

    even with monotonic reads and causal

  • Linearizability requires immediate durabilitymust synchronously replicate and persist data on a majority to tolerate failures

    11

    Realizing Strong Consistency

  • Linearizability requires immediate durabilitymust synchronously replicate and persist data on a majority to tolerate failures

    11

    Realizing Strong Consistency

    leader – S1

    S2

    S3

    a0

    a0

    a0

    on diskin memory durable = on disk on a majority

  • Linearizability requires immediate durabilitymust synchronously replicate and persist data on a majority to tolerate failures

    11

    Realizing Strong Consistency

    leader – S1

    S2

    S3

    a0

    a0

    a0

    on diskin memory durable = on disk on a majority

  • Linearizability requires immediate durabilitymust synchronously replicate and persist data on a majority to tolerate failures

    11

    Realizing Strong Consistency

    leader – S1

    S2

    S3

    a0

    a0

    a0

    a1

    on diskin memory durable = on disk on a majority

    a1

    a1

  • Linearizability requires immediate durabilitymust synchronously replicate and persist data on a majority to tolerate failures

    11

    Realizing Strong Consistency

    leader – S1

    S2

    S3

    a0

    a0

    a0

    a1

    on diskin memory durable = on disk on a majority

    a1

    a1

    a1

    a1

  • Linearizability requires immediate durabilitymust synchronously replicate and persist data on a majority to tolerate failures

    11

    Realizing Strong Consistency

    leader – S1

    S2

    S3

    a0

    a0

    a0

    a1

    on diskin memory durable = on disk on a majority

    a1

    a1

    a1

    a1

  • Linearizability requires immediate durabilitymust synchronously replicate and persist data on a majority to tolerate failures

    11

    Realizing Strong Consistency

    leader – S1

    S2

    S3

    a0

    a0

    a0

    a1

    on diskin memory durable = on disk on a majority

    a1

    a1

    a1

    a1

  • Linearizability requires immediate durabilitymust synchronously replicate and persist data on a majority to tolerate failures

    11

    Realizing Strong Consistency

    leader – S1

    S2

    S3

    a0

    a0

    a0

    a1

    on diskin memory durable = on disk on a majority

    a1

    a1

    a1

  • Linearizability requires immediate durabilitymust synchronously replicate and persist data on a majority to tolerate failures

    11

    Realizing Strong Consistency

    leader – S1

    S2

    S3

    a0

    a0

    a0

    a1

    on diskin memory durable = on disk on a majority

    a1

    a1

    a1

  • Linearizability requires immediate durabilitymust synchronously replicate and persist data on a majority to tolerate failures

    11

    Realizing Strong Consistency

    leader – S1

    S2

    S3

    a0

    a0

    a0

    a1

    on diskin memory durable = on disk on a majority

    a1

    a1

    a1

  • Linearizability requires immediate durabilitymust synchronously replicate and persist data on a majority to tolerate failures

    11

    Realizing Strong Consistency

    Poor performance due to synchronous operations10x slower within data center

    leader – S1

    S2

    S3

    a0

    a0

    a0

    a1

    on diskin memory durable = on disk on a majority

    a1

    a1

    a1

  • Weaker models only require eventual durabilitydata buffered on one node, replication and persistence in background

    12

    Realizing Weaker Models

    S1

    S2

    S3

    a0

    a0

    a0

  • Weaker models only require eventual durabilitydata buffered on one node, replication and persistence in background

    12

    Realizing Weaker Models

    S1

    S2

    S3

    a0

    a0

    a0

    app

    session-1

  • Weaker models only require eventual durabilitydata buffered on one node, replication and persistence in background

    12

    Realizing Weaker Models

    S1

    S2

    S3

    a0

    a0

    a0

    a1

    app

    session-1

  • Weaker models only require eventual durabilitydata buffered on one node, replication and persistence in background

    12

    Realizing Weaker Models

    S1

    S2

    S3

    a0

    a0

    a0

    a1

    app

    session-1

  • Weaker models only require eventual durabilitydata buffered on one node, replication and persistence in background

    12

    Realizing Weaker Models

    S1

    S2

    S3

    a0

    a0

    a0

    a1

    app

    session-1

  • Weaker models only require eventual durabilitydata buffered on one node, replication and persistence in background

    12

    Realizing Weaker Models

    S1

    S2

    S3

    a0

    a0

    a0

    app

    session-1

  • Weaker models only require eventual durabilitydata buffered on one node, replication and persistence in background

    12

    Realizing Weaker Models

    S1

    S2

    S3

    a0

    a0

    a0

    app

    session-1

    S1

    S2

    S3

    a0

    a0

    a0

  • Weaker models only require eventual durabilitydata buffered on one node, replication and persistence in background

    12

    Realizing Weaker Models

    S1

    S2

    S3

    a0

    a0

    a0

    app

    session-1

    S1

    S2

    S3

    a0

    a0

    a0

    app

    session-2

  • Weaker models only require eventual durabilitydata buffered on one node, replication and persistence in background

    12

    Realizing Weaker Models

    S1

    S2

    S3

    a0

    a0

    a0

    app

    session-1

    S1

    S2

    S3

    a0

    a0

    a0

    app

    session-2

  • Weaker models only require eventual durabilitydata buffered on one node, replication and persistence in background

    12

    out-of-order across

    clients

    valid under causal and

    monotonic reads

    but confusing semantics

    Realizing Weaker Models

    S1

    S2

    S3

    a0

    a0

    a0

    app

    session-1

    S1

    S2

    S3

    a0

    a0

    a0

    app

    session-2

  • Weaker models only require eventual durabilitydata buffered on one node, replication and persistence in background

    12

    out-of-order across

    clients

    valid under causal and

    monotonic reads

    but confusing semantics

    Many deployments prefer eventual durability for performancein fact, it is the default (e.g., MongoDB, Redis)

    Realizing Weaker Models

    S1

    S2

    S3

    a0

    a0

    a0

    app

    session-1

    S1

    S2

    S3

    a0

    a0

    a0

    app

    session-2

  • Weaker models only require eventual durabilitydata buffered on one node, replication and persistence in background

    12

    out-of-order across

    clients

    valid under causal and

    monotonic reads

    but confusing semantics

    Many deployments prefer eventual durability for performancein fact, it is the default (e.g., MongoDB, Redis)

    Thus settle for weak consistency

    Realizing Weaker Models

    S1

    S2

    S3

    a0

    a0

    a0

    app

    session-1

    S1

    S2

    S3

    a0

    a0

    a0

    app

    session-2

  • Immediate durability enables strong consistency but is slow

    Eventual durability is fast but enables only weaker consistency

  • Outline

    Introduction

    Motivation

    CAD and cross-client monotonic reads

    ORCA design

    Results

    Summary and conclusion

    13

  • Consistency-aware Durability

    14

  • Most consistency models care about what reads see

    Consistency-aware Durability

    14

  • Most consistency models care about what reads see

    Key idea: CAD shifts the point of durability to reads from writes

    Consistency-aware Durability

    14

  • Most consistency models care about what reads see

    Key idea: CAD shifts the point of durability to reads from writes

    Consistency-aware Durability

    14

    delay durability of writes

  • Most consistency models care about what reads see

    Key idea: CAD shifts the point of durability to reads from writes

    Consistency-aware Durability

    14

    clientwrite

    S1 S2 S3

    delay durability of writes

  • Most consistency models care about what reads see

    Key idea: CAD shifts the point of durability to reads from writes

    Consistency-aware Durability

    14

    clientwrite

    S1 S2 S3

    delay durability of writes

  • Most consistency models care about what reads see

    Key idea: CAD shifts the point of durability to reads from writes

    Consistency-aware Durability

    14

    clientwrite

    S1 S2 S3ack

    delay durability of writes

  • Most consistency models care about what reads see

    Key idea: CAD shifts the point of durability to reads from writes

    Consistency-aware Durability

    14

    clientwrite

    S1 S2 S3ack

    delay durability of writes

    good performance

  • Most consistency models care about what reads see

    Key idea: CAD shifts the point of durability to reads from writes

    Consistency-aware Durability

    14

    clientwrite

    S1 S2 S3ack

    delay durability of writes

    good performance

    make data durable before serving reads

  • Most consistency models care about what reads see

    Key idea: CAD shifts the point of durability to reads from writes

    Consistency-aware Durability

    14

    clientwrite

    S1 S2 S3ack

    delay durability of writes

    good performance

    make data durable before serving reads

    client

    S1 S2 S3

    read

  • Most consistency models care about what reads see

    Key idea: CAD shifts the point of durability to reads from writes

    Consistency-aware Durability

    14

    clientwrite

    S1 S2 S3ack

    delay durability of writes

    good performance

    make data durable before serving reads

    client

    S1 S2 S3

    read

  • Most consistency models care about what reads see

    Key idea: CAD shifts the point of durability to reads from writes

    Consistency-aware Durability

    14

    clientwrite

    S1 S2 S3ack

    delay durability of writes

    good performance

    make data durable before serving reads

    client

    S1 S2 S3

    read

  • Most consistency models care about what reads see

    Key idea: CAD shifts the point of durability to reads from writes

    Consistency-aware Durability

    14

    clientwrite

    S1 S2 S3ack

    delay durability of writes

    good performance

    make data durable before serving reads

    client

    S1 S2 S3

    read

  • Most consistency models care about what reads see

    Key idea: CAD shifts the point of durability to reads from writes

    Consistency-aware Durability

    14

    clientwrite

    S1 S2 S3ack

    delay durability of writes

    good performance

    make data durable before serving reads

    client

    S1 S2 S3

    prevents out-of-order data across failures

    strong consistency

    read

  • Most consistency models care about what reads see

    Key idea: CAD shifts the point of durability to reads from writes

    Consistency-aware Durability

    14

    clientwrite

    S1 S2 S3ack

    delay durability of writes

    CAD does not always incur overheads on readsreads do not immediately follow writes – natural in many workloads

    common case: data already durable well before applications access it

    good performance

    make data durable before serving reads

    client

    S1 S2 S3

    prevents out-of-order data across failures

    strong consistency

    read

  • Cross-client Monotonic Reads upon CAD

    15

  • Cross-client Monotonic Reads upon CAD

    A read from a client guaranteed to return at least the latest state returned to a previous read from any client

    15

  • Cross-client Monotonic Reads upon CAD

    A read from a client guaranteed to return at least the latest state returned to a previous read from any client

    15

    a0 a1 a2

  • Cross-client Monotonic Reads upon CAD

    A read from a client guaranteed to return at least the latest state returned to a previous read from any client

    15

    a0 a1

    client-1a2

    a2

  • Cross-client Monotonic Reads upon CAD

    A read from a client guaranteed to return at least the latest state returned to a previous read from any client

    15

    a0 a1

    client-1a2

    client-2a2

    a2

  • Cross-client Monotonic Reads upon CAD

    A read from a client guaranteed to return at least the latest state returned to a previous read from any client

    15

    a0 a1

    client-1a2

    client-2a2

    Even in the presence of failures and

    across client sessionsa2

  • Cross-client Monotonic Reads upon CAD

    A read from a client guaranteed to return at least the latest state returned to a previous read from any client

    15

    a0 a1

    client-1a2

    client-2a2

    Even in the presence of failures and

    across client sessions

    No existing model provides this

    guarantee except linearizability but not

    with high performance

    a2

  • Cross-client Monotonic Reads upon CAD

    A read from a client guaranteed to return at least the latest state returned to a previous read from any client

    15

    a0 a1

    client-1a2

    client-2a2

    Even in the presence of failures and

    across client sessions

    No existing model provides this

    guarantee except linearizability but not

    with high performance

    CAD enables this property with high

    performance

    a2

  • Cross-client Monotonic Reads upon CAD

    A read from a client guaranteed to return at least the latest state returned to a previous read from any client

    15

    a0 a1

    client-1a2

    client-2a2

    Even in the presence of failures and

    across client sessions

    No existing model provides this

    guarantee except linearizability but not

    with high performance

    CAD enables this property with high

    performance

    Does not prevent staleness like many weaker models

    a2

  • Cross-client Monotonic Reads upon CAD

    A read from a client guaranteed to return at least the latest state returned to a previous read from any client

    15

    a0 a1

    client-1a2

    client-2a2

    Even in the presence of failures and

    across client sessions

    No existing model provides this

    guarantee except linearizability but not

    with high performance

    CAD enables this property with high

    performance

    Does not prevent staleness like many weaker models

    However, avoids out-of-order data, useful in many app scenariose.g., location-sharing, twitter timelines

    a2

  • Outline

    Introduction

    Motivation

    CAD and cross-client monotonic reads

    ORCA design

    Results

    Summary and conclusion

    16

  • ORCA

    17

  • ORCA

    Implementation of consistency-aware durability and cross-client monotonic reads in leader-based majority systems

    17

  • ORCA

    Implementation of consistency-aware durability and cross-client monotonic reads in leader-based majority systems

    Leader-based systems (e.g., MongoDB, ZooKeeper)leader – a dedicated node

    others are followers

    writes flow through leader, establishes a single order

    17

  • ORCA

    Implementation of consistency-aware durability and cross-client monotonic reads in leader-based majority systems

    Leader-based systems (e.g., MongoDB, ZooKeeper)leader – a dedicated node

    others are followers

    writes flow through leader, establishes a single order

    Majoritydata is safe when persisted on majority nodes (e.g., 3 out of 5 servers)

    17

  • ORCA Write Path

    18

  • ORCA Write Path

    Same as an eventually durable system

    18

    leader – S1

    S2

    S3

    a

    a

    a

    a0 on diskin memory durable = on disk on a majority

  • ORCA Write Path

    Same as an eventually durable system

    18

    leader – S1

    S2

    S3

    a

    a

    a

    a0 on diskin memory durable = on disk on a majority

  • ORCA Write Path

    Same as an eventually durable system

    18

    leader – S1

    S2

    S3

    a

    a

    a

    b

    a0 on diskin memory durable = on disk on a majority

  • ORCA Write Path

    Same as an eventually durable system

    18

    leader – S1

    S2

    S3

    a

    a

    a

    b

    a0 on diskin memory durable = on disk on a majority

  • ORCA Write Path

    Same as an eventually durable system

    18

    replication and persistence in background

    leader – S1

    S2

    S3

    a

    a

    a

    b

    a0 on diskin memory durable = on disk on a majority

    b

    b

  • ORCA Write Path

    Same as an eventually durable system

    18

    replication and persistence in background

    leader – S1

    S2

    S3

    a

    a

    a

    b

    a0 on diskin memory durable = on disk on a majority

    b

    b

    b

    b

    b

  • ORCA Read Path

    19

    on disk

    in memory

    durable = on disk on

    a majority

    leader – S1

    S2

    S3

    a

    a

    a

    b

  • ORCA Read Path

    19

    on disk

    in memory

    durable = on disk on

    a majority

    leader – S1

    S2

    S3

    a

    a

    a

    b

  • ORCA Read Path

    19

    on disk

    in memory

    durable = on disk on

    a majority

    Durable-index – index of the latest durable item in the system

    leader – S1

    S2

    S3

    a

    a

    a

    b

  • ORCA Read Path

    19

    on disk

    in memory

    durable = on disk on

    a majority

    Durable-index – index of the latest durable item in the systemUpdate-index of item i -- index of the last update that modified i

    leader – S1

    S2

    S3

    a

    a

    a

    b

  • ORCA Read Path

    19

    on disk

    in memory

    durable = on disk on

    a majority

    Durable-index – index of the latest durable item in the systemUpdate-index of item i -- index of the last update that modified iDurability check – i durable if update-index of i ≤ durable-index of system

    leader – S1

    S2

    S3

    a

    a

    a

    b

  • ORCA Read Path

    19

    durable-index:1

    a’s update-index:1

    a is durable

    on disk

    in memory

    durable = on disk on

    a majority

    Durable-index – index of the latest durable item in the systemUpdate-index of item i -- index of the last update that modified iDurability check – i durable if update-index of i ≤ durable-index of system

    leader – S1

    S2

    S3

    a

    a

    a

    b

  • ORCA Read Path

    19

    durable-index:1

    a’s update-index:1

    a is durable

    serve read immediatelyon disk

    in memory

    durable = on disk on

    a majority

    Durable-index – index of the latest durable item in the systemUpdate-index of item i -- index of the last update that modified iDurability check – i durable if update-index of i ≤ durable-index of system

    leader – S1

    S2

    S3

    a

    a

    a

    b

  • ORCA Read Path

    19

    durable-index:1

    a’s update-index:1

    a is durable

    serve read immediatelyon disk

    in memory

    durable = on disk on

    a majority

    Durable-index – index of the latest durable item in the systemUpdate-index of item i -- index of the last update that modified iDurability check – i durable if update-index of i ≤ durable-index of system

    a

    a

    a

    bleader – S1

    S2

    S3

    a

    a

    a

    b

  • ORCA Read Path

    19

    durable-index:1

    a’s update-index:1

    a is durable

    serve read immediately

    durable-index:1

    b’s update-index: 2

    b is not durable

    on disk

    in memory

    durable = on disk on

    a majority

    Durable-index – index of the latest durable item in the systemUpdate-index of item i -- index of the last update that modified iDurability check – i durable if update-index of i ≤ durable-index of system

    a

    a

    a

    bleader – S1

    S2

    S3

    a

    a

    a

    b

  • ORCA Read Path

    19

    durable-index:1

    a’s update-index:1

    a is durable

    serve read immediately

    durable-index:1

    b’s update-index: 2

    b is not durable

    make b durable before serving

    on disk

    in memory

    durable = on disk on

    a majority

    Durable-index – index of the latest durable item in the systemUpdate-index of item i -- index of the last update that modified iDurability check – i durable if update-index of i ≤ durable-index of system

    a

    a

    a

    bleader – S1

    S2

    S3

    a

    a

    a

    b

  • ORCA Read Path

    19

    durable-index:1

    a’s update-index:1

    a is durable

    serve read immediately

    durable-index:1

    b’s update-index: 2

    b is not durable

    make b durable before serving

    on disk

    in memory

    durable = on disk on

    a majority

    Durable-index – index of the latest durable item in the systemUpdate-index of item i -- index of the last update that modified iDurability check – i durable if update-index of i ≤ durable-index of system

    a

    a

    a

    bleader – S1

    S2

    S3

    a

    a

    a

    b

    b

    b

  • ORCA Read Path

    19

    durable-index:1

    a’s update-index:1

    a is durable

    serve read immediately

    durable-index:1

    b’s update-index: 2

    b is not durable

    make b durable before serving

    on disk

    in memory

    durable = on disk on

    a majority

    Durable-index – index of the latest durable item in the systemUpdate-index of item i -- index of the last update that modified iDurability check – i durable if update-index of i ≤ durable-index of system

    a

    a

    a

    bleader – S1

    S2

    S3

    a

    a

    a

    b

    b

    b

    b

    b

    b

  • ORCA Read Path

    19

    durable-index:1

    a’s update-index:1

    a is durable

    serve read immediately

    durable-index:1

    b’s update-index: 2

    b is not durable

    make b durable before serving

    on disk

    in memory

    durable = on disk on

    a majority

    Durable-index – index of the latest durable item in the systemUpdate-index of item i -- index of the last update that modified iDurability check – i durable if update-index of i ≤ durable-index of system

    a

    a

    a

    bleader – S1

    S2

    S3

    a

    a

    a

    b

    b

    b

    b

    b

    b

  • ORCA Read Path

    19

    durable-index:1

    a’s update-index:1

    a is durable

    serve read immediately

    durable-index:1

    b’s update-index: 2

    b is not durable

    make b durable before serving

    on disk

    in memory

    durable = on disk on

    a majority

    Durable-index – index of the latest durable item in the systemUpdate-index of item i -- index of the last update that modified iDurability check – i durable if update-index of i ≤ durable-index of system

    a

    a

    a

    bleader – S1

    S2

    S3

    a

    a

    a

    b

    b

    b

    b

    b

    b

    a

    a

    a

    b

    b

    b

    b

    b

    b

  • ORCA Read Path

    19

    durable-index:1

    a’s update-index:1

    a is durable

    serve read immediately

    durable-index:1

    b’s update-index: 2

    b is not durable

    make b durable before serving

    on disk

    in memory

    durable = on disk on

    a majority

    Durable-index – index of the latest durable item in the systemUpdate-index of item i -- index of the last update that modified iDurability check – i durable if update-index of i ≤ durable-index of system

    a

    a

    a

    bleader – S1

    S2

    S3

    a

    a

    a

    b

    b

    b

    b

    b

    b

    a

    a

    a

    b

    b

    b

    b

    b

    b

  • ORCA Read Path

    19

    durable-index:1

    a’s update-index:1

    a is durable

    serve read immediately

    durable-index:1

    b’s update-index: 2

    b is not durable

    make b durable before serving

    on disk

    in memory

    durable = on disk on

    a majority

    Durable-index – index of the latest durable item in the systemUpdate-index of item i -- index of the last update that modified iDurability check – i durable if update-index of i ≤ durable-index of system

    a

    a

    a

    bleader – S1

    S2

    S3

    a

    a

    a

    b

    b

    b

    b

    b

    b

    a

    a

    a

    b

    b

    b

    b

    b

    b

    leader – S2

  • ORCA Read Path

    19

    durable-index:1

    a’s update-index:1

    a is durable

    serve read immediately

    durable-index:1

    b’s update-index: 2

    b is not durable

    make b durable before serving

    on disk

    in memory

    durable = on disk on

    a majority

    Durable-index – index of the latest durable item in the systemUpdate-index of item i -- index of the last update that modified iDurability check – i durable if update-index of i ≤ durable-index of system

    a

    a

    a

    bleader – S1

    S2

    S3

    a

    a

    a

    b

    b

    b

    b

    b

    b

    a

    a

    a

    b

    b

    b

    b

    b

    b

    leader – S2

  • Cross-Client Monotonic Reads in ORCA

    20

  • Cross-Client Monotonic Reads in ORCA

    If reads restricted to leader, CAD provides cross-client monotonic reads

    20

  • Cross-Client Monotonic Reads in ORCA

    If reads restricted to leader, CAD provides cross-client monotonic readsnot scalable

    20

  • Cross-Client Monotonic Reads in ORCA

    If reads restricted to leader, CAD provides cross-client monotonic readsnot scalable

    Allow reads at followerslagging followers could cause out-of-order states, CAD is not sufficient

    20

  • Cross-Client Monotonic Reads in ORCA

    If reads restricted to leader, CAD provides cross-client monotonic readsnot scalable

    Allow reads at followerslagging followers could cause out-of-order states, CAD is not sufficient

    20

    a1 a2

    a1

    a1

    a1

    a1

    leader – S1

    S5

    S2

    S3

    S4

  • Cross-Client Monotonic Reads in ORCA

    If reads restricted to leader, CAD provides cross-client monotonic readsnot scalable

    Allow reads at followerslagging followers could cause out-of-order states, CAD is not sufficient

    20

    a1 a2

    a1

    a1

    a1

    a1

    leader – S1

    S5

    S2

    S3

    S4

  • Cross-Client Monotonic Reads in ORCA

    If reads restricted to leader, CAD provides cross-client monotonic readsnot scalable

    Allow reads at followerslagging followers could cause out-of-order states, CAD is not sufficient

    20

    a1 a2

    a1

    a1

    a1

    a1

    leader – S1

    S5

    S2

    S3

    S4

  • Cross-Client Monotonic Reads in ORCA

    If reads restricted to leader, CAD provides cross-client monotonic readsnot scalable

    Allow reads at followerslagging followers could cause out-of-order states, CAD is not sufficient

    20

    a1 a2

    a1

    a1

    a1

    a1

    a2

    a2

    a2

    a2

    leader – S1

    S5

    S2

    S3

    S4

  • Cross-Client Monotonic Reads in ORCA

    If reads restricted to leader, CAD provides cross-client monotonic readsnot scalable

    Allow reads at followerslagging followers could cause out-of-order states, CAD is not sufficient

    20

    a1 a2

    a1

    a1

    a1

    a1

    a2

    a2

    a2

    a2

    leader – S1

    S5

    S2

    S3

    S4

  • Cross-Client Monotonic Reads in ORCA

    If reads restricted to leader, CAD provides cross-client monotonic readsnot scalable

    Allow reads at followerslagging followers could cause out-of-order states, CAD is not sufficient

    20

    a1 a2

    a1

    a1

    a1

    a1

    a2

    a2

    a2

    a2

    leader – S1

    S5

    S2

    S3

    S4

  • Cross-Client Monotonic Reads in ORCA

    If reads restricted to leader, CAD provides cross-client monotonic readsnot scalable

    Allow reads at followerslagging followers could cause out-of-order states, CAD is not sufficient

    20

    a1 a2

    a1

    a1

    a1

    a1

    a2

    a2

    a2

    a2

    a1 b

    a1

    a1

    a1

    a1

    a2

    a2

    a2

    a2

    leader – S1

    S5

    S2

    S3

    S4

  • Cross-Client Monotonic Reads in ORCA

    If reads restricted to leader, CAD provides cross-client monotonic readsnot scalable

    Allow reads at followerslagging followers could cause out-of-order states, CAD is not sufficient

    20

    a1 a2

    a1

    a1

    a1

    a1

    a2

    a2

    a2

    a2

    a1 b

    a1

    a1

    a1

    a1

    a2

    a2

    a2

    a2

    leader – S1

    S5

    S2

    S3

    S4

  • Cross-Client Monotonic Reads in ORCA

    If reads restricted to leader, CAD provides cross-client monotonic readsnot scalable

    Allow reads at followerslagging followers could cause out-of-order states, CAD is not sufficient

    20

    a1 a2

    a1

    a1

    a1

    a1

    a2

    a2

    a2

    a2

    a1 b

    a1

    a1

    a1

    a1

    a2

    a2

    a2

    a2

    leader – S1

    S5

    S2

    S3

    S4

  • Cross-Client Monotonic Reads in ORCA

    If reads restricted to leader, CAD provides cross-client monotonic readsnot scalable

    Allow reads at followerslagging followers could cause out-of-order states, CAD is not sufficient

    20

    a1 a2

    a1

    a1

    a1

    a1

    a2

    a2

    a2

    a2

    a1 b

    a1

    a1

    a1

    a1

    a2

    a2

    a2

    a2

    leader – S1

    S5

    S2

    S3

    S4

  • Cross-Client Monotonic Reads in ORCA

    If reads restricted to leader, CAD provides cross-client monotonic readsnot scalable

    Allow reads at followerslagging followers could cause out-of-order states, CAD is not sufficient

    20

    a1 a2

    a1

    a1

    a1

    a1

    a2

    a2

    a2

    a2

    a1 b

    a1

    a1

    a1

    a1

    a2

    a2

    a2

    a2

    leader – S1

    S5

    S2

    S3

    S4

    Additional mechanisms: Active sets (lease-based mechanism), not in this talk…

  • Outline

    Introduction

    Motivation

    CAD and cross-client monotonic reads

    ORCA design

    Results

    Summary and conclusion

    21

  • Evaluation

    22

  • Evaluation

    Implemented in ZooKeeper

    22

  • Evaluation

    Implemented in ZooKeeper

    Evaluate different durability models in isolationcompare CAD against immediate and eventual durability

    22

  • Evaluation

    Implemented in ZooKeeper

    Evaluate different durability models in isolationcompare CAD against immediate and eventual durability

    Evaluate overall system performance ORCA against strong and weakly consistent ZooKeeper

    22

  • CAD Durability Layer Performance

    23

  • CAD Durability Layer Performance

    23

    YCSB-A: 50% W, 50% R

  • CAD Durability Layer Performance

    23

    YCSB-A: 50% W, 50% R

    020406080

    100

    0 500 1000 1500

    CD

    F

    Latency (us)

    Write Latency Distribution

    immediate eventual cad

  • CAD Durability Layer Performance

    23

    YCSB-A: 50% W, 50% R

    020406080

    100

    0 500 1000 1500

    CD

    F

    Latency (us)

    Write Latency Distribution

    immediate eventual cad

  • CAD Durability Layer Performance

    23

    YCSB-A: 50% W, 50% R

    CAD writes faster than immediate durability

    CAD matches performance of eventual

    020406080

    100

    0 500 1000 1500

    CD

    F

    Latency (us)

    Write Latency Distribution

    immediate eventual cad

  • CAD Durability Layer Performance

    23

    YCSB-A: 50% W, 50% R

    CAD writes faster than immediate durability

    CAD matches performance of eventual

    020406080

    100

    0 500 1000 1500

    CD

    F

    Latency (us)

    Write Latency Distribution

    immediate eventual cad

    020406080

    100

    0 500 1000 1500

    CD

    F

    Latency (us)

    Read Latency Distribution

    immediate eventual cad

  • CAD Durability Layer Performance

    23

    YCSB-A: 50% W, 50% R

    CAD writes faster than immediate durability

    CAD matches performance of eventual

    020406080

    100

    0 500 1000 1500

    CD

    F

    Latency (us)

    Write Latency Distribution

    immediate eventual cad

    020406080

    100

    0 500 1000 1500

    CD

    F

    Latency (us)

    Read Latency Distribution

    immediate eventual cad

  • CAD Durability Layer Performance

    23

    YCSB-A: 50% W, 50% R

    CAD writes faster than immediate durability

    CAD matches performance of eventual

    020406080

    100

    0 500 1000 1500

    CD

    F

    Latency (us)

    Write Latency Distribution

    immediate eventual cad

    020406080

    100

    0 500 1000 1500

    CD

    F

    Latency (us)

    Read Latency Distribution

    immediate eventual cad

    reads queued

    behind writes

  • CAD Durability Layer Performance

    23

    YCSB-A: 50% W, 50% R

    CAD writes faster than immediate durability

    CAD matches performance of eventual

    020406080

    100

    0 500 1000 1500

    CD

    F

    Latency (us)

    Write Latency Distribution

    immediate eventual cad

    020406080

    100

    0 500 1000 1500

    CD

    F

    Latency (us)

    Read Latency Distribution

    immediate eventual cad

    Most reads in CAD fast

    Only 5% slow due to synchronous ops

    5%

    reads queued

    behind writes

  • CAD Durability Layer Performance

    23

    YCSB-A: 50% W, 50% R

    CAD writes faster than immediate durability

    CAD matches performance of eventual

    020406080

    100

    0 500 1000 1500

    CD

    F

    Latency (us)

    Write Latency Distribution

    immediate eventual cad

    020406080

    100

    0 500 1000 1500

    CD

    F

    Latency (us)

    Read Latency Distribution

    immediate eventual cad

    Most reads in CAD fast

    Only 5% slow due to synchronous ops

    5%

    reads queued

    behind writes

    CAD performs similar to eventual and is faster than immediate

  • ORCA System Performance

    24

    Strong-ZK – uses immediate durability, reads only at leader

    Weak-ZK – uses eventual durability, reads at many nodes

    ORCA – uses CAD, reads at many nodes

  • ORCA System Performance

    24

    3.7

    8 2.0

    9

    2.0

    9

    3.4

    4

    3.2

    8 1.9

    7

    1.7

    5

    3.0

    4

    0

    10

    20

    30

    40

    50

    A B D F

    Thro

    ugh

    put

    (KO

    ps/

    s)

    strong-ZK weak-ZK orca

    50% R 95% R 95% R 66.7% R

    50% W 5% W 5% W 33.3% W

    Strong-ZK – uses immediate durability, reads only at leader

    Weak-ZK – uses eventual durability, reads at many nodes

    ORCA – uses CAD, reads at many nodes

  • ORCA System Performance

    24

    3.7

    8 2.0

    9

    2.0

    9

    3.4

    4

    3.2

    8 1.9

    7

    1.7

    5

    3.0

    4

    0

    10

    20

    30

    40

    50

    A B D F

    Thro

    ugh

    put

    (KO

    ps/

    s)

    strong-ZK weak-ZK orca

    50% R 95% R 95% R 66.7% R

    50% W 5% W 5% W 33.3% W

    Strong-ZK performs poorly due to immediate durability and leader-restricted reads

    Strong-ZK – uses immediate durability, reads only at leader

    Weak-ZK – uses eventual durability, reads at many nodes

    ORCA – uses CAD, reads at many nodes

  • ORCA System Performance

    24

    3.7

    8 2.0

    9

    2.0

    9

    3.4

    4

    3.2

    8 1.9

    7

    1.7

    5

    3.0

    4

    0

    10

    20

    30

    40

    50

    A B D F

    Thro

    ugh

    put

    (KO

    ps/

    s)

    strong-ZK weak-ZK orca

    50% R 95% R 95% R 66.7% R

    50% W 5% W 5% W 33.3% W

    Strong-ZK performs poorly due to immediate durability and leader-restricted reads

    Weak-ZK performs well due to eventual durability and scalable reads

    Strong-ZK – uses immediate durability, reads only at leader

    Weak-ZK – uses eventual durability, reads at many nodes

    ORCA – uses CAD, reads at many nodes

  • ORCA System Performance

    24

    3.7

    8 2.0

    9

    2.0

    9

    3.4

    4

    3.2

    8 1.9

    7

    1.7

    5

    3.0

    4

    0

    10

    20

    30

    40

    50

    A B D F

    Thro

    ugh

    put

    (KO

    ps/

    s)

    strong-ZK weak-ZK orca

    50% R 95% R 95% R 66.7% R

    50% W 5% W 5% W 33.3% W

    Strong-ZK performs poorly due to immediate durability and leader-restricted reads

    Weak-ZK performs well due to eventual durability and scalable reads

    ORCA adds little overheads compared to weak-ZK

    reads that access non-durable data

    Strong-ZK – uses immediate durability, reads only at leader

    Weak-ZK – uses eventual durability, reads at many nodes

    ORCA – uses CAD, reads at many nodes

  • ORCA System Performance

    24

    3.7

    8 2.0

    9

    2.0

    9

    3.4

    4

    3.2

    8 1.9

    7

    1.7

    5

    3.0

    4

    0

    10

    20

    30

    40

    50

    A B D F

    Thro

    ugh

    put

    (KO

    ps/

    s)

    strong-ZK weak-ZK orca

    50% R 95% R 95% R 66.7% R

    50% W 5% W 5% W 33.3% W

    Strong-ZK performs poorly due to immediate durability and leader-restricted reads

    Weak-ZK performs well due to eventual durability and scalable reads

    ORCA adds little overheads compared to weak-ZK

    reads that access non-durable data

    Strong-ZK – uses immediate durability, reads only at leader

    Weak-ZK – uses eventual durability, reads at many nodes

    ORCA – uses CAD, reads at many nodes

  • More experiments in the paper…

    Evaluationcorrectness testing using a cluster crash-testing framework

    geo-replicated setting

    micro-benchmarks

    Application case studieslocation-tracking

    social-media timeline

    25

  • Summary and Conclusions

    26

  • Summary and Conclusions

    Surprisingly, durability models are overlooked

    26

  • Summary and Conclusions

    Surprisingly, durability models are overlooked

    Immediate durability enables strong consistency but is slow

    26

  • Summary and Conclusions

    Surprisingly, durability models are overlooked

    Immediate durability enables strong consistency but is slow

    Eventual durability is fast but only enables weak consistency

    26

  • Summary and Conclusions

    Surprisingly, durability models are overlooked

    Immediate durability enables strong consistency but is slow

    Eventual durability is fast but only enables weak consistency

    CAD – consistency-aware durability, a new way of thinking about durability

    enables both strong consistency and high performance

    26

  • Summary and Conclusions

    Surprisingly, durability models are overlooked

    Immediate durability enables strong consistency but is slow

    Eventual durability is fast but only enables weak consistency

    CAD – consistency-aware durability, a new way of thinking about durability

    enables both strong consistency and high performance

    CAD is useful for many deployments that currently adopt eventual durability

    26

  • Summary and Conclusions

    Surprisingly, durability models are overlooked

    Immediate durability enables strong consistency but is slow

    Eventual durability is fast but only enables weak consistency

    CAD – consistency-aware durability, a new way of thinking about durability

    enables both strong consistency and high performance

    CAD is useful for many deployments that currently adopt eventual durability

    Consistency and performance are seemingly at odds – by carefully examining

    the underlying layer, achieve both

    26

  • Summary and Conclusions

    Surprisingly, durability models are overlooked

    Immediate durability enables strong consistency but is slow

    Eventual durability is fast but only enables weak consistency

    CAD – consistency-aware durability, a new way of thinking about durability

    enables both strong consistency and high performance

    CAD is useful for many deployments that currently adopt eventual durability

    Consistency and performance are seemingly at odds – by carefully examining

    the underlying layer, achieve both

    Thank you!26