RICON keynote: outwards from the middle of the maze

Post on 14-Jun-2015

2.571 views 3 download

Tags:

description

slides from my RICON keynote

Transcript of RICON keynote: outwards from the middle of the maze

Outwards from the middle of the maze

Peter Alvaro UC Berkeley

Outline

1.  Mourning the death of transactions 2.  What is so hard about distributed systems? 3.  Distributed consistency: managing asynchrony 4.  Fault-tolerance: progress despite failures

The transaction concept

 DEBIT_CREDIT:                        BEGIN_TRANSACTION;                        GET  MESSAGE;                        EXTRACT  ACCOUT_NUMBER,  DELTA,  TELLER,  BRANCH                                FROM  MESSAGE;                        FIND  ACCOUNT(ACCOUT_NUMBER)  IN  DATA  BASE;                        IF  NOT_FOUND    |  ACCOUNT_BALANCE  +  DELTA  <  0  THEN                                  PUT  NEGATIVE  RESPONSE;                        ELSE  DO;                                  ACCOUNT_BALANCE  =  ACCOUNT_BALANCE  +  DELTA;                                  POST  HISTORY  RECORD  ON  ACCOUNT  (DELTA);                                  CASH_DRAWER(TELLER)  =  CASH_DRAWER(TELLER)  +  DELTA;                                  BRANCH_BALANCE(BRANCH)  =  BRANCH_BALANCE(BRANCH)  +  DELTA;                                  PUT  MESSAGE  ('NEW  BALANCE  ='  ACCOUNT_BALANCE);                                  END;                        COMMIT;    

The transaction concept

 DEBIT_CREDIT:                        BEGIN_TRANSACTION;                        GET  MESSAGE;                        EXTRACT  ACCOUT_NUMBER,  DELTA,  TELLER,  BRANCH                                FROM  MESSAGE;                        FIND  ACCOUNT(ACCOUT_NUMBER)  IN  DATA  BASE;                        IF  NOT_FOUND    |  ACCOUNT_BALANCE  +  DELTA  <  0  THEN                                  PUT  NEGATIVE  RESPONSE;                        ELSE  DO;                                  ACCOUNT_BALANCE  =  ACCOUNT_BALANCE  +  DELTA;                                  POST  HISTORY  RECORD  ON  ACCOUNT  (DELTA);                                  CASH_DRAWER(TELLER)  =  CASH_DRAWER(TELLER)  +  DELTA;                                  BRANCH_BALANCE(BRANCH)  =  BRANCH_BALANCE(BRANCH)  +  DELTA;                                  PUT  MESSAGE  ('NEW  BALANCE  ='  ACCOUNT_BALANCE);                                  END;                        COMMIT;    

The transaction concept

 DEBIT_CREDIT:                        BEGIN_TRANSACTION;                        GET  MESSAGE;                        EXTRACT  ACCOUT_NUMBER,  DELTA,  TELLER,  BRANCH                                FROM  MESSAGE;                        FIND  ACCOUNT(ACCOUT_NUMBER)  IN  DATA  BASE;                        IF  NOT_FOUND    |  ACCOUNT_BALANCE  +  DELTA  <  0  THEN                                  PUT  NEGATIVE  RESPONSE;                        ELSE  DO;                                  ACCOUNT_BALANCE  =  ACCOUNT_BALANCE  +  DELTA;                                  POST  HISTORY  RECORD  ON  ACCOUNT  (DELTA);                                  CASH_DRAWER(TELLER)  =  CASH_DRAWER(TELLER)  +  DELTA;                                  BRANCH_BALANCE(BRANCH)  =  BRANCH_BALANCE(BRANCH)  +  DELTA;                                  PUT  MESSAGE  ('NEW  BALANCE  ='  ACCOUNT_BALANCE);                                  END;                        COMMIT;    

The transaction concept

 DEBIT_CREDIT:                        BEGIN_TRANSACTION;                        GET  MESSAGE;                        EXTRACT  ACCOUT_NUMBER,  DELTA,  TELLER,  BRANCH                                FROM  MESSAGE;                        FIND  ACCOUNT(ACCOUT_NUMBER)  IN  DATA  BASE;                        IF  NOT_FOUND    |  ACCOUNT_BALANCE  +  DELTA  <  0  THEN                                  PUT  NEGATIVE  RESPONSE;                        ELSE  DO;                                  ACCOUNT_BALANCE  =  ACCOUNT_BALANCE  +  DELTA;                                  POST  HISTORY  RECORD  ON  ACCOUNT  (DELTA);                                  CASH_DRAWER(TELLER)  =  CASH_DRAWER(TELLER)  +  DELTA;                                  BRANCH_BALANCE(BRANCH)  =  BRANCH_BALANCE(BRANCH)  +  DELTA;                                  PUT  MESSAGE  ('NEW  BALANCE  ='  ACCOUNT_BALANCE);                                  END;                        COMMIT;    

The “top-down” ethos

The “top-down” ethos

The “top-down” ethos

The “top-down” ethos

The “top-down” ethos

The “top-down” ethos

Transactions: a holistic contract

Write   Read  

Application

Opaque store

Transactions

Transactions: a holistic contract

Write   Read  

Application

Opaque store

Transactions

Assert: balance > 0

Assert: balance > 0

Transactions: a holistic contract

Write   Read  

Application

Opaque store

Transactions

Transactions: a holistic contract

Write   Read  

Application

Opaque store

Transactions

Assert: balance > 0

Transactions: a holistic contract

Write   Read  

Application

Opaque store

Transactions

Assert: balance > 0

Incidental complexities

•  The “Internet.” Searching it. •  Cross-datacenter replication schemes •  CAP Theorem •  Dynamo & MapReduce •  “Cloud”

Fundamental complexity

“[…] distributed systems require that the programmer be aware of latency, have a different model of memory access, and take into account issues of concurrency and partial failure.”

Jim Waldo et al., A Note on Distributed Computing (1994)

A holistic contract …stretched to the limit

Write   Read  

Application

Opaque store

Transactions

A holistic contract …stretched to the limit

Write   Read  

Application

Opaque store

Transactions

Are you blithely asserting that transactions aren’t webscale?

Some people just want to see the world burn. Those same people want to see the world use inconsistent databases.

- Emin Gun Sirer

Alternative to top-down design?

The “bottom-up,” systems tradition: Simple, reusable components first. Semantics later.

Alternative: the “bottom-up,” systems ethos

The “bottom-up” ethos

The “bottom-up” ethos

The “bottom-up” ethos

The “bottom-up” ethos

The “bottom-up” ethos

The “bottom-up” ethos

The “bottom-up” ethos

“‘Tis a fine barn, but sure ‘tis no castle, English”

The “bottom-up” ethos

Simple, reusable components first. Semantics later. This is how we live now. Question: Do we ever get those application-level guarantees back?

Low-level contracts

Write   Read  

Application

Distributed store KVS

Low-level contracts

Write   Read  

Application

Distributed store KVS

Low-level contracts

Write   Read  

Application

Distributed store KVS

R1(X=1)  R2(X=1)  W1(X=2)  W2(X=0)  

W1(X=1)  W1(Y=2)  R2(Y=2)  R2(X=0)  

Low-level contracts

Write   Read  

Application

Distributed store KVS

Assert: balance > 0

R1(X=1)  R2(X=1)  W1(X=2)  W2(X=0)  

W1(X=1)  W1(Y=2)  R2(Y=2)  R2(X=0)  

Low-level contracts

Write   Read  

Application

Distributed store KVS

Assert: balance > 0

causal? PRAM? delta? fork/join? red/blue? Release?

R1(X=1)  R2(X=1)  W1(X=2)  W2(X=0)  

W1(X=1)  W1(Y=2)  R2(Y=2)  R2(X=0)  

When do contracts compose?

Application

Distributed service

Assert: balance > 0

iw, did I get mongo in my riak?  Assert: balance > 0

Composition is the last hard problem

Composing modules is hard enough We must learn how to compose guarantees

Outline

1.  Mourning the death of transactions 2.  What is so hard about distributed systems? 3.  Distributed consistency: managing asynchrony 4.  Fault-tolerance: progress despite failures

Why distributed systems are hard2

Asynchrony Partial Failure

Fundamental Uncertainty

Asynchrony isn’t that hard

Logical timestamps Deterministic interleaving

 

Ameloriation:

Partial failure isn’t that hard

Replication Replay

Ameloriation:

(asynchrony * partial failure) = hard2

Logical timestamps Deterministic interleaving

Replication Replay

(asynchrony * partial failure) = hard2

Logical timestamps Deterministic interleaving

Replication Replay

(asynchrony * partial failure) = hard2

Tackling one clown at a time

Poor strategy for programming distributed systems Winning strategy for analyzing distributed programs

 

Outline

1.  Mourning the death of transactions 2.  What is so hard about distributed systems? 3.  Distributed consistency: managing asynchrony 4.  Fault-tolerance: progress despite failures

Distributed consistency

Today: A quick summary of some great work.

Consider a (distributed) graph

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

Partitioned, for scalability

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

Replicated, for availability

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

Deadlock detection

Task: Identify strongly-connected components

Waits-for graph

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

Garbage collection

Task: Identify nodes not reachable from Root.

Root  

Refers-to graph

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

Correctness

Deadlock detection •  Safety: No false positives

•  Liveness: Identify all deadlocks

Garbage collection •  Safety: Never GC live memory!

•  Liveness: GC all orphaned memory

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

Correctness

Deadlock detection •  Safety: No false positives-

•  Liveness: Identify all deadlocks

Garbage collection •  Safety: Never GC live memory!

•  Liveness: GC all orphaned memory

Correctness

Deadlock detection •  Safety: No false positives

•  Liveness: Identify all deadlocks

Garbage collection •  Safety: Never GC live memory!

•  Liveness: GC all orphaned memory

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

Root  

Consistency at the extremes

StorageObjectFlow

LanguageApplication

Linearizable key-value store?

   Custom solutions?

Consistency at the extremes

StorageObjectFlow

LanguageApplication

Linearizable key-value store?

   Custom solutions?

Consistency at the extremes

StorageObjectFlow

LanguageApplication

Linearizable key-value store?

   Custom solutions?

Efficient Correct

Object-level consistency

Capture semantics of data structures that •  allow greater concurrency •  maintain guarantees (e.g. convergence)

StorageObjectFlow

LanguageApplication

Object-level consistency

Insert   Read  

Convergent data structure (e.g., Set CRDT)

Object-level consistency

Insert   Read  

Commutativity Associativity Idempotence

Insert   Read  

Convergent data structure (e.g., Set CRDT)

Object-level consistency

Insert   Read  

Commutativity Associativity Idempotence

Insert   Read  

Convergent data structure (e.g., Set CRDT)

Object-level consistency

Insert   Read  

Commutativity Associativity Idempotence

Insert   Read  

Convergent data structure (e.g., Set CRDT)

Object-level consistency

Insert   Read  

Commutativity Associativity Idempotence

Reordering Batching Retry/duplication

Tolerant to

Application

Convergent data structures

Object-level composition?

Assert: Graph replicas converge

Application

Convergent data structures

Object-level composition?

GC Assert: No live nodes are reclaimed

Assert: Graph replicas converge

Application

Convergent data structures

Object-level composition?

?   ?  

GC Assert: No live nodes are reclaimed

Assert: Graph replicas converge

Flow-level consistency  

StorageObjectFlow

LanguageApplication

Flow-level consistency  

Capture semantics of data in motion •  Asynchronous dataflow model •  component properties à system-wide guarantees

Graphstore

Transactionmanager

Transitiveclosure

Deadlockdetector

Flow-level consistency

Order-insensitivity (confluence)

output  set  =  f(input  set)      

Flow-level consistency

Order-insensitivity (confluence)

output  set  =  f(input  set)      

Flow-level consistency

Order-insensitivity (confluence)

output  set  =  f(input  set)      

Flow-level consistency

Order-insensitivity (confluence)

output  set  =  f(input  set)      

Flow-level consistency

=  

Order-insensitivity (confluence)

output  set  =  f(input  set)      

Flow-level consistency

{                }  

{                }  =  

Order-insensitivity (confluence)

output  set  =  f(input  set)      

Confluence is compositional

output  set  =  f  �  g(input  set)      

Confluence is compositional

output  set  =  f  �  g(input  set)      

Confluence is compositional

output  set  =  f  �  g(input  set)      

Graphstore

Memoryallocator

Transitiveclosure

Garbagecollector

Confluent Not

Confluent

Confluent

Graphstore

Transactionmanager

Transitiveclosure

Deadlockdetector

Confluent ConfluentConfluent

Graph queries as dataflow

Graphstore

Memoryallocator

Transitiveclosure

Garbagecollector

Confluent Not

Confluent

Confluent

Graphstore

Transactionmanager

Transitiveclosure

Deadlockdetector

Confluent ConfluentConfluent

Graph queries as dataflow Confluent

Coordinate  here  

Graphstore

Memoryallocator

Transitiveclosure

Garbagecollector

Confluent Not

Confluent

Confluent

Coordination: what is that?

Coordinate  here  

Strategy 1: Establish a total order

Graphstore

Memoryallocator

Transitiveclosure

Garbagecollector

Confluent Not

Confluent

Confluent

Coordination: what is that?

Coordinate  here  

Strategy 2: Establish a producer- consumer barrier

Fundamental costs: FT via replication

Graphstore

Transactionmanager

Transitiveclosure

Deadlockdetector

Confluent ConfluentConfluent

Graphstore

Transitiveclosure

Deadlockdetector

Confluent ConfluentConfluent

(mostly) free!

global synchronization!

Graphstore

Transactionmanager

Transitiveclosure

GarbageCollector

Confluent Confluent

Graphstore

Transitiveclosure

GarbageCollector

Confluent Not

Confluent

Confluent

Paxos

Not

Confluent

Fundamental costs: FT via replication

Fundamental costs: FT via replication

GarbageCollector

Graphstore

Transactionmanager

Transitiveclosure

GarbageCollector

Confluent Confluent

Graphstore

Transitiveclosure

Confluent Not

Confluent

Confluent

BarrierNot

Confluent

Barrier

The first principle of successful scalability is to batter the consistency mechanisms down to a minimum. – James Hamilton  

Language-level consistency  

DSLs for distributed programming? •  Capture consistency concerns in the

type system

   

StorageObjectFlow

LanguageApplication

Language-level consistency  

CALM Theorem:

Monotonic à confluent

Conservative, syntactic test for confluence

 

Language-level consistency

Deadlock detector

Garbage collector

Language-level consistency

Deadlock detector

Garbage collector

nonmonotonic  

Let’s review

•  Consistency is tolerance to asynchrony •  Tricks: – focus on data in motion, not at rest – avoid coordination when possible – choose coordination carefully otherwise

(Tricks are great, but tools are better)

Outline

1.  Mourning the death of transactions 2.  What is so hard about distributed systems? 3.  Distributed consistency: managing asynchrony 4.  Fault-tolerance: progress despite failures

Grand challenge: composition

Hard problem: Is a given component fault-tolerant? Much harder: Is this system (built up from components) fault-tolerant?

Example: Atomic multi-partition update

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

Two-phase commit

Example: replication

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

T1T2

T4

T10

T3

T6

T5

T9

T7

T8

T11

T12

T13

T14

Reliable broadcast

Popular wisdom: don’t reinvent

Example: Kafka replication bug

Three “correct” components: 1.  Primary/backup replication 2.  Timeout-based failure detectors 3.  Zookeeper

One nasty bug: Acknowledged writes are lost

A guarantee would be nice

Bottom up approach: •  use formal methods to verify individual

components (e.g. protocols) •  Build systems from verified components

Shortcomings: •  Hard to use •  Hard to compose

Investment

Returns

Bottom-up assurances

Formal  verifica[on  

Program  Environment   Correctness  Spec  

Composing bottom-up assurances  

Composing bottom-up assurances  

Issue 1: incompatible failure models eg, crash failure vs. omissions Issue 2: Specs do not compose (FT is an end-to-end property)

If you take 10 components off the shelf, you are putting 10 world views together, and the result will be a mess. -- Butler Lampson

Composing bottom-up assurances  

Composing bottom-up assurances  

Composing bottom-up assurances  

Top-down “assurances”

Top-down “assurances”

Testing

Top-down “assurances”

Fault injection Testing

Top-down “assurances”

Fault injection

Testing

End-to-end testing would be nice

Top-down approach: •  Build a large-scale system •  Test the system under faults

Shortcomings: •  Hard to identify complex bugs •  Fundamentally incomplete

Investment

Returns

Lineage-driven fault injection

Goal: top-down testing that •  finds all of the fault-tolerance bugs, or •  certifies that none exist

Lineage-driven fault injection

Correctness Specification

Malevolent sentience

Molly

Lineage-driven fault injection

Molly

Correctness Specification

Malevolent sentience

Lineage-driven fault injection (LDFI)

Approach: think backwards from outcomes Question: could a bad thing ever happen? Reframe: •  Why did a good thing happen? •  What could have gone wrong along the way?

Thomasina: What a faint-heart! We must work outward from the middle of the maze. We will start with something simple.

The game

•  Both players agree on a failure model •  The programmer provides a protocol •  The adversary observes executions and

chooses failures for the next execution.

Dedalus: it’s about data

log(B, “data”)@5  

What

Where

When

Some data

Dedalus: it’s like Datalog

consequence ! :- premise[s]!!log(Node, Pload) ! ! ! :- bcast(Node, Pload);!!

Dedalus: it’s like Datalog

consequence ! :- premise[s]!!log(Node, Pload) ! ! ! :- bcast(Node, Pload);!!

(Which is like SQL)

create view log as select Node, Pload from bcast;!

Dedalus: it’s about time

consequence@when ! :- premise[s]!!!node(Node, Neighbor)@next :- node(Node, Neighbor);!!!log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2);  

Dedalus: it’s about time

consequence@when ! :- premise[s]!!!node(Node, Neighbor)@next :- node(Node, Neighbor);!!!log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2);  

Natural join (bcast.Node1 == node.Node1)

State change

Communication

The match

Protocol: Reliable broadcast

Specification:

Pre: A correct process delivers a message m Post: All correct process delivers m

Failure Model:

(Permanent) crash failures Message loss / partitions

Round 1 node(Node, Neighbor)@next :- node(Node, Neighbor);!log(Node, Pload)@next ! :- log(Node, Pload);!!!log(Node, Pload) ! ! ! :- bcast(Node, Pload);!!log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2);  

“An effort” delivery protocol

Round 1 in space / time

Process b Process a Process c

2

1

2

log log

Round 1: Lineage

log(B,  data)@5    

Round 1: Lineage  

log(B,  data)@5    

log(B,  data)@4    

log(Node, Pload)@next :- log(Node, Pload);!!!!log(B, data)@5:- log(B, data)@4;!

Round 1: Lineage  

log(B,  data)@5    

log(B,  data)@4    

log(B,  data)@3    

Round 1: Lineage  

log(B,  data)@5    

log(B,  data)@4    

log(B,  data)@3    

log(B,data)@2    

Round 1: Lineage  

log(B,  data)@5    

log(B,  data)@4    

log(B,  data)@3    

log(B,data)@2    

log(A,  data)@1    

log(Node2, Pload)@async :- bcast(Node1, Pload), !! ! ! ! ! ! node(Node1, Node2);!

!!!!log(B, data)@2 :- bcast(A, data)@1, !

! ! ! ! ! ! node(A, B)@1;!  

An execution is a (fragile) “proof” of an outcome

log(A, data)@1 node(A, B)@1

AB1 r2

log(B, data)@2

r1

log(B, data)@3

r1

log(B, data)@4

r1

log(B, data)@5

log(A, data)@1

r1

log(A, data)@2

node(A, B)@1

r3

node(A, B)@2

AB2 r2

log(B, data)@3

r1

log(B, data)@4

r1

log(B, data)@5

log(A, data)@1

r1

log(A, data)@2

r1

log(A, data)@3

node(A, B)@1

r3

node(A, B)@2

r3

node(A, B)@3

AB3 r2

log(B, data)@4

r1

log(B, data)@5

log(A, data)@1

r1

log(A, data)@2

r1

log(A, data)@3

r1

log(A, data)@4

node(A, B)@1

r3

node(A, B)@2

r3

node(A, B)@3

r3

node(A, B)@4

AB4 r2

log(B, data)@5

AB1 ^AB2 ^AB3 ^AB4

1

(which required a message from A to B at time 1)

Valentine: “The unpredictable and the predetermined unfold together to make everything the way it is.”

Round 1: counterexample

The adversary wins!

Process b Process a Process c

1

2

log (LOST) log

Round  2  

Same  as  Round  1,  but  A  retries.  

bcast(N, P)@next ! ! ! :- bcast(N, P);!

Round 2 in spacetime Process b Process a Process c

2

3

4

5

1

2

3

4

2

3

4

5

log log

log log

log log

log log

Round 2

log(B,  data)@5    

Round 2  

log(B,  data)@5    

log(B,  data)@4    

log(Node, Pload)@next :- log(Node, Pload);!!!!log(B, data)@5:- log(B, data)@4;!

Round 2  

log(B,  data)@5    

log(B,  data)@4    

log(A,  data)@4    

log(Node2, Pload)@async :- bcast(Node1, Pload), node(Node1, Node2);!!!!!log(B, data)@3 :- bcast(A, data)@2, node(A, B)@2;!  

Round 2  

log(B,  data)@5    

log(B,  data)@4    

log(A,  data)@4    

log(B,  data)@3    

log(A,  data)@3    

Round 2  

log(B,  data)@5    

log(B,  data)@4    

log(A,  data)@4    

log(B,  data)@3    

log(A,  data)@3    

log(B,data)@2    

log(A,  data)@2    

Round 2  

log(B,  data)@5    

log(B,  data)@4    

log(A,  data)@4    

log(B,  data)@3    

log(A,  data)@3    

log(B,data)@2    

log(A,  data)@2    

log(A,  data)@1    

Round 2  

log(B,  data)@5    

log(B,  data)@4    

log(A,  data)@4    

log(B,  data)@3    

log(A,  data)@3    

log(B,data)@2    

log(A,  data)@2    

log(A,  data)@1    

Retry provides redundancy in time

Traces  are  forests  of  proof  trees  log(A, data)@1 node(A, B)@1

AB1 r2

log(B, data)@2

r1

log(B, data)@3

r1

log(B, data)@4

r1

log(B, data)@5

log(A, data)@1

r1

log(A, data)@2

node(A, B)@1

r3

node(A, B)@2

AB2 r2

log(B, data)@3

r1

log(B, data)@4

r1

log(B, data)@5

log(A, data)@1

r1

log(A, data)@2

r1

log(A, data)@3

node(A, B)@1

r3

node(A, B)@2

r3

node(A, B)@3

AB3 r2

log(B, data)@4

r1

log(B, data)@5

log(A, data)@1

r1

log(A, data)@2

r1

log(A, data)@3

r1

log(A, data)@4

node(A, B)@1

r3

node(A, B)@2

r3

node(A, B)@3

r3

node(A, B)@4

AB4 r2

log(B, data)@5

AB1 ^AB2 ^AB3 ^AB4

1

Traces  are  forests  of  proof  trees  log(A, data)@1 node(A, B)@1

AB1 r2

log(B, data)@2

r1

log(B, data)@3

r1

log(B, data)@4

r1

log(B, data)@5

log(A, data)@1

r1

log(A, data)@2

node(A, B)@1

r3

node(A, B)@2

AB2 r2

log(B, data)@3

r1

log(B, data)@4

r1

log(B, data)@5

log(A, data)@1

r1

log(A, data)@2

r1

log(A, data)@3

node(A, B)@1

r3

node(A, B)@2

r3

node(A, B)@3

AB3 r2

log(B, data)@4

r1

log(B, data)@5

log(A, data)@1

r1

log(A, data)@2

r1

log(A, data)@3

r1

log(A, data)@4

node(A, B)@1

r3

node(A, B)@2

r3

node(A, B)@3

r3

node(A, B)@4

AB4 r2

log(B, data)@5

AB1 ^AB2 ^AB3 ^AB4

1

Round  2:  counterexample  

Process b Process a Process c

1

CRASHED 2

log (LOST) log

The adversary wins!

Round 3

Same  as  in  Round  2,  but  symmetrical.  

bcast(N, P)@next ! ! ! :- log(N, P);!

Round 3 in space / time Process b Process a Process c

2

3

4

5

1

2

3

4

5

2

3

4

5

log log

log log

log log

log log

log log

log log

log log

log log

log log

log log

Redundancy in space and time

Round 3 -- lineage

log(B,  data)@5    

Round 3 -- lineage

log(B,  data)@5    

log(B,  data)@4    

log(A,  data)@4    

log(C,  data)@4    

Round 3 -- lineage

log(B,  data)@5    

log(B,  data)@4    

log(A,  data)@4    

log(C,  data)@4    

Log(B,  data)@3    

log(A,  data)@3    

log(C,  data)@3    

Round 3 -- lineage

log(B,  data)@5    

log(B,  data)@4    

log(A,  data)@4    

log(C,  data)@4    

Log(B,  data)@3    

log(A,  data)@3    

log(C,  data)@3    

log(B,data)@2    

log(A,  data)@2    

log(C,  data)@2    

log(A,  data)@1    

Round 3 -- lineage

log(B,  data)@5    

log(B,  data)@4    

log(A,  data)@4    

log(C,  data)@4    

Log(B,  data)@3    

log(A,  data)@3    

log(C,  data)@3    

log(B,data)@2    

log(A,  data)@2    

log(C,  data)@2    

log(A,  data)@1    

Round 3

The programmer wins!

Let’s reflect

Fault-tolerance is redundancy in space and time. Best strategy for both players: reason backwards from outcomes using lineage Finding bugs: find a set of failures that “breaks” all derivations Fixing bugs: add additional derivations

The role of the adversary can be automated

1.  Break a proof by dropping any contributing message.

Disjunction

(AB1 ∨ BC2)

The role of the adversary can be automated

1.  Break a proof by dropping any contributing message.

2.  Find a set of failures that breaks all proofs of a good outcome.

Disjunction

Conjunction of disjunctions (AKA CNF)

(AB1 ∨ BC2)

∧ (AC1) ∧ (AC2)

The role of the adversary can be automated

1.  Break a proof by dropping any contributing message.

2.  Find a set of failures that breaks all proofs of a good outcome.

Disjunction

Conjunction of disjunctions (AKA CNF)

(AB1 ∨ BC2)

∧ (AC1) ∧ (AC2)

Molly, the LDFI prototype

Molly finds fault-tolerance violations quickly or guarantees that none exist. Molly finds bugs by explaining good outcomes – then it explains the bugs. Bugs identified: 2pc, 2pc-ctp, 3pc, Kafka Certified correct: paxos (synod), Flux, bully leader election, reliable broadcast

Commit protocols

Problem: Atomically change things Correctness properties: 1.  Agreement (All or nothing) 2.  Termination (Something)

Two-phase commit Agent a Agent b Coordinator Agent d

2

5

2

5

1

3

4

2

5

vote vote

prepare prepare prepare

commit commit commit

vote

Two-phase commit Agent a Agent b Coordinator Agent d

2

5

2

5

1

3

4

2

5

vote vote

prepare prepare prepare

commit commit commit

vote

Can I kick it?

Two-phase commit Agent a Agent b Coordinator Agent d

2

5

2

5

1

3

4

2

5

vote vote

prepare prepare prepare

commit commit commit

vote

Can I kick it?

YES YOU CAN

Two-phase commit Agent a Agent b Coordinator Agent d

2

5

2

5

1

3

4

2

5

vote vote

prepare prepare prepare

commit commit commit

vote

Can I kick it?

YES YOU CAN

Well I’m gone

Two-phase commit

Agent a Agent a Coordinator Agent d

2 2

1

3

CRASHED

2

v v

p p p

v

Violation: Termination

The  collabora[ve  termina[on  protocol  

Basic idea: Agents talk amongst themselves when the coordinator fails. Protocol: On timeout, ask other agents about decision.

2PC - CTP Agent a Agent b Coordinator Agent d

2

3

4

5

6

7

2

3

4

5

6

7

1

2

3

CRASHED

2

3

4

5

6

7

vote

decision_req decision_req

vote

decision_req decision_req

prepare prepare prepare

vote

decision_req decision_req

2PC - CTP Agent a Agent b Coordinator Agent d

2

3

4

5

6

7

2

3

4

5

6

7

1

2

3

CRASHED

2

3

4

5

6

7

vote

decision_req decision_req

vote

decision_req decision_req

prepare prepare prepare

vote

decision_req decision_req

Can I kick it?

YES YOU CAN

……?

3PC

Basic idea: Add a round, a state, and simple failure detectors (timeouts). Protocol: 1.  Phase 1: Just like in 2PC –  Agent timeout à abort

2.  Phase 2: send canCommit, collect acks –  Agent timeout à commit

3.  Phase 3: Just like phase 2 of 2PC

3PC Process a Process b Process C Process d

2

4

7

2

4

7

1

3

5

6

2

4

7

vote_msg

ack

vote_msg

ack

cancommit cancommit cancommit

precommit precommit precommit

commit commit commit

vote_msg

ack

3PC Process a Process b Process C Process d

2

4

7

2

4

7

1

3

5

6

2

4

7

vote_msg

ack

vote_msg

ack

cancommit cancommit cancommit

precommit precommit precommit

commit commit commit

vote_msg

ack

Timeout à Abort

Timeout à Commit

Network partitions make 3pc act crazy

Process a Process b Process C Process d

2

4

7

8

2

4

7

8

1

3

5

6

7

8

2

CRASHED

vote_msg

ack

commit

vote_msg

ack

commit

cancommit cancommit cancommit

precommit precommit precommit

abort (LOST) abort (LOST)

abort abort

vote_msg

Network partitions make 3pc act crazy

Process a Process b Process C Process d

2

4

7

8

2

4

7

8

1

3

5

6

7

8

2

CRASHED

vote_msg

ack

commit

vote_msg

ack

commit

cancommit cancommit cancommit

precommit precommit precommit

abort (LOST) abort (LOST)

abort abort

vote_msg

Agent crash Agents learn commit decision

Network partitions make 3pc act crazy

Process a Process b Process C Process d

2

4

7

8

2

4

7

8

1

3

5

6

7

8

2

CRASHED

vote_msg

ack

commit

vote_msg

ack

commit

cancommit cancommit cancommit

precommit precommit precommit

abort (LOST) abort (LOST)

abort abort

vote_msg

Agent crash Agents learn commit decision

d is dead; coordinator decides to abort

Network partitions make 3pc act crazy

Process a Process b Process C Process d

2

4

7

8

2

4

7

8

1

3

5

6

7

8

2

CRASHED

vote_msg

ack

commit

vote_msg

ack

commit

cancommit cancommit cancommit

precommit precommit precommit

abort (LOST) abort (LOST)

abort abort

vote_msg

Brief network partition

Agent crash Agents learn commit decision

d is dead; coordinator decides to abort

Network partitions make 3pc act crazy

Process a Process b Process C Process d

2

4

7

8

2

4

7

8

1

3

5

6

7

8

2

CRASHED

vote_msg

ack

commit

vote_msg

ack

commit

cancommit cancommit cancommit

precommit precommit precommit

abort (LOST) abort (LOST)

abort abort

vote_msg

Brief network partition

Agent crash Agents learn commit decision

d is dead; coordinator decides to abort

Agents A & B decide to commit

Kafka durability bug

Replica b Replica c Zookeeper Replica a Client

1 1

2

1

3

4

CRASHED

1

3

5

m m

m l

m

a

c

w

Kafka durability bug

Replica b Replica c Zookeeper Replica a Client

1 1

2

1

3

4

CRASHED

1

3

5

m m

m l

m

a

c

w

Brief network partition

Kafka durability bug

Replica b Replica c Zookeeper Replica a Client

1 1

2

1

3

4

CRASHED

1

3

5

m m

m l

m

a

c

w

Brief network partition

a becomes leader and sole replica

Kafka durability bug

Replica b Replica c Zookeeper Replica a Client

1 1

2

1

3

4

CRASHED

1

3

5

m m

m l

m

a

c

w

Brief network partition

a becomes leader and sole replica

a ACKs client write

Kafka durability bug

Replica b Replica c Zookeeper Replica a Client

1 1

2

1

3

4

CRASHED

1

3

5

m m

m l

m

a

c

w

Brief network partition

a becomes leader and sole replica

a ACKs client write

Data loss

Molly summary

Lineage allows us to reason backwards from good outcomes Molly: surgically-targeted fault injection Investment similar to testing Returns similar to formal methods

Where we’ve been; where we’re headed

1.  Mourning the death of transactions 2.  What is so hard about distributed systems? 3.  Distributed consistency: managing asynchrony 4.  Fault-tolerance: progress despite failures

Where we’ve been; where we’re headed

1.  We need application-level guarantees 2.  What is so hard about distributed systems? 3.  Distributed consistency: managing asynchrony 4.  Fault-tolerance: progress despite failures

Where we’ve been; where we’re headed

1.  We need application-level guarantees 2.  What is so hard about distributed systems? 3.  Distributed consistency: managing asynchrony 4.  Fault-tolerance: progress despite failures

Where we’ve been; where we’re headed

1.  We need application-level guarantees 2.  (asynchrony X partial failure) = too hard to

hide! We need tools to manage it.

3.  Distributed consistency: managing asynchrony 4.  Fault-tolerance: progress despite failures

Where we’ve been; where we’re headed

1.  We need application-level guarantees 2.  asynchrony X partial failure = too hard to hide!

We need tools to manage it.

3.  Distributed consistency: managing asynchrony 4.  Fault-tolerance: progress despite failures

Where we’ve been; where we’re headed

1.  We need application-level guarantees 2.  asynchrony X partial failure = too hard to hide!

We need tools to manage it.

3.  Focus on flow: data in motion 4.  Fault-tolerance: progress despite failures

Outline

1.  We need application-level guarantees 2.  asynchrony X partial failure = too hard to hide!

We need tools to manage it.

3.  Focus on flow: data in motion 4.  Fault-tolerance: progress despite failures

Outline

1.  We need application-level guarantees 2.  asynchrony X partial failure = too hard to hide!

We need tools to manage it.

3.  Focus on flow: data in motion 4.  Backwards from outcomes

Remember

1.  We need application-level guarantees 2.  asynchrony X partial failure = too hard to hide! We

need tools to manage it.

3.  Focus on flow: data in motion 4.  Backwards from outcomes

Composition is the hardest problem

A happy crisis

Valentine: “It makes me so happy. To be at the beginning again, knowing almost nothing.... It's the best possible time of being alive, when almost everything you thought you knew is wrong.”