RICON keynote: outwards from the middle of the maze

Outwards from the middle of the maze

Peter Alvaro UC Berkeley

Outline

1.  Mourning the death of transactions 2.  What is so hard about distributed systems? 3.  Distributed consistency: managing asynchrony 4.  Fault-tolerance: progress despite failures

The transaction concept

DEBIT_CREDIT: BEGIN_TRANSACTION; GET MESSAGE; EXTRACT ACCOUT_NUMBER, DELTA, TELLER, BRANCH FROM MESSAGE; FIND ACCOUNT(ACCOUT_NUMBER) IN DATA BASE; IF NOT_FOUND | ACCOUNT_BALANCE + DELTA < 0 THEN PUT NEGATIVE RESPONSE; ELSE DO; ACCOUNT_BALANCE = ACCOUNT_BALANCE + DELTA; POST HISTORY RECORD ON ACCOUNT (DELTA); CASH_DRAWER(TELLER) = CASH_DRAWER(TELLER) + DELTA; BRANCH_BALANCE(BRANCH) = BRANCH_BALANCE(BRANCH) + DELTA; PUT MESSAGE ('NEW BALANCE =' ACCOUNT_BALANCE); END; COMMIT;

The “top-down” ethos

Transactions: a holistic contract

Write Read

Application

Opaque store

Transactions

Write Read

Application

Opaque store

Transactions

Assert: balance > 0

Write Read

Application

Opaque store

Transactions

Write Read

Application

Opaque store

Transactions

Assert: balance > 0

Write Read

Application

Opaque store

Transactions

Assert: balance > 0

Incidental complexities

•  The “Internet.” Searching it. •  Cross-datacenter replication schemes •  CAP Theorem •  Dynamo & MapReduce •  “Cloud”

Fundamental complexity

“[…] distributed systems require that the programmer be aware of latency, have a different model of memory access, and take into account issues of concurrency and partial failure.”

Jim Waldo et al., A Note on Distributed Computing (1994)

A holistic contract …stretched to the limit

Write Read

Application

Opaque store

Transactions

A holistic contract …stretched to the limit

Write Read

Application

Opaque store

Transactions

Are you blithely asserting that transactions aren’t webscale?

Some people just want to see the world burn. Those same people want to see the world use inconsistent databases.

- Emin Gun Sirer

Alternative to top-down design?

The “bottom-up,” systems tradition: Simple, reusable components first. Semantics later.

Alternative: the “bottom-up,” systems ethos

The “bottom-up” ethos

“‘Tis a fine barn, but sure ‘tis no castle, English”

The “bottom-up” ethos

Simple, reusable components first. Semantics later. This is how we live now. Question: Do we ever get those application-level guarantees back?

Low-level contracts

Write Read

Application

Distributed store KVS

Low-level contracts

Write Read

Application

Low-level contracts

Write Read

Application

R1(X=1) R2(X=1) W1(X=2) W2(X=0)

W1(X=1) W1(Y=2) R2(Y=2) R2(X=0)

Low-level contracts

Write Read

Application

Assert: balance > 0

R1(X=1) R2(X=1) W1(X=2) W2(X=0)

W1(X=1) W1(Y=2) R2(Y=2) R2(X=0)

Low-level contracts

Write Read

Application

Assert: balance > 0

causal? PRAM? delta? fork/join? red/blue? Release?

R1(X=1) R2(X=1) W1(X=2) W2(X=0)

W1(X=1) W1(Y=2) R2(Y=2) R2(X=0)

When do contracts compose?

Application

Distributed service

Assert: balance > 0

iw, did I get mongo in my riak? Assert: balance > 0

Composition is the last hard problem

Composing modules is hard enough We must learn how to compose guarantees

Outline

Why distributed systems are hard2

Asynchrony Partial Failure

Fundamental Uncertainty

Asynchrony isn’t that hard

Logical timestamps Deterministic interleaving

Ameloriation:

Partial failure isn’t that hard

Replication Replay

Ameloriation:

(asynchrony * partial failure) = hard2

Replication Replay

Tackling one clown at a time

Poor strategy for programming distributed systems Winning strategy for analyzing distributed programs

Outline

Distributed consistency

Today: A quick summary of some great work.

Consider a (distributed) graph

Partitioned, for scalability

Replicated, for availability

Deadlock detection

Task: Identify strongly-connected components

Waits-for graph

Garbage collection

Task: Identify nodes not reachable from Root.

Refers-to graph

Correctness

Deadlock detection •  Safety: No false positives

•  Liveness: Identify all deadlocks

Garbage collection •  Safety: Never GC live memory!

•  Liveness: GC all orphaned memory

Correctness

Deadlock detection •  Safety: No false positives-

Correctness

Deadlock detection •  Safety: No false positives

Consistency at the extremes

StorageObjectFlow

LanguageApplication

Linearizable key-value store?

Custom solutions?

StorageObjectFlow

LanguageApplication

Custom solutions?

StorageObjectFlow

LanguageApplication

Custom solutions?

Efficient Correct

Object-level consistency

Capture semantics of data structures that •  allow greater concurrency •  maintain guarantees (e.g. convergence)

StorageObjectFlow

LanguageApplication

Insert Read

Convergent data structure (e.g., Set CRDT)

Insert Read

Commutativity Associativity Idempotence

Insert Read

Reordering Batching Retry/duplication

Tolerant to

Application

Convergent data structures

Object-level composition?

Assert: Graph replicas converge

Application

GC Assert: No live nodes are reclaimed

Application

GC Assert: No live nodes are reclaimed

Flow-level consistency

StorageObjectFlow

LanguageApplication

Capture semantics of data in motion •  Asynchronous dataflow model •  component properties à system-wide guarantees

Graphstore

Transactionmanager

Transitiveclosure

Deadlockdetector

Order-insensitivity (confluence)

output set = f(input set)

Confluence is compositional

output set = f � g(input set)

Graphstore

Memoryallocator

Transitiveclosure

Garbagecollector

Confluent Not

Confluent

Graphstore

Transactionmanager

Transitiveclosure

Deadlockdetector

Confluent ConfluentConfluent

Graph queries as dataflow

Graphstore

Memoryallocator

Transitiveclosure

Garbagecollector

Confluent Not

Confluent

Graphstore

Transactionmanager

Transitiveclosure

Deadlockdetector

Graph queries as dataflow Confluent

Coordinate here

Graphstore

Memoryallocator

Transitiveclosure

Garbagecollector

Confluent Not

Confluent

Coordination: what is that?

Coordinate here

Strategy 1: Establish a total order

Graphstore

Memoryallocator

Transitiveclosure

Garbagecollector

Confluent Not

Confluent

Coordination: what is that?

Coordinate here

Strategy 2: Establish a producer- consumer barrier

Fundamental costs: FT via replication

Graphstore

Transactionmanager

Transitiveclosure

Deadlockdetector

Graphstore

Transitiveclosure

Deadlockdetector

(mostly) free!

global synchronization!

Graphstore

Transactionmanager

Transitiveclosure

GarbageCollector

Confluent Confluent

Graphstore

Transitiveclosure

GarbageCollector

Confluent Not

Confluent

Fundamental costs: FT via replication

GarbageCollector

Graphstore

Transactionmanager

Transitiveclosure

GarbageCollector

Confluent Confluent

Graphstore

Transitiveclosure

Confluent Not

Confluent

BarrierNot

Confluent

Barrier

The first principle of successful scalability is to batter the consistency mechanisms down to a minimum. – James Hamilton

Language-level consistency

DSLs for distributed programming? •  Capture consistency concerns in the

type system

StorageObjectFlow

LanguageApplication

CALM Theorem:

Monotonic à confluent

Conservative, syntactic test for confluence

Deadlock detector

Garbage collector

Deadlock detector

Garbage collector

nonmonotonic

Let’s review

•  Consistency is tolerance to asynchrony •  Tricks: – focus on data in motion, not at rest – avoid coordination when possible – choose coordination carefully otherwise

(Tricks are great, but tools are better)

Outline

Grand challenge: composition

Hard problem: Is a given component fault-tolerant? Much harder: Is this system (built up from components) fault-tolerant?

Example: Atomic multi-partition update

Two-phase commit

Example: replication

Reliable broadcast

Popular wisdom: don’t reinvent

Example: Kafka replication bug

Three “correct” components: 1.  Primary/backup replication 2.  Timeout-based failure detectors 3.  Zookeeper

One nasty bug: Acknowledged writes are lost

A guarantee would be nice

Bottom up approach: •  use formal methods to verify individual

components (e.g. protocols) •  Build systems from verified components

Shortcomings: •  Hard to use •  Hard to compose

Investment

Returns

Bottom-up assurances

Formal verifica[on

Program Environment Correctness Spec

Composing bottom-up assurances

Issue 1: incompatible failure models eg, crash failure vs. omissions Issue 2: Specs do not compose (FT is an end-to-end property)

If you take 10 components off the shelf, you are putting 10 world views together, and the result will be a mess. -- Butler Lampson

Composing bottom-up assurances

Top-down “assurances”

Testing

Fault injection Testing

Fault injection

Testing

End-to-end testing would be nice

Top-down approach: •  Build a large-scale system •  Test the system under faults

Shortcomings: •  Hard to identify complex bugs •  Fundamentally incomplete

Investment

Returns

Lineage-driven fault injection

Goal: top-down testing that •  finds all of the fault-tolerance bugs, or •  certifies that none exist

Correctness Specification

Malevolent sentience

Correctness Specification

Malevolent sentience

Lineage-driven fault injection (LDFI)

Approach: think backwards from outcomes Question: could a bad thing ever happen? Reframe: •  Why did a good thing happen? •  What could have gone wrong along the way?

Thomasina: What a faint-heart! We must work outward from the middle of the maze. We will start with something simple.

The game

•  Both players agree on a failure model •  The programmer provides a protocol •  The adversary observes executions and

chooses failures for the next execution.

Dedalus: it’s about data

log(B, “data”)@5

Some data

Dedalus: it’s like Datalog

consequence ! :- premise[s]!!log(Node, Pload) ! ! ! :- bcast(Node, Pload);!!

Dedalus: it’s like Datalog

consequence ! :- premise[s]!!log(Node, Pload) ! ! ! :- bcast(Node, Pload);!!

(Which is like SQL)

create view log as select Node, Pload from bcast;!

Dedalus: it’s about time

consequence@when ! :- premise[s]!!!node(Node, Neighbor)@next :- node(Node, Neighbor);!!!log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2);

Dedalus: it’s about time

consequence@when ! :- premise[s]!!!node(Node, Neighbor)@next :- node(Node, Neighbor);!!!log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2);

Natural join (bcast.Node1 == node.Node1)

State change

Communication

The match

Protocol: Reliable broadcast

Specification:

Pre: A correct process delivers a message m Post: All correct process delivers m

Failure Model:

(Permanent) crash failures Message loss / partitions

Round 1 node(Node, Neighbor)@next :- node(Node, Neighbor);!log(Node, Pload)@next ! :- log(Node, Pload);!!!log(Node, Pload) ! ! ! :- bcast(Node, Pload);!!log(Node2, Pload)@async :- bcast(Node1, Pload), ! ! ! ! ! ! ! ! ! node(Node1, Node2);

“An effort” delivery protocol

Round 1 in space / time

Process b Process a Process c

log log

Round 1: Lineage

log(B, data)@5

Round 1: Lineage

log(B, data)@5

log(B, data)@4

log(Node, Pload)@next :- log(Node, Pload);!!!!log(B, data)@5:- log(B, data)@4;!

Round 1: Lineage

log(B, data)@5

log(B, data)@4

log(B, data)@3

Round 1: Lineage

log(B, data)@5

log(B, data)@4

log(B, data)@3

log(B,data)@2

Round 1: Lineage

log(B, data)@5

log(B, data)@4

log(B, data)@3

log(B,data)@2

log(A, data)@1

log(Node2, Pload)@async :- bcast(Node1, Pload), !! ! ! ! ! ! node(Node1, Node2);!

!!!!log(B, data)@2 :- bcast(A, data)@1, !

! ! ! ! ! ! node(A, B)@1;!

An execution is a (fragile) “proof” of an outcome

log(A, data)@1 node(A, B)@1

AB1 r2

log(B, data)@2

log(B, data)@3

log(B, data)@4

log(B, data)@5

log(A, data)@1

log(A, data)@2

node(A, B)@1

node(A, B)@2

AB2 r2

log(B, data)@3

log(B, data)@4

log(B, data)@5

log(A, data)@1

log(A, data)@2

log(A, data)@3

node(A, B)@1

node(A, B)@2

node(A, B)@3

AB3 r2

log(B, data)@4

log(B, data)@5

log(A, data)@1

log(A, data)@2

log(A, data)@3

log(A, data)@4

node(A, B)@1

node(A, B)@2

node(A, B)@3

node(A, B)@4

AB4 r2

log(B, data)@5

AB1 ÂB2 ÂB3 ÂB4

(which required a message from A to B at time 1)

Valentine: “The unpredictable and the predetermined unfold together to make everything the way it is.”

Round 1: counterexample

The adversary wins!

log (LOST) log

Round 2

Same as Round 1, but A retries.

bcast(N, P)@next ! ! ! :- bcast(N, P);!

Round 2 in spacetime Process b Process a Process c

log log

Round 2

log(B, data)@5

Round 2

log(B, data)@5

log(B, data)@4

log(Node, Pload)@next :- log(Node, Pload);!!!!log(B, data)@5:- log(B, data)@4;!

Round 2

log(B, data)@5

log(B, data)@4

log(A, data)@4

log(Node2, Pload)@async :- bcast(Node1, Pload), node(Node1, Node2);!!!!!log(B, data)@3 :- bcast(A, data)@2, node(A, B)@2;!

Round 2

log(B, data)@5

log(B, data)@4

log(A, data)@4

log(B, data)@3

log(A, data)@3

Round 2

log(B, data)@5

log(B, data)@4

log(A, data)@4

log(B, data)@3

log(A, data)@3

log(B,data)@2

log(A, data)@2

Round 2

log(B, data)@5

log(B, data)@4

log(A, data)@4

log(B, data)@3

log(A, data)@3

log(B,data)@2

log(A, data)@2

log(A, data)@1

Round 2

log(B, data)@5

log(B, data)@4

log(A, data)@4

log(B, data)@3

log(A, data)@3

log(B,data)@2

log(A, data)@2

log(A, data)@1

Retry provides redundancy in time

Traces are forests of proof trees log(A, data)@1 node(A, B)@1

AB1 r2

log(B, data)@2

log(B, data)@3

log(B, data)@4

log(B, data)@5

log(A, data)@1

log(A, data)@2

node(A, B)@1

node(A, B)@2

AB2 r2

log(B, data)@3

log(B, data)@4

log(B, data)@5

log(A, data)@1

log(A, data)@2

log(A, data)@3

node(A, B)@1

node(A, B)@2

node(A, B)@3

AB3 r2

log(B, data)@4

log(B, data)@5

log(A, data)@1

log(A, data)@2

log(A, data)@3

log(A, data)@4

node(A, B)@1

node(A, B)@2

node(A, B)@3

node(A, B)@4

AB4 r2

log(B, data)@5

AB1 ÂB2 ÂB3 ÂB4

Traces are forests of proof trees log(A, data)@1 node(A, B)@1

AB1 r2

log(B, data)@2

log(B, data)@3

log(B, data)@4

log(B, data)@5

log(A, data)@1

log(A, data)@2

node(A, B)@1

node(A, B)@2

AB2 r2

log(B, data)@3

log(B, data)@4

log(B, data)@5

log(A, data)@1

log(A, data)@2

log(A, data)@3

node(A, B)@1

node(A, B)@2

node(A, B)@3

AB3 r2

log(B, data)@4

log(B, data)@5

log(A, data)@1

log(A, data)@2

log(A, data)@3

log(A, data)@4

node(A, B)@1

node(A, B)@2

node(A, B)@3

node(A, B)@4

AB4 r2

log(B, data)@5

AB1 ÂB2 ÂB3 ÂB4

Round 2: counterexample

CRASHED 2

log (LOST) log

The adversary wins!

Round 3

Same as in Round 2, but symmetrical.

bcast(N, P)@next ! ! ! :- log(N, P);!

Round 3 in space / time Process b Process a Process c

log log

Redundancy in space and time

Round 3 -- lineage

log(B, data)@5

Round 3 -- lineage

log(B, data)@5

log(B, data)@4

log(A, data)@4

log(C, data)@4

Round 3 -- lineage

log(B, data)@5

log(B, data)@4

log(A, data)@4

log(C, data)@4

Log(B, data)@3

log(A, data)@3

log(C, data)@3

Round 3 -- lineage

log(B, data)@5

log(B, data)@4

log(A, data)@4

log(C, data)@4

Log(B, data)@3

log(A, data)@3

log(C, data)@3

log(B,data)@2

log(A, data)@2

log(C, data)@2

log(A, data)@1

Round 3 -- lineage

log(B, data)@5

log(B, data)@4

log(A, data)@4

log(C, data)@4

Log(B, data)@3

log(A, data)@3

log(C, data)@3

log(B,data)@2

log(A, data)@2

log(C, data)@2

log(A, data)@1

Round 3

The programmer wins!

Let’s reflect

Fault-tolerance is redundancy in space and time. Best strategy for both players: reason backwards from outcomes using lineage Finding bugs: find a set of failures that “breaks” all derivations Fixing bugs: add additional derivations

The role of the adversary can be automated

1.  Break a proof by dropping any contributing message.

Disjunction

(AB1 ∨ BC2)

2.  Find a set of failures that breaks all proofs of a good outcome.

Disjunction

Conjunction of disjunctions (AKA CNF)

(AB1 ∨ BC2)

∧ (AC1) ∧ (AC2)

2.  Find a set of failures that breaks all proofs of a good outcome.

Disjunction

Conjunction of disjunctions (AKA CNF)

(AB1 ∨ BC2)

∧ (AC1) ∧ (AC2)

Molly, the LDFI prototype

Molly finds fault-tolerance violations quickly or guarantees that none exist. Molly finds bugs by explaining good outcomes – then it explains the bugs. Bugs identified: 2pc, 2pc-ctp, 3pc, Kafka Certified correct: paxos (synod), Flux, bully leader election, reliable broadcast

Commit protocols

Problem: Atomically change things Correctness properties: 1.  Agreement (All or nothing) 2.  Termination (Something)

Two-phase commit Agent a Agent b Coordinator Agent d

vote vote

prepare prepare prepare

commit commit commit

vote vote

Can I kick it?

vote vote

Can I kick it?

YES YOU CAN

vote vote

Can I kick it?

YES YOU CAN

Well I’m gone

Two-phase commit

Agent a Agent a Coordinator Agent d

CRASHED

Violation: Termination

The collabora[ve termina[on protocol

Basic idea: Agents talk amongst themselves when the coordinator fails. Protocol: On timeout, ask other agents about decision.

2PC - CTP Agent a Agent b Coordinator Agent d

CRASHED

decision_req decision_req

2PC - CTP Agent a Agent b Coordinator Agent d

CRASHED

Can I kick it?

YES YOU CAN

……?

Basic idea: Add a round, a state, and simple failure detectors (timeouts). Protocol: 1.  Phase 1: Just like in 2PC –  Agent timeout à abort

2.  Phase 2: send canCommit, collect acks –  Agent timeout à commit

3.  Phase 3: Just like phase 2 of 2PC

3PC Process a Process b Process C Process d

vote_msg

cancommit cancommit cancommit

precommit precommit precommit

vote_msg

3PC Process a Process b Process C Process d

vote_msg

Timeout à Abort

Timeout à Commit

Network partitions make 3pc act crazy

Process a Process b Process C Process d

CRASHED

vote_msg

commit

vote_msg

commit

abort (LOST) abort (LOST)

abort abort

vote_msg

CRASHED

vote_msg

commit

vote_msg

commit

abort abort

vote_msg

Agent crash Agents learn commit decision

CRASHED

vote_msg

commit

vote_msg

commit

abort abort

vote_msg

d is dead; coordinator decides to abort

CRASHED

vote_msg

commit

vote_msg

commit

abort abort

vote_msg

Brief network partition

CRASHED

vote_msg

commit

vote_msg

commit

abort abort

vote_msg

Agents A & B decide to commit

Kafka durability bug

Replica b Replica c Zookeeper Replica a Client

CRASHED

a becomes leader and sole replica

CRASHED

a ACKs client write

CRASHED

a ACKs client write

Data loss

Molly summary

Lineage allows us to reason backwards from good outcomes Molly: surgically-targeted fault injection Investment similar to testing Returns similar to formal methods

Where we’ve been; where we’re headed

1.  We need application-level guarantees 2.  What is so hard about distributed systems? 3.  Distributed consistency: managing asynchrony 4.  Fault-tolerance: progress despite failures

1.  We need application-level guarantees 2.  (asynchrony X partial failure) = too hard to

hide! We need tools to manage it.

3.  Distributed consistency: managing asynchrony 4.  Fault-tolerance: progress despite failures

1.  We need application-level guarantees 2.  asynchrony X partial failure = too hard to hide!

We need tools to manage it.

3.  Distributed consistency: managing asynchrony 4.  Fault-tolerance: progress despite failures

3.  Focus on flow: data in motion 4.  Fault-tolerance: progress despite failures

Outline

3.  Focus on flow: data in motion 4.  Fault-tolerance: progress despite failures

Outline

3.  Focus on flow: data in motion 4.  Backwards from outcomes

Remember

1.  We need application-level guarantees 2.  asynchrony X partial failure = too hard to hide! We

need tools to manage it.

3.  Focus on flow: data in motion 4.  Backwards from outcomes

Composition is the hardest problem

A happy crisis

Valentine: “It makes me so happy. To be at the beginning again, knowing almost nothing.... It's the best possible time of being alive, when almost everything you thought you knew is wrong.”

RICON keynote: outwards from the middle of the maze

Science

Transcript of RICON keynote: outwards from the middle of the maze

Candidate chains, unfaithful spell-out, and outwards-looking file!1 Candidate chains, unfaithful spell-out, and outwards-looking phonologically-conditioned allomorphy* Matthew Wolf,

SPINDO, TSP, G-BRAND, FKK, RICON, FLANGE.pdf

Outwards and Inwards Experiential Transformation - Kaskus Case Study

University · Mean total errors on the stylus-maze test, by group, for Maze 2, Maze 3, and Maze 4.. Percent stylus-maze failures, by group, for Maze 3 and Maze 4 . . . . . . . . .

Monster Maze

Ca rnava Iización e isoiopio textua I So ricon, Ll ...

S9922M2-LTE-15.11 - Ricon Mobile

Ricon Mobile -Case Study · Ricon Mobile S9920V-LTE routers have been set up in all stores with 2 SIM cards (1 x Data / 1 x Voice). All routers are registered to the Ricon Central

Que son ambientes_y_objetos_virtuales_de_aprendizaje omar ricon

Ricon Wheelchair Lifts - Vantage Mobility International, Inc.cdn.vantagemobility.com/wp-content/uploads/RiconBrochure_5.pdf · Ricon Wheelchair Lifts ... thanks to its mechanical

Storybook Maze - cf.ltkcdn.net · Storybook Maze Author: LoveToKnow Subject: Storybook Maze Keywords: Storybook Maze Created Date: 4/24/2019 11:27:30 AM ...

Ricon east

UNIVERSIDADE ESTADUAL DE CAMPINASrepositorio.unicamp.br/...RaphaelRiconde_D.pdf · 2015 . RAPHAEL RICON DE OLIVEIRA ... FINAL DA TESE DEFENDIDA PELO ALUNO RAPHAEL RICON DE OLIVEIRA

Ricon Wheelchair Lifts - Wheelchair Van Sales and Rentals · Ricon Wheelchair Lifts ... VMI wheelchair lifts for vehicles can make almost any full-size van a wheelchair-accessible

Ricon conference ppt

ROBOCOM SISD LEGO Robot Maze Eastlake HS. MAZE DIMENSIONS.

Un ricon en san rafael

RICON TECHNOLOGIES CO., LTD. Mobile Data …riconmobile.com/ControlPanel/file/upload/2a90c738RICON-S9910_User...RICON TECHNOLOGIES CO., LTD. Mobile Data Communications User Manual

VVVI VIII : MAZE MASTER’S LORE: MAZE MASTER’S LORE: MAZE ...storygame.free.fr/PART3.pdf · VVVIVIII : MAZE MASTER’S LORE: MAZE MASTER’S LORE: MAZE MASTER’S LORE A World

Host Country-Specific Factors Causing Outwards Foreign Direct ...