Diego Ongaro and John Ousterhout - Raft · 2020-04-29 · Diego Ongaro and John Ousterhout November...

19
1 The Raft Consensus Algorithm Diego Ongaro and John Ousterhout November 2015 Source code available at . Unless otherwise noted, this work is: © 2012-2015 Diego Ongaro, © 2012-2014 John Ousterhout. Licensed under the . https://github.com/ongardie/raft-talk Creative Commons Attribution 4.0 International License 1

Transcript of Diego Ongaro and John Ousterhout - Raft · 2020-04-29 · Diego Ongaro and John Ousterhout November...

Page 1: Diego Ongaro and John Ousterhout - Raft · 2020-04-29 · Diego Ongaro and John Ousterhout November 2015 Source code available at . Unless otherwise noted, this work is: ... Less

1

The Raft Consensus AlgorithmDiego Ongaro and John Ousterhout

November 2015

Source code available at .

Unless otherwise noted, this work is: © 2012-2015 Diego Ongaro, © 2012-2014 John Ousterhout.

Licensed under the .

https://github.com/ongardie/raft-talk

Creative Commons Attribution 4.0 International License

1

Page 2: Diego Ongaro and John Ousterhout - Raft · 2020-04-29 · Diego Ongaro and John Ousterhout November 2015 Source code available at . Unless otherwise noted, this work is: ... Less

2

MotivationGoal: shared key-value store (state machine)Host it on a single machine attached to network

Pros: easy, consistentCons: prone to failure

With Raft, keep consistency yet deal with failures

Page 3: Diego Ongaro and John Ousterhout - Raft · 2020-04-29 · Diego Ongaro and John Ousterhout November 2015 Source code available at . Unless otherwise noted, this work is: ... Less

3

What Is ConsensusAgreement on shared state (single system image)Recovers from server failures autonomously

Minority of servers fail: no problemMajority fail: lose availability, retain consistency

Servers

Key to building consistent storage systems

Page 4: Diego Ongaro and John Ousterhout - Raft · 2020-04-29 · Diego Ongaro and John Ousterhout November 2015 Source code available at . Unless otherwise noted, this work is: ... Less

4

Replicated State MachinesTypical architecture for consensus systems

x←3 y←2 x←1 z←6Log

ConsensusModule

StateMachine

Log

ConsensusModule

StateMachine

Log

ConsensusModule

StateMachine

Servers

Clients

x 1

y 2

z 6

x←3 y←2 x←1 z←6

x 1

y 2

z 6

x←3 y←2 x←1 z←6

x 1

y 2

z 6

z←6

Replicated log ⇒ replicated state machineAll servers execute same commands in same order

Consensus module ensures proper log replicationSystem makes progress as long as any majority of servers upFailure model: fail-stop (not Byzantine), delayed/lost msgs

Page 5: Diego Ongaro and John Ousterhout - Raft · 2020-04-29 · Diego Ongaro and John Ousterhout November 2015 Source code available at . Unless otherwise noted, this work is: ... Less

5

Paxos ProtocolLeslie Lamport, 1989Nearly synonymous with consensus

“The dirty little secret of the NSDI community is that atmost five people really, truly understand every part of

Paxos ;-).” —NSDI reviewer

“There are significant gaps between the description of thePaxos algorithm and the needs of a real-world

system...the final system will be based on an unprovenprotocol.” —Chubby authors

Page 6: Diego Ongaro and John Ousterhout - Raft · 2020-04-29 · Diego Ongaro and John Ousterhout November 2015 Source code available at . Unless otherwise noted, this work is: ... Less

6

Raft's Design for UnderstandabilityWe wanted an algorithm optimized for building real systems

Must be correct, complete, and perform wellMust also be understandable

“What would be easier to understand or explain?”

Fundamentally different decomposition than PaxosLess complexity in state spaceLess mechanism

Page 7: Diego Ongaro and John Ousterhout - Raft · 2020-04-29 · Diego Ongaro and John Ousterhout November 2015 Source code available at . Unless otherwise noted, this work is: ... Less

7

Raft Overview1. Leader election

Select one of the servers to act as cluster leaderDetect crashes, choose new leader

2. Log replication (normal operation)Leader takes commands from clients, appends to its logLeader replicates its log to other servers (overwriting inconsistencies)

3. SafetyOnly a server with an up-to-date log can become leader

Page 9: Diego Ongaro and John Ousterhout - Raft · 2020-04-29 · Diego Ongaro and John Ousterhout November 2015 Source code available at . Unless otherwise noted, this work is: ... Less

9

Core Raft Review1. Leader election

Heartbeats and timeouts to detect crashesRandomized timeouts to avoid split votesMajority voting to guarantee at most one leader per term

2. Log replication (normal operation)Leader takes commands from clients, appends to its logLeader replicates its log to other servers (overwriting inconsistencies)Built-in consistency check simplifies how logs may differ

3. SafetyOnly elect leaders with all committed entries in their logsNew leader defers committing entries from prior terms

Page 10: Diego Ongaro and John Ousterhout - Raft · 2020-04-29 · Diego Ongaro and John Ousterhout November 2015 Source code available at . Unless otherwise noted, this work is: ... Less

10

ConclusionConsensus widely regarded as difficultRaft designed for understandability

Easier to teach in classroomsBetter foundation for building practical systems

Pieces needed for a practical system:Cluster membership changes (simpler in dissertation)

Log compaction (expanded tech report/dissertation)

Client interaction (expanded tech report/dissertation)

Evaluation (dissertation: understandability, correctness, leader election & replication performance)

Page 11: Diego Ongaro and John Ousterhout - Raft · 2020-04-29 · Diego Ongaro and John Ousterhout November 2015 Source code available at . Unless otherwise noted, this work is: ... Less

11

Questionsraft.github.io

mailing listraft-dev

Page 12: Diego Ongaro and John Ousterhout - Raft · 2020-04-29 · Diego Ongaro and John Ousterhout November 2015 Source code available at . Unless otherwise noted, this work is: ... Less

12

How Is Consensus Used?Top-level system configuration

repl. state machine

S1 S2

S3

N N N N...

repl. state machine

leader standby standby

S1 S2

S3

N N N N...

Replicate entire database state

repl. state machine

S1 S2

S3

repl. state machine

S1 S2

S3

repl. state machine

S1 S2

S3

...

2PC 2PC

2PC

Page 13: Diego Ongaro and John Ousterhout - Raft · 2020-04-29 · Diego Ongaro and John Ousterhout November 2015 Source code available at . Unless otherwise noted, this work is: ... Less

13

RaftAlgorithm for implementing a replicated logSystem makes progress as long as any majority of servers upFailure model: fail-stop (not Byzantine), delayed/lost msgsDesigned for understandability

Page 14: Diego Ongaro and John Ousterhout - Raft · 2020-04-29 · Diego Ongaro and John Ousterhout November 2015 Source code available at . Unless otherwise noted, this work is: ... Less

14

Raft User Study

0

10

20

30

40

50

60

0 10 20 30 40 50 60

Raft g

rade

Paxos grade

Raft then PaxosPaxos then Raft

0

5

10

15

20

implement explain

numbe

r of participan

ts

Paxos much easierPaxos somewhat easierRoughly equalRaft somewhat easierRaft much easier

Page 15: Diego Ongaro and John Ousterhout - Raft · 2020-04-29 · Diego Ongaro and John Ousterhout November 2015 Source code available at . Unless otherwise noted, this work is: ... Less

15

Randomized TimeoutsHow much randomization is needed to avoid split votes?

0%

20%

40%

60%

80%

100%

100 1000 10000 100000

cumulative percent

time without leader (ms)

150­150ms150­151ms150­155ms150­175ms150­200ms150­300ms

Conservatively, use random range ~10x network latency

Page 16: Diego Ongaro and John Ousterhout - Raft · 2020-04-29 · Diego Ongaro and John Ousterhout November 2015 Source code available at . Unless otherwise noted, this work is: ... Less

Raft Implementations

Page 17: Diego Ongaro and John Ousterhout - Raft · 2020-04-29 · Diego Ongaro and John Ousterhout November 2015 Source code available at . Unless otherwise noted, this work is: ... Less

16

Name Primary Authors Language LicenseC++ AGPL

Blake Mizerany, Xiang Li and Yicheng Qin Go Apache2.0

(Stanford) C++ ISC(Sky) and (CMU, CoreOS) Go MIT (hashicorp) Go MPL-2.0

, Dan Burkert Rust MITScala Apache2

James Wilcox, Doug Woos, Pavel Panchekha, Zach Tatlock, Xi Wang, Mike Ernst,and Tom Anderson (University of Washington)

Coq BSD

Moiz Raja, Kamal Rameshan, Robert Varga (Cisco), Tom Pantelis (Brocade) Java EclipseGunin Alexander Erlang Apache2

Javascript MPL-2.0Scala Apache2

(Basho) Erlang Apache2Ruby MITC BSD

... ... ... ...

Copied from Raft website, probably stale.

RethinkDB/clusteringetcd/raft

LogCabin Diego Ongarogo-raft Ben Johnson Xiang Lihashicorp/raft Armon Dadgarhoverbear/raft Andrew Hobdenckite Pablo Medinaverdi/raft

OpenDaylightzraft_libkanaka/raft.js Joel Martinakka-raft Konrad Malawskirafter Andrew Stonefloss Alexander Flatterwillemt/raft Willem-Hendrik Thiart

Page 18: Diego Ongaro and John Ousterhout - Raft · 2020-04-29 · Diego Ongaro and John Ousterhout November 2015 Source code available at . Unless otherwise noted, this work is: ... Less

17

LogCabinStarted as research platform for Raft at StanfordDeveloped into production system at Scale ComputingNetwork service running Raft replicated state machineData model: hierarchical key-value store, kept in memoryWritten in C++ ( )gcc 4.4's C++0x

Page 19: Diego Ongaro and John Ousterhout - Raft · 2020-04-29 · Diego Ongaro and John Ousterhout November 2015 Source code available at . Unless otherwise noted, this work is: ... Less