LOTKeeper: A Scalable and Massively Parallel Consensus...

30
CANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL Bernard Wong CoNEXT 2017 Joint work with Sajjad Rizvi and Srinivasan Keshav

Transcript of LOTKeeper: A Scalable and Massively Parallel Consensus...

Page 1: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

CANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

Bernard Wong

CoNEXT 2017

Joint work with Sajjad Rizvi and Srinivasan Keshav

Page 2: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

CONSENSUS PROBLEM

2

Agreement between a set of nodes in the presence of failures

Asynchronous environment

Primarily used to provide fault tolerance

W(x)=3 W(y=1) W(z=1) ?

Node A: W(x=1)

Node B: W(x=2)

Replicated log

W(x=3) W(y=1) W(z=1) ?W(x=3) W(y=1) W(z=1) ?

Page 3: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

A BUILDING BLOCK IN DISTRIBUTED SYSTEMS

3

SpannerAkka MesosHadoop HBase Kafka BookKeeper …

Syst

em

ap

plica

tions

ZooKeeper ConsulChef PuppetetcdChubby …

Coord

ina

tion s

erv

ices

ZAB Paxos Raft …

MenciusEPaxos SPaxos …

AllConcur NOPaxosNetPaxos …

Conse

nsus

and

ato

mic

bro

adca

st

Page 4: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

A BUILDING BLOCK IN DISTRIBUTED SYSTEMS

4

SpannerAkka MesosHadoop HBase Kafka BookKeeper …

Syst

em

ap

plica

tions

ZooKeeper ConsulChef PuppetetcdChubby …

Coord

ina

tion s

erv

ices

ZAB Paxos Raft …

MenciusEPaxos SPaxos …

AllConcur NOPaxosNetPaxos …

Conse

nsus

and

ato

mic

bro

adca

st

Current consensus protocols are not scalable

However, most applications only require a small

number of replicas for fault tolerance

Page 5: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

PERMISSIONED BLOCKCHAINS

A distributed ledger shared by all the participants

Consensus at a large scale

Large number of participants (e.g., financial institutions)

Must validate a block before committing it to the ledger

Examples

Hyperledger, Microsoft Coco, Kadena, Chain …

5

Page 6: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

CANOPUS

Consensus among a large set of participants

Targets thousands of nodes distributed across the globe

Decentralized protocol

Nodes execute steps independently and in parallel

Designed for modern datacenters

Takes advantage of high performance networks and hardware redundancies

6

Page 7: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

SYSTEM ASSUMPTIONS

Non-uniform network latencies and link capacities

Scalability is bandwidth limited

Protocol must be network topology aware

Deployment consists of racks of servers connected by redundant links

Full rack failures and network partitions are rare

7

WAN

… …

Global view Within a datacenter

Page 8: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

CONSENSUS CYCLES

8

Execution divided into a sequence of consensus cycles

In each cycle, Canopus determines the order of writes (state changes) received during the previous cycle

d

e

f g

h

i

Canopus servers

x = 1

y = 3 x = 5

z = 2

Page 9: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

SUPER-LEAVES AND VNODES

9

Nodes in the same rack form a logical group called a super-leaf

Use an intra-super-leaf consensus protocol to replicate write requests between nodes in the same super-leaf

d

e

f g

h

i

Super-leaf Super-leaf

Page 10: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

SUPER-LEAVES AND VNODES

10

Nodes in the same rack form a logical group called a super-leaf

Use an intra-super-leaf consensus protocol to replicate write requests between nodes in the same super-leaf

d

e

f g

h

i

Super-leaf Super-leaf

d

e

f g

h

i

Super-leaf Super-leaf

b cd, e, f h, g, i

Represent the state of each super-leaf as a height 1 virtual node (vnode)

Page 11: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

ACHIEVING CONSENSUS

11

b

d

e

f

c

g

h

i

a

Consensus in

round 1

Consensus in

round 2

d, e, f h, g, i d, e, f h, g, i

Members of a height 1 vnode exchange state with members of nearby height 1 vnodes to compute a height 2 vnode

State exchange is greatly simplified since each vnode is fault tolerant

h rounds in a consensus cycle

A node completes a consensus cycle once it has computed the state of the root vnode

Page 12: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

CONSENSUS PROTOCOL WITHIN A SUPER-LEAF

12

a b

c

A1 B1

C1

C2• Exploit low latency within a rack

• Reliable broadcast

• RAFT

Page 13: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

CONSENSUS PROTOCOL WITHIN A SUPER-LEAF

13

a b

c

A1634 746 B1

C1

C2

538

1. Nodes prepare a

proposal message that

contains a random

number and a list of

pending write requests

Page 14: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

CONSENSUS PROTOCOL WITHIN A SUPER-LEAF

14

a b

c

2. Nodes use reliable

broadcast to exchange

proposals within a

super-leaf

746 B1

A1634

C1

C2

538

746 B1

A1634

C1

C2

538

746 B1

A1634

C1

C2

538

Page 15: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

CONSENSUS PROTOCOL WITHIN A SUPER-LEAF

15

a b

c

3. Every node orders

proposals

746 B1

A1634

C1

C2

538

746 B1

A1634

C1

C2

538

746 B1

A1634

C1

C2

538

Page 16: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

CONSENSUS PROTOCOL WITHIN A SUPER-LEAF

16

These three steps make up a

consensus round.

At the end, all three nodes have the

same state of their common parent.

a b

c

746 B1

A1634

C1

C2

538

746 B1

A1634

C1

C2

538

746 B1

A1634

C1

C2

538

Page 17: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

CONSENSUS PROTOCOL BETWEEN SUPER-LEAVES

17

b

d

e

f

a

c

g

h

i

Representative Emulator

Page 18: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

CONSENSUS PROTOCOL BETWEEN SUPER-LEAVES

18

b

d

e

f

a

c

g

h

i

Representative Emulator

{proposal request}

1. Representatives send

proposal requests to

fetch the states of

vnodes

Page 19: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

CONSENSUS PROTOCOL BETWEEN SUPER-LEAVES

19

b

d

e

f

a

c

g

h

i

Representative Emulator

2. Emulators reply

with proposals

{proposal response}

Page 20: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

CONSENSUS PROTOCOL BETWEEN SUPER-LEAVES

20

b

d

e

f

a

c

g

h

i

Representative Emulator

3. Reliable broadcast

within a super-leaf

Page 21: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

CONSENSUS PROTOCOL BETWEEN SUPER-LEAVES

21

b

d

e

f

a

c

g

h

i

Representative Emulator

Consensus cycle ends

for a node when it has

completed the last

round

Page 22: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

READ REQUESTS

Read requests can be serviced locally by any Canopus node

Does not need to disseminate to other participating nodes

Provides linearizability by

Buffering read requests until the global ordering of writes has been determined

Locally ordering its pending reads and writes to preserve the request order of its clients

Significantly reduces bandwidth requirements for read requests

Achieves total ordering of both read and write requests

22

Page 23: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

ADDITIONAL OPTIMIZATIONS

Pipelining consensus cycles

Critical to achieving high throughput over high latency links

Write leases

For read-mostly workloads with low latency requirements

Reads can complete without waiting until the end of a consensus cycle

23

Page 24: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

EVALUATION: MULTI DATACENTER CASE

3, 5, and 7 datacenters

Each datacenter corresponds to a super-leaf

3 nodes per datacenter (up to 21 nodes in total)

EC2 c3.4xlarge instances

100 clients in five machines per datacenter

Each client is connected to a random node in the same datacenter

24

Latencies across datacenters (in ms)

Regions: Ireland (IR), California (CA),

Virginia (VA), Tokyo (TK), Oregon (OR),

Sydney (SY), Frankfurt (FF)

Page 25: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

CANOPUS VS. EPAXOS (20% WRITES)

25

Page 26: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

EVALUATION: SINGLE DATACENTER CASE

3 super-leaves of sizes of 3, 5, 7, 9 servers (i.e., up to 27 total servers)

Each server has 32GB RAM, 200 GB SSD, 12 cores running at 2.1 GHz

Each server has a 10G to its ToR switch

Aggregation switch has dual 10G links to each ToR switch

180 clients, uniformly distributed on 15 machines

5 machines in each rack

26

Page 27: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

ZKCANOPUS VS. ZOOKEEPER

27

Page 28: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

LIMITATIONS

We trade off fault tolerance for performance and understandability

Cannot tolerate full rack failure or network partitions

We trade off latency for throughput

At low throughputs, latencies can be higher than other consensus protocols

Stragglers can hold up the system (temporarily)

Super-leaf peers detect and remove them

28

Page 29: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

ON-GOING WORK

Handling super-leaf failures

For applications with high availability requirements

Detect and remove failed super-leaves to continue

Byzantine fault tolerance

Canopus currently supports crash-stop failures

Aiming to maintain our current throughput

29

Page 30: LOTKeeper: A Scalable and Massively Parallel Consensus ...conferences2.sigcomm.org/co-next/2017/presentation/S10_1.pdfCANOPUS: A SCALABLE AND MASSIVELY PARALLEL CONSENSUS PROTOCOL

CONCLUSIONS

Emerging applications involve consensus at large scales

Key barrier is a scalable consensus protocol

Addressed by Canopus

Decentralized

Network topology aware

Optimized for modern datacenters

30