On Barriers and the Gap between Active and Passive Replication · or Megastore [1] use primary...

On Barriers and the Gap between Active and Passive Replication(Full Version)?

Flavio P. Junqueira1 and Marco Serafini2

1 Microsoft Research, Cambridge, [email protected]

2 Yahoo! Research, Barcelona, [email protected]

Abstract. Active replication is commonly built on top of the atomic broadcast primitive. Passive replication, whichhas been recently used in the popular ZooKeeper coordination system, can be naturally built on top of the primary-order atomic broadcast primitive. Passive replication differs from active replication in that it requires processes tocross a barrier before they become primaries and start broadcasting messages. In this paper, we propose a barrierfunction τ that explains and encapsulates the differences between existing primary-order atomic broadcast algo-rithms, namely semi-passive replication and Zookeeper atomic broadcast (Zab), as well as the differences betweenPaxos and Zab. We also show that implementing primary-order atomic broadcast on top of a generic consensusprimitive and τ inherently results in higher time complexity than atomic broadcast, as witnessed by existing algo-rithms. We overcome this problem by presenting an alternative, primary-order atomic broadcast implementationthat builds on top of a generic consensus primitive and uses consensus itself to form a barrier. This algorithm ismodular and matches the time complexity of existing τ -based algorithms.

1 Introduction

Passive replication is a popular approach to achieve fault tolerance in practical systems [3]. Systems like ZooKeeper [8]or Megastore [1] use primary replicas to produce state updates or state mutations. Passive replication uses two typesof replicas: primaries and backups. A primary replica executes client operations, without assuming that the executionis deterministic, and produces state updates. Backups apply state updates in the order generated by the primary. Withactive replication, by contrast, all replicas execute all client operations, assuming that the execution is deterministic.Replicas execute a sequence of consensus instances on client operations to agree on a single execution sequenceusing atomic broadcast (abcast). Passive replication has a few advantages such as simplifying the design of replicatedsystems with non-deterministic operations, e.g., those depending on timeouts or interrupts.

It has been observed by Junqueira et al. [9] and Birman et al. [2] that using atomic broadcast for passive, insteadof active, replication requires taking care of specific constraints. State updates must be applied in the exact sequencein which they have been generated: if a primary is in state A and executes an operation making it transition to stateupdate B, the resulting state update δAB must be applied to state A. Applying it to a different state C 6= A isnot safe because it might lead to an incorrect state, which is inconsistent with the history observed by the primaryand potentially the clients. Because a state update is the difference between a new state and the previous, there is acausal dependency between state updates. Unfortunately, passive replication algorithms on top of atomic broadcast(abcast) do not necessarily preserve this dependency: if multiple primaries are concurrently present in the system,they may generate conflicting state updates that followers end up applying in the wrong order. Primary-order atomicbroadcast (POabcast) algorithms, like Zab [9], have additional safety properties that solve this problem. In particular,it implements a barrier, the isPrimary predicate, which must be crossed by processes that want to broadcast messages.

Interestingly, the only existing passive replication algorithm using consensus as a communication primitive, thesemi-passive replication algorithm of Defago et al. [7], has linear time complexity in the number of concurrentlysubmitted requests. Recent algorithms for passive replication have constant complexity but they directly implementPOabcast without building on top of consensus [2,9].

? A conference version of this work (without appendices) appears in the proceedings of the 27th International Symposium onDistributed Computing (DISC) 2013 conference.

arX

iv:1

308.

2979

v5 [

cs.D

C]

12

Oct

201

5

Table 1. Time complexity of POabcast algorithms presented in this paper - see Sect. 6.4 and 7.2 for detail. We consider the use ofPaxos as the underlying consensus algorithm since it has optimal latency [11]. However, only the third solution requires the use ofPaxos; the other algorithms can use any implementation of consensus. For the latency analysis only, we assume that message delaysare equal to ∆. The Stable periods column reports the time, in a passive replication system, between the receipt of a client requestand its delivery by a single broadcasting primary/leader (c is the number of clients). The Leader change column reports idle timeafter a new single leader is elected by Ω and before it can broadcast new messages.

Stable periods Leader changeAtomic broadcast [10] 2∆ 2∆

τ -based POabcast (Sect. 6.1) 2∆ · c 4∆τ -based POabcast with white-box Paxos (Sect. 6.3) 2∆ 4∆

τ -free POabcast (Sect. 7) 2∆ 4∆

During our work on the ZooKeeper coordination system [8] we have realized that it is still not clear how thesealgorithms relate, and whether this trade-off between modularity and time complexity is inherent. This paper showsthat existing implementations of passive replication can be seen as instances of the same unified consensus-basedPOabcast algorithm, which is basically an atomic broadcast algorithm with a barrier predicate implemented througha barrier function τ we define in this work. The τ function outputs the identifier of the consensus instance a leaderprocess must decide on before becoming a primary.

Existing algorithms constitute alternative implementations of τ ; the discriminant is whether they consider the un-derlying consensus algorithm as a black-box whose internal state cannot be observed. Our τ -based algorithm exposesan inherent trade off. We show that if one implements τ while considering the consensus implementation as a blackbox, it is necessary to execute consensus instances sequentially, resulting in higher time complexity. This algorithmcorresponds to semi-passive replication.

If the τ implementation can observe the internal state of the consensus primitive, we can avoid the impossibilityand execute parallel instances. For example, Zab is similar to the instance of our unified algorithm that uses Paxos asthe underlying consensus algorithm and implements the barrier by reading the internal state of the Paxos protocol. Weexperimentally evaluate that using parallel instances almost doubles the maximum throughput of passive replicationin stable periods, even considering optimizations such as batching. Abstracting away these two alternatives and theirinherent limitations regarding time complexity and modularity is one of the main observations of this paper.

Finally, we devise a τ -free POabcast algorithm that makes this trade off unnecessary, since it enables runningparallel consensus instances using an unmodified consensus primitive as a black box. Unlike barrier-based algorithms,a process becomes a primary by proposing a special value in the next available consensus instances; this value marksthe end of the sequence of accepted messages from old primaries. Table 1 compares the different PO abcast algorithmswe discuss in our paper.

Our barrier-free algorithm shows that both active and passive replication can be implemented on top of a black-boxconsensus primitive with small and well understood changes and without compromising performance.Differences with the DISC’13 version. This version includes gap handling in Algorithm 3. In addition, Section 6.3gives more details on how Paxos handles gaps.

2 Related Work

Traditional work on passive replication and the primary-backup approach assumes synchronous links [3]. Group com-munication has been used to support primary-backup systems; it assumes a ♦P failure detector for liveness [6]. Bothatomic broadcast and POabcast can be implemented in a weaker system model, i.e., an asynchronous system equippedwith an Ω leader oracle [4]. For example, our algorithms do not need to agree on a new view every time a non-primaryprocess crashes.

Some papers have addressed the problem of reconfiguration: dynamically changing the set of processes participat-ing to the state machine replication group. Vertical Paxos supports reconfiguration by using an external master, whichcan be a replicated state machine [12]. This supports primary-backup systems, defined as replicated systems where

2

write quorums consist of all processes and each single process is a read quorum. Vertical Paxos does not address theissues of passive replication and considers systems where commands, not state updates, are agreed upon by replicas.Virtually Synchronous Paxos (VS Paxos) aims at combining virtual synchrony and Paxos for reconfiguration [2]. Ourwork assumes a fixed set of processes and does not consider the problem of reconfiguring the set of processes partici-pating to consensus. Shraer et al. have recently shown that reconfiguration can be implemented on top of a POabcastconstruction as the ones we present in this paper, making it an orthogonal topic [14].

While there has been a large body of work on group communication, only few algorithms implement passivereplication in asynchronous systems with Ω failure detectors: semi-passive replication [7], Zab [9] and Virtually syn-chronous Paxos [2]. We relate these algorithms with our barrier-based algorithms in Sect. 6.5.

Pronto is an algorithm for database replication that shares several design choices with our τ -free algorithm and hasthe same time complexity in stable periods [13]. Both algorithms elect a primary using an unreliable failure detectorand have a similar notion of epochs, which are associated to a single primary. Epoch changes are determined usingan agreement protocol, and values from old epochs that are agreed upon after a new epoch has been agreed uponare ignored. Pronto, however, is an active replication protocol: all replicas execute transactions, and non-determinismis handled by agreeing on a per-transaction log of non-deterministic choices that are application specific. Our workfocuses on passive replication algorithms, their difference with active replication protocols, and on the notion ofbarriers in their implementation.

3 System Model and Primitives

Throughout the paper, we consider an asynchronous system composed of a set Π = p1, . . . , pn of processes thatcan fail by crashing. They implement a passive replication algorithm, executing requests obtained by an unboundednumber of client processes, which can also fail by crashing. Correct processes are those that never crash. Processesare equipped with an Ω failure detector oracle.

Definition 1 (Leader election oracle) A leader election oracle Ω operating on a set of processes Π outputs theidentifier of some process p ∈ Π . Instances of the oracle running on different processes can return different outputs.Eventually, all instances of correct processes permanently output the same correct process.

Our algorithms build on top of (uniform) consensus, which has the following properties.

Definition 2 (Consensus) A consensus primitive consists of two operations: propose(v) and decide(v) of a value v.It satisfies the following properties:

Termination. If some correct process proposes a value, every correct process eventually decides some value.Validity. If a processes decides a value, this value was proposed by some process.Integrity. Every correct process decides at most one value.Agreement. No two processes decide differently.

Since our algorithms use multiple instances of consensus, propose and decide have an additional parameter denot-ing the identifier of the consensus instance.

Primary order atomic broadcast (POabcast) is an intermediate abstraction used by our unified passive replicationalgorithm. POabcast provides a broadcast primitive POabcast and a delivery primitive POdeliver. POabcast satisfiesall safety properties of atomic broadcast.

Definition 3 (Atomic broadcast) An atomic broadcast primitive consists of two operations: broadcast and deliver ofa value. It satisfies the following properties:

Integrity If some process delivers v then some process has broadcast v.Total order If some process delivers v before v′ then any process that delivers v′ must deliver v before v′.Agreement If some process pi delivers v and some other process pj delivers v′, then either pi delivers v′ or pj

delivers v.3

3 We modified the traditional formulation of agreement to state it as a safety property only.

3

initiallyreplied(c, t) returns false for all inputs;initialized← false;

upon isPrimary()Θ ← Σ;initialized← true;

upon ¬isPrimary()initialized← false;

upon receive 〈c, t, o〉 ∧ ¬ replied(c, t) ∧ isPrimary() ∧ initializedΘ

o→ 〈r, δ〉;apply(δ,Θ);POabcast (〈δ, r, c, t〉);

upon receive operation 〈c, t, o〉 ∧ replied(c, t)send stored 〈c, t, r〉 to c;

upon POdeliver(〈 δ, r, c, t〉)Σ ← apply(δ,Σ);set 〈c, t〉 as replied and store r as last reply to c;send 〈c, t, r〉 to c;

Algorithm 1: Passive replication based on POabcast - replica

initiallyt← 0;

upon execute operation ot← t+ 1;reliably send 〈c, t, o〉 to all replicas;wait for 〈c, t, r〉 from any replica;

Algorithm 2: Passive replication based on POabcast - client c

POabcast extends atomic broadcast by introducing the concept of primary and a barrier: the additional isPrimary()primitive, which POabcast uses to signal when a process is ready to broadcast state updates. This predicate resemblesPrmys in the specification of Budhiraja et al. [3]. However, as failure detectors are unreliable in our model, primaryelection is also unreliable: there might be multiple concurrent primaries at any given time, unlike in [3].

A primary epoch for a process p, or simply a primary, is a continuous period of time during which isPrimary() istrue at p. Multiple primaries can be present at any given time: the isPrimary() predicate is local to a single process andmultiple primary epochs can overlap in time. Let P be the set of primaries such that at least one value they proposeis ever delivered by some process in the current run. A primary mapping Λ is a function that maps each primary inP to a unique primary identifier λ, which we also use to denote the process executing the primary role. We considerprimaries as logical processes: saying that event ε occurs at primary λ is equivalent to saying that ε occurs at someprocess p during a primary epoch for p having primary identifier λ.

Definition 4 (Primary order atomic broadcast) A primary order atomic broadcast primitive consists of two oper-ations broadcast(v) and deliver(v), and of a binary isPrimary() predicate, which indicates whether a process is aprimary and is allowed to broadcast a value. Let Λ be a primary mapping and ≺Λ a total order relation amongprimary identifiers. Primary order broadcast satisfies the Integrity, Total order, and Agreement properties of atomicbroadcast; furthermore, it also satisfies the following additional properties:

Local primary order If λ broadcasts v before v′, then a process that delivers v′ delivers v before v′.Global primary order If λ broadcasts v, λ′ broadcasts v′, λ ≺Λ λ′, and some process p delivers v and v′, then p

delivers v before v′.Primary integrity If λ broadcasts v, λ′ broadcasts v′, λ ≺Λ λ′, and some process delivers v, then λ′ delivers v

before it broadcasts v′.

4

These properties are partially overlapping, as we show in Appendix A. For example, global primary order is veryuseful in reasoning about the behaviour of POabcast, but it can be implied from the other POabcast properties. It isalso worth noting that local primary order is weaker than the single-sender FIFO property, since it only holds within asingle primary epoch.

The above properties focus on safety. For liveness, it is sufficient to require the following:

Definition 5 (Eventual Single Primary) There exists a correct process such that eventually it is elected primary in-finitely often and all messages it broadcasts are delivered by some process.

Definition 6 (Delivery Liveness) If a process delivers v then eventually every correct process delivers v.

4 Passive Replication from POabcast

Before describing our unified POabcast algorithm, we briefly show how to implement passive replication on top ofPOabcast.

In passive replication systems, all replicas keep a copy of an object and receive operations from clients. Each clientwaits for a reply to its last operation op before it submits the following one. Each operation has a unique identifier,comprising the client identifier c and a counter t local to the client.

The pseudocode of the passive replication algorithm is illustrated in Algorithms 1 and 2. This algorithm captureskey aspects of practical systems, like the ZooKeeper coordination system. In the case of ZooKeeper, for example, theobject is the data tree used to store client information.

A primary replica is a process whose isPrimary predicate is true. This predicate is determined by the underlyingPOabcast algorithm. We will see later in the paper that becoming primary entails crossing a barrier.

Replicas keep two states: a committed state Σ and a tentative state Θ. The primary (tentatively) executes anoperation o on the tentative stateΘ; the primary remembers the identifier of the last executed operation from each client(modeled by the replied(c, t) predicate) and the corresponding reply r. The execution of a new operation generatesa state update δ, which the primary broadcasts to the other replicas together with the reply and the unique operationidentifier 〈c, t〉. We use S

op→ 〈r, δSQ〉 to denote that executing op on the state S produces a reply r and a state updateδSQ, which determines a transition from S to a new stateQ. When a replica delivers state updates, it applies them ontoits committed state Σ.

We show in Appendix B that this algorithm implements a linearizable replicated object. We also discuss a coun-terexample showing why POabcast cannot be replaced with atomic broadcast due to its lack of barriers; it is a gener-alization of the counterexample discussed in [9].

5 Unified POabcast Algorithm Using the Barrier Function

After motivating the use of POabcast to obtain natural implementations of passive replication and discussing the roleof the barrier predicate isPrimary, we introduce our unified τ -based POabcast algorithm (Algorithm 3). It uses threeunderlying primitives: consensus, the Ω leader oracle, and a new barrier function τ we will define shortly.

Like typical atomic broadcast algorithms, our POabcast algorithm runs a sequence of consensus instances, eachassociated with an instance identifier [5]. Broadcast values are proposed using increasing consensus instance iden-tifiers, tracked using the prop counter. Values are decided and delivered following the consensus instance order: ifthe last decided instance was dec, only the event decide(v, dec + 1) can be activated, resulting in an invocation ofPOdeliver. This abstracts the buffering of out-of-order decisions between the consensus primitive and our algorithm.

The most important difference between our algorithm and an implementation of atomic broadcast is that it imposesan additional barrier condition for broadcasting messages: it must hold isPrimary. In particular, it is necessary forsafety that dec ≥ τ . The condition that a primary must be a leader according to Ω is only required for the livenessproperty of POabcast: it ensures that eventually there is exactly one correct primary in the system. The barrier functionτ returns an integer and is defined as follows.

5

initiallydec← 0;prop← 0;

upon POabcast(v) ∧ isPrimary()prop←max(prop+ 1, dec+ 1);propose(v, prop);

upon decide(v, dec+ 1)dec← dec+ 1;POdeliver(v);

function isPrimary()return (dec ≥ τ) ∧ (Ω = p);

/* Gap handling */upon Ω changes from q 6= p to p

forall the i ∈ [dec+ 1, τ ] dopropose(skip(τ), i);

upon decide(skip(k), dec+ 1)dec← k;

Algorithm 3: POabcast based on the barrier function and consensus - process p

Definition 7 (Barrier function ) Let σ be an infinite execution, Λ a primary mapping in σ, ≺Λ a total order amongthe primary identifiers, and λ a primary such that at least one value it proposes is delivered in σ. A barrier function τfor λ returns:4

τ = maxi : ∃v, p, λ′ s.t. decidep(v, i) ∈ σ ∧ proposeλ′(v, i) ∈ σ ∧ λ′ ≺Λ λ

An actual implementation of the τ function can only observe the finite prefix of σ preceding its invocation; however,it must make sure that its outputs are valid in any infinite extension of the current execution. If none of the valuesproposed by a primary during a primary epoch are ever delivered, τ can return arbitrary values.

We show in Appendix C that this definition of τ is sufficient to guarantee the additional properties of POabcastcompared to atomic broadcast. In particular, it is key to guarantee that the primary integrity property is respected.Local primary order is obtained by delivering elements in the order in which they are proposed and decided.

The key to defining a barrier function is identifying a primary mapping Λ and a total order of primary identifiers≺Λ that satisfy the barrier property, as we will show in the following section. There are two important observationsto do here. First, we use the same primary mapping Λ and total order ≺Λ for the barrier function and for POabcast.If no value proposed by a primary p is ever decided, then the primary has no identifier in Λ and is not ordered by≺Λ, so the output of τ returned to p is not constrained by Definition 7. This is fine because values proposed by p areirrelevant for the correctness of passive replication: they are never delivered in the POabcast Algorithm 3 and thereforethe corresponding state updates are never observed by clients in the passive replication Algorithms 1 and 2. Note alsothat a primary might not know its identifier λ: this is only needed for the correctness argument.

We call τ a “barrier” because of the way it is used in Alg. 3. Intuitively, a new leader waits until it determineseither that some value proposed by a previous primary is decided at some point in time or that no such value will everbe delivered. Until the leader makes such a decision, it does not make further progress. Also, although this blockingmechanism is mainly implemented by the leader, the overall goal is to have all processes agreeing on the outcome fora value, so it resembles the classic notion of a barrier.

Consensus does not guarantee the termination of instances in which only a faulty primary has made a proposal.Therefore, a new primary proposes a special skip(τ) value to ensure progress for all instances in which no decision hasbeen reached yet, that is, those with instance number between dec+ 1 and τ . If the skip value is decided, all decisionevents on instances up to τ are ignored, and therefore the values decided in these instances are not POdelivered.

4 Subscripts denote the process that executed the propose or decide steps.

6

6 Implementations of the Barrier Function τ

6.1 Barriers with Black-Box Consensus

We first show how to implement τ using the consensus primitive as a black box. This solution is modular but imposesthe use of sequential consensus instances: a primary is allowed to have at most one outstanding broadcast at a time.This corresponds to the semi-passive replication algorithm [7].

Let prop and dec be the variables used in Algorithm 3, and let τseq be equal to max(prop, dec). We have thefollowing result:

Theorem 8 The function τseq is a barrier function.

Proof. We define Λ as follows: if a leader process p proposes a value vi,p for consensus instance i and vi,p is decided,p has primary identifier λ = i. A primary has only one identifier: after vi,p is broadcast, it holds prop > dec anddec < τseq , so isPrimary() stops evaluating to true at p. The order ≺Λ is defined by ordering primary identifiers asregular integers.

If a process p proposes a value v for instance i = max(prop + 1, dec + 1) in Algorithm 3, it observes τseq =max(prop, dec) = i − 1 when it becomes a primary. If v is decided, p has primary identifier λ = i. All primariespreceding λ in≺Λhave proposed values for instances preceding i, so τseq meets the requirements of barrier functions.

6.2 Impossibility

One might wonder if this limitation of sequential instances is inherent. Indeed, this is the case as we now show.

Theorem 9 Let Π be a set of two or more processes executing the τ -based POabcast algorithm with an underlyingconsensus implementation C that can only be accessed through its propose and decide calls. There is no local imple-mentation of τ for C allowing a primary p to propose a value for instance i before p reaches a decision for instancei− 1.

Proof. The proof is by contradiction: we assume that a barrier function τc allowing primaries to propose values formultiple concurrent consensus instances exists.

Run σ1: The oracle Ω outputs some process p as the only leader in the system from the beginning of the run.Assume that p broadcasts two values v1 and v2 at the beginning of the run. For liveness of POabcast, pmust eventuallypropose values for consensus instances 1 and 2. By assumption, τc allows p to start consensus instance 2 before adecision for instance 1 is reached. Therefore p observes τc = 0 when it proposes v1 and v2. The output of τc must beindependent from the internal events of the underlying consensus implementation C, since τc cannot observe them.We can therefore assume that no process receives any message before p proposes v2.

Run σ′1: The prefix of σ1 that finishes immediately after p proposes v2. No process receives any message.Run σ2: Similar to σ1, but the only leader is p′ 6= p and the proposed values are v′1 and v′2. Process p′ observes

τc = 0 when it proposes v′1 and v′2.Run σ′2: The prefix of σ2 that finishes immediately after p′ proposes v′2. No process receives any message.Run σ3: The beginning of this run is the union of all events in the runs σ′1 and σ′2. No process receives any message

until the end of the union of σ′1 and σ′2. The Ω oracle is allowed to elect two distinct leaders for a finite time. Processp (resp. p′) cannot distinguish between run σ′1 (resp. σ′2) and the corresponding local prefix of σ3 based on the outputsof the consensus primitive and of the leader oracle. After the events of σ′1 and σ′2 have occurred, some process decidesv′1 for consensus instance 1 and v2 for consensus instance 2.

Regardless of the definition of Λ and≺Λ, the output of τc in σ3 is incorrect. Let p and p′ have primary identifiers λand λ′ when they proposed v2 and v′1, respectively. If λ ≺Λ λ′, τc should have returned 2 instead of 0 when p′ becameprimary. If λ′ ≺Λ λ, τc should have returned 1 instead of 0 when p became primary.

7

6.3 Barriers with White-Box Paxos

An alternative, corresponding to Zab [9], to avoid the aforementioned impossibility is to consider the internal statesof the underlying consensus algorithm. We exemplify this approach considering the popular Paxos algorithm [10]. Adetailed discussion of Paxos is out of the scope of this work and we only present a summary for completeness.

Overview of Paxos. In Paxos each process keeps, for every consensus instance, an accepted value, which is the mostcurrent value it is aware of that might have been decided. A process p elected leader must first read, for each instance,the value that may have been decided upon for this instance, if any. To obtain this value, the leader selects a uniqueballot number b and executes a read phase by sending a read message to all other processes. Processes that have notyet received messages from a leader with a higher ballot number b reply by sending their current accepted value forthe instance. Each accepted value is sent attached to the ballot number of the previous leader that proposed that value.The other processes also promise not to accept any message from leaders with ballots lower than b. When p receivesaccepted values from a majority of processes, it picks for each instance the accepted value with the highest attachedballot.

After completing the read phase, the new leader proposes the values it picked as well as its own values for theinstances for which no value was decided. The leader proposes values in a write phase: it sends them to all processestogether with the current ballot number b. Processes accept proposed values only if they have not already receivedmessages from a leader with a ballot number b′ > b. After they accept a proposed value, they send an acknowledgementto the leader proposing it. When a value has been written with the same ballot at a majority of processes, it is decided.

In a nutshell, the correctness argument of Paxos boils down to the following argument. If a value v has beendecided, a majority of processes have accepted it with a given ballot number b; we say that the proposal 〈v, b〉 ischosen. If the proposal is chosen, no process in the majority will accept a value from a leader with a ballot numberlower than b. At the same time, every leader with a ballot number higher than b will read the chosen proposal in theread phase, and will also propose the v.

Integrating the Barrier Function. We modify Paxos to incorporate the barrier function. If a process is not a leader,there is no reason for evaluating τ . Whenever a process is elected leader, it executes the read phase. Given a process psuch that Ω = p, let read(p) be the maximum consensus instance for which any value has been picked in the last readphase executed by p. The barrier function is implemented as follows:

τPaxos =

> iff Ω 6= p ∨ p is in read phaseread(p) iff Ω = p ∧ p is in write phase

The output value > is such that dec ≥ τPaxos never holds for any value of dec. This prevents leaders frombecoming primaries until a correct output for τPaxos is determined.

The sequence of values picked by the new leader p can have gaps. This occurs when p can pick some value for aninstance i but it cannot pick any value for an instance j < i. The new leader fills such gaps with no-op values. This isanalogous to the gap handling mechanism in Algorithm 3, where the new primary proposes skip(τ) values to fill gaps.Therefore, whenever Paxos decides a value no-op for an instance i, Algorithm 3 treats this decision as if consensusdecided skip(τ) for i.

We now show that this τ implementation is correct. The proof relies on the correctness argument of Paxos.

Theorem 10 The function τPaxos is a barrier function.

Proof. By the definition of τPaxos, a process becomes a primary if it is a leader and has completed the read phase. LetΛ associate a primary with the unique ballot number it uses in the Paxos read phase if some values it proposes is everdecided, and let ≺Λ be the ballot number order. Paxos guarantees that if any process ever decides a value v proposedby a leader with ballot number smaller than the one of λ, then v is picked by λ in the read phase [10]. This is sufficientto meet the requirements of τ .

8

6.4 Time Complexity of τ -Based POabcast with Different Barrier Functions

We now explain the second and third row of Table 1. Just for the analysis, we assume that there are c clients inthe system, that the communication delay is ∆, and that Paxos is used as underlying consensus protocol in all ourimplementations since it is optimal [11].

We first consider the barrier function of Sect. 6.1. If a primary receives requests from all clients at the same time, itwill broadcast and deliver the corresponding state updates sequentially. Delivering a message requires 2∆, the latencyof the write phase of Paxos. Since each message will take 2∆ time to be delivered, the last message will be deliveredin 2∆ · c time. During leader change, Paxos takes 2∆ time to execute the read phase and 2∆ to execute the write phaseif a proposal by the old primary has been chosen and potentially decided in the last consensus instance.

With the barrier function of Sect. 6.3, consensus instances are executed in parallel with a latency of 2∆. Thecomplexity for leader changes is the same, since the write phase is executed in parallel for all instances up to τ .

Note that the longer leader change time of POabcast algorithms compared to atomic broadcast (see Table 1) isdue to the barrier: before it becomes a primary, a process must decide on all values that have been proposed by theprevious primaries and potentially decided (chosen). This is equivalent to executing read and write phases that require4∆ time. In atomic broadcast, it is sufficient that a new leader proposes chosen values from previous leaders.

6.5 Relationship between τ Functions and Existing POabcast Algorithms

The POabcast algorithm with the barrier function of Sect. 6.1 is similar to semi-passive replication [7] since bothenforce the same constraint: primaries only keep one outstanding consensus instance at a time. The time complexityof the two protocols using Paxos as the underlying consensus protocol is the same (Table 1, second row).

If the barrier function implementation selects a specific consensus protocol and assumes that it can access itsinternal state, as discussed in Sect. 6.1, our barrier-based POabcast algorithm can broadcast state updates in thepresence of multiple outstanding consensus instances. This is the same approach as Zab, and indeed there are manyparallelisms with this algorithm. The time complexity in stable periods is the same (see Table 1, third row). A closerlook shows that also the leader change complexity is equal, apart from specific optimizations of the Zab protocol. InZab, the read phase of Paxos corresponds to the discovery phase; the CEPOCH message is used to implement leaderelection and to speed up the selection of a unique ballot (or epoch, in Zab terms) number that is higher than anyprevious epoch numbers [9]. After the read phase is completed, the leader decides on all consensus instances untilthe instance identifier returned by τPaxos - this is the synchronization phase, which corresponds to a write phase inPaxos; in our implementation, the barrier function returns and the leader waits until enough consensus instances aredecided. At this point, the necessary condition dec ≥ τPaxos of our generic POabcast construction is fulfilled, so theleader crosses the barrier, becomes a primary, and can proceed with proposing values for new instances. In Zab, thiscorresponds to the broadcast phase.

Virtually-synchronous Paxos is also a modified version of Paxos that implements POabcast and the τPaxos barrierfunction, but it has the additional property of making the set of participating processes dynamic [2]. It has the sametime complexity during stable periods and leader changes as in Table 1.

7 POabcast Using Consensus Instead of τ for the Barrier

The previous section shows an inherent tradeoff in τ implementations between modularity, which can be achieved byusing sequential consensus instances and using consensus as a black box, and performance, which can be increased byintegrating the implementation of the barrier function in a specific consensus protocol. In this section, we show thatthis tradeoff can be avoided through the use of an alternative POabcast algorithm.

7.1 Algorithm

Our τ -free algorithm (see Algorithm 4) implements POabcast, so it is an alternative to Algorithm 3. The algorithm isbuilt upon a leader election oracle Ω and consensus. The main difference with Algorithm 3 is that the barrier predicateisPrimary is implemented using consensus instead of τ : consensus instances are used to agree not only on values,

9

1 initially2 tent-epoch, dec, deliv-seqno, prop, seqno← 0;3 epoch← ⊥;4 primary← false;5 upon Ω changes from q 6= p to p6 try-primary();7 procedure try-primary()8 tent-epoch← new unique epoch number;9 propose(〈NEW-EPOCH, tent-epoch〉, dec);

10 upon decide(〈NEW-EPOCH, tent-epochm〉, dec)11 dec← dec+1;12 epoch← tent-epochm;13 dec-array← empty array;14 prop-array← empty array;15 deliv-seqno← dec;16 if Ω = p then17 if tent-epoch = tent-epochm then18 prop← dec;19 seqno← dec;20 primary← true;21 else22 primary← false;23 try-primary();

24 upon POabcast( v)25 propose(〈VAL, v, epoch, seqno〉, prop);26 prop-array[prop]← 〈 v, seqno〉;27 prop← prop+1;28 seqno← seqno+1;29 upon decide(〈VAL, v, epochm, seqnom〉, dec)30 if epochm = epoch then31 dec-array[seqnom]← v;32 while dec-array[deliv-seqno] 6= ⊥ do33 POdeliver(dec-array[deliv-seqno]);34 deliv-seqno← deliv-seqno+1;35 if primary ∧ epochm 6= epoch36 ∧ prop ≥ dec then37 〈 v′, seqno′〉 ← prop-array[dec];38 prop-array[prop]← prop-array[dec];39 propose(〈VAL, v′, epoch, seqno′〉, prop);40 prop← prop+1;41 if ¬ primary ∧Ω = p then42 try-primary();43 dec← dec+1;44 upon Ω changes from p to q 6= p45 primary← false;46 function isPrimary()47 return primary;

Algorithm 4: Barrier-free POabcast using black-box consensus - process p

but also on primary election information. Another difference is that some decided value may not be delivered. Thisrequires the use of additional buffering, which slightly increases the complexity of the implementation.

When a process p becomes leader, it picks a unique epoch number tent-epoch and proposes a 〈NEW-EPOCH, tent-epoch〉 value in the smallest consensus instance dec in which p has not yet reached a decision (lines 5-9). Like inAlgorithm 3, we use multiple consensus instances. All replicas keep a decision counter dec, which indicates thecurrent instance in which a consensus decision is awaited, and a proposal counter prop, which indicates the nextavailable instance for proposing a value. Another similarity with Algorithm 3 is that decision events are processedfollowing the order of consensus instances, tracked using the variable dec (see lines 10 and 29). Out-of-order decisionevents are buffered, although this is omitted in the pseudocode.

Every time a NEW-EPOCH tuple is decided, the sender of the message is elected primary and its epoch tent-epochis established (lines 10-23). When a new epoch is established, processes set their current epoch counter epoch to tent-epoch. If the process delivering the NEW-EPOCH tuple is a leader, it checks whether the epoch that has been justestablished is its own tentative epoch. If this is the case, the process considers itself as a primary and sets primary totrue; else, it tries to become a primary again.

When p becomes a primary, it can start to broadcast values by proposing VAL tuples in the next consensus in-stances, in parallel (lines 24-28). Ensuring that followers are in a state consistent with the new primary does notrequire using barriers: all processes establishing tent-epoch in consensus instance i have decided and delivered thesame sequence of values in the instances preceding i. This guarantees that the primary integrity property of POabcastis respected.

Processes only POdeliver VAL tuples of the last established epoch until a different epoch is established (lines 29-33, see in particular condition epochm = epoch). The algorithm establishes the following total order ≺Λ of primaryidentifiers: given two different primaries λ and λ′ which picked epoch numbers e and e′ respectively, we say thatλ ≺Λ λ′ if and only if a tuple 〈NEW-EPOCH, e〉 is decided for a consensus instance n, a tuple 〈NEW-EPOCH, e′〉is decided for a consensus instance m, and n < m. Suppose that p is the primary λ with epoch number eλ electedin consensus instance decλ. All processes set their current epoch variable e to eλ after deciding in instance decλ.

10

From consensus instance number decλ + 1 to the next consensus instance in which a NEW-EPOCH tuple is decided,processes decide and deliver only values that are sent from λ and included in VAL tuples with epochm = eλ. Replicasthus deliver messages following the order ≺Λ of the primaries that sent them, fulfilling the global primary orderproperty of POabcast.

The additional complexity in handling VAL tuples is necessary to guarantee the local primary order propertyof POabcast. VAL tuples of an epoch are not necessarily decided in the same order as they are proposed. This iswhy primaries include a sequence number seqno in VAL tuples. In some consensus instance, the tuples proposed bythe current primary might not be the ones decided. This can happen in the presence of concurrent primaries, sinceprimaries send proposals for multiple overlapping consensus instances without waiting for decisions. If a primaryis demoted, values from old and new primaries could be interleaved in the sequence of decided values for a finitenumber of instances. All processes agree on the current epoch of every instance, so they do not deliver messages fromother primaries with different epoch numbers. However, it is necessary to buffer out-of-order values from the currentprimary to deliver them later. That is why processes store decided values from the current primary in the dec-arrayarray (line 31), and deliver them only if a continuous sequence of sequence numbers, tracked by deliv-seqno, can bedelivered (lines 32-34).

Primaries also need to resend VAL tuples that could not be decided in the correct order. When values are proposed,they are stored in the prop-array following the sequence number order; this buffer is reset to the next ongoing con-sensus instance every time a new primary is elected. Primaries resend VAL tuples in lines 35-40. Primaries keep aproposal instance counter prop, indicating the next consensus instance in which values can be proposed. If an estab-lished primary has outstanding proposals for the currently decided instance dec, it holds prop ≥ dec. In this case, ifthe decided VAL tuple is not one such outstanding proposal but has instead been sent by a previous primary, it holdsthat epochm 6= epoch. If all the previous conditions hold, the established primary must resend the value that has beenskipped, prop-array[dec].v′, using the same original sequence number prop-array[dec].seqno’ in the next availableconsensus instance, which is prop.

The arrays dec-array and prop-array do not need to grow indefinitely. Elements of dec-array (resp. prop-array)with position smaller than deliv-seqno (resp. dec) can be garbage-collected.

For liveness, a leader which is not a primary keeps trying to become a primary by sending a NEW-EPOCH tuplefor every consensus instance (lines 22-23). The primary variable is true if a leader is an established primary. It stopsbeing true if the primary is not a leader any longer (lines 44-45).

Algorithm 4 correctly implements POabcast, as shown in the Appendix D.

7.2 Time Complexity

As before, we use Paxos for the consensus algorithm and assume a communication delay of ∆. During stable periods,the time to deliver a value is 2∆, which is the time needed to execute a Paxos write phase. When a new leader is elected,it first executes the read phase, which takes 2∆. Next, it executes the write phase for all instances in which values havebeen read but not yet decided, and for one additional instance for its NEW-EPOCH tuple. All these instances areexecuted in parallel, so they finish within 2∆ time. After this time, the new leader crosses the barrier, becomes aprimary, and starts broadcasting new values.

8 Experimental Evaluation

Our τ -free algorithm combines modularity with constant time complexity. Since our work was motivated by our workon systems like ZooKeeper, one might wonder whether this improvement has a practical impact. Current implementa-tions of replicated systems can reach some degree of parallelism even if they execute consensus instances sequentially.This is achieved through an optimization called batching: multiple clients requests are aggregated in a batch and agreedupon together using a single instance. Even in presence of batching, we found that there is a substantial advantage ofrunning multiple consensus instances in parallel.

We implemented two variants of the Paxos algorithm, one with sequential consensus instances and one with parallelones, and measured the performance of running our POabcast algorithms on top of it. We consider fault-free runs inwhich the leader election oracle outputs the same leader to all processes from the beginning. We used three replicas

11

0

2

4

6

8

10

12

14

0 5 10 15 20 25 30 35 40 45 50La

tenc

y in

ms

Throughput in Kops/s

SequentialParallel

Fig. 1. Latency and throughput with micro benchmarks. Request and state update sizes were set to 1kb, which is the typical sizeobserved in ZooKeeper. Both protocols use batching

and additional dedicated machines for the clients; all servers are quad-core 2.5 GHz CPU servers with 16 GB of RAMconnected through a Gigabit network.

The experiments consist of micro-benchmarks, in which the replicated object does nothing and has an empty state.These benchmarks are commonly used in the evaluation of replication algorithms because they reproduce a scenarioin which the replication protocol, rather than execution, is the bottleneck of the system. In systems where execution isthe bottleneck, using a more efficient replication algorithm has no impact on the performance of the system.

We used batching in all our experiments. With sequential consensus instances, we batch all requests received whilea previous instance is ongoing. In the pipelined version, we start a new consensus instance when either the previousinstance is completed or b requests have been batched. We found b = 50 to be optimal. Every measurement wasrepeated five times at steady state, and variances were negligible.

Figure 1 reports the performance of the two variants with messages (requests and state updates) of size 1 kB, whichis a common state update size for ZooKeeper and Zab [9]. Each point in the plot corresponds of a different number ofclients concurrently issuing requests.

The peak throughput with the parallel consensus instances is almost two times the one with sequential instances.The same holds with messages of size 4 kB. The difference decreases with smaller updates than the ones we observein practical systems like ZooKeeper. In the extreme case of empty requests and state updates, the two approacheshave virtually the same request latency and throughput: they both achieve a maximum throughput of more than 110kops/sec and a minimum latency of less than 0.5 ms.

These results show that low time complexity (see Table 1) is very important for high-performance passive replica-tion. When there is little load in the system, the difference in latency between the two variants is negligible. In fact, dueto the use of batching, running parallel consensus instances is not needed. Since clients can only submit one requestat a time, we increase load by increasing the number of clients concurrently submitting requests; this is equivalent toincreasing the parameter c of Table 1. As the number of clients increase, latency grows faster in the sequential case,as predicted by our analysis. With sequential consensus instances, a larger latency also results in significantly worsethroughput compared to the parallel variant due to lower network and CPU utilization.

9 Conclusions

Some popular systems such as ZooKeeper have used passive replication to mask crash faults. We have shown thatthe barrier predicate isPrimary is a key element differentiating active and passive replication, and its implementationsignificantly impacts the complexity and modularity of the algorithm. We have shown how to implement passivereplication on top of POabcast. By making leader processes cross the barrier before becoming primaries, we preventstate updates from being decided and applied out of order.

12

We then extracted a unified algorithm for implementing POabcast using the barrier function that abstract existingapproaches. The barrier function is a simple way to understand the difference between passive and active replication,as well as the characteristics of existing POabcast algorithms, but it imposes a tradeoff between parallelism andmodularity. We have proposed an algorithm that does present such a limitation by not relying upon a barrier functionand yet it guarantees the order of state updates according to the succession of primaries over time. This algorithm isdifferent from existing ones in its use of consensus, instead of barrier functions, for primary election.

Acknowledgement We would like to express our gratitude to Alex Shraer and Benjamin Reed for the insightfulfeedback on previous versions of the paper, and to Daniel Gomez Ferro for helping out with the experiments.

References

1. Jason Baker, Chris Bond, James Corbett, JJ Furman, Andrey Khorlin, James Larson, Jean-Michel Leon, Yawei Li, AlexanderLloyd, and Vadim Yushprakh. Megastore: Providing scalable, highly available storage for interactive services. In CIDR,volume 11, pages 223–234, 2011.

2. Ken Birman, Dahlia Malkhi, and Robbert Van Renesse. Virtually synchronous methodology for dynamic service replication.Technical Report MSR-TR-2010-151, Microsoft Research, 2010.

3. Navin Budhiraja, Keith Marzullo, Fred B. Schneider, and Sam Toueg. The primary-backup approach, pages 199–216. ACMPress/Addison-Wesley, 1993.

4. Tushar Deepak Chandra, Vassos Hadzilacos, and Sam Toueg. The weakest failure detector for solving consensus. Journal ofthe ACM, 43(4):685–722, 1996.

5. Tushar Deepak Chandra and Sam Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM,43(2):225–267, March 1996.

6. Gregory V. Chockler, Idit Keidar, and Roman Vitenberg. Group communication specifications: a comprehensive study. ACMCompututing Surveys, 33(4):427–469, 2001.

7. Xavier Defago and Andre Schiper. Semi-passive replication and lazy consensus. Journal of Parallel and Distributed Comput-ing, 64(12):1380–1398, 2004.

8. Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Ben Reed. Zookeeper: Wait-free coordination for Internet-scale sys-tems. In USENIX Annual Technical Conference, pages 145–158, 2010.

9. Flavio P. Junqueira, Benjamin Reed, and Marco Serafini. Zab: High-performance broadcast for primary-backup systems. InIEEE Conference on Dependable Systems and Networks, pages 245–256, 2011.

10. Leslie Lamport. The part-time parliament. ACM Transactions on Computing Systems (TOCS), 16(2):133–169, 1998.11. Leslie Lamport. Lower bounds for asynchronous consensus. In Future Directions in Distributed Computing Workshop, pages

22–23, October 2003.12. Leslie Lamport, Dahlia Malkhi, and Lidong Zhou. Vertical paxos and primary-backup replication. In ACM Symposium on

Principles Of Distributed Computing, pages 312–313, 2009.13. Fernando Pedone and Svend Frolund. Pronto: A fast failover protocol for off-the-shelf commercial databases. In IEEE

Symposium on Reliable Distributed Systems, pages 176–185, 2000.14. Alexander Shraer, Benjamin Reed, Dahlia Malkhi, and Flavio Junqueira. Dynamic reconfiguration of primary/backup clusters.

In USENIX Annual Technical Conference, pages 425–438, 2012.

13

A Alternative Definitions of POabcast

There are alternative formulation of POabcast. For example, while global primary order is important for understandingthe properties of POabcast, it can be implied by the other properties of POabcast.

Theorem 11 Global primary order can be implied from the other properties of POabcast.

Proof. Assume that a process pi delivers two values v and v′, which were broadcast by the primaries λ and λ′ re-spectively. Also, assume that λ ≺Λ λ′. Global primary order requires that pi delivers first v and then v′. Assume bycontradiction that pi delivers the two values in the opposite order. By primary integrity, λ′ delivers v before broadcast-ing v′. By total order, λ′ delivers v′ before v. However, by integrity, λ′ cannot deliver v′ since it did not broadcast ityet, a contradiction.

B Correctness of Passive Replication on Top of POabcast

This section shows the correctness of Algorithm 1.Formalism and rationale. Replicas keep a copy of a shared object. The shared object can be nondeterministic: theremay be multiple state updates δSQ1

, δSQ2, . . . and replies r1, r2, . . . such that S

op→ 〈δSQi , ri〉. We omit subscriptsfrom state updates when these are not needed.

Applying state updates responds to the constraints we discussed in the introduction: state updates can only beapplied on the state from which they were generated. Formally, given a state update δSQ such that S

op→ 〈r, δSQ〉 forany value of r, we have the following.

apply(δSQ, R) =

Q iff S = R⊥ otherwise

We also use the notation Sop⇒ Q to denote the tentative execution of the operation op on a tentative state S resulting

in a transition to a tentative state Q.The notation ⊥ denotes an undefined state. A correct passive replication system shall never reach such undefined

state. This formalizes the difference between agreeing on state updates and operations. The execution of operationsnever leads the system to an undefined state. The order of their execution may be constrained by consistency require-ments, for example linearizability, but not by the semantics of the replicated object. State updates are different: theyneed to be applied exactly on the state in which they were generated or an undefined state ⊥ can be reached.

The reason for adopting POabcast instead of regular atomic broadcast in Algorithm 1 can be illustrated with anexample (see Fig. 2). Consider an algorithm similar to Algorithm 1 such that atomic broadcast is used instead ofPOabcast. Assume also that isPrimary is implemented using reliable leader election, that is, only one primary processexists from the beginning, and it is a correct process. Consider a run in which the initial state of all processes is A.A process pi is elected leader; it tentatively executes two operations op1 and op2 such that A

op1⇒ Bop2⇒ C, and it

atomically broadcasts state updates δAB and δBC . Assume now that pi crashes and pj becomes leader. pj did notdeliver any message so far, so it is still in state A. Process pj executes operation op3 such that A

op3⇒ D, and broadcastsδAD.

In this run there are no concurrent primaries. Yet, atomic broadcast leads to an incorrect behavior since it coulddeliver the sequence of state updates 〈δAD, δBC〉. Since all delivered state updates are executed, a backup wouldexecute apply(δAD, A), transitioning to state D, and then execute apply(δBC , D), transitioning to an undefined state⊥ that may not be reachable in correct runs. Note that even using FIFO atomic broadcast would not solve the problem.It guarantees that if a message from a process is delivered then all messages previously broadcast from that processare previously delivered. In our example, FIFO atomic broadcast could still deliver the sequence of state updates〈δAD, δAB , δBC〉, which also leads to ⊥.

POabcast prevents this incorrect run. First, it determines a total order between primary identifiers: for example, thiscould order the primary identifier of pi before the one of pj . Global primary order ensures that followers apply stateupdates following this order: the updates sent by pj are delivered after the ones sent by pi. Primary integrity guarantees

14

that, before broadcasting a state update, a primary delivers all state updates from previous primaries, thus making surethat all followers can safely apply the state updates it broadcasts. For example, pj would apply the sequence of stateupdates 〈δAB , δBC〉, reaching state C before it executes op3 and starts sending state updates: therefore, it will notbroadcast the state update δAC .Correctness. The following is a more general correctness result.

Theorem 12 Passive replication based on primary order atomic broadcast satisfies linearizability.

Proof. Since POabcast ensures the agreement and total order properties of atomic broadcast, it delivers state updatesin a consistent total order ≺∆. We only need to consider delivered state updates since they are the only ones whoseeffects are visible to clients.

Consider a state update δ = δSQ that is delivered by a process p and has position i in the sequence of state updatesdelivered by p. Let δ be broadcast by some primary λ that has generated δSQ by tentatively executing some operationop on its local tentative state Θ = S. We show that, for any value of i ≥ 0 that:

(i) No state update preceding δ in ≺∆ is generated by tentatively executing op;(ii) S is the state p reaches after delivering the state updates preceding δ in ≺∆.

Predicate (i) establishes a mapping of delivered state updates to operations; therefore,≺∆ can be used to determine anexecution order ≺o of operations. Predicate (ii) shows that replies to clients are legal and consistent with ≺o. If bothpredicates hold, then replicas transition from the state S to the state Q upon delivering the state update δSQ. SinceS

op→ 〈Q, r〉, Q is a legal state and r a legal reply. Therefore, linearizability follows by showing these two propertiesfor any value of i ≥ 0.

Both properties (i) and (ii) trivially hold for i = 0 since all processes start from the same initial state. We nowconsider the case i > 0.

Let δ′ be the last state update delivered by λ before POabcasting any value. It follows from the integrity propertyof atomic broadcast that δ′ ≺∆ δ. Let k < i be the position of δ′ in the sequence of operations delivered by λ. We usethe notation (δ′, δ)≺∆ to indicate all operations delivered by λ between δ′ and δ, which have positions in (k, i).

We now consider two cases.Case 1: All state updates in (δ′, δ)≺∆ are broadcast by λ. In our algorithm, when POabcast delivers a state update,

it also delivers the identifier of the request that was executed to obtain the update. Therefore, λ knows the identifierof all requests related to all state updates preceding and including δ′ in the delivery order ≺∆, and consider themas replied. Primaries generate state updates only for operations that have not been replied, so op is not executed togenerate any state updates up to and including δ′ in the delivery order≺∆. The primary λ also knows the identifiers ofall the operations it executed to generate the state updates in (δ′, δ]≺∆ . It executes op and broadcasts δ only if it knowsthat op was not already executed, ensuring property (i).

pi'A' B' C'

op1' op2'

abcast(δAB)' abcast(δBC)'

pj'A' D'

op3'

abcast(δAD)'

'

'

Fig. 2. In this run, atomic broadcast allows delivering the incorrect sequence of state updates 〈δAD, δBC〉. The star denotes becom-ing a primary. A,B,C,D are tentative states.

15

For property (ii), both p and λ deliver and apply the same sequence of state updates preceding and including δ′ in≺∆. It follows from Local Primary Order that p commits all state updates in (δ′, δ]≺∆ in the same order as they aregenerated by λ.

Case 2: There exists a state update in (δ′, δ)≺∆ that has been broadcast by another primary λ′ 6= λ. This caseleads to a contradiction. If λ ≺Λ λ′, a contradiction follows from global primary order, which requires that p deliver δbefore δ′. If λ′ ≺Λ λ, a contradiction follows from primary integrity: since p delivers δ′′ before δ, λ must deliver δ′′

before δ′, so δ′′ ≺∆ δ′.

Liveness directly follows from the liveness property of POabcast. All requests from correct clients eventuallycomplete, since correct clients resend their requests to the current primary.

C Correctness of the Unified POabcast Algorithm

We now show the correctness of Algorithm 3.

Theorem 13 Primary order atomic broadcast based on the barrier function is correct.

Proof. Processes exchange values using consensus instances; this ensures that all atomic broadcast properties arerespected. Local primary order follows by three observations. First, if a process proposes a value for an instance i, itwill never propose a value for an instance j < i since prop only increases. Second, values are decided and deliveredaccording to the order of consensus instance identifiers. Three, skip values do not determine gaps in the sequence ofvalues proposed by a primary.

For primary integrity, consider that in order to broadcast a value, a process needs to become primary first. Let λbe the primary identifier of a process. By definition of barrier function, the last invocation of τ returns to λ a valuei greater or equal than the maximum identifier of a consensus instance in which a value proposed by any previousprimary λ′ ≺Λ λ is decided. Before broadcasting any message, λ decides on all instances up to i and delivers allvalues proposed by previous primaries that are delivered. If λ proposes a skip value and this is decided for an instancenumber j < i, no value from previous primaries decided between j and i is delivered.

Global primary order follows from the other properties of POabcast.The liveness properties of the algorithm follow from the Termination property of the underlying consensus primi-

tive and from the Leader election oracle property of Ω. If some previous primary is faulty, the consensus instances itinitiated may not terminate. In this case, a new correct primary proposes skip values to guarantee progress. Since thenew primary proposes values starting from instance τ , skipping to τ is sufficient to guarantee liveness when eventuallya correct primary is elected.

D Correctness of the Barrier-Free Algorithm

This section shows that Algorithm 4 correctly implements POabcast.

Definition 14 (Total order of primaries) Given two primaries λ and λ′ with epoch numbers e and e′ respectively,we say that λ ≺Λ λ′ if and only if a tuple 〈NEW-EPOCH, epoch〉 is decided for consensus instance number n, a tuple〈NEW-EPOCH, epoch’〉 is decided for consensus instance number m, and n < m.

The fact that this is a total order derives directly from the following two observations. First, all primaries areordered: a leader becomes a primary with epoch number epoch only after its primary variable is set to true, and thisoccurs only after the 〈NEW-EPOCH, epoch〉 is decided for some consensus instance. Second, a primary λ proposes a〈NEW-EPOCH, epoch〉 tuple only once and in a single consensus instance.

In the following, we use the notation λ(epoch) to denote the primary with epoch number epoch.

Lemma 1. If a process p enters a value v sent by a primary λ(tent-epochm) in da[seqnom] upon deciding on instancenumber n, then 〈NEW-EPOCH, tent-epochm〉 is the latest NEW-EPOCH tuple decided before instance n and the tuple〈VAL, v, tent-epochm, seqnom〉 is decided for instance n.

16

Proof. Process p adds v to da upon deciding a 〈VAL, v, epochm, seqnom〉 tuple such that epochm = epoch. Thevariable epoch is set to tent-epochm upon deciding on the last 〈NEW-EPOCH, tent-epochm 〉 tuple, so epochm=epoch= tent-epochm.

Lemma 2. If a process delivers a value v upon deciding on consensus instance n, every process that decides on ndelivers v.

Proof. All processes decide the same values in the same consensus instance order. A value v is delivered only upondeciding on a VAL tuple containing v. The decision of whether to deliver a value or not only depends on variableswhose value is consistent across processes, since they are modified consistently by all processes every time a decisionfor the next instance is reached. Therefore, every process deciding on instance n delivers the same VAL tuple and takesa consistent decision on delivering v.

Lemma 3. A process delivers a value v in at most one consensus instance.

Proof. Consider all values that are broadcast to be distinct. A value v can only be delivered upon deciding a VALtuple that contains v. The primary λ that has broadcast v can include v in more than one VAL tuple. In fact, λ storesv in ta[li] every time it proposes v for consensus instance li. The value ta[li] is proposed, however, only if v is notdelivered upon deciding on instance li.

Lemma 4. Algorithm 4 satisfies Integrity: If some process delivers v then some process has broadcast v.

Proof. A value v is delivered only upon deciding on a VAL tuple containing v. These tuples are proposed either whenv is broadcast, or when another VAL tuple is decided. In the latter case, v is taken from the array ta. However, onlyvalues that are broadcast are entered in ta. .

Lemma 5. Algorithm 4 satisfies Total Order: If some process delivers v before v′ then any process that delivers v′

must deliver v before v′.

Proof. Let pi deliver v before v′ and pj deliver v′. We now show that pj delivers v before v′.Processes deliver values only upon deciding on a consensus instance. From Lemma 3, this happens only once per

value. Since pi delivers v and v′, this occurs upon deciding on instances n and m respectively, with n < m. Since pjdelivers v′, it decides on instance m; this happens only after deciding on instance n. From Lemma 2, pj delivers vupon deciding on instance n.

Lemma 6. Algorithm 4 satisfies Agreement: If some process pi delivers v and some other process pj delivers v′, theneither pi delivers v′ or pj delivers v.

Proof. Processes deliver values only upon deciding on a consensus instance. From Lemma 3, this happens only onceper value. Let n (resp. m) be the consensus instance in which pi (resp. pj) decides upon and deliver v (resp. v′). Ifn < m, then pj decides on n before deciding on m, so it delivers v from Lemma 2. Otherwise, pi decides on m beforen and delivers v′.

Lemma 7. Algorithm 4 satisfies Local Primary Order: If a primary λ broadcasts v before v′, then a process thatdelivers v′ delivers v before v′.

Proof. Assume by contradiction that a primary λ broadcasts a value v followed by a value v′, and a process p deliversv′ but not v. Processes deliver values only upon deciding on a consensus instance. From Lemma 3, this happensonly once per value; from Lemma 2, this happens, given a value, when deciding on the same consensus instance forall processes. Let n be the consensus instance whose decision triggers the delivery of v′, and let 〈VAL, v′, epochm,seqnom〉 be the tuple decided for n. The value v′ is stored in the array da[l] for some index l; the array da was resetthe last time a NEW-EPOCH tuple was decided, say at consensus instance m < n.

Upon deciding onm, the value of s is set tom+1. After deciding onm and until the decision on n, s is incrementedonly when the value da[s] is delivered. Therefore, before delivering v′, a process has delivered all elements of da withindexes in [m + 1, l]. From Lemma 1, all values v′′ added in da[seqnom’] are sent by λ, which includes seqnom’ in

17

the VAL tuple based on its counter ls - this holds even if the VAL tuple is resent more than once by λ, since the valueof the seqnom field is determined by reading the original index stored in the ta array.

λ sets ls to m + 1 upon election and increments every time a new tuple is broadcast. If λ broadcasts v before v′,it has sent a VAL tuple with seqnom ∈ [m+ 1, l− 1]. The value of da[seqnom] held by a process after deciding on mand before deciding on l can only be v. Therefore, if a process delivers v′, it also delivers v.

Lemma 8. Algorithm 4 satisfies Global Primary Order: If a primary λ broadcasts v, λ′ broadcasts v′, and λ ≺Λ λ′,then a process that delivers both v and v′ delivers v before v′.

Proof. By definition, if λ ≺Λ λ′ then the NEW-EPOCH tuple of λ is decided for a consensus instance n preceding theone,m, in which the NEW-EPOCH tuple of λ′ is decided. Each primary sends a NEW-EPOCH tuple only once. Whena process delivers v′, it has it stored in the da array, which is reset every time a NEW-EPOCH tuple is delivered. FromLemma 1, v′ is delivered upon deciding on a consensus instance following m. Similarly, a process delivers v upondeciding on a consensus instance l ∈ [n+ 1,m− 1]. Since decisions on consensus instances are taken sequentially, vis delivered before v′.

Lemma 9. Algorithm 4 satisfies Primary Integrity: If a primary λ broadcasts v, λ′ broadcasts v′, λ ≺Λ λ′, and someprocess delivers v, then λ′ delivers v before it broadcasts v′.

Proof. By definition, if λ ≺Λ λ′ then the NEW-EPOCH tuple of λ is decided for a consensus instance n precedingthe one, m, in which the NEW-EPOCH tuple of λ′ is decided. Each primary sends a NEW-EPOCH tuple only once.When a process delivers v, it has it stored in the da array, which is reset every time a NEW-EPOCH tuple is delivered.From Lemma 1, v′ is delivered upon deciding on a consensus instance l ∈ [n+1,m− 1]. The primary λ′ proposes itsNEW-EPOCH tuple for instance m after deciding on instance l. By Lemma 2, λ′ delivers v upon deciding on instancel, before sending its NEW-EPOCH tuple, and thus before becoming a primary and broadcasting values.

Lemma 10. Eventually, there exists a correct primary λ such that for any other primary λ′ ever existing, λ′ ≺Λ λ

Proof. Let p be a correct process that is eventually indicated as a leader by Ω. Upon becoming a leader, p chooses anew epoch number tent-epoch and sends a NEW-EPOCH tuple for its current instance d.

Process p continues sending NEW-EPOCH tuples for all consensus instances from d on until its tuple is acceptedand it becomes a primary. We show this by induction. For each instance n ≥ d in which its NEW-EPOCH tuple isnot decided, the following can happen. If another NEW-EPOCH tuple is decided with an epoch number tent-epochm6= tent-epoch, then p tries to become a primary in the next instance n + 1. If a VAL tuple is decided, p is not yet aprimary so primary is false. Also in this case, p tries to become a primary in the next instance n + 1. Eventually, premains the only leader in the system. All other processes either crash or have primary set to false because they are notcurrently leader. Therefore, eventually a NEW-EPOCH tuple from p is decided, p becomes the primary λ, and it is thelast NEW-EPOCH tuple decided in any consensus instance. According to the definition of the total order of primaries,λ is the maximum. .

Lemma 11. Algorithm 4 satisfies Eventual Single Primary: There eventually exists a correct primary λ such thatthere is no subsequent primary λ′ with λ ≺Λ λ′ and all messages λ broadcasts are eventually delivered by all correctprocesses.

Proof. It follows from Lemma 10 that there exists a λ = λ(epoch) that is maximum in the total order of primaries. Bydefinition, its 〈NEW-EPOCH, epoch〉 tuple is the last NEW-EPOCH tuple decided for an instance n. After decidingon n, λ proposes all tuples that are broadcast using a sequence number ls that starts from n + 1 and is increased byone every time a new value is broadcast.

We now show that all tuples proposed by λ are eventually delivered by induction on ls, starting from n+1. Assumeby induction that all values proposed by λ with sequence number up to ls− 1 are eventually delivered (trivially true ifls = n + 1 since no value is proposed by λ with a lower sequence number). The primary λ proposes v for ls for thefirst time in consensus instance li, and sets ta[m] = 〈v, ls〉. Note that li ≥ ls since li is incremented every time ls is.

Let d be the value of li at λ when v was proposed. If 〈VAL, v, epoch, ls〉 is decided in instance d, then da[ls] isset to v. By induction, all values from n to ls− 1 are eventually included delivered, so they are added to da. When allelements of da with indexes in [n+ 1, ls− 1] are filled, the element da[ls] is delivered too.

18

If a different VAL tuple is decided in instance d, then it is not from the λ, since λ proposes only one value perinstance li. Therefore, epochm 6= epoch, primary is true for λ, and li ≥ d, so the leader proposes a 〈VAL, v, epoch, ls〉tuple again, this time for a new consensus instance li. This resending can happen several time. Eventually, however,only λ proposes VAL tuples, so only its tuples are decided.

Lemma 12. Algorithm 4 satisfies Delivery Liveness.

Proof. Assume that a process delivers v after deciding it in consensus instance i. By the liveness property of consensus,every correct process eventually decides v in instance i. Since the decision on delivering v deterministically dependson the consistent order of decided values, all processes deliver v.

The following Theorem 15 derives directly from these lemmas and from Lemma 12.

Theorem 15 Barrier-free primary order atomic broadcast is correct.

19

On Barriers and the Gap between Active and Passive Replication · or Megastore [1] use primary...

Documents

Transcript of On Barriers and the Gap between Active and Passive Replication · or Megastore [1] use primary...