MDCC: Multi-Data Center Consistencyto Megastore. MDCC is not the only system that addresses...

MDCC: Multi-Data Center Consistency

Tim Kraska Gene Pang Michael J. Franklin Samuel Madden♠ Alan Fekete†

University of California, Berkeley ♠MIT †University of Sydney{kraska, gpang, franklin}@cs.berkeley.edu [email protected] [email protected]

AbstractReplicating data across multiple data centers allows usingdata closer to the client, reducing latency for applications,and increases the availability in the event of a data cen-ter failure. MDCC (Multi-Data Center Consistency) is anoptimistic commit protocol for geo-replicated transactions,that does not require a master or static partitioning, and isstrongly consistent at a cost similar to eventually consis-tent protocols. MDCC takes advantage of Generalized Paxosfor transaction processing and exploits commutative updateswith value constraints in a quorum-based system. Our exper-iments show that MDCC outperforms existing synchronoustransactional replication protocols, such as Megastore, by re-quiring only a single message round-trip in the normal oper-ational case independent of the master-location and by scal-ing linearly with the number of machines as long as transac-tion conflict rates permit.

1. IntroductionTolerance to the outage of a single data center is now consid-ered essential for many online services. Achieving this for adatabase-backed application requires replicating data acrossmultiple data centers, and making efforts to keep those repli-cas reasonably synchronized and consistent. For example,Google’s e-mail service Gmail is reported to use Megastore[2], synchronously replicating across five data centers to tol-erate two data center outages: one planned, one unplanned.

Replication across geographically diverse data centers(called geo-replication) is qualitatively different from repli-cation within a cluster, data center or region, because inter-data center network delays are in the hundreds of millisec-onds and vary significantly (differing between pairs of lo-cations, and also over time). These delays are close enoughto the limit on total latency that users will tolerate, so it be-comes crucial to reduce the number of message round-trips

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. To copy otherwise, to republish, to post on servers or to redistributeto lists, requires prior specific permission and/or a fee.Eurosys’13 April 15-17, 2013, Prague, Czech RepublicCopyright c© 2013 ACM 978-1-4503-1994-2/13/04. . . $15.00

taken between data centers, and desirable to avoid waitingfor the slowest data center to respond.

For database-backed applications, it is a very valuablefeature when the system supports transactions: multiple op-erations (such as individual reads and writes) grouped to-gether, with the system ensuring at least atomicity so that allchanges made within the transaction are eventually persistedor none. The traditional mechanism for transactions that aredistributed across databases is two-phase commit (2PC), butthis has serious drawbacks in a geo-replicated system. 2PCdepends on a reliable coordinator to determine the outcomeof a transaction, so it will block for the duration of a coordi-nator failure, and (even worse) the blocked transaction willbe holding locks that prevent other transactions from makingprogress until the recovery is completed.1

In deployed highly-available databases, asynchronousreplication is often used where all update transactions mustbe sent to a single master site, and then the updates are prop-agated asynchronously to other sites which can be used forreading (somewhat stale) data. Other common approachesgive up some of the usual guarantees or generality of trans-actions. Some systems achieve only eventual consistency byallowing updates to be run first at any site (preferably localto the client) and then propagate asynchronously with someform of conflict resolution so replicas will converge later toa common state. Others restrict each transaction so it can bedecided at one site, by only allowing updates to co-locateddata such as a single record or partition. In the event of afailure, these diverse approaches may lose committed trans-actions, become unavailable, or violate consistency.

Various projects [2, 8, 11, 18] proposed to coordinatetransaction outcome based on Paxos [14]. The oldest de-sign, Consensus on Transaction Commit [11], shows howto use Paxos to reliably store the abort or commit deci-sion of a resource manager for recovery. However, it treatsdata replication as an orthogonal issue. Newer proposalsfocus on using Paxos to agree on a log-position similarto state-machine replication. For example, Google’s Mega-store [2] uses Paxos to agree on a log-position for every

1 We are referring to the standard 2PC algorithm for transaction processing,which requires a durable abort/commit log entry stored at the coordinator.Of course, this log entry could be replicated at the cost of an additionalmessage round-trip (as in 3-phase commit).

commit in a data shard called entity group imposing a to-tal order of transactions per shard. Unfortunately, this de-sign makes the system inherently unscalable as it only al-lows executing one transaction at a time per shard; this wasobserved [13] in Google’s App Engine, which uses Mega-store. Google’s new system Spanner [8] enhances the Mega-store approach, automatically resharding the data and addingsnapshot isolation, but does not remove the scalability bottle-neck as Paxos is still used to agree on a commit log positionper shard (i.e., tablet). Paxos-CP [20] improves Megastore’sreplication protocol by combining non-conflicting transac-tions into one log-position, significantly increasing the frac-tion of committed transactions. However, the same systembottleneck remains, and the paper’s experimental evaluationis not encouraging with only four transactions per second.

Surprisingly, all these new protocols still rely on two-phase commit, with all its disadvantages, to coordinate anytransactions that access data across shards. They also all relyon a single master, requiring two round-trips from any clientthat is not local to the master, which can often result in sev-eral hundred milliseconds of additional latency. Such addi-tional latency can negatively impact the usability of web-sites; for example, an additional 200 milliseconds of latency,the typical time of one message round-trip between geo-graphically remote locations, can result in a significant dropin user satisfaction and “abandonment” of websites [23].

In this paper, we describe MDCC (short for “Multi-DataCenter Consistency”), an optimistic commit protocol fortransactions with a cost similar to eventually consistent pro-tocols. MDCC requires only a single wide-area messageround-trip to commit a transaction in the common case, andis “master-bypassing”, meaning it can read or update fromany node in any data center. Like 2PC, the MDCC commitprotocol can be combined with different isolation levels thatensure varying properties for the recency and mutual consis-tency of read operations. In its default configuration, it guar-antees “read-committed isolation” without lost updates [4]by detecting all write-write conflicts. That is, either all up-dates inside a transaction eventually persist or none (we referto this property as atomic durability), updates from uncom-mitted transactions are never visible to other transactions(read-committed), concurrent updates to the same recordare either resolved if commutative or prevented (no lost up-dates), but some updates from successful committed trans-actions might be visible before all updates become visible(no atomic visibility). It should be noted, that this isolationlevel is stronger than the default, read-committed isolation,in most commercial and open-source database platforms. Onthe TPC-W benchmark deployed across five Amazon datacenters, MDCC reduces per transaction latencies by at least50% (to 234 ms) as compared to 2PC or Megastore, with or-ders of magnitude higher transaction throughput comparedto Megastore.

MDCC is not the only system that addresses wide-areareplication, but it is the only one that provides the combina-tion of low latency (through one round-trip commits) andstrong consistency (up to serializability) for transactions,without requiring a master or a significant limit on the ap-plication design (e.g., static data partitions with minimiza-tion of cross-partition transactions). MDCC is the first pro-tocol to use Generalized Paxos [15] as a commit protocolon a per record basis, combining it with techniques fromthe database community (escrow transactions [19] and de-marcation [3]). The key idea is to achieve single round-tripcommits by 1) executing parallel Generalized Paxos on eachrecord, 2) ensuring every prepare has been received by afast quorum of replicas, 3) disallowing aborts for success-fully prepared records, and 4) piggybacking notification ofcommit state on subsequent transactions. A number of sub-tleties need to be addressed to create a “master-bypassing”approach, including support for commutative updates withvalue constraints, and for handling conflicts that occur be-tween concurrent transactions.

In summary, the key contributions of MDCC are:• A new optimistic commit protocol, which achieves wide-

area transactional consistency while requiring only onenetwork round trip in the common case.• A new approach to ensure value constraints with quorum

protocols.• Performance results of the TPC-W benchmark showing

that MDCC provides strong consistency with costs simi-lar to eventually consistent protocols, and lower than otherstrongly-consistent designs. We also explore the contribu-tion of MDCC’s various optimizations, the sensitivity ofperformance to workload characteristics, and the perfor-mance impact during a simulated data center failure.In section 2 we show the overall architecture of MDCC.

Section 3 presents MDCC’s new optimistic commit protocolfor the wide area network. Section 4 discusses the MDCC’sread consistency guarantees. Our experiments using MDCCacross 5 data centers are in section 5. In section 6 we relateMDCC to other work.

2. Architecture OverviewMDCC uses a library-centric approach similar to the ar-chitectures of DBS3 [5], Megastore [2] or Spanner [8] (asshown in Figure 1). This architecture separates the statefulcomponent of a database system as a distributed record man-ager. All higher-level functionality (such as query processingand transaction management) is provided through a statelessDB library, which can be deployed at the application server.

As a result, the only stateful component of the architec-ture, the storage node, is significantly simplified and scal-able through standard techniques such as range partitioning,whereas all higher layers of the database can be replicatedfreely with the application tier because they are stateless.MDCC places storage nodes in geographically distributeddata centers, with every node being responsible for one or

Data center IV

Data center III

Data center I

Data center IIApplication ServersStorage ServersMaster Server

Figure 1. MDCC architecture

more horizontal partitions. Although not required, we as-sume for the remainder of the paper that every data centercontains a full replica of the data, and the data within a sin-gle data center is partitioned across machines.

The DB library provides a programming model for trans-actions and is mainly responsible for coordinating the repli-cation and consistency of the data using MDCC’s commitprotocol. The DB library also acts as a transaction managerand is responsible to determine the outcome of a transaction.In contrast to many other systems, MDCC supports an indi-vidual master per record, which can either be storage nodesor app-server and is responsible to coordinate the updatesto a record. This allows the transaction manager to eithertake over the mastership for a single record and to coordi-nate the update directly, or to choose a storage node (e.g.,the current master) to act on its behalf (black arrows in Fig-ure 1). Furthermore, often it is possible to avoid the masterall together, allowing the transaction manager to coordinatethe update, without acquiring any mastership (red arrows inFigure 1). This leads to a very flexible architecture in whichstorage nodes or application servers can act as coordinators,depending on the situation.

In the remaining sections, we concentrate on the MDCCprotocol. Other parts of the system, such as load balancingor storage node design are beyond the scope of this paper.

3. The MDCC ProtocolIn this section, we describe our new optimistic commit pro-tocol for transactions operating on cross-partition replicateddata in the wide-area network. Intra-data center latencies arelargely ignored because they are only a few millisecondscompared to hundreds of milliseconds for inter-data cen-ter latencies. Our target is a fault-tolerant atomic commitprotocol with reduced latency from fewer message roundsby avoiding contacting a master, and high parallelism. Wetrade-off reducing latency by using more CPU cycles tomake sophisticated decisions at each site. We exploit a keyobservation of real workloads; either conflicts are rare, ormany updates commute up to a limit (e.g., add/subtract witha value constraint that the stock should be at least 0).

At its core, the protocol is based on known extensions ofPaxos, such as Multi-Paxos [14] and Generalized Paxos [15].Innovations we introduce enhance these consensus algo-

rithms in order to support transactions on multiple data itemswithout requiring partitioning. In this section, we present asequence of optimizations, refining from an initial designto the full MDCC protocol. Subsection 3.2 allows multi-record transactions with read committed isolation and nolost updates (see Section 4.1) using Multi-Paxos, with tworound-trips of messaging except when the masters for allitems are local. Section 3.3 incorporates Fast Paxos, so oneround-trip is often possible even without a local master. ThenSection 3.4 uses Generalized Paxos to combine commit de-cisions for transactions that are known to be commutative,and this relies on database techniques that determine state-based commutativity for operations like decrement-subject-to-a-limit. While the component ideas for consensus andfor deciding transaction commutativity were known, howwe use them for transaction and the combination of them isnovel.

3.1 Background: Paxos

In the following we provide some background on the princi-ples of Paxos and how we use it to update a single record.

3.1.1 Classic Paxos

Paxos is a family of quorum-based protocols for achievingconsensus on a single value among a group of replicas. Ittolerates a variety of failures including lost, duplicated or re-ordered messages, as well as failure and recovery of nodes.Paxos distinguishes between clients, proposers, acceptorsand learners. These can be directly mapped to our scenario,where clients are app-servers, proposers are masters, accep-tors are storage nodes and all nodes are learners. In the re-mainder of this paper we use the database terminology ofclients, masters and storage nodes. In our implementation,we place masters on storage nodes, but that is not required.

The basic idea in Classic Paxos [14], as applied for repli-cating a transaction’s updates to data, is as follows: Everyrecord has a master responsible for coordinating updates tothe record. At the end of a transaction, the app-server sendsthe update requests to the masters of each the record, asshown by the solid lines in Figure 1. The master informs allstorage nodes responsible for the record that it is the masterfor the next update. It is possible that multiple masters ex-ist for a record, but to make progress, eventually only onemaster is allowed. The master processes the client requestby attempting to convince the storage nodes to agree on it.A storage node accepts an update if and only if it comesfrom the most recent master the node knows of, and it hasnot already accepted a more recent update for the record.

In more detail, the Classic Paxos algorithm operates intwo phases. Phase 1 tries to establish the mastership foran update for a specific record r. A master P , selects aproposal number m, also referred to as a ballot number orround, higher than any known proposal number and sendsa Phase1a request with m, to at least a majority of stor-age nodes responsible for r. The proposal numbers must be

unique for each master because they are used to determinethe latest request.2 If a storage node receives a Phase1a re-quest greater than any proposal number it has already re-sponded to, it responds with a Phase1b message containingm, the highest-numbered update (if any) including its pro-posal number n, and promises not to accept any future re-quests less than or equal to m. If P receives responses con-taining its proposal numberm from a majorityQC of storagenodes, it has been chosen as a master. Now, only P will beable to commit a value for proposal number m.

Phase 2 tries to write a value. P sends an accept requestPhase2a to all the storage nodes of Phase 1 with the ballotnumber m and value v. v is either the update of the highest-numbered proposal among the Phase1b responses, or therequested update from the client if no Phase1b responsescontained a value. P must re-send the previously acceptedupdate to avoid losing the possibly saved value. If a storagenode receives a Phase2a request for a proposal numberedm,it accepts the proposal, unless it has already responded to aPhase1a request having a number greater than m, and sendsa Phase2b message containing m and the value back to P .If the master receives a Phase2b message from the majorityQC of storage nodes for the same ballot number, consensusis reached and the value is learned. Afterwards, the masterinforms all other components, app-servers and responsiblestorage nodes, about the success of the update 3.

Note, that Classic Paxos is only able to learn a singlevalue per single instance, which may consist of multiple bal-lots or rounds. Thus we use one separate Paxos instance perversion of the record with the requirement that the previousversion has already been chosen successfully.3.1.2 Multi-PaxosThe Classic Paxos algorithm requires two message roundsto agree on a value, one in Phase 1 and one in Phase 2. Ifthe master is reasonably stable, using Multi-Paxos (multi-decree Synod protocol) makes it possible to avoid Phase 1by reserving the mastership for several instances [14]. Multi-Paxos is an optimization for Classic Paxos, and in practice,Multi-Paxos is implemented instead of Classic Paxos, to takeadvantage of fewer message rounds.

We explore this by allowing the proposers to suggest thefollowing meta-data [StartInstance, EndInstance, Ballot]. Thus,the storage nodes can vote on the mastership for all instancesfrom StartInstance to EndInstance with a single ballot num-ber at once. The meta-data also allows for different mastersfor different instances. This supports custom master policieslike round-robin, where serverA is the master for instance1, serverB is the master for instance 2, and so on. Storagenodes react to these requests by applying the same seman-tics for each individual instance as defined in Phase1b, but

2 To ensure uniqueness we concatenate the requester’s ip-address.3 It is possible to avoid this delay by sending Phase2b messages directly toall involved nodes. As this significantly increases the number of messages,we do not use this optimization.

they answer in a single message. The database stores thismeta-data including the current version number as part of therecord, which enables a separate Paxos instance per record.To support meta-data for inserts, each table stores a defaultmeta-data value for any non-existent records.

Therefore, the default configuration assigns a single mas-ter per table to coordinate inserts of new records. Although apotential bottleneck, the master is normally not in the criticalpath and is bypassed, as explained in section 3.3.

3.2 Transaction SupportThe first contribution of MDCC is the extension of Multi-Paxos to support multi-record transactions with read-commit-ted isolation and without the lost-update problem. That is,we ensure atomic durability (all or no updates will persist),detect all write-write conflicts (if two transactions try to up-date the same record concurrently at most one will succeed),and guarantee that updates only from successful transactionsare visible. Guaranteeing higher read consistencies, such asatomic visibility and snapshot isolation, is an orthogonalissue and discussed in Section 4.

We guarantee this consistency level by using a Paxos in-stance per record to accept an option to execute the update,instead of writing the value directly. After the app-serverlearns the options for all the records in a transaction, it com-mits the transaction and asynchronously notifies the storagenodes to execute the options. If an option is not yet executed,it is called an outstanding option.

3.2.1 The ProtocolAs in all optimistic concurrency control techniques, we as-sume that transactions collect a write-set of records at theend of the transaction, which the protocol then tries to com-mit. Updates to records create new versions, and are rep-resented in the form vread → vwrite, where vread is theversion of the record read by the transaction and vwrite isthe new version of the record. This allows MDCC to detectwrite-write conflicts by comparing the current version of arecord with vread. If they are not equal, the record was mod-ified between the read and write and a write-write conflictwas encountered. For inserts, the update has a missing vread,indicating that an insert should only succeed if the recorddoesn’t already exist. Deletes work by marking the item asdeleted and are handled as normal updates. We further allowthere to be only one outstanding option per record and thatthe update is not visible until the option is executed.

The app-server coordinates the transaction by trying toget the options accepted for all updates. It proposes the op-tions to the Paxos instances running for each record, with theparticipants being the replicas of the record. Every storagenode responds to the app-server with an accept or reject ofthe option, depending on if vread is valid, similar to validatedByzantine agreement [6]. Hence, the storage nodes make anactive decision to accept or reject the option. This is fun-damentally different than existing uses of Paxos (e.g., Con-sensus on Transaction Commit [11] or Megastore), which re-

quire to send a fixed value (e.g., the “final” accept or commitdecision) and only decide based on the ballot number if thevalue should be accepted. The reason why this change doesnot violate the Paxos assumptions is that we defined at theend of Section 3.1.1 that a new record version can only bechosen if the previous version was successfully determined.Thus, all storage nodes will always make the same abort orcommit decision. This prevents concurrent updates, but onlyper record and not for the entire shard as in Megastore. Wewill relax this requirement in Section 3.4.

Just as in 2PC, the app-server commits a transaction whenit learns all options as accepted, and aborts a transactionwhen it learns any option as rejected. The app-server learnsan option if and only if a majority of storage nodes agreeson the option. In contrast to 2PC we made another impor-tant change: MDCC does not allow clients or app-serversto abort a transaction once it has been proposed. Decisionsare determined and stored by the distributed storage nodeswith MDCC, instead of being decided by a single coordi-nator with 2PC. This ensures that the commit status of atransaction depends only on the status of the learned optionsand hence is always deterministic even with failures. Other-wise, the decision of the app-server/client after the preparehas to be reliably stored, which either influences the avail-ability (the reason why 2PC is blocking) or requires an ad-ditional round as done by three-phase commit or Consensuson Transaction Commit [11].

If the app-server determines that the transaction is abortedor committed, it informs involved storage nodes through aLearned message about the decision. The storage nodes inturn execute the option (make visible) or mark it as rejected.Learning an option is the result of each Paxos instance andthus generates new version-id of the record, whether theoption is learned as accepted or rejected. Note, that so faronly one option per record can be outstanding at a time aswe require the previous instance (version) to be decided.

As a result, it is possible to commit the transaction (com-mit or abort) in a single round-trip across the data centersif all record masters are local. This is possible because thecommit/abort decision of a transaction depends entirely onthe learned values and the application server is not allowedto prematurely abort a transaction (in contrast to 2PC orConsensus on Transaction Commit). The Learned messageto notify the storage nodes about the commit/abort can beasynchronous, but does not influence the correctness, andonly affects the possibility of aborts caused by stale reads.By adding transaction support, this design is able to achieve1 round-trip commits if the master is local, but when themaster is not local it requires 2 round-trips, due to the ad-ditional communication with the remote master. Commu-nication with a local master is ignored because the latencyis negligible (few milliseconds) compared to geographicallyremote master communication (hundreds of milliseconds).

3.2.2 Avoiding DeadlocksThe described protocol is able to atomically commit multi-record transactions. Without further effort, transactions mightcause a deadlock by waiting on each other’s options. For ex-ample, if two transactions t1 and t2 try to learn an optionfor the same two records r1 and r2, t1 might successfullylearn the option for r1, and t2 for r2. Since transactionsdo not abort without learning at least one of the options asaborted, both transactions are now deadlocked because eachtransaction waits for the other to finish. We apply a simplepessimistic strategy to avoid deadlocks. The core idea is torelax the requirement that we can only learn a new version ifthe previous instance is committed. For example, if t1 learnsthe option v0 → v1 for record r1 in one instance as accepted,and t2 tries to acquire an option v0 → v2 for r1, t1 learnsthe option v0 → v1 as accepted and t2 learns the optionv0 → v2 as rejected in the next Paxos instance. This simpletrick causes transaction t1 to commit and t2 to abort or in thecase of the deadlock as described before, both transactions toabort. The Paxos safety property is still maintained becauseall storage nodes will make the same decision based on thepolicy, and the master totally orders the record versions.3.2.3 Failure ScenariosMulti-Paxos allows our commit protocol to recover fromvarious failures. For example, a failure of a storage node canbe masked by the use of quorums. A master failure can be re-covered from by selecting a new master (after some timeout)and triggering Phase 1 and 2 as described previously. Han-dling app-server failures is trickier, because an app-serverfailure can cause a transaction to be pending forever as a“dangling transaction”. We avoid dangling transactions byincluding in all of its options a unique transaction-id (e.g.,UUIDs) as well as all primary keys of the write-set, and byadditionally keeping a log of all learned options at the stor-age node. Therefore, every option includes all necessary in-formation to reconstruct the state of the corresponding trans-actions. Whenever an app-server failure is detected by sim-ple timeouts, the state is reconstructed by reading from aquorum of storage nodes for every key in the transaction, soany node can recover the transaction. A quorum is requiredto determine what was decided by the Paxos instance. Fi-nally, a data center failure is treated simply as each of thenodes in the data center failing. In the future, we might adaptbulk-copy techniques to bring the data up-to-date more effi-ciently without involving the Paxos protocol (also see [2]).3.3 Transactions Bypassing the MasterThe previous subsection showed how we achieve transac-tions with multiple updates in one single round-trip, if themasters for all transaction records are in the same data centeras the app-server. However, 2 round-trips are required whenthe masters are remote, or mastership needs to be acquired.3.3.1 ProtocolFast Paxos [16] avoids the master by distinguishing betweenclassic and fast ballots. Classic ballots operate like the clas-

sic Paxos algorithm described above and are always the fall-back option. Fast ballots normally use a bigger quorum thanclassic ballots, but allow bypassing the master. This savesone message round to the master, which may be in a differ-ent data center. However, since updates are not serialized bythe master, collisions may occur, which can only be resolvedby a master using classic ballots.

We use this approach of fast ballots for MDCC. All ver-sions start as an implicitly fast ballot number, unless a masterchanged the ballot number through a Phase1a message. Thisdefault ballot number informs the storage nodes to accept thenext options from any proposer.

Afterwards, any app-server can propose an option di-rectly to the storage nodes, which in turn promise only toaccept the first proposed option. Simple majority quorums,however, are no longer sufficient to learn a value and ensuresafeness of the protocol. Instead, learning an option withoutthe master requires a fast quorum [16]. Fast and classic quo-rums, are defined by the following requirements: (i) any twoquorums must have a non-empty intersection, and (ii) thereis non-empty intersection of any three quorums consisting oftwo fast quorums Q1

F and Q2F and a classic quorum QC . A

typical setting for a replication factor of 5 is a classic quorumsize of 3 and a fast quorum size of 4. If a proposer receivesan acknowledgment from a fast quorum, the value is safe andguaranteed to be committed. However, if a fast quorum can-not be achieved, collision recovery is necessary. Note, that aPaxos collision is different from a transaction conflict; colli-sions occur when nodes cannot agree on an option, conflictsare caused by conflicting updates.

To resolve the collision, a new classic ballot must bestarted with Phase 1. After receiving responses from a clas-sic quorum, all potential intersections with a fast quorummust be computed from the responses. If the intersectionconsists of all the members having the highest ballot number,and all agree with some option v, then v must be proposednext. Otherwise, no option was previously agreed upon, soany new option can be proposed. For example, assume thefollowing messages were received as part of a collision res-olution from 4 out of 5 servers with the previously men-tioned quorums (notation: (server-id, ballot number, update)):(1,3,v0→v1), (2,4,v1→v2), (3,4,v1→v3 ), (5,4, v1→v2). Here,the intersection size is 2 and the highest ballot number is4, so the protocol compares the following intersections:

[(2, 4, v1 → v2), (3, 4, v1 → v3)]

[(3, 4, v1 → v3), (5, 4, v1 → v2)]

[(2, 4, v1 → v2), (5, 4, v1 → v2)]

Only the last intersection has an option in common andall other intersections are empty. Hence, the option v1→v2has to be proposed next. More details and the correctnessproofs of Fast Paxos can be found in [16].

MDCC uses Fast Paxos to bypass the master for accept-ing an option, which reduces the number of required mes-sage rounds. Per fast ballot, only one option can be learned.

However, by combining the idea of Fast Paxos with Multi-Paxos and using the following adjusted ballot-range defini-tions from Section 3.1.2, [StartInstance, EndInstance, Fast, Bal-lot], it is possible to pre-set several instances as fast. When-ever a collision is detected, the instance is changed to clas-sic, the collision is resolved and the protocol moves on tothe next instance, which can start as either classic or fast.It is important that classic ballot numbers are always higherranked than fast ballot numbers to resolve collisions and savethe correct value. Combined with our earlier observation thata new Paxos instance is started only if the previous instanceis stable and learned, this allows the protocol to execute sev-eral consecutive fast instances without involving a master.

Without the option concept of Section 3.2 fast ballotswould be impossible to use. Without options it would be im-possible to make an abort/commit decision without requiringa lock first in a separate message round on the storage serversor some master (e.g., as done by Spanner). This is also themain reason why other existing Paxos commit protocols can-not leverage fast ballots. It can be shown that the correctnessof Paxos with options and the deadlock avoidance policy stillholds with fast instances, as long as every outstanding ver-sion is checked in order. That is because fast instances stilldetermine a total order of operations. Finally, whenever afast quorum of nodes are unavailable, classic ballots can beused, ensuring the same availability as before.

3.3.2 Fast-PolicyThere exists a non-trivial trade-off between fast and classicinstances. With fast instances, two concurrent updates mightcause a collision requiring another two message rounds forthe resolution, whereas classic instances usually require twomessage rounds, one to either contact the master or acquirethe mastership, and one for Phase 2. Hence, fast instancesshould only be used if conflicts and collisions are rare.

Currently, we use a very simple strategy. The defaultmeta-data for all instances and all records are pre-set to fastwith [0,∞,fast=true,ballot=0]. As the default meta-data for allrecords is the same, it does not need to be stored per record.A record’s meta-data is managed separately, only when col-lision resolution is triggered. If we detect a collision, we setthe next γ instances (default 100) to classic. After γ transac-tions, fast instances are automatically tried again.

This simple strategy stays in fast if possible (recall, fasthas no upper instance-limit) and in classic when necessary,while probing to go back to fast every γ instances. Moreadvanced models could explicitly calculate the conflict rateand remain as future work.

3.4 Commutative UpdatesThe design based on Fast Paxos allows many transactionsto commit with a single round-trip between data centers.However, whenever there are concurrent updates to a givendata item, conflicts will arise and extra messages are needed.MDCC efficiently exploits cases where the updates are com-mutative, to avoid extra messages by using Generalized

Paxos [15], which is an extension of Fast Paxos. In thissection, we show how our novel option concept and the ideato use Paxos on a record instead of a database log-level asdescribed in the previous sections enable us to use Gen-eralized Paxos. Furthermore, in order to support the quitecommon case of operations on data that are subject to valueconstraints (e.g., stock should be at least 0), we developed anew demarcation technique for quorums.

3.4.1 The ProtocolGeneralized Paxos [15] uses the same ideas as Fast Paxosbut relaxes the constraint that every acceptor must agree onthe same exact sequence of values/commands. Since somecommands may commute with each other, the acceptors onlyneed to agree on sets of commands which are compatiblewith each other. MDCC utilizes the notion of compatibilityto support commutative updates.

Fast commutative ballots are always started by a mes-sage from the master. The master sets the record base value,which is the latest committed value. Afterwards, any clientcan propose commutative updates to all storage nodes di-rectly using the same option model as before. In contrast tothe previous section, an option now contains commutativeupdates, which consist of one or more attributes and their re-spective delta changes (e.g., decrement(stock, 1)). If a fastquorum QF out of N storage nodes accepts the option, theupdate is committed. When the updates involved commute,the acceptors can accept multiple proposals in the same bal-lot and the orderings do not have to be identical on all stor-age nodes. This allows MDCC to stay in the fast ballot forlonger periods of time, bypassing the master and allowingthe commit to happen in one message round. More detailson Generalized Paxos are given in [15].

3.4.2 Global ConstraintsGeneralized Paxos is based on commutative operations likeincrement and decrement. However, many database applica-tions must enforce integrity constraints, e.g., that the stockof an item must be greater than zero. Under a constraint likethis, decrements do not commute in general. However, theydo have state-based commutativity when the system con-tains sufficient stock. Thus we allow concurrent processingof decrements while ensuring domain integrity constraints,by requiring storage nodes to only accept an option if the op-tion would not violate the constraint under all permutationsof commit/abort outcomes for pending options. For example,given 5 transactions t1...5 (arriving in order), each generat-ing an option [stock = stock − 1] with the constraint thatstock ≥ 0 and current stock level of 4, a storage node swill reject t5 even though the first four options may abort.This definition is analogous to Escrow [19] and guaranteescorrectness even in the presence of aborts and failures.

Unfortunately, this still does not guarantee integrity con-straints, as storage nodes base decisions on local, not global,knowledge. Figure 2 shows a possible message ordering forthe above example with 5 storage nodes. Here, clients wait

Storage Node 1

Storage Node 2

Storage Node 3

Storage Node 4

Storage Node 5

value: 4

value: 4

value: 4 T1

T1

T1

T1

T1 T2

T2

T2

T2

T2

T3

T3

T3

T3

T3

T4

T4

T4

T4

T4

T5

T5

T5

T5

T5

3.8T1....T5 = Option[-1]

value: 4

value: 4

Figure 2. Message order

for QF responses (4), and each storage node makes a de-cision based on its local state. Through different messagearrival orders it is possible for all 5 transactions to commit,even though committing them all violates the constraint.

We therefore developed a new demarcation based strat-egy for quorum systems. Our demarcation technique is sim-ilar to the earlier technique [3] in that they both use locallimits, but is used in different scenarios. Original demarca-tion uses limits to safely update distributed values, whereMDCC uses limits for quorum replicated values.

Without loss of generality, we assume a constraint ofvalue at least 0 and that all updates are decrements. LetN be the replication factor (number storage nodes), X bethe base value for some attribute and δi be the decrementamount of transaction ti for the attribute. If we considerevery replicated base valueX as a resource, the total numberof resources in the system is N · X . In order to commit anupdate, QF storage nodes must accept the update, so everysuccessful transaction ti reduces the resources in the systemby at least QF · δi. If we assume m successful transactionswhere

∑mi=1 δi = X , this means the attribute value reached

0, and the total amount of resources would reduce by atleast QF ·

∑mi=1 δi = QF · X . Even though the integrity

constraint forbids any more transactions, it is still possiblethat the system still has (N −QF ) ·X resources remainingdue to failures, lost, or out-of-order messages.

The worst case is where the remaining resources areequally distributed across all the storage nodes (otherwise,at least one of the storage nodes would start to reject optionsearlier). The remaining resources (N −QF ) ·X are dividedevenly among the N storage nodes to derive a lower limit toguarantee the value constraint. Storage nodes must reject anoption if it would cause the value to fall below:

L =N −QF

N·X

This limit L is calculated with every new base value.When options in fast ballots are rejected because of thislimit, the protocol handles it as a collision, resolves it byswitching to classic ballots, and writes a new base value andlimit L.3.4.3 MDCC PseudocodeThe complete MDCC protocol as pseudocode is listed inalgorithms 1, 2, and 3, while table 1 defines the used symbols

Symbols Definitionsa an acceptorl a leaderup an updateω(up, _) an option for an update, with 3or 73/ 7 acceptance / rejectionm ballot numbervala[i] cstruct at ballot i at acceptor abala max{k | vala[k] 6= none}vala cstruct at bala at acceptor ambala current ballot number at acceptor aldrBall ballot number at leader lmaxTriedl cstructs proposed by leader lQ a quorum of acceptorsQuorum(k) all possible quorums for ballot klearned cstruct learned by a learneru greatest lower bound operatort least upper bound operatorv partial order operator for cstructsval• ω(up, _) appends option ω(up, _) to cstruct val

Table 1. Definition of symbols for MDCC pseudocode.

Algorithm 1 Pseudocode for MDCC1: procedure TRANSACTIONSTART . Client2: for all up ∈ tx do3: run SENDPROPOSAL(up)

4: wait to learn all update options5: if ∀up ∈ tx : learned ω(up,3) then6: send V isibility[up,3] to Acceptors7: else8: send V isibility[up, 7] to Acceptors9: procedure SENDPROPOSAL(up) . Proposer

10: if classic ballot then11: send Propose[up] to Leader12: else13: send Propose[up] to Acceptors14: procedure LEARN(up) . Learner15: collect Phase2b[m, vala] messages from Q16: if ∀a ∈ Q : v v vala then17: learned← learned t v18: if ω(up, _) /∈ learned then19: send StartRecovery[] to Leader20: return21: if classic ballot then22: move on to next instance23: else24: isComm← up is CommutativeUpdate[delta]25: if ω(up, 7) ∈ learned ∧ isComm then26: send StartRecovery[] to Leader

and variables. For simplicity, we do not show how livenessis guaranteed. The remainder of this section sketches thealgorithm by focusing on how the different pieces from theprevious subsections work together.

The app-server or client starts the transaction by send-ing proposals for every update on line 3. After learning thestatus of options of all the updates (lines 14-26), the app-server sends visibility messages to “execute” the options onlines 5-8, as described in Section 3.2.1. While attempting tolearn options, if the app-server does not learn the status ofan option (line 18), it will initiate a recovery. Also, if theapp-server learns a commutative option as rejected during afast ballot (line 25), it will notify the master to start recovery.

Algorithm 2 Pseudocode for MDCC - Leader l27: procedure RECEIVELEADERMESSAGE(msg)28: switch msg do29: case Propose[up] :30: run PHASE2ACLASSIC(up)

31: case Phase1b[m, bal, val] :32: if received messages from Q then33: run PHASE2START(m,Q)

34: case StartRecovery[] :35: m← new unique ballot number greater than m36: run PHASE1A(m)

37: procedure PHASE1A(m)38: if m > ldrBall then39: ldrBall ← m40: maxTriedl ← none41: send Phase1a[m] to Acceptors42: procedure PHASE2START(m,Q)43: maxTriedl ← PROVEDSAFE(Q,m)44: if new update to propose exists then45: run PHASE2ACLASSIC(up)

46: procedure PHASE2ACLASSIC(up)47: maxTriedl ← maxTriedl• ω(up, _)48: send Phase2a[ldrBall,maxTriedl] to Acceptors49: procedure PROVEDSAFE(Q,m)50: k ≡ max{i | (i < m) ∧ (∃a ∈ Q : vala[i] 6= none)}51: R ≡ {R ∈ Quorum(k) |

∀a ∈ Q ∩R : vala[k] 6= none}52: γ(R) ≡ u{vala[k] | a ∈ Q ∩R}, for all R ∈ R53: Γ ≡ {γ(R) | R ∈ R}54: if R = ∅ then55: return {vala[k] | (a ∈ Q) ∧ (vala]k 6= none)}56: else57: return {tΓ}

Learning a rejected option for commutative updates duringfast ballots is an indicator of violating the quorum demarca-tion limit, so a classic ballot is required to update the limit.

When accepting new options, the storage nodes mustevaluate the compatibility of the options and then accept orreject it. The compatibility validation is shown in lines 83-99. If the new update is not commutative, the storage nodecompares the read version of the update to the current valueto determine the compatibility, as shown in lines 86-92. Fornew commutative updates, the storage node computes thequorum demarcation limits as described in section 3.4.2, anddetermines if any combination of the pending commutativeoptions violate the limits (lines 93-99). When a storage nodereceives a visibility message for an option, it executes theoption in order to make the update visible, on line 103.

4. Consistency GuaranteesMDCC ensures atomicity (i.e., either all updates in a transac-tion persist or none) and that two concurrent write-conflictingupdate transactions do not both commit. By combining theprotocol with different read-strategies it is possible to guar-antee various degrees of consistency.

4.1 Read Committed without Lost UpdatesMDCC’s default consistency level is read committed, butwithout the lost update problem [4]. Read committed iso-lation prevents dirty reads, so no transactions will read any

Algorithm 3 Pseudocode for MDCC - Acceptor a58: procedure RECEIVEACCEPTORMESSAGE(msg)59: switch msg do60: case Phase1a[m] :61: run PHASE1B(m)

62: case Phase2a[m, v] :63: run PHASE2BCLASSIC(m, v)

64: case Propose[up] :65: run PHASE2BFAST(up)

66: case V isibility[up, status] :67: run APPLYVISIBILITY(up, status)

68: procedure PHASE1B(m)69: if mbala < m then70: mbala ← m71: send Phase1b[m, bala, vala] to Leader72: procedure PHASE2BCLASSIC(m, v)73: if bala ≤ m then74: bala ← m75: vala ← v76: SETCOMPATIBLE(vala)77: send Phase2b[m, vala] to Learners78: procedure PHASE2BFAST(up)79: if bala = mbala then80: vala ← vala• ω(up, _)81: SETCOMPATIBLE(vala)82: send Phase2b[m, vala] to Learners83: procedure SETCOMPATIBLE(v)84: for all new options ω(up, _) in v do85: switch up do86: case PhysicalUpdate[vread, vwrite] :87: validRead← vread matches current value88: validSingle← no other pending options exist89: if validRead ∧ validSingle then90: set option to ω(up,3)91: else92: set option to ω(up, 7)

93: case CommutativeUpdate[delta] :94: U ← upper quorum demarcation limit95: L ← lower quorum demarcation limit96: if any option combinations violate U or L then97: set option to ω(up, 7)98: else99: set option to ω(up,3)

100: procedure APPLYVISIBILITY(up, status)101: update ω(up, _) in vala to ω(up, status)102: if status = 3 then103: apply up to make update visible

other transaction’s uncommitted changes. The lost updateproblem occurs when transaction t1 first reads a data itemX , then one or more other transactions write to the samedata item X , and finally t1 writes to data item X . The up-dates between the read and write of item X by t1 are “lost”because the write by t1 overwrites the value and loses theprevious updates. MDCC guarantees read committed isola-tion by only reading committed values and not returning thevalue of uncommitted options. Lost updates are preventedby detecting every write-write conflict between transactions.

Currently, Microsoft SQL Server, Oracle Database, IBMDB2 and PostgreSQL all use read committed isolation by de-fault. We therefore believe that MDCC’s default consistencylevel is sufficient for a wide range of applications.

4.2 Staleness & Monotonicity

Reads can be done from any storage node and are guaran-teed to return only committed data. However, by just read-ing from a single node, the read might be stale. For example,if a storage node missed updates due to a network problem,reads might return older data. Reading the latest value re-quires reading a majority of storage nodes to determine thelatest stable version, making it an expensive operation.

In order to allow up-to-date reads with classic rounds,we can leverage techniques from Megastore [2]. A simplestrategy for up-to-date reads with fast rounds is to ensure thata special pseudo-master storage node is always part of thequorum of Phases 1 and 2 and to switch to classic wheneverthe pseudo-master cannot be contacted. The techniques fromMegastore can apply for the pseudo-master to guaranteeup-to-date reads in all data centers. The same strategy canguarantee monotonic reads such as repeatable reads or readyour writes, but can be further relaxed by requiring only thelocal storage node to always participate in the quorum.

4.3 Atomic Visibility

MDCC provides atomic durability, meaning either all ornone of the operations of the transaction are durable, butit does not support atomic visibility. That is, some of theupdates of a committed transaction might be visible whereasother are not. Two-phase commit also only provides atomicdurability, not visibility unless it is combined with othertechniques such as two-phase locking or snapshot isolation.The same is true for MDCC. For example, MDCC coulduse a read/write locking service per data center or snapshotisolation as done in Spanner [8] to achieve atomic visibility.

4.4 Other Isolation Levels

Finally, MDCC can support higher levels of isolation. Inparticular, Non-monotonic Snapshot Isolation (NMSI) [22]or Spanner’s [8] snapshot isolation through synchronizedclocks are natural fits for MDCC. Both would still allowfast commits while providing consistent snapshots. Further-more, as we already check the write-set for transactions, theprotocol could easily be extended to also consider read-sets,allowing us to leverage optimistic concurrency control tech-niques and ultimately provide full serializability.

5. EvaluationWe implemented a prototype of MDCC on top of a dis-tributed key/value store across five different data centers us-ing the Amazon EC2 cloud. To demonstrate the benefits ofMDCC, we use TPC-W and micro-benchmarks to comparethe performance characteristics of MDCC to other transac-tional and other non-transactional, eventually consistent pro-tocols. This section describes the benchmarks, experimentalsetup, and our findings.5.1 Experimental SetupWe implemented the MDCC protocol in Scala, on top ofa distributed key/value store, which used Oracle BDB Java

Edition as a persistent storage engine. We deployed the sys-tem across five geographically diverse data centers on Ama-zon EC2: US West (N. California), US East (Virginia), EU(Ireland), Asia Pacific (Singapore), and Asia Pacific (Tokyo).Each data center has a full replica of the data, and within adata center, each table is range partitioned by key, and dis-tributed across several storage nodes as m1.large instances(4 cores, 7.5GB memory). Therefore, every horizontal par-tition, or shard, of the data is replicated five times, with onecopy in each data center. Unless noted otherwise, all clientsissuing transactions are evenly distributed across all five datacenters, on separate m1.large instances.

5.2 Comparison with other ProtocolsTo compare the overall performance of MDCC with alter-native designs we used TPC-W, a transactional benchmarkthat simulates the workload experienced by an e-commerceweb server. TPC-W defines a total of 14 web interactions(WI), each of which are web page requests that issue severaldatabase queries. In TPC-W, the only transaction which isable to benefit from commutative operations is the product-buy request, which decreases the stock for each item in theshopping cart while ensuring that the stock never drops be-low 0 (otherwise, the transaction should abort). We imple-mented all the web interactions using our own SQL-like lan-guage but forego the HTML rendering part of the benchmarkto focus on the database part. TPC-W defines that these WIare requested by emulated browsers, or clients, with a wait-time between requests and varying browse-to-buy ratios. Inour experiments, we forego the wait-time between requestsand only use the most write-heavy profile to stress the sys-tem. It should also be noted that read-committed is sufficientfor TPC-W to never violate its data consistency.

In these experiments, the MDCC prototype uses fast bal-lots with commutativity where possible (reverting to classicafter too many collisions have occurred as described in Sec-tion 3.3.2). For comparison, we also implemented forms ofsome other replica management protocols in Scala, using thesame distributed store, and accessed by the same clients.

Quorum Writes (QW). The quorum writes protocol(QW) is the standard for most eventually consistent sys-tems and is implemented by simply sending all updates toall involved storage nodes then waiting for responses fromquorum nodes. We used two different configurations for thewrite quorum: quorum of size 3 out of 5 replicas for eachrecord (we call this QW-3), and quorum of size 4 out of 5(QW-4). We use a read-quorum of 1 to access only the lo-cal replica (the fastest read configuration). It is important tonote that the quorum writes protocol provides no isolation,atomicity, or transactional guarantees.

Two-Phase Commit (2PC). Two-phase commit (2PC) isstill considered the standard protocol for distributed transac-tions. 2PC operates in two phases. In the first phase, a trans-action manager tries to prepare all involved storage nodesto commit the updates. If all relevant nodes prepare success-

fully, then in the second phase the transaction manager sendsa commit to all storage nodes involved; otherwise it sends anabort. Note, that 2PC requires all involved storage nodes torespond and is not resilient to single node failures.

Megastore*. We were not able to compare MDCC di-rectly against the Megastore system because it was not pub-licly available. Google App Engine uses Megastore, but thedata centers and configuration are unknown and out of ourcontrol. Instead, we simulated the underlying protocol asdescribed in [2] to compare it with MDCC; we do this asa special configuration of our system, referred to as Mega-store*. In [2], the protocol is described mainly for transac-tions within a partition. The paper states that 2PC is usedacross partitions with looser consistency semantics but omitsdetails on the implementation and the authors discourageof using the feature because of its high latency. Therefore,for experiments with Megastore*, we placed all data into asingle entity group to avoid transactions which span multi-ple entity groups. Furthermore, Megastore only allows thatone write transaction is executed at any time (all other com-peting transactions will abort). As this results in unusablethroughput for TPC-W, we include an improvement from[20] and allow non-conflicting transactions to commit usinga subsequent Paxos instance. We also relaxed the read con-sistency to read-committed enabling a fair comparison be-tween Megastore* and MDCC. Finally, we play in favor ofMegastore* placing all clients and masters in one data center(US-West), to allow all transactions to commit with a singleround-trip.

5.2.1 TPC-W Write Response TimesTo evaluate MDCC’s main goal, reducing the latency, we ranthe TPC-W workload with each protocol. We used a TPC-Wscale factor of 10,000 items, with the data being evenlyranged partitioned and replicated to four storage nodes perdata center. 100 evenly geo-distributed clients (on separatemachines) each ran the TPC-W benchmark for 2 minutes,after a 1 minute warm-up period.4

Figure 3 shows the cumulative distribution functions(CDF) of the response times of committed write transac-tions for the different protocols. Note that the horizontal(time) axis is a log-scale. We only report the response timesfor write transactions as read transactions were always localfor all configurations and protocols. The two dashed lines(QW-3, QW-4) are non-transactional, eventually consistentprotocols, and the three solid lines (MDCC, 2PC, Mega-store*) are transactional, strongly consistent protocols.

Figure 3 shows that the non-transactional protocol QW-3has the fastest response times, followed by QW-4, then thetransactional systems, of which MDCC is fastest, then 2PC,

4 In a separate experiment, we studied for each protocol the effect of usingdifferent numbers of clients and storage nodes, and found that 4 storagenodes per data center for 100 clients has the best utilization before thelatency starts increasing because of queuing/network effects. For brevitywe omit this experiment.

0

10

20

30

40

50

60

70

80

90

100

100 200 400 1000 2000 4000 10000 20000 40000

Perc

enta

ge o

f Tra

nsac

tions

Write Transaction Response Times, log-scale (ms)

QW-3QW-4

MDCC2PC

Megastore*

0

200

400

600

800

1000

1200

1400

1600

50 100 200

Tran

sact

ions

per

Sec

ond

# Concurrent Clients

QW-3QW-4

MDCC2PC

Megastore*

0

10

20

30

40

50

60

70

80

90

100

0 100 200 300 400 500 600 700 800 900

Perc

enta

ge o

f Tra

nsac

tions

Write Transaction Response Times (ms)

MDCCFastMulti2PC

Figure 3. TPC-W write transaction re-sponse times CDF

Figure 4. TPC-W transactions per sec-ond scalability

Figure 5. Micro-benchmark responsetimes CDF

and finally Megastore* with slowest times. The median re-sponse times are: 188ms for QW-3, 260ms for QW-4, 278msfor MDCC, 668ms for 2PC, and 17,810ms for Megastore*.

Since MDCC uses fast ballots whenever possible, MDCCoften commits transactions from any data center with a sin-gle round-trip to a quorum size of 4. This explains whythe performance of MDCC is similar to QW-4. The differ-ence between QW-3 and QW-4 arises from the need to waitlonger for the 4th response, instead of returning after the 3rdresponse. There is an impact from the non-uniform laten-cies between different data centers, so the 4th is on averagefarther away than the 3rd, and there is more variance whenwaiting for more responses. Hence, an administrator mightchoose to configure a MDCC-like system to use classic in-stances with a local master, if it is known that the workloadhas most requests being issued from the same data center(see Section 5.3.3 for an evaluation).

MDCC reduces per transaction latencies by 50% com-pared to 2PC because it commits in one round-trip instead oftwo. Most surprisingly, however, is the orders of magnitudeimprovement over Megastore*. This can be explained sinceMegastore* must serialize all transactions with Paxos (it ex-ecutes one transaction at a time) and heavy queuing effectsoccur. This queuing effect happens because of the moderateload, but it is possible to avoid the effect by reducing the loador allowing multiple transactions to commit for one commitlog record. If so, performance would be similar to our clas-sical Paxos configuration discussed in Section 5.3.1. Evenwithout queuing effects, Megastore* would require an ad-ditional round-trip to the master for non-local transactions.Since Google’s Spanner [8] uses 2PC across Paxos groups,and each Paxos group requires a master, we expect Spannerto behave similarly to the 2PC data in figure 3.

We conclude from this experiment that MDCC achievesour main goal: it supports strongly consistent transactionswith latencies similar to non-transactional protocols whichprovide weaker consistency, and is significantly faster thanother strongly consistent protocols (2PC, Megastore*).

5.2.2 TPC-W Throughput and Transaction ScalabilityOne of the intended advantages of cloud-based storage sys-tems is the ability to scale out without affecting performance.We performed a scale-out experiment using the same settingas in the previous section, except that we varied the scaleto (50 clients, 5,000 items), (100 clients, 10,000 items), and

(200 clients, 20,000 items). For each configuration, we fixedthe amount of data per storage node to a TPC-W scale-factorof 2,500 items and scaled the number of nodes accordingly(keeping the ratio of clients to storage nodes constant). Forthe same arguments as before, we used a single partition forMegastore* to avoid cross-partition transactions.

Figure 4 shows the results of the throughput measure-ments of the various protocols. We see that the QW proto-cols have the lowest message and CPU overhead and there-fore the highest throughput, with the MDCC throughput notfar behind. For 200 concurrent clients, the MDCC through-put was within 10% of the throughput of QW-4. The exper-iments also demonstrate that MDCC has higher throughputcompared to the other strongly consistent protocols, 2PC andMegastore*. The throughput for 2PC is significantly lower,mainly due to the additional waiting for the second round.

As expected, the QW protocols scale almost linearly; wesee similar scaling for MDCC and 2PC. The Megastore*throughput is very low and does not scale out well, becauseall transactions are serialized for the single partition. Thislow throughput and poor scaling matches the results in [13]for Google App Engine, a public service using Megastore.In summary, figure 4 shows that MDCC provides stronglyconsistent cross data center transactions with throughput andscalability similar to eventually consistent protocols.5.3 Exploring the Design SpaceWe use our own micro-benchmark to independently studythe different optimizations within the MDCC protocol, andhow it is sensitive to workload features. The data for themicro-benchmark is a single table of items, with randomlychosen stock values and a constraint on the stock attributethat it has to be at least 0. The benchmark defines a simplebuy transaction, that chooses 3 random items uniformly, andfor each item, decrements the stock value by an amountbetween 1 and 3 (a commutative operation). Unless statedotherwise, we use 100 geo-distributed clients, and a pre-populated product table with 10,000 items sharded on 2storage nodes per data center.5.3.1 Response TimesTo study the effects of the different design choices inMDCC, we ran the micro-benchmark with 2PC and variousMDCC configurations: MDCC: our full featured protocol,Fast: without the commutative update support, and Multi:all instances being Multi-Paxos (a stable master can skip

0

10

20

30

40

50

60

70

80

2PCMultiFastMDCC

2PCMultiFastMDCC

2PCMultiFastMDCC

2PCMultiFastMDCC

2PCMultiFastMDCC

2PCMultiFastMDCC

Com

mits

/Abo

rts (i

n th

ousa

nds)

Hotspot Size

CommitsAborts

2% 5% 10% 20% 50% 90%

0

200

400

600

800

1000

MultiMDCC

MultiMDCC

MultiMDCC

MultiMDCC

MultiMDCC

Res

pons

e Ti

mes

(ms)

Probability of Local Master100% 80% 60% 40% 20%

0

100

200

300

400

500

0 50 100 150 200 250

Res

pons

e Ti

me

(ms)

Elapsed Time (s)

Failed data center

Figure 6. Commits/aborts for varyingconflict rates

Figure 7. Response times for varyingmaster locality

Figure 8. Time-series of responsetimes during failure

Phase 1). The experiment ran for 3 minutes after a 1 minutewarm-up. Figure 5 shows the cumulative distribution func-tions (CDF) of response times of the successful transactions.

The median response times are: 245ms for MDCC,276ms for Fast, 388ms for Multi, and 543ms for 2PC. 2PC isthe slowest because it must use two round-trips across datacenters and has to wait for responses from all 5 data cen-ters. For Multi, with the masters being uniformly distributedacross all the data centers, most of the transactions (about4/5 of them) require a message round-trip to contact themaster, so two round-trips across data centers are required,similar to 2PC. In contrast to 2PC, Multi needs responsesfrom only 3 of 5 data centers, so the response times areimproved. The response times for Multi would also be ob-served for Megastore* if no queuing effects are experienced.Megastore* experiences heavier queuing effects because alltransactions in the single entity group are serialized, but withMulti, only updates per record are serialized.

The main reason for the low latencies for MDCC is theuse of fast ballots. Both MDCC and Fast return earlier thanthe other protocols, because they often require one round-trip across data centers and do not need to contact a master,like Multi. The improvement from Fast to MDCC is becausecommutative updates reduce conflicts and thus collisions, soMDCC can continue to use fast ballots and avoid resolvingcollisions as described in Section 3.3.2.

5.3.2 Varying Conflict RatesMDCC attempts to take advantage of situations when con-flicts are rare, so we study how the MDCC commit perfor-mance is affected by different conflict rates. We therefore de-fined a hot-spot area and modified the micro-benchmark toaccess items in the hot-spot area with 90% probability, andaccesses the cold-spot portion of the data with the remaining10% probability. By adjusting the size of the hot-spot as apercentage of the data, we alter the conflict rates in the ac-cess patterns. For example, when the hot-spot is 90% of thedata, the access pattern is essentially uniformly at random,since 90% of the accesses will go to 90% of the data.

The Multi system uses masters to serialize transactions,so Paxos conflicts occur when there are multiple potentialmasters, which should be rare. Multi will simply abort trans-actions when the read version is not the same as the versionin the storage node (indicating a write-write transaction con-flict), to keep the data consistent, making Paxos collisions

independent of transaction conflicts. On the other hand, forFast Paxos, collisions are related to transaction conflicts, asa collision/conflict occurs whenever a quorum size of 4 doesnot agree on the same decision. When this happens, colli-sion resolution must be triggered, which eventually switchesto a classic ballot, which will take at least 2 more messagerounds. MDCC is able to improve on it by exploring com-mutativity, but still might cause an expensive collision reso-lution whenever the quorum demarcation integrity constraintis reached, as described in Section 3.4.2.

Figure 6 shows the number of commits and aborts for dif-ferent designs, for various hot-spot sizes. When the hot-spotsize is large, the conflict rate is low, so all configurationsdo not experience many aborts. MDCC commits the mosttransactions because it does not abort any transactions. Fastcommits slightly fewer, because it has to resolve the colli-sions which occur when different storage nodes see updatesin different orders. Multi commits far fewer transactions be-cause most updates have to be sent to a remote master, whichincreases the response times and decreases the throughput.

As the hot-spot decreases in size, the conflict rate in-creases because more of the transactions access smaller por-tions of the data. Therefore, more transactions abort as thehot-spot size decreases. When the hot-spot is at 5%, theFast commits fewer transactions than Multi. This can be ex-plained by the fact that Fast needs 3 round-trips to ultimatelyresolve conflicting transactions, whereas Multi usually uses2 rounds. When the hot-spot is at 2%, the conflict rate is veryhigh, so both Fast and MDCC perform very poorly com-pared to Multi. The costly collision resolution for fast ballotsis triggered so often, that many transactions are not able tocommit. We conclude that fast ballots can take advantage ofmaster-less operation as long as the conflict rate is not veryhigh. When the conflict rate is too high, a master-based ap-proach is more beneficial and MDCC should be configuredas Multi. Exploring policies to automatically determine thebest strategy remains as future work.

5.3.3 Data access localityClassic ballots can save message trips in the situation whenthe client requests have affinity for data with a local mas-ter. To explore the tradeoff between fast and classic ballots,we modified the benchmark to vary the choice of data itemswithin each transaction, so a given percentage will accessrecords with local masters. At one extreme, 100% of trans-

actions choose their items only from those with a local mas-ter; at the other, 20% of the transactions choose items with alocal master (items are chosen uniformly at random).

Figure 7 shows the boxplots of the latencies of Multi andMDCC for different master localities. When all the mas-ters are local to the clients, then Multi will have lower re-sponse times than MDCC, as shown in the graph for 100%.However, as updates access more remote masters, responsetimes for Multi get slower and also increase in variance, butMDCC still maintains the same profile. Even when 80%of the updates are local, the median Multi response time(242ms) is slower than the median MDCC response time(231ms). Our MDCC design is targeted at situations with-out particular access locality, and Multi only out-performsMDCC when the locality is near 100%. Interesting to note is,that the max latency of the Multi configuration is higher thanfor full MDCC. This can be explained by the fact that sometransactions have to queue until the previous transaction fin-ishes, whereas MDCC normally operates in fast ballots andeverything is done in parallel.

5.3.4 Data Center Fault Tolerance

We also experimented with various failure scenarios. Here,we only report on a simulated full data center outage whilerunning the micro-benchmark, as other failures, such asa failure of a transaction coordinator mainly depend onset time-outs. We started 100 clients issuing write transac-tions from the US-West data center. About two minutes intothe experiment, we simulated a failed US-East data center,which is the data center closest to US-West. We simulatedthe failed data center by preventing the data center from re-ceiving any messages. Since US-East is closest to US-West,“killing” US-East forces the protocol to tolerate the failure.We recorded all the committed transaction response timesand plotted the time series graph, in figure 8.

Figure 8 shows the transaction response times before andafter failing the data center, which occurred at around 125seconds into the experiment (solid vertical line). The averageresponse time of transactions before the data center failurewas 173.5 ms and the average response time of transactionsafter the data center failure was 211.7 ms (dashed horizon-tal lines). The MDCC system clearly continues to committransactions seamlessly across the data center failure. Theaverage transaction latencies increase after the data centerfailure, but that is expected behavior, since the MDCC com-mit protocol uses quorums and must wait for responses fromanother data center, potentially farther away. The same argu-ment also explains the increase in variance. If the data cen-ter comes up again (not shown in the figure), only recordswhich have been updated during the failure, would still beimpacted by the increased latency until the next update or abackground process brought them up-to-date. These resultsshow MDCC’s resilience against data center failures.

6. Related Work

There has been recent interest in scalable geo-replicateddatastores. Several recent proposals use Paxos to agree onlog-positions similar to state-machine replication. For ex-ample, Megastore [2] uses Multi-Paxos to agree on log-positions to synchronously replicate data across multipledata centers (typically five data centers). Google Spanner [8]is similar, but uses synchronized timestamps to providesnapshot isolation. Furthermore, other state-machine tech-niques for WANs such as Mencius [18] or HP/CoreFP [10]could also be used for wide-area database log replication. Allthese systems have in common that they significantly limitthe throughput by serializing all commit log records andthus, implicitly executing only one transaction at a time. Asa consequence, they must partition the data in small shardsto get reasonable performance. Furthermore, these protocolsrely on an additional protocol (usually 2PC, with all its dis-advantages) to coordinate any transactions that access dataacross shards. Spanner and Megastore are both master-basedapproaches, and introduce an additional message delay forremote clients. Mencius [18] uses a clever token passingscheme to not rely on a single static master. This schemehowever is only useful on large partitions and is not easy ap-plicable on finer grained replication (i.e., on a record level).Like MDCC, HP/CoreFP avoids a master, and improves onthe cost of Paxos collisions by executing classic and fastrounds concurrently. Their hybrid approach could easily beintegrated into MDCC but requires significant more mes-sages, which is worrisome for real world applications. Insummary, MDCC improves over these approaches, by notrequiring partitioning, natively supporting transactions as asingle protocol, and/or avoiding a master when possible.

Paxos-CP [20] improves Megastore’s replication proto-col by allowing non-conflicting transactions to move on tosubsequent commit log-positions and combining commitsinto one log-position, significantly increasing the fraction ofcommitted transactions. The ideas are very interesting buttheir performance evaluation does not show that it removesthe log-position bottleneck (they only execute 4 transactionsper second). Compared to MDCC, they require an additionalmaster-based single node transaction conflict detection, butare able to provide stronger serializability guarantees.

A more fine-grained use of Paxos was explored in Con-sensus on Commit [11], to reliably store the resource man-ager decision (commit/abort) to make it resilient to failures.In theory, there could be a resource manager per record.However, they treat data replication as an orthogonal issueand require that a single resource manager makes the deci-sion (commit/abort), whereas MDCC assumes this decisionis made by a quorum of storage nodes. Scalaris [24] appliedconsensus on commit to DHTs, but cannot leverage FastPaxos as MDCC does. Our use of record versioning withPaxos has some commonalities with multi-OHS [1], a pro-tocol to construct Byzantine fault-tolerant services, which

also supports atomic updates to objects. However, multi-OHS only guarantees atomic durability for a single server(not across shards) and it is not obvious how to use theprotocol for distributed transactions or commutative opera-tions. The authors of Spanner describe that they tried a morefine-grained use of Paxos by running multiple instances pershard, but that they eventually gave up because of the com-plexity. In this paper, we showed it is possible and presenteda system that uses multiple Paxos instances to execute trans-actions without requiring partitioning.

Other geo-replicated datastores include PNUTS [7], Ama-zon’s Dynamo [9], Walter [25] and COPS [17]. These useasynchronous replication, with the risk of violating consis-tency and losing data in the event of major data center fail-ures. Walter [25] also supports a second mode with strongerconsistency guarantees between data centers, but this relieson 2PC and always requires two round-trip times.

Use of optimistic atomic broadcast protocols for transac-tion commit were proposed in [12, 21]. That technique doesnot explore commutativity and often has considerably longerresponse-times in the wide-area network because of the wait-time for a second verify message before the commit is final.

Finally, our demarcation strategy for quorums was in-spired by [3], which proposed for the first time to use extralimits to ensure value constraints.7. ConclusionThe long and fluctuating latencies between data centersmake it hard to support highly available applications thatcan survive data center failures. Reducing the latency fortransactional commit protocols is the goal of this paper.

We proposed MDCC as a new approach for synchronousreplication in the wide-area network. MDCC’s commit pro-tocol is able to tolerate data center failures without compro-mising consistency, at a similar cost to eventually consis-tent protocols. It requires only one message round-trip acrossdata centers in the common case. In contrast to 2PC, MDCCis an optimistic commit protocol and takes advantage of sit-uations when conflicts are rare and/or when updates com-mute. It is the first protocol applying the ideas of General-ized Paxos to transactions that may access records spanningpartitions. We also presented the first technique to guaranteedomain integrity constraints in a quorum-based system.

In the future, we plan to explore more optimizations ofthe protocol, such as determining the best strategy (fast orclassic) based on client locality, or batching techniques thatreduce the message overhead. Supporting other levels ofread isolation, like PSI, is an interesting future avenue.

MDCC provides a transactional commit protocol for thewide-area network which achieves strong consistency at asimilar cost to eventually consistent protocols.

8. AcknowledgmentsWe would like to thank the anonymous reviewers and ourshepherd, Robbert van Renesse, for all the helpful commentsand suggestions. This research is supported in part by NSF

CISE Expeditions award CCF-1139158 and DARPA XDataAward FA8750-12-2-0331, and gifts from Amazon WebServices, Google, SAP, Blue Goji, Cisco, Clearstory Data,Cloudera, Ericsson, Facebook, General Electric, Horton-works, Huawei, Intel, Microsoft, NetApp, Oracle, Quanta,Samsung, Splunk, VMware and Yahoo!.

References[1] M. Abd-El-Malek et al. Fault-Scalable Byzantine Fault-Tolerant Ser-

vices. In Proc. of SOSP, 2005.[2] J. Baker et al. Megastore: Providing Scalable, Highly Available Stor-

age for Interactive Services. In CIDR, 2011.[3] D. Barbará and H. Garcia-Molina. The Demarcation Protocol: A Tech-

nique for Maintaining Constraints in Distributed Database Systems.VLDB J., 3(3), 1994.

[4] H. Berenson et al. A Critique of ANSI SQL Isolation Levels. In Proc.of SIGMOD, 1995.

[5] M. Brantner et al. Building a database on S3. In Proc. of SIGMOD,2008.

[6] C. Cachin, K. Kursawe, F. Petzold, and V. Shoup. Secure and Effi-cient Asynchronous Broadcast Protocols. In Advances in Cryptology-Crypto 2001, 2001.

[7] B. F. Cooper et al. PNUTS: Yahoo!’s Hosted Data Serving Platform.Proc. VLDB Endow., 1, 2008.

[8] J. C. Corbett et al. Spanner: Google’s Globally-Distributed Database.In Proc. of OSDI, 2012.

[9] G. DeCandia et al. Dynamo: Amazon’s Highly Available Key-ValueStore. In Proc. of SOSP, 2007.

[10] D. Dobre, M. Majuntke, M. Serafini, and N. Suri. HP: Hybrid paxosfor WANs. In Dependable Computing Conference, 2010.

[11] J. Gray and L. Lamport. Consensus on Transaction Commit. TODS,31, 2006.

[12] B. Kemme, F. Pedone, G. Alonso, and A. Schiper. Processing Transac-tions over Optimistic Atomic Broadcast Protocols. In Proc. of ICDCS,1999.

[13] D. Kossmann, T. Kraska, and S. Loesing. An Evaluation of AlternativeArchitectures for Transaction Processing in the Cloud. In Proc. ofSIGMOD, 2010.

[14] L. Lamport. The Part-Time Parliament. TOCS, 16(2), 1998.[15] L. Lamport. Generalized Consensus and Paxos. Technical Report

MSR-TR-2005-33, Microsoft Research, 2005.[16] L. Lamport. Fast Paxos. Distributed Computing, 19, 2006.[17] W. Lloyd, M. J. Freedman, M. Kaminsky, and D. G. Andersen. Don’t

Settle for Eventual: Scalable Causal Consistency for Wide-Area Stor-age with COPS. In Proc. of SOSP, 2011.

[18] Y. Mao, F. P. Junqueira, and K. Marzullo. Mencius: Building EfficientReplicated State Machines for WANs. In Proc. of OSDI, 2008.

[19] P. E. O’Neil. The Escrow Transactional Method. TODS, 11, 1986.[20] S. Patterson et al. Serializability, not Serial: Concurrency Control

and Availability in Multi-Datacenter Datastores. Proc. VLDB Endow.,5(11), 2012.

[21] F. Pedone. Boosting System Performance with Optimistic DistributedProtocols. IEEE Computer, 34(12), 2001.

[22] M. Saeida Ardekani, P. Sutra, N. Preguiça, and M. Shapiro. Non-Monotonic Snapshot Isolation. Research Report RR-7805, INRIA,2011.

[23] E. Schurman and J. Brutlag. Performance Related Changes and theirUser Impact. Presented at Velocity Web Performance and OperationsConference, 2009.

[24] T. Schütt, F. Schintke, and A. Reinefeld. Scalaris: Reliable Transac-tional P2P Key/Value Store. In Erlang Workshop, 2008.

[25] Y. Sovran, R. Power, M. K. Aguilera, and J. Li. Transactional storagefor geo-replicated systems. In Proc. of SOSP, 2011.

MDCC: Multi-Data Center Consistencyto Megastore. MDCC is not the only system that addresses...

Documents

Transcript of MDCC: Multi-Data Center Consistencyto Megastore. MDCC is not the only system that addresses...