A Cms the Akama i Configuration Management System

ACMS: The Akamai Configuration Management System

Alex Sherman†‡, Philip A. Lisiecki†, Andy Berkheimer†, and Joel Wein†∗.†Akamai Technologies, Inc. ‡Columbia University ∗Polytechnic University.

†{andyb,lisiecki,asherman,jwein}@akamai.com‡[email protected] ∗[email protected].

AbstractAn important trend in information technology is the use ofincreasingly large distributed systems to deploy increasinglycomplex and mission-critical applications. In order for thesesystems to achieve the ultimate goal of having similar ease-of-use properties as centralized systems they must allow fast,reliable, and lightweight management and synchronization oftheir configuration state. This goal poses numerous technicalchallenges in a truly Internet-scale system, including varyingdegrees of network connectivity, inevitable machine failures,and the need to distribute information globally in a fast and re-liable fashion.

In this paper we discuss the design and implementation of aconfiguration management system for the Akamai Network. Itallows reliable yet highly asynchronous delivery of configura-tion information, is significantly fault-tolerant, and can scale ifnecessary to hundreds of thousands of servers.

The system is fully functional today providing configurationmanagement to over 15,000 servers deployed in 1200+ differ-ent networks in 60+ countries.

1 Introduction

Akamai Technologies operates a system of 15,000+widely dispersed servers on which its customers deploytheir web content and applications in order to increasethe performance and reliability of their web sites. Whena customer extends their web presence from their ownserver or server farm to a third party Content DeliveryNetwork (CDN), a major concern is the ability to main-tain close control over the manner in which their webcontent is served. Most customers require a level of con-trol over their distributed presence that rivals that achiev-able in a centralized environment.

Akamai’s customers can configure many options thatdetermine how their content is served by the CDN. Theseoptions may include: html cache timeouts, whether toallow cookies, whether to store session data for theirweb applications among many other settings. Configura-

tion files that capture these settings must be propagatedquickly to all of the Akamai servers upon update.

In addition to the configuring customer profiles, Aka-mai also runs many internal services and processes whichrequire frequent updates or “reconfigurations.” One ex-ample is the mapping services which assign users to Aka-mai servers based on network conditions. Subsystemsthat measure frequently-changing network connectivityand latency must distribute their measurements to themapping services.

In this paper we describe the Akamai Configura-tion Management System (ACMS), which was built tosupport customers’ and internal services’ configurationpropagation requirements. ACMS accepts distributedsubmissions of configuration information (captured inconfiguration files) and disseminates this information tothe Akamai CDN. ACMS is highly available through sig-nificant fault-tolerance, allows reliable yet highly asyn-chronous and consistent delivery of configuration infor-mation, provides persistent storage of configuration up-dates, and can scale if necessary to hundreds of thou-sands of servers.

The system is fully functional today providing con-figuration management to over 15,000 servers deployedin 1200+ different ISP networks in 60+ countries. Fur-ther, as a lightweight mechanism for making configura-tion changes, it has evolved into a critical element of howwe administer our network in a flexible fashion.

Elements of ACMS bear resemblance to or draw fromnumerous previous efforts in distributed systems – fromreliable messaging/multicast in wide-area systems, tofault-tolerant data replication techniques, to Microsoft’sWindows Update functionality; we present a detailedcomparison in Section 8. We believe, however, that oursystem is designed to work in a relatively unique envi-ronment, due to a combination of the following factors.

• The set of end clients – our 15,000+ servers – arevery widely dispersed.

• At any point in time a nontrivial fraction of theseservers may be down or may have nontrivial con-nectivity problems to the rest of the system. An in-dividual server may be out of commission for sev-eral months before being returned to active duty,and will need to get caught up in a sane fashion.

• Configuration changes are generated from widelydispersed places – for certain applications, anyserver in the system can generate configuration in-formation that needs to be dispersed via ACMS.

• We have relatively strong consistency requirements.When a server that has been out-of-touch regainscontact it needs to become up to date quickly or riskserving customer content in an outdated mode.

Our solution is based on a small set of front-end dis-tributed Storage Points and a back-end process that man-ages downloads from the front-end. We have designedand implemented a set of protocols that deal with ourparticular availability and consistency requirements.

The major contributions of this paper are as follows:

• We describe the design of a live working systemthat meets the requirements of configuration man-agement in a very large distributed network.

• We present performance data and detail somelessons learned from a building and deploying sucha system.

• We discuss in detail the distributed synchronizationprotocols we introduced to manage the front endsStorage Points. While these protocols bear similar-ity to several previous efforts, they are targeted at adifferent combination of reliability and availabilityrequirements and thus may be of interest in othersettings.

1.1 Assumptions and RequirementsWe assume that the configuration files will vary in sizefrom a few hundred bytes up to 100MB. Although verylarge configuration files are possible and do occur, theyin general should be more rare. We assume that mostupdates must be distributed to every Akamai node, al-though some configuration files may have a relativelysmall number of subscribers. Since distinct applicationssubmit configuration files dynamically, there is no par-ticular arrival pattern of submissions, and at times wecould expect several submissions per second. We alsoassume that the Akamai CDN will continue to grow.Such growth should not impede the CDN’s responsive-ness to configuration changes. We assume that submis-sions could originate from a number of distinct applica-tions running at distinct locations on the Akamai CDN.

We assume that each submission of a configuration filefoo completely overwrites the earlier submitted versionof foo. Thus, we do not need to store older versions offoo, but the system must correctly synchronize to the lat-est version. Finally, we assume that for each configura-tion file there is either a single writer or multiple idem-potent (non-competing) writers.

Based on the motivation and assumptions describedabove we formulate the following requirements forACMS:

High Fault-Tolerance and Availability. In order to sup-port all applications that dynamically submit configura-tion updates, the system must operate 24x7 and experi-ence virtually no downtime. The system must be able totolerate a number of machine failures and network parti-tions, and still accept and deliver configuration updates.Thus, the system must have multiple “entry points” foraccepting and storing configuration updates such thatfailure of any one of them will not halt the system. Fur-thermore, these “entry points” must be located in distinctISP networks so as to guarantee availability even if oneof these networks becomes partitioned from the rest ofthe Internet.

Efficiency and Scalability. The system must deliverupdates efficiently to a network of the size of the AkamaiCDN, and all parts of the system must scale effectivelyto any anticipated growth. Since updates, such as a cus-tomer’s profile, directly effect how each Akamai nodeserves that customer’s content, it is imperative that theservers synchronize relatively quickly with respect to thenew updates. The system must guarantee that propaga-tion of updates to all “alive” nodes takes place within afew minutes from submission. (Provided of course, thatthere is network connectivity to such “alive” or function-ing nodes from some of our “entry points.”).

Persistent Fault-Tolerant Storage. In a large networksome machines will always be experiencing downtimedue to power and network outages or process failures.Therefore, it is unlikely that a configuration update canbe delivered synchronously to the entire CDN in thetime of submission. Instead the system must be ableto store the updates permanently and deliver them asyn-chronously to machines as they become available.

Correctness. Since configuration file updates can besubmitted to any of the “entry points,” it is possible thattwo updates for the same file foo arrive at different “en-try points” simultaneously. We require that ACMS pro-vide a unique ordering of all versions and that the systemsynchronize to the latest version for each configurationfile. Since slight clock skews are possible among ourmachines, we relax this requirement and show that weallow a very limited, but bounded reordering. (See sec-tion 3.4.2).

Acceptance Guarantee. ACMS “accepts” a submis-

sion request only when the system has “agreed” on thisversion of the update. The agreement in ACMS is basedon a “quorum” of “entry points.” (The quorum used inACMS is at the core of our architecture and is discussedin great detail throughout the paper). The agreement isnecessary, because if the “entry point” that receives anupdate submission becomes cut off from the Internet itwill not be able to propagate the update to the rest of thesystem. In essence, the Acceptance Guarantee stipulatesthat if a submission is accepted, a quorum has agreed topropagate the submission to the Akamai CDN.

Security. Configuration updates must be authenticatedand encrypted so that ACMS cannot be spoofed nor up-dates read by any third parties. The techniques that weuse to accomplish this are standard, and we do not dis-cuss them further in this document.

1.2 Our Approach

We observe that the ACMS requirements fall into twosets. The first set of requirements deals with update man-agement: highly available, fault-tolerant storage and cor-rect ordering of accepted updates. The second set ofrequirements deals with delivery: efficient and securepropagation of updates. Instinctively we split the archi-tecture of the system into two subsystems – the “front-end” and the “back-end” – that correspond to the two setsof requirements. The front-end consists of a small set(typically 5 machines) of Storage Points (or SPs). TheSPs are deployed in distinct Tier-1 networks inside well-connected data centers. The SPs are responsible for ac-cepting and storing configuration updates. The back-endis the entire Akamai CDN that subscribes to the updatesand aids in the update delivery.

High availability and fault-tolerance come from thefact that the SPs constitute a fully decentralized sub-system. ACMS does not depend on any particular SPto coordinate the updates, such as a database master ina persistent MOM (message-oriented middleware) stor-age. ACMS can tolerate a number of failures or partitionsamong the Storage Points. Instead of relying on a coordi-nator, we use a set of distributed algorithms that help theSPs synchronize configuration submissions. These algo-rithms that will be discussed later are quorum-based andrequire only a majority of the SPs to stay alive and con-nected to one another in order for the system to continueoperation. Any majority of the SPs can reconstruct thefull state of the configuration submissions and continueto accept and deliver submissions.

To propagate updates, we considered a push-basedvs. a pull-based approach. In a push-based approachthe SPs would need to monitor and maintain state of allAkamai hosts that require updates. In a pull-based ap-proach all Akamai machines check for new updates and

request them. We observed that the Akamai CDN itselfis fully optimized for HTTP download, making the pull-based approach over HTTP download a natural choice.Since many configuration updates must be delivered tovirtually every Akamai server, this allows us to use Aka-mai caches effectively for common downloads and thusreduce network bandwidth requirements. This naturalchoice helps ACMS scale with the growing size of theAkamai network.

As an optimization we add an additional set of ma-chines (the Download Points) to the front-end. Down-load Points offer additional sites for HTTP download andthus alleviate the bandwidth demand placed on the Stor-age Points.

To further improve the efficiency of the HTTP down-load we create an index hierarchy that concisely de-scribes all configuration files available on the SPs. Adownloading agent can start with downloading the rootof the hierarchical index tree and work its way down todetect changes in any particular configuration files it isinterested in.

The rest of this paper is organized as follows. We givean architecture overview in section 2. We discuss ourdistributed techniques of quorum-based replication andrecovery in sections 3 and 4. Section 5 describes the de-livery mechanism. We share our operational experienceand evaluation in sections 6 and 7. Section 8 discussesrelated work. We conclude in section 9.

2 Architecture Overview

The architecture of ACMS is depicted in Figure 1.First an application submitting an update (also known

as a publisher) contacts an ACMS Storage Point. Thepublisher transmits a new version of a given configura-tion file. The SP that receives an update submission isalso known as the Accepting SP for that submission. Be-fore replying to the client the Accepting SP makes sure toreplicate the message on at least a quorum (a majority) ofServers (i.e., Storage Points). Servers store the messagepersistently on disk as a file. In addition to copying thedata, ACMS runs an algorithm called Vector Exchangethat allows a quorum of SPs to agree on a submission.Only after the agreement is reached does the AcceptingSP acknowledge the publisher’s request, by replying with“Accept.”

Once the agreement among the SPs is reached, the datacan also be offered for download. The Storage Pointsupload the data to their local HTTP servers (i.e., HTTPservers runs on the same machines as the SPs).

Since only a quorum of SPs is required to reach anagreement on a submission, some SPs may miss an oc-casional update due to downtime. To account for repli-cation messages missed due to downtime, the SPs run

Figure 1: ACMS: Publishers, Storage Points, and Re-ceivers (Subscribers)

a recovery scheme called Index Merging. Index Merg-ing helps the Storage Points recover any missed updatesfrom their peers.

To subscribe for configuration updates, each server(also known as a node) on the Akamai CDN runs a pro-cess called Receiver that coordinates subscriptions forthat node. Services on each node subscribe with theirlocal Receiver process to receive configuration updates.Receivers periodically make HTTP IMS (If-Modified-Since) requests for these files from the SPs. Receiverssend these requests via the Akamai CDN, and most ofthe requests are served from nearby Akamai caches re-ducing network traffic requirements.

We add an additional set of a few well-positionedmachines to the front-end, called the Download Points(DPs). DPs never participate in initial replication of up-dates and rely entirely on Index Merging to obtain the lat-est configuration files. DPs alleviate some of the down-load bandwidth requirements from the SPs. In this waydata replication between the SPs does not need to com-pete as much for bandwidth with the download requestsfrom subscribers.

3 Quorum-based Replication

The fault-tolerance of ACMS is based on the use of asimple quorum. In order for an Accepting SP to acceptan update submission we require that the update be bothreplicated to and agreed upon by a quorum of the ACMS

SPs. We define quorum as a majority. As long as a ma-jority of the SPs remain functional and not partitionedfrom one another, this majority subset will intersect withthe initial quorum that accepted a submission. Therefore,this latter subset will collectively contain the knowledgeof all previously accepted updates.

This approach is deeply rooted in our assumption thatACMS can maintain a majority of operational and con-nected SPs. If there is no quorum of SPs that are func-tional and can communicate with one another ACMSwill halt and refuse to accept new updates until a con-nected quorum of SPs is re-established.

Each SP maintains connectivity by exchanging live-ness messages with its peers. Liveness messages alsoindicate whether the SPs are fully functional or healthy.Each SP reports whether it has pairwise connectivity toa quorum (including itself) of healthy SPs. The reportsarrive at the Akamai NOCC (Network Operations Com-mand Center) [2]. If a majority of ACMS SPs fails toreport pairwise connectivity to a quorum, a red alert isgenerated in the NOCC and operation engineers performimmediate connectivity diagnosis and attempt to fix thenetwork or server problem(s).

By placing SPs inside distinct ISP networks we reducethe probability of an outage that would disrupt a quo-rum of these machines. (See some statistics in section6.) Since we require only a majority of SPs to be con-nected, it means we can tolerate a number of failures dueto partitioning, hardware, or software malfunctions. Forexample, with an initial set containing five SPs, we cantolerate two SP failures or partitions and still maintain aviable majority of three SPs. When any single SP mal-functions, a lesser priority alert also triggers correctiveaction from the NOCC engineers. ACMS operational ex-perience with maintaining a connected quorum and vari-ous failure cases are discussed in detail in section 6.

The rest of the section describes the quorum-basedAcceptance Algorithm in detail. We also explain howACMS replication and agreement methods satisfy Cor-rectness and Acceptance requirements outlined in section1.1 and discuss maintenance of the ACMS SPs.

3.1 Acceptance AlgorithmThe ACMS Acceptance Algorithm consists of twophases: replication and agreement. In the replicationphase, the Accepting SP copies the update to at least aquorum of the SPs.

The Accepting SP first creates a temporary file with aunique filename (UID). For a configuration file foo theUID may look like this: “foo.A.1234”, where A is thename of the Accepting SP and “1234” is the timestampof the request in UTC (shortened to 4 digits for this ex-ample). This UID is unique, because each SP allows only

one request per file per second.The Accepting SP then sends this file that contains the

update along with its MD5 hash to a number of SPs overa secure TCP connection. Each SP that receives the filestores it persistently on disk (under the UID name), veri-fies the hash, and acknowledges that it has stored the file.

If the Accepting SP fails to replicate the data to a quo-rum after a timeout, it replies with an error to the publish-ing application. The timeout is based on the size of theupdate, and a very low estimate of available bandwidthbetween this SP and its peers. (If the Accepting SP doesnot have connectivity to a quorum it replies much soonerand does not wait for a timeout to expire).

Otherwise, once at least a quorum of SPs (includingthe Accepting SP) has stored the temporary file, the Ac-cepting SP initiates the second phase to obtain an agree-ment from the Storage Points on the submitted update.

3.2 Vector Exchange

Vector Exchange (also called “VE”) is a light-weightprotocol that forms the second phase of the acceptancealgorithm – the agreement phase. As the name suggests,VE involves Storage Points exchanging a state vector.The VE vector is just a bit vector with a bit correspond-ing to each Storage Point. A 1-bit indicates that the cor-responding Storage Point knows of a given update. Whena majority of bits are set to 1, we say that an agreementoccurs and it is safe for any SP (that sees the majority ofthe bits set) to upload this latest update.

In the beginning of the agreement phase, the Accept-ing SP initializes a bit vector by setting its own bit to 1and the rest to 0, and broadcasts the vector along withthe UID of the update to the other SPs. Any SP that seesthe vector sets its corresponding bit to 1, stores the vec-tor persistently on disk and re-broadcasts the modifiedvector to the rest of the SPs. Persistent storage guaran-tees that the SP will not lose its vector state on processrestart or machine reboot. It is safe for each SP to setthe bit even if it did not receive the temporary file duringthe replication phase. Since at least a quorum of the SPshave stored this temporary file, it can always locate thisfile at a later stage.

Each SP learns of the agreement independently whenit sees a quorum of bits set. Two actions can take placewhen a SP learns of the agreement for the first time.When the Accepting SP that initiated the VE instancelearns of the agreement it accepts the submission of thepublishing application. When any SP (including the Ac-cepting SP) learns of the agreement it uploads the file.Uploading means that the SP copies the temporary file toa permanent location on its local HTTP server where it isnow available for download by the Receivers. If it doesnot have the temporary file then it downloads it from one

of the other SPs via the recovery routine (section 4).Note, that it is possible for the Accepting SP to be-

come “cut-off” from the quorum after it initiates the VEphase. In this case it does not know whether its broad-casts were received and whether the agreement tookplace. It is then forced to reply only with “PossibleAccept” rather than “Accept” to the publishing applica-tion. We recommend that the publisher that gets cut offfrom the Accepting SP or receives a “Possible Accept”should try to re-submit its update to another SP. (From apublisher’s perspective the reply of “Possible Accept” isequivalent to “Reject.” The distinction was made initiallypurely for the purpose of monitoring this condition.)

As in many agreement schemes, the purpose of the VEprotocol is to deal with some Byzantine network or ma-chine failures [18]. In particular, VE prevents an individ-ual SP (or a minority subset of SPs) from uploading newdata and then becoming “disconnected” from the rest ofthe SPs. A quorum of SPs could then continue to oper-ate successfully without the knowledge that the minor-ity is advertising a new update. This new update wouldbecome available only to a small subset of the Akamainodes that can reach the minority subset, possibly caus-ing a discord in the Akamai network viz. the latest up-dates.

VE is based on earlier ideas of vector clocks intro-duced by by Fidge [10] and Mattern [24]. Section 8compares Acceptance Algorithm with Two-Phase Com-mit and other agreement schemes used in common dis-tributed systems.

3.3 An ExampleWe give an example to demonstrate both phases of theAcceptance Algorithm. Imagine that our system containsfive Storage Points named A, B, C, D, and E with SPD down temporarily for a software upgrade. With fiveSPs the quorum required for the Acceptance algorithm isthree SPs.

SP A receives a submission update from publisher Pfor configuration file “foo”. To use the example fromsection 3.1 SP A stores the file under a temporary UID:foo.A.1234.

SP A initiates the replication phase by sending the filein parallel to as many SPs as it can reach. SPs B, C, andE store the temporary update under the UID name. (SPD is down and does not respond). SPs B and C happen tobe the first SPs to acknowledge the reception of the fileand the MD5 hash check. Now A knows that the majority(A, B, and C) have stored the file and A is ready to initiatethe agreement phase.

SP A broadcasts the following VE message to the otherSPs:

foo.A.1234 A:1 B:0 C:0 D:0 E:0

This message contains the UID of the pending updateand the vector that has only A’s bit set. (A stores thisvector state persistently on disk prior to sending it out).

When SP B receives this message it adds its bit to thevector, stores the vector, and broadcasts it:

foo.A.1234 A:1 B:1 C:0 D:0 E:0

After a couple of rounds all four live SPs store the fol-lowing message with all bits set except for D’s:

foo.A.1234 A:1 B:1 C:1 D:0 E:1

At this point, as each SP sees that the majority of bitsis set, A, B, C, and E upload the temporary file in placeof the permanent configuration file foo, and store in theirlocal database the UID of the latest agreed upon versionof file foo: foo.A.1234. All older records of foo can bediscarded.

3.4 GuaranteesWe now show that our Acceptance Algorithm satisfiesthe acceptance and correctness requirements, providedthat our quorum assumption holds.

3.4.1 Acceptance Guarantee

Having introduced the quorum-based scheme we nowrestate the acceptance guarantee more precisely than insection 1.1. The acceptance guarantee states that if theAccepting SP has accepted a submission, it will be up-loaded by a quorum of SPs.

Proof: The Accepting SP accepts only when the up-date has been replicated to a quorum AND when the Ac-cepting SP can see a majority of bits set in the VE vec-tor. Now if the Accepting SP can see a majority of bitsset in the VE vector it means that at least a majority ofthe SPs have stored a partially filled VE vector duringthe agreement phase. Therefore, any future quorum willinclude at least one SP that stores the VE vector for thisupdate. Once such a SP is part of a quorum, after a fewre-broadcast rounds, all of the SPs in this future quorumwill have their bits set. Therefore, all the SPs in the latterquorum will decide to upload.

So based on our assumption that a quorum of con-nected SPs can be reasonably maintained, acceptance byACMS implies a future decision by at least a quorum toupload the update.

The converse of the acceptance guarantee does notnecessarily hold. If the quorum decides to upload, it doesnot mean that the Accepting SP will accept. As statedearlier the Accepting SP may be “cut off” from the quo-rum after VE phase is initiated, but before it completes.In that case the Accepting SP replies with “Possible Ac-cept,” because it’s likely but not definite. The publishing

application treats this reply as “Reject” and tries to re-submit to another SP.

The probability of a “Possible Accept” is very small,and we have never seen it occur in the real system. Thereason for that is that in order for the VE phase to beinitiated the replication phase must succeed. If the repli-cation is successful it most likely means that the lighterVE phase that also requires connectivity to a quorum(but less bandwidth) will also succeed. If the replicationphase fails, ACMS replies with a definite “Reject.”

3.4.2 Correctness

The Correctness requirements state that ACMS providesa unique ordering of all update versions for a given con-figuration file AND that the system synchronizes to thelatest submitted update. We later relaxed that guaranteeto state that ACMS allows limited re-ordering in decid-ing which update is the latest, due to clock skews. Moreprecisely, accepted updates for the same file submittedat least 2T + 1 seconds apart will be ordered correctly.T is the maximum allowed clock skew between any twocommunicating SPs.

The unique ordering of submitted updates is guaran-teed by the UID assigned to a submission as soon as it isreceived by ACMS (regardless of whether it will be ac-cepted). The UID contains both a UTC timestamp fromthe SP’s clock and the SP’s name. The submissions forthe same configuration file are first ordered by time andthen by the Accepting SP name. So “foo.B.1234” is con-sidered to be more recent than “foo.A.1234”, and it iskept as the later version. A Storage Point accepts onlyone update per second for a given configuration file.

Since we do not use logical synchronized clocks,slight clock skews and reordering of updates are possi-ble. We now explain how we bound such reordering, andwhy any small reordering is acceptable in ACMS.

We bound the possible skew between any two com-municating SPs by T seconds (where T is usually set to20 seconds). Our communication protocols enforce thisbound by rejecting liveness messages from SPs that areat least T seconds apart. (I.e., such pairs of servers ap-pear virtually dead to each other). As a result it followsthat no two SPs that accept updates for the same file canhave a clock skew more than 2T seconds.

Proof: Imagine SPs A and B that are both able to ac-cept updates. This means both A and B are able to repli-cate these update to a majority of SPs. These majoritiesmust overlap by at least one SP. Moreover, neither A norB can have more than a T second clock skew from thatSP. So A and B cannot be more than 2T seconds apart.

Developers of the Akamai subsystems that submitconfiguration files to Akamai nodes via ACMS are ad-vised to avoid mis-ordering by submitting updates to the

same configuration file at intervals of at least 2T + 1.In addition, we use NTP [3] to synchronize our serverclocks, and in practice we find very rare instances whenour servers are more than one second apart.

Finally with ACMS, it is actually acceptable to re-order updates within a small bound such as 2T . Weare not dealing with competing editors of a distributedfilesystem. Subsystems that are involved in configuringa large CDN such as Akamai must and do cooperate witheach other. In fact, we considered two cases of such sub-systems that update the same configuration file. Eitherthere is only one process that submits updates for file“foo”, or there are redundant processes that submit thesame or idempotent updates for file “foo”. In the caseof a single publishing process, it can easily abide by the2T rule and therefore avoid reordering. In the case ofredundant writers – that exist for fault-tolerance – wedo not care whose update within the 2T period is sub-mitted first as these updates are idempotent. Any morecomplex distributed systems that publish to ACMS useleader election to select a publishing process, effectivelyreducing these systems to one-publisher systems.

3.5 Termination and Message ComplexityIn almost all cases VE terminates after the last SP in aquorum broadcasts its addition to the VE vector. How-ever, in an unlikely event where a SP becomes partitionedoff during a VE phase it attempts to broadcast its vectorstate once every few seconds. This way, once it recon-nects to a quorum it can notify the other SPs of its partialstate.

VE is not expensive and the number of messages ex-changed is quite small. We make a small change to theprotocol as it was originally described by adding a smallrandom delay (under 1 second) before a re-broadcast ofthe changed vector by a SP. This way, instead of all SPsre-broadcasting in parallel, only one SP broadcasts at atime. With the random delay, on average each SP willonly broadcast once after setting its bit. This results inO(n2) unicast messages.

We use the gossip model, because the numbers of par-ticipants and the size of the messages are both small.The protocol can easily be amended to have only the Ac-cepting SP do a re-broadcast after it collects the replies.Only when an SP does not hear the re-broadcast does itswitch to a gossip mode. When the Accepting SP staysconnected until termination the number of messages ex-changed is just O(n).

3.6 MaintenanceSoftware or OS upgrades performed on individual Stor-age Points must be coordinated to prevent an outage of a

quorum. Such upgrades are scheduled independently onindividual Storage Points so that the remaining systemstill contains a connected quorum.

Adding and removing machines with quorum-basedsystems is a theoretically tricky problem. Rambo [19]is an example of a quorum-based system that solves dy-namic set configuration changes by having an old quo-rum agree on a new configuration.

Since adding or removing SPs is extremely rare wechose not to complicate the system to allow dynamicconfiguration changes. Instead, we halt the system tem-porarily by disallowing accepts of new updates, changethe set configuration on all machines, wait for a new quo-rum to sync up on all state (via the Recovery algorithm),and allow all SPs to resume operation. Replacing a deadSP is a simpler procedure where we bring up a new SPwith the same SP ID as the old one and clean state.

3.7 Flexibility of the VE QuorumACMS’ quorum is configured as majority. Just like inthe Paxos [16] algorithm this choice guarantees that anyfuture quorum will necessarily intersect with an earlierone and all previously accepted submissions can be re-covered. However, this definition is quite flexible in VEand allows for consistency vs. availability trade-offs. Forexample, one could define a quorum to be just a coupleof SPs which would offer loose consistency, but muchhigher availability. Since there is a new VE instance foreach submission, one could potentially configure a dif-ferent quorum for each file. If desired, this property canbe used to add or remove SPs by reconfiguring each SPindependently, resulting in a very slight and temporaryshift toward consistency over availability.

4 Recovery via Index Merging

Recovery is an important mechanism that allows all Stor-age Points that experience down time or a network out-age to “sync up” all latest configuration updates.

Our Acceptance Algorithm guarantees that at least aquorum of SPs stores each update. Some Akamai nodesmay only be able to reach a subset of the SPs that werenot part of the quorum that stored the update. Even ifthat subset intersects with the quorum, that Akamai nodemay need to retry multiple downloads before reachinga SP that stores the update. To increase the number ofAkamai nodes that can get their updates and improve theefficiency of download, preferably all SPs should storeall state.

In order to “sync up” any missed updates StoragePoints continuously run a background recovery protocolwith one another. The downloadable configuration filesare represented on the SPs in the form of an index tree.

The recovery protocol is called Index Merging. The SPs“merge” their index trees to pick up any missed updatesfrom one another.

The Download Points also need to “sync up” state.These machines do not participate in the Acceptance Al-gorithm and instead rely entirely on the recovery proto-col on Storage Points to pick up all state.

4.1 The Index TreeFor a concise representation of the configuration files, weorganize the files into a tree. The configuration files aresplit into groups. A Group Index file lists the UIDs ofthe latest agreed upon updates for each file in the group.The Root Index file lists all Group Index files togetherwith the latest modification timestamps of those indexes.The top two layers (i.e. the Root and the Group indexes)completely describe the latest UIDs of all configurationfiles and together are known as the snapshot of the SP.

Each SP can modify its snapshot when it learns of aquorum agreement through the Acceptance Algorithm orby seeing a more recent UID in a snapshot of another SP.

Since a quorum of SPs should together have a com-plete state, for full recovery each SP needs only tomerge in a snapshot from Q − 1 other SPs (where Q =majority). (Download Points need to merge in statefrom Q SPs).

The configuration files are assigned to groups stati-cally when the new configuration file is provisioned onACMS. A group usually contains a logical set of filessubscribed to by a set of related receiving applications.

4.2 The Index Merging AlgorithmAt each round of the Index Merging Algorithm a SP Apicks a random set of Q − 1 other SPs and downloadsand parses the index files from those SPs. If it detects amore recent UID of a configuration file, SP A updates itsown snapshot, and attempts to download the missing filefrom one of its peers. Note that it is safe for A to updateits snapshot before obtaining the file. Since the UID ispresent in another SP’s snapshot it means that the filehas already been agreed upon and stored by a quorum.

To avoid frequent parsing of one another’s index files,the SPs remember the timestamps of one another’s indextrees and make HTTP IMS (if-modified-since) requests.If an index file has not been changed, HTTP 304 (not-modified) is returned on the download attempt.

Index Merging rounds run continuously.

4.3 Snapshots for ReceiversAs a side-effect the snapshots also provide an efficientway for Receivers to learn of latest configuration file ver-

sions. Typically receivers are only interested in a subsetof the index tree that describes their subscriptions. Re-ceivers also download index files from the SPs via HTTPIMS requests.

Using HTTP IMS is efficient but is also problematicbecause each SP generates its own snapshot and assignsits own timestamps to the index files that it uploads. Thusit is possible for a SP A to generate an index file withmore recent timestamp than SP B, but less recent infor-mation. If a Receiver is unlucky and downloads the in-dex file from A first, it will not download an index with alower timestamp from B, until the timestamp increases.It may take a while for it to get all the necessary changes.

There are two solutions to this problem. In one solu-tion we could require a Receiver to download an indextree independently from each SP, or at least a quorum ofthe SPs. Having each Receiver download multiple in-dex trees is an unnecessary waste of bandwidth. Fur-thermore, requiring each Receiver to be able to reach aa quorum of SPs reduces system availability. Ideally, weonly require that a Receiver be able to reach one SP thatitself is part of a quorum.

We implemented an alternative solution, where theSPs merge their index timestamps, not just the data listedin the those indexes.

4.4 Index Time-stamping RulesWith just a couple of simple rules that constrain howStorage Points assign timestamps to their index files, wecan present a coherent snapshot view to the Receivers:

1. If a Storage Point A has an index file bar.index witha timestamp T , and then A learns of new infor-mation inside bar.index (either through Vector Ex-change agreement or Index Merging from a peer),then on the next iteration A must upload a newbar.index with a timestamp at least T + 1.

2. If Storage Points A and B have an index filebar.index that contains identical information andhave timestamps Ta and Tb respectively with Ta >Tb, then on the next iteration B must uploadbar.index with a timestamp at least as great as Ta.

Simply put, rule 1 says that when a Storage Point in-cludes new information it must increase the timestamp.This is really a redundant rule — a new timestamp wouldbe assigned anyway when a Storage Point writes a newfile. Rule 2 says that a Storage Point should always set itsindex’s timestamp to the highest timestamp for that indexamong its peers (even if it includes no new information).

Once a Storage Point modifies a group index it mustmodify the Root Index as well following the same rules.(The same would apply to a hierarchy with more layers).

We now show the correctness of this timestamping algo-rithm.

4.5 Timestamping CorrectnessGuarantee: If a Receiver downloads bar.index (index filefor group bar) with a timestamp T1 from any StoragePoint, then when new information in group bar becomesavailable all Storage Points will publish bar.index with atimestamp at least as big as T1 + 1, so that the Receiverwill quickly pick up the change.

Proof: Assume in steady state a set of k Storage Points1...k each has a bar.index with timestamps T1, T2, ..., Tk

sorted in non-decreasing order. (i.e., Tk is the highesttimestamp). When new information becomes available,then following rule 1 above, Storage Point k will incor-porate new information and increase its timestamp to atleast Tk + 1. On the next iteration, following rule 2,SPs 1...k − 1 will make their timestamps at least Tk + 1as well. Before the change, the highest timestamp forbar.index known to a Receiver was Tk. A couple of iter-ations after the new information becomes incorporated,the lowest timestamp available on any Storage Point isTk + 1. Thus, a Receiver will be able to detect an in-crease in the timestamp and pick up a new index quickly.

5 Data Delivery

In addition to providing high fault-tolerance and avail-ability the system must scale to support download bythousands of Akamai servers. We naturally use the Aka-mai CDN (Content Distribution Network) which is opti-mized for file download. In this section we describe theReceiver process, its use of the hierarchical index data,and the use of the Akamai CDN itself.

5.1 Receiver ProcessReceivers run on each of over 15,000 Akamai nodes andcheck for message updates on behalf of the local sub-scribers.

A Subscription for a configuration file specifies the lo-cation of that file in the index tree: the root index, thegroup index that includes that file, and the file name it-self. Receivers combine all local subscriptions into asubscriptions tree. (This is a subtree of the whole treestored by the SPs.)

A Receiver checks for updates to the subscription treeby making HTTP IMS requests recursively beginning atthe Root Index. If the Root Index has changed, Re-ceiver parses the file, and checks whether any interme-diate indexes that are also in the Receiver’s subscrip-tion tree have been updated (i.e., if they are listed witha higher timestamp than previously downloaded by that

Receiver). If so, it stores the timestamp listed for thatindex as the “target timestamp,” and keeps making IMSrequests until it downloads the index that is at least as re-cent as the target timestamp. Finally it parses that indexand checks whether any files in its subscription tree (thatbelong to this index) have been updated. If so the Re-ceiver then tries to download a changed file until it getsone at least as recent as the target timestamp.

There are a few reasons why a Receiver may need toattempt multiple IMS requests before it gets a file witha target timestamp. First some Storage Points may be abit behind with Index Merging and not contain the latestfiles. Second, an old file may be cached by the Akamainetwork for a short while. The Receiver retries its down-loads frequently until it gets the required file. Once theReceiver downloads the latest update for a subscription,it places the data in a file on local disk and points a localsubscriber to it.

The Receiver must know how to find the SPs. TheDomain Name Service provides a natural mechanism todistribute the list of SPs’ and DPs’ addresses.

5.2 Optimized DownloadThe Akamai network’s support for HTTP download is anatural fit to be leveraged by ACMS for message propa-gation. Since the indexes and the configuration files arerequested by many machines on the network, these filesbenefit greatly from the caching capabilities of the Aka-mai network.

First, Receivers running on colocated nodes are likelyto request the same files, which makes it likely that therequest is served from a neighboring cache in the localAkamai cluster. Furthermore, if the request leaves thecluster it will be directed to other nearby Akamai clus-ters which are also likely to have a response cached. Fi-nally, if the file is not cached in another nearby Akamaicluster, the request goes through to one of the StoragePoints. These cascading Akamai caches greatly reducethe network bandwidth required for message distributionand make pull-down propagation the ideal choice.

The trade-off of having great cacheability is the in-creased propagation delay of the messages. The longerthe file is served out of cache, the longer it takes for theAkamai system to refresh cached copies. Since we aremore concerned here with efficient rather than very fastdelivery, we set a long cache TTL on the ACMS files, forexample, 30 seconds.

As mentioned in section 2 we augment the list of SPswith a set of a few Download Points. Download Pointsprovide an elegant way to alleviate bandwidth require-ments from the SPs. As a result replication and recoveryalgorithms on the SPs experience less competition withthe download bandwidth.

6 Operational Experience

The design of ACMS has been an iterative process be-tween implementation and field experience where ourassumptions of persistent storage, network connectivity,and OS/software fault-tolerance were tested.

6.1 Earlier Front-End Versions

Our prototype version of ACMS consisted of a single pri-mary Accepting Storage Point replicating submissions toa few secondary Storage Points. Whenever the Accept-ing SP would lose connectivity to some of the StoragePoints or experience a software or hardware malfunctionthe entire system would halt. It quickly became impera-tive to design a system that did not rely entirely on anysingle machine. We also considered a solution of us-ing a set of auto-replicating databases. We encounteredtwo problems. First, commercial databases would proveunnecessarily expensive as we would have to acquire li-censes to match the number of customers using ACMS.More importantly, we required consistency. At the timewe did not find database software that would deal withvarious Byzantine network failures. Although some aca-demic systems were emerging that in theory did promisethe right level of wide-area fault-tolerance we requireda professional, field-tested system that we could easilytune to our needs. Based on our study of Paxos [16] andBFS [17] we designed a simpler version of decentralizedquorum-based techniques. Similar to Paxos and BFS ouralgorithm requires a quorum. However, there is no leaderto enforce strict ordering in VE as bounded re-orderingis permitted with non-competing configuration applica-tions.

6.2 Persistent Storage Assumption

Storage Points rely on persistent disk storage to storeconfiguration files, snapshots, and temporary VE vectors.Most hard disks are highly reliable, but guarantees arenot absolute. Data may get corrupted, especially on sys-tems with high levels of I/O. Moreover, if the operatingsystem crashes before an OS buffer is flushed to disk, theresult of the write may be lost.

After experiencing a few file corruptions we adoptedthe technique of writing out MD5 hash together with thefile’s contents before declaring a successful write. Thehash is checked on opening the file. A Storage Pointwhich detects a corrupted file will refuse to communicatewith its peers and require an engineer’s attention. Overthe period of six months ending in February 2005, theNOCC [2] monitoring system has recorded 3 instancesof such file corruption on ACMS.

Since ACMS runs automatic recovery routines replac-ing damaged or old hardware on ACMS is trivial. TheSP process running on a clean disk quickly recovers allof the ACMS state from other SPs via Index Merging.

6.3 Connected Quorum Assumption

The assumption of a connected quorum turned out to be avery good one. Nonetheless, network partitions do occur,and the quorum requirement of our system does play itsrole. For the first 9 months of 2004 the NOCC monitor-ing system recorded 36 instances where a Storage Pointdid not have connectivity to a quorum due to networkoutages that lasted for more than 10 minutes. However,in all of those instances there was an operating quorumof other SPs that continued to accept submissions.

Brief network outages on the Internet are also commonalthough they would generally not result in a SP losingconnectivity to a quorum. For example, a closer analysisof ACMS logs over a 6 day period revealed two short out-ages within the same hour between a pair of SPs locatedin different Tier-1 networks. They lasted for 8 and 2 min-utes respectively. Such outages emphasize the necessityfor an ACMS-like design to provide uninterrupted ser-vice.

6.4 Lessons Learned

As we anticipated, redundancy has been important in allaspects of our system. Placing the SPs in distinct net-works has protected ACMS from individual network fail-ures. Redundancy of multiple replicas helped ACMScope with disk corruption and data loss on individualSPs.

Even the protocols used by ACMS are in some senseredundant. The continuous recovery scheme (i.e., IndexMerging) helps the Storage Points recover updates thatthey may miss during the initial replication and agree-ment phases of the Acceptance Algorithm. In fact, insome initial deployments Index Merging helped ACMSovercome some communication software glitches of theAcceptance Algorithm.

The back-end of ACMS also benefited from redun-dancy. Receivers begin their download attempt fromnearby Akamai nodes, but can fail over to higher layersof the Akamai network if needed. This approach allowsReceivers to cope with downed servers on their down-load path.

Despite the redundant and self-healing design some-times human intervention is required. We rely heavily onthe Akamai error reporting infrastructure and the opera-tions of the NOCC to prevent critical failures of ACMS.Detection of and response to secondary failures such as

individual SP corruption or downtime helps decrease theprobability of full quorum failures.

7 Evaluation

To evaluate the effectiveness of the system we gathereddata from the live ACMS system accepting and deliver-ing configuration updates on the actual Akamai network.

7.1 Submission and PropagationFirst we looked at the workload of the ACMS front-endover a 48 hour period in the middle of a work week.There were 14,276 total file submissions on the systemwith five operating Storage Points. The table below liststhe distribution of the file sizes. Submission of smallerfiles (under 100KB) were dominant, but files on the orderof 50MB also appear about 3% of the time.

size range avg file sz distribution avg.time (s)0K-1K 290 40% 0.61

1K-10K 3K 26% 0.6310K-100K 22K 23% 0.72100K-1M 167K 7% 2.231M-10M 1.8M 1% 13.63

10M-100M 51M 3% 199.87The last column of the table shows the average sub-

mission time for various file sizes. We evaluated the“submission” time by measuring the period from the timean Accepting SP is first contacted by a publishing appli-cation, until it replies with “Accept.” The submissiontime includes replication and agreement phases of theAcceptance Algorithm. The agreement phase for all filestakes 50 milliseconds on average. For files under 100KB,all “submission” times are under one second. However,with larger files, replication begins to dominate. For ex-ample, for 50MB files, the time is around 200 seconds.Even though our SPs are located in Tier 1 networks theyall share replication bandwidth with the download band-width from the Receivers. In addition, replication formultiple submissions and multiple peers is performed inparallel.

We also measured the total update propagation timefrom when many configuration updates were first madeavailable for download through receipt on the live Aka-mai network for a random sampling of 250 Akamainodes. Figure 2 shows the distribution of update prop-agation times. The average propagation time is approx-imately 55 seconds. Most of the delay comes from Re-ceiver polling intervals and caching.

Figure 3 examines the effect of file size on propagationtime. We have analyzed the mean and 95th-percentiledelivery time for each submission in the test period.99.95% of updates arrived within three minutes. The re-maining 0.05% were delayed due to temporary network

0 50 100 1500

2

4

6

8

10

12Update Propagation Time Distribution

propagation time (s)

% re

ceiv

ed

Figure 2: Propagation time distribution for a large num-ber of configuration updates delivered to a sampling ofthousands of machines.

connectivity issues; the files were delivered promptly af-ter connectivity was restored. These delivery times meetour objectives of distributing files within several minutes.The figure shows a high propagation time for especiallysmall files. Although one would expect that the propaga-tion time increases monotonically with the file size, CDNcaching slows down files submitted more frequently. Webelieve that many smaller files are updated frequently onACMS. As a result the caching TTL of the CDN is moreheavily reflected in propagation delay.

The use of caching reduces bandwidth on the StoragePoints anywhere from 90% to 99%, increasing in generalwith system activity and with the file size being pushed,allowing large updates to be propagated to tens of thou-sands of machines without significant impact on StoragePoint traffic.

Finally to analyze general connectivity and the tail ofthe propagation distribution we looked at a propagationof short files (under 20KB) to another random sample of300 machines over a 4 day period. We found that 99.8%of the time a file was received within 2 minutes from be-coming available and 99.96% of the time it was receivedwithin 4 minutes.

7.2 ScalabilityWe analyzed the overhead of the Acceptance Algorithmand its effect on the scalability of the front-end. Over arecent 6 day period we recorded 43,504 successful filesubmissions with an average file size of 121KB. In a sys-tem with 5 SPs, the Accepting SP needs to replicate datato 4 other SPs requiring 484 KBytes per file on average.The size of a VE message is roughly 100 bytes. Withn(n − 1) VE messages exchanged per submission, VEuses 2 KB per file or 0.4% of the replication bandwidth.

102 103 104 105 106

60

90

120Update Propagation Time

file size (bytes)

prop

agat

ion

time

(s)

95th percentilemean

Figure 3: Propagation times for various size files. Thedashed line shows the average time for each file to prop-agate to 95% of its recipients. The solid line shows theaverage propagation time.

For our purposes we chose 5 SPs, so that during a soft-ware upgrade of one machine the system cold tolerateone failure and still maintain a majority quorum of 3. Ex-tending the calculation to 15 SPs, for example, with anaverage file size of 121 KB the system would require 1.7MB for replication and 21KB for VE. The VE overheadbecomes 1.2%, which is higher, but not significant.

Such a system is conceivable if one chooses not torely on a CDN for efficient propagation, but instead of-fer more download sites (SPs). The VE overhead canbe further reduced as described in section 3.5. However,the minimum bandwidth required to replicate the data toall 15 machines may grow to be prohibitive. In such asystem one could still allow each Server to maintain allindexes, but split the actual storage into subsets based onsome hashing function such as Consistent Hashing [4].

For ACMS choosing the Akamai CDN itself for prop-agation is the natural choice. The cacheability of the sys-tem grows as the CDN penetrates more ISP networks,and the system scales naturally with its own growth.Also, as the CDN grows the reachability of receivers in-side more remote ISPs improves.

8 Related Work

8.1 Fault Tolerant Replication

Many distributed filesystems such as Coda [20], Pangea[21], and Bayou [22] store files across multiple repli-cas similar to the Storage Points of ACMS. Similar to

ACMS’ Index Merging these filesystems run recovery al-gorithms that synchronize the data among replicas, suchas Bayou’s anti-entropy algorithm. However, all of thesesystems attempt to improve the availability of data at theexpense of consistency. The aim is to allow file op-erations to clients on a set of disconnected machines.ACMS, on the other hand must provide a very high levelof consistency across the Akamai network and cannot al-low a single SP to accept and upload a new update inde-pendently.

The two-phase Acceptance Algorithm used by ACMSis similar in nature to the Two Phase-Commit [12]. Two-phase commit also separates a transaction phase from acommit phase, but its failure modes make it more suit-able to a local environment.

The Vector Exchange (the agreement phase of our al-gorithm) was inspired by the concept of vector clocks in-troduced by Fidge [10] and Mattern [24] which are usedto determine causality of events in a distributed system.Bayou also uses vectors to represent latest known com-mit sequence numbers for each server. In our algorithm,the vectors’ contents are simply bits since each messageonly has two interesting states, known to a server or not.Each subsequent agreement is a separate “instance” ofthe protocol.

VE uses a quorum-based scheme similar to Paxos [16]and BFS [17]. Paxos defines quorum as strict majoritywhile BFS defines it as “more than 2/3.” VE allows“quorum” to be configurable as long as it is at least amajority. All these algorithms consider Byzantine fail-ures and rely on persistent storage by a quorum to enablea later quorum to recover state. This strong property pre-cludes scenarios allowed by a simpler two phase commitprotocol for a minority of partitioned replicas to commita transaction. Other quorum systems include weightedvoting [11] and hierarchical quorum consensus [15].

At the same time VE is simpler than Paxos andBFS and does not implement a full Byzantine Fault-Tolerance. It does not require an auxiliary protocol to de-termine a leader or a primary as in Paxos or BFS respec-tively. This relaxation stems from the nature of ACMSapplications where only a single or redundant writers ex-ist for each file and thus, some bounded reordering ispermissible as explained in section 3.4.2. No leader isenforcing ordering.

OceanStore [31] is an example of a storage system thatimplements Byzantine Fault-Tolerance to have replicasagree on the order of updates that originate from differ-ent sources. ACMS, on the other hand must complete“agreement” at the time of an update submission. Thisis primarily due to the important aspect of the Akamainetwork where an application that publishes a new con-figuration file must know that the system has agreed toupload and propagate the new update. (Otherwise it will

keep retrying.)

8.2 Data PropagationSimilar to multicast [9], ACMS is designed to deliverdata to many widely dispersed nodes in a way that con-serves bandwidth. While ACMS takes advantage ofthe Akamai Network optimizations for hierarchical filecaching, multicast uses proximity of network IP ad-dresses to send fewer IP packets. However, due to thelack of more intelligent routing infrastructure betweenmajor networks on the Internet, it is virtually impossibleto multicast data across these networks.

To bypass the Internet routing shortcomings manyapplication-level multicast schemes based on over-lay networks were proposed: CAN-Multicast [27],Bayeux [34], and Scribe [29] among others [14] [7].These systems leverage communication topologies ofP2P overlays such as CAN [26], Chord [30], Pastry [28],Tapestry [33]. Unlike ACMS, these systems create apropagation tree for each new source of the multicast,incurring an overhead. As shown in [5], using these sys-tems for multicast is not always efficient. In our systemon the other hand, once the data is injected into ACMS, itis available for download from any Storage or DownloadPoint, and propagates down the tree from these distinctwell-connected sources. The effect of the overlay net-works used in reliable multicasting networks [23], [6] isreplaced by cooperating caches in our system.

ACMS is similar to Messaging Oriented Middleware(MOM) in that it provides persistent storage and asyn-chronous delivery of updates to subscribers that maybe temporarily unavailable. Common MOMs includeSun’s JMS [32], IBM’s MQSeries [13], Microsoft’sMSMQ [25], and the like. These system usually con-tain a server that persists the messaging “queue” whichhelps deal with crash recovery, but does create a singlepoint of failure. The distributed model of ACMS stor-age, on the other hand, helps it tolerate multiple failuresor partitions.

8.3 Software UpdatesFinally, we compare a complete ACMS with existingsoftware update systems. LCFG [35] and Novadigm [36]create systems to manage desktops and PDAs across anenterprise. While these systems scale to thousands ofservers they usually span a single or a few enterprisenetworks. ACMS, on the other hand delivers updatesacross multiple networks for critical customer-facing ap-plications. As a result ACMS focuses on a highly fault-tolerant storage and efficient propagation.

Systems that deliver software, like Windows Updates[37] target a much larger set of machines than found in

the Akamai network. However, polling intervals for suchupdates are not as critical. Some Windows users takedays to activate their updates while each Akamai node isresponsible for serving requests to tens of thousands ofusers and thus must synchronize to the latest updates veryefficiently. Moreover, systems such as Windows Updatesuse a rigorous, centralized process to push out new up-dates. ACMS accepts submissions from dynamic pub-lishers dispersed throughout the Akamai network. Thus,highly fault-tolerant, available, and consistent storage ofupdates is required.

9 Conclusion

In this paper we have presented the Akamai Configura-tion Management System that successfully manages con-figuration updates for the Akamai network of 15,000+nodes. Through the use of simple quorum-based algo-rithms (Vector Exchange and Index Merging), ACMSprovides highly available, distributed, and fault-tolerantmanagement of configuration updates. Although thesealgorithms are based on earlier ideas, they were particu-larly adapted to suit a configuration publishing environ-ment and provide high level of consistency and easy re-covery for the ACMS’ Storage Points. These schemes of-fer much flexibility and may be useful in other distributedsystems.

Just like ACMS, any other management system couldbenefit from using a CDN such as Akamai’s to propagateupdates. First, a CDN managed by a third party offers aconvenient overlay that can span thousands of networkseffectively. A solution such as multicast requires muchmanagement and simply does not scale across differentISPs. Second, a CDN’s caching and reach will allow thesystem to scale to hundreds of thousands of nodes andbeyond.

Most importantly we have presented valuable lessonslearned from our operational experience. Redundancyof machines, networks, and even algorithms helps a dis-tributed system such as ACMS cope with network andmachine failures, and even human errors. Despite 36 net-work failures that we recorded in the last 9 months, thataffected some ACMS Storage Points, the system contin-ued to operate successfully. Finally, active monitoring ofany critical distributed system is invaluable. We reliedheavily on the NOCC infrastructure to maintain a highlevel of fault-tolerance.

AcknowledgementsWe would like to thank William Weihl, Chris Joerg, andJohn Dilley among many other Akamai engineers fortheir advice and suggestions during the design. We wantto thank Gong Ke Shen for her role as a developer on this

project. We would like to thank Professor Jason Niehfor his motivation and advice with the paper. Finally,we want to thank all of the reviewers and especially ourNSDI shepherd Jeff Mogul for their valuable comments.

References[1] Akamai Technologies, Inc.,

http://www.akamai.com/.

[2] Network Operations Command Center,http://www.akamai.com/en/html/technology/nocc.html.

[3] http://www.ntp.org/.

[4] D.Karger, E.Lehman, T.Leighton, M.Levine, D.Lewinand R.Panigrahy, Consistent hashing and random trees:Distributed caching protocols for relieving hot spots onthe World Wide Web. In Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Comput-ing, pages 654-663 , 1997.

[5] M. Castro, M. B. Jones, A-M Kermarrec, A. Rowstron,M. Theimer, H. Wang, and A. Wolman, An Evaluationof Scalable Application-level Multicast Built Using Peer-to-Peer Overlays, in Proc. INFOCOM, 2003.

[6] Y. Chawathe, S. McCanne, and E. Brewer, RMX: Reli-able Multicast for Heterogeneous Networks, Proc. of IN-FOCOM, March 2000, pp. 795–804.

[7] Y. H. Chu, S. G. Rao, and H. Zhang, A case for endsystem multicast, Proc. of ACM Sigmetrics, June 2000,pp. 1–12.

[8] S. B. Davidson, H. Garcia-Molina, and D. Skeen, Con-sistency in partitioned networks, ACM Comput. Surveys,1985.

[9] S. E. Deering and D. R. Cheriton, Host extensions forIP multicasting, Technical Report RFC 1112, NetworkWorking Group, August 1989.

[10] C. J. Fidge, Timestamp in Message Passing Systems thatPreserves Partial Ordering, Proc. 11th Australian Com-puting Conf., 1988, pp. 56–66.

[11] D. K. Gifford, Weighted Voting for Replicated Data, Pro-ceedings 7th ACM Symposium on Operating Systems,1979.

[12] J. Gray, Notes on database operating systems, OperatingSystems: An Advanced Course. pp. 394–481, 1978.

[13] IBM Corporation, WebSphere MQ family,http://www-306.ibm.com/software/integration/mqfamily/.

[14] J. Jannotti, D. K. Gifford, K. L. Johnson, F. Kaashoek,and J. W. O’Toole, Overcast: Reliable Multicastingwith an Overlay Network, Proc. of OSDI, October 2000,pp. 197–212.

[15] A. Kumar, Hierarchical quorum consensus: A new algo-rithm for managing replicated data, IEEE Trans. Com-puters, 1991.

[16] L. Lamport, The Part-time Parliament, ACM Transac-tions in Computer Systems, May, 1998.

[17] M. Castro, B. Liskov, Practical Byzantine Fault-Tolerance, OSDI 1999.

[18] L. Lamport, R. Shostak, and M. Pease, The ByzantineGenerals Problem, ACM Transactions on ProgrammingLanguages and Systems, July 1982.

[19] N. Lynch and A. Shvartsman, RAMBO: A ReconfigurableAtomic Memory Service for Dynamic Networks, DISC,October 2002.

[20] M. Satyanarayanan, Scalable, Secure, and Highly Avail-able Distributed File Access, IEEE Computer, May 1990.

[21] S. Saito, C. Karamanolis, M. Karlsson, M. Mahalingam,Taming Aggressive Replication in the Pangaea Wide-areaFile System, OSDI 2002.

[22] K. Peterson, M. Spreitzer, D. Terry, Flexible UpdatePropagation for Weakly Consistent Replication, SOSP,1997.

[23] S. Paul, K. Sabnani, J. C. Lin, and S. Bhattacharyya, Reli-able Multicast Transport Protocol (RMTP), IEEE Journalon Selected Areas in Communications, April 1997.

[24] F. Mattern, Virtual Time and Global States of DistributedSystems, Proc. Parallel and Distributed Algorithms Conf.,Elsevier Science, 1988.

[25] Microsoft Corporation, Microsoft Message Queuing(MSMQ) Center,http://www.microsoft.com/windows2000/technologies/communications/msmq/default.asp.

[26] S. Rantmasamy, P. Francis, M. Handley, R. Karp, andS. Shenker, A Scalable Content-Addressable Network,Proc. of ACM SIGCOMM, August 2001.

[27] S. Ratnasamy, M. Handley, R. Karp and S. Shenker,Application-level multicast using content-addressablenetworks, Proc. of NGC, November 2001.

[28] A. Rowstron and P. Drischel, Pastry: Scalable, dis-tributed object location and routing for large-scale peer-to-peer systems, Proc of Middleware, November 2001.

[29] A. Rowstron, A. M. Kermarrec, M. Castro and P. Dr-uschel, Scribe: The design of a large-scale event noti-fication infra, structure in Proc of NGC, Nov 2001.

[30] I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Bal-akrishnan, Chord: A scalable peer-to-peer lookup ser-vice for internet applications, in Proc of ACM SIG-COMM, August 2001.

[31] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski,P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weath-erspoon, W. Weimer, C. Wells, B. Zhao, OceanStore:an architecture for global-scale persistent storage, AS-PLOS 2000, November 2000.

[32] Sun Microsystems, Java Message Service,http://java.sun.com/products/jms.

[33] B. Zhao, J. Kubiatowicz and A. Joseph, Tapestry: Aninfrastructure for fault-resilient wide-area location androuting, U. C. Berkeley, Tech. Rep. April 2001.

[34] S. Zhuang, B. Zhao, A. Joseph, R. Katz, and J. Kubia-towicz, Bayeux: An Architecture for Scalable and Fault-Tolerant Wide-Area Data Dissemination, Proc. of NOSS-DAV, June 2001.

[35] http://www.lcfg.org/.

[36] http://www2.novadigm.com/hpworld/.

[37] Microsoft Windows Updatehttp://windowsupdate.microsoft.com.

A Cms the Akama i Configuration Management System

Documents

Transcript of A Cms the Akama i Configuration Management System