A Microrebootable System — Design, Implementation, and Eva ...of these are difﬁcult to track...

arX

iv:c

s/04

0600

5v2

[cs.

OS

] 3

Jun

2004

A Microrebootable System — Design, Implementation, and Evaluation

George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, Armando Fox{candea,skawamo,fjk,gjf,fox}@cs.stanford.edu

Abstract

A significant fraction of software failures in large scaleInternet systems are cured by rebooting, even when theexact failure causes are unknown. However, rebootingcan be expensive, causing nontrivial service disruption ordowntime even when clusters and failover are employed.In this work we use separation of process recovery fromdata recovery to enable frequent use of the microreboot, afine grain recovery mechanism that restarts only suspected-faulty application components without disturbing the rest.

We evaluate this recovery approach on an eBay-likeInternet auction application running on our microreboot-enabled application server. We find that microreboots re-cover from most of the same failures as full reboots, butdo so an order of magnitude faster, resulting in an orderof magnitude savings in lost work. Unlike full reboots,microreboot-based recovery is sufficiently inexpensive tobeemployed at the first sign of failure, even when mistakes infailure detection are likely. The cost of our microreboot-enabling modifications is a reduction of less than 1% infailure-free steady-state throughput.

1. IntroductionIn spite of ever-improving development processes and

tools, all production-quality software still has bugs; mostof these are difficult to track down and resolve, taking theform of Heisenbugs, race conditions, resource leaks, andenvironment-dependent bugs [13, 34]. When these bugsstrike in live systems, they can result in prolonged out-ages [17, 30]. Faced with failures, operators do not havetime to run sophisticated diagnosis, but rather need to bringthe system back up immediately; compounding this, up to80% of software problems are due to bugs for which no fixis available at the time of failure [40]. The results of sev-eral studies [36, 17, 31, 11] as well as experience in thefield [3, 33, 21] suggest that many failures can be recoveredby rebooting even if their root causes are unknown. Con-sequently, the state of the art in achieving high availabilityin today’s Internet clusters involves circumventing a failednode through failover, rebooting the failed node, and subse-quently reintegrating the recovered node into the cluster.

Reboots provide a high-confidence way to reclaim staleor leaked resources, they do not rely on the correct func-tioning of the rebooted system, they are easy to implementand automate, and they return the software to its start state,which is often its best understood and best tested state. Un-fortunately, in some systems, unexpected reboots can result

in data loss and unpredictable recovery times. This occursmost frequently when the software lacks clean separationbetween data recovery and process recovery. For exam-ple, performance optimizations such as buffer caches opena window of vulnerability during which allegedly-persistentdata is stored only in volatile memory that does not sur-vive a crash; rebooting this system would restart the pro-cess, but would not recover the data in the buffers. At-tempting to combine data and application recovery in thesame code paths is difficult, and often falls to relatively in-experienced developers. In the face of demands for ever-increasing feature sets, application-specific recovery codethat is both bug-free and efficient will likely be an increas-ingly elusive goal.

In this paper we separate process recovery from data re-covery to enable component-level “microreboots” (µRB).We describe both the general conditions necessary forµRBsto be safe and a specific testbed on which we conductedfault injection to test the ability ofµRBs to recover froma variety of failures. We show how using well-isolated,stateless components can make a system amenable to mi-crorebooting. To ensure correctness, all important applica-tion state is segregated into specialized state stores, suchas databases and dedicated session state stores, thus com-pletely separating data recovery from (µRB-based) applica-tion recovery.

Our contribution is to show that:

• Microreboots achieve many of the same benefits inµRB-able systems as full reboots, but an order of mag-nitude more quickly and with an order of magnitudesavings in lost work.

• Due to their low cost, microreboots can always be at-tempted as a first-line-of-defense recovery mechanism,even when failure detection is prone to false positivesor when the failure is not known to beµRB-curable. Ifa µRB does not recover the system, but some othersubsequent recovery mechanism does, the recoverytime added by theµRB attempt is negligible.

• µRB-enabling modifications in our prototype Internetservice brought about a performance overhead of lessthan 1% of original failure-free throughput.

The rest of the paper is structured as follows: Section 2describes the general design requirements for microreboota-bility, and Section 3 describes our prototype implementa-tion. Sections 4 and 5 experimentally evaluate the proto-type’s recovery properties using a realistic workload and

1

http://arxiv.org/abs/cs/0406005v2

fault injection, to determine whetherµRBs can recoverfrom full-reboot-curable failures and whether there is a re-sulting benefit in application availability. We also examinethe benefits microreboots offer over existing cluster-failovertechniques. Section 6 discusses limitations ofµRBs, andthen presents a roadmap for generalizing our approach be-yond the implemented prototype. Section 7 presents relatedwork, and Section 8 concludes.

2. Design Overview

Workloads faced by Internet services often consist ofmany relatively short tasks, rather than long-running ones.This affords the opportunity for recovery by reboot, becauselosing in-progress work typically represents a small fractionof requests served in a day. We therefore set out to optimizeInternet-like systems for frequent, fine-grained rebooting.We strive to make microreboots so cheap that they are al-ways tried as a first line of defense whenever a failure issuspected. To this end, we have three design goals:

• Fast and correct recovery of application components

• Strongly-localized recovery, with no impact on the restof the system

• Fast and correct reintegration of recovered components

Here we discuss how to achieve these goals; the next sec-tion describes the realization of these goals in our prototype.

Component-level reboot time is determined by how longit takes for the system to restart the component and for thecomponent to reinitialize. A microrebootable applicationtherefore aims for components that are as small (in terms ofprogram logic) as possible.

To ensure recovery correctness, we must preventµRBsfrom inducing corruption or inconsistency in the appli-cation’s persistent state. The inventors of transactionaldatabases recognized that segregating recovery of persis-tent data from application logic can improve the recover-ability of both the application and the data that must per-sist across failures. We take this idea further and requirethatµRB-able applications keepall important state in spe-cialized state stores located outside the application, behindwell-defined APIs. Examples of such state stores includedatabases and session state stores [23]. The complete sep-aration of component recovery from data recovery makesunannounced microreboots safe. The burden of data man-agement is shifted from the often-inexperienced applicationwriters to the specialists who develop state stores.

For an application to gracefully tolerate theµRB of acomponent, coupling between components must be veryloose: components in aµRB-able application have well-defined, enforced boundaries; direct references, such aspointers, may not span these boundaries. Indirect,µRB-safe references can be maintained outside the components,either by a state store or by the application platform.

This design approach embodies time-proven techniquesfor robust programming of distributed systems; we applythese techniques at finer levels of granularity within appli-cations. In the following section we discuss the implemen-tation of a platform forµRB-able applications, that incor-porates these design principles.

3. A Platform for MicrorebootsWe added microreboot capabilities to JBoss [19], a pop-

ular open-source application server written in Java. Its sup-port for the enterprise edition of Java (J2EE) [37] allowedus to exploit the features of this programming framework.J2EE is increasingly used for critical Internet-connectedap-plications, claiming, for example, 40% of today’s enterpriseapplication market [1]. The changes we made to JBoss canuniversally benefit all J2EE applications running on JBoss.

3.1 The J2EE Component FrameworkA common design pattern for Internet applications is the

three-tiered architecture: the presentation tier consists ofstateless Web servers, the application tier runs the appli-cation per se, and the persistence tier stores long-term datain one or more databases. J2EE is a framework designed tosimplify developing applications for this three-tiered model.

J2EE applications consist of portable components, calledEnterprise Java Beans (EJBs), together with server-specificXML deployment descriptors. A J2EE application serveruses the deployment information to instantiate an applica-tion’s EJBs inside management containers; there is one con-tainer per EJB object, and it manages all instances of thatEJB. The server-managed container provides a rich set ofservices: thread pooling and lifecycle management, clientsession management, database connection pooling, transac-tion management, security and access control, etc.

End users interact with a J2EE application through aWeb interface, the application’s presentation tier. This con-sists of servlets and Java Server Pages (JSPs) hosted in aWeb server; they invoke methods on the EJBs and then for-mat the returned results for presentation to the end user. In-voked EJBs can call on other EJBs, interact with the back-end databases, invoke other Web services, etc.

An EJB is similar to an event handler, in that it does notconstitute a separate locus of control—a single Java threadshepherds a user request through multiple EJBs, from thepoint it enters the application tier until it returns to the Webtier. EJBs satisfy to a good extent the requirements outlinedfor µRB-able components in Section 2; our further modifi-cations and extensions are described in the rest of this sec-tion.

3.2 Microreboot MachineryWe added amicrorebootmethod to the JBoss EJB con-

tainer that can be invoked from within the applicationserver, or by an administrator through a Web-based man-agement interface. Since we modified the JBoss container,microreboots can now be performed on any J2EE applica-tion (however, this is safe only if the application conformsto the guidelines of Sections 3.3 and 3.4). The microre-boot method destroys all extant instances of the EJB and as-sociated threads, releases all associated resources, discardsserver metadata maintained on behalf of the EJB, and thenreinstantiates the EJB. This fixes many problems such asEJB-private variables being corrupted, EJB-caused memoryleaks, or the inability of one EJB to call another because itsreference to the callee has become stale.

The only server metadata we do not discard onµRB isthe component’s classloader. JBoss uses a separate class

2

loader for each EJB to provide appropriate sandboxing be-tween components; when a caller invokes an EJB method,the caller’s thread switches to the EJB’s classloader. A Javaclass’ identity is determined both by its name and the class-loader responsible for loading it; discarding an EJB’s class-loader uponµRB would have (unnecessarily) complicatedthe update of internal references to theµRB-ed compo-nent. Keeping the classloader active does not violate anyof the sandboxing properties. Preserving classloaders doesnot reinitialize EJB static variables uponµRB, but J2EEstrongly discourages the use of mutable static variables any-way (to simplify replication of EJBs in clusters).

3.3 State Segregation

Internet applications, like the ones we would expect torun on JBoss, typically handle three types of importantstate: long-term data that must persist for years (such as cus-tomer information), session data that needs to persist for theduration of a user session (e.g., shopping carts or wokflowstate in enterprise applications), and virtually read-only data(static images, HTML, JSPs, etc.). We keep these kinds ofstate in a database, session state store, and an Ext3FS read-only filesystem, respectively.

Persistent state:There are three types of EJB: (a) entitybeans, which map each bean instance’s state to a row in adatabase table, (b) session EJBs, which are used to performtemporary operations (stateless session beans) or representsession objects (stateful session beans), and (c) message-driven EJBs (not of interest to this work). EJBs may inter-act with a database directly and issue SQL commands, orindirectly via an entity EJB. InµRB-able applications werequire that only stateless session beans and entity beans beused; this is consistent with best practices for building scal-able EJB applications [6]. The entity beans must make useof Container-Managed Persistence (CMP), a J2EE mecha-nism that delegates management of entity data to the EJB’scontainer. CMP provides relatively transparent data persis-tence, relieving the programmer from the burden of manag-ing this data directly or writing SQL code to interact witha database. Our prototype application, described in Sec-tion 4.1, conforms to these requirements.

Session statemust persist on the application server forlong enough to synthesize a user session from indepen-dent stateless HTTP requests, but can be discarded whenthe user logs out or the session times out. Typically, thisstate is maintained in the application server and is namedby a cookie accompanying incoming HTTP requests. Toensure the session state survives bothµRBs and full re-boots, we externalize session state into a modified versionof SSM, a session state store [23]. SSM’s storage modelis based on leases, so orphaned session state is eventuallygarbage-collected automatically. Many commercial appli-cation servers forgo this separation and store session statein local memory only, in which case a server crash or EJBµRB would cause the corresponding user sessions to be lost.In Section 5.5 we compare the cost of externalizing sessionstate to the benefit of being able to preserve sessions acrossbothµRBs and full reboots.

The segregation of state offers some level of recov-ery containment, since data shared across components by

means of a state store does not require that the compo-nents be recovered together. Externalized state also helpsto quickly reintegrate recovered components, because theydo not need to perform data recovery following aµRB.

3.4 Containment and Reintegration

Further containment of recovery is obtained throughcompiler-enforced interfaces and type safety. EJBs can-not name each others’ internal variables, nor can they usemutable static variables. While this is not enforced by thecompiler, J2EE documents warn against the use of staticvariables and recommend instead the use of singleton EJBclasses, whose state is accessed through standard acces-sor/mutator methods. EJBs can obtain references to eachother in order to make inter-EJB method calls; referencesare obtained from a naming service (JNDI) provided by theapplication server, and may be cached once obtained. Theinter-EJB calls themselves are also mediated by the appli-cation server via the containers, to abstract away the detailsof remote invocation (if the application server is running ona cluster) or replication (if the application server has repli-cated a particular EJB for performance or load balancingreasons).

Since EJBs may maintain references to other EJBs,µRB-ing a particular EJB causes those references to be-come stale. To remedy this, whenever an EJB isµRB-ed,we alsoµRB the transitive closure of its inter-EJB refer-ences. This ensures that when a reference goes out of scope,the referent disappears as well. While static or dynamicanalysis could be used to determine this closure, we use thesimpler method of determining groups statically by exam-ining deployment descriptors, which are typically generatedfor the J2EE application by the development environment.The reference information is used by the application serverto determine in what order to deploy the EJBs.

The time to reintegrate aµRB-ed component is deter-mined by the amount of initialization it performs at startupand the time it takes for other components to recognize thenewly-instantiated EJB. Initialization dominates reintegra-tion time; in our prototype it takes on the order of hundredsof milliseconds, but varies considerably by component, aswill be seen in Table 2. The time required to destroy andre-establish EJB metadata in the application server is negli-gible. Making the EJB known to other components happensthrough the JNDI naming service described earlier; thislevel of indirection ensures immediate reintegration oncethe component is initialized.

In the prototype described here,µRBs are considerablyless disruptive than full reboots. First, recovery and rein-tegration is faster: the actions required toµRB an EJBtake hundreds of milliseconds, whereas hot-redeploying theentire application (analogous to a warm full reboot) takesalmost 12 seconds, while restarting the application serverprocess (analogous to a cold full reboot) takes 52 seconds.Second, since we can selectivelyµRB only those EJBs sus-pected of causing an observed failure, unaffected EJBs cancontinue to serve requests from other users. As will beshown in Section 5.1, only a fraction of active users aremaking requests that require the EJBs in question, so iso-lated recovery results in higher overall system availability.

3

4. Evaluation FrameworkTo evaluate our prototype, we developed a client emu-

lator, a fault injection framework, and an automated fail-ure detection, diagnosis, and recovery system. Within thisframework, we ran experiments using our modified versionof JBoss and a J2EE application.

4.1 Test ApplicationAlthough the open-source JBoss application server hosts

production applications in many companies, we have foundcompanies unwilling to provide us with the applicationsthemselves. We therefore evaluated our technique on amodified version of RUBiS [6] (Rice University BiddingSystem), a J2EE/Web-based auction system that mimicseBay’s functionality. RUBiS maintains user accounts, al-lows bidding on, selling, and buying of items, has searchfacilities, customized information summary screens, userfeedback pages, etc. RUBiS consists of 26K lines of non-comment source code in 582 Java files. We extended RU-BiS to maintain server-side session state that enables usersto log on and preserve their session information across in-teractions with the Web site. Since RUBiS was written tostudy different design strategies for J2EE applications, afew different implementations are provided; the implemen-tation we use consists entirely of stateless session EJBs andentity EJBs.

The structure of RUBiS is typical of J2EE applications,in that there is a separate session EJB implementing eachuser operation and interfacing with entity beans. For ex-ample, there is a “place a bid on item X” EJB and a “viewbid history for item X” EJB; the two session EJBs interactwith the entity EJB implementing the “bid entity,” whichmaintains bid information in the database. This illustrateswhy the EJB is a natural unit of recovery. Any unit smallerthan an EJB would hinder independent recovery, becausethere would be too many dependencies to take into accountwithin the EJB boundary.

Long-term data in RUBiS consists of user account in-formation, item information, bid/buy/sell activity, etc.andis maintained in a MySQL database through 9 entity EJBs:IDManager, User, Item, Bid, Buy, Category, OldItem, Re-gion, and UserFeedback. MySQL is crash-safe and recoversfast for our datasets (132K items, 1.5M bids, 100K users).Read-only presentation data, such as static HTML and GIFimages, are stored in a journaling Ext3FS read-only filesys-tem accessed only by the stateless Web server. Session datain RUBiS takes the form of items that a user buys/sells/bidson during her session; we store this state in an extendedversion of SSM [23]. Session state in SSM is leased, ratherthan permanently allocated, which means that the applica-tion need not worry about reclamation.

4.2 Fault InjectionWe measured end-user-perceived system availability in

the face of failures caused by faults we injected, with therecognition that no fault injection experiment can claim tobe complete or to accurately reflect real-life faults. As men-tioned in Section 1, our work focuses exclusively on fail-ures that can be cured with some form of a reboot. DespiteJ2EE’s popularity as a commercial infrastructure, we were

unable to find any published systematic studies of faults oc-curing in production J2EE systems, so we relied on advicefrom colleagues in industry who routinely work with en-terprise applications or application servers [13, 34, 28, 12,33, 21]. These discussions helped us conclude that J2EEsystems suffer from the following categories of software-related failures:

• accidental use of null references (e.g., during excep-tion handling) that result inNullPointerException

• hung threads due to deadlocks, interminable waits, etc.

• bug-induced corruption of volatile metadata

• leak-induced resource exhaustion

• various other Java exceptions and errors that are nothandled correctly

We aimed to reproduce these problems in our system byadding facilities for runtime fault injection: in our exper-iments we can (a) set component class variables tonull,(b) directly induce deadlock conditions and infinite loopsin EJBs, (c) alter global volatile metadata, such as garbleentries in the JNDI naming service’s database, (d) leak acertain amount of memory per call, and (e) intercept callsto EJBs and, instead of passing the call through to the com-ponent, throw an exception/error of choice.

We were concerned that our fault injection mechanismmay not provide sufficient coverage of realistic faults vis-ible to the J2EE applications. We therefore used bothFIG [4] and FAUmachine [5] to inject hundreds of faultsunderneath our HotSpot Java virtual machine layer: mem-ory and register bit flips, disk block errors, network packetdrops, and erroneus returns from system calls for memoryallocation and input/output. In all our test cases the out-come was either a Java exception (a condition we can sim-ulate with mechanism (e) above), a JVM crash (which re-quires JVM restart), a resource leak (e.g., becausefree()

or close() failed), or erroneous data (which we explic-itly do not address in this work). These results are consis-tent with similar findings in other systems [7]. We thereforeconsidered it sufficient to study our system’s behavior underour injected faults, corresponding to application-level bugs.

4.3 Client EmulationIn order to emulate realistic clients, we extended and

used the load generator that ships with RUBiS. It takes a de-scription of the workload for emulated clients in the form ofa state transition tableT , with the client’s states as rows andcolumns. These states correspond naturally to the variousoperations possible in RUBiS, such asRegister, SearchItem-sInCategory, AboutMe, etc. (29 in total). A table cellT (s, s′) represents the probability of a client transitioningfrom states to states′; e.g.,T (ViewItem,BuyNow) de-scribes the probability we associate with an end user click-ing on the “Buy Now” button after viewing an item’s de-scription.

The emulator usesT to automatically navigate the RU-BiS web site: when ins, it chooses a next states′ withprobabilityT (s, s′), constructs the URL representings′ and

4

does an HTTPGET for the given URL. Inbetween succes-sive clicks, emulated clients have a think time based on arandom distribution with average 7 seconds and maximum70 seconds, as done in the TPC-W benchmark [35]. Theemulator uses a separate thread for each emulated client.In choosing the workload for our tests, we mimic the realworkload seen by a major Internet auction site [39]; ourworkload is described in Table 1.

User operation results mostly in... Fraction ofworkload

Read-only DB access (e.g., ViewItem, ViewBidHistory) 32%Creation/deletion of session state (e.g., Login, Logout) 23%Exclusively static HTML content (e.g., home page) 12%Search (e.g., SearchItemsInCategory) 12%Forms that update session state (e.g., MakeBid, BuyNow) 11%DB updates (e.g., CommitBid, RegisterItem) 10%

Table 1. Workload . All operations entail additional access to static GIFor HTML content, so 12% underestimates the amount of static content read.

In choosing the number of clients to emulate, we aimedto maximize system utilization while still getting good per-formance; for our system, this balance was reached at350 concurrent clients. This is slightly more aggressivethan current Internet services, which typically run theirapplication server nodes at 50-60% utilization [25, 14].We deployed our application server with an embeddedWeb/servlet tier on Athlon 2600XP machines with 1.5 GBof RAM; the MySQL database and SSM were hosted onPentium 2.8 GHz nodes with 1 GB of RAM and 7200rpm120 GB hard drives. The client emulator ran on a 4-wayP-III 550 MHz multiprocessor with 1 GB of RAM. All ma-chines were interconnected by a 100 Mbps Ethernet switchand ran Linux kernel 2.4.22 with Java 1.4.1 and J2EE 1.3.1.

4.4 Failure DetectionTo enable automatic recovery, we implemented failure

detection in the client emulator and primitive diagnosis fa-cilities on the server side (described in the next section).Of course, real end-user Web browsers do not automati-cally report failures to the Internet services they use. Whatour client-side detector mimics is WAN services that deploy“client-like” end-to-end monitors around the Internet to candetect a service’s user-visible failures [20].

If a response to a request is not received within a certainamount of time, the client concludes the request has failed.Responses that do arrive can be correct or incorrect: a badresponse is either (a) a network-level error, such as not be-ing able to connect to server, (b) an HTTP 4xx or 5xx er-ror, (c) an HTML page containing particular keywords thatwe know to be indicative of application errors, (d) an un-expected Web page, such as a login prompt when the useris already logged in, or (e) certain application-specific er-rors, such as a Web page with a negative item ID. Searchingfor error, failed, andexceptionin the returned HTML suc-cessfully detects all error pages generated by RUBiS in ourexperiments, with no false positives or false negatives. Weensured that none of the simulated users sells an item whosedescription could match these patterns. Complex failures,such as an erroneous bid loss, surreptitious modifications ofbid amounts, etc. require manual detection.

When the client detect a failure, it performs a numberof retries, emulating a browser’s response to an HTTP/1.1

Retry-After reply, or a user manually reloading the page.If success does not occur within the configured number ofretries, a failure report is sent to the server-side recoveryservice. The client emulator only retries idempotent oper-ations; we encoded idempotency information based on ourknowledge of the application. The only non-idempotent op-erations are: “register new item for auction”, “make bid onitem”, “buy item now”, and “give feedback on user”. Otheroperations, such as “search item by category” or “registernew user” are retry-safe.

4.5 Diagnosis and RecoveryWe added to JBoss a recovery service that performs very

simple failure diagnosis and recovery. This service listenson a UDP port for failure reports from the monitors. A fail-ure report contains the failed URL and the type of failureobserved. Using static analysis, we derived a mapping fromeach RUBiS URL to a path/sequence of calls to servlets andEJBs. The recovery service maintains for each componentin the system a score, which gets incremented every timethe component is in the path originating at a failed URL.The recovery policy for stateless session EJBs is simple: ifa score exceeds a configured threshold, the recovery serviceµRBs the corresponding EJB. Entity beans, however, are lo-cated at the intersection of several call paths, because multi-ple URLs (operations) use the same entity EJB. In this case,the recovery service will wait to get failure reports from aconfigured number ofdifferentURLs mapping to that entityEJB before deciding toµRB it.

We emphasize that accurate or sophisticated failure de-tection is anon-goalof this work: to the contrary, our sim-plistic approach to diagnosis often yields false positives, butour goal is to show that the mistakes resulting even fromsimple or “sloppy” diagnosis are tolerable because of thevery low cost ofµRBs. In light of our naive diagnosis, theresults we report are conservative.

4.6 Measuring Action-Weighted GoodputA simple approach to measuring the effect of downtime

on end users would be to measure goodput (i.e., number ofrequests completed successfully) under partial-failure con-ditions, averaged across all clients. This is usually how per-formability [26] is measured, yielding the amount of worksuccessfully completed over a period of time in the pres-ence of failures. Unfortunately, this simple metric fails todistinguish between operations that are relatively slow (e.g.,performing a buy operation) and those that are fast (e.g., ac-cessing the static home page). The net effect is that, in thepresence of certain failures in the application server, good-put actuallygoes up, because the simulated user no longerwaits for long-running operations, which fail immediately.Consequently, the simulated client is able to perform manyfast operations in a small amount of time, artifically inflat-ing goodput and masking the fact that a real user would notget useful work done.

Another problem with the simple goodput metric is thatit does not allow operations to be connected to each other.If a user searches for an item and then fails on the buy op-eration, the entire interaction has failed, as far as the user isconcerned. In other words, the goodput metric fails to cap-ture the fact that user interactions are actually sequencesof

5

0

10

20

30

40

50

60

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Gaw

[res

pons

es/s

econ

d]

Time [minutes]

350 clients FULL REBOOTS requests: 56028 OK / 3916 failed

SatisfiedFailed

0

10

20

30

40

50

60

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Gaw

[res

pons

es/s

econ

d]

Time [minutes]

350 clients MICROREBOOTS requests: 61527 OK / 425 failed

SatisfiedFailed

Figure 1. Full reboots vs. microreboots: We injected a null reference fault in SBCommitBid, then a corrupt JNDI database fault for User-Item, followed by aRuntimeExceptionin SB BrowseCategories, and a JavaError in SB CommitUserFeedback. Left graph shows automatic recovery via full application reboot; right graphshows recovery from the same faultload usingµRBs. SBCommitBid took 387 msec to recover, User-Item took 1742 msec, SB BrowseCategories 450 msec, andSB CommitUserFeedback 404 msec. In reaction to the second failure, our recovery service’s diagnosis included 6 false positives, resulting in 6 unnecessaryµRBs,totaling 4765 msec. In spite of this, there are 89% fewer failed requests (425 vs. 3916) and 9% more successful requests (61527 vs. 56028) in the case ofµRB-ing.

correlated operations that need to succeed “atomically” forthe user to be satisfied.

We therefore evaluated the availability of our prototypeusing a new metric,action-weighted goodput(Gaw). Weview a usersessionas beginning with a login operation andending with an explicit logout or abandonment of the site.Each session consists of a sequence ofuser actions. Eachuser action is a sequence ofoperationsthat culminates witha “commit point”: an operation that must succeed for thatuser action to be considered successful. For example, an ac-tion may take the form of “search for a pair of Salomon skiboots and place a $200 bid on them”; this action is success-ful if and only if all operations (i.e., HTTP requests) withinthat action succeed; otherwise, the action fails as a unit.

Whenever an operation fails, all operations in the con-taining action are counted as failed; when a “commit point”succeeds, all operations in the containing action count assuccessful.Gawaccounts for the fact that both long-runningand short-running operations must succeed for a user to behappy with the site.Gawalso captures the fact that, when anaction with many operations succeeds, it often means thatthe user got more work done than in a short action.

In our auction application, we identified nine commitpoints: registering a new user, auctioning a new item, mak-ing a bid, performing a “buy now”, leaving user feedback,viewing account information, logging out, spontaneouslydeciding to abandon the site, and clicking to the homepageafter a sequence of browse operations (if the user browsesand returns to the homepage then he/she has still accom-plished useful work). A user action is aborted any time theclient detects a failure and all retries have been exhausted.

5. Evaluation ResultsWe evaluated five different aspects of our prototype. Sec-

tion 5.1 finds thatµRB-based recovery is an order of magni-tude faster than full reboots and just as effective. Section5.2shows that combiningµRBs with failover in clusters can re-duce failed requests by 94% compared to full reboots. InSection 5.3 we show thatµRB-based recovery combinedwith a false positive rate of 97% or less in failure detectionachieves better levels of availability than full reboots withperfect detection. Section 5.4 shows thatµRBs are as effec-tive as full reboots in preventing resource-leak-induced fail-ures, but do so at a fourth of the cost in lost work. Finally,in Section 5.5, we evaluate the performance impact of ourmicroreboot-enabling changes and find less than 1% lossin average throughput and human-imperceptible increase inaverage latency.

5.1 Recovering from Reboot-Curable Failures

To illustrate the end-user-visible differences between re-covering with a full reboot vs. usingµRBs, we injecteda sequence of faults into the application and allowed thesystem to automatically recover from failure in two ways:by restarting RUBiS or byµRB-ing a component, respec-tively. We consider recovery to be successful when endusers do not experience any more failures after recovery(unless new faults are deliberately injected). JBoss allowsfor a J2EE application to be hot-redeployed, and this consti-tutes the fastest way to do reboot-based recovery in unmod-ified JBoss; we use RUBiS application restart as a baseline,and call this afull reboot(FRB).

Figure 1 showsGaw as measured by the emulatedclients; each sample point represents the number of suc-cessful/failed requests observed during the corresponding1-second interval. In choosing the locations for fault injec-tion, we wanted to study the extremes of recovery time, aswell as the spectrum of workload-induced effects. Hence,we injected the first fault in the component that takes theshortest time to recover, the seconds fault in the one thattakes the longest, the third fault in the most frequently calledcomponent, and the fourth fault in the least frequently calledone. Although our recovery service misdiagnosed 6 addi-tional components as faulty after the second fault injection,µRBs still reduced the overall number of failed requests byan order of magnitude.

To understand the factors that contribute to this reduc-tion, we directed our attention to the dips inGaw. The im-pact of a failure and recovery action can be qualitativelyestimated by the area of the corresponding dip inGaw: thelarger this area, the higher the disruption of the service. Thedip’s area is determined by recovery time, the time to reachback to pre-failure performance levels, and the number ofrequests that were turned away during recovery. We nowlook at each factor in isolation.

Effects of Small Recovery Units

The longer the service is unavailable, the wider the dip inGaw: submitted requests cannot be served during recovery,and user actions started in the past fail as users unsuccess-fully attempts operations during recovery, causing opera-tions to retroactively be counted as failed. This explainswhyGaw= 0 during each full reboot recovery. Table 2 showscomprehensive measurements for all the recovery units inour system; the first three rows are not EJBs and are shownfor comparison. Most EJBs generally recover an order of

6

magnitude faster than a application reboot.

Component Avg Min MaxJBoss app server restart 51,800 49,000 54,000RUBiS application restart 11,679 7,890 19,225Jetty (embedded Web server) 1,390 1,009 2,005SB CommitBid 286 237 520SB BrowseCategories 340 277 413SB ViewUserInfo 398 288 566SB ViewItem 465 284 977SB RegisterUser 502 292 681SB CommitUserFeedback 509 316 854SB SearchItemsByRegion 529 296 906SB CommitBuyNow 550 297 1,102SB RegisterItem 552 363 837SB Auth 554 317 1,143SB BrowseRegions 597 241 906SB BuyNow 603 303 1,417SB ViewBidHistory 623 317 2,058SB AboutMe 639 330 1,287SB LeaveUserFeedback 680 314 1,275SB MakeBid 856 232 2,920SB SearchItemsByCategory 911 488 3,019IDManager 1,059 663 1,547UserFeedback 1,248 761 1,591BuyNow 1,421 668 4,453User-Item 1,828 876 4,636

Table 2. Recovery times(in msec) for whole JBoss, RUBiS, the Web tiercollocated with JBoss, and the various application components. Componentsprefixed bySBare stateless session EJBs, while the rest are entity EJBs. Weran 10 trials per component, under load from 350 concurrent clients.

As described in Section 3, some EJBs have dependen-cies that require them to beµRB-ed together. We groupedsuch inter-dependent EJBs together into common JAR files(Java ARchive) andµRB by JAR. The largest such group,User-Item, contains 5 entity EJBs: Category, Region, User,Item, and Bid—any time one of these EJBs requires aµRB,weµRB the entire User-Item JAR file. This is the longest-recovering component; it is possible though to restructurethe application such that each EJB is independentlyµRB-able, but we have not performed these optimizations yet.

The time toµRB a component varies considerably; thisis mainly because over 90% ofµRB time is spent reinitial-izing the component, and the amount of initialization variesbetween EJBs. In the particular case of entity EJBs, initial-ization includes establishing connections to the database,since entity EJB state is automatically persisted in thedatabase by the application server. Finally, we also believetime measurements at sub-second granularity are likely tobe influenced by thread scheduling events. The time re-quired toµRB multiple components (e.g., in reaction to apropagating failure) is theoretically additive, but, as wewillsee in subsequent sections, in practiceµRB times are lessthan additive.

Effects of Rapid Reintegration

Even when a system is “back up” from a functional pointof view, it takes a while to recover to the pre-failure per-formance level after a FRB: caches need to be warmed up,DB connections are to be reestablished, etc. Performancerecovery time explains the width of theGawdip followingthe completion of functional recovery. In the experiment ofFigure 1, it took more than half a minute for performance tobe restored after FRB-based recovery. In the case ofµRB-ing, due fast reintegration of the recovered component, theperformance impact is lower.

To illustrate the performance-relatedGawdip more pre-cisely, we zoom in on the first recovery of Figure 1 andshow in Figure 2 the request response times measured atthe end users during that period. The higher latency is areflection of the system’s poorer performance during thisinterval, combined with an overload situation caused by theclients piling up to resubmit requests following the recov-ery; the cumulative result is the lower throughput seen inFigure 1. Note that we disabled timeout-based failure de-tection for these experiments; with this detection turned on,latencies exceeding the threshold turn into end user visiblefailures and widen the dip in FRBGaw.

0

4

8

12

3.5 4 4.5 5

Res

pons

e T

ime

[sec

]

Time [minutes]

FULL REBOOT

0

4

8

12

3.5 4 4.5 5

Time [minutes]

MICROREBOOT

Figure 2. Response time during recovery: FRB on the left,µRB on theright. Some of the response times in the FRB case exceed 12 secand areclipped. Although recovery completes byt = 4 sec, system response timetakes more than half a minute to recover to pre-failure levels in the case ofFRB.

Several researchers [27, 2] found that response timesexceeding 8 seconds cause computer users to get dis-tracted from the task they were pursuing and engage in oth-ers; industry reports [41] cite 8-second response times asthe typical threshold beyond which customers abandon e-commerce Web sites; Miller [27] found that response timesbelow 2 seconds are sufficient to maintain a feeling of in-teractivity for most users. SinceµRB-based recovery cankeep response times below the 8-second threshold at alltime (even below 2 seconds in our experiment), whereasFRB-based recovery does not, we would expect user ex-perience to be improved in a system that usesµRB-basedrecovery.

Effects of Recovery Containment

Figure 1 shows thatGaw drops to zero during a FRB, mean-ing that the system serves no requests during that periodof time. In the case ofµRB-based recovery, however,Gaw

never drops to zero. This indicates that failure and recov-ery are contained, allowing most of the system to continueserving requests while the faulty component is being recov-ered. To illustrate this effect, we grouped the various clientoperations into 4 functional groups: bid/buy/sell, search,browse/view, and login/logout. Figure 3 shows the end-userperceived availability of each of these functionality groups.

A solid vertical line indicates that, at the correspondingpoint in time, an end user had an outstanding request be-longing to that functionality group and the request eventu-ally completed successfully. Hence, the end user was ableto conclude that the service was “up”, servicing its request.The absence of such a line indicates that the system was notprocessing a request in that category, either because it wasdown or because no such request was submitted.

When FRB was used, the entire application became un-available during recovery, whereas in the case ofµRB, most

7

Login/Logout

Browse/View

Search

Bid/Buy/Sell

3.6 3.8 4 4.2 4.4

Time [minutes]

Service Functionality Availability (MICROREBOOTS)

Figure 3. Recovery containment leads to reduced functionality disrup-tion. Graph shows end-user-perceived availability of the auction service’s 4functional groups, during the [3.5 sec, 4.5 sec] interval ofthe experiment inFigure 1, for both FRB andµRB recovery. Each shade of gray representsdifferent functional group. The solid continuity seen in all 4 groups im-mediately after FRB is due to increased response time (Figure 2): requestssubmitted right after recovery took a long time to process, but eventually didcomplete successfully.

components continued serving requests while the faultyones recovered. From a functionality perspective, theµRBis not qualitatively noticeable, as the recovering component(SB CommitBid) shares responsibility for its bid/buy/sellgroup with 5 other session EJBs and 4 entity EJBs; it istherefore unlikely that a significant fraction of end users willnotice the brief outage.

5.2 Complementing Cluster Failover withµRB

In a cluster, the unit of rebootability is a full node, whichis small relative to the cluster as a whole; this begs the ques-tion of whetherµRBs can yield any benefit in such systems.To study this question, we built a cluster of 2 independentapplication server nodes. Clusters of 2-4 J2EE servers aretypical in enterprise settings, with high-end financial andtelecom applications running on 10-24 nodes [14]; a hugeservice like eBay runs on pools of clusters totaling 2000 ap-plication servers [10].

In front of the two nodes we place a load balancerLB .Under failure-free operation,LB distributes new incominglogin requests evenly between the two nodes. For estab-lished sessions,LB implements server affinity, i.e., non-login requests are directed to the node on which the sessionwas originally established. If we inject a fault in one of theserver instances (Nfaulty), then the failure monitor detectsthe failure and reports it to the load balancer;LB redirectsthe load to the other node (Ngood) while Nfaulty is recov-ering byµRB-ing or FRB-ing. OnceNfaulty is back up,LB resumes routing requests toNfaulty.

Reducing Impact of Session State Loss withµRB

In the first set of experiments, we used a simple failoverpolicy: as soon asLB is notified ofNfaulty’s failure, itroutes all incoming requests toNgood, regardless of ses-sion affinity. In theµRB case, we used our modified ap-plication server, while the FRB case used vanilla JBoss.For both experiments we used a version of RUBiS thatuses the Web server’sHttpSessionmechanism for storingserver-side session state—this is fairly usual. The volatileHttpSession objects offer high access performance but

cannot survive FRBs. We ran a load of 700 clients againstthe cluster and allowed it to recover from a deadlock faultinjected in theUserFeedbackentity EJB (this component isinvoked by several operations, including “leave feedback onuser”, “view my information”, “view seller feedback”, etc.)We set timeout-based failure detection to 8 sec, correspond-ing to the earlier cited threshold for end-user patience.

In the face of failure,µRBs offered the preservation ofHttpSession objects, which surviveµRBs but not FRBs,and faster recovery time: ourµRB-able cluster failed 66%fewer requests than the FRB-recovering cluster (378 fail-ures vs. 1144). Specifically, 60% of failures in the FRBcase were caused by requests in sessions that were initiallybound toNfaulty but then were failed over toNgood dur-ing Nfaulty’s recovery—these sessions failed because theirsession state was not available. Of the remaining 40%,some failures were caused byNfaulty-bound sessions thatreturned toNfaulty after recovery and did not find their ses-sion state, while the rest failed because they were issuedwhile Nfaulty was FRB-ing. In the case ofµRB, virtuallyall failed requests were due to being routed toNgood and notfinding theirHttpSession objects; fast recovery reducedthe number of requests that had to fail over and hence noticethe loss of session state.

UsingµRB to Reduce Overload of Good Nodes

In order to factor out the effect of session state availability,in the second set of experiments we enabled both the FRBand theµRB cluster with external session state storage. Weran SSM-enabled RUBiS in both clusters, injected the samedeadlock fault as before, and used the same failover policy.Figure 4 shows the resultingGaw.

The main difference compared to the previous experi-ment is that nowNgood is able to process the failed-overrequests, since their session state is in SSM. This ability,however, has a downside: whereas in the previous exper-iment requests withinNfaulty’s sessions failed early whenarriving atNgood, now Ngood can become quickly over-loaded duringNfaulty’s recovery. As a result, when usingFRB recovery, response time onNgood increases beyond 8seconds and eventually triggersNgood’s reboot shortly afterNfaulty started rebooting. This is whyGaw drops to zero inthe left graph of Figure 4. In theµRB case, however, thereis not enough time forNgood to become overloaded, thussaving the cluster from a double reboot. The net effect is a96% reduction in the failed requests when usingµRBs.

Although both FRB nodes eventually recoveredgracefully, in other similar experimentsNfaultyandNgoodoscillated between overload and FRB-based recov-ery, butµRBs never induced such behavior. We asked ourcolleagues in industry whether commercial applicationservers do admission control when overloaded, and weresuprised to learn that they currently do not [25, 14]. Forthis reason, cluster operators need to significantly over-provision their clusters or use expert-tuned load balancersthat can avert the overload and oscillation problems. Weshowed here thatµRB-based fast recovery may providea complement to both overprovisioning and sophisticatedload balancing.

8

0

20

40

60

80

100

120

0 1 2 3 4 5 6 7 8 9 10 11 12 13

Gaw

[res

pons

es/s

econ

d]

Time [minutes]

FULL REBOOT requests: 73551 OK / 1556 failed

SatisfiedFailed

0

20

40

60

80

100

120

0 1 2 3 4 5 6 7 8 9 10 11 12 13

Time [minutes]

MICROREBOOT requests: 75923 OK / 57 failed

SatisfiedFailed

Figure 4. Regular failover with no loss of session state. DuringNfaulty ’s FRB-recovery, all 700 clients fail over toNgood , leading to tem-porary overload; this causesNgood to stop serving requests for a short timeinterval, triggeringNgood ’s recovery. In the case ofµRBs, overload occursbriefly but does not last long enough to impact end users. Timeout-basedfailure detection is set at 8 sec; in both cases we use SSM to store sessionstate.

Avoiding Failover Altogether

To explore the possibility of not doing failover at all, weperformed a third set of experiments comparing best-casescenarios for FRB andµRB without SSM. Given the samecluster setup, faultload, and workload, we changed the loadbalancing policy as follows: in the FRB case,LB fails overonly new login requests duringNfaulty’s recovery, while intheµRB case,LB performs no failover at all. In the formercase,LB immediately fails requests that it knows wouldfail due to unavailability of session state, thus avoiding theconsumption of server resources and avertingNgood’s over-load.

Figure 5 shows thatµRB-ing without failover reducesthe number of failed requests by 95% over FRB-ing withfailover. More interestingly, however,µRB-ing withoutfailover results in fewer failures thanµRB-ing in any ofthe previous experiments, indicating that the coarseness offailover may actually reduce the benefits ofµRB-based re-covery. This can be explained by the localization of recov-ery effects shown in Figure 3:Nfaulty can be more useful ifallowed to continue serving requests while recovering.

We therefore believe that, inµRB-able clusters, oneshould first attemptµRB-based recoveryprior to failover;shouldµRB-based recovery fail,LB can trigger failoverand perform a full reboot ofNfaulty. The cost ofµRB-ingin a non-µRB-curable case is negligible compared to theoverall impact of recovery (in our case, 16 failed requests,corresponding to 3 user actions). This seems a small priceto pay for potential order-of-magnitude benefits; moreover,µRB-ing prior to failover will largely preserve the cluster’sload dynamics.

These experiments showed that, in a small cluster,µRBscomplement redundancy/failover and provide an effectiveway to improve the recovery properties of cluster nodes,with relatively little engineering. UsingµRBs reduces thenumber of failed requests both when an external sessionstate store is or isn’t available.µRB-ing can avert nodeoverload induced by failover; in some cases,µRBs mayeven obviate the need for failover.

5.3 Tolerating Lax Failure Detection

In general, downtime for an incident consists of the timeit takes for the failure to be detected by a monitor (Tdet), thetime to diagnose the faulty component (Tdiag), and the time

0

20

40

60

80

100

120

0 1 2 3 4 5 6 7 8 9 10 11 12 13

Gaw

[res

pons

es/s

econ

d]

Time [minutes]

FULL REBOOT requests: 74475 OK / 1106 failed

SatisfiedFailed

0

20

40

60

80

100

120

0 1 2 3 4 5 6 7 8 9 10 11 12 13

Time [minutes]

MICROREBOOT requests: 76378 OK / 53 failed

SatisfiedFailed

Figure 5. Trying µRB before failover. For the FRB-recovering system,LB redirects all new login requests over toNgoodduringNfaulty ’s recov-ery, and failsNfaulty-bound sessions immediately, thus averting overload.In theµRB-recovering case,LB does no failover at all. Client-side timeoutis 8 sec; no session state store is used.

to recover (Trec): Tdown = Tdet +Tdiag +Trec. Failure detec-tion quality can be characterized by three parameters:Tdet,the monitor’s false positive rateFPdet (how many of the to-tal detections were in fact mistaken), and the false negativerateFN det (how many of the actual failures were missed).Failure monitors typically make tradeoffs between these pa-rameters, e.g., a longerTdet generally yields lowerFPdet

andFN det, since the more sample points can be gathered,the more thorough the analysis.

In this section we show that our system usingµRB-basedrecovery is more tolerant to both longerTdet and higherFPdet, as compared to FRBs; this gives monitors more free-dom in making tradeoffs. Figure 6 shows the number offailed requests resulting from the two types of recovery asa function ofTdet andFPdet, respectively. To vary timeto detection, we introduced an artificial delay in the moni-tor’s reporting of observed failures; since failure reportingis down over UDP in a LAN, we considerTdet ≈ 0 in theabsence of delays. To vary the false positive rate, we ar-tificially triggered “failure detected” events in the monitor,prompting recovery on the server side. Varying the falsenegative rate is not interesting for our comparison, becauseits effects are independent of the technique used to recover(since it does not get triggered).

The graph on the left shows that, if we desired to main-tain the same level of availability as zero-time-FD FRB,thenµRB-ing would allowdetection timeto take 53 secondslonger. We expect the two curves to eventually meet forsome high values ofTdet, because eventually the requeststhat fail due to the detection delay start dominating the totalnumber of failed requests and drowning out those that faildue to the recovery method.

The graph on the right shows thatµRB-ing allows an in-crease of almost two orders of magnitude in the monitors’false positive rate, if we wish to maintain the same levelof availability as FRB recovery with no false positives: a97.2% false positive rate usingµRBs yields the same num-ber of failed requests as a 0% false positive rate for FRBs.This result illustrates how the order-of-magnitude shorterrecovery time ofµRBs compounds the reduced effectµRB-ing has on a service’s end-user-visible functional degrada-tion.

AlthoughµRB-ing requires more precise diagnosis thanFRB-ing, it seems that the benefit ofµRBs outweigh theadded requirement in precision. Diagnosis quality can becharacterized in terms of false negative rateFN diag (how

9

10

100

1000

10000

0 10 20 30 40 50 60 70 80 90 100

Fai

led

Req

uest

s

Failure Detection Time [sec]

MicrorebootsFull Reboots

10

100

1000

10000

100000

0 10 20 30 40 50

91 95 97 98

Number of False Positives

False Positive Rate [%]

MicrorebootsFull Reboots

Figure 6. Impact of µRB on Failure Detection. We injected a null refer-ence fault in the most frequently called component (9.3% of total workload),and recovered with FRB orµRB, respectively. In the left graph, we assumeFPdiag = 0 and varyTdet by introducing a corresponding delay in the moni-tor. In the right graph,Tdet ≈ 0 (corresponding to UDP packet delivery timein our LAN) and we varyFP diag by generating spurious failure reports.Load is 350 clients.

many of the failed components were not diagnosed faulty)and false positive rateFPdiag (how many of the diagnosed-faulty components were in fact not faulty). Our primitivediagnosis algorithm does not yield false negatives (FN diag

= 0), which we believe is typical for componentized appli-cations like RUBiS, that have well defined mappings fromentry points (servlets) to component paths. In terms of falsepositives, however, our algorithm is weak: in the exper-iment of Figure 1 it caused theµRB of 10 components,whereas only 4 were faulty (FPdiag = 60%). A fault deci-sion tree-based diagnosis system used at eBay [10] achievesFPdiag = 24%; if with FPdiag = 60%µRBs can reduce thenumber of failed requests by 90%, then using better diagno-sis should only yield better results.

µRB-based recovery is cheap, hence more tolerant tounnecessary recovery actions (false positives) than full re-boots; this suggests that more aggressive failure detectorswith higher false positive rates can be used inµRB-ablesystems. Since a thorough discussion and implementationof failure detection and diagnosis was beyond the scope ofthis work, we expect the reader to extrapolate from these re-sults to the detection/diagnosis algorithm of his/her choice.

5.4 Averting Failures with Microrejuvenation

Despite automatic garbage collection, resource leaks area major problem for many large-scale Java applications; arecent study of IBM customers’ J2EE e-business softwarerevealed that production systems frequently crash becauseof memory leaks [29]. To avoid unpredictable leak-inducedcrashes, operators resort to preventive rebooting, or soft-ware rejuvenation [18]. Some of the largest U.S. financialcompanies reboot their J2EE applications as often as sev-eral times a day [28] to recover memory, network sockets,file descriptors, etc. In this section we show thatµRB-basedrejuvenation (microrejuvenation) is as effective as a FRB inpreventing leak-induced failures, but does so at a fourth ofa FRB’s cost in failed requests.

We injected memory leaks in two components of our ap-plication: a stateless session EJB that is among the mostfrequently called components, and the longest-recoveringentity EJB. We chose aggressive leak rates that allowed usto keep each experiment to less than 1 hour. With slowerleaks we expect to see the same results, but after a longertime. To motivate our interest in rejuvenation, we show inFigure 7 the effects of these leaks in the absence of reju-

venation: once the 1-GB memory heap is exhausted, thesystem grinds to a halt. A few requests already in the sys-tem continue to be served correctly, but with latencies onthe order of minutes. Most requests fail upon running outof memory and new requests are not being accepted.

0

10

20

30

40

50

60

Gaw

[res

pons

es/s

econ

d]

350 clients NO REJUVENATION requests: 28842 OK / 1804 failed

SatisfiedFailed

0

200

400

600

800

1000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 0

5

10

15

20

Use

d M

emor

y [M

Byt

es]

Cum

ulat

ive

GC

Tim

e [m

inut

es]

Time [minutes]

Memory utilizationCumulative GC Time

Figure 7. Memory leaks induce failure. We inject a 2 KB/call leak inItem and a 250 KB/call leak in SBViewItem; timeout-based failure detectionis disabled. Att=25.2 andt=26, JBoss released an object pool, which freedup 4 KB of memory, allowing incoming requests to be accepted,but thenpromptly failing. The system spends all its time trying to reclaim memory.

In Figure 8 we show how this failure can be prevented.Our recovery service periodically checks the amount ofavailable memory; if it drops belowMalarm, then the re-covery serviceµRBs components in a rolling fashion untilavailable memory exceeds a thresholdMsufficient. In addi-tion to monitoring memory, production systems could mon-itor a number of other system parameters, such as number offile descriptors, CPU utilization, lock graphs for identifyingdeadlocks, etc.

The recovery service does not have any knowledge ofwhich components need toµRB-ed in order to reclaimmemory. Our recovery service maintains a list of candidatecomponents for rejuvenation, which is initially in randomorder. As it performes microrejuvenations, the service as-signs a score to each component based on how much mem-ory was released byµRB-ing it. The list of componentsis kept sorted by score and, the next time memory runslow, the recovery service proceeds withµRB-ing compo-nents that are expected to release most memory, re-sortingas needed.

In Figure 8 we compare FRB-based rejuvenation tomicrorejuvenation using a scenario that is worst case forµRBs: the initial microrejuvenation order is such that thecomponents leaking most memory areµRB-ed last. Sincethe recovery service started in a worst case scenario forµRBs, it ends up rebooting the entire application by piecesduring the first round of rejuvenation, Afterward, the listof candidate components is reordered, improving the effi-ciency of subsequent rejuvenations.SBViewItemhad themost leaked memory at the time of the first microrejuve-nation, so it got pushed to the front of the candidate list,followed by Item. For the second rejuvenation,µRB-ingSBViewItemis sufficient to bring utilization below the 200MB threshold (80% available memory). On the third reju-venation, however, bothSBViewItemand Item require re-

10

0

10

20

30

40

50

60

Gaw

[res

pons

es/s

econ

d]

350 clients FULL REJUVENATION requests: 78405 OK / 4317 bad

SatisfiedFailed

0

10

20

30

40

50

60

350 clients MICROREJUVENATION requests: 82911 OK / 1049 bad

SatisfiedFailed

0

200

400

600

800

1000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 0

5

10

15

20

Use

d M

emor

y [M

Byt

es]

Time [minutes]


0

200

400

600

800

1000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 0

5

10

15

20

Cum

ulat

ive

GC

Tim

e [m

inut

es]

Time [minutes]


Figure 8. Full rejuvenation vs. microrejuvenation: We inject a 2 KB/call leak in Item and a 250 KB/call leak in SBViewItem.Malarm = 35% andMsufficient =80%. The recovery service probes available memory every 10 sec.µRBs reduce the number of failed requests from 4,317 to 1,049 (a factor of four) and do not letGawever reach zero. The garbage collector is invoked after eachµRB, so time spent garbage collecting suffers a fourfold increase from 12 sec to 48 sec. In spite of this,microrejuvenation increases the number of successful requests by 5%, from 78,405 to 82,911.

juvenation, which explains the higher number of failed re-quests. On the fourth rejuvenation,µRB-ing SBViewItemis once again sufficient. We expect this pattern would havecontinued if we had let the experiment run.

We also implemented a fixed schedule for microrejuve-nation, thatµRBs components based on a per-componenttime-to-live specified in an XML file or specified at run-time through a Web interface. As expected, it was less ef-fective than the reactive policy described above. However,there may be resource leaks in the application that cannotbe monitored as easily as memory, or for which thresholdscannot be easily established; in such cases, a fixed microre-juvenation schedule combined with a reactive one would bewarranted.

Notice that, in the first attempt, the entire applicationwas rejuvenated by pieces without ever lettingGaw dropto zero. The commonly used argument for software re-juvenation is that it turns unplanned total downtime intoplanned total downtime. Here we’ve shown that microre-juvenation can turn unplanned total downtime into plannedpartial downtime. Of course, microrejuvenation cannot re-claim resources leaked by the application server, but this isa shortcoming it shares with FRB.

5.5 Performance ImpactThis final set of experiments shows that ourµRB-centric

design has a negligible effect on steady-state fault-freethroughput and latency. Moreover, our system’s perfor-mance is in line with that of a production Internet auc-tion service. Table 3 compares three different configu-rations: vanilla JBoss 3.2.1 release running HttpSession-enabled RUBiS (as described in Section 5.2), and SSM-enabled RUBiS running on top ofµRB-enabled JBoss andvanilla JBoss 3.2.1, respectively.

Comparing the first two configurations, we see that inspite of the modifications we made to enableµRB-ing,throughput remains virtually the same (within 1%). On theother hand, average response time doubles from 39 to 82msec. However, human-perceptible delay is commonly be-lieved to be 100-200 msec, which means that the 82 msec la-tency is of no consequence for an interactive Internet service

JBoss 3.2.1 w/ µRB-JBoss w/ JBoss 3.2.1 w/HttpSession-RUBiS SSM-RUBiS SSM-RUBiS

Throughput[req/sec] 44.8 44.5 44.4

RequestLatency[msec]

Ave 39 82 83

Min 3 3 3

Max 826 1245 1097

StDev .066 .131 .142

Table 3. Performance comparisonof µRB-able vs. non-µRB-able de-sign. Results from a 30-minute fault-free run, with 350 concurrent clients.

like ours. The worst case response time increases from 0.8sec to 1.2 sec; while certainly human-perceptible, this levelis below the 2-second threshold for what human users per-ceive as fast [27]. Both latency and throughput are withinthe range of measurements done at a major Internet auc-tion service [39]: average throughput per cluster node is 41req/sec, while latency varies between 33 and 300 msec.

A comparison of HttpSession-RUBiS and SSM-RUBiSrunning on vanilla JBoss reveals that the observed increasein response time is entirely due to our use of SSM. Giventhat session state is externalized to a separate node, access-ing it requires the session object to be marshalled, sent overthe network, then unmarshalled; this consumes consider-ably more CPU than if it were kept in the server’s memory.These results also suggest that maintaining a write-throughcache of session data in the application server could absorbsome of this latency penalty. However, performance opti-mization was not one of our goals, so we did not explorethis possibility. We found that by dropping the per-nodeload to 150 clients,µRB-JBoss with SSM-RUBiS deliversan average latency of 38 msec. Thus, if latency is of con-cern, it would be sufficient to increase a 3-node cluster to a7-node cluster (thus redistributing the client load) to recoupthe loss. However, with only 150 clients, each node ends upbeing underutilized by more than a factor of two.

We compared ourµRB-able system to an unmodifiedJ2EE system and found insignificant performance overhead.However, J2EE already provides many of the features wewere looking for in aµRB-able system, so it may alreadyincorporate a corresponding performance penalty. Never-

11

theless, the absolute performance of our system is compa-rable to that of a large, established Internet service.

6. Discussion

Some J2EE applications, like RUBiS, are alreadyµRB-friendly and require minimal changes to take advantage ofour µRB-enabled application server. From our experiencewith other J2EE applications, we concluded that the biggestchallenges in making themµRB-able would be to extricatesession state handling from the application logic, and to en-sure that persistent state is properly updated with transac-tions. The rest of the work is already in our prototype ap-plication server, so can be leveraged across all J2EE appli-cations.

In the rest of this section we discuss limitations ofµRB-based recovery, as well as discuss the building ofµRB-ablesystems outside the J2EE framework. Finally, we discussthe issue of truly resolving root causes of failure, as opposedto just temporarily curing them throughµRB-ing.

6.1 Limitations of µRB-based Recovery

It may appear thatµRBs introduce three classes of prob-lems: interruption of a component during a state update,improper reclamation of anµRB-ed component’s externalresources, and delay of a (needed) full reboot.

Impact on non-transactional shared state

If state updates are atomic, as they are with a database andwith SSM, there is no distinction betweenµRBs and FRBsfrom the point of view of the state store. The case of non-transactional, shared EJB state is more challenging: theµRB of one component may leave the state inconsistent,unbeknownst to the other components that share it. A FRB,on the other hand, would not give the other components theopportunity to see the inconsistent state, since they wouldbe rebooted as well. J2EE best-practices documents do dis-courage sharing state by passing references between EJBsor using static variables, but we believe they could be en-forced by a suitably modified JIT compiler; alternatively,should the runtime detect such practices, it could disablethe use ofµRBs for the application in question.

Note that a FRB would avoid this pitfall but at the costof losing the state. In terms of our testbed, the shared statecould be corrupted such that sessions are preserved but un-usable, whereas in vanilla JBoss the sessions would be lostaltogether. In general, this is a risk whenever transient orsemi-persistent state is made persistent, and was a majorreason why application-generic checkpoint-based recoveryin Unix was found not to work well [24]. In the logicallimit, all code becomes “stateless” and recovery will involveeither repairing corrupted data structures in the persistentstate itself orµRB-ing processing components.

Interaction with external resources

If a component circumvents JBoss and acquires an externalresource that the application server is not aware of, thenµRB-ing the component may leak the resource in a waythat a JBoss restart would not (however, RUBiS restart still

would). For example, we experimentally verified that anEJBA can directly open a JDBC connection to a database(without using the application server’s wrapped JDBC ser-vice), acquire a database lock, then share the connectionwith another EJBB. WhenA is µRB-ed, the lock persists,because the JDBC connection (andA’s DB session) is notterminated uponµRB as JBoss has no knowledge of it. TheDB therefore does not release the lock until afterA’s DBsession timeout. If we instead rebooted the whole JBossprocess, the resulting termination by the OS of the underly-ing TCP connection would cause the immediate terminationof the DB session and the release of the lock.

The above case is contrived in that it violates EJB pro-gramming practices, but it can occur in principle. It il-lustrates the need for application components to obtain re-sources exclusively through the facilities provided by theapplication server.

Delaying a full reboot

The more state gets segregated out of the application,the less effective a reboot becomes at scrubbing thisdata. Moreover, our implementation ofµRBs does notscrub application-global data maintained by the applicationserver, such as the JDBC connection pool and various othercaches (with a few exceptions, described in Section 4.2).µRBs also generally cannot recover from problems occur-ring at layers below the application, such as the applicationserver or the JVM. In all these cases, a full server restartmay be required.

Poor failure diagnosis may result in one or more ineffec-tualµRBs of the wrong EJBs, leading toµRB-ing progres-sively larger groups of components until the whole applica-tion is rebooted. Even in this case, however,µRB-ing addsonly a small additional cost to the total recovery cost.

Finally, Java does not allow for explicit resource release,so the best we were able to do without extensive JVM mod-ifications was to call the system garbage collector followingaµRB. However, this form of resource reclamation does notcomplete in an amount of time that is independent of thesize of the resource and the size of the system. We believethat efficient support forµRBs should provide a nearly-constant-time resource reclamation mechanism, which willallowµRBs to synchronously clean resources up.

6.2 Generalizing Beyond Our PrototypeWhile we feel J2EE makes it easier to write aµRB-able

application, because its model is amenable to state exter-nalization and component isolation, we believe it is possi-ble to provideµRB support to other types of systems. Inthis section we describe design aspects that would deserveconsideration in such extensions.

Isolation: EJB isolation in J2EE is not enforced bylower-level (hardware) mechanisms, as would be the casewith separate process address spaces; consequently, bugsin the Java virtual machine or the application server couldresult in state corruption crossing EJB boundaries. Depend-ing on the system, stronger levels of isolation may be war-ranted, such as can be achieved with virtual machines [16],microkernels [22], protection domains [38], etc. Dependen-cies between components need to be minimized, because

12

a dense dependency graph increases the size of recoverygroups, soµRBs may take longer.

Workload: µRBs thrive on workloads consisting of finegrain, independent requests; if a system is faced with longrunning operations, then the activity of the system needsto be periodically checkpointed to keep the cost ofµRBslow. In the same vein, requests need to be sufficiently self-contained to allow a fresh instance of aµRB-ed compo-nent to pick it up and continue processing where the previ-ous instance left off. If requests incorporate informationonwhether they are idempotent, then they enable transparentretries of failed requests.

Transparent Retry: To allow forµRB-ing and smoothreintegration ofµRB-ed components, all inter-componentinteractions should have a timeout; if no response is re-ceived to a call within the allotted timeframe, the callercan gracefully handle the situation. Such timeouts providean orthogonal mechanism for turning all non-Byzantinefailures into fail-stop events, which are easier to acco-modate and serve to contain failures. If components in-voke a currentlyµRB-ing component, they can be givena RetryAfter(n) exception, indicating that the requestscan be re-submitted after the estimated time to recover. Re-covering from a failed idempotent operation entails sim-ply reissuing it; for non-idempotent operations, rollbackorcompensating operations are in order. Transparent recoveryof requests can hide intra-system component failures andµRBs from end users.

Leases: Resources in a frequently-µRB-ing systemshould be leased, rather than permanently allocated, to al-low centralizing the responsibility for cleaning up afterµRBs that may leave orphaned resources. Such resourcesgo beyond memory and file descriptors: we believe manytypes of persistent state should be leased (after which it isei-ther deleted or archived out of the system), as well as CPU:if a computation hands and does not renew its executionlease, it should be terminated with anµRB. If requests carrya time-to-live, then those that are stuck be purged from thesystem once this TTL runs out.

6.3 Short-Term vs. Long-Term CuresRecovering by reboot does not mean that the root causes

should not eventually be identified and the bugs fixed. How-ever, rebooting provides a separation of concerns betweenrecovery and diagnosis/repair, consistent with the observa-tion that the latter is not always a prerequisite for the former.

Building crash-safe systems is in some sense well-understood; we have shown here how to build a microcrash-safe system. Attempting to repair/recover a reboot-curablefailure by anything other than a reboot in such systems al-ways entails the risk of taking longer than a reboot wouldhave taken in the first place: trying to diagnose the prob-lem can often turn what could have been a fast reboot intoa long-delayed reboot, thus hurting availability. We there-fore advocate a general approach to recovery that involvesalwaysµRB-ing as the first recovery attempt.

7. Related WorkOur work has three major themes: reboot-based recov-

ery, minimizing recovery time, and reducing disruption dur-

ing recovery. In this section we discuss a sample of workrelated to these themes, as well as complementary work inthe area of fault diagnosis.

Separation of control and data is key to reboot-based re-covery. There are many ways to isolate subsystems (e.g.,using processes, virtual machines [16], microkernels [22],protection domains [38], etc.). Additionally, in pre-J2EEtransaction processing monitors, each piece of system func-tionality, such as doing I/O with clients, writing to the trans-action log, etc. was a separate process communicating withthe others using IPC or RPC, with shared system state keptin shared-memory structures and session state managed bya dedicated component. Although the architecture did notscale well and was eventually superseded by today’s ap-plication servers, the multiple-processes approach providedbetter isolation and would have been more amenable toµRBs.

A number of efforts have centered around reducing re-boot time through careful engineering. For example, FSM-Labs [15] have trimmed down Linux such that it can bootin under 200 msec. While these optimizations are primarilydriven by the real-time community seeking to embed Linuxin a variety of devices, it does open the door for reboot-based recovery. In contrast to such approaches, we chooseto partition our system and perform a “partial reboot” as away to reduce recovery time.

Much work on Internet services has focused on reducingthe functional disruption associated with recovering fromatransient failure. Failover in clusters is the canonical exam-ple; Brewer [3] proposed the “DQ principle” as a way tounderstand how a partial failure can be mapped to either adecrease in queries served or a decrease in data returned perquery.

A large fraction of recovery time, and therefore availabil-ity, is the time required to detect failures and localize themwell enough to determine recovery action [8]. A study [32]found that earlier detection might have mitigated or avoided65% of reported user-visible failures. By enabling sloppierfault detection, we make a number of detection and diag-nosis solutions more useful. For example, statistical learn-ing approaches like Pinpoint [9], while prone to false posi-tives, are useful for systems whose structure is not known.Combining Pinpoint withµRBs has the potential to improveavailability of Internet systems with relatively little engi-neering.

8. ConclusionsBy completely separating process recovery from data re-

covery, we enabled the use of microreboots in a J2EE appli-cation server.µRBs have many of the same desirable recov-ery and resource-reclamation behaviors as full reboots, butare an order of magnitude less disruptive. We have maderebooting cheap enough to no longer be feared.

We showed that, in a 2-node cluster, skipping failoverand simplyµRB-ing the faulty node reduced failed requestsby an order of magnitude compared to the commonly-usedfailover-and-reboot approach. We also demonstrated thatµRB-based recovery can achieve higher levels of availabil-ity even in the case of failure detection with false positiverates as high as 97%. Using microreboots, we were able to

13

reclaim memory leaks in our prototype application withoutshutting it down, thus continuously maintaining availabil-ity. Finally, we showed that these benefits come at less than1% loss in average throughput and a human-imperceptibleincrease in average latency.

Although we exploited some properties of the J2EE pro-gramming model to simplify our implementation, we be-lieve the techniques can be applied more generally to non-J2EE systems. Microreboots are a recovery mechanismcheap enough for frequent use as a first line of defense; evenif an µRB-ing is ineffective, it cannot hurt. Therefore, mi-croreboots offer the potential for significant simplificationof recovery management.

References

[1] M. Barnes. J2EE application servers: Market overview. TheMeta Group, Mar. 2004.

[2] N. Bhatti, A. Bouch, and A. Kuchinsky. Integrating user-perceived quality into web server design. InProc. 9th Inter-national WWW Conference, Amsterdam, Holland, 2000.

[3] E. Brewer. Lessons from giant-scale services.IEEE InternetComputing, 5(4):46–55, July 2001.

[4] P. A. Broadwell, N. Sastry, and J. Traupman. FIG: A proto-type tool for online verification of recovery mechanisms. InWorkshop on Self-Healing, Adaptive and Self-Managed Sys-tems, New York, NY, June 2002.

[5] K. Buchacker and V. Sieh. Framework for testing the fault-tolerance of systems including OS and network aspects.In Proc. IEEE High-Assurance System Engineering Sympo-sium, Boca Raton, FL, 2001.

[6] E. Cecchet, J. Marguerite, and W. Zwaenepoel. Performanceand scalability of EJB applications. InProc. 17th Conferenceon Object-Oriented Programming, Systems, Languages, andApplications, Seattle, WA, 2002.

[7] S. Chandra and P. M. Chen. The impact of recovery mech-anisms on the likelihood of saving corrupted state. InProc.13th International Symposium on Software Reliability Engi-neering, Annapolis, MD, 2002.

[8] M. Chen, A. Accardi, E. Kiciman, D. Patterson, A. Fox, andE. Brewer. Path-based macroanalysis for large, distributedsystems. InNSDI, 2004.

[9] M. Chen, E. Kiciman, E. Fratkin, E. Brewer, and A. Fox.Pinpoint: Problem determination in large, dynamic, Internetservices. InProc. International Conference on DependableSystems and Networks, Washington, DC, June 2002.

[10] M. Chen, A. Zheng, J. Lloyd, M. Jordan, and E. Brewer. Fail-ure diagnosis using decision trees. InInternational Confer-ence on Autonomic Computing, New York, NY, May 2004.

[11] T. C. Chou. Beyond fault tolerance.IEEE Computer,30(4):31–36, 1997.

[12] T. C. Chou. Personal communication. Oracle Corp., 2003.[13] H. Cohen and K. Jacobs. Personal communication. Oracle

Corporation, 2002.[14] S. Duvur. Personal comm. Sun Microsystems, 2004.[15] I. FSMLabs. http://www.fsmlabs.com/.[16] T. Garfinkel, B. Pfaff, J. Chow, M. Rosenblum, and

D. Boneh. Terra: a virtual machine-based platform fortrusted computing. InProc. 19th ACM Symposium on Op-erating Systems Principles, Bolton Landing, NY, 2003.

[17] J. Gray. Why do computers stop and what can be done about

it? In Proc. 5th Symposium on Reliability in DistributedSoftware and Database Systems, Los Angeles, CA, 1986.

[18] Y. Huang, C. M. R. Kintala, N. Kolettis, and N. D. Ful-ton. Software rejuvenation: Analysis, module and appli-cations. InProc. 25th International Symposium on Fault-Tolerant Computing, Pasadena, CA, 1995.

[19] JBoss. Homepage. http://www.jboss.org/, 2002.[20] KeynoteSystems. http://keynote.com.[21] H. Levine. Personal communication. EBates.com, 2003.[22] J. Liedtke. Toward real microkernels.Communications of

the ACM, 39(9):70–77, 1996.[23] B. Ling, E. Kiciman, and A. Fox. Session state: Beyond soft

state. InNSDI, 2004.[24] D. E. Lowell, S. Chandra, and P. M. Chen. Exploring fail-

ure transparency and the limits of generic recovery. InProc.4th USENIX Symposium on Operating Systems Design andImplementation, San Diego, CA, 2000.

[25] A. Messinger. Personal Comm. BEA Systems, 2004.[26] J. F. Meyer. On evaluating the performability of degrad-

able computer systems.IEEE Transactions on Computers,C-29:720–731, Aug 1980.

[27] R. Miller. Response time in man-computer conversationaltransactions. InProc. AFIPS Fall Joint Computer Confer-ence, volume 33, 1968.

[28] N. Mitchell. IBM Research. Personal Comm., 2004.[29] N. Mitchell and G. Sevitsky. LeakBot: An automated

and lightweight tool for diagnosing memory leaks in largeJava applications. In17th European Conference on Object-Oriented Programming, Darmstadt, Germany, July 2003.

[30] B. Murphy and N. Davies. System reliability and availabilitydrivers of Tru64 UNIX. InProc. 29th International Sym-posium on Fault-Tolerant Computing, Madison, WI, 1999.Tutorial.

[31] B. Murphy and T. Gent. Measuring system and softwarereliability using an automated data collection process.Qual-ity and Reliability Engineering International, 11:341–353,1995.

[32] D. Oppenheimer, A. Ganapathi, and D. Patterson. Why dointernet services fail, and what can be done about it? InProc. 4th USENIX Symposium on Internet Technologies andSystems, Seattle, WA, 2003.

[33] A. Pal. Personal communication. Yahoo!, Inc., 2002.[34] D. Reimer. IBM Research. Personal Communication, 2004.[35] W. D. Smith. TPC-W: Benchmarking an E-Commerce solu-

tion. Transaction Processing Council, 2002.[36] M. Sullivan and R. Chillarege. Software defects and their

impact on system availability – a study of field failures inoperating systems. InProc. 21st International Symposiumon Fault-Tolerant Computing, Montreal, Canada, 1991.

[37] SunMicrosystems. J2EE platform specification.http://java.sun.com/j2ee/, 2002.

[38] M. M. Swift, B. N. Bershad, and H. M. Levy. Improving thereliability of commodity operating systems. InProc. 19thACM Symposium on Operating Systems Principles, BoltonLanding, NY, 2003.

[39] TBD. A major Internet auction site. Terms of disclosurearebeing negotiated, May 2004.

[40] A. P. Wood. Software reliability from the customer view.IEEE Computer, 36(8):37–42, Aug. 2003.

[41] Zona. The need for speed II. Zona Research Bulletin, Issue5, Apr. 2001.

14

A Microrebootable System — Design, Implementation, and Eva ...of these are difﬁcult to track...

Documents

Transcript of A Microrebootable System — Design, Implementation, and Eva ...of these are difﬁcult to track...