Building a Reactive Immune System for Software Services

13
2005 USENIX Annual Technical Conference USENIX Association 149 Building a Reactive Immune System for Software Services Stelios Sidiroglou Michael E. Locasto Stephen W. Boyd Angelos D. Keromytis Network Security Lab Department of Computer Science, Columbia University {stelios,locasto,swb48,angelos}@cs.columbia.edu Abstract We propose a reactive approach for handling a wide variety of software failures, ranging from remotely ex- ploitable vulnerabilities to more mundane bugs that cause abnormal program termination (e.g., illegal mem- ory dereference) or other recognizable bad behavior (e.g., computational denial of service). Our emphasis is in creating “self-healing” software that can protect itself against a recurring fault until a more comprehensive fix is applied. Briefly, our system monitors an application during its ex- ecution using a variety of external software probes, try- ing to localize (in terms of code regions) observed faults. In future runs of the application, the “faulty” region of code will be executed by an instruction-level emulator. The emulator will check for recurrences of previously seen faults before each instruction is executed. When a fault is detected, we recover program execution to a safe control flow. Using the emulator for small pieces of code, as directed by the observed failure, allows us to minimize the performance impact on the immunized application. We discuss the overall system architecture and a proto- type implementation for the x86 platform. We show the effectiveness of our approach against a range of attacks and other software failures in real applications such as Apache, sshd, and Bind. Our preliminary performance evaluation shows that although full emulation can be pro- hibitively expensive, selective emulation can incur as lit- tle as 30% performance overhead relative to an uninstru- mented (but failure-prone) instance of Apache. Although this overhead is significant, we believe our work is a promising first step in developing self-healing software. 1 Introduction Despite considerable work in fault tolerance and reliabil- ity, software remains notoriously buggy and crash-prone. The situation is particularly troublesome with respect to services that must maintain high availability in the face of remote attacks, high-volume events (such as fast- spreading worms like Slammer [2] and Blaster [1]) that may trigger unrelated and possibly non-exploitable bugs, or simple application-level denial of service attacks. The majority of solutions to this problem fall into four cate- gories: Proactive approaches that seek to make the code as dependable as possible, through a combination of safe languages (e.g., Java), libraries [3] and com- pilers [15], code analysis tools [8], and development methodologies. Debugging aids whose aim is to make post-fault analysis and recovery as easy as possible for the programmer. Runtime solutions that seek to contain the fault us- ing some type of sandboxing, ranging from full- scale emulators such as VMWare, to system call sandboxes [24], to narrowly applicable schemes such as StackGuard [13]. Byzantine fault-tolerance schemes (e.g., [34]) which use voting among a number of service in- stances to select the correct answer, under the as- sumption that only a minority of the replicas will exhibit faulty behavior. The contribution of this paper is a reactive approach, accomplished by observing an application (or appropri- ately instrumented instances of it) for previously unseen failures. The types of faults we focus in this paper con- sist of illegal memory dereferences, division by zero ex- ceptions, and buffer overflow attacks. Other types of failures can be easily added to our system as long as their cause can be algorithmically determined (i.e., an- other piece of code can tell us what the fault is and where it occurred). We intend to enrich this set of faults in the future; specifically, we plan to examine Time- Of-Check-To-Time-Of-Use (TOCTTOU) violations, and algorithmic-complexity denial of service attacks [9].

Transcript of Building a Reactive Immune System for Software Services

Page 1: Building a Reactive Immune System for Software Services

2005 USENIX Annual Technical Conference USENIX Association 149

Building a Reactive Immune System for Software Services

Stelios Sidiroglou Michael E. Locasto Stephen W. Boyd Angelos D. KeromytisNetwork Security Lab

Department of Computer Science, Columbia University{stelios,locasto,swb48,angelos}@cs.columbia.edu

Abstract

We propose a reactive approach for handling a widevariety of software failures, ranging from remotely ex-ploitable vulnerabilities to more mundane bugs thatcause abnormal program termination (e.g., illegal mem-ory dereference) or other recognizable bad behavior (e.g.,computational denial of service). Our emphasis is increating “self-healing” software that can protect itselfagainst a recurring fault until a more comprehensive fixis applied.Briefly, our system monitors an application during its ex-ecution using a variety of external software probes, try-ing to localize (in terms of code regions) observed faults.In future runs of the application, the “faulty” region ofcode will be executed by an instruction-level emulator.The emulator will check for recurrences of previouslyseen faults before each instruction is executed. When afault is detected, we recover program execution to a safecontrol flow. Using the emulator for small pieces of code,as directed by the observed failure, allows us to minimizethe performance impact on the immunized application.We discuss the overall system architecture and a proto-type implementation for the x86 platform. We show theeffectiveness of our approach against a range of attacksand other software failures in real applications such asApache, sshd, and Bind. Our preliminary performanceevaluation shows that although full emulation can be pro-hibitively expensive, selective emulation can incur as lit-tle as 30% performance overhead relative to an uninstru-mented (but failure-prone) instance of Apache. Althoughthis overhead is significant, we believe our work is apromising first step in developing self-healing software.

1 Introduction

Despite considerable work in fault tolerance and reliabil-ity, software remains notoriously buggy and crash-prone.The situation is particularly troublesome with respectto services that must maintain high availability in the

face of remote attacks, high-volume events (such as fast-spreading worms like Slammer [2] and Blaster [1]) thatmay trigger unrelated and possibly non-exploitable bugs,or simple application-level denial of service attacks. Themajority of solutions to this problem fall into four cate-gories:

• Proactive approaches that seek to make the codeas dependable as possible, through a combinationof safe languages (e.g., Java), libraries [3] and com-pilers [15], code analysis tools [8], and developmentmethodologies.

• Debugging aids whose aim is to make post-faultanalysis and recovery as easy as possible for theprogrammer.

• Runtime solutions that seek to contain the fault us-ing some type of sandboxing, ranging from full-scale emulators such as VMWare, to system callsandboxes [24], to narrowly applicable schemessuch as StackGuard [13].

• Byzantine fault-tolerance schemes (e.g., [34])which use voting among a number of service in-stances to select the correct answer, under the as-sumption that only a minority of the replicas willexhibit faulty behavior.

The contribution of this paper is a reactive approach,accomplished by observing an application (or appropri-ately instrumented instances of it) for previously unseenfailures. The types of faults we focus in this paper con-sist of illegal memory dereferences, division by zero ex-ceptions, and buffer overflow attacks. Other types offailures can be easily added to our system as long astheir cause can be algorithmically determined (i.e., an-other piece of code can tell us what the fault is andwhere it occurred). We intend to enrich this set of faultsin the future; specifically, we plan to examine Time-Of-Check-To-Time-Of-Use (TOCTTOU) violations, andalgorithmic-complexity denial of service attacks [9].

Page 2: Building a Reactive Immune System for Software Services

2005 USENIX Annual Technical Conference USENIX Association150

Our approach employs an Observe Orient Decide Act(OODA) feedback loop and uses a set of software probesthat monitor the application for specific types of faults.Upon detection of a fault, we invoke a localized re-covery mechanism that seeks to recognize and preventthe specific failure in future executions of the program.Using continuous hypothesis testing, we verify whetherthe fault has been repaired by re-running the applica-tion against the event sequence that apparently causedthe failure. Our initial focus is on automatic healing ofservices against newly detected faults (whether acciden-tal failures or attacks). We emphasize that we seek toaddress a wide variety of software failures, not just at-tacks.For our recovery mechanism we introduce SelectiveTransactional EMulation (STEM), an instruction-levelemulator that can be selectively invoked for arbitrary seg-ments of code, allowing us to mix emulated and non-emulated execution inside the same process. The emula-tor allows us to (a)monitor for the specific type of failureprior to executing the instruction, (b) undo any memorychanges made by the function inside which the fault oc-curred, by having the emulator record all memory mod-ifications made during its execution, and (c) simulate anerror-return from said function.One of our key assumptions is that we can create a map-ping between the set of errors that could occur duringa program’s execution and the limited set of errors thatare explicitly handled by the program’s code. This “er-ror virtualization” technique is based on heuristics thatwe present in Section 2.4. We believe that a majority ofserver applications are written to have relatively robusterror handling; by virtualizing the errors, an applicationcan continue execution even though a boundary condi-tion that was not predicted by the programmer allowed afault to “slip in.” In other words, error virtualization at-tempts to retrofit an exception catching mechanism ontocode that wasn’t explicitly written to have such a capabil-ity. Our experiments with Apache, OpenSSH, and Bindvalidate this intuition. Evidence from other recent work[26, 25, 33] supports our findings.Our current work focuses on server-type applications,since they typically have higher availability requirementsthan user-oriented applications. Micro-rebooting [7] hasbeen proposed as another approach to dealing with er-rors, by restarting all or parts of an application uponrecognizing a failure. However, server applications of-ten cannot be simply restarted because they are typicallylong running (and thus accumulate a fair amount of state)and usually contain a number of threads that servicemany remote users. Restarting the whole server becauseof one failed thread unfairly denies service to other users.Also, unlike user-oriented applications, servers operate

without direct human supervision and thus have a higherneed for an automated reactive system. Furthermore, it isrelatively easy to replay the offending sequence of eventsin such applications, as these are typically limited to in-put received over the network (as opposed to a user’s in-teraction with a graphical interface). We intend to inves-tigate other classes of applications in the future.To evaluate the effectiveness of our system and its impactto performance, we conduct a series of experiments us-ing a number of open-source server applications includ-ing Apache, sshd, and Bind. The results show that our“virtualized error” mapping assumption holds for morethan 88% of the cases we examined. Testing with realattacks against Apache, OpenSSH, and Bind, we showthat our technique can be effective in quickly and au-tomatically protecting against zero-day attacks and fail-ures. Although full emulation of these is prohibitivelyexpensive (3,000% slowdown), our selective emulationdegrades performance by a factor of 1.3–2, dependingon the size of the emulated code segment. We believethat our findings show that a reactive approach such aswe advocate is a promising mechanism for dealing withapplication faults.

Paper Organization. Section 2 presents our approach,including the limitations of our system and the basic sys-tem architecture. Section 3 briefly discusses the imple-mentation of STEM, and Section 4 presents some prelim-inary performancemeasurements of the system. We givean overview of related work in Section 5 and summarizeour contributions and plan for future work in Section 6.

2 Approach

Our architecture, depicted in Figure 1, uses three typesof components: a set of sensors that monitor an applica-tion (such as a web server) for faults; Selective Transac-tional EMulation (STEM), an instruction-level emulatorthat can selectively emulate “slices” (arbitrary segments)of code; and a testing environment where hypothesesabout the effect of various fixes are evaluated. Thesecomponents can operate without human supervision tominimize reaction time.

2.1 System Overview

When the sensor detects an error in the application’s ex-ecution (such as a segmentation fault), the system instru-ments the portion of the application’s code that imme-diately surrounds the faulty instruction(s), such that the

Page 3: Building a Reactive Immune System for Software Services

2005 USENIX Annual Technical Conference USENIX Association 151

Figure 1: Feedback control loop: (1) a variety of sensors monitor the application for known types (but unknown instances)of faults; (2) upon recognizing a fault, we emulate the region of code where the fault occurred and test with the inputsseen before the fault occurred; (3) by varying the scope of emulation, we can determine the “narrowest” code slice we canemulate and still detect and recover from the fault; (4) we then update the production version of the server.

code segment is emulated (the mechanics of this are ex-plained in Section 3). To verify the effectiveness of thefix, the application is restarted in a test environment withthe instrumentation enabled, and is supplied with the in-put that caused the failure (or the N most recent inputs,if the offending one cannot be easily identified, whereNis a configurable parameter). We focus on server typeapplications that have a transactional processing model,because it is easier to quickly correlate perceived failureswith a small or finite set of inputs than with other typesof applications (e.g., those with a GUI).During emulation, STEM maintains a record of allmemory changes (including global variables or library-internal state, e.g., libc standard I/O structures) that theemulated code makes, along with their original values.Furthermore, STEM examines the operands for each ma-chine instruction and pre-determines the side effects ofthe instructions it emulates. The use of an emulator al-lows us to circumvent the complexity of code analysis, aswe only need to focus on the operation and side effectsof individual instructions independently from each other.If the emulator determines that a fault is about to occur,the emulated execution is aborted. Specifically, all mem-ory changes made by the emulated code are undone, andthe currently executing function is “forced” to return anerror. We describe how both emulation and error virtual-ization are accomplished in Sections 2.3 and 2.4, respec-tively, and we experimentally validate the error virtual-ization hypothesis in Section 4. For our initial approach,we are primarily concerned with failures where there is aone-to-one correspondence between inputs and failures,and not with those that are caused by a combination ofinputs. Note, however, that many of the latter type of fail-ures are in fact addressed by our system, because the lastinput (and code leading to a failure) will be recognizedas “problematic” and handled as we have discussed.In the testing and error localization phase, emulation

stops after forcing the function to return. If the programcrashes, the scope of the emulation is expanded to in-clude the parent (calling) routine and the application re-executes with the same inputs. This process is repeateduntil the application does not terminate after we abort afunction calls sequence. In the extreme case, the wholeapplication could end up being emulated, at a significantperformance cost. However, Section 4 shows that thisfailsafe measure is rarely necessary.If the program does not crash after the forced return, wehave found a “vaccine” for the fault, which we can use onthe production server. Naturally, if the fault is not trig-gered during an emulated execution, emulation halts atthe end of the vulnerable code segment, and all memorychanges become permanent.The overhead of emulation is incurred at all times(whether the fault is triggered or not). To minimize thiscost, we must identify the smallest piece of code thatwe need emulate in order to catch and recover from thefault. We currently treat functions as discrete entities andemulate the whole body of a function, even though theemulator allows us to start and stop emulation at arbi-trary points, as described in Section 3. Future work willexplore strategies for minimizing the scope of the emu-lation and balancing the tradeoff between coverage andperformance.In the remainder of this section, we describe the types ofsensors we employ, give an overview of how the emu-lator operates (with more details on the implementationin Section 3), and describe how the emulator forces afunction to return with an error code. We also discussthe limitations of reactive approaches in general and oursystem in particular.

Page 4: Building a Reactive Immune System for Software Services

2005 USENIX Annual Technical Conference USENIX Association152

2.2 Application Monitors

The selection of appropriate failure-detection sensors de-pends on both the nature of the flaws themselves and tol-erance of their impact on system performance. We de-scribe the two types of application monitors that we haveexperimented with.The first approach is straightforward. The operating sys-tem forces a misbehaving application to abort and cre-ates a core dump file that includes the type of failure andthe stack trace when that failure occurred. This informa-tion is sufficient for our system to apply selective em-ulation, starting with the top-most function in the stacktrace. Thus, we only need a watchdog process that waitsuntil the service terminates before it invokes our system.A second approach is to use an appropriately instru-mented version of the application on a separate serveras a honeypot, as we demonstrated for the case of net-work worms [29]. Under this scheme, we instrument theparts of the application that may be vulnerable to a par-ticular class of attack (in this case, remotely exploitablebuffer overflows) such that an attempt to exploit a newvulnerability exposes the attack vector and all pertinentinformation (attacked buffer, vulnerable function, stacktrace, etc.).This information is then used to construct an emulator-based vaccine that effectively implements array boundschecking at the machine-instruction level. This approachhas great potential in catching new vulnerabilities thatare being indiscriminately attempted at high volume,as may be the case with an “auto-root” kit or a fast-spreading worm. Since the honeypot is not in the pro-duction server’s critical path, its performance is not a pri-mary concern (assuming that attacks are relatively rarephenomena). In the extreme case, we can construct ahoneypot using our instruction-level emulator to executethe whole application, althoughwe do not further explorethis possibility in this paper.

2.3 Selective Transactional EMulation (STEM)

The recovery mechanism uses an instruction-level emu-lator, STEM, that can be selectively invoked for arbitrarysegments of code. This tool permits the execution of em-ulated and non-emulated code inside the same process.The emulator is implemented as a C library that definesspecial tags (a combination of macros and function calls)that mark the beginning and the end of selective emula-tion. To use the emulator, we can either link it with anapplication in advance, or compile it in the code in re-sponse to a detected failure, as was done in [29].Upon entering the vulnerable section of code, the emula-

tor snapshots the program state and executes all instruc-tions on the virtual processor. When the program counterreferences the first instruction outside the bounds of em-ulation, the virtual processor copies its internal state backto the real CPU, and lets the program continue executionnatively. While registers are explicitly updated, memoryupdates have implicitly been applied throughout the ex-ecution of the emulation. The program, unaware of theinstructions executed by the emulator, continues execut-ing directly on the CPU.To implement fault catching, the emulator simply checksthe operands of instructions it is emulating, taking intoconsideration additional information supplied by the sen-sor that detected the fault. For example, in the case ofdivision by zero, the emulator need only check the valueof the appropriate operand to the div instruction. For ille-gal memory dereferencing, the emulator verifies whetherthe source or destination addresses of anymemory access(or the program counter, for instruction fetches) point toa page that is mapped to the process address space usingthe mincore() system call. Buffer overflow detection ishandled by padding the memory surrounding the vulner-able buffer, as identified by the sensor, by one byte, sim-ilar to the way StackGuard [13] operates. The emulatorthen simply watches for memory writes to these memorylocations. This approach requires source code availabil-ity, so as to insert the “canary” variables. Contrary toStackGuard, our approach allows us to stop the overflowbefore it overwrites the rest of the stack, and thus to re-cover the execution. For algorithmic-complexity denialof service attacks, such as the one described in [9], wekeep track of the amount of time (in terms of numberof instructions) we execute in the instrumented code; ifthis exceeds a pre-defined threshold, we abort the execu-tion. This threshold may be defined manually, or can bedetermined by profiling the application under real (or re-alistic) workloads, although we have not fully exploredthe possibilities.We currently assume that the emulator is pre-linked withthe vulnerable application, or that the source code of thatapplication is available. It is possible to circumvent thislimitation by using the CPU’s programmable breakpointregister (in much the same way that a debugger uses itto capture execution at particular points in the program)to invoke the emulator without the running process evenbeing able to detect that it is now running under an emu-lator.

2.4 Recovery: Forcing Error Returns

Upon detecting a fault, our recovery mechanism undoesall memory changes and forces an error return from thecurrently executing function. To determine the appropri-

Page 5: Building a Reactive Immune System for Software Services

2005 USENIX Annual Technical Conference USENIX Association 153

ate error return value, we analyze the declared type of thefunction.Depending on the return type of the emulated function,the system returns an “appropriate” value. This value isdetermined based on some straightforward heuristics andis placed in the stack frame of the returning function. Theemulator then transfers control back to the calling func-tion. For example, if the return type is an int, a value of−1 is returned; if the value is unsigned int the system re-turns 0, etc. A special case is used when the function re-turns a pointer. Instead of blindly returning a NULL, weexamine if the returned pointer is further dereferencedby the parent function. If so, we expand the scope ofthe emulation to include the parent function. We han-dle value-return function arguments similarly. There aresome contexts where this heuristic may not work well;however, as a first approach these heuristics worked ex-tremely well in our experiments (see Section 4).In the future, we plan to use more aggressive source codeanalysis techniques to determine the return values thatare appropriate for a function. Since in many cases acommon error-code convention is used in large applica-tions or modules, it may be possible to ask the program-mer to provide a short description of this convention asinput to our system either through code annotations or asseparate input. A similar approach can be used to markfunctions that must be fail-safe and should return a spe-cific value when an error return is forced, e.g., code thatchecks user permissions.

2.5 Caveats and Limitations

While promising, reactive approaches to software faultsface a new set of challenges. As this is a relatively unex-plored field, some problems are beyond the scope of thispaper.First, our primary goal is to evolve an application pro-tected by STEM towards a state that is highly resistant toexploits and errors. While we expect the downtime forsuch a system to be reduced, we do not reasonably ex-pect zero downtime. STEM fundamentally relies on theapplication monitors detecting an error or attack, stop-ping the application, marking the affected sections foremulated execution, and then restarting the application.This process necessarily involves downtime, but is in-curred only once for each detected vulnerability. We be-lieve that combining our approach with microrebootingtechniques can streamline this process.A reaction system must evaluate and choose a responsefrom a wide array of choices. Currently, when encoun-tering a fault, a system can (a) crash, (b) crash and berestarted by a monitor [7], (c) return arbitrary values

[26], or (d) slice off the functionality. Most proactivesystems take the first approach. We elect to take the lastapproach. As Section 2.4 shows, this choice seems towork extremely well. This phenomenon also appears atthe machine instruction level [33].However, there is a fundamental problem in choosing aparticular response. Since the high-level behavior of anysystem cannot be algorithmically determined, the systemmust be careful to avoid cases where the response wouldtake execution down a semantically (from the viewpointof the programmer’s intent) incorrect path. An exampleof this type of problem is skipping a check in sshd whichwould allow an otherwise unauthenticated user to gainaccess to the system. The exploration of ways to boundthese types of errors is an open area of research. Ourinitial approach is to rely on the programmer to provideannotations as to which parts of the code should not becircumvented.There is a key tradeoff between code coverage (andthus confidence in the level of security the system pro-vides) and performance (processing and memory over-head). Our emulator implementation is a proof of con-cept; many enhancements are possible to increase per-formance in a production system. Our main goal is toemphasize the service that such an emulator will provide:the ability to selectively incur the cost of emulation forvulnerable program code only. Our system is directed tothese vulnerable sections by runtime sensors – the qual-ity of the application monitors dictates the quality of thecode coverage.Since our emulator is designed to operate at the userlevel, it hands control to the operating system during sys-tem calls. If a fault were to occur in the operating system,our system would not be able to react to it. In a relatedproblem, I/O beyond the machine presents a problem fora rollback strategy. This problem can partially be ad-dressed by the approach taken in [17], by having the ap-plication monitors log outgoing data and implementing acallback mechanism for the receiving process.Finally, in our current work, we assume that the sourcecode of the vulnerable application is available to our sys-tem. We briefly discussed how to partially circumventthis limitation in Section 2.3. Additional work is neededto enable our system to work in a binary-only environ-ment.

3 Implementation

We implemented the STEM x86 emulator to validate thepracticality of providing a supervision framework forthe feedback control loop through selective emulation of

Page 6: Building a Reactive Immune System for Software Services

2005 USENIX Annual Technical Conference USENIX Association154

Page 7: Building a Reactive Immune System for Software Services

2005 USENIX Annual Technical Conference USENIX Association 155

the Apache httpd, OpenSSH sshd, and Bind. We run pro-filed versions of the selected applications against a set oftest suites and examine the subsequent call-graphs gen-erated by these tests with gprof and Valgrind [21].The ensuing call trees are analyzed in order to extractleaf functions. The leaf functions are, in turn, employedas potentially vulnerable functions. Armed with the in-formation provided by the call-graphs, we run a scriptthat inserts an early return in all the leaf functions (as de-scribed in Section 2.4), simulating an aborted function.Note that these tests do not require going back up thecall stack.In Apache’s case, we examined 154 leaf functions. Foreach aborted function, we monitor the program execu-tion of Apache by running httperf [20], a web server per-formance measurement tool. Success for each test wasdefined as the application not crashing.The results from these tests were very encouraging, as139 of the 154 functions completed the httperf tests suc-cessfully. In these cases, program execution was not in-terrupted. What we found to be surprising was that notonly did the program not crash, but in some cases all thepages were served (as reported by httperf). This resultis probably because a large number of the functions areused for statistical and logging purposes. Furthermore,out of the 15 functions that produced segmentation faults,4 did so at start up (and would thus not be relevant in thecase of a long-running process). While this result is en-couraging, testing the correctness of this process wouldrequire a regression test suite against the page contents,headers, and HTTP status code for the response. We planto build this “correctness” testing framework.Similarly for sshd, we iterate through each aborted func-tion while examining program execution during an scptransfer. In the case of sshd, we examined 81 leaf func-tions. Again, the results were auspicious: 72 of the 81functions maintained program execution. Furthermore,only 4 functions caused segmentation faults; the rest sim-ply did not allow the program to start.For Bind, we examined the program execution of namedduring the execution of a set of queries; 67 leaf functionswere tested. In this case, 59 of the 67 functions main-tained the proper execution state. Similar to sshd, only 4functions caused segmentation faults.These results, along with supporting evidence from [26]and [33], validate our “error virtualization” hypothesisand approach. However, additional work is needed todetermine the degree to which other types of applications(e.g., GUI-driven) exhibit the same behavior.

4.2 Attack Exploits

Given the success of our experimental evaluation onprogram execution, we wanted to further validate ourhypothesis against a set of real exploits for Apache,OpenSSH sshd, and Bind. No prior knowledge was en-coded in our system with respect to the vulnerabilities:for all purposes, this experiment was a zero-day attack.For Apache, we used the apache-scalp exploit that takesadvantage of a buffer overflow vulnerability based onthe incorrect calculation of the required buffer sizes forchunked encoding requests. We applied selective emu-lation on the offending function and successfully recov-ered from the attack; the server successfully served sub-sequent requests.The attack used for OpenSSH was the RSAREF2 exploitfor SSH-1.2.27. This exploit relies on unchecked offsetsthat result in a buffer overflow vulnerability. Again, wewere able to gracefully recover from the attack and thesshd server continued normal operation.Bind is susceptible to a number of known exploits; forthe purposes of this experiment, we tested our approachagainst the TSIG bug on ISC Bind 8.2.2-x. In the samemotif as the previous attacks, this exploit takes advantageof a buffer overflow vulnerability. As before, we wereable to safely recover program execution while maintain-ing service availability.

4.3 Performance

We next turned our attention to the performance impactof our system. In particular, we measured the overheadimposed by the emulator component. STEM is meant tobe a lightweight mechanism for executing selected por-tions of an application’s code. We can select these codeslices according to a number of strategies, as we dis-cussed in Section 2.2.We evaluated the performance impact of STEM by in-strumenting the Apache 2.0.49 web server and OpenSSHsshd, as well as performing micro-benchmarks on vari-ous shell utilities such as ls, cat, and cp.

4.3.1 Testing Environment

The machine we chose to host Apache was a single Pen-tium III at 1GHz with 512MB of memory running Red-Hat Linux with kernel 2.4.20. The machine was under alight load during testing (standard set of background ap-plications and an X11 server). The client machine was adual Pentium II at 350MHzwith 256MB ofmemory run-ning RedHat Linux 8.0 with kernel 2.4.18smp. The client

Page 8: Building a Reactive Immune System for Software Services

2005 USENIX Annual Technical Conference USENIX Association156

machine was running a light load (X11 server, sshd,background applications) in addition to the test tool.Both emulated and non-emulated versions of Apachewere compiled with the –enable-static-support configu-ration option. Finally, the standard runtime configurationfor Apache 2.0.49 was used; the only change we madewas to enable the server-status module (which is com-piled in by default but not enabled in the default con-figuration). STEM was compiled with the “-g -static -fno-defer-pop” flags. In order to simplify our debuggingefforts, we did not include optimization.We chose the Apache flood httpd testing tool to eval-uate how quickly both the non-emulated and emulatedversions of Apache would respond and process requests.In our experiments, we chose to measure performanceby the total number of requests processed, as reflectedin Figures 3 and 4. The value for total number of re-quests per second is extrapolated (by flood’s reportingtool) from a smaller number of requests sent and pro-cessed within a smaller time slice; the value should notbe interpreted to mean that our test Apache instances andour test hardware actually served some 6000 requests persecond.

4.3.2 Emulation of Apache Inside Valgrind

To get a sense of the performance degradation imposedby running the entire system inside an emulator otherthan STEM, we tested Apache running in Valgrind ver-sion 2.0.0 on the Linux test machine that hosted Apachefor our STEM test trials.Valgrind has two notable features that improve perfor-mance over our full emulation of the main request loop.First, Valgrind maintains a 14 MB cache of translatedinstructions which are executed natively after the firsttime they are emulated, while STEM always translateseach encountered instruction. Second, Valgrind per-forms some internal optimizations to avoid redundantload, store, and register-to-register move operations.We ran Apache under Valgrind with the default skinMemcheck and tracing all children processes. While Val-grind performed better than our emulation of the full re-quest processing loop, it did not perform as well as ouremulated slices, as shown in Figure 3 and the timing per-formance in Table 1.Finally, the Valgrind–ized version of Apache is 10 timesthe size of the regular Apache image, while Apache withSTEM is not noticeably larger.

4.3.3 Full Emulation and Baseline Performance

We demonstrate that emulating the bulk of an applica-tion entails a significant performance impact. In par-ticular, we emulated the main request processing loopfor Apache (contained in ap process http connection())and compared our results against a non-emulatedApacheinstance. In this experiment, the emulator executedroughly 213,000 instructions. The impact on perfor-mance is clearly seen in Figure 3 and further elucidatedin Figure 4, which plots the performance of the fully em-ulated request-handling procedure.

0

50

100

150

200

0 10 20 30 40 50 60 70 80

requ

ests

per

sec

ond

# of client threads

Apache 2.0.49 (emulated) Request Handling Performance

lbtasvm-mainloop

Figure 4: A closer look at the performance for the fullyemulated version of main processing loop. While there isa considerable performance impact compared to the non-emulated request handling loop, the emulator appears toscale at the characteristic linear rate, indicating that it doesnot create additional overhead beyond the cost of emula-tion.

In order to get a more complete sense of this performanceimpact, we timed the execution of the request handlingprocedure for both the non-emulated and fully-emulatedversions of Apache by embedding calls to gettimeofday()where the emulation functions were (or would be) in-voked.For our test machines and sample loads, Apache nor-mally (e.g., non-emulated) spent 6.3 milliseconds to per-form the work in the ap process http connection() func-tion, as shown in Table 1. The fully instrumented looprunning in the emulator spends an average of 278 mil-liseconds per request in that particular code section. Forcomparison, we also timed Valgrind’s execution of thissection of code; after a large initial cost (to perform theinitial translation and fill the internal instruction cache)Valgrind executes the section with a 34 millisecond aver-age. These initial costs sometimes exceeded one or twoseconds; we ignore them in our data and measure Val-

Page 9: Building a Reactive Immune System for Software Services

2005 USENIX Annual Technical Conference USENIX Association 157

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0 10 20 30 40 50 60 70 80

requ

ests

per

sec

ond

# of client threads

Apache 2.0.49 Request Handling Performance

apache-mainlooplibtasvm-mainlooplibtasvm-parse-uri

libtasvm-header-parservalgrind-apache

Figure 3: Performance of the system under various levels of emulation. This data set includes Valgrind for reference.While full emulation is fairly expensive, selective emulation of input handling routines appears quite sustainable. Valgrindruns better than STEM when executing the entire request loop. As expected, selective emulation still performs better thanValgrind.

grind only after it has stabilized.

Apache trials Mean Std. Dev.Normal 18 6314 847STEM 18 277927 74488Valgrind 18 34192 11204

Table 1: Timing of main request processing loop. Times arein microseconds. This table shows the overhead of runningthe whole primary request handling mechanism inside theemulator. In each trial a user thread issued an HTTP GETrequest.

4.3.4 Selective Emulation

In order to identify possible vulnerable sections of codein Apache 2.0.49, we used the RATS tool. The tool iden-tified roughly 270 candidate lines of code, the majorityof which contained fixed size local buffers. We then cor-related the entries on the list with code that was in theprimary execution path of the request processing loop.The two functions that are measured perform work oninput that is under client control, and are thus likely can-didates for attack vectors.

The main request handling logic in Apache 2.0.49 beginsin the ap process http connection() function. The effec-tive work of this function is carried out by two subrou-tines: ap read request() and ap process request(). Theap process request() function is where Apache spendsmost of its time during the handling of a particular re-quest. In contrast, the ap read request() function ac-counts for a smaller fraction of the request handlingwork. We chose to emulate subroutines of each functionin order to assess the impact of selective emulation.We constructed a partial call tree and chose theap parse uri() function (invoked via read request line()in ap read request()) and the ap run header parser()function (invoked via ap process request internal() inap process request()). The emulator processed approxi-mately 358 and 3229 instructions, respectively, for thesetwo functions. In each case, the performance impact, asexpected, was much less than the overhead incurred byneedlessly emulating the entire work of the request pro-cessing loop.

4.3.5 Microbenchmarks

Using the client machine from the Apache performancetests, we ran a number of micro-benchmarks to gain a

Page 10: Building a Reactive Immune System for Software Services

2005 USENIX Annual Technical Conference USENIX Association158

broader view of the performance impact of STEM. Weselected some common shell utilities and measured theirperformance for large workloads running both with andwithout STEM.For example, we issued an ’ls -R’ command on the rootof the Apache source code with both stderr and stdoutredirected to /dev/null in order to reduce the effects ofscreen I/O. We then used cat and cp on a large file (alsowith any screen output redirected to /dev/null). Table 2shows the result of these measurements.As expected, there is a large impact on performancewhen emulating the majority of an application. Our ex-periments demonstrate that only emulating potentiallyvulnerable sections of code offers a significant advantageover emulation of the entire system.

5 Related Work

Modeling executing software as a transaction that can beaborted has been examined in the context of language-based runtime systems (namely, Java) in [28, 27].That work, focused on safely terminating misbehavingthreads, introduces the concept of “soft termination.”Soft termination allows thread termination while pre-serving the stability of the language runtime, withoutimposing unreasonable performance overheads. In thatapproach, threads (or codelets) are each executed in self-encompassing transactions, applying standard ACID se-mantics. This allows changes to the runtime’s (and otherthreads’) state made by the terminated codelet to berolled back. The performance overhead of that systemcan range from 200% up to 2300%. Relative to thatwork, our contribution is twofold. First, we apply thetransactional model to an unsafe language such as C, ad-dressing several (but not all) challenges presented by thatenvironment. Second, by selectively emulating, we sub-stantially reduce the performance overhead of the appli-cation. However, there is no free lunch: this reductioncomes at the cost of allowing failures to occur. Our sys-tem aims to automatically evolve a piece of code suchthat it eventually (i.e., once an attack has been observed,possibly more than once) does not succumb to attacks.Oplinger and Lam propose [23] another transactional ap-proach to improve software reliability. Their key ideais to employ thread level speculation (TLS) and exe-cute an application’s monitoring code in parallel with theprimary computation. The computation “transaction” isrolled back depending on the results of the monitoringcode.Virtual machine emulation of operating systems or pro-cessor architectures to provide a sandboxed environment

is an active area of research. Virtual machine monitors(VMM) are employed in a number of security–relatedcontexts, from autonomic patching of vulnerabilities [29]to intrusion detection [14].Other protection mechanisms include compiler tech-niques like StackGuard [13] and safer libraries, such aslibsafe and libverify [4]. Other tools exist to verify andsupervise code during development or debugging. Ofthese tools, Purify and Valgrind [21] are popular choices.Valgrind is a program supervision framework that en-ables in–depth instrumentation and analysis of IA-32 bi-naries without recompilation. Valgrind has been used byBarrantes et al. [5] to implement instruction set random-ization techniques to protect programs against code in-sertion attacks. Other work on instruction–set random-ization includes [16], which employs the i386 emulatorBochs.Program shepherding [19] is a technique developed byKiriansky, Bruening, and Amarasinghe. The authorsdescribe a system based on the RIO [11] architecturefor protecting and validating control flows according tosome security policy without modification of IA-32 bi-naries for Linux and Windows. The system works byvalidating branch instructions and storing the decision ina cache, thus incurring little overhead.The work by Dunlap, King, Cinar, Basrai, and Chen[12] is closely related to the work presented in this pa-per. ReVirt is a system implemented in a VMM that logsdetailed execution information. This detailed executiontrace includes non–deterministic events such as timer in-terrupt information and user input. Because ReVirt isimplemented in a VMM, it is more resistant to attack orsubversion. However, ReVirt’s primary use is as a foren-sic tool to replay the events of an attack, while the goalof STEM is to provide a lightweight and minimally in-trusive mechanism for protecting code against maliciousinput at runtime.King, Dunlap, and Chen [18] discuss optimizationsthat reduce the performance penalties involved in usingVMMs. There are three basic optimizations: reduce thenumber of context switches by moving the VMM intothe kernel, reduce the number of page faults by allow-ing each VMM process greater freedom in allocating andmaintaining address space, and ameliorate the penaltyfor switching between guest kernel mode and guest usermode by simply changing the bounds on the guest mem-ory area rather than re–mapping.An interesting application of ReVirt [12] is BackTracker[17], a tool that can automatically identify the steps in-volved in an intrusion. Because detailed execution infor-mation is logged, a dependency graph can be constructedbackward from the detection point to provide forensic in-

Page 11: Building a Reactive Immune System for Software Services

2005 USENIX Annual Technical Conference USENIX Association 159

Test Type trials mean (s) Std. Dev. Min Max Instr. Emulatedls (non-emu) 25 0.12 0.009 0.121 0.167 0ls (emu) 25 42.32 0.182 42.19 43.012 18,000,000

cp (non-emu) 25 16.63 0.707 15.80 17.61 0cp (emu) 25 21.45 0.871 20.31 23.42 2,100,000

cat (non-emu) 25 7.56 0.05 7.48 7.65 0cat (emu) 25 8.75 0.08 8.64 8.99 947,892

Table 2: Microbenchmark performance times for various command line utilities.

formation about an attack.Toth and Kruegel [32] propose to detect buffer overflowpayloads (including previously unseen ones) by treat-ing inputs received over the network as code fragments.They show that legitimate requests will appear to con-tain relatively short sequences of valid x86 instructionopcodes, compared to attacks that will contain long se-quences. They integrate this mechanism into the Apacheweb server, resulting in a small performance degradation.Some interesting work has been done to deal with mem-ory errors at runtime. For example, Rinard et al. [25]have developed a compiler that inserts code to deal withwrites to unallocated memory by automatically expand-ing the target buffer. Such a capability aims toward thesame goal our system does: provide a more robust faultresponse rather than simply crashing. The techniquepresented in [25] is modified in [26] and introduced asfailure-oblivious computing. This behavior of this tech-nique is close to the behavior of our system.One of the most critical concerns with recovering fromsoftware faults and vulnerability exploits is ensuring theconsistency and correctness of program data and state.An important contribution in this area is presented byDempsky [10], which discusses mechanisms for detect-ing corrupted data structures and fixing them to matchsome pre-specified constraints. While the precision ofthe fixes with respect to the semantics of the programis not guaranteed, their test cases continued to operatewhen faults were randomly injected.Suh et al. [31] propose a hardware based solution thatcan be used to thwart control-transfer attacks and re-strict executable instructions by monitoring “tainted” in-put data. In order to identify “tainted” data, they relyon the operating system. If the processor detects the useof this tainted data as a jump address or an executed in-struction, it raises an exception that can be handled by theoperating system. The authors do not address the issueof recovering program execution and suggest the imme-diate termination of the offending process. DIRA [30]is a technique for automatic detection, identification andrepair of control-hijaking attacks. This solution is imple-

mented as a GCC compiler extension that transforms aprogram’s source code and adds heavy instrumentationso that the resulting program can perform these tasks.The use of checkpoints throughout the program ensuresthat corruption of state can be detected if control sensi-tive data structures are overwritten. Unfortunately, theperformance implications of the system make it unus-able as a front line defense mechanism. Song and New-some [22] propose dynamic taint analysis for automaticdetection of overwrite attacks. Tainted data is monitoredthroughout the program execution and modified bufferswith tainted information will result in protection faults.Once an attack has been identified, signatures are gener-ated using automatic semantic analysis. The techniqueis implemented as an extension to Valgrind and does notrequire any modifications to the program’s source codebut suffers from severe performance degradation.While our prototype x86 emulator is a fairly straight-forward implementation, it can gain further performancebenefits by using Valgrind’s technique of caching alreadytranslated instructions. With some further optimizations,STEM is a viable and practical approach to protectingcode. In fact, [6] outlines several ways to optimize emu-lators; their approaches reduce the performance overhead(as measured by two SPEC2000 benchmarks, crafty andvpr) from a factor of 300 to about 1.7. Their optimiza-tions include caching basic blocks (essentially what VGis doing), linking direct and indirect branches, and build-ing traces.

6 Conclusions

Software errors and the concomitant potential for ex-ploitable vulnerabilities remain a pervasive problem. Ac-cepted approaches to this problem are almost alwaysproactive, but it seems unlikely that such strategies willresult in error-free code. In the absence of such guaran-tees, reactive techniques for error toleration and recoverycan be powerful tools.We have described a lightweight mechanism for super-

Page 12: Building a Reactive Immune System for Software Services

2005 USENIX Annual Technical Conference USENIX Association160

Page 13: Building a Reactive Immune System for Software Services

2005 USENIX Annual Technical Conference USENIX Association 161

[14] T. Garfinkel and M. Rosenblum. A Virtual Ma-chine Introspection Based Architecture for Intru-sion Detection. In 10th ISOC Symposium on Net-work and Distributed Systems Security (SNDSS),February 2003.

[15] T. Jim, G. Morrisett, D. Grossman, M. Hicks, J. Ch-eney, and Y. Wang. Cyclone: A safe dialect of C. InProceedings of the USENIX Annual Technical Con-ference, pages 275–288, June 2002.

[16] G. S. Kc, A. D. Keromytis, and V. Preve-lakis. Countering Code-Injection Attacks WithInstruction-Set Randomization. In 10th ACM Con-ference on Computer and Communications Security(CCS), October 2003.

[17] S. T. King and P. M. Chen. Backtracking Intru-sions. In 19th ACM Symposium on Operating Sys-tems Principles (SOSP), October 2003.

[18] S. T. King, G. Dunlap, and P. Chen. Operating Sys-tem Support for Virtual Machines. In Proceedingsof the USENIX Annual Technical Conference, June2003.

[19] V. Kiriansky, D. Bruening, and S. Amarasinghe.Secure Execution Via Program Shepherding. InProceedings of the 11th USENIX Security Sympo-sium, August 2002.

[20] D. Mosberger and T. Jin. httperf: A tool formeasur-ing web server performance. In First Workshop onInternet Server Performance, pages 59—67. ACM,June 1998.

[21] N. Nethercote and J. Seward. Valgrind: A Pro-gram Supervision Framework. In Electronic Notesin Theoretical Computer Science, volume 89, 2003.

[22] J. Newsome and D. Dong. Dynamic Taint Analysisfor Automatic Detection, Analysis, and SignatureGeneration of Exploits on Commodity Software. InThe 12th Annual Network and Distributed SystemSecurity Symposium, February 2005.

[23] J. Oplinger and M. S. Lam. Enhancing SoftwareReliability with Speculative Threads. In Proceed-ings of the 10th International Conference on Archi-tectural Support for Programming Languages andOperating Systems(ASPLOS X), October 2002.

[24] N. Provos. Improving Host Security with SystemCall Policies. In Proceedings of the 12th USENIXSecurity Symposium, pages 257–272, August 2003.

[25] M. Rinard, C. Cadar, D. Dumitran, D. Roy, andT. Leu. A Dynamic Technique for Eliminating

Buffer Overflow Vulnerabilities (and Other Mem-ory Errors). In Proceedings 20th Annual Com-puter Security Applications Conference (ACSAC),December 2004.

[26] M. Rinard, C. Cadar, D. Dumitran, D. Roy, T. Leu,and J. W Beebee. Enhancing Server Availabilityand Security Through Failure-Oblivious Comput-ing. In Proceedings 6th Symposium on OperatingSystems Design and Implementation (OSDI), De-cember 2004.

[27] A. Rudys and D. S. Wallach. Transactional Roll-back for Language-Based Systems. In ISOC Sym-posium on Network and Distributed Systems Secu-rity (SNDSS), February 2001.

[28] A. Rudys and D. S. Wallach. Termination inLanguage-based Systems. ACM Transactions onInformation and System Security, 5(2), May 2002.

[29] S. Sidiroglou and A. D. Keromytis. A NetworkWorm Vaccine Architecture. In Proceedings of theIEEE Workshop on Enterprise Technologies: In-frastructure for Collaborative Enterprises (WET-ICE), Workshop on Enterprise Security, pages 220–225, June 2003.

[30] A. Smirnov and T. .Chiueh. DIRA: AutomaticDetection, Identification, and Repair of Control-Hijacking Attacks. In The 12th Annual Net-work and Distributed System Security Symposium,February 2005.

[31] G. E. Suh, J. W. Lee, D. Zhang, and S. Devadas.Secure program execution via dynamic informationflow tracking. SIGOPS Oper. Syst. Rev., 38(5):85–96, 2004.

[32] T. Toth and C. Kruegel. Accurate Buffer OverflowDetection via Abstract Payload Execution. In Pro-ceedings of the 5th Symposium on Recent Advancesin Intrusion Detection (RAID), October 2002.

[33] N. Wang, M. Fertig, and S. Patel. Y-Branches:When You Come to a Fork in the Road, Take It. InProceedings of the 12th International Conferenceon Parallel Architectures and Compilation Tech-niques, September 2003.

[34] J. Yin, J.-P. Martin, A. Venkataramani, L. Alvisi,and M. Dahlin. Separating Agreement from Ex-ecution for Byzantine Fault Tolerant Services. InProceedings of ACM SOSP, October 2003.