[IEEE 2009 29th IEEE International Conference on Distributed Computing Systems (ICDCS) - Montreal,...

10
Graduated QoS by Decomposing Bursts: Don’t Let the Tail Wag Your Server Lanyue Lu and Peter Varman Rice University, USA Email: [email protected], [email protected] Kshitij Doshi Intel Corporation, USA Email: [email protected] ABSTRACT The growing popularity of hosted storage services and shared storage infrastructure in data centers is driving the recent interest in resource management and QoS in storage systems. The bursty nature of storage workloads raises significant performance and provisioning challenges, leading to increased infrastructure, management, and energy costs. We present a novel dynamic workload shaping framework to handle bursty workloads, where the arrival stream is dynam- ically decomposed to isolate its bursts, and then rescheduled to exploit available slack. We show how decomposition reduces the server capacity requirements dramatically while affecting QoS guarantees minimally. We present an optimal decomposition algorithm RTT and a recombination algo- rithm Miser, and show the benefits of the approach by performance evaluation using several storage traces. 1. Introduction The increasing complexity of managing stored data and the economic benefits of consolidation are driving stor- age systems towards a service-oriented paradigm, in which personal and corporate clients purchase storage space and access bandwidth to store and retrieve their data. This paper deals with issues of performance and provisioning of server resources in storage data centers. In a typical setup, Service Level Agreements (SLA) between the service provider and clients stipulate guarantees on throughput [1], [2] or latency [3], [4] for rate-controlled clients. The service provider must provision sufficient resources to meet these performance guarantees based on estimates of the resource demands of the individual clients, and the aggregate capacity requirements of the client mix admitted into the system. The run-time scheduler must isolate the individual clients from each other so that they receive their reservations without in- terference from misbehaving clients with demand overruns, and schedule their requests on the server appropriately [5]. A fundamental challenge in data center operations is the need to deal effectively with high-variance bursty workloads arising in the network and storage server traffic [6], [7], [8]. These workloads are characterized by unpredictable bursty periods during which the instantaneous arrival rates can significantly exceed the average long-term rate. In the absence of explicit mechanisms to deal with it, the effects of these bursts are not confined to the localized regions where they occur, but spill over and affect otherwise well-behaved regions of the workload as well. As a consequence, although the bursty portion may be only a small fraction of the entire workload, it has a disproportionate effect on performance and provisioning decisions. This ”tail wagging the dog” situation results in the server being forced to make unduly conservative estimates of resource requirements, resulting in excessive provisioning and energy consumption costs, and unnecessary throttling of the number of the clients admitted into the system. In this paper we present a novel approach to improving client performance and slimming resource provisioning at the server. In our approach we modify the characteristics of the arriving workload so that its behavior is dominated by the majority well-behaved portions of the request stream; the portions of the workload comprising the tail are identified and isolated so that their effects are localized. This results in more predictable behavior, and significantly lower resource requirements. Consequently, the performance SLA is speci- fied by a distribution of response times rather than a single worst-case measure. By slightly relaxing the performance requirements, a significant reduction in server capacity can be achieved while maintaining stringent QoS guarantees for most of the workload. The server can pass on these savings by providing a variety of SLAs and pricing options to the client. Storage service subscribers that have highly streamlined request behavior, and who therefore require negligible surplus capacity in order to meet their deadlines, can be offered service on concessional terms as reward for their ”well-behavedness”. This paper makes the following specific contributions. We present a new framework for run-time scheduling of a client’s workload based on decomposition and recombi- nation of the request stream. This reshaped workload helps localize the effects of bursts so that a large percentage of the workload has superior response time guarantees, while keeping the behavior of the tail comparable to that achieved by traditional methods. The resource requirements for the reshaped workloads are shown to be significantly lower than that for the original workload, since it is closer to the average rather than the worst-case requirement. This translates into reductions in provisioned capacity, and re- 2009 29th IEEE International Conference on Distributed Computing Systems 1063-6927/09 $25.00 © 2009 IEEE DOI 10.1109/ICDCS.2009.55 12

Transcript of [IEEE 2009 29th IEEE International Conference on Distributed Computing Systems (ICDCS) - Montreal,...

Page 1: [IEEE 2009 29th IEEE International Conference on Distributed Computing Systems (ICDCS) - Montreal, Quebec, Canada (2009.06.22-2009.06.26)] 2009 29th IEEE International Conference on

Graduated QoS by Decomposing Bursts: Don’t Let the Tail Wag Your Server

Lanyue Lu and Peter VarmanRice University, USA

Email: [email protected], [email protected]

Kshitij DoshiIntel Corporation, USA

Email: [email protected]

ABSTRACT

The growing popularity of hosted storage services andshared storage infrastructure in data centers is driving therecent interest in resource management and QoS in storagesystems. The bursty nature of storage workloads raisessignificant performance and provisioning challenges, leadingto increased infrastructure, management, and energy costs.We present a novel dynamic workload shaping framework tohandle bursty workloads, where the arrival stream is dynam-ically decomposed to isolate its bursts, and then rescheduledto exploit available slack. We show how decompositionreduces the server capacity requirements dramatically whileaffecting QoS guarantees minimally. We present an optimaldecomposition algorithm RTT and a recombination algo-rithm Miser, and show the benefits of the approach byperformance evaluation using several storage traces.

1. Introduction

The increasing complexity of managing stored data andthe economic benefits of consolidation are driving stor-age systems towards a service-oriented paradigm, in whichpersonal and corporate clients purchase storage space andaccess bandwidth to store and retrieve their data. Thispaper deals with issues of performance and provisioningof server resources in storage data centers. In a typicalsetup, Service Level Agreements (SLA) between the serviceprovider and clients stipulate guarantees on throughput [1],[2] or latency [3], [4] for rate-controlled clients. The serviceprovider must provision sufficient resources to meet theseperformance guarantees based on estimates of the resourcedemands of the individual clients, and the aggregate capacityrequirements of the client mix admitted into the system. Therun-time scheduler must isolate the individual clients fromeach other so that they receive their reservations without in-terference from misbehaving clients with demand overruns,and schedule their requests on the server appropriately [5].A fundamental challenge in data center operations is theneed to deal effectively with high-variance bursty workloadsarising in the network and storage server traffic [6], [7],[8]. These workloads are characterized by unpredictablebursty periods during which the instantaneous arrival ratescan significantly exceed the average long-term rate. In the

absence of explicit mechanisms to deal with it, the effects ofthese bursts are not confined to the localized regions wherethey occur, but spill over and affect otherwise well-behavedregions of the workload as well. As a consequence, althoughthe bursty portion may be only a small fraction of the entireworkload, it has a disproportionate effect on performanceand provisioning decisions. This ”tail wagging the dog”situation results in the server being forced to make undulyconservative estimates of resource requirements, resulting inexcessive provisioning and energy consumption costs, andunnecessary throttling of the number of the clients admittedinto the system.

In this paper we present a novel approach to improvingclient performance and slimming resource provisioning atthe server. In our approach we modify the characteristics ofthe arriving workload so that its behavior is dominated bythe majority well-behaved portions of the request stream; theportions of the workload comprising the tail are identifiedand isolated so that their effects are localized. This results inmore predictable behavior, and significantly lower resourcerequirements. Consequently, the performance SLA is speci-fied by a distribution of response times rather than a singleworst-case measure. By slightly relaxing the performancerequirements, a significant reduction in server capacity canbe achieved while maintaining stringent QoS guaranteesfor most of the workload. The server can pass on thesesavings by providing a variety of SLAs and pricing optionsto the client. Storage service subscribers that have highlystreamlined request behavior, and who therefore requirenegligible surplus capacity in order to meet their deadlines,can be offered service on concessional terms as reward fortheir ”well-behavedness”.

This paper makes the following specific contributions.We present a new framework for run-time scheduling ofa client’s workload based on decomposition and recombi-nation of the request stream. This reshaped workload helpslocalize the effects of bursts so that a large percentage ofthe workload has superior response time guarantees, whilekeeping the behavior of the tail comparable to that achievedby traditional methods. The resource requirements for thereshaped workloads are shown to be significantly lowerthan that for the original workload, since it is closer tothe average rather than the worst-case requirement. Thistranslates into reductions in provisioned capacity, and re-

2009 29th IEEE International Conference on Distributed Computing Systems

1063-6927/09 $25.00 © 2009 IEEE

DOI 10.1109/ICDCS.2009.55

12

Page 2: [IEEE 2009 29th IEEE International Conference on Distributed Computing Systems (ICDCS) - Montreal, Quebec, Canada (2009.06.22-2009.06.26)] 2009 29th IEEE International Conference on

duced energy consumption as well. Finally, we show how theframework can be used to improve estimates of the aggregateresource requirements of multiple concurrent clients. Dueto statistical variations, the peak inputs of the workloadsare unlikely to line up simultaneously. Estimates basedon simple aggregation of the requirements of each clienttherefore tend to overestimate the requirements significantly,but estimating the savings due to statistical multiplexing isdifficult [9]. We show that aggregation based on the capacityof the reshaped workload provides more realistic estimatesof resource requirements, compared to dealing with theunshaped workload.

The rest of the paper is organized as follows. Section 2describes the reshaping framework and provides a high-leveldescription of the decomposition and recombination phases.An optimal decomposition algorithm RTT is described inSection 3 along with several recombination schemes, in-cluding a new slack-based algorithm called Miser. Detailedevaluation results are presented in Section 4. Related workis summarized in Section 5, and conclusions are presentedin Section 6.

2. Workload Shaping

The goal of workload shaping is to smoothen the workloadto reduce the unpredictability caused by bursty arrival pat-terns, which makes capacity planning difficult and degradesperformance. Although the average utilization of the systemtends to be low, the unpredictable bursts of high activityoverwhelm server resources leading to poor performance.With traditional scheduling the effects of these bursts arenot confined to the localized regions where they occur, butspill over and affect otherwise well-behaved regions of theworkload as well. Consequently, as noted before, a smallamount of bursty behavior has a disproportionate effect onperformance, provisioning, and admission control decisions.The workload shaping procedure consists of two comple-mentary operations: decomposition and recombination, asshown schematically in Figure 1.

In the decomposition phase, the workload of a singleapplication (or client) is partitioned into two or more classeswith different performance guarantees. The requests belong-ing to the different classes are directed to separate queues.In the scheme shown in Figure 1 there are two classes,identified by queues Q1 and Q2 respectively. In this example,requests belonging to Q1 will be guaranteed a response timeR1 while requests in Q2 are served in a best-effort fashion.In the recombination phase the requests of the two classesare multiplexed in a suitable manner to satisfy the individualperformance constraints. Different scheduling algorithmswhich provide different response time distributions can beused in this phase. These will be discussed and evaluated inSections 3 and 4.

Figure 1. Architecture of workload shaper providinggraduated QoS guarantees

Figure 2(a) shows a portion of an OpenMail trace [10]of I/O requests (displayed using aggregated requests in atime window of 100 ms). Note that the peak request rateis about 4440 IOPS while the average request rate is onlyabout 534 IOPS. Figure 2(b) shows the class Q1 containing90% of the requests after decomposing the workload usingour decomposition algorithm RTT (described later). Thecapacity of the server is chosen so that all requests in Q1meet a response time of 10 ms. As may be seen Q1 isrelatively even at this granularity. All requests in Q1 canmeet their response time bounds with a capacity of only1080 IOPS, compared to 9241 IOPS required for the originalworkload. Finally, Figure 2(c) shows the workload followingrecombination of Q1 and Q2 using the Miser algorithm(described later). This algorithm monitors the slack in thearrivals where it can schedule a request of Q2 withoutcausing any of the requests of Q1 to miss their deadlineand schedules a request from Q2 at the earliest such time.

2.1. Decomposition and Recombination

The workload is characterized by its arrival sequence thatspecifies the number of I/O requests ni arriving at time ai,i = 1, · · · , N . The Cumulative Arrival Curve (abbreviatedAC) A(t) is the the total number of I/O requests that arriveduring the interval [0, t]; i.e. A(t) =

∑ij=1 nj , where

ai ≤ t < ai+1. Figure 3 (a) shows AC as a staircasefunction with jumps corresponding to the arrival instants.The server provides service at a constant rate of C IOPSas long as there are unfinished requests. The Service Curve(SC) is shown by a line of slope C beginning at the originduring a busy period when the server is continuously busy.At any time, the vertical distance between SC and AC is thenumber of pending requests (either queued or in service).

13

Page 3: [IEEE 2009 29th IEEE International Conference on Distributed Computing Systems (ICDCS) - Montreal, Quebec, Canada (2009.06.22-2009.06.26)] 2009 29th IEEE International Conference on

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000

Request R

ate

(IO

PS

)

Time (second)

(a) Original workload

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000

Request R

ate

(IO

PS

)

Time (second)

(b) 90% of workload after Decomposition

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000

Request R

ate

(IO

PS

)

Time (second)

(c) 100% of workload after Recombination

Figure 2. Shaping the OpenMail trace by Decomposition and Recombination

Each request has a response time requirement of δ, so thatrequests arriving at ai have a deadline of di = ai + δ. Ifthe number of pending requests exceeds C × δ it signalsan overload condition. Since at most C × δ requests canbe completed in time δ, some of the requests pending atan overflow instant must necessarily miss their deadlines. InFigure 3 (a) the line above and parallel to the Service Curveis an upper bound on the amount of pending service that canmeet their deadlines. We call this the Service Curve Limit(SCL).

The operation of a decomposition algorithm can be de-scribed easily with respect to the Service Curve Limit. Thegoal is to identify requests to drop from the workload (inactuality dropped requests are merely moved to Q2 andserved from there). Consider time instants like 2 and 3 inFigure 3 (a) where the AC exceeds SCL. From the previousdiscussion, requests exceeding the SCL limits cause anoverload condition and some requests must must be droppedin order for the rest of the requests to meet their deadline.If requests are dropped from the workload, the AC shiftsdown by an amount equal to that removed. This is shown inFigure 3 (b) which shows the situation following the removalof 1 request at time 1 and another at time 2. As can be seenthe modified AC lies below the SCL which means that allrequests in the new AC will meet their deadlines. A differentchoice of two requests to remove is shown in Figure 3(c), where one request each at times 2 and 3 are removed.One can argue that for the given capacity and responsetime requirements, at least two requests in this workloadmust necessarily miss their deadlines (as in the two choicesabove). On the other hand dropping two requests at time 1 isa poor choice, since a request arriving at time 3 will still missits deadline. Note also that the decomposition method needsto be online in that it needs to make a decision on whetheror not to drop a request based on the past inputs only,without knowing the future patterns of requests. We showin Section 3 that our decomposition algorithm RTT satisfiesthese properties: it is online and minimizes the number ofdropped requests for a given capacity and deadline.

We now describe the operation of a recombination algo-

rithm. The goal is to service the overflowing requests thathave been placed in Q2 concurrently with the guaranteedrequests in Q1. For instance, in Figure 3 (d) the two requeststhat were dropped at times 2 and 3 are scheduled fromQ2 at times 4 and 5 when there is slack in the server.Several strategies with different tradeoffs can be employedfor the recombination. One simple approach is to offloadthe overflowing requests to a separate physical server wherethey can be serviced without interfering with the guaranteedtraffic (this is similar in principle to the write offloadingstrategy in [11] where bursts of write requests are distributedto a number of low-utilization disks for service). In caseswhere this offloading is not feasible, perhaps due to lack ofa suitable off-load sever or the need for dedicated resourcesavailable only on the main server, a good strategy is to treatthe two parts of the workload independently and multiplexthem on the server using a Fair Queuing scheduler tokeep them isolated. This approach actually has significantcapacity benefits over the dedicated offload server approach(as we show in Section 4), due to the benefits of statisticalmultiplexing. Since the overflow workload is active onlyduring bursts, the capacity during its idle periods can beprofitably used by the guaranteed portion of the workloadto improve its response time profile. We also propose anew slack-based scheduling algorithm to combine the twoportions of the workload. This method called Miser, allowsbetter shaping of the tail of the workload than a FairQueueing Scheduler, but may in the worst-case slightlyincrease the fraction of requests missing their deadlines,unless a small amount of additional capacity is provided.

2.2. Capacity Provisioning

In this section we address the issue of how much servercapacity needs to be reserved in order to meet a client’srequirements. We consider both the cases of provisioningfor a single client and for multiple, concurrent clients.

Single Client: We profile the workload to determinethe capacity reservation needed to meet a stipulated QoSrequirement. That is, given a response time bound δ, find

14

Page 4: [IEEE 2009 29th IEEE International Conference on Distributed Computing Systems (ICDCS) - Montreal, Quebec, Canada (2009.06.22-2009.06.26)] 2009 29th IEEE International Conference on

(a) Workload model (b) One way of drop requests (c) RTT-based request dropping (d) RTT-based Recombining

Figure 3. Illustrating the Decomposition and Recombination process

the minimum server capacity Cmin required to guaranteethat a given fraction f of the requests of the given workloadmeets their deadlines. Decomposing the workload in thisway results in much smaller server capacity requirement,while still maintaining a high QoS, albeit less than 100%.

Although it is possible to find direct optimization meth-ods for the problem stated, we found that a deterministicsearch of the solution space provided the answers withlow computational overhead for even very large traces. Wesearch the space as follows. For a given C and δ weuse the RTT algorithm (detailed in Section 3.1) to finda decomposition that maximizes the number of requestsmeeting their deadlines. If the fraction meeting the deadlineis higher than the required fraction f we reduce the capacityand try again; else we increase the capacity and retry. Sincef is a monotonically non-decreasing function of C, a binary-search converges rapidly (within O(logC) iterations) to thedesired minimum capacity Cmin.

We provision a capacity of Cmin + ∆C, where the latteris used to prevent starvation of the secondary class in Q2. Inour experiments an additional capacity of ∆C = 1/δ wasfound to be sufficient to obtain good performance of theentire workload.

Multiple Concurrent Clients: In a data center envi-ronment, the service provider needs to provision sufficientresources for several clients simultaneously sharing the sys-tem. Accurate provisioning is an extremely difficult problemand several approaches based on statistical models of the ar-rival process have been proposed [9]. A brute-force approachis to estimate the worst-case capacity required for each clientand then provision at least that much for each of the clients.This approach results in poor server utilization and overlycautious admission control policies. There are two mainproblems: first, the worst-case capacity requirements of aclient are usually several times the average demand; sec-ondly, adding the individual capacity requirements presumesthat the worst-cases of all the individual workloads line upsimultaneously, an extremely unlikely situation in practice.

We argue that aggregation of the capacity requirementsof reshaped clients provides a good estimate of the server

capacity needed for handling multiple concurrent clients.That is, the capacity required by the reshaped workloadserves as a measure of ”effective bandwidth” [9] of theclient. We evaluate this in Section 4 and show that usingthe aggregated effective bandwidths provides a very goodestimate of the required server capacity.

3. Workload Shaping Algorithms

The system model is shown in Figure 1. The schedulermaintains two queues Q1 and Q2. The primary queue Q1 hasbounded length to control the latencies of requests acceptedinto it. The overflow queue Q2 holds requests that are notaccepted into Q1 because their latency cannot be guaranteed.The server has a capacity C and the response time boundsfor the requests in the primary queue is δ.

3.1. RTT Decomposition

Algorithm 1: RTT DecompositionRTT Decompose( )begin

maxQ1 = C × δif lenQ1 ≤ maxQ1 − 1 then

beginAdd request to Q1Increment lenQ1

endelse

Add request to Q2;

end

The decomposition algorithm RTT, shown in Algorithm 1,is used to partition the requests dynamically into the twoqueues. The algorithm is extremely simple. If the arrivingrequest will cause the length of the primary queue Q1

(lenQ1) to exceed its maximum length (maxQ1), the re-quest is diverted to the overflow queue; else it joins theend of the primary queue. The maximum length of Q1,maxQ1 = C × δ. Despite its simplicity, RTT satisfies thefollowing optimality property:

For a given workload, capacity, and response time bound,RTT identifies a maximal-sized set of requests that canmeet the deadline.

15

Page 5: [IEEE 2009 29th IEEE International Conference on Distributed Computing Systems (ICDCS) - Montreal, Quebec, Canada (2009.06.22-2009.06.26)] 2009 29th IEEE International Conference on

To show RTT optimality, we first show that in any periodthat RTT is continuously busy, the number of requests itdrops is the minimum possible. Lemma 1 shows a lowerbound on the number of dropped requests in any interval,and Lemma 2 shows that RTT matches that bound in a busyperiod. Following this, we consider an arbitrary period ofoperation in which RTT may alternate between idle andbusy periods. We show inductively in Lemma 3, that RTTcumulatively drops no more than a hypothetical (possiblyoffline) optimal algorithm OPT at the end of any busy period.

Recall from Section 2 that ai represents a request arrivalinstant, and A(t) and S(t) represent the cumulative arrivalsand service up to some time t. Also, define the functionsgn(x) = dxe for x ≥ 0, and sgn(x) = 0 for x < 0. Theproofs of the Lemmas have been omitted due to space butcan be found in the technical report available in [12].

Lemma 1: Given server capacity C, a lower bound on thenumber of requests in the workload that cannot meet theirdeadlines is given by max1≤k≤N{sgn(A(ak)−S(ak+δ))}.

Lemma 2 In any busy period [0, aN ], RTT will drop nomore than max1≤i≤N{sgn(A(ai)− S(ai + δ))} requests.

Let intervals I1, I2, · · · Im be successive busy periods ofRTT during the time [0, T ]. The folowing Lemma can beproved by Induction.

Lemma 3: Let OPT be an optimal algorithm that dropsthe minimal number of requests in [0, T ]. ∀k, 1 ≤ k ≤ m,OPT drops at least ∆k requests in Ik and incurs an idleperiod of at least ηk, where ∆k is the number of requestsdropped by RTT in Ik and ηk is the amount of idle time ofRTT till the end of Ik.

3.2. Recombining Algorithms

We now describe four methods for combining the work-load spilt by RTT and scheduling them at the server. Theirperformance evaluation is described in Section 4. FCFS: Therequests are not partitioned and serviced in a FCFS manner.This serves as a base case for the evaluation. Split: Therequests are partitioned by RTT and the overflow requestsin Q2 are served by a separate physical server. The primaryserver’s capacity Cmin is based on profiling the workload,and a small additional amount ∆C is provided to the sec-ondary server. Fair Queueing: The requests are partitionedby RTT and the two queues Q1 and Q2 are served usinga proportional share bandwidth allocator (like WF2Q [13],SFQ [14], pClock [3]) that divides the server capacityin the specified ratio. The total capacity of the server isCmin+∆C, but by sharing a single physical server we hopeto harness the benefits of statistical multiplexing. Miser: Thescheduler uses slack in the scheduling of the primary queueto schedule requests in Q2 as early as possible. Unlike theprevious two methods, where the additional capacity ∆Conly affected the performance of the requests in Q2, herethe two queues are more closely coupled. Being online the

composite algorithm (RTT + Miser) could sometimes dropmore than the theoretical minimum number of requests. Onecan show that if ∆C = Cmin, then this can never occur. Oursimulations show that even with a small amount of additionalservice ∆C = 1/δ, very few requests in Q1 are delayedbeyond the deadline in practice, and the tail distribution ofQ2 is much nicer.

Algorithm 2: Miser SchedulingOn a request arrival:begin

RTT Decompose( );/* Compute Slack*/if request ri in Q1 then

ri slack = bmaxQ1 − lenQ1cminSlack = min{minSlack, ri slack}

endOn a request departure:begin

/*Dispatch a request*/if ( minSlack ≥ 1 ) & ( Q2 is not empty ) then

Issue request from Q2 in FIFO orderelse

Issue request from Q1 in FIFO order

/*Update Slack*/if scheduled request ri is from Q1 then

if ri slack = minSlack thenminSlack = mini∈Q1{ri slack}

elsefor ∀i ∈ Q1 do

ri slack = ri slack − 1

minSlack = minSlack − 1

end

Algorithm 2 shows the actions taken by Miser on requestarrivals and completions. On a request arrival the routineRTT Decompose is first invoked to classify the request. Ifplaced in the primary queue it is assigned a slack value equalto the number of places still available in Q1. A request inthe overflow queue Q2 is scheduled when the smallest slackvalue of the requests in Q1 is at least 1.

4. Experimental Evaluation

In this section, we evaluate the workload shaping basedscheduling framework using the storage system simulationtool DiskSim [15]. We implemented the RTT decompositionalgorithm at the device driver level which catches all theincoming requests before they reach the underlying disks.The workload is decomposed by RTT and put into separatequeues. When the disk driver needs to dispatch a new requestto the disk, our recombining scheduler is called to choosethe next request for service.

We use traces of three different storage applications forour evaluation: Web Search Engine (WebSearch), OLTPapplication (FinTrans) and Email service (OpenMail). Thetraces are obtained from UMass Storage Repository [16] andHP Research Labs [10]. All of these are low-level blockstorage I/O traces, which have not been filtered out by the

16

Page 6: [IEEE 2009 29th IEEE International Conference on Distributed Computing Systems (ICDCS) - Montreal, Quebec, Canada (2009.06.22-2009.06.26)] 2009 29th IEEE International Conference on

Workloads Response Time Percentage of Workload Meeting Response TimeTarget 90.0% 95.0% 99.0% 99.5% 99.9% 100%5 ms 590 711 960 1055 1310 2325

WebSearch 10 ms 417 474 603 658 786 1538(WS) 20 ms 345 388 462 487 540 900

50 ms 328 363 419 437 467 5335 ms 400 550 600 800 1000 3000

FinTrans 10 ms 200 300 360 400 500 1500(FT) 20 ms 150 168 216 236 280 750

50 ms 119 138 172 184 209 3305 ms 1350 2000 3950 4800 6600 13990

OpenMail 10 ms 1080 1595 2965 3550 4860 9241(OM) 20 ms 900 1326 2361 2740 3480 5766

50 ms 745 1045 1805 2050 2495 3656

Table 1. Capacity (IOPS) required for specified Workload Fraction to meet the Response Time target

file system cache. The WebSearch traces are from a popularsearch engine and consist of user web search requests. TheFinTrans traces are generated by financial transactions in anOLTP application running at two large financial institutions.OpenMail traces are collected from HP email servers duringthe servers’ busy periods.

We conducted four types of experiments: (i) measuringserver capacity requirements as a function of the fractionf of requests that are guaranteed a response time δ (ii)response time distribution obtained by a traditional FCFSscheduler that does not decompose the workload (iii) com-parison of the response time distribution of recombinationalgorithms Split, Fair Schedule and Miser with FCFS andrelative to each other (iv) capacity estimation for multipleconcurrent clients using the decomposition framework. Forreasons of space only a representative sample of the resultsare presented here; some more results are in our technicalreport [12].

4.1. Capacity-QoS Tradeoffs

Avoiding resource over-provisioning is a difficult problemdue to the unpredictable bursty behavior of real workloads.This set of experiments explores the tradeoffs between thefraction f of the workload that is guaranteed to meet aspecified response time bound δ, and the minimum servercapacity Cmin required. The case f = 100%, gives theminimum capacity required for all the requests to meet thelatency bound. As f is relaxed, a smaller capacity should besufficient. Our results confirm the existence of a sharp kneein the Cmin versus f relation, that shows that a very smallpercentage of the workload necessitates an overwhelmingcapacity to meet its guarantees.

Table 1 shows the capacity required to meet responsetime bounds of 5, 10, 20, 50 ms for f between 90% to 100%using three different workloads. The capacity required fallsoff significantly by relaxing the response time guaranteefor 1% to 10% of the workload. For instance, with δ = 5ms, increasing f from 90% to 100% requires large capacity

increases: almost 4 times (590 to 2325 IOPS) for Websearch,7.5 times (400 to 3000 IOPS) for FinTrans, and more than10 times (1350 to 13990 IOPS) for OpenMail. Even goingfrom 99% to 100% the capacity required increases by factorsof 2.4 (960 to 2325 IOPS) for Websearch, 5 (600 to 3000IOPS) for FinTrans and 3.5 (3950 to 13990 IOPS) forOpenMail. For higher response times, the capacity requiredalso increases by significant, though smaller factors. Forinstance, in OpenMail for δ = 10, 20, and 50ms respectively,the required capacity increases 8.6, 6.4 and 4.9 times ingoing from 90% to 100% and 3.1, 2.4 and 2 times in goingfrom from 99% to 100%. The extent of burstiness (andpotential for capacity savings) can be gauged by looking atthe range from 99% and 100% of FinTrans, where increasingf from 99.9% to 100% required capacity increases by factorsof 3.0, 3.0, 2.7 and 1.6 respectively for different responsetimes.

Summarizing, the experiments clearly indicate that ex-empting even a small fraction of the workload from theresponse time guarantees can substantially reduce the ca-pacity that needs to be provisioned. The more aggressivethe QoS specifications (lower response time requirements),the greater the savings in relaxing the fraction meeting theguarantee. Even a small percentage of burst in the workload(such as 0.1%) can require a large amount of resources toguarantee the response time.

4.2. Response Time Distribution of FCFS

We next investigate the effects of the bursts on theresponse time distribution of the workload. In a shared datacenter, scheduling across clients may be done using a fairscheduler, and scheduling at the storage array uses somethroughput maximizing ordering of the requests in the low-level queue. However, in the absence of application-specificordering, the requests of a single client are usually handled ina simple FCFS manner. The following experiments show thatthe bursts in the workload significantly degrade its responsetime profile. That is, the effects of bursts are not isolated but

17

Page 7: [IEEE 2009 29th IEEE International Conference on Distributed Computing Systems (ICDCS) - Montreal, Quebec, Canada (2009.06.22-2009.06.26)] 2009 29th IEEE International Conference on

0.5

0.6

0.7

0.8

0.9

1

10 100 1000

Fra

ction

Response Time (ms)

WS, (C=417, P=90%)FT, (C=200, P=90%)

OM, (C=1080, P=90%)

(a) Target: (90%,10 ms)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

50 100 1000 10000

Fra

ction

Response Time (ms)

WS, (C=328, P=90%)FT, (C=119, P=90%)

OM, (C=745, P=90%)

(b) Target: (90%,50 ms)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

50 100 1000 10000

Fra

ction

Response Time (ms)

WS, (C=363, P=95%)FT, (C=138, P=95%)

OM, (C=1045, P=95%)

(c) Target: (95%,50 ms)

Figure 4. Response time CDF of FCFS scheduling

affect the behavior of the non-bursty part of the workload;isolation needs to be specifically enforced by a scheduler.

The cumulative response time distribution obtained for theunpartitioned workloads using FCFS scheduling is shownin Figures 4(a) and 4(b) for δ = 10ms and δ = 50msrespectively, for three workloads. In each case the capacityis chosen so that 90% of the workload can meet the responsetime target if it were optimally decomposed using RTT.

In Figure 4(a), with a capacity of 417 IOPS only 54%of the unpartitioned WebSearch workload meets a δ = 10ms latency bound compared to 90% compliance for theworkload decomposed by RTT. The unpartitioned workloadreaches 90% compliance only for δ around 200ms. Similarly,unpartitioned OpenMail workload with δ = 10ms and C =1080 IOPS achieves only 71% compliance compared to 90%for the partitioned workload, and reaches 90% complianceonly at around δ of 90ms. For the FinTrans workload, acapacity of 200 IOPS resulted in 64% of the unpartitionedworkload, and 90% of the partitioned workload meeting the10ms response time bound.

In Figure 4(b), the response time target is relaxed to 50ms. In this case, for WebSearch only a tiny 5% of therequests meet the 50 ms deadline, compared to 90% ofthe partitioned workload. For FinTrans and OpenMail thecorresponding figures are still a low 29% and 55% of theworkload respectively, With a more relaxed response time(50ms instead of 10ms), the partitioned workload can meetthe 90% compliance with a smaller capacity; however, forFCFS the smaller capacity results in the queues built upduring the burst to drain slower, increasing the responsetime for the well-behaved part of the workload as well.Figure 4(c) shows the performance of FCFS using the samecapacity at which RTT guarantees a 50 ms response timefor 95% of workload. The corresponding percentages ofrequests meeting the 50ms bound for WebSearch, FinTransand OpenMail using FCFS are a low 30%, 57% and 85%respectively.

4.3. Response Time of Shaped Workload

In this section, we evaluate the recombination methodsdiscussed in Section 3.2, Split, FairQueueing and Miser, andcompare them with the performance of FCFS. In each casethe total capacity provided for the workload is held fixedat Cmin + ∆C; ∆C was chosen to be a small amount 1/δ.FCFS uses the total capacity for the unpartitioned workload.For Split and FairQueuing the capacity is divided in the ratioCmin to ∆C for Q1 and Q2 respectively. In Split, the serversare not shared. FairQueuing multiplexes the capacity of asingle server so that excess capacity can be flexibly movedfrom one part to the other, while guaranteeing a minimumreservation to each. Miser opportunistically uses the capacityto schedule the overflow requests depending on the amountof available slack.

In Figure 5, we evaluate the scheduling performance forWebsearch workload with the response time target of 50 ms.We can see that Split and FairQueueing achieves the 90%target of 50 ms response time following decomposition ofthe workload. Miser, as noted previously, may incur someadditional misses, but is still very close to the 90% target,even with just ∆C = 20 IOPS additional capacity. However,FCFS can only finish 14% of the requests within 50 ms.Furthermore, FCFS has 74% of requests with response timebigger than 1000 ms, while Split, FairQueue and Miser haveabout 10%. Figure 5(b) shows the performance of theseschedulers with percentage target 95% and δ = 50 ms. Split,FairQueueing and Miser still outperform FCFS with 95%guarantees of 50 ms response time, while FCFS finishes only51% within 50 ms. For the response time larger than 1000ms, Split has 4.9%, FairQueueing has 4.1% and Miser has4.6% of the requests respectively, while FCFS has 17.7%.

Figures 5(a) and 5(b) show that Split, FairQueueingand Miser are better able to guarantee a higher percentageof requests with small deadlines. But Split, FairQueueingand Miser have larger maximum response time than FCFS,because a decomposition-based scheduler will delay requestsin Q2 over those in Q1 leading to larger delays for theoverflowing requests. But as shown, the total number of

18

Page 8: [IEEE 2009 29th IEEE International Conference on Distributed Computing Systems (ICDCS) - Montreal, Quebec, Canada (2009.06.22-2009.06.26)] 2009 29th IEEE International Conference on

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

<=50 <=100 <=500 <=1000 >1000

Fra

ction

Response Time (ms)

FCFS: 328+20 IOPSSplit: 328+20 IOPS

FairQueue: 328+20 IOPSMiser: 328+20 IOPS

(a) Target: (90%,50ms)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

<=50 <=100 <=500 <=1000 >1000

Fra

ction

Response Time (ms)

FCFS: 363+20 IOPSSplit: 363+20 IOPS

Fairqueue: 363+20 IOPSMiser: 363+20 IOPS

(b) Target: (95%,50ms)

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

90 95

Norm

aliz

ed R

atio o

f F

airQ

ueue

Target Percentage

FairQueue AverageMiser Average

FairQueue MaxMiser Max

(c) Best effort requests performance

Figure 5. Performance comparison of FCFS, Split, Fair Queuing and Miser: Websearch workload

long-delay requests (greater than 1s) is much less than forFCFS, even though the largest value may be higher.

Finally we compare the performance of Split, FairQueue-ing and Miser (see Figure 5(c)). For Split there is nocapacity sharing between the two classes, resulting in anorder of magnitude bigger average and maximal responsetimes of Q2’s requests compared to FairQueueing and Miser.FairQueueing assigns the weighted capacity to the twoclasses; Q2 can only use the spare capacity of Q1 whenthe latter has no requests. However, Miser dynamicallymonitors the slack in Q1, and uses it to improve theperformance of Q2’s requests. Figure 5(c) shows the averageand maximal response time of the requests in Q2 attainedby Miser normalized to that of FairQueueing in the aboveexperiments. We can see that for WebSearch the averageresponse time of secondary class of Miser is about 85%- 90% of FairQueueing, while maximal response time isroughly 85% compared to FairQueueing.

4.4. Multi-flow Consolidation

In a shared server environment, resource provisioning isusually hard to predict because of the bursty nature of theworkloads. A straightforward aggregation of the reservationrequirements of each client provides a simple estimate ofthe capacity requirements, but tends to severely overestimatethe capacity, since it assumes strong correlation betweenthe bursts of different clients. We evaluate the resourcerequirements for combinations of workloads with δ = 10 ms,and compare it with the estimated value which is the sumof the individual capacities of the workloads. Figure 6(a)shows the results when combining different pairs of thethree workloads. For WebSearch and FinTrans, the actualcapacity needed is only 53% of the estimate, indicatingconsiderable multiplexing gains in the combination. ForFinTrans and OpenMail, and OpenMail and WebSearch, theactual capacity needed is 86% and 87% of the estimate.The reason for this high real value is that the capacityneeded individually by OpenMail (9241 IOPS) dominatesthat required by WebSearch (1538 IOPS) and FinTrans (1500

IOPS).To avoid overprovisioning and provide a good estimate for

the required capacity, we argue that capacity provisioningbased on workload decomposition works well in practice.In Figure 6(b) and 6(c), we report the capacity requirementsbased on decompositions of 90% and 95%, with the responsetime guarantee 10 ms, for the same workload combinationsas in Figure 6(a). We can see that after decomposition, thecapacity estimate based on adding the individual capacityrequirements is very close to the actual capacity needed,with error of 0.3% for WebSearch + FinTrans, error of0.05% for FinTrans + OpenMail, and error of 0.7% forOpenMail + WebSearch. Similar results can be found for95%, with the relative errors 6.2%, 2.6% and 0.1% forWebSearch + FinTrans, FinTrans + OpenMail and OpenMail+ WebSearch respectively. By removing the high varianceportion of the individual workloads, the simple aggregationof the decomposed workloads provides a very good estimatefor the combined workload.

5. Related Work

Recently proposed QoS schedulers for storage servers [1],[2], [3], [4], [17] are generally based on Fair Queuing [13],[14], [18] principles, combined with throughput enhancingmechanisms to exploit locality and concurrency in thestorage arrays. These works do not explicitly address theissue of efficiency in resource provisioning, simply requiringthe server to be provisioned for the worst-case traffic ofeach client based on its SLA. As shown in the results thisrequires significant resource over-provisioning to accountfor the unpredictable burst-arrival pattern. Our schedulingframework differs the above works by provisioning based onthe well-behaved portion that comprises the overwhelmingportion of the workload rather than the worst-case bursts, anddynamically decomposing each client workload to conformto the provisioning.

Considerable amount of previous work has been devotedto designing optimal size-aware schedulers to improve per-formance [19], [20], [21] in Web servers. The basic idea is to

19

Page 9: [IEEE 2009 29th IEEE International Conference on Distributed Computing Systems (ICDCS) - Montreal, Quebec, Canada (2009.06.22-2009.06.26)] 2009 29th IEEE International Conference on

0

2000

4000

6000

8000

10000

WS + FT FT + OM OM + WS

Capacity (

IOP

S)

Workloads

EstimateReal

(a) Traditional 100% combine

0

200

400

600

800

1000

1200

1400

WS + FT FT + OM OM + WS

Capacity (

IOP

S)

Workloads

EstimateReal

(b) 90% decomposition combine

0

500

1000

1500

2000

WS + FT FT + OM OM + WS

Capacity (

IOP

S)

Workloads

EstimateReal

(c) 95% decomposition combine

Figure 6. Capacity required for workload multiplexing with response time targets 10 ms

separate jobs in terms of their size to avoid having short jobsgetting stuck behind long ones. The SRPT scheduler [19]gives preference to jobs or requests with short remainingprocessing times to improve mean response time of Webservers. In a clustered server environment, D EQAL [21]utilizes the size-based policy to assign the jobs to differentservers in terms of size distribution, and further enhancesthis by considering the autocorrelation property of the work-load to deliberately unbalance the load to improve the per-formance. Swap [20] also leverages the size-autocorrelationproperty of the jobs to simulate the Short Job First scheduleronline and delay the long jobs in preference to short ones.Our scheduling framework is designed for storage systems,where the request size is not as diverse as Web applications.The big requests are already partitioned by the OS or storagedevice driver into smaller-sized block requests, such asup to 32KB. Our work differs from the above works byconsidering the arrival-autocorrelation (bursty arrival rates)property of the workloads, and then proposes decomposingthe workload to different classes dynamically based on theirburst characteristics to improve the resource efficiency andperformance.

A third body of related work can be found in theconsiderable literature on network QoS [22] where trafficshaping is used to tailor the workloads to fit QoS-basedSLAs. Typically, arriving network traffic is made to conformto a token-bucket model by monitoring the arrivals, anddropping requests that do not conform to the bucket param-eters of the SLA. Alternatively, early detection of overloadconditions is used to create back pressure to throttle thesources [23]. In storage workload request dropping is not aviable option since the protocols do not support automaticretry mechanisms, and throttling is difficult in an opensystem and can lead to loss of throughput in disks andstorage arrays. Techniques leveraging statistical envelopeshave been proposed [9] to reshape inbound traffic and toallocate resources in network systems in order to achieveprobablistically bounded service delays, while simultane-ously multiplexing system resources among the requestersto achieve higher utilizations.

6. Conclusion and Future WorkWe addressed the problem of response time degradation

in storage servers caused by the the bursty nature of manystorage workloads. Since the arrival rates during a burst canbe an order of magnitude or more than the long-term averagearrival rate, providing worst-case guarantees requires verysignificant over provisioning of server resources. Further-more, even though the bursts make up only a small fractionof the requests, their effects are not isolated but affect eventhe well-behaved portions of the workload.

We presented a workload shaping framework to addressthis problem. In our approach, the workload is dynamicallydecomposed into its bursty and non-bursty portions based onthe response time and capacity parameters. By recombiningthe bursty portions to exploit available slack in the rest of theworkload, the entire workload can be scheduled with muchsmaller capacity and superior response time distribution.We presented an optimal decomposition algorithm RTT anda slack-scheduling recombination method Miser to do theworkload shaping, and evaluated it on several storage traces.The results show significant capacity reductions and betterresponse time distributions over non-decomposed traditionalscheduling methods. Finally, we showed how the decom-position could be used to provide more accurate capacityestimates for multiplexing several clients on a shared server,thereby improving admission control decisions.

In future work, we are focusing on how the workloadshaping affects the disk throughput and how to best ap-ply workload shaping in a multi-client environment. Diskthroughput highly depends on the spatial locality in theworkload. Thus, during decomposition, we not only needto schedule the requests across different queues to meetresponse time bounds, but also try to improve the systemthroughput by considering the data locality in schedulingthe queues. In a shared storage environment, each clientor application may have different QoS requirements. Theworkload shaping framework could maintain private primaryqueues for each client and separate or consolidated overflowqueues, while the total bandwidth is shared proportionallyamong all the clients.

20

Page 10: [IEEE 2009 29th IEEE International Conference on Distributed Computing Systems (ICDCS) - Montreal, Quebec, Canada (2009.06.22-2009.06.26)] 2009 29th IEEE International Conference on

Acknowledgment

Support by NSF grants CNS 0615376 and CNS 0541369is gratefully acknowledged. We thank the reviewers for theiruseful comments that helped improve this paper.

References

[1] W. Jin, J. S. Chase, and J. Kaur, “Interposed proportionalsharing for a storage service utility,” in Proceedings of SIG-METRICS, 2004.

[2] J. Zhang, A. Sivasubramaniam, Q. Wang, A. Riska, andE. Riedel, “Storage performance virtualization via throughputand latency control.” in Proceedings of MASCOTS, 2005.

[3] A. Gulati, A. Merchant, and P. Varman, “pClock: An arrivalcurve based approach for QoS in shared storage systems,” inProceedings of SIGMETRICS, 2007.

[4] C. Lumb, A. Merchant, and G. Alvarez, “Facade: Virtualstorage devices with performance guarantees,” Proceedingsof FAST, 2003.

[5] M. Aron, P. Druschel, and W. Zwaenepoel, “Cluster reserves:a mechanism for resource management in cluster-based net-work servers,” in Proceedings of SIGMETRICS, 2000.

[6] M. E. Gomez and V. Santonja, “On the impact of workloadburstiness on disk performance,” in Workload characteriza-tion of emerging computer applications, 2001.

[7] W. E. Leland, M. S. Taqqu, W. Willinger, and D. V. Wilson,“On the self-similar nature of ethernet traffic,” in IEEE/ACMTrans. Netw, 1994.

[8] A. Riska and E. Riedel, “Long-range dependence at the diskdrive level,” in Proceedings of QEST, 2006.

[9] E. W. Knightly and N. B. Shroff, “Admission control forstatistical qos: theory and practice,” in IEEE Network, 1999.

[10] “Public software (storage systems department at hp labs),”2007, http://tesla.hpl.hp.com/publicsoftware/.

[11] D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, andA. Rowstron, “Everest: Scaling down peak loads through i/ooff-loading,” in Proceedings of OSDI, 2008.

[12] in http://www.ece.rice.edu/~ll2/distribute/miser.pdf.

[13] J. C. R. Bennett and H. Zhang, “WF 2Q: Worst-case fairweighted fair queueing,” in Proceedings of INFOCOM, 1996.

[14] P. Goyal, H. M. Vin, and H. Cheng, “Start-time fair queueing:a scheduling algorithm for integrated services packet switch-ing networks,” IEEE/ACM Trans. Netw., vol. 5, no. 5, pp.690–704, 1997.

[15] Http://www.pdl.cmu.edu/DiskSim/.

[16] “Storage performance council (umass trace repository),”2007, http://traces.cs.umass.edu/index.php/Storage.

[17] P. J. Shenoy and H. M. Vin, “Cello: a disk scheduling frame-work for next generation operating systems,” in Proceedingsof SIGMETRICS, 1998.

[18] A. Demers, S. Keshav, and S. Shenker, “Analysis and simula-tion of a fair queuing algorithm,” Journal of InternetworkingResearch and Experience, 1990.

[19] M. Harchol-Balter, B. Schroeder, N. Bansal, and M. Agrawal,“Size-based scheduling to improve web performance,” inACM Trans. Comput. Syst., 2003.

[20] N. Mi, G. Casale, and E. Smirni, “Scheduling for perfor-mance and availability in systems with temporal dependentworkloads,” in Proceedings of DSN, 2008.

[21] Q. Zhang, N. Mi, A. Riska, and E. Smirni, “Load unbalanc-ing to improve performance under autocorrelated traffic,” inProceedings of ICDCS, 2006.

[22] J. W. Evans and C. Filsfils, “Deploying ip and mpls qos formultiservice networks,” in Morgan Kauffman, 2007.

[23] S. Floyd and V. Jacobson, “Random early detection gatewaysfor congestion avoidance,” in IEEE/ACM Transactions onNetworking, 1993.

21