Estimating Computational Requirements in Multi-Threaded ...gcasale/content/tse14estimating.pdf ·...

1

Estimating Computational Requirements inMulti-Threaded Applications

Juan F. Perez, Member, IEEE, Giuliano Casale, Member, IEEE,and Sergio Pacheco-Sanchez, Member, IEEE,

Abstract—Performance models provide effective support for managing quality-of-service (QoS) and costs of enterpriseapplications. However, expensive high-resolution monitoring would be needed to obtain key model parameters, such as theCPU consumption of individual requests, which are thus more commonly estimated from other measures. However, currentestimators are often inaccurate in accounting for scheduling in multi-threaded application servers. To cope with this problem, wepropose novel linear regression and maximum likelihood estimators. Our algorithms take as inputs response time and resourcequeue measurements and return estimates of CPU consumption for individual request types. Results on simulated and realapplication datasets indicate that our algorithms provide accurate estimates and can scale effectively with the threading levels.

Index Terms—Demand estimation, Multi-threaded application servers, Application performance management

F

1 INTRODUCTION

P ERFORMANCE management of datacenter andcloud applications aims at guaranteeing quality-

of-service (QoS) to end-users, while minimizing op-erational costs [1]. To achieve this goal, predictivemodels are enjoying a resurgence of interest as toolsfor automated decision-making at both design timeand runtime [2], [3]. To successfully make use ofpredictive models, their parameterization is a keyand non-trivial step, especially when estimating theresource requirements of requests. In performancemodels, this information is encoded in the resource de-mand, which is the total effective time a request seizesa resource, e.g., a CPU, to complete its execution. Thisparameter, or its inverse – the resource service rate,is fundamental to specify performance models suchas Markov chains, queueing networks, or Petri nets,popular in software performance engineering. Esti-mating resource demands is also helpful to determinethe maximum achievable throughput of each requesttype, for admission control algorithms, and to definea baseline for performance anomaly detection [4].Further, these estimates can be used to parameterizeperformance models derived from high-level softwarespecifications, such as UML MARTE [5], or PCM [6].

Although deep monitoring instrumentation couldprovide demand values, they typically pose unac-

• J. F. Perez and G. Casale are with the Department of Computing,Imperial College London, UK.E-mail: {j.perez-bernal,g.casale}@imperial.ac.uk

• S. Pacheco-Sanchez is with SAP HANA Cloud Computing, SystemsEngineering, Belfast, UK.E-mail: [email protected]

The research of Giuliano Casale and Juan F. Perez leading to theseresults has received funding from the European Union Seventh FrameworkProgramme FP7/2007-2013 under grant agreement no. 318484 (MODA-Clouds). Sergio Pacheco-Sanchez has been co-funded by the InvestNI/SAPVIRTEX and Colab projects.

ceptably large overheads, especially when run athigh-resolution. For this reason, a number of ap-proaches [2], [7], [8], [9], [10], [11], [12], [13], [14],[15], [16] have been proposed to obtain mean resourcedemand estimates by applying statistical methodson coarse-grained measurements. Averaging is alsoconvenient to summarize the intrinsic uncertainty andvariability that applications face due to file size dis-tributions, variability in input data, and caching [17].However, most of these estimation approaches rely onutilization data, which is not always available, as in,e.g., Platform-as-a-Service (PaaS) deployments wherethe resource layer is hidden to the application andthus protected from external monitoring.

We propose novel estimation methods that rely onresponse time and thread-pool occupation measurementsat an application server. A key advantage of responsedata is that it can be efficiently obtained by activeprobing or by simple injection of timers in the applica-tion code. However, it is sensitive to the CPU schedul-ing policy and thus more challenging to analyze thanutilization data. Response-based maximum likelihoodformulations have been recently attempted only forsimple first-come first-served queues [11], howeverthese cannot cope with multi-threading and limitedthread pools. These complex features are addressedeffectively by the proposed estimation methods.

The main contributions of this paper are detailed asfollows. First, we present MINPS, a scheduling-awaredemand estimation method for multi-threaded appli-cations with a small to medium threading level, e.g.,below twenty threads. The MINPS method relies ontwo subsidiary methods: RPS, which is a regression-based method derived from mean-value analysis [18],and MLPS, a maximum-likelihood (ML) method thatrelies on a Markovian description of the resourceconsumption process. Our second contribution aims

2

at applications with medium to large threading lev-els, in the order of several tens, hundreds or more,for which we introduce two alternative estimators:FMLPS, an ML approach based on fluid models,i.e., approximations of queueing models based onordinary differential equations; and ERPS, a modifiedversion of RPS that better accounts for parallelism inmulti-core architectures. These methods complementeach other, FMLPS being more accurate while ERPSbeing more efficient computationally. An implemen-tation of these methods is available at https://github.com/imperial-modaclouds/modaclouds-fg-demand.

Validation results on simulation data show that ouralgorithms offer good performance under a broadrange of scenarios, including heterogeneous and non-exponential service demands. Further, we have testedthe methods using empirical datasets from two multi-threaded applications: the commercial applicationSAP ERP [19], which has a small threading level; andthe open-source e-commerce application OFBiz, whichhas a large threading level.

The paper is organized as follows. Section 2 re-views related work on demand estimation. Section3 provides background and a motivating example.Section 4 introduces an empirical estimation methodthat provides a baseline for the other methods. Theregression-based methods RPS and ERPS are intro-duced in Section 5, while the ML approach MLPS,and MINPS, are presented in Section 6. The scalableML method FMLPS is the topic of Section 7, andadditional validation results for all the methods areprovided in Section 8. We illustrate the use of thesemethods on empirical datasets in sections 9 and 10.Section 11 concludes the paper.

2 RELATED WORK

Resource demand estimation has received significantattention recently, in particular for resource manage-ment of self-adaptive systems [13]. In this context,different adaptation rules are undertaken for differ-ent request classes, where a class is a request typeidentified for example by a URL or by a cluster-ing thereof. Estimation methods based on indirectmeasurements have gained wide acceptance, withthe majority of them being based on regressing to-tal CPU utilization and class throughputs to obtainclass estimates of resource demands [7], [14], [15],[20]. Considering multiple classes is important sincerequests belonging to different classes may have verydifferent resource demands. These methods, however,may suffer from multicollinearity, which can affect theestimates and their confidence intervals [2], [7]. Toovercome this and other problems of regression-basedmethods, other approaches have been put forward,including Kalman filters [8], [16], [21], clustering [9],[10], and pattern recognition methods [22], [23]. Otherworks exploit CPU utilization measurements at the

V CPUs

W worker threads

admission control

(a) Server layer.

worker think time

CPU 1

CPU V …

Population = W workers

(b) Resource layer. Popula-tion: W workers.

Fig. 1. Layers in a multi-threaded server

hypervisor level to estimate resource demands [24]and virtualization overheads [25].

Other estimation methods rely on a different setof measurements, i.e., response times, queue lengths,or equivalently arrival and departure time pairs fora set of requests. Previous work on inference onthese metrics is limited. Queue-length measurementshave been used for a Gibbs sampler [26] based ona closed multi-class product-form queueing network(QN). Queue-length data is also used in [27] to de-velop a maximum-likelihood (ML) method based onthe diffusion limit of a single-class first-come-first-served (FCFS) multi-server queueing model. Arrivaland departure times are used in [28] to estimateservice times, with a Markov Chain Monte Carloapproach, for fairly general single-class QNs. In [13],unknown demands are estimated via an optimizationproblem where the difference between the measuredand the predicted response times is minimized. Sim-ilar approaches [12], [29] model the resources as aproduct-form open QN, with explicit expressions forthe mean response times. [30] uses utilization, re-sponse times, and arrival rates measurements to pa-rameterize a regression model that predicts the overallaverage response time. Also based on response times,[11] proposes an ML demand estimation methodassuming that requests are processed in an FCFSfashion. Our methods also make use of the measuredresponse times, but we tackle the complications ofestimation in multi-threaded, multi-core architectures,and in the presence of admission control and multi-class workloads. Further, our methods are able toconsider applications with either small or large threadpools.

3 BACKGROUND

3.1 System modelOur reference system is a multi-threaded applicationrunning on a multi-core architecture. The applicationserver processes requests incoming from the exter-nal world, either in an open or closed-loop fashion.This reference system is conveniently modeled asthe layered queueing network shown in Figure 1.Figure 1(a) details the server admission queue (aFIFO buffer), which we assume to be configured

https://github.com/imperial-modaclouds/modaclouds-fg-demand


3

with a pool of W worker threads. A worker threadis an independent software processing unit capableof serving requests. We assume that workers cannotremain idle if a request sits in the admission buffer.Also, we assume that the measured response timesare sanitized of the time where the worker is blockedwithout consuming computational resources at thelocal server. In other words, we assume that the timeconsumed in calls to services deployed on externalresources, where the worker is effectively blocked,has been subtracted from the sampled response time.This can be achieved by relying on application logs.In sections 9 and 10 we consider the SAP ERP andthe OFBiz e-commerce applications, which have anarchitecture that can be mapped into this model.In both cases a set of services are exposed by theapplication, which is deployed on a single resourcewhere all the services are executed. The case wherethe application is replicated on multiple servers, anda load balancing mechanism is applied to split therequests among them, may be handled by consideringeach resource separately. More complex applications,such as those that orchestrate a number of distributedservices, fall beyond the reference model. However,the model may still be useful to characterize some ofthe individual services that compose the application.

Throughout the paper, we focus on the responsetimes at the server, not on end-to-end delays, andrestrict our attention to the dynamics inside the re-source layer shown in Figure 1(b). This layer mayrepresent a physical host or a virtual machine, withthe queues modeling the V available CPUs. Thequeues are assumed to process jobs with a processor-sharing scheduling policy, capturing the time-sharingin the operating system. The resource layer may beabstracted by a closed network with a population ofW workers placing resource demands to serve therequests, on the V identical queues. Thus, the jobsin the closed network represent workers, each oneassociated to an admitted request that belongs toone of R classes. The think-time server models thetime that elapses between completion of a requestand admission of the following one. If the admissioncontrol buffer is full at the time of a completion, thisthink time is zero; otherwise, it is the residual timebefore the arrival of the next request. Further, afterleaving the worker thread, and before going throughthe think time, a request can change its class ran-domly according to a discrete probability distribution.This user class-switching behavior accounts for systemswhere users may change the request type they send,e.g., sending requests to different URLs.

Scheduling. We take the simplifying assumption thatthe operating system dispatches active workers toCPUs to maximize the available capacity. Thus, theV CPUs are assumed to be fully shared and the Wworkers are placed to exploit all the available capacity.Importantly, we do not keep track of the specific CPU on

which a worker executes. This assumption is useful tolimit the measurement complexity, since threads canfrequently migrate among CPUs and high-resolutionmeasurements would be expensive. By looking at anaggregate level, we are able to ignore these detailswithout compromising estimation accuracy, as weshow in Section 4. To simplify, we also ignore possibleCPU affinities, where a subset of threads are boundto specific CPUs, although our fluid models are inprinciple generalizable to this case.

Demand estimation. We refer to the expected servicedemand posed by class-r requests as E[Dr]. Themain goal of the analysis is to characterize E[Dr], foreach request class r, using measurements of responsetimes inside the resource layer in Figure 1(b). Thismeans that our response times require to measurethe execution time, from request assignment to aworker thread, until completion, and excluding non-CPU related overheads such as network transmissionlatency. Monitoring modules simplify the task of fil-tering out these components from the response timemeasurements [31].

It is also important to remark that, by looking onlyat mean resource demands, we can equivalently mapour problem into estimating the service rates µr atwhich an application serves class-r requests, thanks tothe relation1 E[Dr] = µ−1r . However, since the applica-tion works in a multi-threaded fashion, at any time tthere can be n(t) concurrent requests being processedat the CPU. A class-r job will thus be processed at aneffective rate of µr/n(t). Hence, the technical problemwe need to address is how to filter out such n(t)factor from the measurements. This is simple to do ifall information is known, but becomes difficult whensampling, as we show in the next sections.

3.2 Estimation methodology

The demand estimation methodology we proposerequires the ability to collect a dataset of I systemstate samples n(ti) = (n0(ti), n1(ti), n2(ti), . . . , nR(ti))at a finite sequence of instants t1 < t2 < . . . < tI ,where• nr(ti) represents the number of busy workers serv-

ing requests of class r, 1 ≤ r ≤ R, at time ti• n(ti) =

∑Rr=1 nr(ti) is the number of busy workers

• n0(ti) = W − n(ti) is the number of idle workerthreads at time ti

Throughout the paper, we assume that the instant ticorresponds to the time an event happens in the sys-tem, i.e., a request enters or leaves a worker. We con-sider two main alternatives for the collection of thisinformation. First, we consider the baseline case, where

1. Here we assume that the ratio of the number of visits to theprocessing node to the number of visits to the delay node is one,which is in line with the architecture described. This however canbe easily generalized to any ratio [4].

4

the system is observed during a monitoring period oflength T , along which all samples {n(ti)}0≤ti≤T arecollected, i.e, every time a request enters or leaves aworker. While this is unrealistic for most productionsystems, due to the overhead necessary to collect thisinformation, it serves us to set a lower bound on theestimation error attainable with a given sample size.It is important to stress that such estimation errorwill exist due to finite size of the sample, and moreimportantly, due to our assumption of not keepingtrack of the specific CPU on which a worker executes.

The second, more realistic, alternative is the one ofarrival sampling, which considers collecting a set ofsamples {(n(ti), ri)}Ii=1, where the instants ti corre-spond to arrival times, and, together with the systemstate n(ti), we also collect the response time of the i-th sampled request. In other words, sample i is com-posed of the system state n(ti) at the job arrival time,ti, and the job response time ri, which is recorded atthe job departure time ti + ri. Contrary to the firstsample type, in this case the samples need not beconsecutive and are treated as completely unrelated.Note that samples of the system state at these instantscan be obtained through, e.g., server or applicationlogs.

3.3 Motivating Example

In this section we consider the use of three existingmethods, two of which require utilization measure-ments, and one based on response times and queue-length measurements. The method introduced in [29]relies on an open product-form QN model to representthe resource usage. We refer to this method as PFfor product form. The system we consider, as shownin Figure 1(b), features a closed topology due to thethread pool and the admission control. The commonmethod used in [15] and other works, is based onthe utilization law and requires both utilization andthroughput measurements to perform a non-negativelinear regression. We refer to this method as NNLR fornon-negative linear regression. Instead, the methodin [11] relies on response time and queue lengthmeasurements, but models the resource as a singleFCFS server, which may not be adequate for a multi-threaded multi-core deployment. We refer to thismethod as FCFS.

We consider two system setups: V = 1 processorand W = 2 working threads, and V = 2 proces-sors and W = 4 threads. We assume a total of Nexternal users that generate requests, and let thisnumber vary between 20 and 180, which allows forload values between 10% and 90% approximately.The users may belong to one of two classes, eachone with a different mean resource demand. Withthese parameters, we set up a simulation (the detailsof which are described later) and sample, after awarm-up period and using sampling windows of five

20 60 100 140 1800

20

40

60

80

100

120

Number of users

erro

r (%

)

BLPFFCFSNNLR

(a) V = 1 - W = 2

20 60 100 140 1800

20

40

60

80

100

120

Number of users

erro

r (%

)

BLPFFCFSNNLR

(b) V = 2 - W = 4

Fig. 2. Various estimation methods - Two classes

seconds, the overall server utilization, as well as theaverage response time and the throughput for each jobclass. Figure 2 illustrates the mean relative absoluteestimation error (see Eq. (2)) obtained with the threemethods mentioned above, and with the baseline (BL)method to be introduced in Section 4. For the single-processor case we observe that the NNLR and FCFSmethods perform well, but increasing the number ofservers degrades their performance severely. Whilethese methods have been shown to be very effectiveunder different setups, the multi-threaded multi-coredeployment, together with the limited thread pool,pose additional challenges. Further, the utilizationmeasurements required by two of these methods maynot be available in some deployments.

4 BL: BASELINE ESTIMATION

We consider here the baseline case where the datasetincludes all samples n(ti) for the instants ti at whichrequests arrive and depart from the system, betweenany two instants t1 and tI . The estimation algorithmbased on this dataset, named BL, will serve to testthe performance of other methods that rely on lessinformation.

4.1 BL Algorithm DescriptionIn a single-processor system (V = 1), the baselinedataset allows reconstructing exactly the sample pathof the system and the individual history of eachprocessed job. In this case, it is straightforward todetermine the empirical values of the demand of eachprocessed job. That is, consider a class-r request j,1 ≤ j ≤ J , arrived at time τj,1 = tl and departedat time τj,L = tl+L−1, where L − 2 events occur be-tween its arrival and departure at times τj,i = tl+i−1,1 < i < L. These events correspond to the arrivaland departure of other jobs and are recorded in thebaseline dataset. Then the demand placed by job j is

dj,r =

L−1∑i=1

(τj,i+1 − τj,i)n(τ+j,i)

, (1)

where n(τ+j,i) refers to the state of the system just aftertime τj,i. We therefore approximate the mean class-r

5

1

Time

Service

Demand

1

| |

CPU-1

t1,1

|

t1,2 t1,3

Fig. 3. Baseline estimation

demand E[Dr] with the sample mean of the values{dj,r | 1 ≤ j ≤ J}. Figure 3 illustrates (1), where aCPU starts serving three different jobs. We follow thelight-gray job, marked with a one, which runs for twoperiods. The first period concludes when the job withthe dashed background leaves the server, while thesecond begins at this instant and concludes when thelight-gray job completes service. The length of the firstperiod is τ1,2−τ1,1, while the second period lasts τ1,3−τ1,2. We have thus the two terms of the sum in (1),which give us the demand posed by the marked job as(τ1,2−τ1,1)/3+(τ1,3−τ1,2)/2. The BL method also hasthe following property, the proof of which is providedin the supplemental material.

Proposition 1: When the processing times follow anexponential distribution, the BL estimator is efficient,i.e., it is unbiased and achieves the Cramer-Raobound.

Since we assume that the state of each individ-ual processor is not tracked separately, in the multi-processor case (V > 1) we estimate the value dj,rby considering two cases. In the first case, wherethe number of active workers n(τ+j,i) is less than orequal to the number of processors V , we assume thateach of the active workers is assigned to a differentprocessor. Thus dj,r is equal to the numerator in (1),as the workers do not need to share the processors’capacity. In the second case, where n(τ+j,i) > V , weassume that all processors are busy and the systemcan be approximated by a single “super-processor”with V times the speed of the individual processors.The demand of job j is thus computed as

dj,r =

L−1∑i=1

(τj,i+1 − τj,i) min(n(τ+j,i), V )

n(τ+j,i).

4.2 Results

Figure 4(a) reports the estimation error obtained withthe BL method for a system with V = 2 processorsand W = 8 threads, using different sample sizes (SS).In this and the upcoming experiments, we run eachof the estimation methods with the same numberof samples (100 per request class unless otherwisestated) and obtain an estimated mean service timefor each class. Letting E[Dr] and Dr,k be the actualand the estimated mean service time for class-r jobsin experiment k, respectively, we obtain the absolute

20 60 100 140 1800

5

10

15

Number of users

erro

r (%

)

SS=100SS=1000SS=10000

(a) V = 2 - W = 8

8 16 32 64 1280

5

10

15

Threading level

erro

r (%

)

SS=100SS=1000SS=10000

(b) V = 4 - N = 100

Fig. 4. BL - Two classes

relative error as

errorr,k =|E[Dr]− Dr,k|

E[Dr]. (2)

For each system setup, we run 30 experiments andpresent the mean estimation error and its 95% confi-dence interval using the samples of errorr,k from allthe request classes. In Figure 4 we observe that theerror is on average 8.34% for a sample size of 100,with small variations depending on the number ofusers (load) in the system. The error diminishes to2.45% on average for 1000 samples, and to 0.86% onaverage for 10000 samples. We have also observedthat the estimates obtained with the BL method arenot significantly affected by the threading level W ,as depicted in Figure 4(b), nor by the number ofprocessors V .

These results suggest the following considerations.First, since the samples are not affected by noise andare drawn from a model that behaves according to ourassumptions, the values found are empirical lowerbounds on the achievable performance of estimationmethods that rely on arrival sampling. It is howeverpossible that other estimation methods can provide asmaller estimation error than BL for specific samplesets. Our results indicate that even with 100 samplesthe estimates are reasonably accurate, but better re-sults are obtained with more samples. It is interestingto note that the BL estimation approach avoids toexplicitly represent the state of each server in themulti-core case. Thus, it should be interpreted as anempirical lower bound on the achievable performancewith this kind of representation. Describing the stateof each server is possible, but poses additional chal-lenges in terms of both the instrumentation needed tocollect the data, and the modeling required to trackthe state of each processor separately.

5 RPS: A REGRESSION-BASED APPROACH

In this section we describe RPS, a regression-basedestimation method that makes use of response timesand queue lengths observed at arrival times. RPSis based on the mean-value analysis [18] theory forproduct-form closed queueing networks (QNs). LetN = (N1, . . . , NR) be the population vector of a

6

closed QN, where Nr is the number of class-r jobs.For a processor-sharing (PS) station, let E[Rr] be theexpected response time of a class-r job, E[Dr] itsexpected service demand, and E[Q] the expected totalqueue length at the station. We addN as an argumentto the previous expressions to make explicit that theseare for a network with population N . Also, let er bea zero vector with a 1 in the r-th position. For thesingle PS server case, the main result from mean-valueanalysis states that

E[Rr](N) = E[Dr] (1 + E[Q](N − er)) ,

i.e., the expected response time for a class-r job inthis station, in a network with population N , can beexpressed as a function of its expected service time,and the expected queue length in this station in a net-work with one class-r customer less. From the arrivaltheorem [18] we also know that E[Q](N−er) is equalto the expected number of jobs seen upon admissionby a class-r job (excluding itself) in a network withpopulation N , referred to as E[QA] ≡ E[QA](N).We thus obtain E[Rr] = E[Dr]

(1 + E[QA]

). Letting

E[QA] = 1+E[QA] be the expected number seen uponadmission, including the admitted job, we have

E[Rr] = E[Dr]E[QA].

Thus to estimate E[Dr] we perform a linear regressionon observations of Rr against QA. These observationsare taken from the sample {(n(ti), ri)}Ii=1 discussedin Section 3.2, as ri is the response time and n(ti)the number of jobs seen upon admission by the i-thsample.

The previous result holds for a single processoronly. For V > 1 processors we split the expectedqueue length equally among the processors, thus

E[Rr] = E[Dr]E[QA]/V, (3)

becomes the regression equation to estimate E[Dr].

5.1 ERPS: An extended RPS

We now introduce a small modification to the RPSmethod that consists of replacing the term V in Eq. (3)by an estimate of the number of busy servers V .The purpose of this is to attempt to solve the maindrawback of the RPS method, which, as illustrated inthe next section, is its inability to capture the behaviorof the system under low loads. We estimate V as

V = min

{V,

1

I

I∑i=1

n(ti)

}.

We refer to this method as Extended RPS (ERPS).

20 60 100 140 1800

10

20

30

40

50

Number of users

erro

r (%

)

BLRPSERPSMLPS

(a) V = 1 - W = 2

20 60 100 140 1800

10

20

30

40

50

Number of users

erro

r (%

)

BL

RPS

ERPS

MLPS145%

52%

(b) V = 4 - W = 16

Fig. 5. BL - RPS - ERPS - Two classes

5.2 ResultsWe now illustrate the behavior of the RPS and ERPSmethods under different system setups. We generatethe data by means of a discrete-event simulation,implemented in Java Modelling Tools (JMT) [32], withthe closed-loop topology described in Section 3.2,where the V CPUs are assumed to be fully sharedamong the workers.

We point out that the RPS method, as it relies onregression analysis, is computationally inexpensiveand can be executed in less than one second in astandard desktop. Figure 5(a) shows the error rate forthe BL, RPS and ERPS methods for the case with oneprocessor and 2 threads. Both RPS and ERPS showa very good behavior, comparable with that of BL,for the whole range considered for the number ofusers. In Figure 5(b) we consider a larger number ofprocessors V = 4, and W = 16 worker threads. Theerror for the RPS method increases significantly forlow loads, reaching 145% and 52% error for 20 and60 users respectively. This behavior, which worsensfor a larger number of processors, is caused by theassumption that the queue length is split equallyamong the V servers, but under a small load, lessthan V servers are actually active, and the servicerate is less than the total capacity. ERPS shows abetter behavior, thanks to the correction introduced bysplitting the queue length among the mean number ofbusy servers V . The error of ERPS is however twicethat of BL under low loads, a trend sustained in abroad range of multi-processor scenarios. Thus, RPSand ERPS estimate the resource demand accuratelyfor the single-processor case, and for a larger numberof processors under medium to high loads.

6 MLPS: AN ML APPROACH

In this section we introduce MLPS, a maximum like-lihood (ML) estimation method well-suited for themulti-processor case under low loads.

6.1 Maximum LikelihoodAn ML estimation procedure is an optimizationmethod that aims at finding the value of the parame-ters of a probabilistic model, such that the likelihood

7

L of obtaining a given set of samples is maximal. ForMLPS, we assume the sample set {(n(ti), ri)}Ii=1, in-troduced in Section 3.2, is available. From this sampleset, MLPS estimates the mean service demand E[Dr]for each class r.

The MLPS method seeks a solution to the optimiza-tion problem with objective function

maxµ1,...,µR

L (r1, . . . , rI |n(t1), . . . ,n(tI), µ1, . . . , µR) ,

where the service rates µr are the decision variables.Recall that finding the rates µr is equivalent to findingthe mean demands since E[Dr] = µ−1r . We preferin MLPS to operate with rates, instead of demands,because the likelihood function leverages on a Markovmodel parameterized by the rates µ = [µ1, . . . , µR].For feasibility we add to this likelihood problem theconstraints µr ≥ 0, for r = 1, . . . , R.

Assuming independent samples and applying log-arithms, the objective function can be written as

maxµ1,...,µR

I∑i=1

log (L (ri|n(ti),µ)) .

Notice that, although the response times are not inde-pendent, the effect of correlation is reduced if the sam-ples are randomly selected or well spaced in time. Thelatter can be achieved, for instance, by collecting theresponse times of jobs whose execution do not overlapor that belong to different busy periods. For instance,considering the scenarios that will be later introducedin Section 8, and with sets of 200 samples, we observea mean lag-1 auto-correlation of the response timesof 0.091. If the samples are taken only from non-overlapping jobs, the mean auto-correlation reducesto 0.065, while randomly sampling the response timesis also effective as it reduces the mean auto-correlationto 0.054. Further, if samples are taken from differentbusy periods the mean auto-correlation reduces to0.053. The next sections define a Markov model tocompute L (ri|n(ti),µ).

6.2 The approximate modelTo capture all the features of the multi-threaded ap-plication under consideration, the ideal model shouldconsider the two layers depicted in Figure 1. AMarkov model of this system, however, is infeasiblesince it would require a very large state space, limitingthe scenarios that could be analyzed. To cope withthis limitation, we introduce an approximate modelthat focuses only on the behavior after admission, asdepicted in Figure 1(b). The model is a closed QN,with a total of W circulating jobs. Each job representsa worker thread contending for CPU capacity. Theseusers can be either at a PS processing node, whereclass-r users are served with rate µr, or at a delaynode, where class-r users spend a think time withmean λ−1r , for r = 1, . . . , R. This think time captures

the time between a service completion and the admis-sion of a new request. Both processing and think timesare assumed to be exponentially distributed to ob-tain the Markov-chain (MC) representation describedin the next section. In Section 8 we show that theMLPS method provides good estimation accuracy alsounder hypo-exponential processing times. The hyper-exponential case proves harder, though we have ob-tained good results for cases with limited variability.

6.3 The absorbing Markov chain representation

To compute the likelihood L ((n(ti), ri)|µ) of obtain-ing a given sample, we define an absorbing MC suchthat the time to absorption reflects the total processingtime received by the i-th sampled job. Consider anMC with m + 1 states such that its generator matrixcan be written as

Q =

[T t0 0

],

where T holds the transition rates among the firstm states, and is called a sub-generator matrix. Thevector t = −Te, with e a column vector of ones,holds the transition rates from the first m states tostate m + 1. Since the rates out of state m + 1 arezero, once the chain visits this state it stays thereforever, making it an absorbing state, and all theothers transient. Similarly, the initial state in this MCis selected according to the row probability vector[α α0], where the i-th entry of α holds the probabilitythat the MC starts in transient state i, and α0 = 1−αe.Starting from a transient state, the PDF of the time toabsorption in state m+ 1 is [33]

f(x) = α exp(Tx)t. (4)

Sub-generator T : In our model, given a sample(n(ti), ri) for a tagged job i, the parameters α andT of the absorbing MC are a function of the observednumber of jobs in service n(ti) and the service ratesµ, while the absorption time is equal to the totalprocessing time ri. To keep track of the tagged job, weextend the set of classes with a tagged class, and allowonly the tagged job to belong to it. To describe thismodel, the state space of the MC needs to considerall possible combinations of W jobs in R+ 1 differentclasses and two nodes. This number is large even forsmall values of W , making infeasible the computationof the matrix-exponential in (4). To cope with thisproblem we introduce the following key assumption.We assume that the total population of class-r jobs(threads) is equal to the one observed by the taggedjob upon admission, i.e., nr(ti).

As the number of jobs in each class is now limited, itis enough to keep track of the number of jobs of eachclass in the service node. Let k(ti) be the class of thei-th admitted job, and let Xr(t) be the number of class-r jobs in service, without considering the tagged job.

8

Require: State l = (l1, . . . , lR), l =∑Rr=1 lr + 1

for r = 1, . . . , R doService completionT (l, l− er) = µrlr/lThink time completionT (l, l+ er) = (nr(ti)− lr)λr

end for

Fig. 6. MC non-absorbing transition rates

We can thus describe the system with the variables{Xr(t), r = 1, . . . , R, t ≥ 0}, taking values in the set

{(l1, . . . , lR)| 0 ≤ lr ≤ nr(ti)− 1{r = k(ti)}} , (5)

where 1 is the indicator function.The non-absorbing rates of the MC {Xr(t), r =

1, . . . , R, t ≥ 0} are shown in Figure 6, includingboth arrivals and service completions at the process-ing node. Additionally, absorption occurs in state(l1, . . . , lR) with rate µk(ti)/l, l =

∑Ri=1 li + 1, corre-

sponding to the tagged job service completion. Thenon-absorbing rates compose the sub-generator ma-trix T (n(ti),µ) associated with sample (n(ti), ri).

Initial distribution α: To define the initial probabilitydistribution α we observe that the i-th sampled jobfinds the processor with n(ti) jobs. Thus the initialprobability vector α(n(ti)) has a one in the entrycorresponding to state n(ti), and zero everywhereelse. In this manner the likelihood of obtaining asample (n(ti), ri) when the service rates are µ can beexpressed as

L (ri|n(ti),µ) = α(n(ti)) exp(T (n(ti),µ)ri)t(n(ti),µ),

where t(n(ti),µ) = −T (n(ti),µ)e.

6.4 The case of multiple processors

We now extend MLPS to consider multiple processors.As keeping track of the number of busy workerthreads in each processor would suffer from the curseof dimensionality, we modify the transition rates inthe absorbing MC as follows. In state l = (l1, . . . , lR)

such that ∃r with lr ≥ 1, and l =∑Rr=1 lr + 1 ≤ V ,

i.e., when the number of jobs in process is less thanor equal to the number of processors, we assume thateach job is being processed by a different processor,and therefore a transition occurs to state l − er withrate lrµr. Instead, when l > V , the service completionrate of a class-r job is V µrlr/l, as if the processor wasa single “super-processor” with V times the capacityof a single processor.

6.5 Estimating the think times

To estimate the mean think times λ−1r for each classr, we assume that, during a given monitoring period,the admission rate of requests to the worker threads,

called βr for class-r jobs, can be estimated. As this re-duces to knowing the number of arrivals in the moni-toring interval, this information can be extracted fromserver log files. Now, from the sample {(n(ti), ri)}Ii=1

we compute the average number of busy threadsseen upon admission, that is W = 1

I

∑Ii=1 n(ti). Since

W − W can be thought of as an estimate of themean number of threads undergoing a think time, weapproximate the think rate as

λr =βr

(W −W )/R. (6)

This expression is a simple application of Little’s law,as (W − W )/R approximates the mean number ofclass-r jobs in the delay node, λ−1r is the mean timespent in this node, and βr is the effective throughput.Here we divide the number of idle threads evenlyamong the different request classes. Notice that ifthe server is lowly loaded, the think rate λr is smallboth because βr is small and W is close to zero. Theopposite occurs under heavy loads.

6.6 ResultsWe illustrate the performance of MLPS by consider-ing, as in Section 5, a system with 2 job classes, 1 pro-cessor and 2 worker threads. We must highlight thatthe simulation is performed for the system as depictedin Figure 1, without considering the assumptions in-troduced in the derivation of MLPS. Figure 5(a) showsthe error for MLPS, where a trend, repeated among aset of scenarios broader than the ones shown here,arises: MLPS provides estimates similar in accuracyto those of the BL method for low loads, but itsaccuracy diminishes for high loads. This behavioralso holds for multiple processors, as illustrated inFigure 5(b), where we consider 4 processors and 16worker threads. We have performed an exhaustive setof experiments, and found that this trend holds for abroad range of parameter values. MLPS thus performswell under low loads for the multi-processor case,complementing well the RPS method, which performswell under high loads but poorly under load loads.

6.7 The MINPS methodIn this section we present the MINPS method, whichis built on top of RPS and MLPS, and relies on twoobservations. First, RPS has a difficulty under lowloads because it is incapable of treating correctly thesituation where n < V jobs are being processed. In thiscase, not all processors are busy, and approximatingthem as a single “super-processor” working at rate V µbecomes too coarse. As a result, RPS overestimatesthe mean service time, since the measured responsetimes appear too long for a system with service rateV µ, while the actual service rate will be at most nµ.

Second, one of the drawbacks of MLPS is its inabil-ity to capture arrivals that occupy the worker threads

9

20 60 100 140 1800

10

20

30

40

50

Number of users

erro

r (%

)

BLERPSMINPS

(a) V = 4 - W = 16

20 60 100 140 1800

10

20

30

40

50

Number of users

erro

r (%

)

BLERPSMINPS

(b) V = 4 - W = 32

Fig. 7. MINPS - ERPS - Two classes

beyond the state observed upon admission. That is,the jobs observed upon admission can finish theirservice and return later on, but the number of jobs ineach class is limited by the numbers observed uponadmission. As a result, under high loads, there willbe samples where the observed number of jobs inprocess is (much) lower than the maximum numberreached before the tagged job service is completed. Inthis case, MLPS over-estimates the service demand, asthe response times observed have to be matched bya system where the servers are shared among fewerjobs than in the actual system.

From these observations we conclude that, whenthe RPS and MLPS estimates have larger errors, bothmethods are likely to fail by over-estimating the servicedemand. In MINPS we propose to run both methodsand choose the one that offers the smaller estimatedmean service demand, thus preventing the use of anover-estimated value caused by the failures describedabove. Figure 7(a) shows the behavior of MINPSagainst BL and ERPS for a setup with 4 processorsand 16 threads. While their performance is similarunder high loads, MINPS has a much smaller errorunder low loads, similar to that of BL. This behavior issustained for setups with a larger number of threads,as in Figure 7(b), and processors.

7 FMLPS: A SCALABLE ML APPROACH

The MLPS method described in the previous sectionrelies on an approximation to limit the MC state-spacesize. However, if the number of threads or proces-sors is large, the MC state-space size can turn thelikelihood computation infeasible. In fact, the numberof threads considered in the previous sections arecompatible with memory-intensive applications, suchas the ERP application analyzed in Section 9, whereallowing a large number of parallel threads maycause memory bottlenecks. In this case, the numberof parallel threads is limited and the state space ofthe MC underlying MLPS is tractable. In other typesof applications, e.g., web sites, the number of workerthreads can be significantly larger, as the memoryaccess requirements are much lower.

To deal with applications with a large number ofthreads, we introduce an ML method that replaces the

underlying MC in MLPS with a fluid approximation.The fluid model avoids the state-space explosion andis therefore suitable for models with a large numberof threads and processors. With the fluid model, thematrix-exponential in (4) is replaced by the solution ofa set of ordinary differential equations (ODEs), whichis derived from the original system by a scaling ar-gument. As with MLPS, for a given sample (n(ti), ri)we model the state after admission as a closed multi-class QN with two stations: a processing station and adelay station. A key difference with respect to MLPSis that, instead of assuming a total population of n(ti)workers, FMLPS assumes the correct number of Wworkers, where W − n(ti) are initially located at thedelay station. This better captures the dynamics of theadmitted jobs, but it is expensive to consider with theMC model of MLPS due to the state-space explosion.

7.1 The FMLPS QN modelWe model the dynamics after admission as a QN withR classes and two stations. One station represents thesuper-processor obtained by aggregating the V CPUs;the other station is a think-time server to model theworkers’ idle time. We assume a total of W jobs andallow the jobs in the QN to change their class whentransferring from the CPU to the think time server,making this model a class-switching QN2.

We assign the admitted job to a tagged class, labeledR+1, as in MLPS. With this setup, the QN can be mod-eled as an MC, the state of which at time t is given bythe vector X(t) = {Xi,r(t), i ∈ {P,D}, 1 ≤ r ≤ R+1},where Xi,r(t) is the number of class-r jobs in resourcei at time t, with i = P (resp. i = D) standing for theprocessing (resp. delay) station. The system state ismodified by events, which are limited to admissions(think time departures) and service completions (CPUdepartures). Once the service of a class-r job termi-nates in station i, a job proceeds to station j as a class-sjob with probability P r,si,j , allowing the worker class-switching behavior. In FMLPS, this feature is usedto estimate the response time distribution, and thusthe likelihood, of each sample. A service completiontriggers a transition from state x to state x+ej,s−ei,r,where ei,r is a vector of zeros with a one in entry (i, r).Thus, the number of class-r jobs in station i decreasesby one, while the number of class-s jobs in station jincreases by one.

The transition rate from state x to state x+ej,s−ei,ris denoted with fr,si,j (x), and its definition is given inFigure 8. Notice that the transition rates associatedto service completions are adjusted to consider thePS multi-core case. Thus, when at most V jobs are

2. We refer to this as worker class-switching, to differentiate itfrom the user class-switching described in Section 3.1. Worker classswitching will be used only to estimate the likelihood of the sam-ples, not to capture the user class-switching behavior. This wouldrequire a layered model and additional monitoring information toparameterize the model. This generalization is left for future work.

10

Require: State x = (xi,r), xi =∑R+1r=1 xi,r, i ∈ {P,D}

for r, s = 1, . . . , R+ 1 doService completionif xP ≤ V then

fr,sP,D(x) = µrPr,sP,DxP,r

elsefr,sP,D(x) = µrP

r,sP,DV

xP,r

xP

end ifThink time completionfr,sD,P (x) = λrP

r,sD,PxD,r

end for

Fig. 8. MC transition rates with worker class switching

present at the processing station, each of the jobs isassumed to be assigned to a different processor, with-out contention. When more than V jobs are present,the service completion rate, as in MLPS, is that of asingle super-processor with processing rate V timesthat of an individual processor. The transition ratesassociated to think time completions use the rates λr,estimated as in Eq. (6). As in MLPS, both processingand think times are assumed to be exponentiallydistributed. However, this assumption can be relaxedto consider Phase-Type (PH) [33] processing and thinktimes, as the fluid model can better accommodate therequired extended state description without sufferingfrom the state-space explosion that affects MLPS.

7.1.1 The fluid model

As mentioned above, the MC {X(t), t ≥ 0} cannotbe analyzed directly as in MLPS, due to the statespace explosion. To overcome this issue we intro-duce a fluid approximation, i.e., an ODE formulationof the queueing model. This approximation can beshown to become exact when the number of jobs Wgrows asymptotically large. Such asymptotic limit isobtained as follows. We consider a sequence of QNmodels, indexed by k, such that when k → +∞, thesample paths of the QN models tend to the solution ofa deterministic ODE system. Let {Xk(t)}k∈N+

be thesequence of QN models such that X1(t) = X(t) (theQN model defined in the previous subsection), andXk(t) for k ≥ 2 is defined as X1(t) with a total popu-lation of kW jobs and kV servers. The state space ofXk(t) is thus {x ∈ NMR :

∑i∈{P,D}

∑R+1r=1 xi,r = kW}.

As we show in the technical report [34], the se-quence {Xk(t)}k∈N+ verifies the conditions of [35,Theorem 3.1], such that the sample paths of thenormalized sequence {Xv(t)/v}v∈N+

converge in prob-ability to a deterministic ODE system. Informally,this means that the solution of an ODE system canbe used to approximate the original QN, with theapproximation becoming tighter as v grows large.Notice that the fluid model is derived using a scalingargument, where we consider a certain limit for in-creasing number of servers and jobs, but the resulting

asymptotic ODE system is re-scaled to approximatethe behavior of the original system X1(t). We nowintroduce the ODE system for the evaluation of theFMLPS queueing model.

7.1.2 The ODE systemThe state of the ODE system x(t) ∈ RMR that de-scribes the FMLPS QN model evolves according to

dx(t)

dt= F (x(t)), t ≥ 0, (7)

where, for any x ∈ RMR, F (x) is the drift of X(t) instate x, defined as

F (x) =∑

i∈{P,D}

∑j∈{P,D}

R∑r=1

R∑s=1

(ej,s − ei,r)fr,si,j (x). (8)

The ODE initial state is partitioned as x0 = [xD0 , xP0 ],

where xD0 (resp. xP0 ), refers to the initial number ofjobs of each class undergoing a think time (resp. pro-cessing at a CPU). For sample (n(ti), ri), xp0 is set as

xp0 = [n1, . . . , nk−1, nk − 1, nk+1, . . . , nR, 1],

where k is the class of the admitted job, and weremoved the parameter ti for readability. Notice thatthe number of class-k jobs is reduced by one, and thisjob is assigned to the tagged class R+1. Additionally,the initial state of the delay station is set by takingthe total number of jobs not present at the processingstation, W −

∑Rr=1 nr(ti), and assigning them propor-

tionally to the effective admission rate of each class(βr). The tagged class has zero jobs in the delay node.The initial state of the delay station is thus set to

xd0 =

[(W −

R∑r=1

nr(ti)

)β∑Rr=1 βr

, 0

], (9)

where β is the R × 1 vector with βr as entries. Therouting matrix P is defined in the next section.

7.1.3 Response time distributionTo assess the likelihood of the observed response timeri with the fluid model we follow a similar approachas in [36]. This approach consists of defining a tran-sient class, and letting the jobs (workers) in this classswitch to a non-transient class after processing. As inMLPS, for a class-k sample we define a tagged classR+1 and assign the admitted (tagged) job to this class.Next, we set PR+1,k

P,D = 1, such that, when the taggedjob (worker) finishes, it switches from class R + 1 toits original class, making the class R+1 transient. Forall the other classes, we set P r,rD,P = P r,rP,D = 1, thusrouting the workers without any class switching.

As in MLPS, we require an expression for thelikelihood L of obtaining a sample (n(ti), ri) givena set of processing rates µ, which for the fluid modelwe express as the response time PDF fi(·) evaluatedat ri. To approximate this function we introduce the

11

variable y(t), which keeps track of the station visitedby the tagged job at time t, thus y(t) ∈ {D,P} fort ≥ 0. Let Ri be the random variable describingthe response time of the tagged job associated withsample i, and let Φi be its cumulative distributionfunction. Since, for any t ≥ 0, the event {y(t) = P} isequivalent to {Ri > t}, we can state

Φi(t) = P (Ri ≤ t) = 1− P (y(t) = P ) .

Further, P{y(t) = P} can be written as the expectedvalue of the indicator function 1{y(t) = P} since

E[1{y(t) = P}] = 1 · P (y(t) = P ) + 0 · P (y(t) = D) .

Notice that 1{y(t) = P} is the number of taggedjobs in the processing station, thus being equal toXP,R+1(t). We can therefore rewrite Φi(t) as

Φi(t) = 1− E [XP,R+1(t)] ≈ 1− xP,R+1(t),

where, as in [36], we make use of the fluid solutionx(t) to approximate the expected value of X(t). Usingthis approximation, the response time PDF fi(·) isobtained as the derivative of Φi(·),

fi(t) =dΦi(t)

dt≈ −dxP,R+1(t)

dt= −FP,R+1(x(t)),

since, from (7), F (·) is precisely the derivative of x(t).The likelihood of sample (n(ti), ri) is thus given by−FP,R+1(x(ri)), obtained by solving the ODE in (7)between t = 0 and t = ri with initial state x0.

8 VALIDATION

In this section we explore the behavior of the differentmethods introduced in the paper. As summarized inTable 1, we consider a broad range for the numberof processors and the threading ratio W/V . We alsoevaluate heterogeneous service demands, by mod-ifying the ratio µ1/µ2. The number of users N isset between 20 and 180. From simulations we knowthat for 20, 100, and 180 users, the server load isabout 10%, 50%, and 90%, respectively. Recall that thesimulation assumes that the V CPUs are fully sharedamong the workers, although we will also considera Join the Shortest Queue (JSQ) policy for workerallocation. As before, we evaluate the estimates withthe absolute relative error, defined in Eq. (2). For thecases with small (resp. large) number of servers andthreads we consider MLPS (resp. FMLPS), since MLPSsuffers from the state-space explosion while FMLPS isexpected to be more accurate when these numbers arelarge due to the fluid approximation.

We also consider different user class-switching be-haviors, by means of the parameter α, which is theprobability that a job switches class after leaving theworker thread. In all the results to be presented weset α = 0.1, making the switching probability matrixfast mixing. We also considered slow mixing cases,α = 0.001, but the estimation errors, although slightlyhigher, are very similar to those shown here.

TABLE 1Experimental setup

Symbol Parameter ValueV Number of processors {1, 2, 4, 8, 16}

W/V Threading ratio {2, 4, 8, 16, 32}R Number of classes {1, 2, 3}

µ1/µ2 Service rate ratio {2, 10, 1000}α Class Switching Probability {0.1, 0.001}N Number of users {20, 60, . . . , 180}

20 60 100 140 1800

10

20

30

40

50

Number of users

erro

r (%

)

BLERPSMINPS

(a) SS = 100

20 60 100 140 1800

10

20

30

40

50

Number of users

erro

r (%

)

BLERPSMINPS

(b) SS = 1000

Fig. 9. MINPS - ERPS - V = 2 - W = 8

20 60 100 140 1800

10

20

30

40

50

Number of users

erro

r (%

)

BLERPSFMLPS

(a) SS = 100

20 60 100 140 1800

10

20

30

40

50

Number of users

erro

r (%

)

BLERPSFMLPS

(b) SS = 1000

Fig. 10. FMLPS - ERPS - V = 8 - W = 64

8.1 Larger sample sizeFigure 9(a) shows how MINPS and ERPS providesimilar estimation errors for the case of 2 processorsand 100 samples. However, increasing the sample sizeto 1000, as in Figure 9(b), MINPS performs similar toBL, and much better than ERPS, under low loads. Thisis a direct effect of the better performance of MLPSunder low loads, which is inherited by MINPS. Figure10(a) illustrates the case of V = 8 and W = 64, with asample size of 100. FMLPS performs remarkably well,similar to BL, over the whole load range considered,while ERPS shows larger errors under low loads. Thiseffect is amplified when increasing the sample sizeto 1000, as in Figure 10(b), where FMLPS performssignificantly better than ERPS, particularly under lowto medium loads. Under high loads the performanceof FMLPS diminishes, probably due to the approxi-mation introduced to set the number of jobs of eachclass initially located at the delay station, cf. Eq. (9).

8.2 Heterogeneous service timesWe now consider highly differentiated service times,setting the ratio µ2/µ1 = 1000, which was equal to 2 in

12

20 60 100 140 1800

10

20

30

40

50

Number of users

erro

r (%

)

BLERPSMINPS

(a) V = 2 - W = 8

20 60 100 140 1800

10

20

30

40

50

Number of users

erro

r (%

)

BLERPSFMLPS

(b) V = 8 - W = 64

Fig. 11. µ2/µ1 = 1000

the previous experiments. We also modify the thinktimes, keeping the ratio λi/µi fixed, such that eachclass accounts for half the server load. This preventsone class from becoming insignificant with respect tothe other. In Figure 11, we observe that, despite thesignificant differentiation, the estimation errors of allthe methods remain very similar to the case µ2/µ1 =2.Comparing Figures 9(a) and 11(a), we observe a minorincrease for MINPS under medium loads, but thedifference is not statistically significant. In the case ofa large number of processors and threads, in Figures10(a) and 11(b), we observe a similar behavior for bothERPS and FMLPS. In this and other experiments wehave observed that the estimation errors are similarfor the different request classes, provided a similarnumber of samples for each class is available.

8.3 Non-exponential service timesWe now consider the impact of non-exponentialservice times on the proposed estimation methods,since both MLPS and FMLPS assume exponentially-distributed processing times. This is particularly rele-vant since the exponential assumption does not nec-essarily hold in real applications. To this end weapply the estimation methods, which assume expo-nential processing times, while in the simulation theprocessing times are generated from non-exponentialdistributions. We consider two main cases for theprocessing times: an Erlang distribution made of fiveconsecutive exponential phases, resulting in a squarecoefficient of variation (SCV) of 0.2; and a hyper-exponential (HE) distribution made of the mixture oftwo exponential phases, which can match any SCVgreater than one [37]. In both cases we keep the samemean as in the exponential experiments to maintainthe same offered load on the servers.

Figure 12(a) presents the estimation errors underErlang service times for the case V = 2 and W = 8.Under this setup MINPS performs better than in theexponential case, depicted in Figure 9(a), with estima-tion errors close to those of BL. ERPS, however, showssimilar results than in the exponential case under lowloads, and improves only under high loads. Figure13(a) also considers Erlang service times, for the caseV = 8 and W = 64. We highlight the performance

20 60 100 140 1800

10

20

30

40

50

Number of users

erro

r (%

)

BLERPSMINPS

(a) Erlang

20 60 100 140 1800

10

20

30

40

50

Number of users

erro

r (%

)

BLERPSMINPS

(b) Hyper-exponential

Fig. 12. MINPS - ERPS - V = 2 - W = 8

20 60 100 140 1800

10

20

30

40

50

Number of userser

ror

(%)

BLERPSFMLPS

(a) Erlang

20 60 100 140 1800

10

20

30

40

50

Number of users

erro

r (%

)

BLERPSFMLPS

(b) Hyper-exponential

Fig. 13. FMLPS - ERPS - V = 8 - W = 64

of FMLPS, which is as accurate as BL. ERPS insteadpresents larger errors under low loads, similar to theexponential case, depicted in Figure 10(a), while inhigh loads it performs as well as BL. The methodsthus handle well hypo-exponential processing times.

Figure 12(b) illustrates the effect of HE service timeswith SCV= 2. While the results for both MINPS andERPS worsen, their estimation errors are similar tothose of BL, verifying their good performance. Similarresults hold for a large number of processors andthreads, shown in Figure 13(b), where all the methodsshow a similar behavior, with a slight advantage forFMLPS (resp. ERPS) under low (resp. high) loads. No-tice that the performance of the BL method is affectedby the scatter of the data around its mean, improvingin the hypo-exponential case and worsening in thehyper-exponential case.

8.3.1 Extending FMLPS to HE processing timesWe have observed that further increasing the variabil-ity of the processing times, beyond an SCV of 2, affectsthe accuracy of the ML methods, particularly underhigh loads. To handle this case we extend the FMLPSmethod to allow for non-exponential, specifically 2-phase hyper-exponential (HE), processing times. Tothis end, we exploit the class-switching assumption,slightly modifying the ML methods as follows. Weassume 2 classes (k1,k2) for each class k in the model.For a class-k sample, we assume that it joins theserver as a class-ki request with probability αki . Aclass-ki request poses a mean demand (µki )−1 on theserver, and upon completion it leaves the server asa class-k request. The likelihood of sample (n(tj), rj)

13

20 60 100 140 1800

10

20

30

40

50

60

Number of users

erro

r (%

)

BL ERPS FMLPS

(a) SS = 100

20 60 100 140 1800

10

20

30

40

50

Number of users

erro

r (%

)

BLERPSFMLPS

(b) SS = 500

Fig. 14. HE (SCV=10) - V = 8 - W = 64

is obtained by separately computing the likelihood asif it was a class-ki request and combining the resultswith weights αki . The method thus needs to estimatethree parameters for each class k: µk1 , µk2 , and αk1 , asαk2 = 1− αk1 .

We illustrate this strategy in Figure 14 by consider-ing highly-variable processing times with SCV = 10.The effect of the high variability is evident as all themethods, including BL, show larger estimation errors.Increasing the sample size to 500, as in Figure 14(b),results in smaller errors and a smaller variability inthe estimates. We observe that the modified FMLPSbehaves very well under low to medium loads, butshows a large error under high loads. ERPS insteadbehaves similarly to previous cases, with errors largerthan BL and FMLPS under low loads, and similarto BL under high loads. The case of high variabilityand high load is thus difficult for FMLPS, and inthese cases ERPS is preferable. Another option tohandle high variability is to extend FMLPS to con-sider PH distributions, which are more flexible andcan approximate long-tailed processing times [38].The downside is the additional computational effortneeded to estimate a larger number of parameters.This will be the topic of future work.

8.4 A different scheduling policy

In the previous scenarios the samples were collectedfrom a simulation where the V CPUs are fully shared,to represent time-sharing in the operating systemscheduler. For illustration, we now consider a differ-ent scheduling policy, namely Join the Shortest Queue(JSQ), to allocate the jobs to the CPUs. Figure 15depicts the estimation errors under this policy, for thecase V = 4, W = 16, which are only slightly higherthan those obtained under full sharing, depicted inFigure 7(a). The good performance of our methods isrelated to the ability of the JSQ policy to balance thenumber of jobs assigned to each processor, although itis blind to the job size. Simpler allocation alternatives,such as random allocation, do not have this balancingproperty, and the methods presented in this paper arenot well suited for their analysis.

20 60 100 140 1800

10

20

30

40

50

Number of users

erro

r (%

)

BL ERPS MINPS

Fig. 15. JSQ Scheduling

0 1 2 3 4 5 6 710

0

101

102

103

104

log2 V

exec

utio

n tim

e [s

]

MLPSFMLPS

Fig. 16. Execution times

8.5 Computational costThe regression-based methods, RPS and ERPS, scalevery well with the number of processors and threads,since the size of the linear problem that needs to besolved for the regression depends on the number ofsamples only. The ML procedures are instead com-putationally more expensive. In fact, the key approx-imation in MLPS was introduced to deal with thestate-space explosion and its effect on the computa-tion times. FMLPS overcomes this problem with thefluid approximation, and can therefore tackle prob-lems with a large number of threads and processors.Figure 16 compares the execution times of MLPS andFMLPS for an increasing number of processors, whilekeeping the thread-to-processor ratio W/V = 4 andthe number of users N = 140 fixed. This experimentwas performed in a 3.4 GHz 4-core Intel Core i7 ma-chine, with 4GB RAM, running Linux Ubuntu 12.04.We observe how, for a small number of processors,MLPS is about one order of magnitude faster thanFMLPS. However, its execution times increase rapidlywith the number of processors and surpass those ofFMLPS. Instead, FMLPS shows a more stable behav-ior, with execution times between 1 and 4 min .

9 CASE STUDY I: SAP ERP APPLICATION

In this section we consider a multi-threaded applica-tion with a small threading level, the ERP applicationof SAP Business Suite [19], which runs on top of SAPNetWeaver [39], a middleware that defines the un-derlying architecture. Similar to the model in Section3.1, SAP NetWeaver has an admission control queuethat receives the incoming requests, and dispatchesthem to software threads, referred to as work processes(WPs), for processing. Admission occurs when a WPbecomes idle, based on an FCFS scheme. Thus, thewaiting time in the admission queue tends to becomethe dominating component of the response time as thenumber of active users becomes large with respect tothe threading level. The operations in the WPs areCPU-intensive and include calls to a database.

We installed SAP ERP on a two-tier configurationcomposed of an application server and a databaseserver residing on the same virtual machine, with noother virtual machines running on the same physicalcomputer. The virtualization software is the VMware

14

10 50 75 100 1500

10

20

30

40

50

Number of Clients

erro

r %

BLMINPSERPS

(a) W = 3

10 50 75 100 1500

10

20

30

40

50

Number of Clients

erro

r %

BLMINPSERPS

(b) W = 6

Fig. 17. SAP - MINPS - ERPS - V = 2

ESX server configured with 32 GB of memory, 230 GBof storage, and 2 virtual CPUs, running at 2.2GHzand mapped to separate physical CPU cores. We con-sider different scenarios varying the number of WPsW ∈ {3, 6}, and the number of users N ∈ [10, 150],which correspond to a CPU utilization in the range[0.03, 0.90]. The users issue requests via a closed-loopworkload generator with exponential think time withmean 10 s . The workload we consider consists of sixtransaction types, related to the creation and displayof sales orders, the creation of billing documents, andthe generation of outbound deliveries.

We use the data collected to estimate the meandemand with BL, ERPS, and opting for MINPS insteadof FMLPS since the threading level W is small. Thethreading level W is small due to the high memoryrequirements of the requests. We compare the resultsagainst measurements obtained from the internal pro-filer of the ERP application (STAD). The measure-ments of the profiler collect the CPU consumptionper request type, including initialization steps anddatabase calls. In this case, these execution times areconsidered part of the request demand since the WPis not freed during the database call, and the databaseitself is located on the same physical host. Discrimi-nating these times would require additional monitor-ing, measuring the times consumed in database calls.

Figure 17 depicts the estimation errors obtainedwith 30 experiments, each based on 600 samples. Weobserve a small estimation error, particularly for BLand MINPS, when the number of clients is between10 and 75. For a larger number of clients, the errorincreases to around 20% for BL, and between 20% and30% for MINPS and ERPS, which show similar results.Even in this case, both EPRS and MINPS remain veryclose to the best achievable error (BL). A first source oferror is in the processing times’ variability, reflectedin an SCV in the range [1.01, 1.72], which is abovethat of the exponential, although not very far fromit. Another source of error may be related to thedatabase calls, which, as mentioned above, are notexplicitly modeled. The proposed methods, in spiteof the simplifying assumptions, are able to provideuseful estimates, especially under low to medium loadscenarios. In these experiments MINPS shows execu-

10 20 30 40 500

10

20

30

40

50

Number of OFBench Clients

erro

r %

FMLPSERPS

(a) R = 4

10 20 30 40 500

10

20

30

40

50

Number of OFBench Clients

erro

r %

FMLPSERPS

(b) R = 8

Fig. 18. OFBiz - Estimation errors relative to BL

tion times between 1 and 10 min , with an averageunder 3 min , while ERPS requires less than 10 ms .

10 CASE STUDY II: OFBIZ

In this section we turn to an application with alarge threading level featuring several tens of con-currently executing requests. We consider the open-source Apache OFBiz e-commerce application3, whichhas a three-layer architecture, including presentation,application, and data layers, and uses scripting lan-guages to generate dynamic content. We generate theworkload with OFBench [40], a benchmarking tooldesigned to recreate the behavior of a set of OFBizusers, generating semantically-correct requests.

We have deployed OFBiz and the OFBench clienton separate Amazon EC2 instances. Different from thecase study in the previous section, the deploymenton a public cloud implies that the VMs used canbe co-located with other VMs on the same physicalhost, posing additional challenges to the estimationmethods. The traces collected are used to estimatethe mean resource demand with the BL, ERPS, andFMLPS methods. In this case, the default maximumnumber of threads is 200, and we therefore opt forFMLPS instead of MINPS. The OFBiz workload con-sists of 35 request types [40], a number that sig-nificantly increases the execution times of FMLPS.We have therefore clustered these classes into fourand eight sets. This approach reflects the practice inperformance analysis of aggregating several requeststypes into a few groups with homogeneous businessmeaning.

Figure 18 depicts the estimation error obtained withERPS and FMLPS, when comparing their estimateswith those obtained with BL. Different from the ERPapplication considered in the previous section, noprofiler is available for OFBiz, as is the case withmany applications, restricting the possibility of ob-taining the values of the actual demands. Further, inthe previous section the deployment was performedin a private environment, where we could controlthat only the application VM was deployed on thephysical host, and the virtual cores were mapped to

3. Apache OFBiz: http://ofbiz.apache.org/

15

2 4 6 8 10 12 140

50

100

150

Number of Classes

Exe

cutio

n T

ime

[min

]

Fig. 19. OFBiz - FMLPS - Execution times

the physical cores. In this section instead, we considera cloud deployment, where no control is providedover the physical resources, and the actual processingcapacity depends on the current resource usage ofthe physical host. As a result, we make use of theestimates obtained with BL as the basis to comparethe other two methods. These results are for a serverdeployed on a c1.medium VM instance, which has 2virtual CPUs, but we have obtained similar results forother instance types. The scenarios considered, withthe number of users between 10 and 50, generate anaverage utilization between 25% and 65%. We observesimilar estimation errors for both methods, with amaximum mean error around 20%. In the case withR = 4 classes, the error is around 10% when thenumber of users is between 10 and 30. For 40 and50 users, the error is around 20%, which is also theprevalent value for the case with R = 8 classes.

We also observe higher variability in the results,which is to be expected given the application char-acteristics and the cloud deployment. As with theSAP ERP application, we observe that the methodswith limited information, FMLPS and ERPS, provideestimates that are in generally good agreement withthose provided by BL, which has full informationabout the requests’ arrival and departure times. Inthe R = 4 case, using 400 samples, FMLPS expe-riences running times between 2 and 30 min , withan average close to 9 min . It is thus appropriate foroffline analysis or online use at coarse, e.g. everyhour, timescales. Figure 19 shows how the the FMLPSexecution times increase with the number of classes.Although feasible, assuming a large number of classesimplies significantly larger computation times, whichcan be appropriate for offline analysis only. ERPSinstead executes in under 10 ms , thus being well-suited for online estimation.

11 CONCLUSION

We have introduced demand estimation methodsfor multi-threaded applications with either largeor small threading levels. An implementation ofthese methods is available at https://github.com/imperial-modaclouds/modaclouds-fg-demand. Themethods provide accurate results under differentserver setups, including different numbers of

processors, threads, and different server loads. TheMINPS method is well-suited for low threading levels,while for large threading levels FMLPS (resp. ERPS)offers better results under low (resp. large) loads.MINPS is partly based on MLPS, which assumesa fixed population during the request executiontime. This assumption is removed in FMLPS as theunderlying fluid model is able to model systemswith a large number of requests. In both of themaximum-likelihood methods we proposed, MLPSand FMLPS, we make the simplifying assumptionthat the response times are independent. Although theresponse times are not independent, their dependencecan be reduced by selecting the samples randomly orwell-spaced in time, avoiding response times of jobswhose execution overlap or that correspond to thesame busy period.

We evaluate the estimation methods proposed bymeans of two case studies: an enterprise applicationdeployed on a local host, and an e-commerce applica-tion deployed on a public cloud. Cloud deploymentspose additional challenges that require further study,particularly regarding the variability in the CPU re-sources effectively available to the application whenother VMs are deployed on the same physical host.Demand estimation approaches that account for thisvariability, without requiring access to hyper-visorinformation, could rely on metrics such as the CPUSteal, which records the time the virtual CPU has towait for the physical CPU when the hyper-visor isbusy with other virtual CPU [41]. This will be thetopic of future work.

REFERENCES

[1] B. Addis, D. Ardagna, B. Panicucci, and L. Zhang, “Autonomicmanagement of cloud service centers with availability guaran-tees,” in Proc. of IEEE CLOUD, 2010.

[2] A. Kalbasi, D. Krishnamurthy, J. Rolia, and S. Dawson, “Dec:Service demand estimation with confidence,” IEEE Trans. Soft-ware Eng., vol. 38, pp. 561–578, 2012.

[3] L. Zhang, X. Meng, S. Meng, and J. Tan, “K-scope: Onlineperformance tracking for dynamic cloud applications,” inProceedings of the 10th ICAC, 2013.

[4] E. D. Lazowska, J. Zahorjan, G. S. Graham, and K. C. Sevcik,Quantitative system performance: computer system analysis usingqueueing network models. Prentice-Hall, Inc., 1984.

[5] M. Tribastone, P. Mayer, and M. Wirsing, “Performance pre-diction of service-oriented systems with layered queueingnetworks,” in Proc. of the 4th ISoLA’10, 2010.

[6] S. Becker, H. Koziolek, and R. Reussner, “Model-based per-formance prediction with the palladio component model,” inProc. of the 6th WOSP, 2007.

[7] A. Kalbasi, D. Krishnamurthy, J. Rolia, and M. Richter,“MODE: Mix driven on-line resource demand estimation,” inProc. of IEEE CNSM, 2011.

[8] W. Wang, X. Huang, X. Qin, W. Zhang, J. Wei, and H. Zhong,“Application-level cpu consumption estimation: Towards per-formance isolation of multi-tenancy web applications,” in Proc.of the 5th IEEE CLOUD, 2012.

[9] P. Cremonesi, K. Dhyani, and A. Sansottera, “Service TimeEstimation with a Refinement Enhanced Hybrid ClusteringAlgorithm,” in Proc. of ASMTA, 2010.

[10] P. Cremonesi and A. Sansottera, “Indirect estimation of servicedemands in the presence of structural changes,” in QEST, 2012.



16

[11] S. Kraft, S. Pacheco-Sanchez, G. Casale, and S. Dawson, “Es-timating service resource consumption from response timemeasurements,” in Proc. of the 4th VALUETOOLS, 2009.

[12] D. Kumar, L. Zhang, and A. Tantawi, “Enhanced inferencing:estimation of a workload dependent performance model,” inProc. of the 4th VALUETOOLS, 2009.

[13] D. Menasce, “Computing missing service demand parametersfor performance models,” in CMG 2008, 2008, pp. 241–248.

[14] G. Pacifici, W. Segmuller, M. Spreitzer, and A. Tantawi, “CPUdemand for web serving: Measurement analysis and dynamicestimation,” Performance Evaluation, vol. 65, pp. 531–553, 2008.

[15] Q. Zhang, L. Cherkasova, and E. Smirni, “A Regression-BasedAnalytic Model for Dynamic Resource Provisioning of Multi-Tier Applications,” in Proceedings of the 4th ICAC, 2007.

[16] T. Zheng, C. Woodside, and M. Litoiu, “Performance ModelEstimation and Tracking Using Optimal Filters,” Software En-gineering, IEEE Transactions on, vol. 34, pp. 391–406, 2008.

[17] N. Roy, A. S. Gokhale, and L. W. Dowdy, “Impediments toanalytical modeling of multi-tiered web applications,” in Proc.of IEEE MASCOTS, 2010.

[18] M. Reiser and S. S. Lavenberg, “Mean-value analysis of closedmultichain queueing networks,” Journal of the ACM, vol. 27,pp. 313–322, 1980.

[19] (2011) Enterprise resource planning (ERP). [Online].Available: http://www.sap.com/solutions/business-suite/erp/index.epx

[20] G. Casale, P. Cremonesi, and R. Turrin, “Robust workloadestimation in queueing network performance models,” in Proc.of 17th PDP, 2008.

[21] T. Zheng, J. Yang, M. Woodside, M. Litoiu, and G. Iszlai,“Tracking time-varying parameters in software systems withextended Kalman filters,” in Proc. of the CASCON ’05, 2005.

[22] A. Khan, X. Yan, S. Tao, and N. Anerousis, “Workload charac-terization and prediction in the cloud: A multiple time seriesapproach,” in Proc. of IEEE NOMS, 2012, pp. 1287–1294.

[23] D. Gmach, J. Rolia, L. Cherkasova, and A. Kemper, “WorkloadAnalysis and Demand Prediction of Enterprise Data CenterApplications,” in Proc. of the 10th IEEE IISWC, 2007.

[24] C. Isci, J. Hanson, I. Whalley, M. Steinder, and J. Kephart,“Runtime demand estimation for effective dynamic resourcemanagement,” in Proc. of IEEE NOMS, 2010.

[25] F. Brosig, F. Gorsler, N. Huber, and S. Kounev, “Evaluatingapproaches for performance prediction in virtualized environ-ments,” in Proceedings of MASCOTS, 2013.

[26] W. Wang and G. Casale, “Bayesian service demand estimationwith Gibbs sampling,” in Proc. of IEEE MASCOTS, 2013.

[27] J. V. Ross, T. Taimre, and P. K. Pollett, “Estimation for queuesfrom queue length data,” Queueing Systems, vol. 55, pp. 131–138, 2007.

[28] C. Sutton and M. I. Jordan, “Bayesian inference for queueingnetworks and modeling of internet services,” The Annals ofApplied Statistics, vol. 5, pp. 254–282, 2011.

[29] Z. Liu, L. Wynter, C. H. Xia, and F. Zhang, “Parameterinference of queueing models for IT systems using end-to-endmeasurements,” Perf. Eval., vol. 63, no. 1, pp. 36–60, 2006.

[30] C. Stewart, T. Kelly, and A. Zhang, “Exploiting nonstationarityfor performance prediction,” in Proceedings of EuroSys’07, 2007.

[31] “Apache module log-firstbyte.” [Online]. Available: https://code.google.com/p/mod-log-firstbyte/

[32] M. Bertoli, G. Casale, and G. Serazzi, “JMT: performanceengineering tools for system modeling,” ACM Perf. Eval. Rev.,vol. 36, pp. 10–15, 2009.

[33] G. Latouche and V. Ramaswami, Introduction to Matrix AnalyticMethods in Stochastic Modeling. SIAM, 1999.

[34] J. F. Perez and G. Casale, “A fluid model for closed queueingnetworks with PS stations,” Imperial College London, Tech.Rep. DTR13-8, 2013. [Online]. Available: https://www.doc.ic.ac.uk/research/technicalreports/2013/DTR13-8.pdf

[35] T. G. Kurtz, “Solutions of ordinary differential equations aslimits of pure jump Markov processes,” Journal of AppliedProbability, vol. 7, pp. 49–58, 1970.

[36] R. A. Hayden, A. Stefanek, and J. T. Bradley, “Fluid compu-tation of passage-time distributions in large Markov models,”Theor. Comput. Sci., vol. 413, no. 1, pp. 106–141, Jan. 2012.

[37] W. Whitt, “Approximating a point process by a renewal pro-cess, I: two basic methods,” Operations Research, vol. 30, pp.125–147, 1982.

[38] A. Riska, V. Diev, and E. Smirni, “Efficient fitting of long-tailed data sets into hyperexponential distributions,” in IEEEGLOBECOM’02, 2002.

[39] T. Schneider, SAP Performance Optimization Guide. SAP Press,2003.

[40] J. Moschetta and G.Casale., “OFBench: an Enterprise Applica-tion Benchmark for Cloud Resource Management Studies,” inProc. of MICAS, 2012.

[41] G. Casale, C. Ragusa, and P. Parpas, “A feasibility study ofhost-level contention detection by guest virtual machines,” inProc. of CloudCom, 2013.

Juan F. Perez is a Research Associate inperformance analysis at Imperial CollegeLondon, Department of Computing. He ob-tained a PhD in Computer Science from theUniversity of Antwerp, Belgium, in 2010. Hisresearch interest center around the perfor-mance analysis of computer systems, espe-cially on cloud computing and optical net-working.

Giuliano Casale is a Lecturer in perfor-mance analysis and operations researchat Imperial College London, Department ofComputing, since January 2013. Prior to this,he was a full-time researcher at SAP Re-search (UK), a postdoctoral research asso-ciate at the College of William & Mary (US).He obtained a PhD from Politecnico di Milano(Italy) in 2006. He has served as programco-chair for ACM SIGMETRICS/Performance2012 QEST 2012, and ICAC 2014, and as

general co-chair for ACM/SPEC ICPE 2013, and as technical pro-gram committee member for more than 50 conferences and work-shops.

Sergio Pacheco-Sanchez is a ResearchSpecialist in performance analysis at theSAP HANA Cloud Computing, Systems En-gineering division. Sergio obtained a PhDfrom the University of Ulster, UK, in 2012.Previously he was a Research Associate atSAP Research (UK) where he investigatedand developed statistical inference methodsto effectively parameterize queueing modelsfor IT systems. He is currently investigatingthe impact of the latest hardware and soft-

ware technologies on the performance of in-memory distributeddatabase systems.

http://www.sap.com/solutions/business-suite/erp/index.epx

http://www.sap.com/solutions/business-suite/erp/index.epx

https://code.google.com/p/mod-log-firstbyte/

https://code.google.com/p/mod-log-firstbyte/

https://www.doc.ic.ac.uk/research/technicalreports/2013/DTR13-8.pdf

https://www.doc.ic.ac.uk/research/technicalreports/2013/DTR13-8.pdf

Estimating Computational Requirements in Multi-Threaded ...gcasale/content/tse14estimating.pdf ·...

Documents

Transcript of Estimating Computational Requirements in Multi-Threaded ...gcasale/content/tse14estimating.pdf ·...