[IEEE 2007 IEEE International Multitopic Conference (INMIC) - Lahore, Pakistan...

6
Internet Traffic Modeling for Pakistan Internet Exchange Mehreen Alam FAST-NU, Islamabad [email protected] Abstract-The paper makes models the Internet traffic demand by applying statistical techniques on data collected between two nodes of Pakistan's Internet backbone over a period of two years. The traffic model is used to make predictions for future Internet usage which simplifies the task of capacity planning for the network management by helping them determine when and to what extent future provisioning is required in the backbone. This is not easy as quantified information is missing about the rate of increase and the pattern followed by the Internet traffic demand. We have used various statistical analysis methods on aggregate Internet demand between two nodes to isolate the long term trend from the noisy short term fluctuations in the overall traffic pattern, ensure its variance is within control limits and finally make a model out of it to make predictions for future. The accuracy of the traffic model developed is proven empirically by the comparative result showing that the future estimates deviated by only 7% from the actual values observed during that time interval. I. INTRODUCTION The purpose of this paper is to model the Internet traffic for capacity planning purposes by employing statistical techniques on data collected between two nodes of Pakistan's Internet backbone over a period of two years. We have used various statistical analysis methods on aggregate Internet demand between two nodes to isolate the long term trend from the noisy short term fluctuations in the overall traffic pattern, ensure its variance is within control limits and finally make a model out of it to make predictions for future. The research methodology incorporates wavelet multi- resolution analysis, analysis of variance and linear time series models. Wavelet multi-resolution smoothes the collected measurements until the overall long-term trend is identified keeping intact the short term trends. The fluctuations around the trend are further analyzed at multiple time scales. It has been observed that the largest amount of variability in the original signal is due to its fluctuations at the 24 hour time scale, which just confirms the diurnal pattern exhibited by the traffic demand. We show that for network provisioning purposes one needs only to account for the overall long term trend as the fluctuations at smaller time scales have an insignificant contribution in the Internet traffic pattern across the backbone. We model the inter-Point of Presence (PoP) aggregate demand using two components: the long term trend and its variation. Inter-PoP aggregate demand is mapped to a multiple linear regression model with the above-mentioned two identified components. Using the Analysis of Variance (ANOVA) techniques, it is shown that the proposed model captures 98% of the total energy of the original signal and takes care of 91% of its variance. Weekly approximations of those components can be accurately modeled with low-order Auto-Regressive Integrated Moving Average (ARIMA) models. The accuracy of the traffic model developed is proven empirically by the comparative result showing that the future estimates deviated by only 700 from the actual values observed during that time interval. The method is practical in terms of the computation and storage but is robust only as long as the technology of the Internet service model is unchanged. The present mode of Internet traffic modeling and forecasting is restricted between two nodes only which does not give us a consistent global status of the whole backbone. Future work would target the modeling and forecasting for the overall backbone capacity planning issues, which would also help in load balancing, handling the links that are temporarily down and helps determine where a new node or a link is required. II. MEASUREMENT ENVIRONMENT A. Pakistan Internet Exchange Pakistan Internet Exchange (PIE) was created in 2000 to cater for the needs of IP/ATM connectivity via a single core data backbone for the whole country. Figure 1 below shows the existing architecture of PIE network topology interconnecting major cites of Pakistan. All the links between the node of Karachi and Lahore (marked in red in the Figure 1) are monitored over the period of almost 2 years (Dec 2003 to Dec 2005). New links were added and the old ones were replaced by technologically advanced links. The data in bits per second is recorded at each link connecting the PoP Lahore with the PoP Karachi. Only the data exchanged on all links between these two nodes is taken into account. 1. Architecture of Pakistan Internet Exchange. 1-4244-1553-5/07/$25.00 ©2007 IEEE .F

Transcript of [IEEE 2007 IEEE International Multitopic Conference (INMIC) - Lahore, Pakistan...

Page 1: [IEEE 2007 IEEE International Multitopic Conference (INMIC) - Lahore, Pakistan (2007.12.28-2007.12.30)] 2007 IEEE International Multitopic Conference - Internet Traffic Modeling for

Internet Traffic Modeling for Pakistan InternetExchange

Mehreen AlamFAST-NU, Islamabad

[email protected]

Abstract-The paper makes models the Internet traffic demandby applying statistical techniques on data collected between twonodes of Pakistan's Internet backbone over a period of two years.The traffic model is used to make predictions for future Internetusage which simplifies the task of capacity planning for thenetwork management by helping them determine when and towhat extent future provisioning is required in the backbone. Thisis not easy as quantified information is missing about the rate ofincrease and the pattern followed by the Internet traffic demand.We have used various statistical analysis methods on aggregateInternet demand between two nodes to isolate the long term trendfrom the noisy short term fluctuations in the overall trafficpattern, ensure its variance is within control limits and finallymake a model out of it to make predictions for future. Theaccuracy of the traffic model developed is proven empirically bythe comparative result showing that the future estimates deviatedby only 7% from the actual values observed during that timeinterval.

I. INTRODUCTION

The purpose of this paper is to model the Internet traffic forcapacity planning purposes by employing statistical techniqueson data collected between two nodes of Pakistan's Internetbackbone over a period of two years. We have used variousstatistical analysis methods on aggregate Internet demandbetween two nodes to isolate the long term trend from thenoisy short term fluctuations in the overall traffic pattern,ensure its variance is within control limits and finally make amodel out of it to make predictions for future.The research methodology incorporates wavelet multi-

resolution analysis, analysis of variance and linear time seriesmodels. Wavelet multi-resolution smoothes the collectedmeasurements until the overall long-term trend is identifiedkeeping intact the short term trends. The fluctuations aroundthe trend are further analyzed at multiple time scales. It hasbeen observed that the largest amount of variability in theoriginal signal is due to its fluctuations at the 24 hour timescale, which just confirms the diurnal pattern exhibited by thetraffic demand. We show that for network provisioningpurposes one needs only to account for the overall long termtrend as the fluctuations at smaller time scales have aninsignificant contribution in the Internet traffic pattern acrossthe backbone. We model the inter-Point of Presence (PoP)aggregate demand using two components: the long term trendand its variation. Inter-PoP aggregate demand is mapped to amultiple linear regression model with the above-mentioned twoidentified components. Using the Analysis of Variance(ANOVA) techniques, it is shown that the proposed modelcaptures 98% of the total energy of the original signal and

takes care of 91% of its variance. Weekly approximations ofthose components can be accurately modeled with low-orderAuto-Regressive Integrated Moving Average (ARIMA)models. The accuracy of the traffic model developed is provenempirically by the comparative result showing that the futureestimates deviated by only 700 from the actual values observedduring that time interval. The method is practical in terms ofthe computation and storage but is robust only as long as thetechnology of the Internet service model is unchanged.The present mode of Internet traffic modeling and

forecasting is restricted between two nodes only which doesnot give us a consistent global status of the whole backbone.Future work would target the modeling and forecasting for theoverall backbone capacity planning issues, which would alsohelp in load balancing, handling the links that are temporarilydown and helps determine where a new node or a link isrequired.

II. MEASUREMENT ENVIRONMENT

A. Pakistan Internet ExchangePakistan Internet Exchange (PIE) was created in 2000 to

cater for the needs of IP/ATM connectivity via a single coredata backbone for the whole country. Figure 1 below shows theexisting architecture of PIE network topology interconnectingmajor cites of Pakistan. All the links between the node ofKarachi and Lahore (marked in red in the Figure 1) aremonitored over the period of almost 2 years (Dec 2003 to Dec2005). New links were added and the old ones were replacedby technologically advanced links. The data in bits per secondis recorded at each link connecting the PoP Lahore with thePoP Karachi. Only the data exchanged on all links betweenthese two nodes is taken into account.

1. Architecture of Pakistan Internet Exchange.

1-4244-1553-5/07/$25.00 ©2007 IEEE

.F

Page 2: [IEEE 2007 IEEE International Multitopic Conference (INMIC) - Lahore, Pakistan (2007.12.28-2007.12.30)] 2007 IEEE International Multitopic Conference - Internet Traffic Modeling for

B. CollectedDataData was collected over a time period of approximately two

years with data missing for a few months of first year and for afew weeks of the second. Missing data could have beeninterpolated but we chose to keep it unmodified to keep theoriginality of the data as it has to undergo many levels ofapproximations later on. Data was captured and saved ingraphical format using MRTG tool. It had to be converted tovector form using customized image processing techniques.

C MRTG DataMRTG is an acronym for Multi Router Traffic Grapher

which is a tool to monitor the traffic load on network links.Normally, MRTG retrieves the statistics data using the SNMP,records the values of the configuration, updates the statistics,saves the statistics in a graphical representation and finallyincorporates graphs in html pages. The whole process isrepeated typically every five minutes and one html page isgenerated that features four graphs. Traffic workload on thenode is plotted vs. time into day, week, month and year graphs.Green area shows the Internet traffic input on the node whilethe blue line marks the output traffic; we are only concernedwith the input traffic and restrict the analysis for the greenregion only for the time being. A sample MRTG graph isshown below, (figure 2).

D. Data SpecificationsData was collected from Dec 03 to Dec 05 and html pages

generated at the frequency of 5 minutes for three links(namely: lhr-kh-2, lhr-kh-10 and lhr-kh-30) between the nodesof Lahore and Karachi. However, data was not available for allthe time duration when either the link was down temporarily orthe stored data was lost. Data for the link 'lhr-kh-2' is missingfor 29 weeks for the year 2004. Data for both the links 'lhr-kh-2' and 'lhr-kh-10' has some days when the whole data was losttogether or the links were down separately.The reason for keeping data in weekly form was to keep

intact the trend exhibited in a week. For example, usage atweekends differs in weekdays; the former usually has lesserload. Such variations get smoothed out to an extent but stillcontribute in forming a model for the Internet traffic demandmodel. Once a satisfactory model has been made, we just dealwith one data value for each week. The predicted data is alsoindicated as one value for a week.Missing data could have been interpolated but we chose to

omit those data points altogether. Firstly almost half of the firstyear's data was missing; interpolating it would mean having asubstantial contribution of dummy data in analysis leading toinaccurate results. Secondly, it was wiser to keep the

originality of the data as it already has to undergo many levelsof approximations in self aggregation, multi-time scaleresolution and forecasting using time series analysis.A week's data is included in the analysis only if the

complete data is available for the week from Monday toSunday both inclusive. The order of the days is also important,for example, data for a week Tuesday to Monday is notapplicable as it does not capture the weekly trend followed byweekdays and weekends.Keeping the target of looking at the long term, it was

imperative to select graph of appropriate time granularityamongst the graphs MRTG gives for each day, week, monthand year. Time granularity of each graph is 5 minutes, 30minutes, 2 hours and 1 day respectively. Out of all, the timescale of 2 hours data was chosen as it is refined enough tosafeguard the daily variation and at the same time blurredenough to get rid of the high frequency components. Anothermore technique-oriented reason would be elaborated under thetopic of multi resolution time analysis. Last but not least, datadoes not suffer from possible inaccuracies in the SNMPmeasurements, since such events are smoothed out by theaveraging operation. (2 hours data is an averaged value for 5minute intervals).

Data was available in a visual representation stored inlossless compressed bitmap image format PNG (PortableNetwork Graphics). Digital image processing techniques had tobe applied on the individual image files to extract data intovectors.

E. Data TransformationFor each link between two nodes (Lahore and Karachi),

demand is aggregated to form one vector. Intricacy involved inthis procedure was to map the up and down state of each linkto the corresponding time line, also keeping in mind that thedata for the chosen is complete and belongs to thecorresponding week number in month chosen. A quick checkto validate the correctness of the algorithm was to ensure thatthe length of the vector is a multiple of seven.The final transformation done on data before it is used for

modeling is the aggregation. Groups of three points areaveraged to a point, which changes the data points from 2hours to 6 hours. Reason for applying this aggregationbecomes clear when data is smoothed out in wavelet multi-resolution analysis (MRA). The wavelet MRA looks into theproperties of the signal at time scales 2J times coarser than thefinest time scale and the collected measurements exhibit strongperiodicities at the cycle of 12 and 24 hours. We would have togo to the granularity of 1.5 hours as the finest time scale or at 6hours. Third and the fourth resolution of the former give us thegranularity at 12 and 24 hours.

3rd(23 * 1.5 = 12) and4th(24 * 1.5 =24 ) (1)

While for the 6 hours as the finest time scale it takes just thefirst and the second resolution to achieve the objective

Page 3: [IEEE 2007 IEEE International Multitopic Conference (INMIC) - Lahore, Pakistan (2007.12.28-2007.12.30)] 2007 IEEE International Multitopic Conference - Internet Traffic Modeling for

Ist(2 *6 12)and2nd(22*6 =24) (2)

Both are ways of smoothening data out. Using the 6 hoursaggregated data, first two MRA are done in the form ofconventional averaging when compared to the 1.5 hour data.Analyzing the data at the smallest time scale of 1.5 hours or 30minutes is kept for further work.

III. STATISTICAL ANALYSIS

Thorough observation of collected data shows that the datahas multi-scale variability, strong periodicity and is non-stationary in nature. Periodicity with such attributes helps incoming up with future forecasts for the next cycle of the sameperiod. For instance, periodicities at the weekly cycle implythat the time series behavior from one week to the next can bepredicted.Data obtained has abrupt spikes and ditches (Figure 3). The

former could be caused by the surge of traffic during a specifictime resolution caused due to the transfer of load from one link

Fig. 3. Aggregated Internet Traffic Demand over the whole year

to the other, routing changes, or simply denial of serviceattacks. The ditch simply implies that the link is down for thatparticular time interval. But as the aim of the project is to dolong term trend analysis, these abrupt instantaneous changesdisturb the inferences and predictions we intend to make fromthe data. Therefore, we smooth out these outliers byaggregation and wavelets.

F. Multi-Time Scale AnalysisWavelet MRA makes use of wavelets function xp(t) (i.e. themother wavelet) and scaling function p(t) (also called fatherwavelet) in the time domain. At each time scale, the signal isdecomposed into an approximate signal and a detailed signalthrough a series of wavelet xvjk(t) and scaling functions pjk(t),where j is the resolution level and k is the translation index.

TABLE IRESULTANT TIME SCALE FOR TIME GRANULARITY OF 2 HOURS AND 6 HOURS

6th

k), and the mother wavelet function X(t), vjk(t)= 22 W(2-J t-k).The approximation is represented by a series of (scaling)coefficients ajk and the detail by a series of (wavelet)coefficients djk.The reason for aggregating the data points into resolution of 6hours makes sense now. Wavelet MRA is operated at the timescales 2J times coarser than the finest time scale, where j is theresolution index. As evident from table 1, decomposition at thetwo hour granularity would devoid us of the inspecting theperiodicity at 12 and 24 hours, while the granularity of 6 hourrightly captures them.

Wavelet MRA Application: For MRA, we have used Haarwavelet which is a special form of Daubechies wavelets andhas a single vanishing moment. Both these attributes of thewavelets are pertinent in the analysis of signals that have selfsimilarity and require easy encoding of information data.In figures 5 and 6, we show the approximation and detail till 4levels of the original Internet traffic at each time scale.Resolution levels are for 12, 24, 48 and 96 hours, the last oneat the coarsest level. We did not add another level ofgranularity as it would take the window of observation over aweek, by which we lose the variability that exists from week toweek only. 96 hours makes up 4 days; getting coarser wouldbring an overlap in the weeks. Resolution less than 96 wouldhave the effects from 12 hours or 24 hours effect. These effectshave been plotted in the detailed signal of the decompositions.To prove the sufficiency of the approximation signal with

ig .. .. .. ... ... .. .. . . . 4. ri~ina1.i at S s o.. na..

Fig. 4. Original Aggregated Signal

The scaling and wavelet functions are obtained by dilating andtranslating the father scaling function p(t), pjk(t)= 2-2 p(2J t-

Decomposition Levels Decomposed Time Scale2 hour granularity 6 hour granularity

1st 21* 2 = 4 21* 6 = 12

2nd 22 * 2 = 8 22 * 6 = 24

3rd 23 * 2 = 16 2 * 6 = 48

4th 24 * 2 = 32 24 * 6 = 96

5th 25 * 2 = 64

26* 2 = 128

Page 4: [IEEE 2007 IEEE International Multitopic Conference (INMIC) - Lahore, Pakistan (2007.12.28-2007.12.30)] 2007 IEEE International Multitopic Conference - Internet Traffic Modeling for

respect to the decompositions done, we calculate thepercentage of the energy retained by the final approximationsignal over the total energy of the signal. Percentage energy ofeach of the detail signal is also calculated. It is evident thatoverall trend has a huge share of almost 96% even after thedecomposition. Amongst the detailed signals, d2 has themaximum energy share, which corresponds to the fluctuationsacross 24 hours. Combining the approximation a4 and thedetailed signal d2 with equal weight-age, captures themaximum energy of 97.95°O (95.8943 + 2.0555) of the originalaggregated signal.

ANOVA and Regression analysis: For model formulation, thecombination of the approximation and the 24 hr time scaledetailed signal is used as it maximizes the energy of the signalwhile keeping the linear multiple regression model simple. Wehave used regression technique in coming up with a linearmodel and evaluated its degree of correctness using theanalysis of variance (ANOVA) technique.Both ANOVA and the regression analysis study the impact ofindependent variables on response variables i.e. variables(approximation and 2nd time scale signal) on the originalsignal (dependent). But while ANOVA seeks to define thescope of the variables that will be included in an experiment,the regression analysis determines the coefficients for eachvariable.Variance for the details signals d1(t), d3(t) and d4(t)contributeless than 500 in the original, and so we choose d2 and a4 as thefinal variables for the reduced model.

x(t) = a4(t) +± * d2(t) + e(t) (3)

Next we have used least squares method to evaluate the valueof J, which comes out to be 0.907 for the approximation signaland 0.298 for d2(t). Simplifying the model, we take coefficientfor the approximation signal to be 1 and 0.3 for the d2(t).Goodness of the regression (Coefficient of Determination) P7is quite significant (P7=0.912) which means that the modelaccounts for a large fraction of variability in the original signal.We also added d1(t) detailed signal and see how this affects theP7 value, it does increase to 0.955 and the residual sum ofsquares gets halved, but the complexity enhanced by theadditional variable outweighs the benefit of an accurateestimate. So we restrict ourselves to a4(t) and d2(t) only.

Results and Deductions: Decomposition of the original Internettraffic reveals existence of long term trend in the traffic. Mostinfluential period that affects the input traffic is the diurnalpattern followed by the periodicity exhibited at 12 hourinterval. The fluctuations around this long term trend aremostly due to the significant changes in the traffic bandwidthat the time scale of 24 hours. The combined effect of theapproximation signal and the diurnal variation accounts 98% ofthe total energy in the original signal, which is a goodfoundation to base our future predictions about the trafficdemand.

Capacity planning and finding long term trends require that thedata is looked at from the broader window i.e. in terms ofweeks, rather than rely on variations of 24 hours over a week.For this, we find the mean of d2(t) (24 hr) for each week andalso find its standard deviation. This smoothes out the diurnalperiodicity present in the data; and leads to lesser complexmodel for ARIMA (discussed in the section of time seriesanalysis). Mean and the variance of the daily data trafficrepresent the fluctuations of the traffic around the long termtrend from day to day within each particular week. Therefore,we get one value per week for the approximation signal,denoted by l(t) and see the variation in terms of the detailedsignal denoted by dt2(t).The ANOVA technique reinforces the conclusions reachedabove as the maximum cause of variation in the original signalis caused by the approximation signal and the detailed signalfor the diurnal period. We proposed a model which explainsapproximately 910% of the variance in the original signal. Atpresent we are only restricted to links between two nodes,when it comes to a generic model for the over all backbone, itis important to keep room for more variation. Keeping thisstrategy in mind, we propose the following model that fullycaptures the variations in the Internet traffic demand.

x(t) = l(t) + 1.5 * dt2(t) (4)

In the Figure 7, we show the aggregate traffic demand, the longterm trend in the data and, the two curves showing theapproximation of the signal as the sum of the long term trendwithin : times the average daily standard deviation for oneparticular week. This model rightly covers all the data pointswithin the deviation of the weekly standard deviation. Shortterm trends have almost vanished and the long term trendfollowed by the traffic pattern is becoming apparent.Building on the foundations of this model, we project the trendto come up with forecast that helps in the capacity planning ofthe backbone. The next section discusses how time series

Fi g. 7. Aggregate trallic demand against Months

models help us in this projection and later compares the valuespredicted with the actual observations under the testing period.

G. Uni-Variate Time SeriesAnalysisThe first requirement for an adequate model of Internet trafficis that it must be stochastic and not deterministic. There aremany factors affecting the amount of traffic on the backbone,most of which cannot be measured or identified. To predictprobable future traffic, the best available basis is an analysis ofpreviously observed traffic patterns. Because we want to

Page 5: [IEEE 2007 IEEE International Multitopic Conference (INMIC) - Lahore, Pakistan (2007.12.28-2007.12.30)] 2007 IEEE International Multitopic Conference - Internet Traffic Modeling for

examine changes in the traffic over time, second requirement isthat model must be time series model. Furthermore, theInternet traffic data is non-stationary, so the model must be ofthe form that can accept non-stationary data. A time seriesmodel that fits these criteria is the autoregressive integratedmoving average process (ARIMA). [2]The ARIMA model is an extension of a set of time seriesmodels called autoregressive (AR), moving average (MA) andautoregressive moving average (ARMA). An autoregressivemodel of order p (denoted by AR(p)), predicts the currentvalue of a time series based on the weighted sum of p previousvalues of the process plus a random shock a. (A shock is arandom drawing from a white noise process with zero meanand finite variance). A moving average model of order q(denoted by MA(q)), predicts the current value based on arandom shock a and weighted values of q previous a's. If thesemodels are combined, the ARMA model of order (p,q) predictsthe current value of the time series based on p previous valuesand q previous shocks. The advantage of the ARMA model isthat many stationary time series can be modeled with p and qvalues of 0, 1, or 2.

Fitting an ARIMA Model to Data: When fitting an ARIMAmodel to time series data, there are three basic steps which areused iteratively until a successful model is achieved:1. Model Identification: This is the determination of the

likely values of p, d, and q for this set of data. Often therewill be several plausible models to be examined.

2. Parameter Estimation: Once a set of possible models hasbeen selected, parameter (coefficient) values aredetermined for each model.

3. Diagnostic Checking: This involves both checking howwell the fitted model conforms to the data, and alsosuggests how the model should be changed, in case of alack of good fit. Based on the outcome of the diagnostictest, p, d or q may be changed and steps 2 and 3 arerepeated.

Once a good fitting ARIMA model has been found by thismethod, it can be used to make forecasts of future behavior ofthe system.

Time Series Analysis ofLong Term: ARIMA (1,1,1) was usedwith SPSS 11.5 and values calculated within 9500 confidenceintervals. Melard's algorithm was used for estimation while thetermination criteria is when the Maximum Marquardt constantis 1.OOE+09 and the number of iterations reach 10. Theresultant equation is as follows.

z(t) = 1.66678 + .07442 z(t-1) +.54446 a(t-1) + a(t) (5)

To check for the independence of the residuals, we calculatedthe auto-correlation function (acf) of the residuals andcompared the critical values with the resultant t-values foundby using Bartlett's approximation for standard error ofestimated auto-correlations. Residuals pass the test forindependence if the t-values for lag 1, 2 and 3 are below 1.25while for the rest of the lags t-values have to be less then 1.6.

The t-values for the ARIMA (1,1,1) are way safe below thecritical values, strengthening our confidence in the model used.We compared our model with ARIMA: (1,0,0), (0,0,1), (1,1,0),(0,1,1), (2,0,0), (0,0,2), (2,1,2) , (0,1,2) , (2,1,2). Against all themodels, ARIMA (1,1,1) gave the least value according toAICC, BIC and log-likelihood objective function value and thesmallest mean square prediction error [3]. The abovementioned criteria not only improve the validity of the modelbut also fit the data parsimoniously by penalizing the modelswith large number of variables.

Models of Traffic Demand: The model proposed ARIMA(1,1,1) has been thoroughly diagnostically tested and comparedwith the other prospective models too. Differencing at lag 1and a constant pt indicates that the long-term trend across allthe traces is an exponential smoothing with growth, while theregressive part does not let the slope of the line to be equal to piuntil its effect dies out as the predictions are extended fartheraway in time. This also implies that the long-term forecasts areultimately following a sloping line with gradient equal to pi (i.e.1.75 Mbps). This is related to the average aggregate demandbetween the two nodes only. This increase of traffic demandbetween the nodes can be used to calculate the cumulativeincrease in the Internet traffic of the whole backbone.The claim that the traffic demand between the two nodesincreases at the rate of 1.75Mbps using ARIMA (1,1,1) hasbeen made possible because of the averaging done to aggregatethe data at two hour's interval onto one point in week. Thissimplified the Box-Jenkins methodology by removing the threeseasonal components (12, 24 and 48 hours) and evening out theeffect of outliers. Had the original time series been used, itwould have led to a highly unstable model, inaccurate forecastsand at the same time very expensive computationally.Forecasting done by the aggregated time series is evaluatedwith the actual traffic in the following section.

Evaluation of Forecasts: Figure 8 shows the actual trend linewith positive/negative deviations as well as the predictedtraffic demand within 1.5 times the deviation from the actualaverage value. The green vertical line demarcates the pointfrom where onwards the predictions are made. Empirically, aswell as visually, it is clear that the predictions made by ourmodel fully encompass the traffic demand variation on weeklybasis. For capacity planning purposes, we would consider theupper bound on the aggregate demand as we want to have acongestion-free backbone infrastructure.We find the percentage error incurred by our forecasts when

Page 6: [IEEE 2007 IEEE International Multitopic Conference (INMIC) - Lahore, Pakistan (2007.12.28-2007.12.30)] 2007 IEEE International Multitopic Conference - Internet Traffic Modeling for

compared against the actual measurements for each weekduring the evaluation period. Positive error refers to anoverestimation while the negative points to underestimatedfuture prediction, i.e. the observed demand were larger than thepredicted. The accuracy of the predicted results can also beseen from the fact that the error values are centered towardszero despite its fluctuations from time to time. On average, theerror incurred is 5% for all the links between the two nodes.Forecasts could have been made for a longer period; but to beaccurate and to keep the error to its minimum it was reasonableto predict for a smaller restrictive period. We start facinggreater variations from the actual values as we move fartherahead in the time span of forecasting. The effect gets morevisible when discrepancies across all the nodes/linksaccumulate into a large error in the backbone. Even a largerinput data set is unlikely to give predictions in the far future areclose to the actual values. The solution lies in re-estimation.We propose to set a threshold for the maximum errorencountered and revert back to the model fitting phase againwhen this threshold is crossed. This would end up with a newadaptive, better fitting model which gives future values that aremore accurate than the previous model.

IV. DISCUSSION AND FUTURE WORK

First target is to increase the accuracy of the model made byovercoming the limitation of ARIMA model which bases itspredictions solely on previous data. By nature, ARIMA fails totake into account the impact of outside forces that mayfundamentally change the pattern of data.Along with the forecasts at the week's level, we can extend

the analysis to a finer time scale to daily or 12 hourly. Thiswould give us the pattern from one day to the next, whichmight not be of much help in capacity planning for the corerouters, but might be useful for other network engineeringtasks like scheduling of maintenance windows or largedatabase network backups.Data at granularity of 2 hours was taken and further

aggregated to a scale of 6 hours so as to get rid of the variationat 12 hours and 24 hours. A much finer way would be to takedata at the time scale of 1.5 hours and use MRA directly so thatthe other contributing frequencies are not lost in simple, directaveraging. Another way to have a clearer picture of the trafficpattern in the whole backbone is to look into the output trafficdata too which is shown by the blue line on the MRTG graph.

V. CONCLUSION

The paper focuses on modeling the Internet traffic accuratelyso as to identify the long term trend between Lahore andKarachi, the two major nodes of Pakistan Internet Exchange.We came up with quantitative Internet traffic demand forecastsfor the next eight weeks using the past data spanning over 2years. The predictions were compared to the actual demandobserved and it was encouraging to observe minimal errorbetween the estimates and the observed values.We aggregated the traffic demand for all the links between

the two nodes at the time granularity of two hours. Using

aggregation and multi-resolution analysis it was found thatthere were three seasonal periods present in the data: 12 hours,24 hours and a week. Instead of tackling the seasonal patterns,we smoothed out the finer resolution periods and identified thelong term trends from the data. Daily variation had the majoreffect on the long term. So, a model for the Internet trafficdemand pattern was made that had the approximation signaland a weighted contribution of the diurnal signal. This helpsget a model that maps to the real traffic as accurately aspossible in a long term with minimal effects from shorterseasonal pattern and outliers. Weekly approximations are usedto calculate forecasts using low order ARIMA process. Theforecasts at max deviated by 15% from the actual values, whileon average there was a deviation of 700 which provesempirically the accuracy of the Internet traffic modeling.

It was the use of aggregations, decompositions andaveraging that helped in keeping the computation time withinmilliseconds, making the practical implementation of thesecalculations possible.

ACKNOWLEDGMENT

I am indebted to Dr Tariq Jadoon for helping me to getinsight into the traffic engineering issues and for his continuedsupport and encouragement throughout the research process. Iexpress my gratitude to Dr Arif Zaman and Dr Sohaib A Khanfor helping me ascertain the right direction by reviewing theexperimentation of different ideas. Thanks to Mr. AmirMehmood for willingly lending the proprietary PTCL data forresearch purposes in time.

REFERENCES

[1] K. Papagiannaki, "Provisioning IP Backbone Networks Based onMeasurements". PhD thesis, University College London, March 2003.

[2] G. Box and G. Jenkins, Time Series Analysis: Forecasting and Control,Holden-Day, San Francisco, CA, 1970.

[3] P. Brockwell and R. Davis, Introduction to Time Series and Forecasting,Springer, 1996.

[4] N. K. Groschwitz and G. C. Polyzos, "A Time Series Model of Long-TermNSFNET Backbone Traffic," IEEE ICC'94, 1994.

[5] S. Basu and A. Mukherjee, "Time Series Models for Internet Traffic," 24thConf. on Local Computer Networks, Oct. 1999, pp. 164-171.

[6] J. Bolot and P. Hoschka, "Performance Engineering of the World WideWeb: Application to Dimensioning and Cache Design," 5th InternationalWorld Wide Web Conference, May 1996.

[7] K. Chandra, C. You, G. Olowoyeye, and C. Thompson, "Non-Linear Time-Series Models of Ethernet Traffic," Tech. Rep., CACT, June 1998.

[8] R. A. Golding, "End-to-end performance prediction for the Internet," Tech.Rep. UCSC-CRL-92-96, CISB, University of California, Santa Cruz, June1992.

[9] A. Sang and S. Li, "A Predictability Analysis of Network Traffic,"INFOCOM, Tel Aviv, Israel, Mar. 2000.

[10] R.Wolski, "Dynamically Forecasting Network Performance Using theNetwork Weather Service," Journal ofCluster Computing, 1999.

[11] A. Mehmood, T. Jadoon and N. Sheikh, "Evaluation of VoIP Quality overthe Pakistan Internet Exchange (PIE) Backbone", IEEE-ICET, 2005.

[12] I. Daubechies, "Ten Lectures on Wavelets," Cbms-Nsf RegionalConference Series in Applied Mathematics, 1992, vol. 61.

[13] P. Brockwell and R. Davis, Introduction to Time Series and Forecasting,Springer, 1996.