Automated deriving of statistical parameters ... · Research Unit VIII: Network Architectures...

20
Research Unit VIII: Network Architectures Computer Science Department, Technische Universität München Automated deriving of statistical parameters characterizing network traffic Systementwicklungsprojekt Stefan Schneider [email protected] supervised by: Nils Kammenhuber

Transcript of Automated deriving of statistical parameters ... · Research Unit VIII: Network Architectures...

Page 1: Automated deriving of statistical parameters ... · Research Unit VIII: Network Architectures Computer Science Department, Technische Universität München Automated deriving of statistical

Research Unit VIII: Network ArchitecturesComputer Science Department, Technische Universität München

Automated deriving of statistical parameters characterizing network traffic

Systementwicklungsprojekt

Stefan Schneider

[email protected]

supervised by: Nils Kammenhuber

Page 2: Automated deriving of statistical parameters ... · Research Unit VIII: Network Architectures Computer Science Department, Technische Universität München Automated deriving of statistical

Abstract

To   representatively   simulate   network   traffic,   its   characteristics,   for example the distribution of the sizes of the transferred objects or the transfer  durations,  need  to  be known.  These  characteristics  are  well researched,  and  it   is  widely  accepted that  most  of   them show heavy tailed and self­similar properties. But as there is a rapid advancement of   networking   technology,   it   can   be   suspected   that   the   parameters characterizing   the   traffic   could   shift   correspondingly.   To   be   able   to analyze   such   changes   we   developed   a   set   of   tools   to   automate   the necessary   steps   to   evaluate   important   statistical   parameters   from actual   traffic.   Although   our   tools   significantly   ease   this   task,   they cannot remove the fact the user performing the analysis still needs to be experienced in order to correctly judge the results and to understand the impact of statistical subtleties. Finally we analyzed the traffic from our network and present some results.

Page 3: Automated deriving of statistical parameters ... · Research Unit VIII: Network Architectures Computer Science Department, Technische Universität München Automated deriving of statistical

Table of Contents

Abstract.................................................................................................................................................21. Introduction.......................................................................................................................................5

1.1.Motivation..................................................................................................................................51.2. Approach...................................................................................................................................5

1.2.1. Measurement......................................................................................................................51.2.2. Data Extraction..................................................................................................................51.2.3. Data Analysis  ...................................................................................................................5

2. Initial Situation.................................................................................................................................62.1 Data Collection..........................................................................................................................6

2.1.1 TCPDUMP..........................................................................................................................62.1.2 BRO....................................................................................................................................6

2.2. Fitting .......................................................................................................................................62.2.1 Fitting in the literature vs our approach.............................................................................62.2.2. R........................................................................................................................................7

3. Theoretical Background....................................................................................................................83.1 Probability distributions.............................................................................................................8

3.1.1 Exponential distribution......................................................................................................83.1.2 The generalised Pareto distribution....................................................................................8

3.2. Identification of the underlying distribution.............................................................................93.2.1 Mean Excess Plots..............................................................................................................9

3.3 Fitting parameters....................................................................................................................103.3.1 Hill Graph.........................................................................................................................103.3.2. Maximum Likelihood Estimation....................................................................................11

3.4. Goodness of fit........................................................................................................................113.4.1. Quantile­Quantile Plot.....................................................................................................11

4. Implementation...............................................................................................................................134.1. TCPDUMP..............................................................................................................................134.2. BRO.........................................................................................................................................134.3. Perl..........................................................................................................................................144.4. R..............................................................................................................................................154.5Stationarity................................................................................................................................15

5. Results.............................................................................................................................................165.1. Handling..................................................................................................................................165.2. Stationarity..............................................................................................................................165.3. Goodness of fit........................................................................................................................18

6. Conclusions.....................................................................................................................................19Bibliography.......................................................................................................................................20Appendix.............................................................................................................................................21

Page 4: Automated deriving of statistical parameters ... · Research Unit VIII: Network Architectures Computer Science Department, Technische Universität München Automated deriving of statistical

Illustration Index

Illustration I: mean excess graph of 5000 exponentially distributed random values..........................10Illustration II: mean excess graph of 5000 pareto distributed random values....................................11Illustration III: Hill graph of a pareto distributed random sample with   = 0.2. α .............................11Illustration IV: qq­plot of two exponential distributed random deviates that differ by a linear factor............................................................................................................................................................12Illustration V: qqplot of an exponential distribution vs a pareto distibution.......................................12Illustration VI: qqplot of objectsizes 0am ­ 12am...............................................................................17Illustration VII: qqplot of objectsizes 12am ­ 12am next day.............................................................18Illustration VIII: Objectsizes of tranferred GIFs against fitted Pareto distribution............................19

Page 5: Automated deriving of statistical parameters ... · Research Unit VIII: Network Architectures Computer Science Department, Technische Universität München Automated deriving of statistical

1. Introduction

1.1.Motivation

Network simulators are used to do "what if" studies to better understand the interactions of the various network components, the scale, and the variability of the network. To generate realistic load, network simulators use random statistical distributions, which are configured by parameters. By careful selection of the appropriate distributions,and by carefully tuning these parameters it can be ensured that the generated data conforms with given requirements. Most times it is desired that the simulations closely represent real-world data. In networking the technology quickly advances, and therefor the characteristics of the traffic might change over time. For most network simulators there are default parameters built in, but we wanted to be able to easily verify if those defaults are still ,or whether they have significantly changes and should thus be improved. Extracting these parameters out of real world traffic is tedious work that requires a lot of elaborateness and experience, so we intended to automate the necessary steps as far as possible.

1.2. Approach

Our approach is split into three parts. The first step is to measure the traffic and gather the data, the second step to extract the relevant data from the measurement results and the third step to analyze the data, to extract the parameters needed and to verify them.

1.2.1. Measurement

First the traffic is captured, and is then either stored in a file or processed directly. Running the data gathering directly on live traffic is limited by the machines resources, and is not actively supported. The more reasonable approach is to sniff the traffic first to a file and later read the dump file in step 2.

1.2.2. Data Extraction

From the dump file, we then extract the relevant HTTP connection information and create a separate file for each MIME type. After this step the data is completely anonymized as well.

1.2.3. Data Analysis  

First a summary of the gathered data is produced as a HTML document which contains links to more detailed overviews. These overviews are used to estimate the supposed underlying distribution. To this end plots of the ordered data, a histogram and if desired a mean excess plot are given. After the user has chosen a specific distribution model for a given data type, he then can interactively fit the distribution parameters to the data.

Page 6: Automated deriving of statistical parameters ... · Research Unit VIII: Network Architectures Computer Science Department, Technische Universität München Automated deriving of statistical

2. Initial Situation

The characteristics of Web workload are relatively well researched and there exist a lot of papers on it like the dissertation of Paul Barford [BAR01]. We did not intend to redo or verify this kind of research. Instead, we wanted to make use of already existing techniques to verify if the parameters for Web traffic have changed over time. Therefore in this chapter we will give a quick summary of the assumptions made and the techniques and setup used. The mathematical and statistical background will be presented in the next chapter .

2.1 Data Collection

In order to derive characteristics of traffic, we first need logs of real traffic to do our analysis on it. Previous measurements were based on web traffic log files collected with different methods. Here we present an overview as described in [FEL02]:

• from users running a modified Browser [CAT01]

• from Web content provider logging

• from Web proxies logging

• from the wire via packet monitoring

As described that paper the packet monitoring method is suited best for our purpose as one gets a representative overview of all aspects the HTTP traffic.

2.1.1. TCPDUMP

We used TCPDUMP on a designated capturing machine to record full traces of the traffic at the uplink of the "Münchner Wissenschaftsnetz" to the "Deutsche Forschungsnetzwerk". The traces were written directly to disk. Every time the trace reached a size of 1 Gigabyte it was copied to a storage machine. To be able to handle the load and to run different analysis we filtered out specific subnets. For the exact setup see [SCH01].

2.1.2. BRO

To extract the datasets from the traces we used BRO, an intrusion detection system, which features a built in HTTP analyzer [BRO]. This enabled us to easily access the data we needed by processing the BRO´s HTTP analyzer events, without having to deal with the difficulties described in [FEL01].

2.2. Fitting

Fitting is the process of finding parameters for a random distribution so that it represents a given dataset best.

2.2.1 Fitting in the literature vs our approach

As a reference we cite the approach Barford supposes in [BAR01]:

1. Use log-log complementary distribution (LLCD) plots to determine whether or not the dataset has a heavy tail. If so, then hybrid modeling may be necessary (hybrid models may also be necessary even if the data is not heavy tailed).

Page 7: Automated deriving of statistical parameters ... · Research Unit VIII: Network Architectures Computer Science Department, Technische Universität München Automated deriving of statistical

2. Use standard visual techniques such as simple histograms or complementary distribution function ([C]CDF) plots to narrow the set of candidate models for the data. Logarithmic transformation may be necessary to distinguish important characteristics.

3. If the data appears to best be modeled by a hybrid model then use censoring methods to determine how to divide that data.

4. Use maximum likelihood estimators (MLE) to estimate parameters for candidate models for the data.

5. Use goodness of fit tests such as the Anderson-Darling test to see if there is a close fit between model and data. If this test shows no significance, then use a random sub sampling technique to test fit for small sample sizes.

6. Use goodness of fit metrics like the - test to determine a discrepancy measure between data and model. This test is especially important if all goodness of fit tests fail.

7. Outlier analysis should be done to determine if there are any data points which are skewing analysis.

To able to use automatisation we used a slightly different (simplified) approach:

1. Use log-log complementary distribution, ordered data and mean excess plots to determine whether or not the dataset has a heavy tail, and to chose the model for the data.

2. Use the Hill Estimator to choose a threshold.

3. Use maximum likelihood estimators (MLE) interactively to estimate parameters for candidate models for the data.

4. Use ordered data and log-log complementary distribution plots to review the goodness of fit.

As our main focus was to review the tail of the data we left the hybrid models out and fitted to the tail of the data. The goodness of fit test and outlier determination were combined to an interactive fitting process, using a threshold model. The supposed Anderson-Darling test [AD01] is left out as it is only suitable to (log)normal distributions, we did only implement the Pareto distribution fitting because it is the most common distribution for modeling various aspects of Web (and other Internet) traffic. It would be desirable to have more distributions to choose from, however an inclusion of further distributions would have forced us o leave out other more important features due to time limitations.

2.2.2. R

For the analysis of the data gathered we used the GNU R software environment for statistical computing and graphics [R] extended by the R metrics library [Rmet]. An introduction into fitting with R is given by [RIC01].

Page 8: Automated deriving of statistical parameters ... · Research Unit VIII: Network Architectures Computer Science Department, Technische Universität München Automated deriving of statistical

3. Theoretical Background

This section will give a short introduction into the theory behind the statistical tools and methods that were employed in the project.

3.1 Probability distributions

First we take a quick look at the probability distributions that are used in the simulations for which we want to generate parameters for.

3.1.1 Exponential distribution

The exponential distribution is a continuous probability distribution often used to model the time between events that happen at a constant average rate. It is defined as:

F ; x :={1−e− x

0

if x≥0if x0}

where > 0 is the rate parameter.λ

3.1.2 The generalised Pareto distribution

We present the generalised Pareto distribution (GPD) here, because nearly all available fitting tools are based on it. For the simulations we need parameters for the ordinary Pareto distribution, which is a special case of the GPD. As we will see later we only discovered generalised Pareto distribution parameters that correspond to an ordinary Pareto distribution.

The generalised Pareto distribution is defined as:

G , , x ={1−1x−

−1/

if ≠0

1−e−x−/ if =0}with

x∈{ [ ,∞] , if ≥0[ ,−/] , if 0}

where = 1/ is the shape parameter, is the tail index, is the scale parameter and isξ α α β ν the location parameter. When = 0 and = 1, the representation is known as the standardν β GPD.The shape parameter determines the type of the generalised Pareto distribution. There areξ three possibilities:

• > 0ξ produces an ordinary Pareto distribution,• = 0ξ produces an exponential distribution,• < 0 ξ produces a Pareto II type distribution.

With our empirical data we only got values of greater than zero.ξ

Page 9: Automated deriving of statistical parameters ... · Research Unit VIII: Network Architectures Computer Science Department, Technische Universität München Automated deriving of statistical

3.2. Identification of the underlying distribution

Before we can apply any fitting of a parameterizable distribution to our measured data we first need to identify which distribution suits our empirical data best. The first step usually is to take a look at a plot of the sorted data, but we included also a specific plot that helps to identify heavy tailed distributions, as most of the literature suggest heavy tailed distributions for Web workload.

3.2.1 Mean Excess Plots

The mean excess is the expected size of the transgression over a given threshold, given that the threshold is exceeded.The mean excess e for a random variable X over the threshold u is defined as

e u =E x−u∣X u , for 0≤x≤x f

where xf is the upper bound of the distribution.To estimate the behavior of the tail we look at the form of the distribution of the mean excess. Let X1,...,Xn be independent and identically distributed with empirical distribution Fn. Then the mean excess is given by,

e u = 1∣nu∣

∑i∈n u

X i−u , u≥0

where

nu := {i∣i=1,.. , n ; X iu }So the mean excess is the added up excess over the threshold that is divided by the number of exceedances over the threshold.The Mean Excess Graph is now formed by the following set of points:

{X k , n ,enX k , n}with k=1,. .. , n

From the shape of the Mean Excess Graph we can derive some informations of the underlying distribution. The following two cases are of special interest for us:

• If the data can be assumed to be consistent with an exponential distribution, the Mean Excess Graph shows a horizontal line. See Illustration I for an example:

Illustration I: mean excess graph of 5000 exponentially distributed random values

0 .001 0 .005 0 .050 0 .500 5 .000

0.4

0.6

0.8

1.2

Thresho ld

Me

an

 Ex

ce

ss

Page 10: Automated deriving of statistical parameters ... · Research Unit VIII: Network Architectures Computer Science Department, Technische Universität München Automated deriving of statistical

• If the data can be assumed to be consistent to an generalised Pareto Distribution with an positive shape parameter the graph will have a positive slope. See Illustration IIξ for an example:

3.3 Fitting parameters

As soon as we have identified the underlying distribution, we can start to determine those parameters that correspond with the empirical data best. This so called “fitting“ searches the most feasible parameters to a given dataset. As our data is empirical, there might be some outliers that can obstruct the fitting. As a remedy the fitting routines offer the possibility to give a threshold that all data is capped to. The selection of a appropriate threshold can be tricky as choosing a low threshold makes the fitting biased and by choosing a high threshold we cut many observations and the parameter estimation may become imprecise. To help with the selection of a suitable threshold we have included the possibility to produce an so called Hill Graph.

3.3.1 Hill Graph

The Hill Graph is a useful tool to get a first look at the tail index and to determine a suitable threshold. The Hill estimator, which calculates a tail index for some given order statisticsα Xn,n<...<X1,n of independent and identically distributed random variables, is defined as:

k , nH = 1

k ∑j=1

k

ln X j , n − ln X k , n−1

where k is the number of upper order statistics and n is the sample size. For a Hill Graph the estimated is plotted as a function ofα k upper order statistics.

Illustration II: mean excess graph of 5000 Pareto distributed random values

0 20 40 60 80 100 120 140

01

00

20

03

00

Thresho ld

Me

an

 Ex

ce

ss

Page 11: Automated deriving of statistical parameters ... · Research Unit VIII: Network Architectures Computer Science Department, Technische Universität München Automated deriving of statistical

Here we give a example of an Hill Graph for a set of 500 Pareto distributed random values:

To help estimating a threshold, the according threshold is inscribed at the top of the plot. A good threshold for the fitting would be in areas where the slope of the tail index isα horizontal.

3.3.2. Maximum Likelihood Estimation

By default all fitting is done using the maximum likelihood estimation. The Maximum Likelihood Estimation fits parameters by maximizing the likelihood function:

L q := ∏i=1

n

f X iX i ; q

which is derived by evaluation of the probability density function f as a function of q for fixed values of Xi, namely our empirical quantiles. Since it is maximized by using its first derivation, which can get difficult to examine for exponential density functions, the logarithmic likelihood function given by

l q=∑i=1

n

ln f X iX i ; q

also can be used if desired.

3.4. Goodness of fit

The parameters returned by the fitting have to be checked before we can use it in simulations. The most elegant way would be to do some hypothesis test like the Kolmogorov-Smirnov test or the Chi-Square test, but they turned out to be way too strict for our empirical data. So we had to settle with comparing the generated distribution with the empirical data manually and eventually tuning the threshold to get utilisable results. The simplest way to get a look at the data is again to inspect the sorted plot of it. This gives a good first impression but for verification is way to rough. Therefore some more detailed plots are provided.

Illustration III: Hill graph of a Pareto distributed random sample with   = 0.2. α

15 43 71 99 131 167 203 239 275 311 347 383 419 455 491

0.1

00

.20

0.3

00

.40

5 .90e+06 8 .25e+02 5 .50e+01 8 .75e+00 2 .14e+00 7 .75 e­01 2 .11e­01

Order S ta tis tics

alp

ha

 (C

I, p

 =0

.95

)

Thresho ld

Page 12: Automated deriving of statistical parameters ... · Research Unit VIII: Network Architectures Computer Science Department, Technische Universität München Automated deriving of statistical

3.4.1. Quantile-Quantile Plot

Quantile-Quantile Plots (QQ-Plots) are plots of the quantiles of two data sets against each other. We use them to determine the goodness of fit of a distribution to our data, so in our case the QQ-Plots show the empirical data against the theoretical quantiles of the fitted distribution. The more the fall across a straight line the better the fit. For reference a 45-degree line is also plotted. In some cases you can derive useful informations from the plot, even if it indicates that the fit is not good.

• If the points form a line, with an angle different to 45 degrees the two datasets differ in an linear factor

• If the points form a line with curves at the ending one data set has a bigger tail.

3.4.2 Cumulative distribution function

A cumulative distribution function (CDF) describes the probability distribution of a real-valued random variable, X. For every real number x, the CDF is given by

F x =P X ≤x

It simply represents the probability that the variable X takes on a value less than or equal to x. The complementary cumulative distribution function (CCDF) turns the question around and represents the probability that the variable X takes on a value greater than x.

F c x =P X x =1−F x

It is preferred sometimes because, if plotted on a logarithmic scale, the tail of the distribution is drawn at a larger scale. If plotted on a log-log scale, Pareto and similar distributes data appear as a falling straight line.

Illustration IV: qqplot of an exponential distribution vs  a Pareto distribution

1    e ­03 1    e ­01 1    e+01 1    e+03

0.0

01

0.0

05

0.0

50

0.5

00

5.0

00

p a re to

ex

po

ne

nti

al

Illustration V: qq­plot of two exponential  distributed random deviates that differ by a linear factor

0 2 4 6 8 10

02

46

81

0

re xp (500)

rex

p(5

00

) * 

2

Page 13: Automated deriving of statistical parameters ... · Research Unit VIII: Network Architectures Computer Science Department, Technische Universität München Automated deriving of statistical

4. Implementation

Here we give a quick overview over the scripts we developed. For a more detailed view there exists a user and developer documentation in the appendix.

4.1. TCPDUMP

The traces were recorded at a designated recording machine and stored on disk. Since in persistent HTTP connections a new object can start at any position within an HTTP connection a full packet length trace is necessary. Every time the trace file reached the size of 1 Gigabyte a new consecutively numbered file was created and the file was anonymized and copied to a designated analysis machine.

4.2. BRO

In the first place we tried to use the outputs of the existing Bro analyzers to collect the datasets we wanted, but it clearly showed to be more reasonable to write a new specialized policy, that directly collects the data. The data is written to a text file containing the following data sets:

Connection:

• request IP (anonymized)• request port• response IP (anonymized)• response port• number of transferred objects • list of all objects transferred via the connection as described below

Object:

• HTTP reply code• interrupted• MIME type• URL• request time• response time i.e.,the time that the first byte of the response was seen• response complete time i.e.,the time that the last byte of the response was seen• list of query headers• list of response headers

Page 14: Automated deriving of statistical parameters ... · Research Unit VIII: Network Architectures Computer Science Department, Technische Universität München Automated deriving of statistical

4.3. Perl scripts

The output of our Bro analyzer contains all relevant data grouped by the connections.As it contains all header information, it usually still amounts to 2 to 7 percent of the size of the original packet trace. To be able to look at specific measurement categories and to remove any possibly privacy harming information the data is next processed by Perl scripts that extract the following data sets :

• number of occurrences for each MIME type• distribution of object sizes for each MIME type• number of embedded objects for each MIME type• number of transfers per embedded object for each MIME type• occurrences of reply codes• occurrences of web clients (anonymized)• occurrences of web servers (anonymized)• transfer duration• inter arrival times host-based• request-response time intervals• inter-request times• number of objects transferred per TCP connection• number of unsuccessful transfers• popularity of servers• popularity of clients

Each data set is stored in a separate text file containing one value per line. This format was chosen so that it can easily be read with R.

Page 15: Automated deriving of statistical parameters ... · Research Unit VIII: Network Architectures Computer Science Department, Technische Universität München Automated deriving of statistical

4.4. R scripts

The preprocessed data from the Perl scripts is read into R in order to analyze it and to extract the distribution parameters. R gives the experienced user a huge tool set for the analysis and the possibility to plot results, but as we wanted to automate the process we chose to use yet another Perl script which directs the flow of control. This Perl script creates an R script for every dataset, runs it and presents the results in an HTML page. Here the automatization gets suspended, as for the next step the user has to choose the distribution to fit to. From our work with the software it turned out that it is not reasonable to automate the fitting process, as manually choosing a suitable threshold was indispensable for useful results. For clarification we present two tables showing the parameters for an generalised Pareto distribution derived from traces with length of an hour concluding all traffic from the department of computer science at the Technische Universität München, excluding the high traffic leo.org server farm.

This sample results clearly demonstrate the need for user interaction at this stage of the fitting process. Therefore we chose to implement a Perl script that lets the user interactively choose the threshold and then presents the corresponding results. To determine the goodness of fit it also presents ordered data plots, quantile-quantile plots, a histogram and cumulative distribution plots.

4.5 Stationarity

All our methods are only well-defined if the stationarity as described in [NIST01] of the input data is given. Especially the Hill Estimator is disturbed very easily by trends in the data [GOG01]. As methodologies to remove trends [KRIS01] are relatively complex, their inclusion into our software would have been beyond the limited scope of an "Systementwicklungsprojekt". Thus we decided not to implement them.

Table II: fitted parameters with manually  chosen threshold, according to the Hill Graph interpretation

Time of Day 01:00 AM 06:00 AM 09:00 AM 10:00 AM 12:00 PM 02:00 PM 04:00 PM 06:00 PM 09:00 PMNumber of datasets 10943 7161 44134 74504 82625 89852 85259 48492 21891 1.16 1.47 1.29 1.3 1.3 1.3 1.31 1.39 1.42 2990 1811 2306 2453 2389 1338 2012 2345 3500

Table I: fitted parameters with standard threshold (0)

Time of Day 01:00 AM 06:00 AM 09:00 AM 10:00 AM 12:00 PM 02:00 PM 04:00 PM 06:00 PM 09:00 PMNumber of datasets 10943 7161 44134 74504 82625 89852 85259 48492 21891 1.18 1.47 1.47 1.41 1.9 0.38 1.48 1.82 1.51 2907 1811 1446 911 719 45139 1048 954 1941

Page 16: Automated deriving of statistical parameters ... · Research Unit VIII: Network Architectures Computer Science Department, Technische Universität München Automated deriving of statistical

5. Results

To test our scripts we recorded a trace comprising all traffic from the department of computer science at the Technische Universität München, exchanged with non-local addresses and excluding leo.org. It started at 0:00 of the 5. October 2005 and lasted until 7. October 2005 14:30. In total it contained 275 Gigabytes of traffic.

5.1. Handling

First we tried to run our scripts on the whole trace at once to test if they could handle the load. It took the BRO policy approximately 14 hours to extract the connection data into a 6.3 Gigabytes file. The process of splitting up into the different data files worked as well. But R could not handle all the big data files with our standard setup. Although we were able to generate the general overview. Running all the analysis with this big data is not sensible anyway as many of them take a long time even on much smaller datasets. Especially the mean excess plot is very time-consuming, hence it can be switched off via a command line parameter.

5.2. Stationarity

To be able to run our analysis and to test the data for stationarity we cut the connection data file into slices of 15, 30 and 60 minutes. As seen in Table I the obtained parameters varied heavily over the day, whereas the length of the slices did not affect the results significantly. As an example here the object sizes at 0 am compared to 12 am in a quantile-quantile plot:

The rather nonlinear shape of the qqplot indeed suggests that the traffic characteristics are not stationary but rather change with the time of day.

Illustration VI: qqplot of object sizes 0am – 12am of the same day

Page 17: Automated deriving of statistical parameters ... · Research Unit VIII: Network Architectures Computer Science Department, Technische Universität München Automated deriving of statistical

The following qqplot however suggests that the traffic from the same period of time on subsequent days shows a very similar behavior:

These considerations give a rough feeling of the stationarity of the datasets, but a sensible analysis of the stationarity would take much more effort. For further references see [BRK01]

5.3. Goodness of fit

Looking at the results of our sample analysis it showed that with our simple automated approach the goodness of fit most times was relatively poor if considered over the whole dataset. This results from two different sources, first the representing with an single distribution is a strong simplification, second in the measured data there alway exist phenomenas that cannot be represented with the random distributions. For example it is very popular to use very small gifs in web pages, which all have the same file size.

But as we mainly were interested to check if the established parameters apply to our network the tail of the datasets is far more interesting and by carefully choosing a suitable threshold we could reach a good fit there most of the times.

Illustration VII: qqplot of object sizes 12am ­ 12am next  day

Illustration VIII: Object sizes of transferred GIFs against fitted Pareto distribution

Page 18: Automated deriving of statistical parameters ... · Research Unit VIII: Network Architectures Computer Science Department, Technische Universität München Automated deriving of statistical

6. Conclusions

We implemented a set of scripts that, though it is not fully automated, assists you at fitting parameters for random distributions to various data sets according to a given network traffic trace. A full automatization showed to be neither feasible in the given context nor reasonable because the automated goodness of fit tests were not applicable, using the measured data directly. Therefor providing a fully automated result might mask errors and did almost never show usable results.

On the other hand the data collection and preprocessing scripts are fully self controlled and showed to be able to precess even big traces with adequate performance.

With our work we evaluated the possibilities and difficulties in automating the extraction of parameters with the given resources. The developed scripts showed for the data collection to be a reasonable approach and were used already in production. The analysis of the data provides a look at the data that enable the user to get a feel of the data and an rough estimation of suitable parameters.

Further work would could to be done at the at the data processing, by adding a possibility to remove trends or outliers from the input data, but this presumably also will be hard to automate. The fitting might be improved by adding functionality to fit to more different distributions or to fit head and tail of the data to different distributions.

Page 19: Automated deriving of statistical parameters ... · Research Unit VIII: Network Architectures Computer Science Department, Technische Universität München Automated deriving of statistical

Bibliography

[AD01] ,"Anderson­Darling Test",http://www.itl.nist.gov/div898/handbook/eda/sectio [BAR01] Paul Robert Barford,"Modeling, Measurement and Performance of World Wide Web Transactions", 2001[BRK01] Peter J. Brockwell, Richard A. Davis,"Time Series: Theory and Methods", 2002[BRO] ,"BRO Intrusion detection system",http://www.bro­ids.org [CAT01] Lara D. Catledge, James E. Pitkow,"haracterizing Browsing Strategies in the World­Wide Web", 1995[FEL01] Anja Feldmann,"Continuous online extraction of HTTP traces from packet traces", 1998[FEL02] Anja Feldmann,"", 2000[GOG01] Gogl Helmut,"Measurement and Characterization of Traffic Streams in High­Speed Wide A", 2001[KRIS01]  Krishna Kant, M. Venkatachalam,"Characterizing Non­stationarity in the Presence of Long­range Dependence", 2002[NIST01] ,"NIST/SEMATECH e­Handbook of Statistical Methods",http://www.itl.nist.gov/div898/handbook/ [R] ,"R software environment for statistical computing and graphics",http://www.r­project.org [RIC01] Vito Ricci,"Fitting distributions with R", 2005[Rmet] ,"Rmetrics ­ R Library",http://www.rmetrics.org [SCH01] Fabian Schneider,"Analyse der Leistung von BPF und libpcap in Gigabit­Ethernet Umgebungen", 2004

Page 20: Automated deriving of statistical parameters ... · Research Unit VIII: Network Architectures Computer Science Department, Technische Universität München Automated deriving of statistical

Appendix

All scripts that we developed together with documentation and examples are available via the project website: http://www.novogarchinsk.net/schneist/SEP/ .