CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v4.pdf · Concept of input...

1

CS626 Data Analysis and Simulation

Today:Stochastic Input Modeling

based on WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permissionReference: Law/Kelton, Simulation Modeling and Analysis, Ch 6.

Instructor: Peter Kemper R 104A, phone 221-3462, email:[email protected] hours: Monday, Wednesday 2-4 pm

2

Big Picture: Model-based Analysis of Systems

portion/facetreal world

formal / computer aidedanalysis

solution, rewards,qualitative and

quantitative properties

probability model,stochastic process

transformation presentation

transfer

decisiondescription

perception

solution to real world problemreal world problem

formal model

What is input modeling?

Input modeling Deriving a representation of the uncertainty or randomness in a

stochastic simulation. Common representations

Measurement data Distributions derived from measurement data <-- focus of “Input modeling”

usually requires that samples are i.i.d and corresponding random variables in the simulation model are i.i.d

i.i.d. = independent and identically distributed theoretical distributions empirical distribution

Time-dependent stochastic process Other stochastic processes

Examples include time to failure for a machining process; demand per unit time for inventory of a product; number of defective items in a shipment of goods; times between arrivals of calls to a call center. 3

Why are input models stochastic?

We just cannot assume randomness away. Example (Nelson and Biller 2003): Suppose you are a supplier of a component that you know has a

mean time to failure of 2 years. A client is willing to pay $1000 for your component, but wants you to pay a penalty of $5000 if failure occurs in less than one year. Should you take this contract? No uncertainty:

You will pocket $1000 for each component you sell. Uncertainty:

If you know that the distribution of time to failure is well modeled as being exponentially distributed (an input model) with mean 2 years, then F(1)=0.39 and you can expect to lose $950 on each component you sell.

If you know that the distribution of time to failure is well modeled as being uniformly distributed (an input model) between 0 and 4 years (so that mean lifetime is 2 years), then F(1)=0.25 the expected loss on each component is $250.

4from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

Learning objectives

Concept of input modeling and its fit in simulation model development. Input modeling with data: Physical basis for distributions. Fitting and checking.

Input modeling without data: Sources of information. Incorporating expert opinion.



Input modeling:

Deriving a representation of the uncertainty or randomness in a stochastic simulation.

Randomness? A way to describe the behavior of a subsystem that- (lack of knowledge): we can not describe as a deterministic system- (lack of interest, abstraction from details): we do not want to describe as a deterministic system

6


Example model: G/G/n/m FCFS queue Customers (Tasks) arrive according to some general distribution G Customers are served for a time according to some distribution G n servers are available to serve customers in parallel Customers are scheduled following first-come-first-serve (FCFS) m is the capacity of the queue, (customers hitting a full system are

turned away) Design question: What values of n and m are necessary

to limit the waiting time for 90% of all customers to 10 min and to limit the fraction of customers that get turned away to 5% in the long run

What pieces of information does the input modeling contribute to this simulation study?

7Photo: Stuart Richards (Left-hand), Flickr, Creative Commons

Cookbook recipe for conducting a simulation study

8

Cookbook recipe for simulation

Model Building Design and coding of the simulation

program

Experimental Design

Verification

Simulation runs

Recommendation for decisions and

implementation of the model

Statement of the decision problem

and objectives

Input ModelingDevelopment

System Analysis Data Collection

Rough-cut Model

Development

Static Models

Dynamic Models

Static (Spreadsheet)

Simulation

DynamicSystem

Simulation

Removal of initial-condition bias

Determination of the replication number

for error control

Output AnalysisValidation

Statistical analysis of results and system design comparison

Final documentation

Comparison via Simulation

Simulation Optimization

from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

Simulation model development

9

Simulation model development

Real-WorldProcess or Phenomenon

Simulation Model

Simulation Program

Random Input Model

Random Variate Generator

SimulationProgramming

SimulationModeling

SimulationInput Modeling

Random VariateProgramming


G/G/n/m FCFS queueing model revisited

Conceptual model Customers (Tasks) arrive according to some general distribution G Customers are served for a time according to distribution G n servers are available to serve customers in parallel Customers are scheduled following first-come-first-serve (FCFS) m is the capacity of the queue, (customers hitting a full system are turned away)

Design question: What values of n and m are necessary to limit the waiting time for 90% of all customers to 10 min and to limit the fraction of customers that get turned away to 5% in the long run

Input model Measurement data for task arrivals and service times for a certain time Option 1: Trace-driven simulation

use measurement data to feed a simulation run Option 2: Simulation draws from a probability distribution

needs selection/configuration of a distribution (distribution fitting) alternative: empirical distribution

Option 3: Simulation executes stochastic process (later) 10

Input model development

There is no “true” model for any stochastic input. The best that we can hope is to obtain an approximation that yields useful results.

A key distinction in input modeling problems is the presence or absence of data: When we have data, then we fit a model to the data.

Software support: Special purpose software, e.g., ExpertFit by A. Law Simulation environments include this, e.g., Arena by Rockwell Automation Statistics packages provide key functionality, e.g., R (www.r-project.org)

When no data are available, then we have to creatively use what we can get to construct an input model.

11

Modelling

Essentially, all models are wrong, but some are useful.

Box, George E. P.; Norman R. Draper.Empirical Model-Building and Response Surfaces.Wiley 1987.

. . . unfortunately some models are more wrong than others.

2 / 58

Collecting data

Generally hard, expensive, frustrating, boring: System might not exist. Data available on the wrong things – might have to change model

according to what is available. Incomplete, dirty data. Too much data (!)

Sensitivity of outputs to uncertainty in inputs. Match model detail to quality of data. Cost – should be budgeted in project. Capture variability in data – model validity.


Example: Traffic measured at a node in a network

Plot shows sequence of time stamps for a series of requests (arrival stream).

Observations: concatenation of several

measurements with a restart close to 0.0 or unreasonable wide gaps to higher values of time stamps

Need thresholds to automate subsequence detection x = 20s for drop of time,

y = 1000s for increase

13Note: Check consistency ahead of any numerical analysis!

Example: Traffic measured at a node in a network Plot shows sequence of time differences for first 20k of events. Observations: closer look reveals that

subsequence are not necessarily accurately ordered

Options?1.remove out-of-order

entries2.consider ordered

subsequences3.sort subsequence

14

Note: Check consistency ahead of any numerical analysis!


15


Real-World Process

Fitting Probability

Distributions

Using Data Itself

Expert Opinion

Input Model(Fit)

ApproachesCollecting Data

Goodness of the Fit

Validation


Notes on using data itself: Trace-driven simulation

Example: Simulator needs arrival of i-th customer: pick i-th arrival from data

Limitations and Challenges Can never go outside your observed data. No tail and nothing in the gaps. Difficult to reflect dependencies in the inputs. Need to change the data when the input process changes. May not have enough data for long or many runs. Difficult to configure, e.g., customers arrive twice as fast ... Huge amount of data requires huge amount of space

On the positive side measurement data can naturally incorporate all kinds of qualitative

and quantitative constraints and necessary details for a realistic run allows for a direct comparison of real system with simulated system

and validation

16

Fitting Probability Distributions

Precondition: I.I.D assumption for sample data used in fitting I.I.D assumption for RVs in real system must be validated Corresponding graphical techniques/statistical tests ... later!

Focus: univariate distributions (i.e. just one RV) Most probability distributions were invented to represent a particular physical situation. If we know the physical basis for a distribution, then we can match it to the situation we have to model. Examples: Binomial Poisson and Exponential Normal and Lognormal Beta, Pert, and Triangular Uniform (See Law, Chapter 6 (2007) for a detailed list)

17

G/G/m/n FCFS Example refined (from Law, Example 6.1)

Does the selection of the distribution really matter? Arrivals: exponential, rate λ = 1, m=1, n=∞ Service times: given 200 samples, distribution unknown Exercise different distributions with parameter being fitted to match data Make 100 independent simulation runs using each of the 5 distributions; continue each of the 500 runs to collect 100 delays; observe impact of selected distribution:

18

Distribution Delay in queue Number in queue Prop. delays ≥20

Exponential 6.71 6.78 0.064

Gamma 4.54 4.60 0.019

Weibull (best) 4.36 4.41 0.013

Lognormal 7.19 7.30 0.078

Normal 6.04 6.13 0.045

Some Distributions

Exponential Gamma Weibull Lognormal Normal

19

Parameterization of distributions

Parameters of 3 basic types Location specifies an x-axis location point of a distribution’s range of values usually the midpoint (e.g. mean for normal distribution) or lower end

point for the distribution’s range sometimes called shift parameter since changing its value shifts the

distribution to the left or right, e.g., for Y = X + γ

Scale determines the scale (unit) of measurement of the values in the

range of the distribution (e.g. std deviation σ for normal distribution) changing its value compresses/expands distribution but does not

alter its basic form, e.g., for Y = β X

Shape determines basic form/shape of a distribution changing its values alters a distribution’s properties, e.g. skewness

more fundamentally than a change in location or scale20

Physical basis for binomial distribution

Binomial Models the number of successes in n independent Bernoulli trials, with probability

p of success in each trial

Example: The number of defective components found in a lot of n components with

probability p of picking a defective component.

21

0

0.45

-1.0000 0 1.0000 2.0000 3.0000 4.0000 5.0000 6.0000

Binomial(5, 0.2)

0

0.45

-1.0000 0 1.0000 2.0000 3.0000 4.0000 5.0000 6.0000

Binomial(5, 0.8)

E[X]=np Var=np(1-p)


Physical basis for Poisson distribution

Poisson: Models the number of independent events that occur in a fixed

amount of time.

Example: Number of customers arriving at a store during 1 hr.

22

0

0.4000

-0.5000 0.1250 0.7500 1.3750 2.0000 2.6250 3.2500 3.8750 4.5000

Poisson(1)

0

0.2000

-2.0000 0 2.0000 4.0000 6.0000 8.0000 10.0000 12.0000

Poisson(5)

E[X]=λ Var=λ


Physical basis for exponential distribution

Exponential Models the time between independent events, or a process time

which is memoryless.

Example: The time to failure for a system that has constant failure rate over

time. Note: If the time between events is exponential, then the number of events is Poisson.

23

0

0.2400

0.4800

0.7200

0.9600

1.2000

-3.0000 -2.4500 -1.9000 -1.3500 -0.8000 -0.2500 0.3000 0.8500 1.4000 1.9500 2.5000

Expon(1) Shift=-2.5

0

0.3500

-4.0000 -2.0000 0 2.0000 4.0000 6.0000 8.0000 10.0000 12.0000

Expon(3) Shift=-2.5

E[X]=λ Var=λ2


Physical basis for normal distribution

Normal distribution Models quantities that are the sum of a large number of other

quantities.

Example: Time to assemble a product.

Student t distribution Very similar to normal, but with heavier tails.

24

Normal(0, 1) vs Student(6)X <= -1.645

5.0%X <= 1.645

95.0%

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

-4 -3 -2 -1 0 1 2 3 4

@RISK Student VersionFor Academic Use Only










Normal: E[X]=µ Var=σ2from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

Physical basis for lognormal distribution

Lognormal: Models the distribution of a process that can be thought of as the

product of a number of component processes.

Example: The rate of return on an investment, when interest is compounded,

is the product of the returns for a number of periods. Time to perform some task Quantities that are the product of a large number of others (by

virtue of central limit theorem)

25

0

0.4000

-4.0000 -2.0000 0 2.0000 4.0000 6.0000 8.0000

Lognorm(2.5, 2) Shift=-2.5

0

0.7000

-5.0000 0 5.0000 10.0000 15.0000 20.0000

Lognorm(2.5, 5) Shift=-2.5


Physical basis for beta distribution

Beta An extremely flexible distribution used to model bounded (fixed

upper and lower limits) random variables in the absence of data. Used as a rough model in the absence of data Distribution of a random proportion such as the proportion of

defective items in a shipment Time to complete a task, e.g. in a PERT network

Example: Proportion of defective items in a shipment.

26

0

0.5000

1.0000

1.5000

2.0000

2.5000

3.0000

-0.2000 0.0800 0.3600 0.6400 0.9200 1.2000

Beta(1.5, 5)

0

0.5000

1.0000

1.5000

2.0000

2.5000

3.0000

-0.2000 0.0800 0.3600 0.6400 0.9200 1.2000

Beta(5, 1.5)


Physical basis for Pert (Beta) distribution

Pert, (Beta in disguise) Used to model the activity times in project management problems

and defined by three point estimates: min, mode, max

Example: Time to complete a task in a PERT network.

27

0

0.3000

4.0000 6.0000 8.0000 10.0000 12.0000 14.0000 16.0000

Pert(5, 6, 15)

0

0.2500

4.0000 6.0000 8.0000 10.0000 12.0000 14.0000 16.0000

Pert(5, 13, 15)

PERT is a method to analyze the involved tasks in completing a given project, especially the time needed to complete each task, and identifying the minimum time needed to complete the total project.

Physical basis for triangular distribution

Triangular: Models a process when only the minimum, most likely and maximum

values of the distribution are known.

Example: The minimum, most likely and maximum inflation rate we will have

this year.

28

0

0.2500

4.0000 6.0000 8.0000 10.0000 12.0000 14.0000 16.0000

Triang(5, 6, 15)

0

0.2500

4.0000 6.0000 8.0000 10.0000 12.0000 14.0000 16.0000

Triang(5, 13, 15)


Physical basis for uniform distribution

Discrete Uniform Models complete uncertainty, since all outcomes are equally likely.

Example: A first model for a quantity that is varying among the integers 1

through 4, but about which little else is known.

29

0

0.3

0.5000 1.0000 1.5000 2.0000 2.5000 3.0000 3.5000 4.0000 4.5000

DUniform({x})


Distributions

Many theoretical distributions with nice properties experience with scenarios when to apply those well-studied properties, parameters, characteristics compact representation of data software support for sampling in simulation runs software support to perform parameter fitting easy to vary by modification of parameters some allow for closed-form analytical formulas for system analysis

(queueing networks) may allow for numbers beyond reasonable limits, e.g. negative

values, very high values such that truncation may be necessary less sensitive to data irregularities than an empirical distribution

For distributions and their relationships see also: Wheyming Song and Yi-Chun Chen, Simulation Input Models: Relationships Among Eighty Univariate Distributions Displayed in a Matrix

Format, Proceedings Winter Simulation Conference 2010.

Larry Leemis:Univariate Distribution Relationships www.math.wm.edu/~leemis/chart/UDR/UDR.html

30

Overview of fitting with data

Select one or more candidate distributions based on physical characteristics of the process and graphical examination of the data.

Fit the distribution to the data determine values for its unknown parameters.

Check the fit to the data via statistical tests and via graphical analysis.

If the distribution does not fit, select another candidate and repeat the process, or use an empirical distribution.


CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v4.pdf · Concept of input...

Documents

Transcript of CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v4.pdf · Concept of input...