CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v4.pdf · Concept of input...
Transcript of CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v4.pdf · Concept of input...
1
CS626 Data Analysis and Simulation
Today:Stochastic Input Modeling
based on WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permissionReference: Law/Kelton, Simulation Modeling and Analysis, Ch 6.
Instructor: Peter Kemper R 104A, phone 221-3462, email:[email protected] hours: Monday, Wednesday 2-4 pm
2
Big Picture: Model-based Analysis of Systems
portion/facetreal world
formal / computer aidedanalysis
solution, rewards,qualitative and
quantitative properties
probability model,stochastic process
transformation presentation
transfer
decisiondescription
perception
solution to real world problemreal world problem
formal model
What is input modeling?
Input modeling Deriving a representation of the uncertainty or randomness in a
stochastic simulation. Common representations
Measurement data Distributions derived from measurement data <-- focus of “Input modeling”
usually requires that samples are i.i.d and corresponding random variables in the simulation model are i.i.d
i.i.d. = independent and identically distributed theoretical distributions empirical distribution
Time-dependent stochastic process Other stochastic processes
Examples include time to failure for a machining process; demand per unit time for inventory of a product; number of defective items in a shipment of goods; times between arrivals of calls to a call center. 3
Why are input models stochastic?
We just cannot assume randomness away. Example (Nelson and Biller 2003): Suppose you are a supplier of a component that you know has a
mean time to failure of 2 years. A client is willing to pay $1000 for your component, but wants you to pay a penalty of $5000 if failure occurs in less than one year. Should you take this contract? No uncertainty:
You will pocket $1000 for each component you sell. Uncertainty:
If you know that the distribution of time to failure is well modeled as being exponentially distributed (an input model) with mean 2 years, then F(1)=0.39 and you can expect to lose $950 on each component you sell.
If you know that the distribution of time to failure is well modeled as being uniformly distributed (an input model) between 0 and 4 years (so that mean lifetime is 2 years), then F(1)=0.25 the expected loss on each component is $250.
4from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
Learning objectives
Concept of input modeling and its fit in simulation model development. Input modeling with data: Physical basis for distributions. Fitting and checking.
Input modeling without data: Sources of information. Incorporating expert opinion.
5from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
What is input modeling?
Input modeling:
Deriving a representation of the uncertainty or randomness in a stochastic simulation.
Randomness? A way to describe the behavior of a subsystem that- (lack of knowledge): we can not describe as a deterministic system- (lack of interest, abstraction from details): we do not want to describe as a deterministic system
6
What is input modeling?
Example model: G/G/n/m FCFS queue Customers (Tasks) arrive according to some general distribution G Customers are served for a time according to some distribution G n servers are available to serve customers in parallel Customers are scheduled following first-come-first-serve (FCFS) m is the capacity of the queue, (customers hitting a full system are
turned away) Design question: What values of n and m are necessary
to limit the waiting time for 90% of all customers to 10 min and to limit the fraction of customers that get turned away to 5% in the long run
What pieces of information does the input modeling contribute to this simulation study?
7Photo: Stuart Richards (Left-hand), Flickr, Creative Commons
Cookbook recipe for conducting a simulation study
8
Cookbook recipe for simulation
Model Building Design and coding of the simulation
program
Experimental Design
Verification
Simulation runs
Recommendation for decisions and
implementation of the model
Statement of the decision problem
and objectives
Input ModelingDevelopment
System Analysis Data Collection
Rough-cut Model
Development
Static Models
Dynamic Models
Static (Spreadsheet)
Simulation
DynamicSystem
Simulation
Removal of initial-condition bias
Determination of the replication number
for error control
Output AnalysisValidation
Statistical analysis of results and system design comparison
Final documentation
Comparison via Simulation
Simulation Optimization
from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
Simulation model development
9
Simulation model development
Real-WorldProcess or Phenomenon
Simulation Model
Simulation Program
Random Input Model
Random Variate Generator
SimulationProgramming
SimulationModeling
SimulationInput Modeling
Random VariateProgramming
from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
G/G/n/m FCFS queueing model revisited
Conceptual model Customers (Tasks) arrive according to some general distribution G Customers are served for a time according to distribution G n servers are available to serve customers in parallel Customers are scheduled following first-come-first-serve (FCFS) m is the capacity of the queue, (customers hitting a full system are turned away)
Design question: What values of n and m are necessary to limit the waiting time for 90% of all customers to 10 min and to limit the fraction of customers that get turned away to 5% in the long run
Input model Measurement data for task arrivals and service times for a certain time Option 1: Trace-driven simulation
use measurement data to feed a simulation run Option 2: Simulation draws from a probability distribution
needs selection/configuration of a distribution (distribution fitting) alternative: empirical distribution
Option 3: Simulation executes stochastic process (later) 10
Input model development
There is no “true” model for any stochastic input. The best that we can hope is to obtain an approximation that yields useful results.
A key distinction in input modeling problems is the presence or absence of data: When we have data, then we fit a model to the data.
Software support: Special purpose software, e.g., ExpertFit by A. Law Simulation environments include this, e.g., Arena by Rockwell Automation Statistics packages provide key functionality, e.g., R (www.r-project.org)
When no data are available, then we have to creatively use what we can get to construct an input model.
11
Modelling
Essentially, all models are wrong, but some are useful.
Box, George E. P.; Norman R. Draper.Empirical Model-Building and Response Surfaces.Wiley 1987.
. . . unfortunately some models are more wrong than others.
2 / 58
Collecting data
Generally hard, expensive, frustrating, boring: System might not exist. Data available on the wrong things – might have to change model
according to what is available. Incomplete, dirty data. Too much data (!)
Sensitivity of outputs to uncertainty in inputs. Match model detail to quality of data. Cost – should be budgeted in project. Capture variability in data – model validity.
12from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
Example: Traffic measured at a node in a network
Plot shows sequence of time stamps for a series of requests (arrival stream).
Observations: concatenation of several
measurements with a restart close to 0.0 or unreasonable wide gaps to higher values of time stamps
Need thresholds to automate subsequence detection x = 20s for drop of time,
y = 1000s for increase
13Note: Check consistency ahead of any numerical analysis!
Example: Traffic measured at a node in a network Plot shows sequence of time differences for first 20k of events. Observations: closer look reveals that
subsequence are not necessarily accurately ordered
Options?1.remove out-of-order
entries2.consider ordered
subsequences3.sort subsequence
14
Note: Check consistency ahead of any numerical analysis!
Input model development
15
Input model development
Real-World Process
Fitting Probability
Distributions
Using Data Itself
Expert Opinion
Input Model(Fit)
ApproachesCollecting Data
Goodness of the Fit
Validation
from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
Notes on using data itself: Trace-driven simulation
Example: Simulator needs arrival of i-th customer: pick i-th arrival from data
Limitations and Challenges Can never go outside your observed data. No tail and nothing in the gaps. Difficult to reflect dependencies in the inputs. Need to change the data when the input process changes. May not have enough data for long or many runs. Difficult to configure, e.g., customers arrive twice as fast ... Huge amount of data requires huge amount of space
On the positive side measurement data can naturally incorporate all kinds of qualitative
and quantitative constraints and necessary details for a realistic run allows for a direct comparison of real system with simulated system
and validation
16
Fitting Probability Distributions
Precondition: I.I.D assumption for sample data used in fitting I.I.D assumption for RVs in real system must be validated Corresponding graphical techniques/statistical tests ... later!
Focus: univariate distributions (i.e. just one RV) Most probability distributions were invented to represent a particular physical situation. If we know the physical basis for a distribution, then we can match it to the situation we have to model. Examples: Binomial Poisson and Exponential Normal and Lognormal Beta, Pert, and Triangular Uniform (See Law, Chapter 6 (2007) for a detailed list)
17
G/G/m/n FCFS Example refined (from Law, Example 6.1)
Does the selection of the distribution really matter? Arrivals: exponential, rate λ = 1, m=1, n=∞ Service times: given 200 samples, distribution unknown Exercise different distributions with parameter being fitted to match data Make 100 independent simulation runs using each of the 5 distributions; continue each of the 500 runs to collect 100 delays; observe impact of selected distribution:
18
Distribution Delay in queue Number in queue Prop. delays ≥20
Exponential 6.71 6.78 0.064
Gamma 4.54 4.60 0.019
Weibull (best) 4.36 4.41 0.013
Lognormal 7.19 7.30 0.078
Normal 6.04 6.13 0.045
Some Distributions
Exponential Gamma Weibull Lognormal Normal
19
Parameterization of distributions
Parameters of 3 basic types Location specifies an x-axis location point of a distribution’s range of values usually the midpoint (e.g. mean for normal distribution) or lower end
point for the distribution’s range sometimes called shift parameter since changing its value shifts the
distribution to the left or right, e.g., for Y = X + γ
Scale determines the scale (unit) of measurement of the values in the
range of the distribution (e.g. std deviation σ for normal distribution) changing its value compresses/expands distribution but does not
alter its basic form, e.g., for Y = β X
Shape determines basic form/shape of a distribution changing its values alters a distribution’s properties, e.g. skewness
more fundamentally than a change in location or scale20
Physical basis for binomial distribution
Binomial Models the number of successes in n independent Bernoulli trials, with probability
p of success in each trial
Example: The number of defective components found in a lot of n components with
probability p of picking a defective component.
21
0
0.45
-1.0000 0 1.0000 2.0000 3.0000 4.0000 5.0000 6.0000
Binomial(5, 0.2)
0
0.45
-1.0000 0 1.0000 2.0000 3.0000 4.0000 5.0000 6.0000
Binomial(5, 0.8)
E[X]=np Var=np(1-p)
from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
Physical basis for Poisson distribution
Poisson: Models the number of independent events that occur in a fixed
amount of time.
Example: Number of customers arriving at a store during 1 hr.
22
0
0.4000
-0.5000 0.1250 0.7500 1.3750 2.0000 2.6250 3.2500 3.8750 4.5000
Poisson(1)
0
0.2000
-2.0000 0 2.0000 4.0000 6.0000 8.0000 10.0000 12.0000
Poisson(5)
E[X]=λ Var=λ
from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
Physical basis for exponential distribution
Exponential Models the time between independent events, or a process time
which is memoryless.
Example: The time to failure for a system that has constant failure rate over
time. Note: If the time between events is exponential, then the number of events is Poisson.
23
0
0.2400
0.4800
0.7200
0.9600
1.2000
-3.0000 -2.4500 -1.9000 -1.3500 -0.8000 -0.2500 0.3000 0.8500 1.4000 1.9500 2.5000
Expon(1) Shift=-2.5
0
0.3500
-4.0000 -2.0000 0 2.0000 4.0000 6.0000 8.0000 10.0000 12.0000
Expon(3) Shift=-2.5
E[X]=λ Var=λ2
from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
Physical basis for normal distribution
Normal distribution Models quantities that are the sum of a large number of other
quantities.
Example: Time to assemble a product.
Student t distribution Very similar to normal, but with heavier tails.
24
Normal(0, 1) vs Student(6)X <= -1.645
5.0%X <= 1.645
95.0%
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
-4 -3 -2 -1 0 1 2 3 4
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
@RISK Student VersionFor Academic Use Only
Normal: E[X]=µ Var=σ2from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
Physical basis for lognormal distribution
Lognormal: Models the distribution of a process that can be thought of as the
product of a number of component processes.
Example: The rate of return on an investment, when interest is compounded,
is the product of the returns for a number of periods. Time to perform some task Quantities that are the product of a large number of others (by
virtue of central limit theorem)
25
0
0.4000
-4.0000 -2.0000 0 2.0000 4.0000 6.0000 8.0000
Lognorm(2.5, 2) Shift=-2.5
0
0.7000
-5.0000 0 5.0000 10.0000 15.0000 20.0000
Lognorm(2.5, 5) Shift=-2.5
from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
Physical basis for beta distribution
Beta An extremely flexible distribution used to model bounded (fixed
upper and lower limits) random variables in the absence of data. Used as a rough model in the absence of data Distribution of a random proportion such as the proportion of
defective items in a shipment Time to complete a task, e.g. in a PERT network
Example: Proportion of defective items in a shipment.
26
0
0.5000
1.0000
1.5000
2.0000
2.5000
3.0000
-0.2000 0.0800 0.3600 0.6400 0.9200 1.2000
Beta(1.5, 5)
0
0.5000
1.0000
1.5000
2.0000
2.5000
3.0000
-0.2000 0.0800 0.3600 0.6400 0.9200 1.2000
Beta(5, 1.5)
from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
Physical basis for Pert (Beta) distribution
Pert, (Beta in disguise) Used to model the activity times in project management problems
and defined by three point estimates: min, mode, max
Example: Time to complete a task in a PERT network.
27
0
0.3000
4.0000 6.0000 8.0000 10.0000 12.0000 14.0000 16.0000
Pert(5, 6, 15)
0
0.2500
4.0000 6.0000 8.0000 10.0000 12.0000 14.0000 16.0000
Pert(5, 13, 15)
PERT is a method to analyze the involved tasks in completing a given project, especially the time needed to complete each task, and identifying the minimum time needed to complete the total project.
Physical basis for triangular distribution
Triangular: Models a process when only the minimum, most likely and maximum
values of the distribution are known.
Example: The minimum, most likely and maximum inflation rate we will have
this year.
28
0
0.2500
4.0000 6.0000 8.0000 10.0000 12.0000 14.0000 16.0000
Triang(5, 6, 15)
0
0.2500
4.0000 6.0000 8.0000 10.0000 12.0000 14.0000 16.0000
Triang(5, 13, 15)
from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
Physical basis for uniform distribution
Discrete Uniform Models complete uncertainty, since all outcomes are equally likely.
Example: A first model for a quantity that is varying among the integers 1
through 4, but about which little else is known.
29
0
0.3
0.5000 1.0000 1.5000 2.0000 2.5000 3.0000 3.5000 4.0000 4.5000
DUniform({x})
from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission
Distributions
Many theoretical distributions with nice properties experience with scenarios when to apply those well-studied properties, parameters, characteristics compact representation of data software support for sampling in simulation runs software support to perform parameter fitting easy to vary by modification of parameters some allow for closed-form analytical formulas for system analysis
(queueing networks) may allow for numbers beyond reasonable limits, e.g. negative
values, very high values such that truncation may be necessary less sensitive to data irregularities than an empirical distribution
For distributions and their relationships see also: Wheyming Song and Yi-Chun Chen, Simulation Input Models: Relationships Among Eighty Univariate Distributions Displayed in a Matrix
Format, Proceedings Winter Simulation Conference 2010.
Larry Leemis:Univariate Distribution Relationships www.math.wm.edu/~leemis/chart/UDR/UDR.html
30
Overview of fitting with data
Select one or more candidate distributions based on physical characteristics of the process and graphical examination of the data.
Fit the distribution to the data determine values for its unknown parameters.
Check the fit to the data via statistical tests and via graphical analysis.
If the distribution does not fit, select another candidate and repeat the process, or use an empirical distribution.
31from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission