Statistical Computing Session 4: Random Simulationavdvaart/ASC/StatComp_Session_4.pdf · The...
Transcript of Statistical Computing Session 4: Random Simulationavdvaart/ASC/StatComp_Session_4.pdf · The...
Statistical ComputingSession 4: Random Simulation
Paul Eilers & Dimitris RizopoulosDepartment of Biostatistics, Erasmus University Medical Center
Masters Track Statistical Sciences, 2010
The plan for today
• Short introduction: who am I?
• Random simulation, what is it good for?
• The R system of random number generation
• Tools for presentation and analysis of simulation output
• Simulation techniques
• Getting insight in classical procedures by simulation
Masters Track Statistical Sciences 2009 – Statistical Computing 1
Why simulation?
• Statistics is about random numbers
• We assume a model for generating the data
• The model contains parameters
• We estimate them from the data, using recipes
• Examples: regression, ANOVA, time series analysis, ...
• Crucial question: how good is a recipe?
• Classical (frequentist) statistics: confidence intervals
• Bayesian: combine prior and data to get posterior distribution
• Testing (mostly for zero values) is a special case
Masters Track Statistical Sciences 2009 – Statistical Computing 2
How to evaluate performance by analysis
• In some cases exact results can be obtained
• Especially for normally distributed data
• Examples: χ2, t and F distributions (ANOVA)
• You will encounter them in other courses here
• In the old days we used tables
• Nowadays they are built in as functions in statistical software
• It is always wise to use analytical results
Masters Track Statistical Sciences 2009 – Statistical Computing 3
How to evaluate performance by simulation
• Simulate data according to the model
• Apply the recipe
• Repeat this many times
• Study the distributions of parameter estimates
• Confidence intervals follow from them
• Probabilities of outcomes of tests can be determined
• Disadvantage: generalization can be hard
• What if I have more/less observations, larger variances, ...?
• You might have to simulate again for every case
Masters Track Statistical Sciences 2009 – Statistical Computing 4
The R system for generating random numbers
• For a given distribution you have four choices
• I use the normal distribution as an example
• Give me 100 drawings from standard normal: rnorm(100)
• The density over a vector x: dnorm(x)
• The cumulative distribution over a vector x: pnorm(x)
• The quantiles over a vector of probabilities p: qnorm(p)
• Let’s take a better look
Masters Track Statistical Sciences 2009 – Statistical Computing 5
Example: normal density
x = seq(-4, 4 by = 0.1);
y = dnorm(x);
plot(x, y, type = ’l’, col = ’blue’, lwd = 2)
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
y
Masters Track Statistical Sciences 2009 – Statistical Computing 6
Example: cumulative normal distribution
x = seq(-4, 4 by = 0.1);
y = pnorm(x);
plot(x, y, type = ’l’, col = ’blue’, lwd = 2)
−4 −2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
x
y
Masters Track Statistical Sciences 2009 – Statistical Computing 7
Example: normal quantiles
p = seq(0.001, 0.999, by =0.001)
y = qnorm(p)
plot(p, y, type = ’l’, lwd = 2, col = ’blue’)
0.0 0.2 0.4 0.6 0.8 1.0
−3
−2
−1
01
23
p
y
Masters Track Statistical Sciences 2009 – Statistical Computing 8
How to look at random numbers
• The result of rnorm() is a series of numbers
• There is little use in plotting them individually
• Instead we use the histogram
• We divide x-axis in a number of bins
• In each bin we count the number of observations
• We plot it as a bar chart
Masters Track Statistical Sciences 2009 – Statistical Computing 9
Example of a histogram
y = rnorm(200)
hist(y)
Histogram of y
y
Fre
quen
cy
−3 −2 −1 0 1 2 3
010
2030
40
Masters Track Statistical Sciences 2009 – Statistical Computing 10
Playing with the histogram
y = rnorm(200)
b = seq(-4, 4, by = 0.2) # More narrow bins
hist(y, breaks = b, col = ’gray’, border = ’white’)
Histogram of y
y
Fre
quen
cy
−4 −2 0 2 4
05
1015
Masters Track Statistical Sciences 2009 – Statistical Computing 11
Adding the theoretical normal density
y = rnorm(200); b = seq(-4, 4, by = 0.2)
hist(y, breaks = b, col = ’gray’, border = ’white’, freq = F)
lines(b, dnorm(b), col = ’blue’, lwd = 2)
Histogram of y
y
Den
sity
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Masters Track Statistical Sciences 2009 – Statistical Computing 12
A different result every time?
• Look at the histograms carefully
• They seem to be based on different numbers
• That’s indeed the case
• Every call to rnorm() gives a different result
• You might not like that (and you should not)
• It makes it hard to compare results
• Set the random “seed” to a starting value
• Like set.seed(123) or set.seed(911)
• Any integer number you like (see R help for details)
Masters Track Statistical Sciences 2009 – Statistical Computing 13
The empirical cumulative distribution
n = 20; y = rnorm(n)
p = ((1:n) - 0.5) / n # Empirical probabilities
plot(sort(y), p, type = ’s’, col = ’red’)
−2 −1 0 1 2
0.0
0.2
0.4
0.6
0.8
1.0
sort(y)
p
Masters Track Statistical Sciences 2009 – Statistical Computing 14
Adding the theoretical cumulative distribution
n = 40; y = rnorm(n); p = ((1:n) - 0.5) / n
plot(sort(y), p, type = ’s’, col = ’red’)
x = seq(-3, 3, by = 0.1); lines(x, pnorm(x), col = ’blue’, lwd = 2)
−2 −1 0 1
0.0
0.2
0.4
0.6
0.8
1.0
sort(y)
p
Masters Track Statistical Sciences 2009 – Statistical Computing 15
Randomness at work
• The theoretical distribution fits not impressively
• This is generally the case for small samples
• Run these programs many times
• (Without set.seed(), of course)
• You will see randomness at work
• You wil appreciate variation in small samples better
Masters Track Statistical Sciences 2009 – Statistical Computing 16
The Q-Q plot
• We have our sorted observations
• And the empirical cumulative probabilities
• For the latter we can compute normal quantiles (with qnorm())
• Plot one against the other
• We’ll program it ourselves
• That has some educative value, I hope
• The function qqnorm() is easier to use
• Plot line, based on mean and variance, with qqplot()
• Of course, this only compares to the normal distribution
Masters Track Statistical Sciences 2009 – Statistical Computing 17
Q-Q plot from scratch
n = 40; y = rnorm(n);
p = ((1:n) - 0.5) / n
plot(qnorm(p), sort(y), col = ’red’)
−2 −1 0 1 2
−2
−1
01
qnorm(p)
sort
(y)
Masters Track Statistical Sciences 2009 – Statistical Computing 18
Q-Q plot the easy way
n = 40; y = rnorm(n);
qqnorm(y)
qqline(y, col = ’blue’, lwd = 2)
−2 −1 0 1 2
−2
−1
01
2
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
Masters Track Statistical Sciences 2009 – Statistical Computing 19
An improvement on the histogram
• The histogram is a great idea
• But it is sensitive to the width of the bins
• And to their positions (midpoints)
• An improvement is the kernel density estimator
• It is explained in Chapter 10 of the book
• We will discuss that later
• And develop a further improvement
Masters Track Statistical Sciences 2009 – Statistical Computing 20
Comparing histogram and kernels
y = rnorm(200); b = seq(-4, 4, by = 0.2)
plot(density(y), col = ’red’, lwd = 2)
lines(b, dnorm(b), col = ’blue’, lwd = 2)
Histogram of y
y
Den
sity
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
0.5
0.6
−2 0 2 4
0.0
0.1
0.2
0.3
0.4
density.default(x = y)
N = 200 Bandwidth = 0.278
Den
sity
Masters Track Statistical Sciences 2009 – Statistical Computing 21
.
Part II
Masters Track Statistical Sciences 2009 – Statistical Computing 22
How to generate random numbers
• The uniform distribution is always the starting point
• Sum 6 (or more) “uniforms” to approximate normal distribution
• This is of limited usefulness
• Apply a function: -log(runif(100)) gives exponential distribution
• Inverse transform: compute quantiles of desired distribution (if possible)
• Acceptance-rejection sampling
• Combinations of distributions
• Mixtures
Masters Track Statistical Sciences 2009 – Statistical Computing 23
Sums of uniform random variables
• The uniform distribution runs from 0 to 1
• Mean 0.5, variance 1/12
• Sum of n (independent) “uniforms” has mean n/2
• The variance is n/12
• Subtract n/2
• Divide by√n/12
• To get standard normal distribution
• This is an approximation, but a good one
Masters Track Statistical Sciences 2009 – Statistical Computing 24
Illustration of sums
Y = matrix(runif(10000 * 10), 10000, 10); y = apply(Y, 1, sum)
plot(density(y), col = ’red’, lwd = 2)
x = seq(0, 10, by = 0.1); lines(x, dnorm(x, 10 * 0.5, sqrt(10 / 12)), col = ’blue’, lwd = 2)
2 3 4 5 6 7 8
0.0
0.1
0.2
0.3
0.4
density.default(x = y)
N = 10000 Bandwidth = 0.1302
Den
sity
Masters Track Statistical Sciences 2009 – Statistical Computing 25
Simple transformations
• Apply a function to uniformly distributed random numbers
• Example: y = -log(runif(n)) gives exponential distribution
• Sums of these leads to the gamma distribution
• Many of such “tricks” exist
• See examples in the book
Masters Track Statistical Sciences 2009 – Statistical Computing 26
Inverse transform or quantiles
• Remember the cumulative distribution: p = F (x)
• Where F (x) =∫ x
−∞ f (t)dt
• And f (x) is the probability density
• We have that 0 < p < 1
• Turn things around: x = F−1(p)
• Generate uniformly distributed p
• Compute x = F−1(p)
• Then x will have density f
• Of course, you need a procedure to invert F
• Example: qnorm(runif(n)) is equivalent to rnorm(n)
Masters Track Statistical Sciences 2009 – Statistical Computing 27
Illustration of the quantile approach
−4 −2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
x
pnor
m(x
)
x
p
x’
p’
Masters Track Statistical Sciences 2009 – Statistical Computing 28
A special case: the exponential distribution
• Density: fx) = e−x (positive x)
• Cumulative distribution: F (x) = 1− e−x
• You can check this by integrating it yourself
• From p = 1− e−x follows x = − log(1− p)
• But if p from uniform distribution, so is 1− p
• Hence x = − log(p) has exponential distribution
Masters Track Statistical Sciences 2009 – Statistical Computing 29
Inverting the cumulative distribution
• You need a formula for F−1(p)
• In many cases an exact formula is not available
• But over the years very good approximations were found
• Try ?qnorm to see pointers to literature
• Luckily, R has many built-in procedures
Masters Track Statistical Sciences 2009 – Statistical Computing 30
Acceptance-rejection sampling
• Suppose you have a formula for f (x), the density
• But none for F (x), the cumulative distribution
• Acceptance sampling is a solution
• Select a domain in which x will fall (almost surely)
• Let x1 (x2) be left (right) boundary
• Let u = maxf (x) (over this domain)
• Generate y uniformly distributed between x1 and x2
• Generate z uniformly distributed betweeen 0 and u
• If z ≤ f (y), accept y
• If not, repeat
Masters Track Statistical Sciences 2009 – Statistical Computing 31
Illustration of acceptance-rejection sampling
xlo = -3
xhi = 3
x = seq(xlo, xhi, by = 0.1)
f = dnorm(x)
plot(x, f, lwd =2, type = ’l’, col = ’blue’)
u = max(f)
n = 1000
a = rep(0, n)
y = runif(n) * (xhi - xlo) + xlo
z = runif(n) * u
a = 1 * z < dnorm(y)
g = c(’-’, ’+’)
points(y, z, pch = g[a + 1])
Masters Track Statistical Sciences 2009 – Statistical Computing 32
The result of acceptance sampling
−3 −2 −1 0 1 2 3
0.0
0.1
0.2
0.3
0.4
x
f
−
−
+
+
−
+
+−
−
−
−
+
+
+
+++
−
+
−
+
−−
+
+
−
−
−
+
−
++
−
−
−
−
+
+
+
++
−
+
+
−
+
+
−
−
−
+ +
−
+
+
−
−
−
−−
−−
+
−
−
−
+
+
−
−
+
+ +
+
−
+
+
+
+
−
+
+
−
−
−
−−
−
−+
−
−
+−
−
+
+
−
−
+
−
−
−
−
−
−
+
+
−
+
+
−
−
−
−
−
+
+
+
−
+
+
−
+
−−
+
−
+
++
−
−−
−
+
−
+
−
−
+
+
+
−
−
−+
−
+
−
−
+ −
+
+
−
−
−
−
−
−
−
−
+
−
−
−
−
++
+
−
+
+
−−
−+
+
+
+
−
+
−
−
+
−
+
+
−
+
−−
−
+
− −
+
+
+
+
−−
−
−
+
+
+
−
−
−
+
−
+
−
−
+
−
−
−
++
−
+ −
+
+
−
−
−
−
+
−−
+
+
+
+
−
+
+
+
+
−
+
−
−
−
−
++
++
−
+
−
+
−
−
−
+
+
+
+−
−
+
−
+
−
−−
−−
−
+
−
+
+
−
−
+
+
+
−
+
−
−
−−
−
−
++
−
−
+
+
−
−+
+
+
+
−
−−
−−
−
−
−
+
+
+
−
−
+
−
+
−
− +−
++
−
−
−
−
+
−
−
−
−−
−
+
−
−
+
+
+
++
+
−
+
−
−
−
−
+
+
−
+
−
−
−
−
−
−
+
+
−
+
+
+
−
−
−
+
−
+
−
− +
−
+
−
+−
+−−
−−
−
−
−
+
−
+
−
+
+−
−
−
−
−
−
−
+
−
−
−
− −
−+
−
−
+
−
−
−
−
−
+
−
−
−
−
−
−
−
+
−
+ +
−
−
+
−
−
−
−
−
−−
−
+
+
−
−
+
++
−
+
+
+
−
+
−
+
+
−
+
−
−
−−
+
−
+−− +
−
−
−−−
−
+
−
−
−
−
+
+
−
+
−+
−
−
+
−−
+
−
−−
+
−
−
−
−
−−
−
−
+
−−
+
−
−
−
−
+
+
−
−
++
+
−
−
−
−
+
+
−
+
++
+
−
−
−
−
+
−
+
+
+
−
+
+
−
+
−
−
+
−
+
+
−
−+
−
−
−
−
+
++
−
−
−
−
−
−
−
−
−
−
−
−
+
−
−
+
+
−− −
−
−
−
−
− −
−
−
+
+
+
−
+
−
−
−
−
−
−−−
−
−
−
−
+
−+
−
−
+
−−
+
+
+
−
−
+
−
++
+
+
+
+
−
+
−+
−
−
++
+
+
−
−
−
−
+
+
++
−
−
−
+
−
+
++
−
+−
−
+
−
−
+
−
+
−
−
−
+
−
−
−
+
+
−
−
−
−
−
−+
−
−
+
−
+
+
+
−
− −
−−
−
++ +
+
−
−
+
−
−
−
−
−
−
−
+−
−
−
−
−
−
+
−
−
−
−
+
− −
−
+
+
+
−+
+
−
−
− −−
−
−
+−
+
−
−
+
−
+
+++
−
+
−
+
−
+
+
+
−
+
−
−
−
+
−
+
+
−
+
+
+
+
+
+
−
−
−
+
+
+
−
+ +
+
−
−
−
−
−
+
−
−
+
−
−
+
−
−
−
−
+
+
+−
+
−
++
−
−
+
−
+
−
+
+
+
−
+
−
+
−
+
−
+
−
−
−+
−
+
+
+
−
−
+
+
+
−
−
+
−
+
+
+
−
−
+
−
+
−
−
+
+
−
+
−
+−
−−
−
−
+
−
+
−
−
+
+
+
−
−
−
+
+
−
+
−
−
+
−
+
−
−
−
−
−
+
++
−
−
−
−
−
−
+
−
+
+
−
−
+
−
−
−
−
+
−
−
+
+
+
−
−
−
−
−
+
−
−
+
−
−
+
−
− −
+
−
−
+
+
+
+
−
+
+
−
+
+
+
+ +
++
+
−
+
−
−
+
+
−+
−
+
−
+
+
−
−
−
−
+
−
−
−−
+
+−
+
−+
+
+
+
−
+
+
−
−+
+
−
+
+
+
−
+
−
+
+
−
+
−
+
−
−
−
−
+
Masters Track Statistical Sciences 2009 – Statistical Computing 33
Combinations of distributions
• Example: t distribution
• Let s be sum of n squared standard normal variates
• The s has χ2n distribution (n degrees of freedom)
• Let x be one standard normal variate
• Then x/√s has t distribution
• Ratio of “gammas” has beta distribution
• Ratio of “normals” has Cauchy distribution
• Ratio of “χ2 has F distribution
Masters Track Statistical Sciences 2009 – Statistical Computing 34
Mixtures
• Mixtures are a very flexible tool
• Basic idea: probability p1 to draw from density f1
• Probability p2 to draw from density f2
• And so on
• Typical application: outlier simulation
• Two normals: f1(x) = N(x;µ, σ), f2(x) = N(x;µ, 10 ∗ σ)• Probabilities: p1 = 0.98, p2 = 0.02
• Chance of 2% to get a large outlier
Masters Track Statistical Sciences 2009 – Statistical Computing 35
The use of random numbers
• You probably will not often aim for a specific distribution
• Instead you program a data generating procedure
• Mimicking a model you are working on
• You study the distribution of statistical summaries
• Educational example: ANOVA and the F distribution
• Simulate from the null hypothesis, 5 groups, 10 obs., in each
• Compare empirical (n = 5000) distributions (χ2 and F ) with theory
Masters Track Statistical Sciences 2009 – Statistical Computing 36
ANOVA simulation: examples of simulated data
1 2 3 4 5
−2
−1
01
2
1 2 3 4 5
−2
−1
01
2
1 2 3 4 5
−1
01
2
1 2 3 4 5
−3
−2
−1
01
2
Masters Track Statistical Sciences 2009 – Statistical Computing 37
ANOVA: simulated sums of squares and ratio
0 100 200 300 400 500
2040
6080
Index
ssw
Sum of squares within (SSW)
0 100 200 300 400 500
05
1015
20
Index
ssb
Sum of squares between (SSB)
0 100 200 300 400 500
02
46
8
Index
msb
/msw
Ratio MSB/MSW
Masters Track Statistical Sciences 2009 – Statistical Computing 38
ANOVA: histograms and expected values
ssw
Fre
quen
cy
0 20 40 60 80 100
050
150
Sum of squares within (SSW)
ssb
Fre
quen
cy
0 5 10 15 20 25
050
100
200
Sum of squares between (SSB)
msb/msw
Fre
quen
cy
0 2 4 6 8 10
010
020
030
0
F = MSB / MSW
Masters Track Statistical Sciences 2009 – Statistical Computing 39
ANOVA: histograms and theoretical densities
ssw
Fre
quen
cy
0 20 40 60 80 100
050
150
Sum of squares within (SSW)
ssb
Fre
quen
cy
0 5 10 15 20 25
050
100
200
Sum of squares between (SSB)
msb/msw
Fre
quen
cy
0 2 4 6 8 10
010
020
030
0
F = MSB / MSW
Masters Track Statistical Sciences 2009 – Statistical Computing 40
Reading hints
• Theoretical statistical background: Chapter 2
• Random number simulation: Section 3.1-3.5
• Kernel density estimation: Section 10.2
• Histogram and qqplot: R on-line documentation
Masters Track Statistical Sciences 2009 – Statistical Computing 41