Statistical Computing Session 4: Random Simulationavdvaart/ASC/StatComp_Session_4.pdf · The...

Statistical ComputingSession 4: Random Simulation

Paul Eilers & Dimitris RizopoulosDepartment of Biostatistics, Erasmus University Medical Center

[email protected]

Masters Track Statistical Sciences, 2010

The plan for today

• Short introduction: who am I?

• Random simulation, what is it good for?

• The R system of random number generation

• Tools for presentation and analysis of simulation output

• Simulation techniques

• Getting insight in classical procedures by simulation

Masters Track Statistical Sciences 2009 – Statistical Computing 1

Why simulation?

• Statistics is about random numbers

• We assume a model for generating the data

• The model contains parameters

• We estimate them from the data, using recipes

• Examples: regression, ANOVA, time series analysis, ...

• Crucial question: how good is a recipe?

• Classical (frequentist) statistics: confidence intervals

• Bayesian: combine prior and data to get posterior distribution

• Testing (mostly for zero values) is a special case


How to evaluate performance by analysis

• In some cases exact results can be obtained

• Especially for normally distributed data

• Examples: χ2, t and F distributions (ANOVA)

• You will encounter them in other courses here

• In the old days we used tables

• Nowadays they are built in as functions in statistical software

• It is always wise to use analytical results


How to evaluate performance by simulation

• Simulate data according to the model

• Apply the recipe

• Repeat this many times

• Study the distributions of parameter estimates

• Confidence intervals follow from them

• Probabilities of outcomes of tests can be determined

• Disadvantage: generalization can be hard

• What if I have more/less observations, larger variances, ...?

• You might have to simulate again for every case


The R system for generating random numbers

• For a given distribution you have four choices

• I use the normal distribution as an example

• Give me 100 drawings from standard normal: rnorm(100)

• The density over a vector x: dnorm(x)

• The cumulative distribution over a vector x: pnorm(x)

• The quantiles over a vector of probabilities p: qnorm(p)

• Let’s take a better look


Example: normal density

x = seq(-4, 4 by = 0.1);

y = dnorm(x);

plot(x, y, type = ’l’, col = ’blue’, lwd = 2)

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

y


Example: cumulative normal distribution

x = seq(-4, 4 by = 0.1);

y = pnorm(x);

plot(x, y, type = ’l’, col = ’blue’, lwd = 2)

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

x

y


Example: normal quantiles

p = seq(0.001, 0.999, by =0.001)

y = qnorm(p)

plot(p, y, type = ’l’, lwd = 2, col = ’blue’)

0.0 0.2 0.4 0.6 0.8 1.0

−3

−2

−1

01

23

p

y


How to look at random numbers

• The result of rnorm() is a series of numbers

• There is little use in plotting them individually

• Instead we use the histogram

• We divide x-axis in a number of bins

• In each bin we count the number of observations

• We plot it as a bar chart


Example of a histogram

y = rnorm(200)

hist(y)

Histogram of y

y

Fre

quen

cy

−3 −2 −1 0 1 2 3

010

2030

40


Playing with the histogram

y = rnorm(200)

b = seq(-4, 4, by = 0.2) # More narrow bins

hist(y, breaks = b, col = ’gray’, border = ’white’)

Histogram of y

y

Fre

quen

cy

−4 −2 0 2 4

05

1015


Adding the theoretical normal density

y = rnorm(200); b = seq(-4, 4, by = 0.2)

hist(y, breaks = b, col = ’gray’, border = ’white’, freq = F)

lines(b, dnorm(b), col = ’blue’, lwd = 2)

Histogram of y

y

Den

sity

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

0.6


A different result every time?

• Look at the histograms carefully

• They seem to be based on different numbers

• That’s indeed the case

• Every call to rnorm() gives a different result

• You might not like that (and you should not)

• It makes it hard to compare results

• Set the random “seed” to a starting value

• Like set.seed(123) or set.seed(911)

• Any integer number you like (see R help for details)


The empirical cumulative distribution

n = 20; y = rnorm(n)

p = ((1:n) - 0.5) / n # Empirical probabilities

plot(sort(y), p, type = ’s’, col = ’red’)

−2 −1 0 1 2

0.0

0.2

0.4

0.6

0.8

1.0

sort(y)

p


Adding the theoretical cumulative distribution

n = 40; y = rnorm(n); p = ((1:n) - 0.5) / n

plot(sort(y), p, type = ’s’, col = ’red’)

x = seq(-3, 3, by = 0.1); lines(x, pnorm(x), col = ’blue’, lwd = 2)

−2 −1 0 1

0.0

0.2

0.4

0.6

0.8

1.0

sort(y)

p


Randomness at work

• The theoretical distribution fits not impressively

• This is generally the case for small samples

• Run these programs many times

• (Without set.seed(), of course)

• You will see randomness at work

• You wil appreciate variation in small samples better


The Q-Q plot

• We have our sorted observations

• And the empirical cumulative probabilities

• For the latter we can compute normal quantiles (with qnorm())

• Plot one against the other

• We’ll program it ourselves

• That has some educative value, I hope

• The function qqnorm() is easier to use

• Plot line, based on mean and variance, with qqplot()

• Of course, this only compares to the normal distribution


Q-Q plot from scratch

n = 40; y = rnorm(n);

p = ((1:n) - 0.5) / n

plot(qnorm(p), sort(y), col = ’red’)

−2 −1 0 1 2

−2

−1

01

qnorm(p)

sort

(y)


Q-Q plot the easy way

n = 40; y = rnorm(n);

qqnorm(y)

qqline(y, col = ’blue’, lwd = 2)

−2 −1 0 1 2

−2

−1

01

2

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s


An improvement on the histogram

• The histogram is a great idea

• But it is sensitive to the width of the bins

• And to their positions (midpoints)

• An improvement is the kernel density estimator

• It is explained in Chapter 10 of the book

• We will discuss that later

• And develop a further improvement


Comparing histogram and kernels

y = rnorm(200); b = seq(-4, 4, by = 0.2)

plot(density(y), col = ’red’, lwd = 2)

lines(b, dnorm(b), col = ’blue’, lwd = 2)

Histogram of y

y

Den

sity

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

0.6

−2 0 2 4

0.0

0.1

0.2

0.3

0.4

density.default(x = y)

N = 200 Bandwidth = 0.278

Den

sity


.

Part II


How to generate random numbers

• The uniform distribution is always the starting point

• Sum 6 (or more) “uniforms” to approximate normal distribution

• This is of limited usefulness

• Apply a function: -log(runif(100)) gives exponential distribution

• Inverse transform: compute quantiles of desired distribution (if possible)

• Acceptance-rejection sampling

• Combinations of distributions

• Mixtures


Sums of uniform random variables

• The uniform distribution runs from 0 to 1

• Mean 0.5, variance 1/12

• Sum of n (independent) “uniforms” has mean n/2

• The variance is n/12

• Subtract n/2

• Divide by√n/12

• To get standard normal distribution

• This is an approximation, but a good one


Illustration of sums

Y = matrix(runif(10000 * 10), 10000, 10); y = apply(Y, 1, sum)

plot(density(y), col = ’red’, lwd = 2)

x = seq(0, 10, by = 0.1); lines(x, dnorm(x, 10 * 0.5, sqrt(10 / 12)), col = ’blue’, lwd = 2)

2 3 4 5 6 7 8

0.0

0.1

0.2

0.3

0.4

density.default(x = y)

N = 10000 Bandwidth = 0.1302

Den

sity


Simple transformations

• Apply a function to uniformly distributed random numbers

• Example: y = -log(runif(n)) gives exponential distribution

• Sums of these leads to the gamma distribution

• Many of such “tricks” exist

• See examples in the book


Inverse transform or quantiles

• Remember the cumulative distribution: p = F (x)

• Where F (x) =∫ x

−∞ f (t)dt

• And f (x) is the probability density

• We have that 0 < p < 1

• Turn things around: x = F−1(p)

• Generate uniformly distributed p

• Compute x = F−1(p)

• Then x will have density f

• Of course, you need a procedure to invert F

• Example: qnorm(runif(n)) is equivalent to rnorm(n)


Illustration of the quantile approach

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

x

pnor

m(x

)

x

p

x’

p’


A special case: the exponential distribution

• Density: fx) = e−x (positive x)

• Cumulative distribution: F (x) = 1− e−x

• You can check this by integrating it yourself

• From p = 1− e−x follows x = − log(1− p)

• But if p from uniform distribution, so is 1− p

• Hence x = − log(p) has exponential distribution


Inverting the cumulative distribution

• You need a formula for F−1(p)

• In many cases an exact formula is not available

• But over the years very good approximations were found

• Try ?qnorm to see pointers to literature

• Luckily, R has many built-in procedures


Acceptance-rejection sampling

• Suppose you have a formula for f (x), the density

• But none for F (x), the cumulative distribution

• Acceptance sampling is a solution

• Select a domain in which x will fall (almost surely)

• Let x1 (x2) be left (right) boundary

• Let u = maxf (x) (over this domain)

• Generate y uniformly distributed between x1 and x2

• Generate z uniformly distributed betweeen 0 and u

• If z ≤ f (y), accept y

• If not, repeat


Illustration of acceptance-rejection sampling

xlo = -3

xhi = 3

x = seq(xlo, xhi, by = 0.1)

f = dnorm(x)

plot(x, f, lwd =2, type = ’l’, col = ’blue’)

u = max(f)

n = 1000

a = rep(0, n)

y = runif(n) * (xhi - xlo) + xlo

z = runif(n) * u

a = 1 * z < dnorm(y)

g = c(’-’, ’+’)

points(y, z, pch = g[a + 1])


The result of acceptance sampling

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

x

f

−

−

+

+

−

+

+−

−

−

−

+

+

+

+++

−

+

−

+

−−

+

+

−

−

−

+

−

++

−

−

−

−

+

+

+

++

−

+

+

−

+

+

−

−

−

+ +

−

+

+

−

−

−

−−

−−

+

−

−

−

+

+

−

−

+

+ +

+

−

+

+

+

+

−

+

+

−

−

−

−−

−

−+

−

−

+−

−

+

+

−

−

+

−

−

−

−

−

−

+

+

−

+

+

−

−

−

−

−

+

+

+

−

+

+

−

+

−−

+

−

+

++

−

−−

−

+

−

+

−

−

+

+

+

−

−

−+

−

+

−

−

+ −

+

+

−

−

−

−

−

−

−

−

+

−

−

−

−

++

+

−

+

+

−−

−+

+

+

+

−

+

−

−

+

−

+

+

−

+

−−

−

+

− −

+

+

+

+

−−

−

−

+

+

+

−

−

−

+

−

+

−

−

+

−

−

−

++

−

+ −

+

+

−

−

−

−

+

−−

+

+

+

+

−

+

+

+

+

−

+

−

−

−

−

++

++

−

+

−

+

−

−

−

+

+

+

+−

−

+

−

+

−

−−

−−

−

+

−

+

+

−

−

+

+

+

−

+

−

−

−−

−

−

++

−

−

+

+

−

−+

+

+

+

−

−−

−−

−

−

−

+

+

+

−

−

+

−

+

−

− +−

++

−

−

−

−

+

−

−

−

−−

−

+

−

−

+

+

+

++

+

−

+

−

−

−

−

+

+

−

+

−

−

−

−

−

−

+

+

−

+

+

+

−

−

−

+

−

+

−

− +

−

+

−

+−

+−−

−−

−

−

−

+

−

+

−

+

+−

−

−

−

−

−

−

+

−

−

−

− −

−+

−

−

+

−

−

−

−

−

+

−

−

−

−

−

−

−

+

−

+ +

−

−

+

−

−

−

−

−

−−

−

+

+

−

−

+

++

−

+

+

+

−

+

−

+

+

−

+

−

−

−−

+

−

+−− +

−

−

−−−

−

+

−

−

−

−

+

+

−

+

−+

−

−

+

−−

+

−

−−

+

−

−

−

−

−−

−

−

+

−−

+

−

−

−

−

+

+

−

−

++

+

−

−

−

−

+

+

−

+

++

+

−

−

−

−

+

−

+

+

+

−

+

+

−

+

−

−

+

−

+

+

−

−+

−

−

−

−

+

++

−

−

−

−

−

−

−

−

−

−

−

−

+

−

−

+

+

−− −

−

−

−

−

− −

−

−

+

+

+

−

+

−

−

−

−

−

−−−

−

−

−

−

+

−+

−

−

+

−−

+

+

+

−

−

+

−

++

+

+

+

+

−

+

−+

−

−

++

+

+

−

−

−

−

+

+

++

−

−

−

+

−

+

++

−

+−

−

+

−

−

+

−

+

−

−

−

+

−

−

−

+

+

−

−

−

−

−

−+

−

−

+

−

+

+

+

−

− −

−−

−

++ +

+

−

−

+

−

−

−

−

−

−

−

+−

−

−

−

−

−

+

−

−

−

−

+

− −

−

+

+

+

−+

+

−

−

− −−

−

−

+−

+

−

−

+

−

+

+++

−

+

−

+

−

+

+

+

−

+

−

−

−

+

−

+

+

−

+

+

+

+

+

+

−

−

−

+

+

+

−

+ +

+

−

−

−

−

−

+

−

−

+

−

−

+

−

−

−

−

+

+

+−

+

−

++

−

−

+

−

+

−

+

+

+

−

+

−

+

−

+

−

+

−

−

−+

−

+

+

+

−

−

+

+

+

−

−

+

−

+

+

+

−

−

+

−

+

−

−

+

+

−

+

−

+−

−−

−

−

+

−

+

−

−

+

+

+

−

−

−

+

+

−

+

−

−

+

−

+

−

−

−

−

−

+

++

−

−

−

−

−

−

+

−

+

+

−

−

+

−

−

−

−

+

−

−

+

+

+

−

−

−

−

−

+

−

−

+

−

−

+

−

− −

+

−

−

+

+

+

+

−

+

+

−

+

+

+

+ +

++

+

−

+

−

−

+

+

−+

−

+

−

+

+

−

−

−

−

+

−

−

−−

+

+−

+

−+

+

+

+

−

+

+

−

−+

+

−

+

+

+

−

+

−

+

+

−

+

−

+

−

−

−

−

+


Combinations of distributions

• Example: t distribution

• Let s be sum of n squared standard normal variates

• The s has χ2n distribution (n degrees of freedom)

• Let x be one standard normal variate

• Then x/√s has t distribution

• Ratio of “gammas” has beta distribution

• Ratio of “normals” has Cauchy distribution

• Ratio of “χ2 has F distribution


Mixtures

• Mixtures are a very flexible tool

• Basic idea: probability p1 to draw from density f1

• Probability p2 to draw from density f2

• And so on

• Typical application: outlier simulation

• Two normals: f1(x) = N(x;µ, σ), f2(x) = N(x;µ, 10 ∗ σ)• Probabilities: p1 = 0.98, p2 = 0.02

• Chance of 2% to get a large outlier


The use of random numbers

• You probably will not often aim for a specific distribution

• Instead you program a data generating procedure

• Mimicking a model you are working on

• You study the distribution of statistical summaries

• Educational example: ANOVA and the F distribution

• Simulate from the null hypothesis, 5 groups, 10 obs., in each

• Compare empirical (n = 5000) distributions (χ2 and F ) with theory


ANOVA simulation: examples of simulated data

1 2 3 4 5

−2

−1

01

2

1 2 3 4 5

−2

−1

01

2

1 2 3 4 5

−1

01

2

1 2 3 4 5

−3

−2

−1

01

2


ANOVA: simulated sums of squares and ratio

0 100 200 300 400 500

2040

6080

Index

ssw

Sum of squares within (SSW)

0 100 200 300 400 500

05

1015

20

Index

ssb

Sum of squares between (SSB)

0 100 200 300 400 500

02

46

8

Index

msb

/msw

Ratio MSB/MSW


ANOVA: histograms and expected values

ssw

Fre

quen

cy

0 20 40 60 80 100

050

150


ssb

Fre

quen

cy

0 5 10 15 20 25

050

100

200


msb/msw

Fre

quen

cy

0 2 4 6 8 10

010

020

030

0

F = MSB / MSW


ANOVA: histograms and theoretical densities

ssw

Fre

quen

cy

0 20 40 60 80 100

050

150


ssb

Fre

quen

cy

0 5 10 15 20 25

050

100

200


msb/msw

Fre

quen

cy

0 2 4 6 8 10

010

020

030

0

F = MSB / MSW


Reading hints

• Theoretical statistical background: Chapter 2

• Random number simulation: Section 3.1-3.5

• Kernel density estimation: Section 10.2

• Histogram and qqplot: R on-line documentation


Statistical Computing Session 4: Random Simulationavdvaart/ASC/StatComp_Session_4.pdf · The...

Documents

Transcript of Statistical Computing Session 4: Random Simulationavdvaart/ASC/StatComp_Session_4.pdf · The...