Statistics 580 Monte Carlo Methods Introductionmervyn/stat580/Notes/s09mc.pdf · Statistics 580...

Statistics 580

Monte Carlo Methods

IntroductionStatisticians have used simulation to evaluate the behaviour of complex random variables

whose precise distribution cannot be exactly evaluated theoretically. For example, supposethat a statistic T based on a sample x1, x2, . . . , xn has been formulated for testing a certainhypothesis; the test procedure is to reject H0 if T (x1, x2, . . . , xn) ≥ c. Since the exactdistribution of T is unknown, the value of c has been determined such that the asymptotictype I error rate is α = .05, (say), i.e., Pr(T ≥ c|H0 is true) = .05 as n → ∞. We can studythe actual small sample behavior of T by the following procedure:

a. generate x1, x2, . . . , xn from the appropriate distribution, say, the Normal distributionand compute Tj = T (x1, x2, . . . , xn),

b. repeat N times, yielding a sample T1, T2, . . . , TN , and

c. compute proportion of times Tj ≥ c as an estimate of the error rate i.e.,

α(N) = 1N

∑Nj=1 I(Tj ≥ c) = #(Tj≥c)

N, where I(·) is the indicator function.

d. Note that α(N)a.s.→ α, the type I error rate.

Results of such a study may provide evidence to support the asymptotic theory even forsmall samples. Similarily, we may study the power of the test under different alternativesusing Monte Carlo sampling. The study of Monte Carlo methods involves learning about

(i) methods available to perform step (a) above correctly and efficiently, and

(ii) how to perform step (c) above efficiently, by incorporating “variance reduction” tech-niques to obtain more accurate estimates.

In 1908, Monte Carlo sampling was used by W. S. Gossett, who published under thepseudonym “Student”, to approximate the distribution of z =

√n(x−µ)/s where (x1, x2, . . . , xn)

is a random sample from a Normal distribution with mean µ and variance σ2 and s2 is thesample variance. He writes:

“Before I had succeeded in solving my problem analytically, I had endeavouredto do so empirically [i.e., by simulation]. The material used was a ... tablecontaining the height and left middle finger measurements of 3000 criminals....The measurements were written out on 3000 pieces of cardboard, which werethen very thoroughly shuffled and drawn at random ... each consecutive set of 4was taken as a sample ... [i.e.,n above is = 4] ... and the mean [and] standarddeviation of each sample determined.... This provides us with two sets of ... 750z’s on which to test the theoretical results arrived at. The height and left middlefinger ... table was chosen because the distribution of both was approximatelynormal...”

1

The object of many simulation experiments in statistics is the estimation of an expectationof the form E[g(X)] where X is a random vector. Suppose that f(x) is the joint density ofX. Then

θ = E[g(X)] =∫

g(x)f(x)dx.

Thus, virtually any Monte Carlo procedure may be characterized as a problem of evaluatingan integral (a multidimensional one, as in this case). For example, if T (x) is a statistic basedon a random sample x, computing the following quantities involving functions of T may beof interest:

the mean, θ1 =∫

T (x)f(x) dx,

the tail probability, θ2 =∫

I(T (x) ≤ c)f(x) dx,

or the variance, θ3 =∫

(T (x) − θ1)2f(x) dx

=∫

T (x)2f(x) dx − θ21,

where I() is the indicator function.

Crude Monte CarloAs outlined in the introduction, applications of simulation in Statistics involve the com-

putation of an integral of the form∫

g(x)f(x) dx, where g(x) is a function of a statisticcomputed on a random sample x. A simple estimate of the above integral can be obtainedby using the following simulation algorithm:

1. Generate X1,X2, . . . ,XN .

2. Compute g(X1), g(X2), . . . , g(XN)

3. Estimate θ(N) = 1N

∑Nj=1 g(Xj)

Here N is the simulation sample size and note that

• θ(N)a.s.→ θ

• if var(g(X)) = σ2, the usual estimate of σ2 is

s2 =

∑Nj=1(g(Xj) − (θ))2

N − 1;

• thus S.E.(θ) = s/√

N and for large N , the Central Limit Theorem could be used toconstruct approximate confidence bounds for θ.

[θ(N) − zα/2s2

√N

, θ(N) + zα/2s2

√N

]

• Clearly, the precision of θ is proportional to 1/√

N (in other words, the error in θ is

O(N− 1

2 )), and it depends on the value of s2.

2

Efficiency of Monte Carlo SimulationSuppose as usual that it is required to estimate θ = E[g(X)]. Then the standard simu-

lation algorithm as detailed previously gives the estimate

θ(N) =1

N

N∑

j=1

g(Xj)

with standard error s(θ) = s/√

N One way to measure the quality of the estimator θ(N) isby the half-width, HW, of the confidence interval for θ. For a fixed α it is given by

HW = zα/2

√V ar(g(X))

N

For increasing the precision of the estimate, HW needs to be made small, but sometimesthis is difficult to achieve. This may be because V ar(g(X)) is too large, or too much com-putational effort is required to simulate each g(Xj) so that the simulation sample size Ncannot be too large, or both. If the effort required to produce one value of g(Xj) is E thetotal effort for estimation of θ by a Monte Carlo experiment is N × E. Since for achievinga precision of HW at least a sample size

N ≥ zα/2

HW× V ar(g(X))

is needed, the quantity V ar(g(X))×E is proportional to the total effort required to achievea fixed precision. This thsi quantity may be use to compare the efficiency of different MonteCarlo methods. When E’s are similar for two methods, the method with the lower varianceof the estimator is considered superior. As a result, statistical techniques are often employedto increase simulation efficiency by decreasing the variability of the simulation output thatare used to estimate θ. The techniques used to do this are usually called variance reductiontechniques.

Variance Reduction TechniquesMonte Carlo methods attempt to reduce S.E.(θ), that is obtain more accurate estimates

without increasing N . These techniques are called variance reduction techniques. Whenconsidering variance reduction techniques, one must take into account the cost (in terms ofdifficulty level) of implementing such a procedure and weigh it against the gain in precisionof estimates from incorporating such a procedure.

Before proceeding to a detailed description of available methods, a brief example is usedto illustrate what is meant by variance reduction. Suppose the parameter to be estimated isan upper percentage point of the standard Cauchy distribution:

θ = Pr(X > 2)

where

f(x) =1

π(1 + x2), −∞ < x < ∞

3

is the Cauchy density. To estimate θ, let g(x) = I(x > 2), and express θ = Pr(x > 2) asthe integral

∫∞−∞ I(x > 2)f(x)dx. Note that x is a scalar here. Using crude Monte Carlo, θ

is estimated by generating N Cauchy variates x1, x2, . . . , xN and computing the proportionthat exceeds 2, i.e.,

θ =

∑Nj=1 I(xi > 2)

N=

#(xi > 2)

N.

Note that since Nθ ∼ Binomial (N, θ), the variance of θ is given by θ(1− θ)/N . Since θ canbe directly computed by analytical integration i.e.,

θ = 1 − F (2) =1

2− π−1 tan 2 ≈ .1476

the variance of the crude Monte Carlo estimate θ is given approximately as 0.126/N. In orderto construct a more efficient Monte Carlo method, note that for y = 2/x,

θ =∫ ∞

2

1

π(1 + x2)dx =

∫ 1

0

2

π(4 + y2)dy =

∫ 1

0g(y)dy .

This particular integral is in a form where it can be evaluated using crude Monte Carlo bysampling from the uniform distribution U(0, 1). If u1, u2, . . . , uN from U(0, 1), an estimateof θ is

θ =1

N

N∑

j=1

g(uj).

For comparison with the estimate of θ obtained from crude Monte Carlo the variance of θcan again be computed by evaluating the integral

var(θ) =1

N

∫{g(x) − θ}2f(x)dx

analytically, where f(x) is the density of U(0, 1). This gives var(θ) ≈ 9.4×10−5/N, a variancereduction of 1340.

In the following sections, several methods available for variance reduction will be dis-cussed. Among these are

• stratified sampling

• control variates

• importance sampling

• antithetic variates

• conditioning swindles

4

Stratified SamplingConsider again the estimation of

θ = E[g(X)] =∫

g(x) f(x) dx

For stratified sampling, we first partition the domain of integration S into m disjoint subsetsSi, i = 1, 2, . . . ,m. That is, Si ∩ Sj = 0 ∀ i 6= j and ∪m

i=1Si = S. Define θi as

θi = E(g(X)

∣∣∣Xε Si

)

=∫

Si

g(x) f(x) dx

for i = 1, 2, . . . ,m. Then

θ =∫

Sg(x) f(x) dx =

m∑

i=1

∫

Si

g(x) f(x) dx =m∑

i=1

θi

The motivation behind this method is to be able to sample more from regions of S thatcontribute more variability to the estimator of θ, rather than spread the samples evenly acrossthe whole region S. This idea is the reason why stratified sampling is used often as a generalsampling techniques; for e.g., instead of random sampling from the entire state, one mightuse stratified sampling based on subregions such as counties, or districts, sampling morefrom those regions that contribute more information about the quantity being estimated.

Definepi =

∫

Si

f(x) dx

and

fi(x) =1

pi

f(x) .

Thenm∑

i=1

pi = 1 and∫

Si

fi(x) dx = 1 .

By expressing

gi(x) =

g(x), if x ε Si

0, otherwise

we can show that

θi =∫

Si

pi g(x)f(x)

pi

dx

= pi

∫

Sgi(x) fi(x) dx

and the crude Monte Carlo estimator of θi is then

θi =

pi

Ni∑

j=1

gi(xj)

Ni

5

where x1, . . . ,xNiis a random sample from the distribution with density fi(x) on Si, with

variance given by

Var(θi) =p2

i σ2i

Ni

,

where σ2i = Var

(g(x)

∣∣∣Xε Si

). Thus the combined estimator of θ is

θ =m∑

i=1

θi =m∑

i=1

pi

Ni

Ni∑

j=1

gi(xj),

with variance given by

Var(θ) =m∑

i=1

p2i σ2

i

Ni

.

For the best stratification, we need to choose Ni optimally i.e., by minimizing Var(θ) withrespect to Ni, i = 1, 2, . . . ,m. That is, minimize the Lagrangian

m∑

i=1

p2i σ2

i

Ni

+ λ(N −m∑

i=1

Ni)

Equating partial derivatives with respect to Ni to zero it has been shown that the optimumoccurs when

Ni =pi σi

m∑

i=1

pi σi

N

for N =m∑

i=1

Ni. Thus after the strata are chosen, the samples sizes Ni should be selected

proportional to pi σi. Of course, since σi are unknown, they need to be estimated by somemethod, such as from past data or in absence of any information, by conducting a pilotstudy. Small samples of points are taken from each Si. The variances are estimated fromthese samples. This allows the sample sizes for the stratified sample to be allocated in sucha way that more points are sampled from those strata where g(x) shows the most variation.

Example: Evaluate I =∫ 1

0e−x2

dx using stratified sampling.

Since in this example f(x) is the U(0, 1) density we will use 4 equispaced partitions of theinterval (0, 1) giving p1 = p2 = p3 = p4 = .25. To conduct a pilot study to determine stratasample sizes, we first estimated the strata variances of g(x):

σ2i = var (g(x)|x ∈ Si) for i = 1, 2, 3, 4

by a simulation experiment. We generated 10000 u(0, 1) variables u and compute the stan-dard deviation of e−u2

for u’s falling in each of the intervals (0, .25), (.25, .5), (5, .75), and(.75, 1)

6

x

exp(

-x^2

)

0.0 0.2 0.4 0.6 0.8 1.0

0.4

0.5

0.6

0.7

0.8

0.9

1.0

These came out to be σi = .018, .046, .061, and .058. Since p1 = p2 = p3 = p4,Ni ∝ σi/

∑σi, we get σi/

∑σi = .0984, .2514, .3333, and .3169 implying that for N = 10000,

N1 = 984, N2 = 2514, N3 = 3333, and N4 = 3169.

We will round these and use

N1 = 1000, N2 = 2500, N3 = 3500, and N4 = 3000.

as sample sizes of our four strata.

A MC experiment using stratified sampling for obtaining an estimate of I and its standarderror can now be carried out as follows. For each stratum (mi, mi+1), m1 = 0, m2 =.25, m3 = .50, m4 = .75, and, m5 = 1.0, generate Ni numbers uj ∼ U(mi, mi+1)) for j =

1, . . . , Ni and compute xj = e−u2

j for the Ni uj’s. Then compute estimates

θi =∑

xj/Ni

σ2i =

∑(xj − x)2/(Ni − 1).

for i = 1, 2, 3, 4. Pool these estimates to obtain θ =∑

θi/4 and var(θ) =∑

σ2

i/Ni

42 . Weobtained

θ = 0.7470707 and var (θ) = 2.132326e − 007

A crude MC estimation of I is obtained by ignoring strata and using the 10000 u(0, 1)

variables to compute means and variance of e−u2

j . We obtained

θ = 0.7462886 and var (θ) = 4.040211e − 006

Thus the variance reduction is 4.040211e − 006/2.132326e − 007 ≈ 19.

7

Control VariatesSuppose it is required to estimate θ = E(g(x)) =

∫g(x)f(x)dx where f(x) is a joint den-

sity as before. Let h(x) be a function “similar” to g(x) such that E(h(x)) =∫

h(x)f(x)dx =τ is known. Then θ can be written as

θ =∫

[g(x) − h(x)]f(x)dx + τ = E[g(x) − h(x)] + τ

and the crude Monte Carlo estimator of θ now becomes

θ =1

N

N∑

j=1

[g(xj) − h(xj)] + τ ,

and

var(θ) =var(g(x)) + var(h(x)) − 2cov(g(x), h(x))

N.

If var(h(x)) ≈ var(g(x)) which is possibly so because h mimics g, then there is a reductionin variance if

corr(g(x), h(x)) >1

2.

In general, θ can be written as

θ = E[g(x) − β{h(x) − τ}]

in which case the crude Monte Carlo estimator of θ is

θ =1

N[

N∑

j=1

g(xj) − β{N∑

j=1

h({xj) − τ}] ,

As a trivial example consider a Monte Carlo study to estimate θ = E(median(x1, x2, . . . , xn))where x = (x1, x2, . . . , xn)′ are i.i.d. random variables from some distribution. If µ = E(x)is known, then it might be a good idea to use x as a control variate. Crude Monte Carlo isperformed on the difference {median(x)− x} rather than on median(x) giving the estimator

θ =1

N

∑{median(x) − x} + µ

with

var(θ) =var{median(x) − x}

N.

The basic problem in using the method is identifying a suitable variate to use as a controlvariable. Generally, if there are more than one candidate for a control variate, a possible wayto combine them is to find an appropriate linear combination of them. Let Z = g(x) and letWi = hi(x) for i = 1, ..., p be the possible control variates, where as above it is required toestimate θ = E(Z) and E(Wi) = τi are known. Then θ can be expressed as

θ = E[g(x) − {β1(h1(x) − τ1) + . . . + βp(hp(x) − τp)}]= E[Z − β1(W1 − τ1) − . . . − βp(Wp − τp)].

8

This is a generalization of the application with a single control variate given above, i.e.,if p = 1 then

θ = E[Z − β(W − τ)].

An unbiased estimator of θ in this case is obtained by averaging observations of Z−β(W1−τ)as before and β is chosen to minimize var(θ). This amounts to regressing observations ofZ on the observations of W. In the general case, this leads to multiple regression of Z onW1, . . . ,Wp. However note that Wi’s are random and therefore the estimators of θ and var(θ)obtained by standard regression methods may not be unbiased. However, if the estimatesβi are found in a preliminary experiment and then held fixed, unbiased estimators can beobtained.

To illustrate this technique, again consider the example of estimating θ = P (x > 2) =12− ∫ 2

0 f(x)dx where x ∼ Cauchy. Expanding f(x) = 1/π(1 + x2) suggests control variatesx2 and x4, i.e., h1(x) = x2, h2(x) = x4. A small regression experiment based on samplesdrawn from U(0, 2) gave

θ =1

2− [f(x) + 0.15(x2 − 8/3) − 0.025(x4 − 32/5)].

with var(θ) ≈ 6.3 × 10−4/N.Koehler and Larntz (1980) consider the likelihood ratio (LR) statistic for testing of

goodness-of-fit in multinomials:

H0 : pi = pi0, i = 1, . . . , k

where (X1, ..., Xk) ∼ mult(n,p), where p = (p1, ..., pk)′. The chi-squared statistic is given by

χ2 =k∑

j=1

(xj − npj0)2

npj0

,

and the LR statistic by

G2 = 2k∑

j=1

xj logxj

npj0

.

The problem is to estimate θ = E(G2) by Monte Carlo sampling and they consider usingthe χ2 statistic as a control variate for G2 as it is known that E(χ2) = k − 1 and that G2

and χ2 are often highly correlated. Then θ = E(G2 − χ2) + (k − 1) and the Monte Carloestimator is given by

θ =1

N

∑{G2 − χ2} + k − 1

and an estimate of var(θ) is given by

var(θ) =

var{G2 − χ2}N

.

Now, since in general, var(G2) and var(χ2) are not necessarily similar in magnitude, animproved estimator may be obtained by considering

9

θ = E[G2 − β (χ2 − (k − 1))]

and the estimator is

θβ =1

N

∑{G2 − β χ2} + β (k − 1)

where β is estimate of β obtained through an independent experiment. To do this, computesimulated values of G2 and χ2 statistics from a number of independent samples (say, 100)obtained under the null hypothesis. Then regress the G2 values on χ2 values to obtain β.An estimate of var(θβ) is given by

var(θβ) =var

{G2 − β χ2

}

N.

Importance SamplingAgain, the interest is in estimating θ = E(g(x)) =

∫g(x)f(x)dx where f(x) is a joint

density. Importance sampling attempts to reduce the variance of the crude Monte Carloestimator θ by changing the distribution from which the actual sampling is carried out.Recall that in crude Monte Carlo the sample is generated from the distribution of x (withdensity f(x)). Suppose that it is possible to find a distribution H(x) with density h(x) withthe property that g(x)f(x)/h(x) has small variability. Then θ can be expressed as

θ =∫

g(x)f(x)dx =∫

g(x)f(x)

h(x)h(x)dx = E [φ(x)]

where φ(x) = g(x)f(x)/h(x) and the expectation is with respect to H. Thus, to estimate θthe crude Monte Carlo estimator of φ(x) with respect to the distribution H could be used,i.e.,

θ =1

N

N∑

j=1

φ(xj)

where xj = (x1, . . . , xN)′ is sampled from H. The variance of this estimator is

var(θ) =1

N

∫{φ(xj) − θ}2h(x)dx =

1

N

∫[f(x)g(x)

h(x)]2h(x)dx − θ2

N.

Thus var(θ) will be smaller than var(θ) if

∫[f(x)g(x)

h(x)]2h(x)dx ≤

∫[g(x)]2f(x)dx

since

var(θ) =∫

[g(x)]2f(x)dx − θ2

N.

10

It can be shown using Cauchy-Schwarz’s inequality for integrals that the above inequalityholds if

h(x) =|g(x)|f(x)

∫ |f(x)|g(x)dx,

i.e., if h(x) ∝ g(x)f(x). For an extreme example, if h(x) = g(x)f(x)/θ, then var(θ) = 0!The choice of an appropriate density h is usually difficult and depends on the problem at

hand. A slightly better idea is obtained by looking at the expression for the left hand sideof the above inequality a little differently:

∫[f(x)g(x)

h(x)]2h(x)dx =

∫[g(x)2f(x)][

f(x)

h(x)]dx

To ensure a reduction in var(θ)), this suggests that h(x) should be chosen such that h(x) >f(x) for large values of g(x)2f(x) and h(x) < f(x) for small values of g(x)2f(x).

Since sampling is now done from h(x) , θ will be more efficient if h(x) is selected so thatregions of x for which g(x)2f(x) is large are more frequently sampled than when using f(x).By choosing the sampling distribution h(x) this way, the probability mass is redistributedaccording to the relative importance of x as measured by g(x)2f(x). This gives the nameimportance sampling to this method since important x’s are sampled more often and h(x) iscalled the importance function.

Example 1:

Evaluate θ = Φ(x) =∫ x−∞

1√2π

e−y2/2dy =∫∞−∞ I(y < x)φ(y)dy, where φ(y) is the Standard

Normal density. A density curve similar in shape to φ(y) is the logistic density:

f(y) =π exp(−π y/

√3)√

3(1 + exp(−πy/√

3))2

with mean 0 and variance 1. The cdf of this distribution is:

F (y) =1

1 + exp(−π y/√

3).

A crude Monte Carlo estimate of θ is

θ =

∑Ni=1 I(zi < x)

N=

#(zi < x)

N

where z1, . . . , zN is a random sample from the standard normal. An importance samplingestimator of θ can be obtained by considering

θ =∫ ∞

−∞

I(y < x)φ(y)

f(y)f(y)dy =

∫ ∞

−∞

kφ(y)

f(y)

I(y < x)f(y)

kdy

where k is determined such that f(y)I(y < x)/k is a density i.e., f(y)/k is a density over(−∞, x):

k =1

1 + exp(−πx/√

3)·

11

The importance sampling estimator of θ is:

θ =k

N

N∑

i=1

φ(yi)

f(yi)

where y1, . . . , yN is a random sample from the density f(y)I(y < x)/k. The sampling fromthis density is easily done through the method of inversion: Generate a variate ui fromuniform (0, 1) and transform using

yi = −√

3

πlog

{(1 + exp(−π x/

√3))/ui − 1

}. 2

Earlier, it was argued that one may expect to obtain variance reduction if h(x) is chosensimilar in shape to g(x)f(x. One way to do this is attempt to select h(x) so that both h(x)and g(x)f(x) are maximized at the same value. With this choice if h is chosen from thesame family of distributions as f one might expect to gain substantial variance reduction.For example, if f is multivariate normal, then h might also be taken to be a multivariatenormal but with a different mean and/or covariance matrix.

Example 2:

Evaluate θ = E[h(X)] = E[X4eX2/4I[X≥2]] where X ∼ N(0, 1). Chose h to be also Normalwith a mean µ but the same variance 1. Since g is maximized at µ, the above suggests agood choice for µ to be the value that maximizes g(x)f(x):

µ = arg maxx

g(x)f(x)

= arg maxx

x4ex2/4I[x≥2] ×e−x2/2

√2π

= arg maxx≥2

x4e−x2/4

=√

8

Then

θ =1

N

N∑

j=1

y4j e

(y2

j/4−

√8yj+4)

where yj ≥ 2 and generated from N(√

8, 1). 2

A recommended procedure for selecting an appropriate h(x) given gL ≤ g(x) ≤ gU forall x, is to choose h(x) so that, for all x,

h(x) ≥ g(x) − gL

gU − gL

While this does not guarantee variance reduction, it can be used to advantage when crudeMonte Carlo will require large sample sizes to obtain a required accuracy. In many cases,this occurs usually when g(x) is unbounded, i.e., gU ≡ ∞ in the region of integration. In

this case, we may chose h(x) such that φ(x) = f(x)g(x)h(x)

is bounded.

12

Example 3:

Consider the evaluation of

θ =∫ 1

0xα−1e−xdx,

1

2< α ≤ 1

Here if g(x) = xα−1e−x and f(x) = I(0,1)(x), we find that g(x) is unbounded at x = 0.Suppose we consider the case when α = .6. Then the crude MC estimate of θ is

θ =

∑Nj=1 g(uj)

N=

∑Nj=1(u

−.4j e−uj)

N

where u1, u2, . . . , uN are from U(0, 1).

x

g(x)

^2*f

(x)

0.0 0.2 0.4 0.6 0.8 1.0

010

2030

40

x

h(x)

0.0 0.2 0.4 0.6 0.8 1.0

0.5

1.5

2.5

3.5

On the left above is a plot of g(x)2f(x) for values of x ∈ (0, 1). This plot shows that weneed to select h(x) > 1 (say) for x < .1 (say) and h(x) < 1 for x > .1. Suppose we choseh(x) = αxα−1. The plot on the right is a plot of h(x) for α = .6. The dashed line is a plot off(x) = I(0,1)(x). This plot shows clearly that h(x) would give more weight to those values ofx < .1, the values for which g(x) gets larger, than would f(x), which assigned equal weightsto all values of g(x). We also see that since

θ =∫ 1

0xα−1e−xdx =

∫ 1

0φ(x)αxα−1dx

where φ(x) = e−x

α, φ(x) is bounded at x = 0.

Example 4: A more complex application

In Sequential Testing, the likelihood ratio test statistics λ is used to test

H0 : Population density is f0(x) vs. HA : Population density is f1(x)

where

λ =f1(x1, . . . , xm)

f0(x1, . . . , xm)=

m∏

j=1

f1(xj)

f0(xj)

= exp

M∑

j=1

logf1(xj)

f0(xj)

= exp

m∑

j=1

zj

and, zj = log{f1(xj)/f0(xj)}. Using the decision rule:

13

Reject H0 if∑

zj ≥ bAccept H0 if

∑zj ≤ a

continue sampling if a <∑

zj < b

It is usually assumed that a < 0 < b. Here the stopping time, m, is a random variable andalso, note that EH0

(zj) < 0. The theory of sequential tests needs to evaluate probabilitiessuch as θ = Pr {∑ zj ≥ b|H0 is true}.

As an example, consider X ∼ N(µ, 1) and suppose that H0 : µ = µ0 vs. Ha : µ = µ1.For simplicity, let µ ≤ 0. Consider estimating the type I error rate

θ = PrH0

{∑zj ≥ b

}.

where zj = xj, i.e.∑

xj is the test statistic used. Since θ in integral form is

θ =∫ b

−∞g0(x)dx

where g0 is the density of∑

zj under the null hypothesis H0, which can also be expressed inthe form

θ =∫

I(∑

zj ≥ b)f0(x)dx,

where f0 is the density of Xj under H0, the usual Monte Carlo estimator of θ is

θ =

∑Ni=1 I(yi ≥ b)

N=

#(yi ≥ b)

N

where y1, . . . , yN are N independently generated values of∑

zj. An estimator of θ underimportance sampling is obtained by considering

θ =∫

I(∑

zj ≥ b)f0(x)

f1(x)f1(x) dx

where f1 is the density of Xj under the alternative hypothesis HA.Generating N independent samples of y1, y2, . . . , yN under HA the estimator of θ under

importance sampling is

θ =

∑Ni=1 I(yi ≥ b)f0(yi)/f1(yi)

N.

For this example,f0(yi)

f1(yi)=

e−1/2(yi−µ0)2

e−1/2(yi−µ1)2= e−1/2(µ0−µ1)(2yi−(µ0+µ1)) .

If µ1 is chosen to be −µ0 (remember µ0 is ≤ 0 and thus µ1 > 0), the algebra simplifies to

f0(yi)

f1(yi)= e2µ0yi .

14

Thus the importance sampling estimator reduces to∑

I(yi ≥ b) exp(2µ0yi) where yi aresampled under HA. The choice of −µ0 for µ1 is optimal in the sense that it minimizes Var(θ)asymptotially. The following results are from Siegmund (1976):

−µ0 a b θ {NVar(θ)}1/2 {NVar(θ)}1/2 R.E..0 any any .5 .5 .5 1

.125 -5 5 .199 .399 .102 15

.25 -5 5 .0578 .233 .0200 136

.50 -5 5 .00376 .0612 .00171 1280

.125 -9 9 .0835 .277 .0273 102

.25 -9 9 .00824 .0904 .00207 1910

.50 -9 9 .0000692 .00832 .0000311 7160

More on Important Sampling

For estimating

θ = E [g(X)] =∫

g(x)f(x)dx

=∫

g(x)f(x)

h(x)h(x)dx

we used

θ =1

N

N∑

j=1

g(xj)w(xj)

wherew(xj) = f(xj)/h(xj)

and xj’s are generated from H. This estimator θ is called the importance sampling estimatorwith unstandardized weights. For estimating the posterior mean, for example, f is knownonly upto a constant. In that case, the problem will involve estimating the constant ofintegration, as well, using Monte Carlo. That is

θ = E [g(X)] =

∫g(x)f(x)dx∫

f(x)dx

=

∫g(x) f(x)

h(x)h(x)dx

∫ f(x)h(x)

h(x)dx

is estimated by the importance sampling estimator with standardized weights θ∗ :

θ∗ =1N

∑Nj=1 g(xj)w(xj)

1N

∑Nj=1 w(xj)

=1

N

N∑

j=1

g(xj)w∗(xj)

15

where

w∗(xj) =w(xj)∑N

j=1 w(xj)

and xj’s are generated from H as before. Obtaining the standard error of θ∗ involves usingan approximation, either based on Fieller’s theorem or the Delta method. To derive this,the following shorthand notation is introduced: gj ≡ g(xj), wj ≡ w(xj), and zj ≡ gjwj. Inthe new notation,

θ∗ =1N

∑Nj=1 gjwj

1N

∑Nj=1 wj

=1N

∑Nj=1 zj

1N

∑Nj=1 wj

=z

w

Now from the central limit theorem, we have

(zw

)≈ N

[(µz

µw

),

1

n

(σ2

z σzw

σzw σ2w

)]

where µz, µw, σ2z , σzw and σ2

w are all expectations derived with respect to H.It is required to estimate µz/µw and the standard error of this estimate would be the

standard error of θ∗. To estimate µz/µw either the Delta method or Fieller’s theorem couldbe used. Delta method gives

z

w≈ N

(µz

µw

,1

nµ2w

(σ2

z −2µz

µw

σzw +µ2

z

µ2w

σ2w

))

and the standard error of z/w is given by

s.e.(

z

w

)=

1√n w

[s2

z − 2(

z

w

)szw +

(z2

w2

)s2

w

]1/2

where

s2z =

∑(zi − z)2/(n−1), szw =

∑(zi − z)(wi − w)/(n−1), and s2

w =∑

(wi − w)2/(n−1).

Rather than using variance reduction to measure the effectiveness of an importance samplingstrategy under a selected h(x), the following quantity, known as the effective sample size hasbeen suggested as a measure of the efficiency. When using unstandardized weights thismeasure is given by

N(g, f) =N

1 + s2w

Clearly, this reduces to N when h is exactly equal to f and is smaller than N otherwise. Itcan thus be interpreted to mean that N samples must be used in the importance samplingprocedure using h to obtain the same precision as estimating using N samples drawn exactlyfrom f (i.e. using crude Monte Carlo). Thus it measures the loss in precision in approximat-ing f with g, and is a useful measure for comparing different importance functions. Whenusing standardized weights an equivalent measure is

N(g, f) =N

1 + cv2

16

where cv=sw/w is coefficient of variation of w.From the above results, for an importance sampling procedure to succeed some general

requirements must be met:

• weights must be bounded

• small coefficient of variation of the weights

• infinite variance of the weights must be avoided

• max wi/∑

wi should be small, i.e., for calculating the standard error of the the impor-tance sampling estimate the central limit theorem must hold. Since the central limittheorem relies on the average of similar quantities, this condition ensures that a fewlarge weights will not dominate the weighted sum.

Since standardized weight could be used even when f is completely determined the followingcriterion could be used to make a choice between ”unstandardized” and ”standardized”estimators. The mean squared error of the standardized esimator is smaller than that of theunstandardized estimator, i.e., MSE(θ∗) < MSE(θ) if

Corr(zj, wj) >cv(wj)

2cv(zj)

where zj = gjwj as before.

Antithetic VariatesAgain, the problem of estimating θ =

∫g(x)f(x)dx is considered. In the control variate

and importance sampling variance reduction methods the averaging of the function g(x)done in crude Monte Carlo was replaced by the averaging of g(x) combined with anotherfunction. The knowledge about this second function brought about the variance reductionof the resulting Monte Carlo method. In the antithetic variate method, two estimators arefound by averaging two different functions, say, g1 and g2. These two estimators are suchthat they are negatively correlated, so that the combined estimator has smaller variance.

Suppose that θ1 is the crude Monte Carlo estimator based on g1 and a sample of size Nand θ2, that based on g2 and a sample of size N . Then let the combined estimator be

θ = 1/2(θ1 + θ2

).

where θ1 =∑

g1(xj)/N and θ2 =∑

g2(xj)/N . Then

Var(θ) = 1/4[Var(θ1) + Var(θ2) + 2cov(θ1, θ2)

]

= 1/4[σ2

1/N + σ22/N + 2ρσ1σ2/N

]

where ρ = corr(θ1, θ2). A substantial reduction in Var(θ) over Var(θ1) will be achieved ifρ � 0. If this is so, then g2 is said to be an antithetic variate for g1. The usual way ofgenerating values for g2 so that the estimate of θ2 will have high negative correlation with θ1

17

is by using uniform variates 1−u to generate samples for θ2 of the same u’s used to generatesamples for θ1. In practice this is achieved by using streams uj and 1 − uj to generate 2samples from the required distribution by inversion.

The following example will help clarify the idea. Consider estimation of θ = E[g(u)]where u is uniformly distributed between 0 and 1. Then for any function g(·), let

θ = E(g(u)) =∫ 1

0g(u)du ,

so 1N

∑g(uj) is an unbiased estimator of θ for a random sample u1, . . . , uN . But since 1− u

is also uniformly distributed on (0, 1), 1N

∑g(1 − uj) is also an unbiased estimator of θ. If

g(x) is a monotonic function in x, g(u) and g(1 − u) are negatively correlated and thus are“antithetic variates.”

In general, let

θ1 =1

N

N∑

j=1

g(F−11 (Uj1), F−1

2 (Uj2), . . . , F−1m (Ujm))

θ2 =1

N

N∑

j=1

g(F−11 (1 − Uj1), F−1

2 (1 − Uj2), . . . , F−1m (1 − Ujm))

where F1, F2, . . . , Fm are c.d.f.’s of X1, X2, . . . , Xm and U1, U2, . . . , Um are iid U(0, 1). Thenit can be shown that if g is a monotone function, that cov(θ1, θ2) << 0 and thus

θ =1

2

(θ1 + θ2

)

will have a smaller variance.

Example 1:

Evaluate E[h(X)] where h(x) = x/(2x − 1) and X ∼ exp (2)

Generate N samples ui, i = 1, . . . , N where u ∼ U(0, 1) and compute

θ =1

2

(∑xi/N +

∑yi/N

)

where xi = −2 log (ui) and yi = −2 log (1 − ui), i = 1, . . . , N.

Example 2: Antithetic Exponential Variates

For X ∼ exp (θ) the inverse transform gives X = −θ log (1 − U) as a variate from exp (θ).Then Y = −θ log (U) is an antithetic variate. It can be shown that, since

Cov(X, Y ) = θ2∫ 1

0log (u) log (1 − u)du −

∫ 1

0log (u)du

∫ 1

0log (1 − u)du

= θ2(1 − π2/6)

and Var(X) = Var(Y ) = θ2, that

Corr(X, Y ) =θ2(1 − π2/6)√

θ2 · θ2= (1 − π2/6) = −.645

18

Example 3: Antithetic Variates with Symmetric Distributions

Consider the inverse transform to obtain two standard normal random variables: Z1 =Φ−1(U) and Z2 = Φ−1(1 − U). Z1 and Z2 are antithetic variates as Φ−1 is monotonic.Further, it is also easy to see that Φ(Z1) = 1−Φ(Z2). Since the standard normal distributionis symmetric it is also true that Φ(Z) = 1 − Φ(−Z) which implies that Z2 = −Z1. Thus, itis seen that Z1 and −Z1 are two antithetic random variables from the standard normal andCorr(Z1, −Z1) = −1.

For example, to evaluate θ = E(eZ) for Z ∼ N(0, 1) generate z1, z2, . . . , zN and computethe antithetic estimate θ = (θ1 + θ2)/2 where θ1 = (

∑Nj=1 ezj)/N and θ2 = (

∑Nj=1 e−zj)/N . In

general, for distributions symmetric around c it is true that F (x − c) = 1 − F (c − x). Thenfor two anti thetic random variables X and Y from this distribution we have F (X − c) = Uand F (Y − c) = 1 − U which results in Y = 2c − X and Corr(X, Y ) = −1.

Conditioning SwindlesFor simplicity of discussion, consider the problem of estimating θ = E(X) is considered.

For a random variable W defined on the same probability space as X

E(X) = E{E(X|W )}and

var(X) = var{E(X|W )} + E{var(X|W )}giving var(X) ≥ var{E(X|W )} . The crude Monte Carlo estimate of θ is

θ =1

N

N∑

j=1

xj

and its variance

var(θ) =var(X)

N

where x1, x2, . . . , xN is a random sample from the distribution of X. The conditional MonteCarlo estimator is

θ =1

N

N∑

j=1

q(wj)

and its variance

var(θ) =var{E(X|W )}

N

where w1, w2, . . . , wN is a random sample from the distribution of W and q(wj) = E(X|Wj).The usual situation in which the conditioning swindle an be applied is when q(w) = E(X|W )can be easily evaluated analytically.

In general, this type of conditioning is called Rao-Blackwellization. Recall that Rao-Blackwell theorem says that the variance of an unbiased estimator can be reduced by con-ditioning on the sufficient statistics. Let (X, Y ) have a joint distribution and consider the

19

estimation of θ = EX(g(X)). A crude Monte Carlo estimate of θ is

θ =1

N

N∑

j=1

g(xj)

where xj, j = 1, . . . , N generated from the distribution of X. The estimator of θ = EyEX(g(X)|Y )is ˜theta = 1

N

∑Nj=1 EX(g(X)|Y ) where yj, j = 1, . . . , N generated from the distribution of Y .

This requires that the conditional expectation EX(g(X)|Y ) to be explicitly determined. Itcan be shown as above that ˜theta dominates θ in variance.

In general, we can write

var(g(X)) = var{E(g(X)|W )} + E{var(g(X)|W )}

and techniques such as importance sampling and stratified sampling were actually based onthis conditioning principle.

Example 1:

Suppose W ∼ Poisson(10) and that X|W ∼ Beta(w,w2 + 1). The problem is to estimateθ = E(X). Attempting to use crude Monte Carlo, i.e., to average samples from X will beinefficient. To use the conditioning swindle, first compute E(X|W ) which is shown to bew/(w2 + w + 1). Thus the conditional Monte Carlo estimator is

θ =1

N

N∑

j=1

wj

w2j + wj + 1

where w1, w2, . . . , wN is a random sample from Poisson (10).In this example, the conditioning variable was rather obvious; sometimes real ingenuity

is required to discover a good random variable to condition on.

Example 2:

Suppose x1, x2, . . . , xn is a random sample from N(0, 1) and let M = median (X1, . . . , Xn.The object is to estimate θ = var(M) = E[M 2]. The crude Monte Carlo estimate is obtainedby generating N samples of size n from N(0, 1) and simply averaging the squares of mediansfrom each sample:

θ =1

N

N∑

j=1

M2j .

An alternative is to use the independence decomposition swindle. If X is the samplemean, note that

var(M) = var(X + (M − X)) where X = sample mean

= var(X) + var(M − X) + 2cov(X,M − X)

20

However, X is independent of M − X giving cov(X,M − X) = 0 . Thus

var(M) =1

n+ E(M − X)2

giving the swindle estimate of θ to be

θ =1

n+

1

N

N∑

j=1

(mj − Xj)2.

One would expect the variation in M − X to be much less than the variation in M . Ingeneral, the idea is to decompose the stimator X into X = U + V where U and V areindependent; thus var(X) = var(U) + var(V ). If var(U) is known, then var(V ) ≤ var(X),and therefore V can be estimated more precisely than X.

21

Princeton Robustness Study (Andrews et al. (1972))The simulation study of robust estimators combines aspects of the conditioning and

independence swindles. The problem is evaluating the performance of a number of possiblerobust estimators of location. Let X1, X2, . . . , Xn be a random sample from an unknowndistribution which is assumed to be symmetric about its median m. Consider T (X1, . . . , Xn)to be an unbiased estimator of m, i.e., E(T ) = m. Additionally, the estimators will berequired to have the property that

T (aX1 + b, aX2 + b, · · · , aXn + b) = a T (X1, X2, . . . , Xn) + b

i.e., T is a location and scale equivariant statistic. A good estimator would be one withsmall variance over most of the possible distributions for the sample X1, X2, . . . , Xn. Thus,in general, it will be required to estimate θ where

θ = E{g(T )}

If the interest is in estimating the variance of T , then g(T ) = T 2. So the problem can beessentialy reduced to estimating θ = E{T 2}.

Next, a method to construct a reasonable class of distributions to generate sample datais considered. It was decided to represent the random variable X as the ratio Z/V whereZ ∼ N(0, 1) and V is positive random variable independent of Z. This choice is motivatedby the fact that conditional on V , X will have a simple distribution; a fact that mightbe useful for variance-reduction by employing a conditioning swindle. Some examples ofpossible distribution that may be represented are:

(i) If V = 1/σ, then X ∼ N(0, σ2)

(ii) If V =√

χ2(ν)/ν, then X ∼ t(ν)

(iii) If V = Abs(Std. Normal Random Variable) then X ∼ Cauchy

(iv) If V has the two point distribution:

V = c with probability α

= 1 otherwise,

then X ∼ contaminated Normal, i.e., N(0, 1) with probability 1 − α and N(0, 1/c2) withprobability α. To take advantage of this relatively nice conditional distribution, Xj’s mustbe generated using ratio’s of sequences of independent Zj’s and Vj’s. Consider Xj = Zj/Vj.Conditional on Vj, Xj ∼ N(0, 1/V 2

j ). To incorporate conditioning and independence swindlesto the problem of Monte Carlo estimation of θ, it is found convenient to express Xj in theform

Xj = β + sCj or X = β 1 + s C

where β, s, and Cj are random variables to be defined later, such that β, s and C are inde-pendent given V . Note that because of the way T was defined,

22

T (X) = β + s T (C) .

Under these conditions it is possible to reduce the problem of estimating θ = E(T 2) to

θ = E{T 2(X)}= EV

[EX|V{T 2(X)}

]

= EV

[Eβ,s,C{T 2(β, s,C)}

]

= EV

[EC|V

[Eβ|V

[Es|V

{β2 + 2 β s T (C) + s2 T 2(C)

}]]]

= EV

[Eβ|V

(β2)

+ 2 Eβ|V(β)Es|V(s)EC|V(T (C)) + Es|V(s2)EC|V(T 2(C))]

The random variables β and s2 will be defined so that conditional on V, β ∼ N(0, 1/Σv2j ),

(n − 1 )s2 ∼ χ2(n − 1) and β is independent of s2. Thus

Eβ|V(β) = 0 , Eβ|V(β2) = 1/ΣV 2j and Es|V(s2) = 1

Therefore

θ = E[1/ΣV 2

j

]+ E

[T 2(C)

].

Thus, the problem of estimation of θ has been split into two parts. In many cases, it ispossible to evaluate E(1/V′V) analytically. For example, in generating samples from t-

distribution with ν d.f., Vj ∼√

χ2/ν and thus EV(1/ΣV 2j ) = E(ν/χ2(nν)) = ν/(nν − 2). In

other cases, the Delta method may be used to obtain an approximation (see the Appendix).The problem of estimating θ thus reduces to estimating E(T 2(C)), which can be easilyaccomplished by averaging over N samples: {T 2(C1) + · · · + T 2(CN)}/N . The variance ofthis estimate should be lower than an estimate found by averaging T 2(X) since T 2(C) ismuch less variable than T 2(X).

Now, β, s and C will be defined such that

Xj = β + sCj

and given V, β, s, and C are independent. Consider regressing Zj on Vj through the origin.The slope is given by

β =ΣVjZj

ΣV 2j

,

and β ∼ N(0, 1/V′V) given V. The MSE from regression is

s2 =

∑nj=1(Zj − βVj)

2

n − 1

23

and (n− 1)s2 ∼ χ2(n− 1) given V. Since Zj are independent and normally distributed, s2

and β are independent. Now form the standardized residuals from the regression.

{Zj − βVj

s

}.

It can be shown that these standardized residuals are independent of (β, s2) (Simon (1976)).Now let

Cj =Zj − βVj

Vjs=

Zj/Vj − β

s=

Xj − β

s.

The Cj are independent of (β, s2) conditional on Vj since they are functions of the standard-ized residuals above. Then the needed form for Xj is obtained:

Xj = β + sCj.

The essence of this transformation is to average analytically over as much of the variationas possible. The construction of the distribution of X as a Normal-over-independent randomvariable is somewhat restrictive; however it does include t, Cauchy, Laplace and contaminatednormal as possible distribution for X.

A procedure for a Monte Carlo experiment for comparing variances of several estima-tors of location under a specified sampling distribution using the location and scale swindledescribed above follows. The variances are used to measure the standard of performanceof the estimators and are parameters of interest in the study. A good estimators are thosewhich would have comparably low variances over a range of possible parent distributions.In some cases theoretical results can be used to compare the performance of estimators; forexample, if the sample in Normal then X the sample mean is certainly the best since it hasminimum variance. However, if the sample is from Cauchy X has infinite variance thus nota contender.

The estimators considered are the mean, median, and 10% trimmed mean computed fromindependent samples from the t-distribution with k.d.f. generated as ratios of Normal-over-independent random variables. Note that these estimators satisfy the location and scaleequivariant property. The problem is thus one for which the swindle could be applied. Thisexperiment is to be repeated for various sample sizes (done here for 2 sample sizes n=10, 20).In order to facilitate comparisons across different sample size the actual quantity tabulatedis n×var(T ) instead of just var(T ). Since the data were generated from a distribution withmedian zero the parameter to be estimated is θ = E(T 2). Both a crude Monte Carlo andthe Swindle are employed. In the first case, for each replication,

1. Sample z1, . . . , zn ∼ N(0, 1) and v1, . . . , vn ∼ χ2(k) ,independently; Set xi = zi/√

vi/k,i = 1, . . . , n.

2. Form W = T (x)2.

then average W over 1000 replications, to obtain a crude MC estimate.

24

For the Swindle, for each replication,

1. Sample z1, . . . , zn ∼ N(0, 1) and y1, . . . , yn ∼ χ2(k), independently; Set xi = zi/vi

where vi =√

yi/k, i = 1, . . . , n.

2. Compute β and s2 from (vi, zi), i = 1, . . . , n.

3. Set ci = (xi − β)/s for i = 1, . . . , n, and form W ∗ = k/(nk − 2) + T (c)2

then average W ∗ over the 1000 replications. Recall that W and W ∗ are the Monte Carloestimates of var(T ) from crude Monte Carlo and from the swindle. In order to be ableto find out whether there is an actual variance reduction, the sampling variances of boththese estimates are needed. Although formulas can be developed for computing these more

Table 1: Crude Monte Carlo and swindle estimates of n× var(T ) based on 1000 simulationsfor estimators T = Mean, Median, and 10% Trimmed Mean of samples of size n = 10 fromthe t-distribution with k d.f.

kT 3 4 10 20 100Mean

Crude MC estimate 2.53 1.94 1.16 1.17 1.04Standard error .16 .10 .049 .054 .049Swindle estimate 2.84 1.93 1.24 1.11 1.02Standard error .16 .065 .0098 .0047 .00082Variance reduction 1 2 25 132 3571

MedianCrude MC estimate 1.65 1.72 1.31 1.57 1.46Standard error .08 .094 .059 .073 0.066Swindle estimate 1.76 1.68 1.45 1.45 1.42Standard error .034 .033 .020 .019 .017Variance reduction 6 8 9 15 15

10% Trimmed MeanCrude MC estimate 1.59 1.54 1.11 1.19 1.11Standard error .073 .075 .047 .055 .052Swindle estimate 1.75 1.52 1.19 1.13 1.07Standard error .038 .024 .0077 .0049 .0029Variance reduction 4 10 37 126 322

efficiently, estimates can be obtained form the actual sampling procedure itself. Writing

θ = Var(T ) = E(T 2)

25

Table 2: Crude Monte Carlo and swindle estimates of n×var(T ) based on 1000 simulationsfor estimators T = Mean, Median, and 10% Trimmed Mean of samples of sizes n = 20 fromthe t-distribution with k d.f.

kT 3 4 10 20 100Mean

Crude MC estimate 2.75 1.76 1.24 1.12 0.99Standard error .19 .087 .056 .056 .045Swindle estimate 2.96 1.87 1.25 1.11 1.02Standard error .20 .062 .010 .0046 .00085Variance reduction 1 2 31 148 2803

MedianCrude MC estimate 1.71 1.75 1.67 1.59 1.46Standard error .080 .075 .074 .077 .061Swindle estimate 1.79 1.72 1.58 1.54 1.46Standard error .036 .035 .025 .022 .019Variance reduction 5 5 9 12 10

10% Trimmed MeanCrude MC estimate 1.55 1.40 1.23 1.15 1.04Standard error .071 .060 .055 .057 .047Swindle estimate 1.64 1.46 1.20 1.13 1.07Standard error .027 .019 .0082 .0055 .0029Variance reduction 7 10 45 107 263

we have

θ =1

N

N∑

j=1

T 2j

where, say, T 2j ≡ T (xj)

2. Then

Var(θ) = Var

1

N

N∑

j=1

T 2j

=

1

N2

N∑

j=1

Var(T 2j ).

So an estimate of Var(θ) is

Var(θ) =

1

N2× N × Var(T 2)

Since an estimate of Var(T 2) can be obtained from the sample as

Var(T 2) =

∑T 4

j

N−(∑

T 2j

N

)2

26

The standard errors reported in Table 1 and 2 are thus computed using:

s.e.(n × var(T )) = n ×√√√√{∑

T 4j

N2− (

∑T 2

j )2

N3

},

.

The Location-Scale Swindle for a Scale EstimatorHere the problem is less straightforward than the location parameter case, where the

interest was in the unbiased estimator of the median of a symmetric distribution. However,it is less clear what the scale parameter of interest is, although the samples would stillbe drawn from a symmetric distribution. Suppose it is imposed that the scale estimatorconsidered is location-invariant and scale-transparent, i.e.,

S(cX1 + d, cX2 + d, · · · , cXn + d) = | c |T (X1, X2, . . . , Xn) + d

or, in vector formS(cX + d = | c |S(X)

thenlog S(cX + d = log | c | + log S(X) .

Thus if the interest is in comparing scale estimators, Q(X) = log S(X) is a good function ofS(X) to use and Var{Q(X)} provides a good measure of the performance of the estimatorS(X) since it is scale-invariant (i.e., will not depend on c). Thus estimators S(X) such thatVar{log S(X)} is well-behaved over a large class of distributions would be better.

By our presentation on the location estimator we have X = β 1 + sC and it can also beshown easily that S(X) = s S(C). Thus

Var{Q(X)} = Var{log S(X)}= Var{log s + log S(C)}= Var[E{(log s + log S(C))|V}] + E[Var{(log s + log S(C))|V}]

= Var{log√

W

k − 1+ Var{log S(C)}

where W ∼ χ2(k − 1) and note that Var{E(log s|V)} = 0.

27

Appendix

A Note on approximating E(1 | ∑ni=1 V 2

C) using the Delta method

First order approximation using the Taylor series approximation does not provide suf-ficient accuracy for the computation of the expectation of a function of random variable;at least the second order term is needed. Consider the computation of E(g(X)). AssumeX∼(µ, σ2) and expand g(x) around µ:

g(x) ≈ g(µ) + (x − µ)g′(µ) +1

2(x − µ)2g′′(µ).

Thus

E[g(X)] ≈ g(µ) +1

2g′′(µ)σ2

This is a standard Delta method formula for the mean of a function of X. For g(x) =1x

g′′(x) = 2x3 implying that E

(1X

)≈ 1

µ+ σ2

µ3 . Thus the first order approximation for

E( 1X

) is 1µ, and the second order approximation is 1

µ+ σ2

µ3 . In the swindle problem we have

X =∑n

i=1 V 2i

N(0, 1)/U(0, 1) case:Straight forward application of the Delta method, taking V ∼ U(0, 1) gives the first orderapproximation E( 1

ΣV 2

i

) = 3n, and the second order approximation

E

(1

ΣV 2i

)=

3

n+

12

5n2

.9N(0, 1) + .1N(0, 9) case:Not as straightforward as above. First compute

µ = E(ΣV 2i ) = n

(1(.9) +

1

9(.1)

)=

8.2

9n

σ2 = V ar

(n∑

i=1

V 2i

)= nV ar(V 2

i )

= nE(V 2i − µ)2

= n

{(1 − 8.2

9

)2

(.9) +(

1

9− 8.2

9

)2

(.1)

}

=5.76n

81

Thus the first order approximation for E( 1ΣV 2

i

) is 98.2n

and the second order approximation

E

(1

ΣV 2i

)=

9

8.2n+

5.76n

81·(

9

8.2

)3 1

n2≈ 9

8.2n+

6480

68921n2

28

References

Andrew, D. F., Bickel, P. J., Hampel, F. R., Huber, P. J., Rogers, W. H., and Tukey (1972)Robust Estimation of Location. Princeton University Press: Princeton, N.J.

Gentle, J. E. (1998). Random Number Generation and Monte Carlo Methods. Springer-Verlag: NY.

Hammersley, J. M. and Handscomb, D. C. (1964) Monte Carlo Methods. Methuen: London.

Johnstone, Iain and Velleman, Paul (1984) “ Monte Carlo Swindles Based on VarianceDecompositions,”Proc. Stat. Comp. Section, Amer. Statist. Assoc.(1983),52-57.

Koehler, Kenneth J. (1981) “An Improvement of a Monte Carlo Technique using Asymp-totic Moments with an Application to the Likelihood Ratio Statistics,” Communica-tions in Statistics – Simulation and Computation, 343-357. 73, 504-519.

Lange, K. (1998). Numerical Analysis for Statisticians. Springer-Verlag: NY.

Ripley, B. D. (1987). Stochastic Simulation. Wiley: NY.

Rothery, P. (1982) “The Use of Control Variates in Monte Carlo Estimation of Power,” Ap-plied Statistics, 31, 125-129.

Rubinstein, R. Y. (1981) Simulation and the Monte Carlo Method. Wiley: New York.

Schruben, Lee W. and Margolin, Barry H. (1978) “Pseudo-random number assignment instatistically designed simulation and distribution sampling experiments,” J. Amer.Statist. Assoc.,

Siegmund, D. (1976) “Importance Sampling in the Monte Carlo study of Sequential Tests,”Annals of Statistics, 4, 673-684.

Simon, Gary (1976) “Computer Simulation Swindles, with Applications to Estimates ofLocation and Dispersion,” Applied Statistics, 25, 266-274.

29

Statistics 580 Monte Carlo Methods Introductionmervyn/stat580/Notes/s09mc.pdf · Statistics 580...

Documents

Transcript of Statistics 580 Monte Carlo Methods Introductionmervyn/stat580/Notes/s09mc.pdf · Statistics 580...