On the convergence of quasi-random sampling/importance resampling

Available online at www.sciencedirect.com

Mathematics and Computers in Simulation 81 (2010) 490–505

On the convergence of quasi-randomsampling/importance resampling

Bart Vandewoestyne, Ronald Cools ∗Department of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, B-3001 Heverlee, Belgium

Received 11 September 2007; accepted 7 September 2009Available online 11 December 2009

Abstract

This article discusses the general problem of generating representative point sets from a distribution known up to a multiplicativeconstant. The sampling/importance resampling (SIR) algorithm is known to be useful in this context. Moreover, the quasi-randomsampling/importance resampling (QSIR) scheme, based on quasi-Monte Carlo methods, is a more recent modification of the SIRalgorithm and was empirically shown to have better convergence. By making use of quasi-Monte Carlo theory, we derive upperbounds for the error of the QSIR scheme.© 2009 IMACS. Published by Elsevier B.V. All rights reserved.

Keywords: Sampling/importance resampling; Weighted bootstrap; Quasi-Monte Carlo methods; Bayesian inference

1. Introduction

A frequently occurring problem in many fields of science is the generation of samples from a given density. Manymethods exist to solve this problem, see for example the monographs of Gentle [5] or Devroye [3]. One of the possiblemethods uses the inverse of the cumulative distribution function to generate the samples. We will refer to this methodas the inverse CDF method. However, when the density is only known up to a multiplicative constant, the inverse CDFmethod cannot be used.

A method to obtain samples that have an approximate distribution with density h(θ) was suggested by Rubin in[12,13] and is called sampling /importance resampling (SIR). It has the advantage that the normalizing constant of thetarget density is not needed, so instead of the density h(θ) any nonnegative proportional function ch(θ) can be used.

Two modifications of the SIR algorithm were recently proposed in [11] but not supported by theory. The modifiedalgorithms are based on low-discrepancy point sets and sequences and it is empirically shown that more representativesamples can be obtained using the modified algorithms. In this article, we complement the empirical results from [11]with convergence proofs. The proofs are based on quasi-Monte Carlo theory.

The outline of this article is as follows: in Section 2 we recall concepts from quasi-Monte Carlo theory that areuseful for the rest of this paper. The ideas behind the SIR algorithm and its recent QSIR variant are given in Section3. In Section 4, we prove convergence for the QSIR algorithm. Numerical examples are presented in Section 5 and weconclude in Section 6.

∗ Corresponding author. Fax: +32 16 32 79 96.E-mail addresses: [email protected] (B. Vandewoestyne), [email protected] (R. Cools).

0378-4754/$36.00 © 2009 IMACS. Published by Elsevier B.V. All rights reserved.doi:10.1016/j.matcom.2009.09.004

dx.doi.org/10.1016/j.matcom.2009.09.004

mailto:[email protected]

mailto:[email protected]

dx.doi.org/10.1016/j.matcom.2009.09.004

B. Vandewoestyne, R. Cools / Mathematics and Computers in Simulation 81 (2010) 490–505 491

2. Quasi-Monte Carlo

If an s-dimensional function f (x) is of bounded variation in the sense of Hardy and Krause on [0, 1]s (see page 19in [9] for a definition), then the theory of quasi-Monte Carlo integration provides us an upper bound for the error ofintegration over the s-dimensional unit cube by means of the Koksma–Hlawka inequality∣∣∣∣∣ 1

N

N∑n=1

f (xn) −∫

[0,1]sf (x) dx

∣∣∣∣∣ ≤ VHK(f )D∗N (P).

Here, VHK(f ) is the variation of f on [0, 1]s in the sense of Hardy and Krause, P is a point set in [0, 1)s consisting ofthe points x1, . . . , xN , and D∗

N (P) is the star discrepancy of the points in P, defined as

D∗N (P) = sup

x ∈ [0,1]s

∣∣∣∣∣ 1

N

N∑n=1

I[0,x)(xn) − �(x)

∣∣∣∣∣ ,with �(x) the volume of the hypercube defined by x and where the indicator function of A is defined as

IA(x) ={

1, if x ∈ A,

0, if x /∈ A.

If the star discrepancy of a point set is O(log (N)s/N), then the point set is called a low-discrepancy point set orquasi-random point set. Several constructions of s-dimensional low-discrepancy point sets exist (see for example[4,6,8,15]).

For one-dimensional point sets, an explicit expression for the star discrepancy can be derived (see [9], page 15). Ifwe denote the one-dimensional point set consisting of the points x1, . . . , xN by P and we assume that 0 ≤ x1 ≤ x2 ≤. . . ≤ xN ≤ 1, then

D∗N (P) = 1

2N+ max1≤n≤N

∣∣∣∣xn − 2n − 1

2N

∣∣∣∣ .From this, it follows that the one-dimensional point setP consisting of N points and having the lowest star discrepancy

D∗N (P) is given by

xi = 2i − 1

2N, i = 1, . . . , N. (1)

To measure how representative a multidimensional point set with certain associated weights is for a certain distri-bution, we will need the empirical cumulative distribution function of this point set.

Definition 1. Let x1, . . . , xN be (not necessarily distinct) points inRs with respective weights q1, . . . , qN normalizedso that

∑Ni=1qi = 1. Then the s-dimensional function

FN (x) =N∑

i=1

qiI(−∞,x](xi),

is called the empirical cumulative distribution function (ECDF) of x1, . . . , xN .

If no weights are specified, we assume all weights to be equal so that the ECDF becomes

FN (x) = 1

N

N∑i=1

I(−∞,x](xi).

In the rest of this paper, we will use the concept of F-discrepancy as our quality-parameter.

492 B. Vandewoestyne, R. Cools / Mathematics and Computers in Simulation 81 (2010) 490–505

Definition 2. Let F (x) be a cumulative distribution function in Rs and P = {x1, . . . , xN} a set or multiset of pointsin Rs with normalized weights q1, . . . , qN . Then

D∗N (P, F ) = sup

x ∈Rs

|FN (x) − F (x)|, (2)

is called the F-discrepancy of P with respect to F (x), where FN (x) is the ECDF of the points x1, . . . , xN .

This measure, proposed by Wang and Fang [18], gives an indication of the representativeness of a point set P ={x1, . . . , xN} with respect to a distribution F. If we can find a set P∗ = {x∗

1, . . . , x∗N} such that

D∗N (P∗, F ) = min

PD∗

N (P, F ),

where P runs over all sets of N points in Rs, then P∗ is called a set of cdf-rep-points of F (x). Using the concept ofF-discrepancy, it is possible to obtain the following variant of the Koksma–Hlawka inequality for integration over Rs

instead of over the unit cube [1,7,10].

Theorem 3. Let f be of bounded variation on Rs in the sense of Hardy and Krause and let P = {x1, . . . , xN} be apoint set on Rs. Moreover, let g be a probability density function and G its cumulative distribution function on Rs.Then for any positive integer N we have that∣∣∣∣∣ 1

N

N∑n=1

f (xn) −∫Rs

f (x) dG(x)

∣∣∣∣∣ ≤ VHK(f )D∗N (P, G).

A proof of this theorem for general distributions G is presented in [1]. The following lemma is straightforward.

Lemma 4. Let A = {x1, . . . , xN} be a point set in [0, 1]s. Denote by xi,j the jth coordinate of the ith point xi. Let gbe a multivariate PDF with CDF G having independent marginals:

G(x) = G(x1, . . . , xs) =s∏

i=1

Gi(xi). (3)

Take B = {y1, . . . , yN} the g-distributed sequence obtained by independently applying the inverse CDF to eachdimension:

yk = (G−11 (xk,1), . . . , G−1

s (xk,s)), k = 1, . . . , N.

Then D∗N (B, G) = D∗

N (A).

Proof. By definition

D∗N (B, G) = sup

x ∈Rs

∣∣∣∣∣ 1

N

N∑i=1

I(−∞,x](yi) − G(x)

∣∣∣∣∣ .Because all Gi(x), i = 1, . . . , s are monotone increasing and independent, yi ∈ (−∞, x] ⇔ xi ∈ [0, G(x)], henceI(−∞,x](yi) = I[0,G(x)](xi). Therefore,

D∗N (B, G) = sup

x ∈Rs

∣∣∣∣∣ 1

N

N∑i=1

I[0,G(x)](G(yi)) − G(x)

∣∣∣∣∣ = supy ∈ [0,1]s

∣∣∣∣∣ 1

N

N∑i=1

I[0,y](xi) − y

∣∣∣∣∣ = D∗N (A). �

Recall that the point set (1) is the one-dimensional point set with the lowest value for the star discrepancy. Wetherefore have the following corollary.

Corollary 5 (See [18], page 156). Let F (x) be a univariate continuous CDF and F−1(y) be its inverse. Then the set

Z = {zi = F−1(

2i − 1

2N

), i = 1, . . . , N},

contains the cdf-rep-points of size N of F (x) with D∗N (Z, F ) optimal and equal to 1/(2N).


Before generalizing Corollary 5 to more dimensions, we review the concept of multinomial resampling (See, e.g.,page 24 from [16] or Chapter 9 in [2]). In this resampling method, we start from the ECDF FN (θ) formed by thesamples θ1, . . . , θN and their respective weights q1, . . . , qN with

∑Ni=1qi = 1. From this ECDF, n new samples θ∗

i aredrawn independently. This is done as follows:

(1) draw n independent uniforms zi on the interval (0, 1];(2) set the index Ii = F̃−1

N (zi) and θ∗i = θIi , for i = 1, . . . , n, where F̃−1

N (zi) is the inverse of the ECDF associatedwith the weights qi, that is, F̃−1

N (z) = i for z ∈ (∑i−1

j=1qj,∑i

j=1qj].

The new samples are all given the same weight 1/n and we will represent the ECDF belonging to these new samplesby FN,n(θ). Note also that the zi are one-dimensional points. For s-dimensional ECDF’s this leads to the followingtheorem:

Theorem 6. Let FN (θ) be the ECDF of the ordered s-dimensional points θ1 ≤ θ2 ≤ . . . ≤ θN with their respectiveweights qi, where we use

θi ≤ θj ⇔

⎧⎪⎪⎨⎪⎪⎩

θi,1 ≤ θj,1

...

θi,s ≤ θj,s

to denote component-wise ordering. If the point set

Z ={

zi = 2i − 1

2n, i = 1, . . . , n

}

is used for the multinomial resampling, then

supθ ∈Rs

|FN (θ) − FN,n(θ)| <1

2n.

Proof. Suppose the supremum occurs at θj . If other samples are equal to θj , then take i to be the largest index so thatθj = θi. Due to the ordering of the points:{

θk ≤ θi for k = 1, . . . , i,

θi < θl for l = i + 1, . . . , n(4)

and consequently

FN,n(θj) = FN,n(θi) = 1

n

n∑k=1

I(−∞,θi](θk) = i

n.

Now because θi < θi+1, we must have that (i/n) ≤ FN (θ) < (2i + 1)/(2n) over the domain (−∞, θi+1] \ (−∞, θi].Making the wrong assumption that FN (θ) ≥ ((2i + 1)/2n) would lead to θi = θi+1, which contradicts (4). From(i/n) ≤ FN (θ) < (2i + 1)/(2n), we then finally have

supθ ∈Rs

|FN (θ) − FN,n(θ)| <1

2n. �

Fig. 1 illustrates the above for a one-dimensional ECDF.

For Theorem 6 to be applicable, the points must be ordered so that we have an ECDF that looks similar to the onein the left part of Fig. 2. In case of unordered points that have an ECDF similar to the one in the right part of Fig. 2,we have the following theorem.


Fig. 1. For ECDF’s and when using the lowest discrepancy point set, observe how the F-discrepancy is always lower than or equal to 1/(2n).

Theorem 7. Let FN (θ) be the ECDF of the unordered s-dimensional points θ1, θ2, . . . , θN with their respectiveweights qi. If the point set

Z ={

zi = 2i − 1

2n, i = 1, . . . , n

}(5)

is used as independent uniforms for the multinomial resampling, then

supθ ∈Rs

|FN (θ) − FN,n(θ)| ≤ N

n.

Proof. We first create N ′ ≤ N new samples θ′i with weights q′

i so that each θ′i occurs only once. We do this by

adding the weights of equal samples in the original multiset to form the new weight for that sample. Now if during themultinomial resampling with the points from (5) the sample θ′

i was selected ki times, we know that

ki − 1

n≤ q′

i <ki

n, (6)

with 0 ≤ ki ≤ n for i = 1, . . . , N ′ and∑N ′

i=1ki = n. We then have

supθ ∈Rs

|FN (θ) − FN,n(θ)| = supθ ∈Rs

∣∣∣∣∣N∑

i=1

qiI(−∞,θ](θi) − 1

n

n∑i=1

I(−∞,θ](θ∗i )

∣∣∣∣∣= sup

θ ∈Rs

∣∣∣∣∣∣N ′∑i=1

q′iI(−∞,θ](θ

′i) − 1

n

N ′∑i=1

kiI(−∞,θ](θ′i)

∣∣∣∣∣∣ = supθ ∈Rs

∣∣∣∣∣∣N ′∑i=1

(q′i − ki

n

)I(−∞,θ](θ

′i)

∣∣∣∣∣∣ .

Fig. 2. Cumulative distribution of three ordered (left) and three unordered (right) two-dimensional points.


From (6) follows that |q′i − (ki/n)| ≤ 1/n which leads to

supθ ∈Rs

∣∣FN (θ) − FN,n(θ)∣∣ ≤ 1

nsup

θ ∈Rs

∣∣∣∣∣∣N ′∑i=1

I(−∞,θ](θ′i)

∣∣∣∣∣∣≤ N ′

n

≤ N

n

. �

3. Quasi-random sampling/importance resampling

The SIR algorithm (also called ‘importance resampling’ [17] or ‘weighted bootstrap’ [14]) proposed in [12] is anon-iterative algorithm that is useful in many contexts, for example when generating samples from a distribution knownup to a multiplicative constant. Suppose we want to generate samples from a distribution:

f (θ) = h(θ)∫Rs h(θ) dθ

, (7)

where the normalizing constant∫Rs h(θ) dθ could be unknown. Let θ1, . . . , θN be a random sample from an approxi-

mating distribution (also called proposal distribution) with density g and such that g(θ) /= 0 for any θ in the supportof f. Now define weights:

ωi = h(θi)

g(θi), for i = 1, . . . , N,

and normalize them to obtain

qi = ωi

N∑i=1

ωi

, for i = 1, . . . , N.

If we generate n samples θ∗ from the discrete distribution over {θ1, . . . , θN} which places mass qi at θi, then θ∗ isapproximately distributed according to f. When N/n goes to infinity, the probability distribution of the generated valuesis the correct distribution. A proof for the multidimensional case is given in [11]. In the same paper, the authors proposetwo modifications of the original SIR algorithm. These modifications use low-discrepancy point sets and sequences.The samples generated using these algorithms are empirically verified in 1 and 2 dimensions to be more representative(in the sense of the F-discrepancy) than those obtained with the original algorithm. The first algorithm can be writtenas follows:

Algorithm 1. Execute the following steps:

(1) Generate a quasi-random sample θ1, . . . , θN from the proposal density g.(2) Calculate the weights qi.(3) Obtain a random sample of size n from the discrete distribution over {θ1, . . . , θN} which places mass qi at θi.

Note that in the first step, a quasi-random sample is generated instead of a (pseudo)-random one. At the thirdstep, multinomial resampling with (pseudo)-random numbers is used to introduce randomness through sampling withreplacement.

The second algorithm is completely deterministic.


Algorithm 2. Execute the following steps:

(1) Generate a quasi-random sample θ1, . . . , θN from the proposal density g.(2) Calculate the weights qi.(3) Calculate {ui = (2i − 1)/(2n), i = 1, . . . , n} for a sample size n.(4) Use {u1, . . . , un} (instead of n (pseudo)-random numbers) to do multinomial resampling on the discrete distribution

placing mass qi at θi.

Note that there is no randomness in Algorithm 2. Instead, the lowest discrepancy point set on [0, 1] of size n is usedas a basis for the multinomial resampling in step 4.

4. Convergence results for QSIR

In this section, we investigate convergence properties and we derive error bounds for the two QSIR schemesproposed in [11] by using the definitions and theorems given in Section 2. In the remaining part of this paper, we writef (N) ∈O(g(N)) to indicate that f belongs to the set of all functions dominated by g. More formally f (N) ∈O(g(N))as N → ∞ if and only if

∃N0, ∃M > 0 such that |f (N)| ≤ M|g(N)|for N > N0.

We start with the following theorem.

Theorem 8. Assume the PDF g has a cumulative distribution function of the form (3) and independent inversetransformations are used to generate the set of samples P = {θ1, . . . , θN}. Assume furthermore that in the inverseCDF method, a point set L on [0, 1)s of size N and with star discrepancy D∗

N (L) is used. Then, after the first step inboth algorithms, the G-discrepancy of P has the value D∗

N (L).

Proof. According to Lemma 4, the suggested transformation is discrepancy preserving so that D∗N (P, G) =

D∗N (L). �In one dimension, the set L = {(2i − 1)/(2N), i = 1, . . . , N} is the point set with the lowest possible star dis-

crepancy value 1/(2N). Therefore, for one-dimensional cases, after the first step in both algorithms the G-discrepancyof the samples θi will have the optimal value 1/(2N).

To further prove convergence for QSIR Algorithms 1 and 2, we need the following lemmas:

Lemma 9. If b(N) ∈O(1) and

|a − a(N)| ∈O(ra(N)),

|b − b(N)| ∈O(rb(N)),

then ∣∣∣∣ab − a(N)

b(N)

∣∣∣∣ ∈O(max{|ra(N)|, |rb(N)|}).

Proof. From

a

b− a(N)

b(N)= a(b(N) − b)

bb(N)− a(N) − a

b(N),

follows∣∣∣∣ab − a(N)

b(N)

∣∣∣∣ ≤∣∣∣ab

∣∣∣∣∣∣∣ 1

b(N)

∣∣∣∣ |b(N) − b| +∣∣∣∣ 1

b(N)

∣∣∣∣ |a(N) − a|,∈O(rb(N)) + (ra(N)),

∈O(max{|ra(N)|, |rb(N)|}). �


Lemma 10. Denote by FN (θ) the discrete CDF of the samples θ1, . . . , θN and their respective weights q1, . . . , qN

obtained after step 1 and 2 in both Algorithms 1 and 2. If (h(t)/g(t))I(−∞,θ](t) is of bounded variation in the sense ofHardy and Krause for any θ inRs, then for both Algorithms 1 and 2 and with a proposal distribution G of the form (3),we have that

supθ ∈Rs

|F (θ) − FN (θ)| ∈O(κ(N)),

with

κ(N) =

⎧⎪⎨⎪⎩

1

Nif s = 1,

(log(N))s

Nif s > 1.

Proof. The θi are sampled quasi-randomly from G(θ) using a low-discrepancy point set L with star discrepancyD∗

N (L). If we regard θ as a parameter and apply Theorem 3 with the function (h(t)/g(t))I(−∞,θ](t), this results in∣∣∣∣∣∫Rs

h(t)

g(t)I(−∞,θ](t) dG(t) − 1

N

N∑i=1

h(θi)

g(θi)I(−∞,θ](θi)

∣∣∣∣∣≤ VHK

(h(t)

g(t)I(−∞,θ](t)

)D∗

N (S, G) ≤ VHK

(h(t)

g(t)I(−∞,θ](t)

)D∗

N (L)

and also∣∣∣∣∣∫Rs

h(t)

g(t)dG(t) − 1

N

N∑i=1

h(θi)

g(θi)

∣∣∣∣∣ ≤ VHK

(h(t)

g(t)

)D∗

N (S, G) ≤ VHK

(h(t)

g(t)

)D∗

N (L).

Now since (1/N)∑N

i=1(h(θi)/g(θi)) ∈O(1) and D∗N (L) ∈O(log (N)s/N) for an s-dimensional low-discrepancy point

set and D∗N (L) = 1/(2N) for the one-dimensional pointset with lowest discrepancy, applying Lemma 9 results in∣∣∣∣∣∣∣∣∣∣∣

∫Rs (h(t)/g(t))I(−∞,θ](t) dG(t)∫

Rs (h(t)/g(t)) dG(t)−

(1/N)N∑

i=1

(h(θi)/g(θi))I(−∞,θ](θi)

(1/N)N∑

i=1

(h(θi)/g(θi))

∣∣∣∣∣∣∣∣∣∣∣∈O(κ(N)),

which then simplifies to∣∣∣∣∣∣∣∣∣∣∣1∫

Rs

h(t) dt

∫ θ

−∞h(t)

g(t)g(t) dt −

(1/N)N∑

i=1

(h(θi)/g(θi))I(−∞,θ](θi)

(1/N)N∑

i=1

(h(θi)/g(θi))

∣∣∣∣∣∣∣∣∣∣∣∈O(κ(N)),

|F (θ) − FN (θ)| ∈O(κ(N)).

(8)

Formula (8) is valid for all θ ∈Rs, so also for the value where the supremum occurs. �Using Lemma 10, we are now ready to show how the error of the quasi-SIR approximation of Algorithm 1 converges.

Theorem 11. Suppose the proposal CDF G is of the form (3). If the function (h(t)/g(t))I(−∞,θ](t) is of boundedvariation in the sense of Hardy and Krause over Rs for all θ ∈Rs, then for Algorithm 1

supθ ∈Rs

|F (θ) − FN,n(θ)| ∈O(

max

(κ(N),

1√n

)).


Proof. We know that

supθ ∈Rs

|F (θ) − FN,n(θ)| ≤ supθ ∈Rs

|F (θ) − FN (θ)| + supθ ∈Rs

|FN (θ) − FN,n(θ)|.

Now because the multinomial resampling in Algorithm 1 is based on random numbers, we know from Donsker’stheorem that for any given θ

E[FN,n(θ)] = FN (θ) and Var[FN,n(θ)] = n−1FN (θ)(1 − FN (θ))

so that we can write

supθ ∈Rs


|FN (θ) − F (θ)| + O(

1√n

).

Applying Lemma 10 then completes the proof. �The following theorem gives an error bound for the approximation in the second step of Algorithm 2.

Theorem 12. In step 4 of Algorithm 2, let θ∗1, . . . , θ

∗n be the samples (with weights all equal to n−1) drawn (by

the quasi-random multinomial resampling procedure) from the ECDF of the samples θ1, . . . , θN with their respectiveweights q1, . . . , qN . If we denote the ECDF corresponding to θ∗

1, . . . , θ∗n by

FN,n(θ) = 1

n

n∑i=1

I(−∞,θ](θ∗i ),

and the ECDF of the samples θ1, . . . , θN with their respective weights q1, . . . , qN by

FN (θ) =N∑

i=1

qiI(−∞,θ](θi),

then

supθ ∈Rs

|FN,n(θ) − FN (θ)| ≤ τ(N, n)

where

τ(N, n) =

⎧⎪⎨⎪⎩

1

2nif s = 1,

N

nif s > 1.

Proof. For s = 1, the samples will be ordered after the first step in the algorithm so that we can apply Theorem 6. Ifs > 1, we cannot be certain about the ordering and applying Theorem 7 then completes the proof. �

An upper bound for the error of the QSIR approximation of Algorithm 2 is given by the following theorem.

Theorem 13. Suppose the CDF G of the proposal is of the form (3). If the function (h(t)/g(t))I(−∞,θ](t) is of boundedvariation in the sense of Hardy and Krause over Rs for all θ ∈Rs, then for Algorithm 2

supθ ∈Rs

|FN,n(θ) − F (θ)| ≤ τ(N, n) + O(κ(N)).

Proof. Again, we know that

supθ ∈Rs


|FN (θ) − FN,n(θ)| + supθ ∈Rs

|F (θ) − FN (θ)|

which by Theorem 12 and Lemma 10 reduces to

supθ ∈Rs

|F (θ) − FN,n(θ)| ≤ τ(N, n) + O(κ(N)). �


Table 1Order of convergence for both Algorithms 1 and 2 in different scenario’s.

s = 1 s > 1

Algorithm 1 O(

max(

1N

, 1√n

))O(

max(

(log N)s

N, 1√

n

))Algorithm 2

Ordered 12n

+ O(

1N

)1

2n+ O(

(log N)s

N

)Unordered / N

n+ O(

(log N)s

N

)

Table 1 summarizes the results from this section. For Algorithm 2, the labels ‘ordered’ and ‘unordered’ are used todistinguish between the cases where the points are either ordered or unordered after the first step in the algorithm.

5. Numerical examples

We first illustrate the previous convergence results on the first example from Table 1 in [11]. In this one-dimensionalexample, g is the standard Cauchy distribution and f is the t-distribution with two degrees of freedom:

g(x) = 1

π(1 + x2), f (x) = h(x) = �(3/2)√

2π�(1)

(1 + x2

2

)−3/2

.

Observe that the normalizing constant for h is 1 so that f = h and we have

h(x)

g(x)= f (x)

g(x)= π

2√

2(1 + x2)

(1 + x2

2

)−3/2

,

which is given in Fig. 3. It is clear that h/g is a function of bounded variation in the sense of Hardy and Krause. FromTable 1 we know that in one dimension and when using Algorithm 1 we have

supθ ∈R

|FN,n(θ) − F (θ)| ∈O(

max

(1

N,

1√n

)),

while when using Algorithm 2 this becomes

supθ ∈R

|FN,n(θ) − F (θ)| ≤ O(

1

N

)+ 1

2n∈O

(max

(1

N,

1

n

)).

The error plots in Fig. 4 show on a log–log scale how the F-discrepancy converges as a function of n with a fixedvalue for N. The graph for Algorithm 1 and with N = 25 shows how the most dominating factor is the O(1/

√n) term

for n relatively small to N. As n gets larger, this term becomes negligible compared to the O(1/N) term which is fixed

Fig. 3. The function h(x)/g(x) of the numerical example.


Fig. 4. Error convergence with fixed N.

here to a certain value. For Algorithm 2, this is even better visible due to the deterministic nature of the sampling.Here, N = 100 and we can see how the convergence follows the line 1/(2n) for n-values relatively small to N. Asn gets larger and the O(1/n)-term becomes negligible, we see how the F-discrepancy approaches our fixed O(1/N)term.

Fig. 5 shows the results with fixed n-values. In the graph for Algorithm 1 and with n = 10, 000, we see how forN-values relatively small to

√n the error convergence closely follows the line 1/(2N), but as soon as N gets relatively

large to√

n the O(1/N) term is negligible compared to the fixed O(1/√

n) term. Again, due to its deterministic nature,this is even more visible in the graph for Algorithm 2 where n = 100.

As a second example, we reconsider the two-dimensional problem that was used in [11] to illustrate the QSIRalgorithm for sampling from a bivariate posterior distribution. Over [0, 1]2, the likelihood function takes the form

h(θ1, θ2) =3∏

i=1

⎡⎣∑

ji

(ni1

ji

)(ni2

yi − ji

)θji

1 (1 − θ1)ni1−jiθyi−ji2 (1 − θ2)ni2−yi+ji

⎤⎦ ,

where max{0, yi − ni2} ≤ ji ≤ min{ni1, yi}. Outside [0, 1]2 it is zero. The parameters considered are given in Table 2and a plot of the likelihood function is given in Fig. 6. The normalizing constant from (7) is easy to compute:

f (θ1, θ2) = 29993

7927920h(θ1, θ2).


Fig. 5. Error convergence with fixed n.

In this example, the proposal density g is chosen as the uniform distribution so to obtain the samples after the firststep we can simply generate samples from the low-discrepancy point set or sequence we wish and no transformation isneeded. For this experiment, Hammersley points (see, e.g., [18], page 29) were used as the low-discrepancy pointset.Fig. 7 shows the resulting θ∗ samples obtained by Algorithm 1, respectively, Algorithm 2 together with a contourplot

Fig. 6. Likelihood function h(θ1, θ2) for the two-dimensional example.


Table 2Parameters for the two-dimensional example.

i ni1 ni2 yi

1 5 5 72 6 4 53 4 6 6

Fig. 7. Resulting samples at the end of Algorithm 1 (upper) and Algorithm 2 (lower).


of h(θ1, θ2). The larger a dot, the more times that particular Hammersley point was resampled in the second step of thealgorithm. Furthermore, from Table 1 we know for Algorithm 1 that

supθ ∈R2

|F (θ) − FN,n(θ)| ∈O(

max

((log N)2

N,

1√n

))(9)

and for Algorithm 2

supθ ∈R2

|F (θ) − FN,n(θ)| ≤ N

n+ O

((log N)2

N

). (10)

To generate error convergence plots, we did not compute supθ ∈R2 |F (θ) − FN,n(θ)|. For reasons of computationalsimplicity, we computed the relative errors for the approximation of the expected values of the marginal distributions.It is easy to verify that

E[θ1] =∫ 1

0

∫ 1

0θ1f (θ1, θ2) dθ1dθ2 = 3325600

6628453,

E[θ2] =∫ 1

0

∫ 1

0θ2f (θ1, θ2) dθ1dθ2 = 4472580

6628453.

Fig. 8. Relative errors for the expected value of the marginal distributions.


Fig. 8 then shows the relative errors:∣∣∣∣∣∣∣∣∣∣

(1/n)n∑

i=1

θ∗1,i − E[θ1]

E[θ1]

∣∣∣∣∣∣∣∣∣∣and

∣∣∣∣∣∣∣∣∣∣

(1/n)n∑

i=1

θ∗2,i − E[θ2]

E[θ2]

∣∣∣∣∣∣∣∣∣∣(11)

for both Algorithms 1 and 2 with n = 105 and N = 2i for i = 1, . . . , 14. Note that although we use (11) and notsupθ ∈R2 |F (θ) − FN,n(θ)| as our error measure, we still observe the convergence from formulas (9) and (10): for nrelatively large to N the behavior is dominated by the O((log N)2/N) term.

6. Conclusions

By making use of quasi-Monte Carlo theory, asymptotic error bounds for the quasi-random sampling/importanceresampling scheme were derived. Our analysis explains the empirical results from [11] and is summarized in Table 1.

This table reveals among others that for one-dimensional distributions and when taking N = n, Algorithm 1 willbe O(1/

√N), while Algorithm 2 is O(1/N) and is thus to be preferred in this case.

In a two-dimensional experiment, we used the error of the approximation for the expected value of the marginaldistributions to determine the convergence of the QSIR algorithms instead of the F-discrepancy measure. Nevertheless,we observed the convergence rates given in the table.

Finally, it should be mentioned that the order by which the QSIR algorithms converge depends on s, N and n as canbe seen from the summary in Table 1. Because the F-discrepancy is never larger than 1, the relative sizes of s, N and nwill influence the usefulness of the bounds obtained.

Acknowledgements

The authors acknowledge support from the Research Programme of the Research Foundation - Flanders (FWO -Vlaanderen) G.0597.06. This paper presents research results of the Belgian Network DYSCO (Dynamical Systems,Control and Optimization), funded by the Interuniversity Attraction Poles Programme, initiated by the Belgian State,Science Policy Office. The scientific responsibility rests with its author(s).

References

[1] P.O. Chelson. Quasi-random techniques for Monte Carlo methods, Ph.D. Thesis, The Claremont Graduate School, Claremont, CA, 1976.[2] W.G. Cochran, Sampling Techniques. Wiley Series in Probability and Mathematical Statistics, 3rd ed., John Wiley and Sons, Inc., New York,

1977.[3] L. Devroye, Non-Uniform Random Variate Generation, Springer-Verlag, New York, 1986.[4] H. Faure, Discrépance de suites associées a un système de numération (en dimension s), Acta Arithmetica 41 (1982) 337–351.[5] J.E. Gentle, Random number generation and Monte Carlo methods, in: Statistics and Computing, 2nd ed., Springer-Verlag, New York, 2003.[6] J.H. Halton, On the efficiency of certain quasi-random sequences of points in evaluating multi-dimensional integrals, Numerische Mathematik

2 (1960) 84–90.[7] J. Hartinger, R.F. Kainhofer, R.F. Tichy, Quasi-Monte Carlo algorithms for unbounded weighted integration problems, Journal of Complexity

20 (2004) 654–668.[8] H. Niederreiter, Low-discrepancy and low-dispersion sequences, Journal of Number Theory 30 (1) (September 1988) 51–70.[9] H. Niederreiter, Random Number Generation and Quasi-Monte Carlo Methods, volume 63 of SIAM CBMS-NSF Regional Conference Series

in Applied Mathematics, SIAM, Philadelphia, 1992.[10] G. Ökten, Error reduction techniques in quasi-Monte Carlo integration, Mathematical and Computer Modelling 30 (1999) 61–69.[11] C.J. Pérez, J. Martín, M.J. Rufo, C. Rojano, Quasi-random sampling importance resampling, Communications in Statistics B: Simulation and

Computation 34 (1) (2005) 97–112.[12] D.B. Rubin, Using the SIR algorithm to simulate posterior distributions, in: J.M. Bernardo, M.H. DeGroot, D.V. Lindley, A.F.M. Smith (Eds.),

in: Bayesian Statistics, 3, Oxford University Press, Oxford, 1988, pp. 395–402.[13] D.B. Rubin, The calculation of posterior distributions by data augmentation. Comment: a noniterative sampling/importance resampling alter-

native to the data augmentation algorithm for creating a few imputations when fractions of missing information are modest: the SIR algorithm,Journal of the American Statistical Association 82 (June (398)) (1987) 543–546.


[14] A.F.M. Smith, A.E. Gelfland, Bayesian statistics without tears: a sampling-resampling perspective, The American Statistician 46 (May (2))(1992) 84–88.

[15] I.M. Sobol’, On the distribution of points in a cube and the approximate evaluation of integrals, USSR Computational Mathematics andMathematical Physics 7 (1967) 86–112.

[16] I.M. Sobol’, The Monte Carlo Method. Popular Lectures in Mathematics, The University of Chicago Press, 1974.[17] J.F. Talbot, D. Cline, P.K. Egbert, Importance resampling for global illumination, in: K. Bala, P. Dutré (Eds.), Rendering Techniques 2005

Eurographics Symposium on Rendering, Eurographics Association, Aire-la-Ville, Switzerland, 2005, pp. 139–146.[18] Y. Wang, K.-T. Fang, Number-Theoretic Methods in Statistics, volume 51 of Monographs on Statistics and Applied Probability, Chapman and

Hall, London, 1994.

On the convergence of quasi-random sampling/importance resampling

Documents

Transcript of On the convergence of quasi-random sampling/importance resampling