Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak...

43
Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science, Leiden University, P.O. Box 9512, 2300 RA Leiden, The Netherlands E-mail address : [email protected]

Transcript of Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak...

Page 1: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

Notes on weak convergence and related topics

Shota Gugushvili

Mathematical Institute, Faculty of Science, Leiden University,P.O. Box 9512, 2300 RA Leiden, The Netherlands

E-mail address: [email protected]

Page 2: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

2010 Mathematics Subject Classification. 60-01

Key words and phrases. Central limit theorem, sequential compactness, tightness,weak convergence, weak law of large numbers

Abstract. These notes deal with weak convergence of probability measures

on the real line. They are largely based on the lecture notes written by PeterSpreij to accompany the Measure Theoretic Probability course.

Page 3: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

Contents

Preface vii

Chapter 1. Weak convergence 11.1. Generalities 11.2. Criteria for weak convergence 21.3. Convergence of distribution functions 41.4. Sequential compactness 51.5. Continuous mapping theorem 81.6. Almost sure representation theorem 91.7. Relation to other modes of convergence 121.8. Slutsky’s lemma 14Exercises 15

Chapter 2. Characteristic functions 172.1. Definition and first properties 172.2. Inversion formula and uniqueness 202.3. Necessary conditions 232.4. Multidimensional case 23Exercises 24

Chapter 3. Limit theorems 273.1. Characteristic functions and weak convergence 273.2. Weak law of large numbers 293.3. Probabilities of large deviations 313.4. Central limit theorem 323.5. Delta method 333.6. Berry-Esseen theorem 34Exercises 35

Bibliography 37

v

Page 4: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,
Page 5: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

Preface

These notes deal with weak convergence of probability measures on the real lineand related topics. They are to a large extent based on the lecture notes written byPeter Spreij to accompany the Measure Theoretic Probability course. Other sourceswe used are listed in the bibliography.

Shota Gugushvili

vii

Page 6: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,
Page 7: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

CHAPTER 1

Weak convergence

1.1. Generalities

We start with the definition of weak convergence of probability measures on(R,B), and that of a sequence of random variables.

Definition 1. Let µ, µ1, µ2, . . . be probability measures on (R,B). It is said

that µn converges weakly to µ, and we then write µnw→ µ, if µn(f) → µ(f) for

all f ∈ Cb(R). If X,X1, X2, . . . are random variables (possibly defined on differentprobability spaces) with distributions µ, µ1, µ2, . . . , then we say that Xn converges

weakly to X, and write Xn X, if it holds that µnw→ µ.

Another accepted notation for weak convergence of a sequence of random vari-

ables is Xnd→ X, and one says that Xn converges to X in distribution.

Consider the following example that illustrates for a special case that there issome reasonableness in Definition 1.

Example 2. Let xn be a convergent of real numbers sequence with limn→∞ xn =0. Then for every f ∈ Cb(R) one has f(xn) → f(0). Let µn be the Dirac measureconcentrated on xn and µ the Dirac measure concentrated in the origin. Since

µn(f) = f(xn), we see that µnw→ µ.

As a further result, here is a statement that gives an appealing sufficient con-dition for weak convergence of a sequence of random variables, when the randomvariables involved admit densities.

Theorem 3. Consider distributions µ, µ1, µ2, . . . having densities f, f1, f2, . . .

w.r.t. Lebesgue measure λ on (R,B). Suppose that fn → f λ-a.e. Then µnw→ µ.

Proof. We apply Scheffe’s lemma to conclude that fn → f in L1(R,B, λ). Letg ∈ Cb(R). Since g is bounded, we also have fng → fg in L1(R,B, λ) and hence

µnw→ µ.

One could naively think of another definition of convergence of probabilitymeasures, for instance by requiring that µn(B) → µ(B) for every B ∈ B, whichis the same as to require that th class of test functions f consists of indicators ofBorel sets, or even by requiring that the integrals µn(f) converge to µ(f) for everybounded measurable function. It turns out that each of these requirements is toostrong to get a useful convergence concept. One drawback of such a definition isdemonstrated by the following example with Dirac measures.

Example 4. Assume the same setup as in Example 2 and take for concretenessxn = 1/n. Let B = (−∞, x] for some x > 0. Then for all xn < x, we haveµn(B) = 1B(xn) = 1 and µ(B) = 1B(0) = 1, so that µn(B)→ µ(B). For x < 0 we

1

Page 8: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

2 1. WEAK CONVERGENCE

get that µn(B) = µ(B) = 0, and thus µn(B)→ µ(B). But for B = (−∞, 0] we haveµn(B) = 0 for all n, whereas µ(B) = 1. Hence convergence of µn(B)→ µ(B) doesnot hold for this last choice of B, although it is quite natural in this case to saythat µn → µ. For the future reference note the following: if Fn is the distributionfunction of µn and F that of µ, then we have seen that Fn(x)→ F (x), for all x ∈ R,except for x = 0.

1.2. Criteria for weak convergence

In this section we give several criteria for weak convergence. These are primarilyuseful in the proofs.

Theorem 5. The following are equivalent:

(i) µnw→ µ;

(ii) every subsequence µnj of µn has a further subsequence µnjk , such that

µnjkw→ µ as k →∞.

Proof. That (i) implies (ii) is obvious. We prove the reverse implication. As-

sume the convergence µnw→ µ fails. This means there exists a bounded continuous

function f, a subsequence nj of n and a constant ε > 0, such that

|µnj (f)− µ(f)| ≥ εfor all j. But then

|µnjk (f)− µ(f)| ≥ εfor any subsequence njk of nj as well. Hence µnj has no further subsequenceconverging weakly to µ, a contradiction.

Recall that the boundary ∂E of a set E ∈ B is ∂E = E \ E, where E is theclosure and E is the interior of E. The distance from a point x to a set E is

d(x,E) = inf|x− y| : y ∈ E.The δ-neighbourhood of E (here δ > 0) is the set Eδ = x : d(x,E) < δ.

The following result is known as the portmanteau lemma.

Theorem 6 (Portmanteau lemma). Let µ, µ1, µ2, . . . be probability measureson (R,B). The following statements are equivalent.

(i) µnw→ µ.

(ii) lim supn→∞ µn(F ) ≤ µ(F ) for all closed sets F .(iii) lim infn→∞ µn(G) ≥ µ(G) for all open sets G.(iv) limn→∞ µn(E) = µ(E) for all sets E with µ(∂E) = 0 (all µ-continuity sets).

Proof. We start with (i)⇒(ii). Given ε > 0, choose m ∈ N, such that forδ = 1/m > 0, µ(F δ) < µ(F ) + ε. This is possible, because F is closed and henceF δ ↓ F as m→∞. Let

ϕ(x) =

1 if x ≤ 0,

1− x if 0 < x < 1,

0 if x ≥ 1,

and define

f(x) = ϕ

(1

δd(x, F )

).

Page 9: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

1.2. CRITERIA FOR WEAK CONVERGENCE 3

Note that f is continuous, nonnegative, is bounded by 1 on R, equals 1 on F andvanishes outside F δ. Therefore,

µn(F ) =

∫F

fdµn ≤∫Rfdµn,

and ∫Rfdµ =

∫F δfdµ ≤ µ(F δ).

We also have

limn→∞

∫Rfdµn =

∫Rfdµ.

Combining the above,

lim supn→∞

µn(F ) ≤ limn→∞

∫Rfdµn =

∫Rfdµ ≤ µ(F δ) < µ(F ) + ε.

Since ε is arbitrary, the implication follows.(ii)⇔(iii) follows by a simple complementation argument.(ii) and (iii) together imply (iv) by

µ(E) ≥ lim supn→∞

µn(E) ≥ lim supn→∞

µn(E)

≥ lim infn→∞

µn(E) ≥ lim infn→∞

µn(E) ≥ µ(E),

because µ(∂E) = 0 implies that the extreme terms are equal, the inequalities arein fact equalities, and so limn→∞ µn(E) = µ(E).

(iv)⇒(i) Let ε > 0, g ∈ Cb(S) and choose two constants C1, C2, such thatC1 < g < C2. Let D = x ∈ R : µ(g = x) > 0. So, D is the set of atoms ofg and hence it is at most countable (if not, µ would have an infinite total mass).Let C1 = x0 < . . . < xm = C2 be a finite set of points not in D, such thatmaxxk − xk−1 : k = 1, . . . ,m < ε. Let Ik = (xk−1, xk]. The continuity of gimplies that if y is a boundary point of a set

x : xk−1 < g(x) ≤ xk,

then g(y) is either xk−1 or xk. Hence the sets in the above display are µ-continuitysets. We have(1.1)m∑k=1

xk−1µn(x : xk−1 < g(x) ≤ xk) ≤∫Rgdµn ≤

m∑k=1

xkµn(x : xk−1 < g(x) ≤ xk),

and likewise,(1.2)

m∑k=1

xk−1µ(x : xk−1 < g(x) ≤ xk) ≤∫Rgdµ ≤

m∑k=1

xkµ(x : xk−1 < g(x) ≤ xk).

Now note that the extreme terms in (1.1) converge to the respective extreme termsin (1.2). The latter differ by at most ε. Hence both the limit superior and limitinferior of

∫R gdµn are within distance ε of

∫R gdµ. Since ε is arbitrary, the result

follows. This finishes the proof of the theorem.

Part (iv) of the portmanteau lemma is quite illustrative for understanding thedefinition of the weak convergence and in what way it differs from the requirement

Page 10: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

4 1. WEAK CONVERGENCE

µn(B) → µ(B) for every set B in the case of another would-be definition of weakconvergence (cf. Section 1.1).

1.3. Convergence of distribution functions

In this section we give an appealing characterisation of weak convergence (con-vergence in distribution) in terms of distribution functions, which makes the defi-nition of weak convergence look less abstract.

Definition 7. We shall say that a sequence of distribution functions Fn onR converges weakly to a limit distribution function F, and shall write Fn F, ifFn(x) → F (x) for all x ∈ CF , where CF is the set of all those points, at which Fis continuous.

Theorem 8. Let µ, µ1, µ2, . . . be probability measures on the real line and de-note by F, F1, F2, . . . the corresponding distribution functions. The following state-ments are equivalent:

(i) µnw−→ µ;

(ii) Fn F.

Proof. Assume (i). If x is a continuity point of F, the set (−∞, x], the bound-ary of which is x, is a µ-continuity set. Hence

Fn(x) = µn((−∞, x])→ µ((−∞, x]) = F (x)

by the portmanteau lemma and thus (ii) holds.Conversely, let (ii) hold. Fix an arbitrary 0 < ε < 1 and pick two continuity

points a and b of F in such a way that F (a) < ε and F (b) > 1− ε. Next, given f ∈C(R), choose the continuity points xi of F, such that a = x0 < x1 < . . . < xk = band |f(x)− f(xi)| < ε for xi−1 ≤ x ≤ xi (this is possible by the uniform continuityof f on [a, b]). Define

S =

k∑i=1

f(xi)[F (xi)− F (xi−1)], Sn =

k∑i=1

f(xi)[Fn(xi)− Fn(xi−1)].

By assumption, Sn → S as n→∞. Let M = supx∈R |f(x)|. We have∣∣∣∣∫Rfdµ− S

∣∣∣∣ < (2M + 1)ε.

Likewise, ∣∣∣∣∫Rfdµn − Sn

∣∣∣∣ ≤ ε+MFn(a) +M(1− Fn(b))

→ ε+MF (a) +M(1− F (b))

< (2M + 1)ε.

As a result,

lim supn→∞

∣∣∣∣∫Rfdµn −

∫Rfdµ

∣∣∣∣ ≤ lim supn→∞

∣∣∣∣∫Rfdµn − Sn

∣∣∣∣+

∣∣∣∣∫Rfdµ− S

∣∣∣∣+ limn→∞

|Sn − S|

≤ 2(2M + 1)ε.

Page 11: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

1.4. SEQUENTIAL COMPACTNESS 5

Since ε is arbitrary, the limit superior on the left-hand side of the first inequalityis in fact zero and the result follows.

As shown in the next result, when the limit distribution function F is continuouseverywhere, i.e. CF = R, the convergence Fn(t)→ F (t) is in fact uniform in t ∈ R.

Theorem 9. Suppose Fn F and F is continuous. Then

limn→∞

supt∈R|Fn(t)− F (t)| = 0.

Proof. Let k ∈ N be fixed. By continuity of F and the intermediate valuetheorem, there exist points −∞ = x0 < x1 < . . . < xk =∞, such that F (xi) = i/k.Therefore, for xi−1 ≤ x ≤ xi,

Fn(x)− F (x) ≤ Fn(xi)− F (xi−1) = Fn(xi)− F (xi) + 1/k,

Fn(x)− F (x) ≥ Fn(xi−1)− F (xi) = Fn(xi−1)− F (xi−1)− 1/k.

Thus|Fn(x)− F (x)| ≤ sup

0≤i≤k|Fn(xi)− F (xi)|+ 1/k, x ∈ R.

For any ε > 0, choose k so large that 1/k ≤ ε/2. Next note that with this k, byconvergence of Fn(x)→ F (x) at all x ∈ R, the supremum sup0≤i≤k |Fn(xi)−F (xi)|can be made arbitrarily small, in particular smaller than ε/2, by taking n is largeenough. Conclude that supx∈R |Fn(x) − F (x)| ≤ ε for all n large enough. Since εis arbitrary, the result follows.

1.4. Sequential compactness

In the previous sections we studied several alternative characterisations of weakconvergence. In this section we will take a more abstract stance and study a condi-tion guaranteeing that a sequence of probabaility measures has at least one weaklyconvergent subsequence. We first introduce the notion of sequential compactnessof a sequence of probability measures.

Definition 10. A sequence of probability measures µn on (R,B) is calledsequentially compact, if every subsequence µnk of µn has a further weakly con-vergent subsequence.

A general answer to the question whether a sequence µn is sequentially com-pact or not is given by Prokhorov’s theorem. In its proof we need one auxiliaryresult, known as Helly’s theorem.

The Bolzano-Weierstraß theorem states that every bounded sequence of realnumbers has a convergent subsequence. The theorem easily generalises to sequencesin Rd, but fails to hold for uniformly bounded sequences in general metric spaces.But if extra properties are imposed, there can still be an affirmative answer. Some-thing like that happens in Helly’s theorem. At this point it is convenient to in-troduce the notion of a defective distribution function. Such a function, F say,has values in [0, 1], is right-continuous and increasing, but at least one of the twoproperties limx→∞ F (x) = 1 and limx→−∞ F (x) = 0 fails to hold. The measure µcorresponding to F on (R,B) will then be a subprobability measure, µ(R) < 1.

Theorem 11 (Helly’s theorem). Let Fn be a sequence of distribution func-tions. Then there exists a possibly defective distribution function F and a subse-quence Fnk, such that Fnk(x)→ F (x), for all x ∈ CF .

Page 12: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

6 1. WEAK CONVERGENCE

Proof. The main ingredient of the proof is an infinite repetition of the Bolzano-Weierstraß theorem combined with the Cantor diagonalisation. First we restrictourselves to working on Q instead of R, and exploit the countability of Q. WriteQ = q1, q2, . . . and consider restrictions of Fn to Q. Then the sequence Fn(q1)is bounded and along some subsequence n1

k it has a limit, `(q1) say. Look thenat the sequence Fn1

k(q2). Again, along some subsequence of n1

k, call it n2k,

we have a limit, `(q2) say. Note that along the thinned subsequence, we still havelimk→∞ Fn2

k(q1) = `(q1). Continue like this to construct a nested sequence of

subsequences njk for which we have that limk→∞ Fnjk(qi) = `(qi) holds for every

i ≤ j. Define a diagonal sequence nk by nk = nkk. For an arbitrary i, alongthis sequence one has limk→∞ Fnk(qi) = `(qi). In this way we have constructeda function ` : Q → [0, 1], and by the monotonicity of Fn(t) in t this function isincreasing.

In the next step we extend this function to a function F on R that is right-continuous, and still increasing. We put

F (x) = inf`(q) : q ∈ Q, q > x.

Obviously, F is an increasing function. It is also right-continuous: let x ∈ R andε > 0. There is q ∈ Q with q > x such that `(q) < F (x) + ε. Pick y ∈ (x, q). ThenF (y) < `(q) and we have F (y)−F (x) < ε, which shows that F is right-continuous.However, limx→∞ F (x) = 1 or limx→−∞ F (x) = 0 do not necessarily hold true.Thus F is a possibly defective distribution function.

We now show that Fnk(x) → F (x) if x ∈ CF . Fix x ∈ CF and let ε > 0. Pickq as above. By left-continuity of F at x, there is y < x such that F (x) < F (y) + ε.Take now r ∈ (y, x)∩Q. Then F (y) ≤ `(r), and hence F (x) < `(r) + ε. So we havethe inequalities

`(q)− ε < F (x) < `(r) + ε.

Then

lim supk→∞

Fnk(x) ≤ limk→∞

Fnk(q) = `(q) < F (x) + ε,

lim infk→∞

Fnk(x) ≥ lim infk→∞

Fnk(r) = `(r) > F (x)− ε.

Since ε is arbitrary, the result follows.

Here is an example, for which the limit in Theorem 11 is not a true distributionfunction.

Example 12. Let µn be the Dirac measure concentrated on n. Then itsdistribution function is given by Fn(x) = 1[n,∞)(x) and hence limn→∞ Fn(x) = 0.Hence the limit function F in Theorem 11 has to be the zero function, whichis clearly defective. One colloquially says that in the limit the probability massescapes to infinity.

Translated in terms of probability laws, Helly’s theorem says that every se-quence of probability measures µn has a (weakly) convergent subsequence, butthat the limit law in general is a subprobability measure only. We are now in-terested in finding a condition, that would guarantee that the limit is a bona fideprobability measure. A possible path is to require that all probability measuresinvolved have probability one on a fixed bounded set. That would prevent the

Page 13: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

1.4. SEQUENTIAL COMPACTNESS 7

phenomenon described in Example 12. However, this is a too stringent assump-tion, because it rules out many useful distributions. Fortunately, a considerablyweaker assumption suffices. For any probability measure µ on (R,B) it holds thatlimM→∞ µ([−M,M ]) = 1. The next condition, tightness, gives a uniform versionof this.

Definition 13. A sequence of probability measures µn on (R,B) is calledtight, if limM→∞ infn µn([−M,M ]) = 1.

Remark 14. Note that a sequence µn is tight iff every tail sequence µnn≥Nis tight. In order to show that a sequence is tight it is thus sufficient to showtightness from a certain suitably chosen index on.

Theorem 15 (Prokhorov’s theorem1). A sequence µn of probability measureson (R,B) is tight if and only if it is sequentially compact.

Proof. Suppose µn is sequentially compact, but not tight. Then there existsε > 0, such that for any M > 0 and all n, µn([−M,M ]c) > ε. It follows that for anyj ∈ N and Ij = (−j, j), one can find an index nj , such that µnj (I

cj ) > ε. Extract

from the sequence µnj a weakly convergent subsequence µnjk , and denote itsweak limit by µ. By the portmanteau lemma, for every fixed j ∈ N,

lim supk→∞

µnjk (Icj ) ≤ µ(Icj ).

Letting j → ∞, we see that the right-hand side converges to zero, while the left-hand side stays bounded by ε > 0 from below. This contradiction proves the firstimplication.

We now prove the second implication. Let Fn be the distribution functionof µn. By Helly’s theorem, there exists a subsequence Fnj of the sequence ofdistribution functions Fn, such that Fnj F as j → ∞, for some, possiblydefective, distribution function F. We will show that in fact

(1.3) limx→∞

F (x) = 1, limx→−∞

F (x) = 0,

so that F is a proper distribution function. By tightness of µn, for any constant0 < ε < 1 there exists a constant Mε > 0, such that Fn(Mε) > 1− ε for all n ∈ N.Without loss of generality, Mε can be taken to be a continuity point of F. Then

F (Mε) = limj→∞

Fnj (Mε) ≥ 1− ε.

Since ε is arbitrary, the above display and monotonicity of F imply the first equalityin (1.3). The second one can be proved in a similar manner. This completes theproof.

Theorem 15 has a simple corollary.

Corollary 16. If µnw→ µ for some probability measure µ, then the sequence

µn is tight.

We also remark that tightness of a sequence µn in general is not sufficient forits weak convergence. Here is a simple counterexample: let µn = N(0, 1) for n oddand µn = N(0, 2) for n even. Then µn is tight, but does not converge weakly.

1The name Prokhorov is alternatively spelled as Prohorov, but Prokhorov is the way itappears in the English translation of the original paper containing (a much more general version

of) the theorem. See Prokhorov (1956).

Page 14: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

8 1. WEAK CONVERGENCE

1.5. Continuous mapping theorem

The continuous mapping theorem is a result asserting that if a sequence ofrandom variables Xn converges in a suitable sense to a random variable X, thenfor a continuous function g the transformed sequence g(Xn) converges to g(X).We will prove a slightly more general result, that allows g to be discontinuous on anegligible set. Such a refinement does not require much additional technical effort,while occasionally being useful.

Theorem 17 (Continuous mapping theorem). Let g : R 7→ R be continuous atevery point of a set C, such that P(X ∈ C) = 1.

(i) If Xna.s.−−→ X, then g(Xn)

a.s.−−→ g(X).(ii) If Xn X, then g(Xn) g(X).

(iii) If XnP−→ X, then g(Xn)

P−→ g(X).

Proof. Part (i) is trivial.We prove part (ii). Let F be an arbitrary closed set. We have g(Xn) ∈ F =

Xn ∈ g−1(F ). Trivially, g−1(F ) ⊂ g−1(F ). Take an arbitrary x ∈ g−1(F ). Bydefinition, there exists a sequence xm, such that xm → x and g(xm) ∈ F. Ifx ∈ C, then g(xm) → g(x), and g(x) ∈ F, because F is closed. Otherwise x ∈ Cc.Hence g−1(F ) ⊂ g−1(F ) ∪ Cc. Then by the portmanteau lemma and the fact thatP(X ∈ C) = 1,

lim supn→∞

P(g(Xn) ∈ F ) ≤ lim supn→∞

P(Xn ∈ g−1(F ))

≤ P(X ∈ g−1(F ))

≤ P(X ∈ g−1(F )) + P(X ∈ Cc)= P(g(X) ∈ F ).

By another application of the portmanteau lemma we conclude that g(Xn) g(X).

We move to part (iii). Assume that g(Xn)P−→ g(X) fails. Then there exist

ε > 0, δ > 0 and a subsequence nj of n, such that

(1.4) P(|g(Xnj )− g(X)| > ε) > δ.

Extract from nj a further subsequence njk, such that Xnjk

a.s.−−→ X. By part

(iii), g(Xnjk)a.s.−−→ g(X). But this contradicts (1.4). The proof of the theorem is

completed.

It would be more appropriate, albeit clumsier, to call Theorem 17 the almostsurely continuous mapping theorem.

Example 18. Here is a simple illustration of Theorem 17. Let Y1, . . . , Yn be ani.i.d. sample from the normal distribution with mean zero and unknown varianceσ2. By the strong law of large numbers,

σ2n =

1

n

n∑i=1

Y 2i

a.s.−−→ σ2,

and hence σ2 is a reasonable estimator of σ2. Since the function g(x) =√x is

continuous, σn is then a reasonable estimator of the standard deviation σ: we have

σna.s.−−→ σ.

Page 15: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

1.6. ALMOST SURE REPRESENTATION THEOREM 9

(a) Distribution function (b) Quantile function

Figure 1. Distribution and quantile functions of the discrete uni-form distribution on integers 1, 2.

1.6. Almost sure representation theorem

Suppose we want to prove some distributional property of a sequence Xn ofrandom variables, knowing that Xn X. In general this might be difficult, but

is perhaps easier, if we knew that Xna.s.−−→ X. Unfortunately, the latter almost

sure convergence might be difficult to establish, or perhaps is even false. However,the situation is not hopeless. The almost sure representation theorem, proved

below, tells us that there exists a probability space (Ω, F , P), that supports random

variables Xn, X, such that for all n ∈ N, Xnd= Xn, X

d= X, and Xn

a.s.−−→ X. We

then prove the distributional property we are interested in for the sequence Xn on

the space (Ω, F , P). The result automatically carries over to the original sequenceXn.

We will need a number of results on quantile functions, which are of independentinterest as well.

A distribution function in general is only non-decreasing, but not necessarilystrictly increasing. Therefore, it typically does not admit the inverse function inthe usual sense. Nevertheless, a kind of inverse, the quantile function, can still bedefined. The quantile function of a distribution function F is a generalised inverseF−1 : (0, 1) 7→ R given by

F−1(p) = inf x : F (x) ≥ p .For an illustration see Figure 1. The quantile function is left-continuous. Its range isequal to the support of F (or rather to the support of the corresponding probabilitymeasure µ; the support of a probability measure on R is defined as the set of allthose points x, such that any open neighbourhood Ux of x has strictly positivemeasure: µ(Ux) > 0. Intuitively, this is the smallest closed subset of R that receivesmeasure 1 under µ (although you might be wondering at this point, this latterexplanation is valid even for probability measures on separable metric spaces; seee.g. Theorem 2.1 on pp. 27–28 in Parthasarathy (2005)). As one example, thesupport of the standard normal distribution is the whole R) and therefore, F−1 isoften unbounded. An evident fact that the quantile function is monotone impliesthat it might have at most a countable number of discontinuity points only. Thefollowing lemma lists some other properties of F−1. Of these we will only makepartial use of (i)–(iv).

Page 16: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

10 1. WEAK CONVERGENCE

Figure 2. Distribution function (red line).

Lemma 19. For every 0 < p < 1 and x ∈ R,(i) F−1(p) ≤ x if and only if p ≤ F (x);

(ii) F F−1(p) ≥ p, with equality holding if and only if p is in the range ofF ; the equality can fail if and only if F is discontinuous at F−1(p);

(iii) F− F−1(p) ≤ p, with F−(x) = F (x−);(iv) F−1 F (x) ≤ x; the equality fails if and only if x is in the interior or at

the right endpoint of a flat part of F ;(v) F F−1 F = F ;F−1 F F−1 = F−1;

(vi) (F G)−1 = G−1 F−1.

Proof. (i) through (iv) can be proved either directly, by appealing to thedefinitions, or through a picture, such as the one given in Figure 2. To prove thefirst equality in (v), note that by (ii), the monotonicity of F and (iv),

F (x) = p ≤ F F−1(p) = F F−1 F (x) ≤ F (x).

The second equality in (v) follows from (ii), the monotonicity of F−1 and (iv) byF−1(q) ≤ F−1F F−1(q) ≤ F−1(q). Finally, (vi) is a consequence of the definitionof (F G)−1 and (i).

As a consequence of (ii) and (iv), F F−1(p) ≡ p and F−1 F (p) ≡ p on (0, 1)if and only if F is continuous and strictly increasing. In that case F−1 is a properinverse of F, as it should be.

Corollary 20. Let F be an arbitrary distribution function and U a uniformrandom variable on [0, 1]. Then F−1(U) ∼ F.

This follows from Lemma 19 (i). The transformation F−1(U) is called thequantile transformation.

Page 17: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

1.6. ALMOST SURE REPRESENTATION THEOREM 11

Corollary 21. Let X ∼ F for a continuous distribution function F. ThenF (X) is uniformly distributed on [0, 1].

Again, this follows from Lemma 19 (i) and (ii) by

P(F (X) ≤ x) = P(F (X) < x) = 1− P(F (X) ≥ x) = 1− P(X ≥ F−1(x))

= P(X < F−1(x)) = P(X ≤ F−1(x)) = F F−1(x) = x,

where x ∈ (0, 1). The transformation F (X) for X ∼ F is called the probabilityintegral transformation.

Quantile functions are occasionally useful when studying weak convergence ofa sequence of random variables. In the next definition we introduce the notion ofthe weak convergence of a sequence of quantile functions.

Definition 22. We shall say that a sequence of quantile functions F−1n con-

verges weakly to a limit quantile function F−1, and denote this by F−1n F−1, if

F−1n (t)→ F−1(t) at every point 0 < t < 1, at which F−1 is continuous.

Both the terminology and the notation for the weak convergence of quantilefunctions are reminiscent of those for the weak convergence of distribution functions.In fact, as shown in the next lemma, the two types of convergence are equivalent.

Lemma 23. For any sequence of distribution functions Fn, Fn F if and onlyif F−1

n F−1.

Proof. Let U be a standard uniform random variable on some probabilityspace, for instance on ([0, 1],B[0, 1], λ). Since F−1 has at most a countable number ofdiscontinuity points and the distribution of U is absolutely continuous, F−1

n F−1

implies that F−1n (U)

a.s.−−→ F−1(U). Therefore, F−1n (U) F−1(U). By Corollary

20, this is exactly the weak convergence Fn F.Now we prove the reverse implication. Let V be a standard normal random

variable on some probability space, for instance on ([0, 1],B[0, 1], λ), on which it canbe obtained through the quantile transformation Φ−1(U) for U a standard uniformrandom variable, see Corollary 20. Since the convergence Fn(t)→ F (t) can fail onlyat a countable number of points t, and since the distribution of V is continuous, we

have Fn(V )a.s.−−→ F (V ) (and of course Fn(V ) F (V )). By Lemma 19 (i),

Φ(F−1n (t)) = P(V < F−1

n (t))

= 1− P(V ≥ F−1n (t))

= 1− P(Fn(V ) ≥ t)= P(Fn(V ) < t),

and similarly, P(F (V ) < t) = Φ(F−1(t)). By the portmanteau lemma,

lim infn→∞

P(Fn(V ) < t) ≥ P(F (V ) < t).

On the other hand, by elementary properties of the limits inferior and superior andthe portmanteau lemma again,

lim infn→∞

P(Fn(V ) < t) ≤ lim supn→∞

P(Fn(V ) < t)

≤ lim supn→∞

P(Fn(V ) ≤ t)

= 1− lim infn→∞

P(Fn(V ) > t)

Page 18: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

12 1. WEAK CONVERGENCE

≤ 1− P(F (V ) > t)

= P(F (V ) ≤ t).

If P(F (V ) ≤ t) is continuous at t, then

P(F (V ) ≤ t) = P(F (V ) < t) = Φ(F−1(t)),

and in this case

lim infn→∞

P(Fn(V ) < t) = lim supn→∞

P(Fn(V ) < t)

= limn→

P(Fn(V ) < t)

= P(F (V ) < t)

= Φ(F−1(t)).

The function Φ(F−1(·)) is certainly continuous at every point t, at which F−1 is.Since Φ−1 is a continuous function as well (cf. Lemma 19), from this it follows thatF−1n (t) → F−1(t) at every point t, at which F−1 is continuous. Thus F−1

n (t) F−1(t).

The work we put in the previous results allows us to give a short proof of thealmost sure representation theorem.

Theorem 24 (Almost sure representation). Let Xn X. Then there exists a

probability space (Ω, F , P) and random variables Xn, X defined on it, such that for

all n ≥ 1, Xnd= Xn, X

d= X, and Xn

a.s.−−→ X.

Proof. Let Fn and F be the distribution functions of Xn and X, respectively.

Consider the probability space (Ω, F , P) = ([0, 1],B[0, 1], λ) and let U be a random

variable on it with a standard uniform distribution. Define Xn = F−1n (U) and X =

F−1(U). By Corollary 20, Xnd= Xn and X

d= X. By Lemma 23, the convergence

Fn F implies that F−1n F−1. By definition the latter means that F−1

n (t) →F−1(t) at all points t, at which F−1 is continuous. Note that F−1 has at mosta countable number of discontinuity points, and hence the convergence F−1

n (t) →F−1(t) can perhaps fail only on a set with Lebesgue measure zero. Since U has a

continuous distribution, this implies that F−1n (U)

a.s.−−→ F−1(U), i.e. Xna.s.−−→ X.

Several applications of the almost sure representation theorem will be given inthe next section.

1.7. Relation to other modes of convergence

Firstly, we show that convergence in probability implies convergence in distri-bution.

Theorem 25. Suppose that a sequence Xn of random variables and a random

variable X are defined on the same probability space. Assume that XnP−→ X. Then

Xn X.

Proof. Suppose the convergence Xn X fails. By definition this means thatthere exists f ∈ Cb(R), such that the convergence µn(f) → µ(f) fails. Thus thereexists ε > 0 and a subsequence nk of n, such that |µnk(f) − µ(f)| ≥ ε forall nk. This is obviously true for any further subsequence of nk as well. Pick

Page 19: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

1.7. RELATION TO OTHER MODES OF CONVERGENCE 13

a subsequence nk` of nk, such that Xnk`

a.s.−−→ X (this is possible, because

XnP−→ X). Then µnk` (f)→ µ(f) by the dominated convergence theorem. But this

leads to a contradiction that proves the theorem.

Corollary 26. Suppose that a sequence Xn of random variables and a

random variable X are defined on the same probability space. Assume that Xna.s.−−→

X. Then Xn X.

This follows from Theorem 25 and the fact that almost sure convergence impliesconvergence in probability.

The converse to Theorem 25 (and Corollary 26) is in general false.

Example 27. Let X ∼ N(0, 1) and Xn = −X for all n ∈ N. Then P(|Xn −X| > ε) = P(|X| > ε/2) > 0 for all n ∈ N, and thus convergence in probabilityfails. Obviously, so does the almost sure convergence. On the other hand, by the

symmetry of the standard normal distribution, Xnd= X, and hence Xn X.

There is one notable exception, however.

Theorem 28. Let the random variables X,X1, X2, . . . be defined on the sameprobability space. If Xn X, where P(X = x) = 1 for some x ∈ R, then also

XnP→ X.

Proof. The distribution µ of X is the Dirac measure at x. For any ε > 0, thesets (x+ ε,∞) and (−∞, x− ε) are µ-continuity sets. Note that

P(|Xn −X| > ε) = P(Xn > x+ ε) + P(Xn < x− ε).The right-hand side of the above display tends to zero as n→∞ by the portmanteaulemma. This completes the proof.

Next we move to convergence of the first moments. Since weak convergencein general does not imply convergence in probability, neither does it in generalimply convergence of means. But when the collection Xn is uniformly integrable,the weak convergence Xn X can be strengthened to convergence of means:E[Xn]→ E[X]. The proof is a simple application of the almost sure representationtheorem.

Theorem 29. Assume that Xn X. If the sequence Xn is uniformly inte-grable, then E[Xn]→ E[X] as n→∞.

Proof. By the almost sure representation theorem, there exists a probability

space (Ω, F , P) with random variables X, X1, X2 . . . , such that Xd= X, Xn

d= Xn

for all n ∈ N, and Xna.s.−−→ X. By the uniform integrability of the family Xn, the

family Xn is also uniformly integrable. Therefore E[Xn]→ E[X], and since thislatter convergence depends only on the laws of the random variables involved, theresult follows.

Remark 30. Assume that Xn and X are defined on the same probabilityspace. Inspecting the proof of the previous theorem, one could have thought thatnot only do the means converge, but that we also have the L1-convergence: E[|Xn−X|]→ 0. However, this in general is false and here is a simple counterexample: takeX ∼ N(0, 1) and Xn = −X for all n ∈ N. Then the conditions of Theorem 29 aresatisfied, but E[|Xn−X|] = 2E[|X|], which does not tend to zero. The point is that

Page 20: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

14 1. WEAK CONVERGENCE

E[|Xn−X|] depends on the bivariate law of (Xn, X), and this need not be the same

as that of (Xn, X) (marginals do not determine joint distributions uniquely). Thisserves as a warning to when the almost sure representation theorem is applicableand when it is not: the representation does not in general preserve the dependencestructure of Xn and X, and hence typically cannot be used for statements dealingwith multivariate vectors obtained from Xn and X.

The following is what we can obtain without the uniform integrability assump-tion in Theorem 29. Again, the proof is an application of the almost sure represen-tation theorem.

Theorem 31. If Xn X, then lim infn→∞ E[|Xn|] ≥ E[|X|].

Proof. By the almost sure representation theorem, there exists a proba-

bility space (Ω, F , P) with random variables X, X1, X2 . . . , such that Xd= X,

Xnd= Xn for all n ∈ N, and Xn

a.s.−−→ X. Fatou’s lemma implies that E[|X|] ≤lim infn→∞ E[|Xn|], and the statement follows.

1.8. Slutsky’s lemma

Suppose Xn X and the sequence Yn is close in some sense to Xn. Whatcan be said about the weak limit of Yn? Or suppose that Xn and Yn areweakly convergent. What can be said about the weak convergence of the sequenceXnYn? The following result, known as Slutsky’s lemma2, gives an answer to thesequestions.

Theorem 32. Let Xn and Yn be two sequences of the random variablesdefined on the same probability space.

(i) If Xn X and |Xn − Yn|P−→ 0, then Yn X.

(ii) If Xn X and Yn c for a constant c, then XnYn cX.

Proof. We first prove (i). Let F be closed and δ = 1/m for m ∈ N. We have

P(Yn ∈ F ) = P(Xn + Yn −Xn ∈ F )

= P(Xn + Yn −Xn ∈ F ; |Xn − Yn| < δ)

+ P(Xn + Yn −Xn ∈ F ; |Xn − Yn| ≥ δ)

≤ P(Xn ∈ F δ) + P(|Xn − Yn| ≥ δ).

Letting n → ∞ and using the assumption |Xn − Yn|P−→ 0 and the portmanteau

lemma, we obtain that

lim supn→∞

P(Yn ∈ F ) ≤ P(X ∈ F δ).

Since F δ ↓ F as m→∞, the result follows by another application of the portman-teau lemma.

Now we prove (ii). Write

(1.5) XnYn = Xn(Yn − c) + cXn.

2An alternative, but less common spelling of Slutsky’s name is Slutzky. Also the result is attimes called a theorem, not lemma.

Page 21: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

EXERCISES 15

An elementary argument shows that for any ε > 0 and δ > 0,

(1.6) P(|Xn(Yn − c)| > ε) ≤ P(|Xn| >

ε

δ

)+ P(|Yn − c| > δ).

Fix ε and pick δ such that ε/δ and −ε/δ are continuity points of the distributionof X. Then the first term on the right-hand side of the above display converges toP(|X| > ε/δ). The latter can be made arbitrarily small by taking δ small enough.As far as the second term in (1.6) is concerned, for every fixed δ it converges to

zero. Hence Xn(Yn−c)P−→ 0. It is also easy to see that cXn cX (this can be done

in a variety of ways. For instance, the almost sure representation theorem and thedominated convergence theorem give for f ∈ Cb(R) that E[f(cXn)] → E[f(cX)]).Now apply part (i) to the right-hand side of (1.5).

Slutsky’s lemma finds numerous applications in asymptotic theorems of math-ematical statistics.

Exercises1 Show that Xn X iff E[f(Xn)]→ E[f(X)] for all bounded uniformly continuous

functions f.

2 Show the implication Fn(x)→ F (x) for all x ∈ CF ⇒ µnw−→ µ without referring

to the almost sure representation theorem. Hint: first you take for given ε > 0 aK > 0 such that F (K)−F (−K) > 1− ε. Approximate a function f ∈ Cb(R) onthe interval [−K,K] by a piecewise constant function, compute the integrals ofthis approximating function and use the convergence of Fn at continuity pointsof F etc.

3 Let µn be a sequence of discrete uniform distributions on [0, 1]: µn(i/n) = 1/n,i = 1, . . . , n. Show that µn is weakly convergent and identify the weak limit.

4 Let Xn be an i.i.d. sequence of exponentially distributed random variables:FXn(x) = P(Xn ≤ x) = 1 − e−x for x ≥ 0 and FXn(x) = 0 for x < 0. LetMn = − log n + max1≤i≤nXn. Show that FMn

FM , where FM (x) = P(M ≤x) = e−e

−x, x ∈ R. The latter distribution is known as the Gumbel distribution

(or the extreme value distribution).5 Consider the N(µn, σ

2n) distributions, where the µn are real numbers and the

σ2n nonnegative. Show that this family is tight iff the sequences (µn) and (σ2

n)are bounded. Under what condition do we have that the N(µn, σ

2n) distributions

converge to a (weak) limit? What is this limit?6 Let random variabes X and Xn possess discrete distributions supported on N.

Show that Xn X if and only if P(Xn = m)→ P(X = m) for every m ∈ N.7 Give an example of distribution functions F and Fn on the real line, such that

Fnw→ F, but supx |Fn(x)− F (x)| → 0 fails.

8 For a distribution function G on the real line the median is defined by G−1(1/2).Assume that Fn F and let m = med(F ) and mn = med(Fn) be the mediansof F and Fn, respectively. Find suitable assumptions, under which mn → m asn→∞.

9 Let F and G be two distribution functions on R and let

L(F,G) = infh > 0 : F (x− h)− h ≤ G(x) ≤ F (x+ h) + h

Page 22: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

16 1. WEAK CONVERGENCE

be the Levy distance between them (accept as a fact, or prove for yourself thatL(F,G) defines a distance). Show that the weak convergence Fn F is equiv-alent to convergence in the Levy metric: L(Fn, F ) → 0. Hint: the implicationL(Fn, F ) → 0 ⇒ Fn F follows from the definition. The other one can beestablished by contradiction.

10 Prove uniqueness of the weak limit µ of a weakly convergent sequence of proba-bility measures µn.

Page 23: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

CHAPTER 2

Characteristic functions

2.1. Definition and first properties

Let X be a random variable defined on (Ω,F ,P). X induces a probability mea-sure on (R,B), the law or distribution of X, denoted by PX or µ. This probabilitymeasure, in turn, determines the distribution function F of X. Conversely, F alsodetermines PX . Hence distribution functions on R and probability measures on(R,B) are in bijective correspondence. In this chapter we develop another suchcorrespondence. We start with a definition.

Definition 33. Let µ be a probability measure on (R,B). Its characteristicfunction φ : R→ C is defined by

(2.1) φ(u) =

∫Reıuxµ(dx).

Whenever needed, we write φµ instead of φ to express the dependence on µ.

Note that in this definition we integrate a complex valued function. By splittinga complex valued function f = g + ıh into its real part g and imaginary part h,we define

∫f dµ :=

∫g dµ + ı

∫hdµ. For integrals of complex valued functions,

previously shown theorems are, mutatis mutandis, true. For instance, one has|∫f dµ| ≤

∫|f |dµ, where | · | denotes the norm of a complex number.

If X is a random variable with distribution µ, then φµ can alternatively beexpressed as φ(u) = E[exp(ıuX)]. There are many random variables with distribu-tion µ. They all have the same characteristic function. We also adopt the notationφX to indicate that we are dealing with the characteristic function of the randomvariable X.

Before we give some examples and elementary properties of characteristic func-tions, we look at a special case. Suppose that X admits a density f with respectto Lebesgue measure. Then

(2.2) φX(u) =

∫Reıuxf(x) dx.

Analysts define for f ∈ L1(R,B, λ) the Fourier transform f by

f(u) =

∫Re−ıuxf(x) dx.

What we thus see is the equality φX(u) = f(−u). Given usefulness of Fourier trans-forms in various branches of mathematics, we then get a feeling that characteristicfunctions will be important in probability theory as well.

Computation of a characteristic function (if it is explicitly computable) is typ-ically a clever exercise in integration.

17

Page 24: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

18 2. CHARACTERISTIC FUNCTIONS

Example 34. Let X ∼ N(0, 1). Then

φX(u) = E[eıuX ] =1√2π

∫Reıuxe−x

2/2dx = e−u2/2.

In fact,

1√2π

∫Reıuxe−x

2/2dx =1√2π

∫R

∞∑n=0

(ıux)n

n!e−x

2/2dx

=

∞∑n=0

(ıu)n

n!

1√2π

∫Rxne−x

2/2dx

=

∞∑n=0

(ıu)n

n!E[Xn].

For n odd, E[Xn] = 0, while by Stein’s lemma, see Lemma 35 ahead, for n even,E[Xn] = (n− 1)!!. Hence the above chain of equalities can be continued as

∞∑n=0

(ıu)n

n!E[Xn] =

∞∑n=0

(ıu)2n

(2n)!(2n− 1)!!

=

∞∑n=0

(ıu)2n

(2n)!

(2n)!

2nn!

=

∞∑n=0

(−u

2

2

)n1

n!

= e−u2/2.

Here we used the fact that

(2n− 1)!! =

n∏i=1

(2i− 1)

=

n∏i=1

(2i− 1)

n∏i=1

(2i)

1∏n

i=1(2i)

=(2n)!

2nn!.

Lemma 35 (Stein’s lemma). Let X ∼ N(0, 1) and let g be a differentiablefunction satisfying E[g(X)X] <∞ and E[g′(X)] <∞. Then E[g(X)X] = E[g′(X)].

Proof. We have

E[g(X)X] =1√2π

∫Rg(x)xe−x

2/2dx.

By partial integration the right-hand side is equal to

− 1√2πg(x)e−x

2/2|∞−∞ +1√2π

∫Rg′(x)e−x

2/2dx = E[g′(X)].

This completes the proof.

Here is another illustrative example.

Page 25: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

2.1. DEFINITION AND FIRST PROPERTIES 19

Example 36. Let X have a standard Cauchy distribution. Directly from thedefinition, when u = 0, φX(u) = 1. Now assume u 6= 0. Then

φX(u) =1

π

∫Reıux

1

1 + x2dx =

1

π

∫R

cos(ux)

1 + x2dx = |u| 1

π

∫R

cos(y)

u2 + y2dy.

The integral in the last equality is best evaluated through contour integration tech-niques. Let CR be a closed contour consisting of the real line segment from −R toR and the upper semi-circle ΓR centred at the origin and of radius R. It can beshown that ∫

ΓR

eiz

u2 + z2dz → 0

as R→∞, see pp. 145–146 in Bak and Newman (2010). Therefore,∫CR

eiz

u2 + z2dz →

∫R

eiy

u2 + y2dy.

Taking real parts on both sides, since z0 = ı|u| is the only pole of the functioneiz/(u2 + z2) in the upper half plain, by the residue theorem we get that∫

R

cos(y)

u2 + y2dy = Re

[2πıRes

[eiz

u2 + z2, z0

]].

Now, since z0 is a pole of order 2, it follows by (ii) on p. 130 in Bak and Newman(2010) that

Res

[eiz

u2 + z2, z0

]=

d

dz

[(z − z0)2 eiz

u2 + z2

]z=z0

=e−|u|

2ı|u|.

Thus φX(u) = e−|u| for u 6= 0. We conclude that φX(u) = e−|u| for all u ∈ R.

The following proposition lists some simple properties of characteristic func-tions.

Proposition 37. Let φ = φX be the characteristic function of some randomvariable X. The following hold true:

(i) φ(0) = 1, |φ(u)| ≤ 1, for all u ∈ R(ii) φ is uniformly continuous on R.

(iii) φaX+b(u) = φX(au)eıub.(iv) φ is real valued and symmetric around zero, if X and −X have the same

distribution.(v) If X and Y are independent, then φX+Y (u) = φX(u)φY (u).

(vi) If E|X|k <∞, then φ ∈ Ck(R) and φ(k)(0) = ıkEXk.

Proof. Properties (i), (iii) and (iv) are trivial. Consider (ii). Fixing u ∈ R,we consider φ(u+ t)− φ(u) for t→ 0. We have

|φ(u+ t)− φ(u)| =∣∣∣∣∫ (exp(ı(u+ t)x)− exp(ıux))µ( dx)

∣∣∣∣≤∫| exp(ıtx)− 1|µ( dx).

The functions x 7→ exp(ıtx) − 1 converge to zero pointwise for t → 0 and arebounded by 2. The result thus follows from dominated convergence.

Property (v) follows from the product rule for expectations of independentrandom variables.

Page 26: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

20 2. CHARACTERISTIC FUNCTIONS

Finally, property (vi) for k = 1 follows by an application of the dominatedconvergence theorem and the inequality |eıx − 1| ≤ |x|, for x ∈ R. The other casescan be treated similarly.

Remark 38. Here is a simple application of Proposition 37: if X ∼ N(m,σ),

then φX(u) = eıum−σ2u2/2.

Remark 39. Warning: the converse to Proposition 37 (v) is typically false, i.e.from the equality

φX+Y (u) = φX(u)φY (u), u ∈ R,it cannot be concluded that X and Y are independent. Here is a counterexample:let X have a standard Cauchy distribution and let Y = X. Then

e−2|u| = φ2X(u) = φX+Y (u) = e−|u|e−|u| = φX(u)φY (u),

although X and Y are obviously dependent in this case. More on this examplelater.

2.2. Inversion formula and uniqueness

Given a characteristic function φ, how can we find the corresponding distri-bution function F, or the corresponding law µ? As we will see, an answer to thisqusetion is given by the inversion formula given below. Note that the integrationinterval in formula (2.3) is symmetric around zero. This is essential: an improperintegral ∫ ∞

−∞

e−iua − e−ıub

ıuφ(u) du

typically does not exist (in the Lebesgue sense). That the limit in (2.3), called theCauchy limit, is finite, is actually part of the assertion of the theorem.

Theorem 40. Let µ be a probability law and φ its characteristic function. Thenfor all a < b,

(2.3) limT→∞

1

∫ T

−T

e−iua − e−ıub

ıuφ(u) du = µ((a, b)) +

1

2µ(a, b).

Proof. We compute, using Fubini’s theorem, which we will justify below,

ΦT :=1

∫ T

−T

e−ıua − e−iub

ıuφ(u) du(2.4)

=1

∫ T

−T

e−ıua − e−ıub

iu

∫Reıux µ(dx) du

=1

∫R

∫ T

−T

e−ıua − e−iub

ıueıux duµ(dx)

=1

∫R

∫ T

−T

eı(x−a)u − ei(x−b)u

ıuduµ(dx)(2.5)

=:

∫RET (x)µ(dx)

Application of Fubini’s theorem is justified as follows. First, the integrand in (2.5)is bounded by b − a, because |eıx − eıy| ≤ |x − y| for all x, y ∈ R. Second, theproduct measure λ× µ on [−T, T ]× R is finite.

Page 27: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

2.2. INVERSION FORMULA AND UNIQUENESS 21

By splitting the integrand of ET (x) into its real and imaginary part, we seethat the imaginary part vanishes and we are left with

ET (x) =1

∫ T

−T

sin(x− a)u− sin(x− b)uu

du

=1

∫ T

−T

sin(x− a)u

udu− 1

∫ T

−T

sin(x− b)uu

du

=1

∫ T (x−a)

−T (x−a)

sin v

vdv − 1

∫ T (x−b)

−T (x−b)

sin v

vdv.

The function g given by g(s, t) =∫ ts

sin yy dy is continuous in (s, t). Hence it is

bounded on any compact subset of R2. Moreover, g(s, t) → π as s → −∞ andt → ∞ (this can be shown by contour integration techniques; see e.g. pp. 146–147in Bak and Newman (2010)1). Hence g, as a function on R2, is bounded in s, t. Weconclude that also ET (x) is bounded as a function of T and x, the first ingredientto apply the dominated convergence theorem to (2.5), since µ is a finite measure.The second ingredient is to identify E(x) := limT→∞ET (x). For an arbitrary α, achange of the integration variable x = αy gives∫ ∞

0

sin(αy)

ydy = sgn(α)

π

2.

Here sgn(α) denotes 1, 0 or −1 according to whether α > 0, α = 0 or α < 0.By comparing the location of x relative to a and b, we use the value of the latterintegral to obtain

E(x) =

1 if a < x < b,12 if x = a or x = b,0 else.

We thus get, using the dominated convergence theorem again, that

ΦT → µ((a, b)) +1

2µ(a, b)

as T →∞. This completes the proof.

Remark 41. If a and b are continuity points of F, then the right-hand side of(2.3) is F (b)−F (a). Thus φ determines F at all continuity points of F. But due toright-continuity of F, the latter completely determines F. F in turn determines µ,and so φ determines µ.

Let us now give another version of the inversion formula.

Theorem 42. If the characteristic function φ of a probability measure µ on(R,B) belongs to L1(R,B, λ), then µ admits a density f w.r.t. the Lebesgue measureλ. Moreover, f is continuous.

Proof. Define

(2.6) f(x) =1

∫Re−ıuxφ(u) du.

1An alternative derivation is given here: http://staff.science.uva.nl/~hvzanten/ex_5_9.pdf

Page 28: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

22 2. CHARACTERISTIC FUNCTIONS

Since |φ| has a finite integral, f is well defined for every x. Observe that f is

real valued, because φ(u) = φ(−u). An easy application of the dominated con-vergence theorem shows that f is continuous. Now note first that the limit of the

integral in (2.3) is equal to the (Lebesgue) integral 12π

∫e−ıua−e−ıub

ıu φ(u) du, againbecause of dominated convergence. Next we use Fubini’s theorem to compute forany continuity points a < b of F that∫ b

a

f(x) dx =1

∫ b

a

∫Re−iuxφ(u) dudx

=1

∫Rφ(u)

∫ b

a

e−iux dxdu

=1

∫Rφ(u)

e−ıua − e−iub

ıudu

= F (b)− F (a),

where we also employed Theorem 40. Next, by continuity of∫ baf(x) dx in a and b,

the relationship ∫ b

a

f(x) dx = F (b)− F (a)

in fact holds for any a, b ∈ R. By continuity of f, for any y ∈ [a, b] the Lebesgueintegral

∫ yaf(x) dx equals the Riemann integral. By the fundamental theorem of

calculus it follows that F ′(y) = f(y) for all y ∈ (a, b) and so for all y ∈ R. Since Fis non-decreasing, f must be nonnegative, and hence it is a probability density.

Remark 43. Note the duality between the expressions (2.2) and (2.6). Apartfrom the presence of the minus sign in the integral and the factor 2π in the denom-inator in (2.6), the transformations f 7→ φ and φ 7→ f are similar.

The inversion theorem entails one very important result.

Theorem 44. Random variables X and Y are equal in distribution if and onlyif their characteristic functions are the same: φX(t) = φY (t) for all t ∈ R.

Proof. One side of the theorem is trivial. For the other side we argue asfollows: suppose φX(t) = φY (t) for all t ∈ R. By Fubini’s theorem and the inversionformula for characteristic functions, for every σn > 0 and y ∈ R we have∫

Re−ıtye−σ

2nt

2/2φX(t)dt =

∫Re−ıtye−σ

2nt

2/2E[eitX ]dt

= E[∫

Re−ıt(y−X)e−σ

2nt

2/2dt

]=

√2π

σnE[e−(y−X)2/(2σ2

n)]

=

√2π

σn

∫Re−(y−x)2/(2σ2

n)dFX(x)

= 2πfX+σnZ(y).

Here Z is a standard normal random variable independent of X and fX+σnZ is thedensity of X + σnZ with respect to the Lebesgue measure. Replace φX with φYin the above argument to see that fX+σnZ(y) = fY+σnZ(y). This implies that for

Page 29: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

2.4. MULTIDIMENSIONAL CASE 23

every σn > 0, X + σnZd= Y + σnZ. Letting σn → 0 as n → ∞, Slutsky’s lemma

gives that X + σnZ X. Likewise, X + σnZ Y. Due to the uniqueness of the

weak limit, we then obtain that Xd= Y.

Put another way, Theorem 44 implies that there is a one-to-one correspondencebetween probability measures and characteristic functions.

2.3. Necessary conditions

In the previous sections we have derived some properties a characteristic func-tion possesses. Equally interesting is finding general conditions for a function φ tobe a characteristic function. We will formulate two results in that direction. Theirproofs can be found e.g. in Chung (2001) (see Theorems 6.5.2 and 6.5.3 there). Thefirst result gives a necessary and sufficient condition, but is not easily verifiable.The second one is only sufficient, but its conditions are simpler.

Recall that a complex-valued function φ on R is called positive definite, if forany finite set of real numbers tj ’s and complex numbers zj ’s, 1 ≤ j ≤ n, n = 1, 2, . . . ,we have

n∑j=1

n∑k=1

φ(tj − tk)zjzk ≥ 0,

where zk is a complex conjugate of zk.

Theorem 45 (Bochner-Khinchin theorem). A function φ is a characteristicfunction if and only if it is positive definite, continuous at 0, and φ(0) = 1.

Theorem 46 (Polya’s theorem). Let φ satisfy the following conditions: φ(0) =1, φ is nonnegative, symmetric around zero, and decreasing, continuous and convexon [0,∞). Then φ is a characteristic function.

Example 47. Let 0 < α ≤ 1. An application of Polya’s theorem gives that thefunction

φα(u) = e−|u|α

is a characteristic function (check this). No such luck when 1 < α < 2, but via analternative route φα can nevertheless be shown to be a characteristic function inthat case as well (see e.g. pp. 192–193 in Chung (2001)). When α = 2, we know thatφ corresponds to the normal distribution. A probability distribution that has φαas a characteristic function is called a stable distribution with index α. We finallyremark that it can be shown that φα with α > 2 is not a characteristic function (inthis case φα is twice differentiable at zero and φ′α(0) = φ′′α(0) = 0. Assume φα isa characteristic function. By Theorem 6.4.1 in Chung (2001) the first and secondmoments of the corresponding probability law are zero. But then µ must be theDirac measure at zero, so that φα(u) = 1 for all u ∈ R. This is a contradiction).

2.4. Multidimensional case

Our treatment in this section is cursory and we omit most details.The characteristic function φ of a probability measure µ on (Rk,B(Rk)) is

defined by the k-dimensional analogue of (2.1). We have with u, x ∈ Rk, 〈·, ·〉 thestandard inner product,

φ(u) =

∫Rkeı〈u,x〉µ(dx).

Page 30: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

24 2. CHARACTERISTIC FUNCTIONS

Like in the real case, also here probability measures are uniquely determined by theircharacteristic functions. As a consequence we have the following characterizationof independent random variables.

Proposition 48. Let X = (X1, . . . , Xk) be a k-dimensional random vector.

Then X1, . . . , Xk are independent random variables iff φX(u) =∏ki=1 φXi(ui), ∀u =

(u1, . . . , uk) ∈ Rk.

Proof. If the Xi are independent, the statement about the characteristic func-tions is proved in the same way as Proposition 37 (v). If the characteristic functionφX factorizes as stated, the result follows by the uniqueness property of character-istic functions.

Remark 49. Let k = 2 in the above proposition. If X1 = X2 as in Remark 39,then we do not have φX(u) = φX1

(u1)φX2(u2) for every u1, u2 (you check!), in

agreement with the fact that X1 and X2 are not independent. But for the specialchoice u1 = u2 this product relation holds true.

Example 50. Let X and Y be independent standard normal random variables.Then somewhat unexpectedly, the random variables X − Y and X + Y are alsoindependent, which can be shown using Proposition 48.

Exercises1 Let φ be a characteristic function. Show that so is |φ|2.2 If F and G are distribution functions, such that F =

∑mj=1 bjδaj and G has a

density, say g, show that the convolution F ∗G has a density and find it.3 Show that for any characteristic function φ,

Re[1− φ(u)] ≥ 1

4Re[1− φ(2u)].

4 A random variable X with the characteristic function φ is symmetric, if and onlyif φ(u) is real for all u ∈ R.

5 Let X1, X2, . . . be a sequence of i.i.d. random variables and N a Poisson(λ)

distributed random variable, independent of the Xn. Put Y =∑Nn=1Xn. Let φ

be the characteristic function of the Xn and ψ the characteristic function of Y .Show that ψ = exp(λφ− λ).

6 If X has an exponential distribution with parameter λ, then φX(u) = λ/(λ− iu).7 Let φ be a real characteristic function with the property that φ(nu) = φ(u)n for

all u ∈ R and n ∈ N. Show that for some α ≥ 0 it holds that φ(u) = exp(−α|u|).Let X have characteristic function φ(u) = exp(−α|u|). If α > 0, show that Xadmits the density

x 7→ α

π

1

α2 + x2.

What is the distribution of X if α = 0?8 Prove the statement made in Example 50. Also verify that the function φα from

Example 47 is indeed a characteristic function for 0 < α ≤ 1.9 Let µ be a probability law on (R,B(R)) and let φ be the corresponding charac-

teristic function. Show that for any fixed x ∈ R,

limT→∞

1

2T

∫ T

−Te−ıuxφ(u) du = µ(x).

Page 31: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

EXERCISES 25

Hint: reduce the question to studying∫R\x

sin(T (y − x))

T (y − x)µ( dy) +

∫x

µ( dy).

10 Let the distribution function F on R have a density f with respect to the Lebesguemeasure. Prove that for the corresponding characteristic function φ one hasφ(u)→ 0 as |u| → ∞. This result is known as the Riemann-Lebesgue lemma andits ‘analytic counterpart’ is of importance in harmonic analysis. You may assumeadditionally that f is continuos Lebesgue a.e. You will get a bonus point, if youprove the result for a general f (not necessarily continuous).

Page 32: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,
Page 33: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

CHAPTER 3

Limit theorems

This chapter deals with a number of important limit theorems in probabilitytheory. Their proofs are to a considerable extent based on characteristic functiontechniques.

3.1. Characteristic functions and weak convergence

In this section we study how characteristic functions relate to weak convergence.Our first result says that weak convergence of probability measures implies

pointwise convergence of their characteristic functions.

Proposition 51. Let µ, µ1, µ2, . . . be probability measures on (R,B) and let

φ, φ1, φ2, . . . be their characteristic functions. If µnw→ µ, then φn(u) → φ(u) for

every u ∈ R.

Proof. Consider for fixed u the function f(x) = eiux. It is obviously boundedand continuous and we obtain straight from the definition of weak convergence thatµn(f)→ µ(f). But µn(f) = φn(u).

Proposition 52. Let µ1, µ2, . . . be probability measures on (R,B). Let φ1, φ2, . . .be the corresponding characteristic functions. Assume that the sequence (µn) is tightand that for all u ∈ R the limit φ(u) := limn→∞ φn(u) exists. Then there exists a

probability measure µ on (R,B), such that φ = φµ and µnw→ µ.

Proof. Since (µn) is tight we use Prokhorov’s theorem to deduce that thereexists a weakly converging subsequence (µnk) with a probability measure as limit.Call this limit µ. From Proposition 51 we know that φnk(u) → φµ(u) for all u.Hence we must have φµ = φ. We will now show that any convergent subsequenceof (µn) has µ as a limit. Suppose that there exists a subsequence (µn′k) with limit

µ′. Then φn′k(u) converges to φµ′(u) for all u. But, since (µn′k) is a subsequence of

the original sequence, by assumption the corresponding φn′k(u) must converge to

φ(u) for all u. Hence we conclude that φµ′ = φµ and then µ′ = µ.Suppose that the whole sequence (µn) does not converge to µ. Then there must

exist a function f ∈ Cb(R), such that µn(f) does not converge to µ(f). So, thereis ε > 0, such that for some subsequence (n′k) we have

(3.1) |µn′k(f)− µ(f)| > ε.

Using Prokhorov’s theorem, the sequence (µn′k) has a further subsequence (µn′′k )

that has a limit probability measure µ′′. By the same argument as above (conver-gence of the characteristic functions) we conclude that µ′′(f) = µ(f). Thereforeµn′′k (f)→ µ(f), which contradicts (3.1).

27

Page 34: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

28 3. LIMIT THEOREMS

Characteristic functions are a tool to give a rough estimate of the tail probabil-ities of a random variable, useful to establish tightness of a sequence of probabilitymeasures. To that end we will use the following lemma. By taking the complexconjugate, check first that

∫ a−a(1− φ(u)) du ∈ R for every a > 0.

Lemma 53. Let a random variable X have distribution µ and characteristicfunction φ. Then for every K > 0

(3.2) P (|X| > 2K) ≤ K∫ 1/K

−1/K

(1− φ(u)) du.

Proof. It follows from Fubini’s theorem and∫ a

−aeiux du = 2

sin ax

x

that

K

∫ 1/K

−1/K

(1− φ(u)) du = K

∫ 1/K

−1/K

∫(1− eiux)µ(dx) du

=

∫K

∫ 1/K

−1/K

(1− eiux) duµ(dx)

= 2

∫ [1− sin(x/K)

x/K

]µ(dx)

≥ 2

∫|x/K|>2

[1− sin(x/K)

x/K

]µ(dx)

≥ µ([−2K, 2K]c).

since sin xx ≤ 1

2 for x > 21.

The following theorem is known as Levy’s continuity theorem.

Theorem 54 (Levy’s continuity theorem). Let µ1, µ2, . . . be a sequence of prob-ability measures on (R,B) and φ1, φ2, . . . the corresponding characteristic functions.Assume that for all u ∈ R the limit φ(u) := limn→∞ φn(u) exists. If φ is continuousat zero, then there exists a probability measure µ on (R,B), such that φ = φµ and

µnw→ µ.

Proof. We will show that under the present assumptions, the sequence (µn)is tight. To this end we will use Lemma 53. Let ε > 0. Since φ is continuous atzero, the same holds for φ, and there is δ > 0 such that |φ(u) + φ(−u) − 2| < εif |u| < δ. Notice that φ(u) + φ(−u) is real-valued and bounded from above by 2.

Hence 2∫ δ−δ(1− φ(u)) du =

∫ δ−δ(2− φ(u)− φ(−u)) du ∈ [0, 2δε).

By the convergence of the characteristic functions (which are bounded), thedominated convergence theorem implies that∫ δ

−δ(1− φn(u)) du→

∫ δ

−δ(1− φ(u)) du.

Hence, for all n ≥ N with N chosen large enough, we have∫ δ

−δ(1− φn(u)) du < 2δε.

1Function g(x) = (sinx)/x is called the cardinal sine, or simply the sinc function.

Page 35: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

3.2. WEAK LAW OF LARGE NUMBERS 29

It now follows from Lemma 53 that for n ≥ N and K = 1/δ

µn([−2K, 2K]c) ≤ 1

δ

∫ δ

−δ(1− φn(u)) du

< 2ε.

We conclude that (µn)n≥N is tight and then so is the sequence (µn)n∈N as well.Apply Proposition 52 to conclude.

Corollary 55. Let µ, µ1, µ2, . . . be probability measures on (R,B) and φ,

φ1, φ2, . . . be their corresponding characteristic functions. Then µnw→ µ if and

only if φn(u)→ φ(u) for all u ∈ R.

Proof. If φn(u)→ φ(u) for all u ∈ R, then we can apply Theorem 54. Func-tion φ, being a characteristic function, is continuous at zero. Hence there is aprobability measure to which the µn weakly converge. But since the φn(u) con-verge to φ(u), the limiting probability measure must be µ. The converse statementwe have encountered as Proposition 51.

3.2. Weak law of large numbers

In this section we present the weak law of large numbers for a sequence of i.i.d.random variables. In its proof we will need the following elementary result fromcalculus.

Lemma 56. Let z be a complex number, such that |z| ≤ 1/2. Then there existsa complex number θ depending on z, such that |θ| ≤ 1, and log(1 + z) = z + θ|z|2.

Proof. Without loss of generality, assume that z 6= 0 (when z = 0, log(1+z) =0 = z, and hence θ = 0). We have

log(1 + z) = z − z2

2+z3

3− z4

4. . .

= z + z2

(−1

2+z

3− z2

4+ . . .

)= z + |z|2 z

2

|z|2

(−1

2+z

3− z2

4+ . . .

).

We claim that

θ =z2

|z|2

(−1

2+z

3− z2

4+ . . .

).

To verify the claim, we need to check that |θ| ≤ 1. This, however, is easy:

|θ| =∣∣∣∣−1

2+z

3− z2

4+ . . .

∣∣∣∣ ≤ 1

2+

1

3

(1

2

)+

1

4

(1

2

)2

+ . . . ≤∞∑k=1

(1

2

)k= 1.

Corollary 57. If a sequence of complex numbers cn converges to the limitc, then

limn→∞

(1 +

cnn

)n= ec.

Page 36: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

30 3. LIMIT THEOREMS

Proof. It is sufficient to prove that

limn→∞

log(

1 +cnn

n)= limn→∞

n log

(1 +

cnn

)= c.

Since the sequence cn converges, it is bounded, and furthermore, |cn/n| ≤ 1/2for all n large enough. Then from Lemma 56,

n log(

1 +cnn

)= cn + o(1).

Because the right-hand side tends to c as n→∞, the result follows.

Theorem 58 (Weak law of large numbers). Let X1, . . . , Xn be i.i.d. randomvariables with characteristic function φ. Assume that φ is differentiable at zero andφ′(0) = ıµ. Then

Xn =1

n

n∑i=1

XiP−→ µ.

Proof. By differentiability of φ at zero, we have

φ(t) = φ(0) + φ′(0)t+ o(t)

= 1 + ıµt+ o(t).

By independence of Xi’s, for every fixed t,

E[eıtXn

]= φn

(t

n

)=

(1 + ıµ

t

n+ o

(1

n

))n.

As n→∞, by Corollary 57 the right-hand side converges to eıtµ. Now φ(t) = eıtµ

is the characteristic function of a constant random variable µ. By Levy’s continuitytheorem, Xn µ. Since the convergence in distribution and in probability are

equivalent for constant limits, it follows that XnP−→ µ.

Remark 59. If E[|X1|] <∞, then the dominated convergence theorem allowsone to interchange the order of differentiation and expectation to obtain

(3.3) φ′(t) =d

dtE[eitX1

]= E

[d

dteitX1

]= ıE

[X1e

itX1].

For t = 0 this yields φ′(0) = ıE[X1] = ıµ and XnP−→ E[X1], which is hardly

surprising in light of the strong law of large numbers. However, integrability of X1

is only a sufficient, but not a necessary condition to justify (3.3). Hence the weaklaw of large numbers holds under a weaker condition than the strong law.

Remark 60. The condition φ′(0) = ıµ is also necessary for convergence XnP−→

µ. We will not prove this fact. For the proof see e.g. Theorem 2.5.5 in Revesz (1968).An alternative necessary and sufficient condition for the weak law of large numbers,that does not employ characteristic functions, is also known (see e.g. Chung (2001),pp. 116–118). Furthermore, Chung (2001), pp. 118–119, contains an example, inwhich the weak law of large numbers holds, while the strong law fails.

Page 37: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

3.3. PROBABILITIES OF LARGE DEVIATIONS 31

3.3. Probabilities of large deviations

The weak law of large numbers does not provide information on the probabilitiesof large deviations of Xn from µ. Derivation of results in this setting is a task of animportant and deep branch of probability theory, the large deviations theory. Thelatter is beyond the scope of the present course. We only remark that treatmentof the case when a sequence of i.i.d. random variables Xn satisfies Cramer’scondition,

(3.4) ∃λ > 0, s.t. ϕ(λ) = E[eλX1 ] <∞,

is relatively elementary and refer the reader to pp. 400–403 in Shiryaev (1996)for details. Under (3.4), E[X1] = µ < ∞. The function ϕ is called the moment-generating function of X1 or the Laplace transform (of the law) of X1 (as it is oftencalled in nonprobabilistic literature). It is obtained by replacing the argument ofthe characteristic function of X1 with −ıλ. In light of this the moment generatingfunction possesses many properties similar to those of a characteristic function, butunlike the latter it does not always exist. Define the function ψ by ψ(λ) = logϕ(λ)(this is the cumulant-generating function of X1). The inequality one gets is

(3.5) P(|Xn − µ| ≥ ε

)≤ 2 exp (−n ·min(H(µ− ε), H(µ+ ε))) ,

where the function

H(a) = supλ∈R

[aλ− ψ(λ)]

is called the Cramer transform of X1 (in terminology of convex analysis this is theLegendre transform of the cumulant-generating function ψ). The Cramer transformcan be computed explicitly for a number of distributions, which yields explicitbounds on large deviations probabilities.

Example 61. Let Xn be a sequence of i.i.d. Bernoulli random variables withprobability of success 0 < p < 1. Straightforward computations give that

H(a) =

a log

(ap

)+ (1− a) log

(1−a1−p

)if a ∈ [0, 1],

∞ otherwise.

Insert this expression in the right-hand side of (3.5) to obtain a bound on theprobabilities of large deviations.

A much more crude bound on probabilities of large deviations is obtained byapplying Chebyshev’s inequality. If V[X1] = σ2, then

P(|Xn − E[X]| ≥ ε

)≤ V[Xn]

ε2=

σ2

nε2.

In particular, when Xn is an i.i.d. sequence of Bernoulli random variables withprobability of success p,

(3.6) P(|Xn − E[X]| ≥ ε

)≤ p(1− p)

nε2≤ 1

4nε2.

If we denote

pn(k) =

(n

k

)pk(1− p)n−k,

Page 38: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

32 3. LIMIT THEOREMS

the inequality (3.6) can be rewritten as∑k:|k/n−p|≥ε

pn(k) ≤ 1

4nε2.

We will use this fact to give a probabilistic proof of the Weierstraß theorem, whichasserts that for any continuous function u : [0, 1] → R there exists a sequence ofpolynomials un, such that

(3.7) limn→∞

supp∈[0,1]

|un(p)− u(p)| = 0,

see Theorem 7.26 in Rudin (1976). Take

un(p) =

n∑k=0

u

(k

n

)(n

k

)pk(1− p)n−k.

These are called Bernstein polynomials. We have

E[un(Xn)

]= un(p).

Since the function u, being continuous on [0, 1], is uniformly continuous on thatinterval, for every ε > 0 one can find δ > 0, such that |u(x)− u(y)| ≤ ε, whenever|x− y| ≤ δ. Also note that u is bounded on [0, 1]. We then get

|un(p)− u(p)| =

∣∣∣∣∣n∑k=0

[u

(k

n

)− u(p)

](n

k

)pk(1− p)n−k

∣∣∣∣∣≤

∑k:|k/n−p|≤δ

∣∣∣∣u(kn)− u(p)

∣∣∣∣ pn(k)

+∑

k:|k/n−p|≥δ

∣∣∣∣u(kn)− u(p)

∣∣∣∣ pn(k)

≤ ε+‖u‖∞nδ2

.

The bound on the right-hand side is independent of p. Let n → ∞ to obtain thatthe right-hand side of (3.7) does not exceed ε. Since ε is arbitrary, the result follows.

3.4. Central limit theorem

Let Xi be a sequence of random variables. In general the distribution ofthe sum Sn =

∑ni=1Xi might have a complicated form and hence be difficult to

compute. The central limit theorem provides a simple approximation to it, that isvery useful in practice.

Although the result holds in a much greater generality, we will prove the cen-tral limit theorem only for a sequence of i.i.d. random variables with finite secondmoments. The proof will yet again demonstrate the power of the method of char-acteristic functions.

Theorem 62 (Central limit theorem). Let Xn be a sequence of i.i.d. randomvariables with mean E[Xi] = µ and variance 0 < Var[Xi] = σ2 < ∞. Let Sn =∑ni=1Xi. Then

Sn − nµσ√n N(0, 1).

Page 39: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

3.5. DELTA METHOD 33

Proof. Without loss of generality, we may suppose that E[Xi] = 0 and Var[Xi] =1 (otherwise replace Xi with (Xi−µ)/σ, and note that this has mean 0 and variance1). Let φ be the characteristic function of Xi. Since by assumption E[X2

i ] = 1, thecharacteristic function φ is twice differentiable and (cf. p. 290 in Hardy (1967) andProposition 37 (vi))

φ(u) = φ(0) + φ′(0)u+ φ′′(0)u2

2+ o(u2)

= 1− 1

2u2 + o(u2).

By independence of Xi’s and Corollary 57 we then get for every fixed t ∈ R that

E[eitSn/

√n]

= φn(

t√n

)=

1− 1

2

(t√n

)2

+ o

(|t|√n

)2n

=

1− t2

2n+ o

(1

n

)n→ e−t

2/2.

The limit being the characteristic function of a standard normal random variableZ, the proof is completed upon invoking Levy’s continuity theorem.

Example 63. Suppose we have an i.i.d. sample X1, . . . , Xn from the Bernoullidistribution with probability of success p, but we do not know p. The parameterp can be estimated by the sample mean pn = n−1

∑ni=1Xi. By the strong law of

large numbers pna.s.−−→ p, and by the central limit theorem

√n√

p(1− p)(pn − p) N(0, 1).

Thus for large n the estimator pn has approximately the normal distribution withmean p and variance p(1 − p)/n, which gives an idea on the precision with whichp is recovered as n→∞. The asymptotic variance p(1− p)/n of the estimator canbe estimated by pn(1− pn)/n, and by Slutsky’s lemma

√n√

pn(1− pn)(pn − p) N(0, 1),

so that, roughly speaking, we do not need to know the value of p in order todetermine the precision with which it is recovered by pn: by a somewhat circularargument the latter can be again estimated by using pn.

3.5. Delta method

Let Xn be a sequence of i.i.d. random variables with mean E[Xi] = µ andvariance 0 < Var[Xi] = σ2 <∞. By the central limit theorem,

Sn − nµσ√n N(0, 1).

Can we say something about the weak convergence of a sequence g(Xn), whereg : R 7→ R is some fixed function? Such a question often arise in in statistics. Wheng is differentiable, the answer is given by the following result, known as the deltamethod.

Page 40: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

34 3. LIMIT THEOREMS

Theorem 64 (Delta method). Assume that the conditions of the central limittheorem (Theorem 62) hold. Let g be differentiable at µ and g′(µ) 6= 0. Then

√ng(Xn)− g(µ)

σg′(µ) N(0, 1).

Proof. The proof is an instance of an elegant application of the almost sure

representation theorem. On some probability space (Ω, F , P) there exist randomvariables

Znd=Sn − nµσ√n

, Z ∼ N(0, 1),

such that Zna.s.−−→ Z (under P). By the foregoing, the definition of a derivative, the

facts that σZn/√n

a.s.−−→ 0 and P(Z 6= 0) = 1, and the continuous mapping theoremwe have

√ng(Xn)− g(µ)

σg′(µ)

d=√ng(µ+ σZn/

√n)− g(µ)

σg′(µ)· 1[Zn 6=0]

=g(µ+ σZn/

√n)− g(µ)

σZn/√n

· σZnσg′(µ)

· 1[Zn 6=0]

a.s.−−→ g′(µ) · σZ

σg′(µ)· 1[Z 6=0].

The last term is equal to Z (P-almost surely), whence it follows that

√ng(Xn)− g(µ)

σg′(µ) N(0, 1)

on the original probability space.

Example 65. This is a continuation of Example 63. Suppose we want toestimate the odds r = p/(1 − p). For example, if the data X1, . . . , Xn are theoutcomes of a medical treatment with p = 3/4, then a patient has odds 3 : 1 ofgetting better. A natural estimator of r is rn = pn/(1− pn), but how good is thisestimator? Assume 0 < p < 1. Firstly, by the strong law of large numbers and

the continuous mapping theorem, rna.s.−−→ r. Secondly, by the delta method (take

g(p) = p/(1− p) in Theorem 64)√n(1− p)3

√p

(rn − r) N(0, 1),

so that for large n the estimator rn is approximately normally distributed withmean r and variance p/[n(1− p)3]. The latter can be estimated by pn/[n(1− pn)3]and an application of Sutsky’s lemma yields√

n(1− pn)3

√pn

(rn − r) N(0, 1).

3.6. Berry-Esseen theorem

Convergence of some quantity to a limit inevitably leads to the question of therate of convergence. In the setting of the central limit theorem proved above, thequestion is this: let Fn be the distribution function of (Sn − nµ)/(σ

√n). Theorem

62 implies that for all x ∈ R, Fn(x) → Φ(x). Can we say something about therate, at which the difference |Fn(x) − Φ(x)| converges to zero? A good estimate

Page 41: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

EXERCISES 35

on this quantity might be necessary in applications, in particular for numericalcomputations. One result in this direction is the Berry-Esseen theorem.

Theorem 66 (Berry-Esseen theorem). Let Xn be a sequence of i.i.d. randomvariables with mean zero, variance σ2 and the third absolute moment γ = E[|Xi|3] <∞. Then there exists a universal constant A0, such that

supx∈R|Fn(x)− Φ(x)| ≤ A0γ

σ3

1√n.

We will not prove this theorem. For the proof see e.g. Chung (2001), Section7.4 (that particular proof is based on the method of characteristic functions). Theexact value of the constant A0 is not known, but there exist good estimates on it(the latest (?) one is A0 ≤ 0.5129; this is quite sharp, because there also holds a

lower bound proved by Esseen: A0 ≥ (√

10 + 3)/(6√

2π) ≈ 0.40973). The estimatein Theorem 66 is ‘generic’. For specific distributions, tighter bounds might hold.For instance, let X1, . . . , Xn be jointly normal and i.i.d. with Xi ∼ N(0, 1). ThenFn = Φ and supx∈R |Fn(x)− Φ(x)| is in fact zero.

Exercises1 Let Xn be a sequence of random variables with E[|Xn|] <∞ and V[Xn] <∞.

Assume that the covariances Cov[Xi, Xj ]→ 0 as |i−j| → ∞. Prove the followingversion of the law of large numbers (due to Bernstein):

P

(∣∣∣∣∣ 1nn∑i=1

Xi −1

n

n∑i=1

E[Xi]

∣∣∣∣∣ > ε

)→ 0

as n → ∞. Hint: a sequence of random variables ξn converges to zero inprobability, when both the mean E[ξn] and the variance V[ξn] converge to zeroas n→∞ (show this).

2 Let Xn be a sequence of i.i.d. random variables. Prove that Sn = n−1/2∑ni=1Xi

converges in probability as n→∞ if and only if P(X1 = 0) = 1.3 Let Xn be a sequence of i.i.d. random variables with E[X2

1 ] <∞. Prove that

max(|X1|, . . . , |Xn|)√n

0

as n→∞. Hint: for any ε > 0,

P(

max(|X1|, . . . , |Xn|)√n

≤ ε)

=[P(X2

1 ≤ nε2)]n

and nε2P(X1 > nε)→ 0 as n→∞.4 Let Xn be a sequence of i.i.d. random variables with mean zero and variance

one, and let dn be a sequence of nonnegative numbers, such that dn = o(Dn)for D2

n =∑ni=1 d

2i . Prove that the sequence dnXn satisfies the central limit

theorem:

1

Dn

n∑i=1

diXi Z

for Z ∼ N(0, 1).

Page 42: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

36 3. LIMIT THEOREMS

5 Let Xn be a sequence of i.i.d. random variables, such that P(X1 = ±1) = 1/2.

Set Si =∑ik=1Xi. Show that

1√n

max1≤i≤n

Si |Z|,

where Z ∼ N(0, 1).6 Let Xn be a sequence of i.i.d. random variables with E[X1] = 0 and E[X2

1 ] = 1.Show that

√nXn

σn N(0, 1),

where

Xn =1

n

n∑i=1

Xi, σ2n =

1

n− 1

n∑i=1

(Xi −Xn)2.

Incidentally, the result of this exercise also shows that if Yn possesses t-distributionwith n degrees of freedom, then Yn N(0, 1). Explain why.

7 Let Xn have a Bin(n, λ/n) distribution (for n > λ). Show that Xn X, whereX has a Poisson(λ) distribution. This result is known as the Poisson theorem.

8 Let X,X1, X2, . . . be a sequence of random variables and Y a N(0, 1)-distributedrandom variable independent of that sequence. Let φn be the characteristicfunction of Xn and φ that of X. Let pn be the density of Xn + σY and p thedensity of X + σY .

(i) If φn → φ pointwise, then pn → p pointwise.(ii) Let f : R→ R be bounded by B. Show that |Ef(Xn+σY )−Ef(X+σY )| ≤

2B∫

(p(z)− pn(z))+ dz.(iii) Show that |Ef(Xn + σY )−Ef(X + σY )| → 0 (with f bounded) if φn → φ

pointwise.(iv) Prove without referring to Corollary 55 that Xn X iff φn → φ pointwise

(hint: one implication is straightforward, for the other the result of Exercise1.1 is useful).

Page 43: Notes on weak convergence and related topics Shota …gugushvilis/chapter.pdf · Notes on weak convergence and related topics Shota Gugushvili Mathematical Institute, Faculty of Science,

Bibliography

J. Bak and D. J. Newman. Complex analysis. Third edition. Undergraduate Textsin Mathematics. Springer, New York, 2010.

P. Billingsley. Weak Convergence of Measures: Applications in Probability. Confer-ence Board of the Mathematical Sciences Regional Conference Series in AppliedMathematics, No. 5. Society for Industrial and Applied Mathematics, Philadel-phia, Pa., 1971.

K. L. Chung. A Course in Probability Theory. Third edition. Academic Press, Inc.,San Diego, CA, 2001.

G. H. Hardy. A Course of Pure Mathematics. Tenth edition. Cambridge UniversityPress, Cambridge, 1967.

K. R. Parthasarathy. Probability measures on metric spaces. Reprint of the 1967original. AMS Chelsea Publishing, Providence, RI, 2005.

Yu. V. Prokhorov. Convergence of random processes and limit theorems in proba-bility theory. Theory Probab. Appl., 1(2), 157–214, 1956.

S. I. Resnick. A Probability Path. Birkhauser Boston, Inc., Boston, MA, 1999.P. Revesz. The Laws of Large Numbers. Academic Press, New York, 1968.W. Rudin. Principles of Mathematical Analysis. Third edition. International Series

in Pure and Applied Mathematics. McGraw-Hill Book Co., New York-Auckland-Dusseldorf, 1976.

A. N. Shiryaev. Probability. Translated from the first (1980) Russian edition by R.P. Boas. Second edition. Graduate Texts in Mathematics, 95. Springer-Verlag,New York, 1996.

A. W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical andProbabilistic Mathematics, 3. Cambridge University Press, Cambridge, 1998.

37