Part II - Probability and Measure - SRCF · Part II | Probability and Measure Based on lectures by...

Part II — Probability and Measure

Based on lectures by J. MillerNotes taken by Dexter Chua

Michaelmas 2016

These notes are not endorsed by the lecturers, and I have modified them (oftensignificantly) after lectures. They are nowhere near accurate representations of what

was actually lectured, and in particular, all errors are almost surely mine.

Analysis II is essential

Measure spaces, σ-algebras, π-systems and uniqueness of extension, statement *andproof* of Caratheodory’s extension theorem. Construction of Lebesgue measure on R.The Borel σ-algebra of R. Existence of non-measurable subsets of R. Lebesgue-Stieltjesmeasures and probability distribution functions. Independence of events, independenceof σ-algebras. The Borel–Cantelli lemmas. Kolmogorov’s zero-one law. [6]

Measurable functions, random variables, independence of random variables. Construc-tion of the integral, expectation. Convergence in measure and convergence almosteverywhere. Fatou’s lemma, monotone and dominated convergence, differentiationunder the integral sign. Discussion of product measure and statement of Fubini’stheorem. [6]

Chebyshev’s inequality, tail estimates. Jensen’s inequality. Completeness of Lp for1 ≤ p ≤ ∞. The Holder and Minkowski inequalities, uniform integrability. [4]

L2 as a Hilbert space. Orthogonal projection, relation with elementary conditionalprobability. Variance and covariance. Gaussian random variables, the multivariatenormal distribution. [2]

The strong law of large numbers, proof for independent random variables with boundedfourth moments. Measure preserving transformations, Bernoulli shifts. Statements*and proofs* of maximal ergodic theorem and Birkhoff’s almost everywhere ergodictheorem, proof of the strong law. [4]

The Fourier transform of a finite measure, characteristic functions, uniqueness and

inversion. Weak convergence, statement of Levy’s convergence theorem for characteristic

functions. The central limit theorem. [2]

1

Contents II Probability and Measure

Contents

0 Introduction 3

1 Measures 51.1 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Probability measures . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Measurable functions and random variables 202.1 Measurable functions . . . . . . . . . . . . . . . . . . . . . . . . . 202.2 Constructing new measures . . . . . . . . . . . . . . . . . . . . . 232.3 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.4 Convergence of measurable functions . . . . . . . . . . . . . . . . 292.5 Tail events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3 Integration 363.1 Definition and basic properties . . . . . . . . . . . . . . . . . . . 363.2 Integrals and limits . . . . . . . . . . . . . . . . . . . . . . . . . . 423.3 New measures from old . . . . . . . . . . . . . . . . . . . . . . . 443.4 Integration and differentiation . . . . . . . . . . . . . . . . . . . . 463.5 Product measures and Fubini’s theorem . . . . . . . . . . . . . . 48

4 Inequalities and Lp spaces 544.1 Four inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.2 Lp spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.3 Orthogonal projection in L2 . . . . . . . . . . . . . . . . . . . . . 614.4 Convergence in L1(P) and uniform integrability . . . . . . . . . . 65

5 Fourier transform 695.1 The Fourier transform . . . . . . . . . . . . . . . . . . . . . . . . 695.2 Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705.3 Fourier inversion formula . . . . . . . . . . . . . . . . . . . . . . 725.4 Fourier transform in L2 . . . . . . . . . . . . . . . . . . . . . . . 775.5 Properties of characteristic functions . . . . . . . . . . . . . . . . 795.6 Gaussian random variables . . . . . . . . . . . . . . . . . . . . . 80

6 Ergodic theory 836.1 Ergodic theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

7 Big theorems 907.1 The strong law of large numbers . . . . . . . . . . . . . . . . . . 907.2 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . 92

Index 94

2

0 Introduction II Probability and Measure

0 Introduction

In measure theory, the main idea is that we want to assign “sizes” to differentsets. For example, we might think [0, 2] ⊆ R has size 2, while perhaps Q ⊆ R hassize 0. This is known as a measure. One of the main applications of a measureis that we can use it to come up with a new definition of an integral. The ideais very simple, but it is going to be very powerful mathematically.

Recall that if f : [0, 1]→ R is continuous, then the Riemann integral of f isdefined as follows:

(i) Take a partition 0 = t0 < t1 < · · · < tn = 1 of [0, 1].

(ii) Consider the Riemann sum

n∑j=1

f(tj)(tj − tj−1)

(iii) The Riemann integral is∫f = Limit of Riemann sums as the mesh size of the partition → 0.

x

y

0 t1 t2 t3

· · ·

tk tk+1 · · ·

· · ·

1

The idea of measure theory is to use a different approximation scheme. Insteadof partitioning the domain, we partition the range of the function. We fix somenumbers r0 < r1 < r2 < · · · < rn.

We then approximate the integral of f by

n∑j=1

rj · (“size of f−1([rj−1, rj ])”).

We then define the integral as the limit of approximations of this type as themesh size of the partition → 0.

3

0 Introduction II Probability and Measure

x

y

We can make an analogy with bankers — If a Riemann banker is given a stackof money, they would just add the values of the money in order. A measure-theoretic banker will sort the bank notes according to the type, and then findthe total value by multiplying the number of each type by the value, and addingup.

Why would we want to do so? It turns out this leads to a much moregeneral theory of integration on much more general spaces. Instead of integratingfunctions [a, b] → R only, we can replace the domain with any measure space.Even in the context of R, this theory of integration is much much more powerfulthan the Riemann sum, and can integrate a much wider class of functions. Whileyou probably don’t care about those pathological functions anyway, being ableto integrate more things means that we can state more general theorems aboutintegration without having to put in funny conditions.

That was all about measures. What about probability? It turns out theconcepts we develop for measures correspond exactly to many familiar notionsfrom probability if we restrict it to the particular case where the total measureof the space is 1. Thus, when studying measure theory, we are also secretlystudying probability!

4

1 Measures II Probability and Measure

1 Measures

In the course, we will write fn f for “fn converges to f monotonicallyincreasingly”, and fn f similarly. Unless otherwise specified, convergence istaken to be pointwise.

1.1 Measures

The starting point of all these is to come up with a function that determinesthe “size” of a given set, known as a measure. It turns out we cannot sensiblydefine a size for all subsets of [0, 1]. Thus, we need to restrict our attention to acollection of “nice” subsets. Specifying which subsets are “nice” would involvespecifying a σ-algebra.

This section is mostly technical.

Definition (σ-algebra). Let E be a set. A σ-algebra E on E is a collection ofsubsets of E such that

(i) ∅ ∈ E .

(ii) A ∈ E implies that AC = X \A ∈ E .

(iii) For any sequence (An) in E , we have that⋃n

An ∈ E .

The pair (E, E) is called a measurable space.

Note that the axioms imply that σ-algebras are also closed under countableintersections, as we have A ∩B = (AC ∪BC)C .

Definition (Measure). A measure on a measurable space (E, E) is a functionµ : E → [0,∞] such that

(i) µ(∅) = 0

(ii) Countable additivity: For any disjoint sequence (An) in E , then

µ

(⋃n

An

)=

∞∑n=1

µ(An).

Example. Let E be any countable set, and E = P (E) be the set of all subsetsof E. A mass function is any function m : E → [0,∞]. We can then define ameasure by setting

µ(A) =∑x∈A

m(x).

In particular, if we put m(x) = 1 for all x ∈ E, then we obtain the countingmeasure.

5


Countable spaces are nice, because we can always take E = P (E), and themeasure can be defined on all possible subsets. However, for “bigger” spaces, wehave to be more careful. The set of all subsets is often “too large”. We will seea concrete and also important example of this later.

In general, σ-algebras are often described on large spaces in terms of a smallerset, known as the generating sets.

Definition (Generator of σ-algebra). Let E be a set, and that A ⊆ P (E) be acollection of subsets of E. We define

σ(A) = A ⊆ E : A ∈ E for all σ-algebras E that contain A.

In other words σ(A) is the smallest sigma algebra that contains A. This isknown as the sigma algebra generated by A.

Example. Take E = Z, and A = x : x ∈ Z. Then σ(A) is just P (E), sinceevery subset of E can be written as a countable union of singletons.

Example. Take E = Z, and let A = x, x+1, x+2, x+3, · · · : x ∈ E. Thenagain σ(E) is the set of all subsets of E.

The following is the most important σ-algebra in the course:

Definition (Borel σ-algebra). Let E = R, and A = U ⊆ R : U is open. Thenσ(A) is known as the Borel σ-algebra, which is not the set of all subsets of R.

We can equivalently define this by A = (a, b) : a < b, a, b ∈ Q. Then σ(A)is also the Borel σ-algebra.

Often, we would like to prove results that allow us to deduce propertiesabout the σ-algebra just by checking it on a generating set. However, usually,we cannot just check it on an arbitrary generating set. Instead, the generatingset has to satisfy some nice closure properties. We are now going to introduce abunch of many different definitions that you need not aim to remember (exceptwhen exams are near).

Definition (π-system). Let A be a collection of subsets of E. Then A is calleda π-system if

(i) ∅ ∈ A

(ii) If A,B ∈ A, then A ∩B ∈ A.

Definition (d-system). Let A be a collection of subsets of E. Then A is calleda d-system if

(i) E ∈ A

(ii) If A,B ∈ A and A ⊆ B, then B \A ∈ A

(iii) For all increasing sequences (An) in A, we have that⋃nAn ∈ A.

The point of d-systems and π-systems is that they separate the axioms of aσ-algebra into two parts. More precisely, we have

Proposition. A collection A is a σ-algebra if and only if it is both a π-systemand a d-system.

6


This follows rather straightforwardly from the definitions.The following definitions are also useful:

Definition (Ring). A collection of subsets A is a ring on E if ∅ ∈ A and for allA,B ∈ A, we have B \A ∈ A and A ∪B ∈ A.

Definition (Algebra). A collection of subsets A is an algebra on E if ∅ ∈ A,and for all A,B ∈ A, we have AC ∈ A and A ∪B ∈ A.

So an algebra is like a σ-algebra, but it is just closed under finite unions only,rather than countable unions.

While the names π-system and d-system are rather arbitrary, we can makesome sense of the names “ring” and “algebra”. Indeed, a ring forms a ring(without unity) in the algebraic sense with symmetric difference as “addition”and intersection as “multiplication”. Then the empty set acts as the additiveidentity, and E, if present, acts as the multiplicative identity. Similarly, analgebra is a boolean subalgebra under the boolean algebra P (E).

A very important lemma about these things is Dynkin’s lemma:

Lemma (Dynkin’s π-system lemma). Let A be a π-system. Then any d-systemwhich contains A contains σ(A).

This will be very useful in the future. If we want to show that all elements ofσ(A) satisfy a particular property for some generating π-system A, we just haveto show that the elements of A satisfy that property, and that the collection ofthings that satisfy the property form a d-system.

While this use case might seem rather contrived, it is surprisingly commonwhen we have to prove things.

Proof. Let D be the intersection of all d-systems containing A, i.e. the smallestd-system containing A. We show that D contains σ(A). To do so, we will showthat D is a π-system, hence a σ-algebra.

There are two steps to the proof, both of which are straightforward verifica-tions:

(i) We first show that if B ∈ D and A ∈ A, then B ∩A ∈ D.

(ii) We then show that if A,B ∈ D, then A ∩B ∈ D.

Then the result immediately follows from the second part.We let

D′ = B ∈ D : B ∩A ∈ D for all A ∈ A.We note that D′ ⊇ A because A is a π-system, and is hence closed underintersections. We check that D′ is a d-system. It is clear that E ∈ D′. If wehave B1, B2 ∈ D′, where B1 ⊆ B2, then for any A ∈ A, we have

(B2 \B1) ∩A = (B2 ∩A) \ (B1 ∩A).

By definition of D′, we know B2 ∩A and B1 ∩A are elements of D. Since D is ad-system, we know this intersection is in D. So B2 \B1 ∈ D′.

Finally, suppose that (Bn) is an increasing sequence in D′, with B =⋃Bn.

Then for every A ∈ A, we have that(⋃Bn

)∩A =

⋃(Bn ∩A) = B ∩A ∈ D.

7


Therefore B ∈ D′.Therefore D′ is a d-system contained in D, which also contains A. By our

choice of D, we know D′ = D.We now let

D′′ = B ∈ D : B ∩A ∈ D for all A ∈ D.

Since D′ = D, we again have A ⊆ D′′, and the same argument as above impliesthat D′′ is a d-system which is between A and D. But the only way that canhappen is if D′′ = D, and this implies that D is a π-system.

After defining all sorts of things that are “weaker versions” of σ-algebras, wenow defined a bunch of measure-like objects that satisfy fewer properties. Again,no one really remembers these definitions:

Definition (Set function). Let A be a collection of subsets of E with ∅ ∈ A. Aset function function µ : A → [0,∞] such that µ(∅) = 0.

Definition (Increasing set function). A set function is increasing if it has theproperty that for all A,B ∈ A with A ⊆ B, we have µ(A) ≤ µ(B).

Definition (Additive set function). A set function is additive if wheneverA,B ∈ A and A ∪B ∈ A, A ∩B = ∅, then µ(A ∪B) = µ(A) + µ(B).

Definition (Countably additive set function). A set function is countably addi-tive if whenever An is a sequence of disjoint sets in A with ∪An ∈ A, then

µ

(⋃n

An

)=∑n

µ(An).

Under these definitions, a measure is just a countable additive set functiondefined on a σ-algebra.

Definition (Countably subadditive set function). A set function is countablysubadditive if whenever (An) is a sequence of sets in A with

⋃nAn ∈ A, then

µ

(⋃n

An

)≤∑n

µ(An).

The big theorem that allows us to construct measures is the Caratheodoryextension theorem. In particular, this will help us construct the Lebesgue measureon R.

Theorem (Caratheodory extension theorem). Let A be a ring on E, and µa countably additive set function on A. Then µ extends to a measure on theσ-algebra generated by A.

Proof. (non-examinable) We start by defining what we want our measure to be.For B ⊆ E, we set

µ∗(B) = inf

∑n

µ(An) : (An) ∈ A and B ⊆⋃An

.

8


If it happens that there is no such sequence, we set this to be∞. This measure isknown as the outer measure. It is clear that µ∗(φ) = 0, and that µ∗ is increasing.

We say a set A ⊆ E is µ∗-measurable if

µ∗(B) = µ∗(B ∩A) + µ∗(B ∩AC)

for all B ⊆ E. We let

M = µ∗-measurable sets.

We will show the following:

(i) M is a σ-algebra containing A.

(ii) µ∗ is a measure on M with µ∗|A = µ.

Note that it is not true in general that M = σ(A). However, we will alwayshave M ⊇ σ(A).

We are going to break this up into five nice bite-size chunks.

Claim. µ∗ is countably subadditive.

Suppose B ⊆⋃nBn. We need to show that µ∗(B) ≤

∑n µ∗(Bn). We can

wlog assume that µ∗(Bn) is finite for all n, or else the inequality is trivial. Letε > 0. Then by definition of the outer measure, for each n, we can find asequence (Bn,m)∞m=1 in A with the property that

Bn ⊆⋃m

Bn,m

andµ∗(Bn) +

ε

2n≥∑m

µ(Bn,m).

Then we haveB ⊆

⋃n

Bn ⊆⋃n,m

Bn,m.

Thus, by definition, we have

µ∗(B) ≤∑n,m

µ∗(Bn,m) ≤∑n

(µ∗(Bn) +

ε

2n

)= ε+

∑n

µ∗(Bn).

Since ε was arbitrary, we are done.

Claim. µ∗ agrees with µ on A.

In the first example sheet, we will show that if A is a ring and µ is a countablyadditive set function on µ, then µ is in fact countably subadditive and increasing.

Assuming this, suppose that A, (An) are in A and A ⊆⋃nAn. Then by

subadditivity, we have

µ(A) ≤∑n

µ(A ∩An) ≤∑n

µ(An),

9


using that µ is countably subadditivity and increasing. Note that we have to dothis in two steps, rather than just applying countable subadditivity, since we didnot assume that

⋃nAn ∈ A. Taking the infimum over all sequences, we have

µ(A) ≤ µ∗(A).

Also, we see by definition that µ(A) ≥ µ∗(A), since A covers A. So we get thatµ(A) = µ∗(A) for all A ∈ A.

Claim. M contains A.

Suppose that A ∈ A and B ⊆ E. We need to show that

µ∗(B) = µ∗(B ∩A) + µ∗(B ∩AC).

Since µ∗ is countably subadditive, we immediately have µ∗(B) ≤ µ∗(B ∩A) +µ∗(B ∩AC). For the other inequality, we first observe that it is trivial if µ∗(B)is infinite. If it is finite, then by definition, given ε > 0, we can find some (Bn)in A such that B ⊆

⋃nBn and

µ∗(B) + ε ≥∑n

µ(Bn).

Then we have

B ∩A ⊆⋃n

(Bn ∩A)

B ∩AC ⊆⋃n

(Bn ∩AC)

We notice that Bn ∩AC = Bn \A ∈ A. Thus, by definition of µ∗, we have

µ∗(B ∩A) + µ∗(B ∩Ac) ≤∑n

µ(Bn ∩A) +∑n

µ(Bn ∩AC)

=∑n

(µ(Bn ∩A) + µ(Bn ∩AC))

=∑n

µ(Bn)

≤ µ∗(Bn) + ε.

Since ε was arbitrary, the result follows.

Claim. We show that M is an algebra.

We first show that E ∈M. This is true since we obviously have

µ∗(B) = µ∗(B ∩ E) + µ∗(B ∩ EC)

for all B ⊆ E.Next, note that if A ∈M, then by definition we have, for all B,

µ∗(B) = µ∗(B ∩A) + µ∗(B ∩AC).

Now note that this definition is symmetric in A and AC . So we also haveAC ∈M .

10


Finally, we have to show that M is closed under intersection (which isequivalent to being closed under union when we have complements). SupposeA1, A2 ∈M and B ⊆ E. Then we have

µ∗(B) = µ∗(B ∩A1) + µ∗(B ∩AC1 )

= µ∗(B ∩A1 ∩A2) + µ∗(B ∩A1 ∩AC2 ) + µ∗(B ∩AC1 )

= µ∗(B ∩ (A1 ∩A2)) + µ∗(B ∩ (A1 ∩A2)C ∩A1)

+ µ∗(B ∩ (A1 ∩A2)C ∩AC1 )

= µ∗(B ∩ (A1 ∩A2)) + µ∗(B ∩ (A1 ∩A2)C).

So we have A1 ∩A2 ∈M. So M is an algebra.

Claim. M is a σ-algebra, and µ∗ is a measure on M.

To show that M is a σ-algebra, we need to show that it is closed undercountable unions. We let (An) be a disjoint collection of sets in M, then wewant to show that A =

⋃nAn ∈M and µ∗(A) =

∑n µ∗(An).

Suppose that B ⊆ E. Then we have

µ∗(B) = µ∗(B ∩A1) + µ∗(B ∩AC1 )

Using the fact that A2 ∈M and A1 ∩A2 = ∅, we have

= µ∗(B ∩A1) + µ∗(B ∩A2) + µ∗(B ∩AC1 ∩AC2 )

= · · ·

=

n∑i=1

µ∗(B ∩Ai) + µ∗(B ∩AC1 ∩ · · · ∩ACn )

≥n∑i=1

µ∗(B ∩Ai) + µ∗(B ∩AC).

Taking the limit as n→∞, we have

µ∗(B) ≥∞∑i=1

µ∗(B ∩Ai) + µ∗(B ∩AC).

By the countable-subadditivity of µ∗, we have

µ∗(B ∩A) ≤∞∑i=1

µ∗(B ∩Ai).

Thus we obtainµ∗(B) ≥ µ∗(B ∩A) + µ∗(B ∩AC).

By countable subadditivity, we also have inequality in the other direction. Soequality holds. So A ∈M. So M is a σ-algebra.

To see that µ∗ is a measure on M, note that the above implies that

µ∗(B) =

∞∑i=1

(B ∩Ai) + µ∗(B ∩AC).

11


Taking B = A, this gives

µ∗(A) =

∞∑i=1

(A ∩Ai) + µ∗(A ∩AC) =

∞∑i=1

µ∗(Ai).

Note that when A itself is actually a σ-algebra, the outer measure can besimply written as

µ∗(B) = infµ(A) : A ∈ A, B ⊆ A.

Caratheodory gives us the existence of some measure extending the set functionon A. Could there be many? In general, there could. However, in the specialcase where the measure is finite, we do get uniqueness.

Theorem. Suppose that µ1, µ2 are measures on (E, E) with µ1(E) = µ2(E) <∞. If A is a π-system with σ(A) = E , and µ1 agrees with µ2 on A, then µ1 = µ2.

Proof. LetD = A ∈ E : µ1(A) = µ2(A)

We know that D ⊇ A. By Dynkin’s lemma, it suffices to show that D is ad-system. The things to check are:

(i) E ∈ D — this follows by assumption.

(ii) If A,B ∈ D with A ⊆ B, then B \A ∈ D. Indeed, we have the equations

µ1(B) = µ1(A) + µ1(B \A) <∞µ2(B) = µ2(A) + µ2(B \A) <∞.

Since µ1(B) = µ2(B) and µ1(A) = µ2(A), we must have µ1(B \ A) =µ2(B \A).

(iii) Let (An) ∈ D be an increasing sequence with⋃An = A. Then

µ1(A) = limn→∞

µ1(An) = limn→∞

µ2(An) = µ2(A).

So A ∈ D.

The assumption that µ1(E) = µ2(E) <∞ is necessary. The theorem doesnot necessarily hold without it. We can see this from a simple counterexample:

Example. Let E = Z, and let E = P (E). We let

A = x, x+ 1, x+ 2, · · · : x ∈ E ∪ ∅.

This is a π-system with σ(A) = E . We let µ1(A) be the number of elements in A,and µ2 = 2µ1(A). Then obviously µ1 6= µ2, but µ1(A) =∞ = µ2(A) for A ∈ A.

Definition (Borel σ-algebra). Let E be a topological space. We define theBorel σ-algebra as

B(E) = σ(U ⊆ E : U is open).

We write B for B(R).

12


Definition (Borel measure and Radon measure). A measure µ on (E,B(E)) iscalled a Borel measure. If µ(K) <∞ for all K ⊆ E compact, then µ is a Radonmeasure.

The most important example of a Borel measure we will consider is theLebesgue measure.

Theorem. There exists a unique Borel measure µ on R with µ([a, b]) = b− a.

Proof. We first show uniqueness. Suppose µ is another measure on B satisfyingthe above property. We want to apply the previous uniqueness theorem, but ourmeasure is not finite. So we need to carefully get around that problem.

For each n ∈ Z, we set

µn(A) = µ(A ∩ (n, n+ 1]))

µn(A) = µ(A ∩ (n, n+ 1]))

Then µn and µn are finite measures on B which agree on the π-system of intervalsof the form (a, b] with a, b ∈ R, a < b. Therefore we have µn = µn for all n ∈ Z.Now we have

µ(A) =∑n∈Z

µ(A ∩ (n, n+ 1]) =∑n∈Z

µn(A) =∑n∈Z

µn(A) = µ(A)

for all Borel sets A.To show existence, we want to use the Caratheodory extension theorem. We

let A be the collection of finite, disjoint unions of the form

A = (a1, b1] ∪ (a2, b2] ∪ · · · ∪ (an, bn].

Then A is a ring of subsets of R, and σ(A) = B (details are to be checked onthe first example sheet).

We set

µ(A) =

n∑i=1

(bi − ai).

We note that µ is well-defined, since if

A = (a1, b1] ∪ · · · ∪ (an, bn] = (a1, b1] ∪ · · · ∪ (an, bn],

thenn∑i=1

(bi − ai) =

n∑i=1

(bi − ai).

Also, if µ is additive, A,B ∈ A, A ∩B = ∅ and A ∪B ∈ A, we obviously haveµ(A ∪B) = µ(A) + µ(B). So µ is additive.

Finally, we have to show that µ is in fact countably additive. Let (An) be adisjoint sequence in A, and let A =

⋃∞i=1An ∈ A. Then we need to show that

µ(A) =∑∞n=1 µ(An).

Since µ is additive, we have

µ(A) = µ(A1) + µ(A \A1)

= µ(A1) + µ(A2) + µ(A \A1 ∪A2)

=

n∑i=1

µ(Ai) + µ

(A \

n⋃i=1

Ai

)

13


To finish the proof, we show that

µ

(A \

n⋃i=1

Ai

)→ 0 as n→∞.

We are going to reduce this to the finite intersection property of compact sets in R:if (Kn) is a sequence of compact sets in R with the property that

⋂nm=1Km 6= ∅

for all n, then⋂∞m=1Km 6= ∅.

We first introduce some new notation. We let

Bn = A \n⋃

m=1

Am.

We now suppose, for contradiction, that µ(Bn) 6→ 0 as n→∞. Since the Bn’sare decreasing, there must exist ε > 0 such that µ(Bn) ≥ 2ε for every n.

For each n, we take Cn ∈ A with the property that Cn ⊆ Bn and µ(Bn\Cn) ≤ε

2n . This is possible since each Bn is just a finite union of intervals. Thus wehave

µ(Bn)− µ

(n⋂

m=1

Cm

)= µ

(Bn \

n⋂m=1

Cm

)

≤ µ

(n⋃

m=1

(Bm \ Cm)

)

≤n∑

m=1

µ(Bm \ Cm)

≤n∑

m=1

ε

2m

≤ ε.

On the other hand, we also know that µ(Bn) ≥ 2ε.

µ

(n⋂

m=1

Cm

)≥ ε

for all n. We now let that Kn =⋂nm=1 Cm. Then µ(Kn) ≥ ε, and in particular

Kn 6= ∅ for all n.Thus, the finite intersection property says

∅ 6=∞⋂n=1

Kn ⊆∞⋂n=1

Bn = ∅.

This is a contradiction. So we have µ(Bn)→ 0 as n→∞. So done.

Definition (Lebesgue measure). The Lebesgue measure is the unique Borelmeasure µ on R with µ([a, b]) = b− a.

Note that the Lebesgue measure is not a finite measure, since µ(R) = ∞.However, it is a σ-finite measure.

14


Definition (σ-finite measure). Let (E, E) be a measurable space, and µ ameasure. We say µ is σ-finite if there exists a sequence (En) in E such that⋃nEn = E and µ(En) <∞ for all n.

This is the next best thing we can hope after finiteness, and often proofs thatinvolve finiteness carry over to σ-finite measures.

Proposition. The Lebesgue measure is translation invariant , i.e.

µ(A+ x) = µ(A)

for all A ∈ B and x ∈ R, where

A+ x = y + x, y ∈ A.

Proof. We use the uniqueness of the Lebesgue measure. We let

µx(A) = µ(A+ x)

for A ∈ B. Then this is a measure on B satisfying µx([a, b]) = b − a. So theuniqueness of the Lebesgue measure shows that µx = µ.

It turns out that translation invariance actually characterizes the Lebesguemeasure.

Proposition. Let µ be a Borel measure on R that is translation invariant andµ([0, 1]) = 1. Then µ is the Lebesgue measure.

Proof. We show that any such measure must satisfy

µ([a, b]) = b− a.

By additivity and translation invariance, we can show that µ([p, q]) = q−p for allrational p < q. By considering µ([p, p+ 1/n]) for all n and using the increasingproperty, we know µ(p) = 0. So µ(([p, q)) = µ((p, q]) = µ((p, q)) = q − p forall rational p, q.

Finally, by countable additivity, we can extend this to all real intervals. Thenthe result follows from the uniqueness of the Lebesgue measure.

In the proof of the Caratheodory extension theorem, we constructed a measureµ∗ on the σ-algebra M of µ∗-measurable sets which contains A. This containsB = σ(A), but could in fact be bigger than it. We callM the Lebesgue σ-algebra.

Indeed, it can be given by

M = A ∪N : A ∈ B, N ⊆ B ∈ B with µ(B) = 0.

If A ∪N ∈M, then µ(A ∪N) = µ(A). The proof is left for the example sheet.It is also true that M is strictly larger than B, so there exists A ∈M with

A 6∈ B. Construction of such a set was on last year’s exam (2016).On the other hand, it is also true that not all sets are Lebesgue measurable.

This is a rather funny construction.

15


Example. For x, y ∈ [0, 1), we say x ∼ y if x − y is rational. This defines anequivalence relation on [0, 1). By the axiom of choice, we pick a representativeof each equivalence class, and put them into a set S ⊆ [0, 1). We will show thatS is not Lebesgue measurable.

Suppose that S were Lebesgue measurable. We are going to get a contra-diction to the countable additivity of the Lebesgue measure. For each rationalr ∈ [0, 1) ∩Q, we define

Sr = s+ r mod 1 : s ∈ S.

By translation invariance, we know Sr is also Lebesgue measurable, and µ(Sr) =µ(S).

Also, by construction of S, we know (Sr)r∈Q is disjoint, and⋃r∈Q Sr = [0, 1).

Now by countable additivity, we have

1 = µ([0, 1)) = µ

⋃r∈Q

Sr

=∑r∈Q

µ(Sr) =∑r∈Q

µ(S),

which is clearly not possible. Indeed, if µ(S) = 0, then this says 1 = 0; Ifµ(S) > 0, then this says 1 =∞. Both are absurd.

1.2 Probability measures

Since the course is called “probability and measure”, we’d better start talkingabout probability! It turns out the notions we care about in probability theoryare very naturally just special cases of the concepts we have previously considered.

Definition (Probability measure and probability space). Let (E, E) be a measurespace with the property that µ(E) = 1. Then we often call µ a probability measure,and (E, E , µ) a probability space.

Probability spaces are usually written as (Ω,F ,P) instead.

Definition (Sample space). In a probability space (Ω,F ,P), we often call Ωthe sample space.

Definition (Events). In a probability space (Ω,F ,P), we often call the elementsof F the events.

Definition (Probaiblity). In a probability space (Ω,F ,P), if A ∈ F , we oftencall P[A] the probability of the event A.

These are exactly the same things as measures, but with different names!However, thinking of them as probabilities could make us ask different questionsabout these measure spaces. For example, in probability, one is often interestedin independence.

Definition (Independence of events). A sequence of events (An) is said to beindependent if

P

[⋂n∈J

An

]=∏n∈J

P[An]

for all finite subsets J ⊆ N.

16


However, it turns out that talking about independence of events is usuallytoo restrictive. Instead, we want to talk about the independence of σ-algebras:

Definition (Independence of σ-algebras). A sequence of σ-algebras (An) withAn ⊆ F for all n is said to be independent if the following is true: If (An) is asequence where An ∈ An for all n, them (An) is independent.

Proposition. Events (An) are independent iff the σ-algebras σ(An) are inde-pendent.

While proving this directly would be rather tedious (but not too hard), it isan immediate consequence of the following theorem:

Theorem. Suppose A1 and A2 are π-systems in F . If

P[A1 ∩A2] = P[A1]P[A2]

for all A1 ∈ A1 and A2 ∈ A2, then σ(A1) and σ(A2) are independent.

Proof. This will follow from two applications of the fact that a finite measure isdetermined by its values on a π-system which generates the entire σ-algebra.

We first fix A1 ∈ A1. We define the measures

µ(A) = P[A ∩A1]

andν(A) = P[A]P[A1]

for all A ∈ F . By assumption, we know µ and ν agree on A2, and we have thatµ(Ω) = P[A1] = ν(Ω) ≤ 1 <∞. So µ and ν agree on σ(A2). So we have

P[A1 ∩A2] = µ(A2) = ν(A2) = P[A1]P[A2]

for all A2 ∈ σ(A2).So we have now shown that if A1 and A2 are independent, then A1 and

σ(A2) are independent. By symmetry, the same argument shows that σ(A1)and σ(A2) are independent.

Say we are rolling a dice. Instead of asking what the probability of gettinga 6, we might be interested instead in the probability of getting a 6 infinitelyoften. Intuitively, the answer is “it happens with probability 1”, because in eachdice roll, we have a probability of 1

6 of getting a 6, and they are all independent.We would like to make this precise and actually prove it. It turns out that

the notions of “occurs infinitely often” and also “occurs eventually” correspondto more analytic notions of lim sup and lim inf.

Definition (limsup and liminf). Let (An) be a sequence of events. We define

lim supAn =⋂n

⋃m≥n

Am

lim inf An =⋃n

⋂m≥n

Am.

17


To parse these definitions more easily, we can read ∩ as “for all”, and ∪ as“there exits”. For example, we can write

lim supAn = ∀n,∃m ≥ n such that Am occurs

= x : ∀n, ∃m ≥ n, x ∈ Am= Am occurs infinitely often= Am i.o.

Similarly, we have

lim inf An = ∃n,∀m ≥ n such that Am occurs

= x : ∃n, ∀m ≥ n, x ∈ Am= Am occurs eventually= Am e.v.

We are now going to prove two “obvious” results, known as the Borel–Cantellilemmas. These give us necessary conditions for an event to happen infinitelyoften, and in the case where the events are independent, the condition is alsosufficient.

Lemma (Borel–Cantelli lemma). If∑n

P[An] <∞,

thenP[An i.o.] = 0.

Proof. For each k, we have

P[An i.o] = P

⋂n

⋃m≥n

Am

≤ P

⋃m≥k

Am

≤∞∑m=k

P[Am]

→ 0

as k →∞. So we have P[An i.o.] = 0.

Note that we did not need to use the fact that we are working with aprobability measure. So in fact this holds for any measure space.

Lemma (Borel–Cantelli lemma II). Let (An) be independent events. If∑n

P[An] =∞,

thenP[An i.o.] = 1.

18


Note that independence is crucial. If we flip a fair coin, and we set all the Anto be equal to “getting a heads”, then

∑n P[An] =

∑n

12 =∞, but we certainly

do not have P[An i.o.] = 1. Instead it is just 12 .

Proof. By example sheet, if (An) is independent, then so is (ACn ). Then we have

P

[N⋂

m=n

ACm

]=

N∏m=n

P[ACm]

=

N∏m=n

(1− P[Am])

≤N∏

m=n

exp(−P[Am])

= exp

(−

N∑m=n

P[Am]

)→ 0

as N →∞, as we assumed that∑n P[An] =∞. So we have

P

[ ∞⋂m=n

ACm

]= 0.

By countable subadditivity, we have

P

[⋃n

∞⋂m=n

ACm

]= 0.

This in turn implies that

P

[⋂n

∞⋃m=n

Am

]= 1− P

[⋃n

∞⋂m=n

ACm

]= 1.

So we are done.

19

2 Measurable functions and random variables II Probability and Measure

2 Measurable functions and random variables

We’ve had enough of measurable sets. As in most of mathematics, not onlyshould we talk about objects, but also maps between objects. Here we want totalk about maps between measure spaces, known as measurable functions. Inthe case of a probability space, a measurable function is a random variable!

In this chapter, we are going to start by defining a measurable function andinvestigate some of its basic properties. In particular, we are going to prove themonotone class theorem, which is the analogue of Dynkin’s lemma for measurablefunctions. Afterwards, we turn to the probabilistic aspects, and see how we canmake sense of the independence of random variables. Finally, we are going toconsider different notions of “convergence” of functions.

2.1 Measurable functions

The definition of a measurable function is somewhat like the definition of acontinuous function, except that we replace “open” with “in the σ-algebra”.

Definition (Measurable functions). Let (E, E) and (G,G) be measure spaces.A map f : E → G is measurable if for every A ∈ G, we have

f−1(A) = x ∈ E : f(x) ∈ E ∈ E .

If (G,G) = (R,B), then we will just say that f is measurable on E.If (G,G) = ([0,∞],B), then we will just say that f is non-negative measurable.If E is a topological space and E = B(E), then we call f a Borel function.

How do we actually check in practice that a function is measurable? It turnsout we are lucky. We can simply check that f−1(A) ∈ E for A in any generatingset Q of G.

Lemma. Let (E, E) and (G,G) be measurable spaces, and G = σ(Q) for someQ. If f−1(A) ∈ E for all A ∈ Q, then f is measurable.

Proof. We claim thatA ⊆ G : f−1(A) ∈ E

is a σ-algebra on G. Then the result follows immediately by definition of σ(Q).Indeed, this follows from the fact that f−1 preserves everything. More

precisely, we have

f−1

(⋃n

An

)=⋃n

f−1(An), f−1(AC) = (f−1(A))C , f−1(∅) = ∅.

So if, say, all An ∈ A, then so is⋃nAn.

Example. In the particular case where we have a function f : E → R, we knowthat B = B(R) is generated by (−∞, y] for y ∈ R. So we just have to check that

x ∈ E : f(x) ≤ y = f−1((−∞, y])) ∈ E .

20


Example. Let E,F be topological spaces, and f : E → F be continuous. Wewill see that f is a measurable function (under the Borel σ-algebras). Indeed,by definition, whenever U ⊆ F is open, we have f−1(U) open as well. Sof−1(U) ∈ B(E) for all U ⊆ F open. But since B(F ) is the σ-algebra generatedby the open sets, this implies that f is measurable.

This is one very important example. We can do another very importantexample.

Example. Suppose that A ⊆ E. The indicator function of A is 1A(x) : E →0, 1 given by

1A(x) =

1 x ∈ A0 x 6∈ A

.

Suppose we give 0, 1 the non-trivial measure. Then 1A is a measurable functioniff A ∈ E .

Example. The identity function is always measurable.

Example. Composition of measurable functions are measurable. More precisely,if (E, E), (F,F) and (G,G) are measurable spaces, and the functions f : E → Fand g : F → G are measurable, then the composition gf : E → G is measurable.

Indeed, if A ∈ G, then g−1(A) ∈ F , so f−1(g−1(A)) ∈ E . But f−1(g−1(A)) =(g f)−1(A). So done.

Definition (σ-algebra generated by functions). Now suppose we have a set E,and a family of real-valued functions fi : i ∈ I on E. We then define

σ(fi : i ∈ I) = σ(f−1i (A) : A ∈ B, i ∈ I).

This is the smallest σ-algebra on E which makes all the fi’s measurable.This is analogous to the notion of initial topologies for topological spaces.

If we want to construct more measurable functions, the following definitionwill be rather useful:

Definition (Product measurable space). Let (E, E) and (G,G) be measurespaces. We define the product measure space as E × G whose σ-algebra isgenerated by the projections

E ×G

E G

π1 π2 .

More explicitly, the σ-algebra is given by

E ⊗ G = σ(A×B : A ∈ E , B ∈ G).

More generally, if (Ei, Ei) is a collection of measure spaces, the product measurespace has underlying set

∏iEi, and the σ-algebra generated by the projection

maps πi :∏j Ej → Ei.

This satisfies the following property:

21


Proposition. Let fi : E → Fi be functions. Then fi are all measurable iff(fi) : E →

∏Fi is measurable, where the function (fi) is defined by setting the

ith component of (fi)(x) to be fi(x).

Proof. If the map (fi) is measurable, then by composition with the projectionsπi, we know that each fi is measurable.

Conversely, if all fi are measurable, then since the σ-algebra of∏Fi is

generated by sets of the form π−1j (A) : A ∈ Fj , and the pullback of such sets

along (fi) is exactly f−1j (A), we know the function (fi) is measurable.

Using this, we can prove that a whole lot more functions are measurable.

Proposition. Let (E, E) be a measurable space. Let (fn : n ∈ N) be a sequenceof non-negative measurable functions on E. Then the following are measurable:

f1 + f2, f1f2, maxf1, f2, minf1, f2,infnfn, sup

nfn, lim inf

nfn, lim sup

nfn.

The same is true with “real” replaced with “non-negative”, provided the newfunctions are real (i.e. not infinity).

Proof. This is an (easy) exercise on the example sheet. For example, the sumf1 + f2 can be written as the following composition.

E [0,∞]2 [0,∞].(f1,f2) +

We know the second map is continuous, hence measurable. The first function isalso measurable since the fi are. So the composition is also measurable.

The product follows similarly, but for the infimum and supremum, we need tocheck explicitly that the corresponding maps [0,∞]N → [0,∞] is measurable.

Notation. We will write

f ∧ g = minf, g, f ∨ g = maxf, g.

We are now going to prove the monotone class theorem, which is a “Dynkin’slemma” for measurable functions. As in the case of Dynkin’s lemma, it willsound rather awkward but will prove itself to be very useful.

Theorem (Monotone class theorem). Let (E, E) be a measurable space, andA ⊆ E be a π-system with σ(A) = E . Let V be a vector space of functions suchthat

(i) The constant function 1 = 1E is in V.

(ii) The indicator functions 1A ∈ V for all A ∈ A

(iii) V is closed under bounded, monotone limits.

More explicitly, if (fn) is a bounded non-negative sequence in V, fn f(pointwise) and f is also bounded, then f ∈ V.

Then V contains all bounded measurable functions.

22


Note that the conditions for V is pretty like the conditions for a d-system,where taking a bounded, monotone limit is something like taking increasingunions.

Proof. We first deduce that 1A ∈ V for all A ∈ E .

D = A ∈ E : 1A ∈ V.

We want to show that D = E . To do this, we have to show that D is a d-system.

(i) Since 1E ∈ V, we know E ∈ D.

(ii) If 1A ∈ V , then 1− 1A = 1E\A ∈ V. So E \A ∈ D.

(iii) If (An) is an increasing sequence in D, then 1An→ 1⋃

Anmonotonically

increasingly. So 1⋃An

is in D.

So, by Dynkin’s lemma, we know D = E . So V contains indicators of all measur-able sets. We will now try to obtain any measurable function by approximating.

Suppose that f is bounded and non-negative measurable. We want to showthat f ∈ V. To do this, we approximate it by letting

fn = 2−nb2nfc =

∞∑k=0

k2−n1k2−n≤f<(k+1)2−n.

Note that since f is bounded, this is a finite sum. So it is a finite linearcombination of indicators of elements in E . So fn ∈ V, and 0 ≤ fn → fmonotonically. So f ∈ V.

More generally, if f is bounded and measurable, then we can write

f = (f ∨ 0) + (f ∧ 0) ≡ f+ − f−.

Then f+ and f− are bounded and non-negative measurable. So f ∈ V.

Unfortunately, we will not have a chance to use this result until the nextchapter where we discuss integration. There we will use this a lot.

2.2 Constructing new measures

We are going to look at two ways to construct new measures on spaces based onsome measurable function we have.

Definition (Image measure). Let (E, E) and (G,G) be measure spaces. Supposeµ is a measure on E and f : E → G is a measurable function. We define theimage measure ν = µ f−1 on G by

ν(A) = µ(f−1(A)).

It is a routine check that this is indeed a measure.If we have a strictly increasing continuous function, then we know it is

invertible (if we restrict the codomain appropriately), and the inverse is alsostrictly increasing. It is also clear that these conditions are necessary for aninverse to exist. However, if we relax the conditions a bit, we can get some sortof “pseudoinverse” (some categorists may call them “left adjoints” (and will tellyou that it is a trivial consequence of the adjoint functor theorem)).

Recall that a function g is right continuous if xn x implies g(xn)→ g(x),and similarly f is left continuous if xn x implies f(xn)→ f(x).

23


Lemma. Let g : R→ R be non-constant, non-decreasing and right continuous.We set

g(±∞) = limx→±∞

g(x).

We set I = (g(−∞), g(∞)). Since g is non-constant, this is non-empty.Then there is a non-decreasing, left continuous function f : I → R such that

for all x ∈ I and y ∈ R, we have

x ≤ g(y)⇔ f(x) ≤ y.

Thus, taking the negation of this, we have

x > g(y)⇔ f(x) > y.

Explicitly, for x ∈ I, we define

f(x) = infy ∈ R : x ≤ g(y).

Proof. We just have to verify that it works. For x ∈ I, consider

Jx = y ∈ R : x ≤ g(y).

Since g is non-decreasing, if y ∈ Jx and y′ ≥ y, then y′ ∈ Jx. Since g isright-continuous, if yn ∈ Jx is such that yn y, then y ∈ Jx. So we have

Jx = [f(x),∞).

Thus, for f ∈ R, we have

x ≤ g(y)⇔ f(x) ≤ y.

So we just have to prove the remaining properties of f . Now for x ≤ x′, we haveJx ⊆ Jx′ . So f(x) ≤ f(x′). So f is non-decreasing.

Similarly, if xn x, then we have Jx =⋂n Jxn

. So f(xn)→ f(x). So thisis left continuous.

Example. If g is given by the function

then f is given by

24


This allows us to construct new measures on R with ease.

Theorem. Let g : R→ R be non-constant, non-decreasing and right continuous.Then there exists a unique Radon measure dg on B such that

dg((a, b]) = g(b)− g(a).

Moreover, we obtain all non-zero Radon measures on R in this way.

We have already seen an instance of this when we g was the identity function.Given the lemma, this is very easy.

Proof. Take I and f as in the previous lemma, and let µ be the restriction ofthe Lebesgue measure to Borel subsets of I. Now f is measurable since it is leftcontinuous. We define dg = µ f−1. Then we have

dg((a, b]) = µ(x ∈ I : a < f(x) ≤ b)= µ(x ∈ I : g(a) < x ≤ g(b))= µ((g(a), g(b)]) = g(b)− g(a).

So dg is a Radon measure with the required property.There are no other such measures by the argument used for uniqueness of

the Lebesgue measure.To show we get all non-zero Radon measures this way, suppose we have a

Radon measure ν on R, we want to produce a g such that ν = dg. We set

g(y) =

−ν((y, 0]) y ≤ 0

ν((0, y]) y > 0.

Then ν((a, b]) = g(b) − g(a). We see that ν is non-zero, so g is non-constant.It is also easy to see it is non-decreasing and right continuous. So ν = dg bycontinuity.

2.3 Random variables

We are now going to look at these ideas in the context of probability. It turnsout they are concepts we already know and love!

Definition (Random variable). Let (Ω,F ,P) be a probability space, and (E, E)a measurable space. Then an E-valued random variable is a measurable functionX : Ω→ E.

By default, we will assume the random variables are real.

Usually, when we have a random variable X, we might ask questions like“what is the probability that X ∈ A?”. In other words, we are asking for the“size” of the set of things that get sent to A. This is just the image measure!

Definition (Distribution/law). Given a random variable X : Ω → E, thedistribution or law of X is the image measure µx : P X−1. We usually write

P(X ∈ A) = µx(A) = P(X−1(A)).

25


If E = R, then µx is determined by its values on the π-system of intervals(−∞, y]. We set

FX(x) = µX((−∞, x]) = P(X ≤ x)

This is known as the distribution function of X.

Proposition. We have

FX(x)→

0 x→ −∞1 x→ +∞

.

Also, FX(x) is non-decreasing and right-continuous.

We call any function F with these properties a distribution function.

Definition (Distribution function). A distribution function is a non-decreasing,right continuous function f : R→ [0, 1] satisfying

FX(x)→

0 x→ −∞1 x→ +∞

.

We now want to show that every distribution function is indeed a distribution.

Proposition. Let F be any distribution function. Then there exists a probabilityspace (Ω,F ,P) and a random variable X such that FX = F .

Proof. Take (Ω,F ,P) = ((0, 1),B(0, 1),Lebesgue). We take X : Ω→ R to be

X(ω) = infx : ω ≤ f(x).

Then we haveX(ω) ≤ x⇐⇒ w ≤ F (x).

So we haveFX(x) = P[X ≤ x] = P[(0, F (x)]] = F (x).

Therefore FX = F .

This construction is actually very useful in practice. If we are writinga computer program and want to sample a random variable, we will use thisprocedure. The computer usually comes with a uniform (pseudo)-random numbergenerator. Then using this procedure allows us to produce random variables ofany distribution from a uniform sample.

The next thing we want to consider is the notion of independence of randomvariables. Recall that for random variables X,Y , we used to say that they areindependent if for any A,B, we have

P[X ∈ A, Y ∈ B] = P[X ∈ A]P[Y ∈ B].

But this is exactly the statement that the σ-algebras generated by X and Y areindependent!

Definition (Independence of random variables). A family (Xn) of random vari-ables is said to be independent if the family of σ-algebras (σ(Xn)) is independent.

26


Proposition. Two real-valued random variables X,Y are independent iff

P[X ≤ x, Y ≤ y] = P[X ≤ x]P[Y ≤ y].

More generally, if (Xn) is a sequence of real-valued random variables, then theyare independent iff

P[x1 ≤ x1, · · · , xn ≤ xn] =

n∏j=1

P[Xj ≤ xj ]

for all n and xj .

Proof. The ⇒ direction is obvious. For the other direction, we simply note that(−∞, x] : x ∈ R is a generating π-system for the Borel σ-algebra of R.

In probability, we often say things like “let X1, X2, · · · be iid random vari-ables”. However, how can we guarantee that iid random variables do indeedexist? We start with the less ambitious goal of finding iid Bernoulli(1/2) randomvariables:

Proposition. Let

(Ω,F ,P) = ((0, 1),B(0, 1),Lebesgue).

be our probability space. Then there exists as sequence Rn of independentBernoulli(1/2) random variables.

Proof. Suppose we have ω ∈ Ω = (0, 1). Then we write ω as a binary expansion

ω =

∞∑n=1

ωn2−n,

where ωn ∈ 0, 1. We make the binary expansion unique by disallowing infinitesequences of zeroes.

We define Rn(ω) = ωn. We will show that Rn is measurable. Indeed, we canwrite

R1(ω) = ω1 = 1(1/2,1](ω),

where 1(1/2,1] is the indicator function. Since indicator functions of measurablesets are measurable, we know R1 is measurable. Similarly, we have

R2(ω) = 1(1/4,1/2](ω) + 1(3/4,1](ω).

So this is also a measurable function. More generally, we can do this for anyRn(ω): we have

Rn(ω) =

2n−1∑j=1

1(2−n(2j−1),2−n(2j)](ω).

So each Rn is a random variable, as each can be expressed as a sum of indicatorsof measurable sets.

Now let’s calculate

P[Rn = 1] =

2n−1∑j=1

2−n((2j)− (2j − 1)) =

2n−1∑j=1

2−n =1

2.

27


Then we have

P[Rn = 0] = 1− P[Rn = 1] =1

2

as well. So Rn ∼ Bernoulli(1/2).We can straightforwardly check that (Rn) is an independent sequence, since

for n 6= m, we have

P[Rn = 0 and Rm = 0] =1

4= P[Rn = 0]P[Rm = 0].

We will now use the (Rn) to construct any independent sequence for anydistribution.

Proposition. Let

(Ω,F ,P) = ((0, 1),B(0, 1),Lebesgue).

Given any sequence (Fn) of distribution functions, there is a sequence (Xn) ofindependent random variables with FXn

= Fn for all n.

Proof. Let m : N2 → N be any bijection, and relabel

Yk,n = Rm(k,n),

where the Rj are as in the previous random variable. We let

Yn =

∞∑k=1

2−kYk,n.

Then we know that (Yn) is an independent sequence of random variables, andeach is uniform on (0, 1). As before, we define

Gn(y) = infx : y ≤ Fn(x).

We set Xn = Gn(Yn). Then (Xn) is a sequence of random variables withFXn

= Fn.

We end the section with a random fact: let (Ω,F ,P) and Rj be as above.Then 1

n

∑nj=1Rj is the average of n independent of Bernoulli(1/2) random

variables. The weak law of large numbers says for any ε > 0, we have

P

∣∣∣∣∣∣ 1nn∑j=1

Rj −1

2

∣∣∣∣∣∣ ≥ ε→ 0 as n→∞.

The strong law of large numbers, which we will prove later, says that

P

ω :1

n

n∑j=1

Rj →1

2

= 1.

So “almost every number” in (0, 1) has an equal proportion of 0’s and 1’s in itsbinary expansion. This is known as the normal number theorem.

28


2.4 Convergence of measurable functions

The next thing to look at is the convergence of measurable functions. In measuretheory, wonderful things happen when we talk about convergence. In analysis,most of the time we had to require uniform convergence, or even strongernotions, if we want limits to behave well. However, in measure theory, the kindsof convergence we talk about are somewhat pointwise in nature. In fact, itwill be weaker than pointwise convergence. Yet, we are still going to get goodproperties out of them.

Definition (Convergence almost everywhere). Suppose that (E, E , µ) is a mea-sure space. Suppose that (fn), f are measurable functions. We say fn → falmost everywhere (a.e.) if

µ(x ∈ E : fn(x) 6→ f(x)) = 0.

If (E, E , µ) is a probability space, this is called almost sure convergence.

To see this makes sense, i.e. the set in there is actually measurable, note that

x ∈ E : fn(x) 6→ f(x) = x ∈ E : lim sup |fn(x)− f(x)| > 0.

We have previously seen that lim sup |fn− f | is non-negative measurable. So theset x ∈ E : lim sup |fn(x)− f(x)| > 0 is measurable.

Another useful notion of convergence is convergence in measure.

Definition (Convergence in measure). Suppose that (E, E , µ) is a measure space.Suppose that (fn), f are measurable functions. We say fn → f in measure if foreach ε > 0, we have

µ(x ∈ E : |fn(x)− f(x)| ≥ ε)→ 0 as n→∞,

then we say that fn → f in measure.If (E, E , µ) is a probability space, then this is called convergence in probability.

In the case of a probability space, this says

P(|Xn −X| ≥ ε)→ 0 as n→∞

for all ε, which is how we state the weak law of large numbers in the past.After we define integration, we can consider the norms of a function f by

‖f‖p =

(∫|f(x)|p dx

)1/p

.

Then in particular, if ‖fn − f‖p → 0, then fn → f in measure, and this providesan easy way to see that functions converge in measure.

In general, neither of these notions imply each other. However, the followingtheorem provides us with a convenient dictionary to translate between the twonotions.

Theorem.

(i) If µ(E) <∞, then fn → f a.e. implies fn → f in measure.

29


(ii) For any E, if fn → f in measure, then there exists a subsequence (fnk)

such that fnk→ f a.e.

Proof.

(i) First suppose µ(E) <∞, and fix ε > 0. Consider

µ(x ∈ E : |fn(x)− f(x)| ≤ ε).

We use the result from the first example sheet that for any sequence ofevents (An), we have

lim inf µ(An) ≥ µ(lim inf An).

Applying to the above sequence says

lim inf µ(x : |fn(x)− f(x)| ≤ ε) ≥ µ(x : |fm(x)− f(x)| ≤ ε eventually)≥ µ(x ∈ E : |fm(x)− f(x)| → 0)= µ(E).

As µ(E) <∞, we have µ(x ∈ E : |fn(x)− f(x)| > ε)→ 0 as n→∞.

(ii) Suppose that fn → f in measure. We pick a subsequence (nk) such that

µ

(x ∈ E : |fnk

(x)− f(x)| > 1

k

)≤ 2−k.

Then we have

∞∑k=1

µ

(x ∈ E : fnk

(x)− f(x)| > 1

k

)≤∞∑k=1

2−k = 1 <∞.

By the first Borel–Cantelli lemma, we know

µ

(x ∈ E : |fnk

(x)− f(x)| > 1

ki.o.

)= 0.

So fnk→ f a.e.

It is important that we assume that µ(E) <∞ for the first part.

Example. Consider (E, E , µ) = (R,B,Lebesgue). Take fn(x) = 1[n,∞)(x).Then fn(x) → 0 for all x, and in particular almost everywhere. However, wehave

µ

(x ∈ R : |fn(x)| > 1

2

)= µ([n,∞)) =∞

for all n.

There is one last type of convergence we are interested in. We will onlyfirst formulate it in the probability setting, but there is an analogous notion inmeasure theory known as weak convergence, which we will discuss much later onin the course.

30


Definition (Convergence in distribution). Let (Xn), X be random variableswith distribution functions FXn

and FX , then we say Xn → X in distribution ifFXn

(x)→ FX(x) for all x ∈ R at which FX is continuous.

Note that here we do not need that (Xn) and X live on the same probabilityspace, since we only talk about the distribution functions.

But why do we have the condition with continuity points? The idea is thatif the resulting distribution has a “jump” at x, it doesn’t matter which side ofthe jump FX(x) is at. Here is a simple example that tells us why this is veryimportant:

Example. Let Xn to be uniform on [0, 1/n]. Intuitively, this should convergeto the random variable that is always zero.

We can compute

FXn(x) =

0 x ≤ 0

nx 0 < x < 1/n

1 x ≥ 1/n

.

We can also compute the distribution of the zero random variable as

F0 =

0 x < 0

1 x ≥ 0.

But FXn(0) = 0 for all n, while FX(0) = 1.

One might now think of cheating by cooking up some random variable suchthat F is discontinuous at so many points that random, unrelated things convergeto F . However, this cannot be done, because F is a non-decreasing function,and thus can only have countably many points of discontinuities.

The big theorem we are going to prove about convergence in distribution isthat actually it is very boring and doesn’t give us anything new.

Theorem (Skorokhod representation theorem of weak convergence).

(i) If (Xn), X are defined on the same probability space, and Xn → X inprobability. Then Xn → X in distribution.

(ii) If Xn → X in distribution, then there exists random variables (Xn) andX defined on a common probability space with FXn

= FXn and FX = FX

such that Xn → X a.s.

Proof. Let S = x ∈ R : FX is continuous.

(i) Assume that Xn → X in probability. Fix x ∈ S. We need to show thatFXn

(x)→ FX(x) as n→∞.

We fix ε > 0. Since x ∈ S, this implies that there is some δ > 0 such that

FX(x− δ) ≥ FX(x)− ε

2

FX(x+ δ) ≤ FX(x) +ε

2.

31


We fix N large such that n ≥ N implies P[|Xn −X| ≥ δ] ≤ ε2 . Then

FXn(x) = P[Xn ≤ x]

= P[(Xn −X) +X ≤ x]

We now notice that (Xn−X) +X ≤ x ⊆ X ≤ x+ δ∪|Xn−X| > δ.So we have

≤ P[X ≤ x+ δ] + P[|Xn −X| > δ]

≤ FX(x+ δ) +ε

2≤ FX(x) + ε.

We similarly have

FXn(x) = P[Xn ≤ x]

≥ P[X ≤ x− δ]− P[|Xn −X| > δ]

≥ FX(x− δ)− ε

2≥ FX(x)− ε.

Combining, we have that n ≥ N implying |Fxn(x)− FX(x)| ≤ ε. Since εwas arbitrary, we are done.

(ii) Suppose Xn → X in distribution. We again let

(Ω,F ,B) = ((0, 1),B((0, 1)),Lebesgue).

We let

Xn(ω) = infx : ω ≤ FXn(x),

X(ω) = infx : ω ≤ FX(x).

Recall from before that Xn has the same distribution function as Xn forall n, and X has the same distribution as X. Moreover, we have

Xn(ω) ≤ x⇔ ω ≤ FXn(x)

x < Xn(ω)⇔ FXn(x) < ω,

and similarly if we replace Xn with X.

We are now going to show that with this particular choice, we have Xn → Xa.s.

Note that X is a non-decreasing function (0, 1) → R. Then by generalanalysis, X has at most countably many discontinuities. We write

Ω0 = ω ∈ (0, 1) : X is continuous at ω0.

Then (0, 1) \ Ω0 is countable, and hence has Lebesgue measure 0. So

P[Ω0] = 1.

32


We are now going to show that Xn(ω)→ X(ω) for all ω ∈ Ω0.

Note that FX is a non-decreasing function, and hence the points of discon-tinuity R \ S is also countable. So S is dense in R. Fix ω ∈ Ω0 and ε > 0.We want to show that |Xn(ω)− X(ω)| ≤ ε for all n large enough.

Since S is dense in R, we can find x−, x+ in S such that

x− < X(ω) < x+

and x+−x− < ε. What we want to do is to use the characteristic propertyof X and FX to say that this implies

FX(x−) < ω < FX(x+).

Then since FXn→ FX at the points x−, x+, for sufficiently large n, we

haveFXn(x−) < ω < FXn(x+).

Hence we havex− < Xn(ω) < x+.

Then it follows that |Xn(ω)− X(ω)| < ε.

However, this doesn’t work, since X(ω) < x+ only implies ω ≤ FX(x+),and our argument will break down. So we do a funny thing where weintroduce a new variable ω+.

Since X is continuous at ω, we can find ω+ ∈ (ω, 1) such that X(ω+) ≤ x+.

X(ω)

ω

x−x+

ω+

ε

Then we havex− < X(ω) ≤ X(ω+) < x+.

Then we haveFX(x−) < ω < ω+ ≤ FX(x+).

So for sufficiently large n, we have

FXn(x−) < ω < FXn

(x+).

So we havex− < Xn(ω) ≤ x+,

and we are done.

33


2.5 Tail events

Finally, we are going to quickly look at tail events. These are events that dependonly on the asymptotic behaviour of a sequence of random variables.

Definition (Tail σ-algebra). Let (Xn) be a sequence of random variables. Welet

Tn = σ(Xn+1, Xn+2, · · · ),

andT =

⋂n

Tn.

Then T is the tail σ-algebra.

Then T -measurable events and random variables only depend on the asymp-totic behaviour of the Xn’s.

Example. Let (Xn) be a sequence of real-valued random variables. Then

lim supn→∞

1

n

n∑j=1

Xj , lim infn→∞

1

n

n∑j=1

Xj

are T -measurable random variables. Finally, limn→∞

1

n

n∑j=1

Xj exists

∈ T ,since this is just the set of all points where the previous two things agree.

Theorem (Kolmogorov 0-1 law). Let (Xn) be a sequence of independent (real-valued) random variables. If A ∈ T , then P[A] = 0 or 1.

Moreover, if X is a T -measurable random variable, then there exists aconstant c such that

P[X = c] = 1.

Proof. The proof is very funny the first time we see it. We are going to provethe theorem by checking something that seems very strange. We are going toshow that if A ∈ T , then A is independent of A. It then follows that

P[A] = P[A ∩A] = P[A]P[A],

so P[A] = 0 or 1. In fact, we are going to prove that T is independent of T .Let

Fn = σ(X1, · · · , Xn).

This σ-algebra is generated by the π-system of events of the form

A = X1 ≤ x1, · · · , Xn ≤ xn.

Similarly, Tn = σ(Xn+1, Xn+2, · · · ) is generated by the π-system of events of theform

B = Xn+1 ≤ xn+1, · · · , Xn+k ≤ xn+k,

where k is any natural number.

34


Since the Xn are independent, we know for any such A and B, we have

P[A ∩B] = P[A]P[B].

Since this is true for all A and B, it follows that Fn is independent of Tn.Since T =

⋂k Tk ⊆ Tn for each n, we know Fn is independent of T .

Now⋃k Fk is a π-system, which generates the σ-algebra F∞ = σ(X1, X2, · · · ).

We know that if A ∈⋃n Fn, then there has to exist an index n such that A ∈ Fn.

So A is independent of T . So F∞ is independent of T .Finally, note that T ⊆ F∞. So T is independent of T .To find the constant, suppose that X is T -measurable. Then

P[X ≤ x] ∈ 0, 1

for all x ∈ R since X ≤ x ∈ T .Now take

c = infx ∈ R : P[X ≤ x] = 1.

Then with this particular choice of c, it is easy to see that P[X = c] = 1. Thiscompletes the proof of the theorem.

35

3 Integration II Probability and Measure

3 Integration

3.1 Definition and basic properties

We are now going to work towards defining the integral of a measurable functionon a measure space (E, E , µ). Different sources use different notations for theintegral. The following notations are all commonly used:

µ(f) =

∫E

f dµ =

∫E

f(x) dµ(x) =

∫E

f(x)µ(dx).

In the case where (E, E , µ) = (R,B,Lebesgue), people often just write this as

µ(f) =

∫Rf(x) dx.

On the other hand, if (E, E , µ) = (Ω,F,P) is a probability space, and X is arandom variable, then people write the integral as E[X], the expectation of X.

So how are we going to define the integral? There are two steps to definingthe integral. The idea is that we first define the integral on simple functions,and then extend the definition to more general measurable functions by takingthe limit. When we do the definition for simple functions, it will be obvious thatthe definition satisfies the nice properties, and we will have to check that theyare preserved when we take the limit.

Definition (Simple function). A simple function is a measurable function thatcan be written as a finite non-negative linear combination of indicator functionsof measurable sets, i.e.

f =

n∑k=1

ak1Ak

for some Ak ∈ E and ak ≥ 0.

Note that some sources do not assume that ak ≥ 0, but assuming this makesour life easier.

It is obvious that

Proposition. A function is simple iff it is measurable, non-negative, and takeson only finitely-many values.

Definition (Integral of simple function). The integral of a simple function

f =

n∑k=1

ak1Ak

is given by

µ(f) =

n∑k=1

akµ(Ak).

Note that it can be that µ(Ak) = ∞, but ak = 0. When this happens, weare just going to declare that 0 · ∞ = 0 (this makes sense because this meanswe are ignoring all 0 · 1A terms for any A). After we do this, we can check theintegral is well-defined.

36


We are now going to extend this definition to non-negative measurablefunctions by a limiting procedure. Once we’ve done this, we are going to extendthe definition to measurable functions by linearity of the integral. Then wewould have a definition of the integral, and we are going to deduce properties ofthe integral using approximation.

Definition (Integral). Let f be a non-negative measurable function. We set

µ(f) = supµ(g) : g ≤ f, g is simple.

For arbitrary f , we write

f = f+ − f− = (f ∨ 0) + (f ∧ 0).

We put |f | = f+ + f−. We say f is integrable if µ(|f |) <∞. In this case, set

µ(f) = µ(f+)− µ(f−).

If only one of µ(f+), µ(f−) < ∞, then we can still make the above definition,and the result will be infinite.

In the case where we are integrating over (a subset of) the reals, we call itthe Lebesgue integral .

Proposition. Let f : [0, 1]→ R be Riemann integrable. Then it is also Lebesgueintegrable, and the two integrals agree.

We will not prove this, but this immediately gives us results like the funda-mental theorem of calculus, and also helps us to actually compute the integral.However, note that this does not hold for infinite domains, as you will see in thesecond example sheet.

But the Lebesgue integrable functions are better. A lot of functions areLebesgue integrable but not Riemann integrable.

Example. Take the standard non-Riemann integrable function

f = 1[0,1]\Q.

Then f is not Riemann integrable, but it is Lebesgue integrable, since

µ(f) = µ([0, 1] \Q) = 1.

We are now going to study some basic properties of the integral. We will firstlook at the properties of integrals of simple functions, and then extend them togeneral integrable functions.

For f, g simple, and α, β ≥ 0, we have that

µ(αf + βg) = αµ(f) + βµ(g).

So the integral is linear.Another important property is monotonicity — if f ≤ g, then µ(f) ≤ µ(g).Finally, we have f = 0 a.e. iff µ(f) = 0. It is absolutely crucial here that we

are talking about non-negative functions.Our goal is to show that these three properties are also satisfied for arbitrary

non-negative measurable functions, and the first two hold for integrable functions.In order to achieve this, we prove a very important tool — the monotone

convergence theorem. Later, we will also learn about the dominated convergencetheorem and Fatou’s lemma. These are the main and very important resultsabout exchanging limits and integration.

37


Theorem (Monotone convergence theorem). Suppose that (fn), f are non-negative measurable with fn f . Then µ(fn) µ(f).

In the proof we will use the fact that the integral is monotonic, which weshall prove later.

Proof. We will split the proof into five steps. We will prove each of the followingin turn:

(i) If fn and f are indicator functions, then the theorem holds.

(ii) If f is an indicator function, then the theorem holds.

(iii) If f is simple, then the theorem holds.

(iv) If f is non-negative measurable, then the theorem holds.

Each part follows rather straightforwardly from the previous one, and the readeris encouraged to try to prove it themself.

We first consider the case where fn = 1Anand f = 1A. Then fn f is true

iff An A. On the other hand, µ(fn) µ(f) iff µ(An) µ(A).For convenience, we let A0 = ∅. We can write

µ(A) = µ

(⋃n

An \An−1

)

=

∞∑n=1

µ(An \An−1)

= limN→∞

N∑n=1

µ(An \An−1)

= limN→∞

µ(AN ).

So done.

We next consider the case where f = 1A for some A. Fix ε > 0, and set

An = fn > 1− ε ∈ E .

Then we know that An A, as fn f . Moreover, by definition, we have

(1− ε)1An ≤ fn ≤ f = 1A.

As An A, we have that

(1− ε)µ(f) = (1− ε) limn→∞

µ(An) ≤ limn→∞

µ(fn) ≤ µ(f)

since fn ≤ f . Since ε is arbitrary, we know that

limn→∞

µ(fn) = µ(f).

38


Next, we consider the case where f is simple. We write

f =

m∑k=1

ak1Ak,

where ak > 0 and Ak are pairwise disjoint. Since fn f , we know

a−1k fn1Ak

1Ak.

So we have

µ(fn) =

m∑k=1

µ(fn1Ak) =

m∑k=1

akµ(a−1k fn1Ak

)→m∑k=1

akµ(Ak) = µ(f).

Suppose f is non-negative measurable. Suppose g ≤ f is a simple function.As fn f , we know fn ∧ g f ∧ g = g. So by the previous case, we know that

µ(fn ∧ g)→ µ(g).

We also know thatµ(fn) ≥ µ(fn ∧ g).

So we havelimn→∞

µ(fn) ≥ µ(g)

for all g ≤ f . This is possible only if

limn→∞

µ(fn) ≥ µ(f)

by definition of the integral. However, we also know that µ(fn) ≤ µ(f) for all n,again by definition of the integral. So we must have equality. So we have

µ(f) = limn→∞

µ(fn).

Theorem. Let f, g be non-negative measurable, and α, β ≥ 0. We have that

(i) µ(αf + βg) = αµ(f) + βµ(g).

(ii) f ≤ g implies µ(f) ≤ µ(g).

(iii) f = 0 a.e. iff µ(f) = 0.

Proof.

(i) Let

fn = 2−nb2nfc ∧ ngn = 2−nb2ngc ∧ n.

Then fn, gn are simple with fn f and gn g. Hence µ(fn) µ(f) andµ(gn g and µ(αfn + βgn) µ(αf + βg), by the monotone convergencetheorem. As fn, gn are simple, we have that

µ(αfn + βgn) = αµ(fn) + βµ(gn).

Taking the limit as n→∞, we get

µ(αf + βg) = αµ(f) + βµ(g).

39


(ii) We shall be careful not to use the monotone convergence theorem. Wehave

µ(g) = supµ(h) : h ≤ g simple≥ supµ(h) : h ≤ f simple= µ(f).

(iii) Suppose f 6= 0 a.e. Let

An =

x : f(x) >

1

n

.

Thenx : f(x) 6= 0 =

⋃n

An.

Since the left hand set has non-negative measure, it follows that there issome An with non-negative measure. For that n, we define

h =1

n1An

.

Then µ(f) ≥ µ(h) > 0. So µ(f) 6= 0.

Conversely, suppose f = 0 a.e. We let

fn = 2−nb2nfc ∧ n

be a simple function. Then fn f and fn = 0 a.e. So

µ(f) = limn→∞

µ(fn) = 0.

We now prove the analogous statement for general integrable functions.

Theorem. Let f, g be integrable, and α, β ≥ 0. We have that

(i) µ(αf + βg) = αµ(f) + βµ(g).

(ii) f ≤ g implies µ(f) ≤ µ(g).

(iii) f = 0 a.e. implies µ(f) = 0.

Note that in the last case, the converse is no longer true, as one can easilysee from the sign function sgn : [−1, 1]→ R.

Proof.

(i) We are going to prove these by applying the previous theorem.

By definition of the integral, we have µ(−f) = −µ(f). Also, if α ≥ 0, then

µ(αf) = µ(αf+)− µ(αf−) = αµ(f+)− αµ(f−) = αµ(f).

Combining these two properties, it then follows that if α is a real number,then

µ(αf) = αµ(f).

40


To finish the proof of (i), we have to show that µ(f + g) = µ(f) + µ(g).We know that this is true for non-negative functions, so we need to employa little trick to make this a statement about the non-negative version. Ifwe let h = f + g, then we can write this as

h+ − h− = (f+ − f−) + (g+ − g−).

We now rearrange this as

h+f− + g− = f+ + g+ + h−.

Now everything is non-negative measurable. So applying µ gives

µ(f+) + µ(f−) + µ(g−) = µ(f+) + µ(g+) + µ(h−).

Rearranging, we obtain

µ(h+)− µ(h−) = µ(f+)− µ(f−) + µ(g+)− µ(g−).

This is exactly the same thing as saying

µ(f + g) = µ(h) = µ(f) = µ(g).

(ii) If f ≤ g, then g−f ≥ 0. So µ(g−f) ≥ 0. By (i), we know µ(g)−µ(f) ≥ 0.So µ(g) ≥ µ(f).

(iii) If f = 0 a.e., then f+, f− = 0 a.e. So µ(f+) = µ(f−) = 0. So µ(f) =µ(f+)− µ(f−) = 0.

As mentioned, the converse to (iii) is no longer true. However, we do havethe following partial converse:

Proposition. If A is a π-system with E ∈ A and σ(A) = E , and f is anintegrable function that

µ(f1A) = 0

for all A ∈ A. Then µ(f) = 0 a.e.

Proof. LetD = A ∈ E : µ(f1A) = 0.

It follows immediately from the properties of the integral that D is a d-system.So D = E by Dynkin’s lemma. Let

A+ = x ∈ E : f(x) > 0,A− = x ∈ E : f(x) < 0.

Then A± ∈ E , andµ(f1A+) = µ(f1A−) = 0.

So f1A+ and f1A− vanish a.e. So f vanishes a.e.

Proposition. Suppose that (gn) is a sequence of non-negative measurablefunctions. Then we have

µ

( ∞∑n=1

gn

)=

∞∑n=1

µ(gn).

41


Proof. We know (N∑n=1

gn

)

( ∞∑n=1

gn

)as N →∞. So by the monotone convergence theorem, we have

N∑n=1

µ(gn) = µ

(N∑n=1

gn

) µ

( ∞∑n=1

gn

).

But we also know thatN∑n=1

µ(gn)∞∑n=1

µ(gn)

by definition. So we are done.

So for non-negative measurable functions, we can always switch the order ofintegration and summation.

Note that we can consider summation as integration. We let E = N andE = all subsets of N. We let µ be the counting measure, so that µ(A) is thesize of A. Then integrability (and having a finite integral) is the same as absoluteconvergence. Then if it converges, then we have∫

f dµ =

∞∑n=1

f(n).

So we can just view our proposition as proving that we can swap the order oftwo integrals. The general statement is known as Fubini’s theorem.

3.2 Integrals and limits

We are now going to prove more things about exchanging limits and integrals.These are going to be extremely useful in the future, as we want to exchangelimits and integrals a lot.

Theorem (Fatou’s lemma). Let (fn) be a sequence of non-negative measurablefunctions. Then

µ(lim inf fn) ≤ lim inf µ(fn).

Note that a special case was proven in the first example sheet, where we didit for the case where fn are indicator functions.

Proof. We start with the trivial observation that if k ≥ n, then we always havethat

infm≥n

fm ≤ fk.

By the monotonicity of the integral, we know that

µ

(infm≥n

fm

)≤ µ(fk).

for all k ≥ n.

42


So we have

µ

(infm≥n

fm

)≤ infk≥n

µ(fk) ≤ lim infm

µ(fm).

It remains to show that the left hand side converges to µ(lim inf fm). Indeed,we know that

infm≥n

fm lim infm

fm.

Then by monotone convergence, we have

µ

(infm≥n

fm

) µ

(lim infm

fm

).

So we haveµ(

lim infm

fm

)≤ lim inf

mµ(fm).

No one ever remembers which direction Fatou’s lemma goes, and this leads tomany incorrect proofs and results, so it is helpful to keep the following examplein mind:

Example. We let (E, E , µ) = (R,B,Lebesgue). We let

fn = 1[n,n+1].

Then we havelim inf

nfn = 0.

So we haveµ(fn) = 1 for all n.

So we havelim inf µ(fn) = 1, µ(lim inf fn) = 0.

So we haveµ(

lim infm

fm

)≤ lim inf

mµ(fm).

The next result we want to prove is the dominated convergence theorem.This is like the monotone convergence theorem, but we are going to remove theincreasing and non-negative measurable condition, and add in something else.

Theorem (Dominated convergence theorem). Let (fn), f be measurable withfn(x)→ f(x) for all x ∈ E. Suppose that there is an integrable function g suchthat

|fn| ≤ g

for all n, then we haveµ(fn)→ µ(f)

as n→∞.

43


Proof. Note that|f | = lim

n|f |n ≤ g.

So we know thatµ(|f |) ≤ µ(g) <∞.

So we know that f , fn are integrable.Now note also that

0 ≤ g + fn, 0 ≤ g − fnfor all n. We are now going to apply Fatou’s lemma twice with these series. Wehave that

µ(g) + µ(f) = µ(g + f)

= µ(

lim infn

(g + fn))

≤ lim infn

µ(g + fn)

= lim infn

(µ(g) + µ(fn))

= µ(g) + lim infn

µ(fn).

Since µ(g) is finite, we know that

µ(f) ≤ lim infn

µ(fn).

We now do the same thing with g − fn. We have

µ(g)− µ(f) = µ(g − f)

= µ(

lim infn

(g − fn))

≤ lim infn

µ(g − fn)

= lim infn

(µ(g)− µ(fn))

= µ(g)− lim supn

µ(fn).

Again, since µ(g) is finite, we know that

µ(f) ≥ lim supn

µ(fn).

These combine to tell us that

µ(f) ≤ lim infn

µ(fn) ≤ lim supn

µ(fn) ≤ µ(f).

So they must be all equal, and thus µ(fn)→ µ(f).

3.3 New measures from old

We have previously considered several ways of constructing measures from oldones, such as the image measure. We are now going to study a few more ways ofconstructing new measures, and see how integrals behave when we do these.

44


Definition (Restriction of measure space). Let (E, E , µ) be a measure space,and let A ∈ E . The restriction of the measure space to A is (A, EA, µA), where

EA = B ∈ E : B ⊆ A,

and µA is the restriction of µ to EA, i.e.

µA(B) = µ(B)

for all B ∈ EA.

It is easy to check the following:

Lemma. For (E, E , µ) a measure space and A ∈ E , the restriction to A is ameasure space.

Proposition. Let (E, E , µ) and (F,F , µ′) be measure spaces and A ∈ E . Letf : E → F be a measurable function. Then f |A is EA-measurable.

Proof. Let B ∈ F . Then

(f |A)−1(B) = f−1(B) ∩A ∈ EA.

Similarly, we have

Proposition. If f is integrable, then f |A is µA-integrable and µA(f |A) =µ(f1A).

Note that means we have

µ(f1A) =

∫E

f1A dµ =

∫A

f dµA.

Usually, we are lazy and just write

µ(f1A) =

∫A

f dµ.

In the particular case of Lebesgue integration, if A is an interval with left andright end points a, b (i.e. it can be open, closed, half open or half closed), thenwe write ∫

A

f dµ =

∫ b

a

f(x) dx.

There is another construction we would be interested in.

Definition (Pushforward/image of measure). Let (E, E) and (G,G) be measurespaces, and f : E → G a measurable function. If µ is a measure on (E, E), then

ν = µ f−1

is a measure on (G,G), known as the pushforward or image measure.

We have already seen this before, but we can apply this to integration asfollows:

45


Proposition. If g is a non-negative measurable function on G, then

ν(g) = µ(g f).

Proof. Exercise using the monotone class theorem (see example sheet).

Finally, we can specify a measure by specifying a density.

Definition (Density). Let (E, E , µ) be a measure space, and f be a non-negativemeasurable function. We define

ν(A) = µ(f1A).

Then ν is a measure on (E, E).

Proposition. The ν defined above is indeed a measure.

Proof.

(i) ν(φ) = µ(f1∅) = µ(0) = 0.

(ii) If (An) is a disjoint sequence in E , then

ν(⋃

An

)= µ(f1⋃

An) = µ

(f∑

1An

)=∑

µ (f1An) =

∑ν(f).

Definition (Density). Let X be a random variable. We say X has a density ifits law µX has a density with respect to the Lebesgue measure. In other words,there exists fX non-negative measurable so that

µX(A) = P[X ∈ A] =

∫A

fX(x) dx.

In this case, for any non-negative measurable function, for any non-negativemeasurable g, we have that

E[g(X)] =

∫Rg(x)fX(x) dx.

3.4 Integration and differentiation

In “normal” calculus, we had three results involving both integration and dif-ferentiation. One was the fundamental theorem of calculus, which we alreadystated. The others are the change of variables formula, and differentiating underthe integral sign.

We start by proving the change of variables formula.

Proposition (Change of variables formula). Let φ : [a, b]→ R be continuouslydifferentiable and increasing. Then for any bounded Borel function g, we have∫ φ(b)

φ(a)

g(y) dy =

∫ b

a

g(φ(x))φ′(x) dx. (∗)

We will use the monotone class theorem.

46


Proof. We let

V = Borel functions g such that (∗) holds.

We will want to use the monotone class theorem to show that this includes allbounded functions.

We already know that

(i) V contains 1A for all A in the π-system of intervals of the form [u, v] ⊆ [a, b].This is just the fundamental theorem of calculus.

(ii) By linearity of the integral, V is indeed a vector space.

(iii) Finally, let (gn) be a sequence in V , and gn ≥ 0, gn g. Then we knowthat ∫ φ(b)

φ(a)

gn(y) dy =

∫ b

a

gn(φ(x))φ′(x) dx.

By the monotone convergence theorem, these converge to∫ φ(b)

φ(a)

g(y) dy =

∫ b

a

g(φ(x))φ′(x) dx.

Then by the monotone class theorem, V contains all bounded Borel functions.

The next problem is differentiation under the integral sign. We want to knowwhen we can say

d

dt

∫f(x, t) dx =

∫∂f

∂t(x, t) dx.

Theorem (Differentiation under the integral sign). Let (E, E , µ) be a space,and U ⊆ R be an open set, and f : U × E → R. We assume that

(i) For any t ∈ U fixed, the map x 7→ f(t, x) is integrable;

(ii) For any x ∈ E fixed, the map t 7→ f(t, x) is differentiable;

(iii) There exists an integrable function g such that∣∣∣∣∂f∂t (t, x)

∣∣∣∣ ≤ g(x)

for all x ∈ E and t ∈ U .

Then the map

x 7→ ∂f

∂t(t, x)

is integrable for all t, and also the function

F (t) =

∫E

f(t, x)dµ

is differentiable, and

F ′(t) =

∫E

∂f

∂t(t, x) dµ.

47


The reason why we want the derivative to be bounded is that we want toapply the dominated convergence theorem.

Proof. Measurability of the derivative follows from the fact that it is a limit ofmeasurable functions, and then integrability follows since it is bounded by g.

Suppose (hn) is a positive sequence with hn → 0. Then let

gn(x) =f(t+ hn, x)− f(t, x)

hn− ∂f

∂t(t, x).

Since f is differentiable, we know that gn(x)→ 0 as n→∞. Moreover, by themean value theorem, we know that

|gn(x)| ≤ 2g(x).

On the other hand, by definition of F (t), we have

F (t+ hn)− F (t)

hn−∫E

∂f

∂t(t, x) dµ =

∫gn(x) dx.

By dominated convergence, we know the RHS tends to 0. So we know

limn→∞

F (t+ hn)− F (t)

hn→∫E

∂f

∂t(t, x) dµ.

Since hn was arbitrary, it follows that F ′(t) exists and is equal to the integral.

3.5 Product measures and Fubini’s theorem

Recall the following definition of the product σ-algebra.

Definition (Product σ-algebra). Let (E1, E1, µ1) and (E2, E2, µ2) be finite mea-sure spaces. We let

A = A1 ×A2 : A2 × E1, A2 × E2.

Then A is a π-system on E1 × E2. The product σ-algebra is

E = E1 ⊗ E2 = σ(A).

We now want to construct a measure on the product σ-algebra. We can, ofcourse, just apply the Caratheodory extension theorem, but we would want amore explicit description of the integral. The idea is to define, for A ∈ E1 ⊗ E2,

µ(A) =

∫E1

(∫E2

1A(x1, x2) µ2(dx2)

)µ1(dx1).

Doing this has the advantage that it would help us in a step of proving Fubini’stheorem.

However, before we can make this definition, we need to do some preparationto make sure the above statement actually makes sense:

Lemma. Let E = E1 × E2 be a product of σ-algebras. Suppose f : E → R isE-measurable function. Then

48


(i) For each x2 ∈ E2, the function x1 7→ f(x1, x2) is E1-measurable.

(ii) If f is bounded or non-negative measurable, then

f2(x2) =

∫E1

f(x1, x2) µ1(dx1)

is E2-measurable.

Proof. The first part follows immediately from the fact that for a fixed x2,the map ι1 : E1 → E given by ι1(x1) = (x1, x2) is measurable, and that thecomposition of measurable functions is measurable.

For the second part, we use the monotone class theorem. We let V bethe set of all measurable functions f such that x2 7→

∫E1f(x1, x2)µ1(dx1) is

E2-measurable.

(i) It is clear that 1E ,1A ∈ V for all A ∈ A (where A is as in the definitionof the product σ-algebra).

(ii) V is a vector space by linearity of the integral.

(iii) Suppose (fn) is a non-negative sequence in V and fn f , then(x2 7→

∫E1

fn(x1, x2) µ1(dx1)

)(x2 7→

∫E1

f(x1, x2) µ(dx1)

)by the monotone convergence theorem. So f ∈ V .

So the monotone class theorem tells us V contains all bounded measurablefunctions.

Now if f is a general non-negative measurable function, then f ∧n is boundedand measurable, hence f ∧n ∈ V . Therefore f ∈ V by the monotone convergencetheorem.

Theorem. There exists a unique measurable function µ = µ1 ⊗ µ2 on E suchthat

µ(A1 ×A2) = µ(A1)µ(A2)

for all A1 ×A2 ∈ A.

Here it is crucial that the measure space is finite. Actually, everythingstill works for σ-finite measure spaces, as we can just reduce to the finite case.However, things start to go wrong if we don’t have σ-finite measure spaces.

Proof. One might be tempted to just apply the Caratheodory extension theorem,but we have a more direct way of doing it here, by using integrals. We define

µ(A) =

∫E1

(∫E2

1A(x1, x2) µ2(dx2)

)µ1(dx1).

Here the previous lemma is very important. It tells us that these integralsactually make sense!

We first check that this is a measure:

(i) µ(∅) = 0 is immediate since 1∅ = 0.

49


(ii) Suppose (An) is a disjoint sequence and A =⋃An. Then we have

µ(A) =

∫E1

(∫E2

1A(x1, x2) µ2(dx2)

)µ1(dx1)

=

∫E1

(∫E2

∑n

1An(x1, x2) µ2(dx2)

)µ1(dx1)

We now use the fact that integration commutes with the sum of non-negative measurable functions to get

=

∫E1

(∑n

(∫E2

1A(x1, x2) µ2(dx2)

))µ1(dx1)

=∑n

∫E1

(∫E2

1An(x1, x2) µ2(dx2)

)µ1(dx1)

=∑n

µ(An).

So we have a working measure, and it clearly satisfies

µ(A1 ×A2) = µ(A1)µ(A2).

Uniqueness follows because µ is finite, and is thus characterized by its values onthe π-system A that generates E .

Exercise. Show the non-uniqueness of the product Lebesgue measure on [0, 1]and the counting measure on [0, 1].

Note that we could as well have defined the measure as

µ(A) =

∫E2

(∫E1

1A(x1, x2) µ1(dx1)

)µ2(dx2).

The same proof would go through, so we have another measure on the space.However, by uniqueness, we know they must be the same! Fubini’s theoremgeneralizes this to arbitrary functions.

Theorem (Fubini’s theorem).

(i) If f is non-negative measurable, then

µ(f) =

∫E1

(∫E2

f(x1, x2) µ2(dx2)

)µ1(dx1). (∗)

In particular, we have∫E1

(∫E2

f(x1, x2) µ2(dx2)

)µ1(dx1) =

∫E2

(∫E1

f(x1, x2) µ1(dx1)

)µ2(dx2).

This is sometimes known as Tonelli’s theorem.

50


(ii) If f is integrable, and

A =

x1 ∈ E :

∫E2

|f(x1, x2)|µ2(dx2) <∞.

thenµ1(E1 \A) = 0.

If we set

f1(x1) =

∫E2f(x1, x2) µ2(dx2) x1 ∈ A

0 x1 6∈ A,

then f1 is a µ1 integrable function and

µ1(f1) = µ(f).

Proof.

(i) Let V be the set of all measurable functions such that (∗) holds. Then Vis a vector space since integration is linear.

(a) By definition of µ, we know 1E and 1A are in V for all A ∈ A.

(b) The monotone convergence theorem on both sides tell us that V isclosed under monotone limits of the form fn f , fn ≥ 0.

By the monotone class theorem, we know V contains all bounded measur-able functions. If f is non-negative measurable, then (f ∧ n) ∈ V , andmonotone convergence for f ∧ n f gives that f ∈ V .

(ii) Assume that f is µ-integrable. Then

x1 7→∫E2

|f(x1, x2)| µ(dx2)

is E1-measurable, and, by (i), is µ1-integrable. So A1, being the inverseimage of ∞ under that map, lies in E1. Moreover, µ1(E1 \A1) = 0 becauseintegrable functions can only be infinite on sets of measure 0.

We set

f+1 (x1) =

∫E2

f+(x1, x2) µ2(dx2)

f−1 (x1) =

∫E2

f−(x1, x2) µ2(dx2).

Then we havef1 = (f+

1 − f−1 )1A1

.

So the result follows since

µ(f) = µ(f+)− µ(f−) = µ(f+1 )− µ1(f−1 ) = µ1(f1).

by (i).

51


Since R is σ-finite, we know that we can sensibly talk about the d-fold productof the Lebesgue measure on R to obtain the Lebesgue measure on Rd.

What σ-algebra is the Lebesgue measure on Rd defined on? We know theLebesgue measure on R is defined on B. So the Lebesgue measure is defined on

B × · · · × B = σ(B1 × · · · ×Bd : Bi ∈ B).

By looking at the definition of the product topology, we see that this is just theBorel σ-algebra on Rd!

Recall that when we constructed the Lebesgue measure, the Caratheodoryextension theorem yields a measure on the “Lebesgue σ-algebra” M, whichwas strictly bigger than the Borel σ-algebra. It was shown in the first examplesheet that M is complete, i.e. if we have A ⊆ B ⊆ R with B ∈ M, µ(B) = 0,then A ∈ M. We can also take the Lebesgue measure on Rd to be defined onM⊗ · · · ⊗M. However, it happens that M⊗M together with the Lebesguemeasure on R2 is no longer complete (proof is left as an exercise for the reader).

We now turn to probability. Recall that random variables X1, · · · , Xn areindependent iff the σ-algebras σ(X1), · · · , σ(Xn) are independent. We will showthat random variables are independent iff their laws are given by the productmeasure.

Proposition. Let X1, · · · , Xn be random variables on (Ω,F ,P) with values in(E1, E1), · · · , (En, En) respectively. We define

E = E1 × · · · × En, E = E1 ⊗ · · · ⊗ En.

Then X = (X1, · · · , Xn) is E-measurable and the following are equivalent:

(i) X1, · · · , Xn are independent.

(ii) µX = µX1⊗ · · · ⊗ µXn

.

(iii) For any f1, · · · , fn bounded and measurable, we have

E

[n∏k=1

fk(Xk)

]=

n∏k=1

E[fk(Xk)].

Proof.

– (i) ⇒ (ii): Let ν = µX1× · · · ⊗ µXn

. We want to show that ν = µX . Todo so, we just have to check that they agree on a π-system generating theentire σ-algebra. We let

A = A1 × · · · ×An : A1 ∈ E1, · · · , Ak ∈ Ek.

Then A is a generating π-system of E . Moreover, if A = A1×· · ·×An ∈ A,then we have

µX(A) = P[X ∈ A]

= P[X1 ∈ A1, · · · , Xn ∈ An]

52


By independence, we have

=

n∏k=1

P[Xk ∈ Ak]

= ν(A).

So we know that µX = ν = µX1⊗ · · · ⊗ µXn

on E .

– (ii) ⇒ (iii): By assumption, we can evaluate the expectation

E

[n∏k=1

fk(Xk)

]=

∫E

n∏k=1

fk(xk)µ(dxk)

=

n∏k=1

∫Ek

f(xk)µk(dxk)

=

n∏k=1

E[fk(Xk)].

Here in the middle we have used Fubini’s theorem.

– (iii) ⇒ (i): Take fk = 1Akfor Ak ∈ Ek. Then we have

P[X1 ∈ A1, · · · , Xn ∈ An] = E

[n∏k=1

1Ak(Xk)

]

=

n∏k=1

E[1Ak(Xk)]

=

n∏k=1

P[Xk ∈ Ak]

So X1, · · · , Xn are independent.

53

4 Inequalities and Lp spaces II Probability and Measure

4 Inequalities and Lp spaces

Eventually, we will want to define the Lp spaces as follows:

Definition (Lp spaces). Let (E, E , µ) be a measurable space. For 1 ≤ p <∞,we define Lp = Lp(E, E , µ) to be the set of all measurable functions f such that

‖f‖p =

(∫|f |pdµ

)1/p

<∞.

For p =∞, we let L∞ = L∞(E, E , µ) to be the space of functions with

‖f‖∞ = infλ ≥ 0 : |f | ≤ λ a.e. <∞.

However, it is not clear that this is a norm. First of all, ‖f‖p = 0 does notimply that f = 0. It only means that f = 0 a.e. But this is easy to solve. Wesimply quotient out the vector space by functions that differ on a set of measurezero. The more serious problem is that we don’t know how to prove the triangleinequality.

To do so, we are going to prove some inequalities. Apart from enabling us toshow that ‖ · ‖p is indeed a norm, they will also be very helpful in the futurewhen we want to bound integrals.

4.1 Four inequalities

The four inequalities we are going to prove are the following:

(i) Chebyshev/Markov inequality

(ii) Jensen’s inequality

(iii) Holder’s inequality

(iv) Minkowski’s inequality.

So let’s start proving the inequalities.

Proposition (Chebyshev’s/Markov’s inequality). Let f be non-negative mea-surable and λ > 0. Then

µ(f ≥ λ) ≤ 1

λµ(f).

This is often used when this is a probability measure, so that we are boundingthe probability that a random variable is big.

The proof is essentially one line.

Proof. We writef ≥ f1f≥λ ≥ λ1f≥λ.

Taking µ gives the desired answer.

This is incredibly simple, but also incredibly useful!The next inequality is Jensen’s inequality. To state it, we need to know what

a convex function is.

54


Definition (Convex function). Let I ⊆ R be an interval. Then c : I → R isconvex if for any t ∈ [0, 1] and x, y ∈ I, we have

c(tx+ (1− t)y) ≤ tc(x) + (1− t)c(y).

x y(1 − t)x + ty

(1 − t)f(x) + tc(y)

Note that if c is twice differentiable, then this is equivalent to c′′ > 0.

Proposition (Jensen’s inequality). Let X be an integrable random variablewith values in I. If c : I → R is convex, then we have

E[c(X)] ≥ c(E[X]).

It is crucial that this only applies to a probability space. We need the totalmass of the measure space to be 1 for it to work. Just being finite is not enough.Jensen’s inequality will be an easy consequence of the following lemma:

Lemma. If c : I → R is a convex function and m is in the interior of I, thenthere exists real numbers a, b such that

c(x) ≥ ax+ b

for all x ∈ I, with equality at x = m.

φ

m

ax+ b

If the function is differentiable, then we can easily extract this from thederivative. However, if it is not, then we need to be more careful.

Proof. If c is smooth, then we know c′′ ≥ 0, and thus c′ is non-decreasing. Weare going to show an analogous statement that does not mention the word“derivative”. Consider x < m < y with x, y,m ∈ I. We want to show that

c(m)− c(x)

m− x≤ c(y)− c(m)

y −m.

55


To show this, we turn off our brains and do the only thing we can do. We canwrite

m = tx+ (1− t)y

for some t. Then convexity tells us

c(m) ≤ tc(x) + (1− t)c(y).

Writing c(m) = tc(m) + (1− t)c(m), this tells us

t(c(m)− c(x)) ≤ (1− t)(c(y)− c(m)).

To conclude, we simply have to compute the actual value of t and plug it in. Wehave

t =y −my − x

, 1− t =m− xy − x

.

So we obtainy −my − x

(c(m)− c(x)) ≤ m− xy − x

(c(y)− c(m)).

Cancelling the y − x and dividing by the factors gives the desired result.Now since x and y are arbitrary, we know there is some a ∈ R such that

c(m)− c(x)

m− x≤ a ≤ c(y)− c(m)

y −m.

for all x < m < y. If we rearrange, then we obtain

c(t) ≥ a(t−m) + c(m)

for all t ∈ I.

Proof of Jensen’s inequality. To apply the previous result, we need to pick aright m. We take

m = E[X].

To apply this, we need to know that m is in the interior of I. So we assume thatX is not a.s. constant (that case is boring). By the lemma, we can find somea, b ∈ R such that

c(X) ≥ aX + b.

We want to take the expectation of the LHS, but we have to make sure theE[c(X)] is a sensible thing to talk about. To make sure it makes sense, we showthat E[c(X)−] = E[(−c(X)) ∨ 0] is finite.

We simply bound

[c(X)]− = [−c(X)] ∨ 0 ≤ |a||X|+ |b|.

So we haveE[c(X)−] ≤ |a|E|X|+ |b| <∞

since X is integrable. So E[c(X)] makes sense.We then just take

E[c(X)] ≥ E[aX + b] = aE[X] + b = am+ b = c(m) = c(E[X]).

So done.

56


We are now going to use Jensen’s inequality to prove Holder’s inequality.Before that, we take note of the following definition:

Definition (Conjugate). Let p, q ∈ [1,∞]. We say that they are conjugate if

1

p+

1

q= 1,

where we take 1/∞ = 0.

Proposition (Holder’s inequality). Let p, q ∈ (1,∞) be conjugate. Then forf, g measurable, we have

µ(|fg|) = ‖fg‖1 ≤ ‖f‖p‖g‖q.

When p = q = 2, then this is the Cauchy-Schwarz inequality.

We will provide two different proofs.

Proof. We assume that ‖f‖p > 0 and ‖f‖p <∞. Otherwise, there is nothing toprove. By scaling, we may assume that ‖f‖p = 1. We make up a probabilitymeasure by

P[A] =

∫|f |p1A dµ.

Since we know

‖f‖p =

(∫|f |p dµ

)1/p

= 1,

we know P[ · ] is a probability measure. Then we have

µ(|fg|) = µ(|fg|1|f |>0)

= µ

(|g||f |p−1

1|f |>0|f |p)

= E[|g||f |p−1

1|f |>0

]Now use the fact that (E|X|)q ≤ E[|X|q] since x 7→ xq is convex for q > 1. Thenwe obtain

≤(E[|g|q

|f |(p−1)q1|f |>0

])1/q

.

The key realization now is that 1q + 1

p = 1 means that q(p − 1) = p. So thisbecomes

E[|g|q

|f |p1|f |>0

]1/q

= µ(|g|q)1/q = ‖g‖q.

Using the fact that ‖f‖p = 1, we obtain the desired result.

Alternative proof. We wlog 0 < ‖f‖p, ‖g‖q < ∞, or else there is nothing toprove. By scaling, we wlog ‖f‖p = ‖g‖q = 1. Then we have to show that∫

|f ||g| dµ ≤ 1.

57


To do so, we notice if 1p + 1

q = 1, then the concavity of log tells us for any a, b > 0,we have

1

plog a+

1

qlog b ≤ log

(a

p+b

q

).

Replacing a with ap; b with bp and then taking exponentials tells us

ab ≤ ap

p+bq

q.

While we assumed a, b > 0 when deriving, we observe that it is also valid whensome of them are zero. So we have∫

|f ||g| dµ ≤∫ (

|f |p

p+|g|q

q

)dµ =

1

p+

1

q= 1.

Just like Jensen’s inequality, this is very useful when bounding integrals, andit is also theoretically very important, because we are going to use it to provethe Minkowski inequality. This tells us that the Lp norm is actually a norm.

Before we prove the Minkowski inequality, we prove the following tiny lemmathat we will use repeatedly:

Lemma. Let a, b ≥ 0 and p ≥ 1. Then

(a+ b)p ≤ 2p(ap + bp).

This is a terrible bound, but is useful when we want to prove that things arefinite.

Proof. We wlog a ≤ b. Then

(a+ b)p ≤ (2b)p ≤ 2pbp ≤ 2p(ap + bp).

Theorem (Minkowski inequality). Let p ∈ [1,∞] and f, g measurable. Then

‖f + g‖p ≤ ‖f‖p + ‖g‖p.

Again the proof is magic.

Proof. We do the boring cases first. If p = 1, then

‖f + g‖1 =

∫|f + g| ≤

∫(|f |+ |g|) =

∫|f |+

∫|g| = ‖f‖1 + ‖g‖1.

The proof of the case of p =∞ is similar.Now note that if ‖f + g‖p = 0, then the result is trivial. On the other hand,

if ‖f + g‖p =∞, then since we have

|f + g|p ≤ (|f |+ |g|)p ≤ 2p(|f |p + |g|p),

we know the right hand side is infinite as well. So this case is also done.

58


Let’s now do the interesting case. We compute

µ(|f + g|p) = µ(|f + g||f + g|p−1)

≤ µ(|f ||f + g|p−1) + µ(|g||f + g|p−1)

≤ ‖f‖p‖|f + g|p−1‖q + ‖g‖p‖|f + g|p−1‖q= (‖f‖p + ‖g‖p)‖|f + g|p−1‖q= (‖f‖p + ‖g‖p)µ(|f + g|(p−1)q)1−1/p

= (‖f‖p + ‖g‖p)µ(|f + g|p)1−1/p.

So we knowµ(|f + g|p) ≤ (‖f‖p + ‖g‖p)µ(|f + g|p)1−1/p.

Then dividing both sides by (µ(|f + g|p)1−1/p tells us

µ(|f + g|p)1/p = ‖f + g‖p ≤ ‖f‖p + ‖g‖p.

Given these inequalities, we can go and prove some properties of Lp spaces.

4.2 Lp spaces

Recall the following definition:

Definition (Norm of vector space). Let V be a vector space. A norm on V isa function ‖ · ‖ : V → R≥0 such that

(i) ‖u+ v‖ ≤ ‖u‖+ ‖v‖ for all U, v ∈ V .

(ii) ‖αv‖ = |α|‖v‖ for all v ∈ V and α ∈ R

(iii) ‖v‖ = 0 implies v = 0.

Definition (Lp spaces). Let (E, E , µ) be a measurable space. For 1 ≤ p <∞,we define Lp = Lp(E, E , µ) to be the set of all measurable functions f such that

‖f‖p =

(∫|f |pdµ

)1/p

<∞.

For p =∞, we let L∞ = L∞(E, E , µ) to be the space of functions with

‖f‖∞ = infλ ≥ 0 : |f | ≤ λ a.e. <∞.

By Minkowski’s inequality, we know Lp is a vector space, and also (i) holds.By definition, (ii) holds obviously. However, (iii) does not hold for ‖ · ‖p, because‖f‖p = 0 does not imply that f = 0. It merely implies that f = 0 a.e.

To fix this, we define an equivalence relation as follows: for f, g ∈ Lp, we saythat f ∼ g iff f − g = 0 a.e. For any f ∈ Lp, we let [f ] denote its equivalenceclass under this relation. In other words,

[f ] = g ∈ Lp : f − g = 0 a.e..

59


Definition (Lp space). We define

Lp = [f ] : f ∈ Lp,

where[f ] = g ∈ Lp : f − g = 0 a.e..

This is a normed vector space under the ‖ · ‖p norm.

One important property of Lp is that it is complete, i.e. every Cauchysequence converges.

Definition (Complete vector space/Banach spaces). A normed vector space(V, ‖ · ‖) is complete if every Cauchy sequence converges. In other words, if (vn)is a sequence in V such that ‖vn − vm‖ → 0 as n,m→∞, then there is somev ∈ V such that ‖vn − v‖ → 0 as n→∞. A complete vector space is known asa Banach space.

Theorem. Let 1 ≤ p ≤ ∞. Then Lp is a Banach space. In other words, if (fn)is a sequence in Lp, with the property that ‖fn − fm‖p → 0 as n,m→∞, thenthere is some f ∈ Lp such that ‖fn − f‖p → 0 as n→∞.

Proof. We will only give the proof for p < ∞. The p = ∞ case is left as anexercise for the reader.

Suppose that (fn) is a sequence in Lp with ‖fn − fm‖p → 0 as n,m → ∞.Take a subsequence (fnk

) of (fn) with

‖fnk+1− fnk

‖p ≤ 2−k

for all k ∈ N. We then find that∥∥∥∥∥M∑k=1

|fnk+1− fnk

|

∥∥∥∥∥p

≤M∑k=1

‖fnk+1− fnk

‖p ≤ 1.

We know that

M∑k=1

|fnk+1− fnk

| ∞∑k=1

|fnk+1− fnk

| as M →∞.

So applying the monotone convergence theorem, we know that∥∥∥∥∥∞∑k=1

|fnk+1− fnk

|

∥∥∥∥∥p

≤∞∑k=1

‖fnk+1− fnk

‖p ≤ 1.

In particular,∞∑k=1

|fnk+1− fnk

| <∞ a.e.

So fnk(x) converges a.e., since the real line is complete. So we set

f(x) =

limk→∞ fnk

(x) if the limit exists

0 otherwise

60


By an exercise on the first example sheet, this function is indeed measurable.Then we have

‖fn − f‖pp = µ(|fn − f |p)

= µ

(lim infk→∞

|fn − fnk|p)

≤ lim infk→∞

µ(|fn − fnk|p),

which tends to 0 as n → ∞ since the sequence is Cauchy. So f is indeed thelimit.

Finally, we have to check that f ∈ Lp. We have

µ(|f |p) = µ(|f − fn + fn|p)≤ µ((|f − fn|+ |fn|)p)≤ µ(2p(|f − fn|p + |fn|p))= 2p(µ(|f − fn|p) + µ(|fn|p)2)

We know the first term tends to 0, and in particular is finite for n large enough,and the second term is also finite. So done.

4.3 Orthogonal projection in L2

In the particular case p = 2, we have an extra structure on L2, namely an innerproduct structure, given by

〈f, g〉 =

∫fg dµ.

This inner product induces the L2 norm by

‖f‖22 = 〈f, f〉.

Recall the following definition:

Definition (Hilbert space). A Hilbert space is a vector space with a completeinner product.

So L2 is not only a Banach space, but a Hilbert space as well.Somehow Hilbert spaces are much nicer than Banach spaces, because you

have an inner product structure as well. One particular thing we can do isorthogonal complements.

Definition (Orthogonal functions). Two functions f, g ∈ L2 are orthogonal if

〈f, g〉 = 0,

Definition (Orthogonal complement). Let V ⊆ L2. We then set

V ⊥ = f ∈ L2 : 〈f, v〉 = 0 for all v ∈ V .

61


Note that we can always make these definitions for any inner product space.However, the completeness of the space guarantees nice properties of the orthog-onal complement.

Before we proceed further, we need to make a definition of what it meansfor a subspace of L2 to be closed. This isn’t the usual definition, since L2 isn’treally a normed vector space, so we need to accommodate for that fact.

Definition (Closed subspace). Let V ⊆ L2. Then V is closed if whenever (fn)is a sequence in V with fn → f , then there exists v ∈ V with v ∼ f .

Thee main thing that makes L2 nice is that we can use closed subspaces todecompose functions orthogonally.

Theorem. Let V be a closed subspace of L2. Then each f ∈ L2 has anorthogonal decomposition

f = u+ v,

where v ∈ V and u ∈ V ⊥. Moreover,

‖f − v‖2 ≤ ‖f − g‖2

for all g ∈ V with equality iff g ∼ v.

To prove this result, we need two simple identities, which can be easily provenby writing out the expression.

Lemma (Pythagoras identity).

‖f + g‖2 = ‖f‖2 + ‖g‖2 + 2〈f, g〉.

Lemma (Parallelogram law).

‖f + g‖2 + ‖f − g‖2 = 2(‖f‖2 + ‖g‖2).

To prove the existence of orthogonal decomposition, we need to use a slighttrick involving the parallelogram law.

Proof of orthogonal decomposition. Given f ∈ L2, we take a sequence (gn) in Vsuch that

‖f − gn‖2 → d(f, V ) = infg‖f − g‖2.

We now want to show that the infimum is attained. To do so, we show that gnis a Cauchy sequence, and by the completeness of L2, it will have a limit.

If we apply the parallelogram law with u = f − gn and v = f − gm, then weknow

‖u+ v‖22 + ‖u− v‖22 = 2(‖u‖22 + ‖v‖22).

Using our particular choice of u and v, we obtain∥∥∥∥2

(f − gn + gm

2

)∥∥∥∥2

2

+ ‖gn − gm‖22 = 2(‖f − gn‖22 + ‖f − gm‖22).

So we have

‖gn − gm‖22 = 2(‖f − gn‖22 + ‖f − gm‖22)− 4

∥∥∥∥f − gn + gm2

∥∥∥∥2

2

.

62


The first two terms on the right hand side tend to d(f, V )2, and the last termis bounded below in magnitude by 4d(f, V ). So as n,m → ∞, we must have‖gn − gm‖2 → 0. By completeness of L2, there exists a g ∈ L2 such that gn → g.

Now since V is assumed to be closed, we can find a v ∈ V such that g = va.e. Then we know

‖f − v‖2 = limn→∞

‖f − gn‖2 = d(f, V ).

So v attains the infimum. To show that this gives us an orthogonal decomposition,we want to show that

u = f − v ∈ V ⊥.

Suppose h ∈ V . We need to show that 〈u, h〉 = 0. We need to do another funnytrick. Suppose t ∈ R. Then we have

d(f, V )2 ≤ ‖f − (v + th)‖22= ‖f − v‖2 + t2‖h‖22 − 2t〈f − v, h〉.

We think of this as a quadratic in t, which is minimized when

t =〈f − v, h〉‖h‖22

.

But we know this quadratic is minimized when t = 0. So 〈f − v, h〉 = 0.

We are now going to look at the relationship between conditional expectationand orthogonal projection.

Definition (Conditional expectation). Suppose we have a probability space(Ω,F ,P), and (Gn) is a collection of pairwise disjoint events with

⋃nGn = Ω.

We letG = σ(Gn : n ∈ N).

The conditional expectation of X given G is the random variable

Y =

∞∑n=1

E[X | Gn]1Gn ,

where

E[X | Gn] =E[X1Gn ]

P[Gn]for P[Gn] > 0.

In other words, given any x ∈ Ω, say x ∈ Gn, then Y (x) = E[X | Gn].If X ∈ L2(P), then Y ∈ L2(P), and it is clear that Y is G-measurable. We

claim that this is in fact the projection of X onto the subspace L2(G,P) ofG-measurable L2 random variables in the ambient space L2(P).

Proposition. The conditional expectation of X given G is the projection of Xonto the subspace L2(G,P) of G-measurable L2 random variables in the ambientspace L2(P).

In some sense, this tells us Y is our best prediction of X given only theinformation encoded in G.

63


Proof. Let Y be the conditional expectation. It suffices to show that E[(X−W )2]is minimized for W = Y among G-measurable random variables. Suppose thatW is a G-measurable random variable. Since

G = σ(Gn : n ∈ N),

it follows that

W =

∞∑n=1

an1Gn .

where an ∈ R. Then

E[(X −W )2] = E

( ∞∑n=1

(X − an)1Gn

)2

= E

[∑n

(X2 + a2n − 2anX)1Gn

]

= E

[∑n

(X2 + a2n − 2anE[X | Gn])1Gn

]

We now optimize the quadratic

X2 + a2n − 2anE[X | Gn]

over an. We see that this is minimized for

an = E[X | Gn].

Note that this does not depend on what X is in the quadratic, since it is in theconstant term.

Therefore we know that E[X | Gn] is minimized for W = Y .

We can also rephrase variance and covariance in terms of the L2 spaces.Suppose X,Y ∈ L2(P) with

mX = E[X], mY = E[Y ].

Then variance and covariance just correspond to L2 inner product and norm.In fact, we have

var(X) = E[(X −mX)2] = ‖X −mX‖22,cov(X,Y ) = E[(X −mX)(Y −mY )] = 〈X −mX , Y −mY 〉.

More generally, the covariance matrix of a random vector X = (X1, · · · , Xn) isgiven by

var(X) = (cov(Xi, Xj))ij .

On the example sheet, we will see that the covariance matrix is a positive definitematrix.

64


4.4 Convergence in L1(P) and uniform integrability

What we are looking at here is the following question — suppose (Xn), X arerandom variables and Xn → X in probability. Under what extra assumptions isit true that Xn also converges to X in L1, i.e. E[Xn −X]→ 0 as X →∞?

This is not always true.

Example. If we take (Ω,F ,P) = ((0, 1),B((0, 1)),Lebesgue), and

Xn = n1(0,1/n).

Then Xn → 0 in probability, and in fact Xn → 0 almost surely. However,

E[|Xn − 0|] = E[Xn] = n · 1

n= 1,

which does not converge to 1.

We see that the problem with this series is that there is a lot of “stuff”concentrated near 0, and indeed the functions can get unbounded near 0. Wecan easily curb this problem by requiring our functions to be bounded:

Theorem (Bounded convegence theorem). Suppose X, (Xn) are random vari-ables. Assume that there exists a (non-random) constant C > 0 such that|Xn| ≤ C. If Xn → X in probability, then Xn → X in L1.

The proof is a rather standard manipulation.

Proof. We first show that |X| ≤ C a.e. Let ε > 0. We then have

P[|X| > C + ε] ≤ P[|X −Xn|+ |Xn| > C + ε]

≤ P[|X −Xn| > ε] + P[|Xn| > C]

We know the second term vanishes, while the first term → 0 as n→∞. So weknow

P[|X| > C + ε] = 0

for all ε. Since ε was arbitrary, we know |X| ≤ C a.s.Now fix an ε > 0. Then

E[|Xn −X| = E[|Xn −X|(1|Xn−X|≤ε + 1|Xn−X|>ε)

]≤ ε+ 2C P [|Xn −X| > ε] .

Since Xn → X in probability, for N sufficiently large, the second term is ≤ ε.So E[|Xn −X|] ≤ 2ε, and we have convergence in L1.

But we can do better than that. We don’t need the functions to be actuallybounded. We just need that the functions aren’t concentrated in arbitrarilysmall subsets of Ω. Thus, we make the following definition:

Definition (Uniformly integrable). Let X be a family of random variables.Define

IX (δ) = supE[|X|1A] : X ∈ X , A ∈ F with P[A] < δ.

Then we say X is uniformly integrable if X is L1-bounded (see below), andIX (δ)→ 0 as δ → 0.

65


Definition (Lp-bounded). Let X be a family of random variables. Then we sayX is Lp-bounded if

sup‖X‖p : X ∈ X <∞.

In some sense, this is “uniform continuity for integration”. It is immediatethat

Proposition. Finite unions of uniformly integrable sets are uniformly integrable.

How can we find uniformly integrable families? The following propositiongives us a large class of such families.

Proposition. Let X be an Lp-bounded family for some p > 1. Then X isuniformly integrable.

Proof. We letC = sup‖X‖p : X ∈ X <∞.

Suppose that X ∈ X and A ∈ F . We then have

E[|X|1A] =≤ E[|X|p]1/pP[A]1/q ≤ CP[A]1/q.

by Holder’s inequality, where p, q are conjugates. This is now a uniform bounddepending only on P[A]. So done.

This is the best we can get. L1 boundedness is not enough. Indeed, ourearlier example

Xn = n1(0,1/n),

is L1 bounded but not uniformly integrable. So L1 boundedness is not enough.For many practical purposes, it is convenient to rephrase the definition of

uniform integrability as follows:

Lemma. Let X be a family of random variables. Then X is uniformly integrableif and only if

supE[|X|1|X|>k] : X ∈ X → 0

as k →∞.

Proof.

(⇒) Suppose that χ is uniformly integrable. For any k, and X ∈ X by Cheby-shev inequality, we have

P[|X| ≥ k] ≤ E[X]

k.

Given ε > 0, we pick δ such that P[|X|1A] < ε for all A with µ(A) < δ.Then pick k sufficiently large such that kδ < supE[X] : X ∈ X. ThenP[|X| ≥ k] < δ, and hence E[|X|1|X|>k] < ε for all X ∈ X .

(⇐) Suppose that the condition in the lemma holds. We first show that X isL1-bounded. We have

E[|X|] = E[|X|(1|X|≤k + 1|X|>k)] ≤ k + E[|X|1|X|>k] <∞

by picking a large enough k.

66


Next note that for any measurable A and X ∈ X , we have

E[|X|1A] = E[|X|1A(1|X|>k + 1|X|≤k)] ≤ E[|X|1|X|>k] + kP[A].

Thus, for any ε > 0, we can pick k sufficiently large such that the firstterm is < ε

2 for all X ∈ X by assumption. Then when P[A] < ε2k , we have

E|X|1A] ≤ ε.

As a corollary, we find that

Corollary. Let X = X, where X ∈ L1(P). Then X is uniformly integrable.Hence, a finite collection of L1 functions is uniformly integrable.

Proof. Note that

E[|X|] =

∞∑k=0

E[|X|1X∈[k,k+1)].

Since the sum is finite, we must have

E[|X|1|X|≥K ] =

∞∑k=K

E[|X|1X∈[k,k+1)]→ 0.

With all that preparation, we now come to the main theorem on uniformintegrability.

Theorem. Let X, (Xn) be random variables. Then the following are equivalent:

(i) Xn, X ∈ L1 for all n and Xn → X in L1.

(ii) Xn is uniformly integrable and Xn → X in probability.

The (i) ⇒ (ii) direction is just a standard manipulation. The idea of the (ii)⇒ (i) direction is that we use uniformly integrability to cut off Xn and X at somelarge value K, which gives us a small error, then apply bounded convergence.

Proof. We first assume that Xn, X are L1 and Xn → X in L1. We want to showthat Xn is uniformly integrable and Xn → X in probability.

We first show that Xn → X in probability. This is just going to come fromthe Chebyshev inequality. For ε > 0. Then we have

P[|X −Xn| > ε] ≤ E[|X −Xn|]ε

→ 0

as n→∞.Next we show that Xn is uniformly integrable. Fix ε > 0. Take N such

that n ≥ N implies E[|X−Xn|] ≤ ε2 . Since finite families of L1 random variables

are uniformly integrable, we can pick δ > 0 such that A ∈ F and P[A] < δimplies

E[X1A],E[|Xn|1A] ≤ ε

2

for n = 1, · · · , N .

67


Now when n > N and A ∈ F with P[A] ≤ δ, then we have

E[|Xn|1A] ≤ E[|X −Xn|1A] + E[|X|1A]

≤ E[|X −Xn|] +ε

2

≤ ε

2+ε

2= ε.

So Xn is uniformly integrable.

Assume that Xn is uniformly integrable and Xn → X in probability.The first step is to show that X ∈ L1. We want to use Fatou’s lemma, but

to do so, we want almost sure convergence, not just convergence in probability.Recall that we have previously shown that there is a subsequence (Xnk

) of(Xn) such that Xnk

→ X a.s. Then we have

E[|X|] = E[lim infk→∞

|Xnk|]≤ lim inf

k→∞E[|Xnk

|] <∞

since uniformly integrable families are L1 bounded. So E[|X|] < ∞, henceX ∈ L1.

Next we want to show that Xn → X in L1. Take ε > 0. Then there existsK ∈ (0,∞) such that

E[|X|1|X|>K

],E[|Xn|1|Xn|>K

]≤ ε

3.

To set things up so that we can use the bounded convergence theorem, we haveto invent new random variables

XKn = (Xn ∨ −K) ∧K, XK = (X ∨ −K) ∧K.

Since Xn → X in probability, it follows that XKn → XK in probability.

Now bounded convergence tells us that there is some N such that n ≥ Nimplies

E[|XKn −XK |] ≤ ε

3.

Combining, we have for n ≥ N that

E[|Xn −X|] ≤ E[|XKn −XK |] + E[|X|1|X|≥K] + E[|Xn|1|Xn|≥K] ≤ ε.

So we know that Xn → X in L1.

The main application is that when Xn is a type of stochastic process knownas a martingale. This will be done in III Advanced Probability and III StochasticCalculus.

68

5 Fourier transform II Probability and Measure

5 Fourier transform

We now turn to the exciting topic of the Fourier transform. There are two mainquestions we want to ask — when does the Fourier transform exist, and whenwe can recover a function from its Fourier transform.

Of course, not only do we want to know if the Fourier transform exists. Wealso want to know if it lies in some nice space, e.g. L2.

It turns out that when we want to prove things about Fourier transforms,it is often helpful to “smoothen” the function by doing what is known as aGaussian convolution. So after defining the Fourier transform and proving somereally basic properties, we are going to investigate convolutions and Gaussiansfor a bit (convolutions are also useful on their own, since they correspond tosums of independent random variables). After that, we can go and prove theactual important properties of the Fourier transform.

5.1 The Fourier transform

When talking about Fourier transforms, we will mostly want to talk aboutfunctions Rd → C. So from now on, we will write Lp for complex valued Borelfunctions on Rd with

‖f‖p =

(∫Rd

|f |p)1/p

<∞.

The integrals of complex-valued function are defined on the real and imaginaryparts separately, and satisfy the properties we would expect them to. The detailsare on the first example sheet.

Definition (Fourier transform). The Fourier transform f : Rd → C of f ∈L1(Rd) is given by

f(u) =

∫Rd

f(x)ei(u,x) dx,

where u ∈ Rd and (u, x) denotes the inner product, i.e.

(u, x) = u1x1 + · · ·+ udxd.

Why do we care about Fourier transforms? Many computations are easierwith f in place of f , especially computations that involve differentiation andconvolutions (which are relevant to sums of independent random variables). Inparticular, we will use it to prove the central limit theorem.

More generally, we can define the Fourier transform of a measure:

Definition (Fourier transform of measure). The Fourier transform of a finitemeasure µ on Rd is the function µ : Rd → C given by

µ(u) =

∫Rd

ei(u,x)µ(dx).

In the context of probability, we give these things a different name:

Definition (Characteristic function). Let X be a random variable. Then thecharacteristic function of X is the Fourier transform of its law, i.e.

φX(u) = E[ei(u,X)] = µX(u),

where µX is the law of X.

69


We now make the following (trivial) observations:

Proposition.‖f‖∞ ≤ ‖f‖1, ‖µ‖∞ ≤ µ(Rd).

Less trivially, we have the following result:

Proposition. The functions f , µ are continuous.

Proof. If un → u, then

f(x)ei(un,x) → f(x)ei(u,x).

Also, we know that|f(x)ei(un,x)| = |f(x)|.

So we can apply dominated convergence theorem with |f | as the bound.

5.2 Convolutions

To actually do something useful about the Fourier transforms, we need to talkabout convolutions.

Definition (Convolution of random variables). Let µ, ν be probability measures.Their convolution µ ∗ ν is the law of X + Y , where X has law µ and Y has lawν, and X,Y are independent. Explicitly, we have

µ ∗ ν(A) = P[X + Y ∈ A]

=

∫∫1A(x+ y) µ(dx) ν(dy)

Let’s suppose that µ has a density function f with respect to the Lebesguemeasure. Then we have

µ ∗ ν(A) =

∫∫1A(x+ y)f(x) dx ν(dy)

=

∫∫1A(x)f(x− y) dx ν(dy)

=

∫1A(x)

(∫f(x− y) ν(dy)

)dx.

So we know that µ ∗ ν has law∫f(x− y) ν(dy).

This thing has a name.

Definition (Convolution of function with measure). Let f ∈ Lp and ν aprobability measure. Then the convolution of f with µ is

f ∗ ν(x) =

∫f(x− y) ν(dy) ∈ Lp.

70


Note that we do have to treat the two cases of convolutions separately, sincea measure need not have a density, and a function need not specify a probabilitymeasure (it may not integrate to 1).

We check that it is indeed in Lp. Since ν is a probability measure, Jensen’sinequality says we have

‖f ∗ ν‖pp =

∫ (∫|f(x− y)|ν(dy)

)pdx

≤∫∫|f(x− y)|pν(dy) dx

=

∫∫|f(x− y)|p dx ν(dy)

= ‖f‖pp<∞.

In fact, from this computation, we see that

Proposition. For any f ∈ Lp and ν a probability measure, we have

‖f ∗ ν‖p ≤ ‖f‖p.

The interesting thing happens when we try to take the Fourier transform ofa convolution.

Proposition.

f ∗ ν(u) = f(u)ν(u).

Proof. We have

f ∗ ν(u) =

∫ (∫f(x− y)ν(dy)

)ei(u,x) dx

=

∫∫f(x− y)ei(u,x) dx ν(dy)

=

∫ (∫f(x− y)ei(u,x−y) d(x− y)

)ei(u,y) µ(dy)

=

∫ (∫f(x)ei(u,x) d(x)

)ei(u,y) µ(dy)

=

∫f(u)ei(u,x) µ(dy)

= f(u)

∫ei(u,x) µ(dy)

= f(u)ν(u).

In the context of random variables, we have a similar result:

Proposition. Let µ, ν be probability measures, and X,Y be independent vari-ables with laws µ, ν respectively. Then

µ ∗ ν(u) = µ(u)ν(u).

Proof. We have

µ ∗ ν(u) = E[ei(u,X+Y )] = E[ei(u,X)]E[ei(u,Y )] = µ(u)ν(u).

71


5.3 Fourier inversion formula

We now want to work towards proving the Fourier inversion formula:

Theorem (Fourier inversion formula). Let f, f ∈ L1. Then

f(x) =1

(2π)d

∫f(u)e−i(u,x) du a.e.

Our strategy is as follows:

(i) Show that the Fourier inversion formula holds for a Gaussian distributionby direct computations.

(ii) Show that the formula holds for Gaussian convolutions, i.e. the convolutionof an arbitrary function with a Gaussian.

(iii) We show that any function can be approximated by a Gaussian convolution.

Note that the last part makes a lot of sense. If X is a random variable, thenconvolving with a Gaussian is just adding X +

√tZ, and if we take t→ 0, we

recover the original function. What we have to do is to show that this behavessufficiently well with the Fourier transform and the Fourier inversion formulathat we will actually get the result we want.

Gaussian densities

Before we start, we had better start by defining the Gaussian distribution.

Definition (Gaussian density). The Gaussian density with variance t is

gt(x) =

(1

2πt

)d/2e−|x|

2/2t.

This is equivalently the density of√tZ, where Z = (Z1, · · · , Zd) with Zi ∼

N(0, 1) independent.

We now want to compute the Fourier transformation directly and show thatthe Fourier inversion formula works for this.

We start off by working in the case d = 1 and Z ∼ N(0, 1). We want tocompute the Fourier transform of the law of this guy, i.e. its characteristicfunction. We will use a nice trick.

Proposition. Let Z ∼ N(0, 1). Then

φZ(a) = e−u2/2.

We see that this is in fact a Gaussian up to a factor of 2π.

Proof. We have

φZ(u) = E[eiuZ ]

=1√2π

∫eiuxe−x

2/2 dx.

72


We now notice that the function is bounded, so we can differentiate under theintegral sign, and obtain

φ′Z(u) = E[iZeiuZ ]

=1√2π

∫ixeiuxe−x

2/2 dx

= −uφZ(u),

where the last equality is obtained by integrating by parts. So we know thatφZ(u) solves

φ′Z(u) = −uφZ(u).

This is easy to solve, since we can just integrate this. We find that

log φZ(u) = −1

2u2 + C.

So we haveφZ(u) = Ae−u

2/2.

We know that A = 1, since φZ(0) = 1. So we have

φZ(u) = e−u2/2.

We now do this problem in general.

Proposition. Let Z = (Z1, · · · , Zd) with Zj ∼ N(0, 1) independent. Then√tZ

has density

gt(x) =1

(2πt)d/2e−|x|

2/(2t).

withgt(u) = e−|u|

2t/2.

Proof. We have

gt(u) = E[ei(u,√tZ)]

=

d∏j=1

E[ei(uj ,√tZj)]

=

d∏j=1

φZ(√tuj)

=

d∏j=1

e−tu2j/2

= e−|u|2t/2.

Again, gt and gt are almost the same, apart form the factor of (2πt)−d/2 andthe position of t shifted. We can thus write this as

gt(u) = (2π)d/2t−d/2g1/t(u).

73


So this tells us thatˆgt(u) = (2π)dgt(u).

This is not exactly the same as saying the Fourier inversion formula works,because in the Fourier inversion formula, we integrated against e−i(u,x), notei(u,x). However, we know that by the symmetry of the Gaussian distribution,we have

gt(x) = gt(−x) = (2π)−d ˆgt(−x) =

(1

2π

)d ∫gt(u)e−i(u,x) du.

So we conclude that

Lemma. The Fourier inversion formula holds for the Gaussian density function.

Gaussian convolutions

Definition (Gaussian convolution). Let f ∈ L1. Then a Gaussian convolutionof f is a function of the form f ∗ gt.

We are now going to do a little computation that shows that functions ofthis type also satisfy the Fourier inversion formula.

Before we start, we make some observations about the Gaussian convolution.By general theory of convolutions, we know that we have

Proposition.‖f ∗ gt‖1 ≤ ‖f‖1.

We also have a pointwise bound

|f ∗ gt(x)| =

∣∣∣∣∣∫f(x− y)e−|y|

2/(2t)

(1

2πt

)d/2dy

∣∣∣∣∣≤ (2πt)−d/2

∫|f(x− y)| dx

≤ (2πt)−d/2‖f‖1.

This tells us that in fact

Proposition.‖f ∗ gt‖∞ ≤ (2πt)−d/2‖f‖1.

So in fact the convolution is pointwise bounded. We see that the bound getsworse as t→ 0, and we will see that this is because as t→ 0, the convolutionf ∗ gt becomes a better and better approximation of f , and we did not assumethat f is bounded.

Similarly, we can compute that

Proposition.

‖f ∗ gt‖1 = ‖f gt‖1 ≤ (2π)d/2t−d/2‖f‖1,

and‖f ∗ gt‖∞ ≤ ‖f‖∞.

74


Now given these bounds, it makes sense to write down the Fourier inversionformula for a Gaussian convolution.

Lemma. The Fourier inversion formula holds for Gaussian convolutions.

We are going to reduce this to the fact that the Gaussian distribution itselfsatisfies Fourier inversion.

Proof. We have

f ∗ gt(x) =

∫f(x− y)gt(y) dy

=

∫f(x− y)

(1

(2π)d

∫gt(u)e−i(u,y) du

)dy

=

(1

2π

)d ∫∫f(x− y)gt(u)e−i(u,y) du dy

=

(1

2π

)d ∫ (∫f(x− y)e−i(u,x−y) dy

)gt(u)e−i(u,x) du

=

(1

2π

)d ∫f(u)gt(u)e−i(u,x) du

=

(1

2π

)d ∫f ∗ gt(u)e−i(u,x) du

So done.

The proof

Finally, we are going to extend the Fourier inversion formula to the case wheref, f ∈ L2.

Theorem (Fourier inversion formula). Let f ∈ L1 and

ft(x) = (2π)−d∫f(u)e−|u|

2t/2e−i(u,x) du = (2π)−d∫f ∗ gt(u)e−i(u,x) du.

Then ‖ft−f‖1 → 0, as t→ 0, and the Fourier inversion holds whenever f, f ∈ L1.

To prove this, we first need to show that the Gaussian convolution is indeeda good approximation of f :

Lemma. Suppose that f ∈ Lp with p ∈ [1,∞). Then ‖f ∗ gt − f‖p → 0 ast→ 0.

Note that this cannot hold for p =∞. Indeed, if p =∞, then the ∞-normis the uniform norm. But we know that f ∗ gt is always continuous, and theuniform limit of continuous functions is continuous. So the formula cannot holdif f is not already continuous.

Proof. We fix ε > 0. By a question on the example sheet, we can find h whichis continuous and with compact support such that ‖f − h‖p ≤ ε

3 . So we have

‖f ∗ gt − h ∗ gt‖p = ‖(f − h) ∗ gt‖p ≤ ‖f − h‖p ≤ε

3.

75


So it suffices for us to work with a continuous function h with compact support.We let

e(y) =

∫|h(x− y)− h(x)|p dx.

We first show that e is a bounded function:

e(y) ≤∫

2p(|h(x− y)|p + |h(x)|p) dx

= 2p+1‖h‖pp.

Also, since h is continuous and bounded, the dominated convergence theoremtells us that e(y)→ 0 as y → 0.

Moreover, using the fact that∫gt(y) dy = 1, we have

‖h ∗ gt − h‖pp =

∫ ∣∣∣∣∫ (h(x− y)− h(x))gt(y) dy

∣∣∣∣p dx

Since gt(y) dy is a probability measure, by Jensen’s inequality, we can boundthis by

≤∫∫|h(x− y)− h(x)|pgt(y) dy dx

=

∫ (∫|h(x− y)− h(x)|p dx

)gt(y) dy

=

∫e(y)gt(y) dy

=

∫e(√ty)g1(y) dy,

where we used the definition of g and substitution. We know that this tends to 0as t→ 0 by the bounded convergence theorem, since we know that e is bounded.

Finally, we have

‖f ∗ gt − f‖p ≤ ‖f ∗ gt − h ∗ gt‖p + ‖h ∗ gt − h‖p + ‖h− f‖p

≤ ε

3+ε

3+ ‖h ∗ gt − h‖p

=2ε

3+ ‖h ∗ gt − h‖p.

Since we know that ‖h ∗ gt − h‖p → 0 as t→ 0, we know that for all sufficientlysmall t, the function is bounded above by ε. So we are done.

With this lemma, we can now prove the Fourier inversion theorem.

Proof of Fourier inversion theorem. The first part is just a special case of theprevious lemma. Indeed, recall that

f ∗ gt(u) = f(u)e−|u|2t/2.

Since Gaussian convolutions satisfy Fourier inversion formula, we know that

ft = f ∗ gt.

76


So the previous lemma says exactly that ‖ft − f‖1 → 0.

Suppose now that f ∈ L1 as well. Then looking at the integrand of

ft(x) = (2π)−d∫f(u)e−|u|

2t/2e−i(u,x) du,

we know that ∣∣∣f(u)e−|u|2t/2e−i(u,x)

∣∣∣ ≤ |f |.Then by the dominated convergence theorem with dominating function |f |, weknow that this converges to

ft(x)→ (2π)−d∫f(u)e−i(u,x) du as t→ 0.

By the first part, we know that ‖ft − f‖1 → 0 as t → 0. So we can find asequence (tn) with tn > 0, tn → 0 so that ftn → f a.e. Combining these, weknow that

f(x) =

∫f(u)e−i(u,x) du a.e.

So done.

5.4 Fourier transform in L2

It turns out wonderful things happen when we take the Fourier transform of anL2 function.

Theorem (Plancherel identity). For any function f ∈ L1 ∩ L2, the Plancherelidentity holds:

‖f‖2 = (2π)d/2‖f‖2.

As we are going to see in a moment, this is just going to follow from theFourier inversion formula plus a clever trick.

Proof. We first work with the special case where f, f ∈ L1, since the Fourierinversion formula holds for f . We then have

‖f‖22 =

∫f(x)f(x) dx

=1

(2π)d

∫ (∫f(u)e−i(u,x) du

)f(x) dx

=1

(2π)d

∫f(u)

(f(x)e−i(u,x) dx

)du

=1

(2π)d

∫f(u)

(f(x)ei(u,x) dx

)du

=1

(2π)d

∫f(u)f(u) du

=1

(2π)d‖f(u)‖22.

So the Plancherel identity holds for f .

77


To prove it for the general case, we use this result and an approximationargument. Suppose that f ∈ L1 ∩ L2, and let ft = f ∗ gt. Then by our earlierlemma, we know that

‖ft‖2 → ‖f‖2 as t→ 0.

Now note thatft(u) = f(u)gt(u) = f(u)e−|u|

2t/2.

The important thing is that e−|u|2t/2 1 as t→ 0. Therefore, we know

‖ft‖22 =

∫|f(u)|2e−|u|

2t du→∫|f(u)|2 du = ‖f‖22

as t→ 0, by monotone convergence.Since ft, ft ∈ L1, we know that the Plancherel identity holds, i.e.

‖ft‖2 = (2π)d/2‖ft‖2.

Taking the limit as t→ 0, the result follows.

What is this good for? It turns out that the Fourier transform gives asa bijection from L2 to itself. While it is not true that the Fourier inversionformula holds for everything in L2, it holds for enough of them that we can justapproximate everything else by the ones that are nice. Then the above tells usthat in fact this bijection is a norm-preserving automorphism.

Theorem. There exists a unique Hilbert space automorphism F : L2 → L2

such thatF ([f ]) = [(2π)−d/2f ]

whenever f ∈ L1 ∩ L2.Here [f ] denotes the equivalence class of f in L2, and we say F : L2 → L2 is

a Hilbert space automorphism if it is a linear bijection that preserves the innerproduct.

Note that in general, there is no guarantee that F sends a function to itsFourier transform. We know that only if it is a well-behaved function (i.e. inL1 ∩ L2). However, the formal property of it being a bijection from L2 to itselfwill be convenient for many things.

Proof. We define F0 : L1 ∩ L2 → L2 by

F0([f ]) = [(2π)−d/2f ].

By the Plancherel identity, we know F0 preserves the L2 norm, i.e.

‖F0([f ])‖2 = ‖[f ]‖2.

Also, we know that L1 ∩ L2 is dense in L2, since even the continuous functionswith compact support are dense. So we know F0 extends uniquely to an isometryF : L2 → L2.

Since it preserves distance, it is in particular injective. So it remains to showthat the map is surjective. By Fourier inversion, the subspace

V = [f ] ∈ L2 : f, f ∈ L1

78


is sent to itself by the map F . Also if f ∈ V , then F 4[f ] = [f ] (note thatapplying it twice does not suffice, because we actually have F 2[f ](x) = [f ](−x)).So V is contained in the image F , and also V is dense in L2, again because itcontains all Gaussian convolutions (we have ft = f gt, and f is bounded and gtis decaying exponentially). So we know that F is surjective.

5.5 Properties of characteristic functions

We are now going to state a bunch of theorems about characteristic functions.Since the proofs are not examinable (but the statements are!), we are only goingto provide a rough proof sketch.

Theorem. The characteristic function φX of a distribution µX of a randomvariable X determines µX . In other words, if X and X are random variablesand φX = φX , then µX = µX

Proof sketch. Use the Fourier inversion to show that φX determines µX(g) =E[g(X)] for any bounded, continuous g.

Theorem. If φX is integrable, then µX has a bounded, continuous densityfunction

fX(x) = (2π)−d∫φX(u)e−i(u,x) du.

Proof sketch. Let Z ∼ N(0, 1) be independent of X. Then X +√tZ has a

bounded continuous density function which, by Fourier inversion, is

ft(x) = (2π)−d∫φX(u)e−|u|

2t/2e−i(u,x) du.

Sending t→ 0 and using the dominated convergence theorem with dominatingfunction |φX |.

The next theorem relates to the notion of weak convergence.

Definition (Weak convergence of measures). Let µ, (µn) be Borel probabilitymeasures. We say that µn → µ weakly if and only if µn(g) → µ(g) for allbounded continuous g.

Similarly, we can define weak convergence of random variables.

Definition (Weak convergence of random variables). Let X, (Xn) be randomvariables. We say Xn → X weakly iff µXn

→ µX weakly, iff E[g(Xn)]→ E[g(X)]for all bounded continuous g.

This is related to the notion of convergence in distribution, which we definedlong time ago without talking about it much. It is an exercise on the examplesheet that weak convergence of random variables in R is equivalent to convergencein distribution.

It turns out that weak convergence is very useful theoretically. One reason isthat they are related to convergence of characteristic functions.

Theorem. Let X, (Xn) be random variables with values in Rd. If φXn(u) →

φX(u) for each u ∈ Rd, then µXn→ µX weakly.

79


The main application of this that will appear later is that this is the factthat allows us to prove the central limit theorem.

Proof sketch. By the example sheet, it suffices to show that E[g(Xn)]→ E[g(X)]for all compactly supported g ∈ C∞. We then use Fourier inversion andconvergence of characteristic functions to check that

E[g(Xn +√tZ)]→ E[g(X +

√tZ)]

for all t > 0 for Z ∼ N(0, 1) independent of X, (Xn). Then we check thatE[g(Xn +

√tZ)] is close to E[g(Xn)] for t > 0 small, and similarly for X.

5.6 Gaussian random variables

Recall that in the proof of the Fourier inversion theorem, we used these thingscalled Gaussians, but didn’t really say much about them. These will be usefullater on when we want to prove the central limit theorem, because the centrallimit theorem says that in the long run, things look like Gaussians. So here welay out some of the basic definitions and properties of Gaussians.

Definition (Gaussian random variable). Let X be a random variable on R.This is said to be Gaussian if there exists µ ∈ R and σ ∈ (0,∞) such that thedensity of X is

fX(x) =1√

2πσ2exp

(− (x− µ)2

2σ2

).

A constant random variable X = µ corresponds to σ = 0. We say this has meanµ and variance σ2.

When this happens, we write X ∼ N(µ, σ2).

For completeness, we record some properties of Gaussian random variables.

Proposition. Let X ∼ N(µ, σ2). Then

E[X] = µ, var(X) = σ2.

Also, for any a, b ∈ R, we have

aX + b ∼ N(aµ+ b, a2σ2).

Lastly, we have

φX(u) = e−iµu−u2σ2/2.

Proof. All but the last of them follow from direct calculation, and can be foundin IA Probability.

For the last part, if X ∼ N(µ, σ2), then we can write

X = σZ + µ,

where Z ∼ N(0, 1). Recall that we have previously found that the characteristicfunction of a N(0, 1) function is

φZ(u) = e−|u|2/2.

80


So we have

φX(u) = E[eiu(σZ+µ)]

= eiuµE[eiuσZ]

= eiuµφZ(iuσ)

= eiuµ−u2σ2/2.

What we are next going to do is to talk about the corresponding facts forthe Gaussian in higher dimensions. Before that, we need to come up with thedefinition of a higher-dimensional Gaussian distribution. This might be differentfrom the one you’ve seen previously, because we want to allow some degeneracyin our random variable, e.g. some of the dimensions can be constant.

Definition (Gaussian random variable). Let X be a random variable. We saythat X is a Gaussian on Rn if (u,X) is Gaussian on R for all u ∈ Rn.

We are now going to prove a version of our previous theorem to higherdimensional Gaussians.

Theorem. Let X be Gaussian on Rn, and le tA be an m×n matrix and b ∈ Rm.Then

(i) AX + b is Gaussian on Rm.

(ii) X ∈ L2 and its law µX is determined by µ = E[X] and V = var(X), thecovariance matrix.

(iii) We haveφX(u) = ei(u,µ)−(u,V u)/2.

(iv) If V is invertible, then X has a density of

fX(x) = (2π)−n/2(detV )−1/2 exp

(−1

2(x− µ, V −1(x− µ))

).

(v) If X = (X1, X2) where Xi ∈ Rni , then cov(X1, X2) = 0 iff X1 and X2 areindependent.

Proof.

(i) If u ∈ Rm, then we have

(AX + b, u) = (AX, u) + (b, u) = (X,ATu) + (b, u).

Since (X,ATu) is Gaussian and (b, u) is constant, it follows that (AX+b, u)is Gaussian.

(ii) We know in particular that each component of X is a Gaussian randomvariable, which are in L2. So X ∈ L2. We will prove the second part of (ii)with (iii)

81


(iii) If µ = E[X] and V = var(X), then if u ∈ Rn, then we have

E[(u,X)] = (u, µ), var((u,X)) = (u, V u).

So we know(u,X) ∼ N((u, µ), (u, V u)).

So it follows that

φX(u) = E[ei(u,X)] = ei(u,µ)−(u,V u)/2.

So µ and V determine the characteristic function of X, which in turndetermines the law of X.

(iv) We start off with a boring Gaussian vector Y = (Y1, · · · , Yn), where theYi ∼ N(0, 1) are independent. Then the density of Y is

fY (y) = (2π)−n/2e−|y|2/2.

We are now going to construct X from Y . We define

X = V 1/2Y + µ.

This makes sense because V is always non-negative definite. Then X isGaussian with E[X] = µ and var(X) = V . Therefore X has the samedistribution as X. Since V is assumed to be invertible, we can computethe density of X using the change of variables formula.

(v) It is clear that if X1 and X2 are independent, then cov(X1, X2) = 0.

Conversely, let X = (X1, X2), where cov(X1, X2) = 0. Then we have

V = var(X) =

(V11 00 V22

).

Then for u = (u1, u2), we have

(u, V u) = (u1V11u1) + (u2, V22u2),

where V11 = var(X1) and V22 var(X2). Then we have

φX(u) = eiµu−(u,V u)/2

= eiµ1u1−(u1,V11u1)/2eiµ2u2−(u2,V22u2)/2

= φX1(u1)φX2

(u2).

So it follows that X1 and X2 are independent.

82

6 Ergodic theory II Probability and Measure

6 Ergodic theory

We are now going to study a new topic — ergodic theory. This is the studythe “long run behaviour” of system under the evolution of some Θ. Due to timeconstraints, we will not do much with it. We are going to prove two ergodictheorems that tell us what happens in the long run, and this will be useful whenwe prove our strong law of large numbers at the end of the course.

The general settings is that we have a measure space (E, E , µ) and a measur-able map Θ : E → E that is measure preserving, i.e. µ(A) = µ(Θ−1(A)) for allA ∈ E .

Example. Take (E, E , µ) = ([0, 1),B([0, 1)),Lebesgue). For each a ∈ [0, 1), wecan define

Θa(x) = x+ a mod 1.

By what we’ve done earlier in the course, we know this translation map preservesthe Lebesgue measure on [0, 1).

Our goal is to try to understand the “long run averages” of the system whenwe apply Θ many times. One particular quantity we are going to look at is thefollowing:

Let f be measurable. We define

Sn(f) = f + f Θ + · · ·+ f Θn−1.

We want to know what is the long run behaviour of Sn(f)n as n→∞.

The ergodic theorems are going to give us the answer in a certain specialcase. Finally, we will apply this in a particular case to get the strong law oflarge numbers.

Definition (Invariant subset). We say A ∈ E is invariant for Θ if A = Θ−1(A).

Definition (Invariant function). A measurable function f is invariant if f =f Θ.

Definition (EΘ). We write

EΘ = A ∈ E : A is invariant.

It is easy to show that EΘ is a σ-algebra, and f : E → R is invariant iff it isEΘ measurable.

Definition (Ergodic). We say Θ is ergodic if A ∈ EΘ implies µ(A) = 0 orµ(AC) = 0.

Example. For the translation map on [0, 1), we have Θa is ergodic iff a isirrational. Proof is left on example sheet 4.

Proposition. If f is integrable and Θ is measure-preserving. Then f Θ isintegrable and ∫

f Θdµ =

∫E

fdµ.

It turns out that if Θ is ergodic, then there aren’t that many invariantfunctions.

83


Proposition. If Θ is ergodic and f is invariant, then there exists a constant csuch that f = c a.e.

The proofs of these are left as an exercise on example sheet 4.We are now going to spend a little bit of time studying a particular example,

because this will be needed to prove the strong law of large numbers.

Example (Bernoulli shifts). Let m be a probability distribution on R. Thenthere exists an iid sequence Y1, Y2, · · · with law m. Recall we constructed thisin a really funny way. Now we are going to build it in a more natural way.

We let E = RN be the set of all real sequences (xn). We define the σ-algebraE to be the σ-algebra generated by the projections Xn(x) = xn. In otherwords, this is the smallest σ-algebra such that all these functions are measurable.Alternatively, this is the σ-algebra generated by the π-system

A =

∏n∈N

An, An ∈ B for all n and An = R eventually

.

Finally, to define the measure µ, we let

Y = (Y1, Y2, · · · ) : Ω→ E

where Yi are iid random variables defined earlier, and Ω is the sample space ofthe Yi.

Then Y is a measurable map because each of the Yi’s is a random variable.We let µ = P Y −1.

By the independence of Yi’s, we have that

µ(A) =∏n∈N

m(An)

for anyA = A1 ×A2 × · · · ×An × R× · · · × R.

Note that the product is eventually 1, so it is really a finite product.This (E, E , µ) is known as the canonical space associated with the sequence

of iid random variables with law m.Finally, we need to define Θ. We define Θ : E → E by

Θ(x) = Θ(x1, x2, x3, · · · ) = (x2, x3, x4, · · · ).

This is known as the shift map.

Why do we care about this? Later, we are going to look at the function

f(x) = f(x1, x2, · · · ) = x1.

Then we have

Sn(f) = f + f Θ + · · ·+ f Θn−1 = x1 + · · ·+ xn.

So Sn(f)n will the average of the first n things. So ergodic theory will tell us

about the long-run behaviour of the average.

84


Theorem. The shift map Θ is an ergodic, measure preserving transformation.

Proof. It is an exercise to show that Θ is measurable and measure preserving.To show that Θ is ergodic, recall the definition of the tail σ-algebra

Tn = σ(Xm : m ≥ n+ 1), T =⋂n

Tn.

Suppose that A ∈∏n∈NAn ∈ A. Then

Θ−n(A) = Xn+k ∈ Ak for all k ∈ Tn.

Since Tn is a σ-algebra, we and Θ−n(A) ∈ TN for all A ∈ A and σ(A) = E , weknow Θ−n(A) ∈ TN for all A ∈ E .

So if A ∈ EΘ, i.e. A = Θ−1(A), then A ∈ TN for all N . So A ∈ T .From the Kolmogorov 0-1 law, we know either µ[A] = 1 or µ[A] = 0. So

done.

6.1 Ergodic theorems

The proofs in this section are non-examinable.Instead of proving the ergodic theorems directly, we first start by proving

the following magical lemma:

Lemma (Maximal ergodic lemma). Let f be integrable, and

S∗ = supn≥0

Sn(f) ≥ 0,

where S0(f) = 0 by convention. Then∫S∗>0

f dµ ≥ 0.

Proof. We letS∗n = max

0≤m≤nSm

andAn = S∗n > 0.

Now if 1 ≤ m ≤ n, then we know

Sm = f + Sm−1 Θ ≤ f + S∗n Θ.

Now on An, we haveS∗n = max

1≤m≤nSm,

since S0 = 0. So we haveS∗n ≤ f + S∗n Θ.

On ACn , we haveS∗n = 0 ≤ S∗n Θ.

85


So we know∫E

S∗n dµ =

∫An

S∗n dµ+

∫AC

n

S∗n dµ

≤∫An

f dµ+

∫An

S∗n Θ dµ+

∫AC

n

S∗n Θ dµ

=

∫An

f dµ+

∫E

S∗n Θ dµ

=

∫An

f dµ+

∫E

S∗n dµ

So we know ∫An

f dµ ≥ 0.

Taking the limit as n → ∞ gives the result by dominated convergence withdominating function f .

We are now going to prove the two ergodic theorems, which tell us thelimiting behaviour of Sn(f).

Theorem (Birkhoff’s ergodic theorem). Let (E, E , µ) be σ-finite and f beintegrable. There exists an invariant function f such that

µ(|f |) ≤ µ(|f |),

andSn(f)

n→ f a.e.

If Θ is ergodic, then f is a constant.

Note that the theorem only gives µ(|f |) ≤ µ(|f |). However, in many cases,we can use some integration theorems such as dominated convergence to arguethat they must in fact be equal. In particular, in the ergodic case, this will allowus to find the value of f .

Theorem (von Neumann’s ergodic theorem). Let (E, E , µ) be a finite measurespace. Let p ∈ [1,∞) and assume that f ∈ Lp. Then there is some functionf ∈ Lp such that

Sn(f)

n→ f in Lp.

Proof of Birkhoff’s ergodic theorem. We first note that

lim supn

Snn, lim sup

n

Snn

are invariant functions, Indeed, we know

Sn Θ = f Θ + f Θ2 + · · ·+ f Θn

= Sn+1 − f

So we have

lim supn→∞

Sn Θ

n= lim sup

n→∞

Snn

+f

n→ lim sup

n→∞

Snn.

86


Exactly the same reasoning tells us the lim inf is also invariant.What we now need to show is that the set of points on which lim sup and

lim inf do not agree have measure zero. We set a < b. We let

D = D(a, b) =

x ∈ E : lim inf

n→∞

Sn(x)

n< a < b < lim sup

n→∞

Sn(x)

n

.

Now if lim sup Sn(x)n 6= lim inf Sn(x)

n , then there is some a, b ∈ Q such thatx ∈ D(a, b). So by countable subadditivity, it suffices to show that µ(D(a, b)) = 0for all a, b.

We now fix a, b, and just write D. Since lim sup Sn

n and lim inf Sn

n are bothinvariant, we have that D is invariant. By restricting to D, we can assume thatD = E.

Suppose that B ∈ E and µ(G) <∞. We let

g = f − b1B .

Then g is integrable because f is integrable and µ(B) <∞. Moreover, we have

Sn(g) = Sn(f − b1B) ≥ Sn(f)− nb.

Since we know that lim supnSn(f)n > b by definition, we can find an n such that

Sn(g) > 0. So we know that

S∗(g)(x) = supnSn(g)(x) > 0

for all x ∈ D. By the maximal ergodic lemma, we know

0 ≤∫D

g dµ =

∫D

f − b1B dµ =

∫D

f dµ− bµ(B).

If we rearrange this, we know

bµ(B) ≤∫D

f dµ.

for all measurable sets B ∈ E with finite measure. Since our space is σ-finite, wecan find Bn D such µ(Bn) <∞ for all n. So taking the limit above tells

bµ(D) ≤∫D

f dµ. (†)

Now we can apply the same argument with (−a) in place of b and (−f) in placeof f to get

(−a)µ(D) ≤ −∫D

f dµ. (‡)

Now note that since b > a, we know that at least one of b > 0 and a < 0 has tobe true. In the first case, (†) tells us that µ(D) is finite, since f is integrable.Then combining with (‡), we see that

bµ(D) ≤∫D

f dµ ≤ aµ(D).

87


But a < b. So we must have µ(D) = 0. The second case follows similarly (orfollows immediately by flipping the sign of f).

We are almost done. We can now define

f(x) =

limSn(f)/n the limit exists

0 otherwise

Then by the above calculations, we have

Sn(f)

n→ f a.e.

Also, we know f is invariant, because limSn(f)/n is invariant, and so is the setwhere the limit exists.

Finally, we need to show that

µ(f) ≤ µ(|f |).

This is sinceµ(|f Θn|) = µ(|f |)

as Θn preserves the metric. So we have that

µ(|Sn|) ≤ nµ(|f |) <∞.

So by Fatou’s lemma, we have

µ(|f |) ≤ µ(

lim infn

∣∣∣∣Snn∣∣∣∣)

≤ lim infn

µ

(Snn

)≤ µ(|f |)

The proof of the von Neumann ergodic theorem follows easily from Birkhoff’sergodic theorem.

Proof of von Neumann ergodic theorem. It is an exercise on the example sheetto show that

‖f Θ‖pp =

∫|f Θ|p dµ =

∫|f |p dµ = ‖f‖pp.

So we have ∥∥∥∥Snn∥∥∥∥p

=1

n‖f + f Θ + · · ·+ f Θn−1‖ ≤ ‖f‖p

by Minkowski’s inequality.So let ε > 0, and take M ∈ (0,∞) so that if

g = (f ∨ (−M)) ∧M,

88


then‖f − g‖p <

ε

3.

By Birkhoff’s theorem, we know

Sn(g)

n→ g

a.e.Also, we know ∣∣∣∣Sn(g)

n

∣∣∣∣ ≤Mfor all n. So by bounded convergence theorem, we know∥∥∥∥Sn(g)

n− g∥∥∥∥p

→ 0

as n→∞. So we can find N such that n ≥ N implies∥∥∥∥Sn(g)

n− g∥∥∥∥p

<ε

3.

Then we have ∥∥f − g∥∥pp

=

∫lim inf

n

∣∣∣∣Sn(f − g)

n

∣∣∣∣p dµ

≤ lim inf

∫ ∣∣∣∣Sn(f − g)

n

∣∣∣∣p dµ

≤ ‖f − g‖pp.

So if n ≥ N , then we know∥∥∥∥Sn(f)

n− f

∥∥∥∥p

≤∥∥∥∥Sn(f − g)

n

∥∥∥∥p

+

∥∥∥∥Sn(g)

n− f

∥∥∥∥p

+∥∥g − f∥∥

p≤ ε.

So done.

89

7 Big theorems II Probability and Measure

7 Big theorems

We are now going to use all the tools we have previously developed to provetwo of the most important theorems about the sums of independent randomvariables, namely the strong law of large numbers and the central limit theorem.

7.1 The strong law of large numbers

Before we start proving the strong law of large numbers, we first spend sometime discussing the difference between the strong law and the weak law. In bothcases, we have a sequence (Xn) of iid random variables with E[Xi] = µ. We let

Sn = X1 + · · ·+Xn.

– The weak law of large number says Sn/n → µ in probability as n → ∞,provided E[X2

1 ] <∞.

– The strong law of large number says Sn/n→ µ a.s. provided E|X1| <∞.

So we see that the strong law is indeed stronger, because convergence almosteverywhere implies convergence in measure.

We are actually going to do two versions of the strong law with differenthypothesis.

Theorem (Strong law of large numbers assuming finite fourth moments). Let(Xn) be a sequence of independent random variables such that there exists µ ∈ Rand M > 0 such that

E[Xn] = µ, E[X4n] ≤M

for all n. With Sn = X1 + · · ·+Xn, we have that

Snn→ µ a.s. as n→∞.

Note that in this version, we do not require that the Xn are iid. We simplyneed that they are independent and have the same mean.

The proof is completely elementary.

Proof. We reduce to the case that µ = 0 by setting

Yn = Xn − µ.

We then have

E[Yn] = 0, E[Y 4n ] ≤ 24(E[µ4 +X4

n]) ≤ 24(µ4 +M).

So it suffices to show that the theorem holds with Yn in place of Xn. So we canassume that µ = 0.

By independence, we know that for i 6= j, we have

E[XiX3j ] = E[Xi]E[X3

j ] = 0.

Similarly, for all i, j, k, ` distinct, we have

E[XiXjX2k ] = E[XiXjXkX`] = 0.

90


Hence we know that

E[S4n] = E

n∑k=1

X4k + 6

∑1≤i<j≤n

X2iX

2j

.We know the first term is bounded by nM , and we also know that for i 6= j, wehave

E[X2iX

2j ] = E[X2

i ]E[X2j ] ≤

√E[X4

i ]E[X4j ] ≤M

by Jensen’s inequality. So we know

E

6∑

1≤i<j≤n

X2iX

2j

≤ 3n(n− 1)M.

Putting everything together, we have

E[S4n] ≤ nM + 3n(n− 1)M ≤ 3n2M.

So we know

E[(Sn/n)4

]≤ 3M

n2.

So we know

E

[ ∞∑n=1

(Snn

)4]≤∞∑n=1

3M

n2<∞.

So we know that∞∑n=1

(Snn

)4

<∞ a.s.

So we know that (Sn/n)4 → 0 a.s., i.e. Sn/n→ 0 a.s.

We are now going to get rid of the assumption that we have finite fourthmoments, but we’ll need to work with iid random variables.

Theorem (Strong law of large numbers). Let (Yn) be an iid sequence of inte-grable random variables with mean ν. With Sn = Y1 + · · ·+ Yn, we have

Snn→ ν a.s.

We will use the ergodic theorem to prove this. This is not the “usual” proofof the strong law, but since we’ve done all that work on ergodic theory, we mightas well use it to get a clean proof. Most of the work left is setting up the rightsetting for the proof.

Proof. Let m be the law of Y1 and let Y = (Y1, Y2, Y3, · · · ). We can view Y asa function

Y : Ω→ RN = E.

We let (E, E , µ) be the canonical space associated with the distribution m sothat

µ = P Y −1.

91


We let f : E → R be given by

f(x1, x2, · · · ) = X1(x1, · · · , xn) = x1.

Then X1 has law given by m, and in particular is integrable. Also, the shift mapΘ : E → E given by

Θ(x1, x2, · · · ) = (x2, x3, · · · )

is measure-preserving and ergodic. Thus, with

Sn(f) = f + f Θ + · · ·+ f Θn−1 = X1 + · · ·+Xn,

we have thatSn(f)

n→ f a.e.

by Birkhoff’s ergodic theorem. We also have convergence in L1 by von Neumannergodic theorem.

Here f is EΘ-measurable, and Θ is ergodic, so we know that f = c a.e. forsome constant c. Moreover, we have

c = µ(f) = limn→∞

µ(Sn(f)/n) = ν.

So done.

7.2 Central limit theorem

Theorem. Let (Xn) be a sequence of iid random variables with E[Xi] = 0 andE[X2

1 ] = 1. Then if we set

Sn = X1 + · · ·+Xn,

then for all x ∈ R, we have

P[Sn√n≤ x

]→∫ x

−∞

e−y2/2

√2π

dy = P[N(0, 1) ≤ x]

as n→∞.

Proof. Let φ(u) = E[eiuX1 ]. Since E[X21 ] = 1 < ∞, we can differentiate under

the expectation twice to obtain

φ(u) = E[eiuX1 ], φ′(u) = E[iX1eiuX1 ], φ′′(u) = E[−X2

1eiuX1 ].

Evaluating at 0, we have

φ(0) = 1, φ′(0) = 0, φ′′(0) = −1.

So if we Taylor expand φ at 0, we have

φ(u) = 1− u2

2+ o(u2).

92


We consider the characteristic function of Sn/√n

φn(u) = E[eiuSn/√n]

=

n∏i=1

E[eiuXj/√n]

= φ(u/√n)n

=

(1− u2

2n+ o

(u2

n

))n.

We now take the logarithm to obtain

log φn(u) = n log

(1− u2

2n+ o

(u2

n

))= −u

2

2+ o(1)

→ −u2

2

So we know thatφn(u)→ e−u

2/2,

which is the characteristic function of a N(0, 1) random variable.So we have convergence in characteristic function, hence weak convergence,

hence convergence in distribution.

93

Index II Probability and Measure

Index

FX , 26Lp space, 54, 59Lp-bounded, 66N(µ, σ2), 80Sn(f), 83V ⊥, 61E[X], 36lim inf, 17lim sup, 17B, 12B(E), 12EΘ, 83Lp space, 60T -measurable, 34µ(f), 36π-system, 6σ-algebra, 5

independent, 17product, 21, 48tail, 34

σ-algebra generated by functions,21

σ-finite measure, 15

Additive set function, 8algebra, 7almost everywhere, 29almost sure convergence, 29

Banach space, 60Bernoulli shift, 84Birkhoff’s ergodic theorem, 86Borel σ-algebra, 6, 12Borel function, 20Borel measure, 13Borel–Cantelli lemma, 18Borel–Cantelli lemma II, 18bounded convergence theorem, 65

canonical space, 84Caratheodory extension theorem, 8change of variables formula, 46characteristic function, 69Chebyshev’s inequality, 54closed subspace, 62complete vector space, 60conditional expectation, 63

conjugate, 57convergence

almost everywhere, 29almost sure, 29in distribution, 31in measure, 29in probability, 29

convex function, 55convolution

function with measure, 70random variable, 70

countable additivity, 5countably additive set function, 8countably subadditive set function,

8counting measure, 5covariance, 64covariance matrix, 64

d-system, 6density, 46

random variable, 46differentiation under the integral

sign, 47distribution, 25distribution function, 26dominated convergence theorem, 43Dynkin’s π-system lemma, 7

ergodic, 83event

independent, 16events, 16expectation, 36

Fatou’s lemma, 42finite intersection property, 14Fourier transform, 69

of measure, 69Fubini’s theorem, 50

Gaussian convolution, 74Gaussian density, 72Gaussian random variable, 80, 81

mean, 80variance, 80

generating set, 6generator of σ-algebra, 6

94

Index II Probability and Measure

Holder’s inequality, 57Hilbert space, 61

image measure, 23, 45Increasing set function, 8independent

σ-algebras, 17events, 16random variable, 26

integrable function, 37integral, 37

simple function, 36invariant function, 83invariant subset, 83

Jensen’s inequality, 55

Kolmogorov 0-1 law, 34

law, 25Lebesgue σ-algebra, 15Lebesgue integral, 37Lebesgue measure, 14left continuous, 23

Markov’s inequality, 54martingale, 68mass function, 5maximal ergodic lemma, 85mean, 80measurable function, 20measurable space, 5

product, 21measure, 5

image, 45pushforward, 45

measure spacerestriction, 45

Minkowski inequality, 58monotone class theorem, 22monotone convergence theorem, 38

non-negative measurable function,20

normvector space, 59

orthogonal complement, 61orthogonal decomposition, 62orthogonal functions, 61outer measure, 9

parallelogram law, 62pi-system, 6Plancherel identity, 77probability, 16probability measure, 16probability space, 16product σ-algebra, 21, 48product measurable space, 21pushforward of measure, 45Pythagoras identity, 62

Radon measure, 13random variable, 25

characteristic function, 69convolution, 70density, 46Gaussian, 80, 81independent, 26

restriction of measure space, 45right continuous, 23ring, 7

sample space, 16set function, 8

additive, 8countably additive, 8increasing, 8

shift map, 84sigma-algebra, 5simple function, 36

integral, 36Skorokhod representation theorem

of weak convergence, 31strong law of large numbers, 91

assuming finite fourthmoments, 90

tail σ-algebra, 34Tonelli’s theorem, 50translation invariant, 15

UI, 65uniformly integrable, 65

variance, 64, 80vector space

norm, 59von Neumann’s ergodic theorem, 86

weak convergenceof measures, 79of random variables, 79

95

Part II - Probability and Measure - SRCF · Part II | Probability and Measure Based on lectures by...

Documents

Transcript of Part II - Probability and Measure - SRCF · Part II | Probability and Measure Based on lectures by...