STAT 6720 Mathematical Statistics II Spring Semester 2008symanzik/teaching/2008_stat...• Casella,...

STAT 6720

Mathematical Statistics II

Spring Semester 2008

Dr. Jürgen Symanzik

Utah State University

Department of Mathematics and Statistics

3900 Old Main Hill

Logan, UT 84322–3900

Tel.: (435) 797–0696

FAX: (435) 797–1822

e-mail: [email protected]

Contents

Acknowledgements 1

6 Limit Theorems 1

6.1 Modes of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

6.2 Weak Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

6.3 Strong Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6.4 Central Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

7 Sample Moments 36

7.1 Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

7.2 Sample Moments and the Normal Distribution . . . . . . . . . . . . . . . . . . 39

8 The Theory of Point Estimation 44

8.1 The Problem of Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 44

8.2 Properties of Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

8.3 Sufficient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

8.4 Unbiased Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

8.5 Lower Bounds for the Variance of an Estimate . . . . . . . . . . . . . . . . . . 67

8.6 The Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

8.7 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 77

8.8 Decision Theory — Bayes and Minimax Estimation . . . . . . . . . . . . . . . . 83

9 Hypothesis Testing 91

9.1 Fundamental Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

9.2 The Neyman–Pearson Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

9.3 Monotone Likelihood Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

9.4 Unbiased and Invariant Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

10 More on Hypothesis Testing 116

10.1 Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

10.2 Parametric Chi–Squared Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

10.3 t–Tests and F–Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

10.4 Bayes and Minimax Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

1

11 Confidence Estimation 134

11.1 Fundamental Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

11.2 Shortest–Length Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . 138

11.3 Confidence Intervals and Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . 143

11.4 Bayes Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

12 Nonparametric Inference 152

12.1 Nonparametric Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

12.2 Single-Sample Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

12.3 More on Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

13 Some Results from Sampling 169

13.1 Simple Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

13.2 Stratified Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

14 Some Results from Sequential Statistical Inference 176

14.1 Fundamentals of Sequential Sampling . . . . . . . . . . . . . . . . . . . . . . . 176

14.2 Sequential Probability Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . . 180

Index 184

2

Acknowledgements

I would like to thank all my students who helped from the Fall 1999 through the Spring 2006

semesters with the creation and improvement of these lecture notes and for their suggestions

how to improve some of the material presented in class.

In addition, I particularly would like to thank Mike Minnotte and Dan Coster, who previously

taught this course at Utah State University, for providing me with their lecture notes and other

materials related to this course. Their lecture notes, combined with additional material from

a variety of textbooks listed below, form the basis of the script presented here.

The textbook required for this class is:

• Casella, G., and Berger, R. L. (2002): Statistical Inference (Second Edition), Duxbury/Thomson Learning, Pacific Grove, CA.

A Web page dedicated to this class is accessible at:

http://www.math.usu.edu/~symanzik/teaching/2006_stat6720/stat6720.html

This course follows Casella and Berger (2002) as described in the syllabus. Additional material

originates from the lectures from Professors Hering, Trenkler, Gather, and Kreienbrock I have

attended while studying at the Universität Dortmund, Germany, the collection of Masters and

PhD Preliminary Exam questions from Iowa State University, Ames, Iowa, and the following

textbooks:

• Bandelow, C. (1981): Einführung in die Wahrscheinlichkeitstheorie, BibliographischesInstitut, Mannheim, Germany.

• Büning, H., and Trenkler, G. (1978): Nichtparametrische statistische Methoden, Walterde Gruyter, Berlin, Germany.

• Casella, G., and Berger, R. L. (1990): Statistical Inference, Wadsworth & Brooks/Cole,Pacific Grove, CA.

• Fisz, M. (1989): Wahrscheinlichkeitsrechnung und mathematische Statistik, VEB Deut-scher Verlag der Wissenschaften, Berlin, German Democratic Republic.

• Gibbons, J. D., and Chakraborti, S. (1992): Nonparametric Statistical Inference (ThirdEdition, Revised and Expanded), Dekker, New York, NY.

• Johnson, N. L., and Kotz, S., and Balakrishnan, N. (1994): Continuous UnivariateDistributions, Volume 1 (Second Edition), Wiley, New York, NY.

3

• Johnson, N. L., and Kotz, S., and Balakrishnan, N. (1995): Continuous UnivariateDistributions, Volume 2 (Second Edition), Wiley, New York, NY.

• Kelly, D. G. (1994): Introduction to Probability, Macmillan, New York, NY.

• Lehmann, E. L. (1983): Theory of Point Estimation (1991 Reprint), Wadsworth &Brooks/Cole, Pacific Grove, CA.

• Lehmann, E. L. (1986): Testing Statistical Hypotheses (Second Edition – 1994 Reprint),Chapman & Hall, New York, NY.

• Mood, A. M., and Graybill, F. A., and Boes, D. C. (1974): Introduction to the Theoryof Statistics (Third Edition), McGraw-Hill, Singapore.

• Parzen, E. (1960): Modern Probability Theory and Its Applications, Wiley, New York,NY.

• Rohatgi, V. K. (1976): An Introduction to Probability Theory and Mathematical Statis-tics, John Wiley and Sons, New York, NY.

• Rohatgi, V. K., and Saleh, A. K. E. (2001): An Introduction to Probability and Statistics(Second Edition), John Wiley and Sons, New York, NY.

• Searle, S. R. (1971): Linear Models, Wiley, New York, NY.

• Tamhane, A. C., and Dunlop, D. D. (2000): Statistics and Data Analysis – From Ele-mentary to Intermediate, Prentice Hall, Upper Saddle River, NJ.

Additional definitions, integrals, sums, etc. originate from the following formula collections:

• Bronstein, I. N. and Semendjajew, K. A. (1985): Taschenbuch der Mathematik (22.Auflage), Verlag Harri Deutsch, Thun, German Democratic Republic.

• Bronstein, I. N. and Semendjajew, K. A. (1986): Ergänzende Kapitel zu Taschenbuch derMathematik (4. Auflage), Verlag Harri Deutsch, Thun, German Democratic Republic.

• Sieber, H. (1980): Mathematische Formeln — Erweiterte Ausgabe E, Ernst Klett, Stuttgart,Germany.

Jürgen Symanzik, January 7, 2006

4

Lecture 02:

We 01/07/046 Limit Theorems

(Based on Rohatgi, Chapter 6, Rohatgi/Saleh, Chapter 6 & Casella/Berger,

Section 5.5)

Motivation:

I found this slide from my Stat 250, Section 003, “Introductory Statistics” class (an under-

graduate class I taught at George Mason University in Spring 1999):

What does this mean at a more theoretical level???

1

6.1 Modes of Convergence

Definition 6.1.1:

Let X1, . . . ,Xn be iid rv’s with common cdf FX(x). Let T = T (X) be any statistic, i.e., a

Borel–measurable function of X that does not involve the population parameter(s) ϑ, defined

on the support X of X . The induced probability distribution of T (X) is called the samplingdistribution of T (X).

Note:

(i) Commonly used statistics are:

Sample Mean: Xn =1n

n∑

i=1

Xi

Sample Variance: S2n =1

n−1

n∑

i=1

(Xi −Xn)2

Sample Median, Order Statistics, Min, Max, etc.

(ii) Recall that if X1, . . . ,Xn are iid and if E(X) and V ar(X) exist, then E(Xn) = µ =

E(X), E(S2n) = σ2 = V ar(X), and V ar(Xn) =

σ2

n .

(iii) Recall that if X1, . . . ,Xn are iid and if X has mgf MX(t) or characteristic function ΦX(t)

then MXn(t) = (MX(tn))

n or ΦXn(t) = (ΦX(tn))

n.

Note: Let {Xn}∞n=1 be a sequence of rv’s on some probability space (Ω, L, P ). Is there anymeaning behind the expression lim

n→∞Xn = X? Not immediately under the usual definitions

of limits. We first need to define modes of convergence for rv’s and probabilities.

Definition 6.1.2:

Let {Xn}∞n=1 be a sequence of rv’s with cdf’s {Fn}∞n=1 and let X be a rv with cdf F . IfFn(x) → F (x) at all continuity points of F , we say that Xn converges in distribution toX (Xn

d−→ X) or Xn converges in law to X (Xn L−→ X), or Fn converges weakly to F(Fn

w−→ F ).

Example 6.1.3:

Let Xn ∼ N(0, 1n). Then

Fn(x) =

∫ x

−∞

exp(−12nt2

)

√2πn

dt

2

=

∫ √nx

−∞

exp(−12s2)√2π

ds

= Φ(√nx)

=⇒ Fn(x) →

Φ(∞) = 1, if x > 0Φ(0) = 12 , if x = 0

Φ(−∞) = 0, if x < 0

If FX(x) =

{1, x ≥ 00, x < 0

the only point of discontinuity is at x = 0. Everywhere else,

Φ(√nx) = Fn(x) → FX(x), where Φ(z) = P (Z ≤ z) with Z ∼ N(0, 1).

So, Xnd−→ X, where P (X = 0) = 1, or Xn d−→ 0 since the limiting rv here is degenerate,

i.e., it has a Dirac(0) distribution.

Example 6.1.4:

In this example, the sequence {Fn}∞n=1 converges pointwise to something that is not a cdf:

Let Xn ∼ Dirac(n), i.e., P (Xn = n) = 1. Then,

Fn(x) =

{0, x < n

1, x ≥ n

It is Fn(x) → 0 ∀x which is not a cdf. Thus, there is no rv X such that Xn d−→ X.

Example 6.1.5:

Let {Xn}∞n=1 be a sequence of rv’s such that P (Xn = 0) = 1− 1n and P (Xn = n) = 1n and letX ∼ Dirac(0), i.e., P (X = 0) = 1.It is

Fn(x) =

0, x < 0

1 − 1n , 0 ≤ x < n1, x ≥ n

FX(x) =

{0, x < 0

1, x ≥ 0

It holds that Fnw−→ FX but

E(Xkn) = 0k · (1 − 1

n) + nk · 1

n= nk−1 6→ E(Xk) = 0.

Thus, convergence in distribution does not imply convergence of moments/means.

3

Note:

Convergence in distribution does not say that the Xi’s are close to each other or to X. It only

means that their cdf’s are (eventually) close to some cdf F . The Xi’s do not even have to be

defined on the same probability space.

Example 6.1.6:

Let X and {Xn}∞n=1 be iid N(0, 1). Obviously, Xnd−→ X but lim

n→∞Xn 6= X.

Theorem 6.1.7:

Let X and {Xn}∞n=1 be discrete rv’s with support X and {Xn}∞n=1, respectively. Define

the countable set A = X ∪∞⋃

n=1

Xn = {ak : k = 1, 2, 3, . . .}. Let pk = P (X = ak) and

pnk = P (Xn = ak). Then it holds that pnk → pk ∀k iff Xn d−→ X.

Theorem 6.1.8:

LetX and {Xn}∞n=1 be continuous rv’s with pdf’s f and {fn}∞n=1, respectively. If fn(x) → f(x)for almost all x as n→ ∞ then Xn d−→ X.

Theorem 6.1.9:

Let X and {Xn}∞n=1 be rv’s such that Xnd−→ X. Let c ∈ IR be a constant. Then it holds:

(i) Xn + cd−→ X + c.

(ii) cXnd−→ cX.

(iii) If an → a and bn → b, then anXn + bn d−→ aX + b.

Proof:

Part (iii):

Suppose that a > 0, an > 0. (If a < 0, an < 0, the result follows via (ii) and c = −1.)Let Yn = anXn + bn and Y = aX + b. It is

FY (y) = P (Y < y) = P (aX + b < y) = P (X <y − ba

) = FX(y − ba

).

Likewise,

FYn(y) = FXn(y − bnan

).

If y is a continuity point of FY ,y−ba is a continuity point of FX . Since an → a, bn → b and

FXn(x) → FX(x), it follows that FYn(y) → FY (y) for every continuity point y of FY . Thus,anXn + bn

d−→ aX + b.

4

Lecture 38:

We 11/29/00Definition 6.1.10:

Let {Xn}∞n=1 be a sequence of rv’s defined on a probability space (Ω, L, P ). We say that Xnconverges in probability to a rv X (Xn

p−→ X, P- limn→∞

Xn = X) if

limn→∞

P (| Xn −X |> ǫ) = 0 ∀ǫ > 0.

Note:

The following are equivalent:

limn→∞

P (| Xn −X |> ǫ) = 0

⇐⇒ limn→∞

P (| Xn −X |≤ ǫ) = 1

⇐⇒ limn→∞P ({ω : | Xn(ω) −X(ω) |> ǫ)) = 0

If X is degenerate, i.e., P (X = c) = 1, we say that Xn is consistent for c. For example, let

Xn such that P (Xn = 0) = 1 − 1n and P (Xn = 1) = 1n . Then

P (| Xn |> ǫ) ={

1n , 0 < ǫ < 1

0, ǫ ≥ 1

Therefore, limn→∞

P (| Xn |> ǫ) = 0 ∀ǫ > 0. So Xnp−→ 0, i.e., Xn is consistent for 0.

Theorem 6.1.11:

(i) Xnp−→ X ⇐⇒ Xn −X

p−→ 0.

(ii) Xnp−→ X,Xn

p−→ Y =⇒ P (X = Y ) = 1.

(iii) Xnp−→ X,Xm

p−→ X =⇒ Xn −Xmp−→ 0 as n,m→ ∞.

(iv) Xnp−→ X,Yn

p−→ Y =⇒ Xn ± Ynp−→ X ± Y .

(v) Xnp−→ X, k ∈ IR a constant =⇒ kXn

p−→ kX.

(vi) Xnp−→ k, k ∈ IR a constant =⇒ Xrn

p−→ kr ∀r ∈ IN .

(vii) Xnp−→ a, Yn

p−→ b, a, b ∈ IR =⇒ XnYnp−→ ab.

(viii) Xnp−→ 1 =⇒ X−1n

p−→ 1.

(ix) Xnp−→ a, Yn

p−→ b, a ∈ IR, b ∈ IR− {0} =⇒ XnYnp−→ ab .

5

(x) Xnp−→ X,Y an arbitrary rv =⇒ XnY

p−→ XY .

(xi) Xnp−→ X,Yn

p−→ Y =⇒ XnYnp−→ XY .

Proof:

See Rohatgi, page 244–245, and Rohatgi/Saleh, page 260–261, for partial proofs.

Theorem 6.1.12:

Let Xnp−→ X and let g be a continuous function on IR. Then g(Xn)

p−→ g(X).

Proof:

Preconditions:

1.) X rv =⇒ ∀ǫ > 0 ∃k = k(ǫ) : P (|X| > k) < ǫ22.) g is continuous on IR

=⇒ g is also uniformly continuous on [−k, k] (see Definition of uniformly continuousin Theorem 3.3.3 (iii))

=⇒ ∃δ = δ(ǫ, k) : |X| ≤ k, |Xn −X| < δ ⇒ |g(Xn) − g(X)| < ǫ

Let

A = {|X| ≤ k} = {ω : |X(ω)| ≤ k}B = {|Xn −X| < δ} = {ω : |Xn(ω) −X(ω)| < δ}C = {|g(Xn) − g(X)| < ǫ} = {ω : |g(Xn(ω)) − g(X(ω))| < ǫ}

If ω ∈ A ∩B2.)

=⇒ ω ∈ C

=⇒ A ∩B ⊆ C

=⇒ CC ⊆ (A ∩B)C = AC ∪BC

=⇒ P (CC) ≤ P (AC ∪BC) ≤ P (AC) + P (BC)

Now:

P (|g(Xn) − g(X)| ≥ ǫ) ≤ P (|X| > k)︸︷︷︸≤ ǫ

2by 1.)

+ P (|Xn −X| ≥ δ)︸︷︷︸≤ ǫ

2for n≥n0(ǫ,δ,k) since Xn

p−→X

≤ ǫ for n ≥ n0(ǫ, δ, k)

6

Corollary 6.1.13:

(i) Let Xnp−→ c, c ∈ IR and let g be a continuous function on IR. Then g(Xn)

p−→ g(c).

(ii) Let Xnd−→ X and let g be a continuous function on IR. Then g(Xn) d−→ g(X).

(iii) Let Xnd−→ c, c ∈ IR and let g be a continuous function on IR. Then g(Xn) d−→ g(c).

Theorem 6.1.14:

Xnp−→ X =⇒ Xn d−→ X.

Proof:

Xnp−→ X ⇔ P (|Xn −X| > ǫ) → 0 as n→ ∞ ∀ǫ > 0

It holds:

P (X ≤ x− ǫ) = P (X ≤ x− ǫ, |Xn −X| ≤ ǫ) + P (X ≤ x− ǫ, |Xn −X| > ǫ)(A)

≤ P (Xn ≤ x) + P (|Xn −X| > ǫ)

(A) holds since X ≤ x− ǫ and Xn within ǫ of X, thus Xn ≤ x.

Similarly, it holds:

P (Xn ≤ x) = P (Xn ≤ x, | Xn −X |≤ ǫ) + P (Xn ≤ x, | Xn −X |> ǫ)

≤ P (X ≤ x+ ǫ) + P (|Xn −X| > ǫ)

Combining the 2 inequalities from above gives:

P (X ≤ x− ǫ) − P (|Xn −X| > ǫ)︸︷︷︸→0 as n→∞

≤ P (Xn ≤ x)︸︷︷︸=Fn(x)

≤ P (X ≤ x+ ǫ) + P (|Xn −X| > ǫ)︸︷︷︸→0 as n→∞

Therefore,

P (X ≤ x− ǫ) ≤ Fn(x) ≤ P (X ≤ x+ ǫ) as n→ ∞.

Since the cdf’s Fn(·) are not necessarily left continuous, we get the following result for ǫ ↓ 0:

P (X < x) ≤ Fn(x) ≤ P (X ≤ x) = FX(x)

Let x be a continuity point of F . Then it holds:

F (x) = P (X < x) ≤ Fn(x) ≤ F (x)

=⇒ Fn(x) → F (x)

=⇒ Xn d−→ X

7

Theorem 6.1.15:

Let c ∈ IR be a constant. Then it holds:

Xnd−→ c⇐⇒ Xn

p−→ c.

Example 6.1.16:

In this example, we will see that

Xnd−→ X 6=⇒ Xn

p−→ X

for some rv X. Let Xn be identically distributed rv’s and let (Xn,X) have the following joint

distribution:Xn

X0 1

0 0 1212

1 12 012

12

12 1

Obviously, Xnd−→ X since all have exactly the same cdf, but for any ǫ ∈ (0, 1), it is

P (| Xn −X |> ǫ) = P (| Xn −X |= 1) = 1 ∀n,

so limn→∞

P (| Xn −X |> ǫ) 6= 0. Therefore, Xn 6p−→ X.

Theorem 6.1.17:

Let {Xn}∞n=1 and {Yn}∞n=1 be sequences of rv’s and X be a rv defined on a probability space(Ω, L, P ). Then it holds:

Ynd−→ X, | Xn − Yn |

p−→ 0 =⇒ Xn d−→ X.

Proof:

Similar to the proof of Theorem 6.1.14. See also Rohatgi, page 253, Theorem 14, and Ro-

hatgi/Saleh, page 269, Theorem 14.

Lecture 41:

We 12/06/00Theorem 6.1.18: Slutsky’s Theorem

Let {Xn}∞n=1 and {Yn}∞n=1 be sequences of rv’s and X be a rv defined on a probability space(Ω, L, P ). Let c ∈ IR be a constant. Then it holds:

(i) Xnd−→ X,Yn

p−→ c =⇒ Xn + Yn d−→ X + c.

(ii) Xnd−→ X,Yn

p−→ c =⇒ XnYn d−→ cX.If c = 0, then also XnYn

p−→ 0.

(iii) Xnd−→ X,Yn

p−→ c =⇒ XnYnd−→ Xc if c 6= 0.

8

Proof:

(i) Ynp−→ c Th.6.1.11(i)⇐⇒ Yn − c

p−→ 0

=⇒ Yn − c = Yn + (Xn −Xn) − c = (Xn + Yn) − (Xn + c)p−→ 0 (A)

Xnd−→ X Th.6.1.9(i)=⇒ Xn + c d−→ X + c (B)

Combining (A) and (B), it follows from Theorem 6.1.17:

Xn + Ynd−→ X + c

(ii) Case c = 0:

∀ǫ > 0 ∀k > 0, it is

P (| XnYn |> ǫ) = P (| XnYn |> ǫ, Yn ≤ǫ

k) + P (| XnYn |> ǫ, Yn >

ǫ

k)

≤ P (| Xnǫ

k|> ǫ) + P (Yn >

ǫ

k)

≤ P (| Xn |> k) + P (| Yn |>ǫ

k)

Since Xnd−→ X and Yn

p−→ 0, it follows

limn→∞

P (| XnYn |> ǫ) ≤ P (| Xn |> k) → 0 as k → ∞.

Therefore, XnYnp−→ 0.

Case c 6= 0:Since Xn

d−→ X and Ynp−→ c, it follows from (ii), case c = 0, that XnYn − cXn =

Xn(Yn − c)p−→ 0.

=⇒ XnYnp−→ cXn

Th.6.1.14=⇒ XnYn d−→ cXn

Since cXnd−→ cX by Theorem 6.1.9 (ii), it follows from Theorem 6.1.17:

XnYnd−→ cX

(iii) Let Znp−→ 1 and let Yn = cZn.

c 6=0=⇒ 1Yn =

1Zn

· 1cTh.6.1.11(v,viii)

=⇒ 1Ynp−→ 1c

With part (ii) above, it follows:

Xnd−→ X and 1Yn

p−→ 1c=⇒ XnYn

d−→ Xc

9

Definition 6.1.19:

Let {Xn}∞n=1 be a sequence of rv’s such that E(| Xn |r) 0. We say that Xnconverges in the rth mean to a rv X (Xn

r−→ X) if E(| X |r) 0. Therefore, Xnr−→ 0 ∀r > 0.

Note:

The special cases r = 1 and r = 2 are called convergence in absolute mean for r = 1

(Xn1−→ X) and convergence in mean square for r = 2 (Xn ms−→ X or Xn 2−→ X).

Theorem 6.1.21:

Assume that Xnr−→ X for some r > 0. Then Xn

p−→ X.

Proof:

Using Markov’s Inequality (Corollary 3.5.2), it holds for any ǫ > 0:

E(| Xn −X |r)ǫr

≥ P (| Xn −X |≥ ǫ) ≥ P (| Xn −X |> ǫ)

Xnr−→ X =⇒ lim

n→∞E(| Xn −X |r) = 0

=⇒ limn→∞

P (| Xn −X |> ǫ) ≤ limn→∞

E(| Xn −X |r)ǫr

= 0

=⇒ Xnp−→ X

Example 6.1.22:

Let {Xn}∞n=1 be a sequence of rv’s defined by P (Xn = 0) = 1 − 1nr and P (Xn = n) = 1nr forsome r > 0.

For any ǫ > 0, P (| Xn |> ǫ) → 0 as n→ ∞; so Xnp−→ 0.

For 0 < s < r, E(| Xn |s) = 1nr−s → 0 as n → ∞; so Xns−→ 0. But E(| Xn |r) = 1 6→ 0 as

n→ ∞; so Xn 6 r−→ 0.

10

Theorem 6.1.23:

If Xnr−→ X, then it holds:

(i) limn→∞E(| Xn |

r) = E(| X |r); and

(ii) Xns−→ X for 0 < s < r.

Proof:

(i) For 0 < r ≤ 1, it holds:

E(| Xn |r) = E(| Xn −X +X |r)(∗)≤ E(| Xn −X |r + | X |r)

=⇒ E(| Xn |r) − E(| X |r) ≤ E(| Xn −X |r)

=⇒ limn→∞

E(| Xn |r) − limn→∞

E(| X |r) ≤ limn→∞

E(| Xn −X |r) = 0

=⇒ limn→∞

E(| Xn |r) ≤ E(| X |r) (A)

(∗) holds due to Bronstein/Semendjajew (1986), page 36 (see Handout)

Similarly,

E(| X |r) = E(| X −Xn +Xn |r) ≤ E(| Xn −X |r + | Xn |r)

=⇒ E(| X |r) − E(| Xn |r) ≤ E(| Xn −X |r)

=⇒ limn→∞

E(| X |r) − limn→∞

E(| Xn |r) ≤ limn→∞

E(| Xn −X |r) = 0

=⇒ E(| X |r) ≤ limn→∞

E(| Xn |r) (B)

Combining (A) and (B) gives

limn→∞E(| Xn |

r) = E(| X |r)

For r > 1, it follows from Minkowski’s Inequality (Theorem 4.8.3):

[E(| X −Xn +Xn |r)]1r ≤ [E(| X −Xn |r)]

1r + [E(| Xn |r)]

1r

=⇒ [E(| X |r)] 1r − [E(| Xn |r)]1r ≤ [E(| X −Xn |r)]

1r

=⇒ [E(| X |r)] 1r − limn→∞

[E(| Xn |r)]1r ≤ lim

n→∞[E(| Xn−X |r)]

1r = 0 since Xn

r−→ X

=⇒ [E(| X |r)] 1r ≤ limn→∞

[E(| Xn |r)]1r (C)

Similarly,

[E(| Xn −X +X |r)]1r ≤ [E(| Xn −X |r)]

1r + [E(| X |r)] 1r

=⇒ limn→∞[E(| Xn |

r)]1r − lim

n→∞[E(| X |r)]

1r ≤ lim

n→∞[E(| Xn−X |r)]

1r = 0 since Xn

r−→ X

11

=⇒ limn→∞

[E(| Xn |r)]1r ≤ [E(| X |r)] 1r (D)

Combining (C) and (D) gives

limn→∞[E(| Xn |

r)]1r = [E(| X |r)] 1r

=⇒ limn→∞

E(| Xn |r) = E(| X |r)Lecture 42/1:

Fr 12/08/00(ii) For 1 ≤ s < r, it follows from Lyapunov’s Inequality (Theorem 3.5.4):

[E(| Xn −X |s)]1s ≤ [E(| Xn −X |r)]

1r

=⇒ E(| Xn −X |s) ≤ [E(| Xn −X |r)]sr

=⇒ limn→∞E(| Xn −X |

s) ≤ limn→∞[E(| Xn −X |

r)]sr = 0 since Xn

r−→ X

=⇒ Xn s−→ X

An additional proof is required for 0 < s < r < 1.

Definition 6.1.24:

Let {Xn}∞n=1 be a sequence of rv’s on (Ω, L, P ). We say that Xn converges almost surelyto a rv X (Xn

a.s.−→ X) or Xn converges with probability 1 to X (Xnw.p.1−→ X) or Xn

converges strongly to X iff

P ({ω : Xn(ω) → X(ω) as n→ ∞}) = 1.

Note:

An interesting characterization of convergence with probability 1 and convergence in proba-

bility can be found in Parzen (1960) “Modern Probability Theory and Its Applications” on

page 416 (see Handout).

Example 6.1.25:

Let Ω = [0, 1] and P a uniform distribution on Ω. Let Xn(ω) = ω + ωn and X(ω) = ω.

For ω ∈ [0, 1), ωn → 0 as n→ ∞. So Xn(ω) → X(ω) ∀ω ∈ [0, 1).

However, for ω = 1, Xn(1) = 2 6= 1 = X(1) ∀n, i.e., convergence fails at ω = 1.

Anyway, since P ({ω : Xn(ω) → X(ω) as n → ∞}) = P ({ω ∈ [0, 1)}) = 1, it is Xn a.s.−→ X.

12

Theorem 6.1.26:

Xna.s.−→ X =⇒ Xn

p−→ X.

Proof:

Choose ǫ > 0 and δ > 0. Find n0 = n0(ǫ, δ) such that

P

( ∞⋂

n=n0

{| Xn −X |≤ ǫ})

≥ 1 − δ.

Since∞⋂

n=n0

{| Xn −X |≤ ǫ} ⊆ {| Xn −X |≤ ǫ} ∀n ≥ n0, it is

P ({| Xn −X |≤ ǫ}) ≥ P( ∞⋂

n=n0

{| Xn −X |≤ ǫ})

≥ 1 − δ ∀n ≥ n0.

Therefore, P ({| Xn −X |≤ ǫ}) → 1 as n→ ∞. Thus, Xnp−→ X.

Example 6.1.27:

Xnp−→ X 6=⇒ Xn a.s.−→ X:

Let Ω = (0, 1] and P a uniform distribution on Ω.

Define An by

A1 = (0,12 ], A2 = (

12 , 1]

A3 = (0,14 ], A4 = (

14 ,

12 ], A5 = (

12 ,

34 ], A6 = (

34 , 1]

A7 = (0,18 ], A8 = (

18 ,

14 ], . . .

Let Xn(ω) = IAn(ω).

It is P (| Xn − 0 |≥ ǫ) → 0 ∀ǫ > 0 since Xn is 0 except on An and P (An) ↓ 0. Thus Xnp−→ 0.

But P ({ω : Xn(ω) → 0}) = 0 (and not 1) because any ω keeps being in some An beyond anyn0, i.e., Xn(ω) looks like 0 . . . 010 . . . 010 . . . 010 . . ., so Xn 6 a.s.−→ 0.

Example 6.1.28:

Xnr−→ X 6=⇒ Xn a.s.−→ X:

Let Xn be independent rv’s such that P (Xn = 0) = 1 − 1n and P (Xn = 1) = 1n .

It is E(| Xn − 0 |r) = E(| Xn |r) = E(| Xn |) = 1n → 0 as n → ∞, so Xnr−→ 0 ∀r > 0 (and

due to Theorem 6.1.21, also Xnp−→ 0).

But

13

P (Xn = 0 ∀m ≤ n ≤ n0) =n0∏

n=m

(1− 1n

) = (m− 1m

)(m

m+ 1)(m+ 1

m+ 2) . . . (

n0 − 2n0 − 1

)(n0 − 1n0

) =m− 1n0

As n0 → ∞, it is P (Xn = 0 ∀m ≤ n ≤ n0) → 0 ∀m, so Xn 6 a.s.−→ 0.

Example 6.1.29:

Xna.s.−→ X 6=⇒ Xn r−→ X:

Let Ω = [0, 1] and P a uniform distribution on Ω.

Let An = [0,1

ln n ].

Let Xn(ω) = nIAn(ω) and X(ω) = 0.

It holds that ∀ω > 0 ∃n0 : 1ln n0 < ω =⇒ Xn(ω) = 0 ∀n > n0 and P (ω = 0) = 0. Thus,Xn

a.s.−→ 0.

But E(| Xn − 0 |r) = nr

lnn → ∞ ∀r > 0, so Xn 6r−→ X.

14

Lecture 39:

Fr 12/01/006.2 Weak Laws of Large Numbers

Theorem 6.2.1: WLLN: Version I

Let {Xi}∞i=1 be a sequence of iid rv’s with mean E(Xi) = µ and variance V ar(Xi) = σ2 0,

i.e., Xnp−→ µ.

Proof:

By Markov’s Inequality (Corollary 3.5.2), it holds for all ǫ > 0:

P (| Xn − µ |≥ ǫ) ≤E((Xn − µ)2)

ǫ2=V ar(Xn)

ǫ2=

σ2

nǫ2−→ 0 as n→ ∞

Note:

For iid rv’s with finite variance, Xn is consistent for µ.

A more general way to derive a “WLLN” follows in the next Definition.

Definition 6.2.2:

Let {Xi}∞i=1 be a sequence of rv’s. Let Tn =n∑

i=1

Xi. We say that {Xi} obeys the WLLN

with respect to a sequence of norming constants {Bi}∞i=1, Bi > 0, Bi ↑ ∞, if there exists asequence of centering constants {Ai}∞i=1 such that

B−1n (Tn −An)p−→ 0.

Theorem 6.2.3:

Let {Xi}∞i=1 be a sequence of pairwise uncorrelated rv’s with E(Xi) = µi and V ar(Xi) = σ2i ,

i ∈ IN . Ifn∑

i=1

σ2i → ∞ as n→ ∞, we can choose An =n∑

i=1

µi and Bn =n∑

i=1

σ2i and get

n∑

i=1

(Xi − µi)n∑

i=1

σ2i

p−→ 0.

15

Proof:

By Markov’s Inequality (Corollary 3.5.2), it holds for all ǫ > 0:

P (|n∑

i=1

Xi −n∑

i=1

µi |> ǫn∑

i=1

σ2i ) ≤E((

n∑

i=1

(Xi − µi))2)

ǫ2(n∑

i=1

σ2i )2

=1

ǫ2n∑

i=1

σ2i

−→ 0 as n→ ∞

Note:

To obtain Theorem 6.2.1, we choose An = nµ and Bn = nσ2.

Theorem 6.2.4:

Let {Xi}∞i=1 be a sequence of rv’s. Let Xn = 1nn∑

i=1

Xi. A necessary and sufficient condition

for {Xi} to obey the WLLN with respect to Bn = n is that

E

(X

2n

1 +X2n

)→ 0

as n→ ∞.

Proof:

Rohatgi, page 258, Theorem 2, and Rohatgi/Saleh, page 275, Theorem 2.

Example 6.2.5:

Let (X1, . . . ,Xn) be jointly Normal with E(Xi) = 0, E(X2i ) = 1 for all i, and Cov(Xi,Xj) = ρ

if | i− j |= 1 and Cov(Xi,Xj) = 0 if | i− j |> 1.

Let Tn =n∑

i=1

Xi. Then, Tn ∼ N(0, n + 2(n − 1)ρ) = N(0, σ2). It is

E

(X

2n

1 +X2n

)= E

(T 2n

n2 + T 2n

)

=2√2πσ

∫ ∞

0

x2

n2 + x2e−

x2

2σ2 dx | y = xσ, dy =

dx

σ

=2√2π

∫ ∞

0

σ2y2

n2 + σ2y2e−

y2

2 dy

=2√2π

∫ ∞

0

(n+ 2(n− 1)ρ)y2n2 + (n+ 2(n − 1)ρ)y2 e

− y2

2 dy

≤ n+ 2(n− 1)ρn2

∫ ∞

0

2√2π

y2e−y2

2 dy

︸︷︷︸=1, since Var of N(0,1) distribution

16

→ 0 as n→ ∞=⇒ Xn

p−→ 0

Note:

We would like to have a WLLN that just depends on means but does not depend on the

existence of finite variances. To approach this, we consider the following:


i=1

Xi. We truncate each | Xi | at c > 0 and get

Xci =

{Xi, | Xi |≤ c0, otherwise

Let T cn =n∑

i=1

Xci and mn =n∑

i=1

E(Xci ).

Lemma 6.2.6:

For Tn, Tcn and mn as defined in the Note above, it holds:

P (| Tn −mn |> ǫ) ≤ P (| T cn −mn |> ǫ) +n∑

i=1

P (| Xi |> c) ∀ǫ > 0

Proof:

It holds for all ǫ > 0:

P (| Tn −mn |> ǫ) = P (| Tn −mn |> ǫ and | Xi |≤ c ∀i ∈ {1, . . . , n}) +P (| Tn −mn |> ǫ and | Xi |> c for at least one i ∈ {1, . . . , n}

(∗)≤ P (| T cn −mn |> ǫ) + P (| Xi |> c for at least one i ∈ {1, . . . , n})

≤ P (| T cn −mn |> ǫ) +n∑

i=1

P (| Xi |> c)

(∗) holds since T cn = Tn when | Xi |≤ c ∀i ∈ {1, . . . , n}.

Note:

If the Xi’s are identically distributed, then

P (| Tn −mn |> ǫ) ≤ P (| T cn −mn |> ǫ) + nP (| X1 |> c) ∀ǫ > 0.

If the Xi’s are iid, then

P (| Tn −mn |> ǫ) ≤nE((Xc1)

2)

ǫ2+ nP (| X1 |> c) ∀ǫ > 0 (∗).

Note that P (| Xi |> c) = P (| X1 |> c) ∀i ∈ IN if the Xi’s are identically distributed and thatE((Xci )

2) = E((Xc1)2) ∀i ∈ IN if the Xi’s are iid.

17

Lecture 42/2:

Fr 12/08/00Theorem 6.2.7: Khintchine’s WLLN

Let {Xi}∞i=1 be a sequence of iid rv’s with finite mean E(Xi) = µ. Then it holds:

Xn =1

nTn

p−→ µ

Proof:

If we take c = n and replace ǫ by nǫ in (∗) in the Note above, we get

P

(∣∣∣∣Tn −mn

n

∣∣∣∣ > ǫ)

= P (| Tn −mn |> nǫ) ≤E((Xn1 )

2)

nǫ2+ nP (| X1 |> n).

Since E(| X1 |) < ∞, it is nP (| X1 |> n) → 0 as n → ∞ by Theorem 3.1.9. From Corollary3.1.12 we know that E(| X |α) = α

∫ ∞

0xα−1P (| X |> x)dx. Therefore,

E((Xn1 )2) = 2

∫ n

0xP (| Xn1 |> x)dx

= 2

∫ A

0xP (| Xn1 |> x)dx+ 2

∫ n

AxP (| Xn1 |> x)dx

(+)≤ K + δ

∫ n

Adx

≤ K + nδ

In (+), A is chosen sufficiently large such that xP (| Xn1 |> x) < δ2 ∀x ≥ A for an arbitraryconstant δ > 0 and K > 0 a constant.

Therefore,E((Xn1 )

2)

nǫ2≤ Knǫ2

+δ

ǫ2

Since δ is arbitrary, we can make the right hand side of this last inequality arbitrarily small

for sufficiently large n.

Since E(Xi) = µ ∀i, it ismnn

=

n∑

i=1

E(Xni )

n→ µ as n→ ∞.

Note:

Theorem 6.2.7 meets the previously stated goal of not having a finite variance requirement.

18

6.3 Strong Laws of Large Numbers

Definition 6.3.1:


i=1

Xi. We say that {Xi} obeys the SLLN

with respect to a sequence of norming constants {Bi}∞i=1, Bi > 0, Bi ↑ ∞, if there exists asequence of centering constants {Ai}∞i=1 such that

B−1n (Tn −An)a.s.−→ 0.

Note:

Unless otherwise specified, we will only use the case that Bn = n in this section.

Theorem 6.3.2:

Xna.s.−→ X ⇐⇒ lim

n→∞P ( sup

m≥n| Xm −X |> ǫ) = 0 ∀ǫ > 0.

Proof: (see also Rohatgi, page 249, Theorem 11)

WLOG, we can assume that X = 0 since Xna.s.−→ X implies Xn −X a.s.−→ 0. Thus, we have to

prove:

Xna.s.−→ 0 ⇐⇒ lim

n→∞P ( sup

m≥n| Xm |> ǫ) = 0 ∀ǫ > 0

Choose ǫ > 0 and define

An(ǫ) = { supm≥n

| Xm |> ǫ}

C = { limn→∞

Xn = 0}

“=⇒”:Since Xn

a.s.−→ 0, we know that P (C) = 1 and therefore P (Cc) = 0.

Let Bn(ǫ) = C ∩ An(ǫ). Note that Bn+1(ǫ) ⊆ Bn(ǫ) and for the limit set∞⋂

n=1

Bn(ǫ) = Ø. It

follows that

limn→∞

P (Bn(ǫ)) = P (∞⋂

n=1

Bn(ǫ)) = 0.

We also have

P (Bn(ǫ)) = P (An ∩ C)= 1 − P (Cc ∪Acn)= 1 − P (Cc)︸︷︷︸

=0

−P (Acn) + P (Cc ∩ACn )︸︷︷︸=0

= P (An)

19

=⇒ limn→∞

P (An(ǫ)) = 0

“⇐=”:Assume that lim

n→∞P (An(ǫ)) = 0 ∀ǫ > 0 and define D(ǫ) = { lim

n→∞| Xn |> ǫ}.

Since D(ǫ) ⊆ An(ǫ) ∀n ∈ IN , it follows that P (D(ǫ)) = 0 ∀ǫ > 0. Also,

Cc = { limn→∞Xn 6= 0} ⊆

∞⋃

k=1

{ limn→∞ | Xn |>

1

k}.

=⇒ 1 − P (C) ≤∞∑

k=1

P (D(1

k)) = 0

=⇒ Xn a.s.−→ 0

Note:

(i) Xna.s.−→ 0 implies that ∀ǫ > 0 ∀δ > 0 ∃n0 ∈ IN : P ( sup

n≥n0| Xn |> ǫ) < δ.

(ii) Recall that for a given sequence of events {An}∞n=1,

A = limn→∞

An = limn→∞

∞⋃

k=n

Ak =∞⋂

n=1

∞⋃

k=n

Ak

is the event that infinitely many of the An occur. We write P (A) = P (An i.o.) where

i.o. stands for “infinitely often”.

(iii) Using the terminology defined in (ii) above, we can rewrite Theorem 6.3.2 as

Xna.s.−→ 0 ⇐⇒ P (| Xn |> ǫ i.o.) = 0 ∀ǫ > 0.

20

Lecture 02:

We 01/10/01Theorem 6.3.3: Borel–Cantelli Lemma

(i) 1st BC–Lemma:

Let {An}∞n=1 be a sequence of events such that∞∑

n=1

P (An)

(+)≤ lim

n0→∞exp

(−

n0∑

k=n

P (Ak)

)

= 0

=⇒ P (A) = 1

(+) holds since

1 − exp−

n0∑

j=n

αj

≤ 1 −

n0∏

j=n

(1 − αj) ≤n0∑

j=n

αj for n0 > n and 0 ≤ αj ≤ 1

Example 6.3.4:

Independence is necessary for 2nd BC–Lemma:

Let Ω = (0, 1) and P a uniform distribution on Ω.

Let An = I(0, 1n

)(ω). Therefore,

∞∑

n=1

P (An) =∞∑

n=1

1

n= ∞.

But for any ω ∈ Ω, An occurs only for 1, 2, . . . , ⌊ 1ω ⌋, where ⌊ 1ω ⌋ denotes the largest integer(“floor”) that is ≤ 1ω . Therefore, P (A) = P (An i.o.) = 0.

Lemma 6.3.5: Kolmogorov’s Inequality

Let {Xi}∞i=1 be a sequence of independent rv’s with common mean 0 and variances σ2i . Let

Tn =n∑

i=1

Xi. Then it holds:

P ( max1≤k≤n

| Tk |≥ ǫ) ≤

n∑

i=1

σ2i

ǫ2∀ǫ > 0

Proof:

See Rohatgi, page 268, Lemma 2, and Rohatgi/Saleh, page 284, Lemma 1.

Lemma 6.3.6: Kronecker’s Lemma

For any real numbers xn, if∞∑

n=1

xn converges to s

Proof:

See Rohatgi, page 269, Lemma 3, and Rohatgi/Saleh, page 285, Lemma 2.

Theorem 6.3.7: Cauchy Criterion

Xna.s.−→ X ⇐⇒ lim

n→∞P (sup

m| Xn+m −Xn |≤ ǫ) = 1 ∀ǫ > 0.

Proof:

See Rohatgi, page 270, Theorem 5.

Theorem 6.3.8:

If∞∑

n=1

V ar(Xn) 0, Bi ↑ ∞, be a sequence

of norming constants. Let Tn =n∑

i=1

Xi.

If∞∑

i=1

V ar(Xi)

B2i

Lemma 6.3.11:

Let X be a rv with E(| X |) ǫ) 0

Proof:

See Rohatgi, page 265, Theorem 3.

24

Theorem 6.3.13: Kolmogorov’s SLLN

Let {Xi}∞i=1 be a sequence of iid rv’s. Let Tn =n∑

i=1

Xi. Then it holds:

Tnn

= Xna.s.−→ µ

iid=

∞∑

k=1

P (| X |≥ k)

Lemma 6.3.11≤ E(| X |)

< ∞

By Lemma 6.3.10, it follows that Tn and T′n are convergence–equivalent. Thus, it is sufficient

to prove that X′n

a.s.−→ E(X).

We now establish the conditions needed in Corollary 6.3.9. It is

V ar(X ′n) ≤ E((X ′n)2)

=

∫ n

−nx2fX(x)dx

=n−1∑

k=0

∫

k≤|x|

It is

∞∑

n=k

1

n2=

1

k2+

1

(k + 1)2+

1

(k + 2)2+ . . .

≤ 1k2

+1

k(k + 1)+

1

(k + 1)(k + 2)+ . . .

=1

k2+

∞∑

n=k+1

1

n(n− 1)

From Bronstein, page 30, # 7, we know that

1 =1

1 · 2 +1

2 · 3 +1

3 · 4 + . . .+1

n(n+ 1)+ . . .

=1

1 · 2 +1

2 · 3 +1

3 · 4 + . . .+1

(k − 1) · k +∞∑

n=k+1

1

n(n− 1)

=⇒∞∑

n=k+1

1

n(n− 1) = 1 −1

1 · 2 −1

2 · 3 −1

3 · 4 − . . . −1

(k − 1) · k

=1

2− 1

2 · 3 −1

3 · 4 − . . .−1

(k − 1) · k

=1

3− 1

3 · 4 − . . .−1

(k − 1) · k

=1

4− . . .− 1

(k − 1) · k= . . .

=1

k

=⇒∞∑

n=k

1

n2≤ 1

k2+

∞∑

n=k+1

1

n(n− 1)

=1

k2+

1

k

≤ 2k

27

Using this result in (A), we get

∞∑

n=1

1

n2V ar(X ′n) ≤ 2

∞∑

k=1

(k + 1)2

kP (k ≤| X |< k + 1) + 2P (0 ≤| X |< 1)

= 2∞∑

k=0

kP (k ≤| X |< k + 1) + 4∞∑

k=1

P (k ≤| X |< k + 1)

+ 2∞∑

k=1

1

kP (k ≤| X |< k + 1) + 2P (0 ≤| X |< 1)

(B)

≤ 2E(| X |) + 4 + 2 + 2

< ∞

To establish (B), we use an inequality from the Proof of Lemma 6.3.11, i.e.,

∞∑

k=0

kP (k ≤| X |< k + 1)Proof≤

∞∑

n=1

P (| X |≥ n)Lemma 6.3.11

≤ E(| X |)

Thus, the conditions needed in Corollary 6.3.9 are met. With Bn = n, it follows that

1

nT ′n −

1

nE(T ′n)

a.s.−→ 0 (C)

Since E(X ′n) → E(X) as n → ∞, it follows by Kronecker’s Lemma (6.3.6) that 1nE(T ′n) →E(X). Thus, when we replace 1nE(T

′n) by E(X) in (C), we get

1

nT ′n

a.s.−→ E(X) Lemma 6.3.10=⇒ 1nTn

a.s.−→ E(X)

since Tn and T′n are convergence–equivalent (as defined in Lemma 6.3.10).

28

Lecture 04:

We 01/17/016.4 Central Limit Theorems

Let {Xn}∞n=1 be a sequence of rv’s with cdf’s {Fn}∞n=1. Suppose that the mgf Mn(t) of Xnexists.

Questions: Does Mn(t) converge? Does it converge to a mgf M(t)? If it does converge, does

it hold that Xnd−→ X for some rv X?

Example 6.4.1:

Let {Xn}∞n=1 be a sequence of rv’s such that P (Xn = −n) = 1. Then the mgf is Mn(t) =E(etX ) = e−tn. So

limn→∞Mn(t) =

0, t > 0

1, t = 0

∞, t < 0

So Mn(t) does not converge to a mgf and Fn(x) → F (x) = 1 ∀x. But F (x) is not a cdf.

Note:

Due to Example 6.4.1, the existence of mgf’s Mn(t) that converge to something is not enough

to conclude convergence in distribution.

Conversely, suppose that Xn has mgf Mn(t), X has mgf M(t), and Xnd−→ X. Does it hold

that

Mn(t) →M(t)?

Not necessarily! See Rohatgi, page 277, Example 2, and Rohatgi/Saleh, page 289, Example

2, as a counter example. Thus, convergence in distribution of rv’s that all have mgf’s does

not imply the convergence of mgf’s.

However, we can make the following statement in the next Theorem:

Theorem 6.4.2: Continuity Theorem

Let {Xn}∞n=1 be a sequence of rv’s with cdf’s {Fn}∞n=1 and mgf’s {Mn(t)}∞n=1. Suppose thatMn(t) exists for | t |≤ t0 ∀n. If there exists a rv X with cdf F and mgf M(t) which exists for| t |≤ t1 < t0 such that lim

n→∞Mn(t) = M(t) ∀t ∈ [−t1, t1], then Fn w−→ F , i.e., Xn d−→ X.

29

Example 6.4.3:

Let Xn ∼ Bin(n, λn). Recall (e.g., from Theorem 3.3.12 and related Theorems) that forX ∼ Bin(n, p) the mgf is MX(t) = (1 − p+ pet)n. Thus,

Mn(t) = (1 −λ

n+λ

net)n = (1 +

λ(et − 1)n

)n(∗)−→ eλ(et−1) as n→ ∞.

In (∗) we use the fact that limn→∞

(1+x

n)n = ex. Recall that eλ(e

t−1) is the mgf of a rv X where

X ∼ Poisson(λ). Thus, we have established the well–known result that the Binomial distribu-tion approaches the Poisson distribution, given that n→ ∞ in such a way that np = λ > 0.

Note:

Recall Theorem 3.3.11: Suppose that {Xn}∞n=1 is a sequence of rv’s with characteristic fuctions{Φn(t)}∞n=1. Suppose that

limn→∞

Φn(t) = Φ(t) ∀t ∈ (−h, h) for some h > 0,

and Φ(t) is the characteristic function of a rv X. Then Xnd−→ X.

Theorem 6.4.4: Lindeberg–Lévy Central Limit Theorem

Let {Xn}∞n=1 be a sequence of iid rv’s with E(Xi) = µ and 0 < V ar(Xi) = σ2 < ∞. Then it

holds for Xn =1n

n∑

i=1

Xi that

√n(Xn − µ)

σd−→ Z

where Z ∼ N(0, 1).

Proof:

Let Z ∼ N(0, 1). According to Theorem 3.3.12 (v), the characteristic function of Z isΦZ(t) = exp(−12t2).

Let Φ(t) be the characteristic function of Xi. We now determine the characteristic function

Φn(t) of√

n(Xn−µ)σ :

Φn(t) = E

exp

it

√n( 1n

n∑

i=1

Xi − µ)

σ

=

∫ ∞

−∞. . .

∫ ∞

−∞exp

it

√n( 1n

n∑

i=1

xi − µ)

σ

dFX(x)

30

= exp(− it√nµ

σ)

∫ ∞

−∞exp(

itx1√nσ

)dFX1(x1) . . .

∫ ∞

−∞exp(

itxn√nσ

)dFXn(xn)

=

(Φ(

t√nσ

) exp(− itµ√nσ

)

)n

Recall from Theorem 3.3.5 that if the kth moment exists, then Φ(k)(0) = ikE(Xk). In partic-

ular, it holds for the given distribution that Φ(1)(0) = iE(X) = iµ and Φ(2)(0) = i2E(X2) =

i2(µ2 + σ2) = −(µ2 + σ2). Also recall the definition of a Taylor series in MacLaurin’s form:

f(x) = f(0) +f ′(0)

1!x+

f ′′(0)2!

x2 +f ′′′(0)

3!x3 + . . .+

f (n)(0)

n!xn + . . . ,

e.g.,

f(x) = ex = 1 + x+x2

2!+x3

3!+ . . .

Thus, if we develop a Taylor series for Φ( t√nσ

) around t = 0, we get:

Φ(t√nσ

) = Φ(0) + tΦ′(0) +1

2t2Φ′′(0) +

1

6t3Φ′′′(0) + . . .

= 1 + tiµ√nσ

− 12t2µ2 + σ2

nσ2+ o

((

t√nσ

)2)

Here we make use of the Landau symbol “o”. In general, if we write u(x) = o(v(x)) for

x → L, this implies limx→L

u(x)

v(x)= 0, i.e., u(x) goes to 0 faster than v(x) or v(x) goes to ∞

faster than u(x). We say that u(x) is of smaller order than v(x) as x → L. Examples are1x3 = o(

1x2 ) and x

2 = o(x3) for x→ ∞. See Rohatgi, page 6, for more details on the Landausymbols “O” and “o”.

Similarly, if we develop a Taylor series for exp(− itµ√nσ

) around t = 0, we get:

exp(− itµ√nσ

) = 1 − t iµ√nσ

− 12t2µ2

nσ2+ o

((

t√nσ

)2)

Combining these results, we get:

Φn(t) =

((1 + t

iµ√nσ

− 12t2µ2 + σ2

nσ2+ o

((

t√nσ

)2))(

1 − t iµ√nσ

− 12t2µ2

nσ2+ o

((

t√nσ

)2)))n

=

(1 − t iµ√

nσ− 1

2t2µ2

nσ2+ t

iµ√nσ

+ t2µ2

nσ2− 1

2t2µ2 + σ2

nσ2+ o

((

t√nσ

)2))n

=

(1 − 1

2

t2

n+ o

((

t√nσ

)2))n

=

(1 +

−12t2n

+ o

(1

n

))n

(∗)−→ exp(− t2

2) as n→ ∞

31

Thus, limn→∞

Φn(t) = ΦZ(t) ∀t. For a proof of (∗), see Rohatgi, page 278, Lemma 1. Accordingto the Note above, it holds that

√n(Xn − µ)

σd−→ Z.

Lecture 05:

Fr 01/19/01Definition 6.4.5:

Let X1,X2 be iid non–degenerate rv’s with common cdf F . Let a1, a2 > 0. We say that F is

stable if there exist constants A and B (depending on a1 and a2) such that

B−1(a1X1 + a2X2 −A) also has cdf F .

Note:

When generalizing the previous definition to sequences of rv’s, we have the following examples

for stable distributions:

• Xi iid Cauchy. Then 1nn∑

i=1

Xi ∼ Cauchy (here Bn = n,An = 0).

• Xi iid N(0, 1). Then 1√nn∑

i=1

Xi ∼ N(0, 1) (here Bn =√n,An = 0).

Definition 6.4.6:

Let {Xi}∞i=1 be a sequence of iid rv’s with common cdf F . Let Tn =n∑

i=1

Xi. F belongs to

the domain of attraction of a distribution V if there exist norming and centering constants

{Bn}∞n=1, Bn > 0, and {An}∞n=1 such that

P (B−1n (Tn −An) ≤ x) = FB−1n (Tn−An)(x) → V (x) as n→ ∞

at all continuity points x of V .

Note:

A very general Theorem from Loève states that only stable distributions can have domains

of attraction. From the practical point of view, a wide class of distributions F belong to the

domain of attraction of the Normal distribution.

32

Theorem 6.4.7: Lindeberg Central Limit Theorem

Let {Xi}∞i=1 be a sequence of independent non–degenerate rv’s with cdf’s {Fi}∞i=1. Assume

that E(Xk) = µk and V ar(Xk) = σ2k 0 that

(A) limn→∞

1

s2n

n∑

k=1

∫

{|x−µk|>ǫsn}(x− µk)2F ′k(x)dx = 0.

If the Xk are discrete rv’s with support {xkl} and probabilities {pkl}, l = 1, 2, . . ., assume thatit holds for all ǫ > 0 that

(B) limn→∞

1

s2n

n∑

k=1

∑

|xkl−µk|>ǫsn(xkl − µk)2pkl = 0.

The conditions (A) and (B) are called Lindeberg Condition (LC). If either LC holds, then

n∑

k=1

(Xk − µk)

sn

d−→ Z

where Z ∼ N(0, 1).Proof:

Similar to the proof of Theorem 6.4.4, we can use characteristic functions again. An alterna-

tive proof is given in Rohatgi, pages 282–288.

Note:

Feller shows that the LC is a necessary condition if σ2n

s2n→ 0 and s2n → ∞ as n→ ∞.

Corollary 6.4.8:

Let {Xi}∞i=1 be a sequence of iid rv’s such that 1√nn∑

i=1

Xi has the same distribution for all n.

If E(Xi) = 0 and V ar(Xi) = 1, then Xi ∼ N(0, 1).Proof:

Let F be the common cdf of 1√n

n∑

i=1

Xi for all n (including n = 1). By the CLT,

limn→∞

P (1√n

n∑

i=1

Xi ≤ x) = Φ(x),

where Φ(x) denotes P (Z ≤ x) for Z ∼ N(0, 1). Also, P ( 1√n

n∑

i=1

Xi ≤ x) = F (x) for each n.

Therefore, we must have F (x) = Φ(x).

33

Note:

In general, if X1,X2, . . ., are independent rv’s such that there exists a constant A with

P (| Xn |≤ A) = 1 ∀n, then the LC is satisfied if s2n → ∞ as n→ ∞. Why??

Suppose that s2n → ∞ as n→ ∞. Since the | Xk |’s are uniformly bounded (by A), so are therv’s (Xk −E(Xk)). Thus, for every ǫ > 0 there exists an Nǫ such that if n ≥ Nǫ then

P (| Xk − E(Xk) |< ǫsn, k = 1, . . . , n) = 1.

This implies that the LC holds since we would integrate (or sum) over the empty set, i.e., the

set {| x− µk |> ǫsn} = Ø.

The converse also holds. For a sequence of uniformly bounded independent rv’s, a necessary

and sufficient condition for the CLT to hold is that s2n → ∞ as n→ ∞.

Example 6.4.9:

Let {Xi}∞i=1 be a sequence of independent rv’s such that E(Xk) = 0, αk = E(| Xk |2+δ) 0, andn∑

k=1

αk = o(s2+δn ).

Does the LC hold? It is:

1

s2n

n∑

k=1

∫

{|x|>ǫsn}x2fk(x)dx

(A)≤ 1

s2n

n∑

k=1

∫

{|x|>ǫsn}

| x |2+δǫδsδn

fk(x)dx

≤ 1s2nǫ

δsδn

n∑

k=1

∫ ∞

−∞| x |2+δ fk(x)dx

=1

s2nǫδsδn

n∑

k=1

αk

=1

ǫδ

n∑

k=1

αk

s2+δn

(B)−→ 0 as n→ ∞

(A) holds since for | x |> ǫsn, it is |x|δ

ǫδsδn> 1. (B) holds since

n∑

k=1

αk = o(s2+δn ).

Thus, the LC is satisfied and the CLT holds.

34

Note:

(i) In general, if there exists a δ > 0 such that

n∑

k=1

E(| Xk − µk |2+δ)

s2+δn−→ 0 as n→ ∞,

then the LC holds.

(ii) Both the CLT and the WLLN hold for a large class of sequences of rv’s {Xi}ni=1. Ifthe {Xi}’s are independent uniformly bounded rv’s, i.e., if P (| Xn |≤ M) = 1 ∀n, theWLLN (as formulated in Theorem 6.2.3) holds. The CLT holds provided that s2n → ∞as n→ ∞.

If the rv’s {Xi} are iid, then the CLT is a stronger result than the WLLN since the CLT

provides an estimate of the probability P ( 1n |n∑

i=1

Xi − nµ |≥ ǫ) ≈ 1 − P (| Z |≤ǫ

σ

√n),

where Z ∼ N(0, 1), and the WLLN follows. However, note that the CLT requires theexistence of a 2nd moment while the WLLN does not.

(iii) If the {Xi} are independent (but not identically distributed) rv’s, the CLT may applywhile the WLLN does not.

(iv) See Rohatgi, pages 289–293, and Rohatgi/Saleh, pages 299–303, for additional details

and examples.

35

7 Sample Moments

7.1 Random Sampling

(Based on Casella/Berger, Section 5.1 & 5.2)

Definition 7.1.1:

Let X1, . . . ,Xn be iid rv’s with common cdf F . We say that {X1, . . . ,Xn} is a (random)sample of size n from the population distribution F . The vector of values {x1, . . . , xn} iscalled a realization of the sample. A rv g(X1, . . . ,Xn) which is a Borel–measurable function

of X1, . . . ,Xn and does not depend on any unknown parameter is called a (sample) statistic.

Definition 7.1.2:

Let X1, . . . ,Xn be a sample of size n from a population with distribution F . Then

X =1

n

n∑

i=1

Xi

is called the sample mean and

S2 =1

n− 1n∑

i=1

(Xi −X)2 =1

n− 1

(n∑

i=1

X2i − nX2

)

is called the sample variance.

Definition 7.1.3:

Let X1, . . . ,Xn be a sample of size n from a population with distribution F . The function

F̂n(x) =1

n

n∑

i=1

I(−∞,x](Xi)

is called empirical cumulative distribution function (empirical cdf).

Note:

For any fixed x ∈ IR, F̂n(x) is a rv.

Theorem 7.1.4:

The rv F̂n(x) has pmf

P (F̂n(x) =j

n) =

(n

j

)(F (x))j(1 − F (x))n−j , j ∈ {0, 1, . . . , n},

36

with E(F̂n(x)) = F (x) and V ar(F̂n(x)) =F (x)(1−F (x))

n .

Proof:

It is I(−∞,x](Xi) ∼ Bin(1, F (x)). Then nF̂n(x) ∼ Bin(n,F (x)).

The results follow immediately.

Corollary 7.1.5:

By the WLLN, it follows that

F̂n(x)p−→ F (x).

Corollary 7.1.6:

By the CLT, it follows that √n(F̂n(x) − F (x))√F (x)(1 − F (x))

d−→ Z,

where Z ∼ N(0, 1).

Theorem 7.1.7: Glivenko–Cantelli Theorem

F̂n(x) converges uniformly to F (x), i.e., it holds for all ǫ > 0 that

limn→∞

P ( sup−∞

Theorem 7.1.9:

Let X1, . . . ,Xn be a sample of size n from a population with distribution F . Assume that

E(X) = µ, V ar(X) = σ2, and E((X − µ)k) = µk exist. Then it holds:

(i) E(a1) = E(X) = µ

(ii) V ar(a1) = V ar(X) =σ2

n

(iii) E(b2) =n−1

n σ2

(iv) V ar(b2) =µ4−µ22

n −2(µ4−2µ22)

n2 +µ4−3µ22

n3

(v) E(S2) = σ2

(vi) V ar(S2) = µ4n − n−3n(n−1)µ22

Proof:

(i)

E(X) =1

n

n∑

i=1

E(Xi) =n

nµ = µ

(ii)V ar(X) =

(1

n

)2 n∑

i=1

V ar(Xi) =σ2

n

(iii)

E(b2) = E

(1

n

n∑

i=1

(Xi −X)2)

= E

1n

n∑

i=1

X2i −1

n2

(n∑

i=1

Xi

)2

= E(X2) − 1n2E

n∑

i=1

X2i +∑∑

i6=jXiXj

(∗)= E(X2) − 1

n2(nE(X2) + n(n− 1)µ2)

=n− 1n

(E(X2) − µ2)

=n− 1n

σ2

(∗) holds since Xi and Xj are independent and then, due to Theorem 4.5.3, it holdsthat E(XiXj) = E(Xi)E(Xj).

See Casella/Berger, page 214, and Rohatgi, page 303–306, for the proof of parts (iv) through

(vi) and results regarding the 3rd and 4th moments and covariances.

38

7.2 Sample Moments and the Normal Distribution

(Based on Casella/Berger, Section 5.3)

Theorem 7.2.1:

Let X1, . . . ,Xn be iid N(µ, σ2) rv’s. Then X = 1n

n∑

i=1

Xi and (X1 − X, . . . ,Xn − X) are

independent.

Proof:

By computing the joint mgf of (X,X1 −X,X2 −X, . . . ,Xn −X), we can use Theorem 4.6.3(iv) to show independence. We will use the following two facts:

(1):

MX(t) = M(

1n

n∑

i=1

Xi

)(t)

(A)=

n∏

i=1

MXi(t

n)

(B)=

[exp(

t

nµ+

σ2t2

2n2)

]n

= exp

(tµ+

σ2t2

2n

)

(A) holds by Theorem 4.6.4 (i). (B) follows from Theorem 3.3.12 (vi) since the Xi’s are iid.

(2):

MX1−X,X2−X,...,Xn−X(t1, t2, . . . , tn)Def.4.6.1

= E

[exp

(n∑

i=1

ti(Xi −X))]

= E

[exp

(n∑

i=1

tiXi −Xn∑

i=1

ti

)]

= E

[exp

(n∑

i=1

Xi(ti − t))]

, where t =1

n

n∑

i=1

ti

= E

[n∏

i=1

exp (Xi(ti − t))]

(C)=

n∏

i=1

E(exp(Xi(ti − t)))

=n∏

i=1

MXi(ti − t)

39

(D)=

n∏

i=1

exp

(µ(ti − t) +

σ2(ti − t)22

)

= exp

µ

n∑

i=1

(ti − t)︸︷︷︸

=0

+σ2

2

n∑

i=1

(ti − t)2

= exp

(σ2

2

n∑

i=1

(ti − t)2)

(C) follows from Theorem 4.5.3 since the Xi’s are independent. (D) holds since we evaluate

MX(h) = exp(µh+σ2h2

2 ) for h = ti − t.

From (1) and (2), it follows:

MX,X1−X,...,Xn−X(t, t1, . . . , tn)Def.4.6.1

= E[exp(tX + t1(X1 −X) + . . .+ tn(Xn −X))

]

= E[exp(tX + t1X1 − t1X + . . . + tnXn − tnX)

]

= E

[exp

(n∑

i=1

Xiti − (n∑

i=1

ti − t)X)]

= E

exp

n∑

i=1

Xiti −(t1 + . . .+ tn − t)

n∑

i=1

Xi

n

= E

[exp

(n∑

i=1

Xi(ti −t1 + . . . + tn − t

n)

)]

= E

[n∏

i=1

exp

(Xinti − nt+ t

n

)], where t =

1

n

n∑

i=1

ti

(E)=

n∏

i=1

E

[exp

(Xi[t+ n(ti − t)]

n

)]

=n∏

i=1

MXi

(t+ n(ti − t)

n

)

(F )=

n∏

i=1

exp

(µ[t+ n(ti − t)]

n+σ2

2

1

n2[t+ n(ti − t)]2

)

40

= exp

µ

n

nt+ n

n∑

i=1

(ti − t)︸︷︷︸

=0

+σ2

2n2

n∑

i=1

(t+ n(ti − t))2

= exp(µt) exp

σ2

2n2

nt2 + 2nt

n∑

i=1

(ti − t)︸︷︷︸

=0

+n2n∑

i=1

(ti − t)2

= exp

(µt+

σ2

2nt2)

exp

(σ2

2

n∑

i=1

(ti − t)2)

(1)&(2)= MX(t)MX1−X,...,Xn−X(t1, . . . , tn)

Thus, X and (X1 − X, . . . ,Xn − X) are independent by Theorem 4.6.3 (iv). (E) followsfrom Theorem 4.5.3 since the Xi’s are independent. (F ) holds since we evaluate MX(h) =

exp(µh+ σ2h2

2 ) for h =t+n(ti−t)

n .

Corollary 7.2.2:

X and S2 are independent.

Proof:

This can be seen since S2 is a function of the vector (X1 − X, . . . ,Xn − X), and (X1 −X, . . . ,Xn − X) is independent of X, as previously shown in Theorem 7.2.1. We can useTheorem 4.2.7 to formally complete this proof.

Corollary 7.2.3:

(n− 1)S2σ2

∼ χ2n−1.

Proof:

Recall the following facts:

• If Z ∼ N(0, 1) then Z2 ∼ χ21.

• If Y1, . . . , Yn ∼ iid χ21, thenn∑

i=1

Yi ∼ χ2n.

• For χ2n, the mgf is M(t) = (1 − 2t)−n/2.

• If Xi ∼ N(µ, σ2), then Xi−µσ ∼ N(0, 1) and(Xi−µ)2

σ2 ∼ χ21.

Therefore,n∑

i=1

(Xi − µ)2σ2

∼ χ2n and (X−µ)2

( σ√n

)2= n (X−µ)

2

σ2∼ χ21. (∗)

41

Now considern∑

i=1

(Xi − µ)2 =n∑

i=1

((Xi −X) + (X − µ))2

=n∑

i=1

((Xi −X)2 + 2(Xi −X)(X − µ) + (X − µ)2)

= (n− 1)S2 + 0 + n(X − µ)2

Therefore,n∑

i=1

(Xi − µ)2σ2

︸︷︷︸W

=n(X − µ)2

σ2︸︷︷︸U

+(n− 1)S2

σ2︸︷︷︸V

We have an expression of the form: W = U + V

Since U and V are functions ofX and S2, we know by Corollary 7.2.2 that they are independent

and also that their mgf’s factor by Theorem 4.6.3 (iv). Now we can write:

MW (t) = MU (t)MV (t)

=⇒MV (t) =MW (t)

MU (t)

(∗)=

(1 − 2t)−n/2(1 − 2t)−1/2

= (1 − 2t)−(n−1)/2

Note that this is the mgf of χ2n−1 by the uniqueness of mgf’s. Thus, V =(n−1)S2

σ2 ∼ χ2n−1.

Corollary 7.2.4: √n(X − µ)S

∼ tn−1.

Proof:

Recall the following facts:

• If Z ∼ N(0, 1), Y ∼ χ2n and Z, Y independent, then Z√Yn

∼ tn.

• Z1 =√

n(X−µ)σ ∼ N(0, 1), Yn−1 =

(n−1)S2σ2 ∼ χ2n−1, and Z1, Yn−1 are independent.

Therefore,√n(X − µ)S

=

(X−µ)σ/

√n

S/√

nσ/

√n

=

(X−µ)σ/

√n√

S2(n−1)σ2(n−1)

=Z1√Yn−1(n−1)

∼ tn−1.

42

Corollary 7.2.5:

Let (X1, . . . ,Xm) ∼ iid N(µ1, σ21) and (Y1, . . . , Yn) ∼ iid N(µ2, σ22). Let Xi, Yj be independent∀i, j.Then it holds:

X − Y − (µ1 − µ2)√[(m− 1)S21/σ21 ] + [(n − 1)S22/σ22 ]

·√

m+ n− 2σ21/m+ σ

22/n

∼ tm+n−2

In particular, if σ1 = σ2, then:

X − Y − (µ1 − µ2)√(m− 1)S21 + (n− 1)S22

·√mn(m+ n− 2)

m+ n∼ tm+n−2

Proof:

Homework.

Corollary 7.2.6:

Let (X1, . . . ,Xm) ∼ iid N(µ1, σ21) and (Y1, . . . , Yn) ∼ iid N(µ2, σ22). Let Xi, Yj be independent∀i, j.Then it holds:

S21/σ21

S22/σ22

∼ Fm−1,n−1

In particular, if σ1 = σ2, then:S21S22

∼ Fm−1,n−1

Proof:

Recall that, if Y1 ∼ χ2m and Y2 ∼ χ2n, then

F =Y1/m

Y2/n∼ Fm,n.

Now, C1 =(m−1)S21

σ21∼ χ2m−1 and C2 =

(n−1)S22σ22

∼ χ2n−1. Therefore,

S21/σ21

S22/σ22

=

(m−1)S21σ21(m−1)(n−1)S22σ22(n−1)

=C1/(m− 1)C2/(n − 1)

∼ Fm−1,n−1.

If σ1 = σ2, thenS21S22

∼ Fm−1,n−1.

43

Lecture 07:

We 01/24/01

8 The Theory of Point Estimation

(Based on Casella/Berger, Chapters 6 & 7)

8.1 The Problem of Point Estimation

Let X be a rv defined on a probability space (Ω, L, P ). Suppose that the cdf F of X depends

on some set of parameters and that the functional form of F is known except for a finite

number of these parameters.

Definition 8.1.1:

The set of admissible values of θ is called the parameter space Θ. If Fθ is the cdf of X

when θ is the parameter, the set {Fθ : θ ∈ Θ} is the family of cdf’s. Likewise, we speak ofthe family of pdf’s if X is continuous, and the family of pmf’s if X is discrete.

Example 8.1.2:

X ∼ Bin(n, p), p unknown. Then θ = p and Θ = {p : 0 < p < 1}.

X ∼ N(µ, σ2), (µ, σ2) unknown. Then θ = (µ, σ2) and Θ = {(µ, σ2) : −∞ < µ 0}.

Definition 8.1.3:

Let X be a sample from Fθ, θ ∈ Θ ⊆ IR. Let a statistic T (X) map IRn to Θ. We call T (X)an estimator of θ and T (x) for a realization x of X an (point) estimate of θ. In practice,

the term estimate is used for both.

Example 8.1.4:

Let X1, . . . ,Xn be iid Bin(1, p), p unknown. Estimates of p include:

T1(X) = X, T2(X) = X1, T3(X) =1

2, T4(X) =

X1 +X23

Obviously, not all estimates are equally good.

44

8.2 Properties of Estimates

Definition 8.2.1:

Let {Xi}∞i=1 be a sequence of iid rv’s with cdf Fθ, θ ∈ Θ. A sequence of point estimatesTn(X1, . . . ,Xn) = Tn is called

• (weakly) consistent for θ if Tnp−→ θ as n→ ∞ ∀θ ∈ Θ

• strongly consistent for θ if Tn a.s.−→ θ as n→ ∞ ∀θ ∈ Θ

• consistent in the rth mean for θ if Tn r−→ θ as n→ ∞ ∀θ ∈ Θ

Example 8.2.2:

Let {Xi}∞i=1 be a sequence of iid Bin(1, p) rv’s. Let Xn = 1nn∑

i=1

Xi. Since E(Xi) = p, it

follows by the WLLN that Xnp−→ p, i.e., consistency, and by the SLLN that Xn a.s.−→ p, i.e,

strong consistency.

However, a consistent estimate may not be unique. We may even have infinite many consistent

estimates, e.g.,n∑

i=1

Xi + a

n+ b

p−→ p ∀ finite a, b ∈ IR.

Theorem 8.2.3:

If Tn is a sequence of estimates such that E(Tn) → θ and V ar(Tn) → 0 as n→ ∞, then Tn isconsistent for θ.

Proof:

P (| Tn − θ |> ǫ)(A)≤ E((Tn − θ)

2)

ǫ2

=E[((Tn − E(Tn)) + (E(Tn) − θ))2]

ǫ2

=V ar(Tn) + 2E[(Tn − E(Tn))(E(Tn) − θ)] + (E(Tn) − θ)2

ǫ2

=V ar(Tn) + (E(Tn) − θ)2

ǫ2

(B)−→ 0 as n→ ∞

45

(A) holds due to Corollary 3.5.2 (Markov’s Inequality). (B) holds since V ar(Tn) → 0 asn→ ∞ and E(Tn) → θ as n→ ∞.

Definition 8.2.4:

Let G be a group of Borel–measurable functions of IRn onto itself which is closed under com-position and inverse. A family of distributions {Pθ : θ ∈ Θ} is invariant under G if foreach g ∈ G and for all θ ∈ Θ, there exists a unique θ′ = g(θ) such that the distribution ofg(X) is Pθ′ whenever the distribution of X is Pθ. We call g the induced function on θ since

Pθ(g(X) ∈ A) = Pg(θ)(X ∈ A).

Example 8.2.5:

Let (X1, . . . ,Xn) be iid N(µ, σ2) with pdf

f(x1, . . . , xn) =1

(√

2πσ)nexp

(− 1

2σ2

n∑

i=1

(xi − µ)2).

The group of linear transformations G has elements

g(x1, . . . , xn) = (ax1 + b, . . . , axn + b), a > 0, −∞ < b

Definition 8.2.7:

An estimate T is:

• location invariant if T (X1 + a, . . . ,Xn + a) = T (X1, . . . ,Xn), a ∈ IR

• scale invariant if T (cX1, . . . , cXn) = T (X1, . . . ,Xn), c ∈ IR− {0}

• permutation invariant if T (Xi1 , . . . ,Xin) = T (X1, . . . ,Xn) ∀ permutations (i1, . . . , in)of 1, . . . , n

Example 8.2.8:

Let Fθ ∼ N(µ, σ2).

S2 is location invariant.

X and S2 are both permutation invariant.

Neither X nor S2 is scale invariant.

Note:

Different sources make different use of the term invariant. Mood, Graybill & Boes (1974)

for example define location invariant as T (X1 + a, . . . ,Xn + a) = T (X1, . . . ,Xn) + a (page

332) and scale invariant as T (cX1, . . . , cXn) = cT (X1, . . . ,Xn) (page 336). According to their

definition, X is location invariant and scale invariant.

47

8.3 Sufficient Statistics


Definition 8.3.1:

Let X = (X1, . . . ,Xn) be a sample from {Fθ : θ ∈ Θ ⊆ IRk}. A statistic T = T (X) issufficient for θ (or for the family of distributions {Fθ : θ ∈ Θ}) iff the conditional dis-tribution of X given T = t does not depend on θ (except possibly on a null set A where

Pθ(T ∈ A) = 0 ∀θ).

Note:

(i) The sample X is always sufficient but this is not particularly interesting and usually is

excluded from further considerations.

(ii) Idea: Once we have “reduced” from X to T (X), we have captured all the information

in X about θ.

(iii) Usually, there are several sufficient statistics for a given family of distributions.

Example 8.3.2:

Let X = (X1, . . . ,Xn) be iid Bin(1, p) rv’s. To estimate p, can we ignore the order and simply

count the number of “successes”?

Let T (X) =n∑

i=1

Xi. It is

P (X1 = x1, . . . Xn = xn |n∑

i=1

Xi = t) =P (X1 = x1, . . . ,Xn = xn, T = t)

P (T = t)

=

P (X1 = x1, . . . ,Xn = xn)P (T = t)

,n∑

i=1

xi = t

0, otherwise

=

pt(1 − p)n−t(n

t

)pt(1 − p)n−t

,n∑

i=1

xi = t

0, otherwise

=

1(n

t

) ,n∑

i=1

xi = t

0, otherwise

48

This does not depend on p. Thus, T =n∑

i=1

Xi is sufficient for p.

Example 8.3.3:

Let X = (X1, . . . ,Xn) be iid Poisson(λ). Is T =n∑

i=1

Xi sufficient for λ? It is

P (X1 = x1, . . . ,Xn = xn | T = t) =P (X1 = x1, . . . ,Xn = xn, T = t)

P (T = t)

=

n∏

i=1

e−λλxi

xi!

e−nλ(nλ)t

t!

,n∑

i=1

xi = t

0, otherwise

=

e−nλλ∑

xi

∏xi!

e−nλ(nλ)t

t!

,n∑

i=1

xi = t

0, otherwise

=

t!

ntn∏

i=1

xi!

,n∑

i=1

xi = t

0, otherwise

This does not depend on λ. Thus, T =n∑

i=1

Xi is sufficient for λ.

Example 8.3.4:

Let X1,X2 be iid Poisson(λ). Is T = X1 + 2X2 sufficient for λ? It is

P (X1 = 0,X2 = 1 | X1 + 2X2 = 2) =P (X1 = 0,X2 = 1,X1 + 2X2 = 2)

P (X1 + 2X2 = 2)

=P (X1 = 0,X2 = 1)

P (X1 + 2X2 = 2)

=P (X1 = 0,X2 = 1)

P (X1 = 0,X2 = 1) + P (X1 = 2,X2 = 0)

=e−λ(e−λλ)

e−λ(e−λλ) + (e−λλ2

2 )e−λ

=1

1 + λ2,

49

i.e., this is a counter–example. This expression still depends on λ. Thus, T = X1 + 2X2 is

not sufficient for λ.

Note:

Definition 8.3.1 can be difficult to check. In addition, it requires a candidate statistic. We

need something constructive that helps in finding sufficient statistics without having to check

Definition 8.3.1. The next Theorem helps in finding such statistics.

Lecture 08:

Fr 01/26/01Theorem 8.3.5: Factorization Criterion

Let X1, . . . ,Xn be rv’s with pdf (or pmf) f(x1, . . . , xn | θ), θ ∈ Θ. Then T (X1, . . . ,Xn) issufficient for θ iff we can write

f(x1, . . . , xn | θ) = h(x1, . . . , xn) g(T (x1, . . . , xn) | θ),

where h does not depend on θ and g does not depend on x1, . . . , xn except as a function of T .

Proof:

Discrete case only.

“=⇒”:Suppose T (X) is sufficient for θ. Let

g(t | θ) = Pθ(T (X) = t)

h(x) = P (X = x | T (X) = t)

Then it holds:

f(x | θ) = Pθ(X = x)(∗)= Pθ(X = x, T (X) = T (x) = t)

= Pθ(T (X) = t) P (X = x | T (X) = t)

= g(t | θ)h(x)

(∗) holds since X = x implies that T (X) = T (x) = t.

“⇐=”:Suppose the factorization holds. For fixed t0, it is

Pθ(T (X) = t0) =∑

{x : T (x)=t0}Pθ(X = x)

50

=∑

{x : T (x)=t0}h(x)g(T (x) | θ)

= g(t0 | θ)∑

{x : T (x)=t0}h(x) (A)

If Pθ(T (X) = t0) > 0, it holds:

Pθ(X = x | T (X) = t0) =Pθ(X = x, T (X) = t0)

Pθ(T (X) = t0)

=

Pθ(X = x)Pθ(T (X) = t0)

, if T (x) = t0

0, otherwise

(A)=

g(t0 | θ)h(x)g(t0 | θ)

∑

{x : T (x)=t0}h(x)

, if T (x) = t0

0, otherwise

=

h(x)∑

{x : T (x)=t0}h(x)

, if T (x) = t0

0, otherwise

This last expression does not depend on θ. Thus, T (X) is sufficient for θ.

Note:

(i) In the Theorem above, θ and T may be vectors.

(ii) If T is sufficient for θ, then also any 1–to–1 mapping of T is sufficient for θ. However,

this does not hold for arbitrary functions of T .

Example 8.3.6:

Let X1, . . . ,Xn be iid Bin(1, p). It is

P (X1 = x1, . . . ,Xn = xn | p) = p∑

xi(1 − p)n−∑

xi .

Thus, h(x1, . . . , xn) = 1 and g(∑xi | p) = p

∑xi(1 − p)n−

∑xi .

Hence, T =n∑

i=1

Xi is sufficient for p.

51

Example 8.3.7:

Let X1, . . . ,Xn be iid Poisson(λ). It is

P (X1 = x1, . . . ,Xn = xn | λ) =n∏

i=1

e−λλxi

xi!=e−nλλ

∑xi

∏xi!

.

Thus, h(x1, . . . , xn) =1∏xi!

and g(∑xi | λ) = e−nλλ

∑xi .

Hence, T =n∑

i=1

Xi is sufficient for λ.

Example 8.3.8:

Let X1, . . . ,Xn be iid N(µ, σ2) where µ ∈ IR and σ2 > 0 are both unknown. It is

f(x1, . . . , xn | µ, σ2) =1

(√

2πσ)nexp

(−∑

(xi − µ)22σ2

)=

1

(√

2πσ)nexp

(−∑x2i

2σ2+ µ

∑xiσ2

− nµ2

2σ2

).

Hence, T = (n∑

i=1

Xi,n∑

i=1

X2i ) is sufficient for (µ, σ2).

Example 8.3.9:

Let X1, . . . ,Xn be iid U(θ, θ + 1) where −∞ < θ

Example 8.3.11:

Let X1, . . . ,Xn be iid Bin(1, p). We have seen in Example 8.3.6 that T =n∑

i=1

Xi is sufficient

for p. Is it also complete?

We know that T ∼ Bin(n, p). Thus,

Ep(g(T )) =n∑

t=0

g(t)

(n

t

)pt(1 − p)n−t = 0 ∀p ∈ (0, 1)

implies that

(1 − p)nn∑

t=0

g(t)

(n

t

)(

p

1 − p)t = 0 ∀p ∈ (0, 1) ∀t.

However,n∑

t=0

g(t)

(n

t

)(

p

1 − p)t is a polynomial in p1−p which is only equal to 0 for all p ∈ (0, 1)

if all of its coefficients are 0.

Therefore, g(t) = 0 for t = 0, 1, . . . , n. Hence, T is complete.

Lecture 09:

Mo 01/29/01Example 8.3.12:

Let X1, . . . ,Xn be iid N(θ, θ2). We know from Example 8.3.8 that T = (

n∑

i=1

Xi,n∑

i=1

X2i ) is

sufficient for θ. Is it also complete?

We know thatn∑

i=1

Xi ∼ N(nθ, nθ2). Therefore,

E((n∑

i=1

Xi)2) = nθ2 + n2θ2 = n(n+ 1)θ2

E(n∑

i=1

X2i ) = n(θ2 + θ2) = 2nθ2

It follows that

E

(2(

n∑

i=1

Xi)2 − (n+ 1)

n∑

i=1

X2i

)= 0 ∀θ.

But g(x1, . . . , xn) = 2(n∑

i=1

xi)2 − (n+ 1)

n∑

i=1

x2i is not identically to 0.

Therefore, T is not complete.

Note:

Recall from Section 5.2 what it means if we say the family of distributions {fθ : θ ∈ Θ} is aone–parameter (or k–parameter) exponential family.

53

Theorem 8.3.13:

Let {fθ : θ ∈ Θ} be a k–parameter exponential family. Let T1, . . . , Tk be statistics. Then thefamily of distributions of (T1(X), . . . , Tk(X)) is also a k–parameter exponential family given

by

gθ(t) = exp

(k∑

i=1

tiQi(θ) +D(θ) + S∗(t)

)

for suitable S∗(t).

Proof:

The proof follows from our Theorems regarding the transformation of rv’s.

Theorem 8.3.14:

Let {fθ : θ ∈ Θ} be a k–parameter exponential family with k ≤ n and let T1, . . . , Tk bestatistics as in Theorem 8.3.13. Suppose that the range of Q = (Q1, . . . , Qk) contains an open

set in IRk. Then T = (T1(X), . . . , Tk(X)) is a complete sufficient statistic.

Proof:

Discrete case and k = 1 only.

Write Q(θ) = θ and let (a, b) ⊆ Θ.It follows from the Factorization Criterion (Theorem 8.3.5) that T is sufficient for θ. Thus,

we only have to show that T is complete, i.e., that

Eθ(g(T (X))) =∑

t

g(t)Pθ(T (X) = t)

(A)=

∑

t

g(t) exp(θt+D(θ) + S∗(t)) = 0 ∀θ (B)

implies g(t) = 0 ∀t. Note that in (A) we make use of a result established in Theorem 8.3.13.

We now define functions g+ and g− as:

g+(t) =

{g(t), if g(t) ≥ 00, otherwise

g−(t) =

{−g(t), if g(t) < 00, otherwise

It is g(t) = g+(t)− g−(t) where both functions, g+ and g−, are non–negative functions. Usingg+ and g−, it turns out that (B) is equivalent to

∑

t

g+(t) exp(θt+ S∗(t)) =∑

t

g−(t) exp(θt+ S∗(t)) ∀θ (C)

where the term exp(D(θ)) in (A) drops out as a constant on both sides.

54

If we fix θ0 ∈ (a, b) and define

p+(t) =g+(t) exp(θ0t+ S

∗(t))∑

t

g+(t) exp(θ0t+ S∗(t))

, p−(t) =g−(t) exp(θ0t+ S∗(t))∑

t

g−(t) exp(θ0t+ S∗(t))

,

it is obvious that p+(t) ≥ 0 ∀t and p−(t) ≥ 0 ∀t and by construction∑

t

p+(t) = 1 and

∑

t

p−(t) = 1. Hence, p+ and p− are both pmf’s.

From (C), it follows for the mgf’s M+ and M− of p+ and p− that

M+(δ) =∑

t

eδtp+(t)

=

∑

t

eδtg+(t) exp(θ0t+ S∗(t))

∑

t


=

∑

t

g+(t) exp((θ0 + δ)t+ S∗(t))

∑

t


(C)=

∑

t

g−(t) exp((θ0 + δ)t+ S∗(t))

∑

t


=

∑

t

eδtg−(t) exp(θ0t+ S∗(t))

∑

t


=∑

t

eδtp−(t)

= M−(δ) ∀δ ∈ (a− θ0︸︷︷︸0

).

By the uniqueness of mgf’s it follows that p+(t) = p−(t) ∀t.

=⇒ g+(t) = g−(t) ∀t

=⇒ g(t) = 0 ∀t

=⇒ T is complete

55

Definition 8.3.15:

Let X = (X1, . . . ,Xn) be a sample from {Fθ : θ ∈ Θ ⊆ IRk} and let T = T (X) be a sufficientstatistic for θ. T = T (X) is called a minimal sufficient statistic for θ if, for any other

sufficient statistic T ′ = T ′(X), T (x) is a function of T ′(x).

Note:

(i) A minimal sufficient statistic achieves the greatest possible data reduction for a sufficient

statistic.

(ii) If T is minimal sufficient for θ, then also any 1–to–1 mapping of T is minimal sufficient

for θ. However, this does not hold for arbitrary functions of T .

Definition 8.3.16:

Let X = (X1, . . . ,Xn) be a sample from {Fθ : θ ∈ Θ ⊆ IRk}. A statistic T = T (X) is calledancillary if its distribution does not depend on the parameter θ.

Example 8.3.17:

Let X1, . . . ,Xn be iid U(θ, θ + 1) where −∞ < θ < ∞. As shown in Example 8.3.9,T = (X(1),X(n)) is sufficient for θ. Define

Rn = X(n) −X(1).

Use the result from Stat 6710, Homework Assignment 5, Question (viii) (a) to obtain

fRn(r | θ) = fRn(r) = n(n− 1)rn−2(1 − r)I(0,1)(r).

This means that Rn ∼ Beta(n − 1, 2). Moreover, Rn does not depend on θ and, therefore,Rn is ancillary.

Theorem 8.3.18: Basu’s Theorem

Let X = (X1, . . . ,Xn) be a sample from {Fθ : θ ∈ Θ ⊆ IRk}. If T = T (X) is a complete andminimal sufficient statistic, then T is independent of any ancillary statistic.

Theorem 8.3.19:

Let X = (X1, . . . ,Xn) be a sample from {Fθ : θ ∈ Θ ⊆ IRk}. If any minimal sufficient statis-tic T = T (X) exists for θ, then any complete statistic is also a minimal sufficient statistic.

56

Note:

(i) Due to the last Theorem, Basu’s Theorem often only is stated in terms of a complete

sufficient statistic (which automatically is also a minimal sufficient statistic).

(ii) As already shown in Corollary 7.2.2, X and S2 are independent when sampling from a

N(µ, σ2) population. As outlined in Casella/Berger, page 289, we could also use Basu’s

Theorem to obtain the same result.

(iii) The converse of Basu’s Theorem is false, i.e., if T (X) is independent of any ancillary

statistic, it does not necessarily follow that T (X) is a complete, minimal sufficient statis-

tic.

(iv) As seen in Examples 8.3.8 and 8.3.12, T = (n∑

i=1

Xi,n∑

i=1

X2i ) is sufficient for θ but it is not

complete when X1, . . . ,Xn are iid N(θ, θ2). However, it can be shown that T is minimal

sufficient. So, there may be distributions where a minimal sufficient statistic exists but

a complete statistic does not exist.

(v) As with invariance, there exist several different definitions of ancillarity within the lit-

erature — the one defined in this chapter being the most commonly used.

57

8.4 Unbiased Estimation


Definition 8.4.1:

Let {Fθ : θ ∈ Θ}, Θ ⊆ IR, be a nonempty set of cdf’s. A Borel–measurable function T fromIRn to Θ is called unbiased for θ (or an unbiased estimate for θ) if

Eθ(T ) = θ ∀θ ∈ Θ.

Any function d(θ) for which an unbiased estimate T exists is called an estimable function.

If T is biased,

b(θ, T ) = Eθ(T ) − θ

is called the bias of T .

Example 8.4.2:

If the kth population moment exists, the kth sample moment is an unbiased estimate. If

V ar(X) = σ2, the sample variance S2 is an unbiased estimate of σ2.

However, note that for X1, . . . ,Xn iid N(µ, σ2), S is not an unbiased estimate of σ:

(n − 1)S2σ2

∼ χ2n−1 = Gamma(n− 1

2, 2)

=⇒ E√

(n− 1)S2σ2

=

∫ ∞

0

√xx

n−12

−1e−x2

2n−1

2 Γ(n−12 )dx

=

√2Γ(n2 )

Γ(n−12 )

∫ ∞

0

xn2−1e−

x2

2n2 Γ(n2 )

dx

(∗)=

√2Γ(n2 )

Γ(n−12 )

=⇒ E(S) = σ√

2

n− 1Γ(n2 )

Γ(n−12 )

(∗) holds since xn2−1e−

x2

2n2 Γ(

n

2)

is the pdf of a Gamma(n2 , 2) distribution and thus the integral is 1.

So S is biased for σ and

b(σ, S) = σ

√

2

n− 1Γ(n2 )

Γ(n−12 )− 1

.

58

Note:

If T is unbiased for θ, g(T ) is not necessarily unbiased for g(θ) (unless g is a linear function).

Lecture 10:

We 01/31/01Example 8.4.3:

Unbiased estimates may not exist (see Rohatgi, page 351, Example 2) or they me be absurd

as in the following case:

Let X ∼ Poisson(λ) and let d(λ) = e−2λ. Consider T (X) = (−1)X as an estimate. It is

Eλ(T (X)) = e−λ

∞∑

x=0

(−1)xλx

x!

= e−λ∞∑

x=0

(−λ)xx!

= e−λe−λ

= e−2λ

= d(λ)

Hence T is unbiased for d(λ) but since T alternates between -1 and 1 while d(λ) > 0, T is not

a good estimate.

Note:

If there exist 2 unbiased estimates T1 and T2 of θ, then any estimate of the form αT1+(1−α)T2for 0 < α < 1 will also be an unbiased estimate of θ. Which one should we choose?

Definition 8.4.4:

The mean square error of an estimate T of θ is defined as

MSE(θ, T ) = Eθ((T − θ)2)

= V arθ(T ) + (b(θ, T ))2.

Let {Ti}∞i=1 be a sequence of estimates of θ. If

limi→∞

MSE(θ, Ti) = 0 ∀θ ∈ Θ,

then {Ti} is called a mean–squared–error consistent (MSE–consistent) sequence of es-timates of θ.

59

Note:

(i) If we allow all estimates and compare their MSE, generally it will depend on θ which

estimate is better. For example θ̂ = 17 is perfect if θ = 17, but it is lousy otherwise.

(ii) If we restrict ourselves to the class of unbiased estimates, then MSE(θ, T ) = V arθ(T ).

(iii) MSE–consistency means that both the bias and the variance of Ti approach 0 as i→ ∞.

Definition 8.4.5:

Let θ0 ∈ Θ and let U(θ0) be the class of all unbiased estimates T of θ0 such that Eθ0(T 2)

An Excursion into Logic II

In our first “Excursion into Logic” in Stat 6710 Mathematical Statistics I, we have established

the following results:

A⇒ B is equivalent to ¬B ⇒ ¬A is equivalent to ¬A ∨B:

A B A⇒ B ¬A ¬B ¬B ⇒ ¬A ¬A ∨B1 1 1 0 0 1 1

1 0 0 0 1 0 0

0 1 1 1 0 1 1

0 0 1 1 1 1 1

When dealing with formal proofs, there exists one more technique to show A ⇒ B. Equiva-lently, we can show (A∧¬B) ⇒ 0, a technique called Proof by Contradiction. This means,assuming that A and ¬B hold, we show that this implies 0, i.e., something that is alwaysfalse, i.e., a contradiction. And here is the corresponding truth table:

A B A⇒ B ¬B A ∧ ¬B (A ∧ ¬B) ⇒ 01 1 1 0 0 1

1 0 0 1 1 0

0 1 1 0 0 1

0 0 1 1 0 1

Note:

We make use of this proof technique in the Proof of the next Theorem.

Example:

Let A : x = 5 and B : x2 = 25. Obviously A⇒ B.

But we can also prove this in the following way:

A : x = 5 and ¬B : x2 6= 25

=⇒ x2 = 25 ∧ x2 6= 25

This is impossible, i.e., a contradiction. Thus, A⇒ B.

61

Theorem 8.4.7:

Let U be the class of all unbiased estimates T of θ ∈ Θ with Eθ(T 2) < ∞ ∀θ, and supposethat U is non–empty. Let U0 be the set of all unbiased estimates of 0, i.e.,

U0 = {ν : Eθ(ν) = 0, Eθ(ν2)

“⇐=:”Let Eθ(νT0) = 0 for some T0 ∈ U for all θ ∈ Θ and all ν ∈ U0.

We choose T ∈ U , then also T0 − T ∈ U0 and

Eθ(T0(T0 − T )) = 0 ∀θ ∈ Θ,

i.e.,

Eθ(T20 ) = Eθ(T0T ) ∀θ ∈ Θ.

It follows from the Cauchy–Schwarz–Inequality (Theorem 4.5.7 (ii)) that

Eθ(T20 ) = Eθ(T0T ) ≤ (Eθ(T 20 ))

12 (Eθ(T

2))12 .

This implies

(Eθ(T20 ))

12 ≤ (Eθ(T 2))

12

and

V arθ(T0) ≤ V arθ(T ),

where T is an arbitrary unbiased estimate of θ. Thus, T0 is UMVUE.

Lecture 11:

Mo 02/05/01Theorem 8.4.8:

Let U be the non–empty class of unbiased estimates of θ ∈ Θ as defined in Theorem 8.4.7.Then there exists at most one UMVUE T ∈ U for θ.Proof:

Suppose T0, T1 ∈ U are both UMVUE.

Then T1 − T0 ∈ U0, V arθ(T0) = V arθ(T1), and Eθ(T0(T1 − T0)) = 0 ∀θ ∈ Θ

=⇒ Eθ(T 20 ) = Eθ(T0T1)

=⇒ Covθ(T0, T1) = Eθ(T0T1) − Eθ(T0)Eθ(T1)

= Eθ(T20 ) − (Eθ(T0))2

= V arθ(T0)

= V arθ(T1) ∀θ ∈ Θ

=⇒ ρT0T1 = 1 ∀θ ∈ Θ

=⇒ Pθ(aT0 + bT1 = 0) = 1 for some a, b ∀θ ∈ Θ

=⇒ θ = Eθ(T0) = Eθ(− baT1) = Eθ(T1) ∀θ ∈ Θ

=⇒ − ba = 1

=⇒ Pθ(T0 = T1) = 1 ∀θ ∈ Θ

63

Theorem 8.4.9:

(i) If an UMVUE T exists for a real function d(θ), then λT is the UMVUE for λd(θ), λ ∈ IR.

(ii) If UMVUE’s T1 and T2 exist for real functions d1(θ) and d2(θ), respectively, then T1 +T2

is the UMVUE for d1(θ) + d2(θ).

Proof:

Homework.

Theorem 8.4.10:

If a sample consists of n independent observations X1, . . . ,Xn from the same distribution, the

UMVUE, if it exists, is permutation invariant.

Proof:

Homework.

Theorem 8.4.11: Rao–Blackwell

Let {Fθ : θ ∈ Θ} be a family of cdf’s, and let h be any statistic in U , where U is the non–empty class of all unbiased estimates of θ with Eθ(h

2)

Equality holds iff

Eθ((E(h | T ))2) = Eθ(h2) = Eθ(E(h2 | T ))

⇐⇒ Eθ(E(h2 | T ) − (E(h | T ))2) = 0

⇐⇒ Eθ(V ar(h | T )) = 0

⇐⇒ Eθ(E((h− E(h | T ))2 | T )) = 0

⇐⇒ E((h − E(h | T ))2 | T ) = 0

⇐⇒ h is a function of T and h = E(h | T ).

For the proof of the last step, see Rohatgi, page 170–171, Theorem 2, Corollary, and Proof of

the Corollary.

Theorem 8.4.12: Lehmann–Scheffée

If T is a complete sufficient statistic and if there exists an unbiased estimate h of θ, then

E(h | T ) is the (unique) UMVUE.Proof:

Suppose that h1, h2 ∈ U . Then Eθ(E(h1 | T )) = Eθ(E(h2 | T )) = θ by Theorem 8.4.11.

Therefore,

Eθ(E(h1 | T ) − E(h2 | T )) = 0 ∀θ ∈ Θ.

Since T is complete, E(h1 | T ) = E(h2 | T ).

Therefore, E(h | T ) must be the same for all h ∈ U and E(h | T ) improves all h ∈ U . There-fore, E(h | T ) is UMVUE by Theorem 8.4.11.

Note:

We can use Theorem 8.4.12 to find the UMVUE in two ways if we have a complete sufficient

statistic T :

(i) If we can find an unbiased estimate h(T ), it will be the UMVU

STAT 6720 Mathematical Statistics II Spring Semester 2008symanzik/teaching/2008_stat...• Casella,...

Documents

Transcript of STAT 6720 Mathematical Statistics II Spring Semester 2008symanzik/teaching/2008_stat...• Casella,...