STAT 6720 Mathematical Statistics II Spring Semester 2008symanzik/teaching/2008_stat...• Casella,...

192
STAT 6720 Mathematical Statistics II Spring Semester 2008 Dr. J¨ urgen Symanzik Utah State University Department of Mathematics and Statistics 3900 Old Main Hill Logan, UT 84322–3900 Tel.: (435) 797–0696 FAX: (435) 797–1822 e-mail: [email protected]

Transcript of STAT 6720 Mathematical Statistics II Spring Semester 2008symanzik/teaching/2008_stat...• Casella,...

  • STAT 6720

    Mathematical Statistics II

    Spring Semester 2008

    Dr. Jürgen Symanzik

    Utah State University

    Department of Mathematics and Statistics

    3900 Old Main Hill

    Logan, UT 84322–3900

    Tel.: (435) 797–0696

    FAX: (435) 797–1822

    e-mail: [email protected]

  • Contents

    Acknowledgements 1

    6 Limit Theorems 1

    6.1 Modes of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    6.2 Weak Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    6.3 Strong Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    6.4 Central Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    7 Sample Moments 36

    7.1 Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    7.2 Sample Moments and the Normal Distribution . . . . . . . . . . . . . . . . . . 39

    8 The Theory of Point Estimation 44

    8.1 The Problem of Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    8.2 Properties of Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    8.3 Sufficient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    8.4 Unbiased Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    8.5 Lower Bounds for the Variance of an Estimate . . . . . . . . . . . . . . . . . . 67

    8.6 The Method of Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

    8.7 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 77

    8.8 Decision Theory — Bayes and Minimax Estimation . . . . . . . . . . . . . . . . 83

    9 Hypothesis Testing 91

    9.1 Fundamental Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

    9.2 The Neyman–Pearson Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

    9.3 Monotone Likelihood Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

    9.4 Unbiased and Invariant Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

    10 More on Hypothesis Testing 116

    10.1 Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

    10.2 Parametric Chi–Squared Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

    10.3 t–Tests and F–Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

    10.4 Bayes and Minimax Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

    1

  • 11 Confidence Estimation 134

    11.1 Fundamental Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

    11.2 Shortest–Length Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . 138

    11.3 Confidence Intervals and Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . 143

    11.4 Bayes Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

    12 Nonparametric Inference 152

    12.1 Nonparametric Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

    12.2 Single-Sample Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

    12.3 More on Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

    13 Some Results from Sampling 169

    13.1 Simple Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

    13.2 Stratified Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

    14 Some Results from Sequential Statistical Inference 176

    14.1 Fundamentals of Sequential Sampling . . . . . . . . . . . . . . . . . . . . . . . 176

    14.2 Sequential Probability Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . . 180

    Index 184

    2

  • Acknowledgements

    I would like to thank all my students who helped from the Fall 1999 through the Spring 2006

    semesters with the creation and improvement of these lecture notes and for their suggestions

    how to improve some of the material presented in class.

    In addition, I particularly would like to thank Mike Minnotte and Dan Coster, who previously

    taught this course at Utah State University, for providing me with their lecture notes and other

    materials related to this course. Their lecture notes, combined with additional material from

    a variety of textbooks listed below, form the basis of the script presented here.

    The textbook required for this class is:

    • Casella, G., and Berger, R. L. (2002): Statistical Inference (Second Edition), Duxbury/Thomson Learning, Pacific Grove, CA.

    A Web page dedicated to this class is accessible at:

    http://www.math.usu.edu/~symanzik/teaching/2006_stat6720/stat6720.html

    This course follows Casella and Berger (2002) as described in the syllabus. Additional material

    originates from the lectures from Professors Hering, Trenkler, Gather, and Kreienbrock I have

    attended while studying at the Universität Dortmund, Germany, the collection of Masters and

    PhD Preliminary Exam questions from Iowa State University, Ames, Iowa, and the following

    textbooks:

    • Bandelow, C. (1981): Einführung in die Wahrscheinlichkeitstheorie, BibliographischesInstitut, Mannheim, Germany.

    • Büning, H., and Trenkler, G. (1978): Nichtparametrische statistische Methoden, Walterde Gruyter, Berlin, Germany.

    • Casella, G., and Berger, R. L. (1990): Statistical Inference, Wadsworth & Brooks/Cole,Pacific Grove, CA.

    • Fisz, M. (1989): Wahrscheinlichkeitsrechnung und mathematische Statistik, VEB Deut-scher Verlag der Wissenschaften, Berlin, German Democratic Republic.

    • Gibbons, J. D., and Chakraborti, S. (1992): Nonparametric Statistical Inference (ThirdEdition, Revised and Expanded), Dekker, New York, NY.

    • Johnson, N. L., and Kotz, S., and Balakrishnan, N. (1994): Continuous UnivariateDistributions, Volume 1 (Second Edition), Wiley, New York, NY.

    3

  • • Johnson, N. L., and Kotz, S., and Balakrishnan, N. (1995): Continuous UnivariateDistributions, Volume 2 (Second Edition), Wiley, New York, NY.

    • Kelly, D. G. (1994): Introduction to Probability, Macmillan, New York, NY.

    • Lehmann, E. L. (1983): Theory of Point Estimation (1991 Reprint), Wadsworth &Brooks/Cole, Pacific Grove, CA.

    • Lehmann, E. L. (1986): Testing Statistical Hypotheses (Second Edition – 1994 Reprint),Chapman & Hall, New York, NY.

    • Mood, A. M., and Graybill, F. A., and Boes, D. C. (1974): Introduction to the Theoryof Statistics (Third Edition), McGraw-Hill, Singapore.

    • Parzen, E. (1960): Modern Probability Theory and Its Applications, Wiley, New York,NY.

    • Rohatgi, V. K. (1976): An Introduction to Probability Theory and Mathematical Statis-tics, John Wiley and Sons, New York, NY.

    • Rohatgi, V. K., and Saleh, A. K. E. (2001): An Introduction to Probability and Statistics(Second Edition), John Wiley and Sons, New York, NY.

    • Searle, S. R. (1971): Linear Models, Wiley, New York, NY.

    • Tamhane, A. C., and Dunlop, D. D. (2000): Statistics and Data Analysis – From Ele-mentary to Intermediate, Prentice Hall, Upper Saddle River, NJ.

    Additional definitions, integrals, sums, etc. originate from the following formula collections:

    • Bronstein, I. N. and Semendjajew, K. A. (1985): Taschenbuch der Mathematik (22.Auflage), Verlag Harri Deutsch, Thun, German Democratic Republic.

    • Bronstein, I. N. and Semendjajew, K. A. (1986): Ergänzende Kapitel zu Taschenbuch derMathematik (4. Auflage), Verlag Harri Deutsch, Thun, German Democratic Republic.

    • Sieber, H. (1980): Mathematische Formeln — Erweiterte Ausgabe E, Ernst Klett, Stuttgart,Germany.

    Jürgen Symanzik, January 7, 2006

    4

  • Lecture 02:

    We 01/07/046 Limit Theorems

    (Based on Rohatgi, Chapter 6, Rohatgi/Saleh, Chapter 6 & Casella/Berger,

    Section 5.5)

    Motivation:

    I found this slide from my Stat 250, Section 003, “Introductory Statistics” class (an under-

    graduate class I taught at George Mason University in Spring 1999):

    What does this mean at a more theoretical level???

    1

  • 6.1 Modes of Convergence

    Definition 6.1.1:

    Let X1, . . . ,Xn be iid rv’s with common cdf FX(x). Let T = T (X) be any statistic, i.e., a

    Borel–measurable function of X that does not involve the population parameter(s) ϑ, defined

    on the support X of X . The induced probability distribution of T (X) is called the samplingdistribution of T (X).

    Note:

    (i) Commonly used statistics are:

    Sample Mean: Xn =1n

    n∑

    i=1

    Xi

    Sample Variance: S2n =1

    n−1

    n∑

    i=1

    (Xi −Xn)2

    Sample Median, Order Statistics, Min, Max, etc.

    (ii) Recall that if X1, . . . ,Xn are iid and if E(X) and V ar(X) exist, then E(Xn) = µ =

    E(X), E(S2n) = σ2 = V ar(X), and V ar(Xn) =

    σ2

    n .

    (iii) Recall that if X1, . . . ,Xn are iid and if X has mgf MX(t) or characteristic function ΦX(t)

    then MXn(t) = (MX(tn))

    n or ΦXn(t) = (ΦX(tn))

    n.

    Note: Let {Xn}∞n=1 be a sequence of rv’s on some probability space (Ω, L, P ). Is there anymeaning behind the expression lim

    n→∞Xn = X? Not immediately under the usual definitions

    of limits. We first need to define modes of convergence for rv’s and probabilities.

    Definition 6.1.2:

    Let {Xn}∞n=1 be a sequence of rv’s with cdf’s {Fn}∞n=1 and let X be a rv with cdf F . IfFn(x) → F (x) at all continuity points of F , we say that Xn converges in distribution toX (Xn

    d−→ X) or Xn converges in law to X (Xn L−→ X), or Fn converges weakly to F(Fn

    w−→ F ).

    Example 6.1.3:

    Let Xn ∼ N(0, 1n). Then

    Fn(x) =

    ∫ x

    −∞

    exp(−12nt2

    )

    √2πn

    dt

    2

  • =

    ∫ √nx

    −∞

    exp(−12s2)√2π

    ds

    = Φ(√nx)

    =⇒ Fn(x) →

    Φ(∞) = 1, if x > 0Φ(0) = 12 , if x = 0

    Φ(−∞) = 0, if x < 0

    If FX(x) =

    {1, x ≥ 00, x < 0

    the only point of discontinuity is at x = 0. Everywhere else,

    Φ(√nx) = Fn(x) → FX(x), where Φ(z) = P (Z ≤ z) with Z ∼ N(0, 1).

    So, Xnd−→ X, where P (X = 0) = 1, or Xn d−→ 0 since the limiting rv here is degenerate,

    i.e., it has a Dirac(0) distribution.

    Example 6.1.4:

    In this example, the sequence {Fn}∞n=1 converges pointwise to something that is not a cdf:

    Let Xn ∼ Dirac(n), i.e., P (Xn = n) = 1. Then,

    Fn(x) =

    {0, x < n

    1, x ≥ n

    It is Fn(x) → 0 ∀x which is not a cdf. Thus, there is no rv X such that Xn d−→ X.

    Example 6.1.5:

    Let {Xn}∞n=1 be a sequence of rv’s such that P (Xn = 0) = 1− 1n and P (Xn = n) = 1n and letX ∼ Dirac(0), i.e., P (X = 0) = 1.It is

    Fn(x) =

    0, x < 0

    1 − 1n , 0 ≤ x < n1, x ≥ n

    FX(x) =

    {0, x < 0

    1, x ≥ 0

    It holds that Fnw−→ FX but

    E(Xkn) = 0k · (1 − 1

    n) + nk · 1

    n= nk−1 6→ E(Xk) = 0.

    Thus, convergence in distribution does not imply convergence of moments/means.

    3

  • Note:

    Convergence in distribution does not say that the Xi’s are close to each other or to X. It only

    means that their cdf’s are (eventually) close to some cdf F . The Xi’s do not even have to be

    defined on the same probability space.

    Example 6.1.6:

    Let X and {Xn}∞n=1 be iid N(0, 1). Obviously, Xnd−→ X but lim

    n→∞Xn 6= X.

    Theorem 6.1.7:

    Let X and {Xn}∞n=1 be discrete rv’s with support X and {Xn}∞n=1, respectively. Define

    the countable set A = X ∪∞⋃

    n=1

    Xn = {ak : k = 1, 2, 3, . . .}. Let pk = P (X = ak) and

    pnk = P (Xn = ak). Then it holds that pnk → pk ∀k iff Xn d−→ X.

    Theorem 6.1.8:

    LetX and {Xn}∞n=1 be continuous rv’s with pdf’s f and {fn}∞n=1, respectively. If fn(x) → f(x)for almost all x as n→ ∞ then Xn d−→ X.

    Theorem 6.1.9:

    Let X and {Xn}∞n=1 be rv’s such that Xnd−→ X. Let c ∈ IR be a constant. Then it holds:

    (i) Xn + cd−→ X + c.

    (ii) cXnd−→ cX.

    (iii) If an → a and bn → b, then anXn + bn d−→ aX + b.

    Proof:

    Part (iii):

    Suppose that a > 0, an > 0. (If a < 0, an < 0, the result follows via (ii) and c = −1.)Let Yn = anXn + bn and Y = aX + b. It is

    FY (y) = P (Y < y) = P (aX + b < y) = P (X <y − ba

    ) = FX(y − ba

    ).

    Likewise,

    FYn(y) = FXn(y − bnan

    ).

    If y is a continuity point of FY ,y−ba is a continuity point of FX . Since an → a, bn → b and

    FXn(x) → FX(x), it follows that FYn(y) → FY (y) for every continuity point y of FY . Thus,anXn + bn

    d−→ aX + b.

    4

  • Lecture 38:

    We 11/29/00Definition 6.1.10:

    Let {Xn}∞n=1 be a sequence of rv’s defined on a probability space (Ω, L, P ). We say that Xnconverges in probability to a rv X (Xn

    p−→ X, P- limn→∞

    Xn = X) if

    limn→∞

    P (| Xn −X |> ǫ) = 0 ∀ǫ > 0.

    Note:

    The following are equivalent:

    limn→∞

    P (| Xn −X |> ǫ) = 0

    ⇐⇒ limn→∞

    P (| Xn −X |≤ ǫ) = 1

    ⇐⇒ limn→∞P ({ω : | Xn(ω) −X(ω) |> ǫ)) = 0

    If X is degenerate, i.e., P (X = c) = 1, we say that Xn is consistent for c. For example, let

    Xn such that P (Xn = 0) = 1 − 1n and P (Xn = 1) = 1n . Then

    P (| Xn |> ǫ) ={

    1n , 0 < ǫ < 1

    0, ǫ ≥ 1

    Therefore, limn→∞

    P (| Xn |> ǫ) = 0 ∀ǫ > 0. So Xnp−→ 0, i.e., Xn is consistent for 0.

    Theorem 6.1.11:

    (i) Xnp−→ X ⇐⇒ Xn −X

    p−→ 0.

    (ii) Xnp−→ X,Xn

    p−→ Y =⇒ P (X = Y ) = 1.

    (iii) Xnp−→ X,Xm

    p−→ X =⇒ Xn −Xmp−→ 0 as n,m→ ∞.

    (iv) Xnp−→ X,Yn

    p−→ Y =⇒ Xn ± Ynp−→ X ± Y .

    (v) Xnp−→ X, k ∈ IR a constant =⇒ kXn

    p−→ kX.

    (vi) Xnp−→ k, k ∈ IR a constant =⇒ Xrn

    p−→ kr ∀r ∈ IN .

    (vii) Xnp−→ a, Yn

    p−→ b, a, b ∈ IR =⇒ XnYnp−→ ab.

    (viii) Xnp−→ 1 =⇒ X−1n

    p−→ 1.

    (ix) Xnp−→ a, Yn

    p−→ b, a ∈ IR, b ∈ IR− {0} =⇒ XnYnp−→ ab .

    5

  • (x) Xnp−→ X,Y an arbitrary rv =⇒ XnY

    p−→ XY .

    (xi) Xnp−→ X,Yn

    p−→ Y =⇒ XnYnp−→ XY .

    Proof:

    See Rohatgi, page 244–245, and Rohatgi/Saleh, page 260–261, for partial proofs.

    Theorem 6.1.12:

    Let Xnp−→ X and let g be a continuous function on IR. Then g(Xn)

    p−→ g(X).

    Proof:

    Preconditions:

    1.) X rv =⇒ ∀ǫ > 0 ∃k = k(ǫ) : P (|X| > k) < ǫ22.) g is continuous on IR

    =⇒ g is also uniformly continuous on [−k, k] (see Definition of uniformly continuousin Theorem 3.3.3 (iii))

    =⇒ ∃δ = δ(ǫ, k) : |X| ≤ k, |Xn −X| < δ ⇒ |g(Xn) − g(X)| < ǫ

    Let

    A = {|X| ≤ k} = {ω : |X(ω)| ≤ k}B = {|Xn −X| < δ} = {ω : |Xn(ω) −X(ω)| < δ}C = {|g(Xn) − g(X)| < ǫ} = {ω : |g(Xn(ω)) − g(X(ω))| < ǫ}

    If ω ∈ A ∩B2.)

    =⇒ ω ∈ C

    =⇒ A ∩B ⊆ C

    =⇒ CC ⊆ (A ∩B)C = AC ∪BC

    =⇒ P (CC) ≤ P (AC ∪BC) ≤ P (AC) + P (BC)

    Now:

    P (|g(Xn) − g(X)| ≥ ǫ) ≤ P (|X| > k)︸ ︷︷ ︸≤ ǫ

    2by 1.)

    + P (|Xn −X| ≥ δ)︸ ︷︷ ︸≤ ǫ

    2for n≥n0(ǫ,δ,k) since Xn

    p−→X

    ≤ ǫ for n ≥ n0(ǫ, δ, k)

    6

  • Corollary 6.1.13:

    (i) Let Xnp−→ c, c ∈ IR and let g be a continuous function on IR. Then g(Xn)

    p−→ g(c).

    (ii) Let Xnd−→ X and let g be a continuous function on IR. Then g(Xn) d−→ g(X).

    (iii) Let Xnd−→ c, c ∈ IR and let g be a continuous function on IR. Then g(Xn) d−→ g(c).

    Theorem 6.1.14:

    Xnp−→ X =⇒ Xn d−→ X.

    Proof:

    Xnp−→ X ⇔ P (|Xn −X| > ǫ) → 0 as n→ ∞ ∀ǫ > 0

    It holds:

    P (X ≤ x− ǫ) = P (X ≤ x− ǫ, |Xn −X| ≤ ǫ) + P (X ≤ x− ǫ, |Xn −X| > ǫ)(A)

    ≤ P (Xn ≤ x) + P (|Xn −X| > ǫ)

    (A) holds since X ≤ x− ǫ and Xn within ǫ of X, thus Xn ≤ x.

    Similarly, it holds:

    P (Xn ≤ x) = P (Xn ≤ x, | Xn −X |≤ ǫ) + P (Xn ≤ x, | Xn −X |> ǫ)

    ≤ P (X ≤ x+ ǫ) + P (|Xn −X| > ǫ)

    Combining the 2 inequalities from above gives:

    P (X ≤ x− ǫ) − P (|Xn −X| > ǫ)︸ ︷︷ ︸→0 as n→∞

    ≤ P (Xn ≤ x)︸ ︷︷ ︸=Fn(x)

    ≤ P (X ≤ x+ ǫ) + P (|Xn −X| > ǫ)︸ ︷︷ ︸→0 as n→∞

    Therefore,

    P (X ≤ x− ǫ) ≤ Fn(x) ≤ P (X ≤ x+ ǫ) as n→ ∞.

    Since the cdf’s Fn(·) are not necessarily left continuous, we get the following result for ǫ ↓ 0:

    P (X < x) ≤ Fn(x) ≤ P (X ≤ x) = FX(x)

    Let x be a continuity point of F . Then it holds:

    F (x) = P (X < x) ≤ Fn(x) ≤ F (x)

    =⇒ Fn(x) → F (x)

    =⇒ Xn d−→ X

    7

  • Theorem 6.1.15:

    Let c ∈ IR be a constant. Then it holds:

    Xnd−→ c⇐⇒ Xn

    p−→ c.

    Example 6.1.16:

    In this example, we will see that

    Xnd−→ X 6=⇒ Xn

    p−→ X

    for some rv X. Let Xn be identically distributed rv’s and let (Xn,X) have the following joint

    distribution:Xn

    X0 1

    0 0 1212

    1 12 012

    12

    12 1

    Obviously, Xnd−→ X since all have exactly the same cdf, but for any ǫ ∈ (0, 1), it is

    P (| Xn −X |> ǫ) = P (| Xn −X |= 1) = 1 ∀n,

    so limn→∞

    P (| Xn −X |> ǫ) 6= 0. Therefore, Xn 6p−→ X.

    Theorem 6.1.17:

    Let {Xn}∞n=1 and {Yn}∞n=1 be sequences of rv’s and X be a rv defined on a probability space(Ω, L, P ). Then it holds:

    Ynd−→ X, | Xn − Yn |

    p−→ 0 =⇒ Xn d−→ X.

    Proof:

    Similar to the proof of Theorem 6.1.14. See also Rohatgi, page 253, Theorem 14, and Ro-

    hatgi/Saleh, page 269, Theorem 14.

    Lecture 41:

    We 12/06/00Theorem 6.1.18: Slutsky’s Theorem

    Let {Xn}∞n=1 and {Yn}∞n=1 be sequences of rv’s and X be a rv defined on a probability space(Ω, L, P ). Let c ∈ IR be a constant. Then it holds:

    (i) Xnd−→ X,Yn

    p−→ c =⇒ Xn + Yn d−→ X + c.

    (ii) Xnd−→ X,Yn

    p−→ c =⇒ XnYn d−→ cX.If c = 0, then also XnYn

    p−→ 0.

    (iii) Xnd−→ X,Yn

    p−→ c =⇒ XnYnd−→ Xc if c 6= 0.

    8

  • Proof:

    (i) Ynp−→ c Th.6.1.11(i)⇐⇒ Yn − c

    p−→ 0

    =⇒ Yn − c = Yn + (Xn −Xn) − c = (Xn + Yn) − (Xn + c)p−→ 0 (A)

    Xnd−→ X Th.6.1.9(i)=⇒ Xn + c d−→ X + c (B)

    Combining (A) and (B), it follows from Theorem 6.1.17:

    Xn + Ynd−→ X + c

    (ii) Case c = 0:

    ∀ǫ > 0 ∀k > 0, it is

    P (| XnYn |> ǫ) = P (| XnYn |> ǫ, Yn ≤ǫ

    k) + P (| XnYn |> ǫ, Yn >

    ǫ

    k)

    ≤ P (| Xnǫ

    k|> ǫ) + P (Yn >

    ǫ

    k)

    ≤ P (| Xn |> k) + P (| Yn |>ǫ

    k)

    Since Xnd−→ X and Yn

    p−→ 0, it follows

    limn→∞

    P (| XnYn |> ǫ) ≤ P (| Xn |> k) → 0 as k → ∞.

    Therefore, XnYnp−→ 0.

    Case c 6= 0:Since Xn

    d−→ X and Ynp−→ c, it follows from (ii), case c = 0, that XnYn − cXn =

    Xn(Yn − c)p−→ 0.

    =⇒ XnYnp−→ cXn

    Th.6.1.14=⇒ XnYn d−→ cXn

    Since cXnd−→ cX by Theorem 6.1.9 (ii), it follows from Theorem 6.1.17:

    XnYnd−→ cX

    (iii) Let Znp−→ 1 and let Yn = cZn.

    c 6=0=⇒ 1Yn =

    1Zn

    · 1cTh.6.1.11(v,viii)

    =⇒ 1Ynp−→ 1c

    With part (ii) above, it follows:

    Xnd−→ X and 1Yn

    p−→ 1c=⇒ XnYn

    d−→ Xc

    9

  • Definition 6.1.19:

    Let {Xn}∞n=1 be a sequence of rv’s such that E(| Xn |r) 0. We say that Xnconverges in the rth mean to a rv X (Xn

    r−→ X) if E(| X |r) 0. Therefore, Xnr−→ 0 ∀r > 0.

    Note:

    The special cases r = 1 and r = 2 are called convergence in absolute mean for r = 1

    (Xn1−→ X) and convergence in mean square for r = 2 (Xn ms−→ X or Xn 2−→ X).

    Theorem 6.1.21:

    Assume that Xnr−→ X for some r > 0. Then Xn

    p−→ X.

    Proof:

    Using Markov’s Inequality (Corollary 3.5.2), it holds for any ǫ > 0:

    E(| Xn −X |r)ǫr

    ≥ P (| Xn −X |≥ ǫ) ≥ P (| Xn −X |> ǫ)

    Xnr−→ X =⇒ lim

    n→∞E(| Xn −X |r) = 0

    =⇒ limn→∞

    P (| Xn −X |> ǫ) ≤ limn→∞

    E(| Xn −X |r)ǫr

    = 0

    =⇒ Xnp−→ X

    Example 6.1.22:

    Let {Xn}∞n=1 be a sequence of rv’s defined by P (Xn = 0) = 1 − 1nr and P (Xn = n) = 1nr forsome r > 0.

    For any ǫ > 0, P (| Xn |> ǫ) → 0 as n→ ∞; so Xnp−→ 0.

    For 0 < s < r, E(| Xn |s) = 1nr−s → 0 as n → ∞; so Xns−→ 0. But E(| Xn |r) = 1 6→ 0 as

    n→ ∞; so Xn 6 r−→ 0.

    10

  • Theorem 6.1.23:

    If Xnr−→ X, then it holds:

    (i) limn→∞E(| Xn |

    r) = E(| X |r); and

    (ii) Xns−→ X for 0 < s < r.

    Proof:

    (i) For 0 < r ≤ 1, it holds:

    E(| Xn |r) = E(| Xn −X +X |r)(∗)≤ E(| Xn −X |r + | X |r)

    =⇒ E(| Xn |r) − E(| X |r) ≤ E(| Xn −X |r)

    =⇒ limn→∞

    E(| Xn |r) − limn→∞

    E(| X |r) ≤ limn→∞

    E(| Xn −X |r) = 0

    =⇒ limn→∞

    E(| Xn |r) ≤ E(| X |r) (A)

    (∗) holds due to Bronstein/Semendjajew (1986), page 36 (see Handout)

    Similarly,

    E(| X |r) = E(| X −Xn +Xn |r) ≤ E(| Xn −X |r + | Xn |r)

    =⇒ E(| X |r) − E(| Xn |r) ≤ E(| Xn −X |r)

    =⇒ limn→∞

    E(| X |r) − limn→∞

    E(| Xn |r) ≤ limn→∞

    E(| Xn −X |r) = 0

    =⇒ E(| X |r) ≤ limn→∞

    E(| Xn |r) (B)

    Combining (A) and (B) gives

    limn→∞E(| Xn |

    r) = E(| X |r)

    For r > 1, it follows from Minkowski’s Inequality (Theorem 4.8.3):

    [E(| X −Xn +Xn |r)]1r ≤ [E(| X −Xn |r)]

    1r + [E(| Xn |r)]

    1r

    =⇒ [E(| X |r)] 1r − [E(| Xn |r)]1r ≤ [E(| X −Xn |r)]

    1r

    =⇒ [E(| X |r)] 1r − limn→∞

    [E(| Xn |r)]1r ≤ lim

    n→∞[E(| Xn−X |r)]

    1r = 0 since Xn

    r−→ X

    =⇒ [E(| X |r)] 1r ≤ limn→∞

    [E(| Xn |r)]1r (C)

    Similarly,

    [E(| Xn −X +X |r)]1r ≤ [E(| Xn −X |r)]

    1r + [E(| X |r)] 1r

    =⇒ limn→∞[E(| Xn |

    r)]1r − lim

    n→∞[E(| X |r)]

    1r ≤ lim

    n→∞[E(| Xn−X |r)]

    1r = 0 since Xn

    r−→ X

    11

  • =⇒ limn→∞

    [E(| Xn |r)]1r ≤ [E(| X |r)] 1r (D)

    Combining (C) and (D) gives

    limn→∞[E(| Xn |

    r)]1r = [E(| X |r)] 1r

    =⇒ limn→∞

    E(| Xn |r) = E(| X |r)Lecture 42/1:

    Fr 12/08/00(ii) For 1 ≤ s < r, it follows from Lyapunov’s Inequality (Theorem 3.5.4):

    [E(| Xn −X |s)]1s ≤ [E(| Xn −X |r)]

    1r

    =⇒ E(| Xn −X |s) ≤ [E(| Xn −X |r)]sr

    =⇒ limn→∞E(| Xn −X |

    s) ≤ limn→∞[E(| Xn −X |

    r)]sr = 0 since Xn

    r−→ X

    =⇒ Xn s−→ X

    An additional proof is required for 0 < s < r < 1.

    Definition 6.1.24:

    Let {Xn}∞n=1 be a sequence of rv’s on (Ω, L, P ). We say that Xn converges almost surelyto a rv X (Xn

    a.s.−→ X) or Xn converges with probability 1 to X (Xnw.p.1−→ X) or Xn

    converges strongly to X iff

    P ({ω : Xn(ω) → X(ω) as n→ ∞}) = 1.

    Note:

    An interesting characterization of convergence with probability 1 and convergence in proba-

    bility can be found in Parzen (1960) “Modern Probability Theory and Its Applications” on

    page 416 (see Handout).

    Example 6.1.25:

    Let Ω = [0, 1] and P a uniform distribution on Ω. Let Xn(ω) = ω + ωn and X(ω) = ω.

    For ω ∈ [0, 1), ωn → 0 as n→ ∞. So Xn(ω) → X(ω) ∀ω ∈ [0, 1).

    However, for ω = 1, Xn(1) = 2 6= 1 = X(1) ∀n, i.e., convergence fails at ω = 1.

    Anyway, since P ({ω : Xn(ω) → X(ω) as n → ∞}) = P ({ω ∈ [0, 1)}) = 1, it is Xn a.s.−→ X.

    12

  • Theorem 6.1.26:

    Xna.s.−→ X =⇒ Xn

    p−→ X.

    Proof:

    Choose ǫ > 0 and δ > 0. Find n0 = n0(ǫ, δ) such that

    P

    ( ∞⋂

    n=n0

    {| Xn −X |≤ ǫ})

    ≥ 1 − δ.

    Since∞⋂

    n=n0

    {| Xn −X |≤ ǫ} ⊆ {| Xn −X |≤ ǫ} ∀n ≥ n0, it is

    P ({| Xn −X |≤ ǫ}) ≥ P( ∞⋂

    n=n0

    {| Xn −X |≤ ǫ})

    ≥ 1 − δ ∀n ≥ n0.

    Therefore, P ({| Xn −X |≤ ǫ}) → 1 as n→ ∞. Thus, Xnp−→ X.

    Example 6.1.27:

    Xnp−→ X 6=⇒ Xn a.s.−→ X:

    Let Ω = (0, 1] and P a uniform distribution on Ω.

    Define An by

    A1 = (0,12 ], A2 = (

    12 , 1]

    A3 = (0,14 ], A4 = (

    14 ,

    12 ], A5 = (

    12 ,

    34 ], A6 = (

    34 , 1]

    A7 = (0,18 ], A8 = (

    18 ,

    14 ], . . .

    Let Xn(ω) = IAn(ω).

    It is P (| Xn − 0 |≥ ǫ) → 0 ∀ǫ > 0 since Xn is 0 except on An and P (An) ↓ 0. Thus Xnp−→ 0.

    But P ({ω : Xn(ω) → 0}) = 0 (and not 1) because any ω keeps being in some An beyond anyn0, i.e., Xn(ω) looks like 0 . . . 010 . . . 010 . . . 010 . . ., so Xn 6 a.s.−→ 0.

    Example 6.1.28:

    Xnr−→ X 6=⇒ Xn a.s.−→ X:

    Let Xn be independent rv’s such that P (Xn = 0) = 1 − 1n and P (Xn = 1) = 1n .

    It is E(| Xn − 0 |r) = E(| Xn |r) = E(| Xn |) = 1n → 0 as n → ∞, so Xnr−→ 0 ∀r > 0 (and

    due to Theorem 6.1.21, also Xnp−→ 0).

    But

    13

  • P (Xn = 0 ∀m ≤ n ≤ n0) =n0∏

    n=m

    (1− 1n

    ) = (m− 1m

    )(m

    m+ 1)(m+ 1

    m+ 2) . . . (

    n0 − 2n0 − 1

    )(n0 − 1n0

    ) =m− 1n0

    As n0 → ∞, it is P (Xn = 0 ∀m ≤ n ≤ n0) → 0 ∀m, so Xn 6 a.s.−→ 0.

    Example 6.1.29:

    Xna.s.−→ X 6=⇒ Xn r−→ X:

    Let Ω = [0, 1] and P a uniform distribution on Ω.

    Let An = [0,1

    ln n ].

    Let Xn(ω) = nIAn(ω) and X(ω) = 0.

    It holds that ∀ω > 0 ∃n0 : 1ln n0 < ω =⇒ Xn(ω) = 0 ∀n > n0 and P (ω = 0) = 0. Thus,Xn

    a.s.−→ 0.

    But E(| Xn − 0 |r) = nr

    lnn → ∞ ∀r > 0, so Xn 6r−→ X.

    14

  • Lecture 39:

    Fr 12/01/006.2 Weak Laws of Large Numbers

    Theorem 6.2.1: WLLN: Version I

    Let {Xi}∞i=1 be a sequence of iid rv’s with mean E(Xi) = µ and variance V ar(Xi) = σ2 0,

    i.e., Xnp−→ µ.

    Proof:

    By Markov’s Inequality (Corollary 3.5.2), it holds for all ǫ > 0:

    P (| Xn − µ |≥ ǫ) ≤E((Xn − µ)2)

    ǫ2=V ar(Xn)

    ǫ2=

    σ2

    nǫ2−→ 0 as n→ ∞

    Note:

    For iid rv’s with finite variance, Xn is consistent for µ.

    A more general way to derive a “WLLN” follows in the next Definition.

    Definition 6.2.2:

    Let {Xi}∞i=1 be a sequence of rv’s. Let Tn =n∑

    i=1

    Xi. We say that {Xi} obeys the WLLN

    with respect to a sequence of norming constants {Bi}∞i=1, Bi > 0, Bi ↑ ∞, if there exists asequence of centering constants {Ai}∞i=1 such that

    B−1n (Tn −An)p−→ 0.

    Theorem 6.2.3:

    Let {Xi}∞i=1 be a sequence of pairwise uncorrelated rv’s with E(Xi) = µi and V ar(Xi) = σ2i ,

    i ∈ IN . Ifn∑

    i=1

    σ2i → ∞ as n→ ∞, we can choose An =n∑

    i=1

    µi and Bn =n∑

    i=1

    σ2i and get

    n∑

    i=1

    (Xi − µi)n∑

    i=1

    σ2i

    p−→ 0.

    15

  • Proof:

    By Markov’s Inequality (Corollary 3.5.2), it holds for all ǫ > 0:

    P (|n∑

    i=1

    Xi −n∑

    i=1

    µi |> ǫn∑

    i=1

    σ2i ) ≤E((

    n∑

    i=1

    (Xi − µi))2)

    ǫ2(n∑

    i=1

    σ2i )2

    =1

    ǫ2n∑

    i=1

    σ2i

    −→ 0 as n→ ∞

    Note:

    To obtain Theorem 6.2.1, we choose An = nµ and Bn = nσ2.

    Theorem 6.2.4:

    Let {Xi}∞i=1 be a sequence of rv’s. Let Xn = 1nn∑

    i=1

    Xi. A necessary and sufficient condition

    for {Xi} to obey the WLLN with respect to Bn = n is that

    E

    (X

    2n

    1 +X2n

    )→ 0

    as n→ ∞.

    Proof:

    Rohatgi, page 258, Theorem 2, and Rohatgi/Saleh, page 275, Theorem 2.

    Example 6.2.5:

    Let (X1, . . . ,Xn) be jointly Normal with E(Xi) = 0, E(X2i ) = 1 for all i, and Cov(Xi,Xj) = ρ

    if | i− j |= 1 and Cov(Xi,Xj) = 0 if | i− j |> 1.

    Let Tn =n∑

    i=1

    Xi. Then, Tn ∼ N(0, n + 2(n − 1)ρ) = N(0, σ2). It is

    E

    (X

    2n

    1 +X2n

    )= E

    (T 2n

    n2 + T 2n

    )

    =2√2πσ

    ∫ ∞

    0

    x2

    n2 + x2e−

    x2

    2σ2 dx | y = xσ, dy =

    dx

    σ

    =2√2π

    ∫ ∞

    0

    σ2y2

    n2 + σ2y2e−

    y2

    2 dy

    =2√2π

    ∫ ∞

    0

    (n+ 2(n− 1)ρ)y2n2 + (n+ 2(n − 1)ρ)y2 e

    − y2

    2 dy

    ≤ n+ 2(n− 1)ρn2

    ∫ ∞

    0

    2√2π

    y2e−y2

    2 dy

    ︸ ︷︷ ︸=1, since Var of N(0,1) distribution

    16

  • → 0 as n→ ∞=⇒ Xn

    p−→ 0

    Note:

    We would like to have a WLLN that just depends on means but does not depend on the

    existence of finite variances. To approach this, we consider the following:

    Let {Xi}∞i=1 be a sequence of rv’s. Let Tn =n∑

    i=1

    Xi. We truncate each | Xi | at c > 0 and get

    Xci =

    {Xi, | Xi |≤ c0, otherwise

    Let T cn =n∑

    i=1

    Xci and mn =n∑

    i=1

    E(Xci ).

    Lemma 6.2.6:

    For Tn, Tcn and mn as defined in the Note above, it holds:

    P (| Tn −mn |> ǫ) ≤ P (| T cn −mn |> ǫ) +n∑

    i=1

    P (| Xi |> c) ∀ǫ > 0

    Proof:

    It holds for all ǫ > 0:

    P (| Tn −mn |> ǫ) = P (| Tn −mn |> ǫ and | Xi |≤ c ∀i ∈ {1, . . . , n}) +P (| Tn −mn |> ǫ and | Xi |> c for at least one i ∈ {1, . . . , n}

    (∗)≤ P (| T cn −mn |> ǫ) + P (| Xi |> c for at least one i ∈ {1, . . . , n})

    ≤ P (| T cn −mn |> ǫ) +n∑

    i=1

    P (| Xi |> c)

    (∗) holds since T cn = Tn when | Xi |≤ c ∀i ∈ {1, . . . , n}.

    Note:

    If the Xi’s are identically distributed, then

    P (| Tn −mn |> ǫ) ≤ P (| T cn −mn |> ǫ) + nP (| X1 |> c) ∀ǫ > 0.

    If the Xi’s are iid, then

    P (| Tn −mn |> ǫ) ≤nE((Xc1)

    2)

    ǫ2+ nP (| X1 |> c) ∀ǫ > 0 (∗).

    Note that P (| Xi |> c) = P (| X1 |> c) ∀i ∈ IN if the Xi’s are identically distributed and thatE((Xci )

    2) = E((Xc1)2) ∀i ∈ IN if the Xi’s are iid.

    17

  • Lecture 42/2:

    Fr 12/08/00Theorem 6.2.7: Khintchine’s WLLN

    Let {Xi}∞i=1 be a sequence of iid rv’s with finite mean E(Xi) = µ. Then it holds:

    Xn =1

    nTn

    p−→ µ

    Proof:

    If we take c = n and replace ǫ by nǫ in (∗) in the Note above, we get

    P

    (∣∣∣∣Tn −mn

    n

    ∣∣∣∣ > ǫ)

    = P (| Tn −mn |> nǫ) ≤E((Xn1 )

    2)

    nǫ2+ nP (| X1 |> n).

    Since E(| X1 |) < ∞, it is nP (| X1 |> n) → 0 as n → ∞ by Theorem 3.1.9. From Corollary3.1.12 we know that E(| X |α) = α

    ∫ ∞

    0xα−1P (| X |> x)dx. Therefore,

    E((Xn1 )2) = 2

    ∫ n

    0xP (| Xn1 |> x)dx

    = 2

    ∫ A

    0xP (| Xn1 |> x)dx+ 2

    ∫ n

    AxP (| Xn1 |> x)dx

    (+)≤ K + δ

    ∫ n

    Adx

    ≤ K + nδ

    In (+), A is chosen sufficiently large such that xP (| Xn1 |> x) < δ2 ∀x ≥ A for an arbitraryconstant δ > 0 and K > 0 a constant.

    Therefore,E((Xn1 )

    2)

    nǫ2≤ Knǫ2

    ǫ2

    Since δ is arbitrary, we can make the right hand side of this last inequality arbitrarily small

    for sufficiently large n.

    Since E(Xi) = µ ∀i, it ismnn

    =

    n∑

    i=1

    E(Xni )

    n→ µ as n→ ∞.

    Note:

    Theorem 6.2.7 meets the previously stated goal of not having a finite variance requirement.

    18

  • 6.3 Strong Laws of Large Numbers

    Definition 6.3.1:

    Let {Xi}∞i=1 be a sequence of rv’s. Let Tn =n∑

    i=1

    Xi. We say that {Xi} obeys the SLLN

    with respect to a sequence of norming constants {Bi}∞i=1, Bi > 0, Bi ↑ ∞, if there exists asequence of centering constants {Ai}∞i=1 such that

    B−1n (Tn −An)a.s.−→ 0.

    Note:

    Unless otherwise specified, we will only use the case that Bn = n in this section.

    Theorem 6.3.2:

    Xna.s.−→ X ⇐⇒ lim

    n→∞P ( sup

    m≥n| Xm −X |> ǫ) = 0 ∀ǫ > 0.

    Proof: (see also Rohatgi, page 249, Theorem 11)

    WLOG, we can assume that X = 0 since Xna.s.−→ X implies Xn −X a.s.−→ 0. Thus, we have to

    prove:

    Xna.s.−→ 0 ⇐⇒ lim

    n→∞P ( sup

    m≥n| Xm |> ǫ) = 0 ∀ǫ > 0

    Choose ǫ > 0 and define

    An(ǫ) = { supm≥n

    | Xm |> ǫ}

    C = { limn→∞

    Xn = 0}

    “=⇒”:Since Xn

    a.s.−→ 0, we know that P (C) = 1 and therefore P (Cc) = 0.

    Let Bn(ǫ) = C ∩ An(ǫ). Note that Bn+1(ǫ) ⊆ Bn(ǫ) and for the limit set∞⋂

    n=1

    Bn(ǫ) = Ø. It

    follows that

    limn→∞

    P (Bn(ǫ)) = P (∞⋂

    n=1

    Bn(ǫ)) = 0.

    We also have

    P (Bn(ǫ)) = P (An ∩ C)= 1 − P (Cc ∪Acn)= 1 − P (Cc)︸ ︷︷ ︸

    =0

    −P (Acn) + P (Cc ∩ACn )︸ ︷︷ ︸=0

    = P (An)

    19

  • =⇒ limn→∞

    P (An(ǫ)) = 0

    “⇐=”:Assume that lim

    n→∞P (An(ǫ)) = 0 ∀ǫ > 0 and define D(ǫ) = { lim

    n→∞| Xn |> ǫ}.

    Since D(ǫ) ⊆ An(ǫ) ∀n ∈ IN , it follows that P (D(ǫ)) = 0 ∀ǫ > 0. Also,

    Cc = { limn→∞Xn 6= 0} ⊆

    ∞⋃

    k=1

    { limn→∞ | Xn |>

    1

    k}.

    =⇒ 1 − P (C) ≤∞∑

    k=1

    P (D(1

    k)) = 0

    =⇒ Xn a.s.−→ 0

    Note:

    (i) Xna.s.−→ 0 implies that ∀ǫ > 0 ∀δ > 0 ∃n0 ∈ IN : P ( sup

    n≥n0| Xn |> ǫ) < δ.

    (ii) Recall that for a given sequence of events {An}∞n=1,

    A = limn→∞

    An = limn→∞

    ∞⋃

    k=n

    Ak =∞⋂

    n=1

    ∞⋃

    k=n

    Ak

    is the event that infinitely many of the An occur. We write P (A) = P (An i.o.) where

    i.o. stands for “infinitely often”.

    (iii) Using the terminology defined in (ii) above, we can rewrite Theorem 6.3.2 as

    Xna.s.−→ 0 ⇐⇒ P (| Xn |> ǫ i.o.) = 0 ∀ǫ > 0.

    20

  • Lecture 02:

    We 01/10/01Theorem 6.3.3: Borel–Cantelli Lemma

    (i) 1st BC–Lemma:

    Let {An}∞n=1 be a sequence of events such that∞∑

    n=1

    P (An)

  • (+)≤ lim

    n0→∞exp

    (−

    n0∑

    k=n

    P (Ak)

    )

    = 0

    =⇒ P (A) = 1

    (+) holds since

    1 − exp−

    n0∑

    j=n

    αj

    ≤ 1 −

    n0∏

    j=n

    (1 − αj) ≤n0∑

    j=n

    αj for n0 > n and 0 ≤ αj ≤ 1

    Example 6.3.4:

    Independence is necessary for 2nd BC–Lemma:

    Let Ω = (0, 1) and P a uniform distribution on Ω.

    Let An = I(0, 1n

    )(ω). Therefore,

    ∞∑

    n=1

    P (An) =∞∑

    n=1

    1

    n= ∞.

    But for any ω ∈ Ω, An occurs only for 1, 2, . . . , ⌊ 1ω ⌋, where ⌊ 1ω ⌋ denotes the largest integer(“floor”) that is ≤ 1ω . Therefore, P (A) = P (An i.o.) = 0.

    Lemma 6.3.5: Kolmogorov’s Inequality

    Let {Xi}∞i=1 be a sequence of independent rv’s with common mean 0 and variances σ2i . Let

    Tn =n∑

    i=1

    Xi. Then it holds:

    P ( max1≤k≤n

    | Tk |≥ ǫ) ≤

    n∑

    i=1

    σ2i

    ǫ2∀ǫ > 0

    Proof:

    See Rohatgi, page 268, Lemma 2, and Rohatgi/Saleh, page 284, Lemma 1.

    Lemma 6.3.6: Kronecker’s Lemma

    For any real numbers xn, if∞∑

    n=1

    xn converges to s

  • Proof:

    See Rohatgi, page 269, Lemma 3, and Rohatgi/Saleh, page 285, Lemma 2.

    Theorem 6.3.7: Cauchy Criterion

    Xna.s.−→ X ⇐⇒ lim

    n→∞P (sup

    m| Xn+m −Xn |≤ ǫ) = 1 ∀ǫ > 0.

    Proof:

    See Rohatgi, page 270, Theorem 5.

    Theorem 6.3.8:

    If∞∑

    n=1

    V ar(Xn) 0, Bi ↑ ∞, be a sequence

    of norming constants. Let Tn =n∑

    i=1

    Xi.

    If∞∑

    i=1

    V ar(Xi)

    B2i

  • Lemma 6.3.11:

    Let X be a rv with E(| X |) ǫ) 0

    Proof:

    See Rohatgi, page 265, Theorem 3.

    24

  • Theorem 6.3.13: Kolmogorov’s SLLN

    Let {Xi}∞i=1 be a sequence of iid rv’s. Let Tn =n∑

    i=1

    Xi. Then it holds:

    Tnn

    = Xna.s.−→ µ

  • iid=

    ∞∑

    k=1

    P (| X |≥ k)

    Lemma 6.3.11≤ E(| X |)

    < ∞

    By Lemma 6.3.10, it follows that Tn and T′n are convergence–equivalent. Thus, it is sufficient

    to prove that X′n

    a.s.−→ E(X).

    We now establish the conditions needed in Corollary 6.3.9. It is

    V ar(X ′n) ≤ E((X ′n)2)

    =

    ∫ n

    −nx2fX(x)dx

    =n−1∑

    k=0

    k≤|x|

  • It is

    ∞∑

    n=k

    1

    n2=

    1

    k2+

    1

    (k + 1)2+

    1

    (k + 2)2+ . . .

    ≤ 1k2

    +1

    k(k + 1)+

    1

    (k + 1)(k + 2)+ . . .

    =1

    k2+

    ∞∑

    n=k+1

    1

    n(n− 1)

    From Bronstein, page 30, # 7, we know that

    1 =1

    1 · 2 +1

    2 · 3 +1

    3 · 4 + . . .+1

    n(n+ 1)+ . . .

    =1

    1 · 2 +1

    2 · 3 +1

    3 · 4 + . . .+1

    (k − 1) · k +∞∑

    n=k+1

    1

    n(n− 1)

    =⇒∞∑

    n=k+1

    1

    n(n− 1) = 1 −1

    1 · 2 −1

    2 · 3 −1

    3 · 4 − . . . −1

    (k − 1) · k

    =1

    2− 1

    2 · 3 −1

    3 · 4 − . . .−1

    (k − 1) · k

    =1

    3− 1

    3 · 4 − . . .−1

    (k − 1) · k

    =1

    4− . . .− 1

    (k − 1) · k= . . .

    =1

    k

    =⇒∞∑

    n=k

    1

    n2≤ 1

    k2+

    ∞∑

    n=k+1

    1

    n(n− 1)

    =1

    k2+

    1

    k

    ≤ 2k

    27

  • Using this result in (A), we get

    ∞∑

    n=1

    1

    n2V ar(X ′n) ≤ 2

    ∞∑

    k=1

    (k + 1)2

    kP (k ≤| X |< k + 1) + 2P (0 ≤| X |< 1)

    = 2∞∑

    k=0

    kP (k ≤| X |< k + 1) + 4∞∑

    k=1

    P (k ≤| X |< k + 1)

    + 2∞∑

    k=1

    1

    kP (k ≤| X |< k + 1) + 2P (0 ≤| X |< 1)

    (B)

    ≤ 2E(| X |) + 4 + 2 + 2

    < ∞

    To establish (B), we use an inequality from the Proof of Lemma 6.3.11, i.e.,

    ∞∑

    k=0

    kP (k ≤| X |< k + 1)Proof≤

    ∞∑

    n=1

    P (| X |≥ n)Lemma 6.3.11

    ≤ E(| X |)

    Thus, the conditions needed in Corollary 6.3.9 are met. With Bn = n, it follows that

    1

    nT ′n −

    1

    nE(T ′n)

    a.s.−→ 0 (C)

    Since E(X ′n) → E(X) as n → ∞, it follows by Kronecker’s Lemma (6.3.6) that 1nE(T ′n) →E(X). Thus, when we replace 1nE(T

    ′n) by E(X) in (C), we get

    1

    nT ′n

    a.s.−→ E(X) Lemma 6.3.10=⇒ 1nTn

    a.s.−→ E(X)

    since Tn and T′n are convergence–equivalent (as defined in Lemma 6.3.10).

    28

  • Lecture 04:

    We 01/17/016.4 Central Limit Theorems

    Let {Xn}∞n=1 be a sequence of rv’s with cdf’s {Fn}∞n=1. Suppose that the mgf Mn(t) of Xnexists.

    Questions: Does Mn(t) converge? Does it converge to a mgf M(t)? If it does converge, does

    it hold that Xnd−→ X for some rv X?

    Example 6.4.1:

    Let {Xn}∞n=1 be a sequence of rv’s such that P (Xn = −n) = 1. Then the mgf is Mn(t) =E(etX ) = e−tn. So

    limn→∞Mn(t) =

    0, t > 0

    1, t = 0

    ∞, t < 0

    So Mn(t) does not converge to a mgf and Fn(x) → F (x) = 1 ∀x. But F (x) is not a cdf.

    Note:

    Due to Example 6.4.1, the existence of mgf’s Mn(t) that converge to something is not enough

    to conclude convergence in distribution.

    Conversely, suppose that Xn has mgf Mn(t), X has mgf M(t), and Xnd−→ X. Does it hold

    that

    Mn(t) →M(t)?

    Not necessarily! See Rohatgi, page 277, Example 2, and Rohatgi/Saleh, page 289, Example

    2, as a counter example. Thus, convergence in distribution of rv’s that all have mgf’s does

    not imply the convergence of mgf’s.

    However, we can make the following statement in the next Theorem:

    Theorem 6.4.2: Continuity Theorem

    Let {Xn}∞n=1 be a sequence of rv’s with cdf’s {Fn}∞n=1 and mgf’s {Mn(t)}∞n=1. Suppose thatMn(t) exists for | t |≤ t0 ∀n. If there exists a rv X with cdf F and mgf M(t) which exists for| t |≤ t1 < t0 such that lim

    n→∞Mn(t) = M(t) ∀t ∈ [−t1, t1], then Fn w−→ F , i.e., Xn d−→ X.

    29

  • Example 6.4.3:

    Let Xn ∼ Bin(n, λn). Recall (e.g., from Theorem 3.3.12 and related Theorems) that forX ∼ Bin(n, p) the mgf is MX(t) = (1 − p+ pet)n. Thus,

    Mn(t) = (1 −λ

    n+λ

    net)n = (1 +

    λ(et − 1)n

    )n(∗)−→ eλ(et−1) as n→ ∞.

    In (∗) we use the fact that limn→∞

    (1+x

    n)n = ex. Recall that eλ(e

    t−1) is the mgf of a rv X where

    X ∼ Poisson(λ). Thus, we have established the well–known result that the Binomial distribu-tion approaches the Poisson distribution, given that n→ ∞ in such a way that np = λ > 0.

    Note:

    Recall Theorem 3.3.11: Suppose that {Xn}∞n=1 is a sequence of rv’s with characteristic fuctions{Φn(t)}∞n=1. Suppose that

    limn→∞

    Φn(t) = Φ(t) ∀t ∈ (−h, h) for some h > 0,

    and Φ(t) is the characteristic function of a rv X. Then Xnd−→ X.

    Theorem 6.4.4: Lindeberg–Lévy Central Limit Theorem

    Let {Xn}∞n=1 be a sequence of iid rv’s with E(Xi) = µ and 0 < V ar(Xi) = σ2 < ∞. Then it

    holds for Xn =1n

    n∑

    i=1

    Xi that

    √n(Xn − µ)

    σd−→ Z

    where Z ∼ N(0, 1).

    Proof:

    Let Z ∼ N(0, 1). According to Theorem 3.3.12 (v), the characteristic function of Z isΦZ(t) = exp(−12t2).

    Let Φ(t) be the characteristic function of Xi. We now determine the characteristic function

    Φn(t) of√

    n(Xn−µ)σ :

    Φn(t) = E

    exp

    it

    √n( 1n

    n∑

    i=1

    Xi − µ)

    σ

    =

    ∫ ∞

    −∞. . .

    ∫ ∞

    −∞exp

    it

    √n( 1n

    n∑

    i=1

    xi − µ)

    σ

    dFX(x)

    30

  • = exp(− it√nµ

    σ)

    ∫ ∞

    −∞exp(

    itx1√nσ

    )dFX1(x1) . . .

    ∫ ∞

    −∞exp(

    itxn√nσ

    )dFXn(xn)

    =

    (Φ(

    t√nσ

    ) exp(− itµ√nσ

    )

    )n

    Recall from Theorem 3.3.5 that if the kth moment exists, then Φ(k)(0) = ikE(Xk). In partic-

    ular, it holds for the given distribution that Φ(1)(0) = iE(X) = iµ and Φ(2)(0) = i2E(X2) =

    i2(µ2 + σ2) = −(µ2 + σ2). Also recall the definition of a Taylor series in MacLaurin’s form:

    f(x) = f(0) +f ′(0)

    1!x+

    f ′′(0)2!

    x2 +f ′′′(0)

    3!x3 + . . .+

    f (n)(0)

    n!xn + . . . ,

    e.g.,

    f(x) = ex = 1 + x+x2

    2!+x3

    3!+ . . .

    Thus, if we develop a Taylor series for Φ( t√nσ

    ) around t = 0, we get:

    Φ(t√nσ

    ) = Φ(0) + tΦ′(0) +1

    2t2Φ′′(0) +

    1

    6t3Φ′′′(0) + . . .

    = 1 + tiµ√nσ

    − 12t2µ2 + σ2

    nσ2+ o

    ((

    t√nσ

    )2)

    Here we make use of the Landau symbol “o”. In general, if we write u(x) = o(v(x)) for

    x → L, this implies limx→L

    u(x)

    v(x)= 0, i.e., u(x) goes to 0 faster than v(x) or v(x) goes to ∞

    faster than u(x). We say that u(x) is of smaller order than v(x) as x → L. Examples are1x3 = o(

    1x2 ) and x

    2 = o(x3) for x→ ∞. See Rohatgi, page 6, for more details on the Landausymbols “O” and “o”.

    Similarly, if we develop a Taylor series for exp(− itµ√nσ

    ) around t = 0, we get:

    exp(− itµ√nσ

    ) = 1 − t iµ√nσ

    − 12t2µ2

    nσ2+ o

    ((

    t√nσ

    )2)

    Combining these results, we get:

    Φn(t) =

    ((1 + t

    iµ√nσ

    − 12t2µ2 + σ2

    nσ2+ o

    ((

    t√nσ

    )2))(

    1 − t iµ√nσ

    − 12t2µ2

    nσ2+ o

    ((

    t√nσ

    )2)))n

    =

    (1 − t iµ√

    nσ− 1

    2t2µ2

    nσ2+ t

    iµ√nσ

    + t2µ2

    nσ2− 1

    2t2µ2 + σ2

    nσ2+ o

    ((

    t√nσ

    )2))n

    =

    (1 − 1

    2

    t2

    n+ o

    ((

    t√nσ

    )2))n

    =

    (1 +

    −12t2n

    + o

    (1

    n

    ))n

    (∗)−→ exp(− t2

    2) as n→ ∞

    31

  • Thus, limn→∞

    Φn(t) = ΦZ(t) ∀t. For a proof of (∗), see Rohatgi, page 278, Lemma 1. Accordingto the Note above, it holds that

    √n(Xn − µ)

    σd−→ Z.

    Lecture 05:

    Fr 01/19/01Definition 6.4.5:

    Let X1,X2 be iid non–degenerate rv’s with common cdf F . Let a1, a2 > 0. We say that F is

    stable if there exist constants A and B (depending on a1 and a2) such that

    B−1(a1X1 + a2X2 −A) also has cdf F .

    Note:

    When generalizing the previous definition to sequences of rv’s, we have the following examples

    for stable distributions:

    • Xi iid Cauchy. Then 1nn∑

    i=1

    Xi ∼ Cauchy (here Bn = n,An = 0).

    • Xi iid N(0, 1). Then 1√nn∑

    i=1

    Xi ∼ N(0, 1) (here Bn =√n,An = 0).

    Definition 6.4.6:

    Let {Xi}∞i=1 be a sequence of iid rv’s with common cdf F . Let Tn =n∑

    i=1

    Xi. F belongs to

    the domain of attraction of a distribution V if there exist norming and centering constants

    {Bn}∞n=1, Bn > 0, and {An}∞n=1 such that

    P (B−1n (Tn −An) ≤ x) = FB−1n (Tn−An)(x) → V (x) as n→ ∞

    at all continuity points x of V .

    Note:

    A very general Theorem from Loève states that only stable distributions can have domains

    of attraction. From the practical point of view, a wide class of distributions F belong to the

    domain of attraction of the Normal distribution.

    32

  • Theorem 6.4.7: Lindeberg Central Limit Theorem

    Let {Xi}∞i=1 be a sequence of independent non–degenerate rv’s with cdf’s {Fi}∞i=1. Assume

    that E(Xk) = µk and V ar(Xk) = σ2k 0 that

    (A) limn→∞

    1

    s2n

    n∑

    k=1

    {|x−µk|>ǫsn}(x− µk)2F ′k(x)dx = 0.

    If the Xk are discrete rv’s with support {xkl} and probabilities {pkl}, l = 1, 2, . . ., assume thatit holds for all ǫ > 0 that

    (B) limn→∞

    1

    s2n

    n∑

    k=1

    |xkl−µk|>ǫsn(xkl − µk)2pkl = 0.

    The conditions (A) and (B) are called Lindeberg Condition (LC). If either LC holds, then

    n∑

    k=1

    (Xk − µk)

    sn

    d−→ Z

    where Z ∼ N(0, 1).Proof:

    Similar to the proof of Theorem 6.4.4, we can use characteristic functions again. An alterna-

    tive proof is given in Rohatgi, pages 282–288.

    Note:

    Feller shows that the LC is a necessary condition if σ2n

    s2n→ 0 and s2n → ∞ as n→ ∞.

    Corollary 6.4.8:

    Let {Xi}∞i=1 be a sequence of iid rv’s such that 1√nn∑

    i=1

    Xi has the same distribution for all n.

    If E(Xi) = 0 and V ar(Xi) = 1, then Xi ∼ N(0, 1).Proof:

    Let F be the common cdf of 1√n

    n∑

    i=1

    Xi for all n (including n = 1). By the CLT,

    limn→∞

    P (1√n

    n∑

    i=1

    Xi ≤ x) = Φ(x),

    where Φ(x) denotes P (Z ≤ x) for Z ∼ N(0, 1). Also, P ( 1√n

    n∑

    i=1

    Xi ≤ x) = F (x) for each n.

    Therefore, we must have F (x) = Φ(x).

    33

  • Note:

    In general, if X1,X2, . . ., are independent rv’s such that there exists a constant A with

    P (| Xn |≤ A) = 1 ∀n, then the LC is satisfied if s2n → ∞ as n→ ∞. Why??

    Suppose that s2n → ∞ as n→ ∞. Since the | Xk |’s are uniformly bounded (by A), so are therv’s (Xk −E(Xk)). Thus, for every ǫ > 0 there exists an Nǫ such that if n ≥ Nǫ then

    P (| Xk − E(Xk) |< ǫsn, k = 1, . . . , n) = 1.

    This implies that the LC holds since we would integrate (or sum) over the empty set, i.e., the

    set {| x− µk |> ǫsn} = Ø.

    The converse also holds. For a sequence of uniformly bounded independent rv’s, a necessary

    and sufficient condition for the CLT to hold is that s2n → ∞ as n→ ∞.

    Example 6.4.9:

    Let {Xi}∞i=1 be a sequence of independent rv’s such that E(Xk) = 0, αk = E(| Xk |2+δ) 0, andn∑

    k=1

    αk = o(s2+δn ).

    Does the LC hold? It is:

    1

    s2n

    n∑

    k=1

    {|x|>ǫsn}x2fk(x)dx

    (A)≤ 1

    s2n

    n∑

    k=1

    {|x|>ǫsn}

    | x |2+δǫδsδn

    fk(x)dx

    ≤ 1s2nǫ

    δsδn

    n∑

    k=1

    ∫ ∞

    −∞| x |2+δ fk(x)dx

    =1

    s2nǫδsδn

    n∑

    k=1

    αk

    =1

    ǫδ

    n∑

    k=1

    αk

    s2+δn

    (B)−→ 0 as n→ ∞

    (A) holds since for | x |> ǫsn, it is |x|δ

    ǫδsδn> 1. (B) holds since

    n∑

    k=1

    αk = o(s2+δn ).

    Thus, the LC is satisfied and the CLT holds.

    34

  • Note:

    (i) In general, if there exists a δ > 0 such that

    n∑

    k=1

    E(| Xk − µk |2+δ)

    s2+δn−→ 0 as n→ ∞,

    then the LC holds.

    (ii) Both the CLT and the WLLN hold for a large class of sequences of rv’s {Xi}ni=1. Ifthe {Xi}’s are independent uniformly bounded rv’s, i.e., if P (| Xn |≤ M) = 1 ∀n, theWLLN (as formulated in Theorem 6.2.3) holds. The CLT holds provided that s2n → ∞as n→ ∞.

    If the rv’s {Xi} are iid, then the CLT is a stronger result than the WLLN since the CLT

    provides an estimate of the probability P ( 1n |n∑

    i=1

    Xi − nµ |≥ ǫ) ≈ 1 − P (| Z |≤ǫ

    σ

    √n),

    where Z ∼ N(0, 1), and the WLLN follows. However, note that the CLT requires theexistence of a 2nd moment while the WLLN does not.

    (iii) If the {Xi} are independent (but not identically distributed) rv’s, the CLT may applywhile the WLLN does not.

    (iv) See Rohatgi, pages 289–293, and Rohatgi/Saleh, pages 299–303, for additional details

    and examples.

    35

  • 7 Sample Moments

    7.1 Random Sampling

    (Based on Casella/Berger, Section 5.1 & 5.2)

    Definition 7.1.1:

    Let X1, . . . ,Xn be iid rv’s with common cdf F . We say that {X1, . . . ,Xn} is a (random)sample of size n from the population distribution F . The vector of values {x1, . . . , xn} iscalled a realization of the sample. A rv g(X1, . . . ,Xn) which is a Borel–measurable function

    of X1, . . . ,Xn and does not depend on any unknown parameter is called a (sample) statistic.

    Definition 7.1.2:

    Let X1, . . . ,Xn be a sample of size n from a population with distribution F . Then

    X =1

    n

    n∑

    i=1

    Xi

    is called the sample mean and

    S2 =1

    n− 1n∑

    i=1

    (Xi −X)2 =1

    n− 1

    (n∑

    i=1

    X2i − nX2

    )

    is called the sample variance.

    Definition 7.1.3:

    Let X1, . . . ,Xn be a sample of size n from a population with distribution F . The function

    F̂n(x) =1

    n

    n∑

    i=1

    I(−∞,x](Xi)

    is called empirical cumulative distribution function (empirical cdf).

    Note:

    For any fixed x ∈ IR, F̂n(x) is a rv.

    Theorem 7.1.4:

    The rv F̂n(x) has pmf

    P (F̂n(x) =j

    n) =

    (n

    j

    )(F (x))j(1 − F (x))n−j , j ∈ {0, 1, . . . , n},

    36

  • with E(F̂n(x)) = F (x) and V ar(F̂n(x)) =F (x)(1−F (x))

    n .

    Proof:

    It is I(−∞,x](Xi) ∼ Bin(1, F (x)). Then nF̂n(x) ∼ Bin(n,F (x)).

    The results follow immediately.

    Corollary 7.1.5:

    By the WLLN, it follows that

    F̂n(x)p−→ F (x).

    Corollary 7.1.6:

    By the CLT, it follows that √n(F̂n(x) − F (x))√F (x)(1 − F (x))

    d−→ Z,

    where Z ∼ N(0, 1).

    Theorem 7.1.7: Glivenko–Cantelli Theorem

    F̂n(x) converges uniformly to F (x), i.e., it holds for all ǫ > 0 that

    limn→∞

    P ( sup−∞

  • Theorem 7.1.9:

    Let X1, . . . ,Xn be a sample of size n from a population with distribution F . Assume that

    E(X) = µ, V ar(X) = σ2, and E((X − µ)k) = µk exist. Then it holds:

    (i) E(a1) = E(X) = µ

    (ii) V ar(a1) = V ar(X) =σ2

    n

    (iii) E(b2) =n−1

    n σ2

    (iv) V ar(b2) =µ4−µ22

    n −2(µ4−2µ22)

    n2 +µ4−3µ22

    n3

    (v) E(S2) = σ2

    (vi) V ar(S2) = µ4n − n−3n(n−1)µ22

    Proof:

    (i)

    E(X) =1

    n

    n∑

    i=1

    E(Xi) =n

    nµ = µ

    (ii)V ar(X) =

    (1

    n

    )2 n∑

    i=1

    V ar(Xi) =σ2

    n

    (iii)

    E(b2) = E

    (1

    n

    n∑

    i=1

    (Xi −X)2)

    = E

    1n

    n∑

    i=1

    X2i −1

    n2

    (n∑

    i=1

    Xi

    )2

    = E(X2) − 1n2E

    n∑

    i=1

    X2i +∑∑

    i6=jXiXj

    (∗)= E(X2) − 1

    n2(nE(X2) + n(n− 1)µ2)

    =n− 1n

    (E(X2) − µ2)

    =n− 1n

    σ2

    (∗) holds since Xi and Xj are independent and then, due to Theorem 4.5.3, it holdsthat E(XiXj) = E(Xi)E(Xj).

    See Casella/Berger, page 214, and Rohatgi, page 303–306, for the proof of parts (iv) through

    (vi) and results regarding the 3rd and 4th moments and covariances.

    38

  • 7.2 Sample Moments and the Normal Distribution

    (Based on Casella/Berger, Section 5.3)

    Theorem 7.2.1:

    Let X1, . . . ,Xn be iid N(µ, σ2) rv’s. Then X = 1n

    n∑

    i=1

    Xi and (X1 − X, . . . ,Xn − X) are

    independent.

    Proof:

    By computing the joint mgf of (X,X1 −X,X2 −X, . . . ,Xn −X), we can use Theorem 4.6.3(iv) to show independence. We will use the following two facts:

    (1):

    MX(t) = M(

    1n

    n∑

    i=1

    Xi

    )(t)

    (A)=

    n∏

    i=1

    MXi(t

    n)

    (B)=

    [exp(

    t

    nµ+

    σ2t2

    2n2)

    ]n

    = exp

    (tµ+

    σ2t2

    2n

    )

    (A) holds by Theorem 4.6.4 (i). (B) follows from Theorem 3.3.12 (vi) since the Xi’s are iid.

    (2):

    MX1−X,X2−X,...,Xn−X(t1, t2, . . . , tn)Def.4.6.1

    = E

    [exp

    (n∑

    i=1

    ti(Xi −X))]

    = E

    [exp

    (n∑

    i=1

    tiXi −Xn∑

    i=1

    ti

    )]

    = E

    [exp

    (n∑

    i=1

    Xi(ti − t))]

    , where t =1

    n

    n∑

    i=1

    ti

    = E

    [n∏

    i=1

    exp (Xi(ti − t))]

    (C)=

    n∏

    i=1

    E(exp(Xi(ti − t)))

    =n∏

    i=1

    MXi(ti − t)

    39

  • (D)=

    n∏

    i=1

    exp

    (µ(ti − t) +

    σ2(ti − t)22

    )

    = exp

    µ

    n∑

    i=1

    (ti − t)︸ ︷︷ ︸

    =0

    +σ2

    2

    n∑

    i=1

    (ti − t)2

    = exp

    (σ2

    2

    n∑

    i=1

    (ti − t)2)

    (C) follows from Theorem 4.5.3 since the Xi’s are independent. (D) holds since we evaluate

    MX(h) = exp(µh+σ2h2

    2 ) for h = ti − t.

    From (1) and (2), it follows:

    MX,X1−X,...,Xn−X(t, t1, . . . , tn)Def.4.6.1

    = E[exp(tX + t1(X1 −X) + . . .+ tn(Xn −X))

    ]

    = E[exp(tX + t1X1 − t1X + . . . + tnXn − tnX)

    ]

    = E

    [exp

    (n∑

    i=1

    Xiti − (n∑

    i=1

    ti − t)X)]

    = E

    exp

    n∑

    i=1

    Xiti −(t1 + . . .+ tn − t)

    n∑

    i=1

    Xi

    n

    = E

    [exp

    (n∑

    i=1

    Xi(ti −t1 + . . . + tn − t

    n)

    )]

    = E

    [n∏

    i=1

    exp

    (Xinti − nt+ t

    n

    )], where t =

    1

    n

    n∑

    i=1

    ti

    (E)=

    n∏

    i=1

    E

    [exp

    (Xi[t+ n(ti − t)]

    n

    )]

    =n∏

    i=1

    MXi

    (t+ n(ti − t)

    n

    )

    (F )=

    n∏

    i=1

    exp

    (µ[t+ n(ti − t)]

    n+σ2

    2

    1

    n2[t+ n(ti − t)]2

    )

    40

  • = exp

    µ

    n

    nt+ n

    n∑

    i=1

    (ti − t)︸ ︷︷ ︸

    =0

    +σ2

    2n2

    n∑

    i=1

    (t+ n(ti − t))2

    = exp(µt) exp

    σ2

    2n2

    nt2 + 2nt

    n∑

    i=1

    (ti − t)︸ ︷︷ ︸

    =0

    +n2n∑

    i=1

    (ti − t)2

    = exp

    (µt+

    σ2

    2nt2)

    exp

    (σ2

    2

    n∑

    i=1

    (ti − t)2)

    (1)&(2)= MX(t)MX1−X,...,Xn−X(t1, . . . , tn)

    Thus, X and (X1 − X, . . . ,Xn − X) are independent by Theorem 4.6.3 (iv). (E) followsfrom Theorem 4.5.3 since the Xi’s are independent. (F ) holds since we evaluate MX(h) =

    exp(µh+ σ2h2

    2 ) for h =t+n(ti−t)

    n .

    Corollary 7.2.2:

    X and S2 are independent.

    Proof:

    This can be seen since S2 is a function of the vector (X1 − X, . . . ,Xn − X), and (X1 −X, . . . ,Xn − X) is independent of X, as previously shown in Theorem 7.2.1. We can useTheorem 4.2.7 to formally complete this proof.

    Corollary 7.2.3:

    (n− 1)S2σ2

    ∼ χ2n−1.

    Proof:

    Recall the following facts:

    • If Z ∼ N(0, 1) then Z2 ∼ χ21.

    • If Y1, . . . , Yn ∼ iid χ21, thenn∑

    i=1

    Yi ∼ χ2n.

    • For χ2n, the mgf is M(t) = (1 − 2t)−n/2.

    • If Xi ∼ N(µ, σ2), then Xi−µσ ∼ N(0, 1) and(Xi−µ)2

    σ2 ∼ χ21.

    Therefore,n∑

    i=1

    (Xi − µ)2σ2

    ∼ χ2n and (X−µ)2

    ( σ√n

    )2= n (X−µ)

    2

    σ2∼ χ21. (∗)

    41

  • Now considern∑

    i=1

    (Xi − µ)2 =n∑

    i=1

    ((Xi −X) + (X − µ))2

    =n∑

    i=1

    ((Xi −X)2 + 2(Xi −X)(X − µ) + (X − µ)2)

    = (n− 1)S2 + 0 + n(X − µ)2

    Therefore,n∑

    i=1

    (Xi − µ)2σ2

    ︸ ︷︷ ︸W

    =n(X − µ)2

    σ2︸ ︷︷ ︸U

    +(n− 1)S2

    σ2︸ ︷︷ ︸V

    We have an expression of the form: W = U + V

    Since U and V are functions ofX and S2, we know by Corollary 7.2.2 that they are independent

    and also that their mgf’s factor by Theorem 4.6.3 (iv). Now we can write:

    MW (t) = MU (t)MV (t)

    =⇒MV (t) =MW (t)

    MU (t)

    (∗)=

    (1 − 2t)−n/2(1 − 2t)−1/2

    = (1 − 2t)−(n−1)/2

    Note that this is the mgf of χ2n−1 by the uniqueness of mgf’s. Thus, V =(n−1)S2

    σ2 ∼ χ2n−1.

    Corollary 7.2.4: √n(X − µ)S

    ∼ tn−1.

    Proof:

    Recall the following facts:

    • If Z ∼ N(0, 1), Y ∼ χ2n and Z, Y independent, then Z√Yn

    ∼ tn.

    • Z1 =√

    n(X−µ)σ ∼ N(0, 1), Yn−1 =

    (n−1)S2σ2 ∼ χ2n−1, and Z1, Yn−1 are independent.

    Therefore,√n(X − µ)S

    =

    (X−µ)σ/

    √n

    S/√

    nσ/

    √n

    =

    (X−µ)σ/

    √n√

    S2(n−1)σ2(n−1)

    =Z1√Yn−1(n−1)

    ∼ tn−1.

    42

  • Corollary 7.2.5:

    Let (X1, . . . ,Xm) ∼ iid N(µ1, σ21) and (Y1, . . . , Yn) ∼ iid N(µ2, σ22). Let Xi, Yj be independent∀i, j.Then it holds:

    X − Y − (µ1 − µ2)√[(m− 1)S21/σ21 ] + [(n − 1)S22/σ22 ]

    ·√

    m+ n− 2σ21/m+ σ

    22/n

    ∼ tm+n−2

    In particular, if σ1 = σ2, then:

    X − Y − (µ1 − µ2)√(m− 1)S21 + (n− 1)S22

    ·√mn(m+ n− 2)

    m+ n∼ tm+n−2

    Proof:

    Homework.

    Corollary 7.2.6:

    Let (X1, . . . ,Xm) ∼ iid N(µ1, σ21) and (Y1, . . . , Yn) ∼ iid N(µ2, σ22). Let Xi, Yj be independent∀i, j.Then it holds:

    S21/σ21

    S22/σ22

    ∼ Fm−1,n−1

    In particular, if σ1 = σ2, then:S21S22

    ∼ Fm−1,n−1

    Proof:

    Recall that, if Y1 ∼ χ2m and Y2 ∼ χ2n, then

    F =Y1/m

    Y2/n∼ Fm,n.

    Now, C1 =(m−1)S21

    σ21∼ χ2m−1 and C2 =

    (n−1)S22σ22

    ∼ χ2n−1. Therefore,

    S21/σ21

    S22/σ22

    =

    (m−1)S21σ21(m−1)(n−1)S22σ22(n−1)

    =C1/(m− 1)C2/(n − 1)

    ∼ Fm−1,n−1.

    If σ1 = σ2, thenS21S22

    ∼ Fm−1,n−1.

    43

  • Lecture 07:

    We 01/24/01

    8 The Theory of Point Estimation

    (Based on Casella/Berger, Chapters 6 & 7)

    8.1 The Problem of Point Estimation

    Let X be a rv defined on a probability space (Ω, L, P ). Suppose that the cdf F of X depends

    on some set of parameters and that the functional form of F is known except for a finite

    number of these parameters.

    Definition 8.1.1:

    The set of admissible values of θ is called the parameter space Θ. If Fθ is the cdf of X

    when θ is the parameter, the set {Fθ : θ ∈ Θ} is the family of cdf’s. Likewise, we speak ofthe family of pdf’s if X is continuous, and the family of pmf’s if X is discrete.

    Example 8.1.2:

    X ∼ Bin(n, p), p unknown. Then θ = p and Θ = {p : 0 < p < 1}.

    X ∼ N(µ, σ2), (µ, σ2) unknown. Then θ = (µ, σ2) and Θ = {(µ, σ2) : −∞ < µ 0}.

    Definition 8.1.3:

    Let X be a sample from Fθ, θ ∈ Θ ⊆ IR. Let a statistic T (X) map IRn to Θ. We call T (X)an estimator of θ and T (x) for a realization x of X an (point) estimate of θ. In practice,

    the term estimate is used for both.

    Example 8.1.4:

    Let X1, . . . ,Xn be iid Bin(1, p), p unknown. Estimates of p include:

    T1(X) = X, T2(X) = X1, T3(X) =1

    2, T4(X) =

    X1 +X23

    Obviously, not all estimates are equally good.

    44

  • 8.2 Properties of Estimates

    Definition 8.2.1:

    Let {Xi}∞i=1 be a sequence of iid rv’s with cdf Fθ, θ ∈ Θ. A sequence of point estimatesTn(X1, . . . ,Xn) = Tn is called

    • (weakly) consistent for θ if Tnp−→ θ as n→ ∞ ∀θ ∈ Θ

    • strongly consistent for θ if Tn a.s.−→ θ as n→ ∞ ∀θ ∈ Θ

    • consistent in the rth mean for θ if Tn r−→ θ as n→ ∞ ∀θ ∈ Θ

    Example 8.2.2:

    Let {Xi}∞i=1 be a sequence of iid Bin(1, p) rv’s. Let Xn = 1nn∑

    i=1

    Xi. Since E(Xi) = p, it

    follows by the WLLN that Xnp−→ p, i.e., consistency, and by the SLLN that Xn a.s.−→ p, i.e,

    strong consistency.

    However, a consistent estimate may not be unique. We may even have infinite many consistent

    estimates, e.g.,n∑

    i=1

    Xi + a

    n+ b

    p−→ p ∀ finite a, b ∈ IR.

    Theorem 8.2.3:

    If Tn is a sequence of estimates such that E(Tn) → θ and V ar(Tn) → 0 as n→ ∞, then Tn isconsistent for θ.

    Proof:

    P (| Tn − θ |> ǫ)(A)≤ E((Tn − θ)

    2)

    ǫ2

    =E[((Tn − E(Tn)) + (E(Tn) − θ))2]

    ǫ2

    =V ar(Tn) + 2E[(Tn − E(Tn))(E(Tn) − θ)] + (E(Tn) − θ)2

    ǫ2

    =V ar(Tn) + (E(Tn) − θ)2

    ǫ2

    (B)−→ 0 as n→ ∞

    45

  • (A) holds due to Corollary 3.5.2 (Markov’s Inequality). (B) holds since V ar(Tn) → 0 asn→ ∞ and E(Tn) → θ as n→ ∞.

    Definition 8.2.4:

    Let G be a group of Borel–measurable functions of IRn onto itself which is closed under com-position and inverse. A family of distributions {Pθ : θ ∈ Θ} is invariant under G if foreach g ∈ G and for all θ ∈ Θ, there exists a unique θ′ = g(θ) such that the distribution ofg(X) is Pθ′ whenever the distribution of X is Pθ. We call g the induced function on θ since

    Pθ(g(X) ∈ A) = Pg(θ)(X ∈ A).

    Example 8.2.5:

    Let (X1, . . . ,Xn) be iid N(µ, σ2) with pdf

    f(x1, . . . , xn) =1

    (√

    2πσ)nexp

    (− 1

    2σ2

    n∑

    i=1

    (xi − µ)2).

    The group of linear transformations G has elements

    g(x1, . . . , xn) = (ax1 + b, . . . , axn + b), a > 0, −∞ < b

  • Definition 8.2.7:

    An estimate T is:

    • location invariant if T (X1 + a, . . . ,Xn + a) = T (X1, . . . ,Xn), a ∈ IR

    • scale invariant if T (cX1, . . . , cXn) = T (X1, . . . ,Xn), c ∈ IR− {0}

    • permutation invariant if T (Xi1 , . . . ,Xin) = T (X1, . . . ,Xn) ∀ permutations (i1, . . . , in)of 1, . . . , n

    Example 8.2.8:

    Let Fθ ∼ N(µ, σ2).

    S2 is location invariant.

    X and S2 are both permutation invariant.

    Neither X nor S2 is scale invariant.

    Note:

    Different sources make different use of the term invariant. Mood, Graybill & Boes (1974)

    for example define location invariant as T (X1 + a, . . . ,Xn + a) = T (X1, . . . ,Xn) + a (page

    332) and scale invariant as T (cX1, . . . , cXn) = cT (X1, . . . ,Xn) (page 336). According to their

    definition, X is location invariant and scale invariant.

    47

  • 8.3 Sufficient Statistics

    (Based on Casella/Berger, Section 6.2)

    Definition 8.3.1:

    Let X = (X1, . . . ,Xn) be a sample from {Fθ : θ ∈ Θ ⊆ IRk}. A statistic T = T (X) issufficient for θ (or for the family of distributions {Fθ : θ ∈ Θ}) iff the conditional dis-tribution of X given T = t does not depend on θ (except possibly on a null set A where

    Pθ(T ∈ A) = 0 ∀θ).

    Note:

    (i) The sample X is always sufficient but this is not particularly interesting and usually is

    excluded from further considerations.

    (ii) Idea: Once we have “reduced” from X to T (X), we have captured all the information

    in X about θ.

    (iii) Usually, there are several sufficient statistics for a given family of distributions.

    Example 8.3.2:

    Let X = (X1, . . . ,Xn) be iid Bin(1, p) rv’s. To estimate p, can we ignore the order and simply

    count the number of “successes”?

    Let T (X) =n∑

    i=1

    Xi. It is

    P (X1 = x1, . . . Xn = xn |n∑

    i=1

    Xi = t) =P (X1 = x1, . . . ,Xn = xn, T = t)

    P (T = t)

    =

    P (X1 = x1, . . . ,Xn = xn)P (T = t)

    ,n∑

    i=1

    xi = t

    0, otherwise

    =

    pt(1 − p)n−t(n

    t

    )pt(1 − p)n−t

    ,n∑

    i=1

    xi = t

    0, otherwise

    =

    1(n

    t

    ) ,n∑

    i=1

    xi = t

    0, otherwise

    48

  • This does not depend on p. Thus, T =n∑

    i=1

    Xi is sufficient for p.

    Example 8.3.3:

    Let X = (X1, . . . ,Xn) be iid Poisson(λ). Is T =n∑

    i=1

    Xi sufficient for λ? It is

    P (X1 = x1, . . . ,Xn = xn | T = t) =P (X1 = x1, . . . ,Xn = xn, T = t)

    P (T = t)

    =

    n∏

    i=1

    e−λλxi

    xi!

    e−nλ(nλ)t

    t!

    ,n∑

    i=1

    xi = t

    0, otherwise

    =

    e−nλλ∑

    xi

    ∏xi!

    e−nλ(nλ)t

    t!

    ,n∑

    i=1

    xi = t

    0, otherwise

    =

    t!

    ntn∏

    i=1

    xi!

    ,n∑

    i=1

    xi = t

    0, otherwise

    This does not depend on λ. Thus, T =n∑

    i=1

    Xi is sufficient for λ.

    Example 8.3.4:

    Let X1,X2 be iid Poisson(λ). Is T = X1 + 2X2 sufficient for λ? It is

    P (X1 = 0,X2 = 1 | X1 + 2X2 = 2) =P (X1 = 0,X2 = 1,X1 + 2X2 = 2)

    P (X1 + 2X2 = 2)

    =P (X1 = 0,X2 = 1)

    P (X1 + 2X2 = 2)

    =P (X1 = 0,X2 = 1)

    P (X1 = 0,X2 = 1) + P (X1 = 2,X2 = 0)

    =e−λ(e−λλ)

    e−λ(e−λλ) + (e−λλ2

    2 )e−λ

    =1

    1 + λ2,

    49

  • i.e., this is a counter–example. This expression still depends on λ. Thus, T = X1 + 2X2 is

    not sufficient for λ.

    Note:

    Definition 8.3.1 can be difficult to check. In addition, it requires a candidate statistic. We

    need something constructive that helps in finding sufficient statistics without having to check

    Definition 8.3.1. The next Theorem helps in finding such statistics.

    Lecture 08:

    Fr 01/26/01Theorem 8.3.5: Factorization Criterion

    Let X1, . . . ,Xn be rv’s with pdf (or pmf) f(x1, . . . , xn | θ), θ ∈ Θ. Then T (X1, . . . ,Xn) issufficient for θ iff we can write

    f(x1, . . . , xn | θ) = h(x1, . . . , xn) g(T (x1, . . . , xn) | θ),

    where h does not depend on θ and g does not depend on x1, . . . , xn except as a function of T .

    Proof:

    Discrete case only.

    “=⇒”:Suppose T (X) is sufficient for θ. Let

    g(t | θ) = Pθ(T (X) = t)

    h(x) = P (X = x | T (X) = t)

    Then it holds:

    f(x | θ) = Pθ(X = x)(∗)= Pθ(X = x, T (X) = T (x) = t)

    = Pθ(T (X) = t) P (X = x | T (X) = t)

    = g(t | θ)h(x)

    (∗) holds since X = x implies that T (X) = T (x) = t.

    “⇐=”:Suppose the factorization holds. For fixed t0, it is

    Pθ(T (X) = t0) =∑

    {x : T (x)=t0}Pθ(X = x)

    50

  • =∑

    {x : T (x)=t0}h(x)g(T (x) | θ)

    = g(t0 | θ)∑

    {x : T (x)=t0}h(x) (A)

    If Pθ(T (X) = t0) > 0, it holds:

    Pθ(X = x | T (X) = t0) =Pθ(X = x, T (X) = t0)

    Pθ(T (X) = t0)

    =

    Pθ(X = x)Pθ(T (X) = t0)

    , if T (x) = t0

    0, otherwise

    (A)=

    g(t0 | θ)h(x)g(t0 | θ)

    {x : T (x)=t0}h(x)

    , if T (x) = t0

    0, otherwise

    =

    h(x)∑

    {x : T (x)=t0}h(x)

    , if T (x) = t0

    0, otherwise

    This last expression does not depend on θ. Thus, T (X) is sufficient for θ.

    Note:

    (i) In the Theorem above, θ and T may be vectors.

    (ii) If T is sufficient for θ, then also any 1–to–1 mapping of T is sufficient for θ. However,

    this does not hold for arbitrary functions of T .

    Example 8.3.6:

    Let X1, . . . ,Xn be iid Bin(1, p). It is

    P (X1 = x1, . . . ,Xn = xn | p) = p∑

    xi(1 − p)n−∑

    xi .

    Thus, h(x1, . . . , xn) = 1 and g(∑xi | p) = p

    ∑xi(1 − p)n−

    ∑xi .

    Hence, T =n∑

    i=1

    Xi is sufficient for p.

    51

  • Example 8.3.7:

    Let X1, . . . ,Xn be iid Poisson(λ). It is

    P (X1 = x1, . . . ,Xn = xn | λ) =n∏

    i=1

    e−λλxi

    xi!=e−nλλ

    ∑xi

    ∏xi!

    .

    Thus, h(x1, . . . , xn) =1∏xi!

    and g(∑xi | λ) = e−nλλ

    ∑xi .

    Hence, T =n∑

    i=1

    Xi is sufficient for λ.

    Example 8.3.8:

    Let X1, . . . ,Xn be iid N(µ, σ2) where µ ∈ IR and σ2 > 0 are both unknown. It is

    f(x1, . . . , xn | µ, σ2) =1

    (√

    2πσ)nexp

    (−∑

    (xi − µ)22σ2

    )=

    1

    (√

    2πσ)nexp

    (−∑x2i

    2σ2+ µ

    ∑xiσ2

    − nµ2

    2σ2

    ).

    Hence, T = (n∑

    i=1

    Xi,n∑

    i=1

    X2i ) is sufficient for (µ, σ2).

    Example 8.3.9:

    Let X1, . . . ,Xn be iid U(θ, θ + 1) where −∞ < θ

  • Example 8.3.11:

    Let X1, . . . ,Xn be iid Bin(1, p). We have seen in Example 8.3.6 that T =n∑

    i=1

    Xi is sufficient

    for p. Is it also complete?

    We know that T ∼ Bin(n, p). Thus,

    Ep(g(T )) =n∑

    t=0

    g(t)

    (n

    t

    )pt(1 − p)n−t = 0 ∀p ∈ (0, 1)

    implies that

    (1 − p)nn∑

    t=0

    g(t)

    (n

    t

    )(

    p

    1 − p)t = 0 ∀p ∈ (0, 1) ∀t.

    However,n∑

    t=0

    g(t)

    (n

    t

    )(

    p

    1 − p)t is a polynomial in p1−p which is only equal to 0 for all p ∈ (0, 1)

    if all of its coefficients are 0.

    Therefore, g(t) = 0 for t = 0, 1, . . . , n. Hence, T is complete.

    Lecture 09:

    Mo 01/29/01Example 8.3.12:

    Let X1, . . . ,Xn be iid N(θ, θ2). We know from Example 8.3.8 that T = (

    n∑

    i=1

    Xi,n∑

    i=1

    X2i ) is

    sufficient for θ. Is it also complete?

    We know thatn∑

    i=1

    Xi ∼ N(nθ, nθ2). Therefore,

    E((n∑

    i=1

    Xi)2) = nθ2 + n2θ2 = n(n+ 1)θ2

    E(n∑

    i=1

    X2i ) = n(θ2 + θ2) = 2nθ2

    It follows that

    E

    (2(

    n∑

    i=1

    Xi)2 − (n+ 1)

    n∑

    i=1

    X2i

    )= 0 ∀θ.

    But g(x1, . . . , xn) = 2(n∑

    i=1

    xi)2 − (n+ 1)

    n∑

    i=1

    x2i is not identically to 0.

    Therefore, T is not complete.

    Note:

    Recall from Section 5.2 what it means if we say the family of distributions {fθ : θ ∈ Θ} is aone–parameter (or k–parameter) exponential family.

    53

  • Theorem 8.3.13:

    Let {fθ : θ ∈ Θ} be a k–parameter exponential family. Let T1, . . . , Tk be statistics. Then thefamily of distributions of (T1(X), . . . , Tk(X)) is also a k–parameter exponential family given

    by

    gθ(t) = exp

    (k∑

    i=1

    tiQi(θ) +D(θ) + S∗(t)

    )

    for suitable S∗(t).

    Proof:

    The proof follows from our Theorems regarding the transformation of rv’s.

    Theorem 8.3.14:

    Let {fθ : θ ∈ Θ} be a k–parameter exponential family with k ≤ n and let T1, . . . , Tk bestatistics as in Theorem 8.3.13. Suppose that the range of Q = (Q1, . . . , Qk) contains an open

    set in IRk. Then T = (T1(X), . . . , Tk(X)) is a complete sufficient statistic.

    Proof:

    Discrete case and k = 1 only.

    Write Q(θ) = θ and let (a, b) ⊆ Θ.It follows from the Factorization Criterion (Theorem 8.3.5) that T is sufficient for θ. Thus,

    we only have to show that T is complete, i.e., that

    Eθ(g(T (X))) =∑

    t

    g(t)Pθ(T (X) = t)

    (A)=

    t

    g(t) exp(θt+D(θ) + S∗(t)) = 0 ∀θ (B)

    implies g(t) = 0 ∀t. Note that in (A) we make use of a result established in Theorem 8.3.13.

    We now define functions g+ and g− as:

    g+(t) =

    {g(t), if g(t) ≥ 00, otherwise

    g−(t) =

    {−g(t), if g(t) < 00, otherwise

    It is g(t) = g+(t)− g−(t) where both functions, g+ and g−, are non–negative functions. Usingg+ and g−, it turns out that (B) is equivalent to

    t

    g+(t) exp(θt+ S∗(t)) =∑

    t

    g−(t) exp(θt+ S∗(t)) ∀θ (C)

    where the term exp(D(θ)) in (A) drops out as a constant on both sides.

    54

  • If we fix θ0 ∈ (a, b) and define

    p+(t) =g+(t) exp(θ0t+ S

    ∗(t))∑

    t

    g+(t) exp(θ0t+ S∗(t))

    , p−(t) =g−(t) exp(θ0t+ S∗(t))∑

    t

    g−(t) exp(θ0t+ S∗(t))

    ,

    it is obvious that p+(t) ≥ 0 ∀t and p−(t) ≥ 0 ∀t and by construction∑

    t

    p+(t) = 1 and

    t

    p−(t) = 1. Hence, p+ and p− are both pmf’s.

    From (C), it follows for the mgf’s M+ and M− of p+ and p− that

    M+(δ) =∑

    t

    eδtp+(t)

    =

    t

    eδtg+(t) exp(θ0t+ S∗(t))

    t

    g+(t) exp(θ0t+ S∗(t))

    =

    t

    g+(t) exp((θ0 + δ)t+ S∗(t))

    t

    g+(t) exp(θ0t+ S∗(t))

    (C)=

    t

    g−(t) exp((θ0 + δ)t+ S∗(t))

    t

    g−(t) exp(θ0t+ S∗(t))

    =

    t

    eδtg−(t) exp(θ0t+ S∗(t))

    t

    g−(t) exp(θ0t+ S∗(t))

    =∑

    t

    eδtp−(t)

    = M−(δ) ∀δ ∈ (a− θ0︸ ︷︷ ︸0

    ).

    By the uniqueness of mgf’s it follows that p+(t) = p−(t) ∀t.

    =⇒ g+(t) = g−(t) ∀t

    =⇒ g(t) = 0 ∀t

    =⇒ T is complete

    55

  • Definition 8.3.15:

    Let X = (X1, . . . ,Xn) be a sample from {Fθ : θ ∈ Θ ⊆ IRk} and let T = T (X) be a sufficientstatistic for θ. T = T (X) is called a minimal sufficient statistic for θ if, for any other

    sufficient statistic T ′ = T ′(X), T (x) is a function of T ′(x).

    Note:

    (i) A minimal sufficient statistic achieves the greatest possible data reduction for a sufficient

    statistic.

    (ii) If T is minimal sufficient for θ, then also any 1–to–1 mapping of T is minimal sufficient

    for θ. However, this does not hold for arbitrary functions of T .

    Definition 8.3.16:

    Let X = (X1, . . . ,Xn) be a sample from {Fθ : θ ∈ Θ ⊆ IRk}. A statistic T = T (X) is calledancillary if its distribution does not depend on the parameter θ.

    Example 8.3.17:

    Let X1, . . . ,Xn be iid U(θ, θ + 1) where −∞ < θ < ∞. As shown in Example 8.3.9,T = (X(1),X(n)) is sufficient for θ. Define

    Rn = X(n) −X(1).

    Use the result from Stat 6710, Homework Assignment 5, Question (viii) (a) to obtain

    fRn(r | θ) = fRn(r) = n(n− 1)rn−2(1 − r)I(0,1)(r).

    This means that Rn ∼ Beta(n − 1, 2). Moreover, Rn does not depend on θ and, therefore,Rn is ancillary.

    Theorem 8.3.18: Basu’s Theorem

    Let X = (X1, . . . ,Xn) be a sample from {Fθ : θ ∈ Θ ⊆ IRk}. If T = T (X) is a complete andminimal sufficient statistic, then T is independent of any ancillary statistic.

    Theorem 8.3.19:

    Let X = (X1, . . . ,Xn) be a sample from {Fθ : θ ∈ Θ ⊆ IRk}. If any minimal sufficient statis-tic T = T (X) exists for θ, then any complete statistic is also a minimal sufficient statistic.

    56

  • Note:

    (i) Due to the last Theorem, Basu’s Theorem often only is stated in terms of a complete

    sufficient statistic (which automatically is also a minimal sufficient statistic).

    (ii) As already shown in Corollary 7.2.2, X and S2 are independent when sampling from a

    N(µ, σ2) population. As outlined in Casella/Berger, page 289, we could also use Basu’s

    Theorem to obtain the same result.

    (iii) The converse of Basu’s Theorem is false, i.e., if T (X) is independent of any ancillary

    statistic, it does not necessarily follow that T (X) is a complete, minimal sufficient statis-

    tic.

    (iv) As seen in Examples 8.3.8 and 8.3.12, T = (n∑

    i=1

    Xi,n∑

    i=1

    X2i ) is sufficient for θ but it is not

    complete when X1, . . . ,Xn are iid N(θ, θ2). However, it can be shown that T is minimal

    sufficient. So, there may be distributions where a minimal sufficient statistic exists but

    a complete statistic does not exist.

    (v) As with invariance, there exist several different definitions of ancillarity within the lit-

    erature — the one defined in this chapter being the most commonly used.

    57

  • 8.4 Unbiased Estimation

    (Based on Casella/Berger, Section 7.3)

    Definition 8.4.1:

    Let {Fθ : θ ∈ Θ}, Θ ⊆ IR, be a nonempty set of cdf’s. A Borel–measurable function T fromIRn to Θ is called unbiased for θ (or an unbiased estimate for θ) if

    Eθ(T ) = θ ∀θ ∈ Θ.

    Any function d(θ) for which an unbiased estimate T exists is called an estimable function.

    If T is biased,

    b(θ, T ) = Eθ(T ) − θ

    is called the bias of T .

    Example 8.4.2:

    If the kth population moment exists, the kth sample moment is an unbiased estimate. If

    V ar(X) = σ2, the sample variance S2 is an unbiased estimate of σ2.

    However, note that for X1, . . . ,Xn iid N(µ, σ2), S is not an unbiased estimate of σ:

    (n − 1)S2σ2

    ∼ χ2n−1 = Gamma(n− 1

    2, 2)

    =⇒ E√

    (n− 1)S2σ2

    =

    ∫ ∞

    0

    √xx

    n−12

    −1e−x2

    2n−1

    2 Γ(n−12 )dx

    =

    √2Γ(n2 )

    Γ(n−12 )

    ∫ ∞

    0

    xn2−1e−

    x2

    2n2 Γ(n2 )

    dx

    (∗)=

    √2Γ(n2 )

    Γ(n−12 )

    =⇒ E(S) = σ√

    2

    n− 1Γ(n2 )

    Γ(n−12 )

    (∗) holds since xn2−1e−

    x2

    2n2 Γ(

    n

    2)

    is the pdf of a Gamma(n2 , 2) distribution and thus the integral is 1.

    So S is biased for σ and

    b(σ, S) = σ

    2

    n− 1Γ(n2 )

    Γ(n−12 )− 1

    .

    58

  • Note:

    If T is unbiased for θ, g(T ) is not necessarily unbiased for g(θ) (unless g is a linear function).

    Lecture 10:

    We 01/31/01Example 8.4.3:

    Unbiased estimates may not exist (see Rohatgi, page 351, Example 2) or they me be absurd

    as in the following case:

    Let X ∼ Poisson(λ) and let d(λ) = e−2λ. Consider T (X) = (−1)X as an estimate. It is

    Eλ(T (X)) = e−λ

    ∞∑

    x=0

    (−1)xλx

    x!

    = e−λ∞∑

    x=0

    (−λ)xx!

    = e−λe−λ

    = e−2λ

    = d(λ)

    Hence T is unbiased for d(λ) but since T alternates between -1 and 1 while d(λ) > 0, T is not

    a good estimate.

    Note:

    If there exist 2 unbiased estimates T1 and T2 of θ, then any estimate of the form αT1+(1−α)T2for 0 < α < 1 will also be an unbiased estimate of θ. Which one should we choose?

    Definition 8.4.4:

    The mean square error of an estimate T of θ is defined as

    MSE(θ, T ) = Eθ((T − θ)2)

    = V arθ(T ) + (b(θ, T ))2.

    Let {Ti}∞i=1 be a sequence of estimates of θ. If

    limi→∞

    MSE(θ, Ti) = 0 ∀θ ∈ Θ,

    then {Ti} is called a mean–squared–error consistent (MSE–consistent) sequence of es-timates of θ.

    59

  • Note:

    (i) If we allow all estimates and compare their MSE, generally it will depend on θ which

    estimate is better. For example θ̂ = 17 is perfect if θ = 17, but it is lousy otherwise.

    (ii) If we restrict ourselves to the class of unbiased estimates, then MSE(θ, T ) = V arθ(T ).

    (iii) MSE–consistency means that both the bias and the variance of Ti approach 0 as i→ ∞.

    Definition 8.4.5:

    Let θ0 ∈ Θ and let U(θ0) be the class of all unbiased estimates T of θ0 such that Eθ0(T 2)

  • An Excursion into Logic II

    In our first “Excursion into Logic” in Stat 6710 Mathematical Statistics I, we have established

    the following results:

    A⇒ B is equivalent to ¬B ⇒ ¬A is equivalent to ¬A ∨B:

    A B A⇒ B ¬A ¬B ¬B ⇒ ¬A ¬A ∨B1 1 1 0 0 1 1

    1 0 0 0 1 0 0

    0 1 1 1 0 1 1

    0 0 1 1 1 1 1

    When dealing with formal proofs, there exists one more technique to show A ⇒ B. Equiva-lently, we can show (A∧¬B) ⇒ 0, a technique called Proof by Contradiction. This means,assuming that A and ¬B hold, we show that this implies 0, i.e., something that is alwaysfalse, i.e., a contradiction. And here is the corresponding truth table:

    A B A⇒ B ¬B A ∧ ¬B (A ∧ ¬B) ⇒ 01 1 1 0 0 1

    1 0 0 1 1 0

    0 1 1 0 0 1

    0 0 1 1 0 1

    Note:

    We make use of this proof technique in the Proof of the next Theorem.

    Example:

    Let A : x = 5 and B : x2 = 25. Obviously A⇒ B.

    But we can also prove this in the following way:

    A : x = 5 and ¬B : x2 6= 25

    =⇒ x2 = 25 ∧ x2 6= 25

    This is impossible, i.e., a contradiction. Thus, A⇒ B.

    61

  • Theorem 8.4.7:

    Let U be the class of all unbiased estimates T of θ ∈ Θ with Eθ(T 2) < ∞ ∀θ, and supposethat U is non–empty. Let U0 be the set of all unbiased estimates of 0, i.e.,

    U0 = {ν : Eθ(ν) = 0, Eθ(ν2)

  • “⇐=:”Let Eθ(νT0) = 0 for some T0 ∈ U for all θ ∈ Θ and all ν ∈ U0.

    We choose T ∈ U , then also T0 − T ∈ U0 and

    Eθ(T0(T0 − T )) = 0 ∀θ ∈ Θ,

    i.e.,

    Eθ(T20 ) = Eθ(T0T ) ∀θ ∈ Θ.

    It follows from the Cauchy–Schwarz–Inequality (Theorem 4.5.7 (ii)) that

    Eθ(T20 ) = Eθ(T0T ) ≤ (Eθ(T 20 ))

    12 (Eθ(T

    2))12 .

    This implies

    (Eθ(T20 ))

    12 ≤ (Eθ(T 2))

    12

    and

    V arθ(T0) ≤ V arθ(T ),

    where T is an arbitrary unbiased estimate of θ. Thus, T0 is UMVUE.

    Lecture 11:

    Mo 02/05/01Theorem 8.4.8:

    Let U be the non–empty class of unbiased estimates of θ ∈ Θ as defined in Theorem 8.4.7.Then there exists at most one UMVUE T ∈ U for θ.Proof:

    Suppose T0, T1 ∈ U are both UMVUE.

    Then T1 − T0 ∈ U0, V arθ(T0) = V arθ(T1), and Eθ(T0(T1 − T0)) = 0 ∀θ ∈ Θ

    =⇒ Eθ(T 20 ) = Eθ(T0T1)

    =⇒ Covθ(T0, T1) = Eθ(T0T1) − Eθ(T0)Eθ(T1)

    = Eθ(T20 ) − (Eθ(T0))2

    = V arθ(T0)

    = V arθ(T1) ∀θ ∈ Θ

    =⇒ ρT0T1 = 1 ∀θ ∈ Θ

    =⇒ Pθ(aT0 + bT1 = 0) = 1 for some a, b ∀θ ∈ Θ

    =⇒ θ = Eθ(T0) = Eθ(− baT1) = Eθ(T1) ∀θ ∈ Θ

    =⇒ − ba = 1

    =⇒ Pθ(T0 = T1) = 1 ∀θ ∈ Θ

    63

  • Theorem 8.4.9:

    (i) If an UMVUE T exists for a real function d(θ), then λT is the UMVUE for λd(θ), λ ∈ IR.

    (ii) If UMVUE’s T1 and T2 exist for real functions d1(θ) and d2(θ), respectively, then T1 +T2

    is the UMVUE for d1(θ) + d2(θ).

    Proof:

    Homework.

    Theorem 8.4.10:

    If a sample consists of n independent observations X1, . . . ,Xn from the same distribution, the

    UMVUE, if it exists, is permutation invariant.

    Proof:

    Homework.

    Theorem 8.4.11: Rao–Blackwell

    Let {Fθ : θ ∈ Θ} be a family of cdf’s, and let h be any statistic in U , where U is the non–empty class of all unbiased estimates of θ with Eθ(h

    2)

  • Equality holds iff

    Eθ((E(h | T ))2) = Eθ(h2) = Eθ(E(h2 | T ))

    ⇐⇒ Eθ(E(h2 | T ) − (E(h | T ))2) = 0

    ⇐⇒ Eθ(V ar(h | T )) = 0

    ⇐⇒ Eθ(E((h− E(h | T ))2 | T )) = 0

    ⇐⇒ E((h − E(h | T ))2 | T ) = 0

    ⇐⇒ h is a function of T and h = E(h | T ).

    For the proof of the last step, see Rohatgi, page 170–171, Theorem 2, Corollary, and Proof of

    the Corollary.

    Theorem 8.4.12: Lehmann–Scheffée

    If T is a complete sufficient statistic and if there exists an unbiased estimate h of θ, then

    E(h | T ) is the (unique) UMVUE.Proof:

    Suppose that h1, h2 ∈ U . Then Eθ(E(h1 | T )) = Eθ(E(h2 | T )) = θ by Theorem 8.4.11.

    Therefore,

    Eθ(E(h1 | T ) − E(h2 | T )) = 0 ∀θ ∈ Θ.

    Since T is complete, E(h1 | T ) = E(h2 | T ).

    Therefore, E(h | T ) must be the same for all h ∈ U and E(h | T ) improves all h ∈ U . There-fore, E(h | T ) is UMVUE by Theorem 8.4.11.

    Note:

    We can use Theorem 8.4.12 to find the UMVUE in two ways if we have a complete sufficient

    statistic T :

    (i) If we can find an unbiased estimate h(T ), it will be the UMVU