STAT:5100 (22S:193) Statistical Inference I - Week...

STAT:5100 (22S:193) Statistical Inference IWeek 14

Luke Tierney

University of Iowa

Fall 2015

Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 1

Monday, November 30, 2015 Central Limit Theorem

Recap

• Distances between probability distributions

• Convergence in distribution

• Central limit theorem



Theorem (Central Limit Theorem)

Let X1,X2, . . . , be i .i .d . from a population with an MGF that is finite nearthe origin. Then X1 has finite mean µ and finite variance σ2. Let

Zn =X n − µσ/√

n=√

nX n − µσ

and let Z ∼ N(0, 1). Then Zn → Z in distribution, i.e.

P(Zn ≤ z)→∫ z

−∞

1√2π

e−u2/2du

for all z.



Notation

• If (Xn − an)/bnD→ Z ∼ N(0, 1), then we sometimes write

Xn ∼ AN(an, b2n)

• Read this as

Xn is approximately normal

or

Xn is asymptotically normal.

• b2n is sometimes called the asymptotic variance.

• bn is sometimes called the asymptotic standard deviation.



Examples

• The sample mean X n is approximately normally distributed with meanµ and SD σ/

√n.

• Alternatively, X n ∼ AN(µ, σ2/n).

• The sum Yn =∑n

i=1 Xi has an approximate normal distribution withmean nµ and SD

√nσ.

• Alternatively, Yn ∼ AN(nµ, nσ2).

• If Yn ∼ Binomial(n, p) then Yn ∼ AN(np, np(1− p)).• Yn can be viewed as the sum of n independent Bernoulli(p) random

variables.

• If Yn ∼ χ2n then Yn ∼ AN(n, 2n).

• Yn can be viewed as a sum of n independent χ21 random variables.

• If Yα ∼ Gamma(α, β) then Yα ∼ AN(αβ, αβ2) as α→∞.

• If Yλ ∼ Poisson(λ) then Yλ ∼ AN(λ, λ) as λ→∞.



Proof of the Central Limit Theorem.

• Let Yi = (Xi − µ)/σ.

• Then E [Yi ] = 0, Var(Yi ) = 1, and Zn =√

nY n.

• The MGF of Zn is therefore

MZn(t) = E

[exp

{t√

n1

n

∑Yi

}]= E

[exp

{t√n

∑Yi

}]= MY1

(t√n

)n

• Now by Taylor’s theorem we can write

MY (s) = MY (0) + sM ′Y (0) +s2

2M ′′Y (0) + RY (s)

where RY (s) = o(s2) as s → 0.Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 6


Proof of the Central Limit Theorem (continued).

• Furthermore,

MY (0) = 1

M ′Y (0) = E [Y ] = 0

M ′′Y (0) = E [Y 2] = Var(Y ) = 1

• Therefore

MY

(t√n

)n

=

(1 +

t2

2n+ RY

(t√n

))n

=

(1 +

12t2 + nRY (t/

√n)

n

)n

→ et2/2

as n→∞.

• This is the MGF of the standard normal distribution.


Monday, November 30, 2015 Other Limiting Distributions

Other Limiting Distributions

Example

• Suppose X1,X2, . . . are non-negative i.i.d. random variables from adistribution with CDF F .

• Let Yn = min{X1, . . . ,Xn}.• Suppose F (x) ∼ axb for some positive a, b as x → 0; this means

limx→0

F (x)

axb= 1.

• This implies that F (y) > 0 for y > 0.

• The CDF of Yn is FYn(y) = 1− (1− F (y))n.

• This converges to one for all y > 0, so YnD→ 0 and Yn

P→ 0.

• Some simulations for Xi ∼ Uniform[0, 1]: http:

//www.stat.uiowa.edu/~luke/classes/193/convergence.R


http://www.stat.uiowa.edu/~luke/classes/193/convergence.R



Example (continued)

• Can we find constants cn such that Zn = cnYn converges to a non-degenerate limit?

• The CDF of Zn is

FZn (z) = P(Yn ≤ z/cn) = 1− (1− F (z/cn))n = 1−(

1− nF (x/cn)

n

)n

.

• As cn →∞ we havenF (z/cn) ∼ na(z/cn)b =

n

cbnazb.

• So taking cn = n1/b yields nF (z/cn)→ azb and therefore

FZn (z)→ 1− e−azb

• This is the CDF of a Weibull distribution.

• For a Uniform[0, 1] distribution a = 1 and b = 1, so the limiting distribution isexponential with mean 1.



Example (continued)

• If F is a Gamma(α, 1) CDF then F (x) ∼ xα

αΓ(α) as x → 0.

• For α = 2 this means F (x) ∼ 12 x2.

• This code provides a graphical comparison of the exact andapproximate CDFs of Zn for n = 10:

n <- 10

x <- seq(0,4,len=101)

plot(x,1-exp(-0.5*x^2), type = "l")

lines(x, 1 - (1 - pgamma(x/sqrt(n), 2))^n, lty = 2)

• You could also compare the true densities to the density of theapproximation.

• On the original scale,

FYn(y) = FZn(cny) ≈ FZ (cny) = 1− e−a(cny)b .


Monday, November 30, 2015 Other Convergence Results

Other Convergence Results

Theorem (Slutsky’s Theorem)

Suppose XnD→ X and Yn

P→ a, where a is a constant. Then

XnYnD→ Xa

Xn + YnD→ X + a

Theorem (Continuous Mapping Theorem)

Suppose Xn → X in probability (in distribution) and f has the propertythat

P(X ∈ {x : f is continuous at x}) = 1

Then f (Xn)→ f (X ) in probability (in distribution).



Example

• Suppose X1,X2, . . . , are i .i .d with mean µ and finite variance σ2.

• Let

X n =1

n

n∑i=1

Xi

and

S2n =

1

n − 1

n∑i=1

(Xi − X n)2.

• We already know that

• X nP→ µ

• X n → µ almost surely• X n ∼ AN(µ, σ2/n).

• What can we say about the behavior of S2n as n→∞?



Example (continued)

• If we assume that X1 has a finite fourth moment θ4 = E [(X1 − µ)4]then we have

E [S2n ] = σ2

Var(S2n ) =

1

n

(θ4 −

n − 3

n − 1σ4

)→ 0.

• So S2n → σ2 in mean square and in probability.



Example (continued)

• Another approach is to decompose S2n as

S2n =

1

n − 1

[n∑

i=1

(Xi − µ)2 − n(X n − µ)2

]

=n

n − 1

1

n

n∑i=1

(Xi − µ)2 − n

n − 1(X n − µ)2

=n

n − 1Un −

n

n − 1Vn

• By the WLLN

Un =1

n

n∑i=1

(Xi − µ)2 P→ σ2.

• By the WLLN and the continuous mapping theorem Vn = (X n − µ)2 P→ 0.

• Slutsky’s theorem then shows that

S2n =

n

n − 1Un −

n

n − 1Vn

P→ σ2.



Example (contiued)

• By the SLLN we also have Un → σ2 almost surely.

• By the SLLN and continuity Vn → 0 almost surely as well.

• Continuity then also implies S2n → σ2 almost surely.



Example (continued)

• Since the variance of S2n is O(n−1) we might expect

Wn =√

n(S2n − σ2)

to have a non-degenerate limiting distribution.

• Big picture:• After dropping smaller order terms S2

n ≈ Un.• Un is a sum of i.i.d. variables Yi = (Xi − µ)2 with

E [Yi ] = E [(Xi − µ)2] = σ2

Var(Yi ) = E [Y 2i ]− E [Yi ]

2

= E [(Xi − µ)4]− σ4

= θ4 − σ4.

• By the CLT Un ∼ AN(σ2, (θ4 − σ4)/n).• So S2

n ∼ AN(σ2, (θ4 − σ4)/n).



Example (continued)

• To verify this carefully, write

S2n = Un +

1

n − 1Un +

n

n − 1Vn.

• Then

√n(S2

n − σ2) =√

n(Un − σ2) +

√n

n − 1Un +

n

n − 1

√nVn

= Zn +

√n

n − 1Un +

n

n − 1

√nVn.

• By the CLT ZnD→ Z ∼ N(0, θ4 − σ4).

• To complete the proof we need to show that the remaining two termsconverge to zero in probability.



Example (continued)

• First term√n

n−1 Un:

• UnP→ σ2 by the WLLN, and

√n

n−1 → 0.

• So by Slutsky’s theorem√n

n−1 UnP→ 0.

• Second term nn−1

√nVn:

• nn−1 → 1 and

√nVn =

√n(X n − µ)2 = [

√n(X n − µ)](X n − µ).

• By the CLT√

n(X n − µ)D→ V ∼ N(0, σ2).

• By the WLLN X n − µP→ 0.

• So by Slutsky’s theorem nn−1

√nVn

P→ 0.

• This completes the proof.



Example

• From the CLT we know that

Zn =X n − µσ/√

n=√

nX n − µσ

D→ Z ∼ N(0, 1).

• What happens if we replace σ by Sn to get

Tn =X n − µSn/√

n?

• Write Tn as

Tn =σ

Sn

[√n

X n − µσ

]=

σ

SnZn.

• By a homework problem σ/SnP→ 1.

• So Slutsky’s theorem implies that TnD→ Z .


Wednesday, December 2, 2015 Delta Method

Recap

• Central limit theorem

• Other Limit Distributions

• Slutsky’s theorem

• Continuous mapping theorem



Example

• What can we say about the sampling distribution of Sn =√

S2n?

• We know that E [Sn] < σ.

• Since Var(Sn) = E [S2n ]− E [Sn]2 = σ2 − E [Sn]2 we have

E [Sn] =√σ2 − Var(Sn).

• Var(Sn) is typically O(n−1), so typically

E [Sn] = σ + O(n−1).

• Since f (x) =√

x is continuous and S2n → σ2 in probability and

almost surely, we also have Sn → σ in probability and almost surely.

• Can we approximate the distribution of Sn?

• Some sumulations:http://www.stat.uiowa.edu/~luke/classes/193/convergence.R.




Example (continued)

• We know that the distribution of S2n is concentrated near σ2 for large n.

• Near σ2 we can approximate f (x) =√x by its tangent

f (x) ≈ f (σ2) + f ′(σ2)(x − σ2)

=√σ2 +

1

2

1√σ2

(x − σ2)

= σ +1

2σ(x − σ2)

• So Sn ≈ σ + 12σ

(S2n − σ2).

• We know that S2n − σ2 ∼ AN(0, (θ4 − σ4)/n).

• This suggests that

Sn ∼ AN(σ, (f ′(σ2))2(θ4 − σ4)/n)

= AN

(σ,

(1

2σ

)2

(θ4 − σ4)/n

)

= AN

(σ,

1

4n

(θ4

σ2− σ2

)).



Example (continued)

• To verify this more carefully, Taylor’s theorem applied to the continuous,differentiable function f (x) =

√x in a neighborhood of σ2 gives

f (x) = f (σ2) + f ′(σ2)(x − σ2) + o(x − σ2).

• Therefore

√n(f (S2

n )− f (σ2)) = f ′(σ2)√n(S2

n − σ2) +√n o(S2

n − σ2) = Un + Vn.

• Now√n(S2

n − σ2)D→ Z ∼ N(0, θ4 − σ4) and f ′(σ2) is a constant.

• So Un = f ′(σ2)√n(S2

n − σ2)D→ U ∼ N(0, (f ′(σ2))2(θ4 − σ4)).

• The remainder is

Vn =√n o(S2

n − σ2) =[√

n(S2n − σ2)

] [o(S2n − σ2)

S2n − σ2

].

• This converges to zero in probability by the continuous mapping theorem andSlutsky’s theorem.



Example (continued)

• The same idea can be used to approximate the distribution of log S2n .

• The Taylor expansion near σ2 produces

log S2n ≈ log σ2 +

1

σ2(S2

n − σ2).

• The resulting distribution approximation is

log S2n ∼ AN

(log σ2,

θ4 − σ4

nσ4

).

• The log transformation• reduces the skewness of the distribution of S2

n

• spreads the distribution across the entire real line

• This may improve the quality of the approximation.



Theorem (Delta Method, Propagation of Error)

Suppose XnP→ a, Yn(Xn − a)

D→ Z , and f is differentiable at a. Then

Yn(f (Xn)− f (a))D→ f ′(a)Z

If Yn is real-valued, Xn ∈ Rp, f : Rp → Rq, then this holds with

f ′(a) =

(∂fi∂xj

)i=1,...,qj=1,...,p



• The formal Delta method is a formal version of

f (Xn) ≈ f (a) + f ′(a)(Xn − a)

if f is differentiable at a and Xn is close to a with high probability.

• If Xn ∼ AN(a, b2n) with bn → 0, then the delta method can be

expressed as

f (Xn) ∼ AN(f (a), (|f ′(a)|bn)2)

∼ AN(f (a), (f ′(a))2b2n)

• For f : Rp → Rq, if Xn ∼ AN(a,Bn) with Bn → 0, then the deltamethod becomes

f (Xn) ∼ AN(f (a), f ′(a)Bnf ′(a)T )


Wednesday, December 2, 2015 Examples

Example

• Suppose X1, . . . ,Xn are i.i.d. positive random variables with mean µ and varianceσ2.

• Let Tn = logX n.

• Then, informally,

Tn ≈ logµ+1

µ(X n − µ) ∼ AN

(logµ,

1

n

σ2

µ2

).

• More formally:

• By the central limit theorem

√n(X n − µ)

D→ Z ∼ N(0, σ2)

• Take Yn =√

n, a = µ, f (x) = log x in the Delta method.• Then f ′(a) = 1/µ, and

√n(log X n − logµ)

D→ 1

µZ ∼ N(0, σ2/µ2).



Example (continued)

• By the continuous mapping theorem

1

X n

P→ 1

µ

if µ 6= 0.

• So by Slutsky’s theorem, if µ 6= 0,

Sn/X nP→ σ/µ.

• By Slutsky’s theorem therefore

√n

X n

Sn(log X n − logµ)

D→ N(0, 1)

• This is useful for forming approximate confidence intervals for logµ.



Example

• Let X ∼ N(10, (0.1)2) and Y ∼ N(15, (0.2)2) be independent.

• X and Y represent measurements of the sides of a rectangle.

• An estimate of the area of the rectangle based on thesemeasurements is A = XY .

• We would like an approximation to the distribution of A.

• The function f (x , y) = xy is continuous and differentiable.

• The gradient of f is

∇f (x , y) =

(∂f (x , y)

∂x,∂f (x , y)

∂y

)= (y , x).



Example (continued)

• The Taylor expansion of f near (10, 15) is

f (x , y) ≈ 10× 15 + 15(x − 10) + 10(y − 15)

= 150 + 15(x − 10) + 10(y − 15).

• So

A ≈ 150 + 15(X − 10) + 10(Y − 15)

∼ N(150, (15× 0.1)2 + (10× 0.2)2)

= N(150, (2.5)2).



Example (continued)

• The exact density of A is

fA(a) =

∫fX (a/y)fY (y)/|y |dy .

• This is not available in closed form but can be computed numerically.• A graphical comparison is produced by

g <- function(a) {

f <- function(y) dnorm(a/y, 10, .1) * dnorm(y, 15, .2) / abs(y)

integrate(f, 14, 16)$value ## these limits seem reasonable

}

a <- seq(140, 160, len = 101)

d <- sapply(a, g)

plot(a, d, type="l")

lines(a, dnorm(a, 150, 2.5), col = "red")


Friday, December 4, 2015

Recap

• Delta method

• Examples of using limit theorems


Friday, December 4, 2015 Examples

Example

• A line is described by the equation

y = A + Bx .

• A and B are joinlty normally distributed with parameters

µA = 2 σA = 0.2 µB = 1 σB = 0.1 ρAB = 0.5.

• The value of x where y = 0 is X = −A/B.

• We would like to approximate the distribution of X .

• X = f (A,B) with f (a, b) = −a/b.

• f is continuous and differentiable for b 6= 0.



Example (continued)

• The gradient of f is∇f (a, b) = (−1/b, a/b2).

• The first order Taylor approximation to f near (a, b) = (µA, µB) is

f (a, b) ≈ −µA

µB− 1

µB(a− µA) +

µA

µ2B

(b − µB)

= −µA

µB− 1

µBa +

µA

µ2B

b.

= −2− a + 2b.

• Therefore

X ≈ −2− A + 2B

∼ N(−2− µA + 2µB , σ2A + 4σ2

B − 4σAσBρAB)

= N(−2, (0.2)2).



Example (continued)

• The exact distribution is

fX (x) =

∫fAB(−xb, b)|b|db.

• This is available in closed form but is somewhat messy.

• We can use a simple simulation to check the approximation:

N <- 10000

z1 <- rnorm(N)

z2 <- rnorm(N, 0.5 * z1, sqrt(0.75))

A <- 2 + 0.2 * z1

B <- 1 + 0.1 * z2

plot(density(-A/B))

x <- seq(-3, -1, len = 100)

lines(x, dnorm(x, -2, 0.2), lty = 2)

• The normal approximation seems to work quite well.

• Are µX ≈ −2 and σ2X ≈ (0.2)2 good approximations to the mean and variance of

X?


Friday, December 4, 2015 Normal Approximation to the Beta Distribution

Normal Approximation to the Beta Distribution• Suppose

• Xn ∼ Gamma(αn, 1),Yn ∼ Gamma(βn, 1)• Xn,Yn are independent• αn →∞ and βn →∞.• αn/(αn + βn) = p with 0 < p < 1.

• Then

Xn ∼ AN(αn, αn)

Yn ∼ AN(βn, βn)

• A useful result:

TheoremIf Xn ∼ AN(an, b

2n), Yn ∼ AN(cn, d

2n ), and Xn,Yn are independent, then(

Xn

Yn

)∼ AN

((ancn

),

(b2n 0

0 d2n

))Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 35


• For this example, this implies(Vn

Wn

)=

(Xn

αn+βnYn

αn+βn

)

∼ AN

((αn

αn+βnβn

αn+βn

),

(αn

(αn+βn)2 0

0 βn(αn+βn)2

))

∼ AN

((p

1− p

),

(p

αn+βn0

0 1−pαn+βn

))



• Now set

Un =Xn

Xn + Yn=

Xnαn+βn

Xnαn+βn

+ Ynαn+βn

=Vn

Vn + Wn∼ Beta(αn, βn)

• Letf (x , y) =

x

x + y= 1− y

x + y

• ThenUn = f (Vn,Wn).



• The functionf (x , y) =

x

x + y= 1− y

x + y

is differentiable for x > 0, y > 0.

• The gradient of f is

∇f (x , y) =

(∂

∂xf (x , y),

∂

∂yf (x , y)

)=

(y

(x + y)2,− x

(x + y)2

)• So

∇f (p, 1− p) = (1− p,−p)

• Alsof (p, 1− p) = p.



• Therefore

Un ∼ AN

(p, (1− p,−p)

(p

αn+βn0

0 1−pαn+βn

)(1− p−p

))

∼ AN

(p,

p(1− p)2 + (1− p)p2

αn + βn

)∼ AN

(p,

p(1− p)

αn + βn

)• Replacing p with its definition in terms of αn and βn:

Un ∼ AN

(αn

αn + βn,

αnβn(αn + βn)3

)



• Comparing exact and approximate densities:

x <- seq(0, 1, len = 100)

n <- 10

p <- 0.25

plot(x, dbeta(x, p * n, (1 - p) * n), type = "l")

lines(x, dnorm(x, p, sqrt(p * (1 - p) / n)), lty = 2)

• Standardized to make approximation N(0, 1):

z <- seq(-3, 3, len = 100)

s <- sqrt(p * (1 - p) / n)

plot(z, s * dbeta(p + z * s, p * n , (1 - p) * n), type = "l")

lines(z, dnorm(z), lty = 2)



• Comparing standardized CDFs:

plot(z, pbeta(p + z * s, p * n , (1 - p) * n), type = "l")

lines(z, pnorm(z), lty = 2)

• A numerical summary (Kolmogorov distance):

max(abs(pbeta(p + z * s, p * n , (1 - p) * n) - pnorm(z)))

• Summary for several sample sizes:

nn <- c(10, 20, 40, 60, 100, 150, 200)

pp <- sapply(nn, function(n) {

s <- sqrt(p * (1 - p) / n)

max(abs(pbeta(p + z * s, p * n , (1 - p) * n) - pnorm(z)))

})

plot(nn, pp)



• A probability plot:

F1 <- pbeta(p + z * s, p * n , (1 - p) * n)

F2 <- pnorm(z)

plot(F1, F2, type = "l")

abline(0, 1, lty = 2)

• A Quantile plot:

plot(p + s * z, qbeta(pnorm(z), p * n , (1 - p) * n), type = "l")

abline(0,1,lty=2)

• Relative errors:

plot(z, (F1-F2) / (F1 + F2), type = "l")

lines(z, (F1-F2) / (1 - F1 + 1 - F2), lty = 2)

• A numerical summary for relative errors:

max(abs(F1-F2) / (F1 + F2), abs(F1-F2) / (1 - F1 + 1 - F2))

• A better approximation in the tails is obtained by approximating the distribution of

log(U/(1− U)).


Friday, December 4, 2015 Approximate Distribution of Order Statistics

Approximate Distribution of Order Statistics

• For 0 < p < 1 the p-th sample quantile for a U[0, 1] populationsatisfies

U({np}) ∼ Beta({np}, n − {np}+ 1).

• The normal approximation for the Beta distribution therefore impliesthat

U({np}) ∼ AN(p, p(1− p)/n).

• This means thatU({np})

P→ p

• and √n(U({np}) − p)

D→ V ∼ N(0, p(1− p))



• Suppose F is a population CDF such that• F is a continuous distribution with density f• At the p-th population quantile F−1(p)

F ′(F−1(p)) = f (F−1(p)) > 0.

• SinceX(k) ∼ F−1(U(k))

we can approximate the distribution of X({np}) using• the normal approximation to the distribution of U({np})• the delta method



• The normal approximation to the distribution of U({np}) is

U({np}) ∼ AN(p, p(1− p)/n).

• The delta method then shows that

X({np}) = F−1(U({np})) ∼ AN

(F−1(p),

(d

dpF−1(p)

)2 p(1− p)

n

)

• Nowd

dpF−1(p) =

1

f (F−1(p))

• So

X({np}) ∼ AN

(F−1(p),

p(1− p)

n f (F−1(p))2

)



Example

• Suppose f ∼ N(µ, σ2) and p = 1/2. Then F−1(1/2) = µ and

f (F−1(p)) =1√2πσ

• Then

X̃ ∼ AN

(µ,

2πσ2

4n

)• Now X ∼ N(µ, σ2/n) and√

2π

4≈ 1.2533 > 1

• So for a given sample size X is a more accurate estimator of µ thanX̃ .


Friday, December 4, 2015 Higher Order Delta Method

Higher Order Delta Method

• The first order delta method works well when the derivative of thefunction is not zero.

• For example, suppose Xn ∼ Binomial(n, p).

• Then Yn = Xn/n ∼ AN(p, p(1− p)/n).

• Let Sn = Yn(1− Yn).

• The first order delta method will produce a useful approximatedistribution for 0 < p < 1 and p 6= 1/2.

• For p = 1/2 a second order Taylor expansion of f (p) = p(1− p) canbe used.

• This is called the second order delta method.

• Other approaches, for example based on simulation, may be moreeffective.


STAT:5100 (22S:193) Statistical Inference I - Week...

Documents

Transcript of STAT:5100 (22S:193) Statistical Inference I - Week...