STAT:5100 (22S:193) Statistical Inference I - Week...
Transcript of STAT:5100 (22S:193) Statistical Inference I - Week...
STAT:5100 (22S:193) Statistical Inference IWeek 13
Luke Tierney
University of Iowa
Fall 2015
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 1
Monday, November 16, 2015
Recap
• Normal populations
• Order statistics
• Marginal CDF and density of an order statistic
• Little “o” and big “O” notation
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 2
Monday, November 16, 2015 Order Statistics
Example
• A N(µ, σ2) population has both mean µ and median µ.
• Either the sample mean or the sample median could be used toestimate µ.
• Which would produce a better estimate?
• We can explore this question using both simulation and theory.
• Some R code:http://www.stat.uiowa.edu/~luke/classes/193/median.R.
• The standard deviation of the sample median seems to satisfy
SD(Xn) ≈ 1.25√nσ
• Many statistics have sampling distributions that follow this squareroot relationship.
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 3
Monday, November 16, 2015 Order Statistics
Example (continued)
• Suppose we are considering estimating µ with X using a sample of size n.
• What would be the equivalent samples size nE we would need to achieve thesame accuracy using the median?
• We need to solve the equation
Var(X n) = Var(XnE )
orσ2
n≈ (1.25)2
nEσ2
• The solution is nE ≈ (1.25)2n = 1.5625n.
• The ratio n/nE ≈ 1/1.5625 = 0.64 is called the relative efficiency of X to X .
• X is less efficient than X if the data really are normally distributed.
• But X is much more robust to outliers than X .
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 4
Monday, November 16, 2015 Approximations and Limits
Approximations and Limits
• If we can’t say anything precise about a sampling distribution, weoften look for approximations.
• Approximations are usually stated as limit results.• This is common in mathematics, for example:
• The statement “f is differentiable at x∗” means
limx→x∗
f (x)− f (x∗)
x − x∗= f ′(x∗)
• This can also be expressed as
f (x) = f (x∗) + f ′(x∗)(x − x∗) + o(x − x∗)
as x → x∗.• This suggests the linear approximation
f (x) ≈ f (x∗) + f ′(x∗)(x − x∗)
for x close to x∗.• Care and experience are needed in interpreting “≈” and “close to.”
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 5
Monday, November 16, 2015 Approximations and Limits
• We will look at two kinds of convergence results:• Convergence of sequences of random variables• Convergence of sequences of probability distributions
• This will help with questions like• Should Xn be close to µ for large samples?• Can the probability distribution of the error Xn − µ be approximated by
a normal distribution?
• These will be formalized as limits as n→∞:• Xn converges to µ.• The distribution of
Xn − µσ/√
n
converges to a standard normal distribution.
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 6
Monday, November 16, 2015 Approximations and Limits
• We will develop tools to help us answer questions like• if Xn → X and Yn → Y does this imply that Xn + Yn → X + Y ?• If f is continuous and Xn → X , does this imply f (Xn)→ f (X )?• If the distribution of
Xn − µσ/√
n
converges to a N(0, 1) distribution and Sn → σ can we conclude thatthe distribution of
Xn − µSn/√
n
also converges to a N(0, 1) distribution?
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 7
Monday, November 16, 2015 Convergence of Sequences of Random Variables
Convergence of Sequences of Random Variables
Examples
• Suppose we want to use a statistic Tn to estimate a parameter θ.• To decide whether this makes sense a minimal requirement might be
that Tn → θ as n→∞.• This property is known as consistency.• The Weak Law of Large Numbers is an example of such a result.
• In showing that an approximate distribution for√
n(Xn − µ)/σ canalso be used as an approximate distribution for
√n(Xn − µ)/Sn a
useful step is to show that
Xn − µSn/√
n− Xn − µ
σ/√
n=
Xn − µσ/√
n
(σ
Sn− 1
)→ 0
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 8
Monday, November 16, 2015 Convergence of Sequences of Random Variables
• Some sumulations: http:
//www.stat.uiowa.edu/~luke/classes/193/convergence.R.
• X1,X2, . . . are i.i.d. Bernoulli(p) and
Pn =
∑ni=n Xi
n.
• Almost all sample paths converge to p.
• This means
P(Pn → p) = P({s ∈ S : Pn(s)→ p}) = 1.
• This is called almost sure convergence.
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 9
Monday, November 16, 2015 Almost Sure Convergence
Almost Sure Convergence
Definition
A sequence X1,X2, . . . , of random variables converges almost surely to arandom variable X if
P(
limn→∞
Xn = X)
= 1
orP({
s ∈ S : limn→∞
Xn(s) = X (s)})
= 1
Notation:
Xna.s.→ X
Xn → X a.s.
Plimn→∞
Xn = X
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 10
Monday, November 16, 2015 Almost Sure Convergence
Example
Theorem (Strong Law of Large Numbers)
Let X1,X2, . . . , be i .i .d . with E [|X1|] <∞ and µ = E [X1]. Then
X n → µ
almost surely.
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 11
Monday, November 16, 2015 Almost Sure Convergence
Example
• Let Z be a standard normal random variable.
• Define Zn as
Zn =i
nif
i − 0.5
n≤ Z <
i + 0.5
n
for all integers i .
• Using the notation {b} for the closest integer to b:
Zn =1
n{nZ}.
• Then Zn → Z almost surely:• For any number z define zn = 1
n{nz}.• Then |z − zn| ≤ 1
n → 0; i.e. zn → z .• Therefore Zn(s) = 1
n{Z (s)} → Z (s) for all s ∈ S .
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 12
Wednesday, November 18, 2015
Recap
• Normal populations
• Approximations and limits — motivations from calculus
• Almost sure convergence
• Strong Law of Large Numbers
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 13
Wednesday, November 18, 2015 Almost Sure Convergence
TheoremSuppose Xn → X almost surely and f is continuous. Then f (Xn)→ f (X )almost surely.
Proof.
• Let A = {s ∈ S : Xn(s)→ X (s)}.• Since f is continuous, for any s ∈ A
f (Xn(s))→ f (X (s))
• Therefore{s ∈ S : f (Xn(s))→ f (X (s))} ⊃ A
• SoP(f (Xn)→ f (X )) ≥ P(A) = 1.
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 14
Wednesday, November 18, 2015 Almost Sure Convergence
Example
• Suppose X1,X2, . . . are independent draws from a population withfinite mean µ and finite variance σ2.
• The sample variance can be written as
S2n =
1
n − 1
n∑i=1
(Xi − Xn)2
=1
n − 1
[n∑
i=1
(Xi − µ)2 − n(Xn − µ)2
]
=n
n − 1
[1
n
n∑i=1
(Xi − µ)2 − (Xn − µ)2
]=
n
n − 1Un
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 15
Wednesday, November 18, 2015 Almost Sure Convergence
Example (continued)
• By the Strong Law of Large Numbers
Xn → µ a.s.
1
n
n∑i=1
(Xi − µ)2 → σ2 a.s..
• Therefore
Un =1
n
n∑i=1
(Xi − µ)2 − (Xn − µ)2 → σ2 a.s..
• SoS2n =
n
n − 1Un → σ2 a.s..
• Since the square root is continuous, we also have
Sn =√
S2n →
√σ2 = σ a.s..
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 16
Wednesday, November 18, 2015 Almost Sure Convergence
• Almost sure convergence, if you can show that you have it, is theeasiest form of convergence to work with.
• But almost sure convergence can be difficult to verify.
• It is also more than we need for many useful results.
• It is useful to look for other notions of convergence that may beeasier to verify.
• One alternative is convergence in probability.
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 17
Wednesday, November 18, 2015 Convergence in Probability
Convergence in Probability
Definition
A sequence of random variables X1,X2, . . ., converges in probability to arandom variable X if for every ε > 0
limn→∞
P(|Xn − X | ≥ ε) = 0
orlimn→∞
P(|Xn − X | < ε) = 1
Notation:
XnP→ X
plimn→∞
Xn = X
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 18
Wednesday, November 18, 2015 Convergence in Probability
Examples
• Weak Law of Large Numbers: XnP→ µ.
• Suppose U1,U2, . . . are independent Uniform[0, 1] random variables.• Let Xn = max{U1, . . . ,Un} and let X ≡ 1.• Then for any ε > 0
P(|Xn − X | ≥ ε) = P(Xn ≤ 1− ε)
=
{(1− ε)n if ε < 1
0 otherwise
→ 0
• So XnP→ 1.
• It is also true that Xn → 1 almost surely.
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 19
Wednesday, November 18, 2015 Convergence in Probability
• If Xn → X almost surely, then XnP→ X :
• For an ε > 0
P(|Xn − X | ≥ ε) = E[1{|Xn−X |≥ε}
].
• For any s ∈ S where Xn(s)→ X (s) we have
1{|Xn(s)−X (s)|≥ε} → 0.
• So 1{|Xn−X |≥ε} → 0 almost surely.• By the dominated convergence theorem this implies that
P(|Xn − X | ≥ ε)→ 0.
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 20
Wednesday, November 18, 2015 Convergence in Probability
• It is possible to have convergence in probability but not almost sureconvergence.
• Let X1,X2, . . . be independent Bernoulli random variables withP(Xn = 1) = 1
n .• For any ε > 0 with ε ≤ 1
P(|Xn| ≥ ε) =1
n→ 0.
• So XnP→ 0.
• But for every n
P(all of Xn,Xn+1, . . . are zero) =∞∏k=n
(1− 1
k
)≤ exp
{−∞∑k=n
1
k
}= 0
• So with probability one the sequence X1,X2, . . . contains infinitelymany ones and cannot converge almost surely to zero.
• So almost sure convergence is stronger than convergence inprobability.
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 21
Wednesday, November 18, 2015 A Sufficient Condition
A Sufficient Condition
Suppose for some p ≥ 1 we have E [|Xn|p] <∞ for all n, E [|X |p] <∞,and
limn→∞
E [|Xn − X |p] = 0
Then XnP→ X .
Proof.
By Markov’s inequality,
P(|Xn − X | ≥ ε) = P(|Xn − X |p ≥ εp)
≤ E [|Xn − X |p]
εp→ 0
for any ε > 0.
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 22
Wednesday, November 18, 2015 A Sufficient Condition
• If E [|Xn − X |p]→ 0 then Xn is said to converge to X in Lp.
• Usually we use this for p = 2.
• This is called convergence in mean square.
• If the limit is a constant a then the sufficient condition becomes
E [(Xn − a)2] = Var(Xn) + (E [Xn]− a)2 → 0.
• This convergence holds if and only if both
Var(Xn)→ 0
E [Xn]→ a.
• If Xn is used to estimate a then• E [Xn]− a is called the bias of Xn;• E [(Xn − a)2] is the Mean Squared Error (MSE).
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 23
Wednesday, November 18, 2015 Weak Law of Large Numbers
Weak Law of Large Numbers
TheoremLet X1,X2, . . . , be i .i .d . with mean µ and finite variance σ2. LetX n = 1
n
∑ni=1 Xi . Then
X nP→ µ.
Proof.
E [(X n − µ)2] = Var(X n) =σ2
n→ 0
This is sometimes called (weak) consistency of X n for µ.
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 24
Wednesday, November 18, 2015 Distance and Convergence
Distance and Convergence
• One way to develop a notion of convergence for complicated objects,like random variables, is to define a distance between two objects.
• A distance is a function d(x , y) with these properties:• d(x , y) ≥ 0 for all x , y .• d(x , y) = d(y , x) for all x , y .• d(x , y) = 0 if and only if x = y .• d(x , y) ≤ d(x , z) + d(z , y) for all x , y , z .
• A distance is also called a metric
• A metric space is a set together with a distance.
• Convergence xn → x in a metric space means d(xn, x)→ 0.
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 25
Wednesday, November 18, 2015 Distance and Convergence
Examples
• Lp convergence corresponds to convergence with respect to thedistance
d(Xn,X ) = E [|Xn − X |p]1/p.
• To satisfy the requirement that d(X ,Y ) = 0 implies X = Y we needto work in terms of equivalence classes of almost surely equal randomvariables.
• Convergence in probability also corresponds to convergence withrespect to a distance; one possible distance is the Ky Fan distance
d(X ,Y ) = E [min {|X − Y |, 1}] .
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 26
Wednesday, November 18, 2015 Distances for Probabilities
Distances for Probabilities
• One possible distance between two probabilities P and Q is the totalvariation distance
dTV(P,Q) = supA∈B|P(A)− Q(A)|.
• If P and Q are both continuous with densities f and g then
dTV(P,Q) =1
2
∫|f (x)− g(x)|dx .
• An analogous result holds if P and Q are both discrete.
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 27
Wednesday, November 18, 2015 Distances for Probabilities
• If Fn and F are probability distributions with densities or PMFs fn andf , and fn(x)→ f (x) for all x then
dTV (Fn,F )→ 0.
• This is known as Scheffe’s Theorem.
• If P is continuous and Q is discrete, then
dTV(P,Q) = 1.
• So total variation distance cannot be used to help with approximatingcontinuous distribution with discrete ones, or vice versa.
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 28
Friday, November 20, 2015
Recap
• Almost sure convergence
• Strong Law of Large Numbers
• Convergence in probability
• Weak law of large numbers
• Lp convergence
• Distances and convergence
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 29
Friday, November 20, 2015 Distances for Probabilities
• A distance among cumulative distribution functions is theKolmogorov distance:
dK (F ,G ) = supx∈R|F (x)− G (x)|
• This is a useful distance for continuous distributions or for discretedistributions with a common support.
• It is useful for capturing convergence of a sequence of discretedistributions to a continuous distribution.
• For general discrete distributions it has some undesirable features.
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 30
Friday, November 20, 2015 Distances for Probabilities
Example
• Let Fy (x) be the CDF of a random variable that equals y withprobability one:
Fy (x) =
{1 if x ≥ y
0 if x < y .
• Let yn = 1n .
• Then dK (Fyn ,F0) = 1 for all n.
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 31
Friday, November 20, 2015 Distances for Probabilities
• An alternative distance among CDFs is the Levy distance:
dL(F ,G ) = inf{ε > 0 : F (x − ε)− ε ≤ G (x) ≤ F (x + ε) + ε for all x ∈ R}
• Another way of defining this distance:• Think of placing a square parallel to the axes with side ε in a gap
between F and G .• dL is the largest ε that will fit.
• The Levy distance between a N(0,1) and a N(1, 1) distribution isapproximately 0.28.
• For point mass distributions Fx and Fy the Levy distance isdL(Fx ,Fy ) = |x − y |.
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 32
Friday, November 20, 2015 Distances for Probabilities
• Two useful results: Suppose Xn ∼ Fn and X ∼ F . Then• dL(Fn,F )→ 0 if and only if Fn(x)→ F (x) for all x where F is
continuous.• dL(Fn,F )→ 0 if and only if
E [g(Xn)]→ E [g(X )]
for all bounded, continuous functions g .
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 33
Friday, November 20, 2015 Convergence in Distribution
Convergence in Distribution
Definition
A sequence of random variables X1,X2, . . . , converges in distribution to arandom variable X if
limn→∞
FXn(x) = FX (x)
for all x where FX is continuous.
• This is different—it is really about distributions, not random variables.• This is also called weak convergence of distributions.• It corresponds to convergence in the Levy distance.
Notation:
XnD→ X
Xn ⇒ X
L(Xn)→ L(X )
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 34
Friday, November 20, 2015 Convergence in Distribution
Example
• Suppose X ∼ N(0, 1) and let Xn = (−1)nX .
• Then Xn ∼ X for all n, so Xn → X in distribution.
• Xn does not converge to X almost surely or in probability.
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 35
Friday, November 20, 2015 Convergence in Distribution
TheoremA sequence of random variables X1,X2, . . . converges to a random variable X indistribution if and only if
P(Xn ∈ A)→ P(X ∈ A)
for every open set A with P(X ∈ ∂A) = 0, where ∂A is the boundary of A.
TheoremA sequence of random variables X1,X2, . . . converges to a random variable X indistribution if and only if
E [g(Xn)]→ E [g(X )]
for all bounded, continuous functions g.
TheoremSuppose Xn has MGF Mn, n = 1, 2, . . ., X has MGF M, M is finite in aneighborhood of the origin, and for all t in a neighborhood of the origin
Mn(t)→ M(t). Then XnD→ X .
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 36
Friday, November 20, 2015 Convergence in Distribution
TheoremIf Xn
P→ X then XnD→ X .
TheoremIf c is a constant and Xn
D→ c then XnP→ c.
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 37
Friday, November 20, 2015 Convergence in Distribution
Example
• Suppose U1,U2, . . . are independent Uniform[0, 1] random variables.
• Let Xn = max{U1, . . . ,Un}.• Then for 0 < x < 1
FXn(x) = xn → 0.
• For x ≤ 0 we have FXn(x) = 0 and FXn(x) = 1 for x ≥ 0.
• So FXn(x)→ FX (x) for all x , where FX is the CDF of X ≡ 1.
• So XnD→ X .
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 38
Friday, November 20, 2015 Convergence in Distribution
Example (continued)
• Now suppose Yn = min{U1, . . . ,Un} and Y ≡ 0.
• The CDF of Yn is FYn(y) = 1− (1− y)n for 0 ≤ y ≤ 1.
• For y > 0 we have FYn(y)→ 1.
• But for y = 0 we have FYn(y) = 0 for all n.
• So FYn(y)→ FY (y) for all y except y = 0, where FY is notcontinuous.
• So YnD→ Y .
• YnP→ 0 as well (also almost surely).
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 39
Friday, November 20, 2015 Convergence in Distribution
Example (continued)
• At what rate does Yn → 0?
• The mean of Yn is
E [Yn] =
∫ ∞0
(1− FYn(t))dt =
∫ 1
0(1− t)ndt =
1
n + 1= O(n−1).
• What happens to the distribution of Vn = nYn?
• For 0 ≤ v ≤ n the CDF of Vn is
FVn(v) = P(Vn ≤ v) = P(Yn ≤ v/n) = 1− (1− v/n)n → 1− e−v
• So Vn converges in distribution to V ∼ Exp(1).
• The distribution of Yn = Vn/n is approximately Exponential(λ = n).
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 40
Friday, November 20, 2015 Convergence in Distribution
Example
• Let Pn be the sample proportion of successes in n Bernoulli(p) trials.
• What can we say about the distribution of Pn for large n?
• It is useful to look at the standardized version
Z =Pn − p√p(1− p)/n
.
• Some simulations:http://www.stat.uiowa.edu/~luke/classes/193/convergence.R
• The sample paths do not converge.
• Their probability distributions do converge.
• The limiting distribution is the standard normal distribution.
• This suggests that the distribution of Pn for large n is approximately
N
(p,
p(1− p)
n
).
• This is an example of the Central Limit Theorem.
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 41
Friday, November 20, 2015 Central Limit Theorem
Theorem (Central Limit Theorem)
Let X1,X2, . . . , be i .i .d . from a population with an MGF that is finite nearthe origin. Then X1 has finite mean µ and finite variance σ2. Let
Zn =X n − µσ/√
n=√
nX n − µσ
and let Z ∼ N(0, 1). Then Zn → Z in distribution, i.e.
P(Zn ≤ z)→∫ z
−∞
1√2π
e−u2/2du
for all z.
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 42
Friday, November 20, 2015 Central Limit Theorem
• If we only assume E [X 21 ] <∞ then the theorem is still true; the proof
works with characteristic functions.
• Independence and identical distribution can be weakened somewhat.
Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 43