Big Picture - Booth School of...
Transcript of Big Picture - Booth School of...
1
5. Data, Estimates, and Models: quantifying the accuracy of estimates.
5.1 Estimating a Normal Mean5.2 The Distribution of the Normal Sample Mean5.3 Normal data, confidence interval for , known5.4 Normal data, confidence interval for , unknown
(the t distribution)5.5 Bernoulli data, confidence interval for p 5.6 The Central Limit Theorem and a General
Approximate Confidence Interval for
Big Picture
• We now move to looking at using data to estimate parameters of models.
• We begin by considering estimation of the mean of a distribution. The mean and variance are the two parameters that describe a Normal model.
• We saw that as the sample size gets big, the sample average should get close to the mean .
• What determines how close? Can we quantify the accuracy?
x
2
Consider a plant which fills cereal boxes.
The manager needs to know “how much cereal is goingin to the boxes”, at least, “on average”.
How accurate will the sample average be as an estimate for the true mean ?
5.1 Estimating the Mean of a Normal distribution
The setup
• The distribution of cereal box weights are Normal(345,152). So the true mean (long run average weight) is 345.
• The manager doesn’t know =345 so she randomly grabs boxes that have been filled and uses the sample average of their weights as an estimate for the unknown true mean (345).
3
Our Approach
• First, I’ll show that if we know the true distribution of cereal box weights (say N(345,152)), we can describe how likely it is that the estimate constructed from the sample average lies near of far from the true value.
• Next, we’ll use the results from above to quantify the accuracy of our estimates in the realistic setting where we don’t know the true value of the mean.
Here are the time series and histogram of the observed weights for 500 boxes:
The weights of cereal boxes are iid normal with = 345 and = 15.
observation #
wei
ghts
0 100 200 300 400 500
300
320
340
360
380
300 320 340 360 380
020
4060
80100
120
weights
Histogram looks Normal!
Looks iid
4
With 500 observations, our guess for , is probablypretty good (we get 344.83, very close).
But what if you had fewer observations ?
Suppose you only had the first 10 !
How would youguess ?
The solid black lineis the sampleaverage of thefirst 10 obs..It is further from the true value 345.
first 10
1 2 3 4 5 6 7 8 9 10
340
350
360
370
x
10x
500x
n
ii 1
1E(X) X
n
(for large n)
We saw that the sample average of a large number of iiddraws should converge to the mean of the distribution we are drawing from.
In our cereal box example, the weights are iid draws from a N(345,152) This means that the sample average should be “close” to 345.
In general:
5
Given a sample of size n of observations thatlook iid normal, the sample mean,
n
iiX
nX
1
1
is our estimate of
)X(E i
is sometimes called the population mean sinceit is the mean of the entire population of all potential values, while the sample mean isjust the average of some of them.
5.2 The Distribution of the Normal Sample Mean
How bad can the estimate be if you only have 10observations?
To investigate this we perform a conceptualExperiment.
Let’s take our 500 observations and break themup into 50 groups of 10 consecutive observations. Each group represents a sample of size 10 that you might have gotten. For each group we calculate the mean.
This will show us what kinds of values we couldget for the average of just 10 observations.
6
• I want to see how “noisy” the sample average is when we have a sample of size 10 so I will look at a bunch of sample averages constructed using different datasets of 10 observations. We will look at how close or far the sample averages lie from the true mean.
• In reality we would have just a single sample of size 10, we could have gotten any of the 50 samples we look at.
0 100 200 300 400 500
32
03
30
34
03
50
36
03
70
The little solid segments are plotted at the mean ofthe corresponding 10 numbers.
7
C12
Freq
uenc
y
352350348346344342340338
20
15
10
5
0
Histogram of 50 sample averages
Here is the histogram of the 50 sample averages
• These are the 50 sample averages, not the individual cereal boxes
• The look Normal too!!
• So the distribution of
the types of values we
get for our sample averages looks Normal too!
Suppose the manager is about to grab a new sample ofsize 10 using observations 501-510 and use that sample average as their estimate for the mean .
What values might they get for the sample average?n
ii 1
1X X
n
C12
Freq
uenc
y
352350348346344342340338
20
15
10
5
0
Histogram of 50 sample averages
Recall empirically we found this histogram for our conceptual experiment.
8
• With the new sample, the manager could get any value like the ones we saw in our conceptual experiment (or other values).
• When we take a new sample it is like a random outcome, why is it random?
– Because the data are random outcomes.
– Each Xi is a random draw from a N(345,152)
Key idea: Before we get the sample, eachXi is random. So we think of the samplemean as a random variable!! It is a linear Combination of iid Normals!
Q? What is the value that we will get for the first observation, X1? Ans. It’s unknown. It will be the outcomeof a random draw from a N(345,152).
300 320 340 360 380
020
4060
80100
120
weights
1 2 10
1X X X ... X
10
9
• So, the big idea is that before we collect our n observations, we can think of the sample average as a random variable.
• When we finally take our sample it gives us one realization of the sample average.
• It is random because it is a linear combination of n iidrandom variables.
• Note that the notation will remain the same, but we now think of the sample average before we take the sample as random.
n
ii 1
1X X
n
nnn
nnn
XEn
XEn
XEn
XE
Xn
Xn
Xn
Xn
X
n
i
n
n
n
ii
11
1...
11
1...
11)(
1...
111
1
21
211
Since the expected value of is equal to the thing we aretrying to estimate, , we say our is an unbiassedestimate of the population mean .
X
X
10
What is the variance of the sample average?
1 2 n
1 2 n
1 2 n2 2 2
22 2 2 2
2 2 2 2
1X X X X
n1 1 1
X X Xn n n
1 1 1Var(X) Var(X ) Var(X ) Var(X )
n n n
1 1 1 1n
n n n n n
• So the sample average is unbiassed and the variance of the sample average can be quantified.
• Ideally we would like the variance to be small so that the sample average should be close to the mean.
11
The variance of the sample average depends on two things: the variance of the population from which we are sampling 2and the sample size n.
•The variability of our sample average is decreasing withlarger sample sizes (larger values of n)
•The variability of our sample average is larger when the population variance is larger. Larger population variance means that our individual draws of the X’s are more spreadout.
Why don’t any covariances appear in the variance of ?
The Xi must be independent.
Does this make sense?
X
12
Let 1 2
2, , ~ ( , )nX X X N iid
then,
This is the same 2. In the top line it represents the variance of the distribution of cereal box weights.In the second line, the ratio of 2 /n provides the variance of sample averages constructed by averaging n cereal box weights.
1
21~ ( , )
n
ii
X X Nn n
Fact: since the average is a combination of independentNormals, it is also Normally distributed.
Same 2
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
300 320 340 360 380
Cereal Box
Average of 10
Relationship between the distribution of cereal box weights and the sample
average of ten
13
For different sample sizes we get different distributions for the sample averages:
Example
215X ~ N(345, )
10
215X ~ N(345, )
500215
X ~ N(345, )50
For different sample sizes, the normal curves tell ushow close we can expect our estimate to be to the true value!
xbar
dens
ity
335 340 345 350 355
0.0
0.1
0.2
0.3
0.4
0.5
0.6
( 345)
0 100 200 300 400 500
300
320
340
360
3802
n15
( 345 2 354.5)10
2n
15( 345 2 335.5)
10
2
( 345 30 375)
2
( 345 30 315)
If we assume =345 and =15:
14
5.3 Confidence Intervals 1: How do we use the results from the previous section when we
don’t know ?
• We just figured out that if we sample from a N(,2), we can figure out what kind of sample averages we will get from a sample of size n.
• What we really want to know is, given a sample average, where do we think is?
• At first, we will still assume that we know but we don’t know .
• In the next section we will relax this unrealistic assumption.
• We are assuming that the data are iid normal.
15
First let’s add a bit of notation:
Let,
2
X n n
This will simplify the look of the formulas and emphasize that the sample mean has its own standard deviation.
2X
X
XX ~ N( , ) ~ N(0,1)
so,
X
XPr( 2 2) .95
Now we standardize
(really the 2 is 1.96!!)
16
X XPr( 2 X 2 ) .95
so,
This says that there is a 95% chance that the sample mean lies withintwo standard deviations of the true mean . Remember that getssmaller as the sample size gets larger. So we should expect the samplemean to be closer to the true mean in larger samples.
x
X
X
0
0.02
0.04
0.06
0.08
0.1
320 330 340 350 360 370
215~ 345,
10X N
345
2 X 2 X
95% chance that falls in herex
Next lets Rearrange somemore to get something useful!
X XPr( X 2 X 2 ) .95
X XPr(X 2 X 2 ) .95
Alternatively, if there is a 95% chance that lies within two standard deviations of , then there is a 95% chance that lies within two standard deviations of :
Mathematically, we rearrange the last inequality to get:
x
x
17
For iid normal data, with known standarddeviation , a 95% confidence interval forthe true mean is,
XX 2
95% of the time the true value will becontained in the interval.
1 2x
A picture of the process:
Our sample gives us a value for .We want to ask what values for are ‘reasonable’.
Consider a possible value 1. The red curve is the sampling distribution of . Is this “reasonable”?
NO. If that were the right value of , it’s extremelyunlikely we’d see a like the one we got in the data.
x
x
Consider 2. If this were the right value of , it’s perfectly possible we’d see a like the one we saw in the data.
95% CI for :All values we cannot rule outbased on the data
x
x
18
Example:
Remember our weight data ?Given 500 observations, what do we know about ?
Assume =15.X .67
n
The sample average was 344.83.
The 95% ci is
(343.5, 346.17)
IN EXCEL, use the formula:
=15/sqrt(500)
EXCEL gives us: 0.670820
IN EXCEL, use the formulas:
= 344.83-2*.67
= 344.83+2*.67
EXCEL gives us the values:
343.490
346.170
Example:
Given just the 10 observations, what do we know about ?
Assume =15.X 4.74
n
The sample average was 348.5.
The 95% ci is
(339.02, 357.98)
IN EXCEL, use the formula:
= 15/sqrt(10)
EXCEL gives us: 4.74342
IN EXCEL, use the formulas:
= 348.5-2*4.74
= 348.5+2*4.74
EXCEL gives us the values:
339.020
357.980
19
Confidence intervals answer the basicquestions, what do you think the parameter isand how sure are you.
In particular, a 95% CI means that if we took 100 samplesand created 100 different confidence intervals, we wouldexpect 95 of them to contain the true (but unknown) value .
small interval: good, you know a lot
big interval: bad, you don’t know much.
Clearly there is nothing special (outside of convention) in using a 95% CI. We can have constructed any confidence interval we like.
For example: A 68% CI is given by
More generally we can compute an 100(1-% confidenceinterval by:
XX
XzX 2
2z 2z
1‐
22
20
Here are some tabulated values:
1
2
2
.80 .90 .95 .99
.20 .10 .05 .01
.10 .05 .025 .005
1.28 1.64 1.96 2.58
/
/z
n
/2zX
by given thenis for C.I. 100%)-(1 The
5.4 Normal data, confidence interval for , unknown
Now we will extend our ci to the more realisticsituation where is unknown.
Typically you don’t, so we have to estimate it as well.
How do we estimate ?
Just as we now think of the sample mean as anestimate of , we can think of the sample sd as an estimate of .
21
Estimating
n2 2x i
i 1
1s (x x)
n 1
is our estimate for 2
2 2xE s
we divide by n-1 so that the estimator is unbiased.
Fact:
the estimate of is,
sn
x xx ii
n
1
12
1
( )
22
Now our big idea is that in the formulainstead of using,
X n
we use an estimate of it:
xsse(X)
n
This is called the standard error.Clearly, it is an estimate of the true standard deviation.
XN(0,1)
se(X)
We might think that
This is approximately right for large n (n>30).
But it turns out that for iid normal data we canget an exact result.
First we need to learn about the t-distribution.
giving the ci: x 2se(X) (just replace with itsestimate)
(squiggly lines mean“approximately distributed as“)
23
The t distribution
The t is just another continuous distribution.
It has one parameter called the degrees of freedom which is usually denoted by the symbol .
Each value of gives you a different distribution.
Comparison of Normal and t distributions for different values of
24
43210-1-2-3-4
0.4
0.3
0.2
0.1
0.0
t
nu3
t dist with=3 df.
One of these is t with 30 df,the other is standard normal.
When is bigger than about 30the t is very much like the standard normal.
For smaller , it puts more prob in the “tails”.
Now, let, n 1,.025t
n 1,.025 n 1 n 1,.025P( t t t ) .95
3210-1-2-3
0.4
0.3
0.2
0.1
0.0
x
f(x)
.025.025
.95
n 1,.025t n 1,.025t
t rv with n-1 df.be such that
for our Normal mean problem we use =n-1.
25
For n-1>about 30, the tn-1 is so much like the standardnormal that
n 1,.025t 2 For smaller n, the t value gets bigger than 2.
tn1 025,. n
4.303 32.228 112.086 212.042 312.00 61
Here is a tableof t values and n.
We can see thatfor n>30 (or evenabout 20) the t valueis about 2.
IN EXCEL, use the formula:
=TINV( 0.05, 10)
EXCEL gives us: 2.22
There is .025 probless than -2.22 and .025 probgreater than 2.22for the t dist with10 degrees of freedom.
Probability in the tails
Degrees of freedom
26
Our basic result is,
n 1
X~ t
se(X)
for small n, the t distribution accounts for our estimation of with sx .
thus,
n 1,.025 n 1,.025
XPr( t t ) .95
se(X)
Just a before, we can rearrange this to obtain the interval:
n 1,.025x t se(X)
27
n 1,.025x t se(X)
Using the t value instead of the z value will make the intervalbigger for smaller n.
This reflects the fact that we are not sure that our estimatefor is quite right.
An exact 95% confidence interval for withunknown is
Example
Back to our weight data.
With n=500 the sample sd is 15.455,and the sample mean is 344.83.
The t dist with =499 is just like the standard normalso the t-value is about 2.
15.455se(X) .69
500
ci: 344.83 +/- 1.4
IN EXCEL, use the formulas:
= 344.83-1.4
= 344.83+1.4
EXCEL gives us the values:
343.430
346.230
28
T Confidence Intervals
Variable N Mean StDev SE Mean 95.0 % CI
weights 500 344.828 15.455 0.691 ( 343.470, 346.186)
300 350 400
0
20
40
60
weights
Fre
que
ncy
Histogram of weights(with 95% t-confidence interval for the mean)
[ ]X_
se Xs
nx( )
For the first 10 observations, the sample sd = 14.6,and the sample mean was 348.5.
The t 9,.025 value is 2.262.
14.6se(X) 4.6
10
ci: 348.5 +/- 2.262*4.6348.5 +/- 10.4
=(338.1, 358.9)
29
330 340 350 360 370 380
0
1
2
3
4
weights10
Fre
quen
cy
Histogram of weights10(with 95% t-confidence interval for the mean)
[ ]X_
T Confidence Intervals
Variable N Mean StDev SE Mean 95.0 % CI
weights1 10 348.51 14.60 4.62 ( 338.07, 358.96)
IN EXCEL, use the pull-down menu:
StatPro > Statistical Inference > One-sample analysis …
Example Let’s get a 95% ci for the true mean of Canadian returns.
-0.1 0.0 0.1
0
10
20
canada
Fre
quen
cy
Histogram of canada(with 95% t-confidence interval for the mean)
[ ]X_
Is the confidenceinterval big?
se Xs
nx( )
n 1,.025x t se(X)
Summary measures
Sample size 107
Sample mean 0.009
Sample standard deviation 0.038
Confidence interval for mean
Confidence level 95.0%
Sample mean 0.009
Std error of mean 0.004
Degrees of freedom 106
Lower limit 0.002
Upper limit 0.016
Results for one-sample analysis for canada
30
-0.2 -0.1 0.0 0.1
0
10
20
30
nyse
Fre
quen
cy
Histogram of nyse(with 95% t-confidence interval for the mean)
[ ]X_
Example: 95% CI for true mean of NYSE stock index over same period.
T Confidence Intervals
Variable N Mean StDev SE Mean 95.0 % CInyse 107 0.01330 0.03686 0.00356 ( 0.00624, 0.02036)
n
s1)-n/2,(tX
bygiven then is for C.I. 100%)-(1 The
defined similarly to z/2 for the N(0,1)
Of course, just as for the case of the Normal, we can find any confidence interval that we would like.
31
5.5 Bernoulli data, confidence interval for p
Now we consider confidence intervals for p giveniid Bernoulli observations.
10 20 30 40 50
0.0
0.5
1.0
Index
C1
Suppose wehad this datawhere 1 meansa default and 0means no default.
What do you thinkthe true default rateis and how sure are you ?
• Our data consist of Bernoulli outcomes where a mortgage either defaults (1) or does not (0).
• Our best estimate of p will be the sample fraction of defaults. That is:
• For our data it is 12/50.
1ˆ
n
ii
xp
n
32
• We play the same game as before: before we take our sample we ask what can happen?
• This time the outcomes are realizations of iidBernoulli(p).
• The sum of iid Bernoulli’s is a Binomial distribution so the numerator is the outcome of a Binomial(n,p) where n is the sample size and p is the parameter we want to know.
For iid Bernoulli data, the estimate of p is
observed number of successes in the n trials Y
p̂number of trials n
Y ~ B(n,p)
33
Before we get a sample of size n, what kind ofestimate can we expect to get ?
2
Y 1 1ˆE(p) E E(Y) np pn n n
1 p(1 p)ˆVar(p) np(1 p)n n
(unbiased)
Two things:1) The variance of is again decreasing in the sample
size n.2) The variance of depends on the value of p.
p̂
p̂
Unlike the normal case, only approximateresults are available.
)n
)p(p,p(Np̂
1
Since our estimate is a combination of independentbernoullis, the central limit theorem tells us that itshould be approximately normal:
34
We make a final approximation.
ˆ ˆp(1 p)ˆse(p)n
so,p̂ p
N(0,1)ˆse(p)
The 95% interval is for the true proportionp is,
ˆ ˆp 2se(p)
In our example our interval would be:
IN EXCEL, use the formulas:
= .24 - 2*sqrt(.24*(1-.24)/50)
= .24 + 2*sqrt(.24*(1-.24)/50)
EXCEL gives us the values:
0.119203
0.360797
35
Example
Remember the discrimination case ?
We used .07 for p.
1101009080706050
10
5
0
nn
yy
Not counting the firm being sued, we had1128 partners 77 of which were female.
.p 77
112807
The confidence interval is
.. ( . )
. .07 207 1 07
112807 0152
This interval tells us where we think p is:
(.0548, .0852)
36
Suppose we had only 100 partners 7% of whom are female.
The interval would be
.. ( . )
. .07 207 1 07
10007 05
This interval is much bigger, telling us that withonly 100 observations, our estimate could be a lotfarther from the truth.
100 observations has less information than 1128.
(.02, .12)
National Poll of likely voters (CNN)
37
Trump tops Clinton 58% to 56% in unfavorable poll by CNN/ORC
The CNN/ORC Poll was conducted by telephone October 20‐23 among a random national sample of 1,017 adults, including 779 who were determined to be likely voters. The margin of sampling error for results among the sample of likely voters is plus or minus 3.5 percentage points.
What is the “margin of error”?
• It is actually a 95% Confidence Interval.
.58* 1 .58ˆ .0176
779
ˆ2* 2*.0176 .0353 3.5%
se p
se p
38
Why is the Margin of Error often 3%?
• Generally the sample size is a little over 1000.
• The numerator of the standard error depends on p, so why are the errors not dependent on the value of the estimate of p?
• The media uses the largest interval that is obtained when . ˆ .5p
.5 1 .5ˆ2* 2* .0316 3%
1000se p
• So if the estimate for p is different from .5, the confidence interval (margin of error) will by smaller than 3%.
• This means that we are at least 95% sure that the true value of p lies in the plus or minus three percentage points.
39
5.6 The Central Limit Theorem and aGeneral Approximate Confidence Interval for
Suppose we are willing to assume that our data areiid but not willing to assume that they are normally distributed.and they are not Bernoulli.
We might still want to estimate
iE(X )
It turns out that the approach we used fornormal data is approximately correct (with large sample sizes):
2
2
X N( , )n
N( , se(X) )
(first squiggle isthe clt,second squigglewe just hope ourestimate of is good)
so,X
N(0,1)se(X)
40
Given iid observations Xi, and approximate95% confidence interval for =E(Xi) is given by,
x 2se(X)
This is an extremely powerful result. It says that even ifwe don’t know the distribution of the population we are sampling from is Normal, the distribution of the sampleaverage will still be Normally distributed in large samples!