Big Picture - Booth School of...

1

5. Data, Estimates, and Models: quantifying the accuracy of estimates.

5.1 Estimating a Normal Mean5.2 The Distribution of the Normal Sample Mean5.3 Normal data, confidence interval for , known5.4 Normal data, confidence interval for , unknown

(the t distribution)5.5 Bernoulli data, confidence interval for p 5.6 The Central Limit Theorem and a General

Approximate Confidence Interval for

Big Picture

• We now move to looking at using data to estimate parameters of models.

• We begin by considering estimation of the mean of a distribution. The mean and variance are the two parameters that describe a Normal model.

• We saw that as the sample size gets big, the sample average should get close to the mean .

• What determines how close? Can we quantify the accuracy?

x

2

Consider a plant which fills cereal boxes.

The manager needs to know “how much cereal is goingin to the boxes”, at least, “on average”.

How accurate will the sample average be as an estimate for the true mean ?

5.1 Estimating the Mean of a Normal distribution

The setup

• The distribution of cereal box weights are Normal(345,152). So the true mean (long run average weight) is 345.

• The manager doesn’t know =345 so she randomly grabs boxes that have been filled and uses the sample average of their weights as an estimate for the unknown true mean (345).

3

Our Approach

• First, I’ll show that if we know the true distribution of cereal box weights (say N(345,152)), we can describe how likely it is that the estimate constructed from the sample average lies near of far from the true value.

• Next, we’ll use the results from above to quantify the accuracy of our estimates in the realistic setting where we don’t know the true value of the mean.

Here are the time series and histogram of the observed weights for 500 boxes:

The weights of cereal boxes are iid normal with = 345 and = 15.

observation #

wei

ghts

0 100 200 300 400 500

300

320

340

360

380

300 320 340 360 380

020

4060

80100

120

weights

Histogram looks Normal!

Looks iid

4

With 500 observations, our guess for , is probablypretty good (we get 344.83, very close).

But what if you had fewer observations ?

Suppose you only had the first 10 !

How would youguess ?

The solid black lineis the sampleaverage of thefirst 10 obs..It is further from the true value 345.

first 10

1 2 3 4 5 6 7 8 9 10

340

350

360

370

x

10x

500x

n

ii 1

1E(X) X

n

(for large n)

We saw that the sample average of a large number of iiddraws should converge to the mean of the distribution we are drawing from.

In our cereal box example, the weights are iid draws from a N(345,152) This means that the sample average should be “close” to 345.

In general:

5

Given a sample of size n of observations thatlook iid normal, the sample mean,

n

iiX

nX

1

1

is our estimate of

)X(E i

is sometimes called the population mean sinceit is the mean of the entire population of all potential values, while the sample mean isjust the average of some of them.

5.2 The Distribution of the Normal Sample Mean

How bad can the estimate be if you only have 10observations?

To investigate this we perform a conceptualExperiment.

Let’s take our 500 observations and break themup into 50 groups of 10 consecutive observations. Each group represents a sample of size 10 that you might have gotten. For each group we calculate the mean.

This will show us what kinds of values we couldget for the average of just 10 observations.

6

• I want to see how “noisy” the sample average is when we have a sample of size 10 so I will look at a bunch of sample averages constructed using different datasets of 10 observations. We will look at how close or far the sample averages lie from the true mean.

• In reality we would have just a single sample of size 10, we could have gotten any of the 50 samples we look at.

0 100 200 300 400 500

32

03

30

34

03

50

36

03

70

The little solid segments are plotted at the mean ofthe corresponding 10 numbers.

7

C12

Freq

uenc

y

352350348346344342340338

20

15

10

5

0

Histogram of 50 sample averages

Here is the histogram of the 50 sample averages

• These are the 50 sample averages, not the individual cereal boxes

• The look Normal too!!

• So the distribution of

the types of values we

get for our sample averages looks Normal too!

Suppose the manager is about to grab a new sample ofsize 10 using observations 501-510 and use that sample average as their estimate for the mean .

What values might they get for the sample average?n

ii 1

1X X

n

C12

Freq

uenc

y

352350348346344342340338

20

15

10

5

0

Histogram of 50 sample averages

Recall empirically we found this histogram for our conceptual experiment.

8

• With the new sample, the manager could get any value like the ones we saw in our conceptual experiment (or other values).

• When we take a new sample it is like a random outcome, why is it random?

– Because the data are random outcomes.

– Each Xi is a random draw from a N(345,152)

Key idea: Before we get the sample, eachXi is random. So we think of the samplemean as a random variable!! It is a linear Combination of iid Normals!

Q? What is the value that we will get for the first observation, X1? Ans. It’s unknown. It will be the outcomeof a random draw from a N(345,152).

300 320 340 360 380

020

4060

80100

120

weights

1 2 10

1X X X ... X

10

9

• So, the big idea is that before we collect our n observations, we can think of the sample average as a random variable.

• When we finally take our sample it gives us one realization of the sample average.

• It is random because it is a linear combination of n iidrandom variables.

• Note that the notation will remain the same, but we now think of the sample average before we take the sample as random.

n

ii 1

1X X

n

nnn

nnn

XEn

XEn

XEn

XE

Xn

Xn

Xn

Xn

X

n

i

n

n

n

ii

11

1...

11

1...

11)(

1...

111

1

21

211

Since the expected value of is equal to the thing we aretrying to estimate, , we say our is an unbiassedestimate of the population mean .

X

X

10

What is the variance of the sample average?

1 2 n

1 2 n

1 2 n2 2 2

22 2 2 2

2 2 2 2

1X X X X

n1 1 1

X X Xn n n

1 1 1Var(X) Var(X ) Var(X ) Var(X )

n n n

1 1 1 1n

n n n n n

• So the sample average is unbiassed and the variance of the sample average can be quantified.

• Ideally we would like the variance to be small so that the sample average should be close to the mean.

11

The variance of the sample average depends on two things: the variance of the population from which we are sampling 2and the sample size n.

•The variability of our sample average is decreasing withlarger sample sizes (larger values of n)

•The variability of our sample average is larger when the population variance is larger. Larger population variance means that our individual draws of the X’s are more spreadout.

Why don’t any covariances appear in the variance of ?

The Xi must be independent.

Does this make sense?

X

12

Let 1 2

2, , ~ ( , )nX X X N iid

then,

This is the same 2. In the top line it represents the variance of the distribution of cereal box weights.In the second line, the ratio of 2 /n provides the variance of sample averages constructed by averaging n cereal box weights.

1

21~ ( , )

n

ii

X X Nn n

Fact: since the average is a combination of independentNormals, it is also Normally distributed.

Same 2

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

300 320 340 360 380

Cereal Box

Average of 10

Relationship between the distribution of cereal box weights and the sample

average of ten

13

For different sample sizes we get different distributions for the sample averages:

Example

215X ~ N(345, )

10

215X ~ N(345, )

500215

X ~ N(345, )50

For different sample sizes, the normal curves tell ushow close we can expect our estimate to be to the true value!

xbar

dens

ity

335 340 345 350 355

0.0

0.1

0.2

0.3

0.4

0.5

0.6

( 345)

0 100 200 300 400 500

300

320

340

360

3802

n15

( 345 2 354.5)10

2n

15( 345 2 335.5)

10

2

( 345 30 375)

2

( 345 30 315)

If we assume =345 and =15:

14

5.3 Confidence Intervals 1: How do we use the results from the previous section when we

don’t know ?

• We just figured out that if we sample from a N(,2), we can figure out what kind of sample averages we will get from a sample of size n.

• What we really want to know is, given a sample average, where do we think is?

• At first, we will still assume that we know but we don’t know .

• In the next section we will relax this unrealistic assumption.

• We are assuming that the data are iid normal.

15

First let’s add a bit of notation:

Let,

2

X n n

This will simplify the look of the formulas and emphasize that the sample mean has its own standard deviation.

2X

X

XX ~ N( , ) ~ N(0,1)

so,

X

XPr( 2 2) .95

Now we standardize

(really the 2 is 1.96!!)

16

X XPr( 2 X 2 ) .95

so,

This says that there is a 95% chance that the sample mean lies withintwo standard deviations of the true mean . Remember that getssmaller as the sample size gets larger. So we should expect the samplemean to be closer to the true mean in larger samples.

x

X

X

0

0.02

0.04

0.06

0.08

0.1

320 330 340 350 360 370

215~ 345,

10X N

345

2 X 2 X

95% chance that falls in herex

Next lets Rearrange somemore to get something useful!

X XPr( X 2 X 2 ) .95

X XPr(X 2 X 2 ) .95

Alternatively, if there is a 95% chance that lies within two standard deviations of , then there is a 95% chance that lies within two standard deviations of :

Mathematically, we rearrange the last inequality to get:

x

x

17

For iid normal data, with known standarddeviation , a 95% confidence interval forthe true mean is,

XX 2

95% of the time the true value will becontained in the interval.

1 2x

A picture of the process:

Our sample gives us a value for .We want to ask what values for are ‘reasonable’.

Consider a possible value 1. The red curve is the sampling distribution of . Is this “reasonable”?

NO. If that were the right value of , it’s extremelyunlikely we’d see a like the one we got in the data.

x

x

Consider 2. If this were the right value of , it’s perfectly possible we’d see a like the one we saw in the data.

95% CI for :All values we cannot rule outbased on the data

x

x

18

Example:

Remember our weight data ?Given 500 observations, what do we know about ?

Assume =15.X .67

n

The sample average was 344.83.

The 95% ci is

(343.5, 346.17)

IN EXCEL, use the formula:

=15/sqrt(500)

EXCEL gives us: 0.670820

IN EXCEL, use the formulas:

= 344.83-2*.67

= 344.83+2*.67

EXCEL gives us the values:

343.490

346.170

Example:

Given just the 10 observations, what do we know about ?

Assume =15.X 4.74

n

The sample average was 348.5.

The 95% ci is

(339.02, 357.98)


= 15/sqrt(10)



= 348.5-2*4.74

= 348.5+2*4.74


339.020

357.980

19

Confidence intervals answer the basicquestions, what do you think the parameter isand how sure are you.

In particular, a 95% CI means that if we took 100 samplesand created 100 different confidence intervals, we wouldexpect 95 of them to contain the true (but unknown) value .

small interval: good, you know a lot

big interval: bad, you don’t know much.

Clearly there is nothing special (outside of convention) in using a 95% CI. We can have constructed any confidence interval we like.

For example: A 68% CI is given by

More generally we can compute an 100(1-% confidenceinterval by:

XX

XzX 2

2z 2z

1‐

22

20

Here are some tabulated values:

1

2

2

.80 .90 .95 .99

.20 .10 .05 .01

.10 .05 .025 .005

1.28 1.64 1.96 2.58

/

/z

n

/2zX

by given thenis for C.I. 100%)-(1 The

5.4 Normal data, confidence interval for , unknown

Now we will extend our ci to the more realisticsituation where is unknown.

Typically you don’t, so we have to estimate it as well.

How do we estimate ?

Just as we now think of the sample mean as anestimate of , we can think of the sample sd as an estimate of .

21

Estimating

n2 2x i

i 1

1s (x x)

n 1

is our estimate for 2

2 2xE s

we divide by n-1 so that the estimator is unbiased.

Fact:

the estimate of is,

sn

x xx ii

n

1

12

1

( )

22

Now our big idea is that in the formulainstead of using,

X n

we use an estimate of it:

xsse(X)

n

This is called the standard error.Clearly, it is an estimate of the true standard deviation.

XN(0,1)

se(X)

We might think that

This is approximately right for large n (n>30).

But it turns out that for iid normal data we canget an exact result.

First we need to learn about the t-distribution.

giving the ci: x 2se(X) (just replace with itsestimate)

(squiggly lines mean“approximately distributed as“)

23

The t distribution

The t is just another continuous distribution.

It has one parameter called the degrees of freedom which is usually denoted by the symbol .

Each value of gives you a different distribution.

Comparison of Normal and t distributions for different values of

24

43210-1-2-3-4

0.4

0.3

0.2

0.1

0.0

t

nu3

t dist with=3 df.

One of these is t with 30 df,the other is standard normal.

When is bigger than about 30the t is very much like the standard normal.

For smaller , it puts more prob in the “tails”.

Now, let, n 1,.025t

n 1,.025 n 1 n 1,.025P( t t t ) .95

3210-1-2-3

0.4

0.3

0.2

0.1

0.0

x

f(x)

.025.025

.95

n 1,.025t n 1,.025t

t rv with n-1 df.be such that

for our Normal mean problem we use =n-1.

25

For n-1>about 30, the tn-1 is so much like the standardnormal that

n 1,.025t 2 For smaller n, the t value gets bigger than 2.

tn1 025,. n

4.303 32.228 112.086 212.042 312.00 61

Here is a tableof t values and n.

We can see thatfor n>30 (or evenabout 20) the t valueis about 2.


=TINV( 0.05, 10)


There is .025 probless than -2.22 and .025 probgreater than 2.22for the t dist with10 degrees of freedom.

Probability in the tails

Degrees of freedom

26

Our basic result is,

n 1

X~ t

se(X)

for small n, the t distribution accounts for our estimation of with sx .

thus,

n 1,.025 n 1,.025

XPr( t t ) .95

se(X)

Just a before, we can rearrange this to obtain the interval:

n 1,.025x t se(X)

27

n 1,.025x t se(X)

Using the t value instead of the z value will make the intervalbigger for smaller n.

This reflects the fact that we are not sure that our estimatefor is quite right.

An exact 95% confidence interval for withunknown is

Example

Back to our weight data.

With n=500 the sample sd is 15.455,and the sample mean is 344.83.

The t dist with =499 is just like the standard normalso the t-value is about 2.

15.455se(X) .69

500

ci: 344.83 +/- 1.4


= 344.83-1.4

= 344.83+1.4


343.430

346.230

28

T Confidence Intervals

Variable N Mean StDev SE Mean 95.0 % CI

weights 500 344.828 15.455 0.691 ( 343.470, 346.186)

300 350 400

0

20

40

60

weights

Fre

que

ncy

Histogram of weights(with 95% t-confidence interval for the mean)

[ ]X_

se Xs

nx( )

For the first 10 observations, the sample sd = 14.6,and the sample mean was 348.5.

The t 9,.025 value is 2.262.

14.6se(X) 4.6

10

ci: 348.5 +/- 2.262*4.6348.5 +/- 10.4

=(338.1, 358.9)

29

330 340 350 360 370 380

0

1

2

3

4

weights10

Fre

quen

cy

Histogram of weights10(with 95% t-confidence interval for the mean)

[ ]X_


Variable N Mean StDev SE Mean 95.0 % CI

weights1 10 348.51 14.60 4.62 ( 338.07, 358.96)

IN EXCEL, use the pull-down menu:

StatPro > Statistical Inference > One-sample analysis …

Example Let’s get a 95% ci for the true mean of Canadian returns.

-0.1 0.0 0.1

0

10

20

canada

Fre

quen

cy

Histogram of canada(with 95% t-confidence interval for the mean)

[ ]X_

Is the confidenceinterval big?

se Xs

nx( )

n 1,.025x t se(X)

Summary measures

Sample size 107

Sample mean 0.009

Sample standard deviation 0.038

Confidence interval for mean

Confidence level 95.0%

Sample mean 0.009

Std error of mean 0.004

Degrees of freedom 106

Lower limit 0.002

Upper limit 0.016

Results for one-sample analysis for canada

30

-0.2 -0.1 0.0 0.1

0

10

20

30

nyse

Fre

quen

cy

Histogram of nyse(with 95% t-confidence interval for the mean)

[ ]X_

Example: 95% CI for true mean of NYSE stock index over same period.


Variable N Mean StDev SE Mean 95.0 % CInyse 107 0.01330 0.03686 0.00356 ( 0.00624, 0.02036)

n

s1)-n/2,(tX

bygiven then is for C.I. 100%)-(1 The

defined similarly to z/2 for the N(0,1)

Of course, just as for the case of the Normal, we can find any confidence interval that we would like.

31

5.5 Bernoulli data, confidence interval for p

Now we consider confidence intervals for p giveniid Bernoulli observations.

10 20 30 40 50

0.0

0.5

1.0

Index

C1

Suppose wehad this datawhere 1 meansa default and 0means no default.

What do you thinkthe true default rateis and how sure are you ?

• Our data consist of Bernoulli outcomes where a mortgage either defaults (1) or does not (0).

• Our best estimate of p will be the sample fraction of defaults. That is:

• For our data it is 12/50.

1ˆ

n

ii

xp

n

32

• We play the same game as before: before we take our sample we ask what can happen?

• This time the outcomes are realizations of iidBernoulli(p).

• The sum of iid Bernoulli’s is a Binomial distribution so the numerator is the outcome of a Binomial(n,p) where n is the sample size and p is the parameter we want to know.

For iid Bernoulli data, the estimate of p is

observed number of successes in the n trials Y

p̂number of trials n

Y ~ B(n,p)

33

Before we get a sample of size n, what kind ofestimate can we expect to get ?

2

Y 1 1ˆE(p) E E(Y) np pn n n

1 p(1 p)ˆVar(p) np(1 p)n n

(unbiased)

Two things:1) The variance of is again decreasing in the sample

size n.2) The variance of depends on the value of p.

p̂

p̂

Unlike the normal case, only approximateresults are available.

)n

)p(p,p(Np̂

1

Since our estimate is a combination of independentbernoullis, the central limit theorem tells us that itshould be approximately normal:

34

We make a final approximation.

ˆ ˆp(1 p)ˆse(p)n

so,p̂ p

N(0,1)ˆse(p)

The 95% interval is for the true proportionp is,

ˆ ˆp 2se(p)

In our example our interval would be:


= .24 - 2*sqrt(.24*(1-.24)/50)

= .24 + 2*sqrt(.24*(1-.24)/50)


0.119203

0.360797

35

Example

Remember the discrimination case ?

We used .07 for p.

1101009080706050

10

5

0

nn

yy

Not counting the firm being sued, we had1128 partners 77 of which were female.

.p 77

112807

The confidence interval is

.. ( . )

. .07 207 1 07

112807 0152

This interval tells us where we think p is:

(.0548, .0852)

36

Suppose we had only 100 partners 7% of whom are female.

The interval would be

.. ( . )

. .07 207 1 07

10007 05

This interval is much bigger, telling us that withonly 100 observations, our estimate could be a lotfarther from the truth.

100 observations has less information than 1128.

(.02, .12)

National Poll of likely voters (CNN)

37

Trump tops Clinton 58% to 56% in unfavorable poll by CNN/ORC

The CNN/ORC Poll was conducted by telephone October 20‐23 among a random national sample of 1,017 adults, including 779 who were determined to be likely voters. The margin of sampling error for results among the sample of likely voters is plus or minus 3.5 percentage points.

What is the “margin of error”?

• It is actually a 95% Confidence Interval.

.58* 1 .58ˆ .0176

779

ˆ2* 2*.0176 .0353 3.5%

se p

se p

38

Why is the Margin of Error often 3%?

• Generally the sample size is a little over 1000.

• The numerator of the standard error depends on p, so why are the errors not dependent on the value of the estimate of p?

• The media uses the largest interval that is obtained when . ˆ .5p

.5 1 .5ˆ2* 2* .0316 3%

1000se p

• So if the estimate for p is different from .5, the confidence interval (margin of error) will by smaller than 3%.

• This means that we are at least 95% sure that the true value of p lies in the plus or minus three percentage points.

39

5.6 The Central Limit Theorem and aGeneral Approximate Confidence Interval for

Suppose we are willing to assume that our data areiid but not willing to assume that they are normally distributed.and they are not Bernoulli.

We might still want to estimate

iE(X )

It turns out that the approach we used fornormal data is approximately correct (with large sample sizes):

2

2

X N( , )n

N( , se(X) )

(first squiggle isthe clt,second squigglewe just hope ourestimate of is good)

so,X

N(0,1)se(X)

40

Given iid observations Xi, and approximate95% confidence interval for =E(Xi) is given by,

x 2se(X)

This is an extremely powerful result. It says that even ifwe don’t know the distribution of the population we are sampling from is Normal, the distribution of the sampleaverage will still be Normally distributed in large samples!

Big Picture - Booth School of...

Documents

Transcript of Big Picture - Booth School of...