Inference for Categorical Variables 2/29/12

46
Inference for Categorical Variables 2/29/12 • Single Proportion, p Distribution • Intervals and tests Difference in proportions, p 1 p 2 • One proportion or two? • Distribution • Intervals and tests Section 6.1, 6.2, 6.3, 6.7, 6.8, 6.9 Professor Kari Lock Morgan Duke University

description

Inference for Categorical Variables 2/29/12. Single Proportion, p Distribution Intervals and tests Difference in proportions, p 1 – p 2 One proportion or two? Distribution Intervals and tests. Section 6.1, 6.2, 6.3, 6.7, 6.8, 6.9. Professor Kari Lock Morgan Duke University. - PowerPoint PPT Presentation

Transcript of Inference for Categorical Variables 2/29/12

Page 1: Inference for Categorical Variables 2/29/12

Inference for Categorical Variables2/29/12

• Single Proportion, p• Distribution• Intervals and tests• Difference in proportions, p1 – p2 • One proportion or two?• Distribution• Intervals and tests

Section 6.1, 6.2, 6.3, 6.7, 6.8, 6.9 Professor Kari Lock MorganDuke University

Page 2: Inference for Categorical Variables 2/29/12

• Homework 5 (due Monday, 3/12)

• Project 1 (due Thursday, 3/22)

(NOTE: DUE DATE HAS CHANGED)

To Do

Page 3: Inference for Categorical Variables 2/29/12

Central Limit Theorem!

For a sufficiently large sample size, the distribution of sample

statistics for a mean or a proportion is normal

Page 4: Inference for Categorical Variables 2/29/12

IF SAMPLE SIZES ARE LARGE…

A confidence interval can be calculated by

where z* is a N(0,1) percentile depending on the level of confidence.

Interval Using N(0,1)

*sample statistic z SE

Page 5: Inference for Categorical Variables 2/29/12

The standard error for a sample proportion can be calculated by

SE of a Proportion

(1 )p pSEn

*Notice the sample size in the denominator… as the sample size increases, the standard error decreases

Page 6: Inference for Categorical Variables 2/29/12

If he is truly guessing randomly, then p = 0.5 so the SE of his sample proportion correct out of 8 guesses is

Paul the Octopus

0.5(1 0.5)(1 ) 0 18

. 77SE p pn

Page 7: Inference for Categorical Variables 2/29/12

www.lock5stat.com/statkey

Paul the OctopusThis is the same value we get from a randomization distribution…

Page 8: Inference for Categorical Variables 2/29/12

If Paul really does have psychic powers, and can guess the correct team every time, then p = 1, and

Paul the Octopus

1(1 1) 08

SE

Page 9: Inference for Categorical Variables 2/29/12

Distribution of ̂p100n30n10n1n

0.5p

0.0 0.5 1.0

50n

0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0

0.7p

0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0

0.0 0.5 1.0

0.1p

0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0

Page 10: Inference for Categorical Variables 2/29/12

If counts for each category are at least 10 (np ≥ 10 and n(1 – p) ≥ 10), then

CLT for a Proportion

(1 )ˆ , p pp N pn

Page 11: Inference for Categorical Variables 2/29/12

• One small problem… if we are doing inference for p, we don’t know p!

• For confidence intervals, use your best guess for p:

Standard Error

(1 )p pSEn

Page 12: Inference for Categorical Variables 2/29/12

Confidence Interval for a Single Proportion

*sample statistic z SE

* ˆ

ˆ1 ) (ˆ p pp zn

Page 13: Inference for Categorical Variables 2/29/12

In head-to-head match ups Duke vs UNC in men’s basketball, UNC has won 131 and Duke has won 102.

Duke vs UNC Men’s Basketball

102 102ˆ 0.438102 131 233

p

What is Duke’s probability of winning on Saturday?

What do we have to assume? Is this reasonable??? ???p

Find a 95% CI.

Page 14: Inference for Categorical Variables 2/29/12

Counts are greater than 10 in each category

For a 95% confidence interval, z* = 2

Duke vs UNC Men’s Basketball

* ˆ

ˆ1 ) (ˆ p pp zn

0.43 (1 0.438)0.4 82

383

23

0.438 0.065

0.373,0.503

Page 15: Inference for Categorical Variables 2/29/12

Duke vs UNC Men’s Basketball

0.373,0.503

Page 16: Inference for Categorical Variables 2/29/12

Other Levels of Confidencehttp://davidmlane.com/hyperstat/z_table.html

90% Confidence* 1.645z

99% Confidence* 2.576z

Technically, for 95% confidence, z* = 1.96, but 2 is much easier to remember, and close enough

Page 17: Inference for Categorical Variables 2/29/12

z*-z*

P%

z* on TI-83

2nd DISTR 3: invNorm( Proportion below z*

(for a 95% CI, the proportion below z* is 0.975)

Page 18: Inference for Categorical Variables 2/29/12

Margin of Error* ˆ

ˆ1 ) (ˆ p pp z

n

For a single proportion, what is the margin of error?

a)

b)

c)

ˆ ˆ(1 )p pn

* ˆ ˆ(1 )p pzn

* ˆ ˆ(1 )2 p pzn

Page 19: Inference for Categorical Variables 2/29/12

Margin of Error* ˆ ˆ1 ) (p pME z

n

You can choose your sample size in advance, depending on your desired margin of error!

Given this formula for margin of error, solve for n. 2*

ˆ ˆ(1 )zn p pME

Page 20: Inference for Categorical Variables 2/29/12

Margin of Error2*

ˆ ˆ(1 )zn p pME

ˆNeither nor is known in advance.

To be conservative, use 0.5.p p

p *% confidence intervalFor a 9 , 25 z

2

1nME

Page 21: Inference for Categorical Variables 2/29/12

Margin of Error

Suppose we want to estimate a proportion with a margin of error of 0.03 with 95% confidence.

How large a sample size do we need?

(a) About 100(b) About 500(c) About 1000(d) About 5000

2

1nME

Page 22: Inference for Categorical Variables 2/29/12

IF SAMPLE SIZES ARE LARGE…

A p-value is the area in the tail(s) of a N(0,1) beyond

Tests Using N(0,1)

sample statistic null valueSE

z

Page 23: Inference for Categorical Variables 2/29/12

z-statistic

If z = –3, using = 0.05 we would

(a) Reject the null(b) Not reject the null(c) Impossible to tell(d) I have no idea

sample statistic null valueSE

z

About 95% of z-statistics are within -2 and +2, so anything beyond those values will be in the most extreme 5%, or equivalently will give a p-value less than 0.05.

Page 24: Inference for Categorical Variables 2/29/12

Hypothesis Testing

For hypothesis testing, we want the distribution of the sample proportion assuming the null hypothesis is true

0 0:H p p

(1 )p pSEn

What to use for p?

Page 25: Inference for Categorical Variables 2/29/12

Hypothesis Testing

sample statistic null valueSE

z

0

0 0

ˆ(1 )pp

z

n

pp

The p-value is the area in the tail(s) beyond z in a N(0,1)

Page 26: Inference for Categorical Variables 2/29/12

Using α = 0.05, is this evidence that one team is/was better than the other (combining past and present)?

(a) Yes(b) No(c) No idea

Duke vs UNC Men’s Basketball

102 102ˆ 0.438102 131 233

p

Page 27: Inference for Categorical Variables 2/29/12

Duke vs UNC Men’s BasketballCounts are greater than 10 in each category

0

0 0

0.438 0.5 1.89

233

ˆ(1 ) 0.5(1 0.5)p

zpp p

n

0 : .5: 0.5a

pHH

p

0.059p value

Based on this data, we cannot conclude that either Duke or UNC is significantly better.

Page 28: Inference for Categorical Variables 2/29/12

Duke vs UNC Men’s Basketball

0.058p value

Page 29: Inference for Categorical Variables 2/29/12

P%

p-value on TI-83

2nd

DISTR

3: normalcdf(

lower bound, upper bound

Hint: if you want greater than 2, just put 2, 100 (or some other large number)

Page 30: Inference for Categorical Variables 2/29/12

One Proportion or Two?

• Two proportions: there are two separate categorical variables

• One proportion: there is only one categorical variable

Page 31: Inference for Categorical Variables 2/29/12

One Proportion or Two?

Of residents in the triangle area, on Saturday night will the proportion of people cheering for Duke or UNC be greater? How much greater?

a) Inference for one proportionb) Inference for two proportions

(Note: assume no one will be cheering for both)

Page 32: Inference for Categorical Variables 2/29/12

One Proportion or Two?

Who is more likely to be wearing a blue shirt on Saturday night, a UNC fan or a Duke fan?

a) Inference for one proportionb) Inference for two proportions

Page 33: Inference for Categorical Variables 2/29/12

Standard Error for

1 1 2 2

1 2

(1 ) (1 )p pp pSEn n

1 2ˆ ˆp p

Page 34: Inference for Categorical Variables 2/29/12

CLT for1 2Parameter: p p

1 2 1 2Statistic: , based on sample sizesˆ ˆ and p p n n

1 1 2 21 2 1 2

1 2

(1 ) (1 )ˆ ˆ , p pp Nn

p pn

p p p

1 2ˆ ˆp p

If counts within each category (each cell of the two-way table) are at least 10

Page 35: Inference for Categorical Variables 2/29/12

Metal Tags and Penguins

Give a 90% confidence interval for the difference in proportions.Source: Saraux, et. al. (2011). “Reliability of flipper-banded penguins as indicators of climate change,” Nature, 469, 203-206.

Are metal tags detrimental to penguins? A study looked at the 10 year survival rate of penguins tagged either with a metal tag or an electronic tag. 20% of the 167 metal tagged penguins survived, compared to 36% of the 189 electronic tagged penguins.

Page 36: Inference for Categorical Variables 2/29/12

Metal Tags and Penguins

* 1 1 2 21 2

1 2

ˆ ˆ ˆ ˆ(1 ) (1 )ˆ ˆ p p pn

pp p zn

0.2(1 0.16

2) 0.36(1 0.36)0.2 0.36 1.7 1

6489

5

0.16 1.645 0.047

-0.237, -0.09

33 10,134 10, 68 10,121 10

We are 90% confident that the survival rate is between 0.09 and 0.237 lower for metal tagged penguins, as opposed to electronically tagged.

Page 37: Inference for Categorical Variables 2/29/12

Metal Tags and Penguins

www.lock5stat.com/statkey

Page 38: Inference for Categorical Variables 2/29/12

Hypothesis Testing

1 1 2 2

1 2

(1 ) (1 )p pp pSEn n

0 1 2:H p p

What should we use for p1 and p2 in the formula for SE for hypothesis testing?

Page 39: Inference for Categorical Variables 2/29/12

Pooled Proportion

1 1 2 2

1 2

ˆ ˆˆ nn

n p ppn

1 1 2 2

1 2 1 2 1 2

ˆ ˆ ˆ ˆ(1 ) (1 ) (1 ) (1 ) 1ˆ ˆ(1 ) 1p p p p p p p pp pSEn n n n n n

Overall sample proportion across both groups. It will be in between the two observed sample proportions.

Page 40: Inference for Categorical Variables 2/29/12

Hypothesis Testing

sample statistic null valueSE

z

1 2

1 2

ˆ ˆ

1 1ˆ ˆ1 )

0

(

p

p

z p

pn n

The p-value is the area in the tail(s) beyond z in a N(0,1)

Page 41: Inference for Categorical Variables 2/29/12

Metal Tags and Penguins

Are metal tags detrimental to penguins?(a) Yes(b) No(c) Cannot tell from this data

20% of the 167 metal tagged penguins survived, compared to 36% of the 189 electronic tagged penguins.

Page 42: Inference for Categorical Variables 2/29/12

Metal Tags and Penguins

Are metal tags detrimental to penguins?

1 10.284(1 0.284)167 18

(0.2 0.36) 0 0.16 3.330. 4

90 8

z

ˆ167 18

33 68 0.289

4p

1 2

1 2

ˆ ˆ

1 1ˆ ˆ1 )

0

(

p

p

z p

pn n

Page 43: Inference for Categorical Variables 2/29/12

Metal Tags and Penguins3.33z

http://davidmlane.com/hyperstat/z_table.html

0.0004p value

This is very strong evidence that metal tags are detrimental to penguins.

Page 44: Inference for Categorical Variables 2/29/12

Metal Tags and Penguins

www.lock5stat.com/statkey

0.0004p value

Page 45: Inference for Categorical Variables 2/29/12

Accuracy• The accuracy of intervals and p-values generated using simulation methods (bootstrapping and randomization) depends on the number of simulations (more simulations = more accurate)

• The accuracy of intervals and p-values generated using formulas and the normal distribution depends on the sample size (larger sample size = more accurate)

• If the distribution of the statistic is truly normal and you have generated many simulated randomizations, the p-values should be very close

Page 46: Inference for Categorical Variables 2/29/12

• For a single proportion:

• For a difference in proportions:

Summary

1 1 2 2

1 2

(1 ) (1 )p pp pSEn n

(1 )p pSEn