Chapter
-
Upload
perry-schwartz -
Category
Documents
-
view
14 -
download
0
description
Transcript of Chapter
© 2010 Pearson Prentice Hall. All rights reserved
Chapter
Inference on Categorical Data
12
© 2010 Pearson Prentice Hall. All rights reserved
Section
Goodness-of-Fit Test
12.1
© 2010 Pearson Prentice Hall. All rights reserved 12-3
Objective
• Perform a goodness-of-fit test
© 2010 Pearson Prentice Hall. All rights reserved 12-4
Characteristics of the Chi-Square Distribution
1. It is not symmetric.
© 2010 Pearson Prentice Hall. All rights reserved 12-5
1. It is not symmetric.2. The shape of the chi-square distribution
depends on the degrees of freedom, just like Student’s t-distribution.
Characteristics of the Chi-Square Distribution
© 2010 Pearson Prentice Hall. All rights reserved 12-6
1. It is not symmetric.2. The shape of the chi-square distribution
depends on the degrees of freedom, just like Student’s t-distribution.
3. As the number of degrees of freedom increases, the chi-square distribution becomes more nearly symmetric.
Characteristics of the Chi-Square Distribution
© 2010 Pearson Prentice Hall. All rights reserved 12-7
1. It is not symmetric.2. The shape of the chi-square distribution
depends on the degrees of freedom, just like Student’s t-distribution.
3. As the number of degrees of freedom increases, the chi-square distribution becomes more nearly symmetric.
4. The values of 2 are nonnegative, i.e., the values of 2 are greater than or equal to 0.
Characteristics of the Chi-Square Distribution
© 2010 Pearson Prentice Hall. All rights reserved 12-8
© 2010 Pearson Prentice Hall. All rights reserved 12-9
A goodness-of-fit test is an inferential procedure used to determine whether a frequency distribution follows a specific distribution.
© 2010 Pearson Prentice Hall. All rights reserved 12-10
Expected Counts
Suppose that there are n independent trials of an
experiment with k ≥ 3 mutually exclusive possible
outcomes. Let p1 represent the probability of observing
the first outcome and E1 represent the expected count of
the first outcome; p2 represent the probability of
observing the second outcome and E2 represent the
expected count of the second outcome; and so on. The
expected counts for each possible outcome are given by
Ei = i = npi for i = 1, 2, …, k
© 2010 Pearson Prentice Hall. All rights reserved 12-11
A sociologist wishes to determine whether the distribution for the number of years care-giving grandparents are responsible for their grandchildren is different today than it was in 2000. According to the United States Census Bureau, in 2000, 22.8% of grandparents have been responsible for their grandchildren less than 1 year; 23.9% of grandparents have been responsible for their grandchildren for 1 or 2 years; 17.6% of grandparents have been responsible for their grandchildren 3 or 4 years; and 35.7% of grandparents have been responsible for their grandchildren for 5 or more years. If the sociologist randomly selects 1,000 care-giving grandparents, compute the expected number within each category assuming the distribution has not changed from 2000.
Parallel Example 1: Finding Expected Counts
© 2010 Pearson Prentice Hall. All rights reserved 12-12
Step 1: The probabilities are the relative frequencies from the 2000 distribution:
p<1yr = 0.228 p1-2yr = 0.239
p3-4yr = 0.176 p ≥5yr = 0.357
Solution
© 2010 Pearson Prentice Hall. All rights reserved 12-13
Step 2: There are n=1,000 trials of the experiment so the expected counts are:
E<1yr = np<1yr = 1000(0.228) = 228
E1-2yr = np1-2yr = 1000(0.239) = 239
E3-4yr = np3-4yr =1000(0.176) = 176
E≥5yr = np ≥5yr = 1000(0.357) = 357
Solution
© 2010 Pearson Prentice Hall. All rights reserved 12-14
Test Statistic for Goodness-of-Fit Tests
Let Oi represent the observed counts of category i, Ei
represent the expected counts of category i, k representthe number of categories, and n represent the number ofindependent trials of an experiment. Then the formula
approximately follows the chi-square distribution withk-1 degrees of freedom, provided that• all expected frequencies are greater than or equal to 1 (all Ei ≥ 1)
and• no more than 20% of the expected frequencies are less than 5.
2 Oi E i 2
E i
i 1, 2,, k
© 2010 Pearson Prentice Hall. All rights reserved 12-15
CAUTION!
Goodness-of-fit tests are used to test hypotheses regarding the distribution of a variable based on a single population. If you wish to compare two or more populations, you must use the tests for homogeneity presented in Section 12.2.
© 2010 Pearson Prentice Hall. All rights reserved 12-16
Step 1: Determine the null and alternative hypotheses. H0: The random variable follows a
certain distribution H1: The random variable does not
follow a certain distribution
The Goodness-of-Fit Test
To test the hypotheses regarding a distribution, we use the steps that follow.
© 2010 Pearson Prentice Hall. All rights reserved 12-17
Step 2: Decide on a level of significance, , depending on the seriousness of making a Type I error.
© 2010 Pearson Prentice Hall. All rights reserved 12-18
Step 3: a) Calculate the expected counts for each
of the k categories. The expected counts are Ei=npi for i = 1, 2, … , k where n is the number of trials and pi is the probability of the ith category, assuming that the null hypothesis is true.
© 2010 Pearson Prentice Hall. All rights reserved 12-19
Step 3: b) Verify that the requirements for the goodness-
of-fit test are satisfied.1. All expected counts are greater than or
equal to 1 (all Ei ≥ 1).2. No more than 20% of the expected counts
are less than 5.c) Compute the test statistic:
Note: Oi is the observed count for the ith category.
02
Oi E i 2
E i
© 2010 Pearson Prentice Hall. All rights reserved 12-20
CAUTION!
If the requirements in Step 3(b) are not satisfied, one option is to combine two or more of the low-frequency categories into a single category.
© 2010 Pearson Prentice Hall. All rights reserved 12-21
Step 4: Determine the critical value. All goodness-of-fit tests are right-tailed tests, so the critical value is with k-1 degrees of freedom.
Classical Approach
2
© 2010 Pearson Prentice Hall. All rights reserved 12-22
Step 5: Compare the critical value to the test statistic. If reject the null hypothesis.
Classical Approach
02
2,
© 2010 Pearson Prentice Hall. All rights reserved 12-23
Step 4: Use Table VII to obtain an approximate P-value by determining the area under the chi-square distribution with k-1 degrees of freedom to the right of the test statistic.
P-Value Approach
© 2010 Pearson Prentice Hall. All rights reserved 12-24
Step 5: If the P-value < , reject the null hypothesis.
P-Value Approach
© 2010 Pearson Prentice Hall. All rights reserved 12-25
Step 6: State the conclusion.
© 2010 Pearson Prentice Hall. All rights reserved 12-26
A sociologist wishes to determine whether the distribution forthe number of years care-giving grandparents are responsiblefor their grandchildren is different today than it was in 2000.
According to the United States Census Bureau, in 2000, 22.8%of grandparents have been responsible for their grandchildrenless than 1 year; 23.9% of grandparents have been responsiblefor their grandchildren for 1 or 2 years; 17.6% of grandparentshave been responsible for their grandchildren 3 or 4 years; and35.7% of grandparents have been responsible for theirgrandchildren for 5 or more years. The sociologist randomlyselects 1,000 care-giving grandparents and obtains thefollowing data.
Parallel Example 2: Conducting a Goodness-of -Fit Test
© 2010 Pearson Prentice Hall. All rights reserved 12-27
Test the claim that the distribution is different today than it was in 2000 at the = 0.05 level of significance.
© 2010 Pearson Prentice Hall. All rights reserved 12-28
Step 1: We want to know if the distribution today is different than it was in 2000. The hypotheses are then:
H0: The distribution for the number of years care-giving grandparents are responsible for their grandchildren is the same today as it was in 2000
H1: The distribution for the number of years care-giving grandparents are responsible for their grandchildren is different today than it was in 2000
Solution
© 2010 Pearson Prentice Hall. All rights reserved 12-29
Step 2: The level of significance is =0.05.
Step 3:
(a) The expected counts were computed in Example 1.
Solution
Number of Years
Observed Counts
Expected Counts
<1 252 228
1-2 255 239
3-4 162 176
≥5 331 357
© 2010 Pearson Prentice Hall. All rights reserved 12-30
Step 3:
(b) Since all expected counts are greater than or equal to 5, the requirements for the goodness-of-fit test are satisfied.
(c) The test statistic is
Solution
02
252 228 2
228
255 239 2
239
162 176 2
176
331 357 2
3576.605
© 2010 Pearson Prentice Hall. All rights reserved 12-31
Step 4: There are k = 4 categories, so we find the critical value using 4-1=3 degrees of freedom. The critical value is
Solution: Classical Approach
0.052 7.815
© 2010 Pearson Prentice Hall. All rights reserved 12-32
Step 5: Since the test statistic, is less than the critical value , we fail to reject the null hypothesis.
Solution: Classical Approach
02 6.605
0.052 7.815
© 2010 Pearson Prentice Hall. All rights reserved 12-33
Step 4: There are k = 4 categories. The P-value is the area under the chi-square distribution with 4-1=3 degrees of freedom to the right of . Thus, P-value ≈ 0.09.
Solution: P-Value Approach
02 6.605
© 2010 Pearson Prentice Hall. All rights reserved 12-34
Step 5: Since the P-value ≈ 0.09 is greater than the level of significance = 0.05, we fail to reject the null hypothesis.
Solution: P-Value Approach
© 2010 Pearson Prentice Hall. All rights reserved 12-35
Step 6: There is insufficient evidence to conclude that the distribution for the number of years care-giving grandparents are responsible for their grandchildren is different today than it was in 2000 at the = 0.05 level of significance.
Solution
© 2010 Pearson Prentice Hall. All rights reserved
Section
Tests for Independence and the Homogeneity of Proportions
12.2
© 2010 Pearson Prentice Hall. All rights reserved 12-37
1. Perform a test for independence
2. Perform a test for homogeneity of proportions
Objectives
© 2010 Pearson Prentice Hall. All rights reserved 12-38
Objective 1
• Perform a Test for Independence
© 2010 Pearson Prentice Hall. All rights reserved 12-39
The chi-square test for independence is used to determine whether there is an association between a row variable and column variable in a contingency table constructed from sample data. The null hypothesis is that the variables are not associated; in other words, they are independent. The alternative hypothesis is that the variables are associated, or dependent.
© 2010 Pearson Prentice Hall. All rights reserved 12-40
“In Other Words”
In a chi-square independence test, the null
hypothesis is always
H0: The variables are independent
The alternative hypothesis is always
H0: The variables are not independent
© 2010 Pearson Prentice Hall. All rights reserved 12-41
The idea behind testing these types of claims is to compare actual counts to the counts we would expect if the null hypothesis were true (if the variables are independent). If a significant difference between the actual counts and expected counts exists, we would take this as evidence against the null hypothesis.
© 2010 Pearson Prentice Hall. All rights reserved 12-42
If two events are independent, thenP(A and B) = P(A)P(B)
We can use the Multiplication Principle for independent events to obtain the expected proportion of observations within each cell under the assumption of independence and multiply this result by n, the sample size, in order to obtain the expected count within each cell.
© 2010 Pearson Prentice Hall. All rights reserved 12-43
In a poll, 883 males and 893 females were asked “If you could have only one of the following, which would you pick: money, health, or love?” Their responses are presented in the table below. Determine the expected counts within each cell assuming that gender and response are independent.
Source: Based on a Fox News Poll conducted in January, 1999
Parallel Example 1: Determining the Expected Counts in a Test for Independence
© 2010 Pearson Prentice Hall. All rights reserved 12-44
Step 1: We first compute the row and column totals:
Solution
Money Health Love Row Totals
Men 82 446 355 883
Women 46 574 273 893
Column totals 128 1020 628 1776
© 2010 Pearson Prentice Hall. All rights reserved 12-45
Step 2: Next compute the relative marginal frequencies for the row variable and column variable:
Solution
Money Health Love Relative Frequency
Men 82 446 355 883/1776
≈ 0.4972
Women 46 574 273 893/1776
≈0.5028
Relative Frequency
128/1776
≈0.0721
1020/1776
≈0.5743
628/1776
≈0.3536 1
© 2010 Pearson Prentice Hall. All rights reserved 12-46
Step 3: Assuming gender and response are independent, we use the Multiplication Rule for Independent Events to compute the proportion of observations we would expect in each cell.
Solution
Money Health Love
Men 0.0358 0.2855 0.1758
Women 0.0362 0.2888 0.1778
© 2010 Pearson Prentice Hall. All rights reserved 12-47
Step 4: We multiply the expected proportions from step 3 by 1776, the sample size, to obtain the expected counts under the assumption of independence.
Solution
Money Health Love
Men 1776(0.0358)
≈ 63.5808
1776(0.2855)
≈ 507.048
1776(0.1758)
≈ 312.2208
Women
1776(0.0362)
≈ 64.2912
1776(0.2888)
≈ 512.9088
1776(0.1778)
≈ 315.7728
© 2010 Pearson Prentice Hall. All rights reserved 12-48
Expected Frequencies in a Chi-Square Test for Independence
To find the expected frequencies in a cell when performing a chi-square independence test, multiply the row total of the row containing the cell by the column total of the column containing the cell and divide this result by the table total. That is,
Expected frequency = (row total)(column total)
table total
© 2010 Pearson Prentice Hall. All rights reserved 12-49
Test Statistic for the Test of Independence
Let Oi represent the observed number of counts in the
ith cell and Ei represent the expected number of countsin the ith cell. Then
approximately follows the chi-square distribution with (r-1)(c-1) degrees of freedom, where r is the number of rowsand c is the number of columns in the contingency table,provided that (1) all expected frequencies are greater than orequal to 1 and (2) no more than 20% of the expectedfrequencies are less than 5.
2 Oi E i 2
E i
© 2010 Pearson Prentice Hall. All rights reserved 12-50
Step 1: Determine the null and alternative hypotheses. H0: The row variable and column
variable are independent. H1: The row variable and column
variables are dependent.
Chi-Square Test for Independence
To test the association (or independence of) two variables in a contingency table:
© 2010 Pearson Prentice Hall. All rights reserved 12-51
Step 2: Choose a level of significance, , depending on the seriousness of making a Type I error.
© 2010 Pearson Prentice Hall. All rights reserved 12-52
Step 3: a) Calculate the expected frequencies
(counts) for each cell in the contingency table.
b) Verify that the requirements for the chi-square test for independence are satisfied:1. All expected frequencies are greater
than or equal to 1 (all Ei ≥ 1).2. No more than 20% of the expected
frequencies are less than 5.
© 2010 Pearson Prentice Hall. All rights reserved 12-53
Step 3: c) Compute the test statistic:
Note: Oi is the observed count for the ith category.
02
Oi E i 2
E i
© 2010 Pearson Prentice Hall. All rights reserved 12-54
Step 4: Determine the critical value. All chi-square tests for independence are right-tailed tests, so the critical value is with (r-1)(c-1) degrees of freedom, where r is the number of rows and c is the number of columns in the contingency table.
Classical Approach
2
© 2010 Pearson Prentice Hall. All rights reserved 12-55
© 2010 Pearson Prentice Hall. All rights reserved 12-56
Step 5: Compare the critical value to the test statistic. If reject the null hypothesis.
Classical Approach
02
2,
© 2010 Pearson Prentice Hall. All rights reserved 12-57
Step 4: Use Table VII to determine an approximate P-value by determining the area under the chi-square distribution with (r-1)(c-1) degrees of freedom to the right of the test statistic.
P-Value Approach
© 2010 Pearson Prentice Hall. All rights reserved 12-58
Step 5: If the P-value < , reject the null hypothesis.
P-Value Approach
© 2010 Pearson Prentice Hall. All rights reserved 12-59
Step 6: State the conclusion.
© 2010 Pearson Prentice Hall. All rights reserved 12-60
In a poll, 883 males and 893 females were asked “If you could have only one of the following, which would you pick: money, health, or love?” Their responses are presented in the table below. Test the claim that gender and response are independent at the = 0.05 level of significance.
Source: Based on a Fox News Poll conducted in January, 1999
Parallel Example 2: Performing a Chi-Square Test for Independence
© 2010 Pearson Prentice Hall. All rights reserved 12-61
Step 1: We want to know whether gender and response are dependent or independent so the hypotheses are:
H0: gender and response are independent
H1: gender and response are dependent
Step 2: The level of significance is =0.05.
Solution
© 2010 Pearson Prentice Hall. All rights reserved 12-62
Step 3:
(a) The expected frequencies were computed in Example 1 and are given in parentheses in the table below, along with the observed frequencies.
Solution
Money Health Love
Men 82
(63.5808)
446
(507.048)
355
(312.2208)
Women 46
(64.2912)
574
(512.9088)
273
(315.7728)
© 2010 Pearson Prentice Hall. All rights reserved 12-63
Step 3:
(b) Since none of the expected frequencies are less than 5, the requirements for the goodness-of-fit test are satisfied.
(c) The test statistic is
Solution
02
82 63.5808 2
63.5808
446 507.048 2
507.048
273 315.7728 2
315.772836.82
© 2010 Pearson Prentice Hall. All rights reserved 12-64
Step 4: There are r = 2 rows and c =3 columns, so we find the critical value using (2-1)(3-1) = 2 degrees of freedom. The critical value is .
Solution: Classical Approach
0.052 5.99
© 2010 Pearson Prentice Hall. All rights reserved 12-65
Step 5: Since the test statistic, is greater than the critical value , we reject the null hypothesis.
Solution: Classical Approach
02 36.82
0.052 5.99
© 2010 Pearson Prentice Hall. All rights reserved 12-66
Step 4: There are r = 2 rows and c =3 columns so we find the P-value using (2-1)(3-1) = 2 degrees of freedom. The P-value is the area under the chi-square distribution with 2 degrees of freedom to the right of which is approximately 0.
Solution: P-Value Approach
02 36.82
© 2010 Pearson Prentice Hall. All rights reserved 12-67
Step 5: Since the P-value is less than the level of significance = 0.05, we reject the null hypothesis.
Solution: P-Value Approach
© 2010 Pearson Prentice Hall. All rights reserved 12-68
Step 6: There is sufficient evidence to conclude that gender and response are dependent at the = 0.05 level of significance.
Solution
© 2010 Pearson Prentice Hall. All rights reserved 12-69
To see the relation between response and gender, we draw bar graphs of the conditional distributions of response by gender. Recall that a conditional distribution lists the relative frequency of each category of a variable, given a specific value of the other variable in a contingency table.
© 2010 Pearson Prentice Hall. All rights reserved 12-70
Find the conditional distribution of response by gender for the data from the previous example, reproduced below.
Source: Based on a Fox News Poll conducted in January, 1999
Parallel Example 3: Constructing a Conditional Distribution and Bar Graph
© 2010 Pearson Prentice Hall. All rights reserved 12-71
We first compute the conditional distribution of response by gender.
Solution
Money Health Love
Men 82/883
≈ 0.0929
446/883
≈ 0.5051
355/883
≈ 0.4020
Women 46/893
≈ 0.0515
574/893
≈ 0.6428
273/893
≈ 0.3057
© 2010 Pearson Prentice Hall. All rights reserved 12-72
Solution
© 2010 Pearson Prentice Hall. All rights reserved 12-73
Objective 2
• Perform a Test for Homogeneity of Proportions
© 2010 Pearson Prentice Hall. All rights reserved 12-74
In a chi-square test for homogeneity of proportions, we test whether different populations have the same proportion of individuals with some characteristic.
© 2010 Pearson Prentice Hall. All rights reserved 12-75
The procedures for performing a test of homogeneity are identical to those for a test of independence.
© 2010 Pearson Prentice Hall. All rights reserved 12-76
The following question was asked of a random sample of individuals in 1992, 2002, and 2008: “Would you tell me if you feel being a teacher is an occupation of very great prestige?” The results of the survey are presented below:
Test the claim that the proportion of individuals that feel being a teacher is an occupation of very great prestige is the same for each year at the = 0.01 level of significance.
Source: The Harris Poll
Parallel Example 5: A Test for Homogeneity of Proportions
1992 2002 2008
Yes 418 479 525
No 602 541 485
© 2010 Pearson Prentice Hall. All rights reserved 12-77
Step 1: The null hypothesis is a statement of “no difference” so the proportions for each year who feel that being a teacher is an occupation of very great prestige are equal. We state the hypotheses as follows:
H0: p1= p2= p3
H1: At least one of the proportions is different from the others.
Step 2: The level of significance is =0.01.
Solution
© 2010 Pearson Prentice Hall. All rights reserved 12-78
Step 3:
(a) The expected frequencies are found by multiplying the appropriate row and column totals and then dividing by the total sample size. They are given in parentheses in the table below, along with the observed frequencies.
Solution
1992 2002 2008
Yes418
(475.554)
479
(475.554)
525
(470.892)
No602
(544.446)
541
(544.446)
485
(539.108)
© 2010 Pearson Prentice Hall. All rights reserved 12-79
Step 3:
(b) Since none of the expected frequencies are less than 5, the requirements are satisfied.
(c) The test statistic is
Solution
02
418 475.554 2
475.554
479 475.554 2
475.554
485 539.108 2
539.10824.74
© 2010 Pearson Prentice Hall. All rights reserved 12-80
Step 4: There are r = 2 rows and c =3 columns, so we find the critical value using (2-1)(3-1) = 2 degrees of freedom. The critical value is .
Solution: Classical Approach
0.012 9.210
© 2010 Pearson Prentice Hall. All rights reserved 12-81
Step 5: Since the test statistic, is greater than the critical value , we reject the null hypothesis.
Solution: Classical Approach
02 24.74
0.012 9.210
© 2010 Pearson Prentice Hall. All rights reserved 12-82
Step 4: There are r = 2 rows and c =3 columns so we find the P-value using (2-1)(3-1) = 2 degrees of freedom. The P-value is the area under the chi-square distribution with 2 degrees of freedom to the right of which is approximately 0.
Solution: P-Value Approach
02 24.74
© 2010 Pearson Prentice Hall. All rights reserved 12-83
Step 5: Since the P-value is less than the level of significance = 0.01, we reject the null hypothesis.
Solution: P-Value Approach
© 2010 Pearson Prentice Hall. All rights reserved 12-84
Step 6: There is sufficient evidence to reject the null hypothesis at the = 0.01 level of significance. We conclude that the proportion of individuals who believe that teaching is a very prestigious career is different for at least one of the three years.
Solution