Introduction to Statistical Data Analysis Lecture 7: The ...
Transcript of Introduction to Statistical Data Analysis Lecture 7: The ...
Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test
Chi-Square Test for Independence
Introduction to Statistical Data AnalysisLecture 7: The Chi-Square Distribution
James V. Lambers
Department of MathematicsThe University of Southern Mississippi
James V. Lambers Statistical Data Analysis 1 / 24
Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test
Chi-Square Test for Independence
Introduction
In this lecture, we will use hypothesis testing for new purposes:
I To determine whether a given data set follows a specific probabilitydistribution, and
I To determine whether two random variables are statisticallyindependent.
James V. Lambers Statistical Data Analysis 2 / 24
Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test
Chi-Square Test for Independence
Review of Data Measurement Scales
Recall from Lecture 1 that there are four data measurement scales:nominal, ordinal, interval, and ratio.
The hypothesis testing techniques presented in Lecture 6 only apply tothe scales that are more quantitative, interval and ratio.
Now, though, we can use hypothesis testing for data measured innominal or ordinal scales as well.
This is because we are working with frequency distributions, which can beconstructed from any data set, regardless of its measurement scale.
James V. Lambers Statistical Data Analysis 3 / 24
Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test
Chi-Square Test for Independence
Stating the HypothesesObserved and Expected FrequenciesCalculating the Chi-Square StatisticDetermining the Critical Chi-Square ScoreCharacteristics of a Chi-Square DistributionA Goodness-of-Fit Test with the Binomial Distribution
The Chi-Square Goodness-of-Fit Test
The chi-square goodness-of-fit test uses a sample to determine whetherthe frequency distribution of the population conforms to a particularprobability distribution that it is believed to follow.
Example Suppose that a six-sided die is rolled 150 times, and the resultof each roll is recorded. The number of rolls that are a 1,2,3,4,5 or 6should follow a uniform distribution.
A chi-square goodness-of-fit test can be used to compare the observednumber of rolls for each value, from 1 to 6, to the expected number ofrolls for each value, which is 150/6 = 25.
James V. Lambers Statistical Data Analysis 4 / 24
Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test
Chi-Square Test for Independence
Stating the HypothesesObserved and Expected FrequenciesCalculating the Chi-Square StatisticDetermining the Critical Chi-Square ScoreCharacteristics of a Chi-Square DistributionA Goodness-of-Fit Test with the Binomial Distribution
Stating the Hypotheses
For the chi-square goodness-of-fit test, the null hypothesis H0 is that thepopulation does follow the predicted distribution, and the alternativehypothesis H1 is that it does not.
James V. Lambers Statistical Data Analysis 5 / 24
Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test
Chi-Square Test for Independence
Stating the HypothesesObserved and Expected FrequenciesCalculating the Chi-Square StatisticDetermining the Critical Chi-Square ScoreCharacteristics of a Chi-Square DistributionA Goodness-of-Fit Test with the Binomial Distribution
Observed and Expected Frequencies
The chi-square goodness-of-fit test works with two frequencydistributions, with the same classes, and frequencies denoted by {Oi} and{Ei}, respectively.
Each frequency Oi is the actual number of observations from the samplethat belong to the ith class.
Each frequency Ei is the expected number of observations that shouldbelong to class i , assuming H0 is true.
It is essential that the total number of observations in both frequencydistributions are equal; that is,
n∑i=1
Oi =n∑
i=1
Ei ,
where n is the number of classes.James V. Lambers Statistical Data Analysis 6 / 24
Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test
Chi-Square Test for Independence
Stating the HypothesesObserved and Expected FrequenciesCalculating the Chi-Square StatisticDetermining the Critical Chi-Square ScoreCharacteristics of a Chi-Square DistributionA Goodness-of-Fit Test with the Binomial Distribution
Calculating the Chi-Square Statistic
The test statistic for the chi-square goodness-of-fit test, also known asthe chi-square score is given by
χ2 =n∑
i=1
(Oi − Ei )2
Ei,
where, as before, n is the number of classes.
James V. Lambers Statistical Data Analysis 7 / 24
Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test
Chi-Square Test for Independence
Stating the HypothesesObserved and Expected FrequenciesCalculating the Chi-Square StatisticDetermining the Critical Chi-Square ScoreCharacteristics of a Chi-Square DistributionA Goodness-of-Fit Test with the Binomial Distribution
Determining the Critical Chi-Square Score
Once we have computed the test statistic, we compare it against thecritical value χ2
c , which can be obtained as follows:
I It can be looked up in a table of right-tail areas for the chi-squaredistribution, with the degrees of freedom d .f . = n − 1 and chosensignificance level α, or
I One can use the R function qchisq with first parameter 1− α andsecond parameter d .f . = n − 1; this function returns the left-tail
area corresponding to these parameters, in contrast to the tablegiven in Appendix A, which is why 1− α is given as the firstparameter instead of α.
If the chi-square score χ2 is greater than this critical value χ2c , then we
reject H0; otherwise we do not reject H0.
Because test statistic and critical value are always positive, the chi-squaregoodness of fit test is always a one-tail test.
James V. Lambers Statistical Data Analysis 8 / 24
Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test
Chi-Square Test for Independence
Stating the HypothesesObserved and Expected FrequenciesCalculating the Chi-Square StatisticDetermining the Critical Chi-Square ScoreCharacteristics of a Chi-Square DistributionA Goodness-of-Fit Test with the Binomial Distribution
Characteristics of a Chi-Square Distribution
The chi-square distribution is of a very different character than otherdistributions that we have seen.
If Z1,Z2, . . . ,Zn are independent, standard random normal variables, thenthe random variable Q defined by
Q =n∑
i=1
Z 2i
follows the chi-square distribution with n degrees of freedom.
It is not symmetric; rather, its values are skewed toward zero, which isthe leftmost value of the distribution.
However, as the number of degrees of freedom (d .f .) increases, thedistribution becomes more symmetric.
James V. Lambers Statistical Data Analysis 9 / 24
Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test
Chi-Square Test for Independence
Stating the HypothesesObserved and Expected FrequenciesCalculating the Chi-Square StatisticDetermining the Critical Chi-Square ScoreCharacteristics of a Chi-Square DistributionA Goodness-of-Fit Test with the Binomial Distribution
Characteristics, cont’d
The probability density function for this distribution is
fn(x) =1
2n/2Γ(n/2)xn/2−1e−x/2,
where n is the degrees of freedom and Γ(n) is the gamma function.
James V. Lambers Statistical Data Analysis 10 / 24
Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test
Chi-Square Test for Independence
Stating the HypothesesObserved and Expected FrequenciesCalculating the Chi-Square StatisticDetermining the Critical Chi-Square ScoreCharacteristics of a Chi-Square DistributionA Goodness-of-Fit Test with the Binomial Distribution
A Goodness-of-Fit Test with the Binomial Distribution
Suppose a coin is flipped 10 times, and the number of times it comes upheads is recorded.
Then, this process is repeated several times, for a total of 100 sequencesof 10 flips each.
Since coin flips are Bernoulli trials, the number of heads follows abinomial distribution, which yields the expected number of sequencesthat produces k heads.
James V. Lambers Statistical Data Analysis 11 / 24
Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test
Chi-Square Test for Independence
Stating the HypothesesObserved and Expected FrequenciesCalculating the Chi-Square StatisticDetermining the Critical Chi-Square ScoreCharacteristics of a Chi-Square DistributionA Goodness-of-Fit Test with the Binomial Distribution
Observed and Expected Values
Number of heads Observed Sequences Expected Sequences0 1 0.0981 2 0.9772 3 4.3953 9 11.7194 18 20.5085 26 24.6096 21 20.5087 13 11.7198 5 4.3959 2 0.97710 0 0.098
James V. Lambers Statistical Data Analysis 12 / 24
Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test
Chi-Square Test for Independence
Stating the HypothesesObserved and Expected FrequenciesCalculating the Chi-Square StatisticDetermining the Critical Chi-Square ScoreCharacteristics of a Chi-Square DistributionA Goodness-of-Fit Test with the Binomial Distribution
Performing the Chi-Square Test
Our null hypothesis H0 is that the number of heads does in fact follow abinomial distribution.
The chi-square score is
χ2 =10∑i=0
(Oi − Ei )2
Ei
=(1− 0.098)2
0.098+
(2− 0.977)2
0.977+
(3− 4.395)2
4.395+
(9− 11.719)2
11.719+
(18− 20.508)2
20.508+
(26− 24.609)2
24.610+
(21− 20.508)2
20.508+
(13− 11.719)2
11.719
+(5− 4.395)2
4.395+
(2− 0.977)2
0.977+
(0− 0.098)2
0.098= 12.274.
James V. Lambers Statistical Data Analysis 13 / 24
Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test
Chi-Square Test for Independence
Stating the HypothesesObserved and Expected FrequenciesCalculating the Chi-Square StatisticDetermining the Critical Chi-Square ScoreCharacteristics of a Chi-Square DistributionA Goodness-of-Fit Test with the Binomial Distribution
And the Verdict is...
This is compared to the critical value χ2c , with degrees of freedom
d .f . = n − 1 = 10, since there are n = 11 classes, with level ofsignificance α = 0.05.
We can use the R expression qchisq(1-0.05,10) to obtainχ2c = 18.307.
Since χ2 < χ2c , we do not reject H0, and conclude that the distribution of
the number of heads from each sequence of 10 flips follows a binomialdistribution, as expected.
James V. Lambers Statistical Data Analysis 14 / 24
Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test
Chi-Square Test for Independence
Stating the HypothesesObserved and Expected FrequenciesCalculating the Chi-Square StatisticDetermining the Critical Chi-Square ScoreCharacteristics of a Chi-Square DistributionA Goodness-of-Fit Test with the Binomial Distribution
Chi-Square Goodness-of-fit Test in R
> obs=c(1,2,3,9,18,26,21,13,5,2,0)
> pexp=dbinom(0:10,10,0.5)
> chisq.test(obs,p=pexp)
Chi-squared test for given probabilities
data: obs
X-squared = 12.2743, df = 10, p-value = 0.2671
James V. Lambers Statistical Data Analysis 15 / 24
Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test
Chi-Square Test for Independence
Chi-Square Test for Independence
Now, we use the chi-square distribution to test whether two givenrandom variables are statistically independent.
For this test, the null hypothesis H0 is that the variables are independent,while the alternative hypothesis H1 is that they are not.
James V. Lambers Statistical Data Analysis 16 / 24
Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test
Chi-Square Test for Independence
Contingency Tables
To compute the test statistic, we construct a contingency table, which isa two-dimensional array, or a matrix, in which each cell contains anobserved frequency of an ordered pair of values of the two variables.
That is, the entry in row i , column j , which we denote by Oi,j , containsthe number of observations that fall into class i of the first variable andclass j of the second.
The frequencies in this table are the observed frequencies for thechi-square goodness of fit test.
James V. Lambers Statistical Data Analysis 17 / 24
Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test
Chi-Square Test for Independence
Computing Expected Frequencies
Next, for each row i and each column j , we compute Ei,j , which is:
(sum of entries in row i)× (sum of entries of column j),
divided by the total number of observations, to get the expectedfrequencies for the chi-square goodness-of-fit test.
James V. Lambers Statistical Data Analysis 18 / 24
Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test
Chi-Square Test for Independence
Relation to Independent Events
That is, if the contingency table has m rows and n columns, then
Ei,j =
(n∑
k=1
Oi,k
)(m∑`=1
O`,j
)∑m
`=1
∑nk=1 O`,k
.
It should be noted that this quantity, divided again by the total numberof observations, is exactly P(Ai )P(Bj), where Ai is the event that thefirst variable falls into class i , and Bj is the event that the second variablefalls into class j .
By the multiplication rule, this probability would equal P(Ai ∩ Bj) if thevariables were independent.
James V. Lambers Statistical Data Analysis 19 / 24
Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test
Chi-Square Test for Independence
The Test Statistic
Then, the test statistic is
χ2 =m∑i=1
n∑j=1
(Oi,j − Ei,j)2
Ei,j.
We then obtain the critical value χ2c using d .f . = (m− 1)(n− 1) and our
chosen level of significance α.
As before, if χ2 > χ2c , then we reject H0 and conclude that the variables
are in fact statistically dependent.
James V. Lambers Statistical Data Analysis 20 / 24
Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test
Chi-Square Test for Independence
Example
Suppose that 300 voters were surveyed, and classified according togender and political affiliation: Democrat, Republican, or Independent.
The contingency table for these classifications is as follows:
AffiliationGender Democrat Republican Independent TotalFemale 68 56 32 156Male 52 72 20 144Total 120 128 52 300
That is, 68 of the voters are female and Democrat, 72 of the voters aremale and Republican, and so on.
The entry in row i and column j is the observation Oi,j .
James V. Lambers Statistical Data Analysis 21 / 24
Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test
Chi-Square Test for Independence
Computing Expected Frequencies
Let Gi be the event that the voter is of the gender for row i , i = 1, 2, andlet Aj be the event that the voter’s affiliation corresponds to column j ,j = 1, 2, 3. Then, we compute the expected observations as follows:
(i , j) Gi ∩ Aj Ei,j = P(Gi ∩ Aj)
(1, 1) Female, Democrat(156)(120)
300= 62.4
(1, 2) Female, Republican(156)(128)
300= 66.56
(1, 3) Female, Independent(156)(52)
300= 27.04
(2, 1) Male, Democrat(144)(120)
300= 57.60
(2, 2) Male, Republican(144)(128)
300= 61.44
(2, 3) Male, Independent(144)(52)
300= 24.96
James V. Lambers Statistical Data Analysis 22 / 24
Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test
Chi-Square Test for Independence
The Test Statistic
Then, the test statistic is
χ2 =2∑
i=1
3∑j=1
(Oi,j − Ei,j)2
Ei,j
=(68− 62.4)2
62.4+
(56− 66.56)2
66.56+
(32− 27.04)2
27.04+
(52− 57.60)2
57.60+ · · ·
= 6.433.
We compare this value against the critical value χ2c , with degrees of
freedom d .f . = (2− 1)(3− 1) = 2 and significance level 0.05.
Since this value is χ2c = 5.991, and χ2 > χ2
c , we reject the null hypothesisthat gender and political affiliation are independent.
James V. Lambers Statistical Data Analysis 23 / 24
Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test
Chi-Square Test for Independence
Independence Test in R
> M=matrix(c(68,52,56,72,32,20),nrow=2,ncol=3)
> chisq.test(M)
Pearson’s Chi-squared test
data: M
X-squared = 6.4329, df = 2, p-value = 0.0401
James V. Lambers Statistical Data Analysis 24 / 24