Introduction to Statistical Data Analysis Lecture 7: The ...

Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test

Chi-Square Test for Independence

Introduction to Statistical Data AnalysisLecture 7: The Chi-Square Distribution

James V. Lambers

Department of MathematicsThe University of Southern Mississippi

James V. Lambers Statistical Data Analysis 1 / 24



Introduction

In this lecture, we will use hypothesis testing for new purposes:

I To determine whether a given data set follows a specific probabilitydistribution, and

I To determine whether two random variables are statisticallyindependent.




Review of Data Measurement Scales

Recall from Lecture 1 that there are four data measurement scales:nominal, ordinal, interval, and ratio.

The hypothesis testing techniques presented in Lecture 6 only apply tothe scales that are more quantitative, interval and ratio.

Now, though, we can use hypothesis testing for data measured innominal or ordinal scales as well.

This is because we are working with frequency distributions, which can beconstructed from any data set, regardless of its measurement scale.




Stating the HypothesesObserved and Expected FrequenciesCalculating the Chi-Square StatisticDetermining the Critical Chi-Square ScoreCharacteristics of a Chi-Square DistributionA Goodness-of-Fit Test with the Binomial Distribution

The Chi-Square Goodness-of-Fit Test

The chi-square goodness-of-fit test uses a sample to determine whetherthe frequency distribution of the population conforms to a particularprobability distribution that it is believed to follow.

Example Suppose that a six-sided die is rolled 150 times, and the resultof each roll is recorded. The number of rolls that are a 1,2,3,4,5 or 6should follow a uniform distribution.

A chi-square goodness-of-fit test can be used to compare the observednumber of rolls for each value, from 1 to 6, to the expected number ofrolls for each value, which is 150/6 = 25.





Stating the Hypotheses

For the chi-square goodness-of-fit test, the null hypothesis H0 is that thepopulation does follow the predicted distribution, and the alternativehypothesis H1 is that it does not.





Observed and Expected Frequencies

The chi-square goodness-of-fit test works with two frequencydistributions, with the same classes, and frequencies denoted by {Oi} and{Ei}, respectively.

Each frequency Oi is the actual number of observations from the samplethat belong to the ith class.

Each frequency Ei is the expected number of observations that shouldbelong to class i , assuming H0 is true.

It is essential that the total number of observations in both frequencydistributions are equal; that is,

n∑i=1

Oi =n∑

i=1

Ei ,

where n is the number of classes.James V. Lambers Statistical Data Analysis 6 / 24




Calculating the Chi-Square Statistic

The test statistic for the chi-square goodness-of-fit test, also known asthe chi-square score is given by

χ2 =n∑

i=1

(Oi − Ei )2

Ei,

where, as before, n is the number of classes.





Determining the Critical Chi-Square Score

Once we have computed the test statistic, we compare it against thecritical value χ2

c , which can be obtained as follows:

I It can be looked up in a table of right-tail areas for the chi-squaredistribution, with the degrees of freedom d .f . = n − 1 and chosensignificance level α, or

I One can use the R function qchisq with first parameter 1− α andsecond parameter d .f . = n − 1; this function returns the left-tail

area corresponding to these parameters, in contrast to the tablegiven in Appendix A, which is why 1− α is given as the firstparameter instead of α.

If the chi-square score χ2 is greater than this critical value χ2c , then we

reject H0; otherwise we do not reject H0.

Because test statistic and critical value are always positive, the chi-squaregoodness of fit test is always a one-tail test.





Characteristics of a Chi-Square Distribution

The chi-square distribution is of a very different character than otherdistributions that we have seen.

If Z1,Z2, . . . ,Zn are independent, standard random normal variables, thenthe random variable Q defined by

Q =n∑

i=1

Z 2i

follows the chi-square distribution with n degrees of freedom.

It is not symmetric; rather, its values are skewed toward zero, which isthe leftmost value of the distribution.

However, as the number of degrees of freedom (d .f .) increases, thedistribution becomes more symmetric.





Characteristics, cont’d

The probability density function for this distribution is

fn(x) =1

2n/2Γ(n/2)xn/2−1e−x/2,

where n is the degrees of freedom and Γ(n) is the gamma function.





A Goodness-of-Fit Test with the Binomial Distribution

Suppose a coin is flipped 10 times, and the number of times it comes upheads is recorded.

Then, this process is repeated several times, for a total of 100 sequencesof 10 flips each.

Since coin flips are Bernoulli trials, the number of heads follows abinomial distribution, which yields the expected number of sequencesthat produces k heads.





Observed and Expected Values

Number of heads Observed Sequences Expected Sequences0 1 0.0981 2 0.9772 3 4.3953 9 11.7194 18 20.5085 26 24.6096 21 20.5087 13 11.7198 5 4.3959 2 0.97710 0 0.098





Performing the Chi-Square Test

Our null hypothesis H0 is that the number of heads does in fact follow abinomial distribution.

The chi-square score is

χ2 =10∑i=0

(Oi − Ei )2

Ei

=(1− 0.098)2

0.098+

(2− 0.977)2

0.977+

(3− 4.395)2

4.395+

(9− 11.719)2

11.719+

(18− 20.508)2

20.508+

(26− 24.609)2

24.610+

(21− 20.508)2

20.508+

(13− 11.719)2

11.719

+(5− 4.395)2

4.395+

(2− 0.977)2

0.977+

(0− 0.098)2

0.098= 12.274.





And the Verdict is...

This is compared to the critical value χ2c , with degrees of freedom

d .f . = n − 1 = 10, since there are n = 11 classes, with level ofsignificance α = 0.05.

We can use the R expression qchisq(1-0.05,10) to obtainχ2c = 18.307.

Since χ2 < χ2c , we do not reject H0, and conclude that the distribution of

the number of heads from each sequence of 10 flips follows a binomialdistribution, as expected.





Chi-Square Goodness-of-fit Test in R

> obs=c(1,2,3,9,18,26,21,13,5,2,0)

> pexp=dbinom(0:10,10,0.5)

> chisq.test(obs,p=pexp)

Chi-squared test for given probabilities

data: obs

X-squared = 12.2743, df = 10, p-value = 0.2671





Now, we use the chi-square distribution to test whether two givenrandom variables are statistically independent.

For this test, the null hypothesis H0 is that the variables are independent,while the alternative hypothesis H1 is that they are not.




Contingency Tables

To compute the test statistic, we construct a contingency table, which isa two-dimensional array, or a matrix, in which each cell contains anobserved frequency of an ordered pair of values of the two variables.

That is, the entry in row i , column j , which we denote by Oi,j , containsthe number of observations that fall into class i of the first variable andclass j of the second.

The frequencies in this table are the observed frequencies for thechi-square goodness of fit test.




Computing Expected Frequencies

Next, for each row i and each column j , we compute Ei,j , which is:

(sum of entries in row i)× (sum of entries of column j),

divided by the total number of observations, to get the expectedfrequencies for the chi-square goodness-of-fit test.




Relation to Independent Events

That is, if the contingency table has m rows and n columns, then

Ei,j =

(n∑

k=1

Oi,k

)(m∑`=1

O`,j

)∑m

`=1

∑nk=1 O`,k

.

It should be noted that this quantity, divided again by the total numberof observations, is exactly P(Ai )P(Bj), where Ai is the event that thefirst variable falls into class i , and Bj is the event that the second variablefalls into class j .

By the multiplication rule, this probability would equal P(Ai ∩ Bj) if thevariables were independent.




The Test Statistic

Then, the test statistic is

χ2 =m∑i=1

n∑j=1

(Oi,j − Ei,j)2

Ei,j.

We then obtain the critical value χ2c using d .f . = (m− 1)(n− 1) and our

chosen level of significance α.

As before, if χ2 > χ2c , then we reject H0 and conclude that the variables

are in fact statistically dependent.




Example

Suppose that 300 voters were surveyed, and classified according togender and political affiliation: Democrat, Republican, or Independent.

The contingency table for these classifications is as follows:

AffiliationGender Democrat Republican Independent TotalFemale 68 56 32 156Male 52 72 20 144Total 120 128 52 300

That is, 68 of the voters are female and Democrat, 72 of the voters aremale and Republican, and so on.

The entry in row i and column j is the observation Oi,j .




Computing Expected Frequencies

Let Gi be the event that the voter is of the gender for row i , i = 1, 2, andlet Aj be the event that the voter’s affiliation corresponds to column j ,j = 1, 2, 3. Then, we compute the expected observations as follows:

(i , j) Gi ∩ Aj Ei,j = P(Gi ∩ Aj)

(1, 1) Female, Democrat(156)(120)

300= 62.4

(1, 2) Female, Republican(156)(128)

300= 66.56

(1, 3) Female, Independent(156)(52)

300= 27.04

(2, 1) Male, Democrat(144)(120)

300= 57.60

(2, 2) Male, Republican(144)(128)

300= 61.44

(2, 3) Male, Independent(144)(52)

300= 24.96




The Test Statistic

Then, the test statistic is

χ2 =2∑

i=1

3∑j=1

(Oi,j − Ei,j)2

Ei,j

=(68− 62.4)2

62.4+

(56− 66.56)2

66.56+

(32− 27.04)2

27.04+

(52− 57.60)2

57.60+ · · ·

= 6.433.

We compare this value against the critical value χ2c , with degrees of

freedom d .f . = (2− 1)(3− 1) = 2 and significance level 0.05.

Since this value is χ2c = 5.991, and χ2 > χ2

c , we reject the null hypothesisthat gender and political affiliation are independent.




Independence Test in R

> M=matrix(c(68,52,56,72,32,20),nrow=2,ncol=3)

> chisq.test(M)

Pearson’s Chi-squared test

data: M

X-squared = 6.4329, df = 2, p-value = 0.0401


Introduction to Statistical Data Analysis Lecture 7: The ...

Documents

Transcript of Introduction to Statistical Data Analysis Lecture 7: The ...