Introduction to Statistical Data Analysis Lecture 7: The ...

24
Review of Data Measurement Scales The Chi-Square Goodness-of-Fit Test Chi-Square Test for Independence Introduction to Statistical Data Analysis Lecture 7: The Chi-Square Distribution James V. Lambers Department of Mathematics The University of Southern Mississippi James V. Lambers Statistical Data Analysis 1 / 24

Transcript of Introduction to Statistical Data Analysis Lecture 7: The ...

Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test

Chi-Square Test for Independence

Introduction to Statistical Data AnalysisLecture 7: The Chi-Square Distribution

James V. Lambers

Department of MathematicsThe University of Southern Mississippi

James V. Lambers Statistical Data Analysis 1 / 24

Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test

Chi-Square Test for Independence

Introduction

In this lecture, we will use hypothesis testing for new purposes:

I To determine whether a given data set follows a specific probabilitydistribution, and

I To determine whether two random variables are statisticallyindependent.

James V. Lambers Statistical Data Analysis 2 / 24

Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test

Chi-Square Test for Independence

Review of Data Measurement Scales

Recall from Lecture 1 that there are four data measurement scales:nominal, ordinal, interval, and ratio.

The hypothesis testing techniques presented in Lecture 6 only apply tothe scales that are more quantitative, interval and ratio.

Now, though, we can use hypothesis testing for data measured innominal or ordinal scales as well.

This is because we are working with frequency distributions, which can beconstructed from any data set, regardless of its measurement scale.

James V. Lambers Statistical Data Analysis 3 / 24

Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test

Chi-Square Test for Independence

Stating the HypothesesObserved and Expected FrequenciesCalculating the Chi-Square StatisticDetermining the Critical Chi-Square ScoreCharacteristics of a Chi-Square DistributionA Goodness-of-Fit Test with the Binomial Distribution

The Chi-Square Goodness-of-Fit Test

The chi-square goodness-of-fit test uses a sample to determine whetherthe frequency distribution of the population conforms to a particularprobability distribution that it is believed to follow.

Example Suppose that a six-sided die is rolled 150 times, and the resultof each roll is recorded. The number of rolls that are a 1,2,3,4,5 or 6should follow a uniform distribution.

A chi-square goodness-of-fit test can be used to compare the observednumber of rolls for each value, from 1 to 6, to the expected number ofrolls for each value, which is 150/6 = 25.

James V. Lambers Statistical Data Analysis 4 / 24

Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test

Chi-Square Test for Independence

Stating the HypothesesObserved and Expected FrequenciesCalculating the Chi-Square StatisticDetermining the Critical Chi-Square ScoreCharacteristics of a Chi-Square DistributionA Goodness-of-Fit Test with the Binomial Distribution

Stating the Hypotheses

For the chi-square goodness-of-fit test, the null hypothesis H0 is that thepopulation does follow the predicted distribution, and the alternativehypothesis H1 is that it does not.

James V. Lambers Statistical Data Analysis 5 / 24

Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test

Chi-Square Test for Independence

Stating the HypothesesObserved and Expected FrequenciesCalculating the Chi-Square StatisticDetermining the Critical Chi-Square ScoreCharacteristics of a Chi-Square DistributionA Goodness-of-Fit Test with the Binomial Distribution

Observed and Expected Frequencies

The chi-square goodness-of-fit test works with two frequencydistributions, with the same classes, and frequencies denoted by {Oi} and{Ei}, respectively.

Each frequency Oi is the actual number of observations from the samplethat belong to the ith class.

Each frequency Ei is the expected number of observations that shouldbelong to class i , assuming H0 is true.

It is essential that the total number of observations in both frequencydistributions are equal; that is,

n∑i=1

Oi =n∑

i=1

Ei ,

where n is the number of classes.James V. Lambers Statistical Data Analysis 6 / 24

Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test

Chi-Square Test for Independence

Stating the HypothesesObserved and Expected FrequenciesCalculating the Chi-Square StatisticDetermining the Critical Chi-Square ScoreCharacteristics of a Chi-Square DistributionA Goodness-of-Fit Test with the Binomial Distribution

Calculating the Chi-Square Statistic

The test statistic for the chi-square goodness-of-fit test, also known asthe chi-square score is given by

χ2 =n∑

i=1

(Oi − Ei )2

Ei,

where, as before, n is the number of classes.

James V. Lambers Statistical Data Analysis 7 / 24

Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test

Chi-Square Test for Independence

Stating the HypothesesObserved and Expected FrequenciesCalculating the Chi-Square StatisticDetermining the Critical Chi-Square ScoreCharacteristics of a Chi-Square DistributionA Goodness-of-Fit Test with the Binomial Distribution

Determining the Critical Chi-Square Score

Once we have computed the test statistic, we compare it against thecritical value χ2

c , which can be obtained as follows:

I It can be looked up in a table of right-tail areas for the chi-squaredistribution, with the degrees of freedom d .f . = n − 1 and chosensignificance level α, or

I One can use the R function qchisq with first parameter 1− α andsecond parameter d .f . = n − 1; this function returns the left-tail

area corresponding to these parameters, in contrast to the tablegiven in Appendix A, which is why 1− α is given as the firstparameter instead of α.

If the chi-square score χ2 is greater than this critical value χ2c , then we

reject H0; otherwise we do not reject H0.

Because test statistic and critical value are always positive, the chi-squaregoodness of fit test is always a one-tail test.

James V. Lambers Statistical Data Analysis 8 / 24

Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test

Chi-Square Test for Independence

Stating the HypothesesObserved and Expected FrequenciesCalculating the Chi-Square StatisticDetermining the Critical Chi-Square ScoreCharacteristics of a Chi-Square DistributionA Goodness-of-Fit Test with the Binomial Distribution

Characteristics of a Chi-Square Distribution

The chi-square distribution is of a very different character than otherdistributions that we have seen.

If Z1,Z2, . . . ,Zn are independent, standard random normal variables, thenthe random variable Q defined by

Q =n∑

i=1

Z 2i

follows the chi-square distribution with n degrees of freedom.

It is not symmetric; rather, its values are skewed toward zero, which isthe leftmost value of the distribution.

However, as the number of degrees of freedom (d .f .) increases, thedistribution becomes more symmetric.

James V. Lambers Statistical Data Analysis 9 / 24

Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test

Chi-Square Test for Independence

Stating the HypothesesObserved and Expected FrequenciesCalculating the Chi-Square StatisticDetermining the Critical Chi-Square ScoreCharacteristics of a Chi-Square DistributionA Goodness-of-Fit Test with the Binomial Distribution

Characteristics, cont’d

The probability density function for this distribution is

fn(x) =1

2n/2Γ(n/2)xn/2−1e−x/2,

where n is the degrees of freedom and Γ(n) is the gamma function.

James V. Lambers Statistical Data Analysis 10 / 24

Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test

Chi-Square Test for Independence

Stating the HypothesesObserved and Expected FrequenciesCalculating the Chi-Square StatisticDetermining the Critical Chi-Square ScoreCharacteristics of a Chi-Square DistributionA Goodness-of-Fit Test with the Binomial Distribution

A Goodness-of-Fit Test with the Binomial Distribution

Suppose a coin is flipped 10 times, and the number of times it comes upheads is recorded.

Then, this process is repeated several times, for a total of 100 sequencesof 10 flips each.

Since coin flips are Bernoulli trials, the number of heads follows abinomial distribution, which yields the expected number of sequencesthat produces k heads.

James V. Lambers Statistical Data Analysis 11 / 24

Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test

Chi-Square Test for Independence

Stating the HypothesesObserved and Expected FrequenciesCalculating the Chi-Square StatisticDetermining the Critical Chi-Square ScoreCharacteristics of a Chi-Square DistributionA Goodness-of-Fit Test with the Binomial Distribution

Observed and Expected Values

Number of heads Observed Sequences Expected Sequences0 1 0.0981 2 0.9772 3 4.3953 9 11.7194 18 20.5085 26 24.6096 21 20.5087 13 11.7198 5 4.3959 2 0.97710 0 0.098

James V. Lambers Statistical Data Analysis 12 / 24

Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test

Chi-Square Test for Independence

Stating the HypothesesObserved and Expected FrequenciesCalculating the Chi-Square StatisticDetermining the Critical Chi-Square ScoreCharacteristics of a Chi-Square DistributionA Goodness-of-Fit Test with the Binomial Distribution

Performing the Chi-Square Test

Our null hypothesis H0 is that the number of heads does in fact follow abinomial distribution.

The chi-square score is

χ2 =10∑i=0

(Oi − Ei )2

Ei

=(1− 0.098)2

0.098+

(2− 0.977)2

0.977+

(3− 4.395)2

4.395+

(9− 11.719)2

11.719+

(18− 20.508)2

20.508+

(26− 24.609)2

24.610+

(21− 20.508)2

20.508+

(13− 11.719)2

11.719

+(5− 4.395)2

4.395+

(2− 0.977)2

0.977+

(0− 0.098)2

0.098= 12.274.

James V. Lambers Statistical Data Analysis 13 / 24

Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test

Chi-Square Test for Independence

Stating the HypothesesObserved and Expected FrequenciesCalculating the Chi-Square StatisticDetermining the Critical Chi-Square ScoreCharacteristics of a Chi-Square DistributionA Goodness-of-Fit Test with the Binomial Distribution

And the Verdict is...

This is compared to the critical value χ2c , with degrees of freedom

d .f . = n − 1 = 10, since there are n = 11 classes, with level ofsignificance α = 0.05.

We can use the R expression qchisq(1-0.05,10) to obtainχ2c = 18.307.

Since χ2 < χ2c , we do not reject H0, and conclude that the distribution of

the number of heads from each sequence of 10 flips follows a binomialdistribution, as expected.

James V. Lambers Statistical Data Analysis 14 / 24

Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test

Chi-Square Test for Independence

Stating the HypothesesObserved and Expected FrequenciesCalculating the Chi-Square StatisticDetermining the Critical Chi-Square ScoreCharacteristics of a Chi-Square DistributionA Goodness-of-Fit Test with the Binomial Distribution

Chi-Square Goodness-of-fit Test in R

> obs=c(1,2,3,9,18,26,21,13,5,2,0)

> pexp=dbinom(0:10,10,0.5)

> chisq.test(obs,p=pexp)

Chi-squared test for given probabilities

data: obs

X-squared = 12.2743, df = 10, p-value = 0.2671

James V. Lambers Statistical Data Analysis 15 / 24

Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test

Chi-Square Test for Independence

Chi-Square Test for Independence

Now, we use the chi-square distribution to test whether two givenrandom variables are statistically independent.

For this test, the null hypothesis H0 is that the variables are independent,while the alternative hypothesis H1 is that they are not.

James V. Lambers Statistical Data Analysis 16 / 24

Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test

Chi-Square Test for Independence

Contingency Tables

To compute the test statistic, we construct a contingency table, which isa two-dimensional array, or a matrix, in which each cell contains anobserved frequency of an ordered pair of values of the two variables.

That is, the entry in row i , column j , which we denote by Oi,j , containsthe number of observations that fall into class i of the first variable andclass j of the second.

The frequencies in this table are the observed frequencies for thechi-square goodness of fit test.

James V. Lambers Statistical Data Analysis 17 / 24

Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test

Chi-Square Test for Independence

Computing Expected Frequencies

Next, for each row i and each column j , we compute Ei,j , which is:

(sum of entries in row i)× (sum of entries of column j),

divided by the total number of observations, to get the expectedfrequencies for the chi-square goodness-of-fit test.

James V. Lambers Statistical Data Analysis 18 / 24

Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test

Chi-Square Test for Independence

Relation to Independent Events

That is, if the contingency table has m rows and n columns, then

Ei,j =

(n∑

k=1

Oi,k

)(m∑`=1

O`,j

)∑m

`=1

∑nk=1 O`,k

.

It should be noted that this quantity, divided again by the total numberof observations, is exactly P(Ai )P(Bj), where Ai is the event that thefirst variable falls into class i , and Bj is the event that the second variablefalls into class j .

By the multiplication rule, this probability would equal P(Ai ∩ Bj) if thevariables were independent.

James V. Lambers Statistical Data Analysis 19 / 24

Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test

Chi-Square Test for Independence

The Test Statistic

Then, the test statistic is

χ2 =m∑i=1

n∑j=1

(Oi,j − Ei,j)2

Ei,j.

We then obtain the critical value χ2c using d .f . = (m− 1)(n− 1) and our

chosen level of significance α.

As before, if χ2 > χ2c , then we reject H0 and conclude that the variables

are in fact statistically dependent.

James V. Lambers Statistical Data Analysis 20 / 24

Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test

Chi-Square Test for Independence

Example

Suppose that 300 voters were surveyed, and classified according togender and political affiliation: Democrat, Republican, or Independent.

The contingency table for these classifications is as follows:

AffiliationGender Democrat Republican Independent TotalFemale 68 56 32 156Male 52 72 20 144Total 120 128 52 300

That is, 68 of the voters are female and Democrat, 72 of the voters aremale and Republican, and so on.

The entry in row i and column j is the observation Oi,j .

James V. Lambers Statistical Data Analysis 21 / 24

Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test

Chi-Square Test for Independence

Computing Expected Frequencies

Let Gi be the event that the voter is of the gender for row i , i = 1, 2, andlet Aj be the event that the voter’s affiliation corresponds to column j ,j = 1, 2, 3. Then, we compute the expected observations as follows:

(i , j) Gi ∩ Aj Ei,j = P(Gi ∩ Aj)

(1, 1) Female, Democrat(156)(120)

300= 62.4

(1, 2) Female, Republican(156)(128)

300= 66.56

(1, 3) Female, Independent(156)(52)

300= 27.04

(2, 1) Male, Democrat(144)(120)

300= 57.60

(2, 2) Male, Republican(144)(128)

300= 61.44

(2, 3) Male, Independent(144)(52)

300= 24.96

James V. Lambers Statistical Data Analysis 22 / 24

Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test

Chi-Square Test for Independence

The Test Statistic

Then, the test statistic is

χ2 =2∑

i=1

3∑j=1

(Oi,j − Ei,j)2

Ei,j

=(68− 62.4)2

62.4+

(56− 66.56)2

66.56+

(32− 27.04)2

27.04+

(52− 57.60)2

57.60+ · · ·

= 6.433.

We compare this value against the critical value χ2c , with degrees of

freedom d .f . = (2− 1)(3− 1) = 2 and significance level 0.05.

Since this value is χ2c = 5.991, and χ2 > χ2

c , we reject the null hypothesisthat gender and political affiliation are independent.

James V. Lambers Statistical Data Analysis 23 / 24

Review of Data Measurement ScalesThe Chi-Square Goodness-of-Fit Test

Chi-Square Test for Independence

Independence Test in R

> M=matrix(c(68,52,56,72,32,20),nrow=2,ncol=3)

> chisq.test(M)

Pearson’s Chi-squared test

data: M

X-squared = 6.4329, df = 2, p-value = 0.0401

James V. Lambers Statistical Data Analysis 24 / 24