Chi Square Lecture

8/13/2019 Chi Square Lecture

1/10

137

CHAPTER 11

CHI-SQUARE: NON-PARAMETRIC COMPARISONS OF FREQUENCY

The hypothesis testing statistics detailed thus far in this text have all been designed to

allow comparison of the means of two or more samples to determine if they are significantly

different from each other. Such comparisons can only be conducted when the researcher has

interval level data. While the use of interval level data is preferred by most researchers

because it provides a more precise measurement of the phenomena under consideration, it is

often impossible to obtain. Researchers must then turn to another set of statistical tools that

allow the testing of hypotheses using nominal and ordinal data. These tools are referred to in

the field of statistics as non-parametric tests. Aparameter is a quantity which is constant

for a given population. Parameterscan also be defined as numerical descriptive measures of

a population. Two major parameters already explained in earlier chapters are measures of

central tendencyand variability. For example, the mean is a parameter which describes an

entire distribution of values. Obviously, these parameters cannot be obtained for nominal and

ordinal data. It follows then that statistics not dependent on calculating measures of central

tendency or variability are non-parametric. However, this is not to say that parameters are

not studied when using non-parametric statistics. One just does not know or make

assumptions about any specific values of a parameter. Statisticians generally refer to T-tests

and ANOVA tests as parametricstatistics.

This chapter introduces and explains the use of Chi-Square, used to test hypotheses

involving nominal data, while the next is devoted to a statistic called Mann-Whitney U which

is employed for hypothesis testing when working with ordinal measures. It should be pointed

out by way of a cautionary note that statistics designed to test hypotheses for nominal and


2/10

138

ordinal data are no better than the data which they are designed to analyze. Interval data are

more precise and accurate. The lower level of precision possible using nominal or ordinal

measures makes the non-parametric statistics are somewhat less accurate for hypothesis

testing. This limitation is partially addressed through the use of more stringent demands for

statistical significance when non-parametric statistics are used.

CHI-SQUARE

The most frequently used non-parametric statistic for testing hypotheses with nominal

datais Chi-Square. The nature of nominal data as explained in chapter one involves

assigning data to mutual exclusive categories, labeling, or naming the data. Nominal data are

most generally analyzed by frequency of occurrence. The non-parametric statistic Chi-

Square is a comparison of relative frequencies among two or more groups. The null

hypothesis for Chi-Square is that there is no statistically significant difference in the relative

frequency of one outcome over another. For example, a possible null hypothesis might be that

there is no statistically significant difference in the relative frequency of Hispanics failing

their first math course in college and the relative frequency of Whites failing their first math

course. In other words, there is no statistical difference between the two groups as measured

by frequency of failure. Nominal data for testing this hypothesis can be organized in a two-

by-two data matrixcontaining two rows and two columns for pass-fail categories and by

group. This approach to organization is shown for a sample of 100 Hispanics and a sample of

100 Whites in Figure 11:1.


3/10

139

FIGURE 11:1

In this example, the null hypothesis would be accepted because one can simply observe that

there is no difference between Hispanics and Whites. The frequencies of pass or fail rates are

the same for both groups. No statistics are necessary for nominal data equally distributed

between groups, but not all frequencies are this simple. Generally, decisions relative to

accepting and rejecting null hypotheses require far more complex analyses because differences

between samples do occur. Whether or not these differences are sufficient to suggest a

statistically significant difference in the overall populations is the reason for conducting

statistical tests.

Calculation of the Chi-Square statistic is basically a comparison between observed and

expected frequencies. Observed frequenciesare actual nominal data for each characteristic

under consideration by the researcher. In the above example, oneobservesthat fifty Whites

and Hispanics failed and fifty Whites and Hispanics passed. Theexpected frequenciesare

the nominal data results one would expect to find if the null hypothesis is to be accepted. In

the above example, one would expect the proportion of pass and fail frequencies for Whites

and Hispanics to be the same. The theory behind the Chi-Square statistic is that if the

difference between the observed and expected frequencies is large, that even with assumed

sampling error, the null hypothesis is rejected. One would conclude that a statistically

significant difference between two or more groups does exist. By implication, this also means

Pass Fail Total

Whites 50 50 100

Hispanics 50 50 100

Total 100 100 200


4/10

140

that not all differences between observed and expected frequencies are significant, some are

the result of sampling error or too small to be significant.

The formula for calculating the Chi-Square statistic is:

Where:

the observed frequencies for each position in the matrix

the expected frequencies for each position in the matrix

Calculation of the Chi-Square statistic is a simple process involving the use of a solution

matrix. For example, suppose a researcher wanted to test the difference between frequencies

of high or low incomes for men and women in the same profession. A research question

could be stated as follows: Do male lawyers have higher incomes than female lawyers? The

null hypothesis might be stated as follows: There is no statistically significant difference

between the frequencies of the high and low incomes for males and the frequencies of the high

and low incomes for females.

Organizing the solution matrix for the Chi-Square statistic is simple and easy. First,

the data are organized by row and column in the form of a data matrix. The actual or observed

values for each place in the data matrix are recorded. Then the values in each rows and

column are totaled and the total number of cases under consideration (n) is determined. The

solution matrix will vary in size depending on the number of rows and columns needed to

display the observed frequencies. In figure 11:2 the following data matrix was constructed


5/10

141

using the observed frequencies of high and low incomes (nominal) for men and women

(nominal) are displayed in a 2 x 2 data matrix.

Figure 11:2: DATA MATRIX

Once the data matrix has been constructed, the expected frequencies for each cell in the matrix

can be determined using the formula:

For example, row 1 and column 1 square of the matrix, which represents high income men,

the calculation of the expected frequency is:

Row 1 column 2 is calculated:

Expected frequencies are similarly obtained for all of the squares of the data matrix and

included in parentheses within the data matrix immediately below the observed values. When

the expected frequencies have been calculated, the remaining Chi-Square calculations are

Men Women Total

High Income 15(19.66)

25(20.34)

40

Low Income 14(9.34)

5(9.66)

19

Total 29 30 59


6/10

142

simple mathematics.

Solution Matrix for

The value of the Chi-Square statistic is 6.75. The next step in the process of testing the

hypothesis requires that the degrees of freedom be determined. The simple formula for finding

the degrees of freedom for is:2

d.f. = (Total Rows - 1) (Total Columns - 1)

In the context of the present example, df= (2-1)(2-1)=1(1)=1

By consulting the table in Appendix H the critical values for at .05 and .01 are2

3.84 and 6.63 for 1 degree of freedom. The researcher compares the obtained value for 2

with the critical value to determine if the observed difference in frequencies is statistically

significant. The null hypothesis is rejected at both the .05 and .01 levels At the 95% and 99%

confidence levels in this case because the obtained value is higher than either of the critical

values from Appendix H. Therefore, the researcher must conclude that there is a statistically

significant difference between the relative frequencies of high and low incomes for men and

Row Column

1 1 15 19.66 -4.66 21.72 1.10

1 2 25 20.34 4.66 21.72 1.07

2 1 14 9.34 4.66 21.72 2.33

2 2 5 9.66 -4.66 21.72 2.25


7/10

The critical values are critical because they are the basis for accepting or rejecting the null1

hypothesis. Since Chi-Square is a statistic based on nominal data, the obtained Chi-Square mus t be

larger than these critical values in the table for a significant difference in the frequenc ies.

143

women. Even allowing for the presence of sampling error, the value of is large enough to2

suggest that a real difference exists between the populations represented by these samples. In

this example, the research conclusion is that the female lawyers have higher incomes than

male lawyers. A very useful rule for accepting or rejecting the null hypothesis for is as2

follows:

Accept null if the obtained is less than the critical values in the table. Reject the2 2

null hypothesis if the obtained is equal to or greater than the critical values in the 2 2

table.1

Under certain circumstances when working with a 2x2 data matrix, the formula used to

calculate the Chi-Square statistic is adjusted slightly. This process is utilized when any of the

expected frequencies within the data matrix are lower than 10. The alternative Chi-Square

formula is known as the Yates' Correction. When expected frequencies are this low,

researchers have determined that it is appropriate to make the standard for rejecting the null

hypothesis more stringent by subtracting .5 from the absolute value of the difference between

each observed and expected frequency before the differences are squared. The formula for

Chi-Square using Yates Correction is as follows:

Applying this correction requires an additional column in the solution matrix and the


8/10

144

correction will also reduce the size of Chi-Square. The reduction is an effort to be more

conservative and reduce the probability of making the alpha error. The comparison of

frequencies of men and women in high and low income categories earlier in the chapter

provides an example of a context in which Yates Correction is to be applied. Compare the

solution matrix using Yates Correction presented in figure 11:3 below with the one produced

earlier. Notice the difference in the value of Chi-Square and the difference in statistical

conclusions required when the Yates Correction is employed.

FIGURE 11:3: YATES CORRECTION

The obtained value for Chi-Square is 5.37 which is still significant at the .05 level but which

Men Women Total

High Income 15(19.66)

25(20.34)

40

Low Income 14(9.34)

5(9.66)

19

Total 29 30 59

Row Column

1 1 15 19.66 -4.66 4.16 17.31 .88

1 2 25 20.34 4.66 4.16 17.31 .85

2 1 14 9.34 4.66 4.16 17.31 1.85

2 2 5 9.66 -4.66 4.16 17.31 1.79


9/10

145

is no longer significant at the .01 level.

In summation, the Chi-Square statistic is used to test hypotheses by comparing

observed and expected frequencies of a characteristic for two or more groups. Chi-Square is

not limited to the comparison of two samples. One may have a 5 x 5, 10 x 10, 7 x 10, or any

size data matrix for many independent samples. Unlike the t test, Chi-Square is notused for

dependent samples. In addition, Chi-Square is used only for nominal data, and a researcher

should make use of Yates' Correction when it applies.


10/10

146

EXERCISES - CHAPTER 10

(1) A researcher wants to determine whether students who had taken a drivers education

course sponsored by the school passed their state drivers examination with a higher relative

frequency than those who did not take the class. Using the data provided in the 2x2 matrix

below and Yates Correction:

A. Write a null hypothesis

B. Calculate the value for Chi-Square

C. Draw statistical and research conclusions

(2) In a poll of New York residents, the following results were recorded with

reference to political ideology and party affiliations. For 65 Republicans: 20

conservative, 35 liberal, and 10 neither. For 120 Democrats: 40 conservative, 70

liberal, and 10 neither. Test a null hypothesis for these data and draw statistical

conclusions.

Taken Drivers Education

Test Result Yes No Total

Pas 14 6 20

Fail 5 10 15

Total 19 16 35

Chi Square Lecture

Documents

Transcript of Chi Square Lecture