Chi Square Lecture

download Chi Square Lecture

of 10

Transcript of Chi Square Lecture

  • 8/13/2019 Chi Square Lecture

    1/10

    137

    CHAPTER 11

    CHI-SQUARE: NON-PARAMETRIC COMPARISONS OF FREQUENCY

    The hypothesis testing statistics detailed thus far in this text have all been designed to

    allow comparison of the means of two or more samples to determine if they are significantly

    different from each other. Such comparisons can only be conducted when the researcher has

    interval level data. While the use of interval level data is preferred by most researchers

    because it provides a more precise measurement of the phenomena under consideration, it is

    often impossible to obtain. Researchers must then turn to another set of statistical tools that

    allow the testing of hypotheses using nominal and ordinal data. These tools are referred to in

    the field of statistics as non-parametric tests. Aparameter is a quantity which is constant

    for a given population. Parameterscan also be defined as numerical descriptive measures of

    a population. Two major parameters already explained in earlier chapters are measures of

    central tendencyand variability. For example, the mean is a parameter which describes an

    entire distribution of values. Obviously, these parameters cannot be obtained for nominal and

    ordinal data. It follows then that statistics not dependent on calculating measures of central

    tendency or variability are non-parametric. However, this is not to say that parameters are

    not studied when using non-parametric statistics. One just does not know or make

    assumptions about any specific values of a parameter. Statisticians generally refer to T-tests

    and ANOVA tests as parametricstatistics.

    This chapter introduces and explains the use of Chi-Square, used to test hypotheses

    involving nominal data, while the next is devoted to a statistic called Mann-Whitney U which

    is employed for hypothesis testing when working with ordinal measures. It should be pointed

    out by way of a cautionary note that statistics designed to test hypotheses for nominal and

  • 8/13/2019 Chi Square Lecture

    2/10

    138

    ordinal data are no better than the data which they are designed to analyze. Interval data are

    more precise and accurate. The lower level of precision possible using nominal or ordinal

    measures makes the non-parametric statistics are somewhat less accurate for hypothesis

    testing. This limitation is partially addressed through the use of more stringent demands for

    statistical significance when non-parametric statistics are used.

    CHI-SQUARE

    The most frequently used non-parametric statistic for testing hypotheses with nominal

    datais Chi-Square. The nature of nominal data as explained in chapter one involves

    assigning data to mutual exclusive categories, labeling, or naming the data. Nominal data are

    most generally analyzed by frequency of occurrence. The non-parametric statistic Chi-

    Square is a comparison of relative frequencies among two or more groups. The null

    hypothesis for Chi-Square is that there is no statistically significant difference in the relative

    frequency of one outcome over another. For example, a possible null hypothesis might be that

    there is no statistically significant difference in the relative frequency of Hispanics failing

    their first math course in college and the relative frequency of Whites failing their first math

    course. In other words, there is no statistical difference between the two groups as measured

    by frequency of failure. Nominal data for testing this hypothesis can be organized in a two-

    by-two data matrixcontaining two rows and two columns for pass-fail categories and by

    group. This approach to organization is shown for a sample of 100 Hispanics and a sample of

    100 Whites in Figure 11:1.

  • 8/13/2019 Chi Square Lecture

    3/10

    139

    FIGURE 11:1

    In this example, the null hypothesis would be accepted because one can simply observe that

    there is no difference between Hispanics and Whites. The frequencies of pass or fail rates are

    the same for both groups. No statistics are necessary for nominal data equally distributed

    between groups, but not all frequencies are this simple. Generally, decisions relative to

    accepting and rejecting null hypotheses require far more complex analyses because differences

    between samples do occur. Whether or not these differences are sufficient to suggest a

    statistically significant difference in the overall populations is the reason for conducting

    statistical tests.

    Calculation of the Chi-Square statistic is basically a comparison between observed and

    expected frequencies. Observed frequenciesare actual nominal data for each characteristic

    under consideration by the researcher. In the above example, oneobservesthat fifty Whites

    and Hispanics failed and fifty Whites and Hispanics passed. Theexpected frequenciesare

    the nominal data results one would expect to find if the null hypothesis is to be accepted. In

    the above example, one would expect the proportion of pass and fail frequencies for Whites

    and Hispanics to be the same. The theory behind the Chi-Square statistic is that if the

    difference between the observed and expected frequencies is large, that even with assumed

    sampling error, the null hypothesis is rejected. One would conclude that a statistically

    significant difference between two or more groups does exist. By implication, this also means

    Pass Fail Total

    Whites 50 50 100

    Hispanics 50 50 100

    Total 100 100 200

  • 8/13/2019 Chi Square Lecture

    4/10

    140

    that not all differences between observed and expected frequencies are significant, some are

    the result of sampling error or too small to be significant.

    The formula for calculating the Chi-Square statistic is:

    Where:

    the observed frequencies for each position in the matrix

    the expected frequencies for each position in the matrix

    Calculation of the Chi-Square statistic is a simple process involving the use of a solution

    matrix. For example, suppose a researcher wanted to test the difference between frequencies

    of high or low incomes for men and women in the same profession. A research question

    could be stated as follows: Do male lawyers have higher incomes than female lawyers? The

    null hypothesis might be stated as follows: There is no statistically significant difference

    between the frequencies of the high and low incomes for males and the frequencies of the high

    and low incomes for females.

    Organizing the solution matrix for the Chi-Square statistic is simple and easy. First,

    the data are organized by row and column in the form of a data matrix. The actual or observed

    values for each place in the data matrix are recorded. Then the values in each rows and

    column are totaled and the total number of cases under consideration (n) is determined. The

    solution matrix will vary in size depending on the number of rows and columns needed to

    display the observed frequencies. In figure 11:2 the following data matrix was constructed

  • 8/13/2019 Chi Square Lecture

    5/10

    141

    using the observed frequencies of high and low incomes (nominal) for men and women

    (nominal) are displayed in a 2 x 2 data matrix.

    Figure 11:2: DATA MATRIX

    Once the data matrix has been constructed, the expected frequencies for each cell in the matrix

    can be determined using the formula:

    For example, row 1 and column 1 square of the matrix, which represents high income men,

    the calculation of the expected frequency is:

    Row 1 column 2 is calculated:

    Expected frequencies are similarly obtained for all of the squares of the data matrix and

    included in parentheses within the data matrix immediately below the observed values. When

    the expected frequencies have been calculated, the remaining Chi-Square calculations are

    Men Women Total

    High Income 15(19.66)

    25(20.34)

    40

    Low Income 14(9.34)

    5(9.66)

    19

    Total 29 30 59

  • 8/13/2019 Chi Square Lecture

    6/10

    142

    simple mathematics.

    Solution Matrix for

    The value of the Chi-Square statistic is 6.75. The next step in the process of testing the

    hypothesis requires that the degrees of freedom be determined. The simple formula for finding

    the degrees of freedom for is:2

    d.f. = (Total Rows - 1) (Total Columns - 1)

    In the context of the present example, df= (2-1)(2-1)=1(1)=1

    By consulting the table in Appendix H the critical values for at .05 and .01 are2

    3.84 and 6.63 for 1 degree of freedom. The researcher compares the obtained value for 2

    with the critical value to determine if the observed difference in frequencies is statistically

    significant. The null hypothesis is rejected at both the .05 and .01 levels At the 95% and 99%

    confidence levels in this case because the obtained value is higher than either of the critical

    values from Appendix H. Therefore, the researcher must conclude that there is a statistically

    significant difference between the relative frequencies of high and low incomes for men and

    Row Column

    1 1 15 19.66 -4.66 21.72 1.10

    1 2 25 20.34 4.66 21.72 1.07

    2 1 14 9.34 4.66 21.72 2.33

    2 2 5 9.66 -4.66 21.72 2.25

  • 8/13/2019 Chi Square Lecture

    7/10

    The critical values are critical because they are the basis for accepting or rejecting the null1

    hypothesis. Since Chi-Square is a statistic based on nominal data, the obtained Chi-Square mus t be

    larger than these critical values in the table for a significant difference in the frequenc ies.

    143

    women. Even allowing for the presence of sampling error, the value of is large enough to2

    suggest that a real difference exists between the populations represented by these samples. In

    this example, the research conclusion is that the female lawyers have higher incomes than

    male lawyers. A very useful rule for accepting or rejecting the null hypothesis for is as2

    follows:

    Accept null if the obtained is less than the critical values in the table. Reject the2 2

    null hypothesis if the obtained is equal to or greater than the critical values in the 2 2

    table.1

    Under certain circumstances when working with a 2x2 data matrix, the formula used to

    calculate the Chi-Square statistic is adjusted slightly. This process is utilized when any of the

    expected frequencies within the data matrix are lower than 10. The alternative Chi-Square

    formula is known as the Yates' Correction. When expected frequencies are this low,

    researchers have determined that it is appropriate to make the standard for rejecting the null

    hypothesis more stringent by subtracting .5 from the absolute value of the difference between

    each observed and expected frequency before the differences are squared. The formula for

    Chi-Square using Yates Correction is as follows:

    Applying this correction requires an additional column in the solution matrix and the

  • 8/13/2019 Chi Square Lecture

    8/10

    144

    correction will also reduce the size of Chi-Square. The reduction is an effort to be more

    conservative and reduce the probability of making the alpha error. The comparison of

    frequencies of men and women in high and low income categories earlier in the chapter

    provides an example of a context in which Yates Correction is to be applied. Compare the

    solution matrix using Yates Correction presented in figure 11:3 below with the one produced

    earlier. Notice the difference in the value of Chi-Square and the difference in statistical

    conclusions required when the Yates Correction is employed.

    FIGURE 11:3: YATES CORRECTION

    The obtained value for Chi-Square is 5.37 which is still significant at the .05 level but which

    Men Women Total

    High Income 15(19.66)

    25(20.34)

    40

    Low Income 14(9.34)

    5(9.66)

    19

    Total 29 30 59

    Row Column

    1 1 15 19.66 -4.66 4.16 17.31 .88

    1 2 25 20.34 4.66 4.16 17.31 .85

    2 1 14 9.34 4.66 4.16 17.31 1.85

    2 2 5 9.66 -4.66 4.16 17.31 1.79

  • 8/13/2019 Chi Square Lecture

    9/10

    145

    is no longer significant at the .01 level.

    In summation, the Chi-Square statistic is used to test hypotheses by comparing

    observed and expected frequencies of a characteristic for two or more groups. Chi-Square is

    not limited to the comparison of two samples. One may have a 5 x 5, 10 x 10, 7 x 10, or any

    size data matrix for many independent samples. Unlike the t test, Chi-Square is notused for

    dependent samples. In addition, Chi-Square is used only for nominal data, and a researcher

    should make use of Yates' Correction when it applies.

  • 8/13/2019 Chi Square Lecture

    10/10

    146

    EXERCISES - CHAPTER 10

    (1) A researcher wants to determine whether students who had taken a drivers education

    course sponsored by the school passed their state drivers examination with a higher relative

    frequency than those who did not take the class. Using the data provided in the 2x2 matrix

    below and Yates Correction:

    A. Write a null hypothesis

    B. Calculate the value for Chi-Square

    C. Draw statistical and research conclusions

    (2) In a poll of New York residents, the following results were recorded with

    reference to political ideology and party affiliations. For 65 Republicans: 20

    conservative, 35 liberal, and 10 neither. For 120 Democrats: 40 conservative, 70

    liberal, and 10 neither. Test a null hypothesis for these data and draw statistical

    conclusions.

    Taken Drivers Education

    Test Result Yes No Total

    Pas 14 6 20

    Fail 5 10 15

    Total 19 16 35