Contingency Tables
description
Transcript of Contingency Tables
Contingency Tables
• Chapters Seven, Sixteen, and Eighteen
• Chapter Seven– Definition of Contingency Tables– Basic Statistics– SPSS program (Crosstabulation)
• Chapter Sixteen – Basic Probability Theory Concepts– Test of Hypothesis of Independence
Basic Empirical Situation
• Unit of data.
• Two nominal scales measured for each unit. – Example: interview study, sex of respondent,
variable such as whether or not subject has a cellular telephone.
– Objective is to compare males and females with respect to what fraction have cellular telephones.
Contingency Table
• One column for each value of the column variable; C is the number of columns.
• One row for each value of the row variable; R is the number of rows.
• R x C contingency table.
Contingency Table
• Each entry is the OBSERVED COUNT O(i,j) of the number of units having the (i,j) contingency.
• Column of marginal totals.
• Row of marginal totals.
Basic Hypothesis
• ASSUME column variable is the independent variable.
• Hypothesis is independence.
• That is, the conditional distribution in any column is the same as the conditional distribution in any other column.
Expected Count
• Basic idea is proportional allocation of observations in a column based on column total.
• Expected count in (i, j ) contingency = E(i,j)= total number in column j *total number in row i/total number in table.
• Expected count need not be an integer; one expected count for each contingency.
Residual
• Residual in (i,j) contingency = observed count in (i,j) contingency - expected count in (i,j) contingency.
• That is, R(i,j)= O(i,j)-E(i,j)
• One residual for each contingency.
Pearson Chi-squared Component
• Chi-squared component for (i, j) contingency =C(i,j)= (Residual in (i, j) contingency)2/expected count in (i, j) contingency.
• C(i,j)=(R(i,j))2 / E(i,j)
Assessing Pearson Component
• Rough guides on whether the (i, j) contingency has an excessively large chi-squared component C(i,j):– the observed significance level of 3.84 is about
0.05.– Of 6.63 is about 0.01.– Of 10.83 is 0.001.
Pearson Chi-Squared Test
• Sum C(i,j) over all contingencies.
• Pearson chi-squared test has (R-1)(C-1) degrees of freedom.
• Under null hypothesis– Expected value of chi-square equals its degrees
of freedom.– Variance is twice its degrees of freedom
Marijuana Use at Time 4 by Marijuana Use at Time 3
Use attime 4
No use attime 3
Used attime 3
Total
No use attime 4
120 9 129
Used attime 4
95 142 237
Total 215 151 366
Contingency Tables
• Chapter Eighteen– Measures of Association– For nominal variables– For ordinal variables
Measures of Association
• Measures strength of an association– usually, a dimensionless number between 0 and 1 in
absolute value.
– Values near 0 indicate no association, near 1 mean strong association.
• Correlation coefficient is a measure of association
• Chi-square test is not– depends on the number of observations.
Measures of Association for Nominal Scale Variables
• Chi-square based– Phi coefficient– Coefficient of contingency– Cramer’s V
• Proportional reduction in error– Lambda, symmetric– Lambda, not symmetric
Chi-squared Measure: Phi Coefficient
• Definition of the Phi Coefficient
N
2
Phi Coefficient
• Can be greater than one.
• N is the total number of the table.
• For marijuana at time 3 and 4 data, phi coefficient is (96.595/366)0.5=0.51.
Coefficient of Contingency
• Definition of coefficient of contingency
NC
2
2
Coefficient of Contingency
• Can never get as large as one.
• Largest value depends on number in table.
• For example given, c=0.46.
Cramér’s V
• Definition of statistic; k is smaller of number of rows and columns.
)1(
2
kNV
Interpretation of Chi-squared measures of association
• An approximate observed level of significance is given for each measure.
• Use this in the usual way.
Proportional Reduction in Error (PRE) Measures
• Prediction is the modal category.
• Predict overall– Predict used marijuana at time 4; correct for
237 and wrong for 129.
• Number of misclassified is 129.
Proportional Reduction in Error (PRE) Measures
• Predict for each condition of the independent variable.– Predict not use at time 4 for those not using at
time 3• correct 120 of 215 times• misclassify 95 times
– Predict use at time 4 for those using at time• correct 142 of 151 times• misclassify 9 times.
Proportional Reduction in Error (PRE) Measures
• Using only totals, number of misclassified is 129.
• Using marijuana at time 3, number misclassified is 104.
• The lambda measure is λ= (129-104) /129=0.19
Lambda PRE Measures
• There is a lambda measure using marijuana use at time 4 as the independent variable.– Total: predict no usage at time 3: 151 errors.– Conditional
• no usage at Time 4: predict none at 3 with 9 errors
• usage at time 4: predict use at 3 with 95 errors
• 104 total errors.
– Lambda measure is (151-104)/151=0.31
Lambda PRE Measures
• There is a symmetric lambda measure.
• [(129-104)+(151-104)]/(129+151)=0.26
Text Example Data Set
Subject Life Degree
Case 1 1 2
Case 2 2 3
Case 3 3 2
Comparing Pairs of Cases
• Concordant pair of cases: sign of difference on variable 1 is the same as the sign of the difference on variable 2. – Case 1 and Case 2: concordant.– Case 2 and Case 3: discordant– Case 1 and Case 3: tied
• Let P be number of concordant pairs and Q be the number of discordant pairs.
Measures Based on Concordant and Discordant Pairs
• Goodman and Kruskal’s Gamma– (P-Q)/(P+Q)
• Kendall’s Tau-b
• Kendall’s Tau-c
• Somers’ d
Choosing a measure
• Choose a measure “interpretable for the purpose in hand”!
• Avoid data dredging (taking the measure that is largest for the data set that you have).
Other measures
• Correlation based– Pearson’s correlation – Spearman correlation: replace values by ranks.
• Measures of agreement– Cohen’s kappa.
Summary
• Contingency table methods crucial to the analysis of market research and social science data.
• Hypothesis of independence
• Measures of association describe the strength of the dependence between two variables.