Session IV: Contingency Tables Tests of Association and Independence (Zar, Chapters 23, 24.10)
-
Upload
jessenia-kianoush -
Category
Documents
-
view
40 -
download
2
description
Transcript of Session IV: Contingency Tables Tests of Association and Independence (Zar, Chapters 23, 24.10)
Session IV: Contingency Tables
Tests of Association and Independence
(Zar, Chapters 23, 24.10)
Chapter 23: Contingency Tables
Another (and last look!!) at peas!
Remember the pea color & texture:
YS Yw gS gw nObserve: 152 39 53 6 250
ExpectedProportion
9 : 3 : 3 : 1 16
Expected:Probability
9/16 3/16 3/16 1/16
Expected:Number
140.625 46.875 46.875 15.625
Put the data into a matrix:
T ColorE Y g Row TotalsX S 152 53 205T W 39 6 45U Column Totals 191 59 250RE
What was the hypothesis from the last session?
Answer: 9 : 3 : 3: 1, or ?
(1) Y is dominant!
Test the hypothesis with:
Y gObserved: 191 59 250ExpectedProbability
.75 .25
ExpectedNumber
.75 x 250 .25 x 250
Then 2 for the answer.
(2) S is dominant!
Test the hypothesis with:
S WObserved 205 45ExpectedProbability
.75 .25
ExpectedNumber
.75 x 250 .25 x 250
Then for the answer.
(3) Color & Texture are independent
Problem: How to test this w/o assuming proportions in the hypothesis?
Y g RowTotals
EstimatedProbability
S 152 53 205 205250
w 39 6 45 45250
ColumnTotals
191 59 250
EstimatedColumn
Probability
191 59 250 250
Estimate from the Marginal Totals
Given the marginal probability estimates, if the Ho of the independence is true,what are the probabilities in the table “cells”?
191 205
250 250191 45
250 25059 205
250 25059 45
250 250
P YS P Y P S
P Yw P Y P w
P gS P g P S
P gw P g P w
Given the marginal probability estimates and table estimates,
how do we get the expected number?
Expected # = probability x total number
191 205 191 205ˆ( ) 250250 250 250191 45 191 45ˆ( ) 250250 250 25059 205 59 205ˆ( ) 250250 250 25059 45 59 45ˆ ˆ( ) ( ) 250250 250 250
E YS P YS n
E Yw P Yw n
E gS P gS n
E gw P gw n
Row ColumnˆRule: i jijE
n
Y gS 152
191205
250
156.62
5359205
250
48.39
205
w 39191.45
250
34.38
659.45250
10.6245
191 59 250
2 152 156.62
156.61
2
53 48.38
48.38
2
39 34.38 2
34.38
6 10.62 2
10.620.1360.441.6212.010
3.208 D.F.4 1 1 1 1
(Sum to 250) (row) (column)
.052 1 =3.841 p=.073
In General: 2 x 2 Tables:
2
2 22
1 1ˆ
ij ij
i j ij
f f
f
DF: 2 2 –1 –1 –1= 1rowcolum
n
total
4 cells
Aa1 a2
b1f11
f11
R1C1
n
f12
f
12 = R1C2
n
R1 or f1.
b2f21
f
21R2C1
n
f22
f
22R2C2
n
R2 or f2 .
B
C1 orf.1
C2 orf.2
n
• •
where i=1,2; j=1,2
i jij
i j
f ff
nR C
n
In general:A
a1 a2 a3 … ac rowsb1 f11 f12 f13 … f1c R1
b2 f21 f22 f23 R2
b3 f31 f32 f33 R3
B
br fr1 fr2 fr3 Rr
columns C1 C2 C3 Cc n
2
2
1 1
ij ijr C
i jij
f f
f
Where:
1 1
; C r
i ij j j ij jj i
R f f C f f
Estimates of the “cells”:
.i j i jij
R C f ff
n n
d.f.r x c - (c-1)-(r-1) - 1=(r-1)(c-1) Degrees of Freedom:
cells
rows
columns
total
d.f.= rc - (c-1)-(r -1)-1
= rc - c + 1 - r + 1-1
= rc -c - r + 1
=(r -1)(c -1)
Special cases: Rows fixed
If before the study, the #'s in rows to
be seen are fixed.
d.f. = rc- (c-1)-1
Columns fixed
d.f.=rc-(r-1)-1
Both fixed (ex: 9 : 3 : 3 : 1)
d.f. = rc-1
Example 23.1: Hair Color and Gender
Hair ColorGender Black Brown Blond Red TotalMale 32
(29.0)43
(36.0)16
(26.67)9 (8.33) 100
(=R1)Female 55
(58.0)65
(72.0)64
(53.33)16
(16.67)200
(=R2)Total 87
(=C1)108
(=C2)80
(=C3)25
(=C4)300(=n)
2 fij ˆ f ji 2
ˆ f ij
32 29.0 2
29.0
43 36.0 2
36.0
16 16.67 2
16.670.31031.36114.26670.05330.15520.68062.13330.0267
8.987
df=(r 1)(c 1)(2 1)(4 1)3
0.05,32 7.815
Therefore, reject H0
2 0.31031.36114.26670.05330.15520.68062.1333 0.0267
8.987
Okay, we have now rejected the null hypothesis:
H0: The rows and the columns are independent
And accepted the alternative hypothesis:
HA: The rows and the columns are not independent
What now?
Look for sub-hypotheses?
Look at the individual chi-squares!
MBk MBr MBl MR FBk FBr FBl Fr
It makes sense to combine hair color with similar gender ratios.
Blonds vs. Non-Blonds?
But are all the Non-Blonds the same?
H0: Black=Brown=Red
Hair ColorGender Black Brown Red TotalMale 32
(29.0)43
(36.0)9 (8.33) 84
(=R1)Female 55
(58.0)65
(72.0)16
(16.67)136
(=R2)Total 87
(=C1)108
(=C2)25
(=C4)220(=n)
2
20.05,2 0
0.245 with 2 DF
5.991 Accept H
Now
H0: Blonds=Non-Blonds
Hair ColorGender Blond Non-
BlondTotal
Male 16(26.67)
84(73.33)
100
Female 64(53.33)
136(146.67)
200
Total 80 220 300
2
20.05,1 0
8.727 DF=1
3.841 Reject H
(1) Too Few Male Blonds
(2) Too Many Female Blonds
(3) Too Many Male NBl
(4) Too Few Female NBl
Hair Color independent of Gender
The 2 x 2 Table re-examined:
2
ij2
: counts, whole numbers, integers
: rational number, fraction
-
ij
i jij
ij
ij
f
R Cf
n
f f
f
(2) The Yates Correction
(1) The problem of counts vs. estimates
22
ˆ 0.5
ˆij ij
c
ij
f f
f
The subtraction of 0.5 is “conservative”.
Rounding would be better.
(3) The Computing formulae
2
11 22 12 212
1 2 1 2
2
11 22 12 212
1 2 1 2
/ 2
C
n f f f f
R R C C
n f f f f n
R R C C
The Cochran/Haber –Correction
The Cochran/Haber (Haber(1980)) correction gives better results when routinely employed than does either the Yates-corrected or non-corrected chi-square calculation. (a) Determine each of the four expected frequencies denoting the smallest as . f
(b) Calculate the absolute difference between the smallest expected frequency and its corresponding observed frequency is .d f f
(c) If define D=the largest multiple of 0.5 that is < d;f 2 f,
(d) If define D=d - 0.5. f >2 f,
3 22'
1 2 1 2
.c
n D
R R C C (e) Calculate
Use the Blond/Non-Blond Example:Hair Color
Gender Blond Non-Blond
Total
Male 16(26.67)
84(73.33)
100
Female 64(53.33)
136(146.67)
200
Total 80 220 300
Yates: 22300 16 136 84 64 150
7.92880 220 100 200C
Cochran:
3 22
'
16 26.67 10.67
10.5
300 10.58.457
80 220 100 200C
d
D
2 8.727 DF =1
0.05,12 3.841 Reject H0
Uncorrected:
(5) The Fisher – Exact Method
2 x 2 tables only
Aa1 a2
b1 f11 f12
Bb2 f21 f22
C1 C2
1 2 1 2
11 121 2 1 2
21 22 11 12 21 22
! ! ! !!Pr
! ! ! !
R R C Cf f nR R C Cf f f f f f
Just like the binomial, we need this probability + all more extreme.
Hypergeometric Distribution
To fine those more extreme, look for the smallest expected frequency as in Cochran’s Method. Form successive tables as in the next example:
Beetles Bugs Upper Leaf 12
(9.17) 7 (9.8)
19
Lower Leaf 2 (4.83)
8 (5.17)
10
14 15 29
1 2 1 2
11 12 21 22
! ! ! !!
19!10!14!15!29!
12!7!2!8!
! ! ! !
log[(log19! log10!log14! log15! log 29!)(log12! log7! log 2!log8)]
log[15.75522 17.28932]log[ 1.53410]log[0.46590 2.00000]
R R C CnP
f f f f
anti
antiantianti
0.02923
From 2 to 1
Beetles Bugs Upper Leaf 13
(9.17) 6 (9.8)
19
Lower Leaf 1 (4.83)
9 (5.17)
10
14 15 29 Beetles Bugs Upper Leaf 14
(9.17) 5 (9.8)
19
Lower Leaf 0 (4.83)
10 (5.17)
10
14 15 29
19!10!14!15!
29!13!6!1!9!
log[15.75522 (log13!
log 6! log1! log 9!)]
log[ 2.46515]
0.003498
19!10!14!15!
29!14!5!0!10!
log[15.75522 (log14!
log 5! log 0! log10!)]
log[ 3.8213]
0.0001499
P
anti
anti
P
anti
anti
Beetles Bugs Upper Leaf 14
(9.17) 5 (9.8)
19
Lower Leaf 0 (4.83)
10 (5.17)
10
14 15 29
And 1 to 0
Prob=0.02923 + 0.003498 + 0.0001499 = 0.03288.Reject H0: Independence or ?
A heterogeneity chi-square analysis of 2 x2 contingency tables.
Ho: The four samples are homogeneous.HA The four samples are heterogeneous.
Example 6.4a (Edition 2) Or see Example 23.8a (Edition 4)
Experiment I
Dead Alive Total
Treated 9 15 24 2 =2.4806Not treated 15 10 25 DF =1
Total 24 25 49
Experiment 2Dead Alive Total
Treated 13 12 25 2 2.1222Non treated 18 7 25 DF =1
Total 31 19 50
Experiment 3Dead Alive Total
Treated 12 13 25 2 2.0525Non treated 17 8 25 DF =1
Total 29 21 50Experiment 4
Dead Alive TotalTreated 10 14 24 2 2.4522
Non treated 16 9 25 DF =1
Total 26 23 49
Pooled Data for Experiments 1-4Dead Alive Total
Treated 44 54 98 2 8.9262Non treated 66 34 100 DF =1
Total 110 88 198Ho: Treatment & status are independent
Total chi-square 9.1075 DF = 4Chi-square 8.9262 DF = 1Heterogeneity chi-square 0.1813 DF = 3
0.05,32 7.815 Accept H0, Experiments Homogeneous
THE LOG-LIKELIHOOD RATIO
The log-likelihood ratio was introduced in Session 2. In tables, it is calculated:
ij i i j ji j i j
2[ ln ln ln ],
or
4.60517[ log log
log log ].
ij ij i i
j j
G f R R C C n n
G f f R R
C C n n
Since G is approximately distributed as able B.1 may be used with (r-1)(c-1) degrees of freedom.
Hair Color
Sex Black Brown Blond Red Total
Male 32 43 16 9 100Female 55 65 64 16 200Total 87 108 80 25 300
Log-likelihood Example: Hair Color
G4.60517 fij Ri log Ri Cj log C j n log n]
=4.60517[(32)(1.50515)+(43)(1.63347) + (16) (1.20412)
+ (9)(0.95424) + (55)(1.74036)+(65)(1.81291)
+ (64)(1.80618) + (16)(1.20412)- (100)(2.00000)
- (200)(2.30103) - (87)(1.93952)- (108)(2.03342)
- (80)(1.90309)- (25)(1.399794) + (300)(2.47712)]
=4.60517(2.06518)
= 9.510 with DF = 3
0.05,32 7.815 reject Ho.
Three and Higher Dimensional Tables
Ex: Problem III:Row: RemissionColumn: AgeTier: LI
Row #1Complete LIRemission 8888> >8%
<50% >50
298 9 0 0
Row #2not complete 10remission 12 >8% LI
16 17 <50 <8%
>50age
Notation: Row i; column j; tier l
fijl# in i, j, l
•Mutual
f112
f122f111
f121
f211
f221
2
i1 1 1
1 1
22
1 1 ln
where
R ;
( )
( 1) ( 1) ( 1) 1 2
i j lijl
C t r t
ijl j ijlj l i l n
r c
l ijli j
r c tijl ijl
i jijl
R C Tf
n
f C f
T f
f f
f
df rct c r t rct c r t
Types ofIndependence:
3-D:
Partial R vs C&T: Column & Tier Spread Out as a single variable
22
( ) d.f. = (r-1)(ct-1)
( )
i jlijl
ijl ijl
i j lijl
R CTf
n
f f
f
Col.1Tier 1
Col. 1Tier2
Col.1Tier t
Col. 2Tier 1
Col. cTier t
Total
R 1 fiii f112 f11t f121 f1ct R1
O 2 f211f212 f21t f22t
f2ct R2
W r fr11
fr12 fr13 fr2t frct Rr
(CT)11 (CT)12 (CT)1t (CT)21 (CT)ct n
C vs R & TRow & Tier
(R1 T1) (R1 T2) (Rr Tt) TotalC 1 f111 f112 fr1t C1
O 2 f121 f122 C2
L C f1c1 f1c2 frct Cc
(R T)11 (RT)12 (RT)rt n
( ) df= (c-1)(rt-1)j il
ijl
C RTf
n
T vs R & CTotalT1
T2
Tt
(RC)11 (RC)12 (RC)rc n
( ) df = (t-1)(cr-1)l ij
ijl
T RCf
n
Pairwise
R vs C C
1 2 c1 f11 f12 f1c R1
R 2 f21 f22 f2c R2
r fr2 fr2 frc Rr
C1 C2 Cc n
• df = (r-1)(c-1)i jij
R Cf
n
R vs TT
1 2 t1 f11 f12 f1t R1
R 2 f21 f22 f2t R2
r fr1 fr2 frt Rt
T1 T2 Tt n
• df = (r-1)(t-1)i li l
RTf
n
C vs T T
1 2 t1 f11 f12 f1t C1
C 2
c
T1 n
• df = (c-1)(t-1)j ljl
C Tf
n
Testing Proportions:
A standard problem for which contingency tables and the independence null hypothesis is used is a test of proportions (more about this in Session 7).
The problem in it’s simplest form is a 2 by 2 contingency table similar to the following table comparing two groups, control and treatment, say for which there are two outcomes
positive/negative, alive/dead, remission/no remission
GroupControl Treatment Total
Positive f11 f12npStatus
Negative f21 f22 nn
Total nc nt n
The control group percent positive = f11 / nc 100%
The treatment group percent positive = f12 / nt 100%
H0: % Positive Control = % Positive Treatment
HA : % Positive Control % Positive Treatment
The hypothesis of independence with Fisher-Exact or the chi-square approximation is a test of this hypothesis
Power Calculations and Sample size Determination:
Proportion positive control (p1)Proportion positive treatment (p2)Sample Size (n)Type I Error (Pr(Rejecting H0 | H0 is true)Type II Error (Pr (Rejecting HA| HA is true) Or 1-Pr(Accepting HA | HA is true) = power
Given any four of these (p1, p2, n, the fifth one is specified.
This is using the formula of Casagrande JT, Pike MC (An Improved Formula for Calculating Sample Sizes, for Comparing Two Binomial Distributions, Biometrics, 34, 483-486):
2 2
1 2 1 2
2
1 1 1 2 2
1 2
1
1 1 4( ) / / 4
where
2
( ) / 2
1
is the 1 100% or 1- /2 100% critical level
for the Normal/t-distribution (as appropriate).
n A p p A p p
A K pq K p q p q
p p p
q p
K
This equation can be solved for theparameter that is not defined. Withthe t-distribution, iteration is required.