Session IV: Contingency Tables Tests of Association and Independence (Zar, Chapters 23, 24.10)

Session IV: Contingency Tables

Tests of Association and Independence

(Zar, Chapters 23, 24.10)

Chapter 23: Contingency Tables

Another (and last look!!) at peas!

Remember the pea color & texture:

YS Yw gS gw nObserve: 152 39 53 6 250

ExpectedProportion

9 : 3 : 3 : 1 16

Expected:Probability

9/16 3/16 3/16 1/16

Expected:Number

140.625 46.875 46.875 15.625

Put the data into a matrix:

T ColorE Y g Row TotalsX S 152 53 205T W 39 6 45U Column Totals 191 59 250RE

What was the hypothesis from the last session?

Answer: 9 : 3 : 3: 1, or ?

(1) Y is dominant!

Test the hypothesis with:

Y gObserved: 191 59 250ExpectedProbability

.75 .25

ExpectedNumber

.75 x 250 .25 x 250

Then 2 for the answer.

(2) S is dominant!

Test the hypothesis with:

S WObserved 205 45ExpectedProbability

.75 .25

ExpectedNumber

.75 x 250 .25 x 250

Then for the answer.

(3) Color & Texture are independent

Problem: How to test this w/o assuming proportions in the hypothesis?

Y g RowTotals

EstimatedProbability

S 152 53 205 205250

w 39 6 45 45250

ColumnTotals

191 59 250

EstimatedColumn

Probability

191 59 250 250

Estimate from the Marginal Totals

Given the marginal probability estimates, if the Ho of the independence is true,what are the probabilities in the table “cells”?

191 205

250 250191 45

250 25059 205

250 25059 45

250 250

P YS P Y P S

P Yw P Y P w

P gS P g P S

P gw P g P w

Given the marginal probability estimates and table estimates,

how do we get the expected number?

Expected # = probability x total number

191 205 191 205ˆ( ) 250250 250 250191 45 191 45ˆ( ) 250250 250 25059 205 59 205ˆ( ) 250250 250 25059 45 59 45ˆ ˆ( ) ( ) 250250 250 250

E YS P YS n

E Yw P Yw n

E gS P gS n

E gw P gw n

Row ColumnˆRule: i jijE

n

Y gS 152

191205

250

156.62

5359205

250

48.39

205

w 39191.45

250

34.38

659.45250

10.6245

191 59 250

2 152 156.62

156.61

2

53 48.38

48.38

2

39 34.38 2

34.38

6 10.62 2

10.620.1360.441.6212.010

3.208 D.F.4 1 1 1 1

(Sum to 250) (row) (column)

.052 1 =3.841 p=.073

In General: 2 x 2 Tables:

2

2 22

1 1ˆ

ij ij

i j ij

f f

f

DF: 2 2 –1 –1 –1= 1rowcolum

n

total

4 cells

Aa1 a2

b1f11

f11

R1C1

n

f12

f

12 = R1C2

n

R1 or f1.

b2f21

f

21R2C1

n

f22

f

22R2C2

n

R2 or f2 .

B

C1 orf.1

C2 orf.2

n

• •

where i=1,2; j=1,2

i jij

i j

f ff

nR C

n

In general:A

a1 a2 a3 … ac rowsb1 f11 f12 f13 … f1c R1

b2 f21 f22 f23 R2

b3 f31 f32 f33 R3

B

br fr1 fr2 fr3 Rr

columns C1 C2 C3 Cc n

2

2

1 1

ij ijr C

i jij

f f

f

Where:

1 1

; C r

i ij j j ij jj i

R f f C f f

Estimates of the “cells”:

.i j i jij

R C f ff

n n

d.f.r x c - (c-1)-(r-1) - 1=(r-1)(c-1) Degrees of Freedom:

cells

rows

columns

total

d.f.= rc - (c-1)-(r -1)-1

= rc - c + 1 - r + 1-1

= rc -c - r + 1

=(r -1)(c -1)

Special cases: Rows fixed

If before the study, the #'s in rows to

be seen are fixed.

d.f. = rc- (c-1)-1

Columns fixed

d.f.=rc-(r-1)-1

Both fixed (ex: 9 : 3 : 3 : 1)

d.f. = rc-1

Example 23.1: Hair Color and Gender

Hair ColorGender Black Brown Blond Red TotalMale 32

(29.0)43

(36.0)16

(26.67)9 (8.33) 100

(=R1)Female 55

(58.0)65

(72.0)64

(53.33)16

(16.67)200

(=R2)Total 87

(=C1)108

(=C2)80

(=C3)25

(=C4)300(=n)

2 fij ˆ f ji 2

ˆ f ij

32 29.0 2

29.0

43 36.0 2

36.0

16 16.67 2

16.670.31031.36114.26670.05330.15520.68062.13330.0267

8.987

df=(r 1)(c 1)(2 1)(4 1)3

0.05,32 7.815

Therefore, reject H0

2 0.31031.36114.26670.05330.15520.68062.1333 0.0267

8.987

Okay, we have now rejected the null hypothesis:

H0: The rows and the columns are independent

And accepted the alternative hypothesis:

HA: The rows and the columns are not independent

What now?

Look for sub-hypotheses?

Look at the individual chi-squares!

MBk MBr MBl MR FBk FBr FBl Fr

It makes sense to combine hair color with similar gender ratios.

Blonds vs. Non-Blonds?

But are all the Non-Blonds the same?

H0: Black=Brown=Red

Hair ColorGender Black Brown Red TotalMale 32

(29.0)43

(36.0)9 (8.33) 84

(=R1)Female 55

(58.0)65

(72.0)16

(16.67)136

(=R2)Total 87

(=C1)108

(=C2)25

(=C4)220(=n)

2

20.05,2 0

0.245 with 2 DF

5.991 Accept H

Now

H0: Blonds=Non-Blonds

Hair ColorGender Blond Non-

BlondTotal

Male 16(26.67)

84(73.33)

100

Female 64(53.33)

136(146.67)

200

Total 80 220 300

2

20.05,1 0

8.727 DF=1

3.841 Reject H

(1) Too Few Male Blonds

(2) Too Many Female Blonds

(3) Too Many Male NBl

(4) Too Few Female NBl

Hair Color independent of Gender

The 2 x 2 Table re-examined:

2

ij2

: counts, whole numbers, integers

: rational number, fraction

-

ij

i jij

ij

ij

f

R Cf

n

f f

f

(2) The Yates Correction

(1) The problem of counts vs. estimates

22

ˆ 0.5

ˆij ij

c

ij

f f

f

The subtraction of 0.5 is “conservative”.

Rounding would be better.

(3) The Computing formulae

2

11 22 12 212

1 2 1 2

2

11 22 12 212

1 2 1 2

/ 2

C

n f f f f

R R C C

n f f f f n

R R C C

The Cochran/Haber –Correction

The Cochran/Haber (Haber(1980)) correction gives better results when routinely employed than does either the Yates-corrected or non-corrected chi-square calculation. (a) Determine each of the four expected frequencies denoting the smallest as . f

(b) Calculate the absolute difference between the smallest expected frequency and its corresponding observed frequency is .d f f

(c) If define D=the largest multiple of 0.5 that is < d;f 2 f,

(d) If define D=d - 0.5. f >2 f,

3 22'

1 2 1 2

.c

n D

R R C C (e) Calculate

Use the Blond/Non-Blond Example:Hair Color

Gender Blond Non-Blond

Total

Male 16(26.67)

84(73.33)

100

Female 64(53.33)

136(146.67)

200

Total 80 220 300

Yates: 22300 16 136 84 64 150

7.92880 220 100 200C

Cochran:

3 22

'

16 26.67 10.67

10.5

300 10.58.457

80 220 100 200C

d

D

2 8.727 DF =1

0.05,12 3.841 Reject H0

Uncorrected:

(5) The Fisher – Exact Method

2 x 2 tables only

Aa1 a2

b1 f11 f12

Bb2 f21 f22

C1 C2

1 2 1 2

11 121 2 1 2

21 22 11 12 21 22

! ! ! !!Pr

! ! ! !

R R C Cf f nR R C Cf f f f f f

Just like the binomial, we need this probability + all more extreme.

Hypergeometric Distribution

To fine those more extreme, look for the smallest expected frequency as in Cochran’s Method. Form successive tables as in the next example:

Beetles Bugs Upper Leaf 12

(9.17) 7 (9.8)

19

Lower Leaf 2 (4.83)

8 (5.17)

10

14 15 29

1 2 1 2

11 12 21 22

! ! ! !!

19!10!14!15!29!

12!7!2!8!

! ! ! !

log[(log19! log10!log14! log15! log 29!)(log12! log7! log 2!log8)]

log[15.75522 17.28932]log[ 1.53410]log[0.46590 2.00000]

R R C CnP

f f f f

anti

antiantianti

0.02923

From 2 to 1


(9.17) 6 (9.8)

19

Lower Leaf 1 (4.83)

9 (5.17)

10

14 15 29 Beetles Bugs Upper Leaf 14

(9.17) 5 (9.8)

19

Lower Leaf 0 (4.83)

10 (5.17)

10

14 15 29

19!10!14!15!

29!13!6!1!9!

log[15.75522 (log13!

log 6! log1! log 9!)]

log[ 2.46515]

0.003498

19!10!14!15!

29!14!5!0!10!

log[15.75522 (log14!

log 5! log 0! log10!)]

log[ 3.8213]

0.0001499

P

anti

anti

P

anti

anti


(9.17) 5 (9.8)

19

Lower Leaf 0 (4.83)

10 (5.17)

10

14 15 29

And 1 to 0

Prob=0.02923 + 0.003498 + 0.0001499 = 0.03288.Reject H0: Independence or ?

A heterogeneity chi-square analysis of 2 x2 contingency tables.

Ho: The four samples are homogeneous.HA The four samples are heterogeneous.

Example 6.4a (Edition 2) Or see Example 23.8a (Edition 4)

Experiment I

Dead Alive Total

Treated 9 15 24 2 =2.4806Not treated 15 10 25 DF =1

Total 24 25 49

Experiment 2Dead Alive Total

Treated 13 12 25 2 2.1222Non treated 18 7 25 DF =1

Total 31 19 50

Experiment 3Dead Alive Total


Total 29 21 50Experiment 4

Dead Alive TotalTreated 10 14 24 2 2.4522

Non treated 16 9 25 DF =1

Total 26 23 49

Pooled Data for Experiments 1-4Dead Alive Total


Total 110 88 198Ho: Treatment & status are independent

Total chi-square 9.1075 DF = 4Chi-square 8.9262 DF = 1Heterogeneity chi-square 0.1813 DF = 3

0.05,32 7.815 Accept H0, Experiments Homogeneous

THE LOG-LIKELIHOOD RATIO

The log-likelihood ratio was introduced in Session 2. In tables, it is calculated:

ij i i j ji j i j

2[ ln ln ln ],

or

4.60517[ log log

log log ].

ij ij i i

j j

G f R R C C n n

G f f R R

C C n n

Since G is approximately distributed as able B.1 may be used with (r-1)(c-1) degrees of freedom.

Hair Color

Sex Black Brown Blond Red Total

Male 32 43 16 9 100Female 55 65 64 16 200Total 87 108 80 25 300

Log-likelihood Example: Hair Color

G4.60517 fij Ri log Ri Cj log C j n log n]

=4.60517[(32)(1.50515)+(43)(1.63347) + (16) (1.20412)

+ (9)(0.95424) + (55)(1.74036)+(65)(1.81291)

+ (64)(1.80618) + (16)(1.20412)- (100)(2.00000)

- (200)(2.30103) - (87)(1.93952)- (108)(2.03342)

- (80)(1.90309)- (25)(1.399794) + (300)(2.47712)]

=4.60517(2.06518)

= 9.510 with DF = 3

0.05,32 7.815 reject Ho.

Three and Higher Dimensional Tables

Ex: Problem III:Row: RemissionColumn: AgeTier: LI

Row #1Complete LIRemission 8888> >8%

<50% >50

298 9 0 0

Row #2not complete 10remission 12 >8% LI

16 17 <50 <8%

>50age

Notation: Row i; column j; tier l

fijl# in i, j, l

•Mutual

f112

f122f111

f121

f211

f221

2

i1 1 1

1 1

22

1 1 ln

where

R ;

( )

( 1) ( 1) ( 1) 1 2

i j lijl

C t r t

ijl j ijlj l i l n

r c

l ijli j

r c tijl ijl

i jijl

R C Tf

n

f C f

T f

f f

f

df rct c r t rct c r t

Types ofIndependence:

3-D:

Partial R vs C&T: Column & Tier Spread Out as a single variable

22

( ) d.f. = (r-1)(ct-1)

( )

i jlijl

ijl ijl

i j lijl

R CTf

n

f f

f

Col.1Tier 1

Col. 1Tier2

Col.1Tier t

Col. 2Tier 1

Col. cTier t

Total

R 1 fiii f112 f11t f121 f1ct R1

O 2 f211f212 f21t f22t

f2ct R2

W r fr11

fr12 fr13 fr2t frct Rr

(CT)11 (CT)12 (CT)1t (CT)21 (CT)ct n

C vs R & TRow & Tier

(R1 T1) (R1 T2) (Rr Tt) TotalC 1 f111 f112 fr1t C1

O 2 f121 f122 C2

L C f1c1 f1c2 frct Cc

(R T)11 (RT)12 (RT)rt n

( ) df= (c-1)(rt-1)j il

ijl

C RTf

n

T vs R & CTotalT1

T2

Tt

(RC)11 (RC)12 (RC)rc n

( ) df = (t-1)(cr-1)l ij

ijl

T RCf

n

Pairwise

R vs C C

1 2 c1 f11 f12 f1c R1

R 2 f21 f22 f2c R2

r fr2 fr2 frc Rr

C1 C2 Cc n

• df = (r-1)(c-1)i jij

R Cf

n

R vs TT

1 2 t1 f11 f12 f1t R1

R 2 f21 f22 f2t R2

r fr1 fr2 frt Rt

T1 T2 Tt n

• df = (r-1)(t-1)i li l

RTf

n

C vs T T

1 2 t1 f11 f12 f1t C1

C 2

c

T1 n

• df = (c-1)(t-1)j ljl

C Tf

n

Testing Proportions:

A standard problem for which contingency tables and the independence null hypothesis is used is a test of proportions (more about this in Session 7).

The problem in it’s simplest form is a 2 by 2 contingency table similar to the following table comparing two groups, control and treatment, say for which there are two outcomes

positive/negative, alive/dead, remission/no remission

GroupControl Treatment Total

Positive f11 f12npStatus

Negative f21 f22 nn

Total nc nt n

The control group percent positive = f11 / nc 100%

The treatment group percent positive = f12 / nt 100%

H0: % Positive Control = % Positive Treatment

HA : % Positive Control % Positive Treatment

The hypothesis of independence with Fisher-Exact or the chi-square approximation is a test of this hypothesis

Power Calculations and Sample size Determination:

Proportion positive control (p1)Proportion positive treatment (p2)Sample Size (n)Type I Error (Pr(Rejecting H0 | H0 is true)Type II Error (Pr (Rejecting HA| HA is true) Or 1-Pr(Accepting HA | HA is true) = power

Given any four of these (p1, p2, n, the fifth one is specified.

This is using the formula of Casagrande JT, Pike MC (An Improved Formula for Calculating Sample Sizes, for Comparing Two Binomial Distributions, Biometrics, 34, 483-486):

2 2

1 2 1 2

2

1 1 1 2 2

1 2

1

1 1 4( ) / / 4

where

2

( ) / 2

1

is the 1 100% or 1- /2 100% critical level

for the Normal/t-distribution (as appropriate).

n A p p A p p

A K pq K p q p q

p p p

q p

K

This equation can be solved for theparameter that is not defined. Withthe t-distribution, iteration is required.

Session IV: Contingency Tables Tests of Association and Independence (Zar, Chapters 23, 24.10)

Documents

Transcript of Session IV: Contingency Tables Tests of Association and Independence (Zar, Chapters 23, 24.10)