Chi Square Test

52
1 Chi-square test Chi-square test or 2 2 test A A nonparametric nonparametric hypothesis test hypothesis test

Transcript of Chi Square Test

Page 1: Chi Square Test

1

Chi-square testChi-square testor22 test A A nonparametric nonparametric hypothesis testhypothesis test

Page 2: Chi Square Test

2

Parametric vs. Parametric vs. Nonparametric TestsNonparametric Tests

•Parametric hypothesis test– about population parameter (

or p)– z, t, tests– interval/ratio data

•Nonparametric tests– do not test a specific parameter– nominal & ordinal data– frequency data ~

Page 3: Chi Square Test

3

Chi-square testChi-square test

Used to test the countscounts of categorical dataThreeThree types

Goodness of fit (univariate)Independence (bivariate)Homogeneity (univariate with two samples)

Page 4: Chi Square Test

4

22 distribution – distribution –

df=3

df=5

df=10

Page 5: Chi Square Test

5

22 distribution distribution

Different df have different curvesSkewed rightAs df increases, curve shifts toward right & becomes more like a normal curvenormal curve

Page 6: Chi Square Test

6

2 2 assumptionsassumptionsSRS SRS – reasonably random sampleHave countscounts of categorical data & we expect each category to happen at least onceSample sizeSample size – to insure that the sample size is large enough we should expect at least five in each category.

***Be sure to list expected counts!!

Combine these together:

All expected counts are at

least 5.

Page 7: Chi Square Test

7

2 2 formulaformula

i

ii

EO 2

2 E

where where OOi i is the observed frequencyis the observed frequencyEEii is the expected frequency is the expected frequency

Page 8: Chi Square Test

8

2 2 Goodness of fit testGoodness of fit testUses univariate dataWant to see how well the observed counts “fit” what we expect the counts to be

Based on df –Based on df –

df = number of df = number of categoriescategories - 1 - 1

Page 9: Chi Square Test

9

Hypotheses – written in Hypotheses – written in wordswords

H0: the observed counts equal the expected counts, i.e., there is no significant difference between the observed and the expected counts H1: the observed counts are not equal to the expected counts

Be sure to write in context!

Page 10: Chi Square Test

10

Steps for Computation of Steps for Computation of 22(1)Compute the expected frequencies E1,

E2,…, En corresponding to the observed frequencies O1, O2,…, On under some hypothesis.

(2) Compute the deviations (O-E) for each frequency and then square them to obtain (O-E)2 .

(3)Divide the square of the deviations (O-E)2 by the corresponding expected frequency to obtain (O-E)2 / E.

(4) Add the values obtained in step 3 to compute:

i

ii

EO 2

2 E

Page 11: Chi Square Test

11

Steps for Computation of Steps for Computation of 2 2 (con’t)(con’t)5 Under the null hypothesis test the theory

fits well, the above statistic follows 22 distribution with v=n-1 d.f.

6 Look up the tabulated value of 22 for (n-1) d.f at given level of significance.

7 If the calculated value of 2 2 is less than the corresponding tabulated value obtained in step 6, then it is said to be non-significant at the required level of significance.

8 If the calculated value of 2 2 is greater than the corresponding tabulated value obtained in step 6, then it is said to be significant at the required level of significance.

Page 12: Chi Square Test

12

Let’s test our dice!Let’s test our dice!

Page 13: Chi Square Test

13

1. A dice is rolled 100 times with the following distribution:Number : 1 2 3 4 5 6 Observed frequency : 17 14 20 17 17 15At the 0.01 level of significance, determine whether dice is true (unbiased).Solution. We are given: Number of categories = 6 N = total frequency = 17+14+20+17+17+15 =100

Null Hypothesis: Ho : The die is unbiased, i.e., the probability of obtaining six faces is same

Alternate Hypothesis: H1 : The dice is biased

CASELETSCASELETS

Page 14: Chi Square Test

14

CASELETS (con’t)CASELETS (con’t)Expected frequency for each face (E) = N*p =100/6 =16.67(all expected values are greater than 5)

Number Observed Frequency (O)

Expected Frequency (E)

(O-E) (O-E)2 (O-E)2/E

1 17 16.67 0.33 0.1089 0.00652 14 16.67 -2.67 7.1289 0.42763 20 16.67 3.33 11.088 0.66524 17 16.67 0.33 0.1089 0.00655 17 16.67 0.33 0.1089 0.00656 15 16.67 -1.67 2.7889 0.1673Total

22==1.2791.27966

Page 15: Chi Square Test

15

CASELETS (con’t)CASELETS (con’t)

2796.1E 22

i

ii

EO

Since the calculated value of 22 = 1.2796 = 1.2796 is less than the tabulated value of 22, i.e., 15.086 (for 5 d.f at 1% level of significance) therefore the null hypothesis is accepted and we conclude that the dice is regarded as true (unbiased).

Page 16: Chi Square Test

16

2. Offspring of certain fruit flies may have yellow or ebony bodies and normal wings or short wings. Genetic theory predicts that these traits will appear in the ratio 9:3:3:1 (yellow & normal, yellow & short, ebony & normal, ebony & short) A researcher checks 100 such flies and finds the distribution of traits to be 59, 20, 11, and 10, respectively. What are the expected counts? df?Are the results consistent with the theoretical distribution predicted by the genetic model? (5% level of significance)

Expected counts:Y & N = 56.25Y & S = 18.75E & N = 18.75E & S = 6.25 We expect 9/16 of the

100 flies to have yellow and normal

wings. (Y & N)

Since there are 4 categories,

df = 4 – 1 = 3

CASELETSCASELETS

Page 17: Chi Square Test

17

Assumption:All expected counts are greater than 5. Expected counts:Y & N = 56.25, Y & S = 18.75, E & N = 18.75, E & S = 6.25H0: The distribution of fruit flies is the same as the theoretical model.Ha: The distribution of fruit flies is not the same as the theoretical model.Number Obser.

Freq. (O)ExpecFreq. (E)

(O-E) (O-E)2 (O-E)2/E

1 59 56.25 2.75 7.5625 0.1352 20 18.75 1.25 1.5625 0.0833 11 18.75 -7.75 60.062

53.203

4 10 6.25 3.75 14.0625

2.25

Total

22==5.6715.671

CASELETS (con’t)CASELETS (con’t)

Page 18: Chi Square Test

18

CASELETS (con’t)CASELETS (con’t)

671.5E 22

i

ii

EO

Since the calculated value of 22 = 5.671 = 5.671 is less than the tabulated value of 22, i.e., 7.815 (for 3 d.f at 5% level of significance) therefore the null hypothesis is accepted and we conclude that the distribution of fruit flies is the same as the theoretical model.

Page 19: Chi Square Test

19

3.Does your zodiac sign determine how successful you will be? Fortune magazine collected the zodiac signs of 256 heads of the largest 400 companies. Is there sufficient evidence to claim that successful people are more likely to be born under some signs than others?Aries 23 Libra 18 Leo

20Taurus 20 Scorpio 21 Virgo 19Gemini 18 Sagittarius19 Aquarius

24Cancer 23 Capricorn 22 Pisces

29How many would you expect in each sign if there were no difference between them?How many degrees of freedom?

Since there are 12 signs –df = 12 – 1 = 11

CASELETSCASELETS

I would expect CEOs to be equally born under all signs.

So 256/12 = 21.333333

Page 20: Chi Square Test

20

AssumptionAll expected counts are greater than 5. (I expect 21.33 CEO’s to be born in each sign.)H0: The number of CEO’s born under each sign is the same.H1: The number of CEO’s born under each sign is the different.

094.53.21

3.2129...3.21

3.21203.21

3.2123 2222

CASELETS (con’t)CASELETS (con’t)

Page 21: Chi Square Test

21

CASELETS (con’t)CASELETS (con’t)

094.5E 22

i

ii

EO

Since the calculated value of 22 = 5.094 = 5.094 is less than the tabulated value of 22, i.e., 19.675 (for 11 d.f at 1% level of significance) therefore the null hypothesis is accepted and we conclude that number of CEO’s born under each sign is the same.

Page 22: Chi Square Test

22

CASELETSCASELETS4.Records taken of the number of male and female births in 800 families having four children are given as follows:

No. of births FrequencyMale Female

0 4 321 3 1782 2 2903 1 2364 0 64

Test whether the data are consistent with the hypothesis that the binomial law holds and the chance of a male birth is equal to that of female birth.

Page 23: Chi Square Test

23

CASELETS (con’t)CASELETS (con’t)Let us set up the null hypothesis that the data are consistent with the binomial law of equal probability for male and female births

We are given n = 4, N = 800According to binomial probability law, the frequency of r male births is given by:F(r) = N*p(r) = N* nCr * pr

*qn-r

= 800* 4Cr * (0.5)r

*(0.5)4-r

= 50 * 4Cr; (r = 0,1,2,3,4)

No. of male births

Expected frequency

F(r)0 50 * 4C0 = 501 50 * 4C1 = 2002 50 * 4C2 = 3003 50 * 4C3 = 2004 50 * 4C4 = 50

Total 800

Page 24: Chi Square Test

24

No. of male birth

Obser.Freq. (O)

ExpecFreq. (E)

(O-E) (O-E)2 (O-E)2/E

0 32 50 -18 324 6.481 178 200 -22 484 2.422 290 300 -10 100 0.333 236 200 36 1296 6.484 64 50 14 196 3.92Total

22==19.6319.63

CASELETS (con’t)CASELETS (con’t)

63.19E 22

i

ii

EO

Since the calculated value of 22 = 19.63 = 19.63 is greater than the tabulated value of 22, i.e., 9.488 (for 4 d.f at 5% level of significance) therefore the null hypothesis is rejected and we conclude that hypothesis of equal male and female births is wrong

Page 25: Chi Square Test

25

5. A company says its premium mixture of nuts contains 10% Brazil nuts, 20% cashews, 20% almonds, 10% hazelnuts and 40% peanuts. You buy a large can and separate the nuts. Upon weighing them, you find there are 112 g Brazil nuts, 183 g of cashews, 207 g of almonds, 71 g or hazelnuts, and 446 g of peanuts. You wonder whether you mix is significantly different from what the company advertises?Why is the chi-square goodness-of-fit test NOT appropriate here?What might you do instead of weighing the nuts in order to use chi-square?

Because we do NOT have countscounts

of the type of nuts.We could countcount the number of each type of nut and then perform a

2 test.

CASELETSCASELETS

Page 26: Chi Square Test

26

1. The following figures show the distribution of digits in numbers chosen at random from a telephone directory:

Digit : 0 1 2 3 4 5 6 7 8 9 10

Frequency:1026 1107 997 966 1075 933 1107 972 964 853

Test whether the digits may be taken to occur equally frequently in the directory.(tabular value for 9 d.f at 5% level of significance is 16.92)

2. The number of scooter accidents per month in a certain town were as follows:

12, 8, 20, 2, 14, 10, 15, 6, 9, 4Are these frequencies in agreement with the belief that

accidents conditions were the same during this 10 month period? (tabular value for 9 d.f at 5% level is 16.92)

Practice CASELETSPractice CASELETS

Page 27: Chi Square Test

27

2 test for independence

Used with categorical, bivariate data from ONE sampleUsed to see if the two categorical variables are associated (dependent) or not associated (independent)

Page 28: Chi Square Test

28

Assumptions & Assumptions & formula remain the formula remain the same!same!

Page 29: Chi Square Test

29

Hypotheses – written in Hypotheses – written in wordswordsH0: two variables are independentH1: two variables are dependent

Be sure to write in context!

Page 30: Chi Square Test

30

1. A beef distributor wishes to determine whether there is a relationship between geographic region and cut of meat preferred. If there is no relationship, we will say that beef preference is independent of geographic region. Suppose that, in a random sample of 500 customers, 300 are from the North and 200 from the South. Also, 150 prefer cut A, 275 prefer cut B, and 75 prefer cut C. Also suppose that in the actual sample of 500 consumers the observed numbers were as follows:

CASELETSCASELETS

Page 31: Chi Square Test

31

CASELETS (con’t)CASELETS (con’t)

 Is there sufficient evidence to suggest that geographic regions and beef preference are not independent? (Is there a difference between the expected and observed counts?)

North South Total

Cut A 100 50 150

Cut B 150 125 275

Cut C 50 25 75

Total 300 200 500

Page 32: Chi Square Test

32

Expected Counts Expected Counts

Assuming H0 is true,

totaltablealcolumn tot totalrow counts expected

CASELETS (con’t)CASELETS (con’t)

SolutionSolution

Page 33: Chi Square Test

33

Degrees of freedomDegrees of freedom

)1c)(1(r df

Or cover up one row & one column & count the number of cells remaining!

CASELETS (con’t)CASELETS (con’t)

Page 34: Chi Square Test

34

If beef preference is independent of geographic region, how would we expect this table to be filled in?

North South Total

Cut A 150

Cut B 275

Cut C 75

Total 300 200 500

90 60165

110

45 30

CASELETS (con’t)CASELETS (con’t)

Page 35: Chi Square Test

35

Assumptions:All expected counts are greater than 5.

H0: geographic region and beef preference are independent H1: geographic region and beef preference are dependent Obser.

Freq. (O)ExpecFreq. (E)

(O-E) (O-E)2 (O-E)2/E

100 90 10 100 0.1150 60 -10 100 1.67150 165 -15 225 0.33125 110 15 225 1.3650 45 5 25 0.5625 30 -5 25 0.83

22==4.864.86

CASELETS (con’t)CASELETS (con’t)

Page 36: Chi Square Test

36

Since the calculated value of 22 = 4.86 = 4.86 is less than the tabulated value of 22, i.e., 5.991 (for (3-1)*(2-1)=2 d.f at 5% level of significance) therefore the null hypothesis is accepted and we conclude that geographic region and beef preference are independent

CASELETS (con’t)CASELETS (con’t)

Page 37: Chi Square Test

37

2. In a certain sample of 2000 families 1400 families are consumer of tea. Out of 1800 Hindu families, 1236 families consume tea. Use Chi-Square test and state whether there is any significant difference between consumption of tea among Hindu and non-Hindu families. (5% level of significance)

CASELETSCASELETS

Number of Hindu Non-Hindu TotalFamilies consuming tea

1236 164 1400

Families not consuming tea

564 36 600

Total 1800 200 2000

Page 38: Chi Square Test

38

Solution: Expected Solution: Expected Counts Counts

•Assuming H0 is true,

totaltablealcolumn tot totalrow counts expected

CASELETS (con’t)CASELETS (con’t)

Number of Hindu Non-Hindu TotalFamilies consuming tea

1400

Families not consuming tea

600

Total 1800 200 2000

1260540

14060

Page 39: Chi Square Test

39

Assumptions:All expected counts are greater than 5.

H0: consumption of tea and community are independent, i.e., there is no significance difference between the consumption of tea among Hindu and non-Hindu familiesH1: consumption of tea and community are dependent

Obser.Freq. (O)

ExpecFreq. (E)

(O-E) (O-E)2 (O-E)2/E

1236 1260 -24 576 0.457564 540 24 576 1.067164 140 24 576 4.11436 60 -24 576 9.6

Total

22==15.2315.2388

CASELETS (con’t)CASELETS (con’t)

Page 40: Chi Square Test

40

Since the calculated value of 22 = 15.238 = 15.238 is greater than the tabulated value of 22, i.e., 3.841 (for (2-1)*(2-1)=1 d.f at 5% level of significance) therefore the null hypothesis is rejected and we conclude that the two communities differ significantly as regards the consumption of tea among them.

CASELETS (con’t)CASELETS (con’t)

Page 41: Chi Square Test

41

3. A sample of 400 students of under-graduate and 400 students of post-graduate classes was taken to know their opinion about autonomous colleges. 290 of the under-graduate and 310 of the post-graduate students favored the autonomous status. Present these facts in the form of a table and test at 55 level, that the opinion regarding autonomous status of college is independent of the level of classes of students.

CASELETSCASELETS

Class Number of Students Total

Favoring Opposing

Under Graduate 290 400Post Graduate 310 400Total

SolutionObserved Frequencies

11090

600

200

800

Page 42: Chi Square Test

42

Expected Counts Expected Counts •Assuming H0 is true,

totaltablealcolumn tot totalrow counts expected

CASELETS (con’t)CASELETS (con’t)

Class Number of Students Total

Favoring Opposing

Under Graduate 400Post Graduate 400Total 600 200 800

300

10030

0100

Page 43: Chi Square Test

43

Assumptions:All expected counts are greater than 5.

H0: Opinion about autonomous colleges is independent of the level of classesH1: consumption of tea and community are dependent

Obser.Freq. (O)

ExpecFreq. (E)

(O-E) (O-E)2 (O-E)2/E

290 300 -10 100 0.33110 100 10 100 1.00310 300 10 100 0.3390 100 -10 100 1.00

Total

22==2.662.66

CASELETS (con’t)CASELETS (con’t)

Page 44: Chi Square Test

44

Since the calculated value of 22 = 2.66 = 2.66 is less than the tabulated value of 22, i.e., 3.841 (for (2-1)*(2-1)=1 d.f at 5% level of significance) therefore the null hypothesis is accepted and we conclude that the opinion about autonomous colleges may be regarded to be independent of the level of classes of the students

CASELETS (con’t)CASELETS (con’t)

Page 45: Chi Square Test

45

4. Suppose that, in a public opinion survey answers to the questions-

(a)Do you drink(b)Are you in favor of local option on sale of liquor?Were as given in the table

CASELETSCASELETS

Question (b)

Question (a) TotalYes No

Yes 56 31 87No 18 6 24Total 74 37 111

Can you infer that opinion on local option is dependent on whether or not an individual drinks?

Page 46: Chi Square Test

46

Solution: Expected Solution: Expected Counts Counts

•Assuming H0 is true,

totaltablealcolumn tot totalrow counts expected

CASELETS (con’t)CASELETS (con’t)

Question (b)

Question (a) TotalYes No

Yes 87No 24Total 74 37 111

58 2916 8

Page 47: Chi Square Test

47

Assumptions:All expected counts are greater than 5.

H0: the local option on sale of liquor is independent of whether or not an individual drinksH1: geographic region and beef preference are dependent

Obser.Freq. (O)

ExpecFreq. (E)

(O-E) (O-E)2 (O-E)2/E

56 58 -2 4 0.06931 29 2 4 0.13818 16 2 4 0.256 8 -2 4 0.5

Total

22==0.9570.957

CASELETS (con’t)CASELETS (con’t)

Page 48: Chi Square Test

48

Since the calculated value of 22 = 0.957 = 0.957 is less than the tabulated value of 22, i.e., 3.841 (for (2-1)*(2-1)=1 d.f at 5% level of significance) therefore the null hypothesis is accepted and we conclude that the opinion on local option on sale of liquor is independent (not dependent) of whether or not an individual drinks

CASELETS (con’t)CASELETS (con’t)

Page 49: Chi Square Test

49

Yates correction for Yates correction for ContinuityContinuity

If any cell frequency in 2X2 table is less than 5, then for the application of Chi-Square test it has to be pooled with the preceding or succeeding frequency so that total is greater than 5. This results in the loss of 1 d.f. Since for 2X2 table, d.f. = (2-1)X(2-1) = 1; the d.f. left after adjusting for pooling are v = 1-1 = 0, which is absurd. In such situation we apply Yates correction for ‘continuity’. In this method we add 0.5 to the cell frequency which is less than 5 and adjusting the remaining frequencies accordingly, since row and column totals are fixed and then applying Chi-Square.

Page 50: Chi Square Test

50

22 test for homogeneity test for homogeneity

Used with a single single categoricalcategorical variable from two (or more) two (or more) independent samplesindependent samplesUsed to see if the two populations are the same (homogeneous)

Page 51: Chi Square Test

51

Assumptions & formula remain the same!Expected counts & df are found the same way as test for independence.

OnlyOnly change is the hypotheses!

Page 52: Chi Square Test

52

Hypotheses – written in Hypotheses – written in wordswordsH0: the two (or more) distributions are the sameH1: the distributions are different

Be sure to write in context!