Statistics lecture 10(ch10)

35
1

description

X squared Tests

Transcript of Statistics lecture 10(ch10)

Page 1: Statistics lecture 10(ch10)

1

Page 2: Statistics lecture 10(ch10)

OBJECTIVES

• Recognise a suitable distribution to apply chi

square test to

• Conduct the goodness-of-fit test of hpothesis

• Conduct the test of independence

• Conduct a test of homgeneity

2

Page 3: Statistics lecture 10(ch10)

Chi square distribution

• Positively skewed

• Test done on right tail only

• Therefore all chi square tests are positive with

one critical value only

• Basic steps of hypothesis test are the same, only

the test statistic and distribution have changed

3

Page 4: Statistics lecture 10(ch10)

4

• Techniques used to analyse data up to now was measured on quantitative scale.

• Results of tests can often be classified into categories where there is no natural order:

– Categorical variable

– Categories

– Categorical data

• Categorical data can be analysed with Chi-squared tests:

– Simple random sample

– Sample size reasonable large

Page 5: Statistics lecture 10(ch10)

5

Example:

• Survey of job satisfaction

• Employed persons classified as satisfied, neutral, dissatisfied

CATEGORICAL VARIABLE – is employee satisfaction

CATEGORIES – satisfied, neutral, dissatisfied

CATEGORICAL DATA – no. of employees satisfied, neutral or dissatisfied (also referred to as frequency of category)

Page 6: Statistics lecture 10(ch10)

6

Examples

1. A persons income can be categorised as high, medium or low. Define the categorical variable, the categories and the categorical data

2. We want to investigate different types of industries, e.g. information technology, financial and transformation. Define the categorical variable, the categories and the categorical data

Page 7: Statistics lecture 10(ch10)

7

Example answers

1. Categorical variable is income. Categories are high, medium and low. Categorical data are the no. of people who have high, medium or low income

2. Categorical variable is type of industry, categories are information technology, financial and transformation. Categorical data are the no of industries that are information tech, financial or transformation

Page 8: Statistics lecture 10(ch10)

8

• Chi-squared goodness-of-fit test – This test describes a single population of categorical data.

– The multinomial experiment studied is an extension of the binomial experiment.

• There are n independent trials.

• The outcome of each trial can be classified into one of k categories.

• The probability pi of cell i remains constant for each trial. Moreover, p1 + p2 + … +pk = 1.

– Experiment records the observed trails for each category.

– Denoted by f1, f2, …, fk and f1 + f2 + … + fk = n

Page 9: Statistics lecture 10(ch10)

EXAMPLE

In a box of smarties you will find 6 different colours:

brown, red,yellow,blue,orange and green. A

random sample of smarties (6918 in total) was

taken and the frequesncy of each colour was

counted. The distribution of colours is given below

Determine whether the smarties survey fits the

description of a multinomial experiment

9

Colour Brown Red Yellow Blue Orange Green

f 1611 1172 1308 904 921 1002

Page 10: Statistics lecture 10(ch10)

EXAMPLE

Answer:

See example 10.1, p350, textbook

10

Page 11: Statistics lecture 10(ch10)

11

• The goodness-of-fit test

– Used to determine if the observed counts of the categories agree with the probabilities specified for each category.

– Observed frequencies (f ) compared with the expected frequencies (e).

Testing H0: Proportions agree with specified probabilities

Alternative

hypothesis

Decision rule:

Reject H0 if … Test statistic

H1: H0 is not true Χ2 > Χ2k – 1;1 – α

22 ( )i i

i

f e

e

To use the Χ2-tests

all expected

frequencies must

be at least 5

Page 12: Statistics lecture 10(ch10)

12

• Example – A household detergent is marketed in three sizes:

• 1 000 ml, 750 ml and 250 ml

– The distributers belief that the market share of the different sizes is

as follow:

• 1 000 ml = 40%

• 750 ml = 45%

• 250 ml = 15%.

– To study the effect of the economic climate on the sales of the

products, 200 customers were ask to state which size they will

prefer.

• Survey results:

– 82 customers preferred the 1 000 ml

– 102 customers preferred the 750 ml

– 16 customers preferred the 250 ml

Page 13: Statistics lecture 10(ch10)

13

• Solution

– The population investigated is the size preferences.

– The data are in categories.

– This is a multinomial experiment (three categories).

– The question of interest: Are p1, p2, and p3 different

from the expected 40%, 45% and 15%?

Page 14: Statistics lecture 10(ch10)

14

• The hypotheses are:

– H0: p1 = 0,40, p2 = 0,45, p3 = 0,15

– H1: At least one pi is not equal to its specified value.

Are the observed and the expected frequencies the same?

Expected frequencies

40%

45%

15%

1000ml

750ml

250ml

Observed values

82

102

16

1000ml

750ml

250ml

Expected

frequencies

ei = npi

40% of 200 = 80

45% of 200 = 90

15% of 200 = 30

Expected

frequencies are

all ≥ 5

Page 15: Statistics lecture 10(ch10)

15

• The hypotheses are:

– H0: p1 = 0,40, p2 = 0,45, p3 = 0,15

– H1: At least one pi is not equal to its specified value.

Are the observed and the expected frequencies the same?

Expected frequencies

40%

45%

15%

1000ml

750ml

250ml

Observed values

82

102

16

1000ml

750ml

250ml

Expected

frequencies

ei = npi

40% of 200 = 80

45% of 200 = 90

15% of 200 = 30

80

30

90

Page 16: Statistics lecture 10(ch10)

• The hypotheses are:

– H0: p1 = 0,40, p2 = 0,45, p3 = 0,15

– H1: At least one pi is not equal to its specified value.

– Reject H0.

16

0 5,9917

Accept H0 Reject H0

Conclusion: At 5% significance level there is

sufficient evidence to reject the null hypothesis.

At least one of the probabilities pi is different.

Thus, at least two market shares have changed.

α = 0,05

Χ2k – 1;1 – α

22

2 2 2

( )

(82 80) (102 90) (16 30)

80 90 30

8,18

i i

i

f e

e

Page 17: Statistics lecture 10(ch10)

17

Two friends were playing a board game in which a die played a

big role. One of the players believed that the die was not fair.

60 tosses of the die produced the results below. Test at 5%

significance level whether the die was fair.

Number of dots 1 2 3 4 5 6

Number of tosses 7 6 7 18 15 7

Page 18: Statistics lecture 10(ch10)

18

ei = npi

= 60(1/6)

= 10

Expected values for the six categories are: 10 10 10 10 10 10

H0: p1 = … = p6 = 1/6

H1: At least one pi ≠ 1/6

= 0,05

2 =

f e 2

e

=

710 2

10 + … +

(710)2

10

= 13,2

k 1;12

=

5; 0,952

= 11,07

Therefore, reject H0. The probabilities of the dots are not equal and the die was not

fair.

Accept H0 Reject H0

Page 19: Statistics lecture 10(ch10)

19

• Chi-squared test for independence – Cross classify two categories using a contingency table.

– Rows representing one category and columns

representing the other category.

– Each value in cell indicates the frequency in the cross

classification.

– Table can be any number of rows and columns:

• r×c number of cells

Page 20: Statistics lecture 10(ch10)

CONCEPT QUESTIONS

• Questions 1 – 3 , p356

20

Page 21: Statistics lecture 10(ch10)

21

• Chi-squared test for independence – H0: the two variables are independent – no relationship.

– H1: the two variables are dependent – is a relationship.

Observed

frequencies

A

B

Total B1 B2

A1 f11 f12 r1

A2 f21 f22 r2

Total c1 c2 n

For a 2×2 contingency table:

Page 22: Statistics lecture 10(ch10)

22

• Chi-squared test for independence – Contingency tables describe the relationship between two

categorical variables.

– H0: the two variables are independent – no relationship.

– H1: the two variables are dependent – is a relationship.

For each observed

frequency an expected

frequency must be

calculated

row total × column totale =

n

A

B

Total B1 B2

A1 f11 f12 r1

A2 f21 f22 r2

Total c1 c2 n

For a 2×2 contingency table:

Page 23: Statistics lecture 10(ch10)

23

• Chi-squared test for independence – Contingency tables describe the relationship between two

categorical variables.

– H0: the two variables are independent – no relationship.

– H1: the two variables are dependent – is a relationship.

11 1 1 12 1 2

21 2 1 22 2 2

( ) / ; ( ) /

( ) / ; ( ) /

row total × column totale =

n

e r c n e r c n

e r c n e r c n

A

B

Total B1 B2

A1 f11 f12 r1

A2 f21 f22 r2

Total c1 c2 n

For a 2×2 contingency tabel:

Page 24: Statistics lecture 10(ch10)

24

Testing H0: Variables are independent

Alternative

hypothesis

Decision rule:

Reject H0 if … Test statistic

H1: Variables are

dependent

Χ2 > Χ2(r – 1)(c – 1);1 – α

22 ( )f e

e

• Chi-squared test for independence – H0: the two variables are independent – no relationship.

– H1: the two variables are dependent – is a relationship.

Page 25: Statistics lecture 10(ch10)

25

• Example

– A household detergent is marketed in three sizes:

• 1 000 ml, 750 ml and 250 ml

– The market for potential buyers is divided into three

age groups:

• < 30 years old

• 30–50 years old

• > 50 years old

– Market researcher believe that there is a relationship

between the age of a buyer and the size of the

packaging.

Page 26: Statistics lecture 10(ch10)

26

• Solution

– The data is summarised in a 3×3 contingency table.

– H0: Size and age are independent.

– H1: Size and age are dependent.

Size

Age groups

Total < 30 30–50 > 50

1 000 ml 27 41 14 82

750 ml 39 18 45 102

250 ml 8 2 6 16

Total 74 61 65 200

Observed

frequencies

Page 27: Statistics lecture 10(ch10)

Size

Age groups

Total < 30 30–50 > 50

1 000 ml 27 30,34 41 25,01 14 26,65 82

750 ml 39 37,74 18 31,11 45 33,15 102

250 ml 8 5,92 2 4,88 6 5,20 16

Total 74 61 65 200 27

• Solution

– Calculate the expected frequency

– (Row total×column total)/n Expected frequency:

(74×82)/200 = 30,34

Page 28: Statistics lecture 10(ch10)

• The hypotheses are:

– H0: Size and age are independent

– H1: Size and age are dependent

– Reject H0.

28

0 9,49

Accept H0 Reject H0

Conclusion: At 5% significance level there is

sufficient evidence to reject the null hypothesis.

There is a relationship between the size of detergent

that people prefer and their age.

α = 0,05

Χ2(r – 1)(c – 1);1 – α =

Χ2(3-1)(3-1);0.95 = 9.49

22

2 2 2

( )

(27 30,34) (41 25,01) (6 5,20).....

30,34 25,01 5,20

28,95

f e

e

Page 29: Statistics lecture 10(ch10)

29

A recent survey of marketing managers in four different industries provided

the data in the table below, which gives managers attitudes to market

research and its value in marketing decision making:-

Test at 1% level of significance whether manager’s perception of the value

of the market research is dependent on the type of industry in which a

marketing manager is employed.

INDUSTRY TYPE

Perceived value

of M Research

Consumer

businesses

Industrial

organisations

Retail &

wholesale

Finance &

insurance

Little value 9 22 13 9

Moderate value 29 41 6 17

Great value 26 28 6 27

TOTAL 64 91 25 53

Page 30: Statistics lecture 10(ch10)

30

Industry type

Perceived value of market research

Consumer businesses

Industrial organisations

Retail and wholesale

Finance and

insurance Total

Little value 9 (14,56) 22 (20,7) 13 (5,69) 9 (12,06) 53

Moderate value 29 (25,55) 41 (36,32) 6 (9,98) 17 (21,15) 93

Great value 26 (23,9) 28 (33,98) 6 (9,33) 27 (19,79) 87

Total 64 91 25 53 233

H0: Manager’s perception is independent of industry type. H1: Manager’s perception is dependent of industry type.

= 0,01

2 =

f e 2

e

=

914,56 2

14,56 + … +

2719,79 2

19,79

= 20,895

r1 c 1 ;12

=

6; 0,992

= 16,81 Therefore, reject H0. Manager’s perception is dependent on the industry type.

Accept H0 Reject H0

Page 31: Statistics lecture 10(ch10)

31

Questions 4 – 6, p361, textbook

Page 32: Statistics lecture 10(ch10)

32

• Chi-squared Test of Homogeneity – Test if two or more populations are homogeneous (similar)

with regard to a certain characteristic.

– H0: The proportion of elements with certain characteristic in

two or more different populations are the same.

– H1: The proportion of elements with certain characteristic in

two or more different populations are not the same.

– The rest of the test is the same as the test for

independence.

Page 33: Statistics lecture 10(ch10)

33

An immigration attorney was investigating which industries to

target for obtaining new clients who might have problems with

change in the immigration laws. The lawyer selected five

industries and twenty workers were randomly selected in each

industry and their visa statuses were verified.

Test at a 1% level of significance whether the 5 industries are

homogeneous with respect to the visa status of their workers

VISA STATUS INDUSTRY

A B C D E

Illegal resident 8 10 5 10 1

Legal resident 4 2 6 4 9

SA citizen 8 8 9 6 10

Page 34: Statistics lecture 10(ch10)

34

Visa status Industry

Total A B C D E

Illegal resident 8 (6,8) 10 (6,8) 5 (6,8) 10 (6,8) 1 (6,8) 34

Legal resident 4 (5) 2 (5) 6 (5) 4 (5) 9 (5) 25

SA citizen 8 (8,2) 8 (8,2) 9 (8,2) 6 (8,2) 10 (8,2) 41

Total 20 20 20 20 20 100

H0: Five industries are homogeneous with respect to the visa status of their workers. H1: Five industries are heterogeneous with respect to the visa status of their workers.

= 0,01

2 =

f e 2

e

=

8 6,8 2

6,8 + … +

108,2 2

8,2

= 15,32

r1 c 1 ;12

=

8; 0,992

= 20,09 Therefore, do not reject H0. The five industries are homogeneous with respect to the visa status of their workers.

Page 35: Statistics lecture 10(ch10)

CLASSWORK/HOMEWORK

1. Activity 1,2,3,4 – p168 – 174, Module

Manual

2. Revision exercise 1, 2, 3 4 – p174 -176,

Module Manual

3. Self Review Test – 1 – 4, p 368, textbook

4. Supplementary Exercises 1 – 11, p370,

textbook

35