CHAPTER 16 THE FURTHER DATA ANALYSIS

16.1 Introduction

16.2FURTHER DATA ANALYSIS: (MEASURED

V ATTRIBUTE) FDA is procedure that enables a decision to

be made, based on the sample evidence: There is no relationship There is a relationship

These statistical procedures are called hypothesis tests

Hypothesis A statement about a population developed for

purpose of testing. Hypothesis tests

A Procedure based on sample evidence and probability theory to determine whether the hypothesis is a reasonable statement.

Four stages of hypothesis tests Stage 1: Specifying the hypotheses. Stage 2: Defining the test parameters and the

decision rule. Stage 3: Examining the sample evidence. Stage 4: The conclusions.

FDA for Measured v Attribute requires two different hypotheses tests Two levels of attribute explanatory variable three or more levels of attribute

explanatory variable

16.3 HYPOTHESIS TEST 1 Measured Response v Attribute Explanatory Variable with exactly two levels

Illustrative Example Response Variable: AMOUNT Spent on Clothes per

month Attribute Explanatory Variable GENDER

(Male/Female) If Males and Females have the same 'spending on

clothes' characteristics then the average amounts spent monthly by Males and by Female should be the same.

If Male and Females have different 'spending on clothes' characteristics then the average amount spent monthly by Males and Female would be different.

Total population can be split into two or more sub-populations according to the level of the attribute, a population of Males and a population of Females.

POPULATION MEANS THE SAME

Stage 1: Specifying the hypotheses. NULL HYPOTHESIS:

ALTERNATIVE HYPOTHESIS

100 : H

101 : H

Stage 2: The Decision Rule Results of IDA for Illustrative Example Outcome 1

Male Mean = £45 (Stand Dev = £20)Female Mean = £55 (Stand Dev = £20)Noenough evidence to form a clear judgement FDA is required.

Outcome 2Male Mean = £45 (Stand Dev = £10)Female Mean = £55 (Stand Dev = £10) The widths of the boxes would lead to the decision

from the I.D.A. that there is definitely a link.

Outcome 3Male Mean = £45 (Stand Dev = £40)Female Mean = £55 (Stand Dev = £40) FDA is required and Stand Dev is bigger

Measure of Relative Separation of the boxplots Considering not only MEANS but also STANDA

RD DEVIATIONof the two samples Finding “Threshold value”

If Measure of Relative Separation > Threshold value, there is a connection If Measure of Relative Separation < Threshold value there is no connection

Student's t Ratio (a measure of the relative separation of the boxplots )Sample data is Normal distributionStudent’s t-testtcalc --- value of t-ratio

2

22

1

21

21

ns

ns

XXtcalc

Bigger |tcalc| Larger SeparationOutcome2 >Outcome 1>Outcome3Set up decision rule

Decision RuleIf tcalc value is numerically between the range - tcri

t & + tcrit then the decision rule is flagging H0 Supporting the viewpoint that there is no relationship

If tcalc value is numerically outside the range - tcrit & + tcrit then the decision rule is flagging H1 Supporting the viewpoint that there is a relationship.

Value of tcrit

Depending upon the sample size, through a measure called Degrees of Freedom(DF)

Could be looked up in the tables.

The hypothesis test described above is called the student's t test and is a two tailed test using the 5% level of significance.

Formally the level of significance may be defined as the chance the tester is prepared to take in coming to the wrong conclusion about H0

Stage 3: Doing the calculations If tcalc value is numerically between the ran

ge - tTable & + tTable then the decision rule is flagging H0 There is no relationship

If tcalc value is numerically outside the range - tTable & + tTable then the decision rule is flagging H1 There is a relationship

Stage 4: The conclusions In terms of the original business problem

specification For example, On the basis of the sample

evidence there is evidence to suggest that there is a link between the amount spent on clothes and gender, Males on average spend about £45 per month and females spend on average £55.

Worked Example CREDIT IDA

FDA Stage 1: Define the hypotheses:

0--true average amount borrowed on credit for house owners

1--true average amount borrowed on credit for non house owners}

100 : H

101 : H

Stage 2: Defining the test parameters and the decision ruleStudent’s t-test

Stage 3: Examining the sample evidence MINITAB to do the calculations on the sampl

e data

tTable = 1.96 tcalc = -4.51 lies outside the range -1.96 to 1.9

6, reject H0 , accept H1

Stage 4: The conclusions. Based on the sample evidence there is a

connection between Amount Borrowed on Credit and House-ownership. On average house owners borrow £869.5 and non house owners borrow £1009.00.

16.4 HYPOTHESIS TEST 2: Measured Response v Attribute Explanatory Variable with

three or more levels For example

Response variable: amount spent in a supermarket Explanatory Variable: the customer's marital status--four

categories, Single, Married, Divorced, or widowed The common data analysis methodology applies and has

the following three stages: Initial Data Analysis Further Data Analysis Describing the Relationship

Example 1: No evidence of a connection.

Example 2: Some degree of separation Measure of relative separation

Hypothesis Test--Four stages Stage 1:Specifying the hypotheses. Stage 2:Defining the test parameters and

the decision rule. Stage 3:Examining the sample evidence. Stage 4:The conclusions.

Stage 1: Specifying the hypotheses. By definition if there is no connection then

all the population means are equal, whilst if there is a connection at least on of the means must be different,

Null hypotheses

Alternative hypotheses

43210 : H

different ismean on least at :1H

Stage 2: Defining the test parameters and the decision rule. Decision rule: based on F-Ratio. Test procedure: Oneway Analysis of Variance ANalysis Of VAriance : ANOVA Fcrit is the particular value of F that split the area un

der the distribution in the proportions 95%/5%.

Decision ruleIf the value of Fcalc is between 0 and Fcrit then co

nclude that there is no linkIf the value of Fcalc is greater than Fcrit then concl

ude that on the basis of the sample evidence there is a link.

Stage 3:Examining the sample evidence

Example1: Fcalc would be small. The F-Ratio is defined in such a way that if the

null hypothesis is true, i.e. all the means are equal then Fcalc is expected to be 1.

Example 2Fcalc measures the relative separationwider the separation, larger Fcalc value

To find Threshold Value: Fcrit

For F-Ratio: two degrees of freedom(depends on sample siz

e)Look up the statistical tables: Ftable

Suppose:Fcalc

= 8.91 The degrees of freedom as (3, 80) Then Ftable=2.72

Stage 4:The conclusions. Since the value of Ftable is larger than the va

lue of Fcalc the conclusion is that on the basis of the sample evidence, there is enough evidence to suggest that there is a link between amount spent by customers in a supermarket and the customer's marital status. The remaining issue is to describe the connection.

Worked Example CREDIT data scenario

Question: The explanatory variable 'REGION' influence the

response variable 'CREDIT'? The amount borrowed on credit is dependent upon the

region of the country where the customer lives?

FDA Stage 1:Specifying the hypotheses.

Stage 2: Defining the test parameters and the decision rule.

543210 : Hdifferent ismean on least at :1H

Stage 3:Examining the sample evidenceMINITAB—ANOVA—ONE WAY

Analysis of Variance for CREDIT Source DF SS MS F PREGION 4 3445125 861281 5.10 0.0Error 649 109631953 168924 Total 653 113077078

Ftable=2.39Since Fcalc= 5.10 > Ftable=2.39 , the sample evide

nce is indicating a link between "Amount borrowed on credit" and "The region the customer lives in"

Stage 4:The conclusions

Examination of the average values shows London to be the region with the highest amount on credit, then the South-West and South-East with similar average credits; the North having the lowest amount on credit.

REGION AMOUNT

SOUTH-WEST £977.10

SOUTH-EAST £958.40

LONDON £1061.80

MIDLANDS £898.10

NORTH £864.30

Examine diagram displaying the 95% confidence intervals for each level of the attribute variable

Interpretation:The decision rule is that if the confidence limits

don't overlap then there is a real difference in the sample means for the two levels of the attribute.

For example Region 3 London has an average amount on credit that is statistically significantly larger than average amount on credit for Regions 4, The Midlands, because the two confidence limits don't overlap.

The final description of the link can be summarised, as the amount spent on credit in London is significantly higher than in the Midlands and the North.

level 2 level 3 level 4 level 5

level 1 No Difference No Difference No Difference No Difference

level 2 No Difference No Difference No Difference

level 3 Difference Difference

level 4 No Difference

CHAPTER 16 THE FURTHER DATA ANALYSIS

Documents

Transcript of CHAPTER 16 THE FURTHER DATA ANALYSIS