CHAPTER 16 THE FURTHER DATA ANALYSIS
description
Transcript of CHAPTER 16 THE FURTHER DATA ANALYSIS
CHAPTER 16 THE FURTHER DATA ANALYSIS
16.1 Introduction
16.2FURTHER DATA ANALYSIS: (MEASURED
V ATTRIBUTE) FDA is procedure that enables a decision to
be made, based on the sample evidence: There is no relationship There is a relationship
These statistical procedures are called hypothesis tests
Hypothesis A statement about a population developed for
purpose of testing. Hypothesis tests
A Procedure based on sample evidence and probability theory to determine whether the hypothesis is a reasonable statement.
Four stages of hypothesis tests Stage 1: Specifying the hypotheses. Stage 2: Defining the test parameters and the
decision rule. Stage 3: Examining the sample evidence. Stage 4: The conclusions.
FDA for Measured v Attribute requires two different hypotheses tests Two levels of attribute explanatory variable three or more levels of attribute
explanatory variable
16.3 HYPOTHESIS TEST 1 Measured Response v Attribute Explanatory Variable with exactly two levels
Illustrative Example Response Variable: AMOUNT Spent on Clothes per
month Attribute Explanatory Variable GENDER
(Male/Female) If Males and Females have the same 'spending on
clothes' characteristics then the average amounts spent monthly by Males and by Female should be the same.
If Male and Females have different 'spending on clothes' characteristics then the average amount spent monthly by Males and Female would be different.
Total population can be split into two or more sub-populations according to the level of the attribute, a population of Males and a population of Females.
POPULATION MEANS THE SAME
Stage 1: Specifying the hypotheses. NULL HYPOTHESIS:
ALTERNATIVE HYPOTHESIS
100 : H
101 : H
Stage 2: The Decision Rule Results of IDA for Illustrative Example Outcome 1
Male Mean = £45 (Stand Dev = £20)Female Mean = £55 (Stand Dev = £20)Noenough evidence to form a clear judgement FDA is required.
Outcome 2Male Mean = £45 (Stand Dev = £10)Female Mean = £55 (Stand Dev = £10) The widths of the boxes would lead to the decision
from the I.D.A. that there is definitely a link.
Outcome 3Male Mean = £45 (Stand Dev = £40)Female Mean = £55 (Stand Dev = £40) FDA is required and Stand Dev is bigger
Measure of Relative Separation of the boxplots Considering not only MEANS but also STANDA
RD DEVIATIONof the two samples Finding “Threshold value”
If Measure of Relative Separation > Threshold value, there is a connection If Measure of Relative Separation < Threshold value there is no connection
Student's t Ratio (a measure of the relative separation of the boxplots )Sample data is Normal distributionStudent’s t-testtcalc --- value of t-ratio
2
22
1
21
21
ns
ns
XXtcalc
Bigger |tcalc| Larger SeparationOutcome2 >Outcome 1>Outcome3Set up decision rule
Decision RuleIf tcalc value is numerically between the range - tcri
t & + tcrit then the decision rule is flagging H0 Supporting the viewpoint that there is no relationship
If tcalc value is numerically outside the range - tcrit & + tcrit then the decision rule is flagging H1 Supporting the viewpoint that there is a relationship.
Value of tcrit
Depending upon the sample size, through a measure called Degrees of Freedom(DF)
Could be looked up in the tables.
The hypothesis test described above is called the student's t test and is a two tailed test using the 5% level of significance.
Formally the level of significance may be defined as the chance the tester is prepared to take in coming to the wrong conclusion about H0
Stage 3: Doing the calculations If tcalc value is numerically between the ran
ge - tTable & + tTable then the decision rule is flagging H0 There is no relationship
If tcalc value is numerically outside the range - tTable & + tTable then the decision rule is flagging H1 There is a relationship
Stage 4: The conclusions In terms of the original business problem
specification For example, On the basis of the sample
evidence there is evidence to suggest that there is a link between the amount spent on clothes and gender, Males on average spend about £45 per month and females spend on average £55.
Worked Example CREDIT IDA
FDA Stage 1: Define the hypotheses:
0--true average amount borrowed on credit for house owners
1--true average amount borrowed on credit for non house owners}
100 : H
101 : H
Stage 2: Defining the test parameters and the decision ruleStudent’s t-test
Stage 3: Examining the sample evidence MINITAB to do the calculations on the sampl
e data
tTable = 1.96 tcalc = -4.51 lies outside the range -1.96 to 1.9
6, reject H0 , accept H1
Stage 4: The conclusions. Based on the sample evidence there is a
connection between Amount Borrowed on Credit and House-ownership. On average house owners borrow £869.5 and non house owners borrow £1009.00.
16.4 HYPOTHESIS TEST 2: Measured Response v Attribute Explanatory Variable with
three or more levels For example
Response variable: amount spent in a supermarket Explanatory Variable: the customer's marital status--four
categories, Single, Married, Divorced, or widowed The common data analysis methodology applies and has
the following three stages: Initial Data Analysis Further Data Analysis Describing the Relationship
Example 1: No evidence of a connection.
Example 2: Some degree of separation Measure of relative separation
Hypothesis Test--Four stages Stage 1:Specifying the hypotheses. Stage 2:Defining the test parameters and
the decision rule. Stage 3:Examining the sample evidence. Stage 4:The conclusions.
Stage 1: Specifying the hypotheses. By definition if there is no connection then
all the population means are equal, whilst if there is a connection at least on of the means must be different,
Null hypotheses
Alternative hypotheses
43210 : H
different ismean on least at :1H
Stage 2: Defining the test parameters and the decision rule. Decision rule: based on F-Ratio. Test procedure: Oneway Analysis of Variance ANalysis Of VAriance : ANOVA Fcrit is the particular value of F that split the area un
der the distribution in the proportions 95%/5%.
Decision ruleIf the value of Fcalc is between 0 and Fcrit then co
nclude that there is no linkIf the value of Fcalc is greater than Fcrit then concl
ude that on the basis of the sample evidence there is a link.
Stage 3:Examining the sample evidence
Example1: Fcalc would be small. The F-Ratio is defined in such a way that if the
null hypothesis is true, i.e. all the means are equal then Fcalc is expected to be 1.
Example 2Fcalc measures the relative separationwider the separation, larger Fcalc value
To find Threshold Value: Fcrit
For F-Ratio: two degrees of freedom(depends on sample siz
e)Look up the statistical tables: Ftable
Suppose:Fcalc
= 8.91 The degrees of freedom as (3, 80) Then Ftable=2.72
Stage 4:The conclusions. Since the value of Ftable is larger than the va
lue of Fcalc the conclusion is that on the basis of the sample evidence, there is enough evidence to suggest that there is a link between amount spent by customers in a supermarket and the customer's marital status. The remaining issue is to describe the connection.
Worked Example CREDIT data scenario
Question: The explanatory variable 'REGION' influence the
response variable 'CREDIT'? The amount borrowed on credit is dependent upon the
region of the country where the customer lives?
IDA
FDA Stage 1:Specifying the hypotheses.
Stage 2: Defining the test parameters and the decision rule.
543210 : Hdifferent ismean on least at :1H
Stage 3:Examining the sample evidenceMINITAB—ANOVA—ONE WAY
Analysis of Variance for CREDIT Source DF SS MS F PREGION 4 3445125 861281 5.10 0.0Error 649 109631953 168924 Total 653 113077078
Ftable=2.39Since Fcalc= 5.10 > Ftable=2.39 , the sample evide
nce is indicating a link between "Amount borrowed on credit" and "The region the customer lives in"
Stage 4:The conclusions
Examination of the average values shows London to be the region with the highest amount on credit, then the South-West and South-East with similar average credits; the North having the lowest amount on credit.
REGION AMOUNT
SOUTH-WEST £977.10
SOUTH-EAST £958.40
LONDON £1061.80
MIDLANDS £898.10
NORTH £864.30
Examine diagram displaying the 95% confidence intervals for each level of the attribute variable
Interpretation:The decision rule is that if the confidence limits
don't overlap then there is a real difference in the sample means for the two levels of the attribute.
For example Region 3 London has an average amount on credit that is statistically significantly larger than average amount on credit for Regions 4, The Midlands, because the two confidence limits don't overlap.
The final description of the link can be summarised, as the amount spent on credit in London is significantly higher than in the Midlands and the North.
level 2 level 3 level 4 level 5
level 1 No Difference No Difference No Difference No Difference
level 2 No Difference No Difference No Difference
level 3 Difference Difference
level 4 No Difference