9.11 Using Statistics To Make Inferences 9 Summary Correlation – Pearson correlation. Spearmans...
-
Upload
brett-daniels -
Category
Documents
-
view
226 -
download
0
Transcript of 9.11 Using Statistics To Make Inferences 9 Summary Correlation – Pearson correlation. Spearmans...
9.11
Using Statistics To Make Inferences 9
Summary
Correlation – Pearson correlation.Spearmans rank correlation.Point Biserial Correlation.
Tuesday 18 April 2023 03:53 PM
9.22
Goals
To evaluate the correlation, rank correlation and the point biserial correlation and test if they are significant.
Practical
Perform scatter plots and evaluate correlations.
9.33
Recall
What graph would you use to represent any possible relationship between two variables?
Scatter plotcccccccccccc
9.44
Looking for Relationships
Raw data
nn yx
yx
yx
yx
,
...
,
,
,
33
22
11
9.55
Which do we plot horizontally?
Horizontally
Independent
Accurate
x
Vertically
Dependent
Errors
y(x) y = m x + c y = a x + b
Cccccccccccccccccccccccccccccc
9.66
Scatterplot
Subjects are scored on verbal and spatial reasoning skills.
Subject 1 2 3 4 5 6 7 8 910
11
12
Verbal 5066
7384
57
8376
95
7378
4853
Spatial 6985
8870
84
7890
97
7995
6760
Plot first
Which variable is “dependent”?
9.77
Scatterplot
Verbal
Spati
al
1009080706050
100
90
80
70
60
Scatterplot of Spatial vs Verbal
Is there a relationship?
9.88
Recallmean
variance
Given the raw data (x)
15 9 4 15 10 13 9
Find the sample mean and variance
The following sums might prove useful Σx = 75 and Σx2 = 897
9.99
Recall
n = 7, Σx = 75 and Σx2 = 897
57.157571
89717
1
11
1
2
222
x
nx
ns
71.10775
nx
xCCCCCCCCc
9.1010
Looking for Relationships
Correlation (r) – the Pearson Correlation
22 1iixx x
nxS 22 1
iiyy yn
yS
iiiixy yxn
yxS1
yyxx
xy
SS
Sr
Compare to the variance
9.1111
Notation
Formally
xxSnxVariance
1
1)(
yySnyVariance
1
1)(
xySnyx
1
1),(Covariance
9.1212
Significance Test
ν = n - 2
Degrees of freedom
9.1313
Interpretation-1 ≤ r ≤ 1
r > rcrit a significant positive correlation
Any fitting line has a positive slope
r < -rcrit a significant negative correlation
Any fitting line has a negative slope
-rcrit < r < rcrit uncorrelated
Any fitting line is effectively horizontal
From tables the critical value is rcrit
9.1414
AssumptionsVariables are measured at the interval or ratio level (continuous).
Variables are approximately normally distributed. Essentially neither set of data is independently skewed.
There is a linear relationship between the two variables.
Pearson’s correlation is sensitive to outliers so it is best if outliers are kept to a minimum or there are no outliers.
9.1515
ConcernsDon’t forget causality, which means that the two sets of data may have a third influencing factor (firemen cause fires, storks bring babies…).
Variables are homoscedastic this means that there needs to be a consistent scatter pattern over the whole range. Otherwise, you may get a positive correlation over a range of the data that is tainted by an unproven correlation in another area.
9.1616
Example
Subjects are scored on verbal and spatial reasoning skills.
Subject 1 2 3 4 5 6 7 8 9 10
11
12
Verbal x 50 66
73 84
57
83 76
95
73 78
48 53
Spatial y 69 85
88 70
84
78 90
97
79 95
67 60
9.1717
Scatterplot
Verbal
Spati
al
1009080706050
100
90
80
70
60
Scatterplot of Spatial vs Verbal
9.1818
Calculation
n = 12
Subject 1 2 3 4 5 6 7 8 9 10
11
12
Verbal x 50 66
73 84
57
83 76
95
73 78
48 53
Spatial y 69 85
88 70
84
78 90
97
79 95
67 60
Σxi = 50+66+…+53 = 836
Σyi = 69+85+…+60 = 962
Σxi2 = 502+662+…+532 =
60706 Σyi2 = 692+852+…+602 =
78634 Σxiyi = 50×69+66×85+…+53×60 = 68254
9.1919
Calculation Sxx
n = 12 Σxi = 836 Σyi = 962
Σxi2 =
60706 Σyi
2 = 78634
Σxiyi = 68254
67.246412
83660706
1 222 iixx x
nxS
9.2020
Calculation Syy
n = 12 Σxi = 836 Σyi = 962
Σxi2 = 60706 Σyi
2 = 78634
Σxiyi = 68254
67.151312
96278634
1 222 iiyy y
nyS
9.2121
Calculation Sxy
n = 12 Σxi = 836 Σyi = 962
Σxi2 = 60706 Σyi
2 = 78634
Σxiyi = 68254
67.123412
96283668254
1 iiiixy yxn
yxS
9.2222
Calculationn = 12 Σxi = 836 Σyi = 962
Σxi2 = 60706 Σyi
2 = 78634
Σxiyi = 68254
67.246412
83660706
1 222 iixx x
nxS
67.151312
96278634
1 222 iiyy y
nyS
67.123412
96283668254
1 iiiixy yxn
yxS
9.2323
Conclusion67.2464xxS 67.1513yyS 67.1234xyS
64.067.151367.2464
67.1234
yyxx
xy
SS
Sr
ν = n – 2 = 10 ν p=0.1 p=0.0
5p=0.02
5p=0.0
1p=0.00
5p=0.002
10 0.497 0.576 0.640 0.708 0.750 0.795The tables give one and two tail values.Since r10(0.025) = 0.58. There appears to be a significant correlation at the 95% confidence level (0.64 > 0.58).
9.2424
SPSSAnalyze > Correlate > Bivariate
9.2525
SPSSThe correlation (0.64) and p (p < 0.05) value are consistent with our calculation.
Correlations
1 .639*
.025
12 12
.639* 1
.025
12 12
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Verbal
Spatial
Verbal Spatial
Correlation is significant at the 0.05 level (2-tailed).*.
9.2626
Aside
From previous experience it is known that in a population, measurements of IQ are approximately normally distributed with standard deviation 16. Tests are carried out on a particular subgroup of 14 individuals from the population. Calculate a 95% confidence interval for the population mean. Is your interval consistent with a population mean of 132?
Observed data133.06 119.30 109.17 93.88 116.93 130.98 135.25 140.02 118.38 121.86 124.78 142.91 135.52 132.96
What are the key words/information?
9.2727
Aside
From previous experience it is known that in a population, measurements of IQ are approximately normally distributed with standard deviation 16. Tests are carried out on a particular subgroup of 14 individuals from the population. Calculate a 95% confidence interval for the population mean. Is your interval consistent with a population mean of 132?
Observed data133.06 119.30 109.17 93.88 116.93 130.98 135.25 140.02 118.38 121.86 124.78 142.91 135.52 132.96
CCCCCCCCCCCc
9.2828
Aside
Which tests would be appropriate for the sample mean?
z or t
9.2929
Aside
Which tests would be appropriate for the sample mean? What are the key parameters for these tests?
z
t
µ σ n
µ s n
CCC
9.3030
Aside
In this case which test would be appropriate for the sample mean? What are the key parameters for this test?
z µ σ nSince σ is available use the z test
CCC
9.3131
Rank Correlation
What if we cannot assume normality or there are outliers when calculating and assessing a correlation?
If your samples violate the assumption of normality or have outliers then you might need to consider using a non-parametric test such as Spearman's Correlation.
9.3232
Spearman’s Rank Correlation Coefficient (rs)
di is the difference between the rankings in each pair of scores; n is the number of pairs of scores.
1
61
2
1
2
nn
dr
n
ii
s
Note that rs only matches the conventional correlation (direct calculation) for the ranked data if there are no ties.
9.3333
Example
A researcher has a theory that phonological working memory (memory for speech, the auditory component of the working memory model) is related to children's vocabulary size. The researcher tests this theory by measuring both phonological working memory (A) and vocabulary size (B) in children of 4 and 5 years of age.
9.3434
Data
Child A B
1 18 187
2 14 134
3 15 121
4 11 150
5 17 145
6 18 178
7 12 112
8 9 87
First plot the data
9.3535
Scatterplot
85
105
125
145
165
185
8 10 12 14 16 18
A
B
9.3636
Data
Child A B
1 18 187
2 14 134
3 15 121
4 11 150
5 17 145
6 18 178
7 12 112
8 9 87
Replace all observed values (A,B) by their ranks
9.3737
Rank A
Child A Rank A
1 18 1
6 18 2
5 17 3
3 15 4
2 14 5
7 12 6
4 11 7
8 9 8
The first two observations are tied!
Is there a problem?
CCCC
9.3838
Rank A
Child A Rank A True Rank
1 18 1 1.5
6 18 2 1.5
5 17 3 3
3 15 4 4
2 14 5 5
7 12 6 6
4 11 7 7
8 9 8 8
9.3939
Rank B
Child B Rank B
1 187 1
6 178 2
4 150 3
5 145 4
2 134 5
3 121 6
7 112 7
8 87 8
No ties in this case
9.4040
Rank B
Child B Rank B True Rank
1 187 1 1
6 178 2 2
4 150 3 3
5 145 4 4
2 134 5 5
3 121 6 6
7 112 7 7
8 87 8 8
9.4141
DifferencesChild A
True Rank
1 18 1.5
2 14 5
3 15 4
4 11 7
5 17 3
6 18 1.5
7 12 6
8 9 8
Total
Rearrange the A data against the identifier (child)
9.4242
DifferencesChild A
True Rank
BTrue Rank
1 18 1.5 187 1
2 14 5 134 5
3 15 4 121 6
4 11 7 150 3
5 17 3 145 4
6 18 1.5 178 2
7 12 6 112 7
8 9 8 87 8
Total
Similarly for the B data
Now find the difference between the true ranks
9.4343
DifferencesChild A
True Rank
BTrue Rank
di
1 18 1.5 187 1 0.5
2 14 5 134 5 0
3 15 4 121 6 -2
4 11 7 150 3 4
5 17 3 145 4 -1
6 18 1.5 178 2 -0.5
7 12 6 112 7 -1
8 9 8 87 8 0
Total
Now square these differences
9.4444
DifferencesChild A
True Rank
BTrue Rank
di di2
1 18 1.5 187 1 0.5 0.25
2 14 5 134 5 0 0
3 15 4 121 6 -2 4
4 11 7 150 3 4 16
5 17 3 145 4 -1 1
6 18 1.5 178 2 -0.5 0.25
7 12 6 112 7 -1 1
8 9 8 87 8 0 0
Total
And form the total
9.4545
DifferencesChild A
True Rank
BTrue Rank
di di2
1 18 1.5 187 1 0.5 0.25
2 14 5 134 5 0 0
3 15 4 121 6 -2 4
4 11 7 150 3 4 16
5 17 3 145 4 -1 1
6 18 1.5 178 2 -0.5 0.25
7 12 6 112 7 -1 1
8 9 8 87 8 0 0
Total 22.5
Now calculate the correlation
9.4646
Conclusion
n = 8 Σdi2 = 22.5
732.0638
1351
1885.226
11
61
22
1
2
nn
dr
n
ii
s
n p = 0.05 p = 0.01
8 0.738 0.881The tables give one and two tail values.Note that SPSS reports an approximate p value based on the Pearson correlation.
9.4747
Conclusionn = 8 rcrit = 0.738 rs = 0.732
The critical value for n = 8 at the p_=_0.025 level is 0.738. Since rs =_0.732, which is less than 0.738; then rs is apparently not significant at the 95% confidence level. Or more plainly it would appear that there was probably no relationship.
9.4848
Example
The hypothesis tested is that prices should decrease with distance from the key area of gentrification surrounding the Contemporary Art Museum (CAM, El Raval, Barcelona). The line followed is Transect 2 in the map, with continuous sampling of the price of a 50cl. bottle water at every convenience store.
9.4949
MACBA Barcelona Contemporary Art Museum
9.5050
Map
Selected transect Museum
9.5151
Data
Convenience Store
Distance from CAM (m)
Price of 50cl. bottle (€)
1 50 1.80
2 175 1.20
3 270 2.00
4 375 1.00
5 425 1.00
6 580 1.20
7 710 0.80
8 790 0.60
9 890 1.00
10 980 0.85
First plot the data
Which variable is dependent?
9.5252
Scatterplot
0.5
0.7
0.9
1.1
1.3
1.5
1.7
1.9
40 140 240 340 440 540 640 740 840 940
Distance from CAM
Pri
ce
(E
uro
s)
9.5353
Data
Convenience Store
Distance from CAM (m)
Price of 50cl. bottle (€)
1 50 1.80
2 175 1.20
3 270 2.00
4 375 1.00
5 425 1.00
6 580 1.20
7 710 0.80
8 790 0.60
9 890 1.00
10 980 0.85
Can we assume normality?
Now rank the data
9.5454
Rank Price
Ties have been identified
Convenience Store
Price of 50cl. bottle (€)
Rank
3 2 1
1 1.8 2
2 1.2 3
6 1.2 4
4 1 5
5 1 6
9 1 7
10 0.85 8
7 0.8 9
8 0.6 10
9.5555
Rank Price
Convenience Store
Price of 50cl. bottle (€)
RankTrue Rank
3 2 1 1
1 1.8 2 2
2 1.2 3 3.5
6 1.2 4 3.5
4 1 5 6
5 1 6 6
9 1 7 6
10 0.85 8 8
7 0.8 9 9
8 0.6 10 10
9.5656
Rank Distance
There are no ties
Convenience Store
Distance from CAM (m)
Rank
10 980 1
9 890 2
8 790 3
7 710 4
6 580 5
5 425 6
4 375 7
3 270 8
2 175 9
1 50 10
9.5757
Rank Distance
Convenience StoreDistance from CAM
(m)Rank True Rank
10 980 1 1
9 890 2 2
8 790 3 3
7 710 4 4
6 580 5 5
5 425 6 6
4 375 7 7
3 270 8 8
2 175 9 9
1 50 10 10
9.5858
Differences
Convenience Store
Price of
50cl. bottle
(€)
True Rank Price
1 1.8 2
2 1.2 3.5
3 2 1
4 1 6
5 1 6
6 1.2 3.5
7 0.8 9
8 0.6 10
9 1 6
10 0.85 8
Arrange true rank of price by store identifier
9.5959
Differences
Convenience Store
Price of
50cl. bottle
(€)
True Rank Price
Distance from
CAM(m)
True Rank
Distance
1 1.8 2 50 10
2 1.2 3.5 175 9
3 2 1 270 8
4 1 6 375 7
5 1 6 425 6
6 1.2 3.5 580 5
7 0.8 9 710 4
8 0.6 10 790 3
9 1 6 890 2
10 0.85 8 980 1
Arrange true rank of distance by store identifier
9.6060
Differences
Convenience Store
Price of
50cl. bottle
(€)
True Rank Price
Distance from
CAM(m)
True Rank
Distance
di
1 1.8 2 50 10 8
2 1.2 3.5 175 9 5.5
3 2 1 270 8 7
4 1 6 375 7 1
5 1 6 425 6 0
6 1.2 3.5 580 5 1.5
7 0.8 9 710 4 -5
8 0.6 10 790 3 -7
9 1 6 890 2 -4
10 0.85 8 980 1 -7
Calculate the differences between the true ranks
9.6161
Differences
Convenience Store
Price of
50cl. bottle
(€)
True Rank Price
Distance from
CAM(m)
True Rank
Distance
di di2
1 1.8 2 50 10 8 64
2 1.2 3.5 175 9 5.5 30.25
3 2 1 270 8 7 49
4 1 6 375 7 1 1
5 1 6 425 6 0 0
6 1.2 3.5 580 5 1.5 2.25
7 0.8 9 710 4 -5 25
8 0.6 10 790 3 -7 49
9 1 6 890 2 -4 16
10 0.85 8 980 1 -7 49
Calculate the square of the differences
9.6262
Differences
Convenience Store
Price of
50cl. bottle
(€)
True Rank Price
Distance from
CAM(m)
True Rank
Distance
di di2
1 1.8 2 50 10 8 64
2 1.2 3.5 175 9 5.5 30.25
3 2 1 270 8 7 49
4 1 6 375 7 1 1
5 1 6 425 6 0 0
6 1.2 3.5 580 5 1.5 2.25
7 0.8 9 710 4 -5 25
8 0.6 10 790 3 -7 49
9 1 6 890 2 -4 16
10 0.85 8 980 1 -7 49
Total285.
5
Form the total of the differences squared
9.6363
Conclusion
n = 10 Σdi2 = 285.5
730.09910
17131
110105.2856
11
61
22
1
2
nn
dr
n
ii
s
n p = 0.05 p = 0.01
10 0.648 0.794
The tables give one and two tail values.
9.6464
Conclusion
n = 10 rcrit = 0.648 rs = -0.730
The critical value for n = 10 at the p = 0.025 level is 0.648. The value 0.73 for two tails gives a significance level of slightly less than 5%.
Apparently there is a relationship.
9.6565
Comparison
Recall that rs only matches the conventional correlation for the ranked data if there are no ties.
The previous calculation is repeated using ranked data and the full correlation formula and then tested with software.
9.6666
Calculation
n = 10
Σxi = 10+9+…+1 = 55
Σyi = 2+3.5+…+8 = 55
Σxi2 = 102+92+…+12 = 385
Σyi2 = 22+3.52+…+82 = 382.5
Σxiyi = 10×2+9×3.5+…+1×8 = 241
1 2 3 4 5 6 7 8 9 10
10 9 8 7 6 5 4 3 2 12 3.5 1 6 6 3.5 9 10 6 8
Convenience Store
x Distance f rom CAM (m)y Price of 50cl. bottle (€)
Agree
Disagree because of ties
9.6767
Calculation Sxx
n = 10 Σxi = 55 Σyi = 55
Σxi2 = 385 Σyi
2 = 382.5 Σxiyi = 241
5.821055
3851 2
22 iixx xn
xS
9.6868
Calculation Syy
n = 10 Σxi = 55 Σyi = 55
Σxi2 = 385 Σyi
2 = 382.5 Σxiyi = 241
801055
5.3821 2
22 iiyy yn
yS
9.6969
Calculation Sxy
n = 10 Σxi = 55 Σyi = 55
Σxi2 = 385 Σyi
2 = 382.5 Σxiyi = 241
5.6110
5555241
1 iiiixy yxn
yxS
9.7070
Calculation
5.6110
5555241
1 iiiixy yxn
yxS
801055
5.3821 2
22 iiyy yn
yS
n = 10 Σxi = 55 Σyi = 55
Σxi2 = 385 Σyi
2 = 382.5 Σxiyi = 241
5.821055
3851 2
22 iixx xn
xS
9.7171
Conclusion
n p = 0.05 p = 0.01
10 0.648 0.794
The tables give one and two tail values.The critical value for n = 10 at the p = 0.025 level is 0.648. The value 0.757 for two tails gives a significance level of slightly less than 5%. Apparently there is a relationship.
n = 12 rcrit = 0.648 rs = -0.757
75.0805.82
5.61
yyxx
xy
SS
Sr
Note slight difference due to a single tie.
9.7272
SPSSAnalyze > Correlate > Bivariate
9.7373
SPSSThe correlation (-0.76) and p (p < 0.05) value are consistent with our calculation.
Correlations
1.000 -.757*
. .011
10 10
-.757* 1.000
.011 .
10 10
Correlation Coefficient
Sig. (2-tailed)
N
Correlation Coefficient
Sig. (2-tailed)
N
Distance
Price
Spearman's rhoDistance Price
Correlation is significant at the 0.05 level (2-tailed).*.
9.7474
Point Biserial Correlation
The point biserial correlation coefficient (rpb) is a correlation coefficient used when one variable is dichotomous; ideally it will be “naturally” dichotomous such as pass/fail (P/F). The point biserial correlation is mathematically equivalent to the Pearson correlation. This can be shown by assigning two distinct numerical values (usually 0/1) to the dichotomous variable .
9.7575
Point Biserial Correlation
To calculate rpb, use the dichotomous variable to divide the data set into two groups. Evaluate MP is the mean score for group PMF is the mean score for group FS is the population standard deviation, evaluated for all entriesp the proportion of those in group Pf the proportion of those in group F (f = 1 – p)
There is no version of the formula for a case where you only have sample data.
pfS
MMr FPpb
9.7676
Point Biserial Correlation - Example
The following data is for 165 students attempting 50 multiple-choice questions. The final mark for the examination (out of 50) was
30 38 24 22 31 33 37 38 33 27 35 31 35 42 24 26 34 33 35 22 25 22 38 36 41 40 26 29 43 30 34 28 16 38 34 33
26 32 39 27 12 32 35 17 39 20 18 30 37 17 26 37 21 19 38 25 38 31 21 29 25 26 27 31 33 37 35 26 35 17 22 26
24 21 34 40 32 22 28 24 38 23 17 22 19 33 13 32 17 33 26 15 39 32 22 23 32 19 41 29 33 29 24 19 20 35 31 31
33 37 24 18 38 22 33 29 26 32 27 24 25 27 26 29 29 34 35 38 27 23 35 35 34 26 19 27 33 38 32 25 37 24 39 3028 25 32 28 33 26 22 26 29 25 32 30 37 29 33 28 22 33 29 27 29
Individual success on the first question (1 pass, 0 fail) was
1 0 1 1 1 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 0 1 0 1 1 1 0 0 1 0 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 0 0 1 1 0 1 0 1 1 0 1
1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 0 1 1 0 1 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 1
0 1 0 0 1 1 1 1 0 0 1 1 0 0 0 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 0 1 1 0 1 0 0 0
9.7777
Point Biserial Correlation - Example
MP 30.87 is the mean score for group PMF 25.71 is the mean score for group FS 6.63 is the population standard deviation, evaluated for all entriesp .65 the proportion of those in group Pf .35 the proportion of those in group F (f = 1 – p)
This may be verified by direct calculation.
37.
pfS
MMr FPpb
Correlations
Question_1
Total
Pearson Correlation .372**
Sig. (2-tailed) .000
N 165
**. Correlation is significant at the 0.01 level (2-tailed).
9.7878
Point Biserial Correlation - Example
Since rcrit is 0.153 (use the Calculator) and rpb is 0.37 the result is significant. Corresponding to a large positive correlation. We would expect students with high scores on the overall test to also be getting the item right. That students with low scores on the overall test get the item wrong. Correspond to a large negative correlation. We would not expect that students who get the item correct to tend to do poorly on the overall test. Students who get the item wrong would tend to do well on the test.
9.7979
Read
Read Howitt and Cramer pages 59-74
Read Howitt and Cramer (e-text) pages 87-124
Read Russo (e-text) pages 176-201
Read Davis and Smith pages 173-192
9.8080
Practical 9
This material is available from the module web page.
http://www.staff.ncl.ac.uk/mike.cox
Module Web Page
9.8181
Practical 9
This material for the practical is available.
Instructions for the practical
Practical 9
Material for the practicalPractical 9
9.8282
Whoops!
"There's this cluster of interrelated findings", said Richard A. Lippa, a professor of psychology at California State University at Fullerton, who has found evidence that in gay men, the hair on the back of the head is more likely to curl counter-clockwise than in straight men. "These are all biological markers that something must have gone on early in development".
Washington Post
5 February 2008
Source
9.8383
However!
Dilbert
9.8484
Cause and Effect
Recall - Type I error – false positive – conclude that the variable or coefficient is important, but the true state of nature is that it is not.
Firemen cause damageStorks bring babies
Conclude two drugs differ, when in fact they do not
Correlation does not imply causation! Spurious Correlations
5.8585
Big Data Helps Companies Find Some Surprising Correlations
There's a Link Between Sales and Phases of the Moon, Among Other Things, By Deborah Gage , Wall Street Journal, 23 March 2014
9.8686
Whoops!