9.11 Using Statistics To Make Inferences 9 Summary Correlation – Pearson correlation. Spearmans...

Post on 17-Dec-2015

226 views 0 download

Transcript of 9.11 Using Statistics To Make Inferences 9 Summary Correlation – Pearson correlation. Spearmans...

9.11

Using Statistics To Make Inferences 9

Summary 

Correlation – Pearson correlation.Spearmans rank correlation.Point Biserial Correlation.

 

Tuesday 18 April 2023 03:53 PM

9.22

Goals 

To evaluate the correlation, rank correlation and the point biserial correlation and test if they are significant.

Practical 

Perform scatter plots and evaluate correlations.

9.33

Recall

What graph would you use to represent any possible relationship between two variables?

Scatter plotcccccccccccc

9.44

Looking for Relationships

Raw data

nn yx

yx

yx

yx

,

...

,

,

,

33

22

11

9.55

Which do we plot horizontally?

Horizontally

Independent

Accurate

x

Vertically

Dependent

Errors

y(x) y = m x + c y = a x + b

Cccccccccccccccccccccccccccccc

9.66

Scatterplot

Subjects are scored on verbal and spatial reasoning skills.

Subject 1 2 3 4 5 6 7 8 910

11

12

Verbal 5066

7384

57

8376

95

7378

4853

Spatial 6985

8870

84

7890

97

7995

6760

Plot first

Which variable is “dependent”?

9.77

Scatterplot

Verbal

Spati

al

1009080706050

100

90

80

70

60

Scatterplot of Spatial vs Verbal

Is there a relationship?

9.88

Recallmean

variance

Given the raw data (x)

15 9 4 15 10 13 9

Find the sample mean and variance

The following sums might prove useful Σx = 75 and Σx2 = 897

9.99

Recall

n = 7, Σx = 75 and Σx2 = 897

57.157571

89717

1

11

1

2

222

x

nx

ns

71.10775

nx

xCCCCCCCCc

9.1010

Looking for Relationships

Correlation (r) – the Pearson Correlation

22 1iixx x

nxS 22 1

iiyy yn

yS

iiiixy yxn

yxS1

yyxx

xy

SS

Sr

Compare to the variance

9.1111

Notation

Formally

xxSnxVariance

1

1)(

yySnyVariance

1

1)(

xySnyx

1

1),(Covariance

9.1212

Significance Test

ν = n - 2

Degrees of freedom

9.1313

Interpretation-1 ≤ r ≤ 1

r > rcrit a significant positive correlation

Any fitting line has a positive slope

r < -rcrit a significant negative correlation

Any fitting line has a negative slope

-rcrit < r < rcrit uncorrelated

Any fitting line is effectively horizontal

From tables the critical value is rcrit

9.1414

AssumptionsVariables are measured at the interval or ratio level (continuous).

Variables are approximately normally distributed. Essentially neither set of data is independently skewed.

There is a linear relationship between the two variables.

Pearson’s correlation is sensitive to outliers so it is best if outliers are kept to a minimum or there are no outliers.

9.1515

ConcernsDon’t forget causality, which means that the two sets of data may have a third influencing factor (firemen cause fires, storks bring babies…).

Variables are homoscedastic this means that there needs to be a consistent scatter pattern over the whole range. Otherwise, you may get a positive correlation over a range of the data that is tainted by an unproven correlation in another area.

9.1616

Example

Subjects are scored on verbal and spatial reasoning skills.

Subject 1 2 3 4 5 6 7 8 9 10

11

12

Verbal x 50 66

73 84

57

83 76

95

73 78

48 53

Spatial y 69 85

88 70

84

78 90

97

79 95

67 60

9.1717

Scatterplot

Verbal

Spati

al

1009080706050

100

90

80

70

60

Scatterplot of Spatial vs Verbal

9.1818

Calculation

n = 12

Subject 1 2 3 4 5 6 7 8 9 10

11

12

Verbal x 50 66

73 84

57

83 76

95

73 78

48 53

Spatial y 69 85

88 70

84

78 90

97

79 95

67 60

Σxi = 50+66+…+53 = 836

Σyi = 69+85+…+60 = 962

Σxi2 = 502+662+…+532 =

60706 Σyi2 = 692+852+…+602 =

78634 Σxiyi = 50×69+66×85+…+53×60 = 68254

9.1919

Calculation Sxx

n = 12 Σxi = 836 Σyi = 962

Σxi2 =

60706 Σyi

2 = 78634

Σxiyi = 68254

67.246412

83660706

1 222 iixx x

nxS

9.2020

Calculation Syy

n = 12 Σxi = 836 Σyi = 962

Σxi2 = 60706 Σyi

2 = 78634

Σxiyi = 68254

67.151312

96278634

1 222 iiyy y

nyS

9.2121

Calculation Sxy

n = 12 Σxi = 836 Σyi = 962

Σxi2 = 60706 Σyi

2 = 78634

Σxiyi = 68254

67.123412

96283668254

1 iiiixy yxn

yxS

9.2222

Calculationn = 12 Σxi = 836 Σyi = 962

Σxi2 = 60706 Σyi

2 = 78634

Σxiyi = 68254

67.246412

83660706

1 222 iixx x

nxS

67.151312

96278634

1 222 iiyy y

nyS

67.123412

96283668254

1 iiiixy yxn

yxS

9.2323

Conclusion67.2464xxS 67.1513yyS 67.1234xyS

64.067.151367.2464

67.1234

yyxx

xy

SS

Sr

ν = n – 2 = 10 ν p=0.1 p=0.0

5p=0.02

5p=0.0

1p=0.00

5p=0.002

10 0.497 0.576 0.640 0.708 0.750 0.795The tables give one and two tail values.Since r10(0.025) = 0.58. There appears to be a significant correlation at the 95% confidence level (0.64 > 0.58).

9.2424

SPSSAnalyze > Correlate > Bivariate

9.2525

SPSSThe correlation (0.64) and p (p < 0.05) value are consistent with our calculation.

Correlations

1 .639*

.025

12 12

.639* 1

.025

12 12

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Verbal

Spatial

Verbal Spatial

Correlation is significant at the 0.05 level (2-tailed).*.

9.2626

Aside

From previous experience it is known that in a population, measurements of IQ are approximately normally distributed with standard deviation 16. Tests are carried out on a particular subgroup of 14 individuals from the population. Calculate a 95% confidence interval for the population mean. Is your interval consistent with a population mean of 132?

Observed data133.06 119.30 109.17 93.88 116.93 130.98 135.25 140.02 118.38 121.86 124.78 142.91 135.52 132.96

What are the key words/information?

9.2727

Aside

From previous experience it is known that in a population, measurements of IQ are approximately normally distributed with standard deviation 16. Tests are carried out on a particular subgroup of 14 individuals from the population. Calculate a 95% confidence interval for the population mean. Is your interval consistent with a population mean of 132?

Observed data133.06 119.30 109.17 93.88 116.93 130.98 135.25 140.02 118.38 121.86 124.78 142.91 135.52 132.96

CCCCCCCCCCCc

9.2828

Aside

Which tests would be appropriate for the sample mean?

z or t

9.2929

Aside

Which tests would be appropriate for the sample mean? What are the key parameters for these tests?

z

t

µ σ n

µ s n

CCC

9.3030

Aside

In this case which test would be appropriate for the sample mean? What are the key parameters for this test?

z µ σ nSince σ is available use the z test

CCC

9.3131

Rank Correlation

What if we cannot assume normality or there are outliers when calculating and assessing a correlation?

If your samples violate the assumption of normality or have outliers then you might need to consider using a non-parametric test such as Spearman's Correlation.

9.3232

Spearman’s Rank Correlation Coefficient (rs)

di is the difference between the rankings in each pair of scores; n is the number of pairs of scores.

1

61

2

1

2

nn

dr

n

ii

s

Note that rs only matches the conventional correlation (direct calculation) for the ranked data if there are no ties.

9.3333

Example

A researcher has a theory that phonological working memory (memory for speech, the auditory component of the working memory model) is related to children's vocabulary size. The researcher tests this theory by measuring both phonological working memory (A) and vocabulary size (B) in children of 4 and 5 years of age.

9.3434

Data

Child A B

1 18 187

2 14 134

3 15 121

4 11 150

5 17 145

6 18 178

7 12 112

8 9 87

First plot the data

9.3535

Scatterplot

85

105

125

145

165

185

8 10 12 14 16 18

A

B

9.3636

Data

Child A B

1 18 187

2 14 134

3 15 121

4 11 150

5 17 145

6 18 178

7 12 112

8 9 87

Replace all observed values (A,B) by their ranks

9.3737

Rank A

Child A Rank A

1 18 1

6 18 2

5 17 3

3 15 4

2 14 5

7 12 6

4 11 7

8 9 8

The first two observations are tied!

Is there a problem?

CCCC

9.3838

Rank A

Child A Rank A True Rank

1 18 1 1.5

6 18 2 1.5

5 17 3 3

3 15 4 4

2 14 5 5

7 12 6 6

4 11 7 7

8 9 8 8

9.3939

Rank B

Child B Rank B

1 187 1

6 178 2

4 150 3

5 145 4

2 134 5

3 121 6

7 112 7

8 87 8

No ties in this case

9.4040

Rank B

Child B Rank B True Rank

1 187 1 1

6 178 2 2

4 150 3 3

5 145 4 4

2 134 5 5

3 121 6 6

7 112 7 7

8 87 8 8

9.4141

DifferencesChild A

True Rank

1 18 1.5

2 14 5

3 15 4

4 11 7

5 17 3

6 18 1.5

7 12 6

8 9 8

Total

Rearrange the A data against the identifier (child)

9.4242

DifferencesChild A

True Rank

BTrue Rank

1 18 1.5 187 1

2 14 5 134 5

3 15 4 121 6

4 11 7 150 3

5 17 3 145 4

6 18 1.5 178 2

7 12 6 112 7

8 9 8 87 8

Total

Similarly for the B data

Now find the difference between the true ranks

9.4343

DifferencesChild A

True Rank

BTrue Rank

di

1 18 1.5 187 1 0.5

2 14 5 134 5 0

3 15 4 121 6 -2

4 11 7 150 3 4

5 17 3 145 4 -1

6 18 1.5 178 2 -0.5

7 12 6 112 7 -1

8 9 8 87 8 0

Total

Now square these differences

9.4444

DifferencesChild A

True Rank

BTrue Rank

di di2

1 18 1.5 187 1 0.5 0.25

2 14 5 134 5 0 0

3 15 4 121 6 -2 4

4 11 7 150 3 4 16

5 17 3 145 4 -1 1

6 18 1.5 178 2 -0.5 0.25

7 12 6 112 7 -1 1

8 9 8 87 8 0 0

Total

And form the total

9.4545

DifferencesChild A

True Rank

BTrue Rank

di di2

1 18 1.5 187 1 0.5 0.25

2 14 5 134 5 0 0

3 15 4 121 6 -2 4

4 11 7 150 3 4 16

5 17 3 145 4 -1 1

6 18 1.5 178 2 -0.5 0.25

7 12 6 112 7 -1 1

8 9 8 87 8 0 0

Total 22.5

Now calculate the correlation

9.4646

Conclusion

n = 8 Σdi2 = 22.5

732.0638

1351

1885.226

11

61

22

1

2

nn

dr

n

ii

s

n p = 0.05 p = 0.01

8 0.738 0.881The tables give one and two tail values.Note that SPSS reports an approximate p value based on the Pearson correlation.

9.4747

Conclusionn = 8 rcrit = 0.738 rs = 0.732

The critical value for n = 8 at the p_=_0.025 level is 0.738. Since rs =_0.732, which is less than 0.738; then rs is apparently not significant at the 95% confidence level. Or more plainly it would appear that there was probably no relationship.

9.4848

Example

The hypothesis tested is that prices should decrease with distance from the key area of gentrification surrounding the Contemporary Art Museum (CAM, El Raval, Barcelona). The line followed is Transect 2 in the map, with continuous sampling of the price of a 50cl. bottle water at every convenience store.

9.4949

MACBA Barcelona Contemporary Art Museum

9.5050

Map

Selected transect Museum

9.5151

Data

Convenience Store

Distance from CAM (m)

Price of 50cl. bottle (€)

1 50 1.80

2 175 1.20

3 270 2.00

4 375 1.00

5 425 1.00

6 580 1.20

7 710 0.80

8 790 0.60

9 890 1.00

10 980 0.85

First plot the data

Which variable is dependent?

9.5252

Scatterplot

0.5

0.7

0.9

1.1

1.3

1.5

1.7

1.9

40 140 240 340 440 540 640 740 840 940

Distance from CAM

Pri

ce

(E

uro

s)

9.5353

Data

Convenience Store

Distance from CAM (m)

Price of 50cl. bottle (€)

1 50 1.80

2 175 1.20

3 270 2.00

4 375 1.00

5 425 1.00

6 580 1.20

7 710 0.80

8 790 0.60

9 890 1.00

10 980 0.85

Can we assume normality?

Now rank the data

9.5454

Rank Price

Ties have been identified

Convenience Store

Price of 50cl. bottle (€)

Rank

3 2 1

1 1.8 2

2 1.2 3

6 1.2 4

4 1 5

5 1 6

9 1 7

10 0.85 8

7 0.8 9

8 0.6 10

9.5555

Rank Price

Convenience Store

Price of 50cl. bottle (€)

RankTrue Rank

3 2 1 1

1 1.8 2 2

2 1.2 3 3.5

6 1.2 4 3.5

4 1 5 6

5 1 6 6

9 1 7 6

10 0.85 8 8

7 0.8 9 9

8 0.6 10 10

9.5656

Rank Distance

There are no ties

Convenience Store

Distance from CAM (m)

Rank

10 980 1

9 890 2

8 790 3

7 710 4

6 580 5

5 425 6

4 375 7

3 270 8

2 175 9

1 50 10

9.5757

Rank Distance

Convenience StoreDistance from CAM

(m)Rank True Rank

10 980 1 1

9 890 2 2

8 790 3 3

7 710 4 4

6 580 5 5

5 425 6 6

4 375 7 7

3 270 8 8

2 175 9 9

1 50 10 10

9.5858

Differences

Convenience Store

Price of

50cl. bottle

(€)

True Rank Price

1 1.8 2

2 1.2 3.5

3 2 1

4 1 6

5 1 6

6 1.2 3.5

7 0.8 9

8 0.6 10

9 1 6

10 0.85 8

Arrange true rank of price by store identifier

9.5959

Differences

Convenience Store

Price of

50cl. bottle

(€)

True Rank Price

Distance from

CAM(m)

True Rank

Distance

1 1.8 2 50 10

2 1.2 3.5 175 9

3 2 1 270 8

4 1 6 375 7

5 1 6 425 6

6 1.2 3.5 580 5

7 0.8 9 710 4

8 0.6 10 790 3

9 1 6 890 2

10 0.85 8 980 1

Arrange true rank of distance by store identifier

9.6060

Differences

Convenience Store

Price of

50cl. bottle

(€)

True Rank Price

Distance from

CAM(m)

True Rank

Distance

di

1 1.8 2 50 10 8

2 1.2 3.5 175 9 5.5

3 2 1 270 8 7

4 1 6 375 7 1

5 1 6 425 6 0

6 1.2 3.5 580 5 1.5

7 0.8 9 710 4 -5

8 0.6 10 790 3 -7

9 1 6 890 2 -4

10 0.85 8 980 1 -7

Calculate the differences between the true ranks

9.6161

Differences

Convenience Store

Price of

50cl. bottle

(€)

True Rank Price

Distance from

CAM(m)

True Rank

Distance

di di2

1 1.8 2 50 10 8 64

2 1.2 3.5 175 9 5.5 30.25

3 2 1 270 8 7 49

4 1 6 375 7 1 1

5 1 6 425 6 0 0

6 1.2 3.5 580 5 1.5 2.25

7 0.8 9 710 4 -5 25

8 0.6 10 790 3 -7 49

9 1 6 890 2 -4 16

10 0.85 8 980 1 -7 49

Calculate the square of the differences

9.6262

Differences

Convenience Store

Price of

50cl. bottle

(€)

True Rank Price

Distance from

CAM(m)

True Rank

Distance

di di2

1 1.8 2 50 10 8 64

2 1.2 3.5 175 9 5.5 30.25

3 2 1 270 8 7 49

4 1 6 375 7 1 1

5 1 6 425 6 0 0

6 1.2 3.5 580 5 1.5 2.25

7 0.8 9 710 4 -5 25

8 0.6 10 790 3 -7 49

9 1 6 890 2 -4 16

10 0.85 8 980 1 -7 49

Total285.

5

Form the total of the differences squared

9.6363

Conclusion

n = 10 Σdi2 = 285.5

730.09910

17131

110105.2856

11

61

22

1

2

nn

dr

n

ii

s

n p = 0.05 p = 0.01

10 0.648 0.794

The tables give one and two tail values.

9.6464

Conclusion

n = 10 rcrit = 0.648 rs = -0.730

The critical value for n = 10 at the p = 0.025 level is 0.648. The value 0.73 for two tails gives a significance level of slightly less than 5%.

Apparently there is a relationship.

9.6565

Comparison

Recall that rs only matches the conventional correlation for the ranked data if there are no ties.

The previous calculation is repeated using ranked data and the full correlation formula and then tested with software.

9.6666

Calculation

n = 10

Σxi = 10+9+…+1 = 55

Σyi = 2+3.5+…+8 = 55

Σxi2 = 102+92+…+12 = 385

Σyi2 = 22+3.52+…+82 = 382.5

Σxiyi = 10×2+9×3.5+…+1×8 = 241

1 2 3 4 5 6 7 8 9 10

10 9 8 7 6 5 4 3 2 12 3.5 1 6 6 3.5 9 10 6 8

Convenience Store

x Distance f rom CAM (m)y Price of 50cl. bottle (€)

Agree

Disagree because of ties

9.6767

Calculation Sxx

n = 10 Σxi = 55 Σyi = 55

Σxi2 = 385 Σyi

2 = 382.5 Σxiyi = 241

5.821055

3851 2

22 iixx xn

xS

9.6868

Calculation Syy

n = 10 Σxi = 55 Σyi = 55

Σxi2 = 385 Σyi

2 = 382.5 Σxiyi = 241

801055

5.3821 2

22 iiyy yn

yS

9.6969

Calculation Sxy

n = 10 Σxi = 55 Σyi = 55

Σxi2 = 385 Σyi

2 = 382.5 Σxiyi = 241

5.6110

5555241

1 iiiixy yxn

yxS

9.7070

Calculation

5.6110

5555241

1 iiiixy yxn

yxS

801055

5.3821 2

22 iiyy yn

yS

n = 10 Σxi = 55 Σyi = 55

Σxi2 = 385 Σyi

2 = 382.5 Σxiyi = 241

5.821055

3851 2

22 iixx xn

xS

9.7171

Conclusion

n p = 0.05 p = 0.01

10 0.648 0.794

The tables give one and two tail values.The critical value for n = 10 at the p = 0.025 level is 0.648. The value 0.757 for two tails gives a significance level of slightly less than 5%. Apparently there is a relationship.

n = 12 rcrit = 0.648 rs = -0.757

75.0805.82

5.61

yyxx

xy

SS

Sr

Note slight difference due to a single tie.

9.7272

SPSSAnalyze > Correlate > Bivariate

9.7373

SPSSThe correlation (-0.76) and p (p < 0.05) value are consistent with our calculation.

Correlations

1.000 -.757*

. .011

10 10

-.757* 1.000

.011 .

10 10

Correlation Coefficient

Sig. (2-tailed)

N

Correlation Coefficient

Sig. (2-tailed)

N

Distance

Price

Spearman's rhoDistance Price

Correlation is significant at the 0.05 level (2-tailed).*.

9.7474

Point Biserial Correlation

The point biserial correlation coefficient (rpb) is a correlation coefficient used when one variable is dichotomous; ideally it will be “naturally” dichotomous such as pass/fail (P/F). The point biserial correlation is mathematically equivalent to the Pearson correlation. This can be shown by assigning two distinct numerical values (usually 0/1) to the dichotomous variable .

9.7575

Point Biserial Correlation

To calculate rpb, use the dichotomous variable to divide the data set into two groups. Evaluate MP is the mean score for group PMF is the mean score for group FS is the population standard deviation, evaluated for all entriesp the proportion of those in group Pf the proportion of those in group F (f = 1 – p)  

There is no version of the formula for a case where you only have sample data.

pfS

MMr FPpb

9.7676

Point Biserial Correlation - Example

The following data is for 165 students attempting 50 multiple-choice questions. The final mark for the examination (out of 50) was

30 38 24 22 31 33 37 38 33 27 35 31 35 42 24 26 34 33 35 22 25 22 38 36 41 40 26 29 43 30 34 28 16 38 34 33

26 32 39 27 12 32 35 17 39 20 18 30 37 17 26 37 21 19 38 25 38 31 21 29 25 26 27 31 33 37 35 26 35 17 22 26

24 21 34 40 32 22 28 24 38 23 17 22 19 33 13 32 17 33 26 15 39 32 22 23 32 19 41 29 33 29 24 19 20 35 31 31

33 37 24 18 38 22 33 29 26 32 27 24 25 27 26 29 29 34 35 38 27 23 35 35 34 26 19 27 33 38 32 25 37 24 39 3028 25 32 28 33 26 22 26 29 25 32 30 37 29 33 28 22 33 29 27 29  

Individual success on the first question (1 pass, 0 fail) was

1 0 1 1 1 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 0 1 0 1 1 1 0 0 1 0 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 0 0 1 1 0 1 0 1 1 0 1

1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 0 1 1 0 1 0 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 1

0 1 0 0 1 1 1 1 0 0 1 1 0 0 0 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 0 1 1 0 1 0 0 0 

9.7777

Point Biserial Correlation - Example

MP 30.87 is the mean score for group PMF 25.71 is the mean score for group FS 6.63 is the population standard deviation, evaluated for all entriesp .65 the proportion of those in group Pf .35 the proportion of those in group F (f = 1 – p)  

 This may be verified by direct calculation.

37.

pfS

MMr FPpb

Correlations  

  Question_1  

Total

Pearson Correlation .372**  

Sig. (2-tailed) .000  

N 165  

**. Correlation is significant at the 0.01 level (2-tailed).

9.7878

Point Biserial Correlation - Example

Since rcrit is 0.153 (use the Calculator) and rpb is 0.37 the result is significant. Corresponding to a large positive correlation. We would expect students with high scores on the overall test to also be getting the item right. That students with low scores on the overall test get the item wrong. Correspond to a large negative correlation. We would not expect that students who get the item correct to tend to do poorly on the overall test. Students who get the item wrong would tend to do well on the test.

9.7979

Read

Read Howitt and Cramer pages 59-74

Read Howitt and Cramer (e-text) pages 87-124

Read Russo (e-text) pages 176-201

Read Davis and Smith pages 173-192

9.8080

Practical 9

This material is available from the module web page.

http://www.staff.ncl.ac.uk/mike.cox

Module Web Page

9.8181

Practical 9

This material for the practical is available.

Instructions for the practical

Practical 9

Material for the practicalPractical 9

9.8282

Whoops!

"There's this cluster of interrelated findings", said Richard A. Lippa, a professor of psychology at California State University at Fullerton, who has found evidence that in gay men, the hair on the back of the head is more likely to curl counter-clockwise than in straight men. "These are all biological markers that something must have gone on early in development".

Washington Post

5 February 2008

Source

9.8383

However!

Dilbert

9.8484

Cause and Effect

Recall - Type I error – false positive – conclude that the variable or coefficient is important, but the true state of nature is that it is not.

Firemen cause damageStorks bring babies

Conclude two drugs differ, when in fact they do not

Correlation does not imply causation! Spurious Correlations

5.8585

Big Data Helps Companies Find Some Surprising Correlations

There's a Link Between Sales and Phases of the Moon, Among Other Things, By Deborah Gage , Wall Street Journal, 23 March 2014

9.8686

Whoops!