Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know...

29
Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related, when x is high, y tends to be high, and when x is low, y tends to be low. If they are indirectly related, when x is high, y tends to be low, and when x is low, y tends to be high. This suggests summing the products of the z-scores.

Transcript of Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know...

Page 1: Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,

Correlation• Assume you have two measurements, x and

y, on a set of objects, and would like to know if x and y are related.

• If they are directly related, when x is high, y tends to be high, and when x is low, y tends to be low.

• If they are indirectly related, when x is high, y tends to be low, and when x is low, y tends to be high.

• This suggests summing the products of the z-scores.

Page 2: Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,

1 7.7 64.2

2 10.9 65.7

3 10.3 64.3

4 8.7 64.3

5 9.6 64.6

6 10.3 65.8

7 13.6 67.9

8 12.8 67.5

9 8.7 63.5

10 13.0 66.6

11 10.7 64.1

12 9.9 63.9

13 10.7 65.4

14 9.8 66.4

15 9.9 64.2

16 11.5 66.1

17 11.4 65.5

18 11.4 66.8

19 10.7 64.2

20 8.8 64.4

x ycase

Scatter plot of y vs. x

Page 3: Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,

x (zsc

ore)

y (zsc

ore)

case1 7.7 64.2 -1.9 -0.8 1.55

2 10.9 65.7 0.3 0.3 0.08

3 10.3 64.3 -0.1 -0.8 0.11

4 8.7 64.3 -1.2 -0.8 0.91

5 9.6 64.6 -0.6 -0.5 0.32

6 10.3 65.8 -0.1 0.4 -0.06

7 13.6 67.9 2.0 2.0 4.17

8 12.8 67.5 1.5 1.7 2.62

9 8.7 63.5 -1.2 -1.4 1.66

10 13.0 66.6 1.6 1.0 1.70

11 10.7 64.1 0.1 -0.9 -0.11

12 9.9 63.9 -0.4 -1.1 0.44

13 10.7 65.4 0.1 0.1 0.01

14 9.8 66.4 -0.5 0.9 -0.42

15 9.9 64.2 -0.4 -0.8 0.34

16 11.5 66.1 0.6 0.6 0.42

17 11.4 65.5 0.6 0.2 0.10

18 11.4 66.8 0.6 1.2 0.69

19 10.7 64.2 0.1 -0.8 -0.10

20 8.8 64.4 -1.1 -0.7 0.77

mean 10.5 65.3 0.0 0.0 0.8

x y xzyz

Correlationcoefficient

(Pearson’s) r

Page 4: Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,

Properties of r

• r ranges from -1.0 to +1.0

• r = 1 means perfect linear relationship

• r = -1 means perfect linear relationship with negative slope

• r = 0 means no correlation

Page 5: Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,

r = 0.80

r = 0.35

Example scatterplots

r = -.24 r = .41

r = 0

r = -.66 r = .94

Page 6: Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,

Correlation and causation

• “Correlation does not imply causation”

• More precisely, x correlated with y does not imply x causes y, because

• correlation could be a type I error

• y could cause x

• z could cause both x and y

Page 7: Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,

Uncorrelated does not mean independent

x

y

x is highly predictive of y, but r = 0

Page 8: Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,

Significance test for r

• The aim is to test the null hypothesis that the population correlation ρ (rho) is 0.

• The larger n, the less likely a given r will happen under the null hypothesis.

From r and n, we can compute a p-value

From n and α, we can compute a critical r

• Numerical example

Page 9: Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,

Regression

• Correlation suggests a linear model of y as a function of x

• A linear model is defined by

ŷ = mx + b + e

random error with mean 0equation for a line

slope interceptpredicted y

Page 10: Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,

x

regression y

residuals

x

e

R2 = 0.5336, F = 20.5949, p = 0.0003

Regression line: y = -1.18 x + 18.77

Page 11: Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,

r vs. R2

•R2 is actually the square of r. So why is it capitalized and squared in a regression?

•r ranges from -1 to 1.

•But in a regression, r cannot meaningfully be negative, because it is the correlation between y and ŷ. Since ŷ is the best estimate of y, this correlation is automatically positive.

•The capitalization and squaring reflects this situation.

•It is squared to

Page 12: Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,

Interpretation of R2

•R2 can be interpreted as the proportion of the variance accounted for

•R2 = 1 -SSerror

SStotal

SSreg

SStotal=

regression line

mean

R2 is high when the unexplained (residual) variance is small relative to the total amount of variance

Page 13: Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,

Simpson’s paradox

Size of animallen

gth

of

ears

Negatively correlated

Or positively correlated?Rabbits

Humans

Whales

Adding a variable can change the sign of the correlation

Page 14: Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,

Effect size•Beyond computing significance, we

often need an estimate of the magnitude of an effect.

•There are two basic ways of expressing this:

- Normalized mean difference

- Proportion of variance accounted for

Page 15: Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,

The normalized difference between means

Cohen’s d

expresses how the difference between two means relative to the spread in the data.

Page 16: Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,

Proportion of variance accounted for• R2 can be interpreted as the proportion

of all the variance in the data that is predicted by a regression model

• η2 (eta squared) can be interpreted as the proportion of all variance in a factorial design that is accounted for by main effects and interactions

Page 17: Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,

Power

•Power is the probability of finding an effect of a given size in your experiment, i.e.

•The probability of rejecting the null hypothesis if the null hypothesis is actually false.

Page 18: Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,
Page 19: Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,

Outliers• An outlier is a measurement that is so discrepant

from others that it seems “suspicious.”

• If p(xsuspicious|distribution) is low enough, we “reject the null hypothesis” that xsuspicious came from the same distribution as the others, and remove it.

• A common rule of thumb is z > 2.5 (or 2 or 3), BUT...

• But also consider transforms that avoid outliers in the first place, like 1/x.

• Removed data is best NOT REPLACED. But if it must be replaced, do so “conservatively,” i.e. in a manner biased towards the null hypothesis.

Page 20: Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,

Chi squared•Assume that mutually K exclusive

outcomes are predicted to occur E1,E2,...,EK, times

•...but are actually observed to occur N1,N2,...,NK times respectively...

•A chi-square test allows us to evaluate the null hypothesis that the proportions were as expected, with deviations “by chance.”

Page 21: Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,

Performing a chi-squared test

•For each outcome, compute

•Sum them up over all outcomes

•Then, under the null hypothesis, this total will be distributed as a χ2 distribution with n-1 degrees of freedom.

Page 22: Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,

The Bayesian perspective• Conventional statistics is based on a

frequentist definition of probability, which insists that hypotheses do not have “probabilities.”

→ All we can do is “reject” H, or not reject it.

• Bayesian inference is based on a subjectivist definition of probability, which considers p(H) to be the “degree of belief” in hypothesis H, simply expressing our uncertainty about H in light of the data.

→ Instead of accepting or rejecting, we seek p(H|E).

Page 23: Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,

Cartoon 1: Fisher• Fisher: Given the sampling distribution of the

null p(E|H0), consider the likelihood of the null hypothesis, integrated out to the tail. If this probability is low, this tends to contradict the null hypothesis.

• In fact, if it is lower than .05, we informally “reject” the null.

0

p(E|H0)

E

probabilitydensity

Page 24: Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,

Cartoon 2: Neyman & Pearson• N&P: There are really two hypotheses, the null H0 and some

alternative H1.

• Our main goal is to avoid a Type I error. So set this probability at α, which determines our criterion for rejecting the null.

• Note though that there is also a possibility of making a Type II error, a hit, or a correct rejection.

• Compute power and set sample size to control the probability of a Type II error.

0

p(E|H0)

Expected effect size

p(E|H1)

μ1E

probabilitydensity

Page 25: Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,

Cartoon 3: Bayes (/Laplace/Jeffreys)

• What we really want is to evaluate how strongly our data favors either hypothesis, not just make an accept/reject decision.

• For each H, the degree of belief in it, conditioned on the data, is p(H|E). So to evaluate the relative strength of the H1 and H0, consider the posterior ratio

• This expresses how strongly the data and priors favor H1 relative to H0, taking into account everything we know about the situation.

Degree of belief in H1

Degree of belief in H0=

Page 26: Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,

Decomposing the posterior ratio

posterior ratio = prior ratio × likelihood ratio• If you want to be “unbiased”, set the prior ratio to 1, sometimes called an “uninformative prior.”

•Then your posterior belief about H0 and H1 depends entirely on the likelihood ratio, aka “Bayes factor.”

Page 27: Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,

Visualizing the Likelihood Ratio

0Expected effect size

μ1

=

p(E|H0) p(E|H1)

height of green bar at Eheight of red bar at E

E

probabilitydensity

Page 28: Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,

Interpretation of likelihood ratios

• LR = 1 means the evidence was neutral about which hypothesis was correct.

• LR > 1 means the evidence favors the hypothesis.

• Jeffreys (1939) suggested rules of thumb, e.g. LR > 3 means “substantial” evidence in favor of H1, LR >10 means “strong,” evidence etc.

• LR < 1 means the evidence actually favored the null hypothesis.

Page 29: Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,

LRs vs. p-values

• Likelihood ratios and p-values are not at all the same thing.

• But in practice, they are related.

Dixon (1998)