The Regression Equation Using the regression equation to individualize prediction and move beyond...

56
The Regression Equation Using the regression equation to individualize prediction and move beyond saying that everyone is equal, that everyone should score right at the mean
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    0

Transcript of The Regression Equation Using the regression equation to individualize prediction and move beyond...

The Regression Equation

Using the regression equation to individualize prediction and move beyond

saying that everyone is equal, that everyone should score right at the mean

Best fitting line: A review

The definition of the best fitting line plotted on t axes

• A “best fitting line” minimizes the average squared vertical distance of Y scores in the sample (expressed as tY scores) from the line.

• The best fitting line is a least squares, unbiased estimate of values of Y in the sample.

• The generic formula for a line is Y=mx+b where m is the slope and b is the Y intercept.

• Thus, any specific line, such as the best fitting line, can be defined by its slope and its intercept.

The intercept of the best fitting line plotted on t axes

The origin is the point where both tX and tY=0.000

• So the origin represents the mean of both the X and Y variable

• When plotted on t axes all best fitting lines go through the origin.

• Thus, the tY intercept of the best fitting line = 0.000.

The slope of and formula for the best fitting line

• When plotted on t axes the slope of the best fitting line = r, the correlation coefficient.

• To define a line we need its slope and Y intercept

• r = the slope and tY intercept=0.00 • The formula for the best fitting line is

therefore tY=rtX + 0.00 or tY= rtX

Here’s how a visual representation of the best fitting line (slope = r, tY intercept = 0.000) and the dots representing

tX and tY scores might be described. (Whether the correlation is positive of negative doesn’t matter.)

• Perfect - scores fall exactly on a straight line.

• Strong - most scores fall near the line.

• Moderate - some are near the line, some not.

• Weak - the scores are only mildly linear.

• Independent - the scores are not linear at all.

Strength of a relationship1.5

-1.5

1.0

0.5

0

-0.5

-1.0

1.5 -1.5 1.0 0.5 0 -0.5 -1.0

Perfect

Strength of a relationship1.5

-1.5

1.0

0.5

0

-0.5

-1.0

1.5 -1.5 1.0 0.5 0 -0.5 -1.0

Strongr about .800

Strength of a relationship1.5

-1.5

1.0

0.5

0

-0.5

-1.0

1.5 -1.5 1.0 0.5 0 -0.5 -1.0

Moderater about .500

Strength of a relationshipr about 0.000

1.5

-1.5

1.0

0.5

0

-0.5

-1.0

1.5 -1.5 1.0 0.5 0 -0.5 -1.0

Independent

Notice what that formula for independent variables says

• tY = rtX = 0.000 (tX) = 0.000• When tY = 0.000, you are at the mean of Y• So, when variables are independent, the best

fitting line says that making the best estimate of Y scores in the sample requires you to go back to prediction that everyone will score at the mean of Y (regardless of his or her score on X).

• Thus, when variables are independent we go back to saying everyone will score right at the mean

A note of caution: Watch out for the plot for which the best fitting line is a curve.

1.5

-1.5

1.0

0.5

0

-0.5

-1.0

1.5 -1.5 1.0 0.5 0 -0.5 -1.0

Moving from the best fitting line to the regression equation and the

regression line.

The best fitting line (tY=rtX) was the line closest to the Y values in

the sample. But what should we do if we

want to go beyond our sample and use a version of our best

fitting line to make individualized predictions for the

rest of the population?

What do we need to do to be able to use the regression equation

• tY' = rtX • Notice this is not quite the same as the formula for the

best fitting line. The formula now reads tY' (read t-y-prime). Not tY.

• tY' is the predicted score on the Y variable for every X score in the population within the range observed in our random sample.

• Before we were describing the linear relationship of the X and Y variable in our sample. Now we are predicting estimated Z scores (t scores) for most or all of the population.

This is one of the key points in the course; a point when things change radically.

Up to this point, we have just been describing scores, means and relationships. We have not

yet gone beyond predicting that everyone in the population who was not in our sample will

score at the mean of Y.

But now we want to be able to make individualized predictions for the rest of the population, the people not in our sample and

for whom we don’t have Y scores

But now we want to be able to make individualized predictions for the rest of the population, the people not in our sample and

for whom we don’t have Y scores.

In the context of correlation and regression (Ch. 7 & 8), this means making predictions of

one score from a pre-existing difference among individuals, a score on the X variable.

In Chapter 9 we will determine whether such predictions can be made from the different ways people are treated in an experiment.

That’s dangerous. Our first rule as scientists in “Do not increase

error.” Individualizing prediction can easily do that.

Let me give you an example.

Assume, you are the personnel officer for a mid size company.

• You need to hire a typist.

• There are 2 applicants for the job.

• You give the applicants a typing test.

• Which would you hire: someone who types 6 words a minute with 12 mistakes or someone who types 100 words a minute with 1 mistake.

Whom would you hire?

• Of course, you would predict that the second person will be a better typist and hire that person.

• Notice that we never gave the person with 6 words/minute a chance to be a typist in our firm.

• We prejudged her on the basis of the typing test.• That is probably valid in this case – a typing test

probably predicts fairly well how good a typist someone will be.

But say the situation is a little more complicated!

• You have several applicants for a leadership position in your firm.

• But it is not 2004, it is 1954, when we knew that only white males were capable of leadership in corporate America.

• That is, we all “know” that leadership ability is correlated with both gender and skin color, white and male are associated with high leadership ability and darker skin color and female gender with lower leadership ability.

In 1954, it would have been just as absurd to hire someone of color or a woman for a leadership

position as it would be to hire the bad (6-words-a- minute-with-12-mistakes) typist now.

Everyone knew that 1) they couldn’t do the job and/or that even if they had some talent 2) no

subordinate would be comfortable following him or her.

• We now know this is absurd, but lots of people were never given a chance to try their hand at leadership, because of a standard pre-judgment that you can now see as obvious prejudice.

We would have been much better off saying that everyone is equal, everyone should be

predicted to score at the mean.

• Pre-judgments on the basis of supposed relationships between variables that have no real scientific support are a form of prejudice.

• In the case we just discussed, they cost potential leaders jobs in which they could have shown their ability. That is unfair.

We would have been much better off saying that everyone is equal, everyone should be

predicted to score at the mean.

• Moreover, by excluding such individuals, you narrow the talent pool of potential leaders. The more restricted the group of potential leaders, the less talented the average leader will be.

• This is why aristocracies don’t work in the long run. The talent pool is too small.

So, to avoid prejudice you must start with the notion that everyone will score at the mean.

In correlational language, to make that prediction you have to hypothesize that

rho = 0.000.

Only if you can disprove the notion that rho = 0.000 and no other time, should you make

any other prediction than “Everyone is equal, everyone should be predicted to score right at

the mean”.

We call the hypothesis that rho=0.000 the null hypothesis

The symbol for the null hypothesis is H0. We will see the null hypothesis many times during the rest of this

course. It is the hypothesis that you will learn to test statistically.

Confidence intervals around rhoT

Confidence intervals around rhoT

• In Chapter 6 we learned to create confidence intervals around muT that allowed us to test a theory.

• To test our theory about mu we took a random sample, computed the sample mean and standard deviation, and determined whether the sample mean fell into that interval.

• If the sample mean fell into the confidence interval, there was some support for our theory, and we held onto it.

Confidence intervals around rhoT

• The interesting case was when muT fell outside the confidence interval.

• In that case, the data from our sample falsified our theory, so we had to discard the theory and the estimate of mu specified by the theory

Remember the example involving the drug thought to increase body

temperature in Chapter 6• In Chapter 6, we hypothesized that our new

antidepressant drug had no effect on body temperature.

• We created a 95% confidence interval around normal body temperature, 98.6o F.

• The CI.95 went 0.75o F above and below 98.6, from 97.85o to 99.35o F.

• Then we administered the drug and measured body temperature.

• Our sample mean fell outside the CI.95.

If we discard a theory based prediction, what do we use in its place?

Since X-bar (99.5o F) fell outside the confidence interval, we discarded our theory about the value of mu and concluded our drug caused a slight fever.

• How slight a fever? If we reject the theory (hypothesis) about mu, we must go back to using X-bar, the sample mean that fell outside the confidence interval and falsified our theory, as our best (least squares, unbiased, consistent estimate) of mu. So, we would predict that people from the same population who took the drug would have an average body temp of 99.5o F.

To test any theory about any population parameter, we go

through similar steps:

• We theorize about the value of the population parameter.

• We obtain some measure of the variability of sample-based estimates of the population parameter.

• We create a test of the theory about the population parameter by creating a confidence interval, almost always a CI.95.

• We then obtain an estimate of the parameter from a random sample.

The sample statistic will fall inside or outside of the CI.95

• If the sample statistic falls inside the confidence interval, our theory has received some support and we hold on to it.

• But the more interesting case is when the sample statistic falls outside the confidence interval.

• Then we must discard the theory and the theory based estimate of the population parameter.

• In that case, our best estimate of the population parameter is the sample statistic.

• Remember, the sample statistic is a least squares, unbiased, consistent estimate of its population parameter.

We are going to do the same thing with a theory about rho that

we did in Ch. 6 with muT.

• rho is the correlation coefficient for the population.• If we have a theory about rho, we can create a 95%

confidence interval around our theory-based prediction of rho.

• We will almost always predict that rho = 0.000. • If our theory is true, we expect r, our estimated correlation

coefficient, to fall into the CI.95 around 0.000 .• An r computed from a random sample will then fall inside

or outside the confidence interval.

When r falls inside of the CI.95

around rhoT

• If r falls inside the confidence interval, our theory about rho has received some support and we hold on to it.

• Remember, our theory will almost always be that the null hypothesis is true.

The null hypothesis almost always predicts that rho = 0.000

• The null predicts that the variables are independent.

• It says that knowing a score on X will therefore tell you nothing about what to predict as a score on Y.

• If r falls in the 95% confidence interval around 0.000, we have failed to disprove the null.

• To minimize error in that case, go back to predicting that everyone will score right at the mean of Y.

When r falls outside of the CI.95

around rhoT

• But the more interesting case is when r falls outside the confidence interval.

• Then we must discard the null hypothesis and its theory based estimate of the population parameter (rho = 0.000).

• In that case, our best estimate of rho is the r we found in our random sample

• Thus, when r falls outside the CI.95 we can use Pearson’s r in the regression equation as a least squares unbiased estimate of rho.

• One caveat: this only works for X scores in the population that fall in the range of X scores seen in our random sample. More about this next time.

Then what?

• If r falls outside the range predicted by the null, we can use the r from our sample, the r that falsified the theory that rho=0.000, in the regression equation:

• tY'=rtX

• In that case, we will individualize predictions, making a different prediction of Y scores for each different X score for members of the population who were not members of the random sample.

To repeat

• If rho = 0.000, we should go back to saying everyone is equal, everyone will score at the mean of Y.

• To be fair and avoid doing damage, we must test the hypothesis that rho=0.000 before doing anything else.

• To test the theory that rho=0.00, we create a CI.95 around rho=0.000.

• If, and only if, we disprove the notion that rho=0.000 by having r fall outside the CI.95 can we use r in the regression equation, tY'=rtX.

The r table

I could teach you how to calculate the confidence interval for rho=0.000

• But other people have already calculated the intervals for many different df.

• Those calculations are summarized in the r table.

123456789

101112

.

.

.100200300500

10002000

10000

-.996 to .996-.949 to .949-.877 to .877-.810 to .810-.753 to .753-.706 to .706-.665 to .665-.631 to .631-.601 to .601-.575 to .575-.552 to .552-.531 to .531

.

.

.-.194 to .194-.137 to .137-.112 to .112-.087 to .087-.061 to .061-.043 to .043-.019 to .019

.997

.950

.878

.811

.754

.707

.666

.632

.602

.576

.553

.532...

.195

.138

.113

.088

.062

.044

.020

.9999.990.959.917.874.834.798.765.735.708.684.661

.

.

..254.181.148.115.081.058.026

df nonsignificant .05 .01

How the r table is laid out: the important columns

– Column 1 of the r table shows degrees of freedom for correlation and regression (dfREG)

– dfREG=nP-2– Column 2 shows the CI.95 for varying degrees of freedom– Column 3 shows the absolute value of the r that falls just

outside the CI.95. Any r this far or further from 0.000 falsifies the hypothesis that rho=0.000 and can be used in the regression equation to make predictions of Y scores for people who were not in the original sample but who were part of the population from which the sample is drawn.

123456789

101112

.

.

.100200300500

10002000

10000

-.996 to .996-.949 to .949-.877 to .877-.810 to .810-.753 to .753-.706 to .706-.665 to .665-.631 to .631-.601 to .601-.575 to .575-.552 to .552-.531 to .531

.

.

.-.194 to .194-.137 to .137-.112 to .112-.087 to .087-.061 to .061-.043 to .043-.019 to .019

.997

.950

.878

.811

.754

.707

.666

.632

.602

.576

.553

.532...

.195

.138

.113

.088

.062

.044

.020

.9999.990.959.917.874.834.798.765.735.708.684.661

.

.

..254.181.148.115.081.058.026

df nonsignificant .05 .01

Testing H0: rho = 0.000

• To test the null, select a random sample, then see if the resultant r falls inside or outside the CI.95 around 0.000.

Let’s test the (fairly absurd) hypothesis that liking for strong

sensations in one area is related to liking for strong sensations in other

areas.To test our hypothesis, we ask a

random sample about their liking for two things that usually produce

strong sensations: anchovy pizza and horror movies

Ratings of liking for anchovy pizza and horror films

H1: People who enjoy food with strong flavors also enjoy other strong sensations.

H0: There is no relationship between enjoying food with strong flavors and enjoying other strong sensations.

Anchovy pizza

7733084111

Horrorfilms

7986965216

Can we reject the null hypothesis?

(scale 0-9)

Is this more or less linear? Yes.

0

8

6

4

2

0 8642

Horror films

Pizza

Can we reject the null hypothesis?

r = .352

df = 8

We do the math and we find that:

123456789

101112

.

.

.100200300500

10002000

10000

-.996 to .996-.949 to .949-.877 to .877-.810 to .810-.753 to .753-.706 to .706-.665 to .665-.631 to .631-.601 to .601-.575 to .575-.552 to .552-.531 to .531

.

.

.-.194 to .194-.137 to .137-.112 to .112-.087 to .087-.061 to .061-.043 to .043-.019 to .019

.997

.950

.878

.811

.754

.707

.666

.632

.602

.576

.553

.532...

.195

.138

.113

.088

.062

.044

.020

.9999.990.959.917.874.834.798.765.735.708.684.661

.

.

..254.181.148.115.081.058.026

df nonsignificant .05 .01

This finding falls within the CI.95 around 0.000

• We call such findings “nonsignificant” • Nonsignificant is abbreviated n.s.• We would report these finding as follows• r (8)=0.352, n.s.• In English, we would say, the correlation with 8

degrees of freedom was .352. That finding is nonsignificant and we fail to falsify the null. Therefore, we can not use the regression equation as we have no evidence that the correlation in the population as a whole is not 0.000. We go back to predicting everyone will score right at the mean of Y.

This system prevents plausible, but incorrect, theories from affecting peoples’ futures.

• I would guess that like most variables, desire for anchovy pizza and horror movies are not really correlated.

• This sample probably has an r of .352 solely because of the way samples of this size fluctuate around a rho of zero.

How to report a significant r• For example, let’s say that you had a sample

(nP=30) and r = -.400• Looking under nP-2=28 dfREG, we find the interval

consistent with the null is between -.360 and +.360• So we are outside the CI.95 for rho=0.000• We would write that result as r(28)=-.400, p<.05 • This tells you that there were 28 df for r, that r

= -.400, and that you can expect an r that far from 0.00 5 or fewer times in 100 when rho = 0.000

Then there is Column 4

• Column 4 shows the values that lie outside a CI.99

• (The CI.99 itself isn’t shown like the CI.95 in Column 2 because it isn’t important enough.)

• However, Column 4 gives you bragging rights.• If your r is as far or further from 0.000 as the number in Column

4, you can say there is 1 or fewer chance in 100 of an r being this far from zero (p<.01).

• For example, let’s say that you had a sample (nP=30) and r = -.525.

• The critical value at .01 is .463. You are further from 0.00 than that.So you can brag.

• You write that result as r(28)=-.525, p<.01.

To summarize• If r falls inside the CI.95 around 0.000, it is

nonsignificant (n.s.) and you can’t use the regression equation (e.g., r(28)=.300, n.s.

• If r falls outside the CI.95, but not as far from 0.000 as the number in Column 4, you have a significant finding and can use the regression equation (e.g., r(28)=-.400,p<.05

• If r is as far or further from zero as the number in Column 4, you can use the regression equation and brag while doing it (e.g., r(28)=-.525, p<.01

The rest of this course is largely about hypothesis (theory) testing.

The one and only one hypothesis that we will test statistically from this point on is

the NULL HYPOTHESIS.

As a result of our statistical tests, we will either reject the null or fail to reject the null based on the data from a random

sample.