SIMPLE LINEAR REGRESSION. Last week Discussed the ideas behind: Hypothesis testing Random...

46
SIMPLE LINEAR REGRESSION

Transcript of SIMPLE LINEAR REGRESSION. Last week Discussed the ideas behind: Hypothesis testing Random...

SIMPLE LINEAR REGRESSION

Last week

Discussed the ideas behind: Hypothesis testing Random Sampling Error Statistical Significance, Alpha, and p-values

Examined Correlation – specifically Pearson’s r What it’s used for, when to use it (and not to use

it) Statistical Assumptions Interpretation of r (direction/magnitude) and p

Tonight

Extend our discussion on correlation – into simple linear regression Correlation and regression are specifically

linked together, conceptually and mathematically Often see correlations paired with regression

Regression is nothing but one step past r You’ve all done it in high school math

First…brief review…

Quick Review/Quiz

A health researcher plans to determine if there is an association between physical activity and body composition. Specifically, the researcher thinks that

people who are more physically active (PA) will have a lower percent body fat (%BF).

Write out a null and alternative hypothesis

PA and %BF HO:

There is no association between PA and %BF

HA: People with ↑ PA will have ↓ %BF

The researcher will use a Pearson correlation to determine this association. He sets alpha ≤ 0.05.

Write out what that means (alpha ≤ 0.05)

Alpha If the researcher sets alpha ≤ 0.05, this

means that he/she will reject the null hypothesis if the p-value of the correlation is equal to or less than 0.05. This is the level of confidence/risk the researcher

is willing to accept

If the p-value of the test is greater than 0.05, there is a greater than 5% chance that the result could be due to ___________________, rather than a real effect/association

Results The researcher runs the correlation in SPSS and

this is in the output: n = 100, r = -0.75, p = 0.02

1) What is the direction of the correlation? What does this mean?

2) What is the sample size? 3) Describe the magnitude of the association? 4) Is this result statistically significant? 5) Did he/she fail to reject the null hypothesis OR

reject the null hypothesis?

Results defined

There is a negative, moderate-to-strong, relationship between PA and %BF (r = -0.75, p = 0.02). Those with higher levels of physical activity

tended to have lower %BF (or vice versa) Reject the null hypothesis and accept the

alternative

Based on this correlation alone, does PA cause %BF to change? Why or why not?

Error

Assume the association seen here between PA and %BF is REAL (not due to RSE). What type of error is made if the researcher fails to

reject the null hypothesis (and accepts HO) Says there is no association when there really is Type II Error

Assume the association seen here between PA and %BF is due to RSE (not REAL). What type of error is made if the researcher rejects

the null hypothesis (and rejects HO) Says there is an association when there really is not Type I Error

Our Decision

Reject HO Accept HO

What is True

HO Type I Error Correct

HA Correct Type II Error

HA: Is an association between PA and %BF HO: Is not an association between PA and %BF

Questions…?

Back to correlations

Recall, correlations provide two critical pieces of information a relationship between two variables: 1) Direction (+ or -) 2) Strength/Magnitude

However, the correlation coefficient (r) can also be used to describe how well a variable can be used for prediction (of the another). A frequent goal of statistics For example…

Association vs Prediction

Is undergrad GPA associated with grad school GPA? Can grad school GPA be predicted by undergrad GPA?

Are skinfolds measurements associated with %BF? Can %BF be predicted by skinfolds?

Is muscular strength associated with injury risk? Can muscular strength be predictive of injury risk?

Is event attendance associated with ticket price? Can event attendance be predicted by ticket price? (i.e., what ticket price will maximize profits?)

Correlation and Prediction

This idea should seem reasonable. Look at the three correlations below. In which of the

three do you think it would be easiest (most accurate) to predict the y variable from the x variable?

A B C

Correlation and Prediction

The stronger the relationship between two variables, the more accurately you can use information from one of those variables to predict the other

Which do you think you could predict more accurately?

Bench press repetitions from body weight ?

Or

40-yard dash from 10-yard dash?

Explained Variance

The stronger the relationship between two variables, the more accurately you can use information from one of those variables to predict the other

This concept is “explained variance” or “variance accounted for” Variance = the spread of the data around the center

Why the values are different for everyone Calculated by squaring the correlation coefficient, r2

Above correlation: r = 0.624 and r2 = 0.389 aka, Coefficient of Determination

What percentage of the variability in x is explained by y The 10-yard dash explains 39% of the variance in the 40-yard

dash If we could explain 100% of the variance – we’d be able to

make a perfect prediction

Coefficient of Determination, r2

What percentage of the variability in y is explained by x The 10-yard dash explains 39% of the variance in the 40-yard

dash So – about 61% (100% - 39% = 61%) of the variance remains

unexplained (is due to other things) The more variance you can explain the better the predication The less variance that is explained the more error in the

prediction Examples, notice how quickly the prediction degrades:

r = 1.00; r2 = 100% r = 0.87; r2 = 75% r = 0.71; r2 = 50% r = 0.50; r2 = 25% r = 0.22; r2 = 5%

Example with BP…

Average systolic blood pressure in the United States

Note mean – and variation (variance) in the values

Mean = 119 mmHg

SD = 20N = 22,270

Variance: BP

Why are these values so spread out?

What things influence blood pressure Age Gender Physical

Activity Diet Stress

Which of these variables do you think is most important? Least important?

If we could measure all of these, could we perfectly predict blood pressure?

Correlating each variable with BP would allow us to answer these questions using r2

Beyond r2

Obviously you want to have an estimate of how well a prediction might work – but it does not tell you how to make that prediction For that we use some form of regression

Regression is a generic term (like correlation) There are several different methods to create a

prediction equation: Simple Linear Regression Multiple Linear Regression Logistic Regression (pregnancy test) and many more…

Example using Height to predict Weight

80

90

100

110

120

130

140

150

160

170

55 65 75

Height

Wei

gh

t

r = 0.81

Note the correlation coefficient above (r2 = 0.66)

SPSS is going to do all the work. It will use a process called: Least Squares Estimation

Let’s start with a scatterplot between the two variables…

80

90

100

110

120

130

140

150

160

170

55 65 75

Height

Wei

gh

t

r = .81

Least squares estimation: Fancy process where SPSS draws every possible line through the points - until finding the line where the vertical deviations from that line are the smallest

The green line indicates a possible line, the blue arrows indicate the deviations – longer arrows = bigger deviations

This is a crappy attempt – it will keep trying new lines until it finds the best one

80

90

100

110

120

130

140

150

160

170

55 65 75

Height

Wei

gh

t

r = .81

Eventually, SPSS will get it right, finding the line that minimizes deviations, known as:

Line of Best Fit

Least squares estimation: Fancy process where SPSS draws every possible line through the points - until finding the line where the vertical deviations from that line are the smallest

80

90

100

110

120

130

140

150

160

170

55 65 75

Height

Wei

gh

t

r = .81

The Line of Best fit is the end-product of regression

Up so many units

In so many others

And it will have a value on the y-axis for the zero value of the x-axis

-234

SLOPE

INTERCEPT

This line will have a certain slope…

The intercept can be seen more clearly if we redraw the graph with appropriate axes…

-300

-250

-200

-150

-100

-50

0

50

100

150

200

0 20 40 60 80

Height

Wei

gh

t

-234lbs

The intercept will sometimes be a nonsense value – in this case, nobody is 0 inches tall or weighs -234 lbs.

80

90

100

110

120

130

140

150

160

170

55 65 75

Height

Wei

gh

t

r = .81

From the line (it’s equation), we can predict that an increase in height of 1 inch predicts a rise in weight of 5.4 lbs

We can now estimate weight from height. A person that’s 68 inches tall should weight about 135 lbs.

68

135lbs

Slope = 5.4

SPSS will output the equation, among a number of other items if you ask for them

Coefficientsa

-234.681 71.552 -3.280 .005

5.434 1.067 .806 5.092 .000

(Constant)

Height (in inches)

Model1

B Std. Error

UnstandardizedCoefficients

Beta

Standardized

Coefficients

t Sig.

Dependent Variable: Weight (in pounds)a.

SPSS output:

SLOPEINTERCEPT

The β-coefficient is the Slope of the lineThe (Constant) is the Intercept of the lineThe p-value is still here. In this case, height is a

statistically significant predictor of weight (association likely NOT due to RSE)

Depending on your high school math teacher:

Y = a + bX

SLOPEINTERCEPT

Weight = -234 + 5.434 (Height)

We can use those two values to write out the equation for our line

Y = b + mXor

Model Fit?

Once you create your regression equation, this equation is called the ‘model’ i.e., we just modeled (created a model for) the

association between height and weight

How good is the model? How well do the data fit? Can use r2 for a general comparison

How well one variable can predict the other Lower r2 means less variance accounted for, more error Our r = 0.81 for height/weight, so r2 = 0.65

We can also use Standard Error of the Estimate

How good, generally, is the fit?

Standard error of the estimate (SEE) Imagine we used our prediction equation to predict

height for each subject in our dataset (X to predict Y) Will our equation perfectly estimate each Y from X?

Unless r2 = 1.0, there will be some error between the real Y and the predicted Y

The SEE is the standard deviation of those differences The standard deviation of actual Y’s about predicted Y’s Estimates typical size of the error in predicting Y (sort of)

Critically related to r2, but SEE is more specific to your equation

80

90

100

110

120

130

140

150

160

170

55 65 75

Height

Wei

gh

t

r = .81

Let’s go back to our line of best fit (this line represents the predicted value of Y for each X):

Notice some real Y’s are closer to the line than others

SEE = The standard deviation of actual Y’s about predicted Y’s

Large Error

Small Error

Very Small Error

SEE is the standard deviation of all these errors

SEE Why calculate the ‘standard deviation’ of these errors

instead of just calculating the ‘average error’? By using standard deviation instead of the mean, we can

describe what percentage of estimates are within 1 SEE of the line In other words, if we used this prediction equation, we would expect that

68% fall within 1 SEE 95% fall within 2 SEE 99% fall within 3 SEE

Knowing, “How often is this accurate?” is probably more important than asking, “What’s the average error?”

Of course, how large the SEE is depends on your r2 and your sample size (larger samples make more accurate predictions)

80

90

100

110

120

130

140

150

160

170

55 65 75

Height

Wei

gh

t

r = .81

Let’s go back to our line of best fit :

In regression, we call these errors/deviations “residuals”

Residual Y = Real Y – Predicted Y

Notice that some of the residuals are - and some are +, where we over-estimated (-) or under-estimated (+) weight

Large Residual

Small Residual

Very Small Residual

SEE is the standard deviation of the residuals

Residuals

The line of best fit is a line where the residuals are minimized (least error) The residuals will sum to 0 The mean of the residuals will also be 0 The Line of Best Fit is the ‘balance point’ of the

scatterplot The standard deviation of the residuals is the SEE

Recognize this concept/terminology– if there is a residual – that means the effect of other variables is creating error Confounding variables create residuals

QUESTIONS…?

Statistical Assumptions of Simple Linear Regression See last week’s notes on assumptions of

correlation… Variables are normally distributed Homoscedasticity of variance Sample is representative of population Relationship is linear (remember, y = a + bX) The variables are ratio/interval (continuous)

Can’t use nominal or ordinal variables …at least pretend for now, we’ll break this one

next week.

Simple Linear Regression: Example Let’s start simple, with two variables we

know to be very highly correlated 40-yard dash and 20-yard dash

Can we predict 40-yard dash from 20-yard dash?

SLR

Trimmed dataset down to just two variables

Let’s look at a scatterplot first

All my assumptions are good, should be able to produce a decent prediction

Next step, correlation

Correlation

Strength? Direction? Statistically significant correlations will (usually)

produce statistically significant predictors r2 = ??

0.66

Now, run the regression in SPSS

SPSS

The ‘predictor’ is the independent variable

Model Outputs

Adjusted r2 = Adjusts the r2 value based on sample size…small samples tend to overestimate the ability to predict the DV with the IV (our sample is 428, adjusted is similar)

Model Outputs

Notice our SEE of 0.06 seconds. 68% of residuals are within 0.06 seconds of predicted 95% of residuals are within 0.12 seconds of predicted

Model Outputs

The ‘ANOVA’ portion of the output tells you if the entire model is statistically significant. However, since our model just includes one variable (20-yard dash), the p-value here will match the one to follow

Outputs

Y-intercept = 1.259 Slope = 1.245 20-yard dash is a statistically significant predictor What is our equation to predict 40-yard dash?

Equation 40yard dash time =

1.245(20yard time) + 1.259 If a player ran the 20-yard dash in 2.5 seconds,

what is their estimated 40-yard dash time?1.245(2.5) + 1.259 =

4.37 secondsIf the player actually ran 4.53 seconds, what is

the residual?Residual = Real – Predicted

4.53 – 4.37 = 0.16

Significance vs. Importance in Regression A statistically significant model/variable does NOT

mean the equation is good at predicting

The p-value tells you if the independent variable (predictor) can be used as a predictor of the dependent variable

The r2 tells you how good the independent variable might be as a predictor (variance accounted for)

The SEE tells you how good the predictor (model) is at predicting

QUESTIONS…?

Upcoming…

In-class activity…

Homework: Cronk Section 5.3 Holcomb Exercises 29, 44, 46 and 33

Multiple Linear Regression next week