General approach for most statistical modeling: Lecture 10 ...eckel/biostat2/notes/notes10.pdf ·...

11
Lecture 10: Linear Regression Approach, Assumptions and Diagnostics Sandy Eckel [email protected] 8 May 2008 2 Approach to Modeling I General approach for most statistical modeling: Define the population of interest State the scientific questions & underlying theories Describe and explore the observed data Define the model Probability part (models the randomness / noise) Systematic part (models the expectation / signal) 3 Approach to Modeling II Estimate the parameters in the model Fit the model to the observed data Make inferences about covariates Check the validity of the model Verify the model assumptions Re-define, re-fit, and re-check the model if necessary Interpret the results of the analysis in terms of the scientific questions of interest 4 Model Validity Checks Check that the assumptions of the model hold Plots Residual Checking Global Model Checks (adjusted R 2 , AIC, BIC)

Transcript of General approach for most statistical modeling: Lecture 10 ...eckel/biostat2/notes/notes10.pdf ·...

Lecture 10: Linear Regression Approach, Assumptions and Diagnostics

Sandy [email protected]

8 May 2008

2

Approach to Modeling I

General approach for most statistical modeling:

� Define the population of interest

� State the scientific questions & underlying theories

� Describe and explore the observed data

� Define the model� Probability part (models the randomness / noise)

� Systematic part (models the expectation / signal)

3

Approach to Modeling II

� Estimate the parameters in the model� Fit the model to the observed data

� Make inferences about covariates

� Check the validity of the model� Verify the model assumptions

� Re-define, re-fit, and re-check the model if necessary

� Interpret the results of the analysis in terms of the scientific questions of interest

4

Model Validity Checks

� Check that the assumptions of the model hold

� Plots

� Residual Checking

� Global Model Checks (adjusted R2, AIC, BIC)

5

Linear Regression ASSUMPTIONS

Model Systematic:

yi = β0 + β1xi + εi

Probability:

εi ~ N(0,σ2)

Assumptions

� L Linear relationship

� I Independent observations

� N Normally distributed around line

� E Equal variance across X’s6

How do you check the assumptions?In general…plot your data!

� Simply plotting the data can be one of the most powerful model checking techniques

� From a simple plot of Y on X that includes the fitted regression line, we can check:

� linearity, normality, equal (constant) variance, outliers, etc.

7

LINELinear relationship

� Is the model correct?� Is this the right line?

� Are there outliers for which the model may be wrong?

� Assess with graphs� 1 continuous X:

� graph Y vs. X with line

� residual plot

� 2+ continuous X’s:

� adjusted variable plot

8

LINE Independent observations

� Are all the people surveyed independent from one another?

� Cannot be assessed graphically or statistically

� Must know how the data were collected

9

LINE Normally distributed errors

� At every value of X, the observed points should follow a roughly normal distribution centered at the fitted value of Y.

� Assess with residual plots

10

LINE Equal variance

� At every value of X, the observed points should follow a roughly normal distribution with the same variance across all X’s

� Assess with residual plots

11

Model ChecksPlot of y versus x with regression line

Non-Linear

Y

X-2 0 2

-6

-4

-2

0

Outlier

Y

X0 10 20 30

-5

0

5

10

15

Non-Normal

Y

X-2 -1 0 1 2

-2

0

2

4

6

Non-Constant Variance

Y

X.8 1 1.2

-5

0

5

10

15

4 types of assumption violations

12

Model ChecksResidual Analysis: standardized residuals

� The residuals, εi , are the differences between the observed values, yi, and their fitted values:

� Since our model states: εi ~ N(0,σ2)

� We know that the standardized residuals,

should follow a standard normal distribution

iii yy ˆ−=ε

MSEi =− 2

2ˆ where

ˆ

σ

ε

13

Model ChecksResidual Analysis

If the model fits the data well, we expect:

� A histogram of the standardized residuals should look normal. � Check for asymmetry and outliers.

� A plot of the residuals vs. X should look like a random scatter (no systematic relationship)

� A plot of the residuals vs. ŷi (the fitted values) should also look like a random scatter.

14

Model ChecksResidual Analysis: Example plots

Health Status and Pollution Levels

He

alth

Sta

tus

Mea

su

re

Pollution Level40 60 80 100

60

80

100

120

Histogram of Standardized ResidualsNormal Curve Overlay

Standardized Residuals-2 0 2

0

.1

.2

.3

.4

Standardized Residuals vs. X

Sta

nd

ard

ized

Res

idu

als

Pollution Level40 60 80 100

-2

0

2

Standardized Residuals vs. Yhat

Sta

nd

ard

ized

Res

idu

als

Predicted Health Measure60 70 80 90 100

-2

0

2

Example: Relationship between health status and pollution in 20 geographic areas

15

Model ChecksResidual Analysis: Example conclusions

� Regression scatterplot looks good

� Standard Residuals appear fairly normally distributed

� Standard Residuals vs X appear randomly scattered (i.e. no apparent patterns & no extreme outliers)

� Standard Residuals vs predicted values appear randomly scattered (i.e. no apparent patterns & no extreme outliers)

16

Residual Analysis: Nonlinear Example plots

Non-Linear

Y

X-2 0 2

-6

-4

-2

0

Histogram of Standardized ResidualsNormal Curve Overlay

Standardized Residuals-3 -2 -1 1.1e-15 1

0

.1

.2

.3

Standardized Residuals vs. X

Sta

nd

ard

ized

Re

sid

ua

ls

x110-1-2-3

-3

-2

-1

1.1e-15

1

Standardized Residuals vs. Yhat

Sta

nd

ard

ized

Re

sid

ua

ls

Fitted values-1.5 -1.4 -1.3 -1.2 -1.1

-3

-2

-1

1.1e-15

1

17

Residual Analysis: Nonlinear Example conclusions

� Regression scatterplot shows non-linear relationship

� Standard Residuals don’t look normally distributed

� Standard Residuals vs X shows non-linear relationship

� Standard Residuals vs predicted values shows non-linear relationship

18

Residual Analysis: Outlier Example plots

Outlier

Y

X0 10 20 30

-5

0

5

10

15

Histogram of Standardized ResidualsNormal Curve Overlay

Standardized Residuals-4 -2 0 2

0

.1

.2

.3

.4

Standardized Residuals vs. X

std

res

2

x20 10 20 30

-4

-2

0

2

Standardized Residuals vs. Yhat

std

res

2

Fitted values0 5 10 15

-4

-2

0

2

19

Residual Analysis: Outlier Example conclusions

� Regression scatterplot shows outlier

� Standard Residuals look normal but ‘large’ residual present

� Standard Residuals vs X shows a pattern & the outlier

� Standard Residuals vs Y shows a pattern & the outlier

20

Residual Analysis: Non-Normal Example plots

Non-Normal

Y

X-2 -1 0 1 2

-2

0

2

4

6

Histogram of Standardized ResidualsNormal Curve Overlay

Standardized Residuals0 2 4 6

0

.5

1

Standardized Residuals vs. X

std

res3

x3-2 -1 0 1 2

0

2

4

6

Standardized Residuals vs. Yhat

std

res3

Fitted values-2 0 2 4 6

0

2

4

6

21

Residual Analysis: Non-Normal Example conclusions

� Regression scatterplot shows non-even spread

� Standard Residuals don’t look normally distributed

� Standard Residuals vs X shows non-even spread

� Standard Residuals vs predicted values shows non-even spread

22

Residual Analysis: Increasing Variance Example plots

Non-Constant Variance

Y

X.8 1 1.2

-5

0

5

10

15

Histogram of Standardized ResidualsNormal Curve Overlay

Standardized Residuals-4 -2 0 2 4

0

.2

.4

.6

.8

Standardized Residuals vs. X

std

res

4

x4.8 1 1.2

-4

-2

0

2

4

Standardized Residuals vs. Yhat

std

res

4

Fitted values4.8 4.9 5 5.1

-4

-2

0

2

4

23

Residual Analysis: Increasing Variance Example conclusions

� Regression scatterplot shows increasing variability

� Standard Residuals do look normally distributed

� Standard Residuals vs X shows increasing variability

� Standard Residuals vs predicted values shows increasing variability

24

Example Wage Data

01

02

03

04

05

0

0 20 40 60Years of Experience

Ho

urly W

age

ii

ii

Y

Y

Experience04.038.8ˆ

Experienceβ̂β̂ˆ10

+=⇒

+=

25

PlotResiduals vs. continuous X

� Used to assess remaining relationships within data� assumption of “linearity”

� Line has been “flattened”

� Residuals (or error terms) are centered at 0: horizontal line shows “flattened”regression line

26

PlotResiduals vs. continuous X example

-10

01

02

030

40

Resid

uals

0 20 40 60Years of Experience

27

Another approach Plot standardized residuals vs. X

� With the actual residuals, it’s hard to tell which points are extreme

� Standardized residuals are

� |sres|>2 about 5%

� |sres|>3 about 1%

i

residuals

ii Z

SD==

residual ResidualedStandardiz

28

Wage examplePlot standardized residuals vs. X

-20

24

68

Sta

nd

ard

ized

resid

ua

ls

0 20 40 60Years of Experience

Residuals > 2 standard deviations from 0 should happen only 5% of the time

29

Standardized residuals vs. fitted values

� Plotting against X is fine when there’s only one continuous X in model

� When multiple continuous X’s are in model

� plotting residuals against fitted values is like plotting against all the X’s at once

� if problems are seen, one can plot residuals against each X to see which causes problem

30

Standardized residuals vs. fitted values

-20

24

68

Sta

nd

ard

ized

resid

ua

ls

8.5 9 9.5 10 10.5Fitted values

31

Summary: Residual plots

� Y-axis� residuals

� standardized residuals

� standardized residuals are Z values, so extreme observations are obvious

� X-axis� continuous X

� fitted values

� fitted values are a linear combination of X’s

32

Is there a “linear relationship”

� Does the wage-experience model appear to fit well?

� There seems to be a residual pattern for people with very little experience� could add a spline to allow for non-linearity

� There is one outlier� should check data entry and information for

that person

33

Outliers

� Outliers far from the pattern of the rest of the X’s may affect the line

� the regression line always goes through (mean X, mean Y)

� an outlier near the mean X will not influence the line very much

� an outlier far from the mean X can draw the line towards itself

34

Are the observations independent?

� This cannot be assessed graphically.

� The data were collected by randomly sampling workers, so independence can be assumed

35

Are the data normally distributed around the line?

-10

01

02

03

04

0R

esid

uals

0 20 40 60Years of Experience

Goal within each slice of X:

Normal curve centered at 036

Non-normal residuals

� Parameters estimates are still correct, but CI’s are misleading

� Including additional predictors sometimes solves this problem

� Another solution is to transform Y� ln(Y) or sqrt(Y) draws in data skewed to high

values� 1/Y or 1/sqrt(Y) draws in data skewed to low

values� use transformed Y instead of original Y� interpret parameters according to transformed Y!

37

Are the data normally distributed with equal variance?

-10

01

02

03

04

0R

esid

uals

0 20 40 60Years of ExperienceGoal within each slice of X:

Normal curve with equal variance

38

What if the variance is unequal?

� Again, parameter estimates are valid, but CI’s are misleading

� Adding additional parameters may solve the problem

39

Assumptions:From SLR to MLR

� In simple linear regression (SLR):� One Predictor / Covariate: X

� In multiple linear regression (MLR):� Same Assumptions as SLR, (i.e. L.I.N.E.), but:

� More than one Covariate: X1, X2, X3, …, Xp

Model:� Y ~ N(µ, σ2)

� µ = E(Y | X) = β0 + β1X1 + β2X2 + β3X3 +... βpXp

40

Example

E(wage | experience, age, gender, education)

= E(Y | X) = β0 + β1X1 + β2X2 + β3X3 + β4X4

where: Y = wage

X1 = experience (in years)

X2 = age

X3 = gender

X4 = education

41

Assessing assumptions with multiple continuous X’s

� Residual plot is usually vs. fitted values� if problems, then plot against each X

� Can’t assess model by plotting Y vs. X� plot shows unadjusted relationship

� model shows relationship adjusted for other X’s

� Use AV plot instead � shows the relationship between Y and X after

adjusting for the other X’s

42

Lecture 10 Summary

� Linear regression assumptions

� L assess with Y vs. X (AV plot)

� I cannot assess

� N assess with Y vs. X (AVplot), residual plot

� E assess with Y vs. X (AVplot), residual plot

� Check assumptions for final model

� may have to continue modeling based on results