CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

42
CSC323 – Week 3 CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression

Transcript of CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Page 1: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

CSC323 – Week 3 CSC323 – Week 3

• Regression line

• Residual analysis and diagnostics for linear regression

Page 2: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Regression line – Fitting a line to dataRegression line – Fitting a line to data

If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall pattern

Fitting a line means drawing a line that is as close as possible to the points: the “best” straight line is the regression line.

Birth rate(1,000 pop)

Log G.N.P.

Page 3: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Prediction errorsPrediction errors

Y

x

For a given x, use the regression line to predict the response

The accuracy of the prediction depends on how much spread out the observations are around the line.

y

Observed value y

Error

Predicted value y

yy ˆ

Page 4: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Simple Example: Productivity levelSimple Example: Productivity level

 

To see how productivity was related to level of maintenance, a firm randomly selected 5 of its high speed machines for an experiment. Each machine was randomly assigned a different level of maintenance X and then had its average number of stoppage Y recorded.

Hours of maintenance

1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2

0

| | | | | | | | 2 4 6 8 10 12 14 16 X

# in

terr

uptio

ns

Hours X Average int. Y

4 1.6

6 1.2

8 1.1

10 0.5

12 0.6

Ave(x)=8

s(x)=3.16

Ave(y)=1s(y)=0.

45

r=–0.94

Page 5: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Least squares regression lineLeast squares regression line

DefinitionThe regression line of y on x is the line that makes the sum of the squares of the vertical distances (deviations) of the data points from the line as small as possible

It is defined as bxay ˆ

Where b = r*s.d.(y)/s.d.(x)a = ave(y) – b*ave(x)

We use to distinguish between the values predicted from the regression line and the observed values

y

Note: b has the same sign of r

Page 6: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Example: cont.Example: cont.

Slope

Intercept a=ave(y) –b ave(x)=1– (–0.135) 8=2.08 Regression Line: = 2.08 –0.135 x

= 2.08 –0.135 hours

135.016.3

45.094.0

x

y

s

rsb

y

y

The regression line of the number of interruptions and the hours of maintenance per week is calculated as follows.The descriptive statistics for x and y are:Ave(x)=8 s(x)=3.16; Ave(y)=1 s(y)=0.45 and r=–0.94

Page 7: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Regression line = 2.08 –0.135 hours

To draw a line: find two points that satisfy the regression equation and connect them! Point of averages (8,1) Point on the line: (6,1.27) found by plugging x=6 into the regression equation, s.t. y=2.08-0.135*6=1.27

y

Hours of maintenance

1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2

0

| | | | | | | | 2 4 6 8 10 12 14 16 X

# in

terr

uptio

ns

r=–0.94

residual

Point of averages

Page 8: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Example: CPU UsageExample: CPU Usage

A study was conducted to examine what factors affect the CPU usage.

A set of 38 processes written in a programming language was considered. For each program, data were collected on the CPU usage (time) in seconds of time, and the number of lines (line) in thousands generated by the program execution.

CPU usage

Number of lines

The scatter plot shows a clear positive association.

We’ll fit a regression line to model the association!

Page 9: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Variable N Mean Std Dev Sum Minimum MaximumY time 38 0.15710 0.13129 5.96980 0.01960 0.46780X linet 38 3.16195 3.96094 120.15400 0.10200 14.87200

Pearson Correlation Coefficients =0.89802

The regression line is . .y x 0 063 0 0297

Page 10: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

1.1. Coefficient of determinationCoefficient of determination

R2 = (correlation coefficient)2

describes how good the regression line is in explaining the response y. fraction of the variation in the values of y that is explained by the regression line of y on x. Varies between 0 and 1. Values close to 1, then the regression line provides a

good explanation of the data ; close to zero, then the regression line is not able to capture the variability in the data

EXAMPLE (cont.): The correlation coefficient is r = –0.94.

The regression line is able to capture 88.3% of the variability in the data. 

2R

883.0)94.0( 22 R

Goodness of fit measuresGoodness of fit measures

Page 11: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

2. Residuals2. Residuals The vertical distances between the observed points and the regression line can be regarded as the “left-over” variation in the response after fitting the regression line.

A residual is the difference between an observed value of the response variable y and the value predicted by the regression line.

Residual e = observed y – predicted = y – A special property: the average of the residuals is always zero.

yy

A prediction error in statistics is called residual

Page 12: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

  

EXAMPLE: Residuals for the regression line = 2.08 – 0.135 x

for the number of interruptions Y on the hours of maintenance X.

y

Hours X Average interr. Y

Predicted Interr. Residual y –

4 1.6 2.08 – 0.135*4=1.54 1.6 – 1.54=0.06

6 1.2 2.08 – 0.135*6=1.27 1.2 – 1.27 = –0.07

8 1.1 2.08 – 0.135*8=1 1.1 –1=0.1

10 0.5 2.08 – 0.135*10=0.73

0.5 –0.73= –0.23

12 0.6 2.08 – 0.135*12=0.46

0.6 –0.46=0.14

Average=0

y y

Page 13: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

3. Accuracy of the predictions3. Accuracy of the predictions

If the cloud of points is football-shaped, the prediction errors are similar along the regression line. One possible measure of the accuracy of the regression predictions is given by the root mean square error (r.m.s. error).

The r.m.s. error is defined as the square root of the average squared residual:

1

)#(...)2#()1#(...

222

n

nresidresidresiderrorsmr

This is an estimate of the variation of y about the regression line.

Page 14: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Roughly 68% of the points

1 r.m.s. error

Roughly 95% of the points

2 r.m.s. errors

Page 15: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Hours X

Average

interr. Y

Predicted

Interr.

Residual

Squared

Residual

4 1.6 1.54 0.06 0.0036

6 1.2 1.27 0.07 0.0049

8 1.1 1 0.1 0.01

10 0.5 0.73 –0.23 0.053

12 0.6 0.46 0.14 0.0196

Total 0.0911

Computing the r.m.s.error:

The r.m.s. error is (0.0911/4) = 0.151

If the company will schedule 7 hours of maintenance per week, the predicted weekly number of interruptions of the machine will be =2.08 – 0.1357=1.135 on average.

Using the r.m.s. error, more likely the number of interruptions will be between 1.135–2*0.151=0.833 and 1.135+2*0.151=1.437.

y

Page 16: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Looking at vertical stripsLooking at vertical stripsWhen all the vertical strips in a scatter plot show similar amount of spread then the diagram is said to be homoscedastic. A football-shaped cloud of points is homoscedastic!!Consider the data on the birth rate and the GNP index in 97 countries.

Birth rate

(1,000 pop)

Log G.N.P.

predicted points in corresponding strips

Page 17: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

In a football-shaped scatter diagram, consider the points in a vertical strip. The value predicted by the regression line can be regarded as the average of their y-values. Their standard deviation is about equal to the r.m.s. error of the regression line.

y

is the averageof the y-values in the strip

s.d. is roughly the r.m.s error

y

Log G.N.P.

Page 18: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Computing the r.m.s. errorComputing the r.m.s. error

In large data sets, the r.m.s. error is approximately equal to

).(.1... 2 ydsrerrorsmr

Average st. dev.

Birth rate Y 29.33 13.55

Log G.N.P. X

7.51 1.65

Consider the example on birth rate & GNP index

r = – 0.74

xy 077.697.74ˆ The regression line is

For x=8 the predicted birth rate is 35.268*077.697.74ˆ8 y

How accurate is this prediction?

Page 19: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

The r.m.s.error is =sqrt(1 –0.74^2)*13.55=9.11Thus 68% of the countries with log GNP=8, equal to about 3000$ per capitahave birth rate between 26.35 – 9.11=17.24 and 26.35+9.11=35.46.Most likely the countries with log GNP =8 have birth rate between (8.13, 44.57) since 26.35 – 2*9.11=8.13 and 26.13+2*9.11=44.57

Log G.N.P.

Birth rate

8.13

44.57

95% of the points in the strip for logGNP=8

Page 20: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Detect problems in the regression analysis:

the residual plots

The analysis of the residuals is useful to detect possible problems and anomalies in the regression

A residual plot is a scatter plot of the regression residuals against the explanatory variable.

Points should be randomly scattered inside a band centered around the horizontal line at zero (the mean of the residuals).

Page 21: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

““Good case”Good case”

Res

idua

l

X“Bad cases”

Non linear relationship Variation of y changing with x

Page 22: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Anomalies in the regression analysisAnomalies in the regression analysis

• If the residual plot displays a curve the straight line is not a good description of the association between x and y

• If the residual plot is fan-shaped the variation of y is not constant. In the figure above, predictions on y will be less precise as x increases, since y shows a higher variability for higher values of x.

Be careful if you use r.m.s. error!!

Page 23: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Example of CPU usage data Example of CPU usage data Residual plot Residual plot

Do you see any striking pattern?

Page 24: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Example: 100 meter dashExample: 100 meter dash

At the 1987 World Championship in Rome, Ben Johnson set a new world

record in the 100-meter dash.

Meters JohnsonAverage 55 5.83 St. dev. 30.27 2.52Correlation = 0.999

Scatter plot for Johnson’s times

Ela

psed

tim

e

Meters

The data:Y=the elapsed time from the start of the race in 10-meter increments for Ben Johnson,X= meters

Page 25: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Regression LineRegression Line

Ela

psed

tim

e

Meters

The fitted regression line is =1.11+0.09 meters.

The value of R2 is 0.999, therefore 99.9% of the variability in the data is explained by the regression line.

y

Page 26: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Residual PlotResidual Plot

Meters

Res

idua

l

Does the graph show any anomaly?

Page 27: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Confounding factorConfounding factor

A confounding factor is a variable that has an important effect on the relationship among the variables in a study but it is not included in the study.

Example: The mathematics department of a large university must plan the timetable for the following year. Data are collected on the enrollment year, the number x of first-year students and the number y of students enrolled in elementary math courses.

The fitted regression line has equation:

=2491.69+1.0663 x

R2=0.694.

y

Page 28: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Residual AnalysisResidual Analysis

Do the residuals have a random pattern?

Page 29: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Scatter plot of residuals vs yearScatter plot of residuals vs year

The plot of the residuals against the year suggests that a change took place between 1994 and 1995. This caused a higher number of students to take math courses (one school changed its curriculum).

1990 1991 1992 1993 1994 1995 1996 1997 Enrollment year

Page 30: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Outliers and Influential pointsOutliers and Influential points

An outlier is an observation that lies outside the overall pattern of the other observation

outlier

Large residual

Page 31: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Influential PointInfluential Point

An observation is influential for the regression line, if removing it would change considerably the fitted line. An influential point pulls the regression line towards itself.

Regression line if is omitted

Influential point

Page 32: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Example: house prices in Albuquerque.

=365.66+0.5488 price. The coefficient of determination is R2=0.4274.

y

Selling price

Annual tax

What does the value of R2 say?

Page 33: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

New analysis: omitting the influential New analysis: omitting the influential pointspoints

Selling price

Annual tax

 

=-55.364+0.8483 price

The coefficient of determination is R2=0.8273

The regression line is

The new regression line explains 82% of the variation in y .

Previous regression line

Page 34: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Extrapolation

Extrapolation is when we use a regression equation to predict values outside of this range. This is dangerous and often inappropriate, and may produce unreasonable answers.

Example: a linear model which relates weight gain to age for young children. Applying such a model to adults, or even teenagers, would be absurd.

Example on selling price of houses, the regression line should not be used to predict the annual taxes for expensive houses that cost over 500,000 dollars

Page 35: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Summary – WarningsSummary – Warnings

1. Correlation measures linear association, regression line should be used only when the association is linear

2. Extrapolation – do not use the regression line to predict values outside the observed range – predictions are not reliable

3. Correlation and regression line are sensitive to influential / extreme points

4. Check residual plots to detect anomalies and “hidden” patterns which are not captured by the regression line

Page 36: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Example of regression analysisExample of regression analysis

Leaning Tower of Pisa response variable the lean (Y)

= the distance between where a point at the top of the tower is and where it would be if the tower were straight. The units for the lean are tenths of a millimeter above 2.9 meters.

explanatory variable time (X) (1975-1987)

Steps of our analysis: plot fit a line predict the future lean

Page 37: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Regression lineRegression line

The equation of the regression line is

= -61.12+9.32y

Page 38: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Residual Plot Normal probability plot

Page 39: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

Predict

Obs YEAR LEAN Value Residual 1 75 642.0 637.8 4.2198 2 76 644.0 647.1 -3.0989

3 77 656.0 656.4 -0.4176 4 78 667.0 665.7 1.2637 5 79 673.0 675.1 -2.0549 6 80 688.0 684.4 3.6264 7 81 696.0 693.7 2.3077 8 82 698.0 703.0 -5.0110 9 83 713.0 712.3 0.6703 10 84 717.0 721.6 -4.6484 11 85 725.0 731.0 -5.9670 12 86 742.0 740.3 1.7143 13 87 757.0 749.6 7.3956

Prediction in 2002 14 102 889.4

Page 40: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

PROC REG in SASPROC REG in SAS

PROC REG;MODEL yvar=xvar1;PLOT yvar*xvar/nostat; draw scatter plot and

regression linePLOT residual.*xvar1 residual.*predicted.; residual plotsPLOT npp.*residual.; probability plot for the residualsPLOT yvar*xvar/PRED; draw scatter plot & upper and

lower prediction bounds.RUN;

The option nostat in the PLOT statement eliminates the equation of the regression line that is displayed in the regression plot. The option lineprinter produces lineprinter plots (low-level graphics)

Page 41: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

SAS Data Step

data pisa; input year lean @@;datalines;75 642 76 644 77 656 78 667 79 673 80 68881 696 82 698 83 713 84 717 85 725 86 74287 757 ;

SAS Proc GPLOT & REG

proc gplot; plot lean*year;

proc reg;model lean=year;plot lean*year/pred;plot residual.*year;plot npp.*residual.;run;

Page 42: CSC323 – Week 3 Regression line Residual analysis and diagnostics for linear regression.

The REG Procedure Dependent Variable: lean lean in tenths of millimeter

Analysis of Variance Sum of MeanSource DF Squares Square F Value Pr > FModel 1 15804 15804 904.12 <.0001Error 11 192.28571 17.48052Corrected Total 12 15997

Root MSE 4.18097 R-Square 0.9880Dependent Mean 693.69231 Adj R-Sq 0.9869Coeff Var 0.60271

Parameter Estimates Parameter StandardVariable Label DF Estimate Error t Value Pr > |t|

Intercept Intercept 1 -61.12088 25.12982 -2.43 0.0333year 1 9.31868 0.30991 30.07 <.0001