MAT 3375 - Regression Analysis Model Adequacy Checking and ...

MAT 3375 - Regression AnalysisModel Adequacy Checking and Diagnostics for

Leverage and InfluenceChapters 4, 5 and 6

Professor: Termeh Kousha

Fall 2016

1

1 Introduction to Model Adequacy Checking

We have made the assumptions in the previous report of multiple regressionmodel as follows:

1. The relation between response y and regressors X is approximatelylinear.

2. The errors εj(j = 1, · · · , n) follow normal distribution N(0, σ2) andthey are uncorrelated.

If any assumptions is violated, it may lead to an unstable model. And thestandard summary statistics like t or F test, or R2 are not enough to ensurethe model adequacy. In these report, we will use several methods based onresiduals to diagnose whether the basic regression assumptions are violated.We would like to present several methods useful for diagnosing violations ofthe basic regression assumptions.

2 Residual Analysis

2.1 Definition of Residuals

As defined previous report, the residual is

e = y − y = y −Hy = (I −H)y

e is the deviation between the observation y and the fitted value y, and italso measures unexplained variability in the response variable. Residuals ofa sample is the difference between the observed values and the fitted value,while errors of a sample are the deviation of the observed values from the truevalues. We can also consider th residuals as the observed values of the modelerrors when errors themselves cannot be observed. Therefore, the residualsshould show up any departures from the assumptions on the errors, in thesense that residual analysis is effective to check the model inadequacy. Theresiduals can also be written as the expression that includes term ε as below,

e = Xβ + ε−X(X ′X)−1X ′(Xβ + ε) = ε−Hε = (I −H)ε = Mε.

So thatE(e) = 0, var(e) = (I −H)σ2 = Mσ2

2

.V ar(ei) = (1− hii)σ2

Cov(ei, ej) = −hijσ2

Here hii and hij are elements of hat matrix H, and hii is a measure of thelocation of the ith point in x space, which means the variance of ei dependentson where the point xi lies, 0 ≤ hii ≤ 1. Furthermore, the bigger the hii is, theless the V ar(ei) is, so hii is called leverage. When hii is close to 1, V ar(ei) isalmost equal to 0, namely yi ≈ yi. It shows that when xi is far away from thecenter point xi, the point (xi, yi) will pull the regression line close to itself. Ifit has great effect on the estimation of parameters it is called as high leveragecase.

The approximate average variance is estimated by residual mean squareMSE, which is expressed as

MSE =SSEn− p

=e′e

n− p

E(MSE) = E

(SSE

n− k − 1

)= E

(e′e

n− k − 1

)= σ2

σ2 = MSE

In the assumptions, errors are independent but residuals are not, in fact,the degrees of freedom of residuals are n−p. However, as long as the numberof data n is much more than the number of parameters p the independenceof residuals has little effect on model adequacy checking.

Generally speaking, residual analysis plays important role in

1. verifying the model’s assumption,

2. finding observations that are outliers and extreme values.

The paired points (xi1, · · · , xip, yi) that have great influence on statisticalinference are called influence points. We hope each pair of data has certaininfluence but not very large so that our estimation of data will be stable.Otherwise, if we simply remove the influence points the estimated model willchange massively compared with the origin model. As a result, we shouldnot trust the origin model and doubt whether it shows the true relationshipbetween response and regressors. If the errors are as the result of the systemerrors, such as the mistake in measurement, we can delete this record directly.

3

If the possibility of recording errors is ruled out, we had better collect moredata or use robust estimation to reduce the effect of influence points onestimation.

2.2 Methods for Scaling Residuals

It is convenient to find outliers and extreme values by using scaled residuals.Here we will introduce four popular methods for scaling residuals.

2.2.1 Standardized Residuals

We have known that MSE is the estimate of the approximate average vari-ance, so a reasonable scaling residuals would be the standardized residuals

di =ei√MSE

=eiσ, i = 1, · · · , n

E(di) = 0, var(di) ≈ 1

The standardized residuals (sometimes they are expressed as ZREi) makeresiduals comparable. If di is large (di > 3,say), this record is regarded as anoutlier. The standardized residuals simplify the determination but they donot solve the problem of unequal variance.

2.2.2 Studentized Residuals

One of the important application of residuals is to find outliers based on theabsolute value of it. However the ordinary residuals are related to hii, so it isnot appropriate to compare residuals directly. Different from the standard-ized residuals, the studentized residuals are another method to standardizeresiduals,

ei

σ√

1− hiiBecause the σ is unknown, we replace it with σ = MSE. The studentizedresiduals are expressed as

ri =ei√

MSE(1− hii).

4

If there is only one regressor, it is easy to show that the studentized residualsare

ri =ei√

MSE[1− ( 1n

+ (xi−x)2

Sxx)].

Here Sxx =∑n

i=1(xi − x)2.

Compared with standardized residuals di, ri has constant variance V ar(ri) =1 regardless of the location of xi, in other words, they solve the problem ofunequal variance. When the data sets are large enough, the variance of theresiduals stabilizes, so in many cases standardize and studentized residualsconvey equivalent information because there are little differences betweenthem. However, studentized residuals are better because it reduce the influ-ence of the points with large residuals and large hii.

Example 1. The residuals for the model from pull strength of a wire bondin a semiconductor manufacturing process are shown below.

A normal probability plot of there is shown below. No severe deviations fromnormality are obviously apparent although two largest residuals ( e15 = 5.84and e17 = 4.33) do not fall extremely close to a straight line drawn throughthe remaining residuals.

5

Calculate the standardized and studentized residuals corresponding to e15

and e17? What can you conclude?

6

2.2.3 PRESS Residuals

Besides the standardize and studentized residuals, another improved residu-als make the use of yi − y(i), where y(i) is the fitted value of the ith responsebased on all observations except the ith one. This approach deletes the ithobservation so that the estimated response yi will not be influenced by thisobservation and the result residual can be determined whether it is an outlier.

We defined the prediction error as

e(i) = yi − y(i).

These prediction errors are called PRESS residuals (prediction error sum ofsquares). The ith PRESS residual is

e(i) =ei

1− hii.

If hii is large, the PRESS residual will also be large and the point will behigh influence point. Compared with the ordinary residual, the model usingPRESS residuals will fit the data better but the prediction will be poorer.

V ar(e(i)) = V ar(ei

1− hii) =

1

(1− hii)2[σ2(1− hii)] =

σ2

(1− hii).

A standardized PRESS residuals is

e(i)√V ar[e(i)]

=ei/(1− hii)σi/√

1− hii=

ei

σ√

1− hii.

The homoscedasticity implies that the variance is the same for all observation(∀ i σi = σ). The prediction error sum of squares are

7

PRESS =n∑i=1

(e(i))2 =

n∑i=1

(ei

1− hii

)2

.

2.2.4 R-student

Sometimes the R-student is also called externally studentized residuals whilethe studentized residuals which discussed before are called internally studen-tized residuals. Different from using MSE in the approach of studentizedresiduals, S2

(i) is used in the approach of R-student.

S2(i) =

(n− p)MSE − e2i /(1− h(ii))

n− p− 1

and the R-student is given as

r∗i = ti =ei√

S2(i)(1− h(ii))

under some assumptions, r∗i ∼ tn−p−1.The relationship between ri and r∗i is

r∗2i =n− p− 1

n− p− r2i

∗ r2i .

In many situations, r∗i differ little from ri. However when the ith obser-vation is influential, the R-student will be more sensitive to the point thanstudentized residual.

2.3 Residual Plots

After calculating the value of different types of residuals, we can use graphicalanalysis of residuals to investigate the adequacy of the fit of a regressionmodel and check whether the assumptions are satisfied. In this report, wewill introduce some residuals plots, for example, normal probability plot,plot of residuals against the fitted values yi, the regressors xi and the timesequence.

8

2.3.1 Normal Probability Plot

Figure 1: Patterns for residual plots:(a)satisfactory;(b)funnel;(c)double bow;(d)nonlinear.

If the errors come from a distribution with heavier tails than the normal, itwill produce outliers that put the regression line in their directions. In thesecases, some other estimation techniques such as robust regression methodshould be considered.

Normal probability plot is a direct method to check the normality as-sumption. The first step of plot is to rank the residuals in increasing or-der (e[1], e[2], · · · , e[n]), and then plot e[i] against the cumulative probability

Pi = i−1/2n

on the normal probability plot. It is usually called P-P plot.

1. If the resulting points approximately lie on the diagonal line, it meansthe errors satisfy the normal distribution.

2. If fitted curve is sharper than the diagonal line at both extremes, itmeans the errors come from a distribution with lighter tail.

3. If fitted curve is more flat than the diagonal line at both extremes, itmeans the errors come from a distribution with heavier tail.

4. if fitted curve rises or falls down at the right extreme, it means theerrors come from a distribution which is positive or negative skewed,respectively.

9

Similar to the P-P plot, Q-Q plot is plotting e[i] against the quantile ofdistribution.

2.3.2 Plot of Residuals against the Fitted Values yi

The plot of residuals (ei, ri, di or r∗i ) against the fitted values yi is useful forseveral model inadequacies. In the plot residuals vertical ordinate is residualsand horizontal ordinate is fitted values.

1. The ideal plot is that all points are in a horizontal band.

2. If the plot looks likes a outward-opening funnel pattern(or inward-opening), it means the variance of errors is an increasing function ofy(or decreasing function).

3. If the plot looks like a double-bow, it means that the variance of abinomial proportion near 0.5 is greater than one near 0 or 1.

4. If the plot looks like a curve, it means the model is nonlinear.

Example 2. The diagnostic plots show residuals in four different ways. Ifyou want to look at four plots at once rather than one by one:

par(mfrow=c(2,2)) # Change the panel layout to 2 x 2

plot(lm)

par(mfrow=c(1,1)) # Change back to 1 x 1

Lets take a look at the first type of plot:

Residuals vs Fitted

This plot shows if residuals have non-linear patterns. There could be anon-linear relationship between predictor variables and an outcome variableand the pattern could show up in this plot if the model doesn’t capturethe non-linear relationship. If you find equally spread residuals around ahorizontal line without distinct patterns, that is a good indication you don’thave non-linear relationships.

Lets look at residual plots from a good model and a bad model. The goodmodel data are simulated in a way that meets the regression assumptions verywell, while the bad model data are not:

10

We don’t see any distinctive pattern in Case 1, but we see a parabola inCase 2, where the non-linear relationship was not explained by the modeland was left out in the residuals.

Normal Q-Q

This plot shows if residuals are normally distributed. Do residuals followa straight line well or do they deviate severely? It is good if residuals arelined well on the straight dashed line.

Case 2 definitely concerns us. We would not be concerned by Case 1 toomuch, although an observation numbered as 38 looks a little off. Lets lookat the next plot while keeping in mind that 38 might be a potential problem.

Scale-Location

11

This plot shows if residuals are spread equally along the ranges of pre-dictors. This is how you can check the assumption of equal variance (ho-moscedasticity). It’s good if you see a horizontal line with equally (randomly)spread points.

In Case 1, the residuals appear randomly spread. Whereas, in Case 2,the residuals begin to spread wider along the x-axis as it passes around 5.Because the residuals spread wider and wider, the red smooth line is nothorizontal and shows a steep angle in Case 2.

Residuals vs Leverage

This plot helps us to find influential cases (i.e., subjects) if any. Not alloutliers are influential in linear regression analysis (whatever outliers mean).Even though data have extreme values, they might not be influential to de-termine a regression line. That means, the results wouldn’t be much differentif we either include or exclude them from analysis. They follow the trend inthe majority of cases and they don’t really matter; they are not influential.On the other hand, some cases could be very influential even if they lookto be within a reasonable range of the values. They could be extreme casesagainst a regression line and can alter the results if we exclude them fromanalysis. Another way to put it is that they don’t get along with the trendin the majority of the cases.

Unlike the other plots, this time patterns are not relevant. We watch outfor outlying values at the upper right corner or at the lower right corner.Those spots are the places where cases can be influential against a regression

12

line. Look for cases outside of a dashed line, Cook’s distance. When casesare outside of the Cook’s distance (meaning they have high Cook’s distancescores), the cases are influential to the regression results. The regressionresults will be altered if we exclude those cases.

Case 1 is the typical look when there is no influential case, or cases. Youcan barely see Cook’s distance lines (a red dashed line) because all cases arewell inside of the Cook’s distance lines. In Case 2, a case is far beyond theCook’s distance lines (the other residuals appear clustered on the left becausethe second plot is scaled to show larger area than the first plot). The plotidentified the influential observation as 49. If we exclude the 49th case fromthe analysis, the slope coefficient changes from 2.14 to 2.68 and R2 from .757to .851. Pretty big impact!

The four plots show potential problematic cases with the row numbersof the data in the data set. If some cases are identified across all four plots,you might want to take a close look at them individually. Is there anythingspecial for the subject? Or could it be simply errors in data entry?

So, what does having patterns in residuals mean to your research? Itsnot just a go-or-stop sign. It tells you about your model and data. Yourcurrent model might not be the best way to understand your data if there isso much good stuff left in the data.

In that case, you may want to go back to your theory and hypotheses. Isit really a linear relationship between the predictors and the outcome? Youmay want to include a quadratic term, for example. A log transformationmay better represent the phenomena that you like to model. Or, is there any

13

important variable that you left out from your model? Other variables youdidn’t include (e.g., age or gender) may play an important role in your modeland data. Or, maybe, your data were systematically biased when collectingdata. You may want to redesign data collection methods.

2.3.3 Plot of Residuals against the Regressor

Here the horizontal ordinate is the regressor. Compared with the ordinaryresiduals, the outliers of studentized residuals are easier to be determined.The cases of plots are same as the four situations before. Assuming thehorizontal ordinate is xj, the pattern of 2 or 3 indicates nonconstant varianceand the patter of 4 indicates the assumed relationship between y and xj isnot correct.

14

3 Transformation and Weighting to Correct

Model Inadequate

In the previous section , we have mentioned that we will introduce methodsto solve the problem of the inequality of variance. Data transformation andweighted least squares are two common methods that are useful in buildingmodels without violations of assumptions. In this section, we lay emphasison data transformation.

3.1 Transformation for Nonlinear Relation Only

First we consider transformations for linearizing a nonlinear regression rela-tion when the distribution of the error terms is reasonably close to normaldistribution and the error terms have approximately constant variance. Inthis situation, transformations on X should be attempted. The reason whytransformations on may not be desirable here is that a transformation on Ymay change the shape of the distribution of the error terms and also leadsdiffering error term variances.

Example 3. In one study we have 10 participants and X represents thenumber of days of training received and Y performance score in a batteryof simulated sales situations. A scatter plot of these data are shown below:Clearly the regression relation appears to be curvilinear, os the simple re-

gression model doesn’t seem to be appropriate. We consider a square roottransformation X ′ =

√X). In the scatter plot below the same data are plot-

ted with the predictor variable transformed to X ′. Note that the scatter plotnow shows a reasonably linear relation.

15

data

X Y

1 0.5 42.5

2 0.5 50.6

3 1.0 68.5

4 1.0 80.7

5 1.5 89.0

6 1.5 99.6

7 2.0 105.3

8 2.0 111.8

9 2.5 112.3

10 2.5 125.7

> lm<-lm(Y~X)

> X2<-sqrt(data$X)

> X2

[1] 0.7071068 0.7071068 1.0000000 1.0000000 1.2247449 1.2247449 1.4142136

[8] 1.4142136 1.5811388 1.5811388

> plot(Y~X2)

> summary(lm2)

Call:

lm(formula = Y ~ X2)

Residuals:

Min 1Q Median 3Q Max

-9.3221 -4.1884 -0.2367 4.1007 7.7200

16

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -10.328 7.892 -1.309 0.227

X2 83.453 6.444 12.951 1.2e-06 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 6.272 on 8 degrees of freedom

Multiple R-squared: 0.9545, Adjusted R-squared: 0.9488

F-statistic: 167.7 on 1 and 8 DF, p-value: 1.197e-06

We obtain the following fitted regression function

Y = −10.33 + 83.45X ′ = −10.33 + 83.45√X

The plot of residuals shows no evidence of unequal error variance and nostrong indications of substantial departures from normality.

17

3.2 Assumption of Constant Variance

One assumption of a regression model is that the errors have constant vari-ance. A common reason for the violation (V ar(εi) 6= V ar(εj), i 6= j) is forthe response y follows a distribution whose variance is relate to the mean(related to the regressors x). For example, if y follow Poisson distributionin a simple linear regression model, then E(y) = V ar(y) = λ, here λ is theparameter of the Poisson distribution and is related to the regressor variablex.As another example, we want to analyse the relationship between residentincome and purchasing power yi = β0 +β1xi + εi. The difference among low-income families is little because their main purchases are daily necessities.However the difference among high-income families is large because theirpurchase range is wider from automobiles to houses. This explain the causeof inequality of variance of errors.In another example, if y is proportion, i.e., 0 ≤ yi ≤ 1 then in such casesthe variance of y is proportional to E(y)[1−E(y)] In such case, the variancestabilizing transformation is useful.

Many factors result in heteroscedasticity, when the sample data are cross-section data (observing different subjects at the same time (period)), theerrors are more likely to have different variance.

To solve these problems, we could reexpress the response y and theninverse the predicted values into origin units. Table 1 shows a series ofcommon variance-stabilizing transformations. We should notice about thetransformations that

1. If mild transformation is applied over a relatively narrow range of values(ymaxymin

< 3), it has little effect.

2. If strong transformation is applied over a wide range, it has a dramaticeffect on analysis.

After making the suitable transformation, use y′ as a study variable inrespective case. Through the residual plot, we can find the trend of theresiduals against fitted values y. If the plot looks like the pattern (b),(c) or(d) in figure 1, it indicates variance inequality. And then we can try sometransformations to get a better residual plot. The strength of a transfor-mation depends on the amount of curvature present in the curve betweenstudy and explanatory variable. The transformation mentioned here range

18

Relation of sigma2 to E(y) Transformationσ2 ∝ constant y′ = yσ2 ∝ E(y) y′ =

√y

σ2 ∝ E(y)[1− E(y)] y′ = sin−1(√y)(0 ≤ yi ≤ 1)

σ2 ∝ E(y)2 y′ = ln(y)(log)σ2 ∝ E(y)3 y′ = y−1/2

σ2 ∝ E(y)4 y′ = y−1

Table 1: Useful Variance-Stabilizing Transformations

from relatively mild to relatively strong. The square root transformation isrelatively mild and reciprocal transformation is relatively strong. The squareroot transformation is relatively mild and reciprocal transformation is rela-tively strong.

3.2.1 Box-Cox Method for Appropriate Transformation

Transformation by observing residual plots and scatter diagrams of y againstx are empirical. Here we introduce a more formal and objective to transforma regression modelBox-Cox method.

Box-Cox method is applying power transformation to response y in formof yλ, λ and the parameters can be estimated simultaneously by using themethod of maximum-likelihood. There is a obvious problem that when λ ap-proaches to zero, yλ approaches to 1, which is meaningless. We can transformthe response as

y(λ) =

{yλ−1λ

λ 6= 0

ln y λ = 0.

However, there is still a problem, when λ changes, the value of (yλ − 1)/λchange dramatically. We improve the formula

y(λ) =

{yλ−1λyλ−1 λ 6= 0

y ln y λ = 0.

where y = ln−1(

1n

∑ni=1 ln yi

)is the geometric mean of the observations.

19

3.2.2 Weighted Least Squares (WLS)

In many cases, assumptions of uncorrelated and homoscedastic errors, maynot be valid. Thus, alternative methods should be considered in order tocompute improved parameter estimates.

One alternative is weighted least squares (WLS). This method redis-tributes the influence of data points on the estimation of the parameters.This approach is consulted when there are heteroscedastic errors in the model(non-constant variances). Through the weighted least squares approach, ob-servations with greater variances will have smaller weights, and converselyobservations with smaller variances will have greater weights. With thismethod, the expression of the variance-covariance matrix is replaced in favourof a more general expression:

Cov(ε) = σ2W−1

where W is a positive definite matrix (Fahrmeir 2013 ).

W is defined as the diagonal matrix of weights as follows:

W = diag(w1, w2, ..., wn)

or:

W =

w1 0 . . . 00 w2 . . . 0...

.... . .

...0 0 . . . wn

,W−1 =

1w1

0 . . . 0

0 1w2

. . . 0...

.... . .

...0 0 . . . 1

wn

20

Thus, the heteroscedastic variances can be written as V ar(εi) =σ2

wi

For the WLS approach, transformation of the response variable (Y), theregressor matrix (x) and the errors (ε) are performed such that the trans-formed variables follow the linear model with homoscedastic errors. Fori = 1, 2, ..., n, and k is the number of regressors in the model:

• Transformed errors: ε∗i =√wiεi

Transformation of errors induces constant variance where V ar(ε∗i ) =V ar(

√wiεi) = σ2

• Transformed response variable: y∗i =√wiyi

• Transformed predictors: x∗ik =√wixik

Thus, the linear model can be rewritten as:

W12y = W

12xβ +W

12 ε (1)

WhereW

12 = diag(

√w1,√w2, ...,

√wn)

Regressor coefficient estimates are obtained by minimizing the weighted sumsof squares (WSS):

WSS(β) =n∑i=1

e2wi

WSS(β) =n∑i=1

wi(yi − x′

iβ)2

i = 1, 2, ..., n

where the residuals of a weighted least squares model is defined as:

ewi =√wi(yi − x

′

iβ)

21

Example 4. For a simple linear regression model:

yi = β0 + β1xi + εi

the weighted least squares estimates can be obtained as follows:

1. We define the weighted sums of squares:

WSS(β0, β1) =n∑i=1

wi(yi − β0 − β1xi)2

2. We then differentiate WSS(β0, β1) with respect to β0 and β1, andequate them to zero:

• Let Q =∑n

i=1wi(yi − β0 − β1xi)2

•∂Q

∂β0

= −2∑n

i=1 wi(yi − β0 − β1xi)

• 0 = −2∑n

i=1wi(yi − β0 − β1xi)

• β0

∑ni=1wi + β1

∑ni=1 wixi =

∑ni=1wiyi

The same procedure (with respect to β1) is done to obtain:

β0

n∑i=1

wixi + β1

n∑i=1

wix2i =

n∑i=1

wixiyi

3. Finally, we solve these equations for the weighted least squares estimateof β0 and β1.

For example, it can be shown that the weighted estimator for β1 is

β1w =

∑ni=1 wi(xi − xw)(yi − yw)∑n

i=1 wi(xi − xw)2

where xw =∑ni=1 wixi∑ni=1 wi

and yw =∑ni=1 wiyi∑ni=1 wi

. From which we can find the

weighted least squares estimate of the intercept:

β0w = yw − β1wxw.

22

Example 5. Consider the dataset ”cleaningwgt”. We are trying to develop aregression equation to model the relationship between the number of rommscleaned, Y the number of crews, and X the number of rooms cleaned by 4and 16 crews. In this case we take

wi =1

(Standard deviation Yi)2,

then yi has a variance σ2

wi, with σ2 = 1.

cleaningwtd <-read.table(file.choose(),header=TRUE,sep=",")

> cleaningwtd

Case Crews Rooms StdDev

1 1 16 51 12.000463

2 2 10 37 7.927123

3 3 12 37 7.289910

4 4 16 46 12.000463

5 5 16 45 12.000463

6 6 4 11 4.966555

7 7 2 6 3.000000

8 8 4 19 4.966555

9 9 6 29 4.690416

10 10 2 14 3.000000

11 11 12 47 7.289910

12 12 8 37 6.642665

13 13 16 60 12.000463

14 14 2 6 3.000000

15 15 2 11 3.000000

16 16 2 10 3.000000

17 17 6 19 4.690416

18 18 10 33 7.927123

19 19 16 46 12.000463

20 20 16 69 12.000463

21 21 10 41 7.927123

22 22 6 19 4.690416

23 23 2 6 3.000000

24 24 6 27 4.690416

25 25 10 35 7.927123

26 26 12 55 7.289910

23

27 27 4 15 4.966555

28 28 4 18 4.966555

29 29 16 72 12.000463

30 30 8 22 6.642665

31 31 10 55 7.927123

32 32 16 65 12.000463

33 33 6 26 4.690416

34 34 10 52 7.927123

35 35 12 55 7.289910

36 36 8 33 6.642665

37 37 10 38 7.927123

38 38 8 23 6.642665

39 39 8 38 6.642665

40 40 2 10 3.000000

41 41 16 65 12.000463

42 42 8 31 6.642665

43 43 8 33 6.642665

44 44 12 47 7.289910

45 45 10 42 7.927123

46 46 16 78 12.000463

47 47 2 6 3.000000

48 48 2 6 3.000000

49 49 8 40 6.642665

50 50 12 39 7.289910

51 51 4 9 4.966555

52 52 4 22 4.966555

53 53 12 41 7.289910

> attach(cleaningwtd)

> wm1 <- lm(Rooms~Crews,weights=1/StdDev^2)

> summary(wm1)

Call:

lm(formula = Rooms ~ Crews, weights = 1/StdDev^2)

Weighted Residuals:


-1.43184 -0.82013 0.03909 0.69029 2.01030

24

Coefficients:


(Intercept) 0.8095 1.1158 0.725 0.471

Crews 3.8255 0.1788 21.400 <2e-16 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1



F-statistic: 458 on 1 and 51 DF, p-value: < 2.2e-16

We would like to find a 95% prediction intervals for Y when x = 4 and 16.

> predict(wm1,newdata=data.frame(Crews=c(4,16)),interval="prediction",level=0.95)

fit lwr upr

1 16.11133 13.71210 18.51056

2 62.01687 57.38601 66.64773

Warning message:

In predict.lm(wm1, newdata = data.frame(Crews = c(4, 16)), interval = "prediction", :

Assuming constant prediction variance even though model fit is weighted

Write the model as a multiple linear regression with two predictors and nointercept.

> ynew <- Rooms/StdDev

> x1new <- 1/StdDev

> x2new <- Crews/StdDev

> wm1check <- lm(ynew~x1new + x2new - 1)

> summary(wm1check)

Call:

lm(formula = ynew ~ x1new + x2new - 1)

Residuals:


-1.43184 -0.82013 0.03909 0.69029 2.01030

Coefficients:


25

x1new 0.8095 1.1158 0.725 0.471

x2new 3.8255 0.1788 21.400 <2e-16 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1



F-statistic: 639.6 on 2 and 51 DF, p-value: < 2.2e-16

3.3 Assumption of Linear Relationship Between Re-sponse and Regressors

Some cases, a nonlinear model can be linearized by using a suitable trans-formation. Such nonlinear models are called intrinsically or transformablelinear. The advantage of transforming the nonlinear function into linearfunction is that the statistical tools are developed for the case of linear re-gression model. If the scatter diagram of y against x indicates looks like somecurve of the function that we have known, then we can use the linearized formof the function to represent the data. For example, consider the exponentialfunction

y = β0eβ1xε,

Using the transformation y′ = ln(y) the model becomes

ln y = ln β0 + β1x+ ln ε.

ory′ = β′0 + β1x+ ε′

Where β′0 = ln β0.

26

4 Detection and Treatment of Outliers ( In-

fluential observations)

As mentioned before, an outlier is an observation point that is distant fromother observations, sometimes it may be three of four standard deviationsfrom the mean |di| > 3. An outlier may be due to variability in the mea-surement or it may indicate experimental error and the latter can be simplyremoved from the data set. Taking the advantage of studentized or R-studentresiduals and the residual plots against yi and the normal probability plot,we can find outliers easily. If an outlier does not come from the system error,then it play a more important role than other points because it may controlmany key model properties.

We can compare the origin values of parameters or summary statisticssuch as t statistic, F statistic, R2 and MSR with the values after removingthe outliers.

Outliers can be divided into two possibilities, about the response y orabout the regressors xi. If the absolute value of R-student is greater than3(|r∗i | > 3), then it can be determined as an outlier and can be deleted.

Sometimes the residual is big but it may be an influence point rather thana outlier about response. So the high influence point do not have bad effecton the regression model but have significant influence on the regression effect.As defined in paragraph 1.1, hii, the diagonal elements of the hat matrix H iscalled the leverage of the ith observation because it is a standardized measureof the distance of the ith observation from the center of the x space and itcan be written as

hii = x′i(X′X)−1xi

where x′i is the ith row of the X matrix. The bigger the hii is, the less theV ar(ei) is. The points with big leverage will put the regression line in theirdirection. If the high leverage point lies almost on the line passing throughthe remaining observation, it has no effect on the regression coefficients. Themean value of hii is

h =1

n

n∑i=1

hii =p

n

where∑n

i=1 hii = rank(H) = rank(X) = p. If a leverage hii is 2 or 3 timesgreater than h, it will be regarded as high leverage value. Observations withlarge hat diagonals and large residuals are called influence points. Influ-

27

ence points are not always outliers about response, so we cannot determinewhether an influence point is an outlier.

It is desirable to consider both the location of the point in the x spaceand the response variable in measuring influence. Therefore, we introduceCook’s distance, a measure of the squared distance to determine. It is namedafter the American statistician R. Dennis Cook, who introduced the conceptin 1977. The distance measure can be expressed as

Di =

(β(i) − β

)′X ′X

(β(i) − β

)pMSE

, i = 1, · · · , n

where β is the least-squares estimate based on all n points and β(i) is theestimate obtained by deleting the ith point. It can also be written as

Di =r2i

p· V ar(yi)V ar(ei)

=r2i

p· hii

1− hii=

e2i

pσ2· hii

1− hii

From the above equation, we can find that Cook’s distance reflects acombined effect of leverage hii and residual ei.

Points with large values of Di have great influence on the least-squaresestimate. However, it is complicated to determine the standard of the size ofCook’s distance. There are different opinions regarding what cut-off valuesto use for spotting highly influential points. A simple operational guidelineis that if Di > 1, then it is regarded as an outlier. If Di < 0.5, it is not anoutlier. Others have indicated that Di >

4n

as the standard.R Codes:

library(foreign)

library(MASS)

cd<-cooks.distance(lm)

Example 6. The table below, lists the values of the hat matrix diagonalshii and Cook’s distance measure Di for the wire bond pull strength data.Calculate D1.

28

5 Lack of Fit of the Regression Model

In the formal test for lack of fit, we assume the regression model has satisfiedthe requirement of normality, independence and constant variance and wejust to determine whether the relationship is straight-line or not.

Replicating experiments in regression analysis aims to make it clear thatwhether there are any non-negligible factors except x. Here replication meanshaving ni separate experiments at level x = xi and observing the value ofresponse yi instead of having just one experiment and observing the responseni times. The residuals of the latter method are only from the difference ofmethod of measurements and useless for analysis of fitted model. If otheruncontrolled or non-negligible factors, including interaction, are in the model,then the fitting effect may not be good and it was called lack of fit. Inthe situation, even if the hypothesis testing shows that the regression issignificant, it only indicates that the regressor x has effect on the responsey but not indicates the fitting is good. Thus, with regard to the data setinvolves replicated data, we can use the test for lack of fit on the expectedfunction. Determining whether the model is good or not is mainly by theresidual analysis.

Residuals are composed of two parts, one of parts is called pure errorwhich is random and unable to be eliminated while the other is related tothe model called lack of fit.

Test for lack of fit is a test used to judge whether a regression model wouldbe accepted. And the test is based on the relative size of the lack of fit andthe pure error. If the lack of fit is greater than the pure error significantly,then the model should be rejected.

29

A sum of squares due to lack of fit is one of the components of a partitionof the sum of squares in an analysis of variance, used in the numerator in anF-test of the null hypothesis that says that a proposed model fits well.

SSE = SSPE + SSLOF

here SSPE is the sum of squares due to pure error and SSLOF is the sum ofsquares due to lack of fit.

Notation:xi, i = 1, · · · ,m: the ith level of regressor x.ni: the number of replication at the ith level of x.

m∑i=1

ni = n

yij, i = 1, · · · ,m, j = 1, · · · , ni: the jth observation at xi.the ijth residual is

yij − yi = (yij − yi) + (yi − yi).

The full model isyij = µi + εij,

E(yij) = µi.

For the full model, we can estimate µi as

µi = yi

SSE(FullModel) =m∑i=1

ni∑j=1

(yij − yi)2.

The reduced model is

y(ij) = β0 + β1xi + ε(ij),

yij = β0 + β1xi.

Note that the reduced model is an ordinary simple linear regression model.

SSE = SSE(ReducedModel) =m∑i=1

ni∑j=1

(yij − (β0 + β1xi))2 =

m∑i=1

ni∑j=1

(yij − yi)2

=m∑i=1

ni∑j=1

(yij − yi)2 +m∑i=1

ni(yi − yi)2

30

The cross-product term∑m

i=1

∑nij=1(yij − yi)(yi − yi) equals to 0. The

reduced model is

SSPE = SSE(FullModel) =m∑i=1

ni∑j=1

(yij − yi)2,

SSLOF = SSE − SSPE =m∑i=1

ni(yi − yi)2.

This term ni and yi in the formula is because all the yij at the level of xihave the same fitted value yi.

If the fitted values yi are close to the corresponding average responses yi,then the lack of fit is approximate zero, which indicates that the regressionfunction is more likely to be linear, vice versa.

The degree of freedom for pure error at each level xi is ni − 1 (similar toSST ), and the total number of degrees of freedom associated with the sumof squares due to pure error is

∑mi=1(ni− 1) = n−m. The degree of freedom

associated with SSLOF is m− 2 because the regressor has m levels and twoparameters must be estimated to obtain the yi.

The hypothesis isH0 : E(yi) = β0 + β1xi

(reduced model is adequate)

H1 : E(yi) 6= β0 + β1xi

(reduced model is not adequate).The test statistic for lack of fit is

F0 =SSLOF/(m− 2)

SSPE/(n−m)=MSLOFMSPE

In theory, if a model is fitted well, the parameters should not change alot under several replicated experiments and the SSLOF is the smaller thebetter. If F0 > F(α,m−2,n−m) or P -value of F0 < α, the null hypothesis thatthe tentative model adequately describes the data should be rejected, in otherwords, lack of fit exists in the model at α level. Otherwise the lack of fit maynot exist.

31

Example 7. Perform the F-test for lack of fit on the following data:

(90, 81), (90, 83), (79, 75), (66, 68), (66, 60), (66, 62), (51, 60), (51, 64), (35, 51), (35, 53)

> lmred<-lm(y~x) # reduced model

> lmfull<-lm(y~as.factor(x)) #full model

> anova(lmred,lmfull)

Analysis of Variance Table

Model 1: y ~ x

Model 2: y ~ as.factor(x)

Res.Df RSS Df Sum of Sq F Pr(>F)

1 8 118.441

2 5 46.667 3 71.775 2.5634 0.168

From the code above, we can conclude : SSE = 118.44 (with 8 df), SSLOF =71.77 (with 5 df), SSPE = 46.667 (with 3 df).

32

MAT 3375 - Regression Analysis Model Adequacy Checking and ...

Documents

Transcript of MAT 3375 - Regression Analysis Model Adequacy Checking and ...