Chapter 11

Post on 25-Feb-2016

59 views 0 download

Tags:

description

Chapter 11. Multiple Linear Regression Introduction Theory SAS Summary By: Airelle, Bochao , Chelsea, Menglin , Reezan, Tim, Wu, Xinyi, Yuming. Introduction. Regression Analysis in the Making. - PowerPoint PPT Presentation

Transcript of Chapter 11

Chapter 11

Multiple Linear Regression

IntroductionTheory

SASSummary

By:Airelle, Bochao, Chelsea, Menglin, Reezan, Tim, Wu, Xinyi, Yuming

Introduction

Regression Analysis in the Making The earliest form of regression analysis was the method of least squares published by Legendre in 1805 in the paper “Nouvelles méthodes pour la détermination des orbites des comètes.”1

Legendre used least squares to study the orbits of comets around the Sun

“Sur la Méthode des moindres quarrées”1 was an appendix to the paper (“On the method of least squares”)

Adrien-Marie Legendre2

1752-1833

1Firmin Didot, Paris, 1805. “Nouvelles méthodes pour la détermination des orbites des comètes.” “Sur la Méthode des moindres quarrés” appears as an appendix2Picture from <http://www.superstock.com/stock-photos-images/1899-40028>

Regression Analysis in the Making

Gauss also developed the method of least squares for the purpose of astronomical observations

In 1809 he published the paper: Theoria combinationis observationum erroribus minimus obnoxiae1 (Theory of the combination of observations least subject to errors).

Johann Carl Friedrich Gauss1777-1855

Shown here on the 10 Deutsche German banknote!

1C.F. Gauss. Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientum. (1809)2Picture from <http://www.pictobrick.de/en/gallery_gauss.shtml>

Why “regression”?

Coined in the 19th century by Sir Francis Galton1

Sir Francis Galton2

1822-1911 1”Regression analysis." Wikipedia, The Free Encyclopedia. Wikimedia Foundation, Inc. 22 July 2004. Web. 20 Nov. 2013.2Picture from <http://hu.wikipedia.org/wiki/Szineszt%C3%A9zia>

Why “regression”?

The termed was used to describe how the height of tall ancestors will “regress” down to the average height of the current generation; also known as “regression towards the mean.”

1”Regression analysis." Wikipedia, The Free Encyclopedia. Wikimedia Foundation, Inc. 22 July 2004. Web. 20 Nov. 2013.2Picture from <http://en.wikipedia.org/wiki/File:Miles_Park_Romney_family.jpg>

Fun Fact

Before 1970, one run of linear regression could take up to 24 hours on an

electromechanical desk calculator 1”Regression analysis." Wikipedia, The Free Encyclopedia. Wikimedia Foundation, Inc. 22 July 2004. Web. 20 Nov. 2013.2Picture from <http://www.technikum29.de/en/computer/electro-mechanical>

Uses of linear regression

1. Making predictions: create the model using linear regression on an observed set of data and outcomes, then predict the next unknown outcome

2. Correlating data: Determine the relationship between two sets of data (one is not “causal” to the other)

Theory

Multiple Linear Regression

• Review simple linear regression where we only have one predictor variable.

• What if there exist more than one predictor?• Multiple linear regression model

– Generalization of linear regression (considering more than one independent variable)

iii xY 10 i= 1,…,n

i= 1, … , n

• We fit a model with the form:• i= 1, … , n• k≥2 predictor variables• : k+1 unknown parameters • : is a random error • Note: here “linear” because it is linear in the ,

not necessarily in the x’s. For example: : may be the salary of the ith person in the sample : years of experience : years of education

Graph 1. Regression plane for the model with 2 predictor variables (source of Graph 1: http://reliawiki.org/index.php/Multiple_Linear_Regression_Analysis)

iikkiii xxxY ...22110

:,...,, 21 ikii xxx

k ,...,, 10

i

);...,( ,2,1 iikii yxxx

iy

1ix

2ix

• Here, we assume that the random errors are independent r.v’s.

• Yi are independent r.v.’s with

i),0( 2N

),( 2iN

ikkiiii xxxYE ...)( 22110

i=1,2,…n

Fitting the Multiple Regression Model• Least Squares (LS) Fit estimates of the unknown parameters we minimize

We set the first partial derivatives of Q with respect to equal to zero

n

iikkiii xxxyQ

1

222110 )]...([

k ,...,, 10

k ,...,, 10

n

iikkiii xxxyQ

122110

0

0)]...([2

n

iijikkiii

j

xxxxyQ1

22110 0)]...([2

j=0,1,…,k

Simplification leads to the following normal equations:

n

iiji

n

iijikk

n

iiji

n

iij

n

ii

n

i

n

iikki

xyxxxxx

yxxn

11111

10

11 1110

...

...

k ,...,, 10

j= 1,2,…k

The resulting solutions are the least square ( LS ) estimates of And are denoted by respectively.

k ˆ,...,ˆ,ˆ 10

• Goodness of Fit of the Model• We use the residuals defined by i= 1,2,…, n

• Where are the fitted values: i=1,2,…,n• As an overall measure of the goodness of fit, we can use the error sum of

squares

(which is the minimum value of Q.) We compare this SSE to the total sum of square

Define the regression sum of squares given by SSR=SST-SSEThe ratio of SSR to SST is called the coefficient of multiple determination:

ranges between 0 and 1, values closer to 1 representing better fits. The fact is adding more predictor variables to a model generally increases The positive square root of is the multiple correlation coefficient

iii yye ˆ

iy ikkiii xxxy ˆ...ˆˆˆˆ 22110

n

iieSSE

1

2

2)( yySST i

SSTSSE

SSTSSRr 12

2rr

2r2r

2r

• Multiple Regression Model in Matrix Notation• Let

nnn y

yy

y

Y

YY

...

,...

,...

2

1

2

1

2

1

be the n*1 factors of the r.v.’s , there observes values , and random Errors respectively.

sYi ' syi 'si '

nknn

k

k

xxx

xxxxxx

...1...............

...1

...1

21

22221

11211

n*(k+1) matrix of the values of the Predictor variables.

Let

kk

ˆ...

ˆˆ

ˆ,...

1

0

1

0

yXXX ''

be the (k+1)*1 vectors of unknown parameters and their LS estimates respectively.

Then, the model can be written as:

YXXX ')'(ˆ 1

The simultaneous linear equation of the normal equations can be written as:

If the inverse of matrix X’X exists, then the solution is given by:

• Generalized linear model (GLM)The generalized linear model is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution.The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

source : http://en.wikipedia.org/wiki/Generalized_linear_model

0:,0: 10 jjjj HH

Statistical inference for Multiple regression

Then, in order to determine which predictor variables have statistically significant effects on the response variable, we need to test the hypotheses:

First, we assume that 2~ 0,iid

i N

If we reject 0:0 jjH

Then, is a significant predictor of y. jx

• It can be shown that each is normally distributed with mean and variance , where is the jth diagonal entry (j=0,1,..,k) of the matrix

jj

jj 2

1)'( XXV

jj

But how can we get the mean and variance?

Mean: Here, Is unbiased, which is similar to the and in simple linear regression.

Then we can get the following:

01

0 0

11

ˆ( )ˆ( )ˆ( )

ˆ( ) kk

E

EE

E

Variance:

From the assumption

)ˆ(Var 2~ 0,

iid

i N

IYVar 2

2

2

2

2

...000...00000...0

)(

We can get,

12 )'()ˆ( XXVar

Let is the jth diagonal entry (j=0,1,..,k) of the matrix ,

We can get

jj 1)'( XXV

jjjVar 2)ˆ(

YXXX ')'(ˆ 1

MSEkne

knSSES

i

)1(

)1(2

2

Derive the PQ for the inference :

The unbiased estimator of the unknown error varianceIs given by

2

Here, MSE is error mean square and (n-(k+1)) is the degree freedom.

Let 2)1(22

2

~))1((

kn

SSESknW

2S jAnd and are statistically independent.

Recall the definition of t-distribution, we can obtain the pivotal quantity.

)1,0(~ˆ

NZjj

jj

)1(

2

2~

ˆ

)1()1(

ˆ

)1(

knjj

jjjj

jj

TS

knSkn

knWZT

Statistically independent: the occurrence of one event does

not affect the outcome of the other event.

Confidence interval:

A 100(1-α)% confidence interval on is given by: j

)ˆˆ(1

(1

)(1

2,12,1

2,12,1

2,12,1

jjknjjjjknj

knjj

jjkn

knkn

StStP

tS

tP

tTtP

jSo, the confidence interval for is:

)ˆ(ˆ2,1

jknj SEt Where jjj SSE )ˆ(

Derivation of the hypothesis test for at j 0

10

0 :,: jjjjjj HH

Test statistic:

)ˆ(

ˆ 0

j

jj

j SEt

2,1

0

)ˆ(

ˆ

kn

j

jj

j tSE

t

Reject if jH0

Another Hypothesis test to determine if the model is useful

H0 : the null hypothesis which means that none of the predictors xj is related to y.

Ha : indicates that at least one of them is related.

The test statistic is

,~ )1(, knkfMSEMSRF

kSSRMSR Where and )1(

kn

SSEMSE

By using the formula

We have

krknr

knSSE

kSSR

MSEMSRF

)1())1((

)1(2

2

2rWe can see that F is an increasing function of , and in this formis used to test the statistical significance of , which is equivalent to testing H0.

2r

Reject H0 if

Coefficient of multiple determination

Extra Sum of Squares Method for Testing Subsetsof Parameters

Consider the full model:

And the partial model:

To test whether the full model is significantly better than the partial model, we have

Since SST is fixed regardless of the particular model,

We have

So,

The test statistic is

Rejects H0 if

Numerator m: # of coefficients set to zero.Denominator n-(k+1): the error df for the full model

))1(/(/)(

knSSEmSSESSEF

k

kmk

The extra sum of squares in the numerator represents the part

of the variation in y that is accounted for by regression on the m predictors.

Divided by m to get an average contribution per term.

Source of Variation(source)

Degrees of Freedom

(DF)

Sum of squares

(SS)

Mean Square(MS) F

Regression k SSR

Error

n-(k+1) SSE

Total n-1 SST

MSRFMSE

ANOVA table

kSSRMSR

)1(

knSSEMSE

Links between ANOVA and extra sum of squares method:Let k=1 and m=k, we have

SSTyySSEn

ii

2

10 )(

SSESSEk

SSRSSESSTSSESSE k 0

)1(,~)]1(/[

/

knkFMSE

MSRknSSEkSSRF

Prediction of Future Observation

•Having fitted a multiple regression model, suppose that we want to predict the future value of y for a specified vector of predictor variables

(Notice that we have included as the first component of the vector to correspond to the constant term in the model.)

Prediction of Future Observation

One way is to estimate by a confidence interval (CI).

We already have

Prediction of Future Observation

And

Prediction of Future Observation

Replacing b by its estimate which has n-(k+1) df.

The pivotal quantity is

A level C.I for is given by

Prediction of Future ObservationAnother way is to predict by a prediction interval (PI).We know

The error prediction , is the difference between two independent variables with mean

And variance

* * * 20 1 1~ ( ... , )k kY N x x

Prediction of Future Observation

Replacing by its estimate which has n-(k+1) df.

The pivotal quantity is

A level C.I for is given by

Residual Analysis

Recall that

Where

H is called the hat matrix

Residual Analysis

Standardized residuals are given by

Here ith is the ith diagonal element of the Hat Matrix HLarge | | values indicate outlier observations.

*

( ) 1i i

ii ii

e ee

SE e s h

Residual Analysis

Moreover,

we conclude the ith observation is influential if

1( ) 1n

iiiTrace H h k

2( 1) 2ii iikh hn

Data Transformation

•Transformations of the variables (both y and the x’s) are often necessary to satisfy the assumptions of linearity, normality and constant variance. Many seemingly nonlinear models can be written as the multiple linear regression model after making a suitable transformation

Example:1 2

0 1 2y x x

Data Transformation

We can do the transformation by taking Ln on both sides

Then we have

Let

We now have

Which is a good model.

0 1 1 2 2ln( ) ln( ) ln( ) ln( )y x x

* * * * * *0 0 1 1 1 1 2 2 2 2ln( ), ln( ), , ln( ), , ln( )y y x x x x

* * * * * *0 1 1 2 2y x x

Code and table and graphs

Voting Example

Voting Example• Setup: Data on individual state voting percentages for winners of the

last twelve (15) U.S. presidential elections.

y = New York voting percentage (‘ny’)x1 = California voting percentage (‘ca’)x2 = South Carolina voting percentage (‘sc’)x3 = Wisconsin voting percentage (‘wi’)

• Goal: See if there’s any positive correlation between NY and California’s (two traditionally Democratic states) voting patterns, or a negative correlation between NY and South Carolina’s (one Democratic, one Republican state). • Note: Wisconsin was included as a variable although their traditional stance is

(seemingly) more ambiguous.

Source: <http://www.presidency.ucsb.edu/elections.php>

Section 11.6Multicollinearity

Standardized Regression CoefficientsDummy Variables

Multicollinearity(Section 11.6.1)

• The “multicollinearity problem” = columns of X are approx. linearly dependent– Result: • → columns of X’X approx. linearly dependent

• → det(X’X) ≈ 0

• → X’X approx. singular (non-invertible)

• → unstable (large variance) or impossible to calculate

Multicollinearity (Section 11.6.1)

• One cause: predictor variables x1, x2, …, xk highly correlated– Example: x1 = income, x2 = expenditures, x3 =

savings

• Linear relationship: x3 = x1 – x2 → don’t want this• Conclusion: Only two (2) of the variables should be

included

Multicollinearity (from Tamhane/Dunlop – pg. 416)

• Example 11.5: Data on the heat evolved in calories during hardening of cement (y) along with percentages of four ingredients (x1, x2, x3, x4)

– Table 11.2: Cement DataNo. x1 x2 x3 x4 y1 7 26 6 60 78.52 1 29 15 52 74.33 11 56 8 20 104.34 11 31 8 47 87.65 7 52 6 33 95.96 11 55 9 22 109.27 3 71 17 6 102.78 1 31 22 44 72.59 2 54 18 22 93.110 21 47 4 26 115.911 1 40 23 34 83.812 11 66 9 12 113.313 10 68 8 12 109.4

Multicollinearity (from Tamhane/Dunlop – pg. 416 cont’d)

• Linear relationship: x2 + x4 ≈ 80 (again, don’t want this)• Results: (Tamhane/Dunlop)

– Correlations x1 x2 x3 x4

x2 0.229 x3 -0.824 -0.139x4 -0.245 -0.973 0.030y 0.731 0.816 -0.535 -0.821

– RegressionPredictor Coef t-ratio p-valueConstant 62.41 0.89 0.399x1 1.5511 2.08 0.071x2 0.5102 0.70 0.501x3 0.1019 0.14 0.896x4 -0.1441 -0.20 0.844

Multicollinearity (from Tamhane/Dunlop – pg. 416 cont’d)

– ANOVA:SOURCE DF SS MS F p-valueRegression 4 2667.90 666.97 111.48 0.000Error 8 47.86 5.98Total 12 2715.76

• Conclusion: All coefficients except x1 are nonsignificant, but model as a whole is a good fit. • Further work: better model includes only x1 and x2.

Standardized Regression Coefficients(Section 11.6.4)

• Purpose: allows user to compare predictors in terms of the magnitudes of their effects on response variable y.– Magnitudes of ’s depend on the units of xj ’s and

y → need to “standardize”

Standardized Regression Coefficients(Section 11.6.4)

• Result: new predictors , where

• All entries of R and r between -1 and 1 → more stable computation of

Dummy Predictor Variables(Section 11.6.3)

• Dummy variables: allow inclusion of categorical data into regression model.

– Example: “gender” as a predictor variable → x = 1 for female, x = 0 for male

• k ≥ 2 categories → k – 1 dummy variables x1, x2, … , xk-

1 , where xi = 1 for ith category and 0 otherwise

• x1 + x2 + … + xk = 1 → avoiding linear dependence by not including xk

Dummy Predictor Variables(Section 11.6.3)

• Example 11.6 (Tamhane/Dunlop): Winter → (x1, x2, x3) = (1, 0, 0); Spring → (x1, x2, x3) = (0, 1, 0); Summer → (x1, x2, x3) = (0, 0, 1); Fall → (x1, x2, x3) = (0, 0, 0)

• General Linear Model: at least one (1) categorical regressor in our model (i.e. at least one dummy variable x)

Dummy Predictor Variablesin General Linear Model

• Consider a GLM representing k categories as the predictors– k categories → k-1 dummy variables (discussed

earlier)• Model:

where

Dummy Predictor Variablesin General Linear Model

• Interpretation of ’s:

– Result: interpreted as the value added (or subtracted) to expected value of if category true.

– Note: all if category true.

Dummy Predictor Variablesin General Linear Model

• ANOVA: (general def n)– Testing the null hypothesis that all groups of data

are of the same population (same mean)– Setup: groups with means

Dummy Predictor Variablesin General Linear Model

– Test statistic: • •

– Reject at significance level if

Dummy Predictor Variablesin General Linear Model

• Our case (GLM with dummy variables): testing the null hypothesis , where is the (population) mean for category • But so we have

• Result: simplified ANOVA hypotheses (plug in and subtract )

Dummy Predictor Variablesin General Linear Model

• Test statistic:

Dummy Predictor Variablesin General Linear Model

• Estimated model:

• Let

• Fact: (can be shown) for all categories

Source: “Statistics 104: A Note on ANOVA and Dummy Variable Regression” – lecture by Sundeep Iyer (4/23/2010)

Dummy Predictor Variablesin General Linear Model

• Conclusion: F-test for this case similar to that of a regression (letting )

with the same set of hypotheses

Variable Selection Methods

Consider a model that includes all of the following variables:

WP = Winning PercentageNL= National League Indicator (dummy variable)AVG = Batting AverageOBP = On Base PercentageHR = Home RunsB2 = DoublesB3 = TripleshitsSB = Stolen Bases (offense)ERA = Earned Run AverageSO = Strike OutsErrors = Errors*

*Note: Errors is a baseball term referring to the number of mistakes players have made.

DATA dataMLB5;INFILE DATALINES DSD;INFORMAT TEAM $21.;INPUT TEAM $ WP NL AVG OBP HR B2 B3 hitSB ERA SO Errors;LABEL WP="Winning Percentage"

NL="National League Indicator"AVG="Batting Average"OBP="On Base Percentage"HR="Home Runs"B2="Doubles"B3="Triples"hitsSB="Stolen Bases"ERA="Earned Run Average"SO="Strike Outs"Errors="Errors";

DATALINES;Boston Red Sox,0.599,0,0.277,0.349,178,363,29,123,3.79,1294,80Tampa Bay Rays,0.564,0,0.257,0.329,165,296,23,73,3.74,1310,59Baltimore Orioles,0.525,0,0.26,0.313,212,298,14,79,4.2,1169,54

...San Francisco Giants,0.469,1,0.26,0.32,107,280,35,67,4,1256,107Philadelphia Phillies,0.451,1,0.248,0.306,140,255,32,73,4.32,1199,97Colorado Rockies,0.457,1,0.27,0.323,159,283,36,112,4.44,1064,90

;RUN;

Source (MLB.com): <http://mlb.mlb.com/stats/sortable.jsp?c_id=mlb&tcid=mm_mlb_stats#statType=hitting&elem=%5Bobject+Object%5D&tab_level=child&click_text=Sortable+Player+hitting&%23167%3BionType=sp&page=1&ts=1385005626059&season=2013&season_type=ANY&playerType=QUALIFIER&sportCode='mlb'&league_code='MLB'&split=&team_id=&active_sw=&game_type='R'&position=&page_type=SortablePlayer&sortOrder='asc'&sortColumn=era&results=&perPage=50&timeframe=&last_x_days=&extended=0&sectionType=sp>

Variable Selection Methods

Run the regression on all of the data:

PROC REG DATA=dataMLB5;TITLE "Regression - Whole Model";MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/ P R tol vif collin;RUN;QUIT;

Variable Selection Methods

Some MODEL Statement Options:P compute predicted values R compute residualsTOL displays tolerance values for parameter estimatesVIF computes variance-inflation factorsCOLLIN produces collinearity analysis

Variable Selection Methods

Variable Selection Methods

Many very high p-values

Some of these variables should not be in the regression.

Variable Selection Methods

Variable Selection MethodsStepwise, Forward, and Backward Regression

Best Subsets Regression

Successively adding/removing variables from the model

A subset of variables is chosen that optimizes criterion

Final model not guaranteed to be optimal

Final model is optimal

Produces a single ‘final’ model(in actuality there might be several, almost equally good models)

Determines a specified number of best subsets (sized ). There are possible subsets

Not influenced by other considerations, such as practicality of including variables

When large number of variables, requires efficient algorithms to determine optimum subset

Variable Selection MethodsStepwise, Forward, and Backward Regression

Assume are in the model. To determine if should be included, compare the following models:

(p-1)-Model:

p-Model:

Partial F-testTest Statistic:

Since , we reject (at significance ), when:

Variable Selection MethodsStepwise, Forward, and Backward Regression

Partial Correlation Coefficients

Variable Selection MethodsStepwise, Forward, and Backward Regression

Variable Selection MethodsStepwise Regression Algorithm

p=0

Compute partial Fi

for xi, i=p+1,…,k

Is max Fi>FIN? for xi,

i=p+1,…,k

Yes

No STOP

Enter the xi

producing the max Fi

p p+1Relable variables

x1,…,xp

Compute partial Fi

for xi, i=1,…,p

Is min Fi<FOUT? for

xi, i=1,…,p

Yes

Remove the xi producing the min

Fi

p p-1Relable variables

x1,…,xp

No Does p=k?

Yes

STOP

No

(Tamhane, Dunlop, p. 430)

Forward SelectionPROC REG DATA=dataMLB5;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/

SELECTION = FORWARD SLENTRY=0.05;RUN;

QUIT;

Adds variables producing the highest F value in the Partial F-test. Setting the threshold based on F with 0.05 significance.

Method: Forward, Backward, or Stepwise

significance level for inclusion

Forward Selection

Forward Selection – Final StepF values of variables within the final model.

Note:- F values changed at each step as variables were added

- For the FORWARD method, once a variable is included in the model, it will always stay in the model (its relative significance is not reconsidered as additional variables are added).

Forward Selection - Summary

Results of partial F tests at each step

Backward EliminationPROC REG DATA=dataMLB5;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/

SELECTION = BACKWARD SLSTAY=0.05;RUN;QUIT;

Removes variables, one by one, until all have F values with 0.05 significance.

AVG will be removed in the next step.

significance level to remain in model

Backward EliminationAVG is no longer included.

Will continue to remove variables until the variable with the lowest F value still meets the specified significance level.

Backward Elimination - Final Step

F values of variables within the final model all meet the specified significance level.

Backward Elimination – Summary

Stepwise Selection

PROC REG DATA=dataMLB5;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/

SELECTION = STEPWISE SLENTRY=0.05 SLSTAY=0.05;RUN;QUIT;

significance level to remain in model (variables re-evaluated after each new variable added)

Stepwise Selection - Final Model

F values of variables within the final model all meet the specified significance levels.

Stepwise Selection - Summary

If, at any step, a variable within the model no longer met the specified significance level, it would be removed.

Ultimately, none of the variables outside the model are significant, and all of the variables in the model are significant.

F values of variables at each step.

Variable Selection Methods – Best SubsetsOptimality Criteria 1) - Criterion:

• Maximized when all the variables are in the model• Only provides goodness of fit consideration (not how well the model predicts)

2) Adjusted - Criterion:

• Unlike , adding variables could cause it decrease

3) Criterion:• Minimizing is essentially equivalent to maximizing

4) Cp Criterion: Measures the ability of the model to predict Y-values

Select predictor variable vectors x (potentially representing a range for future predictions)

Standardized mean square error of prediction:

If we assume that the full model is the “true model” :

It should be noted that as we increase the number of variables, the prediction variance increases.

Variable Selection Methods – Best Subsets

4) Cp Criterion (continued):

Mallows’ Cp-Statistic is considered an “almost” unbiased estimate of

We can use to estimate , and commonly MSE (for the whole model) is used for

Note: for the full model (assumed to have a zero bias), so when p=k

Variable Selection Methods – Best Subsets

5) PRESSp Criterion:

• Considers impact of removing observations on predictability • Observations removed one at a time, and the model is re-fit after

each observation is removed

LS estimates when the ith observation is removed:

The predicted value for the observation that was just removed:

Prediction error sum of squares (PRESS):

Select the model with the smallest PRESSp

Variable Selection Methods – Best Subsets

Variable Selection Methods - Best Subsets

PROC REG DATA=dataMLB5;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/

SELECTION = RSQUARE best=1;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/

SELECTION = ADJRSQ best=10;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/

SELECTION = CP best=10 ADJRSQ RSQUARE;RUN;QUIT;

RSQUARE finds r2 for subsets of modelsADJRSQ finds Adjusted r2 for subsets of modelsCP computes Mallow’s Cp statistic

Computes the model with the largest r2 (for each amount of variables that could be in the model)

Computes the best 10 models (based on the largest adjusted r2 values)

Computes the best 10 models (based on the lowest C(p) values). Also includes r2 and adjusted r2 values for each model.

Variable Selection Methods - Best Subsets

R-Square Selection Method• Each model represents

the combination of variables (from 1 variable through all variables) with the largest r2 value.

• r2 increases as additional variables are added to the model.

Variable Selection Methods - Best Subsets

Adjusted R-Square Selection Method• Considers all

combinations (and number) of variables in the model

• Models are listed in order of decreasing Adjusted r2 values.

• Since “BEST=10” was specified, the top 10 models are included

Variable Selection Methods - Best Subsets

C(p) Selection Method• Considers all

combinations (and number) of variables in the model

• Models are listed in order of increasing Cp values.

• Since “BEST=10” was specified, the top 10 models are included

PROC REG DATA=dataMLB5;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/

SELECTION = CP ADJRSQ STOP=1 BEST=1;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/

SELECTION = CP ADJRSQ STOP=2 BEST=1;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/

SELECTION = CP ADJRSQ STOP=3 BEST=1;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/

SELECTION = CP ADJRSQ STOP=4 BEST=1;MODEL WP = NL AVG OBP HR B2 B3 hitSB ERA SO Errors/

SELECTION = CP ADJRSQ STOP=5 BEST=1;RUN;QUIT;

Variable Selection Methods - Best Subsets

Determines the model with the lowest C(p) value for model including no more than 5 variables.

Variable Selection Methods - Best Subsets

Model with the lowest C(p) value - Up to 1 variable in the model

Model with the lowest C(p) value - Up to 2 variables in the model

Variable Selection Methods - Best Subsets

Model with the lowest C(p) value - Up to 3 variables in the model

Model with the lowest C(p) value - Up to 4 variables in the model

Note decreasing C(p)

Variable Selection Methods - Best SubsetsModel with the lowest C(p) value -

Up to 5 variables in the model

Same result as “up to 4 variables”

• The forward selection, backward elimination and stepwise selection methods produced the same model.

• We could conclude this is the best model. However, other data might produce different models using the various selection methods.

Variable Selection Methods

PROC REG DATA=dataMLB5;MODEL WP = OBP HR ERA Errors / P R tol vif collin;RUN;QUIT;

Variable Selection Methods

Decide the type of model

Collect the data

Explore the data

Divide the data

Fit several candidate models

Select and evaluate

Select the final modelModel Building

Strategy

?Model

DataExplore

and divide Let’s fit it!

Which is better?

•A model used to predict the response variable from a chosen set of predictor variables.

Predictive

•A model based on a theoretical relationship between a response variable and predictor variables

Theoretical•A model used control a response variable by

manipulating predictor variables.

Control

•A model used to explore the strength of relationships between a response variable and individual predictor variables

Inferential

•A model used primarily as a device to summarize a large set of data by a single equationData

Summary

Step I:Decide the Type of Model Needed

Step II: Collect the Data

• Decide variables(predictor and response) on which to collect data.

• Consult subject matter experts.• Obtain pertinent, bias-free data.

Step III: Explore the Data• Check for outliers, gross errors, missing values,etc. on a univariate basis.

• Study bivariate relationships to reveal other outliers, to suggest possible transformations, and to identify possible multicollinearities .

Very Important!

Step IV: Divide the Data

Randomly divide the data into:• Testing set: used for cross-validation of the fitted

model.• Training set: with at least 15-20 error degrees of

freedom, is used to estimate the model. • The split into the training and test set should be

done randomly. A 50:50 split can be used if sample size is large enough; otherwise more data may be put into the training set.

Step V: Fit Several Candidate Models

Generally models can be identified using the training data set:• Use best subsets regression.• Use stepwise regression, which of course only

yields one model unless different alpha-to-remove and alpha-to-enter values are specified.

Step VI: Select and Evaluate “Good” Models

pc• Select several models base on four criteria we

learned such as – statistic, the number of predictors(p)

• Check for violation of model assumptions• Further transformations in response and/or predictor

variables• If non of the models provided a satisfactory fit, try

something else, such as collecting more data, identifying different predictors, or formulating a different type of model.

Step VII: Select the Final Model

• Compare competing models by cross-validating them against the test data.

• Choose a model with smaller cross-validation SSE.

• Final selection of the model is based on considerations such as residual plots, outliers, parsimony, relevance, etc..

Summary• Multiple regression model extends the simple linear regression model to two or more predictor variables.

K≥2

• The least squares estimate of the parameter vector base on n complete set of observations on all variables equal

X is the n*(k+1) matrix observation on the predictor variableY is the n*1 vector of observation on y

iikkiii xxxY ...22110

YXXX ')'(ˆ 1

Summary• The fitted vector residual vector

• Error sum of squares

• Total sum of squares

• Multiple coefficient of determination

• Positive square root of is called the multiple correlation coefficient

ˆ Xy iii yye ˆ

n

iieSSE

1

2

SSTSSE

SSTSSRr 12

2)( yySST i

2r

Summary

• In probabilistic model for multiple regression, we assume random errors to be independent

• It follows that are

where is the jth diagonal term of the matrix

),0( 2Ni

j ),( 2jjj vN

jjv

1)'( XXV

Summary• Furthermore,

an unbiased estimate of and has distribution

, independent of the

• From the result, we can draw inference on the based on the t-distribution with (n-(k+1)) df.

100(1-α)% confidence interval on is

MSEkn

SSES

)1(

2

2

))1((

2)1(

2

knkn

j

j

j

jjknj St 2,1

ˆ

Summary• Extra sum of squares method is useful for deriving F-tests on subsets of the

• To test the hypothesis that a specified subset of m ≤ k of the

Let and denote the error sum of squares for the two models, respectively.

sj '

kSSE mkSSE

0' sj

Summary• Then F statistic is given by

• Two special cases of this method are the test of significance on• Single - t test• All - F statistic (not including )

j

j 0

MSEMSR

knSSEkSSRF

)]1(/[

/with k and (n-(k+1)) df

)]1(/[/)(

knSSEmSSESSEF

k

kmk with m and (n-(k+1)) df

Summary• The fitted vector can be written as

where is called the hat matrix

If ith diagonal element of H, , then the ith observation is regarded as influential because it has high leverage on the model fit.

• Residuals are used to check the model assumption like normality and constant variance by making appropriate residual plots. If the standardized residual

then the ith observation is regarded as an outlier.

Hyy ˆ

nkH ii /)1(2

2|)/(||| * jjii See

Summary

• Multicollinearity: columns of X are approximately linearly dependent → estimates of β is unstable– Major cause: high correlation between the xi’s

• Dummy variables: used to represent categorical data– xi = 1 if ith category true; xi = 0 otherwise

Summary

• Stepwise regression: selects and deletes variables based on marginal contributions to model– Partial F-statistics and partial correlation coefficients

• Best subsets regression: chooses subset of predictor variables that optimize certain criterion function (e.g. adjusted rp

2 or Cp-statistic)

Summary

• Fitting a model:1) Decide on the type of model.2) Collect the data.3) Explore the data.4) Divide the data.5) Fit several candidate models.6) Select/evaluate “good” models.7) Select final model.