Topic 3: Multiple Regression Analysisprofessional.education.uiowa.edu/skliethe/teaching... ·...

3a-1

Design and Analysis of Biomedical Studies (171:162) Summer Semester 2012

Topic 3: Multiple Linear Regression I (Part i) Models and Parameter Estimation

Stephanie Kliethermes

June 13, 2012

3a-2

Multiple Linear Regression Multiple regression analysis is an extension of simple linear (straight-line) regression. With multiple linear regression we use more than one explanatory variable (or higher order terms, i.e. X2) to explain or predict a single response variable. Dealing with several independent variables simultaneously in a regression analysis is considerably more difficult than dealing with a single independent variable.

1. It is more difficult to choose the best model. 2. It is more difficult to visualize what the fitted model looks like. 3. It can be more difficult to interpret what the model means in

clinical terms. 4. Computations are virtually impossible by hand.

3a-3

Example 3.1 (Problem 8.2, pages 127-8)

A psychiatrist wants to know whether the level of pathology (Y) in

psychotic patients 6 months after treatment can be predicted with

reasonable accuracy from knowledge of pre-treatment symptom

ratings of thinking disturbance (X1) and hostile suspiciousness (X2).

3a-4

First try to fit two simple linear regression models: Model: MODEL1 Dependent Variable: pathology pathology Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 1535.85697 1535.85697 6.39 0.0146 Error 51 12255 240.30025 Corrected Total 52 13791 Root MSE 15.50162 R-Square 0.1114 Dependent Mean 22.69811 Adj R-Sq 0.0939 Coeff Var 68.29476 Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept 1 -25.04084 19.00283 -1.32 0.1935 thinking thinking 1 15.95111 6.30947 2.53 0.0146

3a-5

Model: MODEL2 Dependent Variable: pathology pathology Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 157.84733 157.84733 0.59 0.4458 Error 51 13633 267.32005 Corrected Total 51 13791 Root MSE 16.34993 R-Square 0.0114 Dependent Mean 22.69811 Adj R-Sq -0.0079 Coeff Var 72.03209 Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t|

3a-6

Intercept Intercept 1 37.61326 19.53945 1.92 0.0598 hostile hostile 1 -2.25150 2.93001 -0.77 0.4458

Multiple Regression Models The general form of a regression model for k independent variables is given by

EXXXY kk +++++= ββββ 22110 where Y is the dependent (response) variable X1, X2, …, Xk are the independent (explanatory) variables

β0, β1, β2, …, βk are the regression coefficients that need to be estimated. Data Structure

Y 1X 2X kX

3a-7

1

2

n

yy

y

11

12

1n

xx

x

21

22

2n

xx

x

1

2

k

k

k n

xx

x

3a-8

Instead of looking at individual relationships, we can look at all of the

pair-wise scatterplots:

3a-9

With two independent variables, our representation of the data can best be viewed in three-dimensions. (As the number of independent variables increase, our best-fitting “curve” becomes a hypersurface in (K+1) dimensional space).

3a-10

Thus, in three dimensions the response surface is not a line, but rather a plane. Similarly to the first step in simple linear regression, we now want to find the best fitting plane to the data using the method of least squares. The plane will describe the mean values of Y at various combinations of the independent variables (X1= thinking, X2=hostile)

3a-11

Assumptions of Multiple Regression

1. Existence: For each specific combination of values of the (basic) independent variables X1, X2, …, Xk, Y is a (univariate) random variable with a certain probability distribution having finite mean and variance.

2. Independence: The Y observations are statistically independent of one

another. 3. Linearity: The mean value of Y for each specific combination of X1,

X2, …, Xk is a linear function of X1, X2, …, Xk; that is

EXXXYor

XXX

kk

kkXXXY k

+++++=

++++=

ββββ

ββββµ

22110

22110,...,,| 21

where E is the error component reflecting the difference between an individual’s observed response Y and the true average response

kXXXY ,...,,| 21µ .

3a-12

4. Homoscedasticity: The variance of Y is the same for any fixed combination of X1, X2, …, Xk; that is

221

2,,,| ),,,|(

21σσ ≡= kXXXY XXXYVar

k

5. Normality: For any fixed combination of X1, X2, …, Xk, the variable Y

is normally distributed.

),(~ 2,...,,| 21

σµkXXXYNY

or equivalently we can assume that E ~ N( 0, σ 2 ).

• normality is required for inference of the regression model, but not to fit

the model using least squares regression • Tests are fairly robust—only extreme departures from normality will

produce incorrect results. • If lack of normality does occur, a transformation of the Y values is

considered.

3a-13

Similarly to simple linear regression: • Y is an observable random variable, while X1, X2, . . , Xk are generally

considered fixed (nonrandom) known quantities • The constants β0, β1, …, βk are unknown regression/population

parameters • E is an unobservable random variable • If one estimates β0, β1, …, βk with �̂�0, �̂�1, … , �̂�𝑘 then an acceptable

estimate of Ei for the ith subject is

( )kikiiiii XXYYYE βββ ˆˆˆˆˆ110 +++−=−=

• The estimated error 𝐸�𝑖 is usually called a residual o 𝑌 = 𝛽0 + 𝛽1𝑋1 + ⋯+ 𝛽𝑘𝑋𝑘 is called the prediction equation With numbers (estimates) we call it the least squares equation or

fitted model or estimated regression equation o 𝑌 = 𝛽0 + 𝛽1𝑋1 + ⋯+ 𝛽𝑘𝑋𝑘 + 𝐸 is called the regression model

3a-14

Determining the Best Estimate of the Multiple Regression Equation The least squares approach is again used to estimate the multiple

regression equation.

Just as before, we minimize the sum of squared distances between the

observed responses and the predicted responses.

Let the fitted regression model be

𝑌� = �̂�0 + �̂�1𝑋1 + �̂�2𝑋2 + ⋯�̂�𝑘𝑋𝑘

We then obtain the estimates that minimize

𝑆𝑆𝐸 = ∑�𝑌𝑖 − 𝑌�𝑖�2 = ∑�𝑌𝑖 − �̂�0 − �̂�1𝑋𝑖1 − ⋯− �̂�𝑘𝑋𝑖𝑘�

2

Calculation of these estimates is beyond the scope of this class but estimates are easily calculated in standard statistical software.

3a-15

Properties of the least squares estimators:

1. The estimators �̂�0, �̂�1, … , �̂�𝑘 are each a linear function of the Y-

values. Since Y values assumed to be normally distributed (and

independent), �̂�0, �̂�1, … , �̂�𝑘 will be normally distributed, too.

2. The least-squares equation 𝑌� = �̂�0 + �̂�1𝑋1 + �̂�2𝑋2 + ⋯�̂�𝑘𝑋𝑘 is the

combination of variables X1, X2,…, Xk that has maximum possible

correlation with Y.

𝑟𝑌,𝑌� =∑(𝑌𝑖 − 𝑌�)(𝑌�𝑖 − 𝑌��)

�∑(𝑌𝑖 − 𝑌�)2∑�𝑌�𝑖 − 𝑌��2

The quantity 𝑟𝑌,𝑌� is called the multiple correlation coefficient.

3. Multiple regression is related to the multivariate normal distribution

just as straight-line regression is related to the bivariate normal.

3a-16

Interpretation of Regression Parameters In multiple linear regression, the interpretation of 𝛽𝑖 (𝑖 = 1,2, … , 𝑘) is the same as it was in simple linear regression – with one additional (and important) restriction.

𝛽𝑖 (𝑖 = 1,2, … , 𝑘) represents the expected change of Y corresponding to a

one unit increase in X given the other independent variables are fixed

(must be within the scope of the model).

If we allowed the other predictors to change, we would be unable to

determine how much of the change in the mean of Y was associated with the

differing values of the predictor of interest vs. how much was due to some

other predictor(s) differing as well.

3a-17

The ANOVA Table for Multiple Regression Similarly to SLR, the ANOVA table can provide an overall summary of a multiple regression analysis. We begin by partitioning the sums of squares. 𝑆𝑆𝑌 = ∑(𝑌𝑖 − 𝑌�)2 is called the total sum of squares

- represents total variability in Y without accounting for any X variables in the regression equation.

𝑆𝑆𝐸 = ∑�𝑌𝑖 − 𝑌�𝑖�

2 is the residual sum of squares

- represents the amount of Y variation left unexplained after the X variables have been used in the regression equation.

𝑆𝑆𝑅 = 𝑆𝑆𝑌 − 𝑆𝑆𝐸 = ∑�𝑌�𝑖 − 𝑌��2 is the regression sum of squares

- represents the reduction in variation due to the X variables in the regression equation.

3a-18

SSY = SSR + SSE

( ) ( ) ( )2 22

1 1 1

ˆ ˆ +

SS(Total) SS(Regression)

n n n

i i i ii i i

Y Y Y Y Y Y= = =

− = − −

↑ ↑ ↑

∑ ∑ ∑

SS(Residuals) SS(Explained) SS(Unexplained)

In SAS Source DF Sum of

Squares Mean Square

F

Model k SSR MSR=SSR/k F=MSR/MSE Error n-k-1 SSE MSE=SSE/(n-k-1) C Total n-1 SSY

Note: Different software packages may present the form of the ANOVA table with slight variations.

3a-19

SSYSSE

SSYSSR

SSYSSESSYR −==

−= 12

• The R2 value is a quantitative measure of how well the fitted model

containing variables X1, X2, … Xk predicts the dependent variable Y.

• 0 ≤ R2 ≤ 1. Back to our example:

PROC REG DATA=Ex0802; MODEL y=x1 x2; TITLE 'Multivariate Regression Analysis'; RUN;

3a-20

Multivariate Regression Analysis The REG Procedure Model: MODEL1 Dependent Variable: pathology Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 2753.87136 1376.93568 6.24 0.0038 Error 50 11037 220.74597 Corrected Total 52 13791 Root MSE 14.85752 R-Square 0.1997 Dependent Mean 22.69811 Adj R-Sq 0.1677 Coeff Var 65.45708 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -0.63535 20.96833 -0.03 0.9759 thinking 1 23.45144 6.83851 3.43 0.0012

hostile 1 -7.07261 3.01092 -2.35 0.0228

3a-21

The least squares equation is

1 2

0 1 2

ˆ 0.635 23.451 7.073ˆ ˆ ˆ0.635, 23.451, and 7.073

Y X X

β β β

= − + −

= − = = −

• Given a fixed rating on hostile suspiciousness (X2), every unit increase

in thinking disturbance (X1) yields a 23.451 unit increase in level of pathology, on average.

• Given a fixed rating of thinking disturbance (X1), every unit increase in hostile suspiciousness (X2) yields a 7.07261 unit decrease in level of pathology, on average.

Using this equation, determine the predicted level of pathology for a patient

with pretreatment scores of 2.80 on thinking disturbance and 7.0 on hostile

suspiciousness. How does this predicted value compare with the value

actually obtained for patient 5 (Y5=25)?

3a-22

Comparison of parameter estimates in three models considered: 𝑌� = �̂�0 + �̂�1𝑋1 Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -25.04084 19.00283 -1.32 0.1935 thinking 1 15.95111 6.30947 2.53 0.0146

𝑌� = �̂�0 + �̂�1𝑋2 Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 37.61326 19.53945 1.92 0.0598 hostile 1 -2.25150 2.93001 -0.77 0.445

𝑌� = �̂�0 + �̂�1𝑋1 + �̂�2𝑋2 Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -0.63535 20.96833 -0.03 0.9759 thinking 1 23.45144 6.83851 3.43 0.0012 hostile 1 -7.07261 3.01092 -2.35 0.0228

Topic 3: Multiple Regression Analysisprofessional.education.uiowa.edu/skliethe/teaching... ·...

Documents

Transcript of Topic 3: Multiple Regression Analysisprofessional.education.uiowa.edu/skliethe/teaching... ·...