Multiple Regression ppt

Chap 14-1

Chapter 14

Introduction to Multiple Regression

Basic Business Statistics10th Edition

Chap 14-2

Learning Objectives

In this chapter, you learn: How to develop a multiple regression model How to interpret the regression coefficients How to determine which independent variables to

include in the regression model How to determine which independent variables are more

important in predicting a dependent variable How to use categorical variables in a regression model How to predict a categorical dependent variable using

logistic regression

Chap 14-3

The Multiple Regression Model

Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more independent variables (Xi)

ikik2i21i10i εXβXβXββY

Multiple Regression Model with k Independent Variables:

Y-intercept Population slopes Random Error

Chap 14-4

Multiple Regression Equation

The coefficients of the multiple regression model are estimated using sample data

kik2i21i10i XbXbXbbY

Estimated (or predicted) value of Y

Estimated slope coefficients

Multiple regression equation with k independent variables:

Estimatedintercept

In this chapter we will always use Excel to obtain the regression slope coefficients and other regression

summary measures.

Chap 14-5

Two variable modelY

X1

X2

22110 XbXbbY

Slope for v

ariable X 1

Slope for variable X2

Multiple Regression Equation(continued)

Chap 14-6

Example: 2 Independent Variables

A distributor of frozen desert pies wants to evaluate factors thought to influence demand

Dependent variable: Pie sales (units per week) Independent variables: Price (in $)

Advertising ($100’s)

Data are collected for 15 weeks

Chap 14-7

Pie Sales Example

Sales = b0 + b1 (Price)

+ b2 (Advertising)

WeekPie

SalesPrice

($)Advertising

($100s)1 350 5.50 3.32 460 7.50 3.33 350 8.00 3.04 430 8.00 4.55 350 6.80 3.06 380 7.50 4.07 430 4.50 3.08 470 6.40 3.79 450 7.00 3.5

10 490 5.00 4.011 340 7.20 3.512 300 7.90 3.213 440 5.90 4.014 450 5.00 3.515 300 7.00 2.7

Multiple regression equation:

Chap 14-8

Using Minitab

Stat | Regression | Regression … Enter appropriate variables for Response and Predictors

Chap 14-9

Minitab Multiple Regression Output

Regression Analysis: Pie Sales versus Price, Advertising

The regression equation isPie Sales = 307 - 25.0 Price + 74.1 Advertising

Predictor Coef SE Coef T PConstant 306.5 114.3 2.68 0.020Price -24.98 10.83 -2.31 0.040Advertising 74.13 25.97 2.85 0.014

S = 47.4634 R-Sq = 52.1% R-Sq(adj) = 44.2%

Analysis of Variance

Source DF SS MS F PRegression 2 29460 14730 6.54 0.012Residual Error 12 27033 2253Total 14 56493

Source DF Seq SSPrice 1 11100Advertising 1 18360

Chap 14-10

The Multiple Regression Equation

rtising)74.13(Adve e)24.98(Pric - 306.5 Sales

b1 = -24.98: sales will decrease, on average, by 24.98 pies per week for each $1 increase in selling price, net of the effects of changes due to advertising

b2 = 74.13: sales will increase, on average, by 74.13 pies per week for each $100 increase in advertising, net of the effects of changes due to price

where Sales is in number of pies per week Price is in $ Advertising is in $100’s.

Chap 14-11

Using The Equation to Make Predictions

Predict sales for a week in which the selling price is $5.50 and advertising is $350:

Predicted sales is 428.6 pies

428.6 (3.5) 74.13 (5.50) 24.98 - 306.5

rtising)74.13(Adve e)24.98(Pric - 306.5 Sales

Note that Advertising is in $100’s, so $350 means that X2 = 3.5

Chap 14-12

Predictions in MinitabSelect Stat | Regression | Regression … then click on the “options” box

Check the Confidence limits and Prediction limits boxes

Enter desired value for each independent variable

Chap 14-13

Predicted Values for New Observations

NewObs Fit SE Fit 95% CI 95% PI 1 428.6 17.2 (391.1, 466.1) (318.6, 538.6)

Values of Predictors for New Observations

NewObs Price Advertising 1 5.50 3.50

Input values

Predictions in Minitab(continued)

Predicted Y value

<

Confidence interval for the mean Y value, given these X’s

<

Prediction interval for an individual Y value, given these X’s

<Minitab output:

Chap 14-14

Coefficient of Multiple Determination

Reports the proportion of total variation in Y explained by all X variables taken together

squares of sum totalsquares of sum regression

SSTSSRr 2

Chap 14-15




S = 47.4634 R-Sq = 52.1% R-Sq(adj) = 44.2%




.5215649329460

SSTSSRr2

52.1% of the variation in pie sales is explained by the variation in price and advertising

Multiple Coefficient of Determination

(continued)

Chap 14-16

Adjusted r2

r2 never decreases when a new X variable is added to the model This can be a disadvantage when comparing

models What is the net effect of adding a new variable?

We lose a degree of freedom when a new X variable is added

Did the new X variable add enough explanatory power to offset the loss of one degree of freedom?

Chap 14-17

Shows the proportion of variation in Y explained by all X variables adjusted for the number of X variables used

(where n = sample size, k = number of independent variables)

Penalize excessive use of unimportant independent variables Smaller than r2

Useful in comparing among models

Adjusted r2

(continued)

1kn

1n)r(11r 22adj

Chap 14-18




S = 47.4634 R-Sq = 52.1% R-Sq(adj) = 44.2%




Multiple Coefficient of Determination

(continued)

.442r2adj

44.2% of the variation in pie sales is explained by the variation in price and advertising, taking into account the sample size and number of independent variables

Chap 14-19

Is the Model Significant?

F Test for Overall Significance of the Model Shows if there is a linear relationship between all

of the X variables considered together and Y Use F-test statistic Hypotheses:

H0: β1 = β2 = … = βk = 0 (no linear relationship)

H1: at least one βi ≠ 0 (at least one independent variable affects Y)

Chap 14-20

F Test for Overall Significance

Test statistic:

where F has (numerator) = k and(denominator) = (n – k - 1)

degrees of freedom

1knSSE

kSSR

MSEMSRF

Chap 14-21

(continued)

F Test for Overall Significance

With 2 and 12 degrees of freedom

P-value for the F Test




S = 47.4634 R-Sq = 52.1% R-Sq(adj) = 44.2%




6.542253

14730MSEMSRF

Chap 14-22

H0: β1 = β2 = 0

H1: β1 and β2 not both zero

= .05df1= 2 df2 = 12

Test Statistic:

Decision:

Conclusion:

Since F test statistic is in the rejection region (p-value < .05), reject H0

There is evidence that at least one independent variable affects Y

0 = .05

F.05 = 3.885Reject H0Do not

reject H0

6.54MSEMSRF

Critical Value:

F = 3.885

F Test for Overall Significance(continued)

F

Chap 14-23

Are Individual Variables Significant?

Use t tests of individual variable slopes Shows if there is a linear relationship between

the variable Xj and Y

Hypotheses: H0: βj = 0 (no linear relationship)

H1: βj ≠ 0 (linear relationship does exist between Xj and Y)

Chap 14-24


H0: βj = 0 (no linear relationship)

H1: βj ≠ 0 (linear relationship does exist between xj and y)

Test Statistic:

(df = n – k – 1)

jb

j

S0b

t

(continued)

Chap 14-25

(continued)





S = 47.4634 R-Sq = 52.1% R-Sq(adj) = 44.2%




t-value for Price is t = -2.31, with p-value 0.04

t-value for Advertising is t = 2.85, with p-value 0.014

Chap 14-26

d.f. = 15-2-1 = 12 = .05

t/2 = 2.1788

Inferences about the Slope: t Test Example

H0: βi = 0

H1: βi 0

The test statistic for each variable falls in the rejection region (p-values < .05)

There is evidence that both Price and Advertising affect pie sales at = .05

From Minitab output:

Reject H0 for each variable

Coef SE Coef T PPrice -24.98 10.83 -2.31 0.04Advertising 74.13 25.97 2.85 0.014

Decision:

Conclusion:Reject H0Reject H0

/2=.025

-tα/2Do not reject H0

0 tα/2

/2=.025

-2.1788 2.1788

Chap 14-27

Confidence Interval Estimate for the Slope

Confidence interval for the population slope βj

Example: Form a 95% confidence interval for the effect of changes in price (X1) on pie sales:

-24.98 ± (2.1788)(10.83)

So the interval is (-48.576 , -1.374)

(This interval does not contain zero, so price has a significant effect on sales)

jb1knj Stb where t has (n – k – 1) d.f.

d.f. = 15-2-1 = 12 = .05

t/2 = 2.1788

Coef SE CoefPrice -24.98 10.83Advertising 74.13 25.97

Chap 14-28

Confidence Interval Estimate for the Slope

Confidence interval for the population slope βi

Interpretation:

Weekly sales are estimated to be reduced by between 1.37 to 48.58 pies for each increase of $1 in the selling price

(continued)

Example: Form a 95% confidence interval for the effect of changes in price (X1) on pie sales:

-24.98 ± (2.1788)(10.83)

So the interval is (-48.576 , -1.374)

Chap 14-29

Assumptions of Regression

Use the acronym LINE: Linearity

The underlying relationship between X and Y is linear

Independence of Errors Error values are statistically independent

Normality of Error Error values (ε) are normally distributed for any given value of

X

Equal Variance (Homoscedasticity) The probability distribution of the errors has constant variance

Chap 14-30

Residual Analysis

The residual for observation i, ei, is the difference between its observed and predicted value

Check the assumptions of regression by examining the residuals

Examine for linearity assumption Evaluate independence assumption Evaluate normal distribution assumption Examine for constant variance for all levels of X

(homoscedasticity)

Graphical Analysis of Residuals Can plot residuals vs. X

iii YYe

Chap 14-31

Residual Analysis for Linearity

Not Linear Linear

x

resi

dual

s

x

Y

x

Y

x

resi

dual

s

Chap 14-32

Residual Analysis for Independence

Not IndependentIndependent

X

Xresi

dual

s

resi

dual

s

X

resi

dual

s

Chap 14-33

Residual Analysis for Normality

Percent

Residual

A normal probability plot of the residuals can be used to check for normality:

-3 -2 -1 0 1 2 3

0

100

Chap 14-34

Residual Analysis for Equal Variance

Non-constant variance Constant variance

x x

Y

x x

Y

resi

dual

s

resi

dual

s

Chap 14-35

Examples of Minitab Plots

Chap 14-36

Used when data are collected over time to detect if autocorrelation is present

Autocorrelation exists if residuals in one time period are related to residuals in another period

Measuring Autocorrelation:The Durbin-Watson Statistic

Chap 14-37

Autocorrelation

Autocorrelation is correlation of the errors (residuals) over time

Violates the regression assumption that residuals are random and independent

Time (t) Residual Plot

-15-10

-505

1015

0 2 4 6 8

Time (t)

Res

idua

ls

Here, residuals show a cyclic pattern, not random. Cyclical patterns are a sign of positive autocorrelation

Chap 14-38

The Durbin-Watson Statistic

n

1i

2i

n

2i

21ii

e

)ee(D

The possible range is 0 ≤ D ≤ 4

D should be close to 2 if H0 is true

D less than 2 may signal positive autocorrelation, D greater than 2 may signal negative autocorrelation

The Durbin-Watson statistic is used to test for autocorrelation

H0: residuals are not correlated

H1: positive autocorrelation is present

Chap 14-39

Testing for Positive Autocorrelation

Calculate the Durbin-Watson test statistic = D (The Durbin-Watson Statistic can be found using Excel or Minitab)

Decision rule: reject H0 if D < dL

H0: positive autocorrelation does not exist

H1: positive autocorrelation is present

0 dU 2dL

Reject H0 Do not reject H0

Find the values dL and dU from the Durbin-Watson table (for sample size n and number of independent variables k)

Inconclusive

Chap 14-40

Suppose we have the following time series data:

Is there autocorrelation?

y = 30.65 + 4.7038x R2 = 0.8976

0

20

40

60

80

100

120

140

160

0 5 10 15 20 25 30

Tim e

Sale

s


(continued)

Chap 14-41

Example with n = 25:

Durbin-Watson CalculationsSum of Squared Difference of Residuals 3296.18

Sum of Squared Residuals 3279.98

Durbin-Watson Statistic 1.00494

y = 30.65 + 4.7038x R2 = 0.8976

0

20

40

60

80

100

120

140

160

0 5 10 15 20 25 30

Tim eSa

les


(continued)

1.004943279.983296.18

e

)e(eD n

1i

2i

n

2i

21ii

Chap 14-42

Minitab Output

Chap 14-43

Here, n = 25 and there is k = 1 one independent variable Using the Durbin-Watson table, dL = 1.29 and dU = 1.45 D = 1.00494 < dL = 1.29, so reject H0 and conclude that

significant positive autocorrelation exists Therefore the linear model is not the appropriate model

to forecast sales


(continued)

Decision: reject H0 since

D = 1.00494 < dL

0 dU=1.45 2dL=1.29Reject H0 Do not reject H0Inconclusive

Chap 14-44

Influence Analysis

Several methods can be used to measure the influence of individual observations:

The hat matrix elements hi

The Studentized deleted residuals ti

Cook’s distance statistic Di

Chap 14-45

The Hat Matrix Elements hi

The hat matrix diagonal element for observation i, denoted hi, reflects the possible influence of Xi on the regression equation

If potentially influential observations are present, you may need to reevaluate the need to keep them in the model

Rule: if hi > 2(k + 1)/n , then Xi is an influential observation and is a candidate for removal from the model

Chap 14-46

The Studentized Deleted Residuals ti

The Studentized deleted residual measures the difference of each Yi from the value predicted by a model that includes all observations except observation i

Expressed as a t statistic

Chap 14-47

The Studentized Deleted Residuals ti

The Studentized deleted residual:

where ei = the residual for observation i k = number of independent variablesSSE = error sum of squares hi = hat matrix diagonal element for observation i

If ti > tn-k-2 or ti < - tn-k-2 (using a two tail test at = 0.10), then observation i is highly influential and a candidate for removal

(continued)

2ii

ii e)hSSE(11knet

Chap 14-48


Cook’s distance statistic Di is based on both hi and the Studentized residual

Used to decide whether an observation flagged by either the hi or ti criterion is unduly affecting the model

Chap 14-49


Cook’s Di statistic

where ei = the residual for observation i k = number of independent variables SSE = error sum of squared hi = hat matrix diagonal element for observation i

If Di > Fk+1 , n-k-1 at = 0.05, then the observation is highly influential on the regression equation and is a candidate for removal

(continued)

2i

i2i

i )h(1h

MSE k eD

Chap 14-50

Collinearity

Collinearity: High correlation exists among two or more independent variables

This means the correlated variables contribute redundant information to the multiple regression model

Chap 14-51

Collinearity

Including two highly correlated independent variables can adversely affect the regression results No new information provided Can lead to unstable coefficients (large

standard error and low t-values) Coefficient signs may not match prior

expectations

(continued)

Chap 14-52

Some Indications of Strong Collinearity

Incorrect signs on the coefficients Large change in the value of a previous

coefficient when a new variable is added to the model

A previously significant variable becomes non-significant when a new independent variable is added

The estimate of the standard deviation of the model increases when a variable is added to the model

Chap 14-53

Detecting Collinearity (Variance Inflationary Factor)

VIFj is used to measure collinearity:

If VIFj > 5, Xj is highly correlated with the other independent variables

where R2j is the coefficient of determination of

variable Xj with all other X variables

2j

j R11VIF

Chap 14-54

Example: Pie Sales

Sales = b0 + b1 (Price)

+ b2 (Advertising)

WeekPie

SalesPrice

($)Advertising

($100s)1 350 5.50 3.32 460 7.50 3.33 350 8.00 3.04 430 8.00 4.55 350 6.80 3.06 380 7.50 4.07 430 4.50 3.08 470 6.40 3.79 450 7.00 3.5

10 490 5.00 4.011 340 7.20 3.512 300 7.90 3.213 440 5.90 4.014 450 5.00 3.515 300 7.00 2.7

Recall the multiple regression equation of chapter 13:

Chap 14-55

Detecting Collinearity in MinitabStat / Regression / Regression …

Click on the “Options” boxSelect the “Variance inflationary factors” box



Predictor Coef SE Coef T P VIFConstant 306.5 114.3 2.68 0.020Price -24.98 10.83 -2.31 0.040 1.0Advertising 74.13 25.97 2.85 0.014 1.0

S = 47.4634 R-Sq = 52.1% R-Sq(adj) = 44.2%

Output for the pie sales example:

VIFs are < 5 There is no evidence of collinearity between Price and Advertising

Chap 14-56

Chapter Summary

Developed the multiple regression model Tested the significance of the multiple

regression model Discussed adjusted r2

Discussed using residual plots to check model assumptions

Tested individual regression coefficients

Multiple Regression ppt

Documents

Transcript of Multiple Regression ppt