Multiple Regression ppt
-
Upload
irlya-noerofi-tyas -
Category
Documents
-
view
224 -
download
0
description
Transcript of Multiple Regression ppt
Chap 14-1
Chapter 14
Introduction to Multiple Regression
Basic Business Statistics10th Edition
Chap 14-2
Learning Objectives
In this chapter, you learn: How to develop a multiple regression model How to interpret the regression coefficients How to determine which independent variables to
include in the regression model How to determine which independent variables are more
important in predicting a dependent variable How to use categorical variables in a regression model How to predict a categorical dependent variable using
logistic regression
Chap 14-3
The Multiple Regression Model
Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more independent variables (Xi)
ikik2i21i10i εXβXβXββY
Multiple Regression Model with k Independent Variables:
Y-intercept Population slopes Random Error
Chap 14-4
Multiple Regression Equation
The coefficients of the multiple regression model are estimated using sample data
kik2i21i10i XbXbXbbY
Estimated (or predicted) value of Y
Estimated slope coefficients
Multiple regression equation with k independent variables:
Estimatedintercept
In this chapter we will always use Excel to obtain the regression slope coefficients and other regression
summary measures.
Chap 14-5
Two variable modelY
X1
X2
22110 XbXbbY
Slope for v
ariable X 1
Slope for variable X2
Multiple Regression Equation(continued)
Chap 14-6
Example: 2 Independent Variables
A distributor of frozen desert pies wants to evaluate factors thought to influence demand
Dependent variable: Pie sales (units per week) Independent variables: Price (in $)
Advertising ($100’s)
Data are collected for 15 weeks
Chap 14-7
Pie Sales Example
Sales = b0 + b1 (Price)
+ b2 (Advertising)
WeekPie
SalesPrice
($)Advertising
($100s)1 350 5.50 3.32 460 7.50 3.33 350 8.00 3.04 430 8.00 4.55 350 6.80 3.06 380 7.50 4.07 430 4.50 3.08 470 6.40 3.79 450 7.00 3.5
10 490 5.00 4.011 340 7.20 3.512 300 7.90 3.213 440 5.90 4.014 450 5.00 3.515 300 7.00 2.7
Multiple regression equation:
Chap 14-8
Using Minitab
Stat | Regression | Regression … Enter appropriate variables for Response and Predictors
Chap 14-9
Minitab Multiple Regression Output
Regression Analysis: Pie Sales versus Price, Advertising
The regression equation isPie Sales = 307 - 25.0 Price + 74.1 Advertising
Predictor Coef SE Coef T PConstant 306.5 114.3 2.68 0.020Price -24.98 10.83 -2.31 0.040Advertising 74.13 25.97 2.85 0.014
S = 47.4634 R-Sq = 52.1% R-Sq(adj) = 44.2%
Analysis of Variance
Source DF SS MS F PRegression 2 29460 14730 6.54 0.012Residual Error 12 27033 2253Total 14 56493
Source DF Seq SSPrice 1 11100Advertising 1 18360
Chap 14-10
The Multiple Regression Equation
rtising)74.13(Adve e)24.98(Pric - 306.5 Sales
b1 = -24.98: sales will decrease, on average, by 24.98 pies per week for each $1 increase in selling price, net of the effects of changes due to advertising
b2 = 74.13: sales will increase, on average, by 74.13 pies per week for each $100 increase in advertising, net of the effects of changes due to price
where Sales is in number of pies per week Price is in $ Advertising is in $100’s.
Chap 14-11
Using The Equation to Make Predictions
Predict sales for a week in which the selling price is $5.50 and advertising is $350:
Predicted sales is 428.6 pies
428.6 (3.5) 74.13 (5.50) 24.98 - 306.5
rtising)74.13(Adve e)24.98(Pric - 306.5 Sales
Note that Advertising is in $100’s, so $350 means that X2 = 3.5
Chap 14-12
Predictions in MinitabSelect Stat | Regression | Regression … then click on the “options” box
Check the Confidence limits and Prediction limits boxes
Enter desired value for each independent variable
Chap 14-13
Predicted Values for New Observations
NewObs Fit SE Fit 95% CI 95% PI 1 428.6 17.2 (391.1, 466.1) (318.6, 538.6)
Values of Predictors for New Observations
NewObs Price Advertising 1 5.50 3.50
Input values
Predictions in Minitab(continued)
Predicted Y value
<
Confidence interval for the mean Y value, given these X’s
<
Prediction interval for an individual Y value, given these X’s
<Minitab output:
Chap 14-14
Coefficient of Multiple Determination
Reports the proportion of total variation in Y explained by all X variables taken together
squares of sum totalsquares of sum regression
SSTSSRr 2
Chap 14-15
Regression Analysis: Pie Sales versus Price, Advertising
The regression equation isPie Sales = 307 - 25.0 Price + 74.1 Advertising
Predictor Coef SE Coef T PConstant 306.5 114.3 2.68 0.020Price -24.98 10.83 -2.31 0.040Advertising 74.13 25.97 2.85 0.014
S = 47.4634 R-Sq = 52.1% R-Sq(adj) = 44.2%
Analysis of Variance
Source DF SS MS F PRegression 2 29460 14730 6.54 0.012Residual Error 12 27033 2253Total 14 56493
Source DF Seq SSPrice 1 11100Advertising 1 18360
.5215649329460
SSTSSRr2
52.1% of the variation in pie sales is explained by the variation in price and advertising
Multiple Coefficient of Determination
(continued)
Chap 14-16
Adjusted r2
r2 never decreases when a new X variable is added to the model This can be a disadvantage when comparing
models What is the net effect of adding a new variable?
We lose a degree of freedom when a new X variable is added
Did the new X variable add enough explanatory power to offset the loss of one degree of freedom?
Chap 14-17
Shows the proportion of variation in Y explained by all X variables adjusted for the number of X variables used
(where n = sample size, k = number of independent variables)
Penalize excessive use of unimportant independent variables Smaller than r2
Useful in comparing among models
Adjusted r2
(continued)
1kn
1n)r(11r 22adj
Chap 14-18
Regression Analysis: Pie Sales versus Price, Advertising
The regression equation isPie Sales = 307 - 25.0 Price + 74.1 Advertising
Predictor Coef SE Coef T PConstant 306.5 114.3 2.68 0.020Price -24.98 10.83 -2.31 0.040Advertising 74.13 25.97 2.85 0.014
S = 47.4634 R-Sq = 52.1% R-Sq(adj) = 44.2%
Analysis of Variance
Source DF SS MS F PRegression 2 29460 14730 6.54 0.012Residual Error 12 27033 2253Total 14 56493
Source DF Seq SSPrice 1 11100Advertising 1 18360
Multiple Coefficient of Determination
(continued)
.442r2adj
44.2% of the variation in pie sales is explained by the variation in price and advertising, taking into account the sample size and number of independent variables
Chap 14-19
Is the Model Significant?
F Test for Overall Significance of the Model Shows if there is a linear relationship between all
of the X variables considered together and Y Use F-test statistic Hypotheses:
H0: β1 = β2 = … = βk = 0 (no linear relationship)
H1: at least one βi ≠ 0 (at least one independent variable affects Y)
Chap 14-20
F Test for Overall Significance
Test statistic:
where F has (numerator) = k and(denominator) = (n – k - 1)
degrees of freedom
1knSSE
kSSR
MSEMSRF
Chap 14-21
(continued)
F Test for Overall Significance
With 2 and 12 degrees of freedom
P-value for the F Test
Regression Analysis: Pie Sales versus Price, Advertising
The regression equation isPie Sales = 307 - 25.0 Price + 74.1 Advertising
Predictor Coef SE Coef T PConstant 306.5 114.3 2.68 0.020Price -24.98 10.83 -2.31 0.040Advertising 74.13 25.97 2.85 0.014
S = 47.4634 R-Sq = 52.1% R-Sq(adj) = 44.2%
Analysis of Variance
Source DF SS MS F PRegression 2 29460 14730 6.54 0.012Residual Error 12 27033 2253Total 14 56493
Source DF Seq SSPrice 1 11100Advertising 1 18360
6.542253
14730MSEMSRF
Chap 14-22
H0: β1 = β2 = 0
H1: β1 and β2 not both zero
= .05df1= 2 df2 = 12
Test Statistic:
Decision:
Conclusion:
Since F test statistic is in the rejection region (p-value < .05), reject H0
There is evidence that at least one independent variable affects Y
0 = .05
F.05 = 3.885Reject H0Do not
reject H0
6.54MSEMSRF
Critical Value:
F = 3.885
F Test for Overall Significance(continued)
F
Chap 14-23
Are Individual Variables Significant?
Use t tests of individual variable slopes Shows if there is a linear relationship between
the variable Xj and Y
Hypotheses: H0: βj = 0 (no linear relationship)
H1: βj ≠ 0 (linear relationship does exist between Xj and Y)
Chap 14-24
Are Individual Variables Significant?
H0: βj = 0 (no linear relationship)
H1: βj ≠ 0 (linear relationship does exist between xj and y)
Test Statistic:
(df = n – k – 1)
jb
j
S0b
t
(continued)
Chap 14-25
(continued)
Are Individual Variables Significant?
Regression Analysis: Pie Sales versus Price, Advertising
The regression equation isPie Sales = 307 - 25.0 Price + 74.1 Advertising
Predictor Coef SE Coef T PConstant 306.5 114.3 2.68 0.020Price -24.98 10.83 -2.31 0.040Advertising 74.13 25.97 2.85 0.014
S = 47.4634 R-Sq = 52.1% R-Sq(adj) = 44.2%
Analysis of Variance
Source DF SS MS F PRegression 2 29460 14730 6.54 0.012Residual Error 12 27033 2253Total 14 56493
Source DF Seq SSPrice 1 11100Advertising 1 18360
t-value for Price is t = -2.31, with p-value 0.04
t-value for Advertising is t = 2.85, with p-value 0.014
Chap 14-26
d.f. = 15-2-1 = 12 = .05
t/2 = 2.1788
Inferences about the Slope: t Test Example
H0: βi = 0
H1: βi 0
The test statistic for each variable falls in the rejection region (p-values < .05)
There is evidence that both Price and Advertising affect pie sales at = .05
From Minitab output:
Reject H0 for each variable
Coef SE Coef T PPrice -24.98 10.83 -2.31 0.04Advertising 74.13 25.97 2.85 0.014
Decision:
Conclusion:Reject H0Reject H0
/2=.025
-tα/2Do not reject H0
0 tα/2
/2=.025
-2.1788 2.1788
Chap 14-27
Confidence Interval Estimate for the Slope
Confidence interval for the population slope βj
Example: Form a 95% confidence interval for the effect of changes in price (X1) on pie sales:
-24.98 ± (2.1788)(10.83)
So the interval is (-48.576 , -1.374)
(This interval does not contain zero, so price has a significant effect on sales)
jb1knj Stb where t has (n – k – 1) d.f.
d.f. = 15-2-1 = 12 = .05
t/2 = 2.1788
Coef SE CoefPrice -24.98 10.83Advertising 74.13 25.97
Chap 14-28
Confidence Interval Estimate for the Slope
Confidence interval for the population slope βi
Interpretation:
Weekly sales are estimated to be reduced by between 1.37 to 48.58 pies for each increase of $1 in the selling price
(continued)
Example: Form a 95% confidence interval for the effect of changes in price (X1) on pie sales:
-24.98 ± (2.1788)(10.83)
So the interval is (-48.576 , -1.374)
Chap 14-29
Assumptions of Regression
Use the acronym LINE: Linearity
The underlying relationship between X and Y is linear
Independence of Errors Error values are statistically independent
Normality of Error Error values (ε) are normally distributed for any given value of
X
Equal Variance (Homoscedasticity) The probability distribution of the errors has constant variance
Chap 14-30
Residual Analysis
The residual for observation i, ei, is the difference between its observed and predicted value
Check the assumptions of regression by examining the residuals
Examine for linearity assumption Evaluate independence assumption Evaluate normal distribution assumption Examine for constant variance for all levels of X
(homoscedasticity)
Graphical Analysis of Residuals Can plot residuals vs. X
iii YYe
Chap 14-31
Residual Analysis for Linearity
Not Linear Linear
x
resi
dual
s
x
Y
x
Y
x
resi
dual
s
Chap 14-32
Residual Analysis for Independence
Not IndependentIndependent
X
Xresi
dual
s
resi
dual
s
X
resi
dual
s
Chap 14-33
Residual Analysis for Normality
Percent
Residual
A normal probability plot of the residuals can be used to check for normality:
-3 -2 -1 0 1 2 3
0
100
Chap 14-34
Residual Analysis for Equal Variance
Non-constant variance Constant variance
x x
Y
x x
Y
resi
dual
s
resi
dual
s
Chap 14-35
Examples of Minitab Plots
Chap 14-36
Used when data are collected over time to detect if autocorrelation is present
Autocorrelation exists if residuals in one time period are related to residuals in another period
Measuring Autocorrelation:The Durbin-Watson Statistic
Chap 14-37
Autocorrelation
Autocorrelation is correlation of the errors (residuals) over time
Violates the regression assumption that residuals are random and independent
Time (t) Residual Plot
-15-10
-505
1015
0 2 4 6 8
Time (t)
Res
idua
ls
Here, residuals show a cyclic pattern, not random. Cyclical patterns are a sign of positive autocorrelation
Chap 14-38
The Durbin-Watson Statistic
n
1i
2i
n
2i
21ii
e
)ee(D
The possible range is 0 ≤ D ≤ 4
D should be close to 2 if H0 is true
D less than 2 may signal positive autocorrelation, D greater than 2 may signal negative autocorrelation
The Durbin-Watson statistic is used to test for autocorrelation
H0: residuals are not correlated
H1: positive autocorrelation is present
Chap 14-39
Testing for Positive Autocorrelation
Calculate the Durbin-Watson test statistic = D (The Durbin-Watson Statistic can be found using Excel or Minitab)
Decision rule: reject H0 if D < dL
H0: positive autocorrelation does not exist
H1: positive autocorrelation is present
0 dU 2dL
Reject H0 Do not reject H0
Find the values dL and dU from the Durbin-Watson table (for sample size n and number of independent variables k)
Inconclusive
Chap 14-40
Suppose we have the following time series data:
Is there autocorrelation?
y = 30.65 + 4.7038x R2 = 0.8976
0
20
40
60
80
100
120
140
160
0 5 10 15 20 25 30
Tim e
Sale
s
Testing for Positive Autocorrelation
(continued)
Chap 14-41
Example with n = 25:
Durbin-Watson CalculationsSum of Squared Difference of Residuals 3296.18
Sum of Squared Residuals 3279.98
Durbin-Watson Statistic 1.00494
y = 30.65 + 4.7038x R2 = 0.8976
0
20
40
60
80
100
120
140
160
0 5 10 15 20 25 30
Tim eSa
les
Testing for Positive Autocorrelation
(continued)
1.004943279.983296.18
e
)e(eD n
1i
2i
n
2i
21ii
Chap 14-42
Minitab Output
Chap 14-43
Here, n = 25 and there is k = 1 one independent variable Using the Durbin-Watson table, dL = 1.29 and dU = 1.45 D = 1.00494 < dL = 1.29, so reject H0 and conclude that
significant positive autocorrelation exists Therefore the linear model is not the appropriate model
to forecast sales
Testing for Positive Autocorrelation
(continued)
Decision: reject H0 since
D = 1.00494 < dL
0 dU=1.45 2dL=1.29Reject H0 Do not reject H0Inconclusive
Chap 14-44
Influence Analysis
Several methods can be used to measure the influence of individual observations:
The hat matrix elements hi
The Studentized deleted residuals ti
Cook’s distance statistic Di
Chap 14-45
The Hat Matrix Elements hi
The hat matrix diagonal element for observation i, denoted hi, reflects the possible influence of Xi on the regression equation
If potentially influential observations are present, you may need to reevaluate the need to keep them in the model
Rule: if hi > 2(k + 1)/n , then Xi is an influential observation and is a candidate for removal from the model
Chap 14-46
The Studentized Deleted Residuals ti
The Studentized deleted residual measures the difference of each Yi from the value predicted by a model that includes all observations except observation i
Expressed as a t statistic
Chap 14-47
The Studentized Deleted Residuals ti
The Studentized deleted residual:
where ei = the residual for observation i k = number of independent variablesSSE = error sum of squares hi = hat matrix diagonal element for observation i
If ti > tn-k-2 or ti < - tn-k-2 (using a two tail test at = 0.10), then observation i is highly influential and a candidate for removal
(continued)
2ii
ii e)hSSE(11knet
Chap 14-48
Cook’s distance statistic Di
Cook’s distance statistic Di is based on both hi and the Studentized residual
Used to decide whether an observation flagged by either the hi or ti criterion is unduly affecting the model
Chap 14-49
Cook’s distance statistic Di
Cook’s Di statistic
where ei = the residual for observation i k = number of independent variables SSE = error sum of squared hi = hat matrix diagonal element for observation i
If Di > Fk+1 , n-k-1 at = 0.05, then the observation is highly influential on the regression equation and is a candidate for removal
(continued)
2i
i2i
i )h(1h
MSE k eD
Chap 14-50
Collinearity
Collinearity: High correlation exists among two or more independent variables
This means the correlated variables contribute redundant information to the multiple regression model
Chap 14-51
Collinearity
Including two highly correlated independent variables can adversely affect the regression results No new information provided Can lead to unstable coefficients (large
standard error and low t-values) Coefficient signs may not match prior
expectations
(continued)
Chap 14-52
Some Indications of Strong Collinearity
Incorrect signs on the coefficients Large change in the value of a previous
coefficient when a new variable is added to the model
A previously significant variable becomes non-significant when a new independent variable is added
The estimate of the standard deviation of the model increases when a variable is added to the model
Chap 14-53
Detecting Collinearity (Variance Inflationary Factor)
VIFj is used to measure collinearity:
If VIFj > 5, Xj is highly correlated with the other independent variables
where R2j is the coefficient of determination of
variable Xj with all other X variables
2j
j R11VIF
Chap 14-54
Example: Pie Sales
Sales = b0 + b1 (Price)
+ b2 (Advertising)
WeekPie
SalesPrice
($)Advertising
($100s)1 350 5.50 3.32 460 7.50 3.33 350 8.00 3.04 430 8.00 4.55 350 6.80 3.06 380 7.50 4.07 430 4.50 3.08 470 6.40 3.79 450 7.00 3.5
10 490 5.00 4.011 340 7.20 3.512 300 7.90 3.213 440 5.90 4.014 450 5.00 3.515 300 7.00 2.7
Recall the multiple regression equation of chapter 13:
Chap 14-55
Detecting Collinearity in MinitabStat / Regression / Regression …
Click on the “Options” boxSelect the “Variance inflationary factors” box
Regression Analysis: Pie Sales versus Price, Advertising
The regression equation isPie Sales = 307 - 25.0 Price + 74.1 Advertising
Predictor Coef SE Coef T P VIFConstant 306.5 114.3 2.68 0.020Price -24.98 10.83 -2.31 0.040 1.0Advertising 74.13 25.97 2.85 0.014 1.0
S = 47.4634 R-Sq = 52.1% R-Sq(adj) = 44.2%
Output for the pie sales example:
VIFs are < 5 There is no evidence of collinearity between Price and Advertising
Chap 14-56
Chapter Summary
Developed the multiple regression model Tested the significance of the multiple
regression model Discussed adjusted r2
Discussed using residual plots to check model assumptions
Tested individual regression coefficients