Multiple Linear Regression and Correlation Analysis Chapter 14.
Chapter 11 Regression Analysis
-
Upload
aphrodite-harrington -
Category
Documents
-
view
54 -
download
6
description
Transcript of Chapter 11 Regression Analysis
1
DoingDoing Statistics for BusinessStatistics for Business Data, Inference, and Decision Making
Marilyn K. PelosiTheresa M. Sandifer
Chapter 11Chapter 11Regression Regression
AnalysisAnalysis
2
Doing Statistics for Business
Chapter 11 Objectives
Find the linear regression equation for a dependent variable Y as a function of a single independent variable X
Determine whether a relationship between X and Y exists
Analyze the results of a regression analysis to determine whether the simple linear model is appropriate
3
Doing Statistics for Business
Figure 11.1 Deterministic Relationship Between Total Order Cost and Number of Items Ordered
4
Doing Statistics for Business
Figure 11.2 Statistical Relationship Between Revenue and Advertising Expenditures
5
Doing Statistics for Business
TRY IT NOW!Increasing Capacity
Plotting Data to Look at the RelationshipAn oil company is trying to determine how the number of
refining sites available for refining crude oil relates to the
overall refining capacity. It would use this information to determine
whether or not expansion will provide the increase in capacity that it
wants or whether others steps to increase capacity will be necessary.
The company collects data on other competitive companies and finds the
following:
6
Doing Statistics for Business
TRY IT NOW!Increasing Capacity
Plotting Data to Look at the Relationship
(con’t) The Economist (July 15, 1995)
Oil Company # SitesRefining Capacity(m tons per year)
Royal Dutch/Shell 13 81.82Exxon 10 81.82Agip 13 58.18BP 8 43.64
Repsol 5 40.00Total 7 36.36
Turkish Petroleum 4 34.55Elf 8 32.73
Mobil 7 29.09Petrofina 3 25.45
7
Doing Statistics for Business
TRY IT NOW!Increasing Capacity
Plotting Data to Look at the Relationship
(con’t)
Use a grid to create a scatter plot of of the data.
Do you think that a linear model is a good one?
8
Doing Statistics for Business
The true relationship between the variablesX and Y, the Simple Linear RegressionModel, can be described by the equation
y = 0 + 1x +
9
Doing Statistics for Business
True regression liney = + x
x1
y1 = + x1
y2 = + x2
x2
Figure 11.3 The True Regression Model Showing how Y Varies for a Given Value of X
10
Doing Statistics for Business
Figure 11.4 Straight Line Approximating the Relationship Between Advertising and Revenue
11
Doing Statistics for Business
Figure 11.5 A Single Criterion Can Produce ManyDifferent Lines
12
Doing Statistics for Business
The distance between the predicted value ofY, and the actual value of Y, , is called thedeviation or error.
y
13
Doing Statistics for Business
Least Squares Line
0
5
10
15
20
25
30
35
0 1 2 3 4 5 6
X
Y
Deviationbetween thedata point andthe line
Figure 11.6 Deviations Between the Data Points and the Line
14
Doing Statistics for Business
The technique that finds the equation ofthe line that minimizes the total or sum ofthe squared deviations between the actualdata points and the line is called the leastsquares method.
15
Doing Statistics for Business
TRY IT NOW!Increasing Capacity
Finding the Equation of the Least-Squares Regression LineThe oil company that is looking at increasing refining capacity has
decided that a linear relationship is appropriate. Fill in the table shown
on the following slide or use some other means to find the equation of
the least-squares line:
16
Doing Statistics for Business
TRY IT NOW!Increasing CapacityFinding the Equation of the Least-Squares Regression Line (con’t)
Obs. # # Sites (X) Capacity(Y)
XY X2
1 13 81.822 10 81.823 13 58.184 8 43.645 5 40.006 7 36.367 4 34.558 8 32.739 7 29.09
10 3 25.45Totals
17
Doing Statistics for Business
TRY IT NOW!Increasing Capacity
Finding the Equation of the Least-Squares Regression Line(con’t)
Interpret the meaning of the estimate of the slope of the line. Does the y intercept make sense for these data?
18
Doing Statistics for Business
The value of that we find is really aprediction of the mean value of Y for a given value of X.
y
19
Doing Statistics for Business
Using the equation to predict values of Ywithin the range of the X data is calledinterpolation. Predicting values of forvalues of X outside the observed range is called extrapolation.
20
Doing Statistics for Business
TRY IT NOW!Increasing Capacity
Using the Regression Equation to Predict the Value of YUse the equation of the regression line you found earlier to predict the
refining capacity for each of the observed values of X, the number of
sites.
21
Doing Statistics for Business
TRY IT NOW!Increasing Capacity
Using the Regression Equation to Predict the Value of Y (con’t)
Obs. ## Sites (X)Capacity (Y) y = 9.03 + 4.79 x1 13 81.822 10 81.823 13 58.184 8 43.645 5 40.006 7 36.367 4 34.558 8 32.739 7 29.0910 3 25.45
22
Doing Statistics for Business
The difference between the observed valueof Y (y), and the predicted value of Y fromthe regression equation ( i), for a value of X = x, is called the ith residual, ei.
y
23
Doing Statistics for Business
TRY IT NOW!Increasing Capacity
Calculating the ResidualsThe oil company that is looking at the relationship between
refining capacity and the number of refining sites wants to get a better
idea of how the regression line relates to the actual data. It decides to
calculate the residuals for each observed value of X, the number of sites.
Find the residuals and fill in the table found on the following slide:
24
Doing Statistics for Business
TRY IT NOW!Increasing Capacity
Calculating the Residuals (con’t)Oil Refineries: The Economist July 15, 1995
Obs. ## Sites(X)
Refining Capacity(Y)
yI = 9.03 + 4.79x ei = yi -yi
1 13 81.82 71.32 10 81.82 56.93 13 58.18 71.34 8 43.64 47.45 5 40.00 33.06 7 36.36 42.67 4 34.55 28.28 8 32.73 47.49 7 29.09 42.610 3 25.45 23.4
25
Doing Statistics for Business
TRY IT NOW!Increasing Capacity
Calculating the Residuals (con’t)
To get a picture of how the residuals and the regression line fit together,
the company also decides to graph the regression line on a plot of the
data.
Graph the regression line on the data plot. How well do you think the line
represents the data?
26
Doing Statistics for Business
TRY IT NOW!Increasing Capacity
Calculating the Residuals (con’t)
27
Doing Statistics for Business
The standard error of the estimate, syx
is a measure of how much the data vary around the regression line.
28
Doing Statistics for Business
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.879773847R Square 0.774002022Adjusted R Square 0.745752275Standard Error 0.65297731Observations 10
Regression Analysis
The regression equation isRev $bn = 0.784 + 1.14 Members (m)
Predictor Coef StDev T PConstant 0.7845 0.5050 1.55 0.159Members 1.1439 0.2185 5.23 0.000
S = 0.6530 R-Sq = 77.4% R-Sq(adj) = 74.6%
Figure 11.8
Computer Output Showing the Standard Error of the Estimate
Excel Output
Minitab Output
29
Doing Statistics for Business
Figure 11.9 (a) Line with non-zero slope
(b) Line with zero slope
(a) (b)
30
Doing Statistics for Business
Figure 11.10
t-test Portion of Computer OutputMinitab
Excel
31
Doing Statistics for Business
TRY IT NOW!Increasing Capacity
Testing for Significance of theRegression ModelThe oil company that is looking at increasing capacity wants to determine
whether the relationship between refining capacity and number of refining
sites that it calculated is significant.
Write down the hypotheses that the company needs to test.
32
Doing Statistics for Business
TRY IT NOW!Increasing Capacity
Testing for Significance of theRegression Model (con’t)
The company decides to use a 0.01 level of significance for the test. Findthe critical values for the test.
It used a computer software package to run the analysis and obtained the
following output:
33
Doing Statistics for Business
TRY IT NOW!Increasing CapacityTesting for Significance of theRegression Model (con’t)
From the computer output, find the slope of the regression line, thestandard error of the slope, and the value of the t statistic.
Perform the hypothesis test and make a decision about the regression line.
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 9.028295455 11.04587353 0.817345539 0.437393793 -16.44355105 34.50014196# Sites 4.786628788 1.307226853 3.661666509 0.006385827 1.77215631 7.801101266
34
Doing Statistics for Business
TRY IT NOW!Increasing CapacityTesting for Significance of theRegression Model (con’t)
Find the p value of the test from the output and explain how you could use the p value on the output to make the same decision.
Once we have determined that the relationship between X and Y is significant, we can perform some additional analyses to see if thepredictions we obtain are useful for the purposes of decision making and to determine the strength of the relationship.
35
Doing Statistics for Business
}SSTSSE
SSR
{{
yi
X
Y
y
y= bbx
Figure 11.11 Components of the Variation in y Value
36
Doing Statistics for Business
ANOVAdf SS MS F Significance F
Regression 1 37.7124744 37.7124744 182.6413924 1.9909E-12Residual 23 4.749125596 0.206483722Total 24 42.4616
Analysis of Variance
Source DF SS MS F PRegression 1 37.712 37.712 182.64 0.000Error 23 4.749 0.206Total 24 42.462
Excel Output
Minitab Output
Figure 11.12 Computer ANOVA Output for Regression Analysis
37
Doing Statistics for Business
A Confidence Interval provides an estimate for the mean value of Y (yx) at a particular value of X.
38
Doing Statistics for Business
Regression Line with Confidence Intervals
Y
X
Figure 11.13
Confidence Interval for the Mean Estimate
39
Doing Statistics for Business
TRY IT NOW!Increasing CapacityFinding Confidence Intervals for theMean Predicted ValueAfter calculating the regression model and deciding that the model issignificant, the analysts at the oil company would like to know about theaccuracy of the estimates from the model. They decide to calculate 95%confidence intervals for X = 8 and 13 sites. They know from previous
work that for the set of 10 observations in the model, syx = 13.43,
x = 78, and x 2 = 714.
40
Doing Statistics for Business
TRY IT NOW!Increasing Capacity
Finding Confidence Intervals for theMean Predicted Value (con’t)
Find 95% confidence intervals for the mean estimates.
Do you think that these estimates would be useful for planning purposes?Why or why not?
41
Doing Statistics for Business
A Prediction Interval gives an estimatefor an individual value of Y at a particularvalue of X.
42
Doing Statistics for Business
TRY IT NOW!Increasing Capacity
Calculating Prediction Intervals for Regression EstimatesThe oil company analysts decide to calculate 95% prediction intervals
for the two X values that they are interested in. The relevant values from
the set of 10 observations are syx = 13.43, x = 78, and x 2 = 714.
Find 95% prediction intervals for X = 8 and X = 13 refining sites.
43
Doing Statistics for Business
TRY IT NOW!Increasing Capacity
Calculating Prediction Intervals for Regression Estimates (con’t)
Do you think that confidence intervals or planning intervals would be
more appropriate for the oil company’s purpose?
44
Doing Statistics for Business
The Correlation Coefficient is used as a measure of the strength of a linear relation-ship. A correlation of – 1 corresponds to aperfect negative relationship, a correlation of 0 corresponds to no relationship, and acorrelation of +1 corresponds to a perfectpositive relationship.
45
Doing Statistics for Business
Perfect Negative
No Relationship
Perfect Positive
Figure 11.14 3 Types of Relationships: Perfect Negative, No Relationship, and
Perfect Positive
46
Doing Statistics for Business
TRY IT NOW!Increasing Capacity
Calculating the Correlation CoefficientThe relevant data to calculate the correlation coefficient for
the oil company problem are
n = 10 x = 78 y = 463.654 xy = 4121.86
x 2 = 714 y2 = 25,359.3224
Find the correlation coefficient for the data.
47
Doing Statistics for BusinessResidual Plots
Res
idua
ls
X
-1
-2
-3
-4
-5
-6
0
1
2
3
4
5
6
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Residual Plots
Res
idua
ls
X
-2.5
0.0
2.5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Residual Plot
Res
idua
ls
X
-2.5
0.0
2.5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Residual Plots
Res
idua
lsX
-10
-20
-30
-40
-50
0
10
20
30
40
50
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Figure 11.15 Examples of Residual Plots
48
Doing Statistics for BusinessHistogram of Residuals
Fre
quen
cy
OK Resids
0
1
2
3
4
5
-6 -4.8 -3.6 -2.4 -1.2 0 1.2 2.4 3.6 4.8 6
Histogram of Residuals
Fre
quen
cy
Std. Nonlin
0
1
2
3
4
5
6
-2 -1.6 -1.2 -0.8 -0.4 0 0.4 0.8 1.2 1.6 2
Histogram of Residuals
Fre
quen
cy
Funnel
0
1
2
3
4
-100 -75 -50 -25 0 25 50 75 100 125 150
Histogram of Residuals
Fre
quen
cy
Bow
0
1
2
3
4
5
-50 -40 -30 -20 -10 0 10 20 30 40 50
Figure 11.16 Histograms of Residuals
49
Doing Statistics for Business
A Normal Probability Plot is a plot of theordered data against their expected valuesunder a normal distribution. When data arenormally distributed, the plot will be a straight line.
50
Doing Statistics for Business
6420-2-4-6
876543210
Residual
Fre
quen
cy
Histogram of Residuals
20100
10
0
-10
Observation Number
Res
idua
l
I Chart of Residuals
X=-0.08434
3.0SL=10.25
-3.0SL=-10.42
20100
5
0
-5
Fit
Res
idua
lResiduals vs. Fits
210-1-2
5
0
-5
Normal Plot of Residuals
Normal Score
Res
idua
l
Residual Model Diagnostics
100806040200
9876543210
Residual
Fre
quen
cy
Histogram of Residuals
20100
100
50
0
Observation Number
Res
idua
l
I Chart of Residuals
1
5
6
1
6 2 22
22
11
X=66.50
3.0SL=106.7
-3.0SL=26.26
20100
100
50
0
Fit
Res
idua
l
Residuals vs. Fits
210-1-2
100
50
0
Normal Plot of Residuals
Normal Score
Res
idua
l
Residual Model Diagnostics
1251007550250-25-50-75
5
4
3
2
1
0
Residual
Fre
quen
cy
Histogram of Residuals
20100
100
0
-100
Observation Number
Res
idua
l
I Chart of Residuals
X=6.937
3.0SL=145.2
-3.0SL=-131.3
20100
100
0
-100
Fit
Res
idua
l
Residuals vs. Fits
210-1-2
100
0
-100
Normal Plot of Residuals
Normal Score
Res
idua
l
Residual Model Diagnostics
50403020100-10-20-30-40-50
4
3
2
1
0
ResidualF
requ
ency
Histogram of Residuals
20100
100
0
-100
Observation Number
Res
idua
l
I Chart of Residuals
6
X=-5.544
3.0SL=80.46
-3.0SL=-91.55
20100
50403020100
-10-20-30-40-50
Fit
Res
idua
l
Residuals vs. Fits
210-1-2
50403020100
-10-20-30-40-50
Normal Plot of Residuals
Normal Score
Res
idua
l
Residual Model Diagnostics
Figure 11.17 Regression Diagnostic Plots
51
Doing Statistics for Business
Obs Members HMORev $ Fit StDev Fit Residual St Resid 1 4.24 5.486 5.631 0.509 -0.146 -0.36 X 2 3.19 4.629 4.432 0.313 0.197 0.34 3 1.83 3.857 2.878 0.215 0.979 1.59 4 1.62 3.600 2.639 0.232 0.961 1.58 5 2.07 3.429 3.153 0.206 0.276 0.45 6 2.30 2.914 3.415 0.210 -0.501 -0.81 7 1.83 2.743 2.878 0.215 -0.136 -0.22 8 2.15 2.400 3.244 0.206 -0.844 -1.37 9 0.97 1.714 1.896 0.323 -0.182 -0.32 10 0.89 1.200 1.805 0.336 -0.605 -1.08
X denotes an observation whose X value gives it large influence.
Obs _ of Sta RadioRev Fit StDev Fit Residual St Resid 1 83 1.0500 0.4628 0.1191 0.5872 2.75R 2 57 0.3143 0.3446 0.0783 -0.0303 -0.13 3 104 0.3143 0.5583 0.1720 -0.2440 -1.40 4 35 0.3048 0.2446 0.0942 0.0602 0.27 5 21 0.2857 0.1809 0.1232 0.1048 0.50 6 63 0.2286 0.3719 0.0831 -0.1433 -0.62 7 67 0.2190 0.3901 0.0882 -0.1710 -0.75 8 41 0.2095 0.2718 0.0852 -0.0623 -0.27 9 38 0.2095 0.2582 0.0894 -0.0487 -0.21 10 20 0.1238 0.1764 0.1256 -0.0526 -0.25
R denotes an observation with a large standardized residual
Figure 11.18 Warning Output from Minitab
52
Doing Statistics for Business
Simple Linear Regression Model in Excel
1. From the list of data analysis tools, select Regression.
2. Position the cursor in the textbox labeled Input Y Range: and highlight the data range for the Y variable, in this case, Revenues.
3. Move the cursor in the textbox for Input X Range: and highlight the data range of the X variable, in this case, Members.
53
Doing Statistics for Business
Simple Linear Regression Model in Excel 4. If the data ranges contain labels, click on the Labels checkbox.
If you want confidence intervals for the regression estimates, click the checkbox for Confidence Level.
5. Specify the location where you want the output to appear, either in the current sheet, in a new worksheet, or in a new workbook.
6. Click the checkbox for Residuals. Do not check the Standardized Residuals checkbox. Excel does not calculate these values correctly.
54
Doing Statistics for Business
Simple Linear Regression Model in Excel
7. Click the checkboxes for Residual Plots and Line Fit Plots. Do not click the checkbox for Normal Probability Plot. The plot is not created correctly.
8. Click on OK. The output will appear in the location you specified.
55
Doing Statistics for Business
Figure 11.21 Completed Regression Dialog Box
56
Doing Statistics for Business
Figure 11.22 Summary Section of Regression Output
57
Doing Statistics for Business
Figure 11.23 Residual Output
58
Doing Statistics for Business
Figure 11.24 Plots from Regression Analysis
59
Doing Statistics for Business
Although Excel does perform linear regression, KaddStat can also be used for the analysis. The basic input is the same, although KaddStat has slightly different output.
From the Kadd menu select Regression and correlation > Single/Multiple. The dialog box shown in Figure 11.25 opens.
60
Doing Statistics for Business
Figure 11.25 Regression Dialog Box in KaddStat
61
Doing Statistics for Business
1. Position the cursor in the box labeled Input Range and highlight your data in the Excel worksheet.
Although nothing changes immediately, if you click on the drop down arrow in the box labeled Dependent Variable all of the variable names in the Input Range appear in the boxes for Dependent and Independent Variables as shown in Figure 11.26.
62
Doing Statistics for Business
Figure 11.26 Variable lists for regression analysis
63
Doing Statistics for Business
1. From the drop down list, select Rev $bn for the Dependent Variable.
2. Move the cursor over to the box labeled Independent Variable and from the list, click on the variable that you want to use for the independent variable, in this case, Members (m).
3. In the bottom part of the dialog box indicate which plots you want included in the output.
4. Indicate where you want the output to appear and click OK.
64
Doing Statistics for Business
The main portion of the output is shown in Figure 11.27
65
Doing Statistics for Business
The remainder of the output consists of the graphs requested and the residuals and standardized residuals shown in Figure 11.28
66
Doing Statistics for Business
67
Doing Statistics for Business
Kadd will calculate the predicted values for the data points, or for any other x values.
Click on the box labeled Forecast and the dialog box will open.
68
Doing Statistics for Business
69
Doing Statistics for Business
Place the cursor in the Forecast Data Range box and highlight the location of the values of the independent variable for which you want predictions.
Indicate where you want the output located Click OK
70
Doing Statistics for Business
71
Doing Statistics for Business
Chapter 11 Summary
In this chapter you have learned: Linear regression analysis is a powerful tool for
determining how two variables are related. The regression equation can be used for:
Description - used when you are simply
trying to understand the way that two
variables are related.
72
Doing Statistics for Business
Chapter 11 Summary (con’t)
Control - describes when the model is used
to set standards or reduce variability.
Predictability - describes when the model is
used to determine what the resulting Y value
should be when X takes on certain values. Although the simple linear model may be
significant, it might not be correct.
73
Doing Statistics for Business
Chapter 11 Summary (con’t)
It is necessary to test the Assumptions of the linear model to see whether the model you obtain is appropriate.