Chapter 26 Multiple Regression, Logistic Regression, and ...
Transcript of Chapter 26 Multiple Regression, Logistic Regression, and ...
6/11/2013
1
Chapter 26
Multiple Regression,
Logistic Regression, and
Indicator Variables
26.1 S4/IEE Application Examples:
Multiple Regression
• An S4/IEE project was created to improve the 30,000-foot-
level metric DSO. Two inputs that surfaced from a cause-
and-effect diagram were the size of the invoice and the
number of line items included within the invoice. A multiple
regression analysis was conducted for DSO versus size of
invoice and number of line items included in invoice.
• An S4/IEE project was created to improve the 30,000-foot-
level metric, the diameter of a manufactured part. Inputs
that surfaced from a cause-and-effect diagram were the
temperature, pressure, and speed of the manufacturing
process. A multiple regression analysis of diameter versus
temperature, pressure, and speed was conducted.
6/11/2013
2
26.2 Description
• A general model includes polynomial terms in one or more
variables such as
𝑌 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + 𝛽3𝑥12 + 𝛽4𝑥2
2 + 𝛽5𝑥1𝑥2 + ε
• Where 𝛽’s are unknown parameters and 𝜀 is random error.
This full quadratic model of 𝑌 on 𝑥1 and 𝑥2 is of great use in
DOE.
• For the situation without polynomial terms where there are
𝑘 predictor variables, the general model reduces to the form
𝑌 = 𝛽0 + 𝛽1𝑥1 + ⋯+ 𝛽𝑘𝑥𝑘 + ε
26.2 Description
• The object is to determine from data the least squares
estimates (𝑏0, 𝑏1, … , 𝑏𝑘) of the unknown parameters
(𝛽0, 𝛽1, … , 𝛽𝑘) for the prediction equation
𝑌 = 𝑏0 + 𝑏1𝑥1 + ⋯+ 𝑏𝑘𝑥𝑘
• Where 𝑌 is the predicted value of 𝑌 for given values of
(𝑥1, … , 𝑥𝑘). Many statistical software packages can perform
these calculations.
6/11/2013
3
26.3 Example 26.1:
Multiple Regression
• An investigator wants to
determine the relationship
of a key process output
variable, product strength,
to two key process input
variables, hydraulic
pressure during a forming
process and acid
concentration. The data
are given as follows,
Strength Pressure Concentration
665 110 116
618 119 104
620 138 94
578 130 86
682 143 110
594 133 87
722 147 114
700 142 106
681 125 107
695 135 106
664 152 98
548 118 86
620 155 87
595 128 96
740 146 120
670 132 108
640 130 104
590 112 91
570 113 92
640 120 100
26.3 Example 26.1:
Multiple Regression
Regression Analysis: Strength versus Pressure, Concentration
The regression equation is
Strength = 16.3 + 1.57 Pressure + 4.16 Concentration
Predictor Coef SE Coef T P
Constant 16.28 44.30 0.37 0.718
Pressure 1.5718 0.2606 6.03 0.000
Concentration 4.1629 0.3340 12.47 0.000
S = 15.0996 R-Sq = 92.8% R-Sq(adj) = 92.0%
Analysis of Variance
Source DF SS MS F P
Regression 2 50101 25050 109.87 0.000
Residual Error 17 3876 228
Total 19 53977
Source DF Seq SS
Pressure 1 14673
Concentration 1 35428
6/11/2013
4
26.3 Example 26.1:
Multiple Regression
• The P columns give the significance level for each model term.
Typically, if a P value is less than or equal to 0.05, the variable
is considered statistically significant (i.e., null hypothesis is
rejected). If a P value is greater than 0.10, the term is
removed from the model. A practitioner might leave the term if
the P value is the gray region between these two probability
levels.
• The coefficient of determination (𝑅2) is presented as R-Sq and
R-Sq (adj) in the output. When a variable is added to an
equation the coefficient of determination will get larger, even if
the added variable has no real value. 𝑅2(adj) is an
approximate unbiased estimate that compensates for this.
26.3 Example 26.1:
Multiple Regression
• In the analysis of variance portion of this output the F value is
used to determine an overall P value for the model fit. In this
case the resulting P value of 0.000 indicates a very high level
of significance. The regression and residual sum of squares
(SS) and mean square (MS) values are interim steps toward
determining the F value. Standard error is the square root of
the mean square.
• No unusual patterns were apparent in the residual analysis
plots. Also, no correlation was shown between hydraulic
pressure and acid concentration.
6/11/2013
5
26.4 Other Consideration
• Regressor variables should be independent within a model
(i.e., completely uncorrelated).
• Multicollinearity occurs when variables are dependent.
• A measure of the magnitude of multicollinearity that is
often available in statistical software is the variance
inflation factor (VIF).
• VIF quantifies how much the variance of an estimated
regression coefficient increases if the predictors are
correlated.
• Regression coefficients can be considered poorly
estimated when VIF exceeds 5 or 10.
• Strategies for breaking up multicollinearity include
collecting additional data or using different predictors.
26.4 Other Consideration
• Another approach to data analysis is the use of stepwise
regression (Draper and Smith 1966) or of all possible
regressions of the data when selecting the number of terms
to include in a model.
• This approach can be most useful when data derives
from an experiment that does not have experiment
structure.
• However, experimenters should be aware of the
potential pitfalls resulting from happenstance data (Box
et al. 1978).
• A multiple regression best subset analysis is another
analysis alternative.
6/11/2013
6
26.4 Other Consideration
Best Subsets Regression: Output timing versus mot_temp, algor, …
Response is Output timing
m s
o m e u
t o x p
_ a t t _
t l _ _ v
e g a a o
Mallows m o d d l
Vars R-Sq R-Sq(adj) Cp S p r j j t
1 57.7 54.7 43.3 1.2862 X
1 33.4 28.6 75.2 1.6152 X
2 91.1 89.7 1.7 0.61288 X X
2 58.3 51.9 44.5 1.3251 X X
3 91.7 89.6 2.9 0.61593 X X X
3 91.5 89.4 3.1 0.62300 X X X
4 92.1 89.2 4.3 0.62718 X X X X
4 91.9 89.0 4.5 0.63331 X X X X
5 92.4 88.5 6.0 0.64701 X X X X X
Minitab:
Stat
Regression
Best Subset
26.4 Other Consideration
• A multiple regression best subset analysis first considers
only one factor in a model, then two, and so forth.
• The 𝑅2 value is then considered for each of the models;
only factor combinations containing the highest two
𝑅2 values are shown.
• The Mallows 𝐶𝑝 statistic is useful to determining the
minimum number of parameters that best fits the model.
Technically this statistic measures the sum of the squared
biases plus the squared random errors in 𝑌 at all n data
points (Daniel and Wood 1980).
6/11/2013
7
26.4 Other Consideration
• The minimum number of factors needed in the model occurs when the Mallows 𝐶𝑝 statistic is a minimum. From this output
the pertinent Mallows 𝐶𝑝 statistic values under consideration
as a function of a number of factors in this model are
• From this summary it is noted that the Mallows 𝐶𝑝 statistic is
minimized whenever there are two parameters in the model.
The corresponding factors are algor and mot adj.
Number in Model Mallows 𝑪𝒑
1 43.3
2 1.7**
3 2.9
4 4.3
5 6.0
26.5 Example 26.2: Multiple
Regression Best Subset Analysis
• The results from a cause-and-effect matrix lead to a passive
analysis of factors A, B, C, and D on Throughput. In a plastic
molding process, for example, the throughput response might be
shrinkage as a function of the input factors. A best subsets
computer regression analysis of the collected data yielded:
Best Subsets Regression: Thruput versus A, B, C, D
Response is Thruput
Vars R-Sq R-Sq(adj) Mallows Cp S A B C D
1 92.1 91.4 38.3 0.25631 X
1 49.2 44.6 294.2 0.64905 X
2 96.3 95.6 14.9 0.18282 X X
2 95.2 94.3 21.5 0.20867 X X
3 98.5 98.0 4.1 0.12454 X X X
3 97.9 97.1 7.8 0.14723 X X X
4 98.7 98.0 5.0 0.12363 X X X X
6/11/2013
8
26.5 Example 26.2: Multiple
Regression Best Subset Analysis
• From this output we note: • R-Sq: Look for the highest value when comparing models
with the same number of predictors (vars).
• Adj.R-Sq: look for the highest value when comparing models
with different numbers of predictors.
• 𝐶𝑝: Look for models where 𝐶𝑝 is small and close to the
number of parameters in the model, e.g., look for a model
with 𝐶𝑝 close to four for a three-predictor model that has an
intercept constant (often we just look for the lowest 𝐶𝑝 value).
• 𝑠: We want 𝑠, the estimate of the standard deviation about
the regression, to be as small as possible.
26.5 Example 26.2: Multiple
Regression Best Subset Analysis
• The regression equation for a 3-parameter model from a computer
program is:
• The magnitude of the VIFs is satisfactory, i.e., not larger than 5-10. In
addition, there were no observed problems with the residual analysis.
Regression Analysis: Thruput versus A, C, D
The regression equation is
Thruput = 3.87 + 0.393 A + 3.19 C + 0.0162 D
Predictor Coef SE Coef T P VIF
Constant 3.8702 0.7127 5.43 0.000
A 0.39333 0.07734 5.09 0.001 1.368
C 3.1935 0.2523 12.66 0.000 1.929
D 0.016189 0.004570 3.54 0.006 1.541
S = 0.124543 R-Sq = 98.5% R-Sq(adj) = 98.0%
6/11/2013
9
26.6 Indicator Variables (Dummy
Variables) to Analyze Categorical Data
• Categorical data such as location, operator, and color can
also be modeled using simple and multiple linear regression.
• It is not generally correct to use numerical code when
analyzing this type of data within regression, since the fitted
values within the model will be dependent upon the
assignment of the numerical values.
• The correct approach is through the use of indicator
variables or dummy variables, which indicate whether a
factor should or should not be included in the model.
26.6 Indicator Variables (Dummy
Variables) to Analyze Categorical Data
• If we are given information about two variables, we can
calculate the third. Hence, only two variables are needed
for a model that has three variables, where it does not
matter which variable is left out of the model. After indicator
or dummy variables are created, indicator variables are
analyzed using regression to create a cell means model.
• If the intercept is left out of the regression equation, a no
intercept cell means model is created. For the case where
there are three indicator variables, a no intercept model
would then have three terms where the coefficients are the
cell means.
6/11/2013
10
26.7 Example 26.3: Indicator
Variables
• Revenue for Arizona, Florida, and Texas is shown in Table 26.3 (
Bower 2001). This table also contains indicator variables that
were created to represent these states.
Regression Analysis: Revenue versus AZ, FL, TX
* TX is highly correlated with other X variables
* TX has been removed from the equation.
The regression equation is
Revenue = 48.7 - 23.8 AZ - 16.0 FL
Predictor Coef SE Coef T P
Constant 48.7329 0.4537 107.41 0.000
AZ -23.8190 0.6416 -37.12 0.000
FL -15.9927 0.6416 -24.93 0.000
S = 3.20811 R-Sq = 90.7% R-Sq(adj) = 90.6%
26.7 Example 26.3: Indicator
Variables
• Calculations for various revenues would be:
Texas Revenue = 48.7 - 24.1(0) - 16.0(0) = 48.7
Arizona Revenue = 48.7 - 24.1(1) - 16.0(0) = 24.6
Florida Revenue = 48.7 - 24.1(0) - 16.0(1) = 32.7
• A no intercept cell means model from a computer analysis would
be
Regression Analysis: Revenue versus AZ, FL, TX
The regression equation is
Revenue = 24.9 AZ + 32.7 FL + 48.7 TX
Predictor Coef SE Coef T P
Noconstant
AZ 24.9139 0.4537 54.91 0.000
FL 32.7402 0.4537 72.16 0.000
TX 48.7329 0.4537 107.41 0.000
S = 3.20811
6/11/2013
11
26.8 Example 26.4: Indicator
Variables with Covariate
• Consider the following data set, which has created indicator
variables and a covariate. This covariate might be a continuous
variable such as process temperature or dollar amount for an
invoice.
Response Factor1 Factor2 A B High Covariate
1 A 1 1 0 1 11
3 A 0 1 0 -1 7
2 A 1 1 0 1 5
2 A 0 1 0 -1 6
4 B 1 0 1 1 6
6 B 0 0 1 -1 3
3 B 1 0 1 1 14
5 B 0 0 1 -1 20
8 C 1 -1 -1 1 2
9 C 0 -1 -1 -1 17
7 C 1 -1 -1 1 19
10 C 0 -1 -1 -1 14
26.8 Example 26.4: Indicator
Variables with Covariate Regression Analysis: Response versus Factor2, A, B, High, Covariate
* High is highly correlated with other X variables
* High has been removed from the equation.
The regression equation is
Response = 6.50 - 1.77 Factor2 - 3.18 A - 0.475 B - 0.0598
Covariate
Predictor Coef SE Coef T P
Constant 6.5010 0.4140 15.70 0.000
Factor2 -1.7663 0.3391 -5.21 0.001
A -3.1844 0.2550 -12.49 0.000
B -0.4751 0.2374 -2.00 0.086
Covariate -0.05979 0.03039 -1.97 0.090
S = 0.580794 R-Sq = 97.6% R-Sq(adj) = 96.2%
6/11/2013
12
26.10 Example 26.5: Binary
Logistic Regression
• Ingots prepared
with different
heating and
soaking times
are tested for
readiness to be
rolled:
Sample Heat Soak Ready Not
Ready
1 7 1.0 10 0
2 7 1.7 17 0
3 7 2.2 7 0
4 7 2.8 12 0
5 7 4.0 9 0
6 14 1.0 31 0
7 14 1.7 43 0
8 14 2.2 31 2
9 14 2.8 31 0
10 14 4.0 19 0
11 27 1.0 55 1
12 27 1.7 40 4
13 27 2.2 21 0
14 27 2.8 21 1
15 27 4.0 15 1
16 51 1.0 10 3
17 51 1.7 1 0
18 51 2.2 1 0
19 51 4.0 1 0
26.10 Example 26.5: Binary
Logistic Regression
Binary Logistic Regression: Ready, Trials versus Heat, Soak
Link Function: Normit
Response Information
Variable Value Count
Ready Event 375
Non-event 12
Trials Total 387
Logistic Regression Table
Predictor Coef SE Coef Z P
Constant 2.89342 0.500601 5.78 0.000
Heat -0.0399555 0.0118466 -3.37 0.001
Soak -0.0362537 0.146743 -0.25 0.805
Log-Likelihood = -47.480
Test that all slopes are zero: G = 12.029, DF = 2, P-Value =
0.002
6/11/2013
13
26.10 Example 26.5: Binary
Logistic Regression
• Heat would be considered statistically significant. Let’s now
address the question of which levels are important.
Rearranging the data by heat only, we get the p chart of this
data it appears that heat at the 51 level causes a larger
portion of not readys.
Heat Not
Ready Sample
Size
7 0 55
14 2 157
27 7 159
51 3 16