Chapter 26 Multiple Regression, Logistic Regression, and ...

13
6/11/2013 1 Chapter 26 Multiple Regression, Logistic Regression, and Indicator Variables 26.1 S 4 /IEE Application Examples: Multiple Regression An S 4 /IEE project was created to improve the 30,000-foot- level metric DSO. Two inputs that surfaced from a cause- and-effect diagram were the size of the invoice and the number of line items included within the invoice. A multiple regression analysis was conducted for DSO versus size of invoice and number of line items included in invoice. An S 4 /IEE project was created to improve the 30,000-foot- level metric, the diameter of a manufactured part. Inputs that surfaced from a cause-and-effect diagram were the temperature, pressure, and speed of the manufacturing process. A multiple regression analysis of diameter versus temperature, pressure, and speed was conducted.

Transcript of Chapter 26 Multiple Regression, Logistic Regression, and ...

Page 1: Chapter 26 Multiple Regression, Logistic Regression, and ...

6/11/2013

1

Chapter 26

Multiple Regression,

Logistic Regression, and

Indicator Variables

26.1 S4/IEE Application Examples:

Multiple Regression

• An S4/IEE project was created to improve the 30,000-foot-

level metric DSO. Two inputs that surfaced from a cause-

and-effect diagram were the size of the invoice and the

number of line items included within the invoice. A multiple

regression analysis was conducted for DSO versus size of

invoice and number of line items included in invoice.

• An S4/IEE project was created to improve the 30,000-foot-

level metric, the diameter of a manufactured part. Inputs

that surfaced from a cause-and-effect diagram were the

temperature, pressure, and speed of the manufacturing

process. A multiple regression analysis of diameter versus

temperature, pressure, and speed was conducted.

Page 2: Chapter 26 Multiple Regression, Logistic Regression, and ...

6/11/2013

2

26.2 Description

• A general model includes polynomial terms in one or more

variables such as

𝑌 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + 𝛽3𝑥12 + 𝛽4𝑥2

2 + 𝛽5𝑥1𝑥2 + ε

• Where 𝛽’s are unknown parameters and 𝜀 is random error.

This full quadratic model of 𝑌 on 𝑥1 and 𝑥2 is of great use in

DOE.

• For the situation without polynomial terms where there are

𝑘 predictor variables, the general model reduces to the form

𝑌 = 𝛽0 + 𝛽1𝑥1 + ⋯+ 𝛽𝑘𝑥𝑘 + ε

26.2 Description

• The object is to determine from data the least squares

estimates (𝑏0, 𝑏1, … , 𝑏𝑘) of the unknown parameters

(𝛽0, 𝛽1, … , 𝛽𝑘) for the prediction equation

𝑌 = 𝑏0 + 𝑏1𝑥1 + ⋯+ 𝑏𝑘𝑥𝑘

• Where 𝑌 is the predicted value of 𝑌 for given values of

(𝑥1, … , 𝑥𝑘). Many statistical software packages can perform

these calculations.

Page 3: Chapter 26 Multiple Regression, Logistic Regression, and ...

6/11/2013

3

26.3 Example 26.1:

Multiple Regression

• An investigator wants to

determine the relationship

of a key process output

variable, product strength,

to two key process input

variables, hydraulic

pressure during a forming

process and acid

concentration. The data

are given as follows,

Strength Pressure Concentration

665 110 116

618 119 104

620 138 94

578 130 86

682 143 110

594 133 87

722 147 114

700 142 106

681 125 107

695 135 106

664 152 98

548 118 86

620 155 87

595 128 96

740 146 120

670 132 108

640 130 104

590 112 91

570 113 92

640 120 100

26.3 Example 26.1:

Multiple Regression

Regression Analysis: Strength versus Pressure, Concentration

The regression equation is

Strength = 16.3 + 1.57 Pressure + 4.16 Concentration

Predictor Coef SE Coef T P

Constant 16.28 44.30 0.37 0.718

Pressure 1.5718 0.2606 6.03 0.000

Concentration 4.1629 0.3340 12.47 0.000

S = 15.0996 R-Sq = 92.8% R-Sq(adj) = 92.0%

Analysis of Variance

Source DF SS MS F P

Regression 2 50101 25050 109.87 0.000

Residual Error 17 3876 228

Total 19 53977

Source DF Seq SS

Pressure 1 14673

Concentration 1 35428

Page 4: Chapter 26 Multiple Regression, Logistic Regression, and ...

6/11/2013

4

26.3 Example 26.1:

Multiple Regression

• The P columns give the significance level for each model term.

Typically, if a P value is less than or equal to 0.05, the variable

is considered statistically significant (i.e., null hypothesis is

rejected). If a P value is greater than 0.10, the term is

removed from the model. A practitioner might leave the term if

the P value is the gray region between these two probability

levels.

• The coefficient of determination (𝑅2) is presented as R-Sq and

R-Sq (adj) in the output. When a variable is added to an

equation the coefficient of determination will get larger, even if

the added variable has no real value. 𝑅2(adj) is an

approximate unbiased estimate that compensates for this.

26.3 Example 26.1:

Multiple Regression

• In the analysis of variance portion of this output the F value is

used to determine an overall P value for the model fit. In this

case the resulting P value of 0.000 indicates a very high level

of significance. The regression and residual sum of squares

(SS) and mean square (MS) values are interim steps toward

determining the F value. Standard error is the square root of

the mean square.

• No unusual patterns were apparent in the residual analysis

plots. Also, no correlation was shown between hydraulic

pressure and acid concentration.

Page 5: Chapter 26 Multiple Regression, Logistic Regression, and ...

6/11/2013

5

26.4 Other Consideration

• Regressor variables should be independent within a model

(i.e., completely uncorrelated).

• Multicollinearity occurs when variables are dependent.

• A measure of the magnitude of multicollinearity that is

often available in statistical software is the variance

inflation factor (VIF).

• VIF quantifies how much the variance of an estimated

regression coefficient increases if the predictors are

correlated.

• Regression coefficients can be considered poorly

estimated when VIF exceeds 5 or 10.

• Strategies for breaking up multicollinearity include

collecting additional data or using different predictors.

26.4 Other Consideration

• Another approach to data analysis is the use of stepwise

regression (Draper and Smith 1966) or of all possible

regressions of the data when selecting the number of terms

to include in a model.

• This approach can be most useful when data derives

from an experiment that does not have experiment

structure.

• However, experimenters should be aware of the

potential pitfalls resulting from happenstance data (Box

et al. 1978).

• A multiple regression best subset analysis is another

analysis alternative.

Page 6: Chapter 26 Multiple Regression, Logistic Regression, and ...

6/11/2013

6

26.4 Other Consideration

Best Subsets Regression: Output timing versus mot_temp, algor, …

Response is Output timing

m s

o m e u

t o x p

_ a t t _

t l _ _ v

e g a a o

Mallows m o d d l

Vars R-Sq R-Sq(adj) Cp S p r j j t

1 57.7 54.7 43.3 1.2862 X

1 33.4 28.6 75.2 1.6152 X

2 91.1 89.7 1.7 0.61288 X X

2 58.3 51.9 44.5 1.3251 X X

3 91.7 89.6 2.9 0.61593 X X X

3 91.5 89.4 3.1 0.62300 X X X

4 92.1 89.2 4.3 0.62718 X X X X

4 91.9 89.0 4.5 0.63331 X X X X

5 92.4 88.5 6.0 0.64701 X X X X X

Minitab:

Stat

Regression

Best Subset

26.4 Other Consideration

• A multiple regression best subset analysis first considers

only one factor in a model, then two, and so forth.

• The 𝑅2 value is then considered for each of the models;

only factor combinations containing the highest two

𝑅2 values are shown.

• The Mallows 𝐶𝑝 statistic is useful to determining the

minimum number of parameters that best fits the model.

Technically this statistic measures the sum of the squared

biases plus the squared random errors in 𝑌 at all n data

points (Daniel and Wood 1980).

Page 7: Chapter 26 Multiple Regression, Logistic Regression, and ...

6/11/2013

7

26.4 Other Consideration

• The minimum number of factors needed in the model occurs when the Mallows 𝐶𝑝 statistic is a minimum. From this output

the pertinent Mallows 𝐶𝑝 statistic values under consideration

as a function of a number of factors in this model are

• From this summary it is noted that the Mallows 𝐶𝑝 statistic is

minimized whenever there are two parameters in the model.

The corresponding factors are algor and mot adj.

Number in Model Mallows 𝑪𝒑

1 43.3

2 1.7**

3 2.9

4 4.3

5 6.0

26.5 Example 26.2: Multiple

Regression Best Subset Analysis

• The results from a cause-and-effect matrix lead to a passive

analysis of factors A, B, C, and D on Throughput. In a plastic

molding process, for example, the throughput response might be

shrinkage as a function of the input factors. A best subsets

computer regression analysis of the collected data yielded:

Best Subsets Regression: Thruput versus A, B, C, D

Response is Thruput

Vars R-Sq R-Sq(adj) Mallows Cp S A B C D

1 92.1 91.4 38.3 0.25631 X

1 49.2 44.6 294.2 0.64905 X

2 96.3 95.6 14.9 0.18282 X X

2 95.2 94.3 21.5 0.20867 X X

3 98.5 98.0 4.1 0.12454 X X X

3 97.9 97.1 7.8 0.14723 X X X

4 98.7 98.0 5.0 0.12363 X X X X

Page 8: Chapter 26 Multiple Regression, Logistic Regression, and ...

6/11/2013

8

26.5 Example 26.2: Multiple

Regression Best Subset Analysis

• From this output we note: • R-Sq: Look for the highest value when comparing models

with the same number of predictors (vars).

• Adj.R-Sq: look for the highest value when comparing models

with different numbers of predictors.

• 𝐶𝑝: Look for models where 𝐶𝑝 is small and close to the

number of parameters in the model, e.g., look for a model

with 𝐶𝑝 close to four for a three-predictor model that has an

intercept constant (often we just look for the lowest 𝐶𝑝 value).

• 𝑠: We want 𝑠, the estimate of the standard deviation about

the regression, to be as small as possible.

26.5 Example 26.2: Multiple

Regression Best Subset Analysis

• The regression equation for a 3-parameter model from a computer

program is:

• The magnitude of the VIFs is satisfactory, i.e., not larger than 5-10. In

addition, there were no observed problems with the residual analysis.

Regression Analysis: Thruput versus A, C, D

The regression equation is

Thruput = 3.87 + 0.393 A + 3.19 C + 0.0162 D

Predictor Coef SE Coef T P VIF

Constant 3.8702 0.7127 5.43 0.000

A 0.39333 0.07734 5.09 0.001 1.368

C 3.1935 0.2523 12.66 0.000 1.929

D 0.016189 0.004570 3.54 0.006 1.541

S = 0.124543 R-Sq = 98.5% R-Sq(adj) = 98.0%

Page 9: Chapter 26 Multiple Regression, Logistic Regression, and ...

6/11/2013

9

26.6 Indicator Variables (Dummy

Variables) to Analyze Categorical Data

• Categorical data such as location, operator, and color can

also be modeled using simple and multiple linear regression.

• It is not generally correct to use numerical code when

analyzing this type of data within regression, since the fitted

values within the model will be dependent upon the

assignment of the numerical values.

• The correct approach is through the use of indicator

variables or dummy variables, which indicate whether a

factor should or should not be included in the model.

26.6 Indicator Variables (Dummy

Variables) to Analyze Categorical Data

• If we are given information about two variables, we can

calculate the third. Hence, only two variables are needed

for a model that has three variables, where it does not

matter which variable is left out of the model. After indicator

or dummy variables are created, indicator variables are

analyzed using regression to create a cell means model.

• If the intercept is left out of the regression equation, a no

intercept cell means model is created. For the case where

there are three indicator variables, a no intercept model

would then have three terms where the coefficients are the

cell means.

Page 10: Chapter 26 Multiple Regression, Logistic Regression, and ...

6/11/2013

10

26.7 Example 26.3: Indicator

Variables

• Revenue for Arizona, Florida, and Texas is shown in Table 26.3 (

Bower 2001). This table also contains indicator variables that

were created to represent these states.

Regression Analysis: Revenue versus AZ, FL, TX

* TX is highly correlated with other X variables

* TX has been removed from the equation.

The regression equation is

Revenue = 48.7 - 23.8 AZ - 16.0 FL

Predictor Coef SE Coef T P

Constant 48.7329 0.4537 107.41 0.000

AZ -23.8190 0.6416 -37.12 0.000

FL -15.9927 0.6416 -24.93 0.000

S = 3.20811 R-Sq = 90.7% R-Sq(adj) = 90.6%

26.7 Example 26.3: Indicator

Variables

• Calculations for various revenues would be:

Texas Revenue = 48.7 - 24.1(0) - 16.0(0) = 48.7

Arizona Revenue = 48.7 - 24.1(1) - 16.0(0) = 24.6

Florida Revenue = 48.7 - 24.1(0) - 16.0(1) = 32.7

• A no intercept cell means model from a computer analysis would

be

Regression Analysis: Revenue versus AZ, FL, TX

The regression equation is

Revenue = 24.9 AZ + 32.7 FL + 48.7 TX

Predictor Coef SE Coef T P

Noconstant

AZ 24.9139 0.4537 54.91 0.000

FL 32.7402 0.4537 72.16 0.000

TX 48.7329 0.4537 107.41 0.000

S = 3.20811

Page 11: Chapter 26 Multiple Regression, Logistic Regression, and ...

6/11/2013

11

26.8 Example 26.4: Indicator

Variables with Covariate

• Consider the following data set, which has created indicator

variables and a covariate. This covariate might be a continuous

variable such as process temperature or dollar amount for an

invoice.

Response Factor1 Factor2 A B High Covariate

1 A 1 1 0 1 11

3 A 0 1 0 -1 7

2 A 1 1 0 1 5

2 A 0 1 0 -1 6

4 B 1 0 1 1 6

6 B 0 0 1 -1 3

3 B 1 0 1 1 14

5 B 0 0 1 -1 20

8 C 1 -1 -1 1 2

9 C 0 -1 -1 -1 17

7 C 1 -1 -1 1 19

10 C 0 -1 -1 -1 14

26.8 Example 26.4: Indicator

Variables with Covariate Regression Analysis: Response versus Factor2, A, B, High, Covariate

* High is highly correlated with other X variables

* High has been removed from the equation.

The regression equation is

Response = 6.50 - 1.77 Factor2 - 3.18 A - 0.475 B - 0.0598

Covariate

Predictor Coef SE Coef T P

Constant 6.5010 0.4140 15.70 0.000

Factor2 -1.7663 0.3391 -5.21 0.001

A -3.1844 0.2550 -12.49 0.000

B -0.4751 0.2374 -2.00 0.086

Covariate -0.05979 0.03039 -1.97 0.090

S = 0.580794 R-Sq = 97.6% R-Sq(adj) = 96.2%

Page 12: Chapter 26 Multiple Regression, Logistic Regression, and ...

6/11/2013

12

26.10 Example 26.5: Binary

Logistic Regression

• Ingots prepared

with different

heating and

soaking times

are tested for

readiness to be

rolled:

Sample Heat Soak Ready Not

Ready

1 7 1.0 10 0

2 7 1.7 17 0

3 7 2.2 7 0

4 7 2.8 12 0

5 7 4.0 9 0

6 14 1.0 31 0

7 14 1.7 43 0

8 14 2.2 31 2

9 14 2.8 31 0

10 14 4.0 19 0

11 27 1.0 55 1

12 27 1.7 40 4

13 27 2.2 21 0

14 27 2.8 21 1

15 27 4.0 15 1

16 51 1.0 10 3

17 51 1.7 1 0

18 51 2.2 1 0

19 51 4.0 1 0

26.10 Example 26.5: Binary

Logistic Regression

Binary Logistic Regression: Ready, Trials versus Heat, Soak

Link Function: Normit

Response Information

Variable Value Count

Ready Event 375

Non-event 12

Trials Total 387

Logistic Regression Table

Predictor Coef SE Coef Z P

Constant 2.89342 0.500601 5.78 0.000

Heat -0.0399555 0.0118466 -3.37 0.001

Soak -0.0362537 0.146743 -0.25 0.805

Log-Likelihood = -47.480

Test that all slopes are zero: G = 12.029, DF = 2, P-Value =

0.002

Page 13: Chapter 26 Multiple Regression, Logistic Regression, and ...

6/11/2013

13

26.10 Example 26.5: Binary

Logistic Regression

• Heat would be considered statistically significant. Let’s now

address the question of which levels are important.

Rearranging the data by heat only, we get the p chart of this

data it appears that heat at the 51 level causes a larger

portion of not readys.

Heat Not

Ready Sample

Size

7 0 55

14 2 157

27 7 159

51 3 16