Decision 411: Class 8 - Duke University

1

Decision 411: Class 8Decision 411: Class 8

One more way to model seasonalityOne more way to model seasonalityAdvanced regression (power tools):Advanced regression (power tools):

Stepwise and all possible regressionsStepwise and all possible regressions11--way ANOVAway ANOVAMultifactor ANOVAMultifactor ANOVAGeneral Linear Models (GLM)General Linear Models (GLM)•• OutOut--ofof--sample validation of regression modelssample validation of regression models

Logistic regressionLogistic regression

One more way to model One more way to model seasonality with regressionseasonality with regression

Suppose a time series has an underlying stable trend Suppose a time series has an underlying stable trend and stable seasonal pattern (either additive or and stable seasonal pattern (either additive or multiplicative), with effects of other independent multiplicative), with effects of other independent variables variables added onadded on, so effects of the independent , so effects of the independent variables are variables are not not seasonal.seasonal.

Suppose that you also have a externally supplied Suppose that you also have a externally supplied seasonal index.seasonal index.

Then it may be appropriate to use the seasonal index Then it may be appropriate to use the seasonal index and/or the seasonal index multiplied by the time index and/or the seasonal index multiplied by the time index as separate as separate regressorsregressors to capture the seasonal part of to capture the seasonal part of the overall pattern.the overall pattern.

2

DetailsDetailsLet SINDEX denote a seasonal index variable and let Let SINDEX denote a seasonal index variable and let TIME denote a time index variable.TIME denote a time index variable.

Then by including TIME, SINDEX, and SINDEX*TIME Then by including TIME, SINDEX, and SINDEX*TIME as potential as potential regressorsregressors, you can model a range of , you can model a range of patterns with stable trend and seasonality.patterns with stable trend and seasonality.

Depending on the amount of trend and the degree to Depending on the amount of trend and the degree to which seasonal swings get larger as the level of the which seasonal swings get larger as the level of the series rises, perhaps not all of these terms would be series rises, perhaps not all of these terms would be significant.significant.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 2 4 6 8 10 12 14 16 18 20

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 2 4 6 8 10 12 14 16 18 20

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 2 4 6 8 10 12 14 16 18 20

Only SINDEX is significant Only SINDEX is significant ⇒⇒ Seasonal pattern with no Seasonal pattern with no trend (no real difference trend (no real difference between additive and between additive and multiplicative) multiplicative)

Depending on the estimated coefficients, you could fit any of thDepending on the estimated coefficients, you could fit any of these patterns:ese patterns:

Only TIME and SINDEX are Only TIME and SINDEX are significant significant ⇒⇒ additive additive seasonal pattern with trendseasonal pattern with trend

SINDEX*TIME is significant SINDEX*TIME is significant ⇒⇒ multiplicative seasonal multiplicative seasonal pattern with trendpattern with trend

3

Stepwise regressionStepwise regressionAutomatic stepwise variable selection is a Automatic stepwise variable selection is a standard feature in multiple regression (a rightstandard feature in multiple regression (a right--mousemouse--button analysis option in button analysis option in StatgraphicsStatgraphics))

Backward stepwise regressionBackward stepwise regressionAutomates the common process of sequentially Automates the common process of sequentially removing the variable with the smallest tremoving the variable with the smallest t--stat, if stat, if that tthat t--stat is less than a specified threshold.stat is less than a specified threshold.

The The ““FF--toto--removeremove”” parameter is the parameter is the squaresquare of the of the minimum tminimum t--stat needed to remain in the model, stat needed to remain in the model, given the other variables still present.given the other variables still present.

Can be used to fineCan be used to fine--tune the selection of variables, tune the selection of variables, but should not be used to but should not be used to ““go fishinggo fishing”” for significant for significant variables in a large pool.variables in a large pool.

4

Forward stepwiseForward stepwiseAutomates the process of sequentially adding the Automates the process of sequentially adding the variable that variable that would have the highest twould have the highest t--stat stat if it if it were the next variable entered. were the next variable entered.

FF--toto--enter is the square of the minimum enter is the square of the minimum tt--stat needed to enter, given the other variables stat needed to enter, given the other variables already present.already present.

Can Can be used (with care!) to go fishing for be used (with care!) to go fishing for significant variables in a large pool.significant variables in a large pool.

ItIt’’s a potentially powerful data exploration tool, s a potentially powerful data exploration tool, because it does something that would be hard to because it does something that would be hard to do by hand.do by hand.

Example: enrollment revisitedExample: enrollment revisitedDependent variable: ROLLDependent variable: ROLL

Potential Potential regressorsregressors (8):(8):

lag(ROLL,1), lag(ROLL,2)lag(ROLL,1), lag(ROLL,2)

HSGRAD, lag(HSGRAD,1)HSGRAD, lag(HSGRAD,1)

UNEMP, lag (UNEMP,1)UNEMP, lag (UNEMP,1)

INCOME, lag(INCOME,1)INCOME, lag(INCOME,1)

Note that YOU are responsible for anticipating all Note that YOU are responsible for anticipating all transformations that may be useful.transformations that may be useful.

Previously we had Previously we had considered models considered models whose equations whose equations involved up to 2 lags of involved up to 2 lags of ROLL and up to 1 lag of ROLL and up to 1 lag of HSGRAD and UNEMP. HSGRAD and UNEMP. INCOME is an INCOME is an additional predictive additional predictive variable that was not variable that was not considered before.considered before.

5

Here is the Here is the ““all likely suspectsall likely suspects”” model: model: many variables are not significantmany variables are not significant

With FWith F--toto--enter and Fenter and F--toto--remove remove set at 4.0, both forward and set at 4.0, both forward and backward stepwise regression lead backward stepwise regression lead to a 2to a 2--variable model.variable model.

6

Details of the steps in the backward Details of the steps in the backward stepwise regression: you can see stepwise regression: you can see here by how much the MSE changes here by how much the MSE changes as variables are removed. as variables are removed.

The INCOME variables are removed The INCOME variables are removed first, followed by the HSGRAD first, followed by the HSGRAD variables, leaving UNEMP as the only variables, leaving UNEMP as the only exogeousexogeous variable after step 4. MSE variable after step 4. MSE goes up somewhat in steps 5 & 6, goes up somewhat in steps 5 & 6, although the variables removed at although the variables removed at those points are technically not those points are technically not significant at the 0.05 level. significant at the 0.05 level.

MSE actually MSE actually improves improves (i.e., gets (i.e., gets smaller) when some of the least smaller) when some of the least significant variables are removed. In significant variables are removed. In this case the smallest MSE was this case the smallest MSE was actually reached at step 2, although actually reached at step 2, although MSEMSE’’ss after steps 1after steps 1--22--33--4 are the 4 are the same for all practical purposes. same for all practical purposes.

When FWhen F--toto--enter and Fenter and F--toto--remove are remove are lowered to 3.0, which corresponds to a lowered to 3.0, which corresponds to a permissible tpermissible t--stat as low as 1.73 in stat as low as 1.73 in magnitude, forward stepwise still leads to the magnitude, forward stepwise still leads to the 22--variable model, while backward stepwise variable model, while backward stepwise leads to this 4leads to this 4--variable model in which one variable model in which one variable (lag(ROLL,2)) has a tvariable (lag(ROLL,2)) has a t--stat of stat of --1.861.86

7

CaveatsCaveatsStepwise regression (or any other automatic model Stepwise regression (or any other automatic model selection method) is not a substitute for logical selection method) is not a substitute for logical thinking and graphical data explorationthinking and graphical data explorationThere is a danger of There is a danger of overfittingoverfitting from fishing in too from fishing in too large a pool of potential variables and finding large a pool of potential variables and finding ““spuriousspurious”” regressorsregressors..Resist the urge to lower FResist the urge to lower F--toto--enter or Fenter or F--toto--remove remove below 3.0 to find more significant variables.below 3.0 to find more significant variables.Ideally you should hold out a significant sample of Ideally you should hold out a significant sample of data data while selecting variables, while selecting variables, for later outfor later out--ofof--sample sample validation of model.validation of model.Validation is not Validation is not ““honesthonest”” if you peeked at the holdif you peeked at the hold--out out data while trying to identify significant variables.data while trying to identify significant variables.

All possible regressionsAll possible regressionsAutomatic stepwise selection (forward or backward) Automatic stepwise selection (forward or backward) is efficient, but not guaranteed to find the is efficient, but not guaranteed to find the ““bestbest””model that can be constructed from a given set of model that can be constructed from a given set of potential potential regressorsregressors..

It is computationally feasible to test It is computationally feasible to test ““all possible all possible regressionsregressions”” that can be constructed with that can be constructed with kk out of out of mmpotential potential regressorsregressors..

Beware: danger of getting obsessed with rankings Beware: danger of getting obsessed with rankings & forgetting about logic & intuition!& forgetting about logic & intuition!

8

All possible regressions, continuedAll possible regressions, continuedIn In StatgraphicsStatgraphics, all, all--possiblepossible--regressions is the regressions is the ““Regression model selectionRegression model selection”” procedureprocedure

Analysis options allow you to set the maximum Analysis options allow you to set the maximum number of variables (default is 5).number of variables (default is 5).

Outputs include rankings of models by adjusted ROutputs include rankings of models by adjusted R--squared and Mallows squared and Mallows CCp p stat.stat.

Pane options for these reports allow you to limit the Pane options for these reports allow you to limit the number of number of ““bestbest”” models shown for a given number models shown for a given number of variables (default is 5).of variables (default is 5).

What is the Mallows What is the Mallows CCpp stat?stat?

(subset of size )( ) 1(all variables)p

MSE pC p n pMSE

⎛ ⎞= + − −⎜ ⎟

⎝ ⎠where where pp = # = # coeffcoeff’’ss in subset model, including constantin subset model, including constant

Ideally, Ideally, CCpp should be should be ““smallsmall”” and and ≤≤ ppNote that Note that CCpp== p p for the allfor the all--variable model, so variable model, so CCpp < p p if if subset model has lower MSE than allsubset model has lower MSE than all--variable model.variable model.Ideally you should approach or even beat the allIdeally you should approach or even beat the all--variable MSE with variable MSE with fewer variablesfewer variables..Ranking byRanking by CCpp penalizes more heavily for model penalizes more heavily for model complexity than ranking by adjusted complexity than ranking by adjusted RR22

9

Example: enrollment revisitedExample: enrollment revisited

Dependent variable: ROLLDependent variable: ROLL

Potential Potential regressorsregressors (8):(8):

lag(ROLL,1), lag(ROLL,2)lag(ROLL,1), lag(ROLL,2)

HSGRAD, lag(HSGRAD,1)HSGRAD, lag(HSGRAD,1)

UNEMP, lag (UNEMP,1)UNEMP, lag (UNEMP,1)

INCOME, lag(INCOME,1)INCOME, lag(INCOME,1)

Number of possible models = 2Number of possible models = 288 = 256= 256

Lining up the suspects:Lining up the suspects:

256 models are fitted to only 27 256 models are fitted to only 27 data points. Overkill?data points. Overkill?

Since there are Since there are ““onlyonly”” 8 potential 8 potential regressorsregressors, it is feasible to ask for , it is feasible to ask for reports on reports on all possible modelsall possible models with with up to 8 up to 8 regressorsregressors (just for (just for purposes of illustration!!).purposes of illustration!!).

The The defaultdefault maximum number of variables is 5, which is usually plenty! maximum number of variables is 5, which is usually plenty! In most applications, I would In most applications, I would notnot recommend raising it.recommend raising it.

10

How does it work so fast?How does it work so fast?

It is actually unnecessary for the computer to run a It is actually unnecessary for the computer to run a complete set of calculations complete set of calculations ““from scratchfrom scratch”” for each for each possible regression.possible regression.

Once the correlation matrix has been computed from Once the correlation matrix has been computed from the original variables, a simple sequence of the original variables, a simple sequence of calculations on the correlation matrix can determine calculations on the correlation matrix can determine the Rthe R--squaredsquared’’ss and and MSEMSE’’ss of all possible models.of all possible models.

The big problem is the length of the reports!The big problem is the length of the reports!

Ranking by RRanking by R--squaredsquared Note hairNote hair--splitting splitting differences in adjusted Rdifferences in adjusted R--squared among models at squared among models at the top of the rankings. the top of the rankings. Easy to get lost here!Easy to get lost here!

Since the dependent Since the dependent variable is nonvariable is non--stationary, stationary, all of the all of the ““goodgood”” models models have adj. Rhave adj. R--sq. very close sq. very close to 100%. to 100%.

As usual, itAs usual, it’’s better to focus s better to focus on MSE to decide whether on MSE to decide whether additional variables are additional variables are worth their weight in model worth their weight in model complexity: MSE goes complexity: MSE goes down as Rdown as R--squared goes upsquared goes up

Some 7Some 7-- variable models, variable models, and even the 8and even the 8--variable variable model, show up near the model, show up near the top of the rankings (top of the rankings (yikeyike!).!).

11

Plot of RPlot of R--squared vs. # coefficientssquared vs. # coefficients

Plot of adjusted RPlot of adjusted R--squared squared vs. # coefficients (including vs. # coefficients (including constant) shows that most constant) shows that most of the variance is explained of the variance is explained by the first by the first regressorregressoradded, which happens to added, which happens to be lag(ROLL,1). be lag(ROLL,1).

This is what should be This is what should be expected with a expected with a nonstationarynonstationary (strongly (strongly trended) dependent trended) dependent variable.variable.

Ranking byRanking by CCpp

Ranking by Ranking by CCpp favors favors models with fewer models with fewer coefficients and coefficients and discriminates more finely discriminates more finely among the models at the among the models at the top. The top. The ““bestbest”” model model includes lag(ROLL,1), includes lag(ROLL,1), lag(ROLL,2), lag(ROLL,2), unempunemp, and , and lag(UNEMP,1)lag(UNEMP,1)

12

Plot of Plot of CCpp vs. # coefficientsvs. # coefficients

CCpp plot shows that plot shows that CCpp is is minimized at 5 minimized at 5 coefficients (i.e., 4 coefficients (i.e., 4 regressorsregressors + constant). + constant). The 4The 4--coefficient model coefficient model also yields also yields CCp p close to close to p.p.

Note: the YNote: the Y--axis scale had to be adjusted to show only small values of axis scale had to be adjusted to show only small values of CCpp..

Details of bestDetails of best--CCpp modelmodel

ItIt’’s necessary to run a s necessary to run a ““manualmanual””regression to see the coefficients. regression to see the coefficients. Note that coefficients of Note that coefficients of unempunempand lag(UNEMP,1) are roughly and lag(UNEMP,1) are roughly ““equal and opposite.equal and opposite.”” Would a Would a difference be just as good?difference be just as good?

13

Restarting from a different set of Restarting from a different set of transformed variablestransformed variables

LetLet’’s res re--run the allrun the all--possiblepossible-- regressions regressions with the with the lagged lagged regressorsregressors replaced by replaced by differencesdifferences. This allows the selection . This allows the selection algorithm to choose the difference algorithm to choose the difference alonealoneor the difference together with the or the difference together with the unlaggedunlagged variable, which would be variable, which would be logically equivalent to including the lags logically equivalent to including the lags separately. separately.

New ranking byNew ranking by CCpp

Here the two clearlyHere the two clearly--best models both include lag(ROLL,1), lag(ROLL,2), and best models both include lag(ROLL,1), lag(ROLL,2), and diff(UNEMPdiff(UNEMP), and the ), and the ““toptop”” model also includes model also includes diff(HSGRADdiff(HSGRAD). Thus, collapsing ). Thus, collapsing separate lags into a difference may allow a model with fewer coeseparate lags into a difference may allow a model with fewer coefficients to fit as fficients to fit as well or allow a model with the same number of coefficients to fiwell or allow a model with the same number of coefficients to fit better. t better.

14

New plot of New plot of CCpp vs. vs. pp

The two highestThe two highest--ranking ranking models now have models now have CCppmuch less than much less than pp

ConclusionsConclusionsAllAll--possiblepossible--regressions makes sure you donregressions makes sure you don’’t t overlook the model with lowest possible error stats overlook the model with lowest possible error stats for a given number of for a given number of regressorsregressors......

……but staring at rankings can distract you from but staring at rankings can distract you from thinking about other issues, such as which model thinking about other issues, such as which model makes the most sense.makes the most sense.

It wonIt won’’t find a good model by magic: you still have t find a good model by magic: you still have to choose the set of potential to choose the set of potential regressorsregressors and and consider transformations of the variables. (Ditto for consider transformations of the variables. (Ditto for stepwise!)stepwise!)

You are NOT REQUIRED to choose the model You are NOT REQUIRED to choose the model that is that is ““#1 in the rankings#1 in the rankings”” (on whatever measure)(on whatever measure)

15

Caution: do not Caution: do not ““overdifferenceoverdifference””In several of our examples of regressions of In several of our examples of regressions of nonstationarynonstationary time series, it has turned out that a time series, it has turned out that a differencing transformation was useful.differencing transformation was useful.

However, beware of using differencing when it is not However, beware of using differencing when it is not really needed!really needed!

Differencing adds complexity to a model, and Differencing adds complexity to a model, and sometimes it may even create sometimes it may even create artificial correlation artificial correlation patternspatterns andand increase increase the variance to be explained.the variance to be explained.

Differencing is most appropriate when the original Differencing is most appropriate when the original variables either look like random walks (e.g., stock variables either look like random walks (e.g., stock prices) or else are very prices) or else are very ““smoothsmooth”” (e.g., ROLL), with (e.g., ROLL), with variances dramatically reduced by differencingvariances dramatically reduced by differencing

Analysis of Variance (ANOVA)Analysis of Variance (ANOVA)ANOVA is multiple regression with (only) categorical ANOVA is multiple regression with (only) categorical independent variables.independent variables.

In a In a ““oneone--wayway”” ANOVA, a dummy variable is created ANOVA, a dummy variable is created for allfor all--butbut--one level of the independent variable.one level of the independent variable.

The model then estimates the The model then estimates the meanmean of the dependent of the dependent variable for each level of the independent variable.variable for each level of the independent variable.

A A ““pooledpooled”” estimate of the error standard deviation is estimate of the error standard deviation is used to compute standard errors of the meansused to compute standard errors of the means

This is how oneThis is how one--way ANOVA differs from separate way ANOVA differs from separate calculations of the means, in which standard errors calculations of the means, in which standard errors are based on separate standard deviations.are based on separate standard deviations.

16

ANOVA in practiceANOVA in practiceAnalysis of variance is typically used to analyze data Analysis of variance is typically used to analyze data from from designed experiments designed experiments in marketing research, in marketing research, pharmaceutical research, crop science, quality, etc.pharmaceutical research, crop science, quality, etc.Interest often centers on Interest often centers on nonlinear nonlinear effects and/or effects and/or interactionsinteractions among effects of independent variables:among effects of independent variables:

Does relative effectiveness of different ad formats vary with Does relative effectiveness of different ad formats vary with market or demographics?market or demographics?Which combinations and dosages of drugs work best?Which combinations and dosages of drugs work best?Which combinations and quantities of crop treatments Which combinations and quantities of crop treatments maximize yield and/or quality?maximize yield and/or quality?

ANOVA is also appropriate for ANOVA is also appropriate for ““natural experimentsnatural experiments””with categorical variables, if error variances can be with categorical variables, if error variances can be assumed to be the same for all categories.assumed to be the same for all categories.

Example: Example: ““cardatacardata””LetLet’’s start by doing a ones start by doing a one--way ANOVA of mpg vs. way ANOVA of mpg vs. origin (continent).origin (continent).

Origin codes 1, 2, 3 refer to America, Europe, and Origin codes 1, 2, 3 refer to America, Europe, and Japan, respectively.Japan, respectively.

This model will test for differences in average This model will test for differences in average (mean) mpg among the 3 origins(mean) mpg among the 3 originsTypical ANOVA output:Typical ANOVA output:

ANOVA tableANOVA table

Means tableMeans table

Box and whisker plotBox and whisker plot

17

ANOVA table shows the decomposition of the sum of squared deviatANOVA table shows the decomposition of the sum of squared deviations from ions from the grand mean and corresponding variances (but no Rthe grand mean and corresponding variances (but no R--squared!)squared!)

Sum of squared errors Sum of squared errors (deviations from group means)(deviations from group means)

Sum of squared deviations between group Sum of squared deviations between group means (predictions) & grand meanmeans (predictions) & grand mean

Sum of squared deviations from the grand meanSum of squared deviations from the grand mean

““Explained varianceExplained variance””

The FThe F--ratio (30.20) is the ratio of ratio (30.20) is the ratio of explained explained variance (1189.61) to the variance (1189.61) to the unexplainedunexplained variance (39.3876). The variables in the model arevariance (39.3876). The variables in the model are jointlyjointlysignificant if this ratio is significantly greater than 1, whichsignificant if this ratio is significantly greater than 1, which means they are doing means they are doing more than what would happen if you just more than what would happen if you just ““dummied outdummied out”” some data points.some data points.

““Unexplained varianceUnexplained variance””

0

2

4

6

8

10

12

14

16

18

20

0 1 2 3 4 5 6 7

Grandmean

B mean

A mean

total variation

between groups variation (prediction)

within groups variation (error)

Decomposition of the sum of squares:Decomposition of the sum of squares:SS(totalSS(total) = ) = SS(betweenSS(between) + ) + SS(withinSS(within))

18

Means tableMeans table

The table of means shows the estimated mean of the dependent The table of means shows the estimated mean of the dependent variable for each level of the independent variablevariable for each level of the independent variable----in this case, in this case, mean mpgmean mpg’’s for cars from each continent. These are just s for cars from each continent. These are just ““ordinaryordinary””means. However, the means. However, the standard errors of the meansstandard errors of the means are based on a are based on a ““pooledpooled”” estimate of the standard deviation of the errors. estimate of the standard deviation of the errors.

Box-and-Whisker Plot

mpg

origin1 2 3

15

25

35

45

55

meanmean

The The ““box and whiskerbox and whisker”” plot provides a nice visual comparison of plot provides a nice visual comparison of the means, interthe means, inter--quartile ranges, and extreme values.quartile ranges, and extreme values.

InterquartileInterquartile range range (25%(25%--tile to 75%tile to 75%--tile)tile)

minimum & maximum (if not minimum & maximum (if not ““outsideoutside””))

““outside pointoutside point”” (outside the box(outside the boxby >1.5x by >1.5x interquartileinterquartile range)range)

Here we see that the American cars have significantly lower meanHere we see that the American cars have significantly lower mean mpg than mpg than European or Japanese cars, although the highestEuropean or Japanese cars, although the highest--mpg American car is in the upper mpg American car is in the upper quartile of the European and Japanese ranges. Also, although thquartile of the European and Japanese ranges. Also, although the European and e European and Japanese cars have similar mean mpgJapanese cars have similar mean mpg’’s, the Japanese cars have a tighter s, the Japanese cars have a tighter distribution of mpgdistribution of mpg’’s except for two outliers, one high and one low. s except for two outliers, one high and one low.

19

For comparison, hereFor comparison, here’’s the same model fitted by using multiple regression s the same model fitted by using multiple regression with dummy variables for the first two origin codes. Note that with dummy variables for the first two origin codes. Note that the ANOVA the ANOVA table shows the same Ftable shows the same F--ratio, etc. The CONSTANT in this model is the ratio, etc. The CONSTANT in this model is the mean for origin=3, and the coefficients of the dummy variables omean for origin=3, and the coefficients of the dummy variables origin=1 and rigin=1 and origin=2 are the differences in means for the other two levels.origin=2 are the differences in means for the other two levels.

Thus, ANOVA is nothing really Thus, ANOVA is nothing really ““new,new,””itit’’s just a repackaging of regression s just a repackaging of regression output for the special case when the output for the special case when the independent variables are dummies for independent variables are dummies for levels of a categorical variable.levels of a categorical variable.

Standard error of Standard error of the regression the regression (6.27596) is the (6.27596) is the square root of square root of mean square for mean square for errorerror (39.3876)(39.3876)

…… and Rand R--squared is squared is model SS (2379.22) model SS (2379.22) divided by total SS divided by total SS (8326.75)(8326.75)

Multifactor ANOVAMultifactor ANOVAMultifactor ANOVA is regression with dummy Multifactor ANOVA is regression with dummy variables for levels of variables for levels of two or moretwo or more categorical categorical independent variables.independent variables.

When there are two or more variables, you When there are two or more variables, you can estimate not only can estimate not only ““main effectsmain effects””, but also , but also ““interactionsinteractions”” among levels of two different among levels of two different variables.variables.

One of the questions of interest is whether the One of the questions of interest is whether the interactions are significant.interactions are significant.

20

MultiMulti--factor ANOVA: possible factor ANOVA: possible patterns in datapatterns in data

A main effect only B main effect only

(These bar charts show hypothetical mean responses for 2 levels of factor A and 3 levels of factor B)

Interactions between factors??Interactions between factors??

Both A and B main effects, without interaction

Both A and B main effects, with interaction

21

Two factors: mpg vs. origin & yearTwo factors: mpg vs. origin & year

Here, is 2Here, is 2--factor ANOVA with no interactions: only main effects have been efactor ANOVA with no interactions: only main effects have been estimated. stimated. The ANOVA table now shows separate FThe ANOVA table now shows separate F--ratios for each of the two input variables, ratios for each of the two input variables, reflecting the joint significance of their respective dummy varireflecting the joint significance of their respective dummy variables. ables. (Both are significant here, but origin is more significant.)(Both are significant here, but origin is more significant.)

Main effects (mean mpg) for origin & yearMain effects (mean mpg) for origin & year

The means table now shows means of the dependent variable for eaThe means table now shows means of the dependent variable for each level of ch level of both variables: these are the both variables: these are the ““main effects.main effects.”” (The means by origin are slightly (The means by origin are slightly different from those of the onedifferent from those of the one--way ANOVA, since the coefficients of the origin way ANOVA, since the coefficients of the origin dummies are now being estimated simultaneously with those for yedummies are now being estimated simultaneously with those for year.)ar.)

22

Means and 95.0 Percent LSD Intervals

origin

mpg

1 2 325

27

29

31

33

35


year

mpg

78 79 80 81 8224

26

28

30

32

34

36

The LSD (least significant difference) intervals are constructedThe LSD (least significant difference) intervals are constructed in in such a way that if two means are the same, their intervals will such a way that if two means are the same, their intervals will overlap 95.0% of the time. Any pair of intervals that do not overlap 95.0% of the time. Any pair of intervals that do not overlap vertically correspond to a pair of means which have a overlap vertically correspond to a pair of means which have a statistically significant difference.statistically significant difference.

Here Europe and Japan have Here Europe and Japan have the same high average mpg.the same high average mpg.

Same model fitted by multiple regression: the Same model fitted by multiple regression: the differences among coefficients for different levels differences among coefficients for different levels of the same variable are the same as the of the same variable are the same as the differences among means in the ANOVA output.differences among means in the ANOVA output.

These coefficients can These coefficients can be computed from the be computed from the ones in the multifactor ones in the multifactor ANOVA output, but not ANOVA output, but not vice versa: the multiple vice versa: the multiple regression output does regression output does not show the not show the ““grand grand mean.mean.””

23

Estimating interaction effectsEstimating interaction effects

If the order of interactions is set to 2, additional If the order of interactions is set to 2, additional dummy variables will be added for all possible dummy variables will be added for all possible combinations of a level of one variable and a level of combinations of a level of one variable and a level of the other variable.the other variable.

Are interaction effects significant?Are interaction effects significant?

The FThe F--ratio for the variance explained by the interaction terms is notratio for the variance explained by the interaction terms is notsignificant. Hence there is no significant interaction between significant. Hence there is no significant interaction between origin and origin and year. This means that variations of average mpg across years aryear. This means that variations of average mpg across years are e essentially the same for each origin, and correspondingly, variaessentially the same for each origin, and correspondingly, variations of tions of average mpg across origins are essentially the same for each yeaaverage mpg across origins are essentially the same for each year.r.

24

Here are the details Here are the details of the estimated of the estimated interactions, as well interactions, as well as the main effects.as the main effects.

Categorical + quantitative ?Categorical + quantitative ?

Suppose that you want to include a Suppose that you want to include a ““quantitativequantitative””independent variable along with dummies for independent variable along with dummies for categorical variables?categorical variables?

Example: suppose you want to include Example: suppose you want to include weightweightas an additional as an additional regressorregressor to control for to control for differences in average weights in cars from differences in average weights in cars from different countries of origin.different countries of origin.

This brings us to...This brings us to...

25

General linear models (GLM)General linear models (GLM)

GLM is a combination of multifactor ANOVA and GLM is a combination of multifactor ANOVA and ordinary multiple regression.ordinary multiple regression.

You can specify both categorical and You can specify both categorical and quantitative independent variables.quantitative independent variables.

You can also estimate interactions and You can also estimate interactions and ““nestednested””effects.effects.

Input variables for GLMInput variables for GLM

26

Effects and interactions to be estimatedEffects and interactions to be estimated

After the variables have been specified on the data input panel,After the variables have been specified on the data input panel, this this panel is used to specify interactions and/or nesting of effects panel is used to specify interactions and/or nesting of effects (if any). (if any). To begin with, we will just look for main effectsTo begin with, we will just look for main effects……

HereHere’’s the ANOVA report. Note that weight has a very significant Fs the ANOVA report. Note that weight has a very significant F--ratio. ratio. (For a quantitative variable, the F(For a quantitative variable, the F--ratio is simply the square of its tratio is simply the square of its t--stat. stat. Here the FHere the F--ratio of 201 corresponds to a tratio of 201 corresponds to a t--stat of around 14.)stat of around 14.)

27

At the bottom of the Analysis Summary report are the usual At the bottom of the Analysis Summary report are the usual regression statistics, including separate error stats for a regression statistics, including separate error stats for a validationvalidationperiod. If you use the period. If you use the ““SelectSelect”” box to hold out data in this (or any box to hold out data in this (or any Advanced Regression) procedure, the deAdvanced Regression) procedure, the de--selected points are used selected points are used as the holdas the hold--out sample. out sample. Hence if you use the GLM procedure to fit a Hence if you use the GLM procedure to fit a multiple regression model, you can perform outmultiple regression model, you can perform out--ofof--sample validation!sample validation!

Regression with outRegression with out--ofof--sample validation!sample validation!

Holding out a random sampleHolding out a random sample

An additional column is An additional column is added to the data added to the data worksheet, with the name worksheet, with the name ““random120.random120.”” The The Generate Data option is Generate Data option is used with the expression used with the expression RANDOM(120) to fill the RANDOM(120) to fill the column with 1column with 1’’s in 120 s in 120 random places, 0random places, 0’’s s elsewhere. When this elsewhere. When this variable is used as the variable is used as the ““SelectSelect”” criterion, only the criterion, only the randomly chosen rows with randomly chosen rows with 11’’s will be fitted.s will be fitted.

28

Refitting the GLM model with a Refitting the GLM model with a random holdrandom hold--out sampleout sample

Here the new random120 Here the new random120 variable is used as the variable is used as the ““SelectSelect””criterion. In this case it is criterion. In this case it is appropriate for the holdappropriate for the hold--out out sample to be determined sample to be determined randomlyrandomly because the because the variables are variables are not time seriesnot time seriesand the rows are sorted and the rows are sorted according to the values of according to the values of some of the independent some of the independent variables. Hence holding out variables. Hence holding out the last k values would not the last k values would not necessarily yield a necessarily yield a representative sample.representative sample.

Validation resultsValidation results

Of the 120 rows that were Of the 120 rows that were randomly selected for fitting, randomly selected for fitting, only 119 had nononly 119 had non--missing missing values for all independent values for all independent variables. Note that MSE is variables. Note that MSE is actually actually smallersmaller in the in the validation period (perhaps validation period (perhaps there was less variance in there was less variance in the holdthe hold--out sample, while out sample, while MAPE is slightly larger. So, MAPE is slightly larger. So, the model appears to be the model appears to be valid, i.e., not valid, i.e., not overfittedoverfitted..

29

The dummy variables actually The dummy variables actually have values of +1/have values of +1/--1/0 instead 1/0 instead of +1/0, although taken together of +1/0, although taken together they are equivalent to the they are equivalent to the ““usualusual”” dummy variables. dummy variables.

Back to the original Back to the original model with no holdmodel with no hold--out: out: herehere’’s the s the ““Model Model CoefficientsCoefficients”” report. It report. It also includes Variance also includes Variance Inflation Factors to test Inflation Factors to test for for multicollinearitymulticollinearity(VIF > 10 is (VIF > 10 is ““badbad””))

What is What is multicollinearitymulticollinearity??MulticollinearityMulticollinearity refers to a situation in which the refers to a situation in which the independent variables are strongly linearly related independent variables are strongly linearly related to each other.to each other.When When multicollinearitymulticollinearity exists, the estimated exists, the estimated coefficients may not represent the coefficients may not represent the ““truetrue”” effects of effects of the variables, and standard errors will be inflated the variables, and standard errors will be inflated (variables may all appear to be insignificant despite (variables may all appear to be insignificant despite high Rhigh R--squared).squared).In the most extreme case, where one independent In the most extreme case, where one independent variable is an variable is an exactexact linear function of the others, linear function of the others, the regression will fail to produce any results at all the regression will fail to produce any results at all (you will get an error condition).(you will get an error condition).

30

What are Variance Inflation Factors?What are Variance Inflation Factors?The VIF for the The VIF for the kkthth regressorregressor isis

((VIFVIF))kk = 1/(1= 1/(1--RRkk22))

……where where RRkk22 is the is the RR--squared obtained by squared obtained by

regressing the regressing the kkthth regressorregressor on all the other on all the other regressorsregressors..

Thus, (Thus, (VIFVIF))kk = 1 when = 1 when RRkk22 == 0.0.

Severe Severe multicollinearitymulticollinearity is indicated if is indicated if ((VIFVIF))kk > 10, which means > 10, which means RRkk

22 > 90%> 90%


origin

mpg

1 2 327

28

29

30

31

32

33


year

mpg

78 79 80 81 8225

27

29

31

33

When we control for weight, the pattern of main effects is diffeWhen we control for weight, the pattern of main effects is different: European rent: European cars get higher mileage cars get higher mileage for a given weightfor a given weight. Japanese cars evidently get high . Japanese cars evidently get high mileage by being lighter than American or European cars on averamileage by being lighter than American or European cars on average. Also, ge. Also, there has been an general upward trend in mpg except for drop inthere has been an general upward trend in mpg except for drop in 1981.1981.

GLM example, continuedGLM example, continued

31

predicted mpg

Residual Plot

Stu

dent

ized

resi

dual

15 25 35 45 55-3.5

-1.5

0.5

2.5

4.5

Another nice feature of the GLM procedure: both autocorrelationAnother nice feature of the GLM procedure: both autocorrelationand probability plots are and probability plots are ““pane optionspane options”” for the residual plot.for the residual plot.

Probability plot in GLMProbability plot in GLM

Studentized residual

Normal Probability Plot for mpg

perc

enta

ge

-3.2 -1.2 0.8 2.8 4.80.1

15

2050809599

99.9

HereHere’’s the (vertical) probability plot. the slight Ss the (vertical) probability plot. the slight S--shaped pattern shaped pattern indicates that the indicates that the ““tailstails”” of the residual distribution are a bit fatter of the residual distribution are a bit fatter than normal, but nothing much to worry about in this case. (Thethan normal, but nothing much to worry about in this case. (There is re is no simple transformation of the data that will make the distribuno simple transformation of the data that will make the distribution tion look any betterlook any better——apparently a few cars are just exceptional.)apparently a few cars are just exceptional.)

32

HereHere’’s the same model fitted in the Multiple Regression procedure inss the same model fitted in the Multiple Regression procedure instead. tead. Note that the regression stats and the coefficient of weight areNote that the regression stats and the coefficient of weight are identical. The identical. The differences in coefficients between levels of the same factor ardifferences in coefficients between levels of the same factor are also identical.e also identical.

GLM with interaction effectsGLM with interaction effects

The GLM procedure can also be used to test for interaction (The GLM procedure can also be used to test for interaction (““crosscross””) effects, ) effects, exactly as in the ANOVA procedure, as well as to use exactly as in the ANOVA procedure, as well as to use ““nestednested”” experimental experimental designs. Here this feature is used to look for interactions bedesigns. Here this feature is used to look for interactions between the tween the categorical factors while controlling for the effect of a quanticategorical factors while controlling for the effect of a quantitative factor (weight).tative factor (weight).

Hit the Cross Hit the Cross button to button to insert the insert the interaction interaction operator operator (*)(*)

33

The FThe F--ratio for the variance explained by the ratio for the variance explained by the interaction between origin and year is larger interaction between origin and year is larger when controlling for weight, but still not when controlling for weight, but still not technically significant (F=1.76, P=0.089).technically significant (F=1.76, P=0.089).

Summary of GLM featuresSummary of GLM featuresGLM is an GLM is an ““allall--everythingeverything”” procedure for fitting procedure for fitting models with categorical and/or quantitative factors, models with categorical and/or quantitative factors, with or without interaction effects.with or without interaction effects.

It can perform outIt can perform out--ofof--sample validation.sample validation.

Variance Inflation Factors (Variance Inflation Factors (VIFVIF’’ss) are a test for ) are a test for multicollinearitymulticollinearity (>10 is bad)(>10 is bad)

It also includes a few more builtIt also includes a few more built--in plots (residual in plots (residual autocorrelation & probability plot).autocorrelation & probability plot).

34

Logistic regressionLogistic regressionLogistic regression is regression with a Logistic regression is regression with a binarybinary (0(0--1) 1) dependent variabledependent variable, e.g., an indicator variable for the , e.g., an indicator variable for the occurrence or some event or condition.occurrence or some event or condition.

Applications: predicting Applications: predicting probabilitiesprobabilities of events or of events or fractions fractions of individuals who will respond to a given of individuals who will respond to a given promotion or medical treatment, etc.promotion or medical treatment, etc.

In this case is the predicted In this case is the predicted probabilityprobability that that YYtt = 1.= 1.

The probabilistic prediction equation has this form:The probabilistic prediction equation has this form:

0 1 1 2 2

0 1 1 2 2

exp( ...)ˆ1 exp( ...)

t tt

t t

X XYX X

β β ββ β β+ + +

=+ + + +

t̂Y

Predictions expressed in terms of Predictions expressed in terms of ““oddsodds””The predicted probability can equivalently be The predicted probability can equivalently be expressed in the form of expressed in the form of ““odds in favor of odds in favor of YYtt = 1= 1””::

The predicted total odds is a The predicted total odds is a product product rather than a rather than a sum of contributions of the independent variables.sum of contributions of the independent variables.

If If XXitit increases by one unit, the predicted odds in favor increases by one unit, the predicted odds in favor of of YYtt=1=1 increase by the factor increase by the factor exp(exp(ββii)), other things , other things being equal.being equal.

1 2

0 1 1 2 2

0 1 2

ˆ ˆ/(1 ) exp( ...)

exp( ) exp( ) exp( ) ...t t

t t t t

X X

Y Y X Xβ β β

β β β

− = + + +

=

Predicted odds in favor of of Yt=1

Constant odds factor

Odds factorfor X1

Odds factorfor X2

The odds The odds factor of the factor of the iiththvariable is variable is raised to the raised to the power power XXitit

35

Predictions expressed in terms of Predictions expressed in terms of ““loglog--oddsodds””The predicted probability can also be equivalently The predicted probability can also be equivalently expressed in expressed in ““log oddslog odds”” form:form:

Thus, logistic regression uses a linear regression Thus, logistic regression uses a linear regression equation to predict the log odds in favor of equation to predict the log odds in favor of YYtt == 11. .

However, you canHowever, you can’’t estimate the model by regressing t estimate the model by regressing log(log(YYtt /(1/(1−−YYtt)) )) on the on the XX’’ss. (Can. (Can’’t take the log of zero!)t take the log of zero!)

In practice, the betas are estimated by a procedure In practice, the betas are estimated by a procedure that is similar to minimizing a weighted sum of the that is similar to minimizing a weighted sum of the squared prediction errors:squared prediction errors:

0 1 1 2 2ˆ ˆlog( /(1 )) ...t t t tY Y X Xβ β β− = + + +

2ˆ( )t tY Y−

Logistic example: predicting magazine Logistic example: predicting magazine subscription responses by age and sexsubscription responses by age and sex

The dependent variable can either be a The dependent variable can either be a binarybinary (0(0--1) variable (shown here) or it 1) variable (shown here) or it can be a vector of can be a vector of proportionsproportions or or probabilitiesprobabilities, together with a vector of , together with a vector of sample sizes.sample sizes.

36

Coefficients, standard errors, Coefficients, standard errors, and Rand R--squared are interpreted squared are interpreted in the same manner as in in the same manner as in multiple regression. Odds multiple regression. Odds ratio of an independent ratio of an independent variable is just variable is just EXP(betaEXP(beta).).

Differences in Differences in predicted predicted subscription subscription responses for responses for male and female male and female subjectssubjects

37

““This plot shows a summary of the prediction capability of the fiThis plot shows a summary of the prediction capability of the fitted model. First, the tted model. First, the model is used to predict the response using the information in emodel is used to predict the response using the information in each row of the data file. ach row of the data file. If the predicted value is larger than the cutoff, the response iIf the predicted value is larger than the cutoff, the response is predicted to be TRUE. If s predicted to be TRUE. If the predicted value is less than or equal to the cutoff, the resthe predicted value is less than or equal to the cutoff, the response is predicted to be ponse is predicted to be FALSE. The table shows the percent of the observed data correctFALSE. The table shows the percent of the observed data correctly predicted at various ly predicted at various cutoff values. For example, using a cutoff equal to 0.56, 75.0%cutoff values. For example, using a cutoff equal to 0.56, 75.0% of all TRUE responses of all TRUE responses were correctly predicted, while 95.0% of all FALSE responses werwere correctly predicted, while 95.0% of all FALSE responses were correctly predicted, e correctly predicted, for a total of 85.0%. Using the cutoff value which maximizes thfor a total of 85.0%. Using the cutoff value which maximizes the total percentage correct e total percentage correct may provide a good value to use for predicting additional indivimay provide a good value to use for predicting additional individuals.duals.””

Other advanced regression proceduresOther advanced regression procedures

Comparison of regression linesComparison of regression linesFit several simple regressions to the same X Fit several simple regressions to the same X & Y variables, splitting the data on levels of & Y variables, splitting the data on levels of another variableanother variable

Nonlinear regressionNonlinear regressionEstimate a model such as Y=1/(a+b*Estimate a model such as Y=1/(a+b*X^cX^c))Similar to Solver in ExcelSimilar to Solver in Excel

38

ResourcesResourcesYou can find out more about these and other You can find out more about these and other procedures via the procedures via the StatgraphicsStatgraphics help system, help system, StatAdvisorStatAdvisor, and user manuals , and user manuals (in (in pdfpdf files in your files in your StatgraphicsStatgraphics directory)directory)

There are also many good onThere are also many good on--line sources:line sources:

StatsoftStatsoft onon--line textbook:line textbook:http://http://www.statsoft.com/textbook/stathome.htmlwww.statsoft.com/textbook/stathome.html

David GarsonDavid Garson’’s ons on--line textbook:line textbook:

http://www2.chass.ncsu.edu/garson/pa765/statnote.htmhttp://www2.chass.ncsu.edu/garson/pa765/statnote.htm

((……links are available on Decision 411 course home page)links are available on Decision 411 course home page)

Decision 411: Class 8 - Duke University

Documents

Transcript of Decision 411: Class 8 - Duke University