Regression Methods

80
OLAWALE AWE LISA SHORT COURSE, DEPARTMENT OF STATISTICS, VIRGINIA TECH. NOVEMBER 21, 2013. Regression Methods

description

Regression Methods. Olawale Awe LISA Short Course, Department of Statistics, Virginia Tech. November 21, 2013. About . What? Laboratory for Interdisciplinary Statistical Analysis Why? Mission: to provide statistical advice, analysis, and education to Virginia Tech researchers How? - PowerPoint PPT Presentation

Transcript of Regression Methods

Page 1: Regression Methods

OLAWALE AWELISA SHORT COURSE, DEPARTMENT OF

STATISTICS, VIRGINIA TECH.NOVEMBER 21 , 2013 .

Regression Methods

Page 2: Regression Methods

About What?

Laboratory for Interdisciplinary Statistical AnalysisWhy?

Mission: to provide statistical advice, analysis, and education to Virginia Tech researchers

How? Collaboration requests, Walk-in Consulting, Short Courses

Where? Walk-in Consulting in GLC and various other locations (

www.lisa.stat.vt.edu/?q=walk_in) Collaboration meetings typically held in Sandy 312

Statistical Collaborators? Graduate students and faculty members in VT statistics

department

Page 3: Regression Methods

Requesting a LISA Meeting

Go to www.lisa.stat.vt.edu Click link for “Collaboration Request Form”Sign into the website using VT PID and passwordEnter your information (email, college, etc.)Describe your project (project title, research

goals, specific research questions, if you have already collected data, special requests, etc.)

Contact assigned LISA collaborators as soon as possible to schedule a meeting

Page 4: Regression Methods

Laboratory for Interdisciplinary Statistical Analysis

Collaboration:

Visit our website to request personalized statistical advice and assistance with:

Experimental Design • Data Analysis • Interpreting ResultsGrant Proposals • Software (R, SAS, JMP, SPSS...)

LISA statistical collaborators aim to explain concepts in ways useful for your research.

Great advice right now: Meet with LISA before collecting your data.

All services are FREE for VT researchers. We assist with research—not class projects or homework.

LISA helps VT researchers benefit from the use of Statistics

www.lisa.stat.vt.edu

LISA also offers:Educational Short Courses: Designed to help graduate students apply statistics in their researchWalk-In Consulting: M-F 1-3 PM GLC Video Conference Room for questions requiring <30mins

Also 11AM-1PM Port (Library/Torg Bridge) and 9.30-11.30 AM ICTAS Café X

4

Page 5: Regression Methods

Outline

Introduction to Regression AnalysisSimple Linear RegressionMultiple Linear RegressionRegression Model AssumptionsResidual AnalysisAssessing Multicollinearity: Correlation and VIFModel Selection ProceduresIllustrative Example (Brief Demo with

SPSS/PASW)Model Diagnostic and Interpretation

Page 6: Regression Methods

Introduction

Regression is a statistical technique for investigating, describing, and predicting the relationship between two or more variables.

Regression has been regarded as the most widely used technique in statistics.

As basic to statistics as the Pythagorean theorem is to geometry (Montgomery et al,2006).

6

Page 7: Regression Methods

Regression: Intro

Regression Analysis has tremendous applications in almost every field of human endeavor.

One of the most popular statistical techniques used by researchers.

Widely used in engineering, physical and chemical sciences, economics, management, social sciences, life and biological sciences, etc.

Easy to understand and interpret.Simply put, Regression analysis is used to

find equations that fit data. 7

Page 8: Regression Methods

When do we use Regression Technique?

8

Explanatory Variable(s)Categorical Continuous Categorical &

ContinuousResponse Variable

Categorical

Contingency Table or Logistic Regression

Logistic Regression

Logistic Regression

Continuous

ANOVA Regression ANCOVA orRegression with categorical variables

Page 9: Regression Methods

9

SIMPLE LINEAR REGRESSION

Page 10: Regression Methods

Simple Linear Regression

10

Simple Linear Regression (SLR) is a statistical method for modeling the relationship between ONLY two continuous variables.

A researcher may be interested in modeling the relationship between Life Expectancy and Per Capita GDP of seven countries as follows.

Scatterplots are first used to graphically examine the relationship between the two variables.

76

76.5

77

77.5

78

78.5

79

Life

Exp

ecta

ncy

18 19 20 21 22 23 24Per Capita GDP

Page 11: Regression Methods

Types of Relationships Between Two Continuous Variables

11

A scatter plot is a visual representation of the relationship between two variables.

Positive and negative linear relationship

10

20

30

40

50

Y

0 5 10 15 20X

0

10

20

30

40

50

60

Y

0 5 10 15 20X

Page 12: Regression Methods

Other Types of Relationships…

12

Curvilinear Relationships

No Relationship

10

15

20

25

30

35

40

45

50

55

Y

0 5 10 15 20X

Page 13: Regression Methods

Simple Linear Regression

13

Can we describe the behavior betweenthe two variableswith a linear equation?

The variable on the x-axis is often called the explanatory or predictor variable(X).

The variable on the y-axis is called the response variable(Y).

Page 14: Regression Methods

Simple Linear Regression Model

14

The Simple Linear Regression model is given by

where is the response of the ith observation is the y-intercept is the slope is the value of the predictor variable for the ith

observation is the random error

Page 15: Regression Methods

Interpretation of Slope and Intercept Parameter

β1 is the difference in the predicted value of Y for one unit difference in X.

β0 is the mean response if the predictor variable is zero(has no practical meaning but should be included)

If β1>0 there exists positive relationship .It means as variable X increases, Y also increases.

If β1 <0 there exists negative relationship between the variables. It means that as variable X decreases, Y increases.

If Β1=0, It means there is no relationship between the two variables(see graphs below).

15

Page 16: Regression Methods

Graphs of Relationships Between Two Continuous Variables

16

10

20

30

40

50

Y

0 5 10 15 20X

0

10

20

30

40

50

60

Y

0 5 10 15 20X

10

15

20

25

30

35

40

45

50

55

Y

0 5 10 15 20X

β1<0β1>0

β1=0

Page 17: Regression Methods

Line of Best fit

A line of best fit is a straight line that best represents your data on a scatter plot. 

Identical to line of a straight line in elementary math class. y=mx+b ,m=slope,

b= y-intercept. Residual is r= y-E(r)=0(more on residual later)Where y=observed response =predicted response.

17

Page 18: Regression Methods

Regression Assumptions

18

Linearity between the dependent and independent variable(s).

Observations are independent Based on how data is collected. Check by plotting residuals vs the order in which the data was

collected.

Constant variance of error terms. Check using a residual plot (plot residuals vs. )

The error terms are normally distributed. Check by making a histogram or normal quantile plot of the

residuals.

Page 19: Regression Methods

Example 1

Consider a data on 15 American Women collected by a researcher as follows:

We can fit a model of the form: Weight =β0 +β1Age+ϵ to the data. 19

Page 20: Regression Methods

Scatter Plot of Weight vs Age

Line of best fit

20

Page 21: Regression Methods

=

=r*

The estimated regression line is Weight =-87.52+3.45Age

Can you interpret these results?

Model Estimation and Result

21

Page 22: Regression Methods

Description/Interpretation

The above results can be interpreted as follows:-Sig.(P value) of 0.000 indicates that the model is a good fit to the data. It means Age has a significant contribution to the average variability in the weights of the women.-The value of β1 (slope=3.45) indicates a positive relationship between the weight and age. The slope coefficient indicates that for every additional unit

increase in age, we can expect weight to increase by an average of 3.45 kilograms.

-R indicates that there is high association between the DV and the predictor variable.R-Squared value of 0.991 means that 99% of the average variability in weight of the women is explained by the model.22

Page 23: Regression Methods

Prediction

Using the regression model above, we can predict the weight of a woman who is 73 years old :

Weight = -87.52 + 3.45(75) Weight = -87.52 +3.45*75Weight =171

Exercise: -Using the SLR model above, predict the weight of a woman whose age is 82.Ans: 195kg

23

Page 24: Regression Methods

MULTIPLE

REGRESSION

Page 25: Regression Methods

Frequently there are many predictors that we want to use simultaneously

Multiple linear regression model:

• Similar to simple linear regression, except now there is more than one explanatory variable.

In this situation each represents the partial slope of the predictor .

Can be interpreted as “the mean change in the response variable for one unit of change in the predictor variable while holding other predictors in the model constant. ”.

  25

Page 26: Regression Methods

Example 2:

Suppose the researcher in our example 1 above is interested in knowing if height also contributes to change in weight:

26

Page 27: Regression Methods

Step 1: Scatterplots

27

Page 28: Regression Methods

Model Estimation with SPSS

28

Page 29: Regression Methods

Multiple Regression

The new model is therefore written as Weight = + Error

So, the fit is : Weight = -81.53+3.46Age -1.11Height 29

Page 30: Regression Methods

Model Interpretation

The result of the model estimation above shows that height does not contribute to the average variability in weight of the women. The high p-value shows that it is not statistically significant(changes in height are not associated with changes in weight).

No statistically significant linear dependence of the mean of weight on height was detected.

Note that the value of R-Squared and Adjusted R-Squared did not decrease as we add additional independent variable(s).

For every one unit increase in age, average weight increases at a rate of 3.46units, while holding height constant.

30

Page 31: Regression Methods

Model Diagnostic and Residual Analysis

Residual is a measure of the variability in the response variable not explained by the regression model.

Analysis of the residual is always an effective way to discover several violations of model assumption.

Plotting residuals is a very effective way to investigate how well the regression model fits the data.

A residual plot is used to check the assumption of constant variance and to check model fit (can the model be trusted?). 31

Page 32: Regression Methods

Diagnostics: Residual Plot

32

The residuals should fall in a symmetrical pattern and have a constant spread throughout its range.

Good residual plot: no pattern.

-6

-4

-2

0

2

4

6

8

Res

idua

ls Y

0 5 10 15 20X

Page 33: Regression Methods

We Can Plot:

Residual vs Independent Variable(s)Residual vs Predicted valuesResidual vs Order of the dataResidual Lag PlotHistogram of ResidualStandardized Residual vs Standardized Predicted

Value etc.

33

Page 34: Regression Methods

Residuals

Column 3 in the table below show the residuals of the regression model: Weight =-87.52+3.45Age

Residual is the deviation between the data and the fit. (Actual Y- Predicted Y)

34

Page 35: Regression Methods

Residual Diagnostics: Very Important!

35

Left: Residuals show non-constant variance.Right: Residuals show non-linear pattern.

-20

-10

0

10

20

30

Res

idua

ls Y

0 1 2 3 4 5 6 7 8 9 10X

-30-20-10

010203040506070

Res

idua

ls Y

0 5 10 15 20X

Page 36: Regression Methods

Look at the Figures Below, What Do You Think?

36

Page 37: Regression Methods

Residual Plot

37

Page 38: Regression Methods

Residual Plots

38

Page 39: Regression Methods

What if the Assumptions Are Not Met?

Linearity: Transform the dependent variable (see next slide

)Normality:

Transform the data (also when outlier is present) Or use robust regression where normality is not

required Increase the sample size, if possible

Homogeneity: Try transforming the data

39

Page 40: Regression Methods

Some Tips on Transformation

Log Y, -Used if Y is positively skewed and has positive values.-If Y has a Poisson distribution(is a count data)1/Y-If variance of Y is proportional to the 4th power

of E(Y)Sin-1 (Y) -Used if Y is a proportion or rate

40

Page 41: Regression Methods

Multicollinearity

A usual problem in multiple regression that develops when one or more of the independent variable(s) is highly correlated with one or more of the other independent variables.

How the explanatory variables relate to each other is fundamental to understanding their relationship with the response variable.

Usually, when you see estimated beta weights larger than 1 in any regression analysis, consider the possibility of multicollinearity.

Multicollinearity can be mild, or severe (depending high correlations, or VIFs above 10).

41

Page 42: Regression Methods

Effects on P-Values

You will get different p-values for the same variables in different regressions as you add/remove other explanatory variables.

A variable can be significantly related to Y by itself, but not be significantly related to Y after accounting for several other variables. In that case, the variable is viewed as redundant.

If all the X variables are correlated, it is possible ALL the variables may be insignificant, even if each is significantly related to Y by itself.

42

Page 43: Regression Methods

Multicollinearity Effect on Coefficients

Similarly, coefficients of individual explanatory variables can change depending on what other explanatory variables are present.

May change signs sporadically.May be excessively large when there is

multicollinearity.

43

Page 44: Regression Methods

Multicollinearity Isn’t Tragic

In most practical datasets there will be some degree of multicollinearity. If the degree of multicollinearity isn’t too bad (more on its assessment in the next slides) then it can be safely ignored.

If you have serious multicollinearity, then your goals must be considered and there are various options.

In what follows, we first focus on how to assess multicollinearity, then what to do about it should it be found to be a problem.

44

Page 45: Regression Methods

Assessing Multicollinearity: Two Methods

There is typically a measure of multicollinearity in most experiments.

We discuss two methods for assessing multicollinearity in this course: (1)Correlation matrix (2)Variance Inflation Factor(VIF)

45

Page 46: Regression Methods

Correlation Matrices

A correlation matrix is simply a table indicating the correlations between each pair of explanatory variables.

If you haven’t seen it before, the correlation between two variables is simply the square root of R2, combined with a sign indicating a positive or negative association.

If you see values close to 1 or -1 that indicates variables are strongly associated with each other and you may have multicollinearity problems.

If you see many correlations all greater in absolute value than 0.7, you may also have problems with your model. 46

Page 47: Regression Methods

Correlation Matrix

47

A cursory look at the correlation matrix of the independent variables shows if there is

multicollinearity in our experiment.

Page 48: Regression Methods

Correlation Matrix Involving the DV

GDP FER MS CE TR ER

GDP 1 0.99 0.95 0.61 0.14 0.87FER 0.99 1 0.92 0.55 0.08 0.85MS 0.95 0.92 1 0.76 0.09 0.76CE 0.61 0.55 0.76 1 0.05 0.46TR 0.14 0.09 0.09 0.05 1 0.36ER 0.87 0.85 0.76 0.46 0.36 1

Can help to assess the preliminary idea of the bivariate association of the dependent variable with the independent variables.

48

Page 49: Regression Methods

Disadvantages of Using Correlation Matrices

Correlation matrices only work with two variables at a time. Thus, we can only see pairwise relationships. If a more complicated relationship exists, the correlation matrix won’t find it.

Multicollinearity is not a bivariate problem.Use VIFs!

49

Page 50: Regression Methods

Variance Inflation Factors (VIFs)

Variance inflation factors measure the relationship of all the variables simultaneously, thus they avoid the “two at a time” disadvantage of correlation matrices.

They are harder to explain.There is a VIF for each variable.Loosely, the VIF is based on regressing each

variable on the remaining variables. If the remaining variables can explain the variable of interest, then that variable has a high VIF.

50

Page 51: Regression Methods

Using VIFs

The use if variance inflation factor is the most reliable way to examine multicollinearity.

VIF = 1/Tolerance= 1/1-R2

Tolerance is the proportion of variance in the independent variable not explained by its relationship with the other independent variables.

In practice, all VIFs are greater than 1. VIFs are considered “bad or severe” if they

exceed 10. 51

Page 52: Regression Methods

So Multicollinearity is an Issue – What Do You Do About It?

Remember, if multicollinearity is present but not excessive (no high correlations, no VIFs above 10), you can ignore it.

If multicollinearity is a big issue in your dataset, your goal becomes extremely important.

52

Page 53: Regression Methods

Variance Inflation Factor

53

VIFs of 16.545 and 17.149 are ‘severe’.

Page 54: Regression Methods

If Your Goal is Prediction…

With severe multicollinearity everything fails, except if your goal is just prediction.

If your main goal is prediction (using the available explanatory variables to predict the response), then you can safely ignore the multicollinearity.

54

Page 55: Regression Methods

If Interest Centers on the Real Relationships Between the Variables…

When you have serious multicollinearity, the variables are sufficiently redundant that you cannot reliably distinguish their effects.

There is no single solution for this problem.

55

Page 56: Regression Methods

Some Tips:

Drop one of the ‘offending’ variables from the regression equation-but often the variables are so intertwined that you cannot distinguish them.

Combine the collinear variables-For example, if in a sociological study you find the variables “father’s education level” and “mother’s education level” are strongly related, it may be sufficient to simply use one variable, “parent’s education level”, which is some function of the two parents.

Sometimes, you may not be able to disentangle your explanatory variables.

56

Page 57: Regression Methods

Dealing with Multicollinearity

In many situations you get to select some of the explanatory variables (in engineering studies you often get to select almost all of them, in medical studies you can select drug dosage).

You can use centered independent variables.Use Ridge Regression or PCA (see Montgomery

et al, 2006).Use one of the analytic procedures like LISREL

(see Adelodun and Awe,2013).

57

Page 58: Regression Methods

Note…

Most importantly, make sure you set up your experiments in a way that you do not “install” multicollinearity.

Since multicollinearity diagnostics are so easy to obtain (through stat. packages), no researcher should ever report results of regressions with obvious multicollinearity problems!

58

Page 59: Regression Methods

SHORT QUIZ:

Consider the Regression Model Below:Which values are known/unknown?Which are data, which are parameters?Which term is the slope? Intercept?What are the common assumptions about error

structure (fill in the blanks): ___(___,____)

What is the difference between and ?

59

Page 60: Regression Methods

Let’s have a break for few minutes!

60

Page 61: Regression Methods

A Brief Review…

You have several explanatory variables and a single response Y.

You run the multiple regression first and check the residuals and collinearity diagnostic measures.

If the residuals look bad, deal with those first (you may need a transformation or fit a polynomial ).

Now suppose you have decent residuals…

61

Page 62: Regression Methods

With Decent Residuals…

Check the collinearity measures. If these are problematic (any VIF above 10 or high correlations), then you must start removing or combining variables before you can trust the output. This tends to be a substantive, not a statistical, task.

The variables with the highest VIFs can be first targeted for deletion. They are the “most redundant”

After you do anything, remember to check the residuals again.

Now suppose you have decent residuals and collinearity measures…

Look at the p-values, R2. If these are significant, stop. Otherwise continue to the next slide…

62

Page 63: Regression Methods

Mode Selection Procedures

“Model selection” refers to determining which of the explanatory variables should be placed in a final model.

Usually, we want a parsimonious model, or a model which describes the response well but is as free from multicollinearity as possible.

"All models are wrong, but some are useful." So said the statistician George Box.

63

Page 64: Regression Methods

Variable Subset Selection Uses Statistical Criteria to Identify a Set of Predictors for Our Model

Variable subset selection: Among a set of potential predictors, choose a subset to include in the model based on some statistical criterion, e.g. p-values Forward selection: Add variables one at a time

starting with the x most strongly associated with y. Stop when no other ‘significant’ variables are identified.

Drawback: Variables added to the model cannot be taken out again.

64

Page 65: Regression Methods

Variable Subset Selection Continued

Backwards elimination: Start with every candidate predictor in the model. Remove variables one at a time until all remaining variables are “significantly” associated with response. Drawback: Variables taken out of the model

cannot be added back.Stepwise selection: As forward selection, but at

each iteration remove variables which are made obsolete by new additions. Combination of forward and backward methods.

65

Page 66: Regression Methods

A Recap on Meaning and Interpretation of Regression Results

Let us review and familiarize ourselves with the meaning of each entity that appears in the regression results.

Note that regression procedure in SPSS and JMP are similar.

All the estimates and analyses can be done easily using statistical packages.

66

Page 67: Regression Methods

Coefficient of Multiple Determination

The coefficient of determination (or covariance), , is the percent of variation in the response y explained by the set of explanatory variables.

(closer to 1 is a better model)The adjusted coefficient of determination, ,

introduces a penalty for more explanatory variables. (takes sample size into account, and so more reliable)

67

Page 68: Regression Methods

Interpretation of Terms…

P-value: The p-value in a regression provides a test of whether that variable is significantly related to Y, after accounting for everything else.

ANOVA is used to evaluate the overall model significance.

Standard error: measure of uncertainty of an estimate-measures the variability in the actual Y value from the predicted Y.

R is the correlation which measures how the variables move in association with each other.

68

Page 69: Regression Methods

ANOVA Table for Simple Linear Regression

69

Source SS df MS F P-valueRegression

1 )

Error n-2

Total n-1

The F-test tests whether there is a linear relationship between the two variables (used in determining if model is significant). Null Hypothesis Alternative Hypothesis

Page 70: Regression Methods

Illustrative Example: Brief SPSS Demo

Please wait patiently for a brief SPSS Demo involving example 3.

70

Page 71: Regression Methods

Example 3:Practical

Suppose a researcher is interested in measuring the effect of several economic indicators on the GDP of a particular country in Africa(say Nigeria).

He may specify the Multiple Linear Regression model as follows:

GDPi=β0+ β1FERi+ β2MSi+ β3CEi+ β4TRi+ β5ERi+ .

Where i=1,…,50.See the data and estimation of this model in the demo section soon.

71

Page 72: Regression Methods

Where…

GDP= Gross Domestic Product(Y)FER=Foreign Exchange Reserve(X1)MS=Money Supply(X2)CE=Capital Expenditure(X3)TR=Treasury Bill Rate(X4)ER=Exchange Rate(X5)=Stochastic Error

After inputting the data into SPSS/PASW, click on Analyze-Regression-Linear… 72

Page 73: Regression Methods

MODEL DIAGNOSTIC AND

INTERPRETATION

Page 74: Regression Methods

Look at the Following (SPSS) Regression Output: Can You Diagnose and Interpret these Results?

74

Page 75: Regression Methods

Residuals Plots for final model

Page 76: Regression Methods

Histogram of Residuals.

Page 77: Regression Methods

Some Lessons…

High R2 value does not always indicate a good model!Always check your residuals after each analysis. If you notice non-random patterns in your residuals, it

means that your model is missing something. Possibilities include:-A missing variable.-A missing higher-order term of a variable in the model to explain the curvature.-A missing interaction between terms already in the model.-etc.

While trying to fit a parsimonious model, these possibilities can be explored further and figured out by the researcher.

Page 78: Regression Methods

78

Page 79: Regression Methods

Some References

Michael Sullivan III. Statistics Informed Decisions Using Data. Upper Saddle River, New Jersey: Pearson Education, 2004.

Michael H. Kutner, Christopher J. Nachtsheim, John Neter and William Li. Applied Linear Statistical Models. New York: McGraw-Hill Irwin, 2005.

Gordon,Robert A(1968). Issues in multiple regression.American Journal of Sociology Vol.73.pp.592-616.

Schroeder,Mary Ann(1990).Diagnosing and Dealing with Multicollinearity.Western Journal of Nursing Research,12(2),175-187.

“Multicollinearity”.Dr. Bunty Ethington EDPR 7/8542. University of Memphis. Montgomery et al(2006).Introduction to Linear Regression Analysis.3rd Ed. Wiley

Series. Awe et al(2013).Regression Model Diagnostic,Test and Robustification in the

Presence of Multicollinear Covariates. International Journal of Electronic and Computer Research(India). Vol.2(2).

Adelodun, A. A. and Awe, O.O.(2013). Using LISREL for Empirical Research. Transnational Journal of Science and Technology(Macedonia).Vol.3(8).pp. 1-14.

www.google.com/search?q=prediction&newwindow. www.lisa.stat.vt.edu/ 79

Page 80: Regression Methods

Acknowledgement

Thanks to the following: Dr. Eric Vance Dr. Chris Franck Tonya Pruitt

80