Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values...

33
Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: Predict values of a dependent variable,Y, based on its relationship with values of at least one independent variable, X. Explain the impact of changes in an independent variable on the dependent variable by estimating the numerical value of the relationship Dependent variable: the variable we wish to explain Independent variable: the variable used to explain the dependent variable

Transcript of Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values...

Page 1: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

Introduction to Regression Analysis, Chapter 13,

• Regression analysis is used to:

– Predict values of a dependent variable,Y, based on its relationship with values of at least one independent variable, X.

– Explain the impact of changes in an independent variable on the dependent variable by estimating the numerical value of the relationship

Dependent variable: the variable we wish to explain

Independent variable: the variable used to explain the dependent variable

Page 2: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

Simple Linear Regression Model

• Only one independent variable (thus, simple), X

• Relationship between X and Y is described by a linear function

• Changes in Y are assumed to be caused by changes in X, that is,

– Change In X Causes Change in Y

Page 3: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

Important points before we start a regression analysis:

• The most important thing in deciding whether or not there is a relationship between X and Y is to have a systematic model that is based on logical reasons.

• Investigate the nature of the relationship between X and Y (use scatter diagram, covariance, correlation of coefficient)

• Remember that regression is not an exact or deterministic mathematical equation. It is a behavioral relationship that is subject to randomness.

• Remember that X is not the only thing that explains the behavior of Y. There are other factor that you may not have information about.

• All you are trying to do is to have an estimate of the relationship using the best linear fit possible

Page 4: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

Types of Relationships

Y

X

Y

X

Y

Y

X

X

Linear relationships Curvilinear relationships

Page 5: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

Types of Relationships

Y

X

Y

X

Y

Y

X

X

Strong relationships Weak relationships

(continued)

Page 6: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

Types of Relationships

Y

X

Y

X

No relationship

(continued)

Page 7: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

ii10i εXββY Linear component

Simple Linear Regression Conceptual Model

The population regression model: This is a conceptual model, a hypothesis, or a postulation

Population Y intercept

Population SlopeCoefficient

Random Error term

Dependent Variable

Independent Variable

Random Error component

Page 8: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

• The model to be estimated from sample data is:

• The actual estimated from the sample

– Where

ii10ieXbbY Residual

(random error from the sample)

i10iXbbY

Estimate of the regression intercept

Estimate of the regression slope

Estimated (or predicted) Y value for observation i

Value of X for observation i

iiiYYe

Page 9: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

• The individual random error terms, ei, have a mean of zero, i.e.,

• Since the sum of random error is zero, we try to estimate the regression line such that the sum of squared differences are minimized, thus, the name Ordinary Least Squared Method (OLS)

• i.e.,

• The estimated values of b0 and b1 by OLS are the only possible values of b0 and b1 that minimize the sum of the squared differences between Y and

0ei

n

1i

2

i10i

2

iii2

))Xb(b(Ymin

)Y(Yminemin

Y

Page 10: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

Random Error for this Xi value

Y

X

Observed Value of Y for Xi

Predicted Value of Y for Xi

Xi

Slope = b1

Intercept = b0

Simple Linear Regression Model

i10iXbbY

iiiYYe

Yi Actual

estimatedY i

Page 11: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

• b0 is the estimated average value of Y when the value of X is b0

zero

• b1 is the estimated change in the average value of Y as a result of a

one-unit change in X

• Units of measurement of X and Y are very important for the correct interpretation of the slop and the intercept

• Example:

• Predict the app. Value of a house with 10,000 s.f. lot size

• How Good is this prediction?

Interpretation of the slope and the intercept

0 1( | 0) ; E(Y|X)/ (X);E Y X

size)(Lot 93.6 03.165 Val App

$234,330(10) 93.6 03.165 Val App

Page 12: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

How Good is the Model’s prediction Power?

SSE SSR SST Total Sum of Squares

Regression Sum of Squares

Error Sum of Squares

2

i)YY(SST 2

ii)YY(SSE 2

i)YY(SSR

where:

= Average value of the dependent variable

Yi = Observed values of the dependent variable

i = Predicted value of Y for the given Xi valueY

Y

• Total variation is made up of two parts:

Page 13: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

• SST = total sum of squares– Measures total variation of the Yi values around their

mean

• SSR = regression sum of squares (Explained)– Explained portion of total variation attributed to Y’s

relationship with X

• SSE = error sum of squares (Unexplained)– Variation of Y values attributable to other factors than

its relationship with X

Page 14: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

Xi

Y, W/O the effect of X

X

Yi

SST = (Yi - Y)2

SSE = (Yi - Yi )2

SSR = (Yi - Y)2

_

_

Y

Y_Y

X

Page 15: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

How Good is the Model’s prediction Power?

• The coefficient of determination is the portion of the total variation in the dependent variable,Y, that is explained by variation in the independent variable, X

• The coefficient of determination is also called r-squared and is denoted as r2

squares of sum total

squares of sum regression

SST

SSRr 2

1r0 2

Page 16: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

Standard Error of Estimate• The standard deviation of the variation of

observations around the regression line is estimated by– Where SSE = error sum of squares; n = sample size

– The concept is the same as the standard deviation (average difference) around the mean of a univariate

MSE2n

)YY(

2n

SSES

n

1i

2

ii

YX

Page 17: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

Comparing Standard Errors• SYX is a measure of the variation of observed Y values from

the regression line

YY

X XYX

s small YXslarge

The magnitude of SYX should always be judged relative to the size of the Y values in the sample data

i.e., SYX = $36.34K is moderately small relative to house prices in the $200 - $300K range (average 215K)

Page 18: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

Assumptions of Regression• Normality of Error

– Error values (ε) are normally distributed for any given value of X

• Homoscedasticity

– The probability distribution of the errors has constant variance

• Independence of Errors

– Error values are statistically independent

Page 19: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

How to investigate the appropriateness of the fitted model

• The residual for observation i, ei, is the difference between its observed and predicted value;

• Check the assumptions of regression by examining the residuals– Examine for linearity assumption

– Examine for constant variance for all levels of X (homoscedasticity)

– Evaluate normal distribution assumption

– Evaluate independence assumption

• Graphical Analysis of Residuals– Can plot residuals vs. X

iiiYYe

Page 20: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

Residual Analysis for Linearity

Not Linear Linear

resi

dual

s

Y Y

x

resi

dual

s

x

x x

Page 21: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

Residual Analysis for Homoscedasticity

Non-constant variance Constant variance

x x

Y

x x

Y

resi

dua

ls

resi

dua

ls

Page 22: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

Residual Analysis for Independence

Not Independent Independent

X

Xresi

dual

s

resi

dual

s

X

resi

dual

s

Page 23: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

Measuring Autocorrelation: the Durbin-Watson Statistic (DO NOT COVER FOR)

• Used when data are collected over time to detect if autocorrelation is present. Can be useful for cross sectional data.

• Autocorrelation exists if residuals in one time period are related to residuals in another period

• Here, residuals show a cyclic pattern, not random

Time (t) Residual Plot

-15

-10

-5

0

5

10

15

0 2 4 6 8

Time (t)

Res

idu

als

Page 24: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

The Durbin-Watson Statistic

n

1i

2

i

n

2i

2

1ii

e

)ee(D Actual

H0: residuals are not correlatedH1: autocorrelation is present

• The Durbin-Watson statistic is used to test for autocorrelation.

• The possible range is 0 ≤ D ≤ 4

• D should be close to 2 if H0 is true

• D less than 2 may signal positive autocorrelation, D greater than 2 may signal negative autocorrelation

Page 25: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

• Calculate the Durbin-Watson test statistic = D

• Find the values dL and dU from the Durbin-Watson table (for sample size n and number of independent variables k)

2 4-dL 44-du

Reject H0Do not reject H0 Inconclusive

0 dU 2dL

Reject H0 Do not reject H0Inconclusive

H0: positive autocorrelation does not exist

H1: positive autocorrelation is present

H0: Negative autocorrelation does not exist

H1: Negative autocorrelation is present

Positive Auto

Negative Auto

Page 26: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

Inferences about Estimated Parameters• t test for a population slope

– Is there a linear relationship between X and Y?

• Null and alternative hypotheses

– H0: β1 = 0 (no linear relationship)

– H1: β1 0 (linear relationship does exist)

• Test statistic

1b

11

S

βbt

2nd.f.

where:

b1 = regression slope coefficient

β1 = hypothesized slope

Sb1 = estimate of the standard error of the slope

Page 27: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

• The standard error of the regression slope coefficient (b1) is estimated by

• is a measure of the variation in the slope of regression lines from different possible samples

2

i

YXYX

b

)X(X

S

SSX

SS

1 2n

SSE SWhere

YX

Y

X

Y

X

1bS small1b

Slarge

1bS

Page 28: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

F-Test for Significance•A second approach to test the existence of a significant relationship is the ratio of explained to unexplained variances. For multiple regression, this is also a test of the entire model.

•H0: β1 = 0; H1: β1 ≠ 0

•The Ratio is,

• where F follows an F distribution with k numerator and (n – k - 1) denominator degrees of freedom. (k = the number of independent variables in the regression model)

df Sum S. Mean S. Actual F

Regression K SSR MSR=SSR/K Fk,(n-k-1)=MSR/MSEResidual (error) n-k-1 SSE MSE=SSE/(n-k-1)Total n-1 SST      

MSE

MSRF

Page 29: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

Confidence Interval Estimates and Prediction of Individual Values of Y1. Confidence Interval Estimate for the Slope

2. Confidence interval estimate for the mean value of Y given a particular Xi

Where

– Note that the size of interval varies according to distance away from mean, X

iYX2n

XX|Y

hStY

:μ for interval Confidencei

1b2n11 Stb

2

i

2

i

2

i

i )X(X

)X(X

n

1

SSX

)X(X

n

1h

Page 30: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

3. Confidence interval estimate for an Individual value of Y given a particular Xi

-- This extra term, 1, adds to the interval width to reflect the added uncertainty for an individual case

iYX2n

XX

h1StY

:Y for interval Confidencei

Page 31: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

• Graphical Presentation of Confidence Interval Estimation

Y

Y = b0+b1Xi

Confidence Interval for the mean of

Y, given Xi

Prediction Interval

for an individual Y, given Xi

Y

Xi Xj

Page 32: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

Pitfalls of Regression Analysis1. Lacking an awareness of the assumptions underlying

least-squares regression

2. Not knowing how to evaluate the assumptions

3. Not knowing the alternatives to least-squares regression if a particular assumption is violated

4. Using a regression model without knowledge of the subject matter

5. Extrapolating outside the relevant range

6. Let’s look at 4 different data set, with their scatter diagram and residual plots on pages 509-510.

Strategies:1. Start with a scatter plot of X on Y to observe possible

relationship

Page 33: Introduction to Regression Analysis, Chapter 13, Regression analysis is used to: –Predict values of a dependent variable,Y, based on its relationship with.

2. Perform residual analysis to check the assumptions Plot the residuals vs. X to check for violations of

assumptions such as homoscedasticity Use a histogram, stem-and-leaf display, box-and-whisker

plot, or normal probability plot of the residuals to uncover possible non-normality

3. If there is violation of any assumption, use alternative methods or models

4. If there is no evidence of assumption violation, then test for the significance of the regression coefficients and construct confidence intervals and prediction intervals

5. Avoid making predictions or forecasts outside the relevant range