Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300...

85
Regression In Excel 1

Transcript of Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300...

Page 1: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Regression In Excel 1

Page 2: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Contents

Regression Model

Regression Analysis in Excel

Simple Linear Regression

Correlation

How To Do A Regression in Excel

Slope

Intercept

ANOVA

References

2

Page 3: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Regression Model

A multiple regression model is:

y = β1+ β2 x2+ β3 x3+ u

Such that:

y is dependent variable

x2 and x3 are independent variables

β1 is constant

β2 and β3 are regression coefficients

It is assumed that the error u is independent with constant variance.

We wish to estimate the regression line:

y = b1 + b2 x2 + b3 x3

3

Page 4: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Regression Analysis in Excel

We do this using the Data analysis Add-in and Regression.

Example:

4

Page 5: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

5

Contd…..

Page 6: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Contd…..

The regression output has three components: Regression statistics table

ANOVA table

Regression coefficients table.

6

Page 7: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Interpreting Regression Statistics Table Regression Statistics

The standard error here refers to the estimated standard deviation of the error term u.

It is sometimes called the standard error of the regression. It equals sqrt(SSE/(n-k)).

It is not to be confused with the standard error of y itself (from descriptive statistics) or with the standard errors of the regression coefficients given below.

R2 = 0.8025 means that 80.25% of the variation of yi around its mean is explained by the regressors x2i and x3i.

7

Page 8: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Contd…..

The regression output of most interest is the following table of coefficients and associated output:

8

Page 9: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Contd….

Let βj denote the population coefficient of the jth regressor (intercept, HH SIZE and CUBED HH SIZE). Then

Column "Coefficient" gives the least squares estimates of βj.

Column "Standard error" gives the standard errors (i.e.the estimated standard deviation) of the least squares estimates bj of βj.

Column "t Stat" gives the computed t-statistic for H0: βj = 0 against Ha: βj ≠ 0. This is the coefficient divided by the standard error. It is compared to a t with (n-k) degrees of freedom where here n = 5 and k = 3.

Column "P-value" gives the p-value for test of H0: βj = 0 against Ha: βj ≠ 0.. This equals the Pr{|t| > t-Stat}where t is a t-distributed random variable with n-k degrees of freedom and t-Stat is the computed value of the t-statistic given in the previous column. Note that this p-value is for a two-sided test. For a one-sided test divide this p-value by 2 (also checking the sign of the t-Stat).

Columns "Lower 95%” and "Upper 95%” values define a 95% confidence interval for βj.

9

Page 10: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Contd……

A simple summary of the previous output is that the fitted line is:

y = 0.8966 + 0.3365x + 0.0021z

10

Page 11: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

11

Regression and Correlation

Techniques that are used to establish whether there is a mathematical relationship between

two or more variables, so that the behavior of one variable can be used to predict the

behavior of others. Applicable to “Variables” data only.

• “Regression” provides a functional relationship (Y=f(x)) between the variables; the

function represents the “average” relationship.

• “Correlation” tells us the direction and the strength of the relationship.

The analysis starts with a Scatter Plot of Y vs X.

The analysis starts with a Scatter Plot of Y vs. X

Page 12: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

12

Simple Linear Regression

What is it?

Determines if Y

depends on X and

provides a math

equation for the

relationship

(continuous data)

Examples:

Process conditions

and product properties

Sales and advertising

budget

y

x

Does Y depend on X?

Which line is correct?

Page 13: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

13

Simple Linear Regression

b = Y intercept

= the Y value

at point that

the line

intersects Y

axis.

m = slope = rise

run

Y

X 0

b

rise

run

A simple linear relationship can be described mathematically by

Y = mX + b

Page 14: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Simple Linear Regression

Y

X

0 10 5

5

0

rise

run

slope = rise

run =

(6 - 3)

(10 - 4)

= 1

2

intercept = 1

Y = 0.5X + 1

14

Page 15: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

15

Simple Regression Example

An agent for a residential real estate company in a large city would like to predict the monthly rental cost for apartments based on the size of the apartment as defined by square footage. A sample of 25 apartments in a particular residential neighborhood was selected to gather the information

Page 16: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

16

Size Rent

850 950

1450 1600

1085 1200

1232 1500

718 950

1485 1700

1136 1650

726 935

700 875

956 1150

1100 1400

1285 1650

1985 2300

1369 1800

1175 1400

1225 1450

1245 1100

1259 1700

1150 1200

896 1150

1361 1600

1040 1650

755 1200

1000 800

1200 1750

The data on size and rent for the 25 apartments will be

analyzed in EXCEL.

Page 17: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

17

Scatter Plot

500

700

900

1100

1300

1500

1700

1900

2100

2300

2500

500 700 900 1100 1300 1500 1700 1900 2100

Size

Ren

t

Scatter plot suggests that there is a ‘linear’ relationship between Rent and Size

Page 18: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

18

Interpreting EXCEL output

Regression Equation

Rent = 177.121+1.065*Size

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.85

R Square 0.72

Adjusted R Square 0.71

Standard Error 194.60

Observations 25

ANOVA

df SS MS F Significance F

Regression 1 2268776.545 2268776.545 59.91376452 7.51833E-08

Residual 23 870949.4547 37867.3676

Total 24 3139726

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 177.121 161.004 1.100 0.282669853 -155.942 510.184

Size 1.065 0.138 7.740 7.51833E-08 0.780 1.350

Page 19: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

19

Interpretation of the Regression Coefficient

What does the coefficient of Size mean?

For every additional square feet,

Rent goes up by $1.065

Page 20: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

20

Using Regression for Prediction

Predict monthly rent when apartment size is 1000 square feet:

Regression Equation:

Rent = 177.121+1.065*Size

Thus, when Size=1000

Rent=177.121+1.065*1000=$1242 (rounded)

Page 21: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

21

Using Regression for Prediction – Caution!

Regression equation is valid only over the range over which it was estimated! We should interpolate

Do not use the equation in predicting Y when X values are

not within the range of data used to develop the equation. Extrapolation can be risky

Thus, we should not use the equation to predict rent for an

apartment whose size is 500 square feet, since this value is not in the range of size values used to create the regression equation.

Page 22: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

22

2.5 4.0

Sample

Data

True

Relationship

Why Extrapolation is Risky

In this figure, we fit our regression model using sample data – but the linear relation implicit in our regression model does not hold outside our sample! By extrapolating, we are making erroneous estimates!

Extrapolated relationship

Page 23: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

23

Correlation (r)

“Correlation coefficient”, r, is a measure of the strength and the direction of the relationship between two variables. Values of r range from +1 (very strong direct relationship), through “0” (no relationship), to –1 (very strong inverse relationship). It measures the degree of scatter of the points around the “Least Squares” regression line.

Page 24: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

24

Coefficient of Correlation from EXCEL

The sign of r is the same as that of the coefficient of X (Size) in the regression equation (in our case the sign is positive). Also, if you look at the scatter plot, you will note that the sign should be positive.

R=0.85 suggests a fairly ‘strong’ correlation between size and rent.

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.85

R Square 0.72

Adjusted R Square 0.71

Standard Error 194.60

Observations 25

ANOVA

df SS MS F Significance F

Regression 1 2268776.545 2268776.545 59.91376452 7.51833E-08

Residual 23 870949.4547 37867.3676

Total 24 3139726

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 177.121 161.004 1.100 0.282669853 -155.942 510.184

Size 1.065 0.138 7.740 7.51833E-08 0.780 1.350

Page 25: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

25

Coefficient of Determination (r2)

“Coefficient of Determination”, r-squared, (sometimes R- squared), defines the amount of the variation in Y that is attributable to variation in X

Page 26: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

26

Getting r2 from EXCEL

It is important to remember that r-squared is always positive. It is the square of the coefficient of correlation r. In our case, r2=0.72 suggests that 72% of variation in Rent is explained by the variation in Size. The higher the value of r2, the better is the simple regression model.

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.85

R Square 0.72

Adjusted R Square 0.71

Standard Error 194.60

Observations 25

ANOVA

df SS MS F Significance F

Regression 1 2268776.545 2268776.545 59.91376452 7.51833E-08

Residual 23 870949.4547 37867.3676

Total 24 3139726

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 177.121 161.004 1.100 0.282669853 -155.942 510.184

Size 1.065 0.138 7.740 7.51833E-08 0.780 1.350

Page 27: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

27

Standard Error (SE)

Standard error measures the variability or scatter of the observed values around the regression line.

500

700

900

1100

1300

1500

1700

1900

2100

500 1000 1500 2000 2500

Size (square feet)

Ren

t ($

)

Page 28: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

28

Getting the Standard Error (SE) from EXCEL

In our example, the standard error associated with estimating rent is $194.60.

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.85

R Square 0.72

Adjusted R Square 0.71

Standard Error 194.60

Observations 25

ANOVA

df SS MS F Significance F

Regression 1 2268776.545 2268776.545 59.91376452 7.51833E-08

Residual 23 870949.4547 37867.3676

Total 24 3139726

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 177.121 161.004 1.100 0.282669853 -155.942 510.184

Size 1.065 0.138 7.740 7.51833E-08 0.780 1.350

Page 29: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

29

Is the Simple Regression Model Statistically Valid?

It is important to test whether the regression model developed from sample data is statistically valid.

For simple regression, we can use 2 approaches to test whether the coefficient of X is equal to zero

1. using t-test

2. using ANOVA

Page 30: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

30

Is the coefficient of X equal to zero?

In both cases, the hypothesis we test is:

0Slope:H

0Slope:H

1

0

What could we say about the linear relationship between X and Y if the slope were zero?

Page 31: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

31

Using coefficient information for testing if slope=0

t-stat=7.740 and P-value=7.52E-08. P-value is very small. If it is smaller than our a level, then, we reject null; not otherwise. If a=0.05, we would reject null and conclude that slope is not zero. Same result holds at a=0.01 because the P-value is smaller than 0.01. Thus, at 0.05 (or 0.01) level, we conclude that the slope is NOT zero implying that our model is statistically valid.

P-value

7.52E-08

=7.52*10-8

=0.0000000752

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.85

R Square 0.72

Adjusted R Square 0.71

Standard Error 194.60

Observations 25

ANOVA

df SS MS F Significance F

Regression 1 2268776.545 2268776.545 59.91376452 7.51833E-08

Residual 23 870949.4547 37867.3676

Total 24 3139726

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 177.121 161.004 1.100 0.282669853 -155.942 510.184

Size 1.065 0.138 7.740 7.51833E-08 0.780 1.350

Page 32: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

32

Using ANOVA for testing if slope=0 in EXCEL

F=59.91376 and P-value=7.51833E-08. P-value is again very small. If it is smaller than our a level, then, we reject null; not otherwise. Thus, at 0.05 (or 0.01) level, slope is NOT zero implying that our model is statistically valid. This is the same conclusion we reached using the t-test.

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.85

R Square 0.72

Adjusted R Square 0.71

Standard Error 194.60

Observations 25

ANOVA

df SS MS F Significance F

Regression 1 2268776.545 2268776.545 59.91376452 7.51833E-08

Residual 23 870949.4547 37867.3676

Total 24 3139726

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 177.121 161.004 1.100 0.282669853 -155.942 510.184

Size 1.065 0.138 7.740 7.51833E-08 0.780 1.350

Page 33: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

33

Confidence Interval for the Slope of Size

The 95% CI tells us that for every 1 square feet increase in apartment Size, Rent will increase by $0.78 to $1.35.

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.85

R Square 0.72

Adjusted R Square 0.71

Standard Error 194.60

Observations 25

ANOVA

df SS MS F Significance F

Regression 1 2268776.545 2268776.545 59.91376452 7.51833E-08

Residual 23 870949.4547 37867.3676

Total 24 3139726

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 177.121 161.004 1.100 0.282669853 -155.942 510.184

Size 1.065 0.138 7.740 7.51833E-08 0.780 1.350

Page 34: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

34

How To Do A Regression In Excel

Page 35: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

1. Enter the data into a spreadsheet 35

Page 36: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

2. Tools/DataAnalysis/Regression 36

Page 37: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

3. Enter the dependent variable in the “y” column and the

independent variable (or variables) in the “x” columns 37

Page 38: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

4. Indicate where output should go (the 1st cell in the range

works)

38

Page 39: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

5. The basic regression is done. (You may need to widen

columns)

39

Page 40: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

1. Construct Scatterplot to see if data looks linear. Click Chart Wizard.

40

Page 41: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Select Scatter and

this chart sub-type

41

Page 42: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Highlight only cells that contain the x and y values

42

Page 43: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Enter a chart title Label x and y axes

43

Page 44: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Store on a new worksheet Name the worksheet

44

Page 45: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

1. Click on grey background -- Delete 2. Click on any horizontal line -- Delete 3. Click on legend -- Delete

45

Page 46: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

46

Page 47: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Right mouse click on any data point. Select “Add Trendline”.

Select Linear from the trendline options.

47

Page 48: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Looks linear. Return to Original Worksheet.

48

Page 49: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Go to Tools Menu Select Data Analysis

49

Page 50: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Select Regression

50

Page 51: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

1. Highlight cells of y-variable

2. Highlight cells of x-variable

3. Check Labels (if first row has labels) 4. Check Confidence Level -- for other than 95% intervals (and change %) 5. Click New Worksheet Ply and give name

51

Page 52: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Low p-value Linear model OK

52.56757x46486.49y

r r2

adj r2 s

n

SSR SSE

SSTOTAL

95% Confidence interval for 1

99% Confidence interval for 1

52

Page 53: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Slope

Returns the slope of the linear regression line through data points in known_y's and known_x's. The slope is the vertical distance divided by the horizontal distance between any two points on the line, which is the rate of change along the regression line.

Syntax

SLOPE(known_y's,known_x's)

Known_y's is an array or cell range of numeric dependent data points.

Known_x's is the set of independent data points.

53

Page 54: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

INTERCEPT

Calculates the point at which a line will intersect the y-axis by using existing x-values and y-values. The intercept point is based on a best-fit regression line plotted through the known x-values and known y-values. Use the INTERCEPT function when you want to determine the value of the dependent variable when the independent variable is 0 (zero). For example, you can use the INTERCEPT function to predict a metal's electrical resistance at 0°C when your data points were taken at room temperature and higher.

Syntax

INTERCEPT(known_y's,known_x's)

Known_y's is the dependent set of observations or data.

Known_x's is the independent set of observations or data.

54

Page 55: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

The basic ANOVA situation

Two variables: 1 Categorical, 1 Quantitative Main Question: Do the (means of) the quantitative variables depend on which group (given by categorical variable) the individual is in? If categorical variable has only 2 values:

• 2-sample t-test

ANOVA allows for 3 or more groups

55

Page 56: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

An example ANOVA situation

Subjects: 25 patients with blisters Treatments: Treatment A, Treatment B, Placebo Measurement: # of days until blisters heal Data [and means]:

• A: 5,6,6,7,7,8,9,10 [7.25] • B: 7,7,8,9,9,10,10,11 [8.875] • P: 7,9,9,10,10,10,11,12,13 [10.11]

Are these differences significant?

56

Page 57: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Informal Investigation

Graphical investigation: • side-by-side box plots • multiple histograms

Whether the differences between the groups are significant depends on

• the difference in the means • the standard deviations of each group • the sample sizes

ANOVA determines P-value from the F statistic

57

Page 58: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Side by Side Boxplots

PBA

13

12

11

10

9

8

7

6

5

treatment

days

58

Page 59: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

What does ANOVA do?

At its simplest (there are extensions) ANOVA tests the following hypotheses:

H0: The means of all the groups are equal.

Ha: Not all the means are equal doesn’t say how or which ones differ.

Can follow up with “multiple comparisons”

Note: we usually refer to the sub-populations as “groups” when doing ANOVA.

59

Page 60: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Assumptions of ANOVA

Each group is approximately normal

check this by looking at histograms and/or normal quantile plots, or use assumptions

can handle some non-normality, but not severe outliers

Standard deviations of each group are approximately equal

rule of thumb: ratio of largest to smallest sample st. dev. must be less than 2:1

60

Page 61: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Normality Check

We should check for normality using: • assumptions about population • histograms for each group • normal quantile plot for each group

With such small data sets, there really isn’t a really good way to check normality from data, but we make the common assumption that physical measurements of people tend to be normally distributed.

61

Page 62: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Standard Deviation Check

Compare largest and smallest standard deviations: • largest: 1.764 • smallest: 1.458 • 1.458 x 2 = 2.916 > 1.764

Note: variance ratio of 4:1 is equivalent.

Variable treatment N Mean Median StDev

days A 8 7.250 7.000 1.669

B 8 8.875 9.000 1.458

P 9 10.111 10.000 1.764

62

Page 63: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Notation for ANOVA

• n = number of individuals all together • I = number of groups • = mean for entire data set is

Group i has

• ni = # of individuals in group i • xij = value for individual j in group i • = mean for group i • si = standard deviation for group i

ix

x

63

Page 64: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

How ANOVA works (outline)

ANOVA measures two sources of variation in the data and compares their relative sizes

• variation BETWEEN groups • for each data value look at the difference between its group mean and the overall mean

• variation WITHIN groups • for each data value we look at the difference between that value and the mean of its group 2iij xx

2xxi

64

Page 65: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

The ANOVA F-statistic is a ratio of the Between Group Variaton divided by the Within Group Variation:

MSE

MSG

Within

BetweenF

A large F is evidence against H0, since it indicates that there is more difference between groups than within groups.

65

Page 66: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

How are These Computations Made?

We want to measure the amount of variation due to BETWEEN group variation and WITHIN group variation For each data value, we calculate its contribution to:

• BETWEEN group variation:

• WITHIN group variation:

x i x 2

2)( iij xx

66

Page 67: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

An Even Smaller Example

Suppose we have three groups • Group 1: 5.3, 6.0, 6.7 • Group 2: 5.5, 6.2, 6.4, 5.7 • Group 3: 7.5, 7.2, 7.9

We get the following statistics:

SUMMARY

Groups Count Sum Average Variance

Column 1 3 18 6 0.49

Column 2 4 23.8 5.95 0.176667

Column 3 3 22.6 7.533333 0.123333

67

Page 68: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Excel ANOVA Output

ANOVA

Source of Variation SS df MS F P-value F crit

Between Groups 5.127333 2 2.563667 10.21575 0.008394 4.737416

Within Groups 1.756667 7 0.250952

Total 6.884 9

1 less than number of groups

number of data values - number of groups (equals df for each group added together) 1 less than number of individuals

(just like other situations)

68

Page 69: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Computing ANOVA F statistic

WITHIN BETWEEN

difference: difference

group data - group mean group mean - overall mean

data group mean plain squared plain squared

5.3 1 6.00 -0.70 0.490 -0.4 0.194

6.0 1 6.00 0.00 0.000 -0.4 0.194

6.7 1 6.00 0.70 0.490 -0.4 0.194

5.5 2 5.95 -0.45 0.203 -0.5 0.240

6.2 2 5.95 0.25 0.063 -0.5 0.240

6.4 2 5.95 0.45 0.203 -0.5 0.240

5.7 2 5.95 -0.25 0.063 -0.5 0.240

7.5 3 7.53 -0.03 0.001 1.1 1.188

7.2 3 7.53 -0.33 0.109 1.1 1.188

7.9 3 7.53 0.37 0.137 1.1 1.188

TOTAL 1.757 5.106

TOTAL/df 0.25095714 2.55275

overall mean: 6.44 F = 2.5528/0.25025 = 10.21575

69

Page 70: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

ANOVA Output

1 less than # of groups

# of data values - # of groups (equals df for each group added together)

1 less than # of individuals (just like other situations)

Analysis of Variance for days

Source DF SS MS F P

treatment 2 34.74 17.37 6.45 0.006

Error 22 59.26 2.69

Total 24 94.00

70

Page 71: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

ANOVA Output

Analysis of Variance for days

Source DF SS MS F P

treatment 2 34.74 17.37 6.45 0.006

Error 22 59.26 2.69

Total 24 94.00

2)( i

obs

ij xx

(x iobs

x )2

(xij

obs

x )2

SS stands for sum of squares • ANOVA splits this into 3 parts

71

Page 72: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

ANOVA Output

MSG = SSG / DFG MSE = SSE / DFE

Analysis of Variance for days

Source DF SS MS F P

treatment 2 34.74 17.37 6.45 0.006

Error 22 59.26 2.69

Total 24 94.00

F = MSG / MSE

P-value comes from F(DFG,DFE)

(P-values for the F statistic are in Table E)

72

Page 73: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

So How Big is F?

Since F is

Mean Square Between / Mean Square Within = MSG / MSE

A large value of F indicates relatively more difference between groups than within groups (evidence against H0)

To get the P-value, we compare to F(I-1,n-I)-distribution • I-1 degrees of freedom in numerator (# groups -1) • n - I degrees of freedom in denominator (rest of df)

73

Page 74: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Connections between SST, MST, and Standard Deviation

So SST = (n -1) s2, and MST = s2. That is, SST and MST measure the TOTAL variation in the data set.

s2 x ij x

2

n 1

SST

DFT MST

If ignore the groups for a moment and just compute the standard deviation of the entire data set, we see

74

Page 75: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Connections between SSE, MSE, and Standard Deviation

So SS[Within Group i] = (si2) (dfi )

ii

iij

idf

iSS

n

xxs

]Group Within[

1

2

2

This means that we can compute SSE from the standard deviations and sizes (df) of each group:

)()1(

] [][

22

iiii dfsns

iGroup WithinSSWithinSSSSE

Remember:

75

Page 76: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Pooled Estimate for Standard Deviation

sp

2 (n1 1)s1

2 (n2 1)s22 ... (nI 1)sI

2

n I

sp

2 (df1)s1

2 (df2)s22 ... (df I )sI

2

df1 df2 ... df I

One of the ANOVA assumptions is that all groups have the same standard deviation. We can estimate this with a weighted average:

MSEDFE

SSEsp 2

so MSE is the pooled estimate of variance

76

Page 77: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

In Summary

SST (x ij xobs

)2 s2(DFT)

SSE (x ij x i)2

obs

si

2

groups

(df i)

SSG (x i

obs

x)2 ni(x i x)2

groups

SSE SSG SST; MS SS

DF; F

MSG

MSE

77

Page 78: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

R2 Statistic

SST

SSG

TotalSS

BetweenSSR

][

][2

R2 gives the percent of variance due to between group variation

We will see R2 again when we study regression.

78

Page 79: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Where’s the Difference?

Analysis of Variance for days

Source DF SS MS F P

treatmen 2 34.74 17.37 6.45 0.006

Error 22 59.26 2.69

Total 24 94.00

Individual 95% CIs For Mean

Based on Pooled StDev

Level N Mean StDev ----------+---------+---------+------

A 8 7.250 1.669 (-------*-------)

B 8 8.875 1.458 (-------*-------)

P 9 10.111 1.764 (------*-------)

----------+---------+---------+------

Pooled StDev = 1.641 7.5 9.0 10.5

Once ANOVA indicates that the groups do not all appear to have the same means, what do we do?

Clearest difference: P is worse than A (CI’s don’t overlap)

79

Page 80: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Multiple Comparisons

Once ANOVA indicates that the groups do not all have the same means, we can compare them two by two using the 2-sample t test

• We need to adjust our p-value threshold because we are doing multiple tests with the same data.

•There are several methods for doing this.

• If we really just want to test the difference between one pair of treatments, we should set the study up that way.

80

Page 81: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Tuckey’s Pairwise Comparisons

Tukey's pairwise comparisons

Family error rate = 0.0500

Individual error rate = 0.0199

Critical value = 3.55

Intervals for (column level mean) - (row level mean)

A B

B -3.685

0.435

P -4.863 -3.238

-0.859 0.766

95% confidence

Use alpha = 0.0199 for each test.

These give 98.01% CI’s for each pairwise difference. Only P vs A is significant (both values have same sign)

98% CI for A-P is (-0.86,-4.86)

81

Page 82: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Tukey’s Method in R

Tukey multiple comparisons of means 95% family-wise confidence level

diff lwr upr

B-A 1.6250 -0.43650 3.6865

P-A 2.8611 0.85769 4.8645

P-B 1.2361 -0.76731 3.2395

82

Page 83: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

Forecasting: Basic Time Series Decomposition in Excel

Forecast method 1 – Guess

Forecast method 2 – Linear Regression

Forecast method 3 – Time Series Decomposition (TSD)

83

Page 84: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

References

http://www.wikihow.com/Run-Regression-Analysis-in-Microsoft-Excel

http://office.microsoft.com/en-001/excel-help/slope-HP005209264.aspx

http://office.microsoft.com/en-in/excel-help/intercept-HP005209143.aspx

http://capacitas.wordpress.com/2013/01/14/forecasting-basic-time-series-decomposition-in-excel/

84

Page 85: Regression In Excelqcfinance.in/regression_in_excel.pdfEXCEL. 17 Scatter Plot 500 700 900 1100 1300 1500 1700 1900 2100 2300 2500 500 700 900 1100 1300 1500 1700 1900 2100 Size nt

THANK YOU

85