Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

43
Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences

Transcript of Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

Page 1: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

Multiple Regression Analysis

Dr. Shishen Xie

Computer and Math Sciences

Page 2: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

The Simple Linear Regression Model

• The simple linear regression model assumes that there is a line with y-intercept α and slope β, called the true or population regression line. When a value of the independent variable x is fixed and an observation on the dependent variable y is made,

y = α + βx +e• Without the random deviation e, all observed (x, y)

points would fall exactly on the population regression line. The inclusion of e in the model equation recognizes that points will deviate from the line.

Page 3: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

Simple Linear Regression Model

Page 4: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

Multiple Regression Analysis

• The general objective of regression analysis: To model the relationship between a dependent variable y and one or more independent variables.

• For example, the variation of house prices can be attributed to house size, location, builder’s reputation, and other variables.

• Multiple regression model includes at least two predictor variables.

• Many of the concepts of simple linear regression can carry over to multiple regression with little modification.

• The calculations are more tedious, so a computer is an indispensable tool for multiple regression analysis.

Page 5: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

1. Multiple Regression Models

• Deterministic model: the value of y is completely determined once values of the independent variable have been specified.

y = α + β1x1 + β2x2 + …+ βkxk

• Example: In a school district, y = 38000 + 800x1 + 60x2

indicates that a teacher starts at salary ( y) of $38,000, and receives an additional $800 per year for each year of teaching experience (x1) and $60 per year for each unit of post college coursework (x2) .

• The value of y is entirely determined by x1 and x2 through the deterministic formula. If two different teachers both have same number of years teaching experience (x1) and same number of post college units (x2) , they will have the same salary (y).

Page 6: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

A general additive multiple regression models

• A probabilistic model is more realistic in most situations:

y = α + β1x1 + β2x2 + …+ βkxk + e

• The random deviation e is assumed to be normally distributed with mean value 0 and standard deviation σ.

• This implies that for fixed x1 , x2 , … xk values, y has a normal distribution with standard deviation σ and

mean y value = α + β1x1 + β2x2 + …+ βkxk

• The βi’s is called population regression coefficients

• The deterministic portion α + β1x1 + β2x2 + …+ βkxk is called the population regression function.

Page 7: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

Multiple Regression Models• Example: What factors contribute to the academic success of

college sophomores? Data collected in a survey of 1000 sophomores suggests that GPA at the end of second year is related to the student’s level of interaction with faculty and staff, and to the student’s commitment to his/her major.

• y = GPA at the end of the second year.

x1 = level of faculty and staff interaction (a scale from 1 to 5)

x2 = level of commitment to major (a scale from 1 to 5)

• One possible population model might be α =1.4, β1=0.33 and β2= 0.16: y = 1.4 + .33x1 + 0.16x2 with σ = 0.15.

• Find mean GPA for students whose x1=4.2 and x2=2.1.

• An actual y value will be within 2σ of the mean value.

Page 8: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

Buffalo Bayou Project

• We want to know how fungi population is affected by the pH value, metal concentration, temperature, etc. of the soil at different locations along Buffalo Bayou.

• In a possible multiple regression model, we can let y = fungi population

x1 = pH value

x2 = metal concentration

x3 = temperature

x4 = location…

Page 9: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

Polynomial Regression

• Suppose that a scatterplot of the n sample (x, y) pairs has one of the appearances in the figure. A polynomial regression model is clearly more appropriate.

Page 10: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

kth-Degree Polynomial Regression Model

• y = α + β1x + β2x2 + …+ βkxk + e is a special case of the

general multiple regression model

y = α + β1x1 + β2x2 + …+ βkxk + e

with x1= x, x2=x2, x3=x3, … xk=xk.

• The most important special case other than simple linear regression (k = 1) is the quadratic regression model.

y = α + β1x + β2x2 + e

• A less encountered special case is that of cubic regression:

y = α + β1x + β2x2 + β3x3 + e

Page 11: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

Interaction Between Variables

• y = product from a certain chemical reaction

x1 = reaction temperature,

x2 = reaction pressure

• A chemist suggests for 80 ≤ x1 ≤ 100 and 50 ≤ x2 ≤ 70 the probabilistic model:

y = 1200 + 15 x1 – 35 x2 + e.

• x1 = 90: mean y value = 1200 + 15(90) – 35 x2 = 2550 – 35 x2

x1 = 95: mean y value = 1200 + 15(95) – 35 x2 = 2625 – 35 x2

x1 = 100: mean y value = 1200 + 15(100) – 35 x2 = 2700 – 35 x2

• Each graph is a straight line with the same slope – 35

Page 12: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

Interaction Between Variables

• The three lines indicates that the average change in yield when pressure x2 is increased by 1 units is – 35, regardless of temperature.

• But chemical theory suggests that the decline in average yield when pressure x2 increases should be more rapid for a high temperature than for a low temperature.

Page 13: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

Interaction Between Variables

• Therefore, the line for a temperature x1 = 100 should have a steeper decline than the line x1 = 95, and that line in turn should be steeper than the one for x1 = 90.

• The previous model may not be appropriate.

• A model with this property included can be

y = – 4500 + 75 x1 + 60x2 – x1 x2 + e

• x1 = 90: y = 2250 – 30 x2

x1 = 95: y = 2625 – 35 x2

x1 = 100: y = 3000 – 40 x2

Page 14: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

Interaction Between Variables

• If the change in the mean y value associated with a 1-unit increase in one variable depends on the value of a second variable, there is interaction between these two variables. When the variables are denoted by x1 and x2, such interaction can be modeled by including x1 x2, the product of the variables that interact, as a predictor variable.

• The multiple regression model based on two independent variables x1 and x2 and includes an interaction predictor is

y = α + β1x1 + β2x2 + β3 x1 x2 + e.• If there are 3 independent variables, a possible model is

y = α + β1x1 + β2x2 + β3 x3 + β4x4 + β5x5 + β6 x6 + e,

where x4 = x1 x2. x5 = x1 x3 , x6 = x2 x3 .

Page 15: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

Nonlinear Relationship and Transformations

• Often a scatterplot exhibits a curved pattern indicating a nonlinear relationship between y and one of the variables xi.

• One method to find a curve to fit the data is to find a way to transform the xi value.

• A transformation involves using a simple function of a variable in place of the variable itself. For examples:

ŷ = a + b ln (x) A 1 unit increase in x is associated with smaller increase or decrease in the value of y for larger x values.

ŷ = a + b x½ A 1 unit increase in x is associated with smaller increase or decrease in the value of y for larger x values.

Page 16: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

Nonlinear Multiple Regression Models

• A frequently used model involving two independent variables x1 and x2 but k = 5 predictors is the full quadratic or complete second-order model

y = α + β1x1 + β2x2 + β3 x1x2 + β4x12 + β5x2

2 + e

When x1 has a fixed value, the graph of the regression function for x2 is a parabola.

• Many nonlinear relationship can be put into the form

y = α + β1x1 + β2x2 + … + βk xk + e by transforming one or more of the variables. An appropriate transformation could be suggested by theory or by various plots of data.

• There are also relationships that cannot be linearized by transformation, and more complicated methods of analysis must be used.

Page 17: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

Example: Wind Chill Factor

• The wind chill index combines information on air temperature and wind speed to describe how cold it really feels. The following table gives the wind chill index for various combination of air temperature and wind speed.

Page 18: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

Example: Wind Chill Factor

• y = wind chill index;

x1 = air temperature,

x2 = wind speed

• Observation: The wind chill index y increases linearly with air temperature x1 at each value of the wind speeds x2, but the linear pattern for the different wind speeds are not parallel.

• An interaction is appropriate.

Page 19: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

Example: Wind Chill Factor

• Observation: The relationship between wind chill index and wind speed is nonlinear at each of the different temperatures.

• The pattern is more markedly curved at some temperatures than at others.

• Therefore a transformation is necessary.

Page 20: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

Example: Wind Chill Factor

• y = wind chill index;

x1 = air temperature, x2 = wind speed

• The model used by the National Weather Service for relating wind chill index to air temperature and wind speed is

mean y value = 35.74 + 0.621 x1 – 35.75x2′ + 0.4275 x1 x2′,

where x2′ = x20.16

• This model incorporates a transformed x2 to model the nonlinear relationship between wind chill index and wind speed, and an interaction term.

Page 21: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

Example: Predictors of Writing Competence

• The article “Grade Level and Gender Differences in Writing Self-Beliefs of Middle School Students” considered relating writing competence score to a number of predictor variables, including perceived value of writing and gender.

• Let y = writing competence score

x1 = gender (x1 = 0 if male, and x1 = 1 if female)

x2 = perceived value of writing• One possible multiple regression model is

y = α + β1x1 + β2x2 + e• Another possible model with an interaction term is

y = α + β1x1 + β2x2 + β3 x1 x2 + e

Page 22: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

Example: Predictors of Writing Competence(no interaction)

• For multiple regression model

y = α + β1x1 + β2x2 + e

when =0 (male)

average y score = α + β2x2 ,

when =1 (female)

average y score = α + β1 + β2x2 ,

where β1 is the coefficient is the difference in average writing score between male and female when is x2 fixed.

• The two lines are parallel with the slope both equal to β2, although the intercept are different.

Page 23: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

Example: Predictors of Writing Competence(interaction)

• For the model with interaction,

y = α + β1x1 + β2x2 + β3 x1 x2 + e

when =0 (male)

average y score = α + β2x2 ,

when =1 (female)

average y score = α + β1 + (β2 +β3 ) x2

• With interaction, the lines not only have different intercepts but also have different slope (unless β3 = 0).

• The change in average writing score when x2 increases by 1 depends on gender x1.

Page 24: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

2. Fitting a Model and Assessing Its Utility

• Suppose a multiple regression model includes a selected set of k predictor variables x1 , x2 , … xk

y = α + β1x1 + β2x2 + … + βk xk + e

• It is then necessary to

– estimate the model coefficients,

– assess the model’s utility, and

– use the estimated model to make further inferences.

• A sample of n independent observations is selected, with each observation consists of k + 1 numbers: (x1 , x2 , … xk , y).

Page 25: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

Definition: The Least Squares Estimates

• According to the principle of least squares, the fit of a particular estimated regression function

a + b1x1 + b2x2 + …+ bk xk

to the observed data is measured by the sum of squared deviations between the observed y values and the y values predicted by the estimated function:

∑ [ y − (a + b1x1 + b2x2 + …+ bk xk ) ]2

The least –squares estimates of α, β1, β2, …, βk are those values of a, b1, b2, …, bk that make this sum of squared deviations as small as possible.

Page 26: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

Example: Graduation Rates

• One way colleges measure success is by graduation rates. We are considering the following variables:y = six-year graduation rate

x1 = median SAT score of students accepted to the college

x2 = student-related expense per full-time student (in dollars)

x3 = 1 if college has only female or only male students= 0 if college has both male and female students

The data in next slide represent a random sample of 22 colleges selected from the 1037 colleges in US with enrollments less than 5000 students.

Fit a linear regression model to describe the relationship between y (the six year graduation rate) and the three predictor variables.

Page 27: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

College y x1 x2 x3

Cornerstone University 0.391 1065 9482 0

Barry University 0.389 950 13149 0

Wilkes University 0.532 1090 9418 0

Colgate University 0.893 1350 26969 0

Lourdes College 0.313 930 8489 0

Concordia University at Austin 0.315 985 8329 0

Carleton College 0.896 1390 29605 0

Letourneau University 0.545 1170 13154 0

Ohio Valley College 0.288 950 10887 0

Chadron State College 0.469 990 6046 0

Meredith College 0.679 1035 14889 1

Tougaloo College 0.495 845 11694 0

Hawaii Pacific University 0.410 1000 9911 0

University of Michigan-Dearborn 0.497 1065 9371 0

Whitter College 0.553 1065 14051 0

Wheaton College 0.845 1325 18420 0

Southampton College of of Long Island 0.465 1035 13302 0

Keene State College 0.541 1005 8098 0

Mount St Mary’s College 0.579 918 12999 1

Wellesley College 0.912 1370 35393 1

Fort Lewis College 0.298 970 5518 0

Bowdoin College 0.891 1375 35669 0

Page 28: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

Example: Graduation Rate at Small Colleges

• In SPSS enter the data.

• In “Analyze”, choose the “Regression”, and then “Linear…”.

• Enter y as the dependent and x1, x2, x3 as independent

• Click “Statistics” button, and a window opens. Choose “Estimates” and “Model Fit”, which should be the default. Click “Continue”.

• We are back to the “Linear Regression” window. Click OK.

• We can see the “Output - SPSS Viewer”. Pull the screen down we can see the “coefficients”. These are the constant term and coefficients of x variables:

y = −0.3906 + 0.0007602x1 + 0.000006969x2 + 0.125 x3

Page 29: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.
Page 30: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.
Page 31: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.
Page 32: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.
Page 33: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.
Page 34: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.
Page 35: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.
Page 36: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

Example: Graduation Rate at Small Colleges

• The estimated mean value of y for specified x1, x2, and x3 values

are y = −0.3906 + 0.0007602x1 + 0.000006969x2 + 0.125 x3 • Use the regression function to estimate the mean six-year

graduation rate of coed college with a median SAT of 1000 and an expenditure per full-time student of $11,000.

• What is the six-year graduation rate for a particular college with a median SAT of 1000 and an expenditure per full-time student of $11,000?

Page 37: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

Is the Model Useful? – The F Test

• In the multiple regression model with regression function:

y = α + β1x1 + β2x2 + … + βk xk, if all k coefficients β1, β2, …, βk are 0, there is no useful linear relationship between y and any of the predictor variables x1, x2, …, xk in the model.

• We need to confirm the model’s utility through a formal test procedure.

• When all k βi′s are 0 in the model y = α + β1x1 + β2x2 + … + βkxk

+ e and when the distribution of e is normal with mean 0 and variance σ2 for any particular values of x1, x2, …, xk, the model utility test for multiple regression has an F probability distribution based on df1 = k (the number of model predictors), and df2 = n – ( k + 1 ) (n is the sample size).

Page 38: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

F Distribution for df1=4, df2 = 6:

Calculated F value P-value

F = 2.16 P > .1

F = 5.70 .01 < P < .05

F = 9.20 .001 < P < .01

F = 25.03 P < .001

Page 39: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

Hypotheses Test

• The null hypothesis, denoted by H0, is a claim about population characteristics that is initially assumed to be true.

• The alternative hypothesis, denoted by Ha, is the competing claim.

• The two possible conclusions are reject H0 , or fail to reject H0.• α is a pre-determined significance level, which is the

probability of rejecting H0 when H0 is true.• P-value is the area under the associated F curve to the right of

the calculated F value. • H0 should be rejected if P-value ≤ α.• H0 should not be rejected if P-value > α.

Page 40: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

The F Test for Utility of the Model y = α + β1x1 + β2x2 + … + βkxk + e

• Null hypothesis: H0: β1= β2 = … = βk = 0 (No useful linear relationship between y and any of the predictors.)

• Alternative hypothesis: Ha: At least one of β1, β2, …, βk is not 0. (A useful linear relationship between y and at least one of the predictors.)

• Test statistics: F (F is defined by a relatively complicated formula, but the calculated value of F is also available from the output of statistics software, such as SPSS.)

• Reject H0 if P-value ≤ α.

Fail to reject H0 if P-value > α.

Page 41: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

F Test for Utility of the Regression Model for the Graduation Rate of Small Colleges

• The model is y = α + β1x1 + β2x2 + βkx3 + e where y = six-year graduation rate, x1 = median SAT score, x2 = expenditure per full time student, and x3 is an indicator if the college is coed (0) or not (1).

• H0: β1= β2 = β3 = 0, Ha: At least one of β1, β2, β3 is not 0.

• The pre-determined significance level: α = .05.

• Test statistics: F = 37.164 (from the “Output-SPSS Viewer” when we used SPSS to estimate the regression model.)

• df1 = k = 3, and df2 = n – ( k + 1 ) = 22 – (3+1) = 18. Use the F-curve Table to find P < .001

• Reject H0 because P < α. We thus confirm the utility of the model.

Page 42: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

Price Calories Protein Fat

1.40 180 12 3.0

1.28 200 14 6.0

1.31 210 16 7.0

1.10 220 13 6.0

2.29 220 17 11.0

1.15 230 14 4.5

2.24 240 24 10.0

1.99 270 24 5.0

2.57 320 31 9.0

0.94 110 5 30.0

1.40 180 10 4.5

0.53 200 7 6.0

1.02 220 8 5.0

1.13 230 9 6.0

1.29 230 10 2.0

1.28 240 10 4.0

1.44 260 6 5.0

1.27 260 7 5.0

1.47 290 13 6.0

Exercise: The following data includes the price, calorie content, protein content, and fat content for a sample of 19 energy bars. What factors contribute to the price of energy bars? Use α = .01.

Page 43: Multiple Regression Analysis Dr. Shishen Xie Computer and Math Sciences.

Exercises

• Complete the previous example.

• Do problems:

14.2, 14.5, 14.7, 14.9

14.15, 14.23, 14.27, 14.29