SW388R7 Data Analysis & Computers II Slide 1 Strategy for Complete Regression Analysis Additional...
-
Upload
morgan-oconnor -
Category
Documents
-
view
233 -
download
2
Transcript of SW388R7 Data Analysis & Computers II Slide 1 Strategy for Complete Regression Analysis Additional...
SW388R7Data Analysis
& Computers II
Slide 1
Strategy for Complete Regression Analysis
Additional issues in regression analysisAssumption of independence of errorsInfluential casesMulticollinearityAdjusted R²
Strategy for Solving problems
Sample problems
Complete regression analysis
SW388R7Data Analysis
& Computers II
Slide 2
Assumption of independence of errors - 1
Multiple regression assumes that the errors are independent and there is no serial correlation. Errors are the residuals or differences between the actual score for a case and the score estimated using the regression equation. No serial correlation implies that the size of the residual for one case has no impact on the size of the residual for the next case.
The Durbin-Watson statistic is used to test for the presence of serial correlation among the residuals. The value of the Durbin-Watson statistic ranges from 0 to 4. As a general rule of thumb, the residuals are not correlated if the Durbin-Watson statistic is approximately 2, and an acceptable range is 1.50 - 2.50.
SW388R7Data Analysis
& Computers II
Slide 3
Assumption of independence of errors - 2
Serial correlation is more of a concern in analyses that involve time series.
If it does occur in relationship analyses, its presence can be usually be understood by changing the sequence of cases and running the analysis again.
If the problem with serial correlation disappears, it may be ignored.
If the problem with serial correlation remains, it can usually be handled by using the difference between successive data points as the dependent variable.
SW388R7Data Analysis
& Computers II
Slide 4
Multicollinearity - 1
Multicollinearity is a problem in regression analysis that occurs when two independent variables are highly correlated, e.g. r = 0.90, or higher.
The relationship between the independent variables and the dependent variables is distorted by the very strong relationship between the independent variables, leading to the likelihood that our interpretation of relationships will be incorrect.
In the worst case, if the variables are perfectly correlated, the regression cannot be computed.
SPSS guards against the failure to compute a regression solution by arbitrarily omitting the collinear variable from the analysis.
SW388R7Data Analysis
& Computers II
Slide 5
Multicollinearity - 2
Multicollinearity is detected by examining the tolerance for each independent variable. Tolerance is the amount of variability in one independent variable that is no explained by the other independent variables.
Tolerance values less than 0.10 indicate collinearity.
If we discover collinearity in the regression output, we should reject the interpretation of the relationships as false until the issue is resolved.
Multicollinearity can be resolved by combining the highly correlated variables through principal component analysis, or omitting a variable from the analysis.
SW388R7Data Analysis
& Computers II
Slide 6
Adjusted R²
The coefficient of determination, R² which measures the strength of a relationship, can usually be increased simply by adding more variables. In fact, if the number of variables equals the number of cases, it is often possible to produce a perfect R² of 1.00.
Adjusted R² is a measure which attempts to reduce the inflation in R² by taking into account the number of independent variables and the number of cases.
If there is a large discrepancy between R² and Adjusted R², extraneous variables should be removed from the analysis and R² recomputed.
SW388R7Data Analysis
& Computers II
Slide 7
Influential cases
The degree to which outliers affect the regression solution depends upon where the outlier is located relative to the other cases in the analysis. Outliers whose location have a large effect on the regression solution are called influential cases.
Whether or not a case is influential is measured by Cook’s distance.
Cook’s distance is an index measure; it is compared to a critical value based on the formula: 4 / (n – k – 1) where n is the number of cases and k is the number of independent variables.
If a case has a Cook’s distance greater than the critical value, it should be examined for exclusion.
SW388R7Data Analysis
& Computers II
Slide 8
Overall strategy for solving problems
1. Run a baseline regression using the method for including variables implied by the problem statement to find the initial strength of the relationship, baseline R².
2. Test for useful transformations to improve normality, linearity, and homoscedasticity.
3. Substitute transformed variables and check for outliers and influential cases.
4. If R² from regression model using transformed variables and omitting outliers is at least 2% better than baseline R², select it for interpretation; otherwise select baseline model.
5. Validate and interpret the selected regression model.
SW388R7Data Analysis
& Computers II
Slide 9
Problem 1
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.05 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. Use 0.0160 as the criteria for identifying influential cases. Validate the results of your regression analysis by splitting the sample in two, using 788035 as the random number seed.
The variables "age" [age], "sex" [sex], and "respondent's socioeconomic index" [sei] have a strong relationship to the variable "how many in family earned money" [earnrs].
Survey respondents who were older had fewer family members earning money. The variables sex and respondent's socioeconomic index did not have a relationship to how many in family earned money.
1. True 2. True with caution 3. False 4. Inappropriate application of a statistic
SW388R7Data Analysis
& Computers II
Slide 10
Dissecting problem 1 - 1
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.05 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. Use 0.0160 as the criteria for identifying influential cases. Validate the results of your regression analysis by splitting the sample in two, using 788035 as the random number seed.
The variables "age" [age], "sex" [sex], and "respondent's socioeconomic index" [sei] have a strong relationship to the variable "how many in family earned money" [earnrs].
The problem may give us different levels of significance for the analysis.
In this problem, we are told to use 0.05 as alpha for the regression, but 0.01 for testing assumptions.
When we test for influential cases using Cook’s distance, we need to compute a critical value for comparison using the formula:
4 / (n – k – 1)where n is the number of cases and k is the number of independent variables. The correct value (0.0160) is provided in the problem.
The random number seed (788035) for the split sample validation is provided.
After evaluating assumptions, outliers, and influential cases, we will decide whether we should use the model with transformations and excluding outliers, or the model with the original form of the variables and all cases.
SW388R7Data Analysis
& Computers II
Slide 11
Dissecting problem 1 - 2
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.05 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. Use 0.0160 as the criteria for identifying influential cases. Validate the results of your regression analysis by splitting the sample in two, using 788035 as the random number seed.
The variables "age" [age], "sex" [sex], and "respondent's socioeconomic index" [sei] have a strong relationship to the variable "how many in family earned money" [earnrs].
Survey respondents who were older had fewer family members earning money. The variables sex and respondent's socioeconomic index did not have a relationship to how many in family earned money.
1. True 2. True with caution 3. False 4. Inappropriate application of a statistic
When a problem states that there is a relationship between some independent variables and a dependent variable, we do standard multiple regression.
The variables listed first in the problem statement are the independent variables (IVs): "age" [age], "sex" [sex], and "respondent's socioeconomic index" [sei].
The variable that is the target of the relationship is the dependent variable (DV): "how many in family earned money“ [earnrs].
SW388R7Data Analysis
& Computers II
Slide 12
Dissecting problem 1 - 3
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.05 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. Use 0.0160 as the criteria for identifying influential cases. Validate the results of your regression analysis by splitting the sample in two, using 788035 as the random number seed.
The variables "age" [age], "sex" [sex], and "respondent's socioeconomic index" [sei] have a strong relationship to the variable "how many in family earned money" [earnrs].
Survey respondents who were older had fewer family members earning money. The variables sex and respondent's socioeconomic index did not have a relationship to how many in family earned money.
1. True 2. True with caution 3. False 4. Inappropriate application of a statistic
In order for a problem to be true, we will have to find that there is a statistically significant relationship between the set of IVs and the DV, and the strength of the relationship stated in the problem must be correct.
In addition, the relationship or lack of relationship between the individual IV's and the DV must be identified correctly, and must be characterized correctly.
SW388R7Data Analysis
& Computers II
Slide 13
LEVEL OF MEASUREMENT
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.05 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. Use 0.0160 as the criteria for identifying influential cases. Validate the results of your regression analysis by splitting the sample in two, using 788035 as the random number seed.
The variables "age" [age], "sex" [sex], and "respondent's socioeconomic index" [sei] have a strong relationship to the variable "how many in family earned money" [earnrs].
Survey respondents who were older had fewer family members earning money. The variables sex and respondent's socioeconomic index did not have a relationship to how many in family earned money.
1. True 2. True with caution 3. False 4. Inappropriate application of a statistic
Multiple regression requires that the dependent variable be metric and the independent variables be metric or dichotomous.
"How many in family earned money" [earnrs] is an interval level variable, which satisfies the level of measurement requirement.
"Age" [age] and "respondent's socioeconomic index" [sei] are interval level variables, which satisfies the level of measurement requirements for multiple regression analysis.
"Sex" [sex] is a dichotomous or dummy-coded nominal variable which may be included in multiple regression analysis.
SW388R7Data Analysis
& Computers II
Slide 14
The baseline regression
We begin out analysis by runring a standard multiple regression analysis with earnrs as the dependent variable and age, sex, and sei as the independent variables.
Select Enter as the Method for including variables to produce a standard multiple regression.
Click on the Statistics… button to select statistics we will need for the analysis.
SW388R7Data Analysis
& Computers II
Slide 15
The baseline regression
Retain the default checkboxes for Estimates and Model fit to obtain the baseline R², which will be used to determine whether we should use the model with transformations and excluding outliers, or the model with the original form of the variables and all cases.
Mark the Descriptives checkbox to get the number of cases available for the analysis.
Mark the checkbox for the Durbin-Watson statistic, which will be used to test the assumption of independence of errors.
SW388R7Data Analysis
& Computers II
Slide 16
Descriptive Statistics
1.47 1.008 254
46.62 16.642 254
1.57 .496 254
48.601 19.1110 254
EARNRS
AGE
SEX
SEI
Mean Std. Deviation N
Initial sample size
The initial sample size before excluding outliers and influential cases is 254. With 3 independent variables, the ratio of cases to variables is 84.7 to 1, satisfying both the minimum and preferred sample size for multiple regression.
If the sample size did not initially satisfy the minimum requirement, regression analysis is not appropriate.
SW388R7Data Analysis
& Computers II
Slide 17
R² before transformations or removing outliers
Prior to any transformations of variables to satisfy the assumptions of multiple regression or removal of outliers, the proportion of variance in the dependent variable explained by the independent variables (R²) was 18.7%.
The relationship is statistically significant, though we would not stop if it were not significant because the lack of significance may be a consequence of violation of assumptions or the inclusion of outliers and influential cases.
The R² of 0.187 is the benchmark that we will use to evaluate the utility of transformations and the elimination of outliers/influential cases.
SW388R7Data Analysis
& Computers II
Slide 18
Assumption of independence of errors:the Durbin-Watson statistic
The Durbin-Watson statistic is used to test for the presence of serial correlation among the residuals, i.e., the assumption of independence of errors, which requires that the residuals or errors in prediction do not follow a pattern from case to case.
The value of the Durbin-Watson statistic ranges from 0 to 4. As a general rule of thumb, the residuals are not correlated if the Durbin-Watson statistic is approximately 2, and an acceptable range is 1.50 - 2.50.
The Durbin-Watson statistic for this problem is 1.849 which falls within the acceptable range.
If the Durbin-Watson statistic was not in the acceptable range, we would add a caution to the findings for a violation of regression assumptions.
SW388R7Data Analysis
& Computers II
Slide 19
Normality of dependent variable:how many in family earned money
After evaluating the dependent variable, we examine the normality of each metric variable and linearity of its relationship with the dependent variable.
To test the normality of number of earners in family, run the script: NormalityAssumptionAndTransformations.SBS
Second, click on the OK button to produce the output.
First, move the independent variable EARNRS to the list box of variables to test.
SW388R7Data Analysis
& Computers II
Slide 20
Descriptives
1.43 .061
1.31
1.56
1.37
1.00
1.015
1.008
0
5
5
1.00
.742 .149
1.324 .296
Mean
Lower Bound
Upper Bound
95% ConfidenceInterval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
HOW MANY IN FAMILYEARNED MONEY
Statistic Std. Error
Normality of dependent variable:how many in family earned money
The dependent variable "how many in family earned money" [earnrs] does not satisfy the criteria for a normal distribution.
The skewness (0.742) fell between -1.0 and +1.0, but the kurtosis (1.324) fell outside the range from -1.0 to +1.0.
SW388R7Data Analysis
& Computers II
Slide 21
Normality of dependent variable:how many in family earned money
The logarithmic transformation improves the normality of "how many in family earned money" [earnrs]. In evaluating normality, the skewness (-0.483) and kurtosis (-0.309) were both within the range of acceptable values from -1.0 to +1.0.
The square root transformation also has values of skewness and kurtosis in the acceptable range.
However, by our order of preference for which transformation to use, the logarithm is preferred to the square root or inverse.
SW388R7Data Analysis
& Computers II
Slide 22
Transformation for how many in family earned money
The logarithmic transformation improves the normality of "how many in family earned money" [earnrs].
We will substitute the logarithmic transformation of how many in family earned money as the dependent variable in the regression analysis.
SW388R7Data Analysis
& Computers II
Slide 23
Adding a transformed variable
Second, mark the checkbox for the transformation we want to add to the data set, and clear the other checkboxes.
First, move the variable that we want to transform to the list box of variables to test.
Before testing the assumptions for the independent variables, we need to add the transformation of the dependent variable to the data set.
Fourth, click on the OK button to produce the output.
Third, clear the checkbox for Delete transformed variables from the data. This will save the transformed variable.
SW388R7Data Analysis
& Computers II
Slide 24
The transformed variable in the data editor
If we scroll to the extreme right in the data editor, we see that the transformed variable has been added to the data set.
Whenever we add transformed variables to the data set, we should be sure to delete them before starting another analysis.
SW388R7Data Analysis
& Computers II
Slide 25
Normality/linearity of independent variable: age
After evaluating the dependent variable, we examine the normality of each metric variable and linearity of its relationship with the dependent variable.
To test the normality of age, run the script: NormalityAssumptionAndTransformations.SBS
Second, click on the OK button to produce the output.
First, move the independent variable AGE to the list box of variables to test.
SW388R7Data Analysis
& Computers II
Slide 26
Descriptives
45.99 1.023
43.98
48.00
45.31
43.50
282.465
16.807
19
89
70
24.00
.595 .148
-.351 .295
Mean
Lower Bound
Upper Bound
95% ConfidenceInterval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
AGE OF RESPONDENTStatistic Std. Error
Normality/linearity of independent variable: age
In evaluating normality, the skewness (0.595) and kurtosis (-0.351) were both within the range of acceptable values from -1.0 to +1.0.
SW388R7Data Analysis
& Computers II
Slide 27
Normality/linearity of independent variable: age
To evaluate the linearity of age and the log transformation of number of earners in the family, run the script for the assumption of linearity:
LinearityAssumptionAndTransformations.SBS
Third, click on the OK button to produce the output.
First, move the transformed dependent variable LOGEARN to the text box for the dependent variable.
Second, move the independent variable, AGE, to the list box for independent variables.
SW388R7Data Analysis
& Computers II
Slide 28
Correlations
1 -.493** -.417** -.552** -.457** -.336**
. .000 .000 .000 .000 .000
269 269 269 269 269 269
-.493** 1 .979** .983** .995** .916**
.000 . .000 .000 .000 .000
269 270 270 270 270 270
-.417** .979** 1 .926** .994** .978**
.000 .000 . .000 .000 .000
269 270 270 270 270 270
-.552** .983** .926** 1 .960** .832**
.000 .000 .000 . .000 .000
269 270 270 270 270 270
-.457** .995** .994** .960** 1 .951**
.000 .000 .000 .000 . .000
269 270 270 270 270 270
-.336** .916** .978** .832** .951** 1
.000 .000 .000 .000 .000 .
269 270 270 270 270 270
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Logarithm of EARNRS[LG10( 1+EARNRS)]
AGE OF RESPONDENT
Logarithm of AGE[LG10(AGE)]
Square of AGE [(AGE)**2]
Square Root of AGE[SQRT(AGE)]
Inverse of AGE [-1/(AGE)]
Logarithm ofEARNRS
[LG10(1+EARNRS)]
AGE OFRESPON
DENT
Logarithm ofAGE
[LG10(AGE)]
Square ofAGE
[(AGE)**2]
Square Rootof AGE
[SQRT(AGE)]
Inverse ofAGE
[-1/(AGE)]
Correlation is significant at the 0.01 level (2-tailed).**.
Normality/linearity of independent variable: age
The evidence of linearity in the relationship between the independent variable "age" [age] and the dependent variable "log transformation of how many in family earned money" [logearn] was the statistical significance of the correlation coefficient (r = -0.493). The probability for the correlation coefficient was <0.001, less than or equal to the level of significance of 0.01. We reject the null hypothesis that r = 0 and conclude that there is a linear relationship between the variables.
The independent variable "age" [age] satisfies the criteria for both the assumption of normality and the assumption of linearity with the dependent variable "log transformation of how many in family earned money" [logearn].
SW388R7Data Analysis
& Computers II
Slide 29
Normality/linearity of independent variable: respondent's socioeconomic
index
To test the normality of respondent's socioeconomic index, run the script:
NormalityAssumptionAndTransformations.SBS
Second, click on the OK button to produce the output.
First, move the independent variable SEI to the list box of variables to test.
SW388R7Data Analysis
& Computers II
Slide 30
Descriptives
48.710 1.1994
46.348
51.072
47.799
39.600
366.821
19.1526
19.4
97.2
77.8
31.100
.585 .153
-.862 .304
Mean
Lower Bound
Upper Bound
95% ConfidenceInterval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
RESPONDENT'SSOCIOECONOMIC INDEX
Statistic Std. Error
Normality/linearity of independent variable: respondent's socioeconomic
index
The independent variable "respondent's socioeconomic index" [sei] satisfies the criteria for the assumption of normality, but does not satisfy the assumption of linearity with the dependent variable "log transformation of how many in family earned money" [logearn].
In evaluating normality, the skewness (0.585) and kurtosis (-0.862) were both within the range of acceptable values from -1.0 to +1.0.
SW388R7Data Analysis
& Computers II
Slide 31
Normality/linearity of independent variable: respondent's socioeconomic
index
To evaluate the linearity of the relationship between respondent's socioeconomic index and the log transformation of how many in family earned money, run the script for the assumption of linearity:
LinearityAssumptionAndTransformations.SBS
Third, click on the OK button to produce the output.
Second, move the independent variable, SEI, to the list box for independent variables.
First, move the transformed dependent variable LOGEARN to the text box for the dependent variable.
SW388R7Data Analysis
& Computers II
Slide 32
Correlations
1 .055 .073 .036 .064 .092
. .385 .243 .563 .309 .142
269 254 254 254 254 254
.055 1 .987** .988** .997** .948**
.385 . .000 .000 .000 .000
254 255 255 255 255 255
.073 .987** 1 .951** .997** .986**
.243 .000 . .000 .000 .000
254 255 255 255 255 255
.036 .988** .951** 1 .973** .892**
.563 .000 .000 . .000 .000
254 255 255 255 255 255
.064 .997** .997** .973** 1 .970**
.309 .000 .000 .000 . .000
254 255 255 255 255 255
.092 .948** .986** .892** .970** 1
.142 .000 .000 .000 .000 .
254 255 255 255 255 255
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Logarithm of EARNRS[LG10( 1+EARNRS)]
RESPONDENT'SSOCIOECONOMIC INDEX
Logarithm of SEI[LG10(SEI)]
Square of SEI [(SEI)**2]
Square Root of SEI[SQRT(SEI)]
Inverse of SEI [-1/(SEI)]
Logarithm ofEARNRS
[LG10(1+EARNRS)]
RESPONDENT'S
SOCIOECONOMIC INDEX
Logarithm ofSEI
[LG10(SEI)]Square of SEI
[(SEI)**2]
Square Rootof SEI
[SQRT(SEI)]Inverse of
SEI [-1/(SEI)]
Correlation is significant at the 0.01 level (2-tailed).**.
Normality/linearity of independent variable: respondent's socioeconomic
index
The probability for the correlation coefficient was 0.385, greater than the level of significance of 0.01. We cannot reject the null hypothesis that r = 0, and cannot conclude that there is a linear relationship between the variables.
Since none of the transformations to improve linearity were successful, it is an indication that the problem may be a weak relationship, rather than a curvilinear relationship correctable by using a transformation. A weak relationship is not a violation of the assumption of linearity, and does not require a caution.
SW388R7Data Analysis
& Computers II
Slide 33
Homoscedasticity of independent variable: Sex
To evaluate the homoscedasticity of the relationship between sex and the log transformation of how many in family earned money, run the script for the assumption of homogeneity of variance:
HomoscedasticityAssumptionAnd Transformations.SBS
Third, click on the OK button to produce the output.
Second, move the independent variable, SEX, to the list box for independent variables.
First, move the transformed dependent variable LOGEARN to the text box for the dependent variable.
SW388R7Data Analysis
& Computers II
Slide 34
Homoscedasticity of independent variable: Sex
Based on the Levene Test, the variance in "log transformation of how many in family earned money" [logearn] is homogeneous for the categories of "sex" [sex].
The probability associated with the Levene Statistic (0.767) is greater than the level of significance, so we fail to reject the null hypothesis and conclude that the homoscedasticity assumption is satisfied.
SW388R7Data Analysis
& Computers II
Slide 35
The regression to identify outliers and influential cases
To run the regression again, select the Regression | Linear command from the Analyze menu.
We use the regression procedure to identify univariate outliers, multivariate outliers, and influential cases.
We start with the same dialog we used for the baseline analysis and substitute the transformed variables which we think will improve the analysis.
SW388R7Data Analysis
& Computers II
Slide 36
The regression to identify outliers and influential cases
Third, we want to save the calculated values of the outlier statistics to the data set.
Click on the Save… button to specify what we want to save.
First, we substitute the logarithmic transformation of earnrs, logearn, into the list of independent variables.
Second, we keep the method of entry to Enter so that all variables will be included in the detection of outliers.
NOTE: we should always use Enter when testing for outliers and influential cases to make sure all variables are included in the determination.
SW388R7Data Analysis
& Computers II
Slide 37
Saving the measures of outliers/influential cases
Second, mark the checkbox for Mahalanobis in the Distances panel. This will compute Mahalanobis distances for the set of independent variables.
Fourth, click on the OK button to complete the specifications.
First, mark the checkbox for Studentized residuals in the Residuals panel. Studentized residuals are z-scores computed for a case based on the data for all other cases in the data set.
Third, mark the checkbox for Cook’s in the Distances panel. This will compute Cook’s distances to identify influential cases.
SW388R7Data Analysis
& Computers II
Slide 38
The variables for identifying outliers/influential cases
The variable for identifying univariate outliers for the dependent variable are in a column which SPSS has named sre_1. These are the studentized residuals for the log transformed variables.
The variable for identifying multivariate outliers for the independent variables are in a column which SPSS has named mah_1.
The variable containing Cook’s distances for identifying influential cases has been named coo_1 by SPSS.
SW388R7Data Analysis
& Computers II
Slide 39
Computing the probability for Mahalanobis D²
To compute the probability of D², we will use an SPSS function in a Compute command.
First, select the Compute… command from the Transform menu.
SW388R7Data Analysis
& Computers II
Slide 40
Formula for probability for Mahalanobis D²
Third, click on the OK button to signal completion of the computer variable dialog.
Second, to complete the specifications for the CDF.CHISQ function, type the name of the variable containing the D² scores, mah_1, followed by a comma, followed by the number of variables used in the calculations, 3.
Since the CDF function (cumulative density function) computes the cumulative probability from the left end of the distribution up through a given value, we subtract it from 1 to obtain the probability in the upper tail of the distribution.
First, in the target variable text box, type the name "p_mah_1" as an acronym for the probability of the mah_1, the Mahalanobis D² score.
SW388R7Data Analysis
& Computers II
Slide 41
Univariate outliers
A score on the dependent variable is considered unusual if its studentized residual is bigger than ±3.0.
SW388R7Data Analysis
& Computers II
Slide 42
Multivariate outliers
The combination of scores for the independent variables is an outlier if the probability of the Mahalanobis D² distance score is less than or equal to 0.001.
SW388R7Data Analysis
& Computers II
Slide 43
Influential cases
In addition, a case may have a large influence on the regression analysis, resulting in an analysis that is less representative of the population represented by the sample. The criteria for identifying influential case is a Cook's distance score with a value of 0.0160 or greater.
The criteria for Cook’s distance is:
4 / (n – k – 1) =4 / (254 – 3 – 1) = 0.0160
SW388R7Data Analysis
& Computers II
Slide 44
Omitting the outliers and influential cases
To omit the outliers and influential cases from the analysis, we select in the cases that are not outliers and are not influential cases.
First, select the Select Cases… command from the Transform menu.
SW388R7Data Analysis
& Computers II
Slide 45
Specifying the condition to omit outliers
First, mark the If condition is satisfied option button to indicate that we will enter a specific condition for including cases.
Second, click on the If… button to specify the criteria for inclusion in the analysis.
SW388R7Data Analysis
& Computers II
Slide 46
The formula for omitting outliers
To eliminate the outliers and influential cases, we request the cases that are not outliers or influential cases.
The formula specifies that we should include cases if the studentized residual (regardless of sign) is less than 3, the probability for Mahalanobis D² is higher than the level of significance of 0.001, and the Cook’s distance value is less than the critical value of 0.0160.
After typing in the formula, click on the Continue button to close the dialog box,
SW388R7Data Analysis
& Computers II
Slide 47
Completing the request for the selection
To complete the request, we click on the OK button.
SW388R7Data Analysis
& Computers II
Slide 48
An omitted outlier and influential case
SPSS identifies the excluded cases by drawing a slash mark through the case number. This omitted case has a large studentized residual, greater than 3.0, as well as a Cook’s distance value that is greater than the critical value, 0.0160.
SW388R7Data Analysis
& Computers II
Slide 49
The outliers and influential cases
Case 20000159 is an influential case (Cook's distance=0.0320) as well as an outlier on the dependent variable (studentized residual=3.13). Case 20000915 is an influential case (Cook's distance=0.0239). Case 20001016 is an influential case (Cook's distance=0.0598) as well as an outlier on the dependent variable (studentized residual=-3.12). Case 20001761 is an influential case (Cook's distance=0.0167). Case 20002587 is an influential case (Cook's distance=0.0264). Case 20002597 is an influential case (Cook's distance=0.0293). There are 6 cases that have a Cook's distance score that is large enough to be considered influential cases.
SW388R7Data Analysis
& Computers II
Slide 50
Running the regression omitting outliers
We run the regression again, without the outliers which we selected out with the Select If command.
Select the Regression | Linear command from the Analyze menu.
SW388R7Data Analysis
& Computers II
Slide 51
Opening the save options dialog
We specify the dependent and independent variables, continuing to substitute any transformed variables required by assumptions.
On our last run, we instructed SPSS to save studentized residuals, Mahalanobis distance, and Cook’s distance. To prevent these values from being calculated again, click on the Save… button.
SW388R7Data Analysis
& Computers II
Slide 52
Clearing the request to save outlier data
First, clear the checkbox for Studentized residuals.
Third, click on the OK button to complete the specifications.
Second, clear the checkbox for Mahalanobis distance.
Third, clear the checkbox form Cook’s distance.
SW388R7Data Analysis
& Computers II
Slide 53
Opening the statistics options dialog
Once we have removed outliers, we need to check the sample size requirement for regression.
Since we will need the descriptive statistics for this, click on the Statistics… button.
SW388R7Data Analysis
& Computers II
Slide 54
Requesting descriptive statistics
First, mark the checkbox for Descriptives.
Third, click on the Continue button to complete the specifications.
Second, mark the checkbox for Collinearity diagnostics to obtain the tolerance values for each independent variable in order to assess multicollinearity.
SW388R7Data Analysis
& Computers II
Slide 55
Requesting the output
Having specified the output needed for the analysis, we click on the OK button to obtain the regression output.
SW388R7Data Analysis
& Computers II
Slide 56
Model Summaryb
.620a .384 .377 .1457258Model1
R R SquareAdjustedR Square
Std. Error ofthe Estimate
Predictors: (Constant), SEI, AGE, SEXa.
Dependent Variable: LOGEARNb.
SELECTION OF MODEL FOR INTERPRETATION
Prior to any transformations of variables to satisfy the assumptions of multiple regression and the removal of outliers and influential cases, the proportion of variance in the dependent variable explained by the independent variables (R²) was 18.7%.
After substituting transformed variables and removing outliers and influential cases, the proportion of variance in the dependent variable explained by the independent variables (R²) was 38.4%.
Since the regression analysis using transformations and omitting outliers and influential cases explained at least two percent more variance than the regression analysis with all cases and no transformations, the regression analysis with transformed variables omitting outliers and influential cases was interpreted.
SW388R7Data Analysis
& Computers II
Slide 57
Descriptive Statistics
.354289 .1845814 248
46.70 16.677 248
1.57 .496 248
48.819 19.1071 248
LOGEARN
AGE
SEX
SEI
Mean Std. Deviation N
SAMPLE SIZE
The minimum ratio of valid cases to independent variables for multiple regression is 5 to 1. After removing 6 influential cases or outliers, there are 248 valid cases and 3 independent variables.
The ratio of cases to independent variables for this analysis is 82.67 to 1, which satisfies the minimum requirement. In addition, the ratio of 82.67 to 1 satisfies the preferred ratio of 15 to 1.
SW388R7Data Analysis
& Computers II
Slide 58
OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT
VARIABLES
The probability of the F statistic (50.759) for the overall regression relationship is <0.001, less than or equal to the level of significance of 0.05. We reject the null hypothesis that there is no relationship between the set of independent variables and the dependent variable (R² = 0). We support the research hypothesis that there is a statistically significant relationship between the set of independent variables and the dependent variable.
We support the research hypothesis that there is a statistically significant relationship between the set of independent variables and the dependent variable.
SW388R7Data Analysis
& Computers II
Slide 59
OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT
VARIABLES
The Multiple R for the relationship between the set of independent variables and the dependent variable is 0.620, which would be characterized as strong using the rule of thumb than a correlation less than or equal to 0.20 is characterized as very weak; greater than 0.20 and less than or equal to 0.40 is weak; greater than 0.40 and less than or equal to 0.60 is moderate; greater than 0.60 and less than or equal to 0.80 is strong; and greater than 0.80 is very strong.
SW388R7Data Analysis
& Computers II
Slide 60
Coefficientsa
.626 .048 12.989 .000
-.007 .001 -.615 -12.237 .000 .999 1.001
.024 .019 .065 1.284 .200 .997 1.003
.000 .000 .018 .354 .724 .997 1.004
(Constant)
AGE
SEX
SEI
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Tolerance VIF
Collinearity Statistics
Dependent Variable: LOGEARNa.
MULTICOLLINEARITY
Multicollinearity occurs when one independent variable is so strongly correlated with one or more other variables that its relationship to the dependent variable is likely to be misinterpreted. Its potential unique contribution to explaining the dependent variable is minimized by its strong relationship to other independent variables. Multicollinearity is indicated when the tolerance value for an independent variable is less than 0.10.
The tolerance values for all of the independent variables are larger than 0.10. Multicollinearity is not a problem in this regression analysis.
SW388R7Data Analysis
& Computers II
Slide 61
Coefficientsa
.626 .048 12.989 .000
-.007 .001 -.615 -12.237 .000 .999 1.001
.024 .019 .065 1.284 .200 .997 1.003
.000 .000 .018 .354 .724 .997 1.004
(Constant)
AGE
SEX
SEI
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Tolerance VIF
Collinearity Statistics
Dependent Variable: LOGEARNa.
RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO
DEPENDENT VARIABLE - 1
For the independent variable age, the probability of the t statistic (-12.237) for the b coefficient is <0.001 which is less than or equal to the level of significance of 0.05. We reject the null hypothesis that the slope associated with age is equal to zero (b = 0) and conclude that there is a statistically significant relationship between age and log transformation of how many in family earned money.
The b coefficient associated with age (-0.007) is negative, indicating an inverse relationship in which higher numeric values for age are associated with lower numeric values for log transformation of how many in family earned money. Therefore, the negative value of b implies that survey respondents who were older had fewer family members earning money.
SW388R7Data Analysis
& Computers II
Slide 62
Coefficientsa
.626 .048 12.989 .000
-.007 .001 -.615 -12.237 .000 .999 1.001
.024 .019 .065 1.284 .200 .997 1.003
.000 .000 .018 .354 .724 .997 1.004
(Constant)
AGE
SEX
SEI
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Tolerance VIF
Collinearity Statistics
Dependent Variable: LOGEARNa.
RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO
DEPENDENT VARIABLE - 2
For the independent variable sex, the probability of the t statistic (1.284) for the b coefficient is 0.200 which is greater than the level of significance of 0.05.
We fail to reject the null hypothesis that the slope associated with sex is equal to zero (b = 0) and conclude that there is not a statistically significant relationship between sex and log transformation of how many in family earned money.
SW388R7Data Analysis
& Computers II
Slide 63
Coefficientsa
.626 .048 12.989 .000
-.007 .001 -.615 -12.237 .000 .999 1.001
.024 .019 .065 1.284 .200 .997 1.003
.000 .000 .018 .354 .724 .997 1.004
(Constant)
AGE
SEX
SEI
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Tolerance VIF
Collinearity Statistics
Dependent Variable: LOGEARNa.
RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO
DEPENDENT VARIABLE - 3
For the independent variable respondent's socioeconomic index, the probability of the t statistic (0.354) for the b coefficient is 0.724 which is greater than the level of significance of 0.05.
We fail to reject the null hypothesis that the slope associated with respondent's socioeconomic index is equal to zero (b = 0) and conclude that there is not a statistically significant relationship between respondent's socioeconomic index and log transformation of how many in family earned money.
SW388R7Data Analysis
& Computers II
Slide 64
Validation analysis:set the random number seed
To set the random number seed, select the Random Number Seed… command from the Transform menu.
SW388R7Data Analysis
& Computers II
Slide 65
Set the random number seed
First, click on the Set seed to option button to activate the text box.
Second, type in the random seed stated in the problem.
Third, click on the OK button to complete the dialog box.
Note that SPSS does not provide you with any feedback about the change.
SW388R7Data Analysis
& Computers II
Slide 66
Validation analysis:compute the split variable
To enter the formula for the variable that will split the sample in two parts, click on the Compute… command.
SW388R7Data Analysis
& Computers II
Slide 67
The formula for the split variable
First, type the name for the new variable, split, into the Target Variable text box.
Second, the formula for the value of split is shown in the text box.
The uniform(1) function generates a random decimal number between 0 and 1. The random number is compared to the value 0.50.
If the random number is less than or equal to 0.50, the value of the formula will be 1, the SPSS numeric equivalent to true. If the random number is larger than 0.50, the formula will return a 0, the SPSS numeric equivalent to false.Third, click on the OK
button to complete the dialog box.
SW388R7Data Analysis
& Computers II
Slide 68
The split variable in the data editor
In the data editor, the split variable shows a random pattern of zero’s and one’s.
To select half of the sample for each validation analysis, we will first select the cases where split = 0, then select the cases where split = 1.
SW388R7Data Analysis
& Computers II
Slide 69
Repeat the regression with first validation sample
To repeat the multiple regression analysis for the first validation sample, select Linear Regression from the Dialog Recall tool button.
SW388R7Data Analysis
& Computers II
Slide 70
Using "split" as the selection variable
First, scroll down the list of variables and highlight the variable split.
Second, click on the right arrow button to move the split variable to the Selection Variable text box.
SW388R7Data Analysis
& Computers II
Slide 71
Setting the value of split to select cases
When the variable named split is moved to the Selection Variable text box, SPSS adds "=?" after the name to prompt up to enter a specific value for split. Click on the
Rule… button to enter a value for split.
SW388R7Data Analysis
& Computers II
Slide 72
Completing the value selection
First, type the value for the first half of the sample, 0, into the Value text box.
Second, click on the Continue button to complete the value entry.
SW388R7Data Analysis
& Computers II
Slide 73
Requesting output for the first validation sample
When the value entry dialog box is closed, SPSS adds the value we entered after the equal sign. This specification now tells SPSS to include in the analysis only those cases that have a value of 0 for the split variable.
Click on the OK button to request
the output.
Since the validation analysis requires us to compare the results of the analysis using the two split sample, we will request the output for the second sample before doing any comparison.
SW388R7Data Analysis
& Computers II
Slide 74
Repeat the regression with second validation sample
To repeat the multiple regression analysis for the second validation sample, select Linear Regression from the Dialog Recall tool button.
SW388R7Data Analysis
& Computers II
Slide 75
Setting the value of split to select cases
Since the split variable is already in the Selection Variable text box, we only need to change its value.
Click on the Rule… button to enter a different value for split.
SW388R7Data Analysis
& Computers II
Slide 76
Completing the value selection
First, type the value for the first half of the sample, 1, into the Value text box.
Second, click on the Continue button to complete the value entry.
SW388R7Data Analysis
& Computers II
Slide 77
Requesting output for the second validation sample
When the value entry dialog box is closed, SPSS adds the value we entered after the equal sign. This specification now tells SPSS to include in the analysis only those cases that have a value of 1 for the split variable.
Click on the OK button to request the output.
SW388R7Data Analysis
& Computers II
Slide 78
SPLIT-SAMPLE VALIDATION - 1
ANOVAb,c
1.692 3 .564 24.220 .000a
2.538 109 .023
4.230 112
Regression
Residual
Total
Model1
Sum ofSquares df Mean Square F Sig.
Predictors: (Constant), SEI, AGE, SEXa.
Dependent Variable: LOGEARNb.
Selecting only cases for which SPLIT = .0000c.
In both of the split-sample validation analyses, the relationship between the independent variables and the dependent variable was statistically significant.
ANOVAb,c
1.500 3 .500 25.062 .000a
2.614 131 .020
4.114 134
Regression
Residual
Total
Model1
Sum ofSquares df Mean Square F Sig.
Predictors: (Constant), SEI, SEX, AGEa.
Dependent Variable: LOGEARNb.
Selecting only cases for which SPLIT = 1.0000c.
In the first validation, the probability for the F statistic testing overall relationship was <0.001.
For the second validation analysis, the probability for the F statistic testing overall relationship was <0.001.
Thus far, the validation verifies the existence of the relationship between the dependent variable and the independent variables.
SW388R7Data Analysis
& Computers II
Slide 79
Model Summaryb,c
.604a .621 .365 .350 .1412615 1.839 2.161Model1
SPLIT = 1.0000
(Selected)
SPLIT ~=1.0000
(Unselected)
R
R SquareAdjustedR Square
Std. Error ofthe Estimate
SPLIT = 1.0000
(Selected)
SPLIT ~=1.0000
(Unselected)
Durbin-Watson Statistic
Predictors: (Constant), SEI, SEX, AGEa.
Unless noted otherwise, statistics are based only on cases for which SPLIT = 1.0000.b.
Dependent Variable: LOGEARNc.
Model Summaryb,c
.632a .593 .400 .383 .1525916 2.117 1.862Model1
SPLIT = .0000
(Selected)
SPLIT ~=.0000
(Unselected)
R
R SquareAdjustedR Square
Std. Error ofthe Estimate
SPLIT = .0000
(Selected)
SPLIT ~=.0000
(Unselected)
Durbin-Watson Statistic
Predictors: (Constant), SEI, AGE, SEXa.
Unless noted otherwise, statistics are based only on cases for which SPLIT = .0000.b.
Dependent Variable: LOGEARNc.
SPLIT-SAMPLE VALIDATION - 2
The total proportion of variance in the relationship utilizing the full data set was 38.4% compared to 40.0% for the first split sample validation and 36.5% for the second split sample validation.
In both of the split-sample validation analyses, the total proportion of variance in the dependent variable explained by the independent variables was within 5% of the variance explained in the model using the full data set (38.4%).
SW388R7Data Analysis
& Computers II
Slide 80
Coefficientsa,b
.663 .077 8.603 .000
-.007 .001 -.628 -8.429 .000 .992 1.008
.024 .029 .062 .828 .410 .989 1.011
.000 .001 -.039 -.525 .601 .985 1.015
(Constant)
AGE
SEX
SEI
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Tolerance VIF
Collinearity Statistics
Dependent Variable: LOGEARNa.
Selecting only cases for which SPLIT = .0000b.
RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO
DEPENDENT VARIABLE - 1
The relationship between "age" [age] and "log transformation of how many in family earned money" [logearn] was statistically significant for the model using the full data set (p<0.001). Similarly, the relationships in both of the validation analyses were statistically significant.
In the first validation analysis, the probability for the test of relationship between "age" [age] and "log transformation of how many in family earned money" [logearn] was <0.001, which was less than or equal to the level of significance of 0.05 and statistically significant.
SW388R7Data Analysis
& Computers II
Slide 81
Coefficientsa,b
.595 .062 9.552 .000
-.007 .001 -.598 -8.590 .000 .999 1.001
.022 .025 .063 .907 .366 .999 1.001
.001 .001 .076 1.098 .274 .999 1.001
(Constant)
AGE
SEX
SEI
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Tolerance VIF
Collinearity Statistics
Dependent Variable: LOGEARNa.
Selecting only cases for which SPLIT = 1.0000b.
RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO
DEPENDENT VARIABLE - 2
In the second validation analysis, the probability for the test of relationship between "age" [age] and "log transformation of how many in family earned money" [logearn] was <0.001, which was less than or equal to the level of significance of 0.05 and statistically significant.
SW388R7Data Analysis
& Computers II
Slide 82
Coefficientsa,b
.663 .077 8.603 .000
-.007 .001 -.628 -8.429 .000 .992 1.008
.024 .029 .062 .828 .410 .989 1.011
.000 .001 -.039 -.525 .601 .985 1.015
(Constant)
AGE
SEX
SEI
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Tolerance VIF
Collinearity Statistics
Dependent Variable: LOGEARNa.
Selecting only cases for which SPLIT = .0000b.
RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO
DEPENDENT VARIABLE - 3
The relationship between "respondent's socioeconomic index" [sei] and "log transformation of how many in family earned money" [logearn] was not statistically significant for the model using the full data set (p=0.724). Similarly, the relationships in both of the validation analyses were not statistically significant.
In the first validation analysis, the probability for the test of relationship between "respondent's socioeconomic index" [sei] and "log transformation of how many in family earned money" [logearn] was 0.601, which was greater than the level of significance of 0.05 and not statistically significant.
SW388R7Data Analysis
& Computers II
Slide 83
Coefficientsa,b
.595 .062 9.552 .000
-.007 .001 -.598 -8.590 .000 .999 1.001
.022 .025 .063 .907 .366 .999 1.001
.001 .001 .076 1.098 .274 .999 1.001
(Constant)
AGE
SEX
SEI
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Tolerance VIF
Collinearity Statistics
Dependent Variable: LOGEARNa.
Selecting only cases for which SPLIT = 1.0000b.
RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO
DEPENDENT VARIABLE - 4
In the second validation analysis, the probability for the test of relationship between "respondent's socioeconomic index" [sei] and "log transformation of how many in family earned money" [logearn] was 0.274, which was greater than the level of significance of 0.05 and not statistically significant.
SW388R7Data Analysis
& Computers II
Slide 84
Coefficientsa,b
.663 .077 8.603 .000
-.007 .001 -.628 -8.429 .000 .992 1.008
.024 .029 .062 .828 .410 .989 1.011
.000 .001 -.039 -.525 .601 .985 1.015
(Constant)
AGE
SEX
SEI
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Tolerance VIF
Collinearity Statistics
Dependent Variable: LOGEARNa.
Selecting only cases for which SPLIT = .0000b.
RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO
DEPENDENT VARIABLE - 5
The relationship between "sex" [sex] and "log transformation of how many in family earned money" [logearn] was not statistically significant for the model using the full data set (p=0.200). Similarly, the relationships in both of the validation analyses were not statistically significant.
In the first validation analysis, the probability for the test of relationship between "sex" [sex] and "log transformation of how many in family earned money" [logearn] was 0.410, which was greater than the level of significance of 0.05 and not statistically significant.
SW388R7Data Analysis
& Computers II
Slide 85
Coefficientsa,b
.595 .062 9.552 .000
-.007 .001 -.598 -8.590 .000 .999 1.001
.022 .025 .063 .907 .366 .999 1.001
.001 .001 .076 1.098 .274 .999 1.001
(Constant)
AGE
SEX
SEI
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Tolerance VIF
Collinearity Statistics
Dependent Variable: LOGEARNa.
Selecting only cases for which SPLIT = 1.0000b.
RELATIONSHIP OF INDIVIDUAL INDEPENDENT VARIABLES TO
DEPENDENT VARIABLE - 6
In the second validation analysis, the probability for the test of relationship between "sex" [sex] and "log transformation of how many in family earned money" [logearn] was 0.366, which was greater than the level of significance of 0.05 and not statistically significant.
The split sample validation supports the findings of the regression analysis using the full data set.
The answer to the original question is true.
SW388R7Data Analysis
& Computers II
Slide 86
Table of validation results: standard regression
Full Data Set Split = 0 (Split1 = 1)
Split = 1 (Split2 = 1)
ANOVA significance (sig <= 0.05)
<0.001 <0.001 <0.001
R2 0.384 0.400 0.365
Significant Coefficients (sig <= 0.05)
Age of respondent Age of respondent Age of respondent
It may be helpful to create a table for our validation results and fill in its cells as we complete the analysis.
The split sample validation supports the findings of the regression analysis using the full data set.
SW388R7Data Analysis
& Computers II
Slide 87
Answering the problem question - 1
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.05 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. Use 0.0160 as the criteria for identifying influential cases. Validate the results of your regression analysis by splitting the sample in two, using 788035 as the random number seed.
The variables "age" [age], "sex" [sex], and "respondent's socioeconomic index" [sei] have a strong relationship to the variable "how many in family earned money" [earnrs].
Survey respondents who were older had fewer family members earning money. The variables sex and respondent's socioeconomic index did not have a relationship to how many in family earned money.
1. True 2. True with caution 3. False 4. Inappropriate application of a statistic
We have found that there is a statistically significant relationship between the set of IVs and the DV (p<0.001), and the Multiple R was 0.620, which would be characterized as a strong relationship.
SW388R7Data Analysis
& Computers II
Slide 88
Answering the problem question - 2
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.05 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. Use 0.0160 as the criteria for identifying influential cases. Validate the results of your regression analysis by splitting the sample in two, using 788035 as the random number seed.
The variables "age" [age], "sex" [sex], and "respondent's socioeconomic index" [sei] have a strong relationship to the variable "how many in family earned money" [earnrs].
Survey respondents who were older had fewer family members earning money. The variables sex and respondent's socioeconomic index did not have a relationship to how many in family earned money.
1. True 2. True with caution 3. False 4. Inappropriate application of a statistic
The b coefficient associated with age was statistically significant (p< 0.001), so there was an individual relationship to interpret.
The b coefficient (-0.007) was negative, indicating an inverse relationship in which higher numeric values for age are associated with lower numeric values for log transformation of how many in family earned money.
Therefore, the negative value of b implies that survey respondents who were older had fewer family members earning money.
SW388R7Data Analysis
& Computers II
Slide 89
Answering the problem question - 3
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.05 for the regression analysis. Use a level of significance of 0.01 for evaluating assumptions. Use 0.0160 as the criteria for identifying influential cases. Validate the results of your regression analysis by splitting the sample in two, using 788035 as the random number seed.
The variables "age" [age], "sex" [sex], and "respondent's socioeconomic index" [sei] have a strong relationship to the variable "how many in family earned money" [earnrs].
Survey respondents who were older had fewer family members earning money. The variables sex and respondent's socioeconomic index did not have a relationship to how many in family earned money.
1. True 2. True with caution 3. False 4. Inappropriate application of a statistic
For the independent variable sex, the probability of the t statistic (1.284) for the b coefficient is 0.200 which is greater than the level of significance of 0.05. Sex did not have a relationship to the number of persons in the family earning money.
For the independent variable respondent's socioeconomic index, the probability of the t statistic (0.354) for the b coefficient is 0.724 which is greater than the level of significance of 0.05. Socioeconomic status did not have a relationship to the number of persons in the family earning money.
The answer to the question is true.
SW388R7Data Analysis
& Computers II
Slide 90
Steps in regression analysis: running the baseline model
The following is a guide to the decision process for answering problems about the complete regression analysis:
Inappropriate application of a statistic
Yes
NoDependent variable metric?Independent variables metric or dichotomous?
Run baseline regression, using method for including variables identified in the research question.Record R² for evaluation of transformations and removal of outliers and influential cases.Record Durbin-Watson statistic for assumption of independence of errors.
Yes
Ratio of cases to independent variables at least 5 to 1?
Yes
No Inappropriate application of a statistic
SW388R7Data Analysis
& Computers II
Slide 91
Steps in regression analysis: evaluating assumptions - 1
Is the dependent variable normally distributed?
Yes
No
Try: 1. Logarithmic transformation2. Square root transformation3. Inverse transformation
If unsuccessful, add caution for violation of regression assumptions
Metric IV’s normally distributed and linearly related to DV?
No
Try: 1. Logarithmic transformation2. Square root transformation(3. Square transformation)4. Inverse transformation
If unsuccessful, add caution for violation of regression assumptions
Yes
SW388R7Data Analysis
& Computers II
Slide 92
Steps in regression analysis: evaluating assumptions - 2
DV is homoscedastic for categories of dichotomous IV’s?
No Add caution for violation of regression assumptions
Residuals are independent,Durbin-Watson between 1.5 and 2.5?
No Add caution for violation of regression assumptions
Yes
Yes
SW388R7Data Analysis
& Computers II
Slide 93
Steps in regression analysis: evaluating outliers
Univariate outliers (DV), multivariate outliers (IVs), or influential cases?
Request statistics for detecting outliers and influential cases by running standard multiple regression using Enter method to include all variables and substituting transformed variables.
No
Yes
Ratio of cases to independent variables at least 5 to 1?
No
Yes
Remove outliers and influential cases from data set
Restore outliers and influential cases to data set, add caution to findings
SW388R7Data Analysis
& Computers II
Slide 94
Steps in regression analysis: picking regression model for
interpretation
R² for evaluated regression greater than R² for baseline regression by 2% or more?
Pick baseline regression for interpretation
Were transformed variables substituted, or outliers and influential cases omitted?
Evaluate impact of transformations and removal of outliers by running regression again, using method for including variables identified in the research question.
Yes
Pick regression with transformations and omitting outliers for interpretation
Yes
No
No
SW388R7Data Analysis
& Computers II
Slide 95
Steps in regression analysis: overall relationship is interpretable
Probability of ANOVA test of regression less than/equal to level of significance?
Yes
NoFalse
Tolerance for all IV’s greater than 0.10, indicating no multicollinearity?
No
Yes
False
SW388R7Data Analysis
& Computers II
Slide 96
Steps in regression analysis: validation - 1
YesNo
Enough valid cases to split sample and keep 5 to 1 ratio of cases/variables?
• Set the random seed and compute the split variable
• Re-run regression with split = 0
• Re-run regression with split = 1
• Set the first random seed and compute the split1 variable
• Re-run regression with split1 = 1
• Set the second random seed and compute the split2 variable
• Re-run regression with split2 = 1
Yes
Probability of ANOVA test <= level of significance for both validation analyses?
Yes
NoFalse
SW388R7Data Analysis
& Computers II
Slide 97
Steps in regression analysis: validation - 2
Pattern of significance for independent variables in both validations matches pattern for full data set?
Yes
NoFalse
Change in R² statistically significant in both validation analyses? (Hierarchical only)
NoFalse
Yes
Yes
R² for both validations within 5% of R² for analysis of full data set?
NoFalse
SW388R7Data Analysis
& Computers II
Slide 98
Steps in regression analysis: answering the question
Assumptions not violated? Outliers/influential cases excluded from interpretation?
Yes
No
True
Yes
Satisfies ratio for preferred sample size: 15 to 1(stepwise: 50 to 1)
Yes
NoTrue with caution
Yes
DV is interval level and IVs are interval level or dichotomous?
Yes
NoTrue with caution
True with caution
SW388R7Data Analysis
& Computers II
Slide 99
Interpreting the coefficients when the dependent variable is transformed
SW388R7Data Analysis
& Computers II
Slide 100
Coefficientsa
.626 .048 12.989 .000
-.007 .001 -.615 -12.237 .000 .999 1.001
.024 .019 .065 1.284 .200 .997 1.003
.000 .000 .018 .354 .724 .997 1.004
(Constant)
AGE
SEX
SEI
Model1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
t Sig. Tolerance VIF
Collinearity Statistics
Dependent Variable: LOGEARNa.
Interpreting b coefficients when the dependent variable is transformed - 1
For the independent variable age, the probability of the t statistic (-12.237) for the b coefficient is <0.001 which is less than or equal to the level of significance of 0.05. We reject the null hypothesis that the slope associated with age is equal to zero (b = 0) and conclude that there is a statistically significant relationship between age and log transformation of how many in family earned money.
The b coefficient associated with age (-0.007) is negative, indicating an inverse relationship in which higher numeric values for age are associated with lower numeric values for log transformation of how many in family earned money. Therefore, the negative value of b implies that survey respondents who were older had fewer family members earning money.
SW388R7Data Analysis
& Computers II
Slide 101
Interpreting b coefficients when the dependent variable is transformed - 2
If we not to interpret a specific change in number of earners for some amount of change in age, we will need to find the answer and convert it from log units to decimal units. We can use Microsoft Excel to calculate the answer.
In the worksheet, I have entered the b coefficient from the SPSS output (-0.007) in row 1.
In row 2, I have entered different ages, e.g. 20, 30, 40, and 50.
SW388R7Data Analysis
& Computers II
Slide 102
Interpreting b coefficients when the dependent variable is transformed - 3
On row 3, we multiply the value for the independent variable age by the b coefficient, which measures the contribution to the dependent variable in log units.
On row 4, we reverse the log transform back to decimal units by raising the number 10 the value on row 3.
The caret symbol is used by Excel for raise to a power, so 10^-0.14 is 10 raised to the -0.14 power, or 0.7244.
SW388R7Data Analysis
& Computers II
Slide 103
Interpreting b coefficients when the dependent variable is transformed - 4
Based on our table, ia respondent were age 20 contributes 0.7244 to the number of earners in the family. If a respondent were 30, rather than 20, the contribution to number of earns would be 0.6166, a decrease of -0.1078. Thus, increasing age has a negative effect on the number of earners.
Note that as we go up in increments of 10, the difference between increments is decreasing. The logarithmic scale is not linear, requiring us to compute the change for any specific interval of interest.