STAT 360: Exam #1course1.winona.edu/cmalone/stat360/Exams/Final_F14… · Web viewThe following...

STAT 360: Take-Home Final Name(s):_________________________Fall 2014Points: 100 points ________________________

Consider the Fan Cost Index which is used to track the cost of attendance for a family of four to major sporting events. According to their website, Team Marking Report is the leading publisher of sports marketing and sponsorship information used daily by executives from collegiate and professional sports properties, sponsor companies, marketing agencies and media partners.

https://www.teammarketing.com/btSubscriptions/fancostindex/index

The calculation of the Fan Cost Index (FCI) is described on the above website and can be generally expressed as.

FCI=4∗Seating+4∗Pop+2∗Beer+4∗Hotdog⏟Food

+2∗Program+Parking+2∗Cap

1. Verify the calculation of FCI for the Boston Red Sox’s in the above table. (2 pts)

I have collected data on the Fan Cost Index for all teams in the MLB, NFL, and NHL leagues from their website. The “Food” component of the Fan Cost Index was not collected.

1

https://www.teammarketing.com/btSubscriptions/fancostindex/index

2. Consider the following mean function for Fan Cost Index (FCI).

E (FCI|Seating ,Program , Parking,Cap )=β0+β1∗Seating+β2∗Parking+ β3∗Program+β4∗Cap

a. Fit the above mean function in JMP. Paste a copy of the model output here. (3 pts)

b. Are the estimated regression coefficients from your regression model fairly close to the values we’d expect them to be considering how FCI is computed? Discuss. (3 pts)

c. Consider any one of the estimated regression coefficients; give a practical interpretation of this quantity. (2 pts)

d. Use the added variable plots (or as JMP calls them Leverage plot) to confirm the notation that Seating is the most important predictor. Copy and paste the added variable plot for Seating below. Explain how we know Seating is the most important predictor as compared to the other predictors. (3 pts)

Consider a second regression model that includes only Seating.

E (FCI|Seating )=β0+β1∗Seating

e. Does this model do a decent job of modeling Fan Cost Index? Discuss. (3 pts)

f. Compare the R2 value from the model that includes all the predictors (i.e. full model) to the model with only Seating (i.e. reduced model). From this comparison, discuss whether or not these additional predictors are important in helping to understand the variability in Fan Cost Index. (2 pts)

g. The “Big Test” (see Handout #10) can be used to more formally evaluate the importance of these additional predictor variables. In particular, the “Big Test” can be used to directly compare the full and reduced models proposed above. Conduct the “Big Test” to simultaneously verify that the statistical importance of these additional predictors, i.e. Parking, Program, and Cap. (4 pts)

2

h. One predictor that was included in the dataset was not used. In particular, Premium Seating was included in this dataset, but was not considered in the model proposed above. I have included the pairwise correlation matrix for our predictor variables here.

From this correlation matrix, I decide two things: i) I should not include both Seating and Premium Seating, and ii) I should include Seating over Premium Seating in my model. Carefully explain how I arrived at both of these conclusions. (2 pts)

3. Suppose I fit add the Baseball Indicator and Football Indicator variables to the original model fit above. This Fit Dialog box and estimated regression coefficients are shown here for this model.

Fit Model Dialog Window Estimated Regression Coefficients

a. What is the effect of including the Baseball Indicator and Football Indicator predictors in this model? It may help to write-out the model using coefficients like we did in class (see notes from Handout #11). (5 pts)

b. The p-value for Baseball Indicator is 0.1540; thus, this predictor does not appear to be useful. What are the ramifications of such an outcome on our model? Discuss. (3 pts)

c. Complete the “Big Test” to confirm that the Baseball Indicator predictor nor Football Indicator predictor are necessary for this model. (4 pts)

d. Practically speaking, what does the outcome from the test run in part c. imply about Fan Cost Index for these three sports? (3 pts)

3

4. Consider the following study done in Japan that investigated the effects of sulfur dioxide (SO2) on the prevalence ratio of chronic bronchitis. The data from this study is provided below.

Comment: Observations 1 – 8 are from a non-polluted area of the city – this is evident as

sulfur dioxide measurements are low for this area, i.e. less than 0.30. Observations 9 – 17 are from a polluted area of the city

Data from study Scatterplot

It is known that the general population has a baseline prevalence ratio for chronic bronchitis. This baseline can be easily estimated by computing the average prevalence ratio using the data from the non-polluted area of the city. The researcher knows that low levels of sulfur dioxide will not increase this baseline level; however, at some point an increased level of sulfur dioxide will start to adversely affect the prevalence of chronic bronchitis. This point is called the threshold point and will be denoted by x0 here.

Model Setup Response: Prevalence Ratio Predictor: SO2 Threshold value: x0 Mean Functions

E (Prevalence Ratio|SO 2¿={ AverageSO2 SO2≤ x0β0+ β1∗SO 2 SO 2>x0

Variance Functions

4

VAR (Prevalence Ratio|SO 2¿={σ 12 SO 2≤x0σ 22 SO 2> x0

a. The first thing the researchers has requested in an estimate of the threshold value. The following process will provide such as estimate.

Finding an estimate ofx0Step 1: Estimate the average prevalence in the non-polluted area.Step 2: Obtain the mean function for prevalence ratio in the polluted area.Step 3: Set the value from Step 1 equal to the mean function from Step 2.

Solving for x in this equation provides an estimate of x0

What value do you obtain for the estimated threshold value? (3 pts)

x0=¿¿

b. The following can be used to obtain the predicted prevalence ratio for our two part mean function.

Predicted={ AverageSO2 SO 2≤ x0β0+ β1∗SO2 SO2> x0

Create the following plot of the two part estimated mean function using the predictive values computed above. (2 pts)

c. Compute the R2 for this two part mean function. (2 pts)

R2=¿¿

d. Consider the following improvements that an experienced modeler, i.e. me or you, might attempt use to improve the above model.

o Improvement #1: After careful consideration, you realize that the predicted values for data at SO2 = 0.9, 1.0, and 1.15 were computed using AverageSO2 because these observations were less than the threshold value. You decide to slide the threshold value to the left slightly, say to 0.75 to improve the prediction of these points.

5

Predicted={ AverageSO2 SO 2≤0.75β0+ β1∗SO2 SO2>0.75

o Improvement #2: Use a quadratic mean function and obtain the predicted values in the usual manner.

Mean Function: E (Prevalence Ratio|SO 2¿=β0+β1∗SO2+β2∗SO 22

Verify, via R2, that the above suggestions are actually improvements. (4 pts)

Method of obtaining predicted values R2

Two part mean function using threshold value = x0Two part mean function using threshold value = 0.75

Using quadratic mean function

e. Use the three sets of predicted values computed above to create the following plot via Graph > Overlay Plot. (2 pts)

6

f. You have worked hard to improve the predictions for this model. Unfortunately, the researcher is *not* impressed. In fact, she sarcastically laughs and refuses to consider your improvements. She encourages you to consider what SO2 level would minimize the prevalence ratio for each “improvement” method you suggest. She informs you that the smallest prevalence ratio should be occurring at very small SO2 values, i.e. close to 0.

i. Which SO2 value produces the smallest predicted prevalence ratio if improvement method #1 is used? Discuss. (2 pts)

ii. Which SO2 value produces the smallest predicted prevalence ratio if improvement method #2 is used? (3 pts)

This problem requires calculus, i.e. compute the derivative, set it equal to 0, and solve for SO2. I have computed the derivative below for those that have not had calculus.

JMP uses a centering option when fitting a quadratic mean function. This problem is easier if this is not done. I used R to obtain the estimated quadratic mean function (without centering).

Quadratic Mean Function

(without JMP’s centering option)

¿

Derivative ¿ β1+2∗β2∗SO 2¿−0.005+2∗0.00553∗SO 2

7

5. For this problem, we will modeling data from school districts in Texas. The data was obtained from the Texas Education Agency.

Source: http://ritter.tea.state.tx.us/perfreport/snapshot/

Consider the following:

Response: 4 Year Graduation Rate Potential predictor variables: Identified in the table below. Usual form for the mean and variance function.

o E (4 YearGradRate∨??? ? )=β0+…o Var (4YearGradRate∨??? ? )=σ2

Possible predictor variables that will be considered initially are listed here. We will ignore the demographic variables for now.

Variable RoleDistrictName Demographic

CountyNumberName DemographicRegion Demographic

DistictSize DemographicCommunityType DemographicWealthIndictor Demographic

TAXRATE DemographicAccountabilityRating Demographic

%AfricianAmerican Students Predictor%Hispanic Students Predictor

%White Students Predictor%AmericanIndian Students Predictor

%Asian Students Predictor%EconomicDisadvantaged Students Predictor

%EnglishLanguageLearners Students Predictor%SpecialEd Students Predictor

%Billingual/ESL Students Predictor%CareerTechnical Students Predictor%GiftedTalented Students Predictor

AttendenceRate PredictorDropOutRate Predictor

4YearGradRate ResponseSTAAR%Level2orHigherALL (Standardized tests) Predictor

%CollegeAdmitted PredictorSATAverage PredictorACTAverage Predictor

SalaryTeacherAvg PredictorStudentsPerTeacher Predictor

Teacher%5orFewerYrsExp PredictorTeacherAvgYearExp PredictorTeacherTurnoverRate Predictor

RevenuePerPupil PredictorExpendituresActualPerPupil Predictor

Expenditures%Athletics Predictor

8

http://ritter.tea.state.tx.us/perfreport/snapshot/

a. In JMP, use Analyze > Multivariate Methods and obtain a correlation matrix that includes the response, all the potential predictors identified above.

Identify on the table below which predictor variables, if any, will be removed due to concerns of multicollinarity. (5 pts)

Varaible Role

Removed because of Multicollinearity?

(Yes / No)DistictName Demographic

CountyNumberName Demographic Region Demographic

DistictSize Demographic CommunityType Demographic WealthIndictor Demographic

TAXRATE Demographic AccountabilityRating Demographic %AfricianAmerican Predictor

%Hispanic Predictor %White Predictor

%AmericanIndian Predictor %Asian Predictor

%EconomicDisadvantaged Predictor %EnglishLanguageLearners Predictor

%SpecialEd Predictor %Billingual/ESL Predictor

%CareerTechnical Predictor %GiftedTalented Predictor AttendenceRate Predictor

DropOutRate Predictor 4YearGradRate Response

STAAR%Level2orHigherALL Predictor %CollegeAdmitted Predictor

SATAverage Predictor ACTAverage Predictor

SalaryTeacherAvg Predictor StudentsPerTeacher Predictor

Teacher%5orFewerYrsExp Predictor TeacherAvgYearExp Predictor

TeacherTurnoverRate Predictor RevenuePerPupil Predictor

ExpendituresActualPerPupil Predictor Expenditures%Athletics Predictor

b. Use the Minimum AICc criterion with Forward selection to obtain a preliminary list of the necessary predictor variables. Provide a screen capture of the Step History portion of the output below. Also, in the table below identify the preliminary importance of each predictor variable via this approach. ( 5 pts)

c. Use the P-value Threshold criterion with Mixed (or Stepwise) selection. Use 0.25 for Prob to Enter and 0.10 for Prob to Leave. Once again, provide a screen capture of the Step History portion of the output and again identify the preliminary importance of each predictor variable via this approach in the table below. (5 pts)

9

Varaible Role

Was this predictor in your final model

when Minimum AICc was used?

Was this predictor in your final model

when P-Value Threshold was used?

%AfricianAmerican Predictor %Hispanic Predictor

%White Predictor %AmericanIndian Predictor

%Asian Predictor %EconomicDisadvantaged Predictor

%EnglishLanguageLearners Predictor %SpecialEd Predictor

%Billingual/ESL Predictor %CareerTechnical Predictor %GiftedTalented Predictor AttendenceRate Predictor








d. Discuss any differences between Minimum AICc Forward, and the P-Value Threshold – Mixed selections procedures. (3 pts)

10

e. Decide on a final model. Identify the predictors you are going to use in your final model in the table below.

Varaible Role

Identify which predictor variables will be used in

my final model.%AfricianAmerican Predictor

%Hispanic Predictor %White Predictor

%AmericanIndian Predictor %Asian Predictor

%EconomicDisadvantaged Predictor %EnglishLanguageLearners Predictor

%SpecialEd Predictor %Billingual/ESL Predictor

%CareerTechnical Predictor %GiftedTalented Predictor AttendenceRate Predictor








f. Obtain the regression output for your final model. Does this model do a decent

job of predicting 4 year graduation rate? Discuss. (3 pts)

g. Use the added variable plots (or as JMP calls them Leverage plot) to identify the most important predictor variable in your model. Copy the added variable plot below. Discuss. (2 pts)

11

h. Consider the information regarding Cook’s Distance from Handout #9. Cook’s Distance can be easily obtained in JMP by selecting Save Columns > Cook’s D Influence from the red-down menu.

Cook’s DistanceCook (1977) developed a measure of the effect of an individual observation by combining the magnitude of the internally studentized residual with the magnitude of leverage for this observations. This statistic is simply referred to as Cook’s Distance or Cook’s D.

Coo k ' s Di=( student ri )

2

(k+1)⏟residual

∗hi

(1−hi)⏟

leveragewhere

ri=internally studentized residual hi=leverage k=number of predictors∈model

Suggested Rules for Cook’s Distance An observation whose Cook’s Distance is substantially larger than others should be

investigated Cook suggests it is always important to investigate observations whose Cook’s D > 1

Remove any observations that have a Cook’s Distance larger than 1 from your dataset. Refit the model. Provide the updated regression output here. In what ways is this model different than your final model you obtained above? (3 pts)

g. Next, consider an investigation of the studentized residuals for your model. These can be easily obtained by selecting the red drop-down menu and selecting Save Columns > Studentized Residuals.

Studentized Residual i=e i−0

√ σ2∗(1−h ii)

Any observation whose studentized residual is less than -2.0 or larger than 2.0 would indicate a poor prediction. In particular, a studentized residual less than -2.0 would indicate a substantial over-prediction and a studentized residual more than 2.0 would indicate a substantial under-prediction for 4 year graduation rate by the model. Identify all school districts (using their name) for the following situations. (5 pts)

School districts for which the model substantially over predicted:

School districts for which the model substantially under predicted:

12

h. Create a plot that has the predicted 4 year graduation rate on the x-axis and the actual 4 year graduation rate on the y-axis. This provides a visualization of the R2 value from our model. Add the y=x line to this plot for reference.

Does the model appear to performing well? Discuss. (3 pts)

13

6. Consider again the cross-validation procedures discussed in Handout #18. For this problem we will again use the Grandfather_Clocks dataset as was done in this handout.

Recall, the following visualization for the Train / Test Cross-Validation, i.e. the simple 2-fold cross-validation approach discussed in Handout #18.

Train / Test, i.e. 2-Fold, Cross ValidaitonDivide data into 2-folds or 2 parts; Training dataset is used to build model; Test dataset is used to obtain RMSE to measure predictive ability.

The following describes the k-fold cross-validation approach. The Grandfather_Clocks dataset has 32 observations, so an 8-fold cross-validation will be used. Thus, each fold will consist of 4 observations each.

Iteration k-Fold Approach0 Create k unique subsets of the data1 Folds 1, 2, 3, 4, 5, 6, 7, __ are

used to create Training dataset; Fold 8 is used as Test dataset; Obtain RMSE using observations in Fold 8.

2 Folds 1, 2, 3, 4, 5, 6, __ , 8 are used to create Training dataset; Fold 7 is used as Test dataset; Obtain RMSE using observations in Fold 7.

3-7 Repeat process for Folds 3 - 78 Folds __, 2, 3, 4, 5, 6, 7, 8 are

used to create Training dataset; Fold 1 is used as Test dataset; Obtain RMSE using observations in Fold 1.

a. Type the following into R.

What is the difference between folds and random.folds? Discuss (3 pts)

14

b. Suppose the name of the data frame that contains the Grandfather Clock dataset is Grandfather_Clocks. Type the following in R (or something similar if your data frame has a different name).

What is this doing to the Grandfather_Clocks dataset? Discuss (3 pts).

c. The following sequence of commands will obtain the RMSE using Fold #1 as the Test dataset.

What is the predictive RMSE when Fold #1 is used for the Test dataset? (2 pts)

Predictive RMSE: _____________

d. Repeat part c for each of the 8 folds. Enter the predictive RMSE from each Fold in the table below. (4 pts)

Fold used for Test dataset

Predictive RMSE

12345678

e. Obtain the average predictive RMSE value for the k=8 fold cross-validation approach used here.

Average Predictive RMSE: ______________

f. The RMSE from the full model (when no cross-validation is used) is 133.1. Is the average predictive RMSE larger than 133.1? Why would we expect this to be the case? Discuss. (2 pts)

15

STAT 360: Exam #1course1.winona.edu/cmalone/stat360/Exams/Final_F14… · Web viewThe following...

Documents

Transcript of STAT 360: Exam #1course1.winona.edu/cmalone/stat360/Exams/Final_F14… · Web viewThe following...