Variable Selection Methods

Variable Selection MethodsWith Applications to Crime Prediction

Goals

There are several variable selection techniques presented in the literature and coursework. However, there is no agreed upon “best”.

In my thesis, I apply variable selection techniques to a 1990 United State data set to determine: which set of socio-economic and demographic factors are most predictive of

violent crime, and which model selection technique produces the best out of sample predictors

Variable selection criteria: AIC, MSE, , and

The Crime Data Set The data set is from the 1990 United States crime database from

the UC Irvine Machine Learning Repository. The original data set contains 147 variables and 2215 observations (cities), which includes unrelated attributes such as county code.

The cleaned data set had 95 variables and 807 observations. The predicted variable Y, violent crime is measured per capita (per

100,000 population) for the number of violent crimes in a community.

The predictor matrix X included income, family, ethnicity, number of children, education, percentage of household with two parents, etc.

Variable Selection Methods The cleaned data was split into 75% training and 25% test sets. Models forward, backward, both simultaneously forward and backward,

lasso, elastic net (), PCR, decision tree, tree with pruning, bagging, random forest, and three newly developed techniques were fitted on the training data set, and validated on the test data set.

Newly developed techniques: Best Random Subset Selection (BRSS)

Approximated best subset selection Limited over fitting Reduced computation time

Count 4 & Count 5 Consensus model Constructed using predictors that are selected at least 4 or 5 times by other models

BRSS Model

Let # of predictors in a model, # of randomly selected models with predictors

For each p, as , we successfully approximate the best subset. If we limit the maximum number of predictors in our models p to Q,

where Q is much less than 94, we can reduce bias and the risk of over fitting our data.

We will use the plot of p versus training to help us determine the number of predictors we should use in the BRSS model.

Cleaned data has observations.

Typically, . Late, we will show how

BRSS model perform for various

BRSS Procedure for N=1000, p=20

Randomly split cleaned data into 75% training and 25% test sets. For each value of , fit models with predictors on the training data set

and pick the model with the maximum for each . At the end of this step, there are best models, one for each value of .

Among the best models , pick the best using training of the model. Cross validate the selected overall best model using the test data set.

BRSS Results with Predictors Estimate Std. Error t value Pr(>|t|)PctRecImmig5 -64.97 18.27 -3.557 0.000406racepctblack 6.762 1.81 3.736 0.000205agePct65up 21.28 5.752 3.7 0.000235PctEmploy 4.878 3.527 1.383 0.167173PctForeignBorn 24.93 5.283 4.719 2.96E-06PctPersOwnOccup 0.4028 2.041 0.197 0.843662NumKidsBornNeverMar 0.0068 0.00196 3.453 0.000595MedYrHousBuilt 3.922 1.768 2.218 0.02691PctPersDenseHous 24.15 10.61 2.276 0.023177PctLess9thGrade -13.28 5.117 -2.596 0.009665MedOwnCostPctIncNoMtg -38.17 12.52 -3.049 0.002399PctHousNoPhone 30.94 7.571 4.087 4.98E-05PctLargHouseFam -6.362 12.28 -0.518 0.604464PctKidsBornNeverMar 96.95 10.83 8.954 < 2e-16MalePctDivorce 37.94 8.042 4.717 2.99E-06Number of predictors: 15 Training aR2: 0.6836 Training R2: 0.6914 Training AIC: 8876.299 Test aR2: 0.6486 Test R2: 0.6749 Test AIC: 3030.488

Results of Various BRSS ModelsM=10, N=200, p=24 Min. 1st Qu. Median Mean 3rd Qu. Max.train aR2 0.6481 0.6587 0.6718 0.6705 0.6818 0.6958test aR2 0.6018 0.6219 0.6496 0.6554 0.6906 0.7104train MSE 126000 134600 139200 138700 142600 151900test MSE 140500 157700 171900 177100 198500 225000

M=10, N=200, p=20 Min. 1st Qu. Median Mean 3rd Qu. Max.train aR2 0.6559 0.6673 0.6739 0.674 0.6837 0.6913test aR2 0.6009 0.6351 0.6463 0.6466 0.6571 0.6921train MSE 128400 134400 138300 138200 142600 145600test MSE 142500 156100 187000 179800 193900 232300



Results: Count 4 and Count 5Predictors Count 4 Count 5HousVacant 0.01467 MalePctDivorce 27.38 MedOwnCostPctIncNoMtg -42.5 -42.147NumKidsBornNeverMar -0.00372 PctEmplManu -6.866 -10.53PctEmplProfServ -6.353 -10.72PctForeignBorn 8.862 PctKids2Par -11.97 -19.391PctKidsBornNeverMar 64.78 66.911PctPersDenseHous 8.902 11.772PctVacantBoarded 6.327 PctWorkMom -7.375 pctWRetire -3.354 -3.812racepctblack 12.94 6.436racePctWhite 4.724 Number of Variables: 15 8Training aR2: 0.6738 0.6547Training R2 0.6819 0.6593Model AIC: 8894.692 8922.207Test aR2: 0.6697 0.6465Test R2: 0.6944 0.6606Test AIC: 3017.975 3025.162

Results of Fitting Each Model Once

ForwardBackward bothFB lasso Elastic BRSS Count4 Count5

# of Vars. 21 39 23 28 27 15 15 8

Train aR2 0.7092 0.7281 0.7154 0.652 0.659 0.6836 0.6738 0.6547

Train R2 0.7193 0.7456 0.7263 0.6526 0.6595 0.6914 0.6819 0.6593Train AIC 8881 8857 8870 8970 8958 8876 8895 8922

Test aR2 0.5959 0.5234 0.5927 0.5545 0.5634 0.6486 0.6697 0.6465

Test R2 0.5979 0.5258 0.5947 0.5567 0.5655 0.6749 0.6944 0.6606Test AIC 2996 3030 2998 3016 3012 3030 3018 3025

Regression Tree: Tree with Pruning Main advantage: easy for

interpretation Ex. Violent crimes for

communities with PctKids2Par < 55.39% and PctUsePubTrans < 15.935% is 1733.

Regression Tree ModelsPredictors Full Tree

Tree Prune Bag*

Random Forest**

population x pctWInvInc x PctUnemployed x PctKids2Par x x NumKidsBornNeverMar x x PctKidsBornNeverMar x x PctSpeakEnglOnly x PctVacantBoarded x PctUsePubTrans x x # of Trees 1 1 500 500# of Variables: 9 4 94 94Model R2: 0.75689 0.6527 0.93858 0.94564Model MSE: 111479 159253 28164.7 24927.9# terminal nodes: 12 5 Test R2: 0.29573 0.41268 0.55822 0.59348Test MSE: 275471 229728 172800 159009

* number of variables tried at each split: 13** number of variables tried at each split: 6

Principal Component Regression

Goal 1: Like other models, use PCR to make predictions. Goal 2: Model complexity versus errors: What’s the difference in training

and test MSE and when the number of components used in training our models varies?

PCR: Model Complexity vs. Accuracy

Model Performance: results of fitting each model once

Dashed line is BRSS used

Simulation Study of 1000 Training and Test Splits

Repeat this process 1000 times Models:

forward, backward, bothFB, lasso, elastic net (), BRSS (), decision tree, tree with pruning, bagging, and random forest.

BRSS uses where other models use .

• All three methods have similar performance.

• Consistently over optimistic

• These group of models perform better than the stepwise models

for 1000 Training/Test Splitstrain R2 Min. 1st Qu. Median Mean

3rd Qu. Max. sd

bothFD 0.6531 0.6909 0.7014 0.7013 0.7116 0.7466 0.0153forward 0.6535 0.692 0.7029 0.7029 0.7136 0.7457 0.0154backward 0.6749 0.7131 0.7225 0.7224 0.732 0.7646 0.0145lasso 0.5299 0.6055 0.6214 0.6206 0.6365 0.7065 0.0241elastic 0.4941 0.6048 0.6221 0.6202 0.6387 0.6983 0.0267BRSS* 0.6171 0.6625 0.6719 0.6712 0.6812 0.7136 0.0142tree 0.6894 0.7317 0.7425 0.7428 0.7542 0.8023 0.0175prun 0.4994 0.5517 0.5656 0.5665 0.5803 0.6337 0.0216bag 0.9364 0.9431 0.9449 0.9449 0.9467 0.9539 0.0027rf 0.9345 0.9411 0.9429 0.9429 0.9447 0.9508 0.0026

test R2 Min. 1st Qu. Median Mean3rd Qu. Max. sd

bothFB 0.1177 0.5796 0.6228 0.6115 0.6575 0.7452 0.0678forward 0.1893 0.5794 0.6227 0.6121 0.6573 0.7439 0.0666backward 0.0285 0.5542 0.6019 0.5891 0.6371 0.7345 0.0733lasso 0.4655 0.5796 0.6078 0.6071 0.6368 0.7338 0.0411elastic 0.4679 0.5793 0.607 0.6054 0.6331 0.7295 0.0402BRSS* 0.5304 0.6295 0.6591 0.6596 0.6925 0.8024 0.0457tree 0.2957 0.6093 0.6529 0.6475 0.6951 0.8111 0.0665prune 0.3171 0.5245 0.5694 0.5617 0.6086 0.7132 0.0670bag 0.4276 0.6158 0.6477 0.6427 0.6726 0.7597 0.0448rf 0.5164 0.6364 0.6610 0.6582 0.6842 0.7610 0.0380

for 1000 Splits of the Cleaned Data Dashed line is BRSS uses

Model Performance: Average of 1000 r-squared

Dashed line is BRSS uses Question: which model is

the best?

Conclusion: The best Model Commonly used variable selection methods can suffer from overfiting; the

worst offenders were random forest and bagging. Stepwise procedures are not quite as bad but still suffer from overfitting. Methods that were consistently not overfitting the data included lasso,

elastic net, tree with pruning, and BRSS. BRSS tended to give the best predictions on average.

It had the highest test and was unbiased in its assessment of model fit to external data.

BRSS have fewer variables selected making it easier for drawing inference and producing more parsimonious models.

We also saw some evidence that combining important variables selected across different methods (count 4, count 5) may be useful as well.

Conclusion: Violent Crime The socio-demographic variables most consistently selected by the models were:

percentage of population that is African American, percentage of households with retirement income in 1989, percentage of people 16 and over who are employed in manufacturing jobs, percentage of people 16 and over who are employed in professional services, percentage of kids in family housing with 2 parents, number of kids born to never married parents, percentage of people in dense housing, and median owners cost as a percentage of household income (for owners without a mortgage).

Similar to the other countries I investigated, I also found that measures of employment, urbanization/density, income, poverty, and family structure were all consistently associated with violent crime.

Although these were the most consistently selected factors, it is important to not conclude causal relationships and acknowledge that there are likely other confounding variables as well.

Variable Selection Methods

Documents

Transcript of Variable Selection Methods