Using Random Forest as a Tool for Policy Analysis Reuben Ternes November 2012.

33
Random Forest 101 Using Random Forest as a Tool for Policy Analysis Reuben Ternes November 2012

Transcript of Using Random Forest as a Tool for Policy Analysis Reuben Ternes November 2012.

  • Slide 1

Using Random Forest as a Tool for Policy Analysis Reuben Ternes November 2012 Slide 2 Part 1: Policy Analysis Test Optional Part 2: The Weaknesses of Parametric Statistics Part 3: Data Mining: Random Forest as an Alternative Part 4: Real world example Slide 3 Policy Analysis Test Optional Reuben Ternes November 2012 Slide 4 Lots of institutions are considering going test-optional these days. Should yours? How can we use IR data to figure out a reasonable policy recommendation? Just as a caveat: OU is not considering changing our admissions rules, this is more of a theoretical exercise. Were already partially test-optional Slide 5 2003 (Geiser with Studley) suggested: HSGPA was better than SAT in predicting FY GPAs. 80,000+ sample size, University of Cali Data Used regression, logistic regression, and some HLM to reach their conclusions. Fairly rigorous methodology. 2007 follow up study (Gesier & Satnelices) found same pattern with 4-year outcomes. Very influential Slide 6 The literature on the topic is vast. Most of it supports the notion that HSGPA is a better predictor than SAT/ACT. Many find that ACT/SATs add predictive validity. Some do not, or find that the addition is trivial. Almost all of the literature uses a parametric regression (of some kind) to estimate SAT/ACTs predictive validity. Slide 7 The Weaknesses of Parametric Statistics Reuben Ternes November 2012 Slide 8 OLS Regression is a fantastic tool. But, its failings as far as a predictive tool are well known: Missing data is difficult to deal with Categorical data is difficult to deal with Interactions must be modeled by hand Non-linearities must be modeled by hand Poorly handles data sets with lots of variables Overfitting is common It is not a good tool to understand the predictive contribution of ACT scores. Slide 9 All parametric statistical techniques make certain assumptions about the data. In Regression: Normality Heteroscedasticity Linearity multicollinearity Among others Slide 10 In practice, these assumptions are often incorrect. We still use parametric statistics because they are useful. But they are not perfect estimators of the predictive contributions of different variables! And, they dont always make good predictions! Slide 11 Imagine that you have 1 categorical variable with 10 categories. In regression, you have to code this as 10 dummy variables (0,1). If you have 10 such variables, then you have 100 additional variables in your regression model. This reduces your degrees of freedom! Now, imagine that you have interaction terms with 10 other potential continuous variables. Thats 1000 different variables! Slide 12 Now imagine that you have 10 continuous variables. You should, at the very least, include quadratic and cubic versions of these variables in your model, just in case they are not linearly related. Now you have 30 variables Dont forget your interaction terms! Slide 13 But if you actually model all of this: Youve probably went to far. Eventually, youll start modeling noise, not real patterns. If is difficult to figure out when youve overfitted your data when using regression. What will happen? Your test data will look good. Actual predictions will be low. Slide 14 Must model missing data, or data is lost. Common to add median or mean value for continuous data. Or most common response for categorical data. Or code them as missing. If you dont impute, then every case that has missing data, even if it is mostly accurate, wont be used in the final analysis. If the data isnt missing at random, you could be in series trouble. Often, you dont know why data is missing. Slide 15 Data Mining: Random Forest as an Alternative Reuben Ternes November 2012 Slide 16 Netflix Prize Target Yahoo Amazon Etc. All are using prediction algorithms to match customers with products. The prediction tools they are using are much more sophisticated than simple regression! Slide 17 There are other ways to understand predictive contributions Data Mining/Machine Learning Algorithms Have improved greatly over the past decade Are now recognized to be much better predictors than many standard regression techniques Random Forest, in particular, stands out. Slide 18 Random Forest Deals with missing data well Robust to over-fitting Relatively easy to use Can handle hundreds of different variables Categorical (i.e. non-numerical) data is OK Makes no assumptions (non-parametric) Overall good performance Slide 19 It builds lots of (decision) trees Randomly (Thats why its called Random Forest) Slide 20 Slide 21 Step 1) Build tree from a random subset of predictor variables. Size of tree = sqrt (classification outcome) or 1/3 (continuous outcome) of the number of predictor variables Step 2) Use N random cases from the dataset, drawing with replacement For each tree, approx. 1/3 of the dataset isnt used (Bootstrapping) Step 3) Record the result of each unused case After building the tree, run the unused cases: record result. Step 4) Repeat this process 500-1000 times Probabilities are generated by the total proportion of yes votes. Regression generated by average prediction. Slide 22 You could build a giant decision tree with dozens of variables. But it would be big. Too big. Suffers from some of the same problems as standard regression techniques (it overfits, poorly models interactions effects, etc.) Instead, Random Forest uses random elements to its advantage. 1) It builds many smaller trees (500-1000) using a random sample of the predictors. 2) It samples N cases with replacement. Slide 23 The trees are smaller. Smaller trees are easier to deal with. That means you can make a lot of them Aggregating lots of small trees do a better job of capturing interaction effects without overfitting Ditto with non-linearities (The split on any continuous predictor will be different for every tree) Slide 24 It keeps N high, but creates a hold out set. This hold out set is used to create an (unbiased) estimate of the error rate. This means you dont need training data! (Essentially, every tree is both a test data set and training data set rolled into one). There are known issues with sampling with replacement. Does not affect raw predictions. Does affect variable importance data Slide 25 Let me pause for questions before continuing. Slide 26 Random Forest results are not like regression Variable important list Based on node purity measures (Gini coefficient) Numbers are pretty much un-interpretable No explanation of how variables interact with outcome No established method to create p-values You really only get: Prediction results Vague sense of how important each variable is Either an error rate (categorical outcomes) or a percent of total variance explained (continuous outcomes). Slide 27 If you cant get p-vales, how can you do policy analysis with Random Forest? What you can do, is run various sets of predictions, and look at the accuracy of those predictions. Systematically exclude the variables that you are interested in examining. Slide 28 Real World Example of Random Forest Reuben Ternes November 2012 Slide 29 Should your institution go test-optional? Another way to ask this question is: How much do admissions test tell us about future student outcomes? We will test just first year GPA. But you could test anything (retention, graduation, etc.) Slide 30 I consider three models 1) Saturated An extreme (unrealistic) amount of data on incoming students. Obtained late during the admissions cycle. More information than a human could process to make a decision. About 50 variables we collect during the admissions cycle. 2) Just HS GPA and ACT scores. 3) HS GPA, ACT scores, and one measure of SES. Obtained by aggregating the % of Pell students by zip code for OU over the last 10 years. I test this model because one of the common complaints against standardized tests is that they only measure SES. Slide 31 Saturated model Averaged over 5 trails (500 trees per trial) All 29.9% of total variance explained Exclude ACT scores 29.7% of total variance explained. Conclusion: ACT scores do not add much information to the total model. But it probably does add something. But this is an unrealistic model for admissions decisions, so it doesnt answer our question. Slide 32 Saturated model Averaged over 5 trails (500 trees per trial) HS GPA + ACT Scores 21.2% of total variance explained Exclude ACT scores 20.2% of total variance explained. Conclusion: ACT scores improve predictions by a noticeable, but still small amount at OU. Slide 33 HS GPA, ACT, + SES model Averaged over 5 trails (500 trees per trial) HS GPA, ACT, SES 25.0% of total variance explained Exclude ACT scores 21.6% of total variance explained. Conclusion 1: ACT scores improve predictions noticeably. Conclusion 2: There are some very important and non-trivial interaction effects going on with ACT scores and SES. If our goal is to develop predictive decision rules that correlate with academic success, we are leaving a lot of useful information out by not considering SES data.