Prediction and cross validation - Soc Stats Reading Group · aucs
Transcript of Prediction and cross validation - Soc Stats Reading Group · aucs
Prediction and cross validationSoc Stats Reading Group
Alex Kindel
Princeton University
1 December 2016
Alex Kindel (Princeton University) Prediction and cross validation 1 December 2016 1 / 17
Outline
1 Civil war2 Cross validation3 Back to civil war4 Why care about prediction?
Alex Kindel (Princeton University) Prediction and cross validation 1 December 2016 2 / 17
Ward, Greenhill & Bakke (2010)
“The perils of policy by p-value: Predicting civil conflicts.” Journal ofPeace Research 47(4), 363-75.“. . . basing policy prescriptions on statistical summaries of probabilisticmodels (which are predictions) can lead to misleading policyprescriptions if out-of-sample predictive heuristics are ignored.”
I In a word: overfitting
Alex Kindel (Princeton University) Prediction and cross validation 1 December 2016 3 / 17
Civil wars
Based on logistic regressionWidely used to guide policy
I World Bank, House of RepresentativesI The New Yorker, The New York Times, etc.
Alex Kindel (Princeton University) Prediction and cross validation 1 December 2016 4 / 17
Civil warsBut: Strikingly poor performance on in-sample prediction
Alex Kindel (Princeton University) Prediction and cross validation 1 December 2016 5 / 17
Cross validation
Alex Kindel (Princeton University) Prediction and cross validation 1 December 2016 6 / 17
Procedure
1 Split data into k “folds” (equally sized groups)2 Withholding one fold, re-estimate model3 Test predictive power of model on withheld group (AUC)
Alex Kindel (Princeton University) Prediction and cross validation 1 December 2016 7 / 17
Receiver operating characteristic (ROC) curve
We use area under the ROC curve (AUC) as a heuristic measure ofpredictiveness
I Intuitively, increasing AUC implies TPR > FPR
(From the people who brought you instructional television. . . )
Alex Kindel (Princeton University) Prediction and cross validation 1 December 2016 8 / 17
Tricks and missteps
Bias-variance tradeoffI k = n (LOOCV): higher variance (low variance among training sets), but
lower biasI k < n (k-fold): lower variance, but higher bias (overestimating
prediction error)
General consensus is that it might be better to overestimate predictionerror (conservative bias)
I Also, LOOCV is “more expensive”
Don’t do (supervised) feature selection before model validation!I Will overestimate AUC (drastically)
Alex Kindel (Princeton University) Prediction and cross validation 1 December 2016 9 / 17
Cross validation: pretty easy to implement!# Function to divide data into folds randomlyfold <- function(data, k) {
data <- data[sample(nrow(data)),] # Shuffle datadata %<>% mutate(fold = cut(seq(1:nrow(data)), breaks = k, labels=FALSE))return(data)
}
# Function to cross-validate data on given model (curried)cv.predict.logit <- function(data, dv, model.fx, k) {
data %<>% fold(k) # Fold dataaucs <- c()for(i in 1:k) {
# Divide data into train and test setstrain <- data %>% filter(fold != i)test <- data %>% filter(fold == i)
# Estimate model on training datamx <- model.fx(data=train)
# Predict on test data and calculate AUCpreds <- predict(mx, newdata=test, type="response")AUC <- somers2(preds, test[[dv]])[1]aucs[i] <- AUC
}return(mean(aucs, na.rm=TRUE)) # Yield mean AUC
}
# Function to rerun CV results n times and average AUCscrossval <- function(data, dv, model.fx, k, n) {
aucs <- replicate(n, cv.predict.logit(data, dv, model.fx, k))return(aucs)
}
Alex Kindel (Princeton University) Prediction and cross validation 1 December 2016 10 / 17
Back to civil war
# Define Collier & Hoeffler modelch.form <- as.factor(warsa) ~ sxp + sxp2 + secm + gy1 + peace + geogia + lnpop + frac + etdo4590ch.mx <- Curry(glm, formula=ch.form, family=binomial(link=logit))
# Define Fearon & Laitin modelfl.form <- as.factor(onset) ~ warl + gdpenl + lpopl1 + lmtnest + ncontig + Oil + nwstate + instab + polity2l + ethfrac + relfracfl.mx <- Curry(glm, formula=fl.form, family=binomial(link=logit))
# Perform cross-validationk <- 4 # Set k foldsch.auc <- cv.predict.logit(ch, "warsa", ch.mx, k)fl.auc <- cv.predict.logit(fl, "onset", fl.mx, k)
c(ch.auc, fl.auc)
## [1] 0.8090876 0.7423249
Alex Kindel (Princeton University) Prediction and cross validation 1 December 2016 11 / 17
Calculating a stable AUC
Sensitive to dataset randomization during “folding”I Not too much to worry about here (usually)
Sensitive to choice of kI Low k: upward bias in AUCI High k: higher variance in AUC
Alex Kindel (Princeton University) Prediction and cross validation 1 December 2016 12 / 17
Sensitivity to randomization: F&Lk <- 4n <- 200 # Set n CV cyclesch.aucs <- crossval(ch, "warsa", ch.mx, k, n)
0
10
20
30
0.78 0.80 0.82 0.84 0.86auc
dens
ity
mean over N cyclesAUC in first cycle
Alex Kindel (Princeton University) Prediction and cross validation 1 December 2016 13 / 17
Sensitivity to randomization: C&Hk <- 4n <- 200 # Set n CV cyclesfl.aucs <- crossval(fl, "onset", fl.mx, k, n)
0
10
20
30
40
50
0.71 0.72 0.73 0.74 0.75auc
dens
ity
mean over N cyclesAUC in first cycle
Alex Kindel (Princeton University) Prediction and cross validation 1 December 2016 14 / 17
Sensitivity to choice of k: F&L
n <- 100list(k4 = crossval(fl, "onset", fl.mx, 4, n),
k10 = crossval(fl, "onset", fl.mx, 10, n),k20 = crossval(fl, "onset", fl.mx, 20, n),k100 = crossval(fl, "onset", fl.mx, 100, n),k500 = crossval(fl, "onset", fl.mx, 500, n)) ->
fl.aucs.ks
Alex Kindel (Princeton University) Prediction and cross validation 1 December 2016 15 / 17
Sensitivity to choice of k: F&L## Using as id variables## Using as id variables
0
20
40
60
0.72 0.74 0.76value
dens
ity
variable
k4
k10
k20
k100
k500
Alex Kindel (Princeton University) Prediction and cross validation 1 December 2016 16 / 17
Conclusion: why might we care?
Technical tradeoff between variable significance vs. modelpredictiveness (Ward et al. 2010; Lo et al. 2015)If we really think our models explain causal effects, shouldn’t they bepredictive? (Watts 2014)
I Especially if we’re basing policy on our findings
Distinguishing origins from effects (Sewell 1996; Pierson 2000; Clemens2007)
Alex Kindel (Princeton University) Prediction and cross validation 1 December 2016 17 / 17