The Effects of Predict-Observe- Explain (POE) Approach on ...
To Explain Or To Predict?
-
Upload
galit-shmueli -
Category
Education
-
view
2.491 -
download
6
Transcript of To Explain Or To Predict?
To Explain or To Predict?Explanatory vs. Predictive Modeling in Scientific Research
Galit Shmuéli
Georgetown UniversityOctober 30, 2009
The path to discovery
Predict
Explain
What are
“explaining”?
“predicting”?
Statistical modeling in social science research
Purpose: test causal theory (“explain”)Association-based statistical models
Prediction nearly absent
Whether statisticians like it or not,
in the social sciences,
association-based statistical models are used for testing causal theory.
Justification: a strong underlying theoretical model provides the causality.
Lesson #1:
Definition: Explanatory Model
A statistical model used for testing causal theory
(“proper” or not)
Definition: Predictive Model
An empirical model used for predicting new records/scenarios
Multi-page sections with theoretical justifications of each hypothesis
Concept operationalization
4 pages of such tables
AngerEconomic stability
Trust
Well-being
Poverty
Statistical model (here: path analysis)
“Statistical” conclusions
Research conclusions
Lesson #2
In the social sciences,
empirical analysis is mainly used for testing causal theory.
Empirical prediction is considered un-academic.
Some statisticians share this view: The two goals in analyzing data... I prefer to describe as “management” and “science”. Management seeks profit... Science seeks truth.
Parzen, Statistical Science 2001
Prediction in the Information Systems literature
Predictive goal stated?Predictive power assessed?
“Examples of [predictive] theory in IS do not come readily to hand, suggesting that they are not common” Gregor, MISQ 2006
1072 articles
of which
52 empirical with predictive claims
Breakdown of the 52 “predictive” articles
To PredictTo Explain
test causal theory
(utility)
relevancenew theory
predictability
Scientific use of empirical models
Why Predict?
Why are statistical
explanatory models different than
predictive models?
Theory vs. its manifestation
?
“The goal of finding models that are predictively accurate differs from the goal of finding models that are true.”
Given the research environment in the social sciences, two critically important points are:
1. Explanatory power and predictive accuracy cannot be inferred from one another.
2. The “best” explanatory model is (nearly) never the “best” predictive model, and vice versa.
Point #1
Explanatory Power
Predictive Power ≠
Cannot infer one from the other
What is R2 ?
In-sample vs. out-of-sample evaluation
out-of-sample
Performance Evaluation
Danger: type I,II errors
goodness-of-fit
p-values
Danger: over-fitting
costs
prediction accuracy
interpretation
run time
R2
Suggestion for social scientists:
Report predictive accuracy in addition to explanatory power
Explanatory Power
Pred
ictiv
e Po
wer
Best explanatory model
Best predictive model
≠
Point #2
Predict ≠ Explain
+ ?
“We should mention that not all data features were found to be useful. For example, we tried to benefit from an extensive set of attributes describing each of the movies in the dataset. Those attributes certainly carry a significant signal and can explain some of the user behavior. However, we concluded that they could not help at all for improving the accuracy of well tuned collaborative filtering models.”
Bell et al., 2008
Predict ≠ ExplainThe FDA considers two products bioequivalent if the 90% CI of the relative mean of the generic to brand formulation is within 80%-125%
“We are planning to… develop predictive models for bioavailability and bioequivalence”
Lester M. Crawford, 2005Acting Commissioner of Food & Drugs
Let’s dig in
Explanatory goal:
minimize model bias
Predictive goal:
minimize MSE (model bias + sampling variance)
What isOptimized?
Bias Prediction MSE
Var(Y)= uncontrollable
bias2 = model misspecification
estimation (sampling variance)
or
Linear Regression Example
True modelEstimated model
2211)( xxxf
)ˆˆ(0))(( 221122
1 xxVarxfYEMSE
2211ˆˆ)(ˆ xxxf
)ˆ())(*ˆ( 112
222122
2 xVarxAxxfYEMSE
11)(* xxf
11̂)(*ˆ xxf
211
11 '' xxxxA
Underspecified modelEstimated model
MSE2 < MSE1 when: σ2 large
|β2| small corr(x1,x2) high
limited range of x’s
Two statistical modeling paths
China's Diverging Paths, photo by Clark Smith
Goal Definition
Design & Collection
Data Preparation
EDA
Variables? Methods? Evaluation,
Validation & Model Selection
Model Use & Reporting
Study design
Hierarchical data
Observational or experiment?
Primary or secondary data?
Instrument (reliability+validity vs. measur accuracy)
How much data?
How to sample?
& data collection
Data preparation
reduced-feature models
missing
partitioning
outliers
PCASVD
trends
Interactive visualization
summary stats plots
Which variables?
Multicollinearity?
theory associations ex-post availability
A, B, A*B?
ensemblesPLS
ridge regression
variance bias
PCR
Methods / Models
Blackbox / interpretableMapping to theory
boosting
Evaluation, Validation& Model Selection
Training dataEmpirical model Holdout data
Predictive power
Over-fitting analysis
Theoretical model
Empirical model
Data
ValidationModel fit ≠
Explanatory power
Inference
Model Use
Test causal theory
(utility) PredictionsRelevanceNew theoryPredictability
Predictive performance
Over-fitting analysis
Null hypothesis
Naïve/baseline
Goal Definition
Design & Collection
Data Preparation
EDA
Variables? Methods? Evaluation,
Validation, & Model Selection
Model Use & Reporting
How does all this impact
research
in the (social) sciences?
Three Current Problems
“While the value of scientific prediction… is beyond question… the inexact sciences [do not] have…the use of predictive expertise well in hand.”
Helmer & Rescher, 1959
Distinction blurred
Inappropriate modeling/assessment
Prediction underappreciated
Why?
What can be done?
Statisticians should acknowledge the difference and teach it!
It’s time for Change
To Predict
To Explain