To Explain Or To Predict?

To Explain or To Predict?Explanatory vs. Predictive Modeling in Scientific Research

Galit Shmuéli

Georgetown UniversityOctober 30, 2009

The path to discovery

Predict

Explain

What are

“explaining”?

“predicting”?

Statistical modeling in social science research

Purpose: test causal theory (“explain”)Association-based statistical models

Prediction nearly absent

Whether statisticians like it or not,

in the social sciences,

association-based statistical models are used for testing causal theory.

Justification: a strong underlying theoretical model provides the causality.

Lesson #1:

Definition: Explanatory Model

A statistical model used for testing causal theory

(“proper” or not)

Definition: Predictive Model

An empirical model used for predicting new records/scenarios

Multi-page sections with theoretical justifications of each hypothesis

Concept operationalization

4 pages of such tables

AngerEconomic stability

Trust

Well-being

Poverty

Statistical model (here: path analysis)

“Statistical” conclusions

Research conclusions

Lesson #2

In the social sciences,

empirical analysis is mainly used for testing causal theory.

Empirical prediction is considered un-academic.

Some statisticians share this view: The two goals in analyzing data... I prefer to describe as “management” and “science”. Management seeks profit... Science seeks truth.

Parzen, Statistical Science 2001

Prediction in the Information Systems literature

Predictive goal stated?Predictive power assessed?

http://www.misq.org/

“Examples of [predictive] theory in IS do not come readily to hand, suggesting that they are not common” Gregor, MISQ 2006

1072 articles

of which

52 empirical with predictive claims

Breakdown of the 52 “predictive” articles

To PredictTo Explain

test causal theory

(utility)

relevancenew theory

predictability

Scientific use of empirical models

Why Predict?

Why are statistical

explanatory models different than

predictive models?

Theory vs. its manifestation

?

“The goal of finding models that are predictively accurate differs from the goal of finding models that are true.”

Given the research environment in the social sciences, two critically important points are:

1. Explanatory power and predictive accuracy cannot be inferred from one another.

2. The “best” explanatory model is (nearly) never the “best” predictive model, and vice versa.

Point #1

Explanatory Power

Predictive Power ≠

Cannot infer one from the other

What is R2 ?

In-sample vs. out-of-sample evaluation

out-of-sample

Performance Evaluation

Danger: type I,II errors

goodness-of-fit

p-values

Danger: over-fitting

costs

prediction accuracy

interpretation

run time

R2

Suggestion for social scientists:

Report predictive accuracy in addition to explanatory power

Explanatory Power

Pred

ictiv

e Po

wer

Best explanatory model

Best predictive model

≠

Point #2

Predict ≠ Explain

+ ?

“We should mention that not all data features were found to be useful. For example, we tried to benefit from an extensive set of attributes describing each of the movies in the dataset. Those attributes certainly carry a significant signal and can explain some of the user behavior. However, we concluded that they could not help at all for improving the accuracy of well tuned collaborative filtering models.”

Bell et al., 2008

Predict ≠ ExplainThe FDA considers two products bioequivalent if the 90% CI of the relative mean of the generic to brand formulation is within 80%-125%

“We are planning to… develop predictive models for bioavailability and bioequivalence”

Lester M. Crawford, 2005Acting Commissioner of Food & Drugs

Let’s dig in

Explanatory goal:

minimize model bias

Predictive goal:

minimize MSE (model bias + sampling variance)

What isOptimized?

Bias Prediction MSE

Var(Y)= uncontrollable

bias2 = model misspecification

estimation (sampling variance)

or

Linear Regression Example

True modelEstimated model

2211)( xxxf

)ˆˆ(0))(( 221122

1 xxVarxfYEMSE

2211ˆˆ)(ˆ xxxf

)ˆ())(*ˆ( 112

222122

2 xVarxAxxfYEMSE

11)(* xxf

11̂)(*ˆ xxf

211

11 '' xxxxA

Underspecified modelEstimated model

MSE2 < MSE1 when: σ2 large

|β2| small corr(x1,x2) high

limited range of x’s

Two statistical modeling paths

China's Diverging Paths, photo by Clark Smith

Goal Definition

Design & Collection

Data Preparation

EDA

Variables? Methods? Evaluation,

Validation & Model Selection

Model Use & Reporting

Study design

Hierarchical data

Observational or experiment?

Primary or secondary data?

Instrument (reliability+validity vs. measur accuracy)

How much data?

How to sample?

& data collection

Data preparation

reduced-feature models

missing

partitioning

outliers

PCASVD

trends

Interactive visualization

summary stats plots

Which variables?

Multicollinearity?

theory associations ex-post availability

A, B, A*B?

ensemblesPLS

ridge regression

variance bias

PCR

Methods / Models

Blackbox / interpretableMapping to theory

boosting

Evaluation, Validation& Model Selection

Training dataEmpirical model Holdout data

Predictive power

Over-fitting analysis

Theoretical model

Empirical model

Data

ValidationModel fit ≠

Explanatory power

Inference

Model Use

Test causal theory

(utility) PredictionsRelevanceNew theoryPredictability

Predictive performance

Over-fitting analysis

Null hypothesis

Naïve/baseline

Goal Definition

Design & Collection

Data Preparation

EDA

Variables? Methods? Evaluation,

Validation, & Model Selection

Model Use & Reporting

How does all this impact

research

in the (social) sciences?

Three Current Problems

“While the value of scientific prediction… is beyond question… the inexact sciences [do not] have…the use of predictive expertise well in hand.”

Helmer & Rescher, 1959

Distinction blurred

Inappropriate modeling/assessment

Prediction underappreciated

Why?

What can be done?

Statisticians should acknowledge the difference and teach it!

It’s time for Change

To Predict

To Explain

To Explain Or To Predict?

Education

Transcript of To Explain Or To Predict?