Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf ·...

47
Targeted Learning for High-Dimensional Variable Importance Alan Hubbard, Nima Hejazi, Wilson Cai, Anna Decker Division of Biostatistics University of California, Berkeley July 27, 2016 for Centre de Recherches Mathématiques (CRM) workshop on "Statistical Causal Inference and its Applications to Genetics" A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 0 / 39

Transcript of Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf ·...

Page 1: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Targeted Learning for High-Dimensional VariableImportance

Alan Hubbard, Nima Hejazi, Wilson Cai, Anna Decker

Division of BiostatisticsUniversity of California, Berkeley

July 27, 2016

for Centre de Recherches Mathématiques (CRM) workshop on"Statistical Causal Inference and its Applications to Genetics"

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 0 / 39

Page 2: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Outline

1 Introduction

2 Statistical Roadmap

3 Causal Inference

4 Statistical Model and SuperLearning

5 TMLE

6 LIMMA

7 Data Analysis

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 0 / 39

Page 3: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

GLUE Study - Trauma Genomics

5

• Study of the mechanismsunderlying inflammation,coagulation, and host response toinjury (Tompkins 2015).

• Data used were a subset of theGLUE cohort (n = 167), who hadgene expression measured (AffyU133 plus 2) in WBC’s measured(if possible) within 12 hours, and at1,4,7 14 and 28 days.

6 • Interested in how expression atvarious times is related tooutcomes (such multiple organfailure; MOF) afterwards.

• So, parameter likeΨ(P)(t) = E (Yahi (t)− Yalow (t))

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 1 / 39

Page 4: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Both Expression and Clinical Measurements on PatientsHeart rate Blood pressure Oxygen saturation

Pulmonary artery pressure Intracranial pressure Temperature

Nursing Documentation Medications Treatments Intake Output Vital signs Blood products IV fluids

Tissue oxygen

Inspired oxygen Tidal volume Peak pressure

End expiratory pressure Respiratory rate Ventilation mode

Cardiac output

Sedation level

ICP wave

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 2 / 39

Page 5: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Joint Variable Importance Measures

Use a combination of causal inference to motivate estimand, and

• SuperLearning (van der Laan et al.2007),

• TMLE (van der Laan and Rose 2011),• LIMMA (Smyth 2004),

to produce joint inference on variable importance adjusting for othercovariates.

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 3 / 39

Page 6: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Outline

1 Introduction

2 Statistical Roadmap

3 Causal Inference

4 Statistical Model and SuperLearning

5 TMLE

6 LIMMA

7 Data Analysis

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 3 / 39

Page 7: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Define Data/ Experiment Statistical Model

Causal Model Parameter of Interest

Estimator

Inference (sampling distribution)

Interpretation

Action

Wrong Parameter

Inefficient

Bias

Non- normality

Foundations of Inference

StayontheRoad(map)

3 A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 4 / 39

Page 8: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Cross-validation

Causal Model

Identifiability

Efficiency Theory

Oracle Inequality

Empirical Processes

Pathwise Differentiability

Design

Science

The Influence Curve

Roadmap

4

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 5 / 39

Page 9: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Outline

1 Introduction

2 Statistical Roadmap

3 Causal Inference

4 Statistical Model and SuperLearning

5 TMLE

6 LIMMA

7 Data Analysis

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 5 / 39

Page 10: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

SCM for Point Treatment

• X = {W ,A,Y }

• Errors: U = (UW ,UA,UY )• Structural Equations (nature):

Õ W = fW (UW )

Õ A = fA(W ,UA)

Õ Y = fY (W ,A,UY )

Distribution of (U,X ) generated by:1 Draw U from PU

2 Generate W via fW with inputUW .

3 Generate A via fA with inputs(W ,UA)

4 Generate Y via fY with inputs(W ,A,UY )

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 6 / 39

Page 11: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Estimand as variable importance measure

• Let’s say, keeping the above tentative causal model, we have theequality (skipping identifiabilitiy steps):

Ψ(PX ) = E (Ya1 − Ya0) =assumptions

E0,W {E0(Y | A = a1,W )−E0(Y | A = a0,W )} = Ψ(P0)for discrete A with chosen comparison levels (a0, a1).

• Now, generalize the data above to situation with O = (W ,A,Y ),where A is vector of biomarkers: A = (A1, . . . ,Ag , . . . ,AG).

• Assume estimating Ψg (P0) corresponding to each of theAg , g = 1, . . . ,G .

• Estimator based on original work by Robins and Rotnitzky (1992).• Like Chambaz et al. (2012), Ritter et al. (2014), van der Laan (2006),

etc., we propose using estimands motivated by causal inference forvariable importance measures in high-dimensional contexts.

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 7 / 39

Page 12: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Outline

1 Introduction

2 Statistical Roadmap

3 Causal Inference

4 Statistical Model and SuperLearning

5 TMLE

6 LIMMA

7 Data Analysis

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 7 / 39

Page 13: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Loss-Based Estimation

• In our example, we wish to estimate: Q0 = E0(Y | A,W ).• Before we can choose a “best" algorithm to estimate this regression

function, we must have a way to define what “best" means, madeexplicit by loss function.

• Data structure: O = (W ,A,Y ) ∼ P0Õ distribution Pn which places probability 1/n on each observed Oi ,

i = 1, . . . , n.

• "Best" algorithm defined in terms of a loss function (for candidatefunction Q when applied to an observation O).

L : (O,Q)→ L(O,Q) ∈ R.

• It is a function of the random variable O and parameter value Q.• Example: L2 squared error (or quadratic) loss function:

L(O,Q) = (Y − Q(A,W ))2.

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 8 / 39

Page 14: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Loss Defines Parameter of Interest

• We define our parameter of interest, Q̄0 = E0(Y | A,W ), as theminimizer of the expected squared error loss:

Q̄0 = arg minQ̄E0L(O, Q̄),

where L(O, Q̄) = (Y − Q̄(A,W ))2.

• E0L(O, Q̄) (the risk), is minimized at the optimal choice of Q̄0.

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 9 / 39

Page 15: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Unbiased Estimate of Risk: Cross-Validation

• V-fold cross-validation: partition the sample of n observationsO1, . . . ,On in training and corresponding validation sets.

• Cross-validation is used to select our “best" algorithm among a library.

• In addition, we also use cross-validation to evaluate the overallperformance of the super learner itself.

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 10 / 39

Page 16: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Training Set

Validation Set

1

2

3

5

4

6

10

9

8

7

Fold 1

Learning Set

• In V -fold cross-validation, ourobserved data O1, . . . ,On isreferred to as the learning set.

• Estimate our parameter ofinterest by partitioning into Vsets of size ≈ n

V .

• For any given fold, V − 1 setswill comprise the training setand the remaining 1 set is thevalidation set.

• The observations in the trainingset are used to construct (ortrain) the candidate estimators.

• The observations in thevalidation set are used to assessthe performance (i.e., risk) ofthe candidate algorithms.

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 11 / 39

Page 17: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Cross-Validation

1

2

3

5

4

6

10

9

8

7

Fold 1

1

2

3

5

4

6

10

9

8

7

Fold 2

1

2

3

5

4

6

10

9

8

7

Fold 10

1

2

3

5

4

6

10

9

8

7

Fold 9

1

2

3

5

4

6

10

9

8

7

Fold 8

1

2

3

5

4

6

10

9

8

7

Fold 7

1

2

3

5

4

6

10

9

8

7

Fold 6

1

2

3

5

4

6

10

9

8

7

Fold 5

1

2

3

5

4

6

10

9

8

7

Fold 4

1

2

3

5

4

6

10

9

8

7

Fold 3

The validation set rotates V times such that each set is used as thevalidation set once.

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 12 / 39

Page 18: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Discrete Super Learner

Discrete Super LearnerSuppose a researcher is interested in using three different parametricstatistical models to estimate E0(Y | A,W ).

We can use these algorithms to build a library of algorithms and select theone with the smallest (honest) cross-validated risk.

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 13 / 39

Page 19: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Discrete Super Learner

Mortality Data

regressionaregressionbregressionc

regressionaregressionbregressionc

regressionaregressionbregressionc

12

10

12

10

.

.

.

.

.

.

.

.

.

12

10

.

.

.

Collection of Algorithms

3. Fit each of the 3 algorithms on the training set (non-shaded blocks).

4. Predict the estimated probabilities of death (Z) using the validation set

(shaded block) for each algorithm, based on the corresponding training set fit.

2. Split data into 10 blocks.

1. Input data and a collection of algorithms.

5. Calculate estimated risk within each validation

set for each algorithm using Z. Average the risks across validation sets resulting in one

estimated cross-validated risk for each algorithm.

6. The discrete super learner algorithm selects the

algorithm with the smallest cross validated risk

1 Z1,a Z1,cZ1,b

2 Z2,a Z2,cZ2,b

10 Z10,a Z10,cZ10,b

CV Riska

CV Riskb

CV Riskc

.

.

.

.

.

.

.

.

.

.

.

.

regressionb

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 14 / 39

Page 20: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Oracle Properties

• Assume K algorithms in the library of algorithms.

• The oracle selector chooses the algorithm with the smallest risk undertrue data-generating distribution, P0.

• However, the oracle selector is unknown since it depends on both theobserved data and P0.

• Oracle Inequality (van der Laan et al. (2004)) suggests discrete superlearner performs as well as the oracle selector, up to a second orderterm.

• The loss function must be bounded, and then we will perform as wellas the algorithm that is the risk minimizer of the expected lossfunction.

• The number of algorithms, K , in the library can grow with samplesize.

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 15 / 39

Page 21: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Ensemble Super Learner - Wisdom of the Crowd (van der Laan,Polley, and Hubbard; 2007)Stacking algorithm that takes multiple algorithms as inputs canoutperform a single algorithm in realistic non-parametric andsemi-parametric statistical models.

• Define parameter of interest, Q̄0 = E0(Y | A,W ), as the minimizer ofthe expected squared error loss:

Q̄0 = arg minQ̄E0L(O, Q̄),

where L(O, Q̄) = (Y − Q̄(A,W ))2.

• E0L(O, Q̄) (risk), evaluates the candidate Q̄, and it is minimized atthe optimal choice of Q̄0.

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 16 / 39

Page 22: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Super Learner: How it works

• Provides a family of weighted combinations of the algorithms in thelibrary of learners, indexed by the weight vector α.

• The family of weighted combinations includes only those α-vectorsthat have a sum equal to one, and where each weight is positive orzero.

• Example for binary outcome:

Pn(Y = 1 | Z ) = expit (αa,nZa + αb,nZb + . . .+ αp,nZp)

where Z are the candidate learner predictions.• The (cross-validated) probabilities of death (Z ) for each algorithm are

used as inputs in a working (statistical) model to predict the outcomeY .

• Deriving weights: formulated as a regression of the outcomes Y onthe predicted values of the algorithms (Z ).

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 17 / 39

Page 23: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Ensemble Super Learner: How well it works

The super learner improves asymptotically on the discrete super learner byworking with a larger library.

• Asymptotic results: in realistic scenarios (where none of thealgorithms are a correctly specified parametric model), the discretesuper learner performs asymptotically as well as the oracle.

• When collection of algorithms contains correctly specified parametricstatistical model, the super learner will approximate the truth as fastas the parametric statistical model, although it will be more variable.

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 18 / 39

Page 24: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Outline

1 Introduction

2 Statistical Roadmap

3 Causal Inference

4 Statistical Model and SuperLearning

5 TMLE

6 LIMMA

7 Data Analysis

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 18 / 39

Page 25: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

First MLE

• A maximum likelihood estimator for a parametric statistical model{pθ : θ} is defined as a maximizer over all densities in the parametricstatistical model of the empirical mean of the log density:

θn = argmaxθ

n∑i=1

log pθ(Oi ).

• This discussion can be equally applied to the case where L(p)(O) isreplaced by any other loss function L(Q) for a relevant part Q0 of p0,satisfying that E0L(Q0)(O) ≤ E0L(Q)(O) for each possible Q.

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 19 / 39

Page 26: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

MLE - Substitution Estimator

Ψ(P0) for the causal risk difference can be written as the g-formula:

Ψ(P0) =∑w

[∑y

yP0(Y = y | A = ah,W = w)

−∑

yyP0(Y = y | A = al ,W = w)

]P0(W = w),

where

P0(Y = y | A = a,W = w) = P0(W = w ,A = a,Y = y)∑y P0(W = w ,A = a,Y = y)

is the conditional probability distribution of Y = y , given A = a, W = w ,and

P0(W = w) =∑y ,a

P0(W = w ,A = a,Y = y).

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 20 / 39

Page 27: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

MLE - Substitution

• Maximum-likelihood-based substitution estimators of the g-formulaare obtained by substitution of a maximum-likelihood-based estimatorof Q0 into the parameter mapping Ψ(Q0).

• The marginal distribution of W can be estimated with thenonparametric maximum likelihood estimator, which happens to bethe empirical distribution that puts mass 1/n on each Wi :i = 1, . . . , n.

ψn = Ψ(Qn) = 1n

n∑i=1{Q̄n(1,Wi )− Q̄n(0,Wi )},

where this estimate is obtained by plugging in Qn = (Q̄n,QW ,n) intothe mapping Ψ.

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 21 / 39

Page 28: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

MLE - fitting Q0

• MLE using regression in a parametric working model. Q̄0(A,W )is estimated using regression in a parametric working (statistical)model and plugged into the formula given previously.

• ML-based super learning. Estimate Q̄0 with the super learner, inwhich the collection of estimators may include stratified maximumlikelihood estimators, maximum likelihood estimators based ondimension reductions implied by the propensity score, and maximumlikelihood estimators based on parametric working models, beyondmany other machine learning algorithms for estimation of Q̄0.

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 22 / 39

Page 29: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

TMLE: Targeted Maximum Likelihood Estimation

• TMLE (van der Laan and Rubin 2006) Produces a well-defined,unbiased, efficient substitution estimator of target parameters of adata-generating distribution.

• It is generally an iterative procedure (though there is now a generalone-step; van der Laan and Gruber (2016)) that updates an initial(super learner) estimate of the relevant part Q0 of the data generatingdistribution P0, possibly using an estimate of a nuisance parameter g0.

• Like corresponding A-IPTW estimators (Robins and Rotnitzky 1992),removes asymptotic residual bias of initial estimator for the targetparameter, if it uses a consistent estimator of g0, thus Doubly Robust.

• If initial estimator was consistent for the target parameter, theadditional fitting of the data in the targeting step may remove finitesample bias, and preserves consistency property of the initialestimator.

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 23 / 39

Page 30: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

TMLE and Machine Learning

• Natural use of machine learning methods for the estimation of bothQ0 and g0.

• Focuses effort to achieve minimal bias and asymptoticsemi-parametric efficiency bound for the variance, but still getinference (with some assumptions).

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 24 / 39

Page 31: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

TMLE Algorithm

Targeted estimator

Observed data random

variablesTarget parameter

map

True probability distribution

Initial estimator

Initial estimator of the probability

distribution of the data

Targeted estimator of the probability distribution of the

data

True value (estimand) of

target parameter

INPUTS

STATISTICAL MODEL

Set of possible probability

distributions of the data

VALUES OF TARGET

PARAMETERValues mapped to the real line

with better estimates closer

to the truth

Chapter 5Understanding TMLE

No Author Given

O1, . . . ,On

Ψ ()

P0n

P∗n

P0

Ψ (P0)

Ψ (P∗n)

Ψ (P0n)

This chapter focuses on understanding targeted maximum likelihood estimation.Recall that TMLE is a two-step procedure where one first obtains an estimate

of the data-generating distribution P0, or the relevant portion Q0 of P0. The sec-ond stage updates this initial fit in a step targeted towards making an optimal bias-variance trade-off for the parameter of interest Ψ (Q0), instead of the overall densityP0. The procedure is double robust and can incorporate data-adaptive likelihoodbased estimation procedures to estimate Q0 and the treatment mechanism. The dou-ble robustness of TMLE has important implications in both randomized controlledtrials and observational studies, with potential reductions in bias and gains in effi-ciency.

91

Chapter 5Understanding TMLE

No Author Given

O1, . . . ,On

Ψ ()

P0n

P∗n

P0

Ψ (P0)

Ψ (P∗n)

Ψ (P0n)

This chapter focuses on understanding targeted maximum likelihood estimation.Recall that TMLE is a two-step procedure where one first obtains an estimate

of the data-generating distribution P0, or the relevant portion Q0 of P0. The sec-ond stage updates this initial fit in a step targeted towards making an optimal bias-variance trade-off for the parameter of interest Ψ (Q0), instead of the overall densityP0. The procedure is double robust and can incorporate data-adaptive likelihoodbased estimation procedures to estimate Q0 and the treatment mechanism. The dou-ble robustness of TMLE has important implications in both randomized controlledtrials and observational studies, with potential reductions in bias and gains in effi-ciency.

91

Chapter 5Understanding TMLE

No Author Given

O1, . . . ,On

Ψ ()

P0n

P∗n

P0

Ψ (P0)

Ψ (P∗n)

Ψ (P0n)

This chapter focuses on understanding targeted maximum likelihood estimation.Recall that TMLE is a two-step procedure where one first obtains an estimate

of the data-generating distribution P0, or the relevant portion Q0 of P0. The sec-ond stage updates this initial fit in a step targeted towards making an optimal bias-variance trade-off for the parameter of interest Ψ (Q0), instead of the overall densityP0. The procedure is double robust and can incorporate data-adaptive likelihoodbased estimation procedures to estimate Q0 and the treatment mechanism. The dou-ble robustness of TMLE has important implications in both randomized controlledtrials and observational studies, with potential reductions in bias and gains in effi-ciency.

91

Chapter 5Understanding TMLE

No Author Given

O1, . . . ,On

Ψ ()

P0n

P∗n

P0

Ψ (P0)

Ψ (P∗n)

Ψ (P0n)

This chapter focuses on understanding targeted maximum likelihood estimation.Recall that TMLE is a two-step procedure where one first obtains an estimate

of the data-generating distribution P0, or the relevant portion Q0 of P0. The sec-ond stage updates this initial fit in a step targeted towards making an optimal bias-variance trade-off for the parameter of interest Ψ (Q0), instead of the overall densityP0. The procedure is double robust and can incorporate data-adaptive likelihoodbased estimation procedures to estimate Q0 and the treatment mechanism. The dou-ble robustness of TMLE has important implications in both randomized controlledtrials and observational studies, with potential reductions in bias and gains in effi-ciency.

91

Chapter 5Understanding TMLE

No Author Given

O1, . . . ,On

Ψ ()

P0n

P∗n

P0

Ψ (P0)

Ψ (P∗n)

Ψ (P0n)

This chapter focuses on understanding targeted maximum likelihood estimation.Recall that TMLE is a two-step procedure where one first obtains an estimate

of the data-generating distribution P0, or the relevant portion Q0 of P0. The sec-ond stage updates this initial fit in a step targeted towards making an optimal bias-variance trade-off for the parameter of interest Ψ (Q0), instead of the overall densityP0. The procedure is double robust and can incorporate data-adaptive likelihoodbased estimation procedures to estimate Q0 and the treatment mechanism. The dou-ble robustness of TMLE has important implications in both randomized controlledtrials and observational studies, with potential reductions in bias and gains in effi-ciency.

91

Chapter 5Understanding TMLE

No Author Given

O1, . . . ,On

Ψ ()

P0n

P∗n

P0

Ψ (P0)

Ψ (P∗n)

Ψ (P0n)

This chapter focuses on understanding targeted maximum likelihood estimation.Recall that TMLE is a two-step procedure where one first obtains an estimate

of the data-generating distribution P0, or the relevant portion Q0 of P0. The sec-ond stage updates this initial fit in a step targeted towards making an optimal bias-variance trade-off for the parameter of interest Ψ (Q0), instead of the overall densityP0. The procedure is double robust and can incorporate data-adaptive likelihoodbased estimation procedures to estimate Q0 and the treatment mechanism. The dou-ble robustness of TMLE has important implications in both randomized controlledtrials and observational studies, with potential reductions in bias and gains in effi-ciency.

91

Chapter 5Understanding TMLE

No Author Given

O1, . . . ,On

Ψ ()

P0n

P∗n

P0

Ψ (P0)

Ψ (P∗n)

Ψ (P0n)

This chapter focuses on understanding targeted maximum likelihood estimation.Recall that TMLE is a two-step procedure where one first obtains an estimate

of the data-generating distribution P0, or the relevant portion Q0 of P0. The sec-ond stage updates this initial fit in a step targeted towards making an optimal bias-variance trade-off for the parameter of interest Ψ (Q0), instead of the overall densityP0. The procedure is double robust and can incorporate data-adaptive likelihoodbased estimation procedures to estimate Q0 and the treatment mechanism. The dou-ble robustness of TMLE has important implications in both randomized controlledtrials and observational studies, with potential reductions in bias and gains in effi-ciency.

91

Chapter 5Understanding TMLE

No Author Given

O1, . . . ,On

Ψ ()

P0n

P∗n

P0

Ψ (P0)

Ψ (P∗n)

Ψ (P0n)

This chapter focuses on understanding targeted maximum likelihood estimation.Recall that TMLE is a two-step procedure where one first obtains an estimate

of the data-generating distribution P0, or the relevant portion Q0 of P0. The sec-ond stage updates this initial fit in a step targeted towards making an optimal bias-variance trade-off for the parameter of interest Ψ (Q0), instead of the overall densityP0. The procedure is double robust and can incorporate data-adaptive likelihoodbased estimation procedures to estimate Q0 and the treatment mechanism. The dou-ble robustness of TMLE has important implications in both randomized controlledtrials and observational studies, with potential reductions in bias and gains in effi-ciency.

91

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 25 / 39

Page 32: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Example: TMLE for the Average Causal Effect

NPSEM/SCM for a point treatment data structure with missing outcome

W = fW (UW ),A = fA(W ,UA),∆ = fA(W ,A,U∆),Y = fY (W ,A,∆,UY ).

We can now define counterfactuals Y1,1 and Y0,1 corresponding withinterventions setting A and ∆.

The additive causal effect EY1 − EY0 equals:Ψ(P) = E [E (Y | A = 1,∆ = 1,W )− E (Y | A = 0,∆, 1,W )

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 26 / 39

Page 33: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Example: TMLE for "Causal" Risk Difference:E0,W{E0(Y | A = ah, W )− E0(Y | A = al , W )}

• Generate an initial estimator of P0n of P; we estimate

E (Y | A,∆ = 1,W ).

• Fluctuate this initial estimator with a logistic regression:

logitP0n (ε)(Y = 1 | A,∆ = 1,W ) = logitP0

n (Y = 1 | A,∆ = 1,W )+εh

whereh(A,W ) = 1

Π(A,W )

( I(A = ah)g(ah |W ) −

I(A = al )g(al |W

)andg(a |W ) = P(A = a |W ) Treatment MechanismΠ(A,W ) = P(∆ = 1 | A,W ) Missingness Mechanism.

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 27 / 39

Page 34: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

TMLE for risk difference (cont.)

• Let εn be the maximum likelihood estimator and

P∗n = P0

n (εn).

• The TMLE is given by Ψ(P∗n ).

• Specifically in the case of the risk difference:

logit Q̄1n(A,W ) = logit Q̄0

n(A,W ) + εnH∗n (A,W ).

• This parametric working model incorporated information from gn,through H∗

n (A,W ), into an updated regression.

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 28 / 39

Page 35: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Inference (Standard errors) via Influence Curve (IC)

• influence curve for estimator is:

ICn(Oi ) =( I(Ai = ah)

gn(ah |Wi )− I(Ai = al )

gn(al |Wi )

)(Y − Q̄1

n(Ai ,Wi ))

+ Q̄1n(ah,Wi )− Q̄1

n(al ,Wi )− ψTMLE ,n,

• Sample variance of the estimated influence curve:

S2(ICn) = 1n∑n

i=1 (ICn(oi ))2 .

• Use sample variance to estimate the standard error of our estimator:

SEn =

√S2(ICn)

n .

• Use this to derive uncertainty measures (p-values, confidenceintervals, etc.).

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 29 / 39

Page 36: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Repeating Estimates of Variable Importance one biomarkerat a time

• Consider this is repeated for j = 1, . . . , J differentbiomarkers, so that one has, for each j :

Ψj(Q∗j ,n), S2j (ICj ,n)

or estimate of variable importance and standard errorfor all J .

• Propose an existing joint-inferential procedure thatcan add some finite-sample robustness to an estimatorthat can be highly variable.

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 30 / 39

Page 37: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Outline

1 Introduction

2 Statistical Roadmap

3 Causal Inference

4 Statistical Model and SuperLearning

5 TMLE

6 LIMMA

7 Data Analysis

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 30 / 39

Page 38: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

LIMMA: Linear Models for Microarray Data

• Thus, one can define a standard t-test statistic for a general(asymptotically linear) parameter estimate (over j = 1, . . . , J) as:

tj =√

n(Ψj(Pn)− ψ0)Sj(ICj,n)

• Consider the moderated t-statistic proposed by Smyth (2005):

t̃j =√

n(Ψj(Pn)− ψ0)S̃2

j

where the posterior estimate of the variance of the influence curve is

S̃2j =

d0S20 + djS2

j (ICj,n)d0 + dj

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 31 / 39

Page 39: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Implement for Any Asymptotically Linear Parameter

• Treat like one-sample problem estimate of parameter with associatedSE from IC.

• Just need to get estimate for each j as well as the plug-in IC for everyobservation for that j and repeat for all j .

• Transform data original Jxn matrix where new entries are:

Y ∗j,i = ICj,n(Oi ; Pn) + Ψj(Pn)

• Since the average of the ICj,n across the columns (units) for a single jwill be 0, the average of this transform will be the original estimateΨj(Pn).

• For simplicity assume the null value is ψ0 = 0 for all j . Then, runninglimma package on this transform,Y ∗

j,i , will generate multiple testingcorrections based on presented above for t̃j .

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 32 / 39

Page 40: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Why LIMMA approach in this context?

• Often times these analyses based on relatively small samples.• want data-adaptive estimate but at least with standard

implementation of these estimates (estimation equation,substitution,...), the SE’s can be non-robust.

• practically, one can get "significant" estimates of variableimportance measures that are driven by poorly andunderestimated S2

j (ICj,n).• LIMMA shrinks these S2

j (ICj,n) by making them bigger and thustakes biomarkers with small parameter estimates but very smallS2

j (ICj,n) out of statistical significance.• Also, just seems to work very well...

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 33 / 39

Page 41: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Outline

1 Introduction

2 Statistical Roadmap

3 Causal Inference

4 Statistical Model and SuperLearning

5 TMLE

6 LIMMA

7 Data Analysis

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 33 / 39

Page 42: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

GLUE Study

• Goal: Ascertain relative variable importance based on estimandfor adjusted risk difference discussed above.

• Because expression generally fairly low, and wanted to discretizeexpression, used a cut-off that separated (roughly) someexpression from background.

• Though there are thousands of expression values, just focus on afew genes known to be involved in coagulation.

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 34 / 39

Page 43: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

GLUE Study Expression Data - coagulation pathway genes

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 35 / 39

Page 44: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

Substitution based on Simple Parametric Model vs. SL

0 100 200 300 400

−1.

0−

0.5

0.0

0.5

Parametric Model

Time(hrs)

VI

●●

● ProCF2F8

0 100 200 300 400

−1.

0−

0.5

0.0

0.5

T−MLE(SL)

Time(hrs)

VI

●●

● ●

●●

●●

● ProCF2F8

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 36 / 39

Page 45: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

References

Antoine Chambaz, Pierre Neuvial, and Mark J van der Laan. Estimation ofa non-parametric variable importance measure of a continuous exposure.Electronic Journal of Statistics, 6:1059, 2012.

S. Ritter, N. Jewell, and A. Hubbard. multipim: An r package for variableimportance analysis. Journal of Statistical Software, 57(8), 2014.

J.M. Robins and A. Rotnitzky. Recovery of information and adjustment fordependent censoring using surrogate markers. In AIDS Epidemiology,Methodological issues. Bikhäuser, 1992.

G.K. Smyth. Linear models and empirical bayes methods for assessing dierential expression in microarray experiments. Statistical Applications inGenetics and Molecular Biology, 3(1), 2004. URLhttp://www.bepress.com/sagmb/vol3/iss1/art3.

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 37 / 39

Page 46: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

References

Gordon K Smyth. Limma: linear models for microarray data. InR. Gentleman, V. Carey, S. Dudoit, R. Irizarry, and W. Huber, editors,Bioinformatics and Computational Biology Solutions Using R andBioconductor, pages 397–420. Springer, New York, 2005.

Ronald G Tompkins. Genomics of injury: the glue grant experience. Thejournal of trauma and acute care surgery, 78(4):671, 2015.

M. van der Laan and S. Rose. Targeted learning: causal inference forobservational and experimental data. Springer, 2011.

Mark van der Laan and Susan Gruber. One-step targeted minimumloss-based estimation based on universal least favorable one-dimensionalsubmodels. The international journal of biostatistics, 12(1):351–378,2016.

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 38 / 39

Page 47: Targeted Learning for High-Dimensional Variable Importancenhejazi/present/2016_crm_limmatmle.pdf · Outline 1 Introduction 2 StatisticalRoadmap 3 CausalInference 4 StatisticalModelandSuperLearning

References

Mark J van der Laan, Eric C Polley, and Alan E Hubbard. Super learner.Stat Appl Genet Mol Biol, 6:Article25, 2007. doi:10.2202/1544-6115.1309.

M.J. van der Laan. Statistical inference for variable importance.International Journal of Biostatistics, 2(1), 2006. URLhttp://www.bepress.com/ijb/vol2/iss1/2.

M.J. van der Laan and D.B. Rubin. Targeted maximum likelihoodlearning. International Journal of Biostatistics, 2(1), 2006. URLhttp://www.bepress.com/ijb/vol2/iss1/11. Article 11.

M.J. van der Laan, S. Dudoit, and S. Keles. Asymptotic optimality oflikelihood-based cross-validation. SAGMB, 3(1):Article 4, 2004.

A Hubbard, N Hejazi, W Cai (UC Berkeley) Targeted Learning for Variable Importance July 27, 2016 39 / 39