Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth...

Intro to nonsmooth inference

Eric B. Laber

Department of Statistics, North Carolina State University

April 2019SAMSI

Last time

I SMARTs gold standard for est and eval of txt regimes

I Highly configurable but choices driven by science

I Looked at examples with varying scientific/clinical goals whichlead to different timing, txt options, response criteria etc.

I Often powered by simple comparisons

I First-stage response rates

I Fixed regimes (most- vs. least-intensive)

I First stage txts (problematic)

I If test statistic is regular and asymptotically normal under nullcan use same basic template for power

1 / 87

Quick SMART review

R

PCST-Full

Treatment 0

PCST-Brief

Treatment 1

Response?

Response?

R

No

R

No

RYes

RYes

PCST-Full maintenance

Treatment 2

No further treatment

Treatment 3

PCST-Plus

Treatment 4

PCST-Full maintenance

Treatment 2

PCST-Brief maintenance

Treatment 5

No further intervention

Treatment 3

PCST-Full

Treatment 0

PCST-Brief maintenance

Treatment 5

2 / 87

Refresher

I Suppose that researchers are interested in comparing theembedded regimes:

(e1) assign PCST-Full initially, assign PCST-Full maintenance toresponders, and assign PCST-Plus to non-responders;

(e2) assign PCST-Brief initially, assign no further intervention toresponders, and assign PCST-Brief maintenance to responders.

I Recall our general template:

I Test statistic: Vn(e1)− Vn(e2), where Vn is IPWE

I Use√nTn/σ

2e1,e2,n asy normal and reject when this is large in

magnitude

3 / 87

Goals for today

I Introduction to inference for txt regimes

I Nonregular inference (and why we should care)

I Basic strategies with a toy problem

I Examples in one-stage problems

4 / 87

Warm up part I: quiz!

I Discuss with your stat buddy:

I What are some common scenarios where series approx or thebootstrap cannot ensure correct op characteristics?

I What is a local alternative?

I How do we know if an asymptotic approx is adequate?

I True or false

I If n is large asymptotic approximations can be trusted.

I The top review of CLT on yelp complains about the burritosbeing too expensive.

I The BBC produced an Hitler-themed sitcom titled ‘Heil Honey,I’m home’ in the 1950s.

5 / 87

On reality and fantasy

Your cat didn’t say that. You know how I know? It’s acat. It doesn’t talk. If you died, it would eat you. Startingwith your face.– Matt Zabka, recently single

6 / 87

Asymptotic approximations

I Basic idea: study behavior of statistical procedure in terms ofdominating features while ignoring lower order ones

I Often, but not always, consider diverging sample size

I ‘Dominating features’ intentionally ambiguous

I Generate new insights and general statistical procedures aslarge classes of problems share same dominating features

I Asymptotics mustn’t be applied mindlessly

I Disgusting trend in statistics: propose method, push throughirrelevant asymptotics, handpick simulation experiments

I Require careful thought about what op characteristics areneeded scientifically and how to ensure these hold with thekind of data that are likely to be observed

I No panacea ⇒ handcrafted construction and evaluation

7 / 87

Inferential questions in precision medicine

I Identify key tailoring variables

I Evaluate performance of true optimal regime

I Evaluate performance of estimated optimal regime

I Compare performance of two+ (possibly data-driven) regimes

I . . .

8 / 87

Toy problem: max of means

I Simple problem that retains many of the salient features ofinference for txt regimes

I Non-smooth function of smooth functionals

I Well-studied in the literature

I Basic notation

I For Z1, . . . ,Zn ∼i.i.d. P comprising ind copies of Z ∼ P write

Pf (Z ) =

∫f (z)dP(z) and Pnf (Z ) = n−1

n∑

i=1

f (Zi )

I Use ‘ ’ to denote convergence in distribution

I Check: assuming requisite moments exist:

√n (Pn − P)Z ????

9 / 87

Max of means

I Observe X1, . . . ,Xn ∼i .i .d . P in Rp with µ0 = PX , define

θ0 =

p∨

j=1

µ0,j = max(µ0,1, . . . ,mu0,p)

I While we consider this estimand primarily for illustration, itcorresponds to problem of estimating the mean outcomeunder an optimal one-size-fits-all treatment recommendationwhere µ0,j is mean outcome under treatment j = 1, . . . , p.

10 / 87

Max of means: estimation

I Define µn = PnX , the plug-in estimator of θ0 is

θn =

p∨

j=1

µn,j

I Warm-up:

I Three minutes trying to derive limiting distn of√n(θn − θ0)

I Three minutes discussing soln with your stat buddy

11 / 87

Max of means: first result

I For v ∈ Rp define U(v) = arg maxj

vj

LemmaAssume regularity conditions under which

√n(Pn − P)X is

asymptotically normal with mean zero and variance-covariancematrix Σ. Then

√n(θn − θ0

)

∨

j∈U(µ0)

Zj ,

where Z ∼ Normal(0,Σ).

12 / 87

Max of means: proof of first result

13 / 87

Extra page if needed

Max of means: discussion of first result

I Limiting distribution of√n(θn − θ0) depends abruptly on µ0

I If µ0 = (0, 0)ᵀT and Σ = I2, the the limiting distn is the maxof two ind std normals

I If µ0 = (0, ε)ᵀT for ε > 0 and Σ = I2, the limiting distn is stdnormal even if ε = 1x10−27 !!

I How can we use such an asymptotic result in practice?!

I Limiting distn of√n(θn − θ0) depends only on submatrix of Σ

cor. to elements of U(θ0). What about in finite samples?

14 / 87

Max of means: discussion of first result cont’d

I Suppose X1, . . . ,Xn ∼i .i .d . Normal(θ0, Ip) and µ0 has aunique maximizer, i.e., U(µ0) a singleton, say µ0,1I√n(θn − θ0) Normal(0, 1)

I P√

n(θn − θ0

)≤ t

= Φ(t)

p∏

j=2

Φt +√n(θ0 − µ0,j)

Quick break: derive this.

I If the gaps θ0 − µ0,j are small relative to√n, the finite sample

behavior can be quite different from limit Φ(t)1

1Note that the limiting distribution doesn’t depend on these gaps at all!15 / 87

Max of means: normal approximation in pictures

I Generate data from Normal(µ, I6) with µ1 = 2 andµj = µ1 − δ for j = 2, . . . , 6. Results shown for n = 100.

n(θn − θ0)

Den

sity

, δ=

0.5

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

0.6

n(θn − θ0)

Den

sity

, δ=

0.1

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

0.6

n(θn − θ0)

Den

sity

, δ=

0.01

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

0.6

16 / 87

Choosing the right asymptotic framework

I Dangerous pattern of thinking:

I In practice, none of the txt effect differences are zero.

I I’ll build my asy approximations assuming a unique maximizer.

I There finitely many components so maximizer iswell-separated.

I Idea! Plug-in estimated mazimizer and use asy normal approx.

I Preceding pattern happens frequently, e.g., oracle property inmodel selection, max eigenvalues in matrix , and txt regimes

17 / 87

Choosing the right asymptotic framework

I What goes wrong? After all, this thinking works well in manyother settings, e.g., everything you learned in stat 101.

I Finite sample behavior driven by small (not necessarily zero)differences in txt effectiveness

I We saw this analytically in normal case

I Intuition helped by thinking in extremes, e.g,. what if all txtswere equal? What if one were infinitely better than others?

I Abrupt dependence of limiting distribution on U(µ0) is aredflag. It is tempting to construct procedures that will recoverthis limiting distn even if some txt differences are exactly zero.This is asymptotics for asymptotics sake. Don’t do it.

18 / 87

Asymptotic working assumptions

I A useful asy approximation should be robust to the settingwhere some (all) txt differences are zero

I Necessary but not sufficient

I Heuristic: in small samples, one cannot distinguish betweensmall (but nonzero) txt differences so use an asy frameworkwhich allows for exact equality.

This heuristic has been misinterpreted and misused in lit.

I Some procedures we’ll look at are designed for such robustness

19 / 87

Local asymptotics: horseshoes and hand grenades

I Allowing null txt differences problematic

I Asymptotically, differences either zero or infinite2

I Txt differences are (probably) not exactly zero

I Challenge: allow small differences to persist as n diverges

I Local or moving parameter asy framework does this

I Idea: allow gen model to change with n so that gapsθ0 − max

j /∈U(µ0)µ0,j shrink to zero as n increases3

2In the stat sense that we have power one to discriminate between them.3This idea should be familiar from hypothesis testing.

20 / 87

Triangular arrays

I For each n, X1,n, . . . ,Xn,n ∼i .i .d . Pn

Observations DistributionX1,1 P1

X1,2 X2,2 P2

X1,3 X2,3 X3,3 P3

X1,4 X2,4 X3,4 X4,4 P4...

......

.... . .

...

I Define µ0,n = PnX and θ0,n =

p∨

j=1

µ0,j

I Assume µ0,n = µ0 + s/√n where s ∈ Rp called local parameter

I Assume√n(Pn − Pn)X Normal(0,Σ)4

4This is true under very mild conditions on the sequence of distributionsPnn≥1. However, given our limited time we will not discuss such conditions.See van der Vaart and Wellner (1996) for details. 21 / 87

Quick quiz

I Suppose that X1,n, . . . ,Xn,n ∼i .i .d . Normal(µ0 + s/√n,Σ)

what is is distribution of√n(Pn − Pn)X?

22 / 87

Local alternatives anticipate unstable performance

LemmaLet s ∈ Rp be fixed. Assume that for each n we observe Xi ,nni=1drawn i.i.d. from Pn which satisfies: (i) PnX = mu0 + s/

√n, and

(ii)√n(Pn − Pn)X Normal(0,Σ). Then, under Pn,

√n(θn − θ0,n

)

∨

j∈U(µ0)

(Zj + sj)−∨

j∈U(µ0)

sj ,

where Z ∼ Normal(0,Σ).

I Discussion/observations on local limiting distn

I Dependence of limiting distn on s ⇒ nonregular

I Set U(µ0) represents set of near-maximizers though sj = 0corresponds to exact equality (so haven’t ruled this out)

23 / 87

Proof of local limiting distribution

24 / 87

Intermission: more regularity after this

Comments on nonregularity

I Sensitivity of estimator to local alternatives cannot berectified through the choice of a more clever estimator

I Inherent property of the estimand5

I This has not stopped some from trying...

I Remainder of today’s notes: cataloging of confidence intervals

5See van der Vaart (1991), Hirano and Porter (2012), and L. et al. (2011,2014, 2019)

25 / 87

Projection region

I Idea: exploit the following two facts

I µn is nicely behaved (reg. asy normal)

I If µ0 were known this would be trivial6

I Given α ∈ (0, 1) denote acceptable error level and ζn,1−α aconfidence region for µ0, e.g.,

ζn,1−α =µ ∈ Rp : n(µn − µ)ᵀT Σn(µn − µ) ≤ χ2

p,1−α,

where Σn = Pn(X − µn)(X − µn)ᵀT

I Projection CI:

Γn,1−α =

θ ∈ R : θ =

p∨

j=1

µj for some µ ∈ ζn,1−α

6In this problem, θ0 is a function of µ0 and is thus completely known whenµ0 is known. In more complicated problems, knowing the value of a nuisanceparameter will make the inference problem of interest regular.

26 / 87

Prove the following with your stat buddy

I P (θ0 ∈ Γn,1−α) ≥ 1− α + oP(1)

27 / 87

Comments on projection regions

I Useful when parameter of interest is a non-smooth functionalof a smooth (regular) parameter

I Robust and widely applicable but conservative

I Projection interval valid under local alternatives (why?)

I Can reduce conservatism using pre-test (L. et al., 2014)

I Berger and Boos (1991) and Robins (2004) for seminal papers

I Consider as a first option in new non-reg problem

28 / 87

Bound-based confidence intervals

I Idea: sandwich non-smooth functional between smooth upperand lower bounds than bootstrap bounds to form conf region

I Let τnn≥1 be seq of pos constants such that τn →∞ and

τn = o(√n) as n→∞, define

Un(µ0) =

j : max

k

√n (µn,k − µn,j) /σj ,k,n ≤ τn

,

where σj ,k,n is est of asy variance of µn,k − µn,j .

Note* May help to think of Un to be the indices of txts thatwe cannot distinguish from being optimal.

29 / 87

Bound-based confidence intervals cont’d

I Given Un(µ0) define

Sn(µ0) =s ∈ Rp : sj = µ0,j if j ∈ Un(µ0)

,

then, it follows that

Un = sups∈Sn(µ0)

√n

p∨

j=1

(µn,j − µ0,j + sj)−p∨

j=1

sj

is an upper bound on√n(θn − θ0). (Why?) A lower bound,

Ln is constructed by replacing sup with an inf.

30 / 87

Dad, where do bounds come from?

I Un obtained by taking sup over all local, i.e., order 1/√n,

perturbations of generative model

I By construction, insensitive to local perturbations ⇒ regular

I Un(µ0) conservative est of U(µ0), lets wave our hands:

I General strategy: look at behavior of (properly centered andscaled) estimator under local perturbations and take sup/infover nonreg parts

31 / 87

Bootstrapping the bounds

I Both Un and Ln are regular and their distns consistentlyestimated via nonpar bootstrap

I Let u(b)n,1−α/2 be (1− α/2)× 100 perc of bootstrap distn of Un

and (b)n,α/2 the (α/2)× 100 perc of bootstrap distn of Ln

I Bound based confidence interval[θn − u

(b)n,1−α/2/

√n, θn − (b)n,α/2/

√n]

32 / 87

Bound-based intervals discussion

I General approach, applies to implicitly defined estimators asas those with closed form expressions like we considered here

I Less conservative than projection interval but stillconservative, such conservatism is unavoidable

I Bounds are tightest in some sense

I Bounding quantiles directly rather than estimand may reduceconservatism though possibly at price of addl complexity

I See Fan et al. (2017) for other improvements/refinements

33 / 87

Bootstrap methods

I Bootstrap is not consistent without modification

I Due to instability (nonregularity)

I Nondifferentiability of max operator causes this instability (seeShao 1994 for a nice review)

I Bootstrap is appealing for complex problems

I Doesn’t require explicitly computing asy approximations7.

I Higher order convergence properties

7There are exceptions to this, including parametric bootstrap and thosebased on quadratic expansions

34 / 87

How about some witchcraft?

I m-out-of-n bootstrap can be used to create valid confidenceintervals for non-smooth functionals

I Idea: resample datasets of size mn = o(n) so sample-levelparameters converge ‘faster’ than bootstrap analogs(i.e., witchcraft)

35 / 87

m-out-of-n bootstrap

I Accepting some components of witchcraft on faith

I√mn

(µ(b)mn− µn

) Normal(0,Σ) conditional on the data

I See Arcones and Gine (1989) for details

I An even toyier example than our toy example: W1, . . . ,Wn

i.i.d. w/ (µ, σ2) derive limit distns of√n(|W n| − |µ|) and

√mn(|W (b)

n | − |W n|)

36 / 87

m-out-ofn bootstrap with max of means

I Derive limiting distribution of√mn

(θ(b)mn − θn

):

37 / 87

Intermission

I wouldn’t want to wind up hooked to a bunch of wires andtubes, unless somehow the wires and tubes were keepingme alive. —Don Alden Adams

Finally! Back to treatment regimes (briefly)

I Consider a one-stage problem with observed data(Xi ,Ai ,Yi )ni=1 where X ∈ Rp, A ∈ −1, 1, and Y ∈ R

I Assume requisite causal conditions hold

I Assume linear rules π(x) = sign(xᵀTβ), where β ∈ Rp, and xmight contain polynomial terms etc.

38 / 87

Warm-up! Derive limiting distn of parameters inlinear Q-learning!!!!8

I Posit linear model Q(x , a;β) = xᵀT0 + axᵀT1 β, indexed by

β = (βᵀT0 , βᵀT1 )ᵀT and x0, x1 known features.

I βn = arg minβ

Pn Y − Q(X ,A;β)2 and

β∗ = arg minβ

P Y − Q(X ,A;β)

I Derive limiting distribution of√n(βn − β∗)

I Construct confidence interval for Q(x , a) assumingQ(x , a) = Q(x , a;β∗)

8He exclaimed.39 / 87

Parameters in (1-stage) Q-learning are easy!

I Similar arguments show that coefficients indexingg -computation and outcome weighted learning asy normal

I Preview: consider the (regression-based) estimator of the

value of πn(x) = sign(xᵀT1 β1,n), which you’ll recall is

Vn(β1,n) = Pn maxa

Q(X , a; βn)

= PnXᵀT0 β0,n + Pn|X ᵀT1 β1,n|

What is limit of√nVn(β1,n)− V (β1,n)

and can we use it

to derive CI for V (βn)? What about a CI for V (β∗)?

40 / 87

Parameters in (1-stage) OWL are easy!

I To illustrate, assume P(A = 1|X ) = P(A = −1|X ) = 1/2 wp1

I Recall OWL based on cvx relaxation of IPWE

Vn(β) = Pn

[Y1

A = sign(X ᵀTβ)

P(A|X )

]

= 2PnY1A sign(X ᵀTβ) > 0

I Let ` : R→ R be cvx, OWL estimator is

βn = arg minβ∈Rp

Pn|Y |`(W ᵀTβ),

where W = sign(Y )AX

41 / 87

Some facts about OWL (and more generallyconvex M-estimators)

I Fn |y |`(wᵀTβ) is composition of linear and cvx function andthus cvx in β for each (y ,w) ⇒ greatly simplifies inference!

I Regularity conditions

I β∗ = arg minβ

P|Y |`(W ᵀTβ) exists and unique

I Map β 7→ P|Y |`(W ᵀTβ) differentiable in nbrhd of β∗9

I Under these conditions,√n(βn − β∗) is regular10 and

asymptotically normal ⇒ many results from Q-learning port

9More formally, require

|y |`wᵀT (β∗ + δ)

− |y |`(wᵀTβ∗) = S(y ,w , ;β∗)ᵀT δ + R(y ,w , δ;β∗) where

PS(Y ,W ;β∗) = 0, ΣO = PS(Y ,W ;β∗)S(Y ,W ;β∗)ᵀT finite, andPR(Y ,W , δ;β∗) = (1/2)δᵀTΩOδ + o(||δ||2) (Haberman, 1989; Niemiro, 1992;Hjort and Pollard, 2011).

10I am being a bit loose with language in this course by referring to bothestimands and rescaled estimators as ‘regular’ or ‘non-regular.’ 42 / 87

Value function(s)

I Three ways to measure performance

I Conditional value: V (πn) = PY ∗(πn) = EY ∗(πn)

∣∣πn

,measures the performance of an estimated decision rule as if itwere to be deployed in popn (note* this is a random variable)

I Unconditional value: Vn = EV (πn), measures the averageperformance of the algorithm used to construct πn with sampleof size n

I Population-level value: V (π∗), where π∗(x) = sign(xᵀTβ∗),measures the potential of applying precision medicine strategyin given domain if algorithm for constructing πn will be used

I Discuss these measures with your stat buddy. Is there ameaningful distinction as the sample size grows large?

43 / 87

It’s a wacky world out there

I The three value measures need not coincide asymptotically

I Let πn(x) = sign(xᵀT βn) and suppose√nβn Normal(0,Σ)

so that β∗ ≡ 0 and π∗(x) ≡ −1. With stat buddy, compute:

I Vn(βn) ????

I Vn = EV (βn)→ ????

I V (β∗) = ????

44 / 87

Calculon!

Have some confidence you useless pile!

I We’ll construct confidence sets for V (βn) and V (β∗) as theseare most commonly of interest in application

I Starting with conditional value fn assume that thedata-generating model is a triangular array Pn such that:

(A0) πn(x) = sign(xᵀT βn)

(A1) ∃β∗n s.t. β∗n = β∗ + s/√n for some s ∈ Rp and√

n(βn − β∗n ) =√n(Pn − Pn)u(X ,A,Y ) + oPn(1), where u

does not depend on s, supn

Pn||u(X ,A,Y )||2 <∞, and

Cov u(X ,A,Y ) is p.d.

(A2) If F is uniformly bounded Donsker class and√n(pn − P) T

in `∞(F) under P then√n(Pn −Pn) T in `∞(F) under Pn.

(A3) supn

Pn||Y ||2 <∞.

I Detailed discussion is beyond the scope of this class. Laberwill wave his hands a bit. Our goal is understand keysteps/ideas.

45 / 87

Building block: joint distribution beforenonsmooth operator

I Define class of functions

G =g(X ,A,Y ; δ) = Y1

AX ᵀT δ > 0

1X ᵀTβ∗ = 0

: δ ∈ Rp

,

view√n(Pn − Pn) as random element of `∞(Rp) .

LemmaAssume (A0)-(A3). Then

√n

Pn − Pn

βn − β∗(Pn − Pn)Y1

AX ᵀTβ∗ > 0

TZW

in `∞(Rp)× Rp × R under Pn.

46 / 87

Limiting distn of V (βn)

Corollary

Assume (A0)-(A3). Then,

√nVn(βn)− V (βn)

T (Z + s) + W.

I Notes

I Presence of s shows this is nonregular

I T is a Brownian bridge indexed by Rp

I W and Z are normal

47 / 87

Hand-waving!

Bound-based confidence interval

I Limiting distribution: T(Z + s) + W

I Local parameter only appears in first term

I (Asy) bound should only affect this term

I Schematic for constructing a bound

I Partition input space into those that are ‘near’ the decisionboundary xᵀTβ∗ = 0 vs. those that are ’far’ from boundary

I Take sup/inf over local perturbations of points in ‘near’ group

48 / 87

Upper bound

I Let Σn be estimator of asy var of βn an upper bound on√nVn(βn)− V (βn)

is

Un = supω∈Rp

√n(Pn−Pn)Y1

AX ᵀTω > 0

1

n(X ᵀT βn)2

X ᵀT ΣnX≤ τn

+√n(Pn − Pn)Y1

AX ᵀT βn > 0

1

n(X ᵀT βn)2

X ᵀT ΣnX> τn

,

where τn is seq of tuning parameters s.t. τn →∞ andτn = o(n) as n→∞. Lower bound constructed by replacingsup with inf.

49 / 87

Limiting distribution of bounds

TheoremAssume (A0)-(A3). Then

(Ln,Un)

infω∈Rp

T(ω) + W, supω∈Rp

T(ω) + W

under Pn.

I Recall limit distn of√nVn(βn)− V (βn)

is T(Z + s) + W

I Bounds equiv to sup/inf over local perturbations

I If all subject have large txt effects, bound are tight

I Bootstrap bounds to construct confidence bound, theoreticalresults for the bootstrap bounds given in book

50 / 87

Note on tuning

I Seq τnn≥1 can affect finite sample performance

I Idea: tune using double bootstrap, i.e., bootstrap thebootstrap samples to estimate coverage and adapt τn

I Double bootstrap considered computationally expensive butnot much of a burden in most problems with moderncomputing infrastructure

I Tuning can be done without affecting theoretical results

51 / 87

Algy the friendly tuning algorithm28 STATISTICAL INFERENCE

Alg. 1.5: Tuning the critical value τn using the double bootstrap

Input: (Xi, Ai, Yi)ni=1,M , α ∈ (0, 1),τ(1)n , . . . , τ

(L)n

1 V = Vn(dn)2 for j = 1, . . . , L do3 c(j) = 04 for b = 1, . . . ,M do5 Draw a sample of size n, say S

(b)n , from (Xi, Ai, Yi)ni=1

with replacement6 Compute bound-based confidence set, ζ(b)mn , using sample S

(b)n

and critical value τ (j)n

7 if V ∈ ζ(b)mn then

8 c(j) = c(j) + 19 end10 end11 end12 Set j∗ = argminj : c(j)≥M(1−α) c

(j)

Output: Return τ(j)n

pirical measure based on a resample of size mn; more generally, for anyZ = f(P,Pn) write Z

(b)mn to denote f(Pn,P(b)

mn). The m-out-of-n boot-strap approximates the sampling distribution of

√nVn(dn)− V(dn)

with

the conditional distribution of √mn

V

(b)mn(d

(b)mn)− Vn(d

(b)mn)

given the ob-

served data. For α ∈ (0, 1) let mnand umn

denote the (α/2) × 100

and (1 − α/2) × 100 percentiles of √mn

V

(b)mn(d

(b)mn)− Vn(d

(b)mn)

, then

[Vn(dn)− umn

/√mn, Vn(dn)− mn/

√mn

]is a (1−α)× 100%m-out-of-

n bootstrap confidence set for V(dn). Algorithm (1.6) provides a schematicfor computing this confidence set.

LetZ be defined as above. To establish consistency of them-out-of-n boot-strap, in addition to (A0)-(A3) and the asymptotic growth conditions onmn,we assume (A4)√mn(β

(b)mn−βn), conditional on the observed data, converges

in distribution to Z in probability. This condition holds under quite generalconditions (see Gine and Zinn (1990) and Ch. 3.6 in Van Der Vaart andWellner(1996)).Theorem 1.2.7. Assume (A0)-(A3) with local parameter s = 0. Let mn be asequence of positive integers such that mn → ∞ and mn = o(n) as n →∞. Let PM,mn

denote expectation induced by sampling mn observation with

52 / 87

Intermission

If any man says he hates war more than I do, he betterhave a knife, that’s all I have to say. –Ghandi

53 / 87

m-out-of-n bootstrap

I Bound-based intervals complex (conceptually and technically)

I Subsampling easier to implement and understand11

I Does not require specialized code etc.

I Let mn = o(n) be resample size s.t. mn →∞ and mn = o(n),

let P(b)mn be bootstrap empirical distn

I Approximate√nVn(βn)− V (βn)

with its bootstrap analog

√mn

V (b)mn

(β(b)mn

)− Vn(βn)

(Laber might draw picture)

I Let mn and umn be the (α/2)× 100 and (1− α/2)× 100

percentiles of√mn

V (b)mn

(β(b)mn

)− Vn(βn)

, ci given by

[Vn(βn)− umn/

√mn, Vn(βn − mn/

√mn

]

11Though the theory underpinning subsampling can be non-trivial so‘understand’ here is meant more mechanically.

54 / 87

Emmy the subsampling boostrap algoINFERENCE WITH ONE-STAGE REGIMES 29

Alg. 1.6: m-out-of-n bootstrap confidence set for the conditionalvalueInput:mn, Xi, Ai, Yini=1,M , α ∈ (0, 1)

1 for b = 1, . . . ,M do2 Draw a sample of sizemn, say S(b)

mn , from Xi, Ai, Yini=1 withreplacement

3 Compute β(b)mn on S

(b)mn

4 Δ(b)mn =√mn

[∑i∈S(b)

mnYiIAiX

Ti β

(b)mn > 0

−∑n

k=1 YkIAkX

Tk β

(b)mn > 0

]

5 end6 Relabel so that Δ(1)

mn ≤ Δ(2)mn ≤ · · · ≤ Δ

(B)mn

7 mn= Δ

(Bα/2)mn

8 umn = Δ(B(1−α/2))mn

Output:[Vn(dn)− umn/

√mn, Vn(dn)− mn/

√mn

]

replacement from the observed data.

supg∈BL1(Rp)

∣∣∣∣Pg (Z)− PM,mng√

mn

(β(b)mn− βn

) ∣∣∣∣

converges to zero in probability.

Let G(b)mn =

√mn

(P(b)mn − Pn

). A proof of this result is omitted; however,

the crux of proof is the expansion

√mn

V(b)mn

(d(b)mn)− Vn(d

(b)mn

)= G(b)

mnY IAXT β(b)

mn> 0

= G(b)mn

(Y I[AXT

√mn

(β(b)mn− βn

)+ (√mn/n)

√n(βn − β∗

)> 0]

× IXTβ∗ = 0

)

+G(b)mn

Y IAXT β(b)

mn> 0IXTβ∗ = 0

,

where G(b)mn , seen as an element of ∞(Rp) via η → G(b)

mnY IAXT η > 0

I XTβ∗ = 0, converges in distribution to T (Van Der Vaart and Wellner(1996) Thm 3.6.3), and (

√mn/n)

√n(βn−β∗) converges to zero in probabil-

ity.Application of the m-out-of-n bootstrap confidence interval requires a

choice of resample sizemn. The asymptotic conditionsmn → ∞ andmn =o(n) do not provide guidance about what mn should be for a given data set.

55 / 87

m-out-of-n cont’d

I Provides valid confidence intervals under (A0)-(A3)

I Proof omitted (tedious)

I Can tune mn using double bootstrap

I Reliance on asymptotic tomfoolery makes me hesitant to usethis in practice12

12I did not always think this way, see Chakraborty, L., and Zhao (2014ab).Also, I should not be so dismissive of these methods. Some of the work in thisarea has been quite deep and produced general uniformly convergent methods.See work by Romano and colleagues.

56 / 87

Confidence interval for opt regime within a class

I Let π∗ denote the optimal txt regime within a given class, ourgoal is to construct a CI for V (π∗)

I Were π∗ known, one could use√nVn(π∗)− V (π∗)

, with

your stats buddy, compute this limiting distn

I Suppose that we could construct a valid confidence region forπ∗, suggest a method for CI for V (π∗)

57 / 87

Projection interval

I For any fixed π, let ζn,1−ν(π) be a (1− ν)× 100% confidenceset for V (π), e.g., using asymptotic approx on previous slide

I Let Dn,1−η denote a (1− η)× 100% confidence set for π∗,then a (1− η − ν)× 100% confidence region for V (π∗) is

⋃

π∈Dn,1−η

ζn,1−ν(π)

Why?

58 / 87

Ex. projection interval for linear regime

I Consider regimes of the form π(x ;β) = sign(xᵀTβ), then

√nVn(β)− V (β)

=√n(Pn − P)

Y 1AXᵀTβ>0

P(A|X )

Normal

0, σ2(β),

take

ζn,1−ν(β) =

[Vn(β)−

z1−ν/2σn(β)√n

, Vn(β) +z1−ν2σn(β)√

n

]

I If√n(βn − β∗) Normal(0,Σ), take

Dn,1−η =β : n(βn − β)ᵀT Σ−1n (βn − β) ≤ χ2

p,1−η

I Projection interval:⋃

β∈Dn,1−η

ζn,1−ν(β)

59 / 87

Quiz break!

I What does the ‘Q’ in Q-learning stand for?

I In txt regimes, which of the following is not yet a thing:A-learning, B-learning, C-Learning, D-learning, E-learning?

I Write down the two-stage Q-learning algorithm assumingbinary treatments and linear models at each stage

I True or false:

I I would rather have a zombie ice dragon than two live firedragons.

I The story of the Easter bunny is based on the little knownstory of Jesus swapping the internal organs of chickens andrabbits to prevent a widespread famine

I Q-learning has been used to obtain state-of-the-artperformance in game-playing domains like chess, backgammon,and atari

60 / 87

Inference for two-stage linear Q-learning

I Learning objectives

I Identify source of nonregularlity

I Understand implications on coverage and asy bias

I Intuition behind bounds

I Hopefully this will be trivial for you now!

61 / 87

Reminder: setup and notation

I Observe (X1,i ,A1,i ,X2,i ,A2,i ,Yi )ni=1, i .i .d . from P

I X1 ∈ Rp1 : baseline subj. info.

I A1 ∈ 0, 1 : first treatment

I X2 ∈ Rp2 : interim subj. info. during course of A1

I A2 ∈ 0, 1 : second treatment

I Y ∈ R : outcome, higher is better

I Define history H1 = X1, H2 = (X1,A1,X2)

I DTR π = (π1, π2) where

πt : suppHt → suppAt ,

patient presenting with Ht = ht assigned treatment πt(ht)

62 / 87

Characterizing optimal DTR

I Optimal regime maximizes value EY ∗(π)

I Define Q-functions

Q2 (h2, a2) = E(Y∣∣H2 = h2,A2 = a2

)

Q1(h1, a1) = E

maxa2

Q2(H2, a2)∣∣H1 = h1,A1 = a1

I Dynamic programming (Bellman, 1957)πoptt (ht) = arg max

atQt(ht , at)

63 / 87

Q-learning

I Regression-based dynamic programming algorithm

(Q0) Postulate working models for Q-functions

Qt(ht , at ;βt) = hᵀTt,0 βt,0 + athᵀTt,1 βt,1, ht,0, ht,1 features of ht

(Q1) Compute β2 = arg minβ2

Pn Y − Q2(H2,A2;β2)2

(Q2) Compute

β1 = arg minβ1

Pn

maxa2

Q2(H2,A2; β2)− Q1(H1,A1;β1)

2

(Q3) πt(ht) = arg maxat

Qt(ht , at ; βt)

I Population parameters β∗t obtained by replacing Pn with PI Inference for β∗2 standard, just OLS

I Focus on confidence intervals for cᵀTβ∗1 for fixed c ∈ Rdim,β∗1

64 / 87

Q-learning

I Regression-based dynamic programming algorithm

(Q0) Postulate working models for Q-functions

Qt(ht , at ;βt) = hᵀTt,0 βt,0 + athᵀTt,1 βt,1, ht,0, ht,1 features of ht

(Q1) Compute β2 = arg minβ2

Pn Y − Q2(H2,A2;β2)2

(Q2) Compute

β1 = arg minβ1

Pn

maxa2

Q2(H2,A2; β2) − Q1(H1,A1;β1)

2

(Q3) πt(ht) = arg maxat

Qt(ht , at ; βt)

I Population parameters β∗t obtained by replacing Pn with P

I Inference for β∗2 standard, just OLS

I Focus on confidence intervals for cᵀTβ∗1 for fixed c ∈ Rdim,β∗1

64 / 87

Inference for cᵀTβ∗1

I Non-smooth max operator makes β1 non-regular

I Distn of cᵀT√n(β1− β∗1 ) sensitive to small perturbations of P

I Limiting distn does not have mean zero (asymptotic bias)

I Occurs with small second stage txt effects, HᵀT2,1 β

∗2,1 ≈ 0

I Confidence intervals based on series approximations orbootstrap can perform poorly; proposed remedies include:

I Apply shrinkage to reduce asymptotic bias

I Form conservative estimates to tail probabilities ofcᵀT√n(β1 − β∗1 )

65 / 87

Characterizing asymptotic bias

DefinitionFor constant c ∈ Rdim β∗

1 and√n-consistent estimator β1 of β∗1

with√n(β1 − β∗1) M, define the c-directional asymptotic bias

Bias(β1, c

), EcᵀTM.

66 / 87

Characterizing asymptotic bias cont’d

Theorem (Asymptotic bias Q-learning)

Let c ∈ Rdim β∗1 be fixed. Under moment conditions:

Bias(β1, c

)=

cᵀTΣ−11,∞P(B1

√HT2,1Σ21,21H2,11

HᵀT2,1 β

∗2,1=0

)

√2π

,

where B1 = (HᵀT1,0 ,A1HᵀT1,1 )ᵀT , Σ1,∞ = PB1B

ᵀT1 , and Σ21,21, is the

asy. cov. of√n(β2,1 − β∗2,1).

I Asymptotic bias for Q-learning

I Ave. of cᵀTΣ1,∞B1 with wts ∝√Var

(HᵀT

2,1 β2,11HᵀT2,1 β

∗2,1=0|H2,1

)

I May be reduced by shrinking hᵀT2,1 β2,1 when hᵀT2,1β∗2,1 = 0

67 / 87

Reducing asymptotic bias to improve inference

I Shrinkage is a popular method for reducing asymptotic biaswith goal of improving interval coverage

I Chakraborty et al. (2009) apply soft-thresholding

I Moodie et al. (2010) apply hard-thresholding

I Goldberg et al. (2013) and Song et al.(2015) use lasso-typepenalization

I Shrinkage methods target

maxa2

Q2

(h2, a2; β2

)= hᵀT2,0 β2,0 + max

a2∈0,1a2hᵀT2,1 β2,1

68 / 87

Reducing asymptotic bias to improve inference

I Shrinkage is a popular method for reducing asymptotic biaswith goal of improving interval coverage

I Chakraborty et al. (2009) apply soft-thresholding

I Moodie et al. (2010) apply hard-thresholding

I Goldberg et al. (2013) and Song et al.(2014) use lasso-typepenalization

I Shrinkage methods target

maxa2

Q2

(h2, a2; β2

)= hᵀT2,0 β2,0 +

[hᵀT2,1 β2,1

]+

68 / 87

Soft-thresholding (Chakraborty et al., 2009)

I In Q-learning, replace maxa2

Q2(H2,A2; β2) with

HᵀT2,0 β2,0 +[HᵀT2,1 β2,1

]+

1−σHᵀT2,1 Σ21,21H2,1

n(HᵀT2,1 β2,1

)2

+

I Amount of shrinkage governed by σ > 0

I Penalization schemes (Goldberg et al., 2013, Song et al.,2014) reduce to this estimator under certain designs

I No theoretical justification in Chakraborty et al. (2009) butimproved coverage of bootstrap intervals in some settings

69 / 87

Soft-thresholding and asymptotic bias

TheoremLet c ∈ Rdim β∗

1 and let βσ1 denote the soft-thresholding estimator.Under moment conditions:

1. |Bias(βσ1 , c

)| ≤ |Bias

(β1, c

)| for any σ > 0.

2. If Bias(β1, c

)6= 0, then for σ > 0

Bias(βσ1 , c)

Bias(β1, c)= exp

(−σ

2

)− σ

∫ ∞√σ

1

xexp

(−x2

2

)dx

70 / 87

Soft-thresholding and asymptotic bias cont’d

I Is thresholding useful in reducing asymptotic bias?

I Preceding theorem says yes, and more shrinkage is better

I Chakraborty et al. suggest σ = 3, which corresponds to13-fold decrease in asymptotic bias

I However, the preceding theorem is based on pointwise, i.e.,fixed parameter, asymptotics and may not faithfully reflectsmall sample performance

71 / 87

Local generative model

I Use local asymptotics approximate small sample behavior ofsoft-thresholding

I Assume:

1. For any s ∈ Rdim β∗2,1 there exists sequence of distributions Pn

so that

∫ √n(dP1/2

n − dP1/2)− 1

2νsdP

1/2

2

→ 0,

for some measurable function νs .

2. β∗2,1,n = β∗2,1 + s/√n, where

β∗2,n = arg minβ2

Pn Y − Q2(H2,A2;β2)2

72 / 87

Local asymptotics view of soft-thresholding


1 be fixed. Under the local generative model andmoment conditions:

1. sup

s∈Rdim β∗2,1

|Bias(β1, c)| ≤ K <∞.

2. sup

s∈Rdim β∗2,1

|Bias(βσ1 , c)| → ∞ as σ →∞.

I Thresholding can be infinitely worse than doing nothing ifdone too aggressively in small samples

73 / 87

Data-driven tuning

I Is it possible to construct a data-driven choice of σ thatconsistently leads to less asymptotic bias than no shrinkage?

I Consider data from a two-arm randomized trial(Ai ,Yi )ni=1,

A ∈ 0, 1, Y ∈ R coded so that higher is better 13

I Define µ∗a , E (Y |A = a) and µa = PnY 1A=a/Pn1A=a

I Mean outcome under optimal treatment assignmentθ∗ = max(µ∗0 , µ

∗1), corresponding estimator

θ = max(µ0, µ1) = µ0 + [µ1 − µ0]+

I Soft-thresholding estimator

θσ = µ0 + [µ1 − µ0]+

1− 4σ

n (µ1 − µ0)2

+

13This is equivalent to two-stage Q-learning with no covariates and a singlefirst stage treatment.

74 / 87

Data-driven tuning: toy example

n = 10 n = 100 n = 100 Bias

I Optimal value of σ depends on µ∗1 − µ∗0I Variability in µ1 − µ0 prevents identification of optimal value of σ,

using plug-in estimator may lead to large bias

I Data-driven σ that significantly improves asymptotic bias over noshrinkage is difficult

75 / 87

Asymptotic bias: discussion

I Asymptotic bias exists in Q-learning

I Local asymptotics show that aggressively shrinking to reduceasymptotic bias can be infinitely worse than no shrinkage

I Data-driven tuning seems to require choosing σ very small orrisking large bias

76 / 87

Confidence intervals for cᵀTβ∗1

I Possible to construct valid confidence intervals in presence ofasymptotic bias

I Idea: construct regular bounds on cᵀT√n(β1 − β∗1)

I Bootstrap bounds to form confidence interval

I Tightest among all regular bounds ⇒ automatic adaptivity

I Local uniform convergence

I Can also obtain conditional properties (Robins and Rotnitzky,2014) and global uniform convergence (Wu, 2014)

77 / 87

Regular bounds on cᵀT√n(β1 − β∗1)

I DefineVn(c , γ) = cᵀTSn + cᵀT Σ−11 PnUn(γ),

where

Sn = Σ−11

√n(Pn − P)B1

HᵀT2,0β

∗2,0 +

(HᵀT2,1β

∗2,1

)+− BᵀT1 β∗1

+Σ−11

√nPnH

ᵀT2,0

(β2,0 − β∗2,0

),

Un(γ) = B1

[HᵀT2,1 (Zn + γ)

+−(HᵀT2,1 γ

)+

]

I Sn is smooth and Un(γ) is non-smooth

78 / 87

Regular bounds on cᵀT√n(β1 − β∗1) cont’d

I It can be shown that cᵀT√n(β1 − β∗1) = Vn(c , β∗2,1)

I Use pretesting to construct upper bound

Un(c) = cᵀTSn + cᵀT Σ−11 PnUn(β∗2,1)1Tn(H2,1)>λn

+ supγ

cᵀT Σ−11 PnUn(γ)1Tn(H2,1)≤λn ,

where Tn(h2,1) test statistic for null hᵀT2,1β∗2,1 = 0 and λn is a

critical value

I Lower bound, Ln(c) obtained by taking inf

I Bootstrap bounds to find confidence interval

79 / 87

Validity of the bounds


1 be fixed. Assume the local generative model,under moment conditions and conditions on the pretest:

1. cT√n(β1−β∗1,n) cᵀTS∞+cᵀTΣ−11,∞PB1H

ᵀT2,1Z∞1

HᵀT2,1 β

∗2,1>0

+cᵀTΣ−11,∞PB1

[HᵀT2,1 (Z∞ + s)

+−(HᵀT2,1 s

)+

]1HᵀT2,1 β

∗2,1=0

2. Un(c) cᵀTS∞+cᵀTΣ−11,∞PB1HᵀT2,1Z∞1

HᵀT2,1 β

∗2,1>0

+supγ

cᵀTΣ−11,∞PB1

[HᵀT2,1 (Z∞ + γ)

+−(HᵀT2,1 γ

)+

]1HᵀT2,1 β

∗2,1=0

80 / 87

Validity of the bootstrap bounds

TheoremFix α ∈ (0, 1) and c ∈ Rdim β∗

1 . Let and u denote the(α/2)× 100 and (1− α/2)× 100 percentiles of bootstrapdistribution of the bounds and let PM denote the distribution withrespect to bootstrap weights. Under moment conditions andconditions on the pretest for any ε > 0:

P

PM

(cᵀT β1 −

u√n≤ cᵀTβ∗1 ≤ cᵀT β1 −

√n

)< 1− α− ε

= o(1).

81 / 87

Uniform validity of the bootstrap bounds(Tianshuang Wu)

TheoremFix α ∈ (0, 1) and c ∈ Rdim β∗

1 . Let and u denote the(α/2)× 100 and (1− α/2)× 100 percentiles of bootstrapdistribution of the bounds and let PM denote the distribution withrespect to bootstrap weights. Under moment conditions andconditions on the pretest for any ε > 0:

infP∈P

P

PM

(cᵀT β1 −

u√n≤ cᵀTβ∗1 ≤ cᵀT β1 −

√n

)< 1− α− ε

,

converges to zero for a large class of distributions P.

82 / 87

Simulation experiments

I Class of generative modelsI Xt ∈ −1, 1, At ∈ −1, 1, t ∈ 1, 2I P(At = 1) = P(At = −1) = 0.5, t ∈ 1, 2I X1 ∼ Bernoulli(0.5)

I X2|X1,A1 ∼ Bernoulli expit(δ1X1 + δ2A1)I ε ∼ N(0, 1)

I Y = γ1 +γ2X1 +γ3A1 +γ4X1A1 +γ5A2 +γ6X2A2 +γ7A1A2 + ε

I Vary parameters to obtain range of effects sizes, classifygenerative models asI Non-regular (NR)

I Nearly non-regular (NNR)

I Regular (R)

83 / 87

Simulation experiments cont’d

I Compare bounding confidence interval (ACI) with bootstrap(BOOT) and bootstrap thresholding (THRESH)

I Compare in terms of width and coverage (target 95%)

I Results based on 1000 Monte Carlo replications with datasetsof size n = 150

I Bootstrap computed with 1000 resamples

I Tuning parameter λn chosen with double bootstrap

84 / 87

Simulation experiments: results

Coverage (target 95%)

MethodEx. 1NNR

Ex. 2NR

Ex. 3NNR

Ex. 4R

Ex. 5NR

Ex. 6NNR

BOOT 0.935* 0.930* 0.933* 0.928* 0.925* 0.928*THRESH 0.945 0.938 0.942 0.943 0.759* 0.762*

ACI 0.971 0.958 0.961 0.943 0.953 0.953

Average width

MethodEx. 1NNR

Ex. 2NR

Ex. 3NNR

Ex. 4R

Ex. 5NR

Ex. 6NNR

BOOT 0.385* 0.430* 0.430* 0.436* 0.428* 0.428*THRESH 0.339 0.426 0.427 0.436 0.426* 0.424*

ACI 0.441 0.470 0.470 0.469 0.473 0.473

85 / 87

Ex. DTR for ADHD without uncertainty

Prior medication?LowdoseMEDS

YesAdequate response? Continue

MEDS

Yes

High adherence?

No

AddBMOD

NoIntensifyMEDS

YesLowdoseBMOD

No

Adequate response?

ContinueBMOD

Yes

High adherence?No

IntensifyBMOD

Yes

AddMEDS

No

86 / 87

Ex. DTR for ADHD with uncertainty

Prior medication?

Low doseMEDS∼OR∼BMOD

YesAdequate response? Continue

MEDS

Yes

High adherence?

No

Add OTHER∼OR∼Itensify SAME

NoIntensifyMEDS

YesLowdoseBMOD

No

Adequate response?

ContinueBMOD

Yes

High adherence?No

Add MEDS∼OR∼Intensify BMOD

Yes

AddMEDS

No

87 / 87

Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth...

Documents

Transcript of Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth...