Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth...
Transcript of Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth...
![Page 1: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/1.jpg)
Intro to nonsmooth inference
Eric B. Laber
Department of Statistics, North Carolina State University
April 2019SAMSI
![Page 2: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/2.jpg)
Last time
I SMARTs gold standard for est and eval of txt regimes
I Highly configurable but choices driven by science
I Looked at examples with varying scientific/clinical goals whichlead to different timing, txt options, response criteria etc.
I Often powered by simple comparisons
I First-stage response rates
I Fixed regimes (most- vs. least-intensive)
I First stage txts (problematic)
I If test statistic is regular and asymptotically normal under nullcan use same basic template for power
1 / 87
![Page 3: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/3.jpg)
Quick SMART review
R
PCST-Full
Treatment 0
PCST-Brief
Treatment 1
Response?
Response?
R
No
R
No
RYes
RYes
PCST-Full maintenance
Treatment 2
No further treatment
Treatment 3
PCST-Plus
Treatment 4
PCST-Full maintenance
Treatment 2
PCST-Brief maintenance
Treatment 5
No further intervention
Treatment 3
PCST-Full
Treatment 0
PCST-Brief maintenance
Treatment 5
2 / 87
![Page 4: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/4.jpg)
Refresher
I Suppose that researchers are interested in comparing theembedded regimes:
(e1) assign PCST-Full initially, assign PCST-Full maintenance toresponders, and assign PCST-Plus to non-responders;
(e2) assign PCST-Brief initially, assign no further intervention toresponders, and assign PCST-Brief maintenance to responders.
I Recall our general template:
I Test statistic: Vn(e1)− Vn(e2), where Vn is IPWE
I Use√nTn/σ
2e1,e2,n asy normal and reject when this is large in
magnitude
3 / 87
![Page 5: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/5.jpg)
Goals for today
I Introduction to inference for txt regimes
I Nonregular inference (and why we should care)
I Basic strategies with a toy problem
I Examples in one-stage problems
4 / 87
![Page 6: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/6.jpg)
Warm up part I: quiz!
I Discuss with your stat buddy:
I What are some common scenarios where series approx or thebootstrap cannot ensure correct op characteristics?
I What is a local alternative?
I How do we know if an asymptotic approx is adequate?
I True or false
I If n is large asymptotic approximations can be trusted.
I The top review of CLT on yelp complains about the burritosbeing too expensive.
I The BBC produced an Hitler-themed sitcom titled ‘Heil Honey,I’m home’ in the 1950s.
5 / 87
![Page 7: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/7.jpg)
On reality and fantasy
Your cat didn’t say that. You know how I know? It’s acat. It doesn’t talk. If you died, it would eat you. Startingwith your face.– Matt Zabka, recently single
6 / 87
![Page 8: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/8.jpg)
Asymptotic approximations
I Basic idea: study behavior of statistical procedure in terms ofdominating features while ignoring lower order ones
I Often, but not always, consider diverging sample size
I ‘Dominating features’ intentionally ambiguous
I Generate new insights and general statistical procedures aslarge classes of problems share same dominating features
I Asymptotics mustn’t be applied mindlessly
I Disgusting trend in statistics: propose method, push throughirrelevant asymptotics, handpick simulation experiments
I Require careful thought about what op characteristics areneeded scientifically and how to ensure these hold with thekind of data that are likely to be observed
I No panacea ⇒ handcrafted construction and evaluation
7 / 87
![Page 9: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/9.jpg)
Inferential questions in precision medicine
I Identify key tailoring variables
I Evaluate performance of true optimal regime
I Evaluate performance of estimated optimal regime
I Compare performance of two+ (possibly data-driven) regimes
I . . .
8 / 87
![Page 10: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/10.jpg)
Toy problem: max of means
I Simple problem that retains many of the salient features ofinference for txt regimes
I Non-smooth function of smooth functionals
I Well-studied in the literature
I Basic notation
I For Z1, . . . ,Zn ∼i.i.d. P comprising ind copies of Z ∼ P write
Pf (Z ) =
∫f (z)dP(z) and Pnf (Z ) = n−1
n∑
i=1
f (Zi )
I Use ‘ ’ to denote convergence in distribution
I Check: assuming requisite moments exist:
√n (Pn − P)Z ????
9 / 87
![Page 11: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/11.jpg)
Max of means
I Observe X1, . . . ,Xn ∼i .i .d . P in Rp with µ0 = PX , define
θ0 =
p∨
j=1
µ0,j = max(µ0,1, . . . ,mu0,p)
I While we consider this estimand primarily for illustration, itcorresponds to problem of estimating the mean outcomeunder an optimal one-size-fits-all treatment recommendationwhere µ0,j is mean outcome under treatment j = 1, . . . , p.
10 / 87
![Page 12: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/12.jpg)
Max of means: estimation
I Define µn = PnX , the plug-in estimator of θ0 is
θn =
p∨
j=1
µn,j
I Warm-up:
I Three minutes trying to derive limiting distn of√n(θn − θ0)
I Three minutes discussing soln with your stat buddy
11 / 87
![Page 13: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/13.jpg)
Max of means: first result
I For v ∈ Rp define U(v) = arg maxj
vj
LemmaAssume regularity conditions under which
√n(Pn − P)X is
asymptotically normal with mean zero and variance-covariancematrix Σ. Then
√n(θn − θ0
)
∨
j∈U(µ0)
Zj ,
where Z ∼ Normal(0,Σ).
12 / 87
![Page 14: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/14.jpg)
Max of means: proof of first result
13 / 87
![Page 15: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/15.jpg)
Extra page if needed
![Page 16: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/16.jpg)
Max of means: discussion of first result
I Limiting distribution of√n(θn − θ0) depends abruptly on µ0
I If µ0 = (0, 0)ᵀT and Σ = I2, the the limiting distn is the maxof two ind std normals
I If µ0 = (0, ε)ᵀT for ε > 0 and Σ = I2, the limiting distn is stdnormal even if ε = 1x10−27 !!
I How can we use such an asymptotic result in practice?!
I Limiting distn of√n(θn − θ0) depends only on submatrix of Σ
cor. to elements of U(θ0). What about in finite samples?
14 / 87
![Page 17: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/17.jpg)
Max of means: discussion of first result cont’d
I Suppose X1, . . . ,Xn ∼i .i .d . Normal(θ0, Ip) and µ0 has aunique maximizer, i.e., U(µ0) a singleton, say µ0,1I√n(θn − θ0) Normal(0, 1)
I P√
n(θn − θ0
)≤ t
= Φ(t)
p∏
j=2
Φt +√n(θ0 − µ0,j)
Quick break: derive this.
I If the gaps θ0 − µ0,j are small relative to√n, the finite sample
behavior can be quite different from limit Φ(t)1
1Note that the limiting distribution doesn’t depend on these gaps at all!15 / 87
![Page 18: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/18.jpg)
Max of means: normal approximation in pictures
I Generate data from Normal(µ, I6) with µ1 = 2 andµj = µ1 − δ for j = 2, . . . , 6. Results shown for n = 100.
n(θn − θ0)
Den
sity
, δ=
0.5
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
0.5
0.6
n(θn − θ0)
Den
sity
, δ=
0.1
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
0.5
0.6
n(θn − θ0)
Den
sity
, δ=
0.01
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
0.5
0.6
16 / 87
![Page 19: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/19.jpg)
Choosing the right asymptotic framework
I Dangerous pattern of thinking:
I In practice, none of the txt effect differences are zero.
I I’ll build my asy approximations assuming a unique maximizer.
I There finitely many components so maximizer iswell-separated.
I Idea! Plug-in estimated mazimizer and use asy normal approx.
I Preceding pattern happens frequently, e.g., oracle property inmodel selection, max eigenvalues in matrix , and txt regimes
17 / 87
![Page 20: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/20.jpg)
Choosing the right asymptotic framework
I Dangerous pattern of thinking:
I In practice, none of the txt effect differences are zero.
I I’ll build my asy approximations assuming a unique maximizer.
I There finitely many components so maximizer iswell-separated.
I Idea! Plug-in estimated mazimizer and use asy normal approx.
I Preceding pattern happens frequently, e.g., oracle property inmodel selection, max eigenvalues in matrix , and txt regimes
17 / 87
![Page 21: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/21.jpg)
Choosing the right asymptotic framework
I What goes wrong? After all, this thinking works well in manyother settings, e.g., everything you learned in stat 101.
I Finite sample behavior driven by small (not necessarily zero)differences in txt effectiveness
I We saw this analytically in normal case
I Intuition helped by thinking in extremes, e.g,. what if all txtswere equal? What if one were infinitely better than others?
I Abrupt dependence of limiting distribution on U(µ0) is aredflag. It is tempting to construct procedures that will recoverthis limiting distn even if some txt differences are exactly zero.This is asymptotics for asymptotics sake. Don’t do it.
18 / 87
![Page 22: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/22.jpg)
Choosing the right asymptotic framework
I What goes wrong? After all, this thinking works well in manyother settings, e.g., everything you learned in stat 101.
I Finite sample behavior driven by small (not necessarily zero)differences in txt effectiveness
I We saw this analytically in normal case
I Intuition helped by thinking in extremes, e.g,. what if all txtswere equal? What if one were infinitely better than others?
I Abrupt dependence of limiting distribution on U(µ0) is aredflag. It is tempting to construct procedures that will recoverthis limiting distn even if some txt differences are exactly zero.This is asymptotics for asymptotics sake. Don’t do it.
18 / 87
![Page 23: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/23.jpg)
Asymptotic working assumptions
I A useful asy approximation should be robust to the settingwhere some (all) txt differences are zero
I Necessary but not sufficient
I Heuristic: in small samples, one cannot distinguish betweensmall (but nonzero) txt differences so use an asy frameworkwhich allows for exact equality.
This heuristic has been misinterpreted and misused in lit.
I Some procedures we’ll look at are designed for such robustness
19 / 87
![Page 24: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/24.jpg)
Local asymptotics: horseshoes and hand grenades
I Allowing null txt differences problematic
I Asymptotically, differences either zero or infinite2
I Txt differences are (probably) not exactly zero
I Challenge: allow small differences to persist as n diverges
I Local or moving parameter asy framework does this
I Idea: allow gen model to change with n so that gapsθ0 − max
j /∈U(µ0)µ0,j shrink to zero as n increases3
2In the stat sense that we have power one to discriminate between them.3This idea should be familiar from hypothesis testing.
20 / 87
![Page 25: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/25.jpg)
Triangular arrays
I For each n, X1,n, . . . ,Xn,n ∼i .i .d . Pn
Observations DistributionX1,1 P1
X1,2 X2,2 P2
X1,3 X2,3 X3,3 P3
X1,4 X2,4 X3,4 X4,4 P4...
......
.... . .
...
I Define µ0,n = PnX and θ0,n =
p∨
j=1
µ0,j
I Assume µ0,n = µ0 + s/√n where s ∈ Rp called local parameter
I Assume√n(Pn − Pn)X Normal(0,Σ)4
4This is true under very mild conditions on the sequence of distributionsPnn≥1. However, given our limited time we will not discuss such conditions.See van der Vaart and Wellner (1996) for details. 21 / 87
![Page 26: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/26.jpg)
Quick quiz
I Suppose that X1,n, . . . ,Xn,n ∼i .i .d . Normal(µ0 + s/√n,Σ)
what is is distribution of√n(Pn − Pn)X?
22 / 87
![Page 27: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/27.jpg)
Local alternatives anticipate unstable performance
LemmaLet s ∈ Rp be fixed. Assume that for each n we observe Xi ,nni=1drawn i.i.d. from Pn which satisfies: (i) PnX = mu0 + s/
√n, and
(ii)√n(Pn − Pn)X Normal(0,Σ). Then, under Pn,
√n(θn − θ0,n
)
∨
j∈U(µ0)
(Zj + sj)−∨
j∈U(µ0)
sj ,
where Z ∼ Normal(0,Σ).
I Discussion/observations on local limiting distn
I Dependence of limiting distn on s ⇒ nonregular
I Set U(µ0) represents set of near-maximizers though sj = 0corresponds to exact equality (so haven’t ruled this out)
23 / 87
![Page 28: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/28.jpg)
Local alternatives anticipate unstable performance
LemmaLet s ∈ Rp be fixed. Assume that for each n we observe Xi ,nni=1drawn i.i.d. from Pn which satisfies: (i) PnX = mu0 + s/
√n, and
(ii)√n(Pn − Pn)X Normal(0,Σ). Then, under Pn,
√n(θn − θ0,n
)
∨
j∈U(µ0)
(Zj + sj)−∨
j∈U(µ0)
sj ,
where Z ∼ Normal(0,Σ).
I Discussion/observations on local limiting distn
I Dependence of limiting distn on s ⇒ nonregular
I Set U(µ0) represents set of near-maximizers though sj = 0corresponds to exact equality (so haven’t ruled this out)
23 / 87
![Page 29: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/29.jpg)
Proof of local limiting distribution
24 / 87
![Page 30: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/30.jpg)
Extra page if needed
![Page 31: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/31.jpg)
Intermission: more regularity after this
![Page 32: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/32.jpg)
Comments on nonregularity
I Sensitivity of estimator to local alternatives cannot berectified through the choice of a more clever estimator
I Inherent property of the estimand5
I This has not stopped some from trying...
I Remainder of today’s notes: cataloging of confidence intervals
5See van der Vaart (1991), Hirano and Porter (2012), and L. et al. (2011,2014, 2019)
25 / 87
![Page 33: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/33.jpg)
Projection region
I Idea: exploit the following two facts
I µn is nicely behaved (reg. asy normal)
I If µ0 were known this would be trivial6
I Given α ∈ (0, 1) denote acceptable error level and ζn,1−α aconfidence region for µ0, e.g.,
ζn,1−α =µ ∈ Rp : n(µn − µ)ᵀT Σn(µn − µ) ≤ χ2
p,1−α,
where Σn = Pn(X − µn)(X − µn)ᵀT
I Projection CI:
Γn,1−α =
θ ∈ R : θ =
p∨
j=1
µj for some µ ∈ ζn,1−α
6In this problem, θ0 is a function of µ0 and is thus completely known whenµ0 is known. In more complicated problems, knowing the value of a nuisanceparameter will make the inference problem of interest regular.
26 / 87
![Page 34: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/34.jpg)
Prove the following with your stat buddy
I P (θ0 ∈ Γn,1−α) ≥ 1− α + oP(1)
27 / 87
![Page 35: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/35.jpg)
Comments on projection regions
I Useful when parameter of interest is a non-smooth functionalof a smooth (regular) parameter
I Robust and widely applicable but conservative
I Projection interval valid under local alternatives (why?)
I Can reduce conservatism using pre-test (L. et al., 2014)
I Berger and Boos (1991) and Robins (2004) for seminal papers
I Consider as a first option in new non-reg problem
28 / 87
![Page 36: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/36.jpg)
Bound-based confidence intervals
I Idea: sandwich non-smooth functional between smooth upperand lower bounds than bootstrap bounds to form conf region
I Let τnn≥1 be seq of pos constants such that τn →∞ and
τn = o(√n) as n→∞, define
Un(µ0) =
j : max
k
√n (µn,k − µn,j) /σj ,k,n ≤ τn
,
where σj ,k,n is est of asy variance of µn,k − µn,j .
Note* May help to think of Un to be the indices of txts thatwe cannot distinguish from being optimal.
29 / 87
![Page 37: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/37.jpg)
Bound-based confidence intervals cont’d
I Given Un(µ0) define
Sn(µ0) =s ∈ Rp : sj = µ0,j if j ∈ Un(µ0)
,
then, it follows that
Un = sups∈Sn(µ0)
√n
p∨
j=1
(µn,j − µ0,j + sj)−p∨
j=1
sj
is an upper bound on√n(θn − θ0). (Why?) A lower bound,
Ln is constructed by replacing sup with an inf.
30 / 87
![Page 38: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/38.jpg)
Dad, where do bounds come from?
I Un obtained by taking sup over all local, i.e., order 1/√n,
perturbations of generative model
I By construction, insensitive to local perturbations ⇒ regular
I Un(µ0) conservative est of U(µ0), lets wave our hands:
I General strategy: look at behavior of (properly centered andscaled) estimator under local perturbations and take sup/infover nonreg parts
31 / 87
![Page 39: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/39.jpg)
Bootstrapping the bounds
I Both Un and Ln are regular and their distns consistentlyestimated via nonpar bootstrap
I Let u(b)n,1−α/2 be (1− α/2)× 100 perc of bootstrap distn of Un
and (b)n,α/2 the (α/2)× 100 perc of bootstrap distn of Ln
I Bound based confidence interval[θn − u
(b)n,1−α/2/
√n, θn − (b)n,α/2/
√n]
32 / 87
![Page 40: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/40.jpg)
Bound-based intervals discussion
I General approach, applies to implicitly defined estimators asas those with closed form expressions like we considered here
I Less conservative than projection interval but stillconservative, such conservatism is unavoidable
I Bounds are tightest in some sense
I Bounding quantiles directly rather than estimand may reduceconservatism though possibly at price of addl complexity
I See Fan et al. (2017) for other improvements/refinements
33 / 87
![Page 41: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/41.jpg)
Bootstrap methods
I Bootstrap is not consistent without modification
I Due to instability (nonregularity)
I Nondifferentiability of max operator causes this instability (seeShao 1994 for a nice review)
I Bootstrap is appealing for complex problems
I Doesn’t require explicitly computing asy approximations7.
I Higher order convergence properties
7There are exceptions to this, including parametric bootstrap and thosebased on quadratic expansions
34 / 87
![Page 42: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/42.jpg)
How about some witchcraft?
I m-out-of-n bootstrap can be used to create valid confidenceintervals for non-smooth functionals
I Idea: resample datasets of size mn = o(n) so sample-levelparameters converge ‘faster’ than bootstrap analogs(i.e., witchcraft)
35 / 87
![Page 43: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/43.jpg)
m-out-of-n bootstrap
I Accepting some components of witchcraft on faith
I√mn
(µ(b)mn− µn
) Normal(0,Σ) conditional on the data
I See Arcones and Gine (1989) for details
I An even toyier example than our toy example: W1, . . . ,Wn
i.i.d. w/ (µ, σ2) derive limit distns of√n(|W n| − |µ|) and
√mn(|W (b)
n | − |W n|)
36 / 87
![Page 44: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/44.jpg)
m-out-ofn bootstrap with max of means
I Derive limiting distribution of√mn
(θ(b)mn − θn
):
37 / 87
![Page 45: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/45.jpg)
Extra page if needed
![Page 46: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/46.jpg)
Intermission
I wouldn’t want to wind up hooked to a bunch of wires andtubes, unless somehow the wires and tubes were keepingme alive. —Don Alden Adams
![Page 47: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/47.jpg)
Finally! Back to treatment regimes (briefly)
I Consider a one-stage problem with observed data(Xi ,Ai ,Yi )ni=1 where X ∈ Rp, A ∈ −1, 1, and Y ∈ R
I Assume requisite causal conditions hold
I Assume linear rules π(x) = sign(xᵀTβ), where β ∈ Rp, and xmight contain polynomial terms etc.
38 / 87
![Page 48: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/48.jpg)
Warm-up! Derive limiting distn of parameters inlinear Q-learning!!!!8
I Posit linear model Q(x , a;β) = xᵀT0 + axᵀT1 β, indexed by
β = (βᵀT0 , βᵀT1 )ᵀT and x0, x1 known features.
I βn = arg minβ
Pn Y − Q(X ,A;β)2 and
β∗ = arg minβ
P Y − Q(X ,A;β)
I Derive limiting distribution of√n(βn − β∗)
I Construct confidence interval for Q(x , a) assumingQ(x , a) = Q(x , a;β∗)
8He exclaimed.39 / 87
![Page 49: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/49.jpg)
Extra page if needed
![Page 50: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/50.jpg)
Extra page if needed
![Page 51: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/51.jpg)
Parameters in (1-stage) Q-learning are easy!
I Similar arguments show that coefficients indexingg -computation and outcome weighted learning asy normal
I Preview: consider the (regression-based) estimator of the
value of πn(x) = sign(xᵀT1 β1,n), which you’ll recall is
Vn(β1,n) = Pn maxa
Q(X , a; βn)
= PnXᵀT0 β0,n + Pn|X ᵀT1 β1,n|
What is limit of√nVn(β1,n)− V (β1,n)
and can we use it
to derive CI for V (βn)? What about a CI for V (β∗)?
40 / 87
![Page 52: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/52.jpg)
Parameters in (1-stage) OWL are easy!
I To illustrate, assume P(A = 1|X ) = P(A = −1|X ) = 1/2 wp1
I Recall OWL based on cvx relaxation of IPWE
Vn(β) = Pn
[Y1
A = sign(X ᵀTβ)
P(A|X )
]
= 2PnY1A sign(X ᵀTβ) > 0
I Let ` : R→ R be cvx, OWL estimator is
βn = arg minβ∈Rp
Pn|Y |`(W ᵀTβ),
where W = sign(Y )AX
41 / 87
![Page 53: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/53.jpg)
Extra page if needed
![Page 54: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/54.jpg)
Some facts about OWL (and more generallyconvex M-estimators)
I Fn |y |`(wᵀTβ) is composition of linear and cvx function andthus cvx in β for each (y ,w) ⇒ greatly simplifies inference!
I Regularity conditions
I β∗ = arg minβ
P|Y |`(W ᵀTβ) exists and unique
I Map β 7→ P|Y |`(W ᵀTβ) differentiable in nbrhd of β∗9
I Under these conditions,√n(βn − β∗) is regular10 and
asymptotically normal ⇒ many results from Q-learning port
9More formally, require
|y |`wᵀT (β∗ + δ)
− |y |`(wᵀTβ∗) = S(y ,w , ;β∗)ᵀT δ + R(y ,w , δ;β∗) where
PS(Y ,W ;β∗) = 0, ΣO = PS(Y ,W ;β∗)S(Y ,W ;β∗)ᵀT finite, andPR(Y ,W , δ;β∗) = (1/2)δᵀTΩOδ + o(||δ||2) (Haberman, 1989; Niemiro, 1992;Hjort and Pollard, 2011).
10I am being a bit loose with language in this course by referring to bothestimands and rescaled estimators as ‘regular’ or ‘non-regular.’ 42 / 87
![Page 55: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/55.jpg)
Value function(s)
I Three ways to measure performance
I Conditional value: V (πn) = PY ∗(πn) = EY ∗(πn)
∣∣πn
,measures the performance of an estimated decision rule as if itwere to be deployed in popn (note* this is a random variable)
I Unconditional value: Vn = EV (πn), measures the averageperformance of the algorithm used to construct πn with sampleof size n
I Population-level value: V (π∗), where π∗(x) = sign(xᵀTβ∗),measures the potential of applying precision medicine strategyin given domain if algorithm for constructing πn will be used
I Discuss these measures with your stat buddy. Is there ameaningful distinction as the sample size grows large?
43 / 87
![Page 56: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/56.jpg)
It’s a wacky world out there
I The three value measures need not coincide asymptotically
I Let πn(x) = sign(xᵀT βn) and suppose√nβn Normal(0,Σ)
so that β∗ ≡ 0 and π∗(x) ≡ −1. With stat buddy, compute:
I Vn(βn) ????
I Vn = EV (βn)→ ????
I V (β∗) = ????
44 / 87
![Page 57: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/57.jpg)
Calculon!
![Page 58: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/58.jpg)
Extra page if needed
![Page 59: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/59.jpg)
Have some confidence you useless pile!
I We’ll construct confidence sets for V (βn) and V (β∗) as theseare most commonly of interest in application
I Starting with conditional value fn assume that thedata-generating model is a triangular array Pn such that:
(A0) πn(x) = sign(xᵀT βn)
(A1) ∃β∗n s.t. β∗n = β∗ + s/√n for some s ∈ Rp and√
n(βn − β∗n ) =√n(Pn − Pn)u(X ,A,Y ) + oPn(1), where u
does not depend on s, supn
Pn||u(X ,A,Y )||2 <∞, and
Cov u(X ,A,Y ) is p.d.
(A2) If F is uniformly bounded Donsker class and√n(pn − P) T
in `∞(F) under P then√n(Pn −Pn) T in `∞(F) under Pn.
(A3) supn
Pn||Y ||2 <∞.
I Detailed discussion is beyond the scope of this class. Laberwill wave his hands a bit. Our goal is understand keysteps/ideas.
45 / 87
![Page 60: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/60.jpg)
Building block: joint distribution beforenonsmooth operator
I Define class of functions
G =g(X ,A,Y ; δ) = Y1
AX ᵀT δ > 0
1X ᵀTβ∗ = 0
: δ ∈ Rp
,
view√n(Pn − Pn) as random element of `∞(Rp) .
LemmaAssume (A0)-(A3). Then
√n
Pn − Pn
βn − β∗(Pn − Pn)Y1
AX ᵀTβ∗ > 0
TZW
in `∞(Rp)× Rp × R under Pn.
46 / 87
![Page 61: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/61.jpg)
Limiting distn of V (βn)
Corollary
Assume (A0)-(A3). Then,
√nVn(βn)− V (βn)
T (Z + s) + W.
I Notes
I Presence of s shows this is nonregular
I T is a Brownian bridge indexed by Rp
I W and Z are normal
47 / 87
![Page 62: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/62.jpg)
Hand-waving!
![Page 63: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/63.jpg)
Extra page if needed
![Page 64: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/64.jpg)
Bound-based confidence interval
I Limiting distribution: T(Z + s) + W
I Local parameter only appears in first term
I (Asy) bound should only affect this term
I Schematic for constructing a bound
I Partition input space into those that are ‘near’ the decisionboundary xᵀTβ∗ = 0 vs. those that are ’far’ from boundary
I Take sup/inf over local perturbations of points in ‘near’ group
48 / 87
![Page 65: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/65.jpg)
Upper bound
I Let Σn be estimator of asy var of βn an upper bound on√nVn(βn)− V (βn)
is
Un = supω∈Rp
√n(Pn−Pn)Y1
AX ᵀTω > 0
1
n(X ᵀT βn)2
X ᵀT ΣnX≤ τn
+√n(Pn − Pn)Y1
AX ᵀT βn > 0
1
n(X ᵀT βn)2
X ᵀT ΣnX> τn
,
where τn is seq of tuning parameters s.t. τn →∞ andτn = o(n) as n→∞. Lower bound constructed by replacingsup with inf.
49 / 87
![Page 66: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/66.jpg)
Limiting distribution of bounds
TheoremAssume (A0)-(A3). Then
(Ln,Un)
infω∈Rp
T(ω) + W, supω∈Rp
T(ω) + W
under Pn.
I Recall limit distn of√nVn(βn)− V (βn)
is T(Z + s) + W
I Bounds equiv to sup/inf over local perturbations
I If all subject have large txt effects, bound are tight
I Bootstrap bounds to construct confidence bound, theoreticalresults for the bootstrap bounds given in book
50 / 87
![Page 67: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/67.jpg)
Note on tuning
I Seq τnn≥1 can affect finite sample performance
I Idea: tune using double bootstrap, i.e., bootstrap thebootstrap samples to estimate coverage and adapt τn
I Double bootstrap considered computationally expensive butnot much of a burden in most problems with moderncomputing infrastructure
I Tuning can be done without affecting theoretical results
51 / 87
![Page 68: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/68.jpg)
Algy the friendly tuning algorithm28 STATISTICAL INFERENCE
Alg. 1.5: Tuning the critical value τn using the double bootstrap
Input: (Xi, Ai, Yi)ni=1,M , α ∈ (0, 1),τ(1)n , . . . , τ
(L)n
1 V = Vn(dn)2 for j = 1, . . . , L do3 c(j) = 04 for b = 1, . . . ,M do5 Draw a sample of size n, say S
(b)n , from (Xi, Ai, Yi)ni=1
with replacement6 Compute bound-based confidence set, ζ(b)mn , using sample S
(b)n
and critical value τ (j)n
7 if V ∈ ζ(b)mn then
8 c(j) = c(j) + 19 end10 end11 end12 Set j∗ = argminj : c(j)≥M(1−α) c
(j)
Output: Return τ(j)n
pirical measure based on a resample of size mn; more generally, for anyZ = f(P,Pn) write Z
(b)mn to denote f(Pn,P(b)
mn). The m-out-of-n boot-strap approximates the sampling distribution of
√nVn(dn)− V(dn)
with
the conditional distribution of √mn
V
(b)mn(d
(b)mn)− Vn(d
(b)mn)
given the ob-
served data. For α ∈ (0, 1) let mnand umn
denote the (α/2) × 100
and (1 − α/2) × 100 percentiles of √mn
V
(b)mn(d
(b)mn)− Vn(d
(b)mn)
, then
[Vn(dn)− umn
/√mn, Vn(dn)− mn/
√mn
]is a (1−α)× 100%m-out-of-
n bootstrap confidence set for V(dn). Algorithm (1.6) provides a schematicfor computing this confidence set.
LetZ be defined as above. To establish consistency of them-out-of-n boot-strap, in addition to (A0)-(A3) and the asymptotic growth conditions onmn,we assume (A4)√mn(β
(b)mn−βn), conditional on the observed data, converges
in distribution to Z in probability. This condition holds under quite generalconditions (see Gine and Zinn (1990) and Ch. 3.6 in Van Der Vaart andWellner(1996)).Theorem 1.2.7. Assume (A0)-(A3) with local parameter s = 0. Let mn be asequence of positive integers such that mn → ∞ and mn = o(n) as n →∞. Let PM,mn
denote expectation induced by sampling mn observation with
52 / 87
![Page 69: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/69.jpg)
Intermission
If any man says he hates war more than I do, he betterhave a knife, that’s all I have to say. –Ghandi
53 / 87
![Page 70: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/70.jpg)
m-out-of-n bootstrap
I Bound-based intervals complex (conceptually and technically)
I Subsampling easier to implement and understand11
I Does not require specialized code etc.
I Let mn = o(n) be resample size s.t. mn →∞ and mn = o(n),
let P(b)mn be bootstrap empirical distn
I Approximate√nVn(βn)− V (βn)
with its bootstrap analog
√mn
V (b)mn
(β(b)mn
)− Vn(βn)
(Laber might draw picture)
I Let mn and umn be the (α/2)× 100 and (1− α/2)× 100
percentiles of√mn
V (b)mn
(β(b)mn
)− Vn(βn)
, ci given by
[Vn(βn)− umn/
√mn, Vn(βn − mn/
√mn
]
11Though the theory underpinning subsampling can be non-trivial so‘understand’ here is meant more mechanically.
54 / 87
![Page 71: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/71.jpg)
Emmy the subsampling boostrap algoINFERENCE WITH ONE-STAGE REGIMES 29
Alg. 1.6: m-out-of-n bootstrap confidence set for the conditionalvalueInput:mn, Xi, Ai, Yini=1,M , α ∈ (0, 1)
1 for b = 1, . . . ,M do2 Draw a sample of sizemn, say S(b)
mn , from Xi, Ai, Yini=1 withreplacement
3 Compute β(b)mn on S
(b)mn
4 Δ(b)mn =√mn
[∑i∈S(b)
mnYiIAiX
Ti β
(b)mn > 0
−∑n
k=1 YkIAkX
Tk β
(b)mn > 0
]
5 end6 Relabel so that Δ(1)
mn ≤ Δ(2)mn ≤ · · · ≤ Δ
(B)mn
7 mn= Δ
(Bα/2)mn
8 umn = Δ(B(1−α/2))mn
Output:[Vn(dn)− umn/
√mn, Vn(dn)− mn/
√mn
]
replacement from the observed data.
supg∈BL1(Rp)
∣∣∣∣Pg (Z)− PM,mng√
mn
(β(b)mn− βn
) ∣∣∣∣
converges to zero in probability.
Let G(b)mn =
√mn
(P(b)mn − Pn
). A proof of this result is omitted; however,
the crux of proof is the expansion
√mn
V(b)mn
(d(b)mn)− Vn(d
(b)mn
)= G(b)
mnY IAXT β(b)
mn> 0
= G(b)mn
(Y I[AXT
√mn
(β(b)mn− βn
)+ (√mn/n)
√n(βn − β∗
)> 0]
× IXTβ∗ = 0
)
+G(b)mn
Y IAXT β(b)
mn> 0IXTβ∗ = 0
,
where G(b)mn , seen as an element of ∞(Rp) via η → G(b)
mnY IAXT η > 0
I XTβ∗ = 0, converges in distribution to T (Van Der Vaart and Wellner(1996) Thm 3.6.3), and (
√mn/n)
√n(βn−β∗) converges to zero in probabil-
ity.Application of the m-out-of-n bootstrap confidence interval requires a
choice of resample sizemn. The asymptotic conditionsmn → ∞ andmn =o(n) do not provide guidance about what mn should be for a given data set.
55 / 87
![Page 72: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/72.jpg)
m-out-of-n cont’d
I Provides valid confidence intervals under (A0)-(A3)
I Proof omitted (tedious)
I Can tune mn using double bootstrap
I Reliance on asymptotic tomfoolery makes me hesitant to usethis in practice12
12I did not always think this way, see Chakraborty, L., and Zhao (2014ab).Also, I should not be so dismissive of these methods. Some of the work in thisarea has been quite deep and produced general uniformly convergent methods.See work by Romano and colleagues.
56 / 87
![Page 73: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/73.jpg)
Confidence interval for opt regime within a class
I Let π∗ denote the optimal txt regime within a given class, ourgoal is to construct a CI for V (π∗)
I Were π∗ known, one could use√nVn(π∗)− V (π∗)
, with
your stats buddy, compute this limiting distn
I Suppose that we could construct a valid confidence region forπ∗, suggest a method for CI for V (π∗)
57 / 87
![Page 74: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/74.jpg)
Projection interval
I For any fixed π, let ζn,1−ν(π) be a (1− ν)× 100% confidenceset for V (π), e.g., using asymptotic approx on previous slide
I Let Dn,1−η denote a (1− η)× 100% confidence set for π∗,then a (1− η − ν)× 100% confidence region for V (π∗) is
⋃
π∈Dn,1−η
ζn,1−ν(π)
Why?
58 / 87
![Page 75: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/75.jpg)
Ex. projection interval for linear regime
I Consider regimes of the form π(x ;β) = sign(xᵀTβ), then
√nVn(β)− V (β)
=√n(Pn − P)
Y 1AXᵀTβ>0
P(A|X )
Normal
0, σ2(β),
take
ζn,1−ν(β) =
[Vn(β)−
z1−ν/2σn(β)√n
, Vn(β) +z1−ν2σn(β)√
n
]
I If√n(βn − β∗) Normal(0,Σ), take
Dn,1−η =β : n(βn − β)ᵀT Σ−1n (βn − β) ≤ χ2
p,1−η
I Projection interval:⋃
β∈Dn,1−η
ζn,1−ν(β)
59 / 87
![Page 76: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/76.jpg)
Quiz break!
I What does the ‘Q’ in Q-learning stand for?
I In txt regimes, which of the following is not yet a thing:A-learning, B-learning, C-Learning, D-learning, E-learning?
I Write down the two-stage Q-learning algorithm assumingbinary treatments and linear models at each stage
I True or false:
I I would rather have a zombie ice dragon than two live firedragons.
I The story of the Easter bunny is based on the little knownstory of Jesus swapping the internal organs of chickens andrabbits to prevent a widespread famine
I Q-learning has been used to obtain state-of-the-artperformance in game-playing domains like chess, backgammon,and atari
60 / 87
![Page 77: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/77.jpg)
Inference for two-stage linear Q-learning
I Learning objectives
I Identify source of nonregularlity
I Understand implications on coverage and asy bias
I Intuition behind bounds
I Hopefully this will be trivial for you now!
61 / 87
![Page 78: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/78.jpg)
Reminder: setup and notation
I Observe (X1,i ,A1,i ,X2,i ,A2,i ,Yi )ni=1, i .i .d . from P
I X1 ∈ Rp1 : baseline subj. info.
I A1 ∈ 0, 1 : first treatment
I X2 ∈ Rp2 : interim subj. info. during course of A1
I A2 ∈ 0, 1 : second treatment
I Y ∈ R : outcome, higher is better
I Define history H1 = X1, H2 = (X1,A1,X2)
I DTR π = (π1, π2) where
πt : suppHt → suppAt ,
patient presenting with Ht = ht assigned treatment πt(ht)
62 / 87
![Page 79: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/79.jpg)
Characterizing optimal DTR
I Optimal regime maximizes value EY ∗(π)
I Define Q-functions
Q2 (h2, a2) = E(Y∣∣H2 = h2,A2 = a2
)
Q1(h1, a1) = E
maxa2
Q2(H2, a2)∣∣H1 = h1,A1 = a1
I Dynamic programming (Bellman, 1957)πoptt (ht) = arg max
atQt(ht , at)
63 / 87
![Page 80: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/80.jpg)
Q-learning
I Regression-based dynamic programming algorithm
(Q0) Postulate working models for Q-functions
Qt(ht , at ;βt) = hᵀTt,0 βt,0 + athᵀTt,1 βt,1, ht,0, ht,1 features of ht
(Q1) Compute β2 = arg minβ2
Pn Y − Q2(H2,A2;β2)2
(Q2) Compute
β1 = arg minβ1
Pn
maxa2
Q2(H2,A2; β2)− Q1(H1,A1;β1)
2
(Q3) πt(ht) = arg maxat
Qt(ht , at ; βt)
I Population parameters β∗t obtained by replacing Pn with PI Inference for β∗2 standard, just OLS
I Focus on confidence intervals for cᵀTβ∗1 for fixed c ∈ Rdim,β∗1
64 / 87
![Page 81: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/81.jpg)
Q-learning
I Regression-based dynamic programming algorithm
(Q0) Postulate working models for Q-functions
Qt(ht , at ;βt) = hᵀTt,0 βt,0 + athᵀTt,1 βt,1, ht,0, ht,1 features of ht
(Q1) Compute β2 = arg minβ2
Pn Y − Q2(H2,A2;β2)2
(Q2) Compute
β1 = arg minβ1
Pn
maxa2
Q2(H2,A2; β2) − Q1(H1,A1;β1)
2
(Q3) πt(ht) = arg maxat
Qt(ht , at ; βt)
I Population parameters β∗t obtained by replacing Pn with P
I Inference for β∗2 standard, just OLS
I Focus on confidence intervals for cᵀTβ∗1 for fixed c ∈ Rdim,β∗1
64 / 87
![Page 82: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/82.jpg)
Inference for cᵀTβ∗1
I Non-smooth max operator makes β1 non-regular
I Distn of cᵀT√n(β1− β∗1 ) sensitive to small perturbations of P
I Limiting distn does not have mean zero (asymptotic bias)
I Occurs with small second stage txt effects, HᵀT2,1 β
∗2,1 ≈ 0
I Confidence intervals based on series approximations orbootstrap can perform poorly; proposed remedies include:
I Apply shrinkage to reduce asymptotic bias
I Form conservative estimates to tail probabilities ofcᵀT√n(β1 − β∗1 )
65 / 87
![Page 83: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/83.jpg)
Characterizing asymptotic bias
DefinitionFor constant c ∈ Rdim β∗
1 and√n-consistent estimator β1 of β∗1
with√n(β1 − β∗1) M, define the c-directional asymptotic bias
Bias(β1, c
), EcᵀTM.
66 / 87
![Page 84: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/84.jpg)
Characterizing asymptotic bias cont’d
Theorem (Asymptotic bias Q-learning)
Let c ∈ Rdim β∗1 be fixed. Under moment conditions:
Bias(β1, c
)=
cᵀTΣ−11,∞P(B1
√HT2,1Σ21,21H2,11
HᵀT2,1 β
∗2,1=0
)
√2π
,
where B1 = (HᵀT1,0 ,A1HᵀT1,1 )ᵀT , Σ1,∞ = PB1B
ᵀT1 , and Σ21,21, is the
asy. cov. of√n(β2,1 − β∗2,1).
I Asymptotic bias for Q-learning
I Ave. of cᵀTΣ1,∞B1 with wts ∝√Var
(HᵀT
2,1 β2,11HᵀT2,1 β
∗2,1=0|H2,1
)
I May be reduced by shrinking hᵀT2,1 β2,1 when hᵀT2,1β∗2,1 = 0
67 / 87
![Page 85: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/85.jpg)
Characterizing asymptotic bias cont’d
Theorem (Asymptotic bias Q-learning)
Let c ∈ Rdim β∗1 be fixed. Under moment conditions:
Bias(β1, c
)=
cᵀTΣ−11,∞P(B1
√HT2,1Σ21,21H2,11
HᵀT2,1 β
∗2,1=0
)
√2π
,
where B1 = (HᵀT1,0 ,A1HᵀT1,1 )ᵀT , Σ1,∞ = PB1B
ᵀT1 , and Σ21,21, is the
asy. cov. of√n(β2,1 − β∗2,1).
I Asymptotic bias for Q-learning
I Ave. of cᵀTΣ1,∞B1 with wts ∝√Var
(HᵀT
2,1 β2,11HᵀT2,1 β
∗2,1=0|H2,1
)
I May be reduced by shrinking hᵀT2,1 β2,1 when hᵀT2,1β∗2,1 = 0
67 / 87
![Page 86: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/86.jpg)
Reducing asymptotic bias to improve inference
I Shrinkage is a popular method for reducing asymptotic biaswith goal of improving interval coverage
I Chakraborty et al. (2009) apply soft-thresholding
I Moodie et al. (2010) apply hard-thresholding
I Goldberg et al. (2013) and Song et al.(2015) use lasso-typepenalization
I Shrinkage methods target
maxa2
Q2
(h2, a2; β2
)= hᵀT2,0 β2,0 + max
a2∈0,1a2hᵀT2,1 β2,1
68 / 87
![Page 87: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/87.jpg)
Reducing asymptotic bias to improve inference
I Shrinkage is a popular method for reducing asymptotic biaswith goal of improving interval coverage
I Chakraborty et al. (2009) apply soft-thresholding
I Moodie et al. (2010) apply hard-thresholding
I Goldberg et al. (2013) and Song et al.(2014) use lasso-typepenalization
I Shrinkage methods target
maxa2
Q2
(h2, a2; β2
)= hᵀT2,0 β2,0 +
[hᵀT2,1 β2,1
]+
68 / 87
![Page 88: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/88.jpg)
Soft-thresholding (Chakraborty et al., 2009)
I In Q-learning, replace maxa2
Q2(H2,A2; β2) with
HᵀT2,0 β2,0 +[HᵀT2,1 β2,1
]+
1−σHᵀT2,1 Σ21,21H2,1
n(HᵀT2,1 β2,1
)2
+
I Amount of shrinkage governed by σ > 0
I Penalization schemes (Goldberg et al., 2013, Song et al.,2014) reduce to this estimator under certain designs
I No theoretical justification in Chakraborty et al. (2009) butimproved coverage of bootstrap intervals in some settings
69 / 87
![Page 89: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/89.jpg)
Soft-thresholding and asymptotic bias
TheoremLet c ∈ Rdim β∗
1 and let βσ1 denote the soft-thresholding estimator.Under moment conditions:
1. |Bias(βσ1 , c
)| ≤ |Bias
(β1, c
)| for any σ > 0.
2. If Bias(β1, c
)6= 0, then for σ > 0
Bias(βσ1 , c)
Bias(β1, c)= exp
(−σ
2
)− σ
∫ ∞√σ
1
xexp
(−x2
2
)dx
70 / 87
![Page 90: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/90.jpg)
Soft-thresholding and asymptotic bias cont’d
I Is thresholding useful in reducing asymptotic bias?
I Preceding theorem says yes, and more shrinkage is better
I Chakraborty et al. suggest σ = 3, which corresponds to13-fold decrease in asymptotic bias
I However, the preceding theorem is based on pointwise, i.e.,fixed parameter, asymptotics and may not faithfully reflectsmall sample performance
71 / 87
![Page 91: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/91.jpg)
Local generative model
I Use local asymptotics approximate small sample behavior ofsoft-thresholding
I Assume:
1. For any s ∈ Rdim β∗2,1 there exists sequence of distributions Pn
so that
∫ √n(dP1/2
n − dP1/2)− 1
2νsdP
1/2
2
→ 0,
for some measurable function νs .
2. β∗2,1,n = β∗2,1 + s/√n, where
β∗2,n = arg minβ2
Pn Y − Q2(H2,A2;β2)2
72 / 87
![Page 92: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/92.jpg)
Local asymptotics view of soft-thresholding
TheoremLet c ∈ Rdim β∗
1 be fixed. Under the local generative model andmoment conditions:
1. sup
s∈Rdim β∗2,1
|Bias(β1, c)| ≤ K <∞.
2. sup
s∈Rdim β∗2,1
|Bias(βσ1 , c)| → ∞ as σ →∞.
I Thresholding can be infinitely worse than doing nothing ifdone too aggressively in small samples
73 / 87
![Page 93: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/93.jpg)
Local asymptotics view of soft-thresholding
TheoremLet c ∈ Rdim β∗
1 be fixed. Under the local generative model andmoment conditions:
1. sup
s∈Rdim β∗2,1
|Bias(β1, c)| ≤ K <∞.
2. sup
s∈Rdim β∗2,1
|Bias(βσ1 , c)| → ∞ as σ →∞.
I Thresholding can be infinitely worse than doing nothing ifdone too aggressively in small samples
73 / 87
![Page 94: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/94.jpg)
Data-driven tuning
I Is it possible to construct a data-driven choice of σ thatconsistently leads to less asymptotic bias than no shrinkage?
I Consider data from a two-arm randomized trial(Ai ,Yi )ni=1,
A ∈ 0, 1, Y ∈ R coded so that higher is better 13
I Define µ∗a , E (Y |A = a) and µa = PnY 1A=a/Pn1A=a
I Mean outcome under optimal treatment assignmentθ∗ = max(µ∗0 , µ
∗1), corresponding estimator
θ = max(µ0, µ1) = µ0 + [µ1 − µ0]+
I Soft-thresholding estimator
θσ = µ0 + [µ1 − µ0]+
1− 4σ
n (µ1 − µ0)2
+
13This is equivalent to two-stage Q-learning with no covariates and a singlefirst stage treatment.
74 / 87
![Page 95: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/95.jpg)
Data-driven tuning: toy example
n = 10 n = 100 n = 100 Bias
I Optimal value of σ depends on µ∗1 − µ∗0I Variability in µ1 − µ0 prevents identification of optimal value of σ,
using plug-in estimator may lead to large bias
I Data-driven σ that significantly improves asymptotic bias over noshrinkage is difficult
75 / 87
![Page 96: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/96.jpg)
Asymptotic bias: discussion
I Asymptotic bias exists in Q-learning
I Local asymptotics show that aggressively shrinking to reduceasymptotic bias can be infinitely worse than no shrinkage
I Data-driven tuning seems to require choosing σ very small orrisking large bias
76 / 87
![Page 97: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/97.jpg)
Confidence intervals for cᵀTβ∗1
I Possible to construct valid confidence intervals in presence ofasymptotic bias
I Idea: construct regular bounds on cᵀT√n(β1 − β∗1)
I Bootstrap bounds to form confidence interval
I Tightest among all regular bounds ⇒ automatic adaptivity
I Local uniform convergence
I Can also obtain conditional properties (Robins and Rotnitzky,2014) and global uniform convergence (Wu, 2014)
77 / 87
![Page 98: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/98.jpg)
Regular bounds on cᵀT√n(β1 − β∗1)
I DefineVn(c , γ) = cᵀTSn + cᵀT Σ−11 PnUn(γ),
where
Sn = Σ−11
√n(Pn − P)B1
HᵀT2,0β
∗2,0 +
(HᵀT2,1β
∗2,1
)+− BᵀT1 β∗1
+Σ−11
√nPnH
ᵀT2,0
(β2,0 − β∗2,0
),
Un(γ) = B1
[HᵀT2,1 (Zn + γ)
+−(HᵀT2,1 γ
)+
]
I Sn is smooth and Un(γ) is non-smooth
78 / 87
![Page 99: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/99.jpg)
Regular bounds on cᵀT√n(β1 − β∗1) cont’d
I It can be shown that cᵀT√n(β1 − β∗1) = Vn(c , β∗2,1)
I Use pretesting to construct upper bound
Un(c) = cᵀTSn + cᵀT Σ−11 PnUn(β∗2,1)1Tn(H2,1)>λn
+ supγ
cᵀT Σ−11 PnUn(γ)1Tn(H2,1)≤λn ,
where Tn(h2,1) test statistic for null hᵀT2,1β∗2,1 = 0 and λn is a
critical value
I Lower bound, Ln(c) obtained by taking inf
I Bootstrap bounds to find confidence interval
79 / 87
![Page 100: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/100.jpg)
Validity of the bounds
TheoremLet c ∈ Rdim β∗
1 be fixed. Assume the local generative model,under moment conditions and conditions on the pretest:
1. cT√n(β1−β∗1,n) cᵀTS∞+cᵀTΣ−11,∞PB1H
ᵀT2,1Z∞1
HᵀT2,1 β
∗2,1>0
+cᵀTΣ−11,∞PB1
[HᵀT2,1 (Z∞ + s)
+−(HᵀT2,1 s
)+
]1HᵀT2,1 β
∗2,1=0
2. Un(c) cᵀTS∞+cᵀTΣ−11,∞PB1HᵀT2,1Z∞1
HᵀT2,1 β
∗2,1>0
+supγ
cᵀTΣ−11,∞PB1
[HᵀT2,1 (Z∞ + γ)
+−(HᵀT2,1 γ
)+
]1HᵀT2,1 β
∗2,1=0
80 / 87
![Page 101: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/101.jpg)
Validity of the bootstrap bounds
TheoremFix α ∈ (0, 1) and c ∈ Rdim β∗
1 . Let and u denote the(α/2)× 100 and (1− α/2)× 100 percentiles of bootstrapdistribution of the bounds and let PM denote the distribution withrespect to bootstrap weights. Under moment conditions andconditions on the pretest for any ε > 0:
P
PM
(cᵀT β1 −
u√n≤ cᵀTβ∗1 ≤ cᵀT β1 −
√n
)< 1− α− ε
= o(1).
81 / 87
![Page 102: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/102.jpg)
Uniform validity of the bootstrap bounds(Tianshuang Wu)
TheoremFix α ∈ (0, 1) and c ∈ Rdim β∗
1 . Let and u denote the(α/2)× 100 and (1− α/2)× 100 percentiles of bootstrapdistribution of the bounds and let PM denote the distribution withrespect to bootstrap weights. Under moment conditions andconditions on the pretest for any ε > 0:
infP∈P
P
PM
(cᵀT β1 −
u√n≤ cᵀTβ∗1 ≤ cᵀT β1 −
√n
)< 1− α− ε
,
converges to zero for a large class of distributions P.
82 / 87
![Page 103: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/103.jpg)
Simulation experiments
I Class of generative modelsI Xt ∈ −1, 1, At ∈ −1, 1, t ∈ 1, 2I P(At = 1) = P(At = −1) = 0.5, t ∈ 1, 2I X1 ∼ Bernoulli(0.5)
I X2|X1,A1 ∼ Bernoulli expit(δ1X1 + δ2A1)I ε ∼ N(0, 1)
I Y = γ1 +γ2X1 +γ3A1 +γ4X1A1 +γ5A2 +γ6X2A2 +γ7A1A2 + ε
I Vary parameters to obtain range of effects sizes, classifygenerative models asI Non-regular (NR)
I Nearly non-regular (NNR)
I Regular (R)
83 / 87
![Page 104: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/104.jpg)
Simulation experiments cont’d
I Compare bounding confidence interval (ACI) with bootstrap(BOOT) and bootstrap thresholding (THRESH)
I Compare in terms of width and coverage (target 95%)
I Results based on 1000 Monte Carlo replications with datasetsof size n = 150
I Bootstrap computed with 1000 resamples
I Tuning parameter λn chosen with double bootstrap
84 / 87
![Page 105: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/105.jpg)
Simulation experiments: results
Coverage (target 95%)
MethodEx. 1NNR
Ex. 2NR
Ex. 3NNR
Ex. 4R
Ex. 5NR
Ex. 6NNR
BOOT 0.935* 0.930* 0.933* 0.928* 0.925* 0.928*THRESH 0.945 0.938 0.942 0.943 0.759* 0.762*
ACI 0.971 0.958 0.961 0.943 0.953 0.953
Average width
MethodEx. 1NNR
Ex. 2NR
Ex. 3NNR
Ex. 4R
Ex. 5NR
Ex. 6NNR
BOOT 0.385* 0.430* 0.430* 0.436* 0.428* 0.428*THRESH 0.339 0.426 0.427 0.436 0.426* 0.424*
ACI 0.441 0.470 0.470 0.469 0.473 0.473
85 / 87
![Page 106: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/106.jpg)
Ex. DTR for ADHD without uncertainty
Prior medication?LowdoseMEDS
YesAdequate response? Continue
MEDS
Yes
High adherence?
No
AddBMOD
NoIntensifyMEDS
YesLowdoseBMOD
No
Adequate response?
ContinueBMOD
Yes
High adherence?No
IntensifyBMOD
Yes
AddMEDS
No
86 / 87
![Page 107: Intro to nonsmooth inference - Nc State Universitydavidian/dtr19/dtr9.pdf · Intro to nonsmooth inference Eric B. Laber Department of Statistics, North Carolina State University April](https://reader034.fdocuments.net/reader034/viewer/2022050604/5fabea4140ec6a377b40e8c1/html5/thumbnails/107.jpg)
Ex. DTR for ADHD with uncertainty
Prior medication?
Low doseMEDS∼OR∼BMOD
YesAdequate response? Continue
MEDS
Yes
High adherence?
No
Add OTHER∼OR∼Itensify SAME
NoIntensifyMEDS
YesLowdoseBMOD
No
Adequate response?
ContinueBMOD
Yes
High adherence?No
Add MEDS∼OR∼Intensify BMOD
Yes
AddMEDS
No
87 / 87