ChallengesinFiducialInference · ChallengesinFiducialInference Partsofthistalkarejointworkwith...

Challenges in Fiducial Inference

Parts of this talk are joint work withT. C.M Lee (UC Davis), Randy Lai (U of Maine), H. Iyer (NIST)J. Williams, Y Cui (UNC)

BFF 2017

Jan Hanniga

University of North Carolina at Chapel Hill

aNSF support acknowledged

outline

Outline

Introduction

Definition

Sparsity

Regularization

Conclusions

1

introduction

Outline

Introduction

Definition

Sparsity

Regularization

Conclusions

introduction

Fiducial?

▶ Oxford English Dictionary▶ adjective technical (of a point or line) used as a fixed basis of

comparison.▶ Origin from Latin fiducia ‘trust, confidence’

▶ Merriam-Webster dictionary1. taken as standard of reference a fiducial mark2. founded on faith or trust3. having the nature of a trust : fiduciary

2

introduction

Aim of this talk

▶ Explain the definition of generalized fiducial distribution

▶ Challenge of extra information:▶ Sparsity▶ Regularization

▶ My point of view: frequentist▶ Justified using asymptotic theorems and simulations.▶ GFI tends to work well

3

introduction

Aim of this talk

▶ Explain the definition of generalized fiducial distribution▶ Challenge of extra information:

▶ Sparsity▶ Regularization

▶ My point of view: frequentist▶ Justified using asymptotic theorems and simulations.▶ GFI tends to work well

3

definition

Outline

Introduction

Definition

Sparsity

Regularization

Conclusions

definition

Comparison to likelihood

▶ Density is the function f(x, ξ), where ξ is fixed and x isvariable.

▶ Likelihood is the function f(x, ξ), where ξ is variable and x isfixed.

▶ Likelihood as a distribution?

4

definition Formal Defintion

General Definition

▶ Data generating equationX = G(U , ξ).▶ e.g. Xi = µ+ σUi

▶ A distribution on the parameter space is Generalized FiducialDistribution if it can be obtained as a limit (as ε ↓ 0) of

arg minξ

∥x−G(U⋆, ξ)∥ | {minξ

∥x−G(U⋆, ξ)∥ ≤ ε} (1)

▶ Similar to ABC; generating from prior replaced bymin.▶ Is this practicle? Can we compute?

5


General Definition



arg minξ


∥x−G(U⋆, ξ)∥ ≤ ε} (1)

▶ Similar to ABC; generating from prior replaced bymin.

▶ Is this practicle? Can we compute?

5


General Definition



arg minξ


∥x−G(U⋆, ξ)∥ ≤ ε} (1)

▶ Similar to ABC; generating from prior replaced bymin.▶ Is this practicle? Can we compute?

5


Explicit limit (1)

▶ AssumeX ∈ Rn is continuous; parameter ξ ∈ Rp

▶ The limit in (1) has density (H, Iyer, Lai & Lee, 2016)

r(ξ|x) = fX(x|ξ)J(x, ξ)∫Ξ fX(x|ξ′)J(x, ξ′) dξ′

,

where J(x, ξ) = D

(ddξG(u, ξ)

∣∣∣u=G−1(x,ξ)

)▶ n = p givesD(A) = | detA|

▶ ∥· ∥2 givesD(A) = (detA⊤A)1/2

Compare to Fraser, Reid, Marras & Yi (2010)

▶ ∥· ∥∞ givesD(A) =∑

i=(i1,...,ip)

|det(A)i|

6


Example -- Linear Regression

▶ Data generating equation Y = Xβ + σZ

▶ ddθY = (X,Z) and Z = (Y −Xβ)/σ.

▶ The L2 Jacobian is

J(y, β, σ) =

(det

((X,

y −Xβ

σ)⊤(X,

y −Xβ

σ)

))1/2

= σ−1|det(XTX)|1/2(RSS)1/2

▶ Fiducial happens to be same as independence Jeffreys,explicit normalizing constant

7


Example -- Linear Regression

▶ Data generating equation Y = Xβ + σZ

▶ ddθY = (X,Z) and Z = (Y −Xβ)/σ.

▶ The L2 Jacobian is

J(y, β, σ) =

(det

((X,

y −Xβ

σ)⊤(X,

y −Xβ

σ)

))1/2

= σ−1| det(XTX)|1/2(RSS)1/2

▶ Fiducial happens to be same as independence Jeffreys,explicit normalizing constant

7


Example -- Uniform(θ, θ2)

▶ Xi i.i.d. U(θ, θ2), θ > 1

▶ Data generating equationXi = θ + (θ2 − θ)Ui, Ui ∼ U(0, 1).

▶ Compute Jacobian: ddθ [θ + (θ2 − θ)Ui] = 1 + (2θ − 1)Ui, with

Ui =Xi−θθ2−θ

.

▶ Using ∥· ∥∞ we have J(x, θ) = n x̄(2θ−1)−θ2

θ2−θ .

▶ Reference prior π(θ) = eψ( 2θ

2θ−1)(2θ−1)θ(θ−1) Berger, Bernardo &

Sun (2009) – complicated to derive.▶ In simulations fiducial was marginally better than reference

prior which was much better than flat prior.

8



▶ Xi i.i.d. U(θ, θ2), θ > 1▶ Data generating equationXi = θ + (θ2 − θ)Ui, Ui ∼ U(0, 1).


Ui =Xi−θθ2−θ

.


θ2−θ .





8





Ui =Xi−θθ2−θ

.


θ2−θ .



Sun (2009) – complicated to derive.

▶ In simulations fiducial was marginally better than referenceprior which was much better than flat prior.

8





Ui =Xi−θθ2−θ

.


θ2−θ .





8


Important Simple Observations

▶ GFD is allways proper

▶ GFD is invariant to re-parametrizations (same as Jeffreys)

▶ GFD is not invariant to smooth transformation of the data ifn > p

▶ Does not satisfy likelihood principle.

9

definition Asymptotic results

Various Asymptotic Results

r(ξ|x) ∝ fX(x|ξ)J(x, ξ)where J(x, ξ) = D

(ddξ

G(u, ξ)∣∣∣u=G−1(x,ξ)

)

▶ Most start with C−1n J(x, ξ) → J(ξ0, ξ)

▶ Bernstein-von Mises theorem for fiducial distributionsprovides asymptotic correctness of fiducial CIs H (2009,2013), Sonderegger & H (2013) .

▶ Consistency of model selection H & Lee (2009), Lai, H & Lee(2015), H, Iyer, Lai & Lee (2016).

▶ Regular higher order asymptotics in Pal Majumdar & H(2016+).

10

sparsity

Outline

Introduction

Definition

Sparsity

Regularization

Conclusions

sparsity

Model Selection

▶ X = G(M, ξM ,U), M ∈ M, ξM ∈ ξM

Theorem: (H, Iyer, Lai, Lee 2016) Under assumptions

r(M |y) ∝ q|M |∫ξM

fM (y, ξM )JM (y, ξM ) dξM

▶ Need for penalty – in fiducial framework additional equations0 = Pk, k = 1, . . . ,min(|M |, n)

▶ Default value q = n−1/2 (motivated by MDL)

11

sparsity

Alternative to penalty

▶ Penalty is used to discourage models with many parameters

▶ Real issue: Not too many parameters but a smaller modelcan do almost the same job

r(M |y) ∝∫ξM

fM (y, ξM )JM (y, ξM )hM (ξM ) dξM ,

hM (ξM ) =

{0 a smaller model predicts nearly as well

1 otherwise

▶ Motivated by non-local priors of Johnson & Rossell (2009)

12

sparsity

Alternative to penalty

▶ Penalty is used to discourage models with many parameters▶ Real issue: Not too many parameters but a smaller model

can do almost the same job

r(M |y) ∝∫ξM

fM (y, ξM )JM (y, ξM )hM (ξM ) dξM ,

hM (ξM ) =

{0 a smaller model predicts nearly as well

1 otherwise

▶ Motivated by non-local priors of Johnson & Rossell (2009)

12

sparsity

Regression

▶ Y = Xβ + σZ

▶ First idea hM (βM ) = I{|βi|>ϵ, i∈M} – issue: collinearity

▶ Better:

hM (βM ) := I{ 12∥XT (XMβM−Xbmin)∥22≥ε(n,|M |)}

where bmin solves

minb∈Rp

1

2∥XT (XMβM −Xb)∥22 subject to ∥b∥0 ≤ |M | − 1.

▶ algorithm – Bertsimas et al (2016)▶ similar to Dantzig selector Candes & Tao (2007)

different norm and target

13

sparsity

Regression

▶ Y = Xβ + σZ

▶ First idea hM (βM ) = I{|βi|>ϵ, i∈M} – issue: collinearity▶ Better:


where bmin solves

minb∈Rp

1


▶ algorithm – Bertsimas et al (2016)

▶ similar to Dantzig selector Candes & Tao (2007)different norm and target

13

sparsity

Regression

▶ Y = Xβ + σZ

▶ First idea hM (βM ) = I{|βi|>ϵ, i∈M} – issue: collinearity▶ Better:


where bmin solves

minb∈Rp

1


▶ algorithm – Bertsimas et al (2016)▶ similar to Dantzig selector Candes & Tao (2007)

different norm and target

13

sparsity

GFD

r(M |y) ∝ π|M|2 Γ

(n− |M |2

)RSS

−(n−|M|−1

2)

M E[hεM (β⋆M )]

Observations:

▶ Expectation with respect to within model GFD (usual T)

▶ r(M |y) negligibly small for large models because of h,e.g., |M | > n implies r(M |y) = 0.

▶ Implemented using Grouped Independence MetropolisHastings (Andrieu & Roberts, 2009).

14

sparsity

GFD

r(M |y) ∝ π|M|2 Γ

(n− |M |2

)RSS

−(n−|M|−1

2)

M E[hεM (β⋆M )]

Observations:

▶ Expectation with respect to within model GFD (usual T)▶ r(M |y) negligibly small for large models because of h,

e.g., |M | > n implies r(M |y) = 0.

▶ Implemented using Grouped Independence MetropolisHastings (Andrieu & Roberts, 2009).

14

sparsity

GFD

r(M |y) ∝ π|M|2 Γ

(n− |M |2

)RSS

−(n−|M|−1

2)

M E[hεM (β⋆M )]

Observations:

▶ Expectation with respect to within model GFD (usual T)▶ r(M |y) negligibly small for large models because of h,

e.g., |M | > n implies r(M |y) = 0.▶ Implemented using Grouped Independence Metropolis

Hastings (Andrieu & Roberts, 2009).

14

sparsity

Some Conditions

▶ Number of Predictors: lim infn→∞p→∞

n1−α

log(p) > 2,

▶ For the true model/parameter pT < log nγ

εMT(n, p) ≤ 1

18∥XT (µT −Xbmin)∥22

where bmin minimizes the norm subject to ∥b∥0 ≤ pT − 1.▶ For a large model |M | > pT and large enough n or p,

9

2∥XT (HM −HM(−1))µT ∥22 < εM (n, p),

whereHM andHM(−1) are the projection matrix forM andM with a covariate removed respectively.

16

sparsity

Some Conditions


n1−α

log(p) > 2,


εMT(n, p) ≤ 1


where bmin minimizes the norm subject to ∥b∥0 ≤ pT − 1.

▶ For a large model |M | > pT and large enough n or p,

9

2∥XT (HM −HM(−1))µT ∥22 < εM (n, p),


16

sparsity

Some Conditions


n1−α

log(p) > 2,


εMT(n, p) ≤ 1


where bmin minimizes the norm subject to ∥b∥0 ≤ pT − 1.▶ For a large model |M | > pT and large enough n or p,

9

2∥XT (HM −HM(−1))µT ∥22 < εM (n, p),


16

sparsity

Simulation

▶ Setup from Rockova & George (2015)▶ n = 100, p = 1000, pT = 8.▶ Columns ofX either a) independent or b) correlated with

ρ = 0.6

▶ εM (n, p) = ΛM σ̂2M

(n0.51

9 + |M | log(pπ)1.1

9 − log(n)γ)+with

γ = 1.45.

17

sparsity

Highlight of simulation results

▶ See Jon Williams’ poster for details on theory and simulation

▶ WhenX independent – usually select the correct model▶ WhenX correlated – usually select too small of a model

▶ Conditions of Theorem violated▶ based on conditions: p decreased to 500 to satisfy,

performance improves.

18

sparsity


▶ See Jon Williams’ poster for details on theory and simulation▶ WhenX independent – usually select the correct model▶ WhenX correlated – usually select too small of a model



18

sparsity



▶ Conditions of Theorem violated

▶ based on conditions: p decreased to 500 to satisfy,performance improves.

18

sparsity





18

sparsity

Comments

▶ Standardized way of measuring closeness in other models?▶ What if small model not the right target, e.g., gene

interactions?

19

sparsity

Comments

▶ Standardized way of measuring closeness in other models?

▶ What if small model not the right target, e.g., geneinteractions?

19

sparsity

Comments

▶ Standardized way of measuring closeness in other models?▶ What if small model not the right target, e.g., gene

interactions?

19

regularization

Outline

Introduction

Definition

Sparsity

Regularization

Conclusions

regularization

Recall


arg minξ


∥x−G(U⋆, ξ)∥ ≤ ε}

▶ Conditioning U⋆ on {x = G(U⋆, ξ)}– “regularization by model”

20

regularization

Most general iid model

▶ Data generating equation:

Xi = F−1(Ui), Ui, i.i.d. Uniform(0,1)

▶ Inverting (solving for F ) we get

F ∗(x−i ) ≤ U∗i ≤ F ∗(xi).

There is a solution iff order of U∗i matches order of xi.

21

regularization





F ∗(x−i ) ≤ U∗i ≤ F ∗(xi).


x x x x• •

••

0

1

U⋆i

xi

21

regularization





F ∗(x−i ) ≤ U∗i ≤ F ∗(xi).


x x x x0

1

F ∗

xi lower upper

• •

••

21

regularization





F ∗(x−i ) ≤ U∗i ≤ F ∗(xi).


x x x x0

1

F ∗

xi lower upper

• •

••

▶ See Yifan Cui’s poster for extension to censored data.21

regularization

Additional Constraints

▶ Location scale family with known density f(x) and cdf F (x),e.g.,N(µ, σ2).

▶ Condition U∗i on existence µ∗, σ∗ so that

F (σ∗−1(xi − µ∗)) = U∗i , for all i

x x x x0

1

F ∗

xiN(4.5, 32) lower upper

• •

••

▶ GFD is r(µ, σ) ∝ σ−1∏n

i=1 σ−1f

(σ−1(xi − µ)

)

22

regularization

Additional Constraints

▶ Location scale family with known density f(x) and cdf F (x),e.g.,N(µ, σ2).

▶ Condition U∗i on existence µ∗, σ∗ so that

F (σ∗−1(xi − µ∗)) = U∗i , for all i

x x x x0

1

F ∗

xiN(4.5, 32) lower upper

• •

••

▶ GFD is r(µ, σ) ∝ σ−1∏n

i=1 σ−1f

(σ−1(xi − µ)

)22

regularization

Constraint complications

Toy example: X = µ+ Z, µ > 0.

▶ Option 1: condition Z⋆|x− Z⋆ > 0

▶ r(µ) = φ(x−µ)Φ(x) I{µ>0}

▶ Lower confidence bounds do not have correct coverage.

▶ Option 2: projection to µ > 0▶ r(µ) = (1− Φ(x))I{0} + φ(x− µ)I{µ>0}▶ Correct coverage; possible to get {0} as CI – sure bet against

▶ Option 3: mixture▶ r(µ) = min( 12 , 1− Φ(x)))I{0} +max( 1

2Φ(x) , 1)φ(x− µ)I{µ>0}▶ Correct/conservative coverage, no {0} for reasonable α CIs.

23

regularization

Shape restrictions - preliminary results

▶ Example: Positive iid data with concave cdf(MLE is the Grenander estimator)

▶ Condition U⋆ on concave solution (Gibbs sampler)▶ Project unrestricted GFD to space of concave functions

(quadratic program)

24

regularization



▶ Condition U⋆ on concave solution (Gibbs sampler)

▶ Project unrestricted GFD to space of concave functions(quadratic program)

24

regularization




(quadratic program)

24

regularization




(quadratic program)

Condition Projection

24

regularization

Comments

▶ When to use conditioning vs. projection?

▶ Connection to ancillarity and IM.▶ Computational cost a consideration?

25

regularization

Comments

▶ When to use conditioning vs. projection?▶ Connection to ancillarity and IM.

▶ Computational cost a consideration?

25

regularization

Comments

▶ When to use conditioning vs. projection?▶ Connection to ancillarity and IM.▶ Computational cost a consideration?

25

conclusions

Outline

Introduction

Definition

Sparsity

Regularization

Conclusions

conclusions

Fiducial Future

▶ What is it that we provide?▶ GFI: General purpose method that often works well

▶ Computational convenience and efficiency▶ Fiducial options in software.

▶ Theoretical guarantees▶ Applications

▶ The proof is in the pudding

26

conclusions

Fiducial Future



▶ Theoretical guarantees

▶ Applications▶ The proof is in the pudding

26

conclusions

Fiducial Future



▶ Theoretical guarantees▶ Applications

▶ The proof is in the pudding

26

conclusions

List of successful applications

▶ General Linear Mixed Models E, H & Iyer (2008); Cissewski &H (2012)

▶ Confidence sets for wavelet regression H & Lee (2009) andfree knot splines Sonderegger & H (2014)

▶ Extreme value data (Generalized Pareto), Maximum mean,and model comparison Wandler & H (2011, 2012ab)

▶ Uncertainty quantification for ultra high dimensionalregression Lai, H & Lee (2015), Wandler & H (2017+)

▶ Volatility estimation for high frequency data Katsoridas & H(2016+)

▶ Logistic regression with random effects (response models)Liu & H (2016,2017)

27

conclusions

I have a dream …

▶ One famous statistician said (I paraphrase)

“I use Bayes because there is no need to proveasymptotic theorem; it is correct.”

▶ I have a dream that by the time I retire people will havesimilar trust in fiducial inspired approaches.

Thank you!

28

ChallengesinFiducialInference · ChallengesinFiducialInference Partsofthistalkarejointworkwith...

Documents

Transcript of ChallengesinFiducialInference · ChallengesinFiducialInference Partsofthistalkarejointworkwith...