-estimators arXiv:1706.05745v1 [stat.ME] 18 Jun 2017

45
Statistical Inference based on Bridge Divergences Arun Kumar Kuchibhotla *1 , Somabha Mukherjee 1 , and Ayanendranath Basu 2 1 Department of Statistics, University of Pennsylvania 2 Interdisciplinary Statistical Research Unit, Indian Statistical Institute Abstract M -estimators offer simple robust alternatives to the maximum likelihood esti- mator. Much of the robustness literature, however, has focused on the problems of location, location-scale and regression estimation rather than on estimation of gen- eral parameters. The density power divergence (DPD) and the logarithmic density power divergence (LDPD) measures provide two classes of competitive M -estimators (obtained from divergences) in general parametric models which contain the MLE as a special case. In each of these families, the robustness of the estimator is achieved through a density power down-weighting of outlying observations. Both the families have proved to be very useful tools in the area of robust inference. However, the relation and hierarchy between the minimum distance estimators of the two families are yet to be comprehensively studied or fully established. Given a particular set of real data, how does one choose an optimal member from the union of these two classes of divergences? In this paper, we present a generalized family of divergences incorporating the above two classes; this family provides a smooth bridge between the DPD and the LDPD measures. This family helps to clarify and settle sev- eral longstanding issues in the relation between the important families of DPD and LDPD, apart from being an important tool in different areas of statistical inference in its own right. 1 Introduction Statistical procedures based on minimization of divergences are popular in the litera- ture. In our context, a divergence is a distance like dissimilarity measure between two distributions which does not necessarily demand metric properties. Density based min- imum divergence methods are very useful as they combine high efficiency with strong robustness properties. Basu et al. (1998) proposed the class of density power divergences * [email protected] [email protected] [email protected] 1 arXiv:1706.05745v1 [stat.ME] 18 Jun 2017

Transcript of -estimators arXiv:1706.05745v1 [stat.ME] 18 Jun 2017

Statistical Inference based on Bridge Divergences

Arun Kumar Kuchibhotla∗1, Somabha Mukherjee†1, and Ayanendranath Basu‡2

1Department of Statistics, University of Pennsylvania2Interdisciplinary Statistical Research Unit, Indian Statistical Institute

Abstract

M -estimators offer simple robust alternatives to the maximum likelihood esti-mator. Much of the robustness literature, however, has focused on the problems oflocation, location-scale and regression estimation rather than on estimation of gen-eral parameters. The density power divergence (DPD) and the logarithmic densitypower divergence (LDPD) measures provide two classes of competitive M -estimators(obtained from divergences) in general parametric models which contain the MLE asa special case. In each of these families, the robustness of the estimator is achievedthrough a density power down-weighting of outlying observations. Both the familieshave proved to be very useful tools in the area of robust inference. However, therelation and hierarchy between the minimum distance estimators of the two familiesare yet to be comprehensively studied or fully established. Given a particular setof real data, how does one choose an optimal member from the union of these twoclasses of divergences? In this paper, we present a generalized family of divergencesincorporating the above two classes; this family provides a smooth bridge betweenthe DPD and the LDPD measures. This family helps to clarify and settle sev-eral longstanding issues in the relation between the important families of DPD andLDPD, apart from being an important tool in different areas of statistical inferencein its own right.

1 Introduction

Statistical procedures based on minimization of divergences are popular in the litera-ture. In our context, a divergence is a distance like dissimilarity measure between twodistributions which does not necessarily demand metric properties. Density based min-imum divergence methods are very useful as they combine high efficiency with strongrobustness properties. Basu et al. (1998) proposed the class of density power divergences

[email protected][email protected][email protected]

1

arX

iv:1

706.

0574

5v1

[st

at.M

E]

18

Jun

2017

(DPD). These divergences, when constructed between an arbitrary density g and a para-metric model density fθ are indexed by a non-negative robustness tuning parameter αand have the form

ρ(α)1 (g, fθ) =

∫ {f1+αθ −

(1 +

1

α

)gfαθ +

1

αg1+α

}. (1.1)

The divergence at α = 0 is defined as the continuous limit of the expression in Equation(1.1) as α→ 0. This generates the divergence

ρ(0)1 (g, fθ) =

∫g log

(g

), (1.2)

where log represents natural logarithm. Observe that the only term on the right handside of Equation (1.1) containing both g and fθ has g in degree one. A divergence withsuch a property has been referred to as a decomposable divergence by some authors; see,for example, Broniatowski et al. (2012).

Another class of divergences with some similar properties was introduced by Joneset al. (2001) and was followed up by Fujisawa and Eguchi (2008), Broniatowski et al.(2012) and Fujisawa (2013) among others. In spite of its formal similarity with the DPD,this family, referred to herein as the logarithmic density power divergence (LDPD) family,was originally developed following the robust model fitting idea of Windham (1995). Thisclass also has a non-negative robustness tuning parameter α and is defined as

ρ(α)0 (g, fθ) = log

∫f1+αθ −

(1 +

1

α

)log

∫gfαθ +

1

αlog

∫g1+α. (1.3)

The divergence ρ(0)0 is once again defined as a limiting case and some simple algebra shows

that ρ(0)0 (g, fθ) = ρ

(0)1 (g, fθ), the common divergence being a version of the Kullback-

Leibler divergence.Jones et al. (2001) also provided a general form of divergence measures in terms of

two tuning parameters α and φ given by

ρ(α)φ (g, fθ) =

1

φ

(∫f1+αθ

)φ− 1

φ

(1 +

1

α

)(∫gfαθ

)φ+

1

αφ

(∫g1+α

)φ, (1.4)

for 0 ≤ φ ≤ 1 and α ≥ 0. The DPD and the LDPD measures are recovered fromthis general form for φ = 1 and φ = 0 , the second one being obtained as the limitingcase as φ → 0. Accordingly, the DPD and the LDPD families have also been referredto as type 1 and type 0 divergences in the literature. A scaled version of the LDPDhas also been called the γ-divergence by Fujisawa and Eguchi (2008). Both the DPDand the LDPD measures have proved to be useful additions to the literature of robustparameter estimation based on divergences. Both families, as well as the correspondingminimum divergence estimators, have been heavily cited in the literature and applied tomany practical problems, and a thorough exploration of their relationship and hierarchy

2

can help us develop better compromises. In either case, a simple density power down-weighting indicates the source of robustness of the resulting estimator. In both cases, theparameter α controls the trade-off between robustness and efficiency; smaller values ofα lead to greater model efficiency and larger values of α lead to greater outlier stability.The minimum divergence estimator corresponding to α = 0 is the maximum likelihoodestimator in either case.

In terms of comparison between the minimum DPD and the minimum LDPD esti-mators, Jones et al. (2001) expressed a (weak) preference for the former. In particular,they observed that the presence of observations very close to zero could lead to spuri-ous global minimum in case of the LDPD under the exponential model. On the otherhand, Fujisawa and Eguchi (2008) and Fujisawa (2013) have claimed that the minimumLDPD estimators exhibit a greater relative stability under heavy contamination leadingto smaller bias and smaller mean squared error. They argued that the small bias thatthe minimum LDPD estimator has is due to an approximate Pythagorean relation thatthe LDPD satisfies. In contrast, Broniatowski et al. (2012), based on their own simu-lation study, do not report any particular relative advantage for the minimum LDPDestimator.

These points raise certain unsettled issues involving these two useful classes of di-vergences and corresponding estimators which we hope to at least partially reconcile inthe present paper. The generalized family of divergences in Equation (1.4), useful as itis, does not generate minimum divergence estimators that are legitimate M -estimatorsexcept when φ = 0 or 1. We will construct an alternative generalized class of divergencesproviding a bridge between these classes where several of the intermediate divergenceslead to reasonable compromises between the positives of these two families. This newfamily will be called the family of bridge density power divergences and the generated es-timators are all M -estimators. This will provide the user with the flexibility of choosinga suitable estimator from a larger class.

The new family of divergences will depend on two tuning parameters which are (i)the robustness parameter α and (ii) the bridge parameter λ. It would be of interest tochoose the tuning parameters adaptively with respect to the proportion of outliers so thatthe estimation is optimal in an appropriate sense. The robustness tuning parameter αshould be close to zero for pure data and should assume a moderately large positive valuein case of contaminated data. In this respect, Hong and Kim (2001) proposed the firstmethod of choosing the tuning parameter by minimizing an estimate of the asymptoticvariance and Warwick and Jones (2005) refined this process by using the mean squarederror criterion together with a pilot estimate. We provide some justification for using amodified Hong and Kim (2001) procedure; see Section 8.

Before concluding this section, we summarize the main achievements of this paper.

1. We introduce a new family of divergences, the Bridge Density Power Divergences(BDPD) which produce a smooth link between the DPD and the LDPD; eachintermediate divergence is decomposable and leads to an M -estimator. Apart fromhelping to understand the relations between the DPD and the LDPD, this family isimportant in its own right, and in specific cases the performance of the intermediate

3

divergences turn out to be superior than both the two marginal divergences (DPDand LDPD).

2. We demonstrate that the spurious root problem observed in case of the LDPD asnoticed by Jones et al. (2001) is not an isolated problem for the exponential model,and is a more general phenomenon.

3. It is shown that the results and assertions of Fujisawa and Eguchi (2008) andFujisawa (2013) are substantially correct, but only when a number of residualconcerns are answered. In particular, the observed superior performance of the“minimum LDPD estimator”, is, most of the time, achieved at a local minimum ofthe LDPD measure (rather than a global one), and has to be viewed as a weightedmethod of moments estimator rather than a minimum divergence estimator.

4. We demonstrate that the weighted method of moment equations correspondingto the LDPD (and indeed many other BDPD measures) can potentially throw upmultiple roots; this is a real problem with practical implications. As the desiredroot is not necessarily the minimizer of the corresponding divergence, there is noautomatic root selection strategy. As a choice of the incorrect solution can bedisastrous, it is imperative that a clear answer to this question is provided in theliterature, where this issue is so far unexplored.

5. We clearly define the target parameter when using the LDPD for parametric es-timation. Our results in this paper show that it cannot, in general, be the globalminimizer of the LDPD, nor any arbitrary root of the weighted method of methodsequation. Our description of the target depends on an actual numerical algorithmfor its selection (see item 6 below). This interpretation of the target parameteractually extends to the entire BDPD family, except the ordinary DPD.

6. Finally, we provide an algorithm for the selection of the suitable root of theweighted method of moment equations for any member of the BDPD family, in-cluding the LDPD. We also provide a method for the selection of “optimal” tuningparameters for analyzing the real data.

The rest of the paper is organized as follows. In Section 2, we introduce the basicparametric set up and also provide the construction leading to the family of bridge di-vergences. Various properties of the estimators including strong consistency, asymptoticnormality and robustness are discussed in Section 3. In Section 4, we provide a heuristicexplanation for the spurious minimum behaviour of the LDPD measures in the caseof small outliers. Section 5 rigorously proves that the LDPD is bound to fail in somespecific examples. In Section 6, we address issues regarding multiple roots of the LDPDestimating equation and well-definedness of the estimators as discussed in the existingliterature. We also introduce an algorithm for getting hold of a good root of the LDPDand all the other BDPD estimating equations. In Section 7, we provide the results of asimulation study conducted to support the claims made in previous sections. In Section

4

8, we give some directions on how to choose the tuning parameters. Finally we concludewith a few remarks in Section 9.

2 The Bridge Density Power Divergence (BDPD) Family

The problem of parameter estimation considered in this paper uses the following set upand notation. Let G denote the set of all probability distributions having densities withrespect to some σ-finite base measure µ. We assume that the data generating distributionG and the model family FΘ = {Fθ : θ ∈ Θ ⊂ Rp} belong to G. Let g and fθ be thecorresponding densities (with respect to µ). Let X1, X2, . . . , Xn be an independent andidentically distributed (iid) random sample from G which is to be modeled by the familyFΘ. Our aim is to estimate the parameter θ by choosing the model density which givesthe “closest fit” to the data. In this paper, we quantify the “closeness” using the densitypower divergences, the logarithmic density power divergences or their generalizations asmentioned in the previous section. However we will see in the latter sections that thisidea of closeness will require a refinement for most of the minimum BDPD estimators.All the integrals considered in this paper including those in the previous section are withrespect to the measure µ.

In order to estimate θ based on an iid sample, one needs to construct an empiricalestimate of the divergence. In this regard, note that the first terms of Equations (1.1)and (1.3), depending only on the model density fθ, need no estimation. The thirdterms are independent of θ and so do not figure in any minimization process over θ.For the second terms, note that the integral involved can be written as Egfαθ (X) andso can be estimated by the average of fαθ (Xi), 1 ≤ i ≤ n. This is a consequence ofthe decomposability property referred to in Section 1. In light of this discussion, the

parameter estimates based on the divergences ρ(α)1 and ρ

(α)0 are given by

θαn1 := arg minθ∈Θ

[∫f1+αθ −

(1 +

1

α

)1

n

n∑

i=1

fαθ (Xi)

], (2.1)

and

θαn0 := arg minθ∈Θ

[log

(∫f1+αθ

)−(

1 +1

α

)log

(1

n

n∑

i=1

fαθ (Xi)

)]. (2.2)

Under certain regularity conditions allowing the interchange of the derivative and theintegral, the estimating equations corresponding to the above divergences are given,respectively, by ∫

f1+αθ uθ =

1

n

n∑

i=1

fαθ (Xi)uθ(Xi), (2.3)

and (1

n

n∑

i=1

fαθ (Xi)

)∫f1+αθ uθ =

(1

n

n∑

i=1

fαθ (Xi)uθ(Xi)

)∫f1+αθ . (2.4)

5

Here uθ(x) represents the likelihood score function given by uθ(x) = ∇ log fθ(x). Through-out this manuscript, we use the symbols ∇,∇2 to denote the first and the second deriva-tives with respect to θ. Equations (2.3) and (2.4) demonstrate that both the minimumDPD and the minimum LDPD estimators are legitimate M -estimators. Later in this sec-tion and in Section 6, we will consider the weighted method of moments representationof these estimating equations.

In this paper, we will show that the DPD and the LDPD families can be combined ina larger super-family of divergences where each intermediate divergence class leads to anM -estimator unlike the generalization given in Equation (1.4). For notational simplicityin the following derivation, define

t1(θ) :=

∫f1+αθ , t2(θ) :=

∫gfαθ , t3(θ) :=

∫g1+α.

We consider the density versions (in terms of the true density g) of Equations (2.3)and (2.4) rather than the versions based on the empiricals. These equations are givenby ∫

f1+αθ uθ =

∫fαθ guθ, (2.5)

and ∫fαθ g

∫fαθ uθ =

∫fαθ guθ

∫fαθ . (2.6)

Rewriting the estimating equations (2.5) and (2.6) in terms of t1 and t2, we get

[∇t1(θ)

α+ 1− ∇t2(θ)

α

]= 0, (2.7)

[t2(θ)∇t1(θ)

α+ 1− t1(θ)∇t2(θ)

α

]= 0. (2.8)

For λ ∈ [0, 1], consider an estimating equation which equates a convex combination ofthe above two estimating functions to zero; it is given explicitly by

λ

[t′1(θ)

α+ 1− t′2(θ)

α

]+ (1− λ)

[t2(θ)t′1(θ)

α+ 1− t1(θ)t′2(θ)

α

]= 0. (2.9)

Rearranging the terms on the left hand side leads to the differential equation,

t′1(θ)

1 + α[λ+ λt2(θ)] =

t′2(θ)

α[λ+ λt1(θ)],

t′1(θ)

λ+ λt1(θ)=

(1 + α

α

)t′2(θ)

λ+ λt2(θ), (2.10)

where λ = 1 − λ. It is now easy to derive the corresponding objective function whichturns out to be

1

λlog(λ+ λt1(θ))− 1

λ

(1 + α

α

)log(λ+ λt2(θ)) + C, (2.11)

6

for some constant C independent of θ. Imposing the condition that the divergence hasto be zero when g = fθ, the constant C is recovered to be 1

λ1α log(λ+ λt3(θ)). Hence the

new divergence can be written as

ρ(α,λ)(g, fθ) =1

λlog(λ+λt1(θ))− 1

λ

(1 + α

α

)log(λ+λt2(θ))+

1

λ

1

αlog(λ+λt3(θ)). (2.12)

This is our class of bridge density power divergences, defined for α ≥ 0 and λ ∈ [0, 1].For λ = 0, this reduces to the class of logarithmic density power divergences; as λ→ 1,we recover the class of density power divergences. It is also easy to verify that as α→ 0,the limiting divergence is the Kullback-Leibler divergence as given in Equation (1.2)irrespective of the value of λ. Due to the consideration of efficiency, we will, in the restof the paper, restrict the robustness parameter α to the range [0, 1].

The estimating equation (2.10) for the bridge divergence with parameters (α, λ) canbe written as: ∫

f1+αθ uθ

λ+ λ∫f1+αθ

=1n

∑ni=1 f

αθ (Xi)uθ(Xi)

λ+ λ 1n

∑ni=1 f

αθ (Xi)

. (2.13)

This may be referred to as the BDPD moment equations. It is immediately obviousthat the above equation represents a very rich class of score moment equations. Whenα = 0, the above represents the ordinary likelihood score equation irrespective of thevalue of λ. More generally it represents a score moment equation where the scores arevariably weighted by powers of densities depending on the parameter α, and the weightsare variably normalized depending on the parameter λ. When λ = 0, the weights areperfectly normalized (as in the case of the LDPD), while we get the non-normalizedequations for λ = 1, as in the case of the DPD. For intermediate values of λ we getpartially normalized estimating equations, with the degree of normalization dropping offwith increasing λ. We will observe the effect of this normalization throughout the restof the paper, most notably in Section 4, where we will give some indication of its role inthe occurrence of the spurious roots.

Remark 2.1 The BDPD family was constructed using an appropriate combinationof the DPD and the LDPD estimating equations with the same tuning parameter α; thereis no specific reason, apart from mathematical convenience, for considering the same αin both the estimating equations. However, if estimating equations with different valuesof α are combined it does not appear to lead to a genuine objective function for such anestimating equation. �

Remark 2.2 Equation (2.9) cannot be obtained by starting directly with a convexcombination of the DPD and the LDPD with the same value of α. �

7

3 Properties of Minimum BDPD Estimators

Based on the bridge density power divergence defined in the previous section and an iidrandom sample X1, X2, . . . , Xn from a distribution G, an estimator of θ is given by

θ(α,λ)n := arg min

θ∈Θ

1

λlog(λ+ λt1(θ))− 1

λ

(1 + α

α

)log

(λ+ λ

1

n

n∑

i=1

fαθ (Xi)

). (3.1)

By definition the BDPD leads to an M -estimator and so its asymptotic and robustnessproperties can be derived from the well-established M -estimation theory. Therefore, wepresent the results for consistency and asymptotic normality of the bridge divergenceestimator without proofs. For this, we use the following notation.

Pi,γ =

∫gifγθ , Qi,γ =

∫gifγθ uθu

>θ , Ri,γ =

∫gifγθ uθ, Si,γ =

∫gifγθ∇uθ.

We now present a consistency theorem with conditions in alignment with those inWald’s consistency theorem (Ferguson (1996, Chapter 17)) for the maximum likelihoodestimator and as the proof is essentially the same, it is moved to the SupplementaryMaterial.

Theorem 3.1 (Consistency). Suppose that the following assumptions hold.

(C1) Θ is a compact metric space;

(C2) There exists a function K(x) (independent of θ) such that |ϕα(x)| ≤ K(x) andK(X) has finite expectation (with respect to G). Here ϕα(x) = fαθ (x) for α > 0and ϕ0(x) = log fθ(x);

(C3) For each x and any sequence θn → θ,

limn→∞

fθn(x) = fθ(x),

for all x, except possibly on a set (which might depend on θ but not on the sequence

{θn}) of µ-measure zero. Also, θ(α,λ)g is the unique minimizer of the bridge density

power divergence ρ(α,λ)(g, fθ).

Then the minimum BDPD estimator θ(α,λ)n is strongly consistent for θ

(α,λ)g for any fixed

α ∈ [0, 1] and λ > 0.

Remark 3.1 Note that the conditions are independent of the value of λ and matchthose of Wald’s consistency theorem when α = 0. The consistency theorem only requiresΘ to be a metric space and not an Euclidean space. The proof can be extended to settingsother than iid samples using various generalizations of the uniform strong law of largenumbers (USLLN). �

Remark 3.2 The main ingredient in the proof of Theorem 3.1 is Theorem 5.7 ofvan der Vaart (1998) that uses uniform convergence of the sample based objective func-tions to their population counterparts. For λ > 0, the function ξ 7→ log(λ + λξ) is

8

Lipschitz on any bounded closed interval I that does not contain 0. This fact allows oneto conclude the following implication (under (C2)):

supθ∈Θ

∣∣∣∣∣1

n

n∑

i=1

fαθ (Xi)−∫gfαθ

∣∣∣∣∣ = op(1), (follows from uniform SLLN)

⇒ supθ∈Θ

∣∣∣∣∣log

(λ+ λ

1

n

n∑

i=1

fαθ (Xi)

)− log

(λ+ λ

∫gfαθ

)∣∣∣∣∣ = op(1).

At λ = 0 the Lipschitz property fails and the above implication breaks down. Thisis the reason for restricting the range of λ to λ > 0. Thus the above proof does notautomatically cover the DPD family. To extend the consistency claim to λ = 0, oneneeds to make the additional assumption infθ∈Θ

∫gfαθ > 0. �

The proof of asymptotic normality of the minimum BDPD estimator can be obtainedthrough a quadratic approximation of the objective function. Using Cramer-Rao typeregularity conditions, we also observe that there exists a sequence of roots of the esti-mating equation which is consistent and asymptotically normal. We do not repeat theconditions here but refer the reader to Theorem 2 of Basu et al. (1998). The proof issimilar to Lehmann’s proof (Lehmann and Casella (1998)) of consistency and asymptoticnormality of the MLE, and is hence omitted.

Theorem 3.2 (Asymptotic Normality). Under certain regularity conditions, there existsa sequence of roots θn of the bridge divergence estimating equation which is consistent

and√n(θn−θ(α,λ)

g ) has an asymptotically normal distribution with mean 0 and variance

given by J(θg)−1K(θg)J(θg)

−1, where θg := θ(α,λ)g ,

K(θ) = (1− λ)2P1,2αR0,α+1R>0,α+1 − λ

[λ+ λP0,α+1

]R0,α+1R

>1,2α

− λ[λ+ λP0,α+1

]R1,2αR

>0,α+1 +

[λ+ λP0,α+1

]2Q1,2α

− λ2P 21,αR0,α+1R

>0,α+1 + λ

[λ+ λP0,α+1

]P1,αR1,αR

>0,α+1

+ λ[λ+ λP0,α+1

]P1,αR0,α+1R

>1,α −

[λ+ λP0,α+1

]2R1,αR

>1,α

J(θ) = (α+ 1)Q0,α+1

[λ+ λP1,α

]+ S0,α+1

[λ+ λP1,α

]

− αQ1,α

[λ+ λP0,α+1

]− S1,α

[λ+ λP0,α+1

]

+ λαR1,αR>0,α+1 − λ(α+ 1)R0,α+1R

>1,α.

3.1 Pythagorean Relation

Fujisawa and Eguchi (2008) have claimed that under heavy contamination, the minimumLDPD estimator achieves a smaller bias (compared to the minimum DPD estimator);it is also suggested that this phenomenon can be partially explained by an approximatePythagorean relation which the LDPD satisfies. We present a sequence of results whichshows that such a Pythagorean relation holds in general for all BDPD (except the DPD),so that the LDPD result is a special case of the more general result. For any 0 ≤ t ≤ 1,

9

let t = 1− t. Define the cross entropy between any two densities g and f for λ, α ∈ [0, 1]by,

dλ,α(g, f) =1

λ

1

1 + αlog

(λ+ λ

∫f1+α

)− 1

αλlog

(λ+ λ

∫gfα

).

The divergence induced by this cross entropy is given by

Dλ,α(g, f) = −dλ,α(g, g) + dλ,α(g, f).

which is a scaled version of ρλ,α(g, f), satisfying Dλ,α(·, ·) = ρλ,α(·, ·)/(1 + α).

Theorem 3.3. Let f and δ be given probability density functions and let 0 ≤ ε ≤ 1.If g(·) = (1 − ε)f(·) + εδ(·) and h is any positive function, then for any α ∈ [0, 1] andλ ∈ [0, 1),

dλ,α(g, h) = dλ,α(f, h)− 1

αλlog(1− ε) +O(Tε,δ),

where

Tε,δ :=ε

1− ε

[λ+ λ

∫δhα

]/αλ

[λ+ λ

∫fhα

].

For λ = 1, we have

d1,α(g, h) = d1,α(f, h) +ε

α

[∫{f − δ}hα

].

Proof. First consider the case 0 ≤ λ < 1. By definition of dλ,α, we have, using `(h) =1

λ(1+α)log(λ+ λ

∫h1+α

),

dλ,α(g, h) = `(h)− 1

λαlog

(λ+ λ

∫{(1− ε)f + εδ}hα

)

= `(h)− 1

λαlog

(λ+ λ

∫(1− ε)fhα + λ

∫εδhα

)

= `(h)− 1

λαlog

[λ+ λ

∫δhα

]+ (1− ε)

[λ+ λ

∫fhα

]).

Now by a simple Taylor series expansion, we get that

dλ,α(g, h) = `(h)− 1

αλlog

((1− ε)

[λ+ λ

∫fhα

])+O(Tε,δ),

from which the result follows. For the case λ = 1, observe that

d1,α(g, h) =1

1 + α

∫h1+α − 1

α

∫ghα

=1

1 + α

∫h1+α − 1

α

∫fhα +

ε

α

[∫fhα −

∫δhα

]

= d1,α(f, h) +ε

α

[∫{f − δ}hα

].

10

Remark 3.3 This theorem can be used to prove a Pythagorean relation (for 0 ≤ λ <1) similar to Theorem 3.2 of Fujisawa and Eguchi (2008). The statement of the result isgiven below for completeness; however the proof is very similar to the one in Fujisawaand Eguchi (2008) and is hence omitted. The Fujisawa and Eguchi (2008) result thusbecomes a particular case of the following theorem. �

Theorem 3.4 (Pythagorean Relation). Let f and δ be given probability density functionsand let 0 ≤ ε ≤ 1. If g(·) = (1 − ε)f(·) + εδ(·) and h is any positive function, then forany 0 ≤ λ < 1,

∆(g, f, h) = Dλ,α(g, h)−Dλ,α(g, f)−Dλ,α(f, h) = O(εν),

where

ν = λ+ λmax

{∫δfα,

∫δhα

}.

Remark 3.4 Theorem 3.4 is one of the main theoretical reasons for the behaviour ofthe minimum BDPD estimators to be observed in the subsequent sections over variouschoices of tuning parameters and various contaminating distributions. Note that thefunction

ξ(λ) =1

λ

λ+ λa

λ+ λb,

for fixed positive values of a and b, is necessarily increasing in λ ∈ [0, 1) if and only ifa ≤ b. This implies that when ∫

δhα ≤∫fhα, (3.2)

the error term Tε,δ is an increasing function of λ ∈ [0, 1) and so the bias of the minimumBDPD estimator is expected to be an increasing function of λ ∈ [0, 1); however asthe Pythagorean relation does not exist for the DPD, calculations based on ξ(λ) arenot helpful in theoretically comparing the bias of the minimum DPD estimator. Thecondition (3.2) with h = fθ and f = fθ0 for θ, θ0 ∈ Θ is implied by the usual approximatesingularity condition in the robustness literature; the latter condition dictates that thecontaminating distribution (in this case represented by δ) in the gross error model isapproximately singular with the parametric family so that

∫δhα ≈ 0 and

∫fhα is

relatively large. In the numerical simulations, we will see that this reasoning generallyfits in well with the observed behaviour of the estimators except for very small values ofα; in the latter case the quantities (in Equation (3.2)) can be very close and the aboveobservations may not hold.

But when the contaminating distribution δ is concentrated at (or around) the modeof the model density h = fθ0 , the integral

∫δhα is often at least as large as

∫fhα so that

the changing behaviour of ξ(λ) over λ with a =∫δhα and b =

∫fhα, is less predictable.

We will observe in the rest of the paper that the performance of the minimum LDPDestimator (and several other minimum BDPD estimators) may suffer badly in this case.�

11

In the following sections, we will consider two kinds of contamination for the para-metric model. The first case will correspond to the approximate singularity idea of thegross error model where the contaminating (minor) component is well-separated fromthe target (major) component. We will refer to this as outer contamination; in this casethe contaminating values will be surprising observations in the sense of Lindsay (1994).In the second case of contamination, we will choose the contaminating component nearthe mode of the major component so that these observations are no longer surprisingobservations but will nevertheless distort the shape of the distribution relative to theparametric model. We will refer to this case as inner contamination.

4 Spurious Behaviour under Inner Contamination

Jones et al. (2001) reported that in case of the exponential distribution, small outliers(near the origin) made the minimum LDPD estimator non-robust. These small outliersare essentially what we have described as constituting inner contamination in Section3. Let ηθ(x) denote the density of the exponential distribution with mean θ. The non-robustness of the minimum LDPD estimator under small outliers was demonstrated byJones et al. (2001) in simulation studies and was confirmed by determining the LDPDbetween the densities ηθ and the 0.85η1 + 0.15δx0 mixture with x0 = 0.0001 where δx0represents the indicator function at x = x0. In Figures 1 and 2, we have exhibited theBDPD objective function between the above two densities over θ for α = 0.5 and severalvalues of λ in [0, 1].

12

0.0 0.5 1.0 1.5 2.0

−0.

20.

20.

61.

0

λ = 0 and x0 = 10−4

θ

obje

ctiv

e fu

nctio

n

0.0 0.5 1.0 1.5 2.0

−0.

20.

20.

61.

0

λ = 0.25 and x0 = 10−4

θ

obje

ctiv

e fu

nctio

n

0.0 0.5 1.0 1.5 2.0

−0.

20.

20.

61.

0

λ = 0.5 and x0 = 10−4

θ

obje

ctiv

e fu

nctio

n

0.0 0.5 1.0 1.5 2.0

−0.

20.

20.

61.

0

λ = 0.7 and x0 = 10−4

θ

obje

ctiv

e fu

nctio

n

Figure 1: Plots of the BDPD objective function (λ = 0.0,0.25,0.5,0.7)

13

0.0 0.5 1.0 1.5 2.0

0.0

0.4

0.8

λ = 0.8 and x0 = 10−4

θ

obje

ctiv

e fu

nctio

n

0.0 0.5 1.0 1.5 2.0

0.0

0.4

0.8

λ = 0.9 and x0 = 10−4

θ

obje

ctiv

e fu

nctio

n

0.0 0.5 1.0 1.5 2.0

0.0

0.4

0.8

λ = 0.95 and x0 = 10−4

θ

obje

ctiv

e fu

nctio

n

0.0 0.5 1.0 1.5 2.0

0.0

0.4

0.8

λ = 0.975 and x0 = 10−4

θ

obje

ctiv

e fu

nctio

n

Figure 2: Plots of the BDPD objective function (λ = 0.8,0.9,0.95,0.975)

It is clear that the global minimizer of the objective function remains stuck at a valuevery close to zero at least up to λ = 0.7. It is noteworthy that this spurious minimummay easily be missed by any gradient descent root search algorithm since the minimumis obtained as a needle sharp notch over a very limited region. It might very well bepossible (and indeed it is the case in Figures 1 and 2) that a local minimizer of theLDPD may serve as a reasonable robust solution even in this case. But the asymptoticresults for a sequence of roots of the estimating equation present in the literature do notprescribe the method to choose a suitable sequence of roots in case of multiple roots. Inany case, it is clear that the global minimizer of the LDPD up to (at least) λ = 0.7 is anonsensical value.

14

The main reason for this phenomenon in the LDPD and the bridge divergences closeto it is the behaviour of the log function at 0, where it diverges to −∞. If one writesthe LDPD between the ηθ and 0.85η1 + 0.15δ0.0001 densities, we can easily check thatthere is a sharp drop in the objective function around a value very close to zero becauseof the above logarithm effect (although at 0, the objective function is positive infinity).This effect slowly smooths out as the λ parameter increases since the presence of theadditive λ term in each of the logarithms eventually forces the argument to be boundedaway from zero. The validity of this reasoning is confirmed by examining the behaviourof the LDPD and the BDPD close to it in case of the N(0, σ2) model, where it can beexpected that the estimate of σ will be driven to zero if there are some outliers near 0in the sample. For that matter, this behaviour can also be seen in case of the N(µ, σ2)model with some outliers near the true mean. Figure 3 exhibits the LDPD at α = 0.5 asa function of σ when computed between the densities of the N(0, σ2) and the mixture0.85N(0, 1) + 0.15δ0.001. Clearly, there is a reasonable local minimum around σ = 0.8;but it is beaten hands down by the useless global minimum in the neighbourhood ofzero.

0.0 0.5 1.0 1.5 2.0

−0.

50.

00.

51.

01.

52.

0

Plot of the Objective Function with N(0,σ2) model

σ

Obj

ectiv

e F

unct

ion

Figure 3: Plot of the LDPD objective function under the N(0, σ2) model.

While the behaviour observed in Figures 1–3 represent the patterns in the divergencesbetween the actual densities, such behaviour can also be observed under pure data

15

(although it is relatively rare in scale models). An actual random sample of size 20 fromN(0, 1) with seed 129 was obtained in R as

−1.120900, −0.989724, −1.374697, −1.355645, 1.996755, 0.695870,0.771968, −0.002847, 1.008549, −0.990280, 1.131772, −0.244929,1.186625, −1.671537, −0.081999, −1.831365, 0.358867, 0.891639,0.489801, 0.000010.

The exact value of the last observation is 1.048653 × 10−5, which forces a spuriousbehaviour in the LDPD objective function plotted in Figure 4 for α = 0.8. To beprecise, if fσ(x) denotes the probability density of N(0, σ2) with respect to the Lebesguemeasure, then ∫

fα+1σ (x)dx =

1√α+ 1(

√2πσ)α

,

and the LDPD objective function satisfies

Mn(σ) ≤ log

(1√

α+ 1(√

2πσ)α

)−(α+ 1

α

)log

(1

nfασ (X20)

)

= log(n1+1/ασ

)+

1

2log

(2π

α+ 1

)+

(α+ 1

2

)X2

20

σ2.

where X20 is the last observation which is close to zero and n = 20. This inequalityconfirms our reasoning. As σ slides down towards zero, for a while the first term involvingthe logarithm dominates, pulling the objective function down. However as σ gets reallyclose to zero, the third term takes over and there is an extremely sharp rise in theobjective function, which generates a minimum with a razor sharp notch. Figure 4shows the spurious (global) minimum near zero (at σ = 0.00001309906 to be exact),although there is a reasonable local minimum at σ = 1.265882. For this sample, thespurious global minimum phenomenon is observed for all α ≥ 0.500144205 and all λ ∈[0, 0.228827308]. In contrast, the global minimum of the DPD objective function (alsoplotted in Figure 4) for α = 0.8 is obtained at σ = 1.20833. If the last observation1.048653 × 10−5 were removed from this data set, this spurious minimum behaviour ofthe LDPD objective function disappears. The minimum LDPD estimator now equalsσ = 1.321807 at α = 0.8, an entirely reasonable value. (The corresponding minimumDPD estimator is 1.298228). Thus one single observation can bring about an absolutelydrastic change in the minimum LDPD estimator which is against the spirit of stabilitythat robust estimators should have; so far we (or indeed anybody else), have not detectedsuch spurious behaviour in any scenario involving the DPD.

Thus the spurious root issue in case of the LDPD is not a isolated problem limitedto the case of the exponential distribution. It is, in fact, a more serious problem in thecase of the location-scale model (compared to just the scale model), as we will see in thenext section.

16

0.0 0.5 1.0 1.5 2.0

01

23

45

Plots of DPD and LDPD objective functions

σ

Obj

ectiv

e F

unct

ion

LDPDDPD

Figure 4: Plots of the sample based DPD and LDPD objective functions under theN(0, σ2) model based on a sample of size 20 from N(0, 1).

5 Unboundedness of the Bridge Divergences

In the previous section, we explained the reason for observing a spurious minimum withLDPD in the case of inner contamination and argued that it might happen even in caseof real data generated from the pure model. Intuitively, this may be explained by thefact that the probability of observing a value near the mode of the majority distributionis not too small at moderate sample sizes.

Here we will formally prove that the sample version of the BDPD objective func-tion (given on the right hand side of Equation (2.12)) is unbounded below in case theparametric model family is a location-scale family Gf,Θ, where f is a probability densityfunction, Θ = R× R+, and

Gf,Θ =

{fθ(·) : fθ(x) =

1

σf

(x− µσ

)for some θ := (µ, σ) ∈ Θ

}.

This result proved in Theorem 5.1 implies that one cannot fit a location-scale familyusing the minimizer of a BDPD, except the DPD.

Theorem 5.1. Let X1, X2, . . . , Xn be independent and identically distributed observa-tions from a density g, which is modeled by the densities in the location-scale family of

17

distributions Gf,Θ with a fixed f satisfying f(0) > 0. Then, for any 0 ≤ λ < 1 and α > 0,

infθ∈Θ

[log

(λ+ λ

∫f1+αθ (x)dx

)−(

1 + α

α

)log

(λ+

λ

n

n∑

i=1

fαθ (Xi)

)]= −∞. (5.1)

Proof. Set the objective function on the left hand side of Equation (5.1) as Mn(θ). Thenfor θ = (µ, σ),

Mn(θ) = log

(λ+

λ

σα

∫f1+α(x)dx

)− 1 + α

αlog

(λ+

λ

nσα

n∑

i=1

fα(Xi − µσ

)).

We will now show that for each 1 ≤ j ≤ n, Mn(Xj , σ) (that is, θ = (Xj , σ)) convergesto −∞ as σ ↓ 0. Fix 1 ≤ j ≤ n. Since f(x) ≥ 0 for all x, we get, by taking µ = Xj ,

(1 + α

α

)log

(λ+

λ

nσα

n∑

i=1

fα(Xi − µσ

))≥(

1 + α

α

)log

nσαfα(0)

).

Substituting this bound in Mn(θ) = Mn(Xj , σ), we obtain

Mn(Xj , σ) ≤ log

(n

1+αα

(λσα+1 + λσ

∫f1+α

))− Cf ,

whereCf = (1 + α) log

1α f(0)

).

Letting σ tend to zero implies Mn(Xj , σ)→ −∞ proving Equation (5.1).

Remark 5.1 From the proof, it is seen that the global minimizer of the bridgedivergence in case of any sample is at the extreme point (σ = 0) and the global minimumover all (µ, σ) combinations, is attained for atleast n points namely (Xj , 0), 1 ≤ j ≤ n.This is similar in spirit to the classical example of likelihood inference for Gaussianmixture modelling as explained in Example 2.4.5 of Bickel and Doksum (2015). Notethat the theorem does not cover the case of DPD which is given below. �

18

Figure 5: LDPD objective function for the N(µ, σ2) model with an artificial dataset offour observations.

In order to illustrate the phenomenon discussed above, we plot the LDPD objectivewith α = 0.5 in Figure 5, based on an artificial sample of size 4, where the sampleobservations are −2,−1, 0.5, 2. The underlying model family is assumed to be {N(µ, σ2) :µ ∈ R, σ > 0}. We observe that the LDPD objective drops down sharply to −∞as σ approaches 0, when µ is either of the four sample points. See Section S.2 of

19

supplementary materials for more details.

Theorem 5.2. Let X1, X2, . . . , Xn be independent and identically distributed observa-tions from a density g, which is modeled by the densities of the location-scale family Gf,Θwith a fixed f . Suppose that lim|x|→∞ f(x) = 0. Then, for any α > 0, the following holds

for all n if µ does not belong to the set {X1, ..., Xn}, and for all n > (1+α)fα(0)α∫f1+α

if µ is

equal to some Xi in the above set:

limσ→0

(∫f1+α

(µ,σ)(x)dx−(

1 + α

α

)1

n

n∑

i=1

fα(µ,σ)(Xi)

)=∞. (5.2)

Proof. First, note that the quantity on the left hand side of (5.2) whose limit needs tobe evaluated as σ → 0, can be written as

σ−α(∫

f1+α −(

1 + α

αn

) n∑

i=1

fα(Xi − µσ

)). (5.3)

Observe that, if µ does not belong to the set {X1, ..., Xn}, then

limσ→0

fα(Xi − µσ

)= 0 forall 1 ≤ i ≤ n.

Consequently, (5.3) goes to ∞ as σ → 0. On the other hand, if µ equals Xj for some1 ≤ j ≤ n, then (5.3) can be written as:

σ−α

∫f1+α −

(1 + α

αn

)fα(0)−

(1 + α

αn

)∑

i 6=jfα(Xi − µσ

) (5.4)

It is now easy to see that for n > (1+α)fα(0)α∫f1+α

, (5.4) goes to ∞ as σ → 0. This completes

the proof.

Remark 5.2 Theorem 5.2 shows that the DPD objective function for the location-scale family (for α > 0) diverges to ∞ as σ → 0, for all n, if µ is different from allthe observations. On the other hand, if µ equals one of the observations, the situationdiffers in the sense that there exists af such that if n > af , the DPD objective functiondiverges to ∞ and if n < af , it diverges to −∞. The case of n = af (when af is anatural number) depends on the rate of convergence of f(x) to zero as |x| → ∞. �

Remark 5.3 Theorem 5.1, in particular, covers the case of N(µ, σ2) parametricfamily and out of entire BDPD family the DPD alone stands out as a practically validmethod of inference. Fujisawa and Eguchi (2008) provided an iterative algorithm toobtain the minimizer of the LDPD in case the parametric family is one of the exponen-tial families, applicable to N(µ, σ2) family. However, while their iterative algorithm ismonotone decreasing, it does not in general guarantee convergence to the global mini-mizer. In fact, the small bias property shown for the scale parameter in their numerical

20

study (Table 2, Fujisawa and Eguchi (2008)) suggests, in view of our discussion in Sec-tion 4, that their algorithm mostly converges to a local minimizer, atleast in the normalfamily. This is also very similar to the case of likelihood inference for Gaussian mixturemodelling wherein one uses the EM algorithm to get hold of the useful local minimizerof the log-likelihood. This fact is also consistent with the asymptotic normality resultpresented in Section 5 of Fujisawa and Eguchi (2008). The asymptotic normality holdsfor a sequence of roots of the estimating equation while the global minimizer does notexist (in the interior of the parameter space). Note that in this case the global minimizerdoes not appear as a root of the estimating equation owing to the fact that any globalminimum is attained on the boundary of the parameter space. �

Remark 5.4 The result as stated in Theorem 5.1 does not apply to a scale familywith a fixed location (say, 0). But if one of the observations is exactly 0, then the proofas presented above can be employed to get the same conclusion. More precisely, we have,(taking σ = X(1)),

infσ>0

[log

∫f1+ασ −

(1 +

1

α

)log

(1

n

n∑

i=1

fασ (Xi)

)]≤ log(n1+1/αX(1)) +Df ,

where Df = log∫f1+α − (1 + α) log f(1) under the assumption f(1) > 0. Here X(1)

denotes the first order statistic in the sample X1, . . . , Xn and fσ(x) = f(x/σ)/σ for adensity f on R+. This indicates that if we have inner contamination (that is, contami-nation near the mode 0) so that n1+1/αX(1) is small enough, then there is a possibilityof a spurious global minimum near zero. Observe that the true distribution is a mixturedescribed as

X|Z = 0 ∼ fσ0 and X|Z = 1 ∼ δ0,

where Z ∼ Bernoulli(p) and δ0 represents a point mass at 0. This represents a simplecase inner contamination model with contamination level p. Then in a sample of size n,we have for any fixed ε > 0,

P(X(1) > n−1−1/αε) =(

1− p− (1− p)F(n−1−1/αε/σ0

))n≤ (1− p)n,

implying convergence with probability one of n1+1/αX(1) to zero as n→∞ for any fixedp > 0. Here F represents the cumulative distribution function of X. In a given samplethough, this spurious minimum may disappear (or may not be significantly pronounced)depending on the proportion of contamination. In this case, the BDPD do not behaveso badly as in the location-scale family. As already observed, the divergences close tothe DPD will lead to reasonable estimators. �

6 Towards the Desired Estimator

As we have seen, the analysis using the LDPD can be beset with different kinds ofproblems, both theoretical and computational. In this section we add to the discussion

21

of the possible flaws in the current state of the analysis based on the LDPD, eventuallygiving a recipe for consolidating the existing knowledge and overcoming the presentdifficulties so that one can arrive at a best compromise. We do not exactly refute thefindings of Fujisawa and Eguchi (2008) and Fujisawa (2013) as we find a substantial partof this research to be valuable and useful. However, there are too many loose ends thatremain unaccounted for which limit the practical applications of the method withoutfurther consolidation.

Fujisawa and Eguchi (2008) and Fujisawa (2013) define the latent bias of an estima-tor as the difference between the target parameter and the limit of the estimator. Todescribe the basic flaw with the minimized LDPD estimator or in general the normalizedestimating equations, we rework the arguments of Fujisawa (2013) which gives the hintof a very small latent bias under heavy contamination for the minimum LDPD estima-tor. Let f(x) = fθ∗(x) be the target density within our parametric family and δ(x) bethe contamination density so that the data generating density g is given by

g(x) = (1− ε)f(x) + εδ(x).

Let `(x; θ) = log fθ(x) and uθ(x) = ∇`(x; θ) be the usual log-likelihood and the score,respectively. A general normalized estimating equation is given by

Eg[ξ(`(X; θ))uθ(X)]

Eg[ξ(`(X; θ))]=

Efθ [ξ(`(X; θ))uθ(X)]

Efθ [ξ(`(X; θ))], (6.1)

where ξ(·) is a non-negative weight function satisfying ξ(a) → 0 as a → −∞ and Eh[·]represents the expectation with respect to the density h. Assume that

Eδ[ξ(`(X; θ))] ≈ 0, and Eδ[ξ(`(X; θ))uθ(X)] ≈ 0, (6.2)

in a neighborhood of θ = θ∗, which is usually satisfied under the classical outlier modelformulation (or, as we call it, outer contamination). Substituting the form of the densityg, we get

(1− ε)Efθ∗ [ξ(`(X; θ))uθ(X)] + εEδ[ξ(`(X; θ))uθ(X)]

(1− ε)Efθ∗ [ξ(`(X; θ))] + εEδ[ξ(`(X; θ))]=

Efθ [ξ(`(X; θ))uθ(X)]

Efθ [ξ(`(X; θ))].

Whenever the approximations in (6.2) hold, we get the following approximate equality

Efθ∗ [ξ(`(X; θ))uθ(X)]

Efθ∗ [ξ(`(X; θ))]≈ Efθ [ξ(`(X; θ))uθ(X)]

Efθ [ξ(`(X; θ))].

Thus, θ∗ is an approximate solution of the normalized estimating equation (6.1). Forconcreteness, consider the case where f(x) = f0(x) (i.e., θ∗ = 0), where fµ denotes thedensity of the normal distribution with mean µ and variance 1 and δ(x) = f10(x). Takeθ = 10 in the normalized estimating equation (6.1). It is easy to see that

Ef0 [ξ(`(X; θ))] ≈ 0, and Ef0 [ξ(`(X; θ))uθ(X)] ≈ 0,

22

in a neighborhood of θ = 10. Therefore, as above, the equality

Ef10 [ξ(`(X; θ))uθ(X)]

Ef10 [ξ(`(X; θ))]=

Efθ [ξ(`(X; θ))uθ(X)]

Efθ [ξ(`(X; θ))],

holds approximately. Thus, not only θ∗ = 0 but θ = 10 is also an approximate root ofthis estimating equation. The latent bias argument for normalizing estimating equationswould be one sided if one were to use it only for θ∗, and not for the root correspondingto the contaminating component. One probably can have an estimator with small latentbias in this case, but one would have to appropriately choose between the multiple rootsto ensure this small bias.

To validate our argument, we generated 20 observations randomly from the mix-ture distribution 0.9N(10, 1) + 0.1N(0, 1). We present the entire sample for the futurereproducibility of our results.

10.73402487, 10.07316778, 12.06112801, 8.52976810, 9.66834408,10.69923160, 10.65200828, 9.87964160, 8.76851845, 10.91329572,8.12296270, 8.57642728, 8.52115355, 12.42173904, 10.45406538,9.86883487, −0.89282090, −0.06273843, 1.34072270, −0.65001187.

The parametric family is taken to be F = {N(µ, 1) : µ ∈ R}. The LDPD objectivefunction in (2.2) with α = 0.5 is plotted against θ in Figure 6. It is seen to have a (local)minimum near 0 and a (possibly global) minimum around 10, which in this case is themean of the larger component of the mixture. Note that while there are at least twoobvious roots of the corresponding estimating equation, a clear root selection strategyis unavailable. While it so happens that in this case the divergence is well behaved, inmany standard cases it is not (as we have seen in Sections 4 and 5).

23

−5 0 5 10 15

510

1520

LDPD Objective Function with α = 0.5

µ

Objec

tive

Func

tion

Figure 6: LDPD Objective Function for the N(µ, 1) model under normal mixture data.

We are now in a position to give an analysis of the minimum LDPD estimator inparticular (and the minimum BDPD estimator in general) by consolidating the differentbits of results that we have presented so far in this paper. In Section 3 we have provedthe consistency of the minimum BDPD estimator. What it means in layman’s termsis that the global minimizer of the empirical LDPD objective function converges to theglobal minimizer of the theoretical objective function obtained by replacing the empiricaldistribution function by the true one. However, the usefulness of these consistency resultsare immediately challenged by our findings in Sections 4 and 5 as well as the previouspart of this section. First of all, it is quite possible, even in case of pure data, that theglobal minimizer of the empirical objective function is a value which is entirely uselessto the statistician for all practical purposes. Yet, the procedure might actually havea very reasonable local minimizer close to our target. This is what we have observedin Figure 4, for example, Fujisawa and Eguchi (2008) or Fujisawa (2013) – or, indeed,anybody else for that matter – do not give any root selection criteria in such cases; andin the absence of any such criteria, one would be hard put to justify the choice of a localminimizer, when a perfectly legitimate (although statistically useless) global minimizerexists. Our results in Section 5 show that the global minimizers for the LDPD wouldalways be at the boundary of the parameter space and will be of very little practicalvalue to us under location-scale models. However, in almost every such case there exists

24

– as we have seen in our simulations – reasonable roots of the estimating equations,representing appropriate local minima of the objective function. The bias and meansquare error reported in Figures 2-5 of Fujisawa (2013) correspond to the root generatedby this reasonable local (but not global) minimum.

Secondly, an equally serious problem is the extreme shift in the target parameteritself under inner contamination when LDPD is the divergence of choice. As we haveseen in Figure 3, a small proportion of inner contamination is enough to shift the targetparameter (the global minimizer of the divergence between the theoretical densities) tosuch an extreme degree that the new target does not characterize the original majorcomponent in any way at all. Yet, once again, the divergence has a local minimizerwhich reasonably characterizes the the major component. Once again, however, thereis no conceivable reason for considering a local minimizer as the target in the presenceof a global minimizer (which, unfortunately, does not characterize the component ofinterest).

It is clear, therefore, that usual inference based on the LDPD alone can lead theinference astray, if one proceeds with the global minimizer. Other reasonable roots doexist, but root selection strategies do not. Existing literature claims the existence of “aroot” which has the desired properties, without any recipe for arriving at that root. Suchinadequacy exists in case of many other members of the BDPD class. The only memberwithin this class which are entirely free – as far as it is observed in our explorations –from the anomalies of spurious roots and ill-defined targets is the DPD. In every modeland every case that we have looked at, the DPD has a well defined minimizer thatwould reasonably represent, in theory, a properly described target, and would generatea suitable estimator for the same target under randomly generated data. Acknowledgingthat the LDPD does have certain roots which are superior in dealing with the bias underheavy contamination, we propose to utilize the global minimizer of the DPD to arrive atthe desired root of the LDPD. The next subsection describes the chain algorithm whichwe propose for this purpose.

6.1 The Chain Algorithm

Under repeated simulations we have observed, under many standard parametric models,that as far as the global minimizers of the BDPDs are considered over all λ ∈ [0, 1] givena specific value α, the minimum DPD estimator is almost always the best in terms oflatent bias (and not the minimum LDPD estimator). At the same time we acknowl-edge that the minimum LDPD estimating equations (or, more generally, the minimumBDPD estimating equations) may have roots which represent some local minimum ofthe divergences which might improve upon the performance of the minimum DPD es-timator in terms of latent bias. As indicated by the heuristics, and as we have seen inour simulations, the LDPD (more generally the BDPD) has a local minimizer that isclose to the true θ∗. This we believe is at least partially due to the Pythagorean relationthat we presented in Section 3. Since the members of the BDPD family form a literalbridge between the DPD and LDPD, we have an opportunity for getting at a root of theLDPD estimating equations that is consistent and asymptotically normal. A completely

25

rigorous proof is unavailable at this time, but we provide a strong heuristic argumentand extensive simulations to support this claim.

From the form of the bridge divergence, for a fixed α and θ, we obtain the limit

ρ(α,λk)(g, fθ)→ ρ(α,λ0)(g, fθ) as λk → λ0. (6.3)

Heuristically this indicates that ρ(α,λ)(g, fθ) has at least a local minimizer close to θαn1

(defined in Equation (2.1)) if λ is close to one. For example, one can take λ = 1−n−1/2

which is sufficiently close to 1 for this root to be reasonably good. We shall now presenta chain algorithm for getting hold of a good root of the LDPD estimating equation(indeed, all bridge estimating equations). The chain algorithm proceeds in the followingsteps: Fix α ∈ [0, 1].

1. First choose a sequence {λi} satisfying

1 = λK > λK−1 > · · · > λ2 > λ1 > λ0 = 0 and max1≤i≤K

|λi − λi−1| ≤ rn,

with rn → 0 at some rate. Define a function λ 7→ θα(λ).

2. Solve the DPD problem completely, that is, find the global minimizer θαn1 of thesample estimate of the DPD. Set θ(λK) = θαn1. Note that λ = 1 in the bridgedivergence expression in Equation (2.12) corresponds to the DPD (in the limit).

3. For i = K − 1,K − 2, . . . , 0, find the local minimizer of the sample estimate of thebridge divergence with parameters (α, λi) that is closest to θ(λi+1). Set this closestlocal minimizer as θ(λi).

4. Return {θ(λi) : 0 ≤ i ≤ K}.

Step 3 above can be solved by using any off-the-shelf algorithm for minimization withθ(λi+1) as a starting point. To draw analogue with existing algorithms of this type, wenote that the LASSO that has taken over the literature in the past few years is solvedby using a chain/path algorithm as above; there the issue is about getting fast algorithmnot distinguishing local and global minimizers (it is a convex problem).

Conjecture 6.1. In any model under the assumptions that guarantee the asymptoticnormality of θαn1, θα(λ0) is the consistent root of the LDPD estimating equation that isclosest to θ∗. And this is the solution claimed by Fujisawa and Eguchi (2008) to have asmall latent bias.

The simulations using this chain algorithm are very encouraging and we think thata proof of the above conjecture can be obtained by an application of the Taylor seriesand using (6.3) with an appropriate choice of rate of convergence of λk − λ0 to zero. Inmany nicely behaved parametric densities, simply choosing the root of LDPD estimatingequation closest to the minimum DPD might be good enough. However, in generalcases, the difference between LDPD and DPD estimating equations can be drastic forthis simple trick to work (at least for a theoretical guarantee).

26

In the following section we will frequently use the term “minimum bridge divergenceestimator”. It is to be understood that this will refer to the local minimizer obtained byusing the chain algorithm which uses the global minimizer of the DPD as the startingvalue. In addition, we will also frequently use the term“root of the estimating equation”.In this case it will be understood that the correspondence is with a local minimizer, andnot an intervening local maximizer (or saddle point), which may also produce a legitimateroot of the estimating equation.

7 Simulation Study

In this section, we apply the chain algorithm discussed in Section 6 on simulated data.We consider the exponential and the normal scale families, the former to illustrate thecase of outer contamination, and the latter to illustrate the case of inner contamination.The estimators are computed from 1000 replications. The bias and mean squared errorover these 1000 replications are calculated against the target component of the underlyingmajority distribution.

The chain algorithm is applied for each pair (α, ε), where the robustness parameterα runs from 0 to 1 in steps of 0.2 and the contamination level ε is 0, 0.05 or 0.2. Foreach (α, ε) combination, the bridge parameter λ runs from 1 to 0 in steps of 0.1 as thechain algorithm proceeds. Since the parameter α is somewhat better understood in theliterature and our introduced parameter λ needs to be analyzed more deeply, the latteris varied through finer steps than the former.

7.1 Exponential Scale Model

Here the model is the class of exponential distributions with mean σ over σ ∈ (0,∞).Data are simulated from the mixture (1−ε)Exp(1) + εUnif(6−10−4, 6+10−4), where thefirst component is exponential with rate 1, and the second is uniform over the indicatedrange. The contamination level ε is taken to be 0, 0.05 and 0.2. For step 2 of thechain algorithm which involves solving the DPD problem completely, 25 starting pointsare drawn uniformly from the interval [0, 0.1] and the remaining 75 uniformly from theinterval (0.1, 10].

7.2 Normal Scale Model

Here the model is the class of normal distributions with mean 5 and variance σ2 over σ ∈(0,∞). Data are simulated from the mixture (1− ε)N(5, 1) + εUnif(5− 10−5, 5 + 10−5),where the first and the second components are normal and uniform respectively, withparameters as indicated. The contamination level ε is taken to be 0, 0.05 and 0.2. For step2 of the chain algorithm which involves solving the DPD problem completely, 25 startingpoints are drawn uniformly from the interval [0, 0.1] and the remaining 75 uniformly fromthe interval (0.1, 10].

For the sake of brevity, we will only include the graphs of the mean-squared errors(scaled by sample size n = 100) of the minimum bridge divergence estimators in the

27

exponential scale model case, for contamination levels 0.05 and 0.2 (in Figure 7, re-spectively) in the main article. Tables presenting exact values of the bias and MSE ofthe minimum bridge divergence estimators for both the exponential and normal scalefamilies, for all the three contamination levels 0, 0.05 and 0.2 are given in the onlinesupplement.

It is observed that for the 5% contamination level, the MSE of the minimum bridgedivergence estimators decreases as the chain algorithm proceeds (λ goes from 1 to 0) forα upto 0.6, and increases as the chain algorithm proceeds for α = 0.8 and 1. On theother hand, for the 20% contamination level, the MSE of the minimum bridge divergenceestimators decreases as the chain algorithm proceeds from λ = 1 to λ = 0 for all valuesof α. As a function of α (with λ held fixed), for 5% contamination, the MSE decreasesroughly upto α = 0.6 and then increases with α, whereas for 20% contamination, acompletely decreasing trend of the MSE is observed as α increases.

Note that the values reported here are not, except for the DPD (λ = 1 case) case,based on the global minimizers of the divergences; rather they are the chain algorithmsolutions. What this shows is that for heavy contamination (at the level of 20%) theLDPD solution (not necessarily the global minimizer) as obtained from the chain algo-rithm does seem to dominate the other BDPD solutions, and the mean square errorsdecrease uniformly in the direction 1→ 0 over λ for each α. This provides partial confir-mation of the results of Fujisawa and Eguchi (2008) and Fujisawa (2013). However theresults are now stronger and more useful in that we identify the root for which it works;it does not work for just any root, and certainly not for the global minimizer. This isso in spite of the fact that the contamination considered here is an outer contamination.Any exponential distribution has a mode at zero, and randomly throwing up values closeto zero, leading to spurious global minimizers, is not entirely uncommon.

A similar situation is observed in the normal distribution case. Here the contamina-tion is an inner contamination, which leads to spurious global minimizers more often forthe BDPDs. However the chain algorithm guides the process to a sensible root (that isalso a local minimizer) in each case.

For the data of size 20 presented in Section 4 from N(0, 1), the minimum bridgedivergence estimators of σ under the N(0, σ2) model with α = 0.8 and varying λ from0 to 1 are presented in Table 1. For each value of λ, the true global minimizers of thebridge divergence objective functions are presented in the parentheses. It is remarkablethat even with λ as high as 0.9, the spurious root phenomenon is observed.

8 Choice of Tuning Parameters

Most of the robust estimators derive their robustness from down-weighting the obser-vations suspected as outliers. As can be seen from the BDPD objective functions, theobservations are proportionally weighted by powers of the model density which demon-strates that the observations are down-weighted for being outliers with respect to thedensity fθ. It is this outlier stability property which makes the corresponding estimatorsdesirable from the point of view of robustness. If there are no outliers in the data, then

28

0.0 0.2 0.4 0.6 0.8 1.0

2.5

3.0

3.5

4.0

4.5

Contamination Error ε = 0.05

λ

100*

Mea

n−S

quar

ed E

rror

α = 0.2 α = 0.4 α = 0.6 α = 0.8 α = 1.0

0.0 0.2 0.4 0.6 0.8 1.0

2040

6080

Contamination Error ε = 0.20

λ

100*

Mea

n−S

quar

ed E

rror

α = 0.2 α = 0.4 α = 0.6 α = 0.8 α = 1.0

Figure 7: Scaled MSE of the Minimum Bridge Divergence Estimators for the Exp(σ)model at 5% and 20% Contaminations.

Table 1: Bridge Divergence Estimators of σ with α = 0.8. (The global minimizers arepresented in the parentheses).

λ θα,λ λ θα,λ λ θα,λ0.9 1.245808 0.6 1.248243 0.3 1.252942

(1.4411e-5) (1.4124e-5) (1.4085e-5)

0.8 1.246477 0.5 1.249401 0.2 1.255756(1.4218e-5) (1.4106e-5) (1.4078e-5)

0.7 1.247269 0.4 1.250926 0.1 1.259926(1.4155e-5) (1.4094e-5) (1.4073e-5)

the tuning parameter α must ideally be set to zero in order to get optimal inference. Butif the data do contain outliers, then the tuning parameters α and λ should be chosenso as to partially or fully eliminate the effect of the outlying observations. However thechoice of the tuning parameter must be an automatic procedure where a data-drivenchoice of the parameters α and λ is generated by the method. One of the first meth-ods of choosing tuning parameters in case of the DPD was proposed in Hong and Kim(2001), where the tuning parameter is chosen by minimizing the trace of an estimateof the asymptotic covariance matrix. Warwick and Jones (2005) refined this method tofind an “optimal” tuning parameter by minimizing the trace of an estimate of the MSEof the estimator. Here the MSE is computed by separately estimating the bias and thevariance components. From the asymptotic variance formula, one can easily estimatethe variance. But bias estimation involves the use of a pilot estimate, and the estimated

29

mean square error has a strong dependence on it. Warwick and Jones (2005) used theminimum L2 distance estimator as the pilot; however, under the current state of theart, there is no universally acceptable choice of the pilot, on which the process criticallydepends.

We acknowledge that the method of Warwick and Jones (2005) would be perfectif we could eliminate its dependence on the pilot estimator, or find an independentestimate of the bias as a function of the tuning parameter. However, here we present amodified version of the approach of Hong and Kim (2001) and give an (almost theoretical)justification of the same. Firstly, one should not use the trace of an estimate of theasymptotic covariance matrix (although some of the present authors have done so in thepast in the absence of a better strategy) for the construction of the objective functionto be minimized. This leads to adding quantities with different units. Consider theN(µ, σ2) example in which case the asymptotic variance of (µ, σ2) or any of the varianceestimates have the units of Xi and X2

i that should not be added. One could, of course,think of estimating (µ, σ) where adding the variance estimates would be more sensible.But this remedy does not work in general parameter spaces.

Why is the asymptotic variance adequate? An intuitive explanation is as follows.It is well-known that the delete-d jack-knife estimator is a consistent estimator of theasymptotic variance. The delete-d jack-knife estimator is obtained by calculating thedifference between the estimator calculated based on n observations and the estimatorcalculated based on n − d observations. By the definition of a good robust estimator,this difference should be small for a robust estimator but for a non-robust estimator likethe maximum likelihood estimator this difference would be large if the d observationsdeleted contain some outliers. This implies that robust estimators should have “smaller”asymptotic variance than non-robust estimators. Of course, if there is no contaminationthen it is known that the maximum likelihood estimator would have “smaller” variancefor large enough sample size. Using a bootstrap asymptotic variance estimator also givesa similar conclusion. We do ignore the bias, but the question clearly goes in favor ofa robust estimator. By definition they are closer to the true parameter based on themajority of the data.

A more concrete explanation is offered by using the closeness measure introduced inZhang and He (2016). Let VΛ define a set of variance matrices Vα indexed by a (possiblyvector) parameter α ∈ Λ.

Definition 8.1. A non-negative real-valued matrix function m(·) defined on the set VΛ

is called a closeness measure if and only if the following conditions hold.

1. Consistency. For any V1, V2 ∈ VΛ, if V1 � V2, then m(V1) ≥ m(V2), where theequality holds only if V1 = V2.

2. Continuity. For any matrix sequence {Vn} ⊆ VΛ with Vn � B. As n → ∞,‖Vn −B‖∞ → 0 if and only if m(Vn)→ m(B).

Here by A � B, we mean A − B is non-negative definite. Definition 8.1 impliesthat if there is an efficient variance matrix in the set VΛ, then minimizing m(Vα) over

30

α ∈ Λ leads to such a matrix. Zhang and He (2016) additionally prove that the traceand the Frobenious norm are valid closeness measures. We use the determinant of thematrix as the measure of closeness (see Section S.5 of the supplementary material). Notethat determinant is a valid quantity to consider from the point of view of units and italso has a practical interpretation as the volume of a confidence set. These argumentsgive a justification for minimizing a closeness measure of an estimate of the asymptoticvariance.

Summing up all these arguments, we claim that minimizing the determinant of anestimate of the asymptotic variance provides an asymptotically optimal estimator when-ever such an estimator exists in the family of estimators under consideration. A detailedanalysis of this procedure would be taken up in a future paper. Just as a remark, forthe normal data of size 20 presented in Section 4, the optimal λ parameter with α = 0.8using this procedure turned out to be λ = 1. A plot of the asymptotic variance over allλ is given in the supplementary material.

9 Concluding Remarks

In this paper, the competing families of divergences, DPD and LDPD, are criticallyexamined for their strengths and deficiencies. The bridge divergence family introducedin this paper is an attempt at combining the good of both and nullify the disadvantagesof either. Unlike the DPD, the LDPD estimating equation admits roots with small latentbias. However, these roots may not be the global minimizers of the LDPD objective.The phenomenon of spurious global minimizers of the LDPD is rigorously proved inspecific parametric families where the DPD provably works. The bridge divergencespartly suppresses this problem for certain tuning parameter (λ) values along the bridge.

However, the Bridge divergence is not a perfect solution in that it also faces the sameproblem as LDPD in some cases. The point made here is that one cannot expect theglobal minimizer of the LDPD to generate a good estimator and the DPD estimator isa safe bet in all the cases, but whenever possible some members of the minimum bridgedivergence estimators can help reduce the latent bias of the DPD estimator.

Acknowledgements

The authors would like to thank Srijata Samanta of University of Florida for her contri-bution towards Remark 5.4.

Bibliography

Basu, A., Harris, I. R., Hjort, N. L., and Jones, M. C. (1998). Robust and efficientestimation by minimising a density power divergence. Biometrika, 85(3):549–559.

Bickel, P. J. and Doksum, K. A. (2015). Mathematical statistics—basic ideas and selected

31

topics. Vol. 1. Texts in Statistical Science Series. CRC Press, Boca Raton, FL, secondedition.

Broniatowski, M., Toma, A., and Vajda, I. (2012). Decomposable pseudodistances andapplications in statistical estimation. J. Statist. Plann. Inference, 142(9):2574–2585.

Ferguson, T. (1996). A Course in Large Sample Theory. Chapman & Hall Texts inStatistical Science Series. Taylor & Francis.

Fujisawa, H. (2013). Normalized estimating equation for robust parameter estimation.Electron. J. Stat., 7:1587–1606.

Fujisawa, H. and Eguchi, S. (2008). Robust parameter estimation with a small biasagainst heavy contamination. J. Multivariate Anal., 99(9):2053–2081.

Hong, C. and Kim, Y. (2001). Automatic selection of the tuning parameter in theminimum density power divergence estimation. J. Korean Statist. Soc., 30(3):453–465.

Jones, M. C., Hjort, N. L., Harris, I. R., and Basu, A. (2001). A comparison of relateddensity-based minimum divergence estimators. Biometrika, 88(3):865–873.

Lehmann, E. L. and Casella, G. (1998). Theory of point estimation. Springer Texts inStatistics. Springer-Verlag, New York, second edition.

Lindsay, B. G. (1994). Efficiency versus robustness: the case for minimum Hellingerdistance and related methods. Ann. Statist., 22(2):1081–1114.

van der Vaart, A. W. (1998). Asymptotic statistics. Cambridge University Press, Cam-bridge.

Warwick, J. and Jones, M. C. (2005). Choosing a robustness tuning parameter. J. Stat.Comput. Simul., 75(7):581–588.

Windham, M. P. (1995). Robustifying model fitting. J. Roy. Statist. Soc. Ser. B,57(3):599–609.

Zhang, S. and He, X. (2016). Inference based on adaptive grid selection of probabilitytransforms. Statistics, 50(3):667–688.

32

Supplementary Material for “Statistical Inference based on Bridge

Divergences”

Arun Kumar Kuchibhotla, Somabha Mukherjee and Ayanendranath Basu

S.1 Consistency of the Minimum Bridge Divergence Estimator

In this section, we give a proof of Theorem 3.1 of the main paper. To begin with, let us define two quantities:

Mn(θ) =1

λlog

(λ+ λ

∫f1+αθ

)− 1

λ

(1 + α

α

)log

(λ+ λ

1

n

n∑

i=1

fαθ (Xi)

),

M(θ) =1

λlog

(λ+ λ

∫f1+αθ

)− 1

λ

(1 + α

α

)log

(λ+ λ

∫gfαθ

).

We have the estimator and the target as

θ(α,λ)n = arg min

θ∈ΘMn(θ) and θ(α,λ)

g = arg minθ∈Θ

M(θ).

In order to conclude that θ(α,λ)n

a.s.→ θ(α,λ)g , we apply Theorem 5.7 of van der Vaart (1998). To this end, we

prove almost sure uniform convergence of the (random) functions Mn(θ) to M(θ). Note that

supθ∈Θ|Mn(θ)−M(θ)| = 1

λ

(1 + α

α

)supθ∈Θ

∣∣∣∣∣log

(λ+ λ

1

n

n∑

i=1

fαθ (Xi)

)− log

(λ+ λ

∫gfαθ

)∣∣∣∣∣

≤ 1

λ

(1 + α

α

)supθ∈Θ

∣∣∣∣∣1

n

n∑

i=1

fαθ (Xi)−∫gfαθ

∣∣∣∣∣ .

The last inequality follows from the fact that if λ ≤ a < b < ∞, then there exists ξ ∈ (a, b) such thatlog b− log a = 1

ξ (b− a) ≤ 1λ(b− a). By (C2), we know that

∫gfαθ ≤

∫g(x)K(x)dx <∞ for all θ. Applying

Theorem 16(a) of Ferguson (1996) with U(x, θ) = fαθ (x), we conclude that the quantity on the right handside above converges almost surely to zero. Note that the conditions (1), (2) and (3) of Theorem 16(a) ofFerguson (1996) are exactly the same as our conditions (C1),(C3) and (C2), respectively, and we are done.

Remark 1. For the above proof to hold, the assumption λ > 0 is crucial to bound the terms λ+λ 1n

∑ni=1 f

αθ (Xi)

and λ + λ∫gfαθ away from 0, so that the mean value theorem can be applied on the log function over its

domain (0,∞). Hence the proof, as it is, cannot be applied on the LDPD, but it works for every other fixedmember of the bridge family.

S.2 Unboundedness of the LDPD Objective based on Simulated Data

In Section 5 of the main article, we plotted the LDPD objective with α = 0.5 based on an artificial sampleof size 4, the underlying model being the family {N(µ, σ2) : µ ∈ R, σ > 0}. Here, we demonstrate theunboundedness of the LDPD objective with α = 0.5 based on 100 observations simulated from N(0, 1). Theinteractive MATLAB figure can be found in the file hundnormfinal.pdf (a part of the online supplement),and the simulated data can be found in norhundfin.mat (a part of the online supplement).

1

arX

iv:1

706.

0574

5v1

[st

at.M

E]

18

Jun

2017

Figure 1: LDPD Objective Based on 100 Observations Simulated From N(0, 1)

The figure shows sharp dips of the LDPD objective when the mean parameter is one of the 100 datapoints and the standard deviation approaches 0. From the proof of Theorem 5.1 of the main article, we seethat the Bridge divergence objective function goes to +∞ as σ goes to 0 for all those µ which do not equalany data point, at a fast rate. That is why, the plot looks steep when µ equals a data point. However,putting λ = 0 in the same proof, we see that the LDPD objective function goes to −∞ for all those µ whichequal one of the data points, at rate log σ, which is a slow rate. That is why, we need a very small startingvalue in the σ-axis, to observe the drop in the LDPD objective function.

S.3 Bias and Mean-Squared Error of the Bridge Estimators

In Secion 7 of the main paper, only the plots of the mean-squared errors of the bridge estimators for theexponential scale model case with 5% and 20% contaminations are included. We now present the exact

2

numerical values of the (scaled up) bias and the (scaled up) mean-squared error of the bridge estimators (inTables 1 to 6) for all the three levels of contamination: 0%, 5% and 20%, for both the exponential and thenormal scale family cases. To gain more numerically significant digits, we scaled up the bias and MSE by 10(=√n) and 100 (= n), respectively. In each table, along each row, the bias/MSE of the bridge estimators

is noted, as the chain algorithm proceeds from λ = 1 to λ = 0 in steps of 0.1.In the normal scale family case with contamination level 0.2, it is seen that for each λ, the maximum

likelihood estimator (the estimator with α = 0) has the smallest mean squared error. The reason behindthis is that in this case, we have an inner contamination around the mode of the majority distribution.

To elaborate, consider a weighted estimating equation of the form:

n∑

i=1

wαi uθ(Xi) = 0. (S.3.1)

Suppose that wi = fη(Xi), where fθ is the density function of N(5, θ2), and η is an initial value of θ. Supposethat the true value of θ is 1. For α > 0, wαi is large if Xi is close to 5. So, observations coming from thecontaminating distribution get more weight relative to observations from the true majority distribution.Consequently, the solution

θ(α) =

√∑ni=1w

αi (Xi − 5)2

∑ni=1w

αi

to (S.3.1) shrinks more towards 0 when α > 0, than when α = 0.Thus, the mean squared error around 1,i.e. E(θ(α) − 1)2 is inversely related to shrinkage towards 0. That is why, we should expect

MSE(θ(α)

)�MSE

(θ(0))

for α > 0. The estimating equation (S.3.1) is similar to that of the DPD. Similar heuristic argument alsoapplies to the other bridge divergences, as they downweight using powers of densities.

S.4 Asymptotic Variance Plot of the Bridge Estimators

The plot of the estimated asymptotic variance of the bridge estimators across λ for α = 0.8, using the normaldata of size 20 presented in Section 4 of the main article, is given in Figure 2. The quantity plotted is theestimated asymptotic variance using the formula given in Theorem 3.2, evaluated at θ = θ(α,λ) obtainedthrough the chain algorithm for these data. It is seen that the estimated asymptotic variance decreasesmonotonically with λ.

We also plot the estimated asymptotic variance of the bridge estimators across both λ and α, based onthe normal data of size 20 presented in Section 4 of the main article (in Figure 3), and based on a sampleof size 1000 simulated from N(0, 1) (in Figure 4). The interactive MATLAB figures can be found in thefiles twodplotfinal.fig and thousand.fig, and the simulated sample of size 1000 can be found in x.mat

(which are parts of the online supplement). The plot in Figure 2 represents the cross section of the plot inFigure 3 along α = 0.8.

3

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

6

0.11

0.12

0.13

0.14

0.15

0.16

0.17

0.18

Asy

mpt

otic

Var

ianc

e

Estimated Asymptotic Variance of Bridge Estimators across 6 for , = 0.8

Figure 2: Plot of Estimated Asymptotic Variance of Bridge Estimators across λ for α = 0.8, based on thenormal data of size 20

4

Figure 3: Plot of Estimated Asymptotic Variance of Bridge Estimators across λ and α, based on the normaldata of size 20

5

Figure 4: Plot of Estimated Asymptotic Variance of Bridge Estimators across λ and α, based on thesimulated normal data of size 1000

S.5 The Determinant is a closeness measure

Theorem S.5.1. The determinant of a matrix is a closeness measure.

Proof. Before getting into the proof, note that for any two positive definite matrices A and B of order p× pwith a positive semi-definite difference A−B, we have

|A| = |A−B +B| = |B|∣∣∣I +B−1/2(A−B)B−1/2

∣∣∣ = |B|p∏

i=1

(1 + λi) ,

where λi ≥ 0 for all 1 ≤ i ≤ p represents the eigenvalues of B−1/2(A−B)B−1/2.(Consistency). The equality above implies that |A| ≥ |B| for A ≥ B. To show that equality |A| = |B|

for A ≥ B holds if and only if A = B, first note that A = B will trivially imply that |A| = |B|. If |A| = |B|,then by the equality above, we get

∏pi=1(1 + λi) = 1 which implies that λi = 0 for all 1 ≤ i ≤ p.

(Continuity). To show that ‖An −B‖∞ → 0 if and only if |An| → |B| as n→∞ under the assumptionthat An ≥ B for all n, first note that ‖An−B‖∞ converging to zero as n→∞ implies that |An| − |B| → 0.This is because, the determinant is a continuous function of the elements of the matrix. So, it is now

6

enough to prove the opposite direction. For proving this, define λin, 1 ≤ i ≤ p as the eigenvalues ofB−1/2(An −B)B−1/2. Then by the equality above, we get that as n→∞,

p∏

i=1

(1 + λin)→ 1.

This ensures that λin → 0 as n→∞ for all 1 ≤ i ≤ p. Therefore ‖An−B‖∞ converges to zero as n→∞.

References

Ferguson, T. (1996). A Course in Large Sample Theory. Chapman & Hall Texts in Statistical Science Series.Taylor & Francis.

van der Vaart, A. W. (1998). Asymptotic statistics. Cambridge University Press, Cambridge.

7

Tab

le1:

Bia

sof

the

min

imu

mb

ridge

div

erge

nce

esti

mat

ors

forε

=0

inth

eex

pon

enti

alsc

ale

fam

ily

case

.T

he

mea

nsq

uare

der

rors

are

give

nin

the

par

enth

eses

.

HH

HHH H

αλ

1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

0.0

-0.0

286

-0.0

286

-0.0

286

-0.0

286

-0.0

286

-0.0

286

-0.0

286

-0.0

286

-0.0

286

-0.0

286

-0.0

286

(0.9

963)

(0.9

963)

(0.9

963)

(0.9

963)

(0.9

963)

(0.9

963)

(0.9

963)

(0.9

963)

(0.9

963)

(0.9

963

)(0

.9963

)0.

2-0

.013

2-0

.013

2-0

.013

2-0

.013

2-0

.013

2-0

.013

2-0

.013

2-0

.0132

-0.0

132

-0.0

131

-0.0

131

(1.0

915)

(1.0

918)

(1.0

922)

(1.0

925)

(1.0

929)

(1.0

933)

(1.0

937)

(1.0

942)

(1.0

946)

(1.0

951

)(1

.0956

)0.

4-0

.005

0-0

.005

1-0

.005

1-0

.005

2-0

.005

2-0

.005

2-0

.005

2-0

.0053

-0.0

052

-0.0

052

-0.0

051

(1.2

918)

(1.2

950)

(1.2

984)

(1.3

021)

(1.3

063)

(1.3

108)

(1.3

158)

(1.3

214)

(1.3

277)

(1.3

348

)(1

.3428

)0.

60.

002

80.0

026

0.00

230.

0020

0.00

170.

0014

0.00

100.0

008

0.0

006

0.00

05

0.00

07(1

.506

6)

(1.5

155)

(1.5

257)

(1.5

372)

(1.5

505)

(1.5

660)

1.58

421.

6060

1.63

241.

6651

1.706

80.

80.

011

30.0

107

0.01

010.

0094

0.00

860.

0077

0.00

680.0

057

0.0

047

0.00

41

0.00

47(1

.696

8)

(1.7

123)

(1.7

304)

(1.7

517)

(1.7

773)

(1.8

085)

(1.8

473)

(1.8

970)

(1.9

627)

(2.0

535

)(2

.1873

)1.

00.

019

80.0

190

0.01

800.

0168

0.01

550.

0138

0.01

190.0

095

0.0

068

0.00

44

0.00

56(1

.856

3)

(1.8

770)

(1.9

016)

(1.9

315)

(1.9

686)

(2.0

156)

(2.0

771)

(2.1

609)

(2.2

813)

(2.4

684

)(2

.7981

)

8

Tab

le2:

Bia

sof

the

min

imu

mb

rid

ged

iver

gen

cees

tim

ator

sfo

=0.

05in

the

exp

onen

tial

scal

efa

mil

yca

se.

Th

em

ean

squ

are

der

rors

are

give

nin

the

par

enth

eses

.

HH

HHH H

αλ

10.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

0.0

2.500

62.5

006

2.50

062.

5006

2.50

062.

5006

2.50

062.5

006

2.5

006

2.50

06

2.50

06(8

.336

5)

(8.3

365)

(8.3

365)

(8.3

365)

(8.3

365)

(8.3

365)

(8.3

365)

(8.3

365)

(8.3

365)

(8.3

365

)(8

.3365

)0.

21.

597

01.5

954

1.59

371.

5920

1.59

011.

5882

1.58

611.5

840

1.5

817

1.57

93

1.57

68(4

.553

8)

(4.5

490)

(4.5

439)

(4.5

385)

(4.5

329)

(4.5

270)

(4.5

208)

(4.5

142)

(4.5

072)

(4.4

999

)(4

.4921

)0.

40.

938

90.9

306

0.92

150.

9117

0.90

080.

8888

0.87

560.8

608

0.8

442

0.82

56

0.80

44(2

.867

7)

(2.8

564)

(2.8

441)

(2.8

307)

(2.8

162)

(2.8

002)

(2.7

826)

(2.7

632)

(2.7

417)

(2.7

177

)(2

.6907

)0.

60.

659

90.6

459

0.63

010.

6122

0.59

180.

5682

0.54

060.5

080

0.4

690

0.42

15

0.36

26(2

.461

6)

(2.4

543)

(2.4

465)

(2.4

382)

(2.4

294)

(2.4

202)

(2.4

105)

(2.4

007)

(2.3

911)

(2.3

826

)(2

.3766

)0.

80.

592

30.5

769

0.55

900.

5382

0.51

340.

4836

0.44

710.4

015

0.3

428

0.26

52

0.15

85(2

.458

0)

(2.4

572)

(2.4

569)

(2.4

574)

(2.4

595)

(2.4

637)

(2.4

715)

(2.4

853)

(2.5

093)

(2.5

517

)(2

.6287

)1.

00.

600

00.5

857

0.56

890.

5487

0.52

400.

4933

0.45

400.4

021

0.3

307

0.22

72

0.06

70(2

.575

1)

(2.5

800)

(2.5

865)

(2.5

955)

(2.6

080)

(2.6

261)

(2.6

532)

(2.6

956)

(2.7

659)

(2.8

918

)(3

.1419

)

9

Tab

le3:

Bia

sof

the

min

imu

mb

rid

gediv

erge

nce

esti

mat

ors

forε

=0.

2in

the

exp

on

enti

al

scale

fam

ily

case

.T

he

mea

nsq

uare

der

rors

are

give

nin

the

par

enth

eses

.

HHH

HH H

αλ

10.

90.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

0.0

9.931

09.9

310

9.93

109.

9310

9.93

109.

9310

9.93

109.9

310

9.9

310

9.9

310

9.931

0(1

03.1

433)

(103

.1433

)(1

03.

1433

)(1

03.1

433)

(103

.143

3)(1

03.1

433)

(103

.143

3)

(103

.143

3)

(103

.143

3)

(103

.143

3)

(103

.1433

)0.

28.

580

28.5

776

8.57

488.

5719

8.56

888.

5655

8.56

198.5

581

8.5

541

8.5

496

8.544

9(8

0.1

781)

(80.

1383

)(8

0.09

56)

(80.

0508

)(8

0.00

29)

(79.

9513

)(7

9.89

64)

(79.8

376)

(79.

7743

)(7

9.705

9)

(79.6

319)

0.4

6.374

86.3

504

6.32

326.

2927

6.25

836.

2192

6.17

446.1

225

6.0

618

5.9

898

5.903

4(4

9.0

325)

(48.

7654

)(4

8.46

86)

(48.

1369

)(4

7.76

37)

(47.

3413

)(4

6.85

88)

(46.3

030)

(45.

6555

)(4

4.892

8)

(43.9

815)

0.6

4.424

04.3

646

4.29

584.

2153

4.12

024.

0060

3.86

663.6

934

3.4

734

3.1

869

2.803

9(2

7.5

293)

(27.

0756

)(2

6.55

80)

(25.

9626

)(2

5.27

11)

(24.

4596

)(2

3.49

58)

(22.3

361)

(20.

9208

)(1

9.167

1)

(16.9

600)

0.8

3.506

63.4

379

3.35

663.

2591

3.14

022.

9922

2.80

362.5

565

2.2

216

1.7

511

1.068

8(1

8.6

063)

(18.

1708

)(1

7.66

61)

(17.

0746

)(1

6.37

35)

(15.

5316

)(1

4.50

61)

(13.2

397)

(11.

6579

)(9

.684

8)

(7.3

460)

1.0

3.220

03.1

591

3.08

612.

9967

2.88

522.

7422

2.55

292.2

921

1.9

147

1.3

370

0.415

7(1

5.6

965)

(15.

3395

)(1

4.91

94)

(14.

4182

)(1

3.81

15)

(13.

0645

)(1

2.12

78)

(10.9

318)

(9.3

892

)(7

.449

9)

(5.4

069)

10

Tab

le4:

Bia

sof

the

min

imu

mb

rid

ged

iver

gen

cees

tim

ator

sfo

ran

=0

inth

en

orm

al

scal

efa

mil

yca

se.

Th

em

ean

squ

are

der

rors

are

give

nin

the

par

enth

eses

.

HH

HH

H Hα

λ1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

0.0

-0.0

191

-0.0

191

-0.0

191

-0.0

191

-0.0

191

-0.0

191

-0.0

191

-0.0

191

-0.0

191

-0.0

191

-0.0

191

(2.1

467)

(2.1

467)

(2.1

467)

(2.1

467)

(2.1

467)

(2.1

467)

(2.1

467)

(2.1

467)

(2.1

467)

(2.1

467)

(2.1

467)

0.2

-0.0

263

-0.0

263

-0.0

263

-0.0

263

-0.0

263

-0.0

263

-0.0

263

-0.0

263

-0.0

262

-0.0

262

-0.0

262

(0.5

846)

(0.5

846)

(0.5

847)

(0.5

848)

(0.5

849)

(0.5

850)

(0.5

851)

(0.5

852)

(0.5

853)

(0.5

854)

(0.5

855)

0.4

-0.0

266

-0.0

267

-0.0

267

-0.0

268

-0.0

269

-0.0

269

-0.0

270

-0.0

271

-0.0

272

-0.0

273

-0.0

274

(0.6

812)

(0.6

818

))(0

.682

6)(0

.683

4)(0

.684

3)(0

.685

3)(0

.686

4)(0

.687

7)(0

.6893

)(0

.691

0)

(0.6

931)

0.6

-0.0

294

-0.0

296

-0.0

298

-0.0

301

-0.0

304

-0.0

307

-0.0

311

-0.0

315

-0.0

321

-0.0

327

-0.0

335

(0.7

944)

(0.7

964)

(0.7

987)

(0.8

013)

(0.8

043)

(0.8

080)

(0.8

124)

(0.8

178)

(0.8

245)

(0.8

332)

(0.8

449)

0.8

-0.0

309

-0.0

313

-0.0

317

-0.0

323

-0.0

329

-0.0

337

-0.0

346

-0.0

359

-0.0

374

-0.0

395

-0.0

423

(0.9

041)

(0.9

078)

(0.9

121)

(0.9

172)

(0.9

234)

(0.9

312)

(0.9

410)

(0.9

540)

(0.9

718)

(0.9

978)

(1.0

393)

1.0

-0.0

302

-0.0

307

-0.0

313

-0.0

321

-0.0

331

-0.0

343

-0.0

358

-0.0

379

-0.0

410

-0.0

455

-0.0

525

(0.9

987)

(1.0

037)

(1.0

097)

(1.0

170)

(1.0

262)

(1.0

380)

(1.0

538)

(1.0

760)

(1.1

093)

(1.1

647)

(1.2

741)

11

Tab

le5:

Bia

sof

the

min

imu

mb

rid

ged

iver

gen

cees

tim

ator

sfo

=0.

05in

the

nor

mal

scal

efa

mil

yca

se.

Th

em

ean

squ

ared

erro

rsare

give

nin

the

par

enth

eses

.

HH

HHH H

αλ

10.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

0.0

-0.5

136

-0.5

136

-0.5

136

-0.5

136

-0.5

136

-0.5

136

-0.5

136

-0.5

136

-0.5

136

-0.5

136

-0.5

136

(2.3

361)

(2.3

361)

(2.3

361)

(2.3

361)

(2.3

361)

(2.3

361)

(2.3

361)

(2.3

361)

(2.3

361)

(2.3

361

)(2

.3361

)0.

2-0

.360

4-0

.360

5-0

.360

6-0

.360

6-0

.360

7-0

.360

8-0

.360

9-0

.3610

-0.3

611

-0.3

612

-0.3

613

(0.7

260)

(0.7

260)

(0.7

262)

(0.7

264)

(0.7

265)

(0.7

267)

(0.7

269)

(0.7

270)

(0.7

272)

(0.7

275

)(0

.7277

)0.

4-0

.447

6-0

.448

1-0

.448

6-0

.449

3-0

.450

0-0

.450

8-0

.451

6-0

.4526

-0.4

537

-0.4

550

-0.4

564

(0.9

104)

(0.9

118)

(0.9

132)

(0.9

149)

(0.9

167)

(0.9

187)

(0.9

209)

(0.9

235)

(0.9

264)

(0.9

297

)(0

.9336

)0.

6-0

.533

5-0

.535

0-0

.536

6-0

.538

5-0

.540

6-0

.543

2-0

.546

2-0

.5498

-0.5

542

-0.5

597

-0.5

667

(1.1

352)

(1.1

396)

(1.1

448)

(1.1

506)

(1.1

574)

(1.1

654)

(1.1

750)

(1.1

865)

(1.2

008)

(1.2

190

)(1

.2427

)0.

8-0

.609

2-0

.611

7-0

.614

7-0

.618

2-0

.622

3-0

.627

4-0

.633

8-0

.6421

-0.6

530

-0.6

683

-0.6

911

(1.3

591)

(1.3

679)

(1.3

783)

(1.3

907)

(1.4

057)

(1.4

242)

(1.4

476)

(1.4

782)

(1.5

198)

(1.5

794

)(1

.6720

)1.

0-0

.671

6-0

.675

0-0

.679

0-0

.683

9-0

.690

0-0

.697

7-0

.707

9-0

.7219

-0.7

422

-0.7

742

-0.8

314

(1.5

545)

(1.5

673)

(1.5

827)

(1.6

016)

(1.6

253)

(1.6

558)

(1.6

967)

(1.7

543)

(1.8

413)

(1.9

851

)(2

.2646

)

12

Tab

le6:

Bia

sof

the

min

imu

mb

rid

ged

iver

gen

cees

tim

ator

sfo

=0.

2in

the

nor

mal

scal

efa

mil

yca

se.

Th

em

ean

squ

ared

erro

rsar

egiv

enin

the

pare

nth

eses

.

HH

HH

H Hα

λ1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

0.0

-1.9

778

-1.9

778

-1.9

778

-1.9

778

-1.9

778

-1.9

778

-1.9

778

-1.9

778

-1.9

778

-1.9

778

-1.9

778

(5.7

870)

(5.7

870)

(5.7

870)

(5.7

870)

(5.7

870)

(5.7

870)

(5.7

870)

(5.7

870)

(5.7

870)

(5.7

870)

(5.7

870

)0.

2-2

.737

9-2

.738

1-2

.738

5-2

.738

7-2

.739

1-2

.739

4-2

.739

7-2

.740

1-2

.7404

-2.7

408

-2.7

413

(18.1

322)

(18.

132

9)(1

8.13

39)

(18.

1348

)(1

8.13

58)

(18.

1368

)(1

8.13

78)

(18.1

389)

(18.

140

1)

(18.

1412)

(18.

1426

)0.

4-2

.377

0-2

.379

4-2

.382

1-2

.385

0-2

.388

2-2

.391

8-2

.395

7-2

.400

1-2

.4050

-2.4

106

-2.4

170

(11.0

261)

(11.

037

0)(1

1.04

86)

(11.

0614

)(1

1.07

55)

(11.

0910

)(1

1.10

82)

(11.1

275)

(11.

149

3)

(11.

1739)

(11.

2020

)0.

6-2

.299

2-2

.308

4-2

.318

7-2

.330

7-2

.344

5-2

.360

8-2

.383

1-2

.413

4-2

.4445

-2.4

831

-2.5

295

(6.8

577)

(6.9

119)

(6.9

739)

(7.0

458)

(7.1

303)

(7.2

318)

(7.4

016)

(7.6

577)

(7.8

807)

(8.1

785)

(8.5

203

)0.

8-2

.660

5-2

.681

4-2

.706

2-2

.738

2-2

.794

1-2

.853

4-2

.942

5-3

.050

8-3

.2000

-3.4

096

-3.6

874

(8.6

585)

(8.8

162)

(9.0

067)

(9.2

747)

(9.8

667)

(10.

4664

)(1

1.45

95)

(12.

6542)

(14.

343

4)

(16.

7522

)(1

9.9

901)

1.0

-2.9

865

-3.0

278

-3.0

800

-3.1

562

-3.2

333

-3.3

812

-3.5

566

-3.8

028

-4.1

377

-4.6

124

-5.3

932

(10.7

030)

(11.

103

8)(1

1.63

95)

(12.

5220

)(1

3.32

24)

(15.

1724

)(1

7.29

57)

(20.3

636)

(24.

478

3)

(30.

3887)

(40.

3177

)

13