Deviance Information Criterion for Model Selection: Justi...

41
Deviance Information Criterion for Model Selection: Justication and Variation Yong Li Renmin University of China Jun Yu Singapore Management University Tao Zeng Zhejiang University November 23, 2017 Abstract Deviance information criterion (DIC) has been extensively used for making model se- lection based on the MCMC output. Although it is understood as a Bayesian version of AIC, a rigorous justication has not been provided in the literature. In this paper, we show that when the plug-in predictive distribution is used, DIC can have a rigorous decision-theoretic justication in a frequentist setup. Under a set of regularity conditions, we show that DIC chooses a model that gives the smallest expected Kullback-Leibler divergence between the data generating process (DGP) and the plug-in predictive distrib- ution asymptotically. An alternative expression for DIC, based on the Bayesian predictive distribution, is proposed. The new DIC has a smaller penalty term than the original DIC and is very easy to compute from the MCMC output. It is invariant to reparameterization and yields a smaller expected loss than the original DIC asymptotically. JEL classication: C11, C12, G12 Keywords: AIC; DIC; Bayesian Predictive Distribution; Plug-in Predictive Distribution; Loss Function; Model Comparison; Expected loss 1 Introduction A highly important statistical inference often faced by model builders and empirical re- searchers is model selection (Phillips, 1995, 1996). Many penalty-based information criteria have been proposed to select from candidate models. In the frequentist statistical framework, We wish to thank Eric Renault, Peter Phillips and David Spiegelhalter for their helpful comments. Yong Li, Hanqing Advanced Institute of Economics and Finance, Renmin University of China, Beijing, China 100872. Jun Yu, School of Economics and Lee Kong Chian School of Business, Singapore Man- agement University, 90 Stamford Rd, Singapore 178903. Email for Jun Yu: [email protected]. URL: http://www.mysmu.edu/faculty/yujun/. Tao Zeng, School of Economics and Academy of Financial Research, Zhejiang University, Zhejiang, China 310027. Li gratefully acknowledges the nancial support of the Chinese Natural Science Fund (No. 71271221), Program for New Century Excellent Talents in University. 1

Transcript of Deviance Information Criterion for Model Selection: Justi...

Page 1: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

Deviance Information Criterion for ModelSelection: Justification and Variation∗

Yong LiRenmin University of China

Jun YuSingapore Management University

Tao ZengZhejiang University

November 23, 2017

Abstract

Deviance information criterion (DIC) has been extensively used for making model se-lection based on the MCMC output. Although it is understood as a Bayesian versionof AIC, a rigorous justification has not been provided in the literature. In this paper,we show that when the plug-in predictive distribution is used, DIC can have a rigorousdecision-theoretic justification in a frequentist setup. Under a set of regularity conditions,we show that DIC chooses a model that gives the smallest expected Kullback-Leiblerdivergence between the data generating process (DGP) and the plug-in predictive distrib-ution asymptotically. An alternative expression for DIC, based on the Bayesian predictivedistribution, is proposed. The new DIC has a smaller penalty term than the original DICand is very easy to compute from the MCMC output. It is invariant to reparameterizationand yields a smaller expected loss than the original DIC asymptotically.

JEL classification: C11, C12, G12Keywords: AIC; DIC; Bayesian Predictive Distribution; Plug-in Predictive Distribution;Loss Function; Model Comparison; Expected loss

1 Introduction

A highly important statistical inference often faced by model builders and empirical re-

searchers is model selection (Phillips, 1995, 1996). Many penalty-based information criteria

have been proposed to select from candidate models. In the frequentist statistical framework,∗We wish to thank Eric Renault, Peter Phillips and David Spiegelhalter for their helpful comments.

Yong Li, Hanqing Advanced Institute of Economics and Finance, Renmin University of China, Beijing,China 100872. Jun Yu, School of Economics and Lee Kong Chian School of Business, Singapore Man-agement University, 90 Stamford Rd, Singapore 178903. Email for Jun Yu: [email protected]. URL:http://www.mysmu.edu/faculty/yujun/. Tao Zeng, School of Economics and Academy of Financial Research,Zhejiang University, Zhejiang, China 310027. Li gratefully acknowledges the financial support of the ChineseNatural Science Fund (No. 71271221), Program for New Century Excellent Talents in University.

1

Page 2: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

the most popular information criterion is AIC. Arguably one of the most important develop-

ments in the Bayesian literature in recent years is the deviance information criterion (DIC)

of Spiegelhalter, et al (2002) for model selection.1 DIC is understood as a Bayesian version

of AIC. Like AIC, it trades off a measure of model adequacy against a measure of complexity

and is concerned with how replicate data predict the observed data. Unlike AIC, DIC takes

prior information into account.

DIC is constructed based on the posterior distribution of the log-likelihood or the deviance,

and has several desirable features. Firstly, DIC is easy to calculate when the likelihood

function is available in closed-form and the posterior distributions of models are obtained by

Markov chain Monte Carlo (MCMC) simulation. Secondly, it is applicable to a wide range of

statistical models. Thirdly, unlike the Bayes factors (BF), it is not subject to Jeffreys-Lindley’s

paradox and can be calculated when non-informative or improper priors are used.

However, as acknowledged in Spiegelhalter, et al (2002, 2014), the decision-theoretic jus-

tification of DIC is not rigorous in the literature. In fact, in the heuristic justification given

by Spiegelhalter, et al (2002), the frequentist framework and the Bayesian framework were

mixed together. The first contribution of the present paper is to provide a rigorous decision-

theoretic justification to DIC purely in a frequentist setup. It can be shown that DIC is an

asymptotically unbiased estimator of the expected Kullback-Leibler (KL) divergence between

the data generating process (DGP) and the plug-in predictive distribution, when the posterior

mean is used. This justification is similar to how AIC has been justified.

Moreover, DIC is not invariant to reparameterization due to the use of the plug-in pre-

dictive distribution in defining the loss function. In the Bayesian framework, an alterative

predictive distribution is the Bayesian predictive distribution. Naturally, the KL divergence

between the DGP and the Bayesian predictive distribution can be used as the loss function

which can in turn be used to derive a new information criterion for model comparison. Unlike

the plug-in predictive distribution, the Bayesian predictive distribution is invariant to repara-

meterization. The second contribution of the present paper is to develop a new information

criterion that is an asymptotically unbiased estimator of the new expected KL divergence

under a general framework. Compared with the information criterion developed in Ando and

Tsay (2010) in the independent and identically distributed (iid) environment, our information

criterion does not need the iid assumption and has a simpler expression. Moreover, it is easier

to compare our information criterion with other information criteria.

Our theoretical results show that asymptotically the expected loss implied by the Bayesian

predictive distribution is smaller than that implied by the plug-in predictive distribution.

Hence, from the predictive viewpoint, the Bayesian predictive distribution is a better predic-

1According to, Spiegelhalter et al. (2014), Spiegelhalter et al. (2002) was the third most cited paper ininternational mathematical sciences between 1998 and 2008. Up to October 2017, it has received 4898 citationson the Web of Knowledge and over 8623 on Google Scholar.

2

Page 3: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

tive distribution. This represents another important advantage of using the Bayesian predic-

tive distribution and hence our new information criterion.

The paper is organized as follows. Section 2 explains how to treat the model selection

as a decision problem and gives a simple review about the decision-theoretic justification of

AIC which helps explain our theory for DIC. Section 3 provides a rigorous decision-theoretic

justification to DIC of Spiegelhalter, et al (2002) under a set of regularity conditions, and shows

why DIC can be explained as the Bayesian version of AIC. In Section 4, based on the Bayesian

predictive distribution, a new information criterion is proposed. Its theoretical properties are

established and comparisons with other information criteria are also made in this section.

Section 5 concludes the paper. The Appendix collects the proof of the theoretical results in

the paper. To save space, the proof of Lemma 3.2 is provided in a technical supplement to

the present article (Li et al., 2017).

2 Decision-theoretic Justification of AIC

There are essentially two strands of literature on model selection.2 The first strand aims

to answer the following question —which model best explains the observed data? The BF

(Kass and Raftery, 1995) and its variations belong to this strand. They compare models by

examining “posterior probabilities”given the observed data and search for the “true”model.

BIC is a large sample approximation to BF although it is based on the maximum likelihood

estimator. The second strand aims to answer the following question —which model give the

best predictions of future observations generated by the same mechanism that gives rise to

the observed data? Clearly this is a utility-based approach where the utility is set to be the

prediction. Ideally, we would like to choose the model that gives the best overall predictions

of future values. Some cross validation-based criteria have been developed where the original

sample is split into a training set and a validation set (Vehtari and Lampinen, 2002; Zhang and

Yang, 2015). Unfortunately, different ways of sample splitting often lead to different outcomes.

Alternatively, based on hypothetically replicate data generated by the same mechanism that

gives rise to the observed data, some predictive information criteria have been proposed for

model selection. They minimize a loss function associated with the predictive decisions. AIC

and DIC are two well-known criteria in this framework. After the decision is made about

which model should be used for prediction, a unique prediction action for future observations

can be obtained to fulfill the original goal. The latter approach is what we follow in the present

paper. Given the relevance of prediction in economics, nor surprisingly, such an approach to

model selection has been widely used in applications.

2For more information about the literature, see Vehtari and Ojanen (2012) and Burnham and Anderson(2002).

3

Page 4: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

2.1 Predictive model selection as a decision problem

Assuming that the probabilistic behavior of observed data, y = (y1, y2, · · · , yn)′ ∈ Y, is

described by a set of probabilistic models such as MkKk=1 := p (y|θk,Mk)Kk=1 where para-meter θk is the set of parameters in candidate modelMk and p(·) means a probability densityfunction (pdf). Formally, the model selection problem can be taken as a decision problem

to select a model among MkKk=1 where the action space has K elements, namely, dkKk=1,where dk means Mk is selected.

For the decision problem, a loss function, `(y, dk), which measures the loss of decision dkas a function of y, must be specified. Given the loss function, the expected loss (or risk) can

be defined as (Berger, 1985)

Risk(dk) = Ey [`(y, dk)] =

∫`(y, dk)g(y)dy,

where g(y) is the pdf of the DGP of y. Hence, the model selection problem is equivalent to

optimizing the statistical decision,

k∗ = arg minkRisk(dk).

Based on the set of candidate models MkKk=1, the model Mk∗ with the decision dk∗ is

selected.

Let yrep be the replicate data independently generated by the same mechanism that gives

rise to the observed data y. Assume the sample size in yrep is the same as that in y. Consider

the predictive density of this replicate experiment for a candidate model Mk. The plug-in

predictive density can be expressed as p(yrep|θn(y),Mk

)for Mk where θn(y) is the quasi

maximum likelihood (QML) estimate of θk based on y (when there is no confusion we simply

write θn(y) as θn or even θ) and defined by

θn(y) = arg maxθk∈Θ

ln p (y|θk,Mk) ,

which is the global maximum interior to Θ.

The quantity that has been used to measure the quality of the candidate model in terms of

its ability to make predictions is the KL divergence between g (yrep) and p(yrep|θn(y),Mk

)multiplied by 2,

2×KL[g (yrep) , p

(yrep|θn(y),Mk

)]= 2Eyrep

lng (yrep)

p(yrep|θn(y),Mk

)

= 2

∫ lng (yrep)

p(yrep|θn(y),Mk

) g (yrep)dyrep.

4

Page 5: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

Naturally the loss function associated with decision dk is

`(y, dk) = 2×KL[g (yrep) , p

(yrep|θn(y),Mk

)].

As a result, the model selection problem is,

k∗ = arg minkRisk(dk) = arg min

kEy [`(y, dk)]

= arg mink

2× EyEyrep

lng (yrep)

p(yrep|θn(y),Mk

)

= arg mink

EyEyrep [2 ln g (yrep)] + EyEyrep

[−2 ln p

(yrep|θn(y),Mk

)].

Since g (yrep) is the DGP, Eyrep [2 ln g (yrep)] is the same across all candidate models, and

hence, is dropped from the above equation. Consequently,

k∗ = arg minkRisk(dk) = arg min

kEyEyrep

[−2 ln p

(yrep|θn(y),Mk

)].

The smaller theRisk(dk), the better the candidate model performs when using p(yrep|θn(y),Mk)

to predict g (yrep). The optimal decision makes it necessary to evaluate the risk. AIC is an

asymptotically unbiased estimator of EyEyrep

[−2 ln p

(yrep|θn(y),Mk

)].

2.2 AIC for predictive model selection

To show the asymptotic unbiasedness of AIC, let us first fix some notations. When there is no

confusion, we simply write candidate model p (y|θk,Mk) as p(y|θ) where θ = (θ1, . . . , θP )′ ∈Θ ⊆ RP . Under the iid assumption, let y = (y1, y2, · · · , yn)′ denote the observed data,

yrep = (y1,rep, · · · , yn,rep)′ denote the replicate data, and n be the sample size in both setsof data. Although −2 ln p

(y|θn(y)

)is a natural estimate of Eyrep

(−2 ln p(yrep|θn(y)

), it is

asymptotically biased. Let

c(y) = Eyrep

(−2 ln p

(yrep|θn(y)

))−(−2 ln p

(y|θn(y)

))(1)

be the “optimism”. Under a set of regularity conditions, one can show that Ey (c(y)) → 2P

as n→∞. Hence, if we let AIC = −2 ln p(y|θn(y)

)+ 2P , then, as n→∞,

Ey(AIC)− EyEyrep

(−2 ln p

(yrep|θn(y)

))→ 0.

To see why a penalty term, 2P , is needed in AIC, let

θpn := arg minθ∈Θ

1

nKL[g(y), p(y|θ)] (2)

be the pseudo-true parameter value; θn (yrep) be the QML estimate of θ obtained from yrep;

p(y|θ) be a “good approximation”to the DGP, a concept will be defined formally in the next

section. Note that

EyEyrep

(−2 ln p

(yrep|θn(y)

))5

Page 6: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

=[EyEyrep

(−2 ln p

(yrep|θn(yrep)

))](T1)

+[EyEyrep (−2 ln p (yrep|θpn))− EyEyrep

(−2 ln p

(yrep|θn (yrep)

))](T2)

+[EyEyrep

(−2 ln p

(yrep|θn(y)

))− EyEyrep (−2 ln p (yrep|θpn))

](T3)

.

Clearly, the term in T1 is the same as Ey

(−2 ln p

(y|θn(y)

)). The term in T2 is the

expectation of the likelihood ratio statistic based on the replicate data. Under a set of regu-

larity conditions that ensure√n-consistency and asymptotic normality of the QML estimate,

we have T2 = T3 + o(1). To approximate the term in T3, if θn(y) is a consistent estimate of

θpn, we have

T3 = Ey

−2Eyrep

[∂ ln p (yrep|θpn)

∂θ′

(θn(y)− θpn

)]+Ey

Eyrep

[−(θn(y)− θpn

)′ ∂2 ln p (yrep|θpn)

∂θ∂θ′

(θn(y)− θpn

)]+ o(1).

By the definition of θpn, we have Eyrep [∂ ln p (yrep|θpn) /∂θ] = 0, implying that

Ey

−2Eyrep

[∂ ln p (yrep|θpn)

∂θ′

(θn(y)− θpn

)]= −2Eyrep

(∂ ln p (yrep|θpn)

∂θ′

)Ey

(θn(y)− θpn

)= 0.

Consequently, under the same regularity conditions for approximating T2, we have

T3 = tr

Eyrep

[∂2 ln p (yrep|θpn)

∂θ∂θ′

]Ey

[−(θn(y)− θpn

)(θn(y)− θpn

)′]= P + o(1),

where tr denotes the trace of a matrix. Following Burnham and Anderson (2002), we have

EyEyrep

(−2 ln p

(yrep|θn(y)

))= Ey

(−2 ln p

(y|θn(y)

)+ 2P

)+ o(1) = Ey (AIC) + o(1),

that is, AIC is an unbiased estimator of EyEyrep

(−2 ln p

(yrep|θn(y)

))asymptotically. From

the decision viewpoint, among candidate models, AIC selects a model which minimizes the

expected loss asymptotically when the plug-in predictive distribution is used for making pre-

dictions.

It is clear that the decision-theoretic justification of AIC rests on a frequentist framework.

Specifically, it requires a careful choice of the KL divergence function, the use of QML esti-

mation, and a set of regularity conditions that ensure the√n-consistency and the asymptotic

normality of the QML estimates. The penalty term in AIC arises from two sources. First, the

pseudo-true value has to be estimated. Second, the estimate obtained from the observed data

is not the same as that from the replicate data. Moreover, as pointed out in Burnham and

Anderson (2002), the justification of AIC requires the candidate model be a “good approxi-

mation”to the DGP for the trace to be P asymptotically in T3 although a formal definition

of “good approximation”was not provided.

6

Page 7: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

3 Decision-theoretic Justification of DIC

3.1 DIC

Spiegelhalter, et al (2002) proposed DIC for Bayesian model selection. The criterion is based

on the deviance

D(θ) = −2 ln p(y|θ),

and takes the form of

DIC = D(θ) + PD. (3)

The first term, interpreted as a Bayesian measure of model fit, is defined as the posterior

expectation of the deviance, that is,

D(θ) = Eθ|y[D(θ)] = Eθ|y[−2 ln p(y|θ)].

The better the model fits the data, the larger the log-likelihood value and hence the smaller

the value for D(θ). The second term, used to measure the model complexity and also known

as “effective number of parameters”, is defined as the difference between the posterior mean

of the deviance and the deviance evaluated at the posterior mean of the parameters:

PD = D(θ)−D(θn(y)

)= −2

∫ [ln p(y|θ)− ln p

(y|θn(y)

)]p(θ|y)dθ, (4)

where θn(y) is the Bayesian estimator based on y, and more precisely the posterior mean of

θ,∫θp(θ|y)dθ. When there is no confusion, we simple write θn(y) as θn or even θ.

DIC can be rewritten in two equivalent forms:

DIC = D(θn)

+ 2PD, (5)

and

DIC = 2D(θ)−D(θn)

= −4Eθ|y[ln p(y|θ)] + 2 ln p(y|θn

). (6)

DIC defined in Equation (5) bears similarity to AIC of Akaike (1973) and can be inter-

preted as a classical “plug-in”measure of fit plus a measure of complexity (i.e., 2PD, also

known as the penalty term). In Equation (3) the Bayesian measure, D(θ), is the same as

D(θn)

+PD which already includes a penalty term for model complexity and, thus, could be

better thought of as a measure of model adequacy rather than pure goodness of fit.

However, as stated explicitly in Spiegelhalter et al. (2002) (Section 7.3 on Page 603 and

the first paragraph on Page 605), the justification of DIC is informal and heuristic. In fact, it

mixes a frequentist setup and a Bayesian setup. In this section, we provide a rigorous decision-

theoretic justification of DIC purely in a frequentist setup. Specifically, we show that, when a

proper loss function is selected, DIC is an asymptotically unbiased estimator of the expected

loss.

7

Page 8: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

3.2 Decision-theoretic justification of DIC

When developing DIC, Spiegelhalter, et al (2002) assumes that there is a true distribution for

y in Section 2.2, a pseudo-true parameter value θpn for a candidate model also in Section 2.2,

an independent replicate data set yrep in Section 7.1. All these assumptions are identical what

was done when justifying AIC. Furthermore, as explained in Section 7.1 of Spiegelhalter, et

al (2002), the goal for model selection is to estimate the expected loss where the expectation

is taken with respect to yrep|θpn. The assumptions and the goal indicate that a frequentistframework was considered. On the other hand, since the “optimism” associated with the

natural estimator depends on a pseudo-true parameter value θpn, instead of replacing it with

a frequentist estimator and then finding the asymptotic property of the “optimism”, Spiegel-

halter, et al (2002) replaces θpn with a random quantity θ and then calculates the posterior

expectation of the “optimism”; see Sections 7.1 and 7.3. As a result, a Bayesian framework

is adopted when studying the behavior of the “optimism”.

Spiegelhalter, et al (2002) has not explicitly specified the KL divergence function. However,

from Equation (33) on Page 602, the loss function defined in the first paragraph on Page

603, and Equation (40) on Page 603 in their paper, one may deduce that the following KL

divergence

KL[p (yrep|θ) , p

(yrep|θn(y)

)]= Eyrep|θ

[ln

p (yrep|θ)

p(yrep|θn(y)

)] (7)

was used.3 Hence,

2×KL[p (yrep|θ) , p

(yrep|θn(y)

)]= 2Eyrep|θ (ln p (yrep|θ)) +Eyrep|θ

(−2 ln p

(yrep|θn(y)

)).

(8)

With this KL function, unfortunately, the first term in the right hand side of Equation (8) is

no longer a constant across candidate models. This is because, when the pseudo-true value

is replaced by a random quantity θ, the first term in the right hand side of Equation (8) is

model dependent. Clearly, another KL divergence function is needed.

As in AIC, we first consider the plug-in predictive distribution p(yrep|θ(y)

)in the follow-

ing KL divergence

KL[g (yrep) , p

(yrep|θn(y)

)]= Eyrep

[ln

g (yrep)

p(yrep|θn(y)

)] .The corresponding expected loss function of a statistical decision dk for model selection is

Risk(dk) = Ey

Eyrep

[2 ln

g (yrep)

p(yrep|θn(y),Mk

)]3 In Equation (33) of Spiegelhalter, et al (2002), the expectation is taken with respect to yrep|θt which

corresponds to the candidate model. In AIC, the expectation is taken with respect to yrep which correspondsto the DGP.

8

Page 9: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

= EyEyrep [2 ln g (yrep)] + EyEyrep

[−2 ln p

(yrep|θn(y),Mk

)].

Since EyEyrep [2 ln g (yrep)] is the same across candidate models, minimizing the expected loss

function Risk(dk) is equivalent to minimizing

EyEyrep

[−2 ln p

(yrep|θn(y),Mk

)].

Denote the selected model by Mk∗ . Then p(yrep|θn(y),Mk∗

)is used to generate future

observations where θn(y) is the posterior mean of θ in Mk∗ .

We are now in the position to provide a rigorous decision-theoretic justification to DIC in

a frequentist framework based on a set of regularity conditions. Let yt := (y0, y1, . . . , yt) for

any 0 ≤ t ≤ n and lt(yt,θ

):= ln p(yt|θ) − ln p(yt−1|θ) be the conditional log-likelihood for

the tth observation for any 1 ≤ t ≤ n. When there is no confusion, we suppress lt(yt,θ

)as

lt (θ) so that the log-likelihood function ln p(y|θ) is∑n

t=1 lt (θ).4 Let ∇jlt (θ) denote the jth

derivative of lt (θ) and ∇jlt (θ) := lt (θ) when j = 0. Furthermore, define

s(yt,θ) :=∂ ln p(yt|θ)

∂θ=

t∑i=1

∇li (θ) , h(yt,θ) :=∂2 ln p(yt|θ)

∂θ∂θ′=

t∑i=1

∇2li (θ) ,

st(θ) := ∇lt (θ) = s(yt,θ)− s(yt−1,θ), ht(θ) := ∇2lt (θ) = h(yt,θ)− h(yt−1,θ),

Bn (θ) := V ar

[1√n

n∑t=1

5lt (θ)

], Hn(θ) :=

1

n

n∑t=1

ht(θ),

Jn(θ) :=1

n

n∑t=1

[st(θ)− s(θ)] [st(θ)− s(θ)]′ , s(θ) :=1

n

n∑t=1

st(θ),

Ln(θ) := ln p(θ|y), L(j)n (θ) := ∂j ln p(θ|y)/∂θj ,

Hn(θ) :=

∫Hn(θ)g (y) dy, Jn(θ) :=

∫Jn(θ)g (y) dy.

In this paper, we impose the following regularity conditions.

Assumption 1: Θ ⊂ RP is compact.Assumption 2: yt∞t=1 satisfies the strong mixing condition with the mixing coeffi cient

α (m) = O(m−2rr−2−ε

)for some ε > 0 and r > 2.

Assumption 3: For all t, lt (θ) satisfies the standard measurability and continuity condi-

tion, and the eight-times differentiability condition on F t−∞×Θ where F t−∞ = σ (yt, yt−1, · · · ).Assumption 4: For j = 0, 1, 2, for any θ,θ′ ∈ Θ,

∥∥5jlt (θ)−5jlt(θ′)∥∥ ≤ cjt (yt) ∥∥θ − θ′∥∥,

where cjt(yt)is a positive random variable with the following two properties: suptE

∥∥∥cjt (yt)∥∥∥ <∞ and 1

n

∑nt=1

(cjt(yt)− E

(cjt(yt))) p→ 0.

Assumption 5: For j = 0, 1, . . . , 8, there exist Mt(yt) and M < ∞ such that for all

θ ∈ Θ, 5jlt (θ) exists, supθ∈Θ∥∥5jlt (θ)

∥∥ 6 Mt(yt), suptE

∥∥Mt(yt)∥∥r+δ ≤ M for some

δ > 0, where r is the same as that in Assumption 2.4 In the definition of log-likelihood, we ignore the initial condition ln p(y0). For weakly dependent data, the

impact of ignoring the initial condition is aymptotically negligible.

9

Page 10: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

Assumption 6:5jlt (θ)

is L2-near epoch dependent with respect to yt of size −1

for j = 0, 1 and −12 for j = 2 uniformly on Θ.

Assumption 7: Let θpn be the pseudo-true value that minimizes the KL loss between the

DGP and the candidate model

θpn = arg minθ∈Θ

1

n

∫ln

g(y)

p(y|θ)g(y)dy,

where θpn is the sequence of minimizers that are interior to Θ uniformly in n. For all ε > 0,

limn→∞

sup supΘ\N(θ

pn,ε)

1

n

n∑t=1

E [lt (θ)]− E [lt (θpn)] < 0, (9)

where N (θpn, ε) is the open ball of radius ε around θpn.

Assumption 8: The sequence Hn (θpn) is negative definite and Bn (θpn) is positivedefinite, both uniformly in n.

Assumption 9: Hn (θpn) + Bn (θpn) = o (1).

Assumption 10: The prior density p(θ) is eight-times continuously differentiable, p(θpn) >

0 and∫‖θ‖2 p(θ)dθ <∞.

Remark 3.1 Assumption 1 is the compactness condition. Assumption 2 and Assumption

6 imply weak dependence in yt and lt. The first part of Assumption 3 is the continuity

condition. Assumption 4 is the Lipschitz condition for lt first introduced in Andrews (1987) to

develop the uniform law of large numbers for dependent and heterogeneous stochastic processes.

Assumption 5 contains the domination condition for lt. Assumption 7 is the identification

condition. These assumptions are well-known primitive conditions for developing the QML

theory, namely consistency and asymptotic normality, for dependent and heterogeneous data;

see, for example, Gallant and White (1988) and Wooldridge (1994).

Remark 3.2 The eight-times differentiability condition in Assumption 3 and the domination

condition for up to the eighth derivative of lt are important to develop a high order stochastic

Laplace expansion. In particular, as shown in Kass et al (1990), these two conditions, together

with the well-known consistency condition for QML given by (10) below, are suffi cient for

developing the Laplace expansion. This consistency condition requires that, for any ε > 0,

there exists K1 (ε) > 0 such that

limn→∞

P

supΘ\N(θ

pn,ε)

1

n

n∑t=1

[lt (θ)− lt (θpn)] < −K1 (ε)

= 1. (10)

Our Assumption 7 is clearly more primitive than the consistency condition (10). In the fol-

lowing lemma, we show that Assumptions 1-7, including the identification condition (9), are

suffi cient to ensure (10) as well as the concentration condition around the posterior mode

10

Page 11: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

given by Chen (1985) and the concentration condition around the QML estimator given by

Kim (1994, 1998). Together with Assumption 10, the concentration condition suggests that

the stochastic Laplace expansion can be applied to the posterior distribution and the asymp-

totic normality of posterior distribution can be established. To the best of our knowledge, this

is the first time in the literature that primitive conditions have been proposed for the stochas-

tic Laplace expansion. Assumption 10 ensures the second moment of the prior is bounded.

As argued in Geweke (2001), such a condition typically leads to a finite second moment of

posterior. Moreover, it implies that the prior is negligible asymptotically.

Lemma 3.1 If Assumptions 1-7 hold true, then Equation (10) holds. Furthermore, if As-

sumptions 1-7 hold true, for any ε > 0, there exists K2 (ε) > 0 such that

limn→∞

P

sup

Θ\N(θn,ε

) 1

n

[n∑t=1

lt (θ)−n∑t=1

lt (θpn)

]< −K2 (ε)

= 1. (11)

If, in addition, Assumption 10 holds true, for any ε > 0, there exists K3 (ε) > 0 such that

limn→∞

P

sup

Θ\N(←→θ n,ε

) 1

n

(n∑t=1

[lt (θ)− lt (θpn)] + ln p (θ)− ln p (θpn)

)< −K3 (ε)

= 1, (12)

where←→θ n := arg maxθ∈Θ

∑nt=1 lt (θ) + ln p (θ) which is the posterior mode.

Remark 3.3 Assumption 9 gives the exact requirement for “good approximation”. This gen-

eralizes the definition of “information matrix equality”; see White (1996). We now specify a

few cases where Assumption 9 is satisfied. The detail of these cases can be found in White

(1996) and Wooldridge (1994).

1. According to White (1996, p55), a model is specification correct in its entirety if there

exists θ0 ∈ Θ such that

p(y|θ0) = g (y) . (13)

In this case, θpn = θ0 for all n, Hn (θ0) + Bn (θ0) = 0, implying Assumption 9 (White,

1996, p93).

2. For any t, if Cov [st(θpn), st+j(θ

pn)] = 0 for all j > 1, then Jn (θpn) = Bn (θpn). For any

t, if V ar [st(θpn)] = −E [ht(θ

pn)], then Jn (θpn) + Hn (θpn) = 0. If both conditions are

satisfied, we have Hn (θpn) + Bn (θpn) = 0, again implying Assumption 9.

3. Suppose yt is a ky × 1 vector which can be partitioned into a kw × 1 vector of endoge-

nous variables wt and a kz × 1 vector of exogenous variables zt. Let xt be a subset

11

Page 12: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

of (zt, wt−1, zt−1, . . . , w1, z1). According to White (1996, p55), a model is specification

correct for dependent variable wt, if there exists θ0 ∈ Θ, such that

p (wt|xt,θ0) = g (wt|xt) , all xt, t = 1, 2, . . . . (14)

If lt (θ) = ln p (wt|xt,θ), then E [st(θ0)|xt] = 0. Under the condition (14), θpn = θ0 for

all n, E [st(θ0)st(θ0)′] = −E [ht(θ0)], and Jn (θ0) = −Hn (θ0). If the model is further

assumed to be dynamically complete in distribution in the sense that

g (wt|xt, wt−1, zt−1, . . . , w1, z1) = g (wt|xt) ,

then E [st(θ0)st+j(θ0)′] = 0 and Jn (θ0) = Bn (θ0). Hence, Hn (θ0) + Bn (θ0) = 0,

implying Assumption 9; see White (1996, p95-97).

4. A model is specification correct for the conditional mean and the conditional variance,

if there exists θ0 ∈ Θ, such that

E (wt|xt) = mt (xt,θ0) , V ar (wt|xt) = Ωt (xt,θ0) , for any t,

where mt (xt,θ) and Ωt (xt,θ) are the conditional mean and the conditional variance

of the model. In this case, the QML method that maximizes the Gaussian-quasi-log-

likelihood function can be used. Let

lt (θ) := −1

2ln |Ωt (xt,θ)| − 1

2utΩt (xt,θ)−1 ut,

where ut = ut (θ) = wt − mt (xt,θ). It is easy to show that E [st(θpn)|xt] = 0. If the

model is further assumed to be dynamically complete in mean and variance in the sense

that

E (wt|xt, wt−1, zt−1, . . . , w1, z1) = E (wt|xt) , V ar (wt|xt, wt−1, zt−1, . . . , w1, z1) = V ar (wt|xt) ,(15)

then θpn = θ0, st(θ0) is a martingale difference sequence, and Jn (θ0) = Bn (θ0). If

E[vec(utu

′t)u′t|xt]

= 0,

E[vec(utu

′t)− Ωt (xt,θ0)

vec(utu

′t)− Ωt (xt,θ0)

′ |xt] = 2Nkw [Ωt (xt,θ0)⊗ Ωt (xt,θ0)] ,

where vec is the column-wise vectorization, Nkw = Dkw

(D′kwDkw

)−1D′kw , and Dkw is

the k2w × kw(kw + 1)/2 duplication matrix, then Jn (θ0) = −Hn (θ0). Hence, Hn (θ0) +

Bn (θ0) = 0, implying Assumption 9; see Wooldridge (1994, p2690).

The above four cases all imply thatHn (θpn)+Bn (θpn) = 0, stronger than what Assumption

9 requires. We now give an example where Hn (θpn) + Bn (θpn) = o (1).

12

Page 13: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

Example 3.1 Let the DGP be

yt = x1tβ0 + x2tγ0 + εt, εtiid∼ N(0, σ20),

where (x1t, x2t) is iid over t and independent of εt. Assume that γ0 = δ0/n1/2, where δ0 is an

unknown constant. Let the candidate model be

yt = x1tβ + vt, vtiid∼ N(0, σ2).

In this case

lt (θ) = −1

2ln 2π − 1

2lnσ2 − (yt − x1tβ)2

2σ2,

where θ =(β, σ2

)′. In this case, the pseudo true value is θpn =(βpn, σ

2pn

)′, which maximizes

E [lt (θ)], and can be expressed as

βpn = β0 + bγ0, σ2pn = σ20 + cγ20,

where b =[E(x21t)]−1

E (x1tx2t) and c = E(x22t)− [E (x1tx2t)]

2 [E (x21t)]−1. Hence,−E [ht (θpn)] =

E(x21t)σ2pn

0

0 − 1

2(σ2pn )2 − σ20+cγ

20

(σ2pn )3

,−Hn (θpn) = − 1

n

n∑t=1

E [ht (θpn)] = −E [ht (θpn)] .

From the iid assumption, we have

V ar (st (θpn)) = E(st (θpn) st (θpn)′

)=

σ20E(x1tx′1t)σ2pn

+d1γ20

(σ2pn )2

d2γ30

2(σ2pn )2

d2γ30

2(σ2pn )2 − 1

4(σ2pn )2 +

3σ20+6cσ20γ

20+d3γ

40

4(σ2pn )4

.where dj = E

[x4−j−11t (x2t − x1tb)j+1

]for j = 1, 2, 3 and

Bn (θpn) = V ar

(1√n

n∑t=1

st (θpn)

)=

1

n

n∑t=1

V ar (st (θpn)) = Jn (θpn)

=

σ20E(x21t)

σ2pn+

d1γ20

(σ2pn )2

d2γ30

2(σ2pn )2

d2γ30

2(σ2pn )2 − 1

4(σ2pn )2 +

3(σ20)2+6cσ20γ

20+d3γ

40

4(σ2pn )4

.Hence,

limn→∞

Bn (θpn) = limn→∞

Jn (θpn) = limn→∞

−Hn (θpn) =

E(x21t)σ20

0

0 1

2(σ20)2

since γ0 = δ0/n

1/2. Thus, Hn (θpn) + Bn (θpn) = o (1). However, Hn (θpn) + Bn (θpn) 6= 0 for

any finite n.

13

Page 14: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

To develop the Laplace expansion, we need to fix more notations. For convenience of

exposition, we let H(j)n (θ) = 1

n

∑nt=1∇jlt (θ) for j = 3, 4, 5. Let π (θ) = ln p (θ), p, π, ∇j p,

and ∇j π be the values of functions, p (θ), π (θ), ∇jp (θ) and ∇jπ (θ) evaluated at θn. The

next lemma extends the result in Kass et al (1990) to a higher order in matrix form.

Lemma 3.2 Under Assumptions 1-10, we have∫lt (θ) p (θ) p (y|θ) dθ∫p (θ) p (y|θ) dθ

= lt

(θn

)+

1

nBt,1 +

1

n2(Bt,2 −Bt,3) +Op

(n−3

), (16)

where

Bt,1 = −1

2tr

[Hn

(θn

)−152 lt

(θn

)]−5lt

(θn

)′Hn

(θn

)−1 ∇pp

−1

2vec

(Hn

(θn

)−1)H(3)n

(θn

)Hn

(θn

)−15 lt

(θn

), (17)

Bt,2 = −1

85 lt

(θn

)′Hn

(θn

)−1H(5)n

(θn

)′vec

[Hn

(θn

)−1⊗ vec

(Hn

(θn

)−1)](18)

+35

485 lt

(θn

)′A2Hn

(θn

)−1H(3)n

(θn

)′vec

(Hn

(θn

)−1)−35

485 lt

(θn

)′Hn

(θn

)−1H(3)n

(θn

)′vec

(Hn

(θn

)−1)A1

−5

85 lt

(θn

)′Hn

(θn

)−1 ∇pp

tr [A2] +35

245 lt

(θn

)′Hn

(θn

)−1 ∇ppA1

−5

45 lt

(θn

)′Hn

(θn

)−1H(3)n

(θn

)′vec

(Hn

(θn

)−1)tr

[Hn

(θn

)−1 ∇2pp

]+

1

25 lt

(θn

)′Hn

(θn

)′ ∇3pp

[vec

(H(2)n

(θn

)−1)]− 5

16tr [A2] tr

[Hn

(θn

)−152 lt

(θn

)]+

35

48A1tr

[Hn

(θn

)−152 lt

(θn

)]− 5

12vec

(Hn

(θn

)−1)′H(3)n

(θn

)Hn

(θn

)−153 lt

(θn

)′vec

(Hn

(θn

)−1)+

1

8tr

([Hn

(θn

)−1⊗ vec

(Hn

(θn

)−1)]′54 lt

(θn

))−5

4vec

(Hn

(θn

)−1)′H(3)n

(θn

)Hn

(θn

)−152 lt

(θn

)vec

(Hn

(θn

)−1) ∇pp

+1

2vec

(Hn

(θn

)−1)′53 lt

(θn

)Hn

(θn

)−1 ∇pp

+3

4tr

[52lt

(θn

)Hn

(θn

)−1]tr

[Hn

(θn

)−1 ∇2pp

],

Bt,3 = B4 ×Bt,1 with (19)

14

Page 15: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

B4 = −1

2tr

[Hn

(θn

)−1 ∇2pp

]+

1

2vec

(Hn

(θn

)−1)′H(3)n

(θn

)Hn

(θn

)−1 ∇pp

− 5

24A1 +

1

8tr [A2] , (20)

A1 = vec

(Hn

(θn

)−1)′H(3)n

(θn

)Hn

(θn

)−1H(3)n

(θn

)′vec

(Hn

(θn

)−1),

A2 =

[Hn

(θn

)−1⊗ vec

(Hn

(θn

)−1)]′H(4)n

(θn

).

We are now in the position to develop a high order expansion to PD and DIC.

Lemma 3.3 Under Assumptions 1-10, we have

PD = P +1

nC1 +

1

nC2 +Op

(n−2

),

DIC = AIC+1

nD1 +

1

nD2 +Op

(n−2

),

where

C1 = −−13 + 15P

12A1 −

1− 2P

4tr [A2] , C2 =

3− P2

C21 − PC22 + (1− P )C23,

D1 = −1− 2P

2tr [A2]−

−23 + 25P

12A1, D2 = (2− P )C21 − 2PC22 + (1− 2P )C23,

C21 = Oπ′Hn

(θn

)−1H(3)n

(θn

)′vec

(Hn

(θn

)−1),

C22 = tr

[Hn

(θn

)−1O2π

], C23 = Oπ′Hn

(θn

)−1Oπ,

where A1 and A2 are defined as in Lemma 3.2.

Theorem 3.1 Under Assumptions 1-10, we have,

EyEyrep

[−2 ln p

(yrep|θn(y)

)]= Ey [DIC+ op(1)] = Ey [DIC] + o(1).

Remark 3.4 Like AIC, DIC is an unbiased estimator of EyEyrep

[−2 ln p

(yrep|θ(y)

)]as-

ymptotically, according to Theorem 3.1. Hence, the decision-theoretic justification to DIC is

that DIC selects a model that asymptotically minimizes the expected loss, which is the expected

KL divergence between the DGP and the plug-in predictive distribution p(yrep|θn(y)

)where

the expectation is taken with respect to the DGP. A key difference between AIC and DIC is

that the plug-in predictive distribution is based on different estimators. In AIC the QML esti-

mate, θn(y), is used while in DIC the Bayesian posterior mean, θn(y), is used. That is why

DIC is explained as the Bayesian version of AIC.

15

Page 16: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

Remark 3.5 The justification of DIC remains valid if the posterior mean is replaced with

the posterior mode or with the QML estimate and PD is replaced with P . This is because the

justification of DIC requires that the information matrix identity holds true asymptotically,

and that the posterior distribution converges to a Normal distribution (the posterior mean

converges to the posterior mode and the posterior variance converges to zero).

Remark 3.6 In AIC, the number of degrees of freedom, P , is used to measure the model

complexity. When the prior is informative, it imposes additional restrictions on the parameter

space and, hence, PD may not be close to P for a finite n. A useful contribution of DIC is to

provide a way to measure the model complexity when the prior information is incorporated;

see Brooks (2002). From Lemma 3.3, the effect of prior information on PD is reflected by C2and is of order Op

(n−1

). Similarly, the effect of prior information on DIC is reflected by D2

and is also of order Op(n−1

).

Remark 3.7 As pointed out in Spiegelhalter, et al (2014), the consistency of BF requires

that there is a true model and the true model is among the candidate models. However, AIC

and DIC are prediction-based criteria which are designed to find the best model for making

predictions among candidate models. Neither AIC nor DIC makes attempt to find the true

model.

Remark 3.8 If p(y|θ) has a closed-form expression, DIC is trivially computable from the

MCMC output. The computational tractability, together with the versatility of MCMC and

the fact that DIC is incorporated into a Bayesian software, WinBUGS, are among the reasons

why DIC has enjoyed a very wide range of applications.

4 DIC Based on Bayesian Predictive Distribution

The above decision-theoretic justification of DIC is based on the loss function constructed

from the plug-in predictive distribution. Unfortunately, the plug-in predictive distribution

is not invariant to parameterization and, hence, the corresponding DIC can be sensitive to

parameterization. From the pure Bayesian viewpoint, only the Bayesian predictive distribu-

tion, but not the plug-in predictive distribution, is a full proper predictive distribution. The

Bayesian predictive distribution is invariant to reparameterization. Hence, the loss function

and the corresponding information criterion will be invariant to reparameterization; see Ando

and Tsay (2010), Spiegelhalter, et al (2014).

Let p(yrep|y,Mk) be the Bayesian predictive distribution, that is,

p(yrep|y,Mk) =

∫p(yrep|θ,Mk)p(θ|y,Mk)dθ.

The KL divergence based on the Bayesian predictive distribution is

KL [g (yrep) , p (yrep|y,Mk)] = Eyrep (ln g (yrep))− Eyrep (ln p (yrep|y,Mk)) . (21)

16

Page 17: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

The expected loss for a statistical decision dk that selects Model Mk is

Risk(dk) = Ey 2×KL [g (yrep) , p (yrep|y,Mk)]

= Ey

Eyrep (2 ln g (yrep))

+ Ey

Eyrep (−2 ln p (yrep|y,Mk))

.

A better model is expected to yield a smaller value for Risk(dk). Since Eyrep (2 ln g (yrep))

is the same across all candidate models, it is dropped from (21) when comparing models. As

a result, we propose to choose a model that gives the smallest value of (again we suppress Mk

for notational simplicity)

EyEyrep (−2 ln p (yrep|y)) =

∫ ∫−2 ln p (yrep|y) g (yrep) g(y)dyrepdy.

Let the selected model beMk∗ (i.e. the optimal decision is dk∗). Then p(yrep|y,Mk∗), which is

the closest to g (yrep) in terms of the expected KL divergence, is used to generate predictions

of future observations.

4.1 ICAT

Under the iid assumption, Ando and Tsay (2010) showed that

EyEyrep [−2 ln p (yrep|y)] = Ey

−2 ln p(y|y) + tr

[J−1AT

(θAT

)IAT

(θAT

)]+ o(1),

where

θAT = arg maxθ∈Θ

2 ln p(y|θ) + ln p(θ) ,

JAT (θ) = − 1

n

n∑t=1

∂2 ln ξ(yt|θ)

∂θ∂θ′

, IAT (θ) =

1

n

n∑t=1

∂ ln ξ(yt|θ)

∂θ

∂ ln ξ(yt|θ)

∂θ′

,

with ln ξ(yt|θ) = ln p(yt|θ) + ln p(θ)/(2n) for t = 1, · · · , n.

Remark 4.1 Ando and Tsay (2010) defined the following information criterion

ICAT = −2 ln p(y|y) + tr[I−1AT

(θAT

)JAT

(θAT

)]. (22)

Since ICAT is constructed based on the Bayesian predictive distribution, it is invariant to

reparameterization. Interestingly, the first term in ICAT is different from that in AIC or

DIC. To compute the first term in ICAT from the MCMC output, note that

−2 ln p(y|y) = −2 ln

(∫p(y|θ)p(θ|y)dθ

)≈ −2 ln

1

J

J∑j=1

p(y|θ(j)

) ,

whereθ(j)

Jj=1

are J effective random samples drawn from the posterior distribution p(θ|y).

To compute the penalty term tr[I−1AT

(θAT

)JAT

(θAT

)], one first needs to obtain θAT . In

general, θAT is not available analytically and hence, one has to use a numerical method to

find θAT . Then one needs to calculate IAT (θ) and JAT (θ) and to invert IAT (θ).

17

Page 18: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

Remark 4.2 Under the assumptions that the prior is Op(1) and that the candidate model

encompasses the DGP, Ando and Tsay simplified the information criterion as

ICAT = −2 ln p(y|y) + P. (23)

In this case, the second term has a very simple expression as it is the same as the number of

degrees of freedom, which no longer depends on the prior information.

4.2 DICBP

Using −2 ln p(y|y) as the first term makes it diffi cult to compare with DIC, AIC or BIC. In

this paper, we propose a Bayesian predictive distribution-based information criterion whose

first term is the same as DIC, i.e., D(θ)but without resorting the iid assumption. It turns

out such a choice not only facilitates the comparison with DIC, AIC or BIC, but also leads

to a much simpler expression for the penalty term. In the following theorem we propose a

new information criterion based on the KL divergence between the DGP and the Bayesian

predictive distribution and show the asymptotic unbiasedness.

Theorem 4.1 Define the information criterion as

DICBP = D(θ)

+ (1 + ln 2)PD, (24)

where PD is defined in (4). Under Assumptions 1-10, we have

EyEyrep [−2 ln p (yrep|y)] = Ey

[DICBP

]+ o(1).

Remark 4.3 The justification of DICBP remains valid if the posterior mean is replaced with

the posterior mode or with the QML estimate and if PD is replaced with P . Clearly, the penalty

term in DICBP is smaller than that in DIC, AIC, and BIC (i.e., (1 + ln 2)PD ≈ (1 + ln 2)P

as opposed to 2PD in DIC, 2P in AIC, and P lnn in BIC).

Remark 4.4 It can be shown that DICBP =ICAT+op(1) and Ey

[DICBP

]= Ey(ICAT )+o(1).

Clearly (1 + ln 2)PD ≈ (1 + ln 2)P > P , implying −2 ln p(y|θ)< −2 ln p(y|y). To see why

this is the case, note that p(y|θ)

= p(y|(∫θp(θ|y)dθ

)), and p(y|y) =

∫p(y|θ)p(θ|y)dθ.

Under Assumptions 1-10, p(θ|y) is approximately Gaussian and concave. The inequality

−2 ln p(y|θ)< −2 ln p(y|y) follows from Jensen’s inequality.

Remark 4.5 As ICAT , DICBP is based on the Bayesian predictive distribution and hence, is

invariant to reparameterization. There are several good properties for DICBP . First, DICBP

is developed without resorting the iid assumption. Second, DICBP is easier to compute. When

the MCMC output is available, ICAT needs to evaluate

ln p(y|y) = ln

(∫p(y|θ)p(θ|y)dθ

)≈ ln

1

J

J∑j=1

p(y|θ(j)

) ,

18

Page 19: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

while DICBP needs to evaluate∫ln p(y|θ)p(θ|y)dθ ≈ 1

J

J∑j=1

ln p(y|θ(j)

).

Numerically, the log-likelihood function is usually much more stable than the likelihood func-

tion. Thus, for the same value of J , the accuracy in 1J

∑Jj=1 ln p

(y|θ(j)

)is often much higher

than that in ln(1J

∑Jj=1 p

(y|θ(j)

)). Moreover, DICBP is as easy to compute as DIC. Since

DIC is monitored in WinBUGS, no additional effort is needed for calculating DICBP . Third,

like DIC, the penalty term depends on the prior information. Hence, the prior information is

incorporated in DICBP .

Remark 4.6 A recent literature suggests the minimization of the posterior mean of the KL

divergence between g (yrep) and p (yrep|θ), i.e.,∫ [∫ln

g (yrep)

p (yrep|θ)g(yrep)dyrep

]p(θ|y)dθ

=

∫ln g (yrep) g (yrep) dyrep −

∫ ∫[ln p (yrep|θ) g (yrep) dyrep] p (θ|y) dθ.

Hence, the corresponding expected loss for a statistical decision dk is

Risk(dk) = Ey

∫ [∫ln

g (yrep)

p (yrep|θ,Mk)g(yrep)dyrep

]p (θ|y,Mk) dθ

=

∫ln g (yrep) g (yrep) dyrep − Ey

∫ ∫ln p (yrep|θ,Mk) g (yrep) dyrepp (θ|y,Mk) dθ

.

Since the first term is constant across different models, van der Linde (2005, 2012), Plum-

mer (2008), Ando (2007) and Ando (2012) proposed to choose a model to minimize

Ey

∫ ∫[−2 ln p (yrep|θ) g (yrep) dyrep] p (θ|y) dθ

.

Under the iid assumption, it was shown that

Ey

[∫ ∫[−2 ln p (yrep|θ) g (yrep) dyrep] p(θ|y)dθ

]≈ Ey

[D(θ)

+ 3PD],

leading to an information criterion called BPIC by Ando (2007). Clearly, the target here is

different from that under DIC and also from that under DICBP . According to Spiegelhalter,

et al (2014) and Vehtari and Ojanen (2012), BPIC chooses an “average target” rather than

a “representative target”. Although BPIC can select a model, it cannot tell the user how

to actually predict the future observations because there is no utility corresponding to this

criterion.

19

Page 20: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

Remark 4.7 To understand why the penalty term in BPIC is larger than that in DICBP ,

note that the loss employed by BPIC is∫ ∫−2 ln (p (yrep|θ)) g (yrep) p(θ|y)dθdyrep

=

∫ ∫−2 ln (p (yrep|θ)) p(θ|y)dθg (yrep) dyrep,

while the loss by DICBP is∫−2 ln

(∫p (yrep|θ) p(θ|y)dθ

)g (yrep) dyrep.

By Jensen’s inequality,

−2 ln

(∫p (yrep|θ) p(θ|y)dθ

)<

∫−2 ln (p (yrep|θ)) p(θ|y)dθ.

Hence, the expected loss in DICBP is smaller than that in BPIC for the same candidate model.

4.3 Expected loss of DIC and DICBP

From the decision viewpoint, DIC and DICBP lead to different statistical decisions. If the

optimal model selected by DIC is different from that selected by DICBP , the two statistical

decisions are obviously different. Even when DIC and DICBP select the same model, the

two statistical decisions are still different because the predictions come from two different

distributions, namely p(yrep|θ(y),Mk

)versus p (yrep|y,Mk). In both cases, therefore, the

expected loss implied by DIC and DICBP is different. Although it is known that the Bayesian

predictive distribution is a full predictive distribution and invariant to parameterization, the

questions such as “which predictive distribution should be used for making predictions?”and

“is there any difference in using the two predictive distributions for making predictions?”

remain unanswered in the literature. The theoretical development of DICBP allows us to

answer these two important questions.

With the two information criteria, the action space is larger than before. Denote the action

space by dk0 , dk1Kk=1 where dka (a ∈ (0, 1)) means Mk is selected, and the predictions come

from p(yrep|θk(y),Mk

)if a = 0 (i.e., DIC is the corresponding information criterion) but the

predictions come from p (yrep|y,Mk) if a = 1 (i.e., DICBP is the corresponding information

criterion). Let the two KL divergence functions be represented uniformly as

`(y, dka) = 2×KL [g (yrep) , p (yrep|y, dka))] ,

where p (yrep|y, dk0) := p(yrep|θk(y),Mk

), and p (yrep|y, dk1) := p (yrep|y,Mk). The ex-

pected loss associated with dka is

Risk(dka) = Ey (`(y, dka)) =

∫`(y, dka)g(y)dy.

20

Page 21: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

Hence, the model selection problem is equivalent to the following statistical decision,

mina∈0,1

mink∈1,··· ,K

Risk (dka) . (25)

According to Lemma 3.3, PD = P+op(1). Since P > 0 and ln 2+1 < 2, for any modelMk,

we have DIC > DICBP with probability approaching one (w.p.a.1). As a result, Ey (DIC) >

Ey

(DICBP

)w.p.a.1. Following Theorem 3.1 and Theorem 4.1, w.p.a.1, we have

EyEyrep

[−2 ln p

(yrep|θk(y),Mk

)]> EyEyrep [−2 ln p (yrep|y,Mk)] ,

2EyEyrepg(yrep)−EyEyrep

[2 ln p

(yrep|θk(y),Mk

)]> 2EyEyrepg(yrep)−EyEyrep [2 ln p (yrep|y,Mk)] ,

and

Risk (dk0) = Ey (` (y, dk0)) > Ey (` (y, dk1)) = Risk (dk1) . (26)

Hence, w.p.a.1,

mink∈1,··· ,K

Risk(dk0) > mink∈1,··· ,K

Risk (dk1) . (27)

and

arg mina∈0,1

[min

k∈1,··· ,KRisk(dka)

]= 1.

This means that the optimal solution to the statistical decision problem given in Section 2.1

is obtained by DICBP w.p.a.1. This is true even in case where arg mink∈1,··· ,KRisk(dk0) =

arg mink∈1,··· ,KRisk(dk1). Therefore, as far as the expected loss is concerned, the Bayesian

predictive distribution but not the plug-in predictive distribution should be used for making

predictions.

5 Conclusion

This paper provides a rigorous decision-theoretic justification of DIC based on a set of regular-

ity conditions but without requiring the iid assumption. The candidate model is not required

to encompassed the DGP. It is shown that DIC is an asymptotically unbiased estimator of

the expected KL divergence between the DGP and the plug-in predictive distribution.

Based on the Bayesian predictive distribution, a new information criterion (DICBP ) is

constructed for model selection. The first term has the same expression as that in DIC, but the

penalty term is smaller than that in DIC. The asymptotic justification of DICBP is provided,

in the same way as how DIC has been justified. The expected loss of DICBP is compared

with that of DIC and BPIC. It is shown that as n→∞, DICBP leads to the smaller expectedloss than DIC and BPIC. From the decision viewpoint, the Bayesian predictive distribution

and DICBP but not the plug-in predictive distribution or DIC leads to the optimal decision

action.

21

Page 22: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

Although the theoretic framework under which we justify DIC and DICBP are general, it

requires the consistency of the posterior mean, the asymptotic normal approximation to the

posterior distribution, and the asymptotic normality to the QML estimator. When there are

latent variables in the candidate model under which the number of latent variables grows as

n grows and when the parameter space is enlarged to include latent variables, the consistency

and the asymptotic normality may not hold true. As a result, DIC and DICBP are not

justified. Moreover, when the data are nonstationary, the asymptotic normality may not hold

true. In this case, it remains unknown whether or not DIC and DICBP are still justified.

Appendix

Notations

:= definitional equality←→θ n posterior mode

o(1) tend to zero θn QML estimateop(1) tend to zero in probability θpn pseudo true parameterp→ converge in probability θAT arg max of 2 ln p(y|θ) + ln p(θ)

θn posterior mean θn arg max of ln p (yrep|θ) + ln p (y|θ) + ln p (θ)

Proof of Lemma 3.1

1

n

n∑t=1

[lt (θ)− lt (θpn)]

=1

n

n∑t=1

(lt (θ)− E [lt (θ)]) +1

n

n∑t=1

(E [lt (θ)]− E [lt (θpn)]) +1

n

n∑t=1

(E [lt (θpn)]− lt (θpn)) .

From (9), we know that for any ε > 0, there exists δ1(ε) > 0 and N(ε) > 0, for all n > N (ε),

1

n

n∑t=1

E [lt (θ)]− E [lt (θpn)] < −δ1(ε),

if θ ∈ Θ\N (θpn, ε). Thus, for any ε > 0, if θ ∈ Θ\N (θpn, ε), for all n > N (ε),

1

n

n∑t=1

[lt (θ)− lt (θpn)] <1

n

n∑t=1

(lt (θ)− E [lt (θ)])− δ1 (ε) +1

n

n∑t=1

(E [lt (θpn)]− lt (θpn)) ,

and

supΘ\N(θ

pn,ε)

1

n

n∑t=1

[lt (θ)− lt (θpn)]

≤ supΘ\N(θ

pn,ε)

1

n

n∑t=1

(lt (θ)− E [lt (θ)])− δ1 (ε) +1

n

n∑t=1

(E [lt (θpn)]− lt (θpn))

≤ supΘ\N(θ

pn,ε)

∣∣∣∣∣ 1nn∑t=1

(lt (θ)− E [lt (θ)])

∣∣∣∣∣− δ1 (ε) +

∣∣∣∣∣ 1nn∑t=1

(E [lt (θpn)]− lt (θpn))

∣∣∣∣∣22

Page 23: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

≤ 2 supθ∈Θ

∣∣∣∣∣ 1nn∑t=1

(lt (θ)− E [lt (θ)])

∣∣∣∣∣− δ1 (ε) . (28)

Under Assumptions 1-6, the uniform convergence condition is satisfied, i.e.,

P

(supθ∈Θ

∣∣∣∣∣ 1nn∑t=1

(lt (θ)− E [lt (θ)])

∣∣∣∣∣ < ε

)→ 1, (29)

From the uniform convergence, if we choose δ2 such that 0 < δ2 < δ1 (ε) /2, we have

P

[supθ∈Θ

∣∣∣∣∣ 1nn∑t=1

(lt (θ)− E [lt (θ)])

∣∣∣∣∣ < δ2

]→ 1.

Hence,

P

[2 supθ∈Θ

∣∣∣∣∣ 1nn∑t=1

(lt (θ)− E [lt (θ)])

∣∣∣∣∣− δ1 (ε) < 2δ2 − δ1 (ε)

]→ 1.

From (28), we have

P

[2 supθ∈Θ

∣∣∣∣∣ 1nn∑t=1

(lt (θ)− E [lt (θ)])

∣∣∣∣∣− δ1 (ε) < 2δ2 − δ1 (ε)

]

6 P

supΘ\N(θ

pn,ε)

1

n

[n∑t=1

lt (θ)−n∑t=1

lt (θpn)

]< 2δ2 − δ1 (ε)

.Letting K1 (ε) = − (2δ2 − δ1 (ε)) > 0, we have, for any ε,

limn→∞

P

supΘ\N(θ

pn,ε)

1

n

[n∑t=1

lt (θ)−n∑t=1

lt (θpn)

]< −K1 (ε)

= 1,

which proves the consistency condition given by (10). The proof of the other two concentration

conditions (11) and (12) can be done similarly and hence omitted.

Proof of Lemma 3.2: See the technical supplement (Li, et al, 2017).

Proof of Lemma 3.3

From the definition

PD =

∫−2[ln p (y|θ)− ln p

(y|θ)]p (θ|y)dθ

= −2

∫ln p (y|θ) p (θ|y)dθ + 2 ln p

(y|θ)

= −2

n∑t=1

∫lt (θ) p (θ|y) dθ + 2 ln p

(y|θ).

23

Page 24: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

By Lemma 3.2 we have∫lt (θ) p(θ|y)dθ =lt

(θn

)+

1

nBt,1 +

1

n2(Bt,2 −Bt,3) +Op

(1

n3

),

where Bt,i is defined in Lemma 3.2 for i = 1, 2, 3. Note that

1

n

n∑t=1

Bt,1 = −1

2tr

[Hn

(θn

)−1 1

n

n∑t=1

O2lt(θn

)]− 1

n

n∑t=1

Olt(θn

)′Hn

(θn

)−1 5pp

−1

2vec

(Hn

(θn

)−1)H(3)n

(θn

)−1Hn

(θn

)−1 1

n

n∑t=1

Olt(θn

)= −1

2tr

[Hn

(θn

)−1Hn

(θn

)]= −1

2P,

since∑n

t=1Olt(θn

)= 0. For the same reason

1

n

n∑t=1

Bt,2

= − 5

16tr [A2] tr

[Hn

(θn

)−1 1

n

n∑t=1

O2lt(θn

)]+

35

48A1tr

[Hn

(θn

)−1 1

n

n∑t=1

O2lt(θn

)]

− 5

12vec

(Hn

(θn

)−1)′H(3)n

(θn

)Hn

(θn

)−1 1

n

n∑t=1

O3lt(θn

)′vec

(Hn

(θn

)−1)

+1

8tr

[[Hn

(θn

)−1⊗ vec

(Hn

(θn

)−1)] 1

n

n∑t=1

O4lt(θn

)′]

−5

4vec

(Hn

(θn

)−1)′H(3)n

(θn

)Hn

(θn

)−1 1

n

n∑t=1

O2lt(θn

)Hn

(θn

)−1 5pp

+1

2vec

(Hn

(θn

)−1)′ 1

n

n∑t=1

O3lt(θn

)Hn

(θn

)−1 5pp

+3

4tr

[Hn

(θn

)−1 1

n

n∑t=1

O2lt(θn

)]tr

[Hn

(θn

)−1 52pp

]= − 5

16tr [A2]P +

35

48A1P −

5

12A1 +

1

8tr [A2]

−3

4vec

(Hn

(θn

)−1)′H(3)n

(θn

)Hn

(θn

)−1 5pp

+3

4P tr

[Hn

(θn

)−1 52pp

]=

2− 5P

16tr [A2] +

−20 + 35P

48A1 −

3

4C21 +

3P

4C22 +

3P

4C23,

where C21, C22 and C23 are defined in Lemma 3.3. Clearly the first two terms are not related

to the prior while the last three terms are related to the prior. We can also show that

1

n

n∑t=1

Bt,3

24

Page 25: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

= B41

n

n∑t=1

Bt,1

= −P2

(−1

2tr

[Hn

(θn

)−1 ∇2pp

]+

1

2vec

(Hn

(θn

)−1)′H(3)n

(θn

)Hn

(θn

)−1 ∇pp

)−P

2

(− 5

24A1 +

1

8tr [A2]

)=

P

4C22 +

P

4C23 −

P

4C21 +

P

2

(5

24A1 −

1

8tr [A2]

).

Hence,

n∑t=1

∫lt (θ) p (θ|y) dθ

=

n∑t=1

lt

(θn

)− P

2

+1

n

(2− 5P

16tr [A2] +

−20 + 35P

48A1 −

3

4C21 +

3P

4C22 +

3P

4C23

)+

1

n

[−P

4C22 −

P

4C23 +

P

4C21 −

P

2

(5

24A1 −

1

8tr [A2]

)]+Op

(n−2

)=

n∑t=1

lt

(θn

)− P

2

+1

n

(1− 2P

8tr [A2] +

−10 + 15P

24A1 −

3− P4

C21 +P

2C22 +

P

2C23

)+Op

(n−2

).

From the stochastic expansion, we have

θn = θn −1

nHn

(θn

)−1Oπ

+1

2nHn

(θn

)−1H(3)n

(θn

)′vec

(Hn

(θn

)−1)+Op

(1

n2

),

and

θn − θn =1

nHn

(θn

)−1Oπ

− 1

2nHn

(θn

)−1H(3)n

(θn

)′vec

(Hn

(θn

)−1)+Op

(1

n2

). (30)

By the Taylor expansion, we get

ln p(y|θn

)= ln p

(y|θn

)+∂ ln p

(y|θn

)∂θ′

(θn − θn

)+

1

2

(θn − θn

)′ ∂2 ln p(y|θn

)∂θ∂θ′

(θn − θn

)+Op

(1

n2

)25

Page 26: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

= ln p(y|θn

)+

1

2nOπ′Hn

(θn

)−1Hn

(θn

)Hn

(θn

)−1Oπ

+1

8nvec

(Hn

(θn

)−1)′ 1

n

n∑t=1

O3lt(θn

)Hn

(θn

)−1Hn

(θn

)Hn

(θn

)−1× 1

n

n∑t=1

O3lt(θn

)′vec

(Hn

(θn

)−1)− 1

2nOπ′Hn

(θn

)−1Hn

(θn

)Hn

(θn

)−1× 1

n

n∑t=1

O3lt(θn

)′vec

(Hn

(θn

)−1)+Op(n

−2)

= ln p(y|θn

)− 1

2nC21 +

1

2nC23 +

1

8nA1 +Op(n

−2). (31)

Hence, from (30) and (31) we have,

PD = −2

n∑t=1

∫lt (θ) p (θ|y)dθ + 2 ln p

(y|θn

)= −2 ln p

(y|θn

)+ P

+1

n

(−1− 2P

4tr [A2]−

−10 + 15P

12A1 +

3− P2

C21 − PC22 − PC23)

+2 ln p(y|θ)

+Op(n−2

)= P +

1

n

(−C21 + C23 +

1

4A1

)+

1

n

(−1− 2P

4tr [A2]−

−10 + 15P

12A1 +

3− P2

C21 − PC22 − PC23)

+Op(n−2

)= P +

1

n

(−1− 2P

4tr [A2]−

−13 + 15P

12A1

)+

1

n

(1− P

2C21 − PC22 + (1− P )C23

)+Op

(n−2

)= P +

1

nC1 +

1

nC2 +Op

(n−2

), (32)

where C1 and C2 are defined in Lemma 3.3. From the definition of DIC, (31) and (32), we

have

DIC = −4

n∑t=1

∫lt (θ) p (θ|y) dθ + 2 ln p

(y|θ)

= −2

n∑t=1

∫lt (θ) p (θ|y) dθ+PD

= −2 ln p(y|θn

)+ P +

1

n

(−1− 2P

4tr [A2]−

−10 + 15P

12A1 +

3− P2

C21 − PC22 − PC23)

+P +1

n

(−1− 2P

4tr [A2]−

−13 + 10P

12A1 +

1− P2

C21 − PC22 + (1− P )C23

)+Op

(n−2

)26

Page 27: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

= −2 ln p(y|θn

)+ 2P

+1

n

[−1− 2P

2tr [A2]−

−23 + 25P

12A1 + (2− P )C21 − 2PC22 + (1− 2P )C23

]+Op

(n−2

)= AIC+

1

nD1 +

1

nD2 +Op

(n−2

),

where D1 and D2 are defined in Lemma 3.3.

Proof of Theorem 3.1

We write Hn (θpn) as Hn, Bn (θpn) as Bn, and let Cn = H−1n BnH−1n . Under Assumptions

1-10, we can show that

θn(y) = θn(y) +Op(n−1

)(33)

by (30). Then we have

θn(y) = θpn +Op

(n−1/2

),

1√n

B−1/2n

∂ ln p(yrep|θpn)

∂θ

d→ N (0, IP ) , (34)

and

C−1/2n

√n(θn(y)− θpn

) d−→ N (0, IP ) . (35)

Note that

EyEyrep

(−2 ln p

(yrep|θn(y)

))=

[EyEyrep

(−2 ln p

(yrep|θn (yrep)

))](T1)

+[EyEyrep (−2 ln p (yrep|θpn))− EyEyrep

(−2 ln p

(yrep|θn (yrep)

))](T2)

+[EyEyrep

(−2 ln p

(yrep|θn(y)

))− EyEyrep (−2 ln p (yrep|θpn))

](T3)

.

Now let us analyze T2 and T3. First expand ln p (yrep|θpn) at θn (yrep),

ln p (yrep|θpn)

= ln p(yrep|θn (yrep)

)+∂ ln p

(yrep|θn (yrep)

)∂θ′

(θpn − θn (yrep)

)+

1

2

(θpn − θn (yrep)

)′ ∂ ln p(yrep|θn (yrep)

)∂θ∂θ′

(θpn − θn (yrep)

)+ op (1)

= ln p(yrep|θn (yrep)

)+∂ ln p

(yrep|θn (yrep)

)∂θ′

(θpn − θn (yrep)

)+(θn (yrep)− θn (yrep)

)′ ∂ ln p(yrep|θn (yrep)

)∂θ∂θ′

(θpn − θn (yrep)

)+

1

2

(θpn − θn (yrep)

)′ ∂ ln p(yrep|θn (yrep)

)∂θ∂θ′

(θpn − θn (yrep)

)+ op (1)

27

Page 28: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

= ln p(yrep|θn (yrep)

)+

1

2

(θpn − θn (yrep)

)′ ∂ ln p(yrep|θn (yrep)

)∂θ∂θ′

(θpn − θn (yrep)

)+ op (1) .

from (33). Then we have

T2 = EyEyrep

[−2 ln p (yrep|θpn) + 2 ln p

(yrep|θn (yrep)

)]= EyEyrep

[−(θn (yrep)− θpn

)′ ∂ ln p(yrep|θn (yrep)

)∂θ∂θ′

(θn (yrep)− θpn

)+ op (1)

]

= Eyrep

[−(θn (yrep)− θpn

)′ ∂ ln p(yrep|θn (yrep)

)∂θ∂θ′

(θn (yrep)− θpn

)]+ o (1)

= Ey

[−(θn(y)− θpn

)′ ∂ ln p(y|θn(y))

∂θ∂θ′(θn(y)− θpn

)]+ o (1) ,

by Assumption 5-6 and the dominated convergence theorem (Chung, 2001 and DasGupta,

2008). Next, we expand ln p(yrep|θn(y)

)at θpn:

ln p(yrep|θn(y)

)= ln p (yrep|θpn) +

∂ ln p (yrep|θpn)

∂θ′(θn(y)− θpn

)+

1

2

(θn(y)− θpn

)′ ∂2 ln p (yrep|θpn)

∂θ∂θ′(θn(y)− θpn

)+ op (1) .

Substituting the above expansion into T3, we have

T3 = EyEyrep(−2 ln p(yrep|θn(y)

))− EyEyrep(−2 ln p (yrep|θpn))

= EyEyrep

−2∂ ln p(yrep|θ

pn)

∂θ′(θn(y)− θpn

)−(

θn(y)− θpn)′ ∂2 ln p(yrep|θ

pn)

∂θ∂θ′(θn(y)− θpn

)+ op (1)

= EyEyrep

[−2

∂ ln p (yrep|θpn)

∂θ′(θn(y)− θpn

)]+EyEyrep

[−(θn(y)− θpn

)′ ∂2 ln p (yrep|θpn)

∂θ∂θ′(θn(y)− θpn

)]+ o (1)

= −2Eyrep

(∂ ln p (yrep|θpn)

∂θ′

)Ey

[(θn(y)− θpn

)]+Ey

[−(θn(y)− θpn

)′Eyrep

(∂2 ln p (yrep|θpn)

∂θ∂θ′

)(θn(y)− θpn

)]+ o (1)

= Ey

[−√n(θn(y)− θpn

)′Ey

(1

n

∂2 ln p(y|θpn)

∂θ∂θ′

)√n(θn(y)− θpn

)]+ o(1),

since

EyEyrep

[−2

∂ ln p (yrep|θpn)

∂θ′(θn(y)− θpn

)]= Eyrep

[−2

∂ ln p (yrep|θpn)

∂θ′

]Ey

[(θn(y)− θpn

)]= 0

by (34), (35), and the dominated convergence theorem.

Note that1

n

∂2 ln p(y|θn(y)

)∂θ∂θ′

= Ey

(1

n

∂2 ln p (y|θpn)

∂θ∂θ′

)+ op (1) ,

28

Page 29: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

under Assumption 1-10 and by the uniform law of large numbers. Hence, we get

T2 = Ey

[−(θn(y)− θpn

)′ ∂ ln p(y|θn(y)

)∂θ∂θ′

(θn(y)− θpn

)]+ o (1)

= Ey

[−√n(θn(y)− θpn

)′ 1

nEy

(∂2 ln p (y|θpn)

∂θ∂θ′

)√n(θn(y)− θpn

)+ op (1)

]+ o (1)

= T3 + o(1).

So we only need to analyze T3. Note that

T3 = Ey

[−√n(θn(y)− θpn

)′Ey

(− 1

n

∂2 ln p(y|θpn)

∂θ∂θ′

)√n(θn(y)− θpn

)]+ o (1)

= Ey

[√n(θn(y)− θpn

)′(−Hn)

√n(θn(y)− θpn

)]+ o (1)

= Ey

[(C−1/2n

√n(θn(y)− θpn

))′C1/2n (−Hn) C1/2

n C−1/2n

√n(θn(y)− θpn

)]+ o (1)

= Ey

tr[HnC

1/2n C−1/2n

√n(θn(y)− θpn

)√n(θn(y)− θpn

)′C−1/2n C1/2

n

]+ o (1)

= tr

(−Hn) C1/2n Ey

[C−1/2n

√n(θn(y)− θpn

)√n(θn(y)− θpn

)′C−1/2n

]C1/2n

+ o (1)

= tr

(−Hn) C1/2n Ey

[C−1/2n

√n(θn(y)− θpn

)√n(θn(y)− θpn

)′C−1/2n

]C1/2n

+ o (1) ,

and

Ey

[C−1/2n

√n(θn(y)− θpn

)√n(θn(y)− θpn

)′C−1/2n

]= IP + o (1) .

Hence,

T3 = tr(

(−Hn) C1/2n C1/2

n

)+ o (1) = tr ((−Hn) Cn) + o (1)

= tr(

(−Hn) (−Hn)−1Bn (−Hn)−1)

+ o (1)

= tr(

(−Hn) (−Hn)−1Bn (−Hn)−1)

+ o (1)

= tr(Bn (−Hn)−1

)+ o (1) ,

and

Ey

[Eyrep(−2 ln p

(yrep|θn(y)

))]

= Ey

[Eyrep(−2 ln p

(yrep|θn (yrep)

))]

+ 2tr(Bn (−Hn)−1

)+ o (1)

= Ey

[Ey(−2 ln p(y|θn(y)))

]+ 2tr

(Bn (−Hn)−1

)+ o (1)

= Ey(−2 ln p(y|θn(y))) + 2tr(Bn (−Hn)−1

)+ o (1)

= Ey(−2 ln p(y|θn(y)) + 2P ) + o (1) ,

under the condition that Bn + Hn = o(1).

Following Lemma 3.3, we get PD = P + op(1). Finally, we have

EyEyrep

[−2 ln p

(yrep|θn(y)

)]= Ey

[−2 ln p(y|θn) + 2P + op(1)

]29

Page 30: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

= Ey

[D(θn)

+ 2PD + op(1)]

= Ey [DIC+ op(1)] = Ey [DIC] + o(1),

by Assumption 10 and the dominated convergence theorem.

Proof of Theorem 4.1

Denote

θn := arg maxθ

ln p (yrep|θ) + ln p (y|θ) + ln p (θ) .

Then we have the following three lemmas under the condition that y and yrep are independent.

These three lemma are useful to prove Theorem 4.1.

Lemma 5.1 Under Assumptions 1-7 and 10, θnp→ θpn.

Proof. The proof follows the argument in Theorem 4.2 in Wooldridge (1994) and Bester

and Hansen (2006). Let Qn (θ) = n−1∑n

t=1 lt(yt,θ

)+n−1

∑nt=1 lt

(ytrep,θ

)+n−1 ln p (θ) and

Qn (θ) = n−1E [ln p (yrep|θ) + ln p (y|θ)]. Then we need to show that, for each ε > 0,

P

[supθ∈Θ

∣∣Qn (θ)− Qn (θ)∣∣ > ε

]→ 0.

Let δ > 0 be a number to be set later. Because Θ is compact, there exists a finite number

of spheres of radius δ about θj , say ζδ (θj) = θ∈ Θ : ‖θ−θj‖ ≤ δ, j = 1, . . . ,K (δ), which

covers Θ (Gallant and White, 1988). Set ζj = ζδ (θj), K = K (δ). It follows that

P

[supθ∈Θ

∣∣Qn (θ)− Qn (θ)∣∣ > ε

]≤ P

max1≤j≤K

supθ∈ζj

∣∣Qn (θ)− Qn (θ)∣∣ > ε

K∑j=1

P

supθ∈ζj

∣∣Qn (θ)− Qn (θ)∣∣ > ε

.For all θ ∈ ζj ,∣∣Qn (θ)− Qn (θ)

∣∣≤ |Qn (θ)−Qn (θj)|+

∣∣Qn (θj)− Qn (θj)∣∣+∣∣Qn (θj)− Qn (θ)

∣∣≤ 1

n

n∑t=1

∣∣lt (yt,θ)− lt (yt,θj)∣∣+1

n

n∑t=1

∣∣lt (ytrep,θ)− lt (ytrep,θj)∣∣+

∣∣∣∣∣ 1nn∑t=1

(lt(yt,θj

)− E [lt (θj)]

)∣∣∣∣∣+

∣∣∣∣∣ 1nn∑t=1

(lt(ytrep,θj

)− E [lt (θj)]

)∣∣∣∣∣+

1

n

n∑t=1

|E [lt (θ)]− E [lt (θj)]|+1

n

n∑t=1

|E [lt (θ)]− E [lt (θj)]|+∣∣∣∣ ln p (θ)

n

∣∣∣∣ ,where E [lt (θ)] := E

[lt(yt,θ

)]= E

[lt(ytrep,θ

)]. By Assumption 4, for all θ ∈ ζj ,∣∣lt (yt,θ)− lt (yt,θj)∣∣ ≤ ct (yt) ‖θ−θj‖ ≤ δct (yt) .

30

Page 31: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

and

|E [lt (θ)]− E [lt (θj)]| ≤ δct,

where ct = E[ct(yt)]

= E[ct(ytrep

)]. Similarly, we have∣∣lt (ytrep,θ)− lt (ytrep,θj)∣∣ ≤ δct (ytrep) .

Thus, we have

supθ∈ζj

∣∣Qn (θ)− Qn (θ)∣∣

≤ δ

n

n∑t=1

ct(ytt)

+

∣∣∣∣∣ 1nn∑t=1

(lt(yt,θ

)− E [lt (θj)]

)∣∣∣∣∣+δ

n

n∑t=1

ct +δ

n

n∑t=1

ct(ytrep

)+

∣∣∣∣∣ 1nn∑t=1

(lt(ytrep,θ

)− E [lt (θj)]

)∣∣∣∣∣+δ

n

n∑t=1

ct +

∣∣∣∣ ln p (θ)

n

∣∣∣∣≤ 2

δ

n

n∑t=1

ct + δ

∣∣∣∣∣ 1nn∑t=1

(ct(yt)− ct

)∣∣∣∣∣+

∣∣∣∣∣ 1nn∑t=1

(lt(yt,θ

)− E [lt (θj)]

)∣∣∣∣∣+ 2δ

n

n∑t=1

ct

∣∣∣∣∣ 1nn∑t=1

(ct(ytrep

)− ct

)∣∣∣∣∣+

∣∣∣∣∣ 1nn∑t=1

(lt(ytrep,θ

)− E [lt (θj)]

)∣∣∣∣∣+

∣∣∣∣ ln p (θ)

n

∣∣∣∣≤ 4δC + δ

∣∣∣∣∣ 1nn∑t=1

(ct(yt)− ct

)∣∣∣∣∣+

∣∣∣∣∣ 1nn∑t=1

(lt(yt,θ

)− E [lt (θj)]

)∣∣∣∣∣+δ

∣∣∣∣∣ 1nn∑t=1

(ct(ytrep

)− ct

)∣∣∣∣∣+

∣∣∣∣∣ 1nn∑t=1

(lt(ytrep,θ

)− E [lt (θj)]

)∣∣∣∣∣+

∣∣∣∣ ln p (θ)

n

∣∣∣∣ ,where n−1

∑nt=1 ct ≤ C <∞ by Assumption 4. It follows that

P

[maxθ∈ζj

∣∣Qn (θ)− Qn (θ)∣∣ > ε

]

≤ P

δ∣∣ 1n

∑nt=1

(ct(yt)− ct

)∣∣+∣∣ 1n

∑nt=1

(lt(yt,θ

)− E [lt (θj)]

)∣∣+δ∣∣ 1n

∑nt=1

(ct(ytrep

)− ct

)∣∣+∣∣ 1n

∑nt=1

(lt(ytrep,θ

)− E [lt (θj)]

)∣∣+

∣∣∣∣ ln p(θ)n

∣∣∣∣ > ε− 4δC

.Now choose δ such that 0 < δ < ε/(4C) (i.e., ε− 2δC > ε/2). Then

P

supθ∈ζj

∣∣Qn (θ)− Qn (θ)∣∣ > ε

≤ P

∣∣ 1n

∑nt=1

(ct(yt)− ct

)∣∣+∣∣ 1n

∑nt=1

(lt(yt,θ

)− E [lt (θj)]

)∣∣+∣∣ 1n

∑nt=1

(ct(ytrep

)− ct

)∣∣+∣∣ 1n

∑nt=1

(lt(ytrep,θ

)− E [lt (θj)]

)∣∣+

∣∣∣∣ ln p(θ)n

∣∣∣∣ > ε/2

.Next, choose n0 so that

P

∣∣ 1n

∑nt=1

(ct(yt)− ct

)∣∣+∣∣ 1n

∑nt=1

(lt(yt,θ

)− E [lt (θj)]

)∣∣+∣∣ 1n

∑nt=1

(ct(ytrep

)− ct

)∣∣+∣∣ 1n

∑nt=1

(lt(ytrep,θ

)− E [lt (θj)]

)∣∣+

∣∣∣∣ ln p(θ)n

∣∣∣∣ > ε/2

≤ ε

K

31

Page 32: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

for all n ≥ n0 and all j = 1, . . . ,K by Assumptions 1-10 since K is finite. Hence,

P

[supθ∈Θ

∣∣Qn (θ)− Qn (θ)∣∣ > ε

]→ 0.

It then follows that Qn (θ) satisfies a uniform law of large numbers and the consistency of θnfollowed by the usual argument.

Lemma 5.2 Under Assumptions 1-10, − (2Hn)−1/2√n(θn − θpn

)d→ N (0, IP ).

Proof. The proof follows from Bester and Hansen (2006). By Lemma 5.1, we have,

0 =1

n

n∑t=1

[5lt

(yt, θn

)+5lt

(ytrep, θn

)]+

1

nln p

(θn

)=

1

n

n∑t=1

[5lt

(yt,θpn

)+5lt

(ytrep,θ

pn

)]+

1

n

n∑t=1

[52lt

(yt, θn1

)+52lt

(ytrep, θn1

)](θn − θpn

)

+ln p

(θn

)n

where θn1 is an intermediate value between θn and θpn. It follows that

√n(θn − θpn

)=

(−n−1

n∑t=1

[52lt

(yt, θn1

)+52lt

(ytrep, θn1

)])−1×(

n−1/2n∑t=1

[5lt

(yt,θpn

)+5lt

(ytrep,θ

pn

)]+ n−1/2 ln p

(θn

)).

Under the assumptions, we have

n−1/2 ln p(θn

)= op (1) , − n−1

n∑t=1

52lt(yt, θn1

)p→ −Hn,

−n−1n∑t=1

52lt(yt, θn1

)p→ −Hn,−n−1

n∑t=1

52lt(ytrep, θn1

)p→ −Hn,

[−Hn]1/2 n−1/2n∑t=1

5lt(yt,θpn

) d→ N (0, IP ) , [−Hn]1/2 n−1/2n∑t=1

5lt(ytrep,θ

pn

) d→ N (0, IP ) .

Note that V ar(n−1/2

∑nt=15lt

(yt,θpn

))→ −Hn as n → ∞. By the central limit theorem

and the Cramer-Wold device, we get

(−2Hn)−1/2√n(θn − θpn

)d→ N (0, IP ) or

√2n (−Hn)−1/2

(θn − θpn

)d→ N (0, IP ) .

32

Page 33: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

Lemma 5.3 Under Assumption 1-10, the asymptotic joint distribution of√n(θn − θpn

)and

√n(←→θ n − θpn

)is

[(−2Hn)−1 (−2Hn)−1

(−2Hn)−1 (−Hn)−1

]1/2 √n(θn − θpn

)√n(←→θ n − θpn

) d→ N (0, I2P ) .

Proof. By Lemma 5.2, we have

√n(θn − θpn

)=

(−n−1

n∑t=1

[52lt

(yt, θn1

)+52lt

(ytrep, θn1

)])−1

×(n−1/2

n∑t=1

[5lt

(yt,θpn

)+5lt

(ytrep,θ

pn

)]+ n−1/2 ln p

(θn

)),

and

√n(←→θ n − θpn

)=

(−n−1

n∑t=1

52lt(yt, θn2

))−1(n−1/2

n∑t=1

5lt(yt,θpn

)+ n−1/2 ln p

(θn

)),

where θn2 is an intermediate value between←→θ n and θpn. Hence, we have

Cov(√

n(θn − θpn

),√n(←→θ n − θpn

))= E

(√n(θn − θpn

)√n(←→θ n − θpn

)′)= E

[ −n−1

∑nt=1

[52lt

(yt,θpn

)+52lt

(ytrep,θ

pn

)]−1n−1/2

∑nt=15lt

(yt,θpn

)×[n−1/2

∑nt=15lt

(yt,θpn

)]′ (−n−1∑nt=152lt

(yt,θpn

))−1]

+ op (1)

= (−2Hn)−1 (−Hn) (−Hn)−1 + op (1)

= (−2Hn)−1 + op (1)

Then we have[(−2Hn)−1 (−2Hn)−1

(−2Hn)−1 (−Hn)−1

]1/2 √n(θn − θpn

)√n(←→θ n − θpn

) d→ N (0, I2P ) .

We are now in the position to prove Theorem 4.1. Under Assumptions 1-10, it can be

shown that,

EyEyrep (−2 ln p (yrep|y)) = Ey

[−2 ln p

(y|←→θ n

)+ (1 + ln 2)P

]+ o (1) .

By the Laplace approximation (Tierney et al., 1989 and Kass et al., 1990) and Lemma

5.2, we have

p (yrep|y) =

∫p (yrep|θ) p (θ|y) dθ=

∫p (yrep|θ) p (y|θ) p (θ) dθ∫

p (y|θ) p (θ) dθ

33

Page 34: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

=

∫exp (−nhN (θ)) dθ∫exp (−nhD (θ)) dθ

=

∣∣∣∇2hN (θn)∣∣∣−1/2 exp(−nhN

(θn

))∣∣∣∇2hD (←→θ n

)∣∣∣−1/2 exp(−nhD

(←→θ n

)) (1 +Op

(1

n

))+Op

(1

n2

),

where

hN (θ) = − 1

n(ln p (yrep|θ) + ln p (y|θ) + ln p (θ)) ,

hD (θ) = − 1

n(ln p (y|θ) + ln p (θ)) .

Note that

ln

∣∣∣∇2hN (θn)∣∣∣−1/2 exp

(−nhN

(θn

))∣∣∣∇2hD (←→θ n

)∣∣∣−1/2 exp(−nhD

(←→θ n

))

= −1

2

(ln∣∣∣∇2hN (θn)∣∣∣− ln

∣∣∣∇2hD (←→θ n

)∣∣∣)+[−nhN

(θn

)+ nhD

(←→θ n

)].

The first term is

−1

2

(ln∣∣∣∇2hN (θn)∣∣∣− ln

∣∣∣∇2hD (←→θ n

)∣∣∣)= −1

2ln

∣∣∣∣∣∣− 1

n

∂ ln p(yrep|θn

)∂θ∂θ′

− 1

n

∂ ln p(y|θn

)∂θ∂θ′

− 1

n

∂ ln p(θn

)∂θ∂θ′

∣∣∣∣∣∣+

1

2ln

∣∣∣∣∣∣− 1

n

∂ ln p(y|←→θ n

)∂θ∂θ′

− 1

n

∂ ln p(←→θ n

)∂θ∂θ′

∣∣∣∣∣∣= −1

2ln |−Hn −Hn|+

1

2ln |−Hn|+ op (1)

= −1

2ln(2P |−Hn|

)+

1

2ln |−Hn|+ op (1) = −1

2P ln 2 + op (1) . (36)

Here we can see how ln 2 shows up in the penalty term.

The second term is

−nhN(θn

)+ nhD

(←→θ n

)= ln p

(yrep|θn

)+ ln p

(y|θn

)+ ln p

(θn

)− ln p

(y|←→θ n

)− ln p

(←→θ n

)= ln p

(yrep|θn

)+ ln p

(y|θn

)− ln p

(y|←→θ n

)+ op (1)

= ln p(yrep|

←→θ n

)+ E1 + E2 + op (1) , (37)

where

E1 = ln p(yrep|θn

)− ln p

(yrep|

←→θ n

), E2 = ln p

(y|θn

)− ln p

(y|←→θ n

).

34

Page 35: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

We can further decompose E1 as

E1 = E11 + E12,

where

E11 = ln p(yrep|θn

)− ln p (yrep|θpn) , E12 = ln p (yrep|θpn)− ln p

(yrep|

←→θ n

).

For E11, we have

E11 = ln p(yrep|θn

)− ln p (yrep|θpn)

=1√n

∂ ln p (yrep|θpn)

∂θ′√n(θn − θpn

)+

1

2

√n(θn − θpn

)′ 1

n

∂2 ln p (yrep|θpn)

∂θ∂θ′√n(θn − θpn

)+ op (1) .

Following Assumption 1-10 and Lemma 5.3, we can similarly prove that

1√n

∂ ln p (yrep|θpn)

∂θ′√n(θn − θpn

)=√n(←→θ (yrep)− θpn

)′(−n−1

n∑t=1

52lt(ytrep,θ

pn

))√n(θn − θpn

)+ op (1)

=√n(←→θ (yrep)− θpn

)′(−Hn)

√n(θn − θpn

)+ op (1)

= tr

[(−Hn)

√n(θn − θpn

)√n(←→θ (yrep)− θpn

)′]+ op (1)

=√n(←→θ (yrep)− θpn

)′(−Hn)1/2 (−Hn)−1/2 (−Hn) (−2Hn)−1/2

× (−2Hn)1/2√n(θn − θpn

)+ op (1)

= 2−1/2[(−Hn)1/2

√n(←→θ (yrep)− θpn

)]′(−2Hn)1/2

√n(θn − θpn

)+ op (1) .

where [(−Hn)1/2

√n(←→θ n (yrep)− θpn

)]′(−2Hn)1/2

√n(θn − θpn

)=

(−Hn)1/2(−n−1

n∑t=1

52lt(ytrep, θn3

))−1n−1/2

n∑t=1

5lt(ytrep,θ

pn

)× (−2Hn)1/2

(−n−1

n∑t=1

[52lt

(yt, θn1

)+52lt

(ytrep, θn1

)])−1

×[n−1/2

n∑t=1

5lt(yt,θpn

)+ n−1/2

n∑t=1

5lt(ytrep,θ

pn

)]

=

[(−Hn)−1/2 n−1/2

n∑t=1

5lt(ytrep,θ

pn

)]′(−2Hn)−1/2

×[n−1/2

n∑t=1

5lt(yt,θpn

)+ n−1/2

n∑t=1

5lt(ytrep,θ

pn

)]+ op (1)

35

Page 36: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

=

[(−Hn)−1/2 n−1/2

n∑t=1

5lt(ytrep,θ

pn

)]′(−2Hn)−1/2 n−1/2

n∑t=1

5lt(yt,θpn

)+2−1/2

[(−Hn)−1/2 n−1/2

n∑t=1

5lt(ytrep,θ

pn

)]′(−Hn)−1/2

×n−1/2n∑t=1

5lt(ytrep,θ

pn

)+ op (1) ,

with θn3 lying between←→θ n (yrep) and θpn. Hence,[

(−Hn)−1/2 n−1/2n∑t=1

5lt(ytrep,θ

pn

)]′(−Hn)−1/2 n−1/2

n∑t=1

5lt(ytrep,θ

pn

) d→ χ2 (P ) . (38)

Thus, we have

EyEyrep

([(−Hn)−1/2 n−1/2

n∑t=1

5lt(ytrep,θ

pn

)]′(−2Hn)−1/2 n−1/2

n∑t=1

5lt(yt,θpn

))= 0,

(39)

since y and yrep are independent. Hence, we have

EyEyrep

[1√n

∂ ln p (yrep|θpn)

∂θ′√n(θn − θpn

)]= EyEyrep

[2−1/2

[(−Hn)1/2

√n(←→θ n (yrep)− θpn

)]′(−2Hn)1/2

√n(θn − θpn

)+ op (1)

]= EyEyrep

(2−1/2

[(−Hn)−1/2 n−1/2

n∑t=1

5lt(ytrep,θ

pn

)]′(−2Hn)−1/2 n−1/2

n∑t=1

5lt(yt,θpn

))+

EyEyrep

(2−1

[(−Hn)−1/2 n−1/2

n∑t=1

5lt(ytrep,θ

pn

)]′(−Hn)−1/2 n−1/2

n∑t=1

5lt(ytrep,θ

pn

))+ o (1)

=1

2P + o (1) . (40)

Moreover,

1

2

√n(θn − θpn

)′ 1

n

∂2 ln p (yrep|θpn)

∂θ∂θ′√n(θn − θpn

)= −1

2

√n(θn − θpn

)′(−Hn)

√n(θn − θpn

)+ op (1)

= −1

2

√n(θn − θpn

)′(−2Hn)1/2 (−2Hn)−1/2 (−Hn) (−2Hn)−1/2 (−2Hn)1/2

×√n(θn − θpn

)+ op (1)

= −1

4

[(−2Hn)1/2

√n(θn − θpn

)]′(−2Hn)1/2

√n(θn − θpn

)+ op (1)

where [(−2Hn)1/2

√n(θn − θpn

)]′(−2Hn)1/2

√n(θn − θpn

)d→ χ2 (P ) . (41)

36

Page 37: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

From (40) and (41)

EyEyrep (E11) =

(1

2− 1

4

)P =

1

4P + o (1) .

For E12, we have

E12 = ln p (yrep|θpn)− ln p (yrep|θpn)− 1√n

∂ ln p (yrep|θpn)

∂θ′√n(←→θ n − θpn

)−1

2

(←→θ n − θpn

)′ ∂2 ln p (yrep|θpn)

∂θ∂θ′

(←→θ n − θpn

)+ op (1) .

Since

−(←→θ n − θpn

)′ ∂2 ln p (yrep|θpn)

∂θ∂θ′

(←→θ n − θpn

)d→ χ2 (P ) , (42)

Eyrep

(1√n

∂ ln p (yrep|θpn)

∂θ

)= 0, Eyrep

(√n(←→θ n − θpn

))= o (1) (43)

from (42), and (43), we have

EyEyrep (E12) =1

2P + o (1) .

Then

EyEyrep (E1) = EyEyrep

(ln p

(yrep|θn

)− ln p

(yrep|

←→θ n

))(44)

= EyEyrep (E11 + E12) =3

4P + o (1) .

Similarly, we can decompose E2 = − ln p(y|←→θ n

)+ ln p

(y|θn

)as

E2 = E21 + E22,

where

E21 = ln p(y|θn

)− ln p (y|θpn) , E22 = ln p (y|θpn)− ln p

(y|←→θ n

).

From the discussion above, ln p(y|θn

)− ln p (y|θpn) has the same asymptotic property as

ln p(yrep|θn

)− ln p (yrep|θpn), then we have

EyEyrep (E21) = EyEyrep (E11) =1

4P + o (1) .

For E22, we have

ln p (y|θpn)− ln p(y|←→θ n

)(45)

= ln p(y|←→θ n

)+

1√n

∂ ln p(y|←→θ n

)∂θ′

√n(θpn −

←→θ n

)+

1

2

√n(←→θ n − θpn

)′ 1

n

∂2 ln p(y|←→θ n

)∂θ∂θ′

√n(←→θ n − θpn

)− ln p

(y|←→θ n

)+ op (1) ,

37

Page 38: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

where

1√n

∂ ln p(y|←→θ n

)∂θ′

= op (1) ,

−√n(←→θ n − θpn

)′ 1

n

∂2 ln p(y|←→θ n

)∂θ∂θ′

√n(←→θ n − θpn

)d→ χ2 (P ) .

Then we can get

EyEyrep (E22) = −1

2P + o (1) ,

and

EyEyrep (E2) = EyEyrep

(ln p

(y|θn

)− ln p

(y|←→θ n

))(46)

= EyEyrep (E21 + E22) = −1

4P + o (1) .

Note that

θn =←→θ n + op(n

−1/2),

by Lemma 3.3. Mimicking the proof of Theorem 3.1, we get

EyEyrep ln p(yrep|

←→θ n

)= Ey

[ln p

(y|←→θ n

)]− P. (47)

With (36), (44), (46) and (47), we have

Ey

[Eyrep ln p (yrep|y)

]= EyEyrep ln

∣∣∣∇2hN (θn)∣∣∣−1/2 exp

(−nhN

(θn

))∣∣∣∇2hD (←→θ n

)∣∣∣−1/2 exp(−nhD

(←→θ n

)) (1 +Op

(1

n

))+Op

(1

n2

)= −1

2

(ln∣∣∣∇2hN (θn)∣∣∣− ln

∣∣∣∇2hD (←→θ n

)∣∣∣)+EyEyrep

[ln p

(yrep|

←→θ n

)+ ln p

(yrep|θn

)− ln p

(yrep|

←→θ n

)+ ln p

(y|θn

)− ln p

(y|←→θ n

)]+o (1)

= −P2

ln 2 + EyEyrep

[ln p

(yrep|

←→θ n

)]+ EyEyrep [E1 + E2] + o (1)

= −P2

ln 2 + EyEyrep

[ln p

(yrep|

←→θ n

)]+

3

4P − 1

4P + o (1)

= EyEyrep

(ln p

(yrep|

←→θ n

)+

(1

2− ln 2

2

)P

)+ o (1)

= Ey ln p(y|←→θ n

)− P +

(1

2− ln 2

2

)P + o (1)

= Ey

[ln p

(y|←→θ n

)− 1 + ln 2

2P

]+ o (1) , (48)

= Ey

[ln p

(y|θn

)− 1 + ln 2

2P

]+ o (1) .

Therefore, −2 ln p(y|θn

)+ (1 + ln 2)P is an unbiased estimator of Eyrep (−2 ln p (yrep|y))

asymptotically.

38

Page 39: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

References1 Akaike, H. (1973). Information theory and an extension of the maximum likelihood prin-

ciple. Second International Symposium on Information Theory, Springer Verlag, 1,267-281.

2 Andrews, D. W. K. (1987). Consistency in nonlinear econometric models: A generic uniformlaw of large numbers. Econometrica, 55(6), 1465-1471.

3 Andrews, D. W. K. (1988). Laws of large numbers for dependent non-identically distributedrandom variables. Econometric theory, 4(03), 458-467.

4 Ando, T. (2007). Bayesian predictive information criterion for the evaluation of hierarchicalBayesian and empirical Bayes models. Biometrika, 443-458.

5 Ando, T. (2012). Predictive Bayesian model selection. American Journal of Mathematicaland Management Sciences, 31(1-2), 13-38.

6 Ando, T. and Tsay, R. (2010). Predictive likelihood for Bayesian model selection andaveraging. International Journal of Forecasting, 26, 744—763.

7 Berger, J. O. (1985). Statistical decision theory and Bayesian analysis. 2nd edition.Springer-Verlag.

8 Bester, C.A. and Hansen, C. (2006). Bias reduction for Bayesian and frequentist estimators.SSRN Working Paper Series

9 Brooks, S. (2002). Discussion on the paper by Spiegelhalter, Best, Carlin, and van de Linde(2002). Journal of the Royal Statistical Society Series B, 64, 616—618.

10 Burnham, K. and Anderson, D. (2002). Model Selection and Multi-model Inference: APractical Information-theoretic Approach, Springer.

11 Chen, C. (1985). On asymptotic normality of limiting density functions with Bayesianimplications. Journal of the Royal Statistical Society Series B, 47, 540—546.

12 Chung, K.L. (2001) A course in probability theory. Academic press.

13 DasGupta, A. (2008). Asymptotic theory of statistics and probability. Springer Science& Business Media.

14 Gallant, A. R. and White, H. (1988). A unified theory of estimation and inference fornonlinear dynamic models. Blackwell.

15 Geweke, J. and Keane, M. (2001). Computationally intensive methods for integration ineconometrics. Handbook of econometrics, 5, 3463-3568.

16 Geweke, J. (2005). Contemporary Bayesian Econometrics and Statistics. Wiley-Interscience.

17 Ghosh, J. and Ramamoorthi, R. (2003). Bayesian Nonparametrics, Springer Verlag.

39

Page 40: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

18 Kass, R., Tierney, L. and Kadane, J. (1990) The validity of posterior expansions based onLaplace’s Method. in Bayesian and Likelihood Methods in Statistics and Econometrics,ed. by S. Geisser, J.S. Hodges, S.J. Press and A. Zellner. Elsevier Science PublishersB.V.: North-Holland.

19 Jennrich, R. I. (1969). Asymptotic properties of non-linear least squares estimators. TheAnnals of Mathematical Statistics, 40(2), 633-643.

20 Kim, J. (1994). Bayesian asymptotic theory in a time series model with a possible non-stationary process. Econometric Theory, 10, 764-773.

21 Kim, J. (1998). Large sample properties of posterior densities, Bayesian informationcriterion and the likelihood principle in nonstationary time series models. Econometrica,66, 359-380.

22 Li, Y., Yu, J. and T. Zeng (2017) Online Supplement to ‘Deviance Information Criterionfor Model Selection: Justification and Variation’, Singapore Management University.

23 Magnus J R, Neudecker H. (1986). Symmetry, 0-1 matrices and Jacobians: A review .Econometric Theory, 2(02), 157-190.

24 Phillips, P. C. B. (1995). Bayesian model selection and prediction with empirical applica-tion (with discussions). Journal of Econometrics, 69(1), 289-331.

25 Phillips, P. C. B. (1996). Econometric model determination. Econometrica, 64, 763-812.

26 Plummer, M. (2008). Penalized loss functions for Bayesian model comparison. Biostatis-tics, 9(3), 523-539.

27 Spiegelhalter, D., Best, N., Carlin, B. and van der Linde, A. (2002). Bayesian measures ofmodel complexity and fit. Journal of the Royal Statistical Society Series B, 64, 583-639.

28 Spiegelhalter, D., Best, N., Carlin, B. and van der Linde, A. (2014). The deviance in-formation criterion: 12 years on. Journal of the Royal Statistical Society Series B, 76,485-493.

29 Schervish, M. J. (2012). Theory of statistics. Springer Science & Business Media.

30 Tierney, L., Kass, R. E., and Kadane, J. B. (1989). Fully exponential Laplace approxima-tions to expectations and variances of nonpositive functions. Journal of the AmericanStatistical Association, 84(407), 710-716.

31 van der Linde, A. (2005). DIC in variable selection. Statistica Neerlandica, 59(1), 45-56.

32 van der Linde, A. (2012). A Bayesian view of model complexity. Statistica Neerlandica,66, 253—271.

33 Vehtari, A., and Ojanen, J. (2012). A survey of Bayesian predictive methods for modelassessment, selection and comparison. Statistics Surveys, 6, 142-228.

40

Page 41: Deviance Information Criterion for Model Selection: Justi ...mysmu.edu/faculty/yujun/Research/DIC_Theory27.pdfthe present article (Li et al., 2017). 2 Decision-theoretic Justi–cation

34 Vehtari, A., and Lampinen, J. (2002). Bayesian model assessment and comparison usingcross-validation predictive densities. Neural Computation, 14, 2439-2468.

35 White, H. (1996). Estimation, inference and specification analysis. Cambridge UniversityPress. Cambridge, UK.

36 Wooldridge, J. M. (1994). Estimation and inference for dependent processes. Handbookof Econometrics, 4, 2639-2738.

37 Zhang, Y., and Yang, Y. (2015). Cross-validation for selecting a model selection procedure.Journal of Econometrics, 187(1), 95-112.

41