1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our...

29
Asymptotic optimality of C p -type criteria in high-dimensional multivariate linear regression models (Last Modified: January 31, 2020) Shinpei Imori Department of Mathematics, Hiroshima University 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima 739-8526, Japan [email protected] Abstract We study the asymptotic optimality of C p -type criteria from the perspective of prediction in high- dimensional multivariate linear regression models, where the dimension of a response matrix is large but does not exceed the sample size. We derive conditions in order that the generalized C p (GC p ) exhibits asymptotic loss efficiency (ALE) and asymptotic mean efficiency (AME) in such high-dimensional data. Moreover, we clarify that one of the conditions is necessary for GC p to exhibit both ALE and AME. As a result, it is shown that the modified C p can claim both ALE and AME but the original C p cannot in high- dimensional data. The finite sample performance of GC p with several tuning parameters is compared through a simulation study. Key words: Asymptotic loss efficiency; Asymptotic mean efficiency; High-dimensional data analysis; Multivariate linear regression model; Variable selection. 1 Introduction Variable selection problems are crucial in statistical fields to improve prediction accuracy and/or in- terpretability of a resultant model. There is a burgeoning literature which has attempted to solve the variable selection problem, and many selection procedures and their theoretical properties have been studied. For example, Mallows’ C p criterion (Mallows 1973) and Akaike information criterion (AIC) (Akaike 1974) are known as useful selection methods from a predictive point of view because these procedures are optimal in some predictive sense (see Shibata 1981, 1983, Li 1987, Shao 1997). On the other hand, Bayes information criterion (BIC) proposed by Schwarz (1978) is consistent (Nishii 1984); that is, the probability that a model selected by BIC coincides with the true model converges to 1 as the sample size n tends to infinity. In this sense, BIC is a feasible method from the perspective of interpretability. 1

Transcript of 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our...

Page 1: 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our high-dimensional framework, which is also considered in previous studies (Fujikoshi et al.

Asymptotic optimality of Cp-type criteria in high-dimensional multivariate

linear regression models

(Last Modified: January 31, 2020)

Shinpei Imori

Department of Mathematics, Hiroshima University

1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima 739-8526, Japan

[email protected]

Abstract

We study the asymptotic optimality of Cp-type criteria from the perspective of prediction in high-

dimensional multivariate linear regression models, where the dimension of a response matrix is large but

does not exceed the sample size. We derive conditions in order that the generalized Cp (GCp) exhibits

asymptotic loss efficiency (ALE) and asymptotic mean efficiency (AME) in such high-dimensional data.

Moreover, we clarify that one of the conditions is necessary for GCp to exhibit both ALE and AME. As a

result, it is shown that the modified Cp can claim both ALE and AME but the original Cp cannot in high-

dimensional data. The finite sample performance of GCp with several tuning parameters is compared

through a simulation study.

Key words: Asymptotic loss efficiency; Asymptotic mean efficiency; High-dimensional data analysis;

Multivariate linear regression model; Variable selection.

1 Introduction

Variable selection problems are crucial in statistical fields to improve prediction accuracy and/or in-

terpretability of a resultant model. There is a burgeoning literature which has attempted to solve the

variable selection problem, and many selection procedures and their theoretical properties have been

studied.

For example, Mallows’ Cp criterion (Mallows 1973) and Akaike information criterion (AIC) (Akaike

1974) are known as useful selection methods from a predictive point of view because these procedures

are optimal in some predictive sense (see Shibata 1981, 1983, Li 1987, Shao 1997). On the other hand,

Bayes information criterion (BIC) proposed by Schwarz (1978) is consistent (Nishii 1984); that is, the

probability that a model selected by BIC coincides with the true model converges to 1 as the sample

size n tends to infinity. In this sense, BIC is a feasible method from the perspective of interpretability.

1

Page 2: 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our high-dimensional framework, which is also considered in previous studies (Fujikoshi et al.

However, Cp and AIC are inconsistent (Nishii 1984) under the same condition. Details of properties of

selection procedures are well studied in Shao (1997) in the context of univariate linear regression models.

However, here, our target is multivariate linear regression models.

Recently, high-dimensional data are often encountered where the dimension of a response matrix

in multivariate linear regression models pn is large, whereas pn does not exceed the sample size n.

Considering such high-dimensional multivariate linear regression models, one may presume that the

properties of selection methods such as optimality and consistency are inherited from univariate models.

However, interestingly, properties derived when pn is fixed can be altered in high-dimensional situations.

For example, Yanagihara et al. (2015) showed that AIC acquires the consistency property and that

BIC loses its consistency in high-dimensional data. Similar results for Cp-type criteria were reported by

Fujikoshi et al. (2014). The reason why this inversion arises may be that a difference in risks between two

over-specified models (i.e., models including the true model) diverges with n and pn tending to infinity,

and thus penalty terms of Cp and AIC are moderate but that of BIC is too strong. In addition to these

studies, model selection criteria in high-dimensional data contexts and their consistency properties have

been vigorously studied in various models and situations (e.g., Katayama and Imori 2014, Imori and von

Rosen 2015, Yanagihara 2015, Fujikoshi and Sakurai 2016, Bai et al. 2018).

Compared with the consistency property, asymptotic optimality for prediction in high-dimensional

data contexts is under-researched. Conventional results derived from univariate models are no longer

reliable in high-dimensional data contexts, and extension to such cases is not mathematically trivial. In

the present paper, we focus on asymptotic loss efficiency (ALE) (Li 1987, Shao 1997) and asymptotic

mean efficiency (AME) (Shibata 1983) as criteria for the asymptotic optimality of variable selection. We

derive sufficient conditions in order that a generalization of Cp (GCp) exhibits ALE and AME in high-

dimensional data. We also show that one of the sufficient conditions is necessary for GCp to exhibit both

of these efficiencies. As a result, we can observe that the modified Cp (MCp) introduced by Fujikoshi

and Satoh (1997) exhibits ALE and AME assuming moderate conditions although the original Cp does

not under the same conditions.

The remainder of this paper is composed as follows. In Section 2, we clarify the variable selection

framework used in this paper. In Section 3, the sufficient conditions for ALE and AME of GCp are

given. In Section 4, we study the asymptotic inefficiency of GCp. Section 5 illustrates the finite sample

performances of some Cp-type criteria. Finally, conclusions are offered in Section 6.

2

Page 3: 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our high-dimensional framework, which is also considered in previous studies (Fujikoshi et al.

2 Model selection framework

2.1 True and candidate models

Let Y be an n × pn response variable matrix and X be an n × kn explanatory variable matrix, where

n is the sample size, pn is the dimension of response and kn is the number of explanatory variables. We

assume X to be of full rank and non-stochastic. We allow kn and pn to diverge to infinity with n tending

to infinity, although neither kn nor pn exceeds n. Specific conditions for n, kn, and pn are given later.

The true distribution of Y is given by

Y = Γ∗ + EΣ1/2∗ ,

where Γ∗ = E[Y ], E is an n × pn error matrix, of which all entries are independent and identically

distributed as the standard normal distribution N(0, 1) and Σ∗ is the true covariance matrix of each row

of Y . The relationship between Y and X is represented by a multivariate linear regression model as

follows:

Y = XB + EΣ1/2,

where B is a kn × pn matrix of unknown regression coefficients and Σ is a pn × pn unknown covariance

matrix. To consider a variable selection problem, let M ⊂ MF = {1, . . . , kn} be a candidate model,

where XM is an n× kM sub-matrix of X corresponding to M , and kM is the cardinality of M . Then, a

candidate model M implies a multivariate linear regression model defined as follows:

Y = XMBM + EΣ1/2,

where BM is a kM × pn matrix of unknown regression coefficients. A set of candidate models is denoted

by Mn that is a subset of a comprehensive set {M |M ⊂ MF }. Note that Mn does not have to include

the full model MF .

2.2 Loss and risk functions

Herein, the goodness of fit of a candidate model M is measured by a quadratic loss function Ln given by

Ln(M) = tr{(Γ∗ −XMBM )Σ−1∗ (Γ∗ −XMBM )⊤}, (1)

3

Page 4: 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our high-dimensional framework, which is also considered in previous studies (Fujikoshi et al.

where BM is a least squares estimator of BM , i.e.,

BM = (X⊤MXM )−1XMY , (2)

By substituting (2) into (1), we have

Ln(M) = tr(∆M ) + tr(E⊤PME) (3)

where ∆M = Σ−1/2∗ Γ⊤

∗ P⊥MΓ∗Σ

−1/2∗ , P⊥

M = In − PM and PM = XM (X⊤MXM )−1X⊤

M . Then, a risk

function Rn is obtained as

Rn(M) = E[Ln(M)] = tr(∆M ) + kMpn. (4)

The best models with respect to the loss and risk functions are denoted by M∗L and M∗

R, which minimize

(1) and (4) among Mn, respectively, i.e.,

M∗L = arg min

M∈Mn

Ln(M), M∗R = arg min

M∈Mn

Rn(M).

Note that M∗L is a random variable, M∗

R is non-stochastic, and both of them depend on n although they

are suppressed for brevity.

2.3 Selection method and asymptotic efficiency

To select the best model among Mn, we use GCp defined by

GCp(M ;αn) = nαntr(ΣMS−1) + 2kMpn. (5)

where αn is a positive sequence, ΣM = Y ⊤P⊥MY /n and S = nΣMF

/(n− kn). For theoretical purposes,

we use αn satisfying

limn→∞

αn = a ∈ [0,∞).

When αn = 1, GCp indicates Cp proposed by Mallows (1973) and when αn = 1 − (pn + 1)/(n − kn),

selection results by GCp coincide with the modified Cp (called MCp) by Fujikoshi and Satoh (1997). If

the full model includes the true model, then MCp is an unbiased estimator for all M ∈ Mn (Fujikoshi

and Satoh 1997). Note that Atkinson (1980) introduced a criterion equivalent to GCp for univariate data

and Nagai et al. (2012) proposed GCp for multivariate generalized ridge regression models. The best

4

Page 5: 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our high-dimensional framework, which is also considered in previous studies (Fujikoshi et al.

model selected by minimizing GCp among Mn is denoted by Mn, i.e.,

Mn = arg minM∈Mn

GCp(M ;αn).

Then, we state that GCp exhibits ALE (Li 1987, Shao 1997) if

Ln(Mn)

Ln(M∗L)

p→ 1, n → ∞, (6)

and exhibits AME (Shibata 1983) if

limn→∞

E[Ln(Mn)]

Rn(M∗R)

= 1. (7)

Note that Ln(Mn) and E[Ln(Mn)] are respectively referred to as loss and risk functions of the best model

selected by GCp.

3 Asymptotic efficiency of GCp

In this section, we present ALE and AME of GCp(M ;αn). Hereafter, we may omit symbol “n → ∞” for

simplifying expressions.

Firstly, we assume the following conditions for ALE:

(C1) kn = o(n), limn→∞ pn/n = cp ∈ [0, 1) and n− kn − pn > 0.

(C2) σ1(∆MF) = o(n), where for any matrix A, σℓ(A) denotes the ℓth largest singular value of A.

(C3) For all δ ∈ (0, 1),

limn→∞

∑M∈Mn

δRn(M) = 0.

The first part of condition (C1) is the same as a condition assumed in Shibata (1981, 1983) if the full

model MF is included in the set of candidate models Mn. The second part of (C1) constructs our high-

dimensional framework, which is also considered in previous studies (Fujikoshi et al. 2014, Yanagihara

et al. 2015, Yanagihara 2015, Fujikoshi and Sakurai 2016, Bai et al. 2018). The final part of (C1)

is required to guarantee regularity of S, which can be satisfied asymptotically from the previous two

conditions. Condition (C2) is used to ignore an effect of ∆MF, and condition (C3) controls the number

of candidate models. When pn = 1, conditions (C2) and (C3) correspond to assumptions in Shao (1997)

and Shibata (1981, 1983), respectively. Then, we can derive sufficient conditions for ALE of GCp as the

5

Page 6: 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our high-dimensional framework, which is also considered in previous studies (Fujikoshi et al.

following theorem, of which a proof is given in Appendix A.2.

Theorem 3.1. Suppose that conditions (C1)–(C3) hold. If αn → a = 1−cp as n → ∞, then GCp(M ;αn)

exhibits ALE, i.e.,

Ln(Mn)

Ln(M∗L)

p→ 1, n → ∞.

Next, we show AME of GCp. Besides conditions (C1)–(C3), we assume the following condition:

(C4) There exists γ0 ∈ (0, 1) such that maxM∈Mn Rn(M)/Rn(M∗R) = O(exp(nγ0)).

Condition (C4) sets an upper bound of the risk ratio Rn(M)/Rn(M∗R), which prevents the maximum

risk from being too large. Because of conditions (C1) and (C3), condition (C4) is satisfied when

maxM∈Mn tr(∆M ) = O(exp(nγ0)), which is a weak condition. Then, sufficient conditions for AME

of GCp are obtained.

Theorem 3.2. Suppose that conditions (C1)–(C4) hold. If αn → a = 1−cp as n → ∞, then GCp(M ;αn)

exhibits AME, i.e.,

limn→∞

E[Ln(Mn)]

Rn(M∗R)

= 1.

A proof of this theorem is provided in Appendix A.3. For both ALE and AME of GCp, we assume

αn → a = 1 − cp. Unless cp = 0, this condition does not hold when αn = 1 (i.e., the original Cp).

On the other hand, this condition is satisfied for all cp ∈ [0, 1) when αn = 1 − (pn + 1)/(n − kn) (i.e.,

MCp). Hence, MCp is more reasonable for variable selection in high-dimensional data contexts from the

perspective of prediction.

4 Asymptotic inefficiency of GCp

As noted in the previous section, αn → a = 1− cp is a key condition for GCp to acquire ALE and AME.

In this section, we show that this is a necessary condition. Namely, when αn → a = 1 − cp, there is a

situation such that

limn→∞

Pr

(Ln(Mn)

Ln(M∗L)

> 1

)= 1,

limn→∞

E[Ln(Mn)]

Rn(M∗R)

> 1

even under conditions (C1)–(C4).

6

Page 7: 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our high-dimensional framework, which is also considered in previous studies (Fujikoshi et al.

For expository purposes, let X = (x1,x2), i.e., kn = 2 such that X⊤X = I2, Γ∗ =√nx2β

⊤, where

β ∈ Rpn , Σ∗ = Ipn, and Mn = {{1}, {1, 2}}. Suppose that cp ∈ (0, 1) and β satisfies ∥β∥2 → b ∈ (0,∞),

where ∥·∥ is the Euclidean norm. Then, because σ1(∆MF) = 0, Rn({1}) = n∥β∥2+pn, and Rn({1, 2}) =

2pn, conditions (C1)–(C4) are satisfied for sufficiently large n.

From the definition of GCp,

GCp({1, 2};αn)−GCp({1};αn)

= nαntr{(Σ{1,2} − Σ{1})S−1}+ 2pn

= −(n− 2)αnx⊤2 Y Y ⊤x2

x⊤2 Y {Y ⊤(In − x1x

⊤1 − x2x

⊤2 )Y }−1Y ⊤x2

x⊤2 Y Y ⊤x2

+ 2pn.

It follows from Theorem 3.2.11 in Muirhead (1982) that

(x⊤2 Y {Y ⊤(In − x1x

⊤1 − x2x

⊤2 )Y }−1Y ⊤x2

x⊤2 Y Y ⊤x2

)−1

∼ χ2n−pn−1.

On the other hand, because Y ⊤x2 =√nβ + E⊤x2 ∼ Npn

(√nβ, Ipn

), x⊤2 Y Y ⊤x2 ∼ χ2

pn(n∥β∥2),

which denotes a non-central chi-square distribution with non-centrality parameter n∥β∥2. Note that

χ2n−pn−1/n = 1− cp + op(1) and χ2

pn(n∥β∥2)/n = cp + b+ op(1). Hence, it holds that

GCp({1, 2};αn)−GCp({1};αn)

n= −a(cp + b)

1− cp+ 2cp + op(1). (8)

Meanwhile, loss functions of models {1} and {1, 2} are given as

Ln({1}) = n∥β∥2 + x⊤1 EE

⊤x1,

Ln({1, 2}) = x⊤1 EE

⊤x1 + x⊤2 EE

⊤x2.

Because x⊤i EE

⊤xi ∼ χ2pn

(i = 1, 2), it follows that

Ln({1})Ln({1, 2})

p→ cp + b

2cp∈ (0,∞), (9)

limn→∞

Rn({1})Rn({1, 2})

=cp + b

2cp∈ (0,∞). (10)

First, we consider a situation where a > 0. Let b = cp(1− cp)/a. It follows from (8) and (9) that

GCp({1, 2};αn)−GCp({1};αn)

n

p→ cp(1− cp − a)

1− cp,

Ln({1})Ln({1, 2})

p→ a+ 1− cp2a

= 1 +1− cp − a

2a.

7

Page 8: 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our high-dimensional framework, which is also considered in previous studies (Fujikoshi et al.

Hence, we have

Ln(Mn)

Ln(M∗L)

p→

(a+ 1− cp)/(2a) > 1, a < 1− cp,

(2a)/(a+ 1− cp) > 1, a > 1− cp.

This implies that GCp does not exhibit ALE when 0 < a < 1− cp or a > 1− cp.

On the other hand, (10) yields M∗R = {1, 2} (resp. {1}) for sufficiently large n when a < 1− cp (resp.

a > 1− cp). Thus, by using M∗∗R = Mn \M∗

R, we can see that

E[Ln(Mn)]

Rn(M∗R)

=E[Ln(M

∗R)I(Mn = M∗

R)]

Rn(M∗R)

+E[Ln(M

∗∗R )I(Mn = M∗∗

R )]

Rn(M∗R)

=Rn(M

∗∗R )

Rn(M∗R)

− E[{Ln(M∗∗R )− Ln(M

∗R)}I(Mn = M∗

R)]

Rn(M∗R)

≥ Rn(M∗∗R )

Rn(M∗R)

−√E[{Ln({1})− Ln({1, 2})}2]

Rn(M∗R)

√Pr(Mn = M∗

R),

where I(·) is an indicator function and the last inequality follows from the Cauchy-Schwarz inequality.

Note that

√E[{Ln({1, 2})− Ln({1})}2]

Rn(M∗R)

=√E[(χ2

pn− n∥β∥2)2]max

{1

2pn,

1

pn + n∥β∥2

}=√2pn + (pn − n∥β∥2)2 max

{1

2pn,

1

pn + n∥β∥2

}→ |a− (1− cp)|max

{1

2a,

1

a+ 1− cp

}< ∞.

Because limn→∞ Pr(Mn = M∗R) = 0 and Rn(M

∗∗R )/Rn(M

∗R) > 1, GCp does not exhibit AME when

0 < a < 1− cp or 1− cp < a.

Next, we consider a situation where a = 0. Then, (8) implies that Pr(Mn = {1}) → 1. However, when

b > cp, (9) and (10) yield Pr(M∗L = {1, 2}) → 1 and M∗

R = {1, 2} for sufficiently large n, respectively.

Hence, in the same manner as the argument when a > 0, we can appreciate that GCp does not exhibit

ALE or AME when a = 0.

Therefore, αn → a = 1 − cp is a necessary and sufficient condition for ALE and AME of GCp under

conditions (C1)–(C4).

5 Simulation study

This section provides details of a simulation study to compare GCp among several αn, where the goodness

of criteria is measured by the loss function of the best model selected by each criterion. We prepare three

parameters for αn, that is, αn = 1 (i.e., Cp), αn = 1 − (pn + 1)/(n − kn) (i.e., MCp) and αn = 2/ log n

8

Page 9: 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our high-dimensional framework, which is also considered in previous studies (Fujikoshi et al.

(i.e., BIC-type Cp, say BCp). Because 2/ log n ≤ 1 − (pn + 1)/(n − kn) ≤ 1 in our settings described

below, the number of dimensions of the model selected by Cp (resp. BCp) is larger (resp. smaller) than

that by MCp. Generally speaking, this inequality always holds for sufficiently large n.

Hereafter, we explain the simulation settings. Let the first column of X be a vector of ones in Rn

and the other entries be independently generated from a uniform distribution U(0, 1). For all 1 ≤ i ≤ kn

and 1 ≤ j ≤ pn, let (B∗)ij = uijdi, where uij are independently generated from U(0, 1/2) and di =

5√kn − i+ 1/kn. For comparative purposes, we examine a situation where Γ∗ = XB∗, which implies

that the full model is the true model. Suppose that Σ∗ = Ipn. To reduce computational burden, we

adopt a nested model set, i.e., Mn = {{1}, . . . , {1, . . . , kn}}. It should be noted that the true (full) model

is not the best model from the perspective of prediction in our simulation study, because some coefficients

are very small, so variable selection makes sense in this situation. This supposition is confirmed below.

We prepared two cases for pn as high- and fixed-dimensional cases, where pn = ⌊n/3⌋ for the high-

dimensional case, and ⌊α⌋ denotes the largest integer that does not exceed α, whereas pn = 10 for the

fixed case. The sample size n varies from 100 to 800, and we set kn = ⌊5 log n⌋. Then, we generate Y

and select the best subset of explanatory variables by each Cp-type criterion. After variable selection, we

calculate the loss functions for each best model.

Table 1: Average values of Ln(Mn)/Ln(M∗L) and Ln(Mn)/Rn(M

∗R) of Cp, MCp and BCp among

1,000 repetitions for each (n, pn, kn). Standard deviations are shown in parentheses. Best values forLn(Mn)/Ln(M

∗L) and Ln(Mn)/Rn(M

∗R) are emboldened for each (n, pn, kn). All values are rounded to

3 decimal places.

Ln(Mn)/Ln(M∗L) Ln(Mn)/Rn(M

∗R)

n pn kn Cp MCp BCp Cp MCp BCp

100 33 23 1.805 1.057 1.012 1.798 1.053 1.008(0.311) (0.088) (0.030) (0.312) (0.093) (0.037)

200 66 26 1.291 1.033 1.114 1.277 1.022 1.101(0.111) (0.034) (0.030) (0.113) (0.039) (0.011)

400 133 29 1.151 1.013 1.599 1.150 1.012 1.597(0.048) (0.014) (0.104) (0.052) (0.024) (0.097)

800 266 33 1.081 1.003 1.813 1.079 1.001 1.809(0.024) (0.005) (0.139) (0.027) (0.015) (0.134)

100 10 23 1.214 1.117 1.031 1.183 1.087 1.003(0.216) (0.139) (0.040) (0.222) (0.150) (0.045)

200 10 26 1.110 1.101 1.158 1.075 1.065 1.117(0.100) (0.087) (0.073) (0.117) (0.101) (0.027)

400 10 29 1.068 1.067 1.783 1.043 1.042 1.732(0.052) (0.051) (0.135) (0.089) (0.085) (0.023)

800 10 33 1.043 1.044 2.222 1.030 1.032 2.183(0.042) (0.044) (0.295) (0.083) (0.082) (0.228)

Table 1 provides average values of Ln(Mn)/Ln(M∗L) and Ln(Mn)/Rn(M

∗R) of Cp, MCp and BCp

9

Page 10: 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our high-dimensional framework, which is also considered in previous studies (Fujikoshi et al.

based on 1,000 repetitions for each (n, pn, kn). Note that Ln(Mn)/Ln(M∗L) and Ln(Mn)/Rn(M

∗R) are

criteria for ALE and AME, respectively, and smaller is better. From this table, we can confirm that MCp

exhibits good performance regardless of pn, and Cp works well when pn = 10 but it does not work well

when pn is large. On the other hand, BCp has higher values of Ln(Mn)/Ln(M∗L) and Ln(Mn)/Rn(M

∗R)

except when the sample size is small. These results concur with our theoretical exposition regarding

efficiency and inefficiency.

Table 2: Average dimensions of selected models by Cp, MCp, and BCp and loss minimizing modelsamong 1,000 repetitions for each (n, pn, kn). Standard deviations are shown in parentheses. All valuesare rounded to 3 decimal places.

n pn kn Cp MCp BCp Loss

100 33 23 17.870 2.953 1.234 1.574(3.691) (2.808) (0.826) (1.327)

200 66 26 21.582 9.064 1.023 9.119(2.337) (3.616) (0.234) (2.918)

400 133 29 25.911 15.723 1.766 15.296(1.427) (2.492) (1.528) (1.573)

800 266 33 31.013 24.783 7.324 24.676(0.917) (0.857) (1.830) (0.616)

100 10 23 6.271 3.474 1.026 2.656(4.240) (3.137) (0.165) (1.884)

200 10 26 9.388 7.785 1.008 8.069(4.121) (3.954) (0.089) (3.076)

400 10 29 17.677 17.001 1.057 17.441(3.740) (3.709) (0.249) (3.378)

800 10 33 25.241 24.986 2.459 25.284(2.472) (2.552) (1.657) (1.509)

Table 2 shows the average dimensions of models selected by each GCp and loss minimizing models.

This indicates that the number of dimensions of loss minimizing models varies depending on the sample

size, and the full model is not (always) the best model in spite of the fact that the full model is true. Based

on our simulation settings, BCp tends to select much smaller models in comparison with models that

have the smallest loss function while Cp often selects larger models when pn is large. The average number

of dimensions of models selected by MCp is close to that of the loss minimizing models in both high- and

fixed-dimensional situations. This implies that αn substantially affects the dimensions of selected models

as well as efficiency.

Hence, these results indicate that MCp is a useful variable selection method regardless of pn, and

thus we recommend its use from the perspective of robust prediction.

10

Page 11: 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our high-dimensional framework, which is also considered in previous studies (Fujikoshi et al.

6 Conclusions

We have derived sufficient conditions for ALE and AME of GCp in high-dimensional multivariate linear

regression models. It is shown that MCp exhibits ALE and AME in high-dimensional data, while the

original Cp, known as an asymptotically efficient criterion in univariate cases, does not exhibit ALE or

AME under the same conditions. This is because a non-trivial bias term is omitted in the original Cp as

an estimator of the risk function; this term plays an important role for adaptation to high-dimensional

frameworks. Indeed, if the tuning parameter of GCp, αn, converges to a = 1−cp like in the case of Cp and

BCp, we showed that GCp is asymptotically inefficient. Through a simulation study, the finite sample

performances of Cp-type criteria are compared, and MCp is better than Cp and BCp in high-dimensional

data.

As shown in Fujikoshi et al. (2014), MCp is consistent for variable selection in high-dimensional

multivariate linear regression models under moderate conditions when the true model is included in

the set of candidate models. Hence, MCp can be regarded as a feasible method for variable selection

from the perspective of both prediction and interpretability. This attractive property is only seen in

high-dimensional situations.

When pn is greater than n, we cannot directly calculate S−1 and thus GCp. Therefore, we need

different approaches to estimate a covariance matrix Σ such as sparse or ridge estimation (e.g., Yamamura

et al. 2010, Katayama and Imori 2014, Fujikoshi and Sakurai 2016). If we can estimate Σ accurately via

these procedures, ALE and AME can be established by using it in place of S. It should also be noted that

our proof depends on the assumption that the response matrix follows a Gaussian distribution. Because

we use some properties of the Gaussian distribution, this is not a trivial limitation from the perspective

of generalizing the results. How best to navigate this issue represents fruitful terrain for future research.

Acknowledgements

This study is supported in part by JSPS KAKENHI Grant Number JP17K12650 and “Funds for the

Development of Human Resources in Science and Technology” under MEXT, through the “Home for

Innovative Researchers and Academic Knowledge Users (HIRAKU)” consortium.

References

H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control,

19:716–723, 1974.

11

Page 12: 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our high-dimensional framework, which is also considered in previous studies (Fujikoshi et al.

A. C. Atkinson. A note on the generalized information criterion for choice of a model. Biometrika, 67

(2):413–418, 1980.

Z. Bai, K. P. Choi, and Y. Fujikoshi. Consistency of AIC and BIC in estimating the number of significant

components in high-dimensional principal component analysis. The Annals of Statistics, 46(3):1050–

1076, 2018.

Y. Fujikoshi and T. Sakurai. High-dimensional consistency of rank estimation criteria in multivariate

linear model. Journal of Multivariate Analysis, 149:199–212, 2016.

Y. Fujikoshi and K. Satoh. Modified AIC and Cp in multivariate linear regression. Biometrika, 84(3):

707–716, 1997.

Y. Fujikoshi, T. Sakurai, and H. Yanagihara. Consistency of high-dimensional AIC-type and Cp-type

criteria in multivariate linear regression. Journal of Multivariate Analysis, 123:184–200, 2014.

S. Imori and D. von Rosen. Covariance components selection in high-dimensional growth curve model

with random coefficients. Journal of Multivariate Analysis, 136:86–94, 2015.

S. Katayama and S. Imori. Lasso penalized model selection criteria for high-dimensional multivariate

linear regression analysis. Journal of Multivariate Analysis, 132:138–150, 2014.

K.-C. Li. Asymptotic optimality for Cp, CL, cross-validation and generalized cross-validation: Discrete

index set. The Annals of Statistics, 15(3):958–975, 1987.

C. L. Mallows. Some comments on Cp. Technometrics, 15(4):661–675, 1973.

R. J. Muirhead. Aspects of Multivariate Statistical Theory. Wiley series in probability and statistics.

John Wiley & Sons, Inc., 1982.

I. Nagai, H. Yanagihara, and K. Satoh. Optimization of ridge parameters in multivariate generalized

ridge regression by plug-in methods. Hiroshima Mathematical Journal, 42(3):301–324, 2012.

R. Nishii. Asymptotic properties of criteria for selection of variables in multiple regression. The Annals

of Statistics, 12(2):758–765, 1984.

G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464, 1978.

J. Shao. An asymptotic theory for linear model selection. Statistica Sinica, pages 221–242, 1997.

R. Shibata. An optimal selection of regression variables. Biometrika, 68(1):45–54, 1981.

R. Shibata. Amendments and corrections: An optimal selection of regression variables. Biometrika, 69

(2):492–492, 1982.

12

Page 13: 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our high-dimensional framework, which is also considered in previous studies (Fujikoshi et al.

R. Shibata. Asymptotic mean efficiency of a selection of regression variables. Annals of the Institute of

Statistical Mathematics, 35(3):415–423, 1983.

M. J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in

Statistical and Probabilistic Mathematics. Cambridge University Press, 2019.

M. Yamamura, H. Yanagihara, and M. S. Srivastava. Variable selection in multivariate linear regression

models with fewer observations than the dimension. Japanese Journal of Applied Statistics, 39(1):1–19,

2010.

H. Yanagihara. Conditions for consistency of a log-likelihood-based information criterion in normal

multivariate linear regression models under the violation of the normality assumption. Journal of the

Japan Statistical Society, 45(1):21–56, 2015.

H. Yanagihara, H. Wakaki, and Y. Fujikoshi. A consistency property of the AIC for multivariate linear

models when the dimension and the sample size are large. Electronic Journal of Statistics, 9(1):869–897,

2015.

13

Page 14: 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our high-dimensional framework, which is also considered in previous studies (Fujikoshi et al.

A Appendix

In this section, we show ALE and AME of GCp. A brief outline of the proof is based on Li (1987) and

Shibata (1983) although techniques of random matrix theory are required to overcome the difficulties

imposed by high-dimensionality.

A.1 Concentration inequalities

We introduce three concentration inequalities that are used for showing ALE and AME.

Lemma A.1. Let χ2m be a random variable distributed as a chi-square distribution with m degrees of

freedom. Then, for all t > 0,

Pr(χ2m −m ≥ mt) ≤ exp

{−m{t− log(1 + t)}

2

},

P r(χ2m −m ≤ −mt) ≤ exp

(−mt2

4

).

The former is obtained from Shibata (1982), while the latter is given from Lemma 2.1 in Shibata

(1981).

Lemma A.2. Let X be a random variable distributed as N(0, 1). Then, for all t > 0,

Pr(X ≥ t) ≤ exp

(− t2

2

).

Lemma A.3. Let X ∼ Nn,p(O, In, Ip) and n > p. It holds that for all t > 0,

√n−√

p− t ≤ σp(X) ≤ σ1(X) ≤√n+

√p+ t

with probability at least 1− 2 exp(−Ct2), where C is a positive constant that does not depend on n and

p.

Because Lemmas A.2 and A.3 can be found elsewhere (see e.g., Wainwright 2019: Example 2.1,

Example 6.2), we omit their proofs for brevity.

14

Page 15: 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our high-dimensional framework, which is also considered in previous studies (Fujikoshi et al.

A.2 Proof of Theorem 3.1

From the definition of Ln(M) and GCp(M ;αn), these difference can be separated as

GCp(M ;αn)− Ln(M)

= 2{kMpn − tr(E⊤PME)}+ 2tr(Σ−1/2∗ Γ⊤

∗ P⊥ME) + tr{Γ⊤

∗ (PMF− PM )Γ∗(αnS

−1 −Σ−1∗ )}

− tr{Σ1/2∗ E⊤PMEΣ1/2

∗ (αnS−1 −Σ−1

∗ )}+ 2tr{Γ⊤∗ (PMF

− PM )EΣ1/2∗ (αnS

−1 −Σ−1∗ )}

+ tr{Σ1/2∗ E⊤PMF

EΣ1/2∗ (αnS

−1 −Σ−1∗ )}+ tr{Y ⊤P⊥

MFY (αnS

−1 −Σ−1∗ )}.

Thus, we only consider the terms that depend on M defined as follows:

Bn(M) = 2{kMpn − tr(E⊤PME)}+ 2tr(Σ−1/2∗ Γ⊤

∗ P⊥ME) + tr{Γ⊤

∗ (PMF− PM )Γ∗(αnS

−1 −Σ−1∗ )}

− tr{Σ1/2∗ E⊤PMEΣ1/2

∗ (αnS−1 −Σ−1

∗ )}+ 2tr{Γ⊤∗ (PMF

− PM )EΣ1/2∗ (αnS

−1 −Σ−1∗ )}.

We can discern from the definition of Mn that

Ln(Mn) +GCp(Mn;αn)− Ln(Mn) ≤ Ln(M∗L) +GCp(M

∗L;αn)− Ln(M

∗L),

and thus

Ln(Mn)

{1 +

Bn(Mn)

Ln(Mn)

}≤ Ln(M

∗L)

{1 +

Bn(M∗L)

Ln(M∗L)

}.

This implies that GCp exhibits ALE if we can show that maxM∈Mn|Bn(M)|/Ln(M) converges to zero

in probability. The following lemma clarifies a concentration inequality of maxM∈Mn |Bn(M)|/Ln(M).

Lemma A.4. Let γ ∈ (0, 1). Suppose that conditions (C1) and (C2) hold. If αn → a = 1−cp as n → ∞,

then for all ε > 0, there exist positive constants Cj > 0 (j = 1, . . . , 4) such that for sufficiently large n,

Pr

(max

M∈Mn

|Bn(M)|Ln(M)

> ε

)≤ C1 exp(−C2n

γ) + C3

∑M∈Mn

exp{−C4Rn(M)}.

Because Ln(M∗L) ≤ Ln(M) for all M ∈ Mn, Lemma A.4 immediately yields Theorem 3.1. Hence,

hereafter, we attempt to show Lemma A.4.

The following lemma enables us to consider maxM∈Mn|Bn(M)|/Rn(M) instead of maxM∈Mn

|Bn(M)|/Ln(M).

Lemma A.5. For all ε > 0, there exists a positive Cε > 0 such that for all M ∈ Mn,

Pr

(∣∣∣∣Ln(M)

Rn(M)− 1

∣∣∣∣ > ε

)≤ 2 exp{−CεRn(M)}.

15

Page 16: 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our high-dimensional framework, which is also considered in previous studies (Fujikoshi et al.

Proof. Let

ξ(M) =Ln(M)−Rn(M)

Rn(M)

=tr(E⊤PME)− kMpn

Rn(M).

Note that tr(E⊤PME) follows a chi-square distribution with kMpn degrees of freedom. For all ε > 0, it

holds from Lemma A.1 and the fact that Rn(M) ≥ kMpn that

Pr(|ξ(M)| > ε)

= Pr

(tr(E⊤PME)− kMpn

kMpn>

εRn(M)

kMpn

)+ Pr

(tr(E⊤PME)− kMpn

kMpn< −εRn(M)

kMpn

)

≤ exp

{−kMpn

2

{εRn(M)

kMpn− log

(1 +

εRn(M)

kMpn

)}}+ exp

{−ε2Rn(M)2

4kMpn

}≤ exp

{−εRn(M)

2

{1− kMpn

εRn(M)log

(1 +

εRn(M)

kMpn

)}}+ exp

{−ε2Rn(M)

4

}.

Here, a function f(x) = 1 − x−1 log (1 + x) is monotonically increasing for all x ≥ 1, and thus f(x) ≥

f(1) = 1− log 2 > 0.3 for all x ≥ 1. Therefore, if εRn(M) ≥ kMpn, then

Pr(|ξ(M)| > ε) ≤ exp

{−3εRn(M)

20

}+ exp

{−ε2Rn(M)

4

}.

If not, Rn(M) ≥ kMpn > εRn(M) implies that

Pr(|ξ(M)| > ε) ≤ Pr

(tr(E⊤PME)− kMpn

kMpn> ε

)+ Pr

(tr(E⊤PME)− kMpn

kMpn< −εRn(M)

kMpn

)

≤ exp

{−{ε− log(1 + ε)}kMpn

2

}+ exp

{−ε2Rn(M)

4

}≤ exp

{−ε{ε− log(1 + ε)}Rn(M)

2

}+ exp

{−ε2Rn(M)

4

}.

By letting Cε = min{3ε/20, ε2/4, ε{ε− log(1 + ε)}/2} > 0, we can show that for all M ∈ Mn,

Pr(|ξ(M)| > ε) ≤ 2 exp{−CεRn(M)}.

Note that from the proof of Lemma A.5, the first term of Bn(M) can be evaluated. The next lemma

provides an evaluation of the second term of Bn(M).

16

Page 17: 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our high-dimensional framework, which is also considered in previous studies (Fujikoshi et al.

Lemma A.6. For all ε > 0 and for all M ∈ Mn,

Pr

(|tr(Σ−1/2

∗ Γ⊤∗ P

⊥ME)|

Rn(M)> ε

)≤ 2 exp

{−ε2Rn(M)

2

}.

Proof. Because tr(Σ−1/2∗ Γ⊤

∗ P⊥ME)/

√tr(∆M ) follows N(0, 1), Lemma A.2 implies that for all ε > 0,

Pr

(|tr(Σ−1/2

∗ Γ⊤∗ P

⊥ME)|

Rn(M)> ε

)= Pr

(|tr(Σ−1/2

∗ Γ⊤∗ P

⊥ME)|√

tr(∆M )

√tr(∆M )

Rn(M)> ε

)

≤ 2 exp

{−ε2Rn(M)2

2tr(∆M )

}≤ 2 exp

{−ε2Rn(M)

2

}.

The last inequality follows from tr(∆M ) ≤ Rn(M).

Next, we evaluate the third term of Bn(M). For all M ∈ Mn \ {MF }, let QM ∈ R(kn−kM )×n satisfy

Q⊤MQM = PMF

− PM and QMQ⊤M = Ikn−kM

. Then, we consider a matrix

WM = αn(A⊤MAM )−1/2A⊤

MΣ1/2∗ S−1Σ

1/2∗ AM (A⊤

MAM )−1/2,

where AM = Σ−1/2∗ Γ⊤

∗ Q⊤M . This yields

|tr{Γ⊤∗ (PMF

− PM )Γ∗(αnS−1 −Σ−1

∗ )}| = |tr{AMA⊤M (αnΣ

1/2∗ S−1Σ

1/2∗ − Ipn

)}|

= |tr{A⊤MAM (WM − Ikn−kM

)}|

≤ σ1(WM − Ikn−kM)tr(A⊤

MAM ).

Therefore, for evaluation of the third term of Bn(M), we calculate σ1(WM − Ikn−kM) and tr(A⊤

MAM ).

The latter can be evaluated as tr(A⊤MAM ) ≤ tr(∆M ) ≤ Rn(M). Hence, we attempt to show that

maxM∈Mn

σ1(WM − Ikn−kM) = o(1)

with high probability.

Let HF be an n× (n− kn) matrix such that P⊥MF

= HFH⊤F and H⊤

F HF = In−kn. Then, define an

event

E(1)n,γ :

√n− kn −√

pn − nγ/2 ≤ σpn(H⊤

F E) ≤ σ1(H⊤F E) ≤

√n− kn +

√pn + nγ/2,

17

Page 18: 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our high-dimensional framework, which is also considered in previous studies (Fujikoshi et al.

where γ ∈ (0, 1) is constant. For event E(1)n,γ , we evaluate the difference between WM and

VM = (n− kn)αn(A⊤MAM )−1/2A⊤

M (E⊤P⊥MF

E)−1AM (A⊤MAM )−1/2.

Lemma A.7. Suppose that conditions (C1) and (C2) hold. Then, it follows that on E(1)n,γ with γ ∈ (0, 1),

maxM∈Mn\{MF }

σ1(WM − VM ) = o(1).

Proof. Because the columns of AM (A⊤MAM )−1/2 are orthogonal to each other,

maxM∈Mn\{MF }

σ1(WM − VM ) ≤ (n− kn)αnσ1(Σ1/2∗ (Y ⊤P⊥

MFY )−1Σ

1/2∗ − (E⊤P⊥

MFE)−1).

Without loss of generality, we assume Σ∗ = Ipn . From basic linear algebra, it follows that

(Y ⊤P⊥MF

Y )−1 − (E⊤P⊥MF

E)−1

= −(Y ⊤P⊥MF

Y )−1(Y ⊤P⊥MF

Y − E⊤P⊥MF

E)(E⊤P⊥MF

E)−1

= −(Y ⊤P⊥MF

Y )−1(∆MF+ Γ⊤

∗ P⊥MF

E + E⊤P⊥MF

Γ∗)(E⊤P⊥MF

E)−1.

This equation implies that

σ1((Y⊤P⊥

MFY )−1 − (E⊤P⊥

MFE)−1) ≤

σ1(∆MF+ Γ⊤

∗ P⊥MF

E + E⊤P⊥MF

Γ∗)

σpn(Y ⊤P⊥

MFY )σpn

(E⊤P⊥MF

E)

≤σ1(∆MF

) + 2σ1(Γ⊤∗ P

⊥MF

E)σpn(Y

⊤P⊥MF

Y )σpn(E⊤P⊥

MFE)

. (11)

Thus, it holds from condition (C2) that

0 ≤ σpn(∆MF

) ≤ σ1(∆MF) = o(n). (12)

Hence, on E(1)n,γ , under conditions (C1) and (C2), the following evaluation of the singular values is obtained:

σ1(Γ⊤∗ P

⊥MF

E) ≤√σ1(∆MF

){√n− kn +

√pn + nγ/2} = o(n),

σpn(E⊤P⊥

MFE) ≥ {

√n− kn −√

pn − nγ/2}2 ≍ n,

(13)

where for a sequence an > 0, an ≍ n means that limn→∞ an/n < ∞ and limn→∞ n/an < ∞. From (12)

18

Page 19: 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our high-dimensional framework, which is also considered in previous studies (Fujikoshi et al.

and (13), we have

σ1(∆MF) + 2σ1(Γ

⊤∗ P

⊥MF

E)σpn(E

⊤P⊥MF

E)= o(1).

Note that αn = O(1). Hence, (11) implies that it suffices for the proof to show that σpn(Y⊤P⊥

MFY )/n

does not converge to zero. It follows from (12) and (13) that

σpn(Y ⊤P⊥

MFY ) ≥ σpn

(∆MF) + σpn

(E⊤P⊥MF

E)− 2σ1(Γ⊤∗ P

⊥MF

E) ≍ n.

Hence, the proof is completed.

This lemma implies that WM can be replaced by VM on E(1)n,γ with a small bias. To evaluate singular

values of VM instead, we consider the following event:

E(2)n,γ :

{√n− pn −

√kn − nγ/2}2

(n− kn)αn≤ σkn

(V −1∅ ) ≤ σ1(V

−1∅ ) ≤ {

√n− pn +

√kn + nγ/2}2

(n− kn)αn,

where γ ∈ (0, 1),

V∅ = (n− kn)αnK−1/2∅ Q∅Γ∗Σ

−1/2∗ (E⊤P⊥

MFE)−1Σ

−1/2∗ Γ⊤

∗ Q⊤∅ K

−1/2∅ ,

K∅ = Q∅Γ∗Σ−1∗ Γ⊤

∗ Q⊤∅ and Q∅ = (X⊤

MFXMF

)−1/2X⊤MF

. Note that if αn → a = 1 − cp, then V∅

approaches Iknon E

(2)n,γ .

Lemma A.8. Suppose that conditions (C1) and (C2) hold. If αn → a = 1 − cp, then it holds that on

E(1)n,γ ∩ E

(2)n,γ with γ ∈ (0, 1),

maxM∈Mn\{MF }

σ1(WM − Ikn−kM) = o(1).

Proof. First, we consider σ1(VM − Ikn−kM) instead of σ1(WM − Ikn−kM

). For all M ∈ Mn \ {MF }

and x ∈ Rkn−kM , let y = K1/2∅ Q∅Q

⊤MK

−1/2M x ∈ Rkn , where KM = QMΓ∗Σ

−1∗ Γ⊤

∗ Q⊤M . Note that

PMFQ⊤

MQM = PMF(PMF

− PM ) = PMF− PM = Q⊤

MQM , and thus PMFQ⊤

M = Q⊤M . Hence, we have

y⊤y = x⊤K−1/2M QMQ⊤

∅ K∅Q∅Q⊤MK

−1/2M x

= x⊤K−1/2M QMPMF

Γ∗Σ−1∗ Γ⊤

∗ PMFQ⊤

MK−1/2M x

= x⊤K−1/2M QMΓ∗Σ

−1∗ Γ⊤

∗ Q⊤MK

−1/2M x

= x⊤x.

19

Page 20: 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our high-dimensional framework, which is also considered in previous studies (Fujikoshi et al.

This indicates that

1

(n− kn)αn

y⊤V∅y

y⊤y=

y⊤K−1/2∅ Q∅Γ∗Σ

−1/2∗ (E⊤P⊥

MFE)−1Σ

−1/2∗ Γ⊤

∗ Q⊤∅ K

−1/2∅ y

y⊤y

=x⊤K

−1/2M QMΓ∗Σ

−1/2∗ (E⊤P⊥

MFE)−1Σ

−1/2∗ Γ⊤

∗ Q⊤MK

−1/2M x

x⊤x

=1

(n− kn)αn

x⊤VMx

x⊤x.

Hence, for all M ∈ Mn,

infy∈Rkn

y⊤V∅y

y⊤y≤ inf

x∈Rkn−kM

x⊤VMx

x⊤x≤ sup

x∈Rkn−kM

x⊤VMx

x⊤x≤ sup

y∈Rkn

y⊤V∅y

y⊤y.

Under condition (C1) and E(2)n,γ , the assumption αn → a = 1− cp yields

maxM∈Mn\{MF }

σ1(VM − Ikn−kM) = o(1).

Hence, it follows from Lemma A.7 that on E(1)n,γ ∩ E

(2)n,γ ,

maxM∈Mn\{MF }

σ1(WM − Ikn−kM)

≤ maxM∈Mn\{MF }

σ1(WM − VM ) + maxM∈Mn\{MF }

σ1(VM − Ikn−kM)

= o(1).

It still remains to be seen E(1)n,γ and E

(2)n,γ arise with high probability. This is confirmed as follows.

Lemma A.9. Suppose that condition (C1) holds. There exists C > 0 such that For all γ ∈ (0, 1),

Pr(E(1)n,γ) ≥ 1− 2 exp (−Cnγ) ,

P r(E(2)n,γ) ≥ 1− 2 exp (−Cnγ) .

Proof. Because H⊤F E ∼ N(n−kn),pn

(O, In−kn, Ipn

), by applying Lemma A.3, we immediately obtain

the former assertion. On the other hand, (n − kn)αnV−1∅ ∼ Wkn(n − pn, Ikn), which is shown by

Theorem 3.2.11 in Muirhead (1982). By using Z ∼ N(n−pn),kn(O, In−pn

, Ikn), we can observe that

Z⊤Z ∼ Wkn(n− pn, Ikn

). Hence, Pr(E(2)n,γ) is equal to

Pr(E(2)n,γ) = Pr(

√n− pn −

√kn − nγ/2 ≤ σkn(Z) ≤ σ1(Z) ≤

√n− pn +

√kn + nγ/2),

20

Page 21: 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our high-dimensional framework, which is also considered in previous studies (Fujikoshi et al.

which is upper bounded by 1− 2 exp (−Cnγ) with some constant C > 0 due to Lemma A.3.

Hence, from Lemmas A.8 and A.9, the following lemma holds.

Lemma A.10. Let γ ∈ (0, 1). Suppose that conditions (C1) and (C2) hold. If αn → a = 1 − cp, then

there exists C > 0 such that for all ε > 0 and for sufficiently large n,

Pr

(max

M∈Mn

|tr{Γ⊤∗ (PMF

− PM )Γ∗(αnS−1 −Σ−1

∗ )}|Rn(M)

> ε

)≤ 4 exp(−Cnγ).

We can evaluate the fourth term of Bn(M) similar to Lemma A.10 by replacing QM with a matrix

QM ∈ RkM×n satisfying Q⊤MQM = PM and QMQ⊤

M = IkMand AM with AM = E⊤Q⊤

M . Then, we

obtain the following lemma.

Lemma A.11. Let γ ∈ (0, 1). Suppose that conditions (C1) and (C2) hold. If αn → a = 1 − cp, then

there exist positive constants C1, C2 > 0 such that for all ε > 0 and for sufficiently large n,

Pr

(max

M∈Mn

|tr{Σ1/2∗ E⊤PMEΣ1/2

∗ (αnS−1 −Σ−1

∗ )}|Rn(M)

> ε

)≤ 2

∑M∈Mn

exp{−C1Rn(M)}+ 4 exp(−C2nγ).

Proof. A proof for the fourth term is slightly different from Lemma A.10 because tr(A⊤MAM ) = tr(E⊤PME)

is a random variable, although it has already been evaluated in Lemma A.5. By using AM , we have the

following evaluation:

|tr{Σ1/2∗ E⊤PMEΣ1/2

∗ (αnS−1 −Σ−1

∗ )}| = |tr{A⊤MAM (WM − IkM

)}|

≤ σ1(WM − IkM)tr(A⊤

MAM ),

where WM is a counterpart of WM defined by

WM = αn(A⊤MAM )−1/2A⊤

MΣ1/2∗ S−1Σ

1/2∗ AM (A⊤

MAM )−1/2.

Here, let us divide the target probability into two parts as follows:

Pr

(max

M∈Mn

|tr{Σ1/2∗ E⊤PMEΣ1/2

∗ (αnS−1 −Σ−1

∗ )}|Rn(M)

> ε

)

≤ Pr

(max

M∈Mn

σ1(WM − IkM)tr(A⊤

MAM )

Rn(M)> ε

)

≤ Pr

(max

M∈Mn

tr(A⊤MAM )

Rn(M)> 2

)+ Pr

(2 maxM∈Mn

σ1(WM − IkM) > ε

). (14)

First, we evaluate the former probability of (14). Because tr(A⊤MAM ) = tr(E⊤PME) and kMpn ≤

21

Page 22: 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our high-dimensional framework, which is also considered in previous studies (Fujikoshi et al.

Rn(M),

maxM∈Mn

tr(A⊤MAM )

Rn(M)≤ max

M∈Mn

kMpn + |tr(E⊤PME)− kMpn|Rn(M)

≤ 1 + maxM∈Mn

|tr(E⊤PME)− kMpn|Rn(M)

.

Lemma A.5 indicates that there exists a positive constant C1 > 0 such that

Pr

(max

M∈Mn

tr(A⊤MAM )

Rn(M)> 2

)≤ Pr

(max

M∈Mn

|tr(E⊤PME)− kMpn|Rn(M)

> 1

)

≤ 2∑

M∈Mn

exp{−C1Rn(M)}.

Next, let us calculate the latter probability of (14). We define a counterpart of VM as follows:

VM = (n− kn)αn(A⊤MAM )−1/2A⊤

M (E⊤P⊥MF

E)−1AM (A⊤MAM )−1/2.

Then, from the proof of Lemma A.7, when conditions (C1) and (C2) hold, it follows that on E(1)n,γ ,

maxM∈Mn

σ1(WM − VM ) = o(1). (15)

Thus, we can consider σ1(VM −IkM) instead of σ1(WM −IkM

). Here, let us consider an event E(3)n,γ given

as

E(3)n,γ :

{√n− pn −

√kn − nγ/2}2

(n− kn)αn≤ σkn

(V −1MF

) ≤ σ1(V−1MF

) ≤ {√n− pn +

√kn + nγ/2}2

(n− kn)αn.

Similar to V∅, if αn → a = 1− cp, then VMFapproaches Ikn on E

(3)n,γ . From similar arguments to those

of Lemma A.8, it follows that for all M ∈ Mn,

infy∈Rkn

y⊤VMFy

y⊤y≤ inf

x∈RkM

x⊤VMx

x⊤x≤ sup

x∈RkM

x⊤VMx

x⊤x≤ sup

y∈Rkn

y⊤VMFy

y⊤y.

This implies that when αn → a = 1− cp, on E(3)n,γ ,

maxM∈Mn

σ1(VM − IkM) = o(1). (16)

Hence, by combining (15) and (16), it is established that

maxM∈Mn

σ1(WM − IkM) = o(1) (17)

22

Page 23: 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our high-dimensional framework, which is also considered in previous studies (Fujikoshi et al.

on E(1)n,γ ∩ E

(3)n,γ . Thus, for sufficiently large n,

Pr

(2 maxM∈Mn

σ1(WM − IkM) > ε

)≤ Pr((E(1)

n,γ)c) + Pr((E(3)

n,γ)c).

Note that (n − kn)αnV−1MF

follows Wkn(n − pn, Ikn

) from Theorem 3.2.11 in Muirhead (1982). Hence,

the same arguments of Lemma A.9 show that there exists a positive constant C2 > 0 such that for all

γ ∈ (0, 1),

Pr(E(3)n,γ) ≥ 1− 2 exp(−C2n

γ). (18)

Combining this with Lemma A.9, there exists a positive constant C3 > 0 such that for sufficiently large

n, we have

Pr

(2 maxM∈Mn

σ1(WM − IkM) > ε

)≤ 4 exp(−C3n

γ).

Hence, the proof is completed.

Finally, we evaluate the fifth term of Bn(M). First, we derive an upper bound of σ1((αnS−1 −

Σ−1∗ )Σ∗).

Lemma A.12. Let γ ∈ (0, 1). Suppose that conditions (C1) and (C2) hold. If αn → a = 1 − cp, then

there exists a positive constant C > 0 such that for sufficiently large n,

Pr

(σ1((αnS

−1 −Σ−1∗ )Σ∗) >

2

1−√cp

)≤ 2 exp(−Cnγ).

Proof. Without loss of generality, we assume Σ∗ = Ipn.

σ1(αnS−1 − Ipn

) ≤ (n− kn)αnσ1((Y⊤P⊥

MFY )−1 − (E⊤P⊥

MFE)−1)

+ σ1((n− kn)αn(E⊤P⊥MF

E)−1 − Ipn).

From the proof of Lemma A.7, it follows that on E(1)n,γ ,

(n− kn)αnσ1((Y⊤P⊥

MFY )−1 − (E⊤P⊥

MFE)−1) = o(1).

23

Page 24: 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our high-dimensional framework, which is also considered in previous studies (Fujikoshi et al.

On the other hand, event E(1)n,γ implies that

σ1((n− kn)αn(E⊤P⊥MF

E)−1 − Ipn)

= max{(n− kn)αnσpn

(H⊤F E)−2 − 1, 1− (n− kn)αnσ1(H

⊤F E)−2

}≤ max

{(n− kn)αn

(√n− kn −√

pn − nγ/2)2− 1, 1− (n− kn)αn

(√n− kn +

√pn + nγ/2)2

}

= max

{1− cp

(1−√cp)2

− 1 + o(1), 1− 1− cp(1 +

√cp)2

+ o(1)

}= max

{1 +

√cp

1−√cp

− 1 + o(1), 1−1−√

cp

1 +√cp

+ o(1)

}=

2√cp

1−√cp

+ o(1).

Because cp < 1, for sufficiently large n, on E(1)n,γ ,

σ1(αnS−1 − Ipn

) ≤ 2

1−√cp

.

Hence, it can be discerned that for sufficiently large n,

Pr

(σ1(αnS

−1 −Σ−1∗ ) >

2

1−√cp

)≤ Pr

({σ1(αnS

−1 −Σ−1∗ ) >

2

1−√cp

}∩ E(1)

n,γ

)+ Pr((E(1)

n,γ)c)

≤ 2 exp(−Cnγ),

where the last inequality follows from A.9.

Noting that Γ⊤∗ (PMF

− PM )EΣ1/2∗ is a function of QME, which is independent to S, we can obtain

the following lemma.

Lemma A.13. Let γ ∈ (0, 1). Suppose that conditions (C1) and (C2) hold. If αn → a = 1 − cp, then

there exists a positive constant C > 0 such that for all ε > 0 and for sufficiently large n,

Pr

(max

M∈Mn

|tr{Γ⊤∗ (PMF

− PM )EΣ1/2∗ (αnS

−1 −Σ−1∗ )}|

Rn(M)> ε

)

≤ 2∑

M∈Mn

exp

{−(1−√

cp)2ε2Rn(M)

8

}+ 2 exp(−Cnγ).

Proof. Recall that for all M ∈ Mn \ {MF }, QM ∈ R(kn−kM )×n satisfying Q⊤MQM = PMF

− PM and

QMQ⊤M = Ikn−kM

. Then, given S, tr{Γ⊤∗ (PMF

− PM )EΣ1/2∗ (αnS

−1 − Σ−1∗ )} is normally distributed

24

Page 25: 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our high-dimensional framework, which is also considered in previous studies (Fujikoshi et al.

with mean 0 and variance

v2M = tr{QMΓ∗(αnS−1 −Σ−1

∗ )Σ∗(αnS−1 −Σ−1

∗ )Γ⊤∗ Q

⊤M}.

It follows from Lemma A.12 that with probability at least 1− 2 exp(−Cnγ), for all M ∈ Mn

v2M ≤ σ1((αnS−1 −Σ−1

∗ )Σ∗)2tr(QMΓ∗Σ

−1∗ Γ⊤

∗ Q⊤M )

≤(

2

1−√cp

)2

Rn(M),

where we used tr(QMΓ∗Σ−1∗ Γ⊤

∗ Q⊤M ) ≤ tr(∆M ) ≤ Rn(M) at the last inequality. Hence, from Lemma

A.2, we have

Pr

(max

M∈Mn

|tr{Γ⊤∗ (PMF

− PM )EΣ1/2∗ (αnS

−1 −Σ−1∗ )}|

Rn(M)> ε

)

≤ Pr

(max

M∈Mn

|tr{Γ⊤∗ (PMF

− PM )EΣ1/2∗ (αnS

−1 −Σ−1∗ )}|

|vM |>

(1−√cp)ε

√Rn(M)

2

)

+ Pr

(max

M∈Mn

v2MRn(M)

>4

(1−√cp)2

)≤ 2

∑M∈Mn

exp

{−(1−√

cp)2ε2Rn(M)

8

}+ 2 exp(−Cnγ).

Thus, by combining Lemmas A.5, A.6, A.10, A.11, and A.13, Lemma A.4 is obtained.

A.3 Proof of Theorem 3.2

To show AME, the following lemma plays an important role.

Lemma A.14. Let Xn and Yn be random variables such that Yn ≥ 1. If E[(Xn/Yn − 1)2] → 0 and

E[Y 2n ] → 1, then

limn→∞

E[Xn] = 1.

Proof. Because Xn − 1 can be decomposed as Xn − 1 = (Xn/Yn − 1)Yn + (Yn − 1), by applying the

25

Page 26: 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our high-dimensional framework, which is also considered in previous studies (Fujikoshi et al.

triangle inequality and Cauchy-Schwarz inequality, it follows that

|E[Xn − 1]| ≤∣∣∣∣E [(Xn

Yn− 1

)Yn

]∣∣∣∣+ |E[Yn − 1]|

√√√√E

[(Xn

Yn− 1

)2]√

E[Y 2n ] +

√E[(Yn − 1)2].

Because Yn ≥ 1 from the assumption, we can see that 1 ≤ E[Yn] ≤ E[Y 2n ] → 1. Hence, the right-hand

side of the above inequality goes to 0.

Showing the conditions of Lemma A.14 with

Xn =Ln(Mn)

Rn(M∗R)

, Yn =Rn(Mn)

Rn(M∗R)

,

i.e., E[Y 2n ] → 1 and E[(Xn/Yn − 1)2] → 0, then we can show that GCp has AME.

Lemma A.15. Under conditions (C1)–(C4), if αn → a = 1− cp, then

limn→∞

E[Rn(Mn)2]

Rn(M∗R)

2= 1.

Proof. For all η > 0, Mn is separated into M(1)n and M(2)

n such that

M(1)n =

{M ∈ Mn

∣∣∣∣ Rn(M)

Rn(M∗R)

≤ (1 + η)

},

M(2)n =

{M ∈ Mn

∣∣∣∣ Rn(M)

Rn(M∗R)

> (1 + η)

}.

Then, E[Rn(Mn)2]/Rn(M

∗R)

2 − 1 can be decomposed as follows:

E[Rn(Mn)2]

Rn(M∗R)

2− 1 =

∑M∈M(1)

n

Rn(M)2

Rn(M∗R)

2Pr(Mn = M)− 1 +

∑M∈M(2)

n

Rn(M)2

Rn(M∗R)

2Pr(Mn = M).

If the last term of the right-hand side of the above equation converges to zero, it holds that Pr(Mn ∈

M(2)n ) → 0 because Rn(M)2/Rn(M

∗R)

2 > (1 + η)2 for all M ∈ M(2)n . This then leads to Pr(Mn ∈

M(1)n ) → 1. On the other hand, 1 ≤ Rn(M)2/Rn(M

∗R)

2 ≤ (1 + η)2 for all M ∈ M(1)n . Hence, for

sufficiently large n,

∣∣∣∣∣∣∑

M∈M(1)n

Rn(M)2

Rn(M∗R)

2Pr(Mn = M)− 1

∣∣∣∣∣∣ < (1 + η)2Pr(Mn ∈ M(1)n )− 1

≤ (1 + η)2 − 1.

26

Page 27: 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our high-dimensional framework, which is also considered in previous studies (Fujikoshi et al.

Because η is an arbitrary positive constant, it follows that

limn→∞

E[Rn(Mn)2]

Rn(M∗R)

2= 1

if

limn→∞

∑M∈M(2)

n

Rn(M)2

Rn(M∗R)

2Pr(Mn = M) = 0.

Let En,γ = E(1)n,γ ∩ E

(2)n,γ ∩ E

(3)n,γ and γ = (1 + γ0)/2 ∈ (γ0, 1), where γ0 is defined in (C4). Then, we can

see that

∑M∈M(2)

n

Rn(M)2

Rn(M∗R)

2Pr(Mn = M)

≤∑

M∈M(2)n

Rn(M)2

Rn(M∗R)

2Pr({Mn = M} ∩ En,γ) + max

M∈Mn

Rn(M)2

Rn(M∗R)

2

∑M∈M(2)

n

Pr({Mn = M} ∩ Ecn,γ)

≤∑

M∈M(2)n

Rn(M)2

Rn(M∗R)

2Pr({Mn = M} ∩ En,γ) + max

M∈Mn

Rn(M)2

Rn(M∗R)

2Pr(Ec

n,γ).

Hence, it is shown from Lemma A.9 and (18) that there exists a positive constant C > 0 such that

maxM∈Mn

Rn(M)2

Rn(M∗R)

2Pr(Ec

n,γ) ≤ 6 maxM∈Mn

Rn(M)2

Rn(M∗R)

2exp(−Cnγ)

= O(exp(nγ0 − Cnγ)).

where the last equality follows from condition (C4). Recall that γ > γ0, which implies that

maxM∈Mn

Rn(M)2

Rn(M∗R)

2Pr(Ec

n,γ) = o(1).

Because Rn(M∗R) → ∞ due to condition (C3), it suffices to show that

limn→∞

∑M∈M(2)

n

Rn(M)2Pr({Mn = M} ∩ En,γ) = 0.

27

Page 28: 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our high-dimensional framework, which is also considered in previous studies (Fujikoshi et al.

For all M ∈ M(2)n ,

Pr({Mn = M} ∩ En,γ)

≤ Pr({GCp(M∗R;αn)−GCp(M ;αn) ≥ 0} ∩ En,γ)

= Pr

({GCp(M

∗R;αn)−GCp(M ;αn)

Rn(M)≥ 0

}∩ En,γ

)= Pr

({GCp(M

∗R;αn)−GCp(M ;αn)

Rn(M)− Rn(M

∗R)−Rn(M)

Rn(M)≥ 1− Rn(M

∗R)

Rn(M)

}∩ En,γ

)≤ Pr

({Rn(M)−GCp(M ;αn)

Rn(M)− Rn(M

∗R)−GCp(M

∗R;αn)

Rn(M)≥ η

1 + η

}∩ En,γ

)≤ Pr

({∣∣∣∣Ln(M)−Rn(M)

Rn(M)

∣∣∣∣+ ∣∣∣∣Ln(M∗R)−Rn(M

∗R)

Rn(M)

∣∣∣∣+ |Bn(M)|Rn(M)

+|Bn(M

∗R)|

Rn(M)≥ η

1 + η

}∩ En,γ

)≤ Pr

(∣∣∣∣Ln(M)−Rn(M)

Rn(M)

∣∣∣∣ ≥ η

4(1 + η)

)+ Pr

(∣∣∣∣Ln(M∗R)−Rn(M

∗R)

Rn(M)

∣∣∣∣ ≥ η

4(1 + η)

)+ Pr

({|Bn(M)|Rn(M)

≥ η

4(1 + η)

}∩ En,γ

)+ Pr

({|Bn(M

∗R)|

Rn(M)≥ η

4(1 + η)

}∩ En,γ

).

It follows from similar arguments with respect to Lemmas A.4 and A.5 that there exist positive constants

C1 and C2 such that

∑M∈M(2)

n

Rn(M)2Pr({Mn = M} ∩ En,γ) ≤ C1

∑M∈Mn

exp{−C2Rn(M)} → 0.

The last convergence follows from condition (C3).

Lemma A.16. Under condition (C3), it can be seen that

limn→∞

E

(Ln(Mn)

Rn(Mn)− 1

)2 = 0.

Proof. Like in the case of Lemma A.5, define

ξ(M) =Ln(M)−Rn(M)

Rn(M)

=tr(E⊤PME)− kMpn

Rn(M).

28

Page 29: 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our high-dimensional framework, which is also considered in previous studies (Fujikoshi et al.

For all ε > 0,

E[ξ(Mn)2] = E[ξ(Mn)

2I(|ξ(Mn)| > ε)] + E[ξ(Mn)2I(|ξ(Mn)| ≤ ε)]

≤ E[ξ(Mn)2I(|ξ(Mn)| > ε)] + ε2

≤∑

M∈Mn

E[ξ(M)2I(|ξ(M)| > ε)I(Mn = M)] + ε2

≤∑

M∈Mn

E[ξ(M)2I(|ξ(M)| > ε)] + ε2

≤∑

M∈Mn

√E[ξ(M)4]

√Pr(|ξ(M)| > ε) + ε2.

The last inequality follows from the Cauchy-Schwarz inequality. Because Rn(M) ≥ kMpn, it holds from

a moment formula of a chi-square distribution that there exists a positive constant Cξ such that

maxM∈Mn

E[ξ(M)4] ≤ Cξ.

Under condition (C3), from Lemma A.5, we can see that there exists a positive constant Cε such that

for all sufficiently large n,

∑M∈Mn

√E[ξ(M)4]

√Pr(|ξ(M)| > ε) ≤

√2Cξ

∑M∈Mn

exp(−CεRn(M))

≤ ε.

Hence, we have

E[ξ(Mn)2] ≤ ε+ ε2.

Because ε can be arbitrary small, the proof is completed.

By combining Lemmas A.15 and A.16 with A.14, Theorem A.3 is established.

29