George Bataille Concerning accounts given of Hiroshima Hiroshima
1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our...
Transcript of 1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima …The second part of (C1) constructs our...
Asymptotic optimality of Cp-type criteria in high-dimensional multivariate
linear regression models
(Last Modified: January 31, 2020)
Shinpei Imori
Department of Mathematics, Hiroshima University
1-3-1 Kagamiyama, Higashi-Hiroshima, Hiroshima 739-8526, Japan
Abstract
We study the asymptotic optimality of Cp-type criteria from the perspective of prediction in high-
dimensional multivariate linear regression models, where the dimension of a response matrix is large but
does not exceed the sample size. We derive conditions in order that the generalized Cp (GCp) exhibits
asymptotic loss efficiency (ALE) and asymptotic mean efficiency (AME) in such high-dimensional data.
Moreover, we clarify that one of the conditions is necessary for GCp to exhibit both ALE and AME. As a
result, it is shown that the modified Cp can claim both ALE and AME but the original Cp cannot in high-
dimensional data. The finite sample performance of GCp with several tuning parameters is compared
through a simulation study.
Key words: Asymptotic loss efficiency; Asymptotic mean efficiency; High-dimensional data analysis;
Multivariate linear regression model; Variable selection.
1 Introduction
Variable selection problems are crucial in statistical fields to improve prediction accuracy and/or in-
terpretability of a resultant model. There is a burgeoning literature which has attempted to solve the
variable selection problem, and many selection procedures and their theoretical properties have been
studied.
For example, Mallows’ Cp criterion (Mallows 1973) and Akaike information criterion (AIC) (Akaike
1974) are known as useful selection methods from a predictive point of view because these procedures
are optimal in some predictive sense (see Shibata 1981, 1983, Li 1987, Shao 1997). On the other hand,
Bayes information criterion (BIC) proposed by Schwarz (1978) is consistent (Nishii 1984); that is, the
probability that a model selected by BIC coincides with the true model converges to 1 as the sample
size n tends to infinity. In this sense, BIC is a feasible method from the perspective of interpretability.
1
However, Cp and AIC are inconsistent (Nishii 1984) under the same condition. Details of properties of
selection procedures are well studied in Shao (1997) in the context of univariate linear regression models.
However, here, our target is multivariate linear regression models.
Recently, high-dimensional data are often encountered where the dimension of a response matrix
in multivariate linear regression models pn is large, whereas pn does not exceed the sample size n.
Considering such high-dimensional multivariate linear regression models, one may presume that the
properties of selection methods such as optimality and consistency are inherited from univariate models.
However, interestingly, properties derived when pn is fixed can be altered in high-dimensional situations.
For example, Yanagihara et al. (2015) showed that AIC acquires the consistency property and that
BIC loses its consistency in high-dimensional data. Similar results for Cp-type criteria were reported by
Fujikoshi et al. (2014). The reason why this inversion arises may be that a difference in risks between two
over-specified models (i.e., models including the true model) diverges with n and pn tending to infinity,
and thus penalty terms of Cp and AIC are moderate but that of BIC is too strong. In addition to these
studies, model selection criteria in high-dimensional data contexts and their consistency properties have
been vigorously studied in various models and situations (e.g., Katayama and Imori 2014, Imori and von
Rosen 2015, Yanagihara 2015, Fujikoshi and Sakurai 2016, Bai et al. 2018).
Compared with the consistency property, asymptotic optimality for prediction in high-dimensional
data contexts is under-researched. Conventional results derived from univariate models are no longer
reliable in high-dimensional data contexts, and extension to such cases is not mathematically trivial. In
the present paper, we focus on asymptotic loss efficiency (ALE) (Li 1987, Shao 1997) and asymptotic
mean efficiency (AME) (Shibata 1983) as criteria for the asymptotic optimality of variable selection. We
derive sufficient conditions in order that a generalization of Cp (GCp) exhibits ALE and AME in high-
dimensional data. We also show that one of the sufficient conditions is necessary for GCp to exhibit both
of these efficiencies. As a result, we can observe that the modified Cp (MCp) introduced by Fujikoshi
and Satoh (1997) exhibits ALE and AME assuming moderate conditions although the original Cp does
not under the same conditions.
The remainder of this paper is composed as follows. In Section 2, we clarify the variable selection
framework used in this paper. In Section 3, the sufficient conditions for ALE and AME of GCp are
given. In Section 4, we study the asymptotic inefficiency of GCp. Section 5 illustrates the finite sample
performances of some Cp-type criteria. Finally, conclusions are offered in Section 6.
2
2 Model selection framework
2.1 True and candidate models
Let Y be an n × pn response variable matrix and X be an n × kn explanatory variable matrix, where
n is the sample size, pn is the dimension of response and kn is the number of explanatory variables. We
assume X to be of full rank and non-stochastic. We allow kn and pn to diverge to infinity with n tending
to infinity, although neither kn nor pn exceeds n. Specific conditions for n, kn, and pn are given later.
The true distribution of Y is given by
Y = Γ∗ + EΣ1/2∗ ,
where Γ∗ = E[Y ], E is an n × pn error matrix, of which all entries are independent and identically
distributed as the standard normal distribution N(0, 1) and Σ∗ is the true covariance matrix of each row
of Y . The relationship between Y and X is represented by a multivariate linear regression model as
follows:
Y = XB + EΣ1/2,
where B is a kn × pn matrix of unknown regression coefficients and Σ is a pn × pn unknown covariance
matrix. To consider a variable selection problem, let M ⊂ MF = {1, . . . , kn} be a candidate model,
where XM is an n× kM sub-matrix of X corresponding to M , and kM is the cardinality of M . Then, a
candidate model M implies a multivariate linear regression model defined as follows:
Y = XMBM + EΣ1/2,
where BM is a kM × pn matrix of unknown regression coefficients. A set of candidate models is denoted
by Mn that is a subset of a comprehensive set {M |M ⊂ MF }. Note that Mn does not have to include
the full model MF .
2.2 Loss and risk functions
Herein, the goodness of fit of a candidate model M is measured by a quadratic loss function Ln given by
Ln(M) = tr{(Γ∗ −XMBM )Σ−1∗ (Γ∗ −XMBM )⊤}, (1)
3
where BM is a least squares estimator of BM , i.e.,
BM = (X⊤MXM )−1XMY , (2)
By substituting (2) into (1), we have
Ln(M) = tr(∆M ) + tr(E⊤PME) (3)
where ∆M = Σ−1/2∗ Γ⊤
∗ P⊥MΓ∗Σ
−1/2∗ , P⊥
M = In − PM and PM = XM (X⊤MXM )−1X⊤
M . Then, a risk
function Rn is obtained as
Rn(M) = E[Ln(M)] = tr(∆M ) + kMpn. (4)
The best models with respect to the loss and risk functions are denoted by M∗L and M∗
R, which minimize
(1) and (4) among Mn, respectively, i.e.,
M∗L = arg min
M∈Mn
Ln(M), M∗R = arg min
M∈Mn
Rn(M).
Note that M∗L is a random variable, M∗
R is non-stochastic, and both of them depend on n although they
are suppressed for brevity.
2.3 Selection method and asymptotic efficiency
To select the best model among Mn, we use GCp defined by
GCp(M ;αn) = nαntr(ΣMS−1) + 2kMpn. (5)
where αn is a positive sequence, ΣM = Y ⊤P⊥MY /n and S = nΣMF
/(n− kn). For theoretical purposes,
we use αn satisfying
limn→∞
αn = a ∈ [0,∞).
When αn = 1, GCp indicates Cp proposed by Mallows (1973) and when αn = 1 − (pn + 1)/(n − kn),
selection results by GCp coincide with the modified Cp (called MCp) by Fujikoshi and Satoh (1997). If
the full model includes the true model, then MCp is an unbiased estimator for all M ∈ Mn (Fujikoshi
and Satoh 1997). Note that Atkinson (1980) introduced a criterion equivalent to GCp for univariate data
and Nagai et al. (2012) proposed GCp for multivariate generalized ridge regression models. The best
4
model selected by minimizing GCp among Mn is denoted by Mn, i.e.,
Mn = arg minM∈Mn
GCp(M ;αn).
Then, we state that GCp exhibits ALE (Li 1987, Shao 1997) if
Ln(Mn)
Ln(M∗L)
p→ 1, n → ∞, (6)
and exhibits AME (Shibata 1983) if
limn→∞
E[Ln(Mn)]
Rn(M∗R)
= 1. (7)
Note that Ln(Mn) and E[Ln(Mn)] are respectively referred to as loss and risk functions of the best model
selected by GCp.
3 Asymptotic efficiency of GCp
In this section, we present ALE and AME of GCp(M ;αn). Hereafter, we may omit symbol “n → ∞” for
simplifying expressions.
Firstly, we assume the following conditions for ALE:
(C1) kn = o(n), limn→∞ pn/n = cp ∈ [0, 1) and n− kn − pn > 0.
(C2) σ1(∆MF) = o(n), where for any matrix A, σℓ(A) denotes the ℓth largest singular value of A.
(C3) For all δ ∈ (0, 1),
limn→∞
∑M∈Mn
δRn(M) = 0.
The first part of condition (C1) is the same as a condition assumed in Shibata (1981, 1983) if the full
model MF is included in the set of candidate models Mn. The second part of (C1) constructs our high-
dimensional framework, which is also considered in previous studies (Fujikoshi et al. 2014, Yanagihara
et al. 2015, Yanagihara 2015, Fujikoshi and Sakurai 2016, Bai et al. 2018). The final part of (C1)
is required to guarantee regularity of S, which can be satisfied asymptotically from the previous two
conditions. Condition (C2) is used to ignore an effect of ∆MF, and condition (C3) controls the number
of candidate models. When pn = 1, conditions (C2) and (C3) correspond to assumptions in Shao (1997)
and Shibata (1981, 1983), respectively. Then, we can derive sufficient conditions for ALE of GCp as the
5
following theorem, of which a proof is given in Appendix A.2.
Theorem 3.1. Suppose that conditions (C1)–(C3) hold. If αn → a = 1−cp as n → ∞, then GCp(M ;αn)
exhibits ALE, i.e.,
Ln(Mn)
Ln(M∗L)
p→ 1, n → ∞.
Next, we show AME of GCp. Besides conditions (C1)–(C3), we assume the following condition:
(C4) There exists γ0 ∈ (0, 1) such that maxM∈Mn Rn(M)/Rn(M∗R) = O(exp(nγ0)).
Condition (C4) sets an upper bound of the risk ratio Rn(M)/Rn(M∗R), which prevents the maximum
risk from being too large. Because of conditions (C1) and (C3), condition (C4) is satisfied when
maxM∈Mn tr(∆M ) = O(exp(nγ0)), which is a weak condition. Then, sufficient conditions for AME
of GCp are obtained.
Theorem 3.2. Suppose that conditions (C1)–(C4) hold. If αn → a = 1−cp as n → ∞, then GCp(M ;αn)
exhibits AME, i.e.,
limn→∞
E[Ln(Mn)]
Rn(M∗R)
= 1.
A proof of this theorem is provided in Appendix A.3. For both ALE and AME of GCp, we assume
αn → a = 1 − cp. Unless cp = 0, this condition does not hold when αn = 1 (i.e., the original Cp).
On the other hand, this condition is satisfied for all cp ∈ [0, 1) when αn = 1 − (pn + 1)/(n − kn) (i.e.,
MCp). Hence, MCp is more reasonable for variable selection in high-dimensional data contexts from the
perspective of prediction.
4 Asymptotic inefficiency of GCp
As noted in the previous section, αn → a = 1− cp is a key condition for GCp to acquire ALE and AME.
In this section, we show that this is a necessary condition. Namely, when αn → a = 1 − cp, there is a
situation such that
limn→∞
Pr
(Ln(Mn)
Ln(M∗L)
> 1
)= 1,
limn→∞
E[Ln(Mn)]
Rn(M∗R)
> 1
even under conditions (C1)–(C4).
6
For expository purposes, let X = (x1,x2), i.e., kn = 2 such that X⊤X = I2, Γ∗ =√nx2β
⊤, where
β ∈ Rpn , Σ∗ = Ipn, and Mn = {{1}, {1, 2}}. Suppose that cp ∈ (0, 1) and β satisfies ∥β∥2 → b ∈ (0,∞),
where ∥·∥ is the Euclidean norm. Then, because σ1(∆MF) = 0, Rn({1}) = n∥β∥2+pn, and Rn({1, 2}) =
2pn, conditions (C1)–(C4) are satisfied for sufficiently large n.
From the definition of GCp,
GCp({1, 2};αn)−GCp({1};αn)
= nαntr{(Σ{1,2} − Σ{1})S−1}+ 2pn
= −(n− 2)αnx⊤2 Y Y ⊤x2
x⊤2 Y {Y ⊤(In − x1x
⊤1 − x2x
⊤2 )Y }−1Y ⊤x2
x⊤2 Y Y ⊤x2
+ 2pn.
It follows from Theorem 3.2.11 in Muirhead (1982) that
(x⊤2 Y {Y ⊤(In − x1x
⊤1 − x2x
⊤2 )Y }−1Y ⊤x2
x⊤2 Y Y ⊤x2
)−1
∼ χ2n−pn−1.
On the other hand, because Y ⊤x2 =√nβ + E⊤x2 ∼ Npn
(√nβ, Ipn
), x⊤2 Y Y ⊤x2 ∼ χ2
pn(n∥β∥2),
which denotes a non-central chi-square distribution with non-centrality parameter n∥β∥2. Note that
χ2n−pn−1/n = 1− cp + op(1) and χ2
pn(n∥β∥2)/n = cp + b+ op(1). Hence, it holds that
GCp({1, 2};αn)−GCp({1};αn)
n= −a(cp + b)
1− cp+ 2cp + op(1). (8)
Meanwhile, loss functions of models {1} and {1, 2} are given as
Ln({1}) = n∥β∥2 + x⊤1 EE
⊤x1,
Ln({1, 2}) = x⊤1 EE
⊤x1 + x⊤2 EE
⊤x2.
Because x⊤i EE
⊤xi ∼ χ2pn
(i = 1, 2), it follows that
Ln({1})Ln({1, 2})
p→ cp + b
2cp∈ (0,∞), (9)
limn→∞
Rn({1})Rn({1, 2})
=cp + b
2cp∈ (0,∞). (10)
First, we consider a situation where a > 0. Let b = cp(1− cp)/a. It follows from (8) and (9) that
GCp({1, 2};αn)−GCp({1};αn)
n
p→ cp(1− cp − a)
1− cp,
Ln({1})Ln({1, 2})
p→ a+ 1− cp2a
= 1 +1− cp − a
2a.
7
Hence, we have
Ln(Mn)
Ln(M∗L)
p→
(a+ 1− cp)/(2a) > 1, a < 1− cp,
(2a)/(a+ 1− cp) > 1, a > 1− cp.
This implies that GCp does not exhibit ALE when 0 < a < 1− cp or a > 1− cp.
On the other hand, (10) yields M∗R = {1, 2} (resp. {1}) for sufficiently large n when a < 1− cp (resp.
a > 1− cp). Thus, by using M∗∗R = Mn \M∗
R, we can see that
E[Ln(Mn)]
Rn(M∗R)
=E[Ln(M
∗R)I(Mn = M∗
R)]
Rn(M∗R)
+E[Ln(M
∗∗R )I(Mn = M∗∗
R )]
Rn(M∗R)
=Rn(M
∗∗R )
Rn(M∗R)
− E[{Ln(M∗∗R )− Ln(M
∗R)}I(Mn = M∗
R)]
Rn(M∗R)
≥ Rn(M∗∗R )
Rn(M∗R)
−√E[{Ln({1})− Ln({1, 2})}2]
Rn(M∗R)
√Pr(Mn = M∗
R),
where I(·) is an indicator function and the last inequality follows from the Cauchy-Schwarz inequality.
Note that
√E[{Ln({1, 2})− Ln({1})}2]
Rn(M∗R)
=√E[(χ2
pn− n∥β∥2)2]max
{1
2pn,
1
pn + n∥β∥2
}=√2pn + (pn − n∥β∥2)2 max
{1
2pn,
1
pn + n∥β∥2
}→ |a− (1− cp)|max
{1
2a,
1
a+ 1− cp
}< ∞.
Because limn→∞ Pr(Mn = M∗R) = 0 and Rn(M
∗∗R )/Rn(M
∗R) > 1, GCp does not exhibit AME when
0 < a < 1− cp or 1− cp < a.
Next, we consider a situation where a = 0. Then, (8) implies that Pr(Mn = {1}) → 1. However, when
b > cp, (9) and (10) yield Pr(M∗L = {1, 2}) → 1 and M∗
R = {1, 2} for sufficiently large n, respectively.
Hence, in the same manner as the argument when a > 0, we can appreciate that GCp does not exhibit
ALE or AME when a = 0.
Therefore, αn → a = 1 − cp is a necessary and sufficient condition for ALE and AME of GCp under
conditions (C1)–(C4).
5 Simulation study
This section provides details of a simulation study to compare GCp among several αn, where the goodness
of criteria is measured by the loss function of the best model selected by each criterion. We prepare three
parameters for αn, that is, αn = 1 (i.e., Cp), αn = 1 − (pn + 1)/(n − kn) (i.e., MCp) and αn = 2/ log n
8
(i.e., BIC-type Cp, say BCp). Because 2/ log n ≤ 1 − (pn + 1)/(n − kn) ≤ 1 in our settings described
below, the number of dimensions of the model selected by Cp (resp. BCp) is larger (resp. smaller) than
that by MCp. Generally speaking, this inequality always holds for sufficiently large n.
Hereafter, we explain the simulation settings. Let the first column of X be a vector of ones in Rn
and the other entries be independently generated from a uniform distribution U(0, 1). For all 1 ≤ i ≤ kn
and 1 ≤ j ≤ pn, let (B∗)ij = uijdi, where uij are independently generated from U(0, 1/2) and di =
5√kn − i+ 1/kn. For comparative purposes, we examine a situation where Γ∗ = XB∗, which implies
that the full model is the true model. Suppose that Σ∗ = Ipn. To reduce computational burden, we
adopt a nested model set, i.e., Mn = {{1}, . . . , {1, . . . , kn}}. It should be noted that the true (full) model
is not the best model from the perspective of prediction in our simulation study, because some coefficients
are very small, so variable selection makes sense in this situation. This supposition is confirmed below.
We prepared two cases for pn as high- and fixed-dimensional cases, where pn = ⌊n/3⌋ for the high-
dimensional case, and ⌊α⌋ denotes the largest integer that does not exceed α, whereas pn = 10 for the
fixed case. The sample size n varies from 100 to 800, and we set kn = ⌊5 log n⌋. Then, we generate Y
and select the best subset of explanatory variables by each Cp-type criterion. After variable selection, we
calculate the loss functions for each best model.
Table 1: Average values of Ln(Mn)/Ln(M∗L) and Ln(Mn)/Rn(M
∗R) of Cp, MCp and BCp among
1,000 repetitions for each (n, pn, kn). Standard deviations are shown in parentheses. Best values forLn(Mn)/Ln(M
∗L) and Ln(Mn)/Rn(M
∗R) are emboldened for each (n, pn, kn). All values are rounded to
3 decimal places.
Ln(Mn)/Ln(M∗L) Ln(Mn)/Rn(M
∗R)
n pn kn Cp MCp BCp Cp MCp BCp
100 33 23 1.805 1.057 1.012 1.798 1.053 1.008(0.311) (0.088) (0.030) (0.312) (0.093) (0.037)
200 66 26 1.291 1.033 1.114 1.277 1.022 1.101(0.111) (0.034) (0.030) (0.113) (0.039) (0.011)
400 133 29 1.151 1.013 1.599 1.150 1.012 1.597(0.048) (0.014) (0.104) (0.052) (0.024) (0.097)
800 266 33 1.081 1.003 1.813 1.079 1.001 1.809(0.024) (0.005) (0.139) (0.027) (0.015) (0.134)
100 10 23 1.214 1.117 1.031 1.183 1.087 1.003(0.216) (0.139) (0.040) (0.222) (0.150) (0.045)
200 10 26 1.110 1.101 1.158 1.075 1.065 1.117(0.100) (0.087) (0.073) (0.117) (0.101) (0.027)
400 10 29 1.068 1.067 1.783 1.043 1.042 1.732(0.052) (0.051) (0.135) (0.089) (0.085) (0.023)
800 10 33 1.043 1.044 2.222 1.030 1.032 2.183(0.042) (0.044) (0.295) (0.083) (0.082) (0.228)
Table 1 provides average values of Ln(Mn)/Ln(M∗L) and Ln(Mn)/Rn(M
∗R) of Cp, MCp and BCp
9
based on 1,000 repetitions for each (n, pn, kn). Note that Ln(Mn)/Ln(M∗L) and Ln(Mn)/Rn(M
∗R) are
criteria for ALE and AME, respectively, and smaller is better. From this table, we can confirm that MCp
exhibits good performance regardless of pn, and Cp works well when pn = 10 but it does not work well
when pn is large. On the other hand, BCp has higher values of Ln(Mn)/Ln(M∗L) and Ln(Mn)/Rn(M
∗R)
except when the sample size is small. These results concur with our theoretical exposition regarding
efficiency and inefficiency.
Table 2: Average dimensions of selected models by Cp, MCp, and BCp and loss minimizing modelsamong 1,000 repetitions for each (n, pn, kn). Standard deviations are shown in parentheses. All valuesare rounded to 3 decimal places.
n pn kn Cp MCp BCp Loss
100 33 23 17.870 2.953 1.234 1.574(3.691) (2.808) (0.826) (1.327)
200 66 26 21.582 9.064 1.023 9.119(2.337) (3.616) (0.234) (2.918)
400 133 29 25.911 15.723 1.766 15.296(1.427) (2.492) (1.528) (1.573)
800 266 33 31.013 24.783 7.324 24.676(0.917) (0.857) (1.830) (0.616)
100 10 23 6.271 3.474 1.026 2.656(4.240) (3.137) (0.165) (1.884)
200 10 26 9.388 7.785 1.008 8.069(4.121) (3.954) (0.089) (3.076)
400 10 29 17.677 17.001 1.057 17.441(3.740) (3.709) (0.249) (3.378)
800 10 33 25.241 24.986 2.459 25.284(2.472) (2.552) (1.657) (1.509)
Table 2 shows the average dimensions of models selected by each GCp and loss minimizing models.
This indicates that the number of dimensions of loss minimizing models varies depending on the sample
size, and the full model is not (always) the best model in spite of the fact that the full model is true. Based
on our simulation settings, BCp tends to select much smaller models in comparison with models that
have the smallest loss function while Cp often selects larger models when pn is large. The average number
of dimensions of models selected by MCp is close to that of the loss minimizing models in both high- and
fixed-dimensional situations. This implies that αn substantially affects the dimensions of selected models
as well as efficiency.
Hence, these results indicate that MCp is a useful variable selection method regardless of pn, and
thus we recommend its use from the perspective of robust prediction.
10
6 Conclusions
We have derived sufficient conditions for ALE and AME of GCp in high-dimensional multivariate linear
regression models. It is shown that MCp exhibits ALE and AME in high-dimensional data, while the
original Cp, known as an asymptotically efficient criterion in univariate cases, does not exhibit ALE or
AME under the same conditions. This is because a non-trivial bias term is omitted in the original Cp as
an estimator of the risk function; this term plays an important role for adaptation to high-dimensional
frameworks. Indeed, if the tuning parameter of GCp, αn, converges to a = 1−cp like in the case of Cp and
BCp, we showed that GCp is asymptotically inefficient. Through a simulation study, the finite sample
performances of Cp-type criteria are compared, and MCp is better than Cp and BCp in high-dimensional
data.
As shown in Fujikoshi et al. (2014), MCp is consistent for variable selection in high-dimensional
multivariate linear regression models under moderate conditions when the true model is included in
the set of candidate models. Hence, MCp can be regarded as a feasible method for variable selection
from the perspective of both prediction and interpretability. This attractive property is only seen in
high-dimensional situations.
When pn is greater than n, we cannot directly calculate S−1 and thus GCp. Therefore, we need
different approaches to estimate a covariance matrix Σ such as sparse or ridge estimation (e.g., Yamamura
et al. 2010, Katayama and Imori 2014, Fujikoshi and Sakurai 2016). If we can estimate Σ accurately via
these procedures, ALE and AME can be established by using it in place of S. It should also be noted that
our proof depends on the assumption that the response matrix follows a Gaussian distribution. Because
we use some properties of the Gaussian distribution, this is not a trivial limitation from the perspective
of generalizing the results. How best to navigate this issue represents fruitful terrain for future research.
Acknowledgements
This study is supported in part by JSPS KAKENHI Grant Number JP17K12650 and “Funds for the
Development of Human Resources in Science and Technology” under MEXT, through the “Home for
Innovative Researchers and Academic Knowledge Users (HIRAKU)” consortium.
References
H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control,
19:716–723, 1974.
11
A. C. Atkinson. A note on the generalized information criterion for choice of a model. Biometrika, 67
(2):413–418, 1980.
Z. Bai, K. P. Choi, and Y. Fujikoshi. Consistency of AIC and BIC in estimating the number of significant
components in high-dimensional principal component analysis. The Annals of Statistics, 46(3):1050–
1076, 2018.
Y. Fujikoshi and T. Sakurai. High-dimensional consistency of rank estimation criteria in multivariate
linear model. Journal of Multivariate Analysis, 149:199–212, 2016.
Y. Fujikoshi and K. Satoh. Modified AIC and Cp in multivariate linear regression. Biometrika, 84(3):
707–716, 1997.
Y. Fujikoshi, T. Sakurai, and H. Yanagihara. Consistency of high-dimensional AIC-type and Cp-type
criteria in multivariate linear regression. Journal of Multivariate Analysis, 123:184–200, 2014.
S. Imori and D. von Rosen. Covariance components selection in high-dimensional growth curve model
with random coefficients. Journal of Multivariate Analysis, 136:86–94, 2015.
S. Katayama and S. Imori. Lasso penalized model selection criteria for high-dimensional multivariate
linear regression analysis. Journal of Multivariate Analysis, 132:138–150, 2014.
K.-C. Li. Asymptotic optimality for Cp, CL, cross-validation and generalized cross-validation: Discrete
index set. The Annals of Statistics, 15(3):958–975, 1987.
C. L. Mallows. Some comments on Cp. Technometrics, 15(4):661–675, 1973.
R. J. Muirhead. Aspects of Multivariate Statistical Theory. Wiley series in probability and statistics.
John Wiley & Sons, Inc., 1982.
I. Nagai, H. Yanagihara, and K. Satoh. Optimization of ridge parameters in multivariate generalized
ridge regression by plug-in methods. Hiroshima Mathematical Journal, 42(3):301–324, 2012.
R. Nishii. Asymptotic properties of criteria for selection of variables in multiple regression. The Annals
of Statistics, 12(2):758–765, 1984.
G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464, 1978.
J. Shao. An asymptotic theory for linear model selection. Statistica Sinica, pages 221–242, 1997.
R. Shibata. An optimal selection of regression variables. Biometrika, 68(1):45–54, 1981.
R. Shibata. Amendments and corrections: An optimal selection of regression variables. Biometrika, 69
(2):492–492, 1982.
12
R. Shibata. Asymptotic mean efficiency of a selection of regression variables. Annals of the Institute of
Statistical Mathematics, 35(3):415–423, 1983.
M. J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in
Statistical and Probabilistic Mathematics. Cambridge University Press, 2019.
M. Yamamura, H. Yanagihara, and M. S. Srivastava. Variable selection in multivariate linear regression
models with fewer observations than the dimension. Japanese Journal of Applied Statistics, 39(1):1–19,
2010.
H. Yanagihara. Conditions for consistency of a log-likelihood-based information criterion in normal
multivariate linear regression models under the violation of the normality assumption. Journal of the
Japan Statistical Society, 45(1):21–56, 2015.
H. Yanagihara, H. Wakaki, and Y. Fujikoshi. A consistency property of the AIC for multivariate linear
models when the dimension and the sample size are large. Electronic Journal of Statistics, 9(1):869–897,
2015.
13
A Appendix
In this section, we show ALE and AME of GCp. A brief outline of the proof is based on Li (1987) and
Shibata (1983) although techniques of random matrix theory are required to overcome the difficulties
imposed by high-dimensionality.
A.1 Concentration inequalities
We introduce three concentration inequalities that are used for showing ALE and AME.
Lemma A.1. Let χ2m be a random variable distributed as a chi-square distribution with m degrees of
freedom. Then, for all t > 0,
Pr(χ2m −m ≥ mt) ≤ exp
{−m{t− log(1 + t)}
2
},
P r(χ2m −m ≤ −mt) ≤ exp
(−mt2
4
).
The former is obtained from Shibata (1982), while the latter is given from Lemma 2.1 in Shibata
(1981).
Lemma A.2. Let X be a random variable distributed as N(0, 1). Then, for all t > 0,
Pr(X ≥ t) ≤ exp
(− t2
2
).
Lemma A.3. Let X ∼ Nn,p(O, In, Ip) and n > p. It holds that for all t > 0,
√n−√
p− t ≤ σp(X) ≤ σ1(X) ≤√n+
√p+ t
with probability at least 1− 2 exp(−Ct2), where C is a positive constant that does not depend on n and
p.
Because Lemmas A.2 and A.3 can be found elsewhere (see e.g., Wainwright 2019: Example 2.1,
Example 6.2), we omit their proofs for brevity.
14
A.2 Proof of Theorem 3.1
From the definition of Ln(M) and GCp(M ;αn), these difference can be separated as
GCp(M ;αn)− Ln(M)
= 2{kMpn − tr(E⊤PME)}+ 2tr(Σ−1/2∗ Γ⊤
∗ P⊥ME) + tr{Γ⊤
∗ (PMF− PM )Γ∗(αnS
−1 −Σ−1∗ )}
− tr{Σ1/2∗ E⊤PMEΣ1/2
∗ (αnS−1 −Σ−1
∗ )}+ 2tr{Γ⊤∗ (PMF
− PM )EΣ1/2∗ (αnS
−1 −Σ−1∗ )}
+ tr{Σ1/2∗ E⊤PMF
EΣ1/2∗ (αnS
−1 −Σ−1∗ )}+ tr{Y ⊤P⊥
MFY (αnS
−1 −Σ−1∗ )}.
Thus, we only consider the terms that depend on M defined as follows:
Bn(M) = 2{kMpn − tr(E⊤PME)}+ 2tr(Σ−1/2∗ Γ⊤
∗ P⊥ME) + tr{Γ⊤
∗ (PMF− PM )Γ∗(αnS
−1 −Σ−1∗ )}
− tr{Σ1/2∗ E⊤PMEΣ1/2
∗ (αnS−1 −Σ−1
∗ )}+ 2tr{Γ⊤∗ (PMF
− PM )EΣ1/2∗ (αnS
−1 −Σ−1∗ )}.
We can discern from the definition of Mn that
Ln(Mn) +GCp(Mn;αn)− Ln(Mn) ≤ Ln(M∗L) +GCp(M
∗L;αn)− Ln(M
∗L),
and thus
Ln(Mn)
{1 +
Bn(Mn)
Ln(Mn)
}≤ Ln(M
∗L)
{1 +
Bn(M∗L)
Ln(M∗L)
}.
This implies that GCp exhibits ALE if we can show that maxM∈Mn|Bn(M)|/Ln(M) converges to zero
in probability. The following lemma clarifies a concentration inequality of maxM∈Mn |Bn(M)|/Ln(M).
Lemma A.4. Let γ ∈ (0, 1). Suppose that conditions (C1) and (C2) hold. If αn → a = 1−cp as n → ∞,
then for all ε > 0, there exist positive constants Cj > 0 (j = 1, . . . , 4) such that for sufficiently large n,
Pr
(max
M∈Mn
|Bn(M)|Ln(M)
> ε
)≤ C1 exp(−C2n
γ) + C3
∑M∈Mn
exp{−C4Rn(M)}.
Because Ln(M∗L) ≤ Ln(M) for all M ∈ Mn, Lemma A.4 immediately yields Theorem 3.1. Hence,
hereafter, we attempt to show Lemma A.4.
The following lemma enables us to consider maxM∈Mn|Bn(M)|/Rn(M) instead of maxM∈Mn
|Bn(M)|/Ln(M).
Lemma A.5. For all ε > 0, there exists a positive Cε > 0 such that for all M ∈ Mn,
Pr
(∣∣∣∣Ln(M)
Rn(M)− 1
∣∣∣∣ > ε
)≤ 2 exp{−CεRn(M)}.
15
Proof. Let
ξ(M) =Ln(M)−Rn(M)
Rn(M)
=tr(E⊤PME)− kMpn
Rn(M).
Note that tr(E⊤PME) follows a chi-square distribution with kMpn degrees of freedom. For all ε > 0, it
holds from Lemma A.1 and the fact that Rn(M) ≥ kMpn that
Pr(|ξ(M)| > ε)
= Pr
(tr(E⊤PME)− kMpn
kMpn>
εRn(M)
kMpn
)+ Pr
(tr(E⊤PME)− kMpn
kMpn< −εRn(M)
kMpn
)
≤ exp
{−kMpn
2
{εRn(M)
kMpn− log
(1 +
εRn(M)
kMpn
)}}+ exp
{−ε2Rn(M)2
4kMpn
}≤ exp
{−εRn(M)
2
{1− kMpn
εRn(M)log
(1 +
εRn(M)
kMpn
)}}+ exp
{−ε2Rn(M)
4
}.
Here, a function f(x) = 1 − x−1 log (1 + x) is monotonically increasing for all x ≥ 1, and thus f(x) ≥
f(1) = 1− log 2 > 0.3 for all x ≥ 1. Therefore, if εRn(M) ≥ kMpn, then
Pr(|ξ(M)| > ε) ≤ exp
{−3εRn(M)
20
}+ exp
{−ε2Rn(M)
4
}.
If not, Rn(M) ≥ kMpn > εRn(M) implies that
Pr(|ξ(M)| > ε) ≤ Pr
(tr(E⊤PME)− kMpn
kMpn> ε
)+ Pr
(tr(E⊤PME)− kMpn
kMpn< −εRn(M)
kMpn
)
≤ exp
{−{ε− log(1 + ε)}kMpn
2
}+ exp
{−ε2Rn(M)
4
}≤ exp
{−ε{ε− log(1 + ε)}Rn(M)
2
}+ exp
{−ε2Rn(M)
4
}.
By letting Cε = min{3ε/20, ε2/4, ε{ε− log(1 + ε)}/2} > 0, we can show that for all M ∈ Mn,
Pr(|ξ(M)| > ε) ≤ 2 exp{−CεRn(M)}.
Note that from the proof of Lemma A.5, the first term of Bn(M) can be evaluated. The next lemma
provides an evaluation of the second term of Bn(M).
16
Lemma A.6. For all ε > 0 and for all M ∈ Mn,
Pr
(|tr(Σ−1/2
∗ Γ⊤∗ P
⊥ME)|
Rn(M)> ε
)≤ 2 exp
{−ε2Rn(M)
2
}.
Proof. Because tr(Σ−1/2∗ Γ⊤
∗ P⊥ME)/
√tr(∆M ) follows N(0, 1), Lemma A.2 implies that for all ε > 0,
Pr
(|tr(Σ−1/2
∗ Γ⊤∗ P
⊥ME)|
Rn(M)> ε
)= Pr
(|tr(Σ−1/2
∗ Γ⊤∗ P
⊥ME)|√
tr(∆M )
√tr(∆M )
Rn(M)> ε
)
≤ 2 exp
{−ε2Rn(M)2
2tr(∆M )
}≤ 2 exp
{−ε2Rn(M)
2
}.
The last inequality follows from tr(∆M ) ≤ Rn(M).
Next, we evaluate the third term of Bn(M). For all M ∈ Mn \ {MF }, let QM ∈ R(kn−kM )×n satisfy
Q⊤MQM = PMF
− PM and QMQ⊤M = Ikn−kM
. Then, we consider a matrix
WM = αn(A⊤MAM )−1/2A⊤
MΣ1/2∗ S−1Σ
1/2∗ AM (A⊤
MAM )−1/2,
where AM = Σ−1/2∗ Γ⊤
∗ Q⊤M . This yields
|tr{Γ⊤∗ (PMF
− PM )Γ∗(αnS−1 −Σ−1
∗ )}| = |tr{AMA⊤M (αnΣ
1/2∗ S−1Σ
1/2∗ − Ipn
)}|
= |tr{A⊤MAM (WM − Ikn−kM
)}|
≤ σ1(WM − Ikn−kM)tr(A⊤
MAM ).
Therefore, for evaluation of the third term of Bn(M), we calculate σ1(WM − Ikn−kM) and tr(A⊤
MAM ).
The latter can be evaluated as tr(A⊤MAM ) ≤ tr(∆M ) ≤ Rn(M). Hence, we attempt to show that
maxM∈Mn
σ1(WM − Ikn−kM) = o(1)
with high probability.
Let HF be an n× (n− kn) matrix such that P⊥MF
= HFH⊤F and H⊤
F HF = In−kn. Then, define an
event
E(1)n,γ :
√n− kn −√
pn − nγ/2 ≤ σpn(H⊤
F E) ≤ σ1(H⊤F E) ≤
√n− kn +
√pn + nγ/2,
17
where γ ∈ (0, 1) is constant. For event E(1)n,γ , we evaluate the difference between WM and
VM = (n− kn)αn(A⊤MAM )−1/2A⊤
M (E⊤P⊥MF
E)−1AM (A⊤MAM )−1/2.
Lemma A.7. Suppose that conditions (C1) and (C2) hold. Then, it follows that on E(1)n,γ with γ ∈ (0, 1),
maxM∈Mn\{MF }
σ1(WM − VM ) = o(1).
Proof. Because the columns of AM (A⊤MAM )−1/2 are orthogonal to each other,
maxM∈Mn\{MF }
σ1(WM − VM ) ≤ (n− kn)αnσ1(Σ1/2∗ (Y ⊤P⊥
MFY )−1Σ
1/2∗ − (E⊤P⊥
MFE)−1).
Without loss of generality, we assume Σ∗ = Ipn . From basic linear algebra, it follows that
(Y ⊤P⊥MF
Y )−1 − (E⊤P⊥MF
E)−1
= −(Y ⊤P⊥MF
Y )−1(Y ⊤P⊥MF
Y − E⊤P⊥MF
E)(E⊤P⊥MF
E)−1
= −(Y ⊤P⊥MF
Y )−1(∆MF+ Γ⊤
∗ P⊥MF
E + E⊤P⊥MF
Γ∗)(E⊤P⊥MF
E)−1.
This equation implies that
σ1((Y⊤P⊥
MFY )−1 − (E⊤P⊥
MFE)−1) ≤
σ1(∆MF+ Γ⊤
∗ P⊥MF
E + E⊤P⊥MF
Γ∗)
σpn(Y ⊤P⊥
MFY )σpn
(E⊤P⊥MF
E)
≤σ1(∆MF
) + 2σ1(Γ⊤∗ P
⊥MF
E)σpn(Y
⊤P⊥MF
Y )σpn(E⊤P⊥
MFE)
. (11)
Thus, it holds from condition (C2) that
0 ≤ σpn(∆MF
) ≤ σ1(∆MF) = o(n). (12)
Hence, on E(1)n,γ , under conditions (C1) and (C2), the following evaluation of the singular values is obtained:
σ1(Γ⊤∗ P
⊥MF
E) ≤√σ1(∆MF
){√n− kn +
√pn + nγ/2} = o(n),
σpn(E⊤P⊥
MFE) ≥ {
√n− kn −√
pn − nγ/2}2 ≍ n,
(13)
where for a sequence an > 0, an ≍ n means that limn→∞ an/n < ∞ and limn→∞ n/an < ∞. From (12)
18
and (13), we have
σ1(∆MF) + 2σ1(Γ
⊤∗ P
⊥MF
E)σpn(E
⊤P⊥MF
E)= o(1).
Note that αn = O(1). Hence, (11) implies that it suffices for the proof to show that σpn(Y⊤P⊥
MFY )/n
does not converge to zero. It follows from (12) and (13) that
σpn(Y ⊤P⊥
MFY ) ≥ σpn
(∆MF) + σpn
(E⊤P⊥MF
E)− 2σ1(Γ⊤∗ P
⊥MF
E) ≍ n.
Hence, the proof is completed.
This lemma implies that WM can be replaced by VM on E(1)n,γ with a small bias. To evaluate singular
values of VM instead, we consider the following event:
E(2)n,γ :
{√n− pn −
√kn − nγ/2}2
(n− kn)αn≤ σkn
(V −1∅ ) ≤ σ1(V
−1∅ ) ≤ {
√n− pn +
√kn + nγ/2}2
(n− kn)αn,
where γ ∈ (0, 1),
V∅ = (n− kn)αnK−1/2∅ Q∅Γ∗Σ
−1/2∗ (E⊤P⊥
MFE)−1Σ
−1/2∗ Γ⊤
∗ Q⊤∅ K
−1/2∅ ,
K∅ = Q∅Γ∗Σ−1∗ Γ⊤
∗ Q⊤∅ and Q∅ = (X⊤
MFXMF
)−1/2X⊤MF
. Note that if αn → a = 1 − cp, then V∅
approaches Iknon E
(2)n,γ .
Lemma A.8. Suppose that conditions (C1) and (C2) hold. If αn → a = 1 − cp, then it holds that on
E(1)n,γ ∩ E
(2)n,γ with γ ∈ (0, 1),
maxM∈Mn\{MF }
σ1(WM − Ikn−kM) = o(1).
Proof. First, we consider σ1(VM − Ikn−kM) instead of σ1(WM − Ikn−kM
). For all M ∈ Mn \ {MF }
and x ∈ Rkn−kM , let y = K1/2∅ Q∅Q
⊤MK
−1/2M x ∈ Rkn , where KM = QMΓ∗Σ
−1∗ Γ⊤
∗ Q⊤M . Note that
PMFQ⊤
MQM = PMF(PMF
− PM ) = PMF− PM = Q⊤
MQM , and thus PMFQ⊤
M = Q⊤M . Hence, we have
y⊤y = x⊤K−1/2M QMQ⊤
∅ K∅Q∅Q⊤MK
−1/2M x
= x⊤K−1/2M QMPMF
Γ∗Σ−1∗ Γ⊤
∗ PMFQ⊤
MK−1/2M x
= x⊤K−1/2M QMΓ∗Σ
−1∗ Γ⊤
∗ Q⊤MK
−1/2M x
= x⊤x.
19
This indicates that
1
(n− kn)αn
y⊤V∅y
y⊤y=
y⊤K−1/2∅ Q∅Γ∗Σ
−1/2∗ (E⊤P⊥
MFE)−1Σ
−1/2∗ Γ⊤
∗ Q⊤∅ K
−1/2∅ y
y⊤y
=x⊤K
−1/2M QMΓ∗Σ
−1/2∗ (E⊤P⊥
MFE)−1Σ
−1/2∗ Γ⊤
∗ Q⊤MK
−1/2M x
x⊤x
=1
(n− kn)αn
x⊤VMx
x⊤x.
Hence, for all M ∈ Mn,
infy∈Rkn
y⊤V∅y
y⊤y≤ inf
x∈Rkn−kM
x⊤VMx
x⊤x≤ sup
x∈Rkn−kM
x⊤VMx
x⊤x≤ sup
y∈Rkn
y⊤V∅y
y⊤y.
Under condition (C1) and E(2)n,γ , the assumption αn → a = 1− cp yields
maxM∈Mn\{MF }
σ1(VM − Ikn−kM) = o(1).
Hence, it follows from Lemma A.7 that on E(1)n,γ ∩ E
(2)n,γ ,
maxM∈Mn\{MF }
σ1(WM − Ikn−kM)
≤ maxM∈Mn\{MF }
σ1(WM − VM ) + maxM∈Mn\{MF }
σ1(VM − Ikn−kM)
= o(1).
It still remains to be seen E(1)n,γ and E
(2)n,γ arise with high probability. This is confirmed as follows.
Lemma A.9. Suppose that condition (C1) holds. There exists C > 0 such that For all γ ∈ (0, 1),
Pr(E(1)n,γ) ≥ 1− 2 exp (−Cnγ) ,
P r(E(2)n,γ) ≥ 1− 2 exp (−Cnγ) .
Proof. Because H⊤F E ∼ N(n−kn),pn
(O, In−kn, Ipn
), by applying Lemma A.3, we immediately obtain
the former assertion. On the other hand, (n − kn)αnV−1∅ ∼ Wkn(n − pn, Ikn), which is shown by
Theorem 3.2.11 in Muirhead (1982). By using Z ∼ N(n−pn),kn(O, In−pn
, Ikn), we can observe that
Z⊤Z ∼ Wkn(n− pn, Ikn
). Hence, Pr(E(2)n,γ) is equal to
Pr(E(2)n,γ) = Pr(
√n− pn −
√kn − nγ/2 ≤ σkn(Z) ≤ σ1(Z) ≤
√n− pn +
√kn + nγ/2),
20
which is upper bounded by 1− 2 exp (−Cnγ) with some constant C > 0 due to Lemma A.3.
Hence, from Lemmas A.8 and A.9, the following lemma holds.
Lemma A.10. Let γ ∈ (0, 1). Suppose that conditions (C1) and (C2) hold. If αn → a = 1 − cp, then
there exists C > 0 such that for all ε > 0 and for sufficiently large n,
Pr
(max
M∈Mn
|tr{Γ⊤∗ (PMF
− PM )Γ∗(αnS−1 −Σ−1
∗ )}|Rn(M)
> ε
)≤ 4 exp(−Cnγ).
We can evaluate the fourth term of Bn(M) similar to Lemma A.10 by replacing QM with a matrix
QM ∈ RkM×n satisfying Q⊤MQM = PM and QMQ⊤
M = IkMand AM with AM = E⊤Q⊤
M . Then, we
obtain the following lemma.
Lemma A.11. Let γ ∈ (0, 1). Suppose that conditions (C1) and (C2) hold. If αn → a = 1 − cp, then
there exist positive constants C1, C2 > 0 such that for all ε > 0 and for sufficiently large n,
Pr
(max
M∈Mn
|tr{Σ1/2∗ E⊤PMEΣ1/2
∗ (αnS−1 −Σ−1
∗ )}|Rn(M)
> ε
)≤ 2
∑M∈Mn
exp{−C1Rn(M)}+ 4 exp(−C2nγ).
Proof. A proof for the fourth term is slightly different from Lemma A.10 because tr(A⊤MAM ) = tr(E⊤PME)
is a random variable, although it has already been evaluated in Lemma A.5. By using AM , we have the
following evaluation:
|tr{Σ1/2∗ E⊤PMEΣ1/2
∗ (αnS−1 −Σ−1
∗ )}| = |tr{A⊤MAM (WM − IkM
)}|
≤ σ1(WM − IkM)tr(A⊤
MAM ),
where WM is a counterpart of WM defined by
WM = αn(A⊤MAM )−1/2A⊤
MΣ1/2∗ S−1Σ
1/2∗ AM (A⊤
MAM )−1/2.
Here, let us divide the target probability into two parts as follows:
Pr
(max
M∈Mn
|tr{Σ1/2∗ E⊤PMEΣ1/2
∗ (αnS−1 −Σ−1
∗ )}|Rn(M)
> ε
)
≤ Pr
(max
M∈Mn
σ1(WM − IkM)tr(A⊤
MAM )
Rn(M)> ε
)
≤ Pr
(max
M∈Mn
tr(A⊤MAM )
Rn(M)> 2
)+ Pr
(2 maxM∈Mn
σ1(WM − IkM) > ε
). (14)
First, we evaluate the former probability of (14). Because tr(A⊤MAM ) = tr(E⊤PME) and kMpn ≤
21
Rn(M),
maxM∈Mn
tr(A⊤MAM )
Rn(M)≤ max
M∈Mn
kMpn + |tr(E⊤PME)− kMpn|Rn(M)
≤ 1 + maxM∈Mn
|tr(E⊤PME)− kMpn|Rn(M)
.
Lemma A.5 indicates that there exists a positive constant C1 > 0 such that
Pr
(max
M∈Mn
tr(A⊤MAM )
Rn(M)> 2
)≤ Pr
(max
M∈Mn
|tr(E⊤PME)− kMpn|Rn(M)
> 1
)
≤ 2∑
M∈Mn
exp{−C1Rn(M)}.
Next, let us calculate the latter probability of (14). We define a counterpart of VM as follows:
VM = (n− kn)αn(A⊤MAM )−1/2A⊤
M (E⊤P⊥MF
E)−1AM (A⊤MAM )−1/2.
Then, from the proof of Lemma A.7, when conditions (C1) and (C2) hold, it follows that on E(1)n,γ ,
maxM∈Mn
σ1(WM − VM ) = o(1). (15)
Thus, we can consider σ1(VM −IkM) instead of σ1(WM −IkM
). Here, let us consider an event E(3)n,γ given
as
E(3)n,γ :
{√n− pn −
√kn − nγ/2}2
(n− kn)αn≤ σkn
(V −1MF
) ≤ σ1(V−1MF
) ≤ {√n− pn +
√kn + nγ/2}2
(n− kn)αn.
Similar to V∅, if αn → a = 1− cp, then VMFapproaches Ikn on E
(3)n,γ . From similar arguments to those
of Lemma A.8, it follows that for all M ∈ Mn,
infy∈Rkn
y⊤VMFy
y⊤y≤ inf
x∈RkM
x⊤VMx
x⊤x≤ sup
x∈RkM
x⊤VMx
x⊤x≤ sup
y∈Rkn
y⊤VMFy
y⊤y.
This implies that when αn → a = 1− cp, on E(3)n,γ ,
maxM∈Mn
σ1(VM − IkM) = o(1). (16)
Hence, by combining (15) and (16), it is established that
maxM∈Mn
σ1(WM − IkM) = o(1) (17)
22
on E(1)n,γ ∩ E
(3)n,γ . Thus, for sufficiently large n,
Pr
(2 maxM∈Mn
σ1(WM − IkM) > ε
)≤ Pr((E(1)
n,γ)c) + Pr((E(3)
n,γ)c).
Note that (n − kn)αnV−1MF
follows Wkn(n − pn, Ikn
) from Theorem 3.2.11 in Muirhead (1982). Hence,
the same arguments of Lemma A.9 show that there exists a positive constant C2 > 0 such that for all
γ ∈ (0, 1),
Pr(E(3)n,γ) ≥ 1− 2 exp(−C2n
γ). (18)
Combining this with Lemma A.9, there exists a positive constant C3 > 0 such that for sufficiently large
n, we have
Pr
(2 maxM∈Mn
σ1(WM − IkM) > ε
)≤ 4 exp(−C3n
γ).
Hence, the proof is completed.
Finally, we evaluate the fifth term of Bn(M). First, we derive an upper bound of σ1((αnS−1 −
Σ−1∗ )Σ∗).
Lemma A.12. Let γ ∈ (0, 1). Suppose that conditions (C1) and (C2) hold. If αn → a = 1 − cp, then
there exists a positive constant C > 0 such that for sufficiently large n,
Pr
(σ1((αnS
−1 −Σ−1∗ )Σ∗) >
2
1−√cp
)≤ 2 exp(−Cnγ).
Proof. Without loss of generality, we assume Σ∗ = Ipn.
σ1(αnS−1 − Ipn
) ≤ (n− kn)αnσ1((Y⊤P⊥
MFY )−1 − (E⊤P⊥
MFE)−1)
+ σ1((n− kn)αn(E⊤P⊥MF
E)−1 − Ipn).
From the proof of Lemma A.7, it follows that on E(1)n,γ ,
(n− kn)αnσ1((Y⊤P⊥
MFY )−1 − (E⊤P⊥
MFE)−1) = o(1).
23
On the other hand, event E(1)n,γ implies that
σ1((n− kn)αn(E⊤P⊥MF
E)−1 − Ipn)
= max{(n− kn)αnσpn
(H⊤F E)−2 − 1, 1− (n− kn)αnσ1(H
⊤F E)−2
}≤ max
{(n− kn)αn
(√n− kn −√
pn − nγ/2)2− 1, 1− (n− kn)αn
(√n− kn +
√pn + nγ/2)2
}
= max
{1− cp
(1−√cp)2
− 1 + o(1), 1− 1− cp(1 +
√cp)2
+ o(1)
}= max
{1 +
√cp
1−√cp
− 1 + o(1), 1−1−√
cp
1 +√cp
+ o(1)
}=
2√cp
1−√cp
+ o(1).
Because cp < 1, for sufficiently large n, on E(1)n,γ ,
σ1(αnS−1 − Ipn
) ≤ 2
1−√cp
.
Hence, it can be discerned that for sufficiently large n,
Pr
(σ1(αnS
−1 −Σ−1∗ ) >
2
1−√cp
)≤ Pr
({σ1(αnS
−1 −Σ−1∗ ) >
2
1−√cp
}∩ E(1)
n,γ
)+ Pr((E(1)
n,γ)c)
≤ 2 exp(−Cnγ),
where the last inequality follows from A.9.
Noting that Γ⊤∗ (PMF
− PM )EΣ1/2∗ is a function of QME, which is independent to S, we can obtain
the following lemma.
Lemma A.13. Let γ ∈ (0, 1). Suppose that conditions (C1) and (C2) hold. If αn → a = 1 − cp, then
there exists a positive constant C > 0 such that for all ε > 0 and for sufficiently large n,
Pr
(max
M∈Mn
|tr{Γ⊤∗ (PMF
− PM )EΣ1/2∗ (αnS
−1 −Σ−1∗ )}|
Rn(M)> ε
)
≤ 2∑
M∈Mn
exp
{−(1−√
cp)2ε2Rn(M)
8
}+ 2 exp(−Cnγ).
Proof. Recall that for all M ∈ Mn \ {MF }, QM ∈ R(kn−kM )×n satisfying Q⊤MQM = PMF
− PM and
QMQ⊤M = Ikn−kM
. Then, given S, tr{Γ⊤∗ (PMF
− PM )EΣ1/2∗ (αnS
−1 − Σ−1∗ )} is normally distributed
24
with mean 0 and variance
v2M = tr{QMΓ∗(αnS−1 −Σ−1
∗ )Σ∗(αnS−1 −Σ−1
∗ )Γ⊤∗ Q
⊤M}.
It follows from Lemma A.12 that with probability at least 1− 2 exp(−Cnγ), for all M ∈ Mn
v2M ≤ σ1((αnS−1 −Σ−1
∗ )Σ∗)2tr(QMΓ∗Σ
−1∗ Γ⊤
∗ Q⊤M )
≤(
2
1−√cp
)2
Rn(M),
where we used tr(QMΓ∗Σ−1∗ Γ⊤
∗ Q⊤M ) ≤ tr(∆M ) ≤ Rn(M) at the last inequality. Hence, from Lemma
A.2, we have
Pr
(max
M∈Mn
|tr{Γ⊤∗ (PMF
− PM )EΣ1/2∗ (αnS
−1 −Σ−1∗ )}|
Rn(M)> ε
)
≤ Pr
(max
M∈Mn
|tr{Γ⊤∗ (PMF
− PM )EΣ1/2∗ (αnS
−1 −Σ−1∗ )}|
|vM |>
(1−√cp)ε
√Rn(M)
2
)
+ Pr
(max
M∈Mn
v2MRn(M)
>4
(1−√cp)2
)≤ 2
∑M∈Mn
exp
{−(1−√
cp)2ε2Rn(M)
8
}+ 2 exp(−Cnγ).
Thus, by combining Lemmas A.5, A.6, A.10, A.11, and A.13, Lemma A.4 is obtained.
A.3 Proof of Theorem 3.2
To show AME, the following lemma plays an important role.
Lemma A.14. Let Xn and Yn be random variables such that Yn ≥ 1. If E[(Xn/Yn − 1)2] → 0 and
E[Y 2n ] → 1, then
limn→∞
E[Xn] = 1.
Proof. Because Xn − 1 can be decomposed as Xn − 1 = (Xn/Yn − 1)Yn + (Yn − 1), by applying the
25
triangle inequality and Cauchy-Schwarz inequality, it follows that
|E[Xn − 1]| ≤∣∣∣∣E [(Xn
Yn− 1
)Yn
]∣∣∣∣+ |E[Yn − 1]|
≤
√√√√E
[(Xn
Yn− 1
)2]√
E[Y 2n ] +
√E[(Yn − 1)2].
Because Yn ≥ 1 from the assumption, we can see that 1 ≤ E[Yn] ≤ E[Y 2n ] → 1. Hence, the right-hand
side of the above inequality goes to 0.
Showing the conditions of Lemma A.14 with
Xn =Ln(Mn)
Rn(M∗R)
, Yn =Rn(Mn)
Rn(M∗R)
,
i.e., E[Y 2n ] → 1 and E[(Xn/Yn − 1)2] → 0, then we can show that GCp has AME.
Lemma A.15. Under conditions (C1)–(C4), if αn → a = 1− cp, then
limn→∞
E[Rn(Mn)2]
Rn(M∗R)
2= 1.
Proof. For all η > 0, Mn is separated into M(1)n and M(2)
n such that
M(1)n =
{M ∈ Mn
∣∣∣∣ Rn(M)
Rn(M∗R)
≤ (1 + η)
},
M(2)n =
{M ∈ Mn
∣∣∣∣ Rn(M)
Rn(M∗R)
> (1 + η)
}.
Then, E[Rn(Mn)2]/Rn(M
∗R)
2 − 1 can be decomposed as follows:
E[Rn(Mn)2]
Rn(M∗R)
2− 1 =
∑M∈M(1)
n
Rn(M)2
Rn(M∗R)
2Pr(Mn = M)− 1 +
∑M∈M(2)
n
Rn(M)2
Rn(M∗R)
2Pr(Mn = M).
If the last term of the right-hand side of the above equation converges to zero, it holds that Pr(Mn ∈
M(2)n ) → 0 because Rn(M)2/Rn(M
∗R)
2 > (1 + η)2 for all M ∈ M(2)n . This then leads to Pr(Mn ∈
M(1)n ) → 1. On the other hand, 1 ≤ Rn(M)2/Rn(M
∗R)
2 ≤ (1 + η)2 for all M ∈ M(1)n . Hence, for
sufficiently large n,
∣∣∣∣∣∣∑
M∈M(1)n
Rn(M)2
Rn(M∗R)
2Pr(Mn = M)− 1
∣∣∣∣∣∣ < (1 + η)2Pr(Mn ∈ M(1)n )− 1
≤ (1 + η)2 − 1.
26
Because η is an arbitrary positive constant, it follows that
limn→∞
E[Rn(Mn)2]
Rn(M∗R)
2= 1
if
limn→∞
∑M∈M(2)
n
Rn(M)2
Rn(M∗R)
2Pr(Mn = M) = 0.
Let En,γ = E(1)n,γ ∩ E
(2)n,γ ∩ E
(3)n,γ and γ = (1 + γ0)/2 ∈ (γ0, 1), where γ0 is defined in (C4). Then, we can
see that
∑M∈M(2)
n
Rn(M)2
Rn(M∗R)
2Pr(Mn = M)
≤∑
M∈M(2)n
Rn(M)2
Rn(M∗R)
2Pr({Mn = M} ∩ En,γ) + max
M∈Mn
Rn(M)2
Rn(M∗R)
2
∑M∈M(2)
n
Pr({Mn = M} ∩ Ecn,γ)
≤∑
M∈M(2)n
Rn(M)2
Rn(M∗R)
2Pr({Mn = M} ∩ En,γ) + max
M∈Mn
Rn(M)2
Rn(M∗R)
2Pr(Ec
n,γ).
Hence, it is shown from Lemma A.9 and (18) that there exists a positive constant C > 0 such that
maxM∈Mn
Rn(M)2
Rn(M∗R)
2Pr(Ec
n,γ) ≤ 6 maxM∈Mn
Rn(M)2
Rn(M∗R)
2exp(−Cnγ)
= O(exp(nγ0 − Cnγ)).
where the last equality follows from condition (C4). Recall that γ > γ0, which implies that
maxM∈Mn
Rn(M)2
Rn(M∗R)
2Pr(Ec
n,γ) = o(1).
Because Rn(M∗R) → ∞ due to condition (C3), it suffices to show that
limn→∞
∑M∈M(2)
n
Rn(M)2Pr({Mn = M} ∩ En,γ) = 0.
27
For all M ∈ M(2)n ,
Pr({Mn = M} ∩ En,γ)
≤ Pr({GCp(M∗R;αn)−GCp(M ;αn) ≥ 0} ∩ En,γ)
= Pr
({GCp(M
∗R;αn)−GCp(M ;αn)
Rn(M)≥ 0
}∩ En,γ
)= Pr
({GCp(M
∗R;αn)−GCp(M ;αn)
Rn(M)− Rn(M
∗R)−Rn(M)
Rn(M)≥ 1− Rn(M
∗R)
Rn(M)
}∩ En,γ
)≤ Pr
({Rn(M)−GCp(M ;αn)
Rn(M)− Rn(M
∗R)−GCp(M
∗R;αn)
Rn(M)≥ η
1 + η
}∩ En,γ
)≤ Pr
({∣∣∣∣Ln(M)−Rn(M)
Rn(M)
∣∣∣∣+ ∣∣∣∣Ln(M∗R)−Rn(M
∗R)
Rn(M)
∣∣∣∣+ |Bn(M)|Rn(M)
+|Bn(M
∗R)|
Rn(M)≥ η
1 + η
}∩ En,γ
)≤ Pr
(∣∣∣∣Ln(M)−Rn(M)
Rn(M)
∣∣∣∣ ≥ η
4(1 + η)
)+ Pr
(∣∣∣∣Ln(M∗R)−Rn(M
∗R)
Rn(M)
∣∣∣∣ ≥ η
4(1 + η)
)+ Pr
({|Bn(M)|Rn(M)
≥ η
4(1 + η)
}∩ En,γ
)+ Pr
({|Bn(M
∗R)|
Rn(M)≥ η
4(1 + η)
}∩ En,γ
).
It follows from similar arguments with respect to Lemmas A.4 and A.5 that there exist positive constants
C1 and C2 such that
∑M∈M(2)
n
Rn(M)2Pr({Mn = M} ∩ En,γ) ≤ C1
∑M∈Mn
exp{−C2Rn(M)} → 0.
The last convergence follows from condition (C3).
Lemma A.16. Under condition (C3), it can be seen that
limn→∞
E
(Ln(Mn)
Rn(Mn)− 1
)2 = 0.
Proof. Like in the case of Lemma A.5, define
ξ(M) =Ln(M)−Rn(M)
Rn(M)
=tr(E⊤PME)− kMpn
Rn(M).
28
For all ε > 0,
E[ξ(Mn)2] = E[ξ(Mn)
2I(|ξ(Mn)| > ε)] + E[ξ(Mn)2I(|ξ(Mn)| ≤ ε)]
≤ E[ξ(Mn)2I(|ξ(Mn)| > ε)] + ε2
≤∑
M∈Mn
E[ξ(M)2I(|ξ(M)| > ε)I(Mn = M)] + ε2
≤∑
M∈Mn
E[ξ(M)2I(|ξ(M)| > ε)] + ε2
≤∑
M∈Mn
√E[ξ(M)4]
√Pr(|ξ(M)| > ε) + ε2.
The last inequality follows from the Cauchy-Schwarz inequality. Because Rn(M) ≥ kMpn, it holds from
a moment formula of a chi-square distribution that there exists a positive constant Cξ such that
maxM∈Mn
E[ξ(M)4] ≤ Cξ.
Under condition (C3), from Lemma A.5, we can see that there exists a positive constant Cε such that
for all sufficiently large n,
∑M∈Mn
√E[ξ(M)4]
√Pr(|ξ(M)| > ε) ≤
√2Cξ
∑M∈Mn
exp(−CεRn(M))
≤ ε.
Hence, we have
E[ξ(Mn)2] ≤ ε+ ε2.
Because ε can be arbitrary small, the proof is completed.
By combining Lemmas A.15 and A.16 with A.14, Theorem A.3 is established.
29