[Academic Press Library in Signal Processing] Academic Press Library in Signal Processing: Volume 3...

2CHAPTER

Model Order Selection

Visa Koivunen and Esa OllilaDepartment of Signal Processing and Acoustics, School of Electrical Engineering, Aalto University, Finland

3.02.1 IntroductionIn model selection the main task is to choose a mathematical model for a phenomenon from a collectionof models given some evidence such as the observed data at hand. Informally, one aims at identifyinga proper complexity for the model. Typical application examples include finding the number of radiofrequency signals impinging an antenna array, choosing the order of an ARMA model used for analyzingtime series data, variable selection in the regression model in statistics, estimating the channel orderin wireless communication receiver and selecting the order of a state-space model used in tracking amaneuvering target in radar.

Model order selection uses a set of observed data and seeks the best dimension for the parametricmodel used. This is related to the idea of the inverse experiment and the maximum likelihood estimationwhere one looks for the parameters that most likely have produced the observed data. Various methodsfor model selection or model order estimation stem from the detection and estimation theory, statisticalinference, information and coding theory and machine learning. The outcome of a model order selectionis an integer-value describing the dimension of the model.

One of the well known heuristic approach used in model selection is Occam’s razor (e.g., [1]). Itusually favors parsimony, i.e., simpler explanations instead of more complex ones while retaining theexplanatory power of the model. Hence, Occam’s razor chooses the simplest model among the candidatemodels. Simpler models tend to capture the underlying structure of the phenomenon (such as signalof interest) better, and consequently provide better performance in predicting the output when newobservations are processed. More complex models tend to overfit the observed data and to be moresensitive to noise. Moreover, the model may not generalize well to new data. The same problem isaddressed in statistical inference where there is a trade-off between bias and variance when explainingthe observed data. Too simple models lead to a bigger systematic error (bias), while too complex modelsmay lead to an increased variance of estimated parameters and higher Cramer-Rao Bound as well asto finding structures in the data that do not actually exist. Parsimony is also favored in the probabilitytheory, since the probability of making an error will increase by introducing additional assumptions thatmay be invalid or unnecessary.

Traditionally it is assumed that there is a single correct hypothesis within the set of hypothesesconsidered that describe the (true) model generating the given data. In practice, any model that weassume will be only an approximation to the truth: for example, a data does not necessarily follow

Academic Press Library in Signal Processing. http://dx.doi.org/10.1016/B978-0-12-411597-2.00002-3© 2014 Elsevier Ltd. All rights reserved.

9

http://dx.doi.org/10.1016/B978-0-12-411597-2.00002-3

10 CHAPTER 2 Model Order Selection

an exactly normal distribution. Yet the described criteria are still useful in selecting the “best model”from the set which is most supported by the data within the mathematical paradigm of the selectedinformation criteria. There are many practical consequences for a misspecified model order. In state-space models and Kalman filtering, models with too low order make the innovation sequence correlatedand the optimality of the filter is lost. On the other hand, if the order is too high, one may end up withtracking the noise that can lead to divergence of the Kalman filter. In regression, models with too loworder may lead to biased estimates since they do not have enough flexibility to fit the data with highfidelity. On the other hand, too many parameters increase the variance of the estimates unnecessarilyand cause the loss of optimality.

Model selection is of fundamental interest to the investigator seeking a better understanding of thephenomenon under study and the literature on this subject is considerable. Several review articles [2–4]are noticeable among others as well as the full length text-books [1,5,6]. In this overview, we first reviewthe classical model selection paradigms: the Bayesian Information Criterion (BIC) with an emphasis onits asymptotic approximation due to Schwarz [7], Akaike’s Information Criterion (AIC) [8,9], stochasticcomplexity [10] and its early development, the Minimum Description Length (MDL) criterion due toRissanen [11] and the generalized likelihood ratio test (GLRT-) based sequential hypothesis testing (see,e.g., [12]). Among the methods considered, the MDL and AIC have their foundations in the informationtheory and coding whereas BIC and GLRT methods have their roots in major statistical inferenceparadigms (Bayesian and likelihood principle). The AIC is based on the asymptotic approximation ofthe Kullback-Leibler (K-L) divergence, a fundamental notion in information theory, whereas Rissanen’sMDL is based on the coding theory. However, both of these methods can also be linked with the likelihoodprinciple and the classification of the statistical inference and the information theory based methodsis provided mainly for pedagogic and informative purpose. There are some more recent contributionssuch as the exponentially embedded family (EEF) method by Kay [13] and Bozdogan’s extensions ofAIC [14] among others which are not reviewed herein.

This chapter is organized as follows. First, Section 3.02.2 describes the basic principles, challengesand the complexity of the model selection problems with a simple example: variable selection in themultiple linear regression model. Especially, the need for compromise with the goodness of fit andconcision (“principle of parsimony”), which is the key idea of model selection, is illustrated and theforward stepwise regression variable selection principle using the AIC is explained along with the cross-validation principle. Section 3.02.3 addresses statistical inference based methods (BIC, GLRT) in detail.Section 3.02.4 introduces the AIC, the MDL and its variants and enhancements. In Section 3.02.5, themodel selection methods are derived and illustrated using a practical engineering example in determiningthe dimension of the signal subspace; such problems arise commonly in sensor array processing, andin harmonic retrieval; see, e.g., [15–18].

Notations. More specifically, model selection refers to the multihypothesis testing problem of deter-mining which model best describes a given data (measurement) y of size N. Let Hm,m = 1, . . . ,M ,denote the set of hypotheses which may not be nested (i.e., H1 �⊂ H2 ⊂ · · · �⊂ HM ), and that eachhypothesis specifies a probability density function (p.d.f.) fm with a parameter vector θ of dimensionK = dim(θ |Hm) in the parameter space �. For the notational convenience we often suppress the sub-script m (the model index) from θ, K , and � but the reader should keep in mind that these differ foreach model. Model order selection refers to finding the true dimension K of the model commonly in

3.02.2 Example: Variable Selection in Regression 11

the nested set of hypotheses. Let

�m(θ |y) � ln fm(y|θ) and pm � P(Hm) ≡ P(y ∼ fm)

denote the log-likelihood function of the data under the mth model and the a priori probability of themth hypothesis (

∑Mm=1 pm = 1), respectively. Usually, all models are a priori equally likely, in which

case pm = 1/M for m = 1, . . . ,M . In the likelihood theory, the natural point estimate of θ is themaximum likelihood estimate (MLE) θ � arg maxθ∈� �m(θ |y) which is assumed to be unique. Let usdenote by

Jm ≡ Jm(θ) � −E

[∂2�m(θ |y)∂θ∂θT

]and Jm � − ∂2�m

∂θ∂θT(θ |y)

∣∣∣∣θ=θ

(2.1)

the (expected) Fisher information matrix (FIM) and the observed FIM, respectively.

3.02.2 Example: variable selection in regressionIn the multiple linear regression model the data y is modeled as y = Xβ + ε, where β ∈ R

k denotesthe unknown vector of regression coefficients, X is the known N × k matrix of explanatory variables(k < N ) and ε is the N-variate noise vector with i.i.d. components from normal distribution N (0, σ 2)

with an unknown variance σ 2 > 0. Let us call the columns of X (explanatory) variables following theconvention in statistics, i.e., we use Xi to denote both the column of X and the ith variable of the model.Thus, y ∼ NN (Xβ, σ 2 I ) and the p.d.f. and the log-likelihood are

f (y|θ) = (2π)−N/2(σ 2)−N/2 exp

{− 1

2σ 2 ‖y − Xβ‖2},

�(θ |y) = − N

2ln σ 2 − 1

2σ 2 ‖y − Xβ‖2,

where θ = (βT , σ 2)T denotes the parameter vector. Commonly there is a large set of explanatoryvariables and the main interest is choosing the subset of explanatory variables that significantly contributeto the data (response variable) y; see [19,20]. Thus, we consider the hypotheses

HI : Only the subset of variables, say Xi1 , . . . , Xim , contribute to y,

where I = (i1, . . . , im), 1 ≤ i1 < i2 < · · · < im ≤ k, denotes the index set of the contributingvariables. Then under HI , the MLEs of the regression coefficient and the noise variance correspondto the classic least squares (LS-)estimates β = (X T

I X I )−1 X T

I y and the residual sample variance,σ 2 = (1/N )RSS(β), respectively, where RSS(β) = ‖y − X I β‖2 is the residual sum of squares. Hencethe −2 × (maximum log-likelihood) under hypothesis HI becomes

− 2 × �(θ |y) = N ln σ 2 + 1

σ 2 ‖y − X I β‖2 = N ln σ 2 + N , (2.2)

which always decrease when we include additional variables in the model. This is because σ 2+1 ≤ σ 2,where σ 2+1 denotes the residual variance with one additional variable in the model. Then, σ 2 = 0 if weintroduce as many explanatory variables as responses. More generally, if two models that are comparedare nested, then the one with more parameters will always yield a larger maximum log-likelihood, eventhough it is not necessarily a better model.


3.02.2.1 AIC and the stepwise regressionPreviously, we noticed that minimizing −2×�(θ |y) cannot be used as the criterion. An intuitive approachis to introduce an additive term that penalizes overfitting. The penalty term needs to be an increasingfunction of K = dim(θ |HI ) = m + 1, so that the criterion −2 × (maximum log-likelihood) decreasesas the penalty term increases yielding a trade-off between goodness of fit and simplicity. This principleof parsimony is the fundamental idea of model selection. As we shall see, the classic approximationsof information criteria reduce to such forms. For example, the AIC can be stated as

AIC = −2 × (maximum log-likelihood)+ 2 × (number of estimable parameters),

that is, AIC = −2�(θ |y)+2K , and the regression model (hypothesis HI ) that minimize the criterion ischosen. Here, the penalty term is 2K which increases with K and hence penalizes selecting too manyvariables. In the regression setting,

AIC(m) = N ln σ 2 + 2m,

where we have suppressed the irrelevant additive constant N from (2.2) and the “plus one” term fromK = m + 1 which does not depend on the number of explanatory variables m = |I | in the model. Notethat the robust versions of AIC have been proposed, e.g., [21] which utilizes robust M-estimation.

The AIC is not computationally feasible for variable selection when the number of explanatoryvariable candidates k is large. We need to consider all index sets I of size |I | = m for each m = 1, . . . , kand for each m, there are

( km

)possible index sets. More practical and commonly used approach is to

use stepwise regression variable selection method. In the forward stepwise search strategy one startswith a model containing only an intercept term (or a subset of explanatory variables), and then includesthe variable that yields the largest reduction in the AIC, i.e., the largest reduction in the RSS (andthus improving the fit the most). One stops when no variable produces a normalized reduction greaterthan a threshold. There are several variants of stepwise regression which differ based on the criterionused for inclusion/exclusion of variables. For example, instead of AIC, Mallows Cp, BIC, etc. can beused. One particularly simple stepwise regression is to include in each step the variable that is mostcorrelated with the current residual r = y − y, where y is the LS fit based on the current selectionof variables. This strategy is called the orthogonal matching pursuit [22] in the engineering literature.Some modern approaches for variable selection in regression that can be related to the forward stepwiseregression are the LASSO (least absolute shrinkage and selection operator) [23] and the LARS (leastangle regression) [24]. LASSO minimizes the usual LS criterion (the sum of squared residuals) with anadditional L1-norm constraint on the regression coefficient. As a result, some regression coefficients areshrunk to(wards) zero. LARS is a forward stepwise regression method where the selection of variablesis done differently: namely a regression coefficient of the variable is increased until that variable is nolonger the one most correlated with the residual r. A simple modification of the LARS algorithm canbe used to compute the LASSO solution [24].

3.02.2.2 Cross-validation and bootstrap methodsCross-validation (CV) is a widely used technique for model order selection in machine learning, regres-sion analysis, pattern recognition, and density estimation; see [25]. In CV one is interested in not only

3.02.2 Example: Variable Selection in Regression 13

finding a statistical model for the observed data but also in prediction performance for other independentdata sets. The goal is to ensure that the model generalizes well, not adapting too much to the particulardata set at hand. Unlike many other model selection techniques, CV can be considered a computer-based heuristic method that replaces some statistical analyses by repeated computations over subsetsof data.

In CV, one splits the data set into subsets, one training set for the initial statistical analysis and theother (potentially multiple) subsets for testing or validating the analysis. If the test data are independentand identically distributed (i.i.d.), they can be viewed as new data in prediction. If the sample size forthe training set is small or the dimension of the initial model is high, it is likely that the obtained modelwill fit the set of observations with too high fidelity. Obviously, this specific model will not fit the othertest data sets equally well and the prediction estimates produced by the initial model using the trainingset may contain significantly off-values than expected ones.

The prediction results from the testing (validation) sets are used to evaluate the goodness of themodel optimized for the training set. The original model may then be adjusted by combining with themodels estimated from the test sets to produce a single model, for example by averaging. CV may alsobe used in cases where the response is indicator of a class membership instead of being continuousvalued. This is the case in M-ary detection problems and pattern recognition, for example.

There are many ways to subdivide the data into training and test sets. The assignment may be randomsuch that the training set and the test set have equal number of observations. These two data sets maybe used for training and testing in an alternating manner. A very commonly used approach is so-calledleave-one-out (LOO) cross-validation. Each individual observation in the data set is serving a test set atits turn while the remaining data form the training set. This is done exhaustively until all the observationshave been used as a test set. It is obvious that LOO has a high computational complexity since the trainingset is different and training process is repeated every time.

Cross-validation can also be used for model order selection in regression. Let us assume that wehave a “supermodel” of large number of explaining variables and we want to select which subset ofexplaining or predictor variables gives the best model in terms of its prediction performance. By usingthe cross-validation approach with the test sets of observations one may find a subset of explainingvariables that gives the best prediction performance measured, for example, in terms of mean squareerror. If cross-validation is not used, adding more predictor variables in the model decreases the riskor error measure, such as mean square error, sum of squared residuals or classification error for thetraining set but the predictive performance may be very different.

A few remarks on the statistical performance of the CV method are in place. CV gives a too positiveview on the quality of the model optimized for the training set. The risk or error measure such as MSEwill be smaller for the training set than the one for the test set. The variance of the cross-validationestimates depends on the data splitting scheme and it is difficult to analyze the performance of themethod in a general case. However, it has been established in some special cases, see [25]. If the modelis estimated for each subset of test data, the values of estimates will vary (estimator is a random variable).

The CV estimators often have some bias that depends on the size of the training sample relativeto the whole sample size. It tends to decrease as the training sample size increases. For the same sizeof training set the bias remains the same but differences may be attested in variance (accuracy). Theoverall performance of the CV estimators depends on many factors including the size of training dataset, validation data set, and the number of data subsets.


Bootstrapping [26] is a resampling method that is typically used for characterizing confidence inter-vals. It has found applications in hypothesis testing and model order estimation as well, see, e.g., [27].Assuming independent and identically distributed (i.i.d.) data samples, it resamples the observed dataset with replacement to simulate new data called bootstrap data. Then the parameter of interest suchas the variance or the mean is computed for the bootstrap sample. It is a computer-based method sinceit requires that a very large number of resamples are drawn, and bootstrap estimates for each boot-strap sample are computed. Availability of highly powerful computing machinery has made the use ofbootstrap more feasible in different applications as well as allowed to increase the sample sizes. Thebootstrap method does not make any assumptions on the underlying distribution of the observed data.It is a nonparametric method and allows for characterizing the distribution of the observed data. This isparticularly useful for non-Gaussian data where the distribution may not be symmetric or unimodal. Anexample of model order estimation in a sensor array signal processing application is provided in [27].

3.02.3 Methods based on statistical inference paradigmsIn this section, we review the model selection methods based on two main stream statistical inferenceparadigms (Bayesian and likelihood principles). The widely used asymptotic approximation of theBayesian Information Criterion (BIC) due to Schwarz [7] is reviewed first, followed by the GLRT-based model selection method which is based on the sequential use of the generalized likelihood ratiotest (GLRT). The latter method is one of the oldest approaches in model selection.

3.02.3.1 Bayesian Information Criterion (BIC)We start by recalling the steps and assumptions that lead to the classic asymptotic approximation ofBIC due to Schwarz [7] which can be stated as follows:

BIC = −2 × (maximum log-likelihood)+ ln N × (number of estimable parameters)

so that BIC = −2�(θ |y)+( ln N )K and the hypothesis Hm (mth model) that minimizes the criterion is tobe chosen. This approximation is based on the second-order Taylor expansion of the log posterior densityand some regularity assumptions on the log-likelihood and log posterior. Recall from earlier discussionon the nested hypotheses, the term −2 × (maximum log-likelihood) monotonically decreases with anincreasing m (model index), and cannot be used alone as it leads to the choice of the model with thelargest number of parameters. The second term is the penalty term that increases with K = dim(θ |Hm)

and hence penalizes the use of too many parameters.In the Bayesian framework, the parameter vector is taken to be a random vector (r.v.) with a prior

density qm(θ), θ ∈ �. For the mth model, let

gm(θ |y) � �m(θ |y)+ ln qm(θ)

denote the log posterior density (neglecting the irrelevant additive constant term). The full BIC isobtained by integrating out the nuisance variable θ in the posterior density of θ given the data y and themodel m:

BIC(m) � pm

∫�

exp{gm(θ |y)}dθ. (2.3)

3.02.3 Methods Based on Statistical Inference Paradigms 15

The mth model that maximizes BIC(m) over m = 1, . . . ,M is selected for the data. In most cases, theintegration cannot be solved in a closed-form and approximations are sought for.

In Bayesian approach, the maximum-a-posteriori probability (MAP) estimate θB � arg maxθ∈�gm(θ |y) is a natural point estimate of θ and assume that it is unique. An approximation of g( · |y) aroundthe MAP estimate based on second-order Taylor expansion yields

gm(θ |y) ≈ gm(θB |y)− 1

2(θ − θB)

T Hm(θ − θB); Hm � ∂2gm

∂θ∂θT(θ |y)|

θ=θB.

Note that the first-order term vanishes when evaluated at the maximum. Let us assume that gm is twicecontinuously differentiable (gm ∈ C2(�)). This means that its second-order partial derivatives w.r.t. θexist on � and that the resulting K × K (negative) Hessian matrix Hm is symmetric. Substituting theapproximation into (2.3) yields

BIC(m) ≈ pm exp{gm(θB |y)}∫θm

exp

{−1

2(θ − θB)

T Hm(θ − θB)

}dθ

= pm exp{gm(θB |y)}(2π)K/2|Hm |−1/2,

where the last equality follows by noticing that the integrand is the density of K -variate NK (θB, H−1m )

distribution up to a constant term (2π)−K/2|Hm |1/2, where | · | denotes the matrix determinant. Equiv-alently, we can minimize the log of BIC. In logarithmic form, the BIC becomes

BIC(m) ≈ ln pm + gm(θB |y)+ K

2ln (2π)− 1

2ln |Hm |. (2.4)

An alternative approximation for the BIC can be derived with the following assumption:

(A) the prior distribution qm(θ) is sufficiently flat around the MAP estimate θB and does not dependon N for m = 1, . . . ,M .

Due to (A), when evaluating the integral in (2.3), we may extract the qm(θ) term from the integrand asa constant proportional to qm(θB), yielding the approximation

BIC(m) ≈ pmqm(θB)

∫θm

exp{�m(θ |y)}dθ, (2.5)

≈ pmqm(θB) exp{�m(θ |y)}(2π)K/2|Jm |−1/2, (2.6)

where Jm is the Fisher information matrix of the mth model evaluated at the unique MLE θ . Note that(2.6) is an approximation of (2.5) using (similarly as earlier) the second-order Taylor expansion on�m(θ |y) around the MLE. Thus in terms of logarithms, we have

BIC(m) ≈ ln pm + ln qm(θB)+ K

2ln (2π)+ �m(θ |y)− 1

2ln |Jm |. (2.7)

Note that approximations (2.4) and (2.7) differ as the former (resp. the latter) is obtained for the casethat the full log posterior (resp. log-likelihood) is expanded to the second-order terms. Furthermore,(2.7) relies upon (A).


Next, note that under some regularity conditions and due to consistency of the MLE θ , we can inmost cases approximate ln |Jm | for a sufficiently large N as

ln |Jm | ≈ K ln N . (2.8)

Since the three first terms in (2.7) are independent of N [recall assumption (A)], we may neglect the firstthree terms in (2.7) for large N, yielding after multiplying by −2 the earlier stated classic approximationdue to Schwarz,

BIC(m) = −2�m(θ |y)+ K ln N . (2.9)

Let us recall the assumptions for the above result. First of all, the log-likelihood function � ought to be aC2-function around the neighborhood of the unique MLE θ . Second, the prior density is assumed to ver-ify Assumption (A). Third, sufficient regularity conditions based on likelihood theory, e.g., consistencyof θ , are needed for the approximation (2.8) to hold. Note that the classic BIC approximation, in contrastto approximations (2.4) or (2.7), is based on asymptotic arguments, i.e., N is assumed to be sufficientlylarge. Naturally, how large N is required, depends on the application at hand. The benefit of such anapproximation is its simplicity and good performance observed in many applications. As we shall seein Section 3.02.4, Rissanen’s MDL derived using arguments in coding theory can be asymptoticallyapproximated in the same form as BIC in (2.9). This sometimes leads to misunderstanding that MDLand BIC are the same.

If we assume that the set of hypotheses {Hm} includes the true model Hm that generated thedata, then the probability that the true mth model is selected can often be calculated analytically.For example, in the regression setting, the BIC estimate has proven to be consistent [28], that is,P(“correctly choosing Hm”) → 1 as N → ∞. This can be contrasted with the AIC estimate that failsto be consistent, due to a tendency of overfitting [28].

3.02.3.2 GLRT-based sequential hypothesis testingLet us assume that sequence of model p.d.f.’s fm(y|θ), θ ∈ �m , belong to the same parametric family(i.e., share the same functional form, so f = fm for m = 1, . . . ,M) and are ordered in the sense that�m ⊂ �m+1. Hence, we have a nested set of hypotheses H1 ⊂ H2 ⊂ · · · ⊂ Hm . We can consider HM

as the “full model” and Hm,m < M , a parsimonious model presuming that a subset of the parametersof the full model are equal to zero. This type of situation can arise in many engineering applications;a practical application is given in Section 3.02.5. In such a scenario, the GLRT statistic, defined as the−2× the log of the ratio of the likelihood functions maximized under Hm and under the full model HM :

m = −2{�(θm |y)− �(θM |y)} = −2 ln

(maxθ∈�m f (y|θ)maxθ∈�M f (y|θ)

), (2.10)

where θm = arg maxθ∈�m �(θ |y) and θM = arg maxθ∈�M �(θ |y) and �(θ |y) = ln f (y|θ) denotes thelog-likelihood. By the general maximum likelihood theory,m has aχ2

DOF(m) distribution asymptotically(i.e., chi-squared distribution with DOF(m) degrees of freedom), where

DOF(m) = dim(θ |HM )− dim(θ |Hm).

3.02.4 Information and Coding Theory Based Methods 17

The null hypothesis Hm is rejected ifm exceeds a rejection threshold γm , where γm is (1−α)th quantileof the χ2

DOF(m) distribution (i.e., P(m > γm) = α when m ∼ χ2DOF(m)). Such a detector has an

asymptotic probability of false alarm (PFA) equal to α. For example, choosing PFA equal to α = 0.05is quite reasonable for many applications; however, in radar applications, smaller PFA is often desired.

The sequential GLRT-based model selection procedure chooses the first hypothesis Hm (and hencedim(θ |Hm) as the model order) that is not rejected by the GLRT statistic. In other words, we first testH1 and if it is rejected by the GLRT statistic (2.10) (1 > γ1 for some predetermined PFA α), thentest H2, etc. until acceptance, i.e., until m ≤ γm . In case that all hypotheses are rejected, then weaccept the full model HM . Note that the above decision sequence is not statistically independent, evenasymptotically. Hence, even though the asymptotic PFA of each individual GLRT for testing Hm is α,it does not mean that the asymptotic PFA of the GLRT-based model selection is α. It is also possible touse a backward testing strategy, i.e., test the sequence of hypotheses H1, . . . ,HM−2,HM−1 from thereverse direction. Then we select the model in the sequence before the first rejection. That is, we first testHM−1 and if it is accepted, then test HM−2. Continue until the first rejection and select the hypothesistested prior the rejection. If it is known a priori that a large model order is more likely than a smallmodel order, then the backward strategy might be preferred from pure computational point of view.

3.02.4 Information and coding theory based methodsThe information and coding theory based approaches for model order selection are frequently used.We will consider two such techniques, Information Criterion proposed by Akaike [8,9] and MinimumDescription Length (MDL) methods proposed by Rissanen [10,11]. The enhanced versions of bothmethods have been introduced in order to provide more accurate results for finite sample sizes and insome cases to relax unnecessary assumptions on underlying distribution models.

3.02.4.1 Akaike Information Criterion (AIC)The AIC criterion is based on an asymptotic approximation of the K-L divergence. Let f0(y) denotethe true p.d.f. of the data and let f (y|θ) denote the approximating model. The Kullback-Leibler (K-L)divergence (or relative entropy), can be defined as the integral (as f and f0 are continuous functions)

D( f0‖ f ) = E0

[ln

(f0(y)

f (y|θ))]

=∫

f0(y) ln

(f0(y)

f (y|θ))

dy

and it can be viewed as the information loss when the model f is used to approximate f0 or as a “distance”(discrepancy) between f and f0 since it verifies

D( f0‖ f ) ≥ 0 with equality if and only if f0 = f .

Clearly, among the models in the set of hypotheses Hm : y ∼ fm(y|θm), the best one loses the leastinformation (have smallest value of D( fm‖ f0)) relative to the other models. Note that the K-L divergencecan be expressed as

D( f0‖ f ) = E0[ln f0(y)] − E0[ln f (y|θ)],


where the model dependent part is −E0[ln f (y|θ)]. When finding the minimum of D( f0‖ f ) over theset of models considered is equivalent to finding the maximum of the function

I ( f0, f ) = E0[ln f (y|θ)] = E0[�(θ |y)]among all the models. The function I ( f0, f ), sometimes referred to as relative K-L information, cannotbe used directly in model selection because it requires the full knowledge of the true data p.d.f. f0 andthe true value of the parameter θ in the approximate model f (y|θ), both of which are unknown. Hencewe settle with its estimate

I ( f0, f ) = Ey

[Ew[ln f (w|θ (y))]

], (2.11)

where w is an (imaginary) independent data vector from the same unknown true distribution f0 as theobserved data y and θ = θ (y) denotes the MLE of θ based on the assumed model f and the data y.Note that the statistical expectations above are w.r.t. to the true (unknown) p.d.f. f0 and we maximize(2.11) instead of I ( f0, f ). Key finding is that for a large sample size N, the maximized log-likelihood�(θ |y) = ln f (y|θ ) is a biased estimate of (2.11) where the bias is approximately equal to K , the numberof the estimable parameters in the model. The maximum log-likelihood corrected by the asymptoticbias, i.e., �(θ |y)− K , can be viewed (under regularity conditions) as an unbiased estimator of I ( f , f0)

when N is large. Multiplying by −2 gives the famous Akaike’s Information Criterion

AIC = −2�(θ |y)+ 2K .

Unlike the BIC, the AIC is not consistent, namely the probability of correctly identifying the true modeldoes not converge to one with the sample length. It can be shown that for nested hypotheses and underfairly general conditions,

P(“choosing Hk ⊂ Htrue”) → 0,

P(“choosing Hk ⊃ Htrue”) → c > 0 as N → ∞,

that is, AIC tends to overfit, i.e., choosing model order that is larger then the true model order. See,e.g., [15].

Some enhanced versions have been introduced where the penalty term is larger to reduce the chanceof overfitting. The following corrected AIC rule [29,30]

AICc = −2�(θ |y)+ 2K + 2K (K + 1)

N − K − 1, (2.12)

= −2�(θ |y)+ CN · 2K , where CN = N

N − K − 1(2.13)

provides a finite sample (second-order bias correction) adjustment to AIC; asymptotically they are thesame since CN → 1 when N → ∞, but CN > 1 for finite N. Consequently the penalty term of AICc

is always larger than that of AIC, which means that the criterion is less likely to overfit. On the otherhand, this comes with an increased risk of underfitting, i.e., choosing a smaller model order than the trueone. Since the underfitting risk does not seem to be unduly large in many practical experiments, manyresearchers recommend the use of AICc instead of AIC. It should be noted, however, that AICc possess

3.02.4 Information and Coding Theory Based Methods 19

a sound theoretical justification only in the regression model with normal errors, where it is an unbiasedestimate of the relative K-L information; note that AIC is by its construction only an asymptoticallyunbiased estimate of the relative K-L information. Besides the regression model, the reasonings forCN , lay merely on its property of reducing the risk of overfitting. The so-called generalized informationcriteria (GIC) [2], defined as in (2.13) as

GIC = −2�(θ |y)+ C ′N · 2K

can often outperform AIC for C ′N > 1. For example, when C ′

N = CN the AICc is obtained and thechoice C ′

N = (1/2) log N yields the MDL criterion.

3.02.4.2 Minimum Description LengthThe Minimum Description Length [11] method stems from the algorithmic information and codingtheory. The basic idea is to find any regularity in the observed data that can be used to compress thedata. This allows for condensing the data so that less symbols are needed to present it. The code usedfor compressing the data should be such that the total length (in bits) of description of the code andthe description of the data is minimal. There exists a simple and refined version of the MDL method.The concept of stochastic complexity will be employed in estimating the model order. It defines thecode length of the data with respect to the employed probability distribution model. It also establishes aconnection to statistical model order estimation methods such as BIC and MDL. In some special cases,the approximation of stochastic complexity and BIC leads to the same criterion. Detailed descriptionsof the MDL method can be found in [1,4].

The MDL method does not make any assumptions on the underlying distribution of the data. In fact,it is considered to be less important since the task at hand is not to estimate the distribution that generatedthe data but rather to extract useful information or model from the data that facilitates by compressingit efficiently. However, codes used for compressing data and the probability mass functions are related.The Kraft inequality establishes a correspondence between code lengths and probabilities whereas theinformation inequality relates the optimal code in terms of the minimum expected code length with thedistribution of the data. If the data obeys the distribution f (y), then the code with the length − log f (y)achieves the minimum code length on average.

A simple version of the MDL principle may now be stated as follows. Let H be a class of candidatemodels or codes used to explain the data y. The best model H ∈ H is the one that minimizes

L(H)+ L(y|H),i.e., the sum of the length of the description of the model and the length of the data encoded usingthe model. Trading-off between complexity of the code and complexity of the data is built in the MDLmethod. Hence, it is less likely to overestimate the model order and overfit the data than AIC, for example.

A probability distribution where the parameter θ may vary defines a family of distributions. Thetask of interest is to optimally encode the data and to use the whole family of distributions in order tomake the process efficient. An obvious choice from the distribution family would be to use the codecorresponding to f (y|θ ) where θ is the MLE of θ for the observed data, i.e., θ that most likely wouldhave produced the data set. Consequently, the code corresponding to the maximum likelihood (or theprobability model) depends on the data and the code cannot be specified before observing the data.


This makes this approach less practical. One could avoid this problem by using an universal distributionmodel that is representative of the entire family of distributions and would allow coding the data setalmost as efficiently as the code associated with the maximum likelihood function. The refined MDLuses normalized maximum likelihood approach for finding an universal distribution model used to assistthe coding. This will lead to an excessive code length that is often called regret. The Kullback-Leiblerdivergence D(p‖ f )may be used as a starting point where p is employed to approximate the distributionf. Finding the optimal universal distribution p∗ can be formulated as a minimax problem:

p∗ = argp

minp

maxq

[ln

f (y|θ )p(y)

],

where q denotes the set of all feasible distributions. The distribution q(y) represents the worst caseapproximation of the maximum likelihood code and it is allowed to be almost any distribution within amodel class of feasible distributions. Hence, the assumption on the true distribution generating the datamay be considered to be relaxed. The so-called normalized maximum likelihood (NML) distributionsatisfies the above optimization problem. It may be written as [31]

p∗(y) = f (y|θ )∫f (Y|θ )dY

,

where θ denotes the maximum likelihood estimate of θ over the observation set. The denominator is anormalizing factor that is the sum of the maximum likelihoods of all potential observation sets Y acquiredin experiments, and the numerator represents the maximized likelihood value. With this solution, theworst case data regret is minimized. It has been further shown that by solving a related minimax problem

p∗ = argp

minp

maxq

Eq

[ln

f (y|θ )p(y)

],

the worst case of the expected regret is considered instead of the regret for the observed data set, see [32].The code length associated with normalized maximum likelihood is called the stochastic complexity

of the data for a distribution family. It minimizes the worst case expected regret

Eq

[ln

f (y|θ )p∗(y)

].

The MDL method employs first the normalized maximum likelihood distributions to order the candidatemodels and chooses the model that gives minimal stochastic complexity. An early result by Rissanen[11] and a frequently used approximation for the stochastic complexity is given as follows:

MDL(K ) = −2 log f (y|θ )+ K log N ,

which is the same result as the approximation of BIC by Schwarz given earlier in this section. Otherapproximations of the stochastic complexity that are more accurate avoid some pitfalls of the aboveapproximation, for example:

MDL(K ) = −2 log f (y|θ )+ K log N + log∫θ

√|I (θ)|dθ + o(1),

3.02.5 Example: Estimating Number of Signals in Subspace Methods 21

where I (θ) denotes the Fischer information matrix of sample size 1. The term o(1) contains the higherorder terms and vanishes as N → ∞. The integral term with Fischer information needs to be finite.See [33] for more detail on other approximations and their computation. The MDL principle and thenormalized maximum likelihood (NML) lead to the same expression as the Bayesian InformationCriterion (BIC) in some special cases, for example, when Jeffrey’s noninformative prior is used in aBayesian model and N → ∞. In general, however, the NML and BIC have different formulations.The MDL has been shown to provide consistent estimates of the model order, e.g., for estimating thedimensionality of the signal subspace in the presence of white noise [34]. This problem is reviewed indetail in the next section.

3.02.5 Example: estimating number of signals in subspace methodsWe consider the problem of estimating the dimensionality of the signal subspace. Such a problemcommonly arises in array processing, time-delay estimation, frequency estimation, channel order esti-mation, for example. In these applications, if the signals are not coherent and signals are narrow-band,the dimension of the signal subspace defines the number of signals.

We consider the model, where the received data y ∈ CM consists of (unobservable) random zero

mean signal vector v ∈ CM plus additive zero mean white noise n ∈ C

M (E[nnH ] = σ 2 I withσ 2 > 0 unknown) uncorrelated with the signal v which is supposed to lie in an m-dimensional sub-space of C

M (m ≤ M), called the signal subspace. The covariance of the data y = v + n is then� = E[yyH ] = � + σ 2 I , where the signal covariance matrix � = E[vvH ] is of rank(�) = m ≤ M .Let λ1 ≥ λ2 ≥ · · · ≥ λM ≥ 0 denote the ordered eigenvalues of �. If rank(�) = m, then the smallesteigenvalue λM equals σ 2 and has multiplicity M − m, whereas the m largest eigenvalues are equal tothe nonzero eigenvalues of � plus σ 2. Hence, testing that the signal subspace has dimensionality m(m ∈ {0, 1, . . . ,M − 1}) can be formulated as a hypothesis of the equality of the eigenvalues

Hm : λm+1 = λm+2 = · · · = λM (2.14)

and a GLRT can be formed against a general alternative HM : � arbitrary. Thus, we have a set ofhypotheses {Hm,m = 0, 1, . . . ,M} and a model selection method can be used to determine the bestmodel given the data. To this end, note that H0 corresponds to the noise-only hypothesis. Let us assumethat v and n possess M-variate circular complex normal (CN) distribution in which case the receiveddata y has a zero mean CN distribution with covariance matrix�. Further suppose that an i.i.d. randomsample y1, . . . , yN is observed from y ∼ CNM (0, �).

For the above problem, GLRT-based sequential test or the MDL and AIC criteria have been widelyin sensor array signal processing and subspace estimation methods; see, e.g., [15–17], where the task athand is to determine the number of signals m impinging on the sensor array. In this case, the signal v has arepresentation v = a(ϑ1)s1+· · ·+a(ϑm)sm , where a(ϑi ) denote the array response (steering vector) fora random source signal si impinging on the array from the direction-of-arrival (DOA) ϑi (i = 1, . . . ,m).In this application, the sample y1, . . . , yN represents the array snapshots at time instants t1, . . . , tN andthe SCM � is the conventional estimate of the array covariance matrix �, where the signal covariancematrix � has a representation � = AE[ssH ]AH , where A = (a(ϑ1) · · · a(ϑm)) denotes the array


steering matrix and s = (s1, . . . , sm)T the vector of source signals. Another type of the problem such

as determining the degree of non-circularity of complex random signals is illustrated in detail in [35].Let us proceed by first constructing the sequential GLRT-based model order selection procedure for

the problem at hand. The p.d.f. of y ∼ CNM (0, �) is f (y|�) = π−M |�|−1 exp ( − yH�−1y), andthus the log-likelihood function of the sample is

�(θ) =N∑

i=1

ln f (yi |�) ∝ −N ln |�| − N Tr(�−1�),

where � = 1N

∑Ni=1 yi yH

i denotes the sample covariance matrix (SCM), Tr( · ) denotes matrix traceand θ denotes the vector consisting of functionally independent real-valued parameters in�. Naturally,θ has different parameter space�m under each hypothesis Hm dimension and hence also different orderdim(θ |Hm). Under no-structure hypothesis HM , dim(θ |HM ) = M2. This follows as� is Hermitian andhence there are M(M − 1) estimable parameters due to Re(�i j ) and Im(�i j ) for i < j ∈ {1, . . . ,M}and M estimable parameters �i i ≥ 0, i = 1, . . . ,M . For Hm , where m < M, K is the number offunctionally independent real parameters in � with rank m plus 1 due to σ 2 > 0. The eigenvaluedecomposition of � is � = ∑m

i=1 ψi ui uHi where ψi = λi − σ 2 represent its nonzero eigenvalues and

ui s the corresponding eigenvectors. There are 2Mm real parameters in the eigenvectors but they aretied by 2m constraints due to the scale and phase indeterminacy of each eigenvector1 and an additionalm(m − 1) constraints due to orthogonality of the eigenvectors, uH

i u j = 0 + j0 for i < j ∈ {1, . . . ,m}.Since there are m eigenvalues ψi and σ 2, the total number of estimable parameters in θ is

dim(θ |Hm) = 2Mm + m + 1 − 2m − m(m − 1) = m(2M − m)+ 1 (2.15)

when m < M .Let us then derive the GLRT statisticm as defined in (2.10). Under Hm for m ≤ M , the maximum

log-likelihood is

�(θm) = maxθ∈�m

�(θ) = −N lnm∏

i=1

λi − N (M − m) ln σ 2 − N M, (2.16)

where λ1, λ2, . . . , λM denote the eigenvalues of the SCM � and σ 2 = 1M−m

∑Mi=m+1 λi . The proof

of (2.16) follows similarly as in Anderson [36, Theorem 2] in the real case. Then given the expression(2.16), it is easy to verify the GLRT statistic m = −2{�(θm)− �(θM )}, simplifies into the form

m = −2N (M − m) ln �(m), (2.17)

where

�(m) = 1

σ 2 (λm+1λm+2 · · · λM )1

M−m (2.18)

is the ratio of the geometric to the arithmetic mean of the noise subspace eigenvalues of the SCM. If Hm

holds, this ratio tends to unity as N → ∞. Now recall from Section 3.02.3.2 that the GLRT statistic

1Recall that scale indeterminacy is fixed by assuming uHi ui = 1 and the phase of ui can be fixed similarly by any suitable

convention.

References 23

m in (2.17) has a limiting χ2DOF(m) distribution, where

DOF(m) = dim(θ |HM )− dim(θ |Hm) = M2 − {m(2M − m)+ 1} = (M − m)2 − 1.

The (forward) GLRT-based model selection procedure then chooses the first Hm not rejected by theGLRT, i.e., the first m which verifies m ≤ γm , where (as discussed in Section 3.02.3.2) the detectionthreshold γm can be set as the (1 −α)th quantile of the χ2

DOF(m). This choice yields the asymptotic PFAα for testing hypothesis (2.14).

Let us now derive the AIC and MDL criteria. Let us first point out that the −N M term in (2.16) can bedropped as it does not depend on m whereas any constant term (namely, c = ln

∏Mi=1 λi ) not dependent

on m can be added, thus showing that �(θm) in (2.16) can be written compactly as n(M −m) ln �(m) upto irrelevant additive constant not dependent on m. Further noting that the +1 can be dropped from thepenalty term dim(θ |Hm) in (2.15), we have the result that the AIC and the MDL criteria can be writtenin the following forms:

AIC(m) = −2n(M − m) ln �(m)+ 2m(2M − m),

MDL(m) = −2n(M − m) ln �(m)+ 2m(2M − m) · ln N

2.

An estimate of the AIC/MDL model order, i.e., the dimensionality of the signal subspace m, is found bychoosing m ∈ {0, 1, . . . ,M − 1} that minimizes the AIC/MDL criterion above. As already pointed out,the penalty term of the MDL is (1/2) ln N times the penalty term of AIC. Since (1/2) ln N > 1 alreadyat small sample lengths (N ≥ 8), the MDL method is less prone to overfitting the data compared to theAIC. Note that first term in the MDL and AIC criterions above is simply the GLRT statistic m . Waxand Kailath [15] showed the MDL rule consistent in choosing the true dimensionality whereas the AICtends to overestimate the dimensionality (in the large sample limit). Later [34] proved the consistencyof the MDL rule when the CN assumption does not hold.

3.02.6 ConclusionsIn this section, we provided an overview of model order estimation methods. The majority of well-defined methods stem from statistical inference or coding and information theory. The likelihoodfunction plays a crucial role in most methods based on solid theoretical foundation. Statistical andinformation theoretical methods stem from fundamentally different theoretical background but havebeen shown to lead to identical results under certain restrictive assumptions and model classes. A prac-tical example of model order estimation in case of low-rank model that is commonly used in sensorarray signal processing and many other tasks that require high-resolution capability was provided.

References[1] P.D. Grünwald, The Minimum Description Length Principle, MIT Press, 2007.[2] P. Stoica, Y. Selen, Model-order selection: a review of information criterion rules, IEEE Signal Process. Mag.

21 (4) (2004) 36–47.


[3] A.D. Lanterman, Schwarz, Wallace and Rissanen: intertwining themes in theories of model selection, Int.Stat. Rev. 69 (2) (2001) 185–212.

[4] P.D. Grünwald, I. Myung, M. Pitt (Eds.), Advances in Minimum Description Length: Theory and Applications,MIT Press, 2005.

[5] H. Linhart, W. Zucchini, Model Selection, Wiley, New York.[6] K. Burnham, D. Anderson, Model Selection and Inference: A Practical Information-Theoretic Approach,

Springer, New York, 1998.[7] G. Schwarz, Estimating the dimension of a model, Ann. Stat. 6 (2) (1978) 461–464.[8] H. Akaike, A new look at the statistical model identification, IEEE Trans. Automat. Control 19 (1974) 716–723.[9] H. Akaike, On the likelihood of a time series model, The Statistician 27 (3) (1978) 217–235.

[10] J. Rissanen, Stochastic complexity and modeling, Ann. Stat. 14 (3) (1986) 1080–1100.[11] J. Rissanen, Modeling by the shortest data description, Automatica 14 (1978) 465–471.[12] S.M. Kay, Fundamentals of Statistical Signal Processing: Detection Theory, Prentice Hall, 1998.[13] S. Kay, Exponentially embedded families—new approaches to model order estimation, IEEE Trans. Aerosp.

Electron. Syst. 41 (1) (2005) 333–344.[14] H. Bozdogan, Model selection and Akaike’s information criterion (AIC): the general theory and its analytical

extensions, Psychometrika 52 (3) (1987) 345–370.[15] T. Wax, T. Kailath, Detection of signals by information theoretic criteria, IEEE Trans. Acoust. Speech Signal

Process. 33 (2) (1985) 387–392.[16] Q.T. Zhang, K.M. Wong, Information theoretic criteria for the determination of the number of signals in

spatially correlated noise, IEEE Trans. Signal Process. 41 (4) (1993) 1652–1663.[17] G. Xu, R.H.I. Roy, T. Kailath, Detection of number of sources via exploitation of centro-symmetry property,

IEEE Trans. Signal Process. 42 (1) (1994) 102–112.[18] A.A. Shah, D.W. Tufts, Determination of the dimension of a signal subspace from short data records, IEEE

Trans. Signal Process. 42 (9) (1994) 2531–2535.[19] S. Weisberg, Applied Linear Regression, Wiley, New York, 1980.[20] N.R. Draper, H. Smith, Applied Regression Analysis, third ed., Wiley, New York, 1998.[21] E. Ronchetti, Robust model selection in regression, Stat. Prob. Lett. 3 (1985) 21–23.[22] J.A. Tropp, A.C. Gilbert, Signal recovery from random measurements via orthogonal matching pursuit, IEEE

Trans. Inform. Theory 53 (12) (2007) 4655–4666.[23] R. Tibshirani, Regression shrinkage and selection via the lasso, J. Roy. Stat. Soc. Ser. B 58 (1996) 267–288.[24] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, Least angle regression, Ann. Stat. 32 (2) (2004) 407–451.[25] S. Arlot, A. Celisse, A survey of cross-validation procedures for model selection, Stat. Surv. 4 (2010) 40–79.[26] B. Efron, Bootstrap methods: another look at the jackknife, Ann. Stat. 7 (1) (1979) 1–26.[27] R. Brcich, A.M. Zoubir, P. Pelin, Detection of sources using bootstrap techniques, IEEE Trans. Signal Process.

50 (2) 206–215.[28] R. Nishii, Asymptotic properties of criteria for selection of variables in multiple regression, Ann. Stat. 2 (1984)

758–765.[29] J.E. Cavanaugh, Unifying the derivations for the Akaike and corrected Akaike information criteria, Stat. Prob.

Lett. 33 (1977) 201–208.[30] N. Sugiura, Further analysis of the data by Akaike information criterion and the finite corrections, Commun.

Stat. Theory Methods 7 (1978) 13–26.[31] Y. Shtarkov, Block universal sequential coding of single messages, Prob. Inform. Transmission 23 (3) (1987)

175–186.[32] J. Myung, D. Navarro, M. Pitt, Model selection by normalized maximum likelihood, J. Math. Psychol. 50

(2006) 167–179.

References 25

[33] P. Kontkanen, W. Buntine, P. Myllymaki, J. Rissanen, H. Tirri, Efficient computation of stochastic complexity,in: Proceedings of International Conference on Artificial Intelligence and, Statistics, 2003, pp. 233–238.

[34] L.C. Zhao, P.R. Krishnaiah, Z.D. Bai, On detection of the number of signals in presence of white noise,J. Multivar. Anal. 20 (1) (1986) 1–25.

[35] M. Novey, E. Ollila, T. Adali, On testing the extent of noncircularity, IEEE Trans. Signal Process. 59 (11)(2011) 5632–5637.

[36] T.W. Anderson, Asymptotic theory for principal component analysis, Ann. Math. Stat. 1 (1963) 122–148.

[Academic Press Library in Signal Processing] Academic Press Library in Signal Processing: Volume 3...

Documents

Transcript of [Academic Press Library in Signal Processing] Academic Press Library in Signal Processing: Volume 3...