LucPronzato, arXiv:0802.4381v1 [math.OC] 29 Feb 2008responding author L. Pronzato. Tel. +33 (0)4...

25
arXiv:0802.4381v1 [math.OC] 29 Feb 2008 Optimal experimental design and some related control problems Luc Pronzato, Laboratoire I3S, CNRS-Universit´ e de Nice-Sophia Antipolis, France Abstract This paper traces the strong relations between experimental design and control, such as the use of optimal inputs to obtain precise parameter estimation in dynamical systems and the introduction of suitably designed perturbations in adaptive control. The mathematical background of optimal experimental design is briefly presented, and the role of experimental design in the asymptotic properties of estimators is emphasized. Although most of the paper concerns parametric models, some results are also presented for statistical learning and prediction with nonparametric models. Key words: Parameter estimation; design of experiments; adaptive control; active control; active learning. 1 Introduction The design of experiments (DOE) is a well developed methodology in statistics, to which several books have been dedicated, see e.g. [42], [167], [125], [4], [149], [44]. See also the series of proceedings of the Model-Oriented Design and Analysis workshops (Springer Verlag 1987; Physica Verlag, 1990, 1992, 1995, 1998, 2001, 2004). Its application to the construction of persistently exciting inputs for dynamical systems is well known in control theory (see Chapter 6 of [58], Chapter 14 of [104], Chap- ter 6 of [188], the book [196] and the recent surveys [53], [66]). A first objective of this paper is to briefly present the mathematical background of the methodology and make it accessible to a wider audience. DOE, which can can be apprehended as a technique for extracting the most useful information from data to be collected, is thus a central (and sometimes hidden) methodology in every occasion where unknown quantities must be esti- mated and the choice of a method for this estimation is open. DOE may therefore serve different purposes and happens to be a suitable vehicle for establishing links between problems like optimization, estimation, predic- tion and control. Hence, a second objective of the paper is to exhibit links and similarities between seemingly dif- ferent issues (for instance, we shall see that parameter estimation and prediction of a model response are es- This paper was not presented at any IFAC meeting. Cor- responding author L. Pronzato. Tel. +33 (0)4 92942708. Fax +33 (0)4 92942896. Email address: [email protected] (Luc Pronzato). sentially equivalent problems for parametric models and that the construction of an optimal method for global optimization can be casted as a stochastic control prob- lem). At the same time, attention will be drawn to fun- damental differences that exist between seemingly simi- lar problems (in particular, evidence will be given of the gap between using parametric or nonparametric models for prediction). A third objective is to point out and ex- plain some inherent difficulties in estimation problems when combined with optimization or control (hence we shall see why adaptive control is an intrinsically difficult subject), indicate some tentative remedies and suggest possible developments. Mentioning these three objectives should not shroud the main message of the paper, which consists in pointing out prospective research directions for experimental design in relation with control, in short: classical DOE relies on the assumption of persistence of excitation but many issues remain open in other situations; DOE should be driven by the final purpose of the identification (the intended model application of [57]) and this should be reflected in the construction of design criteria; DOE should face the new challenges raised by nonparametric models and ro- bust control; algorithms and practical methods for DOE in non-standard situations are still missing. The program is rather ambitious, and this survey does not pretend to be exhaustive (for instance, only the case of scalar ob- servations is considered; Bayesian techniques are only slightly touched; measurement errors are assumed to be independent, although correlated errors would deserve a special treatment; distributed parameter systems are Preprint submitted to Automatica 8 October 2018

Transcript of LucPronzato, arXiv:0802.4381v1 [math.OC] 29 Feb 2008responding author L. Pronzato. Tel. +33 (0)4...

Page 1: LucPronzato, arXiv:0802.4381v1 [math.OC] 29 Feb 2008responding author L. Pronzato. Tel. +33 (0)4 92942708. Fax +33 (0)4 92942896. Email address: pronzato@i3s.unice.fr (Luc Pronzato).

arX

iv:0

802.

4381

v1 [

mat

h.O

C]

29

Feb

2008

Optimal experimental design

and some related controlproblems ⋆

Luc Pronzato,

Laboratoire I3S, CNRS-Universite de Nice-Sophia Antipolis, France

Abstract

This paper traces the strong relations between experimental design and control, such as the use of optimal inputs to obtainprecise parameter estimation in dynamical systems and the introduction of suitably designed perturbations in adaptive control.The mathematical background of optimal experimental design is briefly presented, and the role of experimental design in theasymptotic properties of estimators is emphasized. Although most of the paper concerns parametric models, some results arealso presented for statistical learning and prediction with nonparametric models.

Key words: Parameter estimation; design of experiments; adaptive control; active control; active learning.

1 Introduction

The design of experiments (DOE) is a well developedmethodology in statistics, to which several books havebeen dedicated, see e.g. [42], [167], [125], [4], [149], [44].See also the series of proceedings of the Model-OrientedDesign and Analysis workshops (Springer Verlag 1987;Physica Verlag, 1990, 1992, 1995, 1998, 2001, 2004). Itsapplication to the construction of persistently excitinginputs for dynamical systems is well known in controltheory (see Chapter 6 of [58], Chapter 14 of [104], Chap-ter 6 of [188], the book [196] and the recent surveys [53],[66]). A first objective of this paper is to briefly presentthe mathematical background of the methodology andmake it accessible to a wider audience. DOE, which cancan be apprehended as a technique for extracting themost useful information from data to be collected, isthus a central (and sometimes hidden) methodology inevery occasion where unknown quantities must be esti-mated and the choice of a method for this estimation isopen. DOE may therefore serve different purposes andhappens to be a suitable vehicle for establishing linksbetween problems like optimization, estimation, predic-tion and control. Hence, a second objective of the paperis to exhibit links and similarities between seemingly dif-ferent issues (for instance, we shall see that parameterestimation and prediction of a model response are es-

⋆ This paper was not presented at any IFAC meeting. Cor-responding author L. Pronzato. Tel. +33 (0)4 92942708. Fax+33 (0)4 92942896.

Email address: [email protected] (Luc Pronzato).

sentially equivalent problems for parametric models andthat the construction of an optimal method for globaloptimization can be casted as a stochastic control prob-lem). At the same time, attention will be drawn to fun-damental differences that exist between seemingly simi-lar problems (in particular, evidence will be given of thegap between using parametric or nonparametric modelsfor prediction). A third objective is to point out and ex-plain some inherent difficulties in estimation problemswhen combined with optimization or control (hence weshall see why adaptive control is an intrinsically difficultsubject), indicate some tentative remedies and suggestpossible developments.

Mentioning these three objectives should not shroud themainmessage of the paper, which consists in pointing outprospective research directions for experimental design inrelation with control, in short: classical DOE relies on theassumption of persistence of excitation but many issuesremain open in other situations; DOE should be drivenby the final purpose of the identification (the intendedmodel application of [57]) and this should be reflected inthe construction of design criteria; DOE should face thenew challenges raised by nonparametric models and ro-bust control; algorithms and practical methods for DOEin non-standard situations are still missing. The programis rather ambitious, and this survey does not pretend tobe exhaustive (for instance, only the case of scalar ob-servations is considered; Bayesian techniques are onlyslightly touched; measurement errors are assumed to beindependent, although correlated errors would deservea special treatment; distributed parameter systems are

Preprint submitted to Automatica 8 October 2018

Page 2: LucPronzato, arXiv:0802.4381v1 [math.OC] 29 Feb 2008responding author L. Pronzato. Tel. +33 (0)4 92942708. Fax +33 (0)4 92942896. Email address: pronzato@i3s.unice.fr (Luc Pronzato).

not considered; nonparametric modelling is briefly con-sidered and for static systems only, etc.). However, ref-erences are indicated where a detailed enough presen-tation is lacking. None of the results presented is reallynew, but their collection in a single document probablyis, and will hopefully be useful to the reader.

Section 2 presents different types of application of opti-mal experimental design, partly through examples, andserves as an introduction to the topic. In particular, thefourth application concerns optimization and forms apreliminary illustration of the link between sequentialdesign and adaptive control. Section 3 concerns statisti-cal learning and nonparametric modelling, where DOEis still at an early stage of development. The rest of thepaper mainly deals with parametric models, for whichparameter uncertainty is suitably characterized throughinformationmatrices, due to the asymptotic normality ofparameter estimators and the Cramer-Rao bound. Thisis considered in Section 4 for regression models. Section5 presents the mathematical background of optimal ex-perimental design for parameter estimation. The case ofdynamical models is considered in Section 6, where theinput is designed to yield the most accurate estimationof the model parameters, while possibly taking a robust-control objective into account. Section 7 concerns adap-tive control: the ultimate objective is process control,but the construction of the controller requires the esti-mation of the model parameters. The difficulties are il-lustrated through a series of simple examples. OptimalDOE yields input sequences that are optimally (and per-sistently) exciting. At the same time, by focussing at-tention on parameter estimation, it reveals the intrinsicdifficulties of adaptive control through the links betweendual (active) control and sequential design. General se-quential design (for static systems) is briefly consideredin Section 8. Finally, Section 9 suggests further devel-opments and research directions in DOE, concerning inparticular active learning and nonlinear feedback con-trol. Here also the presentation is mainly through exam-ples.

2 Examples of applications of DOE

Although the paper is mainly dedicated to parameterestimation issues, DOE may have quite different objec-tives (and it is indeed one of the purposes of the pa-per to use DOE to exhibit links relating these objec-tives). They are illustrated through examples which alsoserve to progressively introduce the notations. The firstone concerns an extremely simple parameter estimationproblem where the benefit of a suitably designed exper-iment is spectacular.

2.1 A weighing problem

Suppose we wish to determine the weights of eight ob-jects with a chemical balance. The result y of a weigh-

ing (the observation) corresponds to the mass on the leftpan of the balance minus the mass on the right pan plussome measurement error ε. The errors associated with aseries of measurements are assumed to be independentlyidentically distributed (i.i.d.) with the normal distribu-tion N (0, σ2). The objects have weights θi, i = 1, . . . , 8.Each weighing is characterized by a 8-dimensional vec-tor u with components ui equal to 1, −1 or 0 depend-ing whether object i is on the left pan, the right pan oris absent from the weighing, and the associated observa-tion is y = u⊤θ+ ε. We thus have a linear model (in thestatistical sense: the response is linear in the parameter

vector θ), and the Least-Squares (LS) estimator θN

as-sociated with N observations yk characterized by exper-imental conditions (design points 1 ) uk, k = 1, . . . , N , is

θN

= argminθ

N∑

k=1

[yk − u⊤k θ]

2 = M−1N

N∑

k=1

ykuk , (1)

with MN =

N∑

k=1

uku⊤k . (2)

We consider two weighing methods. In method a theeight objets are weighed successively: the vectors ui forthe eight observations coincide with the basis vectorsei of R

8 and the observations are yi = θi + εi, i =1, . . . , 8. The estimated weights are simply given by the

observations, that is, θi = yi ∼ N (θi, σ2). Method b

is slightly more sophisticated. Eight measurements areperformed, each time using a different configuration ofthe objets on the two pans so that the vectors ui form a8 × 8 Hadamard matrix (|uij| = 1 ∀i, j and u⊤

i uj =0 ∀i 6= j, i, j = 1, . . . , 8). The estimates then satisfy

θi ∼ N (θi, σ2/8) with 8 observations only. To obtain

the same precision with method a, one would need toperform eight independent repetitions of the experiment,requiring 64 observations in total 2 .

In a linear model of this type, the LS estimator (1) is

unbiased: IEθθN − θ = 0, where IEθ· denotes the

mathematical expectation conditionally to θ being thetrue vector of unknown parameters. Its covariance ma-

trix is IEθ(θN−θ)(θ

N−θ)⊤ = σ2M−1N withMN given

by (2) (note that it does not depend on θ). Choosing an

1 Although design points and experimental variables areusually denoted by the letter x in the statistical literature,we shall use the letter u due to the attention given hereto control problems. In this weighing example, uk denotesthe decisions made concerning the k-th observation, whichalready reveals the contiguity between experimental designand control.2 Note that we implicitly assumed that the range of theinstrument allows to weigh all objects simultaneously inmethod b. Also note that the gain would be smaller whenusing method b if the variance of the measurement errorsincreased with the total weight on the balance.

2

Page 3: LucPronzato, arXiv:0802.4381v1 [math.OC] 29 Feb 2008responding author L. Pronzato. Tel. +33 (0)4 92942708. Fax +33 (0)4 92942896. Email address: pronzato@i3s.unice.fr (Luc Pronzato).

experiment that provides a precise estimation of the pa-rameters thus amounts to choosing N vectors uk suchthat (MN is non singular and) “M−1

N is as small as pos-

sible”, in the sense that a scalar function ofM−1N is min-

imized (or a scalar function of MN is maximized), seeSection 5. In the weighing problem above the optimiza-tion problem is combinatorial since uki ∈ −1, 0, 1.In the design of method b the vectors uk optimize most“reasonable” criteria Φ(MN ), see, e.g., [29], [162]. Thiscase will not be considered in the rest of the paper butcorresponds to a topic that has a long and rich history(it originated in agriculture through the pioneering workof Fisher, see [46]).

2.2 An example of parameter estimation in a dynamicalmodel

The example is taken from [39] and concerns a so-calledcompartment model, widely used in pharmacokinetics.A drug x is injected in blood (intravenous infusion) withan input profile u(t), the drug moves from the centralcompartment C (blood) to the peripheral compartmentP , where the respective quantities of drugs at time tare denoted xC(t) and xP (t). These obey the followingdifferential equations:

dxC(t)dt = (−KEL− KCP )xC(t) + KPCxP (t) + u(t)

dxP (t)dt = KCPxC(t) − KPCxP (t)

where KCP , KPC and KEL are unknown parameters.One observes the drug concentration in blood, that is,y(t) = xC(t)/V + ε(t) at time t, where the errors ε(ti)corresponding to different observations are assumed tobe i.i.d. N (0, σ2) and where V denotes the (unknown)volume of the central compartment. There are thus fourunknown parameters to be estimated, which we denoteθ = (KCP ,KPC ,KEL, V ). The profile of the input u(t)is imposed: it consists of a 1 min loading infusion of 75mg/min followed by a continuous maintenance infusionof 1.45 mg/min. The experimental variables correspondto the sampling times ti, 1 ≤ ti ≤ 720 min (the time in-stants at which the observations— blood samples — aretaken). Suppose that the true parameters take the valuesθ = (0.066min−1, 0.038min−1, 0.0242min−1, 30 l).Two different experimental designs are considered.The first one, called “conventional”, is given byt = (5, 10, 30, 60, 120, 180, 360, 720) (in min); the “op-timal” one (D-optimal for θ, see Section 5.1) ist∗ = (1, 1, 10, 10, 74, 74, 720, 720) (in min). (Note thatboth designs contain 8 observations and that t∗ com-prises repetitions of observations at the same time— which means that it is implicitly assumed that thecollection of several simultaneous independent measure-ments is possible.) Figure 1 presents the (approximate)marginal density for the LS estimator of KEL, see [129],[139], when σ = 0.2µg/ml. Similar pictures are obtainedfor the other parameters.

0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.050

20

40

60

80

100

120

140

160

180

Fig. 1. Approximate marginal densities for the LS estimatorKEL (dashed line for the conventional design, solid line forthe optimal one); the true value is KEL = 0.0242 min−1.

Clearly, the “optimal” design t∗ yields a much more pre-cise estimation of θ than the conventional one, althoughboth involve the same number of observations. On theother hand, with 4 = dim(θ) sampling times only, t∗

does not permit to test the validity of the model. DOEfor model discrimination, which we consider next, is es-pecially indicated for situations where one hesitates be-tween several structures.

2.3 Discrimination between model structures

Design for discrimination between model structures willnot be detailed in the paper, only the basic principle ofa simple method is indicated below and one can refer to[17] and the survey papers [3], [65] for other approaches.When there are two model structures η(1)(θ1,u) andη(2)(θ2,u) and the errors are i.i.d., a simple sequentialprocedure is as follows, see [5]:

• after the observation of y(u1), . . . , y(uk) estimate θk

1

and θk

2 for both models;

• place next point uk+1 where [η(1)(θ

k

1 ,u)−η(2)(θk

2 ,u)]2

is maximum;• k → k + 1, repeat.When there aremore than two structures in competition,

one should estimate θk

i for all of them and place the nextpoint using the two models with the best and secondbest fitting, see [6]. The idea is to place the design pointwhere the predictions of the competitors differ much,so that when one of the structures is correct (which isthe underlying assumption), next observation should beclose to the prediction of that model and should thus giveevidence that the other structures are wrong. Similarideas can be used to design input sequences for detectingchanges in the behavior of dynamical systems, see thebook [82].

2.4 Optimization of a model response

Suppose that one wishes to maximize a function η(θ,u)with respect to u ∈ R

d, with θ ∈ Rp a vector of un-

3

Page 4: LucPronzato, arXiv:0802.4381v1 [math.OC] 29 Feb 2008responding author L. Pronzato. Tel. +33 (0)4 92942708. Fax +33 (0)4 92942896. Email address: pronzato@i3s.unice.fr (Luc Pronzato).

known parameters. When a value ui is proposed, thefunction is observed through yi = y(ui) = η(θ,ui) + εiwith εi a measurement error. Since the problem is to de-termine u∗ = u∗(θ) = argmaxu η(θ,u), it seems natu-

ral to first estimate θ = θ[y] from a vector of observa-tions y = [y1, . . . , yN ]⊤ and then predict the optimum

by u∗(θ). The question is then which values to use for

the ui’s for estimating θ, that is, which criterion to opti-mize for designing the experiment? It could be (i) based

on the precision of θ, or (ii) based on the precision of

u∗(θ), or, preferably, (iii) oriented towards the final ob-

jective and based on the cost C(θ|θ) of using θ when

the true value of θ is θ. A possible choice is C(θ|θ) =

η[θ,u∗(θ)]−η[θ,u∗(θ)] ≥ 0, which leads to a design that

minimizes the Bayesian risk R = IEC(θ[y]|θ), wherethe expectation is with respect to y and θ for which aprior distribution π(·) is assumed, see, e.g., [144] (see also[27] and the book [132] for a review of Bayesian DOE).

The approaches (i-iii) above are standard in experimen-tal design: optimization is performed in two steps, firstsome design points ui’s are selected for estimation, sec-ond θ is estimated and used to construct u∗(θ). How-ever, in general each response η(θ,ui) is far from themaximum η[θ,u∗(θ)] (since the explicit objective of thedesign is estimation, not maximization) while in somesituations it is required to have η(θ,ui) as large as pos-sible for every i, that is, ui close to u∗(θ), which is un-known. A sequential approach is then natural: try ui,

observe yi, estimate θi= θ(yi

1), suggest ui+1 and soon. . . (Notice that this involves a feedback of informa-tion in the sequence of design points — the control se-quence— and thus induces a dynamical aspect althoughthe initial problem is purely static.) Each ui has twoobjectives: help to estimate θ, try to maximize η(θ,u).The design problem thus corresponds to a dual controlproblem, to be considered in Section 7.4. When no para-metric form is known for the function to be maximized,it is classical to resort to suboptimal methods such asthe Kiefer-Wolfowitz scheme [83], or the response sur-face methodology which involves linear and quadraticapproximations, see, e.g., [18]. Optimization with a non-parametric model will be considered in Section 9, com-bining statistical learning with global optimization.

3 Statistical learning, nonparametric models

One can refer to the books [183], [184], [62] and thesurveys [37], [10] for a detailed exposition of statisti-cal learning. Based on so-called “training data” D =[u1, y(u1)], . . . , [uN , y(uN )] we wish to predict the re-sponse y(u) of a process at some unsampled input u us-ing Nadaraya-Watson regression [118], [189], Radial Ba-sis Functions (RBF), Support Vector Machine (SVM)regression or Kriging (Gaussian process). All these ap-proaches can be casted in the class of kernel methods,

see [185], [186] and [159] for a more precise formulation,and we only consider the last one, Kriging, due to itswide flexibility and easy interpretability. The associatedDOE problem is considered in Section 3.2. We denoteyD(u) the prediction at u and y = [y(u1), . . . , y(uN )]⊤.

3.1 Gaussian process and Kriging

The method originated in geostatistics, see [86], [107],and has a long history. When the modelling errors con-cern a transfer function observed in the Nyquist plane,the approach possesses strong similarities with the so-called “stochastic embedding” technique, see, e.g., [59]and the survey paper [124]. The observations are mod-elled as y(uk) = θ0 + P (uk, ω) + εk, where P (u, ω) de-notes a second-order stationary zero-mean random pro-cess with covariance IEP (u, ω)P (z, ω) = K(u, z) =σ2PC(u − z) and the εk’s are i.i.d., with zero mean and

variance σ2. The best linear unbiased predictor at u isyD(u) = v⊤(u)y, where v(u) minimizes IE(v⊤y−[θ0+

P (u, ω)])2with the constraint IEv⊤y = θ0∑N

i=1 vi =

IEy(u) = θ0, that is,∑N

i=1 vi = 1. This optimizationproblem is solvable explicitly, which gives

yD(u) = v⊤(u)y = θ0 + c⊤(u)C−1y (y − θ01) (3)

where Cy = σ2IN + σ2PCP with IN the N -dimensional

identity matrix and CP the N × N matrix defined by[CP ]i,j = C(ui−uj), 1 is theN -dimensional vector withcomponents 1, c(u) = σ2

P [C(u − u1), . . . , C(u − uN )]⊤

and θ0 = (1⊤C−1y y)/(1⊤C−1

y 1) (a weighted LS esti-mator of θ0). Note that the prediction takes the form

yD(u) = θ0 +∑N

k=1 akK(u,uk), i.e., a linear combina-tion of kernel values. The Mean-Squared-Error (MSE)of the prediction yD(u) at u is given by

ρ2D(u) = σ2P −

[

c⊤(u) 1]

[

Cy 1

1⊤ 0

]−1 [

c(u)

1

]

(4)

and, if σ2 = 0 (i.e., there are no measurement errors εk),yD(ui) = y(ui) and ρ2D(ui) = 0 for any i. The predic-tor yD(u) is then a perfect interpolator. This methodthus makes statistical inference possible even for purelydeterministic systems, the model uncertainty being rep-resented by the trajectory of a random process. Sincethe publication [157] it has been successfully applied inmany domains of engineering where simulations (com-puter codes) replace real physical experiments (andmea-surement errors are thus absent), see, e.g., [158].

If the characteristics of the process P (u, ω) and errorsεk belong to a parametric family, the unknown parame-ters that are involved can be estimated. For instance, fora Gaussian process with C(z) parameterized as C(z) =C(β, z) and for normal errors εk, the parameters β, σ2

P

4

Page 5: LucPronzato, arXiv:0802.4381v1 [math.OC] 29 Feb 2008responding author L. Pronzato. Tel. +33 (0)4 92942708. Fax +33 (0)4 92942896. Email address: pronzato@i3s.unice.fr (Luc Pronzato).

and σ2 can be estimated by Maximum Likelihood; seethe book [173], in particular for recommendations con-cerning the choice of the covariance function C(z). Seealso the survey [105] and the papers [195], [181] con-cerning the asymptotic properties of the estimator. Themethod can be extended in several directions: the con-stant terms θ0 can be replaced by a linear model r⊤(u)θ(this is called universal Kriging, or intrinsic Krigingwhen generalized covariances are used, which is thenequivalent to splines, see [187]), a prior distribution canbe set on θ (Bayesian Kriging, see [38]), the derivative(gradient) of the response y(u) can also be predictedfrom observations y(uk), see [185], or observations ofthe derivatives can be used to improve the prediction ofthe response, see [114], [106], [102]. Nonparametric mod-elling can be used in optimization, and an application ofKriging to global optimization is presented in Section 9.

3.2 DOE for nonparametric models

The approaches can be classified among those that aremodel-free (of the space-filling type) and those that usea model.

3.2.1 Model-free design (space filling)

For U the design set (the admissible set for u), we callSS ⊂ U the finite set of chosen design points or sitesuk where the observations are made, k = 1, . . . , N .Maximin-distance design [78] chooses sites SS thatmaximize the minimum distance between points of SS,i.e. minu 6=u′∈SS2 d(u,u′). The chosen sites uk are thusmaximally spread in U (in particular, some points areset on the boundary of U). When U is a discrete set,minimax-distance design [78] chooses sites that mini-mize the maximum distance between a point in U andSS, i.e. maxz∈U d(z, SS) = maxz∈U minu∈SS d(z,u).In order to ensure good projection properties in alldirections (for each component of the uk’s), it is rec-ommended to work in the class of latine hypercubedesigns, see [113] (when U is scaled to [0, 1]d, for everyi = 1, . . . , d the components uki, k = 1, . . . , N , thentake all the values 0, 1/(N − 1), 2/(N − 1), . . . , 1).

3.2.2 Model-based design

In order to relate the choice of the design to the qualityof the prediction yD(u), a first step is to characterize theuncertainty on yD(u). This raises difficult issues in non-parametric modelling, in particular due to the difficultyof deriving a global measure expressing the speed of de-crease of the MSE of the prediction as N , the number ofobservations, increases (we shall see in Section 5.3.4 thatthe situation is opposite in the parametric case). A rea-son is that the effect of the addition of a new observationis local: when we observe at u, the MSE of the predictionat z decreases for z close to u (for instance, for Krigingwithout measurement errors ρD(u) becomes zero), but

is weakly modified for z far from u. Hence, DOE is oftenignored in the statistical learning literature 3 , where theset of training data D is generally assumed to be a col-lection of i.i.d. pairs [uk, y(uk)], see, e.g., [37], [10]. Thelocal influence just mentioned has the consequence thatan optimal design should (asymptotically) tend to ob-serve everywhere in U , and distribute the points uk witha density (i.e. according to a probability measure abso-lutely continuous with respect to the Lebesgue measureon U — again, we shall see that the situation is oppo-site for the parametric case). Few results exist on thatdifficult topic, see e.g. [30]: for u scalar, observationsy(uk) = f(uk)+εk with i.i.d. errors εk, and a predictionof the Nadaraya-Watson type ([118], [189]), a sequentialalgorithm is constructed that is asymptotically optimal(it tends to distribute the points uk with a density pro-portional to |f ′′(u)|2/9). See also [115], [41] for relatedresults. The uniform distribution may turn out to be op-timal when considering minimax optimality over a classof functions, see [12].

When Kriging is used for prediction, the MSE is givenby (4) and SS can be chosen for instance by minimizingthe maximum MSE maxu∈U ρ2D(u) (which is related tominimax-distance design, see [78]) or by minimizing theintegrated MSE

Uρ2D(u)π(du), with π(·) some proba-

bility density foru, see [156].Maximum entropy sampling[163] provides an elegant alternative designmethod, usu-ally requiring easier computations. It can be related tomaximin-distance design, see [78].

Notice finally that in general the parameters β, σ2P and

σ2 in the covariance matrix Cy used in Kriging are esti-mated fromdata, so that the precision of their estimationinfluences the precision of the prediction. This seems tohave received very little attention, although designs forprediction (space filling for instance) are clearly not ap-propriate for the precise estimation of these parameters,see [197].

4 Parametric models and information matrices

Throughout this section we consider regression modelswith observations

y(uk) = η(θ,uk) + εk , θ ∈ Θ , uk ∈ U , (5)

where the errors εk are independent with zero meanand variance IEuk

(ε2k) = σ2(uk), k = 1, 2, . . . (with

3 There exists a literature on active learning, which aims atselecting training data using techniques from DOE. However,it seems that when explicit reference to DOE is made, theattention is restricted to learning with a parametric model,see in particular [33], [34]. In that case, the underlying as-sumption that the data are generated by a process whosestructure coincides with that of the model is often hardlytenable, especially for a behavioral model e.g. of the neural-network type, see Section 5.3.4 for a discussion.

5

Page 6: LucPronzato, arXiv:0802.4381v1 [math.OC] 29 Feb 2008responding author L. Pronzato. Tel. +33 (0)4 92942708. Fax +33 (0)4 92942896. Email address: pronzato@i3s.unice.fr (Luc Pronzato).

0 < a ≤ σ2(u) ≤ b < ∞). The function η(θ,uk) isknown, possibly nonlinear in θ, and θ, the true value ofthe model parameters, is unknown. The asymptotic be-havior of the LS estimator, in relation with the design, isrecalled in the next section (precise proofs are generallyrather technical, and we give conditions on the designthat facilitate their construction). Maximum-Likelihoodestimation and estimating functions are considered next.The extension to dynamical systems requires more tech-nical developments beyond the scope of this paper. Onecan refer e.g. to [58], [104], [24] [171] for a detailed pre-sentation, including data-recursive estimation methods.Also, one can refer to the monograph [180] for the identi-fication of systems with distributed parameters and e.g.to [90], [151], [152] for optimal input design for such sys-tems.

4.1 Weighted LS estimation

Consider the weighted LS (WLS) estimator

θN

WLS = argminθ

(1/N)

N∑

k=1

w(uk)[y(uk)− η(θ,uk)]2

with w(·) a known function, bounded on U . To inves-

tigate the asymptotic properties of θN

WLS for N → ∞we need to specify how the design points uk’s are gen-erated. In that sense, the asymptotic properties of theestimator are strongly related to the design. The earlyand now classical reference [77] makes assumptions onthe finite tail products of the regression function η andits derivatives, but the results are more easily obtainedat least in two cases:• (i) (uk) forms a sequence of i.i.d. random variables(vectors), distributed with a probability measure ξ(which we call random design);• (ii) The empirical measure ξN with distribution func-

tion IFξN (u) =∑N

i=1, ui<u(1/N) (where the inequality

ui < u is componentwise) converges strongly (in varia-tion, see [164], p. 360) to a discrete probability measureξ on U , with finite support SSξ = u ∈ U : ξ(u) > 0,that is, limN→∞ ξN (u) = ξ(u) for any u ∈ U .

Note that in case (i) the pairs (εk,uk) are i.i.d. and incase (ii) there exist a finite number of support points ui

that receive positive weights ξ(ui) > 0, so that, as Nincreases, the observations at those ui’s are necessarilyrepeated. In both cases the asymptotic distribution ofthe estimator is characterized by ξ.

The strong consistency of θN

WLS , i.e., θN

WLSa.s.→ θ,

N → ∞, can easily be proved for designs satisfy-ing (i) or (ii) under continuity and boundedness as-sumptions on η(θ,u) when the estimability condition[∫

U w(u)[η(θ,u)− η(θ′,u)]2 ξ(du) = 0 ⇔ θ′ = θ] is sat-isfied. Supposing, moreover, that η(θ,u) is two times

continuously differentiable in θ and that the matrix

M1(ξ, θ) =

U

w(u)∂η(θ,u)

∂θ |θ

∂η(θ,u)

∂θ⊤|θ

ξ(du)

has full rank, an application of the Central Limit The-orem to a Taylor series development of ∇θJN (θ), the

gradient of the WLS criterion, around θN

WLS gives

√N(θ

N

WLS − θ)d→ z ∼ N (0,C(w, ξ, θ)) , N → ∞ , (6)

where C(w, ξ, θ) = M−11 (ξ, θ)M2(ξ, θ)M

−11 (ξ, θ) with

M2(ξ, θ) =

U

w2(u)σ2(u)∂η(θ,u)

∂θ

∂η(θ,u)

∂θ⊤ξ(du) .

One may notice that C(w, ξ, θ) − M−1(ξ, θ) is non-negative definite for any weighting function w(·), whereM(ξ, θ) denotes the matrix

M(ξ, θ) =

U

σ−2(u)∂η(θ,u)

∂θ

∂η(θ,u)

∂θ⊤ξ(du) . (7)

The equality C(w, ξ, θ) = M−1(ξ, θ) is obtained forw(u) = c σ−2(u), with c a positive constant, and thischoice of w(·) is thus optimal (in terms of asymptoticvariance) among all WLS estimators. This result can becompared to that obtained for linear regression in Sec-tion 2.1 where σ2M−1

N was the exact expression for the

variance of θN

for N finite. In nonlinear regression the

expression C(w, ξ, θ)/N for the variance of θN

is onlyvalid asymptotically, see (6); moreover, it depends onthe unknown true value θ of the parameters. These re-sults can easily be extended to situations where also thevariance of the errors depends on the parameters θ ofthe response η, that is, IEuk

(ε2k) = σ2(uk) = βλ(θ,uk),see e.g. [130], [140].

4.2 Maximum-likelihood estimation

Denote ϕuk(·) the probability density function (pdf)

of the error εk in (5). Due to the independence of er-rors, we obtain for the vector y of observation the pdf

π(y|θ) =∏Nk=1 π[y(ui)|θ] =

∏Nk=1 ϕuk

[y(uk)−η(θ,uk)]

and theMaximum-Likelihood (ML) estimator θN

ML min-

imizes− logπ(y|θ) =∑Nk=1 − logϕuk

[y(uk)−η(θ,uk)].Different pdf ϕ yield different estimators (LS for Gaus-sian errors, L1 estimation for errors with a Laplace dis-tribution, etc.). Under standard regularity assumptionson ϕu(·) and for designs satisfying conditions (i) or (ii)

of Section 4.1, θN

MLa.s.→ θ and

√N(θ

N

ML − θ)d→ z ∼ N (0,M−1

F (ξ, θ)) , N → ∞ , (8)

6

Page 7: LucPronzato, arXiv:0802.4381v1 [math.OC] 29 Feb 2008responding author L. Pronzato. Tel. +33 (0)4 92942708. Fax +33 (0)4 92942896. Email address: pronzato@i3s.unice.fr (Luc Pronzato).

with MF (θ, ξ) the Fisher information matrix (averageper sample) given by

MF (θ, ξ) = IEθ

1

N

∂ log π(y|θ)∂θ

∂ log π(y|θ)∂θ⊤

=−IEθ

1

N

∂2 log π(y|θ)∂θ∂θ⊤

.

In the particular case of the regression model consideredhere we obtain

MF (ξ, θ) =

U

I(u) ∂η(θ,u)∂θ

∂η(θ,u)

∂θ⊤ξ(du) (9)

with I(u) =∫

[ϕ′u(z)]2/ϕu(z) dz the Fisher information

for location of the pdf ϕu. From the Cramer-Rao in-equality, M−1

F (ξ, θ) forms a lower-bound on the covari-

ance matrix of any unbiased estimator θN

of θ, i.e.,

IEθ(θN −θ)(θ

N −θ)⊤−M−1F (ξ, θ)/N is non-negative

definite for any estimator θN

such that IEθθN = θ.

When the errors εk are normal N (0, σ2(uk)), I(u) =σ−2(u) andML estimation coincides with WLS with op-timal weights (and MF (ξ, θ) coincides with (7)). Whenthey are i.i.d., that is ϕu = ϕ for any u, I(u) = I con-stant, and

MF (ξ, θ) = I∫

U

∂η(θ,u)

∂θ

∂η(θ,u)

∂θ⊤ξ(du) . (10)

4.2.1 Estimating functions

Estimating functions form a very generally applicableset of tools for parameter estimation in stochastic mod-els. As the example below will illustrate, they can yieldvery simple estimators for dynamical systems. One canrefer to [63] for a general exposition of the methodology,see also the discussion paper [103] that comprises a shorthistorical perspective. Instrumental variables methods(see, e.g., [169], [170] and Chapter 8 of [171]) used in dy-namical systems as an alternative to LS estimation whenthe regressors and errors are correlated (so that the LSestimator is biased) can be considered as methods forconstructing unbiased estimating functions. Their im-plementation often involves the construction of regres-sors obtained through simulations with previous valuesof parameter estimates, but simpler constructions arepossible.

Consider a discrete-time system with scalar state andinput, respectively xi and ui, defined by the recurrenceequation

xi+1 = xi + T [ui + θ(xi + 1)] , i = 0, 1, 2 . . . (11)

with known sampling period T and initial state x0. Theobservations are given by yi = xi+εi for i ≥ 1, where (εi)

denotes a sequence of i.i.d. errors normal N (0, σ2). Theunknown parameter θ can be estimated by LS (whichcorresponds to ML estimation since the errors are nor-mal), but recursive LS cannot be used since xi dependsnonlinearly in θ. However, simpler estimators can beused if one is prepared to loose some precision for theestimation. For instance, substitute yi for the state xi in(11) and form the equation in θ

gi+1(θ) = yi+1 − yi − T [ui + θ(yi + 1)] = 0 ; (12)

k successive observations then giveGk(θ) =1k

∑ki=1 gi(θ)

= 0. Since Gk(θ) is linear in the yi’s, IEθGk(θ) = 0for any θ, and Gk(θ) is called an unbiased estimatingfunction 4 , see, e.g., [103]. Since Gk(θ) is linear in θ, the

solution θk of Gk(θ) = 0 is simply given by

θk =(yk − y0)/(kT )− (

∑k−1i=0 ui)/k

1 + (∑k−1

i=0 yi)/k(13)

(provided that the denominator is different from zero)and forms an estimator for θ. Notice that the true valueθ satisfies a similar equation with the yi’s replaced bythe noise-free values xi. Estimation by θk is less pre-cise than LS estimation, see Figure 3 in Section 9, butrequires much less computations. Would other parame-ters be present in the model, other estimating functionswould be required. For instance, a function of the type

Gk,α(θ) =∑k

i=1 iαgi(θ) would put more stress on the

transient (respectively long-term) behavior of the sys-tem when α < 0 (respectively α > 0). Also, the multipli-cation of gi(θ) by a known function of ui gives a new esti-mating function. When information on the noise statis-tics is available, it is desirable for the (asymptotic) preci-sion of the estimation to choose Gk as (proportional to)an approximation of the score function ∂ log π(y|θ)/∂θwith π(y|θ) the pdf of the observations y1, . . . , yk, see,e.g., [36] p. 274 and [103].

There seems to be a revival of interest for estimatingfunctions, partly due to the elegant algebraic frameworkrecently developed for time-continuous linear systems(differential equations); see [47] where estimating func-tions are constructed through Laplace transforms. How-ever, in this algebraic setting only multiplications by s ors−1 and differentiation with respect to s are considered(with s the Laplace variable), which seems unnecessar-ily restrictive. Consider for instance the time-continuousversion of (11),

x = u+ θ(x+ 1) , x(0) = x0 , (14)

4 Nonlinearity in the observations is allowed, provided thatthe bias is suitably corrected; for instance the functiong′i+1(θ) = (1+ yi)gi+1(θ)+σ2(1+Tθ) with gi+1(θ) given by(12) satisfies IEθg

i+1(θ) = 0 for any θ when the errors εi are

i.i.d. with zero mean and variance σ2, and (1/k)Pk

i=1 g′

i(θ)is an unbiased estimating function for θ.

7

Page 8: LucPronzato, arXiv:0802.4381v1 [math.OC] 29 Feb 2008responding author L. Pronzato. Tel. +33 (0)4 92942708. Fax +33 (0)4 92942896. Email address: pronzato@i3s.unice.fr (Luc Pronzato).

where x denotes differentiation with respect to time. ItsLaplace transform is sX(s) = U(s)+ θX(s)+s−1θ+x0,which can be first multiplied by s, then differentiatedtwo times with respect to s and the result multiplied bys−2 to avoid derivation with respect to time. This gives aestimating function comprising double integrations withrespect to time. Multiple integrations may be avoided bynoticing that the multiplication of the initial differentialequation by any function of time preserves the linearityof the estimating function with respect to both θ andthe state (provided that the integrals involved are welldefined). For instance, when u is a known function oftime, the multiplication of (14) by the input u followedby integration with respect to time gives the estimat-

ing function [x(t)u(t) − x0u0]/t − (1/t)∫ t

0 [x(τ)u(τ) +

u2(τ)] dτ = (θ/t)∫ t

0 u(τ)[1 + x(τ)] dτ , which is linearin x. Infinitely many unbiased estimating functions canthus be easily constructed in this way. (Note that, dueto linearity, the introduction of process noise in (14) asx(t) = u(t)+θ[x(t)+1]+dBt(ω), withBt(ω) a Brownianmotion, leaves the estimating function above unbiased.)

The analysis of the asymptotic behavior of the estimator

θkassociated with an estimating function is straightfor-

ward when the function is unbiased and linear in θ. Theexpression of the asymptotic variance of the estimatorcan be used to select suitable experiments in terms ofthe precision of the estimation, as it is the case for LSor ML estimation. However, in general the asymptoticvariance of the estimator takes a more complicated formthan M−1(ξ, θ) or M−1

F (ξ, θ), see (7, 10), so that DOEfor such estimators does not seem to have been consid-ered so far. The recent revival of interest for this methodmight provide some motivation for such developments(see also Section 9).

4.3 DOE

To obtain a precise estimation of θ one should first usea good estimator (WLS with weights proportional toσ−2, or ML) and second select a good design 5 ξ∗. Inthe next section we shall consider classical DOE for pa-rameter estimation, which is based on the informationmatrix (10) 6 . Hence, we shall choose ξ∗ that optimizes

5 We shall thus follow the standard approach, in which theestimator is chosen first, and an optimal design is then con-structed for that given estimator (even though it may beoptimal for different estimators); this can be justified underrather general conditions, see [119].6 Note that defining η(θ,u) = σ−1(u)η(θ,u) and η(θ,u) =p

I(u)η(θ,u) one can respectively write the matrices (7)and (9) in the same form as (10). Also notice that classi-cal DOE uses the covariance matrix with the simplest ex-pression: DOE for WLS estimation is more complicated fornon-optimal weights than for the optimal ones, compareC(w, ξ, θ) to M

−1(ξ, θ) in Section 4.1. Similarly, the asymp-totic covariance matrix for a general M -estimator (see, e.g.,

Φ[MF (ξ, θ)], for some criterion function Φ(·). For mod-els nonlinear in θ, this raises two difficulties: (i) the cri-terion function, and thus ξ∗, depends on a guessed valueθ for θ. This is called local DOE (the design ξ∗ is opti-mal locally, when θ is close to θ), some alternatives tolocal optimal design will be presented in Section 5.3.5;(ii) the method relies on the asymptotic properties of theestimator. More accurate approximations of the preci-sion of the estimation exist, see e.g. [126], but are com-plicated and seldom used for DOE, see [128], [138] (seealso the recent work [25] concerning the finite samplesize properties of estimators, which raises challengingDOE issues). They will not be considered here. For dy-namical systemswith correlated observations or contain-ing an autoregressive part, classical DOE also relies onthe information matrix, which has then a more compli-cated expression, see Section 6. Also, the calculation ofthe asymptotic covariance of some estimators requiresspecific developments that are not presented here, seee.g. [58], [104], [24] for recursive estimation methods.For Bayesian estimation, a standard approach for DOEconsists in replacing MF (ξ, θ) by MF (ξ, θ) + Ω−1/N ,with Ω the prior covariance matrix for θ, see e.g. [132],[27]. Note finally the central role of the design concern-ing the asymptotic properties of estimators. In partic-ular, the conditions (i) and (ii) of Section 4.1 on thedesign imply some stationarity of the “inputs” uk andguarantee the persistence of excitation, which can be ex-pressed as a condition on the minimum eigenvalue of theinformation matrix: lim infN→∞ λmin[MF (ξN , θ)] > 0,with ξN the empirical measure of u1, . . . ,uN (that is,lim infN→∞ λmin(MN )/N > 0 for the linear regressionmodel of Section 2.1, see (2)).

5 DOE for parameter estimation

5.1 Design criteria

We consider criteria for designing optimal experiments(for parameter estimation) that are scalar functions ofthe (Fisher) information matrix (average, per sample)(10) 7 . For N observations at the design points ui ∈ U ,i = 1, . . . , N , we shall denote UN

1 = (u1, . . . ,uN ), whichis called a finite (or discrete) design of sizeN , orN -pointdesign. The associated information matrix is then

MF (UN1 , θ) =

IN

N∑

i=1

∂η(θ,ui)

∂θ

∂η(θ,ui)

∂θ⊤. (15)

[72]) is more complicated than for ML.7 Notice that the analytic form of the sensitivities∂η(θ,u)/∂θ of the model response is not required: for amodel given by differential equations, like in Section 2.2, orby difference equations, the sensitivities can be obtained bysimulation, together with the model response itself; see, e.g.,Chapter 4 of [188].

8

Page 9: LucPronzato, arXiv:0802.4381v1 [math.OC] 29 Feb 2008responding author L. Pronzato. Tel. +33 (0)4 92942708. Fax +33 (0)4 92942896. Email address: pronzato@i3s.unice.fr (Luc Pronzato).

The admissible design set U is sometimes a finite set,U = u1, . . . ,uK, K < ∞. We shall more generallyassume that U is a compact subset of Rd. For a linearregression model with i.i.d. errors N (0, σ2), the ellip-

soid R(θN

LS, α) = θ / (θ− θN

LS)⊤MF (U

N1 )(θ− θ

N

LS) ≤X 2

α(p)/N, where X 2α(p) has the probability α to be ex-

ceeded by a random variable chi-square distributed with

p degrees of freedom, satisfies Prθ ∈ R(θN

LS , α) = α,and this is asymptotically true in nonlinear situations 8 .

Most of classical design criteria are related to character-istics of (asymptotic) confidence ellipsoids. MinimizingΦ(M) = trace[M−1] corresponds to minimizing the sumof the squared lengthes of the axes of (asymptotic) con-fidence ellipsoids for θ and is called A-optimal design(minimizing Φ(M) = trace[Q⊤QM−1] with Q someweighting matrix is called L-optimal design, see [31]for an early reference). Minimizing the longest axis of(asymptotic) confidence ellipsoids for θ is equivalent tomaximizing the minimum eigenvalue of M and is calledE-optimal design.D-optimal design maximizes det(M),or equivalently minimizes the volume of (asymptotic)confidence ellipsoids for θ (their volume being pro-

portional to 1/√detM). This approach is very much

used, in particular due to the invariance of a D-optimalexperiment by re-parametrization of the model (since

detM(ξ, θ′) = detM(ξ, θ)[det(∂θ′/∂θ⊤)]−2). Most of-ten D-optimal experiments consist of replications of asmall number of different experimental conditions. Thishas been illustrated by the example of Section 2.2 forwhich p = 4 and four sampling times were duplicated inthe D-optimal design t∗.

5.2 Algorithms for discrete design

Consider the regression model (5) with i.i.d. errors andN observations at UN

1 = (u1, . . . ,uN ) where the sup-port points ui belong to U ⊂ R

d. The Fisher informa-tion matrix MF (U

N1 , θ) is then given by (15). The (lo-

cal) design problem consists in optimizing Ψθ(UN1 ) =

Φ[MF (UN1 , θ)] for a given θ, with respect to UN

1 ∈R

N×d. If the problem dimension N × d is not too large,standard optimization algorithms can be used (note,however, that constraints may exist in the definition ofthe admissible set U and that local optima exist is gen-eral). When N × d is large, specific algorithms are rec-ommended. They are usually of the exchange type, see[42], [108]. Since several local optima exist in general,these methods provide locally optimal solutions only.

8 Such confidence regions for θ can be transformed intosimultaneous confidence regions for functions of θ, see inparticular [160], [14].

5.3 Approximate design theory

5.3.1 Design measures

Suppose that replications of observations exist, so thatseveral ui’s coincide in (15). Let m < N denote thenumber of different ui’s, so that

MF (UN1 , θ) = I

m∑

i=1

riN

∂η(θ,ui)

∂θ

∂η(θ,ui)

∂θ⊤

with ri/N the proportion of observations collected at ui,which can be considered as the percentage of experimen-tal effort at ui, or the weight of the support point ui. De-note λ(ui) this weight. The design UN

1 is then character-ized by the support points u1, . . . ,um and their associ-ated weights λ(u1), . . . , λ(um) satisfying

∑mi=1 λ(ui) =

1, that is, a normalized discrete distribution on the ui’s,with the constraints λ(ui) = ri/N , i = 1, . . . ,m. Releas-ing these constraints, one defines an approximate designas a discrete probability measure with support points ui

and weights λi (with∑m

i=1 λi = 1). Releasing now thediscreteness constraint, a design measure is simply de-fined as any probability measure ξ on U , see [84], andMF (ξ, θ) takes the form (10). Now,MF (ξ, θ) belongs tothe convex hull of the setM1 of rank-onematrices of theform M(δu, θ) = I [∂η(θ,u)/∂θ] [∂η(θ,u)/∂θ⊤]. It is ap×p symmetric matrix, and thus belongs to a p(p+1)/2-dimensional space. Therefore, fromCaratheodory’sThe-orem, it can be written as the linear combination ofp(p+ 1)/2 + 1 elements of M1 at most; that is

MF (ξ, θ) = Im∑

i=1

λi∂η(θ,ui)

∂θ

∂η(θ,ui)

∂θ⊤, (16)

with m ≤ p(p + 1)/2 + 1. The information matrix as-sociated with any design measure ξ can thus always beconsidered as obtained from a discrete probability mea-sure with p(p+1)/2+ 1 support points at most. This istrue in particular for the optimal design 9 . Given sucha discrete design measure ξ with m support points, adiscrete design UN

1 with repetitions can be obtained bychoosing the numbers of repetitions ri such that ri/Nis an approximation 10 of λi, the weight of ui for ξ, see,e.g., [150].

The property that the matrices in the sum (16) haverank one is not fundamental here and is only due

9 In general the situation is even more favorable. For in-stance, if ξD is D-optimal (it maximizes detMF (ξ, θ)), thenMF (ξD, θ) is on the boundary of the convex closure of M1

and p(p+ 1)/2 support points are enough.10This is at the origin of the name approximate design theory.However, a design ξ (even with a density) can sometimesbe implemented without any approximation: this is the casein Section 6.2 where ξ corresponds to the power spectraldensity of the input signal.

9

Page 10: LucPronzato, arXiv:0802.4381v1 [math.OC] 29 Feb 2008responding author L. Pronzato. Tel. +33 (0)4 92942708. Fax +33 (0)4 92942896. Email address: pronzato@i3s.unice.fr (Luc Pronzato).

to the fact that we considered single-output mod-els (i.e., scalar observations). In the multiple-outputcase with independent errors, say with y(u) of di-mension q corrupted by errors having the q × q co-variance matrix Σ(u), the model response is a q-dimensional vector η(θ,u) and the information matrixfor WLS estimation with weights Σ−1(u) is M(ξ, θ) =∫

U[∂η⊤(θ,u)/∂θ]Σ−1(u) [∂η(θ,u)/∂θ⊤] ξ(du), to be

compared with (7) obtained in the single-output case,see, e.g., [42], Section 1.7 and Chapter 5. Caratheodory’sTheorem still applies and, with the same notations asabove, we can write

M(ξ, θ) =

m∑

i=1

λi∂η⊤(θ,ui)

∂θΣ−1(ui)

∂η(θ,ui)

∂θ⊤,

with againm ≤ p(p+1)/2+1. All the results concerningDOE for scalar observations thus easily generalize to themultiple-output situation.

5.3.2 Properties

Only the main properties are indicated, one may referto the books [42], [167], [125], [4], [149], [44] for moredetailed developments. Suppose that the design crite-rion Φ[M] to be minimized (respectively maximized) isstrictly convex (respectively concave). For instance forD-optimality, maximizing det[M] is equivalent to maxi-mizing log det[M] and, for any positive-definite matricesM1,M2 such thatM1 6= M2, ∀α, 0 < α < 1, log det[(1−α)M1 +αM2] > (1−α) log det[M1] +α log det[M2], sothat Φ[·] = log det[·] is a strictly concave function. SinceMF (ξ, θ) belongs to a convex set, the optimal matrixM∗

F = MF (ξ∗, θ) for Φ is unique (which usually does not

imply that the optimal design ξ∗ is unique; however, theset of optimal design measures is convex). The unique-ness of the optimum and differentiability of the criteriondirectly yield a necessary and sufficient condition for op-timality, and in the case of D-optimality we obtain thefollowing, known asKiefer-Wolfowitz Equivalence Theo-rem [85] (other equivalence theorems are easily obtainedfor other design criteria having suitable regularity andthe appropriate convexity or concavity property).

Theorem 1 The following statements are equivalent:(1) ξD is D-optimal for θ,(2) maxu∈U dθ(u, ξD) = p,(3) ξD minimizes maxu∈U dθ(u, ξD),where dθ(u, ξ) is defined by

dθ(u, ξ) = I ∂η(θ,u)

∂θ⊤M−1

F (ξ, θ)∂η(θ,u)

∂θ. (17)

Moreover, for any support point ui of ξD, dθ(ui, ξD) = p.

Note that condition (2) is easily checked when u is scalarby plotting dθ(u, ξ) as a function of u.

Theorem 1 relates optimality in the parameter space tooptimality in the space of observations, in the following

sense. Let θN

ML be obtained for a design ξ, the variance

of the prediction η(θN

ML,u) of the response at u is then

such that Nvar[η(θN

ML,u)] tends to

∂η(θ,u)

∂θ⊤|θ

M−1F (ξ, θ)

∂η(θ,u)

∂θ |θ=

dθ(u, ξ)

I (18)

when N → ∞, see (8). Therefore, a D-optimal experi-ment also minimizes the maximum of the (asymptotic)variance of the prediction over the experimental domainU . This is called G-optimality, and Theorem 1 thus ex-presses the equivalence between D and G-optimality. (Itis also related to maximum entropy sampling consideredin Section 3.2.2, see [193].)

Suppose that the observations are collected sequentiallyand that the choice of the design points can be madeaccordingly (sequential design). After the collection ofy(u1), . . . , y(uN ), which gives the parameter estimates

θN

and the prediction η(θN,u), in order to improve the

precision of the prediction the next observation should

intuitively be placed where var[η(θN,u)] is large, that is,

where dθN (u, ξN ) is large, with ξN the empirical measurefor the first N design points. This receives a theoreticaljustification in the algorithms presented below.

5.3.3 Algorithms

The presentation is for D-optimality, but most al-gorithms easily generalize to other criteria. Let ξk

denote the design measure at iteration k of the al-gorithm. The steepest-ascent direction at ξk corre-sponds to the delta measure that puts mass 1 atu∗k+1 = argmaxu∈U dθ(u, ξ

k). Hence, at iteration k,algorithms of the steepest-ascent type add the supportpoint u∗

k to ξk as follows:

Fedorov–Wynn Algorithm:• Step 1 : Choose ξ1 not degenerate (detMF (ξ

1, θ) 6= 0),and ǫ such that 0 < ǫ << 1, set k = 1.• Step 2 : Compute u∗

k+1 = argmaxu∈U dθ(u, ξk). If

dθ(u∗k+1, ξ

k) < p+ ǫ, stop: ξk is almost D-optimal.

• Step 3 : Set ξk+1 = (1 − αk)ξk + αkδu∗

k+1, k → k + 1,

return to Step 2.

Fedorov’s algorithm corresponds to choosing the step-length α∗

k that maximizes log detMF (ξk+1, θ), which

gives α∗k = [dθ(u

∗k+1, ξ

k)−p]/p[dθ(u∗k+1, ξ

k)−1] (notethat 0 < α∗

k < 1/p) and ensures monotonic convergencetowards a D-optimal measure ξD, see [42].

Wynn’s algorithm corresponds to a sequence satisfying0 < αk < 1, limk→∞ αk = 0 and

∑∞i=1 αk = ∞, see

10

Page 11: LucPronzato, arXiv:0802.4381v1 [math.OC] 29 Feb 2008responding author L. Pronzato. Tel. +33 (0)4 92942708. Fax +33 (0)4 92942896. Email address: pronzato@i3s.unice.fr (Luc Pronzato).

[192] (the convergence is then not monotonic). One maynotice that in sequential design where the design pointsenter MF (U

N1 , θ) given by (15) one at a time, one has

MF (Uk+11 , θ) =

k

k + 1MF (U

k1 , θ)

+1

k + 1I ∂η(θ,uk+1)

∂θ

∂η(θ,uk+1)

∂θ⊤

and, when uk+1 = argmaxu∈U dθ(u, ξk), this corre-

sponds to Wynn’s algorithm with αk = 1/(k + 1).

Contrary to the exchange algorithms of Section 5.2, thesesteepest-ascent methods guarantee convergence to theoptimum. However, in practice they are rather slow (inparticular due to the fact that a support point presentat iteration k is never totally removed in subsequent it-erations — since αk < 1 for any k) and faster methods,still of the steepest-ascent type, have been proposed, seee.g. [13], [111], [112] and [44] p. 49. An acceleration ofthe algorithms can also be obtained by using a submod-ularity property of the design criterion, see [154], or byremoving design points that cannot support aD-optimaldesign measure, see [61].

When the set U is finite (which can be obtained by asuitable discretization), say with cardinality K, the op-timal design problem in the approximate design frame-work corresponds to the minimization of a convex func-tion ofK positive weightsλi with sum equal one, and anyconvex optimization algorithm can be used. The recentprogress in interior point methods, see for instance thesurvey [48] and the books [120], [40], [190], [194], providealternatives to the usual sequential quadratic program-ming algorithm. In control theory these methods havelead to the development of tools based on linear matrixinequalities, see, e.g., [20], which in turn have been sug-gested for D-optimal design, see [182] and Chapter 7 of[21]. Alternatively, a simple updating rule can sometimesbe used for the optimization of a design criterion over afinite set U = u1, . . . ,uK. For instance, convergenceto a D-optimal measure is guaranteed when the weightλki of ui at iteration k is updated as

λk+1i = λk

i

dθ(ui, ξk)

p, (19)

where ξk is the measure defined by the support points ui

and their associated weights λki , and dθ(u, ξ) is given by

(17), see [176], [168], [177] and Chapter 5 of [125]. (Note

that∑K

i=1 λk+1i = 1 and that λk+1

i > 0 when λki > 0.)

The extension to the case where information matricesassociated with single points have ranks larger than one(see Section 5.3.1) is considered in [180].

Finally, it is worthwhile noticing that D-optimal designis connected with a minimum-ellipsoid problem. Indeed,

using Lagrangian theory one can easily show that theconstruction of ξD that maximizes the determinant ofMF (ξ, θ) given by (10) with respect to the probabilitymeasure ξ on U is equivalent to the construction of theminimum-volume ellipsoid, centered at the origin, thatcontains the set SSθ = ∂η(θ,u)/∂θ : u ∈ U ⊂ R

p,see [165]. The construction of the minimum-volume el-lipsoid centered at 0 containing a given set U ⊂ R

p thuscorresponds to a D-optimal design problem on U for thelinear regressionmodel η(θ,u) = u⊤θ. In the case wherethe center of the ellipsoid is free, one can show equiva-lence with a D-optimal design in a (p + 1)-dimensionalspace where the regression model is η(θ,u) = (1 u⊤)θ,θ ∈ R

p+1, see [166], [175]. Algorithms with iterations ofthe type (19) are then strongly connected with steepest-descent type algorithms when minimizing a quadraticfunction, see [147], [148] and Chapter 7 of [146]. In sys-tem identification, minimum-volume ellipsoids find ap-plications in parameter bounding (or parameter estima-tion with bounded errors), see, e.g., [145] and [153] foran application to robust control.

5.3.4 Active learning with parametric models

When learning with a parametric model, the predic-

tion yD(u) at u is η(θN,u) with θ

Nestimated from

the data D = [u1, y(u1)], . . . , [uN , y(uN)]. As Theo-rem 1 shows, the precision of the prediction is directlyrelated to the precision of the estimation of the modelparameters θ: a D-optimal design minimizes the max-imum (asymptotic) variance 11 of yD(u) for u ∈ U .Similar properties hold for other measures of the pre-cision of the prediction. Consider for instance the inte-grated (asymptotic) variance of the prediction with re-spect to some given probability measure π (that mayexpress the importance given to different values of u inU). It is given by Ψθ,H(ξ) = trace

HM−1(ξ, θ)

, where

H = H(θ) =∫

U[∂η(θ,u)/∂θ] [∂η(θ,u)/∂θ⊤]π(du), see

(18), and its minimization corresponds to a L-optimaldesign problem, see Section 5.1. The following paramet-ric learning problem is addressed in [81]: the measureπ is unknown, n samples ui from π are used, togetherwith the associated observations, to estimate θ and H,

respectively by θnand Hn(θ

n), N −n samples are then

chosen optimally for Ψθn,Hn(ξ). It is shown that the op-

timal balance between the two sample sizes correspondsto n being proportional to

√N . When the samples ui

are cheap and only the observations y(ui) are expensive,one may decide on-line to collect an observation or not

for updating the estimate θnand the information ma-

trixMn. A sequential selection rule is proposed in [136],

11We could also speak of MSE since in parametric modelsthe estimators are usually unbiased for models linear in θ,and for nonlinear models (under the condition of persistenceof excitation) the squared bias decreases as 1/N2 whereasthe variance decreases as 1/N , see [19].

11

Page 12: LucPronzato, arXiv:0802.4381v1 [math.OC] 29 Feb 2008responding author L. Pronzato. Tel. +33 (0)4 92942708. Fax +33 (0)4 92942896. Email address: pronzato@i3s.unice.fr (Luc Pronzato).

which is asymptotically optimal when a given propor-tion n = ⌊αN⌋ of samples, α ∈ (0, 1), can be acceptedin a sequence of length N , N → ∞.

There exists a fundamental difference between learningwith parametric and nonparametric models. For para-metric models, the MSE of the prediction globally de-creases as 1/N , and precise predictions are obtained foroptimal designs which, from Caratheodory’s Theorem(see Section 5.3.1) are concentrated on a finite numberof sites. These are the points ui that carry the maxi-mum information about θ useful for prediction, in termsof the selected design criterion. On the opposite, precisepredictions for nonparametric models are obtained whenthe observation sites are spread over U , see Section 3.2.2.Note, however, that parametric methods rely on the ex-tremely strong assumption that the data are generated bya model with known structure. Since optimal designs willtend to repeat observations at the same sites (whateverthe method used for their construction),modelling errorswill not be detected. This makes optimal design theoryof very delicate use when the model is of the behavioraltype, e.g. a neural network as in [33], [34]. A recent ap-proach [52] based on bagging (Bootstrap Aggregating,see [23]) seems to open promising perspectives.

5.3.5 Dependence in θ in nonlinear situations

We already stressed the point that in nonlinear situa-tions the Fisher information matrix depends on θ, sothat an optimal design for estimation depends on theunknown value of the parameters to be estimated. Sofar, only local optimal design has been considered, wherethe experiment is designed for a nominal value θ. Sev-eral methods can be used to reduce the effect of the de-pendence in the assumed θ. A first simple approach is

to use a finite set Θ = θ(1), . . . , θ(m) of nominal val-ues and to designm locally optimal experiments ξ∗

θ(i) for

the θ(i)’s in Θ. This permits to appreciate the strengthof the dependence of the optimal experiment in θ, andseveral ξ∗

θ(i) ’s can eventually be combined to form a sin-gle experiment. More sophisticated approaches rely onaverage or minimax optimality.

In average-optimal design, the criterion Ψθ(ξ) =Φ[MF (ξ, θ)] is replaced by its expectation IEπΨθ(ξ) =∫

Φ[MF (ξ, θ)]π(dθ) for some suitably chosen prior π,see, e.g., [43], [26], [27]. (Note that when the Fisherinformation matrix MF (ξ, θ) is used, it means that theprior is not used for estimation and the method is notreally Bayesian.) In minimax-optimal design, Ψθ(ξ) (tobe minimized) is replaced by its worst possible valuemaxθ∈ΘΦ[MF (ξ, θ)] when θ belongs to a given feasible

set Θ, see, e.g., [43]. Compared to local design, these ap-proaches do not create any special difficulty (other thanheavier computations) for discrete design, see Section5.2: no special property of the design criterion is used,but the algorithms only yield local optima. Of course,

for computational reasons the situation is simpler whenπ is a discrete measure and Θ is a finite set 12 . Con-cerning approximate design theory (Section 5.3), theconvexity (or concavity) of Φ is preserved, EquivalenceTheorems can still be obtained (Section 5.3.2) and glob-ally convergent algorithms can be constructed (Section5.3.3), see, e.g., [44]. A noticeable difference with localdesign, however, concerns the number of support pointsof the optimum design which is no longer bounded byp(p + 1)/2 + 1 (see, e.g., Appendix A in [155]). Also,algorithms for minimax-optimal design are more com-plicated than for local optimal design, in particularsince the steepest-ascent direction does not necessarilycorrespond to a one-point delta measure.

A third possible approach to circumvent the dependencein θ consists in designing the experiment sequentially(see the examples in Sections 2.3 and 2.4), which is par-ticularly well suited for nonparametric models, both interms of prediction and estimation of the model, see Sec-tion 3.2.2. Sequential DOE for regression models is con-sidered into more details in Section 8.

6 Control inDOE: optimal inputs for parameterestimation in dynamical models

In this section, the choice of the input is (part of) thedesign, UN

1 or ξ depending whether discrete or approx-imate design is used. One can refer in particular to thebook [196] and Chapter 6 of [58] for detailed develop-ments. The presentation is for single-input single-outputsystems, but the results can be extended to multi-inputmulti-output systems. The attention is on the construc-tion of the Fisher information matrix, the inverse ofwhich corresponds to the asymptotic covariance of theML estimator, see Section 4. For control-oriented appli-cations it is important to relate the experimental designcriterion to the ultimate control objective, see, e.g., [50],[53]. This is considered in Section 6.2.

6.1 Input design in the time domain

Consider a Box and Jenkins model, with observations

yk = F (θ, z)uk +G(θ, z)εk

where the errors εk are i.i.d. N (0, σ2), and F (θ, z) andG(θ, z) are rational fractions in z−1 withG stable with astable inverse. Suppose that σ2 is unknown. An extendedvector of parameters β = (θ⊤ σ2)⊤ must then be esti-mated, and one can assume that G(θ,∞) = 1 withoutany loss of generality. For suitable input sequences (such

12When Θ is a compact set of Rp, a relaxation algorithmis suggested in [143] for minimax-optimal design; stochasticapproximation can be used for average-optimal design, see[142].

12

Page 13: LucPronzato, arXiv:0802.4381v1 [math.OC] 29 Feb 2008responding author L. Pronzato. Tel. +33 (0)4 92942708. Fax +33 (0)4 92942896. Email address: pronzato@i3s.unice.fr (Luc Pronzato).

that the experiment is informative enough, see [104], p.

361), NVar(βN

ML) → M−1F (ξ, β), N → ∞, with β the

unknown true value of β and

MF (ξ,β) = IEβ

1

N

∂ log π(y|β)∂β

∂ log π(y|β)∂β⊤

.

Using the independence and normality of the errors andthe fact that σ2 does not depend on θ, we obtain

MF (ξ,β) =

(

MF (ξ, θ) 0

0⊤ 12σ4

)

with MF (ξ, θ) = IEθ

1

Nσ2

N∑

k=1

∂ek(θ)

∂θ

∂ek(θ)

∂θ⊤

and ek(θ) the prediction error ek(θ) = G−1(θ, z)[yk −F (θ, z)uk]. The fact that σ

2 is unknown has therefore noinfluence on the (asymptotic) precision of the estimationof θ. Assuming that the identification is performed inopen loop (that is, there is no feedback) 13 and that Fand G have no common parameters (that is, θ can be

partitioned into θ = (θ⊤F θ⊤

G)⊤, with pF components in

θF and pG in θG), we then obtain

MF (ξ, θ) =

(

MFF (ξ, θ) O

O MGF (ξ, θ)

)

with

MFF (ξ, θ) =

1

Nσ2

N∑

k=1

[

G−1(θ, z)∂F (θ, z)

∂θFuk

]

×[

G−1(θ, z)∂F (θ, z)

∂θ⊤F

uk

]

(20)

and MGF (ξ, θ) not depending on uk, see, e.g., [58], p.

131.The asymptotic covariancematrixM−1F (ξ, θ) is thus

partitioned into two blocks, and the input sequence (uk)has no effect on the precision of the estimation of theparameters θG in G. A D-optimal input sequence maxi-

mizes detMFF (ξ, θ) = det

[

1/(Nσ2)∑N

k=1 vkv⊤k

]

where

vk is a vector of (linearly) filtered inputs,

vk = G−1(θ, z)∂F (θ, z)

∂θFuk , (21)

usually with power or amplitude constraints on uk. Thiscorresponds to an optimal control problem in the timedomain and standard techniques from control theory canbe used for its solution.

13 One may refer, e.g., to Chapter 6 of [58], [56], [57], [67],[49], [50] [76] for results concerning closed-loop experiments.

6.2 Input design in the frequency domain

We consider the same framework as in previoussection, with the information matrix of interestMF

F (ξ, θ) given by (20). Suppose that the system out-put is uniformly sampled at period T and denoteMF

F (ξ, θ) = limN→∞ MFF (ξ, θ)/T the average Fisher

information matrix per time unit. It can be writtenas MF

F (ξ, θ) = 1/(2πσ2)∫ π

−π Pv(ω) dω with Pv(ω)

the power spectral density of vk given by (21), or

MFF (ξ, θ) = 1/π

∫ π

0M

F

F (ω, θ)Pu(ω) dω with Pu(ω) thepower spectral density of u and

MF

F (ω, θ) =1

σ2Re

∂F (θ, ejω)

∂θFG−1(θ, ejω)

× G−1(θ, e−jω)∂F (θ, e−jω)

∂θ⊤F

.

The framework is thus the the same as for approximatedesign theory of Section 5.3: the experimental domainU becomes the frequency domain R

+ and to the designmeasure ξ corresponds the power spectral density Pu.An optimal input with discrete spectrum always exists;it has a finite number of support points 14 (frequencies)and associated weights (input power). The optimal in-put can thus be searched in the class of signals consist-ing of finite combinations of sinusoidal components, andthe algorithms for its construction are identical to thoseof Section 5.3.3. Notice, however, that no approximationis now involved in the implementation of the “approxi-mate” design. Once an optimal spectrum has been spec-ified, the construction of signal with this spectrum canobey practical considerations, for instance on the ampli-tude of the signal, see [9]. Alternatively, the input spec-trum can be decomposed on a suitable basis of rationaltransfer functions and the optimization of Pu performedwith respect to the linear coefficients of the decomposi-tion, see [74], [75]. Notice that the problem can also betaken the other way round: one may wish to minimizethe input power subject to a constraint on the precisionof the estimation, expressed throughM−1

F (ξ, θ), see e.g.,[15], [16].

The design criteria presented in Section 5.1 are relatedto the definition of confidence regions, or uncertaintysets, for the model parameters. When the intendedapplication of the identification is the control of a dy-namical system, it seems advisable to relate the DOE tocontrol-oriented uncertainty sets, see in particular [53]for an inspired exposition. First note that according tothe expression (18) the variance of the transfer functionF (θ, ejω) at the frequency ω is approximately VF (ω) =

(1/N)[∂F (θ, ejω)/∂θ⊤F ] [M

FF (ξ, θ)]

−1 [∂F (θ, ejω)/∂θF ].

14One can show that the upper bound on their number canbe reduced from pF (pF +1)/2 to pF , the number of param-eters in F , see [58], p. 138.

13

Page 14: LucPronzato, arXiv:0802.4381v1 [math.OC] 29 Feb 2008responding author L. Pronzato. Tel. +33 (0)4 92942708. Fax +33 (0)4 92942896. Email address: pronzato@i3s.unice.fr (Luc Pronzato).

Several H∞-related design criteria can then be de-rived. For instance, a robust-control constraint ofthe form ‖W (ejω)∆F (θ, ejω)/F (θ, ejω)‖∞ < 1, with∆F (θ, ejω)/F (θ, ejω) the relative error on F (θ, ejω)due to the estimation of θ andW (ejω) a weighting func-tion, leads to z⊤(θ, ejω)[MF

F (ξ, θ)]−1z(θ, ejω) < 1 ∀ω,

with z(θ, ejω) = (1/√N)|W (ejω)|[∂F (θ, ejω)/∂θF ].

This type of constraint can be expressed as a linear ma-trix inequality in MF

F (ξ, θ), and, using the KYP lemma,the problem can be reformulated as having a finitenumber of constraints, see [75]. Notice that minimizingmaxω z⊤(θ, ejω)[MF

F (ξ, θ)]−1z(θ, ejω) can be compared

to E-optimum design, see Section 5.1, which minimizesmaxz:z⊤z=1 z

⊤[MFF (ξ, θ)]

−1z. When |W (ejω)| = 1

(uniform weighting) and G(θ, ejω) = 1 (white noise), itcorresponds toG-optimal design, and thus toD-optimaldesign, see Section 5.3.2. It is also strongly related tothe minimax-optimal design of Section 5.3.5, (where theworst-case is now considered with respect to ω), see [44]and [143] for algorithms. Alternatively, the asymptoticconfidence regions for θ can be transformed into uncer-tainty sets SSF (θ, ξ) for the transfer function F (θ, ejω).The worst-case ν-gap over this set can then be com-puted, with the property that the smaller this number,the larger the set of controllers that stabilize all transferfunctions in SSF (θ, ξ) [54], [55] (see also [153] for re-lated results). Designing experiments that minimize theworst-case ν-gap is considered in [64] where the problemis shown to be amenable to convex optimization.

The dependence of the design criteria in the unknown pa-rameters of the model is a major issue for optimal inputdesign, as it is more generally the case for models with anonlinear parametrization (it explains why input spec-tra with few sinusoidal components are often consideredas unpleasant). The methods suggested in Section 5.3.5to face this difficulty can be applied here too. In particu-lar, input spectra having a small number of componentscan be avoided by designing optimal inputs for differentnominal values for θ and combining the optimal spec-tra that are obtained, or by using average or minimax-optimal design [155]. One can also design the experimentsequentially (see Section 8); in general, each design stepinvolves many observations and a few steps only are re-quired to achieve suitable performance, see, e.g. [8].

When on-line adaptation is possible, adjusting the con-troller while data are collected and the uncertainty onthe model decreases can be expected to achieve betterperformance than non-adaptive robust control. Ideally,one would wish to have uncertainty sets shrinking to-wards a single point representing the true model (or themodel closest to the true system for the model class con-sidered), so that a robust controller adapted to smallerand smaller uncertainty sets becomes less and less con-servative. While the determination of such robust-and-adaptive controllers is still an open issue, a first step inthe construction is to investigate the properties of the

parameter estimates in adaptive procedures.

7 DOE in adaptive control

The results of Sections 5 and 6 rely on the asymptoticproperties of the estimator: the asymptotic variance of

θN

ML was supposed to be given byM−1F /N , which is true

when the design (input) sequence satisfies some “sta-tionarity” condition (the assumption of random designwas used in Section 5 and a condition of persistence ofexcitation in Section 6). However, this condition may failto hold: a typical example is adaptive control, where theinput has another objective than estimation. The issuesthat it raises are investigated hereafter. We first presenta series of simple examples that illustrate the variety ofthe difficulties.

7.1 Examples of difficulties

It is rather well-known that the usual asymptotic nor-mality of the LS estimator may fail to hold for de-signs such that MF (U

N1 ) is nonsingular for any N

but converges to a singular matrix, that is, such thatλmin[MF (U

N1 )] → 0 as N → ∞, see [131]. We shall not

develop this point but rather focuss on the difficultiesraised by the sequential construction of the design.

Consider the following well-known example (see, e.g.,[96], [99]) of a linear regression model with observationsyk = θ1+θ2uk+εk where the errors εk are i.i.d. with zeromean and variance 1. The input (design points uk) sat-isfies u1 = 0 and un+1 = (1/n)

∑ni=1 ui+(c/n)

∑ni=1 εi.

Then, one can prove that θN

LS1a.s.→ θ1+

∑∞i=1 εi/i and

θN

LS2a.s.→ θ2 − 1/c, N → ∞. That is, θN

LS1 con-

verges to a randomvariable and θN

LS2 to a non-randomconstant different from θ2. The non-consistency of theLS estimator is due to the dependence of un+1 on pre-vious εi’s, that is, to the presence of feedback control(in terms of DOE, the design is sequential). Although

MN =∑N

i=1(1 ui)⊤(1 ui) is such that λmin(MN ) → ∞,

it does not grows fast enough (in particular, the infor-mation matrix M(UN

1 ) = MN/N tends to become sin-gular). Although this example might seem quite artifi-cial, one must notice that adaptive control as used e.g.in self-tuning strategies, may raise similar difficulties.

7.1.1 ARX model and self-tuning regulator

Consider a model with observations satisfying yk =a1yk−1+ · · ·+ ana

yk−na+ b1uk−1+ · · ·+ bnb

uk−nb+ εk,

which we can write yk = r⊤k θ + εk, with θ =(b1, b2, . . . , a1, a2, . . .)

⊤ and rk = (uk−1, . . . , uk−nb,

yk−1, . . . , yk−na)⊤. The objective of minimum-variance

control is to minimize RN =∑N

k=1(yk − εk)2. The in-

put sequence (uk) is then said to be globally convergent

14

Page 15: LucPronzato, arXiv:0802.4381v1 [math.OC] 29 Feb 2008responding author L. Pronzato. Tel. +33 (0)4 92942708. Fax +33 (0)4 92942896. Email address: pronzato@i3s.unice.fr (Luc Pronzato).

if RN/Na.s.→ 0 as N → ∞, see [100], [101], [60]. If θ is

known (with b1 6= 0) the optimal controller correspondsto u∗

k = −(a1yk + · · · + anayk+1−na

+ b2uk−1 + · · · +bnb

uk+1−nb)/b1. But then r⊤k θ = 0 for all k, the matrix

MN =∑N

k=1 rkr⊤k is singular (since θ

⊤MN θ = 0) and

θ is not estimable. If certainty equivalence is forced by

using at step k the optimal control calculated for θk

LS,then additional perturbations must be introduced toguarantee that λmin(MN ) tends to infinity fast enough,see, e.g., [1]. Using a persistently exciting input uk, pos-sibly with optimal features via the approach of Section6, permits to avoid this difficulty but is in conflict withthe global convergence property [100], in particularsince ‖θ‖2λmin(MN) < RN , see [60].

7.1.2 Self-tuning optimizer

Consider a linear regression model with observationsyk = r⊤(uk)θ+εk. The objective is to maximize a func-tion f(u, θ) with respect to u. If θ were known, thevalue u∗ = u∗(θ) = argmaxu f(u, θ) could be used(for instance, u∗ = −θ1/(2θ2) when f(u, θ) = θ0 +θ1u + θ2u

2). Since θ is unknown, it must be estimatedfrom the observations yk, k = 1, 2, . . . Again, the matrix

MN =∑N

k=1 r(uk)r⊤(uk) is singular when the control

is fixed, that is when uk = u∗(θ) (constant) for all k,and θ is then not estimable. Suppose that forced cer-tainty equivalence is used with LS estimation, that is

uk+1 = u∗(θk

LS). Perturbations should then be intro-duced to ensure consistency (e.g. randomly, see [22] forthe quadratic case f(u, θ) = θ0 + θ1u+ θ2u

2). The per-sistency of excitation is here in conflict with the perfor-

mance objective (1/n)∑n

i=1 f(ui, θ)a.s.→ f(u∗, θ), n →

∞. Self-tuning regulation of dynamical systems is con-sidered in [89] and [87] for time-continuous systems andin [32] for discrete-time systems. With a periodic distur-bance of magnitude α playing the role of a persistentlyexciting input signal, the output exponentially convergesto a neighborhood O(α2) of the extremum.

7.2 Nonlinear feedback control is not the answer

Nonlinear-Feedback Control (NFC) offers a set of tech-niques for stabilizing systems with unknown parame-ters, see in particular the book [88]. The stability of theclosed-loop is proved using Lyapunov techniques and,although not explicitly expressed in the construction ofthe feedback control, an estimator of the model parame-ters is obtained, which differs from standard estimationmethods. At first sight one might think that NFC bringsa suitable answer to adaptive control issues. However,stability is not consistency and it is the aim of this sec-tion to show that a direct application of NFC is boundto fail in the presence of random disturbances. Combin-ing NFC with more traditional estimation methods andsuitably exciting perturbations then forms interestingperspectives, see Section 9.

The presentation is made through (a slight modificationof) one of the simplest examples in [88]. Consider thedynamical system (14), with known initial state x0 andunknown parameter θ ∈ R. The problem is to constructa control u = u(t) that drives x to zero. (Notice that if θwere known, u = −(a+ θ)x − θ with a > 0 would solvethe problem since substitution in (14) gives the stablesystem x = −ax.) The following method is suggestedin [88]: (i) construct an auxiliary controller that obeys˙θ = x(x+1), (ii) consider θ as an estimator of θ and use

FCE control with θ, that is, u = −(a+θ)x−θ, a > 0. Thestability of this NFC can be checked through the behav-

ior of the Lyapunov function V (x, θ) = x2/2+(θ−θ)2/2.

It satisfies V (x, θ) = −ax2, which implies that x tends

to zero, as required. Then, θ + u tends to zero (fromthe expression of u), and θ + u also tends to zero (from

Lasalle principle). Therefore, the estimation error θ − θtends to zero 15 . In the simulations that follow we sim-ply use a discretized Euler approximation of the differen-tial equation (14) and of the associated continuous-timecontroller, although it should be emphasized that somecare is needed in general when implementing a digitalcontroller on a continuous-time model, see, e.g., [122].The discretization of (14) gives the recurrence equation(11), where xk = x(kT ) and uk = u(kT ). We take θ = 1and x0 = 1, the sampling period T is taken equal to0.01 s. (Notice that the open-loop system is unstable.)The NFC is discretized as

θk+1 = θk + Txk(xk + 1) ,

uk = −(a+ θk)xk − θk ,(22)

where θk = θ(kT ). We take θ0 = 2 and a = 1 s−1. (Al-though the book [88] only concerns the stabilization ofcontinuous-time systems, one can easily check that the

fixed point xk = 0, θk = θ of the controlled discretizedmodel above is Lyapunov-asymptotically stable.) Simu-lation results are presented in Figure 2. The initial de-crease of the state variable (solid line) is in agreementwith the time-constant a−1 = 1 s and, for t > 8 s, the pa-rameter estimates (dashed line) and state become veryclose to the targets, respectively θ = 1 and zero.

Suppose now that the state is observed through y0 =x0 and yk = xk + εk for k ≥ 1, where (εk) denotes asequence of i.i.d. errors normal N (0, σ2) (setting σ2 =S2T one may suppose for instance that εk is S timesthe integral of a realization of the standard Brownianmotion between 0 and T ). We take σ = x0/2 = 0.5, arather extreme situation, to emphasize the influence ofmeasurement errors. The evolutions of xk (dash-dotted

15 In the example on p. 3–4 of [88], x = u + θx,˙θ = x2

and u = −(a + θ)x, a > 0, so that x tends to zero but not

necessarily the estimation error θ − θ.

15

Page 16: LucPronzato, arXiv:0802.4381v1 [math.OC] 29 Feb 2008responding author L. Pronzato. Tel. +33 (0)4 92942708. Fax +33 (0)4 92942896. Email address: pronzato@i3s.unice.fr (Luc Pronzato).

0 200 400 600 800 1000 1200 1400 1600 1800 2000−1

−0.5

0

0.5

1

1.5

2

2.5

3

k

Fig. 2. Evolution of xk (solid line) and θk (dashed line) asfunctions of k for the system (11) with NFC (22) (θ = 1,

θ0 = 2, x0 = 1, a = 1, sampling period T = 0.01 s). Thecurves in dash-dotted line and dotted line respectively show

xk and θk when yk is substituted for xk in (22).

line) and θk (dotted line) when yk is substituted for xk in(22) are presented onFigure 2: the sequence of parameterestimates does not converge, the state fluctuates and isclearly not driven to zero.

7.3 Some consistency results

The difficulties encountered in Sections 7.1.1, 7.1.2 and7.2 are general in regulation-type problems: in order tosatisfy the control objective, the input should asymptot-ically vanish, which does not bring enough excitation forguaranteeing the consistent estimation of the model pa-rameters. The control objective is thus in conflict withparameter estimation, and perturbations must be intro-duced. It is then of importance to know the minimalamount of perturbations required to ensure consistencyof the estimator on which the control is based. Some re-sults are presented below for the case of linear regression.

7.3.1 LS estimation

Consider a linear regression model with observationsyk = r⊤k θ + εk, and denote by Fk the σ-algebra gen-erated by the errors ε1, . . . , εk. They are supposedto form a martingale difference sequence (εk is Fk−1

measurable and IEεk|Fk−1 = 0) and to be suchthat supk IEε2k|Fk−1 < ∞ (with i.i.d. errors withzero mean and finite variance as a special case). Let

MN =∑N

k=1 rkr⊤k , then M−1

N → 0 for N → ∞ is

• sufficient for θN

LSa.s.→ θ when the regressors rk are

non-random constants, see [97], [98];• necessary and sufficient if, moreover, the errors are εki.i.d.,

• but M−1N

a.s.→ 0 is not sufficient for θN

LSa.s.→ θ if rk is

Fk−1 measurable (see the first example of Section 7.1).

In the latter situation, a sufficient condition for

θN

LSa.s.→ θ when N → ∞ is that λmin(MN )

a.s.→ ∞ and

[logλmax(MN )]1+δ = o[λmin(MN )] a.s. for some δ > 0,see [99]. In some sense, this is the best possible condi-tion: it is only marginally violated in the first exampleof Section 7.1, where [logλmax(MN)]/λmin(MN ) tendsa.s. to a random constant. Note that this condition ismuch weaker than the persistence of excitation whichrequires that MN grows at the same speed as N .

7.3.2 Bayesian imbedding

An even weaker condition is obtained for Bayesian esti-mation. Let π be a prior probability measure for θ anddenote by P the probability measure induced by the er-rors εk, k = 1, . . . ,∞. Denote F ′

k the σ-algebra gener-ated by the observations y1, . . . , yk and suppose that rkis F ′

k−1-measurable. Suppose that the parameters are

estimated by the posterior mean θN

B = IEθ|F ′N and

denote by CN = Var(θ|F ′N ) the posterior covariance

matrix. Then, from martingale theory, θN

B andCN bothconverge (π × P )-a.s. when N → ∞, see [174], and allwhat is required for the (π × P )-a.s. consistency of theestimator is CN → 0 (π × P )-a.s. Now, for a linear re-gression model with i.i.d. normal errors εk and a normalprior for θ, Bayesian estimation coincides with LS esti-mation (when the prior for θ is suitably chosen), CN isproportional to M−1

N and therefore, M−1N → 0 (π × P )-

a.s. is sufficient for θN

LS = θN

B → θ (π × P )-a.s. The re-quired condition is thus as weak as when the regressorsrk are non-random constants! Note, however, that theconvergence is almost sure with respect to θ having theprior π; that is, singular values of θ may exist for whichconsistency does not hold 16 .

This very powerful technique which analyses the prop-erties of LS estimation via a Bayesian approach is calledBayesian imbedding, see [174], [93]. Although in its orig-inal formulation it requires the measurement errors tobe normal, the normality assumption is relaxed in [68]to the condition that the density ϕ of the errors is log-concave (d2 logϕ(t)/dt2 < 0) and strictly positive withrespect to the Lebesgue measure µ, the prior measureπ being absolutely continuous with respect to µ. Moregenerally, the consistency of Bayesian estimators canbe checked through the behavior of posterior covariancematrices, see [69]. Bayesian imbedding allows for easierproofs of consistency of the estimator, and permits torelax the conditions on the perturbations required to ob-tain consistency. This is illustrated below by revisitingthe examples of Sections 7.1.1 and 7.1.2.

16 In the first example of Section 7.1 rk is not F ′

k−1-measurable since uk is not obtained from previous observa-tions. Modify the control into un+1 = α1+(α2/n)

Pn

i=1 ui+

(c/n)Pn

i=1 yi, which is F ′

n-measurable. Then, θN

LS is not

consistent when θ takes the particular value θ1 = −α1/c,θ2 = (1− α2)/c, so that the control coincides with previousone, un+1 = (1/n)

Pn

i=1 ui + (c/n)Pn

i=1 εi.

16

Page 17: LucPronzato, arXiv:0802.4381v1 [math.OC] 29 Feb 2008responding author L. Pronzato. Tel. +33 (0)4 92942708. Fax +33 (0)4 92942896. Email address: pronzato@i3s.unice.fr (Luc Pronzato).

Consider again the self-tuning regulator of Section 7.1.1.When LS estimation is used with forced certainty equiv-alence control, it is required to perturb the system toobtain a globally convergent input. It can be shown [94]that the control objective Rn grows at least as logn, andrandomly perturbed input sequences achieving this per-formance are proposed in [101]. Using Bayesian imbed-ding, global convergence can be obtained without theintroduction of perturbations, see [93].

For the self-tuning optimizer of Section 7.1.2, Astromand Wittenmark [2] have suggested a control of the

type uk+1 = argmaxu f(u, θk) + αk d(u, ξk)/k, where

d(u, ξ) = r⊤(u)M−1(ξ)r(u) is the function (17) usedin D-optimal design, ξk is the empirical measure ofthe inputs u1, . . . ,uk and (αk) is a sequence of pos-itive scalars. Note that d(u, ξk)/k = r⊤(u)M−1

k r(u)

with Mk =∑k

i=1 r(ui)r⊤(ui). This strategy makes

a compromise between optimization (maximization of

f(u, θk), for αk small) and estimation (D-optimal de-

sign, for αk large). Using the results of Section 7.3.1,the following is proved in [135] for LS estimation. Whenthe errors εk form a martingale difference sequencewith supk IEε2k|Fk−1 < ∞, if (αk/k) logαk is mono-tonically decreasing and αk/(log k)

1+δ monotonically

increases to infinity for some δ > 0, then θk

LSa.s.→ θ,

(1/k)∑k

i=1 f(ui, θ)a.s.→ f(u∗, θ) = maxu f(u, θ) and

ξka.s.→ δu∗ (weak convergence of probability measures)

as k → ∞. That is, the LS estimator is strongly consis-tent, and at the same time the design points uk tend toconcentrate at the optimal location u∗. Using Bayesianimbedding, the same results are obtained when the con-ditions above on αk are relaxed to αk → ∞, αk/k → 0,provided the errors εk are i.i.d. N (0, σ2), see [141].

7.4 Finite horizon: dynamic programming and dualcontrol

The presentation is for self-tuning optimization, but theproblem is similar for other adaptive control situations.

Suppose one wishes to maximize∑N

i=1 wif(ui, θ) forsome sequence of positive weights wi, with θ unknownand estimated through observations yi = η(θ,ui) + εi.Let π denote a prior probability measure for θ and de-fine Uk

1 = (u1, . . . ,uk), Yk1 = (y1, . . . , yk) for all k. The

problem to be solved can then be written as

maxu1

IEw1f(u1, θ) + maxu2

IEw2f(u2, θ) + · · ·maxuN−1

IEwN−1f(uN−1, θ)

+maxuN

IEwNf(uN , θ)|UN−11 , Y N−1

1

|UN−21 , Y N−2

1 · · · |u1, y1 (23)

and thus corresponds to a Stochastic Dynamic Pro-gramming (SDP) problem. It is, in general, extremely

difficult to solve due to the presence of imbedded maxi-mizations and expectations. The control uk has a dualeffect (see e.g. [7]): it affects both the value of f(uk, θ)and the future uncertainty on θ through the posteriormeasures π(θ|U i

1, Yi1 ), i ≥ k. One of the main obstacles

being the propagation of these measures, classical ap-proaches are based on their approximation. Considerstage k, where Uk

1 and Y k1 are known. Then:

• Forced Certainty Equivalence control (FCE) replacesπ(θ|U i

1, Yi1 ) for i ≥ k (a “future posterior” for i > k), by

the delta measure δθk , where θkis the current estimated

value of θ (see the examples of Sections 7.1.1 and 7.1.2);• Open-Loop-Feedback-Optimal control (OLFO) re-places π(θ|U i

1, Yi1 ), i ≥ k, by the current posterior mea-

sure π(θ|Uk1 , Y

k1 ) (moreover, most often this posterior

is approximated by a normal distribution N (θk,Ck)).

The FCE and OLFO control strategies can be con-sidered as passive since they ignore the influence ofuk+1,uk+2 . . . on the future posteriors π(θ|U i

1, Yi1 ), see,

e.g., [179]. On the other hand, they yield a drastic sim-plification of the problem, since the approximation ofπ(θ|U i

1, Yi1 ) for i > k does not depend on the future

observations yk+1, yk+2 . . . This, and the fact that fewactive alternatives exist, explains their frequent usage.

The active-control strategy suggested in [178] is based ona linearization around a nominal trajectory u(i) and ex-tended Kalman filtering. It does not seem to have beenmuch employed, probably due to its rather high com-plexity. A modification of OLFO control is proposed in[137]. It takes a very simple form when the model re-sponse η(θ,u) is linear in θ, that is, η(θ,u) = r⊤(u)θ,the errors are i.i.d. normalN (0, σ2) and the prior for θ isalso normal. Then, at stage k, the posterior π(θ|U i

1, Yi1 )

is the normal N (θk

B,Ck) for i = k and can be approx-

imated by N (θk

B,Ci) for i > k, where θk

B and Ck areknown (computed by classical recursive LS) and Ci fol-lows a recursion similar to that of recursive LS,

Ci+1 = Ci −Cir(ui+1)r

⊤(ui+1)Ci

σ2 + r⊤(ui+1)Cir(ui+1), i ≥ k .

Note thatCi depends ofuk+1,uk+2 . . . ,ui (whichmakesthe strategy active), but not on yk+1, yk+2 . . . (whichmakes it implementable). This method has been suc-cessfully applied to the adaptive control of model witha FIR, ARX, or state-space structure, see, e.g., [91],[92]. It requires, however, that the objective functionf(u, θ) in (23) be non linear in θ to express the depen-dence in the covariance matrices Ci. Indeed, supposethat in the self-tuning optimization problem the func-tion to be maximized is the model response itself, that is,

f(u, θ) = r⊤(u)θ. Then, IEf(u, θ)|U i1, Y

i1 = r⊤(u)θ

i

B

and using the approximation N (θk

B,Ci) for the futureposteriors π(θ|U i

1, Yi1 ), i > k, one gets classical FCE con-

17

Page 18: LucPronzato, arXiv:0802.4381v1 [math.OC] 29 Feb 2008responding author L. Pronzato. Tel. +33 (0)4 92942708. Fax +33 (0)4 92942896. Email address: pronzato@i3s.unice.fr (Luc Pronzato).

trol based on the Bayesian estimator θk

B . On the otherhand, it is possible in that case to take benefit of thelinearity of the function and obtain an approximation of

IEmaxu r⊤(u)θN−1

B |UN−21 , Y N−2

1 for small σ2, whichcan then be back-propagated; see [141] where a controlstrategy is given that is withinO(σ4) of the optimal (un-known) strategy u∗

k for the SDP problem (23).

8 Sequential DOE

Consider a nonlinear regression model for whichthe optimal design problem consists in minimizingΨθ(U

N1 ) = Φ[MF (U

N1 , θ)] for some criterion Φ, with

θ unknown. In order to design an experiment adaptedto θ, a natural approach consists in working sequen-tially. In full-sequential design, one support point uk

is introduced after each observation: θk−1

is esti-mated from the data (Y k−1

1 , Uk−11 ) and next uk min-

imizes Φ[MF (Uk−11 ,uk, θ

k−1)] (for D-optimal de-

sign, this is equivalent to choosing uk that maximizesdθk−1(uk, ξk−1) with dθ(u, ξ) the function (17) and ξk−1

the empirical measure for the design points in Uk−11 ).

Note that it may be considered as a FCE control strat-egy, where the input (design point) at step k is based

on the current estimated value θk−1

. For a finite hori-zon N (the number of observations), the problem issimilar to that of Section 7.4 (self-tuning optimizer),with the design objective Φ[MF (U

N1 , θ)] substituted for

∑Ni=1 wif(ui, θ). Although the objective does not take

an additive form, the problem is still of the SDP type,and active-control strategies can thus be constructed toapproximate the optimal solution. However, they seemto only provide marginal improvements over traditionalpassive strategies like FCE control, see e.g. [51] 17 .

Although a sequentially designed experiment for theminimization of Φ[MF (U

N1 , θ)] aims at estimating θ

with maximum possible precision, it is difficult to assess

that θN a.s.→ θ as N → ∞ (and thus ξN

a.s.→ ξ∗(θ), withξ∗(θ) the optimal design for θ) when full-sequential de-sign is used; see [191] for a simple example (with a posi-tive answer) for LS estimation. When full-sequential de-sign is based on Bayesian estimation (posterior mean),strong consistency can be proved if the optimal designξ∗(θ) satisfies an identifiability condition for any θ, see[70] (this is related to Bayesian imbedding considered inSection 7.3.2). The asymptotic analysis of multi-stagesequential design is considered in [28] and the construc-tion of asymptotically optimal sequential design strate-gies in [172], where it is shown that using two stages is

17 An active strategy aims at taking into account the influ-ence of current decisions on the future precision of estimates;in that sense DOE is naturally active by definition, even ifbased on FCE control. Trying to make sequential DOE moreactive is thus doomed to small improvements

enough. Practical experience tends to confirm the goodperformance of two-stage procedures, see, e.g., [8].

9 Concluding remarks and perspectives inDOE

Correlated errors. Few results exist on DOE in the pres-ence of correlated observations and one can refer e.g. to[127], [117], [45] and the monograph [116] for recent de-velopments. See also Section 3.2.2. The situation is dif-ferent in the adaptive control community where corre-lated errors are classical, see Section 7.3.1 (for instance,the paper [123] gives results on strong laws of large num-bers for correlated sequences of random variables underrather common assumptions in signal or control applica-tions), which calls for appropriate developments in DOE.Notice that when the correlation of the error processdecays at hyperbolic rate (long-range dependence), theasymptotic theory of parameter estimation in regressionmodels (Section 4) must itself be revisited, see, e.g., [73].

Nonlinear models. The presentation in Sections 6 and 7has concerned models with a linear dynamic (e.g. with aBox and Jenkins structure), butmodels that correspondsto nonlinear differential or recurrence equations raise nospecial difficulty for the construction of the Fisher infor-mationmatrix (Section 6), which can always be obtainedthrough simulations. The main issue concerns linearitywith respect to the model parameters θ. In particular,few results exist concerning the extension of the resultsof Section 7.3 to models with a nonlinear parametriza-tion (see [95] for LS estimation and [71] for results onBayesian imbedding when θ has a discrete prior).

DOE without persistence of excitation. In the contextof self-tuning regulation, we mentioned in Section 7.3.2that random perturbations may be added to certaintyequivalence control based on LS estimation to guaranteethe strong consistency of parameter estimates and theasymptotically optimal growth of the control objective,see [101]. This is an example of a situation where “non-stationary experiments” could be designed in order toreplace random perturbations by inputs with suitablespectrum and asymptotically vanishing amplitude. Inthe same vein, the modified OLFO control proposed in[137] and the small-noise approximation of [141] (de-signed for self-tuning optimization, but extendable toself-tuning regulation) make a good compromise be-tween exploration and exploitation when the horizonis finite (see Section 7.4). An asymptotic analysis forthe horizon tending to infinity could permit to designsimpler non-stationary strategies.

Nonparametric models, active learning and control.Strategies are called active in opposition to passive onesthat collect data “as they come”. DOE is thus intrin-sically active, and its use in learning leads to methodsthat try to select training samples instead of taking

18

Page 19: LucPronzato, arXiv:0802.4381v1 [math.OC] 29 Feb 2008responding author L. Pronzato. Tel. +33 (0)4 92942708. Fax +33 (0)4 92942896. Email address: pronzato@i3s.unice.fr (Luc Pronzato).

them randomly. Although its usefulness is now well per-ceived in the statistical learning community, it is still atan early stage of development due to the complexity ofDOE for nonparametric models. More generally, activestrategies are valuable each time actions or decisionshave a dual effect and a compromise should be madebetween exploration and exploitation: exploration maybe done at random, but better performance is achievedwhen it is carefully planned. For instance, active strate-gies connected with Markov decision theory could yieldimprovements in reinforcement learning, see e.g. thesurvey [80].

Linking nonparametric estimation with control forms aquite challenging area, where the issues raised by theestimation of the function that defines the dynamics ofthe system come in addition to those, more classical, ofadaptive control with parametric models, see, e.g., [134],[133] for emerging developments.

Algorithms for optimal DOE. The importance of con-structing criteria for DOE in relation with the intendedobjective has been underlined in Section 6.2 where cri-teria of the minimax type have been introduced fromrobust-control considerations. (Minimax-optimal designis also an efficient method to face the dependence of lo-cal optimal design in the unknown value of the modelparameters, see Section 5.3.5.) Although the minimaxproblem can often be formulated as a convex one, some-times with a finite number of constraints, the develop-ment of specific algorithms would be much profitable tothe diffusion of the methodology, in the same way as theclassical design algorithms of Sections 5.2 and 5.3.3 havecontributed to the diffusion of optimal DOE outside thestatistical community where it originated.

Another view on global optimization. Let f(u) be a func-tion to be maximized with respect to u in some givenset U ; it is not assumed to be concave, nor is the setU assumed to be convex, so that local maxima may ex-ist. The function can be evaluated at any given inputui ∈ U , which gives an “observation” y(ui) = f(ui). Inengineering applications where the evaluation of f cor-responds to the execution of a large simulation code, ex-pensive in terms of computing time, it is of paramountimportance to use an optimizationmethod parsimoniousin terms of number of function evaluations. This en-ters into the framework of computer experiments, whereKriging is now a rather well-established tool for mod-elling, see Section 3.1. Using a Bayesian point of view,the value f(u) after the collection of the data Dk =[u1, y(u1)], . . . , [uk, y(uk)] can be considered as dis-tributed with the density ϕ(y|Dk,u) of the normal dis-tribution N (yDk

(u), ρ2Dk(u)), where yD(u) and ρ2D(u)

are respectively given by (3) and (4). An optimizationalgorithm that uses this information should then makea compromise between exploration (trying to reduce theMSE ρ2D(u) by placing observations at values of u whereρ2D(u) is large) and exploitation (trying to maximize the

expected response yD(u)). A rather intuitive method isto choose uk+1 that maximizes yDk

(u) + αρDk(u) for

some positive constant α, see [35]. In theory, Stochas-tic Dynamic Programming could be used to find the op-timal strategy (or algorithm) to maximize f(u): whenthe number N of evaluations is given in advance, theproblem takes the same form as in (23) with wi = 0 fori = 1, . . . , N − 1. However, in practise this SDP prob-lem is much too difficult to solve, and approximationsmust be used to define suboptimal searching rules. Forinstance, one may use a one-step-ahead approach andchoose the input uk+1 that maximizes the expected im-provement EI(u) =

∫∞

ymaxk

[y − ymaxk ]ϕ(y|Dk,u) dy, with

ymaxk = maxy(u1), . . . , y(uk), the maximum value off observed so far, see [110], [109], [161]. The functionf is then evaluated at uk+1, the Kriging model is up-dated (although not necessarily at each iteration), andsimilar steps are repeated. Each iteration of the result-ing algorithm requires one evaluation of f and the solu-tion of another global optimization problem, for whichany ad-hoc global search algorithm can be used (the op-timization concerns the function EI, which is easier toevaluate than f). For instance, it is suggested in [11] toupdate a Delaunay triangulation of the set U based onthe vertices ui, i = 1, . . . , k, and to perform the globalsearch for the maximization of EI(u) by initializing lo-cal searches at the centers of the Delaunay cells. Notethat the algorithm tends asymptotically to observe ev-erywhere in U , since the expected improvement EI(u)is always strictly positive at any value u where no obser-vation has been taken yet. However, a credible stoppingrule is given by the criterion itself: it is reasonable to stopwhen the expected improvement becomes small enough.One can refer to [161], [79] for detailed implementations,including problems with constraints also defined by sim-ulation codes. Derivative information on f can be in-cluded in the Kriging model, as indicated in Section 3.1,and thus used by the optimization algorithm, see [102].It seems that suboptimal searching rules looking furtherthan one-step-aheadhave never been used, which forms arather challenging objective for active control. Also, theone-step-aheadmethod above is completely passive withrespect to the estimation of the parameters of the Krig-ing model, and active strategies (even one-step-ahead)are still to be designed, see Section 3.2.2. Note finallythat the definition of the expected improvement EI isnot adapted to the presence of errors in the evaluationof f , so that further developments are required for sit-uations where one optimizes the observed response of areal physical process.

NFC, FCE, estimating functions and DOE. Consideragain the example of NFC in Section 7.2, in the casewhere the state xk is observed though yk = xk + εkand yk is substituted for xk in (22). As shown in Fig-

ure 2, the NFC estimator θk is then not consistent. Onthe other hand, in the same situation more classical es-timation techniques have satisfactory behaviors (hence,

19

Page 20: LucPronzato, arXiv:0802.4381v1 [math.OC] 29 Feb 2008responding author L. Pronzato. Tel. +33 (0)4 92942708. Fax +33 (0)4 92942896. Email address: pronzato@i3s.unice.fr (Luc Pronzato).

in terms of DOE, NFC brings enough information toestimate the unknown θ). Consider for instance the LSestimator of θ in (11), obtained from y1, . . . , yk. Sincexk is nonlinear in θ, recursive LS cannot be used di-rectly 18 , but the estimation becomes almost recursiveusing a stochastic-Newton algorithm, see, e.g., [188], p.208. Figure 3 (top) shows that the corresponding esti-

mator θk converges quickly to the true value θ = 1. Theevolution of the estimator (13) obtained from an estimat-ing function is presented on the same figure (bottom).

Its convergence is slower than that of θk, due in particu-lar to the presence of the term yk/(kT ) in the numeratorof (13), but its construction is much simpler. An impor-tant consequence is that the analysis of its asymptoticbehavior in an adaptive-control framework is easier thanfor LS estimation: θk is a consistent estimator of θ when(1/k)

∑k−1i=1 xi is bounded away from −1 (which is the

case since the control drives this quantity to zero). Sim-ulations confirm that when applying the FCE controlleruk = −(a+ θk)xk(θ

k)− θk to (11), with xk(θk) obtained

by substituting θk for θ in (11), θk converges to θ and thestate xk is correctly driven to zero. Simulations show,however, that the dynamic of the state is slower than forNFC; it is thus tempting to combine both strategies. Forinstance, one could use FCE control based on θk whenthe standard deviation of θk is smaller than some pre-scribed value, and use NFC otherwise. The simulationresults obtained with such a switching strategy are en-couraging and indicate that the combination of differentestimation methods may improve the performance of thecontroller. At the same time, this raises more issues thanit brings answers. Some are listed below.

0 200 400 600 800 1000 1200 1400 1600 1800 2000−3

−2

−1

0

1

2

3

k

0 200 400 600 800 1000 1200 1400 1600 1800 2000−3

−2

−1

0

1

2

3

k

Fig. 3. Top: evolution of the LS estimator θk; bottom: evo-lution of θk given by (13).

18 The situation would be much easier for the autoregres-sive model yi+1 = yi + T [ui + θ(yi + 1)] + εi+1 where θcould be estimated by recursive LS. Note that this modelcan be considered as resulting from the discretization ofy = u+ θ(y+1)+SdBt(ω), with Bt(ω) the standard Brow-nian motion (starting at zero with variance 1), which cor-responds to the introduction of process noise into (14), andgives εi+1 = S[B(i+1)T (ω)−BiT (ω)].

Combining NFC, which relies on Lyapunov stability,with simple predictors, e.g. based on FCE, while preserv-ing stability, forms an interesting challenge for which re-sults on input-to-state stabilizing control could be used(see, e.g., Chapters 5 and 6 of [88] for continuous-timeand [121] for discrete-time control). In classical FCE con-trol the consistency of the estimator is a major issue.Using suitable estimating functions could then lead tofruitful developments, due the flexibility of the methodand the simplicity of the associated estimators. Suitablydesigned perturbations could be introduced for helpingthe estimation, possibly following developments simi-lar to those that lead to the active-control strategiesof Sections 7.3.2 and 7.4. At the same time, the per-turbed control should not endanger the stability of thesystem. Designing input sequences (possibly vanishingwith time) that bring maximum information for estima-tion subject to a stability constraint forms an unusualand challenging DOE problem. Finally, as a first steptowards the design of robust-and-adaptive controllersmentioned at the end of Section 6.2, one may replacea traditional FCE controller by one that gives the bestperformance for the worst model in the current uncer-tainty set (roughly speaking, for the self-tuning prob-lem considered in Section 7.4 this amounts to replacingexpectations in (23) by minimizations with respect toθ in the current uncertainty set). The determination ofactive-control strategies for such minimax (dynamicalgames) problems seems to be a promising direction fordevelopments in adaptive control.

Acknowledgements

This work was partially supported by the IST Pro-gramme of the European Community, under the PAS-CAL Network of Excellence, IST-2002-506778. Thispublication only reflects the authors’s view. The paperis partly based on a three-hours mini-course given atthe 24th Benelux Meeting at Houffalize, Belgium, onMarch 22-24, 2005. It has benefited from several inter-esting and motivating discussions during that meeting,and the organizers of the event are gratefully acknowl-edged for their invitation. Many ideas and references

result from many years of collaboration with Eric Wal-ter (CNRS/SUPELEC/Universite Paris XI, France),Andrej Pazman (Comenius University, Bratislava, Slo-vakia), Henry P. Wynn (LSE, London, UK) and Ana-toly A. Zhigljavsky (Cardiff University, UK). Sections7.3.2 and 9 have benefited from several discussions withEric Thierry and Tarek Hamel at I3S. The encouragingcomments and suggestions of several referees were alsomuch helpful.

References

[1] K.J. Astrom and B. Wittenmark. On self-tuning regulators.Automatica, 9:195–199, 1973.

20

Page 21: LucPronzato, arXiv:0802.4381v1 [math.OC] 29 Feb 2008responding author L. Pronzato. Tel. +33 (0)4 92942708. Fax +33 (0)4 92942896. Email address: pronzato@i3s.unice.fr (Luc Pronzato).

[2] K.J. Astrom and B. Wittenmark. Adaptive Control.Addison Wesley, 1989.

[3] A.C. Atkinson and D.R. Cox. Planning experiments fordiscriminating between models (with discussion). Journalof Royal Statistical Society, B36:321–348, 1974.

[4] A.C. Atkinson and A.N. Donev. Optimum ExperimentalDesign. Oxford University Press, 1992.

[5] A.C. Atkinson and V.V. Fedorov. The design of experimentsfor discriminating between two rival models. Biometrika,62(1):57–70, 1975.

[6] A.C. Atkinson and V.V. Fedorov. Optimal design:experiments for discriminating between several models.Biometrika, 62(2):289–303, 1975.

[7] Y. Bar-Shalom and E. Tse. Dual effect, certaintyequivalence, and separation in stochastic control. IEEETransactions on Automatic Control, 19(5):494–500, 1974.

[8] M. Barenthin, H. Jansson, and H. Hjalmarsson.Applications of mixed H2 and H∞ input design inidentification. In Proc. 16th IFAC World Congress onAutomatic Control, Prague, 2005. CD-ROM – Paper 03882.

[9] H.A. Barker and K.R. Godfrey. System identificationwith multi-level periodic perturbation signals. ControlEngineering Practice, 7:717–726, 1999.

[10] P.L. Bartlett. Prediction algorithms: complexity,concentration and convexity. In Prep. 13th IFACSymposium on System Identification, Rotterdam, pages1507–1517, August 2003.

[11] R. Bates and L. Pronzato. Emulator-based globaloptimisation using lattices and Delaunay tesselation. InP. Prado and R. Bolado, editors, Proc. 3rd Int. Symp.on Sensitivity Analysis of Model Output, pages 189–192,Madrid, June 2001.

[12] S. Biedermann and H. Dette. Minimax optimal designs fornonparametric regression — a further optimality propertyof the uniform distribution. In P. Hackl A.C. Atkinsonand W.G. Muller, editors, mODa’6 – Advances in Model–Oriented Design and Analysis, Proceedings of the 76thInt. Workshop, Puchberg/Schneberg (Austria), pages 13–20,Heidelberg, June 2001. Physica Verlag.

[13] D. Bohning. Likelihood inference for mixtures: geometricaland other constructions of monotone step-length algorithms.Biometrika, 76(2):375–383, 1989.

[14] X. Bombois, B.D.O. Anderson, and M. Gevers.Quantification of frequency domain error boundswith guaranteed confidence level in prediction erroridentification. System & Control Letters, 54:471–482, 2005.

[15] X. Bombois, G. Scorletti, M. Gevers, R. Hildebrand, andP. Van den Hof. Cheapest open-loop identification forcontrol. In Proc. 43rd Conf. on Decision and Control, pages382–387, The Bahamas, December 2004.

[16] X. Bombois, G. Scorletti, M. Gevers, P. Van den Hof, andR. Hildebrand. Least costly identification experiment forcontrol. Automatica, 42(10):1651–1662, 2006.

[17] G.E.P. Box and W.J. Hill. Discrimination amongmechanistic models. Technometrics, 9(1):57–71, 1967.

[18] G.E.P. Box and K.B. Wilson. On the experimentalattainment of optimum conditions (with discussion).Journal of Royal Statistical Society, B13(1):1–45, 1951.

[19] M.J. Box. Bias in nonlinear estimation. Journal of RoyalStatistical Society, B33:171–201, 1971.

[20] S. Boyd, L. El Ghaoui, E. Feron, and V. Balakrishnan.Linear Matrix Inequalities in System and Control Theory.SIAM, Philadelphia, 1994.

[21] S. Boyd and L. Vandenberghe. Convex Optimization.Cambridge University Press, Cambridge, 2004.

[22] A.S. Bozin and M.B. Zarrop. Self tuning optimizer— convergence and robustness properties. In Proc. 1stEuropean Control Conf., pages 672–677, Grenoble, July1991.

[23] L. Breiman. Bagging predictors. Machine Learning,24(2):123–140, 1996.

[24] P.E. Caines. Linear Stochastic Systems. Wiley, New York,1988.

[25] M.C. Campi and E. Weyer. Guaranteed non-asymptoticconfidence regions in system identification. Automatica,41(10):1751–1764, 2005.

[26] K. Chaloner and K. Larntz. Optimal Bayesian designapplied to logistic regression experiments. Journal ofStatistical Planning and Inference, 21:191–208, 1989.

[27] K. Chaloner and I. Verdinelli. Bayesian experimental design:a review. Statistical Science, 10(3):273–304, 1995.

[28] P. Chaudhuri and P.A. Mykland. Nonlinear experiments:optimal design and inference based likelihood. Journal of theAmerican Statistical Association, 88(422):538–546, 1993.

[29] C.S. Chen. Optimality of some weighing and 2n fractionalfactorial designs. Annals of Statistics, 8:436–446, 1980.

[30] M.-Y Cheng, P. Hall, and D.M. Titterington. Optimaldesign for curve estimation by local linear smoothing.Bernoulli, 4(1):3–14, 1998.

[31] H. Chernoff. Locally optimal designs for estimatingparameters. Annals of Math. Stat., 24:586–602, 1953.

[32] J.-Y. Choi, M. Krstic, K.B. Ariyur, and J.S. Lee.Extremum seeking control for discrete-time systems. IEEETransactions on Automatic Control, 47(2):318–323, 2002.

[33] D.A. Cohn. Neural network exploration using optimalexperiment design. In J. Cowan, G. Tesauro, andJ. Alspector, editors, Advances in Neural InformationProcessing Systems 6. Morgan Kaufmann, 1994.

[34] D.A. Cohn, Z. Ghahramani, and M.I. Jordan. Activelearning with statistical models. Journal of ArtificialIntelligence Research, 4:129–145, 1996.

[35] D.D. Cox and S. John. A statistical method for globaloptimization. In Proc. IEEE Int. Conf. on Systems Manand Cybernetics, volume 2, pages 1241–1246, Chicago, IL,October 1992.

[36] D.R. Cox and D.V. Hinkley. Theoretical Statistics.Chapman & Hall, London, 1974.

[37] F. Cucker and S. Smale. On the mathematical foundationsof learning. Bulletin of the AMS, 39(1):1–49, 2001.

[38] C. Currin, T.J. Mitchell, M.D. Morris, and D. Ylvisaker.Bayesian prediction of deterministic functions, withapplications to the design and analysis of computerexperiments. Journal of the American StatisticalAssociation, 86:953–963, 1991.

[39] D.Z. D’Argenio. Optimal sampling times forpharmacokinetic experiments. Journal of Pharmacokineticsand Biopharmaceutics, 9(6):739–756, 1981.

[40] D. den Hertog. Interior Point Approach to Linear,Quadratic and Convex Programming. Kluwer, Dordrecht,1994.

[41] J.J. Faraway. Sequential design for the nonparametricregression of curves and surfaces. Computing Science andStatistics, 22:104–110, 1990.

[42] V.V. Fedorov. Theory of Optimal Experiments. AcademicPress, New York, 1972.

21

Page 22: LucPronzato, arXiv:0802.4381v1 [math.OC] 29 Feb 2008responding author L. Pronzato. Tel. +33 (0)4 92942708. Fax +33 (0)4 92942896. Email address: pronzato@i3s.unice.fr (Luc Pronzato).

[43] V.V. Fedorov. Convex design theory. Math.Operationsforsch. Statist., Ser. Statistics, 11(3):403–413,1980.

[44] V.V. Fedorov and P. Hackl. Model-Oriented Design ofExperiments. Springer, Berlin, 1997.

[45] V.V. Fedorov and W.G. Muller. Optimum design forcorrelated processes via eigenfunction expansions.Department of Statistics and Mathematics,Wirtschaftsuniversitat Wien, Report 6, June 2004.

[46] R.A. Fisher. Statistical Methods for Research Workers.Oliver & Boyd, Edimbourgh, 1925.

[47] M. Fliess and H. Sira-Ramiırez. An algebraic frameworkfor linear identification. ESAIM: Control, Optimization andCalculus of Variations, (9):151–168, 2003.

[48] A. Forsgren, P.E. Gill, and M.H. Wright. Interior methodsfor nonlinear optimization. SIAM Review, 44(4):525–597,2002.

[49] U. Forssell and L. Ljung. Closed-loop identificationrevisited. Automatica, 35:1215–1241, 1999.

[50] U. Forssell and L. Ljung. Some results on optimumexperiment design. Automatica, 36:749–756, 2000.

[51] R. Gautier and L. Pronzato. Adaptive control forsequential design. Discussiones Mathematicae, Probability& Statistics, 20(1):97–114, 2000.

[52] S. Gazut, J.-M. Martinez, and S. Issanchou. Plansd’experiences iteratifs pour la construction de modeles nonlineaires. In CD – 38emes Journees de Statistique de laSFdS, Clamart, France, 2006.

[53] M. Gevers. Identification for control. From the earlyachievements to the revival of experimental design.European Journal of Control, 11(45):335–352, 2005.

[54] M. Gevers, X. Bombois, B. Codrons, G. Scorletti, andB.D.O. Anderson. Model validation for control andcontroller validation in a prediction error identificationframework — Part I: theory. Automatica, 39:403–415, 2003.

[55] M. Gevers, X. Bombois, B. Codrons, G. Scorletti, andB.D.O. Anderson. Model validation for control andcontroller validation in a prediction error identificationframework — Part II: illustrations. Automatica, 39:417–427, 2003.

[56] M. Gevers and L. Ljung. Benefits of feedback in experimentdesign. In Prep. 7th IFAC/IFORS Symp. on Identificationand System Parameter Estimation, pages 909–914, York,1985.

[57] M. Gevers and L. Ljung. Optimal experiment design withrespect to the intended model application. Automatica,22:543–554, 1986.

[58] G.C. Goodwin and R.L. Payne. Dynamic SystemIdentification: Experiment Design and Data Analysis.Academic Press, New York, 1977.

[59] G.C. Goodwin and M.E. Salgado. A stochastic imbeddingapproach for quantifying uncertainty in the estimation ofrestricted complexity models. International Journal ofAdaptive Control and Signal Processing, 3:333–356, 1989.

[60] L. Guo. Further results on least-squares based adaptiveminimum variance control. SIAM J. Control andOptimization, 32(1):187–212, 1994.

[61] R. Harman and L. Pronzato. Improvements onremoving non-optimal support points in D-optimum designalgorithms. Statistics & Probability Letters, 77:90–94, 2007.

[62] T. Hastie, R. Tibshirani, and J. Friedman. The Elements ofStatistical Learning. Data Mining, Inference and Prediction.Springer, Heidelberg, 2001.

[63] C.C. Heyde. Quasi-likelihood and its Application. A GeneralApproach to Optimal Parameter Estimation. Springer, NewYork, 1997.

[64] R. Hildebrand and M. Gevers. Identification for control:optimal input design with respect to a worst-case ν-gapcost function. SIAM Journal on Control and Optimization,42(5):1586–1608, 2003.

[65] P.D.H. Hill. A review of experimental design procedures forregression model discrimination. Technometrics, 20:15–21,1978.

[66] H. Hjalmarsson. From experiment design to closed-loopcontrol. Automatica, 41:393–438, 2005.

[67] H. Hjalmarsson, M. Gevers, and F. De Bruyne. For model-based control design, closed-loop identification gives betterperformance. Automatica, 32(12):1659–1673, 1996.

[68] I. Hu. Strong consistency of Bayes estimates in stochasticregression models. Journal of Multivariate Analysis, 57:215–227, 1996.

[69] I. Hu. Strong consistency in stochastic regression modelsvia posterior covariance matrices. Biometrika, 84(3):744–749, 1997.

[70] I. Hu. On sequential designs in nonlinear problems.Biometrika, 85(2):496–503, 1998.

[71] I. Hu. Strong consistency of Bayes estimates in nonlinearstochastic regression models. Journal of Statistical Planningand Inference, 67:155–163, 1998.

[72] P.J. Huber. Robust Statistics. John Wiley, New York, 1981.

[73] A.V. Ivanov and N.N. Leonenko. Asymptotic theoryof nonlinear regression with long-range dependence.Mathematical Methods of Statistics, 13(2):153–178, 2004.

[74] H. Jansson and H. Hjalmarsson. Mixed H∞ and H2 inputdesign for identification. In Proc. 43rd Conf. on Decisionand Control, pages 388–393, The Bahamas, December 2004.

[75] H. Jansson and H. Hjalmarsson. Input design via LMIsadmitting frequency-wise model specifications in confidenceregions. IEEE Transactions on Automatic Control,50(10):1534–1549, 2005.

[76] H. Jansson and H. Hjalmarsson. Optimal experiment designin closed loop. In Proc. 16th IFAC World Congress onAutomatic Control, Prague, 2005. CD-ROM – Paper 04528.

[77] R.I. Jennrich. Asymptotic properties of nonlinear leastsquares estimation. Annals of Math. Stat., 40:633–643, 1969.

[78] M.E. Johnson, L.M. Moore, and D. Ylvisaker. Minimax andmaximin distance designs. Journal of Statistical Planningand Inference, 26:131–148, 1990.

[79] D. Jones, M. Schonlau, and W.J. Welch. Efficient globaloptimization of expensive black-box functions. Journal ofGlobal Optimization, 13:455–492, 1998.

[80] L.P. Kaelbling, M.L. Littman, and A.W. Moore.Reinforcement learning: a survey. Journal of ArtificialIntelligence Research, 4:237–285, 1996.

[81] T. Kanamori. Statistical asymptotic theory of activelearning. Annals Inst. Statist. Math., 54(3):459–475, 2002.

[82] F. Kerestecioglu. Change Detection and Input Design inDynamical Systems. Research Studies Press, Taunton, UK,1993.

[83] J. Kiefer and J. Wolfowitz. Stochastic estimation of themaximum of a regression function. Annals of Math. Stat.,23:462–466, 1952.

[84] J. Kiefer and J. Wolfowitz. Optimum designs in regressionproblems. Annals of Math. Stat., 30:271–294, 1959.

22

Page 23: LucPronzato, arXiv:0802.4381v1 [math.OC] 29 Feb 2008responding author L. Pronzato. Tel. +33 (0)4 92942708. Fax +33 (0)4 92942896. Email address: pronzato@i3s.unice.fr (Luc Pronzato).

[85] J. Kiefer and J. Wolfowitz. The equivalence of two extremumproblems. Canadian Journal of Mathematics, 12:363–366,1960.

[86] D.G. Krige. A statistical approach to some mine valuationand allied problems on the Witwatersrand. Master Thesis,University of Witwatersrand, 1951.

[87] M. Krstic. Performance improvement and limitations inextrremum seeking control. System & Control Letters,39:313–326, 2000.

[88] M. Krstic, I. Kanellakopoulos, and P. Kokotovic. Nonlinearand Adaptive Control Design. Wiley, New York, 1995.

[89] M. Krstic and H.-H. Wang. Stability of extremumseeking feedback for general nonlinear dynamic systems.Automatica, 36:595–601, 2000.

[90] C.S. Kubrusly and H. Malebranche. Sensors and controllerslocation in distributed systems—A survey. Automatica,21(2):117–128, 1985.

[91] C. Kulcsar, L. Pronzato, and E. Walter. Dual control oflinearly parameterised models via prediction of posteriordensities. European J. of Control, 2(1):135–143, 1996.

[92] C. Kulcsar, L. Pronzato, and E. Walter. Experimentaldesign for the control of linear state-space systems. In Proc.13th IFAC World Congress, volume C, pages 175–180, SanFrancisco, June 1996.

[93] P.R. Kumar. Convergence of adaptive control schemes usingleast-squares parameter estimates. IEEE Transactions onAutomatic Control, 35(4):416–424, 1990.

[94] T.L. Lai. Asymptotically efficient adaptive control instochastic regression models. Advances in Applied Math.,7(23):23–45, 1986.

[95] T.L. Lai. Asymptotic properties of nonlinear least squaresestimates in stochastic regression models. Annals ofStatistics, 22(4):1917–1930, 1994.

[96] T.L. Lai and H. Robbins. Consistency and asymptoticefficiency of slope estimates in stochastic approximationschemes. Z. Wahrsch. verw. Gebiete, 56:329–360, 1981.

[97] T.L. Lai, H. Robbins, and C.Z. Wei. Strong consistency ofleast squares estimates in multiple regression. Proc. Nat.Acad. Sci. USA, 75(7):3034–3036, 1978.

[98] T.L. Lai, H. Robbins, and C.Z. Wei. Strong consistency ofleast squares estimates in multiple regression II. Journal ofMultivariate Analysis, 9:343–361, 1979.

[99] T.L. Lai and C.Z. Wei. Least squares estimates in stochasticregression models with applications to identification andcontrol of dynamic systems. Annals of Statistics, 10(1):154–166, 1982.

[100] T.L. Lai and C.Z. Wei. On the concept of excitation in leastsquares identification and adaptive control. Stochastics,16:227–254, 1986.

[101] T.L. Lai and C.Z. Wei. Asymptotically efficient self-tuningregulators. SIAM J. Control and Optimization, 25(2):466–481, 1987.

[102] S. Leary, A. Bhaskar, and A.J. Keane. A derivative basedsurrogate model for approximating and optimizing theoutput of an expensive computer simulation. Journal ofGlobal Optimization, 30:39–58, 2004.

[103] K.-Y. Liang and S.L. Zeger. Inference based on estimatingfunctions in the presence of nuisance parameters. StatisticalScience, 10(2):158–173, 1995.

[104] L. Ljung. System Identification, Theory for the User.Prentice-Hall, Englewood Cliffs, 1987.

[105] K.V. Mardia. Maximum likelihood estimation for spatialmodels. In D.A. Griffith, editor, Spatial Statistics: Past,Present and Future, pages 203–253. Michigan DocumentServices, Ann Arbor, Michigan, 1990.

[106] K.V. Mardia, J.T. Kent, C.R. Goodall, and J.A. Little.Kriging and splines with derivative information. Biometrika,83(1):207–221, 1996. (correction in Biometrika (1998),85(2), p. 205).

[107] G. Matheron. Principles of geostatistics. Economic Geology,58:1246–1266, 1963.

[108] T.J. Mitchell. An algorithm for the construction of “D-optimal” experimental designs. Technometrics, 16:203–210,1974.

[109] J. Mockus. Bayesian Approach to Global Optimization,Theory and Applications. Kluwer, Dordrecht, 1989.

[110] J. Mockus, V. Tiesis, and A. Zilinskas. The application ofBayesian methods for seeking the extremum. In L.C.W.Dixon and G.P. Szego, editors, Towards Global Optimisation2, pages 117–129. North Holland, Amsterdam, 1978.

[111] I. Molchanov and S. Zuyev. Variational calculus in thespace of measures and optimal design. In A. Atkinson,B. Bogacka, and A. Zhigljavsky, editors, Optimum Design2000, chapter 8, pages 79–90. Kluwer, Dordrecht, 2001.

[112] I. Molchanov and S. Zuyev. Steepest descent algorithm ina space of measures. Statistics and Computing, 12:115–123,2002.

[113] M.D. Morris and T.J. Mitchell. Exploratory designs forcomputational experiments. Journal of Statistical Planningand Inference, 43:381–402, 1995.

[114] M.D. Morris, T.J. Mitchell, and D. Ylvisaker. Bayesiandesign and analysis of computer experiments: use ofderivatives in surface prediction. Technometrics, 35(3):243–255, 1993.

[115] H.-G. Muller. Optimal designs for nonparametric kernelregression. Statistics & Probability Letters, 2:285–290, 1984.

[116] W.G. Muller. Collecting Spatial Data. Optimum Designof Experiments for Random Fields (2nd revised edition).Physica-Verlag, Heidelberg, 2001.

[117] W.G. Muller and A. Pazman. Measures for designs inexperiments with correlated errors. Biometrika, 90(2):423–434, 2003.

[118] E.A. Nadaraya. On estimating regression. Theory ofProbability and its Applications, 9:141–142, 1964.

[119] W. Nather. The choice of estimators and experimentaldesigns in a linear regression model according to a jointcriterion of optimality. Math. Operationsforsch. Statist.,6:677–686, 1975.

[120] Y. Nesterov and A. Nemirovskii. Interior-Point PolynomialAlgorithms in Convex Programming. SIAM, Philadelphia,1994.

[121] D. Nesic and D.S. Laila. A note on input-to-statestabilization for nonlinear sampled-data systems. IEEETransactions on Automatic Control, 47(7):1153–1158, 2002.

[122] D. Nesic and A.R. Teel. Sampled-data control of nonlinearsystems: an overview of recent results. In Perspectives inrobust control, pages 221–239. Lecture Notes in Control andInform. Sci., 268, Springer, New York, 2001.

[123] B. Ninness. Strong laws of large numbers under weakassumptions with application. IEEE Transactions onAutomatic Control, 45(11):2117–2122, 2000.

[124] B. Ninness and G. Goodwin. Estimation of model quality.Automatica, 31(12):1771–1797, 1995.

23

Page 24: LucPronzato, arXiv:0802.4381v1 [math.OC] 29 Feb 2008responding author L. Pronzato. Tel. +33 (0)4 92942708. Fax +33 (0)4 92942896. Email address: pronzato@i3s.unice.fr (Luc Pronzato).

[125] A. Pazman. Foundations of Optimum ExperimentalDesign. Reidel (Kluwer group), Dordrecht (co-pub. VEDA,Bratislava), 1986.

[126] A. Pazman. Nonlinear Statistical Models. Kluwer,Dordrecht, 1993.

[127] A. Pazman and W.G. Muller. Optimum design ofexperiments subject to correlated errors. Statistics &Probability Letters, 52:29–34, 2001.

[128] A. Pazman and L. Pronzato. Nonlinear experimentaldesign based on the distribution of estimators. Journal ofStatistical Planning and Inference, 33:385–402, 1992.

[129] A. Pazman and L. Pronzato. A Dirac function method fordensities of nonlinear statistics and for marginal densitiesin nonlinear regression. Statistics & Probability Letters,26:159–167, 1996.

[130] A. Pazman and L. Pronzato. Simultaneous choice of designand estimator in nonlinear regression with parameterizedvariance. In A. Di Bucchianico, H. Lauter, and H.P.Wynn, editors, mODa’7 – Advances in Model–OrientedDesign and Analysis, Proceedings of the 7th Int. Workshop,Heeze (Netherlands), pages 117–124, Heidelberg, June 2004.Physica Verlag.

[131] A. Pazman and L. Pronzato. On the irregular behavior ofLS estimators for asymptotically singular designs. Statistics& Probability Letters, 76:1089–1096, 2006.

[132] J. Pilz. Bayesian Estimation and Experimental Design inLinear Regression Models, volume 55. Teubner-Texte zurMathematik, Leipzig, 1983. (also Wiley, New York, 1991).

[133] J.-M. Poggi and B. Portier. Nonlinear adaptive trackingusing kernel estimators: estimation and test for linearity.SIAM J. Control Optim., 39(3):707–727, 2000.

[134] B. Portier and A. Oulidi. Nonparametric estimation andadaptive control of functional autoregressive models. SIAMJ. Control Optim., 39(2):411–432, 2000.

[135] L. Pronzato. Adaptive optimisation and D-optimumexperimental design. Annals of Statistics, 28(6):1743–1761,2000.

[136] L. Pronzato. On the sequential construction of optimumbounded designs. Journal of Statistical Planning andInference, 136:2783–2804, 2006.

[137] L. Pronzato, C. Kulcsar, and E. Walter. An activelyadaptive control policy for linear models. IEEETransactions on Automatic Control, 41(6):855–858, 1996.

[138] L. Pronzato and A. Pazman. Second-order approximationof the entropy in nonlinear least-squares estimation.Kybernetika, 30(2):187–198, 1994. Erratum 32(1):104, 1996.

[139] L. Pronzato and A. Pazman. Using densities of estimatorsto compare pharmacokinetic experiments. Computers inBiology and Medicine, 31(3):179–195, 2001.

[140] L. Pronzato and A. Pazman. Recursively re-weighted least-squares estimation in regression models with parameterizedvariance. In Proc. EUSIPCO’2004, Vienna, Austria, pages621–624, September 2004.

[141] L. Pronzato and E. Thierry. Sequential experimentaldesign and response optimisation. Statistical Methods andApplications, 11(3):277–292, 2003.

[142] L. Pronzato and E. Walter. Robust experiment designvia stochastic approximation. Mathematical Biosciences,75:103–120, 1985.

[143] L. Pronzato and E. Walter. Robust experiment design viamaximin optimization. Mathematical Biosciences, 89:161–176, 1988.

[144] L. Pronzato and E. Walter. Experimental design forestimating the optimum point in a response surface. ActaApplicandae Mathematicae, 33:45–68, 1993.

[145] L. Pronzato and E. Walter. Minimum-volume ellipsoidscontaining compact sets: application to parameterbounding. Automatica, 30(11):1731–1739, 1994.

[146] L. Pronzato, H.P. Wynn, and A.A. Zhigljavsky. DynamicalSearch. Chapman & Hall/CRC, Boca Raton, 2000.

[147] L. Pronzato, H.P. Wynn, and A.A. Zhigljavsky.Renormalised steepest descent in Hilbert space convergesto a two-point attractor. Acta Applicandae Mathematicae,67:1–18, 2001.

[148] L. Pronzato, H.P. Wynn, and A.A. Zhigljavsky. Asymptoticbehaviour of a family of gradient algorithms in R

d andHilbert spaces. Mathematical Programming, A107:409–438,2006.

[149] F. Pukelsheim. Optimal Experimental Design. Wiley, NewYork, 1993.

[150] F. Pukelsheim and S. Reider. Efficient rounding ofapproximate designs. Biometrika, 79(4):763–770, 1992.

[151] E. Rafaj lowicz. Optimal experiment design for identificationof linear distributed parameter systems: Frequency domainapproach. IEEE Transactions on Automatic Control,28(7):806–808, 1983.

[152] E. Rafaj lowicz. Optimum choice of moving sensortrajectories for distributed parameter system identification.International Journal of Control, 43(5):1441–1451, 1986.

[153] H.-F. Raynaud, L. Pronzato, and E. Walter. Robustidentification and control based on ellipsoidal parametricuncertainty descriptions. European J. of Control, 6(3):245–257, 2000.

[154] T.G. Robertazzi and S.C. Schwartz. An acceleratedsequential algorithm for producing D-optimal designs.SIAM J. Sci. Stat. Comput., 10(2):341–358, 1989.

[155] C.R. Rojas, J.S. Welsh, G.C. Goodwin, and A. Feuer.Robust optimal experiment design for system identification.Automatica, 43:993–1008, 2007.

[156] J. Sacks and S. Schiller. Spatial designs. In S.S. Gupta andJ.O. Berger, editors, Statistical Decision Theory and RelatedTopics IV, volume 2, pages 385–399. Springer, Heidelberg,1988.

[157] J. Sacks, W.J. Welch, T.J. Mitchell, and H.P. Wynn. Designand analysis of computer experiments. Statistical Science,4(4):409–435, 1989.

[158] T. Santner, B.J. Williams, and W.I. Notz. The Design andAnalysis of Computer Experiments. Springer, Heidelberg,2003.

[159] R. Schaback. Mathematical results concerning kerneltechniques. In Prep. 13th IFAC Symposium on SystemIdentification, Rotterdam, pages 1814–1819, August 2003.

[160] H. Scheffe. Simultaneous interval estimates of linearfucntions of parameters. Bull. Inst. Internat. Statist.,38:245–253, 1961.

[161] M. Schonlau, W.J. Welch, and D.R. Jones. Global versuslocal search in constrained optimization of computer models.In New Developments and Applications in ExperimentalDesign, Lecture Notes — Monograph Series, vol. 34, pages11–25. IMS, Hayward, 1998.

[162] R. Schwabe. On adaptive chemical balance weightingdesigns. Journal of Statistical Planning and Inference,17:209–216, 1987.

24

Page 25: LucPronzato, arXiv:0802.4381v1 [math.OC] 29 Feb 2008responding author L. Pronzato. Tel. +33 (0)4 92942708. Fax +33 (0)4 92942896. Email address: pronzato@i3s.unice.fr (Luc Pronzato).

[163] M.C. Shewry and H.P. Wynn. Maximum entropy sampling.Applied Statistics, 14:165–170, 1987.

[164] A.N. Shiryaev. Probability. Springer, Berlin, 1996.

[165] R. Sibson. Discussion on a paper by H.P. Wynn. Journalof Royal Statistical Society, B34:181–183, 1972.

[166] B.W. Silverman and D.M. Titterington. Minimum coveringellipses. SIAM Journal Sci. Stat. Comput., 1(4):401–409,1980.

[167] S.D. Silvey. Optimal Design. Chapman & Hall, London,1980.

[168] S.D. Silvey, D.M. Titterington, and B. Torsney. Analgorithm for optimal designs on a finite design space.Commun. Statist.-Theor. Meth., A7(14):1379–1389, 1978.

[169] T. Soderstrom and P. Stoica. Comparison of someinstrumental variable methods—consistency and accuracyaspects. Automatica, 17(1):101–115, 1981.

[170] T. Soderstrom and P. Stoica. Instrumental Variable Methodsfor System Identification. Springer, New York, 1983.

[171] T. Soderstrom and P. Stoica. System Identification. PrenticeHall, New York, 1989.

[172] V.G. Spokoinyi. On asymptotically optimal sequentialexperimental design. Advances in Soviet Mathematics,12:135–150, 1992.

[173] M.L. Stein. Interpolation of Spatial Data. Some Theory forKriging. Springer, Heidelberg, 1999.

[174] J. Sternby. On consistency for the method of least squaresusing martingale theory. IEEE Transactions on AutomaticControl, 22(3):346–352, 1977.

[175] D.M. Titterington. Optimal design: some geometricalespects of D-optimality. Biometrika, 62(2):313–320, 1975.

[176] D.M. Titterington. Algorithms for computing D-optimaldesigns on a finite design space. In Proc. of the 1976Conference on Information Science and Systems, pages 213–216, Baltimore, 1976. Dept. of Electronic Engineering, JohnHopkins University.

[177] B. Torsney. A moment inequality and monotonicity of analgorithm. In K.O. Kortanek and A.V. Fiacco, editors, Proc.Int. Symp. on Semi-infinite Programming and Applications,pages 249–260, Heidelberg, 1983. Springer.

[178] E. Tse and Y. Bar-Shalom. An actively adaptive control forlinear systems with random parameters via the dual controlapproach. IEEE Transactions on Automatic Control,18(2):109–117, 1973.

[179] E. Tse, Y. Bar-Shalom, and L. Meier III. Wide-senseadaptive dual control for nonlinear stochastic systems. IEEETransactions on Automatic Control, 18(2):98–108, 1973.

[180] D. Ucinski. Optimal Measurement Methods for DistributedParameter System Identification. CRC Press, Boca Raton,2005.

[181] A.W. van der Vaart. Maximum likelihood estimation undera spatial sampling scheme. Annals of Statistics, 24(5):2049–2057, 1996.

[182] L. Vandenberghe, S. Boyd, and S.-P. Wu. Determinantmaximisation with linear matrix inequality constraints.SIAM Journal on Matrix Analysis and Applications,19(2):499–533, 1998.

[183] V.N. Vapnik. Statistical Learning Theory. Wiley, New York,1998.

[184] V.N. Vapnik. The Nature of Statistical Learning Theory.Springer, New York, 2000. 2nd Edition.

[185] E. Vazquez. Modelisation comportementale de systemesnonlineaires multivariables par methodes a noyaux etapplications. Ph.D. Thesis, Universite Paris XI, Orsay,France, May 2005.

[186] E. Vazquez and E. Walter. Multi-output support vectorregression. In Prep. 13th IFAC Symposium on SystemIdentification, Rotterdam, pages 1820–1825, August 2003.

[187] E. Vazquez, E. Walter, and G. Fleury. Intrinsic Kriging andprior information. Applied Stochastic Models in Businessand Industry, 21:215–226, 2005.

[188] E. Walter and L. Pronzato. Identification of ParametricModels from Experimental Data. Springer, Heidelberg, 1997.

[189] G.S. Watson. Smooth regression analysis. Sankhya, SeriesA, 26:359–372, 1964.

[190] S. Wright, editor. Primal-Dual Interior-Point Methods.SIAM, Philadelphia, 1997.

[191] C.F.J. Wu. Asymptotic inference from sequential design ina nonlinear situation. Biometrika, 72(3):553–558, 1985.

[192] H.P. Wynn. The sequential generation of D-optimumexperimental designs. Annals of Math. Stat., 41:1655–1664,1970.

[193] H.P. Wynn. Maximum entropy sampling and generalequivalence theory. In A. Di Bucchianico, H. Lauter,and H.P. Wynn, editors, mODa’7 – Advances in Model–Oriented Design and Analysis, Proceedings of the 7th Int.Workshop, Heeze (Netherlands), pages 211–218. PhysicaVerlag, Heidelberg, June 2004.

[194] Y. Ye. Interior-Point Algorithms: Theory and Analysis.Wiley, Chichester, 1997.

[195] Z. Ying. Maximum likelihood estimation of parametersunder a spatial sampling scheme. Annals of Statistics,21:1567–1590, 1993.

[196] M.B. Zarrop. Optimal Experiment Design for DynamicSystem Identification. Springer, Heidelberg, 1979.

[197] Z. Zhu and H. Zhang. Spatial sampling design under theinfill asymptotic framework. Environmetrics, 17(4):323–337, 2006.

25