Special Session on Convex Optimization for System...

Special Session on Convex Optimization forSystem Identification

Kristiaan Pelckmans

⇤Johan A.K. Suykens

⇤⇤

⇤ SysCon, Information Technology, Uppsala University, 75501, Sweden⇤⇤ KU Leuven ESAT-SCD, B-3001 Leuven (Heverlee), Belgium

Abstract: This special session aims to survey, present new results and stimulate discussionson how to apply techniques of convex optimization in the context of system identification. Thescope of this session includes but is not limited to linear and nonlinear modeling, model structureselection, block structured systems, regularization mechanisms (e.g. L1, sum-of-norms, nuclearnorm), segmentation of time-series, trend filtering, optimal experiment design and others.

1. INTRODUCTION

Insights in convex optimization continue to be a drivingforce for new formulations and methods of estimationand identification. The paramount incentive for phrasingmethods of estimation in a format of a standard convexoptimization problem has been the availability of e�cientsolvers, both from a theoretical as well as practical per-spective. From a conceptual perspective, convexity of aproblem formulation ensures that no local minima areexisting. The work in convex optimization has now becomea well-matured subject, to such an extent that researchersview the distinction to convex and non-convex problemsas more profound than the distinction between linear andnon-linear optimization problems. A survey of techniquesof convex optimization, together with applications in esti-mation is Boyd and Vandenberghe [2004].

Convex optimization has always maintained a close con-nection to systems theory and estimation problems. Maincatalyzers of this synergy include the following:

• (Combinatorial) Interest in convex approaches to ef-ficiently solving complex optimization problems canbe traced back to the maxflow-mincut theorem, es-sentially stating when a combinatorial problem couldbe solved as a convex linear programming one. Thisresult has mainly impacted the field of OperationsResearch (OR), but the related so-called property ofunimodularity of a matrix is currently seeing a revivalin a context of machine learning and estimation.Another important result in this line of work hasbeen the recent interest in Semi-Definite Program-ming (SDP) relaxations for NP-hard combinatorialproblems. A standard reference is Papadimitriou andSteiglitz [1998] and Schrijver [1998].

• (LMI) An immediate predecessor of the solid bodyof work in convex optimization is the literature onLinear Matrix Inequalities (LMIs). This research hasfound a particularly rich application area in systemstheory, were LMIs occur naturally in problems ofstability, and of automatic control. A standard workis Boyd et al. [1994] and Ben-Tal and Nemirovskii[2001].

• (L1) Compressed Sensing or Compressive Sampling(CS) has led to vigorous research in applicationsof convex optimization in estimation and recoveryproblems. More specifically, the interest in sparseestimation problems has led to countless proposals ofL1 norms in estimation problems, often stimulated bythe promise that sparsity of (a transformation of) theunknowns has a natural interpretation in the specificapplication at hand. A main benefit of research in CSover earlier interest in the use of the L1 heuristic forrecovery are the newly derived theoretical guarantees,a research area often attributed to Candes and Tao[2005]. Lately, much interest has been devoted tothe study of the low-rank approximation problemRecht et al. [2010], where di↵erent heuristics wereproposed to relax the combinatorial rank constraint.For extensions in the area of system identification seeRecht et al. [2010], Liu and Vandenberghe [2009].

• (Structure) Recent years made it apparent that tech-niques of convex optimization can play yet anotherimportant role in identification and estimation prob-lems. If the structure of the system underlying theobservations can be imposed as constraints in theestimation problem, the set of possible solutions canbe sharply reduced, leading in turn to better esti-mates. This view has been especially important inmodeling of nonlinear phenomena, where the role ofa parametric model is entirely replaced by structuralconstraints. Examples of such thinking are SupportVector Machines and other non-parametric methods.The former are also tied to convex optimization viaanother link: it is found that Lagrange duality (as inthe theory of convex optimality) can lead to a system-atic approach for introducing nonlinearity throughthe use of Mercer kernels.

• (Design) The design of experiments as in statisticalsciences has always related closely to convex opti-mization. This is no di↵erent in a context of systemidentification, where the design of experiments nowpoints to design of an input sequence which excitesproperly all the modes of the system to be identified.One apparent reason why techniques of convex op-timization are useful is that such experiments haveto work in the allowed operation regions, that is,

constraints enter naturally in most cases. For relatereferences, see e.g. Boyd and Vandenberghe [2004].

• (Priors) A modern view is that methods based onthe L1-norm, nuclear norm relaxation or by imposingstructure, are examples of a larger picture, namelythat such terms can be used to make the inverse prob-lem well-posed. In other words, they fill in unknownpieces of information in the estimation problem byimposing or suggesting a prior structure. In general,an estimation problem can be greatly helped if oneis able to suggest a good prior for completing theevidence given by data only. Such prior can comein a form of a dictionary into which the solutionfit in nicely, or as a penalization term in the costfunction of the optimization function which penalizesthe occurrence of unusual phenomena in the solution.A statistical treatment of regularization is surveyedin Bickel et al. [2006], regularization and priors insystem identification are discussed in Ljung [1999],Ljung et al. [2011].

The last item suggests a definite way forward. Namely, howcan techniques of convex optimization be used to modelappropriate priors for a context of system identification.We conjecture that the near future will witness such shiftof focus from parametric Box-Jenkins and state-spacemodels to structural constraints and application specificpriors.

REFERENCES

A. Ben-Tal and A.S. Nemirovskii. Lectures on modern con-vex optimization: analysis, algorithms, and engineeringapplications, volume 2. Society for Industrial Mathe-matics, 2001.

P.J. Bickel, B. Li, A.B. Tsybakov, S.A. van de Geer, B. Yu,T. Valdes, C. Rivero, J. Fan, and A. van der Vaart.Regularization in statistics. Test, 15(2):271–344, 2006.

S. Boyd and L. Vandenberghe. Convex Optimization.Cambridge University Press, 2004.

S. Boyd, L. El Ghaoui, E. Feron, and V. Balakrishnan.Linear matrix inequalities in system and control theory,volume 15. Society for Industrial Mathematics, 1994.

E.J. Candes and T. Tao. Decoding by linear programming.Information Theory, IEEE Transactions on, 51(12):4203–4215, 2005.

Z. Liu and L. Vandenberghe. Interior-point method fornuclear norm approximation with application to systemidentification. SIAM Journal on Matrix Analysis andApplications, 31(3):1235, 2009.

L. Ljung. System identification. Wiley Online Library,1999.

L. Ljung, H. Hjalmarsson, and H. Ohlsson. Four encoun-ters with system identification. European Journal ofControl, 17(5):449, 2011.

C.H. Papadimitriou and K. Steiglitz. Combinatorial opti-mization: algorithms and complexity. Dover, 1998.

B. Recht, M. Fazel, and P.A. Parrilo. Guaranteedminimum-rank solutions of linear matrix equations vianuclear norm minimization. SIAM Review, 52(3):471–501, 2010.

A. Schrijver. Theory of linear and integer programming.John Wiley & Sons Inc, 1998.

Convex optimization techniques in system

identification

Lieven Vandenberghe !

! Electrical Engineering Department, UCLA, Los Angeles, CA 90095(Tel: 310-206-1259; e-mail: [email protected])

Abstract: In recent years there has been growing interest in convex optimization techniquesfor system identification and time series modeling. This interest is motivated by the success ofconvex methods for sparse optimization and rank minimization in signal processing, statistics,and machine learning, and by the development of new classes of algorithms for large-scalenondi!erentiable convex optimization.

1. INTRODUCTION

Low-dimensional model structure in identification prob-lems is typically expressed in terms of matrix rank orsparsity of parameters. In optimization formulations thisgenerally leads to non-convex constraints or objectivefunctions. However, formulations based on convex penal-ties that indirectly minimize rank or maximize sparsityare often quite e!ective as heuristics, relaxations, or, inrare cases, exact reformulations. The best known exam-ple is 1-norm regularization in sparse optimization, i.e.,the use of the 1-norm !x!1 in an optimization problemas a substitute for the cardinality (number of nonzeroelements) of a vector x. This idea has a rich history instatistics, image and signal processing [Rudin et al., 1992,Tibshirani, 1996, Chen et al., 1999, Efron et al., 2004,Candes and Tao, 2007], and an extensive mathematicaltheory has been developed to explain when and why itworks well [Donoho and Huo, 2001, Donoho and Tanner,2005, Candes et al., 2006b, Candes and Tao, 2005, Candeset al., 2006a, Candes and Tao, 2006, Donoho, 2006, Tropp,2006]. Several excellent surveys and tutorials on this topicare available; see for example [Romberg, 2008, Candes andWakin, 2008, Elad, 2010].

The 1-norm used in sparse optimization has a naturalcounterpart in the nuclear norm for matrix rank minimiza-tion. Here one uses the penalty function !X!! where ! · !!denotes the nuclear norm (sum of singular values) as a sub-stitute for rank(X). Applications of nuclear norm meth-ods in system theory and control were first explored by[Fazel, 2002, Fazel et al., 2004], and have recently gained inpopularity in the wake of the success of 1-norm techniquesfor sparse optimization [Recht et al., 2010]. Much of therecent work in this area has focused on the low-rank matrixcompletion problem [Candes and Recht, 2009, Candes andPlan, 2010, Candes and Tao, 2010, Mazumder et al., 2010],i.e., the problem of identifying a low-rank matrix froma subset of its entries. This problem has applications incollaborative prediction [Srebro et al., 2005] and multi-tasklearning [Pong et al., 2011]. Applications of nuclear normmethods in system identification are discussed in [Liu andVandenberghe, 2009a, Grossmann et al., 2009, Mohan andFazel, 2010, Gebraad et al., 2011, Fazel et al., 2011].

The 1-norm and nuclear norm techniques can be extendedin several interesting ways. The two types of penalties canbe combined to promote sparse-plus-low-rank structurein matrices [Candes et al., 2011, Chandrasekaran et al.,2011]. Structured sparsity, such as group sparsity or hi-erarchical sparsity, can be induced by extensions of the1-norm penalty [Bach et al., 2012, Jenatton et al., 2011,Bach et al., 2011]. Finally, Chandrasekaran et al. [2010]and Bach [2010] describe systematic approaches for con-structing convex penalties for di!erent types of nonconvexstructural constraints.

In this tutorial paper we discuss a few applications of con-vex methods for structured rank minimization and sparseoptimization, in combination with classical ideas fromsystem identification and signal processing. We focus onsubspace algorithms for system identification and topologyselection problems in graphical models. The second part ofthe paper (section 4) provides a short survey of availableconvex optimization algorithms.

2. SYSTEM IDENTIFICATION

Subspace methods in system identification and signalprocessing rely on singular value decompositions (SVDs)to make low-rank matrix approximations [Ljung, 1999].The structure in the approximated matrices (for example,Hankel structure) is therefore lost during the low-rankapproximation. A convex optimization formulation basedon the nuclear norm penalty o!ers an interesting alterna-tive, because it promotes low rank while preserving linearmatrix structure. An additional benefit of an optimiza-tion formulation is the possibility of adding other convexregularization terms or constraints on the optimizationvariables.

As an illustration, consider the input-output equation usedas starting point in many subspace identification methods:

Y = OX +HU.

The matrices U and Y are block Hankel matrices con-structed from a sequence of inputs u(t) and outputs y(t)of a state space model

x(t+ 1) = Ax(t) +Bu(t), y(t) = Cx(t) +Du(t),

and the columns of X form a sequence of states x(t).The matrix H depends on the system matrices, and O isan extended observability matrix [Verhaegen and Verdult,2007, p.295] A simple subspace method consists in formingthe Hankel matrices U and Y and then projecting therows of Y on the nullspace of U . If the data are exactand a persistence of excitation assumption holds, the rankof the projected output matrix is equal to the systemorder and from it a system realization is easily computed.When the input-output data are not exact, one can usea singular value decomposition of the projected outputHankel matrix to estimate the order and compute a systemrealization. However, as mentioned, this step destroys theHankel structure in Y and U . The nuclear norm penaltyon the other hand can be used as a convex heuristicfor indirectly reducing the rank, while preserving linearstructure. For example, if the inputs are exactly knownand the measured outputs ym(t) are subject to error, onecan solve the convex problem

minimize !Y Q!! + !!

t

!y(t)" ym(t)!22

where the columns of Q form a basis of the nullspace ofU and ! is a positive weight. The optimization variablesare the model outputs y(t) and the matrix Y is a Hankelmatrix constructed from the model outputs y(t). This isa convex optimization problem that can be solved viasemidefinite programming. We refer the reader to [Liu andVandenberghe, 2009a,b] for more details and numericalresults. As an important advantage, the optimizationformulation can be extended to include convex contraintson the model outputs. Another promising applicationis identification with missing data [Ding et al., 2007,Grossmann et al., 2009].

3. GRAPHICAL MODELS

In a graphical model of a normal distribution x # N (0,")the edges in the graph represent the conditional depen-dence relations between the components of x. The verticesin the graph correspond to the components of x; theabsence of an edge between vertices i and j indicates thatxi and xj are independent, conditional on the other entriesof x. Equivalently, vertices i and j are connected if thereis a nonzero in the i, j position of the inverse covariancematrix ""1.

A key problem in the estimation of the graphical modelis the selection of the topology. Several authors haveaddressed this problem by adding a 1-norm penalty to themaximum likelihood estimation problem, and solving

minimize trCX " log detX + !!X!1. (1)

HereX denotes the inverse covariance ""1, the matrix C isthe sample covariance matrix, and !X!1 =

"

ij |Xij |. See[Meinshausen and Buhlmann, 2006, Banerjee et al., 2008,Ravikumar et al., 2008, Friedman et al., 2008, Lu, 2009,Scheinberg and Rish, 2009, Yuan and Lin, 2007, Duchiet al., 2008, Li and Toh, 2010, Scheinberg and Ma, 2012].

Graphical models of the conditional independence rela-tions can be extended to Gaussian vector time series[Brillinger, 1996, Dahlhaus, 2000]. In this extension the

topology of the graph is determined by the sparsity patternof the inverse spectral density matrix

S(") =#!

k="#

Rkejk!,

with Rk = Ex(t + k)x(t)T . Using this characterization,one can formulate extensions of the regularized maximumlikelihood problem (1) to vector time series. In [Songsiriet al., 2010, Songsiri and Vandenberghe, 2010] autoregres-sive models

x(t) = "p

!

k=1

Akx(t" k) + v(t), v(t) # N (0,"),

were considered, and convex formulations were presentedfor the problem of estimating the parameters Ak, ",subject to conditional independence constraints, and ofestimating the topology via a 1-norm type regularization.The topology selection problem leads to the followingextension of (1):

minimize tr(CX)" log detX00 + !h(X)subject to X $ 0. (2)

The variable X is a (p + 1) % (p + 1) block matrix withblocks of size n % n (the length of the vector x(t)), andX00 is the leading block of X. The penalty h is chosen toencourage a common, symmetric sparsity pattern for thediagonal sums

p"k!

i=0

Xi,i+k, k = 0, 1, . . . , p,

of the blocks in X.

An extension to ARMA processes is studied by Avventiyet al. [2010].

4. ALGORITHMS

For small and medium sized problems the applicationsdiscussed in the previous sections can be handled bygeneral-purpose convex optimization solvers, such as themodeling packages CVX [Grant and Boyd, 2007] andYALMIP [Lofberg, 2004], and general-purpose conic op-timization packages. In this section we discuss algorithmicapproaches that are of interest for large problems that falloutside the scope of the general-purpose solvers.

4.1 Interior-point algorithms

Interior-point algorithms are known to attain a high ac-curacy in a small number of iterations, fairly independentof problem data and dimensions. The main drawback isthe high linear algebra complexity per iteration associatedwith solving the Newton equations that determine searchdirections. However sometimes problem structure can beexploited to devise dedicated interior-point implementa-tions that are significantly more e#cient than general-purpose solvers.

A simple example is the 1-norm approximation problem

minimize !Ax" b!1

with A of size m % n. This can be formulated as a linearprogram (LP)

minimizem!

i=1

yi

subject to

#

A "I"A "I

$ #

xy

$

&

#

b"b

$

,

at the expense of introducingm auxiliary variables and 2mlinear inequality constraints. By taking advantage of thestructure in the inequalities, each iteration of an interior-point method for the LP can be reduced to solving linearsystems ATDA$x = r where D is a positive diagonalmatrix. As a result, the complexity of solving the 1-norm approximation problem using a custom interior-point solver is roughly the equivalent of a small number ofweighted least-squares problems.

A similar result holds for the nuclear norm approximationproblem

minimize !A(x)"B!! (3)

where A(x) is a matrix valued function of size p%q and x isan n-vector of variables. This problem can be formulatedas a semidefinite program (SDP)

minimize trU + trV

subject to

#

U (A(x)"B)T

A(x)"B V

$

$ 0(4)

with variables x, U , V . The very larger number of variables(O(p2) if we assume p ' q) makes the nuclear normoptimization problem very expensive to solve by general-purpose SDP solvers. A specialized interior-point solverfor the SDP is described in [Liu and Vandenberghe,2009a], with a linear algebra cost per iteration of O(n2pq)if n ' max{p, q}. This is comparable to solving thematrix approximation problem in Frobenius norm, i.e.,minimizing !A(x) " B!F , and the improvement makes itpossible to solve nuclear norm problems with p and q onthe order of several hundred by an interior-point method.

We refer the reader to the book chapter [Andersen et al.,2012] for additional examples of special-purpose interior-point algorithms.

4.2 Nonlinear optimization methods

Burer and Monteiro Burer and Monteiro [2003, 2005] havedeveloped a large-scale method for semidefinite program-ming, based on substituting a low-rank factorization forthe matrix variable and solving the resulting nonconvexproblem by an augmented Lagrangian method. Adaptedto the SDP (4), the method amounts to reformulating theproblem as

minimize !L!2F + !R!2Fsubject to A(x)"B = LRT (5)

with variables x, L ( Rp$r, R ( Rq$r, where r is a upperbound on the rank of A(x) " b at optimum. Recht et al.[2010] discuss in detail Burer and Monteiro’s method inthe context of nuclear norm optimization.

4.3 Proximal gradient algorithms

The proximal gradient algorithm is an extension of thegradient algorithm to problems with simple constraints orwith simple nondi!erentiable terms in the cost function.It is less general than the subgradient algorithm, but itis typically much faster and it handles many types ofnondi!erentiable problems that occur in practice.

The proximal gradient algorithm applies to a convexproblem of the form

minimize f(x) = g(x) + h(x), (6)

in which the cost function f is split in two components gand h, with g di!erentiable and h a ‘simple’ nondi!eren-tiable function. ‘Simple’ here means that the prox-operatorof h, defined as the mapping

proxth(x) = argminu

%

h(u) +1

2t!u" x!22

&

(with t > 0), is inexpensive to compute. It can be shownthat if h is closed and convex, then proxth(x) exists andis unique for every x.

A typical example is h(x) = !x!1. Its prox-operator is theelement-wise ‘soft-thresholding’

proxth(x)i =

'

xi " t if xi ' t0 if "t & xi & txi + t if xi & "t.

Constrained optimization problems

minimize g(x)subject to x ( C

can be brought in the form (6) by defining h(x) = IC(x),the indicator function of C (i.e., IC(x) = 0 if x ( C andIC(x) = +) if x *( C). The prox-operator for IC is theEuclidean projection on C. Prox-operators share many ofthe properties of Euclidean projections on closed convexsets. For example, they are nonexpansive, i.e.,

!proxth(x)" proxth(y)!2 & !x" y!2

for all x, y. (See Moreau [1965].)

The proximal gradient method for minimizing (6) uses theiteration

x+ = proxth (x" t+g(x))

where t > 0 is a step size. The proximal gradient updateconsists of a standard gradient step for the di!erentiableterm g, followed by an application of the prox-operatorassociated with the non-di!erentiable term h. It can bemotivated by noting that x+ is the minimizer of thefunction

h(y) + g(x) ++g(x)T (y " x) +1

2t!y " x!22

over y, so x+ minimizes an approximation of f , obtainedby adding to h a simple local quadratic model of g.

It can be shown that if +g is Lipschitz continuous withconstant L, then the suboptimality f(x(k))" f" decreasesto zero as O(1/k) [Nesterov, 2004, Beck and Teboulle,2009]. Recently, faster variants of the proximal gradient

method with an 1/k2 rate convergence, under the sameassumptions and with the same complexity per step, havebeen developed [Nesterov, 2004, 2005, Beck and Teboulle,2009, Tseng, 2008, Becker et al., 2011].

The (accelerated) proximal gradient methods are wellsuited for problems of the form

minimize g(x) + !x!

where g is di!erentiable with a Lipschitz-continuous gra-dient. Most common norms have easily computed prox-operators, and the following property is useful when com-puting the prox-operator of a norm h(x) = !x!:

proxth(x) = x" tPB(x/t),

where PB is Euclidean projection on the unit ball in thedual norm.

In other applications it is advantageous to apply theproximal gradient method to the dual problem. Considerfor example an optimization problem

minimize f(x) + !Ax" b!

with f strongly convex. Reformulating this problem as

minimize f(x) + !y!subject to y = Ax" b (7)

and taking the Lagrange dual, gives

maximize bT z " f!(AT z)subject to !z!d & 1

where f!(u) = supx(uTx"f(x)) is the conjugate of f and

! · !d is the dual norm of ! · !. It can be shown that if f isstrongly convex, then f! is di!erentiable with a Lipschitzcontinuous gradient. If projection on the unit ball of thedual norm is inexpensive, the dual problem is thereforereadily solved by a fast gradient projection method.

An extensive library of fast proximal-type algorithmsis available in the MATLAB software package TFOCS[Becker et al., 2010].

4.4 ADMM

The Alternating Direction Method of Multipliers (ADMM)was proposed in the 1970s as a simplified version ofthe augmented Lagrangian method. It is a simple andoften very e!ective method for large-scale or distributedoptimization, and has recently been applied successfullyto the regularized covariance selection problem mentionedabove [Scheinberg et al., 2010, Scheinberg and Ma, 2012].The recent survey by Boyd et al. [2011] gives an overviewof the theory and applications of ADMM. Here we limitourselves to a description of the method when applied to aproblem of the form (7). The ADMM iteration consists oftwo alternating minimization steps (over x and y) of theaugmented Lagrangian

L(x, y, z) =

f(x) + !y!+ zT (y "Ax+ b) +t

2!y "Ax+ b!22,

followed by an update

z := z + t(y "Ax" b)

of the dual variable z. The complexity of minimizingover x depends on the properties of f . If f is quadratic,for example, it reduces to a least-squares problem. Theminimization of the augmented Lagrangian over y reducesto the evaluation of the prox-operator of the norm ! · !.

A numerical comparison of the ADMM and proximalgradient algorithms for nuclear norm minimization can befound in the recent paper by Fazel et al. [2011].

5. SUMMARY

Advances in algorithms for large-scale nondi!erentiableconvex optimization are leading to a greater role of con-vex optimization in system identification and time seriesmodeling. These techniques are based on formulations thatincorporate convex penalty functions that promote low-dimensional model structure (such as sparsity or rank).Similar techniques have been used extensively in signalprocessing, image processing, and machine learning. Whileat this point theoretical results that characterize the suc-cess of these convex heuristics in system identificationare limited, the extensive theory that supports 1-normtechniques in sparse optimization, gives hope that progresscan be made in our understanding of similar techniques forsystem identification as well.

ACKNOWLEDGMENT

This material is based upon work supported by theNational Science Foundation under Grants No. ECCS-0824003 and ECCS-1128817. Any opinions, findings, andconclusions or recommendations expressed in this materialare those of the author and do not necessarily reflect theviews of the National Science Foundation.

REFERENCES

M. S. Andersen, J. Dahl, Z. Liu, and L. Vandenberghe.Interior-point methods for large-scale cone program-ming. In S. Sra, S. Nowozin, and S. J. Wright, editors,Optimization for Machine Learning, pages 55–83. MITPress, 2012.

E. Avventiy, A. Lindquist, and B. Wahlberg. Graphicalmodels of autoregressive moving-average processes. InThe 19th International Symposium on MathematicalTheory of Networks and Systems (MTNS 2010), July2010.

F. Bach. Structued sparsity-inducing norms throughsubmodular functions. 2010. Available fromarxiv.org/abs/1008.4220.

F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Opti-mization with sparsity-inducing penalties. Foundationsand Trends in Machine Learning, 4(1):1–106, 2011.

F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Convexoptimization with sparsity-inducing norms. In S. Sra,S. Nowozin, and S. J. Wright, editors, Optimization forMachine Learning, pages 19–53. MIT Press, 2012.

O. Banerjee, L. El Ghaoui, and A. d’Aspremont. Modelselection through sparse maximum likelihood estimationfor multivariate Gaussian or binary data. Journal ofMachine Learning Research, 9:485–516, 2008.

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.

S. Becker, E. J. Candes, and M. Grant. Templates forconvex cone problems with applications to sparse signalrecovery. 2010. arxiv.org/abs/1009.2065.

S. Becker, J. Bobin, and E. Candes. NESTA: a fast andaccurate first-order method for sparse recovery. SIAMJournal on Imaging Sciences, 4(1):1–39, 2011.

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein.Distributed optimization and statistical learning via thealternating direction method of mulitipliers. Founda-tions and Trends in Machine Learning, 3(1):1–122, 2011.

D. R. Brillinger. Remarks concerning graphical models fortime series and point processes. Revista de Econometria,16:1–23, 1996.

S. Burer and R. D. C. Monteiro. A nonlinear programmingalgorithm for solving semidefinite programs via low-rankfactorization. Mathematical Programming (Series B), 95(2), 2003.

S. Burer and R. D. C. Monteiro. Local minima and con-vergence in low-rank semidefinite programming. Math-ematical Programming (Series A), 103(3), 2005.

E. Candes and T. Tao. The Dantzig selector: Statisticalestimation when p is much larger than n. The Annalsof Statistics, 35(6):2313–2351, 2007.

E. J. Candes and Y. Plan. Matrix completion with noise.Proceedings of the IEEE, 98(6):925–936, 2010.

E. J. Candes and B. Recht. Exact matrix completionvia convex optimization. Foundations of ComputationalMathematics, 9(6):717–772, 2009.

E. J. Candes and T. Tao. Decoding by linear program-ming. IEEE Transactions on Information Theory, 51(12):4203–4215, 2005.

E. J. Candes and T. Tao. Near-optimal signal recoveryfrom random projections and universal encoding strate-gies. IEEE Transaction on Information Theory, 52(12),2006.

E. J. Candes and T. Tao. The power of convex relaxation:near-optimal matrix completion. IEEE Transactions onInformation Theory, 56(5):2053–2080, 2010.

E. J. Candes and M. B. Wakin. An introduction to com-pressive sampling. IEEE Signal Processing Magazine,25(2):21–30, 2008.

E. J. Candes, J. Romberg, and T. Tao. Robust uncertaintyprinciples: Exact signal reconstruction from highly in-complete frequency information. IEEE Transactions onInformation Theory, 52(2):489–509, 2006a.

E. J. Candes, J. K. Romberg, and T. Tao. Stable signalrecovery from incomplete and inaccurate measurements.Communications on Pure and Applied Mathematics, 59(8):1207–1223, 2006b.

E. J. Candes, X. Li, Y. Ma, and J. Wright. Robustprincipal component analysis? Journal of the ACM, 58(3), 2011.

V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S.Willsky. The convex geometry of linear inverse prob-lems. 2010. arXiv:1012.0621v1.

V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S.Willsky. Rank-sparsity incoherence for matrix decompo-sition. SIAM Journal on Optimization, 21(2):572–596,2011.

S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomicdecomposition by basis pursuit. SIAM Journal onScientific Computing, 20:33–61, 1999.

R. Dahlhaus. Graphical interaction models for multivari-ate time series. Metrika, 51(2):157–172, 2000.

T. Ding, M. Sznaier, and O. Camps. A rank minimizationapproach to fast dynamic event detection and trackmatching in video sequences. In Proceedings of the 46thIEEE conference on decision and control, 2007.

D. L. Donoho. Compressed sensing. IEEE Transactionson Information Theory, 52(4):1289–1306, 2006.

D. L. Donoho and X. Huo. Uncertainty principles andideal atomic decomposition. IEEE Transactions onInformation Theory, 47(7):2845–2862, 2001.

D. L. Donoho and J. Tanner. Sparse nonnegative solutionsof underdetermined systems by linear programming.Proceedings of the National Academy of Sciences of theUnited States of America, 102(27):9446–9451, 2005.

J. Duchi, S. Gould, and D. Koller. Projected subgradientmethods for learning sparse Gaussians. In Proceedingsof the Conference on Uncertainty in AI, 2008.

B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Leastangle regression. The Annals of Statistics, 32(2):407–499, 2004.

M. Elad. Sparse and Redundant Representations: FromTheory to Applications in Signal and Image Processing.Springer, 2010.

M. Fazel. Matrix Rank Minimization with Applications.PhD thesis, Stanford University, 2002.

M. Fazel, H. Hindi, and S. Boyd. Rank minimizationand applications in system theory. In Proceedings ofAmerican Control Conference, pages 3273–3278, 2004.

M. Fazel, T. K. Pong, D. Sun, and P. Tseng. Hankelmatrix rank minimization with applictions to systemidentification and realization. 2011. Submitted.

J. Friedman, T. Hastie, and R. Tibshirani. Sparse inversecovariance estimation with the graphical lasso. Bio-statistics, 9(3):432, 2008.

P. M. O. Gebraad, J. W. van Wingerden, G. J. van derVeen, and M. Verhaegen. LPV subspace identificationusing a novel nuclear norm regularization method. InProceedings of the American Control Conference, pages165–170, 2011.

M. Grant and S. Boyd. CVX: Matlab software for dis-ciplined convex programming (web page and software).http://stanford.edu/~boyd/cvx, 2007.

C. Grossmann, C. N. Jones, and M. Morari. System iden-tification via nuclear norm regularization for simulatedbed processes from incomplete data sets. In Proceedingsof the 48th IEEE Conference on Decision and Control,pages 4692–4697, 2009.

R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Prox-imal methods for hierarchical sparse coding. Journal ofMachine Learning Research, 12:2297–2334, 2011.

L. Li and K.-C. Toh. An inexact interior point method forL1-regularized sparse covariance selection. Mathemati-cal Programming Computation, 2:291–315, 2010.

Z. Liu and L. Vandenberghe. Interior-point method fornuclear norm approximation with application to systemidentification. SIAM Journal on Matrix Analysis andApplications, 31:1235–1256, 2009a.

Z. Liu and L. Vandenberghe. Semidefinite programmingmethods for system realization and identification. InProceedings of the 48th IEEE Conference on Decisionand Control, pages 4676–4681, 2009b.

L. Ljung. System Identification: Theory for the User.Prentice Hall, Upper Saddle River, New Jersey, secondedition, 1999.

J. Lofberg. YALMIP : A toolbox for modeling and opti-mization in MATLAB. In Proceedings of the CACSDConference, Taipei, Taiwan, 2004.

Z. Lu. Smooth optimization approach for sparse covarianceselection. SIAM Journal on Optimization, 19(4):1807–1827, 2009.

R. Mazumder, T. Hastie, and R. Tibshirani. Spectralregularization algorithms for learning large incompletematrices. Journal of Machine Learning Research, 11:2287–2322, 2010.

N. Meinshausen and P. Buhlmann. High-dimensionalgraphs and variable selection with the Lasso. Annalsof Statistics, 34(3):1436–1462, 2006.

K. Mohan and M. Fazel. Reweighted nuclear norm min-imization with application to system identification. InProceedings of the American Control Conference (ACC),pages 2953–2959, 2010.

J. J. Moreau. Proximite et dualite dans un espace hilber-tien. Bull. Math. Soc. France, 93:273–299, 1965.

Yu. Nesterov. Introductory Lectures on Convex Opti-mization. Kluwer Academic Publishers, Dordrecht, TheNetherlands, 2004.

Yu. Nesterov. Smooth minimization of non-smooth func-tions. Mathematical Programming Series A, 103:127–152, 2005.

T. K. Pong, P. Tseng, Shuiwang Ji, and J. Ye. Trace normregularization: reformulations, algorithms, and multi-task learning. SIAM Journal on Optimization, 20(6):3465–3489, 2011.

R. Ravikumar, M. J. Wainwright, G. Raskutti, andB. Yu. High-dimensional covariance estimation by min-imizing #1-penalized log-determinant divergence, 2008.arxiv.org/abs/0811.3628.

B. Recht, M. Fazel, and P. A. Parrilo. Guaranteedminimum-rank solutions of linear matrix equations vianuclear norm minimization. SIAM Review, 52(3):471–501, 2010.

J. Romberg. Imaging via compressive sampling. IEEESignal Processing Magazine, 25(2):14–20, 2008.

L. Rudin, S. J. Osher, and E. Fatemi. Nonlinear totalvariation based noise removal algorithms. Physica D,60:259–268, 1992.

K. Scheinberg and S. Ma. Optimization methods for sparseinverse covariance selection. In S. Sra, S. Nowozin,and S. J. Wright, editors, Optimization for MachineLearning, pages 455–477. MIT Press, 2012.

K. Scheinberg and I. Rish. SINCO - a greedy coordinateascent method for sparse inverse covariance selectionproblem. Technical report, 2009. IBM ResesarchReport.

K. Scheinberg, S. Ma, and D. Goldfarb. Sparse inversecovariance selection via alternating linearization meth-ods. In J. La!erty, C. K. I. Williams, J. Shawe-Taylor,R.S. Zemel, and A. Culotta, editors, Advances in NeuralInformation Processing Systems 23, pages 2101–2109.2010.

J. Songsiri and L. Vandenberghe. Topology selection ingraphical models of autoregressive processes. Journal ofMachine Learning Research, 11:2671–2705, 2010.

J. Songsiri, J. Dahl, and L. Vandenberghe. Graphicalmodels of autoregressive processes. In Y. Eldar andD. Palomar, editors, Convex Optimization in SignalProcessing and Communications, pages 89–116. Cam-bridge University Press, Cambridge, 2010.

N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In Lawrence K. Saul, YairWeiss, and Leon Bottou, editors, Advances in NeuralInformation Processing Systems 17, pages 1329–1336.MIT Press, Cambridge, MA, 2005.

R. Tibshirani. Regression shrinkage and selection via theLasso. Journal of the Royal Statistical Society. Series B(Methodological), 58(1):267–288, 1996.

J. A. Tropp. Just relax: Convex programming methods foridentifying sparse signals in noise. IEEE Transactionson Information Theory, 52(3):1030–1051, 2006.

P. Tseng. On accelerated proximal gradient methods forconvex-concave optimization. 2008.

M. Verhaegen and V. Verdult. Filtering and SystemIdentification. Cambridge University Press, 2007.

M. Yuan and Y. Lin. Model selection and estimation in theGaussian graphical model. Biometrika, 94(1):19, 2007.

Distributed Change Detection

?

Henrik Ohlsson ⇤,⇤⇤ Tianshi Chen ⇤

Sina Khoshfetrat Pakazad ⇤ Lennart Ljung ⇤

S. Shankar Sastry ⇤⇤

⇤ Division of Automatic Control, Department of Electrical Engineering,Linkoping University, Sweden, e-mail: [email protected].

⇤⇤ Department of Electrical Engineering and Computer Sciences,University of California at Berkeley, CA, USA.

Abstract: Change detection has traditionally been seen as a centralized problem. Manychange detection problems are however distributed in nature and the need for distributedchange detection algorithms is therefore significant. In this paper a distributed change detectionalgorithm is proposed. The change detection problem is first formulated as a convex optimizationproblem and then solved distributively with the alternating direction method of multipliers(ADMM). To further reduce the computational burden on each sensor, a homotopy solutionis also derived. The proposed method have interesting connections with Lasso and compressedsensing and the theory developed for these methods are therefore directly applicable.

1. INTRODUCTION

The change detection problem is often thought of as a cen-tralized problem. Many scenarios are however distributedand lack a central node or require a distributed processing.A practical example is a sensor network. It may be vulnera-ble to select one of the sensors as a central node. Moreover,it may be preferable if the sensor failing can be detectedin a distributed manner. Another practical example is themonitoring of a fleet of agents (airplanes/UAVs/robots) ofthe same type, see e.g., Chu et al. [2011]. The problem ishow to detect if one or more agents start deviating fromthe rest. Theoretically, this can be done straightforwardlyin a centralized manner. The centralized solution, however,poses many di�culties in practice. For instance, the com-munication between the central monitor to the agents andthe computation capacity and speed of the central monitoris highly demanding due to a large number of agents in thefleet and/or the extremely large data sets to be processed,Chu et al. [2011]. Therefore, it is desirable to deal with thechange detection problem in a distributed way.

In a distributed setting, there will be no central node. Eachsensor or agent makes use of measurements from itself andthe other sensors or agents to detect if it has failed ornot. To tackle the problem, we first formulate the changedetection problem as a convex optimization problem. Wethen solve the problem in a distributed manner usingthe so-called alternating direction method of multipliers(ADMM, see for instance Boyd et al. [2011]). The opti-mization problem turns out to have connections with theLasso [Tibsharani, 1996] and compressive sensing [Candeset al., 2006, Donoho, 2006] and the theory developed for

? Ohlsson, Chen and Ljung are partially supported by the SwedishResearch Council in the Linnaeus center CADICS and by the Euro-pean Research Council under the advanced grant LEARN, contract267381. Ohlsson is also supported by a postdoctoral grant from theSweden-America Foundation, donated by ASEA’s Fellowship Fund,and by a postdoctoral grant from the Swedish Research Council.

these methods are therefore applicable. To further reducethe computational burden on each sensor, a homotopysolution (see e.g., Garrigues and El Ghaoui [2008]) is alsostudied. Finally, we show the e↵ectiveness of the proposedmethod by a numerical example.

2. PROBLEM FORMULATION

The basic idea of the proposed method is to use systemidentification, in a distributed manner, to obtain a nominalmodel for the sensors or agents and then detect whetherone or more sensors or agents start deviating from thisnominal model.

To set up the notation, assume that we are given a sensornetwork consisting of N sensors. Denote the measurementfrom sensor i at time t by y

i

(t) and assume that there isa linear relation of the form

yi

(t) = 'Ti

(t)✓ + ei

(t), (1)

describing the relation between the measurable quantityyi

(t) 2 Rn and known quantity 'T

i

(t) 2 Rn⇥m. We willcall ✓ 2 Rm the state. The state is related to the sensorreading through '

i

(t). ei

(t) 2 Rn is the measurementnoise and assumed to be white Gaussian distributed withzero mean and variance �2

i

. Moreover, ei

(t) is assumedindependent of that of e

j

(t), for all i = 1, . . . , N andj = 1, . . . , i� 1, i+ 1, . . . , N . At time t it is assumed thatsensor i obtains y

i

(t) and knows 'i

(t).

The problem is now, in a distributed manner, to detect afailing sensor. That is, detect if the relation (1) is no longervalid.

Remark 2.1. (Time varying state). A dynamical equationor a random walk type of description for the state ✓ canbe incorporated. This is straightforward but for the sakeof clarity and due to page limitations, this is not shownhere. Actually, the only restriction is that ✓ does not varyover sensors (that is, not dependent on i).

Remark 2.2. (Partially observed state). Note that (1) doesnot imply that the sensors need to measure all elements of✓. Some sensors can observe some parts and other sensorsother parts.

Remark 2.3. (Time-varying network topology). That sen-sors are added and taken away from the network is a veryrealistic scenario. We will assume that N is the maximumnumber of sensors in the network and set '

i

(t) = 0 ifsensor i is not present at time t.

Remark 2.4. (Multidimensional yi

(t)). For notational sim-plicity, from now on, we have chosen to let y

i

(t) 2 R. How-ever, the extension to multidimensional y

i

(t) is straightfor-ward.

Remark 2.5. (Distributed system identification). The pro-posed algorithm could also be seen as a robust distributedsystem identification scheme. The algorithm computes, ina distributed manner, a nominal model using observationsfrom several systems and is robust to systems deviatingfrom the majority.

A straightforward way to solve the distributed changedetection problem as posted here would be to(1) locally, at each sensor, estimate ✓(2) broadcast the estimates and the error covariances(3) at each sensor, fuse the estimates(4) at each sensor, use a likelihood ratio test to detect

a failing sensor (see e.g., Willsky and Jones [1976],Willsky [1976])

This method will work fine as long as the number ofmeasurements available at each sensor well exceeds m. Letus say that {(y

i

(t),'i

(t))}Tt=T�Ti+1 is available at sensor i,

i = 1, . . . , N . It is hence required that T1, T2, . . . , TN

� m.If m > T

i

for some i = 1, . . . , N, the method will howeverfail. That m > T

i

for some i = 1, . . . , N , is a very realisticscenario. T

i

may for example be very small if new sensorsmay be added to the network at any time. The caseT1, T2, . . . , TN

� m was previously discussed in Chu et al.[2011].

Remark 2.6. (Sending all data). One may also consider tobroadcast data and solve the full problem on each sen-sor. Sending all data available at time T may howeverbe too much. Sensor i would then have to broadcast{(y

i

(t),'i

(t))}Tt=T�Ti+1.

3. BACKGROUND

Change detection has a long history (see e.g., Gustafsson[2001], Patton et al. [1989], Basseville and Nikiforov [1993]and references therein) but has traditionally been seen asa centralized problem. The literature on distributed ordecentralized change detection is therefore rather smalland only standard methods such as CUSUM and general-ized likelihood ration (GLR) test have been discussed andextended to distributive scenarios (see e.g., Tartakovskyand Veeravalli [2002, 2003]). The method proposed herehas certainly a strong connection to GLR (see for instanceOhlsson et al. [2012]) and an extensive comparison is seenas future work.

The change detection algorithm proposed has also connec-tions to compressive sensing and `1-minimization. Thereare several comprehensive review papers that cover theliterature of compressive sensing and related optimization

techniques in linear programming. The reader is referredto the works of Candes and Wakin [2008], Bruckstein et al.[2009], Loris [2009], Yang et al. [2010].

4. PROPOSED METHOD – BATCH SOLUTION

Assume that the data set {(yi

(t),'i

(t))}Tt=T�Ti+1 is avail-

able at sensor i, i = 1, . . . , N . Since (1) is assumed tohold for a functioning sensors, we would like to detect afailing sensor by checking if its likelihood falls below somethreshold. What complicates the problem is that:

• ✓ is unknown,• m > T

i

for some i = 1, . . . , N , typically.We first solve the problem in a centralized setting.

4.1 Centralized Solution

Introduce ✓i

for the state of sensor i, i = 1, . . . , N . Assumethat we know that k sensors have failed. The maximumlikelihood (ML) solution for ✓

i

, i = 1, . . . , N (taking intoaccount that N � k sensors have the same state) can thenbe computed by

min✓1,...,✓N ,✓

NX

i=1

TX

t=T�Ti+1

kyi

(t)� 'Ti

(t)✓i

k2�

2i

(2a)

subj. to�� [k✓1 � ✓k

p

k✓2 � ✓kp

. . . k✓N

� ✓kp

]��0= k,

(2b)

with k · kp

being the p-norm, p � 1, and k · k�

2idefined

as k · /�i

k2. The k failing sensors could now be identifiedas the sensors for which k✓

i

� ✓kp

6= 0. It follows frombasic optimization theory (see for instance Boyd andVandenberghe [2004]) that there exists a � > 0 such that

min✓1,...,✓N ,✓

NX

i=1

TX

t=T�Ti+1

kyi

(t)� 'Ti

(t)✓i

k2�

2i

+�� [k✓1 � ✓k

p

k✓2 � ✓kp

. . . k✓N

� ✓kp

]��0, (3)

gives exactly the same estimate for ✓1, . . . , ✓N , ✓, as (2).However, both (2) and (3) are non-convex and combinato-rial, and in practice unsolvable.

What makes (3) non-convex is the second term. It hasrecently become popular to approximate the zero-norm byits convex envelope. That is, to replace the zero-norm bythe one-norm. This is in line with the reasoning behindLasso [Tibsharani, 1996] and compressed sensing [Candeset al., 2006, Donoho, 2006]. Relaxing the zero-norm byreplacing it with the one-norm leads to the convex criteria

min✓,✓1,...,✓N

NX

i=1

TX

t=T�Ti+1

kyi

(t)�'Ti

(t)✓i

k2�

2i+�

NX

i=1

k✓�✓i

kp

.

(4)✓ should be interpreted as the nominal model. Most sensorswill have data that can be explained by the nominal model✓ and the criterion (4) will therefore give ✓

i

= ✓ for mosti’s. However, failing sensors will generate a data sequencethat could not have been generated by the nominal modelrepresented by ✓ and for these sensors, (4) will give ✓

i

6= ✓.

In (4), � regulates the trade o↵ between miss fit to theobservations and the deviation from the nominal model ✓.In practice, a large � will make us less sensitive to noisebut may also imply that we miss to detect a deviating

sensor. However, a too small � may in a noisy environmentgenerate false alarms. � should be seen as an applicationdependent design parameter. The estimates of (2) and (3)are indi↵erent to the choice of p. The estimate of (4) isnot, however. In general, p = 1 is a good choice if oneis interested in detecting changes in individual elements ofsensors’ or agents’ states. If one only cares about detectingif a sensor or agent is failing, p > 1 is a better choice.

What is remarkable is that under some conditions on 'i

(t)and the number of failures, the criterion (4) will workexactly as good as (2). That is, (4) and (2) will pick out ex-actly the same sensors as failing sensors. To examine whenthis happens theory developed in compressive sensing canbe used. This is not discussed here but is a good futureresearch direction.

4.2 Distributed Solution

Let us now apply ADMM (see e.g., Boyd et al. [2011],[Bertsekas and Tsitsiklis, 1997, Sec. 3.4]) to solve theidentification problem in a distributed manner. First let

Yi

=

2

664

yi

(T � Ti

+ 1)yi

(T � Ti

+ 2)...

yi

(T )

3

775 , �i

=

2

6664

'Ti

(T � Ti

+ 1)'Ti

(T � Ti

+ 2)...

'Ti

(T )

3

7775. (5)

The optimization problem (4) can then be written as

min✓,✓1,#1,...,✓N ,#N

NX

i=1

kYi

� �i

✓i

k2�

2i+ �k#

i

� ✓i

kp

, (6a)

subj. to #i

� ✓ = 0, i = 1, . . . , N. (6b)

Let xT =⇥✓T1 . . . ✓T

N

#T1 . . . #T

N

⇤, and let

⌫T=⇥⌫T1 ⌫T2 . . . ⌫T

N

⇤, be the Lagrange multiplier vector

and ⌫i

be the Lagrange multiplier associated with the ithconstraint #

i

� ✓ = 0, i = 1, . . . , N . So the augmentedLagrangian takes the following form

L⇢

(✓, x, ⌫) =NX

i=1

kYi

� �i

✓i

k2�

2i+ �k#

i

� ✓i

kp

(7a)

+⌫Ti

(#i

� ✓) + (⇢/2)k#i

� ✓k2. (7b)

ADMM consists of the following update rules

xk+1 = argminx

L⇢

(✓k, x, ⌫k) (8a)

✓k+1 = (1/N)NX

i=1

(#k+1i

+ (1/⇢)⌫ki

) (8b)

⌫k+1i

= ⌫ki

+ ⇢(#k+1i

� ✓k+1), for i = 1, . . . , N. (8c)

Remark 4.1. It should be noted that given ✓k, ⌫k, thecriterion L

⇢

(✓k, x, ⌫k) in (8a) is separable in terms of thepairs #

i

, ✓i

, i = 1, . . . , N . Therefore, the optimization canbe done separately, for each i, as

#k+1i

, ✓k+1i

=argmin#i,✓i

kYi

� �i

✓i

k2�

2i+ �k#

i

� ✓i

kp

+ (⌫ki

)T(#i

� ✓k) + (⇢/2)k#i

� ✓kk2. (9)

Remark 4.2. (Boyd et al. [2011]). It is interesting to notethat no matter what ⌫1 is

NX

i=1

⌫ki

= 0, k � 2. (10)

To show (10), first note thatNX

i=1

⌫k+1i

=NX

i=1

⌫ki

+ ⇢

NX

i=1

#k+1i

�N⇢✓k+1, k � 1. (11)

Inserting ✓k+1 into the above equation yields (10). Sowithout loss of generality, further assume

NX

i=1

⌫1i

= 0. (12)

Then the update on ✓ reduces to

✓k+1 = (1/N)NX

i=1

#k+1i

, k � 1. (13)

As a result, in order to implement the ADMM in adistributed manner each sensor or system i should followthe steps below.(1) Initialization: set ✓1, ⌫1

i

and ⇢.(2) #k+1

i

, ✓k+1i

= argmin✓i,#i

L⇢

(✓k, x, ⌫k).

(3) Broadcast #k+1i

to the other systems (sensors), j =1, . . . , i� 1, i+ 1, . . . , N .

(4) ✓k+1 = (1/N)P

N

i=1 #k+1i

.(5) ⌫k+1

i

= ⌫ki

+ ⇢(#k+1i

� ✓k+1).(6) If not converged, set k = k + 1 and return to step 2.To show that ADMM gives:

• ✓k � #k

i

! 0 as k !1, i = 1, . . . , N .

•P

N

i=1 kYi

��i

✓ki

k2�

2i+ �k#k

i

� ✓ki

kp

! p⇤, where p⇤ is

the optimal objective of (4).it is su�cient to show that the Lagrangian (L0(✓, x, ⌫), theaugmented Lagrangian evaluated at ⇢ = 0) has a saddlepoint according to [Boyd et al., 2011, Sect. 3.2.1] (since theobjective consists of closed, proper and convex functions).Let ✓⇤, x⇤ denote the solution of (4). It is easy to showthat ✓⇤, x⇤ and ⌫ = 0 is a saddle point. Since L0(✓, x, 0) isconvex,

L0(✓⇤, x⇤, 0) L0(✓, x, 0) 8✓, x (14)

and since L0(✓⇤, x⇤, 0) = L0(✓⇤, x⇤, ⌫), 8⌫, ✓⇤, x⇤ and ⌫ =0 must be a saddle point. ADMM hence converges to thesolution of (4) in the sense listed above.

5. PROPOSED METHOD – RECURSIVE SOLUTION

To apply the above batch method to a scenario where wecontinuously get new measurements, we propose to re-runthe batch algorithm every T th sample time:(1) Initialize by running the batch algorithm proposed in

the previous section on the data available.(2) Every T th time-step, re-run the batch algorithm

using the sT, s 2 N , last data. Initialize the ADMMiterations using the estimates of ✓ and ⌫ from theprevious run. Considering the fact that faults occurrarely over time, the optimal solution for di↵erentdata batches are often similar. As a result, by usingthe estimates of ✓ and ⌫ from the previous run forinitializing the ADMM algorithm can speed up theconvergence of the ADMM algorithm considerably.

To add T new observation pairs, one could possibly usean extended version of the homotopy algorithm proposedby Garrigues and El Ghaoui [2008]. The homotopy algo-rithm presented in Garrigues and El Ghaoui [2008] wasdeveloped for including new observations in Lasso.

The ADMM algorithm presented in the previous sectioncould also be altered to use single data samples uponarrival. In such a setup, instead of waiting for a collectionor batch of measurements, the algorithm is updated uponarrival of new measurements. This can be done by studyingthe augmented Lagrangian of the problem. The augmentedLagrangian in (7) can also be written in normalized formas

L⇢

(✓, x, ⌫) =NX

i=1

kYi

� �i

✓i

k2�

2i+ �k#

i

� ✓i

kp

+(⇢/2)k#i

� (✓ � ⌫i

)k2 � (⇢/2)k⌫i

k2, (15)

where ⌫i

= ⌫/⇢. Hence, for p = 2, the update in (9) canbe achieved by solving the following convex optimizationproblem, which can be written as a Second Order ConeProgramming (SOCP) problem, [Boyd and Vandenberghe,2004],

min✓i,#i,t

✓Ti

Hi

✓i

� 2✓Ti

hi

+ (⇢/2)#Ti

#i

� ⇢#Ti

hk

i

+ �s

subj. to k✓i

� #i

k s (16)

where the following data matrices describe this optimiza-tion problem

Hi

= �Ti

�i

/�2i

, hi

= �Ti

Yi

/�2i

, hk

i

= ✓k � ⌫ki

. (17)

As can be seen from (17), among these matrices only Hi

and hi

are the ones that are a↵ected by the new measure-ments. Let ynew

i

and 'new

i

denote the new measurements.Then H

i

and hi

can be updated as follows

Hi

Hi

+ 'new

i

'newTi

/�2i

, hi

hi

+ 'new

i

ynewi

/�2i

.(18)

To handle single data samples upon arrival, step 2 of theADMM algorithm should therefore be altered to:

(2) If there exits any new measurements, update Hi

andhi

according to (18). Find #k+1i

, ✓k+1i

by solving (16).

Remark 5.1. In order for this algorithm to be responsiveto the arrival of the new measurements, it is required tohave network-wide persistent communication. As a resultthis approach demands much higher communication tra�cthan the batch solution.

6. IMPLEMENTATION ISSUES

Step 2 of ADMM requires solving the optimization prob-lem

minx

L⇢

(✓k, x, ⌫k). (19)

This problem is solved locally on each sensor once everyADMM iteration. What varies from one iteration to thenext are the values for the arguments ✓k and ⌫k. However,it is unlikely that ✓k and ⌫k di↵er significantly from✓k+1 and ⌫k+1. To take use of this fact can considerablyease the computational load on each sensor. We presenttwo methods for doing this, warm start and a homotopymethod.

The following two subsections are rather technical andwe refer the reader not interested in implementing theproposed algorithm to Section 7.

6.1 Warm Start for Step 2 of the ADMM Algorithm

For the case where p = 2, at each iteration of the ADMM,we have to solve an SOCP problem, which is described

in (16) and (17). However, in the batch solution, amongthe matrices in (17), only hk

i

changes with the iterationnumber. Therefore, if we assume that the vectors ✓k and⌫ki

do not change drastically from iteration to iteration, itis possible to use the solution for the problem in (16) at thekth iteration to warm start the problem at the (k + 1)thiteration. This is done as follows.

The Lagrangian for the problem in (16) can be written as

L(✓i

,#i

, s, zi

) = ✓Ti

Hi

✓i

� 2✓Ti

hi

+ (⇢/2)#Ti

#i

� ⇢#Ti

hk

i

+ �s�⌧

zi1

zi2

�,

s

xi

� #i

��, (20)

for all kzi2k z

i1. By this, the optimality conditions forthe problem in (16), can be written as

2Hi

✓i

� 2hi

� zi2 = 0 (21a)

⇢#i

� ⇢hk

i

+ zi2 = 0 (21b)

�� zi1 = 0 (21c)

kzi2k z

i1 (21d)k✓

i

� #i

k s (21e)

zi1s+ zT

i2(✓i � #i

) = 0. (21f)

where zi

=

zi1

zi2

�[Boyd and Vandenberghe, 2004]. Let ✓⇤

i

,

#⇤i

, t⇤ and z⇤i

be the primal and dual optimums for theproblem in (16) at the kth iteration and let hk+1

i

= hk

i

+�h

i

. These can be used to generate a good warm startpoint for the solver that solves (16). As a result, by (21)the following vectors can be used to warm start the solver

✓wi

= ✓⇤i

#w

i

= #⇤i

+�hi

zwi

= z⇤i

sw = s⇤ +�s

(22)

where �s should be chosen such thatk✓w

i

� #w

i

k s⇤ +�s

zwi1(s

⇤ +�s) + zwTi2 (✓w

i

� #w

i

) = µ,(23)

for some µ � 0.

6.2 A Homotopy Algorithm

Since it is unlikely that ✓k and ⌫k di↵er significantlyfrom ✓k+1 and ⌫k+1, one can use the previously computedsolution xk and through a homotopy update the solutioninstead of resolving (19) from scratch every iteration. Wewill in the following assume that p = 1 and leave the detailsfor p > 1.

First, define�i

= #i

� ✓i

. (24)

The optimization objective of (9) is then

gki

(✓i

, �i

) ,kYi

� �i

✓i

k2�

2i+ �k�

i

k1 + (⌫ki

)T(�i

+ ✓i

� ✓k)

+(⇢/2)k�i

+ ✓i

� ✓kk2. (25)

It is then straightforward to show that the optimizationproblem (9) is equivalent to

✓k+1i

, �k+1i

=argmin✓i,�i

gki

(✓i

, �i

). (26)

Moreover,#k+1i

= �k+1i

+ ✓k+1i

. (27)

Now, compute the subdi↵erential of gi

(✓i

, �i

) w.r.t. ✓i

and�i

. Simple calculations show that

@✓ig

k

i

(✓i

, �i

) =r✓ig

k

i

(✓i

, �i

) = �2/�2i

Y Ti

�i

+2/�2i

✓Ti

�Ti

�i

+ (⌫ki

)T + ⇢(✓i

+ �i

� ✓k)T, (28a)

@�ig

k

i

(✓i

, �i

) =�@k�i

k1 + (⌫ki

)T + ⇢(�i

+ ✓i

� ✓k)T. (28b)

A necessary condition for the global optimal solution✓k+1i

, �k+1i

of the optimization problem (9) is

0 2 @✓ig

k

i

�✓k+1i

, �k+1i

�, (29a)

0 2 @�ig

k

i

�✓k+1i

, �k+1i

�. (29b)

It follows from (28a) and (29a) that

✓k+1i

= R�1i

�hi

� ⌫ki

/2� (⇢/2)(�i

� ✓k)�

(30)

where we have let

Ri

= �Ti

�i

/�2i

+ (⇢/2)I, hi

= �Ti

Yi

/�2i

. (31)

With (30), the problem now reduces to how to solve (29b).Inserting (30) into (28b), and Q

i

, I � (⇢/2)R�1i

, yields

@�ig

k

i

(✓k+1i

, �i

)=�@k�i

k1 + ⇢hTi

R�1i

+(⌫ki

� ⇢✓k + ⇢�i

)TQi

Now, replace ✓k with ✓k+t�✓k+1 and ⌫ki

with ⌫k+t�⌫k+1.Let

Gk

i

(t) = �@k�i

k1 + ⇢hTi

R�1i

+ (⌫ki

� ⇢✓k + ⇢�i

)TQi

+ t(�⌫k+1i

� ⇢�✓k+1)TQi

. (32)

@�ig

k

i

(✓k+1i

, �i

) hence equals Gk

i

(0). Let�ki

(t) = argmin�iGk

i

(t). It follows that

�k+1i

= �ki

(0), �k+2i

= �ki

(1) (33)

Assume now that �k+1i

has been computed and that theelements have been arranged such that the q first elementsare nonzero and the last m� q zero. Let us write

�ki

(0) =

�ki

0

�. (34)

We then have that (both the sign and | · | taken element-wise)

@k�ki

(0)k1 =⇥sign(�k

i

)T vT⇤, v 2 Rm�q, |v| 1. (35)

Hence, that 0 2 Gk

i

(0) is equivalent with

�sign(�ki

)T + ⇢hTi

R�1i

+ (⌫ki

� ⇢✓k + ⇢�ki

)TQi

= 0 (36a)

�vT + ⇢hTi

R�1i

+ (⌫ki

� ⇢✓k + ⇢�ki

)TQi

= 0, (36b)

with

R�1 =⇥R�1 R�1

⇤, R�1 2 Rm⇥q, R�1 2 Rm⇥m�q, (37)

Q =⇥Q Q

⇤, Q 2 Rm⇥q, Q 2 Rm⇥m�q. (38)

It can be shown that the partition (34) or the supportof �k

i

(t) will stay unchanged for t 2 [0 t⇤), t⇤ > 0 (seeLemma 1 of Garrigues and El Ghaoui [2008]). It hencefollows that for t 2 [0 t⇤]

�ki

(t)T=�(�sign(�ki

)T/⇢+hTi

R�1i

)Q�1i

+(✓k � ⌫ki

/⇢)TQQ�1

+t(�✓k+1 ��⌫k+1i

/⇢)TQQ�1, (39)

where we introduced Q for the q ⇥ q matrix made up ofthe top q rows of Q. We can also compute

vT(t)/� = �⇢hTi

R�1i

+ (⇢✓k � ⌫ki

� ⇢�ki

(t))TQi

+ t(⇢�✓k ��⌫ki

)TQi

= �⇢hTi

R�1i

+ (⇢✓k � ⌫ki

)TQi

� ⇢(�ki

(t))TQi

+ t(⇢�✓k ��⌫ki

)TQi

= �⇢hTi

R�1i

+ (⇢✓k � ⌫ki

)TQi

� ⇢(�ki

(0))TQi

+ t(⇢�✓k ��⌫ki

)T(Qi

� QQ�1Qi

)

where Qi

was used to denote the top q rows of Qi

. Now tofind t⇤, we notice that both �k

i

(t) and v are linear functionsof t. We can hence compute the minimal t that:

• Make one or more elements of v equal to �1 or 1or/and

• make one or more elements of �ki

(t) equal to 0.This minimal t will be t⇤. At t⇤ the partition (34) changes:

• Elements corresponding to v-elements equal to �1 or1 should be included in �k

i

.• Elements of �k

i

(t) equal to 0 should be fixed to zero.

Given the solution �ki

(t⇤), we can now continue in a similarway as above to compute �k

i

(t), t 2 [t⇤, t⇤⇤]. The procedurecontinues until �k

i

(1) has been computed. Due to spacelimitations we have chosen to not give a summary ofthe algorithm. We instead refer the interested reader todownload the implementation available from http://www.

rt.isy.liu.se/

~

ohlsson/code.html.

7. NUMERICAL ILLUSTRATION

For the numerical illustration, we consider a network ofN = 10 sensors with ✓1 = ✓2 = · · · = ✓10 being randomsamples from N(0, I) 2 R20. We assume that thereexist 10 batches of measurements, each consisting of 15samples. The regressors were unit Gaussian noise and themeasurement noise variance was set to one. In the scenarioconsidered in this section, we simulate failures in sensors2 and 5, upon the arrival of the 4th and 6th measurementbatches, respectively. This is done by changing the 5thcomponent of ✓2 by multiplying it by 5 and changing the8th component of ✓5 by shifting (adding) it by 5. It isassumed that the faults are persistent.

With � = ⇢ = 20 both the centralized, the ADMM andthe ADMMwith homotopy give identical results (up to the2nd digit, 10 iterations were used in ADMM). As can beseen from Figures 1-3 the result correctly detected that the2nd and 5th sensors are faulty. In addition as can be seenfrom Figures 2 and 3, ADMM and ADMM with homotopyshow that for how many data batches the sensors remainedfaulty. Also the results detect which elements from ✓2 and✓5 deviated from the nominal value. In this example (usingthe ADMM algorithm), each sensor had to broadcastm ⇥ ADMM iterations ⇥ number of batches = 20 ⇥ 10 ⇥10 = 2000 scalar values. If instead all data would havebeen shared, each sensor would have to broadcast (m +1) ⇥ T ⇥ number of batches = (20 + 1) ⇥ 15 ⇥ 10 = 3150scalar values. Using proposed algorithm, the tra�c overthe network can hence be made considerably lower whilekeeping the performance of a centralized change detectionalgorithm. Using the Homotopy algorithm (or warm start)to solve step 2 of the ADMM algorithm will not a↵ect thetra�c over the network, but could lower the computationalburden on the sensors. It is also worth noting that the

0 2 4 6 8 100

5

10

15

Sensor No.

|| !

!!

i ||2

Fig. 1. Results from the centralized change detection. Ascan be seen sensors 2 and 5 are detected to be faulty.

0 2 4 6 8 100

2

4

6

8

10

12

14

Sensor No.

|| !

i ! "

i ||2

Fig. 2. Results from the ADMM batch solution. Sensors 2and 5 have been detected faulty for 6 and 4 batches.

0 2 4 6 8 100

2

4

6

8

10

12

14

16

18

Sensor No.

|| !

i ! "

i ||2

Fig. 3. Results from the Homotopy solution. Sensors 2 and5 have been detected faulty for 6 and 4 batches.

classical approach using likelihood ration test, as describedin Section 3, would not work on this example since 20 =m > T = 15.

8. CONCLUSION

This paper has presented a distributed change detectionalgorithm. Change detection is most often seen as a cen-tralized problem. As many scenarios are naturally dis-tributed, there is a need for distributed change detectionalgorithms. The basic idea of the proposed distributedchange detection algorithm is to use system identification,in a distributed manner, to obtain a nominal model forthe sensors or agents and then detect whether one or

more sensors or agents start deviating from this nominalmodel. The proposed formulation takes the form of aconvex optimization problem. We show how this can besolved distributively and present a homotopy algorithm toeasy the computational load. The proposed formulationhas connections with Lasso and compressed sensing andtheory developed for these methods are therefore directlyapplicable.

REFERENCES

M. Basseville and I. V. Nikiforov. Detection of Abrupt Changes– Theory and Application. Prentice-Hall, Englewood Cli↵s, NJ,1993.

D. P. Bertsekas and J. N. Tsitsiklis. Parallel and DistributedComputation: Numerical Methods. Athena Scientific, 1997.

S. Boyd and L. Vandenberghe. Convex Optimization. CambridgeUniversity Press, 2004.

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributedoptimization and statistical learning via the alternating directionmethod of multipliers. Foundations and Trends in MachineLearning, 2011.

A. M. Bruckstein, D. L. Donoho, and M. Elad. From sparse solutionsof systems of equations to sparse modeling of signals and images.SIAM Review, 51(1):34–81, 2009.

E. J. Candes and M. B. Wakin. An introduction to compressivesampling. Signal Processing Magazine, IEEE, 25(2):21–30, March2008.

E. J. Candes, J. Romberg, and T. Tao. Robust uncertainty principles:Exact signal reconstruction from highly incomplete frequencyinformation. IEEE Transactions on Information Theory, 52:489–509, February 2006.

E. Chu, D. Gorinevsky, and S. Boyd. Scalable statistical monitoringof fleet data. In Proceedings of the 18th IFAC World Congress,pages 13227–13232, Milan, Italy, August 2011.

D. L. Donoho. Compressed sensing. IEEE Transactions on Infor-mation Theory, 52(4):1289–1306, April 2006.

P. Garrigues and L. El Ghaoui. An homotopy algorithm for thelasso with online observations. In Proceedings of the 22nd AnnualConference on Neural Information Processing Systems (NIPS),2008.

F. Gustafsson. Adaptive Filtering and Change Detection. Wiley,New York, 2001.

I. Loris. On the performance of algorithms for the minimization of`1-penalized functionals. Inverse Problems, 25:1–16, 2009.

H. Ohlsson, F. Gustafsson, L. Ljung, and S. Boyd. Smoothed stateestimates under abrupt changes using sum-of-norms regulariza-tion. Automatica, 48(4):595–605, 2012.

R. Patton, P. Frank, and R. Clark. Fault Diagnosis in DynamicSystems – Theory and Application. Prentice Hall, 1989.

A. Tartakovsky and V. Veeravalli. Quickest change detection indistributed sensor systems. In Proceedings of the 6th Interna-tional Conference on Information Fusion, pages 756–763, Cairns,Australia, July 2003.

A.G. Tartakovsky and V.V. Veeravalli. An e�cient sequentialprocedure for detecting changes in multichannel and distributedsystems. In Proceedings of the Fifth International Conference onInformation Fusion, pages 41–48, 2002.

R. Tibsharani. Regression shrinkage and selection via the lasso.Journal of Royal Statistical Society B (Methodological), 58(1):267–288, 1996.

A. Willsky. A survey of design methods for failure detection indynamic systems. Automatica, 12:601–611, 1976.

A. Willsky and H. Jones. A generalized likelihood ratio approachto the detection and estimation of jumps in linear systems.IEEE Transactions on Automatic Control, 21(1):108–112, Febru-ary 1976.

A. Yang, A. Ganesh, Y. Ma, and S. Sastry. Fast `1-minimizationalgorithms and an application in robust face recognition: A review.In ICIP, 2010.

An ADMM Algorithm for a Class of Total

Variation Regularized Estimation

Problems

?

Bo Wahlberg

⇤, Stephen Boyd

⇤⇤, Mariette Annergren

⇤,and Yang Wang

⇤⇤

⇤ Automatic Control Lab and ACCESS, School of ElectricalEngineering, KTH Royal Institute of Technology,

SE 100 44 Stockholm,Sweden⇤⇤ Department of Electrical Engineering, Stanford University, Stanford,

CA 94305, USA

Abstract: We present an alternating augmented Lagrangian method for convex optimizationproblems where the cost function is the sum of two terms, one that is separable in the variableblocks, and a second that is separable in the di↵erence between consecutive variable blocks.Examples of such problems include Fused Lasso estimation, total variation denoising, and multi-period portfolio optimization with transaction costs. In each iteration of our method, the firststep involves separately optimizing over each variable block, which can be carried out in parallel.The second step is not separable in the variables, but can be carried out very e�ciently. We applythe algorithm to segmentation of data based on changes in mean (`

1

mean filtering) or changesin variance (`

1

variance filtering). In a numerical example, we show that our implementation isaround 10000 times faster compared with the generic optimization solver SDPT3.

Keywords: Signal processing algorithms, stochastic parameters, parameter estimation, convexoptimization and regularization

1. INTRODUCTION

In this paper we consider optimization problems wherethe objective is a sum of two terms: The first term isseparable in the variable blocks, and the second term isseparable in the di↵erence between consecutive variableblocks. One example is the Fused Lasso method in statis-tical learning, Tibshirani et al. [2005], where the objectiveincludes an `

1

-norm penalty on the parameters, as well asan `

1

-norm penalty on the di↵erence between consecutiveparameters. The first penalty encourages a sparse solution,i.e., one with few nonzero entries, while the second penaltyenhances block partitions in the parameter space. Thesame ideas have been applied in many other areas, such asTotal Variation (TV) denoising, Rudin et al. [1992], andsegmentation of ARX models, Ohlsson et al. [2010] (whereit is called sum-of-norms regularization). Another exampleis multi-period portfolio optimization, where the variableblocks give the portfolio in di↵erent time periods, the firstterm is the portfolio objective (such as risk-adjusted re-turn), and the second term accounts for transaction costs.

In many applications, the optimization problem involvesa large number of variables, and cannot be e�cientlyhandled by generic optimization solvers. In this paper,our main contribution is to derive an e�cient and scalableoptimization algorithm, by exploiting the structure of theoptimization problem. To do this, we use a distributed? This work was partially supported by the Swedish Research Coun-cil, the Linnaeus Center ACCESS at KTH and the European Re-search Council under the advanced grant LEARN, contract 267381.

optimization method called Alternating Direction Methodof Multipliers (ADMM). ADMM was developed in the1970s, and is closely related to many other optimizationalgorithms including Bregman iterative algorithms for `

1

problems, Douglas-Rachford splitting, and proximal pointmethods; see Eckstein and Bertsekas [1992], Combettesand Pesquet [2007]. ADMM has been applied in manyareas, including image and signal processing, Setzer [2011],as well as large-scale problems in statistics and machinelearning, Boyd et al. [2011].

We will apply ADMM to `

1

mean filtering and `

1

variancefiltering (Wahlberg et al. [2011]), which are importantproblems in signal processing with many applications, forexample in financial or biological data analysis. In someapplications, mean and variance filtering are used to pre-process data before fitting a parametric model. For non-stationary data it is also important for segmenting thedata into stationary subsets. The approach we present isinspired by the `

1

trend filtering method described in Kimet al. [2009], which tracks changes in the mean value ofthe data. (An example in this paper also tracks changes inthe variance of the underlying stochastic process.) Theseproblems are closely related to the covariance selectionproblem, Dempster [1972], which is a convex optimizationproblem when the inverse covariance is used as the opti-mization variable, Banerjee et al. [2008]. The same ideascan also be found in Kim et al. [2009] and Friedman et al.[2008].

This paper is organized as follows. In Section 2 we reviewthe ADMM method. In Section 3, we apply ADMM to ouroptimization problem to derive an e�cient optimizationalgorithm. In Section 4.1 we apply our method to `

1

mean filtering, while in Section 4.2 we consider `1

variancefiltering. Section 5 contains some numerical examples, andSection 6 concludes the paper.

2. ALTERNATING DIRECTION METHOD OFMULTIPLIERS (ADMM)

In this section we give an overview of ADMM. We followclosely the development in Section 5 of Boyd et al. [2011].

Consider the following optimization problem

minimize f(x)subject to x 2 C (1)

with variable x 2 Rn, and where f and C are convex. Welet p? denote the optimal value of (1). We first re-write theproblem as

minimize f(x) + IC(z)subject to x = z,

(2)

where IC(z) is the indicator function on C (i.e., IC(z) = 0for z 2 C, and IC(z) = 1 for z /2 C). The augmentedLagrangian for this problem is

L

⇢

(x, z, u) = f(x) + IC(z) + (⇢/2)kx� z + uk22

,

where u is a scaled dual variable associated with theconstraint x = z, i.e., u = (1/⇢)y, where y is the dualvariable for x = z. Here, ⇢ > 0 is a penalty parameter.

In each iteration of ADMM, we perform alternating min-imization of the augmented Lagrangian over x and z. Atiteration k we carry out the following steps

x

k+1 := argminx

{f(x) + (⇢/2)kx� z

k + u

kk22

} (3)

z

k+1 := ⇧C(xk+1 + u

k) (4)

u

k+1 := u

k + (xk+1 � z

k+1), (5)

where ⇧C denotes Euclidean projection onto C. In thefirst step of ADMM, we fix z and u and minimize theaugmented Lagrangian over x; next, we fix x and u andminimize over z; finally, we update the dual variable u.

2.1 Convergence

Under mild assumptions on f and C, we can show that theiterates of ADMM converge to a solution; specifically, wehave

f(xk) ! p

?

, x

k � z

k ! 0,as k ! 1. The rate of convergence, and hence the numberof iterations required to achieve a specified accuracy, candepend strongly on the choice of the parameter ⇢. When⇢ is well chosen, this method can converge to a fairlyaccurate solution (good enough for many applications),within a few tens of iterations. However, if the choice of⇢ is poor, many iterations can be needed for convergence.These issues, including heuristics for choosing ⇢, are dis-cussed in more detail in Boyd et al. [2011].

2.2 Stopping criterion

The primal and dual residuals at iteration k are given by

e

k

p

= (xk � z

k), e

k

d

= �⇢(zk � z

k�1).

We terminate the algorithm when the primal and dualresiduals satisfy a stopping criterion (which can varydepending on the requirements of the application). Atypical criterion is to stop when

kekp

k2

✏

pri

, kekd

k2

✏

dual

.

Here, the tolerances ✏

pri

> 0 and ✏

dual

> 0 can be set viaan absolute plus relative criterion,

✏

pri =pn✏

abs + ✏

rel max{kxkk2

, kzkk2

},✏

dual =pn✏

abs + ✏

rel

⇢kukk2

,

where ✏

abs

> 0 and ✏

rel

> 0 are absolute and relativetolerances (see Boyd et al. [2011] for details).

3. PROBLEM FORMULATION AND METHOD

In this section we formulate our problem and derive ane�cient distributed optimization algorithm via ADMM.

3.1 Optimization problem

We consider the problem

minimizeNX

i=1

�i

(xi

) +N�1X

i=1

i

(ri

)

subject to r

i

= x

i+1

� x

i

, i = 1, . . . , N � 1

(6)

with variables x

1

, . . . , x

N

, r

1

, . . . , r

N�1

2 R

n, and where�

i

: Rn ! R [ {1} and i

: Rn ! R [ {1} are convexfunctions.

This problem has the form (1), with variables x =(x

1

, . . . , x

N

), r = (r1

, . . . , r

N�1

), objective function

f(x, r) =NX

i=1

�i

(xi

) +N�1X

i=1

i

(ri

)

and constraint set

C = {(x, r) | ri

= x

i+1

� x

i

, i = 1, . . . , N � 1}. (7)

The ADMM form for problem (6) is

minimizeNX

i=1

�i

(xi

) +N�1X

i=1

i

(ri

) + IC(z, s)

subject to r

i

= s

i

, i = 1, . . . , N � 1x

i

= z

i

, i = 1, . . . , N,

(8)

with variables x = (x1

, . . . , x

N

), r = (r1

, . . . , r

N�1

),z = (z

1

, . . . , z

N

), and s = (s1

, . . . , s

N�1

). Furthermore,we let u = (u

1

, . . . , u

N

) and t = (t1

, . . . , t

N�1

) be vectorsof scaled dual variables associated with the constraintsx

i

= z

i

, i = 1, . . . , N , and r

i

= s

i

, i = 1, . . . , N � 1 (i.e.,u

i

= (1/⇢)yi

, where yi

is the dual variable associated withx

i

= z

i

).

3.2 Distributed optimization method

Applying ADMM to problem (8), we carry out the follow-ing steps in each iteration.

Step 1. Since the objective function f is separable in x

i

and r

i

, the first step (3) of the ADMM algorithm consistsof 2N � 1 separate minimizations

x

k+1

i

:= argminxi

{�i

(xi

) + (⇢/2)kxi

� z

k

i

+ u

k

i

k22

}, (9)

i = 1, . . . , N , and

r

k+1

i

:= argminri

{ i

(ri

) + (⇢/2)kri

� s

k

i

+ t

k

i

k22

}, (10)

i = 1, . . . , N � 1. These updates can all be carried out inparallel. For many applications, we will see that we canoften solve (9) and (10) analytically.

Step 2. In the second step of ADMM, we project (xk+1+u

k

, r

k+1 + t

k) onto the constraint set C, i.e.,(zk+1

, s

k+1) := ⇧C((xk+1

, r

k+1) + (uk

, t

k)).

For the particular constraint set (7), we will show inSection 3.3 that the projection can be performed extremelye�ciently.

Step 3. Finally, we update the dual variables:

u

k+1

i

:= u

k

i

+ (xk+1

i

� z

k+1

i

), i = 1, . . . , N

and

t

k+1

i

:= t

k

i

+ (rk+1

i

� s

k+1

i

), i = 1, . . . , N � 1.

These updates can also be carried out independently inparallel, for each variable block.

3.3 Projection

In this section we work out an e�cient formula for pro-jection onto the constraint set C (7). To perform theprojection

(z, s) = ⇧C((w, v)),

we solve the optimization problem

minimize kz � wk22

+ ks� vk22

subject to s = Dz,

with variables z = (z1

, . . . , z

N

) and s = (s1

, . . . , s

N�1

),and where D 2 R

(N�1)n⇥Nn is the forward di↵erenceoperator, i.e.,

D =

2

664

�I I

�I I

. . .. . .�I I

3

775 .

This problem is equivalent to

minimize kz � wk22

+ kDz � vk22

.

with variable z = (z1

, . . . , z

N

). Thus to perform theprojection we first solve the optimality condition

(I +D

T

D)z = w +D

T

v, (11)

for z, then we let s = Dz.

The matrix I + D

T

D is block tridiagonal, with diagonalblocks equal to multiples of I, and sub/super-diagonalblocks equal to �I. Let LLT be the Cholesky factorizationof I+D

T

D. It is easy to show that L is block banded withthe form

L =

2

66664

l

1,1

l

2,1

l

2,2

l

3,2

l

3,3

. . .. . .l

N,N�1

l

N,N

3

77775⌦ I,

where ⌦ denotes the Kronecker product. The coe�cientsl

i,j

can be explicitly computed via the recursion

l

1,1

=p2,

l

i+1,i

= �1/li,i

, l

i+1,i+1

=q

3� l

2

i+1,i

, i = 1, . . . , N � 2,

l

N,N�1

= �1/lN�1,N�1

, l

N,N

=q

2� l

2

N,N�1

.

The coe�cients only need to be computed once, before theprojection operator is applied.

The projection therefore consists of the following steps

(1) Form b := w +D

T

v:

b

1

:= w

1

� v

1

, b

N

:= w

N

+ v

N�1

,

b

i

:= w

i

+ (vi�1

� v

i

), i = 2, . . . , N � 1.

(2) Solve Ly = b:

y

1

:= (1/l1,1

)b1

,

y

i

:= (1/li,i

)(bi

� l

i,i�1

y

i�1

), i = 2, . . . , N.

(3) Solve L

T

z = y:

z

N

:= (1/lN,N

)yN

,

z

i

:= (1/li,i

)(yi

� l

i+1,i

z

i+1

), i = N � 1, . . . , 1.

(4) Set s = Dz:

s

i

:= z

i+1

� z

i

, i = 1, . . . , N � 1.

Thus, we see that we can perform the projection verye�ciently, in O(Nn) flops (floating-point operations). Infact, if we pre-compute the inverses 1/l

i,i

, i = 1, . . . , N , theonly operations that are required are multiplication, addi-tion, and subtraction. We do not need to perform division,which can be expensive on some hardware platforms.

4. EXAMPLES

4.1 `

1

Mean filtering

Consider a sequence of vector random variables

Y

i

⇠ N (yi

,⌃), i = 1, . . . , N,

where y

i

2 R

n is the mean, and ⌃ 2 S

n

+

is the covariancematrix. We assume that the covariance matrix is known,but the mean of the process is unknown. Given a sequenceof observations y

1

, . . . , y

N

, our goal is to estimate the meanunder the assumption that it is piecewise constant, i.e.,y

i+1

= y

i

for many values of i.

In the Fused Group Lasso method, we obtain our estimatesby solving

minimizeNX

i=1

1

2(y

i

� x

i

)T⌃�1(yi

� x

i

) + �

N�1X

i=1

kri

k2

subject to r

i

= x

i+1

� x

i

, i = 1, . . . , N � 1,

with variables x

1

, . . . , x

N

, r

1

, . . . , r

N�1

. Let x

?

1

, . . . , x

?

N

,r

?

1

, . . . , r

?

N�1

denote an optimal point, our estimates ofy

1

, . . . , y

N

are x

?

1

, . . . , x

?

N

.

This problem is clearly in the form (6), with

�i

(xi

) =1

2(y

i

� x

i

)T⌃�1(yi

� x

i

), i

(ri

) = �kri

k2

.

ADMM steps. For this problem, steps (9) and (10) ofADMM can be further simplified. Step (9) involves mini-mizing an unconstrained quadratic function in the variablex

i

, and can be written as

x

k+1

i

= (⌃�1 + ⇢I)�1(⌃�1

y

i

+ ⇢(zki

� u

k

i

)).

Step (10) is

r

k+1

i

:= argminri

{�kri

k2

+ (⇢/2)kri

� s

k

i

+ t

k

i

k22

},

which simplifies to

r

k+1

i

= S�/⇢

(ski

� t

k

i

), (12)

where S

is the vector soft thresholding operator, definedas

S

(a) = (1� /kak2

)+

a, S

(0) = 0.

Here the notation (v)+

= max{0, v} denotes the positivepart of the vector v. (For details see Boyd et al. [2011].)

Variations. In some problems, we might expect thatindividual components of x

t

will be piecewise constant, inwhich case we can instead use the standard Fused Lassomethod. In the standard Fused Lasso method we solve

minimizeNX

i=1

1

2(y

i

� x

i

)T⌃�1(yi

� x

i

) + �

N�1X

i=1

kri

k1

subject to r

i

= x

i+1

� x

i

, i = 1, . . . , N,

with variables x

1

, . . . , x

N

, r1

, . . . , r

N�1

. The ADMM up-dates are the same, except that instead of doing vectorsoft thresholding for step (10), we perform scalar compo-nentwise soft thresholding, i.e.,

(rk+1

i

)j

= S�/⇢

((ski

� t

k

i

)j

), j = 1, . . . , n.

4.2 `

1

Variance filtering

Consider a sequence of vector random variables (of dimen-sion n)

Y

i

⇠ N (0,⌃i

), i = 1, . . . , N,

where ⌃i

2 S

n

+

is the covariance matrix for Y

i

(whichwe assume is fixed but unknown). Given observationsof y

1

, . . . , y

N

, our goal is to estimate the sequence ofcovariance matrices ⌃

1

, . . . ,⌃N

, under the assumptionthat it is piecewise constant, i.e., it is often the case that⌃

i+1

= ⌃i

. In order to obtain a convex problem, we usethe inverse covariances X

i

= ⌃�1

i

as our variables.

The Fused Group Lasso method for this problem involvessolving

minimizeNX

i=1

Tr(Xi

y

i

y

T

i

)� log detXi

+ �

N�1X

i=1

kRi

kF

subject to R

i

= X

i+1

�X

i

, i = 1, . . . , N � 1,

where our variables are R

i

2 S

n, i = 1, . . . , N � 1, andX

i

2 S

n

+

, i = 1, . . . , N . Here,

kRi

kF

=q

Tr(RT

i

R

i

)

is the Frobenius norm ofRi

. LetX?

1

, . . . , X

?

N

,R?

1

, . . . , R

?

N�1

denote an optimal point, our estimates of ⌃1

, . . . ,⌃N

are(X?

1

)�1

, . . . , (X?

N

)�1.

ADMM steps. It is easy to see that steps (9) and (10)simplify for this problem. Step (9) requires solving

X

k+1

i

:= argminXi�0

{�i

(Xi

) + (⇢/2)kXi

� Z

k

i

+ U

k

i

k22

},

where�

i

(Xi

) = Tr(Xi

y

i

y

T

i

)� log detXi

.

This update can be solved analytically, as follows.

(1) Compute the eigenvalue decomposition of

⇢

�Z

k

i

� U

k

i

�� y

i

y

T

i

= Q⇤QT

where ⇤ = diag(�1

, . . . ,�

n

).(2) Now let

µ

j

:=�

j

+q�

2

j

+ 4⇢

2⇢, j = 1, . . . , n.

(3) Finally, we set

X

k+1

i

= Qdiag(µ1

, . . . , µ

n

)QT

.

For details of this derivation, see Section 6.5 in Boyd et al.[2011].

Step (10) is

R

k+1

i

:= argminRi

{�kRi

kF

+ (⇢/2)kRi

� S

k

i

+ T

k

i

k22

},

which simplifies to

R

k+1

i

= S�/⇢

(Sk

i

� T

k

i

),

where S

is a matrix soft threshold operator, defined as

S

(A) = (1� /kAkF

)+

A, S

(0) = 0.

Variations. As with `

1

mean filtering, we can replace theFrobenius norm penalty with a componentwise vector `

1

-norm penalty on R

i

to get the problem

minimizeNX

i=1

Tr(Xi

y

i

y

T

i

)� log detXi

+ �

N�1X

i=1

kRi

k1

subject to R

i

= X

i+1

�X

i

, i = 1, . . . , N � 1,

with variables R1

, . . . , R

N�1

2 S

n, and X

1

, . . . , X

N

2 S

n

+

,and where

kRk1

=X

j,k

|Rjk

|.

Again, the ADMM updates are the same, the only di↵er-ence is that in step (10) we replace matrix soft thresholdingwith a componentwise soft threshold, i.e.,

(Rk+1

i

)l,m

= S�/⇢

((Sk

i

� T

k

i

)l,m

),

for l = 1, . . . , n, m = 1, . . . , n.

4.3 `

1

Mean and variance filtering

Consider a sequence of vector random variables

Y

i

⇠ N (yi

,⌃i

), i = 1, . . . , N,

where yi

2 R

n is the mean, and ⌃i

2 S

n

+

is the covariancematrix for Y

i

. We assume that the mean and covariancematrix of the process is unknown. Given observationsy

1

, . . . , y

N

, our goal is to estimate the mean and thesequence of covariance matrices ⌃

1

, . . . ,⌃N

, under theassumption that they are piecewise constant, i.e., it is

often the case that yi+1

= y

i

and ⌃i+1

= ⌃i

. To obtain aconvex optimization problem, we use

X

i

= �1

2⌃�1

t

, m

i

= ⌃�1

t

x

i

,

as our variables. In the Fused Group Lasso method, weobtain our estimates by solving

minimizeNX

i=1

�(1/2) log det(�X

i

)�Tr(Xi

y

i

y

T

i

)

�m

T

i

y

i

� (1/4)Tr(X�1

i

m

i

m

T

i

)

+�

1

N�1X

i=1

kri

k2

+ �

2

N�1X

i=1

kRi

kF

subject to r

i

= m

i+1

�m

i

, i = 1, . . . , N � 1,R

i

= X

i+1

�X

i

, i = 1, . . . , N � 1,

with variables r

1

, . . . , r

N�1

2 R

n, m

1

, . . . ,m

N

2 R

n,R

1

, . . . , R

N�1

2 S

n, and X

1

, . . . , X

N

2 S

n

+

.

ADMM steps. This problem is also in the form (6), how-ever, as far as we are aware, there is no analytical formulafor steps (9) and (10). To carry out these updates, we mustsolve semidefinite programs (SDPs), for which there are anumber of e�cient and reliable software packages (Tohet al. [1999], Sturm [1999]).

5. NUMERICAL EXAMPLE

In this section we solve an instance of `1

mean filteringwith n = 1, ⌃ = 1, and N = 400, using the standard FusedLasso method. To improve convergence of the ADMMalgorithm, we use over-relaxation with ↵ = 1.8, see Boydet al. [2011]. The parameter � is chosen as approximately10% of �

max

, where �

max

is the largest value that resultsin a non-constant mean estimate. Here, �

max

⇡ 108 andso � = 10. We use an absolute plus relative error stoppingcriterion, with ✏

abs = 10�4 and ✏

rel = 10�3. Figure 1 showsconvergence of the primal and dual residuals. The resultingestimates of the means are shown in Figure 2.

0 10 20 30 40 50 60 70 8010−2

10−1

100

101

102

Iteration

Fig. 1. Residual convergence: Primal residual e

p

(solidline), and dual residual e

d

(dashed line).

0 50 100 150 200 250 300 350 400−3

−2

−1

0

1

2

3

4

5

Measurement

Fig. 2. Estimated means (solid line), true means (dashedline) and measurements (crosses).

We solved the same `1

mean filtering problem using CVX,a package for specifying and solving convex optimizationproblems (Grant and Boyd [2011]). CVX calls genericSDP solvers SeDuMi (Toh et al. [1999]) or SDPT3 (Sturm[1999]) to solve the problem. While these solvers arereliable for wide classes of optimization problems, andexploit sparsity in the problem formulation, they arenot customized for particular problem families, such asours. The computation time for CVX is approximately20 seconds. Our ADMM algorithm (implemented in C),took 2.2 milliseconds to produce the same estimates.Thus, our algorithm is approximately 10000 times fastercompared with generic optimization packages. Indeed, ourimplementation does not exploit the fact that steps 1 and3 of ADMM can be implemented independently in parallelfor each measurement. Parallelizing steps 1 and 3 of thecomputation can lead to further speedups. For example,simple multi-threading on a quad-core CPU would resultin a further 4⇥ speed-up.

6. CONCLUSIONS

In this paper we derived an e�cient and scalable methodfor an optimization problem (6) that has a variety of ap-plications in control and estimation. Our custom methodexploits the structure of the problem via a distributedoptimization framework. In many applications, each stepof the method is a simple update that typically involvessolving a set of linear equations, matrix multiplication,or thresholding, for which there are exceedingly e�cientlibraries. In numerical examples we have shown that wecan solve problems such as `

1

mean and variance filteringmany orders of magnitude faster than generic optimizationsolvers such as SeDuMi or SDPT3.

The only tuning parameter for our method is the reg-ularization parameter ⇢. Finding an optimal ⇢ is not astraightforward problem, but Boyd et al. [2011] containsmany heuristics that work well in practice. For the `

1

meanfiltering example, we find that setting ⇢ ⇡ � works well,but we do not have a formal justification.

REFERENCES

O. Banerjee, L. El Ghaoui, and A. d’Aspremont. Modelselection through sparse maximum likelihood estimationfor multivariate gaussian or binary data. Journal ofMachine Learning Research, 9:485–516, 2008.

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein.Distributed optimization and statistical learning via thealternating direction method of multipliers. Foundationsand Trends in Machine Learning, 3(1):1–122, 2011.

P. L. Combettes and J. C. Pesquet. A Douglas-Rachfordsplitting approach to nonsmooth convex variational sig-nal recovery. Selected Topics in Signal Processing, IEEEJournal of, 1(4):564 –574, dec. 2007. ISSN 1932-4553.doi: 10.1109/JSTSP.2007.910264.

A. P. Dempster. Covariance selection. Biometrics, 28(1):157–175, 1972.

J. Eckstein and D. P. Bertsekas. On the Douglas-Rachfordsplitting method and the proximal point algorithm formaximal monotone operators. Mathematical Program-ming, 55:293–318, 1992.

J. Friedman, T. Hastie, and R. Tibshirani. Sparse inversecovariance estimation with the graphical lasso. Bio-statistics, 9(3):432–441, 2008.

M. Grant and S. Boyd. CVX: Matlab softwarefor disciplined convex programming, version 1.21.http://cvxr.com/cvx, April 2011.

S. J. Kim, K. Koh, S. Boyd, and D. Gorinevsky. l

1

trendfiltering. SIAM Review, 51(2):339–360, 2009.

H. Ohlsson, L. Ljung, and S. Boyd. Segmentation of arx-models using sum-of-norms regularization. Automatica,46:1107 – 1111, April 2010.

L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear totalvariation based noise removal algorithms. Phys. D,60:259–268, November 1992. ISSN 0167-2789. doi:http://dx.doi.org/10.1016/0167-2789(92)90242-F.

Simon Setzer. Operator splittings, bregman methods andframe shrinkage in image processing. InternationalJournal of Computer Vision, 92(3):265–280, 2011.

J. Sturm. Using SeDuMi 1.02, a MATLAB toolboxfor optimization over symmetric cones. OptimizationMethods and Software, 11:625–653, 1999. Softwareavailable at http://sedumi.ie.lehigh.edu/.

R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, andK. Knight. Sparsity and smoothness via the fusedlasso. Journal of the Royal Statistical Society: SeriesB (Statistical Methodology), 67 (Part 1):91–108, 2005.

K. Toh, M. Todd, and R. Tutuncu. SDPT3—A Matlabsoftware package for semidefinite programming, version1.3. Optimization Methods and Software, 11(1):545–581,1999.

B. Wahlberg, C. R. Rojas, and M. Annergren. On l

1

mean and variance filtering. Proceedings of the Forty-Fifth Asilomar Conference on Signals, Systems andComputers, 2011. arXiv/1111.5948.

Compressive Phase RetrievalFrom Squared Output Measurements

Via Semidefinite Programming ?

Henrik Ohlsson ⇤,⇤⇤ Allen Y. Yang ⇤ Roy Dong ⇤

S. Shankar Sastry ⇤

⇤ Department of Electrical Engineering and Computer Sciences,University of California at Berkeley, CA, USA, (email:

{ohlsson,yang,roydong,sastry}@eecs.berkeley.edu).⇤⇤ Division of Automatic Control, Department of Electrical

Engineering, Linkoping University, Sweden, (e-mail:[email protected]).

Abstract: Given a linear system in a real or complex domain, linear regression aims to recoverthe model parameters from a set of observations. Recent studies in compressive sensing havesuccessfully shown that under certain conditions, a linear program, namely, `1-minimization,guarantees recovery of sparse parameter signals even when the system is underdetermined. Inthis paper, we consider a more challenging problem: when the phase of the output measurementsfrom a linear system is omitted. Using a lifting technique, we show that even though the phaseinformation is missing, the sparse signal can be recovered exactly by solving a semidefiniteprogram when the sampling rate is su�ciently high. This is an interesting finding since theexact solutions to both sparse signal recovery and phase retrieval are combinatorial. The resultsextend the type of applications that compressive sensing can be applied to those where onlyoutput magnitudes can be observed. We demonstrate the accuracy of the algorithms throughextensive simulation and a practical experiment.

Keywords: Phase Retrieval; Compressive Sensing; Semidefinite Programming.

1. INTRODUCTION

Linear models, e.g., y = Ax, are by far the most used anduseful type of model. The main reasons for this are theirsimplicity of use and identification. For the identification,the least-squares (LS) estimate in a complex domain iscomputed by 1

xls = argminx

ky �Axk22 2 Cn, (1)

assuming the output y 2 CN and A 2 CN⇥n are given.Further, the LS problem has a unique solution if thesystem is full rank and not underdetermined, i.e., N � n.

Consider the alternative scenario when the system isunderdetermined, i.e., n > N . The least squares solution isno longer unique in this case, and additional knowledge has? Ohlsson is partially supported by the Swedish foundation forstrategic research in the center MOVIII, the Swedish ResearchCouncil in the Linnaeus center CADICS, the European ResearchCouncil under the advanced grant LEARN, contract 267381, and apostdoctoral grant from the Sweden-America Foundation, donatedby ASEA’s Fellowship Fund. Sastry and Yang are partially supportedby an ARO MURI grant W911NF-06-1-0076. Dong is supported bythe NSF Graduate Research Fellowship under grant DGE 1106400,and by the Team for Research in Ubiquitous Secure Technology(TRUST), which receives support from NSF (award number CCF-0424422).1 Our derivation in this paper is primarily focused on complexsignals, but the results should be easily extended to real domainsignals.

to be used to determine a unique model parameter. Ridgeregression or Tikhonov regression [Hoerl and Kennard,1970] is one of the traditional methods to apply in thiscase, which takes the form

xr = argminx

1

2ky �Axk22 + �kxk22, (2)

where � > 0 is a scalar parameter that decides the tradeo↵ between fit (the first term) and the `2-norm of x (thesecond term).

Thanks to the `2-norm regularization, ridge regression isknown to pick up solutions with small energy that satisfythe linear model. In a more recent approach stemmingfrom the LASSO [Tibsharani, 1996] and compressive sens-ing (CS) [Candes et al., 2006, Donoho, 2006], anotherconvex regularization criterion has been widely used toseek the sparsest parameter vector, which takes the form

x`1 = argminx

1

2ky �Axk22 + �kxk1. (3)

Depending on the choice of the weight parameter �, theprogram (3) has been known as the LASSO by Tibsharani[1996], basis pursuit denoising (BPDN) by Chen et al.[1998], or `1-minimization (`1-min) by Candes et al. [2006].In recent years, several pioneering works have contributedto e�ciently solving sparsity minimization problems suchas Tropp [2004], Beck and Teboulle [2009], Brucksteinet al. [2009], especially when the system parameters andobservations are in high-dimensional spaces.

In this paper, we consider a more challenging problem. Westill seek a linear model y = Ax. Rather than assumingthat y is given we will assume that only the squaredmagnitude of the output is given

bi = |yi|2 = |hx,aii|2, i = 1, · · · , N, (4)

whereAT = [a1, · · · ,aN ] 2 Cn⇥N and y

T = [y1, · · · , yN ] 2C1⇥N . This is clearly a more challenging problem sincethe phase of y is lost when only the (squared) mag-nitude is available. A classical example is that y rep-resents the Fourier transform of x, and that only theFourier transform modulus is observable. This scenarioarises naturally in several practical applications such as op-tics (Walther [1963], Millane [1990]), coherent di↵ractionimaging (Fienup [1987]), astronomical imaging (Daintyand Fienup [1987]), and is known as the phase retrievalproblem.

We note that in general phase cannot be uniquely recov-ered regardless whether the linear model is overdeterminedor not. A simple example to see this, is if x0 2 Cn is asolution to y = Ax, then for any scalar c 2 C on theunit circle cx0 leads to the same squared output b. Asmentioned in Candes et al. [2011a], when the dictionary Arepresents the unitary discrete Fourier transform (DFT),the ambiguities may represent time-reversed solutions ortime-shifted solutions of the ground truth signal x0. Theseglobal ambiguities caused by losing the phase informationare considered acceptable in phase retrieval applications.From now on, when we talk about the solution to the phaseretrieval problem, it is the solution up to a global phase.Accordingly, a unique solution is a solution unique up toa global phase.

Further note that since (4) is nonlinear in the unknownx, N � n measurements are in general needed for aunique solution. When the number of measurements Nare fewer than necessary for a unique solution, additionalassumptions are needed to select one of the solutions (justlike in Tikhonov, LASSO and CS).

Finally, we note that the exact solution to either CS andphase retrieval is combinatorially expensive (Chen et al.[1998], Candes et al. [2011b]). Therefore, the goal of thiswork is to answer the following question: Can we e↵ectivelyrecover a sparse parameter vector x of a linear system upto a global ambiguity using its squared magnitude outputmeasurements via convex programming? The problem isreferred to as compressive phase retrieval (CPR).

The main contribution of the paper is a convex formulationof the sparse phase retrieval problem. Using a lifting tech-nique, the NP-hard problem is relaxed as a semidefiniteprogram. Through extensive experiments, we compare theperformance of our CPR algorithm with traditional CSand PhaseLift algorithms. The results extend the typeof applications that compressive sensing can be appliedto, namely, applications where only magnitudes can beobserved.

1.1 Background

Our work is motivated by the `1-min problem in CS anda recent PhaseLift technique in phase retrieval by Candeset al. [2011b]. On one hand, the theory of CS and `1-minhas been one of the most visible research topics in recent

years. There are several comprehensive review papersthat cover the literature of CS and related optimizationtechniques in linear programming. The reader is referredto the works of Candes and Wakin [2008], Brucksteinet al. [2009], Loris [2009], Yang et al. [2010]. On the otherhand, the fusion of phase retrieval and matrix completionis a novel topic that has recently being studied in aselected few papers, such as Chai et al. [2010], Candeset al. [2011b,a]. The fusion of phase retrieval and CS wasdiscussed in Moravec et al. [2007]. In the rest of the section,we briefly review the phase retrieval literature and itsrecent connections with CS and matrix completion.

Phase retrieval has been a longstanding problem in opticsand x-ray crystallography since the 1970s [Kohler andMandel, 1973, Gonsalves, 1976]. Early methods to recoverthe phase signal using Fourier transform mostly relied onadditional information about the signal, such as band lim-itation, nonzero support, real-valuedness, and nonnegativ-ity. The Gerchberg-Saxton algorithm was one of the pop-ular algorithms that alternates between the Fourier andinverse Fourier transforms to obtain the phase estimate it-eratively [Gerchberg and Saxton, 1972, Fienup, 1982]. Onecan also utilize steepest-descent methods to minimize thesquared estimation error in the Fourier domain [Fienup,1982, Marchesini, 2007]. Common drawbacks of these it-erative methods are that they may not converge to theglobal solution, and the rate of convergence is often slow.Alternatively, Balan et al. [2006] have studied a frame-theoretical approach to phase retrieval, which necessarilyrelied on some special types of measurements.

More recently, phase retrieval has been framed as a low-rank matrix completion problem in Chai et al. [2010],Candes et al. [2011a,b]. Given a system, a lifting techniquewas used to approximate the linear model constraint asa semidefinite program (SDP), which is similar to theCPR objective function (10) only without the sparsityconstraint. The authors also derived the upper-bound forthe sampling rate that guarantees exact recovery in thenoise-free case and stable recovery in the noisy case.

We are aware of the work by Moravec et al. [2007],which has considered compressive phase retrieval on arandom Fourier transform model. Leveraging the sparsityconstraint, the authors proved that an upper-bound ofO(k2 log(4n/k2)) random Fourier modulus measurementsto uniquely specify k-sparse signals. Moravec et al. [2007]also proposed a compressive phase retrieval algorithm.Their solution largely follows the development of `1-min inCS, and it alternates between the domain of solutions thatgive rise to the same squared output and the domain of an`1-ball with a fixed `1-norm. However, the main limitationof the algorithm is that it tries to solve a nonconvexoptimization problem which assumes the `1-norm of thetrue signal is known.

2. CPR VIA SDP

In the noise free case, the phase retrieval problem takesthe form of the feasibility problem:

find x subj. to b = |Ax|2 = {aHi xx

Hai}1iN , (5)

where b

T = [b1, · · · , bN ] 2 R1⇥N . This is a combinatorialproblem to solve: Even in the real domain with the sign

of the measurements {↵i}Ni=1 ⇢ {�1, 1}, one would haveto try out combinations of sign sequences until one thatsatisfies

↵i

pbi = a

Ti x, i = 1, · · · , N, (6)

for some x 2 Rn has been found. For any practical size ofdata sets, this combinatorial problem is intractable.

Since (5) is nonlinear in the unknown x, N � n measure-ments are in general needed for a unique solution. Whenthe number of measurements N are fewer than necessaryfor a unique solution, additional assumptions are neededto select one of the solutions. Motivated by compressivesensing, we here choose to seek the sparsest solution ofCPR satisfying (5) or, equivalent, the solution to

minx

kxk0, subj. to b = |Ax|2 = {aHi xx

Hai}1iN .

(7)As the counting norm k · k0 is not a convex function,following the `1-norm relaxation in CS, (7) can be relaxedas

minx

kxk1, subj. to b = |Ax|2 = {aHi xx

Hai}1iN .

(8)

Note that (8) is still not a linear program, as its equalityconstraint is not a linear equation. In the literature, alifting technique has been extensively used to reframeproblems such as (8) to a standard form in semidefiniteprogramming, such as in Sparse PCA [d’Aspremont et al.,2007].

More specifically, given the ground truth signal x0 2 Cn,let X0

.= x0x

H0 2 Cn⇥n be a rank-1 semidefinite matrix.

Then the CPR problem can be cast as 2

minX

kXk1subj. to bi = Tr(aH

i Xai), i = 1, · · · , N,rank(X) = 1, X ⌫ 0.

(9)

This is of course still a non-convex problem due to the rankconstraint. The lifting approach addresses this issue byreplacing rank(X) with Tr(X). For a semidefinite matrix,Tr(X) is equal to the sum of the eigenvalues of X. Thisleads to an SDP

minX

Tr(X) + �kXk1subj. to bi = Tr(�iX), i = 1, · · · , N,

X ⌫ 0,(10)

where we further denote �i.= aia

Hi 2 Cn⇥n and where

� > 0 is a design parameter. Finally, the estimate of xcan be found by computing the rank-1 decomposition ofX via singular value decomposition. We will refere to theformulation (10) as compressive phase retrieval via lifting(CPRL).

We compare (10) to a recent solution of PhaseLift byCandes et al. [2011b]. In Candes et al. [2011b], a similarobjective function was employed for phase retrieval:

minX

Tr(X)

subj. to bi = Tr(�iX), i = 1, · · · , N,X ⌫ 0,

(11)

2 kXk1 for a matrix X denotes the entry-wise `1-norm in this paper.

albeit the source signal was not assumed sparse. Using thelifting technique to construct the SDP relaxation of theNP-hard phase retrieval problem, with high probability,the program (11) recovers the exact solution (sparse ordense) if the number of measurements N is at least of theorder of O(n log n). The region of success is visualized inFigure 1 as region I with a thick solid line.

If x is su�ciently sparse and random Fourier dictionariesare used for sampling, Moravec et al. [2007] showed thatin general the signal is uniquely defined if the number ofsquared magnitude output measurements b exceeds theorder of O(k2 log(4n/k2)). This lower bound for the regionof success of CPR is illustrated by the dash line in Figure 1.

Finally, the motivation for introducing the `1-norm regu-larization in (10) is to be able to solve the sparse phaseretrieval problem for N smaller than what PhaseLift re-quires. However, one will not be able to solve the compres-sive phase retrieval problem in region III below the dashedcurve. Therefore, our target problems lie in region II.

I

II

III

Fig. 1. An illustration of the regions in which PhaseLiftand CPR are capable of recovering the ground truthsolution up to a global phase ambiguity. WhilePhaseLift primarily targets problems in region I,CPRL operates primarily in region II.

3. NUMERICAL SOLUTIONS FOR NOISY DATA

In this section, we consider the case that the measure-ments are contaminated by data noise. In a linear model,typically bounded random noise a↵ects the output of thesystem as y = Ax+ e, where e 2 CN is a noise term withbounded `2-norm: kek2 ✏. However, in phase retrieval,we follow closely a more special noise model used in Candeset al. [2011b]:

bi = |hx,aii|2 + ei. (12)

This nonstandard model avoids the need to calculate thesquared magnitude output |y|2 with the added noise term.More importantly, in practical phase retrieval applications,measurement noise is introduced when the squared mag-nitudes or intensities of the linear system are measured,not on y itself (Candes et al. [2011b]).

Accordingly, we denote a linear function B of X

B : X 2 Cn⇥n 7! {Tr(�iX)}1iN 2 RN (13)

that measures the noise-free squared output. Then theapproximate CPR problem with bounded `2 error model(12) can be solved by the following SDP program:

min Tr(X) + �kXk1subj. to kB(X)� bk2 ✏,

X ⌫ 0.(14)

The estimate of x, just as in noise free case, can finallybe found by computing the rank-1 decomposition of X viasingular value decomposition. We refer to the method asapproximate CPRL. Due to the machine rounding error,in general a nonzero ✏ should be always assumed in theobjective (14) and its termination condition during theoptimization.

We should further discuss several numerical issues in theimplementation of the SDP program. The constrainedCPR problem (14) can be rewritten as an unconstrainedobjective function:

minX⌫0

Tr(X) + �kXk1 +µ

2kB(X)� bk22, (15)

where � > 0 and µ > 0 are two penalty parameters.

In (15), due to the lifting process, the rank-1 condition ofX is approximated by its trace function Tr(X). In Candeset al. [2011b], the authors considered phase retrieval ofgeneric (dense) signal x. They proved that if the numberof measurements obeys N � cn log n for a su�ciently largeconstant c, with high probability, minimizing (15) withoutthe sparsity constraint (i.e., � = 0) recovers a unique rank-1 solution obeying X⇤ = xx

H .

In Section 4, we will show that using either randomFourier dictionaries or more general random projections, inpractice, one needs much fewer measurements to exactlyrecover sparse signals if the measurements are noisefree.Nevertheless, in the presence of noise, the recovered liftedmatrix X may not be exactly rank-1. In this case, one cansimply use its rank-1 approximation corresponding to thelargest singular value of X.

We also note that in (15), there are two main parameters� and µ that can be defined by the user. Typically µis chosen depending on the level of noise that a↵ectsthe measurements b. For � associated with the sparsitypenalty kXk1, one can adopt a warm start strategy todetermine its value iteratively. The strategy has beenwidely used in other sparse optimization, such as in `1-min[Yang et al., 2010]. More specifically, the objective is solvediteratively with respect to a sequence of monotonicallydecreasing � ! 0, and each iteration is initialized usingthe optimization results from the previous iteration. When� is large, the sparsity constraint outweighs the traceconstraint and the estimation error constraint, and viceversa.

Example 3.1. (Compressive Phase Retrieval). In this ex-ample, we illustrate a simple CPR example, where a 2-sparse complex signal x0 2 C64 is first transformed bythe Fourier transform F 2 C64⇥64 followed by randomprojections R 2 C32⇥64:

b = |RFx0|2. (16)

Given b, F , and R, we first apply the PhaseLift algorithm[Candes et al., 2011b] with A = RF to the 32 squaredobservations b. The recovered dense signal is shown inFigure 2. PhaseLift fails to identify the 2-sparse signal.

Next, we apply CPRL (14), and the recovered sparse signalis also shown in Figure 2. CPRL correctly identifies the twononzero elements in x.

0 10 20 30 40 50 600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

i

|xi|

PLCPRL

Fig. 2. The magnitude of the estimated signal provided byCPRL and PhaseLift (PL). CPRL correctly identifieselements 2 and 24 to be nonzero while PhaseLiftprovides a dense estimate. It is also verified that theestimate from CPRL, after a global phase shift, isapproximately equal the true x0.

4. EXPERIMENT

This section gives a number of examples. Code for thenumerical illustrations can be downloaded from http://

www.rt.isy.liu.se/

~

ohlsson/code.html.

4.1 Simulation

First, we repeat the simulation given in Example 3.1 fork = 1, . . . , 5. For each k, n = 64 is fixed, and we increasethe measurement dimension N until CPRL recovered thetrue sparse support in at least 95 out of 100 trials, i.e.,95% success rate. New data (x, b, and R) are generatedin each trial. The curve of 95% success rate is shown inFigure 3.

With the same simulation setup, we compare the accu-racy of CPRL with the PhaseLift approach and the CSapproach in Figure 3. First, note that CS is not applicableto phase retrieval problems in practice, since it assumesthe phase of the observation is also given. Nevertheless, thesimulation shows CPRL via the SDP solution only requiresa slightly higher sampling rate to achieve the same successrate as CS, even when the phase of the output is missing.Second, similar to the discussion in Example 3.1, withoutenforcing the sparsity constraint in (11), PhaseLift wouldfail to recover correct sparse signals in the low samplingrate regime.

It is also interesting to see the performance as n andN vary and k held fixed. We therefore use the samesetup as in Figure 3 but now fixed k = 2 and for n =10, . . . , 60, gradually increased N until CPRL recovered

1 1.5 2 2.5 3 3.5 4 4.5 50

50

100

150

200

250

k

N

CS

CPRL

PhaseLift

Fig. 3. The curves of 95% success rate for CPRL,PhaseLift, and CS. Note that the CS simulation isgiven the complete output y instead of its squaredmagnitudes.

the true sparsity pattern with 95% success rate. The sameprocedure is repeated to evaluate PhaseLift and CS. Theresults are shown in Figure 4.

10 20 30 40 50 600

20

40

60

80

100

120

140

160

180

200

n

N

CPRL

PhaseLift

CS

Fig. 4. The curves of 95% success rate for CPRL,PhaseLift, and CS. Note that the CS simulation isgiven the complete output y instead of its squaredmagnitudes.

Compared to Figure 3, we can see that the degradationfrom CS to CPRL when the phase information is omittedis largely a↵ected by the sparsity of the signal. Morespecifically, when the sparsity k is fixed, even when thedimension n of the signal increases dramatically, the num-ber of squared observations to achieve accurate recoverydoes not increase significantly for both CS and CPRL.

4.2 CPRL Applied to Audio Signals

In this section, we further demonstrate the performanceof CPRL using signals from a real-world audio recording.The timbre of a particular note on an instrument isdetermined by the fundamental frequency, and severalovertones. In a Fourier basis, such a signal is sparse, beingthe summation of a few sine waves. Using the recordingof a single note on an instrument will give us a naturallysparse signal, as opposed to synthesized sparse signals inthe previous sections. Also, this experiment will let usanalyze how robust our algorithm is in practical situations,

where e↵ects like room ambience might color our otherwiseexactly sparse signal with noise.

Our recording z 2 Rs is a real signal, which is assumedto be sparse in a Fourier basis. That is, for some sparsex 2 Cn, we have z = Finvx, where Finv 2 Cs⇥n is a matrixrepresenting a transform from Fourier coe�cients into thetime domain. Then, we have a randomly generated mixingmatrix with normalized rows, R 2 RN⇥s, with which ourmeasurements are sampled in the time domain:

y = Rz = RFinvx. (17)

Finally, we are only given the magnitudes of our measure-ments, such that b = |y|2 = |Rz|2.For our experiment, we choose a signal with s = 32samples, N = 30 measurements, and it is representedwith n = 2s (overcomplete) Fourier coe�cients. Also, togenerate Finv, the Cn⇥n matrix representing the Fouriertransform is generated, and s rows from this matrix arerandomly chosen.

The experiment uses part of an audio file recording thesound of a tenor saxophone. The signal is cropped so thatthe signal only consists of a single sustained note, withoutsilence. Using CPRL to recover the original audio signalgiven b, R, and Finv, the algorithm gives us a sparseestimate x, which allows us to calculate zest = Finvx.We observe that all the elements of zest have phases thatare ⇡ apart, allowing for one global rotation to make zest

purely real. This matches our previous statements thatCPRL will allow us to retrieve the signal up to a globalphase.

We also find that the algorithm is able to achieve resultsthat capture the trend of the signal using less than smeasurements. In order to fully exploit the benefits ofCPRL that allow us to achieve more precise estimates withsmaller errors using fewer measurements relative to s, theproblem should be formulated in a much higher ambientdimension. However, using the CVX Matlab toolbox byGrant and Boyd [2010], we already ran into computationaland memory limitations with the current implementationof the CPRL algorithm. These results highlight the needfor a more e�cient numerical implementation of CPRL.

5 10 15 20 25 30!0.6

!0.4

!0.2

0

0.2

0.4

0.6

0.8

i

zi,

ze

st,

i

zest

z

Fig. 5. The retrieved signal zest using CPRL versus theoriginal audio signal z.

10 20 30 40 50 600

1

2

3

4

5

6

i

xi

Fig. 6. The magnitude of x retrieved using CPRL. Theaudio signal zest is obtained by zest = Finvx.

5. CONCLUSION AND DISCUSSION

A novel method for the compressive phase retrieval prob-lem has been presented. The method takes the form ofan SDP problem and provides the means to use compres-sive sensing in applications where only squared magnitudemeasurements are available. The convex formulation givesit an edge over previous presented approaches and numer-ical illustrations show state of the art performance.

One of the future directions is improving the speed ofthe standard SDP solver, i.e., interior-point methods, cur-rently used for the CPRL algorithm. Some preliminary re-sults along with a more extensive study of the performancebounds of CPRL are available in Ohlsson et al. [2011].

REFERENCES

R. Balan, P. Casazza, and D. Edidin. On signal recon-struction without phase. Applied and ComputationalHarmonic Analysis, 20:345–356, 2006.


A. Bruckstein, D. Donoho, and M. Elad. From sparsesolutions of systems of equations to sparse modeling ofsignals and images. SIAM Review, 51(1):34–81, 2009.

E. J. Candes and M. Wakin. An introduction to compres-sive sampling. Signal Processing Magazine, IEEE, 25(2):21–30, March 2008.

E. J. Candes, J. Romberg, and T. Tao. Robust uncertaintyprinciples: Exact signal reconstruction from highly in-complete frequency information. IEEE Transactions onInformation Theory, 52:489–509, February 2006.

E. J. Candes, Y. Eldar, T. Strohmer, and V. Voroninski.Phase retrieval via matrix completion. Technical ReportarXiv:1109.0573, Stanford University, September 2011a.

E. J. Candes, T. Strohmer, and V. Voroninski. PhaseLift:Exact and stable signal recovery from magnitude mea-surements via convex programming. Technical ReportarXiv:1109.4499, Stanford University, September 2011b.

A. Chai, M. Moscoso, and G. Papanicolaou. Array imagingusing intensity-only measurements. Technical report,Stanford University, 2010.

S. Chen, D. Donoho, and M. Saunders. Atomic decom-position by basis pursuit. SIAM Journal on ScientificComputing, 20(1):33–61, 1998.

J. Dainty and J. Fienup. Phase retrieval and image re-construction for astronomy. In Image Recovery: Theoryand Applications. Academic Press, New York, 1987.

A. d’Aspremont, L. El Ghaoui, M. Jordan, and G. Lanck-riet. A direct formulation for Sparse PCA using semidef-inite programming. SIAM Review, 49(3):434–448, 2007.

D. Donoho. Compressed sensing. IEEE Transactions onInformation Theory, 52(4):1289–1306, April 2006.

J. Fienup. Phase retrieval algorithms: a comparison.Applied Optics, 21(15):2758–2769, 1982.

J. Fienup. Reconstruction of a complex-valued object fromthe modulus of its Fourier transform using a supportconstraint. Journal of Optical Society of America A, 4(1):118–123, 1987.

R. Gerchberg and W. Saxton. A practical algorithm forthe determination of phase from image and di↵ractionplane pictures. Optik, 35:237–246, 1972.

R. Gonsalves. Phase retrieval from modulus data. Journalof Optical Society of America, 66(9):961–964, 1976.

M. Grant and S. Boyd. CVX: Matlab software for dis-ciplined convex programming, version 1.21. http://

cvxr.com/cvx, August 2010.A. Hoerl and R. Kennard. Ridge regression: Biased

estimation for nonorthogonal problems. Technometrics,12(1):55–67, 1970.

D. Kohler and L. Mandel. Source reconstruction fromthe modulus of the correlation function: a practicalapproach to the phase problem of optical coherencetheory. Journal of the Optical Society of America, 63(2):126–134, 1973.

I. Loris. On the performance of algorithms for the mini-mization of `1-penalized functionals. Inverse Problems,25:1–16, 2009.

S. Marchesini. Phase retrieval and saddle-point optimiza-tion. Journal of the Optical Society of America A, 24(10):3289–3296, 2007.

R. Millane. Phase retrieval in crystallography and optics.Journal of the Optical Society of America A, 7:394–411,1990.

M. Moravec, J. Romberg, and R. Baraniuk. Compressivephase retrieval. In SPIE International Symposium onOptical Science and Technology, 2007.

H. Ohlsson, A. Y. Yang, R. Dong, and S. Sastry. Compres-sive Phase Retrieval From Squared Output Measure-ments Via Semidefinite Programming. Technical Re-port arXiv:1111.6323, University of California, Berkeley,November 2011.

R. Tibsharani. Regression shrinkage and selection via thelasso. Journal of Royal Statistical Society B (Method-ological), 58(1):267–288, 1996.

J. Tropp. Greed is good: Algorithmic results for sparseapproximation. IEEE Transactions on InformationTheory, 50(10):2231–2242, October 2004.

A. Walther. The question of phase retrieval in optics.Optica Acta, 10:41–49, 1963.

A. Yang, A. Ganesh, Y. Ma, and S. Sastry. Fast `1-minimization algorithms and an application in robustface recognition: A review. In ICIP, 2010.

Convex Estimation of Cointegrated VAR

Models by a Nuclear Norm Penalty

M. Signoretto

⇤and J. A. K. Suykens

⇤

⇤ Katholieke Universiteit Leuven, ESAT-SCD/SISTAKasteelpark Arenberg 10, B-3001 Leuven (BELGIUM)

Email: [email protected] [email protected]

Abstract: Cointegrated Vector AutoRegressive (VAR) processes arise in the study of longrun equilibrium relations of stochastic dynamical systems. In this paper we introduce a novelconvex approach for the analysis of these type of processes. The idea relies on an error correctionrepresentation and amounts at solving a penalized empirical risk minimization problem. Thelatter finds a model from data by minimizing a trade-o↵ between a quadratic error functionand a nuclear norm penalty used as a proxy for the cointegrating rank. We elaborate onproperties of the proposed convex program; we then propose an easily implementable andprovably convergent algorithm based on FISTA. This algorithm can be conveniently used forcomputing the regularization path, i.e., the entire set of solutions associated to the trade-o↵parameter. We show how such path can be used to build an estimator for the cointegrating rankand illustrate the proposed ideas with experiments.

1. INTRODUCTION

Unit root nonstationary multivariate processes play an im-portant role in the study of dynamical stochastic systemsBox and Tiao [1977], Engle and Granger [1987], Stock andWatson [1988], Johansen [1988]. Contrary to their station-ary counterpart these processes are allowed to have trendsor shifts in the mean or in the covariances. This featuremakes them suitable to describe many phenomena of inter-est such as economic cycles and population dynamics. Inthis paper we focus on VAR processes. It is well known thatthese processes can generate stochastic and deterministictrends if the associated polynomial matrix has zeros onthe unit circle. If some of the variables within the sameVAR process move together in the long-run — in a sensethat we clarify later — they are called cointegrated. Thissituation is of considerable practical interest. Equilibriumrelationships arise between economic variables such as, forinstance, household income and expenditures. Cointegra-tion has also been advocated to describe long-term parallelgrowth of mutually dependent indicators such as regionalpopulation and employment growth or city populationsand total urban populations Payne and Ewing [1997],Sharma [2003], Møller and Sharp [2008].

? The authors are grateful to the anonymous reviewers for the help-ful comments. Research supported by the Research Council KUL:GOA /11/05 Ambiorics, GOA/10/09 MaNet, CoE EF/05/006 Opti-mization in Engineering (OPTEC) en PFV/10/002 (OPTEC). Flem-ish Government: FWO: PhD/postdoc grants, projects: G0226.06 (co-operative systems and optimization), G0321.06 (Tensors), G.0302.07(SVM/Kernel). Research communities (WOG: ICCoS, ANMMM,MLDM). Belgian Federal Science Policy O�ce: IUAP P6/04(DYSCO, Dynamical systems, control and optimization, 2007-2011);IBBT EU: ERNSI; FP7-HD-MPC (INFSO-ICT-223854), COST in-telliCIS, FP7-EMBOCON (ICT-248940), FP7-SADCO ( MC ITN-264735), ERC HIGHWIND (259 166). The scientific responsibility isassumed by its authors.

The analysis of cointegrated VAR models present chal-lenges that are not present in the stationary case. In par-ticular one of the main goal in the analysis of cointegratedsystem is the estimation of the cointegrating rank. In thiswork we propose a novel approach that relies on an errorcorrection representation of cointegrated VAR processes.The approach consists of solving a convex program anduses a nuclear norm as a proxy for the cointegrating rank.We show how the regularization path arising from di↵erentvalues of a trade-o↵ parameter can be used to estimatethe cointegrating rank. In order to compute solutions wepropose to use a simple yet e�cient algorithm based on anexisting procedure called FISTA (fast iterative shrinkage-thresholding algorithm).

In Section 2 we recall the concept of cointegrated VARmodels, error correction representations and cointegratingrank. In Section 3 we present our main problem formu-lation and discuss its properties. Section 4 deals with analgorithm to compute solutions. In Section 5 we introducean estimator for the cointegrating rank based on the reg-ularization path. In Section 6 we report experiments. Weconclude with final remarks in Section 7.

2. COINTEGRATED VAR(P ) MODELS

In the following we denote vectors as lower case letters(a, b, c, . . .) and matrices as capital letters (A,B,C, . . .).In particular, we use I to indicate the identity matrix,the dimension of which will be clear from the context.For a positive integer P we write NP to denote the set{1, . . . , P}. Finally we use hA,Bi to denote the innerproduct between A,B 2 RD

1

⇥D2 :

hA,Bi = trace(A>B) =X

d1

2ND1

X

d2

2ND2

ad1

d2

bd1

d2

(1)

where ·> indicates matrix transposition 1 . The corre-sponding Hilbert-Frobenius norm is kAk =

phA,Ai.

In this work we are concerned with a multivariate timeseries xt 2 RD following a Vector Autoregressive (VAR)model of order P :

xt = 1

xt�1

+ 2

xt�2

+ · · ·+ Pxt�P + et, (2)

where for any p 2 NP , p 2 RD⇥D and the innovationet 2 RD is a zero-mean serially uncorrelated stationaryprocess.

Remark 1. We do not consider here the case where (2)includes a deterministic trend; note, however, that theconvex approach that we consider in the following can beeasily adapted to deal also with this case.

2.1 Nonstationarity and Cointegration

Recall that xt is called unit-root stationary if all the zerosof the univariate polynomial matrix of degree P

P(z) = I � 1

z � 2

z2 � · · ·� P zP

are outside the unit circle, see e.g. Tsay [2005]. Ifdet(P(1)) = 0, then xt is unit-root nonstationary. Fol-lowing Johansen [1992] we call xt integrated of order R(or simply I(R)) if the process �Rxt = xt � xt�R isstationary whereas for any r 2 NR�1

, the process �rxt

is not. A stationary process is referred to as I(0). Finallywe say that a I(R) process xt is cointegrated if thereexists at least a cointegrating vector � 2 RD such that thescalar process �>xt is I(R⇤) with R⇤ < R. Cointegratedprocesses were originally introduced in Granger [1981] andEngle and Granger [1987]. Since then they have becomepopular mostly in theoretical and applied econometrics.

Cointegration has been advocated to explain long-run orequilibrium relationships; for more discussions on cointe-gration and cointegration tests, see Box and Tiao [1977],Engle and Granger [1987], Stock and Watson [1988], Jo-hansen [1988]. In the following we focus on the situationwhere xt is I(1). This case is the most commonly studiedin the literature.

2.2 Error Correction Models

An error correction model (ECM) for the VAR(P ) process(2) is Tsay [2005], Lutkepohl [2005]:

�xt = ⇧xt�1

+ �1

�xt�1

+�

2

�xt�2

+ · · ·+ �P�1

�xt�P+1

+ et, (3)

where �p = �PP

j=p+1

j and ⇧ = �P(1). The VAR(P )model can be recovered from the ECM via:

8<

:

1

= I +⇧+ �1

, (4a) p = �p � �p�1

, p = 2, . . . , P � 1, (4b) P = ��P�1

. (4c)

Model-based forecasting is done as in the stationary case.Assume et is independent white noise with covariancematrix ⌃e. It can be shown that the optimal H�stepforecast at the origin is

yt(H) = 1

yt(H � 1) + · · ·+ P yt(H � P ) (5)

1 In this paper all vectors and matrices are real.

where yt(j) = yt+j for j 0. The forecast mean squareerror (MSE) matrix is ⌃y(H) = ⌃e +

Ph2NH�1

h⌃e >h ;

notably for unit-root nonstationary processes some entrieswill approach infinity as h ! 1 Lutkepohl [2005].

2.3 Cointegrating Rank

Note that, under the assumption that xt is at most I(1),�xt is a I(0) process. The cointegrating rank Engle andGranger [1987] is defined as:

D? = rank(⇧) . (6)

One can distinguish the following situations depending onits value:

i) D? = 0. In this case ⇧ = 0 and (3) reduces to aVAR(P � 1) model.

ii) 0 < D? < D. The matrix ⇧ can be written as

⇧ = AB> (7)

where A,B 2 RD⇥D?

are full (column) rank. The D?

linearly independent columns of B are cointegratingvectors; they form a basis for the cointegrating sub-space. The vector process

ft = B>xt (8)

is stationary and represent deviation from equilib-rium. These equations clarify that (3) expresses thechange in xt in terms of the deviations from theequilibrium at time t � 1 (the term ⇧xt�1

= Aft�1

,called error correction term) and previous changes(the terms �p�xt�p, 1 p P � 1).

iii) D? = D. In this case det(P(1)) 6= 0 (⇧ is full rank);xt is I(0) and one can study (2) directly.

An example of cointegrated process is given in Figure 1.

0 50 100 150 200 250 300

0.1

0.2

0.3

0.4

0.5

0.6

(a)

0 50 100 150 200 250 300!0.02

!0.015

!0.01

!0.005

0

0.005

0.01

0.015

(b)

Fig. 1. (a) A realization of the scalar components of a4-dimensional I(1) cointegrated process with cointe-grating rank 1. (b) The stationary process correspond-ing to the cointegrating vector [0.4 � 0.2 0.1 � 0.3].

2.4 Existing Estimators

Di↵erent approaches have been proposed to estimateECMs based on training data {xt}1tT , see [Lutkepohl,2005, Chapter 7.2]. Define the matrices

X�p = [ xP�p+1

, xP�p+2

, · · · , xT�p ] , (9a)�X�p = X�p �X�p�1

, (9b)

�X =⇥�X>

�1

,�X>�2

, · · · ,�X>�P+1

⇤>, (9c)

� = [�1

,�2

, · · · ,�P�1

] , (9d)

C =⇥X>

�1

,�X>⇤> , (9e)

where X�p 2 RD⇥(T�P ), �X�p 2 RD⇥(T�P ), �X 2R(P�1)D⇥(T�P ), � 2 RD⇥(P�1)D and C 2 RPD⇥(T�P ).The Least Squares (LS) estimator is simply Lutkepohl[2005]:

h⇧LS , �LS

i= �X

0

C> �CC>��1

. (10)

This approach does not keep the decomposition (7) of ⇧into account 2 ; in principle, one could perform truncatedsingular value decomposition (SVD) of ⇧LS a posteriorito find B. However, this practice requires the knowledgeof D?, which is normally not available 3 . When this in-formation is available, an alternative approach consistsof Maximum Likelihood Estimation (MLE), which worksunder the assumption that the innovation is independentwhite Gaussian noise. Contrary to LS, this approach di-rectly estimates the factors in the representation (7). Thisleads to a nonconvex multistage algorithm. We refer thereader to [Tsay, 2005, Section 8.6.2] for details.

3. ESTIMATION BASED ON CONVEXPROGRAMMING

Recall that for an arbitrary matrix A 2 RD1

⇥D2 with rank

R, the SVD is

A = U⌃V >, ⌃ = diag({�r}1rR) (11)

where U 2 RD1

⇥R and V 2 RD2

⇥R are matrices withorthonormal columns, and the singular values �r satisfy�1

� �2

� · · · � �R > 0. The nuclear norm (a.k.a.trace norm or Schatten�1 norm) of A is defined Horn andJohnson [1994] as

kAk⇤ =X

r2NR

�r . (12)

The nuclear norm has been used to devise convex relax-ations to rank constrained matrix problems Recht et al.[2007], Candes and Recht [2009], Candes et al. [2011].This parallels the approach followed by estimators like theLASSO (Least Absolute Shrinkage and Selection Operator,Tibshirani [1996]) that estimate an high dimensional load-ing vector x 2 RD based on the l

1

-norm

kxk1

=X

d2ND

|xd| . (13)

2 In the literature (10) is sometimes called unrestricted LS toemphasize that the model’s parameters are not constrained to satisfyany specific structural form.3 As we later illustrate in experiments, the spectrum of ⇧LS

normally does not give a good indication of the actual value of thecointegrating rank.

3.1 Main Problem Formulation

Note that, based upon (9), the ECM (3) can be restatedin matrix notation as �X

0

= ⇧X�1

+ ��X + E. Ourapproach now consists of finding estimates (⇧(�), �(�))by solving:

min⇧,�

1

2ck�X

0

�⇧X�1

� ��Xk2 + �k⇧k⇤(14)

where c is a fixed normalization constant, such as c =D(T�P ), and � is a trade-o↵ parameter. When � = 0 (14)reduces to the LS estimates (10). For a strictly positive�, (14) fits a model to the data by minimizing a trade-o↵between an error function and a proxy for the cointegratingrank; a positive � reflects the prior knowledge that theprocess should be cointegrated with cointegrating rankD? < D.

It can be shown that (12) is the convex envelope on theunit ball of the (non-convex) rank function Fazel [2002].In the present context, this makes (14) the tightest convexrelaxation to the problem:

min⇧,�

1

2ck�X

0

�⇧X�1

� ��Xk2 + � rank(⇧) .(15)

The nonconvexity of the latter implies that practical al-gorithms can only be guaranteed to deliver local solutionsof (15) that are rank deficient for su�ciently large �. Incontrast, the problem (14) is convex since the objective isa sum of convex functions Boyd and Vandenberghe [2004].This implies that any local solution found by a provablyconvergent algorithm is globally optimal.

Once a solution of (14) is available, one can recover theparameters of the VAR(P ) model (2) based upon (4). Notethat here we focus on the case where the error functionis expressed in terms of the Hilbert-Frobenius norm; how-ever, alternative norms might also be meaningful. For laterreference, note that, when P = 1 and c = 1, (14) boilsdown to:

min⇧

1

2ck�X

0

�⇧X�1

k2 + �k⇧k⇤ .(16)

3.2 l2

Smoothing

The nature of the problem makes it di�cult to find asolution of (14) for � ! 0. Indeed, in practice X�1

and�X are often close to singular so that the problem isill-conditioned. In order to improve numerical stability apossible approach is to add a ridge penalty to the objectiveof (14). That is, for a small user-defined parameter µ > 0,to add:

µ

2

0

@k⇧k2 +X

p2NP�1

k�pk21

A ; (17)

we call the resulting optimization problem the l2

-smoothedformulation. The idea has a long tradition both in opti-mization and statistics. Recently it found application inthe Elastic Net Zou and Hastie [2005]. The Elastic Net

finds an high dimensional loading vector x based on empir-ical data. The approach consists of replacing the LASSOpenalty based on the l

1

�norm (13) with the compositepenalty �

1

kxk2+�2

kxk1

. This strategy aims at improvingthe LASSO in the presence of dependent variables (which,in fact, leads to ill-conditioning).

In the present context it is easy to see that the solutionof the l

2

smoothed formulation can be found solving(14) where the data matrices (�X

0

, X�1

,�X) have beenreplaced by lifted matrices (�Xµ

0

, Xµ�1

,�Xµ). Considerfor simplicity problem (16). The lifted matrices of interestbecome:

�Xµ0

= [�X0

, O] and Xµ�1

= [X�1

,pµI] (18)

where O, I 2 RD⇥D and O is a matrix of zeros. Forthis reason in the following we will uniquely discuss analgorithm for the non-smoothed case; it is understoodthat a solution for the smoothed formulation can be foundsimply replacing the data matrices.

3.3 Dual Representation and Duality Gap

In this section, for simplicity of exposition, we restrictourselves to the primal problem (16) and derive its dualrepresentation. The Fenchel conjugate Rockafellar [1974]of kXk⇤ is:

f(A) = maxX

hA,Xi � kXk⇤ . (19)

Recall that the dual norm of the nuclear norm is thespectral norm, denoted as k · k

2

; for a generic matrix Awith SVD (11) such norm is defined by

kAk2

= �1

. (20)

It is a known fact that the conjugate of a norm is theindicator function of the dual norm unit ball Boyd andVandenberghe [2004]. In the present context this fact readsas follows.

Lemma 2.

f(A) =

⇢0, if kAk

2

11, otherwise . (21)

With reference to (16), it can be shown that strong dualityholds; the dual problem can be obtained as follows:

min⇧

{ 1

2

k�X0

�⇧X�1

k2 + �k⇧k⇤} =

min⇧

max⇤

{h⇤,�X0

�⇧X�1

i � 1

2

h⇤,⇤i+ �k⇧k⇤} =

max⇤

min⇧

{h⇤,�X0

�⇧X�1

i � 1

2

h⇤,⇤i+ �k⇧k⇤} =

max⇤

min⇧

{h⇤,�X0

i � 1

2

h⇤,⇤i � h⇤X>�1

,⇧i+ �k⇧k⇤} =

by Lemma 2

= max⇤

{h⇤,�X0

i � 1

2

h⇤,⇤i : k⇤X>�1

k2

�} .

(22)

Additionally, as it results clear from the second line of(22), the primal and dual solution are related as follows:

⇤ = �X0

� ⇧X�1

. (23)

This fact can be readily used to derive an optimalitycertificate based on the duality gap, i.e. the di↵erence ofvalues between the objective functions of the dual andprimal problems.

Remark 3. Note that the solution ⇤ of the dual problemin the last line of (22) corresponds to the projection of�X

0

onto the convex set S =�⇤ : k⇤X>

�1

k2

� .

4. ALGORITHM

In order to find a solution (⇧(�), �(�)) corresponding toa fixed value of � one could restate (14) as a semidefiniteprogramming problem (SDP) and rely on general purposesolvers such as SeDuMi Sturm [1999]. Alternatively, it ispossible to use a modelling language like CVX Grant andBoyd [2010] or YALMIP Lofberg [2004]. However theseapproaches are practically feasible only for relatively smallproblems. In contrast, the iterative scheme that we presentnext can be easily implemented and scales well with theproblem size. Additionally, the approach can be conve-niently used through warm-starting to compute solutionscorresponding to nearby values of �. The procedure, de-tailed in the algorithmic environment below, can be shownto be a special instance of the fast iterative shrinkage-thresholding algorithm (FISTA) proposed in Beck andTeboulle [2009]; therefore it inherits its favorable conver-gence properties, see Beck and Teboulle [2009]. We call itCointegrated VAR(P ) via FISTA (Co-VAR(P )-FISTA).

Algorithm: CoVAR(P )-FISTA

Input: X�p; �X�p, p = 0, 1, . . . , P � 1;

Initialize:

⇧0 = ⇧⇤1; �0 p = �⇤

1 p, p 2 NP�1;

t1 = 1; L = 1/c��CC>

��2

(see (9e))

Iteration k � 1:

Ak = ⇧⇤kX�1 +

X

p2NP�1

�⇤kp�X�p ��X0 (24a)

�k p = �⇤k p �

1

LcAk�X>

�p, p 2 NP�1 (24b)

Ck = ⇧⇤k �

1

LcAkX

>�1 (24c)

⇧k = D �L(Ck) (see (25)) (24d)

tk+1 =1 +p

1 + 4t2k2

, rk+1 =

✓tk � 1

tk+1

◆(24e)

⇧⇤k+1 = ⇧k + rk+1(⇧k �⇧k�1) (24f)

�⇤k+1 p = �k p + rk+1(�k p � �⇤

k�1 p), p 2 NP�1 (24g)

Return: ⇧k, �k 1, �k 2 , · · · ,�k P�1

The approach is essentially a forward-backward splittingtechnique (see Bauschke and Combettes [2011] and refer-ence therein) which is accelerated to reach the optimal rateof convergence in the sense of Nesterov [1983, 2003].

The procedure is based on two sets of working variables:

(⇧k, �k 1

, �k 2

, · · · ,�k P�1

)

and(⇧⇤

k, �⇤k 1

, �⇤k 2

, · · · ,�⇤k P�1

) .

Equations (24) from a to d correspond to a forward stepin the first set of variables conditioned on the variables inthe second set; the step size is determined by the Lipschitzconstant L; equation (24d) represent the backward step. Itamounts at evaluating at the current estimate the singularvalue shrinkage operator Cai et al. [2010] defined, for of a

matrix A with SVD (11), as:

D⌧ (A) = U⌃+

V >, ⌃+

= diag({max(�r, ⌧)}1rR) .(25)

D⌧ (·) is the proximity operator Bauschke and Combettes[2011] of the nuclear norm function. Equation (24e) definesthe updating constant rk based upon the estimate sequencetk Nesterov [1983, 2003]. Finally, equations (24) from g to iupdate the second set of variables based upon the variablesin the fist set.

The approach requires to set an appropriate terminationcriterion. A sensible idea, which we follow in experiments,is to stop when the duality gap corresponding to thecurrent estimate is smaller that a predefined threshold.

5. CONTINUOUS SHRINKAGE ANDCOINTEGRATING RANK

5.1 Regularization Path

By continuously varying the regularization parameter �in (14) one obtains an entire set of solutions, denoted as{(⇧(�), �(�))}�, and called regularization path. In general,continuous shrinkage is known to feature certain favorableproperties. In particular, the paths of estimators likethe LASSO are known to be more stable than those ofinherently discrete procedures such as Subset SelectionBreiman [1996].

In the present context the path begins at � = 0 andterminates at the least value �

max

leading to ⇧(�max

) = 0.For problem (16), in particular, such value is

�max

= k�X0

X>�1

k2

(26)

as one can see in light of (23) and Remark 3.

5.2 Estimation of the Cointegrating Rank

Denote by {�r(�)}1rR the spectrum of ⇧(�). For a > 1consider the vector m 2 RD defined entry-wise by 4 :

md =

Zloga(�max

)

�1�2

d(at)atdt

!1/2

. (27)

We call m, the vector obtained from the inverse orderstatistics 5 of m, the path spectrum. Note that the in-tegral in (27) weights more the area of the path whichcorresponds to shrunk singular values. The rationale be-hind this is simple: those singular values that survive theshrinking are likely to be the most important ones. Thepath spectrum can be used to define an estimator for thecointegrating rank. In particular, one can take

D?⌧ (m) = arg min

d2ND

⇢f(d) =

✓Pi2Nd

m2

iPi2ND

m2

i

� ⌧

◆: f(d) > 0

�

(28)where 0 < ⌧ < 1 is a predefined threshold. Note thatthis estimator is independent on �. This is a desirablefeature: setting an appropriate value for the regularizationparameter is known to be a di�cult task.4 In experiments we always consider a = 10.5 I.e., m = [m(1), m(2), · · · , m(d)] is obtained from m sorting itsentries in decreasing order. Note that, normally, m already coincidewith its inverse order statistics m.

In practice the regularization path is evaluated at discretevalues of �. Therefore the integrals are replaced by theirMonte Carlo estimates. In computing the path we actuallybegin with �

max

(recall that ⇧(�max

) = 0) and proceedbackwards computing the solutions corresponding to log-arithmically spaced values of the parameter. At each stepwe initialize the algorithm of Section 4 with the previoussolution.

6. EXPERIMENTS

To test the performance of the proposed estimator weconsidered cointegrated systems with a triangular repre-sentation Phillips [1991]. More specifically we generatedrealizations of a D�dimensional cointegrated process xt

with M cointegrating vectors by the following model 6 :

xit =

8><

>:

(i+1)MX

j=iM+1

xjt + eit, if i = 1, 2, . . . ,M

xit�1

+ eit, if i = M + 1,M + 2, . . . , D .(29)

In all the cases et 2 RD is a Gaussian white noise withmean zero and covariance matrix (5e� 3)ID. The processis observed in noise: we actually used as training dataN successive time steps from a realization of xt = xt +wt where wt is zero-mean Gaussian white noise withcovariance matrix ⌘2ID. For all the cases we took P = 1and computed the path corresponding to an l

2

-smoothedformulation with smoothing parameter µ. We comparedD?

0.8(m) (see (28)) against the naive estimator D?0.8(�LS)

where �LS are the singular values of ⇧LS . Figure 2 refersto an experiment with D = 72, M = 8, N = 400, ⌘ =0.002 and µ = 0.01. In Table 1 we reported the averageestimated cointegrating rank (with standard deviation inparenthesis) over 20 random experiments performed fortwo di↵erent set of values of D, M, N, ⌘ and µ.

Table 1. Estimated cointegrating ranks in ran-domized experiments

D = 16, M = 2, N = 60, ⌘ = 0.001, µ = 0.1

D?0.8(m) D?

0.8(�LS)2.3(0.5) 4.35(0.5)

D = 72, M = 8, N = 400, ⌘ = 0.002, µ = 0.01

D?0.8(m) D?

0.8(�LS)7.4(1.9) 14.3(0.8)

7. CONCLUSIONS

We presented a novel approach, based on a convex pro-gram, for the analysis of cointegrated VAR processes fromobservational data. We proposed to compute solutions viaa scalable iterative scheme; used in combination with warmstarting this algorithm can be conveniently employed tocompute the entire regularization path. At each step onecan rely on the duality gap as an optimality certificate.The regularization path o↵ers indication for the actualvalue of the cointegrating rank and can be used to de-fine estimators for the latter. An important advantage is6 From (29) it is easy to recover the error correction model repre-sentation.

0 10 20 30 40 50 60 700

0.5

1

1.5

2

2.5

3

i

sigm

a(i)

(a)

0 10 20 30 40 50 60 700

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

x 10!3

i

M(i)

(b)

10!6

10!5

10!4

0

0.2

0.4

0.6

0.8

lambda

sig

ma

(la

mb

da

)

(c)

Fig. 2. (a) The singular values of the true ⇧ and ⇧LS ; notethat the LS solution does not give an indication of thecointegrating rank. (b) The path spectrum; note thegap after d = 8. (c) the regularization path.

that the approach does not require to fix a value for theregularization parameter � in (14). This is known to be adi�cult task, especially when the goal of the analysis ismodel selection rather than low prediction errors.

REFERENCES

H.H. Bauschke and P.L. Combettes. Convex Analysis andMonotone Operator Theory in Hilbert Spaces. SpringerVerlag, 2011.


G.E.P. Box and G.C. Tiao. A canonical analysis ofmultiple time series. Biometrika, 64(2):355, 1977.

S.P. Boyd and L. Vandenberghe. Convex Optimization.Cambridge University Press, 2004.

L. Breiman. Heuristics of instability and stabilization inmodel selection. Annals of Statistics, 24(6):2350–2383,1996.

J.F. Cai, E.J. Candes, and Z. Shen. A singular valuethresholding algorithm for matrix completion. SIAMJournal on Optimization, 20(4):1956–1982, 2010.

E.J. Candes and B. Recht. Exact matrix completion viaconvex optimization. Foundations of ComputationalMathematics, 9(6):717–772, 2009.

E.J. Candes, X. Li, Y. Ma, and J. Wright. Robust principalcomponent analysis? Journal of ACM, 58(3):Article 11,37p., 2011.

R.F. Engle and C.W.J. Granger. Co-integration anderror correction: representation, estimation, and test-ing. Econometrica: Journal of the Econometric Society,pages 251–276, 1987.

M. Fazel. Matrix rank minimization with applications.PhD thesis, Elec. Eng. Dept., Stanford University, 2002.

C.W.J. Granger. Some properties of time series data andtheir use in econometric model specification. Journal ofeconometrics, 16(1):121–130, 1981.

M. Grant and S. Boyd. CVX: Matlab softwarefor disciplined convex programming, version 1.21.http://cvxr.com/cvx, May 2010.

R.A. Horn and C.R. Johnson. Topics in Matrix Analysis.Cambridge University Press, 1994.

S. Johansen. Statistical analysis of cointegration vectors.Journal of economic dynamics and control, 12(2-3):231–254, 1988.

S. Johansen. A representation of vector autoregressiveprocesses integrated of order 2. Econometric theory, 8(02):188–202, 1992.

J. Lofberg. Yalmip : A toolbox for modeling andoptimization in MATLAB. In Proceedings of theCACSD Conference, Taipei, Taiwan, 2004. URLhttp://users.isy.liu.se/johanl/yalmip.

H. Lutkepohl. New introduction to multiple time seriesanalysis. Springer, 2005.

N. F. Møller and P. Sharp. Malthus in cointegrationspace: A new look at living standards andpopulation in pre-industrial england. Discus-sion Papers 08-16, University of Copenhagen.Department of Economics, July 2008. URLhttp://ideas.repec.org/p/kud/kuiedp/0816.html.

Y. Nesterov. A method of solving a convex programmingproblem with convergence rate O( 1

k2

). In Soviet Mathe-matics Doklady, volume 27, pages 372–376, 1983, 1983.

Y. Nesterov. Introductory lectures on convex optimization:A basic course. Kluwer Academic Pub, 2003.

J.E. Payne and B.T. Ewing. Population and economicgrowth: a cointegration analysis of lesser developedcountries. Applied economics letters, 4(11):665, 1997.

P.C.B. Phillips. Optimal inference in cointegrated sys-tems. Econometrica: Journal of the Econometric Soci-ety, pages 283–306, 1991.

B. Recht, M. Fazel, and P.A. Parrilo. Guaranteedminimum-rank solutions of linear matrix equations vianuclear norm minimization. SIAM Rev., 52:471–501,2007.

R.T. Rockafellar. Conjugate duality and optimization,volume 16. Society for Industrial Mathematics, 1974.

S. Sharma. Persistence and stability in city growth.Journal of Urban Economics, 53(2):300–320, 2003.

J.H. Stock and M.W. Watson. Testing for common trends.Journal of the American statistical Association, pages1097–1107, 1988.

J.F. Sturm. Using SeDuMi 1.02, a MATLAB toolboxfor optimization over symmetric cones. OptimizationMethods and Software, 11(1):625–653, 1999.

R. Tibshirani. Regression shrinkage and selection via theLASSO. Journal of the Royal Statistical Society. SeriesB (Methodological), 58(1):267–288, 1996.

R.S. Tsay. Analysis of financial time series, volume 543.Wiley-Interscience, 2005.

H. Zou and T. Hastie. Regularization and variable selec-tion via the elastic net. J. Roy. Stat. Soc. Ser. B, 67(2):301–320, 2005.

How effective is the nuclear norm heuristic

in solving data approximation problems?

Ivan Markovsky !

! School of Electronics and Computer Science, University of SouthamptonSO17 1BJ, United Kingdom, Email: [email protected]

Abstract: The question in the title is answered empirically by solving instances of three classicalproblems: fitting a straight line to data, fitting a real exponent to data, and system identification inthe errors-in-variables setting. The results show that the nuclear norm heuristic performs worse thanalternative problem dependant methods—ordinary and total least squares, Kung’s method, and subspaceidentification. In the line fitting and exponential fitting problems, the globally optimal solution is knownanalytically, so that the suboptimality of the heuristic methods is quantified.

Keywords: low-rank approximation, nuclear norm, subspace methods, system identification.

1. INTRODUCTION

With a few exceptions model reduction and system identifica-tion lead to non-convex optimization problems, for which thereare no efficient global solution methods. The methods for H2

model reduction and maximum likelihood system identificationcan be classified as local optimization methods and convexrelaxations. Local optimization methods require an initial ap-proximation and are in general computationally more expensivethan the relaxation methods, however, the local optimizationmethods explicitly optimize the desired criterion, which ensuresthat they produce at least as good result as a relaxation method,provided the solution of the relaxation method is used as aninitial approximation for the local optimization method.

A subclass of convex relaxation methods for system identifica-tion are the subspace methods, see Van Overschee and De Moor[1996]. Subspace identification emerged as a generalization ofrealization theory and proved to be a very effective approach. Italso leads to computationally robust and efficient algorithms.Currently there are many variations of the original subspacemethods (N4SID, MOESP, and CVA). Although the details ofthe subspace methods may differ, their common feature is thatthe approximation is done in two stages, the first of which isunstructured low-rank approximation of a matrix that is con-structed from the given input/output trajectory.

Related to the subspace methods are Kung’s method and thebalanced model reduction method, which are the most effectiveheuristics for model reduction of linear time-invariant systems.

A recently proposed convex relaxation method is the one usingthe nuclear norm as a surrogate for the rank. The nuclearnorm relaxation for solving rank minimization problems wasproposed in Fazel et al. [2001] and was shown to be the tightestrelaxation of the rank. It is a generalization of the !1-normheuristic from sparse vector approximation problems to rankminimization problems.

The nuclear norm heuristic leads to a semidefinite optimiza-tion problem, which can be solved by existing algorithms withprovable convergence properties and readily available softwarepackages. (We use CVX, see Grant and Boyd.) Apart from theo-

retical justification and easy implementation in practice, formu-lating the problem as a semidefinite program has the additionaladvantage of flexibility. For example, adding regularization andaffine inequality constraints in the data modeling problem stillleads to semidefinite optimization problems that can be solvedby the same algorithms and software as the original problem.

A disadvantage of using the nuclear norm heuristic is the factthat the number of optimization variables in the semidefiniteoptimization problem depends quadratically on the number ofdata points in the data modeling problem. This makes methodsbased on the nuclear norm heuristic impractical for problemswith more than a few hundreds of data points. Such problemsare considered “small size” data modeling problem.

Outline of the paper

The objective of this paper is to test the effectiveness of thenuclear norm heuristic as a tool for system identification andmodel reduction. Although, there are recent theoretical results,see, e.g., Candés and Recht [2009], on exact solution of matrixcompletion problems by the nuclear norm heuristic, to the bestof the author’s knowledge there are no similar results about theeffectiveness of the heuristic in system identification problems.

The nuclear norm heuristic is compared empirically with otherheuristic methods on benchmark problems. The selected prob-lems are simple: small complexity model and small num-ber of data points. The experiments in the paper are repro-ducible Buckheit and Donoho [1995]. Moreover the MATLAB

code that generates the results is included in the paper, so thatthe reader can repeat the examples by copying the code chunksfrom the paper and pasting them in the MATLAB commandprompt, or by downloading the code from

http://eprints.soton.ac.uk/336088/

The selected benchmark problems are:

(1) line fitting by geometric distance minimization (orthogo-nal regression),

(2) fitting a real exponential function to data, and(3) system identification in the errors-in-variables setting.

Problem 1 is the static equivalent of problem 3 and can besolved exactly by unstructured rank-1 approximation of thematrix of the point coordinates. Problem 2 can be viewed asa first order autonomous system identification problem. Thisproblem also admits an exact analytic solution. Therefore inthe first two cases, we are able to quantify the sub-optimality ofthe nuclear norm heuristic (as well as any other method). This isnot possible in the third benchmark problem, where there are nomethods that can efficiently compute a globally optimal point.

2. TEST EXAMPLES

2.1 Line fitting

In this section, we consider the problem of fitting a line B,passing through the origin, to a set of points in the plane

D = {d1, . . . ,dN }.

The fitting criterion is the geometric distance from D to B

dist(D ,B) =

!N

!i=1

dist2(di,B), (dist)

wheredist(di,B) := min

"di"B

#di $ "di#2.

The line fitting problem in the geometric distance sense

minimize dist(D ,B)over all lines B passing through 0

(LF)

is equivalent to the problem of finding the nearest in the

Frobenius norm # · #F sense rank-1 matrix "D to the matrix ofthe point coordinates

D = [d1 · · · dN ] ,

i.e.,

minimize over "D "Rq%N #D$ "D#F

subject to rank("D)& r,(LRA)

where q = 2 and r = 1.

Note 1. (Generalization and links to other methods). For generalr < q < N, (LRA) corresponds to fitting an r-dimensional sub-space to N points in a q-dimensional space. This problem isclosely related to the principal component analysis and totalleast squares problem Markovsky and Van Huffel [2007].

The following theorem shows that all optimal solutions of (LRA)are available analytically in terms of the singular value decom-position of D.

Theorem 1. (Eckart–Young–Mirsky). Let

D =U"V'

be a singular value decomposition of D and partition U , " =:diag(#1, . . . ,#q), and V as follows:

U =:

r q$ r

[U1 U2] q , " =:

r q$ r#"1 00 "2

$r

q$ rand V =:

r q$ r

[V1 V2] N ,

Then the rank-r matrix, obtained from the truncated singularvalue decomposition

"D! =U1"1V'1 ,

is such that

#D$ "D!#F = minrank("D)&r

#D$ "D#F =%

#2r+1 + · · ·+#2

q .

The minimizer "D! is unique if and only if #r+1 (= #r.

)define lra 2a*+function dh = lra(d, r)

[u, s, v] = svd(d);

dh = u(:, 1:r) * s(1:r, 1:r) * v(:, 1:r)’;

Let "D! be an optimal solution of (LRA) and let "B! be theoptimal fitting model

"B! = image("D!).The rank constraint in the matrix approximation problem (LRA)corresponds to the constraint in the line fitting problem (LF)

that the model "B is a line passing through the origin (subspaceof dimension one)

dim( "B!) = rank("D!).We use the dimension of the model is a measure for its com-plexity and define the map

Dlrar,$$$- "D!,

implemented by the function lra.

Let #D#! denotes the nuclear norm of D, i.e., the sum of thesingular values of D. Applying the nuclear norm heuristic to(LRA), we obtain the following convex relaxation

minimize over "D " Rq%N #"D#!

subject to #D$ "D#F & e.(NNA)

)define nna 2b*+function dh = nna(d, e)

cvx_begin, cvx_quiet(true);

variables dh(size(d))

minimize norm_nuc(dh)

subject to

norm(d - dh, ’fro’) <= e

cvx_end

The parameter e in (NNA) is a user supplied upper bounds on

the approximation error #D$ "D#F.

Let "D be the solution of (NNA). Problem (NNA) defines themap

Dnnae,$$$- "D,

implemented by the function nnae.

The approximation nnae(D) may have rank more than r,

in which case (NNA) fails to identify a valid model "B.

However, note that nnae(D) = 0 for e . #D#F, so that for suf-ficiently large values of e, nnae(D) is rank deficient. Moreover,the numerical rank numrank

&nnae(D)

'of the nuclear norm

approximation reaches r for e / #D#F. We are interested tocharacterize the set:

e := {e | numrank&nnae(D)

'& r}. (e)

We hypothesise that e is an interval

e = [e!nna

,$). (H)

The smallest value of the approximation error #D$nnae(D)#F,for which rank(nnae(D))& r (i.e., for which a valid model ex-ists) characterizes the effectiveness of the nuclear norm heuris-tic. We define

nnar := nnae!nna , where e!nna

:= mine

{e | e " e}.

A bisection algorithm for computing the limit of perfor-mance e!

nnaof the nuclear norm heuristic is given in Ap-

pendix A.

Another way to quantify the effectiveness of the nuclear normheuristic is to compute the distance of the approximationnnae(D) to the manifold of rank-r matrices

%(e) = distr&nnae(D)

'

:= min""D

#nnae(D)$ ""D#F subject to rank(""D)& r.

)define dist 3a*+dist = @(d, r) norm(d - lra(d, r), ’fro’);

The function e ,- % presents a complexity vs accuracy trade-off in using the nuclear norm heuristic. The optimal rank-r approximation corresponds in the (%,e) space to the point(0,e!

lra), where

e!lra

:= distr(D) = #D$lrar(D)#F.

The best model nnar(D) identifiable by the nuclear normheuristic corresponds to the point (0,e!

nna).

The loss of optimality incurred by the heuristic is quanti-fied by the difference &enna = e!

nna$ e!

lra.

The following code defines a simulation example and plots thee ,- % function over the interval [e!

lra,1.75e!

lra].

)Test line fitting 3b*+randn(’seed’, 0); q = 2; N = 10; r = 1;

d0 = [1; 1] * [1:N]; d = d0 + 0.1 * randn(q, N);

)define dist 3a*, e_lra = dist(d, r)

N = 20; Ea = linspace(e_lra, 1.75 * e_lra, N);

for i = 1:N

Er(i) = dist(nna(d, Ea(i)), r);

end

figure, plot(Ea, Er, ’o’, ’markersize’, 8)

The result is shown in Figure 1. In the example,

e!nna

= 0.4603 and e!lra

= 0.3209.

0.35 0.4 0.45 0.5 0.550

0.02

0.04

0.06

0.08

e

%(e)

e!lra

e!nna

Fig. 1. Distance of nnae(D) to a linear model of complexity 1as a function of the approximation error e.

Next, we compare the loss of optimality of the nuclear normheuristic with those of two other heuristics: line fitting by mini-mization of the sum of squared vertical and horizontal distancesfrom the data points to the fitting line, i.e., the classical methodof solving an overdetermined linear system of equations in theleast squares sense.

)Test line fitting 3b*++dh_ls1 = [1; d(2, :) / d(1, :)] * d(1, :);

e_ls1 = norm(d - dh_ls1, ’fro’)

dh_ls2 = [d(1, :) / d(2, :); 1] * d(2, :);

e_ls2 = norm(d - dh_ls2, ’fro’)

The results are

e!ls1 = 0.4546 and e!ls2 = 0.4531

which are both slightly better than the nuclear norm heuristic.

2.2 Exponential fitting

The problem considered in this section is fitting a time series

yd :=&yd(1), . . . ,yd(T )

'

by an exponential function

cexpz :=&cz1, . . . ,czT

'

in the 2-norm sense, i.e.,

minimize over c " R and z " R #yd $ cexpz #2. (EF)

The constraint that the sequence

"y =&"y(1), . . . ,"y(T )

'

is an exponential function is equivalent to the constraint that theHankel matrix

HL("y) :=

(

)))))))*

"y1 "y2 "y3 · · · "yT$L+1

"y2 "y3 . ..

"yT$L+2

"y3 . .. ...

..."yL "yL+1 · · · "yT

+

,,,,,,,-

,

where 1 < L < T $1 has rank less than or equal to 1. Therefore,the exponential fitting problem (EF) is equivalent to the Hankelstructured rank-1 approximation problem

minimize over "y " RT #yd $"y#2

subject to rank&HL("y)

'& 1.

(HLRA)

Problem (HLRA) has analytic solution, see [De Moor, 1994,Sec. IV.C].

Lemma 1. The optimal solution of (HLRA) is

"y! = c! expz!

where z! is a root of the polynomial equationT

!t=1

tyd(t)zt$1

T

!t=1

z2t $T

!t=1

yd(t)zt

T

!t=1

tz2t$1 = 0 (z!)

and

c! :=!T

t=1 yd(t)z!t

!Tt=1 z!2t

. (c!)

The proof of the lemma and an implementation of a procedurefit_exp for global solution of (HLRA), suggested by thelemma, are given in Appendix B.

Applying the nuclear norm heuristic to problem (HLRA), weobtain the following convex relaxation

minimize over "y " R #HL("y)#!subject to #y$"y#2 & e.

)define nna_exp 3d*+function yh = nna_exp(y, L, e)

cvx_begin, cvx_quiet(true);

variables yh(size(y))

minimize norm_nuc(hankel(yh(1:L), yh(L:end)))

subject to

norm(y - yh) <= e

cvx_end

As in the line fitting problem, the selection of the parameter ecan be done by a bisection algorithm. As in Section 2.2, weshow the complexity vs accuracy trade-off curve and quantifythe loss of optimality by the difference &enna = e!hlra $ e!

nna

between the optimal approximation error e!hlra, computed usingthe result of Lemma 1,

)define dist_exp 4a*+dist_exp = @(y) norm(y - fit_exp(y));

and the minimal error e!nna

, for which the heuristic identifies avalid model.

The following code defines a simulation example and plots thetrade-off curve % over the interval [e!hlra,1.25e!hlra].)Test exponential fitting 4b*+

randn(’seed’, 0); z0 = 0.4; c0 = 1; T = 10;

t = (1:T)’; y = c0 * (z0 .^ t) + 0.1 * randn(T, 1);

)define dist_exp 4a*, e_hlra = dist_exp(y)

N = 20; Ea = linspace(e_hlra, 1.25 * e_hlra, N);

L = round(T / 2);

for i = 1:N

Er(i) = dist_exp(nna_exp(y, L, Ea(i)));

end

ind = find(Er < 1e-6); es_nna = min(Ea(ind))


The result is shown in Figure 2. In the example,

e!nna

= 0.3130, and e!hlra = 0.2734.

0.28 0.3 0.32 0.340

0.01

0.02

0.03

e

%(e)

e!hlra e!nna

Fig. 2. Distance of "y = nnae!nna(y) to an exponential model as afunction of the approximation error e = #y$"y#.

The performance of the nuclear norm heuristic depends on theparameter L. In the simulation example, we have fixed the valueL = 0T/21. Empirical results (see the following chunk of codeand the corresponding plot in Figure 3) suggest that this is thebest choice.

)Test exponential fitting 4b*++Lrange = 2:(T - 1)

for L = Lrange

Er(L) = dist_exp(nna_exp(y, L, es_nna));

end

figure,

plot(Lrange, Er(Lrange), ’o’, ’markersize’, 8)

As in the line fitting problem, we compare the loss of optimalityof the nuclear norm heuristic with an alternative heuristic

2 4 6 8 10

0.01

0.02

0.03

0.04

0.05

L

%(L)

Fig. 3. Distance of "y = nnae(y) to an exponential model as afunction of the parameter L.

method—Kung’s method, see Kung [1978]. Kung’s method isbased on results from realization theory and balanced modelreduction. Its core computational step is the singular valuedecomposition of the Hankel matrix HL(yd), i.e., unstructuredlow-rank approximation. The heuristic comes from the fact thatthe Hankel structure is not taken into account. For detailedabout Kung’s algorithm, we refer the reader to [Markovsky,2012, Sect 3.1]. For completeness, an implementation kungof Kung’s method is given in Appendix C.)Test exponential fitting 4b*++

e_kung = norm(y - kung(y, 1, L)’)

The obtained results is e!kung = 0.2742, which is much better

than the result obtained by the nuclear norm heuristic.

2.3 Errors-in-variables system identification

The considered errors-in-variables identification problem is ageneralization of the line fitting problem (LF) to dynamicmodels. The fitting criterion is the geometric distance (dist) andthe model B is a single-input single-output linear time invariantsystem of order n. Let

wd :=&wd(1), . . . ,wd(T )

', where wd(t) " R

2,be the given trajectory of the system. The identification problemis defined as follows: given wd and n,

minimize dist(wd, "B)subject to "w is traj. of LTI system of order n. (SYSID)

The problem is equivalent to the following block-Hankel struc-tured low-rank approximation problem

minimize over "w #wd $ "w#2

subject to rank

.#HL("w1)HL("w2)

$/& L+ n, (BHLRA)

for n < L < 0T/21.)define blk_hank 4e*+

blk_hank = @(w, L) [hankel(w(1, 1:L), w(1, L:end))

hankel(w(2, 1:L), w(2, L:end))];

This is a nonconvex optimization problem, for which there areno efficient solution methods. Using the nuclear norm heuristic,we obtain the following convex relaxation

minimize over "w0000

#HL("w1)HL("w2)

$0000!

subject to #wd $ "w#2 < e.

)define nna_sysid 5a*+function wh = nna_sysid(w, L, e)

)define blk_hank 4e*cvx_begin, cvx_quiet(true);

variables wh(size(w))

minimize norm_nuc(blk_hank(wh, L))

subject to

norm(w - wh, ’fro’) <= e

cvx_end

A lower bound to the distance from w to a trajectory of a lineartime-invariant system of order n, is given by the unstructuredlow rank approximation of the block Hankel matrix.

)define dist_sysid 5b*+)define blk_hank 4e*, )define dist 3a*dist_sysid = @(w, L, n) dist(blk_hank(w, L), L + n);

The following code defines a test example.

)Test system identification 5c*+randn(’seed’, 0); rand(’seed’, 0); T = 20; n = 1;

sys0 = ss(0.5, 1, 1, 1, -1); u0 = rand(T, 1);

y = lsim(sys0, u0) + 0.1 * randn(T, 1);

w = [(u0 + 0.1 * randn(T, 1))’; y’];

)define dist_sysid 5b*N = 20; Ea = linspace(0.3, 1, N); L = 4;

for i = 1:N

Er(i) = dist_sysid(nna_sysid(w, L, Ea(i)), L, n);

end

ind = find(Er < 1e-6); es_nna = min(Ea(ind))


The obtained trade-off curve is shown in Figure 4. The optimal

0.4 0.6 0.8 10

0.05

0.1

0.15

0.2

e

%(e)

e!n4sid e!nna

Fig. 4. Distance of "w = nnae(w) to a model of order 1 as afunction of the approximation error e = #w$ "w#.

model computed by the nuclear norm heuristic has correspond-ing approximation error e!

nna= 0.7789. We have manually se-

lected the value L = 4 as giving the best results.

)Test system identification 5c*++Lrange = (n + 1):floor(T / 2);

for L = Lrange

Er(L) = dist_sysid(nna_sysid(w, L, es_nna), L, n);

end

figure,

plot(Lrange, Er(Lrange), ’o’, ’markersize’, 8)

Next, we apply the N4SID method, implemented in functionn4sid of the Identification Toolbox.

)Test system identification 5c*++

2 4 6 8 10

0.02

0.04

0.06

0.08

0.1

L

%(L)

Fig. 5. Distance of "w = nnae(w) to a model of order 1 as afunction of the parameter L.

sysh = ss(n4sid(iddata(w(2,:)’, w(1, :)’), ...

n, ’nk’, 0)); sysh = sysh(1, 1);

The distance from the data wd (w) to the obtained model "B(sysh) is computed by the function misfit, see Appendix D.

)Test system identification 5c*++[e_n4sid, wh_n4sid] = misfit(w, sysh); e_n4sid

The approximation error achieved by the n4sid alternativeheuristic method is en4sid = 0.3019. In this example, the sub-space method produces a significantly better model than thenuclear norm heuristic.

3. CONCLUSIONS

The examples considered in the paper—line fitting in thegeometric distance sense, optimal exponential fitting, andsystem identification—suggest that alternative heuristics—ordinary least squares, Kung’s, and N4SID methods—are moreeffective in solving the original nonconvex optimization prob-lems than the nuclear norm heuristic. Further study will focusin understanding the cause of the inferior performance of thenuclear norm heuristic and finding ways for improving it.

ACKNOWLEDGEMENTS

The research leading to these results has received funding fromthe European Research Council under the European Union’sSeventh Framework Programme (FP7/2007-2013) / ERC Grantagreement number 258581 “Structured low-rank approxima-tion: Theory, algorithms, and applications”.

REFERENCES

J. Buckheit and D. Donoho. Wavelets and Statistics, chap-ter "Wavelab and reproducible research". Springer-Verlag,Berlin, New York, 1995.

E. Candés and B. Recht. Exact matrix completion via convexoptimization. Found. Comput. Math., 9:717–772, 2009.

B. De Moor. Total least squares for affinely structured matricesand the noisy realization problem. IEEE Trans. Signal Proc.,42(11):3104–3113, 1994.

M. Fazel, H. Hindi, and S. Boyd. A rank minimization heuristicwith application to minimum order system approximation. InProc. American Control Conference, 2001.

M. Grant and S. Boyd. CVX: Matlab software for disciplinedconvex programming. stanford.edu/~boyd/cvx.

S. Kung. A new identification method and model reductionalgorithm via singular value decomposition. In Proc. 12thAsilomar Conf. Circuits, Systems, Computers, pages 705–714, Pacific Grove, 1978.

I. Markovsky. Low Rank Approximation: Algorithms, Imple-mentation, Applications. Springer, 2012.

I. Markovsky and S. Van Huffel. Overview of total least squaresmethods. Signal Processing, 87:2283–2302, 2007.

P. Van Overschee and B. De Moor. Subspace Identificationfor Linear Systems: Theory, Implementation, Applications.Kluwer, Boston, 1996.

Appendix A. BISECTION ALGORITHM FORCOMPUTING THE LIMIT OF PERFORMANCE OF NNA

Assuming that e is an interval (H) and observing that

#D$lrar(D)#F & e!nna

& #D#F,

we propose a bisection algorithm on e (see Algorithm 1) forcomputing e!

nna.

Algorithm 1 Bisection algorithm for computing e!nna

Input: D, r, and convergence tolerance %&e

el := #D$lrar(D)#F and eu := #D#F

repeate := (el + eu)/2if rank(nnae(D))> r then,

el := e,else

eu := e.end if

until rank&nnae(D)

'(= r or eu $ el > %&e

return e

)bisection 6a*+function e = opt_e(d, r)

el = norm(d - lra(d, r), ’fro’);

eu = norm(d, ’fro’);

while 1,

e = mean([el eu]);

re = rank(nna(d, e), 1e-5); % numerical rank

if re > r, el = e; else, eu = e; end

if (re == r) && (eu - el < 1e-3), break, end

end

Appendix B. PROOF OF LEMMA 1 AND FUNCTION FORGLOBAL SOLUTION OF (HLRA)

The fact that "y! is an exponential function c! expz! followsfrom the equivalence of (HLRA) and (EF). Setting the partialderivatives of the cost function

f (c,z) :=T

!t=1

&y(t)$ czt

'2

of (EF) w.r.t. c and z to zero, we have the following first orderoptimality conditions

' f

'c= 0 =2

T

!t=1

&yd(t)$ czt

'zt = 0,

' f

' z= 0 =2

T

!t=1

&yd(t)$ czt

'tzt = 0.

Solving the first equation for c gives (c!). The right-hand-sideof the second equation is a polynomial of degree 2T , and theresulting polynomial equation is (z!).

)Analytic solution of the exponential fitting problem 6b*+function [yh, ch, zh] = fit_exp(y)

t = (1:length(y))’;

p1(t) = t .* y(t); p2(2 * t + 1) = 1;

p3(t + 1) = y(t); p4(2 * t) = t;

r = roots(conv(p1, p2) - conv(p3, p4));

r(r == 0) = []; z = 1 ./ r(imag(r) == 0);

for i = 1:length(z)

c(i) = (z(i) .^ t) \ y;

Yh(:, i) = c(i) * (z(i) .^ t);

f(i) = norm(y - Yh(:, i)) ^ 2;

end

[f_min, ind] = min(f);

zh = z(ind); ch = c(ind); yh = Yh(:, ind);

Appendix C. IMPLEMENTATION OF KUNG’S METHOD

)Kung method 6c*+function yh = kung(y, n, L)

[U, S, V] = svd(hankel(y(1:L), y(L:end)));

O = S(1:n, 1:n) * U(:, 1:n); C = V(:, 1:n)’;

c = O(1, :); b = C(:, 1);

a = O(1:end - 1, :) \ O(2:end, :);

for t = 1:length(y)

yh(t) = c * (a ^ (t - 1)) * b;

end

Appendix D. DISTANCE COMPUTATION IN THEDYNAMIC CASE

The problem of computing the distance, also called misfit,from a time series to a linear time-invariant model is a con-vex quadratic optimization problem. The solution is thereforeavailable in closed form. The linear time-invariant structure ofthe system, however, allows efficient O(T ) computation, e.g.,the Kalman smoother computes the misfit in O(T ) flops, usingon a state space representation of the system.

In [Markovsky, 2012, Section 3.2], a method based on, what iscalled an image representation of the system is presented. Thefollowing implementation does not exploit the structure in theproblem and the algorithm has computational cost O(T 3).)define misfit 6d*+

function [M, wh] = misfit(w, sys)

)transfer function to image representation 6e*T = size(w, 2); TP = blktoep(P, T);

wh = reshape(TP * (TP \ w(:)), 2, T);

M = norm(w - wh, ’fro’);

The function blktoep (not shown) constructs a block Toeplitzmatrix and the conversion from transfer function to imagerepresentation is done as follows:

)transfer function to image representation 6e*+[q, p] = tfdata(tf(sys), ’v’);

P = zeros(2, length(p));

P(1, :) = fliplr(p);

P(2, :) = fliplr([q zeros(length(p) - length(q))]);

Identification of Box-Jenkins models using

structured ARX models and nuclear norm

relaxation !

Hakan Hjalmarsson ! James S. Welsh !! Cristian R. Rojas !

! Automatic Control Lab and ACCESS Linnaeus Centre, KTH RoyalInstitute of Technology (e-mail:

hakan.hjalmarsson,[email protected])!! School of Electrical Engineering and Computer Science, University

of Newcastle, Australia (e-mail: [email protected])

Abstract: In this contribution we present a method to estimate structured high order ARXmodels. By this we mean that the estimated model, despite its high order is close to a low ordermodel. This is achieved by adding two terms to the least-squares cost function. These two termscorrespond to nuclear norms of two Hankel matrices. These Hankel matrices are constructed fromthe impulse response coe!cients of the inverse noise model, and the numerator polynomial of themodel dynamics, respectively. In a simulation study the method is shown to be competitive ascompared to the prediction error method. In particular, in the study the performance degradesmore gracefully than for the Prediction Error Method when the signal to noise ratio decreases.

1. INTRODUCTION

The fundamental problem of estimating a model for alinear dynamical system, possibly subject to (quasi-) sta-tionary noise has received renewed interest in the lastfew years. In particular di"erent types of regularizationschemes have been in focus. An important contribution tothese developments is Pillonetto and De Nicolao [2010]where a powerful approach rooted in Machine Learn-ing is presented. In Chen et al. [2011] it is shown thatthis approach has close ties with !2-regularization witha cleverly chosen penalty term. Another approach is touse structured low rank approximations. This typicallyleads to non-convex optimization problems for which localnonlinear optimization methods are used, see Markovsky[2008] for a survey. Recently, the nuclear norm has beenused in this approach to obtain convex optimization prob-lems. A number of contributions based on this idea hasalready appeared, e.g. Fazel et al. [2003], Grossmann et al.[2009a,b], Mohan and Fazel [2010], Fazel et al. [2003], Liuand Vandenberghe [2009a,b]. In this contribution we addto this avalanche of new and exciting methods by introduc-ing structured estimation of high order ARX models. Ourcontribution can be seen as an extension of the methodpresented in Grossmann et al. [2009a,b]. The extensionsconcern i) the possibility to also estimate a noise model inthe same convex framework, ii) a way to choose the regu-larization parameter, and iii) a quite extensive simulationstudy. In the simulation study the new method comparesfavourably to the prediction error method for scenarioswhere the signal to noise ratio (SNR) is poor.

The outline of the paper is as follows. In Section 2the problem under consideration is discussed. Section 3

! This work was supported in part by the European Research

Council under the advanced grant LEARN, contract 267381, and in

part by the Swedish Research Council under contract 621-2009-4017.

presents our approach and in Section 4 some numericalcomparisons with state of the art algorithms are presented.The paper is concluded with some conclusions in Section5.

2. THE PROBLEM

Consider the discrete-time linear time-invariant systemwith input u(t) and output y(t) given by

y(t) = Go(q)u(t) +Ho(q)eo(t) (1)

where {eo(t)} is zero mean white noise with variance "e.The input to output relationship is given by the rationaltransfer function

Go(q) :=Bo(q)

Fo(q):=

bo1q"1 + . . .+ bono

q"no

1 + fo1 q

"1 + . . .+ fonoq"no

,

and the coupling between the noise and output is given by

Ho(q) :=Co(q)

Do(q):=

1 + co1q"1 + . . .+ como

q"mo

1 + do1q"1 + . . .+ domo

q"mo

.

Here q"1 is the time-shift operator q"1u(t) = u(t ! 1).Thus the system can be written as

y(t) =Bo(q)

Fo(q)u(t) +

Co(q)

Do(q)eo(t). (2)

A parametric model of (1) is given by the Box-Jenkinsstructure

y(t) =B(q)

F (q)u(t) +

C(q)

D(q)e(t) (3)

where B(q), F (q), C(q) and D(q) are polynomials

B(q) = b1q"1 + . . .+ bnq

"n (4)

F (q) = 1 + f1q"1 + . . .+ fnq

"n (5)

C(q) = 1 + c1q"1 + . . .+ cmq"m (6)

D(q) = 1 + d1q"1 + . . .+ dmq"m (7)

whose coe!cients b1, . . . , dm are collected in the vector# " R2n+2m. This parameter vector can, e.g., be estimatedusing prediction error identification [Ljung, 1999]. A dis-advantage with prediction error identification using themodel structure (3) is that in general the criterion is non-convex and thus the numerical optmization may convergeto a local minimum.

One model structure for which there exists an explicit ex-pression for the parameter estimate is the ARX-structure

y(t) =B(q)

A(q)u(t) +

1

A(q)e(t) (8)

where

A(q) = 1 + a1q"1 + . . .+ anq

"n (9)

Even though the system (1) is not captured by this struc-ture, by letting n increase, Go can be well approximatedby B/A and Ho can be well approximated by 1/A [Ljungand Wahlberg, 1992]. This structure thus has some veryattractive features and is extensively used, e.g. in indus-trial practice [Zhu, 2001]. A drawback with this structureis that the variance error, i.e. the error induced by thenoise eo, increases linearly with the number of parameters[Ljung and Wahlberg, 1992]. Thus the accuracy can besignificantly worse when compared with using the Box-Jenkins structure (3) (with n = no, m = mo). In orderto make a distinction between the desired low order modelwith n parameters in B and F , andm parameters in C andD, we will use the notation nho to denote the number ofparameters in the B and A polynomials when a high-orderARX-model is used 1 .

3. STRUCTURED ARX ESTIMATION

A model equivalent to (3) is given by [Kailath, 1980]

y(t) =#!

k=1

bku(t! k) + e(t) +#!

k=1

hke(t! k)

Rank {Hankel(b)} = nRank {Hankel(h)} = m

(10)

where Hankel(x) is the infinite Hankel matrix with x1,x2, x3, . . . in the first column. Prediction error identifica-tion of a Box-Jenkins model (3) using output/input data{y(t), u(t)}Nt=1 can thus be stated as

minb1,b2,...,h1,h2,...

N!

t=1

"

#

$

1 +#!

k=1

hkq"k

%"1

(y(t)!#!

k=1

bku(t! k))

&

'

2

s.t. Rank {Hankel(b)} = nRank {Hankel(h)} = m

(11)

There are several problems associated with (11) from anoptimization point of view. Firstly, it is non-convex in{hk}. Furthermore, the rank constraints are non-convex.An interesting convex rank heuristic is obtained using thenuclear norm [Fazel et al., 2001] (see Fazel et al. [2003] for

1 It is also easy to generalize our method so that di!erent number

of parameters in B and A are used.

its connection to another heuristic, the log-det heuristic)which for a matrix X " Rn$m is given by

#X#! =

min(n,m)!

i=1

$i(X)

where {$i(X)} are the singular values of X. The nuclearnorm is also known as the Schatten 1-norm and the Ky-Fan n-norm and it can be obtained as the solution to thesemi-definite program (SDP)

#X#! = minY,Z

µ

s.t.

(

Y XXT Z

)

$ 0, Tr{Y + Z) % 2µ

The nuclear norm can be seen as the matrix extension ofthe !1-norm known to give sparse estimates when used inestimation [Tibshirani, 1996].

In the following we will show how to use ARXmodels (8) ofhigh order together with the nuclear norm to approximate(11). We proceed by first studying a simpler problem andthen in a second step we address the full problem.

3.1 Estimation of structured FIR models

Assume that it is known that Ho(q) = 1 so that no noisemodel has to be estimated. Then (11) simplifies to

minb1,b2,...

N!

t=1

(y(t)!#!

k=1

bku(t! k))2

s.t. Rank {Hankel(b)} = n.

(12)

A straightforward convex approximation of this problemis to truncate the impulse response {bk}, i.e. to set bk = 0for k > nho for some su!ciently large integer nho, and toreplace the rank constraint by a constraint on the nuclearnorm of the finite dimensional Hankel matrix with {bk}

nho

k=1as unique elements. We denote this Hankel matrix byHnho

(b). A variation of this approach is to add the nuclearnorm as a regularization term to the cost function. Thisleads to the following SDP:

minb1,...,bn,Yb,Zb

N!

t=1

(y(t)!nho!

k=1

bku(t! k))2 +"b

2Tr{Yb + Zb)

s.t.

(

Yb Hnho(b)

HTnho

(b) Zb

)

$ 0.

(13)

This approach has been used in Grossmann et al. [2009a,b],where in addition missing data were included as freeparameters. We will call this method NUC-FIR.

The regularization parameter "b can be determined bycross-validation. We refer to the next section for details.

3.2 Estimation of structured ARX models

We cannot directly extend the idea in the preceedingsubsection to also include a noise model since, as alreadynoted, (11) is non-convex in the impulse response coef-ficients of the noise dynamics {hk}. However, we havealso observed that prediction error identification of ARX-models is convex and that such models can approximateany rational system and noise dynamics arbitrarily welljust by picking the polynomial order n large enough. Such

Stable Nonlinear System Identification:

Convexity, Model Class, and Consistency

Ian R. Manchester

⇤Mark M. Tobenkin

⇤⇤

Alexandre Megretski

⇤⇤

⇤Aerospace, Mechanical and Mechatronic Engineering,University of Sydney, Australia

⇤⇤ Electrical Engineering and Computer Science,Massachusetts Institute of Technology, USA

e-mail: {irm, mmt, ameg}@mit.edu.

Abstract: Recently a new approach to black-box nonlinear system identification has beenintroduced which searches over a convex set of stable nonlinear models for the one whichminimizes a convex upper bound of long-term simulation error. In this paper, we further studythe properties of the proposed model set and identification algorithm and provide two theoreticalresults: (a) we show that the proposed model set includes all quadratically stable nonlinearsystems, as well as some more complex systems; (b) we study the statistical consistency of theproposed identification method applied to a linear system with noisy measurements. It is showna modification related to instrumental variables gives consistent parameter estimates.

1. INTRODUCTION

Building approximate models of dynamical systems fromdata is a ubiquitous task in the sciences and engineering.Black-box modeling in particular plays an important rolewhen first-principles models are either weakly identifiable,too complicated for the eventual application, or simplyunavailable (see, e.g., Sjoberg et al. [1995], Suykens andVandewalle [1998], Giannakis and Serpedin [2001], Ljung[2010] and references therein).

Model instability, prevalence of local minima, and poorlong-term (many-step-ahead) prediction accuracy are com-mon di�culties when identifying black-box nonlinear mod-els (Ljung [2010], Schon et al. [2010]). Recently a newapproach labelled robust identification error (RIE) wasintroduced which searches over a convex set of stablenonlinear models for the one which minimizes a convexupper bound of long-term simulation error (Tobenkin et al.[2010]). In Manchester et al. [2011] the method was ex-tended to systems with orbitally stable periodic solutions.Earlier related approaches appeared in Sou et al. [2008],Bond et al. [2010], Megretski [2008].

Since both the model set and the cost function of theRIE involve convexifying relaxations, it is important tostudy the degree of conservatism so introduced. In thispaper, we show that the proposed model set includes allquadratically contracting systems, as well as some morecomplex models. We also study the statistical consistencyof the proposed identification method applied to a linearsystem with noisy measurements. It is shown that theestimator can give biased parameter estimates, but amodification related to instrumental variables recoversstatistical consistency.

? This work was supported by National Science Foundation GrantNo. 0835947

We focus on a particular implicit form of state-spacemodels:

e(x(t + 1)) = f(x(t), u(t)), (1)y(t) = g(x(t), u(t)), (2)

where e : Rn 7! Rn, f : Rn ⇥ Rm 7! Rn, and g :Rn ⇥ Rm 7! Rk are continuously di↵erentiable functionssuch that the equation e(z) = w has a unique solutionz 2 Rn for every w 2 Rn. This choice of implicit equationsprovides important flexibility for the convex relaxations.

Models are used for a wide variety of purposes, each withits own characteristic measure of performance, howeverfor a large class of problems an appropriate measure issimulation error, i.e.

E =TX

t=0

|y(t)� y(t)|2,

where y is the simulated output from the model and y isthe recorded real system output, with each subjected tothe same input u(t) and the same initial conditions.

When the true states x(t) can be measured or estimated,a standard approach is to consider equation error:

✏x

(t) = e(x(t + 1))� f(x(t), u(t)), (3)✏y

(t) = y(t)� g(x(t), u(t)). (4)

If e, f and g are linearly parametrized, then minimizing,e.g.,

Pt

(|✏x

|2+ |✏y

|2) amounts to basic least squares. How-ever, minimizing ✏

x

and ✏y

does not give any guaranteesabout the simulation error, since models of this form arerecursive by nature and modelling errors will accumulateand dissipate as the simulation progresses. In particular,there is no guarantee that a model which has an optimalfit in terms of equation error will even be stable.

Ideally, we would search over all stable nonlinear mod-els (1), (2) of limited complexity 1 for one which mini-mizes simulation error. There are two major di�cultieswhich render this impossible in general: firstly, we have notractable parameterization of “all stable nonlinear modelsof limited complexity”; secondly, even supposing we aregiven such a parameterization, the simulation error is ahighly nonlinear function of f and g, making the associ-ated optimization problem very di�cult. Indeed, in Ljung[2010] stability of models and convexity of optimizationcriteria are listed among the major open problems innonlinear system identification.

1.1 Convex Upper Bounds for Simulation Error

We will start with the second problem: optimization sim-ulation error. The main di�culty with simulation error isthat it depends on the y(t), the result of solving a di↵er-ence or di↵erential equation over a long time interval. Evenif some finite-dimensional linearly-parameterized class offunctions is used to represent f and g, the simulated out-put will be a highly nonlinear function of those parameters.

When analyzing stability or performance of a dynamicalsystem it is common to make use of a storage function anda dissipation inequality, and we follow the same approachhere. The advantage is that a condition on the solution of adynamical system can be checked or enforced via pointwiseconditions on the system equations. In particular, let � =x� x be the di↵erence between the state of the model andthe true state of the system. Suppose we find some positivedefinite 2 function V (·) satisfying

V (�(t + 1), t + 1)� V (�(t), t) r(t)� |y(t)� y(t)|2 (5)with y(t) = g(x(t) + �(t), u(t)) and r(t) a slack variable.Suppose also that the model’s initial conditions are cor-rect, i.e. �(0) = 0, then we can simply sum the abovedissipation inequality to time T and obtain

E :=TX

t=0

|y(t)� y(t)|2 TX

t=0

r(t).

This would suggest that to minimize simulation error wecan search over functions e, f, g, and V , and try to mini-mize

Pr(t) while satisfying the dissipation inequality (5).

There are two problems. Firstly, the dissipation inequalitymade use of the particular �(t) which comes from themodel simulation. This can be ensured by simply requiringthat (5) hold for all �(t), which adds a certain robustnessat the expense of conservatism.

The second problem is more challenging: it is that thecondition (5) is not jointy convex in V , e, and f . Forexample, the term V (�(t+1), t+1) is a composition of Vand �(t+1) which is given by x(t+1)�x(t+1) = f(x(t)+�(t), u(t))� x(t + 1). Hence (5) contains a composition ofV and f , which is nonconvex. In Tobenkin et al. [2010]a convex upper bound was derived that can be optimizedvia semidefinite programming.

1 It is di�cult to give a general definition of “limited complexity”,but for a concrete example consider f(x, u) and g(x, u) to bemultivariate polynomials up to some fixed degree.2 Here positive definite means that V (0, t) = 0 and V (�, t) > 0otherwise, for all t.

1.2 Convex Classes of Stable Nonlinear Models

Another central contribution of Tobenkin et al. [2010] isthe model class itself: a particular convex parametrizationof stable nonlinear models was presented which can berepresented by sum-of-squares constraints Parrilo [2000].There are some subtleties and a few di↵erent formulationswhich will be detailed in later sections, but the thrust isthat the storage function V (�, t) used to bound simulationerror doubles as a contraction metric for the identifiedmodel (Lohmiller and Slotine [1998]).

Characterizing the “richness” of a nonlinear model classis di�cult both conceptually and computationally. It wasshown in Tobenkin et al. [2010] that all stable linearmodels are included in the model class. In this work weextend those results to show inclusion of all quadraticallystable nonlinear systems, and at least some systems morecomplex than this.

2. ROBUST IDENTIFICATION ERROR

In this section we recap a few relevant points from To-benkin et al. [2010].

The global RIE for a model (1),(2) is a function of thecoe�cients of (1),(2), a single data sample

z = (v, y, x, u) 2 Rn ⇥ Rk ⇥ Rn ⇥ Rm, (6)where v(t) = x(t+1) i.e. a time-shifted measurement of thestate, and an auxiliary parameter Q = Q0 > 0, a positivedefinite symmetric n-by-n matrix 3 :E

Q

(z) = sup�2Rn

�|f(x + �, u)� e(v)|2

Q

� |�e

|2Q

+ |�y

|2

.

(7)where |a|2

Q

is shorthand for a0Qa, and

�y

= g(x + �, u)� y, �e

= e(x + �)� e(x). (8)

Note that (7) corresponds to (5) with a particular choice ofV = |�

e

|2Q

. Then of course the following theorem suggestsminimizing the sum of E

Q

(z) with respect to e, f, g and Q:Theorem 1. The inequality

E NX

t=1

EQ

(z(t)), (9)

holds for every Q = Q0 > 0.

Unfortunately this is still nonconvex, however we have thefollowing convex upper bound: let

�v

= f(x + �, u)� e(v).

then EQ

(z) EQ

(z) where

EQ

(z) = sup�2Rn

�|�

v

|2Q

+ |�|2P

� 2�0�e

+ |�y

|2

, (10)

and P = Q�1. The function EQ

(z) serves as an upperbound for E

Q

(z) that is jointly convex with respect to e,f , g, and P = Q�1 > 0.

3 For convenience, we only indicate the dependence on z and Q, wewill occasionally write EQ[e, f, g] to make the dependence on (e, f, g)explicit.

2.1 Convex Set of Stable Models

The class of models we search over can be referred to asincrementally output l2 stable. In particular, we considerthe model (1),(2) stable if for every input sequence u :Z+ 7! Rm and pair of initial conditions x00, x10 2 Rn

the solutions (x, y) = (x0, y0) and (x, y) = (x1, y1) definedby (1)(2) with u(t) ⌘ u(t), x0(1) = x00 and x1(1) = x10

satisfy:1X

t=1

|y0(t)� y1(t)|2 < 1.

By minor variation of the work of Lohmiller and Slotine[1998], this form of stability can be verified by the condi-tion

|F (x, u)�|2Q

� |E(x)�|2Q

+ |G(x, u)�|2 0, (11)where capital letters denote Jacobians of e, f and g withrespect to x. Here, E(x)0QE(x) is a contraction metric forthe model.

However, as with the RIE, this condition is nonconvexwith respect to E(x), but a similar relaxation results inthe following su�cient condition:|F (x, u)�|2

Q

� |�|2E(x)+E(x)0�Q

�1 + |G(x, u)�|2 0, (12)for all x, � 2 Rn and u 2 Rm. This condition is jointlyconvex in (e, f, g) and Q�1.

The well-posedness of state dynamics equation (1) isguaranteed when the function e : Rn 7! Rn is a bijection.Theorem 2. Let e : Rn 7! Rn be a continuously di↵eren-tiable function with a uniformly bounded Jacobian E(x),satisfying:

E(x) + E(x)0 � 2r0I, 8x 2 Rn (13)for some fixed r0 > 0. Then e is a bijection.

In the case when e, f and g are polynomials, the convexconstraints and optimization objectives in this section canbe handled via semidefinite programming and the sum-of-squares relaxation Parrilo [2000].

3. RESULTS ON ACCURACY OF RELAXATIONS

For both the RIE and our class of stable models we havedescribed convex upper bounds on non-convex functions.It is natural to ask how tight these upper bounds will be,i.e. how close will the optimizer of the relaxed problemsbe to attaining the optimal value of the original RIEproblems and what characterizes models which potentiallysatisfy the unrelaxed stability constraint (11) but have noparameterization satisfying the relaxed constraint (12). Inthis section we o↵er partial answers to these questions. Inparticular, we will demonstrate that when e(x) is linearour bounds and model class are “tight” in sense we willmake precise below.

As a corollary, we will show our model class contains everyquadratically stable nonlinear system and in particularcontains all stable linear systems of appropriate dimen-sions. We will also demonstrate, via a simple example,that the ability to choose e(x) to be nonlinear allows themethod to describe systems which are not quadraticallystable and o↵ers a substantially richer model class.

3.1 Tightness of the RIE Relaxation

The following lemma describes the “tightness” of therelaxation of the RIE when e(x) is linear.Lemma 3. Let (e, f, g) be functions as in (1),(2) and Q =Q0 2 Rn⇥n be positive definite. If e(x) is linear andinvertible (i.e. e(x) = Ex for an invertible E 2 Rn⇥n)then there exists a symmetric positive definite Q 2 Rn⇥n,an invertible linear function e : Rn 7! Rn and a functionf : Rn ⇥ Rm 7! Rn such that:

EQ

[e, f, g] = EQ

[e, f , g] = EQ

[e, f , g],

and (e�1 � f) = (e�1 � f).

Proof. Examine the choices:Q�1 = E0QE, e = Q�1x, f = Q�1E�1f(x, u).

Then we see the arguments of the supremum in thedefinition of E identically match those in the definitionof E .

An identical statement holds for the global RIE (EQ

) andits relaxation (E

Q

) proven by the same choices for e, f , Q.

3.2 Coverage of Quadratically Stable Models

Let f0 : Rn ⇥ Rm 7! Rn and g : Rn ⇥ Rm 7! Rk. We callthe system:

x(t + 1) = f0(x(t), u(t)) (14)y(t) = g(x(t), u(t)) (15)

quadratically incrementally `2 output stable (or, for ourpurposes, simply quadratically stable) if there exists anM 2 Rn⇥n with M = M 0 > 0 and:|�|2

M

� |f0(x+�, u)�f0(x, u)|2M

+ |g(x+�, u)�g(x, u)|2,(16)

holds for all x,� 2 Rn and u 2 Rm. The followinglemma describes how any quadratically stable system, inprinciple, belongs to some model class described by therelaxed stability constraint (12).Lemma 4. Let f0 and g be continuously di↵erentiablefunctions defining the system (14),(15). Then there existse : Rn 7! Rn, f : Rn ⇥Rm 7! Rn and Q 2 Rn⇥n such thate(x) is linear and invertible, f0 = e�1 � f , Q = Q0 > 0,and the relaxed stability constraint (12) is satisfied.

Proof. As (f0, g) is quadratically stable there exists an Msuch that (16) is satisfied for all x,� 2 Rn and u 2 Rm.The di↵erentiability of f and g then guarantees that:

|�|2M

� |F0(x, u)�|2M

+ |G(x, u)�|2, (17)also holds for all x,� 2 Rn and u 2 Rm. Then takinge(x) = M�1x and Q = M gives E(x) = M�1, F (x, u) =M�1F0(x, u). Now, substituting these in to (12) recovers(17), hence the model is in the relaxed model class.

As stable linear systems are automatically quadraticallystable in the above sense, we arrive at the followingcorollary:Corollary 5. For any k input, m output, stable lineardiscrete time state-space model with n states there existsa positive definite symmetric Q 2 Rn⇥n and equivalentstate-space model (1),(2) given by linear (e, f, g) such that(12) is satisfied.

This corollary is closely connected to the results of Lacyand Bernstein [2003].

We now take a brief moment to explore the merits ofsearching over nonlinear e(x) through a simple example.It is easy to see that the unrelaxed stability constraintimplies quadratic stability when e(x) is linear. We examinesystems (14),(15) where g(x(t), u(t)) = x(t) 2 R. In thiscase we see that the quadratic stability of (f0, g) impliesthat for any fixed u the map f(·, u) is a contraction map.This restricts solutions to grow closer in Euclidean normat every step of their evolution. In this setting, the relaxedstability constraint (12) admits systems which are notcontraction maps. For example, the system (1) with:

e(x) = x +15x5, f(x, u) =

13x3,

satisfies (12) with Q = 1, but is not a contraction map ina neighborhood of x = 3

2 .

4. A CONSISTENT ESTIMATOR OF LINEARMODELS

The work of Tobenkin et al. [2010] did not consider thee↵ects of noise on the proposed fitting procedure. The ob-jectives thus far described are best suited to identificationproblems where the e↵ect of noise can be reduced via pre-processing, or model-order reduction from simulation. Inthis section we describe a consistent estimator for stablelinear models with an “output-error” like noise model,with quite weak statistical assumptions.

We first present the proof that, for linear models, i.e.models of the form:

Ex(t + 1) = Fx(t) + Ku(t), (18)y(t) = Gx(t) + Lu(t) (19)

with E,F 2 Rn⇥n, K 2 Rn⇥m, G 2 Rk⇥n, L 2 Rk⇥m andE invertible, RIE minimization depends only on the dataonly through its correlation matrix. This observation is ofcomputational interest as, in this case, RIE minimizationwill require only d = 2n + m + k nonlinear convexconstraints regardless of the number of observations. Itis also the basis of the consistent estimator to follow.Lemma 6. Let Z = {z(t

i

)}N

i=1 ⇢ Rd. Define:

W :=1N

NX

i=1

z(ti

)z(ti

)0. (20)

Let ⇥ = (E,F, G,K, L) be as in (18),(19) and Q = Q0 2Rn⇥n be a positive definite matrix. For any such matricessuch that

Rdt

:= F 0QF + G0G� E � E0 �Q�1 < 0there exists a matrix H = H 0 2 Rd⇥d such that:

FQ

(W, ⇥) := tr(WH) =1N

NX

i=1

E0Q

(z(ti

)), (21)

for all data sets Z. Further, for every positive semidefiniteW = W 0 in Rd⇥d there exists {q

i

}d

i=1 ⇢ Rd dependingonly on W such that:

FQ

(W, ⇥) =dX

i=1

E0Q

(qi

). (22)

Proof. When Rdt

< 0 the supremum in the definition ofE0

Q

can be calculated explicitly as:

E0Q

(z) = | ex

|2Q

+ | ey

|2 +��

F 0Q e

x

G0 ey

��2

(�Rdt)�1

. (23)

The above expression is a homogenous quadratic formin z, i.e. there exists a symmetric matrix H such thatE0

Q

(z) = z0Hz. We can then conclude:

1N

NX

i=1

z0Hz = tr

1N

NX

i=1

zz0

!H

!= tr(WH).

The second claim follows by taking an eigenvector decom-position of W and reversing the above identities.

We now construct our consistent estimator. Our approachis to approximate a noiseless empirical covariance betweenthe inputs, outputs and states through multiple observa-tions of the system subjected to the same input — thisis similar in spirit to instrumental variables and othercorrelation approaches Ljung [1999]. We then choose amodel which minimizes the RIE treating the approximatecorrelation as its input as in Lemma 6.

More formally, we assume that, for each r 2 {1, 2} andt 2 Z+, our experiments are generated by a linear system:

x(r)(t + 1) = A0x(r)(t) + B0u(t) (24)

y(r)(t) = C0x(r)(t) + D0u(t). (25)

with unknown initial conditions x(r)(1) and identical inputu : Z+ 7! R. The data we obtain is corrupted by noisesequences w(r)(t) 2 Rn+k:

x(r)(t)y(r)(t)

�=x(r)(t)y(r)(t)

�+ w(r)(t). (26)

Note in this setting direct observation of the state is notparticularly restrictive, as we can use the recent input andoutput history for x(t) and assume the system (24),(25) isin a observable canonical form.

The RIE produces implicit models of the form (18),(19).We denote the parameters of such a model by ⇥ 2 Rn⇥n⇥Rn⇥n ⇥ Rn⇥m ⇥ Rk⇥n ⇥ Rk⇥m:

⇥ := (E,F, K,G, L).The implicit form means there are many ⇥ correspondingto the same linear system. We define S(·) to be the maptaking implicit models to their explicit form:

S(⇥) =E�1F E�1K

G L

�.

In this notation, we will present an estimator for S(⇥0)where ⇥0 = (I, A0, B0, C0, D0).

Given a data set {(z(t)(1), z(t)(2))}N

t=1 with:

z(r)(t) =⇥x(r)(t + 1)0 y(r)(t)0 x(r)(t)0 u(t)0

⇤0,

our estimator is defined as follows. Compute a sym-metrized cross-correlation W

N

:

WN

=1

2N

NX

t=1

z(1)(t)z(2)(t)0 + z(2)(t)z(1)(t)0. (27)

Define WN

by:W

N

= WN

+ max{0,��min

(WN

)}I.

Clearly WN

is symmetric and positive semidefinite. Let(⇥

N

, QN

) be any minimizer of FQ

(WN

,⇥) subject to theconstraint that R

dt

< �I and Q = Q0 > 0. Our estimatoris then given by S(⇥

N

).

The following statement describes conditions under whichthis estimator converges:Theorem 7. Let (A0, B0) be stable and controllable. Givenan input u : Z+ 7! R and observations {(z(t)(1), z(t)(2))}1

t=1

and let WN

= 1N

PN

t=1 z(1)(t)z(1)(t)T (i.e. a noiselessempirical correlation). If lim sup

N

kWN

k2F

< 1 and thefollowing conditions hold:

lim infN

�min

1N

NX

t=1

x(i)(t)u(t)

� x(i)(t)u(t)

�0!� ", (28)

limN!1

1N

NX

t=1

w(1)(t)w(2)(t)0 = 0, (29)

limN!1

1N

NX

t=1

w(i)(t)z(j)(t)0 = 0, (30)

for (i, j) = (1, 2) and (i, j) = (2, 1) and some K, " > 0,then:

limN!1

S(⇥N

) = S(⇥0),

where, again, ⇥0 = (I,A0, B0, C0, D0).

A proof of this theorem is given in the appendix. Thefollowing theorem follows from this proof with minimalmodifications:Theorem 8. Let u(t), w(i)(t) be stochastic processes. Ifcondition (28) holds almost surely and the limits (29),(30)converge in probability then S(⇥

N

) converges in probabil-ity to S(⇥0).

The boundedness of WN

can be easily satisified and thecondition (28) is a standard persistence of excitation con-dition. The remaining conditions require a lack of corre-lation between noise sequences on separate experimentsand between the noise sequences and the noiseless data,and can be satisfied by assuming the input is boundedand independent of the noise, and that the noise satisfiescertain bounds on its moments and autocorrelations.

It is important to note that unlike other consistent estima-tors (such as equation error), models identified by the RIEbased on a finite number of samples are still guaranteedto be stable.

4.1 Illustrative Example

To illustrate the utility of the above estimation procedure,let us consider a simple second-order example, with DCgain of 1 and a resonant pole pair with natural frequency3 rad/s and damping factor 0.15. Given two highly noisydata records of this system responding to the same input,but with independent noise, there are three simple waysthey could be used: simply concatenated and treated asone long data record; the output could be averaged at eachtime to reduce noise; or the RIE could be applied with theabove mixed-correlation based modification.

Figure 4.1 shows results of fitting with these three methodsafter 200 and 2000 samples, respectively. It is clear that the

Fig. 1. Bode magnitude plots of several estimation strate-gies for a second-order system after 200 (upper) and2000 (lower) samples.

mixed-correlation approach is by far the best at recoveringthe resonant peak, whereas the other approaches seem togenerate models which are “too stable”.

5. CONCLUSION

The RIE is a new approach to nonlinear system iden-tification. It allows one to search over a convex set ofstable nonlinear models, for one which minimizes a convexupper bound of long-term simulation error. The resultingoptimization problem is a semidefinite program.

In order to convexify the model class and the optimizationobjective, a number of relaxations and approximationswere necessary. The objective of this paper was to shedlight on the degree of approximation introduced. In par-ticular, it is shown that the set of nonlinear models wesearch over in principle includes all quadratically stablesystems (subject to richness of parametrization). It wasalso shown that there exists at least some models whichare not quadratically contracting, but for which a non-quadratic contraction map can be found which satisfiesthe relaxed stability condition. Further results (positive ornegative) on the coverage of this particular model class

– or alternative suggestions for convex parametrization ofnonlinear models – would be highly interesting.

Another challenging theoretical aspect is the e↵ect of mea-surement noise on the (nonlinear) estimation algorithm,and the resulting behaviour of the nonlinear system. In thispaper, we provided preliminary results in this direction, inthe form of a modified estimator which is provably con-sistent for linear systems, and is inspired by instrumentalvariables methods. Extending such analysis to nonlinearmodels is non-trivial, due to the interaction of the systemparametrization and the stability certificates, however thiswill be a focus of future work.

Appendix A. PROOF OF THEOREM 7

We first show that FQN (W

N

, QN

) converges to zero.By the corollary to Lemma 4 we know there existsan R = R0 > 0 such that Q

R

= R�1 and ⇥R

=(R,RA0, RB0, C0, D0) is a feasible point for each mini-mization which determines ⇥

N

, thus:F

QN (WN

, ⇥N

) FQR(W

N

,⇥R

).Conditions (29) and (30) guarantee that:

limN!1

WN

�WN

= 0.

Our boundedness assumption on WN

and this conver-gence ensures for all su�ciently large N we have W

N

and WN

belonging to some compact set. On this set�min(·) is uniformly continuous and as �min(W

N

) = 0,lim

N!1 WN

�WN

= 0. Similarly FQR(W, ⇥

R

) is a uni-formily continuous function of W on this set and thuslim

N!1 FQR(W

N

,⇥R

) � FQR(W

N

,⇥R

) = 0. We seeF

QR(WN

,⇥R

) = 0, as the equation errors (i.e. (ex

, ey

) in(23)) vanish. We thus conclude lim

N!1 FQN (W

N

, ⇥N

) =0.

The constraint that �E0QE < Rdt

< �I guaranteesE0QE > I. From this we see, for every z = (v, y, x, u) 2Rd:|z�E�1(Fx+Ku)|2+|y�(Gx+Lu)|2 | e

x

(z)|2Q

+| ey

(z)|2.Defining L : Rd⇥d ⇥ Rn+k⇥n+m 7! R by:

L(W,S) = tr(W [I �S]0 [I �S]).we see the above inequality ensures:

L(W,S(⇥)) FQ

(W, ⇥)for any W = W 0 � 0 and feasible (Q,⇥). This allows usto conclude

limN!1

L(WN

, S(⇥N

)) = limN!1

L(WN

, S(⇥R

)) = 0

from our previous inequalities.

Let WN22 be the lower n + m square sub-block of W

N

and WN22 be the equivalent sub-block of W

N

. Condi-tion (28) ensures for all su�ciently large N we haveW

N22 � "I. Then for su�ciently large N we also haveW

N22 � "

2I. This condition implies L(WN

, S) is stronglyconvex in S with parameter "

2 . As L(WN

, ·) � 0 and bothlim

N!1 L(WN

, S(⇥N

)) = 0 and limN!1 L(W

N

, S(⇥R

)) =0, strong convexity and the triangle inequality give uslim

N!1 S(⇥N

) = S(⇥R

) = S(⇥0).

This completes the proof of the theorem.

REFERENCES

B.N. Bond, Z. Mahmood, Yan Li, R. Sredojevic,A. Megretski, V. Stojanovic, Y. Avniel, and L. Daniel.Compact modeling of nonlinear analog circuits usingsystem identification via semidefinite programming andincremental stability certification. IEEE Transactionson Computer-Aided Design of Integrated Circuits andSystems, 29(8):1149 –1162, aug. 2010.

G.B. Giannakis and E. Serpedin. A bibliography onnonlinear system identification. Signal Processing, 81(3):533–580, 2001.

S. L. Lacy and D. S. Bernstein. Subspace identificationwith guaranteed stability using constrained optimiza-tion. IEEE Transactions on Automatic Control, 48(7):1259–1263, 2003.

L. Ljung. System Identification: Theory for the User.Prentice Hall, Englewood Cli↵s, New Jersey, USA, 3edition, 1999.

L. Ljung. Perspectives on system identification. AnnualReviews in Control, 34(1):1 – 12, 2010.

W. Lohmiller and J.J.E. Slotine. On contraction analysisfor non-linear systems. Automatica, 34(6):683–696, June1998.

I.R. Manchester, M.M. Tobenkin, and J. Wang. Identifi-cation of nonlinear systems with stable oscillations. In50th IEEE Conference on Decision and Control (CDC).IEEE, 2011.

A. Megretski. Convex optimization in robust identificationof nonlinear feedback. In Proceedings of the 47th IEEEConference on Decision and Control, pages 1370–1374,Cancun, Mexico, Dec 9-11 2008.

P. A. Parrilo. Structured Semidefinite Programs and Semi-algebraic Geometry Methods in Robustness and Opti-mization. PhD thesis, California Institute of Technology,May 18 2000.

T.B. Schon, A. Wills, and B. Ninness. System identifica-tion of nonlinear state-space models. Automatica, 47(1):39–49, 2010.

J. Sjoberg, Q. Zhang, L. Ljung, A. Benveniste, B. De-lyon, P.-Y. Glorennec, H. Hjalmarsson, and A. Juditsky.Nonlinear black-box modeling in system identification:a unified overview. Automatica, 31(12):1691–1724, 1995.

K. C. Sou, A. Megretski, and L. Daniel. Convex re-laxation approach to the identification of the wiener-hammerstein model. In Proc. 47th IEEE Conference onDecision and Control, pages 1375 –1382, dec. 2008.

J. A. K. Suykens and J. Vandewalle, editors. Nonlin-ear Modeling: advanced black-box techniques. SpringerNetherlands, 1998.

M.M. Tobenkin, I.R. Manchester, J. Wang, A. Megretski,and R. Tedrake. Convex optimization in identification ofstable non-linear state space models. In 49th IEEE Con-ference on Decision and Control (CDC), pages 7232–7237. IEEE, 2010.

Primal-Dual Instrumental Variable

Estimators

K. Pelckmans ⇤

⇤ Syscon, Information Technology, Uppsala University, 75501, SE

Abstract: This paper gives a primal-dual derivation of the Least Squares Support VectorMachine (LS-SVM) using Instrumental Variables (IVs), denoted simply as the Primal-dualInstrumental Variable Estimator. Then we propose a convex optimization approach for learningthe optimal instruments. Besides the traditional argumentation for the use of IVs, the primal-dual derivation gives an interesting other advantage, namely that the complexity of the systemto be solved is expressed in the number of instruments, rather than in the number of samples astypically the case for SVM and LS-SVM formulations. This note explores some exciting issuesin the design and analysis of such estimator.

1. INTRODUCTION

The method of Least Square- Support Vector Machines(LS-SVMs) amounts to a class of nonparameteric, regu-larized estimators capable of dealing e�ciently with high-dimensional, nonlinear e↵ects as surveyed in Suykens et al.[2002]. The methodology builds upon the research on Sup-port Vector Machines (SVMs) for classification, integrat-ing ideas from functional analysis, convex optimizationand learning theory. The use of Mercer kernels is foundto generalize many well-known nonparametric approaches,including smoothing splines, RBF and regularization net-works, as well as parametric approaches, see Suykens et al.[2002] and citations. The primal-dual derivations are foundto be a versatile tool for deriving non-parametric estima-tors, and are found to adapt well to cases where the appli-cation at hand provides useful structural information, seee.g. Pelckmans [2005]. In the context of (nonlinear) systemidentification a particularly relevant structural constraintcomes in the form of a specific noise models, as explainedin Espinoza et al. [2004], Pelckmans [2005].

In literature two essentially di↵erent approaches to func-tion estimation in the context of colored noise are found:In (i) the first, one includes the model of the noise coloringin the estimation process. This approach is known asthe Prediction Error Framework (PEM) in the contextof system identification. The main drawback of such ap-proach is the computational demand. In (ii) an instrumen-tal variables approach, one tries to retain the complexityand convenience of ordinary least squares estimation. Thelatter copes with the coloring of the noise by introducingartificial instruments restricting the least squares estima-tor to model the coloring of the noise. Introductions toInstrumental Variable (IV) estimators can be found inLjung [1987], Soderstrom and Stoica [1989] in a contextof system identification of LTI systems, and Bowden andTurkington [1990] in a context of econometry.

In this paper we construct a kernel based learning machinebased on an instrumental variable estimator. Related ideas

?This work was supported in part by Swedish Research Council

under contract 621-2007-6364.

are articulated in Laurain et al. [2011], but here instead weuse the primal-dual framework to derive the estimators,and deviate resolutely from the bias-variance methodsessentially borrowed from the parametric context. Thedesign of suitable instruments relies on recent results onnonparametric instrumental variable methods, as founde.g. in Hall and Horowitz [2005]. However, it is not quiteclear what optimal IVs are in a context of non-parametricestimators. We give a resolutely di↵erent answer to thisquestion based on the observation that IVs try to razoraway stochastic e↵ects which di↵er in various realizationsof the data.

The contribution of this brief note is threefold. The nextsection derives the Instrumental Variable (IV) kernel ma-chine, as well as an extended version of such. Then thisis related to the kernel-based estimator designed for thecase of dealing with a known (stable) colored noise source.Section 3 then elaborates on how the IVs can be chosen ifone has access to di↵erent realizations of the data. Thenwe apply ideas to the case where one has only a singlerealization of the data. The problem of finding such opti-mal IVs can be relaxed as an Semi-Definite-Programmingproblem. The problem is essentially one of finding anoptimal projection matrix, a task quite relates quite nicelyto the objective of projecting away stochastic e↵ects whichcorrelate to the very flexible (kernel) basis. Section 4 givean outlook to the imminent question which are raised. Thispaper essentially solicits for crisp ideas in non-parametricestimation and nonlinear system identification in the con-text of colored noise.

2. THE DUAL OF THE INSTRUMENTAL VARIABLE(IV) ESTIMATOR

The following setup is adopted. Given N samples{(xt, yt)}Nt=1

⇢ Rd ⇥ R. Let f : Rd ! R be any functionwhich can be written as f(x) = wT�(x) for all x 2 Rd,where � : Rd ! Rd� denotes a mapping of the data to ahigher dimensional (possibly infinite dimensional) featurespace, and w 2 Rd� are unknown linear parameters in thisfeature space. Consider the model

yt = f(xt) + et = wT�(xt) + et, 8t = 1, . . . , N. (1)

which explains the given samples up to the residuals {et}t.Unlike the case in traditional regression analysis, we donot want to assume that the residuals are uncorrelated(’white’). In fact, we are interested in what happens if theresiduals are allowed to have nontrivial coloring.

Consider m suitable instruments (time-series) designedappropriately, denoted here for all k = 1, . . . ,m as

zk = (zk1

, . . . , zkN )T 2 RN . (2)

Instrumental Variable (IV) techniques aim at estimatingthe parameters w (or the function f) such that the secondmoments of the residuals with the instruments are zeroedout, or

NX

t=1

zkt (f(xt)� yt) = zkT(fN � yN ) = 0, 8t = 1, . . . , N.

(3)where fN = (f(x

1

), . . . , f(xN ))T 2 RN and yN =(y

1

, . . . , yN ) are vectors. Therefore, the method is alsoreferred to as the method of generalized moment matchingin the statistical and econometrical literature. The ratio-nale goes that albeit the residuals might not be white (orminimal), they are to be uncorrelated with the covariates.That is, the estimate is to be independent of the realizedstochastic behavior (’noise’). The choice of instruments de-pends on which statistical assumptions one might exploitin a certain case. A practical example is found when esti-mating parameters of a dynamic model (say a polynomiallinear time-invariant model). Then, instruments might bechosen as filters of input signals, hereby exploiting theassumption of the residuals being uncorrelated with pastinputs, see e.g. Ljung [1987], Soderstrom and Stoica [1989]and citations.

Fig. 1. Example function f(x) = sin(x) (full line), as wellas observed samples yt = f(xt) + et (green dots) forxt = 20 t

10000

and t = 1, . . . , 1000, where {et}t is a col-

ored noise source with filter et+1

= 1

1+0.99q�1 et. This

example illustrates that we need extra information(assumptions) to separate the (i) function one aims toreconstruct, and (ii) the noise coloring. IV estimatorsdo help a method to separate stochastic elements (asthe colored noise) from deterministic components byuse of appropriate instruments exploiting model as-sumptions.

2.1 Primal-Dual Derivation of the IV estimator

We now implement this principle on a kernel based learn-ing scheme, thereby building on the method of LeastSquares Support Vector Machines (LS-SVMs). The primalobjective might become

minw

1

2wTw s.t.

NX

t=1

zkt (wT�(xt)�yt) = 0, 8k = 1, . . . ,m.

(4)Let Z = (z1z2 . . . zm) 2 RN⇥m be a matrix containingall m instruments. Let to each instrument zk and itsassociated moment condition (3) a single Lagrange mul-tiplier ↵k 2 R be associated. From the dual condition ofoptimality one has

w =mX

k=1

↵k

NX

t=1

zkt �(xt). (5)

Now, define K 2 RN⇥N (’the kernel matrix’) as Kt,s =�(xs)T�(xt). The dual problem is given as

min↵2Rm

1

2↵T (ZTKZ)↵� ↵TZT yN . (6)

If m < N and the matrix (ZTKZ) is full rank, the optimalsolution is unique and can be computed e�ciently as thesolution to the linear system

(ZTKZ)↵ = ZT yN . (7)

Let x⇤ 2 Rd. The estimate can be evaluated in a newsample x⇤ 2 Rd as

f(x⇤) =mX

k=1

↵k

NX

l=1

zkl K(x⇤, xl), (8)

or in matrix notation as f(x⇤) = KT (x⇤)Z↵ where thevector K(x⇤) 2 RN is defined asK(x⇤) = (�(x⇤)T�(x1

), . . . ,�(x⇤)T�(xN ))T 2 RN . Noteat this point that the traditional questions including thechoice of the kernel function K come in, see e.g. Suykenset al. [2002], Pelckmans [2005].

2.2 Primal-Dual derivation of Extended IV Estimator

In a similar vain as extended IV methods - described e.g.in Soderstrom and Stoica [1989] - one may extend the basicdual IV estimator introducing slack-variables. Let � > 0be a trade-o↵ parameter between model complexity wTw

and how strict the moment conditions zkT(fN � yN ) = 0

are enforced, then the primal objective can be written as

minw,e

1

2wTw +

�

2

mX

k=1

e2k

s.t.NX

t=1

zkt (wT�(xt)� yt) = ek, 8k = 1, . . . ,m. (9)

In case � ! 1, one recovers the estimator (7). In case� = 0, a trivial solution (i.e. w = 0d) is obtained. Thechoice of an optimal � > 0 is a di�cult problem of modelsection, see e.g. Suykens et al. [2002], Pelckmans [2005].The dual problem is given as

min↵2Rm

1

2↵T

✓ZTKZ +

1

�Im

◆↵� ↵TZT yN , (10)

where Im = diag(1, . . . , 1) 2 Rm⇥m. The solution can beobtained by solving the linear system✓

ZTKZ +1

�Im

◆↵ = ZT yN . (11)

In general, if ZT = In, the method reduces to the standardLS-SVM without bias term. The resulting estimate can beevaluated in a new point x⇤ as

fn(x⇤) = Kn(x⇤)Z↵n, (12)

where ↵n solves (15) and

Kn(x⇤) = (K(x1

, x⇤), . . . ,K(xn, x⇤))T 2 Rn.

2.3 Computational Complexity

The above derivation implies as well reduced computa-tional complexities compared to the standard LS-SVMs.In particular, it is not needed to memorize and work withthe full matrix K 2 RN⇥N , but one may focus attentionon the matrix (ZKZT ) 2 Rm⇥m which is of considerablylower dimension when m ⌧ N .

2.4 Kernel Machines for Colored Noise Sources

The above problem formulation relates explicitly to learn-ing an LS-SVM from measurements which contain colorednoise. Assume the noise is given as a filter h(q�1) =Pd

⌧=0

h⌧q�⌧ , where d is possibly infinite, and {h⌧} denote

the filter coe�cients. Furthermore, we assume that thefilter is stable, and its inverse exists and is unique (thisassumption is used implicitly in the derivation of the dual,left here as an exercise). Then the primal objective can bewritten as

minw,e

1

2wTw +

�

2

NX

t=1

e2t

s.t.NX

t=1

wT�(xt) +dX

⌧=0

h⌧et�⌧ = yt, 8t = 1, . . . , N.

(13)

Note that the filter coe�cients {h⌧} are assumed to beknown here. The dual problem is given as

min↵2Rm

1

2↵T

✓K +

1

�HHT

◆↵� ↵T yN , (14)

where IN = diag(1, . . . , 1) 2 RN⇥N and the Toeplitzmatrix H 2 RN⇥N is made up by the filter coe�cients,or Hij = hi�j , with h⌧ = h⌧ for 0 ⌧ d, and h⌧ = 0otherwise. Similarly H�1 is the Toeplitz matrix consistingof the coe�cients of the inverse filter, assuming that theinverse exists (or that the filter is minimal phase). Thesolution can be obtained by solving the linear system✓

H�TKH�1 +1

�Im

◆↵ = H�T yN , (15)

where we use a change of variables as ↵ = H↵. Theresulting estimate can be evaluated in a new point x⇤ as

fn(x⇤) = Kn(x⇤)H�1↵n, (16)

where ↵n solves (15) and KN (x⇤) as before. From thisderivation it becomes clear that the instrumental variableapproach is essentially the same as the extended IVapproach if the instruments are chosen appropriately. Thatis, the optimal instruments correspond with the inverse

coloring of the noise realized in the sample. This dualitybetween noise coloring and instrumental variable is wellestablished in the context of linear identification, butcomes as a surprise in the present context of regularizedmethods.

3. ON THE CHOICE OF THE INSTRUMENTS

The question now is how to chose the instruments Z ina realistic or optimal fashion. The classical approach usedin linear IV approaches based on minimizing the varianceof the final estimate is not directly applicable as thenonlinear methods based on regularization are essentiallybiased. Minimizing the variance would result in an optimalestimate such that f⇤(x) = 0 for all x. Indeed, the variancewould certainly be minimal! Trying to minimize the vari-ance while controlling the bias implies the need of a properstochastic framework, and assumption of reasonable priorwhich contrasts to the nonparametric or distribution-freeapproach for which such methods are conceived. If one iswilling to make such strong stochastic assumptions, onemay well resort to a parametric approach as e.g. relyingon a finite set of basis functions. This argument indicatesthat there is a need for new principles for IV approachesin nonparametric approaches. In this note we focus on the’maximal invariance principle’, stating that the IVs shouldbe chosen such that the result of the estimation is similarin case of di↵erent realizations of the data. In order toavoid the trivial solution f⇤(x) = 0, we try to find theinvariant solution which is as informative as possible. Itturns out that such instruments can be found by solvinga convex optimization problem.

3.1 Di↵erent realizations

Suppose that the timeseries corresponding to {xt}t hasdi↵erent realizations of the output values {yt}t for i =1, . . . ,m. Let this realizations be stacked in the vectorsY i = (yi

1

, . . . , yiN )T 2 RN for all i = 1, . . . ,m. Theaforementioned derivation is given as follows

Y i = KZ

✓ZKZT +

1

�IN

◆�1

ZTY i

= K(ZZ†)

✓K +

1

�IN

◆�1

(ZZ†)TY i, (17)

where ⇧ = ZZ† is a projection matrix with eigenvalues in{0, 1}. Moreover, since Z is Toeplitz, it is Here we usedthat ⇧ is equivalent to In up to the nulspace of ⇧. Thisalso implies that this method is equivalent to

Y i = K(ZZ†)

✓K +

1

�IN

◆�1

(ZZ†)TY i. (18)

Then the problem of learning an optimal matrix ⇧ suchthat the estimate is constant over the m di↵erent realiza-tions is phrased as follows. Let ⇧ 2 RN⇥N be a symmetricpositive matrix with eigenvalues {�t⇧}Nt=1

, then

⇧ = arg min�t(⇧)2{0,1}

mX

i=1

��Y i �R⇧Y i��22

s.t. R⇧Y 1 = · · · = R⇧Y m, (19)

where R = K⇣K + 1

� IN

⌘�1

2 RN⇥N can be computed in

advance. The problem can also be phrased as follows byintroducing a vector Y 2 RN as

(⇧, Y ) = arg min�t(⇧)2{0,1}

mX

i=1

��Y i � Y��2

2

s.t. Y = R⇧Y i, 8i = 1, . . . ,m. (20)

This combinatorial optimization problem can be relaxed

(⇧, Y ) = arg min�t(⇧)2[0,1]

mX

i=1

��Y i � Y��2

2

s.t. Y = R⇧Y i, 8i = 1, . . . ,m, (21)

where eigenvalues �t⇧ can now take any value in theinterval [0, 1]. This problem can be solved e�ciently asa Semi-Definite Programming (SDP). This is so as boththe constraint ⇧ ⌫ 0 and �

max

(⇧) 1 are convex, seee.g. Boyd and Vandenberghe [2004]. Note that in practice,in the optimum, many eigenvalues of ⇧ will be set tozero (’structural sparseness’), implying that one could findZ 2 RN⇥m with m N .

3.2 A Single Realization

Consider the situation that we have only observations ofa single realization YN 2 RN , can we still do similaras before? It turns out that we can by exploiting thestationarity of the noise filters. Let us define the dividethe samples {yt} in a set of subsequent batches of lengthn < N as Y i = (yi, . . . , yn+i�1

)T , and let Si 2 {0, 1}N⇥n

be a selection matrix such that SiY = Y i for all i =1, . . . , N � n+ 1. Then we can phrase an optimal IV as asolution to

(⇧n, Y ) = arg min�t(⇧n)2{0,1}

mX

i=1

��Y i � SiY��2

2

s.t. SiY = Ri⇧nYi, 8i = 1, . . . ,m, (22)

wherem = N�n+1, andRi = SiKSiT⇣SiKiS

iT + 1

� In

⌘�1

for all i = 1, . . . ,m. Note that in this case ⇧n 2 Rn⇥n.This problem basically aims to filter away the local colorednoise, assuming that the basis of this coloring is transla-tional invariant.

4. DISCUSSION

This paper gives a primal-dual derivation of instrumentalvariable approach for kernel machines. The main insightis that the computational complexity of the estimator isgiven in terms of the number of instruments, rather thanin the number of data-samples. This observation opensup many unexplored opportunities for dealing with largesets of data. We related this approach to the problemof estimating in the presence of colored noise, where thecoloring is assumed to be know. This resulting estimatorsuggests a close relation between the noise coloring scheme,and the employed instruments, a relation well-known inthe context of parametric identification. This observationis however unexploited in a context of regularized andnon-parametric estimators. Finally, we pronounced oneapproach of learning optimal instruments, based on prin-ciples following from considering dealing with data with

multiple realizations. It is indicated how this can be solvedas a convex problem, basically reducing to finding optimalprojection matrices.

Results in this paper demand for empirical validation. Areason this is not reported in this paper is that empiricalvalidation in non-parametric settings also prompts funda-mental questions on model selection. That is, if predictionof the output value of the next (in time) sample is theproblem, it is beneficiary to exploit the noise coloring.One intriguing open issue is to elaborate how ⇧ reflectsthe causal structure of instruments, or which structureto impose to enforce such property. A related question iswether the optimal choice ⇧ gives us the noise coloring aswell.

A more general open question is wether the relaxation ofthe {0, 1} eigenvalues to the interval [0, 1] is tight. In otherwords, is such approach for learning a projection matrixe�cient? Is it to lead guaranteed to a low rank solution?

REFERENCES

R.J. Bowden and D.A. Turkington. Instrumental variables,volume 8. Cambridge University Press, 1990.

S. Boyd and L. Vandenberghe. Convex Optimization.Cambridge University Press, 2004.

M. Espinoza, J.A.K. Suykens, and B. De Moor. Partiallylinear models and least squares support vector machines.In Decision and Control, 2004. CDC. 43rd IEEE Con-ference on, volume 4, pages 3388–3393. IEEE, 2004.

P. Hall and J.L. Horowitz. Nonparametric methods forinference in the presence of instrumental variables. TheAnnals of Statistics, 33(6):2904–2929, 2005.

V. Laurain, W. Xing Zheng, and R. Toth. Introducing in-strumental variables in the LS-SVM based identificationframework. In Proceedings of the 50th IEEE Conferenceon Decision and Control (CDC 2011). IEEE, 2011.

L. Ljung. System Identification, Theory for the User.Prentice Hall, 1987.

K. Pelckmans. Primal-Dual kernel Machines. PhD thesis,Faculty of Engineering, K.U.Leuven, Leuven, may. 2005.280 pp., TR. 05-95.

T. Soderstrom and P. Stoica. System identification, 1989.J.A.K. Suykens, T. Van Gestel, J. De Brabanter, B. De

Moor, and J. Vandewalle. Least Squares Support VectorMachines. World Scientific, Singapore, 2002.

Identification of Black-Box Wave

Propagation Models Using Large-Scale

Convex Optimization

Toon van Waterschoot ! Moritz Diehl ! Marc Moonen !

Geert Leus !!

! Department of Electrical Engineering (ESAT-SCD), KatholiekeUniversiteit Leuven, 3001 Leuven, Belgium (e-mail:

{tvanwate,mdiehl,moonen}@esat.kuleuven.be)!! Faculty of Electrical Engineering, Mathematics, and ComputerScience, Delft University of Technology, 2628 CD Delft, The

Netherlands (e-mail: [email protected])

Abstract: In this paper, we propose a novel approach to the identification of multiple-inputmultiple-output (MIMO) wave propagation models having a common-denominator pole-zeroparametrization. We show how the traditional, purely data-based identification approach can beimproved by incorporating a physical wave propagation model, in the form of a spatiotemporallydiscretized version of the wave equation. If the wave equation is discretized by means of thefinite element method (FEM), a high-dimensional yet highly sparse linear set of equations isobtained that can be imposed at those frequencies where a high-resolution model estimate isdesired. The proposed identification approach then consists in sequentially solving two large-scale convex optimization problems: a sparse approximation problem for estimating the pointsource positions required in the FEM, and an equality-constrained quadratic program (QP) forestimating the common-denominator pole-zero model parameters. A simulation example for thecase of indoor acoustic wave propagation is provided to illustrate the benefits of the proposedapproach.

Keywords: Multivariable System Identification; Hybrid and Distributed System Identification;Vibration and Modal Analysis

1. INTRODUCTION

We consider wave propagation in a three-dimensional(3-D) enclosure with partially reflective boundaries asgoverned by the wave equation

!2u(r, t)"1

c2!2

!t2u(r, t) = s(r, t) (1)

with appropriate boundary conditions in the spatiotem-poral domain ! # T . Here, r = [x, y, z] $ ! and t $ Tdenote the spatial and temporal coordinates, respectively,

! This research was carried out at the ESAT laboratory of KU Leu-ven, and was supported by the KU Leuven Research Council (CoEEF/05/006 “Optimization in Engineering” (OPTEC), PFV/10/002(OPTEC), IOF-SCORES4CHEM, Concerted Research Action GOA-MaNet), the Belgian Federal Science Policy O!ce (IUAP P6/04“Dynamical systems, control and optimization” (DYSCO) 2007–2011), the Research Foundation Flanders – FWO (Postdoctoral Fel-lowship T. van Waterschoot, Research Projects G0226.06, G0321.06,G.0302.07, G.0320.08, G.0558.08, G.0557.08, G.0588.09, G.0600.08,Research Communities G.0377.09 and WOG: ICCoS, ANMMM,MLDM), the IWT (Research Projects Eureka-Flite+, SBO LeCo-Pro, SBO Climaqs, SBO POM, O&O-Dsquare), the European Com-mission (Research Community ERNSI, Research Projects FP7-HD-MPC (INFSO-ICT- 223854), COST-intelliCIS, FP7-EMBOCON(ICT-248940), FP7-SADCO (MC ITN-264735), ERC HIGHWIND(259 166)), the NWO-STW (VICI project 10382), AMINAL, ACCM,and IBBT. The scientific responsibility is assumed by its authors.

s(r, t) represents the driving source function that initiatesthe wave propagation, u(r, t) represents the resulting wavefield, and c is the wave propagation speed determined bythe propagation mechanism and medium. If we considerthe driving source function to be generated by M pointsources at positions rm, m = 1, . . . ,M , then the wave fieldcan be expressed as a superposition of M contributionscorresponding to the temporal convolution of the pointsource signals sm(t) with the Green’s function h(r, rm, t),

u(r, t) =M!

m=1

"#"

#"

sm(")h(r, rm, t" ")d"

$

. (2)

If we observe the wave field at a discrete number ofpositions rj , j = 1, . . . , J , it follows from (2) that the wavepropagation can be modeled as a multiple-input multiple-output (MIMO) linear time-invariant (LTI) system. If weassume the source signals sm(t) to be bandlimited, thenthe Green’s function can be sampled in time and theMIMO-LTI system can be represented by a discrete-timetransfer function matrix

H(z) ="!

n=0

Hnz#n (3)

="!

n=0

%

&'

h(r1, r1, n) . . . h(r1, rM , n)...

. . ....

h(rJ , r1, n) . . . h(rJ , rM , n)

(

)* z#n (4)

where n = t/Ts denotes the discrete time index, with Ts

the sampling period. It is particularly relevant to representthe transfer function matrix by means of a pole-zero modelwith a common denominator, i.e.,

H(z) =

+Qn=0 Bnz#n

+Pn=0 anz

#n(5)

with

Bn =

%

&'

bn(r1, r1) . . . bn(r1, rM )...

. . ....

bn(rJ , r1) . . . bn(rJ , rM )

(

)* . (6)

Indeed, it was shown in Gustafsson et al. (2000) thatsuch a parametrization is related to the “assumed modessolution” of the wave equation, in which the source andwave field are expanded on the eigenfunction basis of theenclosure, see also Kuttru" (2009). Here, the commondenominator is related to the resonant modes of theenclosure which can be understood to be independent ofthe source and observer positions, see Gustafsson et al.(2000), Kuttru" (2009) and Haneda et al. (1994).

A parametric model of the wave propagation is particu-larly useful for the prediction, simulation, and deconvolu-tion (i.e., source recovery) of the wave field, in case a fixedset of source and observer positions is considered. Eventhough these operations may also be performed throughdirect use of (a numerical approximation of) the waveequation in (1), the availability of a parametric model willtypically result in significantly less computations. Becauseof the tight connection between the pole-zero model andthe assumed modes solution of the wave equation, it maybe appealing to use a grey-box model, by parametrizingthe numerator and denominator in (5) explicitly as a func-tion of the resonance frequencies and damping factors, seeGustafsson et al. (2000). However, this parametrization ishighly nonlinear and so the parameter estimation requiresthe solution of a non-convex optimization problem. There-fore, we prefer to use a linear-in-the-parameters black-boxpole-zero model.

Our aim is then to estimate the model parameters asaccurately as possible, given a data set consisting ofsource and observed signal samples. A variety of estimationalgorithms for identifying common-denominator pole-zeromodels has been proposed earlier in literature, see, e.g.,Gustafsson et al. (2000), Haneda et al. (1994), Rolain et al.(1998), Stoica and Jansson (2000), Verboven et al. (2004),and Hikichi and Miyoshi (2004). A common propertyof these algorithms is that they rely exclusively on theavailable data set. Instead, we propose an approach inwhich not only the data set, but also the structure of theunderlying wave equation is exploited in the estimation ofthe pole-zero model parameters. By enforcing the modelparameters to obey a linear relationship derived from afinite element approximation of the wave equation, we caninclude physical arguments in the black-box identificationwhile avoiding the non-convexity issues encountered witha grey-box approach. This allows to achieve a higherestimation accuracy as compared to the purely data-basedalgorithms in literature, or to achieve a similar accuracywith a smaller data set. The latter property is particularlyappealing if the estimation of the common-denominatorcoe#cients is of primary interest. The inclusion of the wave

equation structure in the black-box identification problemcan then result in a reduction of the number of observationpositions required to achieve a given accuracy, whichmay significantly reduce the cost of the identificationexperiment.

The paper is organized as follows. In Section 2 we for-mulate the problem statement and review the existingdata-based approach to the identification of common-denomimator pole-zero models. In Section 3, we show howthe finite element method (FEM) can be used to derive aset of linear equations in the pole-zero model parameters,that are valid if the MIMO-LTI system is indeed governedby the wave equation in (1). This set of equations isthen used in Section 4 to formulate a large-scale convexoptimization problem that allows to identify the common-denominator pole-zero model by relying on both the dataset and the wave equation structure. Finally, a simulationexample is provided in Section 5.

2. PROBLEM STATEMENT & STATE OF THE ART

2.1 Problem Statement

The problem considered in this paper can be formulatedas follows. We are given a data set consisting of N samplesof the source signals and observed signals,

ZN = {sm(n), yj(n)}Nn=1, m = 1, . . . ,M, j = 1, . . . , J,

(7)where the observed signals obey the measurement model

yj(n) = u(rj , n) + vj(n), j = 1, . . . , J. (8)

Here, the noise-free observations u(rj , n), j = 1, . . . , Jresult from a spatiotemporal sampling of the wave fieldu(r, t), as generated by the wave equation (1), andvj(n), j = 1, . . . , J , represents measurement noise. Forthe sake of simplicity, and without loss of generality, wewill assume that the measurement noise signals vj(n) arerealizations of zero-mean and mutually uncorrelated whitenoise processes with equal variance #2

v . Our aim is then toobtain the best possible estimate of the parameter vector

! =,

bT1 b

T2 . . . bT

M aT-T

(9)

containing the coe#cients of the common-denominatorpole-zero model in (5), with

bm=[b0(r1,rm). . .bQ(r1,rm). . .b0(rJ,rm). . .bQ(rJ,rm)]T

(10)for m = 1, . . . ,M and

a = [a0 . . . aP ]T. (11)

Note that the first coe#cient in the denominator parame-ter vector is usually fixed to a0 = 1. We include it here inthe parameter vector for notational convenience.

2.2 State-of-the-Art Data-Based Identification Approach

Di"erent algorithms for the estimation of the parametervector ! using the data model (5)-(8) have been proposed,see Gustafsson et al. (2000), Haneda et al. (1994), Rolainet al. (1998), Stoica and Jansson (2000), Verboven et al.(2004), and Hikichi and Miyoshi (2004). In these algo-rithms, however, the knowledge that the noise-free obser-vations u(rj , n), j = 1, . . . , J , are samples of the wave fieldgenerated by (1) is not exploited, and hence the structure

!($) =

%

'

.

S!($)ST ($)/

%.

IJ %.

zQ($)zHQ ($)//

vec.

Y($)SH($)/

%.

zQ($)zHP ($)/

vec.

Y($)SH($)/H

%.

zP ($)zHQ ($)/

YH($)Y($)zP ($)zHP ($)

(

* (12)

of the wave equation is not taken into account. In thispaper, we will adopt the frequency domain identificationalgorithm proposed in Verboven et al. (2004) as the state-of-the-art algorithm. In the frequency domain, the datamodel corresponding to (5)-(8) can be written as

Y($) =B($)

A($)S($) +V($) (13)

where $ = 2%fTs denotes radial frequency,

S($) = [S1($) . . . SM ($)]T

(14)

V($) = [V1($) . . . VJ($)]T

(15)

Y($) = [Y1($) . . . YJ($)]T

(16)

contain the N -point discrete Fourier transform (DFT)samples of the source signals, measurement noise, andobserved signals, A($) represents the pole-zero modeldenominator frequency response, and

B($) =

%

&'

B11($) . . . B1M ($)...

. . ....

BJ1($) . . . BJM ($)

(

)* (17)

contains the pole-zero model numerator frequency re-sponses for the di"erent source-observer combinations.

By defining the equation error vector related to (13) as

E($,!) = B($)S($)"A($)Y($) (18)

a least squares (LS) criterion for the estimation of theparameter vector ! can be obtained as

min!

!

!

EH($,!)E($,!) (19)

where the summation is executed over the DFT frequencies$ = 0, 1/N, . . . , (N " 1)/N , and (·)H denotes the Hermi-tian transposition operator. The LS criterion (19) can berewritten as a quadratic program (QP),

min!

!T

0

!

!

!($)

1

! (20)

s. t. a0 = 1 (21)

with !($) defined in (12) at the top of the page, whereIJ represents the J # J identity matrix, (·)! denotesthe complex conjugation operator, vec(·) is the matrixvectorization operator, % denotes the Kronecker product,and the complex sinusoidal vectors are defined as

zQ($) =,

1 ej! . . . ejQ!-T

(22)

zP ($) =,

1 ej! . . . ejP!-T

. (23)

3. FINITE ELEMENT METHOD FORCOMMON-DENOMINATOR POLE-ZERO MODELS

We will now derive a set of linear equations in the pole-zero model parameters, that are valid if the MIMO-LTIsystem is indeed governed by the wave equation in (1).We can eliminate the time variable and the partial timederivative from the wave equation by taking the discreteFourier transform of (1) after temporal sampling, whichresults in the Helmholtz equation

!2U(r,$) + k2U(r,$) = S(r,$) (24)

where k = $/c represents the wave number. As mentionedearlier, we consider the source function to consist of Mpoint source contributions, i.e.,

S(r,$) =M!

m=1

Sm($)&(r" rm). (25)

By substituting (25) in (24) and dividing both sides bySm($), m = 1, . . . ,M we obtain a set of M equations2

33333334

33333335

!2H(r, r1,$) + k2H(r, r1,$) =M!

m=1

Sm($)

S1($)&(r"rm)

......

!2H(r, rM ,$) + k2H(r, rM ,$) =M!

m=1

Sm($)

SM ($)&(r"rm)

(26)where the frequency-domain Green’s function H(r, rm,$),m = 1, . . . ,M , corresponds to the frequency responseof the discrete-time system defined in (4) for r =rj , j = 1, . . . , J . We can hence substitute the common-denominator pole-zero model for H(r, rm,$) in (26), andbring the common denominator (which is independent ofr) to the right-hand side, i.e.,6

!2B(r, rm,$)+k2B(r, rm,$)=A($)M!

l=1

Sl($)

Sm($)&(r"rl),

m = 1, . . . ,M. (27)

where we have used a more compact notation to denotea set of M equations. Note that we have deliberately notrestricted the observer position r in (27) to the discreteset of positions rj defined earlier. Instead, we considerthe numerator frequency response B(r, rm,$) to be acontinuous function of r. This function can be approxi-mated in a finite-dimensional subspace by discretizing thespatial domain ! using a 3-D grid defined by the pointsrk, k = 1, . . . ,K with K & J (and typically K ' J),which includes the observer positions,

B(r, rm,$) (K!

k=1

B(rk, rm,$)'k(r). (28)

Here, the subspace basis functions are chosen to be piece-wise linear functions satisfying 'i(rk) = &(i " k), i =1, . . . ,K. In particular, the basis functions are defined ona 3-D triangulation of the spatial domain !, where thekth basis function is made up of linear (non-zero slope)segments along the line segments between the point rk andall the points with which point rk shares a tetrahedronedge, and zero-valued segments elsewhere. We can thenrewrite the set of Helmholtz equations in (27) as a set oflinear equations in B(rk, rm,$) by making use of the FEM,see Brenner and Scott (2008). In a nutshell, the FEM con-sists in converting the partial di"erential equation (PDE)in (27) to its weak formulation, performing integrationby parts to relax the di"erentiability requirements onthe subspace basis functions, and enforcing the subspaceapproximation error induced by (28) to be orthogonal tothis subspace. The set of M PDEs in (27) can then be

expanded to a set of MK linear equations, also known asthe Galerkin equations,

6

.

K" k2L/

"m($) = "A($)M!

l=1

Sl($)

Sm($)#l ,

m = 1, . . . ,M. (29)

Here, the K # K matrices K and L denote the FEMsti"ness and mass matrices, defined as

[K]ij =

#

!!'j(r) ·!'i(r)dr (30)

[L]ij =

#

!'j(r)'i(r)dr (31)

and the K # 1 vector "m($) contains the spatial samplesof the function B(r, rm,$) as defined by (28), i.e.,

"m($) = [B(r1, rm,$) . . . B(rK , rm,$)]T. (32)

The K # 1 vectors #l, l = 1, . . . ,M on the right-handside of the Galerkin system in (29) contain the barycentriccoordinates of the point sources, obtained by projectingthe spatial unit-impulse functions &(r"rl) onto the chosensubspace basis, i.e.,

[#l]i =

#

!&(r" rl)'i(r)dr. (33)

Each vector #l has only 1, 2, 3, or 4 non-zero elements,depending on whether the mth point source is locatedin a vertex, on an edge, on a face, or in the interiorof a tetrahedron of the FEM mesh. We can write (29)in a more compact notation by defining M # 1 vectors$m, m = 1, . . . ,M , containing the source spectrum ratios,

$m($) =

7S1($)

Sm($). . .

SM ($)

Sm($)

8T

, m = 1, . . . ,M (34)

and the K #M matrix

" = [#1 . . . #M ] (35)

such that6

.

K" k2L/

"m($) = "A($)"$m($),

m = 1, . . . ,M. (36)

Finally, we can write the Galerkin equations as a functionof the model parameters of the common-denominator pole-zero model defined in (5) as follows. Define theK(Q+1)#1numerator parameter vector, for m = 1, . . . ,M ,

bm=[b0(r1,rm). . .bQ(r1,rm). . .b0(rK,rm). . .bQ(rK,rm)]T

(37)and recall the (P + 1) # 1 denominator parameter vectordefinition in (11). Note that only the first J(Q + 1) co-e#cients of the numerator parameter vector (correspond-ing to the elements of the numerator parameter vectorbm defined in (10)) are of explicit interest, while theother coe#cients have been introduced for constructingthe FEM approximation of the continuous-space func-tion B(r, rm,$). By using the above parameter vectordefinitions, and recalling the definitions of the complexsinusoidal vectors in (22)-(23), we can rewrite the Galerkinsystem in (36) as follows,6

.

K" k2L/ .

IK % zHQ ($)

/

bm = ""$m($)zHP ($)a,

m = 1, . . . ,M (38)

or equivalently%

&&&'

M($) 0 . . . 0 "$1($)zHP ($)

0 M($) . . . 0 "$2($)zHP ($)

......

. . ....

...0 0 . . . M($) "$M ($)zHP ($)

(

)))*

9 :; <

!(!)

%

&&&&'

b1b2...

bMa

(

))))*

9 :; <

!

= 0.

(39)Here, M($) =

.

K" k2L/ .

IK % zHQ ($)/

, and 0 representsa zero vector or matrix of appropriate dimensions.

A few remarks are in place here. First, the Galerkinsystem in (39) is always underdetermined. However, wecan straightforwardly increase the number of equations byconsidering (39) for L di"erent radial frequencies $l, l =1, . . . , L, without increasing the dimension of the parame-ter vector. It su#ces to choose L & Q+1+(P +1)/(MK)to obtain a square or overdetermined system of equations.Second, a well-known and attractive property of the FEMis that the sti"ness and mass matrices K and L, as well asthe point source positioning matrix ", are highly sparseand structured. Consequently, the system of equationsin (39) can typically be solved with a linear complexity.Third, we should stress that the accuracy of the FEMapproximation relies heavily on the quality of the mesh,which is why we cannot just setK = J and define the FEMmesh using only the observer positions rj , j = 1, . . . , J .In particular, a su#ciently large number of mesh pointsis needed to achieve a good spatial resolution and near-uniformity of the tetrahedra defined in the triangulation.

4. PROPOSED IDENTIFICATION APPROACH

The proposed identification approach is aimed at blendingmeasured information in the data set with structuralinformation obtained from the wave equation, and resultsfrom the integration of the Galerkin equations (39) inthe QP (20)-(21). One way to achieve this integrationis to apply the field estimation framework proposed invan Waterschoot and Leus (2011), where an optimizationproblem is defined in which a LS data-based objectivefunction is minimized subject to the Galerkin equations. Ifwe apply this framework to the problem considered here,we end up with a large-scale equality-constrained QP,

min!

!TCT

0

!

!

!($)

1

C! (40)

s. t.

2

334

335

#($1)! = 0

...#($L)! = 0

a0 = 1

(41)

Here, the [J(Q+1)+P +1]# [K(Q+1)+P +1] selectionmatrix C is defined such that

C! = !. (42)

Compared to the state-of-the-art identification approachexemplified by the QP in (20)-(21), LMK additionalequality constraints have been included in (41). Theseequality constraints allow to impose structural informationat a number of frequencies $1, . . . ,$L, thus increasingthe model accuracy at these particular frequencies. Thenumber of frequencies L at which the Galerkin equations

are imposed in (41) should satisfy that L ) Q + (P +1)/(MK), as otherwise an infeasible QP may be obtained.

Up till now we have assumed that all quantities required inthe computation of the matrices #($) and C are available.More particularly, #($) relies on the geometry of the FEMmesh (through K, L, and "), on the source spectrumratios (through $m($)), and on the point source positions(through "), while C depends on the observer positions.In a typical identification experiment, the source spectrumratios and the observer positions are indeed known, whilethe FEM mesh is known by construction. However, inmany applications, the point source positions are unknownand so the point source positioning matrix " cannot bestraightforwardly computed. Nevertheless, we will showthat if a preliminary estimate a of the pole-zero modeldenominator parameter vector is available (e.g., by us-ing the state-of-the-art data-based identification approachoutlined in Section 2.2), the point source positioning ma-trix " can be estimated by exploiting its particular struc-ture and sparsity. To this end, we first rewrite (36) as6

.

K" k2L/

"m($) = "=

$Tm($)% A($)IK

>"

; <9 :

vec("),

m = 1, . . . ,M (43)

or equivalently%

&&&'

M($) 0 . . . 0 $T1 ($)% A($)IK

0 M($) . . . 0 $T2 ($)% A($)IK

......

. . ....

...0 0 . . . M($) $T

M ($)% A($)IK

(

)))*

9 :; <

"(a,!)

%

&&&&'

b1b2...

bM#

(

))))*

9 :; <

#

= 0.

(44)The data term in (40) can also be rewritten as a function of%, by partitioning the matrix (

+

! !($))1/2C ! [$L|$R]such that0

!

!

!($)

11/2

C! = $L

F; <9 :,

IMK(Q+1) 0-

% + $Ra. (45)

Again, data information and structural information canbe combined into a single convex optimization problem inwhich the point source positioning matrix " is estimatedalongside the pole-zero model numerator coe#cients, i.e.,

min#

*$LF% + $Ra*2 + *%(a,$)%*2 + (*#*1 (46)

s. t.

?

(IM % 11$K)# = 1M$1

# & 0(47)

with 1 a vector of all ones. In this optimization prob-lem, the sparsity of " is exploited by including an )1-regularization term in (46), while the non-negativity andthe property of columns summing to one, are enforced inthe (in)equality constraints (47).

5. SIMULATION RESULTS

We provide a simulation example, in which the proposedidentification approach is compared to the state-of-the-artapproach for the case of indoor acoustic wave propagation(c = 344 m/s). We consider a rectangular room of 8 #6 # 4 m, with M = 3 sources and J = 5 sensorspositioned as shown in Fig. 1. The Green’s functions

0

2

4

6

8

0

1

2

3

4

5

6

0

0.5

1

1.5

2

2.5

3

3.5

4

x (m)y (m)

z(m

)

Fig. 1. Simulation scenario: rectangular room (8 # 6 # 4m) with M = 3 sources (blue +) and J = 5 sensors(red o).

0 0.5 1 1.5 2 2.5 3!80

!60

!40

!20

0

20

40

! (rad)

20lo

g 10|H

(rj,

rm

,!)|

(dB

)

Fig. 2. Frequency magnitude responses of the Green’sfunctions related to the di"erent source-observer com-binations.

related to the source and observer positions have beensimulated using the assumed modes solution to the waveequation, see Gustafsson et al. (2000), truncated to aduration of 10 s, sampled at fs = 100 Hz, and low-passfiltered to suppress the “cavity mode” at DC. The resultingfrequency magnitude responses 20 log10 |H(rj , rm,$)| form = 1, . . . ,M , j = 1, . . . , J are plotted in Fig. 3. Thecommon resonances can be clearly observed.

The data set was generated as follows: the M sourcesignals were obtained by filtering M Gaussian white noisesignals with M di"erent all-pole filters (first-order low-pass, second-order band-pass, and first-order high-passfor m = 1, 2, 3, respectively). The observed signals wereobtained by filtering the source signals with the simulatedGreen’s functions and adding Gaussian white noise at a0 dB signal-to-noise ratio (SNR). The FEM mesh wasgenerated by performing a 3-D Delaunay triangulation ona set of 315 regularly spaced grid points separated by 1m in each dimension. The resulting FEM mesh consists of1152 elements, and is shown in Fig. 3.

Fig. 3. Visualization of the tetrahedral FEM mesh.

0 0.5 1 1.5 2 2.5 3!10

!5

0

5

10

15

20

25

30

! (rad)

20lo

g 10|A

!1(r

j,rm

,!)|

(dB

)

DATAHYBRID (L = 1)HYBRID (L = 2)

Fig. 4. Results with exact source positioning matrix.

We evaluate the capability of the data-based (“DATA”)and proposed (“HYBRID”) identification approaches tocapture the resonant behavior of the wave propagation,by inspecting the pole-zero model inverse denominator fre-quency magnitude response 20 log10 |A

#1(rj , rm,$)|. Thepole-zero model orders are set to Q = P = 12. Theproposed approach is evaluated with the Galerkin equalityconstraints imposed at L = 1 frequency and L = 2 fre-quencies. These frequencies are chosen to correspond to the4th and 1st resonance frequency of the Green’s functions,respectively, i.e., $1 = 2.7018 rad and $2 = 1.3509 rad.

Fig. 4 shows the results for the case when the point sourcepositions are exactly known, and hence (40)-(41) can bedirectly solved. It is clearly observed that by imposing theGalerkin equality constraints at a certain frequency, theresonant behavior at that particular frequency is identifiedmuch more accurately compared to the case when onlymeasurement information is used.

Finally, Fig. 5 shows the results for the case when thepoint source positions are unknown, and the point sourcepositioning matrix " is estimated using the sparse ap-

0 0.5 1 1.5 2 2.5 3!10

!5

0

5

10

15

20

25

30

! (rad)

20lo

g 10|A

!1(r

j,rm

,!)|

(dB

)

DATAHYBRID (L = 1)HYBRID (L = 2)

Fig. 5. Results with estimated source positioning matrix.

proximation algorithm (46)-(47) prior to executing thehybrid identification algorithm (40)-(41). The resultingidentification performance is seen to be comparable to thecase when exact knowledge of the point source positions isassumed.

REFERENCES

Brenner, S.C. and Scott, L.R. (2008). The mathematicaltheory of finite element methods. Springer, New York.

Gustafsson, T., Vance, J., Pota, H.R., Rao, B.D., andTrivedi, M.M. (2000). Estimation of acoustical roomtransfer functions. In Proc. 39th IEEE Conf. DecisionControl (CDC ’00), 5184–5189. Sydney, Australia.

Haneda, Y., Makino, S., and Kaneda, Y. (1994). Commonacoustical pole and zero modeling of room transferfunctions. IEEE Trans. Speech Audio Process., 2(2),320–328.

Hikichi, T. and Miyoshi, M. (2004). Blind algorithm forcalculating common poles based on linear prediction.In Proc. 2004 IEEE Int. Conf. Acoust., Speech, SignalProcess. (ICASSP ’04), volume 4, 89–92. Montreal,Quebec, Canada.

Kuttru", H. (2009). Room Acoustics. Spon Press, London.Rolain, Y., Vandersteen, G., and Schoukens, J. (1998).

Best conditioned common denominator transfer func-tion matrix estimation in the frequency domain. In Proc.37th IEEE Conf. Decision Control (CDC ’98), 3938–3939. Tampa, Florida, USA.

Stoica, P. and Jansson, M. (2000). MIMO system identifi-cation: State-space and subspace approximations versustransfer function and instrumental variables. IEEETrans. Signal Process., 48(11), 3087–3099.

van Waterschoot, T. and Leus, G. (2011). Static fieldestimation using a wireless sensor network based on thefinite element method. In Proc. Int. Workshop Comput.Adv. Multi-Sensor Adaptive Process. (CAMSAP ’11),369–372. San Juan, PR, USA.

Verboven, P., Guillaume, P., Cauberghe, B., Vanlanduit,S., and Parloo, E. (2004). Modal parameter estima-tion from input-output Fourier data using frequency-domain maximum likelihood identification. J. SoundVib., 276(3–5), 957–979.

Special Session on Convex Optimization for System...

Documents

Transcript of Special Session on Convex Optimization for System...