Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear...

28
Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples Salar Fattahi University of Michigan Abstract In this work, we study the problem of learning partially observed linear dynamical systems from a single sample trajectory. A major practical challenge in the existing system identi- fication methods is the undesirable dependency of their required sample size on the system dimension: roughly speaking, they presume and rely on sample sizes that scale linearly with respect to the system dimension. Evidently, in high-dimensional regime where the system dimension is large, it may be costly, if not impossible, to collect as many samples from the unknown system. In this paper, we will remedy this undesirable dependency on the system dimension by introducing an 1 -regularized estimation method that can accurately estimate the Markov parameters of the system, provided that the number of samples scale logarithmically with the system dimension. Our result significantly improves the sample complexity of learn- ing partially observed linear dynamical systems: it shows that the Markov parameters of the system can be learned in the high-dimensional setting, where the number of samples is signif- icantly smaller than the system dimension. Traditionally, the 1 -regularized estimators have been used to promote sparsity in the estimated parameters. By resorting to the notion of “weak sparsity”, we show that, irrespective of the true sparsity of the system, a similar regularized estimator can be used to reduce the sample complexity of learning partially observed linear systems, provided that the true system is inherently stable. 1 Introduction Most of today’s real-world systems are characterized by being large-scale, complex, and safety- critical. For instance, the nation-wide power grid is comprised of millions of active devices that in- teract according to uncertain dynamics and complex laws of physics [1, 2, 3]. As another example, the contemporary transportation systems are moving towards a spatially distributed, autonomous, and intelligent infrastructure with thousands of heterogeneous and dynamic components [4, 5]. Other examples include aerospace systems [6], decentralized wireless networks [7], and multi- agent robot networks [8]. A common feature of these systems is that they are comprised of a massive network of interconnected subsystems with complex and uncertain dynamics. The unknown structure of the dynamics on the one hand, and the emergence of machine learn- ing and reinforcement learning (RL) as powerful tools for solving sequential decision making problems [9, 10, 11] on the other hand, strongly motivate the use of data-driven methods in the 1 arXiv:2010.04015v1 [cs.LG] 8 Oct 2020

Transcript of Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear...

Page 1: Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples Salar Fattahi University of Michigan

Learning Partially Observed Linear DynamicalSystems from Logarithmic Number of Samples

Salar Fattahi

University of Michigan

Abstract

In this work, we study the problem of learning partially observed linear dynamical systemsfrom a single sample trajectory. A major practical challenge in the existing system identi-fication methods is the undesirable dependency of their required sample size on the systemdimension: roughly speaking, they presume and rely on sample sizes that scale linearly withrespect to the system dimension. Evidently, in high-dimensional regime where the systemdimension is large, it may be costly, if not impossible, to collect as many samples from theunknown system. In this paper, we will remedy this undesirable dependency on the systemdimension by introducing an `1-regularized estimation method that can accurately estimate theMarkov parameters of the system, provided that the number of samples scale logarithmicallywith the system dimension. Our result significantly improves the sample complexity of learn-ing partially observed linear dynamical systems: it shows that the Markov parameters of thesystem can be learned in the high-dimensional setting, where the number of samples is signif-icantly smaller than the system dimension. Traditionally, the `1-regularized estimators havebeen used to promote sparsity in the estimated parameters. By resorting to the notion of “weaksparsity”, we show that, irrespective of the true sparsity of the system, a similar regularizedestimator can be used to reduce the sample complexity of learning partially observed linearsystems, provided that the true system is inherently stable.

1 IntroductionMost of today’s real-world systems are characterized by being large-scale, complex, and safety-critical. For instance, the nation-wide power grid is comprised of millions of active devices that in-teract according to uncertain dynamics and complex laws of physics [1, 2, 3]. As another example,the contemporary transportation systems are moving towards a spatially distributed, autonomous,and intelligent infrastructure with thousands of heterogeneous and dynamic components [4, 5].Other examples include aerospace systems [6], decentralized wireless networks [7], and multi-agent robot networks [8]. A common feature of these systems is that they are comprised of amassive network of interconnected subsystems with complex and uncertain dynamics.

The unknown structure of the dynamics on the one hand, and the emergence of machine learn-ing and reinforcement learning (RL) as powerful tools for solving sequential decision makingproblems [9, 10, 11] on the other hand, strongly motivate the use of data-driven methods in the

1

arX

iv:2

010.

0401

5v1

[cs

.LG

] 8

Oct

202

0

Page 2: Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples Salar Fattahi University of Michigan

operation of unknown safety-critical systems. However, the applications of machine learning tech-niques in the safety-critical systems remain mostly limited due to several fundamental challenges.First, to alleviate the so-called “curse of dimensionality” in these systems, any practical learningand control method must be data-, time-, and memory-efficient. Second, rather than being treatedas “black-box” models, these systems must be governed via models that are interpretable by prac-titioners, and are amenable to well-established robust/optimal control methods.

With the goal of addressing the aforementioned challenges, this paper studies the efficientlearning of partially observed linear systems from a single trajectory of input-output measure-ments. Despite a mature body of literature on the statistical learning and control of linear dy-namical systems, their practicality remains limited for large-scale and safety-critical systems. Akey challenge lies in the required sample sizes of these methods and their dependency on the sys-tem dimensions: for a system with dimension n, the best existing system identification techniquesrequire sample sizes in the order ofO(n) toO(n4) to provide certifiable guarantees on their perfor-mance [12, 13, 14, 15, 16]. Such dependency may inevitably lead to exceedingly long interactionswith the safety-critical system, where it is extremely costly or even impossible to collect nearlyas many samples without jeopardizing its safety—consider sampling from a geographically dis-tributed power grid with tens of millions of parameters, and this increasing difficulty becomesapparent.

Contributions: In this work, we show that the Markov parameters defining the input-output be-havior of partially observed linear dynamical systems can be learned with logarithmic samplecomplexity, i.e., from a single sample trajectory whose length scales poly-logarithmically with theoutput dimension. Our result relies on the key assumption that the system is inherently stable, oralternatively, it is equipped with an initial stabilizing controller. We show that the inherent stabilityof the system is analogous to the notion of weak sparsity in the corresponding Markov parameters.We then show that this “prior knowledge” on the weak sparsity of the Markov parameters can besystematically captured and exploited via an `1-regularized estimation method. Our results implythat the Markov parameters of a partially observed linear system can be learned with certifiablebounds in the high-dimensional settings, where the system dimension is significantly larger thanthe number of available samples, thereby paving the way towards the efficient learning of massive-scale safety-critical systems. Within the realm of statistics, the `1-regularized estimators have beentraditionally used to promote (exact) sparsity in the unknown parameters. In this work, we showthat a similar `1-regularized method can be used to estimate the Markov parameters of the system,irrespective of the true sparsity of the unknown system.

Paper organization: In Section 2, we provide a literature review on different system identificationtechniques, and explain their connection to our work. The problem is formally defined in Section 3,and the main results are presented in Section 4. We provide an empirical study of our method onsynthetically generated systems in Section 5, and end with conclusions and future directions inSection 6. To streamline the presentation, the proofs are deferred to the appendix.

Notation: Upper- and lower-case letters are used to denote matrices and vectors, respectively. Fora matrix M ∈ Rm×n the symbols M:j and Mj: indicate the j th column and row of M , respectively.Given a vector v and an index set S, the notation vS refers to a subvector of v whose indicesare restricted to the set S . For a vector v, ‖v‖p corresponds to its `p-norm. For a matrix M ,the notation ‖M‖p,q is equivalent to ‖

[‖M1:‖p ‖M2:‖p . . . Mm:

]‖q. Moreover, ‖M‖q refers

2

Page 3: Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples Salar Fattahi University of Michigan

to the induced q-norm of the matrix M . The notation ‖M‖F is used to denote the Frobeniusnorm, defined as ‖M‖2,2. Furthermore, ρ(M) correspond to the spectral radius of M . Given thesequences f(n) and g(n) indexed by n, the notation f(n) = O(g(n)) or f(n) . g(n) implies thatthere exists a universal constant C <∞, independent of n, that satisfies f(n) ≤ Cg(n). Moreover,f(n) = O(g(n)) is used to denote f(n) = O(g(n)), modulo logarithmic factors. Similarly, thenotation f(n) � g(n) implies that there exist constants C1 > 0 and C2 < ∞, independent of n,that satisfy C1g(n) ≤ f(n) ≤ C2g(n). Given two scalars a and b, the notation a ∨ b denotes theirmaximum. We use x ∼ N (µ,Σ) to show that x is a multivariate random variable drawn froma Gaussian distribution with mean µ and covariance Σ. For two random variables x and y, thenotation x ∼ y implies that they have the same distribution. E[x] denotes the expected value of therandom variable x. For an event X , the notation P(X ) refers to its probability of occurrence. Thescalar c denotes a universal constant throughout the paper.

2 Related WorksSystem identification: Estimating system models from input/output experiments has a well-developed theory dating back to the 1960s, particularly in the case of linear and time-invariantsystems. Standard reference textbooks on the topic include [17, 18, 19, 20], all focusing on es-tablishing asymptotic consistency of the proposed estimators. On the other hand, contemporaryresults in statistical learning as applied to system identification seek to characterize finite time andfinite data rates. For fully observed systems, [21] shows that a simple least-squares estimator cancorrectly recover the system matrices with multiple trajectories whose length scale linearly withthe system dimension. This result was later generalized to the single sample trajectory setting forstable [22], and unstable [23, 14, 24] systems, with sample complexities depending polynomiallyon the system dimension. These results were later extended to stable [12, 16, 25, 15], and unsta-ble [26] partially observed systems, where it is shown that the system matrices (or their associatedMarkov parameters) can be learned with similar sample complexities.

Regularized estimation: To further reduce the sample complexity of the system identification,a recent line of works has focused on learning dynamical systems with prior information. Theworks [27, 28, 29, 30] employ `1- and `1/`∞-regularized estimators to learn fully observed sparsesystems with sample complexities that scale polynomially in the number nonzero entries in dif-ferent rows and columns of the system matrices, but only logarithmically in the dimension of thesystem. However, these methods are not applicable to partially observed systems with hiddenstates. Another line of works [31, 32, 33] introduces a different regularization technique, wherethe nuclear norm of the Hankel matrix is minimized to improve the sample complexity of learn-ing inherently low-order systems. In particular, [31] shows that for multiple-input-single-output(MISO) systems with order R� n, the sample complexity of estimating both Markov parametersand Hankel matrix can be reduced to O(R2).

Learning-based control: Complementary to the aforementioned results, a large body of worksstudy adaptive [22, 34, 35, 36], robust [14, 37, 38], or distributed [39, 40] control of unknownlinear systems. These works, culminated under the umbrella of model-based RL, indicate that ifa learned model is to be integrated into a safety-critical control loop, then it is essential that theuncertainty associated with the learned model be explicitly quantified. This way, the learned model

3

Page 4: Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples Salar Fattahi University of Michigan

and the uncertainty bounds can be integrated with a reach body of tools from robust and adaptivecontrol to provide strong end-to-end guarantees on the system performance and stability.

3 Problem StatementConsider the following linear time-invariant (LTI) dynamical system:

xt+1 = Axt +But + wt (1)yt = Cxt +Dut + vt (2)

where xt ∈ Rn, ut ∈ Rp, and yt ∈ Rm are the state, input, and output of the system at time t.Moreover, the vectors wt ∈ Rn and vt ∈ Rm are the process (or disturbance) and measurementnoises, respectively. Throughout the paper, we assume that both vt and wt have element-wise in-dependent sub-Gaussian distributions with parameters σw and σv, respectively. Moreover, withoutloss of generality, we assume that x0 = 01. The parameters A ∈ Rn×n, B ∈ Rn×p, C ∈ Rm×n, andD ∈ Rm×p are the unknown system matrices, to be estimated from a single input-output sampletrajectory {(ut, yt)}Nt=0. Much of the progress on the system identification is devoted to learningdifferent variants of fully observed systems, where C = I and vt = 0. While being theoreticallyimportant, the practicality of these results are limited, since realistic dynamical systems are notdirectly observable, or corrupted with measurement noise.

On the other hand, the lack of “intermediate” states xt in partially observed systems gives riseto a mapping from uk to yk that is highly nonlinear in terms of the system parameters:

yt = Dut +T−1∑

τ=1

CAτ−1But−τ

︸ ︷︷ ︸Effect of the last T inputs

+T−1∑

τ=1

CAτ−1wt−τ + vt

︸ ︷︷ ︸Effect of noise

+ CAT−1Bxt−T+1

︸ ︷︷ ︸Effect of the state at time t− T + 1

(3)

where t ≥ T − 1. The first term in (3) captures the effect of the past T inputs on yt, while thesecond term corresponds to the effect of the unknown disturbance and measurement noises on yt.Finally, the third term controls the contribution of the unknown state xt−T+1 on yt, whose effectdiminishes exponentially fast with T , provided that A is stable. A closer look at the first termreveals that the relationship between yt and {ut, ut−1, . . . , ut−T+1} becomes linear in terms of theMarkov matrix

G =[D G0 G1 . . . GT−2

]=[D CB CAB . . . CAT−2B

]∈ Rm×Tp, (4)

whose components are commonly known as Markov parameters of the system. One of the maingoals of this paper is to obtain an accurate estimate ofG given a single input-output trajectory. TheMarkov parameters can be used to directly estimate the outputs of the system from the past input.Moreover, as will be shown later, a good estimation of the Markov parameters can be translatedinto an accurate estimate of the Hankel matrix, which in turn can be used in the model reductionand H∞ methods in control theory [42, 43]. Finally, given the estimated Markov matrix G, onecan recover estimates of the system matrices. Note that it is only possible to extract the system

1Our results can be readily extended to scenarios where x0 is randomly drawn from a sub-Gaussian distribution.

4

Page 5: Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples Salar Fattahi University of Michigan

parameters up to a nonsingular transformation: given any nonsingular matrix T , the system matri-ces (A,B,C,D) and (T−1AT, TB,CT−1, D) correspond to the same Markov matrix. Therefore,a common approach for recovering the system matrices is to first construct the associated Han-kel matrix, and then extract a realization of the system parameters from the Hankel matrix, e.g.via the Ho-Kalman method [41, 18]. In fact, it has been recently shown in [12, 16] that the Ho-Kalman method can robustly obtain a balanced realization of the system matrices, provided thatthe estimated Markov matrix enjoys a small estimation error.

Proposition 1 (Oymak and Ozay [12], informal). Suppose that the true system is controllableand observable. Given an estimate G of G, the Ho-Kalman method outputs system matrices(A, B, C, D) that satisfy

‖B − UB‖F .√T‖G− G‖F (5)

‖C − CU>‖F .√T‖G− G‖F (6)

‖A− UAU>‖F . T‖G‖2‖G− G‖F (7)

for some unitary matrix U , provided that G is sufficiently close to G.

Therefore, without loss of generality, our focus will be devoted to obtaining accurate estimatesof the Markov and Hankel matrices. To streamline the presentation, the concatenated input andprocess noise vectors are defined as:

ut =[u>t u>t−1 . . . u>t−T+1

]> ∈ RTp, (8)

wt =[w>t w>t−1 . . . w>t−T+1

]> ∈ RTn, (9)

Moreover, the following concatenated matrix will be used throughout the paper:

F =[0 C CA . . . CAT−2

]∈ Rm×Tn (10)

Based on the above definitions, the input-output relation (3) can be written compactly as

yt = Gut + Fwt + et + vt (11)

where et = CAT−1xt−T+1. To estimate the Markov matrixG, the work [12] proposes the followingleast-squares estimator:

G = arg minX

N+T−2∑

t=T−1

‖yt −Xut‖22 (12)

Define q = p+ n+m as the system dimension, and σ2e as the effective variance of et, as in

σe = Φ(A)‖CAT−1‖√

T‖Γ∞‖1− ρ(A)2T

(13)

where

Φ(A) = supτ≥0

‖Aτ‖ρ(A)τ

, Γ∞ =∞∑

i=0

σ2wA

i(A>)i + σ2uA

iBB>(A>)i (14)

The work [12] characterizes the non-asymptotic behavior of the least-squares estimate G.

5

Page 6: Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples Salar Fattahi University of Michigan

Theorem 1 (Oymak and Ozay [12]). Suppose that ut ∼ N (0, σ2uI) for every t = 0, . . . , T +N−2,

and N & Tq log2(Tq) log2(Nq). Then, with overwhelming probability, the following inequalitieshold:

‖G−G‖2 .σv + σe + σw‖F‖2

σu

√Tq log2(Tq) log2(Nq)

N, (15)

‖G−G‖F .(σv + σe)

√m+ σw‖F‖2

σu

√Tq log2(Tq) log2(Nq)

N(16)

where σ2u, σ2

v , σ2w are the variances of the random input, disturbance noise, and the measurement

noise, respectively.

The above theorem shows that the spectral norm of the estimation error for the Markov pa-rameters via least-squares method is in the order of O

(√T (n+m+ p)/N

), provided that N =

O(T (n+m+ p)). Moreover, [12] shows that the number of samples N can be reduced to O(Tp)(without improving the spectral norm error). Such dependency on the system dimension is un-avoidable if one does not exploit any prior information on the structure ofG: roughly speaking, theMarkov parameter G has Tmp unknown parameters, and one needs to collect at least Tp outputs(each with size m) to obtain a well-defined least-squares estimator. Evidently, such dependencyon the system dimension may be prohibitive for large-scale and safety-critical systems, where itis expensive to collect as many output samples. Motivated by this shortcoming of the existingmethods, we aim to address the following open question:

Question: Can partially observed linear systems be learned in a logarithmic sample complexity?

4 Main ResultsIn this section, we provide an affirmative answer to the aforementioned question. At a high-level,we will use the fact that, due to the stability of A, the Markov parameters decay exponentiallyfast, which in turn implies that the rows of the extended matrix G exhibit a bounded `1-norm(also known as weak sparsity [44]). This observation strongly motivates the use of the followingregularized estimator:

G = arg minX

(1

2N

N+T−2∑

t=T−1

‖yt −Xut‖22

)+ λ‖X‖1,1 (17)

Due to the stability ofA, there exist scalars Csys ≥ 1 and ρ < 1 such that ‖Aτ‖1 ≤ Csysρτ . Without

loss of generality and to simplify the notation, we assume that max{‖B‖1, ‖C‖1, ‖D‖1} ≤ Csys.

Finally, define the effective variance of the disturbance noise as σw =(C2

sys

1−ρ

)σw. The main result

of the paper is the following theorem:

Theorem 2. Suppose that ut ∼ N (0, σ2uI) for every t = 0, . . . , T + N − 2. Moreover, suppose

that N and T satisfy the following inequalities:

N & log2(Tp), T & T0 =log log(Nn+ Tp) + log

(Csys

1−ρ

)+ log(σw + σv) + log

(1ε

)

1− ρ (18)

6

Page 7: Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples Salar Fattahi University of Michigan

for an arbitrary ε > 0. Finally, assume that λ is chosen such that

λ � σu (σw + σv)

√log(Tpn)

N+ ε (19)

Then, with overwhelming probability, the following inequalities hold:

‖G− G‖2,∞ . E1 ∨ E2 (20)

‖G− G‖F .√m(E1 ∨ E2) (21)

where

E1 =

√C3

sys

1− ρ

(√σw + σvσ3u

(log(Tpn)

N

)1/4

σ2u

)

E2 =C3

sys

1− ρ

(log(Tp)

N

)1/4

(22)

The above theorem can be used to provide estimation error bounds on the higher order Markovparameters and Hankel matrices (which can be used to recover a realization of the system param-eters {A,B,C,D}, as delineated in Proposition 1). Similar to [12], define the true and estimatedK th order (where K ≥ T ) Markov parameters as

G(K) =[D CB CAB . . . CAK−2B

]∈ Rm×Kp, (23)

G(K) =[G 0m×(K−T )p

]∈ Rm×Kp (24)

Moreover, define the true and estimated K th order Hankel matrices as

H(K) =

D CB . . . CAK−2BCB CAB . . . CAK−1B

...CAK−2B CAK−1B . . . CA2K−3B

∈ RKm×Kp, (25)

H(K) =

D G0 . . . GT−3 GT−2 0m×n . . . 0m×nG0 G1 . . . GT−2 0m×n 0m×n . . . 0m×n

...GT−2 0m×n . . . 0m×n 0m×n 0m×n . . . 0m×n0m×n 0m×n . . . 0m×n 0m×n 0m×n . . . 0m×n

...0m×n 0m×n . . . 0m×n 0m×n 0m×n . . . 0m×n

∈ RKm×Kp (26)

Our next corollary follows from Theorem 2.

Corollary 1. Suppose that ut ∼ N (0, σ2uI) for every t = 0, . . . , T+N−2, andN and λ satisfy (18)

and (19), respectively. Moreover, assume that T & T0 ∨ (log(‖C‖∞) + log (1/ε)) /(1− ρ) for an

7

Page 8: Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples Salar Fattahi University of Michigan

arbitrary ε > 0. Then, for any K ≥ T (including K = ∞), the following inequalities hold withoverwhelming probability:

‖G(K) − G(K)‖2,∞ . E1 ∨ E2 + ε, (27)

‖G(K) − G(K)‖F .√m(E1 ∨ E2 + ε) (28)

‖H(K) − H(K)‖2,∞ . E1 ∨ E2 + ε, (29)

‖H(K) − H(K)‖F .√Tm(E1 ∨ E2 + ε) (30)

Next, we will explain the implications of Theorem 2 and Corollary 1.

Sample complexity: According to Theorem 2 and Corollary 1, the required number of samples Nfor estimating the Markov parameters and the Hankel matrix scales poly-logarithmically with thesystem dimension, making it particularly well-suited to massive-scale dynamical systems, wherethe system dimension surpasses the number of available input-output samples. In contrast, theexisting methods for learning partially observed linear systems do not provide any guarantee ontheir estimation errors under such “high-dimension/low-sampling” regime. Moreover, the imposedlower bound on T scales double-logarithmically with respect to the system dimension2, which canbe treated as a constant number for all practical purposes.3

Estimation error: The estimation error bounds in Theorem 2 and Corollary 1 are in terms ofthe row-wise `2 and Frobenius norms. In contrast, most of the existing methods provide upperbounds on the spectral norm of the estimation error. An important benefit of the provided row-wise bound is that it provides a finer control over the element-wise estimation error, which in turncan be used in the recovery of the special sparsity patterns in the Hankel matrices [45]. We notethat although the provided bound on the Frobenius norm of the estimation error readily applies toits spectral norm, we believe that it can be strengthened. Moreover, the provided estimation errorbound reduces at the rate N−1/4, which is slower than the rate N−1/2 for the simple least-squaresestimator (see Theorem 1). However, a more careful scrutiny of (21) and (16) reveals that ourproposed estimator outperforms the least-squares in the regime where

N

T 2.q2 log4(Tq) log4(Nq)

log(Tpn)= O

((n+m+ p)2

)(31)

In fact, a stronger statement can be made on the ratio between the Frobenius norms of the estima-tion errors:

Corollary 2. Denote the right hand sides of (16) and (21) as ELSF and E `1F , respectively. Supposethat σu ∨ σw ∨ σv ∨ Csys

1−ρ ∨ Φ(A) ∨ ‖B‖2 ∨ ‖C‖2 = O(1), and T & T0 + log(n + m + p). Then,we have

limn,m,p→∞

E `1FELSF

= 0 (32)

provided that T and N satisfy

limn,m,p→∞

N log(Tpn)

T 2(n+m+ p)2= 0 (33)

2The imposed lower bound on T is to simplify the derived bounds, and hence, can be relaxed at the expense of lessintuitive estimation bounds.

3It is easy to verify that log log(s) ≤ 5 for any s ≤ 1050!

8

Page 9: Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples Salar Fattahi University of Michigan

Method Sample Complexity Error Bound (‖ · ‖F ) Additional Notes

proposed method O(log2(Tp)) O(√

m(

log(Tnp)N

)1/4)

Single trajectory

Oymak and Ozay [12] O(Tq) O(√

m(TqN

)1/2)

Single trajectory

Sarkar et. al. [16] O(n2) O(√

m(pn2

N

)1/2) Single trajectory,

Suitable for systems withunknown order

Zheng and Li [26] O(mT + q) O(√

m(T 3qN

)1/2)

Multiple trajectories,Stable and unstable systems

Sun et. al. [31] O(pR) O((

RnpN

)1/2)

Multiple trajectories,MISO (m = 1)

Tu et. al. [46] O(r) O((

rT

)1/2)

Multiple trajectories,SISO (p = m = 1)

Table 1: Sample complexity and error bounds on the estimated Markov parameters for different methods. Theparameters R ≤ n and r are respectively the order of the system and the length of the FIR impulse response; see [31]and [46] for more information. The error bounds are measured with respect to the Frobenius norm.

The above proposition implies that in the regime where N is not significantly larger than thesystem dimension, the derived upper bound on the estimation error of the regularized estimatorbecomes arbitrarily smaller than that of the simple least-squares method. Our numerical analysisin Section 5 also reveals the superior performance of the proposed estimator, even when N � Tp.Finally, we point out that similar error bounds have been derived for linear regression problemswith weakly sparse structures. In particular, the work [47] considers a “simpler” linear modelwhere the samples/outputs are assumed to be independent, and shows that a `1-regularized esti-mator achieves an error bound in the order of O

((log(d)/N)1/4

), where d is the dimension of the

unknown regression vector. Theorem 2 reveals that the same non-asymptotic rates can be achievedin the context of system identification with a single (and correlated) input-output trajectory. Table 1compares the performance of the proposed estimator with other state-of-the-art methods.

Role of signal-to-noise ratio: Intuitively, the estimation error should improve with an increas-ing signal-to-noise (SNR) ratio (in our problem, the SNR ratio is defined as σ3

u/(σw + σv)); thisbehavior is also observed in the related works [12, 31, 46]. In contrast, our provided bound isthe maximum of two terms, one of which is independent of the SNR ratio. In other words, anincreasing SNR ratio can only shrink the estimation error down to a certain positive threshold.The reason behind this seemingly unintuitive behavior lies in the statistical behavior of the randominput matrix U . For two different vectors ζ and ζ , the quantity U(ζ − ζ) measures how distin-guishable these vectors are under the considered linear model. For the cases where N & Tp, it iseasy to see that these two vectors are easily distinguishable, since ‖U(ζ − ζ)‖2

2 ≥ κσ2u‖ζ − ζ‖2

2

holds with high probability, for some strictly positive κ (see, e.g., [12, 16]). However, in the high-dimensional setting, where N � Tp, the matrix U will inevitably have zero singular values, andhence, ‖U(ζ − ζ)‖2

2 ≥ κσ2u‖ζ − ζ‖2

2 does not hold for specific choices of ζ and ζ . Under suchcircumstances, we will show that the relaxed inequality ‖U(ζ− ζ)‖2

2 ≥ κσ2u‖ζ− ζ‖2

2−σ2uf(ζ− ζ)

holds for any ζ and ζ , where f(·) is a function to be defined later. Upon replacing ζ − ζ withGi: − Gi: for an arbitrary row index i, it is easy to see that the derived lower bound becomes non-trivial only if ‖Gi: − Gi:‖2

2 > f(Gi: − Gi:)/κ, which is independent of the SNR ratio. We willformalize this intuition later in the proof of Theorem 2. In particular, we will show that: (1) the

9

Page 10: Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples Salar Fattahi University of Michigan

threshold f(Gi: − Gi:)/κ is small, i.e., it is upper bounded by E22 ; (2) whenever ‖Gi: − Gi:‖2

2 islarger than E2

2 , it can be upper bounded by E21 .

N<latexit sha1_base64="z5/i2H4ixFRJRkS4QLMhw8CfQbw=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKezGgB4DXjxJAuYByRJmJ73JmNnZZWZWCCFf4MWDIl79JG/+jZNkD5pY0FBUddPdFSSCa+O6305uY3Nreye/W9jbPzg8Kh6ftHScKoZNFotYdQKqUXCJTcONwE6ikEaBwHYwvp377SdUmsfywUwS9CM6lDzkjBorNe77xZJbdhcg68TLSAky1PvFr94gZmmE0jBBte56bmL8KVWGM4GzQi/VmFA2pkPsWipphNqfLg6dkQurDEgYK1vSkIX6e2JKI60nUWA7I2pGetWbi/953dSEN/6UyyQ1KNlyUZgKYmIy/5oMuEJmxMQSyhS3txI2oooyY7Mp2BC81ZfXSatS9q7KlUa1VKtmceThDM7hEjy4hhrcQR2awADhGV7hzXl0Xpx352PZmnOymVP4A+fzB6OpjMg=</latexit>

kG�b G

LSk F

kG�b G

LA

SS

Ok F

<latexit sha1_base64="af6/kpp97b9ReGKiSSU0vboQ9BQ=">AAACJ3icbVDLSsNAFJ3UV62vqks3g0VwY0lqQVdSEawLwUptKzQlTCaTdujkwcxEKWn+xo2/4kZQEV36J07bCNp64MLhnHu59x47ZFRIXf/UMnPzC4tL2eXcyura+kZ+c6spgohj0sABC/itjQRh1CcNSSUjtyEnyLMZadn9s5HfuiNc0MC/kYOQdDzU9alLMZJKsvInpssRjs1h9cC8pw7pIQmrVnxZT8yhFZ8nyax1Wq9f/bhWvqAX9THgLDFSUgApalb+xXQCHHnEl5ghIdqGHspOjLikmJEkZ0aChAj3UZe0FfWRR0QnHv+ZwD2lONANuCpfwrH6eyJGnhADz1adHpI9Me2NxP+8diTd405M/TCSxMeTRW7EoAzgKDToUE6wZANFEOZU3QpxD6ngpIo2p0Iwpl+eJc1S0Tgslq7LhUo5jSMLdsAu2AcGOAIVcAFqoAEweABP4BW8aY/as/aufUxaM1o6sw3+QPv6BgKYppg=</latexit>

N<latexit sha1_base64="z5/i2H4ixFRJRkS4QLMhw8CfQbw=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKezGgB4DXjxJAuYByRJmJ73JmNnZZWZWCCFf4MWDIl79JG/+jZNkD5pY0FBUddPdFSSCa+O6305uY3Nreye/W9jbPzg8Kh6ftHScKoZNFotYdQKqUXCJTcONwE6ikEaBwHYwvp377SdUmsfywUwS9CM6lDzkjBorNe77xZJbdhcg68TLSAky1PvFr94gZmmE0jBBte56bmL8KVWGM4GzQi/VmFA2pkPsWipphNqfLg6dkQurDEgYK1vSkIX6e2JKI60nUWA7I2pGetWbi/953dSEN/6UyyQ1KNlyUZgKYmIy/5oMuEJmxMQSyhS3txI2oooyY7Mp2BC81ZfXSatS9q7KlUa1VKtmceThDM7hEjy4hhrcQR2awADhGV7hzXl0Xpx352PZmnOymVP4A+fzB6OpjMg=</latexit>

kG�b G

LSk 2

,1

kG�b G

LA

SS

Ok 2

,1<latexit sha1_base64="sb6/N2KhPUe7uLYUzpp3h70+778=">AAACNXicbVDLSgMxFM3UV62vqks3wSK40DJTBV0qLupCUKm1QqcMmTTTBjOZIbmjlHF+yo3/4UoXLhRx6y+YPhbaeiBwOOdcbu7xY8E12ParlZuanpmdy88XFhaXlleKq2vXOkoUZXUaiUjd+EQzwSWrAwfBbmLFSOgL1vBvT/p+444pzSN5Bb2YtULSkTzglICRvOKZGyhCU/ehuuve8zbrEsBVLz2rZe6Dl1Z2XC4D6GXZZOK4VjsfC3nFkl22B8CTxBmREhrhwis+u+2IJiGTQAXRuunYMbRSooBTwbKCm2gWE3pLOqxpqCQh0610cHWGt4zSxkGkzJOAB+rviZSEWvdC3yRDAl097vXF/7xmAsFhK+UyToBJOlwUJAJDhPsV4jZXjILoGUKo4uavmHaJqRFM0QVTgjN+8iS5rpSdvXLlcr90tD+qI4820CbaRg46QEfoFF2gOqLoEb2gd/RhPVlv1qf1NYzmrNHMOvoD6/sHZUSsYA==</latexit>

N<latexit sha1_base64="z5/i2H4ixFRJRkS4QLMhw8CfQbw=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKezGgB4DXjxJAuYByRJmJ73JmNnZZWZWCCFf4MWDIl79JG/+jZNkD5pY0FBUddPdFSSCa+O6305uY3Nreye/W9jbPzg8Kh6ftHScKoZNFotYdQKqUXCJTcONwE6ikEaBwHYwvp377SdUmsfywUwS9CM6lDzkjBorNe77xZJbdhcg68TLSAky1PvFr94gZmmE0jBBte56bmL8KVWGM4GzQi/VmFA2pkPsWipphNqfLg6dkQurDEgYK1vSkIX6e2JKI60nUWA7I2pGetWbi/953dSEN/6UyyQ1KNlyUZgKYmIy/5oMuEJmxMQSyhS3txI2oooyY7Mp2BC81ZfXSatS9q7KlUa1VKtmceThDM7hEjy4hhrcQR2awADhGV7hzXl0Xpx352PZmnOymVP4A+fzB6OpjMg=</latexit>

kG�b Gk 2

,1<latexit sha1_base64="i3Qk1ftzMStGEhsJPS7QodKv2Pk=">AAACBHicbVC7SgNBFJ31GeNr1TLNYBAsNOzGgJYBi1hGMA/IhjA7mU2GzM4uM3eVsElh46/YWChi60fY+TdOHoUmHrhwOOde7r3HjwXX4Djf1srq2vrGZmYru72zu7dvHxzWdZQoymo0EpFq+kQzwSWrAQfBmrFiJPQFa/iD64nfuGdK80jewTBm7ZD0JA84JWCkjp3zRpVz74F3WZ8ArnijTlo887gMYDju2Hmn4EyBl4k7J3k0R7Vjf3ndiCYhk0AF0brlOjG0U6KAU8HGWS/RLCZ0QHqsZagkIdPtdPrEGJ8YpYuDSJmSgKfq74mUhFoPQ990hgT6etGbiP95rQSCq3bKZZwAk3S2KEgEhghPEsFdrhgFMTSEUMXNrZj2iSIUTG5ZE4K7+PIyqRcL7kWheFvKl0vzODIoh47RKXLRJSqjG1RFNUTRI3pGr+jNerJerHfrY9a6Ys1njtAfWJ8/HDSXtw==</latexit>

N<latexit sha1_base64="K9aVRfX6MJclwtVnEhb8ULi/SEg=">AAAB6XicbVBNS8NAEJ34WetX1aOXxSJ4Kkkt6LHgxZNUsR/QhrLZbtqlm03YnQil9B948aCIV/+RN/+NmzYHbX0w8Hhvhpl5QSKFQdf9dtbWNza3tgs7xd29/YPD0tFxy8SpZrzJYhnrTkANl0LxJgqUvJNoTqNA8nYwvsn89hPXRsTqEScJ9yM6VCIUjKKVHu6K/VLZrbhzkFXi5aQMORr90ldvELM04gqZpMZ0PTdBf0o1Cib5rNhLDU8oG9Mh71qqaMSNP51fOiPnVhmQMNa2FJK5+ntiSiNjJlFgOyOKI7PsZeJ/XjfF8NqfCpWkyBVbLApTSTAm2dtkIDRnKCeWUKaFvZWwEdWUoQ0nC8FbfnmVtKoV77JSva+V67U8jgKcwhlcgAdXUIdbaEATGITwDK/w5oydF+fd+Vi0rjn5zAn8gfP5A9gDjNw=</latexit>

N<latexit sha1_base64="z5/i2H4ixFRJRkS4QLMhw8CfQbw=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKezGgB4DXjxJAuYByRJmJ73JmNnZZWZWCCFf4MWDIl79JG/+jZNkD5pY0FBUddPdFSSCa+O6305uY3Nreye/W9jbPzg8Kh6ftHScKoZNFotYdQKqUXCJTcONwE6ikEaBwHYwvp377SdUmsfywUwS9CM6lDzkjBorNe77xZJbdhcg68TLSAky1PvFr94gZmmE0jBBte56bmL8KVWGM4GzQi/VmFA2pkPsWipphNqfLg6dkQurDEgYK1vSkIX6e2JKI60nUWA7I2pGetWbi/953dSEN/6UyyQ1KNlyUZgKYmIy/5oMuEJmxMQSyhS3txI2oooyY7Mp2BC81ZfXSatS9q7KlUa1VKtmceThDM7hEjy4hhrcQR2awADhGV7hzXl0Xpx352PZmnOymVP4A+fzB6OpjMg=</latexit>

kG�b Gk F

<latexit sha1_base64="kUBzG6BfRIqW4Z3vOwji2iip0Oc=">AAAB/XicbVDLSsNAFJ34rPUVHzs3g0VwY0lqQZcFwbqsYB/QhDCZTNqhk0mYmSg1Lf6KGxeKuPU/3Pk3TtsstPXAhcM593LvPX7CqFSW9W0sLa+srq0XNoqbW9s7u+befkvGqcCkiWMWi46PJGGUk6aiipFOIgiKfEba/uBq4rfviZA05ndqmBA3Qj1OQ4qR0pJnHjqj+pnzQAPSRwrWnZGXXY89s2SVrSngIrFzUgI5Gp755QQxTiPCFWZIyq5tJcrNkFAUMzIuOqkkCcID1CNdTTmKiHSz6fVjeKKVAIax0MUVnKq/JzIUSTmMfN0ZIdWX895E/M/rpiq8dDPKk1QRjmeLwpRBFcNJFDCggmDFhpogLKi+FeI+EggrHVhRh2DPv7xIWpWyfV6u3FZLtWoeRwEcgWNwCmxwAWrgBjRAE2DwCJ7BK3gznowX4934mLUuGfnMAfgD4/MH5oqU0w==</latexit> kG�b Gk 2

<latexit sha1_base64="5YAoXDh1KhTHKkTDRMIoQH7QrMM=">AAAB+3icbVDLSsNAFJ34rPUV69LNYBHcWJJa0GXBRV1WsA9oQphMbtuhkwczE7Wk/RU3LhRx64+482+ctllo64ELh3Pu5d57/IQzqSzr21hb39jc2i7sFHf39g8OzaNSW8apoNCiMY9F1ycSOIugpZji0E0EkNDn0PFHNzO/8wBCsji6V+ME3JAMItZnlCgteWbJmTQunEcWwJAo3HAmXtUzy1bFmgOvEjsnZZSj6ZlfThDTNIRIUU6k7NlWotyMCMUoh2nRSSUkhI7IAHqaRiQE6Wbz26f4TCsB7sdCV6TwXP09kZFQynHo686QqKFc9mbif14vVf1rN2NRkiqI6GJRP+VYxXgWBA6YAKr4WBNCBdO3YjokglCl4yrqEOzll1dJu1qxLyvVu1q5XsvjKKATdIrOkY2uUB3doiZqIYqe0DN6RW/G1Hgx3o2PReuakc8coz8wPn8A+FyTsw==</latexit>

N<latexit sha1_base64="K9aVRfX6MJclwtVnEhb8ULi/SEg=">AAAB6XicbVBNS8NAEJ34WetX1aOXxSJ4Kkkt6LHgxZNUsR/QhrLZbtqlm03YnQil9B948aCIV/+RN/+NmzYHbX0w8Hhvhpl5QSKFQdf9dtbWNza3tgs7xd29/YPD0tFxy8SpZrzJYhnrTkANl0LxJgqUvJNoTqNA8nYwvsn89hPXRsTqEScJ9yM6VCIUjKKVHu6K/VLZrbhzkFXi5aQMORr90ldvELM04gqZpMZ0PTdBf0o1Cib5rNhLDU8oG9Mh71qqaMSNP51fOiPnVhmQMNa2FJK5+ntiSiNjJlFgOyOKI7PsZeJ/XjfF8NqfCpWkyBVbLApTSTAm2dtkIDRnKCeWUKaFvZWwEdWUoQ0nC8FbfnmVtKoV77JSva+V67U8jgKcwhlcgAdXUIdbaEATGITwDK/w5oydF+fd+Vi0rjn5zAn8gfP5A9gDjNw=</latexit>

kG�b G

LSk 2

kG�b G

LA

SS

Ok 2

<latexit sha1_base64="c9wGV8YYZZfb4bvVbXvkSfbR8c8=">AAACJ3icbVDLSsNAFJ3UV62vqks3g0VwY0lqQVdScVEXgpXaBzQhTCaTdujkwcxEKWn+xo2/4kZQEV36J04fgrYeuHA4517uvceJGBVS1z+1zMLi0vJKdjW3tr6xuZXf3mmKMOaYNHDIQt52kCCMBqQhqWSkHXGCfIeRltO/GPmtO8IFDYNbOYiI5aNuQD2KkVSSnT8zPY5wYg6rR+Y9dUkPSVi1k6t6ag7tpJSm89Z5vX7949r5gl7Ux4DzxJiSApiiZudfTDfEsU8CiRkSomPokbQSxCXFjKQ5MxYkQriPuqSjaIB8Iqxk/GcKD5TiQi/kqgIJx+rviQT5Qgx8R3X6SPbErDcS//M6sfROrYQGUSxJgCeLvJhBGcJRaNClnGDJBoogzKm6FeIeUsFJFW1OhWDMvjxPmqWicVws3ZQLlfI0jizYA/vgEBjgBFTAJaiBBsDgATyBV/CmPWrP2rv2MWnNaNOZXfAH2tc3w0GmcA==</latexit>

Figure 1: The estimation error of the Markov parameters for LASSO (denoted as GLASSO) andLS (denoted as GLS) with respect to the sample size, with σ2

w = σ2v = 0.1 and varying T . When

N < Tp, LASSO achieves small estimation error, while LS is not well-defined. Moreover, LASSOsignificantly outperforms LSwhenN ≥ Tp. The y-axis in all figures are clipped to better illustratethe differences in the curves.

5 SimulationsIn this section, we showcase the performance of the proposed regularized estimator. In particular,we will provide an empirical comparison between our method and the least-squares approach ofOymak and Ozay [12].4 In all of our simulations, we set n = 200, m = p = 50, and D = 0. Thesystem matrices are generated according to the following rules:

- A is chosen as a banded matrix, with the bandwidth equal to 5. This implies that the rowsand columns of A have at least 6 and at most 11 elements. Moreover, each nonzero entry ofA is selected uniformly from [−0.5, 0.5]. To ensure the stability of the system, A is furthernormalized to ensure that ρ(A) = 0.8. The special structure of A entails that ‖A‖1 remainssmall.

- The (i, j)th entry of B is set to 1 if i = 4j, and it is set to 0 otherwise, for every (i, j) ∈{1, . . . , n} × {1, . . . , p}.

4It has been recently verified in [31] that the method proposed by Oymak and Ozay [12] outperforms that of Sarkaret. al. [16]. Therefore, without loss of generality, we will focus on the former.

10

Page 11: Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples Salar Fattahi University of Michigan

- C is chosen as a Gaussian matrix, with entries selected from N (0, 1/m).

Note that, despite the sparse nature of A and B, the Markov parameters of the system are fullydense, due to the dense nature of C. Throughout our simulations, σu is set of 1, and the valuesof σw and σv are changed to examine the effect of SNR ratio on the quality of our estimates.Moreover, in all of our simulations, we set the regularization parameter to

λ = 0.2(σw + σv)

√log(Tpn)

N+ 0.02× 0.8T (34)

Note that the above choice of the regularization parameter does not require any further fine-tuning,and it is in line with Theorem 2, after replacing ε with 0.02 × 0.8T in (19). The exponentialdecay in ε correctly captures the diminishing effect of the unknown initial state xt−T+1 on theoutput yt with T (see equation (3)). We point out that a better choice of λ may be possible viacross-validation [48]. Figure 1 shows the estimation error of the proposed method compared to theleast-squares estimator (referred to as LASSO and LS, respectively) for σ2

w = σ2v = 0.1 (averaged

over 10 independent trials). It can be seen that LASSO significantly outperforms LS for all valuesof N and T . In the high-dimensional setting, where N < Tp, LS is not well-defined, whileLASSO results in small estimation errors. Moreover, when N ≥ Tp, the incurred estimation errorof LASSO is 1.2 to 1077 times smaller than that of LS. Although the main strength of LASSO is inthe high-dimensional regime, it still outperforms LS when N � Tp. Furthermore, Figure 2 showsthe superior performance of LASSO compared to LS in the estimated Hankel matrices.

kH(T

/2)�b H

(T/2)k 2

<latexit sha1_base64="dTX0IiBXe1rcV87m8fwPLgtaatk=">AAACC3icbVDLSsNAFJ3UV62vqEs3Q4tQF9YkFnRZcNNlhb6giWEymbSDkwczE6Wk3bvxV9y4UMStP+DOv3H6QLT1wIXDOfdy7z1ewqiQhvGl5VZW19Y38puFre2d3T19/6At4pRj0sIxi3nXQ4IwGpGWpJKRbsIJCj1GOt7t1cTv3BEuaBw15TAhToj6EQ0oRlJJrl60R/WbrNw8s07Gp/Y99ckASfgj2SPXcvWSUTGmgMvEnJMSmKPh6p+2H+M0JJHEDAnRM41EOhnikmJGxgU7FSRB+Bb1SU/RCIVEONn0lzE8VooPg5iriiScqr8nMhQKMQw91RkiORCL3kT8z+ulMrh0MholqSQRni0KUgZlDCfBQJ9ygiUbKoIwp+pWiAeIIyxVfAUVgrn48jJpWxXzvGJdV0u16jyOPDgCRVAGJrgANVAHDdACGDyAJ/ACXrVH7Vl7095nrTltPnMI/kD7+Ab35ZkN</latexit> kH(T

/2)�b H

(T/2)k F

<latexit sha1_base64="wk90WPYV1v3WxOPnOdFg1FulICE=">AAACC3icbVDLSsNAFJ3UV62vqEs3Q4tQF9akFnRZEKTLCn1BE8NkMm2HTh7MTJSSdu/GX3HjQhG3/oA7/8ZpG0RbD1w4nHMv997jRowKaRhfWmZldW19I7uZ29re2d3T9w9aIow5Jk0cspB3XCQIowFpSioZ6UScIN9lpO0Or6Z++45wQcOgIUcRsX3UD2iPYiSV5Oh5a1y7TYqNs/LJ5NS6px4ZIAl/JGvsXDt6wSgZM8BlYqakAFLUHf3T8kIc+ySQmCEhuqYRSTtBXFLMyCRnxYJECA9Rn3QVDZBPhJ3MfpnAY6V4sBdyVYGEM/X3RIJ8IUa+qzp9JAdi0ZuK/3ndWPYu7YQGUSxJgOeLejGDMoTTYKBHOcGSjRRBmFN1K8QDxBGWKr6cCsFcfHmZtMol87xUvqkUqpU0jiw4AnlQBCa4AFVQA3XQBBg8gCfwAl61R+1Ze9Pe560ZLZ05BH+gfXwDFkSZIQ==</latexit>

N<latexit sha1_base64="K9aVRfX6MJclwtVnEhb8ULi/SEg=">AAAB6XicbVBNS8NAEJ34WetX1aOXxSJ4Kkkt6LHgxZNUsR/QhrLZbtqlm03YnQil9B948aCIV/+RN/+NmzYHbX0w8Hhvhpl5QSKFQdf9dtbWNza3tgs7xd29/YPD0tFxy8SpZrzJYhnrTkANl0LxJgqUvJNoTqNA8nYwvsn89hPXRsTqEScJ9yM6VCIUjKKVHu6K/VLZrbhzkFXi5aQMORr90ldvELM04gqZpMZ0PTdBf0o1Cib5rNhLDU8oG9Mh71qqaMSNP51fOiPnVhmQMNa2FJK5+ntiSiNjJlFgOyOKI7PsZeJ/XjfF8NqfCpWkyBVbLApTSTAm2dtkIDRnKCeWUKaFvZWwEdWUoQ0nC8FbfnmVtKoV77JSva+V67U8jgKcwhlcgAdXUIdbaEATGITwDK/w5oydF+fd+Vi0rjn5zAn8gfP5A9gDjNw=</latexit>

N<latexit sha1_base64="K9aVRfX6MJclwtVnEhb8ULi/SEg=">AAAB6XicbVBNS8NAEJ34WetX1aOXxSJ4Kkkt6LHgxZNUsR/QhrLZbtqlm03YnQil9B948aCIV/+RN/+NmzYHbX0w8Hhvhpl5QSKFQdf9dtbWNza3tgs7xd29/YPD0tFxy8SpZrzJYhnrTkANl0LxJgqUvJNoTqNA8nYwvsn89hPXRsTqEScJ9yM6VCIUjKKVHu6K/VLZrbhzkFXi5aQMORr90ldvELM04gqZpMZ0PTdBf0o1Cib5rNhLDU8oG9Mh71qqaMSNP51fOiPnVhmQMNa2FJK5+ntiSiNjJlFgOyOKI7PsZeJ/XjfF8NqfCpWkyBVbLApTSTAm2dtkIDRnKCeWUKaFvZWwEdWUoQ0nC8FbfnmVtKoV77JSva+V67U8jgKcwhlcgAdXUIdbaEATGITwDK/w5oydF+fd+Vi0rjn5zAn8gfP5A9gDjNw=</latexit>

Figure 2: The estimation error of the Hankel matrices for LASSO and LSwith respect to the samplesize N , with σ2

w = σ2v = 0.1 and varying T . Similar to Figure 1, LASSO outperforms LS for all

values of N and T . The y-axis in all figures are clipped to better illustrate the differences in thecurves.

Next, we fix N = 200, reduce the variance of the disturbance and measurement noises toσ2w = σ2

v = 0.02, and report the estimation error for different values of T in Figure 3 (left). Toexplain the non-monotonic behavior of ‖G − G‖F , first note that incurred estimation error stemsfrom two sources: (i) the measurement and disturbance noises; and (ii) the unknown initial state.For small values of T , the number of unknown parameters in G is small, and it can be well-estimated with sufficiently large N . However, the effect of the unknown initial state is significant

11

Page 12: Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples Salar Fattahi University of Michigan

due to the small “mixing time”, thereby giving rise to a large estimation error. As T grows, theeffect of the unknown initial state diminishes exponentially fast, while the size of G (and thenumber of unknown parameters) increases. Therefore, the estimation error has a non-monotonicdependency on T for any fixed N ; such behavior is also reflected in Theorem 2, after settingε = ρT . Finally, Figure 3 (right) depicts the estimation error of LASSO and LS for different noisevariances. It can be seen that the estimation accuracy of LASSO is less sensitive to noise, i.e., itdeteriorates at a slower rate with the increasing noise levels.

�2w + �2

v<latexit sha1_base64="9MGiPsmKELiFuytlQ/F0KtJQvxI=">AAAB/nicbZDLSgMxFIYz9VbrbVRcuQkWQRDKzFjQZcGNywr2Au10yKSZNjTJDEmmUoaCr+LGhSJufQ53vo1pOwtt/SHw8Z9zOCd/mDCqtON8W4W19Y3NreJ2aWd3b//APjxqqjiVmDRwzGLZDpEijArS0FQz0k4kQTxkpBWObmf11phIRWPxoCcJ8TkaCBpRjLSxAvukq+iAo+Cx513mOO55gV12Ks5ccBXcHMogVz2wv7r9GKecCI0ZUqrjOon2MyQ1xYxMS91UkQThERqQjkGBOFF+Nj9/Cs+N04dRLM0TGs7d3xMZ4kpNeGg6OdJDtVybmf/VOqmObvyMiiTVRODFoihlUMdwlgXsU0mwZhMDCEtqboV4iCTC2iRWMiG4y19ehaZXca8q3n21XKvmcRTBKTgDF8AF16AG7kAdNAAGGXgGr+DNerJerHfrY9FasPKZY/BH1ucPsWCVQw==</latexit>

kG�b Gk 2

<latexit sha1_base64="5YAoXDh1KhTHKkTDRMIoQH7QrMM=">AAAB+3icbVDLSsNAFJ34rPUV69LNYBHcWJJa0GXBRV1WsA9oQphMbtuhkwczE7Wk/RU3LhRx64+482+ctllo64ELh3Pu5d57/IQzqSzr21hb39jc2i7sFHf39g8OzaNSW8apoNCiMY9F1ycSOIugpZji0E0EkNDn0PFHNzO/8wBCsji6V+ME3JAMItZnlCgteWbJmTQunEcWwJAo3HAmXtUzy1bFmgOvEjsnZZSj6ZlfThDTNIRIUU6k7NlWotyMCMUoh2nRSSUkhI7IAHqaRiQE6Wbz26f4TCsB7sdCV6TwXP09kZFQynHo686QqKFc9mbif14vVf1rN2NRkiqI6GJRP+VYxXgWBA6YAKr4WBNCBdO3YjokglCl4yrqEOzll1dJu1qxLyvVu1q5XsvjKKATdIrOkY2uUB3doiZqIYqe0DN6RW/G1Hgx3o2PReuakc8coz8wPn8A+FyTsw==</latexit>

T<latexit sha1_base64="5wVM2XAq81BdAJJ512XGi2DpnpQ=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoPgKezGgB4DXjwmkBckS5iddJIxs7PLzKwQlnyBFw+KePWTvPk3TpI9aGJBQ1HVTXdXEAuujet+O7mt7Z3dvfx+4eDw6PikeHrW1lGiGLZYJCLVDahGwSW2DDcCu7FCGgYCO8H0fuF3nlBpHsmmmcXoh3Qs+YgzaqzUaA6KJbfsLkE2iZeREmSoD4pf/WHEkhClYYJq3fPc2PgpVYYzgfNCP9EYUzalY+xZKmmI2k+Xh87JlVWGZBQpW9KQpfp7IqWh1rMwsJ0hNRO97i3E/7xeYkZ3fsplnBiUbLVolAhiIrL4mgy5QmbEzBLKFLe3EjahijJjsynYELz1lzdJu1L2bsqVRrVUq2Zx5OECLuEaPLiFGjxAHVrAAOEZXuHNeXRenHfnY9Wac7KZc/gD5/MHrMGMzg==</latexit>

kG�b Gk F

<latexit sha1_base64="kUBzG6BfRIqW4Z3vOwji2iip0Oc=">AAAB/XicbVDLSsNAFJ34rPUVHzs3g0VwY0lqQZcFwbqsYB/QhDCZTNqhk0mYmSg1Lf6KGxeKuPU/3Pk3TtsstPXAhcM593LvPX7CqFSW9W0sLa+srq0XNoqbW9s7u+befkvGqcCkiWMWi46PJGGUk6aiipFOIgiKfEba/uBq4rfviZA05ndqmBA3Qj1OQ4qR0pJnHjqj+pnzQAPSRwrWnZGXXY89s2SVrSngIrFzUgI5Gp755QQxTiPCFWZIyq5tJcrNkFAUMzIuOqkkCcID1CNdTTmKiHSz6fVjeKKVAIax0MUVnKq/JzIUSTmMfN0ZIdWX895E/M/rpiq8dDPKk1QRjmeLwpRBFcNJFDCggmDFhpogLKi+FeI+EggrHVhRh2DPv7xIWpWyfV6u3FZLtWoeRwEcgWNwCmxwAWrgBjRAE2DwCJ7BK3gznowX4934mLUuGfnMAfgD4/MH5oqU0w==</latexit>

Figure 3: (Left) The estimation error of the Markov parameters for LASSO with respect to T ,with N = 2000, and σ2

v = σ2u = 0.02. The estimation error has a non-monotonic behavior with

respect to T . (Right) The estimation error of the Markov parameters for different noise levels, withN = 2000, and T = 20. It can be seen that LASSO is less sensitive to the increasing noise levels.

6 Conclusions and Future DirectionsIn this paper, we propose a method for learning partially observed linear systems from a singlesample trajectory in high-dimensional settings, i.e., when the number of samples is less than thesystem dimension. Most of the existing inference methods presume and rely on the availabil-ity of prohibitively large number of samples collected from the unknown system. In this work,we address this issue by reducing the sample complexity of estimating the Markov parameters ofpartially observed systems via an `1-regularized estimator. We show that, when the system is in-herently stable, the required number of samples for a reliable estimation of the Markov parametersscales poly-logarithmically with the dimension of the system.

As a promising direction for future research, we will study the sparse recovery of the systemmatrices from the estimated Markov parameters. Indeed, most of the real-world systems consist ofmany subsystems with local interactions, thereby giving rise to sparse system matrices. However,it is easy to see that sparsity in the system parameters does translate into sparsity in the Markovparameters. On the other hand, the classical system identification methods, such as Ho-Kalmanmethod, often extracts a dense realization of the system parameters, and therefore, cannot incor-porate prior information, such as sparsity. As a future direction, we aim to remedy this challengein a principled manner. Given a sparse realization of the system matrices, our next goal is to de-

12

Page 13: Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples Salar Fattahi University of Michigan

sign a robust distributed controller for the true system, taking into account the uncertainty in theestimated model.

AcknowledgmentsWe would like to thank Necmiye Ozay, Nikolai Matni, Yang Zheng, and Kamyar Azizzadeneshelifor their insightful suggestions and constructive comments.

Appendix

A Proof of the Main ResultsIn this section, we present the proofs of Theorem 2 and Corollaries 1 and 2. For simplicity, definethe error matrix ∆ = G− G, and the following concatenated matrices:

Y =[yT−1 yT . . . yT+N−2

]> ∈ RN×m (35)

U =[uT−1 uT . . . uT+N−2

]> ∈ RN×Tp (36)

W =[wT−1 wT . . . wT+N−2

]> ∈ RN×Tn (37)

E =[eT−1 eT . . . eT+N−2

]> ∈ RN×m (38)

V =[vT−1 vT . . . vT+N−2

]> ∈ RN×m (39)

With these definitions, one can re-write (3) as

Y = UG> +WF> + E + V (40)

Moreover, the the `1-regularized estimator (17) reduces to

G = arg minX

1

2N

∥∥Y − UX>∥∥2

F+ λ‖X‖1,1 (41)

Note that (41) is decomposable over different rows of X . Therefore, one can write:

Gi: = arg minX

1

2N

∥∥Y:i − U(Xi:)>∥∥2

F+ λ‖Xi:‖1,1, for every i = 1, 2, . . . ,m (42)

At the core of our result is the following fundamental lemma, which deterministically bounds therow-wise error of G.

Proposition 2 (Deterministic Guarantee). Fix a row index i, and assume that the following condi-tions hold:

1. (`1-boundedness) We have ‖Gi:‖1 ≤ R, for some R > 0.

13

Page 14: Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples Salar Fattahi University of Michigan

2. (Restricted singular value) There exists a function f(·) such that

1

N‖U∆i:‖2

2 ≥ κ‖∆i:‖22 − f(∆i:) (43)

for some κ ≤ 1.

3. (Bound on λ) We have

λ ≥ 2

N

(‖U>WF>i: ‖∞ + ‖U>E:i‖∞ + ‖U>V:i‖∞

)(44)

Then, the following inequality holds:

‖∆i:‖22 ≤ max

{2

κf(∆i:),

88R

κ2λ

}(45)

Proof. See Appendix B.1.

Before presenting the implications of this proposition, let us briefly explain the intuition behindthe imposed assumptions. It can be easily seen that the first assumption holds for a choice of Rthat only depends on Csys and ρ (see Proposition 3). On the other hand, the second assumptionimplies that the concatenated input matrix U has a nonzero singular value in the subspace spannedby the vector ∆i:, which is offset by a “slack” term f(∆i:). We consider two scenarios to explainthe inclusion of the slack term. First, in the low-dimensional regime, where Tp & N (modulologarithmic factors), the standard concentration bounds on the random circulant matrices [49, 16]entail that (43) holds for some uniform constant κ > 0, and with the choice of f(∆i:) = 0.However, in the high-dimensional settings where Tp < N , one has to choose nonzero values forf(∆i:), since the matrix U will have zero singular values. While the naive choice of f(∆i:) =κ‖∆i:‖2

2 is always feasible, one of the key contributions of this paper is to provide a sharper choicefor f(∆i:) that is particularly well-suited in the context of high-dimensional system identification.In particular, we will show that, for ∆i: with bounded `1-norm, the inequality (43) holds withhigh probability, for the choices κ = σ2

u/4 and f(∆i:) . σ2uR

2√

log(Tp)/N (see Proposition 4).Finally, the third assumption provides a lower bound on the regularization coefficient, which willbe shown to hold with overwhelming probability when λ is chosen as (σuσw+σuσv)

√log(Tpn)/N

(see Proposition 5).

Proposition 3 (`1-boundedness). The following inequality holds for every i = 1, . . . ,m:

‖Gi:‖1 ≤2C3

sys

1− ρ (46)

14

Page 15: Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples Salar Fattahi University of Michigan

Proof. One can write

‖Gi:‖1 = ‖D‖1 +T−2∑

τ=0

‖Gτ‖1 ≤ Csys +T−2∑

τ=0

‖Ci:AτB‖1

≤ Csys +∞∑

τ=0

‖C‖1‖Aτ‖1‖B‖1

≤ Csys +∞∑

τ=0

C3sysρ

τ

≤ 2C3sys

1− ρ

which completes the proof.

Our next goal is to construct a sharp expression for f(∆i:). As mentioned before, the matrixU will have zero singular values when N < Tp. Therefore, the standard techniques for showingthe concentration of the singular values of circulant matrices around an strictly positive numbercannot be established. To circumvent this challenge, we prove the following key lemma whichplays a pivotal role in our subsequent analysis.

Lemma 1. Suppose that ut is a zero-mean Gaussian vector with covariance σ2uI for every t =

1, . . . T +N − 2. Moreover, assume that N ≥ 4η log2(Tp) for an arbitrary η > 0. Then, we have

1

N‖Uθ‖2

2 ≥σ2u

2‖θ‖2

2 − σ2u

√η log(Tp)

N‖θ‖2

1 (47)

for θ ∈ RTp, with probability of at least 1− (Tp)−cη.

Proof. See Appendix B.2.

Equipped with this lemma, we are now ready to present the appropriate choices of κ and f inProposition 2.

Proposition 4 (Restricted singular value). Assume that N ≥ 4η log2(Tp) for an arbitrary η > 0.Then, for any fixed row index i, the following inequality holds:

1

N‖U∆i:‖2

2 ≥σ2u

4‖∆i:‖2

2 − 128σ2u

(C3

sys

1− ρ

)2√η log(Tp)

N(48)

with probability of at least 1− (Tp)−cη.

Proof. See Appendix B.3.

According to the above proposition, it is possible to choose f(∆i:) such that it diminishes atthe rate of O(

√log(Tp)/N) while κ remains constant.

Finally, we will provide a lower bound on λ in terms of the system parameters, T , p, and N , toensure that the third assumption of Proposition 2 holds with high probability.

15

Page 16: Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples Salar Fattahi University of Michigan

Proposition 5 (Bound on λ). Suppose that T satisfies:

T ≥log log(Np+ Tp+Nn) + 4 log(Csys

1−ρ ) + 4 log(σw + σu) + 2 log(2)

1− ρ + 2 (49)

Then, for an arbitrary η > 0, the following inequality holds

2

N

(‖U>WF>i: ‖∞ + ‖U>E:i‖∞ + ‖U>V:i‖∞

)≤

4√

2σuσw

(C2

sys

1− ρ

)√(1 + η)

log(Tpn)

N+ 4σuσv

√(1 + η)

log(Tp)

N+ 2ρT/2(1 + η) (50)

with the probability of at least 1− 2(Nn)−η − 2(Np+ Tp)−η − 2(Tp)−η.

Proof. See Appendix B.4.

Proof of Theorem 2. We provide the proof in four steps:

1. According to Proposition 3, C3sys

1−ρ is a valid choice for R to satisfy the first assumption ofProposition 2.

2. Proposition 4 implies that the second assumption of Proposition 2 holds with high probabil-

ity, with κ = σ2u

4and f(∆i:) � σ2

u

(C3

sys

1−ρ

)2√

log(Tp)N

.

3. Proposition 5 shows that the third assumption of Proposition 2 holds with high probabilitywith the choice of

λ � (σuσw + σuσv)

√log(Tpn)

N+ ε (51)

for an arbitrary ε > 0, provided that

T &log log(Np+ Tp+Nn) + log(Csys

1−ρ ) + log(σu + σw) + log(1ε)

1− ρ (52)

4. Finally, it is easy to verify that the following inequalities hold for every row index i:

2

κf(∆i:) .

(C3

sys

1− ρ

)2√log(Tp)

N, (53)

88R

κ2λ .

(C3

sys

1− ρ

)((σw + σvσ3u

)√log(Tpn)

N+

1

σ4u

ε

)(54)

These inequalities, combined with (45) and a simple union bound on different rows of G provesthe validity of (20). Moreover, (21) follows from ‖G− G‖F ≤

√m‖G− G‖2,∞. �

Next, we will present the proof of Corollary 1.

16

Page 17: Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples Salar Fattahi University of Michigan

Proof of Corollary 1. It is easy to see that

‖G(K) −G(K)‖2,∞ ≤ ‖G−G‖2,∞ +K−2∑

τ=T−1

‖CAτB‖2,∞ (55)

On the other hand, a simple application of the Holder’s inequality leads to

K−2∑

τ=T−1

‖CAτB‖2,∞ ≤√‖C‖∞Csys

K−2∑

τ=T−1

ρτ/2 ≤√‖C‖∞

(Csys

1− ρ

T−12 (56)

The above expression is upper bounded by ε, provided that

T ≥log(‖C‖∞) + 2 log

(Csys

1−ρ

)+ 2 log

(1ε

)

1− ρ + 1 (57)

Combined with Theorem 2, this certifies the validity of (27). The inequality (28) follows from‖G(K) − G(K)‖F ≤

√m‖G(K) − G(K)‖2,∞. Moreover, the correctness of (29) can verified by

noting that the rows of H(K) − H(K) are subvectors of the rows of G(2K−1) − G(2K−1). Finally, toshow the correctness of (30), note that

‖H(K) − H(K)‖2F =‖D − D‖2

F +T−2∑

k=0

(k + 2)‖Gk − Gk‖2F +

K−2∑

k=T−1

(k + 2)‖CAkB‖2F

+2K−3∑

k=K−1

(2K − 2− k)‖CAkB‖2F

≤T‖G− G‖2F +m‖C‖∞C2

sys

(K−2∑

k=T−1

(k + 2)ρk +2K−3∑

k=K−1

(2K − 2− k)ρk

)

≤T‖G− G‖2F +m‖C‖∞C2

sys

(TρT−1

1− ρ +ρT−1

(1− ρ)2

)

≤Tm(‖G− G‖2

2,∞ + 2‖C‖∞(Csys

1− ρ

)2

ρT−1

)(58)

Therefore

‖H(K) − H(K)‖F ≤√Tm

(‖G− G‖2,∞ +

√2‖C‖∞

(Csys

1− ρ

T−12

)

≤√Tm

(‖G− G‖2,∞ +

√2ε)

(59)

where the second inequality follows from (57). the above inequality combined with Theorem 2completes the proof. �

Finally, we will provide the proof for Corollary 2.

17

Page 18: Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples Salar Fattahi University of Michigan

Proof of Corollary 2. One can write

E `1FELSF.

(N log(Tpn)

T 2(n+m+ p)

)1/4

︸ ︷︷ ︸(a)

+

(Nε2

T (n+m+ p)

)1/2

︸ ︷︷ ︸(b)

(60)

It is easy to see that (a) approaches zero with (n,m, p) → ∞, due to the assumption (33). More-over, due to Proposition 5, we have ε2 . ρT . Therefore, T & log(T/N) + log(n + m + p)ensures that (b) approaches zero with (n,m, p) → ∞. The proof is completed by noting thatlog(T/N) ≤ T/2 for every T,N ≥ 1. �

B Proof of the Auxiliary Results

B.1 Proof of Proposition 2

For the sake of simplicity, we suppress the row index i and denote Y:i, X>i: , G>i: , and G>i: as y, g,g∗, and g, respectively. Therefore, (42) can be written as

g = arg ming∈RTp

1

2N‖y − Ug‖2

2 + λ‖g‖1 (61)

Furthermore, we treat the combined term w = [WF>]:i + E:i + V:i as the additive noise. Finally,the estimation error is denoted as δ = g− g∗. Note that g is an optimal solution of (61). Therefore,one can write

1

2N‖y − Ug‖2

2 + λ‖g‖1 ≤1

2N‖y − Ug∗‖2

2 + λ‖g∗‖1

=⇒ 1

2N‖w − Uδ‖2

2 + λ‖g‖1 ≤1

2N‖w‖2

2 + λ‖g∗‖1

=⇒ 1

2N‖Uδ‖2

2 ≤1

Nw>Uδ + λ(‖g∗‖1 − ‖g‖1) (62)

On the other hand, given an arbitrary index set S ∈ {1, . . . , Tp} with |S| = s, one can write

‖g∗‖1 − ‖g‖1 ≤‖g∗S‖1 + ‖g∗Sc‖1 − ‖g∗S + δS‖1 − ‖g∗Sc + δSc‖1

≤‖δS‖1 − ‖δSc‖1 + 2‖g∗Sc‖1 (63)

where the second line is implied by triangle inequality. Substituting this inequality in (62) leads to

1

2N‖Uδ‖2

2 ≤1

N‖w>U‖∞‖δ‖1 + λ(‖δS‖1 − ‖δSc‖1 + 2‖g∗Sc‖1) (64)

On the other hand, since λ ≥ 2‖w>U‖∞/N , one can write

1

N‖Uδ‖2

2 ≤λ(‖δ‖1 + 2‖δS‖1 − 2‖δSc‖1 + 4‖g∗Sc‖1)

≤λ(3‖δS‖1 − ‖δSc‖1 + 4‖g∗Sc‖1)

≤λ(3√s‖δ‖2 + 4‖g∗Sc‖1) (65)

18

Page 19: Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples Salar Fattahi University of Michigan

where the last inequality is due to ‖δS‖1 ≤√s‖δS‖2 ≤

√s‖δ‖2. Now, we consider two cases. If

we have ‖δ‖22 ≤ 2

κf(δ), then the bound (45) trivially holds. Therefore, suppose ‖δ‖2

2 >2κf(δ).

Combining this inequality with Assumption 2 and (65) leads to

κ‖δ‖22 − f(δ) ≤ λ(3

√s‖δ‖2 + 4‖g∗Sc‖1)

=⇒ ‖δ‖22 ≤

κ(3√s‖δ‖2 + 4‖g∗Sc‖1) (66)

Notice that this is a quadratic inequality in terms of ‖δ‖2. Bounding the roots of this quadraticinequality leads to

‖δ‖22 ≤

72λ2

κ2s+

16λ

κ‖gSc‖1 (67)

Therefore, the following inequality holds for all the values of ‖δ‖22:

‖δ‖22 ≤ max

{2

κf(δ),

72λ2

κ2s+

16λ

κ‖gSc‖1

}(68)

Now, it remains to show that the set S can be chosen such that 72λ2

κ2s+ 16λ

κ‖gSc‖1 ≤ 88R

κ2λ. To show

this, define S = {i : |g∗i | ≥ λ}. Note that

R ≥Tp∑

i=1

|g∗i | ≥∑

i∈S

|g∗i | ≥ sλ (69)

Therefore, we have s ≤ Rλ

. Furthermore, one can write

‖g∗Sc‖1 ≤ R (70)

This implies that

72λ2

κ2s+

16λ

κ‖gSc‖1 ≤

72R

κ2λ+

16R

κλ ≤ 88R

κ2λ (71)

which completes the proof. �

B.2 Proof of Lemma 1To prove Lemma 1, we need the following well-known result on the quadratic forms of sub-Gaussian random vectors.

Theorem 3 (Hanson-Wright Inequality). Let w ∈ Rd be a random vector with independent zero-mean sub-Gaussian elements with variance 1. Given a square and symmetric matrix M , we havew>Mw ≥ E{w>Mw} − t with probability of at least 1− exp

(−cmin

{t2

‖M‖2F, t‖M‖2

})

The main idea behind the proof of this lemma is as follows:

19

Page 20: Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples Salar Fattahi University of Michigan

1. We reformulate ‖Uθ‖22 as a quadratic form w>P (θ)w, where P (θ) ∈ RN×N is only a func-

tion of θ, and w ∈ RN is a random vector with independent zero-mean sub-Gaussian ele-ments.

2. We provide upper bounds on ‖P (θ)‖2F and ‖P (θ)‖2

2 in terms of ‖θ‖22 and ‖θ‖2

1.

3. Finally, we apply Hanson-Wright inequality to obtain the desired concentration bound.

Lemma 2. The following statements hold:

- 1σ2u‖Uθ‖2

2 ∼ wTP (θ)w, where w ∈ RN is a random vector with independent zero-meanGaussian elements, and P (θ) ∈ RN×N is a symmetric matrix defined as

Pij(θ) = R(|i− j|),∀i, j ∈ {1, . . . , N}2, and R(τ) =

(T−τ)p∑

k=1

θkθk+τp (72)

- 1σ2uE{‖Uθ‖2

2} = N‖θ‖22.

Proof. Upon defining ζ = Uθ, one can verify that ζ has a zero-mean Gaussian distribution. More-over, it is easy to see that

E{ζiζj} = E{(u>T+(i−1)θ

) (u>T+(j−1)θ

)}

=

(T−|i−j|)p∑

k=1

θkθk+p|i−j| = R(|i− j|) (73)

where in the second equality, we used the following facts:

- E{ut(s)2} = 1 for every t ∈ {1, . . . , T +N − 2} and s ∈ {1, . . . , p}.

- E{ut(s1)ut(s2)} = 0 for every t ∈ {1, . . . , T + N − 2} and s1, s2 ∈ {1, . . . , p} such thats1 6= s2.

- E{ut1(s1)ut2(s2)} = 0 for every t1, t2 ∈ {1, . . . , T + N − 2} such that t1 6= t2 and s1, s2 ∈{1, . . . , p}.

This implies that E{ζiζj} = σ2uPij(θ), and hence, 1

σuUθ ∼ N (0, P (θ)). Therefore, 1

σ2u‖Uθ‖2

2

has the same distribution as w>P (θ)w, thereby completing the proof of the first statement. Thesecond statement directly follows from the definition of P (θ) and the fact that w has independentelements.

Our next lemma provides an upper bound on the values of ‖P (θ)‖2F and ‖P (θ)‖2

2.

Lemma 3. The following inequalities hold:

- ‖P (θ)‖2 ≤ ‖θ‖22 + ‖θ‖2

1

- ‖P (θ)‖2F ≤ N(‖θ‖2

2 + ‖θ‖21)2

20

Page 21: Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples Salar Fattahi University of Michigan

Proof. Due to the Gershgorin circle theorem, one can write |‖P (θ)‖2 − ‖θ‖22| ≤

∑N−1τ=1 |R(τ)|.

This implies that ‖P (θ)‖2 ≤ ‖θ‖22 +

∑N−1τ=1 |R(τ)|. Define θi = θTp+1−i and R as the convolution

of θ and θ, i.e., R(τ) =∑τ−1

k=1 θkθτ−k for every τ . One can write

R(τ) =

(T−τ)p∑

k=1

θkθk+τp =

(T−τ)p∑

k=1

θkθ(T−τ)p+1−k = R((T − τ)p+ 1) (74)

Therefore, we have

N−1∑

τ=1

|R(τ)| =N−1∑

τ=1

|R((T − τ)p+ 1)| ≤ ‖R‖1 = ‖θ ∗ θ‖1 ≤ ‖θ‖1‖θ‖1 = ‖θ‖21 (75)

where the last inequality is due to Young’s convolution rule. This concludes the proof of the firststatement. The second statement follows from (75) and ‖P (θ)‖2

F ≤ N‖P (θ)‖22.

Proof of Lemma 1. Lemmas 2 and 3, together with Theorem 3 imply that

1

σ2uN‖Uθ‖2

2 ≥ ‖θ‖22 − t

for any t > 0, with probability of at least

1− exp

(−cmin

{Nt2

(‖θ‖22 + ‖θ‖2

1)2,

Nt

‖θ‖22 + ‖θ‖2

1

})(76)

Upon choosing t =√

η log(Tp)N

(‖θ‖22 + ‖θ‖2

1) and N ≥ 4η log(Tp) for some η > 0, we have

1

N‖Uθ‖2

2 ≥ σ2u

(1−

√η log(Tp)

N

)‖θ‖2

2 − σ2u

√η log(Tp)

N‖θ‖2

1

≥ σ2u

2‖θ‖2

2 − σ2u

√η log(Tp)

N‖θ‖2

1 (77)

with probability of at least 1− (Tp)−cη. This completes the proof. �

B.3 Proof of Proposition 4For simplicity, we will borrow the notations g∗, g, and δ from the proof of Proposition 2. First notethat, according to (65), δ = g∗ − g satisfies the following property

‖δSc‖1 ≤ 3‖δS‖1 + 4‖θ∗Sc‖1 (78)

for any choice of S ∈ {1, 2, . . . Tp} with |S| = s. Define S = {i : |g∗i | ≥ γ} for a value of γ to bedefined later. Assuming that ‖g∗‖1 ≤ R, the inequality (78) implies that

‖δ‖1 ≤ 4‖δS‖1 + 4‖θ∗Sc‖1 ≤ 4√s‖δ‖2 + 4‖θ∗Sc‖1 ≤ 4

√R

γ‖δ‖2 + 4R (79)

21

Page 22: Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples Salar Fattahi University of Michigan

where the last inequality follows from s ≤ R/γ and ‖θ∗Sc‖1 ≤ R. This leads to

‖δ‖21 ≤

32R

γ‖δ‖2

2 + 32R2 (80)

Combining the above inequality with Lemma 1 implies that the following inequalities holds withprobability of at least 1− (Tp)−cη:

1

N‖Uδ‖2

2 ≥ σ2u

(1

2− 32R

γ

√η log(Tp)

N

)‖δ‖2

2 − 32σ2uR

2

√η log(Tp)

N(81)

Now, upon choosing γ = 128R(η log(Tp)

N

), we get 32R

γ

√η log(Tp)

N= 1/4, which results in

1

N‖Uδ‖2

2 ≥σ2u

4‖δ‖2

2 − 32σ2uR

2

√η log(Tp)

N(82)

Finally, Proposition 3 can be invoked to show that R ≤ 2C3sys

1−ρ . This completes the proof. �

B.4 Proof of Proposition 5To prove this proposition, we divide the lower bound in three different terms:

λ ≥ 2‖U>WF>i: ‖∞N︸ ︷︷ ︸(I)

+2‖U>E:i‖∞

N︸ ︷︷ ︸(II)

+2‖U>V:i‖∞

N︸ ︷︷ ︸(III)

(83)

Next, we will provide concentration bounds on every term of the above inequality.

Lemma 4 (Bounding (I)). The following inequality holds:

2‖U>WF>i: ‖∞N

≤ 4√

2σuσw

(C2

sys

1− ρ

)√(1 + η)

log(Tpn)

N(84)

with probability of at least 1− 2(Tpn)−2η, for an arbitrary η > 0

Proof. One can write ‖U>WF>i: ‖∞ ≤ ‖U>W‖∞,∞‖Fi:‖1. We will bound each term on the righthand side separately. First, note that

‖Fi:‖1 =

Tp∑

j=1

|Fij| =T−2∑

τ=0

‖Ci:‖1‖Aτ‖1 ≤∞∑

τ=0

C2sysρ

τ ≤ C2sys

1− ρ (85)

Now, let us focus on ‖U>W‖∞,∞. We have ‖U>W‖∞,∞ = maxi,j{|U>:iW:j|}. Note that thevectors U:i and W:j are independent random vectors, each with sub-Gaussian elements. Therefore,U>kiWkj is sub-exponential with (

√2σuσw,

√2σuσw) [44] for every k. A standard concentration

bound on sub-exponential random variables entails that

P(|U>:iW:j| ≤√

2σuσwt) ≥ 1− 2 exp

(−1

2min

{t,t2

N

})(86)

22

Page 23: Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples Salar Fattahi University of Michigan

A simple union bound implies that

P(

1

N‖U>W‖∞,∞ ≤

√2σuσwt

)≥ 1− 2T 2pn exp

(−N

2min

{t, t2})

(87)

Now, define t = c√

log(Tpn)N

for a constant c to be defined later, and assume that N ≥ c2 log(Tpn).This implies that t2 ≤ t, which leads to

P

(1

N‖U>W‖∞,∞ ≤

√2cσuσw

√log(Tpn)

N

)≥ 1− 2 exp

(2 log(Tpn)− c2

2log(Tpn)

)(88)

Now, upon defining c = 2√

1 + η for an arbitrary η > 0, we have

1

N‖U>W‖∞,∞ ≤ 2

√2σuσw

√(1 + η)

log(Tpn)

N(89)

with probability of at least 1 − 2(Tpn)−2η. Combining this inequality with (85) completes theproof.

Lemma 5 (Bounding (II)). Assume that

T ≥log log(Np+ Tp+Nn) + 4 log(Csys

1−ρ ) + 4 log(σw + σu) + 2 log(2)

1− ρ + 2 (90)

Then, the following inequality holds

2‖U>E:i‖∞N

≤ 2ρT/2(1 + η) (91)

with probability of at least 1− 2(Nn)−η − 2(Np+ Tp)−η, for an arbitrary η > 0.

Proof. One can write ‖U>E:i‖∞ ≤ ‖U‖∞,∞‖E:i‖1. On the other hand, note that E =[eT−1 eT . . . eT+N−2

]>, where et = CAT−1xt−T+1. This implies that

‖E:i‖1 =T+N−2∑

t=T−1

|(et)i| =N∑

t=1

|(CAT−1)i:xt| ≤ ‖(CAT−1)i:‖1

N∑

t=1

‖xt‖∞

≤ N‖(CAT−1)i:‖1‖X‖∞,∞ (92)

where X =[x1 x2 . . . xN

]. Similar to the previous case, we bound each term on the right

hand side separately. First note that

‖(CAT−1)i:‖1 ≤ ‖Ci:‖1‖AT−1‖1 ≤ C2sysρ

T−1 (93)

Our next goal is to provide an upper bound on ‖X‖∞,∞. It is easy to see that, for every t, the vectorxt is a zero-mean Gaussian variable with covariance

Σt =t−1∑

i=0

σ2wA

i(A>)i + σ2uA

iBB>(A>)i (94)

23

Page 24: Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples Salar Fattahi University of Michigan

This implies that each element Xij is Gaussian with variance σ2ij satisfying

σ2ij ≤ σ2

w

∞∑

i=0

C2sysρ

2i + σ2u

∞∑

i=0

C4sysρ

2i

≤ (σ2w + σ2

u)C4

sys

1− ρ2(95)

which implies that σij ≤ (σw + σu)C2

sys

1−ρ . This together with the standard concentration bounds onthe Gaussian distributions implies that

P(‖X‖∞,∞ ≤

((σw + σu)

C2sys

1− ρ

)t

)≥ 1− 2Nn exp

(−t

2

2

)(96)

Upon choosing t =√

2(1 + η) log(Nn), we have

‖X‖∞,∞ ≤√

2(σw + σu)

(C2

sys

1− ρ

)√(1 + η) log(Nn) (97)

with probability of at least 1− 2(Nn)−η. Combining this inequality with (92) and (93) leads to

‖E:i‖1 ≤ NC2sysρ

T−1(σw + σu)

(C2

sys

1− ρ

)√2(1 + η) log(Nn)

= N(σw + σu)

(C4

sys

1− ρ

)ρT−1

√2(1 + η) log(Nn) (98)

with probability of at least 1− 2(Nn)−η. Similarly, one can write

P (‖U‖∞,∞ ≤ σut) ≥ 1− 2(N + T )p exp

(−t

2

2

)(99)

Upon choosing t =√

2(1 + η) log(Np+ Tp), we have

‖U‖∞,∞ ≤ σu√

2(1 + η) log(Np+ Tp) (100)

with probability of at least 1− 2(Np+Tp)−η, for some η > 0. Combining all the derived bounds,one can write

1

N‖U>E:i‖∞ ≤ 2(σw + σu)σu

(C4

sys

1− ρ

)(1 + η) log(Np+ Tp+Nn)ρT−1 (101)

Now, if choose

T ≥4 log(Csys

1−ρ ) + 4 log(σw + σu) + 2 log(2) + 2 log log(Np+ Tp+Nn)

1− ρ + 2 (102)

then, we have

ρ−(T/2−1) ≥ 2(σw + σu)σu

(C4

sys

1− ρ

)log(Np+ Tp+Nn) (103)

24

Page 25: Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples Salar Fattahi University of Michigan

Combining the above inequality with (101) leads to

1

N‖U>E:i‖∞ ≤ ρT/2(1 + η) (104)

which holds with probability of at least 1− 2(Nn)−η − 2(Np+ Tp)−η. This completes the proof.

Lemma 6 (Bounding (III)). The following inequality holds:

2‖U>V:i‖∞N

≤ 4σuσv

√(1 + η)

log(Tp)

N(105)

with probability of at least 1− 2(Tp)−η, for an arbitrary η > 0.

Proof. The proof is a simpler version of the proof of Lemma 4, and the details are omitted forbrevity.

Proof of Proposition 5. The proof follows by combining the bounds obtained in Lemmas 4, 5, 6.�

References[1] F. Blaabjerg, R. Teodorescu, M. Liserre, A. V. Timbus, Overview of control and grid syn-

chronization for distributed power generation systems, IEEE Transactions on industrial elec-tronics 53 (5) (2006) 1398–1409.

[2] S. M. Amin, B. F. Wollenberg, Toward a smart grid: power delivery for the 21st century,IEEE power and energy magazine 3 (5) (2005) 34–41.

[3] Y. Wang, D. J. Hill, G. Guo, Robust decentralized control for multimachine power systems,IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 45 (3)(1998) 271–279.

[4] J. Barbaresso, G. Cordahi, D. Garcia, C. Hill, A. Jendzejec, K. Wright, B. A. Hamilton, Us-dot’s intelligent transportation systems (its) its strategic plan, 2015-2019., Tech. rep., UnitedStates. Department of Transportation. Intelligent Transportation (2014).

[5] D. Krechmer, E. Flanigan, A. Rivadeneyra, K. Blizzard, S. Van Hecke, R. Rausch, Effectson intelligent transportation systems planning and deployment in a connected vehicle envi-ronment, Tech. rep., United States. Federal Highway Administration. Office of Operations(2018).

[6] V. Kapila, A. G. Sparks, J. M. Buffington, Q. Yan, Spacecraft formation flying: Dynamicsand control, Journal of Guidance, Control, and Dynamics 23 (3) (2000) 561–564.

[7] M. Kubisch, H. Karl, A. Wolisz, L. C. Zhong, J. Rabaey, Distributed algorithms for trans-mission power control in wireless sensor networks, in: 2003 IEEE Wireless Communicationsand Networking, 2003. WCNC 2003., Vol. 1, IEEE, 2003, pp. 558–563.

25

Page 26: Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples Salar Fattahi University of Michigan

[8] D. Nguyen-Tuong, J. Peters, Model learning for robot control: a survey, Cognitive processing12 (4) (2011) 319–340.

[9] R. S. Sutton, A. G. Barto, Reinforcement learning: An introduction, MIT press, 2018.

[10] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutionalneural networks, in: Advances in neural information processing systems, 2012, pp. 1097–1105.

[11] Y. Duan, X. Chen, R. Houthooft, J. Schulman, P. Abbeel, Benchmarking deep reinforcementlearning for continuous control, in: International Conference on Machine Learning, 2016, pp.1329–1338.

[12] S. Oymak, N. Ozay, Non-asymptotic identification of lti systems from a single trajectory, in:2019 American Control Conference (ACC), IEEE, 2019, pp. 5655–5661.

[13] K. Krauth, S. Tu, B. Recht, Finite-time analysis of approximate policy iteration for the linearquadratic regulator, in: Advances in Neural Information Processing Systems, 2019, pp. 8512–8522.

[14] S. Dean, H. Mania, N. Matni, B. Recht, S. Tu, On the sample complexity of the linearquadratic regulator, Foundations of Computational Mathematics (2019) 1–47.

[15] M. Simchowitz, R. Boczar, B. Recht, Learning linear dynamical systems with semi-parametric least squares, arXiv preprint arXiv:1902.00768 (2019).

[16] T. Sarkar, A. Rakhlin, M. A. Dahleh, Finite-time system identification for partially observedlti systems of unknown order, arXiv preprint arXiv:1902.01848 (2019).

[17] K. J. Astrom, P. Eykhoff, System identification—a survey, Automatica 7 (2) (1971) 123–162.

[18] L. Ljung, System identification, Wiley encyclopedia of electrical and electronics engineering(1999) 1–19.

[19] H.-F. Chen, L. Guo, Identification and stochastic adaptive control, Springer Science & Busi-ness Media, 2012.

[20] G. C. Goodwin, R. Payne, Dynamic system identification. experiment design and data analy-sis. (1977).

[21] S. Dean, H. Mania, N. Matni, B. Recht, S. Tu, On the sample complexity of the linearquadratic regulator, arXiv preprint arXiv:1710.01688 (2017).

[22] S. Dean, H. Mania, N. Matni, B. Recht, S. Tu, Regret bounds for robust adaptive control ofthe linear quadratic regulator, in: Advances in Neural Information Processing Systems, 2018,pp. 4188–4197.

[23] M. Simchowitz, H. Mania, S. Tu, M. I. Jordan, B. Recht, Learning without mixing: Towardsa sharp analysis of linear system identification, arXiv preprint arXiv:1802.08334 (2018).

26

Page 27: Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples Salar Fattahi University of Michigan

[24] T. Sarkar, A. Rakhlin, How fast can linear dynamical systems be learned?, arXiv preprintarXiv:1812.01251 (2018).

[25] A. Tsiamis, G. J. Pappas, Finite sample analysis of stochastic system identification, arXivpreprint arXiv:1903.09122 (2019).

[26] Y. Zheng, N. Li, Non-asymptotic identification of linear dynamical systems using multipletrajectories, arXiv preprint arXiv:2009.00739 (2020).

[27] S. Fattahi, N. Matni, S. Sojoudi, Learning sparse dynamical systems from a single sampletrajectory, arXiv (2019) arXiv–1904.

[28] S. Fattahi, S. Sojoudi, Data-driven sparse system identification, in: 2018 56th Annual Aller-ton Conference on Communication, Control, and Computing (Allerton), IEEE, 2018, pp.462–469.

[29] S. Fattahi, S. Sojoudi, Non-asymptotic analysis of block-regularized regression problem, in:2018 IEEE Conference on Decision and Control (CDC), IEEE, 2018, pp. 27–34.

[30] S. Fattahi, S. Sojoudi, Sample complexity of sparse system identification problem, arXivpreprint arXiv:1803.07753 (2018).

[31] Y. Sun, S. Oymak, M. Fazel, Finite sample system identification: Improved rates and the roleof regularization (2020).

[32] B. Wahlberg, C. Rojas, Matrix rank optimization problems in system identification via nu-clear norm mimization, in: (Tutorial) at the European Control Conference, 2013.

[33] J.-F. Cai, X. Qu, W. Xu, G.-B. Ye, Robust recovery of complex exponential signals fromrandom gaussian projections via low rank hankel matrix reconstruction, Applied and compu-tational harmonic analysis 41 (2) (2016) 470–490.

[34] Y. Abbasi-Yadkori, C. Szepesvari, Regret bounds for the adaptive control of linear quadraticsystems, in: Proceedings of the 24th Annual Conference on Learning Theory, 2011, pp. 1–26.

[35] Y. Abbasi-Yadkori, N. Lazic, C. Szepesvari, Model-free linear quadratic control via reductionto expert prediction, in: The 22nd International Conference on Artificial Intelligence andStatistics, 2019, pp. 3108–3117.

[36] S. Lale, K. Azizzadenesheli, B. Hassibi, A. Anandkumar, Regret bound of adaptive controlin linear quadratic gaussian (lqg) systems, arXiv preprint arXiv:2003.05999 (2020).

[37] S. Dean, S. Tu, N. Matni, B. Recht, Safely learning to control the constrained linear quadraticregulator, in: 2019 American Control Conference (ACC), IEEE, 2019, pp. 5582–5588.

[38] H. Mania, S. Tu, B. Recht, Certainty equivalent control of lqr is efficient, arXiv preprintarXiv:1902.07826 (2019).

[39] S. Fattahi, N. Matni, S. Sojoudi, Efficient learning of distributed linear-quadratic controllers(2019). arXiv:1909.09895.

27

Page 28: Learning Partially Observed Linear Dynamical Systems from ...Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples Salar Fattahi University of Michigan

[40] L. Furieri, Y. Zheng, M. Kamgarpour, Learning the globally optimal distributed lq regulator,in: Learning for Dynamics and Control, 2020, pp. 287–297.

[41] B. Ho, R. E. Kalman, Effective construction of linear state-variable models from input/outputfunctions, at-Automatisierungstechnik 14 (1-12) (1966) 545–548.

[42] A. C. Antoulas, Approximation of large-scale dynamical systems, SIAM, 2005.

[43] K. Zhou, J. C. Doyle, Essentials of robust control, Vol. 104, Prentice hall Upper Saddle River,NJ, 1998.

[44] M. J. Wainwright, High-dimensional statistics: A non-asymptotic viewpoint, Vol. 48, Cam-bridge University Press, 2019.

[45] K. H. Jin, J. C. Ye, Sparse and low-rank decomposition of a hankel structured matrix forimpulse noise removal, IEEE Transactions on Image Processing 27 (3) (2017) 1448–1461.

[46] S. Tu, R. Boczar, A. Packard, B. Recht, Non-asymptotic analysis of robust control fromcoarse-grained identification, arXiv preprint arXiv:1707.04791 (2017).

[47] S. N. Negahban, P. Ravikumar, M. J. Wainwright, B. Yu, et al., A unified framework forhigh-dimensional analysis of m-estimators with decomposable regularizers, Statistical sci-ence 27 (4) (2012) 538–557.

[48] J. Shao, Linear model selection by cross-validation, Journal of the American statistical Asso-ciation 88 (422) (1993) 486–494.

[49] F. Krahmer, S. Mendelson, H. Rauhut, Suprema of chaos processes and the restricted isometryproperty, Communications on Pure and Applied Mathematics 67 (11) (2014) 1877–1904.

28