# finite_time_lti_system_identification__62aaf507.pdf

Journal of Machine Learning Research 22 (2021) 1-61 Submitted 8/19; Revised 4/20; Published 1/21

Finite Time LTI System Identiﬁcation

Tuhin Sarkar tuhin91@gmail.com Department of Electrical Engineering and Computer Sciences Massachusetts Institute of Technology Cambridge, MA 02139, USA

Alexander Rakhlin rakhlin@mit.edu Department of Brain and Cognitive Sciences Massachusetts Institute of Technology Cambridge, MA 02139, USA

Munther A. Dahleh dahleh@mit.edu Department of Electrical Engineering and Computer Sciences Massachusetts Institute of Technology Cambridge, MA 02139, USA

Editor: Benjamin Recht

We address the problem of learning the parameters of a stable linear time invariant (LTI) system with unknown latent space dimension, or order, from a single time series of noisy input-output data. We focus on learning the best lower order approximation allowed by ﬁnite data. Motivated by subspace algorithms in systems theory, where the doubly inﬁnite system Hankel matrix captures both order and good lower order approximations, we construct a Hankel-like matrix from noisy ﬁnite data using ordinary least squares. This circumvents the non-convexities that arise in system identiﬁcation, and allows accurate estimation of the underlying LTI system. Our results rely on careful analysis of self-normalized martingale diﬀerence terms that helps bound identiﬁcation error up to logarithmic factors of the lower bound. We provide a data-dependent scheme for order selection and ﬁnd an accurate realization of system parameters, corresponding to that order, by an approach that is closely related to the Ho-Kalman subspace algorithm. We demonstrate that the proposed model order selection procedure is not overly conservative, i.e., for the given data length it is not possible to estimate higher order models or ﬁnd higher order approximations with reasonable accuracy.

Keywords: Linear Dynamical Systems, System Identiﬁcation, Non parametric statistics, control theory, Statistical Learning theory

1. Introduction

Finite-time system identiﬁcation the problem of estimating the system parameters given a ﬁnite single time series of its output is an important problem in the context of control theory, time series analysis, robotics, and economics, among many others. In this work, we focus on parameter estimation and model approximation of linear time invariant (LTI)

c 2021 T. Sarkar and A. Rakhlin and M. A. Dahleh.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/19-725.html.

T. Sarkar and A. Rakhlin and M. A. Dahleh

systems or linear dynamical system (LDS), which are described by

Xt+1 = AXt + BUt + ηt+1 Yt = CXt + wt. (1)

Here C Rp n, A Rn n, B Rn m; {ηt, wt} t=1 are process and output noise, Ut is an external control input, Xt is the latent state variable and Yt is the observed output. The goal here is parameter estimation, i.e., learning (C, A, B) from a single ﬁnite time series of {Yt, Ut}T t=1 when the order, n, is unknown. Since typically p, m < n, it becomes challenging to ﬁnd suitable parametrizations of LTI systems for provably eﬃcient learning. When {Xj} j=1 are observed (or, C is known to be the identity matrix), identiﬁcation of (C, A, B) in Eq. (1) is signiﬁcantly easier, and ordinary least squares (OLS) is a statistically optimal estimator. It is, in general, unclear how (or if) OLS can be employed in the case when Xt s are not observed. To motivate the study of a lower-order approximation of a high-order system, consider the following example:

Example 1 Consider M1 = (A1, B1, C1) with

0 1 0 0 . . . 0 0 0 1 0 . . . 0 ... ... ... ... ... ... 0 0 0 0 . . . 1 a 0 0 0 . . . 0

0 0 ... 0 1

C1 = B 1 (2)

where na 1 and n > 20. Here the order of M1 is n. However, it can be approximated well by M2 which is of a much lower order and given by

A2 = 0 0 1 0

C2 = B 2 . (3)

For the same input Ut, if Y (1) t , Y (2) t be the output generated by M1 and M2 respectively then a simple computation shows that

(Y (1) t Y (2) t )2

U2 t 4n2a2 1

This suggests that the actual value of n is not important; rather there exists an eﬀective order, r (which is 2 in this case). This lower order model captures most of the LTI system.

Since the true model order is not known in many cases, we emphasize a nonparametric approach to identiﬁcation: one which adaptively selects the best model order for the given data and approximates the underlying LTI system better as T (length of data) grows. The key to this approach will be designing an estimator ˆ M from which we obtain a realization ( ˆC, ˆA, ˆB) of the selected order.

Finite Time LTI System Identification

1.1 Related Work

Linear time invariant systems are an extensively studied class of models in control and systems theory. These models are used in feedback control systems (for example in planetary soft landing systems for rockets (A cıkme se et al., 2013)) and as linear approximations to many non linear systems that nevertheless work well in practice. In the absence of process and output noise, subspace-based system identiﬁcation methods are known to learn (C, A, B) (up to similarity transformation)(Ljung, 1987; Van Overschee and De Moor, 2012). These typically involve constructing a Hankel matrix from the input output pairs and then obtaining system parameters by a singular value decomposition. Such methods are inspired by the celebrated Ho-Kalman realization algorithm (Ho and Kalman, 1966). The correctness of these methods is predicated on the knowledge of n or presence of inﬁnite data. Other approaches include rank minimization-based methods for system identiﬁcation (Fazel et al., 2013; Grussler et al., 2018), further relaxing the rank constraint to a suitable convex formulation. However, there is a lack of statistical guarantees for these algorithms, and it is unclear how much data is required to obtain accurate estimates of system parameters from ﬁnite noisy data. Empirical methods such as the EM algorithm (Dempster et al., 1977) are also used in practice; however, these suﬀer from non-convexity in problem formulation and can get trapped in local minima. Learning simpler approximations to complex models in the presence of ﬁnite noisy data was studied in Venkatesh and Dahleh (2001) where identiﬁcation error is decomposed into error due to approximation and error due to noise; however the analysis assumes the knowledge of a good parametrization and does not provide statistical guarantees for learning the system parameters of such an approximation. More recently, there has been a resurgence in the study of statistical identiﬁcation of LTI systems from a single time series in the machine learning community. In cases when C = I, i.e., Xt is observed directly, sharp ﬁnite time error bounds for identiﬁcation of A, B from a single time series are provided in Faradonbeh et al. (2017); Simchowitz et al. (2018); Sarkar and Rakhlin (2018). The approach to ﬁnding A, B is based on a standard ordinary least squares (OLS) given by

( ˆA, ˆB) = arg min A,B

t=1 ||Xt+1 [A, B][X t , U t ] ||2 2.

Another closely related area is that of online prediction in time series Hazan et al. (2018); Agarwal et al. (2018). Finite time regret guarantees for prediction in linear time series are provided in Hazan et al. (2018). The approach there circumvents the need for system identiﬁcation and instead uses a ﬁltering technique that convolves the time series with eigenvectors of a speciﬁc Hankel matrix. Closest to our work is that of Oymak and Ozay (2018). Their algorithm, which takes inspiration from the Kalman Ho algorithm, assumes the knowledge of model order n. This limits the applicability of the algorithm in two ways: ﬁrst, it is unclear how the techniques can be extended to the case when n is unknown as is usually the case and, second, in many cases n is very large and a much lower order LTI system can be a very good approximation of the original system. In such cases, constructing the order n estimate might be unnecessarily conservative (See Example 1). Consequently, the error bounds do not reﬂect accurate dependence on the system parameters.

T. Sarkar and A. Rakhlin and M. A. Dahleh

When n is unknown, it is unclear when a singular value decomposition should be performed to obtain the parameter estimates via Ho-Kalman algorithm. This leads to the question of model order selection from data. For subspace based methods, such problems have been addressed in Shibata (1976) and Bauer (2001). These papers address the question of estimating order in the context of subspace methods. Speciﬁcally, order estimation is achieved by analyzing the information contained in the estimated singular values and/or estimated innovation variance. Furthermore, they provide guarantees for asymptotic consistency of the methods described. It is unclear, however, if these techniques and guarantees can be extended to the case when only ﬁnite data is available. Another line of literature studied in Ljung et al. (2015) for example, approaches the identiﬁcation of systems with unknown order by ﬁrst learning the largest possible model that ﬁts the data and then performing model reduction to obtain the ﬁnal system. Although one can show that asymptotically this method outputs the true model, we show that such a two step procedure may underperform in a ﬁnite time setting. A possible explanation for this could be that learning the largest possible model with ﬁnite data over-ﬁts on the exogenous noise and therefore gives poor model estimates. The key diﬀerence from prior work is that we provide a direct approach to model selection, instead of learning the largest possible model from data and subsequent model truncation, and provide ﬁnite time guarantees. Other related work on identifying ﬁnite impulse response approximations include Goldenshluger (1998); Tu et al. (2017); but they do not discuss parameter estimation or reduced order modeling. Several authors Campi and Weyer (2002); Shah et al. (2012); Hardt et al. (2016) and references therein have studied the problem of system identiﬁcation in diﬀerent contexts. However, they fail to capture the correct dependence of system parameters on error rates. More importantly, they suﬀer from the same limitation as Oymak and Ozay (2018) that they require the knowledge of n.

2. Mathematical Preliminaries

Throughout the paper, we will refer to an LTI system with dynamics as Eq. (1) by M = (C, A, B). For a matrix A, let σi(A) be the ith singular value of A with σi(A) σi+1(A). Further, σmax(A) = σ1(A) = σ(A). Similarly, we deﬁne ρi(A) = |λi(A)|, where λi(A) is an eigenvalue of A with ρi(A) ρi+1(A). Again, ρmax(A) = ρ1(A) = ρ(A).

Deﬁnition 1 A matrix A is Schur stable if ρmax(A) < 1.

We will only be interested in the class of LTI systems that are Schur stable. Fix γ > 0 (and possibly much greater than 1). The model class Mr of LTI systems parametrized by r Z+ is deﬁned as

Mr = {(C, A, B) | C Rp r, A Rr r, B Rr m, ρ(A) < 1, σ(A) γ}. (4)

Finite Time LTI System Identification

Deﬁnition 2 The (k, p, q) dimensional Hankel matrix for M = (C, A, B) as

Hk,p,q(M) =

CAk B CAk+1B . . . CAq+k 1B CAk+1B CAk+2B . . . CAq+k B ... ... ... ... CAp+k 1B . . . . . . CAp+q+k 2B

and its associated Toeplitz matrix as

0 0 . . . 0 0 CAk B 0 . . . 0 0 ... ... ... ... 0 CAd+k 3B . . . CAk B 0 0 CAd+k 2B CAd+k 3B . . . CAk B 0

We will slightly abuse notation by referring to Hk,p,q(M) = Hk,p,q. Similarly for the Toeplitz matrices Tk,d(M) = Tk,d. The matrix H0, , (M) is known as the system Hankel matrix corresponding to M, and its rank is known as the model order (or simply order) of M. The system Hankel matrix has two well-known properties that make it useful for system identiﬁcation. First, the rank of H0, , has an upper bound n. Second, it maps the past inputs to future outputs. These properties are discussed in detail in appendix as Section 9.2. For inﬁnite matrices H0, , , ||H0, , ||2 ||H0, , ||op, i.e., the operator norm.

Deﬁnition 3 The transfer function of M = (C, A, B) is given by G(z) = C(z I A) 1B where z C.

The transfer function plays a critical role in control theory as it relates the input to the output. Succinctly, the transfer function of an LTI system is the Z transform of the output in response to a unit impulse input. Since for any invertible S the LTI systems M1 = (CS 1, SAS 1, SB), M2 = (C, A, B) have identical transfer functions, identiﬁcation may not be unique, but equivalent up to a transformation S, i.e., (C, A, B) (CS, S 1AS, S 1B). Next, we deﬁne a system norm that will be important from the perspective of model identiﬁcation and approximation.

Deﬁnition 4 The H system norm of a Schur stable LTI system M is given by

||M|| = sup ω R σmax(G(ejω)).

Here, G( ) is the transfer function of M. The r truncation of the transfer function is deﬁned as Gr := [CB, CAB, . . . , CAr 1B]. (5)

For a stable LTI system M we have

Proposition 2.1 (Lemma 2.2 Glover (1987)) Let M be a LTI system then

||M||H= σ1 ||M|| 2(σ1 + . . . + σn)

where σi are the singular values of H0, , (M).

T. Sarkar and A. Rakhlin and M. A. Dahleh

For any matrix Z, deﬁne Zm:n,p:q as the submatrix including row m to n and column p to q. Further, Zm:n,: is the submatrix including row m to n and all columns and a similar notion exists for Z:,p:q. Finally, we deﬁne balanced truncated models which will play an important role in our algorithm.

Deﬁnition 5 (Kung and Lin (1981)) Let H0, , (M) = UΣV where Σ Rn n (n is the model order). Then for any r n, the r order balanced truncated model parameters are given by

Cr = [UΣ1/2]1:p,1:r, Ar = Σ 1/2 1:r,1:r U :,1:r[UΣ1/2]p+1:,1:r, Br = [Σ1/2V ]1:r,1:m.

For r > n, the r order balanced truncated model parameters are the n order truncated model parameters.

Deﬁnition 6 We say a random vector v Rd is subgaussian with variance proxy τ 2 if

sup ||θ||2=1 sup p 1

n p 1/2 (E[| v, θ |p])1/po = τ

and E[v] = 0. We denote this by v subg(τ 2).

A fundamental result in model reduction from systems theory is the following

Theorem 2.1 (Theorem 21.26 Zhou et al. (1996)) Let M = (C, A, B) be the true model of order n and Mr = (Cr, Ar, Br) be its balance truncated model of order r < n. Assume that σr = σr+1. Then

||M Mr|| 2(σr+1 + σr+2 + . . . + σn)

where σi are the Hankel singular values of M.

Critical to obtaining reﬁned error rates, will be a result from the theory of self normalized martingales, an application of the pseudo-maximization technique in (Pe na et al., 2008, Theorem 14.7):

Theorem 2.2 Let {Ft} t=0 be a ﬁltration. Let {ηt Rm, Xt Rd} t=1 be stochastic processes such that ηt, Xt are Ft measurable and ηt is Ft 1-conditionally subg(L2) for some L > 0. For any t 0, deﬁne Vt = Pt s=1 Xs X s, St = Pt s=1 ηs+1Xs. Then for any δ > 0, V 0 and all t 0 we have with probability at least 1 δ

S t (V + Vt) 1St 4L2 log 1

δ + log det(V + Vt)

det(V ) + m .

The proof of this result can be found as Theorem 8.5. We denote by c universal constants which can change from line to line. For numbers a, b, we deﬁne a b min (a, b) and a b max (a, b).

Finally, for two matrices M1 Rl1 l1, M2 Rl2 l2 with l1 < l2, M1 M2 M1 M2

where M1 = M1 0l1 l2 l1 0l2 l1 l1 0l2 l1 l2 l1

Finite Time LTI System Identification

Proposition 2.2 (System Reduction) Let ||S P|| ϵ and the singular values of S be arranged as follows:

σ1(S) > . . . > σr 1(S) > σr(S) σr+1(S) . . . σs(S) > σs+1(S) > . . . σn(S) > σn+1(S) = 0

Furthermore, let ϵ be such that

ϵ inf {1 i r 1} {s+1 i n}

σi(P) σi+1(P)

Deﬁne K0 = [1, 2, . . . , r 1] [s + 1, s + 2, . . . , n], then

||US K0(ΣS K0)1/2 UP K0(ΣP K0)1/2||2 2

(σi σi+1)2 (σi 1 σi)2

((σr 1 σs) (σr σs+1))2 + sup 1 i s | σi p

and σi = σi(S), ˆσi = σi(P).

The proof is provided in Proposition 12.4 in the appendix. This is an extension of Wedin s result that allows us to scale the recovery error of the rth singular vector by only condition number of that singular vector. This is useful to represent the error of identifying a r-order approximation as a function of the rth-singular value only. We brieﬂy summarize our contributions below.

3. Contributions

In this paper we provide a purely data-driven approach to system identiﬁcation from a single time series of ﬁnite noisy data. Drawing from tools in systems theory and the theory of self normalized martingales, we oﬀer a nearly optimal OLS-based algorithm to learn the system parameters. We summarize our contributions below:

The central theme of our approach is to estimate the inﬁnite system Hankel matrix (to be deﬁned below) with increasing accuracy as the length T of data grows. By utilizing a speciﬁc reformulation of the input output relation in Eq. (1) we reduce the problem of Hankel matrix identiﬁcation to that of regression between appropriately transformed versions of output and input. The OLS solution is a matrix ˆH of size ˆd. More precisely, we show that with probability at least 1 δ,

ˆH H0, ˆd, ˆd

p ˆd + log T

for T above a certain threshold, where H0, ˆd, ˆd is the p ˆd m ˆd principal submatrix of the system Hankel. Here β is the H system norm.

We show that by growing ˆd with T in a speciﬁc fashion, ˆH becomes the minimax optimal estimator of the system Hankel matrix. The choice of ˆd for a ﬁxed T is purely datadependent and does not depend on spectral radius of A or n.

T. Sarkar and A. Rakhlin and M. A. Dahleh

It is well known in systems theory that SVD of the doubly inﬁnite system Hankel matrix gives us A, B, C. However, the presence of ﬁnite noisy data prevents learning these parameters accurately. We show that it is always possible to learn the parameters of a lower-order approximation of the underlying system. This is achieved by selecting the top k singular vectors of ˆH. The estimation guarantee corresponds to model selection in Statistics. More precisely, for every k ˆd if (Ak, Bk, Ck) are the parameters of a k-order balanced approximation of the original LTI system and ( ˆAk, ˆBk, ˆCk) are the estimates of our algorithm then for T above a certain threshold we have

||Ck ˆCk||2+||Ak ˆAk||2+||Bk ˆBk||2

β2 ˆd ˆσ2 k T

p ˆd + log T

with probability at least 1 δ where ˆσi is the ith largest singular value of ˆH.

4. Problem Formulation and Discussion

4.1 Data Generation

Assume there exists an unknown M = (C, A, B) Mn for some unknown n. Let the transfer function of M be G(z). Suppose we observe the noisy output time series {Yt Rp 1}T t=1 in response to user chosen input series, {Ut Rm 1}T t=1. We refer to this data generated by M as ZT = {(Ut, Yt)}T t=1. We enforce the following assumptions on M.

Assumption 1 The noise process {ηt, wt} t=1 in the dynamics of M given by Eq. (1) are i.i.d. and ηt, wt are isotropic with sub Gaussian parameter 1. Furthermore, X0 = 0 almost surely. We will only select inputs, {Ut}T t=1, that are isotropic sub Gaussian with sub Gaussian parameter 1.

The input output map of Eq. (1) can be represented in multiple alternate ways. One commonly used reformulation of the input output map in systems and control theory is the following

Y1 Y2 ... YT

U1 U2 ... UT

η1 η2 ... ηT

w1 w2 ... w T

where T Ok,d is deﬁned as the Toeplitz matrix corresponding to process noise ηt (similar to Deﬁnition 2):

0 0 . . . 0 0 CAk 0 . . . 0 0 ... ... ... ... 0 CAd+k 3 . . . CAk 0 0 CAd+k 2 CAd+k 3 . . . CAk 0

||T0,T ||2, ||T O0,T ||2 denote observed ampliﬁcations of the control input and process noise respectively. Note that stability of A ensures ||T0, ||2, ||T O0, ||2< . Suppose both

Finite Time LTI System Identification

m: Input dimension, p: Output dimension γ: Known upper bound on ||A||2 δ: Error probability c, C: Known absolute constants R: Known noise to signal ratio, or, ||T O0, ||2

||T0, ||2 β: Known upper bound on H -norm of LTI system D(T) = {d|T cm2d log2 (d) log2 (m2/δ) + cd log3 (2d)} σA = Pd l=1||CAl B||2, σB = Pd l=1||CAl||2

σ Pd k=1 T d+k,T Td+k,T , σD = r

σ Pd k=1 T O d+k,T T Od+k,T

lp+log (T/δ)+m

Table 1: Summary of constants

ηt, wt = 0 in Eq. (1). Then it is a well-known fact that

||M|| = sup Ut

s P t=0 Y t Yt P t=0 U t Ut = ||M|| = ||T0, ||2 ||H0, , ||2. (7)

Assumption 2 There exist universal constants β, R 1 such that ||T0, ||2 β, ||T O0, ||2

||T0, ||2 R.

Remark 7 (H -norm estimation) Assumption 2 implies that an upper bound to the H norm of the system. It is possible to estimate ||M|| from data (See Tu et al. (2018a) and references therein). It is reasonable to expect that error rates for identiﬁcation of the parameters (C, A, B) depend on the noise-to-signal ratio ||T O0, ||2

||T0, ||2 , i.e., identiﬁcation is much harder when the ratio is large.

Remark 8 (R estimation) The noise to signal ratio hyperparameter can also be estimated from data, by allowing the system to run with Ut = 0 and taking the average ℓ2 norm of the output Yt, i.e., (1/T) PT t=1 Yt 2 2. For the purpose of the results of the paper we simply assume an upper bound on R. If Ut was subg(L) instead of subg(1), the noise-to-signal ratio is modiﬁed to R/L instead.

T. Sarkar and A. Rakhlin and M. A. Dahleh

5. Algorithmic Details

We will now represent the input output relationship in terms of the Hankel and Toeplitz matrices deﬁned before. Fix a d, then for any l we have

Yl Yl+1 ... Yl+d 1

Ul 1 Ul 2 ... Ul d

Ul Ul+1 ... Ul+d 1

ηl 1 ηl 2 ... ηl d+1

ηl ηl+1 ... ηl+d 1

+ Hd,d,l d 1

Ul d 1 Ul d 1 ... U1

+ Od,d,l d 1

ηl d 1 ηl d 1 ... η1

wl wl+1 ... wl+d 1

or, succinctly,

Y + l,d = H0,d,d U l 1,d + T0,d U+ l,d + Hd,d,l d 1 U l d 1,l d 1 + O0,d,d η l 1,d + T O0,d η+ l,d + Od,d,l d 1 η l d 1,l d 1 + w+ l,d (9)

CAk CAk+1 . . . CAq+k 1

CAk+1 CAk+2 . . . CAd+k ... ... ... ... CAp+k 1 . . . . . . CAp+q+k 2

Yl Yl 1 ... Yl d+1

, Y + l,d =

Yl Yl+1 ... Yl+d 1

Furthermore, U l,d, η l,d are deﬁned similar to Y l,d and U+ l,d, η+ l,d, w+ l,d are similar to Y + l,d. The + and signs indicate moving forward and backward in time respectively. This representation will be at the center of our analysis. There are three key steps in our algorithm which we describe in the following sections:

(a) Hankel submatrix estimation: Estimating H0,l,l for every 1 l T. We refer to the estimators as { ˆH0,l,l}T l=1.

(b) Model Selection: From the estimators { ˆH0,l,l}T l=1 select ˆH0, ˆd, ˆd in a data dependent way such that it best estimates H0, , .

(c) Parameter Recovery: For every k ˆd, we do a singular value decomposition of ˆH0, ˆd, ˆd to obtain parameter estimates for a good k-order approximation of the true model.

5.1 Hankel Submatrix Estimation

The goal of our systems identiﬁcation is to estimate either H0,n,n or H0, , . Since we only have ﬁnite data and no apriori knowledge of n it is not possible to directly estimate the unknown matrices. The ﬁrst step then is to estimate all possible Hankel submatrices that are allowed by data, i.e., H0,d,d for d T. For a ﬁxed d, Algorithm 1 estimates the d d principal submatrix H0,d,d.

Finite Time LTI System Identification

Algorithm 1 Learn System(T, d, m, p) Input T = Horizon for learning d = Hankel Size m = Input dimension p = Output dimension Output System Parameters: ˆH0,d,d

1: Generate 2T i.i.d. inputs {Uj N(0, Im m)}2T j=1.

2: Collect 2T input output pairs {Uj, Yj}2T j=1.

3: ˆH0,d,d = arg min H PT 1 l=0 || Y + l+d+1,d H U l+d,d||2 2

4: return ˆH0,d,d

It can be shown that

ˆH0,d,d = T 1 X

l=0 Y + l+d+1,d( U l+d,d) T 1 X

l=0 U l+d,d( U l+d,d) + (10)

and by running the algorithm T times, we obtain { ˆH0,d,,d}T d=1. A key step in showing that ˆH0,d,d is a good estimator for H0,d,d is to prove the ﬁnite time isometry of VT = PT 1 l=0 U l+d,d( U l+d,d) , i.e., the sample covariance matrix.

Lemma 5.1 Deﬁne

T0(δ, d) = cm2d log2 (d) log2 (m2/δ) + cd log3 (2d)

where c is some universal constant. Deﬁne the sample covariance matrix VT := PT 1 l=0 U l+d,d( U l+d,d) . We have with probability 1 δ and for T > T0(δ, d)

Lemma 5.1 allows us to write Eq. (10) as ˆH0,d,d = PT 1 l=0 Y + l+d+1,d( U l+d,d) PT 1 l=0 U l+d,d( U l+d,d) 1

with high probability and upper bound estimation error for d d principal submatrix.

Theorem 5.1 Fix d and let ˆH0,d,d be the output of Algorithm 1. Then for any 0 < δ < 1 and T T0(δ, d), we have with probability at least 1 δ

ˆH0,d,d H0,d,d 2 4σ

Here T0(δ, d) = cm2d log2 (d) log2 (m2/δ) + cd log3 (2d), c is a universal constant and σ = max (σA, σB, σC, σD) from Table 1.

Proof We outline the proof here. Recall Eq. (8), (9). Then for a ﬁxed d

ˆH0,d,d = T 1 X

l=0 Y + l+d+1,d( U l+d,d) V + T .

T. Sarkar and A. Rakhlin and M. A. Dahleh

Then the identiﬁcation error is

ˆH0,d,d H0,d,d 2 = V + T T 1 X

l=0 U l+d,d U+ l+d+1,d T 0,d + U l+d,d U l,l H d,d,l + U l+d,d w+ l+d+1,d

+ U l+d,d η l+d,d O 0,d,d + U l+d,d η+ l+d+1,d T O 0,d + U l+d,d η l,l O d,d,l 2 = ||V + T E||2 (12)

l=0 U l+d,d U+ l+d+1,d T 0,d + U l+d,d U l,l H d,d,l + U l+d,d w+ l+d+1,d

+ U l+d,d η l+d,d O 0,d,d + U l+d,d η+ l+d+1,d T O 0,d + U l+d,d η l,l O d,d,l.

By Lemma 5.1 we have, whenever T T0(δ, d), with probability at least 1 δ

This ensures that, with high probability, that V 1 T exists and decays as O(T 1). The next step involves showing that ||E||2 grows at most as

T with high probability. This is reminiscent of Theorem 2.2 and the theory of self normalized martingales. However, unlike that cases the conditional sub-Gaussianity requirements do not hold here. For example, let Fl = σ(η1, . . . , ηl) then E[v η l+1,l+1|Fl] = 0 for all v since { η l+1,l+1}T 1 l=0 is not an independent sequence. As a result it is not immediately obvious on how to apply Theorem 2.2 to our case. Under the event when Eq. (13) holds (which happens with high probability), a careful analysis of the normalized cross terms, i.e., V 1/2 T E shows that ||V 1/2 T E||2= O(1) with high probability. This is summarized in Propositions 11.1-11.3. The idea is to decompose E into a linear combination of independent subgaussians and reduce it to a form where we can apply Theorem 2.2. This comes at the cost of additional scaling in the form of system dependent constants such as the H norm. Then we can conclude with high probability that || ˆH H0,d,d||2 ||V 1/2 T ||2||V 1/2 T E||2 T 1/2O(1). The full proof has been deferred to Section 11.1 in Appendix 11.

Remark 9 Recall D(T) from Table 1. Since

d D(T) = T T0(δ, d)

we can restate Theorem 5.1 as follows: for a ﬁxed T, we have with probability at least 1 δ that ˆH0,d,d H0,d,d 2 4σ

when d D(T).

Finite Time LTI System Identification

We next present bounds on σ in Theorem 5.1. From the perspective of model selection in later sections, we require that σ be known. In the next proposition we present two bounds on σ, the ﬁrst one depends on unknown parameters and recovers the precise dependence on d. The second bound is an apriori known upper bound and incurs an additional factor of

Proposition 5.1 σ upper bound independent of d:

σ cn (1 ρ(A))2

where cn depends only on n.

σ upper bound dependent on d:

where R is the noise-to-signal ratio as in Table 1

Proof By Gelfand s formula, since ||Ad||2 c(n)ρmax(A)d where ρmax(A) < 1 and c(n) is a constant that only depends on n, it implies that

l=0 ||CAl B||2

l=0 ||CAl B||2

l=0 c(n)ρ(A)l = c(n) 1 ρ(A),

||Td+k,T ||2

l=0 ||CAd+k+l B||2 c(n)ρ(A)d+k

k=1 T d+k,T Td+k,T

(1 ρ(A))2 c(n) (1 ρ(A))2 .

Similarly, there exists a ﬁnite upper bound on σB, σD by replacing CAl B and Td+k,T with CAl and T Od+k,T respectively. For the d independent upper bound, we have

l=0 ||CAl B||2

l=0 ||CAl B||2 2

Since σ T d+k,T Td+k,T β, then

k=1 T d+k,T Td+k,T

T. Sarkar and A. Rakhlin and M. A. Dahleh

For the σB, σD we get an extra R because T O0, βR.

The key feature of the data dependent upper bound is that it only depends on β and R which are known apriori. Recall that Gd = [CB, CAB, . . . , CAd 1B], i.e., the d-order FIR truncation of G(z). Since the p rows of the H0,d,d matrix corresponds to Gd we can obtain estimators for any d-order FIR.

Corollary 5.1 Let b Gd = ˆH0,d,d[1 : p, :] denote the ﬁrst p-rows of ˆH0,d,d. Then for any 0 < δ < 1 and T T0(δ, d), we have with probability at least 1 δ,

|| b Gd Gd||2 4σ

Proof Proof follows because Gd = H0,d,d[1 : p, :] and Theorem 5.1.

Next, we show that the error in Theorem 5.1 is minimax optimal (up to logarithmic factors) and cannot be improved by any estimation method.

Proposition 5.2 Let

T c where c is an absolute constant. Then for any estimator ˆH of H0, , we have

sup ˆ H E[|| ˆH H0, , ||2] cn

where cn > 0 is a constant that is independent of T but can depend on system level parameters.

Proof Assume the contrary that

sup ˆ H E[|| ˆH H0, , ||2] = o

Then recall that [H0, , ]1:p,: = [CB, CAB, . . . , ] and G(z) = z 1CB + z 2CAB + . . .. Similarly we have ˆG(z). Deﬁne

k=0 ||CAk B ˆC ˆAk ˆB||2 2.

If sup ˆ H E[|| ˆH H0, , ||2] = o q

T , then since || ˆH H0, , ||2 ||G ˆG||2 we can conclude that

E[||G ˆG||2] = o

which contradicts Theorem 5 in (Goldenshluger, 1998). Thus, sup ˆ H E[|| ˆH H0, , ||2]

Finite Time LTI System Identification

5.2 Model Selection

At a high level, we want to choose ˆH0, ˆd, ˆd from { ˆH0,d,d}T d=1 such that ˆH0, ˆd, ˆd is a good estimator of H0, , . Our idea of model selection is motivated by (Goldenshluger, 1998). For any ˆH0,d,d, the error from H0, , can be broken as:

|| ˆH0,d,d H0, , ||2 || ˆH0,d,d H0,d,d||2 | {z } =Estimation Error

+ ||H0,d,d H0, , ||2 | {z } =Truncation Error

We would like to select a d = ˆd such that it balances the truncation and estimation error in the following way:

c2 Data dependent upper bound c1 Estimation Error Truncation Error

where ci are absolute constants. Such a balancing ensures that

|| ˆH0, ˆd, ˆd H0, , ||2 c2 (1/c1 + 1) Data dependent upper bound . (14)

Note that such a balancing is possible because the estimation error increases as d grows and truncation error decreases with d. Furthermore, a data dependent upper bound for estimation error can be obtained from Theorem 5.1. Unfortunately (C, A, B) are unknown and it is not immediately clear on how to obtain such a bound for truncation error. To achieve this, we ﬁrst deﬁne a truncation error proxy, i.e., how much do we truncate if a speciﬁc ˆH0,d,d is used. For a given d, we look at || ˆH0,d,d ˆH0,l,l||2 for l D(T) d. This measures the additional error incurred if we choose ˆH0,d,d as an estimator for H0, , instead of ˆH0,l,l for l > d. Then we pick ˆd as follows:

|| ˆH0,d,d ˆH0,l,l||2 16βR α(l) l D(T) d

Recall that α(l) = q

l log (l/δ)+pl2+ml

T , where q

log (l/δ)+pl+m

T denotes how much estimation

error is incurred in learning l l Hankel submatrix, the extra β

l is incurred because we need a data dependent, albeit coarse, upper bound on the estimation error. A key step will be to show that for any l d, whenever

|| ˆH0,d,d ˆH0,l,l||2 cβR α(l)

ensures that

|| ˆH0,d,d H0, , ||2 cβR α(l) and || ˆH0,l,l H0, , ||2 cβR α(l)

and there is no gain in choosing a larger Hankel submatrix estimate. By picking the smallest d for which such a property holds for all larger Hankel submatrices, we ensure that a regularized model is estimated that agrees with the data.

T. Sarkar and A. Rakhlin and M. A. Dahleh

Algorithm 2 Choice of d

Output ˆH0, ˆd, ˆd, ˆd

1: D(T) = n d d T cm2 log3 (Tm/δ)

m+hp+log( T

2: d0(T, δ) = inf n l || ˆH0,l,l ˆH0,h,h||2 16βR(α(h) + 2α(l)) h D(T), h l o .

3: ˆd = max d0(T, δ), log T

4: return ˆH0, ˆd, ˆd, ˆd

We now state the main estimation result for H0, , for d = ˆd as chosen in Algorithm 2. Deﬁne

T (δ) = inf n T d (T, δ) D(T), d (T, δ) 2d

256, δ o (16)

d (T, δ) = inf

16βRα(d) ||H0,d,d H0, , ||2

A close look at Eq. (17) reveals that picking d = d (T, δ) ensures the balancing of Eq. (14). However, d (T, δ) depends on unknown quantities and is unknown. In such a case, ˆd in Eq. (15) becomes a proxy for d (T, δ). From an algorithmic stand point, we no longer need any unknown information; the unknown parameter only appear in T (δ), which is only required to make the theoretical guarantee of Theorem 5.2 below.

Theorem 5.2 Whenever we have T T (δ) we have with probability at least 1 δ that

|| ˆH0, ˆd, ˆd H0, , ||2 12cβR

m ˆd + p ˆd2 + ˆd log T

The proof of Theorem 5.2 can be found as Proposition 13.8 in Appendix 13. We see that the error between ˆH0, ˆd, ˆd and H0, , can be upper bounded by a purely data dependent

quantity. The next proposition shows that ˆd does not grow more that logarithmically in T.

Proposition 5.3 Let T T (δ), d (T, δ) be as in Eq. (17). Then with probability at least 1 δ we have ˆd d (T, δ) log T

Furthermore,

d (T, δ) c log (c T + log 1

δ) log R + log β log 1 ρ(A) .

The eﬀect of unknown quantities, such as the spectral radius, are subsumed in the ﬁnite time condition T T (δ) and appear in an upper bound for ˆd; however this information is not needed from an algorithmic perspective as the selection of ˆd is agnostic to the knowledge of ρ(A). The proof of proposition can be found as Propositions 13.7 and 13.4.

Finite Time LTI System Identification

5.3 Parameter Recovery

Next we discuss ﬁnding the system parameters. To obtain system parameters we use a balanced truncation algorithm on ˆH0, ˆd, ˆd where ˆd is the output of Algorithm 2. The details

are summarized in Algorithm 3 where H = ˆH0, ˆd, ˆd.

Algorithm 3 Hankel2Sys(T, ˆd, k, m, p) Input T = Horizon for Learning ˆd = Hankel Size m = Input dimension p = Output dimension Output System Parameters: ( ˆC ˆd, ˆA ˆd, ˆB ˆd)

1: H = H0, ˆd, ˆd 2: Pad H with zeros to make of dimension 4p ˆd 4m ˆd

3: U, Σ, V SVD of H

4: U ˆd, V ˆd top ˆd singular vectors

5: ˆC ˆd ﬁrst p rows of U ˆdΣ1/2 ˆd 6: ˆB ˆd ﬁrst m columns of Σ1/2 ˆd V ˆd 7: Z0 = [U ˆdΣ1/2 ˆd ]1:4p ˆd p,:, Z1 = [U ˆdΣ1/2 ˆd ]p+1:,:

8: ˆA ˆd (Z 0 Z0) 1Z 0 Z1.

9: return ( ˆC ˆd, ˆA ˆd, ˆB ˆd)

To state the main result we deﬁne a quantity that measures the singular value weighted subspace gap of a matrix S:

Γ(S, ϵ) = q

σ1max/ζ2 1 + σ2max/ζ2 2 + . . . + σlmax/ζ2 l ,

where S = UΣV and Σ is arranged into blocks of singular values such that in each block i we have supj σi j σi j+1 ϵ, i.e.,

Λ1 0 . . . 0 0 Λ2 . . . 0 ... ... ... 0 0 0 . . . Λl

where Λi are diagonal matrices, σi j is the jth singular value in the block Λi and σi min, σi max are the minimum and maximum singular values of block i respectively. Furthermore,

ζi = min (σi 1 min σi max, σi min σi+1 max)

for 1 < i < l, ζ1 = σ1 min σ2 max and ζl = min (σl 1 min σl max, σl min). Informally, the ζi measure the singular value gaps between each blocks. It should be noted that l, the number of separated blocks, is a function of ϵ itself. For example: if ϵ = 0 then the number of blocks correspond to the number of distinct singular values. On the other hand, if ϵ is very large then l = 1.

T. Sarkar and A. Rakhlin and M. A. Dahleh

Theorem 5.3 Let M be the true unknown model and

m ˆd + p ˆd2 + ˆd log T

Then whenever T T (δ), we have with probability at least 1 δ:

||C ˆd ˆC ˆd||2

||B ˆd ˆB ˆd||2

||A ˆd ˆA ˆd||2

γϵΓ( ˆH0, ˆd, ˆd, 2ϵ) + γ sup 1 i ˆd

+ γ ϵ pˆσ ˆdϵ pˆσ ˆd

where sup1 i ˆd p

ˆσ ˆ d ϵ ˆd p

2 ˆdϵ and γ = max (4γ, 8).

Theorem 5.3 holds for all k ˆd and proof follows directly from Theorem 13.8 where we show

|| ˆH0, ˆd, ˆd H0, , ||2 ϵ

and Proposition 14.2. Theorem 5.3 provides an error bound between parameters (of model order ˆd) when true order is unknown. The subspace gap measure, Γ( ˆH0, ˆd, ˆd, 2ϵ), is bounded

even when ϵ = 0. To see this, note that when ϵ = 0, ˆH0, ˆd, ˆd corresponds exactly to H0, ˆd, ˆd. In that case, the number of blocks correspond to the number of distinct singular values of H0, ˆd, ˆd, and ζni then corresponds to singular value gap between the unequal singular values. As a

result Γ( ˆH0, ˆd, ˆd, 2ϵ) = < . Then the bound decays as ϵ = O q

ˆd2/T for singular values

ˆσ ˆd > ˆdϵ, but for much smaller singular values the bound decays as ϵ = O d2/T 1/4 . To shed more light on the behavior of our bounds, we consider the special case of known order. If n is the model order, then we can set ˆd = n. If σi = σi(H0, , ), then for large enough T one can ensure that

min σi =σi+1(σi σi+1)/2 > ϵ,

i.e., ϵ is less than the singular value gap and small enough that the spectrum of ˆH0,n,n is very close to that of H0, , . Consequently ˆσn σn/2 and we have that

||Cn ˆCn||2

||Bn ˆBn||2

||An ˆAn||2

γϵ + γϵ/ σn = cβ γR

pn2 + n log T

This upper bound is (nearly) identical to the bounds obtained in Oymak and Ozay (2018) for the known order case. We get an improvement in the bounds when σn 1

n, which is a consequence of the fact that we know where to threshold our Hankel matrix. The major advantage of our result is that we do not require any information/assumption on the LTI system besides β. Nonparametric approaches to estimating β have been studied in Tu et al. (2017).

Finite Time LTI System Identification

5.4 Order Estimation Lower Bound

In Theorem 5.3 it is shown that whenever T = Ω 1 σ2 ˆ d

we can ﬁnd an accurate ˆd order

approximation. Now we show that if T = O 1 σ2 ˆ d

then there is always some non zero

probability with which we can not recover the singular vector corresponding to the σ ˆd+1. We prove the following lower bound for model order estimation when inputs {Ut}T t=1 are active and bounded which we deﬁne below

Deﬁnition 10 An input sequence {Ut}T t=1 is said to be active if Ut is allowed to depend on past history {Ul, Yl}t 1 l=1. The input sequence is bounded if E[U t Ut] 1 for all t.

Active inputs allow for the case when input selection can be adaptive due to feedback.

Theorem 5.4 Fix δ > 0, ζ (0, 1/2). Let M1, M2 be two LTI systems and σ(1) i , σ(2) i be

the ith-Hankel singular values respectively. Let σ(1) 1 σ(1) 2 2

ζ and σ(2) 2 = 0. Then whenever

sup M {M1,M2} PZT M(order( ˆ M(ZT )) = order(M)) δ

Here ZT = {Ut, Yt}T t=1 M means M generates T data points {Yt}T t=1 in response to active and bounded inputs {Ut}T t=1 and ˆ M(ZT ) is any estimator.

Proof The proof can be found in appendix in Section 15 and involves using Fano s (or Birge s) inequality to compute the minimax risk between the probability density functions generated by two diﬀerent LTI systems:

0 1 0 0 0 0 ζ 0 0

, A1 = A0, B0 =

, C0 = 0 0 βR , C1 = C0.

A0, A1 are Schur stable whenever |ζ|< 1.

Theorem 5.4 shows that when the time required to recover higher order models depends inversely on the condition number, where the condition number is the ratio of largest and least singular values of the Hankel matrix. Speciﬁcally, to correctly distinguish between an order 1 and order 2 model T Ω(2/ζ2) where ζ is the condition number of the 2-order model. We compare this to our upper bound in Theorem 5.3 and Eq. (18), assume Γ( ˆH0, ˆd, ˆd, 2ϵ)

for all ϵ [0, 1] and ˆdϵ ˆσ ˆd, then since parameter error, E, is upper bounded as

m ˆd + p ˆd2 + ˆd log T

we need T log T

T. Sarkar and A. Rakhlin and M. A. Dahleh

to correctly identify ˆd-order model. The ratio (β/σ ˆd) is equal to the condition number of the Hankel matrix. In this sense, the model selection followed by singular value thresholding is not too conservative in terms of R (the signal-to-noise ratio) and conditioning of the Hankel matrix.

6. Experiments

The experiments in this paper are for the single trajectory case. A detailed analysis for system identiﬁcation from multiple trajectories can be found in Tu et al. (2017). Suppose that the LTI system generating data, M, has transfer function given by

G(z) = α0 +

l=1 αlρlz l, ρ < 1 (20)

where αi N(0, 1). M is a ﬁnite dimensional LTI system or order 150 with parameters as M = (C R1 150, A R150 150, B R150 1). For these illustrations, we assume a balanced system and choose R = 1, δ = 0.05. We estimate β0.6 = 15, β0.9 = 40, β0.99 = 140, pick Ut N(0, 1) and {wt, ηt} {N(0, 1), N(0, I)} respectively. We note that our algorithm requires the knowledge of universal constant c. Theoretically, it can be shown that c < 100 but in practice a value c 16 works well for simulations.

Figure 1: Variation of Hankel size = ˆd with T for diﬀerent values of ρ

Fig. 1 shows how d = ˆd change with the number of data points for diﬀerent values of ρ. When ρ = 0.6, i.e., small, ˆd does not grow too big with T even when the number of data points is increased. This shows that a small model order is suﬃcient to specify system dynamics. On the other hand, when ρ = 0.99, i.e., closer to instability the ˆd required is much larger, indicating the need for a higher order. Although ˆd implicitly captures the eﬀect of spectral radius, the knowledge of ρ is not required for ˆd selection. In principle, our algorithm increases the Hankel size to the appropriate size as the data increases. We compare this to a deterministic growth policy d = log (T) and the SSREGEST

Finite Time LTI System Identification

algorithm Ljung et al. (2015). The SSREGEST algorithm ﬁrst learns a large model from data and then performs model reduction to obtain a ﬁnal model. In contrast, we go to reduced model directly by picking a small ˆd. This reduces the sensitivity to noise. In Fig. 2 shows the model errors for a deterministic growth policy d = log (T) and our algorithm. Although the diﬀerence is negligible when ρ = 0.6 (small), we see that our algorithm does better ρ = 0.99 due to its adaptive nature, i.e., ˆd responds faster for our algorithm.

Figure 2: Variation of ||M c Mk||op for diﬀerent values of ρ. Here k = ˆd for our algorithm and k = log (T). Furthermore, || ||op is the Hankel norm.

Finally, for the case when ρ = 0.9, β = 40, we show the model errors for SSREGEST and our algorithm as T increases. Although asymptotically both algorithms perform the same, it is clear that for small T our algorithm is more robust to the presence of noise.

T SSREGEST Our Algorithm 500 6.21 1.35 13.37 3.7 850 30.20 7.55 11.25 2.89 1200 26.80 8.94 9.83 2.60 1500 23.27 10.65 9.17 2.30 2000 26.38 12.88 7.70 1.60

7. Discussion

We propose a new approach to system identiﬁcation when we observe only ﬁnite noisy data. Typically, the order of an LTI system is large and unknown and a priori parametrizations may fail to yield accurate estimates of the underlying system. However, our results suggest that there always exists a lower order approximation of the original LTI system that can be learned with high probability. The central theme of our approach is to recover a good lower order approximation that can be accurately learned. Speciﬁcally, we show that identiﬁcation

T. Sarkar and A. Rakhlin and M. A. Dahleh

of such approximations is closely related to the singular values of the system Hankel matrix. In fact, the time required to learn a ˆd order approximation scales as T = Ω( β2

σ2 ˆ d ) where σ ˆd is

the ˆd the singular value of system Hankel matrix. This means that system identiﬁcation does not explicitly depend on the model order n, rather depends on n through σn. As a result, in the presence of ﬁnite data it is preferable to learn only the signiﬁcant (and perhaps much smaller) part of the system when n is very large and σn 1. Algorithm 1 and 3 provide a guided mechanism for learning the parameters of such signiﬁcant approximations with optimal rules for hyperparameter selection given in Algorithm 2. Future directions for our work include extending the existing low rank optimization-based identiﬁcation techniques, such as (Fazel et al., 2013; Grussler et al., 2018), which typically lack statistical guarantees. Since Hankel based operators occur quite naturally in general (not necessarily linear) dynamical systems, exploring if our methods could be extended for identiﬁcation of such systems appears to be an exciting direction.

Yasin Abbasi-Yadkori, D avid P al, and Csaba Szepesv ari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312 2320, 2011.

Beh cet A cıkme se, John M Carson, and Lars Blackmore. Lossless convexiﬁcation of nonconvex control bound and pointing constraints of the soft landing optimal control problem. IEEE Transactions on Control Systems Technology, 21(6):2104 2113, 2013.

Anish Agarwal, Muhammad Jehangir Amjad, Devavrat Shah, and Dennis Shen. Time series analysis via matrix estimation. ar Xiv preprint ar Xiv:1802.09064, 2018.

Zeyuan Allen-Zhu and Yuanzhi Li. Lazysvd: Even faster svd decomposition yet without agonizing pain. In Advances in Neural Information Processing Systems, pages 974 982, 2016.

Dietmar Bauer. Order estimation for subspace methods. Automatica, 37(10):1561 1573, 2001.

St ephane Boucheron, G abor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013.

Marco C Campi and Erik Weyer. Finite sample properties of system identiﬁcation methods. IEEE Transactions on Automatic Control, 47(8):1329 1334, 2002.

Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1 38, 1977.

Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Finite time identiﬁcation in unstable linear systems. ar Xiv preprint ar Xiv:1710.01852, 2017.

Finite Time LTI System Identification

Maryam Fazel, Ting Kei Pong, Defeng Sun, and Paul Tseng. Hankel matrix rank minimization with applications to system identiﬁcation and realization. SIAM Journal on Matrix Analysis and Applications, 34(3):946 977, 2013.

Keith Glover. All optimal hankel-norm approximations of linear multivariable systems and their l -error bounds. International journal of control, 39(6):1115 1193, 1984.

Keith Glover. Model reduction: a tutorial on hankel-norm methods and lower bounds on l2 errors. IFAC Proceedings Volumes, 20(5):293 298, 1987.

Alexander Goldenshluger. Nonparametric estimation of transfer functions: rates of convergence and adaptation. IEEE Transactions on Information Theory, 44(2):644 658, 1998.

Christian Grussler, Anders Rantzer, and Pontus Giselsson. Low-rank optimization with convex constraints. IEEE Transactions on Automatic Control, 2018.

Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient descent learns linear dynamical systems. ar Xiv preprint ar Xiv:1609.05191, 2016.

Elad Hazan, Holden Lee, Karan Singh, Cyril Zhang, and Yi Zhang. Spectral ﬁltering for general linear dynamical systems. ar Xiv preprint ar Xiv:1802.03981, 2018.

BL Ho and Rudolph E Kalman. Eﬀective construction of linear state-variable models from input/output functions. at-Automatisierungstechnik, 14(1-12):545 548, 1966.

S Kung and D Lin. Optimal hankel-norm model reductions: Multivariable systems. IEEE Transactions on Automatic Control, 26(4):832 852, 1981.

Lennart Ljung. System identiﬁcation: theory for the user. Prentice-hall, 1987.

Lennart Ljung, Rajiv Singh, and Tianshi Chen. Regularization features in the system identiﬁcation toolbox. IFAC-Papers On Line, 48(28):745 750, 2015.

Mark Meckes et al. On the spectral norm of a random toeplitz matrix. Electronic Communications in Probability, 12:315 325, 2007.

Samet Oymak and Necmiye Ozay. Non-asymptotic identiﬁcation of lti systems from a single trajectory. ar Xiv preprint ar Xiv:1806.05722, 2018.

Victor H Pe na, Tze Leung Lai, and Qi-Man Shao. Self-normalized processes: Limit theory and Statistical Applications. Springer Science & Business Media, 2008.

Tuhin Sarkar and Alexander Rakhlin. How fast can linear dynamical systems be learned? ar Xiv preprint ar Xiv:1812.0125, 2018.

Parikshit Shah, Badri Narayan Bhaskar, Gongguo Tang, and Benjamin Recht. Linear system identiﬁcation via atomic norm regularization. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pages 6265 6270. IEEE, 2012.

T. Sarkar and A. Rakhlin and M. A. Dahleh

Ritei Shibata. Selection of the order of an autoregressive model by akaike s information criterion. Biometrika, 63(1):117 126, 1976.

Max Simchowitz, Horia Mania, Stephen Tu, Michael I Jordan, and Benjamin Recht. Learning without mixing: Towards a sharp analysis of linear system identiﬁcation. ar Xiv preprint ar Xiv:1802.08334, 2018.

Stephen Tu, Ross Boczar, Andrew Packard, and Benjamin Recht. Non-asymptotic analysis of robust control from coarse-grained identiﬁcation. ar Xiv preprint ar Xiv:1707.04791, 2017.

Stephen Tu, Ross Boczar, and Benjamin Recht. On the approximation of toeplitz operators for nonparametric H norm estimation. In 2018 Annual American Control Conference (ACC), pages 1867 1872. IEEE, 2018a.

Stephen Tu, Ross Boczar, and Benjamin Recht. Minimax lower bounds for H -norm estimation. ar Xiv preprint ar Xiv:1809.10855, 2018b.

Eugene E Tyrtyshnikov. A brief introduction to numerical analysis. Springer Science & Business Media, 2012.

Sara van de Geer and Johannes Lederer. The bernstein orlicz norm and deviation inequalities. Probability theory and related ﬁelds, 157(1-2):225 250, 2013.

Aad W Van Der Vaart and Jon A Wellner. Weak convergence. In Weak convergence and empirical processes, pages 16 28. Springer, 1996.

Peter Van Overschee and BL De Moor. Subspace identiﬁcation for linear systems: Theory Implementation Applications. Springer Science & Business Media, 2012.

Saligrama R Venkatesh and Munther A Dahleh. On system identiﬁcation of complex systems from ﬁnite data. IEEE Transactions on Automatic Control, 46(2):235 257, 2001.

Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. ar Xiv preprint ar Xiv:1011.3027, 2010.

Per Ake Wedin. Perturbation bounds in connection with singular value decomposition. BIT Numerical Mathematics, 12(1):99 111, 1972.

K Zhou, JC Doyle, and K Glover. Robust and optimal control, 1996.

Finite Time LTI System Identification

8. Preliminaries

Theorem 8.1 (Theorem 5.39 Vershynin (2010)) if E is a T md matrix with independent sub Gaussian isotropic rows with sub Gaussian parameter 1 then with probability at least 1 2e ct2 we have

md t σmin(E)

Proposition 8.1 (Vershynin (2010)) We have for any ϵ < 1 and any w Sd 1 that

P(||M||> z) (1 + 2/ϵ)d P ||Mw||> z (1 ϵ)

Theorem 8.2 (Theorem 1 Meckes et al. (2007)) ] Suppose {Xi Rm} i=1 are independent, E[Xj] = 0 for all j, and Xij are independent subg(1) random variables. Then P(||Td|| cm d log 2d + t) e t2/d where

X0 X1 . . . Xd 1 X1 X0 . . . Xd 2 ... ... ... ... Xd 1 . . . . . . X0

Theorem 8.3 (Hanson Wright Inequality) Given a subgaussian vector X = [X1, X2, . . . , Xn] Rn with supi Xi ψ2 K. Then for any B Rn n and t 0

P XBX E[XBX ] t 2 exp max ct K2 B , ct2

Proposition 8.2 (Lecture 2 Tyrtyshnikov (2012)) Suppose that L is the lower triangular part of a matrix A Rd d. Then

L 2 log2 (2d) A 2.

Let ψ be a nondecreasing, convex function with ψ(0) = 0 and X a random variable. Then the Orlicz norm ||X||ψ is deﬁned as

||X||ψ= inf n α > 0 : E[ψ(|X|/α)] 1 o .

Let (B, d) be an arbitrary semi metric space. Denote by N(ϵ, d) is the minimal number of balls of radius ϵ needed to cover B.

Theorem 8.4 (Corollary 2.2.5 in Van Der Vaart and Wellner (1996)) The constant K can be chosen such that

||sup s,t |Xs Xt|||ψ K Z diam(B)

0 ψ 1(N(ϵ/2, d))dϵ

where diam(B) is the diameter of B and d(s, t) = ||Xs Xt||ψ.

T. Sarkar and A. Rakhlin and M. A. Dahleh

Theorem 8.5 (Theorem 1 in Abbasi-Yadkori et al. (2011)) Let {Ft} t=0 be a ﬁltration. Let {ηt Rm, Xt Rd} t=1 be stochastic processes such that ηt, Xt are Ft measurable and ηt is Ft 1-conditionally subg(L2) for some L > 0. For any t 0, deﬁne Vt = Pt s=1 Xs X s, St = Pt s=1 Xsη s+1. Then for any δ > 0, V 0 and all t 0 we have with probability at least 1 δ

S t (V + Vt) 1St 2L2 log 1

δ + log det(V + Vt)

det(V ) + m .

Proof Deﬁne M = (V + Vt) 1/2St. Now we use Proposition 8.1 and setting ϵ = 1/2,

P(||M||2> z) 5m P(||Mw||2> 2z)

for w Sm 1. Then we can use Theorem 1 in Abbasi-Yadkori et al. (2011), and with probability at least 1 δ we have

||Mw||2 2 2L2 log 1

δ + log det(V + Vt)

By δ 5 mδ, we have with probability at least 1 5 mδ

s m log (5) + log 1

δ + log det(V + Vt)

Then with probability at least 1 δ,

s m + log 1

δ + log det(V + Vt)

Lemma 8.1 For any M = (C, A, B), we have that

||Bv T m T ||=

v u u tσ d X

k=1 T d+k,T Td+k,T

Here Bv T m T is deﬁned as follows: β = H d,d,T v = [β 1 , β 2 , . . . , β T ] .

β 1 0 0 . . . β 2 β 1 0 . . . ... ... ... ... β T β T 1 . . . β 1

and ||v||2= 1.

Finite Time LTI System Identification

Proof For the matrix Bv we have

β 1 u1 β 1 u2 + β 2 u1 β 1 u3 + β 2 u2 + β 3 u1 ... β 1 u T + β 2 u T 1 + . . . + β T u1

CAd+1Bu1 CAd+2Bu1 ... CA2d Bu1

CAd+2Bu1 + CAd+1Bu2 CAd+3Bu1 + CAd+2Bu2 ... CA2d+1Bu1 + CA2d Bu2

CAT+d Bu1 + . . . + CAd+1Bu T CAT+d+2Bu1 + . . . + CAd+2Bu T ... CAT+2d 1Bu1 + . . . + CA2d Bu T

CAd+1Bu1 CAd+2Bu1 ... CA2d Bu1

CAd+2Bu1 + CAd+1Bu2 CAd+3Bu1 + CAd+2Bu2 ... CA2d+1Bu1 + CA2d Bu2

CAT+d Bu1 + . . . + CAd+1Bu T CAT+d+2Bu1 + . . . + CAd+2Bu T ... CAT+2d 1Bu1 + . . . + CA2d Bu T

CAd+1B 0 0 . . . 0 CAd+2B 0 0 . . . 0 ... ... ... ... ... CA2d B 0 0 . . . 0 CAd+2B CAd+1B 0 . . . 0 CAd+3B CAd+2B 0 . . . 0 ... ... ... ... ... CA2d+1B CA2d B 0 . . . 0 ... ... ... ... ... CAT+d 1B CAT+d B CAT+d 1B . . . CAd+1B CAT+d+2B CAT+d+1B CAT+d B . . . CAd+2B ... ... ... ... ... CAT+2d 1B CAT+2d 1B CAT+2d 2B . . . CA2d B

u1 u2 ... u T

T. Sarkar and A. Rakhlin and M. A. Dahleh

It is clear that ||V||2, ||u||2= 1 and for any matrix S, ||S|| does not change if we interchange rows of S. Then we have

CAd+1B 0 0 . . . 0 CAd+2B CAd+1B 0 . . . 0 ... ... ... ... ... CAT+d+1B CAT+d B CAT+d 1B . . . CAd+1B CAd+2B 0 0 . . . 0 CAd+3B CAd+2B 0 . . . 0 ... ... ... ... ... CAT+d+2B CAT+d+1B CAT+d B . . . CAd+2B ... ... ... ... ... CA2d B 0 0 . . . 0 CA2d+1B CA2d B 0 . . . 0 ... ... ... ... ... CAT+2d 1B CAT+2d 1B CAT+2d 2B . . . CA2d B

Td+1,T Td+2,T ... T2d,T

v u u tσ d X

k=1 T d+k,T Td+k,T

Proposition 8.3 (Lemma 4.1 Simchowitz et al. (2018)) Let S be an invertible matrix and κ(S) be its condition number. Then for a 1 4κ net of Sd 1 and an arbitrary matrix A, we have

||SA||2 2 sup v N 1

||v A||2 ||v S 1||2

Proof For any vector v N 1

4κ and w be such that ||SA||2= ||w A||2 ||w S 1||2 we have

||SA||2 ||v A||2

||v S 1||2 ||w A||2

||w S 1||2 ||v A||2 ||v S 1||2

||w S 1||2 ||v A||2 ||w S 1||2 + ||v A||2 ||w S 1||2 ||v A||2 ||v S 1||2

1 4κ||S 1||2 ||w S 1||2 + ||SA||2 ||v S 1||2

||w S 1||2 1

Finite Time LTI System Identification

9. Control and Systems Theory Preliminaries

9.1 Sylvester Matrix Equation

Deﬁne the discrete time Sylvester operator SA,B : Rn n Rn n

LA,B(X) = X AXB (21)

Then we have the following properties for LA,B( ).

Proposition 9.1 Let λi, µi be the eigenvalues of A, B then LA,B is invertible if and only if for all i, j λiµj = 1

Deﬁne the discrete time Lyapunov operator for a matrix A as LA,A ( ) = S 1 A,A ( ). Clearly it follows from Proposition 9.1 that whenever λmax(A) < 1 we have that the SA,A ( ) is an invertible operator. Now let Q 0 then

SA,A (Q) = X

= X = AXA + Q

k=0 Ak QA k (22)

Eq. (22) follows directly by substitution and by Proposition 9.1 is unique if ρ(A) < 1. Further, let Q1 Q2 0 and X1, X2 be the corresponding solutions to the Lyapunov operator then from Eq. (22) that

9.2 Properties of System Hankel matrix

Rank of system Hankel matrix: For M = (C, A, B) Mn, the system Hankel matrix, H0, , (M), can be decomposed as follows:

H0, , (M) =

C CA ... CAd ...

B AB . . . Ad B . . .

It follows from deﬁnition that rank(O), rank(R) n and as a result rank(OR) n. The system Hankel matrix rank, or rank(OR), which is also the model order(or

T. Sarkar and A. Rakhlin and M. A. Dahleh

simply order), captures the complexity of M. If SVD(H0, , ) = UΣV , then O = UΣ1/2S, R = S 1Σ1/2V . By noting that

CAl S = CS(S 1AS)l, S 1Al B = (S 1AS)l S 1B

we have obtained a way of recovering the system parameters (up to similarity transformations). Furthermore, H0, , uniquely (up to similarity transformation) recovers (C, A, B).

Mapping Past to Future: H0, , can also be viewed as an operator that maps past inputs to future outputs. In Eq. (1) assume that {ηt, wt} = 0. Then consider the following class of inputs Ut such that Ut = 0 for all t T but Ut may not be zero for t < T. Here T is chosen arbitrarily. Then

YT YT+1 YT+2 ...

| {z } Future

UT 1 UT 2 UT 3 ...

| {z } Past

9.3 Model Reduction

Given an LTI system M = (C, A, B) of order n with its doubly inﬁnite system Hankel matrix as H0, , . We are interested in ﬁnding the best k order lower dimensional approximation of M, i.e., for every k < n we would like to ﬁnd Mk of model order k such that ||M Mk|| is minimized. Systems theory gives us a class of model approximations, known as balanced truncated approximations, that provide strong theoretical guarantees (See Glover (1984) and Section 21.6 in Zhou et al. (1996)). We summarize some of the basics of model reduction below. Assume that M has distinct Hankel singular values. Recall that a model M = (C, A, B) is equivalent to M = (CS, S 1AS, S 1B) with respect to its transfer function. Deﬁne

Q = A QA + C C

P = APA + BB

For two positive deﬁnite matrices P, Q it is a known fact that there exist a transformation S such that S QS = S 1PS 1 = Σ where Σ is diagonal and the diagonal elements are decreasing. Further, σi is the ith singular value of H0, , . Then let A = S 1AS, C = CS, B = S 1B. Clearly f M = ( A, B, C) is equivalent to M and we have

Σ = A Σ A + C C

Σ = AΣ A + B B (25)

Here C, A, B is a balanced realization of M.

Finite Time LTI System Identification

Proposition 9.2 Let H0, , = UΣV . Here Σ 0 Rn n. Then

C = [UΣ1/2]1:p,: A = Σ 1/2U [UΣ1/2]p+1:,: B = [Σ1/2V ]:,1:m

The triple ( C, A, B) is a balanced realization of M. For any matrix L, L:,m:n (or Lm:n,:) denotes the submatrix with only columns (or rows) m through n.

Proof Let the SVD of H0, , = UΣV . Then M can constructed as follows: UΣ1/2, Σ1/2V

are of the form

CS CAS CA2S ...

, Σ1/2V = S 1B S 1AB S 1A2B . . .

where S is the transformation which gives us Eq. (25). This follows because

Σ1/2U UΣ1/2 =

k=0 S Ak C CAk S

k=0 S Ak S 1 S C CSS 1Ak S

k=0 Ak C C Ak = A Σ A + C C = Σ

Then C = UΣ1/2 1:p,: and

UΣ1/2 A = [UΣ1/2]p+1:,: A = Σ 1/2U [UΣ1/2]p+1:,:

We do a similar computation for B.

It should be noted that a balanced realization C, A, B is unique except when there are some Hankel singular values that are equal. To see this, assume that we have

σ1 > . . . > σr 1 > σr = σr+1 = . . . = σs > σs+1 > . . . σn

where s r > 0. For any unitary matrix Q R(s r+1) (s r+1), deﬁne Q0

I(r 1) (r 1) 0 0 0 Q 0 0 0 I(n s) (n s)

T. Sarkar and A. Rakhlin and M. A. Dahleh

Then every triple ( CQ0, Q 0 AQ0, Q 0 B) satisﬁes Eq. (25) and is a balanced realization. Let Mk = ( Ck, Akk, Bk) where

A = Akk A0k Ak0 A00

, B = Bk B0

, C = Ck C0 (27)

Here Akk is the k k submatrix and corresponding partitions of B, C. The realization Mk = ( Ck, Akk, Bk) is the k order balanced truncated model. Clearly M Mn which gives us C = Cnn, A = Ann, B = Bnn, i.e., the balanced version of the true model. We will show that for the balanced truncation model we only need to care about the top k singular vectors and not the entire model.

Proposition 9.3 For the k order balanced truncated model Mk, we only need top k singular values and singular vectors of H0, , .

Proof From the preceding discussion in Proposition 9.2 and Eq. (27) it is clear that the ﬁrst p k block submatrix of UΣ1/2 (corresponding to the top k singular vectors) gives us Ck. Since A = Σ 1/2U [UΣ1/2]p+1:,: we observe that Akk depend only on the top k singular vectors Uk and corresponding singular values. This can be seen as follows: [UΣ1/2]p+1:,: denotes the submatrix of UΣ1/2 with top p rows removed. Now in UΣ1/2 each column of U is scaled by the corresponding singular value. Then the Akk submatrix depends only on top k rows of Σ 1/2U and the top k columns of [UΣ1/2]p+1:,: which correspond to the top k singular vectors.

10. Isometry of Input Matrix: Proof of Lemma 5.1

Theorem 11 Deﬁne

Ud Ud+1 . . . UT+d 1 Ud 1 Ud . . . UT+d 2 ... ... ... ... U1 U2 . . . UT

where each Ui subg(1) and isotropic. Then there exists an absolute constant c such that U satisﬁes: (1/2)T σmin(UU ) σmax(UU ) (3/2)T

whenever T cm2d(log2 (d) log2 (m2/δ) + log3 (2d)) with probability at least 1 δ.

Proof Deﬁne

0 0 0 . . . 0 I 0 0 . . . 0 ... ... ... ... ... 0 . . . I 0 0 0 . . . 0 I 0

, b Uk := Ud+k

Finite Time LTI System Identification

Ud Ud+1 . . . UT+d 1 Ud 1 Ud . . . UT+d 2 ... ... . . . ... U1 U2 . . . UT

we can reformulate it so that each column is the output of an LTI system in the following sense:

xk+1 = Axk + B b U(k + 1) (28)

where UU = PT 1 k=0 xkx k and x0 =

Ud Ud 1 ... U1

. From Theorem 8.1 we have that

k=0 b Uk b U k 5

with probability at least 1 δ whenever T c m + log 2

δ . Deﬁne Vt = Pt 1 l=0 xkx k then,

VT = AVT 1A + B

k=0 b Uk b U k

Axk b U k+1B + B b Uk+1x k A (29)

It can be easily checked that xk =

Ud+k Ud+k 1 ... Uk+1

and consequently

k=0 Axk b U k+1B =

0 0 . . . 0 0 Ud+k U d+k+1 0 . . . 0 0 Ud+k 1U d+k+1 0 . . . 0 0 ... ... ... ... ... ... ... ... ... ... Uk+2U d+k+1 0 . . . 0 0

Deﬁne Lj := PT 2 k=0 Ud+k j+1U d+k+1 and Lj is a m m block matrix. Then

l=0 Al T 2 X

k=0 Axk b U k+1B !

0 0 . . . 0 0 0 L1 0 . . . 0 0 0 L2 L1 . . . 0 0 0 ... ... ... ... ... ... ... ... ... ... ... ... Ld 1 0 . . . 0 L1 0

T. Sarkar and A. Rakhlin and M. A. Dahleh

Use Lemma 10.1 to show that

Td log (d) log (m2/δ) (30)

with probability at least 1 δ. Then

k=0 b Uk b U k

l=0 Alx T 1x T 1Al .

From Theorem 8.1 we have with probability atleast 1 δ that

k=0 b Uk b U k

B Al (5/4)TI (31)

whenever T c m + log 2

δ . Observe that

l=1 Alx T 1x T 1Al = σ2 1([Ax T 1, A2x T 1, . . . , Adx T 1])

The matrix [Ax T 1, A2x T 1, . . . , Adx T 1] is the lower triangular submatrix of a random Toeplitz matrix with i.i.d subg(1) entries as in Theorem 8.2. Then using Theorem 8.2 and Proposition 8.2 we get that with probability at least 1 δ we have [Ax T 1, A2x T 1, . . . , Adx T 1] cm( p

d log (2d) log (2d) + p

d log (1/δ)). (32)

Then Pd l=1 Alx T 1x T 1Al cm2d(log3 (2d) + log (1/δ) + log (2d) p

log (2d) log (1/δ)) with probability at least 1 δ. By ensuring that Eqs. (30), (31) and (32) hold simultaneously we can ensure that cm

Td log (d) log (m2/δ) T/8 and cm2d(log3 (2d) + log (1/δ) + log (2d) p

log (2d) log (1/δ)) T/8 for large enough T and absolute constant c.

Lemma 10.1 Let {Uj Rm 1}T+d j=1 be independent subg(1) random vectors. Deﬁne Lj := PT 2 k=0 Ud+k j+1U d+k+1 for all j 1 and

0 0 . . . 0 0 0 L1 0 . . . 0 0 0 L2 L1 . . . 0 0 0 ... ... ... ... ... ... ... ... ... ... ... ... Ld 1 0 . . . 0 L1 0

Then with probability at least 1 δ we have

Td log (d) log (m/δ).

Finite Time LTI System Identification

Proof Since Ljs are block matrices, the techniques in Meckes et al. (2007) cannot be directly applied. However, by noting that E can be broken into a sum of m matrices where the norm of each matrix can be bounded by a Toeplitz matrix we can use the result from Meckes et al. (2007). For instance if m = 2 and {ui} i=1 are independent subg(1) random variables then we have

. . . u1 u2 u3 u4

. . . u5 u6 u7 u8

u1 u2 u3 u4

... ... ...

. . . u1 0 u3 0

. . . u5 0 u7 0

... ... ...

. . . 0 u2 0 u4

. . . 0 u6 0 u8

... ... ...

then ||Td|| sup1 i 2||Mi||. Furthermore for each Mi we have

. . . u1 0 0 0

. . . u5 0 0 0

... ... ...

| {z } =M11

. . . 0 0 u3 0

. . . 0 0 u7 0

... ... ...

| {z } =M12

and ||M1|| ||M11||+||M12||. The key idea is to show that Mi1 are Toeplitz matrices (after removing the zeros in the blocks) and we can use the standard techniques described in proof of Theorem 1 in Meckes et al. (2007). Then we will show that each ||Mij|| C with high probability and ||Td|| m C. For brevity, we will assume for now that Ui are scalars and at the end we will scale by m. By standard techniques described in proof of Theorem 1 in Meckes et al. (2007), we have that the ﬁnite Toeplitz matrix Td + T d is d d submatrix of the inﬁnite Laurent matrix

M = [L|j k|1|j k|<d 1]j,k Z.

T. Sarkar and A. Rakhlin and M. A. Dahleh

Consider M as an operator on ℓ2(Z) in the canonical way, and let ψ : ℓ2(Z) L2[0, 1] denote the usual linear trigonometric isometry ψ(ej)(x) = e2πijx. Then ψMdψ 1 : L2 L2 is the operator correpsonding to

j= (d 1) L|j|e2πijx = L0 + 2

j=1 cos (2πjx)Lj

Therefore, Td + T d M = f = sup 0 x 1 |Yx|

where Yx = 2 Pd 1 j=1 cos (2πjx)Lj. Furthermore note that Yx has the following form

0 cx 1 cx 2 . . . cx d 1 0 . . . 0 0 0 cx 1 . . . cx d 1 0 . . . 0 ... ... . . . ... . . . ... ... ... ... ... ... ... ... ... ... ... 0 0 . . . 0 0 cx 1 . . . cx d 1 0 0 . . . 0 0 0 . . . 0 ... ... ... ... ... ... ... ...

U1 U2 ... UT+d

and cx j = 2 cos (2πjx). For any x and assuming Uj subg(1), we have

from Theorem 8.3 P Yx/

Td t 2 exp c(t t2) (34)

The tail behavior of Yx/

Td is not strictly subgaussian and we need to use Theorem 8.4. The function ψ can be found as Eq. 1 of van de Geer and Lederer (2013) (equivalent upto universal constants) with L = 2 and its inverse being

log (1 + t) + log (1 + t).

We have that sup t |Yt| ψ Y0 ψ + K

0 ψ 1(N(ϵ/2, d))dϵ,

where d(s, t) = (Ys Yt)/

Td ψ and N(ϵ, d) is the minimal number of balls of radius

ϵ needed to cover [0, 1] where d( , ) is the pseudometric. Since Ys has distribution as in Eq. (34), it follows that d(s, t) c|s t| for some absolute constant c. Then

0 ψ 1(N(ϵ/2, d))dϵ c

Finite Time LTI System Identification

for some universal constant c > 0. This ensures that supt|Yt| ψ c

Td. Since E[X] X ψ we have that E[sup0 x 1|Yx|]

Td. This implies E[ Td + T d ]

Td, and using Proposition 8.2 we have E[ Td ] c

Td log (d). Furthermore, we can make a stronger statement because supt|Yt| ψ c

Td which implies that

Td log (d) log (1/δ)

with probability at least 1 δ. Then recalling that in the general case that Ljs of Td were m m block matrices we scale by m and get with probability at least 1 δ

Td log (d) log (m2/δ)

where the union is over all m2 elements being less that c

Td log (d) log (m2/δ). Note that c hides the universal constant K from Theorem 8.4.

11. Error Analysis for Theorem 5.1

For this section we assume that Ut subg(L2).

11.1 Proof of Theorem 5.1

Recall Eq. (8) and (9), i.e.,

Y + l,d = H0,d,d U l 1,d + T0,d U+ l,d + Hd,d,l d 1 U l d 1,l d 1 + O0,d,d η l 1,d + T O0,d η+ l,d + Od,d,l d 1 η l d 1,l d 1 + w+ l,d (35)

Assume for now that we have T + 2d data points instead of T. It is clear that

ˆH0,d,d = arg min H

l=0 || Y + l+d+1,d H U l+d,d||2 2=

l=0 Y + l+d+1,d U l+d,d !

l=0 U l+d,d U l+d,d, (36)

Ud Ud+1 . . . UT+d 1 Ud 1 Ud . . . UT+d 2 ... ... ... ... U1 U2 . . . UT

T. Sarkar and A. Rakhlin and M. A. Dahleh

It is show in Theorem 11 that VT is invertible with probability at least 1 δ. So in our analysis we can write this as

l=0 U l+d,d U l+d,d

l=0 U l+d,d U l+d,d

From this one can conclude that

ˆH H0,d,d 2 = T 1 X

l=0 U l+d,d U l+d,d 1 T 1 X

l=0 U l+d,d U+ l+d+1,d T 0,d

+ U l+d,d U l,l H d,d,l + U l+d,d η l+d,d O 0,d,d

+ U l+d,d η+ l+d+1,d T O 0,d + U l+d,d η l,l O d,d,l + U l+d,d w+ l+d+1,d 2 (37)

Here as we can observe U l,l , η l,l grow with T in dimension. Based on this we divide our

error terms in two parts:

l=0 U l+d,d U l+d,d 1 U l+d,d U l,l H d,d,l + U l+d,d η l,l O d,d,l

l=0 U l+d,d U l+d,d 1 U l+d,d η+ l+d+1,d T O 0,d + U l+d,d U+ l+d+1,d T 0,d+ (39)

U l+d,d η+ l+d+1,d T O 0,d + U l+d,d w+ l+d+1,d

Then the proof of Theorem 5.1 will reduce to Propositions 11.1 11.3. We ﬁrst analyze

V 1/2 T T 1 X

l=0 U l+d,d U l,l H d,d,l 2

The analysis of ||V 1/2 T (PT 1 l=0 U l+d,d η l,l O d,d,l)|| will be almost identical and will only diﬀer in constants.

Proposition 11.1 For 0 < δ < 1, we have with probability at least 1 2δ

V 1/2 T T 1 X

l=0 U l+d,d U l,l H d,d,l 2 4σ

where σ = q

σ(Pd k=1 T d+k,T Td+k,T ).

Finite Time LTI System Identification

Proof We proved that TI

2 with high probability, then

P V 1/2 T T 1 X

l=0 U l+d,d U l,l H d,d,l 2 a, TI

l=0 U l+d,d U l,l H d,d,l 2 a, TI

P 2 sup v N 1

l=0 U l+d,d U l,l H d,d,lv 2 a + P TI

l=0 U l+d,d U l,l H d,d,lv 2 a δ. (40)

Deﬁne the following ηl,d = U l,l H d,d,lv, Xl,d = q

2 T U l+d,d. Observe that ηl,d, ηl+1,d have contributions from Ul 1, Ul 2 etc. and do not immediately satisfy the conditions of Theorem 2.2. Instead we will use the fact that Xi,d is independent of Uj for all j i.

V 1/2 T T 1 X

l=0 U l+d,d U l,l H d,d,l 2 2 sup v N 1

l=0 U l+d,d U l,l H d,d,lv||

2 sup v N 1

l=0 Xl,dηl,d||.

Deﬁne H d,d,lv = [β 1 , β 2 , . . . , β l ] . βi are m 1 vectors when LTI system is MIMO. Then

ηl,d = Pl 1 k=0 U l kβk+1. Let αl = Xl,d. Then consider the matrix

β 1 0 0 . . . β 2 β 1 0 . . . ... ... ... ... β T β T 1 . . . β 1

T. Sarkar and A. Rakhlin and M. A. Dahleh

Observe that the matrix ||BT m T ||2= q

σ(Pd k=1 T d+k,T Td+k,T )

d||Td, ||2< which follows from Lemma 8.1. Then

l=0 Xl,dηl,d = [α1, . . . , αT ]B

U1 U2 ... UT

k=1 αkβ k ,

k=2 αkβ k 1, . . . , αT β 1 ]

U1 U2 ... UT

k=j αkβ k Uj .

Here αi = Xi,d and recall that Xi,d is independent of Uj for all i j. Let γ = α B. Deﬁne GT+d k = σ({Uk+1, Uk+2, . . . , UT+d}) where σ(A) is the sigma algebra containing the set A with G0 = φ. Then Gk 1 Gk. Furthermore, since γj 1, Uj are GT+d+1 j measurable and Uj is conditionally (on GT+d j) sub Gaussian, we can use Theorem 2.2 on γ U = α BU (where γj = XT+d j, Uj = ηT+d j+1 in the notation of Theorem 2.2). Then with probability at least 1 δ we have

α BB α + V 1/2 γ U L

δ + log det(α BB α + V )

For any ﬁxed V > 0. With probability at least 1 δ, we know from Theorem 11 that α α 3I

2 = α BB α 3σ2 1(B)I

2 . By combining this event and the event in Eq. (41) and

setting V = 3σ2 1(B)I

2 , we get with probability at least 1 2δ that

||α BU||2= ||γ U||2

δ + pd log 3 + m . (42)

Replacing δ 5 pd δ

2, we get from Eq. (40)

V 1/2 T T 1 X

l=0 U l+d,d U l,l H d,d,l 2

6 log (5)Lσ1(B)

with probability at least 1 δ. Since L = 1 we get our desired result.

Then similar to Proposition 11.1, we analyze V 1/2 T PT 1 l=0 U l+d,d U+ l+d+1,d T 0,d 2

Proposition 11.2 For 0 < δ < 1 and large enough T, we have with probability at least 1 δ

V 1/2 T T 1 X

l=0 U l+d,d U+ l+d+1,d T 0,d 2 4σ

Finite Time LTI System Identification

σ sup ||v||2=1

v CAd B v CAd 1B v CAd 2B . . . v CB 0

0 ... ... ... ... 0

0 ... ... ... ... ... 0 v CAd B v CAd 1B . . . . . . v CB

j=0 ||CAj B||2 β

Proof Note V 1/2 T PT 1 l=0 U l+d,d U+ l+d+1,d T 0,d 2 q

2 T PT 1 l=0 U l+d,d U+ l+d+1,d T 0,d 2 with

probability at least 1 δ for large enough T. Here T 0,d is md pd matrix. Then deﬁne

2 T U l+d,d and the vector Ml Rpd as M l = U+ l+d+1,d T 0,d. Then

l=0 Xl M l ||2 t) |{z} 1 2 net

l=0 Xl M l v||2 t/2)

where M l v is a real value. Let β := T 0,dv, then M l v = U+ l+d+1,dβ. This allows us to write Xl M l v in a form that will enable us to apply Theorem 2.2.

l=0 Xl M l v = [X0, X1, . . . , XT 1] | {z } =X

β 1 β 2 . . . β d . . . 0

0 β 1 ... ... ... 0

0 ... ... ... ... ... 0 . . . 0 β 1 . . . β d

Ud+1 Ud+2 ... UT+2d

Here I is RT (m T+md). It is known from Theorem 11 that XX 3I

2 with high probability

and consequently XII X 3σ2 1(I)I

2 . Deﬁne Fl = σ({Ul}d+l j=1) as the sigma ﬁeld generated by ({Ul}d+l j=1. Furthermore Nl is Fl measurable, and [XI]l is Fl 1 measurable and we can apply Theorem 2.2. Now the proof is similar to Proposition 11.1. Following the same steps as before we get with probability at least 1 δ

l=0 Xl M l v||2= ||

l=0 [XI]l Nl||2

δ + pd log 3 + m

and substituting δ 5 pdδ we get

l=0 Xl M l ||2

6 log (5)σ1(I)L

l=0 Xl Ml||2 4σ1(I)L

δ + pd + m. (44)

T. Sarkar and A. Rakhlin and M. A. Dahleh

The proof for noise and covariate cross terms is almost identical to Proposition 11.2 but easier

because of independence. Finally note that σ1(I) q Pd i=1 βi 2 2

Proposition 11.3 For 0 < δ < 1, we have with probability at least 1 δ

V 1/2 T T X

k=0 U l+d,d η+ l+1+d,d T O 0,d 2 4σA

V 1/2 T T X

k=0 U l+d,d η l,l O d,d,l 2 4σB

V 1/2 T T X

k=0 U l+d,d η l+d,d O 0,d,d 2 4σC

V 1/2 T T X

k=0 U l+d,d w+ l+1+d,d 2 4σD

Here σ = max (σA, σB, σC, σD) where

σA σC sup ||v||2=1

v CAd v CAd 1 v CAd 2 . . . 0

0 ... ... ... 0

0 ... ... ... ... 0 . . . v CAd . . . v C

j=0 ||CAj||2 βR

σ(Pd k=1 T O d+k,T T Od+k,T ) βR

By taking the intersection of all the aforementioned events for a ﬁxed δ we then have with probability at least 1 δ

ˆH0,d,d H0,d,d 2 16σ

m + pd + log d

12. Subspace Perturbation Results

In this section we present variants of the famous Wedin s theorem (Section 3 of Wedin (1972)) that depends on the distribution of Hankel singular values. These will be sign free generalizations of the gap Free Wedin Theorem from Allen-Zhu and Li (2016). The major diﬀerence from the traditional Wedin s theorem is that the Frobenius error bound can include the dimension of the matrix; however in our case the Hankel matrix is allow to grow with T and such a bound may not be ideal. To address we introduce this mild variant of Wedin s theorem. First we deﬁne the Hermitian dilation of a matrix.

H(S) = 0 S S 0

Finite Time LTI System Identification

The Hermitian dilation has the property that ||S1 S2|| ϵ ||H(S1) H(S2)|| ϵ. Hermitian dilations will be useful in applying Wedin s theorem for general (not symmetric) matrices.

Proposition 12.1 Let S, ˆS be symmetric matrices and ||S ˆS|| ϵ. Further, let vj, ˆvj correspond to the jth eigenvector of S, ˆS respectively such that λ1 λ2 . . . λn and ˆλ1 ˆλ2 . . . ˆλn. Then we have

| vj, ˆvk | ϵ

|λj ˆλk| (45)

if either λj or ˆλk is not zero.

Proof Let S = λjvjv j +V Λ j V and ˆS = ˆλkˆvkˆv k + ˆV ˆΛ k ˆV , wlog assume |λj| |ˆλk|. Deﬁne R = S ˆS

v j Sˆvk = v j ˆSˆvk + v j Rˆvk

Since vj, ˆvk are eigenvectors of S and ˆS respectively.

λjv jˆvk = ˆλkv jˆvk + v j Rˆvk

|λj ˆλk||v jˆvk| ϵ

Proposition 12.1 gives an eigenvector subjective Wedin s theorem. Next, we show how to extend these results to arbitrary subsets of eigenvectors.

Proposition 12.2 For ϵ > 0, let S, P be two symmetric matrices such that ||S P||2 ϵ. Let S = UΣSU , P = V ΣP V

Let V+ correspond to the eigenvectors of singular values β, V correspond to the eigenvectors of singular values α and V are the remaining ones. Deﬁne a similar partition for S. Let α < β

||U V+|| ϵ β α

Proof The proof is similar to before. S, P have a spectral decomposition of the form

S = U+ΣS +U + + U ΣS U + UΣS0 U

P = V+ΣP +V + + V ΣP V + V ΣP 0 V

Let R = S P and since U+ is orthogonal to U , U and similarly for V

U S = ΣS U = U P + U R

ΣS U V+ = U V+ΣP + + U RV+

T. Sarkar and A. Rakhlin and M. A. Dahleh

Diving both sides by ΣP

ΣS U V+(ΣP +) 1 = U V+ + U RV+(ΣP +) 1

||ΣS U V+(ΣP +) 1|| ||U V+|| ||U RV+(ΣP +) 1|| α β ||U V+|| ||U V+|| ϵ

||U V+|| ϵ β α

Let Sk, Pk be the best rank k approximations of S, P respectively. We develop a sequence of results to see how ||Sk Pk|| varies when ||S P|| ϵ as a function of k.

Proposition 12.3 Let S, P be such that

Furthermore, let ϵ be such that

ϵ inf {1 i r 1} {s+1 i n}

σi(P) σi+1(P)

and US j , V S j be the left and right singular vectors of S corresponding to σj(S). There exists a unitary transformation Q such that

σmax([UP r , . . . , UP s ]Q [US r , . . . , US s ]) 2ϵ

min σr 1(P) σr(S), σs(S) σs+1(P)

σmax([V P r , . . . , V P s ]Q [V S r , . . . , V S s ]) 2ϵ

min σr 1(P) σr(S), σs(S) σs+1(P) .

Proof Let r k s. First divide the indices [1, n] into 3 parts K1 = [1, r 1], K2 = [r, s], K3 = [s + 1, n]. Although we focus on only three groups extension to general case will be a straight forward extension of this proof. Deﬁne the Hermitian dilation of S, P as H(S), H(P) respectively. Then we know that the eigenvalues of H(S) are

n i=1{σi(S), σi(S)}

Further the eigenvectors corresponding to these are

u S i v S i

u S i v S i

Similarly deﬁne the respective quantities for H(P). Now clearly, ||H(S) H(P)|| ϵ since ||S P|| ϵ. Then by Weyl s inequality we have that

|σi(S) σi(P)| ϵ

Finite Time LTI System Identification

Now we can use Proposition 12.1. To ease notation, deﬁne σi(S) = λi(H(S)) and λ i(H(S)) = σi(S) and let the corresponding eigenvectors be ai, a i for S and bi, b i for P respectively. Note that we can make the assumption that ai, bi 0 for every i. This does not change any of our results because ai, bi are just stacking of left and right singular vectors and uiv i is identical for ui, vi and ui, vi. Then using Proposition 12.1 we get for every (i, j) K2 K2 and i = j

| ai, bj | ϵ |σi(S) σj(P)| (47)

similarly | a i, bj | ϵ |σi(S) + σj(P)| (48)

u S i v S i

u S i v S i

u P i v P i

and σi(S), σi(P) 0 we have by adding Eq. (47),(48) that

max | u S i , u P j |, | v S i , v P j | ϵ |σi(S) σj(P)|

Deﬁne US Ki to be the matrix formed by the orthornormal vectors {aj}j Ki and US K i to be the matrix formed by the orthonormal vectors {aj}j Ki. Deﬁne similar quantities for P. Then

(US K2) UP K2(UP K2) US K2 = (US K2) (I X

j =2 UP Kj(UP Kj) )US K2

= (US K2) (I X

|j| =2 UP Kj(UP Kj) UP K 2(UP K 2) )US K2

= I (US K2) X

|j| =2 UP Kj(UP Kj) US K2 (US K2) UP K 2(UP K 2) US K2 (49)

Now K1, K 1 corresponds to eigenvectors where singular values σr 1(P), K3, K 3 corresponds to eigenvectors where singular values σs+1(P). We are in a position to use Proposition 12.2. Using that on Eq. (49) we get the following relation

(UP K2) US K2(US K2) UP K2 I

(σr 1(P) σs(S))2 ϵ2

(σs(S) σs+1(P))2

(US K2) UP K 2(UP K 2) US K2 (50)

In the Eq. (50) we need to upper bound (US K2) UP K 2(UP K 2) US K2. To this end we will exploit the fact that all singular values corresponding to US K2 are the same. Since ||H(S) H(P)|| ϵ, then

H(S) = US K2ΣS K2(US K2) + US K 2ΣS K 2(US K 2) + US K0ΣS K0(US K0)

H(P) = UP K2ΣP K2(UP K2) + UP K 2ΣP K 2(UP K 2) + UP K0ΣP K0(UP K0)

T. Sarkar and A. Rakhlin and M. A. Dahleh

Then by pre multiplying and post multiplying we get

(US K2) H(S)UP K 2 = ΣS K2(US K2) UP K 2 (US K2) H(P)UP K 2 = (US K2) UP K 2ΣP K 2

Let H(S) H(P) = R then

(US K2) (H(S) H(P))UP K 2 = (US K2) RUP K 2 ΣS K2(US K2) UP K 2 (US K2) UP K 2ΣP K 2 = (US K2) RUP K 2

Since ΣS K2 = σs(A)I then

||(US K2) UP K 2(σs(S)I ΣP K 2)|| = ||(US K2) RUP K 2||

||(US K2) UP K 2|| ϵ σs(S) + σs(P)

||(UP K2) US K 2|| ϵ σs(P) + σs(S)

Since σs(P) + σs(S) σs(S) σs+1(P) combining this with Eq. (50) we get

σmin((US K2) UP K2) 1 3ϵ2

min σr 1(P) σs(S), σs(S) σs+1(P) 2 (51)

σi(P) σi+1(P)

for Eq. (51), we use the inequality

1 x2 1 x2 whenever x < 1 which is true when Eq. (46) is true. This means that there exists unitary transformation Q such that

||US K2 UP K2Q|| 2ϵ

min σr 1(P) σs(S), σs(S) σs+1(P)

Remark 12 Note that S, P will be Hermitian dilations of H0, , , ˆH0, ˆd, ˆd respectively in our case. Since the singular vectors of S (and P) are simply stacked version of singular vectors of H0, , (and ˆH0, ˆd, ˆd), our results hold directly for the singular vectors of H0, , (and ˆH0, ˆd, ˆd)

Let r k s. First divide the indices [1, n] into 3 parts K1 = [1, r 1], K2 = [r, s], K3 = [s + 1, n].

Finite Time LTI System Identification

Proposition 12.4 (System Reduction) Let ||S P|| ϵ and the singular values of S be arranged as follows:

σ1(S) > . . . > σr 1(S) > σr(S) σr+1(S) . . . σs(S) > σs+1(S) > . . . σn(S) > σn+1(S) = 0

Furthermore, let ϵ be such that

ϵ inf {1 i r 1} {s+1 i n}

σi(P) σi+1(P)

Deﬁne K0 = K1 K2, then

||US K0(ΣS K0)1/2 UP K0(ΣP K0)1/2||2 2ϵ

i=1 σi/ζ2 i + σr/ζ2r + sup 1 i s | σi p

and σi = σi(S), ˆσi = σi(P). Here ζi = min (σi σi+1, σi σi+1) and ζr = min (σr 1 σr, σs σs+1).

Since US K0 = [US K1US K2] and likewise for B, we can separate the analysis for K1, K2 as follows

||US K0(ΣS K0)1/2 UP K0(ΣP K0)1/2|| ||(US K0 UP K0)(ΣS K0)1/2||+||UP K0((ΣS K0)1/2 (ΣP K0)1/2)||

||(US K1 UP K1)(ΣS K1)1/2||2 2+||(US K2 UP K2)(ΣS K2)1/2||2 2

+ ||(ΣS K0)1/2 (ΣP K0)1/2||

Now ||(ΣS K0)1/2 (ΣP K0)1/2||= supl| p

σl(P)|. Recall that σr(S) = . . . = σk(S) = . . . = σs 1(S) and by conditions on ϵ we are guaranteed that ϵ σi σj < 1/2 for all 1 i = j r. We will combine our previous results in Proposition 12.1 12.3 to prove this claim. Speciﬁcally from Proposition 12.3 we have

||(US K2 UP K2)(ΣS K2)1/2|| 2ϵ p

min σr 1(P) σr(S), σr(S) σs+1(P)

On the remaining term we will use Proposition 12.3 on each column

||(US K1 UP K1)(ΣS K1)1/2|| ||[ p

σ1(S)c1, . . . , q

σ|K1|(S)c|K1|]||

j=1 σ2 j ||cj||2

min σj 1(P) σj(S), σj(S) σj+1(P) 2

In the context of our system identiﬁcation, S = H0, , and P = ˆH0, ˆd, ˆd. P will be made

T. Sarkar and A. Rakhlin and M. A. Dahleh

compatible by padding it with zeros to make it doubly inﬁnite. Then US K0, UP K0 (after padding) has inﬁnite rows. Deﬁne Z0 = US K0(ΣS K0)1/2(1 :, :), Z1 = US K0(ΣS K0)1/2(p + 1 :, :) (both inﬁnite length) and similarly we will have ˆZ0, ˆZ1. Note that from a computational perspective we do not need to Z0, Z1; we only need to work with ˆZ0 = UP K0(ΣP K0)1/2(1 : , :), ˆZ1 = UP K0(ΣP K0)1/2(p + 1 :, :) and since most of it is just zero padding we can simply compute on ˆZ0(1 : pd, :), ˆZ1(1 : pd, :).

Proposition 12.5 Assume Z1 = Z0A. Furthermore, ||S P||2 ϵ and let ϵ be such that

ϵ inf {1 i r 1} {s+1 i n}

σi(P) σi+1(P)

||(Z 0Z0) 1Z 0Z1 ( ˆZ 0 ˆZ0) 1 ˆZ 0 ˆZ1|| Cϵ(γ + 1)

σ2s ((σs σs+1) (σr 1 σs))2

σiσs (σi σi+1)2 (σi 1 σi)2

where σ1(A) γ.

Proof Note that Z1 = Z0A, then

||(Z 0Z0) 1Z 0Z1 ( ˆZ 0 ˆZ0) 1 ˆZ 0 ˆZ1||2

=||A ( ˆZ 0 ˆZ0) 1 ˆZ 0 ˆZ1||2= ||( ˆZ 0 ˆZ0) 1 ˆZ 0 ˆZ0A ( ˆZ 0 ˆZ0) 1 ˆZ 0 ˆZ1||2

=||( ˆZ 0 ˆZ0) 1 ˆZ 0 ˆZ0A ( ˆZ 0 ˆZ0) 1 ˆZ 0Z0A + ( ˆZ 0 ˆZ0) 1 ˆZ 0Z0A ( ˆZ 0 ˆZ0) 1 ˆZ 0 ˆZ1||2

||( ˆZ 0 ˆZ0) 1 ˆZ 0 ˆZ0A ( ˆZ 0 ˆZ0) 1 ˆZ 0Z0A||2+||( ˆZ 0 ˆZ0) 1 ˆZ 0Z0A ( ˆZ 0 ˆZ0) 1 ˆZ 0 ˆZ1||2

||( ˆZ 0 ˆZ0) 1 ˆZ 0||2 ||Z0A ˆZ0A||2+|| Z0A |{z} Shifted version of Z0

Now, ||( ˆZ 0 ˆZ0) 1 ˆZ 0||2 ( σs ϵ) 1, ||Z0A ˆZ1||2 ||Z0 ˆZ0||2 since Z1 = Z0A is a submatrix of Z0 and ˆZ1 is a submatrix of ˆZ0 we have ||Z0A ˆZ1||2 ||Z0 ˆZ0||2 and ||Z0A ˆZ0A||2 ||A||2||Z0 ˆZ0||2

σ2s ((σs σs+1) (σr 1 σs))2 +

σiσs (σi σi+1)2 (σi 1 σi)2

Finite Time LTI System Identification

13. Hankel Matrix Estimation Results

In this section we provide the proof for Theorem 5.2. For any matrix P, we deﬁne its doubly inﬁnite extension P as

P 0 . . . 0 0 . . . ... ... ...

Proposition 13.1 Fix d > 0. Then we have

||Hd, , ||2 ||H0, , H0,d,d||2

2||Hd, , ||2

Proof Deﬁne Cd, Bd as follows

0md n C CA ...

Bd = 0n pd B AB . . .

Now pad H0,d,d with zeros to make it a doubly inﬁnite matrix and call it H0,d,d and we get that

|| H0,d,d H0, , || = 0 M12 M21 M22

Note here that M21 and M0 = M12 M22

are inﬁnite matrices. Further ||Hd, , ||2= ||M0||2

||M21||2. Then

|| H0,d,d H0, , || q

||M12||2 2+||M0||2 2

2||Hd, , ||2

Further || H0,d,d H0, , || ||M0||= ||Hd, , ||2.

Proposition 13.2 For any d1 d2, we have

||H0, , H0,d1,d1||2

2||H0, , H0,d2,d2||2

Proof Since ||Hd1, , ||2 ||H0, , H0,d1,d1||2

2||Hd1, , ||2 from Proposition 13.1. It is clear that ||Hd1, , ||2 ||Hd2, , ||2. Then

2||H0, , H0,d1,d1||2 ||Hd1, , ||2 ||Hd2, , ||2 ||H0, , H0,d2,d2||2

T. Sarkar and A. Rakhlin and M. A. Dahleh

Proposition 13.3 Fix d > 0. Then

||Td, (M)||2 Mρ(A)d

Proof Recall that

0 0 0 . . . 0 CAd B 0 0 . . . 0 CAd+1B CAd B 0 . . . 0 ... ... ... ... ...

Then ||Td, (M)||2 P j=d||CAj B||2. Now from Eq. 4.1 and Lemma 4.1 in Tu et al. (2017) we get that ||CAj B||2 Mρ(A)j. Then

j=d ||CAj B||2 Mρ(A)d

Remark 13 Proposition 13.3 is just needed to show exponential decay and is not precise. Please refer to Tu et al. (2017) for explicit rates.

Next we show that T (κ) (δ) and d (T, δ) deﬁned in Eq. (16) given by

d (T, δ) = inf

m + pd + log T

δ T ||H0,d,d H0, , ||2

T (κ) (δ) = inf n T T cm2 log3 (Tm/δ) d (T, δ), d (T, δ) κd ( T

The existence of d (T, δ) is predicated on the ﬁniteness of T (κ) (δ) which we discuss below.

13.1 Existence of T (κ) (δ) <

Construct two sets

T1(δ) = inf n T d (T, δ) D(T) o (56)

T2(δ) = inf n T d (t, δ) κd ( t

κ2 , δ) 8 , t T o (57)

Clearly, T (κ) (δ) < T1(δ) T2(δ). A key assumption in the statement of our results is that T (κ) (δ) < . We will show that it is indeed true. Let κ 16.

Proposition 13.4 For a ﬁxed δ > 0, T1(δ) < with d (T, δ) c log (c T+log 1

δ ) log R+log ( M/β) log 1

Here ρ = ρ(A).

Finite Time LTI System Identification

Proof Note the form for d (T, δ), it is the minimum d that satisﬁes

m + pd + log T

δ T ||H0,d,d H0, , ||2

Since from Proposition 13.1 and 13.3 we have ||H0,d,d H0, , ||2 3 Mρd 1 ρ(A), then d (T, δ) d that satisﬁes

m + pd + log T

which immediately implies d (T, δ) d = c log (c T log R+log 1

δ )+log ( M/β) log 1

ρ , i.e., d (T, δ) is at

most logarithmic in T. As a result, for a large enough T

cm2d log2 (d) log2 (m2/δ) + cd log3 (2d) c log (c T + log 1

δ) log R + log ( M/β) log 1

The intuition behind T2(δ) is the following: d (T, δ) grows at most logarithmically in T, as is clear from the previous proof. Then T2(δ) is the point where d (T, δ) is still growing as

T (i.e., mixing has not happened) but at a slightly reduced rate.

Proposition 13.5 For a ﬁxed δ > 0, T2(δ) < .

Proof Recall from the proof of Proposition 13.1 that ||Hd, , || ||H0, , H0,d,d||

2||Hd, , ||. Now Hd, , can be written as

Ad [B, AB, . . .] | {z } = B

Deﬁne Pd = Ad B B (Ad) . Let dκ be such that for every d dκ and κ 16

Clearly such a dκ < would exist because P0 = 0 but limd Pd = 0. Then observe that P2d 1 4κPd. Then for every d dκ we have that

||Hd, , || 4κ||H2d, , ||

T 4dκ (16)2 β2R2

σ2 0 (dκp + log (T/δ)) (59)

T. Sarkar and A. Rakhlin and M. A. Dahleh

where σ0 = ||Hdκ, , ||. Assume that σ0 > 0 (if not then are condition is trivially true). Then simple computation shows that

||H0,dκ,dκ H0, , || ||Hdκ, , || 16βR p

m + pdκ + log T

δ T | {z } < σ0

This implies that d = d (T, δ) dκ for T prescribed as above (ensured by Proposition 13.2). But from our discussion above we also have

||H0,d ,d H0, , || ||Hd , , || 4κ||H2d , , || 2κ||H0,2d ,2d H0, , ||

This means that if

||H0,d ,d H0, , || 16βR p

m + pd + log T

||H0,2d ,2d H0, , || 16

m + pd + log T

m + 2pd + log κ2T

which implies that d (κ2T, δ) 2d (T, δ). The inequality follows from the deﬁnition of d (κ2T, δ). Furthermore, if κ 16, 2d (T, δ) κ

8d (T, δ) whenever T is greater than a certain ﬁnite threshold of Eq. (59).

Eq. (58) happens when σ(Ad)2 1 4κ = dκ = O log κ log 1

where ρ = ρ(A) and T2(δ) c T1(δ).

It should be noted that the dependence of Ti(δ) on log 1

ρ is worst case, i.e., there exists some bad LTI system that gives this dependence and it is quite likely Ti(δ) is much smaller. The condition T T1(δ) T2(δ) simply requires that we capture some reasonable portion of the dynamics and not necessarily the entire dynamics.

13.2 Proof of Theorem 5.2

Proposition 13.6 Let T T (κ) (δ) and d = d (T, δ) then

||H0, , ˆH0,d ,d || 2cβR

m + pd + log T

Proof Consider the following error

||H0, , ˆH0,d ,d ||2 ||H0,d ,d ˆH0,d ,d ||2+||H0, , H0,d ,d ||2

From Proposition 13.1 and Eq. (55) we get that

||H0, , H0,d ,d ||2 16βR

m + pd + log T

Finite Time LTI System Identification

Since from Theorem 5.1

||H0,d ,d ˆH0,d ,d ||2 16βR

m + pd + log T

||H0, , ˆH0,d ,d ||2 32βR

m + pd + log T

Recall the adaptive rule to choose d in Algorithm 1. From Theorem 5.1 we know that for every d D(T) we have with probability at least 1 δ.

||H0,d,d ˆH0,d,d||2 16βR

lp T + log T

δ T . Then consider the following adaptive rule

d0(T, δ) = inf n l || ˆH0,l,l ˆH0,h,h||2 16βR(2α(l) + α(h)) h D(T), h l o (61)

ˆd = ˆd(T, δ) = d0(T, δ) log T

for the same universal constant c as Theorem 5.1. Let d (T, δ) be as Eq. (55). Recall that d = d (T, δ) is the point where estimation error dominates the ﬁnite truncation error. Unfortunately, we do not have apriori knowledge of d (T, δ) to use in the algorithm. Therefore, we will simply use Eq. (62) as our proxy. The goal will be to bound || ˆH0, ˆd, ˆd H0, , ||2

Proposition 13.7 Let T T (κ) (δ), d (T, δ) be as in Eq. (55) and ˆd be as in Eq. (62). Then with probability at least 1 δ we have

ˆd d (T, δ) log T

Proof Let d = d (T, δ). First for all h D(T) d , we note

|| ˆH0,d ,d ˆH0,h,h||2 || ˆH0,d ,d H0,d ,d ||+||H0,h,h ˆH0,h,h||2+||H0,h,h H0,d ,d ||2

|| ˆH0,d ,d H0,d ,d ||2+||H0,h,h ˆH0,h,h||2+||H0, , H0,d ,d ||2.

We use the property that ||H0, , H0,d ,d ||2 ||H0,h,h H0,d ,d ||2. Furthermore, because of the properties of d we have

||H0, , H0,d ,d ||2 16βRα(d )

|| ˆH0,d ,d H0,d ,d ||2 16βRα(d ), ||H0,h,h ˆH0,h,h||2 16βRα(h). (63)

T. Sarkar and A. Rakhlin and M. A. Dahleh

and || ˆH0,d ,d ˆH0,h,h||2 16βR(2α(d ) + α(h)).

This implies that d0(T, δ) d and the assertion follows.

We have the following key lemma about the behavior of ˆH0, ˆd, ˆd.

Lemma 13.1 For a ﬁxed κ 20, whenever T T (κ) (δ) we have with probability at least 1 δ

||H0, , ˆH0, ˆd, ˆd||2 3cβRα max d (T, δ), log T

Furthermore, ˆd = O(log T

Proof Let d > ˆd then

||H0, , ˆH0, ˆd, ˆd||2 ||H0, , H0,d ,d ||2+|| ˆH0, ˆd, ˆd H0, ˆd, ˆd||2+|| ˆH0,d ,d H0,d ,d ||2

If ˆd > d then

||H0, , ˆH0, ˆd, ˆd||2 ||H0, , H0, ˆd, ˆd||2+|| ˆH0, ˆd, ˆd H0, ˆd, ˆd||2= 2|| ˆH0, ˆd, ˆd H0, ˆd, ˆd||2

2cβRα( ˆd) = 2cβRα log T

where the equality follows from Proposition 13.7. The fact that ˆd = O(log T

δ ) follows from Proposition 13.1.

In the following we will use Hl = H0,l,l for shorthand.

Proposition 13.8 Fix κ 16, and T T (κ) (δ). Then

|| ˆH0, ˆd(T,δ), ˆd(T,δ) H0, , ||2 12cβR q

m + p ˆd(T, δ) + log T

with probability at least 1 δ.

Proof Assume that log T

δ d (T, δ). Recall the following functions

d (T, δ) = inf n d cβR

m + pd + log T

δ T ||Hd H ||2 o

d0(T, δ) = inf n l || ˆHl ˆHh||2 cβR(α(h) + 2α(l)) h l, h D(T) o

ˆd(T, δ) = d0(T, δ) log T

It is clear that d (κ2T, δ) (1 + 1

2p)κd (T, δ) for any κ 16. Assume the following

8d (κ 2T, δ) (This relation is true whenever T T (κ) (δ)),

Finite Time LTI System Identification

||H ˆd(T,δ) H ||2 6cβR q

m+p ˆd(T,δ)+log T

ˆd(T, δ) < d (κ 2T, δ) 1.

The key will be to show that with high probability that all three assumptions can not hold with high probability. For shorthand we deﬁne d(1) = d (T, δ), d(κ2) = d (κ 2T, δ), ˆd(1) = ˆd(T, δ), ˆd(κ2) = ˆd(κ 2T, δ) and Hl = H0,l,l, ˆHl = ˆH0,l,l. Let T = κ 2T. Then this implies that

m + pd(1) + log κ2 T

δ T || ˆH ˆd(1) ˆHd(1) ||2

|| ˆH ˆd(1) ˆHd(1) ||2 || ˆH ˆd(1) H ||2 || ˆHd(1) H ||2

|| ˆHd(1) H ||2+|| ˆH ˆd(1) ˆHd(1) ||2 || ˆH ˆd(1) H ||2

|| ˆHd(1) Hd(1) ||2+||Hd(1) H ||2+|| ˆH ˆd(1) ˆHd(1) ||2 || ˆH ˆd(1) H ||2

Since by deﬁnition of d ( , ) we have

|| ˆHd(1) Hd(1) ||2+||Hd(1) H ||2 2cβR

m + pd(1) + log κ2 T

and by assumptions d(1) κ

8d(κ2) , ˆd(1) d(κ2) then as a result ( q

8 + 1)d(κ2)

|| ˆH ˆd(1) H ||2 || ˆHd(1) Hd(1) ||2+||Hd(1) H ||2+ || ˆH ˆd(1) ˆHd(1) ||2 | {z }

m + pd(1) + log κ2 T

δ T | {z } Prop. 13.6

m + pd(1) + log κ2 T

δ T | {z } Deﬁnition of ˆd(1)

|| ˆH ˆd(1) H ||2 1

m + pd(κ2) + log T

where the last inequality follows from ( q

8 + 1)d(κ2) . Now by assumption

||H ˆd(1) H ||2 6cβR p

m + p ˆd(1) + log κ2 T

δ κ2 T it is clear that || ˆH ˆd(1) H ||2 5

6||H ˆd(1) H ||2

and we can conclude that, since 6

||H ˆd(1) H ||2< cβR

m + pd(κ2) + log T

T. Sarkar and A. Rakhlin and M. A. Dahleh

which implies that ˆd(1) d(κ2) 1. This is because by deﬁnition of d(κ2) we know that d(κ2) is the minimum such that

||Hd(κ2) H ||2 cβR

m + pd(κ2) + log T

and furthermore from Proposition 13.2 we have for any d1 d2

||H0, , H0,d1,d1|| 1

2||H0, , H0,d2,d2||.

This contradicts Assumption 3. So, this means that one of three assumptions do not hold. Clearly if assumption 3 is invalid then we have a suitable lower bound on the chosen ˆd( , ), i.e., since d (κ 2T, δ) d (T, δ) κ

8d (κ 2T, δ) we get

κ 8 ˆd(κ2 T, δ) κ

8d ( T, δ) κ

8 d (κ2 T, δ) κ

8 ˆd(κ2 T, δ) κ

8 d ( T, δ) κ

which implies from Lemma 13.1 that (since we pick κ = 16, for large enough T d ( T, δ) 4) and we have

|| ˆH ˆd(κ2 T,δ) H ||2 3cβR q

d (κ2 T, δ)

pd (κ2 T, δ) + log κ2 T

ˆd(κ2 T, δ)

p ˆd(κ2 T, δ) + log κ2 T

Similarly, if assumption 2 is invalid then we get that

||H ˆd(κ2 T,δ) H ||2< 6cβR q

ˆd(κ2 T, δ)

p ˆd(κ2 T, δ) + log κ2 T

and because ˆd(κ2 T, δ) d (κ2 T, δ) and || ˆH ˆd(κ2 T,δ) H ||2 ||H ˆd(κ2 T,δ) H ||2+|| ˆH ˆd(κ2 T,δ) H ||2 we get in a similar fashion to Proposition 13.6

|| ˆH ˆd(κ2 T,δ) H ||2 12cβR q

ˆd(κ2 T, δ)

p ˆd(κ2 T, δ) + log κ2 T

Replacing κ2 T = T it is clear that for any κ 16

|| ˆH ˆd(T,δ) H ||2 12cβR q

p ˆd(T, δ) + log T

If d (T, δ) log T

δ then we can simply apply Lemma 13.1 and our assertion holds.

Finite Time LTI System Identification

14. Model Selection Results

Proposition 14.1 Let H0, , = UΣV , ˆH0, ˆd, ˆd = ˆU ˆΣ ˆV and

||H0, , ˆH0, ˆd, ˆd|| ϵ.

Let ˆΣ be arranged into blocks of singular values such that in each block i we have

sup j ˆσi j ˆσi j+1 χϵ

for some χ 2, i.e.,

Λ1 0 . . . 0 0 Λ2 . . . 0 ... ... ... 0 0 0 . . . Λl

where Λi are diagonal matrices and ˆσi j is the jth singular value in the block Λi. Then there exists an orthogonal transformation, Q, such that

|| ˆU ˆΣ1/2Q UΣ1/2||2 2ϵ q

ˆσ1/ζ2n1 + ˆσn1+1/ζ2n2 + . . . + ˆσPl 1 i=1 ni+1/ζ2nl

+ 2 sup 1 i l

ˆσi min + ϵ pˆσ ˆd ϵ.

Here sup1 i l p

ˆσimax ϵ ˆd q

ζni = min (ˆσni 1 min ˆσni max, ˆσni min ˆσni+1 max )

for 1 < i < l, ζn1 = ˆσn1 min ˆσn2 max and ζnl = min (ˆσnl 1 min ˆσnl max, ˆσnl min).

Proof Let ˆU ˆΣ ˆV = SVD( ˆH0, ˆd, ˆd) and UΣV = SVD(H0, , ) where || ˆH0, ˆd, ˆd H0, , ||2 ϵ. ˆΣ is arranged into blocks of singular values such that in each block i we have ˆσi j ˆσi j+1 χϵ, i.e.,

Λ1 0 . . . 0 0 Λ2 . . . 0 ... ... ... 0 0 0 . . . Λl

where Λi are diagonal matrices and ˆσi j is the jth singular value in the block Λi. Furthermore,

ˆσi 1 min ˆσi max > χϵ. From ˆΣ deﬁne ˆΣ as follows:

ˆσ1In1 n1 0 . . . 0 0 ˆσ2In2 n2 . . . 0 ... ... ... 0 0 0 . . . ˆσl Inl nl

T. Sarkar and A. Rakhlin and M. A. Dahleh

where Λi is a ni ni matrix and ˆσi = 1

ni P j ˆσi j. The key idea of the proof is the following: (A, B, C) (QAQ , QB, CQ ) where Q is a orthogonal transformation and we will show that there exists a block diagonal unitary matrix Q of the form

Qn1 n1 0 . . . 0 0 Qn2 n2 . . . 0 ... ... ... 0 0 0 . . . Qnl nl

such that each block Qni ni corresponds to a orthogonal matrix of dimensions ni ni and that || ˆU ˆΣ1/2Q UΣ1/2||2 is small if || ˆH0, ˆd, ˆd H0, , ||2 is small. Each of the blocks correspond to the set of singular values where the inter-singular value distance is small . To start oﬀ, note that from Propositon 12.4 there must exist a Q that is block diagonal with orthogonal entries such that

|| ˆUQˆΣ1/2 UΣ1/2||2 cϵ q

ˆσ1/ζ2n1 + ˆσn1+1/ζ2n2 + . . . + ˆσPl 1 i=1 ni+1/ζ2nl + sup 1 i ˆd | σi p

Here ζni = min (ˆσni 1 min ˆσni max, ˆσni min ˆσni+1 max )

for 1 < i < l, ζn1 = ˆσn1 min ˆσn2 max and ζnl = min (ˆσnl 1 min ˆσnl max, ˆσnl min). Informally, the ζi measure the singular value gaps between each blocks. Furthermore, it can be shown that for any Q of the form in Eq. (67)

|| ˆUQˆΣ1/2 ˆU ˆΣ1/2Q||2 || ˆUQ ˆΣ 1/2 ˆUQˆΣ1/2||2+|| ˆU ˆΣ1/2Q ˆU ˆΣ 1/2Q||2 2||ˆΣ1/2 ˆΣ1/2||2

because ˆUQ ˆΣ 1/2 = ˆU ˆΣ 1/2Q. Note that ||ˆΣ1/2 ˆΣ1/2||2 sup1 i l p

ˆσi min. Now,

when ˆσi max χniϵ, then p

ˆσimax ; on the other hand when ˆσi max < χniϵ

ˆσimax ˆσi χniϵ and this implies that

ˆσi min χni p

ˆσimax ϵ χniϵ.

|| ˆU ˆΣ1/2Q UΣ1/2||2 || ˆUQˆΣ1/2 UΣ1/2||2+|| ˆUQˆΣ1/2 ˆU ˆΣ1/2Q||2

ˆσ1/ζ2n1 + ˆσn1+1/ζ2n2 + . . . + ˆσPl 1 i=1 ni+1/ζ2nl + sup 1 i ˆd | σi p

Our assertion follows since sup1 i ˆd| σi ˆσi| ϵ

Finite Time LTI System Identification

Proposition 14.2 Let H0, , = UΣV , ˆH0, ˆd, ˆd = ˆU ˆΣ ˆV and

||H0, , ˆH0, ˆd, ˆd|| ϵ.

Let ˆΣ be arranged into blocks of singular values such that in each block i we have

sup j ˆσi j ˆσi j+1 χϵ

for some χ 2, i.e.,

Λ1 0 . . . 0 0 Λ2 . . . 0 ... ... ... 0 0 0 . . . Λl

where Λi are diagonal matrices and ˆσi j is the jth singular value in the block Λi. Then there exists an orthogonal transformation, Q, such that

max || ˆC C||2, || ˆB B||2 2ϵ q

ˆσ1/ζ2n1 + ˆσn1+1/ζ2n2 + . . . + ˆσPl 1 i=1 ni+1/ζ2nl

+ 2 sup 1 i l

ˆσi min + ϵ pˆσ ˆd ϵ = ζ,

||A ˆA||2 4γ ζ/ q

Here sup1 i l p

ˆσimax ϵ ˆd q

ζni = min (ˆσni 1 min ˆσni max, ˆσni min ˆσni+1 max )

for 1 < i < l, ζn1 = ˆσn1 min ˆσn2 max and ζnl = min (ˆσnl 1 min ˆσnl max, ˆσnl min).

Proof The proof follows because all parameters are equivalent up to a orthogonal transform (See discussion preceding Proposition 9.2). Following that we use Propositions 12.4 and 12.5.

15. Order Estimation Lower Bound

Lemma 14 (Theorem 4.21 in Boucheron et al. (2013)) Let {Pi}N i=0 be probability laws over (Σ, A) and let {Ai A}N i=0 be disjoint events. If a = mini=0,...,N Pi(Ai) 1/(N + 1),

+ (1 a) log 1 a

i=1 KL(Pi||P0) (69)

Lemma 15 (Le Cam s Method) Let P0, P1 be two probability laws then

sup θ {0,1} Pθ[M = ˆ M] 1

1 2KL(P0||P1)

T. Sarkar and A. Rakhlin and M. A. Dahleh

Proposition 15.1 Let N0, N1 be two multivariate Gaussians with mean µ0 RT , µ1 RT

and covariance matrix Σ0 RT T , Σ1 RT T respectively. Then the KL(N0, N1) =

1 2 tr(Σ 1 1 Σ0) T + log det(Σ1)

det(Σ0) + Eµ1,µ0[(µ1 µ0) Σ 1 1 (µ1 µ0)] .

In this section we will prove a lower bound on the ﬁnite time error for model approximation. In systems theory subspace based methods are useful in estimating the true system parameters. Intuitively, it should be harder to correctly estimate the subspace that corresponds to lower Hankel singular values, or energy due to the presence of noise. However, due to strong structural constraints on Hankel matrix ﬁnding a minimax lower bound is a much harder proposition for LTI systems. Speciﬁcally, it is not clear if standard subspace identiﬁcation lower bounds can provide reasonable estimates for a structured and non i.i.d. setting such as our case. To alleviate some of the technical diﬃculties that arise in obtaining the lower bounds, we will focus on a small set of LTI systems which are simply parametrized by a number ζ. Consider the following canonical form order 1 and 2 LTI systems respectively with m = p = 1 and let R be the noise-to-signal ratio bound.

0 1 0 0 0 0 ζ 0 0

, A1 = A0, B0 =

, C0 = 0 0 βR , C1 = C0

A0, A1 are Schur stable whenever |ζ|< 1.

1 0 0 0 0 . . . 0 0 0 0 0 . . . 0 0 0 0 0 . . . 0 0 0 0 0 . . . 0 0 0 0 0 . . . ... ... ... ... ... ...

1 0 ζ 0 0 . . . 0 ζ 0 0 0 . . . ζ 0 0 0 0 . . . 0 0 0 0 0 . . . 0 0 0 0 0 . . . ... ... ... ... ... ...

Here Hζ,0, Hζ,1 are the Hankel matrices generated by (C0, A0, B0), (C1, A1, B1) respectively. It is easy to check that for Hζ,1 we have 1

ζ where σi are Hankel singular values.

Further the rank of Hζ,0 is 1 and that of Hζ,1 is at least 2. Also, ||T O0, ((Ci,Ai,Bi))||2

||T0, ((Ci,Ai,Bi))||2 R.

This construction will be key to show that identiﬁcation of a particular rank realization depends on the condition number of the Hankel matrix. An alternate representation of the

Finite Time LTI System Identification

input output behavior is

y T y T 1 ... y1

CB CAi B . . . CAT 1 i B 0 CB . . . CAT 2 i B ... ... ... ... 0 0 . . . CB

u T+1 u T ... u2

C CAi . . . CAT 1 i 0 C . . . CAT 2 i ... ... ... ... 0 0 . . . C

ηT+1 ηT ... η2

w T w T 1 ... w1

where Ai {A0, A1}. We will prove this result for a general class of inputs, i.e., active inputs. Then we will follow the same steps as in proof of Theorem 2 in Tu et al. (2018b).

KL(P0||P1) = EP0

γt(ut|{ul, yl}t 1 l=1)P0(yt|{ul}t 1 l=1) γt(ut|{ul, yl}t 1 l=1)P1(yt|{ul}t 1 l=1)

P0(yt|{ul}t 1 l=1) P1(yt|{ul}t 1 l=1)

Here γt( | ) is the active rule for choosing ut from past data. From Eq. (72) it is clear that conditional on {ul}T l=1, {yl}T l=1 is Gaussian with mean given by Πi U. Then we use Birge s inequality (Lemma 14). In our case Σ0 = O0O 0 + I, Σ1 = O1O 1 + I where Oi is given in Eq. (72). We will apply a combination of Lemma 14, Proposition 15.1 and assume ηi are i.i.d Gaussian to obtain our desired result. Note that O1 = O0 but Π1 = Π0. Therefore, from Proposition 15.1 KL(N0, N1) = Eµ1,µ0[(µ1 µ0) Σ 1 1 (µ1 µ0)] T ζ2

R2 where µi = Πi U. For any δ (0, 1/4), set a = 1 δ in Proposition 14, then we get whenever

δ log δ 1 δ

+ (1 δ) log 1 δ

we have supi =j PAi(Aj) δ. For δ [1/4, 1) we use Le Cam s method in Lemma 15 and

show that if 8δ2 Tζ2

R2 then supi =j PAi(Aj) δ. Since δ2 c log 1

δ when δ [1/4, 1) for an absolute constant, our assertion holds.