# finite_time_lti_system_identification__62aaf507.pdf Journal of Machine Learning Research 22 (2021) 1-61 Submitted 8/19; Revised 4/20; Published 1/21 Finite Time LTI System Identification Tuhin Sarkar tuhin91@gmail.com Department of Electrical Engineering and Computer Sciences Massachusetts Institute of Technology Cambridge, MA 02139, USA Alexander Rakhlin rakhlin@mit.edu Department of Brain and Cognitive Sciences Massachusetts Institute of Technology Cambridge, MA 02139, USA Munther A. Dahleh dahleh@mit.edu Department of Electrical Engineering and Computer Sciences Massachusetts Institute of Technology Cambridge, MA 02139, USA Editor: Benjamin Recht We address the problem of learning the parameters of a stable linear time invariant (LTI) system with unknown latent space dimension, or order, from a single time series of noisy input-output data. We focus on learning the best lower order approximation allowed by finite data. Motivated by subspace algorithms in systems theory, where the doubly infinite system Hankel matrix captures both order and good lower order approximations, we construct a Hankel-like matrix from noisy finite data using ordinary least squares. This circumvents the non-convexities that arise in system identification, and allows accurate estimation of the underlying LTI system. Our results rely on careful analysis of self-normalized martingale difference terms that helps bound identification error up to logarithmic factors of the lower bound. We provide a data-dependent scheme for order selection and find an accurate realization of system parameters, corresponding to that order, by an approach that is closely related to the Ho-Kalman subspace algorithm. We demonstrate that the proposed model order selection procedure is not overly conservative, i.e., for the given data length it is not possible to estimate higher order models or find higher order approximations with reasonable accuracy. Keywords: Linear Dynamical Systems, System Identification, Non parametric statistics, control theory, Statistical Learning theory 1. Introduction Finite-time system identification the problem of estimating the system parameters given a finite single time series of its output is an important problem in the context of control theory, time series analysis, robotics, and economics, among many others. In this work, we focus on parameter estimation and model approximation of linear time invariant (LTI) c 2021 T. Sarkar and A. Rakhlin and M. A. Dahleh. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/19-725.html. T. Sarkar and A. Rakhlin and M. A. Dahleh systems or linear dynamical system (LDS), which are described by Xt+1 = AXt + BUt + ηt+1 Yt = CXt + wt. (1) Here C Rp n, A Rn n, B Rn m; {ηt, wt} t=1 are process and output noise, Ut is an external control input, Xt is the latent state variable and Yt is the observed output. The goal here is parameter estimation, i.e., learning (C, A, B) from a single finite time series of {Yt, Ut}T t=1 when the order, n, is unknown. Since typically p, m < n, it becomes challenging to find suitable parametrizations of LTI systems for provably efficient learning. When {Xj} j=1 are observed (or, C is known to be the identity matrix), identification of (C, A, B) in Eq. (1) is significantly easier, and ordinary least squares (OLS) is a statistically optimal estimator. It is, in general, unclear how (or if) OLS can be employed in the case when Xt s are not observed. To motivate the study of a lower-order approximation of a high-order system, consider the following example: Example 1 Consider M1 = (A1, B1, C1) with 0 1 0 0 . . . 0 0 0 1 0 . . . 0 ... ... ... ... ... ... 0 0 0 0 . . . 1 a 0 0 0 . . . 0 0 0 ... 0 1 C1 = B 1 (2) where na 1 and n > 20. Here the order of M1 is n. However, it can be approximated well by M2 which is of a much lower order and given by A2 = 0 0 1 0 C2 = B 2 . (3) For the same input Ut, if Y (1) t , Y (2) t be the output generated by M1 and M2 respectively then a simple computation shows that (Y (1) t Y (2) t )2 U2 t 4n2a2 1 This suggests that the actual value of n is not important; rather there exists an effective order, r (which is 2 in this case). This lower order model captures most of the LTI system. Since the true model order is not known in many cases, we emphasize a nonparametric approach to identification: one which adaptively selects the best model order for the given data and approximates the underlying LTI system better as T (length of data) grows. The key to this approach will be designing an estimator ˆ M from which we obtain a realization ( ˆC, ˆA, ˆB) of the selected order. Finite Time LTI System Identification 1.1 Related Work Linear time invariant systems are an extensively studied class of models in control and systems theory. These models are used in feedback control systems (for example in planetary soft landing systems for rockets (A cıkme se et al., 2013)) and as linear approximations to many non linear systems that nevertheless work well in practice. In the absence of process and output noise, subspace-based system identification methods are known to learn (C, A, B) (up to similarity transformation)(Ljung, 1987; Van Overschee and De Moor, 2012). These typically involve constructing a Hankel matrix from the input output pairs and then obtaining system parameters by a singular value decomposition. Such methods are inspired by the celebrated Ho-Kalman realization algorithm (Ho and Kalman, 1966). The correctness of these methods is predicated on the knowledge of n or presence of infinite data. Other approaches include rank minimization-based methods for system identification (Fazel et al., 2013; Grussler et al., 2018), further relaxing the rank constraint to a suitable convex formulation. However, there is a lack of statistical guarantees for these algorithms, and it is unclear how much data is required to obtain accurate estimates of system parameters from finite noisy data. Empirical methods such as the EM algorithm (Dempster et al., 1977) are also used in practice; however, these suffer from non-convexity in problem formulation and can get trapped in local minima. Learning simpler approximations to complex models in the presence of finite noisy data was studied in Venkatesh and Dahleh (2001) where identification error is decomposed into error due to approximation and error due to noise; however the analysis assumes the knowledge of a good parametrization and does not provide statistical guarantees for learning the system parameters of such an approximation. More recently, there has been a resurgence in the study of statistical identification of LTI systems from a single time series in the machine learning community. In cases when C = I, i.e., Xt is observed directly, sharp finite time error bounds for identification of A, B from a single time series are provided in Faradonbeh et al. (2017); Simchowitz et al. (2018); Sarkar and Rakhlin (2018). The approach to finding A, B is based on a standard ordinary least squares (OLS) given by ( ˆA, ˆB) = arg min A,B t=1 ||Xt+1 [A, B][X t , U t ] ||2 2. Another closely related area is that of online prediction in time series Hazan et al. (2018); Agarwal et al. (2018). Finite time regret guarantees for prediction in linear time series are provided in Hazan et al. (2018). The approach there circumvents the need for system identification and instead uses a filtering technique that convolves the time series with eigenvectors of a specific Hankel matrix. Closest to our work is that of Oymak and Ozay (2018). Their algorithm, which takes inspiration from the Kalman Ho algorithm, assumes the knowledge of model order n. This limits the applicability of the algorithm in two ways: first, it is unclear how the techniques can be extended to the case when n is unknown as is usually the case and, second, in many cases n is very large and a much lower order LTI system can be a very good approximation of the original system. In such cases, constructing the order n estimate might be unnecessarily conservative (See Example 1). Consequently, the error bounds do not reflect accurate dependence on the system parameters. T. Sarkar and A. Rakhlin and M. A. Dahleh When n is unknown, it is unclear when a singular value decomposition should be performed to obtain the parameter estimates via Ho-Kalman algorithm. This leads to the question of model order selection from data. For subspace based methods, such problems have been addressed in Shibata (1976) and Bauer (2001). These papers address the question of estimating order in the context of subspace methods. Specifically, order estimation is achieved by analyzing the information contained in the estimated singular values and/or estimated innovation variance. Furthermore, they provide guarantees for asymptotic consistency of the methods described. It is unclear, however, if these techniques and guarantees can be extended to the case when only finite data is available. Another line of literature studied in Ljung et al. (2015) for example, approaches the identification of systems with unknown order by first learning the largest possible model that fits the data and then performing model reduction to obtain the final system. Although one can show that asymptotically this method outputs the true model, we show that such a two step procedure may underperform in a finite time setting. A possible explanation for this could be that learning the largest possible model with finite data over-fits on the exogenous noise and therefore gives poor model estimates. The key difference from prior work is that we provide a direct approach to model selection, instead of learning the largest possible model from data and subsequent model truncation, and provide finite time guarantees. Other related work on identifying finite impulse response approximations include Goldenshluger (1998); Tu et al. (2017); but they do not discuss parameter estimation or reduced order modeling. Several authors Campi and Weyer (2002); Shah et al. (2012); Hardt et al. (2016) and references therein have studied the problem of system identification in different contexts. However, they fail to capture the correct dependence of system parameters on error rates. More importantly, they suffer from the same limitation as Oymak and Ozay (2018) that they require the knowledge of n. 2. Mathematical Preliminaries Throughout the paper, we will refer to an LTI system with dynamics as Eq. (1) by M = (C, A, B). For a matrix A, let σi(A) be the ith singular value of A with σi(A) σi+1(A). Further, σmax(A) = σ1(A) = σ(A). Similarly, we define ρi(A) = |λi(A)|, where λi(A) is an eigenvalue of A with ρi(A) ρi+1(A). Again, ρmax(A) = ρ1(A) = ρ(A). Definition 1 A matrix A is Schur stable if ρmax(A) < 1. We will only be interested in the class of LTI systems that are Schur stable. Fix γ > 0 (and possibly much greater than 1). The model class Mr of LTI systems parametrized by r Z+ is defined as Mr = {(C, A, B) | C Rp r, A Rr r, B Rr m, ρ(A) < 1, σ(A) γ}. (4) Finite Time LTI System Identification Definition 2 The (k, p, q) dimensional Hankel matrix for M = (C, A, B) as Hk,p,q(M) = CAk B CAk+1B . . . CAq+k 1B CAk+1B CAk+2B . . . CAq+k B ... ... ... ... CAp+k 1B . . . . . . CAp+q+k 2B and its associated Toeplitz matrix as 0 0 . . . 0 0 CAk B 0 . . . 0 0 ... ... ... ... 0 CAd+k 3B . . . CAk B 0 0 CAd+k 2B CAd+k 3B . . . CAk B 0 We will slightly abuse notation by referring to Hk,p,q(M) = Hk,p,q. Similarly for the Toeplitz matrices Tk,d(M) = Tk,d. The matrix H0, , (M) is known as the system Hankel matrix corresponding to M, and its rank is known as the model order (or simply order) of M. The system Hankel matrix has two well-known properties that make it useful for system identification. First, the rank of H0, , has an upper bound n. Second, it maps the past inputs to future outputs. These properties are discussed in detail in appendix as Section 9.2. For infinite matrices H0, , , ||H0, , ||2 ||H0, , ||op, i.e., the operator norm. Definition 3 The transfer function of M = (C, A, B) is given by G(z) = C(z I A) 1B where z C. The transfer function plays a critical role in control theory as it relates the input to the output. Succinctly, the transfer function of an LTI system is the Z transform of the output in response to a unit impulse input. Since for any invertible S the LTI systems M1 = (CS 1, SAS 1, SB), M2 = (C, A, B) have identical transfer functions, identification may not be unique, but equivalent up to a transformation S, i.e., (C, A, B) (CS, S 1AS, S 1B). Next, we define a system norm that will be important from the perspective of model identification and approximation. Definition 4 The H system norm of a Schur stable LTI system M is given by ||M|| = sup ω R σmax(G(ejω)). Here, G( ) is the transfer function of M. The r truncation of the transfer function is defined as Gr := [CB, CAB, . . . , CAr 1B]. (5) For a stable LTI system M we have Proposition 2.1 (Lemma 2.2 Glover (1987)) Let M be a LTI system then ||M||H= σ1 ||M|| 2(σ1 + . . . + σn) where σi are the singular values of H0, , (M). T. Sarkar and A. Rakhlin and M. A. Dahleh For any matrix Z, define Zm:n,p:q as the submatrix including row m to n and column p to q. Further, Zm:n,: is the submatrix including row m to n and all columns and a similar notion exists for Z:,p:q. Finally, we define balanced truncated models which will play an important role in our algorithm. Definition 5 (Kung and Lin (1981)) Let H0, , (M) = UΣV where Σ Rn n (n is the model order). Then for any r n, the r order balanced truncated model parameters are given by Cr = [UΣ1/2]1:p,1:r, Ar = Σ 1/2 1:r,1:r U :,1:r[UΣ1/2]p+1:,1:r, Br = [Σ1/2V ]1:r,1:m. For r > n, the r order balanced truncated model parameters are the n order truncated model parameters. Definition 6 We say a random vector v Rd is subgaussian with variance proxy τ 2 if sup ||θ||2=1 sup p 1 n p 1/2 (E[| v, θ |p])1/po = τ and E[v] = 0. We denote this by v subg(τ 2). A fundamental result in model reduction from systems theory is the following Theorem 2.1 (Theorem 21.26 Zhou et al. (1996)) Let M = (C, A, B) be the true model of order n and Mr = (Cr, Ar, Br) be its balance truncated model of order r < n. Assume that σr = σr+1. Then ||M Mr|| 2(σr+1 + σr+2 + . . . + σn) where σi are the Hankel singular values of M. Critical to obtaining refined error rates, will be a result from the theory of self normalized martingales, an application of the pseudo-maximization technique in (Pe na et al., 2008, Theorem 14.7): Theorem 2.2 Let {Ft} t=0 be a filtration. Let {ηt Rm, Xt Rd} t=1 be stochastic processes such that ηt, Xt are Ft measurable and ηt is Ft 1-conditionally subg(L2) for some L > 0. For any t 0, define Vt = Pt s=1 Xs X s, St = Pt s=1 ηs+1Xs. Then for any δ > 0, V 0 and all t 0 we have with probability at least 1 δ S t (V + Vt) 1St 4L2 log 1 δ + log det(V + Vt) det(V ) + m . The proof of this result can be found as Theorem 8.5. We denote by c universal constants which can change from line to line. For numbers a, b, we define a b min (a, b) and a b max (a, b). Finally, for two matrices M1 Rl1 l1, M2 Rl2 l2 with l1 < l2, M1 M2 M1 M2 where M1 = M1 0l1 l2 l1 0l2 l1 l1 0l2 l1 l2 l1 Finite Time LTI System Identification Proposition 2.2 (System Reduction) Let ||S P|| ϵ and the singular values of S be arranged as follows: σ1(S) > . . . > σr 1(S) > σr(S) σr+1(S) . . . σs(S) > σs+1(S) > . . . σn(S) > σn+1(S) = 0 Furthermore, let ϵ be such that ϵ inf {1 i r 1} {s+1 i n} σi(P) σi+1(P) Define K0 = [1, 2, . . . , r 1] [s + 1, s + 2, . . . , n], then ||US K0(ΣS K0)1/2 UP K0(ΣP K0)1/2||2 2 (σi σi+1)2 (σi 1 σi)2 ((σr 1 σs) (σr σs+1))2 + sup 1 i s | σi p and σi = σi(S), ˆσi = σi(P). The proof is provided in Proposition 12.4 in the appendix. This is an extension of Wedin s result that allows us to scale the recovery error of the rth singular vector by only condition number of that singular vector. This is useful to represent the error of identifying a r-order approximation as a function of the rth-singular value only. We briefly summarize our contributions below. 3. Contributions In this paper we provide a purely data-driven approach to system identification from a single time series of finite noisy data. Drawing from tools in systems theory and the theory of self normalized martingales, we offer a nearly optimal OLS-based algorithm to learn the system parameters. We summarize our contributions below: The central theme of our approach is to estimate the infinite system Hankel matrix (to be defined below) with increasing accuracy as the length T of data grows. By utilizing a specific reformulation of the input output relation in Eq. (1) we reduce the problem of Hankel matrix identification to that of regression between appropriately transformed versions of output and input. The OLS solution is a matrix ˆH of size ˆd. More precisely, we show that with probability at least 1 δ, ˆH H0, ˆd, ˆd p ˆd + log T for T above a certain threshold, where H0, ˆd, ˆd is the p ˆd m ˆd principal submatrix of the system Hankel. Here β is the H system norm. We show that by growing ˆd with T in a specific fashion, ˆH becomes the minimax optimal estimator of the system Hankel matrix. The choice of ˆd for a fixed T is purely datadependent and does not depend on spectral radius of A or n. T. Sarkar and A. Rakhlin and M. A. Dahleh It is well known in systems theory that SVD of the doubly infinite system Hankel matrix gives us A, B, C. However, the presence of finite noisy data prevents learning these parameters accurately. We show that it is always possible to learn the parameters of a lower-order approximation of the underlying system. This is achieved by selecting the top k singular vectors of ˆH. The estimation guarantee corresponds to model selection in Statistics. More precisely, for every k ˆd if (Ak, Bk, Ck) are the parameters of a k-order balanced approximation of the original LTI system and ( ˆAk, ˆBk, ˆCk) are the estimates of our algorithm then for T above a certain threshold we have ||Ck ˆCk||2+||Ak ˆAk||2+||Bk ˆBk||2 β2 ˆd ˆσ2 k T p ˆd + log T with probability at least 1 δ where ˆσi is the ith largest singular value of ˆH. 4. Problem Formulation and Discussion 4.1 Data Generation Assume there exists an unknown M = (C, A, B) Mn for some unknown n. Let the transfer function of M be G(z). Suppose we observe the noisy output time series {Yt Rp 1}T t=1 in response to user chosen input series, {Ut Rm 1}T t=1. We refer to this data generated by M as ZT = {(Ut, Yt)}T t=1. We enforce the following assumptions on M. Assumption 1 The noise process {ηt, wt} t=1 in the dynamics of M given by Eq. (1) are i.i.d. and ηt, wt are isotropic with sub Gaussian parameter 1. Furthermore, X0 = 0 almost surely. We will only select inputs, {Ut}T t=1, that are isotropic sub Gaussian with sub Gaussian parameter 1. The input output map of Eq. (1) can be represented in multiple alternate ways. One commonly used reformulation of the input output map in systems and control theory is the following Y1 Y2 ... YT U1 U2 ... UT η1 η2 ... ηT w1 w2 ... w T where T Ok,d is defined as the Toeplitz matrix corresponding to process noise ηt (similar to Definition 2): 0 0 . . . 0 0 CAk 0 . . . 0 0 ... ... ... ... 0 CAd+k 3 . . . CAk 0 0 CAd+k 2 CAd+k 3 . . . CAk 0 ||T0,T ||2, ||T O0,T ||2 denote observed amplifications of the control input and process noise respectively. Note that stability of A ensures ||T0, ||2, ||T O0, ||2< . Suppose both Finite Time LTI System Identification m: Input dimension, p: Output dimension γ: Known upper bound on ||A||2 δ: Error probability c, C: Known absolute constants R: Known noise to signal ratio, or, ||T O0, ||2 ||T0, ||2 β: Known upper bound on H -norm of LTI system D(T) = {d|T cm2d log2 (d) log2 (m2/δ) + cd log3 (2d)} σA = Pd l=1||CAl B||2, σB = Pd l=1||CAl||2 σ Pd k=1 T d+k,T Td+k,T , σD = r σ Pd k=1 T O d+k,T T Od+k,T lp+log (T/δ)+m Table 1: Summary of constants ηt, wt = 0 in Eq. (1). Then it is a well-known fact that ||M|| = sup Ut s P t=0 Y t Yt P t=0 U t Ut = ||M|| = ||T0, ||2 ||H0, , ||2. (7) Assumption 2 There exist universal constants β, R 1 such that ||T0, ||2 β, ||T O0, ||2 ||T0, ||2 R. Remark 7 (H -norm estimation) Assumption 2 implies that an upper bound to the H norm of the system. It is possible to estimate ||M|| from data (See Tu et al. (2018a) and references therein). It is reasonable to expect that error rates for identification of the parameters (C, A, B) depend on the noise-to-signal ratio ||T O0, ||2 ||T0, ||2 , i.e., identification is much harder when the ratio is large. Remark 8 (R estimation) The noise to signal ratio hyperparameter can also be estimated from data, by allowing the system to run with Ut = 0 and taking the average ℓ2 norm of the output Yt, i.e., (1/T) PT t=1 Yt 2 2. For the purpose of the results of the paper we simply assume an upper bound on R. If Ut was subg(L) instead of subg(1), the noise-to-signal ratio is modified to R/L instead. T. Sarkar and A. Rakhlin and M. A. Dahleh 5. Algorithmic Details We will now represent the input output relationship in terms of the Hankel and Toeplitz matrices defined before. Fix a d, then for any l we have Yl Yl+1 ... Yl+d 1 Ul 1 Ul 2 ... Ul d Ul Ul+1 ... Ul+d 1 ηl 1 ηl 2 ... ηl d+1 ηl ηl+1 ... ηl+d 1 + Hd,d,l d 1 Ul d 1 Ul d 1 ... U1 + Od,d,l d 1 ηl d 1 ηl d 1 ... η1 wl wl+1 ... wl+d 1 or, succinctly, Y + l,d = H0,d,d U l 1,d + T0,d U+ l,d + Hd,d,l d 1 U l d 1,l d 1 + O0,d,d η l 1,d + T O0,d η+ l,d + Od,d,l d 1 η l d 1,l d 1 + w+ l,d (9) CAk CAk+1 . . . CAq+k 1 CAk+1 CAk+2 . . . CAd+k ... ... ... ... CAp+k 1 . . . . . . CAp+q+k 2 Yl Yl 1 ... Yl d+1 , Y + l,d = Yl Yl+1 ... Yl+d 1 Furthermore, U l,d, η l,d are defined similar to Y l,d and U+ l,d, η+ l,d, w+ l,d are similar to Y + l,d. The + and signs indicate moving forward and backward in time respectively. This representation will be at the center of our analysis. There are three key steps in our algorithm which we describe in the following sections: (a) Hankel submatrix estimation: Estimating H0,l,l for every 1 l T. We refer to the estimators as { ˆH0,l,l}T l=1. (b) Model Selection: From the estimators { ˆH0,l,l}T l=1 select ˆH0, ˆd, ˆd in a data dependent way such that it best estimates H0, , . (c) Parameter Recovery: For every k ˆd, we do a singular value decomposition of ˆH0, ˆd, ˆd to obtain parameter estimates for a good k-order approximation of the true model. 5.1 Hankel Submatrix Estimation The goal of our systems identification is to estimate either H0,n,n or H0, , . Since we only have finite data and no apriori knowledge of n it is not possible to directly estimate the unknown matrices. The first step then is to estimate all possible Hankel submatrices that are allowed by data, i.e., H0,d,d for d T. For a fixed d, Algorithm 1 estimates the d d principal submatrix H0,d,d. Finite Time LTI System Identification Algorithm 1 Learn System(T, d, m, p) Input T = Horizon for learning d = Hankel Size m = Input dimension p = Output dimension Output System Parameters: ˆH0,d,d 1: Generate 2T i.i.d. inputs {Uj N(0, Im m)}2T j=1. 2: Collect 2T input output pairs {Uj, Yj}2T j=1. 3: ˆH0,d,d = arg min H PT 1 l=0 || Y + l+d+1,d H U l+d,d||2 2 4: return ˆH0,d,d It can be shown that ˆH0,d,d = T 1 X l=0 Y + l+d+1,d( U l+d,d) T 1 X l=0 U l+d,d( U l+d,d) + (10) and by running the algorithm T times, we obtain { ˆH0,d,,d}T d=1. A key step in showing that ˆH0,d,d is a good estimator for H0,d,d is to prove the finite time isometry of VT = PT 1 l=0 U l+d,d( U l+d,d) , i.e., the sample covariance matrix. Lemma 5.1 Define T0(δ, d) = cm2d log2 (d) log2 (m2/δ) + cd log3 (2d) where c is some universal constant. Define the sample covariance matrix VT := PT 1 l=0 U l+d,d( U l+d,d) . We have with probability 1 δ and for T > T0(δ, d) Lemma 5.1 allows us to write Eq. (10) as ˆH0,d,d = PT 1 l=0 Y + l+d+1,d( U l+d,d) PT 1 l=0 U l+d,d( U l+d,d) 1 with high probability and upper bound estimation error for d d principal submatrix. Theorem 5.1 Fix d and let ˆH0,d,d be the output of Algorithm 1. Then for any 0 < δ < 1 and T T0(δ, d), we have with probability at least 1 δ ˆH0,d,d H0,d,d 2 4σ Here T0(δ, d) = cm2d log2 (d) log2 (m2/δ) + cd log3 (2d), c is a universal constant and σ = max (σA, σB, σC, σD) from Table 1. Proof We outline the proof here. Recall Eq. (8), (9). Then for a fixed d ˆH0,d,d = T 1 X l=0 Y + l+d+1,d( U l+d,d) V + T . T. Sarkar and A. Rakhlin and M. A. Dahleh Then the identification error is ˆH0,d,d H0,d,d 2 = V + T T 1 X l=0 U l+d,d U+ l+d+1,d T 0,d + U l+d,d U l,l H d,d,l + U l+d,d w+ l+d+1,d + U l+d,d η l+d,d O 0,d,d + U l+d,d η+ l+d+1,d T O 0,d + U l+d,d η l,l O d,d,l 2 = ||V + T E||2 (12) l=0 U l+d,d U+ l+d+1,d T 0,d + U l+d,d U l,l H d,d,l + U l+d,d w+ l+d+1,d + U l+d,d η l+d,d O 0,d,d + U l+d,d η+ l+d+1,d T O 0,d + U l+d,d η l,l O d,d,l. By Lemma 5.1 we have, whenever T T0(δ, d), with probability at least 1 δ This ensures that, with high probability, that V 1 T exists and decays as O(T 1). The next step involves showing that ||E||2 grows at most as T with high probability. This is reminiscent of Theorem 2.2 and the theory of self normalized martingales. However, unlike that cases the conditional sub-Gaussianity requirements do not hold here. For example, let Fl = σ(η1, . . . , ηl) then E[v η l+1,l+1|Fl] = 0 for all v since { η l+1,l+1}T 1 l=0 is not an independent sequence. As a result it is not immediately obvious on how to apply Theorem 2.2 to our case. Under the event when Eq. (13) holds (which happens with high probability), a careful analysis of the normalized cross terms, i.e., V 1/2 T E shows that ||V 1/2 T E||2= O(1) with high probability. This is summarized in Propositions 11.1-11.3. The idea is to decompose E into a linear combination of independent subgaussians and reduce it to a form where we can apply Theorem 2.2. This comes at the cost of additional scaling in the form of system dependent constants such as the H norm. Then we can conclude with high probability that || ˆH H0,d,d||2 ||V 1/2 T ||2||V 1/2 T E||2 T 1/2O(1). The full proof has been deferred to Section 11.1 in Appendix 11. Remark 9 Recall D(T) from Table 1. Since d D(T) = T T0(δ, d) we can restate Theorem 5.1 as follows: for a fixed T, we have with probability at least 1 δ that ˆH0,d,d H0,d,d 2 4σ when d D(T). Finite Time LTI System Identification We next present bounds on σ in Theorem 5.1. From the perspective of model selection in later sections, we require that σ be known. In the next proposition we present two bounds on σ, the first one depends on unknown parameters and recovers the precise dependence on d. The second bound is an apriori known upper bound and incurs an additional factor of Proposition 5.1 σ upper bound independent of d: σ cn (1 ρ(A))2 where cn depends only on n. σ upper bound dependent on d: where R is the noise-to-signal ratio as in Table 1 Proof By Gelfand s formula, since ||Ad||2 c(n)ρmax(A)d where ρmax(A) < 1 and c(n) is a constant that only depends on n, it implies that l=0 ||CAl B||2 l=0 ||CAl B||2 l=0 c(n)ρ(A)l = c(n) 1 ρ(A), ||Td+k,T ||2 l=0 ||CAd+k+l B||2 c(n)ρ(A)d+k k=1 T d+k,T Td+k,T (1 ρ(A))2 c(n) (1 ρ(A))2 . Similarly, there exists a finite upper bound on σB, σD by replacing CAl B and Td+k,T with CAl and T Od+k,T respectively. For the d independent upper bound, we have l=0 ||CAl B||2 l=0 ||CAl B||2 2 Since σ T d+k,T Td+k,T β, then k=1 T d+k,T Td+k,T T. Sarkar and A. Rakhlin and M. A. Dahleh For the σB, σD we get an extra R because T O0, βR. The key feature of the data dependent upper bound is that it only depends on β and R which are known apriori. Recall that Gd = [CB, CAB, . . . , CAd 1B], i.e., the d-order FIR truncation of G(z). Since the p rows of the H0,d,d matrix corresponds to Gd we can obtain estimators for any d-order FIR. Corollary 5.1 Let b Gd = ˆH0,d,d[1 : p, :] denote the first p-rows of ˆH0,d,d. Then for any 0 < δ < 1 and T T0(δ, d), we have with probability at least 1 δ, || b Gd Gd||2 4σ Proof Proof follows because Gd = H0,d,d[1 : p, :] and Theorem 5.1. Next, we show that the error in Theorem 5.1 is minimax optimal (up to logarithmic factors) and cannot be improved by any estimation method. Proposition 5.2 Let T c where c is an absolute constant. Then for any estimator ˆH of H0, , we have sup ˆ H E[|| ˆH H0, , ||2] cn where cn > 0 is a constant that is independent of T but can depend on system level parameters. Proof Assume the contrary that sup ˆ H E[|| ˆH H0, , ||2] = o Then recall that [H0, , ]1:p,: = [CB, CAB, . . . , ] and G(z) = z 1CB + z 2CAB + . . .. Similarly we have ˆG(z). Define k=0 ||CAk B ˆC ˆAk ˆB||2 2. If sup ˆ H E[|| ˆH H0, , ||2] = o q T , then since || ˆH H0, , ||2 ||G ˆG||2 we can conclude that E[||G ˆG||2] = o which contradicts Theorem 5 in (Goldenshluger, 1998). Thus, sup ˆ H E[|| ˆH H0, , ||2] Finite Time LTI System Identification 5.2 Model Selection At a high level, we want to choose ˆH0, ˆd, ˆd from { ˆH0,d,d}T d=1 such that ˆH0, ˆd, ˆd is a good estimator of H0, , . Our idea of model selection is motivated by (Goldenshluger, 1998). For any ˆH0,d,d, the error from H0, , can be broken as: || ˆH0,d,d H0, , ||2 || ˆH0,d,d H0,d,d||2 | {z } =Estimation Error + ||H0,d,d H0, , ||2 | {z } =Truncation Error We would like to select a d = ˆd such that it balances the truncation and estimation error in the following way: c2 Data dependent upper bound c1 Estimation Error Truncation Error where ci are absolute constants. Such a balancing ensures that || ˆH0, ˆd, ˆd H0, , ||2 c2 (1/c1 + 1) Data dependent upper bound . (14) Note that such a balancing is possible because the estimation error increases as d grows and truncation error decreases with d. Furthermore, a data dependent upper bound for estimation error can be obtained from Theorem 5.1. Unfortunately (C, A, B) are unknown and it is not immediately clear on how to obtain such a bound for truncation error. To achieve this, we first define a truncation error proxy, i.e., how much do we truncate if a specific ˆH0,d,d is used. For a given d, we look at || ˆH0,d,d ˆH0,l,l||2 for l D(T) d. This measures the additional error incurred if we choose ˆH0,d,d as an estimator for H0, , instead of ˆH0,l,l for l > d. Then we pick ˆd as follows: || ˆH0,d,d ˆH0,l,l||2 16βR α(l) l D(T) d Recall that α(l) = q l log (l/δ)+pl2+ml T , where q log (l/δ)+pl+m T denotes how much estimation error is incurred in learning l l Hankel submatrix, the extra β l is incurred because we need a data dependent, albeit coarse, upper bound on the estimation error. A key step will be to show that for any l d, whenever || ˆH0,d,d ˆH0,l,l||2 cβR α(l) ensures that || ˆH0,d,d H0, , ||2 cβR α(l) and || ˆH0,l,l H0, , ||2 cβR α(l) and there is no gain in choosing a larger Hankel submatrix estimate. By picking the smallest d for which such a property holds for all larger Hankel submatrices, we ensure that a regularized model is estimated that agrees with the data. T. Sarkar and A. Rakhlin and M. A. Dahleh Algorithm 2 Choice of d Output ˆH0, ˆd, ˆd, ˆd 1: D(T) = n d d T cm2 log3 (Tm/δ) m+hp+log( T 2: d0(T, δ) = inf n l || ˆH0,l,l ˆH0,h,h||2 16βR(α(h) + 2α(l)) h D(T), h l o . 3: ˆd = max d0(T, δ), log T 4: return ˆH0, ˆd, ˆd, ˆd We now state the main estimation result for H0, , for d = ˆd as chosen in Algorithm 2. Define T (δ) = inf n T d (T, δ) D(T), d (T, δ) 2d 256, δ o (16) d (T, δ) = inf 16βRα(d) ||H0,d,d H0, , ||2 A close look at Eq. (17) reveals that picking d = d (T, δ) ensures the balancing of Eq. (14). However, d (T, δ) depends on unknown quantities and is unknown. In such a case, ˆd in Eq. (15) becomes a proxy for d (T, δ). From an algorithmic stand point, we no longer need any unknown information; the unknown parameter only appear in T (δ), which is only required to make the theoretical guarantee of Theorem 5.2 below. Theorem 5.2 Whenever we have T T (δ) we have with probability at least 1 δ that || ˆH0, ˆd, ˆd H0, , ||2 12cβR m ˆd + p ˆd2 + ˆd log T The proof of Theorem 5.2 can be found as Proposition 13.8 in Appendix 13. We see that the error between ˆH0, ˆd, ˆd and H0, , can be upper bounded by a purely data dependent quantity. The next proposition shows that ˆd does not grow more that logarithmically in T. Proposition 5.3 Let T T (δ), d (T, δ) be as in Eq. (17). Then with probability at least 1 δ we have ˆd d (T, δ) log T Furthermore, d (T, δ) c log (c T + log 1 δ) log R + log β log 1 ρ(A) . The effect of unknown quantities, such as the spectral radius, are subsumed in the finite time condition T T (δ) and appear in an upper bound for ˆd; however this information is not needed from an algorithmic perspective as the selection of ˆd is agnostic to the knowledge of ρ(A). The proof of proposition can be found as Propositions 13.7 and 13.4. Finite Time LTI System Identification 5.3 Parameter Recovery Next we discuss finding the system parameters. To obtain system parameters we use a balanced truncation algorithm on ˆH0, ˆd, ˆd where ˆd is the output of Algorithm 2. The details are summarized in Algorithm 3 where H = ˆH0, ˆd, ˆd. Algorithm 3 Hankel2Sys(T, ˆd, k, m, p) Input T = Horizon for Learning ˆd = Hankel Size m = Input dimension p = Output dimension Output System Parameters: ( ˆC ˆd, ˆA ˆd, ˆB ˆd) 1: H = H0, ˆd, ˆd 2: Pad H with zeros to make of dimension 4p ˆd 4m ˆd 3: U, Σ, V SVD of H 4: U ˆd, V ˆd top ˆd singular vectors 5: ˆC ˆd first p rows of U ˆdΣ1/2 ˆd 6: ˆB ˆd first m columns of Σ1/2 ˆd V ˆd 7: Z0 = [U ˆdΣ1/2 ˆd ]1:4p ˆd p,:, Z1 = [U ˆdΣ1/2 ˆd ]p+1:,: 8: ˆA ˆd (Z 0 Z0) 1Z 0 Z1. 9: return ( ˆC ˆd, ˆA ˆd, ˆB ˆd) To state the main result we define a quantity that measures the singular value weighted subspace gap of a matrix S: Γ(S, ϵ) = q σ1max/ζ2 1 + σ2max/ζ2 2 + . . . + σlmax/ζ2 l , where S = UΣV and Σ is arranged into blocks of singular values such that in each block i we have supj σi j σi j+1 ϵ, i.e., Λ1 0 . . . 0 0 Λ2 . . . 0 ... ... ... 0 0 0 . . . Λl where Λi are diagonal matrices, σi j is the jth singular value in the block Λi and σi min, σi max are the minimum and maximum singular values of block i respectively. Furthermore, ζi = min (σi 1 min σi max, σi min σi+1 max) for 1 < i < l, ζ1 = σ1 min σ2 max and ζl = min (σl 1 min σl max, σl min). Informally, the ζi measure the singular value gaps between each blocks. It should be noted that l, the number of separated blocks, is a function of ϵ itself. For example: if ϵ = 0 then the number of blocks correspond to the number of distinct singular values. On the other hand, if ϵ is very large then l = 1. T. Sarkar and A. Rakhlin and M. A. Dahleh Theorem 5.3 Let M be the true unknown model and m ˆd + p ˆd2 + ˆd log T Then whenever T T (δ), we have with probability at least 1 δ: ||C ˆd ˆC ˆd||2 ||B ˆd ˆB ˆd||2 ||A ˆd ˆA ˆd||2 γϵΓ( ˆH0, ˆd, ˆd, 2ϵ) + γ sup 1 i ˆd + γ ϵ pˆσ ˆdϵ pˆσ ˆd where sup1 i ˆd p ˆσ ˆ d ϵ ˆd p 2 ˆdϵ and γ = max (4γ, 8). Theorem 5.3 holds for all k ˆd and proof follows directly from Theorem 13.8 where we show || ˆH0, ˆd, ˆd H0, , ||2 ϵ and Proposition 14.2. Theorem 5.3 provides an error bound between parameters (of model order ˆd) when true order is unknown. The subspace gap measure, Γ( ˆH0, ˆd, ˆd, 2ϵ), is bounded even when ϵ = 0. To see this, note that when ϵ = 0, ˆH0, ˆd, ˆd corresponds exactly to H0, ˆd, ˆd. In that case, the number of blocks correspond to the number of distinct singular values of H0, ˆd, ˆd, and ζni then corresponds to singular value gap between the unequal singular values. As a result Γ( ˆH0, ˆd, ˆd, 2ϵ) = < . Then the bound decays as ϵ = O q ˆd2/T for singular values ˆσ ˆd > ˆdϵ, but for much smaller singular values the bound decays as ϵ = O d2/T 1/4 . To shed more light on the behavior of our bounds, we consider the special case of known order. If n is the model order, then we can set ˆd = n. If σi = σi(H0, , ), then for large enough T one can ensure that min σi =σi+1(σi σi+1)/2 > ϵ, i.e., ϵ is less than the singular value gap and small enough that the spectrum of ˆH0,n,n is very close to that of H0, , . Consequently ˆσn σn/2 and we have that ||Cn ˆCn||2 ||Bn ˆBn||2 ||An ˆAn||2 γϵ + γϵ/ σn = cβ γR pn2 + n log T This upper bound is (nearly) identical to the bounds obtained in Oymak and Ozay (2018) for the known order case. We get an improvement in the bounds when σn 1 n, which is a consequence of the fact that we know where to threshold our Hankel matrix. The major advantage of our result is that we do not require any information/assumption on the LTI system besides β. Nonparametric approaches to estimating β have been studied in Tu et al. (2017). Finite Time LTI System Identification 5.4 Order Estimation Lower Bound In Theorem 5.3 it is shown that whenever T = Ω 1 σ2 ˆ d we can find an accurate ˆd order approximation. Now we show that if T = O 1 σ2 ˆ d then there is always some non zero probability with which we can not recover the singular vector corresponding to the σ ˆd+1. We prove the following lower bound for model order estimation when inputs {Ut}T t=1 are active and bounded which we define below Definition 10 An input sequence {Ut}T t=1 is said to be active if Ut is allowed to depend on past history {Ul, Yl}t 1 l=1. The input sequence is bounded if E[U t Ut] 1 for all t. Active inputs allow for the case when input selection can be adaptive due to feedback. Theorem 5.4 Fix δ > 0, ζ (0, 1/2). Let M1, M2 be two LTI systems and σ(1) i , σ(2) i be the ith-Hankel singular values respectively. Let σ(1) 1 σ(1) 2 2 ζ and σ(2) 2 = 0. Then whenever sup M {M1,M2} PZT M(order( ˆ M(ZT )) = order(M)) δ Here ZT = {Ut, Yt}T t=1 M means M generates T data points {Yt}T t=1 in response to active and bounded inputs {Ut}T t=1 and ˆ M(ZT ) is any estimator. Proof The proof can be found in appendix in Section 15 and involves using Fano s (or Birge s) inequality to compute the minimax risk between the probability density functions generated by two different LTI systems: 0 1 0 0 0 0 ζ 0 0 , A1 = A0, B0 = , C0 = 0 0 βR , C1 = C0. A0, A1 are Schur stable whenever |ζ|< 1. Theorem 5.4 shows that when the time required to recover higher order models depends inversely on the condition number, where the condition number is the ratio of largest and least singular values of the Hankel matrix. Specifically, to correctly distinguish between an order 1 and order 2 model T Ω(2/ζ2) where ζ is the condition number of the 2-order model. We compare this to our upper bound in Theorem 5.3 and Eq. (18), assume Γ( ˆH0, ˆd, ˆd, 2ϵ) for all ϵ [0, 1] and ˆdϵ ˆσ ˆd, then since parameter error, E, is upper bounded as m ˆd + p ˆd2 + ˆd log T we need T log T T. Sarkar and A. Rakhlin and M. A. Dahleh to correctly identify ˆd-order model. The ratio (β/σ ˆd) is equal to the condition number of the Hankel matrix. In this sense, the model selection followed by singular value thresholding is not too conservative in terms of R (the signal-to-noise ratio) and conditioning of the Hankel matrix. 6. Experiments The experiments in this paper are for the single trajectory case. A detailed analysis for system identification from multiple trajectories can be found in Tu et al. (2017). Suppose that the LTI system generating data, M, has transfer function given by G(z) = α0 + l=1 αlρlz l, ρ < 1 (20) where αi N(0, 1). M is a finite dimensional LTI system or order 150 with parameters as M = (C R1 150, A R150 150, B R150 1). For these illustrations, we assume a balanced system and choose R = 1, δ = 0.05. We estimate β0.6 = 15, β0.9 = 40, β0.99 = 140, pick Ut N(0, 1) and {wt, ηt} {N(0, 1), N(0, I)} respectively. We note that our algorithm requires the knowledge of universal constant c. Theoretically, it can be shown that c < 100 but in practice a value c 16 works well for simulations. Figure 1: Variation of Hankel size = ˆd with T for different values of ρ Fig. 1 shows how d = ˆd change with the number of data points for different values of ρ. When ρ = 0.6, i.e., small, ˆd does not grow too big with T even when the number of data points is increased. This shows that a small model order is sufficient to specify system dynamics. On the other hand, when ρ = 0.99, i.e., closer to instability the ˆd required is much larger, indicating the need for a higher order. Although ˆd implicitly captures the effect of spectral radius, the knowledge of ρ is not required for ˆd selection. In principle, our algorithm increases the Hankel size to the appropriate size as the data increases. We compare this to a deterministic growth policy d = log (T) and the SSREGEST Finite Time LTI System Identification algorithm Ljung et al. (2015). The SSREGEST algorithm first learns a large model from data and then performs model reduction to obtain a final model. In contrast, we go to reduced model directly by picking a small ˆd. This reduces the sensitivity to noise. In Fig. 2 shows the model errors for a deterministic growth policy d = log (T) and our algorithm. Although the difference is negligible when ρ = 0.6 (small), we see that our algorithm does better ρ = 0.99 due to its adaptive nature, i.e., ˆd responds faster for our algorithm. Figure 2: Variation of ||M c Mk||op for different values of ρ. Here k = ˆd for our algorithm and k = log (T). Furthermore, || ||op is the Hankel norm. Finally, for the case when ρ = 0.9, β = 40, we show the model errors for SSREGEST and our algorithm as T increases. Although asymptotically both algorithms perform the same, it is clear that for small T our algorithm is more robust to the presence of noise. T SSREGEST Our Algorithm 500 6.21 1.35 13.37 3.7 850 30.20 7.55 11.25 2.89 1200 26.80 8.94 9.83 2.60 1500 23.27 10.65 9.17 2.30 2000 26.38 12.88 7.70 1.60 7. Discussion We propose a new approach to system identification when we observe only finite noisy data. Typically, the order of an LTI system is large and unknown and a priori parametrizations may fail to yield accurate estimates of the underlying system. However, our results suggest that there always exists a lower order approximation of the original LTI system that can be learned with high probability. The central theme of our approach is to recover a good lower order approximation that can be accurately learned. Specifically, we show that identification T. Sarkar and A. Rakhlin and M. A. Dahleh of such approximations is closely related to the singular values of the system Hankel matrix. In fact, the time required to learn a ˆd order approximation scales as T = Ω( β2 σ2 ˆ d ) where σ ˆd is the ˆd the singular value of system Hankel matrix. This means that system identification does not explicitly depend on the model order n, rather depends on n through σn. As a result, in the presence of finite data it is preferable to learn only the significant (and perhaps much smaller) part of the system when n is very large and σn 1. Algorithm 1 and 3 provide a guided mechanism for learning the parameters of such significant approximations with optimal rules for hyperparameter selection given in Algorithm 2. Future directions for our work include extending the existing low rank optimization-based identification techniques, such as (Fazel et al., 2013; Grussler et al., 2018), which typically lack statistical guarantees. Since Hankel based operators occur quite naturally in general (not necessarily linear) dynamical systems, exploring if our methods could be extended for identification of such systems appears to be an exciting direction. Yasin Abbasi-Yadkori, D avid P al, and Csaba Szepesv ari. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312 2320, 2011. Beh cet A cıkme se, John M Carson, and Lars Blackmore. Lossless convexification of nonconvex control bound and pointing constraints of the soft landing optimal control problem. IEEE Transactions on Control Systems Technology, 21(6):2104 2113, 2013. Anish Agarwal, Muhammad Jehangir Amjad, Devavrat Shah, and Dennis Shen. Time series analysis via matrix estimation. ar Xiv preprint ar Xiv:1802.09064, 2018. Zeyuan Allen-Zhu and Yuanzhi Li. Lazysvd: Even faster svd decomposition yet without agonizing pain. In Advances in Neural Information Processing Systems, pages 974 982, 2016. Dietmar Bauer. Order estimation for subspace methods. Automatica, 37(10):1561 1573, 2001. St ephane Boucheron, G abor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press, 2013. Marco C Campi and Erik Weyer. Finite sample properties of system identification methods. IEEE Transactions on Automatic Control, 47(8):1329 1334, 2002. Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1 38, 1977. Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Finite time identification in unstable linear systems. ar Xiv preprint ar Xiv:1710.01852, 2017. Finite Time LTI System Identification Maryam Fazel, Ting Kei Pong, Defeng Sun, and Paul Tseng. Hankel matrix rank minimization with applications to system identification and realization. SIAM Journal on Matrix Analysis and Applications, 34(3):946 977, 2013. Keith Glover. All optimal hankel-norm approximations of linear multivariable systems and their l -error bounds. International journal of control, 39(6):1115 1193, 1984. Keith Glover. Model reduction: a tutorial on hankel-norm methods and lower bounds on l2 errors. IFAC Proceedings Volumes, 20(5):293 298, 1987. Alexander Goldenshluger. Nonparametric estimation of transfer functions: rates of convergence and adaptation. IEEE Transactions on Information Theory, 44(2):644 658, 1998. Christian Grussler, Anders Rantzer, and Pontus Giselsson. Low-rank optimization with convex constraints. IEEE Transactions on Automatic Control, 2018. Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient descent learns linear dynamical systems. ar Xiv preprint ar Xiv:1609.05191, 2016. Elad Hazan, Holden Lee, Karan Singh, Cyril Zhang, and Yi Zhang. Spectral filtering for general linear dynamical systems. ar Xiv preprint ar Xiv:1802.03981, 2018. BL Ho and Rudolph E Kalman. Effective construction of linear state-variable models from input/output functions. at-Automatisierungstechnik, 14(1-12):545 548, 1966. S Kung and D Lin. Optimal hankel-norm model reductions: Multivariable systems. IEEE Transactions on Automatic Control, 26(4):832 852, 1981. Lennart Ljung. System identification: theory for the user. Prentice-hall, 1987. Lennart Ljung, Rajiv Singh, and Tianshi Chen. Regularization features in the system identification toolbox. IFAC-Papers On Line, 48(28):745 750, 2015. Mark Meckes et al. On the spectral norm of a random toeplitz matrix. Electronic Communications in Probability, 12:315 325, 2007. Samet Oymak and Necmiye Ozay. Non-asymptotic identification of lti systems from a single trajectory. ar Xiv preprint ar Xiv:1806.05722, 2018. Victor H Pe na, Tze Leung Lai, and Qi-Man Shao. Self-normalized processes: Limit theory and Statistical Applications. Springer Science & Business Media, 2008. Tuhin Sarkar and Alexander Rakhlin. How fast can linear dynamical systems be learned? ar Xiv preprint ar Xiv:1812.0125, 2018. Parikshit Shah, Badri Narayan Bhaskar, Gongguo Tang, and Benjamin Recht. Linear system identification via atomic norm regularization. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pages 6265 6270. IEEE, 2012. T. Sarkar and A. Rakhlin and M. A. Dahleh Ritei Shibata. Selection of the order of an autoregressive model by akaike s information criterion. Biometrika, 63(1):117 126, 1976. Max Simchowitz, Horia Mania, Stephen Tu, Michael I Jordan, and Benjamin Recht. Learning without mixing: Towards a sharp analysis of linear system identification. ar Xiv preprint ar Xiv:1802.08334, 2018. Stephen Tu, Ross Boczar, Andrew Packard, and Benjamin Recht. Non-asymptotic analysis of robust control from coarse-grained identification. ar Xiv preprint ar Xiv:1707.04791, 2017. Stephen Tu, Ross Boczar, and Benjamin Recht. On the approximation of toeplitz operators for nonparametric H norm estimation. In 2018 Annual American Control Conference (ACC), pages 1867 1872. IEEE, 2018a. Stephen Tu, Ross Boczar, and Benjamin Recht. Minimax lower bounds for H -norm estimation. ar Xiv preprint ar Xiv:1809.10855, 2018b. Eugene E Tyrtyshnikov. A brief introduction to numerical analysis. Springer Science & Business Media, 2012. Sara van de Geer and Johannes Lederer. The bernstein orlicz norm and deviation inequalities. Probability theory and related fields, 157(1-2):225 250, 2013. Aad W Van Der Vaart and Jon A Wellner. Weak convergence. In Weak convergence and empirical processes, pages 16 28. Springer, 1996. Peter Van Overschee and BL De Moor. Subspace identification for linear systems: Theory Implementation Applications. Springer Science & Business Media, 2012. Saligrama R Venkatesh and Munther A Dahleh. On system identification of complex systems from finite data. IEEE Transactions on Automatic Control, 46(2):235 257, 2001. Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. ar Xiv preprint ar Xiv:1011.3027, 2010. Per Ake Wedin. Perturbation bounds in connection with singular value decomposition. BIT Numerical Mathematics, 12(1):99 111, 1972. K Zhou, JC Doyle, and K Glover. Robust and optimal control, 1996. Finite Time LTI System Identification 8. Preliminaries Theorem 8.1 (Theorem 5.39 Vershynin (2010)) if E is a T md matrix with independent sub Gaussian isotropic rows with sub Gaussian parameter 1 then with probability at least 1 2e ct2 we have md t σmin(E) Proposition 8.1 (Vershynin (2010)) We have for any ϵ < 1 and any w Sd 1 that P(||M||> z) (1 + 2/ϵ)d P ||Mw||> z (1 ϵ) Theorem 8.2 (Theorem 1 Meckes et al. (2007)) ] Suppose {Xi Rm} i=1 are independent, E[Xj] = 0 for all j, and Xij are independent subg(1) random variables. Then P(||Td|| cm d log 2d + t) e t2/d where X0 X1 . . . Xd 1 X1 X0 . . . Xd 2 ... ... ... ... Xd 1 . . . . . . X0 Theorem 8.3 (Hanson Wright Inequality) Given a subgaussian vector X = [X1, X2, . . . , Xn] Rn with supi Xi ψ2 K. Then for any B Rn n and t 0 P XBX E[XBX ] t 2 exp max ct K2 B , ct2 Proposition 8.2 (Lecture 2 Tyrtyshnikov (2012)) Suppose that L is the lower triangular part of a matrix A Rd d. Then L 2 log2 (2d) A 2. Let ψ be a nondecreasing, convex function with ψ(0) = 0 and X a random variable. Then the Orlicz norm ||X||ψ is defined as ||X||ψ= inf n α > 0 : E[ψ(|X|/α)] 1 o . Let (B, d) be an arbitrary semi metric space. Denote by N(ϵ, d) is the minimal number of balls of radius ϵ needed to cover B. Theorem 8.4 (Corollary 2.2.5 in Van Der Vaart and Wellner (1996)) The constant K can be chosen such that ||sup s,t |Xs Xt|||ψ K Z diam(B) 0 ψ 1(N(ϵ/2, d))dϵ where diam(B) is the diameter of B and d(s, t) = ||Xs Xt||ψ. T. Sarkar and A. Rakhlin and M. A. Dahleh Theorem 8.5 (Theorem 1 in Abbasi-Yadkori et al. (2011)) Let {Ft} t=0 be a filtration. Let {ηt Rm, Xt Rd} t=1 be stochastic processes such that ηt, Xt are Ft measurable and ηt is Ft 1-conditionally subg(L2) for some L > 0. For any t 0, define Vt = Pt s=1 Xs X s, St = Pt s=1 Xsη s+1. Then for any δ > 0, V 0 and all t 0 we have with probability at least 1 δ S t (V + Vt) 1St 2L2 log 1 δ + log det(V + Vt) det(V ) + m . Proof Define M = (V + Vt) 1/2St. Now we use Proposition 8.1 and setting ϵ = 1/2, P(||M||2> z) 5m P(||Mw||2> 2z) for w Sm 1. Then we can use Theorem 1 in Abbasi-Yadkori et al. (2011), and with probability at least 1 δ we have ||Mw||2 2 2L2 log 1 δ + log det(V + Vt) By δ 5 mδ, we have with probability at least 1 5 mδ s m log (5) + log 1 δ + log det(V + Vt) Then with probability at least 1 δ, s m + log 1 δ + log det(V + Vt) Lemma 8.1 For any M = (C, A, B), we have that ||Bv T m T ||= v u u tσ d X k=1 T d+k,T Td+k,T Here Bv T m T is defined as follows: β = H d,d,T v = [β 1 , β 2 , . . . , β T ] . β 1 0 0 . . . β 2 β 1 0 . . . ... ... ... ... β T β T 1 . . . β 1 and ||v||2= 1. Finite Time LTI System Identification Proof For the matrix Bv we have β 1 u1 β 1 u2 + β 2 u1 β 1 u3 + β 2 u2 + β 3 u1 ... β 1 u T + β 2 u T 1 + . . . + β T u1 CAd+1Bu1 CAd+2Bu1 ... CA2d Bu1 CAd+2Bu1 + CAd+1Bu2 CAd+3Bu1 + CAd+2Bu2 ... CA2d+1Bu1 + CA2d Bu2 CAT+d Bu1 + . . . + CAd+1Bu T CAT+d+2Bu1 + . . . + CAd+2Bu T ... CAT+2d 1Bu1 + . . . + CA2d Bu T CAd+1Bu1 CAd+2Bu1 ... CA2d Bu1 CAd+2Bu1 + CAd+1Bu2 CAd+3Bu1 + CAd+2Bu2 ... CA2d+1Bu1 + CA2d Bu2 CAT+d Bu1 + . . . + CAd+1Bu T CAT+d+2Bu1 + . . . + CAd+2Bu T ... CAT+2d 1Bu1 + . . . + CA2d Bu T CAd+1B 0 0 . . . 0 CAd+2B 0 0 . . . 0 ... ... ... ... ... CA2d B 0 0 . . . 0 CAd+2B CAd+1B 0 . . . 0 CAd+3B CAd+2B 0 . . . 0 ... ... ... ... ... CA2d+1B CA2d B 0 . . . 0 ... ... ... ... ... CAT+d 1B CAT+d B CAT+d 1B . . . CAd+1B CAT+d+2B CAT+d+1B CAT+d B . . . CAd+2B ... ... ... ... ... CAT+2d 1B CAT+2d 1B CAT+2d 2B . . . CA2d B u1 u2 ... u T T. Sarkar and A. Rakhlin and M. A. Dahleh It is clear that ||V||2, ||u||2= 1 and for any matrix S, ||S|| does not change if we interchange rows of S. Then we have CAd+1B 0 0 . . . 0 CAd+2B CAd+1B 0 . . . 0 ... ... ... ... ... CAT+d+1B CAT+d B CAT+d 1B . . . CAd+1B CAd+2B 0 0 . . . 0 CAd+3B CAd+2B 0 . . . 0 ... ... ... ... ... CAT+d+2B CAT+d+1B CAT+d B . . . CAd+2B ... ... ... ... ... CA2d B 0 0 . . . 0 CA2d+1B CA2d B 0 . . . 0 ... ... ... ... ... CAT+2d 1B CAT+2d 1B CAT+2d 2B . . . CA2d B Td+1,T Td+2,T ... T2d,T v u u tσ d X k=1 T d+k,T Td+k,T Proposition 8.3 (Lemma 4.1 Simchowitz et al. (2018)) Let S be an invertible matrix and κ(S) be its condition number. Then for a 1 4κ net of Sd 1 and an arbitrary matrix A, we have ||SA||2 2 sup v N 1 ||v A||2 ||v S 1||2 Proof For any vector v N 1 4κ and w be such that ||SA||2= ||w A||2 ||w S 1||2 we have ||SA||2 ||v A||2 ||v S 1||2 ||w A||2 ||w S 1||2 ||v A||2 ||v S 1||2 ||w S 1||2 ||v A||2 ||w S 1||2 + ||v A||2 ||w S 1||2 ||v A||2 ||v S 1||2 1 4κ||S 1||2 ||w S 1||2 + ||SA||2 ||v S 1||2 ||w S 1||2 1 Finite Time LTI System Identification 9. Control and Systems Theory Preliminaries 9.1 Sylvester Matrix Equation Define the discrete time Sylvester operator SA,B : Rn n Rn n LA,B(X) = X AXB (21) Then we have the following properties for LA,B( ). Proposition 9.1 Let λi, µi be the eigenvalues of A, B then LA,B is invertible if and only if for all i, j λiµj = 1 Define the discrete time Lyapunov operator for a matrix A as LA,A ( ) = S 1 A,A ( ). Clearly it follows from Proposition 9.1 that whenever λmax(A) < 1 we have that the SA,A ( ) is an invertible operator. Now let Q 0 then SA,A (Q) = X = X = AXA + Q k=0 Ak QA k (22) Eq. (22) follows directly by substitution and by Proposition 9.1 is unique if ρ(A) < 1. Further, let Q1 Q2 0 and X1, X2 be the corresponding solutions to the Lyapunov operator then from Eq. (22) that 9.2 Properties of System Hankel matrix Rank of system Hankel matrix: For M = (C, A, B) Mn, the system Hankel matrix, H0, , (M), can be decomposed as follows: H0, , (M) = C CA ... CAd ... B AB . . . Ad B . . . It follows from definition that rank(O), rank(R) n and as a result rank(OR) n. The system Hankel matrix rank, or rank(OR), which is also the model order(or T. Sarkar and A. Rakhlin and M. A. Dahleh simply order), captures the complexity of M. If SVD(H0, , ) = UΣV , then O = UΣ1/2S, R = S 1Σ1/2V . By noting that CAl S = CS(S 1AS)l, S 1Al B = (S 1AS)l S 1B we have obtained a way of recovering the system parameters (up to similarity transformations). Furthermore, H0, , uniquely (up to similarity transformation) recovers (C, A, B). Mapping Past to Future: H0, , can also be viewed as an operator that maps past inputs to future outputs. In Eq. (1) assume that {ηt, wt} = 0. Then consider the following class of inputs Ut such that Ut = 0 for all t T but Ut may not be zero for t < T. Here T is chosen arbitrarily. Then YT YT+1 YT+2 ... | {z } Future UT 1 UT 2 UT 3 ... | {z } Past 9.3 Model Reduction Given an LTI system M = (C, A, B) of order n with its doubly infinite system Hankel matrix as H0, , . We are interested in finding the best k order lower dimensional approximation of M, i.e., for every k < n we would like to find Mk of model order k such that ||M Mk|| is minimized. Systems theory gives us a class of model approximations, known as balanced truncated approximations, that provide strong theoretical guarantees (See Glover (1984) and Section 21.6 in Zhou et al. (1996)). We summarize some of the basics of model reduction below. Assume that M has distinct Hankel singular values. Recall that a model M = (C, A, B) is equivalent to M = (CS, S 1AS, S 1B) with respect to its transfer function. Define Q = A QA + C C P = APA + BB For two positive definite matrices P, Q it is a known fact that there exist a transformation S such that S QS = S 1PS 1 = Σ where Σ is diagonal and the diagonal elements are decreasing. Further, σi is the ith singular value of H0, , . Then let A = S 1AS, C = CS, B = S 1B. Clearly f M = ( A, B, C) is equivalent to M and we have Σ = A Σ A + C C Σ = AΣ A + B B (25) Here C, A, B is a balanced realization of M. Finite Time LTI System Identification Proposition 9.2 Let H0, , = UΣV . Here Σ 0 Rn n. Then C = [UΣ1/2]1:p,: A = Σ 1/2U [UΣ1/2]p+1:,: B = [Σ1/2V ]:,1:m The triple ( C, A, B) is a balanced realization of M. For any matrix L, L:,m:n (or Lm:n,:) denotes the submatrix with only columns (or rows) m through n. Proof Let the SVD of H0, , = UΣV . Then M can constructed as follows: UΣ1/2, Σ1/2V are of the form CS CAS CA2S ... , Σ1/2V = S 1B S 1AB S 1A2B . . . where S is the transformation which gives us Eq. (25). This follows because Σ1/2U UΣ1/2 = k=0 S Ak C CAk S k=0 S Ak S 1 S C CSS 1Ak S k=0 Ak C C Ak = A Σ A + C C = Σ Then C = UΣ1/2 1:p,: and UΣ1/2 A = [UΣ1/2]p+1:,: A = Σ 1/2U [UΣ1/2]p+1:,: We do a similar computation for B. It should be noted that a balanced realization C, A, B is unique except when there are some Hankel singular values that are equal. To see this, assume that we have σ1 > . . . > σr 1 > σr = σr+1 = . . . = σs > σs+1 > . . . σn where s r > 0. For any unitary matrix Q R(s r+1) (s r+1), define Q0 I(r 1) (r 1) 0 0 0 Q 0 0 0 I(n s) (n s) T. Sarkar and A. Rakhlin and M. A. Dahleh Then every triple ( CQ0, Q 0 AQ0, Q 0 B) satisfies Eq. (25) and is a balanced realization. Let Mk = ( Ck, Akk, Bk) where A = Akk A0k Ak0 A00 , B = Bk B0 , C = Ck C0 (27) Here Akk is the k k submatrix and corresponding partitions of B, C. The realization Mk = ( Ck, Akk, Bk) is the k order balanced truncated model. Clearly M Mn which gives us C = Cnn, A = Ann, B = Bnn, i.e., the balanced version of the true model. We will show that for the balanced truncation model we only need to care about the top k singular vectors and not the entire model. Proposition 9.3 For the k order balanced truncated model Mk, we only need top k singular values and singular vectors of H0, , . Proof From the preceding discussion in Proposition 9.2 and Eq. (27) it is clear that the first p k block submatrix of UΣ1/2 (corresponding to the top k singular vectors) gives us Ck. Since A = Σ 1/2U [UΣ1/2]p+1:,: we observe that Akk depend only on the top k singular vectors Uk and corresponding singular values. This can be seen as follows: [UΣ1/2]p+1:,: denotes the submatrix of UΣ1/2 with top p rows removed. Now in UΣ1/2 each column of U is scaled by the corresponding singular value. Then the Akk submatrix depends only on top k rows of Σ 1/2U and the top k columns of [UΣ1/2]p+1:,: which correspond to the top k singular vectors. 10. Isometry of Input Matrix: Proof of Lemma 5.1 Theorem 11 Define Ud Ud+1 . . . UT+d 1 Ud 1 Ud . . . UT+d 2 ... ... ... ... U1 U2 . . . UT where each Ui subg(1) and isotropic. Then there exists an absolute constant c such that U satisfies: (1/2)T σmin(UU ) σmax(UU ) (3/2)T whenever T cm2d(log2 (d) log2 (m2/δ) + log3 (2d)) with probability at least 1 δ. Proof Define 0 0 0 . . . 0 I 0 0 . . . 0 ... ... ... ... ... 0 . . . I 0 0 0 . . . 0 I 0 , b Uk := Ud+k Finite Time LTI System Identification Ud Ud+1 . . . UT+d 1 Ud 1 Ud . . . UT+d 2 ... ... . . . ... U1 U2 . . . UT we can reformulate it so that each column is the output of an LTI system in the following sense: xk+1 = Axk + B b U(k + 1) (28) where UU = PT 1 k=0 xkx k and x0 = Ud Ud 1 ... U1 . From Theorem 8.1 we have that k=0 b Uk b U k 5 with probability at least 1 δ whenever T c m + log 2 δ . Define Vt = Pt 1 l=0 xkx k then, VT = AVT 1A + B k=0 b Uk b U k Axk b U k+1B + B b Uk+1x k A (29) It can be easily checked that xk = Ud+k Ud+k 1 ... Uk+1 and consequently k=0 Axk b U k+1B = 0 0 . . . 0 0 Ud+k U d+k+1 0 . . . 0 0 Ud+k 1U d+k+1 0 . . . 0 0 ... ... ... ... ... ... ... ... ... ... Uk+2U d+k+1 0 . . . 0 0 Define Lj := PT 2 k=0 Ud+k j+1U d+k+1 and Lj is a m m block matrix. Then l=0 Al T 2 X k=0 Axk b U k+1B ! 0 0 . . . 0 0 0 L1 0 . . . 0 0 0 L2 L1 . . . 0 0 0 ... ... ... ... ... ... ... ... ... ... ... ... Ld 1 0 . . . 0 L1 0 T. Sarkar and A. Rakhlin and M. A. Dahleh Use Lemma 10.1 to show that Td log (d) log (m2/δ) (30) with probability at least 1 δ. Then k=0 b Uk b U k l=0 Alx T 1x T 1Al . From Theorem 8.1 we have with probability atleast 1 δ that k=0 b Uk b U k B Al (5/4)TI (31) whenever T c m + log 2 δ . Observe that l=1 Alx T 1x T 1Al = σ2 1([Ax T 1, A2x T 1, . . . , Adx T 1]) The matrix [Ax T 1, A2x T 1, . . . , Adx T 1] is the lower triangular submatrix of a random Toeplitz matrix with i.i.d subg(1) entries as in Theorem 8.2. Then using Theorem 8.2 and Proposition 8.2 we get that with probability at least 1 δ we have [Ax T 1, A2x T 1, . . . , Adx T 1] cm( p d log (2d) log (2d) + p d log (1/δ)). (32) Then Pd l=1 Alx T 1x T 1Al cm2d(log3 (2d) + log (1/δ) + log (2d) p log (2d) log (1/δ)) with probability at least 1 δ. By ensuring that Eqs. (30), (31) and (32) hold simultaneously we can ensure that cm Td log (d) log (m2/δ) T/8 and cm2d(log3 (2d) + log (1/δ) + log (2d) p log (2d) log (1/δ)) T/8 for large enough T and absolute constant c. Lemma 10.1 Let {Uj Rm 1}T+d j=1 be independent subg(1) random vectors. Define Lj := PT 2 k=0 Ud+k j+1U d+k+1 for all j 1 and 0 0 . . . 0 0 0 L1 0 . . . 0 0 0 L2 L1 . . . 0 0 0 ... ... ... ... ... ... ... ... ... ... ... ... Ld 1 0 . . . 0 L1 0 Then with probability at least 1 δ we have Td log (d) log (m/δ). Finite Time LTI System Identification Proof Since Ljs are block matrices, the techniques in Meckes et al. (2007) cannot be directly applied. However, by noting that E can be broken into a sum of m matrices where the norm of each matrix can be bounded by a Toeplitz matrix we can use the result from Meckes et al. (2007). For instance if m = 2 and {ui} i=1 are independent subg(1) random variables then we have . . . u1 u2 u3 u4 . . . u5 u6 u7 u8 u1 u2 u3 u4 ... ... ... . . . u1 0 u3 0 . . . u5 0 u7 0 ... ... ... . . . 0 u2 0 u4 . . . 0 u6 0 u8 ... ... ... then ||Td|| sup1 i 2||Mi||. Furthermore for each Mi we have . . . u1 0 0 0 . . . u5 0 0 0 ... ... ... | {z } =M11 . . . 0 0 u3 0 . . . 0 0 u7 0 ... ... ... | {z } =M12 and ||M1|| ||M11||+||M12||. The key idea is to show that Mi1 are Toeplitz matrices (after removing the zeros in the blocks) and we can use the standard techniques described in proof of Theorem 1 in Meckes et al. (2007). Then we will show that each ||Mij|| C with high probability and ||Td|| m C. For brevity, we will assume for now that Ui are scalars and at the end we will scale by m. By standard techniques described in proof of Theorem 1 in Meckes et al. (2007), we have that the finite Toeplitz matrix Td + T d is d d submatrix of the infinite Laurent matrix M = [L|j k|1|j k| 0. This ensures that supt|Yt| ψ c Td. Since E[X] X ψ we have that E[sup0 x 1|Yx|] Td. This implies E[ Td + T d ] Td, and using Proposition 8.2 we have E[ Td ] c Td log (d). Furthermore, we can make a stronger statement because supt|Yt| ψ c Td which implies that Td log (d) log (1/δ) with probability at least 1 δ. Then recalling that in the general case that Ljs of Td were m m block matrices we scale by m and get with probability at least 1 δ Td log (d) log (m2/δ) where the union is over all m2 elements being less that c Td log (d) log (m2/δ). Note that c hides the universal constant K from Theorem 8.4. 11. Error Analysis for Theorem 5.1 For this section we assume that Ut subg(L2). 11.1 Proof of Theorem 5.1 Recall Eq. (8) and (9), i.e., Y + l,d = H0,d,d U l 1,d + T0,d U+ l,d + Hd,d,l d 1 U l d 1,l d 1 + O0,d,d η l 1,d + T O0,d η+ l,d + Od,d,l d 1 η l d 1,l d 1 + w+ l,d (35) Assume for now that we have T + 2d data points instead of T. It is clear that ˆH0,d,d = arg min H l=0 || Y + l+d+1,d H U l+d,d||2 2= l=0 Y + l+d+1,d U l+d,d ! l=0 U l+d,d U l+d,d, (36) Ud Ud+1 . . . UT+d 1 Ud 1 Ud . . . UT+d 2 ... ... ... ... U1 U2 . . . UT T. Sarkar and A. Rakhlin and M. A. Dahleh It is show in Theorem 11 that VT is invertible with probability at least 1 δ. So in our analysis we can write this as l=0 U l+d,d U l+d,d l=0 U l+d,d U l+d,d From this one can conclude that ˆH H0,d,d 2 = T 1 X l=0 U l+d,d U l+d,d 1 T 1 X l=0 U l+d,d U+ l+d+1,d T 0,d + U l+d,d U l,l H d,d,l + U l+d,d η l+d,d O 0,d,d + U l+d,d η+ l+d+1,d T O 0,d + U l+d,d η l,l O d,d,l + U l+d,d w+ l+d+1,d 2 (37) Here as we can observe U l,l , η l,l grow with T in dimension. Based on this we divide our error terms in two parts: l=0 U l+d,d U l+d,d 1 U l+d,d U l,l H d,d,l + U l+d,d η l,l O d,d,l l=0 U l+d,d U l+d,d 1 U l+d,d η+ l+d+1,d T O 0,d + U l+d,d U+ l+d+1,d T 0,d+ (39) U l+d,d η+ l+d+1,d T O 0,d + U l+d,d w+ l+d+1,d Then the proof of Theorem 5.1 will reduce to Propositions 11.1 11.3. We first analyze V 1/2 T T 1 X l=0 U l+d,d U l,l H d,d,l 2 The analysis of ||V 1/2 T (PT 1 l=0 U l+d,d η l,l O d,d,l)|| will be almost identical and will only differ in constants. Proposition 11.1 For 0 < δ < 1, we have with probability at least 1 2δ V 1/2 T T 1 X l=0 U l+d,d U l,l H d,d,l 2 4σ where σ = q σ(Pd k=1 T d+k,T Td+k,T ). Finite Time LTI System Identification Proof We proved that TI 2 with high probability, then P V 1/2 T T 1 X l=0 U l+d,d U l,l H d,d,l 2 a, TI l=0 U l+d,d U l,l H d,d,l 2 a, TI P 2 sup v N 1 l=0 U l+d,d U l,l H d,d,lv 2 a + P TI l=0 U l+d,d U l,l H d,d,lv 2 a δ. (40) Define the following ηl,d = U l,l H d,d,lv, Xl,d = q 2 T U l+d,d. Observe that ηl,d, ηl+1,d have contributions from Ul 1, Ul 2 etc. and do not immediately satisfy the conditions of Theorem 2.2. Instead we will use the fact that Xi,d is independent of Uj for all j i. V 1/2 T T 1 X l=0 U l+d,d U l,l H d,d,l 2 2 sup v N 1 l=0 U l+d,d U l,l H d,d,lv|| 2 sup v N 1 l=0 Xl,dηl,d||. Define H d,d,lv = [β 1 , β 2 , . . . , β l ] . βi are m 1 vectors when LTI system is MIMO. Then ηl,d = Pl 1 k=0 U l kβk+1. Let αl = Xl,d. Then consider the matrix β 1 0 0 . . . β 2 β 1 0 . . . ... ... ... ... β T β T 1 . . . β 1 T. Sarkar and A. Rakhlin and M. A. Dahleh Observe that the matrix ||BT m T ||2= q σ(Pd k=1 T d+k,T Td+k,T ) d||Td, ||2< which follows from Lemma 8.1. Then l=0 Xl,dηl,d = [α1, . . . , αT ]B U1 U2 ... UT k=1 αkβ k , k=2 αkβ k 1, . . . , αT β 1 ] U1 U2 ... UT k=j αkβ k Uj . Here αi = Xi,d and recall that Xi,d is independent of Uj for all i j. Let γ = α B. Define GT+d k = σ({Uk+1, Uk+2, . . . , UT+d}) where σ(A) is the sigma algebra containing the set A with G0 = φ. Then Gk 1 Gk. Furthermore, since γj 1, Uj are GT+d+1 j measurable and Uj is conditionally (on GT+d j) sub Gaussian, we can use Theorem 2.2 on γ U = α BU (where γj = XT+d j, Uj = ηT+d j+1 in the notation of Theorem 2.2). Then with probability at least 1 δ we have α BB α + V 1/2 γ U L δ + log det(α BB α + V ) For any fixed V > 0. With probability at least 1 δ, we know from Theorem 11 that α α 3I 2 = α BB α 3σ2 1(B)I 2 . By combining this event and the event in Eq. (41) and setting V = 3σ2 1(B)I 2 , we get with probability at least 1 2δ that ||α BU||2= ||γ U||2 δ + pd log 3 + m . (42) Replacing δ 5 pd δ 2, we get from Eq. (40) V 1/2 T T 1 X l=0 U l+d,d U l,l H d,d,l 2 6 log (5)Lσ1(B) with probability at least 1 δ. Since L = 1 we get our desired result. Then similar to Proposition 11.1, we analyze V 1/2 T PT 1 l=0 U l+d,d U+ l+d+1,d T 0,d 2 Proposition 11.2 For 0 < δ < 1 and large enough T, we have with probability at least 1 δ V 1/2 T T 1 X l=0 U l+d,d U+ l+d+1,d T 0,d 2 4σ Finite Time LTI System Identification σ sup ||v||2=1 v CAd B v CAd 1B v CAd 2B . . . v CB 0 0 ... ... ... ... 0 0 ... ... ... ... ... 0 v CAd B v CAd 1B . . . . . . v CB j=0 ||CAj B||2 β Proof Note V 1/2 T PT 1 l=0 U l+d,d U+ l+d+1,d T 0,d 2 q 2 T PT 1 l=0 U l+d,d U+ l+d+1,d T 0,d 2 with probability at least 1 δ for large enough T. Here T 0,d is md pd matrix. Then define 2 T U l+d,d and the vector Ml Rpd as M l = U+ l+d+1,d T 0,d. Then l=0 Xl M l ||2 t) |{z} 1 2 net l=0 Xl M l v||2 t/2) where M l v is a real value. Let β := T 0,dv, then M l v = U+ l+d+1,dβ. This allows us to write Xl M l v in a form that will enable us to apply Theorem 2.2. l=0 Xl M l v = [X0, X1, . . . , XT 1] | {z } =X β 1 β 2 . . . β d . . . 0 0 β 1 ... ... ... 0 0 ... ... ... ... ... 0 . . . 0 β 1 . . . β d Ud+1 Ud+2 ... UT+2d Here I is RT (m T+md). It is known from Theorem 11 that XX 3I 2 with high probability and consequently XII X 3σ2 1(I)I 2 . Define Fl = σ({Ul}d+l j=1) as the sigma field generated by ({Ul}d+l j=1. Furthermore Nl is Fl measurable, and [XI]l is Fl 1 measurable and we can apply Theorem 2.2. Now the proof is similar to Proposition 11.1. Following the same steps as before we get with probability at least 1 δ l=0 Xl M l v||2= || l=0 [XI]l Nl||2 δ + pd log 3 + m and substituting δ 5 pdδ we get l=0 Xl M l ||2 6 log (5)σ1(I)L l=0 Xl Ml||2 4σ1(I)L δ + pd + m. (44) T. Sarkar and A. Rakhlin and M. A. Dahleh The proof for noise and covariate cross terms is almost identical to Proposition 11.2 but easier because of independence. Finally note that σ1(I) q Pd i=1 βi 2 2 Proposition 11.3 For 0 < δ < 1, we have with probability at least 1 δ V 1/2 T T X k=0 U l+d,d η+ l+1+d,d T O 0,d 2 4σA V 1/2 T T X k=0 U l+d,d η l,l O d,d,l 2 4σB V 1/2 T T X k=0 U l+d,d η l+d,d O 0,d,d 2 4σC V 1/2 T T X k=0 U l+d,d w+ l+1+d,d 2 4σD Here σ = max (σA, σB, σC, σD) where σA σC sup ||v||2=1 v CAd v CAd 1 v CAd 2 . . . 0 0 ... ... ... 0 0 ... ... ... ... 0 . . . v CAd . . . v C j=0 ||CAj||2 βR σ(Pd k=1 T O d+k,T T Od+k,T ) βR By taking the intersection of all the aforementioned events for a fixed δ we then have with probability at least 1 δ ˆH0,d,d H0,d,d 2 16σ m + pd + log d 12. Subspace Perturbation Results In this section we present variants of the famous Wedin s theorem (Section 3 of Wedin (1972)) that depends on the distribution of Hankel singular values. These will be sign free generalizations of the gap Free Wedin Theorem from Allen-Zhu and Li (2016). The major difference from the traditional Wedin s theorem is that the Frobenius error bound can include the dimension of the matrix; however in our case the Hankel matrix is allow to grow with T and such a bound may not be ideal. To address we introduce this mild variant of Wedin s theorem. First we define the Hermitian dilation of a matrix. H(S) = 0 S S 0 Finite Time LTI System Identification The Hermitian dilation has the property that ||S1 S2|| ϵ ||H(S1) H(S2)|| ϵ. Hermitian dilations will be useful in applying Wedin s theorem for general (not symmetric) matrices. Proposition 12.1 Let S, ˆS be symmetric matrices and ||S ˆS|| ϵ. Further, let vj, ˆvj correspond to the jth eigenvector of S, ˆS respectively such that λ1 λ2 . . . λn and ˆλ1 ˆλ2 . . . ˆλn. Then we have | vj, ˆvk | ϵ |λj ˆλk| (45) if either λj or ˆλk is not zero. Proof Let S = λjvjv j +V Λ j V and ˆS = ˆλkˆvkˆv k + ˆV ˆΛ k ˆV , wlog assume |λj| |ˆλk|. Define R = S ˆS v j Sˆvk = v j ˆSˆvk + v j Rˆvk Since vj, ˆvk are eigenvectors of S and ˆS respectively. λjv jˆvk = ˆλkv jˆvk + v j Rˆvk |λj ˆλk||v jˆvk| ϵ Proposition 12.1 gives an eigenvector subjective Wedin s theorem. Next, we show how to extend these results to arbitrary subsets of eigenvectors. Proposition 12.2 For ϵ > 0, let S, P be two symmetric matrices such that ||S P||2 ϵ. Let S = UΣSU , P = V ΣP V Let V+ correspond to the eigenvectors of singular values β, V correspond to the eigenvectors of singular values α and V are the remaining ones. Define a similar partition for S. Let α < β ||U V+|| ϵ β α Proof The proof is similar to before. S, P have a spectral decomposition of the form S = U+ΣS +U + + U ΣS U + UΣS0 U P = V+ΣP +V + + V ΣP V + V ΣP 0 V Let R = S P and since U+ is orthogonal to U , U and similarly for V U S = ΣS U = U P + U R ΣS U V+ = U V+ΣP + + U RV+ T. Sarkar and A. Rakhlin and M. A. Dahleh Diving both sides by ΣP ΣS U V+(ΣP +) 1 = U V+ + U RV+(ΣP +) 1 ||ΣS U V+(ΣP +) 1|| ||U V+|| ||U RV+(ΣP +) 1|| α β ||U V+|| ||U V+|| ϵ ||U V+|| ϵ β α Let Sk, Pk be the best rank k approximations of S, P respectively. We develop a sequence of results to see how ||Sk Pk|| varies when ||S P|| ϵ as a function of k. Proposition 12.3 Let S, P be such that Furthermore, let ϵ be such that ϵ inf {1 i r 1} {s+1 i n} σi(P) σi+1(P) and US j , V S j be the left and right singular vectors of S corresponding to σj(S). There exists a unitary transformation Q such that σmax([UP r , . . . , UP s ]Q [US r , . . . , US s ]) 2ϵ min σr 1(P) σr(S), σs(S) σs+1(P) σmax([V P r , . . . , V P s ]Q [V S r , . . . , V S s ]) 2ϵ min σr 1(P) σr(S), σs(S) σs+1(P) . Proof Let r k s. First divide the indices [1, n] into 3 parts K1 = [1, r 1], K2 = [r, s], K3 = [s + 1, n]. Although we focus on only three groups extension to general case will be a straight forward extension of this proof. Define the Hermitian dilation of S, P as H(S), H(P) respectively. Then we know that the eigenvalues of H(S) are n i=1{σi(S), σi(S)} Further the eigenvectors corresponding to these are u S i v S i u S i v S i Similarly define the respective quantities for H(P). Now clearly, ||H(S) H(P)|| ϵ since ||S P|| ϵ. Then by Weyl s inequality we have that |σi(S) σi(P)| ϵ Finite Time LTI System Identification Now we can use Proposition 12.1. To ease notation, define σi(S) = λi(H(S)) and λ i(H(S)) = σi(S) and let the corresponding eigenvectors be ai, a i for S and bi, b i for P respectively. Note that we can make the assumption that ai, bi 0 for every i. This does not change any of our results because ai, bi are just stacking of left and right singular vectors and uiv i is identical for ui, vi and ui, vi. Then using Proposition 12.1 we get for every (i, j) K2 K2 and i = j | ai, bj | ϵ |σi(S) σj(P)| (47) similarly | a i, bj | ϵ |σi(S) + σj(P)| (48) u S i v S i u S i v S i u P i v P i and σi(S), σi(P) 0 we have by adding Eq. (47),(48) that max | u S i , u P j |, | v S i , v P j | ϵ |σi(S) σj(P)| Define US Ki to be the matrix formed by the orthornormal vectors {aj}j Ki and US K i to be the matrix formed by the orthonormal vectors {aj}j Ki. Define similar quantities for P. Then (US K2) UP K2(UP K2) US K2 = (US K2) (I X j =2 UP Kj(UP Kj) )US K2 = (US K2) (I X |j| =2 UP Kj(UP Kj) UP K 2(UP K 2) )US K2 = I (US K2) X |j| =2 UP Kj(UP Kj) US K2 (US K2) UP K 2(UP K 2) US K2 (49) Now K1, K 1 corresponds to eigenvectors where singular values σr 1(P), K3, K 3 corresponds to eigenvectors where singular values σs+1(P). We are in a position to use Proposition 12.2. Using that on Eq. (49) we get the following relation (UP K2) US K2(US K2) UP K2 I (σr 1(P) σs(S))2 ϵ2 (σs(S) σs+1(P))2 (US K2) UP K 2(UP K 2) US K2 (50) In the Eq. (50) we need to upper bound (US K2) UP K 2(UP K 2) US K2. To this end we will exploit the fact that all singular values corresponding to US K2 are the same. Since ||H(S) H(P)|| ϵ, then H(S) = US K2ΣS K2(US K2) + US K 2ΣS K 2(US K 2) + US K0ΣS K0(US K0) H(P) = UP K2ΣP K2(UP K2) + UP K 2ΣP K 2(UP K 2) + UP K0ΣP K0(UP K0) T. Sarkar and A. Rakhlin and M. A. Dahleh Then by pre multiplying and post multiplying we get (US K2) H(S)UP K 2 = ΣS K2(US K2) UP K 2 (US K2) H(P)UP K 2 = (US K2) UP K 2ΣP K 2 Let H(S) H(P) = R then (US K2) (H(S) H(P))UP K 2 = (US K2) RUP K 2 ΣS K2(US K2) UP K 2 (US K2) UP K 2ΣP K 2 = (US K2) RUP K 2 Since ΣS K2 = σs(A)I then ||(US K2) UP K 2(σs(S)I ΣP K 2)|| = ||(US K2) RUP K 2|| ||(US K2) UP K 2|| ϵ σs(S) + σs(P) ||(UP K2) US K 2|| ϵ σs(P) + σs(S) Since σs(P) + σs(S) σs(S) σs+1(P) combining this with Eq. (50) we get σmin((US K2) UP K2) 1 3ϵ2 min σr 1(P) σs(S), σs(S) σs+1(P) 2 (51) σi(P) σi+1(P) for Eq. (51), we use the inequality 1 x2 1 x2 whenever x < 1 which is true when Eq. (46) is true. This means that there exists unitary transformation Q such that ||US K2 UP K2Q|| 2ϵ min σr 1(P) σs(S), σs(S) σs+1(P) Remark 12 Note that S, P will be Hermitian dilations of H0, , , ˆH0, ˆd, ˆd respectively in our case. Since the singular vectors of S (and P) are simply stacked version of singular vectors of H0, , (and ˆH0, ˆd, ˆd), our results hold directly for the singular vectors of H0, , (and ˆH0, ˆd, ˆd) Let r k s. First divide the indices [1, n] into 3 parts K1 = [1, r 1], K2 = [r, s], K3 = [s + 1, n]. Finite Time LTI System Identification Proposition 12.4 (System Reduction) Let ||S P|| ϵ and the singular values of S be arranged as follows: σ1(S) > . . . > σr 1(S) > σr(S) σr+1(S) . . . σs(S) > σs+1(S) > . . . σn(S) > σn+1(S) = 0 Furthermore, let ϵ be such that ϵ inf {1 i r 1} {s+1 i n} σi(P) σi+1(P) Define K0 = K1 K2, then ||US K0(ΣS K0)1/2 UP K0(ΣP K0)1/2||2 2ϵ i=1 σi/ζ2 i + σr/ζ2r + sup 1 i s | σi p and σi = σi(S), ˆσi = σi(P). Here ζi = min (σi σi+1, σi σi+1) and ζr = min (σr 1 σr, σs σs+1). Since US K0 = [US K1US K2] and likewise for B, we can separate the analysis for K1, K2 as follows ||US K0(ΣS K0)1/2 UP K0(ΣP K0)1/2|| ||(US K0 UP K0)(ΣS K0)1/2||+||UP K0((ΣS K0)1/2 (ΣP K0)1/2)|| ||(US K1 UP K1)(ΣS K1)1/2||2 2+||(US K2 UP K2)(ΣS K2)1/2||2 2 + ||(ΣS K0)1/2 (ΣP K0)1/2|| Now ||(ΣS K0)1/2 (ΣP K0)1/2||= supl| p σl(P)|. Recall that σr(S) = . . . = σk(S) = . . . = σs 1(S) and by conditions on ϵ we are guaranteed that ϵ σi σj < 1/2 for all 1 i = j r. We will combine our previous results in Proposition 12.1 12.3 to prove this claim. Specifically from Proposition 12.3 we have ||(US K2 UP K2)(ΣS K2)1/2|| 2ϵ p min σr 1(P) σr(S), σr(S) σs+1(P) On the remaining term we will use Proposition 12.3 on each column ||(US K1 UP K1)(ΣS K1)1/2|| ||[ p σ1(S)c1, . . . , q σ|K1|(S)c|K1|]|| j=1 σ2 j ||cj||2 min σj 1(P) σj(S), σj(S) σj+1(P) 2 In the context of our system identification, S = H0, , and P = ˆH0, ˆd, ˆd. P will be made T. Sarkar and A. Rakhlin and M. A. Dahleh compatible by padding it with zeros to make it doubly infinite. Then US K0, UP K0 (after padding) has infinite rows. Define Z0 = US K0(ΣS K0)1/2(1 :, :), Z1 = US K0(ΣS K0)1/2(p + 1 :, :) (both infinite length) and similarly we will have ˆZ0, ˆZ1. Note that from a computational perspective we do not need to Z0, Z1; we only need to work with ˆZ0 = UP K0(ΣP K0)1/2(1 : , :), ˆZ1 = UP K0(ΣP K0)1/2(p + 1 :, :) and since most of it is just zero padding we can simply compute on ˆZ0(1 : pd, :), ˆZ1(1 : pd, :). Proposition 12.5 Assume Z1 = Z0A. Furthermore, ||S P||2 ϵ and let ϵ be such that ϵ inf {1 i r 1} {s+1 i n} σi(P) σi+1(P) ||(Z 0Z0) 1Z 0Z1 ( ˆZ 0 ˆZ0) 1 ˆZ 0 ˆZ1|| Cϵ(γ + 1) σ2s ((σs σs+1) (σr 1 σs))2 σiσs (σi σi+1)2 (σi 1 σi)2 where σ1(A) γ. Proof Note that Z1 = Z0A, then ||(Z 0Z0) 1Z 0Z1 ( ˆZ 0 ˆZ0) 1 ˆZ 0 ˆZ1||2 =||A ( ˆZ 0 ˆZ0) 1 ˆZ 0 ˆZ1||2= ||( ˆZ 0 ˆZ0) 1 ˆZ 0 ˆZ0A ( ˆZ 0 ˆZ0) 1 ˆZ 0 ˆZ1||2 =||( ˆZ 0 ˆZ0) 1 ˆZ 0 ˆZ0A ( ˆZ 0 ˆZ0) 1 ˆZ 0Z0A + ( ˆZ 0 ˆZ0) 1 ˆZ 0Z0A ( ˆZ 0 ˆZ0) 1 ˆZ 0 ˆZ1||2 ||( ˆZ 0 ˆZ0) 1 ˆZ 0 ˆZ0A ( ˆZ 0 ˆZ0) 1 ˆZ 0Z0A||2+||( ˆZ 0 ˆZ0) 1 ˆZ 0Z0A ( ˆZ 0 ˆZ0) 1 ˆZ 0 ˆZ1||2 ||( ˆZ 0 ˆZ0) 1 ˆZ 0||2 ||Z0A ˆZ0A||2+|| Z0A |{z} Shifted version of Z0 Now, ||( ˆZ 0 ˆZ0) 1 ˆZ 0||2 ( σs ϵ) 1, ||Z0A ˆZ1||2 ||Z0 ˆZ0||2 since Z1 = Z0A is a submatrix of Z0 and ˆZ1 is a submatrix of ˆZ0 we have ||Z0A ˆZ1||2 ||Z0 ˆZ0||2 and ||Z0A ˆZ0A||2 ||A||2||Z0 ˆZ0||2 σ2s ((σs σs+1) (σr 1 σs))2 + σiσs (σi σi+1)2 (σi 1 σi)2 Finite Time LTI System Identification 13. Hankel Matrix Estimation Results In this section we provide the proof for Theorem 5.2. For any matrix P, we define its doubly infinite extension P as P 0 . . . 0 0 . . . ... ... ... Proposition 13.1 Fix d > 0. Then we have ||Hd, , ||2 ||H0, , H0,d,d||2 2||Hd, , ||2 Proof Define Cd, Bd as follows 0md n C CA ... Bd = 0n pd B AB . . . Now pad H0,d,d with zeros to make it a doubly infinite matrix and call it H0,d,d and we get that || H0,d,d H0, , || = 0 M12 M21 M22 Note here that M21 and M0 = M12 M22 are infinite matrices. Further ||Hd, , ||2= ||M0||2 ||M21||2. Then || H0,d,d H0, , || q ||M12||2 2+||M0||2 2 2||Hd, , ||2 Further || H0,d,d H0, , || ||M0||= ||Hd, , ||2. Proposition 13.2 For any d1 d2, we have ||H0, , H0,d1,d1||2 2||H0, , H0,d2,d2||2 Proof Since ||Hd1, , ||2 ||H0, , H0,d1,d1||2 2||Hd1, , ||2 from Proposition 13.1. It is clear that ||Hd1, , ||2 ||Hd2, , ||2. Then 2||H0, , H0,d1,d1||2 ||Hd1, , ||2 ||Hd2, , ||2 ||H0, , H0,d2,d2||2 T. Sarkar and A. Rakhlin and M. A. Dahleh Proposition 13.3 Fix d > 0. Then ||Td, (M)||2 Mρ(A)d Proof Recall that 0 0 0 . . . 0 CAd B 0 0 . . . 0 CAd+1B CAd B 0 . . . 0 ... ... ... ... ... Then ||Td, (M)||2 P j=d||CAj B||2. Now from Eq. 4.1 and Lemma 4.1 in Tu et al. (2017) we get that ||CAj B||2 Mρ(A)j. Then j=d ||CAj B||2 Mρ(A)d Remark 13 Proposition 13.3 is just needed to show exponential decay and is not precise. Please refer to Tu et al. (2017) for explicit rates. Next we show that T (κ) (δ) and d (T, δ) defined in Eq. (16) given by d (T, δ) = inf m + pd + log T δ T ||H0,d,d H0, , ||2 T (κ) (δ) = inf n T T cm2 log3 (Tm/δ) d (T, δ), d (T, δ) κd ( T The existence of d (T, δ) is predicated on the finiteness of T (κ) (δ) which we discuss below. 13.1 Existence of T (κ) (δ) < Construct two sets T1(δ) = inf n T d (T, δ) D(T) o (56) T2(δ) = inf n T d (t, δ) κd ( t κ2 , δ) 8 , t T o (57) Clearly, T (κ) (δ) < T1(δ) T2(δ). A key assumption in the statement of our results is that T (κ) (δ) < . We will show that it is indeed true. Let κ 16. Proposition 13.4 For a fixed δ > 0, T1(δ) < with d (T, δ) c log (c T+log 1 δ ) log R+log ( M/β) log 1 Here ρ = ρ(A). Finite Time LTI System Identification Proof Note the form for d (T, δ), it is the minimum d that satisfies m + pd + log T δ T ||H0,d,d H0, , ||2 Since from Proposition 13.1 and 13.3 we have ||H0,d,d H0, , ||2 3 Mρd 1 ρ(A), then d (T, δ) d that satisfies m + pd + log T which immediately implies d (T, δ) d = c log (c T log R+log 1 δ )+log ( M/β) log 1 ρ , i.e., d (T, δ) is at most logarithmic in T. As a result, for a large enough T cm2d log2 (d) log2 (m2/δ) + cd log3 (2d) c log (c T + log 1 δ) log R + log ( M/β) log 1 The intuition behind T2(δ) is the following: d (T, δ) grows at most logarithmically in T, as is clear from the previous proof. Then T2(δ) is the point where d (T, δ) is still growing as T (i.e., mixing has not happened) but at a slightly reduced rate. Proposition 13.5 For a fixed δ > 0, T2(δ) < . Proof Recall from the proof of Proposition 13.1 that ||Hd, , || ||H0, , H0,d,d|| 2||Hd, , ||. Now Hd, , can be written as Ad [B, AB, . . .] | {z } = B Define Pd = Ad B B (Ad) . Let dκ be such that for every d dκ and κ 16 Clearly such a dκ < would exist because P0 = 0 but limd Pd = 0. Then observe that P2d 1 4κPd. Then for every d dκ we have that ||Hd, , || 4κ||H2d, , || T 4dκ (16)2 β2R2 σ2 0 (dκp + log (T/δ)) (59) T. Sarkar and A. Rakhlin and M. A. Dahleh where σ0 = ||Hdκ, , ||. Assume that σ0 > 0 (if not then are condition is trivially true). Then simple computation shows that ||H0,dκ,dκ H0, , || ||Hdκ, , || 16βR p m + pdκ + log T δ T | {z } < σ0 This implies that d = d (T, δ) dκ for T prescribed as above (ensured by Proposition 13.2). But from our discussion above we also have ||H0,d ,d H0, , || ||Hd , , || 4κ||H2d , , || 2κ||H0,2d ,2d H0, , || This means that if ||H0,d ,d H0, , || 16βR p m + pd + log T ||H0,2d ,2d H0, , || 16 m + pd + log T m + 2pd + log κ2T which implies that d (κ2T, δ) 2d (T, δ). The inequality follows from the definition of d (κ2T, δ). Furthermore, if κ 16, 2d (T, δ) κ 8d (T, δ) whenever T is greater than a certain finite threshold of Eq. (59). Eq. (58) happens when σ(Ad)2 1 4κ = dκ = O log κ log 1 where ρ = ρ(A) and T2(δ) c T1(δ). It should be noted that the dependence of Ti(δ) on log 1 ρ is worst case, i.e., there exists some bad LTI system that gives this dependence and it is quite likely Ti(δ) is much smaller. The condition T T1(δ) T2(δ) simply requires that we capture some reasonable portion of the dynamics and not necessarily the entire dynamics. 13.2 Proof of Theorem 5.2 Proposition 13.6 Let T T (κ) (δ) and d = d (T, δ) then ||H0, , ˆH0,d ,d || 2cβR m + pd + log T Proof Consider the following error ||H0, , ˆH0,d ,d ||2 ||H0,d ,d ˆH0,d ,d ||2+||H0, , H0,d ,d ||2 From Proposition 13.1 and Eq. (55) we get that ||H0, , H0,d ,d ||2 16βR m + pd + log T Finite Time LTI System Identification Since from Theorem 5.1 ||H0,d ,d ˆH0,d ,d ||2 16βR m + pd + log T ||H0, , ˆH0,d ,d ||2 32βR m + pd + log T Recall the adaptive rule to choose d in Algorithm 1. From Theorem 5.1 we know that for every d D(T) we have with probability at least 1 δ. ||H0,d,d ˆH0,d,d||2 16βR lp T + log T δ T . Then consider the following adaptive rule d0(T, δ) = inf n l || ˆH0,l,l ˆH0,h,h||2 16βR(2α(l) + α(h)) h D(T), h l o (61) ˆd = ˆd(T, δ) = d0(T, δ) log T for the same universal constant c as Theorem 5.1. Let d (T, δ) be as Eq. (55). Recall that d = d (T, δ) is the point where estimation error dominates the finite truncation error. Unfortunately, we do not have apriori knowledge of d (T, δ) to use in the algorithm. Therefore, we will simply use Eq. (62) as our proxy. The goal will be to bound || ˆH0, ˆd, ˆd H0, , ||2 Proposition 13.7 Let T T (κ) (δ), d (T, δ) be as in Eq. (55) and ˆd be as in Eq. (62). Then with probability at least 1 δ we have ˆd d (T, δ) log T Proof Let d = d (T, δ). First for all h D(T) d , we note || ˆH0,d ,d ˆH0,h,h||2 || ˆH0,d ,d H0,d ,d ||+||H0,h,h ˆH0,h,h||2+||H0,h,h H0,d ,d ||2 || ˆH0,d ,d H0,d ,d ||2+||H0,h,h ˆH0,h,h||2+||H0, , H0,d ,d ||2. We use the property that ||H0, , H0,d ,d ||2 ||H0,h,h H0,d ,d ||2. Furthermore, because of the properties of d we have ||H0, , H0,d ,d ||2 16βRα(d ) || ˆH0,d ,d H0,d ,d ||2 16βRα(d ), ||H0,h,h ˆH0,h,h||2 16βRα(h). (63) T. Sarkar and A. Rakhlin and M. A. Dahleh and || ˆH0,d ,d ˆH0,h,h||2 16βR(2α(d ) + α(h)). This implies that d0(T, δ) d and the assertion follows. We have the following key lemma about the behavior of ˆH0, ˆd, ˆd. Lemma 13.1 For a fixed κ 20, whenever T T (κ) (δ) we have with probability at least 1 δ ||H0, , ˆH0, ˆd, ˆd||2 3cβRα max d (T, δ), log T Furthermore, ˆd = O(log T Proof Let d > ˆd then ||H0, , ˆH0, ˆd, ˆd||2 ||H0, , H0,d ,d ||2+|| ˆH0, ˆd, ˆd H0, ˆd, ˆd||2+|| ˆH0,d ,d H0,d ,d ||2 If ˆd > d then ||H0, , ˆH0, ˆd, ˆd||2 ||H0, , H0, ˆd, ˆd||2+|| ˆH0, ˆd, ˆd H0, ˆd, ˆd||2= 2|| ˆH0, ˆd, ˆd H0, ˆd, ˆd||2 2cβRα( ˆd) = 2cβRα log T where the equality follows from Proposition 13.7. The fact that ˆd = O(log T δ ) follows from Proposition 13.1. In the following we will use Hl = H0,l,l for shorthand. Proposition 13.8 Fix κ 16, and T T (κ) (δ). Then || ˆH0, ˆd(T,δ), ˆd(T,δ) H0, , ||2 12cβR q m + p ˆd(T, δ) + log T with probability at least 1 δ. Proof Assume that log T δ d (T, δ). Recall the following functions d (T, δ) = inf n d cβR m + pd + log T δ T ||Hd H ||2 o d0(T, δ) = inf n l || ˆHl ˆHh||2 cβR(α(h) + 2α(l)) h l, h D(T) o ˆd(T, δ) = d0(T, δ) log T It is clear that d (κ2T, δ) (1 + 1 2p)κd (T, δ) for any κ 16. Assume the following 8d (κ 2T, δ) (This relation is true whenever T T (κ) (δ)), Finite Time LTI System Identification ||H ˆd(T,δ) H ||2 6cβR q m+p ˆd(T,δ)+log T ˆd(T, δ) < d (κ 2T, δ) 1. The key will be to show that with high probability that all three assumptions can not hold with high probability. For shorthand we define d(1) = d (T, δ), d(κ2) = d (κ 2T, δ), ˆd(1) = ˆd(T, δ), ˆd(κ2) = ˆd(κ 2T, δ) and Hl = H0,l,l, ˆHl = ˆH0,l,l. Let T = κ 2T. Then this implies that m + pd(1) + log κ2 T δ T || ˆH ˆd(1) ˆHd(1) ||2 || ˆH ˆd(1) ˆHd(1) ||2 || ˆH ˆd(1) H ||2 || ˆHd(1) H ||2 || ˆHd(1) H ||2+|| ˆH ˆd(1) ˆHd(1) ||2 || ˆH ˆd(1) H ||2 || ˆHd(1) Hd(1) ||2+||Hd(1) H ||2+|| ˆH ˆd(1) ˆHd(1) ||2 || ˆH ˆd(1) H ||2 Since by definition of d ( , ) we have || ˆHd(1) Hd(1) ||2+||Hd(1) H ||2 2cβR m + pd(1) + log κ2 T and by assumptions d(1) κ 8d(κ2) , ˆd(1) d(κ2) then as a result ( q 8 + 1)d(κ2) || ˆH ˆd(1) H ||2 || ˆHd(1) Hd(1) ||2+||Hd(1) H ||2+ || ˆH ˆd(1) ˆHd(1) ||2 | {z } m + pd(1) + log κ2 T δ T | {z } Prop. 13.6 m + pd(1) + log κ2 T δ T | {z } Definition of ˆd(1) || ˆH ˆd(1) H ||2 1 m + pd(κ2) + log T where the last inequality follows from ( q 8 + 1)d(κ2) . Now by assumption ||H ˆd(1) H ||2 6cβR p m + p ˆd(1) + log κ2 T δ κ2 T it is clear that || ˆH ˆd(1) H ||2 5 6||H ˆd(1) H ||2 and we can conclude that, since 6 ||H ˆd(1) H ||2< cβR m + pd(κ2) + log T T. Sarkar and A. Rakhlin and M. A. Dahleh which implies that ˆd(1) d(κ2) 1. This is because by definition of d(κ2) we know that d(κ2) is the minimum such that ||Hd(κ2) H ||2 cβR m + pd(κ2) + log T and furthermore from Proposition 13.2 we have for any d1 d2 ||H0, , H0,d1,d1|| 1 2||H0, , H0,d2,d2||. This contradicts Assumption 3. So, this means that one of three assumptions do not hold. Clearly if assumption 3 is invalid then we have a suitable lower bound on the chosen ˆd( , ), i.e., since d (κ 2T, δ) d (T, δ) κ 8d (κ 2T, δ) we get κ 8 ˆd(κ2 T, δ) κ 8d ( T, δ) κ 8 d (κ2 T, δ) κ 8 ˆd(κ2 T, δ) κ 8 d ( T, δ) κ which implies from Lemma 13.1 that (since we pick κ = 16, for large enough T d ( T, δ) 4) and we have || ˆH ˆd(κ2 T,δ) H ||2 3cβR q d (κ2 T, δ) pd (κ2 T, δ) + log κ2 T ˆd(κ2 T, δ) p ˆd(κ2 T, δ) + log κ2 T Similarly, if assumption 2 is invalid then we get that ||H ˆd(κ2 T,δ) H ||2< 6cβR q ˆd(κ2 T, δ) p ˆd(κ2 T, δ) + log κ2 T and because ˆd(κ2 T, δ) d (κ2 T, δ) and || ˆH ˆd(κ2 T,δ) H ||2 ||H ˆd(κ2 T,δ) H ||2+|| ˆH ˆd(κ2 T,δ) H ||2 we get in a similar fashion to Proposition 13.6 || ˆH ˆd(κ2 T,δ) H ||2 12cβR q ˆd(κ2 T, δ) p ˆd(κ2 T, δ) + log κ2 T Replacing κ2 T = T it is clear that for any κ 16 || ˆH ˆd(T,δ) H ||2 12cβR q p ˆd(T, δ) + log T If d (T, δ) log T δ then we can simply apply Lemma 13.1 and our assertion holds. Finite Time LTI System Identification 14. Model Selection Results Proposition 14.1 Let H0, , = UΣV , ˆH0, ˆd, ˆd = ˆU ˆΣ ˆV and ||H0, , ˆH0, ˆd, ˆd|| ϵ. Let ˆΣ be arranged into blocks of singular values such that in each block i we have sup j ˆσi j ˆσi j+1 χϵ for some χ 2, i.e., Λ1 0 . . . 0 0 Λ2 . . . 0 ... ... ... 0 0 0 . . . Λl where Λi are diagonal matrices and ˆσi j is the jth singular value in the block Λi. Then there exists an orthogonal transformation, Q, such that || ˆU ˆΣ1/2Q UΣ1/2||2 2ϵ q ˆσ1/ζ2n1 + ˆσn1+1/ζ2n2 + . . . + ˆσPl 1 i=1 ni+1/ζ2nl + 2 sup 1 i l ˆσi min + ϵ pˆσ ˆd ϵ. Here sup1 i l p ˆσimax ϵ ˆd q ζni = min (ˆσni 1 min ˆσni max, ˆσni min ˆσni+1 max ) for 1 < i < l, ζn1 = ˆσn1 min ˆσn2 max and ζnl = min (ˆσnl 1 min ˆσnl max, ˆσnl min). Proof Let ˆU ˆΣ ˆV = SVD( ˆH0, ˆd, ˆd) and UΣV = SVD(H0, , ) where || ˆH0, ˆd, ˆd H0, , ||2 ϵ. ˆΣ is arranged into blocks of singular values such that in each block i we have ˆσi j ˆσi j+1 χϵ, i.e., Λ1 0 . . . 0 0 Λ2 . . . 0 ... ... ... 0 0 0 . . . Λl where Λi are diagonal matrices and ˆσi j is the jth singular value in the block Λi. Furthermore, ˆσi 1 min ˆσi max > χϵ. From ˆΣ define ˆΣ as follows: ˆσ1In1 n1 0 . . . 0 0 ˆσ2In2 n2 . . . 0 ... ... ... 0 0 0 . . . ˆσl Inl nl T. Sarkar and A. Rakhlin and M. A. Dahleh where Λi is a ni ni matrix and ˆσi = 1 ni P j ˆσi j. The key idea of the proof is the following: (A, B, C) (QAQ , QB, CQ ) where Q is a orthogonal transformation and we will show that there exists a block diagonal unitary matrix Q of the form Qn1 n1 0 . . . 0 0 Qn2 n2 . . . 0 ... ... ... 0 0 0 . . . Qnl nl such that each block Qni ni corresponds to a orthogonal matrix of dimensions ni ni and that || ˆU ˆΣ1/2Q UΣ1/2||2 is small if || ˆH0, ˆd, ˆd H0, , ||2 is small. Each of the blocks correspond to the set of singular values where the inter-singular value distance is small . To start off, note that from Propositon 12.4 there must exist a Q that is block diagonal with orthogonal entries such that || ˆUQˆΣ1/2 UΣ1/2||2 cϵ q ˆσ1/ζ2n1 + ˆσn1+1/ζ2n2 + . . . + ˆσPl 1 i=1 ni+1/ζ2nl + sup 1 i ˆd | σi p Here ζni = min (ˆσni 1 min ˆσni max, ˆσni min ˆσni+1 max ) for 1 < i < l, ζn1 = ˆσn1 min ˆσn2 max and ζnl = min (ˆσnl 1 min ˆσnl max, ˆσnl min). Informally, the ζi measure the singular value gaps between each blocks. Furthermore, it can be shown that for any Q of the form in Eq. (67) || ˆUQˆΣ1/2 ˆU ˆΣ1/2Q||2 || ˆUQ ˆΣ 1/2 ˆUQˆΣ1/2||2+|| ˆU ˆΣ1/2Q ˆU ˆΣ 1/2Q||2 2||ˆΣ1/2 ˆΣ1/2||2 because ˆUQ ˆΣ 1/2 = ˆU ˆΣ 1/2Q. Note that ||ˆΣ1/2 ˆΣ1/2||2 sup1 i l p ˆσi min. Now, when ˆσi max χniϵ, then p ˆσimax ; on the other hand when ˆσi max < χniϵ ˆσimax ˆσi χniϵ and this implies that ˆσi min χni p ˆσimax ϵ χniϵ. || ˆU ˆΣ1/2Q UΣ1/2||2 || ˆUQˆΣ1/2 UΣ1/2||2+|| ˆUQˆΣ1/2 ˆU ˆΣ1/2Q||2 ˆσ1/ζ2n1 + ˆσn1+1/ζ2n2 + . . . + ˆσPl 1 i=1 ni+1/ζ2nl + sup 1 i ˆd | σi p Our assertion follows since sup1 i ˆd| σi ˆσi| ϵ Finite Time LTI System Identification Proposition 14.2 Let H0, , = UΣV , ˆH0, ˆd, ˆd = ˆU ˆΣ ˆV and ||H0, , ˆH0, ˆd, ˆd|| ϵ. Let ˆΣ be arranged into blocks of singular values such that in each block i we have sup j ˆσi j ˆσi j+1 χϵ for some χ 2, i.e., Λ1 0 . . . 0 0 Λ2 . . . 0 ... ... ... 0 0 0 . . . Λl where Λi are diagonal matrices and ˆσi j is the jth singular value in the block Λi. Then there exists an orthogonal transformation, Q, such that max || ˆC C||2, || ˆB B||2 2ϵ q ˆσ1/ζ2n1 + ˆσn1+1/ζ2n2 + . . . + ˆσPl 1 i=1 ni+1/ζ2nl + 2 sup 1 i l ˆσi min + ϵ pˆσ ˆd ϵ = ζ, ||A ˆA||2 4γ ζ/ q Here sup1 i l p ˆσimax ϵ ˆd q ζni = min (ˆσni 1 min ˆσni max, ˆσni min ˆσni+1 max ) for 1 < i < l, ζn1 = ˆσn1 min ˆσn2 max and ζnl = min (ˆσnl 1 min ˆσnl max, ˆσnl min). Proof The proof follows because all parameters are equivalent up to a orthogonal transform (See discussion preceding Proposition 9.2). Following that we use Propositions 12.4 and 12.5. 15. Order Estimation Lower Bound Lemma 14 (Theorem 4.21 in Boucheron et al. (2013)) Let {Pi}N i=0 be probability laws over (Σ, A) and let {Ai A}N i=0 be disjoint events. If a = mini=0,...,N Pi(Ai) 1/(N + 1), + (1 a) log 1 a i=1 KL(Pi||P0) (69) Lemma 15 (Le Cam s Method) Let P0, P1 be two probability laws then sup θ {0,1} Pθ[M = ˆ M] 1 1 2KL(P0||P1) T. Sarkar and A. Rakhlin and M. A. Dahleh Proposition 15.1 Let N0, N1 be two multivariate Gaussians with mean µ0 RT , µ1 RT and covariance matrix Σ0 RT T , Σ1 RT T respectively. Then the KL(N0, N1) = 1 2 tr(Σ 1 1 Σ0) T + log det(Σ1) det(Σ0) + Eµ1,µ0[(µ1 µ0) Σ 1 1 (µ1 µ0)] . In this section we will prove a lower bound on the finite time error for model approximation. In systems theory subspace based methods are useful in estimating the true system parameters. Intuitively, it should be harder to correctly estimate the subspace that corresponds to lower Hankel singular values, or energy due to the presence of noise. However, due to strong structural constraints on Hankel matrix finding a minimax lower bound is a much harder proposition for LTI systems. Specifically, it is not clear if standard subspace identification lower bounds can provide reasonable estimates for a structured and non i.i.d. setting such as our case. To alleviate some of the technical difficulties that arise in obtaining the lower bounds, we will focus on a small set of LTI systems which are simply parametrized by a number ζ. Consider the following canonical form order 1 and 2 LTI systems respectively with m = p = 1 and let R be the noise-to-signal ratio bound. 0 1 0 0 0 0 ζ 0 0 , A1 = A0, B0 = , C0 = 0 0 βR , C1 = C0 A0, A1 are Schur stable whenever |ζ|< 1. 1 0 0 0 0 . . . 0 0 0 0 0 . . . 0 0 0 0 0 . . . 0 0 0 0 0 . . . 0 0 0 0 0 . . . ... ... ... ... ... ... 1 0 ζ 0 0 . . . 0 ζ 0 0 0 . . . ζ 0 0 0 0 . . . 0 0 0 0 0 . . . 0 0 0 0 0 . . . ... ... ... ... ... ... Here Hζ,0, Hζ,1 are the Hankel matrices generated by (C0, A0, B0), (C1, A1, B1) respectively. It is easy to check that for Hζ,1 we have 1 ζ where σi are Hankel singular values. Further the rank of Hζ,0 is 1 and that of Hζ,1 is at least 2. Also, ||T O0, ((Ci,Ai,Bi))||2 ||T0, ((Ci,Ai,Bi))||2 R. This construction will be key to show that identification of a particular rank realization depends on the condition number of the Hankel matrix. An alternate representation of the Finite Time LTI System Identification input output behavior is y T y T 1 ... y1 CB CAi B . . . CAT 1 i B 0 CB . . . CAT 2 i B ... ... ... ... 0 0 . . . CB u T+1 u T ... u2 C CAi . . . CAT 1 i 0 C . . . CAT 2 i ... ... ... ... 0 0 . . . C ηT+1 ηT ... η2 w T w T 1 ... w1 where Ai {A0, A1}. We will prove this result for a general class of inputs, i.e., active inputs. Then we will follow the same steps as in proof of Theorem 2 in Tu et al. (2018b). KL(P0||P1) = EP0 γt(ut|{ul, yl}t 1 l=1)P0(yt|{ul}t 1 l=1) γt(ut|{ul, yl}t 1 l=1)P1(yt|{ul}t 1 l=1) P0(yt|{ul}t 1 l=1) P1(yt|{ul}t 1 l=1) Here γt( | ) is the active rule for choosing ut from past data. From Eq. (72) it is clear that conditional on {ul}T l=1, {yl}T l=1 is Gaussian with mean given by Πi U. Then we use Birge s inequality (Lemma 14). In our case Σ0 = O0O 0 + I, Σ1 = O1O 1 + I where Oi is given in Eq. (72). We will apply a combination of Lemma 14, Proposition 15.1 and assume ηi are i.i.d Gaussian to obtain our desired result. Note that O1 = O0 but Π1 = Π0. Therefore, from Proposition 15.1 KL(N0, N1) = Eµ1,µ0[(µ1 µ0) Σ 1 1 (µ1 µ0)] T ζ2 R2 where µi = Πi U. For any δ (0, 1/4), set a = 1 δ in Proposition 14, then we get whenever δ log δ 1 δ + (1 δ) log 1 δ we have supi =j PAi(Aj) δ. For δ [1/4, 1) we use Le Cam s method in Lemma 15 and show that if 8δ2 Tζ2 R2 then supi =j PAi(Aj) δ. Since δ2 c log 1 δ when δ [1/4, 1) for an absolute constant, our assertion holds.