# learning_from_many_trajectories__b0f4d4a5.pdf Journal of Machine Learning Research 25 (2024) 1-109 Submitted 9/23; Revised 6/24; Published 7/24 Learning from many trajectories Stephen Tu stephen.tu@usc.edu University of Southern California Ming Hsieh Department of Electrical and Computer Engineering Los Angeles, CA 90089, USA Roy Frostig frostig@google.com Google Deep Mind San Francisco, CA 94105, USA Mahdi Soltanolkotabi soltanol@usc.edu University of Southern California Ming Hsieh Department of Electrical and Computer Engineering Los Angeles, CA 90089, USA Editor: Daniel Hsu We initiate a study of supervised learning from many independent sequences ( trajectories ) of non-independent covariates, reflecting tasks in sequence modeling, control, and reinforcement learning. Conceptually, our multi-trajectory setup sits between two traditional settings in statistical learning theory: learning from independent examples and learning from a single auto-correlated sequence. Our conditions for efficient learning generalize the former setting trajectories must be non-degenerate in ways that extend standard requirements for independent examples. Notably, we do not require that trajectories be ergodic, long, nor strictly stable. For linear least-squares regression, given n-dimensional examples produced by m trajectories, each of length T, we observe a notable change in statistical efficiency as the number of trajectories increases from a few (namely m n) to many (namely m n). Specifically, we establish that the worst-case error rate of this problem is Θ(n/m T) whenever m n. Meanwhile, when m n, we establish a (sharp) lower bound of Ω(n2/m2T) on the worst-case error rate, realized by a simple, marginally unstable linear dynamical system. A key upshot is that, in domains where trajectories regularly reset, the error rate eventually behaves as if all of the examples were independent, drawn from their marginals. As a corollary of our analysis, we also improve guarantees for the linear system identification problem. Keywords: learning with dependent data, linear dynamical systems, system identification 1. Introduction Statistical learning theory aims to characterize the worst-case efficiency of learning from example data. Its most common setup assumes that examples are independently and identically distributed (iid) draws from an underlying data distribution, but various branches of theory not to mention deployed applications of machine learning consume non-indepen- . Work performed while author was employed at Google Deep Mind. c 2024 Stephen Tu, Roy Frostig, and Mahdi Soltanolkotabi. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-1145.html. Tu, Frostig, and Soltanolkotabi dent data as well. An especially fruitful setting, and the focus of this paper, is in learning from sequential data, where examples are generated by some ordered stochastic process that renders them possibly correlated. Naturally, sequential processes describe application domains spanning engineering and the sciences, such as robotics (Nguyen-Tuong and Peters, 2011), data center cooling (e.g. Lazic et al. (2018)), language (e.g. Sutskever et al. (2014); Belanger and Kakade (2015)), neuroscience (e.g. Linderman et al. (2017); Glaser et al. (2020)), and economic forecasting (Mc Donald et al., 2017). Learning over sequential data can also capture some formulations of imitation learning (Osa et al., 2018) and reinforcement learning (Chen et al., 2021; Janner et al., 2021). In supervised learning, one learns to predict output labels from input covariates, given example pairings of the two. Formal treatments of learning from sequential data typically concern a single inter-dependent chain of covariates. Where these treatments vary is in their assumptions about the underlying process that generates the covariate chain. For instance, some assume that the process is auto-regressive (e.g. Lai and Wei (1983); Goldenshluger and Zeevi (2001); Gonz alez and Rojas (2020)) or ergodic (e.g. Yu (1994); Duchi et al. (2012)). Others assume that it is a linear dynamical system (e.g. Simchowitz et al. (2018); Faradonbeh et al. (2018); Sarkar and Rakhlin (2019)). In this paper, we examine what happens when we learn from many independent chains rather than from one, as one does anyway in many applications (e.g. Pomerleau (1989); Khansari-Zadeh and Billard (2011); Brants et al. (2007); J ozefowicz et al. (2016)). Figure 1 depicts the data dependence structure of our setup in comparison with its two natural counterparts. Learning from a dataset of many short (constant length) chains ought to be similar to independent learning, even if each chain is highly intra-dependent. On the other hand, for any non-trivial chain length, intuition suggests that the error can degrade relative to the total sample size in the worst case, since a greater proportion of the data may contain correlations. Lower bounds even show that, when one sees only a single chain, this degradation is outright necessary in the worst case (Bresler et al., 2020). Do we see any such effect with many chains? We study this question by sharply characterizing worst-case error rates of a fundamental task linear regression imposed over a general sequential data model. Our findings reveal a remarkable phenomenon: after seeing sufficiently many chains (m) relative to the example dimension n, no matter the chain length T, the error rate matches that of learning from the same total number m T of independent examples, drawn from their respective marginal distributions. In our data model, each chain, called a trajectory, comprises a sequence of covariates {xt} generated from a stochastic process. Each covariate is accompanied by a noisy linear response yt as its label. A training set {(x(i) t , y(i) t )}m,T i=1,t=1 comprises m independent chains, each of length T. From such a training set, an estimator produces a hypothesis that predicts the label of any covariate. The resulting hypothesis is evaluated according to its meansquared prediction error over a fresh chain of length T , possibly unequal to T a notion of risk defined naturally over a trajectory. All of our risk upper bounds are guarantees for the ordinary least-squares estimator in particular. A concrete, recurring example in this paper takes the covariate-generating process to be a linear dynamical system (LDS). Specifically, fixing matrices A Rn n, B Rn d, and W Rp n, a single trajectory {(xt, yt)}t 1 is generated as follows. Let x0 = 0, and for Learning from many trajectories x1,1 x1,2 x1,3 x1,T x2,1 x2,2 x2,3 x2,T xm,1 xm,2 xm,3 xm,T (a) independent covariates x1,1 x1,2 x1,3 x1,T x2,1 x2,2 x2,3 x2,T xm,1 xm,2 xm,3 xm,T (b) one long trajectory x1,1 x1,2 x1,3 x1,T x2,1 x2,2 x2,3 x2,T xm,1 xm,2 xm,3 xm,T (c) independent trajectories Figure 1: The covariate dependence structure induced by three data models on m T many training examples. In (a): independent examples, typical of basic statistical learning. In (b): the data models often considered in the sequential learning literature, comprising a long auto-correlated chain of examples. Learning in this setting can be infeasible in general, so oftentimes ergodicity is assumed in order to rule out strong long-range dependencies, essentially inducing an independent resetting effect across time. The effective reset frequency then factors uniformly into error bounds, in a way suggesting that one learns only one independent example s worth within each effective reset window (cf. Section 2). In (c): our multi-trajectory data model. Our accompanying assumptions allow for non-ergodic chains, and for arbitrary chain lengths T, while introducing explicit independent resets. Decoupling the m resets from the sequential data model lets us vary the training set dimensions (m, T) freely, without affecting other data assumptions, as we study their effect on error rates. We find that with enough trajectories m, the worst-case error rate behaves the same as in the independent setting depicted in (a); i.e., one learns as though every example were independently drawn from its marginal distribution. Some recent work in system identification assumes a data model related to ours (specifically linear dynamical data) and likewise avoids ergodicity; our bounds improve these guarantees T-fold where applicable, and upgrade the regimes in which they apply (cf. Section 2). t 1 consider the process: xt = Axt 1 + Bwt, (linear dynamics) yt = W xt + ξt, (linear regression) where the {wt}t 1 are iid centered isotropic Gaussian draws and {ξt}t 1 is a sub-Gaussian martingale difference sequence (with respect to past covariates {xk}t k=1 and noise variables {ξk}t 1 k=1). Incidentally, combining linear dynamical systems with linear regression captures the basic problem of linear system identification (as in Simchowitz et al. (2018)) as a special case. In other instantiations of learning from trajectories, the covariates {xt} may be generated by a different process; what remains common is the superimposed regression task set up by the ground truth W and the noise {ξt}. The key condition that we will introduce, which renders a covariate process amenable to regression, is that it satisfies a trajectory small-ball criterion (Definition 4.1). Section 4.1 shows that LDS-generated data conforms to the trajectory small-ball condition in particular, as do many other distributions. Our main results (Sections 5 and 6) sharply characterize worst-case rates of learning from trajectory data as a function of the training trajectory count m, the training trajectory length T, the evaluation length T , the covariate and response dimensions n and p, and scale parameters of noise in the data model (such as the variance of the noise {ξt}). Restricting Tu, Frostig, and Soltanolkotabi only to terms of covariate dimension n, training set size m and T, and evaluation length T , our bounds imply the following summary statement: Theorem 1.1 (informal; error rate with many small-ball trajectories, T T). If m n, T T, and covariate trajectories are drawn from a trajectory small-ball distribution, then the worst-case excess prediction risk (over evaluation horizon T ) for linear regression from m many trajectories of n-dimensional covariates, each of length T, is Θ(n/(m T)). In drawing comparisons to learning from independent examples, it makes sense to consider training and evaluations lengths T and T equal (cf. Section 3), rendering Theorem 1.1 applicable. The theorem thus echoes our main point above: the same rate of Θ(n/(m T)) describes regression on m T independent examples (details on this point are expanded in Section 3.3). Further structural assumptions are needed (cf. Section 3.3) in order to cover the remaining range of problem dimensions, namely few trajectories (m n) or extended evaluations (T > T), and to that end we return to linear dynamical systems as a focus. Our remaining risk upper bounds, targeting learning under linear dynamics, require that the dynamics matrix A be marginally unstable (meaning that its spectral radius ρ(A) is at most one) and diagonalizable. When trajectories are longer at test time than during training (i.e., T > T), marginal instability is practically necessary, otherwise the risk can scale exponentially in T T. The assumption otherwise still allows for unstable and therefore non-ergodic systems at ρ(A) = 1. For simplicity, we also require that the control matrix B have full row rank. Our bounds then imply the following summary statement about regression when the number of trajectories is limited: Theorem 1.2 (informal; error rate with few LDS trajectories). If m n, m T n, and covariate trajectories are drawn from a linear dynamical system whose dynamics A are marginally unstable and diagonalizable, then the worst-case excess prediction risk (over evaluation horizon T ) for linear regression from m many trajectories of n-dimensional covariates, each of length T, is Θ(n/(m T) max{n T /(m T), 1}).1 If the evaluation horizon T is a constant, the rate in Theorem 1.2 recovers that of Theorem 1.1, up to log factors and extra assumptions. To compare the theorems further, consider equal training and evaluation horizons (T = T). In this setting, the rate in Theorem 1.2 is weaker than that of Theorem 1.1, by up to a factor of the covariate dimension n, so it may seem that Theorem 1.2 establishes a separation between few and many trajectory learning. However, the varying premises of many vs. few trajectories constrains the risk definitions to differ: under a fixed data budget N := m T = m T , fewer trajectories m imply a longer horizon T over which the risk is evaluated. Intuitively, a longer evaluation horizon makes for a different problem, and renders the rate comparison invalid. A more sound comparison across regimes is possible by first normalizing the notion of performance within a problem instance. To this end, we can consider the worst-case risk of learning from trajectories relative to that of learning from independent examples in the 1. In proving the lower bound in Theorem 1.2, we construct a hard instance by decoupling the martingale difference noise {ξt} from the covariate-generating process (cf. Definition 7.1). Technically, this excludes the lower bound from applying directly to the linear system identification problem. Resolving this discrepancy is a question that remains open for future work. Learning from many trajectories same regime. Constructing the latter baseline is somewhat subtle (cf. Section 3.2). To decorrelate the problem of learning from trajectories while maintaining its temporal structure otherwise, we can imagine drawing from its marginal distributions independently at each time step. The resulting dataset is independent, but not identically distributed. Although the rates for the sequential and decorrelated regression problems are as already highlighted remarkably the same under many trajectories, the few-trajectory rate in Theorem 1.2 is indeed weaker than the Θ(n/(m T)) rate that we prove for its decorrelated baseline (cf. Theorem 5.7). Since the more general Theorem 1.1 already describes what happens under many trajectories (m n) and a strict evaluation horizon (T T), what remains is a somewhat niche regime: many trajectories and an extended evaluation horizon T > T. For completeness, our bounds supply the following summary statement: Theorem 1.3 (informal; error rate with many LDS trajectories). If m n and covariate trajectories are drawn from a linear dynamical system whose dynamics A are marginally unstable and diagonalizable, then the worst-case excess prediction risk (over evaluation horizon T ) for linear regression from m many trajectories of n-dimensional covariates, each of length T, is Θ(n/(m T) max{T /T, 1}). Using the tools of our analysis, we also develop upper bounds for parameter error instead of prediction risk, which inform recovery of the ground truth W and (by reduction) of the dynamics matrix A in LDS. The latter captures the linear system identification problem. Our upper bounds improve on its worst-case guarantees by a factor of 1/T where applicable, and extend the parameter ranges in which guarantees hold at all. 2. Related work Linear regression is a basic and well-studied problem. The two treatments most closely related to our work are Hsu et al. (2014) and Mourtada (2022), who develop sharp finitesample characterizations of the risk of random design linear regression (i.e., from iid examples). Discussion and references therein cover the broader problem over its long history. A common approach to studying dependent covariates is to assume that the datagenerating process is ergodic (see e.g. Yu (1994); Meir (2000); Mohri and Rostamizadeh (2008); Steinwart and Christmann (2009); Mohri and Rostamizadeh (2010); Duchi et al. (2012); Kuznetsov and Mohri (2017); Mc Donald et al. (2017); Shalizi (2021) and references therein). The key phenomenon at play is that N correlated examples are statistically similar to N/τmix independent examples, where τmix is the process mixing-time. Relying on this idea, generalization bounds informing independent data can typically be ported to the ergodic setting, where the effective sample size is simply deflated by a factor of τmix. Since mixing-based bounds become vacuous as τmix , they do not present an effective strategy for studying dynamics that do not mix. A critical instance of this arises in linear dynamical systems: in LDS, the ergodicity condition amounts to stability of the dynamics matrix A (i.e., ρ(A) < 1), where τmix as ρ(A) 1 (e.g. Meyn and Tweedie, 1993, Thm. 17.6.2). Marginally unstable systems, in which ρ(A) = 1, are thus not captured. A recent line of work uncovers ways to sharpen generalization bounds based on the specific structure of realizable least-squares regression problems over an ergodic trajectory. Tu, Frostig, and Soltanolkotabi For realizable linear regression with stationary covariates, results from Bresler et al. (2020) imply that, after the trajectory length exceeds an initial burn-in time scaling as τmixn, the minimax (excess) risk coincides with the classic iid rates. Additionally, Ziemann and Tu (2022) show that the empirical risk minimizer exhibits similar behavior in realizable nonparametric regression problems, provided certain small-ball assumptions of the underlying process hold. While these results sharpen our understanding of how the mixing time τmix affects regression risk bounds, they ultimately rely on ergodicity. Since learning from a single trajectory is generally impossible without ergodicity, we are led to study other sequential learning configurations. The two, however, are not mutually exclusive: our results actually apply when mixing, and in fact show that the empirical risk minimizer is minimax optimal (after a burn-in time scaling with the mixing time). This eschews the need for algorithmic modifications to learning from mixing trajectory data (Bresler et al., 2020). We give details on this in Appendix B.7. Non-temporal dependency structures. Covariates and responses can be inter-dependent in many ways, not only via temporal structure. A recent resurgence of work investigates learning under an Ising model structure over covariates (Bresler, 2015; Dagan et al., 2019; Ghosal and Mukherjee, 2020; Dagan et al., 2021b), as well as over responses (Daskalakis et al., 2019; Dagan et al., 2021a) (conditioned on the covariates). At a conceptual level, the extension from a single temporally dependent trajectory to multiple trajectories is analogous to the extension from single observations to Ising models with multiple independent observations. Incidentally, in this area, investigations began by studying learning under multiple independent observations, and progressed towards guarantees on learning from a single one. Relating these two data models trajectories and Ising grids under intercompatible assumptions may reveal interesting connections between these results. System identification. A special case of our LDS-specific data model captures linear system identification with full state observation: the task of recovering the dynamical system parameters A from observations of trajectories. While classic results are asymptotic in nature (see e.g. Lai and Wei (1982, 1983); Ljung (1998)), recent work gives finite-sample guarantees for recovery of linear systems with fully observed states (Simchowitz et al., 2018; Dean et al., 2020; Jedra and Proutiere, 2020; Faradonbeh et al., 2018; Sarkar and Rakhlin, 2019; Jedra and Proutiere, 2019; Tsiamis and Pappas, 2021), and also partially observed states (Oymak and Ozay, 2019; Simchowitz et al., 2019; Tsiamis and Pappas, 2019; Sarkar et al., 2021; Zheng and Li, 2021). The proof of our upper bounds builds on the smallball arguments from Simchowitz et al. (2018) (that, in turn, extend Mendelson (2015); Koltchinskii and Mendelson (2015)), which do not require ergodicity. To the best of our knowledge, our results are the first to quantify the trade-offs between few long trajectories and many short trajectories. Nearly all finite-sample guarantees for linear system identification consider a single trajectory, with a few notable exceptions. First, Dean et al. (2020) allow for m 1 trajectories with fully observed states and make no assumptions on the dynamics matrix A. However, their analysis discards all but the last state transition within a trajectory, reducing to iid learning over only m examples. Second, Zheng and Li (2021); Xin et al. (2022) study the recovery of Markov parameters from partially observed states over many trajectories. However, their error bounds do not decrease with longer training horizons T, since the number of Markov parameters one must Learning from many trajectories recover scales with the trajectory length. Third, Xing et al. (2021) consider multiple trajectories where the noise enters multiplicatively instead of additively. Their main finite-sample parameter recovery result (Theorem 2) states that the operator norm of the parameter error scales as p T/m, with the additional restriction that T n2. To achieve consistency, this result fixes the trajectory length T and takes the trajectory count m . By contrast, our analysis varies the two quantities T and m independently. Furthermore, a line of work concurrent to ours investigates learning from multiple sources of linear dynamical systems (Chen and Poor, 2022; Modi et al., 2022). This is a latent variable model, where the underlying index of the LDS must be disambiguated from data. This model is more general than the one studied in this paper, and specializing the corresponding results to our setup yields sub-optimal bounds and unnecessary requirements. We discuss this in Section 5.2, after presenting upper bounds in detail. Finally, to more clearly interpret the effects of multiple trajectories on learning, a core part of our work studies linear dynamical systems under an ideal setup with Gaussian controls and time-invariant dynamics. Recent work extends beyond this ideal setup in the single trajectory setting, including learning in piecewise-affine systems (Block et al., 2023), bilinear systems (Sattar et al., 2022), switched linear systems (Massucci et al., 2022), and linear dynamical systems with non-linear control laws (Li et al., 2023). Extending our multi-trajectory analysis to these settings is an interesting open direction for future work. Our LDS setup (Section 3.4) decouples the covariate dynamics model A from the observation model W , and our risk definition additionally allows for an arbitrary evaluation horizon T . The risk over an arbitrary evaluation horizon is harder to control than parameter error, which corresponds to an evaluation length of one. This is because the larger signal-to-noise ratio accrued by a less stable system magnifies the prediction error over the entire evaluation horizon. Although the observation model that we consider is mentioned in Simchowitz et al. (2018), the general setup with matching upper and lower bounds are all, to the best of our knowledge, new contributions. A complementary line of work studies the problem of online sequence prediction in a no-regret framework, where the baseline expert class comprises of trajectories generated by a linear dynamical system (Hazan et al., 2017, 2018; Ghai et al., 2020). These results also allow for marginally unstable dynamics but are otherwise not directly comparable. Other efforts look beyond linear systems to identifying various non-linear classes, such as exponentially stable non-linear systems (Sattar and Oymak, 2020; Foster et al., 2020) and marginally unstable non-linear systems (Jain et al., 2021). These results again learn from a single trajectory. We believe that elements of our analysis can be ported over to offer many-trajectory bounds for these particular classes of non-linear systems. 3. Problem formulation Notation. The real eigenvalues of a Hermitian matrix M Ck k are λmax(M) = λ1(M) . . . λk(M) = λmin(M). For a square matrix M Ck k, M denotes its conjugate transpose, and ρ(M) denotes its spectral radius: ρ(M) = max{|λ| | λ is an eigenvalue of M}. The space of n n real-valued symmetric positive semidefinite (resp. positive definite) matrices is denoted Symn 0 (resp. Symn >0). The non-negative (resp. positive) orthant in Rn is Tu, Frostig, and Soltanolkotabi denoted as Rn 0 (resp. Rn >0), and Sn 1 denotes the unit sphere in Rn. Finally, the set of positive integers is denoted by N+. 3.1 Linear regression from sequences Regression model. A covariate sequence is an indexed set {xt}t 1 Rn. Any distribution Px over covariate sequences is assumed to have bounded second moments, i.e., that E[xtx T t ] exists and is finite for all t 1. Also for such a distribution Px, let Pξ[Px] be a distribution over observation noise sequences {ξt}t 1 Rp. Denoting by {Ft}t 0 the filtration with Ft = σ({xk}t+1 k=1, {ξk}t k=1), we assume that {ξt}t 1 is a σξ-sub-Gaussian martingale difference sequence (MDS), i.e., for t 1: E[ v, ξt | Ft 1] = 0, E[exp(λ v, ξt ) | Ft 1] exp(λ2 v 2 2σ2 ξ/2) a.s. λ R, v Rp. Given a ground truth model W Rp n, define the observations (a.k.a. responses or labels ): yt = W xt + ξt, t 1. (3.1) Denote by PW x,y[Px, Pξ] the joint distribution over covariates and observations {(xt, yt)}t 1. Regression task. Fix a ground truth model W Rp n, a covariate distribution Px, an observation noise model Pξ, a training horizon T, and a test horizon T . Draw m independent sequences {(x(i) t , y(i) t )}i [m],t 1 from PW x,y[Px, Pξ], and call their length-T prefixes {(x(i) t , y(i) t )}m,T i=1,t=1 the training examples. From these examples, the regression task is to find a hypothesis ˆfm,T : Rn Rp that matches ground truth predictions f W (x) := W x in expectation over unseen trajectories of length T . Specifically, the excess risk of a hypothesis ˆf is: L( ˆf; T , Px) := EPx t=1 ˆf(xt) f W (xt) 2 2 We say that the evaluation horizon T is strict if T T and extended if T > T. When the hypothesis class is linear, meaning the hypotheses ˆf are of the form ˆf(x) = ˆWx with ˆW Rp n, the risk expression (3.2) simplifies as follows. For a positive definite matrix Σ Rn n, define the weighted square norm M 2 Σ := tr(MΣMT) for M Rp n. Denoting, for t 1: Σt(Px) := EPx[xtx T t ], Γt(Px) := 1 k=1 Σk(Px), (3.3) we overload notation and write: L( ˆW; T , Px) = ˆW W 2 ΓT (Px). (3.4) The risk (3.2), being a notion of error averaged over time steps, relates to that of Ziemann et al. (2022) in the study of learning dynamics (the difference lies in whether the error norm is squared). Learning from many trajectories By allowing unequal training and test horizons T = T , we cover two related scenarios at once: system identification in linear dynamical systems (when T = 1) and predicting past the end of a sequence (when T > T). For the latter, the risk definition (3.2) is closely related to a commonly studied notion of final step generalization (see e.g. (Kuznetsov and Mohri, 2017, Eq. 5), (Mc Donald et al., 2017, Def. 10)) that measures the performance of a hypothesis at T T time steps beyond the training horizon: Lend( ˆf; T , Px) := EPx[ ˆf(x T ) f W (x T ) 2 2]. Linear hypotheses enjoy the identity Lend( ˆW; T , Px) = ˆW W 2 ΣT (Px). In turn: Lend( ˆW; T , Px) L( ˆW; T , Px) Lend( ˆW; T /2 , Px). In other words, provided the scale of the covariances Σt(Px) does not grow substantially over time t, our risk definition L is comparable to the final-step risk Lend. Minimax risk. To compare the hardness of learning across problem classes (i.e., families of covariate distributions Px), we measure the minimax rate of the risk L i.e., the behavior of the best estimator s worst-case risk over valid problem instances as a function of the amount of training data m, T and other problem parameters such as n, p, σξ, and T . Recall that PW x,y denotes the distribution over labeled trajectories {(xt, yt)}t 1. For a collection of covariate sequence distributions Px, the minimax risk over problem instances consistent with Px is: R(m, T, T ; Px) := inf Alg sup Px Px sup W ,Pξ E m i=1PW x,y [Px,Pξ] h L Alg({(x(i) t , y(i) t )}m,T i=1,t=1); T , Px i , where the infimum ranges over estimators Alg : (Rn Rp)m T (Rn Rp) that map training samples to hypotheses, the supremum over W is over all p n ground truth models, and the supremum over Pξ is over all σξ-sub-Gaussian MDS processes determining the observation noise. The ordinary least-squares estimator. Much like its classical role in iid learning, the ordinary least-squares (OLS) estimator will be key to bounding the minimax risk (3.5) from above. We define the OLS estimator to be the linear hypothesis ˆWm,T Rp n that satisfies: ˆWm,T argmin W Rp n t=1 Wx(i) t y(i) t 2 2. (3.6) For i = 1, . . . , m, let X(i) m,T RT n be the data matrix for the i-th trajectory (i.e., the t-th row of X(i) m,T is x(i) t for t = 1, . . . , T). Define Y (i) m,T RT p and Ξ(i) m,T RT p analogously. Put Xm,T Rm T n as the vertical concatenation of X(1) m,T , . . . , X(m) m,T , and similarly for Ym,T Rm T p and Ξm,T Rm T p. Whenever Xm,T has full column rank, then we can write ˆWm,T as: ˆWm,T = Y T m,T Xm,T (XT m,T Xm,T ) 1. (3.7) Tu, Frostig, and Soltanolkotabi x1 x2 x3 x T y1 y2 y3 y T (a) The Seq-LS problem (Problem 3.1): covariates {xt} are drawn from a sequence distribution, and noisy observations {yt} are drawn conditioned on these covariates. x1 x2 x3 x T y1 y2 y3 y T x 1 x 2 x 3 x T y 1 y 2 y 3 y T (b) The corresponding baseline Ind-Seq-LS problem (Problem 3.2): independent covariate-observation pairs {(xt, yt)} are drawn, each from the marginal distribution of the corresponding t th step in the Seq-LS problem. Figure 2: Formulations of regression from sequential data, illustrated as graphical models. Specifically, these graphs depict a simplified special case of our data model, in which the observations {yt} across time are independent conditioned on the covariates {xt}. In our general definitions (Problem 3.1 and Problem 3.2), the observations {yt} can be conditionally interdependent, via a martingale difference sequence on the observation noise (Section 3.1). OLS is not the only estimation method for least-squares problems. Often some regularization penalty such as that of the ridge or LASSO estimator is added to the least-squares loss (3.6), based on some structural understanding of the problem instance at hand. Studying the interplay between multiple trajectories and, say, norm-based risk bounds (Zhang, 2002) or sparse recovery (Cand es et al., 2006), is an exciting direction for future work. 3.2 Problem classes We formalize linear regression from sequential data generally as follows: Problem 3.1 (Seq-LS). Assume a covariate sequence distribution Px in the linear regression model (3.1). Fix an evaluation horizon T . On input m labeled trajectories of length T drawn from this model, in the form of examples {(x(i) t , y(i) t )}m,T i=1,t=1, output a hypothesis ˆfm,T that minimizes excess risk L( ˆfm,T ; T , Px). Our topmost goal is to study the effect of learning from sequentially dependent covariates in comparison with learning in the classical iid setup. Linear regression is well understood in the latter setting. Focusing on well-specified linear regression further simplifies our presentation, allowing us to isolate the effects of what interests us most dependent covariates. Generalizing the supervision aspect of Seq-LS (say, to unrealizable and nonparametric regression, or to classification) is left to future work. We return to discuss this in Section 9. Separately, our assumption that the learner accesses m trajectories each with common length T is also intended for simplicity. Generalizing our results to varying trajectory lengths T1, . . . , Tm, and even varying covariate sequence distributions P(1) x , . . . , P(m) x is conceptually straightforward, but notationally more burdensome. Learning from many trajectories To study how dependent data affects learning, we need to establish an independent data baseline. The natural comparison point for Seq-LS is to remove all correlations across time. Namely, instead of drawing covariates sequentially from the distribution Px, consider learning separately from the marginals of Px at each time step. The resulting decorrelated distribution generates independent examples, but typically not iid ones. We formalize linear regression from independent data generally as follows: Problem 3.2 (Ind-Seq-LS). Fix a sequence of distributions {Px,t}t 1. Consider their product over time t 1Px,t as the covariate sequence distribution in the linear regression model (3.1). Fix an evaluation horizon T . On input m labeled trajectories of length T drawn from this model, in the form of examples {(x(i) t , y(i) t )}m,T i=1,t=1, output a hypothesis ˆfm,T that minimizes L( ˆfm,T ; T , t 1Px,t). This Ind-Seq-LS problem generalizes the canonical iid learning setup slightly. Existing theory can still characterize its minimax risk, provided the covariances of the distributions {Px,t} are roughly equal in scale across time t. However, this equal-scale requirement rules out the marginals of interesting applications, such as dynamical systems that are not stable or ergodic. We therefore extend, in later sections, characterizations of the regression risk to handle covariances that can scale polynomially across time instead. 3.3 Problem separations To set up a baseline for a Seq-LS problem, we will specifically instantiate Ind-Seq-LS over its marginals. Namely, for a sequence distribution Px over {xt}t 1, let µt[Px] be the marginal distribution of xt at time t 1, and consider Ind-Seq-LS with covariates drawn from the sequence {µt[Px]}t 1. Figure 2 illustrates such a Seq-LS problem and the corresponding Ind-Seq-LS instance over its marginals. This decorrelated baseline is a hypothetical benchmark: in a practical context, collecting independent marginal data, when nature only supplies its dependent form, can be expensive or infeasible. However, we can expect that having such data on hand would make learning easier, with risk rates that resemble iid learning. In what follows, we outline scenarios where a sequential learning problem and its decorrelated baseline coincide in difficulty, and others in which they diverge. We then outline the possible assumptions that would allow us to always relate the two. The iid special case. When T = T = 1, the example trajectories {x(i) 1 }m i=1 are trivially a set of iid covariates. The problems Seq-LS and Ind-Seq-LS thus coincide, and reduce to the well-specified random design linear regression problem over m iid covariates. It is well-known that under iid data, and mild regularity conditions, the minimax risk scales as σ2 ξpn/m, and is achieved by the OLS estimator (Hsu et al., 2014; Mourtada, 2022; Wainwright, 2019). Extending the horizon. Considering nontrivial horizons T = T > 1, both Seq-LS and its corresponding Ind-Seq-LS baseline become more involved, but for different reasons. The Ind-Seq-LS problem, as we show in Section 6, is not generally learnable with polynomially many examples. Specifically, the minimax rate scales exponentially in the dimension n provided the trajectory count m is constant. To address this, we will require that the covariances of its constituent distributions {Px,t} grow at most polynomially with Tu, Frostig, and Soltanolkotabi time t. Under this constraint, the problem s minimax risk again scales as the iid-like rate σ2 ξpn/(m T) times, at most, a factor determined exponentially by the covariance growth. The Seq-LS problem inherits the same growth limitation. Even then, it is still not generally learnable without further assumptions on the dependence structure of covariates: the minimax risk is otherwise bounded away from zero as the horizon T tends to infinity, provided the trajectory count m is constant. To realize this, consider x1 N(0, In) and xt = xt 1 for t 2, a sequence of identical covariates whose marginals are all independent Gaussians. The resulting dataset presents an underdetermined regression problem if m < n. In essence, its covariates lack sufficient excitation across time. To rein Seq-LS back in to the realm of learnability, one must: (a) make further modeling assumptions about covariates, or (b) introduce excitation via independent resets. For (a), as detailed in Section 2, the most common modeling assumption considers sequences that mix rapidly to a stationary distribution. Another avenue recently active in the literature, and sometimes overlapping with the mixing approach considers sequences generated by linear dynamical systems. Among these two, mixing implies risk bounds that tend to zero with T, but only hold in the worst case after a burn-in time that scales proportionally to the mixing time (Bresler et al., 2020). This prevents a characterization of minimax risk uniformly across the full range of problem instances Px that mix, unless one caps the mixing time to a fixed constant. Narrowing instead to LDS models in the sequel, we manage to succinctly carve out a basic problem family, with unbounded mixing time, and to characterize its minimax risk uniformly. One still pays a price for sequential dependency, as this minimax risk turns out to be larger than its Ind-Seq-LS counterpart by a factor of the dimension n. Turning in addition to (b), by introducing (sufficiently many) resets, we can expand our data model substantially: we manage to lift most of our LDS assumptions and extend to other dynamical systems. Remarkably, we even show that for any controllable LDS including ones that are unstable and hence grow exponentially in time having sufficiently many resets guarantees that the risk exhibits, once again, the iid-like behavior of σ2 ξpn/(m T), up to mere constants. 3.4 Linear dynamical trajectories Fix a dynamics matrix A Rn n and a control matrix B Rn d. Consider the ndimensional trajectory {xt}t 1 defined by the linear dynamical system: xt = Axt 1 + Bwt, where wt N(0, Id), for t 1, (3.8) taking x0 = 0 by convention. We assume that the noise process {wt}t 1 is independent across time, i.e., that wt wt whenever t = t . Overloading notation, let the matrix Σt(A, B) := Pt 1 k=0 Ak BBT(Ak)T denote the covariance of xt, and let the matrix Γt(A, B) := 1 t Pt k=1 Σk(A, B) denote the average covariance. Denote by PA,B x the distribution over the trajectory {xt}t 1, and let {x(i) t }t 1 for i 1 denote independent draws from PA,B x . When B = In, we use the respective shorthand notation Σt(A), Γt(A), and PA x . Learning from many trajectories Modeling regression covariates as linear dynamical trajectories gives us the LDS-LS problem, a specialization of Seq-LS (Problem 3.1): Problem 3.3 (LDS-LS). Assume a dynamics matrix A Rn n, a control matrix B Rn d, and a corresponding linear dynamical covariate distribution PA,B x in the linear regression model (3.1). Fix an evaluation horizon T . On input m labeled trajectories of length T, drawn from this model, in the form of examples {(x(i) t , y(i) t )}m,T i=1,t=1, output a hypothesis ˆfm,T that minimizes L( ˆfm,T ; T , PA,B x ). Let PA,B x,t be the marginal distribution of xt under PA,B x at each t 1. The natural decorrelated baseline for LDS-LS is a corresponding specialization of Ind-Seq-LS (Problem 3.2) to LDS trajectories: Problem 3.4 (Ind-LDS-LS). Assume a dynamics matrix A Rn n, a control matrix B Rn d, and a corresponding trajectory distribution PA,B x . Consider covariates drawn independently from its marginals, i.e., assume the linear regression model (3.1) under the covariate sequence distribution t 1PA,B x,t . Fix an evaluation horizon T . On input m labeled trajectories of length T, drawn from this model, in the form of examples {(x(i) t , y(i) t )}m,T i=1,t=1, output a hypothesis ˆfm,T that minimizes L( ˆfm,T ; T , t 1PA,B x,t ). Learning dynamical systems. LDS-LS generalizes linear system identification, the problem of recovering the dynamics A from data. The reduction follows by setting W = A and ξ(i) t = Bw(i) t+1, so that y(i) t = x(i) t+1. Note that when B has full row rank, the squared parameter error in the weighted BBT norm BBT is simply the risk L( ˆA; T , PA,B x ) when T = 1. Recent related work typically assumes that B indeed has full row rank, but in later sections we touch on the more general case where this is not required, so long as the pair (A, B) is controllable. Bounds in operator norm are also easily obtainable from our proof techniques. However, our lower bounds will not inform the system identification problem specifically; our hardness results rely on decoupling W from A and ξ(i) t from w(i) t+1, whereas this reduction naturally ties them. 4. Trajectory small-ball definition and examples We establish risk upper bounds by studying the behavior of the ordinary least-squares estimator. The key technical definition that drives the analysis is a small-ball condition on covariate sequences: Definition 4.1 (Trajectory small-ball (Traj SB)). Fix a trajectory length T N+, a parameter k {1, . . . , T}, positive definite matrices {Ψj} T/k j=1 Symn >0, and constants csb 1, α (0, 1]. The distribution Px satisfies the (T, k, {Ψj} T/k j=1 , csb, α)-trajectory-small-ball (Traj SB) condition if: (a) 1 T/k P T/k j=1 Ψj ΓT (Px), (b) {xt}t 1 is adapted to a filtration {Ft}t 1, and Tu, Frostig, and Soltanolkotabi (c) for all v Rn \ {0}, j {1, . . . , T/k } and ε > 0: t=(j 1)k+1 v, xt 2 ε v TΨjv (csbε)α a.s. (4.1) Above, F0 is understood to be the minimal σ-algebra. Additionally, the distribution Px satisfies the (T, k, Ψ, csb, α)-Traj SB condition if it satisfies (T, k, {Ψj} T/k j=1 , csb, α)-Traj SB with Ψj = Ψ. Finally, we call the parameter k the excitation window. We will soon show in Lemma 5.1 how the various quantities parameterizing the trajectory small-ball condition closely govern the final risk bound of the ordinary least-squares estimator. For the remainder of this section, however, we focus on developing intuition for Definition 4.1, by working through diverse examples. In the definition, we typically consider the matrices Ψj to be the sharpest almost-sure lower bound that we can specify (in the Loewner order) on the quantity E[ 1 k Pjk t=(j 1)k+1 xtx T t | F(j 1)k]. Section 4.1 lists examples of covariate sequence distributions Px that satisfy the Traj SB condition. Definition 4.1 draws inspiration from the block martingale small-ball condition from Simchowitz et al. (2018, Definition 2.1). There are, however, two main differences: (a) we consider the small-ball probability of the entire block 1 k Pjk t=(j 1)k+1 v, xt 2 at once, instead of the average of small-ball probabilities: t=(j 1)k+1 P n v, xt 2 ε v TΨjv | F(j 1)k o , (4.2) and (b) equation (4.1) is required to hold at all scales ε > 0, instead of at a single resolution. We need the first modification (a) to prove optimal rates under many trajectories without assuming stability or ergodicity; we expand on this point in Section 4.1. Furthermore, condition (4.1) is implied by a bound on the average of small-ball probabilities (4.2) (cf. Proposition 4.2), rendering it a more general condition. We need the second modification (b) in order to bound the expected value of the OLS risk. The need to modify (b) in order to bound the expected OLS risk is present even in the iid setting, as discussed in Mourtada (2022, Remark 4). It is this modification that prevents Definition 4.1 from fully subsuming Simchowitz et al. (2018, Definition 2.1). The following remark discusses a small change to Definition 4.1 that addresses this issue, by exchanging expected OLS risk bounds to high probability bounds: Remark 4.1. In Appendix B.7, we consider the following modification to Definition 4.1, where we instead suppose that (4.1) holds for some fixed ε (such that the inequality s right-hand side is strictly less than one), rather than for all ε; we refer to this modification as weak trajectory small-ball (Definition B.1). As described above, a consequence of the weak trajectory small-ball condition is that the main OLS risk bounds now hold with high probability (i.e., polylogarithmic in 1/δ) rather than in expectation (Lemma B.23).2 A key upshot (Proposition B.24), however, is that this change allows for an ergodic covariate 2. Such a high probability bound does not, in turn, imply a bound on the expected risk via integration over the tail. The reason is that the high probability bound (Lemma B.23) requires that the number of data Learning from many trajectories sequence with φ-mixing time bounded by τmix to be considered (weak) trajectory smallball (with excitation window k τmix), provided the stationary distribution µ satisfies a standard (weak) small-ball condition (Mendelson, 2015; Koltchinskii and Mendelson, 2015; Oliveira, 2016): sup v Sn 1 Pµ{ v, x 2 ε Eµ[ v, x 2]} < 1 for some ε > 0. This in turn yields upper bounds for Seq-LS in the few trajectories (m n) regime of the following form: if m T Ω(τmixn), then L( ˆWm,T ; T, Px) O σ2 ξ pn m T with high probability. This statement generalizes the risk bound for a single ergodic trajectory from Bresler et al. (2020) to the ordinary least-squares estimator (3.6). We can interpret the condition on m T as a burn-in time requirement. Meanwhile, at least in the single-trajectory (m = 1) setting, Bresler et al. (2020, Theorem 1) tells us that that such a burn-in assumption (T τmixn) is necessary for a non-trivial risk guarantee. 4.1 Examples of trajectory small-ball distributions We now turn to specific examples of distributions Px which satisfy the trajectory smallball condition. First is the example introduced in Section 3.3, where x1 is drawn from a multivariate Gaussian and subsequently copied as xt = xt 1 for all t 2: Example 4.1 (Copies of a Gaussian draw). Let Σ Symn >0, and let Px denote the process x1 N(0, Σ) and xt = xt 1 for t 2. Fix any T N+. Then Px satisfies the (T, T, Σ, e, 1 2)- Traj SB condition. Note that this process only satisfies the trajectory small-ball condition with excitation window k = T. In other words, the conditional distribution xt+k | xt for k 1 (a Dirac distribution on xt) contains no excitation as needed for learning. This example can actually be generalized to arbitrary Gaussian processes indexed by time: Example 4.2 (Gaussian processes). Let Px be a Gaussian process indexed by time, i.e., for every finite index set I N+, the collection of random variables (xt)t I is jointly Gaussian. Let Tnd := inf{t N+ | det(E[xtx T t ]) = 0}, and suppose Tnd is finite. Fix a T N+ satisfying T Tnd. Then Px satisfies the (T, T, ΓT (Px), 2e, 1 2)-Traj SB condition. This example illustrates the insufficiency of the average small-ball probabilities condition (4.2). Suppose that x1, . . . , x T is a jointly Gaussian process. Then Example 4.2 states (cf. Equation (4.1)) that for all v = 0, ε > 0: t=1 v, xt 2 ε v TΓT (Px)v points m T grows proportional to log(1/δ), where δ is the failure probability. If one attempts to bound E[ ˆ Wm,T W 2 Γ ] = R 0 P( ˆ Wm,T W 2 Γ t) dt using Lemma B.23 to control the tail probability on the RHS, then for any fixed m, T, there exists a scale t0 (corresponding to setting δ sufficiently small) such that, for any t t0, the bound on P( ˆ Wm,T W 2 Γ t) from Lemma B.23 is no longer valid. The same issue occurs in, e.g., Simchowitz et al. (2018, Theorem 2.4), and for the same reason: the need for a small-ball assumption that holds at all resolutions, as underscored in Mourtada (2022, Remark 4). Tu, Frostig, and Soltanolkotabi If instead we use the average small-ball condition (4.2), the corresponding bound would be, for all v = 0, ε > 0: t=1 P n v, xt 2 ε v TΓT (Px)v o 2e mint {1,...,T} λmin(Γ 1 T (Px)Σt) ε The extra 1/ mint {1,...,T} λmin(Γ 1 T (Px)Σt) factor would then enter the resulting rates for least-squares regression, in turn requiring assumptions that limit the growth of ΓT (Px) relative to Σt. Fortunately, the trajectory small-ball condition (4.1) avoids these issues. Our next example involves independent, but not identically distributed, covariates: Example 4.3 (Independent Gaussians). Let {Σt}t 1 Symn >0, and let Px = t 1N(0, Σt). Fix a T N+. Then Px satisfies the (T, 1, {Σt}T t=1, e, 1 2)-Traj SB condition. Example 4.3 allows us to select k = 1, reflecting the independence of the covariates across time. We can also craft an example around a process that does not mix, but that still exhibits an excitation window of k = 2: Example 4.4 (Alternating halfspaces). Suppose that n 4 is even, and let u1, . . . , un be a fixed orthonormal basis of Rn. Put U0 = span(u1, . . . , un/2) and U1 = span(un/2+1, . . . , un). Let i1 Bern(1 2), it+1 = (it + 1) mod 2 for t N+, and let Px denote the process with conditional distribution xt | it uniform over the spherical measure on Uit Sn 1. For any T 2, the process Px satisfies the (T, 2, In/(2n), e, 1 2)-Traj SB condition. To see that the covariate distribution {xt} does not mix, observe that the marginal distribution for all t is uniform on Sn 1, whereas the conditional distribution xt+k | xt for any k N+ is either uniform on U0 Sn 1 or uniform on U1 Sn 1. Although it does not mix at all, the trajectory supplies ample excitation for learning in any mere two steps. Even for a process that does mix, it may exhibit an excitation window far smaller than its mixing time. The following sets up such an example, where again sufficient excitation is provided with k = 2 steps: Example 4.5 (Normal subspaces). Suppose that n 3. Let u1, . . . , un be a fixed orthonormal basis in Rn, and let U i := span({uj}j =i) for i {1, . . . , n}. Consider the Markov chain {it}t 1 defined by i1 Unif({1, . . . , n}), and it+1 | it Unif({1, . . . , n} \ {it}). Let Px denote the process with conditional distribution xt | it uniform over the spherical measure on U it Sn 1. For any T 2, the process Px satisfies the (T, 2, In/(4n 4), e, 1 2)-Traj SB condition. In this example, a straightforward computation (detailed in Proposition B.11) shows that the mixing time τmix(ε) of the Markov chain {it}t 1 scales as logn(1/ε).3 In most analyses which rely on mixing time arguments, one requires that the mixing time resolution ε tends 3. For concreteness, given a discrete-time Markov chain over a finite state-space S with transition matrix P and stationary distribution π, we define the mixing time as: τmix(ε) := inf{k N | supµ P(S) µP k π tv ε}. Here, P(S) denotes the set of all probability distributions over S, and tv denotes the total variation norm over distributions. Learning from many trajectories to zero as either the amount of data and/or probability of success increases; as a concrete example, Duchi et al. (2012, Eq. 3.2) suggests to set ε = 1/ T, where T is the number of samples drawn from the underlying distribution. On the other hand, the trajectory small-ball condition in Example 4.5 holds with a short excitation window of length k = 2, independently of T. Next we consider linear dynamical systems. As setup, we first define the notion of controllability for a pair of dynamics matrices (A, B): Definition 4.2 (Controllability). Let (A, B) be a pair of matrices with A Rn n and B Rn d. For k {1, . . . , n}, we say that (A, B) is k-step controllable if the matrix: B AB A2B Ak 1B Rn kd has full row rank. The classical definition of controllability in linear systems (cf. Rugh, 1996, Chapter 25) is equivalent to n-step controllability. Definition 4.2 allows the system to be controllable in fewer than n steps. Also note that k is restricted to {1, . . . , n}, since if a system is not n-step controllable, it will not be n -step controllable for any n > n (by the Cayley Hamilton theorem). A few special cases of interest to note are as follows. If B has rank n, then (A, B) is trivially one-step controllable for any A. On the other hand, if (A, B) are in canonical controllable form (i.e., A is the companion matrix associated with the polynomial p(z) = a0 + a1z + + an 1zn 1 + zn and B is the n-th standard basis vector), then (A, B) is n-step controllable. The latter corresponds directly to the state-space representation of autoregressive processes of order n, e.g. AR(n). Example 4.6 (Linear dynamical systems). Let (A, B) with A Rn n and B Rn d be kcstep-controllable (Definition 4.2). Let PA,B x be the linear dynamical system defined in (3.8). Fix any T, k N+ satisfying T k kc. Then, PA,B x satisfies the (T, k, Γk(A, B), e, 1 2)- Traj SB condition. We next consider LDS controlled by linear feedback policies. Let K Rd n parameterize a linear feedback policy ut = Kxt, and consider the closed-loop linear dynamical system described by the recurrence: xt = Acxt 1, Ac := A + BK. (4.3) When x1 N(0, Σ), we have the following small-ball condition with excitation window k = T. Example 4.7 (Closed-loop linear dynamical systems). Let PAc x denote the closed-loop linear dynamical system defined in (4.3). Suppose that the initial condition x1 N(0, Σ) where Σ is positive definite. For any T N+, PAc x satisfies the (T, T, ΣT (Ac, Σ1/2)/T, e, 1 2)-Traj SB condition. Example 4.7 follows directly from the Gaussian process example (Example 4.2). Proving small-ball conditions under an excitation window k < T for close-loop linear dynamical systems is not possible under our current framework, because there is only one source of Tu, Frostig, and Soltanolkotabi randomness within a trajectory: at its beginning. Hence the conditional distributions in (4.1) become Dirac measures. This is to be expected, as few-trajectory learning in (4.3) can be impossible without noise injection. For example, if A is rank-deficient, then the iterates x2, . . . , x T all lie in the same lower-dimensional subspace of Rn, precluding learning. In all of the examples so far, the time-t marginal distribution of covariates xt has either been a multivariate Gaussian or a spherical measure. To underscore the generality of the small-ball method, we can create additional examples where this is not the case. In what follows, we consider Volterra series (Mathews and Sicuranza, 2000), which generalize the classical Taylor series to causal sequences. Analogous to how polynomials can approximate continuous functions arbitrarily well on a compact set, Volterra series can approximate signals that depend continuously (and solely) on their history over a bounded set of inputs (cf. Rugh, 1981, Section 1.5). Example 4.8 (Degree-D Volterra series). Fix a D N+. Let {c(d,ℓ) i1,...,id}i1,...,id N for d {1, . . . , D} and ℓ {1, . . . , n} denote arbitrary rank-d arrays. Let {w(ℓ) t }t 0 be iid N(0, 1) random variables for ℓ {1, . . . , n}. Consider the process Px where for t 1, the ℓ-th coordinate of xt, denoted (xt)ℓ, is: i1,...,id=0 c(d,ℓ) i1,...,id d =1 w(ℓ) t id 1. (4.4) Let Tnd := inf{t N+ | det(Γt(Px)) = 0}, and suppose Tnd is finite. There is a constant c D > 0, depending only on D, such that for any T Tnd, the process Px satisfies the (T, T, ΓT (Px), c D, 1/(2D))-Traj SB condition. The main idea behind Example 4.8 is that, while xt is certainly not Gaussian, the quadratic form PT t=1 v, xt 2 is a degree at most 2D polynomial in {w(ℓ) t }T 1 t=0 . It will hence exhibit anti-concentration, according to a landmark result from Carbery and Wright (2001). The same result actually provides an immediate extension of this example as well as the previous examples to noise distributions with log-concave densities, such as Laplace or uniform noise. We next present a special case of the Volterra series, where we can choose the excitation window k in the small-ball definition strictly between the endpoints 1 and T. To set up, a few more definitions are needed: Definition 4.3. Fix an integer d N+. A rank-d array of coefficients {ci1,...,id}i1,...,id N is called: (a) symmetric if ci1,...,id = cπ(i1,...,id) for any permutation π of indices i1, . . . , id N, (b) traceless if ci,...,i = 0 for all i N, and (c) non-degenerate if there exists an knd N+ such that the following set is non-empty: {(i1, . . . , id) | ci1,...,id = 0, i1, . . . , id {0, . . . , knd 1}}. The smallest knd such that {ci1,...,id} is the non-degeneracy index. Learning from many trajectories Example 4.9 (Degree-2 Volterra series). Consider the following process Px. Let {c(ℓ) i,j }i,j 0 for ℓ {1, . . . , n} be symmetric, traceless, non-degenerate arrays (Definition 4.3). Let {w(ℓ) t }t 0 be iid N(0, 1) random variables for ℓ {1, . . . , n}. For t 1, the ℓ-th coordinate of xt, denoted (xt)ℓ, is: j=i c(ℓ) i,j w(ℓ) t i 1w(ℓ) t j 1. (4.5) Let knd N+ denote the smallest non-degeneracy index for all n arrays. There is a universal positive constant c such that for any T and k satisfying T k knd, Px satisfies the (T, k, Γk(Px), c, 1 4)-Traj SB condition. The assumptions pulled in from Definition 4.3 help simplify the construction of an almost sure lower bound for conditional covariances E[ 1 k Pjk t=(j 1)k+1 xtx T t | F(j 1)k], to establish that Example 4.9 satisfies the trajectory small-ball condition. We believe that generalizations to higher degree Volterra series with k strictly between 1 and T are possible by more involved calculations. Of course, many other examples are possible. To help in recognizing them, the following statement shows that condition (4.1) in the trajectory small-ball definition can be verified by separately establishing small-ball probabilities for the conditional distributions: Proposition 4.2 (Average small-ball implies trajectory small-ball). Fix T N+, k {1, . . . , T}, {Ψj} T/k j=1 Symn >0, and α, β (0, 1). Let Px be a covariate distribution, with {xt}t 1 adapted to a filtration {Ft}t 1. Suppose for all v Rn \{0} and j {1, . . . , T/k }: t=(j 1)k+1 Pxt Px n v, xt 2 α v TΨjv F(j 1)k o β a.s., (4.6) where F0 is the minimal σ-algebra. Then, for all v Rn \ {0}, j {1, . . . , T/k }, and ε (0, α) t=(j 1)k+1 v, xt 2 ε v TΨjv β 1 ε/α a.s. (4.7) An immediate corollary of Proposition 4.2 is the following: suppose that for all v Rn\{0}, j {1, . . . , T/k }, and ε > 0, t=(j 1)k+1 Pxt Px n v, xt 2 ε v TΨjv F(j 1)k o (csbε)α a.s. (4.8) Then, the (T, k, {Ψj} T/k j=1 , 21+1/αcsb, α)-Traj SB condition holds. Equation (4.8) can be easier to verify than (4.1), since the former allows one to reason about each conditional distribution individually, whereas the latter requires reasoning about the entire excitation window altogether. The following two sections present upper and lower bounds for learning from trajectories, involving various instances of the trajectory small-ball assumption where applicable. All main results are summarized in Table 1. Tu, Frostig, and Soltanolkotabi 5. Risk upper bounds The trajectory small-ball definition allows us to carve out conditions for learnability. A key quantity for what follows is the minimum eigenvalue of the ratio of two positive definite matrices: λ(A, B) := λmin(B 1/2AB 1/2), A, B Symn >0. (5.1) Our various upper bounds statements build on the following general lemma: Lemma 5.1 (General OLS upper bound). There are universal positive constants c0 and c1 such that the following holds. Suppose that Px satisfies the (T, k, {Ψj} T/k j=1 , csb, α)-Traj SB condition (Definition 4.1). Put S := T/k and ΓT := ΓT (Px). Fix any Γ Symn >0 satisfying 1 S PS j=1 Ψj Γ ΓT , and let µ({Ψj}S j=1, Γ) denote the geometric mean of the minimum eigenvalues {λ(Ψj, Γ)}S j=1, i.e., µ({Ψj}S j=1, Γ) := j=1 λ(Ψj, Γ) Suppose that: max{e, csb} αλ(Γ, ΓT )µ({Ψj}S j=1, Γ) Then, for any Γ Symn >0: E[ ˆWm,T W 2 Γ ] c1csbσ2 ξ pn m Tαλ(Γ, Γ )µ({Ψj}S j=1, Γ) log max{e, csb} αλ(Γ, ΓT )µ({Ψj}S j=1, Γ) Lemma 5.1 is a general statement that highlights the interplay between the various trajectory small-ball parameters, and how they directly influence the resulting OLS risk. To develop some intuition for the lemma, consider Γ = ΓT . We see that the closer the quantities λ(Γ, ΓT ) (0, 1] and µ({Ψj}S j=1, Γ) (0, 1] are to one, the sharper the OLS risk (5.4). To satisfy this straightforwardly, we can set Ψj = Γ = ΓT , in which case λ(Γ, ΓT ) = µ({Ψj}S j=1, Γ) = 1, and the resulting OLS risk (5.4) simplifies (up to constant factors) to the iid rate σ2 ξ pn/(m T), treating α, csb as constants. However, in light of the trajectory small-ball condition (4.1), this is only generally possible at either end of the following spectrum: when Px encodes iid covariates (cf. Example 4.3), in which case we can take k = 1, and our data requirement (5.3) becomes m T n; or in the many trajectories setting (cf. Example 4.2) by setting k = T, in which our data requirement becomes m n. Learning from many trajectories When Px falls within these two ends, the excitation window k needs to be selected carefully to balance the data requirement (5.3) with the OLS risk (5.4). The optimal selection of k is heavily influenced by the growth rate of the covariance proxies {Ψj}. As a general rule, as k T, the covariance proxies {Ψj} tend to ΓT , but the rate at which this occurs is heavily dependent on Px. Furthermore, when the covariance proxies {Ψj} and the bound Γ are not equal to ΓT , the quantities λ(Γ, ΓT ), µ({Ψj}S j=1, Γ) are driven away from one, which adds extra factors in the OLS risk (5.4) over the iid rate. All the upper bounds we prove in this section are derived by carefully selecting the excitation window k based on the process Px, and quantifying the resulting overhead. The proof of Lemma 5.1 blends ideas from the analysis of random design linear regression (Hsu et al., 2014; Oliveira, 2016; Mourtada, 2022) with techniques from linear system identification with full state observation (Simchowitz et al., 2018; Sarkar and Rakhlin, 2019; Faradonbeh et al., 2018; Dean et al., 2020). Note that Lemma 5.1 makes no explicit assumptions on the ergodicity of the process Px. The role of Px is instead succinctly captured by the trajectory small-ball condition, together with the minimum eigenvalue quantities that appear in the bound. The proof of Lemma 5.1 also yields, with some straightforward modifications, bounds on the risk that hold with high probability; we only present bounds in expectation for simplicity. Finally, if the square norm X 2 M is defined to be λmax(XMXT) instead of tr(XMXT), then (5.4) holds with the expression p + n replacing pn in the numerator. As long as the process Px satisfies the trajectory small-ball condition with excitation window k = T, Lemma 5.1 (with Ψ1 = Γ = ΓT (Px)) immediately yields the following result for learning from many trajectories in the Seq-LS problem: Theorem 5.2 (Upper bound for Seq-LS, many trajectories). There are univeral positive constants c0 and c1 such that the following holds. Suppose that Px satisfies the trajectory small-ball condition (Definition 4.1) with parameters (T, T, ΓT (Px), csb, α). If: α log max{e, csb} then, for any Γ Symn >0: E[ ˆWm,T W 2 Γ ] c1csbσ2 ξ pn m Tαλ(ΓT (Px), Γ ) log max{e, csb} This result provides the upper bound for the summary statement Theorem 1.1. To interpret the bound (5.5), suppose that csb and α are universal constants. Then, the requirement on m simplifies to m n. Under any strict evaluation horizon T T, taking Γ = ΓT (Px), the risk E[L( ˆWm,T ; T , Px)] scales as σ2 ξpn/(m T). The lower bound for Theorem 1.1 follows from the fact that iid linear regression is a special case of Seq-LS. Meanwhile, to obtain guarantees for parameter recovery, consider taking Γ = In. Then Theorem 5.2 implies that the parameter error E[ ˆWm,T W 2 F ] scales as σ2 ξpn/[m T λmin(ΓT (Px))]. Note that operator norm bounds on parameters also hold, with the expression p + n replacing pn in the bound. Tu, Frostig, and Soltanolkotabi Problem Many? Upper Lower Assumptions Seq-LS Y pn m T Traj SB(a) Seq-LS N pn m T e T/(τmixn) + pn T Ergodicity of covariates(b) LDS-LS Y pn m T kc-step controllability(c) LDS-LS N γ pn2 m2T Marginal stability, etc.(d) Ind-Seq-LS N pn m T Small-ball, poly variance growth(e) Ind-LDS-LS N γ pn m T pn m T Diagonalizable(f) LDS-Sys ID Y n2 m T λmin(ΓT ) - Same as LDS-LS (Y)(g) LDS-Sys ID N γ n2 m T λmin(Γm T/n) n2 T 2 Same as LDS-LS (N)(h) Table 1: Summary of main results presented in Section 5 and Section 6. All upper/lower bounds shown suppress constant and polylogarithmic factors. The Many? column indicates whether the bounds apply in the many trajectories regime (Y) where m n, or the few trajectories regime (N) where m n. A checkmark ( ) in the Lower column indicates that the lower bound matches the upper bound, up to polylogarithmic factors. The LDS-Sys ID problem is classic linear system identification: recover the unknown dynamics matrix A from linear dynamical trajectories, with error measured in squared Frobenius norm (see Equation (3.8) and the discussion at the end of Section 3.4). Elaborations on assumptions: (a) Upper bound follows from Theorem 5.2, treating as O(1) all constants related to trajectory small-ball (Definition 4.1). Lower bound follows from iid linear regression being a special case. (b) Upper bound follows from combining (i) Lemma B.23, a general OLS high probability upper bound that utilizes a simple modification (Definition B.1) to our trajectory small-ball definition, with (ii) Proposition B.24, which shows that a φ-mixing (Definition B.2) covariate process (where the marginal distributions also fulfill a weak small-ball condition (B.29)) satisfies our modified trajectory smallball condition. The upper bound holds in high probability instead of in expectation, and requires a burn-in time that satisfies T τmixn log(1/δ), where δ denotes the failure probability. The lower bound is from Bresler et al. (2020, Theorems 1 and 3), and holds for the single trajectory (m = 1). (c) Upper bound follows from Theorem 5.5; see Definition 4.2 for definition of kc-step controllability. Lower bound follows again from iid linear regression being a special case. (d) Upper bound follows from Theorem 5.6, under Assumption 5.1 (marginal stability), Assumption 5.2 (diagonalizability), and Assumption 5.3 (one-step controllability). The condition number of the diagonalizing factor is denoted by γ (Definition 5.1). Lower bound follows from Theorem 6.3, and is realized by a decoupled noise sequence (Definition 7.1). (e) Upper bound follows from Theorem 5.3, treating as O(1) constants relating to small-ball (Equation (5.7)) and variance growth (Equation (5.8)). The necessity of the variance growth condition is shown in Theorem 6.2. Note that in the many trajectories regime, the Seq-LS upper bound applies. Lower bound again follows from iid linear regression. (f) Upper bound follows from Theorem 5.7; γ is the condition number of the diagonalizing factor (Definition 5.1). Lower bound again follows from iid linear regression. (g) Upper bound follows from Theorem 5.4; ΓT is the T-step average covariance matrix (Equation (3.3)). Lower bound is marked with a dash indicating that Theorem 5.6 does not directly apply to LDS-Sys ID (cf. the discussion following Definition 7.1) (h) Upper bound follows from Theorem 5.8; γ is the condition number of the diagonalizing factor (Definition 5.1), and Γm T/n is the m T/n-step average covariance matrix (Equation (3.3)). Lower bound applies to the single trajectory (m = 1) setting and follows from Simchowitz et al. (2018, Theorem 2.3). (Technically, their bound applies to the operator, instead of Frobenius norm, but the proof can be adjusted to apply.) Specializing the upper bound to one trajectory (m = 1), drawn from the lower bound s hard instance, implies a gap of n3/T 2 (upper) versus n2/T 2 (lower) as noted in Simchowitz et al. (2018, Section 2.2). Learning from many trajectories Lemma 5.1 also yields a bound for Ind-Seq-LS, assuming polynomial growth of the time-t covariances Σt (3.3). To state the result, let φ : [1, ) [0, ) [1, ) be defined as: ( 1 if x 1, ax otherwise. (5.6) Note that 1 φ(a, x) max{ax, 1}. Theorem 5.3 (Upper bound for Ind-Seq-LS). There are universal positive constants c0 and c1 such that the following holds. Fix any sequence of distributions {Px,t}t 1, and let Σt := Ext Px,t[xtx T t ] for t N+. Suppose there exists csb > 0 and α (0, 1] such that for all v Rn \ {0}, ε > 0 and t N+: Pxt Px,t n v, xt 2 ε v TΣtv o (csbε)α. (5.7) Furthermore, suppose there exists a cβ 1 and β 0 such that for all s, t N+ satisfying s t: 1 λ(Σs, Σt) cβ(t/s)β. (5.8) n 2, m T c0n β + log max{e, csb}cβ then, for Px = t 1Px,t: E[L( ˆWm,T ; T , Px)] c1csbσ2 ξcβeβ pn m Tα φ cβ(β + 1), (T /T)β β + log max{e, csb}cβ Consider specializing Theorem 5.3 to the case when Σt = Σ for all t N+. Doing so yields random design linear regression from m T covariates drawn iid from Px,1. The growth condition (5.8) is trivially satisfied with cβ = 1 and β = 0. The small-ball assumption (5.7) simplifies to Px1 Px,1{| v, x1 | ε v Σ} ( csbε)2α for all v = 0 and ε > 0, which matches Mourtada (2022, Assumption 1) up to a minor redefinition of the constants csb, α. Treating csb and α as constants, the conclusion of Theorem 5.3 in this setting is that E[ ˆWm,T W 2 Σ] σ2 ξpn/(m T) as long as n 2 and m T n, which recovers Mourtada (2022, Proposition 2). On the other hand, Theorem 5.3 does not require that the covariates are drawn iid from the same distribution, allowing the time-t covariances Σt to grow polynomially. As an example, suppose that Σt = tβ In for some β > 0. In this case, 1/λ(Σs, Σt) = (t/s)β, so we can take cβ = 1 in (5.8). Again treating csb and α as constants and taking T T, we have E[L( ˆWm,T ; T , Px)] σ2 ξβeβ pn/(m T) as long as m T βn. If β is also considered a constant, then we further have the risk bound E[L( ˆWm,T ; T , Px)] σ2 ξpn/(m T). This matches the minimax rate for iid linear regression. Tu, Frostig, and Soltanolkotabi It is natural to ask if the covariance growth condition (5.8) is needed under strict evaluation horizons T T.4 In Section 6, we show that if the covariances are set to Σt = 2t In and Px,t = N(0, Σt) (satisfying (5.7)), then the minimax risk R(m, T, T; { t 1Px,t}) must scale at least 2cn/m/T whenever m n, for some positive constant c. Sub-exponential growth rates are therefore necessary for polynomial sample complexity. Determining the optimal dependence of β in (5.9) is left to future work. Note that Theorem 5.3 is most interesting either when trajectories are few (m n) or evaluations are extended (T > T). When m n and T T, one can usually apply Theorem 5.2 with Γ = ΓT (Px) instead, and avoid placing any requirements on the growth of covariances. Considering any of the small-ball examples in Section 4.1, recall that when the excitation window k and the horizon T are equal, Theorem 5.2 provides an upper bound on the risk of OLS estimation for the corresponding Seq-LS problem. Specifically, for Example 4.1 and Example 4.6 with k = T, if T = T and trajectories are abundant (m n), then the OLS estimator s rate σ2 ξpn/(m T) matches its behavior in iid linear regression. Meanwhile, for the degree-D Volterra series (Example 4.8), we require that m c D n, and the OLS risk bound scales as σ2 ξc D pn/(m T), for constants c D and c D that only depend on D. In order to cover scenarios in which trajectories may be relatively scarce, namely m n, we need additional structure. More technically, when the small-ball condition is satisfied with k < T, one needs to further control the various eigenvalues that appear in Lemma 5.1 in order to bound the risk of OLS. Specifically for Ind-Seq-LS, a covariate growth assumption suffices: Example 4.3 combined with Theorem 5.3 yields an OLS risk bound. Furthermore, both Example 4.4 and Example 4.5 can be immediately combined with Lemma 5.1, since the matrices Ψj in these examples are bounded above and below by ΓT (Px) up to universal constant factors. But arbitrarily large risk can still be realized in the general Seq-LS problem, even when the trajectory small-ball condition is satisfied. To study the behavior of OLS across all regimes of trajectory count m, example dimensions p and n, and trajectory lengths T and T , we focus specifically on linear dynamical systems and the LDS-LS problem for our remaining upper bounds. 5.1 Upper bounds for linear dynamical system In this section, we focus exclusively on dynamics PA,B x described by a linear dynamical system (3.8). As discussed previously, in order to apply Lemma 5.1 in the few trajectories regime when m n (or when m n and T > T), we must (a) show that the process PA,B x satisfies the trajectory small-ball condition, and (b) bound the various eigenvalues which appear in Lemma 5.1. Example 4.6 establishes that PA,B x satisfies the (T, k, Γk(A, B), e, 1 2)- Traj SB condition, as long as (A, B) is kc-step controllable and k kc, thus taking care of (a). To handle (b), we introduce additional assumptions on the dynamics matrices (A, B): Assumption 5.1 (Marginal instability). The dynamics matrix A in LDS-LS is marginally unstable. That is, ρ(A) 1, where ρ(A) denotes the spectral radius of A. 4. Some regularity is needed when under extended evaluations T > T, otherwise the risk could be arbitrarily large. Learning from many trajectories Assumption 5.2 (Diagonalizability). The dynamics matrix A in LDS-LS is complex diagonalizable as A = SDS 1, where S Cn n is invertible and D Cn n is a diagonal matrix comprising the eigenvalues of A. Assumption 5.3 (One-step controllability). The control matrix B in LDS-LS has full row rank, i.e., rank(B) = n. Equivalently, the pair (A, B) is one-step controllable (Definition 4.2). Assumption 5.1 is fairly standard in the literature. Going beyond the regime ρ(A) = 1 + ε, where ε 1/T, requires additional technical assumptions on the dynamics matrix A that we choose to avoid in the interest of simplicity; the OLS estimator is in general not a consistent estimator when ρ(A) > 1 and m = 1 (cf. Phillips and Magdalinos (2013); Sarkar and Rakhlin (2019)). The condition ρ(A) 1 is often referred to as marginal stability in other work. We choose to call it marginally unstable instead, to emphasize the fact that such systems, namely at ρ(A) = 1, may not be ergodic and that the state can grow unbounded (e.g. have magnitude roughly tn at time t). Diagonalizability (Assumption 5.2) is less standard in the literature. We use it together with Assumption 5.1 and Assumption 5.3 to establish that λ(k, t; A, B) := λ(Γk(A, B), Γt(A, B)) c k/t, whenever k t, where c is a constant that depends only on A and B (and not k and t). In previous work on linear system identification, the term λ(k, t; A, B) only appears under a logarithm, and so coarser analyses in the general case can still establish polynomial rates (cf. Simchowitz et al. (2018, Proposition A.1) and Sarkar and Rakhlin (2019, Proposition 7.6)).5 However, by allowing for evaluation lengths T > 1, the dependence on λ(k, t; A, B) is no longer entirely confined under a logarithm (cf. Lemma 5.1). A sharp characterization is hence critical for deriving optimal rates. In Appendix A, we conjecture the correct scaling of λ(k, t; A, B) as a function of the ratio k/t and the largest Jordan block size of A, based on numerical simulation. One-step controllability (Assumption 5.3) is also an assumption commonly made in linear system identification. It is clear that some form of controllability is needed, otherwise learning may be impossible (e.g. consider the extreme case of B = 0). General multi-step controllability does not suffice either: Tsiamis and Pappas (2021, Theorem 2) show that under a single trajectory (m = 1), n-step controllability (where n remains the state dimension) does not ensure finite risk, and even a more robust controllability definition (Tsiamis and Pappas, 2021, Definition 3) cannot ensure risk bounds better than exponential in the dimension n. Considering these barriers, we simply choose to rely on one-step controllability in the few-trajectory setting (m n). Finally, we introduce a condition number quantity that will feature commonly in our bounds: Definition 5.1. For dynamics matrices (A, B) in LDS-LS satisfying Assumption 5.2 and Assumption 5.3, the condition number γ(A, B) is defined as: γ(A, B) := λmax(S 1BBTS ) λmin(S 1BBTS ) . Here, the matrix S diagonalizes A, as defined in Assumption 5.2. 5. Note, however, that without diagonalizability, Simchowitz et al. (2018, Corollary A.2) can only guarantee a p n2/T rate for the operator norm of the parameter error in general, and this is likely not optimal. Tu, Frostig, and Soltanolkotabi 5.1.1 Many trajectories Our first result instantiates Theorem 5.2 in the special case of Γ = In, which yields a sharp bound for parameter recovery without requiring stability of the dynamics matrix A: Theorem 5.4 (Parameter recovery upper bound for LDS-LS, many trajectories). There are universal positive constants c0 and c1 such that the following holds for any instance of LDS-LS. Suppose that (A, B) is kc-step controllable, If n 2, m c0n, and T kc, then: E[ ˆWm,T W 2 F ] c1σ2 ξ pn m T λmin(ΓT (A, B)). Theorem 5.4 improves on existing linear system identification results in the following way: it replaces stability assumptions on the dynamics matrix A with a simpler assumption of relatively many trajectories (m n), and it guarantees a rate that is inversely proportional to the total number of examples m T instead of only one example per trajectory. In other words, our analysis does not need to discard the data within a trajectory, which is the case in Dean et al. (2020, Proposition 1.1). Additionally, although OLS is generally not a consistent estimator from one trajectory (m = 1) if the dynamics A are unstable, the results of Dean et al. (2020) imply consistency as m , i.e., that ˆWm,T converges in probability to W as m . Theorem 5.4 adds that, provided m n, OLS is consistent under unstable systems as T as well, even if the trajectory count m remains finite. We will return to parameter recovery from relatively few trajectories (m n) by this section s end. We now look beyond an evaluation horizon of length one, and consider the setting with many trajectories (m n). As noted previously, in order to handle an arbitrary evaluation horizon T (in particular those that extend past the training horizon T), some constraint on the admissible dynamics matrices is needed to ensure that the minimax risk remains finite. Without assumptions, the quantity λ(ΓT (A, B), ΓT (A, B)), whose inverse inevitably bounds the risk (3.2) from below, can be arbitrarily small whenever T > T, resulting in arbitrarily large risk. We will use our stated assumptions from the beginning of this section. The following specializes Theorem 5.2 to LDS-LS: Theorem 5.5 (Risk upper bound for LDS-LS, many trajectories). There are universal positive constants c0 and c1 such that the following holds for any instance of LDS-LS. Suppose that (A, B) is kc-step controllable. If n 2, m c0n, T kc, and the evaluation horizon is strict (T T), then: E[L( ˆWm,T ; T , PA,B x )] c1σ2 ξ pn On the other hand, suppose that (A, B) satisfies Assumption 5.1, Assumption 5.2, and Assumption 5.3, with γ := γ(A, B) (Definition 5.1). If n 2, m c0n, and the evaluation horizon is extended (T > T), then: E[L( ˆWm,T ; T , PA,B x )] c1σ2 ξ pn Setting T = T, Theorem 5.5 states that the risk of LDS-LS in the many trajectories regime satisfies E[L( ˆWm,T ; T, PA,B x )] σ2 ξpn/(m T). This rate matches the corresponding Learning from many trajectories independent baseline Ind-LDS-LS in the many trajectories regime. To see this, first observe that the marginal distribution PA,B x,t at time t N+ is N(0, Σt(A, B)). Hence, the covariate distribution for Ind-LDS-LS corresponds to the product distribution t 1N(0, Σt(A, B)), which is an instance of a Gaussian process. Therefore, Example 4.2 combined with Theorem 5.2 yields that the Ind-LDS-LS problem also has a risk bound that scales as σ2 ξpn/(m T) whenever m n. Put differently, the dependent structure of the covariate distribution PA,B x in LDS-LS does not add any statistical overhead to the learning problem (compared to the independent learning problem Ind-LDS-LS), as long as m n. 5.1.2 Few trajectories We now cover the regime in which relatively few training trajectories are available (m n). Our first result bounds the OLS risk for the LDS-LS problem: Theorem 5.6 (Risk upper bound for LDS-LS, few trajectories). There are universal positive constants c0, c1, and c2 such that the following holds for any instance of LDS-LS. Suppose that (A, B) satisfies Assumption 5.1, Assumption 5.2, and Assumption 5.3, with γ := γ(A, B) (Definition 5.1). If n 2, m c0n, and m T c1n log(max{γn/m, e}), then: E[L( ˆWm,T ; T , PA,B x )] c2σ2 ξ pn log(max{γn/m, e}) m T φ γ, c1n log(max{γn/m, e}) To interpret Theorem 5.6, consider γ a constant and suppose that T = T. Then Theorem 5.6 states that E[L( ˆWm,T ; T, PA,B x )] σ2 ξ pn/(m T) n log2(n/m)/m. We now see that this LDS-LS risk is an extra n log2(n/m)/m factor larger than the risk of the baseline problem Ind-LDS-LS: Theorem 5.7 (Risk upper bound for Ind-LDS-LS). There are universal positive constants c0 and c1 such that the following holds for any instance of Ind-LDS-LS. Suppose that (A, B) satisfies Assumption 5.1, Assumption 5.2, and Assumption 5.3, with γ := γ(A, B) (Definition 5.1). If n 2 and m T c0n log(max{γ, e}), then: E[L( ˆWm,T ; T , t 1PA,B x,t )] c1σ2 ξ pnγ log(max{γ, e}) Treating γ as a constant and setting T = T, Theorem 5.7 states that the Ind-LDS-LS risk E[L( ˆWm,T ; T, t 1PA,B x,t )] scales as σ2 ξpn/(m T), matching the risk of iid linear regression up to constant factors. In Section 6, we will see that the result of Theorem 5.6 is sharp up to constants and γ, and therefore the LDS-LS problem is fundamentally more difficult than its corresponding baseline problem Ind-LDS-LS when trajectories are relatively scarce. As an aside: the dependence of both Theorem 5.6 and Theorem 5.7 on the condition number γ is likely not optimal, and we leave sharpening this dependence to future work. We conclude with our final upper bound, using our assumptions to generalize Simchowitz et al. (2018, Theorem 2.1) to the few-trajectory setting: Theorem 5.8 (Parameter recovery upper bound for LDS-LS, few trajectories). There are universal positive constants c0, c1, and c2 such that the following holds for any instance Tu, Frostig, and Soltanolkotabi of LDS-LS. Suppose that (A, B) satisfies Assumption 5.1, Assumption 5.2, and Assumption 5.3, with γ := γ(A, B) (Definition 5.1). If n 2, and m T c0n log(max{γn/m, e}), then: E[ ˆWm,T W 2 F ] c1σ2 ξ pn log(max{γn/m, e}) m T λmin(Γk (A, B)) , k := c2T n/m log(max{γn/m, e}) Theorem 5.8 complements Theorem 5.4; together they cover parameter recovery across all problem regimes. Again, operator norm bounds also hold with p + n in place of pn. 5.2 Comparison to learning from trajectories of multiple unknown systems As mentioned in Section 2, Chen and Poor (2022); Modi et al. (2022) both study the setup where a learner observes multiple independent trajectories from K different unknown linear dynamical systems. The task is to identify the parameters of the K underlying systems. This is more general than the setting we consider, which is recovered by fixing K = 1. However, specializing these rates to our setting yield either unnecessary requirements, suboptimal bounds, or both. To see this, first, if we specialize Chen and Poor (2022, Theorem 1) to our setup, we generate unnecessary assumptions. Specifically, Theorem 1 requires strict stability, onestep controllability, and m T max{n3, 1/(1 ρ)}, where ρ is the spectral radius of A. In comparison, Theorem 5.4 only requires kc-step controllability, T kc, and m n. However, note that Theorem 1, like Theorem 5.4, does have the property that the parameter error (in operator norm) scales as p n/(m T), reflecting that all collected datapoints contribute to reducing error. Next, we specialize Modi et al. (2022, Theorem 2). Theorem 2 bounds the error of an estimation procedure which outputs m different estimates { ˆAi}m i=1, one for each observed trajectory (cf. Eq. (3)). Specifically, it gives an upper bound on the quantity 1 m Pm i=1 ˆAi Ai 2 F , where Ai is the dynamics matrix associated with the i-th trajectory. To specialize this to our setting, we average the estimates and apply Jensen s inequality followed by Theorem 2. This yields the bound ˆA A 2 F 1/T + n2/(m T), where ˆA := 1 m Pm i=1 ˆAi is the averaged estimate. We see that, for a fixed T, as m , the rate tends to 1/T instead of zero (compared with the n2/(m T) bound from Theorem 5.4). Additionally, Theorem 2 requires both that the dynamics are one-step controllable and that the spectral radius of A is bounded by 1 + O(1/T). 6. Risk lower bounds Our lower bounds rely on the following statement, that the expected trace inverse covariance a classic quantity in asymptotic statistics bounds the minimax risk from below: Lemma 6.1 (Expected trace of inverse covariance bounds risk from below). Fix m, T N+ and a set of covariate distributions Px. Suppose that for every Px Px, the data matrix Xm,T Rm T n drawn from m i=1Px has full column rank almost surely. The minimax risk R(m, T, T ; Px) satisfies: R(m, T, T ; Px) σ2 ξp sup Px Px E m i=1Px h tr Γ1/2 T (Px)(XT m,T Xm,T ) 1Γ1/2 T (Px) i . (6.1) Learning from many trajectories Lemma 6.1 is well known, possibly considered folklore; we state and prove it for completeness. Our proof is inspired by a recent argument from Mourtada (2022). It smooths over problem instances according to a Gaussian prior, and analytically characterizes the posterior distribution of the parameter W under a simple Gaussian observation model detailed in Section 7.2. Note that the presence of the expected value outside the of trace inverse, on the right-hand side of (6.1), is critical in proving our separation results. Using Jensen s inequality, one could move the expectation under the inverse, precisely recovering the classic Cram er-Rao lower bound (CRLB) for the variance of unbiased estimators. However, for least-squares regression problems, the CRLB yields (when T = T) the familiar σ2 ξ pn/(m T) rate for iid linear regression. This is not enough for our purposes, and so we must analyze the more complex expected-trace-inverse quantity appearing in (6.1). Our first lower bound underscores the need to make variance growth assumptions (5.8), in Theorem 5.3, for Ind-Seq-LS in the few trajectories (m n) regime: Theorem 6.2 (Need for growth assumptions in Ind-Seq-LS when m n). There exists universal constant c0, c1, and c2 such that the following holds. Suppose that Px = t 1N(0, 2t In), n 6, m T n, and m c0n. Then: R(m, T, T; {Px}) c1σ2 ξ p 2c2n/m Theorem 6.2 states that if the variances Σt are allowed to grow exponentially in t, then the minimax risk of Ind-Seq-LS scales exponentially in n/m when m n. Thus, some subexponential growth assumption is necessary in order to have the risk scale polynomially in n/m. We now turn to a lower bound for LDS-LS. We consider two particular hard instances for LDS-LS dynamics matrices (A, B), where we set B = In and vary A. The first instance corresponds to iid covariates, i.e., A = 0n n. The second instance corresponds to an isotropic Gaussian random walk, i.e., A = In. These two hard instances satisfy Assumption 5.1, Assumption 5.2, and Assumption 5.3. Together they show that our upper bounds are sharp up to logarithmic factors, treating the condition number γ(A, B) from Definition 5.1 as a constant: Theorem 6.3 (Risk lower bound). There are universal positive constants c0, c1, and c2 such that the following holds. Recall that PIn x (resp. P0n n x ) denotes the covariate distribution for a linear dynamical system with A = In and B = In (resp. A = 0n n and B = In). If T c0, n c1, and m T n, then: R(m, T, T ; {P0n n x , PIn x }) c2σ2 ξ pn m T max n T We can interpret this lower bound by a breakdown of ϕ := max{n T /(m T), T /T, 1} across various regimes. When trajectories are limited (m n), ϕ max{n T /(m T), 1}, and therefore the minimax risk is bounded below by σ2 ξ pn/(m T) max{n T /(m T), 1}. This is the same rate prescribed by the OLS upper bound of Theorem 5.6, up to the condition number γ(A, B) and logarithmic factors in n/m. We have thus justified the summary statement Theorem 1.2. On the other hand, under many trajectories (m n), ϕ max{T /T, 1} and Tu, Frostig, and Soltanolkotabi the minimax risk is bounded below by σ2 ξ pn/(m T) max{T /T, 1}. By Theorem 5.5, the OLS risk is bounded above by the same quantity times γ(A, B), justifying the summary statement Theorem 1.3. 7. Key proof ideas In this section, we highlight some of the key ideas behind our results. Proofs of the upper bounds are in Appendix B, and proofs of the lower bounds are in Appendix C. Additional notation. For r N+ and M Rn n, let Jr Rr r denote the Jordan block of size r with ones along its diagonal, let BDiag(M, r) Rnr nr denote the block diagonal matrix with diagonal blocks M, and let BToep(M, r) Rnr nr denote the block Toeplitz matrix with first column (In, MT, . . . , (Mr 1)T)T. 7.1 Upper bounds The proof of Lemma 5.1 decomposes the risk using a standard basic inequality, which we now describe. While Lemma 5.1 is stated quite generally, for simplicity of exposition we restrict ourselves in this section to the case when the matrix parameters {Ψj}S j=1 in Definition 4.1 are all set to Γ. Under this simplification, we have that µ({Ψj}S j=1, Γ) = 1. Equation (3.1) yields the identity Ym,T = Xm,T W T + Ξm,T . Plugging this relationship into the formula (3.6) for ˆWm,T gives ˆWm,T W = ΞT m,T Xm,T (XT m,T Xm,T ) 1. Define the whitened version of Xm,T as Xm,T := Xm,T Γ 1/2. From these definitions and after some basic manipulations, for any Γ Symn >0: ˆWm,T W 2 Γ min{n, p} ( XT m,T Xm,T ) 1/2 XT m,T Ξm,T 2 op λmin( XT m,T Xm,T ) λ(Γ, Γ ) . (7.1) This decomposes the analysis into two parts: (a) upper-bounding the self-normalized martingale ( XT m,T Xm,T ) 1/2 XT m,T Ξm,T 2 op, and (b) lower-bounding the term λmin( XT m,T Xm,T ). The analysis for the martingale term is fairly standard (cf. Abbasi-Yadkori et al., 2011, Corollary 1), so for the remainder of this section we focus on the minimum eigenvalue bound, which contains much of what is novel in our analysis. We first demonstrate how the trajectory small-ball definition (Definition 4.1) can be used to establish pointwise convergence of the quadratic form χ(v) := Pm i=1 PT t=1 v, x(i) t 2 for v Sn 1, where x(i) t := Γ 1/2x(i) t is a whitened state vector. Specifically, we show that for a fixed v Sn 1, the probability of the event {χ(v) ψ ε} is small, for ψ, ε to be specified. The key idea is that for any non-negative random variable X satisfying P{X ε} (cε)α for all ε > 0, the moment generating function satisfies E[exp( ηX)] (c/η)α for all η > 0 (cf. Proposition B.4). Hence, by condition (4.1) from Definition 4.1, for any η > 0: t=(j 1)k+1 v, x(i) t 2 α a.s., i = 1, . . . m, j = 1, . . . S. Learning from many trajectories By a Chernoffbound, the tower property of conditional expectation, and the independence of the trajectories { x(i) t }t 1 and { x(i ) t }t 1 when i = i , for any ψ > 0: t=1 v, x(i) t 2 ζ inf η 0 eηζE exp t=1 v, x(i) t 2 ! inf η 0 eηζ csb = exp m Sα log m Sα Now with a change of variables t := log m Sα csbζ 1, we obtain: t=1 v, x(i) t 2 m Tα 2csb e (t+1) ! exp( m Sαt) t > 0. (7.2) The key upshot of (7.2) is that it controls tail probability at all scales. This control is needed in order to bound the expected value of (7.1) by integration. At this point, it remains to upgrade (7.2) from pointwise to uniform over Sn 1. A natural approach is to use standard covering and union bound arguments, as is done in Simchowitz et al. (2018). However, straightforward covering argument yields un-necessary logarithmic factors in the covariate dimension n. In order to circumvent this issue, we utilize the PAC-Bayes argument from Mourtada (2022) (which itself is an extension of Oliveira (2016)) to establish uniform concentration. The details are given in Appendix B.4. 7.2 Lower bounds 7.2.1 Observation noise behind Lemma 6.1 Our definition of minimax risk R(m, T, T ; Px) in (3.5) involves a supremum over the worst case σξ-sub-Gaussian MDS distribution that models the observation noise. The proof of Lemma 6.1 bounds this supremum from below by considering a noise model that decouples the observation noise {ξt}t 1 from the randomness that drives the trajectory {xt}t 1: Definition 7.1 (Gaussian observation noise). The Gaussian observation noise model holds when ξt N(0, σ2 ξIp), ξt ξt if t = t , and the process {ξt}t 1 is independent from the process {xt}t 1. Decoupling the noise processes orthogonalizes the two problems simultaneously present in Seq-LS: learning the dynamics of covariates and learning the responses from covariates. Definition 7.1 draws attention to the latter. It will unfortunately exclude us from addressing linear system identification specifically with our lower bound, but it allows a sharp and simple characterization of the minimax risk in general. The proof of Lemma 6.1 is given in Appendix C.2. Tu, Frostig, and Soltanolkotabi 7.2.2 An analysis of non-isotropic gramian matrices A key technical challenge for our analysis lies in constructing a sharp lower bound on the expected trace inverse of a gramian matrix formed by random non-isotropic Gaussian random vectors. Specifically, for integers q, n N+ with q n, and for a fixed positive definite matrix Σ Symq >0, we are interested in a lower bound on the quantity E tr((W TΣW) 1), where W Rq n has iid N(0, 1) entries. The matrix W TΣW is equal in distribution to the gramian matrix Y Rn n of the vectors g1, . . . , gn Rq, which are drawn iid from N(0, Σ), i.e., Yij = gi, gj . The main tool we use to analyze E tr((W TΣW) 1) is the convex Gaussian min-max theorem (CGMT) from Thrampoulidis et al. (2014), which allows us to bound from below the expected trace inverse by studying a two dimensional min-max game that is more amenable to analysis. The key idea is to cast the expected trace inverse as a least-norm optimization problem, and apply CGMT to the value of the optimization problem. We believe the following result to be of independent interest. Lemma 7.1. Let q, n be positive integers with q n and n 2. Let W Rq n have iid N(0, 1) entries, and let Σ Rq q be positive definite. Let g N(0, Iq) and h N(0, In 1), with g and h independent. Also, let {ei}q i=1 be the standard basis vectors in Rq. We have: E tr((W TΣW) 1) n Pq i=1 E minβ 0 maxτ 0 h β h 2 τ + βg ei 2 (Σ 1+β h 2τIq) 1 i. (7.3) The proof of Lemma 7.1 appears in Appendix C.4. We now discuss how to analyze the two-dimensional min-max game appearing in Lemma 7.1. We first start by heuristically replacing it with a stylized problem, where the random quantities which appear in (7.3) are replaced by their expected scaling: SP(Σ, n) := i=1 min β 0 max τ 0 τ + β2 tr((Σ 1 + β nτIq) 1) + (Σ 1 + β nτIq) 1 ii | {z } =: ℓi(β,τ) While (7.4) is not a valid upper bound on the value of the min-max game appearing in (7.3), analyzing (7.4) is simpler and gives the correct intuition; we give a rigorous upper bound in Lemma C.9. We start by observing that if β = 0, then regardless of the choice of τ, ℓi(0, τ) = Σii, and therefore Pq i=1 ℓi(0, τi) = tr(Σ) for any {τi}q i=1 Rq 0. On the other hand, if β > 0, then ℓi(β, τ) tends to as τ 0+ and to 0 as τ . Therefore, if we can show that there exists a v (0, tr(Σ)), such that every set of points {(βi, τi)}q i=1 R2 >0 satisfying:6 β (βi, τi) = ℓi τ (βi, τi) = 0, i = 1, . . . , q, (7.5) 6. The conditions given in (7.5) are not in general necessary first-order optimality conditions for a nonconvex/nonconcave game (see e.g. Jin et al., 2020, Proposition 21). However, since for every β > 0, the function τ 7 ℓi(β, τ) has only strictly concave stationary points (Proposition C.10), these conditions are necessary for this particular problem. Learning from many trajectories also satisfies v = Pq i=1 ℓi(βi, τi), then SP(Σ, n) = v. To uncover the critical points, we define the functions f and qi, for i = 1, . . . , q, as: f(x) := x n + x2 tr((Σ 1 + x n Iq) 1), qi(x) := (Σ 1 + x n Iq) 1 ii . With these definitions, we can write: ℓi(β, τ) = 1 τ 2 f(βτ) + qi(βτ). Calculating ℓi β (β, τ) = ℓi τ (β, τ) = 0 yields, for τ = 0: τ (β, τ) = τ 2f (βτ)β 2τ 3f(βτ) + q i(βτ)β, (7.6) β (β, τ) = τ 2f (βτ)τ + q i(βτ)τ. (7.7) The second condition (7.7) implies that q i(βτ) = τ 2f (βτ). Plugging this condition into (7.6) implies that f(βτ) = 0, and hence ℓi(β, τ) = qi(βτ) for the critical point (β, τ). We now study the positive roots of the equation f(x) = 0, or equivalently: x n = x2 tr((Σ 1 + x n Iq) 1). Using the variable substitution y := x n, we have, when y > 0, the equivalent problem: ψ(y; Σ) := y tr((Σ 1 + y Iq) 1) = n. Observe that ψ(0; Σ) = 0 and limy ψ(y; Σ) = q. Furthermore, ψ(y; Σ) is continuous and monotonically increasing with y. Therefore, as long as q > n, there is exactly one y (0, ) such that ψ( y; Σ) = n, or equivalently there is exactly one x (0, ) such that ψ( x n; Σ) = n. Such a quantity x supplies the curve of critical points Crit( x) := {(β, τ) R2 >0 | βτ = x}. Note that Crit( x) is the set of critical points for every ℓi(β, τ), i = 1, . . . , q. Furthermore, for any (β , τ ) Crit( x) and i {1, . . . , q}, we have that ℓi(β , τ ) = qi(β τ ) = (Σ 1 + x n Iq) 1 ii . Therefore: {(βi, τi)}T i=1 Crit( x) = i=1 ℓi(βi, τi) = tr((Σ 1 + x n Iq) 1) (0, tr(Σ)), SP(Σ, n) = n x , with x the solution to ψ( x n; Σ) = n. (7.8) In light of (7.8), Lemma 7.1 then suggests that: E tr((W TΣW) 1) n SP(Σ, n) = x n, (7.9) where the notation indicates the heuristic nature of replacing the expected min-max game appearing in the bound (7.3) with the approximation (7.4). If we briefly check (7.9) in the simple case when Σ = Iq, we see that: n = ψ( x n; Iq) = x n q 1 + x n = x n = n q (1 + x n) n Hence, (7.9) yields that E tr((W TW) 1) n/q, which is the correct scaling; the exact result is E tr((W TW) 1) = n/(q n 1) for q n + 2. Tu, Frostig, and Soltanolkotabi 7.2.3 Ideas behind Theorem 6.2 We let Xm,T denote the data matrix associated with m iid copies of {xt}T t=1, with xt N(0, 2t In) and xt xt for t = t . We also define ΓT := 1 T PT t=1 2t In = 2 T (2T 1) In, and observe that ΓT 2T T In. By Lemma 7.1, it suffices to lower bound the quantity E tr(Γ1/2 T (XT m,T Xm,T ) 1Γ 1/2 T ). Since each column of Xm,T is independent, the matrix Xm,T 2 T/2 has the same distribution as BDiag(Θ1/2, m)W, where Θ RT T is diagonal, Θii = 2i T for i {1, . . . , T}, and W Rm T n has iid N(0, 1) entries. In other words, we have: E tr(Γ1/2 T (XT m,T Xm,T ) 1Γ 1/2 T ) 1 T E tr((W TBDiag(Θ, m)W) 1). By the arguments in Section 7.2.2, we have: E tr((W TBDiag(Θ, m)W) 1) n SP(BDiag(Θ, m), n), where the notation indicates the heuristic nature of the inequality as explained previously. From (7.8), we want to find x such that: n = ψ( x n; BDiag(Θ, m)) = x n m 1 2j + x n. While solving this equation exactly for x n is not tractable, we can estimate a lower bound on x n quite easily. For any integer Tc {0, . . . , T}, we have the following estimate: 1 2j + x n Tc + 2 x n 2 Tc. Let us first assume that x n [1, 2T 1], so that log2( x n) {0, . . . , T}. Setting Tc = log2( x n) then yields the lower bound x n 2n/m 3. On the other hand, if x n > 2T 1, then since we assume m T c1n, we also have x n > 2c1n/m 1. Finally, if x n < 1, we have: 1 2j + x n < 1 2j + x n 2 = m n/2. This yields a contradiction, since by assumption m c2n, if c2 < 1/2, so we must have x n 2c n/m 3 with c = min{1, c1}. Now by (7.8) and (7.9): SP(BDiag(Θ, m), n) = n x n n2 c n/m+3 = E tr(Γ1/2 T (XT m,T Xm,T ) 1Γ 1/2 T ) 2c n/m We make this argument rigorous in Appendix C.5. Learning from many trajectories 7.2.4 Ideas behind Theorem 6.3 We focus here on the hard instance when A = In and m n, since the cases when A = 0n n or A = In and m n are straightforward applications of Jensen s inequality and some basic manipulations (see Lemma C.7). The proof used by Theorem 6.3 when A = In and m n is actually a special case of a general proof indexed by the largest Jordan block size of the hard instance. For a maximum Jordan block size r, the hard instances are A = BDiag(Jr, n/r), where we assume for simplicity that r divides n; this reduces to A = In when r = 1. We associate two important matrices with these hard instances. To define them, let Ir := {1, 1 + r, . . . , 1 + (T 1)r}, and let EIr RT Tr denote the linear operator that extracts the coordinates in Ir. The following matrices then play a key role in our analysis: Ψr,T,T := BDiag(Γ 1/2 T (Jr), T)BToep(Jr, T), Θr,T,T := EIrΨr,T,T ΨT r,T,T ET Ir. (7.10) The next step is to use a simple decoupling argument (see Lemma C.11) to argue that, for A = BDiag(Jr, d): E tr(Γ1/2 T (XT m,T Xm,T ) 1Γ1/2 T ) E tr((W TBDiag(Θr,T,T , m)W) 1), where W Rm T d has iid N(0, 1) entries. This positions us to use the arguments in Section 7.2.2 again. We first focus on the r = 1 case. We reduce the problem to assuming T = T, by observing that since Γt(In) = t+1 2 In for any t N+, then Θ1,T,T = T+1 T +1 Θ1,T,T . Therefore, E tr((W TBDiag(Θ1,T,T , m)W) 1) = T + 1 T + 1 E tr((W TBDiag(Θ1,T,T , m)W) 1) (7.11) T + 1 n SP(BDiag(Θ1,T,T , m), n), where again the notation highlights the heuristic nature of the bound, used to build intuition. To proceed, let LT RT T be the lower triangular matrix of all ones and define ST := (LT LT T ) 1. A computation yields that Θ 1 1,T,T = T+1 2 ST . Note that we can write ST as a rank-one perturbation to a tri-diagonal matrix. Specifically, ST = Tri(2, 1; T) e T e T T , where Tri(a, b; T) denotes the symmetric T T tri-diagonal matrix with a on the diagonal and b on the lower and upper off-diagonals. By the standard formula for the eigenvalues of a tri-diagonal matrix, we have that λT k+1(Tri(2, 1; T)) = 2 1 cos kπ T+1 k2/T 2. In Appendix C.7, we apply the work of Kulkarni et al. (1999) to show that the rank-one perturbation is negligible: λT k+1(ST ) k2/T 2 as well. Therefore λT k+1(Θ 1 1,T,T ) k2/T. With this bound, we have: n = ψ( x n; BDiag(Θ1,T,T , m)) = x n m 1 λi(Θ 1 1,T,T ) + x n 1 i2/T + x n x n m Z T 1 x2/T + x n dx q Tu, Frostig, and Soltanolkotabi Td 2m(SP) 1 m = 1, n = dr, T = 2n r = 1, slope = 2.05 r = 2, slope = 4.18 r = 3, slope = 6.42 r = 4, slope = 8.67 r = 5, slope = 10.81 r = 6, slope = 12.75 r = 7, slope = 14.38 (a) Plot of T d 2m(SP) 1 versus n. Td 2m(SP) 1 n = 150, T = 2n r = 1, slope = 2.02 r = 2, slope = 3.88 r = 3, slope = 5.95 r = 4, slope = 7.89 (b) Plot of T d 2m(SP) 1 versus m. Figure 3: Plot of T d 2m(SP) 1 versus n in (a) and versus m in (b), both on a log-log scale. For (a), m and p are fixed to one, d is fixed to n/r, and T is fixed to 2n. For (b), n is fixed to 150, p is fixed to one, and T is fixed to 2n. In the legends, the slope of the line (in log-log space) computed via linear regression is shown. Based on these plots, we conjecture that T d 2m(SP) 1 cr(d/m)2r, where cr depends only on r. This implies that x n n2/(m2T), and therefore by (7.8) and (7.9): SP(BDiag(Θ1,T,T , m), n) = n x n m2T n = E tr(Γ1/2 T (XT m,T Xm,T ) 1Γ1/2 T ) T We make this argument rigorous in Appendix C.8. 7.2.5 Beyond diagonalizability When r 2, the analytic complexity of characterizing the solution to the equation n = ψ( x n; BDiag(Θr,T,T , m)) increases significantly. Nevertheless, we can still solve for x n by numerical root finding, to look at the scaling patterns for small values of r and T = T. This computation leads us to conjecture a general bound of R(m, T, T; {PBDiag(Jr,n/r) x }) crn2r/(m2r T) when m n, where cr is a constant depending only on r (see Figure 3). A complete and precise statement is given in Appendix A. 8. Numerical simulation We conduct a simple numerical simulation illustrating the benefits of multiple trajectories on learning. We construct a family of LDS-LS problem instances, parameterized by a scalar ρ (0, ) as follows. The covariate distribution Px is the linear dynamical system xt+1 = Axt + wt with: A = U diag( ρ, . . . , ρ | {z } n/2 times , ρ, . . . , ρ)UT, U Unif(O(n)), wt N(0, I/4). Here, O(n) denotes the set of n n orthonormal matrices. By construction, ρ is the spectral radius of A. The labels yt are set as yt = xt+1, so that the ground truth W Rn n is equal to A. Learning from many trajectories We compare the risk of the OLS estimator (3.6) on the LDS-LS problem instance, compared with its risk on the corresponding Ind-LDS-LS baseline. Specifically, we plot the ratio between OLS excess risks E[L( ; T, Px)] on the two problem instances (Px), respectively. We fix the covariate dimension n = 5 and the trajectory horizon length T = 10n, and vary the number of trajectories m {1, . . . , 10}. Figure 4 shows the result of this experiment, where we also vary ρ {0.98, 0.99, 1.0, 1.01, 1.02}. The error bars are plotted over 1000 trials. All computations are implemented using jax (Bradbury et al., 2018), and run with float64 precision on a single machine. 2 4 6 8 10 m Excess Risk Ratio n = 5, T = 10n Figure 4: Plot of the ratio of excess risk for LDS-LS problem instances over its corresponding Ind-LDS-LS baseline instance, as a function of the number of trajectories m, holding both covariate dimension n and horizon length T fixed. The vertical blue line marks the transition between few trajectories (m n) and many trajectories (m n). In Figure 4, we see that for the few trajectories regime (m n) appearing to the left of the vertical blue line, the instability of the covariate process plays an outstanding role in determining the value of the ratio. On the other hand, for the many trajectories regime (m n) appearing to the right of the blue line, the ratios quickly converge to a constant no greater than two (at m = 10). This behavior is consistent with Theorem 5.5. Finally, Theorem 6.3 suggests that the scaling behavior of the ρ = 1 curve with respect to m is on the order of 1/m. 9. Concluding remarks Having sharply characterized the worst-case excess risk of Seq-LS and LDS-LS, we see more precisely the trade-offs or arguably the lack thereof presented by resetting a system, or by simply observing parallel runs from one, where possible. After sufficient resets, one learns roughly as though examples were independent altogether (as reflected in the Ind-Seq-LS and Ind-LDS-LS baselines). In addition to the theoretical upshot that it presents, this phenomenon seems encouraging insofar as the setup may describe reality: one does not learn to ride a bicycle by witnessing thousands of unrelated pedal strokes, nor by watching one cyclist endure the Tu, Frostig, and Soltanolkotabi entire Tour de France, but rather by seeing and attempting many moderate rides and maneuvers. We see a number of future directions for research, primarily in further charting out the reach of the iid-like phenomenon in learning from multiple sequences. Our work offers the trajectory small-ball criterion (Definition 4.1) as a vehicle for proving that this phenomenon occurs, or otherwise for bounding the minimax rate from above. What other notable sequential processes, outside of those covered in Section 4.1, can we capture as trajectory small-ball instances? One might look to covariate sequences generated by, say, input-tostate stable (ISS) non-linear systems, stochastic polynomial difference equations, or various Markov decision processes. On the flip side, when must we necessarily pay a price for dependent data? One answer from our work is that a necessary gap between independent and sequentially dependent learning appears when there are insufficiently many trajectories (m n). As outlined in Section 7.2.5 and Appendix A, we conjecture that this gap can be made much wider, namely by considering non-diagonalizable linear dynamical systems. That said, other pertinent problems may exhibit gaps as well. Finding them would help inform where the limits of learning from sequential data lie. On the regression side, one might look to move beyond a well-specified linear regression model, extend to other loss functions, analyze regularized least-squares estimators in place of OLS, or consider a more adversarial analysis (e.g. measuring regret rather than risk, in an online setting). Acknowledgments We thank Vikas Sindhwani for organizing a lecture series on learning and control, during which this work was prompted, and for encouraging us to pursue the ensuing questions. We also thank Sumeet Singh for pointing out that the claimed first-order optimality conditions in (7.6) and (7.7) require further conditions to hold for nonconvex/nonconcave games, e.g., strictly concave stationary points of the inner maximization problem. Finally, we thank the anonymous reviewers of this manuscript for their constructive feedback, which helped improve the clarity of our exposition. M. Soltanolkotabi is supported by the Packard Fellowship in Science and Engineering, a Sloan Research Fellowship in Mathematics, an NSF-CAREER under award #1846369, DARPA Learning with Less Labels (Lw LL) and Fast NICS programs, and NSF-CIF awards #1813877 and #2008443. Yasin Abbasi-Yadkori, D avid P al, and Csaba Szepesv ari. Online least squares estimation with self-normalized processes: An application to bandit problems. In Conference on Learning Theory, 2011. Peter L. Bartlett, Philip M. Long, G abor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063 30070, 2020. Learning from many trajectories David Belanger and Sham Kakade. A linear dynamical system model for text. In International Conference on Machine Learning, 2015. Adam Block, Max Simchowitz, and Russ Tedrake. Smoothed online learning for prediction in piecewise affine systems. In Neural Information Processing Systems, 2023. James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax. Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-Co NLL), 2007. Guy Bresler. Efficiently learning ising models on arbitrary graphs. In Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, 2015. Guy Bresler, Prateek Jain, Dheeraj Nagaraj, Praneeth Netrapalli, and Xian Wu. Least squares regression with markovian data: Fundamental limits and algorithms. In Neural Information Processing Systems, 2020. Emmanuel J. Cand es, Justin K. Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics, 59(8):1207 1223, 2006. Anthony Carbery and James Wright. Distributional and Lq norm inequalities for polynomials over convex bodies in Rn. Mathematical Research Letters, 8:233 248, 2001. Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. In Neural Information Processing Systems, 2021. Yanxi Chen and H. Vincent Poor. Learning mixtures of linear dynamical systems. In International Conference on Machine Learning, 2022. Yuval Dagan, Constantinos Daskalakis, Nishanth Dikkala, and Siddhartha Jayanti. Learning from weakly dependent data under dobrushin s condition. In Conference on Learning Theory, 2019. Yuval Dagan, Constantinos Daskalakis, Nishanth Dikkala, Surbhi Goel, and Anthimos Vardis Kandiros. Statistical estimation from dependent data. In International Conference on Machine Learning, 2021a. Yuval Dagan, Constantinos Daskalakis, Nishanth Dikkala, and Anthimos Vardis Kandiros. Learning ising models from one or multiple samples. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, 2021b. Tu, Frostig, and Soltanolkotabi Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of johnson and lindenstrauss. Random Structures & Algorithms, 22(1):60 65, 2003. Constantinos Daskalakis, Nishanth Dikkala, and Ioannis Panageas. Regression from dependent observations. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, 2019. Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics, 20:633 679, 2020. John C. Duchi, Alekh Agarwal, Mikael Johansson, and Michael I. Jordan. Ergodic mirror descent. SIAM Journal on Optimization, 22(4):1549 1578, 2012. Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Finite time identification in unstable linear systems. Automatica, 96:342 353, 2018. Dylan J. Foster, Alexander Rakhlin, and Tuhin Sarkar. Learning nonlinear dynamical systems from a single trajectory. In Learning for Dynamics & Control Conference, 2020. Udaya Ghai, Holden Lee, Karan Singh, Cyril Zhang, and Yi Zhang. No-regret prediction in marginally stable systems. In Conference on Learning Theory, 2020. Promit Ghosal and Sumit Mukherjee. Joint estimation of parameters in Ising model. The Annals of Statistics, 48(2):785 810, 2020. Joshua Glaser, Matthew Whiteway, John P Cunningham, Liam Paninski, and Scott Linderman. Recurrent switching dynamical systems models for multiple interacting neural populations. In Neural Information Processing Systems, 2020. Alexander Goldenshluger and Assaf Zeevi. Nonasymptotic bounds for autoregressive time series modeling. The Annals of Statistics, 29(2):417 444, 2001. Rodrigo A. Gonz alez and Cristian R. Rojas. A finite-sample deviation bound for stable autoregressive processes. In Learning for Dynamics and Control, 2020. Elad Hazan, Karan Singh, and Cyril Zhang. Learning linear dynamical systems via spectral filtering. In Neural Information Processing Systems, 2017. Elad Hazan, Holden Lee, Karan Singh, Cyril Zhang, and Yi Zhang. Spectral filtering for general linear dynamical systems. In Neural Information Processing Systems, 2018. Daniel Hsu, Sham M. Kakade, and Tong Zhang. Random design analysis of ridge regression. Foundations of Computational Mathematics, 14:569 600, 2014. Prateek Jain, Suhas S. Kowshik, Dheeraj Nagaraj, and Praneeth Netrapalli. Near-optimal offline and streaming algorithms for learning non-linear dynamical systems. In Neural Information Processing Systems, 2021. Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. In Neural Information Processing Systems, 2021. Learning from many trajectories Yassir Jedra and Alexandre Proutiere. Sample complexity lower bounds for linear system identification. In 2019 IEEE 58th Conference on Decision and Control (CDC), 2019. Yassir Jedra and Alexandre Proutiere. Finite-time identification of stable linear systems optimality of the least-squares estimator. In 2020 59th IEEE Conference on Decision and Control (CDC), 2020. Chi Jin, Praneeth Netrapalli, and Michael I. Jordan. What is local optimality in nonconvexnonconcave minimax optimization? In International Conference on Machine Learning, 2020. Rafal J ozefowicz, Oriol Vinyals, Mike Schuster, Noam M. Shazeer, and Yonghui Wu. Exploring the limits of language modeling. ar Xiv preprint ar Xiv:1602.02410, 2016. S. Mohammad Khansari-Zadeh and Aude Billard. Learning stable nonlinear dynamical systems with gaussian mixture models. IEEE Transactions on Robotics, 27(5):943 957, 2011. Vladimir Koltchinskii and Shahar Mendelson. Bounding the smallest singular value of a random matrix without concentration. International Mathematics Research Notices, 2015 (23):12991 13008, 2015. Devadatta Kulkarni, Darrell Schmidt, and Sze-Kai Tsui. Eigenvalues of tridiagonal pseudotoeplitz matrices. Linear Algebra and its Applications, 297(1 3):63 80, 1999. Vitaly Kuznetsov and Mehryar Mohri. Generalization bounds for non-stationary mixing processes. Machine Learning, 106:93 117, 2017. Tze Leung Lai and Ching Zong Wei. Least squares estimates in stochastic regression models with applications to identification and control of dynamic systems. The Annals of Statistics, 10(1):154 166, 1982. Tze Leung Lai and Ching Zong Wei. Asymptotic properties of general autoregressive models and strong consistency of least-squares estimates of their parameters. Journal of Multivariate Analysis, 13(1):1 23, 1983. B eatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5):1302 1338, 2000. Nevena Lazic, Craig Boutilier, Tyler Lu, Eehern Wong, Binz Roy, MK Ryu, and Greg Imwalle. Data center cooling using model-predictive control. In Neural Information Processing Systems, 2018. Yingying Li, Tianpeng Zhang, Subhro Das, JeffShamma, and Na Li. Non-asymptotic system identification for linear systems with nonlinear policies. IFAC-Papers On Line, 56 (2):1672 1679, 2023. Scott Linderman, Matthew Johnson, Andrew Miller, Ryan Adams, David Blei, and Liam Paninski. Bayesian learning and inference in recurrent switching linear dynamical systems. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017. Tu, Frostig, and Soltanolkotabi Lennart Ljung. System Identification: Theory for the User. Pearson, 1998. Jan R. Magnus. The moments of products of quadratic forms in normal variables. Statistica Neerlandica, 32(4):201 210, 1978. Louis Massucci, Fabien Lauer, and Marion Gilson. A statistical learning perspective on switched linear system identification. Automatica, 145:110532, 2022. V. John Mathews and Giovanni L. Sicuranza. Polynomial Signal Processing. Wiley, 2000. Daniel J. Mc Donald, Cosma Rohilla Shalizi, and Mark Schervish. Nonparametric risk bounds for time-series forecasting. Journal of Machine Learning Research, 18:1 40, 2017. Ron Meir. Nonparametric time series prediction through adaptive model selection. Machine Learning, 39:5 34, 2000. Shahar Mendelson. Learning without concentration. Journal of the ACM, 62(3):1 25, 2015. Sean P. Meyn and Richard L. Tweedie. Markov Chains and Stochastic Stability. Springer Verlag, 1993. Aditya Modi, Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Joint learning of linear time-invariant dynamical systems. ar Xiv preprint ar Xiv:2112.10955, 2022. Mehryar Mohri and Afshin Rostamizadeh. Rademacher complexity bounds for non-i.i.d. processes. In Neural Information Processing Systems, 2008. Mehryar Mohri and Afshin Rostamizadeh. Stability bounds for stationary φ-mixing and β-mixing processes. Journal of Machine Learning Research, 11(26):789 814, 2010. Jaouad Mourtada. Exact minimax risk for linear least squares, and the lower tail of sample covariance matrices. The Annals of Statistics, 50(4):2157 2178, 2022. Duy Nguyen-Tuong and Jan Peters. Model learning for robot control: a survey. Cognitive Processing, 12:319 340, 2011. Roberto Imbuzeiro Oliveira. The lower tail of random quadratic forms with applications to ordinary least squares. Probability Theory and Related Fields, 166(3):1175 1194, 2016. Takayuki Osa, Joni Pajarinen, Gerhard Neumann, J. Andrew Bagnell, Pieter Abbeel, and Jan Peters. An algorithmic perspective on imitation learning. Foundations and Trends in Robotics, 7(1-2):1 179, 2018. Samet Oymak and Necmiye Ozay. Non-asymptotic identification of lti systems from a single trajectory. In 2019 American Control Conference (ACC), 2019. Peter C.B. Phillips and Tassos Magdalinos. Inconsistent var regression with common explosive roots. Econometric Theory, 29:808 837, 2013. Dean A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Neural Information Processing Systems, 1989. Learning from many trajectories Wilson J. Rugh. Nonlinear System Theory: The Volterra / Wiener Approach. The Johns Hopkins University Press, 1981. Wilson J. Rugh. Linear system theory. Prentice Hall, 2nd ed. edition, 1996. Tuhin Sarkar and Alexander Rakhlin. Near optimal finite time identification of arbitrary linear dynamical systems. In International Conference on Machine Learning, 2019. Tuhin Sarkar, Alexander Rakhlin, and Munther A. Dahleh. Finite time lti system identification. Journal of Machine Learning Research, 22:1 61, 2021. Yahya Sattar and Samet Oymak. Non-asymptotic and accurate learning of nonlinear dynamical systems. ar Xiv preprint ar Xiv:2002.08538, 2020. Yahya Sattar, Samet Oymak, and Necmiye Ozay. Finite sample identification of bilinear dynamical systems. In 2022 IEEE 61st Conference on Decision and Control (CDC), 2022. Cosma Rohilla Shalizi. Advanced Data Analysis from an Elementary Point of View. 2021. https://www.stat.cmu.edu/~cshalizi/ADAfa EPo V/ADAfa EPo V.pdf. Max Simchowitz, Horia Mania, Stephen Tu, Michael I. Jordan, and Benjamin Recht. Learning without mixing: Towards a sharp analysis of linear system identification. In Conference on Learning Theory, 2018. Max Simchowitz, Ross Boczar, and Benjamin Recht. Learning linear dynamical systems with semi-parametric least squares. In Conference on Learning Theory, 2019. Ingo Steinwart and Andreas Christmann. Fast learning from non-i.i.d. observations. In Neural Information Processing Systems, 2009. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In Neural Information Processing Systems, 2014. Christos Thrampoulidis, Samet Oymak, and Babak Hassibi. A tight version of the gaussian min-max theorem in the presence of convexity. Technical report, Caltech, 2014. Anastasios Tsiamis and George J. Pappas. Finite sample analysis of stochastic system identification. In 2019 IEEE 58th Conference on Decision and Control (CDC), 2019. Anastasios Tsiamis and George J. Pappas. Linear systems can be hard to learn. ar Xiv preprint ar Xiv:2104.01120, 2021. Stephen Tu and Ross Boczar. An elementary proof of anti-concentration for degree two non-negative gaussian polynomials. ar Xiv preprint ar Xiv:2301.05992, 2023. Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 2019. Lei Xin, George Chiu, and Shreyas Sundaram. Learning the dynamics of autonomous linear systems from multiple trajectories. ar Xiv preprint ar Xiv:2203.12794, 2022. Tu, Frostig, and Soltanolkotabi Yu Xing, Benjamin Gravell, Xingkang He, Karl Henrik Johansson, and Tyler Summers. Identification of linear systems with multiplicative noise from multiple trajectory data. ar Xiv preprint ar Xiv:2106.16078, 2021. Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1):94 116, 1994. Tong Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2:527 550, 2002. Yang Zheng and Na Li. Non-asymptotic identification of linear dynamical systems using multiple trajectories. IEEE Control Systems Letters, 5(5):1693 1698, 2021. Ingvar Ziemann and Stephen Tu. Learning with little mixing. In Neural Information Processing Systems, 2022. Ingvar Ziemann, Henrik Sandberg, and Nikolai Matni. Single trajectory nonparametric learning of nonlinear dynamics. In Conference on Learning Theory, 2022. Appendix A. Beyond diagonalizability: a conjecture for the general case Recall that various results in Section 5.1 required a diagaonalizability assumption (Assumption 5.2) on the dynamics matrix A, specifically in the many trajectories regime when T > T (Theorem 5.5), or in the few trajectories regime (Section 5.1.2). In this section, we conjecture how removing the diagonalizability assumption would affect the results. For simplicity, we focus on the few trajectories regime, and further assume that T = T. Building on potential extensions of this paper s analysis, and numerical evidence detailed in Section 7, we conjecture the following extensions of Theorem 5.6 and Theorem 6.3: Conjecture A.1 (Risk for LDS-LS with few trajectories, general case). There are universal positive constants c0, c1, c2, c3, and a universal mapping ϕ : N+ R>0 such that the following holds for any instance of LDS-LS satisfying Assumption 5.1 (marginal stability) and Assumption 5.3 (one-step controllability). Let A = SJS 1 denote the Jordan normal form of the dynamics matrix A. Define γ := λmax(S 1BBTS ) λmin(S 1BBTS ) , and let r be the size of the largest Jordan block in J. If n c0, m c1n, and m T c2n, then: E[L( ˆWm,T ; T , PA,B x )] c3σ2 ξϕ(r)γ pn2r m2r T . (A.1) Additionally, there exist universal positive constants c 0, c 1, c 2, c 3, and c 4 such that the following is true. Suppose A Rn n is any set containing all n n matrices with Jordan blocks of size at most r. Let T c 0, n c 1, m T c 2n, and m c 3n. Then: R(m, T, T; {PA x | A A}) c 4σ2 ξϕ(r)γ pn2r m2r T . (A.2) Lemma 5.1 provides a viable path towards proving the upper bound (A.1) from Conjecture A.1 up to logarithmic factors in the regime of constant Jordan block size r, by Learning from many trajectories reducing the problem to understanding the scaling of λ(k, t; A, B) = λ(Γk(A, B), Γt(A, B)) when k t. Our analysis uses diagonalizability (Assumption 5.2) of the dynamics matrices to show that λ(k, t; A, B) γ 1 k/t. Without such an assumption, analyzing λ(k, t; A, B) is substantially more involved. A numerical simulation (Figure 5) suggests 101 102 103 α t/k = α, k = 5 r = 1, slope = 1.00 r = 2, slope = 3.02 r = 3, slope = 5.04 r = 4, slope = 7.06 r = 5, slope = 9.09 r = 6, slope = 11.13 r = 7, slope = 13.18 Figure 5: A plot of the ratio α versus 1/λ(k, t) with k fixed to 5 and t fixed to kα. Here, λ(k, t) := λ(k, t; Jr, Ir). The slope of the line (in log-log space) computed via linear regression is reported. We conjecture that in general, λ(k, t; A, B) crγ 1 (k/t)2r 1. that λ(k, t; A, B) crγ 1 (k/t)2r 1 is the general rate for dynamics matrices A with Jordan blocks at most size r, where cr is a constant depending only on r. Assuming this scaling to be correct and plugging the rate into Lemma 5.1 yields (A.1) up to logarithmic factors. Partial progress towards analyzing λ(k, t; A, B) was made in Sarkar and Rakhlin (2019, Proposition 7.6), where it is shown that λ(k, t; A, B) crγ 1 (k/t)r2, with 1/cr depending exponentially on r. We do not conjecture a form for the mapping ϕ(r); λ(k, t; A, B) becomes numerically ill-conditioned when r is large, hindering simulation with large blocks. On the other hand, the analytic arguments in Section 7.2.4 combined with the numerical evidence in Figure 3 suggest that the bound (A.2) holds (up to the condition number factor γ). The one caveat is that, even if we were to analytically characterize the eigenvalues of Θr,T,T for all r, our proof strategy would most likely not be able to give a sharp characterization of the leading constant ϕ(r) in the lower bound. This is because our proof inherently exploits the independence between decoupled subsystems, and does not tackle the harder challenge of understanding the coupling effects within a Jordan block. We conclude this section by noting that Conjecture A.1 does not include any logarithmic factors in the upper bound rate (A.1), and includes the condition number factor γ in the lower bound (A.2). In other words, Conjecture A.1 applied to the special case of r = 1 conjectures that Theorem 5.6 is loose by log2(γn/m), and that Theorem 6.3 is loose by a factor of γ. Tu, Frostig, and Soltanolkotabi Appendix B. Analysis for upper bounds B.1 Preliminaries We collect various technical results which we will use in the proof of the upper bounds. The first result gives us a bound on the functional inverse of T 7 T/ log T. Proposition B.1 (Simchowitz et al. (2018, Lemma A.4)). For α 1, T 2α log(4α) implies that T α log T. The next two results study various properties of functions involving λ. Proposition B.2. For A Symn >0, the map X 7 λ(X, A) is concave over symmetric matrices. Proof Observe that λ(X, A) = λmin(A 1/2XA 1/2) = inf{ X, A 1/2vv TA 1/2 | v Sn 1} is the pointwise infimum over a set of linear functions in X, and is therefore concave. Proposition B.3. Fix T N+, {Ψt}T t=1 Symn >0, and Γ Symn >0. Suppose that 1 T PT t=1 Ψt Γ. Then h QT t=1 λ(Ψt, Γ) i1/T 1. Proof We have that: " T Y t=1 λ(Ψt, Γ) t=1 λ(Ψt, Γ) using the AM-GM inequality using Proposition B.2 and Jensen s inequality λ(Γ, Γ) since 1 The next result relates the anti-concentration properties of a non-negative random variable to its moment generating function on ( , 0). Proposition B.4 (Mourtada (2022, Lemma 7)). Let X be a non-negative random variable. Suppose there exists an α (0, 1] and positive constant c such that: P(X t) (ct)α t > 0. E[exp( ηX)] (c/η)α η > 0. The next few results involve various properties of Gaussian and spherical distributions. Learning from many trajectories Proposition B.5 (Magnus (1978, Lemma 6.2)). For w N(0, I) and symmetric matrices A, B: E[w TAww TBw] = 2 A, B + tr(A) tr(B). Proposition B.6 (Dasgupta and Gupta (2003, Lemma 2.2)). Let n 2 and v Rn \ {0} be fixed. Suppose that ψ is drawn uniformly at random from the uniform measure over Sn 1. We have that for all ε > 0: P n v, ψ 2 ε n v 2 2 o (eε)1/2. Next, we state a classic result which gives us anti-concentration of arbitrary Gaussian (more generally any log-concave distribution) polynomials of bounded degree. Theorem B.7 (Carbery and Wright (2001, Theorem 8)). Fix an integer d N+. There exists a universal positive constant c such that the following is true. Let p : Rn R be a degree d polynomial, and let ε > 0. We have: P{|p(x)| ε E|p(x)|} c dε1/d, x N(0, In). For the case when d = 2 and p is non-negative, we can take c = p Theorem B.7 can be further specialized as follows. Suppose w N(0, I), x is fixed, and Q11 Q12 QT 12 Q22 is positive semidefinite. Then: T Q11 Q12 QT 12 Q22 (eε)1/2 ε > 0. (B.1) Both (B.1) and the explicit constant in Theorem B.7 for d = 2 and p non-negative can be derived by bounding the MGF of various Gaussian quadratic forms; see e.g. Tu and Boczar (2023). Next, we state a well-known result from Abbasi-Yadkori et al. (2011), which yields an anytime bound for the size of a self-normalized martingale difference sequence (MDS). Lemma B.8 (Abbasi-Yadkori et al. (2011, Theorem 3)). Fix a δ (0, 1) and positive definite matrix V Rd d. Let {xt}t 1 Rd be a stochastic process adapted to the filtration {Ft}t 1. Let {ηt}t 1 R be a martingale difference sequence adapted to {Ft}t 2. Suppose there exists R > 0 such that E[exp(ληt) | Ft] exp(λ2R2/2) a.s. for all λ R and t 1. Define Vt := Pt k=1 xkx T k for t 1. With probability at least 1 δ, 2R2 log det(Vt + V )1/2 det(V ) 1/2 Lemma B.8 is generalized to vector-valued self-normalized MDS via a covering argument: Tu, Frostig, and Soltanolkotabi Proposition B.9 (Sarkar and Rakhlin (2019, Proposition 8.2)). Fix a δ (0, 1) and positive definite matrix V Rd d. Let {xt}t 1 Rd be a stochastic process adapted to the filtration {Ft}t 1. Let {ηt}t 1 Rp be a stochastic process adapted to {Ft}t 2. Suppose that for every fixed v Sp 1, for every t 1: (a) E[ v, ηt | Ft] = 0 a.s. (b) E[exp(λ v, ηt ) | Ft] exp(λ2R2/2) a.s. for every λ R. Define Vt := Pt k=1 xkx T k for t 1. With probability at least 1 δ, for all t 1: k=1 ηkx T k (Vt + V ) 1/2 op 2 2R2 log 5p det(Vt + V )1/2 det(V ) 1/2 The next result assumes Vt is invertible in order to simplify Proposition B.9. Proposition B.10. Under the same hypothesis of Proposition B.9, we have with probability at least 1 δ, for all t 1: k=1 ηkx T k V 1/2 t R2 log 5p det(Vt + V )1/2 det(V ) 1/2 Proof Observe that when Vt V , we have: 2Vt Vt + V = V 1 t 2(Vt + V ) 1. For two positive definite matrices M1 and M2 satisfying M1 M2, and any matrix N, NM1/2 1 op = q λmax(NM1NT) q λmax(NM2NT) = NM1/2 2 op. k=1 ηkx T k V 1/2 t k=1 ηkx T k (Vt + V ) 1/2 op R2 log 5p det(Vt + V )1/2 det(V ) 1/2 where the last inequality holds for every t with probability at least 1 δ by Proposition B.9. B.2 Examples of trajectory small-ball In this section, we prove that the examples listed in Section 4.1 satisfying the trajectory small-ball condition (Definition 4.1). Learning from many trajectories Example 4.1 (Copies of a Gaussian draw). Let Σ Symn >0, and let Px denote the process x1 N(0, Σ) and xt = xt 1 for t 2. Fix any T N+. Then Px satisfies the (T, T, Σ, e, 1 2)- Traj SB condition. Proof When k = T and Γ = In, the condition (4.1) simplifies to: sup v Sn 1 P t=1 v, xt 2 ε (csbε)α ε > 0. Since x1 = x2 = = x T , this further simplifies to: sup v Sn 1 P v, x1 2 ε (csbε)α ε > 0. Since v, x1 N(0, 1), Equation (B.1) yields that PX N(0,1){X2 ε} (eε)1/2, so we can take csb = e and α = 1/2. Example 4.2 (Gaussian processes). Let Px be a Gaussian process indexed by time, i.e., for every finite index set I N+, the collection of random variables (xt)t I is jointly Gaussian. Let Tnd := inf{t N+ | det(E[xtx T t ]) = 0}, and suppose Tnd is finite. Fix a T N+ satisfying T Tnd. Then Px satisfies the (T, T, ΓT (Px), 2e, 1 2)-Traj SB condition. Proof Since the covariates (x1, . . . , x T ) are jointly Gaussian, we can write, where µ1, . . . , µT Rn and M1, . . . , MT Rn n T are fixed, and w N(0, In T ). For any v Rn, t=1 v, xt 2 = 1 t=1 v, µt + Mtw 2. This is a degree 2 non-negative polynomial in w, and therefore by Theorem B.7, for all ε > 0: t=1 v, xt 2 εE t=1 v, xt 2 #) Example 4.4 (Alternating halfspaces). Suppose that n 4 is even, and let u1, . . . , un be a fixed orthonormal basis of Rn. Put U0 = span(u1, . . . , un/2) and U1 = span(un/2+1, . . . , un). Let i1 Bern(1 2), it+1 = (it + 1) mod 2 for t N+, and let Px denote the process with conditional distribution xt | it uniform over the spherical measure on Uit Sn 1. For any T 2, the process Px satisfies the (T, 2, In/(2n), e, 1 2)-Traj SB condition. Tu, Frostig, and Soltanolkotabi Proof For i {0, 1}, let ψi be uniform on the uniform measure over Ui Sn 1, let PUi denote the orthogonal projector onto Ui, and let vi = PUiv. Fix any v Rn\{0}. We observe that for any t N+, v, xt+1 2+ v, xt+2 2 | it is equal in distribution to v, ψ0 2 + v, ψ1 2, which itself is equal in distribution to v0, ψ0 2 + v1, ψ1 2. Suppose first that v0 2 v1 2. Then, since v 2 2 = v0 2 2 + v1 2 2 2 v0 2 2: n v0, ψ0 2 + v1, ψ1 2 ε n v 2 2 o v0, ψ0 2 + v1, ψ1 2 2ε v0, ψ0 2 2ε Writing α0 = ( u1, v , . . . , un/2, v ) Rn/2, by a change of coordinates we have that α0 2 2 = v0 2 2, and that v0, ψ0 is equal in distribution to α0, ζ0 , where ζ0 is uniform on Sn/2 1. Since we assumed v0 2 v1 2, we must have that α0 = 0. Hence by Proposition B.6, P n v0, ψ0 2 + v1, ψ1 2 ε n v 2 2 o P α0, ζ0 2 2ε Note that if v1 2 > v0 2, an identical argument yields the same bound. Hence, letting Ft = σ(x1, . . . , xt), we have shown that for all t 0: ℓ=1 v, xt+ℓ 2 ε v T 1 from which the claim follows. Example 4.5 (Normal subspaces). Suppose that n 3. Let u1, . . . , un be a fixed orthonormal basis in Rn, and let U i := span({uj}j =i) for i {1, . . . , n}. Consider the Markov chain {it}t 1 defined by i1 Unif({1, . . . , n}), and it+1 | it Unif({1, . . . , n} \ {it}). Let Px denote the process with conditional distribution xt | it uniform over the spherical measure on U it Sn 1. For any T 2, the process Px satisfies the (T, 2, In/(4n 4), e, 1 2)-Traj SB condition. Proof Fix any v Rn \ {0}, and for i {1, . . . , n}, let vi = PU iv, where PU i is the orthogonal projector onto U i Let {ψi}n i=1 be independent random variables, where each ψi is uniform on the uniform measure over U i Sn 1. Let indices j, k {1, . . . , n} with j = k. We first observe that since j = k, we have that U j = span(uj) U k. Therefore: v 2 2 = vj 2 2 + P U jvj 2 2 vj 2 2 + vk 2 2. Hence, assuming that vj 2 vk 2, we have: vj, ψj 2 + vk, ψk 2 ε 2(n 1) v 2 2 vj, ψj 2 + vk, ψk 2 ε n 1 vj 2 2 Learning from many trajectories vj, ψj 2 ε n 1 vj 2 2 Writing αj = ( ui, v )i =j Rn 1, by a change of coordinates we have that αj 2 2 = vj 2 2, and that vj, ψj is equal in distribution to αj, ζj , where ζj is uniform on Sn 2. Since we assumed vj 2 vk 2, we must have that αj = 0. Hence by Proposition B.6, P vj, ψj 2 + vk, ψk 2 ε 2(n 1) v 2 2 P αj, ζj 2 ε n 1 αj 2 2 On the other hand if vk 2 > vj 2, an identical argument yields the same bound. Now, for any i {1, . . . , n} and t N+: P v, xt+1 2 + v, xt+2 2 ε 2(n 1) v 2 2 j =i,k =j P v, xt+1 2 + v, xt+2 2 ε 2(n 1) v 2 2 it = i, it+1 = j, it+2 = k P{it+1 = j, it+2 = k | it = i} j =i,k =j P vj, ψj 2 + vk, ψk 2 ε 2(n 1) v 2 2 P{it+1 = j, it+2 = k | it = i} j =i,k =j P{it+1 = j, it+2 = k | it = i} Note we also have that P n v, x1 2 + v, x2 2 ε 2(n 1) v 2 2 o (eε)1/2 by a nearly identical argument. Hence, letting Ft = σ(x1, . . . , xt), we have shown that for all t 0: ℓ=1 v, xt+ℓ 2 ε v T 1 4(n 1)In from which the claim follows. For the next claim, recall that the mixing time of a Markov chain over state-space S with transition matrix P and stationary distribution π is defined as: τmix(ε) = inf k N sup µ P(S) µP k π tv ε Here, P(S) denotes the set of distributions over S, and tv is the total-variation norm over distributions. Proposition B.11. Let n 2. Consider the Markov chain {it}t 1 where i1 Unif({1, . . . , n}) and it+1 | it Unif({1, . . . , n} \ {it}). We have that: τmix(ε) = inf k N (n 1) k 2ε 1 1/n Tu, Frostig, and Soltanolkotabi Proof Let 1 Rn denote the all ones vector. The transition matrix for this Markov chain is: P = 1 n 1(11T In), and its stationary distribution is uniform over {1, . . . , n}. Note that for j 1, (11T)j = nj 111T. Since 11T and In commute, by the binomial theorem we have that: P k = 1 (n 1)k (11T)k j( 1)j nk j 1( 1)j11T + ( 1)k In (n 1)k ( 1)k 11T + ( 1)k In n11T + ( 1)k Now, let µ Rn 0 satisfy µT1 = 1. We have: µTP k 1 n1T 1 = 1 (n 1)k It is straightforward to check that supµ Rn 0,µT1=1 µ 1 n, from which the claim follows, since the TV distance between two distributions µ, ν is µ ν tv = 1 Example 4.6 (Linear dynamical systems). Let (A, B) with A Rn n and B Rn d be kcstep-controllable (Definition 4.2). Let PA,B x be the linear dynamical system defined in (3.8). Fix any T, k N+ satisfying T k kc. Then, PA,B x satisfies the (T, k, Γk(A, B), e, 1 2)- Traj SB condition. Proof Let Γk = Γk(Px) and Σk = Σk(Px) be shorthand notation. Let w = (w1, . . . , wk) Rnk denote the vertical concatenation of the process noise variables. Let Mt := At Φt Rn n(k+1) denote the matrix such that xt = Mt . With this notation, for any v Rn: t=1 v, xt 2 = x w t=1 MT t vv TMt By Equation (B.1), for any ε > 0, t=1 MT t vv TMt t=1 ΦT t vv TΦt Learning from many trajectories On the other hand: t=1 ΦT t vv TΦt v = v T 1 k v = v TΓkv. Because we assumed that k kc, then Γk is invertible. Thus, we can take csb = e and α = 1/2. Proposition B.12. Consider the scalar stochastic process {xt}t 1 defined by: j=0 ci,jwt i 1wt j 1, where {ci,j}i,j 0 are the coefficients which describe the dynamics, and {wt}t 0 are iid N(0, 1) random variables. Let {Ft}t 1 denote the filtration defined as Ft := σ(w0, . . . , wt 1), so that xt is Ft-measurable. Suppose that {ci,j}i,j 0 is symmetric and traceless. For every t 1 and k 0, almost surely we have: E[x2 t+k | Fk] E[x2 t ] + (E[xt+k | Fk])2. Proof For t N+, define the symmetric matrices Mt Rt t with (Mt)ii = 0 and (Mt)ij = c(i 1),(j 1). With this notation and with wt N(0, It), we can write xt as: xt = w T t Mt wt. Therefore, by Proposition B.5 and the assumption that tr(Mt) = 0: E[x2 t ] = E( w T t Mt wt)2 = 2 Mt 2 F + tr(Mt)2 = 2 Mt 2 F . Now, partition Mt+k as: Mt+k = Mt Dt,k DT t,k Et,k Let vk = (wk 1, . . . , w0). Given Fk, we can write xt+k as: xt+k = wt vk T Mt Dt,k DT t,k Et,k With this notation: E[xt+k | Fk] = v T k Et,k vk, E[x2 t+k | Fk] = E wt T Mt Dt,k DT t,k Et,k Expanding the square: wt vk T Mt Dt,k DT t,k Et,k !2 = ( w T t Mt wt + 2 w T t Dt,k vk + v T k Et,k vk)2 Tu, Frostig, and Soltanolkotabi = ( w T t Mt wt)2 + 4 w T t Mt wt w T t Dt,k vk + 2 w T t Mt wt v T k Et,k vk + 4( w T t Dt,k vk)2 + 4 w T t Dt,k vk v T k Et,k vk + ( v T k Et,k vk)2. Using Proposition B.5 again: E[x2 t+k | Fk] = E wt T Mt Dt,k DT t,k Et,k = E wt( w T t Mt wt)2 + 2 tr(Mt) v T k Et,k vk + 4 Dt,k vk 2 2 + ( v T k Et,k vk)2 = 2 Mt 2 F + 4 Dt,k vk 2 2 + ( v T k Et,k vk)2 2 Mt 2 F + ( v T k Et,k vk)2. To complete the proof, we recall that E[x2 t ] = 2 Mt 2 F and E[xt+k | Fk] = v T k Et,k vk. Example 4.9 (Degree-2 Volterra series). Consider the following process Px. Let {c(ℓ) i,j }i,j 0 for ℓ {1, . . . , n} be symmetric, traceless, non-degenerate arrays (Definition 4.3). Let {w(ℓ) t }t 0 be iid N(0, 1) random variables for ℓ {1, . . . , n}. For t 1, the ℓ-th coordinate of xt, denoted (xt)ℓ, is: j=i c(ℓ) i,j w(ℓ) t i 1w(ℓ) t j 1. (4.5) Let knd N+ denote the smallest non-degeneracy index for all n arrays. There is a universal positive constant c such that for any T and k satisfying T k knd, Px satisfies the (T, k, Γk(Px), c, 1 4)-Traj SB condition. Proof Fix a v Rn. The relation (4.5) shows that v, xt 2 is a degree four polynomial in {w(ℓ) i }t 1,n i=0,ℓ=1. Let Ft = σ({w(ℓ) i }t 1,n i=0,ℓ=1), so that xt is Ft-measurable. By Theorem B.7, there exists a univeral positive constant c > 0 such that for any s 0, t=1 v, xt+s 2 εE t=1 v, xt+s 2 Fs (cε)1/4 a.s. To conclude the proof, we need to lower bound E 1 k Pk t=1 v, xt+s 2 Fs . For any t 1, E v, xt+s 2 Fs ℓ=1 vℓ (xt+s)ℓ ℓ=1 v2 ℓ E[(xt+s)2 ℓ| Fs] + ℓ1 =ℓ2 vℓ1vℓ2 E[(xt+s)ℓ1 | Fs] E[(xt+s)ℓ2] | Fs] ℓ=1 v2 ℓ E[(xt)2 ℓ] + ℓ=1 v2 ℓ (E[(xt+s)ℓ| Fs])2 Learning from many trajectories ℓ1 =ℓ2 vℓ1vℓ2 E[(xt+s)ℓ1 | Fs] E[(xt+s)ℓ2] | Fs] ℓ=1 v2 ℓ E[(xt)2 ℓ] + ℓ=1 vℓ E[(xt+s)ℓ| Fs] ℓ=1 v2 ℓ E[(xt)2 ℓ] (c) = v TΣt(Px)v. Above, (a) follows since each coordinate of xt is independent by definition, (b) follows from Proposition B.12, and (c) follows since E[xt] = 0 and each coordinate is independent, so E[(xt)ℓ1(xt)ℓ2] = E[(xt)ℓ1]E[(xt)ℓ2] = 0 for ℓ1 = ℓ2. Hence, we have shown: t=1 v, xt+s 2 Fs v = v TΓk(Px)v. Note that because we assume that k knd, the covariances Σt(Px) are all invertible (and hence so is Γk(Px)). The claim now follows. Example 4.8 (Degree-D Volterra series). Fix a D N+. Let {c(d,ℓ) i1,...,id}i1,...,id N for d {1, . . . , D} and ℓ {1, . . . , n} denote arbitrary rank-d arrays. Let {w(ℓ) t }t 0 be iid N(0, 1) random variables for ℓ {1, . . . , n}. Consider the process Px where for t 1, the ℓ-th coordinate of xt, denoted (xt)ℓ, is: i1,...,id=0 c(d,ℓ) i1,...,id d =1 w(ℓ) t id 1. (4.4) Let Tnd := inf{t N+ | det(Γt(Px)) = 0}, and suppose Tnd is finite. There is a constant c D > 0, depending only on D, such that for any T Tnd, the process Px satisfies the (T, T, ΓT (Px), c D, 1/(2D))-Traj SB condition. Proof Fix a v Rn. The definition (4.4) expresses v, xt as a degree at most D polynomial in the noise variables {w(ℓ) t }. By Theorem B.7, there exists a positive constant c D, only depending on D, such that: t=1 v, xt 2 εE t=1 v, xt 2 #) (c Dε)1/(2D). Since T Tnd, the matrix ΓT (Px) is invertible. The claim now follows. B.3 Proof of Proposition 4.2 Proposition 4.2 (Average small-ball implies trajectory small-ball). Fix T N+, k {1, . . . , T}, {Ψj} T/k j=1 Symn >0, and α, β (0, 1). Let Px be a covariate distribution, with Tu, Frostig, and Soltanolkotabi {xt}t 1 adapted to a filtration {Ft}t 1. Suppose for all v Rn \{0} and j {1, . . . , T/k }: t=(j 1)k+1 Pxt Px n v, xt 2 α v TΨjv F(j 1)k o β a.s., (4.6) where F0 is the minimal σ-algebra. Then, for all v Rn \ {0}, j {1, . . . , T/k }, and ε (0, α) t=(j 1)k+1 v, xt 2 ε v TΨjv β 1 ε/α a.s. (4.7) Proof The following proof builds on the argument given in Simchowitz et al. (2018, Section E.1). We note that a similar style of proof is used in Bartlett et al. (2020, Lemma 15). Define the shorthand notation Pt{ } := P{ | Ft}, and similarly Et[ ] := E[ | Ft]. Now fix a v Rn \ {0}, j {1, . . . , T/k }. Markov s inequality yields that: t=(j 1)k+1 v, xt 2 αv TΨjv 1 t=(j 1)k+1 1{ v, xt 2 > αv TΨjv}, and therefore for all ε > 0: t=(j 1)k+1 v, xt 2 ε v TΨjv t=(j 1)k+1 1{ v, xt 2 > αv TΨjv} ε/α Define Zj := 1 k Pjk t=(j 1)k+1 1{ v, xt 2 > αv TΨjv}, and observe that Zj [0, 1]. By (4.6), we have: E(j 1)k[Zj] 1 β. On the other hand: E(j 1)k[Zj] = E(j 1)k[Zj1{Zj > ε/α}] + E(j 1)k[Zj1{Zj ε/α}] P(j 1)k{Zj > ε/α} + ε/α P(j 1)k{Zj ε/α} since Zj 1 = 1 (1 ε/α)P(j 1)k{Zj ε/α}. Combining both these inequalities, and further restricting ε (0, α), we obtain, P(j 1)k{Zj ε/α} β 1 ε/α, which implies (4.7). Learning from many trajectories B.4 General ordinary least-squares estimator upper bound In this section, we supply the proof of Lemma 5.1. We first start with a result which bounds the minimum eigenvalue of the empirical covariance matrix. Lemma B.13 (Minimum eigenvalue bound via trajectory small-ball). Suppose that Px satisfies the (T, k, {Ψj} T/k j=1 , csb, α)-trajectory-small-ball condition (Definition 4.1). Put S := T/k , and ΓT := ΓT (Px). Fix any Γ Symn >0 satisfying 1 S PS j=1 Ψj Γ ΓT . Define Xm,T := Xm,T Γ 1/2, and: µ({Ψj}S j=1, Γ) := j=1 λ(Ψj, Γ) Suppose that: 320ecsb αλ(Γ, ΓT )µ({Ψj}S j=1, Γ) For any t 0, with probability at least 1 2e t, the following statements simultaneously hold: tr( XT m,T Xm,T ) m Tnet λ(Γ, ΓT ), λmin( XT m,T Xm,T ) m Tαµ({Ψj}S j=1, Γ) 8ecsb exp 16kn Proof The proof uses the PAC-Bayes argument for uniform convergence from Mourtada (2022). The first step is to construct a family of random variables, indexed by both v Sn 1 and a scale parameter η > 0, such that its moment generating function is pointwise bounded by one. For notational brevity, let: λ := λ(Γ, ΓT ), µ := µ({Ψj}S j=1, Γ). Since Γ ΓT by assumption, we have λ (0, 1]. Similarly, since 1 S PS j=1 Ψj Γ, we also have µ (0, 1] by Proposition B.3. The trajectory small-ball condition (4.1) implies for any v Sn 1, j {1, . . . , S}, and ε > 0, by substituting v Γ 1/2v and lower bounding v TΓ 1/2ΨjΓ 1/2v λ(Ψj, Γ): t=(j 1)k+1 v, Γ 1/2xt 2 ελ(Ψj, Γ) Using a change of variables ε ε/λ(Ψj, Γ), t=(j 1)k+1 v, Γ 1/2xt 2 ε (csb/λ(Ψj, Γ) ε)α . Tu, Frostig, and Soltanolkotabi By Proposition B.4, for any η > 0, t=(j 1)k+1 v, Γ 1/2xt 2 + α log ηλ(Ψj, Γ) 1 a.s. (B.5) For i {1, . . . , m} and j {1, . . . , S}, define the random variables Z(i) j (v; η), Z(i)(v; η), and Z(v; η): Z(i) j (v; η) := η t=(j 1)k+1 v, Γ 1/2x(i) t 2 + α log ηλ(Ψj, Γ) Z(i)(v; η) := j=1 Z(i) j (v; η), i=1 Z(i)(v; η). We first claim that E[exp(Z(v; η))] 1 for every v Sn 1 and η > 0. Since Z(i)(v; η) is independent of Z(i )(v; η) whenever i = i , we have that: E[exp(Z(v; η))] = E i=1 Z(i)(v; η) i=1 E[exp(Z(i)(v; η))]. Furthermore, by repeated applications of the tower property and (B.5), for every i {1, . . . , m}, E[exp(Z(i)(v; η))] = E j=1 Z(i) j (v; η) j=1 Z(i) j (v; η) E[exp(Z(i) S (v; η)) | F(S 1)k] j=1 Z(i) j (v; η) Hence E[exp(Z(v; η))] 1 for every v Sn 1 and η > 0. Let us now import some notation from Mourtada (2022, Section 4). First, let π denote the spherical measure on Sn 1, and let ρv,γ denote the uniform measure over the spherical cap C(v, γ) := {w Sn 1 | v w 2 γ}. Learning from many trajectories Next, let Fv,γ(Σ) := R C(v,γ) w, Σw dρv,γ for any symmetric matrix Σ. Fix any positive t, η. For two measures µ and ν with µ absolutely continuous w.r.t. ν, let KL(µ, ν) := Eµ log dµ dν denote the KL-divergence between µ and ν. By the PAC-Bayes deviation bound (cf. Mourtada, 2022, Lemma 4), there exists an event Et,1 with probability at least 1 e t, such that on Et,1, we have for every v Sn 1 and γ > 0, t=1 x(i) t (x(i) t )TΓ 1/2 ! + m Sα log ηµ KL(ρv,γ, π) + t. (B.6) Next, by Mourtada (2022, Section 4.3), we can write Fv,γ in terms of a scalar function φ such that: Fv,γ(Σ) = (1 φ(γ)) v, Σv + φ(γ) 1 n tr(Σ), φ(γ) 0, n n 1γ2 . (B.7) Furthermore, for every v Sn 1 and γ > 0, the KL-divergence term can be upper bounded by (Mourtada, 2022, Section 4.4): KL(ρv,γ, π) n log 1 + 2 Therefore on Et,1, plugging (B.7) and (B.8) into (B.6), λmin XT m,T Xm,T k η(1 φ(γ)) m Sα log ηµ n log 1 + 2 φ(γ) 1 φ(γ) 1 n tr XT m,T Xm,T . Restricting γ [0, 1/2], we have from (B.7) that 0 φ(γ) n n 1γ2 2γ2 1/2. Hence, 1 φ(γ) [1/2, 1]. Furthermore, 1 + 2/γ 5/(4γ2). Therefore, λmin XT m,T Xm,T k m Sα log ηµ n tr XT m,T Xm,T . Define the non-negative random variables ψi := PT t=1 Γ 1/2x(i) t 2 2, for i = 1, . . . , m. It is straightforward to verify that tr( XT m,T Xm,T ) = Pm i=1 ψi. By Markov s inequality, for any β > 0: P tr( XT m,T Xm,T ) > β = P E [Pm i=1 ψi] β = m T tr(Γ 1ΓT ) β m Tnλmax(Γ1/2 T Γ 1Γ1/2 T ) β = m Tn Therefore, setting β = etm Tn λ , there exists an event Et,2 such that P(Ec t,2) e t and on Et,2, tr( XT m,T Xm,T ) etm Tn Tu, Frostig, and Soltanolkotabi Therefore on Et,1 Et,2, which we assume holds for the remainder of the proof, we have: λmin XT m,T Xm,T k m Sα log ηµ Next, we further restrict ηµ csb e so that log(ηµ/csb) 1. Now consider, for positive constants A, B, C, the function x 7 A log(B/x)+Cx on the domain (0, ). The derivative vanishes at x = A/C, and the function attains a minimum value of A(1 + log(BC/A)) with this choice of x. Let us set: 4, C 4m Tet Then by choosing γ2 = knλ 4ηm Tet , we have that: 1 + log 5m Tetη Note that this choice of γ satisfies γ [0, 1/2], since: knλ 4ηm Tet 1 4 = kn ηm T 1 since t 0 and λ 1 = knµ ecsbm T 1 since η ecsb/µ = n ecsb m T k since µ 1, and the last condition holds by (B.3). With this choice of γ, we have: λmin XT m,T Xm,T m Sα log ηµ t n 1 + log 5m Tetη (m Sα n) log ηµ t n 1 + log 5csbm Tet t n 1 + log 5csbm Tet since m S 2n/α by (B.3) (1 + n)t n 1 + log 5csbm T 2nt n 1 + log 5csbm T Learning from many trajectories " log(ηµ/csb) 8nt m Sα ηµ/csb To justify inequality (a), we first note that m Sα 4n 1 + log 5csbm T holds from (B.3), since: 4n 1 + log 5csbm T 8 1 + log 5csbm T since S T/(2k) α log 5ecsb α log 5ecsb α log 5ecsb using Proposition B.1 α log 320ecsb The inequality m Sα 4n 1 + log 5csbm T then implies (a) by observing: + n 1 + log 5csbm T + n 1 + log 5csbm T since ηµ/csb e. It remains to optimize over η [ecsb/µ, ). For any G R, the function η 7 log η G η on (0, ) attains a maximum of exp( 1 G) at η = exp(1 + G). Hence, setting η = csb µ exp(1 + 8nt/(m Sα)), which satisfies η ecsb/µ, we have: λmin XT m,T Xm,T km Sαµ 4ecsb exp 8nt 8ecsb exp 16kn m Tαt since S T/(2k). The claim now follows by gathering the requirements on the quantity m T kn and simplifying as in (B.3). Corollary B.14. Assume the hypothesis of Lemma B.13 hold. Then, XT m,T Xm,T is invertible almost surely. Proof For any t 0, define the event Et as: Et := λmin XT m,T Xm,T < m Tαµ 8ecsb exp 16kn Tu, Frostig, and Soltanolkotabi The event {λmin( XT m,T Xm,T ) = 0} is the intersection T t=1 Et. By Lemma B.13, we have that P(Et) 2e t. Since the events Et Et whenever t t, by continuity of measure from above, P(λmin( XT m,T Xm,T ) = 0) = P = lim t P(Et) lim t 2e t = 0. We are now ready to restate and prove Lemma 5.1. Lemma 5.1 (General OLS upper bound). There are universal positive constants c0 and c1 such that the following holds. Suppose that Px satisfies the (T, k, {Ψj} T/k j=1 , csb, α)-Traj SB condition (Definition 4.1). Put S := T/k and ΓT := ΓT (Px). Fix any Γ Symn >0 satisfying 1 S PS j=1 Ψj Γ ΓT , and let µ({Ψj}S j=1, Γ) denote the geometric mean of the minimum eigenvalues {λ(Ψj, Γ)}S j=1, i.e., µ({Ψj}S j=1, Γ) := j=1 λ(Ψj, Γ) Suppose that: max{e, csb} αλ(Γ, ΓT )µ({Ψj}S j=1, Γ) Then, for any Γ Symn >0: E[ ˆWm,T W 2 Γ ] c1csbσ2 ξ pn m Tαλ(Γ, Γ )µ({Ψj}S j=1, Γ) log max{e, csb} αλ(Γ, ΓT )µ({Ψj}S j=1, Γ) Proof For notational brevity, let: λ := λ(Γ, ΓT ), µ := µ({Ψj}S j=1, Γ). We choose c0 64 such that (5.3) implies (B.3). By Corollary B.14, Xm,T has full column rank almost surely, hence: ˆWm,T W = ΞT m,T Xm,T (XT m,T Xm,T ) 1. Put Xm,T := Xm,T Γ 1/2. With this decomposition, we have: ˆWm,T W 2 Γ = ΞT m,T Xm,T (XT m,T Xm,T ) 1 2 Γ = ΞT m,T Xm,T (XT m,T Xm,T ) 1Γ1/2 2 Γ 1/2Γ Γ 1/2 λmax(Γ 1/2Γ Γ 1/2) ΞT m,T Xm,T (XT m,T Xm,T ) 1Γ1/2 2 F Learning from many trajectories = λmax(Γ 1/2Γ Γ 1/2) ( XT m,T Xm,T ) 1 XT m,T Ξm,T 2 F min{n, p}λmax(Γ 1/2Γ Γ 1/2) ( XT m,T Xm,T ) 1 XT m,T Ξm,T 2 op min{n, p}λmax(Γ 1/2Γ Γ 1/2) ( XT m,T Xm,T ) 1/2 XT m,T Ξm,T 2 op λmin( XT m,T Xm,T ) = min{n, p} ( XT m,T Xm,T ) 1/2 XT m,T Ξm,T 2 op λ(Γ, Γ ) λmin( XT m,T Xm,T ) . Fix any t > 0. By Lemma B.13, there exists an event Et,1 with probability at least 1 2e t, such that on Et,1 we have: tr( XT m,T Xm,T ) mn Tet λ , λmin( XT m,T Xm,T ) m Tαµ 8ecsb exp 16kn By our choice of c0, we have m T/k 64n/α. Hence, on Et,1, λmin( XT m,T Xm,T ) ζt := m Tαµ 8ecsb exp( t/4). We now apply Proposition B.10 with V Mt := ζt In and: x1, . . . , x T , x T+1, . . . , x2T , . . . , x(m 1)T+1, . . . , xm T Γ 1/2x(1) 1 , . . . , Γ 1/2x(1) T , Γ 1/2x(2) 1 , . . . , Γ 1/2x(2) T , . . . , Γ 1/2x(m) 1 , . . . , Γ 1/2x(m) T , to conclude that there exists an event Et,2 with probability at least 1 e t such that on Et,2: 1{ XT m,T Xm,T Mt} ( XT m,T Xm,T ) 1/2 XT m,T Ξm,T 2 op p log 5 + 1 2 log det In + ζ 1 t XT m,T Xm,T + t 32σ2 ξ h p + log det In + ζ 1 t XT m,T Xm,T + t i 32σ2 ξ h p + n log(1 + ζ 1 t tr( XT m,T Xm,T )/n) + t i . Above, the last inequality holds since log det(X) n log(tr(X)/n) for any X Symn 0 by the AM-GM inequality. By Proposition B.1, whenever t 8 log 16, we have t et/4. Furthermore, for any t 0 we have 1 et/4. Therefore, for t 8 log 16, on Et,1 Et,2: ( XT m,T Xm,T ) 1/2 XT m,T Ξm,T 2 op λmin( XT m,T Xm,T ) m Tα et/4σ2 ξ p + n log 1 + 8ecsb αλµ e(1+1/4)t + t m Tα et/4σ2 ξ p + n log 16ecsb + n(1 + 1/4)t + t Tu, Frostig, and Soltanolkotabi m Tα et/4σ2 ξ p + n log 16ecsb p + n log 16ecsb Define the random variable Z as: Z := ( XT m,T Xm,T ) 1/2 XT m,T Ξm,T 2 op λmin( XT m,T Xm,T ) p + n log 16ecsb We have shown that: P(Z > et/2) 3e t t 8 log 16 P(Z > s) 3s 2 s 164. 0 P(Z > s) ds 164 + 3 Z 164 s 2 ds = 164 + 3/164. That is, for some universal positive c1, E[ ˆWm,T W 2 Γ ] c1σ2 ξ min{n, p}csb p + n log max{e,csb} m Tαλ(Γ, Γ )µ Now, if p n, (B.10) is upper bounded by: p + n log max{e,csb} m Tαλ(Γ, Γ )µ 2c1σ2 ξpcsb n log max{e,csb} m Tαλ(Γ, Γ )µ On the other hand, if p > n, (B.10) is upper bounded by: p + n log max{e,csb} m Tαλ(Γ, Γ )µ < 2c1σ2 ξncsb p log max{e,csb} m Tαλ(Γ, Γ )µ B.5 Proof of Theorem 5.3 Theorem 5.3 (Upper bound for Ind-Seq-LS). There are universal positive constants c0 and c1 such that the following holds. Fix any sequence of distributions {Px,t}t 1, and let Σt := Ext Px,t[xtx T t ] for t N+. Suppose there exists csb > 0 and α (0, 1] such that for all v Rn \ {0}, ε > 0 and t N+: Pxt Px,t n v, xt 2 ε v TΣtv o (csbε)α. (5.7) Learning from many trajectories Furthermore, suppose there exists a cβ 1 and β 0 such that for all s, t N+ satisfying s t: 1 λ(Σs, Σt) cβ(t/s)β. (5.8) n 2, m T c0n β + log max{e, csb}cβ then, for Px = t 1Px,t: E[L( ˆWm,T ; T , Px)] c1csbσ2 ξcβeβ pn m Tα φ cβ(β + 1), (T /T)β β + log max{e, csb}cβ Proof Equation (5.7) shows that Px satisfies the (T, 1, {Σt}T t=1, csb, α)-Traj SB condition. Let Γt := 1 t Pt k=1 Σk for t N+. For any s, t N+ with s t, λ(Γs, Γt) λ(Γs, Σt) since Γt Σt k=1 λ(Σk, Σt) using Proposition B.2 and Jensen s inequality k=1 (k/t)β using (5.8) 1 cβ(β + 1)(s/t)β since x 7 xβ is increasing. Next, the growth condition (5.8) implies that: µ({Σt}T t=1, ΓT ) = t=1 λ(Σt, ΓT ) t=1 λ(Σt, ΣT ) since ΓT ΣT 1 cβ (t/T)β #1/T using (5.8) = 1 cβT β (T!)β/T 1 cβeβ since T! (T/e)T . We now apply Lemma 5.1 with Γ = ΓT . In doing so, the requirement (5.3) simplifies to: α log max{e, csb}cβeβ Tu, Frostig, and Soltanolkotabi We first assume that T T, in which case (5.4) yields: E[L( ˆWm,T ; T , Px)] c1csbσ2 ξ pn m Tα cβeβ log max{e, csb}cβeβ On the other hand, when T > T, we have λ(ΓT , ΓT ) 1 cb(β+1)(T/T )β, and (5.4) yields: E[L( ˆWm,T ; T , Px)] c1csbσ2 ξ pn m Tα cβeβ cβ(β + 1) T β log max{e, csb}cβeβ B.6 Proofs for linear dynamical systems B.6.1 Control of ratios of covariance matrices Proposition B.15. Let (A, B) be the dynamics matrices for an LDS-LS instance, and suppose (A, B) satisfy Assumption 5.1, Assumption 5.2, and Assumption 5.3. Put Σt := Σt(A, B) for t N+ and γ := γ(A, B). For any integers T1, T2 satisfying 1 T2 T1, λmin(Σ 1/2 T1 ΣT2Σ 1/2 T1 ) 1 Proof Observe that for any t 1, k=0 Ak BB (Ak) = k=0 SDk S 1BB S (Dk) S . By Assumption 5.3, we have that BB is invertible, and hence S 1BB S is also invertible. Therefore we have the following lower and upper bound on Σt: λmin(S 1BB S ) S k=0 Dk(Dk) ! S Σt λmax(S 1BB S ) S k=0 Dk(Dk) ! Now recall that for two square matrices X, Y , the eigenvalues of XY coincide with the eigenvalues of Y X. Letting Qt := Pt 1 k=0 Dk(Dk) , we have: λmin(Σ 1/2 T1 ΣT2Σ 1/2 T1 ) λmin(S 1BB S )λmin(Σ 1/2 T1 SQT2S Σ 1/2 T1 ) = λmin(S 1BB S )λmin((SQT2S )1/2Σ 1 T1 (SQT2S )1/2) λmin(S 1BB S ) λmax(S 1BB S )λmin((SQT2S )1/2(S Q 1 T1 S )(SQT2S )1/2) = λmin(S 1BB S ) λmax(S 1BB S )λmin(QT2Q 1 T1 ). Learning from many trajectories Let λ C be an eigenvalue of A. We have k=0 |λ|2k = 1 |λ|2 if |λ| < 1, t if |λ| = 1. Therefore, (QT2Q 1 T1 )ii is: (QT2Q 1 T1 )ii = 1 |λi|2T1 = 1 (|λi|2T1)T2/T1 1 |λi|2T1 if |λi| < 1, T2/T1 if |λi| = 1. Note that infx (0,1) 1 xc 1 x = c for c [0, 1]. Therefore, we can lower bound: λmin(QT2Q 1 T1 ) T2 The claim now follows. Proposition B.16. Let (A, B) be the dynamics matrices for an LDS-LS instance, and suppose (A, B) satisfy Assumption 5.1, Assumption 5.2, and Assumption 5.3. Put Γt := Γt(A, B) for t N+ and γ := γ(A, B). For any integers k, t N+ satisfying k t, we have: λ(Γk, Γt) 1 Proof Let Σt := Σt(A, B) for t N+. We first consider the case when k 2. Observe that Γt Σt. Furthermore, for any k 2, we have: k = k/2 Σ k/2 = k k/2 + 1 λ(Γk, Γt) = λmin(Γ 1/2 t ΓkΓ 1/2 t ) 1 2λmin(Σ 1/2 t Σ k/2 Σ 1/2 t ) (a) 1 Above, (a) follows from Proposition B.15. When k = 1, we have Γ1 = Σ1, and therefore by Proposition B.15: λ(Γ1, Γt) = λ(Σ1, Γt) λ(Σ1, Σt) = λmin(Σ 1/2 t Σ1Σ 1/2 t ) 1 The claim now follows. Fact B.17. Let (A, B) be the dynamics matrices for an LDS-LS instance. For any s, t N+ with s t: Γs(A, B) Γt(A, B). Tu, Frostig, and Soltanolkotabi Proposition B.18. Let (A, B) be the dynamics matrices for an LDS-LS instance, and suppose (A, B) satisfy Assumption 5.1, Assumption 5.2, and Assumption 5.3. Put Γt := Γt(A, B) for t N+, Σt := Σt(A, B) for t N+, and γ := γ(A, B). For any T, we have: t=1 λ(Σt, ΓT ) Proof By Proposition B.16, we have that λ(Γt, ΓT ) 1 8γ t T for all t {1, . . . , T}. Therefore, since λ(Σt, ΓT ) λ(Γt, ΓT ), and since n! (n/e)n for all n N+, t=1 λ(Σt, ΓT ) 8γT 1 8eγ . B.6.2 Many trajectory results Lemma B.19. There are universal positive constants c0 and c1 such that the following holds for any instance of LDS-LS. Suppose that (A, B) is kc-step controllable. If n 2 and m c0n, then for any Γ Symn >0: E[ ˆWm,T W 2 Γ ] c1σ2 ξ pn m T λ(ΓT (A, B), Γ ). (B.12) Proof Let ΓT := ΓT (A, B). By Example 4.6, LDS-LS satisfies the (T, T, ΓT , e, 1/2)-Traj SB condition. We therefore invoke Lemma 5.1 with k = T and Γ = ΓT . In this case, µ from (5.2) simplifies to µ = λ(ΓT , ΓT ) = 1, and the requirement (5.3) simplifies to n 2 and m c0n. Finally, the rate (5.4) simplifies to (B.12). Theorem 5.4 (Parameter recovery upper bound for LDS-LS, many trajectories). There are universal positive constants c0 and c1 such that the following holds for any instance of LDS-LS. Suppose that (A, B) is kc-step controllable, If n 2, m c0n, and T kc, then: E[ ˆWm,T W 2 F ] c1σ2 ξ pn m T λmin(ΓT (A, B)). Proof Follows by invoking Lemma B.19 with Γ = In. Theorem 5.5 (Risk upper bound for LDS-LS, many trajectories). There are universal positive constants c0 and c1 such that the following holds for any instance of LDS-LS. Suppose that (A, B) is kc-step controllable. If n 2, m c0n, T kc, and the evaluation horizon is strict (T T), then: E[L( ˆWm,T ; T , PA,B x )] c1σ2 ξ pn Learning from many trajectories On the other hand, suppose that (A, B) satisfies Assumption 5.1, Assumption 5.2, and Assumption 5.3, with γ := γ(A, B) (Definition 5.1). If n 2, m c0n, and the evaluation horizon is extended (T > T), then: E[L( ˆWm,T ; T , PA,B x )] c1σ2 ξ pn Proof Let Γt := Γt(A, B) for t N+. Invoking Lemma B.19 with Γ = ΓT yields the bound: E[L( ˆWm,T ; T , PA,B x )] c1σ2 ξ pn m T λ(ΓT , ΓT ). If T T, then λ(ΓT , ΓT ) 1 since ΓT ΓT by Fact B.17. On the other hand, if T > T, by Proposition B.16, λ(ΓT , ΓT ) 1 8γ T T . The claim now follows. B.6.3 Few trajectory results Theorem 5.6 (Risk upper bound for LDS-LS, few trajectories). There are universal positive constants c0, c1, and c2 such that the following holds for any instance of LDS-LS. Suppose that (A, B) satisfies Assumption 5.1, Assumption 5.2, and Assumption 5.3, with γ := γ(A, B) (Definition 5.1). If n 2, m c0n, and m T c1n log(max{γn/m, e}), then: E[L( ˆWm,T ; T , PA,B x )] c2σ2 ξ pn log(max{γn/m, e}) m T φ γ, c1n log(max{γn/m, e}) Proof Let Γt := Γt(A, B) for all t N+. By Example 4.6, for any k {1, . . . , T}, LDS-LS satisfies the (T, k, Γk, e, 1/2)-Traj SB condition. We will apply Lemma 5.1 with Γ = Γk. The quantity µ from (5.2) simplifies to µ = λ(Γk, Γk) = 1. By Proposition B.16, we have that λ(Γk, ΓT ) 1 8γ k T . Hence the requirement (5.3) simplifies to n 2 and kn c log γ T , γ := max{e, γ} (B.13) for some universal positive constant c. Thus, for (B.13) to hold, it suffices to require: m log γ , 2cn As long as 2cn/m 1, then by Proposition B.1, Hence, for (B.14) to hold, it suffices to require m log 8cγ n Tu, Frostig, and Soltanolkotabi Based on (B.15), we choose k as: k = T 4cn/m log(8cγ n/m) To ensure that k 1, we need to ensure that: m T 4cn log(8cγ n/m). (B.17) On the other hand, since 2cn/m 1, we have that: m log 8cγ n which ensures that k T. Thus, our choice of k from (B.16) ensures that (B.13) holds. We now ready to invoke Lemma 5.1 with Γ = ΓT , and conclude for a universal c : E[L( ˆWm,T ; T , PA,B x )] c σ2 ξ pn log(e/λ(Γk, ΓT )) m Tλ(Γk, ΓT ) . (B.18) First, we assume that T k. By Fact B.17 we have Γk ΓT , and therefore λ(Γk, ΓT ) 1. Equation (B.18) yields: E[L( ˆWm,T ; T , PA,B x )] c σ2 ξ pn log(e/λ(Γk, ΓT )) By Proposition B.16, λ(Γk, ΓT ) 1 T 4cn/m log(8cγ n/m) m 64cγn log(8cγ n/m). (B.20) Plugging (B.20) into (B.19), and using the inequalities log x x for x > 0 and φ(a, x) 1 for all a 1 yields, for another universal c : E[L( ˆWm,T ; T , PA,B x )] c σ2 ξ pn m T log(e 64cγn/m log(8cγ n/m)) c σ2 ξ pn log(512e (cγ n/m)2) c σ2 ξ pn log(max{γn/m, e}) c σ2 ξ pn log(max{γn/m, e}) m T φ γ, c1 n log(max{γn/m, e}) On the other hand, if T > k, then by Proposition B.16, λ(Γk, ΓT ) 1 T 4cn/m log(8cγ n/m) m 64cγn log(8cγ n/m) T Plugging (B.20) and (B.21) into (B.18) and using again the inequality log x x for x > 0 yields, for a universal c : E[L( ˆWm,T ; T , PA,B x )] Learning from many trajectories m T log(e 64cγn/m log(8cγ n/m)) 64cγn/m log(8cγ n/m) T c σ2 ξ pn log(max{γn/m, e}) m T γ n log(max{γn/m, e}) Furthermore, when T > k, by choosing c1 sufficiently large: 8cn log(8cγ n/m) = c1 n log(max{γn/m, e}) = γc1 n log(max{γn/m, e}) T = φ γ, c1 n log(max{γn/m, e}) The claim now follows. Theorem 5.7 (Risk upper bound for Ind-LDS-LS). There are universal positive constants c0 and c1 such that the following holds for any instance of Ind-LDS-LS. Suppose that (A, B) satisfies Assumption 5.1, Assumption 5.2, and Assumption 5.3, with γ := γ(A, B) (Definition 5.1). If n 2 and m T c0n log(max{γ, e}), then: E[L( ˆWm,T ; T , t 1PA,B x,t )] c1σ2 ξ pnγ log(max{γ, e}) Proof Let Γt := Γt(A, B) and Σt := Σt(A, B) for t N+. From Example 4.3, we have that Ind-LDS-LS satisfies the (T, 1, {Σt}T t=1, e, 1/2)-Traj SB condition. We will apply Lemma 5.1 with Γ = ΓT , k = 1, and Γ = ΓT . By Proposition B.18, we have that: µ({Σt}T t=1, ΓT ) 1 8eγ . The requirement (5.3) simplifies to n 2 and m T n c log(max{γ, e}) for a universal constant c. By Lemma 5.1, for a universal c : E[L( ˆWm,T ; T , PA,B x )] c σ2 ξ pn log(max{γ, e}) m T λ(ΓT , ΓT ) 8eγ. If T T, then λ(ΓT , ΓT ) 1 since ΓT ΓT by Fact B.17. On the other hand, if T > T, then by Proposition B.16, λ(ΓT , ΓT ) 1 8γ T T . The claim now follows. Theorem 5.8 (Parameter recovery upper bound for LDS-LS, few trajectories). There are universal positive constants c0, c1, and c2 such that the following holds for any instance of LDS-LS. Suppose that (A, B) satisfies Assumption 5.1, Assumption 5.2, and Assumption 5.3, with γ := γ(A, B) (Definition 5.1). If n 2, and m T c0n log(max{γn/m, e}), then: E[ ˆWm,T W 2 F ] c1σ2 ξ pn log(max{γn/m, e}) m T λmin(Γk (A, B)) , k := c2T n/m log(max{γn/m, e}) Tu, Frostig, and Soltanolkotabi Proof The proof is identical to that of Theorem 5.6 until (B.18), after which we set T = 1 from which the result follows. B.7 High probability upper bounds B.7.1 Weak trajectory small ball We first present a modified definition of trajectory small-ball (cf. Definition 4.1) which we will use to establish high probability bounds. Definition B.1 (Weak trajectory small-ball (w Traj SB)). Fix a trajectory length T N+, a parameter k {1, . . . , T}, positive definite matrices {Ψj} T/k j=1 Symn >0, and constants α, β (0, 1). The distribution Px satisfies the (T, k, {Ψj} T/k j=1 , α, β)-weak-trajectory-smallball (w Traj SB) condition if: (a) 1 T/k P T/k j=1 Ψj ΓT (Px), (b) {xt}t 1 is adapted to a filtration {Ft}t 1, and (c) for all v Rn \ {0}, j {1, . . . , T/k }: t=(j 1)k+1 v, xt 2 α v TΨjv β a.s. (B.22) The main difference between Definition B.1 vs. Definition 4.1 is the third condition (B.22), which only needs to hold for a fixed resolution α and failure probability β. By contrast, in Definition 4.1, the condition must hold of for all resolutions there denoted by ε with failure probabilities that tend to zero as the resolution ε 0 (cf. (4.1)). B.7.2 Ordinary least squares bounds Lemma B.20 (Minimum eigenvalue bound via weak trajectory small-ball). Suppose that Px satisfies the (T, k, {Ψj} T/k j=1 , α, β)-w Traj SB condition (Definition B.1). Put S := T/k and Γt := Γt(Px) for t N+. Fix any Γ Symn >0 satisfying 1 S PS j=1 Ψj Γ ΓT , and define the constants: 1 S PS j=1 λ(Ψj, Γ)2 1 S PS j=1 λ(Ψj, Γ) 2 , µ := 1 j=1 λ(Ψj, Γ). (B.23) (Note that 1 CS S always). Fix δ (0, 1), and suppose that: 1 β log 1280CS α(1 β)λ(Γ, ΓT ) µδ Learning from many trajectories With probability at least 1 δ, the following events simulatenously hold: t=1 x(i) t (x(i) t )TΓ 1/2 ! α(1 β)m T µ t=1 x(i) t (x(i) t )TΓ 1/2 ! 2m Tn λ(Γ, ΓT ) δ. Proof The proof proceeds quite similarly to the proof of Lemma B.13. Thus, we focus mostly on the parts that differ. For notational brevity, let: β := 1 β, λ := λ(Γ, ΓT ), λj := λ(Ψj, Γ). Since Γ ΓT by assumption, we have λ (0, 1]. The first step, in preparation for applying the PAC-Bayes deviation inequality, is to construct a family of random variables with moment generating function upper bounded by one. To do this, we utilize the weak trajectory small-ball condition (B.22), which implies for any v Sn 1 and j {1, . . . , S}: t=(j 1)k+1 v, Γ 1/2xt 2 αλ(Ψj, Γ) Let xt := Γ 1/2xt be the whitened vector. Define the random indicator variables for i = 1, . . . , m and j = 1, . . . , S: B(i) j := 1 t=(j 1)k+1 v, x(i) t 2 αλ(Ψj, Γ) By Markov s inequality: t=1 v, x(i) t 2 kα j=1 λj1{B(i) j = 1}. Hence for any η > 0 and v Sn 1: t=1 v, x(i) t 2 ! j=1 λj1{B(i) j = 1} Now observe: E[exp( ηkαλj1{B(i) j = 1}) | F(j 1)k] = e ηkαλj P(B(i) j = 1 | F(j 1)k) + P(B(i) j = 0 | F(j 1)k) = (e ηkαλj 1)P(B(i) j = 1 | F(j 1)k) + 1 (e ηkαλj 1)β + 1 Tu, Frostig, and Soltanolkotabi (a) 1 + ηkαλj + 1 2η2k2α2λ2 j (b) exp ηkαλj + 1 2η2k2α2λ2 j Above, we used the facts (a) for x > 0, we have e x 1 x + x2 2 , and (b) for x R, we have 1 + x ex. Hence by the tower property: t=1 v, x(i) t 2 ! 2η2k2α2λ2 j 2η2k2α2β S X PS j=1 λ2 j PS j=1 λj Now, let us set PS j=1 λj PS j=1 λ2 j = 1 from which we conclude: t=1 v, x(i) t 2 ! PS j=1 λj 2 PS j=1 λ2 j 1 S PS j=1 λj 2 1 S PS j=1 λ2 j By independence across the m trajectories: t=1 v, x(i) t 2 + m Sβ As desired, we have constructed a family of random variables indexed by v Sn 1, with MGF bounded by one. Using the PAC-Bayes arguments from Lemma B.13 followed by Markov s inequality, with probability at least 1 2e t, for all v Sn 1 and γ [0, 1/2]: i,t x(i) t , v 2 1 2CS n log 5 Choosing γ2 = nλ 4ηm Tet , we have that: 1 + log 5m Tetη Learning from many trajectories Note that this choice of γ satisfies γ [0, 1/2], since: nλ 4ηm Tet 1 4 = n ηm T 1 since t 0 and λ 1. The RHS above is ensured by: PS j=1 λ2 j PS j=1 λj = αCS µ = m T kn CS since α, µ 1. If we further enforce that: 4CS (n + 1) t + log 5m Tη then combining (B.25) with (B.26): i,t x(i) t , v 2 1 2CS t n n log 5m Tetη 2CS (n + 1)t n n log 5m Tη 2CS (n + 1)t (n + 1) log 5m Tη 4ηCS = αβ m T µ For (B.27), it suffices that the following conditions hold: PS j=1 λj PS j=1 λ2 j β log 5 λα µCS By Proposition B.1 it suffices that: β max log 5 λα µCS β log 128CS Note that x log(1/x) 1/e for all x > 0, and therefore: CS log 5 λα µCS + 1 2CS log 5 Hence it suffices that: β max log 5 , log 128CS Tu, Frostig, and Soltanolkotabi The claim now follows by simplifying all the required inequalities for the quantity m T/kn. To contrast the effects of the w Traj SB assumption from those of the Traj SB assumption, let us compare Lemma B.20 to its counterpart Lemma B.13. The minimum eigenvalue bound (B.24) from Lemma B.20 differs from the corresponding Traj SB bound (B.4) in the role of the eigenvalues of the matrices {Ψj} T/k j=1 from the small-ball definition. However, due to the differing requirements on the amount of data m T, neither result is necessarily sharper than the other, as detailed in the following remark: Remark B.21. When the matrices {Ψj} T/k j=1 from the trajectory small-ball definition vary across j, both Lemma B.13 and Lemma B.20 yield different dependencies on the eigenvalues {λ(Ψj, Γ)} T/k j=1 . In particular, Lemma B.13 yields a minimum eigenvalue bound scaling as m Tµ, where µ is the geometric mean of the eigenvalues {λ(Ψj, Γ)}, whereas Lemma B.20 yields a bound scaling as m T µ, where µ is the arithmetic mean of the eigenvalues. By the AM-GM inequality, we have that µ µ, so the latter bound is stronger than the former. However, Lemma B.20 has a stronger requirement on the amount of data, requiring that m T kn CS, where CS [1, S] is defined in (B.23), whereas Lemma B.13 has the weaker requirement that m T kn. In the worst case when CS S, then the m T kn CS requirement simplifies to the many trajectories assumption m n. Thus, the qualitative behavior of these two bounds are not necessarily comparable. Meanwhile, although neither bound is strictly sharper than the other, if we assume polynomial growth of the {Ψj} matrices, then the two bounds are roughly on par: Remark B.22. When the matrices {Ψj} exhibit low degree polynomial growth, both Lemma B.13 and Lemma B.20 yield similar qualitative behavior. Concretely, let us suppose that k = 1, Ψj = jp I for j [T], and Γ = 1 T PT j=1 Ψj. Then, µ = 1, whereas µ 1+p ep . Thus, if we consider p as constant, then µ µ. We now state our general OLS upper bound under the weak trajectory small-ball condition. Lemma B.23 (General OLS upper bound, high probability). There are universal positive constants c0 and c1 such that the following holds. Suppose that Px satisfies the (T, k, {Ψj} T/k j=1 , α, β)-w Traj SB condition (Definition B.1). Put S := T/k and Γt := Γt(Px) for t N+. Fix any Γ Symn >0 satisfying 1 S PS j=1 Ψj Γ ΓT , and the constants: 1 S PS j=1 λ(Ψj, Γ)2 1 S PS j=1 λ(Ψj, Γ) 2 , µ := 1 j=1 λ(Ψj, Γ). Fix δ (0, 1/e). Suppose that: 1 β log CS α(1 β)λ(Γ, ΓT ) µδ Learning from many trajectories Then, for any Γ Symn >0, with probability at least 1 δ: ˆWm,T W 2 Γ c1σ2 ξ pn log 1 α(1 β)λ(Γ,ΓT ) µδ λ(Γ, Γ )α(1 β)m T µ Proof Put β := 1 β and Xm,T := Xm,T Γ 1/2. By the arguments in the proof of Lemma 5.1, ˆWm,T W 2 Γ min{n, p} ( XT m,T Xm,T ) 1/2 XT m,T Ξm,T 2 op λ(Γ, Γ ) λmin( XT m,T Xm,T ) . Put M := (αβ m T µ/8) I := ζ I. By Proposition B.10, with probability at least 1 δ/2: 1{ XT m,T Xm,T M} ( XT m,T Xm,T ) 1/2 XT m,T Ξm,T 2 op p log 5 + 1 2 log det In + ζ 1 XT m,T Xm,T + log(2/δ) 32σ2 ξ h p + log det In + ζ 1 XT m,T Xm,T + log(2/δ) i 32σ2 ξ h p + n log(1 + ζ 1 tr( XT m,T Xm,T )/n) + log(2/δ) i . Now, by Lemma B.20, with probability at least 1 δ/2, we also have: λmin( XT m,T Xm,T ) ζ, tr( XT m,T Xm,T ) 4m Tn On both events: ( XT m,T Xm,T ) 1/2 XT m,T Ξm,T 2 op 32σ2 ξ p + n log 1 + 32 αβ λ µδ p + n log 33 αβ λ µδ Combining the inequalities: ˆWm,T W 2 Γ 512σ2 ξ min{n, p} p + n log 33 αβ λ µδ λ(Γ, Γ )αβ m T µ pn log 33 αβ λ µδ λ(Γ, Γ )αβ m T µ By a union bound, both events hold with probability at least 1 δ, which concludes the proof. Tu, Frostig, and Soltanolkotabi B.7.3 Mixing implies weak trajectory small-ball One advantage of Definition B.1 is that it is implied by the standard notions of φ-mixing in the literature (see e.g. Mohri and Rostamizadeh (2008); Duchi et al. (2012); Kuznetsov and Mohri (2017)). In this section, we prove this reduction. First, we state the definition of φ-mixing. Definition B.2 (φ-mixing covariate sequence). Let {xt}t 1 be a covariate sequence which is adapted to a filtration {Ft}t 1. Define the function φ(k) as: φ(k) := sup t N+ sup B Ft Pxt+k( | B) Pxt+k tv. (B.28) The process {xt}t 1 is called φ-mixing if limk φ(k) = 0. We also let φ(k) denote the upper envelope of φ(k), i.e., φ(k) := supk k φ(k). The following result shows that a φ-mixing covariate sequence where each marginal distribution is weakly small-ball satisfies the weak trajectory small-ball condition. Proposition B.24. Fix α (0, 1) and β (0, 1/4). Suppose that the covariate sequence {xt}t 1 is φ-mixing, and that for every t N+ and v Rn \ {0} we have: Pxt{ v, xt 2 αv TΣtv} β, Σt := E[xtx T t ]. (B.29) Let kmix := inf{k N+ | φ(k) β} and assume that T 2kmix. Put S := T/(2kmix) and suppose that {Ψj}S j=1 satisfies: 4Σt j [S], t [kmix(2j 1) + 1, 2jkmix]. Then, Px satisfies the T, 2kmix, {Ψj}T/(2kmix) j=1 , α, 4 2 + β -w Traj SB condition (cf. Definition B.1). Proof Fix j [S]. Since Ψj 1 4Σt for all t [kmix(2j 1) + 1, 2jkmix], we have: t=kmix(2j 1)+1 Σt. j=1 Ψj 1 4Skmix t=kmix(2j 1)+1 Σt 1 4Skmix t=1 Σt ΓT . Above, the last inequality holds since T/(2kmix) T/(4kmix). By definition of φ-mixing (cf. Definition B.2) and the upper envelope φ, for any j [S] and t kmix(2j 1) + 1: Pxt n v, xt 2 4α v TΨjv F(j 1)2kmix o Pxt n v, xt 2 4α v TΨjv o + β Pxt n v, xt 2 α v TΣtv o + β Learning from many trajectories t=(j 1)2kmix+1 Pxt n v, xt 2 4α v TΨjv F(j 1)2kmix o 1 2kmix [kmix + 2βkmix] = 1 The claim now follows from Proposition 4.2. We conclude by noting that φ-mixing is a stronger notion of mixing than β-mixing, where (B.28) is only required to hold in expectation. We leave to future work an analysis that only relies on the weaker β-mixing. Appendix C. Analysis for lower bounds C.1 Preliminaries Here, we collect the necessary auxiliary results we will use to prove the lower bound. The first result is an instance of the well-known fact that the conditional mean is the estimator which minimizes the mean squared error. Proposition C.1. Let T N+ and {Px,t}T t=1 be a sequence of distributions over Rn with finite second moments Σt := Ext Px,t[xtx T t ]. Let PW be any arbitrary distribution on Rp n. Put ΓT := 1 T PT t=1 Σt. We have: inf ˆ W EW PW t=1 Ext Px,t ˆW(xt) Wxt 2 2 = EW PW EW PW [W ] W 2 ΓT , where the infimum ranges over measurable functions ˆW : Rn Rp. Proof Let µT := 1 T PT t=1 Px,t denote the uniform mixture distribution, so that t=1 Ext Px,t ˆW(xt) Wxt 2 2 = E x µT ˆW( x) W x 2 2. By repeated applications of Fubini s theorem, inf ˆ W EW PW E x µT ˆW( x) W x 2 2 = inf ˆ W E x µT EW PW ˆW( x) W x 2 2 inf ˆy Rp EW PW ˆy W x 2 2 = E x µT EW PW EW PW [W ] x W x 2 2 = EW PW E x µT EW PW [W ] x W x 2 2 = EW PW EW PW [W ] W 2 ΓT . Tu, Frostig, and Soltanolkotabi The next result is a simple fact which states that if a function is strictly increasing and concave on an interval, then any root of the function is lower bounded by the root of the linear approximation at any point in the interval. Proposition C.2. Let f : I R be a C1(I) function that is strictly increasing and concave on an interval I R. Suppose that f has a (unique) root x0 I. For any x I, we have that: Proof Because f is concave on I, we have that: 0 = f(x0) f(x) + f (x)(x0 x). Next, because f is strictly increasing on I, we have that f (x) > 0. The claim now follows by re-arranging the previous inequality. The next result states that the trace inverse of any positive definite matrix is lower bounded by the trace inverse of any priciple submatrix. The claim is immediate from Cauchy s eigenvalue interlacing theorem, but we give a more direct proof. Proposition C.3. Let M Rq n have full column rank. Let I {1, . . . , n} be any index set, and let EI : Rn R|I| denote any linear map which extracts the coordinates associated to I. We have: tr((MTM) 1) tr((EIMTMET I ) 1). Proof Fix a z Rn. Since M has full column rank, we have that (MT) = M(MTM) 1. Therefore, min c Rq:MTc=z c 2 2 = (MT) z 2 2 = z T(MTM) 1z. Taking expectation with z N(0, In), tr((MTM) 1) = Ez N(0,In) min c Rq:MTc=z c 2 2 On the other hand, we have that: min c Rq:MTc=z c 2 2 min c Rq:EIMTc=EIz c 2 2. This is clear because for any c Rq satisfying MTc = z, the equality EIMTc = EIz trivially holds. This means we have the following set inclusion: {c Rq | MTc = z} {c Rq | EIMTc = EIz}. Learning from many trajectories Therefore, minimizing any function over the first set will be lower bounded by minimizing the same function over the second set. From this inclusion, we conclude for any index set I: tr((MTM) 1) = Ez N(0,In) min c Rq:MTc=z c 2 2 min c Rq:EIMTc=EIz c 2 2 = Ez N(0,I|I|) min c Rq:EIMTc=z c 2 2 = tr((EIMTMET I ) 1). Next, we state well-known upper and lower tail bounds for chi-squared random variables. Lemma C.4 (Laurent and Massart (2000, Lemma 1)). Let g1, . . . , g D be iid N(0, 1) random variables, and let a1, . . . , a D be non-negative scalars. For any t > 0, we have: i=1 ai(g2 i 1) 2 i=1 a2 i + 2t max i=1,...,D ai i=1 ai(g2 i 1) 2 Finally, we conclude with a convex extension of Gordon s min-max theorem. Theorem C.5 (Thrampoulidis et al. (2014, Theorem II.1)). Let A Rm n, g Rm, and h Rn have iid N(0, 1) entires and be independent of each other. Suppose that S1 Rn and S2 Rm are non-empty compact convex sets, and let ψ : S1 S2 R be a continuous, convex-concave function. For every t R, we have: P min x S1 max y S2 h y TAx + ψ(x, y) i t 2P min x S1 max y S2 h x 2g Ty + y 2h Tx + ψ(x, y) i t . C.2 Proof of Lemma 6.1 We first prove the following intermediate result, which holds under the Gaussian observation noise model (Definition 7.1). Lemma C.6. Let T N+, {Px,t}T t=1 be a sequence of distributions over Rn with finite second moments Σt := Ext Px,t[xtx T t ], and σξ > 0. Let PX be a distribution on Rq n with q n such that for X PX, XTX is invertible almost surely. For W Rp n, let PW be the distribution over Rq n Rq p with (X, Y ) PW satisfying X PX and Y | X = XW T + Ξ, where Ξ Rq p has iid N(0, σ2 ξ) entries (and is independent of everything else). Put ΓT := 1 T PT t=1 Σt. We have that: inf ˆ W sup W Rp n E(X,Y ) PW t=1 Ext Px,t ˆW(X, Y, xt) Wxt 2 2 Tu, Frostig, and Soltanolkotabi σ2 ξp EX PX tr(Γ1/2 T (XTX) 1Γ1/2 T ), where the infimum ranges over all measurable functions ˆW : Rq n Rq p Rn Rp. Proof The proof extends the Bayesian argument from Mourtada (2022, Theorem 1). Let pλ be any prior distribution over Rp n. Let µT := 1 T PT t=1 Px,t denote the uniform mixture. Bounding the minimax risk from below by the Bayes risk: inf ˆ W sup W Rp n E(X,Y ) PW E x µT ˆW(X, Y, x) W x 2 2 inf ˆ W EWλ pλE(X,Y ) PWλE x µT ˆW(X, Y, x) Wλ x 2 2 = inf ˆ W E(X,Y )EWλ|(X,Y )E x µT ˆW(X, Y, x) Wλ x 2 2 using Fubini s theorem = E(X,Y ) inf ˆ WX,Y EWλ|(X,Y )E x µT ˆWX,Y ( x) Wλ x 2 2 where ˆWX,Y maps Rn Rp = E(X,Y )EWλ|(X,Y ) E[Wλ | X, Y ] Wλ 2 ΓT using Proposition C.1. Now let Wλ pλ have iid N(0, 1/λ) entries for λ > 0. Noting that vec(Y ) = (Ip X)vec(W T λ ) + vec(Ξ), we see that the vector vec(W T λ ) vec(Y ) is jointly Gaussian conditioned on X: vec(W T λ ) vec(Y ) λIpn 1 λ(Ip XT) 1 λ(Ip XXT) + σ2 ξIqp Therefore, the distribution of vec(W T λ ) | X, Y is: vec(W T λ ) | X, Y N(µλ, Σλ), λ(Ip XXT) + σ2 ξIqp λ2 (Ip XT) 1 λ(Ip XXT) + σ2 ξIqp A generalization of the identity XT( 1 λXXT + σ2 ξIq) 1 = ( 1 λXTX + σ2 ξIn) 1XT yields: λ(Ip XXT) + σ2 ξIqp λ(Ip XTX) + σ2 ξInp E[vec(W T λ ) | X, Y ] vec(W T λ ) = µλ vec(W T λ ) = h (Ip XTX) + σ2 ξλInp i 1 (Ip XT)vec(Y ) vec(W T λ ) Learning from many trajectories = h (Ip XTX) + σ2 ξλInp i 1 (Ip XTX) Inp vec(W T λ ) + h (Ip XTX) + σ2 ξλInp i 1 (Ip XT)vec(Ξ). Observing that E[Wλ | X, Y ] Wλ 2 ΓT = E[vec(W T λ ) | X, Y ] vec(W T λ ) 2 Ip ΓT , and defining MX(λ) := (Ip XTX) + σ2 ξλInp, we have the following bias-variance decomposition: EX,Ξ,Wλ E[Wλ | X, Y ] Wλ 2 ΓT = EX,Ξ,Wλ h M 1 X (λ)(Ip XTX) Inp i vec(W T λ ) 2 + σ2 ξEX tr (Ip Γ1/2 T )M 1 X (λ)(Ip XTX)M 1 X (λ)(Ip Γ1/2 T ) σ2 ξEX tr (Ip Γ1/2 T )M 1 X (λ)(Ip XTX)M 1 X (λ)(Ip Γ1/2 T ) . Since λ 7 tr (Ip Γ1/2 T )M 1 X (λ)(Ip XTX)M 1 X (λ)(Ip Γ1/2 T ) is non-negative and decreasing in λ for λ > 0, by the monotone convergence theorem: lim λ 0+ EX,Ξ,Wλ E[Wλ | X, Y ] Wλ 2 ΓT σ2 ξ lim λ 0+ EX tr (Ip Γ1/2 T )M 1 X (λ)(Ip XTX)M 1 X (λ)(Ip Γ1/2 T ) = σ2 ξEX tr (Ip Γ1/2 T )M 1 X (0)(Ip XTX)M 1 X (0)(Ip Γ1/2 T ) = σ2 ξEX tr((Ip Γ1/2 T (XTX) 1Γ1/2 T )) = σ2 ξp EX tr(Γ1/2 T (XTX) 1Γ1/2 T ). Since the first expression above lower bounds the minimax risk, this concludes the proof. We now restate and prove Lemma 6.1. Lemma 6.1 (Expected trace of inverse covariance bounds risk from below). Fix m, T N+ and a set of covariate distributions Px. Suppose that for every Px Px, the data matrix Xm,T Rm T n drawn from m i=1Px has full column rank almost surely. The minimax risk R(m, T, T ; Px) satisfies: R(m, T, T ; Px) σ2 ξp sup Px Px E m i=1Px h tr Γ1/2 T (Px)(XT m,T Xm,T ) 1Γ1/2 T (Px) i . (6.1) Proof Fix a Px Px, and let {Px,t}T t=1 denote its marginal distributions up to time T . Let Pg ξ denote the σξ-MDS corresponding to the Gaussian observation noise model (Definition 7.1). Note that for any hypothesis f : Rn Rp, we have from (3.2): L( ˆf; T , Px) = EPx t=1 ˆf(xt) W xt 2 2 t=1 Ext Px,t ˆf(xt) W xt 2 2. Tu, Frostig, and Soltanolkotabi By the definition of R(m, T, T ; Px) from (3.5) and Lemma C.6: R(m, T, T ; Px) inf Alg sup W E m i=1PW x,y [Px,Pg ξ] h L Alg({(x(i) t , y(i) t )}m,T i=1,t=1); T , Px i σ2 ξp E m i=1Px h tr Γ1/2 T (Px)(XT m,T Xm,T ) 1Γ1/2 T (Px) i . Since the bound above holds for any Px Px, we can take the supremum over Px Px, from the which the claim follows. C.3 A general risk lower bound We now state a lower bound which applies with an arbitrary number of trajectories. Lemma C.7. Suppose that Px is any set containing P0n n x and PIn x . Let m T n. Then: R(m, T, T ; Px) σ2 ξ 2 pn Proof Define ζ(A) := E m i=1PA x h tr Γ1/2 T (A)(XT m,T Xm,T ) 1Γ1/2 T (A) i . By Lemma 6.1: R(m, T, T ; Px) σ2 ξp max{ζ(0n n), ζ(In)}. (C.1) Next, for any M Symn 0, the function X 7 tr(M1/2X 1M1/2) is convex on the domain Symn >0. To see this, we define f(X; v) := v TX 1v for X Symn >0. We can write f(X; v) as f(X; v) = sup z TXz + 2v Tz | z Rn ; therefore X 7 f(X; v) is convex on Symn >0, since it is the pointwise supremum of an affine function in X. Now we see that X 7 tr(M1/2X 1M1/2) is convex, since tr(M1/2X 1M1/2) = Pn i=1 f(X; M1/2ei), which is the sum of convex functions. Therefore by Jensen s inequality, whenever XT m,T Xm,T is invertible almost surely, ζ(A) = E m i=1PA x h tr Γ1/2 T (A)(XT m,T Xm,T ) 1Γ1/2 T (A) i tr Γ1/2 T (A)(E m i=1PA x [XT m,T Xm,T ]) 1Γ1/2 T (A) = tr Γ1/2 T (A)(m T ΓT (A)) 1Γ1/2 T (A) = tr(ΓT (A)Γ 1 T (A)) m T . We first consider the case when A = 0n n. Under these dynamics, it is a standard fact that when m T n, then XT m,T Xm,T is invertible almost surely. Furthermore, Γt(0n n) = In for all t, Hence, ζ(0n n) n m T . Next, we consider the case when A = In. We first argue that as long as m T n, the matrix XT m,T Xm,T is invertible almost surely. We write x(i) t = Pt k=1 w(i) k , where {w(i) t }m,T i=1,t=1 are all iid N(0, In) vectors. Let p : Rm Tn R be the polynomial p({w(i) t }) = det(XT m,T Xm,T ). The zero-set of p is either all of Rm Tn, or Lebesgue measure zero. We Learning from many trajectories will select {w(i) t } so that p({w(i) t }) = 0, which shows that the zeros of this polynomial are not all of Rm Tn, and hence Lebesgue measure zero. Since the Gaussian measure on Rm Tn is absolutely continuous w.r.t. the Lebesgue measure on Rm Tn, this implies that det(XT m,T Xm,T ) = 0 almost surely. To select {w(i) t }, we introduce some notation. Let ei Rn denote the i-th standard basis vector. For any positive integer k, let U(k) Rk k be the upper triangular matrix with ones for all its non-zero entries. Let S(k) = U(k)U(k)T. By construction, S(k) is invertible since U(k) is invertible. We put w(i) t = e(i 1)T+t 1{(i 1)T + t n}. We now claim that with this choice of {w(i) t }, the matrix XT m,T Xm,T is invertible. Suppose first that T n. Then we have that XT m,T Xm,T = S(n), and therefore det(XT m,T Xm,T ) = 0. On the other hand, suppose that T < n. Because m T n, then we have that: XT m,T Xm,T = BDiag(S(T), . . . , S(T) | {z } n/T times , S(n T n/T )), where BDiag(M1, . . . , Mk) denotes the block diagonal matrices with block diagonals M1, . . . , Mk. Since S(T) and S(n T n/T ) are both invertible, so is XT m,T Xm,T and therefore det(XT m,T Xm,T ) = 0. Thus, XT m,T Xm,T is invertible almost surely. Next, we note that Σt(In) = t In and Γt(In) = 1 t Pt k=1 k In = t+1 2 In. Hence we have ΓT (In)Γ 1 T (In) = T +1 2T In, and therefore ζ(In) n 2m T T T . Combining our bounds on ζ(0n n) and ζ(In), we have the desired claim: R(m, T, T ; Px) σ2 ξp max n m T , n 2m T T C.4 Non-isotropic random gramian matrices The goal of this subsection is to prove Lemma 7.1, which gives a bound on the expected trace inverse of a non-isotropic random gramian matrix. We first prove an auxiliary lemma, which will be used as a building block in the proof. Lemma C.8. Fix any x Rq. Let g Rq and h Rn be random vectors with iid N(0, 1) entries, and let W Rq n be a random matrix with iid N(0, 1) entries. Let Σ Rq q be positive definite. We have that: E min α Rn Σ1/2Wα x 2 2 E min β 0 max τ 0 τ + βg Σ 1/2x 2 (Σ 1+β h 2τIq) 1 Proof The proof invokes the convex Gaussian min-max lemma (Theorem C.5) via a limiting argument. In what follows, let {αk}k 1 and {vk}k 1 be any two positive, increasing sequences of scalars tending to + . It is clear that for every W, lim k min α 2 αk Σ1/2Wα x 2 2 = min α Rn Σ1/2Wα x 2 2. Tu, Frostig, and Soltanolkotabi Since α = 0 is always a feasible solution to min α 2 αk Σ1/2Wα x 2 2, we have for every k 1: 0 min α 2 αk Σ1/2Wα x 2 2 x 2 2. Therefore, by the dominated convergence theorem, E min α Rn Σ1/2Wα x 2 2 = E lim k min α 2 αk Σ1/2Wα x 2 2 = lim k E min α 2 αk Σ1/2Wα x 2 2. We next state two variational forms which we will use: 1 2 x 2 2 = max v Rq v Tx v 2 2 2 x 2 = min τ 0 x 2 2τ 2 + 1 Using the first variational form (C.3), we have for every W and k1 1, min α 2 αk1 1 2 Σ1/2Wα x 2 2 = min α 2 αk1 max v Rq v T(Σ1/2Wα x) v 2 2 2 = min α 2 αk1 max v Rq v TWα v TΣ 1/2x v TΣ 1v = min α 2 αk1 max v 2 ΣW opαk1+ Σ1/2x 2 v TWα v TΣ 1/2x v TΣ 1v = lim k2 min α 2 αk1 max v 2 vk2 v TWα v TΣ 1/2x v TΣ 1v Observe that for every k2 1, 0 min α 2 αk1 max v 2 vk2 v TWα v TΣ 1/2x v TΣ 1v max v 2 vk2 v TΣ 1/2x v TΣ 1v v TΣ 1/2x v TΣ 1v Therefore, by (C.2) and another application of the dominated convergence theorem: E min α Rn Σ1/2Wα x 2 2 = lim k1 E min α 2 αk1 Σ1/2Wα x 2 2 = lim k1 E lim k2 min α 2 αk1 max v 2 vk2 v TWα v TΣ 1/2x v TΣ 1v = lim k1 lim k2 E min α 2 αk1 max v 2 vk2 v TWα v TΣ 1/2x v TΣ 1v Learning from many trajectories We now apply Theorem C.5 to the expectation on the RHS of (C.5): E min α 2 αk1 max v 2 vk2 v TWα v TΣ 1/2x v TΣ 1v min α 2 αk1 max v 2 vk2 v TWα v TΣ 1/2x v TΣ 1v min α 2 αk1 max v 2 vk2 α 2g Tv + v 2h Tα v TΣ 1/2x v TΣ 1v = 2E min α 2 αk1 max v 2 vk2 α 2g Tv + v 2h Tα v TΣ 1/2x v TΣ 1v 2E min α 2 αk1 max v Rq α 2g Tv + v 2h Tα v TΣ 1/2x v TΣ 1v Above, inequality (a) is an application of Theorem C.5. Now for every k1, g, and h, define ψk1(g, h) := min α 2 αk1 max v Rq α 2g Tv + v 2h Tα v TΣ 1/2x v TΣ 1v For every k1, g, and h, we have 0 ψk1(g, h) max v Rq v TΣ 1/2x v TΣ 1v = x 2 2 2 . Furthermore, since {αk} is an increasing sequence, the sequence {ψk(g, h)}k 1 is montonically decreasing. Therefore, by the monotone convergence theorem, lim k ψk(g, h) = inf{ψk(g, h) | k N+} = min α Rn max v Rq α 2g Tv + v 2h Tα v TΣ 1/2x v TΣ 1v Therefore by another application of the dominated convergence theorem, we have that: lim k1 E min α 2 αk1 max v Rq α 2g Tv + v 2h Tα v TΣ 1/2x v TΣ 1v = E lim k1 min α 2 αk1 max v Rq α 2g Tv + v 2h Tα v TΣ 1/2x v TΣ 1v = E min α Rn max v Rq α 2g Tv + v 2h Tα v TΣ 1/2x v TΣ 1v Chaining together inequalities (C.5), (C.6), (C.7), we have: E min α Rn Σ1/2Wα x 2 2 2E min α Rn max v Rq α 2g Tv + v 2h Tα v TΣ 1/2x v TΣ 1v Tu, Frostig, and Soltanolkotabi We now proceed to study the RHS of (C.8), which we denote by (AO) (the auxiliary optimization problem): (AO) := min α Rn max v Rq α 2g Tv + v 2h Tα v TΣ 1/2x v TΣ 1v = min β 0 min θ [ 1,1] max v Rq βg Tv + β v 2 h 2θ v TΣ 1/2x v TΣ 1v (a) = min β 0 max v Rq βg Tv β v 2 h 2 v TΣ 1/2x v TΣ 1v (b) = min β 0 max v Rq max τ 0 βg Tv β h 2 v 2 2 τ 2 β h 2 2τ v TΣ 1/2x v TΣ 1v = min β 0 max τ 0 max v Rq βg Tv β h 2 v 2 2 τ 2 β h 2 2τ v TΣ 1/2x v TΣ 1v = min β 0 max τ 0 2 βg Σ 1/2x 2 (Σ 1+β h 2τIq) 1 Above, (b) holds by the variational form (C.4). The proof is now finished after justifying (a). First, let hβ(θ, v) denote the term in the bracket, so that (AO) = min β 0 min θ [ 1,1] max v Rq hβ(θ, v). Fix a β 0. By weak duality, min θ [ 1,1] max v Rq hβ(θ, v) max v Rq min θ [ 1,1] hβ(θ, v) = max v Rq hβ( 1, v). On the other hand, min θ [ 1,1] max v Rq hβ(θ, v) min θ [ 1,0] max v Rq hβ(θ, v) = max v Rq min θ [ 1,0] hβ(θ, v) = max v Rq hβ( 1, v). The first equality above is Sion s minimax theorem, since the function θ 7 hβ(θ, v) is affine for every v and the function v 7 hβ(θ, v) is concave for θ [ 1, 0]. Therefore, min θ [ 1,1] max v Rq hβ(θ, v) = max v Rq hβ( 1, v). With Lemma C.8 in hand, we can now restate and prove Lemma 7.1. Lemma 7.1. Let q, n be positive integers with q n and n 2. Let W Rq n have iid N(0, 1) entries, and let Σ Rq q be positive definite. Let g N(0, Iq) and h N(0, In 1), with g and h independent. Also, let {ei}q i=1 be the standard basis vectors in Rq. We have: E tr((W TΣW) 1) n Pq i=1 E minβ 0 maxτ 0 h β h 2 τ + βg ei 2 (Σ 1+β h 2τIq) 1 i. (7.3) Learning from many trajectories Proof We rewrite E tr((W TΣW) 1) in a way that is amenable to Lemma C.8. Let w1 Rq denote the first column of W, so that W = w1 W2 with W2 Rq (n 1). We write: W TΣW = w1 2 Σ w T 1 ΣW2 W T 2 Σw1 W T 2 ΣW2 Using the block matrix inversion formula to compute the (1, 1) entry of (W TΣW) 1: ((W TΣW) 1)11 = (w T 1 (Σ ΣW T 2 (W T 2 ΣW2) 1W T 2 Σ)w1) 1 = (w T 1 Σ1/2(I PΣ1/2W2)Σ1/2w1) 1 = (w T 1 Σ1/2P Σ1/2W2Σ1/2w1) 1. Since the columns of W are all independent and identically distributed, this calculation shows that the law of ((W TΣW) 1)ii is the same as the law of ((W TΣW) 1)11 for all i = 1, . . . , n. Therefore: E tr((W TΣW) 1) = i=1 E((W TΣW) 1)ii = n E(w T 1 Σ1/2P Σ1/2W2Σ1/2w1) 1 n E tr(Σ1/2P Σ1/2W2Σ1/2). The last inequality follows from Jensen s inequality combined with the independence of w1 and W2. By decomposing tr(Σ1/2P Σ1/2W2Σ1/2) = Pq i=1 P Σ1/2W2Σ1/2ei 2 2 and observing that P Σ1/2W2x 2 2 = min α Rn 1 Σ1/2W2α x 2 2 x Rq, we have the following identity: E tr(Σ1/2P Σ1/2W2Σ1/2) = i=1 E min αi Rn 1 Σ1/2W2αi Σ1/2ei 2 2. Invoking Lemma C.8 with x = Σ1/2ei for i = 1, . . . , q yields E tr(Σ1/2P Σ1/2W2Σ1/2) i=1 E min β 0 max τ 0 τ + βg ei 2 (Σ 1+β h 2τIq) 1 where g N(0, Iq) and h N(0, In 1). The claim now follows. We conclude this section with the following technical result which we will use in the sequel. Lemma C.9. Let q, n N+ with q n and n 6, and let Σ Symq >0. Let g N(0, Iq) and h N(0, In 1) with g and h independent. Define the random variables Zi for i {1, . . . , q} as: Zi := min β 0 max τ 0 2τ + β2 g 2 (Σ 1+β h 2τIq) 1 + (Σ 1 + β h 2τIq) 1 ii Tu, Frostig, and Soltanolkotabi Let {λi}q i=1 denote the eigenvalues of Σ 1 listed in decreasing order. Define n1 and the random function p(y) as: 64, p(y) := y λi + yg2 i n1 There exists an event E (over the probability of g and h) such that the following statements hold: (a) P(Ec) e n/128 + e q/16. (b) On E, there exists a unique root y (0, ) such that p(y ) = 0. (c) The following bounds hold for i {1, . . . , q}: Zi Σii, 1{E}Zi 1{E}(Σ 1 + y Iq) 1 ii . (C.11) Proof First, we observe that we can trivially upper bound the value of Zi by setting β = 0 and obtaining the bound Zi Σii. Furthermore, by the rotational invariance of g and the fact that g and h are independent, we have that Zi is equal in distribution to: Zi = min β 0 max τ 0 2τ + β2 q X g2 i λi + β h 2τ + (Σ 1 + β h 2τIq) 1 ii Define the following events: Eh := h 2 n/8 , Eg := i=1 g2 i q/2 and put E := Eh Eg. Since n 6, by a standard computation we have that E h 2 n/4. Therefore, by Gaussian concentration of Lipschitz functions (cf. Wainwright, 2019, Chapter 2), P(Ec h) e n/128. Furthermore, Lemma C.4 yields that P(Ec g) e q/16. By a union bound, P(Ec) e n/128 + e q/16. We now focus on upper bounding the quantity: 1{E}Zi 1{E} min β 0 max τ 0 16τ + β2 q X g2 i λi + β nτ/8 + (Σ 1 + β nτ/8Iq) 1 ii | {z } =:ℓi(β,τ) Let us bracket the value of the game minβ 0 maxτ 0 ℓi(β, τ). We previously noted that ℓi(0, τ) = Σii for all τ [0, ). Next, for any β > 0, limτ ℓi(β, τ) = 0. Hence, min β 0 max τ 0 ℓi(β, τ) [0, Σii]. Recalling from (C.10) that n1 = n/64 (so that n1 = n/8) and defining f, qi as: f(x) := x n1 g2 i λi + x n1 , Learning from many trajectories qi(x) := (Σ 1 + x n1) 1 ii , we have that ℓi(β, τ) = 1 τ 2 f(βτ) + qi(βτ). In order to sharpen our estimate for the value of the game, we will study the positive critical points (β, τ) R2 >0 of the game minβ maxτ ℓi(β, τ), i.e., the points (β, τ) R2 >0 satisfying ℓi β (β, τ) = 0 and ℓi τ (β, τ) = 0. Note that in general for a nonconvex/nonconcave game, this is not a necessary first order optimality condition for the global min/max value (see e.g. Jin et al., 2020, Proposition 21). However, for every fixed β > 0, stationary points of the function τ 7 ℓi(β, τ) on R>0 are strictly concave by Proposition C.10. Hence, by the implicit function theorem (or alternatively Jin et al. (2020, Theorem 23)), the first order stationarity conditions ℓi β (β, τ) = 0 and ℓi τ (β, τ) = 0 are necessary for global min/max optimality. For τ = 0, this yields: 0 = τ 2f (βτ)β 2τ 3f(βτ) + q i(βτ)β, 0 = τ 2f (βτ)τ + q i(βτ)τ. Together, these conditions imply that f(βτ) = 0, and that the value of the game at such a critical point is qi(βτ). Thus, we are interested in the positive roots of f(x) = 0. To proceed, recall the definition of p(y) from (C.10): y λi + yg2 i n1 Note that y is a positive root of p iffy / n1 is a positive root of f. Since q n by assumption, observe that on E: lim y p(y) = i=1 g2 i n1/2 q/2 n1/2 n/2 n/64 > 0. On the other hand, p(0) = n1/2 < 0. Since p(y) is continuous and strictly increasing, on E there exists a unique y (0, ) such that p(y ) = 0. Thus, 1{E}Zi 1{E}(Σ 1 + y Iq) 1 ii . Proposition C.10. Let M, A be n n positive definite matrices, and let α, β be positive numbers. Consider the function: τ + (A + βτI) 1, M . Suppose there exists a τ (0, ) satisfying f (τ) = 0. Then, f (τ) < 0. Proof A straightforward computation yields the following expressions for f (τ) and f (τ): f (τ) = ατ 2 β (A + βτI) 2, M , Tu, Frostig, and Soltanolkotabi f (τ) = 2ατ 3 + 2β2 (A + βτI) 3, M . The assumed condition f (τ) = 0 implies that: ατ 2 = β (A + βτI) 2, M = 2ατ 3 = 2βτ 1 (A + βτI) 2, M . Next, let A = QΛQT be the eigendecomposition of A, with Λ = diag({λi}n i=1). For any integer k: (A + βτI) k, M = tr(MQ(Λ + βτI) k QT) = QMQT, (Λ + βτI) k = (QMQT)ii (λi + βτ)k . Now, since M is positive definite, (QMQT)ii > 0 for all i [n]. Furthermore, since A is positive definite, λi > 0 for all i [n]. Hence plugging these expressions into the expression of f (τ): f (τ) = 2β2 n X (QMQT)ii (λi + βτ)2βτ + 2β2 n X (QMQT)ii (λi + βτ)2(λi + βτ) (QMQT)ii (λi + βτ)2βτ + 2β2 n X (QMQT)ii (λi + βτ)2βτ C.5 Proof of Theorem 6.2 Theorem 6.2 (Need for growth assumptions in Ind-Seq-LS when m n). There exists universal constant c0, c1, and c2 such that the following holds. Suppose that Px = t 1N(0, 2t In), n 6, m T n, and m c0n. Then: R(m, T, T; {Px}) c1σ2 ξ p 2c2n/m Proof Let ΓT := ΓT (Px). We have that ΓT = 2 T (2T 1)In 2T T In. By Lemma 6.1: R(m, T, T; {Px}) σ2 ξp E tr(Γ1/2 T (XT m,T Xm,T ) 1Γ1/2 T ) T E tr((2 T/2XT m,T Xm,T 2 T/2) 1). Since each column of Xm,T is independent, the matrix Xm,T 2 T/2 has the same distribution as BDiag(Θ1/2, m)W, where Θ RT T is diagonal, Θii = 2i T for i {1, . . . , T}, and W Rm T n has iid N(0, 1) entries. Let λt = 2T t for t {1, . . . , T}. With this notation: E tr((2 T/2XT m,T Xm,T 2 T/2) 1) = E tr((W TBDiag(Θ, m)W) 1). Learning from many trajectories Let {gj}m j=1 be independent isotropic Gaussian random vectors in RT , and let h N(0, In 1) be independent from {gj}. Define the random variables {Zi}T i=1 as: Zi := min β 0 max τ 0 2τ + β2 m X g2 j,t λt + β h 2τ + 1 λi + β h 2τ By Lemma 7.1, E tr((W TBDiag(Θ, m)W) 1) n Next, define 64, p(y) := y λt + yg2 j,t n1 Since n 6 and m T n, we can invoke Lemma C.9 to conclude there exists an event E1 (over the probability of {gj} and h) such that: (a) on E1, there exists a unique root y (0, ) such that p(y ) = 0, (b) the following inequalities holds: λi , 1{E1}Zi 1{E1} 1 λi + y , (C.13) (c) the following estimate holds: P(Ec 1) e n/128 + e m T/16. Now, let c = 1/20, and assume that cn1/m 4. We can check easily that cn1/m T. Fix a δ (0, e 2] to be chosen later. Define the integer Tc := cn1/m {4, . . . , T}, and the events (over the probability of {gj} and h): t=1 g2 j,t 5m Tc , Eg,+ 2 := max t=1,...,T j=1 g2 j,t 2m + 4 log t2π2 By Lemma C.4, P((Eg,Tc 2 )c) e m Tc. Next, Gaussian concentration for Lipschitz functions (cf. Wainwright, 2019, Chapter 2) yields, for any η (0, 1): max t=1,...,T P j=1 g2 j,t m + p Hence by a union bound, and the fact that 6δ/π2 PT t=1 t 2 6δ/π2 P t=1 t 2 = δ, we have that P((Eg,+ 2 )c) δ. Putting E := E1 Eg,Tc 2 Eg,+ 2 , we have: P(Ec) e n/128 + e m T/16 + e m Tc + δ Tu, Frostig, and Soltanolkotabi e n/128 + e m T/16 + e cn1 + δ. (C.14) Next, noting that t/2 log2 log((t + 1)2π2/6) for all t 4: t=Tc 2 t log((t + 1)2π2/(6δ)) t=Tc 2 t+log2 log((t+1)2π2/6) + log(1/δ) t=Tc 2 t/2 + log(1/δ) t=Tc 2 t since Tc 4 2 1)(2 Tc/2 2 T/2) + 2 log(1/δ)(2 Tc 2 T ) (4 + 2 log(1/δ))2 Tc/2 4 log(1/δ)2 Tc/2 since δ (0, e 2). (C.15) λt + y g2 j,t since p(y ) = 0 t=1 g2 j,t + y m X t=Tc 2 tg2 j,t+1 t=1 g2 j,t + y T 1 X j=1 g2 j,t+1 5m Tc + y T 1 X t=Tc 2 t 2m + 4 log((t + 1)2π2/(6δ)) using E 5m Tc + 4my 2 Tc + 16y log(1/δ)2 Tc/2 using (C.15) 5m Tc + 18my log(1/δ)2 Tc/2 since δ (0, e 2). This inequality implies the following lower bound on y : y 2cn1/(2m) 18 log(1/δ) = 2cn1/(2m) 18 log(1/δ) 4m 5 i since c = 1/20 144 log(1/δ) n1 m since cn1/m 4 = n1/(8m) 5 We now bound, i=1 E[Zi] = i=1 [E[1{E}Zi] + E[1{Ec}Zi]] Learning from many trajectories E 1{E} 1 λi + y using (C.13) 1 λi + y + P(Ec) 1 λi since y y on E 1 2t + y + 2 e n/128 + e m T/16 + e cn1 + δ using (C.14) y + 2 2 Tc + 2 e n/128 + e m T/16 + e cn1 + δ 288c log(1/δ)2 cn1/(2m) + 2 2 cn1/m + 2 e n/128 + e m T/16 + e cn1 + δ . Since cn1/m 4, we can choose δ = e cn1/(2m) (0, e 2] and obtain: i=1 E[Zi] 144c2 n1 m 2 cn1/(2m) + 2 2 cn1/m + 2 e n/128 + e m T/16 + e cn1 + e cn1/(2m) . Since 1 cn1/(4m), m T n, and m 1, this inequality implies there exists universal positive constants c1, c2 such that: i=1 E[Zi] c1n R(m, T, T; {Px}) σ2 ξp T n 2m m c1n2c2n/m = σ2 ξp2c2n/m C.6 Block decoupling We now use a block decoupling argument to study lower bounds on the risk. The first step is the following result, which bounds the risk from below by a particular random gramian matrix. Lemma C.11. Let n = dr with both d, r positive integers. Define Ir := {1, 1 + r, . . . , 1 + (T 1)r}, and let EIr RT Tr denote the linear operator which extracts the coordinates in Ir, so that (EIrx)i = x1+(i 1)r for i = 1, . . . , T. Recall the following definitions from Equation (7.10): Ψr,T,T = BDiag(Γ 1/2 T (Jr), T)BToep(Jr, T) RTr Tr, Θr,T,T = EIrΨr,T,T ΨT r,T,T ET Ir RT T . Tu, Frostig, and Soltanolkotabi Then, for A = BDiag(Jr, d) we have: E m i=1PA x h tr Γ1/2 T (A)(XT m,T Xm,T ) 1Γ1/2 T (A) i E tr((W TBDiag(Θr,T,T , m)W) 1), where W Rm T d is a matrix with independent N(0, 1) entries. Proof We apply Proposition C.3 with: M = Xm,T Γ 1/2 T , I = {1, 1 + r, 1 + 2r, . . . , 1 + (d 1)r}, |I| = d. Note that the block diagonal structure of A yields the same block diagonal structure on ΓT and its inverse square root, specifically ΓT (A) = BDiag(ΓT (Jr), d) and Γ 1/2 T (A) = BDiag(Γ 1/2 T (Jr), d). Hence, it is not hard to see that the columns of MET I are not only independent, but also identically distributed. Furthermore, the distribution of each column obeys a multivariate Gaussian in Rm T . Hence, MET I is equal in distribution to Q1/2 m,T W, where W Rm T d is a matrix of iid Gaussians and Qm,T Symm T >0 is a positive definite covariance matrix to be determined. Furthermore, because MET I contains the vertical concatenation of m independent trajectories, Qm,T itself is block diagonal: Qm,T = BDiag(QT , m), QT Sym T >0. Let us now compute an expression for QT . Consider the dynamics: xr t+1 = Jrxr t + wr t , xr 0 = 0, wr t N(0, σ2 w Ir). It is not hard to see that, with wr 0:T 1 = (w0, . . . , w T 1) Rr T , Γ 1/2 T (Jr)xr 1 ... Γ 1/2 T (Jr)xr T = Ψr,T,T wr 0:T 1. From this, we see that every column of MET I is equal in distribution to EIrΨr,T,T wr 0:T 1, and therefore has distribution N(0, EIrΨr,T,T ΨT r,T,T ET Ir). Therefore: QT = EIrΨr,T,T ΨT r,T,T ET Ir = Θr,T,T . The claim now follows. C.7 Eigenvalue analysis of a tridiagonal matrix For any T N+, recall that LT denotes the T T lower triangle matrix with ones in the lower triangle, and Tri(a, b; T) denotes the symmetric T T tri-diagonal matrix with a on the digonal and b on the lower and upper off-diagonals. In this section, we study the eigenvalues of (LT LT T ) 1, which we denote by ST : ST = (LT LT T ) 1 = Tri(2, 1; T) e T e T T . (C.16) Understanding the eigenvalues of this matrix will be necessary in the proof of Lemma C.15. The following result sharply characterizes the spectrum of ST up to constant factors. Learning from many trajectories Lemma C.12. Suppose T 8. For all k = 1, . . . , T, we have that: T 2 λT k+1(ST ) π2 k2 Proof We prove the upper bound in Proposition C.13, and the lower bound in Proposition C.14. The next result gives the necessary upper bounds on the eigenvalues of ST . Proposition C.13. We have that: λT k+1(ST ) π2 k2 T 2 , k = 1, . . . , T. Proof By (C.16), we immediately produce a semidefinite upper bound on ST : ST = Tri(2, 1; T) e T e T T Tri(2, 1; T). Therefore by the Courant min-max theorem, followed by the closed-form expression for the eigenvalues of Tri(2, 1; T), we have: λT k+1(ST ) λT k+1(Tri(2, 1; T)) = 2 1 cos kπ , k = 1, . . . , T. Next, we have the following elementary lower bounds for cos(x) on x [0, π]: ( 1 x2/2 if x [0, 2π/3], (x π)2/4 1 if x [2π/3, π]. Therefore, when k n 1, . . . , j 2(T+1) 3 ko , we immediately have that: λT k+1(ST ) π2 k2 For the case when k { 2(T+1) 3 + 1, . . . , T}, we use the cosine lower bounds to bound: λT k+1(ST ) 4 π2 1 1 k T + 1 = 4 k T + 1 = 4 k T + 1 since k 2(T + 1)/3 Tu, Frostig, and Soltanolkotabi The claim now follows by taking the maximum over the upper bounds. We now move to the lower bound on λT k+1(ST ). At this point, it would be tempting to use Weyl s inequalities, which imply that λi(ST ) λi(Tri(2, 1; T)) 1. However, this bound becomes vacuous, since λT (Tri(2, 1; T)) 1/T 2. To get finer grained control, we need to use the eigenvalue interlacing result of Kulkarni et al. (1999). This is done in the following result: Proposition C.14. Suppose that T 8. We have that λT k+1(ST ) 0.02 k2 T 2 , k = 1, . . . , T. Proof The proof relies on the interlacing result from Kulkarni et al. (1999, Theorem 4.1). However, the interlacing result does not cover the minimum eigenvalue of ST , so we first explicitly derive a lower bound for λmin(ST ). To do this, we note that: λmin(ST ) = λmin((LT LT T ) 1) = 1 LT 2op . Letting li RT denote the i-th column of LT , by the variational form of the operator norm followed by Cauchy-Schwarz, LT op = max v 2=1 LT v 2 max v 2=1 i=1 li 2|vi| i=1 li 2 2 = T(T + 1)/2. λmin(ST ) 2 T(T + 1) 1 Now we may proceed with the remaining eigenvalues. We can write ST as the following block matrix, with e T 1 RT 1 denoting the (T 1)-th standard basis vector: ST = Tri(2, 1; T 1) e T 1 e T T 1 1 This matrix is of the form studied in Kulkarni et al. (1999, Theorem 4.1); for what follows we will borrow their notation. Let UT (x) denote the T-th degree Chebyshev polynomial of the 2nd kind. We know that the eigenvalues of ST are given by λ = 2(1 x), where x are the roots of the polynomial p T (x) defined as: p T (x) := (1 + 2x)UT 1(x) UT 2(x). (C.17) Therefore, letting ψ1 . . . ψT denote the roots of (C.17) listed in increasing order, we have: λi(ST ) = 2(1 ψi), i = 1, . . . , T. Learning from many trajectories Let η1 < < ηT 2 denote the T 2 roots of UT 2(x) listed in increasing order. Put η0 := and ηT 1 := + . Because the roots of UT 2(x) are given by x = cos( kπ T 1), k = 1, . . . , T 2, we have that: ηi = cos (T 1 i)π , i = 1, . . . , T 2. Kulkarni et al. (1999, Theorem 4.1) states that there is exactly one root of p T (x) in each of the intervals (ηj, ηj+1) for j {0, . . . , T 2} \ {i }, with i satisfying: 3 } if 2(T 1) mod 3 = 0, {2(T 1) 3 + 1} otherwise, and furthermore (ηi , ηi +1) contains exactly two roots of p T (x). Therefore, for i {i + 3, . . . , T 1}: ψi ηi 1 = λi(ST ) 2(1 ηi 1) = 2 1 cos (T i)π For i {i + 3, . . . , T 1}, we have: T i T 1 T i 3 T 1 = T (2(T 1) 3 1) 3 T 1 = 1 It is elementary to check that: 2(1 cos(x)) x2 2 x [0, π/3]. Therefore for i {i + 3, . . . , T 1}, Furthermore, for i {1, . . . , i + 2}, ψi ηi +1 implies that λi(ST ) 2(1 ηi +1) = 2 1 cos (T 1 i 1)π 2(1 cos(π/21)). The last inequality holds by: cos (T 1 i 1)π cos (T 1) (2(T 1)/3 + 2) T 1 π since i 2(T 1) cos(π/21) since T 8. Summarizing, we have shown that: λT k+1(ST ) 1 T 2 if k = 1, 2 k 1 T 1 2 if k {2, . . . , T i 2}, 2(1 cos(π/21)) if k {T i 1, . . . , T}. Tu, Frostig, and Soltanolkotabi T 1 k 2T when k 2, and since 2(1 cos(π/21)) 2(1 cos(π/21)) k2 T 2 trivially, we have shown the desired conclusion: λT k+1(ST ) min 1, π2 8 , 2(1 cos(π/21)) k2 T 2 0.02 k2 T 2 , k = 1, . . . , T. C.8 A risk lower bound in the few trajectories regime Lemma C.15. There exist universal positive constants c0, c1, c2, and c3 such that the following is true. Suppose A Rn n is any set containing In. Let T c0, n c1, m T n, and m c2n. We have that: R(m, T, T ; {PA x | A A}) c3σ2 ξp n2 Proof Let {gj}m j=1 be independent N(0, IT ) random vectors, and let h N(0, In 1) be independent from {gj}. Let {λt}T t=1 denote the eigenvalues of Θ 1 1,T,T listed in decreasing order. Define the random variables {Zi}T i=1 as: Zi := min β 0 max τ 0 2τ + β2 m X g2 j,t λt + β h 2τ + (Θ 1 1,T,T + β h 2τIT ) 1 ii We now lower bound the minimax risk as follows: R(m, T, T ; {PIn x }) σ2 ξp E m i=1PIn x h tr ΓT (In)1/2(XT m,T Xm,T ) 1ΓT (In)1/2 i by Lemma 6.1 σ2 ξp E tr((W TBDiag(Θ1,T,T , m)W) 1) by Lemma C.11 = σ2 ξp T + 1 T + 1 E tr((W TBDiag(Θ1,T,T , m)W) 1) using (7.11) 2T E tr((W TBDiag(Θ1,T,T , m)W) 1) by Lemma 7.1. (C.19) Next, define: 64, p(y) := y λt + yg2 j,t n1 Assuming that c1 6 so that n 6 and m T n, we can invoke Lemma C.9 to conclude there exists an event E1 (over the probability of {gj} and h) such that: Learning from many trajectories (a) on E1, there exists a unique root y (0, ) such that p(y ) = 0, (b) the following inequalities holds: Zi (Θ1,T,T )ii, 1{E1}Zi 1{E1} 1 λi + y , (C.20) (c) the following estimate holds: P(Ec 1) e n/128 + e m T/16. The remainder of the proof is to estimate a lower bound on y . Towards this goal, we define an auxiliary function: p(y) := E[p1(y)] = m y λt + y n1 Let y be the unique solution to p(y) = 0. A unique root exists because p(0) < 0, limy p(y) = m T n1/2 n n/64 > 0, and p is continuous and strictly increasing. We derive a lower bound on y through a lower bound on y . For any fixed α > 0, the function x 7 x α+x is monotonically increasing and concave on R>0. Therefore, the function p(y) is monotonically increasing and concave on R>0. By Proposition C.2, the root of the linear approximation to p(y) at y is a lower bound to y : 1{E1}y 1{E1} y p( y ) Equation (C.21) is a crucial step for the proof, because it turns analyzing y , which is the root of a random function, into analyzing the pointwise evaluation of a random function on a deterministic quantity. To lower bound the RHS, we need a upper bound on p( y ) and lower bounds on both y and p ( y ). Upper and lower bounds on y . We first derive a crude upper bound by Jensen s inequality. Observe that p( y ) = 0 implies that: λt λt + y . The function x 7 x/(x + y ) is concave on R>0. Let λ := 1 T PT t=1 λt. Jensen s inequality states that T λ λ+ y PT t=1 λt λt+ y . Therefore: 1 n1 2m T λ λ + y = y λ n1 2m T 1 1 n1/(2m T). Recalling the definition of ST from (C.16), we can immediately bound T tr(Θ 1 1,T,T ) = 1 tr(ST ) 2T. Tu, Frostig, and Soltanolkotabi Therefore, since m T n, m 1 1 n1/(2m T) 2n1 Now for the lower bound on y . Noting that λT k+1 = λT k+1(Θ 1 1,T,T ) = T+1 2 λT k+1(ST ), Corollary C.12 implies (assuming that c0 8 so T 8) that T λT k+1 π2 k2 T , k = 1, . . . , T. (C.22) Therefore, p( y ) = 0 implies that: 1 λt + y 2m 1 0.01t2/T + y 1 0.01x2/T + y dx = 20m T n1 y tan 1 Solving for y yields: y 1 100π2 n2 1 m2T . Next, we use this lower bound on y to bootstrap our upper bound y 2n1/m into something stronger. Using the upper bounds on λt from (C.22), 1 λt + y 2m 1 π2t2/T + y 2m 1 π2x2/T + y dx tan 1 (T + 1)π T y tan 1 π T y The function tan 1(x) is increasing. Using the y 2n1/m upper bound and the assumption that m T n, (T + 1)π T y π 32 = tan 1 (T + 1)π T y On the other handing, using the bound y 1 100π2 n2 1 m2T and the assumption that m π T y 10π m 32/2 = tan 1 π T y Combining these inequalities: 32) tan 1(π 32/2) i 2 0.05 T n1 y = y 791π2 n2 1 m2T . Learning from many trajectories Therefore we have the following upper and lower bounds on y : 1 100π2 n2 1 m2T y min 791π2 n2 1 m2T , 2n1 For the remainder of the proof, in order to avoid precisely tracking constants, we let c0, c1, c2, c3 be any positive universal constants such that: T λT k+1 c1 k2 T , k = 1, . . . , T, (C.24) c2 n2 1 m2T y c3 n2 1 m2T . (C.25) Equations (C.22) and (C.23) give one valid setting of these constants. Upper bound on p( y ). To upper bound p( y ), we note that: λt + y g2 j,t n1 λt + y (g2 i,j 1) + λt + y (g2 i,j 1) since p( y ) = 0. Therefore, by Lemma C.4, 2 + 2t max t=1,...,T y e t t > 0. (C.26) We upper bound: 2 using (C.24) = m( y )2T 2c0T y + 2( y )2 + T y 2 c0 tan 1 s 2c0 + π T y n2 1 m T + π m using (C.25) = c3 128c0 + π n1 since m T n and m 1 Tu, Frostig, and Soltanolkotabi =: c4n1. (C.27) Next, we immediately have: max t=1,...,T y λt + y 1. (C.28) Thus, combining (C.26), (C.27), and (C.28), we have: P p( y ) > 2 tu c4n1 + 2tu e tu tu > 0. (C.29) Lower bound on p ( y ). Differentiating p(y) yields: λt (λt + y)2 g2 j,t. Applying Lemma C.4 yields, p ( y ) < m λt (λt + y )2 2 λ2 t (λt + y )4 e t t > 0. (C.30) Our first goal is to lower bound m PT t=1 λt (λt+ y )2 . The function x 7 x/(x + y )2 is increasing when x [0, y ] and decreasing when x ( y , ). Let t {0, . . . , T} be such that c1t2/T y for t {1, . . . , t } and c1t2/T > y for t {t + 1, . . . , T} (t = 0 if c1/T > y ). We write: λt (λt + y )2 c0 c1t2/T (c1t2/T + y )2 using (C.24) c1t2/T (c1t2/T + y )2 + c1t2/T (c1t2/T + y )2 c1x2/T (c1x2/T + y )2 dx + Z T+1 c1x2/T (c1x2/T + y )2 dx c1x2/T (c1x2/T + y )2 dx Z t +1 t c1x2/T (c1x2/T + y )2 dx The function z 7 z (z+ y )2 is upper bounded by 1 4 y . Therefore, t c1x2/T (c1x2/T + y )2 dx 1 4 y 1 4c2 c1x2/T (c1x2/T + y )2 dx = c1T 2c3/2 1 T y tan 1 (T + 1) c1 T y T + 1 2c2 1(T + 1)2 + 2c1T y Learning from many trajectories 2c3/2 1 c3n1 tan 1 (T + 1) c1 T y 2c3/2 1 c3n1 tan 1 64 rc1 The last inequality holds because: (T + 1) c1 T y (T + 1) rc1 Above, the first inequality holds using (C.25) and the last inequality holds since m T n. Therefore, assuming that m T 2 q c3 c1 1 tan 1(64 c1x2/T (c1x2/T + y )2 dx tan 1(64 p c1/c3) 4 c1c3 Combining these inequalities, assuming that m c2 2 c1c3 tan 1(64 p c1/c3)n1, we have: λt (λt + y )2 c0 " tan 1(64 p c1/c3) 4 c1c3 c0 tan 1(64 p 8c3/2 1 c3 =: c5 m2T Next, we turn to upper bounding m PT t=1 λ2 t (λt+ y )4 . Again the function x 7 x2/(x+ y )4 is increasing when x [0, y ] and decreasing when x ( y , ), and therefore x2/(x+ y )4 1 16( y )2 for all x 0. Let t {0, . . . , T} be such that c0t2/T y for t {1, . . . , t } and c0t2/T > y for t {t + 1, . . . , T}. In the case when c0/T > y , we set t = 0. We have: λ2 t (λt + y )4 c2 1 c2 0 m (c0t2/T + y )4 using (C.24) = c2 1 c2 0 m t 1 X (c0t2/T + y )4 + (c0t2/T + y )4 + (c0(t )2/T)2 (c0(t )2/T + y )4 + (c0(t + 1)2/T)2 (c0(t + 1)2/T + y )4 c2 1 c2 0 m Z t (c0x2/T + y )4 dx + Z T (c0x2/T + y )4 dx + (c0(t )2/T)2 (c0(t )2/T + y )4 + (c0(t + 1)2/T)2 (c0(t + 1)2/T + y )4 c2 1 c2 0 m Z T (c0x2/T + y )4 dx + (c0(t )2/T)2 (c0(t )2/T + y )4 + (c0(t + 1)2/T)2 (c0(t + 1)2/T + y )4 Tu, Frostig, and Soltanolkotabi c2 1 c2 0 m Z T (c0x2/T + y )4 dx + 1 8( y )2 since max x>0 x2 (x + y )4 1 16( y )2 c2 1 c2 0 m Z T (c0x2/T + y )4 dx + 1 We now bound: (c0x2/T + y )4 dx = c2 0T 2 (3c0T + y )(c0T 3 y ) 48c2 0T y (c0T + y )3 + tan 1 q 16c5/2 0 T 3/2( y )3/2 c2 0T 2 " 1 16c0 y (c0T + y )2 + π 32c5/2 0 T 3/2( y )3/2 c2 0T 2 " 1 16c3 0 y T 2 + π 32c5/2 0 T 3/2( y )3/2 c2 0T 2 " 1 16c3 0c2 32c5/2 0 c3/2 2 using (C.25) " 1 1024c0c2 + π 32c1/2 0 c3/2 2 n3 1 since m T n. Combining these inequalities, assuming that m n1: λ2 t (λt + y )4 c2 1 c2 0 "" 1 1024c0c2 + π 32c1/2 0 c3/2 2 " 1 1024c0c2 + π 32c1/2 0 c3/2 2 + 1 n3 1 since m n1 =: c6 m4T 2 n3 1 . (C.32) Combining (C.30), (C.31), and (C.32) yields p ( y ) < c5 m2T n1 2 tℓ c6 m2T e tℓ tℓ> 0. (C.33) Lower bounds on y . We now combine (C.29) with (C.33) to established a lower bound on y . Equations (C.21) and (C.25) imply that: p ( y ) c2n2 1 m2T p( y ) We first set tℓ= c2 5 16c6 n1, so that by (C.33), P p ( y ) < c5 e c2 5 16c6 n1. Learning from many trajectories We next set tu = βn1 for a β > 0 to be specified. By (C.29), P p( y ) > 2( p c4β + β)n1 e βn1. Let E2 denote the event: E2 := p ( y ) c5 n p( y ) 2( p c4β + β)n1 o . By a union bound, P(Ec 2) e c2 5 16c6 n1 + e βn1. Furthermore, 1{E2} c2n2 1 m2T p( y ) 1{E2} c2 4( c4β + β) Setting β = c7 := min n c2c5 16 , c2 2c2 5 162c4 o , we have that c2 4( c4β+β) c5 c2/2, and therefore from (C.21), 1{E1}y 1{E1 E2} c2n2 1 m2T p( y ) 2 n2 1 m2T . (C.34) Finishing the proof. Define E := E1 E2 and define y := c2 2 n2 1 m2T . By a union bound, P(E) e n/128 + e m T/16 + e c2 5 16c6 n1 + e c7n1 e n/128 + e n/16 + e c2 5 1024c6 n + e c7 64 n since m T n 4 exp min 1 16, c2 5 1024c6 , c7 n =: 4e c8n. (C.35) From (C.20), since y y on E by (C.34), 1{E}Zi 1{E} 1 λi + y 1 λi + y . (C.36) Next, by Proposition B.1, if n 2 max{1, c 1 8 } log(4 max{1, c 1 8 }), then we have n c 1 8 log n ne c8n 1. We now bound, i=1 E[Zi] = i=1 [E[1{E}Zi] + E[1{Ec}Zi]] 1 λt + y + P(Ec)(Θ1,T,T )tt using (C.36) and Zi (Θ1,T,T )ii 1 λt + y + P(Ec)T since tr(Θ1,T,T ) = T Tu, Frostig, and Soltanolkotabi 1 c0t2/T + y + 4Te c8n using (C.24) and (C.35) 1 c0x2/T + y dx + 4Te c8n T c0y + 4Te c8n = n1 + 4Te c8n 2π 2 c0c2 + 1 n1 =: c8 m T n1 since ne c8n 1 and m 1. Plugging this upper bound into (C.19): R(m, T, T ; {PIn x }) σ2 ξp T n1 m T = 1 256c8 σ2 ξ pn2 The claim now follows. C.9 Proof of Theorem 6.3 Theorem 6.3 (Risk lower bound). There are universal positive constants c0, c1, and c2 such that the following holds. Recall that PIn x (resp. P0n n x ) denotes the covariate distribution for a linear dynamical system with A = In and B = In (resp. A = 0n n and B = In). If T c0, n c1, and m T n, then: R(m, T, T ; {P0n n x , PIn x }) c2σ2 ξ pn m T max n T Proof Let Px := {P0n n x , PIn x }. We let c 0, c 1, c 2, and c 3 denote the universal positive constants in the statement of Lemma C.15. We first invoke Lemma C.7 to conclude that: R(m, T, T ; Px) σ2 ξ 2 pn T , 1 . (C.37) The proof now proceeds in three cases: Case n T /(m T) 1. In this case, we trivially have max n T T , 1 o = max n n T m T , T T , 1 o . Therefore, (C.37) yields: R(m, T, T ; Px) σ2 ξ 2 pn m T max n T Case n T /(m T) > 1 and m c 2n. In this case, we can invoke Lemma C.15 to conclude that: R(m, T, T ; Px) c 3σ2 ξ pn m T = c 3σ2 ξ pn m T max n T m T , 1 . (C.38) Learning from many trajectories Since n/m 1/c 2, we have that n T /(m T) T /(c 2T). Therefore: m T , 1 = max n T c 2T , 1 min{1, 1/c 2} max n T Hence, from (C.38), R(m, T, T ; Px) min{c 3, c 3/c 2}σ2 ξ pn m T max n T Case n T /(m T) > 1 and m > c 2n. In this case, we have T /T > c 2n T /(m T). Therefore, we have: T , 1 = max c 2 n T T , 1 min{1, c 2} max n T Hence, from (C.37), R(m, T, T ; Px) min{1/2, c 2/2}σ2 ξ pn m T max n T The claim now follows taking c0 = c 0, c1 = c 1, and c2 = min{1/2, c 3, c 3/c 2, c 2/2}.